You are on page 1of 18

Article

Journal of Information Science


1–18
Feature engineering for detecting Ó The Author(s) 2017
Reprints and permissions:
spammers on Twitter: Modelling sagepub.co.uk/journalsPermissions.nav
DOI: 10.1177/0165551516684296
and analysis jis.sagepub.com

Wafa Herzallah
Business Information Technology, King Abdullah II School of Information Technology, The University of Jordan, Jordan

Hossam Faris
Business Information Technology, King Abdullah II School of Information Technology, The University of Jordan, Jordan

Omar Adwan
Business Information Technology, King Abdullah II School of Information Technology, The University of Jordan, Jordan

Abstract
Twitter is a social networking website that has gained a lot of popularity around the world in the last decade. This popularity made
Twitter a common target for spammers and malicious users to spread unwanted advertisements, viruses and phishing attacks. In this
article, we review the latest research works to determine the most effective features that were investigated for spam detection in the
literature. These features are collected to build a comprehensive data set that can be used to develop more robust and accurate spam-
mer detection models. The new data set is tested using popular classifiers (Naive Bayes, support vector machines, multilayer percep-
tron neural networks, Decision Trees, Random forests and k-Nearest Neighbour). The prediction performance of these classifiers is
evaluated and compared based on different evaluation metrics. Moreover, a further analysis is carried out to identify the features that
have higher impact on the accuracy of spam detection. Three different techniques are used and compared for this analysis: change of
mean square error (CoM), information gain (IG) and Relief-F method. Top five features identified by each technique are used again to
build the detection models. Experimental results show that most of the developed classifiers obtained high evaluation results based on
the comprehensive data set constructed in this work. Experiments also reveal the important role of some features like the reputation
of the account, average length of the tweet, average mention per tweet, age of the account, and the average time between posts in the
process of identifying spammers in the social network.

Keywords
Classifiers; detection; feature engineering; spam; spam features; spammers; Twitter

1. Introduction
Twitter is a social media platform that plays a dual role in social networking and micro blogging. Users of this popular
web application communicate with one another by posting short texts called tweets, and they acquire the latest informa-
tion from other users’ tweets by following them if they find their posts interesting. Twitter has become an important
mechanism for users to keep up with friends as well as the latest popular topics reaching more than 1 billion users, 255
million of which are active monthly, sending out 500 million tweets per day and averaging 208 followers per user [1,2].
Mining Twitter proved to be a very interesting area for research, as it is a rich source of up to date information, pre-
senting current events and news actively occurring all over the world. Recent statistics have shown that mining Twitter
allows us to deal with a huge amount of data. Therefore, Twitter has become an exciting area for research projects related

Corresponding author:
Hossam Faris, Business Information Technology, King Abdullah II School of Information Technology, The University of Jordan, Amman, Jordan, 11942.
Email: hossam.faris@ju.edu.jo
Herzallah et al. 2

to machine learning and data mining. Such huge amounts of generated data can be very helpful to companies, politicians
and others when making important decisions [3]. One of the popular data mining applications is the sentiment analysis.
For example, Go et al. [4] proposed a method to automatically classify tweets as positive or negative according to a query
entered by the user. This type of methods is very important for consumers who want to research a sentiment of a product
before making a purchase. De Choudhury et al. [5] took the initial steps to build an automatic classifier for user types on
Twitter, focusing on three user categories: organisations, journalists/media bloggers and ordinary individuals. This type
of methods is used in applications that discover expert users for a target topic or in user recommendations to suggest new
users to follow if they are interested in the same topic.
However, another application that is becoming more important in halting the use of spam accounts is the automatic
classification of spammers on Twitter. Spammers on Twitter are driven by several goals, such as to spread advertise-
ments, generate sales, viruses, phishing or simply just to compromise system reputation. Spam tweets can interfere with
statistics presented by Twitter mining tools and waste users’ attention. As a result, identifying spammers on Twitter has
become a very interesting area of research. Since Twitter is a very famous site and has gained a lot of popularity through-
out the world, it continues to be a very interesting target for spammers and malicious users.
Spam can be defined as the unwanted content that appears on online social networking sites, while spammers are the
users who post this content on social networking sites [6]. Spammers belong to one of the following categories as
described in the work by Lee et al. [7]:

• Phishers: users who act as normal users and try to acquire personal data of other users.
• Fake users: users who pretend to be another user, following the user’s friends and sending them spam content.
• Promoters: abusive users who send malicious links of advertisements to steal personal information or harm oth-
ers’ devices or software.

Spam profiles are added to the internet in daily basis in different nature, making it difficult for detection models to
keep up with the fast-growing and dynamic nature of the web [8]. As per Twitter policy, there are a lot of indicators that
would lead to consider an account to be a spam profile, for example, if a profile is sending large amounts of duplicate
mentions, if posts consist mainly of links; not personal updates, if large number of users are blocking it or if it is posting
duplicate content over multiple accounts [9]. Twitter has added a feature to its application that allows users to report such
spammers by posting a tweet in the form, ‘@spam [spam profile name]’, in order for the administrators to consider that
account as a spam profile and suspend it according to the metrics in the Twitter policy [9].
In this study, we target the problem of identifying Twitter spammers by building data mining models based on a data
set consisting of a large number of spam features, which are categorised into three main types: user-based, content-based
and graph-based features. Most of these features were proposed and used by different researchers in the literature; how-
ever, in this study, we integrate the features that were reported as the most important in one comprehensive data set and
experiment their effect on the accuracy of common classification models. The experimented and utilised classifiers in
this work are Naive Bayes (NB), support vector machines (SVMs), multilayer perceptron (MLP) neural networks, two
Decision Trees classifiers, Random forests (RFs) and k-Nearest Neighbour (k-NN). Finally, we analyse the importance
of each feature based on three approaches: information gain (IG), change of mean square error (CoM) and Relief-F
method.
The main contribution of this work focuses on the feature engineering task in the process of identifying spammers in
social networks such as Twitter. This contribution can be summarised in the following points:

• Unlike most of previous works which focused on a limited or a specific type of features, this work aims at con-
structing a comprehensive data set based on collecting most of the identified related features in the literature.
• Identifying the importance of each of these features using different techniques (i.e. IG, CoM and Relief-F).
• Quantifying the influence of these features on the performance of the developed detection models.
• Evaluating the conventional classifiers as detection models under different scenarios.

This article is organised as follows. Section ‘Related work’ describes the related work on spam and Twitter spammers
and discusses the way the researchers handled this problem with Twitter. Section ‘Proposed framework’ describes in
detail the proposed methodology that would be followed for data collection, pre-processing, feature extraction, data label-
ling, building the classifiers and the evaluation metrics. Section ‘Experiments and results’ discusses the experiments and
the obtained results. Finally, the findings of the work are concluded in section ‘Conclusion’.

Journal of Information Science, 2016, pp. 1–19 Ó The Author(s), DOI: 10.1177/0165551516684296
Herzallah et al. 3

2. Related work
In the literature, many articles have addressed the problem of spam and spammers on Twitter and other social networking
sites. The problem becomes more extreme and complex when the social network is famous and popular. Therefore, such
a social network would become a hot target for spammers to spread advertisements or steal users’ sensitive data [10].
To detect spam profiles and spam tweets, researchers use different features, some of them depend on their work on
the user features, which are extracted from the user profile or the user behaviour when tweeting. Gee and Teh [11] col-
lected data from Twitter and labelled the data set manually by reviewing each profile. They built their model to detect
spammers based on the user’s attributes only. They first implemented a NB learning algorithm to classify the profiles
and received an error rate of 27%, which is much higher than what is acceptable. This led them to implement a linear
SVM classifier with fivefold cross-validation on training sets. They got an accuracy rate of 89.6%, which is effective
and more acceptable than the NB classifier.
F Benevento and others compared two approaches for detecting spam profiles and detecting spam tweets on Twitter.
They initially used the user-based features to only detect spam profiles, using the SVM classifier, which obtained an
accuracy rate of 84.6%. Then, the researchers used both the user- and the content-based features to classify the tweets
into spam and non-spam categories, using the same classifier, obtaining an accuracy rate of 87.6%. By comparing the
two techniques, they found that detecting spam tweets using both content- and user-based features is more effective [12].
Lee et al. [7] also used user- and content-based features to distinguish spam users from other users on Twitter. They
utilised a variety of methods in classification: SVM, Decorate, Simple logistic and Decision Trees. The difference in
their approach was using two different data sets in classification, one with 10% spammers and 90% non-spammers and
the other with 90% spammers and 10% non-spammers. They found that the classification metrics are robust against this
change. In their research, the Decorate classifier achieved the highest accuracy rate, reaching 88.98%.
‘Don’t Follow Me’, is an article [13] with a message to spammers to not follow us or risk being detected and blocked.
Hai Wang, the author, utilised both graph-based and content-based features to detect spammers on Twitter. He presented
a directed social graph model, and using the ‘following’ and ‘followers’ relationships, he extracted 49 million of said
relationships from Twitter’s application programming interface (API), collecting 500 users’ accounts with 20 recent
tweets from each user, using different classifiers to classify the users into spam and non-spam categories. The Bayesian
classifier showed the best performance with a higher F-measure = 91.7% and precision = 91.7%. While using neural
networks and the SVM classifiers, the precision was 100%, but the recall and F-measure were less than 50%. Wang’s
research results proved that the reputation feature has the best performance when detecting spam users.
Amleshwaram et al. [14] did not rely only on the content and the user behaviours to detect spam accounts, but rather they
used the Bait technique, which is a set of features that is used by spammers to gain the victim’s attention leading them to
click the malicious links. Some of these features are as follows: number of unique mentions, unsolicited mentions, hijacking
trends and others. In addition, they used the profile vector’s characterising spammers. The training data set had 7321 users
with 2467 of them being spam and other non-spam accounts. The features were used to build the model for classification and
achieved 96% detection rate; with 0.8 false-positive rate and only five tweets from each user, the detection rate was 90%.
Unlike others, Chakraborty et al. [15] proposed a framework for privacy protection via detecting spam users using
automatic profile similarity indexing to deal with the problem of false positives. The user similarity indexing model was
a four-class machine learning model based on user similarity features between two given users, such as messages, hash-
tags, retweets and common friends. The model was trained with a small data set which may be the cause of the low per-
formance of a 64.9% precision rate.
McCord and Chuah [16] used both user-based and content-based features to facilitate spammer detection on Twitter
in their research. They collected 1000 Twitter users using Twitter’s API and labelled them manually as spam or non-
spam by reading the 20, 50 and 100 most recent tweets posted by each user. They compared the performances of the tra-
ditional classifiers in their ability to detect spammers. Only the results of the latest 100 tweets were reported in their
research. In the evaluation process of the models, they found that the RF classifier performed better than the SVM, k-NN
and NB classifiers, with a precision rate of 95.7% and F-measure of 95.7%.
Apart from the tweets and user content features, Lee and Kim [17] proposed a suspicious uniform resource locator
(URL) detection system for Twitter called WARNING-BIRD. Instead of focusing on the page’s content that the URL in
the tweet opens in order to detect whether it is a spam URL or not, they relied on the assumption that the spammers
should have limited resources available to them and considered a correlated redirect chain of URLs in a number of
tweets; part of this redirect chain will be shared. These features gave a high accuracy rate and low false-positive rate for
classification but worked only for the spammers who posted tweets containing URLs.
Lin and Huang [10] studied how to detect spammers using only two features: URL rate and Interaction rate. They
studied the features when detecting long-surviving Twitter spammers and concluded that these two features are effective

Journal of Information Science, 2016, pp. 1–19 Ó The Author(s), DOI: 10.1177/0165551516684296
Herzallah et al. 4

Table 1. Summary of the reviewed Twitter spam detection papers

Ref. ID Features used Classifier/s Data set for training Results

[12] Tweet content attributes SVM 1056 users Accuracy


User behaviours 355 spammers With user att. only: 84.5%
attributes 710 non-spammers With both user and
Only tweets about 2 content att.: 87.6%
trending topics
[11] User-based features only Naive Bayes, SVM 450 users SVM performs better
50% spammers with accuracy: 89.6%
[15] Profile similarity indexing Naive Bayes, SVM 5000 users with 200 Accuracy
between the account recent tweets Naive Bayes: 84%
owners and his friends/ SVM: 89%
followers.
Content-based + user
behaviours
[7] Content-based Decorate 500 users Decorate gives the best
User-based Hyperpipes 168 spam accuracy:88.98%
Bagging, Logistic Boost 332 Normal users
Random sub-space Recent 20 tweet only
BF Tree, FT, SVM collected/user
Simple logistic
Classification via
regression
[13] Graph-based Bayesian classifier 500 Twitter users with 20 Bayesian performs better
Content-based Neural network recent tweets for each results:
SVM 3% spam accounts Precision: 91.7%
Decision Tree Recall: 91.7%
F-measure:91.7%
[14] User and content Random forest 7312 users Decorate is the best
attributes Decision Tree 2467 spam Detection rate: 96% with
Bait technique features Decorate 4854 non-spam 0.8 false positive
Naive Bayes Only 5 tweets for each
user
[16] User-based Random forest 1000 users Random forest is the
Content-based SVM, NB Evaluated using the 20, best, with 100 tweet/user
k-NN 50, 100 recent tweet/user and accuracy: 95.7%
[17] Content-based: L2-regularized logistic 224,834 tweet samples Accuracy: 87.67%
correlated URL redirect regression algorithm 41,721 spam FP: 1.64
chains 183,113 non-spam FN: 10.69
[10] Only two features: J48 400 users Precision: 82.9%–88.5%
URL rate 50% spammers Recall: 98.7%–99.9%
Interaction rate
[6] Four feature sets: Naive Bayes 2 different data sets Random forest is the
User features k-NN Social Honeypot Data set most accurate classifier
Content features SVM 1KS-10KN Data set with F1-measure: 0.94
n-gram features Decision Tree
Sentiment features Random forest

BF Tree : best first decision tree; FT: functional trees; NB: Naive Bayes; SVM: support vector machine; k-NN: k-Nearest Neighbour; FP: False
positive; FN: False negative; URL: uniform resource locator.

in being able to detect spammers. The J48 classifier was used in their study and obtained approximately 86% precision
rate.
Wang et al. [6] focused on spam detection rather than detecting spammers. They used two hand-labelled data sets of
tweets containing spams and considered four feature sets: user, content, n-gram and sentiment features. While n-gram is
all about hashtags and mentions per word, the sentiment features are the number of spam words and part of speech tags
in every tweet. They tested five classification algorithms: NB, k-NN, SVM, Decision Trees and RF. They found that RF
is the most accurate classifier when using multiple feature sets with an F1-measure of 94%.
As a summary, Table 1 lists the used features, classifiers, data sets and the main results of the reviewed works in this
article.

Journal of Information Science, 2016, pp. 1–19 Ó The Author(s), DOI: 10.1177/0165551516684296
Herzallah et al. 5

Figure 1. Spam detection process.

Figure 2. Data collection process.

The success of any data-driven model depends mainly on the selected features to represent the problem under investi-
gation [18]. In data mining, feature engineering is a vital process for enriching the characterisation of the domain prob-
lem by effectively increasing the value of the data in terms of more meaningful features [19–21]. In this work, we focus
on the feature engineering process by exploiting most of the reviewed features and collect them in a single rich data set
in order to improve the quality of the developed spammer detection models.

3. Proposed framework
In this section, we propose a framework for targeting the problem of identifying spammers on Twitter. The framework
consists of five main processes: data collection, data pre-processing, features extraction and labelling, building classifica-
tion models and model’s evaluation and assessment. These processes are carried out consequently as shown in Figure 1
and are described in detail in the following five subsections.

3.1. Data collection


In order to build our models to detect spam accounts, we should have a labelled collection of users who are classified into
spammer and non-spammer categories. Unfortunately, it is not allowed to circulate the data collected from Twitter’s API
[22] for any reason. Therefore, we built our own collection of data. The diagram shown in Figure 2 describes the data col-
lection process adopted for our research.

Journal of Information Science, 2016, pp. 1–19 Ó The Author(s), DOI: 10.1177/0165551516684296
Herzallah et al. 6

Table 2. List of fields retrieved from Twitter API for each tweet

Field ID Content

F1 User Screen Name


F2 Tweet ID
F3 User ID
F4 User Created Date
F5 User Followers Count
F6 User Friends Count
F7 User Statuses Count
F8 Tweet Text
F9 Tweet Created Date
F10 Tweet Favorited Count
F11 Tweet Retweeted Count

API: application program interface.

Twitter’s API is described by Twitter as a ‘Twitter platform that connects our applications to the world wide conversa-
tions happening in Twitter’ [22]. We can use either representational state transfer (REST) API1 or Streaming API2 to
access Twitter data, but first we need to register as Twitter developers, get our secret keys to use their OAuth end points
and send secure and authorised requests. The responses of these APIs are available in Java Script Object Notation (JSON).
To collect data from Twitter, we have to use the authenticated information that is supplied to us by a Twitter developer
in order to connect and extract the needed data from Twitter’s API [23–25]. We used a Python-based interface to connect
to Twitter through two different scripts; one to collect the spam profiles from Twitter Streaming API and return a list of
user names that were already reported as spam to Twitter, and the other to get the latest tweets of a selected list of users
using the Twitter public timeline through the REST API. As a result of the second script represented in the JSON format,
we extracted the fields needed in a feature extraction process and saved it in a comma separated values (CSV) file.
To collect spam profiles, we first searched the already known spam profiles in Twitter and then tried to get their tweets
from the Twitter API, but an error was received as the accounts were already suspended. Therefore, we followed the steps
of article [11] that describe the way in which Twitter reports the spammers. Twitter does this by sending a message to the
user, ‘spam’, from the user report in the format: ‘@spam [Spammer account name]’. Taking advantage of that format,
we wrote a python program that connects to the Streaming API and tracks any message that starts with ‘@spam’, extracts
the spammer account name and saves it in a list. Then, another code was written to read the accounts from the list and to
receive the latest tweets from the users using the Twitter public timeline in REST API before Twitter Support suspends
that account and blocks it.
Our training data should contain spam and non-spam profiles. Therefore, we searched Twitter for official and known
accounts containing news, sports, deals and famous people, saved them in a list. The second script is used to receive the
latest tweets from these accounts. By the end of this process, we had five main lists, which are as follows:

1. Spammers list;
2. News list;
3. Sports list;
4. Deals list;
5. Famous persons list.

For each list, we saved a folder of files that contains the extracted data from Twitter’s API for each user in the list.
Each file contains the fields described in Table 2. These fields are used later to extract the features needed for
classification.

3.2. Data pre-processing


After collecting the data, we analysed the files that were extracted from the spammer’s list and found that 10% of the
accounts are empty, as they were already suspended and blocked from Twitter. It was also found that 20% of the accounts
were reported as spammers after tweeting only three times on Twitter. Therefore, accounts with fewer tweets were

Journal of Information Science, 2016, pp. 1–19 Ó The Author(s), DOI: 10.1177/0165551516684296
Herzallah et al. 7

Figure 3. Data pre-processing.

excluded. The accounts that tweet in a non-English language were also excluded. Figure 3 describes the data preparation
stage.

3.3. Spam features extraction and labelling


In order to extract the features, a python script was developed to read the files inside the folders that were collected from
Twitter to develop our labelled data set. Each file in the folders was used to form a row in the data set containing the user
and the features and labelled as spammer if the file was inside the spam folder. The resulting data set was extracted from
approximately 50,000 tweets from 210 users, with 100 of them being spammers and 110 being non-spammers. Figure 4
describes the feature extraction and labelling process.
From the related work and the papers reviewed about Twitter spamming, the features to discern spammers from non-
spammers are all about the user, their behaviours and the content of the tweets themselves. The following subsections
describe all the features that are used in detail, all according to the Twitter Policy [9].

3.3.1. Graph-based features


• Ratio between friends and followers. Number of friends and number of followers is used to calculate this feature

Number of followers
ð1Þ
Number of friends

If the result of the ratio is too small, then the probability of being a spam account will increase [9]:

Journal of Information Science, 2016, pp. 1–19 Ó The Author(s), DOI: 10.1177/0165551516684296
Herzallah et al. 8

Figure 4. Feature extraction and labelling process.

• Reputation of the account. Also as the previous feature, number of friends and followers are used to extract this
feature

Number of followers
ð2Þ
Number of friends + Number of followers

If this number is small and close to zero, it is highly probable to be a spam account, as the spammers generally tend to
get more followers [9].

• Age of the account. Spammers’ accounts usually have a small age as they used to have new accounts when they
are blocked by most of their following users [12]
P
ðTimeðTweetÞ  User creation dateÞ
ð3Þ
Total number of tweets

3.3.2. User behaviours


• Time between posts. As known, spammers post at a faster rate than normal users. This feature is important to dis-
tinguish between spam profiles and trusted profiles; we capture it by comparing the time between tweets and get
the average of the measured times [9,26]
P  
TimeðTweeti Þ  Time Tweetj
ð4Þ
Total number of tweets

where i and j are sequence numbers.


The result of this equation is expected to be low for spammers because they usually post more than average over the
same period of time.

Journal of Information Science, 2016, pp. 1–19 Ó The Author(s), DOI: 10.1177/0165551516684296
Herzallah et al. 9

• Idle hours. Spammers tend not to be idle for a long time; we calculate the maximum time between tweets and
divide it by the total number of tweets, and if the result is small, then the probability of it being a spam account
will be increased [11]
  
Max Time ðTweeti Þ  Time Tweetj
ð5Þ
Total number of tweets

where i and j are sequence numbers.

• Ratio between statuses count and the age of the account. This number should be larger for the spam accounts as
they are used to having large numbers of tweets per account lifetime

Statuses count
ð6Þ
Age of account

3.3.3. Content-based features. This category of features is based on the content of the tweet text that is posted by the user.
As users usually post several tweets, we analyse the content of these tweets and extract the following features:

• Rate between retweets to tweets. As noticed in the work by Chakraborty et al. [15], the normal users usually
have more retweets than tweets compared with the spam users. Therefore, from the data collected for each user,
the following criteria are used to extract this feature

Number of Tweets that started with ‘‘RT ’’


ð7Þ
Total number of tweets

This number should be too small and near zero for the spam profiles.

• Same URL’s. Spammers usually post more duplicate URLs than normal users; this feature could be important in
differentiating between spam and non-spam profiles, and we get this feature by the following equation [7,15]

Number of duplicate URL’S


ð8Þ
Total number of tweets × # URL’S

The result should be greater for the spam profiles than the normal ones.

• Same Hashtags. Spammers usually post more duplicate hashtags than normal users; this feature could be impor-
tant in differentiating between spam and non-spam profiles, and it is measured by the following equation [15,27]

Number of duplicate Hashtags


ð9Þ
Total number of tweets × Number of Hashtags

The result should be greater for the spam profiles than normal ones.

• Average length of tweet. Spammers post shorter messages compared with others, according to Chakraborty et al.
[15], which is why this feature helps in distinguishing between spammers and non-spammers. To get this feature,
we use the following ratio from the tweets that are extracted from each profile
P
Tweet length
ð10Þ
Total number of tweets

This value should be bigger for normal profiles compared with the spammers.

Journal of Information Science, 2016, pp. 1–19 Ó The Author(s), DOI: 10.1177/0165551516684296
Herzallah et al. 10

• Average URLs per tweet. This feature helps us to determine the URL usage as this feature could help us predict
the spam profiles; they tend to send a lot of spam messages with URLs as an advertisement strategy [15]
P
Number of URLs in tweets
ð11Þ
Total number of tweets

This number should be greater for the spammer than the normal users.

• Average Hashtags per tweet. This feature helps in determining the Hashtag usage, as this feature could help us pre-
dict the spam profiles; they tend to send a lot of spam messages with Hashtags as an advertisement strategy [15]
P
Number of Hashtags in tweets
ð12Þ
Total number of tweets

This value should be greater for the spammer than the normal users.

• Average mentions per tweet. This feature helps us to determine the mention usage. Spammers tend to send a lot
of spam messages with @ user name as an advertisement strategy [15]

Number of ð@ usernameÞ in tweets


ð13Þ
Total number of tweets

This value should be greater for the spammer than the normal users.

• Similarity of tweets (duplicate tweets). As noticed from checking the spammer’s tweets, they used to tweet the
same tweet many times to gain the attention of the users to its content. Therefore, we think that this feature is
very important when distinguishing between profiles to know whether it is spam or not, [9,13] and it can be cap-
tured by the following ratio

Total number of tweets


ð14Þ
Tweet cluster

This figure is expected to be close to 1 for the normal users and greater than 1 for the spam profiles.

• Existence of spam words in tweets. The number of tweets that have spam words should be more in the spam pro-
files than the normal ones [6,12]. Therefore, this feature is significant in determining spam profiles, and it is cal-
culated based on the spam words list that is described in HubSpot [28]

# of tweets contain spam words


ð15Þ
Total number of tweets
• Average of favourite tweets. Spam messages are usually not marked as favourite [10]; so, we use this feature to
distinguish between spam and non-spam using the following criteria

Total number of favorites


ð16Þ
Total number of tweets

This feature value should be high in normal users than the spammers.

• Average of retweets of the tweets. Spam messages are not usually retweeted [10], so we use this feature to distin-
guish between spam and non-spam using the following ratio

Journal of Information Science, 2016, pp. 1–19 Ó The Author(s), DOI: 10.1177/0165551516684296
Herzallah et al. 11

Total number of retweets


ð17Þ
Total number of tweets

This feature value should be higher for normal users than spammers. Table 3 summarises the list of all features in the
final constructed data set.

3.4. Classification models


In order to evaluate the effectiveness of the collected features in identifying spam users, seven classification models are
trained based on these features. The classification models are as follows:

• NB: This classifier uses the Bayesian theorem in its implementation, which is

PðdjCjÞ × PðCjÞ
PðCjjdÞ = ð18Þ
Pð d Þ

where PðCjjdÞ is the probability of instance d to belong to class Cj, PðdjCjÞ is the probability of generating instance d,
P(Cj) is the probability of class j to occur and P(d) is the probability of instance d to occur.
Bayesian analysis is based on the previous experience and the training data, which is called the prior probability. In
other words, if 60% of the spam profiles in the training data have short account ages, and if the instance we are predict-
ing has short account age, it will be 60% probable to be a spam profile according to its account age. This classifier is par-
ticularly suitable in the high-dimensional inputs and can often outperforms other classification methods [29].

• MLP neural network: MLP is an artificial neural network model, where the nodes are neurons with logistic acti-
vation. MLP is a multilayer feed-forward network where neurons of the ith layer serve as input for neurons of
i + 1th layer, and it maps input data into appropriate outputs [29,30].
• k-NN: This classifier is the most basic instance-based method, which assumes all the instances correspond to
points in the n-dimensional space and the nearest neighbour of the instance is defined in term of the standard
Euclidean Distance [29]. To classify an instance, the algorithm first computes the distance of the instance to the
other training records, identifies the k-nearest neighbours and then uses their class labels to determine the class
label of the instance [29].
• Alternating Decision Tree (ADTree): This classifier is a generalisation of decision trees, voted decision trees
and voted decision stumps. The tree consists of decision nodes (prediction conditions) and prediction nodes. The
instance is classified by following the paths where all prediction conditions are true until it reaches the prediction
node which has a real value. Summation of the prediction nodes which match the instance is calculated and the
sign of the summation is used for classification [31].
• Decision Trees (J48): J48 is a java implementation in Weka package for the C4.5 algorithms, which generates a
decision tree from the training data based on the normalised IG. The attribute that has the highest IG is then cho-
sen to make a decision [29].
• RFs: It is a classifier that consists of a collection of tree-structured classifiers, and each tree casts a unit vote for
the most popular class [32]. RF algorithm is introduced by Brieman [32], and it combines the bagging idea and
the random selection of features to construct a collection of decision trees with controlled variance.
• SVM: This classifier in its simplest linear form can be described as follows; if we have a set of negative and pos-
itive examples and we want to separate them by a hyperplane with maximum distance between the hyperplane
and the nearest negative and positive examples, the nearest examples are the support vectors. Then, using this
hyperplane equation, the class of an instance can be predicted [29]. In our experiments, we use the sequential
minimal optimisation (SMO) schema that is used in Weka package and designed by Platt [33], which uses the
SMO algorithm to train SVM classifier polynomial or radial basis function (RBF) kernels.

These models are selected because they are very popular and commonly applied in the literature for spam detection.
This makes it easier to compare the obtained results with the previous works.

Journal of Information Science, 2016, pp. 1–19 Ó The Author(s), DOI: 10.1177/0165551516684296
Herzallah et al. 12

Table 3. List of features

Feature Feature description Feature criteria

#1 Ratio between number of followers and number of friends Number of followers


Number of friends
#2 Reputation of the account Number of followers
Number of Friends + Number of followers
P
#3 Age of the account ðTime ðTweetÞ  User creation dateÞ
P Total number of tweets
#4 Average time between posts ðTime ðTweeti Þ  Time ðTweetj ÞÞ
Total number of tweets
#5 Idle hours Max ðTime ðTweeti Þ  Time ðTweetj ÞÞ
Total number of tweets
#6 Ratio between statuses count and age of account Stasuses count
Feature #3
#7 Ratio between Retweets and Tweets Number of Tweets that is started with ‘‘ RT ’’
Total number of tweets
#8 Same URLs Number of duplicate URL
Total number of tweets × # URL
#9 Same Hashtags Number of duplicate Hashtags
Pnumber of tweets × Number of Hashtags
Total
#10 Average length of tweets Tweet length
P
Total number of tweets
#11 Average URLs per tweet Number of URLs in tweets
PTotal number of tweets
#12 Average hashtags per tweet Number of Hashtags in tweets
Total number of tweets
#13 Average mentions per tweet Number of ð@ usernameÞ in tweets
Total number of tweets
#14 Duplicate tweets Total number of tweets
Tweet cluster
#15 Existence of spam words # of tweets contain spam words
Total number of tweets
#16 Average of favourite tweets Total number of favorites
Total number of tweets
#17 Average of retweeted tweets Total number of retweets
Total number of tweets
#18 User name Screen name for the user
#19 Class Spam or non-spam

URL: uniform resource locator.

Table 4. Confusion matrix

Actual label Predicted label


Spam Non-spam

Spam True positive (TP) False negative (FN)


Non-spam False positive (FP) True negative (TN)

3.5. Evaluation metrics


This section introduces the evaluation metrics for the classifiers used in spam detection. Four major metrics applied,
which are accuracy, precision, recall and F-measure. These metrics are based on the confusion matrix shown in Table 4.
This confusion matrix is considered as the primary reference of evaluation for any binary classifier. Before introducing
the metrics in details, we review the relationship between true positive, true negative, false positive and false negative in
Table 4.
True positive is the number of instances that are spam and correctly predicted as spam. False negative is the instances
that are spam and incorrectly predicted as non-spam. False positive is the number of instances that are predicted as spam
while they are not, and true negatives are those non-spam instances and correctly predicted as non-spam. Based on these
ratios, the formula and the description of the metrics are as follows [34,35]:

• Accuracy: the percentage of the instances that are correctly predicted over the total number of instances

TP + TN
Accuracy = ð19Þ
TP + TN + FP + FN

Journal of Information Science, 2016, pp. 1–19 Ó The Author(s), DOI: 10.1177/0165551516684296
Herzallah et al. 13

• Precision: the percentage of the actual answers which are correct

TP
Precision = ð20Þ
TP + FP
• Recall (R): the percentage of possible answers which are correct

TP
Recall = ð21Þ
TP + FN
• F-measure: a combination of recall and precision to get a single measure, which falls between these two metrics

2 × ð P × RÞ
FM = ð22Þ
P+R

For a better classification, F-measure and accuracy rates should be high. So, in our experiments the goal is to obtain
high F-measure and accuracy.

4. Experiments and results


In our experiments, the seven classifiers that were mentioned in the previous section are applied on the collected data set
using 10-fold cross-validation. This means that the data set is divided into 10 almost equal parts and then the classifiers
are trained using 9 parts each time and tested on the 10th part. To obtain statistically reliable results, the cross-validation
process is repeated 10 independent times. The final obtained evaluation of the classifier is the average of the 10 runs. The
best performance for k-NN was obtained with k = 1, for RF was with a number of trees equals to 50. For SVM, the cost
and gamma parameters were roughly tuned and set to 50 and 0.01, respectively. For MLP, the learning rate and momen-
tum were set to 0.3 and 0.2, respectively.
Weka software package [36] is used for applying the selected classifiers on the data set. Weka is an open-source proj-
ect that provides a collection of machine learning algorithms and data pre-processing tools to researchers and allows
users to compare different machine learning methods and data sets. After performing all the classification experiments
in the data set, the results can be summarised as shown in Table 5.
The results in Table 5 show that k-NN classifier performs slightly better than the other algorithms achieving 99.05%
accuracy, 98.33% precision, 100% recall and 99.13% F-measure. MLP classifier comes next achieving 98.67% accuracy,
followed by RF, ADTree, SVM and J48 classifiers, respectively. In general, it can be noticed that all classifiers achieved
high classification ratios based on the full set of features. However, we still need to identify which features are playing
more important role in this classification performance.
Many factors affect the success of the classification model. The representation and quality of the data set highly affect
the performance of the developed models [37]. In the field of data mining, there are many techniques to measure the
impact of each feature in the data set. In our experiments, we apply three different methods. The first one is the CoM
technique, which was applied by Sung [37] and Adwan et al. [38]. In this technique, the importance of each feature
depends on the CoM before and after removing that feature, and the mean square error (MSE) is defined as

X
i
ðTi  Oi Þ2
ð23Þ
i=1
n

where Ti is the desired output, Oi is the calculated output of the ith output for the ith pattern and n is the total number of
instances in the data set.
In this technique, we rank the features that its deletion cause the largest change in MSE as the most important since
the error is most sensitive to these inputs, we use the MLP classifier as this technique is using classification algorithm
each time we delete an attribute to measure the change in MSE while it is omitted. The following chart shows the CoM
for each feature when it is deleted (Figure 5):
The second approach used to identify the importance of each feature is to evaluate the worth of the feature by measur-
ing the IG with respect to the class. IG ranking filtering is used through Weka package and it depends on Entropy, which
is commonly used in the information theory measures [39]. Entropy of feature F is

Journal of Information Science, 2016, pp. 1–19 Ó The Author(s), DOI: 10.1177/0165551516684296
Herzallah et al. 14

Table 5. Classification evaluation

Classifier name Accuracy Precision Recall F-measure

Naive Bayes 97.62 ± 2.90 98.32 ± 3.37 97.27 ± 4.90 97.68 ± 2.87
SVM-SMO 97.90 ± 2.55 98.01 ± 3.07 98.18 ± 3.86 98.00 ± 2.44
MLP 98.67 ± 2.24 98.66 ± 3.33 98.90 ± 2.95 98.73 ± 2.13
k-NN 99.05 ± 1.90 98.33 ± 3.33 100.0 ± 0.00 99.13 ± 1.74
ADTree 98.52 ± 2.49 98.32 ± 3.36 99.00 ± 3.12 98.65 ± 2.06
J48 97.62 ± 2.38 97.30 ± 3.82 98.18 ± 3.64 97.74 ± 2.26
Random forest 98.57 ± 2.18 98.20 ± 3.33 99.09 ± 2.73 98.60 ± 2.39

SVM: support vector machine; SMO: sequential minimal optimisation; MLP: multilayer perceptron; k-NN: k-Nearest Neighbour; ADTree: alternating
Decision Tree.

Figure 5. Change of mean square error chart using CoM technique.

X
K
Entropyð F Þ =  pi log2 pi ð24Þ
i¼1

where K is the number of classes, pi is the probability of the class.


IG uses the ranking method, which ranks attributes by their individual evaluation. The result of the IG ranking filter is
shown in Figure 6.
The third approach to measure the importance of features is Relief-F attribute evaluation. Relief-F evaluates the worth
of an attribute by repeatedly sampling an instance and considering the value of the given attribute for the nearest instance
of the same and different class [40]. Relief-F ranking filter is used through Weka package using 10 nearest neighbours
and the results are described in Figure 7.
Novaković et al. [40] presented a comparison between different ranking methods and found that it gives quite differ-
ent results, and that is what we found in our experiments too. The top five features according to the CoM technique are
as follows: reputation of the account, average length of tweet, average mention per tweet, age of the account and the
average time between posts. The top five features in the IG ranking technique are as follows: reputation of the account,
ratio between followers and friends, average hashtags per tweet, average of favourite tweets and user ID. The top five
features in the Relief-F technique are as follows: reputation of the account, average hashtags per tweet, user ID, average
length of tweets and average URLs per tweet.
Top five features in the three techniques are a mix of graph-based features, user behaviour features and content-based
features. All approaches show that the reputation of the account feature (F2) is the most important feature in the data set,
and both of IG ranking technique and Relief-F ranking technique have three same features in their top five which are the
reputation of the account, average hashtags per tweet and the user ID.

Journal of Information Science, 2016, pp. 1–19 Ó The Author(s), DOI: 10.1177/0165551516684296
Herzallah et al. 15

Figure 6. Ranking chart for features using IG ranking technique.

Figure 7. Ranking chart for features using Relief-F ranking technique.

Another series of experiments are applied using the same classifiers with different feature sets. In the following
experiments, each classifier is trained using the top five features identified by each feature selection method. Also,
we use a special set of size 5 formed by collecting those features appeared at least twice in the top five by each
method. Evaluation metrics are recorded for each experiment in Tables 6–9 for accuracy, precision, recall and
F-measure.
According to the results, it can be seen that the top features identified by CoM method have slightly improved the eva-
luation ratios of all classifiers. However, the other selection methods showed very close results to the basic results with a
small variation. We can conclude that the features identified by CoM, which are reputation of the account, average length
of tweet, average mention per tweet, age of the account and the average time between posts, are very important features
in the process of identifying spammers in Twitter.

5. Conclusion
In this article, we collected our own data set for the purpose of identifying spammers in Twitter. The features in the data
set are related to graph-based features, user behaviour features and content-based features. In previous research works,
these features were reported as the most effective features in distinguishing spammers and non-spammers. Next, we used

Journal of Information Science, 2016, pp. 1–19 Ó The Author(s), DOI: 10.1177/0165551516684296
Herzallah et al. 16

Table 6. Accuracy results

Classifier name 18 features Top 5 IG Top 5 CoM Top 5 Relief-F Top 5 common

Naive Bayes 97.62 ± 2.90 97.95 ± 2.96 98.10 ± 2.61 97.95 ± 2.96 97.95 ± 2.96
SVM-SMO 97.90 ± 2.55 98.76 ± 2.09 99.05 ± 1.90 98.57 ± 2.18 98.57 ± 2.18
MLP 98.67 ± 2.24 98.48 ± 2.60 99.00 ± 1.94 98.52 ± 2.40 98.67 ± 2.14
k-NN 99.05 ± 1.90 97.71 ± 3.05 99.05 ± 1.90 98.57 ± 2.38 98.10 ± 2.61
ADTree 98.52 ± 2.49 98.24 ± 2.66 98.90 ± 2.22 98.62 ± 2.46 98.52 ± 2.49
J48 97.48 ± 3.12 98.38 ± 2.71 98.38 ± 2.71 98.38 ± 2.71 98.38 ± 2.71
Random forest 98.43 ± 2.52 98.10 ± 2.86 98.90 ± 2.11 98.81 ± 2.17 98.86 ± 2.14

SVM: support vector machine; SMO: sequential minimal optimisation; MLP: multilayer perceptron; k-NN: k-Nearest Neighbour; ADTree: alternating
Decision Tree; IG: information gain; CoM: change of mean square error.

Table 7. Precision results

Classifier name 18 features Top 5 IG Top 5 CoM Top 5 Relief-F Top 5 common

Naive Bayes 98.32 ± 3.37 ± 98.32 ± 3.37 98.32 ± 3.37 98.32 ± 3.37 98.32 ± 3.37
SVM-SMO 98.01 ± 3.70 ± 98.33 ± 3.33 98.33 ± 3.33 98.33 ± 3.33 98.33 ± 3.33
MLP 98.66 ± 3.07 ± 98.32 ± 3.73 99.08 ± 2.61 98.50 ± 3.20 98.50 ± 3.20
k-NN 98.33 ± 3.33 ± 98.31 ± 3.38 98.33 ± 3.33 98.32 ± 3.37 98.32 ± 3.37
ADTree 98.32 ± 3.36 ± 98.32 ± 3.37 98.32 ± 3.35 98.32 ± 3.37 98.32 ± 3.37
J48 97.50 ± 4.13 ± 97.51 ± 4.11 97.51 ± 4.11 97.51 ± 4.11 97.51 ± 4.11
Random forest 98.32 ± 3.37 ± 98.30 ± 3.40 98.33 ± 3.33 98.33 ± 3.33 98.33 ± 3.33

SVM: support vector machine; SMO: sequential minimal optimisation; MLP: multilayer perceptron; k-NN: k-Nearest Neighbour; ADTree: alternating
Decision Tree; IG: information gain; CoM: change of mean square error.

Table 8. Recall results

Classifier name 18 features Top 5 IG Top 5 CoM Top 5 Relief-F Top 5 common

Naive Bayes 97.27 ± 4.90 97.91 ± 4.79 98.18 ± 3.86 97.91 ± 4.79 97.91 ± 4.79
SVM-SMO 98.18 ± 3.86 99.45 ± 2.16 100.0 ± 0.00 99.09 ± 2.73 99.09 ± 2.73
MLP 98.91 ± 2.95 98.91 ± 3.47 99.09 ± 2.73 98.82 ± 3.65 99.09 ± 2.73
k-NN 100.0 ± 0.00 97.45 ± 4.99 100.0 ± 0.00 99.09 ± 2.73 98.18 ± 3.86
ADTree 99.00 ± 3.12 98.45 ± 3.87 99.73 ± 2.01 99.18 ± 2.90 99.00 ± 3.12
J48 97.91 ± 4.43 99.64 ± 2.20 99.64 ± 2.20 99.64 ± 2.20 99.64 ± 2.20
Random forest 98.82 ± 2.73 98.18 ± 4.07 99.73 ± 2.01 99.55 ± 2.36 99.64 ± 2.20

SVM: support vector machine; SMO: sequential minimal optimisation; MLP: multilayer perceptron; k-NN: k-Nearest Neighbour; ADTree: alternating
Decision Tree; IG: information gain; CoM: change of mean square error.

these features to help in detecting spammers and evaluated the usefulness of these features in spammer detection using
the NB, k-NN, Decision Tree, J48, MLP neural networks, RF and SVM classifiers. In general, all classifiers showed rela-
tively high prediction performance using this comprehensive set of features. We also measured the importance of the fea-
tures using three techniques: CoM, IG ranking and Relief-F ranking technique. Results show that the top five features
are a mix of graph-based, user behaviours and content-based features with the highest ratio for the graph-based features
with 47% of the top five, followed by the content-based features with 40%, leaving 13% for the user behaviours features.
Specifically, all approaches show that the reputation of the account feature is the most important feature in the data set,
and both of IG and Relief-F ranking techniques have three same features in their top five features which are the reputa-
tion of the account, average hashtags per tweet and the user ID.
Moreover, the top five features according to each method are used to build the classification models. This was per-
formed as a feature selection process to quantify the change on the classifiers performance based on these features. The
k-NN and SVM classifiers kept performing the best in most cases. Also, it can be noticed that the top five features
according to the CoM technique perform better classification in all classifiers. The best feature set we agree on to detect
spam profiles in Twitter consists of reputation of the account, age of account, average time between tweets, average
length of tweets and average mentions per tweets, which is a mix of graphic-based, user behaviours and content-based
features, and the best classifier to use is the SVM classifier for a larger scale.

Journal of Information Science, 2016, pp. 1–19 Ó The Author(s), DOI: 10.1177/0165551516684296
Herzallah et al. 17

Table 9. F-measure results

Classifier name 18 features Top 5 IG Top 5 CoM Top 5 Relief-F Top 5 common

Naive Bayes 97.68 ± 2.87 98.01 ± 2.96 98.17 ± 2.52 98.01 ± 2.96 98.01 ± 2.96
SVM-SMO 98.00 ± 2.44 98.84 ± 1.95 99.13 ± 1.74 98.65 ± 2.05 98.65 ± 2.06
MLP 98.73 ± 2.13 98.55 ± 2.51 99.05 ± 1.85 98.59 ± 2.33 98.74 ± 2.02
k-NN 99.13 ± 1.74 97.77 ± 3.02 98.13 ± 1.74 98.65 ± 2.25 98.17 ± 2.52
ADTree 98.60 ± 2.36 98.31 ± 2.59 98.90 ± 2.09 98.70 ± 2.34 98.60 ± 2.39
J48 97.60 ± 3.02 98.51 ± 2.52 98.51 ± 2.52 98.51 ± 2.52 98.51 ± 2.52
Random forest 98.50 ± 2.42 98.16 ± 2.77 98.98 ± 1.99 98.89 ± 2.06 98.94 ± 2.02

SVM: support vector machine; SMO: sequential minimal optimisation; MLP: multilayer perceptron; k-NN: k-Nearest Neighbour; ADTree: alternating
Decision Tree; IG: information gain; CoM: change of mean square error.

Although the developed models in this work showed high accuracy rates in detecting spammers, there are some limita-
tions that are important to point out. First is the relatively small number of collected profiles in the data set. Second, the
class distribution of the spammers and non-spammers is balanced since we tried to collect same number of profiles from
each class to avoid the imbalanced data and trying to focus more on the feature engineering part. Third is the scalability
of the detection models. This point is very important to consider when the detection models are applied on a large scale
like in Twitter where there are billions of accounts. Therefore, our future work is to address the problem with much larger
data sets and to investigate the effect of the imbalance data distribution on the performance of the detection models

Declaration of conflicting interests


The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding
The author(s) received no financial support for the research, authorship and/or publication of this article.

Notes
1. The REST application program interfaces (APIs) provide programmatic access to read and write Twitter data. Author a new
Tweet, read author profile and follower data and more.
2. The Streaming APIs give developers low-latency access to Twitter’s global stream of Tweet data.

References
[1] Social Times. Facebook, Twitter, Instagram, Pinterest, Vine, Snapchat: social media stats 2014, http://www.adweek.com/social-
times/social-media-statistics-2014/499230 (2014, accessed January 2016).
[2] Aldayel HK and Azmi AM. Arabic tweets sentiment analysis – a hybrid scheme. J Inf Sci 2016; 42: 782–797.
[3] Tare M, Gohokar I, Sable J et al. Multi-class tweet categorization using map reduce paradigm. Int J Comput Trends Tech 2014;
9(2): 78–81.
[4] Go A, Bhayani R and Huang L. Twitter sentiment classification using distant supervision. CS224N project report, Stanford
University, Stanford, CA, December 2009, p. 12.
[5] De Choudhury M, Diakopoulos N and Naaman M. Unfolding the event landscape on Twitter: classification and exploration of
user categories. In: Proceedings of the ACM 2012 conference on computer supported cooperative work, Seattle, WA, 11
February 2012, pp. 241–244. New York: ACM.
[6] Wang B, Zubiaga A and Liakata M and Procter R. Making the most of tweet-inherent features for social spam detection on
Twitter. arXiv preprint arXiv 2015:1503.07405.
[7] Lee K, Caverlee J and Webb S. Uncovering social spammers: social honeypots + machine learning. In: Proceedings of the 33rd
international ACM SIGIR conference on research and development in information, Geneva, 19 July 2010, pp. 435–442. New
York: ACM.
[8] Al-Kabi M, Wahsheh H, Alsmadi I et al. Content-based analysis to detect Arabic web spam. J Inf Sci 2012; 38(3): 284–296.
[9] Twitter help Center. The Twitter rules, https://support.twitter.com/articles/18311 (accessed October 2015).
[10] Lin PC and Huang PM. A study of effective features for detecting long-surviving Twitter spam accounts. In: Proceedings of the
15th international conference on advanced communication technology (ICACT), PyeongChang, South Korea, 27 January 2013,
pp. 841–846. New York: IEEE.

Journal of Information Science, 2016, pp. 1–19 Ó The Author(s), DOI: 10.1177/0165551516684296
Herzallah et al. 18

[11] Gee G and Teh H. Twitter spammer profile detection. CS229 project report, Stanford University, Stanford, CA, December
2010.
[12] Benevenuto F, Magno G, Rodrigues T et al. Detecting spammers on Twitter. In: Proceedings of the collaboration, electronic
messaging, anti-abuse and spam conference (CEAS), Redmond, Washington, USA, 13 July 2010, vol. 6, p. 12. CEAS
Conference.
[13] Wang AH. Don’t follow me: spam detection in Twitter. In: Proceedings of the international conference on security and crypto-
graphy (SECRYPT), Athens, 26 July 2010, pp. 1–10. New York: IEEE.
[14] Amleshwaram AA, Reddy N, Yadav S et al. CATS: characterizing automation of Twitter spammers. In: Proceedings of the 5th
international conference on communication systems and networks (COMSNETS), Bangalore, India, 7–10 January 2013, pp. 1–
10. New York: IEEE.
[15] Chakraborty A, Sundi J, Satapathy S et al. SPAM: a framework for social profile abuse monitoring. CSE508 report, Stony
Brook University, Stony Brook, NY, 2012.
[16] Mccord M and Chuah M. Spam detection on Twitter using traditional classifiers. In: Proceedings of the 8th international con-
ference on autonomic and trusted computing, Banff, AB, Canada, 2–4 September 2011, pp. 175–186. Berlin, Heidelberg:
Springer.
[17] Lee S and Kim J. Warningbird: a near real-time detection system for suspicious URLs in Twitter stream. IEEE T Depend
Secure 2013; 10(3): 183–195.
[18] Delany SJ, Buckley M and Greene D. SMS spam filtering: methods and data. Expert Syst Appl 2012; 39(10): 9899–9908.
[19] Moro S, Cortez P and Rita P. A data-driven approach to predict the success of bank telemarketing. Decis Support Syst 2014;
30(62): 22–31.
[20] Moro S, Cortez P and Rita P. A framework for increasing the value of predictive data-driven models by enriching problem
domain characterization with novel features. Neural Comput Appl 2016; 1–9.
[21] Moro S, Rita P and Vala B. Predicting social media performance metrics and evaluation of the impact on brand building: a data
mining approach. J Bus Res 2016; 69(9): 3341–3351.
[22] Twitter Developers. Documentation, https://dev.twitter.com/overview/documentation (accessed January 2016).
[23] Ji X, Chun SA and Geller J. Epidemic outbreak and spread detection system based on Twitter data. In: Yin X, Ho K, Zeng D
et al. (eds) Health information science. Berlin, Heidelberg: Springer, 2012, pp. 152–163.
[24] Grier C, Thomas K, Paxson V et al. @ spam: the underground on 140 characters or less. In: Proceedings of the 17th ACM con-
ference on computer and communications security, Chicago, IL, 4 October 2010, pp. 27–37. New York: ACM.
[25] Thomas K, Grier C, Song D et al. Suspended accounts in retrospect: an analysis of Twitter spam. In: Proceedings of the ACM
SIGCOMM conference on Internet measurement conference, Berlin, 2 November 2011, pp. 243–258. New York: ACM.
[26] Guo D and Chen C. Detecting non-personal and spam users on geo-tagged Twitter network. Trans GIS 2014; 18(3): 370–384.
[27] Yang C, Zhang J and Gu G. A taste of tweets: reverse engineering Twitter spammers. In: Proceedings of the 30th annual com-
puter security applications conference, New Orleans, LA, 8 December 2014, pp. 86–95. New York: ACM.
[28] HubSpot. The ultimate list of Email spam trigger words, http://blog.hubspot.com/blog/tabid/6307/bid/30684/The-Ultimate-List-
of-Email-SPAM-Trigger-Words.aspx (accessed January 2016).
[29] Mitchell TM. Machine learning. McGraw-Hill Science/Engineering/Math, 1997. New York City, New York, USA.
[30] Garson DG. Interpreting neural network connection weights. AI Expert 1991; 6: 46–51.
[31] Freund Y and Mason L. The alternating decision tree learning algorithm. In: Proceedings of the international conference of
machine learning (ICML’99), Bled, 27–30 June 1999, pp. 124–133. New York: ACM.
[32] Breiman L. Random forests. Mach Learn 2001; 45(1): 5–32.
[33] Platt JC. Fast training of support vector machines using sequential minimal optimization. In: Schölkopf B, Burges C and Smola
A (eds) Advances in kernel methods: support vector learning. Cambridge, MA: MIT Press, 1999, pp. 185–208.
[34] Chinchor N and Sundheim B. MUC-5 evaluation metrics. In: Proceedings of the 5th conference on message understanding,
Baltimore, MD, 25 August 1993, pp. 69–78. Stroudsburg, PA: Association for Computational Linguistics.
[35] Wang D, Navathe SB, Liu L et al. Click traffic analysis of short URL spam on Twitter. In: Proceedings of the 9th international
conference on collaborative computing: networking, applications and worksharing (collaboratecom), Austin, Texas, USA, 20
October 2013, pp. 250–259. New York: IEEE.
[36] Hall M, Frank E, Holmes G et al. The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 2009; 11(1): 10–
18.
[37] Sung AH. Ranking importance of input parameters of neural networks. Expert Syst Appl 1998; 15(3): 405–411.
[38] Adwan O, Faris H, Jaradat K et al. Predicting customer churn in telecom industry using multilayer perceptron neural networks:
modeling and analysis. Life Sci J 2014; 11(3): 75–81.
[39] Roobaert D, Karakoulas G and Chawla NV. Information gain, correlation and support vector machines. In: Guyon I, Nikravesh
M, Gunn S et al. (eds) Feature extraction. Berlin, Heidelberg: Springer, 2006, pp. 463–470.
[40] Novaković J, Štrbac P and Bulatović D. Toward optimal feature selection using ranking methods and classification algorithms.
Yugoslav J Op 2011; 21(1): 119–135.

Journal of Information Science, 2016, pp. 1–19 Ó The Author(s), DOI: 10.1177/0165551516684296

You might also like