Professional Documents
Culture Documents
2, June 2016
ABSTRACT
Social Networking Sites, in the present scenario, are an amalgam of knowledge and spam. As their
popularity surges among the users day by day so does it among the spammers looking at easy targets for
their campaigns. The threat due to spams causing atrocious harm to the bandwidth, overloading the
servers, spreading malicious pages online et cetera has increased manifold making it necessary for
researchers to foray into this field of spam detection and reduce their effect on the various social
networking sites.
In this paper, we propose a framework for spam detection in the two largest social networking sites
namely, Twitter and Facebook. Well be utilizing the data publically available on these two giants of social
networking era. Initially, well be citing the various approaches that have already been explored in this
field. After that well briefly explain the two methods that we used to collect the datasets from these
websites.
KEYWORDS
APIs, Honeypots, Facebook, Social Networking websites, Spam, Twitter, SVM, Weka, Nave Bayes
Algorithm, Simple K means clustering.
1. INTRODUCTION
There are a large number of social networking sites booming on the internet these days. Some of
these platforms are more popular than the other like Facebook and Twitter. This increase in social
sites has made social media vulnerable to many kinds of online attacks. Among these, one of the
leading problems engulfing the netizens is spam. Spam on online social sites or on any social
media may include messages in bulk or repetition of messages, malicious links, and fake friend
requests et cetera. This not only uses extra bandwidth but also as the volume of spam increases,
the internet becomes more polluted and less useful. Earlier, spamming was carried out through
emails, but now they have expanded their approach. Social websites which are used by users for
communicating with one another is being targeted. Spam is increasing being used to distribute
viruses, links to phishing websites et cetera. This has now become a security threat.
According to CNN, there are 83 million fake profiles on Facebook. This is the amount of spam
covering one of the most popular social website. A study [1] shows that Facebook has 100 times
more spam than any other social networks and 4 times more phishing attacks. According to
Symantec International Security Threat Report 2014 [2], Adult Spam (70%) dominated in 2013.
These are usually in the form of email inviting the user to connect to the scammer or a URL link.
Such a scenario proves to be extremely harmful and dangerous for the young minds who wish to
surf the internet dominantly for academic purposes.
DOI : 10.5121/avc.2016.3102
Advances in Vision Computing: An International Journal (AVC) Vol.3, No.2, June 2016
We are presenting this research paper with the aim to detect spam in the leading social
networking sites Twitter and Facebook using unsupervised as well as supervised methods. Our
objective of presenting the research paper titled Discerning Spam in Social Networking Sites is
basically to diagnose the huge amount of spam available on the social websites which is
predominantly used by each and every one of us so that this worrisome issue can be tackled.
2. BACKGROUND
2.1. TWITTER
Twitter is simple social website which gives access to its users for sending messages (called
tweets) and to follow other users. It displays usernames on their profiles and their recent tweets.
There were 307 million active twitter users in the first quarter of 2015 [3] and there are at least
2.5 million spam tweets every day. According to a report [4], almost 10% of twitter is spam! This
indicates the level of unwanted and harmful material present on one of the trending social
networking online websites which is used by millions of users worldwide. For this problem,
twitter has taken a number of steps in the past. Theyve introduced Reporting spam option
which could be used by users if they find any doubtful material on their site. Users can also flag
any content which they find inappropriate. Some of these spams contain malicious links to
dubious websites which tricks the users and their computers into thinking that it is all legit
content. Unnecessary and unwanted tweets creates a lot of crowd and ultimately the user gets
confused about what is real and what is not on the internet. These spammers take the advantage of
these occasions and illegally and unknowingly collect all the data of the user on which they can
put their hand upon. So, it is the need of the hour to detect the sources of these spams and take
necessary measures so that a user can have a hassle free experience and their security remains the
same.
2.2. FACEBOOK
Facebook is an online social networking website which people use to keep in touch with their
friends, family, colleagues by posting statuses, uploading pictures, sharing links with one other,
liking pages of their interest, joining public groups et cetera. with such an activity going on a
large scale by 1.49 billion monthly active users [5], it is inevitable to safeguard each aspect of this
networking site. Spam is the new harmful trend taking place on this platform. Facebook has a
colossal value of around 170 million fake users [6]. Facebook took a number of security measures
to combat these problems. They removed the Likes from the users which were inactive from a
particular date. Such accounts were deleted by Facebook.
3. LITERATURE REVIEW
A.H. Wang [7] uses Twitter to build their own three graph-based and content based features from
20 most recent tweets for spam bots detection. He observed that if an account posts duplicate
messages on one account, it could be termed as a spam account. For this, he used a classifier
called Bayesian classifier, since it is noise robust and has a better performance based on users
specific pattern. He used Twitters API methods and developed his own web crawler for the
collection of data sets for his experiments. The result of A.H. Wang research paper showed that
there is approximately 1% spam account in the datasets collected by him and approximately 3%
spam on Twitter.
Maarten Bosma et al. [8] basically used HITS link analysis algorithm for their research. They also
used spam reports for the purpose of spam detection. They used three unsupervised models,
2
Advances in Vision Computing: An International Journal (AVC) Vol.3, No.2, June 2016
Advances in Vision Computing: An International Journal (AVC) Vol.3, No.2, June 2016
5. DATASET COLLECTION
Data collection is very important in such an experiment. For performing the task of finding the
spammers on the social websites, a large amount of data is required so that correct analysis and
inference could be reached upon. The information required for performing our analysis is
publically available on both of the social networking sites that we analysed i.e. Twitter and
Facebook.
From Twitter, the data was accessed using its APIs. These APIs allow the users to access mostly
all of the data which the user asks for. For this purpose, Twitter provides Consumer Key (API
Key), Consumer Secret (API Secret), Access Token and Access Token Secret. These could be
used for data accessing.
From Facebook again the data was collected using its Graph API (version 2.0). The approach for
Facebook for data collection in our work was highly indirect due to the fact that Facebook
deprecated its Graph API in an attempt to strengthen the security, integrity and privacy of its
users.
Advances in Vision Computing: An International Journal (AVC) Vol.3, No.2, June 2016
Advances in Vision Computing: An International Journal (AVC) Vol.3, No.2, June 2016
contain that attribute value. The totals will depict the number of instances that belong to both
classes e.g. the number of accounts that are Spam and Not Spam.
Table 1. Classes Assignment
Class
0
total
Advances in Vision Computing: An International Journal (AVC) Vol.3, No.2, June 2016
Confusion Matrix of
Nave Bayes
a b <-- classified as
27|a=0
87|b=1
Confusion Matrix
of SMO
a b <-classified as
27|a=0
69|b=1
After taking the values of TP, FP, TN, FN from the above matrix, Accuracy, Recall and Precision
can be calculated as
Accuracy is calculated by
Recall is calculated by
Precision is calculated by
From the above calculated data one can see that SMO is more efficient as compared to Nave
Bayes in these parameters.
0.6
0.4
Accuracy
0.2
Recall
0
Nave
Bayes
Precision
SMO
Figure 2. Comparison between the Values for both the classifiers (Twitter Dataset)
Advances in Vision Computing: An International Journal (AVC) Vol.3, No.2, June 2016
publicly available data. No personal information of the users can be gathered using the particular
API. In the versions earlier than the current version which is version 2.5 of the API, Facebook
allowed for publicly available user data to be accessed using the particular user name or the
unique user id that Facebook assigns to each node in its social graph. This capability has been
deprecated in the current version of the API due to security reasons. The API does not allow for
gathering personally identifiable information either.
Rate limiting for the calls made to the API is another factor that restricts the amount of data that
can be collected from the social giant. Unlike in Twitter, rate limiting on Facebook isnt just done
on a per user basis. It is calculated by taking the number of users our app had previous day and
adding todays new logins which gives the base number of users for our app. Each app is
allocated 200 API calls per user in any 60 minute window. For instance, our app had 10 users
yesterday and 5 new logins today, that would give us a base of 15 users. This means that our app
can make ((10+5)*200) = 300 API calls in any 60 minute window.
8. RESULTS
Nave Bayes and Support Vector Machines were used to train the classifier. Simple k means
segregated the data set into spam and non-spam categories. Nave Bayes and Support Vector
Machines were compared on the basis of precision, recall and accuracy. Support Vector Machines
showed better results with respect to all the above three parameters, both qualitatively and
quantitatively.
Parameters
Accuracy
Precision
Recall
Values
0.375
0.5
0.46
Advances in Vision Computing: An International Journal (AVC) Vol.3, No.2, June 2016
Table 3. SMO
Parameters
Accuracy
Precision
Recall
Values
0.45
0.5625
0.6
Parameters
Accuracy
Precision
Recall
Values
0.3
0.6
0.2
Table 5. SMO
Parameters
Accuracy
Precision
Recall
Values
0.8
0.6
1
ACKNOWLEDGEMENTS
We are indebted to Prof. (Dr.) Vanita Jain, Head of Department, Bharati Vidyapeeth College of
Engineering and our mentor, Mrs Sarita Yadav for the for helpful guidance. We gratefully
acknowledge our friends for continuous support and discussions.
REFERENCES
[1]
[2]
[3]
http://www.adweek.com/socialtimes/nexgate-spam-study/428835
http://www.symantec.com/content/en/us/enterprise/other_resources/bistr_main_report_v19_21291018.en-us.pdf
http://www.statista.com/statistics/282087/number-of-monthly-active-twitter-users/
9
Advances in Vision Computing: An International Journal (AVC) Vol.3, No.2, June 2016
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
http://www.fastcompany.com/3044485/almost-10-of-twitter-is-spam
http://www.statista.com/statistics/264810/number-of-monthly-active-Facebook-users-worldwide/
http://www.huffingtonpost.com/james-parsons/Facebooks-war-continues-against-fake-profiles-andbots_b_6914282.html?ir=India&adsSiteOverride=in
Alex Hai Wang, Detecting Spam Bots in Online Social Networking Sites: A Machine Learning
Approach, Proceedings of the 24th annual IFIP WG 11.3, Berlin, Germany 2010, pp. 335-342
Maarten Bosma, Edgar Meij and Wouter Weerkamp, A Framework for Unsupervised Spam
Detection in Social Networking Sites, Proceedings of the 34th European Conference on
Information Retrieval, Berlin, Germany, 2012, pp. 602-608
Ritesh Kumar, Shital Ghadge, G.S. Navale, Spam Detection using Approach of Data Mining for
Social Networking Sites, International Journal Of Computer Applications, 2014.
Gianluca Stringhini, Christopher Kruegel, Giovanni Vigna, Detecting Spammers on Social
Networks, Proceedings of the 26th Annual Computer Security Applications Conference, New York,
USA, 2010, pp. 1-9
Enhua Tan, Lei Guo, Songqing Chen, Xiaodong Zhang and Yihong(Eric) Zhao, UNIK:
Unsupervised Social Network Spam Detection, Proceedings of The 22nd ACM International
Conference On Information and Knowledge Management (CIKM 2013), San Francisco, CA, USA,
October 27-November 1, 2013
https://securelist.com/analysis/quarterly-spam-reports/69932/spam-and-phishing-in-the-first-quarterof-2015/
http://www.statista.com/statistics/282087/number-of-monthly-active-twitter-users/
10