Professional Documents
Culture Documents
Kinza Zahra, Farooque Azam, Wasi Haider Butt and Fauqia Ilyas
Abstract Social connection between the set of people is known as social network
analysis. People keep numerous identities on various online social sites. User-related
network data has distinctive information which shows user interests, behavioral pat-
terns, and political views. By using these behaviors individually and collectively are
of great help to recognize users across social networks. SLR (Systematic Literature
Review) has been performed to distinguish 31 papers published during 2010–2018.
The idea is to determine user identification categories that are used to classify users.
Furthermore, to identify algorithms, models, methods, and tools that has been sug-
gested since 2010 for user characterization. We have identified 10 algorithms, 19
models, 5 methods and 8 tools that have proposed for 5 user identification cate-
gories. Finally, we empirically evaluated that text mining techniques are promising
approaches for the identification of users on online social networks.
1 Introduction
OSN (Online Social Networks) such as Facebook, Twitter, and Reddit, etc. have
become extremely popular over the past decade and been one of the most common
communication tools [1]. To integrate these OSN sites, for social networks it is
essential to discover the identity of a user [2]. User identification based on text
has attracted the attention of many types of research. User identification from text
related to behavioral patterns [3], demographic characteristics of authors like age
and gender [4] is essential in forensics, security, and advertisement. For instance,
one would like to learn about the behavior of the author of aggressive and criminal
textual message, or organizations might be intrigued to find out about demographic
attributes of individuals who like or dislike their items, given the web journals and
online product analysis.
Prevailing literature can be categorized into two groups, i.e., Systematic and Tradi-
tional literature reviews. State-of-the-art work and current research trends are mainly
covered by traditional literature reviews while the focus of systematic literature
reviews is to provide solutions to the research questions involving user identifica-
tion. Despite a large number of empirical studies on user identification genres, models
and algorithms there was a need to combine all these frameworks of same domain
in one literature so that they can be compared regarding performance, accuracy, and
validation against state-of-the-art machine learning algorithms. As far as we know,
No systematic literature review can be found that concentrates on online user identi-
fication through text mining techniques, which encourages and stimulate our efforts
in this study. The purpose of this SLR is to provide solutions to the following research
questions:
RQ1: Which of the user identification categories are mainly focused on researching
online social networks through text classification?
RQ2: What are the models and algorithms that are used to identify user through text
classification?
RQ3: What are the current tools for user identification in text classification research
area?
The paper is structured as follows. Section 2 illustrates the methodology per-
formed in this paper. Section 3 demonstrates and shows the results. Section 4 presents
the discussion and limitations. Conclusion and future work are suggested in Sect. 5.
2 Methodology
This paper was commenced as a systematic literature review established on the prim-
itive directions as presented by Kitchenham [5]. It is intended to improve, assess,
and understand the accessible material regarding research contributions that must be
examined applicable concerning user identification and commensurate with already
stated research questions. To lessen at least the possibility of results being writer’s
User Identification on Social Networks Through Text Mining … 487
Search process of our literature review commenced with the selection of digital
libraries and research questions. This selection plays a key part to show the com-
prehensiveness and completeness of accumulated papers as illustrated in Table 1.
Following steps are accomplished in the search process as shown in Fig. 1.
• Digital libraries (IEEE, SPRINGER, ELSEVIER, ACM, Taylor, and Francis) are
focused.
• Journals and conference papers strain by title, keywords, and abstract by using the
criteria of inclusion and exclusion.
• Considerable search terms were acquired from RQ’s.
• Boolean AND was applied to restrict the search.
488 K. Zahra et al.
3 Results
All papers have been analyzed to find out from which user identification category
authors are contributing research results on text in particular. User identification
categories, i.e., behavioral, attribute, topic, spam, and crime are discussed in RQ1.
Among all the above categories behavioral and attribute are two most frequently used
categories, they together were mentioned by 65% of the selected papers as illustrated
in Table 5. Compared to other identification categories, behavioral and attribute seem
User Identification on Social Networks Through Text Mining … 491
to have received assertive research attention in many years. There are some studies
in the literature that contains two or more user identification categories in one study.
Considering the data extracted from the answer to this research question, it emerges
that behavioral, attribute, spam and crime identification models and algorithms are
cited by both journal and conference papers while topic identification frameworks
are just mentioned in conference papers.
Different models with the goal of identifying the user, its behavioral patterns
and attributes have been listed in this research question. The studies reporting the
identification of user behaviors through short text were [11, 21]. In publications
[7, 13, 28, 32] user’s interest and influence regarding responsiveness along with
communication and exploration were identified. Demographic features of user like
age, gender, education, date and email address were identified by models cited in [4,
16, 33]. Spam emails in [14] were spotted by using the anti-spam model as shown
in Table 6.
492 K. Zahra et al.
Gibbs sampling algorithm which is used for feature selection in researches [21,
24, 30] is the only algorithm used in multiple papers. Algorithms can be applied on
supervised, unsupervised and semi-supervised learning depending upon the dataset.
In research [11] a semi-supervised learning algorithm for semi labeled data was used
to train the data until it is labeled completely. Most of the algorithms in this research
study were used on supervised learning. Random forest [12] and machine learning
and compression [19] algorithms were used for classification and regression.
This section of the study presents the tools which are used in 9 research papers as
shown in Table 8 to act as user identification. The basic purpose of these tools is to
reduce the ambiguity in the text present on different social network platforms.
Tools used in this research question belong to both research community and public
sector. Multiple third-party tools [1] like Browsing API, SURBL, and Spamhaus from
both research and private sectors and automated tools [12] are detecting malicious
URLs and spam tweets. Knowledge-based tools [3] was developed in contrast with
statistical approaches to analyze and extract knowledge from each sentence to specify
its sentiment status. There are some tools that automatically evaluate documents,
ROUGE [27] is used to evaluate generated summaries with the summaries created
by experts. LIWC [9, 28] is the only tool in the research which is used by two
studies to recognize behavioral patterns. The University of Austin builds IITP tool to
describe the criminal process, vulnerabilities, and resources that facilitate criminals
to commit the crime. Two natural languages processing tools POS Tagger [8] and
Word Segmentation tool [15] were used to extract features from tweets and recognize
words respectively.
494 K. Zahra et al.
In this study, we evaluated and identified text mining techniques that help and support
to distinguish users, demographic features of users and behavioral patterns while
communicating on different social networking sites. The data used in this research
is mostly gathered from Twitter. Out of 31 studies, 13 used twitter data for their
experimentation. Other data sources include Facebook, Blogs, documents, reports
and instant messages.
In text classification, machine learning techniques vary for different datasets. Dif-
ferent text requires a different set of features and ML techniques. Preprocessing, a
data mining technique transforms data into an understandable format, was used in
14 papers mostly where natural language processing is performed on the text, to
attain optimal achievement. Correct feature selection increases the accuracy and per-
formance of the classifier. Most frequently used feature selection techniques were
POS Tagging and TF-IDF, use of these techniques improved the performance of the
machine learning models. The finding suggests that future studies adopt both seman-
tic based features and demographic features together to achieve higher performance.
Classifiers are performed on supervised learning to validate the experimental
results. In this research classifiers are used by 14 studies and support vector machine
(SVM) alone is used in 9 studies separately as well as combined with other classifiers.
Selected studies suggested that performance of SVM in text classification is much
better than other classifiers like random forest and naïve Bayes.
For user categorization, we identified 19 models, 10 algorithms, 8 tools and 5
methods. In some studies like [11, 13, 23, 26, 28, 30] they are used together to
improve the performance. It has been shown in this review; frameworks perform
differently on every dataset depending upon the size and type of text used in datasets.
Therefore, before making any decision on the choice of models, algorithms and ML
techniques, professionals not just should know about the performance, yet also need
to comprehend the qualities of the frameworks.
Table 9 shows the comparison of text mining techniques (models, algo-
rithms/methods, and tools) that have been proposed for user identification based
on the type of datasets, pre-processing, feature selection, classifier, and validation.
It has been observed that mostly algorithms/methods are validated against state of
the art machine learning techniques as compared to models and tools that have been
identified. Whereas, the comparison based on pre-processing also shows that most of
the algorithms/methods used pre-processing step as compared to tools and models.
In this review for accessing the performance of text mining techniques, only
accuracy metrics is observed. If a model or algorithms fail to perform below the
minimum threshold in terms of accuracy practitioners will reject it, although in
addition to accuracy metrics other evaluation metrics such as propagation ability and
accountability is ignored in this review can also necessarily be considered. Table 9
shows the discussion and comparison of all 31 papers selected in this literature.
Table 9 Comparison of text mining techniques
Research Datasets Pre-processing Feature selection Classifier used Tools Models Algorithms/methods Validation References
√ √
R1 Facebook [1]
√ √
R2 ISEAR BOW (Bag of words) Ensemble and Naïve Bayes [3]
√ √ √
R3 Twitter POS tagging SVM and random forest [4]
√ √ √ √
R4 Text document TF and term weight SVM and Naïve Bayes [26]
√ √ √
R5 Blogs and news reports [27]
√ √ √ √
R6 Reddit TF-IDF [28]
√ √
R7 Last.fm Ensemble and SVM [25]
√ √
R8 Twitter and blogs POS tagging SVM and Naïve Bayes [29]
√ √
R9 Twitter [30]
√
R10 Twitter [20]
√ √ √ √
R11 Twitter TF-IDF K-means [21]
√ √ √
R12 WITS N-grams [22]
√ √ √ √ √
R13 News and reports POS tagging and named [23]
entity recognition
√ √ √ √
R14 Corpus LDA SVM [24]
√ √ √ √
User Identification on Social Networks Through Text Mining …
(continued)
495
Table 9 (continued)
496
Research Datasets Pre-processing Feature selection Classifier used Tools Models Algorithms/methods Validation References
√
R17 SinaWeibo [10]
√ √ √
R18 SinaWeibo SVM [11]
√ √
R19 Blog data SenticNet and POS tagging Ensemble [6]
√
R20 Yelp [7]
√ √ √
R21 Twitter ReLF Random forest [12]
√ √
R22 Instant messages POS tagging and feature SVM and neural networks [15]
based selection
√ √ √
R23 SinaWeibo Density based selection [13]
√
R24 Email and blogs TF-IDF SVM [16]
√
R25 Twitter, email and IM Stylometry [17]
√
R26 Facebook and Twitter [18]
√ √ √
R27 Twitter BOW (Bag of words) SVM and Naïve Bayes [19]
R28 Facebook and Twitter Content based feature [31]
√ √
R29 Facebook and Twitter [32]
√
R30 Facebook and Twitter [33]
√
R31 Email [14]
K. Zahra et al.
User Identification on Social Networks Through Text Mining … 497
References
1. Gao H, Hu J, Wilson C, Li Z, Chen Y, Zhao BY (2010) Detecting and characterizing social spam
campaigns. In: Proceedings of the 10th ACM SIGCOMM conference on Internet measurement,
Nov 2010. ACM, pp 35–47
2. Tuna T, Akbas E, Aksoy A, Canbaz MA, Karabiyik U, Gonen B, Aygun R (2016) User char-
acterization for online social networks. Soc Netw Anal Mining 6(1):104
3. Perikos I, Hatzilygeroudis I (2016) Recognizing emotions in text using ensemble of classifiers.
Eng Appl Artif Intell 51:191–201
4. Sboev A, Litvinova T, Gudovskikh D, Rybka R, Moloshnikov I (2016) Machine learning
models of text categorization by author gender using topic-independent features. Proc Comput
Sci 101:135–142
5. Kitchenham B (2004) Procedures for performing systematic reviews, Keele, UK, Keele Uni-
versity, vol 33, no 2004, pp 1–26
6. Poria S, Cambria E, Gelbukh A, Bisio F, Hussain A (2015) Sentiment data flow analysis by
means of dynamic linguistic patterns. IEEE Comput Intell Mag 10(4):26–36
7. Qian X, Feng H, Zhao G, Mei T (2014) Personalized recommendation combining user interest
and social circle. IEEE Trans Knowl Data Eng 26(7):1763–1777
8. Murkute AM, Gadge J (2015) Framework for user identification using writeprint approach.
In: 2015 international conference on technologies for sustainable development (ICTSD), Feb.
IEEE, pp 1–5
9. Amuchi F, Al-Nemrat A, Alazab M, Layton R (2012) Identifying cyber predators through
forensic authorship analysis of chat logs. In: 2012 third cybercrime and trustworthy computing
workshop (CTC), Oct. IEEE, pp 28–37
10. Wang J, Liu Z, Zhao H (2014) Group recommendation using topic identification in social
networks. In: 2014 sixth international conference on intelligent human-machine systems and
cybernetics (IHMSC), vol 1, Aug. IEEE, pp 355–358
11. Yin C, Xiang J, Zhang H, Wang J, Yin Z, Kim JU (2015) A new SVM method for short
text classification based on semi-supervised learning. In: 2015 4th international conference on
advanced information technology and sensor application (AITS), Aug. IEEE, pp 100–103
498 K. Zahra et al.
12. Meda C, Ragusa E, Gianoglio C, Zunino R, Ottaviano A, Scillia E, Surlinelli R (2016) Spam
detection of Twitter traffic: a framework based on random forests and non-uniform feature
sampling. In: 2016 IEEE/ACM international conference on advances in social networks analysis
and mining (ASONAM), Aug. IEEE, pp 811–817
13. Guo H, Chen Y (2016) User interest detecting by text mining technology for microblog plat-
form. Arab J Sci Eng 41(8):3177–3186
14. Zhang Y, He J, Xu J (2018) A new anti-spam model based on e-mail address concealment
technique. Wuhan Univ J Nat Sci 23(1):79–83
15. Ding Y, Meng X, Chai G, Tang Y (2011) User identification for instant messages. In: Neural
information processing. Springer Berlin/Heidelberg, pp 113–120
16. Ma J, Teng G, Chang S, Zhang X, Xiao K (2011) Social network analysis based on authorship
identification for cybercrime investigation. Intell Secur Inf 27–35
17. Frommholz I, Al-Khateeb HM, Potthast M, Ghasem Z, Shukla M, Short E (2016) On
textual analysis and machine learning for cyberstalking detection. Datenbank-Spektrum
16(2):127–135
18. Chavoshi N, Hamooni H, Mueen A (2016) Identifying correlated bots in twitter. In: International
Conference on Social Informatics, Nov. Springer International Publishing, pp 14–21
19. Santos I, Minambres-Marcos I, Laorden C, Galán-García P, Santamaría-Ibirika A, Bringas
PG (2014) Twitter content-based spam filtering. In: International joint conference SOCO’13-
CISIS’13-ICEUTE’13. Springer, Cham, pp 449–458
20. Zhou X, Wu B, Jin Q (2017) User role identification based on social behavior and networking
analysis for information dissemination. Future Gener Comput Syst
21. Qiu Z, Shen H (2017) User clustering in a dynamic social network topic model for short text
streams. Inf Sci 414:102–116
22. Sharef NM, Martin T (2015) Evolving fuzzy grammar for crime texts categorization. Appl Soft
Comput 28:175–187
23. Zaeem RN, Manoharan M, Yang Y, Barber KS (2017) Modeling and analysis of identity threat
behaviors through text mining of identity theft stories. Comput Secur 65:50–63
24. Liang J, Liu P, Tan J, Bai S (2014) Sentiment classification based on AS-LDA model. Proc
Comput Sci 31:511–516
25. Chelmis C, Prasanna VK (2013) Social link prediction in online social tagging systems. ACM
Trans Inf Syst (TOIS) 31(4):20
26. Manne S, Fatima SS (2012) An extensive empirical study of feature terms selection for text
summarization and categorization. In: Proceedings of the second international conference on
computational science, engineering and information technology, Oct. ACM, pp 606–613
27. Chakraborti S (2015) Multi-document text summarization for competitor intelligence: a
methodology based on topic identification and artificial bee colony optimization. In: Proceed-
ings of the 30th annual ACM symposium on applied computing, Apr. ACM, pp 1110–1111
28. Choi D, Han J, Chung T, Ahn YY, Chun BG, Kwon TT (2015) Characterizing conversation
patterns in Reddit: from the perspectives of content properties and user participation behaviors.
In: Proceedings of the 2015 ACM on conference on online social networks, Nov. ACM, pp
233–243
29. Inches G, Crestani F (2011) Online conversation mining for author characterization and topic
identification. In: Proceedings of the 4th workshop on workshop for Ph.D. students in infor-
mation & knowledge management, Oct. ACM, pp 19–26
30. Zhao Y, Liang S, Ren Z, Ma J, Yilmaz E, de Rijke M (2016) Explainable user clustering in short
text streams. In: Proceedings of the 39th international ACM SIGIR conference on research and
development in information retrieval, July. ACM, pp 155–164
31. O’Riordan S, Feller J, Nagle T (2016) A categorisation framework for a feature-level analysis
of social network sites. J Decis Syst 25(3):244–262
32. Son JE, Lee SH, Cho EY, Kim HW (2016) Examining online citizenship behaviours in social
network sites: a social capital perspective. Behav Inf Technol 35(9):730–747
33. Riedl C, Köbler F, Goswami S, Krcmar H (2013) Tweeting to feel connected: a model for social
connectedness in online social networks. Int J Hum-Comput Interact 29(10):670–687