Professional Documents
Culture Documents
18 Januari 2016
Natalya Tarasova
Abstract
Classification of Hate Tweets and Their Reasons using
SVM
Natalya Tarasova
Kazunori Matsumoto
Sammanfattning
De senaste femton åren har Internet blivit en arena som våra dagliga aktiviteter i allt
större utsträckning utspelar sig på: vi hämtar och sprider information i form av text,
ljud och bild, handlar, bokar resor och upplevelser, läser online kurser etc. Via Internet
samarbetar vi med andra och skapar mötesplatser. Idag kan vi nå ut och vara nåbara
genom olika plattformar. Beroende på vilka grupper man vill nå ut till finns det olika
verktyg. Till exempel används Facebook oftast för att hålla kontakten med släkt, vänner
och bekanta, däremot används LinkedIn som en plattform för professionella kontakter.
Populariteten av sociala medier har lett till att det finns en rik källa till sökbara data för
analys av vad människor känner, tänker och gör [9, 10]. Därför finns det ett stort intresse
hos forskare att undersöka allt från trender och opinionsmättningar till spridning av
influensa med hjälp av dessa data [3, 4]. Företag har också insett värdet av att använda
information från sociala medier i syfte att förstå vad deras kunder tycker om de tjänster
och produkter som företagen tillhandahåller.
I detta arbete har vi fokuserat på det sociala mediet Twitter. Grundtanken med Twitter
är att vem som helst när som helst ska kunna nå ut till andra genom att publicera ett
meddelande som består av max 140 tecken. Ett sådant meddelande kallas tweet. Arbetet
bedrevs vid Social Media Labs, KDDI R&D. På Social Media Labs forskas det kring,
bland annat, vad användare i olika delar av världen tycker om en viss produkt eller
tjänst. Resultatet visualiseras på en karta och ger en snabb överblick över geografiska
trender.
vi har undersökt huruvida det är möjligt att identifiera orsaken till varför användare
uttrycker hat i tweets riktat mot mobiloperatörerna Verizon, AT&T och Sprint. Efter
att ha läst hat-tweets från cirka 500 användare kunde vi konstatera att det gick att
hitta förklaringar till varför Twitter-användare uttrycker hat mot sina mobiloperatörer.
Därefter konkretiserades frågeställningen för denna studie: målet blev att ta fram en
metod som möjliggör identifikation av hat-tweets samt av de orsaker som föranledde
dem. Studien utmynnade i två metoder: en ”naiv” metod (the Naive Method, NM) och
en mer ”avancerad” metod (the Partial Timeline Method, PTM).
ii
Tweets samlades in genom en Twitter-sökning som bestod av orden ”hate” samt före-
tagets namn, i.e. Verizon, AT&T eller Sprint. Om denna sökning gav en träff sparades
hela tidslinjen, ett kronologiskt ordnat flöde av tweet-meddelanden, för denna användare
i en lokal databas. Därefter studerades tidslinjerna och tweet-meddelandena klassificer-
ades manuellt i en av de fyra kategorierna: Hat, Orsak, Explicit och Övrigt. Tweets som
innehöll ordet ”hate” och företagets namn klassificerades som Hat. Tweet-meddelanden
som publicerades under samma dag som hat-tweeten och som angav användarens prob-
lem kopplat till mobiloperatören klassificerades som Orsak. De meddelanden som in-
nehöll både uttryck för hat och anledning tillskrevs kategorin Explicit. De meddelanden
som inte ansågs tillhöra någon av dessa tre kategorier klassificerades som Övrigt.
Modellen för PTM visade sig vara noggrannare än för NM. Däremot inkluderar PTM
inte alla meddelanden från användarens tidslinje utan analyserar endast de tweets som
publicerades inom intervallet ± 30 min från publiceringen av hat-tweeten. Således in-
nebär valet mellan PTM och NM en avvägning. Om det är viktigt att ha en noggrann
klassificering, till exempel om resultatet ska användas i en större automatiserad process,
lämpar PTM sig bättre än NM. Däremot om man istället vill ha en utförlig bearbetning
av samtliga tweets bör NM användas.
Acknowledgements
I wish to thank my subject reader, Sofia Cassel, for support, productive discussions and
guidance. I also want to thank Social Media Lab at KDDI R&D for their warmth and
hospitality.
Finally, I wish to thank the Sweden Japan Foundation for travel funding.
iv
Contents
Sammanfattning ii
Acknowledgements iv
1 Introduction 1
1.1 Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Theory 5
2.1 Data Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Supervised Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Support Vector Machines (Geometrical interpretation) . . . . . . 8
2.4 SVM Data Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Related Work 13
3.1 Internal Enhancement of Feature Space . . . . . . . . . . . . . . . . . . . 15
3.2 External Enhancement of Feature Space . . . . . . . . . . . . . . . . . . . 15
3.3 Marketing Potential of Twitter . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Antagonism in Tweets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Method 18
4.1 Defining hate tweets and reasons . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Collection and Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Training and testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 The Naive Method (NM) . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.5 The Partial Timeline Method (PTM) . . . . . . . . . . . . . . . . . . . . . 23
6 Future Work 28
v
Contents vi
A 30
A.1 Collected data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
A.2 AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Bibliography 34
List of Figures
2.1 Data from class A and class B separated by two different planes: one
represented as a dashed line and another one represented as a solid line [1]. 9
2.2 The maximal margin is represented by the line that goes through the
points d and c and is orthogonal to the hyperplane [1]. . . . . . . . . . . . 9
2.3 Two distinct classes, represented by blue and red dots, cannot be sepa-
rated by a maximal margin hyperplane [2]. . . . . . . . . . . . . . . . . . 10
vii
List of Tables
viii
Chapter 1
Introduction
The proliferation of micro-blogging platforms and networking services has led to new
and previously unexplored opportunities to disseminate information. Many users share
their opinions, thoughts and feelings and express their needs and desires using social
media on a daily basis. At first sight, the disseminated information seems to be of a
diverse and chaotic character. Nevertheless, if it is not viewed in isolation but rather
as a part of a larger context, it can form a coherent structure that reveals hidden
relationships. This fact together with the relative simplicity of information retrieval
has led to the appearance of new domains of research within data mining, text mining,
machine learning and natural language processing. For example, some studies have
focused on predicting trends [3, 4] and revenues [5], recommendation of products and
services [6–8], sentiment analysis towards various topics such as brands, celebrities etc
[9, 10].
Business and industries followed the example of the academic world and started to
explore marketing opportunities in social media. For customers, social media has become
an important arena for sharing experiences and finding information about companies,
their products, and services. Online networking offers a useful source of information
for both parties. Consumers can find up-to-date information and companies can mine
the opinions of users on the products and services. The ideas for improvement and
development of new products can be found and also new opportunities for user-user,
company-user and user-company interaction are created. Due to the improved knowledge
about the customer and the ease of reaching out to her or him, advertising is becoming
more personalized. The behavioural patterns of users are collected and analyzed to create
an individual-oriented relationship to the company. The customers, for their part, can
interact with each other collecting more knowledge on the diversity of products and
services on the market. Due to the size of the available information both customers and
1
Introduction 2
companies need reliable tools for getting a quick and simple overview of the available
information. Therefore, the application of computational research based on the analysis
of tweets is a part of producing such tools [11].
Due to the size and the varying quality of the information available on the Internet,
the process of finding and extracting relevant information has become more difficult
and time consuming. In order to overcome this problem, a wide range of clustering
and summarization methodologies were developed. Due to the dynamic nature of Web
content, there is a constant need to improve and adjust already existing methods.
1.1 Twitter
Twitter is a microblogging website where users share their experiences, opinions and
feelings in 140 character long messages called tweets. Tweets can be posted from
various mobile and desktop clients and even by other applications on the behalf of the
user. A user is a person who posts a tweet. Every user has a unique user name and
id number associated with it. A user can have followers, people who get updated on
what the user has posted. Other users can comment on the content of posted tweets,
share them or in other words retweet. It is also possible to like the content of a tweet
by clicking on a button Favorite. The user can manually designate a topic for the tweet
Introduction 3
Tweets present very noisy data due to the length, informal language and the lack of
context or the background knowledge. It makes the analysis of tweets a challenging
task.
1.2 Background
Prior to the project described in this report, another study on the classification of
tweets using Support Vector Machine (SVM) was conducted. The goal of the study was
to explore how well the model could be used for tweets. Therefore, the classification
categories were taken from CJM. During the labeling of the tweets, it was noticed
that the number of the hate tweets about the mobile carriers and their services was
surprisingly high. For example, a tweet ”I hate Verizon” conveys a strong feeling but
does not explain what the specific problem is. Hate tweets are the tweets that explicitly
express hate towards the company (in this case only mobile carriers but the concept
can be extended to other branches) or its products and/or services. The lack of an
explanation makes it impossible to classify hate tweet according to CJM.
The negative sentiment of tweets damages the reputation of the company. Therefore
an understanding of the underlying reasons would provide a basis for better decision
Theory 4
1.3 Motivation
The question that prompted this study was whether it was possible to determine the
reason for hate tweets. From reading the timelines that included this kind of tweets, we
concluded that in many cases it was possible to determine the source of dissatisfaction
with the services or products provided by the mobile operators. We also noticed that
the majority of hate tweets in the data collection had an explanation either before or
after the hate tweet. Therefore the primary goal of the study became the development
of a method for the determination of hate tweets and their reasons.
Chapter 2
Theory
Tweets and the information related to them such as the name of the user, publication
date, location (if enabled by the user), profile information etc can be downloaded from
Twitter using one of the numerous Twitter Application Programming Interfaces
(APIs). Twitter APIs have to be accessed using an authentication request. The request
is conducted through the Open Authentication protocol (OAuth). The protocol
defines the way an application should requests the access to one of the APIs has to
submit the credentials issued by Twitter. For receiving those the application has to be
registered on Twitter [16].
Tweeter provides the developer with several options when it comes to the choice of an
API. The most used ones are Streaming API and REST API. There are three major
differences between Streaming and REST APIs:
• data access
Streaming APIs work in online mode providing continuous access to newly posted
tweets. Once a connection between the application and Twitter is established, the
1
https://www.csie.ntu.edu.tw/~cjlin/libsvm/
5
Theory 6
tweets will start flowing into the system. REST APIs collect only data that has
already been posted on Twitter pulling a certain number of tweets per request.
Only tweets published within the last week can be collected using REST APIs.
• rate limits
Streaming APIs are allowed to stream 5,000 user ids concurrently [17]. The APIs’
rate limit window duration is 15 minutes long [18]. Users represented by access
tokens can make 180 requests/queries per time window. Using application-only
authentication, an application can make 450 queries/requests per 15 minutes on
its own behalf.
• search function
Search queries can be built from the key words, phrases, names of the users, dates
etc. It is also possible to include additional parameters in the search query. For
example, the search can be restricted to a certain language or geolocation. The
format of the search query is the same for both REST and Streaming. However, the
search function is implemented differently in REST and Streaming. REST provides
relevance but not completeness. For this reason it might be more appropriate to
use Streaming APIs in some cases.
There are three types of streams: public streams, user streams and site streams. Public
streams contain public tweets published in chronological order; user streams are the
tweets from the timeline of a single user; site streams access timelines of multiple users
[17]. In this study, user stream was leveraged for the collection of data from the timelines
of the users tweeting about hate.
Text mining refers to the process of finding relationships and patterns in collection of
unstructured information. Text analytics can be broken down into three steps [19]:
Stop-word removal eliminates words that do not convey any meaning and are equally
frequent in any text, for example, articles and pronouns. Stemming contributes to a
Theory 7
The representation of classes does not have to be constrained to the most significant
words, on the contrary, the research shows that adding extra features improves the
ensuing analysis [9, 20]. In section 3, various feature extension techniques are described.
In order to extract useful knowledge from the created representation, the features are
stored numbers and analyzed with machine learning techniques. The conversion of words
to numbers can be performed in various ways. A technique for feature selection that
was used in this work is Akaike Information Criterion (AIC). AIC is based on the
estimation of the information loss when modeling the underlying processes that create
the observations. Given empirical data, for example a word in a tweet, AIC estimates
how likely it is that the word appears in the target class compared to other classes
using maximum log-likelihood. Furthermore, AIC incurs a penalty on the number of
free parameters. The higher number of the parameters, the higher the penalty. This
approach discourages overfitting. Overfitting is a common problem in machine learning
and refers to the misclassification of test data. It occurs when the number of features is
higher than the number of data-points. AIC is calculated as:
It is important to emphasize that AIC does not provide an absolute value of the precision
of a model, but rather a tool for comparing different models. The model with the lowest
AIC value represents the most accurate model in the context.
Theory 8
After defining the feature space, knowledge discovery methods can be applied to the
numerical representation of the words. A common and robust approach is Support
Vector Machines (SVM) described in section 2.3.1.
The main idea of machine learning is to teach a computer to associate new data with data
that it has been exposed to earlier. The dynamic character of the data available on the
Internet is hard to deal with automatically because pre-programmed algorithms might
not describe tomorrow’s reality. For instance, without the machine learning algorithms
the detection of spam e-mails and frauds becomes difficult. Therefore the goal is to
recognize old patterns in new data. For instance, Gmail had been taught to recognize
spam from non-spam emails rather than being explicitly programmed for it.
The basic idea of SVMs include three main points: maximizing margins, the dual formu-
lation, and kernels [1]. Maximizing margins means that the aim is to find a plane that
will separate the nearest points of two classes, A and B, by as wide minimum distance
as possible. This problem is illustrated in figure 2.1.
Theory 9
Figure 2.1: Data from class A and class B separated by two different planes: one
represented as a dashed line and another one represented as a solid line [1].
Figure 2.2: The maximal margin is represented by the line that goes through the
points d and c and is orthogonal to the hyperplane [1].
In figure 2.1, we see that the solid line is a better choice of demarcation than the dashed
one because the margin from the nearest point of each data set to the line is larger. The
classes A and B are represented by the matrices Amxn and Bmxn respectively. Every row
represents the coordinates of the points in the data set. x - features. Input of each class
leads to an output y = {−1, +1}. Let us assume that the output of class A is x0 w ≥ 1
and at least one point in A lies on the plane x0 w = 1, where w is the normal of this
plane. Similarly, the output of class B results in x0 w ≤ −1 and at least one point lies
on the plane x0 w = −1. The distance between these two supporting hyperplanes is 2
kwk .
Consequently, the distance between the two planes can be maximized by minimizing
kwk. The space between the hyperplanes should not consist any data points, which
gives rise to two constraints that should be fulfilled at the same time, Aw − Ie ≥ 0 and
Theory 10
Figure 2.3: Two distinct classes, represented by blue and red dots, cannot be sepa-
rated by a maximal margin hyperplane [2].
Bw − Ie ≤ 0. The final separating plan lies between the supporting hyperplanes and
therefore the objective of the minimization is 21 kwk2 . The problem of finding the two
closest points can be stated in the following way:
1
minimize kwk2
2
subject to Aw − Ie ≥, Bw − Ie ≤ 0. (2.2)
The line between two closest points must be orthogonal to the supporting hyperplanes.
In order to construct such a plane the algorithm finds two convex hulls and constructs
a line between two nearest points in these sets. This approach is illustrated in figure
2.2 where the line that goes though the point d and c, denoted by w is the maximal
margin, the solid line orthogonal to w is the maximal margin hyperplane. The point d
and the two circle points lying on the same dashed line as the point c are called support
vectors. They support maximal margin in a sense that if these points would move the
maximal margin hyperplane would shift too. Notice that the change in the position of
any other point does not effect the maximal margin hyperplane unless it crosses the
boundary points. It is important that there can be drawn an orthogonal line between
the two supporting planes to avoid the risk that the two points might not be the closest
ones or that the supporting hyperplanes are not as far away as possible.
This also means that the maximal margin classifier cannot be used. A commonly used
approach to deal with this problem is to allow a certain degree of misclassification in the
interest of a better classification of most of the data. This approach is called support
vector classifier. In order to allow a certain degree of the missclassification, a non-
negative tuning parameter C is introduced. The parameter C determines how much
freedom we have to violate the margin. If C is high then the system is tolerant of
the misclassifications. This also means that if the margin is large several violations are
allowed. When C is low the margin narrows implying a choice of a highly fit classifier.
There are some classes that cannot be separated linearly because the relationship be-
tween the outcome and the predictors is non-linear. In this case, the number of predictors
is expanded by using a non-linear function. More specifically, this is done by applying
so called kernels.
n
1 X
minimize kwk + C ξi ξ ∈ Rn
2
i=1
where ξi is a slack variable that allows individual observations to be on the wrong side of
the margin, x is mapped to a non-linear map Φ. If ξi = 0 then then the i th observation
was classified correctly; if ξi ⊃ 0 then it is on the wrong side of the margin; if ξi ⊃ 1
then it is on the wrong side of the hyperplane.
In the equation 2.3, the function Φ(xi ) does not need to be calculated explicitly. Instead,
another function that defines the inner product in the new space is inferred. It is
K(xi , xj ) = Φ(xi )T Φ(xj ). K(xi , xj ) is called a kernel function. The computation of
kernels occurs in the feature space and not in the space of input data Rd . Some of the
most common kernels are the following:
So far, the theory was presented for the case when two classes are compared. However,
SVM can be used for the classification of more then two classes. There are two most
Related Work 12
LIBSVM tool requires the input data, i.e. training and test data for the SVM algorithm,
to be of a certain form called SVM format. It is a numerical representation of textual
data as attribute : occurrence, where the attribute is the position of the word in the
AIC table for the target category and the occurrence is the number of times the word
appears in the current tweet. For example, ”hate hate hate verizon” would be encoded
as 1:3 4:1, where number 1 is the position if the word ”hate” in the AIC table for the
category Hate, number 3 refers to the number of times the word hate occurs in the
current tweet, number 4 is the position of the word ”verizon” and because it occurs only
once the occurrence is set to 1. If a tweet has a word or even several words that do
not appear in the table, those words are neglected. This is referred to as a sparsity
problem.
Chapter 3
Related Work
This study proposes a framework for identification of hate tweets related to mobile
operators and their triggers using classification techniques. Therefore the review of
relevant works focuses mainly on classification of tweets and a few studies related to
hate tweets. To broaden the understanding of Twitter’s importance for companies, a
brief review of the marketing studies related to Twitter is also covered in this section.
According to previous research, there are other decisions that might impact the outcome
of the classification. One of them is the choice of class labels. A suitable choice of labels
facilitates the division of data into well defined clusters, which benefits the representation
of classes. The labeling of data depends on the application. The choice of the labels
is often based on the observations of data or the adoption of pre-existing classification
schemes (e.g. the advertising model AIDA, where the acronyms stand for Attention,
Interest, Desire and Action). ”Mining Consumer Attitude and Behavior” by Hwon
et al. [21] shows that an appropriate clustering, in this case AIDA, can support the
methodology and reveal hidden relationships. However, in some cases class-labels cannot
be determined beforehand due to their dynamic nature. News and trending topics are
two examples of constantly evolving and changing subjects [22], [3]. Nevertheless, even
this type of problem utilizes a framework of pre-determined generic categories, such as
technology, art etc, that supports the classification task.
13
Related Work 14
As stated earlier, tweets are informal, short and sparse (i.e. a certain word might
occur seldom in the tweets), therefore the removal of noise and creation of a denser
feature space are important for future feature vector construction. The pre-processing
procedure is described in the section Text mining. Pre-processing steps were highlighted
in the studies by Lee et al., Wang et al. and Perez et al. [3], [9], [11]. In addition to pre-
processing, some studies [23], [22], [4] employed filtering techniques in order to improve
the relevance level of the collected tweets. The keywords for filtering are often derived
from the observed data and/or defined from dictionaries or other external resources.
For example, Yang et al. [23] in a study from 2013 suggested a method for ambiguity
filtering of the company name. The classification was binary, either the tweet was related
to a company tweet or not. Each category was represented by the keywords that were
defined as the most frequently searched words by the Internet users. For example, the
top three keywords for the company Apple were: apple, apple store, apple iphone 4.
After the keywords were determined, two different filters were created in order to ensure
high accuracy of recognition of the keywords. Filter 1 checked if the the keyword was
a whole keyword or not; filter 2 determined the relevant single tokens. For instance, if
the keyword is ”iphone 4” and the target tweet is ”I love iphone”. Filter 1 would reject
the tweet, whereas Filter 2 would recognize it as being relevant.
Aside from tweets, Twitter provides the reader with additional information such as topic-
indicative hashtags, context-enhancing links, user profile information, lists of followers
etc. Previous studies have shown that a higher classification accuracy can be achieved
by expanding the feature space using this internal information [22], [24], [9]. For exam-
ple, feature-based enhancement was explored by Sriram et al. [20]. They improved the
classification of tweets by adding a nominal feature, i.e. the authors name, and seven bi-
nary features. These, for example, were the absence of shortenings, emotions, and slang
words. Bevenuto et al. [24] proposed a classification method for distinguishing spam-
mers, i.e. users who post spam, from non-spammers by integrating 23 different metrics
discovered on Twitter. Interestingly, the evaluation of the features showed that even
the least significant ones improved the classification compared to the baseline method.
Therefore it is reasonable to assume that even low ranked features have discriminatory
power. Furthermore, the importance of regarding users not as single elements but as a
part of a larger network has been proven to be beneficial [3].
One of the big challenges of tweet classification is their sparseness, i.e. many words
appear only a few times in a corpus. For this reason, some tweets might be repre-
sented by an empty vector. To alleviate this problem, attempts have been made to
create a denser feature space. A common technique that addresses this problem is to
incorporate external resources such as search engines, open source knowledge bases e.g.
Wikipedia and online dictionaries. Perez et al. [11] proposed three ways of enrichment
of original feature vectors: by incorporating general information about the company
provided by Wikipedia, by enriching the tweets that really refer to companies, expand-
ing only the ambiguous words with external information. The result showed that the
third approach performed better than other methods on specific company names (Ar-
mani, Warner, Cadillac etc); the first approach performed better then other methods
on generic company names (Parl, Sprint, Southwest etc). In a paper from 2008 Phan et
al. [25] explored the enrichment of the tweets with extracted topics and found that it
improved the classification and outperformed the baseline.
Method Accuracy
BOW approach for classification with C 5.0 [3] 65 %
Network-based classification with C 5.0 [3] 70 %
DFICF with Naive Bayes [9] 71 %
Adding user attributes; training with SVM [24] 87 %
Tf and clarity with Maximum Entropy classifier [26] 70 %
BOW and eight additional features [20] 95 %
Self-Term expansion with K-means classifier [11] 60-95 %
TEM-Wiki with K-means classifier [11] 60-97 %
TEM-Full with K-means classifier [11] 54 - 73 %
Table 3.1: Accuracy of the proposed methods for some of the studies
Information spreading over the Internet requires that companies to revise their mar-
keting strategies [27]. According to an interview-study, the presence on the Web allows
companies to understand consumers’ consumption habits, detect and anticipate negative
reactions etc [28]. A better understandin of user’s preferences along with advances in
Twitter mining and classification techniques has created opportunities for the develop-
ment of user models. Based on the knowledge about the user’s interests and tweeting
patterns, scientists have tried to understand how to better target users with product
recommender systems [8]. A paper from 2011 [29] investigated how tweeting activities
could support the modeling and personalization of user profiles. Based on the hash-
tags and the topics of the posted tweets the study compared three profile models for
applicability of the news recommendation system.
The topic of hate and radicalization of Twitter has been addressed in a handful of studies.
Burnap and Williams [30] addressed the identification of hateful and/or antagonistic
statements against certain races and religions. In a paper from 2013, Kawase et al.
investigated what consequences hateful tweets about jobs might have [31]. As far as we
concerned, any study addressed hate tweets related to mobile operators.
In a study by Pendar [32] the identification of sexual predators in online chats was stud-
ied. They showed that the performance of SVM can be improved by adding n-grams to
create a more specific context for the use of the most relevant words. In a study from 2014
Agarwal and Sureka [33] focused on the classification of hate and extremism promoting
tweets. They observed that exclamations such as ”send them home” and ”get them
out” were frequently appearing in the collected tweets. These types of phrases follow
Method 17
Method
This chapter gives an overview of the methodology employed in this work. In partic-
ular, we provide a definition of the classification categories and describe the process of
collection and pre-processing of the tweets. Finally, the proposed methods - the Naive
Method (NM) and the Partial Timeline Method (PTM) - are introduced in the sections
4.4 and 4.5.
In this study, the definition of a hate tweet was constrained to the presence of two
compound parts: the verb ”hate” in combination with the object that the verb was
addressed to. The object was represented by the name of the company or the pronoun
”it” if it pointed to the company. In addition, the words and phrases that described
company’s services and/or products were also included as a subject of hate. For example,
”I hate Verizon” or ”I absolutely hate Sprint’s service”. Hate tweets without any stated
explanation for the hate were labeled as Hate.
When reading the timelines of the users who posted a hate tweet, we noticed that
the reasons, if stated, could appear before and/or after the hate tweet. The differences
between these ways of stating the reason were studied more closely in order to determine
if they ought to be treated as separate categories.
First, we looked at the cases when the reasons were stated before the hate tweet, serving
as the premise. These types of reasons were called triggering reasons. Second, we
looked at the cases when the reasons were stated after the hate tweet serving as an
explanation. These reasons were called justificatory reasons. Third, we looked at the
cases where the reasons were stated both before and after the hate and therefore were
18
Method 19
Triggering reason The user first described The user explained the problem:
an issue related to the
mobile service and then Pissed not understand-
made a hate statement. ing why my phone isn’t
ready to be picked up
when it was suppose to
be ready yesterday.
I hate AT&T.
called combined sequences. The observed types of reasons are summarized in the table
4.1. These three categories were compared to each other with respect to the word
frequency, whether they were addressed to someone, i.e. the presence of a user-name
and the time of posting. The results are presented in section Results and Discussion
in table 5.4. However, the differences between these cases were insignificant and they
were therefore treated collectively as a single category Reason. A tweet was labeled as
Reason if the following criteria were fulfilled:
• it contained conjunctions because (also spelled cuz, cus, coz, bs, b/c, cause), nouns
reason, why (optional alternative).
In certain cases the hate and the reason were expressed in the same tweet. This category
was called Explicit. Some users even posted one or two tweets, classified as Reason,
Method 20
Reason A tweet that clarifies the rea- The trigger or reason was posted at
son behind hate. 03:52:03:
before or after the tweet classified as Explicit. The latter category was not further
broken down based on the order in which the hate and reason appeared because this
approach did not yield significant results for justificatory and triggering reasons.
The retweets of hate were not included in the training data. The reason was that these
tweets can be seen as an expression of sympathy and are not necessarily related to the
service experience of the user who retweeted.
Method 21
In order to collect training data, a search query for the identification of hate tweets and
thereby relevant timelines was created. The query had the same format as a hate tweet,
i.e. it utilized a combination of two words: ”Hate” and the name of the operator, e.g.
Verizon. If these two words were present in the same tweet the entire timeline from that
day was pulled and stored in an Excel table. The underlying assumption was that the
reason for the hate tweet would also be posted during the same day.
The tweets were collected with the Streaming and REST API from Twitter; stored in
Excel and manually searched for the reasons.
The labelled tweets were then pre-processed in order to remove noise and create a denser
feature space. In this study, the pre-processing followed a classical scheme described in
Text mining. The first step was to transform the tweets into separate tokens, remove
punctuation and stop-words. The pre-processing was done without the utilization of
the relationship between the words, i.e. BOW approach. The next step was to stem all
the words and replace URLs by the word ”url”, usernames were replaced by the word
”username”. Different spellings of a word were replaced by one alternative1 .
Pre-processing resulted in four word lists, one for each category. These lists, i.e. feature
sets, were compared to each other using AIC in order to determine the most represen-
tative words for each category. The threshold for the model selection was chosen to 0.1
to avoid overlap between the words from different classes. The application of AIC is
explained in section A.2.
Training and testing were performed using LIBSVM. The calculation of AIC was per-
formed in a self-developed program written in Java.
2. 20 % of the collected data were saved as test data (proportional to the number of
tweets in each category) and the remaining 80 % formed training data.
3. Training and test data were scaled in the range [-1, +1] using the command svm-
scale.
1
For example, att, att, @att, #att were replaced by att.
Method 22
4. An RBF kernel was chosen. The relationship between the feature words and the
categories is non-linear and therefore it is reasonable to assume that an RBF kernel
is the optimal choice. The expression for an RBF kernel was presented in section
2.3.1.
6. The SVM algorithm was trained on the training data set. For the classification
task one-versus-all approach was used.
7. The predictive power of the model was estimated on the test set.
The Naive Method is depicted in figure 4.1. In the first stage, a tweet was either
categorized as Other or not. If a tweet was not classified as Other , it was automatically
checked to see if it belonged to Hate. If it did not belong to Hate, the method checked
if it belongs to Explicit. If the tweet could not be classified as Explicit then it was
automatically labeled as Reason. The tweets were classified using binary a method,
i.e. each tweet was tested to see if it belonged to: 1. Explicit or not, 2. Hate or not, 3.
Reason or not. The same test set was used to predict the accuracy of each model.
• tweets of an antagonistic character that are not related to the mobile operators,
• tweets related to mobile issues or the carriers but not related to hate and its
reasons,
• overlapping feature vectors for the categories Hate, Reason and Explicit leading
to misclassification.
Due to these misclassifications, we studied alternative ways to classify the hate tweets.
This resulted in PTM. These two methods were later compared in 5.
Results and Discussion 23
In a study from 2012 Sun et al. [26] mimicked human labelling to create a filter for
classification. This idea was appropriated in our study and adopted to the proposed
method described in this section. This method is not an extension of NM.
In this study we focused on the improvement of the classification using only the informa-
tion in the collected tweets. In order to improve the classification and solve the earlier
stated issues, several of the internal features mentioned in section 3.1 were investigated:
the use of usernames, urls and time of tweeting.
The schematic representation of PTM can be seen in figure 4.2. The first step of the
method identifies a hate tweet or a self-explanatory tweet. It then searches the timeline
for proximate tweets that were posted within a one-hour time window and classifies
them in Reason or Other . PTM relies on the observation that the majority of the
explanations are posted 30 minutes before or 30 minutes after the hate tweet, see table
5.3 in section 5.
Chapter 5
5.1 Evaluation
This study showed that the task of classification of hate tweets and their reasons is
a feasible task. The proposed two methods, see section 4 were evaluated in terms of
accuracy and F-score. The accuracy of the classification for NM and PTM along with
the name of the category are presented in the table 5.1. As can be seen from the table,
PTM generally performs better than NM. For a sense of context, the accuracy of the
proposed methods can be compared to some previous studies, see table 3.1. Direct
comparison is difficult to perform due to the differences in the size and content of data
sets.
The accuracy may not be a true representation of how reliable the methods are. This is
because of the unbalanced number of tweets in different categories. The precision, recall
and F-score can be seen in the table 5.2. There is one value that stands out clearly; it is
the precision for NM, category Reason, which is the highest possible value, i.e. 100%.
Looking at the precision in the context of recall and F-score however, the performance of
the method is not as good as the precision value alone would suggest. It is important to
emphasize that the number of training and test data used for conducting the experiments
was comparatively small. The main reason was the time consuming process of manually
labeling the data. Instead of collecting more data this project focused on developing the
classification scheme.
By looking at table 5.1, PTM is more accurate than the NM. In addition, it saves
time and memory by analysing fewer tweets. However, PTM neglects the tweets posted
outside the one-hour time window. This means that choosing between the proposed
methods implies a trade-off between relevance and completeness.
24
Results and Discussion 25
Table 5.3: The reasons posted before and after a hate tweet were investigated from
a temporal perspective, i.e. we calculated the percentage of the tweets belonging to
Reason posted 30 minutes, 5 minutes and 1 minute before or after the posting of a hate
tweet.
PTM is based on the observation that the majority of the users posted the reason within
the first five minutes: 62.5% of the users posted the reason before the hate tweet and in
51.6% of the cases the reason was posted after the hate. Within a time window of half
an hour 77.5% of reasons were stated before the hate tweet and 70.3% were stated after
the hate tweet. These results are summarized in table 5.3. Notice that the tweets that
were posted within one minute were included in the tweets that were posted within five
minutes. Furthermore, all of these tweets were included in the tweets that were posted
within 30 minutes.
In order to understand what the most common causes for the users’ frustration were, we
analysed the prevalence of most common bigrams in categories Reason and Explicit.
These are ”my phone”, ”customer service”, ”unlimited data”, ”service sucks”, ”phone
bill”, ”my data”, see table 5.5. The content of tweets focused on the phone, data and/or
service related issues. The BOW approach, used in this work, utilized words within a
corpus as predictive features and ignored word sequences. This approach can lead to
misclassification due to word use in different contexts and if words are used as a primary
feature for classification.
Results and Discussion 26
Table 5.5: Most frequent bigrams present in tweets classified as Reason. The bia-
grams that are not presented in this table had the prevalence lower than 5%.
5.2 Observations
In order to see if there was any variation in the tendency to post triggering and
justificatory tweets over the course of the day, we analysed the tweets by dividing
the day in four six-hour-periods. Looking at the data in table 5.4, the tweets were
distributed relatively evenly over the day. However, we could observe a clear difference
in the tendency to post justificatory reasons vs triggering reasons. Justification
reasons were posted in 68 cases out of 100. One possible explanation is that the users
posting hate tweets were contacted by other users and/or the company in question
regarding the reason for the hate tweet. For instance, we observed that the mobile
operators, AT&T, Verizon and Sprint, interact with the customers who have expressed
hate against them.
One the biggest challenges with this work was the time consuming manual labeling of
the training and test data. However, even in the absence of statistically significant
results, it is possible to explore aspects of the methodology such as choice of class labels,
vector features and the sequence of steps by which the classification is carried out. It
is important to emphasize that since the data set analysed in this study is limited, see
Appendix A.1, one should be cautious when drawing conclusions.
One reason for caution is the high ratio of feature vector size to the number of training
tweets. This may contribute to overfitting and deceptively high values for the accuracy,
Suggestions for Future Work 27
see table 5.1. In this study, the SVM might have memorized the tweets instead of
learning the underlying pattern.
Chapter 6
Future Work
For future work there is are number of technical improvements that can be made:
The study could also be improved by solving the problem of the overlapping feature
spaces for categories Hate, Explicit and Reason, see section 4. One suggestion is to
re-define the categories. For instance, create one class with a feature space consisting
of the feature spaces belonging to these three classes. Then, based on the semantic
1
[19], Chapter 4.
28
Appendix A 29
analysis identify hate tweets, which probably have the highest rate of negative emotions
per tweet length. Another suggestion is to regard Explicit and Reason as one class
because both explain the underlying cause of hate.
It would be interesting to study if we could find the reason for the dissatisfaction of a
customer even if he or she has not provided any reason on the timeline. One possible
approach is to use User Similarity Model [34]. The model says that the users A and
B are more closely related than the users B and C if the overlap in the topics posted on
their timelines is greater for A and B then for B and C. For related users it might be
possible to predict the reason for a hate tweet posted by the user B by looking at the
reason posted by A.
Finally, the proposed methods could be applied in other areas where the tweets describe
an action and a result or anticipation and experience. For example, tweets about the
anticipation to see a film accompanied by the review of the film after seeing it or a
review of a hotel stay along with the expectations. Similarly, tweets about political hate
or affinity could be analysed with the proposed method.
Appendix A
The number of collected tweets are presented in table A.1. The number of tweets in
the category Other is fairly high compared to the remaining categories. The contain
of Other varies a lot and therefore in order to make this category more predictable we
collected more tweets.
A.2 AIC
The first step in the modeling of the IM and DM was to calculate the occurrences of
each word, w, appearing in the classified tweets. Many words did not appear more than
once through all the tweets therefore the data were sparse. This fact will be used later
30
Appendix A 31
in SVM analysis. The information about each word has been summarized in table A.2,
where the notation class ¬ A stands for not class A meaning all the remaining classes;
¬w means not w ; n11 stands for the number of tweets belonging to the target class A
and containing the word w; n12 is the number of times the word appeared outside the
target class; n21 is the number of tweets where the word did not appear in the target
class, n22 is the number of tweets where the word does not appear in the other classes.
The number of free parameters is two: n11 and n12 .
For the purpose of readability, the following notations will be introduced: N = (n11 +
n12 + n21 + n22 ), h = n11 + n12 and k = n11 + n21 .
Based on the word and class occurrences, the probability of each word for a specific class
was calculated. The probability of class A is p and the probability of the word to appear
somewhere in training data is q. The probability of each class is known and equals 41 ,
1
therefore the probability of class A is always 4 and the probability of not class A is 34 .
Nevertheless, in order to preserve the generality, the notation p will be used.
k h
P (A) = p = , P (w) = q = (A.1)
N N
The assumption that p and q are independent leads to the derivation of the IM with two
free parameters. The joint probabilities of the IM are presented in table A.3.
The events presented in table A.3 are considered to be independent and therefore their
joint probability P is
P = pk q h (1 − p)N −k (1 − q)N −h (A.2)
L = k ln p + h ln q + (N − k) ln (1 − p) + (N − h) ln (1 − q). (A.3)
Appendix A 32
To find the maximized log-likelihood for the IM with respect to p and q the following
conditions have to be satisfied:
∂L k N −k
= − =0 (A.4)
∂p p 1−p
∂L h N −h
= − =0 (A.5)
∂q q 1−q
The Eq.A.4 and Eq.A.5 lead to Eq.A.1. Insertion of Eq.A.1 into Eq.A.3 gives MLL:
Lmax = h ln h + k ln k + (N − h) ln (N − h)
+ (N − k) ln (N − k) − 2N ln N. (A.6)
Similar derivation of AIC applies to the DM. The outlines of the model are presented
in table A.4. The notations are the following: p11 is the probability of w appearing in
A, p12 is the probability of w appearing in other classes, p21 is the probability of not
observing w in A, and, lastly, p22 is the probability of not observing w in other classes.
Notice that p22 can be expressed as p22 = 1 − p11 − p12 − p21 . This means that the
number of free parameters is 3.
The log-likelihood for the case when w appears in the target class A is maximized when
∂L n11 n22
= − = 0 or
∂p11 p11 p22
n11 n22
= (A.10)
p11 p22
and is constant. Therefore it is possible to set Eq.A.11 equal to some constant c. Now
the events can be expressed as
Finally, from the Eq.2.1 and Eq.A.15 and the AIC of DM was derived:
It is worth mentioning that in the case when any parameter of Eq.A.6 and Eq.A.15 was
equal to zero the limit limx→0 x log x = 0 was applied.
Bibliography
[1] Kristin P. Bennett and Erin J. Bredensteiner. Duality and geometry in svm clas-
sifiers. In In Proc. 17th International Conf. on Machine Learning, pages 57–64.
Morgan Kaufmann, 2000.
[2] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An Introduc-
tion to Statistical Learning: With Applications in R. Springer Publishing Company,
Incorporated, 2014. ISBN 1461471370, 9781461471370.
[4] David Alfred Ostrowski. Semantic filtering in social media for trend modeling. 2013
IEEE Seventh International Conference on Semantic Computing, pages 399–404,
2013.
[5] Sitaram Asur and Bernardo A. Huberman. Predicting the future with social me-
dia. In Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web
Intelligence and Intelligent Agent Technology - Volume 01, WI-IAT ’10, pages 492–
499, Washington, DC, USA, 2010. IEEE Computer Society. ISBN 978-0-7695-4191-
4. doi: 10.1109/WI-IAT.2010.63. URL http://dx.doi.org/10.1109/WI-IAT.
2010.63.
[6] Bernd Hollerit, Mark Kröll, and Markus Strohmaier. Towards linking buyers and
sellers: Detecting commercial intent on twitter, 2013.
[8] Xin Wayne Zhao, Yanwei Guo, Yulan He, Han Jiang, Yuexin Wu, and Xiaoming Li.
We know what you want to buy: A demographic-based system for product recom-
mendation on microblogs. In Proceedings of the 20th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, KDD ’14, pages 1935–1944,
34
Bibliography 35
New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2956-9. doi: 10.1145/2623330.
2623351. URL http://doi.acm.org/10.1145/2623330.2623351.
[9] Xiaolong Wang, Furu Wei, Xiaohua Liu, Ming Zhou, and Ming Zhang. Topic
sentiment analysis in twitter: A graph-based hashtag sentiment classification ap-
proach. In Proceedings of the 20th ACM International Conference on Informa-
tion and Knowledge Management, CIKM ’11, pages 1031–1040, New York, NY,
USA, 2011. ACM. ISBN 978-1-4503-0717-8. doi: 10.1145/2063576.2063726. URL
http://doi.acm.org/10.1145/2063576.2063726.
[10] Georgios Paltoglou Di Cai Mike Thelwall, Kevan Buckley. Sentiment strength de-
tection in short informal text. Sentiment in short strength detection informal text,
61(12):2544–2558, August 2010.
[11] Fernando Perez-Tellez, David Pinto, John Cardiff, and Paolo Rosso. On the diffi-
culty of clustering company tweets. In Proceedings of the 2Nd International Work-
shop on Search and Mining User-generated Contents, SMUC ’10, pages 95–102,
New York, NY, USA, 2010. ACM. ISBN 978-1-4503-0386-6. doi: 10.1145/1871985.
1872001. URL http://doi.acm.org/10.1145/1871985.1872001.
[14] Yutaka Matsuo Takeshi Sakaki, Makoto Okazaki. Earthquake shakes twitter users:
Real-time event detection by social sensors. Proceedings of the 19th international
conference on World wide web, pages 851–860, 2010.
[15] Juan M. Silva, Abu Saleh Md. Mahfujur Rahman, and Abdulmotaleb El Saddik.
Web 3.0: A vision for bridging the gap between real and virtual. In Proceedings of
the 1st ACM International Workshop on Communicability Design and Evaluation
in Cultural and Ecological Multimedia System, CommunicabilityMS ’08, pages 9–14,
New York, NY, USA, 2008. ACM. ISBN 978-1-60558-319-8. doi: 10.1145/1462039.
1462042. URL http://doi.acm.org/10.1145/1462039.1462042.
[17] Shamanth Kumar, Fred Morstatter, and Huan Liu. Twitter Data Analytics.
Springer, New York, NY, USA, 2013.
[19] ChengXiang Zhai Charu C. Aggarwal. Mining Text Data. Science + Business
Media. Springer, 2012. ISBN 9781461432227.
Bibliography 36
[20] Bharath Sriram, Dave Fuhry, Engin Demir, Hakan Ferhatosmanoglu, and Murat
Demirbas. Short text classification in twitter to improve information filtering. In
Proceedings of the 33rd International ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR ’10, pages 841–842, New York, NY,
USA, 2010. ACM. ISBN 978-1-4503-0153-4. doi: 10.1145/1835449.1835643. URL
http://doi.acm.org/10.1145/1835449.1835643.
[21] Hwon Ihm. Mining consumer attitude and behavior: An exploratory study on movie
audience attitude extracted from twitter. Journal of Convergence, 4(2):29–35, June
2013. URL http://www.ftrai.org/joc/vol4no2/v04n02_C03.pdf.
[23] Chao Yang, Sanmitra Bhattacharya, and Padmini Srinivasan. Lexical and ma-
chine learning approaches toward online reputation management. In CLEF (Online
Working Notes/Labs/Workshop), 2012.
[24] Fabrı́cio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgı́lio Almeida. De-
tecting spammers on twitter. In In Collaboration, Electronic messaging, Anti-Abuse
and Spam Conference (CEAS, 2010.
[25] Xuan-Hieu Phan, Le-Minh Nguyen, and Susumu Horiguchi. Learning to classify
short and sparse text & web with hidden topics from large-scale data collections. In
Proceedings of the 17th International Conference on World Wide Web, WWW ’08,
pages 91–100, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-085-2. doi: 10.
1145/1367497.1367510. URL http://doi.acm.org/10.1145/1367497.1367510.
[26] Aixin Sun. Short text classification using very few words. In Proceedings of
the 35th International ACM SIGIR Conference on Research and Development
in Information Retrieval, SIGIR ’12, pages 1145–1146, New York, NY, USA,
2012. ACM. ISBN 978-1-4503-1472-5. doi: 10.1145/2348283.2348511. URL
http://doi.acm.org/10.1145/2348283.2348511.
[27] José Manuel Cristóvão Verı́ssimob Maria Teresa Pinheiro Melo Borges Tiagoa. Dig-
ital marketing and social media: Why bother? Business Horizons, Volume 57(Issue
6):703–708, 2014.
Bibliography 37
[28] Denis Kondopoulos. Internet marketing advanced techniques for increased market
share. Chimica Oggi-Chemistry Today, Volume 29(Issue 3):9–12, 2011.
[29] Fabian Abel, Qi Gao, Geert-Jan Houben, and Ke Tao. Semantic enrichment
of twitter posts for user profile construction on the social web. 6644:375–389,
2011. doi: 10.1007/978-3-642-21064-8 26. URL http://dx.doi.org/10.1007/
978-3-642-21064-8_26.
[31] Ricardo Kawase, Bernardo Pereira Nunes, Eelco Herder, Wolfgang Nejdl, and
Marco Antonio Casanova. Who wants to get fired? In Proceedings of the 5th
Annual ACM Web Science Conference, WebSci ’13, pages 191–194, New York, NY,
USA, 2013. ACM. ISBN 978-1-4503-1889-1. doi: 10.1145/2464464.2464476. URL
http://doi.acm.org/10.1145/2464464.2464476.
[32] Nick Pendar. Toward spotting the pedophile telling victim from predator in text
chats. 2012 IEEE Sixth International Conference on Semantic Computing, 0:235–
241, 2007. doi: http://doi.ieeecomputersociety.org/10.1109/ICSC.2007.32.
[33] A. Sureka and S. Agarwal. Learning to classify hate and extremism promoting
tweets. In Intelligence and Security Informatics Conference (JISIC), 2014 IEEE
Joint, pages 320–320, Sept 2014. doi: 10.1109/JISIC.2014.65.
[34] R. Narayanan. Mining text for relationship extraction and sentiment analysis. Ph.D.
dissertation, 2010.