You are on page 1of 47

Examensarbete 30 hp

18 Januari 2016

Classification of Hate Tweets


and Their Reasons using SVM

Natalya Tarasova
Abstract
Classification of Hate Tweets and Their Reasons using
SVM
Natalya Tarasova

Teknisk- naturvetenskaplig fakultet


UTH-enheten This study focused on finding the hate tweets
posted by the customers of three mobile
Besöksadress: operators Verizon, AT&T and Sprint and
Ångströmlaboratoriet
Lägerhyddsvägen 1 identifying the reasons for their dissatisfaction.
Hus 4, Plan 0 The timelines with a hate tweet were collected
and studied for the presence of an explanation.
Postadress:
Box 536
751 21 Uppsala A machine learning approach was employed
using four categories: Hate, Reason, Explanatory
Telefon: and Other. The classification was conducted with
018 – 471 30 03 one-versus-all approach using Support Vector
Telefax: Machines algorithm implemented in a LIBSVM
018 – 471 30 00 tool.
Hemsida: The study resulted in two methodologies, the
http://www.teknat.uu.se/student
Naive method (NM) and the Partial Timeline
Method (PTM). The Naive Method relied only on
the feature space consisting of the most
representative words chosen with Akaike
Information Criterion. PTM utilized the fact that
the majority of the explanations were posted
within a one-hour time window of the posting of
a hate tweet.

We found that the accuracy of PTM is higher


than for NM. In addition, PTM saves time and
memory by analysing fewer tweets. At the same
time this implies a trade-off between relevance
and completeness.

Handledare: Hiromi Ishizaki


Ämnesgranskare: Sofia Cassel
Examinator: Tomas Nyberg
ISSN: 1401-5757, UPTEC F16 001
“Natalya-san, don’t be afraid of Big Data.”

Kazunori Matsumoto
Sammanfattning

De senaste femton åren har Internet blivit en arena som våra dagliga aktiviteter i allt
större utsträckning utspelar sig på: vi hämtar och sprider information i form av text,
ljud och bild, handlar, bokar resor och upplevelser, läser online kurser etc. Via Internet
samarbetar vi med andra och skapar mötesplatser. Idag kan vi nå ut och vara nåbara
genom olika plattformar. Beroende på vilka grupper man vill nå ut till finns det olika
verktyg. Till exempel används Facebook oftast för att hålla kontakten med släkt, vänner
och bekanta, däremot används LinkedIn som en plattform för professionella kontakter.
Populariteten av sociala medier har lett till att det finns en rik källa till sökbara data för
analys av vad människor känner, tänker och gör [9, 10]. Därför finns det ett stort intresse
hos forskare att undersöka allt från trender och opinionsmättningar till spridning av
influensa med hjälp av dessa data [3, 4]. Företag har också insett värdet av att använda
information från sociala medier i syfte att förstå vad deras kunder tycker om de tjänster
och produkter som företagen tillhandahåller.

I detta arbete har vi fokuserat på det sociala mediet Twitter. Grundtanken med Twitter
är att vem som helst när som helst ska kunna nå ut till andra genom att publicera ett
meddelande som består av max 140 tecken. Ett sådant meddelande kallas tweet. Arbetet
bedrevs vid Social Media Labs, KDDI R&D. På Social Media Labs forskas det kring,
bland annat, vad användare i olika delar av världen tycker om en viss produkt eller
tjänst. Resultatet visualiseras på en karta och ger en snabb överblick över geografiska
trender.

vi har undersökt huruvida det är möjligt att identifiera orsaken till varför användare
uttrycker hat i tweets riktat mot mobiloperatörerna Verizon, AT&T och Sprint. Efter
att ha läst hat-tweets från cirka 500 användare kunde vi konstatera att det gick att
hitta förklaringar till varför Twitter-användare uttrycker hat mot sina mobiloperatörer.
Därefter konkretiserades frågeställningen för denna studie: målet blev att ta fram en
metod som möjliggör identifikation av hat-tweets samt av de orsaker som föranledde
dem. Studien utmynnade i två metoder: en ”naiv” metod (the Naive Method, NM) och
en mer ”avancerad” metod (the Partial Timeline Method, PTM).

ii
Tweets samlades in genom en Twitter-sökning som bestod av orden ”hate” samt före-
tagets namn, i.e. Verizon, AT&T eller Sprint. Om denna sökning gav en träff sparades
hela tidslinjen, ett kronologiskt ordnat flöde av tweet-meddelanden, för denna användare
i en lokal databas. Därefter studerades tidslinjerna och tweet-meddelandena klassificer-
ades manuellt i en av de fyra kategorierna: Hat, Orsak, Explicit och Övrigt. Tweets som
innehöll ordet ”hate” och företagets namn klassificerades som Hat. Tweet-meddelanden
som publicerades under samma dag som hat-tweeten och som angav användarens prob-
lem kopplat till mobiloperatören klassificerades som Orsak. De meddelanden som in-
nehöll både uttryck för hat och anledning tillskrevs kategorin Explicit. De meddelanden
som inte ansågs tillhöra någon av dessa tre kategorier klassificerades som Övrigt.

De manuellt klassificerade meddelandena användes senare för analys med maskininlärn-


ingsalgoritmen Support Vector Machines (SVM). Insamlade data delades in i två
grupper: träningsdata och testdata. Träningsdata användes för att med hjälp av SVM
ta fram en modell. Denna modell användes i nästa steg för att klassificera testdata.
Hela processen genomfördes för NM och PTM vilket resulterade i två olika modeller.

Modellen för PTM visade sig vara noggrannare än för NM. Däremot inkluderar PTM
inte alla meddelanden från användarens tidslinje utan analyserar endast de tweets som
publicerades inom intervallet ± 30 min från publiceringen av hat-tweeten. Således in-
nebär valet mellan PTM och NM en avvägning. Om det är viktigt att ha en noggrann
klassificering, till exempel om resultatet ska användas i en större automatiserad process,
lämpar PTM sig bättre än NM. Däremot om man istället vill ha en utförlig bearbetning
av samtliga tweets bör NM användas.
Acknowledgements
I wish to thank my subject reader, Sofia Cassel, for support, productive discussions and
guidance. I also want to thank Social Media Lab at KDDI R&D for their warmth and
hospitality.

Finally, I wish to thank the Sweden Japan Foundation for travel funding.

iv
Contents

Sammanfattning ii

Acknowledgements iv

List of Figures vii

List of Tables viii

1 Introduction 1
1.1 Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Theory 5
2.1 Data Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Supervised Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Support Vector Machines (Geometrical interpretation) . . . . . . 8
2.4 SVM Data Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Related Work 13
3.1 Internal Enhancement of Feature Space . . . . . . . . . . . . . . . . . . . 15
3.2 External Enhancement of Feature Space . . . . . . . . . . . . . . . . . . . 15
3.3 Marketing Potential of Twitter . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Antagonism in Tweets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Method 18
4.1 Defining hate tweets and reasons . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Collection and Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Training and testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 The Naive Method (NM) . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.5 The Partial Timeline Method (PTM) . . . . . . . . . . . . . . . . . . . . . 23

5 Results and Discussion 24


5.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

6 Future Work 28

v
Contents vi

A 30
A.1 Collected data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
A.2 AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Bibliography 34
List of Figures

2.1 Data from class A and class B separated by two different planes: one
represented as a dashed line and another one represented as a solid line [1]. 9
2.2 The maximal margin is represented by the line that goes through the
points d and c and is orthogonal to the hyperplane [1]. . . . . . . . . . . . 9
2.3 Two distinct classes, represented by blue and red dots, cannot be sepa-
rated by a maximal margin hyperplane [2]. . . . . . . . . . . . . . . . . . 10

4.1 The Naive Method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22


4.2 The Partial Timeline Method. . . . . . . . . . . . . . . . . . . . . . . . . . 23

vii
List of Tables

3.1 Accuracy of the proposed methods for some of the studies . . . . . . . . . 16

4.1 A description of different types of reasons for hate tweets. . . . . . . . . . 19


4.2 Criteria for hate tweet labelling. . . . . . . . . . . . . . . . . . . . . . . . 20

5.1 The results of the experiments. . . . . . . . . . . . . . . . . . . . . . . . . 25


5.2 Recall, precision and F-score . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.3 The reasons posted before and after a hate tweet were investigated from
a temporal perspective, i.e. we calculated the percentage of the tweets
belonging to Reason posted 30 minutes, 5 minutes and 1 minute before
or after the posting of a hate tweet. . . . . . . . . . . . . . . . . . . . . . 25
5.4 Percentage of tweets per time interval . . . . . . . . . . . . . . . . . . . . 26
5.5 Most frequent bigrams present in tweets classified as Reason. The bia-
grams that are not presented in this table had the prevalence lower than
5%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

A.1 The number of collected tweets per each category. . . . . . . . . . . . . . 30


A.2 Occurrence of a word w . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
A.3 Independent model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
A.4 Dependent model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

viii
Chapter 1

Introduction

The proliferation of micro-blogging platforms and networking services has led to new
and previously unexplored opportunities to disseminate information. Many users share
their opinions, thoughts and feelings and express their needs and desires using social
media on a daily basis. At first sight, the disseminated information seems to be of a
diverse and chaotic character. Nevertheless, if it is not viewed in isolation but rather
as a part of a larger context, it can form a coherent structure that reveals hidden
relationships. This fact together with the relative simplicity of information retrieval
has led to the appearance of new domains of research within data mining, text mining,
machine learning and natural language processing. For example, some studies have
focused on predicting trends [3, 4] and revenues [5], recommendation of products and
services [6–8], sentiment analysis towards various topics such as brands, celebrities etc
[9, 10].

Business and industries followed the example of the academic world and started to
explore marketing opportunities in social media. For customers, social media has become
an important arena for sharing experiences and finding information about companies,
their products, and services. Online networking offers a useful source of information
for both parties. Consumers can find up-to-date information and companies can mine
the opinions of users on the products and services. The ideas for improvement and
development of new products can be found and also new opportunities for user-user,
company-user and user-company interaction are created. Due to the improved knowledge
about the customer and the ease of reaching out to her or him, advertising is becoming
more personalized. The behavioural patterns of users are collected and analyzed to create
an individual-oriented relationship to the company. The customers, for their part, can
interact with each other collecting more knowledge on the diversity of products and
services on the market. Due to the size of the available information both customers and

1
Introduction 2

companies need reliable tools for getting a quick and simple overview of the available
information. Therefore, the application of computational research based on the analysis
of tweets is a part of producing such tools [11].

Due to the size and the varying quality of the information available on the Internet,
the process of finding and extracting relevant information has become more difficult
and time consuming. In order to overcome this problem, a wide range of clustering
and summarization methodologies were developed. Due to the dynamic nature of Web
content, there is a constant need to improve and adjust already existing methods.

This study focused on classification of information retrieved from Twitter, a micro-


blogging website. In June 2015, there were 315 million active users on Twitter around
the globe [12]. With such a high number of users almost any action in the real world
receives feedback on the networking platform. Politicians, celebrities and companies
choose to connect to their fans, followers and customers using this media. Twitter has
played an important role in socio-political events such as the Arab Spring and the Occupy
Wall Street Movement. It attracts a diverse community of users and therefore makes
it possible to study different social phenomena. For instance, tweets have been used to
create a variety of applications: from the recommendation of the best navigational routes
based on tweets [13] to mapping the scale of the distraction caused by a tsunami [14].
Furthermore, Twitter is simple to leverage due to the existence of APIs that enable a
fast and simple connection to the website. Furthermore, this study focused particularly
on commercial use of Twitter. The goal was to find the reasons for hate towards mobile
operators. The analysis was conducted on tweets addressed to the three largest mobile
operators in the US: AT&T, Verizon and Sprint. By using machine learning techniques,
the identification of hate tweets and the related reasons was detected. The motivation
and purpose of the study are described in the section 1.3.

1.1 Twitter

Twitter is a microblogging website where users share their experiences, opinions and
feelings in 140 character long messages called tweets. Tweets can be posted from
various mobile and desktop clients and even by other applications on the behalf of the
user. A user is a person who posts a tweet. Every user has a unique user name and
id number associated with it. A user can have followers, people who get updated on
what the user has posted. Other users can comment on the content of posted tweets,
share them or in other words retweet. It is also possible to like the content of a tweet
by clicking on a button Favorite. The user can manually designate a topic for the tweet
Introduction 3

by adding a hashtag represented by 0 #0 sign. In order to address another user, an


”at-sign” @ has to be added in front of the user’s name.

Tweets present very noisy data due to the length, informal language and the lack of
context or the background knowledge. It makes the analysis of tweets a challenging
task.

1.2 Background

According to previous studies, in order to create an efficient marketing strategy, the


online content of social media should be used to bridge the discrepancy in the physical-
virtual relationship between consumers and companies [15]. For this reason, manage-
ment systems for the analysis of web-content are becoming an inevitable part of strategy
analytics for small, mid-size business, large corporations and organizations alike [11].
These systems come in different forms depending on needs of the company. One of the
methodologies used to map the relationship between the customer and the company
is a Customer Journey Map (CJM). A CJM is an oriented graph that visualizes
how the relationship between the customer and the company develops. There is no of-
ficial definition of CJM and it exists in different variations depending on the brand and
purposes. Typical stages of a consumer-company relationship are the following: Dis-
covery , Investigation, Order , Usage, and Termination. Each category describes
the current state of the relationship between the user and the company or its products.
If the information on the Internet posted by customers can be automatically classified
according to CJM, it can help create personalized offers and better foresee the needs of
the customer.

Prior to the project described in this report, another study on the classification of
tweets using Support Vector Machine (SVM) was conducted. The goal of the study was
to explore how well the model could be used for tweets. Therefore, the classification
categories were taken from CJM. During the labeling of the tweets, it was noticed
that the number of the hate tweets about the mobile carriers and their services was
surprisingly high. For example, a tweet ”I hate Verizon” conveys a strong feeling but
does not explain what the specific problem is. Hate tweets are the tweets that explicitly
express hate towards the company (in this case only mobile carriers but the concept
can be extended to other branches) or its products and/or services. The lack of an
explanation makes it impossible to classify hate tweet according to CJM.

The negative sentiment of tweets damages the reputation of the company. Therefore
an understanding of the underlying reasons would provide a basis for better decision
Theory 4

making. By analysing tweets, it is possible to minimize the guesswork of companies


regarding the issues that create dissatisfaction.

1.3 Motivation

The question that prompted this study was whether it was possible to determine the
reason for hate tweets. From reading the timelines that included this kind of tweets, we
concluded that in many cases it was possible to determine the source of dissatisfaction
with the services or products provided by the mobile operators. We also noticed that
the majority of hate tweets in the data collection had an explanation either before or
after the hate tweet. Therefore the primary goal of the study became the development
of a method for the determination of hate tweets and their reasons.
Chapter 2

Theory

Prior to the classification of tweets with a machine learning algorithm, it is necessary


to collect data, remove superfluous words, define the classes and how they can be rep-
resented. This section explains these major steps in text mining along with a geometric
interpretation of the principles behind SVM. Finally, it describes the format of files used
for training and testing SVM with a LIBSVM tool1 .

2.1 Data Retrieval

Tweets and the information related to them such as the name of the user, publication
date, location (if enabled by the user), profile information etc can be downloaded from
Twitter using one of the numerous Twitter Application Programming Interfaces
(APIs). Twitter APIs have to be accessed using an authentication request. The request
is conducted through the Open Authentication protocol (OAuth). The protocol
defines the way an application should requests the access to one of the APIs has to
submit the credentials issued by Twitter. For receiving those the application has to be
registered on Twitter [16].

Tweeter provides the developer with several options when it comes to the choice of an
API. The most used ones are Streaming API and REST API. There are three major
differences between Streaming and REST APIs:

• data access
Streaming APIs work in online mode providing continuous access to newly posted
tweets. Once a connection between the application and Twitter is established, the
1
https://www.csie.ntu.edu.tw/~cjlin/libsvm/

5
Theory 6

tweets will start flowing into the system. REST APIs collect only data that has
already been posted on Twitter pulling a certain number of tweets per request.
Only tweets published within the last week can be collected using REST APIs.

• rate limits
Streaming APIs are allowed to stream 5,000 user ids concurrently [17]. The APIs’
rate limit window duration is 15 minutes long [18]. Users represented by access
tokens can make 180 requests/queries per time window. Using application-only
authentication, an application can make 450 queries/requests per 15 minutes on
its own behalf.

• search function
Search queries can be built from the key words, phrases, names of the users, dates
etc. It is also possible to include additional parameters in the search query. For
example, the search can be restricted to a certain language or geolocation. The
format of the search query is the same for both REST and Streaming. However, the
search function is implemented differently in REST and Streaming. REST provides
relevance but not completeness. For this reason it might be more appropriate to
use Streaming APIs in some cases.

There are three types of streams: public streams, user streams and site streams. Public
streams contain public tweets published in chronological order; user streams are the
tweets from the timeline of a single user; site streams access timelines of multiple users
[17]. In this study, user stream was leveraged for the collection of data from the timelines
of the users tweeting about hate.

2.2 Text Mining

Text mining refers to the process of finding relationships and patterns in collection of
unstructured information. Text analytics can be broken down into three steps [19]:

1. Pre-processing: removal of stop-words, stemming, and tokenization.

2. Text representation: determination of most significant features of a text.

3. Knowledge extraction: use of machine learning tools to find relationships and


hidden dimensions that are difficult to determine manually.

Stop-word removal eliminates words that do not convey any meaning and are equally
frequent in any text, for example, articles and pronouns. Stemming contributes to a
Theory 7

denser feature representation by reducing inflected words to their root. Tokenization


removes punctuation and breaks phrases into single words, unigrams. When the nec-
essary pre-processing is done, the content of a text is analyzed using statistical methods
to determine its characteristic features. In fact, text data can be treated in different
ways and on various levels. It can be treated as a collection of independent unigrams.
This approach omits the linguistic specifics of the text and neglects the semantics. It
is known as bag-of-words (BOW). BOW is the most common approach due to its
simplicity. However, words can be used in different contexts and therefore important
nuances might be lost which could lead to inaccurate results in the knowledge extraction
step. In some cases, the most common pairs of words might be added to the feature
space, these are called bigrams. Feature space is a collection of all the representa-
tive characteristics of the studied classes. Each class is represented by a feature vector.
There is another, more sophisticated, approach that treats data on the semantic level.
Compared to BOW, it is more challenging from the computational point of view since
it takes advantage of the existing relations between the words. It is still more common
to use BOW for the creation of feature space.

The representation of classes does not have to be constrained to the most significant
words, on the contrary, the research shows that adding extra features improves the
ensuing analysis [9, 20]. In section 3, various feature extension techniques are described.
In order to extract useful knowledge from the created representation, the features are
stored numbers and analyzed with machine learning techniques. The conversion of words
to numbers can be performed in various ways. A technique for feature selection that
was used in this work is Akaike Information Criterion (AIC). AIC is based on the
estimation of the information loss when modeling the underlying processes that create
the observations. Given empirical data, for example a word in a tweet, AIC estimates
how likely it is that the word appears in the target class compared to other classes
using maximum log-likelihood. Furthermore, AIC incurs a penalty on the number of
free parameters. The higher number of the parameters, the higher the penalty. This
approach discourages overfitting. Overfitting is a common problem in machine learning
and refers to the misclassification of test data. It occurs when the number of features is
higher than the number of data-points. AIC is calculated as:

AIC = −2 ln L + 2K, (2.1)

where ln L is the maximized likelihood and K is the number of free parameters.

It is important to emphasize that AIC does not provide an absolute value of the precision
of a model, but rather a tool for comparing different models. The model with the lowest
AIC value represents the most accurate model in the context.
Theory 8

After defining the feature space, knowledge discovery methods can be applied to the
numerical representation of the words. A common and robust approach is Support
Vector Machines (SVM) described in section 2.3.1.

2.3 Supervised Machine Learning

The main idea of machine learning is to teach a computer to associate new data with data
that it has been exposed to earlier. The dynamic character of the data available on the
Internet is hard to deal with automatically because pre-programmed algorithms might
not describe tomorrow’s reality. For instance, without the machine learning algorithms
the detection of spam e-mails and frauds becomes difficult. Therefore the goal is to
recognize old patterns in new data. For instance, Gmail had been taught to recognize
spam from non-spam emails rather than being explicitly programmed for it.

There different types of machine learning: supervised, semi-supervised, unsupervised,


active and deep learning. This study focused only on supervised machine learning.
The name comes from the fact that ”correct answers”, i.e. training data, must be
given. The algorithm creates a prediction model based on the observed samples.

Supervised learning targets to types of problems: regression and classification problem.


Regression addresses the problem of prediction of continuous valued output; classifica-
tion problems deal with discrete values. This study considered discrete valued output
and therefore focused on classification problem. In classification the features of classes
are represented as vectors consisting of the most significant attributes. For example,
tweets are represented as the most significant words which are organized in feature vec-
tors. Supervised learning relies on the labelled training data. Usually, the retrieval of
needed data is relatively simple. On the other hand, manual labeling has shown itself to
be more time consuming. For this reason, in this project the approach of active learning
was evaluated. Active learning algorithms require smaller training data sets then super-
vised algorithms. However, active learning is combined with some interactive learning,
i.e. the algorithm asks the user to label some instances of unclassified data to enhance
the ”understanding” of significant features.

2.3.1 Support Vector Machines (Geometrical interpretation)

The basic idea of SVMs include three main points: maximizing margins, the dual formu-
lation, and kernels [1]. Maximizing margins means that the aim is to find a plane that
will separate the nearest points of two classes, A and B, by as wide minimum distance
as possible. This problem is illustrated in figure 2.1.
Theory 9

Figure 2.1: Data from class A and class B separated by two different planes: one
represented as a dashed line and another one represented as a solid line [1].

Figure 2.2: The maximal margin is represented by the line that goes through the
points d and c and is orthogonal to the hyperplane [1].

In figure 2.1, we see that the solid line is a better choice of demarcation than the dashed
one because the margin from the nearest point of each data set to the line is larger. The
classes A and B are represented by the matrices Amxn and Bmxn respectively. Every row
represents the coordinates of the points in the data set. x - features. Input of each class
leads to an output y = {−1, +1}. Let us assume that the output of class A is x0 w ≥ 1
and at least one point in A lies on the plane x0 w = 1, where w is the normal of this
plane. Similarly, the output of class B results in x0 w ≤ −1 and at least one point lies
on the plane x0 w = −1. The distance between these two supporting hyperplanes is 2
kwk .
Consequently, the distance between the two planes can be maximized by minimizing
kwk. The space between the hyperplanes should not consist any data points, which
gives rise to two constraints that should be fulfilled at the same time, Aw − Ie ≥ 0 and
Theory 10

Figure 2.3: Two distinct classes, represented by blue and red dots, cannot be sepa-
rated by a maximal margin hyperplane [2].

Bw − Ie ≤ 0. The final separating plan lies between the supporting hyperplanes and
therefore the objective of the minimization is 21 kwk2 . The problem of finding the two
closest points can be stated in the following way:

1
minimize kwk2
2
subject to Aw − Ie ≥, Bw − Ie ≤ 0. (2.2)

The line between two closest points must be orthogonal to the supporting hyperplanes.
In order to construct such a plane the algorithm finds two convex hulls and constructs
a line between two nearest points in these sets. This approach is illustrated in figure
2.2 where the line that goes though the point d and c, denoted by w is the maximal
margin, the solid line orthogonal to w is the maximal margin hyperplane. The point d
and the two circle points lying on the same dashed line as the point c are called support
vectors. They support maximal margin in a sense that if these points would move the
maximal margin hyperplane would shift too. Notice that the change in the position of
any other point does not effect the maximal margin hyperplane unless it crosses the
boundary points. It is important that there can be drawn an orthogonal line between
the two supporting planes to avoid the risk that the two points might not be the closest
ones or that the supporting hyperplanes are not as far away as possible.

However, it is not always possible to perform a clear classification and therefore to


find a maximal margin hyperplane. This case is illustrated in figure 2.3 where two
classes, represented by blue and red dots, cannot be separated by a margin hyperplane.
Theory 11

This also means that the maximal margin classifier cannot be used. A commonly used
approach to deal with this problem is to allow a certain degree of misclassification in the
interest of a better classification of most of the data. This approach is called support
vector classifier. In order to allow a certain degree of the missclassification, a non-
negative tuning parameter C is introduced. The parameter C determines how much
freedom we have to violate the margin. If C is high then the system is tolerant of
the misclassifications. This also means that if the margin is large several violations are
allowed. When C is low the margin narrows implying a choice of a highly fit classifier.

There are some classes that cannot be separated linearly because the relationship be-
tween the outcome and the predictors is non-linear. In this case, the number of predictors
is expanded by using a non-linear function. More specifically, this is done by applying
so called kernels.

The final formulation of optimization problem becomes:

n
1 X
minimize kwk + C ξi ξ ∈ Rn
2
i=1

subject to yi (hw, Φ(xi )i + b) ≥ 1 − ξi , i = 1, ..., n, (2.3)


b ∈ R, (2.4)
ξ ≥ 0, i = 1, ..., n, (2.5)

where ξi is a slack variable that allows individual observations to be on the wrong side of
the margin, x is mapped to a non-linear map Φ. If ξi = 0 then then the i th observation
was classified correctly; if ξi ⊃ 0 then it is on the wrong side of the margin; if ξi ⊃ 1
then it is on the wrong side of the hyperplane.

In the equation 2.3, the function Φ(xi ) does not need to be calculated explicitly. Instead,
another function that defines the inner product in the new space is inferred. It is
K(xi , xj ) = Φ(xi )T Φ(xj ). K(xi , xj ) is called a kernel function. The computation of
kernels occurs in the feature space and not in the space of input data Rd . Some of the
most common kernels are the following:

• Linear: K(xi , xj ) = (xi )T (xj ).

• Polynomial: K(xi , xj ) = (γ(xi )T (xj ) + r)d , γ > 0.


2
• Radial basis function (RBF): K(xi , xj ) = e(γkxi −xj k ) , γ > 0.

So far, the theory was presented for the case when two classes are compared. However,
SVM can be used for the classification of more then two classes. There are two most
Related Work 12

common classification approaches: one-versus-one and one-versus-all. A one-versus-


one approach compares each class to all other classes by comparing them individually.
A one-versus-all approach treats the target class as one class and all the other classes
as the other one.

2.4 SVM Data Format

LIBSVM tool requires the input data, i.e. training and test data for the SVM algorithm,
to be of a certain form called SVM format. It is a numerical representation of textual
data as attribute : occurrence, where the attribute is the position of the word in the
AIC table for the target category and the occurrence is the number of times the word
appears in the current tweet. For example, ”hate hate hate verizon” would be encoded
as 1:3 4:1, where number 1 is the position if the word ”hate” in the AIC table for the
category Hate, number 3 refers to the number of times the word hate occurs in the
current tweet, number 4 is the position of the word ”verizon” and because it occurs only
once the occurrence is set to 1. If a tweet has a word or even several words that do
not appear in the table, those words are neglected. This is referred to as a sparsity
problem.
Chapter 3

Related Work

This study proposes a framework for identification of hate tweets related to mobile
operators and their triggers using classification techniques. Therefore the review of
relevant works focuses mainly on classification of tweets and a few studies related to
hate tweets. To broaden the understanding of Twitter’s importance for companies, a
brief review of the marketing studies related to Twitter is also covered in this section.

Short text classification is a field of text classification in pre-defined categories, where


the analysed data is sparse. The main goal of short text classification is to assign a
certain category to a piece of information, for example a tweet. Accurate automatic
classification of tweets is a challenging task due to their short length, variations in the
vocabulary, unclear context and the sparseness of the text. To overcome this problems
the classification is always preceded by pre-processing and sometimes by expansion of
the vocabulary.

According to previous research, there are other decisions that might impact the outcome
of the classification. One of them is the choice of class labels. A suitable choice of labels
facilitates the division of data into well defined clusters, which benefits the representation
of classes. The labeling of data depends on the application. The choice of the labels
is often based on the observations of data or the adoption of pre-existing classification
schemes (e.g. the advertising model AIDA, where the acronyms stand for Attention,
Interest, Desire and Action). ”Mining Consumer Attitude and Behavior” by Hwon
et al. [21] shows that an appropriate clustering, in this case AIDA, can support the
methodology and reveal hidden relationships. However, in some cases class-labels cannot
be determined beforehand due to their dynamic nature. News and trending topics are
two examples of constantly evolving and changing subjects [22], [3]. Nevertheless, even
this type of problem utilizes a framework of pre-determined generic categories, such as
technology, art etc, that supports the classification task.

13
Related Work 14

As stated earlier, tweets are informal, short and sparse (i.e. a certain word might
occur seldom in the tweets), therefore the removal of noise and creation of a denser
feature space are important for future feature vector construction. The pre-processing
procedure is described in the section Text mining. Pre-processing steps were highlighted
in the studies by Lee et al., Wang et al. and Perez et al. [3], [9], [11]. In addition to pre-
processing, some studies [23], [22], [4] employed filtering techniques in order to improve
the relevance level of the collected tweets. The keywords for filtering are often derived
from the observed data and/or defined from dictionaries or other external resources.
For example, Yang et al. [23] in a study from 2013 suggested a method for ambiguity
filtering of the company name. The classification was binary, either the tweet was related
to a company tweet or not. Each category was represented by the keywords that were
defined as the most frequently searched words by the Internet users. For example, the
top three keywords for the company Apple were: apple, apple store, apple iphone 4.
After the keywords were determined, two different filters were created in order to ensure
high accuracy of recognition of the keywords. Filter 1 checked if the the keyword was
a whole keyword or not; filter 2 determined the relevant single tokens. For instance, if
the keyword is ”iphone 4” and the target tweet is ”I love iphone”. Filter 1 would reject
the tweet, whereas Filter 2 would recognize it as being relevant.

Classification of tweets can be improved by the proper determination of most significant


words and further refined by lexical and semantic analysis of tweets [9]. For example, the
feature space can be expanded by n-grams and part-of-speech tagging. n-gram is a
sequence of n words that is used for predicting the next word in each sequence. A part-of-
speech tagging is a way of semantic disambiguation by grammatical tagging of the words
in a corpus. One of the methods proposed by Peres et al. [11] exploited only textual
information conveyed in the tweets. For example, they investigated the co-occurrence
of words and the lexical similarity of tweets. But even this feature space expansion is
not sufficient for an accurate classification. Therefore researchers have explored other
possible ways of enhancing the data representation. Generally, there are two ways to
improve the classification of tweets. One approach is to mine internal resources, i.e deep
mine the inherent characteristics of tweets such as the number of words, punctuation
etc. Some studies conducted using this methodology are described in section 3.1. The
contrary approach is to leverage external resources and widen the context of the collected
tweets. Often the tweets are complemented with the information from knowledge bases
(e.g. Wikipedia), online dictionaries, review pages etc. This approach is described in
the section 3.2.
Related Work 15

3.1 Internal Enhancement of Feature Space

Aside from tweets, Twitter provides the reader with additional information such as topic-
indicative hashtags, context-enhancing links, user profile information, lists of followers
etc. Previous studies have shown that a higher classification accuracy can be achieved
by expanding the feature space using this internal information [22], [24], [9]. For exam-
ple, feature-based enhancement was explored by Sriram et al. [20]. They improved the
classification of tweets by adding a nominal feature, i.e. the authors name, and seven bi-
nary features. These, for example, were the absence of shortenings, emotions, and slang
words. Bevenuto et al. [24] proposed a classification method for distinguishing spam-
mers, i.e. users who post spam, from non-spammers by integrating 23 different metrics
discovered on Twitter. Interestingly, the evaluation of the features showed that even
the least significant ones improved the classification compared to the baseline method.
Therefore it is reasonable to assume that even low ranked features have discriminatory
power. Furthermore, the importance of regarding users not as single elements but as a
part of a larger network has been proven to be beneficial [3].

3.2 External Enhancement of Feature Space

One of the big challenges of tweet classification is their sparseness, i.e. many words
appear only a few times in a corpus. For this reason, some tweets might be repre-
sented by an empty vector. To alleviate this problem, attempts have been made to
create a denser feature space. A common technique that addresses this problem is to
incorporate external resources such as search engines, open source knowledge bases e.g.
Wikipedia and online dictionaries. Perez et al. [11] proposed three ways of enrichment
of original feature vectors: by incorporating general information about the company
provided by Wikipedia, by enriching the tweets that really refer to companies, expand-
ing only the ambiguous words with external information. The result showed that the
third approach performed better than other methods on specific company names (Ar-
mani, Warner, Cadillac etc); the first approach performed better then other methods
on generic company names (Parl, Sprint, Southwest etc). In a paper from 2008 Phan et
al. [25] explored the enrichment of the tweets with extracted topics and found that it
improved the classification and outperformed the baseline.

Although external feature enrichment has great potential, it is still a time-consuming


process and strongly dependent on the choice of the external resources. This explains
why many studies prefer to explore the internal characteristics of tweets.
Related Work 16

Method Accuracy
BOW approach for classification with C 5.0 [3] 65 %
Network-based classification with C 5.0 [3] 70 %
DFICF with Naive Bayes [9] 71 %
Adding user attributes; training with SVM [24] 87 %
Tf and clarity with Maximum Entropy classifier [26] 70 %
BOW and eight additional features [20] 95 %
Self-Term expansion with K-means classifier [11] 60-95 %
TEM-Wiki with K-means classifier [11] 60-97 %
TEM-Full with K-means classifier [11] 54 - 73 %

Table 3.1: Accuracy of the proposed methods for some of the studies

3.3 Marketing Potential of Twitter

Information spreading over the Internet requires that companies to revise their mar-
keting strategies [27]. According to an interview-study, the presence on the Web allows
companies to understand consumers’ consumption habits, detect and anticipate negative
reactions etc [28]. A better understandin of user’s preferences along with advances in
Twitter mining and classification techniques has created opportunities for the develop-
ment of user models. Based on the knowledge about the user’s interests and tweeting
patterns, scientists have tried to understand how to better target users with product
recommender systems [8]. A paper from 2011 [29] investigated how tweeting activities
could support the modeling and personalization of user profiles. Based on the hash-
tags and the topics of the posted tweets the study compared three profile models for
applicability of the news recommendation system.

3.4 Antagonism in Tweets

The topic of hate and radicalization of Twitter has been addressed in a handful of studies.
Burnap and Williams [30] addressed the identification of hateful and/or antagonistic
statements against certain races and religions. In a paper from 2013, Kawase et al.
investigated what consequences hateful tweets about jobs might have [31]. As far as we
concerned, any study addressed hate tweets related to mobile operators.

In a study by Pendar [32] the identification of sexual predators in online chats was stud-
ied. They showed that the performance of SVM can be improved by adding n-grams to
create a more specific context for the use of the most relevant words. In a study from 2014
Agarwal and Sureka [33] focused on the classification of hate and extremism promoting
tweets. They observed that exclamations such as ”send them home” and ”get them
out” were frequently appearing in the collected tweets. These types of phrases follow
Method 17

a certain part-of-speech pattern: verb-pronoun-noun, verb-pronoun-adverb. However,


the phrases of positive character such as ”leave them alone” and ”they are peaceful”
follow the same pattern. For this reason a typed dependency was utilized. The
typed dependencies define the relationship between non-adjacent words. The authors
identified the tweets by utilizing hashtags such as #Terrorism, #Islamofobia etc. The
data were manually labelled and the list of the indicative hashtags was extended based
on the observations. Furthermore, the feature vectors were expanded with additional
features such as the presence of religious and war related terms, negative emotions, bad
words and exclamation marks. The study results showed that all the utilized features
are strong indicators of hate promotion and improve classification results.
Chapter 4

Method

This chapter gives an overview of the methodology employed in this work. In partic-
ular, we provide a definition of the classification categories and describe the process of
collection and pre-processing of the tweets. Finally, the proposed methods - the Naive
Method (NM) and the Partial Timeline Method (PTM) - are introduced in the sections
4.4 and 4.5.

4.1 Defining hate tweets and reasons

In this study, the definition of a hate tweet was constrained to the presence of two
compound parts: the verb ”hate” in combination with the object that the verb was
addressed to. The object was represented by the name of the company or the pronoun
”it” if it pointed to the company. In addition, the words and phrases that described
company’s services and/or products were also included as a subject of hate. For example,
”I hate Verizon” or ”I absolutely hate Sprint’s service”. Hate tweets without any stated
explanation for the hate were labeled as Hate.

When reading the timelines of the users who posted a hate tweet, we noticed that
the reasons, if stated, could appear before and/or after the hate tweet. The differences
between these ways of stating the reason were studied more closely in order to determine
if they ought to be treated as separate categories.

First, we looked at the cases when the reasons were stated before the hate tweet, serving
as the premise. These types of reasons were called triggering reasons. Second, we
looked at the cases when the reasons were stated after the hate tweet serving as an
explanation. These reasons were called justificatory reasons. Third, we looked at the
cases where the reasons were stated both before and after the hate and therefore were

18
Method 19

Category Description Example


Justificatory reason The reason is explained First the user posted:
after the hate state-
ment and it justifies the I hate sprint so much.
previously posted hate
tweet. Then the explanation was posted:

So upset because our


wifi is out and all I want
to do was watch Super-
natural.

Triggering reason The user first described The user explained the problem:
an issue related to the
mobile service and then Pissed not understand-
made a hate statement. ing why my phone isn’t
ready to be picked up
when it was suppose to
be ready yesterday.

Consecutive post is a hate tweet:

I hate AT&T.

Combined sequence A combination of justi-


ficatory and triggering
reasons.

Table 4.1: A description of different types of reasons for hate tweets.

called combined sequences. The observed types of reasons are summarized in the table
4.1. These three categories were compared to each other with respect to the word
frequency, whether they were addressed to someone, i.e. the presence of a user-name
and the time of posting. The results are presented in section Results and Discussion
in table 5.4. However, the differences between these cases were insignificant and they
were therefore treated collectively as a single category Reason. A tweet was labeled as
Reason if the following criteria were fulfilled:

• it addressed an issue related to the mobile operator or its service

• it was stated during the same day as the hate tweet

• it contained conjunctions because (also spelled cuz, cus, coz, bs, b/c, cause), nouns
reason, why (optional alternative).

In certain cases the hate and the reason were expressed in the same tweet. This category
was called Explicit. Some users even posted one or two tweets, classified as Reason,
Method 20

Category Structure Example


Hate The word hate and the name
of the company appear in the
Dear AT&T I hate u.
same tweet. It is clear that
the author hates the company I absolutely hate Sprint.
or its services and/or prod-
ucts, but no reason is given .
Explicit Hate and the reason are ex-
plicitly stated in the same
Starting to hate at&t
tweet.
slow service #at&t.

Starbucks free wifi is


better than my AT&T
wifi. I hate it so much.

Reason A tweet that clarifies the rea- The trigger or reason was posted at
son behind hate. 03:52:03:

Welp only got 40% bat-


tery and when my phone
died it’s over cause it
wont charge for some
reason smh.

Hate-tweet posted at 03:58:25:

And I hate going to the


Verizon store smh.

Other Any other tweets on the user’s


timeline.
My grandparents are so
country.

Back to work in the


morning after 9 days.

Table 4.2: Criteria for hate tweet labelling.

before or after the tweet classified as Explicit. The latter category was not further
broken down based on the order in which the hate and reason appeared because this
approach did not yield significant results for justificatory and triggering reasons.

The final categorization of the tweets is presented in table 4.2.

The retweets of hate were not included in the training data. The reason was that these
tweets can be seen as an expression of sympathy and are not necessarily related to the
service experience of the user who retweeted.
Method 21

4.2 Collection and Pre-processing

In order to collect training data, a search query for the identification of hate tweets and
thereby relevant timelines was created. The query had the same format as a hate tweet,
i.e. it utilized a combination of two words: ”Hate” and the name of the operator, e.g.
Verizon. If these two words were present in the same tweet the entire timeline from that
day was pulled and stored in an Excel table. The underlying assumption was that the
reason for the hate tweet would also be posted during the same day.

The tweets were collected with the Streaming and REST API from Twitter; stored in
Excel and manually searched for the reasons.

The labelled tweets were then pre-processed in order to remove noise and create a denser
feature space. In this study, the pre-processing followed a classical scheme described in
Text mining. The first step was to transform the tweets into separate tokens, remove
punctuation and stop-words. The pre-processing was done without the utilization of
the relationship between the words, i.e. BOW approach. The next step was to stem all
the words and replace URLs by the word ”url”, usernames were replaced by the word
”username”. Different spellings of a word were replaced by one alternative1 .

Pre-processing resulted in four word lists, one for each category. These lists, i.e. feature
sets, were compared to each other using AIC in order to determine the most represen-
tative words for each category. The threshold for the model selection was chosen to 0.1
to avoid overlap between the words from different classes. The application of AIC is
explained in section A.2.

4.3 Training and testing

Training and testing were performed using LIBSVM. The calculation of AIC was per-
formed in a self-developed program written in Java.

The training and testing procedure was as follows:

1. The data were converted to SVM format, see section 2.4.

2. 20 % of the collected data were saved as test data (proportional to the number of
tweets in each category) and the remaining 80 % formed training data.

3. Training and test data were scaled in the range [-1, +1] using the command svm-
scale.
1
For example, att, att, @att, #att were replaced by att.
Method 22

Figure 4.1: The Naive Method.

4. An RBF kernel was chosen. The relationship between the feature words and the
categories is non-linear and therefore it is reasonable to assume that an RBF kernel
is the optimal choice. The expression for an RBF kernel was presented in section
2.3.1.

5. The parameters C and γ were calculated with the command grid.py.

6. The SVM algorithm was trained on the training data set. For the classification
task one-versus-all approach was used.

7. The predictive power of the model was estimated on the test set.

4.4 The Naive Method (NM)

The Naive Method is depicted in figure 4.1. In the first stage, a tweet was either
categorized as Other or not. If a tweet was not classified as Other , it was automatically
checked to see if it belonged to Hate. If it did not belong to Hate, the method checked
if it belongs to Explicit. If the tweet could not be classified as Explicit then it was
automatically labeled as Reason. The tweets were classified using binary a method,
i.e. each tweet was tested to see if it belonged to: 1. Explicit or not, 2. Hate or not, 3.
Reason or not. The same test set was used to predict the accuracy of each model.

The classification with NM revealed three major misclassifications:

• tweets of an antagonistic character that are not related to the mobile operators,

• tweets related to mobile issues or the carriers but not related to hate and its
reasons,

• overlapping feature vectors for the categories Hate, Reason and Explicit leading
to misclassification.

Due to these misclassifications, we studied alternative ways to classify the hate tweets.
This resulted in PTM. These two methods were later compared in 5.
Results and Discussion 23

Figure 4.2: The Partial Timeline Method.

4.5 The Partial Timeline Method (PTM)

In a study from 2012 Sun et al. [26] mimicked human labelling to create a filter for
classification. This idea was appropriated in our study and adopted to the proposed
method described in this section. This method is not an extension of NM.

In this study we focused on the improvement of the classification using only the informa-
tion in the collected tweets. In order to improve the classification and solve the earlier
stated issues, several of the internal features mentioned in section 3.1 were investigated:
the use of usernames, urls and time of tweeting.

The schematic representation of PTM can be seen in figure 4.2. The first step of the
method identifies a hate tweet or a self-explanatory tweet. It then searches the timeline
for proximate tweets that were posted within a one-hour time window and classifies
them in Reason or Other . PTM relies on the observation that the majority of the
explanations are posted 30 minutes before or 30 minutes after the hate tweet, see table
5.3 in section 5.
Chapter 5

Results and Discussion

5.1 Evaluation

This study showed that the task of classification of hate tweets and their reasons is
a feasible task. The proposed two methods, see section 4 were evaluated in terms of
accuracy and F-score. The accuracy of the classification for NM and PTM along with
the name of the category are presented in the table 5.1. As can be seen from the table,
PTM generally performs better than NM. For a sense of context, the accuracy of the
proposed methods can be compared to some previous studies, see table 3.1. Direct
comparison is difficult to perform due to the differences in the size and content of data
sets.

The accuracy may not be a true representation of how reliable the methods are. This is
because of the unbalanced number of tweets in different categories. The precision, recall
and F-score can be seen in the table 5.2. There is one value that stands out clearly; it is
the precision for NM, category Reason, which is the highest possible value, i.e. 100%.
Looking at the precision in the context of recall and F-score however, the performance of
the method is not as good as the precision value alone would suggest. It is important to
emphasize that the number of training and test data used for conducting the experiments
was comparatively small. The main reason was the time consuming process of manually
labeling the data. Instead of collecting more data this project focused on developing the
classification scheme.

By looking at table 5.1, PTM is more accurate than the NM. In addition, it saves
time and memory by analysing fewer tweets. However, PTM neglects the tweets posted
outside the one-hour time window. This means that choosing between the proposed
methods implies a trade-off between relevance and completeness.

24
Results and Discussion 25

Table 5.1: The results of the experiments.

Method, category Accuracy


NM, Other 0.77
NM, Explicit 0.88
NM, Hate 0.66
NM, Reason 0.78
PTM, Hate+Explicit 0.98
PTM, Reason 0.87

Method, category Recall Precision F-score


NM, Other 0.78 0.95 0.86
NM, Explicit 0.79 0.54 0.86
NM, Hate 0.93 0.48 0.63
NM, Reason 0.53 1.00 0.69
PTM, Hate+Explicit 0.96 0.66 0.79
PTM, Reason 0.84 0.76 0.80

Table 5.2: Recall, precision and F-score

Reason within 30 min within 5 min within 1 min


Stated before 77.5% 62.5% 45%
Stated after 70.3% 51.6% 27.5%

Table 5.3: The reasons posted before and after a hate tweet were investigated from
a temporal perspective, i.e. we calculated the percentage of the tweets belonging to
Reason posted 30 minutes, 5 minutes and 1 minute before or after the posting of a hate
tweet.

PTM is based on the observation that the majority of the users posted the reason within
the first five minutes: 62.5% of the users posted the reason before the hate tweet and in
51.6% of the cases the reason was posted after the hate. Within a time window of half
an hour 77.5% of reasons were stated before the hate tweet and 70.3% were stated after
the hate tweet. These results are summarized in table 5.3. Notice that the tweets that
were posted within one minute were included in the tweets that were posted within five
minutes. Furthermore, all of these tweets were included in the tweets that were posted
within 30 minutes.

In order to understand what the most common causes for the users’ frustration were, we
analysed the prevalence of most common bigrams in categories Reason and Explicit.
These are ”my phone”, ”customer service”, ”unlimited data”, ”service sucks”, ”phone
bill”, ”my data”, see table 5.5. The content of tweets focused on the phone, data and/or
service related issues. The BOW approach, used in this work, utilized words within a
corpus as predictive features and ignored word sequences. This approach can lead to
misclassification due to word use in different contexts and if words are used as a primary
feature for classification.
Results and Discussion 26

Category Time interval Percentage of tweets (%)


06:01 am - 12:00 pm 30
12:01 pm - 06:00 pm 25
Triggering
06:01 pm - 12:00 am 35
12:01 am - 06:00 am 10
06:01 am - 12:00 pm 40.1
12:01 pm - 06:00 pm 25.3
Justification
06:01 pm - 12:00 am 29.1
12:01 am - 06:00 am 5.5

Table 5.4: Percentage of tweets per time interval

Bigram Percentage (%)


”my phone” 20
”customer service” 5
”unlimited data” 5
”service sucks” 5
”phone bill” 5
”my data” 5

Table 5.5: Most frequent bigrams present in tweets classified as Reason. The bia-
grams that are not presented in this table had the prevalence lower than 5%.

5.2 Observations

In order to see if there was any variation in the tendency to post triggering and
justificatory tweets over the course of the day, we analysed the tweets by dividing
the day in four six-hour-periods. Looking at the data in table 5.4, the tweets were
distributed relatively evenly over the day. However, we could observe a clear difference
in the tendency to post justificatory reasons vs triggering reasons. Justification
reasons were posted in 68 cases out of 100. One possible explanation is that the users
posting hate tweets were contacted by other users and/or the company in question
regarding the reason for the hate tweet. For instance, we observed that the mobile
operators, AT&T, Verizon and Sprint, interact with the customers who have expressed
hate against them.

One the biggest challenges with this work was the time consuming manual labeling of
the training and test data. However, even in the absence of statistically significant
results, it is possible to explore aspects of the methodology such as choice of class labels,
vector features and the sequence of steps by which the classification is carried out. It
is important to emphasize that since the data set analysed in this study is limited, see
Appendix A.1, one should be cautious when drawing conclusions.

One reason for caution is the high ratio of feature vector size to the number of training
tweets. This may contribute to overfitting and deceptively high values for the accuracy,
Suggestions for Future Work 27

see table 5.1. In this study, the SVM might have memorized the tweets instead of
learning the underlying pattern.
Chapter 6

Future Work

For future work there is are number of technical improvements that can be made:

• expand the data set.


A larger data set would reduce the risk of overfitting and improve the statistical
significance of the results.

• utilize the information provided by Twitter.


As discussed in section 3, the classification can be improved by expanding the
feature space with profile information, list of followers, hashtags, geolocation etc.

• create denser feature vectors.


Methods1 for creating a denser feature space might also be useful for reducing the
risk of overfitting and improving the classification.

• conduct sentiment analysis.


Sentiment analysis as well as the syntactic and semantic specifics of the tweets
could help to identify hate tweets and associated reasons.

• crawl the external sources.


One possible improvement of the study is to improve the feature space by adding
synonyms from other websites. These might be Facebook, forums and mobile
operators’ homepages.

The study could also be improved by solving the problem of the overlapping feature
spaces for categories Hate, Explicit and Reason, see section 4. One suggestion is to
re-define the categories. For instance, create one class with a feature space consisting
of the feature spaces belonging to these three classes. Then, based on the semantic
1
[19], Chapter 4.

28
Appendix A 29

analysis identify hate tweets, which probably have the highest rate of negative emotions
per tweet length. Another suggestion is to regard Explicit and Reason as one class
because both explain the underlying cause of hate.

It would be interesting to study if we could find the reason for the dissatisfaction of a
customer even if he or she has not provided any reason on the timeline. One possible
approach is to use User Similarity Model [34]. The model says that the users A and
B are more closely related than the users B and C if the overlap in the topics posted on
their timelines is greater for A and B then for B and C. For related users it might be
possible to predict the reason for a hate tweet posted by the user B by looking at the
reason posted by A.

Finally, the proposed methods could be applied in other areas where the tweets describe
an action and a result or anticipation and experience. For example, tweets about the
anticipation to see a film accompanied by the review of the film after seeing it or a
review of a hotel stay along with the expectations. Similarly, tweets about political hate
or affinity could be analysed with the proposed method.
Appendix A

A.1 Collected data set

The number of collected tweets are presented in table A.1. The number of tweets in
the category Other is fairly high compared to the remaining categories. The contain
of Other varies a lot and therefore in order to make this category more predictable we
collected more tweets.

A.2 AIC

It is difficult to make a judgment on a word’s significance for a class just by looking at


the number of its occurrences. Some words appear frequently through all the classes,
i.e. articles and prepositions, without adding any value to the analysis. Therefore, AIC
was used to quantify the importance of each word. Every word w was tested for each
class A against two models: an independent and dependent model. In future, these will
be denoted as IM and DM respectively. If w is representative for class A, its AIC value
for the DM is less than the AIC value of the IM and vice versa. The relative distance
between the unknown truth and the best model is shortest.

The first step in the modeling of the IM and DM was to calculate the occurrences of
each word, w, appearing in the classified tweets. Many words did not appear more than
once through all the tweets therefore the data were sparse. This fact will be used later

Table A.1: The number of collected tweets per each category.

Category Number of tweets


Other 1000
Hate 132
Reason 132
Explicit 220

30
Appendix A 31

Table A.2: Occurrence of a word w

Token class A class ¬A


w n11 n12
¬w n21 n22

Table A.3: Independent model

Token class A class ¬A


w pq (1 − p)q
¬w p(1 − q) (1 − p)(1 − q)

in SVM analysis. The information about each word has been summarized in table A.2,
where the notation class ¬ A stands for not class A meaning all the remaining classes;
¬w means not w ; n11 stands for the number of tweets belonging to the target class A
and containing the word w; n12 is the number of times the word appeared outside the
target class; n21 is the number of tweets where the word did not appear in the target
class, n22 is the number of tweets where the word does not appear in the other classes.
The number of free parameters is two: n11 and n12 .

For the purpose of readability, the following notations will be introduced: N = (n11 +
n12 + n21 + n22 ), h = n11 + n12 and k = n11 + n21 .

Based on the word and class occurrences, the probability of each word for a specific class
was calculated. The probability of class A is p and the probability of the word to appear
somewhere in training data is q. The probability of each class is known and equals 41 ,
1
therefore the probability of class A is always 4 and the probability of not class A is 34 .
Nevertheless, in order to preserve the generality, the notation p will be used.

k h
P (A) = p = , P (w) = q = (A.1)
N N

The assumption that p and q are independent leads to the derivation of the IM with two
free parameters. The joint probabilities of the IM are presented in table A.3.

The events presented in table A.3 are considered to be independent and therefore their
joint probability P is
P = pk q h (1 − p)N −k (1 − q)N −h (A.2)

Log-likelihood, L, for the IM is:

L = k ln p + h ln q + (N − k) ln (1 − p) + (N − h) ln (1 − q). (A.3)
Appendix A 32

Table A.4: Dependent model

Token class A class ¬A


w p11 p12
¬w p21 p22

To find the maximized log-likelihood for the IM with respect to p and q the following
conditions have to be satisfied:

∂L k N −k
= − =0 (A.4)
∂p p 1−p
∂L h N −h
= − =0 (A.5)
∂q q 1−q

The Eq.A.4 and Eq.A.5 lead to Eq.A.1. Insertion of Eq.A.1 into Eq.A.3 gives MLL:

Lmax = h ln h + k ln k + (N − h) ln (N − h)
+ (N − k) ln (N − k) − 2N ln N. (A.6)

Finally, from Eq.2.1 and Eq.A.6 the AIC for the IM is

AICIM = −2Lmax + 2 ∗ 2 (A.7)

Similar derivation of AIC applies to the DM. The outlines of the model are presented
in table A.4. The notations are the following: p11 is the probability of w appearing in
A, p12 is the probability of w appearing in other classes, p21 is the probability of not
observing w in A, and, lastly, p22 is the probability of not observing w in other classes.
Notice that p22 can be expressed as p22 = 1 − p11 − p12 − p21 . This means that the
number of free parameters is 3.

The joint probability for the DM is

P = pn1111 pn1212 pn2121 pn2222 . (A.8)

Similarly to Eq. A.3, the log-likelihood of the events in table A.4 is

L = n11 ln p11 + n12 ln p12 + n21 ln p21 + n22 ln p22 . (A.9)


Appendix A 33

The log-likelihood for the case when w appears in the target class A is maximized when

∂L n11 n22
= − = 0 or
∂p11 p11 p22
n11 n22
= (A.10)
p11 p22

The last expression in Eq. A.10 is equivalent to

n12 n21 n22


= − (A.11)
p12 p21 p22

and is constant. Therefore it is possible to set Eq.A.11 equal to some constant c. Now
the events can be expressed as

n11 = cp11 , n12 = cp12 , n21 = cp21 , n22 = cp22 . (A.12)

Summing up all the events gives

n11 + n12 + n21 + n22 = c(p11 + p12 + p21 + p22 ) = c. (A.13)

This also means that c = N and

n11 n12 n21 n22


p11 = , p12 = , p21 = , p22 = , (A.14)
N N N N

The insertion of Eq.A.14 in Eq.A.10 gives the MLE of the DM:

Lmax = n11 ln n11 + n12 ln n12


+ n21 ln n21 + n22 ln n22 (A.15)

Finally, from the Eq.2.1 and Eq.A.15 and the AIC of DM was derived:

AICDM = −2Lmax + 2 ∗ 3 (A.16)

It is worth mentioning that in the case when any parameter of Eq.A.6 and Eq.A.15 was
equal to zero the limit limx→0 x log x = 0 was applied.
Bibliography

[1] Kristin P. Bennett and Erin J. Bredensteiner. Duality and geometry in svm clas-
sifiers. In In Proc. 17th International Conf. on Machine Learning, pages 57–64.
Morgan Kaufmann, 2000.

[2] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An Introduc-
tion to Statistical Learning: With Applications in R. Springer Publishing Company,
Incorporated, 2014. ISBN 1461471370, 9781461471370.

[3] K. Lee, D. Palsetia, R. Narayanan, M.M.A. Patwary, A. Agrawal, and A. Choud-


hary. Twitter trending topic classification. In Data Mining Workshops (ICDMW),
2011 IEEE 11th International Conference on, pages 251–258, Dec 2011. doi:
10.1109/ICDMW.2011.171.

[4] David Alfred Ostrowski. Semantic filtering in social media for trend modeling. 2013
IEEE Seventh International Conference on Semantic Computing, pages 399–404,
2013.

[5] Sitaram Asur and Bernardo A. Huberman. Predicting the future with social me-
dia. In Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web
Intelligence and Intelligent Agent Technology - Volume 01, WI-IAT ’10, pages 492–
499, Washington, DC, USA, 2010. IEEE Computer Society. ISBN 978-0-7695-4191-
4. doi: 10.1109/WI-IAT.2010.63. URL http://dx.doi.org/10.1109/WI-IAT.
2010.63.

[6] Bernd Hollerit, Mark Kröll, and Markus Strohmaier. Towards linking buyers and
sellers: Detecting commercial intent on twitter, 2013.

[7] SongJie Gong. A collaborative filtering recommendation system algorithm based


on user clustering and item clustering. Journal of Software, 5(7):745–752, 2010.

[8] Xin Wayne Zhao, Yanwei Guo, Yulan He, Han Jiang, Yuexin Wu, and Xiaoming Li.
We know what you want to buy: A demographic-based system for product recom-
mendation on microblogs. In Proceedings of the 20th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, KDD ’14, pages 1935–1944,
34
Bibliography 35

New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2956-9. doi: 10.1145/2623330.
2623351. URL http://doi.acm.org/10.1145/2623330.2623351.

[9] Xiaolong Wang, Furu Wei, Xiaohua Liu, Ming Zhou, and Ming Zhang. Topic
sentiment analysis in twitter: A graph-based hashtag sentiment classification ap-
proach. In Proceedings of the 20th ACM International Conference on Informa-
tion and Knowledge Management, CIKM ’11, pages 1031–1040, New York, NY,
USA, 2011. ACM. ISBN 978-1-4503-0717-8. doi: 10.1145/2063576.2063726. URL
http://doi.acm.org/10.1145/2063576.2063726.

[10] Georgios Paltoglou Di Cai Mike Thelwall, Kevan Buckley. Sentiment strength de-
tection in short informal text. Sentiment in short strength detection informal text,
61(12):2544–2558, August 2010.

[11] Fernando Perez-Tellez, David Pinto, John Cardiff, and Paolo Rosso. On the diffi-
culty of clustering company tweets. In Proceedings of the 2Nd International Work-
shop on Search and Mining User-generated Contents, SMUC ’10, pages 95–102,
New York, NY, USA, 2010. ACM. ISBN 978-1-4503-0386-6. doi: 10.1145/1871985.
1872001. URL http://doi.acm.org/10.1145/1871985.1872001.

[12] Twitter usage / company facts, . URL https://about.twitter.com/company.

[13] Citymapper. URL https://s3.amazonaws.com/cdn0-crashlytics-com/


marketing/case-studies/Citymapper-caseStudy-1a.pdf.

[14] Yutaka Matsuo Takeshi Sakaki, Makoto Okazaki. Earthquake shakes twitter users:
Real-time event detection by social sensors. Proceedings of the 19th international
conference on World wide web, pages 851–860, 2010.

[15] Juan M. Silva, Abu Saleh Md. Mahfujur Rahman, and Abdulmotaleb El Saddik.
Web 3.0: A vision for bridging the gap between real and virtual. In Proceedings of
the 1st ACM International Workshop on Communicability Design and Evaluation
in Cultural and Ecological Multimedia System, CommunicabilityMS ’08, pages 9–14,
New York, NY, USA, 2008. ACM. ISBN 978-1-60558-319-8. doi: 10.1145/1462039.
1462042. URL http://doi.acm.org/10.1145/1462039.1462042.

[16] Twitter developer, . URL https://dev.twitter.com.

[17] Shamanth Kumar, Fred Morstatter, and Huan Liu. Twitter Data Analytics.
Springer, New York, NY, USA, 2013.

[18] Rate limit. URL https://dev.twitter.com/rest/public/rate-limiting.

[19] ChengXiang Zhai Charu C. Aggarwal. Mining Text Data. Science + Business
Media. Springer, 2012. ISBN 9781461432227.
Bibliography 36

[20] Bharath Sriram, Dave Fuhry, Engin Demir, Hakan Ferhatosmanoglu, and Murat
Demirbas. Short text classification in twitter to improve information filtering. In
Proceedings of the 33rd International ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR ’10, pages 841–842, New York, NY,
USA, 2010. ACM. ISBN 978-1-4503-0153-4. doi: 10.1145/1835449.1835643. URL
http://doi.acm.org/10.1145/1835449.1835643.

[21] Hwon Ihm. Mining consumer attitude and behavior: An exploratory study on movie
audience attitude extracted from twitter. Journal of Convergence, 4(2):29–35, June
2013. URL http://www.ftrai.org/joc/vol4no2/v04n02_C03.pdf.

[22] Jagan Sankaranarayanan, Hanan Samet, Benjamin E. Teitler, Michael D. Lieber-


man, and Jon Sperling. Twitterstand: News in tweets. In Proceedings of the 17th
ACM SIGSPATIAL International Conference on Advances in Geographic Infor-
mation Systems, GIS ’09, pages 42–51, New York, NY, USA, 2009. ACM. ISBN
978-1-60558-649-6. doi: 10.1145/1653771.1653781. URL http://doi.acm.org/10.
1145/1653771.1653781.

[23] Chao Yang, Sanmitra Bhattacharya, and Padmini Srinivasan. Lexical and ma-
chine learning approaches toward online reputation management. In CLEF (Online
Working Notes/Labs/Workshop), 2012.

[24] Fabrı́cio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgı́lio Almeida. De-
tecting spammers on twitter. In In Collaboration, Electronic messaging, Anti-Abuse
and Spam Conference (CEAS, 2010.

[25] Xuan-Hieu Phan, Le-Minh Nguyen, and Susumu Horiguchi. Learning to classify
short and sparse text & web with hidden topics from large-scale data collections. In
Proceedings of the 17th International Conference on World Wide Web, WWW ’08,
pages 91–100, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-085-2. doi: 10.
1145/1367497.1367510. URL http://doi.acm.org/10.1145/1367497.1367510.

[26] Aixin Sun. Short text classification using very few words. In Proceedings of
the 35th International ACM SIGIR Conference on Research and Development
in Information Retrieval, SIGIR ’12, pages 1145–1146, New York, NY, USA,
2012. ACM. ISBN 978-1-4503-1472-5. doi: 10.1145/2348283.2348511. URL
http://doi.acm.org/10.1145/2348283.2348511.

[27] José Manuel Cristóvão Verı́ssimob Maria Teresa Pinheiro Melo Borges Tiagoa. Dig-
ital marketing and social media: Why bother? Business Horizons, Volume 57(Issue
6):703–708, 2014.
Bibliography 37

[28] Denis Kondopoulos. Internet marketing advanced techniques for increased market
share. Chimica Oggi-Chemistry Today, Volume 29(Issue 3):9–12, 2011.

[29] Fabian Abel, Qi Gao, Geert-Jan Houben, and Ke Tao. Semantic enrichment
of twitter posts for user profile construction on the social web. 6644:375–389,
2011. doi: 10.1007/978-3-642-21064-8 26. URL http://dx.doi.org/10.1007/
978-3-642-21064-8_26.

[30] P. Burnap and M. L. Williams. Cyber hate speech on twitter: An application


of machine classification and statistical modeling for policy and decision making.
Policy Internet, Volume 7(Issue 3):223–242, 2015. doi: 10.1002/poi3.85.

[31] Ricardo Kawase, Bernardo Pereira Nunes, Eelco Herder, Wolfgang Nejdl, and
Marco Antonio Casanova. Who wants to get fired? In Proceedings of the 5th
Annual ACM Web Science Conference, WebSci ’13, pages 191–194, New York, NY,
USA, 2013. ACM. ISBN 978-1-4503-1889-1. doi: 10.1145/2464464.2464476. URL
http://doi.acm.org/10.1145/2464464.2464476.

[32] Nick Pendar. Toward spotting the pedophile telling victim from predator in text
chats. 2012 IEEE Sixth International Conference on Semantic Computing, 0:235–
241, 2007. doi: http://doi.ieeecomputersociety.org/10.1109/ICSC.2007.32.

[33] A. Sureka and S. Agarwal. Learning to classify hate and extremism promoting
tweets. In Intelligence and Security Informatics Conference (JISIC), 2014 IEEE
Joint, pages 320–320, Sept 2014. doi: 10.1109/JISIC.2014.65.

[34] R. Narayanan. Mining text for relationship extraction and sentiment analysis. Ph.D.
dissertation, 2010.

You might also like