You are on page 1of 47

Tweet as a Tool for Election Forecast: UK 2015

General Election as an Example


DRAFT
Philipp Burckhardt Raymond Duch Akitaka Matsuo
January 3, 2016

Abstract
In this essay we explore the utility of using Twitter conversations to explain
election outcomes. Our efforts are based on a large corpus of Tweets collected
during the six-month period approaching the 2015 UK election. Our analysis in-
cludes the geo-location of tweets, sentiment analysis, and issue/topic modelling.
The analysis focuses on England and Scotland. The interpretation of the Twitter
landscape of Scotland is straightforward: the Scottish National Party dominated
the Twitter conversation on all aspects including the number of tweets, general
sentiment, and the evaluation of key issues. In contrast, the results of England
are more nuanced: Generally speaking, the Labour Party was the slight favorite;
however, the Conservative Party had a slight edge over Labour on certain issues
that may have been critical in the ultimate Tory victory. We suspect this poor
performance of EnglishTweets in explaining election outcomes is the consequences
of the population of Twitter users being unrepresentative of the countrys popula-
tion. We discuss the strategy to remove the biases through collecting and analysing
demographic information of Twitter users.


Paper prepared for Presentation at the Third Annual Meeting of the Asian Political Methodology
Society in Beijing January, 2016

1
2

1 Introduction

The UK General Election 2015 was a nightmare for polling agencies: Before the elec-
tion, all of them predicted the election would generate a hung parliament and the media
focused on likely coalition formation negotiations that would occur after the election.
The election results turned out that the Conservative Party secured the single-party ma-
jority and formed the new government.1 Academics had also a hard time in predicting
the election results from polling information. A forthcoming special issue of Electoral
Studies is a collection of election predictions submitted by scholars before the election.
Among them, none has predicted the single-party majority of the Conservatives (Fisher
and Lewis-Beck, 2015).
In this essay we explore the utility of using Twitter conversations to explain election
outcomes. Our efforts are based on a large corpus of Tweets collected during the six-
month period approaching the 2015 UK elections. Our analysis includes the geo-location
of tweets, sentiment analysis, and issue/topic modelling. The analysis focuses on England
and Scotland. The interpretation of the Twitter landscape of Scotland is straightforward:
the Scottish National Party dominated the Twitter conversation on all aspects including
the number of tweets, general sentiment, and the evaluation of key issues. In contrast,
the results of England are more nuanced: Generally speaking, the Labour Party was the
slight favorite; however, the Conservative Party had a slight edge over Labour on certain
issues that may have been critical in the ultimate Tory victory. These results are not
consistent with the actual election outcomes; they point to the need for a bias correction
methodology which we are developing and describe in this essay.
1
For instance, http://fivethirtyeight.com/datalab/what-we-got-wrong-in-our-2015-uk-general-election-model/,
http://www.economist.com/news/britain/21651250-why-opinion-polls-went-wrong-pollderdash
3

2 Literature

Because of its abundance and accessibility, micro-blogging data, in particular tweets,


have been used to measure public sentiment about a range of issues. Employing Twitter
conversations to gauge the publics political sentiment is an illustration (OConnor et al.,
2010); as are efforts to predict political affiliations (Conover et al., 2011). Predicting elec-
tion outcomes is another one of these areas (c.f. Gayo-Avello, 2013). Previous attempts
to predict election outcomes have had mixed success. Some articles claim that tweets or
other social media texts have a decent predictive capability of election outcomes. Tumas-
jan et al. (2011) analyzes party and party leader mentions in the 2009 German federal
election and demonstrates that the number of their mentions, combined with sentiment
analysis, has high accuracy in election predictions. Curini, Ceron and Iacus (2013); Ceron
et al. (2013); Ceron, Curini and Iacus (2014) attempt to analyze several elections in Italy,
France, and the US and show that a supervised learning method developed by Hopkins
and King (2010) does a good job of explaining fluctuation in party or candidate support
in various contexts; they also predict election outcomes.
There are host of articles challenging these optimistic findings about the use of social
media texts in predicting election outcomes. Jungherr, Jurgens and Schoen (2012) is
a rebuttal to the findings by Tumasjan et al. (2011). They argue that the predictive
accuracy in Tumasjan et al. (2011) is an artifact of how parties are selected for inclusion
in the analysis. They argue that once all parties are included in the data the predictive
power of tweet data becomes essentially zero. Gayo-Avello (2013), in a comprehensive
review of the literature, argues that there are numerous limitations to efforts to the
empirics that use Twitter to predict election outcomes.
There have been some attempts to analyze social media texts in the UK politics.
Lampos, Preotiuc-Pietro and Cohn (2013) analyze the 2010 General Election showing
that when the polling results of voting intention are used as training data, employing
4

structured supervised learning methods, the tweets can predict vote intentions at the
twitter account level. Boutet, Kim and Yoneki (2013) also analyze tweets during election
periods; they use the network of tweet-retweet relations to estimate the likely supporters
of major parties among Twitter users in the UK.
This project is another attempt to predict election outcomes employing Twitter con-
versations. Our approach which focuses on volume and sentiment in social media texts
resembles Tumasjan et al. (2011). In addition, we put a significant emphasis on the anal-
ysis of the issues discussed in the campaign period. Our approach is explorative: We first
attempt to track the change in popularity of parties through the analysis of all tweets
mentioning major parties or party leaders. We then shift our focus to the analysis of
issues discussed in the tweets. Previous attempts have put relatively little emphasis on
the issues that shape election campaigns.

3 Methodology for Downloading Tweets

We have collected the corpus of tweets using the Twitter streaming API.2 The stream-
ing API allows us to connect to the global stream of Twitter data. A program using the
API maintains a continuous connection to the Twitter server; tweets that satisfy pre-
defined conditions are downloaded automatically. We use very simple criteria to select
tweets: Party names and leader names are used as the search terms. The names of the
six largest parties are used as the search terms. Table 1 presents list of parties and search
terms. The download started December 21, 2015. The last date of tweets we use in the
data analysis is May 6, 2015, a day before the General Election.3 In the five and a half
month period from the late December 2014, we have downloaded 25 million tweets. To
save on computation time we randomly sampled 8 million tweets for our data analysis.
2
https://dev.Twitter.com/docs/streaming-apis
3
Because of technical problems, tweets of three days are incomplete (February 20, 21, and March 13).
These dates are excluded from the final analysis.
5

Table 1: Search Terms

Party Party Name Terms Party Leader Terms

Conservative Party conservatives, tories, torys, tory davidcameron, david cameron,


cemeronmustgo, cameronmust-
stay
Liberal Democrats lib dem, liberal democrats, libdem, lib- nick clegg, nickclegg
dems
Labour Party labour, labours, uklabour ed miliband, edmiliband
UKIP ukip, uk independence party Nigel Farage
Scottish National Party snp, scottish national party, scottishna- Nicola Sturgeon, NicolaStur-
tionalparty geon
Green Party thegreenparty, green party Natalie Bennett

4 Tweet Classification

We attempt to classify these three million in several categories based on the geo-
location of users, the sentiments expressed in tweets, and the electoral issues raised.
Because of the vagueness of search term, not all of these tweets actually discuss UK
election: and there are many tweets created by Twitter users in the location outside the
UK. In order to use only the relevant tweets, we exclude tweets that cannot be geo-located
in the UK. Another crucial issue is the sentiment classification of tweets. For the purpose
of election prediction using tweets, sentiment expressed in tweets can be of considerable
importance. In this paper, we use a recently developed sentiment classifier. The issue
classification is conducted by simple pattern matching.

4.1 Geo-locating Tweets

Because of the regional variations of party systems in the UK, identifying the geo-
location of tweets is important for this project. Also non-UK Tweets need to be removed
from the dataset as the duplication of party names across countries, such as Green Parties
6

in many countries. Twitter has a geo-location feature that provides geocodes of each
Tweet, but the function is an opt-in feature which only gets turned on with the explicit
consent of users. Because of privacy concerns, the proportion of opted-in users in the
Twitter community is too low to use it as a reliable source. In one estimate, only two
percent of Twitter users turn on this feature and the proportion is similar in our data.4
To supply the geo-location information, we use the location field of Twitter users,
which is a field in which Twitter users can provide their location information. We identify
the location using geonames.org API, that provides a service to recover the geolocations
from partial addresses.5 If the location field of a Twitter account includes a name of a
town or city, we subsequently invoke geonames.org API to determine the geo-location.
Through the analysis of location names we have identified 3,776 different cities or counties
in the UK. With these locations, about 3,217,147 tweets are given a geo-location in the
UK. Table 2 shows the top 20 locations included in the data. In the following analysis
we locate Twitter users in English and Scottish counties; and we run separate analyses
for the two regions of the UK.

4.2 Sentiment Analysis

Analyzing the sentiment of text is a developing area in natural language process-


ing. The goal of sentiment analysis is to develop an algorithm to identify whether a
text is subjective or objective, and if it is subjective, then to categorize whether it is
positive or negative. Broadly speaking, there are two different approaches for the classi-
fications (See an extensive review by Pang and Lee, 2008; Paltoglou, 2014). The first is
the machine-learning approach, using some classification method such as Support Vector
Machine, Naive Bayes, or Maximum Entropy. Computer linguists have proposed a range
of machine-learning strategies and compete amongst each other for classification accuracy
4
http://firstmonday.org/article/view/4366/3654
5
http://www.geonames.org/export/web-services.html
7

Table 2: Geo Locations of Tweets

City Name Country Count


London England 1131036
Scotland Scotland 300096
Glasgow Scotland 237046
Edinburgh Scotland 133112
Manchester England 102259
Liverpool England 65090
Bristol England 50637
Wales England 49482
Sheffield England 48824
Birmingham England 43418
Leeds England 42915
East Midlands England 39821
Aberdeen Scotland 38095
Nottingham England 37499
Brighton England 37161
Newcastle upon Tyne England 36369
East Yorkshire England 32676
Cambridge England 29006
Norwich England 23287
Oxford England 22596

(e.g. Wilson et al., 2013). The main data used in this approach is document-term ma-
trices which are very similar to topic modeling. This approach requires a set of training
data that is usually created by human-coders, although some research makes creative
use of the characters of online texts (e.g. Pak and Paroubek, 2010). Another approach
is an unsupervised, lexicon (dictionary)-based approach. The algorithms used in this
second approach search through texts and find specific terms in precompiled dictionaries.
Developing such a dictionary requires significant efforts. However, this approach works
better if the algorithm and dictionary are developed specific to the domain of texts. In
addition, this approach allows the analysis of sentence structures which is difficult in the
machine-learning approach.
In this essay, we use a lexicon based algorithm called VADER developed in Hutto and
Gilbert (2014). VADER is a sophisticated text-parsing algorithm that classifies tweets
based on sentiment. In their paper Hutto and Gilbert (2014) compare the effectiveness
8

of their algorithm with other sentiment classification algorithms including LIWC, Senti-
WordNet, and machine-learning methods using Naive Bayes and Support Vector Machines
(SVM). Their classifier is based on a gold-standard lexicon list including words and other
features (e.g. emoticons and word-capitalization). These elements of the lexicon list are
then combined with rules that detect grammatical and syntactical conventions that typi-
cally modify sentiment intensity through either emphasis or negation. Their comparisons
show that this simple algorithm exhibits better accuracy than the other methodologies.
The critical advantage of this method along with other lexicon based methodologies (e.g.
Paltoglou and Thelwall, 2012) is their computational efficiency.
We applied the Hutto and Gilbert (2014) VADER algorithm to our corpus of UK Tweets.
For the three-million tweets in our Twitter corpus, the classifier finishes the sentiment
analysis within two minutes. The resulting classification output includes four outputs:
neutral, negative, and positive sentiments, along with a compound measure of sentiment.
Figure 1 is the distribution of compound sentiments. Following the advice of Hutto and
Gilbert (2014), we categorize tweets with compound scores greater than 0.05 as positive
sentiment tweets and tweets with compound score less than -0.05 as negative sentiment
tweets. As Figure 1 illustrates the distribution of sentiment in our UK Tweets is trimodal.
Clearly the modal sentiment is neutral. The negative and positive sentiments scores are
normally distributed, roughly, around -.5 and +.5, respectively.
9

Figure 1: Sentiment Classification Result

1e+05
Count

5e+04

0e+00
1.0 0.5 0.0 0.5 1.0
Compound Measure
10

4.3 Issues

We are also interested in identifying the issues that dominated the Twitter election
conversations. Accordingly we also conduct issue classification using iterations of a simple
pattern matching strategy. First we selected important issues in the election based on
archival searches of news articles covering the election campaign and manifesto summaries
provided by news agencies.6 From this analysis we identified six key issues in the election:
economy, immigration, tax, welfare, EU, and education. For each issue, we first search
through the corpus of tweets with pre-defined search terms. We use the tweets that are
matched with each issue in order to create a document-term matrix. Manually we then
determine which terms are frequently used, and therefore highly relevant, for each of the
issue. These relevant terms are then added to the original search terms; we then conduct
another round of issue searches. We iterated this process several times until we finalized
the list of search terms. Table 3 shows the list of search terms along with the number of
tweets matched with the terms.
Table 3: Issue Search Terms

Topic Search Terms Count

Economy deficit, economy, business, austerity, budget, 423082


debt, borrowing, gdp, unemployment, job(s, )
Education education, tuition, school, university, universi- 153582
ties, apprenticeship, childcare, teachers, uni
EU proeu, no2eu, meps, brexit, mep, EU, MEP 158130
Immigration immigration, racist, immigrant, migrants, 162634
boarder(s, )
Tax tax, betroom, mansion, dodging, nondom, VAT, 291107
IFS
Welfare welfare, NHS, benefit, wowpetition 416160

6
Examples:
http://www.bbc.co.uk/news/election/2015/manifesto-guide
http://www.theguardian.com/politics/manifestos-2015
11

5 Tweet Heterogeneity

The first stage of this Twitter project has consisted of obtaining twitter conversations
related to the UK general election and conducting topic modelling and sentiment analysis.
As we will demonstrate in the following analyses, the content of Twitter traffic does
not necessarily reflect the general preferences and sentiment of the population so we
can think of Twitter traffic as being a somewhat biased read of public preferences and
sentiment. Accordingly, in order to generate estimates of population parameters (such
as support for the U.K. Conservative Party) using Twitter conversations, we need to
understand, precisely, the nature of this bias. This is a second stage of the project that
we are currently undertaking. The second stage consists of getting a better measure of
the heterogeneity in twitter conversations and consists of three principal components.

5.1 Identifying Heterogeneous Preferences in the Twitter Sub-

population

A first stage of our effort to estimate heterogeneity in Twitter conversations relies


on survey data that first allows us to identify who actively tweets and secondly to cor-
relate this activity with socio-demographic and political sentiment variables. We gain
two critical insights from this analysis. First, we learn the socio-demographic profiles of
both Twitter users and those who do not tweet for example their age, income, educa-
tion, and gender distributions. This in itself will be very valuable for weighting tweets
so that they better reflect the preferences of the overall population. Secondly, we also
learn how relevant political preferences vary across socio-demographic groups within both
the Twitter and non-Twitter sub-populations. These two insights from the analysis of
survey data are the basis for constructing propensity score weights. These will indicate
two potential sources of bias. First, the socio-demographic distributions of Twitter users
12

may differ from the overall population and hence weighting can correct for this distribu-
tional bias. But, secondly, it may also be the case that within any socio-demographic
grouping (lets say the middle-income, white, low educated segment) Twitter users may
have distinct political preferences from non-Twitter users and hence then also from the
overall population.
The results from this analysis of large-n survey data will inform the propensity score
weights, for socio-demographic segments, that should correct for any bias in the overall
political preferences and sentiment that is being registered by Twitter users.

5.2 Socio-demographic Segmentation of Tweets

In order to accomplish this weighting task we need information on the socio-demographic


composition of the Twitter users in our database these distributions can then be com-
pared to the overall population. The proportion of educated and rich individuals in the
Twitter sub-population is likely higher than their proportions in the overall population
(or lets say the population of registered voters). We would like to be able to segment the
Twitter conversations we collect into at least very rough socio-demographic groupings.
These groupings would then be weighted to their appropriate population distribution us-
ing the information obtain from the data collection and analysis described in the previous
section.
Generating socio-demographic profiles of Twitter users has attracted considerable
attention from computational linguists and computer scientists. There are a number of
strategies implemented in this regard although they typically are based on extracting
demographic information from user profiles; using this information to generate socio-
demographic segmentation of Twitter users; and then identifying distinguishing latent
patterns in the Tweets of these different socio-demographic groupings. These latent
patterns are then used to categorise all twitter users in our database (i.e., out of sample
13

predictions).
We briefly describe recent efforts to estimate the demographic profiles of Twitter
users. Most of these efforts use a three-step methodology: First, they collect concrete
information about a subset of Twitter users by analysing the texts in various fields of
Twitter accounts (e.g. user description) as well as other identifiable information on the
user (e.g. information on the user obtained from other social networks such as LinkedIn).
Secondly, they develop the classification machine using other user information (e.g. num-
ber of followers, number of lifetime tweets, content of tweets). And finally they classify
user accounts which do not provide any concrete information regarding the demographic
characteristics of interest. Preotiuc-Pietro, Lampos and Aletras (2015), for example,
extract occupation information from Twitter user profiles and conduct text analysis to
categorise users into occupational classes. Sloan et al. (2015) approaches the identifica-
tion of occupational groupings in a similar fashion although they use human coders to
validate machine occupational classifications. Others have focused on Twitter text anal-
ysis to identify distinct patterns that isolate the Twitter users gender and age (Schwartz
et al., 2013). Burger et al. (2011) also propose a method for identifying the gender of
Twitter users.
We propose to employ a similar strategy of using information regarding Twitter users
in order to categorise them into occupational, gender and education categories. In addi-
tion to using information from Twitter user profiles we will also supplement this informa-
tion with information about the Twitter user obtained from their profiles on other social
media such as Facebook and Instagram.
As is the case with these other efforts such as Preotiuc-Pietro, Lampos and Aletras
(2015) our intention is to categorise the Twitter users into distinct socio-demographic
categories for example high income, high education, women working in a managerial
position. Having accomplished this we would then attempt to identify unique patterns
14

in their Tweets that could be used to categorise out-of-sample Tweets.

5.3 Geo-location Segmentation of Tweets

We noted earlier that we have developed a strategy for geo-locating Twitter accounts.
This information can also be used for socio-demographic segmentation of Twitter users.
Mohammady and Culotta (2014) propose a strategy for categorising Twitter users into de-
mographic groupings by matching Twitter accounts with demographic information from
their county geo-coded information available from the Twitter API. The machine analy-
sis of linguistic patterns then occurs at the county-level. These county-level analyses are
designed to identify the socio-demographic grouping of each Twitter user.

5.4 Propensity Score Weighting of Tweets

The analyses described above will ultimately associate propensity scores with each
of the Twitter accounts in our database so for example, a Twitter user would have
a likelihood of falling in any one of the socio-demographic cells we ultimately create as
part of the heterogeneity measure strategies described above. So for example, a Twitter
user would have a .37 likelihood of being a poorly educated, male from Northern Ireland.
The same Twitter user would have a .07 likelihood of being a highly educated, woman
from Brighton. We also have a corpus of Tweets classified according to partisanship and
political sentiment. Using them we can obtain probabilistic measures of partisanship of
each user in our data, such as a user has .1 chance of being a Labour supporter, .5 chance
of a Conservative supporter, and so on. We can also calculate sentiment towards Party
leaders and particular issues.
By combining these two measures associated with each Twitter user, a matrix of de-
mographic predictions for a user account and a vector of partisanship predictions, we can
calculate the estimated probabilities of voting directions in each of these demographic
15

cells. For example, the prediction might be that a Scottish male in his forties work-
ing as an engineer has a twenty percent chance of voting for Labour, a forty percent
chance of voting for SNP, etc. To obtain the overall propensity of the voting in each
geographic regions, we will calculate a weighted average of voting directions using the
actual distribution of each of these cells in the U.K. population. This is the strategy we
are developing for the fine-grained election forecast. We now turn to our estimation of
political sentiment and our efforts to use these data to predict the outcome of the 2015
U.K. general election.

6 Preliminary Results

This section presents the results from the first stage of our project; specifically, our
preliminary analysis of the content of Twitter conversations. Our initial analysis is an
exploratory effort to understand whether our Twitter corpus can help us describe, and
possibly explain, election outcomes. Our analysis is mostly descriptive. We will show a
series of graphical aggregations of tweets and discuss our interpretation of them.

6.1 Tweet Counts

Figure 2 is a simple count of tweets. Until the end of March, the plot is more or less
stable, but starting in early April the number of tweets shows a clear upward trend as
the election date approaches. In addition, there are several spikes in April corresponding
to the election debates by party leaders on April 2, 16, and 30.
16

Figure 2: Number of Tweets


Number of Tweets

10000

5000

0
Jan Feb Mar Apr May
Date
17

Table 4 provides a breakdown of tweets by the number of parties mentioned. There are
some tweets which do not mention any party names or mention multiple party names. In
the following plots, we only use tweets with one party name. Figure 3 shows the number
of tweets by parties. For Scotland the plot clearly indicates that SNP dominates all other
parties especially after March, 2015. Until February, Labour and SNP had about the same
number of tweets. However from March, the number of SNP tweets was clearly larger
than Labour, and this SNP dominance became salient after the first election debates on
April 2.
English tweets present a less clear picture. In April, three parties, Conservatives,
Labour and UKIP, are mentioned quite frequently compared to the three other parties.
Although the number of tweets for the three parties fluctuates together, there is a Conser-
vatives spike on April 14 when the Conservatives published their manifesto. Toward the
election date, it seems that Labour Party Twitter volume exceeded Conservative volume.
Twitter volume on its own provides no hint of a Conservative victory.

Table 4: Party Mentions in Each Tweet

Number of Parties Mentioned Count


0 1402758
1 5577683
2 1051153
3 153598
4 27599
5 10323
6 3727
18

Figure 3: Number of Tweets by Party

50000

40000

All Tweets
30000

20000

10000

15000

England
10000

5000

0
8000

6000

Scotland
4000

2000

0
Jan Feb Mar Apr May

Conservatives LibDem Labour SNP UKIP Greens


19

Our expectation is that fluctuations in Twitter sentiment will better reflect the actual
election outcome. Figure 4 presents a plot of our Twitter sentiment measure for each of
the political parties. It would appear that Labour Party sentiment was actually quite
positive, and certainly more positive than Tory sentiment, in the final month leading up
to the election.
Figure 4: Number of Positive Tweets by Party

20000

15000

All Tweets
10000

5000

6000

England
4000

2000

0
5000

4000
Scotland

3000

2000

1000

0
Jan Feb Mar Apr May

Conservatives LibDem Labour SNP UKIP Greens


20

The balance of positive versus negative Twitter sentiment should be more informa-
tive. We generate a measure of net differences in sentiment (positive-negative). Figure 5
summarises the net difference in sentiment for each of the parties. This clearly confirms
that the Conservative Party was on balance negatively perceived at least in Twitter con-
versations. And the Labour Party scores positively on our net difference measure and
this positive net score rises as the election approaches.
21

Figure 5: Net Positive-Negative Tweets by Party

5000

All Tweets
0

3000

2000

England
1000

1000

3000
Scotland
2000

1000

Jan Feb Mar Apr May

Conservatives LibDem Labour SNP UKIP Greens


22

6.2 Party Leader Mentions

We create similar Twitter volume and sentiment plots for the party leaders. Figure 6
presents the Twitter volume plots. Again, as was the case for the party plots, in Scotland,
Twitter volume for the SNP Leader, Nicola Sturgeon, overwhelmed those for the other
party leaders. In England, Conservative Leader David Cameron and Labour Leader Ed
Miliband have similar Twitter volume over the course of the election period. There are
some notable differences between party and party leader mentions: UKIP Leader Nigel
Farage was mentioned much less than leaders of two larger parties and there is no obvious
Conservative manifesto spike for Camerons Twitter volume.
23

Figure 6: Number of Tweets by Party Leader

30000

All Tweets
20000

10000

0
8000

6000

England
4000

2000

9000 Scotland

6000

3000

0
Jan Feb Mar Apr May

Cameron Clegg Miliband Sturgeon Farage Bennett


24

There is more variation in the Twitter conversation results for the leader series. Figure
7 presents the Twitter volume plots. In the early part of our period there is some evidence
that Cameron has a positive lead over Miliband but as the election approaches Milibands
positives exceed those of Cameron.

Figure 7: Number of Positive Tweets by Leaders

20000

15000

All Tweets
10000

5000

3000

England
2000

1000

6000
Scotland

4000

2000

0
Jan Feb Mar Apr May

Cameron Clegg Miliband Sturgeon Farage Bennett


25

Figure 8 presents the difference in positive versus negative tweets regarding each of
the leaders. These graphs suggest that on balance tweets regarding Miliband were more
positive than was the case for Cameron. And certainly as the election approaches we see
a very net Miliband advantage in Twitter conversations.

Figure 8: Net Positive-Negative Tweets by Leaders

15000

10000

All Tweets
5000

5000

2000

England
1000

1000
6000

4000
Scotland

2000

Jan Feb Mar Apr May

Cameron Clegg Miliband Sturgeon Farage Bennett


26

6.3 Cumulative Tweet Counts

In the previous two subsections we assessed the time-series of the count of tweets pub-
lished on each day. These time-series reflect day-to-day fluctuation of tweets responding
to the political events of the day, such as the election debates held on 2, 16, 30 April 2015
and publication of election manifestos. These time-series of raw counts with high fluctua-
tion are suitable for understanding the direction of trends in opinion change but may not
be useful for knowing the overall picture of the election landscape. In this subsection, we
show instead the plot of counts of the tweets for parties and leaders accumulated from
the start of our tweet collections. This method of using cumulative counts of tweets has
proven to be an effective method for twitter election forecast in previous studies (e.g.
Caldarelli et al., 2014; Burnap et al., 2015).
Figures 9 - 11 are the plots of ratios of cumulative tweets. To generates the plot, we
first calculate the number of all tweets mentioning each party or leader from the start
of download period to the date in the plot, then calculate the proportion of tweets by
dividing the number for each party or leader by the total counts of tweets. Figure 9 is
the plot of party mentions, Figure 10 is the plot of party leader mentions, and Figure 11
is the plot of party and leader mentions.
The cumulative counts for the parties and leaders in Scotland are close to the actual
election results7 although the tweets mentioning UKIP comprise the larger proportion
than the actual number of votes they received. However, for England, the counts do not
necessarily correspond to the actual election results: Labour is ahead of Conservatives;
UKIP and Greens are overrepresented in the Twitter world.8 As our analysis of opinion
poll data in Section 7 indicates, the Twitter population in the UK is not a representative
sample of the general UK population, and in order to obtain reliable assessment of opinion
7
The vote percentage in Scotland for each party is SNP: 50.0, LAB: 24.3, CON: 14.9, LD: 7.5, UKIP:
1.6, GRN: 1.3.
8
The vote percentage in England is CON: 41.0, LAB: 31.6, UKIP: 14.1, LD: 8.2, GRN: 4.2.
27

change in the UK from tweets, we need to deal with this bias through the strategy we
mapped out in Section 5.

Figure 9: Ratio of Cumulative Positive Tweets for Parties

0.4

0.3

All Tweets
0.2

0.1

0.0
0.4

0.3

England
0.2

0.1

0.0
0.5

0.4
Scotland
0.3

0.2

0.1

0.0
Jan Feb Mar Apr May

Conservatives LibDem Labour SNP UKIP Greens


28

Figure 10: Ratio of Cumulative Positive Tweets for Leaders

0.4

All Tweets
0.2

0.0

0.4

England
0.2

0.0
0.8

0.6 Scotland

0.4

0.2

0.0
Jan Feb Mar Apr May

Cameron Clegg Miliband Sturgeon Farage Bennett


29

Figure 11: Ratio of Cumulative Positive Tweets for Parties and Leaders

0.4

0.3

All Tweets
0.2

0.1

0.4

0.3

England
0.2

0.1

0.0

0.4
Scotland

0.2

0.0
Jan Feb Mar Apr May

Conservatives LibDem Labour SNP UKIP Greens


30

6.4 Issues

As the election day approached its clear that both the Labour Party and its leader
were generating more positive Twitter traffic than was the case for the Conservatives.
While the Conservatives clearly struggled in terms of Party and leader image, they did
better with respect to the issues that dominated the 2015 General Election. Figure 13
plots the lines from local regression smoothing (LOESS) for each issue. Three issues
received much of the Twitter attention: Economy, Welfare, and Taxes. In England,
Welfare and the Economy are the top issues, while in Scotland, the Economy seems the
primary concern after March.
31

Figure 12: Number of Issue Tweets

8000

6000

All Tweets
4000

2000

3000

2000

England
1000

600
Scotland

400

200

Jan Feb Mar Apr May

Issue economy immigration tax welfare EU education


32

Our analysis of Twitter traffic can provide some insight into which parties are favoured
by any one of these major issues that dominated the election conversation. Figures 13
and 15 show the LOESS lines for issue tweets that are associated with each of the UK
parties. We count the number of tweets for each party by combining mentions of both
party names and party leaders. In England, UKIP clearly owned two issues, immigration
and EU. Over a half of tweets on these issues mentioned UKIP.
Welfare was one of the dominant issues during the campaign. The Conservatives were
favored in terms of the frequency of welfare tweets over the course of the entire election
period. Two other issues dominated public attention: the economy and taxes. Early in
the campaign, the frequency of Conservative and Labour association with these issues
was similar. But as the election day approached we see the Conservative association with
these issues advances considerably over Labour. Clearly in terms of Twitter engagement
on major issues, the Conservatives had an advantage over the Labour Party.
Tweets associating the parties with these issues can be both positive and negative.
Figure 14 summarises the frequency of positive issues tweets for the major parties. On the
major economic issues, the economy and taxes, the Conservatives lead the Labour party
both in frequency overall but also in terms of positive tweets. While the Conservatives
have a lead on frequency of welfare tweets throughout our sample period, Labour, has an
advantage on positive welfare tweets as the election day approaches.
33

Figure 13: Number of Issue Tweets by Parties (England)

economy education EU
300
750
200
500 200
100
250 100
0
0
100 0

immigration tax welfare


600 1200

400 400 800

200
200 400

0
0 0

Jan Feb Mar Apr May Jan Feb Mar Apr May Jan Feb Mar Apr May

Conservatives LibDem Labour SNP UKIP Greens

Figure 14: Number of Issue Tweets by Parties (England, Positive)

economy education EU
120
300
100
80
200
50
100 40
0
0
0
50
immigration tax welfare

200 400
150
300
100 100
200
50 100
0
0 0

Jan Feb Mar Apr May Jan Feb Mar Apr May Jan Feb Mar Apr May

Conservatives LibDem Labour SNP UKIP Greens


34

The analysis of issue tweets in Scotland suggests that the SNP electoral success was
not associated with the set of issues that dominated the electoral conversation in the rest
of the UK. The SNP had a large volume advantage with respect to economic issues. But
with respect to the other issues in our corpus, the SNP were not favored. Clearly, the
issue of Scottish independence played a large role in the SNP success.

Figure 15: Number of Issue Tweets by Parties (Scotland)

economy education EU
60
30
200
40
20

100 20 10

0 0
0

immigration tax welfare


80 120
40

30 60 80
20 40
40
10
20
0 0
0
10
Jan Feb Mar Apr May Jan Feb Mar Apr May Jan Feb Mar Apr May

Conservatives LibDem Labour SNP UKIP Greens


35

7 Insights from Public Opinion Polling

One of the real attraction of monitoring public opinion using Twitter traffic is that the
measure is entirely non-evasive. It avoids many of the problems associated with survey
interviewing effects. On the other hand, Twitter users are unlikely to be a representative
sample of, in this case, the UK eligible voting population. Nevertheless, in spite of its
lack of representativeness, Twitter conversations could in fact do a good job of reflecting
public sentiment about politics. This section provides some speculative assessments of
the ability of Twitter sentiment to measure accurately the public sentiment about the
UK 2015 election. In order to do this we rely on the campaign survey data collected by
the 2015 British Election Survey (BES).
The 2015 BES Wave 5 conducted a daily online election survey beginning thirty-eight
days before election day. A number of the questions measure sentiment that is similar
to the Twitter sentiment. This allows us to compare our daily Twitter sentiment series
with a number of the BES time series. The BES asked an 11-point thermometer scale
question regarding each of the party leaders (it ranged in value from 0 to 10).9 Figure 16
compares the BES results for Cameron and Miliband with those from the Twitter analysis
presented earlier. The BES series suggests in fact that public evaluations of Cameron
are higher than those of Miliband while the Twitter sentiment series suggest exactly the
opposite. An explanation that seems to be confirmed by Figure 17 is that Twitter users
have a Labour, or at least Miliband, bias. The BES surveys included a question asking
respondents if they were Twitter users. Approximately 25 percent of BES respondents
indicated they were Twitter users. Figure 17 presents responses to the leaders evaluation
question for those identifying themselves as Twitter users. Miliband evaluations amongst
Twitter users is significantly higher than those of Camerons, suggesting that the Twitter
traffic result reported in Figure 8 that favoured Miliband may be related to the partisan
9
The actual question wording is How much do you like or dislike each of the following party leaders?
The leaders names are presented in a randomized order.
36

composition of Twitter users.

Figure 16: Leader Evaluations in the 2015 British Election Survey

5.3

5.1

4.9

4.7

4.5
Mar 30 Apr 06 Apr 13 Apr 20 Apr 27 May 04

Cameron Miliband

Figure 17: Leader Evaluations in the 2015 British Election Survey (Twitter Users)

6.0

5.5

5.0

4.5

4.0

Mar 30 Apr 06 Apr 13 Apr 20 Apr 27 May 04

Cameron Miliband
37

The BES also asked respondents to evaluate the major UK political parties using
the same 11-point scale employed for leader evaluations. Figure 18 presents the mean
Conservative and Labour results for the total sample while Figure 19 presents the same
evaluations for Twitter users in the BES sample. First, the Labour Party clearly leads
the Conservatives amongst both the total and Twitter samples from the BES. Second,
our Twitter sentiment scores for the Labour and Conservative parties show the same
pattern with Labour clearly having an advantage over the Conservative Party.

Figure 18: Party Evaluations in the 2015 British Election Survey

5.5

5.0

Mar 30 Apr 06 Apr 13 Apr 20 Apr 27 May 04

Conservatives Labour
38

Figure 19: Party Evaluations in the 2015 British Election Survey (Twitter Users)

Mar 30 Apr 06 Apr 13 Apr 20 Apr 27 May 04

Conservatives Labour
39

7.1 Issues

The BES included an open-ended question on the most important issue facing the
country. The exact questionnaire is As far as youre concerned, what is the SINGLE
MOST important issue facing the country at the present time? In order to compare
responses to this question to our Twitter-derived topics, we categorize answers to this
question using the same dictionaries used for the twitter topic detection. Column 1 of
Appendix Table 5 provides a list of the resulting topics.
To improve the performance of detection, we manually improved the dictionary by
looking at the high frequency terms in the BES list verbatim answers for those situations
in which our original dictionary failed to detect issues. Table 8 shows the modified
dictionary. In addition we added two more topics: Environment and Crime and Security.
These two topics are not particularly large, but are clearly defined and distinguishable
from the other six topics. With this improved dictionary, more than a half of 9,852 texts
unmatched with the original dictionary are categorized in one or more issue topics.

Table 5: Detected Issues in BES

Number of Matching
Topic Name Original List Improved List
Economy 8257 9246
Immigration 5454 5975
Tax 131 131
Welfare 3770 6319
EU 248 547
Education 303 303
Environment 476
Crime and Security 797
Unmatched 9852 4488

There are noticeable differences between the volume of tweets and the frequency of
BES perceived important issues (Figure 20). In particular, there are three issues that are
frequently mentioned in the Twitter traffic but are rarely mentioned as most important
40

issues in the BES: the EU, education and taxes. The most interesting difference is the
mention of taxes. Less than one percent of BES respondents think that taxes are the
most important issue; but we find that the tax topic represents about eighteen percent
of Twitter topics.

Figure 20: Issues on Twitter and in the British Election Study 2015 Compared

Twitter Issue
economy
BES (TW User) immigration
tax
welfare
BES (nonTW User)
EU
education
BES (All)

0.00 0.25 0.50 0.75 1.00

Figure 20 clearly suggests the importance of the economy in Twitter traffic. And the
economic issue importance is also confirmed by the BES most important issue question.
Moreover we saw earlier in Figure 13 and Figure 14 that the Conservative Party had a
significant lead in terms of tweets that associated the Conservative Party to the economy
in a favourable light. This would suggest that the economy likely played an important
role in the Conservative Party victory.
In order to explore further how issue perceptions may have affected vote choice, we
return to the BES survey data. We run a conditional logit model that includes the six
issues as explanatory variables (Table 6). The baseline is the vote for the Conservatives.
Note that the economy is the only issue on which the Conservatives have an advantage
against all other parties. The respondents who think the economy is the most important
issue in the election comprise the largest group compared to the other issues, and are
41

Table 6: Conditional Logit Model of Vote Choice

Labour LibDem UKIP Greens


Economy 0.68 0.39 1.08 1.01
(0.05) (0.07) (0.08) (0.08)
Immigration 1.07 1.35 1.07 2.31
(0.06) (0.10) (0.06) (0.16)
Tax 0.23 0.61 0.63 0.73
(0.29) (0.38) (0.37) (0.38)
Welfare 1.30 0.76 0.05 0.96
(0.06) (0.08) (0.09) (0.08)
EU 1.01 0.21 1.69 0.98
(0.18) (0.22) (0.12) (0.32)
Education 0.63 0.88 0.89 0.76
(0.18) (0.24) (0.41) (0.26)
Constant 0.09 1.27 1.11 1.35
(0.03) (0.05) (0.05) (0.05)
Log Likelihood 24185.69
Num. obs. 19235
p < 0.01, p < 0.05, p < 0.1

more likely to vote for the Conservatives. Welfare is more or less the opposite: voters
who think welfare is the most important issue are less likely to vote for the Conservatives
and they are more likely to support Labour. This results does not change even if we limit
the analysis to the Twitter users among BES respondents (Table 7). Although the issue
interests of Twitter users are slightly different from the non-twitter users, their vote choice
is shaped by similar issue priorities. Hence, these multivariate results seem to confirm
that the economic concerns of the voters contributed significantly to the Conservative
victory.
42

Table 7: Conditional Logit Model of Vote Choice (Twitter Users)

Labour LibDem UKIP Greens


Economy 0.60 0.54 1.05 1.02
(0.08) (0.13) (0.17) (0.13)
Immigration 0.94 1.66 1.37 2.56
(0.12) (0.25) (0.14) (0.33)
Tax 0.05 0.03 0.50 0.36
(0.47) (0.68) (0.69) (0.58)
Welfare 1.45 0.74 0.15 1.11
(0.11) (0.15) (0.20) (0.14)
EU 0.91 0.10 1.84 0.64
(0.36) (0.44) (0.29) (0.50)
Education 0.41 0.73 1.67 0.23
(0.27) (0.34) (1.03) (0.37)
Constant 0.35 0.99 1.37 0.77
(0.06) (0.10) (0.11) (0.09)
Log Likelihood 6875.71
Num. obs. 5435
p < 0.01, p < 0.05, p < 0.1
43

8 Conclusion

In this essay we explore the utility of using Twitter conversations to explain election
outcomes. Our efforts are based on a large corpus of Tweets collected during the six-
month period approaching the 2015 UK elections. Our analysis includes the geo-location
of tweets, sentiment analysis, and issue/topic modelling. The analysis focuses on England
and Scotland. The interpretation of the Twitter landscape of Scotland is straightforward:
the Scottish National Party dominated the Twitter conversation on all aspects including
the number of tweets, general sentiment, and the evaluation of key issues. In contrast,
the results of England are more nuanced: Generally speaking, the Labour Party was the
slight favorite; however, the Conservative Party had a slight edge over Labour on certain
issues that may have been critical in the ultimate Tory victory.
Twitter volume and sentiment tended to favour both the Labour Party and its leader,
Mr. Miliband. This may reflect a Labour partisan bias in the population of Twitter
users. Polling data from BES suggested a Cameron advantage on leader evaluations.
Both BES and Twitter suggest that the electorate has a more positive evaluation of
the Labour Party over the Conservative Party. Our analysis of Twitter issue sentiment
suggests concerns about the economy played to the Conservative advantage this would
seem to be confirmed also by our analysis of the 2015 BES survey data.
We also discussed a strategy for correcting the biases in the twitter user population.
The idea is to obtain additional information regarding the user demographics such as age
and occupation and use it to weigh the volume of classified tweets that express opinionated
mentions of parties and leaders. The assessments of Twitter users in the UK will not only
provide a useful tool to improve our explanations of UK Election through Tweets, but
also help researchers interested in utilizing Tweets in the UK for understanding political
or other types of issues.
44

References
Boutet, Antoine, Hyoungshick Kim and Eiko Yoneki. 2013. Whats in Twitter, I know
what parties are popular and who you are supporting now! Social Network Analysis
and Mining 3(4):13791391.
Burger, John D, John Henderson, George Kim and Guido Zarrella. 2011. Discriminating
Gender on Twitter. In Proceedings of the 2011 Conference on Empirical Methods in
Natural Language Processing. pp. 13011309.
Burnap, Pete, Rachel Gibson, Luke Sloan, Rosalynd Southern and Matthew Williams.
2015. 140 Characters to Victory ?: Using Twitter to Predict the UK 2015 General
Election. Electoral Studies .
Caldarelli, Guido, Alessandro Chessa, Fabio Pammolli, Gabriele Pompa, Michelangelo
Puliga, Massimo Riccaboni and Gianni Riotta. 2014. A Multi-Level Geographical
Study of Italian Political Elections from Twitter Data. PLoS ONE 9(5):e95809.
Ceron, Andrea, Luigi Curini and Stefano M. Iacus. 2014. Using Sentiment Analysis to
Monitor Electoral Campaigns: Method MattersEvidence From the United States and
Italy. Social Science Computer Review .
Ceron, Andrea, Luigi Curini, Stefano M. Iacus and Giuseppe Porro. 2013. Every tweet
counts? How sentiment analysis of social media can improve our knowledge of citizens
political preferences with an application to Italy and France. New Media & Society
16(2):340358.
Conover, M, B Goncalves, J Ratkiewicz, A Flammini and F Menczer. 2011. Predicting
the Political Alignment of Twitter Users. In Proceedings of 3rd IEEE Conference on
Social Computing (SocialCom).
Curini, Luigi, Andrea Ceron and Stefano M Iacus. 2013. To what extent sentiment
analysis of Twitter is able to forecast electoral results ? Evidence from France , Italy
and the United States. In 7th ECPR General Conference. Number September 2013 in
Paper Presented at 7th ECPR General Conference pp. 128.
Fisher, Stephen D. and Michael S. Lewis-Beck. 2015. Forecasting the 2015 British
general election: The 1992 debacle all over again? Electoral Studies .
Gayo-Avello, Daniel. 2013. A Meta-Analysis of State-of-the-Art Electoral Prediction
From Twitter Data. Social Science Computer Review 31(6):649679.
Hopkins, Daniel J and Gary King. 2010. A Method of Automated Nonparametric Con-
tent Analysis for Social Science. American Journal of Political Science 54(1):229247.
Hutto, C J and Eric Gilbert. 2014. VADER: A Parsimonious Rule-based Model for
Sentiment Analysis of Social Media Text. In Proceedings of the Eighth International
AAAI Conference on Weblogs and Social Media. pp. 216225.
45

Jungherr, A., P. Jurgens and H. Schoen. 2012. Why the Pirate Party Won the German
Election of 2009 or The Trouble With Predictions: A Response to Tumasjan, A.,
Sprenger, T. O., Sander, P. G., & Welpe, I. M. Predicting Elections With Twitter:
What 140 Characters Reveal About Political Sentiment. Social Science Computer
Review 30(2):229234.
Lampos, Vasileios, Daniel Preotiuc-Pietro and Trevor Cohn. 2013. A user-centric model
of voting intention from Social Media. Proceedings of the 51st Annual Meeting of the
Association for Computational Linguistics (ACL) pp. 9931003.
Mohammady, Ehsan and Aron Culotta. 2014. Using County Demographics to Infer
Attributes of Twitter Users. In Proceedings of the Joint Workshop on Social Dynamics
and Personal Attributes in Social Media. pp. 716.
OConnor, Brendan, Ramnath Balasubramanyan, Bryan R. Routledge and Noah A.
Smith. 2010. From Tweets to Polls: Linking Text Sentiment to Public Opinion Time
Series. In Proceedings of the Fourth International AAAI Conference on Weblogs and
Social Media. pp. 112129.
Pak, Alexander and Patrick Paroubek. 2010. Twitter as a Corpus for Sentiment Analysis
and Opinion Mining. LREC 10:13201326.
Paltoglou, Georgios. 2014. Sentiment analysis in social media. In Online Collective Action.
Springer pp. 317.
Paltoglou, Georgios and Mike Thelwall. 2012. Twitter, MySpace, Digg. ACM Trans-
actions on Intelligent Systems and Technology 3(4):119.
Pang, Bo and Lillian Lee. 2008. Opinion Mining and Sentiment Analysis. Foundations
and Trends R in Information Retrieval 2(12):1135.

Preotiuc-Pietro, Daniel, Vasileios Lampos and Nikolaos Aletras. 2015. An Analysis of the
User Occupational Class Through Twitter Content. In Proceedings of the 53rd Annual
Meeting of the Association of Computational Linguistics and the 7th International Joint
Conference on Natural Language Processing. Vol. 125 pp. 17541764.
Schwartz, H. Andrew, Johannes C. Eichstaedt, Margaret L. Kern, Lukasz Dziurzynski,
Stephanie M. Ramones, Megha Agrawal, Achal Shah, Michal Kosinski, David Stillwell,
Martin E.P. Seligman and Lyle H. Ungar. 2013. Personality, Gender, and Age in the
Language of Social Media: The Open-Vocabulary Approach. Plos One 8.
Sloan, Luke, Jeffrey Morgan, Pete Burnap and Matthew Williams. 2015. Who Tweets?
Deriving the Demographic Characteristics of Age, Occupation and Social Class from
Twitter User Meta-Data. PLoS ONE 10(3):e0115545.
Tumasjan, A., T. O. Sprenger, P. G. Sandner and I. M. Welpe. 2011. Election Forecasts
With Twitter: How 140 Characters Reflect the Political Landscape. Social Science
Computer Review 29(4):402418.
46

Wilson, Theresa, Zornitsa Kozareva, Preslav Nakov, Sara Rosenthal, Veselin Stoyanov
and Alan Ritter. 2013. SemEval-2013 task 2: Sentiment analysis in twitter. In Pro-
ceedings of the International Workshop on Semantic Evaluation, SemEval. Vol. 13.
47

Appendix

Table 8: Topics Wordlist for British Election Study

Topic Name Search Terms


Economy cost of living, economy, income, business, austerity, budget,
debt, borrowing, gdp, employment, job(s, ), wage(, s), living
costs, economic growth, defecit, economic, ecconomy, financial,
finance, rising price, living standard, standard.+living, money
Immigration imm.gration, im.gration, racist, immigrant, migrants,
boarder(s, )
Tax tax, betroom, mansion, dodging, nondom, vat, ifs
Welfare welfare, nhs, benefit, wowpetition, health, n.h.s, pension,
poverty, rich.+poor, elderly, wealth gap, equality, affordable
housing, hous.+(costs, prices), housing
EU proeu, no2eu, meps, brexit, mep, europe, EU
Education education, tuition, school, university, universities, apprentice-
ship, childcare, teachers, uni
Environment environment, green, climate, global warming
Crime and Security security, crime, terrorism, isis, islam, islamism, defence

You might also like