Robert Coyle

National College of Ireland
Higher Diploma in Science in Data Analytics

2013/2014
Robert Coyle
X13109278
robert.coyle@student.ncirl.ie
The Use of Twitter Activity as a Stock Market

Predictor
Table of Contents
ABSTRACT ........................................................................................................................................... 6
DEFINITIONS, ACRONYMS, AND ABBREVIATIONS ................................................................ 6
INTRODUCTION ................................................................................................................................. 7
RELATED WORK ................................................................................................................................ 8
SYSTEMS AND DATASETS .............................................................................................................. 8
DESIGN AND ARCHITECTURE ......................................................................................................................... 8
Brief description of work carried out .................................................................................................... 8
DATASETS .......................................................................................................................................................... 8
Gathering of Twitter Data. ......................................................................................................................... 9
Gathering of Stock Price Data ................................................................................................................15
Data Preparation .........................................................................................................................................16
REQUIREMENTS ............................................................................................................................................. 17
Data requirements .......................................................................................................................................17
User requirements ........................................................................................................................................17
Usability requirements...............................................................................................................................17
Functional Requirements .........................................................................................................................17
TESTING AND EVALUATION ........................................................................................................19
SYSTEMS TESTING. ........................................................................................................................................ 19
Apple Stock ......................................................................................................................................................19
Microsoft Stock ..............................................................................................................................................25
Tesla Stock .......................................................................................................................................................33
FORMULA FOR PREDICTING STOCK MOVEMENT ..................................................................................... 36
Formula Used .................................................................................................................................................36
Apple Stock Prediction ...............................................................................................................................36
Microsoft Stock Prediction .......................................................................................................................40
Tesla Stock Prediction ................................................................................................................................43
CONCLUSION .....................................................................................................................................46
FURTHER DEVELOPMENT ...........................................................................................................47
BIBLIOGRAPHY................................................................................................................................48
APPENDIX ..........................................................................................................................................48
Project Materials: .........................................................................................................................................48
PROJECT PROPOSAL ......................................................................................................................49
INTRODUCTION .............................................................................................................................................. 49
BACKGROUND ................................................................................................................................................ 49
TECHNICAL APPROACH ................................................................................................................................ 50
SPECIAL RESOURCES REQUIRED ................................................................................................................. 50
PROJECT PLAN ............................................................................................................................................... 51
TECHNICAL DETAILS .................................................................................................................................... 51
SYSTEMS/DATASETS .................................................................................................................................... 51
EVALUATION/TEST AND ANALYSIS ........................................................................................................... 51
CONSULTATION WITH SPECIALIZATION PERSONS................................................................................... 52
REQUIRMENTS SPECIFICATION .................................................................................................53
The Use of Twitter Activity as a Stock Market Predictor
DOCUMENT CONTROL .................................................................................................................................. 53

REVISION HISTORY ....................................................................................................................................... 53
DISTRIBUTION LIST ...................................................................................................................................... 53
RELATED DOCUMENTS ................................................................................................................................. 53
1 INTRODUCTION .......................................................................................................................................... 54
1.1 PURPOSE .................................................................................................................................................. 54
1.2 PROJECT SCOPE ...................................................................................................................................... 54
1.2.1 In Scope ..................................................................................................................................................54
1.2.2 Out of Scope .........................................................................................................................................55
1.3 DOCUMENT SCOPE ................................................................................................................................. 55
1.4 DEFINITIONS, ACRONYMS, AND ABBREVIATIONS ............................................................................. 55
2 USER REQUIREMENTS DEFINITION ......................................................................................55
2.1 USER CHARACTERISTICS ....................................................................................................................... 55
3 REQUIREMENTS SPECIFICATION...........................................................................................56
3.1 FUNCTIONAL REQUIREMENTS ............................................................................................................. 56
3.1.1 USE CASE DIAGRAM OVERALL FUNCTIONAL REQUIREMENTS ................................................ 57
3.1.2 REQUIREMENT 1: ACQUIRE DATA 1 AND 2 ................................................................................... 57
3.1.2.1 Description & Priority .................................................................................................................57
3.1.2.2 Use Case..............................................................................................................................................58
Scope ..................................................................................................................................................................58
Description ......................................................................................................................................................58
Use Case Diagram ........................................................................................................................................58
Flow Description ...........................................................................................................................................58
3.1.3 REQUIREMENT 2: CLEAN DATA 1 AND 2 ....................................................................................... 60
3.1.3.2 Use Case..............................................................................................................................................60
Scope ..................................................................................................................................................................60
Description ......................................................................................................................................................60
Use Case Diagram ........................................................................................................................................61
3.1.4 REQUIREMENT 2: ANALYZE DATA .................................................................................................. 63
3.1.4.2 Use Case..............................................................................................................................................63
Scope ..................................................................................................................................................................63
Description ......................................................................................................................................................63
Use Case Diagram ........................................................................................................................................64
3.1.5 REQUIREMENT 2: PUBLISH DATA ................................................................................................... 65
3.1.5.2 Use Case..............................................................................................................................................66
Scope ..................................................................................................................................................................66
Description ......................................................................................................................................................66
Use Case Diagram ........................................................................................................................................66
3.2 NON-FUNCTIONAL REQUIREMENTS ................................................................................................... 68
3.2.1 Availability: Must Have ..................................................................................................................68
3.2.2 Storage Requirements: Must Have ............................................................................................68
3.2.3 Connection Reliability: Must Have ............................................................................................68
3.2.4 Connection Speed: Must Have .....................................................................................................68
3.2.5 Backup and Recovery: Must Have .............................................................................................68
3.2.6 Program to clean data: Must Have ...........................................................................................68
3.2.7 Software Analysis tools: Must Have ..........................................................................................68
3.2.8 Communication Requirements: Must Have ...........................................................................69
3.2.9 Security: Must Have .........................................................................................................................69

3.2.9 Data Validation: Must Have .........................................................................................................69
5 INTERFACE REQUIREMENTS ...................................................................................................69
5.1 GUI ........................................................................................................................................................... 69
An example of a analysis of tweets. ......................................................................................................69
Examples of tweets analyzed on Microsoft Excel and Geo Flow .............................................69
Analysis of tweets using R language....................................................................................................71
Example of Excel Data for intro to Regression. ..............................................................................71
Example of analysis completed on R Studio. ....................................................................................72
6 ANALYSIS EVOLUTION...............................................................................................................72
PROGRESS MANAGEMENT REPORT 1......................................................................................73
DOCUMENT LOCATION ................................................................................................................................. 73
APPROVALS .................................................................................................................................................... 73
DISTRIBUTION ............................................................................................................................................... 73
PURPOSE OF DOCUMENT ............................................................................................................................. 74
DATE OF REPORT ........................................................................................................................................... 74
PERIOD COVERED .......................................................................................................................................... 74
SCHEDULE STATUS ........................................................................................................................................ 74
Updated Gantt chart ...................................................................................................................................74
DEFINITIONS, ACRONYMS, AND ABBREVIATIONS ..............................................................74
PRODUCTS COMPLETED DURING THIS PERIOD ..................................................................75
PROBLEMS.........................................................................................................................................75
ACTUAL ........................................................................................................................................................... 75
POTENTIAL ..................................................................................................................................................... 75
RAID LOG:....................................................................................................................................................... 76
Risks ....................................................................................................................................................................76
Assumptions ....................................................................................................................................................77
Issues ..................................................................................................................................................................77
Dependency .....................................................................................................................................................77
PRODUCTS DUE FOR COMPLETION..........................................................................................77
PROJECT ISSUES STATUES............................................................................................................................ 78
CONCLUSION .....................................................................................................................................78
APPROVALS .................................................................................................................................................... 79
DISTRIBUTION ............................................................................................................................................... 79
DATE OF REPORT ........................................................................................................................................... 80
PERIOD COVERED .......................................................................................................................................... 80
SCHEDULE STATUS ........................................................................................................................................ 80
PROBLEMS.........................................................................................................................................81
ACTUAL ........................................................................................................................................................... 81
POTENTIAL ..................................................................................................................................................... 81
RAID LOG:....................................................................................................................................................... 82
Risks ....................................................................................................................................................................82
Assumptions ....................................................................................................................................................83
Issues ..................................................................................................................................................................83
Dependency .....................................................................................................................................................84
CONCLUSION .....................................................................................................................................85
APPROVALS .................................................................................................................................................... 85
DISTRIBUTION ............................................................................................................................................... 85
DATE OF REPORT ........................................................................................................................................... 86
PERIOD COVERED .......................................................................................................................................... 86
SCHEDULE STATUS ........................................................................................................................................ 86
PROBLEMS.........................................................................................................................................87
ACTUAL ........................................................................................................................................................... 87
POTENTIAL ..................................................................................................................................................... 87
RAID LOG:....................................................................................................................................................... 87
Risks ....................................................................................................................................................................87
Assumptions ....................................................................................................................................................88
Issues ..................................................................................................................................................................88
Dependency .....................................................................................................................................................89
CONCLUSION .....................................................................................................................................89
REFERENCES .....................................................................................................................................90
Abstract
This thesis investigates the possibility of predicting stock market movement
using Twitter activity. The Analysis will use data mining applications, data
analysis techniques, correlation and regression modelling.
The data mining of Twitter feeds was carried out.
The process involved using Twitter API and Java code to search and download
tweets with the words Apple, Microsoft and Tesla in them. These files were then
processed using Amazon web service and Text Wrangler. An analysis was carried
out using software such as R studio and Microsoft excel. Correlation models and
Regression models were built along with the Granger Causality test in R studio.
Visualisation techniques were carried out in Microsoft Excel and R studio
showing some trends in the data.
A formula for stock market prediction for commercial use was created. Since the
data set gathered from Twitter was not large enough and the actual information
in the tweets was not specified towards the stock belonging to the companies,
there is an issue of noisy data corrupting the analysis. A sentiment analysis was
not carried out on the tweets.
Definitions, Acronyms, and Abbreviations

Term
API
AWS
Causative
GPOMS
Granger causality
test
NASDAQ
Noisy Data
POMS
Sentiment analysis
Text Wrangler
Tweet
Definition
Application programming interface
Amazon Web Service
A form that indicates that a subject causes something else
to do something or causes a change in state of a nonvolition event.
Google Profile of Mood States, algorithm to classify public
sentiment into 6 categories {Calm, Alert, Sure, Vital, Kind
and Happy}
A statistical hypothesis test for predicting if one time series
is useful in predicting another.
National Association of Securities Dealers Automated
Quotations
Meaningless data.
Profile of Mood States.
A natural language processing, text analysis and
computational linguistics to identify and extract subjective
information in source materials.
Text editor for Mac OS X
A message posted on the Twitter website.
Introduction
The stock market is an essential way for companies to raise money.
Companies can raise additional financial capital by being publicly traded in order
to expand their business by selling shares of ownership.
Historically it is known that share prices can have a major influence on economic
activities and can be an indicator of social mood.
The stock market movements has always been a rich and interesting subject with
such many factors to be analysed that for a long time it would be considered
unpredictable.
The application of new computerized mathematical methods over the past few
decades developed by companies such as Merrill Lynch and other financial
management companies have created models that can maximize their returns
while minimizing their risks.
Stock market prediction has been around for years but it has been giving a new
method of prediction thanks to the rise of social media.
The objective of this project is to analyse Twitter feeds for activities and trends
associated with a brand and to see how their stock market shares are related and
if they are affected to the twitter activity.
This analysis will look at the relationship of the amount of tweets for three
specific brands on the NASDAQ, Apple, Microsoft and Tesla. The search for each
companys symbols on the NASDAQ within those returned tweets would be
conducted as an additional exploration of stock conversation on Twitter.
These brands where chosen since they are innovative technology companies that
are on the same stock exchange. Therefore gathering of the twitter data was not
time zone dependent.
Stock market data was collected from the Yahoo Finance website, there they
provide historical data for the NASDAQ.
Java scripts were used to acquire the tweets through Twitters API service.
The Tweets for each brand were then counted using Amazon Web Service and
Text Wrangler.
The counted tweets were subsequently analysed using R studio were
correlational and regression models were built and Granger Causality Test was
performed.
The Data was then visualised in Excel and R studio and the creation of a formula
for commercial use was attempted.
Related Work
In the previous study Stock Market Prediction Using Twitter I researched papers
in relation to sentiment analysis of social media for the prediction of stock
market movement. The social media in question was Twitter.
The investigated looked at the correlation between the public mood and the
stock market movement and how it can be used to predict stock market prices.
The use of sentiment analysis was used to translate the tweets into moods using
algorithms such as Google Profile of Mood States.
The process of using a sentiment analysis on the tweets proved to be an accurate
analysis of the data.
Analysing Twitter activity does not provide sufficient behavioural attitudes
towards the investors and an accurate prediction of stock movement cannot be
ascertained. Sentiment analysis provides the investigation with an insight into
the public attitude. The more detailed sentiment analysis on the Twitter data
along with a reliable stock data the more superior and accurate the results.
Twitter activity along might not give the insight the stockbroker needs to make
challenging decisions in buying or selling shares.
Systems and Datasets

Design and Architecture
Brief description of work carried out
The system was designed to acquire twitter and stock market data and compare
the two data sets for a relationship.
For the Twitter data the use of JAVA script, AWS script and Text Wrangler
were used to clean the data.
The financial data was acquired from the Yahoo Finance website. The data
was downloaded in excel format then saved as a CSV file.
Then the results from the cleaned Twitter data were placed with the
financial cleaned data in excel.
Grangers Causality implemented in R Studio to find if the Twitter times
series was useful at forecasting the stock prices time series.
A correlation model was built to confirm the relation between the two
data types.
Then excel was used to visualizes and confirm the relation.
Datasets
There were two forms of datasets.
The first dataset acquired was the Twitter feeds.
Historical tweets proved to be difficult since Twitter had sold on their
information to external parties. These companies, such as DataSift offer analysis
on historical data. While this would have been beneficial to the original project
proposal the budget of the project was zero.
Twitter launched a Historical Data Grant scheme, which allowed academic
students to send in their proposal to gain access to Twitters historical data.
A proposal on behalf of this project was sent into the Data Grant scheme but a
reply from Twitter returned far too late into the project.
Subsequently from these dates the historical stock market data was gathered
from Yahoo Finance.
Gathering of Twitter Data.
The Java script was acquired under approval of Dr. Brian Mac Namee, a Principal
Investigator with CeADAR and a lecturer in the School of Computing at the
Dublin Institute of Technology.
The Java script was used in conjunction with Twitter API.
In order to use the Twitter API user must first sign up for a developer account
and create an application; there the user can acquire the API codes/keys to run
their script.
The script was run on my behalf at a friends home since my own personal
Internet connection was not suitable and the apprehension of disconnection,
which would have returned unreliable time series.
Figure 1.1: Example of the application used in twitter. (Dev.twitter.com, 2014)
Figure 1.2: Example of the JAVA code used for downloading the twitter feeds.
Figure 1.3: Demonstrates where the unique keys were inputted into the JAVA
script.
Figure 1.4: Demonstrates where the key words were inputted into the JAVA
script.
10
Java script Issues

Since the returns from the JAVA script were so regular and to avoid any
apprehension of a system crash the data was saved into text files daily.
The data sets retrieved from twitter were from 60 megabytes to 100 megabytes
with over 400,000 lines of tweets per day.
Five sets of text files were attained representing Monday to Friday the NASDAQ
opening times.
Figure 1.5: Example of the acquired twitter feeds from the JAVA script in a text
file.
Since one of the days the script was running stopped there was a gap of which
existed no tweets from 3am until 8am one day because of this tweets that were
published between the trading times of the NASDAQ were used.
NASDAQ trading hours is from 09:30 until 16:00 Monday to Friday.
In GMT time that is 14:30 to 21:00.
Counting the Tweets
Next the tweets had to be counted.
To this I initially proposed using Amazon Web Services because of the size of the
data sets. A word count from the AWS website was used to count all the specific
words in each tweet.
11
Figure 1.6: Example of the acquired Python script file from the AWS website.
(Aws.amazon.com, 2014)
A folder in the S3 bucket was created named project 2014.
Here all necessary files such as python scripts and tweet files were uploaded.
An Elastic Map Reduce Cluster was created.
Figure 1.7: Example of a successful cluster from the AWS website.

(Aws.amazon.com, 2014)
12
Figure 1.8: Example of a text file returned form the AWS.

Word counting Issues
The drawback to this script file is that it counted each time a specific word came
up in a tweet providing results that were inaccurate.
13
Figure 1.9: Example of a tweet with Apple mentioned twice in Text Wrangler.
(Mac App Store, 2014)
What was needed was a way to count the amount of tweets that had the keyword
mentioned in them. These tweets could contain all three keywords (Apple,
Microsoft and Tesla) or together the twitter feeds of each word separately.
Text Wrangler was used to search the individual text files for the frequency of
the tweets with the key words separately but still had the same problem of
counting the amount of times the word occurred.
Figure 1.10: Example of tweets from Monday with Tesla mentioned, 3866
occurrences in Text Wrangler. (Mac App Store, 2014)
For this reason there will be some conflicts in my analysis result because of extra
word counts in tweets with the keywords mentioned twice.
Date
Apple
AAPL
Microsoft MSFT
Tesla
07/04/2014
71913
1001
36417
521
08/04/2014
118077
950
47925
613
09/04/2014
81840
1100
24084
437
10/04/2014
63983
1483
19521
435
11/04/2014
62755
1145
18146
343
Figure 1.11: Displays the key words and their occurrences per day.
TSLA
3866
4600
3113
3204
2140
281
395
301
447
347
The Original Key words were Apple, Microsoft and Tesla. I decide to also search
for their NASDAQ symbol/code. From previous research into twitter mining and
stock prediction researchers searched for the company codes, as it would return
14
more accurate tweet count where people were tweeting about the actual stock of
the company.
Gathering of Stock Price Data
Once the twitter feeds had being gathered the financial data could be
downloaded. The historical stock prices had to be the same dates as the Twitter
feeds. The data was downloaded in excel format then saved as a CSV file for use
in R for analysis.
Historical data sets of stock prices can only obtained per day at the minimum
from Yahoo Finance otherwise it would have to be streamed from directly from
the NASDAQ website, which I did not have the access to.
Ideally hourly stock prices would have worked by matching the time series with
the Twitter feeds.
Data sets of stock prices were collected from the Yahoo Finance website for all
three companies.
Each set had seven columns consisting of Date, Open, High, Low, Close, Volume
and Adjusted Close.
Date is the day of trading.
Open is the opening price of the stock at the start of the days trading.
High is the highest price of the stock form that day.
Low is the lowest price of the stock from that day.
Close is the closing price of the stock at the end of the days trading.
Volume the number of shares traded that day.
Adjusted Close is the after trading hours price. The difference between
the open and close price.
15
Figure 1.6: Demonstrates the acquired historical Apple stock prices for the
month of April 2014 form the Yahoo Finance website. (Finance.yahoo.com, 2014)
The closing price is the data in which this analysis focoused on.
Data Preparation
Results from the cleaned Twitter data were placed with the financial cleaned
data in excel.
Date
Open
High
Low
Close
Volume
Adj
Close
516.72
Apple
2014519
522.83 517.14 519.61 9704200
62755
04-11
2014530.68 532.24 523.17 523.48 8559000
520.57 63983
04-10
2014522.64 530.49 522.02 530.32 7363200
527.37 81840
04-09
2014525.19 526.12 518.7
523.44 8710300
520.53 118077
04-08
2014528.02 530.9
521.89 523.47 10351800 520.56 71913
04-07
Figure 4.2: Displays the key words and their occurrences per day with the stock
prices for Apple.
This was repeated for all three companies.
16
AAPL
1145
1483
1100
950
1001
Requirements
The requirements have remained mostly the same from the original
Requirements Specification except for the use of live data rather than using
historical Twitter data. Historical Twitter proved to be impracticable as the
project had no budget and the historical data had to be purchased.
Data requirements
DR#
Category
Description
Mo
sco
w
DR1
The information produced must be of use to the user.
DR2
Use of
Infromation
Availability
S
t
a
t
u
s
M
Information generated must not be previously available to

the user.
DR3
Access
The user must have access to this information.
User requirements
UR#
Category
Description
Mo
sco
w
UR1
Analysis
outcome
The analysis will provide Apple, Microsoft and Tesla with a

better insight of the effectiveness of their advertising
campaign strategy form data acquired by the Twitter feeds
and stock market.
S
t
a
t
u
s
M
UR2
User outcome
This information must be of assistance to these companies
S
t
a
t
u
s
H
Usability requirements
Functional Requirements
FR#
Category
Description
Mo
sco
w
FR1
Aquire Data 1
The project will gather and store all nessary data from live
Twitter feeds using JAVA scripts in conjunction with Twitter
17
FR2
Aquire Data 2
FR3
Clean Data 2
FR4
Clean Data 2
FR5
FR6
Analyse 1
Analyse 2
FR7
Publish Data
API.
The project will gather and store all nessary historical stock
mrket data regarding the brand corrosponding to the dates
in relation to the Twitter data that was aquired from the
Yahoo Finance website.
The correct programs will be aquired and used to clean and
retrive Twitter data regarding to key words and hash tags of
the brand on certain dates.
retrive data historcal stock market share prices regarding
the brand on the same time series as the Twitter feeds data.
The cleaned Twitter data is then analysed and compared.
The cleaned stock market data is then analysed and
compared.
The analyse will then be publised and avslible to the
coustomer.
18
M
M
H
H
Testing and Evaluation

Systems Testing.
Correlation
Correlation coefficient is the linear relationship between two variables. Also
know as Pearson Product-Moment Correlation Coefficient.
Correlation values can be on a scale of +1 to -1.
+1 for very story positive relationship.
-1 for a strong negative relationship.
Regression
Regression is used to estimate or predict the relationships among one
quantitative variable with another quantitative variable.
Granger Causality
Granger Causality is a statistical hypothesis test for predicting if one time series
is useful in predicting another.
Steps in testing stage
1. Check for correlation in R studio.
2. Compose a regression model.
3. Use Granger Causality test used to test if one time series is useful at
forecasting another.
4. Change time series to adjust for lag.
5. Excel and R studio to visualizes and confirm any relation.
Data sets.
The data sets used are the counts from the keyword searches from the AWS
returns. Apple, Microsoft and Tesla.
Also the counts of the NASDAQ symbols for each company within those initial
counts will be used as an additional investigation AAPL, MSFT and TSLA.
Apple Stock
1. Check for correlation
Figure 4.3: Displays the file AprilAAPL imported into R studio.

First the data is imported into R studio.
19
Figure 4.4: Displays the correlation output in R.

The correlation model result shows a moderate relation between Close and the
counts of the keyword Apple of 0.223.
2. Regression Model
Figure 4.5: Displays the regression model output in R.

lm(formula = Apple ~ Close, data = AprilAAPL)
Does Apple tweet count have an effect the close price?
From the Multiple R-squared it is possible to see that the regression model
returned a poor result with only 4.8% explaining Close price.
The process was carried out for the AAPL count.
20
Figure 4.6 Displays the regression model output in R.

lm(formula = AAPL ~ Close, data = AprilAAPL)
Does Apple tweet count have an effect the close price?
The regression model returned a similar poor result with only 0.07% explaining
Close price.
3. Granger Causality Test
Close is Dependent and Apple is independent.
Is Apple the cause of the effect of Close?
Does Apple Granger cause Close?
Figure 4.7 Displays Granger Causality Test output in R for Closing price and
Apple word count.
From the result above you can see that after one-day lag are P value is 0.7057.
21
This is more than the significance level of 5%. Therefore the rejection of the Null
hypothesis cannot happen meaning Apple word count does not predict the
closing price one day later.
Figure 4.8 Displays Granger Causality Test output in R Closing price and AAPL
word count.
A similar test was performed use the keyword AAPL as the independent and
Close as the dependent. Results were slight better but did not cause Granger
Causality. P value of 24% >5%.
Since the data set was small a lag of 2 days could not be performed.
Figure 4.9 Displays Granger Causality Test unsuccessful outputs.
The above image demonstrates the unsuccessful outputs of the Granger Causality
test using more than 1 days lag. The reason for this error is because the data set
was too small.
22
4. Visualization.
Figure 4.1.1 demonstrates the relationship between the Apple count and Close price.
From the above graph it is possible to see the positive relationship that the
keyword Apple has with the Close price of Apple stock. As the Apple Count rises
there is a rise in the closing stock price.
Figure 4.1.2 demonstrates the relationship between the AAPL count and Close price.
23
From the above graph it is possible to see the negative relationship that the
keyword AAPL has with the Close price of Apple stock. As the AAPL Count rises
there is a decline in the closing stock price. This proves are negative results from
the correlation and regression models. AAPL was not a key word in the JAVA
script but a search within the key word apple.
Apple count and Close Price

532
140000
530
120000
100000
526
524
80000
522
60000
520
40000
518
20000
516
514
0
2014-04-07
2014-04-08
2014-04-09
Close
2014-04-10
2014-04-11
Apple
Figure 4.1.3 demonstrates the relationship between the Apple count and Close price.
As you can see from the above chart the Close Price marked line follows a similar
trend about a day later to the Apple count line.
24
Apple Count
Close Price
528
AAPL count and Close Price

532
1600
530
1400
528
Close Price
526
1000
524
800
522
600
520
400
518
200
516
514
2014-04-07
2014-04-08
2014-04-09
Close
2014-04-10
2014-04-11
AAPL
Figure 4.1.4 demonstrates the relationship between the AAPL count and Close price.
Unfortunately the above chart shows that the Close price didnt show a similar
trend with AAPL but it actually showed a trend where AAPL word count is
following the Close Price.
This is probably the reason the correlation model was so low between the two;
also the investor community that would use the keyword AAPL (Apple stock
symbol) are disusing the rise in Apple stock.
Microsoft Stock
The process was started again this time using the Microsoft data set.
1. Check for correlation
Figure 4.1.5 demonstrates the correlation between Microsoft and MSFT word count and
Close price.
The correlation model this time is much better with both keywords retuning a
moderate correlation with Close price.
25
AAPL Count
1200
2. Regression Model
Figure 4.1.6 displays the regression model with Microsoft word count as the
independent variable.
Figure 4.1.7 displays the regression model with MSFT word count as the independent
variable.
Figure 4.1.6 and 4.1.7 demonstration the two regression outputs from R as Close
stock price as the dependent variable.
Figure 4.1.6 displays a Multiple R-squared value of 0.96% explaining Close price.
Figure 4.1.7 displays a Multiple R-squared value of 12.6% explaining Close price.
26
The normality plot

If the residuals fall in a straight line that means the normality condition is met.
Figure 4.1.8 demonstrates Normality plot of Microsoft and Close price. Normality
condition is met.
Figure 4.1.9 demonstrates Normality plot of MSFT and Close price. Normality condition
is met.
27
3. Granger Causality Test
Figure 4.2.1 displays the Granger Causality.
Again the Granger Causality would not use a lag bigger tan one day. Both
returned values bigger than the significant level of 5%.
4. Visualization
Figure 4.2.2 demonstrates the relationship between the Microsoft count and Close price.
28
Figure 4.2.3 demonstrates the relationship between the MSFT count and Close price.
40.6
40.4
40.2
40
39.8
39.6
39.4
39.2
39
38.8
38.6
38.4
60000
50000
40000
30000
20000
10000
0
4/7/14
4/8/14
4/9/14
Close
4/10/14
4/11/14
Microsoft
Figure 4.2.4 demonstrates the relationship between the Microsoft count and Close price
on a line chart.
As you can see from the above chart the Close Price marked line follows a similar
trend about a day later to the Microsoft count line.
29
Microsoft count
Close price
Microsoft and Close Price
MSFT and Close Price

41
700
Close price
500
40
400
39.5
300
200
39
MSFT count
600
40.5
100
38.5
0
4/7/14
4/8/14
4/9/14
Close
4/10/14
4/11/14
MSFT
Figure 4.2.5 demonstrates the relationship between the MSFT count and Close price on a
line chart.
Pervious results with one day lag.
40.6
40.4
40.2
40
39.8
39.6
39.4
39.2
39
38.8
38.6
38.4
60000
50000
40000
30000
20000
10000
0
4/8/14
4/9/14
Close
4/10/14
4/11/14
Microsoft
Figure 4.2.6 demonstrates the relationship between the Microsoft count and Close price
on a line chart with a one-day lag.
30
Microsoft count
Close price
Microsoft and Close Price with 1 day lag
40.6
40.4
40.2
40
39.8
39.6
39.4
39.2
39
38.8
38.6
38.4
700
600
500
400
300
200
100
0
4/8/14
4/9/14
4/10/14
Close
4/11/14
MSFT
Figure 4.2.7 demonstrates the relationship between the MSFT count and Close price on a
line chart with a one-day lag.
The decision was made to perform a manual lag in excel by moving the dates of
the Microsoft count forward to see if the lines in the chart match up.
This lag would mean that the tweet counts about Microsoft happened on the
same dates as the actual Closing price.
The results from the two graphs show that visually there is a relationship
between the word counts and the Close stock price.
A correlation and regression model was built again using the lagged data.
1. Correlation
Close price with a lag of one day.
The correlation model in figure 4.2.8 shown a strong correlation with the two
word counts. So a regression model was produced.
31
MSFT count
Close price
MSFT andClose Price with 1 day lag
2. Regression Model
Figure 4.2.9 displays the regression model with Microsoft word count as the
independent variable using data with a one-day lag.
32
Figure 4.3.1 displays the regression model with MSFT word count as the independent
variable with data of one-day lag.
The two regression models returned a high Multiple R-squared value of

98%Figure explaining Close price.
The high correlation and regression proved that there is a relation between the
tweet counts and the closing stock price. The results were very high the reason
for this occurrence would be the very small data set that was used.
Tesla Stock
The process was started again this time using the Tesla data set.
Correlation and regression was performed with similar results from the pervious
data sets.
Close price.
Close price with a one-day lag.
The keyword Tesla showed a strong correlation with the Tesla closing stock
price from the lagged data set. TSLA still displayed a moderate correlation.
33
Figure 4.3.3 displays the regression model with Tesla word count as the independent
variable using data with a one-day lag.
Again the regression with the lagged data set showed a huge improvement then
the non-lagged Tesla data.
Tesla Count and Close Price

220
5000
4500
215
4000
210
3000
2500
205
2000
1500
200
1000
500
195
0
4/7/14
4/8/14
4/9/14
Close
4/10/14
4/11/14
Tesla
Figure 4.3.4 demonstrates the relationship between the Tesla word count and Close
price on a line chart.
34
Tesla count
Close Price
3500
Tesla Count and Close Price with one day lag

220
5000
4500
215
4000
210
3000
2500
205
2000
1500
200
1000
500
195
0
4/8/14
4/9/14
4/10/14
Close
4/11/14
Tesla
Figure 4.3.5 demonstrates the relationship between the Tesla word count and Close
price on a line chart with a one-day lag.
Figures 4.3.4 and 4.3.5 demonstrate the difference between the non-lagged and
the lagged data sets. Figure 4.3.5 demonstrates that the one-day in lag does make
a difference to the results. It demonstrates a close relationship the Tesla count
has with the Close price.
35
Tesla Count
Close Price
3500
Formula For Predicting Stock Movement

The creation of a formula for commercial use was conducted. The small data set
had an impact on this work since the use of a lag between two the three days was
desired. From pervious research Stock Market Prediction using Twitter it was
discovered that the tweets would predict stock movement two to three days
after the message was tweeted.
Knowing the tweet volumes of a company for two consecutive days the
percentage of movement of tweets between those two days should in turn allow
us to predict the movement in the company share price within in a two or three
day lag.
Formula Used
The percentage difference between two numbers
(| V1 - V2 | / ((V1 + V2)/2)) * 100
V1 = total company tweets on day one.
V2 = total company tweets on day two.
The formula was used to find the percentage difference between the stock
movement and the tweet movement.
Apple Stock Prediction
To save time the focus is only on the key word count of Microsoft.
Calculate the percentage difference of Apple Tweets And Closing Price
Difference in
Apple Stock
Day one
Day Two
Day Three
Day Four
Difference in Tweet Activity

%
-5.73099E-05
%
0.019568162
0.005%
0.013143818
1.31%
-0.012897873
1.29%
-0.007392833
0.73%
1.96%
0.279089758
27.91%
0.442778592
44.28%
-0.390965218
39.09%
Day One
Day Two
Day Three
Day Four
Figure 4.3.6 demonstrates difference in Stock Close price and Tweet activity between
days.
36
If the movement were not identical in percentage increase/ decrease then the
formula would need to be adjusted. The movement in Tweet Activity was not
proportionate (pro rata movement).
Figure 4.3.7 demonstrates the formula for predicting the third day using Close stock
values.
Example of the formula process
Subtract the tweets of Day 1 from Day 2.

The tweet volume has an increase of 1228 tweets, which represent
1.9568% increase.
The Apple closing stock of Day 1 is $523.47.
Multiply it by 1.9568%
This projects an increase of $10.29
Add this to the to the Day 1 share price

(523.47 + 10.29) = $533.7
Closing price of Day 3 = $530.32
Formula projects a closing price of $533.76 against an actual closing price

of $530.32.
The difference in the projected actual price is $3.38
This represents a variance of 0.639%

37
The formula used here is a straight line (1:1 ratio)

The Apple share prices increase at the same rate as the Twitter feeds within an
error level of just 0.639%.
Figure 4.3.8 demonstrates the formula for predicting the forth day using Close stock
values.
The process was repeated this time using values to predict the fourth day.
Unfortunately an error of 27.904% was returned.
Figure 4.3.9 demonstrates the formula for predicting the fifth day using Close stock
values.
The process was repeated this time using values to predict the fifth day.
Unfortunately an error of 47.25% was returned. The formula didnt apply to the
days after the third.
38
Calculate the percentage difference of Apple Tweets And Low Price
Figure 4.4.1demonstrates the formula for predicting the third forth and fifth day using
Low Stock values.
Also considered was the formula used with the Low stock price to see if there
was a relation.
The best day the formula applied to was predicting the third day with an error of
1.89%.
39
Calculate the percentage difference of Microsoft Tweets And Volume

The use of Volume in the formula was also measured.
Figure 4.4.2 demonstrates the formula for predicting the third day using the volume
values.
However this too had a high error rate of 30.23%.
Microsoft Stock Prediction

Calculate the percentage difference of Microsoft Tweets And Closing Price
Difference in Stock
Day one
Day Two
Day Three
Day Four
Difference in Tweet Activity

0.000502513
0.05%
0.016323456
1.63%
-0.027427724
2.74%
-0.003810976
0.38%
Day One
Day Two
Day Three
Day Four
0.316006261
31.60%
-0.497464789
49.74%
-0.189461883
18.94%
-0.070436965
7.04%
days.
Projecting closing stock price Day 3
40
Figure 4.4.4 demonstrates the formula for predicting the third forth and fifth day using
the Close stock values.
The formula returned a high variance for all projected days.

This concludes that the formula does not apply to any of these days using Close
Stock.
41
Calculate the percentage difference of Microsoft Tweets And Low Price

Also considered was the formula used with the Low stock price to see if there
was a relation.
Tweets day1 - day2
11508
Low stock of day 1 * difference of tweets day1 and day 2
12.5580888
Stock low price day 1 + low stock of day 1 * difference of tweets day1 and day 2
52.2980888
Low price of Day3 - projected low price day 3
-12.5580888
Difference between projected low day 3 and actual day 3 as a variance.
0.237448234
23.74%
Figure 4.4.7 demonstrates the formula for predicting the third day using the Low stock
values.
Again the formula showed that it did not apply to the Low Stock price.
Calculate the percentage difference of Microsoft Tweets And Volume
Figure 4.4.7 demonstrates the formula for predicting the third day using the Volume
values.
The Volume data was placed into the formula but the result shown above has a
high error rate of 44.5%.
42
Tesla Stock Prediction

Calculate the percentage difference of Tesla Tweets And Closing Price
Difference in Stock
Difference in Tweet
Activity
Day one
0.002007934
0.200793379 Day One
0.189860321
18.98603207
Day Two
0.027922269
2.792226911 Day Two
-0.32326087
32.32608696
Day Three
-0.02110152
2.110151951 Day Three
0.029232252
2.923225185
Day Four
0.026816564
2.681656439 Day Four
0.332084894
33.20848939
days.
43
Figure 4.4.9 demonstrates the formula for predicting the third forth and fifth day using
the Close stock values.
The formula had high percentage errors except for the prediction for the fifth
day with an error of 2.33%.
44
Tweets day1 - day2

low stock of day 1 * difference of tweets day1 and day 2
38.69163476
Stock low price day 1 + low stock of day 1 * difference of tweets day1 and day
2
242.4816348
Low price of Day3 - projected low price day 3
48.07163476
0.198248559
19.82485594
Difference between projected low day 3 and actual day 3 as a variance.
Figure 4.5.1 demonstrates the formula for predicting the day using the Low stock values.
Tweets day1 - day2
-734
Volume day 1 * difference of tweets day1 and day 2
1369177.703
Volume day 1 + Volume day 1 * difference of tweets day1 and day 2
8580677.703
Volume Day3 - projected low price day 3
877677.7031
Difference between projected Volume day 3 and actual day 3 as a

variance.
0.102285359
10.22853594
Figure 4.4.9 demonstrates the formula for predicting the third day using the Volume
values.
When the Low Stock and Volume values were placed into the formula they also
displayed high errors. Low Stock had an error of over 19% and the Volume
values had an error over 10%.
45
Conclusion
This analysis investigated the relation between twitter activity and stock market
share prices of three companies in the NASDAQ over a period of one week. The
use of a Java script and Twitters API collected the tweets that had the keywords
Apple, Microsoft and Tesla mentioned in them. Once the tweets were collected a
python file was used to count the frequency of words in conjunction with
Amazon Web Service. AWS was used because of the size of the Tweets files,
which were in text format of sizes ranging from 60 to 130 megabytes.
Text Wrangler was also used to count the frequency of tweets with the
keywords. Since one of the data sets have missing data over five hours due to a
program failure it was decided to use tweets during the NASDAQ trading hours.
Stock data belonging to the three companies was acquired from the Yahoo
Finance website.
Similarly a count of times the NASDAQ symbols for each company was conducted
and used as an additional analysis. The symbols would give the opportunity to
investigate the occurrence of conversations directed to the actual company stock
on the NASDAQ.
Analysis was performed in R studio using a correlation model first to see the how
strong a relation the tweet data had with the stock data of each company.
A Linear regression algorithm was then used to see the effect that the twitter
data had on the stock data.
Granger Causality was performed to discover if one of the time series affected
the other providing a result in the form of a lag per day. Since the data was so
small a lag of only one-day could be performed providing a significant level of
over 5%, which we could not select, the alternative hypothesis.
During visualization of the data using line graphs it was noted that there seem to
be a relation where the stock data had a similar trend one day after the tweet
data. A manual lag was performed in excel by moving the tweet data time series
forward by one day. This proved that a trend did exist. Subsequently a
correlation model in R studio was created and the results exhibit a strong
correlation of 0.9 and over.
The creation of a formula for commercial use was attempted. The first formula
was used to find the percentage difference between the stock movement and the
tweet movement. On average there was a difference between the movement of
the stocks and the shares.
Another formula was created to predict the close share price. Knowing the
twitter volumes of a company for two consecutive days, the percentage of
movement of tweets between those two days should in turn allow us to predict
the movement in the company share price three days later.
The formula used is a straight line (1:1 ratio)
Whilst predicting the third day for the Apple share prices an error level of just
0.639% was returned.
This meant that the close share price increased at the same rate as the Twitter
feeds for the key word Apple. Within an error lever of 0.639%
Disappointingly the other days predicted for Apple Close stock price were not as
suitable returning error rates of 27.9% and 47.25%. This trend continued
throughout the analysis for the closing price in the Microsoft and Tesla stock.
The formula was slightly altered to accommodate the use of other variables such
as Low Close stock and Volume. Again the errors were high for each one.
46
The main issue here is that the data set is not developed enough to do this form
of analysis. When acquiring the data specific tweets regarding the stock of the
company should have only being collected. A company on Twitter is competing
for public interest while the stock exchange is competing for capital interest. In
that aspect some of the Tweets gathered in this analysis are noisy data.
Further Development
Further develop in the project would include extracting tweets and stock
data over a longer period of time. This would have provided the analysis
with a superior result from the Granger Causality test.
The tweets need to be selected form a niche community, preferably the
investor community who communicate through Twitter in relation to the
stocks of companies. Tweets that have the company symbols and the
word stock mentioned in them should be gathered using those
keywords.
Narrowing down the selection of companies and focusing on one would
support in reducing the amount of discrepancies in the tweet count.
Developing a program script to count the lines that a word appears in
without recounting the word again if it has being mentioned more than
once in a tweet.
The potential use of developing a formula that could take account of other
variables that would cause movement in stock, such as events like the
release of company financial reports, takeover rumours, mergers or bad
publicity.
The process of using a sentiment analysis on the tweets would provide a
more accurate result from the data. Analysing Twitter data activity along
will not provide the analysis with any information about behavioural
attitudes towards the investors.
Sentiment analysis would also provide a better insight into the public
attitude.
47
Bibliography
Aws.amazon.com, (2014). Word Count Example : Articles & Tutorials : Amazon
Web Services. [online] Available at: http://aws.amazon.com/articles/2273
(Accessed 22 May. 2014).
Bollen, J. and Mao, H. (2011) 'Twitter mood as a stock market predictor'
Computer.
Datasift.com, (2014). Power Decisions With Social Data | DataSift. [online]
Available at: http://datasift.com (Accessed 24 May. 2014).
Dev.twitter.com, (2014). Twitter Developers. [online] Available at:
https://dev.twitter.com (Accessed 22 May. 2014).
Finance.yahoo.com, (2014). AAPL Historical Prices | Apple Inc. Stock - Yahoo!
Finance. [online] Available at:
http://finance.yahoo.com/q/hp?s=AAPL&a=03&b=01&c=2014&d=03&e=30&f=
2014&g=d (Accessed 22 May. 2014).
Mac App Store, (2014). TextWrangler. [online] Available at:
https://itunes.apple.com/ie/app/textwrangler/id404010395?mt=12 (Accessed
22 May. 2014).
Mittal, A. and Goel, A. (2012) 'Stock prediction using Twitter sentiment analysis'
Standford University, CS229(2011 http://cs229. stanford.
edu/proj2011/GoelMittal-StockMarketPredictionUsingTwitterSentimentAnalysis.
pdf).
Simsek, M. and Ozdemir, S. (2012) 'Analysis of the relation between Turkish
twitter messages and stock market index'.
Ucd.ie, (2014). CeADAR. [online] Available at: http://www.ucd.ie/ceadar/
Ucd.ie, (2014). Brian Mac Namee | CeADAR. [online] Available at:
http://www.ucd.ie/ceadar/people/principalinvestigators/brianmacnamee/
Appendix
Project Materials:
https://drive.google.com/folderview?id=0B4pkBIaL1W7CQzVVakgwQ3psNFk&
usp=sharingReferences
48
Project Proposal
Introduction
The purpose of this project is to study and analyse the activities and trends
associated to the Mobile World Congress 2014, which is being held from the 24th
to the 27th of February 2014.
The Mobile World Congress is the worlds largest exhibition of the mobile
industry. Mobile operators, device manufacturers and technology providers are
all represented at the exhibition.
With a large amount of manufacturers attending and product launches the
subject can be quite broad.
The objective of this project is to analyse Twitter feeds for activitys and trends
associated with the top mobile manufacturers before, during and after the event
and to see how their stock market shares are connected and affected by the
Twitter feeds.
Background
As Twitter matures, top brands have realized just how relevant Twitter can be as
a marketing and engagement platform.
According to Useful Social Media 98% of the top brands are on Twitter and 92%
of top brands tweet daily. There are 230 million active users on Twitter; this
provides brands with a global presence. (USM) 92% of top brands Tweet at
least once daily as audiences grow. Study shows Twitters maturity as a
marketing and engagement platform. 98% of all top brands are active on Twitter.
The social network has matured into a valuable and necessary channel for
marketing organizations. (Usefulsocialmedia.com, 2014)i
Releases such as the Samsung Galaxy s5 will hopefully see a surge of Twitter
activity in relation to Samsung during the event. According to Trusted Reviews
the release of the Samsung Galaxy s5 will take place during the event. (Trusted
Reviews) The Samsung Galaxy S5 release date looks set to be held in a matter of
days as the Korean manufacturer issues invites to a February 24 launch event,
kicking Samsung Galaxy S5 rumours into overdrive.(Trusted Reviews, 2014)ii
Using the data from the Twitter feeds I can then analyse them against the stock
market shares.
According to Mac Rumours, Samsung has the biggest phone market share with
Apple in second place. (Mac Rumours) Apple Continues to Lose Smartphone
Share, Gain Mobile Phone Share in 4Q 2013 (Mac Rumours, 2014)iii
49
Similar research has being done in relation to Twitter feeds influencing market
shares but this project will be focusing mainly on the Mobile World Congress in
relation to the markets shares of the top five mobile device manufacturers.
Technical Approach
This objective will be achieved by:
Creating the necessary python coding to use with the Twitter API for
retrieving the data.
Gathering all data created on Twitter related to the mobile device brands
before, during and after the event.
Gather stock market share prices before, during and after the event of the
mobile device brands.
Clean all data gathered for analysis
Analysis of the data gathered of Twitter activity against the stock market
share prices.
Return the results of the analysis.
Special Resources Required

Books to be used:
Python for data analysis Mckinney, W. (2013)
Twitter API: Up and Running: Learn How to Build Applications with the
Twitter API Paperback by Kevin Makice. (2009)
Writing Your Dissertation by Swetnam, D. & Swetnam, R. (2000).
Software to be used:
Python
R studio
MYSQL
Microsoft Excel
Microsoft Project
Twitter API
System storage to be used:

Twitter API
At this stage of the project I am unaware of the amount of data that I will
accumulate from Twitter.
50
Project Plan
Technical Details
The coding I will use to retrieve the data will be python.
R coding and Microsoft Excel will then be used to do the analysis of the data.
Systems/Datasets
The datasets used will be all collected by myself using the online Twitter API
with the python coding to collect specific words, hash tags from the tweets over
the duration of the events operating time per day.
Evaluation/Test and Analysis

I am unable to state how I will test the data due to the fact that we have only had
one class of Data and web mining but I can list the types of analysis that we will
be learning.
Classification
Regression (value estimation)
Similarity matching
Clustering
51
Co-occurrence grouping (frequent itemset mining)

Profiling (behaviour description)
Link Prediction
Data reduction
Causal modelling
Consultation with Specialization Persons

John OConnor CEO of Wellclever.
Wellclever is a startup company that provides the media groups and content
producers with keyword contextual online advertising solutions.
Consulted with John for project ideas. John has over 20 years of experience in the
advertising industry.
(Wellclever, 2014)iv
Oisin Creaner coordinator of the project for NCI
Spoke to Oisin about project ideas through the use of Twitter APIs.
52
Requirments Specification
Document Control
Revision History
Date
20/02/2014
23/02/2014
24/02/2014
Version
1
2
3
Scope of Activity
Create
Update
Update
Prepared
RC
RC
RC
Reviewed
X
X
X
Approved
X
X
X
Distribution List
Name
Title
Oisin Creaner
Samsung
Robert Coyle
Robert Coyle
Robert Coyle
Robert Coyle
Robert Coyle
Lecturer
Customer
BA
System Developer
Statistician
Tester
Advertising and Marketing Devision
Version
Related Documents
Title
Proposal Document
Comments
53
1 Introduction
1.1 Purpose
The purpose of this project is to study and analyze the activities and trends
associated to a brands advertising campaign. The objective of this project is to
analyze Twitter feeds for activities and trends associated with the brand before,
during and after their advertising campaign and to see how their stock market
shares are connected and affected by the Twitter feeds.
The intended customers are the actual brands, their marketing and PR team.
As Twitter matures, top brands have realized just how relevant Twitter can be as
a marketing and engagement platform.
According to Useful Social Media 98% of the top brands are on Twitter and 92%
of top brands tweet daily. There are 230 million active users on Twitter; this
provides brands with a global presence. (USM) 92% of top brands Tweet at
least once daily as audiences grow. Study shows Twitters maturity as a
marketing and engagement platform. 98% of all top brands are active on Twitter.
The social network has matured into a valuable and necessary channel for
marketing organizations. (Usefulsocialmedia.com, 2014)v
1.2 Project Scope

This analysis will compare different advertising campaigns done by a brand on
the release of a new or updated product and how they differ from one another. It
will also look at how a brands advertising campaign affects their stock market
share prices.
I will be using the historic Twitter feeds and historic stock market shares.
The project will look at an individual brand such as Samsung, acquire the
necessary twitter feeds associated with Samsung. Using the correct programs
and scripts the program should gather any mentions of Samsung in the tweets
including hash tags.
The data will include the time series of the tweets and then we can match this
data to the time series of the stock market data.
With a budget of zero acclimating the historic Twitter feeds could be a difficult
task since my researching has show that Twitter has giving/sold their data to
separate/outside companies who now sell the data for use.
1.2.1 In Scope
1. The analysis of a advertising campaign with the data gathered from
twitter and stock market share prices.
2. The development of python programs for cleaning data.
3. The development of an R program and the use of Microsoft Excel for
the analysis of the data.
54
1.2.2 Out of Scope

1.
The project will not provide Samsung with outside analysis of other
brands data.
1.3 Document Scope

The goal of this document is to describe the functional and non-functional
requirements of the Samsung advertising campaign analysis. The stakeholder
analysis was carried out prior to requirement elicitation process.
1.4 Definitions, Acronyms, and Abbreviations

Term
Advertising
campaign
BA
Definition
A series of messages to promote a product.
Backed-up
The process of storing information (hardware or software based)
Cloud
Internet based service where storage, applications and servers are

accused through the internet for an organization.
Information
Data
Excel
GUI
Moscow
Business Analyst
Microsoft Excel is a spreadsheet application used here for analyzing

data.
Graphical user interface
Pyton
Is a technique used in functional requirements .Must, Could, Should,

Want. See Functional requirements
Type of programming language
Programming Langauge
2 User Requirements Definition

2.1 User Characteristics
As part of Samsungs $14 billion advertising and marketing campaign last year
(2013) the company requires an analysis on the effectiveness of the advertising
campaign and how the twitter activity and their stock market prices were
affected. According to ibtimes.co.uk Samsung were expected to spend $14 billion
on there marketing campaign (ibtimes.co.uk) The South Korean company is
expected to spend around $14 billion (8.5bn, 10.3bn) on marketing and
promotion of its products in 2013, which is the biggest (as a percentage of its
total revenue) advertising budget of any company ever(ibtimes 2013)vi,
Samsung have not yet released there analog report for 2014.
The analysis will provide Samsung with a better insight of the effectiveness of
their advertising campaign strategy form data acquired by the Twitter feeds and
stock market. This information will assist Samsung in managing their advertising
55
campaign more effectively and efficiently by directing the style and approach of
the campaign towards their specific products.
3 Requirements Specification
3.1 Functional Requirements
FR#
Category
Description
Mo
sco
w
FR1
Aquire Data 1
FR2
Aquire Data 2
FR3
Clean Data 2
FR4
Clean Data 2
FR5
FR6
Analyse 1
Analyse 2
M
M
H
H
FR7
Publish Data
The project will gather and store all nessary data from
historical Twitter feeds.
The project will gather and store all nessary historical stock
mrket data regarding the brand corrosponding to the dates
in relation to the Twitter data that was aquired.
retrive histoical Twitter data regarding to key words and
hash tags of the brand on certain dates.
retrive data historcal stock market share prices regarding
the brand on the same time and dates as the histoical Twitter
feeds data.
The cleaned Twitter data is then analysed and compared.
The cleaned stock market data is then analysed and
compared.
The analyse will then be publised and avslible to the
coustomer.
S
t
a
t
u
s
H
56
3.1.1 Use Case Diagram Overall Functional Requirements
3.1.2 Requirement 1: Acquire Data 1 and 2

3.1.2.1 Description & Priority
The scope of this use case is to gather all the data necessary to carrier out the
analysis and continue onto the next stage of the project. This requirement has a
very high status and is essential in progressing on the next stage of the analysis.
57
3.1.2.2 Use Case

Scope
The system shall source the historic twitter and stock market data from online
data resources. Define all access points. Accuses the Data, notify its availability
and then download the data.
Description
This use case describes the process to which the data for analysis is acquired.
Use Case Diagram
Flow Description
Precondition
The Data must be online. The data system must be operational at all times.
58
Activation
Use case is activated when the programmer connects to the system online.
Main Flow
1. Step: 1A. Programmer and System Developer source data.
2. Step: 2A. Programmer and Business Analyst validate data with the
Customer.
3. Step: 3A. Programmer accesses the data.
4. Step: 4A. Programmer notifies data availability to the System
Developer.
5. Step: 5A. Programmer downloads data for cleaning.
Alternate Flow
Customer.
3. Step: 2A. Customer does not validate data. Step 1A is set to
recommence.
Customer.
6. Step: 3A. Programmer accesses the data.
7. Step: 4A. Programmer notifies data availability to the System
Developer.
8. Step: 5A. Programmer downloads data for cleaning.
Exceptional Flow
Customer.
3. Step: 2A. Customer does not validate data. Data is unavailable.
4. Use case ends
Termination
The system has gathered all necessary data. The data is then exported on the
cloud storage system. This process has now being terminated.
Post Condition
All Data gathered, move onto the next step.
59
3.1.3 Requirement 2: Clean Data 1 and 2

The scope of this use case is to clean all the data gathered from the pervious
requirement. A programmer and tester investigate the data for any errors such
as missing data and fix the errors. This requirement has a very high status and is
essential in progressing on the next stage of the analysis.
3.1.3.2 Use Case
Scope
The system shall clean all data sets gathered from the pervious requirement.
Define all error points. Get recommendations for fixing the errors. Fixes the
errors and then exports the data for analysis.
Description
This use case describes the process to which the data is cleaned for analysis.
60
Use Case Diagram
Flow Description
Precondition
The Data must be stored and available for cleaning at all times.
Activation
Use case is activated when the programmer connects to the cloud storage system
and retrieves the data.
Main Flow
1. Step: 1B. Programmer and System Developer retrieve data from the
cloud storage system.
2. Step: 2B. Programmer and Tester identify errors in the data set.
3. Step: 3B. Programmer receives recommendations from System
Developer.
61
4. Step: 4B. Programmer with the help of the Tester fixes errors and
notifies the System Developer.
5. Step: 5B. Programmer exports the data for analysis.
Alternate Flow
Developer.
5. Step: 2B. Programmer and Tester test system again and identify more
errors in the data set.
Developer.
8. Step: 5B. Programmer exports the data for analysis.
Exceptional Flow
Developer.
4. Step: 4B. Programmer with the help of the Tester fixes cannot fix
errors. Data is corrupt.
5. Use case ends.
Termination
The system cleaned all acquired data. The data is then saved onto the cloud
storage system and exported for analysis. This process has now being
terminated.
Post Condition
All data cleaned, move onto the next step.
62
3.1.4 Requirement 2: Analyze Data

The scope of this use case is to analyze all the data gathered and cleaned from
the pervious requirements. A Business Analyst and Statistician examine and
study the data for Analysis. This requirement has a very high status and is
essential in progressing on the next stage of the analysis.
3.1.4.2 Use Case
Scope
This process involves the skills and management of the Statistician and Business
Analyst to compare and analyze all data.
The process shall calculate and prove/predict outcomes form the data with the
help of graphs for visualizing. Then all proven data is backed-up and stored.
Description
This use case describes the process to which the data analyzed.
63
Use Case Diagram
Flow Description
Precondition
The Data must be available for analysis at all times.
Activation
Use case is activated when the BA and the Statistician connects to the cloud
storage system and retrieves the data.
Main Flow
1. Step: 1C. BA and Statistician retrieve data from the cloud storage
system.
64
2.
3.
4.
5.
Step: 2C. The Statistician and BA explore and understand the data set.
Step: 3C. Statistician begins the calculations.
Step: 4C. Statistician and BA began to visualize the data.
Step: 5C. Programmer backs up and stores findings with the approval
of the BA.
Alternate Flow
system.
2. Step: 2C. The Statistician and BA explore and understand the data set.
3. Step: 3C. Statistician begins the calculations.
4. Step: 4C. Statistician and BA began to visualize the data. Ba requests
the data to be recalculated with a different approach.
5. Step: 3C. Statistician begins the new calculations.
6. Step: 4C. Statistician and BA began to visualize the data.
7. Step: 5C. Programmer backs up and stores findings with the approval
of the BA.
Exceptional Flow
system.
2. Step: 2C. The Statistician and BA explore and understand the data set.
Statistician and BA are unable to understand the data set. Ba requests
new data set.
3. Use case ends
Termination
The analysis is completed. The data is then saved onto the cloud storage system
and exported for Publishing. This process has now being terminated.
Post Condition
All data analyzed, move onto the next step.
3.1.5 Requirement 2: Publish Data

The scope of this use case is to publish the findings from the analysis approved
by the pervious requirements. A Business Analyst consults the Customer on
topics such as the proprietor of the data, the goal from the publication, the target
audience/data consumer (is the data confidential and for internal use only),
media to which it is published and the release date.
This requirement has a very high status.
65
3.1.5.2 Use Case

Scope
This process involves the communication and business skills of the BA and how
to handle the customers requirements and outcomes.
The process involves the Customer, BA and the Advertising/Publications
division.
The process shall publicize the findings to the desired audience with the
approval of the customer and recommendations of the BA.
Description
This use case describes the process to which the data is publicized.
Use Case Diagram
66
Flow Description
Precondition
The Data must be available for analysis at all times.
Customer/Client must be available for analysis at all times.
Activation
Use case is activated when the findings are present to BA, Customer and
Advertising/Publication Division and all three are engaged in communication.
Main Flow
1. Step: 1D. BA, Customer and Advertising/Publication Division retrieve
analysis findings. Findings have acquired owners approval.
2. Step: 2D. BA and Customer discuss the objective of the findings
release.
3. Step: 3D. BA and Customer began to agree on the target audience/data
consumer.
4. Step: 4D. Customer decides the medium type/the style and method of
publicizing the data e.g. websites, newspaper, with the BAs approval
and the assistance of the Advertising/Publication Division.
5. Step: 5D. BA notifies Advertising/Publication Division to publish the
data.
Alternate Flow
analysis findings. Findings have acquired owners approval.
2. Step: 2D. BA and Customer discuss the objective of the findings
release.
3. Step: 3D. BA and Customer began to agree on the target audience/data
consumer.
and the assistance of the Advertising/Publication Division. Customer
decides to recommence Step: 3D. Again to change the publication
approach.
5. Step: 3D. BA and Customer began to agree on a new target
audience/data consumer
and the assistance of the Advertising/Publication Division.
7. Step: 5D. BA notifies Advertising/Publication Division to publish the
data.
67
Exceptional Flow
analysis findings. Findings have not acquired owners approval.
Customer decides not to publicize the data findings due to the high
importance and confidentiality of the findings.
2. Use case ends
Termination
The publication of the data is completed. This process has now being terminated.
Post Condition
All data publicize, all steps completed.
3.2 Non-Functional Requirements

3.2.1 Availability: Must Have
The information must be available at all times for analysis.
3.2.2 Storage Requirements: Must Have
The data kept during and after the analysis should be stored in a secure facility.
Cloud storage security protocols must be assessed. The must be enough capacity
in the cloud to hold the large amount of data.
3.2.3 Connection Reliability: Must Have
It must have a reliable connection at all times when retrieving, uploading and
updating the data. Connection lost could transpire into losing data.
3.2.4 Connection Speed: Must Have
It must have fast online connection. This is needed when retrieving, uploading
and updating the data. A large data set could take some time to upload.
3.2.5 Backup and Recovery: Must Have
The data must be easily accessed, backed up and updated. It must have a system
recovery in the case of a system failure.
3.2.6 Program to clean data: Must Have
The analysis must have the correct programs to clean and fix any errors in the
data.
3.2.7 Software Analysis tools: Must Have
The analysis must have the correct software analysis tools that all divisions of
the analysis can exercise.
68
3.2.8 Communication Requirements: Must Have

The analysis must have constant communication between all divisions/ parties
in the decision making process.
3.2.9 Security: Must Have
The analysis must have high security measures. The analysis is operating with
highly confidential data. Only key divisions from the analysis must have accuses
to the data.
3.2.9 Data Validation: Must Have
This process requires the use of external services in order to download the data.
Once the data is gathered from the services (Twitter, Nasdaq) it should be
validated.
5 Interface Requirements
5.1 GUI
An example of a analysis of tweets.
vii
comprendia. 2014
Examples of tweets analyzed on Microsoft Excel and Geo Flow
69
viii
powerpivotblog. 2013
70
Analysis of tweets using R language
ix
evolutionanalytics. 2013
Example of Excel Data for intro to Regression.

This is using stock market data.
skilledup. 2013
71
Example of analysis completed on R Studio.
xi
datamachines. 2012
6 Analysis Evolution
The analysis will evolve over time to produce a much more focused outcome,
differencing itself by the analysis of a specific product in the Samsung product
range. This can occur by changing the mining of keys words in the twitter data,
focusing on a product such as the Galaxy products in the Samsung range. These
include the smartphone, Tablet and Watch.
If the customer Samsung required an analysis to focus on the release of a
specific product such as the Galaxy S4 which was released April 2013 this can be
done by narrowing down the search key word, using hash tags and words such
as (#samsungS4, #SamsungGalaxyS4, #GalaxyS4 #S4) and narrowing down the
time lines to the release date of the phone.
72
Progress Management Report 1

Document Location
This document will be uploaded through Turnitin.
Revision History
Date of this revision: 9/03/14
Revision
date
Prevision
revision
date
9/03/14
Summary of changes
Changes
marked
First Issue
Approvals
This project requires the following approvals.
Name
Robert Coyle
Signature
Title
Project
Manager
Date of issue
10/03/14
Version
1
Distribution
Name
Oisin Creaner
Title
Project Lecturer
Date of issue
10/03/14
Version
1
73
Purpose of Document
Is to provide Oisin Creaner the project lecturer with a summary of the status of
the project.
Date of report
09/03/14
Period covered
10/02/14 9/03/14
Schedule Status
This project is still on schedule at this interval.
Updated Gantt chart
03-Feb
Project Proposal
23-Feb
15-Mar
Create Python codes

Data retrival from Twitter API and
Management Progress Report 1
1
5
24-Apr
1
3
04-Apr
25
8
20

Term
API
JSON
NASDAQ
RSS
Definition
JavaScript Object Notation
American Stock Exchange
Rich Site Summary
74
Products completed during this period

Project proposal
Requirements
specification
The project proposal was completed on time. See

(Coyle, 2014)
Requirements specification was completed on
time with changes t project scope. See (Coyle,
2014)
Problems
Actual
Accessing Twitter API
Acquiring free historical

data.
Twitter API has being more difficult to access

than first anticipated due to change of
regulations and updated version of twitter. The
API only supports JSON.
Historical feeds are proving to be difficult, as
twitter has sold their data to approved sites for
resale. As this project has no budget this has
being a high impact on the plan. Twitter has
released a grant application form online for
accessing their historical data.
Potential
The quality and quantity of
the twitter data.
Gathering the data in the

required time.
Not having the JSON code yet I am not sure what

my expected returned of data will be. Using a site
called Twillert, I acquired some data but the site
wont gather more that the first 100 RSS feeds,
this rendering the service useless.
Once I have a response from the Twitter
developers grant I can determine whether the
historical data is possible to acquire and progress
to the next stage of the project.
75
Raid Log:
Risks
76
Assumptions
Issues
Dependency
Products due for completion

By the next period the following should be accomplished.
Gathering of Twitter feeds.
Gathering of stock market
data.
Analysis of data.
Preliminary presentation.
Should have gathered all twitter data either

historical or real time in relation to Samsung.
Should have gathered all Nasdaq data in relation
to Samsung in the same time series as the twitter
data.
Once all data has being gathered analysis can
take place.
Should have Preliminary presentation completed.
77
Projects write up.

Management Progress
Report 2.
Commenced first draft.

This repot will be the end of this period.
Project Issues Statues

We currently have 2 issues on the project issue log, these havent being resolved
and are currant outstanding. Both are waiting upon external client response.
Conclusion
This project, even with the set backs is still capable of finishing within the
original set target dates. Gathering all the data in the next week is paramount for
the success of the project. Any more delays will compromise the quality of the
project.
Currently I am waiting on a response from Twitter in relation with their
Developers grant scheme. If this is approved all the historic data from January
2013 to March 2014 will be available and can be gathered using JSON coding
language, See Dependences Ref: D02.
All necessary information has being submitted to the Twitter Developer Grant
scheme such as dates, key words and hash tags.
Alternatives:
If this grant is not approved the project can revert back to streaming the
data live form Twitter using JSON language.
If the grant approval takes to long the project can revert back to
streaming the data live form Twitter using JSON language.
78

Document Location
Revision History
Revision
date
Prevision
revision
date
30/03/14
Summary of changes
Changes
marked
First Issue
Approvals
Name
Robert Coyle
Signature
Title
Project
Manager
Date of issue
30/03/14
Version
1
Distribution
Name
Oisin Creaner
Title
Project Lecturer
Date of issue
30/03/14
Version
1
79
Purpose of Document
Is to provide Oisin Creaner the project lecturer with a summary of the status of
the project.
Date of report
30/03/14
Period covered
10/03/14 30/03/14
Schedule Status
Updated Gantt chart
03-Feb
Project Proposal
Create Python codes
23-Feb
15-Mar
04-Apr
4
5
1
5
24-Apr
1
3
14-May
14
7

Term
API
JSON
NASDAQ
RSS
Definition
Rich Site Summary
80

Progress Management
report 1
The Project management report 1 was completed

on time. See (Coyle, 2014)
Problems
Actual
Accessing Twitter API
The decision has being made under advisement

from project lecturers to duplicate the twitter
feeds using the Twilert application.
Twilert provides a free service for accessing live
twitter feeds however it only delivers 100 RSS
feeds per day.
The trial run lasts for 15 days so it will provide
the project over 1500 tweets. These tweets will
then be duplicated to match the historic stock
market prices.
The stock market data provide daily end of day
prices.
Potential
The quality and quantity of
the Twitter data provide
by Twilert.
The Twitter data provided by Twilert must be of

good quality and having enough data is essential.
Data will be duplicated otherwise.
81
Raid Log:
Risks
Open
Risks
Ris
k
Ref
R01
Risk
Categ
ory
technol
ogy
Date last
reviewed
30/03
/2014
Risk
Description
Raised
by
No data
backup
available
R.Coyle
10Feb14
R.Coyle
10Feb14
10Feb14
cost
Acquiring data
for free.
R03
time
Acquiring data
on time.
R.Coyle
Ris
k
Ref
Risk
Categ
ory
Risk
Description
Raised
by
R02
Closed
Risks
R01
R02
R03
technol
ogy
No data
backup
available
cost
No costs
needed for
use of data
time
Data will be
aquired on
time.
Dat
e
Iden
tifie
d
Dat
e
Iden
tifie
d
R.Coyle
17Feb14
R.Coyle
24Mar14
R.Coyle
24Mar14
Pri
orit
y
Im
pac
t
Pr
o
b
preve
ntion
accep
tance
preve
ntion
Pri
orit
y
Im
pac
t
Pr
o
b
Mitig
ation
Cate
gory
Mitig
ation
Cate
gory
preve
ntion
accep
tance
conti
ngenc
y
Mitig
ation
Sourc
e
onlin
e
stora
ge for
data.
Sourc
e free
histor
ic
twitte
r
feeds.
Sours
e the
data
on
time.
Mitig
ation
Sourc
e
hard
drive
for
stora
ge
Using
differ
ent
data.
Sours
e the
data
on
time.
O
wn
er
Up
dat
e
Dat
e
upd
ated
RC
10Feb14
RC
10Feb14
RC
10Feb14
O
wn
er
Up
dat
e
Dat
e
upd
ated
RC
10Jun14
RC
24Mar14
RC
24Mar14
E
nd
D
at
e
E
nd
D
at
e
82
Assumptions
Assumptions The purpose of this document is to surface, document, analyse and monitor the key assumptions
upon which the plan is based. Planning parameters, design parameters, issues and risks will be generated from these assumptions
Ref #
Assumption
Lecturers will provide
prompt feedback and
guidance
Twitter will repley to my
grant request for the use
of their historic data.
RSS feeds gathered from
twitter not missing data.
Skills developed for
analysis of data.
A01
A02
A03
A04
Test
Date
Importance
Certainty
Influence
Test
4 - critical
3 - Probable
Send request to test

level of response
2 - somewhat
1 - unknown
Wait for replay.
3 - important
4 - Fact
4 - critical
4 - Fact
Unknow as of yet.
Continue arriving to
lectures.
10-Feb14
03-Mar14
30-Mar14
03-Mar14
Issues
Issues are unexpected incidents or events

Issue
Ref
Issue
Description
Raised
by
Date
Raised
Impact
Priority
I01
Unexpected
issue in
accessing
twitter feeds.
RC
17-Feb14
I02
Twitter API
access more
complex than
anticipated.
RC
03Mar-14
I03
No response
from Twitter
developer
data grant
scheme.
RC
24Mar-14
Action
Plan
Identify
different
means of
accessing
the twitter
feeds.
This issue
has being
brought up
to Project
Leturers.
Awaiting
response.
This issue
has being
brought up
to Project
Leturers.
Alternative
solution
has being
provided.
Target
Resolution
Date
Actual
Resolution
Date
Status
Owner
open
RC
10-Feb-14
closed
RC
03-Mar-14
24-Mar-14
closed
RC
24-Mar-14
30-Mar-14
83
Dependency
Depen
dency
Dependency
Ref
Projec
t
Rai
sed
by
Dependency
Description
D01
NCI
Facilities
IT facilities available
for running twitter
API
D02
External
Expert
Twitter historical
data grant approval.
D03
External
Expert
Aquire Twitter data

from Twilert.
Date
Rais
ed
Im
pac
t
Pri
orit
y
Peri
od
Affe
cted
RC
10Feb14
Feb Mar
RC
03Mar14
MarApr
RC
30Mar14
MarApr
Acti
on
Plan
Conf
irm
availa
bility
with
IT
Awai
ting
resp
onse
from
twitt
er
for
histo
rical
data
grant
appr
oval.
Awai
ting
resp
onse
from
exter
nal
client
.
Targ
et
Resol
ution
Date
Actu
al
Resol
ution
Date
RC
Mar14
Mar14
RC
Mar14
Mar14
RC
Apr14
Ow
ner

Gathering of Twitter feeds.
Gathering of stock market
data.
Analysis of data.
Projects write up.
Management Progress
Report 3.
Should have gathered all twitter data in relation

to Samsung.
Should have gathered all Nasdaq data in relation
to Samsung.
Once all data has being gathered analysis can
take place.
This report will be the end of this period.
84
Conclusion
This project is still on course for completion within the requested timeline.
The project data source has changed since there has being no replay from the
Twitter research data grant scheme to access their historical data.
Twilert will now provide the data for the project.
It has proven to be a reliable source but can only provide access to 100 RSS feeds
per day, this data however will be duplicated providing enough data to complete
the project.
Yahoo finance will provide the historical stock market prices.
Alternatives:
If the Twitter developer grant is approved within the next 2 weeks the
project can revert back to using the correct historical data.

Document Location
Revision History
Revision
date
Prevision
revision
date
20/04/14
Summary of changes
Changes
marked
First Issue
Approvals
Name
Robert Coyle
Signature
Title
Project Manager
Date of issue
20/04/14
Version
1
Distribution
Name
Oisin Creaner
Title
Project Lecturer
Date of issue
20/04/14
Version
1
85
Purpose of Document
The purpose of this document is to provide the project lecturer, Oisin Creaner,
with a summary of the status of the project.
Date of report
20/04/14
Period covered
1/04/14 20/04/14
Schedule Status
Updated Gantt chart
03-Feb
Project Proposal
Create Python codes

23-Feb
15-Mar
04-Apr
4
5
1
5
24-Apr
7
7
14-May
25
7

Term
API
JSON
NASDAQ
RSS
Definition
Rich Site Summary

Acquired Stock Data
This was completed on the 20-04-14.
86
03-Jun
Acquired Twitter Data
This was completed on the 20-04-14.
Problems
Actual
Analysis of Data
The decision has being made to use companies

in the same stock market.
The three brands I have chosen are on the
NASDAQ stock exchange. This has mitigated the
problems that would have being encountered
with different currency and time frames that are
associated with foreign stock exchanges.
Potential
Cleaning Twitter Data
Cleaning of Twitter data acquired from Java

script can be completed in the short time frame
that is left.
Raid Log:
Risks
Open Risks
Date last reviewed
20/04/2014
Risk Ref
Risk Category
Risk Description
Raised by
R01
technology
No data backup available
R.Coyle
10-Feb-14
R02
cost
Acquiring data for free.
R.Coyle
10-Feb-14
R03
time
Acquiring data on time.
R.Coyle
R04
time
Data analysis.
R.Coyle
Mitigation
Owner
Date Identified Priority
Update
Mitigation
Category
Impact
Prob
prevention
acceptance
10-Feb-14
prevention
20-Apr-14
prevention
Date updated
End Date
Source online storage for data.
RC
10-Feb-14
Source free historic twitter feeds.
RC
10-Feb-14
Sourse the data on time.
RC
10-Feb-14
Perpare and analyze data.
RC
21-Apr-14
87
Closed Risks
Risk Ref
Risk Category
Risk Description
Raised by
R01
technology
No data backup available
R.Coyle
17-Feb-14
R02
cost
No costs needed for use of data
R.Coyle
24-Mar-14
R03
time
Data is acquired.
R.Coyle
24-Mar-14
Mitigation
Category
Mitigation
Owner
Date Identified Priority
Update
Impact
Prob
Date updated
prevention
Source hard drive for storage
RC
10-Jun-14
acceptance
Using different data.
RC
24-Mar-14
contingency
Sourse the data on time.
RC
20-Apr-14
End Date
20-Apr-14
Assumptions
Assumptions The purpose of this document is to surface, document, analyze and monitor the key
assumptions upon which the plan is based. Planning parameters, design parameters, issues and risks will be generated from
these assumptions
Ref #
A01
A04
A05
A05
Assumption
Importance
Lecturers will
provide prompt
feedback and
guidance
Skills developed
for analysis of
data.
Data can be
cleaned and
prepared for
analysis.
Cleaned data is
adequate and can
be analyzed
Certainty
Influence
Test
Test Date
3 - important
3 - Probable
Send request to test

level of response
4 - critical
4 - Fact
Continue arriving to
lectures.
4 - critical
4 - Fact
4 - critical
4 - Fact
10-Feb-14
03-Mar-14
Project lectures can

assist during lecture
hours.
Project lectures can
assist during lecture
hours.
20-Apr-14
20-Apr-14
Issues
Issue Ref
Issue Description
Raised by
Date Raised
Impact
Priority
I01
Unexpected issue in accessing twitter feeds.

Twitter API access more complex than
anticipated.
The Response from the Twitter developer
data grant scheme came back rejected.
RC
17-Feb-14
RC
03-Mar-14
RC
24-Mar-14
I02
I03
Target
Resolution
Date
Actual
Resolution
Date
Action Plan
Status
Owner
Data was acquired.

This issue has being brought up to Project Lecturers.
Awaiting response.
This issue has being brought up to Project Lecturers.
Alternative solution has being provided.
closed
RC
10-Feb-14
20-Apr-14
closed
RC
03-Mar-14
24-Mar-14
closed
RC
24-Mar-14
20-Apr-14
88
Dependency
Depend
ency Ref
D01
Project
NCI
Facilities
Depend
ency
Descript
ion
IT
facilities
available
for
running
twitter
API
Rais
ed
by
RC
Date
Raise
d
10Feb-14
Imp
act
Prior
ity
Perio
d
Affec
ted
Feb Mar
Actio
n
Plan
Confir
m
availabi
lity
with
IT
Own
er
Target
Resolut
ion
Date
Actual
Resolut
ion
Date
RC
Mar-14
Mar-14

Cleaning of Twitter data.
Cleaning of stock market
data.
Analysis of data.
Projects write up.
Twitter data will be cleaned and time series

prepared for analysis.
Stock data will be cleaned and time series
prepared for analysis, Stock market data time
series is per day.
Once all data has being and cleaned analysis will
begin.
Conclusion
This project is still on course for completion within the requested timeline.
The project data source has changed since the Twitter Historical Data grant was
denied. I now have gathered a weeks worth of Twitter data associated to three
companies that are on the same stock exchange.
I will now focus on Apple Inc., Tesla Motors, Inc. and Microsoft Corporation.
These tech companies being on the same stock exchange (NASDAQ) will create a
more straightforward approach to the analysis. Samsung Electronics, which was
my original company I had selected to base the analysis upon, is on the Korean
stock market. Not only would I have different time series but I would also have to
modify the currency difference.
Yahoo finance will provide the historical stock market prices.
I am hoping to find a correlation between the twitter activity and the stock
market prices of the three brands with a lag of around three to four days.
Alternatives:
If I can gather the stock market prices in hourly format the analysis would
be more detailed.
89
References
Usefulsocialmedia.com. 2014. Twitter Evolves Becoming more brand friendly |
Useful
Social
Media.
[online]
Available
at:
http://www.usefulsocialmedia.com/measurement/Twitter-evolves--becomingmore-brand-friendly [Accessed: 9 Feb 2014].
Johnson, L. 2014. Samsung Galaxy S5 release date, news, rumours, specs and price News
Trusted
Reviews.
[online]
Available
at:
http://www.trustedreviews.com/news/Samsung-galaxy-s5-release-date-newsrumours-specs-and-price [Accessed: 9 Feb 2014].
Macrumors.com. 2014. Apple Continues to Lose Smartphone Share, Gain Mobile
Phone
Share
in
4Q
2013.
[online]
Available
at:
http://www.macrumors.com/2014/01/28/apple-phone-share-4q-2013/
[Accessed: 9 Feb 2014].
Wellclever.com. 2014. Well Clever - Publisher Centric Platforms. [online] Available
at: http://wellclever.com [Accessed: 9 Feb 2014].
usefulsocialmedia. 2014. Twitter Evolves -Becoming more brand friendly.
[ONLINE]
Available
at:
http://www.usefulsocialmedia.com/measurement/Twitter-evolves--becomingmore-brand-friendly. [Accessed 23 February 14].
btimes.co.uk. 2013. Samsung's $14bn is 'Biggest Marketing Budget in History.
[ONLINE] Available at: http://www.ibtimes.co.uk/samsung-14bn-marketingbudget-biggest-history-525979. [Accessed 28 February 14].
comprendia. 2014. If A Tweet Falls In The Forest? Maximizing Twitter
Engagement Through Time Of Day Analysis. [ONLINE] Available at:
http://comprendia.com/2012/07/17/if-a-tweet-falls-in-the-forest-maximizingtwitter-engagement-and-exposure-through-time-of-day-analysis/. [Accessed 24
February 14].
powerpivotblog. 2013. Analyze a Twitter feed with Excel 2013, DataExplorer and
GeoFlow. [ONLINE] Available at: http://www.powerpivotblog.nl/analyze-atwitter-feed-with-excel-2013-dataexplorer-and-geoflow/. [Accessed 24
February 14].
evolutionanalytics. 2013. What does Barack Obama tweet about most?. [ONLINE]
Available at: http://blog.revolutionanalytics.com/2013/11/what-does-barackobama-tweet-about-most.html. [Accessed 24 February 14].
skilledup. 2013. 50+ (Mostly) Free Excel Add-Ins For Any Task. [ONLINE]
Available at: http://www.skilledup.com/learn/businessentrepreneurship/mostly-free-excel-add-ins/. [Accessed 24 February 14].
90
datamachines. 2012. Decomposing North Carolina Amendment 1 with R and

Tableau (part 1). [ONLINE] Available at:
http://datamachines.blogspot.ie/2012/05/decomposing-north-carolinaamendment.html. [Accessed 24 February 14].
Twilert. 2014. Twitter search alerts. [ONLINE] Available at:
http://www.twilert.com. [Accessed 10 March 14].
Twitter. 2014. Overview: Version 1.1 of the Twitter API. [ONLINE] Available at:
https://dev.twitter.com/docs/api/1.1/overview. [Accessed 10 March 14].
Twitter. 2014. Data Grants. [ONLINE] Available at:
https://engineering.twitter.com/research/data-grants. [Accessed 10 March 14].
Yahoo Finance, 2014. Samsung Electronics Co. Ltd. [ONLINE] Available at:
http://finance.yahoo.com/q/hp?s=005930.KS+Historical+Prices. [Accessed 30
March 14].
Twilert, 2014. Twitter search alerts. [ONLINE] Available at:
http://www.twilert.com. [Accessed 10 March 14].
Yahoo Finance - Business Finance, Stock Market, Quotes, News (2014) Yahoo
Finance. Available at: http://finance.yahoo.com (Accessed: 20 April 2014).
91

Robert Coyle

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Robert Coyle

Uploaded by

Copyright:

Available Formats

National College of Ireland

Higher Diploma in Science in Data Analytics

The Use of Twitter Activity as a Stock Market

The Use of Twitter Activity as a Stock Market Predictor

DOCUMENT CONTROL .................................................................................................................................. 53

The Use of Twitter Activity as a Stock Market Predictor

3.2.9 Security: Must Have .........................................................................................................................69

The Use of Twitter Activity as a Stock Market Predictor

The Use of Twitter Activity as a Stock Market Predictor

Definitions, Acronyms, and Abbreviations

The Use of Twitter Activity as a Stock Market Predictor

The Use of Twitter Activity as a Stock Market Predictor

Systems and Datasets

Figure 1.1: Example of the application used in twitter. (Dev.twitter.com, 2014)

The Use of Twitter Activity as a Stock Market Predictor

The Use of Twitter Activity as a Stock Market Predictor

Java script Issues

The Use of Twitter Activity as a Stock Market Predictor

Figure 1.7: Example of a successful cluster from the AWS website.

The Use of Twitter Activity as a Stock Market Predictor

Figure 1.8: Example of a text file returned form the AWS.

The Use of Twitter Activity as a Stock Market Predictor

The Use of Twitter Activity as a Stock Market Predictor

The Use of Twitter Activity as a Stock Market Predictor

The Use of Twitter Activity as a Stock Market Predictor

The information produced must be of use to the user.

Information generated must not be previously available to

The user must have access to this information.

The analysis will provide Apple, Microsoft and Tesla with a

This information must be of assistance to these companies

The Use of Twitter Activity as a Stock Market Predictor

The Use of Twitter Activity as a Stock Market Predictor

Testing and Evaluation

Figure 4.3: Displays the file AprilAAPL imported into R studio.

The Use of Twitter Activity as a Stock Market Predictor

Figure 4.4: Displays the correlation output in R.

Figure 4.5: Displays the regression model output in R.

The Use of Twitter Activity as a Stock Market Predictor

Figure 4.6 Displays the regression model output in R.

Figure 4.9 Displays Granger Causality Test unsuccessful outputs.

The Use of Twitter Activity as a Stock Market Predictor

The Use of Twitter Activity as a Stock Market Predictor

Apple count and Close Price

The Use of Twitter Activity as a Stock Market Predictor

AAPL count and Close Price

The Use of Twitter Activity as a Stock Market Predictor

The Use of Twitter Activity as a Stock Market Predictor

The normality plot

The Use of Twitter Activity as a Stock Market Predictor

3. Granger Causality Test

Figure 4.2.1 displays the Granger Causality.

The Use of Twitter Activity as a Stock Market Predictor

The Use of Twitter Activity as a Stock Market Predictor

Microsoft and Close Price

MSFT and Close Price

Pervious results with one day lag.

The Use of Twitter Activity as a Stock Market Predictor

Microsoft and Close Price with 1 day lag

The Use of Twitter Activity as a Stock Market Predictor

MSFT andClose Price with 1 day lag

The Use of Twitter Activity as a Stock Market Predictor

The two regression models returned a high Multiple R-squared value of

The Use of Twitter Activity as a Stock Market Predictor

Tesla Count and Close Price