You are on page 1of 29

EFFECTVE SENTIMENT ANALYSIS OF TWITTER DATA USING HADOOP ECOSYSTEM 2017

CHAPTER 1
INTRODUCTION

Department of Computer Science Page 1


EFFECTVE SENTIMENT ANALYSIS OF TWITTER DATA USING HADOOP ECOSYSTEM 2017

INTRODUCTION

Today, the textual data on the internet is growing at a rapid pace. Different industries
are trying to use this huge textual data for extracting the peoples views towards their
products. Social media is a vital source of information in this case. It is impossible to
manually analyse the large amount of data. There are a large number of social media
websites that enable users to contribute, modify and grade the content. Users have an
opportunity to express their personal opinions about specific topics. The example of such
websites include blogs, forums, product reviews sites, and social networks. In my case,
twitter data is used.

The focus of my project is to assign the polarity to each tweet i.e. whether the author express
positive or negative opinion.

1.1 Sentiment Analysis:

Sentiment analysis also known as opinion mining. The process of computationally


identifying and categorizing opinions expressed in a piece of text, especially in order to
determine the writer's attitude towards a particular topic or product. Sentiment Analysis is the
process of detecting the contextual polarity of text. In other words, it determines whether a
piece of writing is positive, negative or neutral.

1.2 Twitter Data:

Twitter, one of the largest social media site receives tweets in millions every day in
the range of petabytes per year. For the development purpose twitter provides streaming API
which allows the developer an access to 1% of tweets tweeted at that time bases on the
particular keyword. The object about which we want to perform sentiment analysis is
submitted to the twitter APIs which does further mining and provides the tweets related to
only that object.

Department of Computer Science Page 2


EFFECTVE SENTIMENT ANALYSIS OF TWITTER DATA USING HADOOP ECOSYSTEM 2017

1.3 Hadoop:

Hadoop is an open source framework for writing and running distributed applications
that process large amount of data. The term HADOOP means Highly Archived Distributed
Object Oriented Programing. It is an open source java framework technology to store, access
and gain large resourcesfrom big data in a distributed fashion at less cost, high degree of fault
tolerance and high scalability. Hadoop framework includes different modules like
MapReduce, Flume, Hive, Pig, Sqoop, Oozie, Zookeeper, and Hbase for different
functionality.

1.4 Flume:

Apache Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of streaming data into the Hadoop Distributed File
System (HDFS). It has a simple and flexible architecture based on streaming data flows. It is
robust and fault tolerant with tuneable reliability mechanisms and many failover and recovery
mechanisms. It uses a simple extensible data model that allows for online analytic
application.

1.5 Hive:

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It


resides on top of Hadoop to summarize Big Data, and makes querying and analysing easy.
Apache Hive (HiveQL) with Hadoop Distributed file System is used for Analysis of data.
Hive provides a SQL-like interface to process data stored in HDFS. Due its SQL-like
interface, Hive is increasingly becoming the technology of choice for using Hadoop.

Department of Computer Science Page 3


EFFECTVE SENTIMENT ANALYSIS OF TWITTER DATA USING HADOOP ECOSYSTEM 2017

CHAPTER 2
LITERATURE
REVIEW

Department of Computer Science Page 4


EFFECTVE SENTIMENT ANALYSIS OF TWITTER DATA USING HADOOP ECOSYSTEM 2017

Sentiment analysis has been handled as a Natural Language Processing task at many
levels of granularity. Micro blog data like Twitter, on which users post real time reactions to
and opinions about everything, poses newer and different challenges.

GautamGoswami in the paper Effective Image Analysis on Twitter Streaming using


Hadoop Eco System on Amazon Web Service EC2 [3]discussed how image statistics are
analysed to know how many user shared or posted the same image across their community
over Twitter. He used the Amazon Web Service EC2] where Hadoop cluster has been created
to analyse the twitter streaming data. Other components of Hadoop eco system viz. Apache
Flume, Hive have also been used. Using Flume, twitter streaming has been collected for a
particular interval of time and subsequently stored in Hadoop Distribution File System
(HDFS) for further analysis where traditional RDBMS are not compatible. He used Hive for
mining the stored data filtered through Map Reduce phase. Only Map has been used to parse
the semi structured Streaming data.

In Lexicon-Based Methods for Sentiment Analysis[4]paper by MaiteTaboada, Julian


Brooke, Milan Tofiloski and Kimberly Voll uses dictionaries of words annotated with their
semantic orientation (polarity and strength). The paper talks about the various methods of
performing Lexicon based analysis. The paper also talks about the different datasets and
dictionaries available. Our project is a re-implementation of the simple lexicon based
approach used for sentiment analysis. Although it is not as accurate as the machine learning
approach wherein an algorithm is trained to classify data, it is the preferred approach for
handling big datasets as training an algorithm takes a lot of time.

Department of Computer Science Page 5


EFFECTVE SENTIMENT ANALYSIS OF TWITTER DATA USING HADOOP ECOSYSTEM 2017

CHAPTER 3
THEORITICAL
BACKGROUND

Department of Computer Science Page 6


EFFECTVE SENTIMENT ANALYSIS OF TWITTER DATA USING HADOOP ECOSYSTEM 2017

3.1 Structure of Hadoop

As the structure of the system, Hadoop consists of two components, the Hadoop
distributed File System (HDFS) and Mapreduce, performing distributed processing by single-
master and multiple-slave servers.

3.1.1 Map reduce:

MapReduce is a parallel programming model for writing distributed applications


devised at Google for efficient processing of large amounts of data (multi-terabyte data-sets),
on large clusters(thousands of nodes) of commodity hardware in a reliable, fault-tolerant
manner. The MapReduce program runs on Hadoop which is an Apache open-source
framework.

MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce
stage.

Figure 1: Map Reduce Dataflow

A) Map stage:

Department of Computer Science Page 7


EFFECTVE SENTIMENT ANALYSIS OF TWITTER DATA USING HADOOP ECOSYSTEM 2017

The map or mappers job is to process the input data. Generally the input data is in the
form of file or directory and is stored in the Hadoop file system (HDFS). The input file is
passed to the mapper function line by line. The mapper processes the data and creates several
small chunks of data.

B) Reduce stage:

This stage is the combination of the Shuffle stage and the Reduce stage. The
Reducers job is to process the data that comes from the mapper. After processing, it
produces a new set of output, which will be stored in the HDFS.

3.1.2 Hadoop Distributed File System (HDFS):

The Hadoop Distributed File System (HDFS) is based on the Google File System
(GFS) and provides a distributed file system that is designed to run on commodity hardware.
It has many similarities with existing distributed file systems. However, the differences from
other distributed file systems are significant. It is highly fault-tolerant and is designed to be
deployed on low-cost hardware. It provides high throughput access to application data and is
suitable for applications having large datasets.

HDFS follows the master-slave architecture and it has the following elements:

Name Node Meta Data (Name, replicas)


Meta Data op

Client
Block ops

Read Write
Replication

DATA NODES DATA NODES

Figure 2: HDFS Architecture

Department of Computer Science Page 8


EFFECTVE SENTIMENT ANALYSIS OF TWITTER DATA USING HADOOP ECOSYSTEM 2017

A. NameNode:

The namenode is the commodity hardware that contains the GNU/Linux operating system
and the namenode software. It is a software that can be run on commodity hardware. The
system having the namenode acts as the master server and it does the following tasks:

Manages the file system namespace.


Regulates clients access to files.
It also executes file system operations such as renaming, closing, and opening files
and directories.

B. DataNode:

The datanode is a commodity hardware having the GNU/Linux operating system and
datanode software. For every node (Commodity hardware/System) in a cluster, there will be a
datanode. These nodes manage the data storage of their system.

Datanodes perform read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.

3.2 Flume Architecture:


Flume deploys as one or more agents. A flume agent is a JVM process that hosts the
components through which events flow from an external source to the next destination. Each
agent contains three components: source, channel and sink.

Web Server Source Sink

HDFS
Channel
Agent

Fig 3: Flume Architecture

Event:

Department of Computer Science Page 9


EFFECTVE SENTIMENT ANALYSIS OF TWITTER DATA USING HADOOP ECOSYSTEM 2017

A single packet of data passed through a system (Source -> Channel -> Sink) is called as an
event. In log files terminology, an event is a line of text followed by a new line character.
Source:
Flume source is configured within an agent and it listens for events from an external source
(e.g. web server) it reads data, translate events and handles failure situations. But source
doesnt know how to store the even. So, after receiving enough data to produce a flume
event, it sends events to the channel to which the source is connected.
Example Avro source, Thrift source, twitter 1% source etc.
Channel:
A channel is a transient store which receives the events from the source and buffers them till
they are consumed by sinks. It acts as a bridge between the sources and the sinks. These
channels are fully transactional and they can work with any number of sources and sinks.
Example JDBC channel, File system channel, Memory channel, etc.

Sink:
A sink stores the data into centralized stores like HBase and HDFS. It consumes the data
(Events) from the channels and delivers it to the destination. The destination of the sink might
be another agent or the central stores.
Example HDFS sink, HBase sink, Avro sink, Kafka sink, etc.

3.3 Need of Sentiment Analysis:

Sentiment analysis is extremely useful in social media monitoring as it allows us to


gain an overview of the wider public opinion behind certain topics. Social media monitoring
tools like Brand-watch Analytics make that process quicker and easier than ever before. The
applications of sentiment analysis are broad and powerful. The ability to extract insights from
social data is a practice that is being widely adopted by organizations across the world. Shifts
in sentiment on social media have been shown to correlate with shifts in the stock market.
The Obama administration used sentiment analysis to gauge public opinion to policy
announcements and campaign messages ahead of 2012 presidential election.

The ability to quickly understand consumer attitudes and react accordingly is something that
Expedia Canada took advantage of when they noticed that there was a steady increase in
negative feedback to the music used in one of their television adverts.

3.4 Sentiment Classifier [5]:

Department of Computer Science Page 10


EFFECTVE SENTIMENT ANALYSIS OF TWITTER DATA USING HADOOP ECOSYSTEM 2017

The tweets are broken down into tokens where each token is assigned polarity which
is a floating point number ranging from 1 to -1

A. Positive Tweets:

Positive tweets are the tweets which show a good or positive response towards something.
For example tweets such a It was an inspiring movie!!! Or Best movie ever.

B. Negative Tweets:

Negative tweets can be classified as the tweets which show a negative response or oppose
towards something. For example tweets such as Waste of time or Worst movie ever.

C. Neutral Tweets:

Neutral tweets can be classified as the tweets which neither show a support or appreciate
anything nor oppose or depreciate it. It also includes tweets which are facts or theories. For
example tweets such as Earth are round.

Department of Computer Science Page 11


EFFECTVE SENTIMENT ANALYSIS OF TWITTER DATA USING HADOOP ECOSYSTEM 2017

CHAPTER 4
PROPOSED METHOD

Department of Computer Science Page 12


EFFECTVE SENTIMENT ANALYSIS OF TWITTER DATA USING HADOOP ECOSYSTEM 2017

4.1. Proposed Architecture:

Map
Reduce Hashtags
Tweets
Twitter Streaming

HIVE

Run Query
Create

Hadoop Distributed File


Twitter Local Transfer HIVEQL
(HDFS)
Source Host
To HDFS

Output
transfer
to HDFS
Positive Negative
Output store in Tweets Tweets
Hive
Warehouse Warehouse Neutral
F Tweets

Figure 4: My proposed System Architecture

Department of Computer Science Page 13


EFFECTVE SENTIMENT ANALYSIS OF TWITTER DATA USING HADOOP ECOSYSTEM 2017

4.2 Algorithm Description:

To assign a sentiment score to the tweets I have used the lexicon-based approach [4].
Lexicon Based techniques work on an assumption that the collective polarity of a sentence is
the sum of polarities of the individual words. For implementing this approach, a dictionary
needs to be used. A dictionary is a list of words classified as positive, negative or neutral.
This dictionary has a list of 8221 words consisting of nouns, verbs, adjectives etc.

We first converted the Twitter data from its raw form into a tabular format.
The text part of the Tweet stored in the table is then split into its individual
component words.
The words are looked up in the dictionary and the corresponding polarity is fetched
and stored in a second table.
If the sum of the polarities of these words is positive, then a positive sentiment
value is assigned to the tweet. If the sum of polarities of these words is negative, then
a negative sentiment value assigned to the tweet. Else the tweet is assigned a neutral
sentiment value.
These sentiment values are stored in a new table with tweets.

Due to infrastructure constraints I have implemented a single node Hadoop installation.


However, the distributed nature of the project is in the storage of the tweets extracted. The
idea here is that this approach could be scaled up in the following two methods:

The tweets could be extracted on multiple compute systems brought together in one
system and processed using lexicon based approach.
Alternatively, twitter data could be extracted on multiple compute systems. The tasks
involved in computing the sentiment of the tweets could be distributed among the
individual compute systems.

Department of Computer Science Page 14


EFFECTVE SENTIMENT ANALYSIS OF TWITTER DATA USING HADOOP ECOSYSTEM 2017

4.3 Execution Methodology:


As an initial step, we need to collect the continuous flow of twitter streaming. To
solve this problem statement, following steps are needed to execute in order:
As a case study, I choose Kotlin programming language for sentiment analysis by
collecting some tweets using Twitter API and then apply hive sql to categories the tweets into
positive, negative and neutral class based on public sentiment.

4.3.1. Create a twitter application using Twitter API and generate the access tokens and
consumer key:

Figure 5: Snapshot of creating a new twitter application

Department of Computer Science Page 15


EFFECTVE SENTIMENT ANALYSIS OF TWITTER DATA USING HADOOP ECOSYSTEM 2017

Figure 6: Snapshot of required keys and access tokens

4.3.2. Connected to twitter to receive the streaming/data which is in JSON format using
Apache Flume:

Figure 7: Snapshot of flume configuration file for twitter data

Department of Computer Science Page 16


EFFECTVE SENTIMENT ANALYSIS OF TWITTER DATA USING HADOOP ECOSYSTEM 2017

Figure 8: Snapshot of Twitter data in JSON format

4.3.3. Configure Flume with HDFS so that streaming data can be stored in HDFS:

Figure 9: Snapshot of twitter data store in HDFS

Department of Computer Science Page 17


EFFECTVE SENTIMENT ANALYSIS OF TWITTER DATA USING HADOOP ECOSYSTEM 2017

4.3.4. Applying HIVE query on tweeter data set:


A. Create table for the incoming data (Data Definition scripts)

The above code snippets show how to create an external table and automatically import the
specify attributes from HDFS by using Cloudera [6] JSON serde (Serializer and
Deserializer).

Figure 10: Snapshot showing the content of created table

B. Use the dictionary to compute the sentiment of each tweet by analysing the number
of positive, negative and neutral words in a tweet. Based on this number, the sentiment tweet
is decided.

Department of Computer Science Page 18


EFFECTVE SENTIMENT ANALYSIS OF TWITTER DATA USING HADOOP ECOSYSTEM 2017

The above code snippets show that these scripts when run would compute the sentiment of
the tweet by comparing the number of negative, positive and neutral words. Using this, the
polarity of the tweets is decided.

The below code snippets is used to compute sentiment of the tweet:

Once the polarity has been set, the above script does the simple job of assigning the
sentiments of the tweets as positive, negative or neutral. Finally, all the processed twitter data
is exported as necessary.

Department of Computer Science Page 19


EFFECTVE SENTIMENT ANALYSIS OF TWITTER DATA USING HADOOP ECOSYSTEM 2017

Figure 11: Snapshot showing the dictionary showing sentiment against each word

Figure 12: Snapshot showing Sentiment to each word in a particular tweets

Department of Computer Science Page 20


EFFECTVE SENTIMENT ANALYSIS OF TWITTER DATA USING HADOOP ECOSYSTEM 2017

Figure 13: Snapshot of showing sentiments against each tweet ID

4.3.5. Visualization:

For graphical visualization and presentation of hive data record I am using Microsoft Excel
power view ad-hoc reporting tool. First import record in CSV format from HDFS storage and
apply different graphical tool like pie-chart, bar-diagram etc. on top of it.

Department of Computer Science Page 21


EFFECTVE SENTIMENT ANALYSIS OF TWITTER DATA USING HADOOP ECOSYSTEM 2017

CHAPTER 5
RESULT AND
DISCUSSION

Department of Computer Science Page 22


EFFECTVE SENTIMENT ANALYSIS OF TWITTER DATA USING HADOOP ECOSYSTEM 2017

5.1 OUTPUT
Once sentiment of tweets is calculated using HIVE query language, it will give output in
following tabular form:

Figure 14: Snapshot of showing sentiments against each tweet ID

Department of Computer Science Page 23


EFFECTVE SENTIMENT ANALYSIS OF TWITTER DATA USING HADOOP ECOSYSTEM 2017

Figure 15: Snapshot of twitter Id With their corresponding sentiment in xlsx form

Department of Computer Science Page 24


EFFECTVE SENTIMENT ANALYSIS OF TWITTER DATA USING HADOOP ECOSYSTEM 2017

Figure 16: Snapshot of pie diagram representation of different sentiment group in Power view

Department of Computer Science Page 25


EFFECTVE SENTIMENT ANALYSIS OF TWITTER DATA USING HADOOP ECOSYSTEM 2017

5.2 Discussion:
Above experimental setup of Sentiment analysis of twitter data using different framework of
Hadoop shows us its potential in public sentiment analysis. This methodology can be
effectively adopted by any company to find the popularity analysis (a kind of advertisement)
of a newly released product into market if social media is used as an alternative advertising
channel.

Department of Computer Science Page 26


EFFECTVE SENTIMENT ANALYSIS OF TWITTER DATA USING HADOOP ECOSYSTEM 2017

CHAPTER 6
CONCLUSION

Department of Computer Science Page 27


EFFECTVE SENTIMENT ANALYSIS OF TWITTER DATA USING HADOOP ECOSYSTEM 2017

Conclusion

As twitter post are very important source of opinion on different issues and topics. It
can give a keen insight about a topic and can be a good source of analysis. Analysis can help
in decision making in various areas. Apache Hadoop is one of the best options for twitter post
analysis. Once the system is set up using FLUME and HIVE, it helps in analysis of diversity
of topics by just changing the keywords in query. Also it do the analysis on real time data, so
is more useful.

The future of data analysis field is vast. In future I would like do real time sentiment
analysis by using deep learning and machine learning algorithm. I would also like to
implement this whole system in multimode environment.

Department of Computer Science Page 28


EFFECTVE SENTIMENT ANALYSIS OF TWITTER DATA USING HADOOP ECOSYSTEM 2017

References

Research paper:

1) Sunil B. Mane , Sunil B. Mane, YashwantSawant, SaifKazi, VaibhavShinde , Real


Time Sentiment Analysis of Twitter Data Using Hadoop, (IJCSIT) International
Journal of Computer Science and Information Technologies, Vol. 5 (3) , 2014, 3098
3100 , ISSN:0975-9646.
2) Mahalakshmi R, Suseela S , Big-SoSA:Social Sentiment Analysis and Data
Visualization on Big Data, International Journal of Advanced Research in Computer
and Communication Engineering, Vol. 4, Issue 4, April 2015 , 304-306, ISSN : 2278-
1021.
3) GautamGoswami Effective image analysis on twitter streaming using Hadoop
ecosystem on Amazon Web Service EC2, International Journal of Advanced
Research in Computer Science and Software Engineering , volume 5, Issue 9 , 2015 .
4) Lexicon-Based Methods for Sentiment Analysis paper by MaiteTaboada, Julian
Brooke, Milan Tofiloski, Kimberly Voll and Manfred.
5) Sentiment analysis of twitter data using Hadoop by Ajinkya Ingle, Anjali Kante,
ShriyaSamak, Anita Kumari , International Journal of Engineering Research and
General Science Volume 3, Issue 6, November-December, 2015 ,ISSN 2091-2730.

Web:

6) https://blog.cloudera.com/blog/ 2012/11/analysed ing-twitter-data-with-hadoop-part-


3-querying-semi-structured-data-with-hive/.
7) Hive kiwi at http://www.apache.org/hadoop/hive.

Department of Computer Science Page 29

You might also like