Professional Documents
Culture Documents
CHAPTER 1
INTRODUCTION
INTRODUCTION
Today, the textual data on the internet is growing at a rapid pace. Different industries
are trying to use this huge textual data for extracting the peoples views towards their
products. Social media is a vital source of information in this case. It is impossible to
manually analyse the large amount of data. There are a large number of social media
websites that enable users to contribute, modify and grade the content. Users have an
opportunity to express their personal opinions about specific topics. The example of such
websites include blogs, forums, product reviews sites, and social networks. In my case,
twitter data is used.
The focus of my project is to assign the polarity to each tweet i.e. whether the author express
positive or negative opinion.
Twitter, one of the largest social media site receives tweets in millions every day in
the range of petabytes per year. For the development purpose twitter provides streaming API
which allows the developer an access to 1% of tweets tweeted at that time bases on the
particular keyword. The object about which we want to perform sentiment analysis is
submitted to the twitter APIs which does further mining and provides the tweets related to
only that object.
1.3 Hadoop:
Hadoop is an open source framework for writing and running distributed applications
that process large amount of data. The term HADOOP means Highly Archived Distributed
Object Oriented Programing. It is an open source java framework technology to store, access
and gain large resourcesfrom big data in a distributed fashion at less cost, high degree of fault
tolerance and high scalability. Hadoop framework includes different modules like
MapReduce, Flume, Hive, Pig, Sqoop, Oozie, Zookeeper, and Hbase for different
functionality.
1.4 Flume:
Apache Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of streaming data into the Hadoop Distributed File
System (HDFS). It has a simple and flexible architecture based on streaming data flows. It is
robust and fault tolerant with tuneable reliability mechanisms and many failover and recovery
mechanisms. It uses a simple extensible data model that allows for online analytic
application.
1.5 Hive:
CHAPTER 2
LITERATURE
REVIEW
Sentiment analysis has been handled as a Natural Language Processing task at many
levels of granularity. Micro blog data like Twitter, on which users post real time reactions to
and opinions about everything, poses newer and different challenges.
CHAPTER 3
THEORITICAL
BACKGROUND
As the structure of the system, Hadoop consists of two components, the Hadoop
distributed File System (HDFS) and Mapreduce, performing distributed processing by single-
master and multiple-slave servers.
MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce
stage.
A) Map stage:
The map or mappers job is to process the input data. Generally the input data is in the
form of file or directory and is stored in the Hadoop file system (HDFS). The input file is
passed to the mapper function line by line. The mapper processes the data and creates several
small chunks of data.
B) Reduce stage:
This stage is the combination of the Shuffle stage and the Reduce stage. The
Reducers job is to process the data that comes from the mapper. After processing, it
produces a new set of output, which will be stored in the HDFS.
The Hadoop Distributed File System (HDFS) is based on the Google File System
(GFS) and provides a distributed file system that is designed to run on commodity hardware.
It has many similarities with existing distributed file systems. However, the differences from
other distributed file systems are significant. It is highly fault-tolerant and is designed to be
deployed on low-cost hardware. It provides high throughput access to application data and is
suitable for applications having large datasets.
HDFS follows the master-slave architecture and it has the following elements:
Client
Block ops
Read Write
Replication
A. NameNode:
The namenode is the commodity hardware that contains the GNU/Linux operating system
and the namenode software. It is a software that can be run on commodity hardware. The
system having the namenode acts as the master server and it does the following tasks:
B. DataNode:
The datanode is a commodity hardware having the GNU/Linux operating system and
datanode software. For every node (Commodity hardware/System) in a cluster, there will be a
datanode. These nodes manage the data storage of their system.
Datanodes perform read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.
HDFS
Channel
Agent
Event:
A single packet of data passed through a system (Source -> Channel -> Sink) is called as an
event. In log files terminology, an event is a line of text followed by a new line character.
Source:
Flume source is configured within an agent and it listens for events from an external source
(e.g. web server) it reads data, translate events and handles failure situations. But source
doesnt know how to store the even. So, after receiving enough data to produce a flume
event, it sends events to the channel to which the source is connected.
Example Avro source, Thrift source, twitter 1% source etc.
Channel:
A channel is a transient store which receives the events from the source and buffers them till
they are consumed by sinks. It acts as a bridge between the sources and the sinks. These
channels are fully transactional and they can work with any number of sources and sinks.
Example JDBC channel, File system channel, Memory channel, etc.
Sink:
A sink stores the data into centralized stores like HBase and HDFS. It consumes the data
(Events) from the channels and delivers it to the destination. The destination of the sink might
be another agent or the central stores.
Example HDFS sink, HBase sink, Avro sink, Kafka sink, etc.
The ability to quickly understand consumer attitudes and react accordingly is something that
Expedia Canada took advantage of when they noticed that there was a steady increase in
negative feedback to the music used in one of their television adverts.
The tweets are broken down into tokens where each token is assigned polarity which
is a floating point number ranging from 1 to -1
A. Positive Tweets:
Positive tweets are the tweets which show a good or positive response towards something.
For example tweets such a It was an inspiring movie!!! Or Best movie ever.
B. Negative Tweets:
Negative tweets can be classified as the tweets which show a negative response or oppose
towards something. For example tweets such as Waste of time or Worst movie ever.
C. Neutral Tweets:
Neutral tweets can be classified as the tweets which neither show a support or appreciate
anything nor oppose or depreciate it. It also includes tweets which are facts or theories. For
example tweets such as Earth are round.
CHAPTER 4
PROPOSED METHOD
Map
Reduce Hashtags
Tweets
Twitter Streaming
HIVE
Run Query
Create
Output
transfer
to HDFS
Positive Negative
Output store in Tweets Tweets
Hive
Warehouse Warehouse Neutral
F Tweets
To assign a sentiment score to the tweets I have used the lexicon-based approach [4].
Lexicon Based techniques work on an assumption that the collective polarity of a sentence is
the sum of polarities of the individual words. For implementing this approach, a dictionary
needs to be used. A dictionary is a list of words classified as positive, negative or neutral.
This dictionary has a list of 8221 words consisting of nouns, verbs, adjectives etc.
We first converted the Twitter data from its raw form into a tabular format.
The text part of the Tweet stored in the table is then split into its individual
component words.
The words are looked up in the dictionary and the corresponding polarity is fetched
and stored in a second table.
If the sum of the polarities of these words is positive, then a positive sentiment
value is assigned to the tweet. If the sum of polarities of these words is negative, then
a negative sentiment value assigned to the tweet. Else the tweet is assigned a neutral
sentiment value.
These sentiment values are stored in a new table with tweets.
The tweets could be extracted on multiple compute systems brought together in one
system and processed using lexicon based approach.
Alternatively, twitter data could be extracted on multiple compute systems. The tasks
involved in computing the sentiment of the tweets could be distributed among the
individual compute systems.
4.3.1. Create a twitter application using Twitter API and generate the access tokens and
consumer key:
4.3.2. Connected to twitter to receive the streaming/data which is in JSON format using
Apache Flume:
4.3.3. Configure Flume with HDFS so that streaming data can be stored in HDFS:
The above code snippets show how to create an external table and automatically import the
specify attributes from HDFS by using Cloudera [6] JSON serde (Serializer and
Deserializer).
B. Use the dictionary to compute the sentiment of each tweet by analysing the number
of positive, negative and neutral words in a tweet. Based on this number, the sentiment tweet
is decided.
The above code snippets show that these scripts when run would compute the sentiment of
the tweet by comparing the number of negative, positive and neutral words. Using this, the
polarity of the tweets is decided.
Once the polarity has been set, the above script does the simple job of assigning the
sentiments of the tweets as positive, negative or neutral. Finally, all the processed twitter data
is exported as necessary.
Figure 11: Snapshot showing the dictionary showing sentiment against each word
4.3.5. Visualization:
For graphical visualization and presentation of hive data record I am using Microsoft Excel
power view ad-hoc reporting tool. First import record in CSV format from HDFS storage and
apply different graphical tool like pie-chart, bar-diagram etc. on top of it.
CHAPTER 5
RESULT AND
DISCUSSION
5.1 OUTPUT
Once sentiment of tweets is calculated using HIVE query language, it will give output in
following tabular form:
Figure 15: Snapshot of twitter Id With their corresponding sentiment in xlsx form
Figure 16: Snapshot of pie diagram representation of different sentiment group in Power view
5.2 Discussion:
Above experimental setup of Sentiment analysis of twitter data using different framework of
Hadoop shows us its potential in public sentiment analysis. This methodology can be
effectively adopted by any company to find the popularity analysis (a kind of advertisement)
of a newly released product into market if social media is used as an alternative advertising
channel.
CHAPTER 6
CONCLUSION
Conclusion
As twitter post are very important source of opinion on different issues and topics. It
can give a keen insight about a topic and can be a good source of analysis. Analysis can help
in decision making in various areas. Apache Hadoop is one of the best options for twitter post
analysis. Once the system is set up using FLUME and HIVE, it helps in analysis of diversity
of topics by just changing the keywords in query. Also it do the analysis on real time data, so
is more useful.
The future of data analysis field is vast. In future I would like do real time sentiment
analysis by using deep learning and machine learning algorithm. I would also like to
implement this whole system in multimode environment.
References
Research paper:
Web: