You are on page 1of 6

Hubs And Authorities Computation on Twitter Data Using

MapReduce
Mike McGrath mmcgrath@umd.edu
College of Information Studies
University of Maryland
College Park, MD 20742
ABSTRACT the actual content of the page, while the hub value measures the
In this paper, I describe an implementation of the Hubs and quality of links on the page. The algorithm can be specified as
Authorities link analysis algorithm in Hadoop, and show the follows:
results of applying this algorithm to a mention graph collected
from the Twitter microblogging service. In the context of
• Initialize the hub and authority values for each node to
1
Twitter, each user’s Twitter feed is treated as a node in the
graph, and each retweet or mention of the user by another user • For k steps, where k is some natural number:
is treated as a link to the feed.
o For each node: update its authority value by
The performance of the algorithm on the Twitter mention graph adding the sum of all hub values of incoming
and a large webgraph data set is analyzed, and the results of the links to the node’s current authority value
Hubs and Authorities calculation on the Twitter data are
o For each node: update its hub value by
presented.
taking the sum of all authority values of
outgoing links to the node’s current hub
1. INTRODUCTION value
The last decade has seen a phenomenal growth in the o Normalize each authority value by dividing
proliferation and use of social networking sites on the Web. One
by the square root of the quadratic sum of all
of the most popular of these sites is Twitter, a microblogging
hub values
website that allows users to post messages (“tweets”) of 140
characters or less to a feed that can be followed by other users. o Normalize each hub value by dividing by the
On Twitter, an informal syntax has developed for mentioning square root of the quadratic sum of all
other users and reposting their messages. Common practice authority values
when mentioning other users in a tweet is to precede user names
with the at-sign (@). When reposting a another user’s tweet in
one’s own feed (“retweeting”), it is customary to precede the In the context of Twitter, we can view each Twitter user’s
reposted message with the acronym RT and then mention the message feed as a node, and each mention or retweet by another
original poster using the @user convention. This network of user as a link to that node. Good Twitter hubs, then, are users
posts, reposts, and mentions forms a directed graph, which lends who frequently mention or retweet highly authoritative Twitter
itself well to the sort of link analysis normally performed on users, and Twitter authorities are those users that are often
networks of hyperlinked pages on the web (i.e. PageRank, HITS, mentioned or retweeted by highly ranked Twitter hubs.
SALSA, etc.). In this paper, I perform such an analysis on a
series of Twitter data using Kleinberg’s Hubs and Authorities 2.2 MapReduce, Apache Hadoop, and
algorithm [1] to generate a ranked list of authoritative Twitter Cloud9
users based on the number of times they are mentioned or The Hubs and Authorities analysis performed for this paper was
retweeted by other users. The algorithm itself has been implemented using the Hadoop framework, using components
implemented using Apache Hadoop, an open-source from the Cloud9 library
implementation of the MapReduce framework originally MapReduce is a software framework and programming
described by Dean and Ghemawat [2]. Hadoop allows for rapid paradigm developed at Google to simplify distributed computing
development of software for performing distributed computation on very large datasets across clusters of commodity hardware. In
on large datasets. MapReduce, a set of input key-value pairs are fed through a map
function which is applied to each key-pair to generate a set of
2. BACKGROUND intermediate key-value pairs. These intermediate key-value pairs
are then fed to a reduce function which performs some
2.1 Hubs and Authorities Algorithm aggregate operation on the set of all values belonging to a
Hubs and Authorities is an iterative link analysis algorithm particular key. The advantage of the MapReduce framework is
designed to be applied to a network of hyperlinked web pages. that it allows for map tasks to run in parallel, handles
The algorithm assigns two values to each page in the network: a distribution of the intermediate key-value pairs to reducers, and
hub value based on the value of outgoing links, and an authority then allows for all reduce tasks to be run in parallel. The
value based on the values of incoming links. Hubs and framework handles task scheduling and distribution of data
authorities exist in a mutually reinforcing relationship. Good amongst the nodes in the cluster, saving the developer
hubs are those pages that link to many authoritative pages, and considerable time and effort.
good authorities are those pages that are linked to by good hub
pages. The authority value is designed to measure the quality of
Hadoop is an open-source implementation of the MapReduce
framework described by Dean and Ghemawat. It was originally
developed at Yahoo but now is maintained under the auspices of
the Apache Foundation.
Cloud9 is a library for Hadoop developed at the University of
Maryland, designed to serve as both a teaching tool and to
support research in data-intensive processing, particularly text
processing. Cloud9 provides a number of data structures that
proved useful in the implementation of Hubs and Authorities in
Hadoop. Documentation for the Cloud 9 library exists at
http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/

2.3 Input Dataset Characteristics


The Twitter dataset used for processing consists of a series of
Hadoop sequence files with key-value pairs in the following
format: Figure 1: MapReduce Implementation of Hubs &
Authorities Algorithm
(<mentioner> , <mentioned>) <times>

Where the key is a Cloud9 PairOfStrings object, and the value is 3. DESIGN AND IMPLEMENTATION
a Hadoop IntWritable object. The mention graph was generated The Hubs and Authorities algorithm was implemented as a set
from a set of roughly 29 million tweets generated between 2006 of five individual Hadoop jobs, as shown in Figure 1. The first
and 2009. The tweets were culled via the Twitter API and are two jobs (Auth Formatter and Hub Formatter) read in the twitter
not constant over time. They are essentially the previous N mention graph data and output them into a format that can be
tweets collected from each user in some subset of Twitter users. used for computation by the actual hubs and authorities
The mention graph occupies about 53 MB on the cluster calculation job. The third job (HubsAndAuthorities) updates the
distributed file system. hub and authority values for each node, the fourth job
(Normalization Step 1) finds the quadratic sums of all of the hub
Because the retweet mentions graph is not especially large, I and authority values in the graph, and the fifth job
also ran the algorithm on a 12GB web graph, in order to gauge (Normalization Step 2) completes the normalization task by
performance on a large dataset. dividing each hub and authority value by the square root of the
correct quadratic sum.
2.4 Execution Environment
The experiments described on this paper were performed on a 3.1 Formatters
416-node cluster of commodity machines provided by Google One formatter job produces hub weight data for each node, and
and IBM as part of their Academic Cloud Computing Initiative. the other produces authority weight data for each node. Ouput of
The specifications of the individual compute nodes have not these formatters is in the following format for authority data:
been provided, but each node on the cluster has been configured
<name> (A, (<auth rank>, [incoming links]))
to be able to run two map tasks or two reduce tasks at a time, for
a total system capacity of 828 concurrent map tasks or reduce where the key is a Hadoop Text object storing the node name,
tasks. and the value is a Cloud 9 Tuple object with the left value set to
the symbol “A” to indicate that this is Authority data. The right
2.5 Previous Research value of this tuple is another Tuple, whose left value is a
Others have attempted to implement the hubs and authorities DoubleWritable representing the authority weight for this node,
algorithm using the MapReduce framework. Dong[3] produced a and the right value is a Cloud 9 ArrayListWritable object
three-step implementation of the algorithm using Hadoop that containing the names of all of the incoming links to the node. I
provided the inspiration for the implementation discussed in this ran the Authority formatter twice. On one execution, I initialized
paper, however his implementation suffered from some each node’s authority value to its number of retweets or
efficiency drawbacks. All key-value pairs in his implementation mentions provided in the mention graph. On the second
were passed from job to job as text rather than sequence files. execution, I simply initialized all authority values to 1.0.
His implementation also did not make use of combiners in Similarly, hub data is formatted like so:
strategic areas that could have boosted performance, and he
performed file system reads in the map phase of certain jobs that <name> (H, (<auth rank>, [outgoing links]))
would have been more efficiently performed during job Where the key is once again the node name, and the value is a
configuration. Cloud9 Tuple object with the left value set to the symbol “H,”
An alternate system for ranking Twitter users exists on the web indicating that this is Authority data. The right value of this
at http://trst.me. This site, built by a team of data analysts tuple is another Tuple, whose left value is a DoubleWritable
known as Infochimps, performs a PageRank-like link analysis representing the hub weight for this node, and the right value is
on Twitter data comprised of 1.6 billion tweets collected since a Cloud9 ArrayListWritable object containing the names of all of
2006. Their implementation is based on a graph of Twitter the outgoing links from this node. Hub values were initialized to
follower links and is described in more detail at 1.0 in both executions of the algorithm.
http://trst.me/about.
3.2 Hub and Authority Update computed by the HubsAndAuthorities job, as well as the
normalization factors computed in Step 1 of the normalization
3.2.1 Map process, and then divides each hub or authority value by the
The HubsAndAuthorities job reads in the output from the correct normalization factor. The normalization factors are read
formatters or from a previous iteration of the algorithm. The in from the filesystem during job configuration and are
map task reads in each key-value pair from the input and outputs distributed to the various mappers using the Hadoop JobConf
the following for each input node n object.
<n>(A,(<current auth. value>, []))
3.3.2.1 Map
<n>(A,(-1,0,[incoming links])) In the configuration stage, each mapper reads the normalization
for each incoming node i in incoming link list: factors from the JobConf. The mapper then reads in each key-
value pair from the input and divides each hub or authority
<i> (H,(<auth value of n>,[])) value by the correct normalization factor (based on whether it is
<n>(H,(<current auth. value>, [])) tagged with an “H” or an “A”). Mapper output has the same
structure as the formatter output described in part 3.1
<n>(H,(-1,0,[outgoing links]))
for each outgoing node o in outgoing link list
3.3.2.2 Reduce
The reduce phase for Step 2 of the normalization process is
<n> (A,(<hub value of n>,[])) simply the Hadoop identity reducer. No computation needs to be
performed during the reduce phase.
The hub and authority values of -1 are dummy values assigned
to the output key-value pairs containing the incoming/outgoing The output from the normalization phase can be fed back into
adjacency lists to signal to the reducer that these key-value pairs the HubAndAuthorities job to begin another iteration of the
should not be used to update the authority or hub values. computation.

3.2.2 Reduce 4. EXPERIMENTAL RESULTS


The reducers sum all authority and hub values generated by the I ran two sets of ten iterations of the hubs and authorities
mappers for each node and reconstruct the authority and hub computation on the input data described in section 2.3. In the
data structure format described in section 3.1. first set of iterations, I seeded each user’s authority values to the
number of mentions listed in the input graph. On the second set
3.3 Normalization of iterations, I seeded each authority value to 1. Finally, I ran
one iteration of the algorithm on a 12GB web graph in order to
3.3.1 Step 1 gauge performance on a large dataset.
In the first step of the normalization process, the output of the
HubsAndAuthorities job is read in, and the square roots of the
quadratic sums of all of the hub and authority values are
4.1 Performance
generated. Performance statistics for the twitter mentions dataset are
largely uninteresting, due to the small size of the input dataset
3.3.1.1 Map (53MB). Each job in the algorithm was launched with 200 map
For each input key-value pair, the Normalization Step 1 mapper tasks and 20 reduce tasks and took between 2-4 minutes
outputs intermediate key-value pairs in the following format depending on the load on the cluster.
<(H|A)> <value> Performance on the web graph is described in Table 1.
Where the key is a Text object containing the letter H or A, and Table 1: Algorithm Performance on Web Graph Dataset
the value is a DoubleWritable hub or authority value. Step HDFS bytes Launche Launche Executio
3.3.1.2 Combiner read d Map d Reduce n Time
Tasks Tasks
Because the reducer will only be computing two output values,
this process can support at most 2 reducers. Thus, it benefits Hub 13,020,630,52 707 89 12m10s
from the use of a combiner to square each input value and Formatter 4
produce partial quadratic sums for the hub and authority values Authority 13,020,630,52 725 85 10m51s
3.3.1.3 Reduce Formatter 4
As mentioned in the previous section, there are at most two Hubs and 23,677,664,25 654 92 19m35sec
reducers for this job, one to aggregate hub values, and one to Authoritie 7
aggregate authority values. The reducer reads in the output from s Update
the mappers and/or combiners, computes the quadratic sums for
Normalize 34,417,142,90 764 1 2m26sec
hub and authority values, and then computes the square root of
Step 1 8
the quadratic sums and writes them to disk in the format:
<(H|A)> <value> Normalize 34,417,142,90 621 102 4m48sec
Step 2 8
where the key is a Text object containing the letter H or A, and
the value is a DoubleWritable hub or authority value.
It is important to note that these jobs were performed on a
3.3.2 Step 2 cluster where a number of other job were being scheduled and
The second stage of the normalization process is a separate executed, therefore the execution time figures may be slightly
Hadoop job that reads in the hub and authorities values
inflated based on job scheduling delays due to other jobs
running on the system. However, it is clear from the data that
the most computationally expensive step in the process is the
Hub and Authorities update job, which requires two
computations and record updates to be made for every node in
the graph: one update to the hub value and one update to the
authority value.

4.2 Computational Results


The following figures show the distribution of hub and authority
values for each run of the algorithm. In both runs, there were a
very small number of users with authority values of magnitude
100 and 10- 2, a large number of users with authority values with
magnitudes between 10-3 and 10-10, and then another large group
of users with authority values of 10 -22-10-23. Presumably, this last
group of users are those with no followers. Figure 4: Hub Value Distribution when Authority values
On the hub side, things look quite different. There is a small seeded with number of mentions
group of users with hub values of magnitude 10 -2 to 10-7, but the
vast majority have hub values of magnitude 10 -21 to 10-24. This
would seem to indicate that most users in the dataset were either
not following anyone, or mostly followed un-authoritative users.

Figure 5: Hub Value Distribution when Authority values


seeded to 1
Figure 2: Authority Value Distribution when Authority
The top 20 users with the highest authority scores are presented
values seeded with number of mentions
in Tables 2 and 3:
Table 2: Top 20 Authorities when authority value seeded to
number of mentions
Authority Value x10 Username
7.7029475763 tweetmeme
5.7089990369 mashable
1.3968257328 TechCrunch
1.293904751 GuyKawasaki
0.5183442548 chrisbrogan
0.4916548334 Twitter_Tips
0.4357376634 huffingtonpost
0.3768996953 garymccaffrey
0.3610229329 copyblogger
0.3479891281 danschawbel
Figure 3: Authority Value Distribution when Authority 0.3456938012 guykawasaki
values seeded to 1 0.3441078103 smashingmag
0.3316099462 problogger
0.2897367035 socialmedia2day
0.2757592647 Jason_Pollock
0.2610604856 BreakingNews
0.2537776719 nytimes
0.2328544942 rww
0.2265959406 Scobleizer
0.2224326619 alltop Hub Value x10 Username
0.1988960893 bloginterface
Table 3: Top 20 Authorities when authority value seeded to 1 0.1826794046 mtrigo
0.1802519587 horaciolm
Authority Value x10 Username 0.1781621154 mikamatikainen
7.3863306829 tweetmeme 0.1781427651 SubZeroService
6.0589688987 mashable 0.176498446 dennypeh
1.4499644312 TechCrunch 0.1761512751 dinachaz
1.3721369066 GuyKawasaki 0.1750134021 aGEEKspot
0.5355155392 chrisbrogan 0.174938272 undergradtv
0.5137647052 Twitter_Tips 0.1741149939 eaglesflite
0.4406625956 huffingtonpost 0.1731484603 lucadb
0.3691258215 copyblogger 0.1725093268 maestro147
0.3690585462 garymccaffrey 0.1725091931 Mile_End_Media
0.3627093623 danschawbel 0.1722296026 manoj_km
0.3584587444 guykawasaki 0.1719086079 WebStudio13
0.353456226 smashingmag 0.1718476357 AlbanyInsurance
0.3402338366 problogger 0.1716049536 pheadrick
0.2964412809 socialmedia2day 0.1712464778 corettajackson
0.2878474166 Jason_Pollock 0.1705805982 darlenegannon
0.2780027708 BreakingNews 0.1700293877 imeily
0.2708411515 nytimes
0.2421932316 rww
0.2383444894 Scobleizer It is less clear how to interpret the hub values in the Twitter
context. According to the algorithm, these users are those users
0.225149232 alltop
who mention or retweet the most authoritative users, however
since the algorithm does not account for Twitter spam accounts,
Most of these Twitter accounts belong to Web personalities or these could just be users who automatically retweet all posts
news sites, and thus seem like reasonable candidates for the top from a number of well-known accounts.
20 list.
The top 20 users with the highest hub scores are presented in
5. CONCLUSIONS AND FUTURE
Tables 4 and 5: WORK
Twitter data can be easily viewed as a directed graph, which
Table 4: Top 20 Hubs when authority value seeded to
lends it well to various types of link analysis. However, it does
number of mentions
not appear that Hubs and Algorithms is the most appropriate
Hub Value x10 Username link analysis algorithm to use on Twitter data. The algorithm is
0.1998010923 bloginterface more complex than other types of link analysis algorithms,
0.1834639216 mtrigo requiring the calculation of two values, rather than one, as in
0.1810148535 horaciolm PageRank. Furthermore, the algorithm possesses a scalability
0.1788128827 SubZeroService bottleneck that appears to make it unsuited for use on truly large
0.1787793615 mikamatikainen datasets. During the normalization calculations, it is necessary
0.1772074214 dennypeh to compute the quadratic sums for all hub and authority values
0.1768250999 dinachaz in the data, which are essentially routed to two reducers on the
0.1756893844 aGEEKspot cluster. If the dataset is very large, these reducers could serve as
0.1756653717 undergradtv a significant bottleneck. The problem can be mitigated
0.174767067 eaglesflite somewhat by use of combiners, but it still exists, even in the
case where combiners are used to compute partial sums.
0.1737820146 lucadb
0.1731358266 Mile_End_Media The Hubs and Authorities algorithm implemented in this paper
0.1730407097 maestro147 also provides no way for dealing with spam accounts. I believe
0.1728568818 manoj_km it would be a fairly trivial task for an adversarial Twitter user to
0.1724871735 WebStudio13 game the algorithm by creating a network of false hubs that
0.172475132 AlbanyInsurance follow a number of popular twitter accounts. These hubs could
0.1722922181 pheadrick then be used in turn to create fake authorities. For the algorithm
0.1719294126 corettajackson to be truly useful, it would need to include a heuristic for
discounting suspected spam nodes.
0.1712461943 darlenegannon
0.1706123208 imeily In fact, it does not seem that the Hubs and Authorities algorithm
was designed to be applied to extremely large-scale graphs in
the first place. Kleinberg’s paper specifies that the algorithm be
run on a focused sub-graph of nodes that have already been
identified as relevant. Hubs and Authorities in its basic form
Table 5: Top 20 Hubs when authority value seeded to 1 might be more useful as a tool to identify influential users in
various subgroups within social networks, rather than a tool to 6. REFERENCES
be applied to the entire social graph. [1] Kleinberg, J. M. 1999. Authoritative sources in a
This exercise has demonstrated that it is possible to perform hyperlinked environment. J. ACM 46, 5 (Sep. 1999), 604-
worthwhile link analysis on Twitter data, but new algorithms 632. DOI= http://doi.acm.org/10.1145/324133.324140
will need to be developed to deal with large social network [2] Dean, J. and Ghemawat, S. 2008. MapReduce: simplified
graphs, which have a richer and more varied link structure than data processing on large clusters. Commun. ACM 51, 1
does the Web. On Twitter, for example, there are at least three (Jan. 2008), 107-113. DOI=
types of links between nodes: retweets, mentions, and followers, http://doi.acm.org/10.1145/1327452.1327492
as well as standard hyperlinks to sites on the Web. Hubs and
Authorities in its basic form cannot handle the richness of these [3] Dong, X. 2008. Hubs and Authorities Calculation using
links, but it may be useful as a starting point in the development MapReduce. Technical Report. Cornell University.
of better social network analysis tools.

You might also like