You are on page 1of 18

July 7

th
, 2013
Large Scale Topic Modeling
By - Sameer Wadkar
2013 Axiomine LLC
What is Topic Modeling
Technique is called Latent Dirichlet Allocation (LDA)
An excellent explanation is available in the following blog
article by Edwin Chen from Google
(http://blog.echen.me/2011/06/27/topic-modeling-the-
sarah-palin-emails/)
This presentation borrows heavily from the blog article to
explain the basics of Topic Modeling



July 7
th
, 2013
2013 Axiomine LLC
Brief Overview of LDA
What can LDA do?
LDA extracts key topics and themes from a large corpus of
text
Each topic is a ordered list of representative words (Order is
based on importance of word to a Topic)
LDA describes each document in the corpus based on
allocation to the extracted topics.
It is an Unsupervised Learning Technique
No extensive preparation needed to create a training dataset
Easy to apply for exploratory analysis
July 7
th
, 2013
2013 Axiomine LLC
LDA A Quick Example
I listened to J ustin Bieber and Lady Gaga on the radio
while driving around in my car, an LDA model might
represent this sentence as 75% about music (a topic which,
contains words Bieber, Gaga , radio ) and 25% about cars (a
topic which contains words driving and cars ).
July 7
th
, 2013
2013 Axiomine LLC
Sarah Palin Email Corpus
Sarah Palin Email Corpus
In June 2011 several thousand emails from Sarah Palins
time as governor of Alaska were released
(http://sunlightfoundation.com/blog/2011/06/15/sarahs-inbox/)
Emails were not organized in any form
The Edwin Chen blog article discusses how LDA was used
to organize these emails in categories discovered from the
Email Corpus using LDA.
July 7
th
, 2013
2013 Axiomine LLC
LDA Analysis Results
Wildlife/ BP
Corrosion
game
fish
moose
wildlife, hunting
bears
polar
bear
subsistence
management
area
board
hunt
wolves
control
department
year
use
wolf
habitat
hunters
caribou
program
Fishing..
Energy/ Fuel/
Oil Mining
energy
fuel
costs
oil
alaskans
prices
cost
nome
Now
high
being
home
public
power
mine
crisis
price
resource
need
community
fairbanks
rebate
use
mining
Villages
Trig/ Family/
Inspiration
family
web
mail
god
son
from
congratulations
children
life
child
down
trig
baby
birth
love
You
syndrome
very
special
bless
old
husband
years
thank
best
Gas
gas
oil
pipeline
agia
project
natural
north
producers
companies
tax
company
energy
development
slope
production
resources
line
gasline
transcanada
said
billion
plan
administration
million
industry,
Education/
Waste
school
waste
education
students
schools
million
read
email
market
policy
student
year
high
news
states
program
first
report
business
management
bulletin
information
reports
2008
quarter
Presidential
Campaign/
Elections
mail
web
from
thank, you
box
mccain
sarah
very
good
great
john
hope
president
sincerely
wasilla
work
keep
make
add
family
republican
support
doing
p.o,
LDA Analysis of Sarah Palins emails discovered the
following topics (notice the ordered list of words)
July 7
th
, 2013
2013 Axiomine LLC
Temporal Extraction Methodology
LDA Sample from Wildlife topic
July 7
th
, 2013
2013 Axiomine LLC
Temporal Extraction Methodology
LDA Sample from multiple topics
LDA classification of above email
Topic Allocation Percentage
Presidential Campaign/ Elections 10%
Wildlife 90%
July 7
th
, 2013
2013 Axiomine LLC
Types of Analysis LDA can perform
Similarity Analysis
Which topics are similar?

Which documents are similar based on Topic Allocations?
LDA can distinguish between business articles related to Mergers
from those related to Quarterly Earnings which leads to more
potent Similarity Analysis
LDA determines Topic Allocation based on collocation of word
groups. Hence IBM and Microsoft documents can be discovered
to be similar if they talk about similar computing topics

Similarity Analysis based on LDA is very accurate since
LDA converts the high-dimensional and noisy space of
Word/Document allocations into a low dimensional Topic/Document
allocations.
July 7
th
, 2013
2013 Axiomine LLC
Brief Overview of LDA
Topic Co-occurance
Do certain topics occur together in documents?
Analysis of software resumes will reveal that Object Oriented
Language skills typically co-occur with SQL and RDBMS skills

Does Topic Co-occurance change with time?
Resume corpus would reveal that Java skills was highly correlated
with Flash Development skills in 2007. In 2013 the correlation has
shifted to Java and HTML5 but not as much as in 2007 indicating
that HTML5 is a more specialized skill

July 7
th
, 2013
2013 Axiomine LLC
Brief Overview of LDA
Time based Analysis
For a corpus which covers documents over time, do certain topics
appear over time
How does appearance of new topics affect the distribution of other topics
Analysis of science articles from the Journal of Science (1880-2002)
reveals this process
http://topics.cs.princeton.edu/Science/
The Browser is at http://topics.cs.princeton.edu/Science/browser/
75 topic model
Demonstrates how Topics gain/lose prominence over time
Demonstrate how a Topic composition changes over time
Demonstrates how new Topics appear
Ex. Laser made an appearance in its topic only in 1980

July 7
th
, 2013
2013 Axiomine LLC
Example based on Sarah Palins email corpus
Analyze emails which below to Trig/Family/Inspiration
topics
Spike in April 2008 Remarkably (for Topic Modeling) and
Unsurprisingly (for common sense), this was exactly the month
Trig was born.
Topic Modeling can discover such patterns from a large Text
Corpus without requiring a human to read the entire corpus.
July 7
th
, 2013
2013 Axiomine LLC
Topic Modeling Toolkits
Several Open Source Options exist
Library Name URL
Mallet MALLET is a Java-based package for statistical natural language processing,
document classification, clustering, topic modeling, information extraction,
and other machine learning applications to text.
R Based Library R based library to perform Topic Modeling
Apache Mahout Big Data solution of Topic Modeling. Why is it needed?
Topic Modeling is computationally expensive
Requires large amounts of memory
Requires considerable computational power
Memory is bigger constraint
Most implementations run out of memory when applied on even a
modest number of documents (50,000 to 100,000 documents)
If they do not run out of memory they slow down to a crawl due to
frequent Garbage Collection (in Java based environment)
A Big Data based approach is needed!

July 7
th
, 2013
2013 Axiomine LLC
Mahout for Big LDA
Apache Mahout
Hadoop MapReduce based suite of Machine Learning procedures
Implements several Machine Learning routines which are based on
Bayesian techniques (Ex. Generative Algorithms)
Generative Algorithms are iterative and iterations converge to a solution
Each iteration needs the results produced by the previous iteration.
Hence Iterations cannot be executed in parallel
Several iterations (a few thousand) are needed to converge to a
solution
Mahout uses Map-Reduce to parallelize a single iteration
Each iteration is a separate Map-Reduce job
Inter-Iteration communication using HDFS. Leads to high I/O
High I/O compounded by multi-iteration nature
Mahout based LDA
Each iteration is slower to accommodate large memory requirements
Typically 1000 iterations needed. Takes too long to run. Unsuitable
for exploratory analysis
Lesser iterations lead to sub-optimal solution



July 7
th
, 2013
2013 Axiomine LLC
Parallel LDA based on Mallet
A Parallel LDA in Mallet is based on
Newman, Asuncion, Smyth and Welling, Distributed
Algorithms for Topic Models JMLR (2009), with SparseLDA
sampling scheme and data structure from Yao, Mimno and
McCallum, Efficient Methods for Topic Model Inference on
Streaming Document Collections, KDD (2009)
Still memory intensive
Large corpus leads to frequent Garbage Collection
Executing Mallet ParallelTopicModel on 8 GB, Intel I-7 Quad
Core processor on 500,000 US Patent abstracts 400
minutes of processing for 1000 iterations.
The application makes no progress for 1 Million Patents and
eventually runs out of memory or stalls due to frequent
Garbage Collection



July 7
th
, 2013
2013 Axiomine LLC
Axiomine Solution Big LDA without Hadoop
Map-Reduce is unsuitable for LDA type Algorithms
Hadoop is complex and unsuited for ad-hoc analysis
Large number of sequential iterations only allows Map-Reduce to be
used at Iteration level. Leads to too many short Map-Reduce jobs
Large scale LDA without Big Data
LDA is a memory intensive process
Off-Heap memory based on Java NIO allows processes to use
memory without incurring GC penalty.
Trade-off is slightly lower performance
Exploit the OS page-caching to use off-heap memory
LDA operates on Text data. But soring text is orders of magnitude
more expensive as compared to storing numbers
Massive off-heap memory based indexes which map words to
numbers allow significant lowering of memory usage
Reorganizing the Mallet implementation steps achieved significant
performance gains and memory savings





July 7
th
, 2013
2013 Axiomine LLC
Axiomine Solution Performance Numbers
Machine Type Corpus Performance
Single 8 GB, Intel I-7 Quad-
core machine
500000 US Patent Abstracts,
600
1000 Iterations completed in 2
hours
Amazon AWS hs1.8xlarge
machine
(http://aws.amazon.com/ec2/in
stance-types/)
2.1 Million US Patent
Abstracts, 600 topics using 5
CPU threads
1000 Iterations completed in
approximately 5 hours.
High Points
Scaling is practically linear unlike other implementations
Each iteration takes between 7-15 seconds
We contemplated Apache HAMA to achieve parallelism without
incurring the disk I/O cost of Hadoop Map-Reduce
But Network I/O will ensure worse intra-iteration performance than
we could achieve on a single machine!
Big Topic Modeling without Big Data!!
At Axiomine we intend to port more such popular Algorithms based
on lessons learned while porting LDA
We want to bring Large Scale Exploratory Analysis at low complexity


July 7
th
, 2013
2013 Axiomine LLC
Conclusion Large Scale Analysis without Big Data
The Axiomine LDA implementation has the following
benefits
Scaling is practically linear unlike other implementations
Each iteration takes between 7-15 seconds
We contemplated Apache HAMA to achieve parallelism without
incurring the disk I/O cost of Hadoop Map-Reduce
But Network I/O will ensure worse intra-iteration performance than
we could achieve on a single machine!
Big Topic Modeling without Big Data!!
At Axiomine we intend to port more such popular Algorithms based
on lessons learned while porting LDA
We want to bring Large Scale Exploratory Analysis at low complexity
Open Source
https://github.com/sameerwadkar/largelda



July 7
th
, 2013
2013 Axiomine LLC

You might also like