Professional Documents
Culture Documents
• Data examples
– Web pages
– Customer surveys
Customer Age Sex Tenure Comments Outcome
Algorithm:Relevant TP FP
• Stop Words
• Stemming
• remove ending
– if a word ends with a consonant other than s,
followed by an s, then delete s.
– if a word ends in es, drop the s.
– if a word ends in ing, delete the ing unless the remaining word consists only of
one letter or of th.
– If a word ends with ed, preceded by a consonant, delete the ed unless this
leaves only a single letter.
– …...
• transform words
– if a word ends with “ies” but not “eies” or “aies” then “ies --> y.”
• Greedy search
– Start from full set and delete one at a time
– Find the least important variable
• Can use Gini index for this if a classification problem
R function: ‘image’
Data Mining -Volinsky - 2011 - Columbia University 27
Weighting in TD space
D1 24 21 9 0 0 3
D2 32 10 5 0 3 0
D3 12 16 5 0 0 0
D4 6 7 2 0 0 0
D5 43 31 20 0 3 0
D6 2 0 0 18 7 6
D7 0 0 1 32 12 0
D8 3 0 0 22 4 4
D9 1 0 0 34 27 25
D10 6 0 0 17 4 23
D1 24 21 9 0 0 3
D2 32 10 5 0 3 0
D3 12 16 5 0 0 0
D4 6 7 2 0 0 0
D5 43 31 20 0 3 0
D6 2 0 0 18 7 6
D7 0 0 1 32 12 0
D8 3 0 0 22 4 4
D9 1 0 0 34 27 25
D10 6 0 0 17 4 23
Term weighting by
document - 10 x 6
In other words, the probability that a bunch of words comes from a given class
equals the product of the individual probabilities of those words.
Nx nj
p(x | ck ) p( Nx | ck ) p( xj | ck )
j 1
– Based on training data, each class has its own multinomial probability across all
words.
• To Cluster:
– Can use LSI
– Another model: Latent Dirichlet Allocation (LDA)
– LDA is a generative probabilistic model of a corpus. Documents are
represented as random mixtures over latent topics, where a topic is
characterized by a distribution over words.
• LDA:
– Three concepts: words, topics, and documents
– Documents are a collection of words and have a probability
distribution over topics
– Topics have a probability distribution over words
– Fully Bayesian Model
Data Mining -Volinsky - 2011 - Columbia University 45
LDA
• Assume data was generated by a generative process:
qis a document - made up from topics from a probability distribution
• z is a topic made up from words from a probability distribution
• w is a word, the only real observables (N=number of words in all documents)
• 10013|In Harm's Way|In Harm's Way|A tough Naval officer faces the enemy while fighting in the
South Pacific during World War II.|A tough Naval officer faces the enemy while fighting in the South
Pacific during World War II.|en-US| Movie,NR Rating|Movies:Drama|||165|1965|USA||||||STARS-
3||NR|John Wayne, Kirk Douglas, Patricia Neal, Tom Tryon, Paula Prentis s, Burgess Meredith|Otto
Preminger||||Otto Preminger|
• Sentiment Analysis
– Automatically determine tone in text: positive, negative or neutral
– Typically uses collections of good and bad words
– “While the traditional media is slowly starting to take John McCain’s straight talking
image with increasingly large grains of salt, his base isn’t quite ready to give up on their
favorite son. Jonathan Alter’s bizarre defense of McCain after he was caught telling an
outright lie, perfectly captures that reluctance[.]”
– Often fit using Naïve Bayes