Professional Documents
Culture Documents
4
3/6/2019
Light Weight Morphology
• No lexicon needed
• Basically a set of staged sets of rewrite rules
that strip suffixes
• Handles both inflectional and derivational
suffixes
• Doesn’t guarantee that the resulting stem is
really a stem (see first bullet)
• Lack of guarantee doesn’t matter for IR
Porter Example
• Computerization
ization -> -ize computerize
ize -> ε computer
Porter
• Expanding clitics
What’re -> what are
I’m -> I am
• Multi-token words
New York
Rock ‘n’ roll
Sentence Segmentation
• !, ? relatively unambiguous
• Period “.” is quite ambiguous
Sentence boundary
Abbreviations like Inc. or Dr.
• General idea:
Build a binary classifier:
Looks at a “.”
Decides EndOfSentence/NotEOS
Could be hand-written rules, or machine-learning
Word Segmentation in Chinese
15
Vietnamese Word Segmentation
Problem (cont)
Word segmentation ambiguities
Syllable sequences “nhà cửa”, “sắc đẹp”,
“hiệu sách”, are words in
a. Nhà cửa bề bộn quá
b. Cô ấy giữ gìn sắc đẹp.
c. Ngoài hiệu sách có bán cuốn này
• We want to compute
P(w1,w2,w3,w4,w5…wn), the probability
of a sequence
• Alternatively we want to compute
P(w5|w1,w2,w3,w4,w5): the probability of
a word given some previous words
• The model that computes P(W) or
P(wn|w1,w2…wn-1) is called the language
model.
29
3/6/2019
Computing P(W)
P(“the”,”other”,”day”,”I”,”was”,”walking”,”along”
,”and”,”saw”,”a”,”lizard”)
30
3/6/2019
The Chain Rule
P( A^ B) P( A | B) P( B)
• More generally
• P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
• In general
• P(x1,x2,x3,…xn) =
P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1…xn-1)
31
3/6/2019
The Chain Rule
• How to estimate?
P(you | the river is so wide that)
34
3/6/2019
Unfortunately
35
3/6/2019
Markov Assumption
36
3/6/2019
Markov Assumption
P(wn | w n1
1 ) P(wn | w n1
nN 1 )
Bigram version
P(w n | w n1
1 ) P(w n | w n1 )
37
3/6/2019
Estimating bigram probabilities
39
3/6/2019
Maximum Likelihood Estimates
41
3/6/2019
Raw Bigram Counts
42
3/6/2019
Raw Bigram Probabilities
• Normalize by unigrams:
• Result:
43
3/6/2019
Bigram Estimates of Sentence
Probabilities
44
3/6/2019
Kinds of knowledge?
45
3/6/2019
The Shannon Visualization
Method
• Generate random sentences:
• Choose a random bigram <s>, w according to its probability
• Now choose a random bigram (w, x) according to its probability
• And so on until we choose </s>
• Then string the words together
• <s> I
I want
want to
to eat
eat Chinese
Chinese food
food </s>
46
3/6/2019
Shakespeare
47
3/6/2019
Shakespeare as corpus
48
3/6/2019
The Wall Street Journal is Not
Shakespeare
49
3/6/2019
Why?
P(X | S)P(S)
P(S | X)
P(X)
50
3/6/2019
Unknown words: Open versus
closed vocabulary tasks
• If we know all the words in advanced
Vocabulary V is fixed
Closed vocabulary task
• Often we don’t know this
Out Of Vocabulary = OOV words
Open vocabulary task
• Instead: create an unknown word token <UNK>
Training of <UNK> probabilities
Create a fixed lexicon L of size V
At text normalization phase, any training word not in L changed to <UNK>
Now we train its probabilities like a normal word
At decoding time
If text input: Use UNK probabilities for any word not in training
51
3/6/2019
Evaluation
• Chain rule:
• For bigrams:
56
Slide from Josh Goodman
3/6/2019
Lower perplexity = better model
57
3/6/2019
Lesson 1: the perils of
overfitting
58
3/6/2019
Lesson 2: zeros or not?
• Zipf’s Law:
A small number of events occur with high frequency
A large number of events occur with low frequency
You can quickly collect statistics on the high frequency events
You might have to wait an arbitrarily long time to get valid statistics on
low frequency events
• Result:
Our estimates are sparse! no counts at all for the vast bulk of things
we want to estimate!
Some of the zeroes in the table are really zeros But others are simply
low frequency events you haven't seen yet. After all, ANYTHING
CAN HAPPEN!
How to address?
• Answer:
Estimate the likelihood of unseen N-grams!
59
3/6/2019
Smoothing is like Robin Hood:
Steal from the rich and give to the poor (in
probability mass)
60
3/6/2019
Laplace smoothing
• MLE estimate:
• Laplace estimate:
• Reconstructed counts:
61
3/6/2019
Laplace smoothed bigram
counts
62
3/6/2019
Laplace-smoothed bigrams
63
3/6/2019
Reconstituted counts
64
3/6/2019
Big Changes to Counts
65
3/6/2019
Better Discounting Methods
67
3/6/2019
Good-Turing
3/18
68
3/6/2019
Good-Turing
69
3/6/2019
Good-Turing Intuition
70
3/6/2019 Slide from Josh Goodman
Good-Turing Intuition
71
3/6/2019 Slide from Josh Goodman
Bigram frequencies of
frequencies and GT re-estimates
72
3/6/2019
Backoff and Interpolation
74
3/6/2019
Interpolation
• Simple interpolation
75
3/6/2019
How to set the lambdas?
77
3/6/2019
OOV words: <UNK> word
78
3/6/2019
Practical Issues
79
3/6/2019
Language Modeling Toolkits
• SRILM
• CMU-Cambridge LM Toolkit
• These toolkits are publicly available
• Can use it to get N-gram models
• Lots of parameters (need to know the
theory!)
• Standard N-gram format: ARPA language
model (see pages 108-109)
80
3/6/2019
Google N-Gram Release
81
3/6/2019
Google N-Gram Release
82
3/6/2019