Professional Documents
Culture Documents
n-gram Models
AGCTTCG ... A? G? C? T?
Applications of LM
Speech Recognition
Statistical Machine Translation
Language Identification
Spelling correction
In Speech Recognition
Given: Observed sound - O
Find: The most likely word/sentence S*
S1: How to recognize speech. ?
S2: How to wreck a nice beach. ?
S3:
Goal: find most likely sentence (S*) given the observed sound (O)
ie. pick the sentence with the highest probability: S* = argmax P(S | O)
SL
We can use Bayes rule to rewrite this as: S* = argmax P(O | S)P(S)
P(O)
SL
Since denominator is the same for each candidate S, we can ignore it for the
argmax:
In Speech Recognition
S* = argmax P(O | S) P(S)
SL
Acoustic model
Language model
argmax
wordsequen ce
Translation model
Language model
1st Approximation
n-grams
10
mountain? tree?
pill? broccoli?
11
Bigrams
1st word
zoophyte zucchini
aardvark
aardwolf
26
12
zoophyte
zucchini
aback
12
Trigrams
1st word
2nd word
13
14
C(w1 ...wn )
P (wn | w1 ...wn-1 ) =
C(w1 ...wn-1 )
remember that?
Well see later again
15
Example 1:
8 times, followed by as
1 time, followed by more
1 time, followed by a
so we have:
Example 2:
P(<s>) =
P(on) =
P(want) =
P(have) =
P(eat) =
0.1
0.09
0.05
0.007
0.001
P(on|eat) =
P(some|eat) =
P(British|eat) =
P(I|<s>) =
P(Id|<s>) =
0.16
0.06
0.001
0.25
0.06
P(want|I) =
P(would|I) =
P(dont|I) =
P(to|want) =
P(a|want) =
0.32
0.29
0.08
0.65
0.5
P(eat|to) =
0.26
P(have|to) =
0.14
P(spend|to)=
0.09
P(food|British) =
0.6
P(<s>|food) = 0.15
= 0.1 x 0.25
x 0.32
= 0.00000012168
x 0.65
x 0.26
x 0.001
x 0.6
x 0.15
17
18
Some Adjustments
19
Solution: smoothing
20
21
Add-one Smoothing
new_count(n-gram) = old_count(n-gram) + 1
22
Add-one: Example
unsmoothed bigram counts (frequencies):
2nd word
1st word
want
to
eat
Chinese
food
lunch
Total
1087
13
C(I)=3437
want
786
C(want)=1215
to
10
860
12
C(to)=3256
eat
19
52
C(eat)=938
Chinese
120
C(Chinese)=213
food
19
17
C(food)=1506
lunch
C(lunch)=459
...
...
N=10,000
V = {a, aardvark, aardwolf, aback, , I, , want, to, , eat, Chinese, , food, , lunch, ,
zoophyte, zucchini}
Add-one: Example
unsmoothed bigram counts:
1st word
want
to
2nd word
eat
Chinese
food
lunch
Total
1087
13
C(I)=3437
want
786
C(want)=1215
to
10
860
12
C(to)=3256
eat
19
52
C(eat)=938
Chinese
120
C(Chinese)=213
food
19
17
C(food)=1506
lunch
C(lunch)=459
N=10,000
want
to
eat
Chinese
food
lunch
0.002
(8/3437)
0.0025
0.316
0.0037
0.0049
0.0009
0.264
0.0049
(6/1215)
0.0009
0.0066
to
0.647
(786/1215)
0.0030
0.0037
eat
0.002
0.020
0.002
0.0554
Chinese
0.0094
0.56
0.0047
food
0.0126
0.0113
lunch
0.0087
0.0002
want
Total
note :
8
10 000
8
P(I | I) =
3 437
P(II) =
24
want
8 9
want
to
eat
Chinese
food
lunch
Total
14
3 4
1087
1088
1
787
3437
C(I) + |V| = 5053
C(want) + |V| = 2831
to
11
861
13
eat
23
20
53
Chinese
121
food
20
18
lunch
total = 10,000
N+|V|2 = 10,000 + (1616)2
= 2,621,456
want
to
eat
Chinese
food
lunch
.215
.00019
.0028
.00019
.00019
.00019
want
.0018
(9/5053)
.0014
.00035
.278
.00035
.0025
.0031
.00247
to
.00082
.0002
.00226
.1767
.00082
.0002
.00267
eat
.00039
.00039
.0009
.00039
.0078
.0012
.0208
25
26
Add-delta Smoothing
PAddD(w1 w2 wn) =
C (w1 w2 wn) +
N+ B
27
Size:
Nature (adaptation):
28
Training phase:
Testing phase:
29
bigram of characters
30
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0042
0.0014
0.0014
0.0014
0.0014
0.0014
0.0097
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0042
0.0014
0.0014
0.0014
0.0014
0.0014
0.0097
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
5-grams
generated from 1 trillion words
24 GB compressed
Number
Number
Number
Number
Number
Number
Number
of tokens: 1,024,908,267,229
of sentences: 95,119,665,584
of unigrams: 13,588,391
of bigrams: 314,843,401
of trigrams: 977,069,902
of fourgrams: 1,313,818,354
of fivegrams: 1,176,470,663
32