You are on page 1of 5

Basic Bayes∗

USC Linguistics

December 20, 2007


α ¬α
β n11 n01
¬β n10 n00

Table 1: n=counts

N = n11 + n01 + n10 + n00 (1)


n11
p(α, β) = (2)
N
n11 + n10 n11 + n01
p(α) = (3) p(β) = (5)
N N
n11 p(α, β) n11 p(α, β)
p(β|α) = = p(α|β) = =
n11 + n10 p(α) n11 + n01 p(β)
(4) (6)

p(α, β) = p(α|β)p(β) = p(α)p(β|α) (7)


“Bayes’ Theorem”:

p(α)p(β|α)
p(α|β) = (8)
p(β)

Thanks to David Wilczynski and USC’s CSCI 561 slides for the general gist of
this brief introduction. Also to Grenager’s Stanford Lecture notes (http://www-
nlp.stanford.edu/∼grenager/cs121/handouts/cs121 lecture06 4pp.pdf), and particularly John A. Carroll’s
Sussex notes (http://www.informatics.susx.ac.uk/courses/nlp/lecturenotes/corpus2.pdf) for the tip on
feature products; also wikipedia for its clear presentation of multiple variables.

1
Extending to more variables:

p(α, β, γ) p(α, β, γ) p(α, β)p(γ|α, β) p(α)p(β|α)p(γ|α, β)


p(α|β, γ) = = = = (9)
p(β, γ) p(β)p(γ|β) p(β)p(γ|β) p(β)p(γ|β)

1 The Naive Approach

for `, a label, and f, features of the event1 :

c(fi, `)
p(fi|`) = P (10)
j c(fj , `)

c(`)
p(`) = P (11)
i c(`i )

A new event is assigned the label which maximizes the following product.

Y
p(`) p(fi|`) (12)
i

1.1 Problems

if α and β are independent:

p(α|β) = p(α) (13) p(α, β) = p(α)p(β) (14)


DO NOT IMPLY:
p(α, β|γ) = p(α|γ)p(β|γ) (15)
1
c.f. John A. Carroll

2
(p(α, β|γ) = p(α|γ)p(β|γ)) ↔ (p(α|β, γ) = p(α|γ)) (16)

suppose : p(α, β|γ) = p(α|γ)p(β|γ) (17)


∴ p(α, β, γ) = p(α|γ)p(β, γ) (18)
∴ p(α|β, γ) = p(α|γ) (19)

Much thanks to Greg Lawler, of the University of Chicago, who, in a fortuitous flight
meeting, provided this elegant example exception:

• α: green die + red die= 7

• β: green die=1

• γ: red die=6

p(α|β) = p(α|γ) = p(α) = 1/6 (20)


but,
p(α|β, γ) = 1 (21)

∴ p(α|β, γ) 6= p(α|γ) ∴ p(α, β|γ) 6= p(α|γ)p(β|γ) (22)

but anyways:
Qn
p(`) i=0 p(fi|`)
p(`|f0, ..., fn) = Qn (23)
i=0 p(fi )

3
Also notice how this equation can give p>1:

Assume 3 events: (α,β), (α,γ), (δ,δ)

• p(α) = 2/3 • p(β|α) = 1/2


• p(β) = 1/3
• p(γ) = 1/3 • p(γ|α) = 1/2

p(α)p(β|α)p(γ|α) 2/3 ∗ 1/2 ∗ 1/2


p(α|β, γ) = = = 3/2 (24)
p(β)p(γ) 1/3 ∗ 1/3
Also, be sure: c(L)=c(F)

c(`) c(f )
p(`) = ; p(f ) = (25)
c(L) c(F )
c(`) c(`,f )
c(L) c(`) c(`, f )/c(L) c(`, f )
p(`|f ) = c(f )
= = (26)
c(f )/c(F ) c(f )
c(F )

2 smoothing
2.1 Linear Interpolation
control the non-conditioned significance. tune α on reserved data.
p(x|y) = αp̂(x|y) + (1 − α)p̂(x) (27)

2.2 Laplace

k, “the strength of the prior”. tune k on reserved data.

c(x) + k c(x) + k
p(x) = P = (28)
x [c(x) + k] N + k|X|
c(x, y) + k
p(x|y) = (29)
c(y) + k|X|
If k=1, we are pretending we saw everything once more than we actually did; even
things that we never saw!

4
2.3 another caveat?
c(`) + |F |
p(`) = (30)
N + |F L|
c(f, `) + 1
p(f |`) = (31)
c(`) + |F |
c(f ) + |L|
p(f ) = (32)
N + |F L|
|F | is the number of feature types, |L| is the number of label types,
and |F L| is their product.

c(`, f ) + 1
∴ p(`|f ) = (33)
c(f ) + |L|

You might also like