Professional Documents
Culture Documents
Wray Buntine
Monash University
http://topicmodels.org
Thanks to: Lan Du (Monash) and Kar Wai Lim (NUS) for some content,
Lancelot James of HKUST for teaching me some of the material
2017-09-12
Outline
4 Stochastic Processes
5 Inference on Hierarchies
6 Conclusion
e.g. Given some measurement points about mosquitoes in Asia, how many
species are there?
(see http://everynoise.com)
Music genre’s are constantly developing.
Which ones do you listen to?
What is the chance that a new genre is seen?
Bayesian N-grams
p~·
Outline
4 Stochastic Processes
5 Inference on Hierarchies
6 Conclusion
Buntine (Monash) Dirichlet Processes 2017-09-12 20 / 112
Basic Properties of (non-hierarchical) DPs and PYPs
Measures
A measure corresponds to the notion of the size or area.
Definition of Additivity for Functions
A function m(·) on domain X is additive if whenever A, B ⊂ X and
A ∩ B = ∅ then m(A ∪ B) = m(A) + m(B).
Measures
A measure corresponds to the notion of the size or area.
Definition of Additivity for Functions
A function m(·) on domain X is additive if whenever A, B ⊂ X and
A ∩ B = ∅ then m(A ∪ B) = m(A) + m(B).
p~0
H(·) is a Gaussian
G is DP sample of
points from a
Gaussian weighted
according to a DP
probability vector
θi is a sample from
G , originally from
Gaussian H(·)
θi is a sample from G ,
so have repeats
xi ∼ Gaussian(θi , σ)
(so xi are clustered)
The Pitman-Yor Process (PYP) has three arguments PYP (d, α, H(·))
(or 2 for the 1 parameter DP), extending the DP:
discount d is the Zipfian slope for the PYP.
Concentration α is inversely proportional to variance.
Base distribution H(·) that seeds the distribution and is the mean.
The DP is also a PYP: PYP (0, α, H(·)) ≡ DP (α, H(·)).
The Pitman-Yor Process (PYP) has three arguments PYP (d, α, H(·))
(or 2 for the 1 parameter DP), extending the DP:
discount d is the Zipfian slope for the PYP.
Concentration α is inversely proportional to variance.
Base distribution H(·) that seeds the distribution and is the mean.
The DP is also a PYP: PYP (0, α, H(·)) ≡ DP (α, H(·)).
PYP originally called “two-parameters Poisson-Dirichlet process”
(Ishwaran and James, 2003).
Proper definitions later.
0 PYP(d=0.0,α)
1 PYP(d=0.0,α)
10
α=0.1
α=1
-100
10 α=10
0.8 α=100
10-200
log probability
10-300 0.6
probability
10-400
0.4
10-500
α=0.1 0.2
10-600 α=1
α=10
α=100
10-700 0 0
10 102 104 106 108 1010 1012 0 5 10 15 20
log rank rank
0 PYP(d=0.0,α)
1 PYP(d=0.0,α)
10
α=0.1
α=1
-100
10 α=10
0.8 α=100
10-200
log probability
10-300 0.6
probability
10-400
0.4
10-500
α=0.1 0.2
10-600 α=1
α=10
α=100
10-700 0 0
10 102 104 106 108 1010 1012 0 5 10 15 20
log rank rank
Note: how separated the different plots are for different concentration
0 PYP(d=0.1,α)
1 PYP(d=0.1,α)
10
α=0.1
0.9 α=1
α=10
10-20 α=100
0.8
0.7
10-40
log probability
0.6
probability
10-60 0.5
0.4
-80
10
0.3
10-100
α=0.1 0.2
α=1
α=10 0.1
α=100
10-120 0 0
10 102 104 106 108 1010 1012 0 5 10 15 20
log rank rank
Note: how separated the different plots are for different concentration
PYP(d=0.5,α) PYP(d=0.5,α)
100 1
α=0.1
0.9 α=1
10-2 α=10
0.8
10-4 0.7
log probability
10-6 0.6
probability
0.5
10-8
0.4
10-10 0.3
-12 0.2
10 α=0.1
α=1 0.1
α=10
10-14 0 1 2 3 4 5 6 0
10 10 10 10 10 10 10 0 5 10 15 20
log rank rank
PYP(d=0.5,α) PYP(d=0.5,α)
100 1
α=0.1
0.9 α=1
10-2 α=10
0.8
10-4 0.7
log probability
10-6 0.6
probability
0.5
10-8
0.4
10-10 0.3
-12 0.2
10 α=0.1
α=1 0.1
α=10
10-14 0 1 2 3 4 5 6 0
10 10 10 10 10 10 10 0 5 10 15 20
log rank rank
Note: separation not as clear for different plots for different concentration
P∞
The DP and PYP samples look like k=1 pk δθk (·).
with DP(α, H(·)), the p~ probabilities, when rank ordered, behave like
1
a geometric series pk ∝
˜ e k/α .
with PYP(d, α, H(·)), the p~ probabilities, when rank ordered, behave
1
like a Dirichlet series pk ∝
˜ k 1/d .
probabilities for the PYP are thus Zipfian (somewhat like Zipf’s
lawW ) and converge slower.
We have a list of K distinct atoms x1∗ , ..., xK∗ and their occurrence
counts in the data n1 , ..., nK .
Then p(x1 , ..., xN |PYP, d, α, H(·)) given by
K
(α|d)K Y
(1 − d)nk −1 H(xk∗ )
(α)N
k=1
where:
rising factorial
(α)N = α(α + 1) · · · (α + N − 1) = Γ(α+N)
Γ(α) ,
discounted rising factorial
(α|d)K = α(α + d) · · · (α + d(K − 1)) = d K Γ(α/d+K )
Γ(α/d) ,
so (1 − d)nk −1 = (1 − d)(2 − d) · · · (nk − 1 − d),
which is 1 when nk = 1.
Another P
way to understand the DP/PYP is to draw samples from
µ(A) = ∞ k=1 pk δθk (A),
e.g. using a Chinese restaurant process.
By discreteness of µ(·) samples must repeat, for instance:
θ394563 , θ37 , θ72345 , θ37 , ...
With a sequence of size N, how many distinct entries K are there?
Another P
way to understand the DP/PYP is to draw samples from
µ(A) = ∞ k=1 pk δθk (A),
e.g. using a Chinese restaurant process.
By discreteness of µ(·) samples must repeat, for instance:
θ394563 , θ37 , θ72345 , θ37 , ...
With a sequence of size N, how many distinct entries K are there?
e.g. given 5Gb of text, how many distinct words are there?
e.g. given a year’s worth of music, how many distinct genres are there?
Posterior(K|N,d=0.5,α) Posterior(K|N,d=0.9,α)
0.025 0.04
α=5 α=5
α=10 0.035 α=10
0.02 α=20 α=20
α=50 0.03 α=50
α=100 α=100
α=200 0.025 α=200
0.015 α=500 α=500
0.02
0.01 0.015
0.01
0.005
0.005
0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
Number of Species/Tables
Number of Species/Tables
Outline
4 Stochastic Processes
5 Inference on Hierarchies
6 Conclusion
Buntine (Monash) Dirichlet Processes 2017-09-12 44 / 112
Definitions of DPs and PYPs
Definitions?
Definitions?
Outline
4 Stochastic Processes
5 Inference on Hierarchies
6 Conclusion
Buntine (Monash) Dirichlet Processes 2017-09-12 45 / 112
Definitions of DPs and PYPs Alternatives
Outline
4 Stochastic Processes
5 Inference on Hierarchies
6 Conclusion
Buntine (Monash) Dirichlet Processes 2017-09-12 48 / 112
Definitions of DPs and PYPs Stick Breaking
Stick Breaking
1
p1 · p1 1 (1 − V1 )
p2 (1 − V1 )(1 − V2 )
p2 ·
p1 · p1 1 (1 − V1 )
p2 (1 − V1 )(1 − V2 )
p2 ·
p1 = V1 where V1 ∼ Beta(1 − d, α + d)
p2 = V2 (1 − p1 ) where V2 ∼ Beta(1 − d, α + 2d)
p3 = V3 (1 − p1 − p2 ) where V3 ∼ Beta(1 − d, α + 3d)
k−1
X
pk = Vk 1 − pj where Vk ∼ Beta(1 − d, α + k d)
j=1
Data for the GEM are in the form of indices for a size-biased ordering.
e.g. 1,1,2,3,2,3,1,4,3,5,2,...
Data for the GEM are in the form of indices for a size-biased ordering.
e.g. 1,1,2,3,2,3,1,4,3,5,2,...
Same as the evidence for PYP, but with a final penalty term.
Is approximated by:
n1 n2 n3 n4
...
N N − n1 N − n1 − n2 N − n1 − n2 − n3
i.e. (probability type 1 will be seen 1st) times
(probability type 2 will be seen 2nd given type 1 is seen 1st) times ...
Is approximated by:
n1 n2 n3 n4
...
N N − n1 N − n1 − n2 N − n1 − n2 − n3
i.e. (probability type 1 will be seen 1st) times
(probability type 2 will be seen 2nd given type 1 is seen 1st) times ...
Final penalty term corresponds to the choice of ordering of indices.
Outline
4 Stochastic Processes
Poisson Point Processes
The Gamma Process
Constructing a Gamma Process from its
Rate
5 Inference on Hierarchies
Definitions?
Definitions?
Outline
4 Stochastic Processes
Poisson Point Processes
The Gamma Process
Constructing a Gamma Process from its
Rate
5 Inference on Hierarchies
Used to:
analyse variable length parameter vectors and other structures with
varying dimensions and elements
e.g. stochastic graphs and trees
analyse time/space varying affects
e.g. event bursts in Twitter data
e.g. crime incidents in a city
Used to:
analyse variable length parameter vectors and other structures with
varying dimensions and elements
e.g. stochastic graphs and trees
analyse time/space varying affects
e.g. event bursts in Twitter data
e.g. crime incidents in a city
Same property holds for the gamma, negative binomial, Gaussian, stable
distributions (and their key parameter).
Adding Poissons
0.35 0.16
λX =5
λX =1.0 0.14
0.30
0.12
0.25 λY =15
0.10
λY =5.0
Pr(n)
Pr(n)
0.20
0.08 λX+Y =20
0.10
0.04
0.05 0.02
0.00 0.00
0 5 10 15 20 0 5 10 15 20 25 30 35 40
n n
Doubling Poissons
Doubling Poissons
0.40
λ=1
λ=2
0.35
λ=4
λ=8
0.30 λ=16
λ=32
0.25
λ=64
Pr(n)
0.20
0.15
0.10
0.05
0.00
0 20 40 60 80 100
n
Doubling Gammas
Pr(λ)
0.4 0.2
0.2 0.1
0.0 0.0
0 1 2 3 4 5 0 10 20 30 40 50
λ λ
Same can be done for the gamma, negative binomial, Gaussian, stable
distributions.
A B partition X into
X
A, B, C , D
C D
Example of a PPP
There are two parts to the definition that define the behaviour of the
PPP.
5
Pr(θ)
ρ(θ)
4
domain X = (0, 1)
3
0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
θ θ
100
rate for “improper”
beta(0, 3) has
80
ρ(x) = x −1 (1 − x)2 ,
60
integral diverges as left
40 boundary x → 0
20 so PPP get infinite
0
number of points near
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
zero
Buntine (Monash) Dirichlet Processes 2017-09-12 73 / 112
Stochastic Processes Poisson Point Processes
The axiomatic part of the definition says a PPP acts like a Poisson at
discrete points.
The constructive part of the definition says the PPP is like a PDF with
probability proportional to ρ(x).
Outline
4 Stochastic Processes
Poisson Point Processes
The Gamma Process
Constructing a Gamma Process from its
Rate
5 Inference on Hierarchies
Equivalently, H(·) ∼ GP(m(·), β) when for any such partition, for all k
H(Ak ) ∼ Gamma(m(Ak ))
Hierarchical Processes
Outline
4 Stochastic Processes
Poisson Point Processes
The Gamma Process
Constructing a Gamma Process from its
Rate
5 Inference on Hierarchies
0.8
0.7
0.6
mass (w) 0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
domain (x)
Have a PPP on R+ × X .
Interpret each point (w , x) as (mass,domain) pair.
Sample set of points N ∼ PPP, as shown.
Usually the rate factors: ρ(w , x) = ρ1 (w )ρ2 (x), said to be
homogeneous.
0.12
0.1
mass (w)
0.08
0.06
0.04
0.02
0
0 0.2 0.4 0.6 0.8 1
domain (x)
Outline
4 Stochastic Processes
5 Inference on Hierarchies
6 Conclusion
Context:
α
α is a parent variable, λ ∼ gamma(α, β),
and data vector n~ has each entry
nn ∼ Poisson(λ) for n = 1, ..., N.
λ How can our data about λ be used to
influence α?
β α α−1 −λβ
The gamma distributionW is p(λ|α, β) = Γ(α) λ e for
λ, α, β ∈ R+ .
Given data in the form of (n1 , ..., nN ), each distributed as
nn ∼ Poisson(λ).
The marginal likelihood p(n1 , ..., nN |Poisson, α, β) is
βα β α Γ(α + n· − 1)
Z
1 1
QN λα+n· −1 e −(N+β)λ d λ = QN
n=1 nn !
Γ(α) n=1 nn !
(N + β)α+n· Γ(α)
PN
where n· = n=1 nn .
β α α−1 −λβ
The gamma distributionW is p(λ|α, β) = Γ(α) λ e for
λ, α, β ∈ R+ .
Given data in the form of (n1 , ..., nN ), each distributed as
nn ∼ Poisson(λ).
The marginal likelihood p(n1 , ..., nN |Poisson, α, β) is
βα β α Γ(α + n· − 1)
Z
1 1
QN λα+n· −1 e −(N+β)λ d λ = QN
n=1 nn !
Γ(α) n=1 nn !
(N + β)α+n· Γ(α)
PN
where n· = n=1 nn .
When used hierarchically, this becomes the likelihood for α.
β α α−1 −λβ
The gamma distributionW is p(λ|α, β) = Γ(α) λ e for
λ, α, β ∈ R+ .
Given data in the form of (n1 , ..., nN ), each distributed as
nn ∼ Poisson(λ).
The marginal likelihood p(n1 , ..., nN |Poisson, α, β) is
βα β α Γ(α + n· − 1)
Z
1 1
QN λα+n· −1 e −(N+β)λ d λ = QN
n=1 nn !
Γ(α) n=1 nn !
(N + β)α+n· Γ(α)
PN
where n· = n=1 nn .
When used hierarchically, this becomes the likelihood for α.
Too complex to easily use in a Gibbs sampler!
βα Γ(α+n· )
The original likelihood terms in α are (N+β)α+n· Γ(α) .
With augmentation, we get
likelihood augmentation
z }| { z }| { α
βα Γ(α + n· ) Γ(α) n· t β
S α ∝ αt
(N + β)α+n· Γ(α) Γ(α + n· ) t,0 N +β
Context:
α
~ is a K -dimensional parent probability
α
~ vector, θ~ ∼ DirichletK (β~
α), and data
vector n~ ∼ multinomial(θ,~ N) for total
count N.
θ~ How can our data about θ~ be used to
influence α
~?
Note in the DP case, α ~ is infinite, but we
n~ will get to that later.
~ , β~
The Dirichlet distributionW is p(θ|K α) = Q Γ(β) θkβαk −1 for
Q
k Γ(βαk )
k
+ ~
β ∈ R and K -dim. probability vectors θ and α
~.
Given data in the form of K -dim counts n~, of total N distributed
~ N).
multinomial(θ,
n|multinomial, α
The marginal likelihood p(~ ~ , β, N) is
Q
N Γ(β) k Γ(nk + βαk )
Z Y
N Γ(β) βαk +nk −1 ~
Q θk dθ = Q
n~ k Γ(βαk ) n~ Γ(N + β) k Γ(βαk )
k
~ , β~
The Dirichlet distributionW is p(θ|K α) = Q Γ(β) θkβαk −1 for
Q
k Γ(βαk )
k
+ ~
β ∈ R and K -dim. probability vectors θ and α
~.
Given data in the form of K -dim counts n~, of total N distributed
~ N).
multinomial(θ,
n|multinomial, α
The marginal likelihood p(~ ~ , β, N) is
Q
N Γ(β) k Γ(nk + βαk )
Z Y
N Γ(β) βαk +nk −1 ~
Q θk dθ = Q
n~ k Γ(βαk ) n~ Γ(N + β) k Γ(βαk )
k
~ , β~
The Dirichlet distributionW is p(θ|K α) = Q Γ(β) θkβαk −1 for
Q
k Γ(βαk )
k
+ ~
β ∈ R and K -dim. probability vectors θ and α
~.
Given data in the form of K -dim counts n~, of total N distributed
~ N).
multinomial(θ,
n|multinomial, α
The marginal likelihood p(~ ~ , β, N) is
Q
N Γ(β) k Γ(nk + βαk )
Z Y
N Γ(β) βαk +nk −1 ~
Q θk dθ = Q
n~ k Γ(βαk ) n~ Γ(N + β) k Γ(βαk )
k
Q
k Γ(nk +βαk )
The likelihood terms in α
~ are Q , as before too complex to
k Γ(βαk )
use!
Augment similarly to before with tk ∼ CRD(0, βαk , nk ).
Thus, the likelihood becomes β k tk k αktk which is a multinomial
P Q
likelihood for α
~.
Thus the marginal likelihood above for α
~ simplifies to a
multinomial likelihood.
Γ(β)
For β, there is an additional term Γ(N+β) .
We augment this using q ∼ beta(N, β).
With augmentation, we get
likelihood augmentation
z }| { z }| {
Γ(β) Γ(N + β) β−1
q (1 − q)N−1 ∝ q β
Γ(N + β) Γ(N)Γ(β)
Combining
P the two augmentations, the marginal likelihood for β is
β
q β k tk which is a Poisson likelihood.
And So On ...
Thus
hierarchies of gamma processes and Dirichlet processes,
which are really hierarchies of gamma distributions and Dirichlet
distributions,
can be augmented so one gets regular Poisson likelihoods and
multinomial likelihoods from child to parent.
And So On ...
Thus
hierarchies of gamma processes and Dirichlet processes,
which are really hierarchies of gamma distributions and Dirichlet
distributions,
can be augmented so one gets regular Poisson likelihoods and
multinomial likelihoods from child to parent.
There is a rich set of literature and methods for doing this:
Mingyuan Zhou (UT Austin) et al.
Buntine, Du et al.
And So On ...
Thus
hierarchies of gamma processes and Dirichlet processes,
which are really hierarchies of gamma distributions and Dirichlet
distributions,
can be augmented so one gets regular Poisson likelihoods and
multinomial likelihoods from child to parent.
There is a rich set of literature and methods for doing this:
Mingyuan Zhou (UT Austin) et al.
Buntine, Du et al.
The PYP has a slightly more complex augmentation,
see Buntine and Hutter, 2012,
corresponds to the generalised gamma process,
allows Zipfian probabilities to be modelled.
And So On ...
Thus
hierarchies of gamma processes and Dirichlet processes,
which are really hierarchies of gamma distributions and Dirichlet
distributions,
can be augmented so one gets regular Poisson likelihoods and
multinomial likelihoods from child to parent.
There is a rich set of literature and methods for doing this:
Mingyuan Zhou (UT Austin) et al.
Buntine, Du et al.
The PYP has a slightly more complex augmentation,
see Buntine and Hutter, 2012,
corresponds to the generalised gamma process,
allows Zipfian probabilities to be modelled.
The DP and PYP case corresponds to a collapsed, hierarchical CRP
(Buntine and Hutter, 2012).
Outline
4 Stochastic Processes
5 Inference on Hierarchies
6 Conclusion
Other Issues
Chinese Restaurant Processes
Buntine (Monash) Dirichlet Processes 2017-09-12 99 / 112
Conclusion
Questions?
Questions?
Outline
4 Stochastic Processes
5 Inference on Hierarchies
6 Conclusion
Other Issues
Chinese Restaurant Processes
Buntine (Monash) Dirichlet Processes 2017-09-12 100 / 112
Conclusion Other Issues
Outline
4 Stochastic Processes
5 Inference on Hierarchies
6 Conclusion
Other Issues
Chinese Restaurant Processes
Buntine (Monash) Dirichlet Processes 2017-09-12 102 / 112
Conclusion Chinese Restaurant Processes
CRP Terminology
Historical Context
CRP Example
2−d 4−d
p(x13 = X1∗ | x1:12 , ...) = p(x13 = X2∗ | x1:12 , ...) =
12 + α 12 + α
4−d 2−d
p(x13 = X3∗ | x1:12 , ...) = p(x13 = X4∗ | x1:12 , ...) =
12 + α 12 + α
α + 4d
p(x13 = X5∗ | x1:12 , ...) = H(X5∗ )
12 + α
Buntine (Monash) Dirichlet Processes 2017-09-12 106 / 112
Conclusion Chinese Restaurant Processes
CRP, cont.
K
Kd +α X nk − d
p(xN+1 |x1 , ..., xN , d, α, H(·)) = H(xN+1 ) + δX ∗ (xN+1 )
N +α N +α k
k=1
x3 x10 x9 x8 x6 x7
x2 ‘b’ ‘a’ x4 ‘c’ x5 ‘b’ ‘d’
x1 ‘a’
x10 x9 x6 x7 x8
x5
‘b’ ‘d’ ‘c’
x1 ‘a’ x2 ‘b’ x3 ‘a’ x4 ‘c’
The above three table configurations all match the data stream:
a, b, a, c, b, b, d, c, a, b
Buntine (Monash) Dirichlet Processes 2017-09-12 109 / 112
Conclusion Chinese Restaurant Processes
x3 x10 x9 x8 x6 x7
x2 ‘b’ ‘a’ x4 ‘c’ x5 ‘b’ ‘d’
x1 ‘a’
T = T1 + T2 + ... + TK
p ~x , ~t N, PYP, d, α
K
(α|d)T Y X Y
= (1 − d)mt −1 H(k)tk
(α)N ∗
k=1 tables(Tk )=tk t : Xt =k
K
(α|d)T Y nk
= Stk ,d H(k)tk
(α)N
k=1