Objectives:: Expectation Maximization (Em)

8443 – Introduction
ECE 8527 to Machine Learning and Pattern Recognition

Pattern Recognition
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
• Objectives:
Jensen’s Inequality (Special Case)
EM Theorem Proof
EM Example – Missing Data
Application: Hidden Markov Models
• Resources:
Wiki: EM History
T.D.: Brown CS Tutorial
UIUC: Tutorial
F.J.: Statistical Methods
The Expectation Maximization Algorithm (Preview)
ECE 8527: Lecture 10, Slide 1

The Expectation Maximization Algorithm (Cont.)

The Expectation Maximization Algorithm

Synopsis
• Expectation maximization (EM) is an approach that is used in many ways to
find maximum likelihood estimates of parameters in probabilistic models.
• EM is an iterative optimization method to estimate some unknown parameters

given measurement data. Used in a variety of contexts to estimate missing
data or discover hidden variables.
• The intuition behind EM is an old one: alternate between estimating the

unknowns and the hidden variables. This idea has been around for a long
time. However, in 1977, Dempster, et al., proved convergence and explained
the relationship to maximum likelihood estimation.
• EM alternates between performing an expectation (E) step, which computes

an expectation of the likelihood by including the latent variables as if they
were observed, and a maximization (M) step, which computes the maximum
likelihood estimates of the parameters by maximizing the expected likelihood
found on the E step. The parameters found on the M step are then used to
begin another E step, and the process is repeated.
• This approach is the cornerstone of important algorithms such as hidden

Markov modeling and discriminative training, and has been applied to fields
including human language technology and image processing.
Special Case of Jensen’s Inequality
Lemma: If p(x) and q(x) are two discrete probability distributions, then:
 p( x) log p( x)   p( x) log q( x)
x x
with equality if and only if p(x) = q(x) for all x.
Proof:
 p( x) log p( x)   p( x) log q( x)  0
x x
 p( x) log  p( x)  q( x)  0
x
 p ( x) 
x p ( x ) log    0
 q ( x) 
q( x)
x p ( x ) log
p( x)
0
q( x) q( x)
x p ( x ) log
p( x)
 x p ( x )(
p ( x)
 1)
The last step follows using a bound for the natural logarithm: ln x  x  1.

Special Case of Jensen’s Inequality
Continuing in efforts to simplify:
q( x) q ( x)  q( x) 
 p( x) log   p( x)(  1)  p( x)    p( x)   q( x)   p( x)  0..
x p( x) x p ( x) x  p( x)  x x x
We note that since both of these functions are probability distributions, they
must sum to 1.0. Therefore, the inequality holds.
The general form of Jensen’s inequality relates a convex function of an integral
to the integral of the convex function and is used extensively in information
theory.

The EM Theorem
Theorem: If  P  t y log P t y    P  t y log P  t y  then P  y   P   y  .
t t
Proof: Let y denote observable data. Let P   y  be the probability distribution

of y under some model whose parameters are denoted by  .
Let P  y  be the corresponding distribution under a different setting  .
Our goal is to prove that y is more likely under  than   .
Let t denote some hidden, or latent, parameters that are governed by the
values of  . Because P  t y  is a probability distribution that sums to 1, we
can write:
log P  y   log P   y    P  t y log P  y   P  t y log P   y 
t t
Because we can exploit the dependence of y on t and using well-known
properties of a conditional probability distribution.

Proof Of The EM Theorem
We can multiply each term by “1”:
 P t , y    P t , y  
log P  y   log P   y    P  t y log  P  y      P  t y log  P   y    
t  P t , y   t  P t , y  
 P t , y    P t , y  
  P  t y log      P  t y log    
 P t y   t  P  t y  
  
t
  P  t y log P t , y    P  t y log P  t , y 
t t
  P  t y log P  t y    P  t y log P t , y 
t t
  P  t y log P t , y    P  t y log P  t , y 
t t
where the inequality follows from our lemma.
Explanation: What exactly have we shown? If the last quantity is greater than
zero, then the new model will be better than the old model. This suggests a
strategy for finding the new parameters, θ: choose them to make the last
quantity positive!

Discussion
• If we start with the parameter setting  , and find a parameter setting  for
which our inequality holds, then the observed data, y, will be more probable
under  than  .
• The name Expectation Maximization comes about because we take the

expectation of P t , y  with respect to the old distribution P  t , y  and then
maximize the expectation as a function of the argument  .
• Critical to the success of the algorithm is the choice of the proper

intermediate variable, t, that will allow finding the maximum of the expectation
of  P  t y log P t y  .
t
• Perhaps the most prominent use of the EM algorithm in pattern recognition is

to derive the Baum-Welch reestimation equations for a hidden Markov model.
• Many other reestimation algorithms have been derived using this approach.
Example: Estimating Missing Data





Example: Gaussian Mixtures
• An excellent tutorial on Gaussian mixture estimation can be found at
J. Bilmes, EM Estimation
• An interactive demo showing convergence of the estimate can be found at

I. Dinov, Demonstration

Summary
• Expectation Maximization (EM) Algorithm: a generalization of Maximum
Likelihood Estimation (MLE) based on maximization of a posterior that data
was generated by a model. EM is a special case of Jensen’s inequality.
• Jensen’s Inequality: describes a relationship between two probability
distributions in terms of an entropy-like quantity. A key tool in proving that EM
estimation converges.
• The EM Theorem: proved that estimation of a model’s parameters using an
iteration of EM increases the posterior probability that the data was generated
by the model.
• Demonstrated an application of the EM Theorem to the problem of estimating
missing data point.
• Explained how EM can be used to reestimate parameters in a pattern
recognition system.

Objectives:: Expectation Maximization (Em)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Objectives:: Expectation Maximization (Em)

Uploaded by

Copyright:

Available Formats

8443 – Introduction

ECE 8527 to Machine Learning and Pattern Recognition

LECTURE 10: EXPECTATION MAXIMIZATION (EM)

ECE 8527: Lecture 10, Slide 1

ECE 8527: Lecture 10, Slide 2

ECE 8527: Lecture 10, Slide 3

• EM is an iterative optimization method to estimate some unknown parameters

• The intuition behind EM is an old one: alternate between estimating the

• EM alternates between performing an expectation (E) step, which computes

• This approach is the cornerstone of important algorithms such as hidden

ECE 8527: Lecture 10, Slide 5

ECE 8527: Lecture 10, Slide 6

Proof: Let y denote observable data. Let P   y  be the probability distribution

ECE 8527: Lecture 10, Slide 7

where the inequality follows from our lemma.

ECE 8527: Lecture 10, Slide 8

• The name Expectation Maximization comes about because we take the

• Critical to the success of the algorithm is the choice of the proper

• Perhaps the most prominent use of the EM algorithm in pattern recognition is

ECE 8527: Lecture 10, Slide 10

ECE 8527: Lecture 10, Slide 11

ECE 8527: Lecture 10, Slide 12

ECE 8527: Lecture 10, Slide 13

ECE 8527: Lecture 10, Slide 14

• An interactive demo showing convergence of the estimate can be found at

ECE 8527: Lecture 10, Slide 15

ECE 8527: Lecture 10, Slide 16

You might also like