L21 Advanced Topics

ECE 6504: Advanced Topics in
Machine Learning
Probabilistic Graphical Models and Large-Scale Learning
Topics
Summary of Class
Advanced Topics
Dhruv Batra
Virginia Tech
HW1 Grades
Mean: 28.5/38 ~= 74.9%
7
6
5
#Students
4
3
2
1
0
0 10 20 30 40
Score
(C) Dhruv Batra 2
HW2 Grades
Mean: 33.2/45 ~= 73.8%
12
10
8
#Students
0
0 10 20 30 40
Score
(C) Dhruv Batra 3
Administrativia
(Mini-)HW4
Out now
Due: May 7, 11:55pm
Implementation:
Parameter Learning with Structured SVMs and Cutting-Plane
Final Project Webpage

Due: May 7, May 13 11:55pm
Can use late days Cant use late days any more
1-3 paragraphs
Goal
Illustrative figure
Approach
Results (with figures or tables)
Take Home Final

Out: May 8
Due: May 13, 11:55pm
No late days
Open book, open notes, open internet. Cite your sources.
No discussions!
(C) Dhruv Batra 4

A look back: PGMs
One of the most exciting advancements in statistical
AI in the last 10-20 years
Marriage
Graph Theory + Probability
Compact representation for exponentially-large

probability distributions
Exploit conditional independencies
Generalize
nave Bayes
logistic regression
Many more
(C) Dhruv Batra 5

A look back: what you learnt
Directed Graphical Models (Bayes Nets)
Representation: Directed Acyclic Graphs (DAGs), Conditional
Probability Tables (CPTs), d-Separation, v-structures, Markov Blanket,
I-Maps
Parameter Learning: MLE, MAP, EM
Structure Learning: Chow-Liu, Decomposable scores, hill climbing
Inference: Marginals, MAP/MPE, Variable Elimination
Undirected Graphical Models (MRFs/CRFs)

Representation: Junction trees, Factor graphs, treewidth, Local Makov
Assumptions, Moralization, Triangulation
Inference: Belief Propagation, Message Passing, Linear Programming
Relaxations, Dual-Decomposition, Variational Inference, Mean Field
Parameter Learning: MLE, gradient descent
Structured Prediction: Structured SVMs, Cutting-Plane training
Large-Scale Learning
Online learning: perceptrons, stochastic (sub-)gradients
Distributed Learning: Dual Decomposition, Alternating Direction Method
of Multipliers (ADMM)
(C) Dhruv Batra 6

Main Issues in PGMs
Representation
How do we store P(X1, X2, , Xn)
What does my model mean/imply/assume? (Semantics)
Inference
How do I answer questions/queries with my model? such as
Marginal Estimation: P(X5 | X1, X4)
Most Probable Explanation: argmax P(X1, X2, , Xn)
Learning
How do we learn parameters and structure of
P(X1, X2, , Xn) from data?
What model is the right for my data?
(C) Dhruv Batra 7

What is this class about?
Making global predictions from local observations
(C) Dhruv Batra 8

A look forward
Stuff we couldnt teach you
A.K.A: Stuff thats not on the exam!
What do people in this area work on?

What is being published in PGMs / Structured Prediction?
(C) Dhruv Batra 9

Error Decomposition
Reality
model class
(C) Dhruv Batra 10

Error Decomposition
Reality
(C) Dhruv Batra 11

Error Decomposition
Reality
model class
Higher-Order
Potentials
(C) Dhruv Batra 12

Error Decomposition
Reality
model class
Higher-Order
Potentials
Focus of MAP Inference
(C) Dhruv Batra 13

Continuous-Variable PGMs
(C) Dhruv Batra 14

Multivariate Gaussian
Canonical form
Standard form and canonical forms are related:
Conditioning is easy in canonical form

Marginalization easy in standard form
Sampling
(C) Dhruv Batra 17

What youve learned so far
VE & Junction Trees
Exact inference
Exponential in tree-width
Belief Propagation, Mean Field

Approximate inference for marginals/conditionals
Fast, but can get inaccurate estimates
Sample-based Inference
Approximate inference for marginals/conditionals
With enough samples, will converge to the right answer (or
a high accuracy estimate)
(If you want to be cynical, replace enough with ridiculously many )

Goal
Often we want expectations given samples
x[1] x[M] from a distribution P.
M
1
EP [f ] f (x[m]) x[i] P (X)
M m=1
M
1
P (X = x) 1(x[m] = x)
M m=1
Discrete Random Variables: X = {X1 , . . . , Xn }

Number of samples from P(X): M
Forward Sampling
Sample nodes in topological order
Assignment to parents selects P(X|Pa(X))

End result is one sample from P(X)
Repeat to get more samples
D x[m, D] (Easy : 0.6, Hard : 0.4) D = Easy

I x[m, I] (Low : 0.7, High : 0.3) I = High
G x[m, G|D = d, I = i] ([80, 100] : 0.9, [50, 80) : 0.08, [0, 50) : 0.02) G = [80,100]
S x[m, S|I = i] (Bad : 0.2, Good : 0.8) S = Bad
L x[m, L|G = g] (F ail : 0.1, P ass : 0.9) L = Pass
Multinomial Sampling
Given an assignment to its parents,
Xi is a multinomial random variable.
x[m, G|D = d, I = i] (v1 : 0.9, v2 : 0.08, v3 : 0.02)
U ~ Unif[0,1]
v1 v2 v3
0.90 0.08 0.02

Sample-based probability estimates
Have a set of M samples from P(X)

Can estimate any probability by counting records:
Marginals: M
1
P (D = Easy, S = Bad) = 1(x[m, D] = Easy, x[m, S] = Bad)
M m=1
Conditionals: M
1(x[m, D] = Easy, x[m, S] = Bad)
P (D = Easy|S = Bad) = m=1
M
m=1 1(x[m, S] = Bad)
Rejection sampling: once the sample and evidence disagree, throw away the sample.
Rare events: If the evidence is unlikely, i.e., P(E = e) small, then the sample size for
P(X|E=e) is low
(C) Dhruv Batra 23
Simulated Annealing
(C) Dhruv Batra Image Credit: [BVZ, PAMI01] Alpha-Expansion 24

Simulated Annealing
(C) Dhruv Batra Image Credit: [BVZ, PAMI01] Alpha-Expansion 25

Effect of Exact MAP
a
(C) Dhruv Batra Image Credit: [BVZ, PAMI01] 26

Sontag NIPS10
(C) Dhruv Batra 27

Soft-Margin Structured SVM
Minimize 1 2 C
w + !! j
2 N j
subject to
wT ! (x j , y j ) ! wT ! (x j , y) + "(y j , y) # ! j
Too many constraints!
(C) Dhruv Batra Slide Credit: Thorsten Joachims 28

Cutting-Plane Method
1 2 C
w + !! j
2 N j
wT ! (x j , y j ) ! wT ! (x j , y) + "(y j , y) # ! j
Key insight of NIPS10 paper

What if we replace exponentially many constraints with a
smaller set (without using Cutting-Plane)?
Key contribution
There exist a rich set of distributions where this
approximation results in optimal parameter in the infinite
data setting
(C) Dhruv Batra 29

Meshi ICML10
(C) Dhruv Batra 30

Tarlow UAI10
(C) Dhruv Batra 31

Ladicky IJCV12
(C) Dhruv Batra 32

Lempitsky ICCV09
(C) Dhruv Batra 33

Kohli & Rother
(C) Dhruv Batra 34

Perturb and MAP

S(y) = i (yi ) + ij (yi , yj )
iV (i,j)E
Approach
Perturb: = + , p()
MAP: argmax S (y)

y
(C) Dhruv Batra 35

Perturb and MAP
Perturb: = + , p()
Theorem: If IID Gumbel, then EXACT samples.

[Papandreou & Yuille, ICCV11]
[Hazan & Jaakkola, ICML12]
(C) Dhruv Batra 36

Perturb and MAP
Full Order Gumbel is hard!
(C) Dhruv Batra 37

Perturb and MAP
Reduced Order Gumbel

S(y) = i (yi ) + ij (yi , yj )
iV (i,j)E
Approach
Perturb: = + , p() i = i + , p()
MAP: argmax S (y)

y
(C) Dhruv Batra 38

My Research: Multiple Predictions
P (y)
xxxx xxxxx xxxx y
yMAP
Sampling
Porway & Zhu, 2011"

TU & Zhu, 2002"
Rich History"
(C) Dhruv Batra 39

P (y)
yMAP
Sampling
Porway & Zhu, 2011"

TU & Zhu, 2002"
Rich History"
M-Best MAP
Flerova et al., 2011"

Fromer et al., 2009"
Yanover et al., 2003"
Ideally:
M-Best Modes
"
(C) Dhruv Batra 40
P (y)
yMAP
"
Our work: Diverse M-Best in MRFs [ECCV 12]
Sampling M-Best MAP
- Dont hope for diversity. Explicitly encode it.Ideally:
Porway & Zhu, 2011" Flerova et al., 2011"
TU & Zhu, 2002" - Not guaranteed toal.,
Fromer et be2009"
modes. M-Best Modes
Rich History" Yanover et al., 2003"
(C) Dhruv Batra 41

What next?
Seminars:
CV-ML Reading Group
https://filebox.ece.vt.edu/~cvmlreadinggroup/
Conferences:
Neural Information Processing Systems (NIPS)
International Conference in Machine Learning (ICML)
Uncertainty in Artificial Intelligence (UAI)
Artificial Intelligence & Statistics (AISTATS)
Classes (at some point in the near future)

ECE6504: Fundamental Ideas in Machine Learning
Paper reading class showing lineage of ideas
Story from 89 first backprop papers to CNNs today
ECE6504: Machine Learning for Big Data
Large-Scale Distributed Machine Learning
Use frameworks such as Graphlab
Implement things in CloudCV
(C) Dhruv Batra 42

Feedback
Student Perception of Teaching (SPOT)
https://eval.scholar.vt.edu/portal
Tell us how were doing
What would you like to see more
What would you like to see less
ENDS MAY 8
(C) Dhruv Batra 43

L21 Advanced Topics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L21 Advanced Topics

Uploaded by

Copyright:

Available Formats

ECE 6504: Advanced Topics in

Final Project Webpage

Take Home Final

(C) Dhruv Batra 4

Compact representation for exponentially-large

(C) Dhruv Batra 5

Undirected Graphical Models (MRFs/CRFs)

(C) Dhruv Batra 6

(C) Dhruv Batra 7

(C) Dhruv Batra 8

What do people in this area work on?

(C) Dhruv Batra 9

(C) Dhruv Batra 10

(C) Dhruv Batra 11

(C) Dhruv Batra 12

Focus of MAP Inference

(C) Dhruv Batra 13

(C) Dhruv Batra 14

Standard form and canonical forms are related:

Conditioning is easy in canonical form

(C) Dhruv Batra 17

Belief Propagation, Mean Field

(If you want to be cynical, replace enough with ridiculously many )

Discrete Random Variables: X = {X1 , . . . , Xn }

Sample nodes in topological order

Assignment to parents selects P(X|Pa(X))

Repeat to get more samples

D x[m, D] (Easy : 0.6, Hard : 0.4) D = Easy

x[m, G|D = d, I = i] (v1 : 0.9, v2 : 0.08, v3 : 0.02)

0.90 0.08 0.02

Have a set of M samples from P(X)

(C) Dhruv Batra Image Credit: [BVZ, PAMI01] Alpha-Expansion 24

(C) Dhruv Batra Image Credit: [BVZ, PAMI01] Alpha-Expansion 25

(C) Dhruv Batra Image Credit: [BVZ, PAMI01] 26

(C) Dhruv Batra 27

Too many constraints!

(C) Dhruv Batra Slide Credit: Thorsten Joachims 28

Key insight of NIPS10 paper

(C) Dhruv Batra 29

(C) Dhruv Batra 30

(C) Dhruv Batra 31

(C) Dhruv Batra 32

(C) Dhruv Batra 33

(C) Dhruv Batra 34

MAP: argmax S (y)

(C) Dhruv Batra 35

Theorem: If IID Gumbel, then EXACT samples.

(C) Dhruv Batra 36

(C) Dhruv Batra 37

MAP: argmax S (y)

(C) Dhruv Batra 38

xxxx xxxxx xxxx y

Porway & Zhu, 2011"

(C) Dhruv Batra 39

Porway & Zhu, 2011"

Flerova et al., 2011"

(C) Dhruv Batra 41

Classes (at some point in the near future)

(C) Dhruv Batra 42

(C) Dhruv Batra 43

You might also like