Lecture14 PDF

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 14 - 1 29 Feb 2016
Administrative
● Everyone should be done with Assignment 3 now
● Milestone grades will go out soon
Last class
Spatial Transformer
Segmentation
Soft Attention
Videos
ConvNets for images
Feature-based approaches to Activity Recognition
Dense trajectories and motion boundary descriptors for action recognition
Wang et al., 2013
Action Recognition with Improved Trajectories

Wang and Schmid, 2013
(code available!)
Wang et al., 2013
detect feature points track features with extract HOG/HOF/MBH

optical flow features in the (stabilized)
coordinate system of
each tracklet
Wang et al., 2013
detected feature points
[J. Shi and C. Tomasi, “Good features to track,” CVPR 1994]

[Ivan Laptev 2005]
Wang et al., 2013
track each keypoint using optical flow.
[G. Farnebäck, “Two-frame motion estimation based on polynomial expansion,” 2003]

[T. Brox and J. Malik, “Large displacement optical flow: Descriptor matching in variational motion estimation,” 2011]
Wang et al., 2013
Extract features in the local coordinate Accumulate into histograms, separately

system of each tracklet. according to multiple spatio-temporal layouts.
Case Study: AlexNet
[Krizhevsky et al. 2012]
Input: 227x227x3 images
First layer (CONV1): 96 11x11 filters applied at stride 4

=>
Output volume [55x55x96]
Q: What if the input is now a small chunk of video? E.g. [227x227x3x15] ?
Case Study: AlexNet
[Krizhevsky et al. 2012]
Input: 227x227x3 images
First layer (CONV1): 96 11x11 filters applied at stride 4

=>
Output volume [55x55x96]
Q: What if the input is now a small chunk of video? E.g. [227x227x3x15] ?
A: Extend the convolutional filters in time, perform spatio-temporal convolutions!
E.g. can have 11x11xT filters, where T = 2..15.
Spatio-Temporal ConvNets
[3D Convolutional Neural Networks for Human Action Recognition, Ji et al., 2010]
Sequential Deep Learning for Human Action Recognition, Baccouche et al., 2011
Spatio-Temporal ConvNets spatio-temporal convolutions;
worked best.
[Large-scale Video Classification with Convolutional Neural Networks, Karpathy et al., 2014]
Learned filters on
the first layer
1 million videos
487 sports classes
The motion information didn’t add all that much...
3D VGGNet, basically.
[Learning Spatiotemporal Features with 3D Convolutional Networks, Tran et al. 2015]
(of VGGNet fame)

[Two-Stream Convolutional Networks for Action Recognition in Videos, Simonyan and Zisserman 2014]
Two-stream version works much better than either alone.
[Two-Stream Convolutional Networks for Action Recognition in Videos, Simonyan and Zisserman 2014]
Long-time Spatio-Temporal ConvNets
All 3D ConvNets so far used local motion cues to
get extra accuracy (e.g. half a second or so)
Q: what if the temporal dependencies of interest are
much much longer? E.g. several seconds?
event 1 event 2
(This paper was way ahead of its time. Cited 65 times.)

LSTM way before it was cool
(This paper was way ahead of its time. Cited 65 times.)

[Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al., 2015]
[Beyond Short Snippets: Deep Networks for Video Classification, Ng et al., 2015]
Summary so far
We looked at two types of architectural patterns:
1. Model temporal motion locally (3D CONV)
2. Model temporal motion globally (LSTM / RNN)
+ Fusions of both approaches at the same time.
Summary so far
We looked at two types of architectural patterns:
1. Model temporal motion locally (3D CONV)
2. Model temporal motion globally (LSTM / RNN)
+ Fusions of both approaches at the same time.

There is another (cleaner) way!
RNN Infinite (in theory)
temporal extent
(neurons that are function
of all video frames in the past)
Finite temporal
3D
extent
CONVNET (neurons that are only
a function of finitely many
video frames in the past)
video
Beautiful:
All neurons in the ConvNet are
recurrent.
Only requires (existing) 2D

CONV routines. No need for 3D
spatio-temporal CONV.
[Delving Deeper into Convolutional Networks for Learning Video Representations, Ballas et al., 2016]
Normal ConvNet:
Convolution Layer
CONV
layer N
RNN-like recurrence
(GRU)
layer N+1 CONV

at previous
timestep layer N+1
Recall: RNNs Vanilla RNN
GRU LSTM
Recall: RNNs
Matrix multiply
=>
GRU CONV
RNN Infinite (in theory)
temporal extent
of all video frames in the past)
Finite temporal
3D
extent
CONVNET (neurons that are only
a function of finitely many
video frames in the past)
video
i.e. we obtain:
Infinite (in theory)

temporal extent
RNN of all video frames in the past)
CONVNET
video
Summary
- You think you need a Spatio-Temporal Fancy Video
ConvNet
- STOP. Do you really?
- Okay fine: do you want to model:
- local motion? (use 3D CONV), or
- global motion? (use LSTM).
- Try out using Optical Flow in a second stream (can work
better sometimes)
- Try out GRU-RCN! (imo best model)
Unsupervised Learning
Unsupervised Learning Overview
● Definitions
● Autoencoders
○ Vanilla
○ Variational
● Adversarial Networks
Supervised vs Unsupervised
Supervised Learning
Data: (x, y)
x is data, y is label
Goal: Learn a function to

map x -> y
Examples: Classification,
regression, object detection,
semantic segmentation, image
captioning, etc
Supervised vs Unsupervised
Supervised Learning Unsupervised Learning
Data: (x, y) Data: x
x is data, y is label Just data, no labels!
Goal: Learn a function to Goal: Learn some structure

map x -> y of the data
Examples: Classification, Examples: Clustering,
regression, object detection, dimensionality reduction, feature
semantic segmentation, image learning, generative models, etc.
captioning, etc
Unsupervised Learning
● Autoencoders
○ Traditional: feature learning
○ Variational: generate samples
● Generative Adversarial Networks: Generate samples
Autoencoders
Features z
Encoder
Input data x
Autoencoders
Originally: Linear + nonlinearity (sigmoid)
Later: Deep, fully-connected
Later: ReLU CNN
Features z
Encoder
Input data x
Autoencoders
Originally: Linear + nonlinearity (sigmoid)
z usually smaller than x
Later: Deep, fully-connected
(dimensionality reduction)
Later: ReLU CNN
Features z
Encoder
Input data x
Autoencoders
Reconstructed
input data
xx
Decoder
Features z
Encoder
Input data x
Originally: Linear +
nonlinearity (sigmoid)
Autoencoders Later: Deep, fully-connected
Later: ReLU CNN (upconv)
Reconstructed
input data
xx
Decoder Encoder: 4-layer conv
Decoder: 4-layer upconv
Features z
Encoder
Input data x
Originally: Linear +
nonlinearity (sigmoid)
Autoencoders Later: Deep, fully-connected
Later: ReLU CNN (upconv)
Reconstructed
input data
xx
Encoder / decoder Decoder Train for
sometimes share reconstruction
weights with no labels!
Features z
Example:
dim(x) = D Encoder
dim(z) = H
w e: H x D
T
Input data x
w d: D x H = w e
Autoencoders Loss function
(Often L2)
Reconstructed
input data
xx
Decoder Train for
reconstruction
with no labels!
Features z
Encoder
Input data x
Autoencoders
Reconstructed
input data
xx
After training, Decoder
throw away
decoder! Features z
Encoder
Input data x
Autoencoders Loss function
(Softmax, etc)
bird plane
Predicted
Label
yy y dog deer truck
Use encoder to
initialize a Classifier
supervised Train for final task
Fine-tune
model
Features z encoder (sometimes with
jointly with small data)
classifier
Encoder
Input data x
Autoencoders: Greedy Training
In mid 2000s layer-wise
pretraining with Restricted
Boltzmann Machines (RBM)
was common
Training deep nets was hard in

2006!
Hinton and Salakhutdinov, “Reducing the Dimensionality of Data with Neural Networks”, Science 2006
Autoencoders: Greedy Training
In mid 2000s layer-wise
pretraining with Restricted
Boltzmann Machines (RBM) Not common anymore
was common
Training deep nets was hard in

2006!
With ReLU, proper
initialization, batchnorm,
Adam, etc easily train
from scratch
Hinton and Salakhutdinov, “Reducing the Dimensionality of Data with Neural Networks”, Science 2006
Autoencoders
Autoencoders can
reconstruct data, and
Reconstructed
xx can learn features to
input data
initialize a supervised
Decoder model
Features z
Can we generate
Encoder images from an
autoencoder?
Input data x
Variational Autoencoder
A Bayesian spin on an autoencoder - lets us generate data!
Assume our data is generated like this:
Sample from true

conditional
Sample from
true prior z x
Kingma and Welling, “Auto-Encoding
Variational Bayes”, ICLR 2014
Intuition: x is an
image, z gives
A Bayesian spin on an autoencoder!
class, orientation,
attributes, etc
Sample from true

conditional
Sample from
true prior z x
Intuition: x is an
image, z gives
A Bayesian spin on an autoencoder!
class, orientation,
attributes, etc
Sample from true Problem: Estimate

conditional without access to
Sample from latent states !
true prior z x
Prior: Assume
is a unit Gaussian
Kingma and Welling, ICLR 2014
Prior: Assume
is a unit Gaussian
Conditional: Assume
is a
diagonal Gaussian,
predict mean and
variance with neural
net
Mean and (diagonal)
Prior: Assume covariance of
is a unit Gaussian
x
Σx
Conditional: Assume
is a Decoder network
diagonal Gaussian, with parameters
predict mean and
variance with neural z
net Latent state
Mean and (diagonal)
Prior: Assume covariance of
is a unit Gaussian
x
Σx
Conditional: Assume
is a Decoder network
diagonal Gaussian, with parameters
predict mean and
variance with neural z
net Fully-connected or
Latent state
upconvolutional
Variational Autoencoder: Encoder
By Bayes Rule the posterior is:
Kingma and Welling,

ICLR 2014
By Bayes Rule the posterior is:
Use decoder network =)

Gaussian =)
Intractible integral =(
Kingma and Welling,

ICLR 2014
Mean and (diagonal)
By Bayes Rule the posterior is: covariance of
z
Σz
Gaussian =) Encoder network
Intractible integral =( with parameters
x
Kingma and Welling,
ICLR 2014
Data point
Mean and (diagonal)
z
Σz
Approximate posterior with x

Kingma and Welling, encoder network Data point
ICLR 2014
Mean and (diagonal)
Fully-connected
or convolutional
z
Σz
Approximate posterior with x

Kingma and Welling, encoder network Data point
ICLR 2014
Data point x Kingma and Welling, ICLR 2014
z z Mean and (diagonal)

Σ covariance of
Encoder network
z
Sample from
Σ covariance of
Encoder network
x
Σx Mean and (diagonal)
Decoder network covariance of
z
Sample from
Σ covariance of
Encoder network
Reconstructed xx
Sample from
x
z
Sample from
Σ covariance of
Encoder network
Training like a normal autoencoder:
Reconstructed xx reconstruction loss at the end,
Sample from regularization toward prior in middle
x
z (should be close to data x)
Sample from
Σ covariance of
Encoder network (should be close
Data point x to prior ) Kingma and Welling, ICLR 2014
Variational Autoencoder: Generate Data!
After network is trained:
z
Sample from
prior
x
Σx
Decoder
network
z
Sample from
prior
Generated xx
Sample from
x
Σx
Decoder
network
z
Sample from
prior
Generated xx
Sample from
x
Σx
Decoder
network
z
Sample from
prior
Generated xx
Sample from
x
Σx
Decoder
network
z
Sample from
prior
After network is trained: Diagonal prior on z =>
independent latent variables
Generated xx
Sample from
x
Σx
Decoder
network
z
Sample from
prior
Variational Autoencoder: Math
Maximum Likelihood?
Maximize likelihood of dataset
Maximum Likelihood?
Maximize log-likelihood instead

because sums are nicer
Maximum Likelihood?

Marginalize joint
distribution
Maximum Likelihood?

Intractible integral =(
“Elbow”
“Elbow”
“Elbow”
Variational lower bound (elbow)

“Elbow”
Variational lower bound (elbow) Training: Maximize lower bound

Reconstruct
the input
data
“Elbow”

Variational Autoencoder: Math Latent states
should follow
Reconstruct the prior
the input
data
“Elbow”

should follow
the input
data
Sampling
with
reparam.
trick
(see paper)
“Elbow”

should follow
the input Everything is
data Gaussian,
closed form
Sampling solution!
with
reparam.
trick
(see paper)
“Elbow”

Autoencoder Overview
● Traditional Autoencoders
○ Try to reconstruct input
○ Used to learn features, initialize supervised model
○ Not used much anymore
● Variational Autoencoders
○ Bayesian meets deep learning
○ Sample from model to generate images
Goodfellow et al, “Generative
Adversarial Nets”, NIPS 2014
Generative Adversarial Nets
Can we generate images with less math?
Random noise z
Fake image x
Generator
Random noise z
Real or fake? y
Discriminator
Fake image x
Generator
Random noise z
Real or fake? y
Discriminator
Fake image x
x Real image
Generator
Fake examples: from generator
Random noise z Real examples: from dataset
Real or fake? y
Train generator and discriminator jointly
Discriminator After training, easy to generate images
Fake image x
x Real image
Generator
Fake examples: from generator
Random noise z Real examples: from dataset
Generated samples
Nearest neighbor from training set Goodfellow et al, “Generative Adversarial Nets”, NIPS 2014
Generated samples (CIFAR-10)
Nearest neighbor from training set Goodfellow et al, “Generative Adversarial Nets”, NIPS 2014
Generative Adversarial Nets: Multiscale
Generate
Denton et al, “Deep generative image models using a Laplacian pyramid of adversarial networks”, NIPS 2015
low-res
Upsample
Generate
low-res
Upsample
Generate
delta, add Generate
low-res
Upsample Upsample
Generate
delta, add Generate
low-res
Upsample Upsample
Generate Generate
delta, add Generate
delta, add
low-res
Upsample Upsample Upsample
Generate Generate
delta, add Generate
delta, add
low-res
Done! Upsample Upsample Upsample
Generate Generate
Generate Generate
delta, add delta, add
delta, add low-res
Discriminators work
at every scale!
Denton et al, NIPS 2015
Train separate model per-class on CIFAR-10

Denton et al, NIPS 2015
Generative Adversarial Nets: Simplifying
Generator is an upsampling network with fractionally-strided convolutions
Discriminator is a convolutional network
Radford et al, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”, ICLR 2016
Generator
Radford et al, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”, ICLR 2016
Samples
from the
model look
amazing!
Radford et al,
ICLR 2016
Interpolating
between
random
points in latent
space
Radford et al,
ICLR 2016
Generative Adversarial Nets: Vector Math
Radford et al, ICLR 2016
Smiling woman Neutral woman Neutral man
Samples
from the
model
Samples
from the
model
Average Z
vectors, do
arithmetic
Smiling Man
Samples
from the
model
Average Z
vectors, do
arithmetic
Glasses man No glasses man No glasses woman
Radford et al,
ICLR 2016
Glasses man No glasses man No glasses woman
Woman with glasses
Radford et al,
ICLR 2016
Dosovitskiy and Brox, “Generating
Images with Perceptual Similarity
Putting everything together Metrics based on Deep Networks”,
arXiv 2016
Pixel loss
xx
x
Σx
Variational
Autoencoder z
z
Σz
arXiv 2016
Real or Generated
Discriminator Pixel loss

network y
xx
x
Σx
Variational
Autoencoder z
z
Σz
arXiv 2016
Real or Generated
Pretrained AlexNet
network y
xx
x
Σx
Variational
Autoencoder z
z
Σz
arXiv 2016
Real or Generated
Pretrained AlexNet
network y
xx
x
Σx
Variational Features of
Autoencoder z Features of xf xxf reconstructed
z
real image image
Σz
arXiv 2016
Real or Generated
Pretrained AlexNet
network y
xx
x
Σx
Variational Features of
Autoencoder z Features of xf xxf reconstructed
z
real image image
Σz
x L2 loss
arXiv 2016
Samples
from the
model, trained
on ImageNet
Recap
● Videos
● Unsupervised learning
○ Autoencoders: Traditional / variational
○ Generative Adversarial Networks
● Next time: Guest lecture from Jeff Dean

Lecture14 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture14 PDF

Uploaded by

Copyright:

Available Formats

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 14 - 1 29 Feb 2016

Action Recognition with Improved Trajectories

detect feature points track features with extract HOG/HOF/MBH

detected feature points

[J. Shi and C. Tomasi, “Good features to track,” CVPR 1994]

track each keypoint using optical flow.

[G. Farnebäck, “Two-frame motion estimation based on polynomial expansion,” 2003]

Extract features in the local coordinate Accumulate into histograms, separately

Input: 227x227x3 images

First layer (CONV1): 96 11x11 filters applied at stride 4

Input: 227x227x3 images

First layer (CONV1): 96 11x11 filters applied at stride 4

The motion information didn’t add all that much...

[Learning Spatiotemporal Features with 3D Convolutional Networks, Tran et al. 2015]

(of VGGNet fame)

Two-stream version works much better than either alone.

(This paper was way ahead of its time. Cited 65 times.)

(This paper was way ahead of its time. Cited 65 times.)

1. Model temporal motion locally (3D CONV)

2. Model temporal motion globally (LSTM / RNN)

+ Fusions of both approaches at the same time.

1. Model temporal motion locally (3D CONV)

2. Model temporal motion globally (LSTM / RNN)

+ Fusions of both approaches at the same time.

Only requires (existing) 2D

layer N+1 CONV

Infinite (in theory)

Goal: Learn a function to

Goal: Learn a function to Goal: Learn some structure

Training deep nets was hard in

Training deep nets was hard in

Assume our data is generated like this:

Sample from true

Sample from true

Sample from true Problem: Estimate

Kingma and Welling, ICLR 2014

Kingma and Welling,

Use decoder network =)

Kingma and Welling,

Approximate posterior with x

Approximate posterior with x

Data point x Kingma and Welling, ICLR 2014

z z Mean and (diagonal)

Maximize likelihood of dataset

Kingma and Welling, ICLR 2014

Maximize likelihood of dataset

Maximize log-likelihood instead

Kingma and Welling, ICLR 2014

Maximize likelihood of dataset

Maximize log-likelihood instead

Maximize likelihood of dataset

Maximize log-likelihood instead

Variational lower bound (elbow)

Variational lower bound (elbow) Training: Maximize lower bound

Variational lower bound (elbow) Training: Maximize lower bound

Variational lower bound (elbow) Training: Maximize lower bound

Variational lower bound (elbow) Training: Maximize lower bound

Variational lower bound (elbow) Training: Maximize lower bound

Train separate model per-class on CIFAR-10

Woman with glasses

Discriminator Pixel loss

You might also like