Professional Documents
Culture Documents
https://www.mpi-inf.mpg.de/hlcv
Rob Fergus
Dept. of Computer Science
New York University
Overview
Image/Video
Hand-designed
Trainable
Object
Pixels Feature
Classifier Class
Extraction
Image/Video
Simple
Pixels Layer 1 Layer 2 Layer 3
Classifier
Convolutional
Boosting
Neural Net
BayesNP
UNSUPERVISED Slide: M. Ranzato
Multistage Hubel&Wiesel Architecture
Slide: Y.LeCun
Cognitron / Neocognitron
[Fukushima 1971-1982]
Convolutional Networks
• Also HMAX [Poggio 2002-2006] [LeCun 1988-present]
[Reading - Chapter 5.1 - 5.3 @ Bishop 2006]
High Level Computer Vision - June 9, 2o16 slide taken from David Stutz (Aachen)15
Single Layer Perceptron
Input Image
[NIPS 2012]
Krizhevsky et al. [NIPS2012]
• Same model as LeCun’98 but:
- Bigger model (8 layers)
- More data (106 vs 103 images)
- GPU implementation (50x speedup over CPU)
- Better regularization (DropOut)
22.5
Top-5 error rate %
15
7.5
0
SuperVision ISI Oxford INRIA Amsterdam
Commercial Deployment
• Google & Baidu, Spring 2013 for personal image search
Intuitions Behind
Deep Networks
(following slides from Marc Aurelio Ranzato - Google)
36
37
38
39
41
42
43
44
45
46
47
48
49
50
51
52
53
54
Large Convnets
for
Image Classification
Large Convnets for Image Classification
• Architecture
• Training
• Results
Components of Each Layer
Spatial/Feature
(Sum or Max)
Normalization
[Optional] between
Output Features
feature responses
Compare: SIFT Descriptor
Image
Pixels Apply
Gabor filters
Spatial pool
(Sum)
Feature
Normalize to unit
length Vector
Non-Linearity
• Non-linearity
– Per-feature independent
– Tanh
– Sigmoid: 1/(1+exp(-x))
– Rectified linear
• Simplifies backprop
• Makes learning faster
• Avoids saturation issues
à Preferred option
Pooling
• Spatial Pooling
– Non-overlapping / overlapping regions
– Sum or max
– Boureau et al. ICML’10 for theoretical analysis
Max
Sum
Architecture
Importance of Depth
Architecture of Krizhevsky et al.
Softmax Output
Layer 6: Full
• Trained on Imagenet
dataset [Deng et al. CVPR’09] Layer 5: Conv + Pool
Layer 4: Conv
• 18.2% top-5 error
Layer 3: Conv
Input Image
Architecture of Krizhevsky et al.
Softmax Output
Layer 3: Conv
• Only 1.1% drop in
performance! Layer 2: Conv + Pool
Input Image
Architecture of Krizhevsky et al.
Softmax Output
Layer 3: Conv
• 5.7% drop in performance
Layer 2: Conv + Pool
Input Image
Architecture of Krizhevsky et al.
Softmax Output
Input Image
Architecture of Krizhevsky et al.
Softmax Output
Input Image
Tapping off Features at each Layer
Plug features from each layer into linear SVM or soft-max
Translation (Vertical)
Output
Layer 1
Layer 7
Layer 1
Layer 7 Output
Scale Invariance
Layer 1
Layer 7 Output
Rotation Invariance
Visualizing
ConvNets
Visualizing Convnets
but in reverse:
– Unpool feature maps Non-linearity
– Convolve unpooled maps
• Filters copied from Convnet
Convolution (learned)
– No inference, no learning
Deconvnet Projection from Higher Layers
[Zeiler and Fergus. arXiv’13]
Feature
0 .... 0 Map ....
Filters Filters
Convnet
Layer 1 Reconstruction Layer 1: Feature maps
• Patches from validation images that give maximal activation of a given feature map
Layer 3: Top-1
Layer 3: Top-9
Layer 3: Top-9 Patches
Layer 4: Top-1
Layer 4: Top-9
Layer 4: Top-9 Patches
Layer 5: Top-1
Layer 5: Top-9
Layer 5: Top-9 Patches
ImageNet Classification 2013 Results
• http://www.image-net.org/challenges/LSVRC/2013/results.php
0.17
0.1525
Test error (top-5)
0.135
0.1175
0.1
Clarifai (extra data) NUS Andrew Howard UvA-Euvision Adobe CogniXveVision