Professional Documents
Culture Documents
Saket Anand
Three-Layer Arbitrary A B
(Complexity B
A
Limited by No. B A
of Nodes)
Convolutional Neural Network: Key Idea
Exploit
1. Structure
2. Local Connectivity
3. Share Parameter
To Give
1. Translation Invariance
2. Minor Distortion and Scale Invariance
3. Occlusion Invariance
Handling sequences through NNs
• How to capture sequential information using neural network?
ot-1 ot ot+1
𝑈𝑈 𝑈𝑈 𝑈𝑈
xt-1 xt Xt+1
Recurrent Neural Network
ot-1 ot ot+1 ot
𝑉𝑉 𝑉𝑉 𝑉𝑉 𝑉𝑉
𝑆𝑆𝑡𝑡−1 𝑆𝑆𝑡𝑡 𝑆𝑆𝑡𝑡+1 𝑆𝑆𝑡𝑡 𝑊𝑊
𝑊𝑊 𝑊𝑊 𝑊𝑊
𝑈𝑈 𝑈𝑈 𝑈𝑈 𝑈𝑈
xt-1 xt-1 Xt+1 xt
Recurrent Neural Network
• RNNs are called recurrent because they perform same task for every
element of a sequence.
𝑉𝑉
𝑆𝑆𝑡𝑡 𝑊𝑊
𝑉𝑉
𝑆𝑆𝑡𝑡−1
𝑊𝑊
𝑆𝑆𝑡𝑡
𝑉𝑉
… 𝑉𝑉
𝑆𝑆𝑁𝑁
𝑊𝑊
𝑈𝑈 𝑈𝑈 𝑈𝑈 𝑈𝑈
xt x1 x2 XN
Notation
• 𝒙𝒙𝒕𝒕 is the input at time step 𝒕𝒕.
• 𝒔𝒔𝒕𝒕 is the hidden state at time step 𝒕𝒕. It’s the “memory” of the
network.
• 𝒔𝒔𝒕𝒕 is calculated based on the previous hidden state and the input at
current step, 𝒔𝒔𝒕𝒕 = 𝒇𝒇(𝑼𝑼𝒙𝒙𝒕𝒕 + 𝑾𝑾𝒔𝒔𝒕𝒕−𝟏𝟏 ).
• Depending upon task we may want output at each time step or just
one final output at last time step.
RNN Modeling Based on Input/Output
• 𝑦𝑦𝑡𝑡 is the ground truth word at time step 𝑡𝑡, and 𝑦𝑦𝑡𝑡′ is the predicted word.
• Gradients are summed up at each time step for one training example
𝜕𝜕𝜕𝜕 𝜕𝜕𝐸𝐸𝑡𝑡
=�
𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕
𝑡𝑡
V V V V V
𝑆𝑆 W 𝑆𝑆 W 𝑆𝑆 W 𝑆𝑆 W 𝑆𝑆 W
U U U U U
xt-2 xt-1 xt Xt+1 Xt+2
Difficulties involved in BPTT
• RNNs trained with BPTT have difficulties learning long term
dependencies due to what is called the vanishing gradient problem.
• When there are many hidden layers the error gradient weakens as it
moves from the back of the network to the front, because the
derivative the sigmoid weakens towards the poles
• The updates as you move to the front of the network will contain less
information.
Difficulties involved in BPTT (Cont.)
• The problem exist in CNNs also. RNNs amplify this. Effectively the
number of layers that is traversed by back-propagation grows
dramatically.
ht
* +
tanh
* *
σ σ tanh σ
Xt
Long Short-Term Memory
LSTM Architecture * +
tanh
* *
• Each line carries an entire vector. σ σ tanh σ
* *
σ σ tanh σ
• LSTMs are explicitly designed to avoid the long term dependency problem.
ht-1 ht ht+1
* +
tanh
* +
tanh
* +
tanh
* * * * * *
σ σ tanh σ σ σ tanh σ σ σ tanh σ
Xt-1 Xt Xt+1
LSTM Networks
ht
𝑐𝑐𝑡𝑡−1
* + 𝑐𝑐𝑡𝑡
tanh
*𝑐𝑐̃
𝑖𝑖𝑡𝑡
*
𝑓𝑓𝑡𝑡 𝑜𝑜𝑡𝑡
𝑡𝑡
σ σ tanh σ
ℎ𝑡𝑡−1 ℎ𝑡𝑡
Xt
ResNet Analogy
How does LSTM cell works? (Cont.)
• The key to LSTMs is the cell state, the
horizontal line running through the top of the ht
cell.
𝑐𝑐𝑡𝑡−1 𝑐𝑐𝑡𝑡
• This line runs straight through the entire chain * +
Xt
• Gates consist of a sigmoid neural network
layer and a pointwise multiplication operation.
Gate
• The sigmoid layer outputs numbers between zero and one
representing how much information each component should let
through.
σ
LSTM Operations: Forget
ht
• First step is to decide what information
to throw away from the cell state. 𝑐𝑐𝑡𝑡−1 𝑐𝑐𝑡𝑡
* +
tanh
𝑐𝑐𝑡𝑡−1
* + 𝑐𝑐𝑡𝑡
tanh
*𝑐𝑐̃
𝑖𝑖𝑡𝑡
*
𝑓𝑓𝑡𝑡 𝑜𝑜𝑡𝑡
𝑡𝑡
σ σ tanh σ
ℎ𝑡𝑡−1 ℎ𝑡𝑡
Xt
Coupled Input and Forget Gates
ht
𝑐𝑐𝑡𝑡−1
* + 𝑐𝑐𝑡𝑡
tanh
-1
*𝑐𝑐̃ *
𝑓𝑓𝑡𝑡 𝑜𝑜𝑡𝑡
𝑡𝑡
σ tanh σ
ℎ𝑡𝑡−1 𝑖𝑖𝑡𝑡 ℎ𝑡𝑡
Xt
σ
Gated Recurrent Unit (GRU)
• Combine the forget and input gates into a single “update gate.”
ℎ𝑡𝑡−1
* + ℎ𝑡𝑡
* -1
* ℎ�
𝑟𝑟𝑡𝑡 𝑧𝑧𝑡𝑡
𝑡𝑡
σ σ
tanh
𝑥𝑥𝑡𝑡
Gated Recurrent Unit (GRU)
• 𝑧𝑧𝑡𝑡 = 𝜎𝜎 𝑊𝑊𝑧𝑧 ∗ ℎ𝑡𝑡−1 , 𝑥𝑥𝑡𝑡
• Bidirectional RNNs
Dense Trajectories
Dense Trajectories
• Dense trajectories and motion boundary descriptors for action recognition:
Wang et al., 2013
1 million videos
487 sports classes
Spatio-Temporal ConvNets
• Large-scale Video Classification with Convolutional Neural Networks,
Karpathy et al., 2014
Spatio-Temporal ConvNets
• Large-scale Video Classification with Convolutional Neural Networks,
Karpathy et al., 2014
3D VGGNet, basically.
Spatio-Temporal ConvNets
• Two-Stream Convolutional Networks for Action Recognition in Videos:
Simonyan and Zisserman 2014
Spatio-Temporal ConvNets
• Two-Stream Convolutional Networks for Action Recognition in Videos:
Simonyan and Zisserman 2014
Combining predictions:
• Return the prediction at the last time
step
• max-pooling the predictions over time,
• summing the predictions over time and
return the max
• linearly weighting the predictions over
time
Less than 1% difference in output by any
of the 4 choices.
Bi-Directional RNN
A Multi-Stream Bi-Directional Recurrent Neural Network for Fine-Grained
Action Detection, CVPR 2016
Image Captioning
Image Sentence Datasets
Attention mechanism from Show, Attend, and Tell only lets us softly attend
to fixed grid positions … can we do better?
Spatial Transformer Networks
Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015
Spatial Transformer Networks
Spatial Transformer Networks
Attention: Recap
• Soft attention:
• Easy to implement: produce distribution over input locations,
reweight features and feed as input
• Attend to arbitrary input locations using spatial transformer
networks
• Hard attention:
• Attend to a single input location
• Can’t use gradient descent!
• Need reinforcement learning!
Other Image Captioning Works
• Explain Images with Multimodal Recurrent Neural Networks, Mao et
al.
• Deep Visual-Semantic Alignments for Generating Image Descriptions,
Karpathy and Fei-Fei
• Show and Tell: A Neural Image Caption Generator, Vinyals et al.
• Long-term Recurrent Convolutional Networks for Visual Recognition
and Description, Donahue et al.
• Learning a Recurrent Visual Representation for Image Caption
Generation, Chen and Zitnick
Learning Representation
Unsupervised Learning with LSTMs, Arxiv 2015.
Pose Estimation
Recurrent Network Models for Human Dynamics, ICCV 2015
Reidentification
Recurrent Convolutional Network for Video-based Person Re-Identification,
CVPR 2016
OCR
Recursive Recurrent Nets with
Attention Modeling for OCR in the
Wild, CVPR 2016