You are on page 1of 1

Visual Attention based Image Captioning

Anadi Chaman(12105), K.V.Sameer Raja(12332)


Dr. Amitabha Mukherjee
Indian Institute of Technology, Kanpur

Convolution Neural Network LSTM RNN Dataset and Resources


Introduction
Flickr8k dataset contains 8000 images and each
In this project, we worked to generate descriptive
image has 5 captions describing it, summing to
captions for images using neural language mod-
40000 caption and image pairs.
els. Our work is a variant of the CNN-LSTM ar-
chitecture based on visual attention models pro- We used 30000 captions for training, 5000
posed by Kelvin Xu et al. We have incorporated captions for validation and 5000 captions for
the use of phrase embeddings for generating cap- Figure 2: Feature map extraction using CNN [4] testing purposes respectively.
tions, and compared the performance obtained Word embeddings used for obtaining phrase
here, with that from word embeddings. CNNs are feedforward type of neural networks embeddings are derived from pre-trained
that convolute and sub-sample an image at word2vec model trained on google news corpus.
successive stages to yield feature maps. Results
Previous Work
We pass images of size 24 24 as an input to a
Figure 3: An LSTM cell [6]
pre-trained CNN , where they are convolved with
Ryan Kiros[3] proposed a neural network based Input vocabulary METEOR
4 different filters to yield 4 sub-images. This

it





caption generating model. It used Multi-modal


phrases 10000 0.062


Eyt1

process was continued till 512 feature maps of size




ft


log bilinear model that was biased by the features


phrases(pre embeddings) 10000 0.06


TD+m+n,n ht1

14 14 each were yielded.





ot

obtained from input image.


phrases 36220 0.041


zt

Looking across the images, we got 196 different




gt tanh


Andrej Karpathy [1] developed a model that uses words[2] 9630 0.067

annotation vectors each of dimensionality 512.


multi-modal embeddings to align images features words(beam search) 9630 0.089
ct = ft ct1 + it gt
K K

and text based on a ranking model. Their Attention Model


Multimodal neural network architecture was ht = ot tanh(ct)K
Conclusions
found to outperform retrieval baselines. p(yt|a, yt1) exp(Lo(Eyt1 + Lhht + Lz zt))
Given the annotation vectors, a context vector is
Oriol Vinyals[5] proposed a CNN-LSTM The marginal decrease in accuracy when using
generated which points to different portions of the phrases is due to replacement of large number of
architecture, where they used feature vectors
given image. Mathematically:
Approach
obtained from CNN, and word embeddings to phrases in the training data with UNK symbol.
determine LSTM gate values. Beam search was zt = st,iai
X

If an efficient phrase vocabulary reduction


i Senna software has been used to obtain phrases
finally used at the output to generate captions. We have employed a Hard attention model technique is employed, we hope that phrase input
from captions, available as a part of training data.
which is a stochastic mechanism. The weights st,i will have better accuracy compared to word input.
Embedding of a phrase is obtained by taking the
Architecture are sampled from a multinuolli(i) distribution References
sum of embeddings of words belonging to that
These is are learned using a network with
phrase.
previous hidden state (ht1) and annotation [1] A. Karpathy and L. Fei-Fei.
For generating annotation vectors, we used a Deep visual-semantic alignments for generating image descriptions.
vectors as input. arXiv preprint arXiv:1412.2306, 2014.
pre-trained model of CNN namely Oxford [2] kelvinxu.
New objective function accounting for sampling is arctic-captions.
VGGnet trained on Imagenet Dataset, and are https://github.com/kelvinxu/arctic-captions.
given by :
using an LSTM architecture [3] R. Kiros, R. Salakhutdinov, and R. Zemel.
Multimodal neural language models.
Ls = P
s p(s|a)log(p(y|s, a) Due to large vocabulary size of phrases, we found
In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages
595603, 2014.

log s p(s|a)p(y|s, a) = log(p(y|a)


X
the ones with the highest frequency, and replaced [4] A. Sironi.
Bigger Faster Convolutional Neural Networks.

the rest with UNK symbol). This reduced our cvlabwww.epfl.ch/projects/bigger_faster_convolutional_neural_networks.html.


[5] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan.
vocabulary to 10000 phrases. Show and tell: A neural image caption generator.
arXiv preprint arXiv:1411.4555, 2014.
Figure 1: System flow Diagram [6] K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio.
Show, attend and tell: Neural image caption generation with visual attention.

You might also like