Professional Documents
Culture Documents
Ye Tian Tianlun Li
Stanford University Stanford University
yetian1@stanford.edu tianlunl@stanford.edu
1
aligns the two modalities through a multimodal embedding. visual information related to a particular input location.
In Work[1](Xu, et al) attention mechanism is used in gener- E RmK is an embedding matrix. m and n is the em-
ation of image caption. They use convolutional neural net- bedding and LSTM dimensionality respectively. and
work to encode image and use a recurrent neural network are the logistic sigmoid activation and element-wise multi-
and attention mechanism to generate caption. By the visu- plication respectively.
alization of the attention weights, we can explain which part The model define a mechanism that computes zt from
the model is focusing on while generating the caption. This annotation vectors ai , i = 1, ..., L corresponding to the fea-
paper is also what our project based on. tures extracted at different image locations. And zt is a rep-
resentation of the relevant part of the image input at time
3. Image Caption Generation with Attention t4. For each location i, the mechanism generates a posi-
Mechanism tive weight i that can be interpreted as the relative impor-
tance to give to location i in blending the i s together. The
3.1. extract features model compute the weight i by attention model fatt for
The input of the model is a single raw image and the out- which the model use multilayer perceptron conditioned on
put is a caption y encoded as a sequence of 1-of-K encoded the previous state ht1 .
words.
y = {y1 , ..., yC }, yi RK eti = fatt (ai , ht1 ) (4)
Where K is the size of the vocabulary and C is the length of
exp(eti )
the caption. ti = (5)
To extract a set feature vectors which we refer to as an- L
k=1 exp etk
notation vectors, we use a convolutional neural network. After the weights are computed, the model then compute
D the context vextor zt by
a = {a1 , ..., aL }, ai R
The extractor produces L vectors and each element corre- zt = (ai , i ), (6)
sponds to a part of the image as a D-dimensional represen-
tation. where is a function that returns a single vector given the
In the work[1], the feature vectors was extract from the set of annotation vectors and their corresponding weights.
convolutional layer before the fully connected layer. We he initial memory state and hidden state of the LSTM
will try different layers such such as convolutional layers to are predicted by an average of the annotation vectors fed
compare the result and try to choose the best layers to pro- through two separate MLPs.
duce feature vectors that contains most precise in formation
1
about relationship between salient features and clusters in co = finit,c ( L ai )
the image. L i
2
4. Architecture h10 h11
CNN features have the potential to describe the im- (2)
finit,c
c20
age. To leverage this potential to natural language, a (2)
finit,h
h20 LSTM1 LSTM1
usual method is to extract sequential information and con-
h00 h01
vert them into language.. In most recent image captioning c10
(1)
finit,c
works, they extract feature map from top layers of CNN, 1 (1) h10 LSTM0 LSTM0
ai finit,h
{ai }
get the score of the words at every step. Now our goal is, in Ey0 fatt Ey1 fatt
{i0 } {a1i }
addition to captioning, also recognize the objects in the im-
age to which every word refers to. In other word, we want
z0 z1
position information. Thus we need to extract feature from
a lower level of CNN, encode them into a vector which is
dominated by the feature vector corresponding to the object
the word wants to describe, and pass them into RNN. Moti- h10 h11
vated by the work asdasdasdasd, we design the architectures
below.
LSTM1 LSTM1
{ai } is the feature vector from CNN. We get these fea-
ture map from the inception-5b layer of google net, which h00 h01
5. Dataset and Features Figure 1: The first is original model. The second is Model
with more weights on attention. The third is Model only
We use dataset from MSCOCO. A picture is represented depending on attention
by a dictionary, the keys are as follow: [sentids, filepath,
filename, imgid, split, sentences, cocoid]. Where the sen-
tences contain five sentences related to the picture. We have we generate a vocabulary from the sentences we need to
82783 training data and 40504 validation data to train and set a threshold to decrease the classification error. The
test out model. threshold should not only remove the spares words, but also
For the sentences, we build a word-index mapping, add avoid producing too many unknown words when predict
a #START# and #END# symbol to its both ends, and the sentences. Thus, we observe the curve of vocabulary
add #NULL# symbol to make them the same length. Be- size threshold and the total word size threshold, which
cause some words in the sentences are very sparse, when are showed in Fig 2. The previous one is exponential and
3
we used a large minibatch size of 512 samples.
1290000
At the beginning, we can overfit a small dataset with
1280000 1000 samples. But when we went to full dataset of 120,000
1270000 samples, we cannot overfit it even we increase the num-
1260000
ber of hidden units and depth of attention function. Then
we adopted a gradually tuning method: train the model on
1250000 dataset with size of 1000, 10000, 60000 and gradually pick
1240000 our hyperparameters. Finally we got a good model with
1230000
60000 training samples, LSTM hidden size of 512 and MLP
hidden size for [1024, 512, 512, 512], which generalize de-
1220000
0 10 20 30 40 50 cently well.
6.2. Results
14000
Some of our results are shown in Fig 4 and Fig 5.
12000
We can see that the generated sentences expressed the
10000 pictures quite well. The main parts of the images can be
8000 recognized and shown in the sentence, and also of the mi-
nor parts are also encoded, such as the flower in the corner
6000
of xxxxxxxxx. Also, there are some mistakes such as the re-
4000 frigerator and lamp in xxxxxxxxxx. But we human beings
2000 are also easy to make such mistakes since there do exist
0
similar objects in the image. The generated sentences also
0 10 20 30 40 50 do well in following grammar.
As for attention, our model is only able to recognize the
most important part of the images. That is, the attentions at
Figure 2: the upper figure is vocabulary size threshold each step are the same. There are 2 major reasons. Firstly,
curve, the lower figure is total word count threshold curve since the features are input at the first step of LSTM, the
overall information of the image has been feed into the de-
coder, which is enough to generate a decent sentence, and
the latter one is linear, so we choose 10 as our threshold. thus the following inputs can be coarser. This is exactly
For images, we preprocessing them by cropping them to the motivation of our other models. They are potential to
224x224 and subtract the mean. Because they are already a work better given more finetune. Secondly, the receptive
large number of data, we dont do any data augmentations. field of inception 5 is quite large (139 x 139). So to focus
We tried to extract features of inception3b, incep- on the main part of image is enough to get a good sentence.
tion4b and inception5b from Googlenet. Among them, To address this problem, we can use lower level features
inception5b is relative higher than the previous 2 layers, from CNN with more expressive fatt , i.e., to deepen the
thusly it contains less region information and hard to be fatt CNN and enlarge the number of hidden units in each
trained to get attention. The layer inception3b is relative layer.
lower so that it contains more region information but less We used BLEU score as quantitative metric for our re-
expressive to for caption. sults. We can see our model cannot achieve the state-of-
the-art. To get better results, we should enlarge our model,
6. Experiment, Result and Discussion train and tune it further.
4
(a) a black bear with (b) a woman stand- (c) a meal sitting in
flowers resting on a ing on top of a court top of a table on a
bed with a tennis racquet wooden table
(a) a man are sit- (b) a kitchen is filled (c) a woman stand-
ting at a kitchen area with a lamp and re- ing on front of a
with a group inside frigerator phone and head to a
house
5
tion generation with visual attention. arXiv preprint
arXiv:1502.03044(2015).
[2] Karpathy, Andrej, and Li Fei-Fei. Deep visual-
semantic alignments for generating image descriptions
In Proceedings of the IEEE Conference on Computer Vi-
sion and Pattern Recognition, pp. 3128-3137. 2015.
[3] Szegedy C, Liu W, Jia Y, et al. Going deeper with con-
volutions[C]//Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 2015: 1-9.