Visualizing Deep Neural Networks Classes and Features - Ankivil

5/9/2017 Visualizing Deep Neural Networks Classes and Features Ankivil
ANKIVIL
Machine Learning Experiments
EXPERIMENT
Visualizing Deep Neural Networks Classes

and Features
By Fabien Tenc on 7 July 2016 1
Introduction
Neural networks are very powerful tools to classify data but they are very hard to
debug. Indeed, they do a lot of computation with low level operations so they are like
black boxes: we provide inputs and get outputs without any understanding on how the
neural network is nding the results.
Few years ago some scientists found ways to delve into the networks used for image
categorization. Instead of doing backpropagation on weights like during the learning
phase of a neural network, they did backpropagation on the images themselves: in the
example below (edited from CS231n), considering x are inputs and w are weights, each
learning step, the gradient (red) is applied to the x instead of the w.
http://ankivil.com/visualizing-deep-neural-networks-classes-and-features/ 1/58
In this article, we will use the method and code from Google, Simonyan, Yosinski and
Chollet to try to visualize the classes and convolutional layers learnt by popular neural
networks. The code provided in this article uses the Keras library.
Naive Approach
The core idea of this visualisation is to input a random image in the neural network.
Then, speci c output(s) of chosen layers are maximized using backpropagation on the
image. These outputs can be the last layer representing the classes or intermediate
convolutional layers representing features learnt by the network.
Using Keras, there is how to do this:
1 import numpy as np
2
3 import scipy.misc
4 import time
5 import os
6 import h5py
7
8 from keras.models import Sequential
9 from keras.layers import Convolution2D, ZeroPadding2D, MaxPooling2D, Flatten, Dense,
10 from keras import backend as K
11
12 #VGG16 mean values
13 MEAN_VALUES = np.array([103.939, 116.779, 123.68]).reshape((3,1,1))
14
15 # path to the model weights file.
16 weights_path = 'vgg16_weights.h5'
17
18 # util function to convert a tensor into a valid image
19 def deprocess(x):
20 x += MEAN_VALUES # Add VGG16 mean values
21
22 x = x[::-1, :, :] # Change from BGR to RGB
23 x = x.transpose((1, 2, 0)) # Change from (Channel,Height,Width) to (Height,Width,Channel)
24
25 x = np.clip(x, 0, 255).astype('uint8') #clip in [0;255] and convert to int
26 return x
27
28 # Creates a VGG16 model and load the weights if available (see https://gist.github.com/baraldil
29 def VGG_16(w_path=None):
30 model = Sequential()
31 model.add(ZeroPadding2D((1,1),input_shape=(3,224,224)))
32 model.add(Convolution2D(64, 3, 3, activation='relu'))
33 model.add(ZeroPadding2D((1,1)))
35 model.add(MaxPooling2D((2,2), strides=(2,2)))
36
42
50
58
66
67 model.add(Flatten())
68 model.add(Dense(4096, activation='relu'))
69 model.add(Dropout(0.5))
70 model.add(Dense(4096, activation='relu'))
71 model.add(Dropout(0.5))
72 model.add(Dense(1000, activation='linear')) # avoid softmax (see Simonyan 2013)
73
74 if w_path:
75 model.load_weights(w_path)
76
77 return model
78
79 # Creates the VGG models and loads weights
80 model = VGG_16(weights_path)
81
82 # Specify input and output of the network
83 input_img = model.layers[0].input
84 layer_output = model.layers[-1].output
85
86 # List of the generated images after learning
87 kept_images = []
88
89 # Update coefficient
90 learning_rate = 500.
91
92 for class_index in [130, 351, 736, 850]: #130 flamingo, 351 hartebeest, 736 pool table, 850 ted
93 print('Processing filter %d' % class_index)
94 start_time = time.time()
95
96 # The loss is the activation of the neuron for the chosen class
97 loss = layer_output[0, class_index]
98
99 # we compute the gradient of the input picture wrt this loss
100 grads = K.gradients(loss, input_img)[0]
101
102 # this function returns the loss and grads given the input picture
103 # also add a flag to disable the learning phase (in our case dropout)
104 iterate = K.function([input_img, K.learning_phase()], [loss, grads])
105
106 np.random.seed(1337)# for reproducibility
107 # we start from a gray image with some random noise
108 input_img_data = np.random.normal(0, 10, (1,) + model.input_shape[1:]) # (1,) for batch axi
109
110 # we run gradient ascent for 1000 steps
111 for i in range(1000):
112 loss_value, grads_value = iterate([input_img_data, 0]) # 0 for test phase
113 input_img_data += grads_value * learning_rate # Apply gradient to image
114
115 print('Current loss value:', loss_value)
116
117 # decode the resulting input image and add it to the list
118 img = deprocess(input_img_data[0])
119 kept_images.append((img, loss_value))
120 end_time = time.time()
121 print('Filter %d processed in %ds' % (class_index, end_time - start_time))
122
123
124 #Compute the size of the grid
125 n = int(np.ceil(np.sqrt(len(kept_images))))
126
127 # build a black picture with enough space for the kept_images
128 img_height = model.input_shape[2]
129 img_width = model.input_shape[3]
130 margin = 5
131 height = n * img_height + (n - 1) * margin
132 width = n * img_width + (n - 1) * margin
133 stitched_res = np.zeros((height, width, 3))
134
135 # fill the picture with our saved filters
136 for i in range(n):
137 for j in range(n):
138 if len(kept_images) <= i * n + j:
139 break
140 img, loss = kept_images[i * n + j]
141 stitched_res[(img_height + margin) * i: (img_height + margin) * i + img_height
142 (img_width + margin) * j: (img_width + margin) * j + img_width
143
144 # save the result to disk
145 scipy.misc.toimage(stitched_res, cmin=0, cmax=255).save('naive_results_%dx%d.png' % (
To run this code, you will need Keras, of course, and the VGG16 weights learnt for
ILSVRC 2014. You can nd them on VGG-16 pre-trained model for Keras GitHub.
While the idea is simple, there are some tricky parts in the code. First, you must be
careful on how the images were fed to the network during the learning phase. Usually,
the mean value of the each pixel in the dataset or each channel is subtracted to each
pixel of the input image. The order of the channels can be a source of errors too: it can
be RGB or BGR depending on the image library used (RGB for PIL and BGR for
OpenCV). Finally, if the last layer has a softmax activation, this activation should be
removed. Indeed, maximizing a softmax for one class can be done in two ways:
maximizing the class score before the softmax or minimizing all the other classes scores
before the softmax. The latter often happens resulting in very noisy images, see
Simonyan 2013: Deep Inside Convolutional Networks: Visualising Image Classi cation
Models and Saliency Maps.
For my tests, I used four classes, you can nd the index of all classes in the
synset_words.txt le:
Top left, class 130, amingo
Top right, class 351, hartebeest
Bottom left, class 736, pool table, billiard table, snooker table
Bottom right, class 850, teddy, teddy bear
Here are the results produced by the previous script, for a learning rate of 250, 500,
750 and 1000:
The results are not great, to say the least, but with a bit of imagination and knowing the
classes, we can distinguish some interesting details. In the lower left we can imagine a
part of a pool table with one or two balls. on the lower right we can imagine heads or
limbs of teddy bears. So even is the results are not exploitable, the algorithm is not
producing garbage. With a bit of tweaking I might be able to make cleaner and nicer
images.
An interesting result with these images is that they all have a very high con dence rate
(>99%) in their respective classes. This process is the base of the generation of
adversarial and fooling examples, that is, images that scores very high for a single
classes but that are unrecognizable by humans. See Deep Neural Networks Are Easily
Fooled: High Con dence Predictions For Unrecognizable Images and Breaking Linear
Classi ers on ImageNet for further details.
Using Regularization to Generate More

Using Regularization to Generate More

Natural Images
The images produced by the previous algorithm are not natural images, they have very
high frequencies and colors saturate. One way to avoid this behavior is to modify the
loss so that the learning process favors more natural images over unnatural ones. The
other method is to apply some modi cation to the image after each optimization step so
that the algorithm tends toward nicer images. This approach is described in
Understanding Neural Networks Through Deep Visualization. This method is more
exible and easier to use as there are a lot of image lters already available. We will
review some of the operations we can do on the images and the effects they have.
Clipping
The most obvious way to modify the image is to ensure it is a valid image: all pixel values
must be between (0,0,0) and (255,255,255) for a 24 bit image. In the case of the VGG16
network, the mean is subtracted from the input, so each step we must modify the input
image tensor as follows:
1 input_img_data = np.clip(input_img_data, 0.-MEAN_VALUES, 255.-MEAN_VALUES)
This regularization ensures that all pixels have a reasonable in uence on the nal
output. Here is an example of the effets of this regularization, with a learning rate of
1000 and 1000 iterations:
The result is not clearly better than without clipping, it only reduces slightly the
saturation and the high frequencies. As it mostly serves as a safeguard against images
outside the valid range, we will keep this regularization for the other tests.
Decay
While clipping avoids values outside the valid range of images, it does nothing to make
the images look more natural. A simple regularization is to make the image closer to the
mean at each step. It avoids bright pixels with very high values in red, green or blue. The
code to do decay is:
1 if l2decay > 0:
2 input_img_data *= (1 - l2decay)
with l2decay the amount of decay. This value is usually very low, around 0.0001, but it
really depends on the strength of the learning rate. For high learning rates, decay must
be stronger to compensate the important modi cations on the image. Here are example
results with clipping and a decay of 0.0001 and 0.01 with a learning rate of 1000 and
1000 iterations:
As we can see, the higher the decay the grayer the image for the same learning rate. The
decay acts as a force that pulls the image toward the mean image which is often mostly
gray. Decay alone does not produce great results because it mostly reduces saturation
but not that much high frequencies.
Blur
With the problem of unnaturally bright pixels partly addressed, its time to focus on the
high frequencies produced in the images. The most obvious solution is to apply a blur on
the image to make it smoother. As blur is a bit slow and computationally intensive, it is
often applied once in a while. Moreover, applying a small blur many times has quite a
similar effect as applying a big blur once in a while . The code to blur is image is the
following:
1 if blurStd is not 0 and i % blurEvery == 0 :

2 input_img_data = gaussian_filter(input_img_data, sigma=[0, 0, blurStd, blurStd]) # blur along
with blurStd the standard deviation for Gaussian kernel, blurEvery the frequency of
the blurring and i the optimization step number. Usually, the standard deviation has
values from 0.3 to 1 and is applied every 4 to 8 updates. Of course, high standard
deviations Gaussian lters should be applied less often than low standard deviation
lters. Again, these values depend on the learning rate. Here are the results with
clipping and a blur of std of respectively 0.5 every 8 updates, 1 every 8 updates, 0.5
every 4 updates and 1 every 4 updates (still 1000 iterations):
Using only blur, the images begins to be recognizable. The pool table can be seen
without clues, the amingo and hartebeest can be guested but it is still dif cult. For the
teddy bear, it is very dif cult to nd out what the image represents. From the example
above, we can see that the blurring does indeed remove high frequencies but it also
make the colors very dim.
Median Filter
While blur gives some nice results there is still a lot of room for improvement. So I tried
others image noise reduction lters and found the median lter. It has the nice
characteristic of keeping edges which are both important for humans and neural nets to
recognize images. The code to apply a median lter to an image is the following:
1 if mFilterSize is not 0 and i % mFilterEvery == 0 :

2 input_img_data = median_filter(input_img_data, size=(1, 1, mFilterSize, mFilterSize
with mFilterSize the median lter size, mFilterEvery the frequency of the ltering and
i the optimization step number. Like the blur, we dont need to apply the lter each
step and I found that median lters of size 33 or 55 applied every 4 to 12 updates can
give good results. Similarly to the blur, these values depends on the learning rate and big
median lters should be applied less often than small ones. Here are the results for
clipping and respectively a median lter of size 3 every 12 updates, size 3 every 8, size 5
every 12 and size 5 every 8 (still 1000 iterations):
Median lter gives quite good results keeping the shapes while removing high
frequencies. It is still a bit dif cult to determine the content of each image but with a
lter of size 5 applied every 12 updates it may be possible to guess the 4 classes.
Overall, the median lter seems to be a good alternative to the blur lter.
Others
There are many other regularizations used to produce better images but I couldnt test
them all. For instance, you can see how Yosinski clips pixels with small norm or
contribution. Many other image-enhancing lters could be used, look at GIMP and
Photoshop to give you some ideas.
Picking The Best-Looking Images

All these regularizations aim at better-looking images. But better-looking does not
mean optimum in regard to the loss. After each regularization, we can observe a drop in
the loss value. This is not really a problem as the nal result looks better but it poses the
question of which image to present as the best result.
In this article, I chose the easiest solution: I keep the very last image generated after
clipping but before all other regularizations. Indeed, blurring particularly, but other
regularizations too, may remove important details. By ending with one or more pure
gradient ascent, I ensure that images contain ne details. The regularizations are here
to avoid the algorithm drifting into high frequencies images.
There are, of course, other solutions, like keeping the image with the higher loss but
high loss does not necessarily means better looking image. Some tests should be done
to see if it is really important to de ne a strategy to pick an image and if yes, which
strategy works the best.
Combining Regularizations and

Algorithm Hyperparameters
So we have an optimization algorithm and regularization methods, each with several
parameters. These parameters are called hyperparameters as they are not the
parameters of the model but the those used to modify the image in our case (usually, its
the parameters of the model which are modi ed).
Each of these hyperparameters has an important impact on the generated result. As it is

slow and impractical to test each parameter I relied on Random Search for Hyper-
Parameter Optimization. The idea is simple: instead of doing a grid search with the
hyperparameters, we do a random search, maximizing the chance of nding a good
value for one or several very important parameters.
However, in our case this is not that easy, as parameters have a huge impact on each
other: high learning rate requires high decay, high blurring requires low decay, etc.
Moreover, there are a lot of choices on how the algorithm is working by changing:
How the learning rate evolves during the learning phase;
How to do the gradient decent (classic, nesterov, rmsprop, adam, etc.);
How to initialize the rst image (uniform random, gaussian, etc.), and often two
similarly generated random images can produce very different results;
How to de ne the loss, this question will be very important when we will optimize
the convolutional lters.
As it is not enough, the hyperparameters and the choices for the algorithm can work
well for a speci c neural net but not for others. For this article, I worked on the VGG16
and the CaffeNet Yosinski networks. For these two networks, I found that , using both
manual and random search, clipping and median lters alone worked quite well,
combined with a constant learning rate and a classic gradient ascent. The starting
images were generated with a normal(0,10). You can nd the code for this algorithm in
the beginning of the post and in the clipping and median lter sections.
Results
Classes
VGG16
These results were found using a learning rate of 8000, clipping, a median lter of size 5
applied every 4 updates and 1000 iterations. The images are strange, colors seems to be
wrong with a lot of green. Maybe its a bug in my code but the color of some objects in
the image are good, so I would say the problem is elsewhere. Here are the results, in
order gold sh, hen, magpie, scorpion, American lobster, amingo, German shepherd,
star sh, hartebeest, giant panda, abacus, aircraft carrier, assault ri e, broom, feather
boa, mountain tent, pool table and teddy bear:
What is interesting is that although most of the images are hardly distinguishable, some
ne details can be visible like for instance the magpie head. Maybe the median lter is
not enough to regularize the images, other tests should be done with the VGG16
network and there is de nitively a problem with the green channel!
Ca eNet Yosinski
These results were found using a learning rate of 30000, clipping, a median lter of size
5 was applied every 4 updates and 1000 iterations. I found these result quite amazing,
even if the quality could still be improved. I was able to identify most of the classes
represented by the images without any clue.
Animals gave the best results. In order, gold sh, hen, magpie, scorpion, American
lobster, amingo, German shepherd, star sh, hartebeest and giant panda:
Man-made objets were a bit more challenging but many are still recognizable. In order,
abacus, aircraft carrier, assault ri e, broom, feather boa, mountain tent, pool table and
teddy bear:
Using the same technique as on the VGG16, I had far better results with the CaffeNet
Yosinski. I dont know exactly why but it proves that it is possible to generate human-
recognizable images using a trained deep net. It seems however that some deep nets
are harder to visualise that others.
Concluding this part on classes visualization, here are the rst 200 iterations with a
learning rate of 30000, clipping and median lter of size 5 applied every 4 updates for
the hen class and the CaffeNet Yosinski:
00:00 00:16
This shows that the convergence is pretty fast and that the 1000 iterations used in the
previous results may not be needed for all classes. An early stopping mechanism using
the loss value should be added to make the generation faster without losing quality.
Filters
So far, we maximized the output of one class but it is possible to do the same with each
layer to understand what they are detecting. The deeper in the network, the more
complex the pattern the lter can recognize. The loss is a bit different for lters and you
have basically two choices: you can optimize one lter or all lters in a layer. I chose the
latter because it allows me to generate bigger images for shallow layers . The loss
function is the following:
1 loss = K.sum(layer_output[:, layer_index, :, :])
I chose to generate images the same size as the input of the model but it is also possible
to remove the fully connected part of the network to generate images of arbitrary sizes,
see How convolutional neural networks see the world.
The last step is to choose the optimized layer. Convolutional layers give the best result
but you must be careful to optimize the layers AFTER the activation (in our case ReLU).
Optimizing before the activation gives very poor results.
In the following, lr means learning rate, mf means median lter followed by its size.
Clipping is always applied and there is 200 iterations of the gradient ascent. Only two
typical lters are presented for each layers.
VGG16
relu1_1 (lr 1)
relu1_2 (lr 1)
relu2_1 (lr 2, mf 3 every 12)

Ca eNet Yosinski
relu1 (lr 10)
relu2 (lr 50, mf 3 every 12)
Even if we know how the networks are working, it is still very impressive to see how
lower layers and lters learn to extract simple features like lines and colors and, using
these features, how higher layers and lters learn complex shapes and even classes.
Indeed, we can distinguish shell shes, cups, birds, balls and pandas in the images
generated for the last layers.
What is interesting too is that the VGG16 and the CaffeNet Yosinski learns the same
kind of low level lters and we can wonder if it is also true for high level lters, see
Convergent Learning: Do different neural networks learn the same representations?.
Conclusion
In this article, I explained how to generate images using backward propagation on deep
networks for image classi cation.
Maximizing classes output, these generated images can be used to nd fooling examples
that are indistinguishable by a human but that are given a very high con dence in one
class by the deep network. With a bit of tuning, this process can also generate images
that can be recognized by humans. It allows us to have some feedback on what the
networks learnt to be a good example of one class.
Maximizing convolutional layers, these generated images give us some understanding

on the inner workings of the network. While many low level lters only detect edges in
some directions, high level lters can detect very complex shapes.
Finally, note that this work is very similar to Understanding Neural Networks Through
Deep Visualization. The difference in the results in mainly due to the fact that I do not
normalize the resulting images (see the normalization in Yosinskis work in
gradient_optimizer.py, saveimagesc in image_misc.py and norm01 in image_misc.py).
This difference induces high differences in the hyperparameters too. I also use median
lter instead of blur which gives far better results in my opinion.
Ideas For Improvements

While the results are quite good, there is room for improvement. Regularization of the
generated image during learning may not be suf cient to generate good-looking images
(if it is even possible). The way backpropagation is done and the loss function could be
tweaked to improve the images. A good idea to start is to compare the activations for a
real image and a generated image. It could be possible to nd ways to make these
activations look the same and hope that the algorithm generates good images. This
comparison could be done using Yosinskis Deep Visualization Toolbox.
References
How convolutional neural networks see the world
Jason Yosinski
Simonyan 2013: Deep Inside Convolutional Networks: Visualising Image

Classi cation Models and Saliency Maps
Research Blog: Inceptionism: Going Deeper into Neural Networks
Bergstra 2012: Random Search for Hyper-Parameter Optimization
Going further:
Striving for Simplicity: The All Convolutional Net
Fabien Tenc
I'm passionate about Machine Learning and I'm willing to share my nds
through this blog.
READ NEXT
EXPERIMENT EXPERIMENT
Kaggle First Steps With Julia MNIST Database and Simple
(Chars74k): First Place using Classi cation Networks
Convolutional Neural
Networks
1 COMMENT Add Comment
andrew kiruluta
I enjoyed reading your blog. Do you know of a good way to visualize then hidden
layers in terms of correlation matrices between the layers and the expected
output ? It would help in determining how many hidden layers are indeed
necessary.
thanks,
andrew
14 FEBRUARY 2017 REPLY
LEAVE A REPLY
Comment *
Name * Email *
Website
POST COM M ENT
Search form
FABIEN TENC
I'm passionate about Machine Learning and I'm willing to share my nds
through this blog.
CATEGORIES
Experiment
Kaggle
Software

2017 ANKIVIL
THEME BY ANDERS NORN

Visualizing Deep Neural Networks Classes and Features - Ankivil

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Visualizing Deep Neural Networks Classes and Features - Ankivil

Uploaded by

Copyright:

Available Formats

5/9/2017 Visualizing Deep Neural Networks Classes and Features Ankivil

Visualizing Deep Neural Networks Classes

Using Keras, there is how to do this:

Top left, class 130, amingo

Top right, class 351, hartebeest

Bottom right, class 850, teddy, teddy bear

Using Regularization to Generate More

Using Regularization to Generate More

1 input_img_data = np.clip(input_img_data, 0.-MEAN_VALUES, 255.-MEAN_VALUES)

1 if blurStd is not 0 and i % blurEvery == 0 :

1 if mFilterSize is not 0 and i % mFilterEvery == 0 :

Picking The Best-Looking Images

Combining Regularizations and

Each of these hyperparameters has an important impact on the generated result. As it is

How the learning rate evolves during the learning phase;

How to do the gradient decent (classic, nesterov, rmsprop, adam, etc.);

1 loss = K.sum(layer_output[:, layer_index, :, :])

relu2_1 (lr 2, mf 3 every 12)

relu2_2 (lr 2, mf 3 every 12)

relu2_2 (lr 2, mf 3 every 12)

relu3_1 (lr 6, mf 3 every 6)

relu3_2 (lr 10, mf 3 every 6)

relu3_3 (lr 10, mf 3 every 6)

relu4_1 (lr 40, mf 5 every 4)

relu4_2 (lr 40, mf 5 every 4)

relu4_3 (lr 40, mf 5 every 4)

relu5_1 (lr 80, mf 5 every 4)

relu5_2 (lr 80, mf 5 every 4)

relu5_3 (lr 80, mf 5 every 4)

relu2 (lr 50, mf 3 every 12)

relu3 (lr 100, mf 3 every 6)

relu4 (lr 200, mf 5 every 6)

relu5 (lr 300, mf 5 every 4)

Maximizing convolutional layers, these generated images give us some understanding

Ideas For Improvements

How convolutional neural networks see the world

Simonyan 2013: Deep Inside Convolutional Networks: Visualising Image

Research Blog: Inceptionism: Going Deeper into Neural Networks

Bergstra 2012: Random Search for Hyper-Parameter Optimization

Striving for Simplicity: The All Convolutional Net

1 COMMENT Add Comment

14 FEBRUARY 2017 REPLY

POST COM M ENT

You might also like