Professional Documents
Culture Documents
What important truth do very few people agree with you on?
If you had asked this question to Prof. Geoffrey Hinton in the year 2010, he
would have answered that Convolutional Neural Networks (CNN) had the
potential to produce a seismic shift in solving the problem of image
classification. Back then researchers in the field would not have bothered to
think twice about that comment. Deep Learning was that uncool!
That was the year ImageNet Large Scale Visual Recognition Challenge
(ILSVRC) was launched.
In two years, with the publication of the paper, “ImageNet Classification with
Deep Convolutional Neural Networks” by Alex Krizhevsky, Ilya Sutskever, and
Geoffrey E. Hinton, he and a handful of researchers were proven right. It was a
seismic shift that broke the Richter scale! The paper forged a new landscape in
Computer Vision by demolishing old ideas in one masterful stroke.
The paper used a CNN to get a Top-5 error rate (rate of not finding the true
label of a given image among its top 5 predictions) of 15.3%. The next best
result trailed far behind (26.2%). When the dust settled, Deep Learning was
cool again.
In the next few years, multiple teams would build CNN architectures that beat
human level accuracy.
The architecture used in the 2012 paper is popularly called AlexNet after the
first author Alex Krizhevsky. In this post, we will go over its architecture and
discuss its key contributions.
Input
As mentioned above, AlexNet was the winning entry in ILSVRC 2012. It solves
the problem of image classification where the input is an image of one of 1000
different classes (e.g. cats, dogs etc.) and the output is a vector of 1000
1
numbers. The ith element of the output vector is interpreted as the probability
that the input image belongs to the ith class. Therefore, the sum of all elements
of the output vector is 1.
The input to AlexNet is an RGB image of size 256×256. This means all images
in the training set and all test images need to be of size 256×256.
AlexNet Architecture
AlexNet was much larger than previous CNNs used for computer vision tasks
( e.g. Yann LeCun’s LeNet paper in 1998). It has 60 million parameters and
650,000 neurons and took five to six days to train on two GTX 580 3GB GPUs.
Today there are much more complex CNNs that can run on faster GPUs very
efficiently even on very large datasets. But back in 2012, this was huge!
Let’s look at the architecture. You can click on the image below to enlarge it.
2
3
AlexNet consists of 5 Convolutional Layers and 3 Fully Connected Layers.
The first two Convolutional layers are followed by the Overlapping Max Pooling
layers that we describe next. The third, fourth and fifth convolutional layers are
connected directly. The fifth convolutional layer is followed by an Overlapping
Max Pooling layer, the output of which goes into a series of two fully connected
layers. The second fully connected layer feeds into a softmax classifier with
1000 class labels.
ReLU nonlinearity is applied after all the convolution and fully connected
layers. The ReLU nonlinearity of the first and second convolution layers are
followed by a local normalization step before doing pooling. But researchers
later didn’t find normalization very useful. So we will not go in detail over that.
ReLU Nonlinearity
An important feature of the AlexNet is the use of ReLU(Rectified Linear Unit)
Nonlinearity. Tanh or sigmoid activation functions used to be the usual way to
train a neural network model. AlexNet showed that using ReLU nonlinearity,
deep CNNs could be trained much faster than using the saturating activation
functions like tanh or sigmoid. The figure below from the paper shows that using
ReLUs(solid curve), AlexNet could achieve a 25% training error rate six times
4
faster than an equivalent network using tanh(dotted curve). This was tested on
Lets see why it trains faster with the ReLUs. The ReLU function is given by
f(x) = max(0,x)
Above are the plots of the two functions – tanh and ReLU. The tanh function
saturates at very high or very low values of z. At these regions, the slope of the
function goes very close to zero. This can slow down gradient descent. On the
other hand the ReLU function’s slope is not close to zero for higher positive
values of z. This helps the optimization to converge faster. For negative values
of z, the slope is still zero, but most of the neurons in a neural network usually
end up having positive values. ReLU wins over the sigmoid function too for the
same reason.
5
Reducing Overfitting
What is overfitting?
Remember the kid from your middle school class who did very well in tests, but
did poorly whenever the questions on the test required original thinking and
were not covered in the class? Why did he do so poorly when confronted with a
problem he had never seen before? Because he had memorized the answers to
questions covered in the class without understanding the underlying concepts.
Similarly, the size of the Neural Network is its capacity to learn, but if you are
not careful, it will try to memorize the examples in the training data without
understanding the concept. As a result, the Neural Network will work
exceptionally well on the training data, but they fail to learn the real concept. It
will fail to work well on new and unseen test data. This is called overfitting.
Data Augmentation
Showing a Neural Net different variation of the same image helps prevent
overfitting. You are forcing it to not memorize! Often it is possible to generate
additional data from existing data for free! Here are few tricks used by the
AlexNet team.
If we have an image of a cat in our training set, its mirror image is also a valid
image of a cat. Please see the figure below for an example. So we can double
the size of the training dataset by simply flipping the image about the vertical
axis.
6
Data Augmentation by Random Crops
In addition, cropping the original image randomly will also lead to additional
data that is just a shifted version of the original data.
The authors of AlexNet extracted random crops of size 227×227 from inside the
256×256 image boundary to use as the network’s inputs. They increased the
size of the data by a factor of 2048 using this method.
7
Notice the four randomly cropped images look very similar but they are not
exactly the same. This teaches the Neural Network that minor shifting of pixels
does not change the fact that the image is still that of a cat. Without data
augmentation, the authors would not have been able to use such a large
network because it would have suffered from substantial overfitting.
Dropout
With about 60M parameters to train, the authors experimented with other ways
to reduce overfitting too. So they applied another technique called dropout that
was introduced by G.E. Hinton in another paper in 2012. In dropout, a neuron is
dropped from the network with a probability of 0.5. When a neuron is dropped, it
does not contribute to either forward or backward propagation. So every input
goes through a different network architecture, as shown in the animation below.
As a result, the learnt weight parameters are more robust and do not get
overfitted easily. During testing, there is no dropout and the whole network is
used, but output is scaled by a factor of 0.5 to account for the missed neurons
while training. Dropout increases the number of iterations needed to converge
by a factor of 2, but without dropout, AlexNet would overfit substantially.