You are on page 1of 4

CS 5970 – Technical Paper Assignments

Santhosh reddy malgireddy


Overview of Gradient descent algorithms
https://arxiv.org/pdf/1609.04747.pdf
The domain/application:
Gradient-based optimization methods are popular in machine learning and deep
neural net applications. In large-scale problems, stochastic methods are preferred due to
their good scaling properties. In this presentation, I compare the performance of Adam
and other Gradient descent optimization algorithms; gradient descent, stochastic gradient
descent, Gradient Descent with momentum and Nesterov accelerated gradient.

To reiterate, the loss function lets us quantify the quality of any particular set of weights W.
The goal of optimization is to find W that minimizes the loss function. We will now motivate
and slowly develop an approach to optimizing the loss function. My goal is to eventually
optimize Neural Networks where we can’t easily use any of the tools developed in the
Convex Optimization literature.

Approach:

Gradient descent is one of the most popular algorithms to perform optimization and by far
the most common way to optimize neural networks. At the same time, every state-of-the-
art Deep Learning library contains implementations of various algorithms to optimize
gradient descent e.g.lasagne,caffe,and keras. These algorithms, however, are often used
as black-box optimizers, as practical explanations of their strengths and weaknesses are
hard to come by. Here I will go afterthe intuitions with regard to the behaviour of different
algorithms for optimizing gradient descent that will help put them to use. Gradient descent
is a way to minimize an objective function J(θ) parameterized by a model’s parameters
by updating the parameters in the opposite direction of the gradient of the objective
function ∇θJ(θ) w.r.t. to the parameters. The learning rate η determines the size of the
steps we take to reach a (local) minimum. In other words, we follow the direction of the
slope of the surface created by the objective function downhill until we reach a valley.

Vanilla gradient descent or batch gradient descent:

It computes the gradient of the cost function w.r.t. to the parameters θ for the entire
training dataset:
θ = θ − η · ∇θJ(θ)
As we need to calculate the gradients for the whole dataset to perform just one update,
batch gradient descent can be very slow and is intractable for datasets that do not fit in
memory. Batch gradient descent also does not allow us to update our model online, i.e.
with new examples on-the-fly. In code, batch gradient descent looks something like this:
for i in range ( no_epochs ):
params_grad = evaluate_gradient ( loss_function , data , params )
params = params - learning_rate * params_grad

For a pre-defined number of epochs, we first compute the gradient vector params_grad
of the loss function for the whole dataset w.r.t. our parameter vector params. Note that
state-of-the-art deep learning libraries provide automatic differentiation that efficiently
computes the gradient w.r.t. some parameters. If you derive the gradients yourself, then
gradient checking is a good idea.6 We then update our parameters in the direction of
the gradients with the learning rate determining how big of an update we perform. Batch
gradient descent is guaranteed to converge to the global minimum for convex error
surfaces and to a local minimum for non-convex surfaces.

Mini-batch gradient descent:


In large-scale applications the training data can have on order of millions of examples.
Hence, it seems wasteful to compute the full loss function over the entire training set in
order to perform only a single parameter update. A very common approach to addressing
this challenge is to compute the gradient over batches of the training data. For example,
in current state of the art ConvNets, a typical batch contains 256 examples from the entire
training set of 1.2 million. This batch is then used to perform a parameter update. The
algorithm for mini-batch gradient descent as follows,

for i in range ( no_epochs ):


np . random . shuffle ( data )
for batch in get_batches ( data , batch_size =50):
params_grad = evaluate_gradient ( loss_function , batch , params)
params = params - learning_rate * params_grad

The size of the mini-batch is a hyperparameter but it is not very common to cross-
validate it. It is usually based on memory constraints (if any), or set to some value, e.g.
32, 64 or 128. We use powers of 2 in practice because many vectorized operation
implementations work faster when their inputs are sized in powers of 2.

Stochastic gradient descent:

The extreme case of this is a setting where the mini-batch contains only a single example.
This process is called Stochastic Gradient Descent (SGD) (or also sometimes on-
line gradient descent). This is relatively less common to see because in practice due to
vectorized code optimizations it can be computationally much more efficient to evaluate
the gradient for 100 examples, than the gradient for one example 100 times. Even though
SGD technically refers to using a single example at a time to evaluate the gradient, you
will hear people use the term SGD even when referring to mini-batch gradient descent
(i.e. mentions of MGD for “Minibatch Gradient Descent”, or BGD for “Batch gradient
descent” are rare to see), where it is usually assumed that

mini-batches are used.


for i in range ( no_epochs ):
np . random . shuffle ( data )
for example in data : params_grad = evaluate_gradient (
loss_function , example , params )
params = params - learning_rate * params_grad

Gradient Descent with momentum:


There's an algorithm called momentum, or gradient descent with momentum that almost
always works faster than the standard gradient descent algorithm. In one sentence, the
basic idea is to compute an exponentially weighted average of weighted gradients, and
then use that gradient to update the weights instead. In above discussed gradient
variants, we have more up and down oscillations which slows down gradient descent and
prevents we from using a much larger learning rate. In particular, if we were to use a
much larger learning rate we might end up over shooting and end up diverging like so.
And so the need to prevent the oscillations from getting too big forces we to use a learning
rate that's not itself too large. Another way of viewing this problem is that on the vertical
axis we want wer learning to be a bit slower, because we don't which is possible with
gradient descent with momentum.
update = learning_rate * gradient
velocity = previous_update * momentum
parameter = parameter + velocity – update
In this algorithm, with a few iterations we will find that the gradient descent with
momentum ends up eventually just taking steps that are much smaller oscillations in the
vertical direction, but are more directed to just moving quickly in the horizontal
direction. And so, this allows our algorithm to take a more straightforward path, or to damp
out the oscillations in this path to the minimum. But There is one problem with momentum:
when we are very close to the goal, our momentum in most of the cases is very high and
it does not know that it should slow down. This can cause it to miss or oscillate around
the minima.
Nesterov accelerated gradient:
It overcomes the problem of where to slow down and where to spped up by starting to
slow down early. In momentum we first compute gradient and then make a jump in that
direction amplified by whatever momentum we had previously. NAG does the same thing
but in another order: at first we make a big jump based on our stored information, and
then we calculate the gradient and make a small correction. This seemingly irrelevant
change gives significant practical speedups. Parameters update will be as below.
vt = γ vt−1 + η*∇θ*J(θ − γvt−1)
θ = θ – vt
Summary:
From above variants we can Observe that how the form of accelerated gradient descent
differs from the classical gradient descent In particular gradient descent is
a local algorithm, both in space and time, because where we go next only depends on
the information at our current point (like a Markov chain). On the other hand, accelerated
gradient descent uses additional past information to take an extragradient step via adding
“momentum” term.

References:
1. Yoshua Bengio, Nicolas Boulanger-Lewandowski, and Razvan Pascanu.
Advances in Optimizing Recurrent Networks. 2012.
2. Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston.
Curriculum learning. Proceedings of the 26th annual international
conference on machine learning, pages 41–48, 2009
3. Timothy Dozat. Incorporating Nesterov Momentum into Adam. ICLR
Workshop, (1):2013–2016, 2016.
4. John Duchi, Elad Hazan, and Yoram Singer. Adaptive Subgradient Methods
for Online Learning and Stochastic Optimization. Journal of Machine
Learning Research, 12:2121–2159, 2011.

You might also like