You are on page 1of 2

CRITICAL REVIEW

CS 5970 – Technical Paper Assignments


Santhosh reddy malgireddy

Adam is used to perform optimization and is one of the best optimizer at present. The
author claims that it inherits from RMSProp and Gradient Descent with momentum i.e
adagrad.

Why it is best?
 Adam is a replacement optimization algorithm for stochastic gradient descent for
training deep learning models.
 Adam combines the best properties of the AdaGrad and RMSProp algorithms to
provide an optimization algorithm that can handle sparse gradients on noisy
problems.
 Adam is relatively easy to configure where the default configuration parameters do
well on most problems.

What Features it has?


 Parameters update are invariant to re-scaling of gradient — It means that if we have
some objective function f(x) and we change it to k*f(x) (where k is some constant).
There will be no effect on performance. We will come to it later.
 The step-size is approximately bounded by the step-size hyper-parameter.
 It doesn’t require stationary objective. That means the f(x) we talked about might
change with time and still the algorithm will converge.
 Naturally performs step size annealing. Well remember the classical SGD, we used
to decrease step size after some epochs, nothing as such is needed here.
Parameters required and their optimal values:
 β1 — This is used for decaying the running average of the gradient (0.9)
 β2 — This is used for decaying the running average of the square of gradient (0.999)
 α — Step size parameter (0.001)
 ε- It is to prevent Division from zero error. (10^-8)

Comparison:

But In a deterministic gradient method we calculate the gradient over the whole dataset
and then apply the update. The iteration cost is linear with the dataset size.
In stochastic gradient methods we calculate the gradient on one datapoint and then
apply the update. The iteration cost is independent of the dataset size. Each iteration of
the stochastic gradient descent is much faster, but it usually takes many more iterations
to train the network. Perhaps the bigger problem in applying GD to some objective
function is how to define the value of the step-size parameter i.e learning rate at each
GD iteration. And also gradient descent maintains a single learning rate (termed alpha)
for all weight updates and the learning rate does not change during training. But in ADAM
learning rate is maintained for each network weight (parameter) and separately adapted
as learning unfolds. GD often is very slow near the optimum, and so it is advisable to use
a variable step-size i.e learning rate.

References:

1. https://arxiv.org/pdf/1609.04747.pdf
2. http://www.cs.berkeley.edu/~barron/BarronICCV2015_supp.pdf
3. https://arxiv.org/pdf/1609.04747.pdf

You might also like