Professional Documents
Culture Documents
Adam is used to perform optimization and is one of the best optimizer at present. The
author claims that it inherits from RMSProp and Gradient Descent with momentum i.e
adagrad.
Why it is best?
Adam is a replacement optimization algorithm for stochastic gradient descent for
training deep learning models.
Adam combines the best properties of the AdaGrad and RMSProp algorithms to
provide an optimization algorithm that can handle sparse gradients on noisy
problems.
Adam is relatively easy to configure where the default configuration parameters do
well on most problems.
Comparison:
But In a deterministic gradient method we calculate the gradient over the whole dataset
and then apply the update. The iteration cost is linear with the dataset size.
In stochastic gradient methods we calculate the gradient on one datapoint and then
apply the update. The iteration cost is independent of the dataset size. Each iteration of
the stochastic gradient descent is much faster, but it usually takes many more iterations
to train the network. Perhaps the bigger problem in applying GD to some objective
function is how to define the value of the step-size parameter i.e learning rate at each
GD iteration. And also gradient descent maintains a single learning rate (termed alpha)
for all weight updates and the learning rate does not change during training. But in ADAM
learning rate is maintained for each network weight (parameter) and separately adapted
as learning unfolds. GD often is very slow near the optimum, and so it is advisable to use
a variable step-size i.e learning rate.
References:
1. https://arxiv.org/pdf/1609.04747.pdf
2. http://www.cs.berkeley.edu/~barron/BarronICCV2015_supp.pdf
3. https://arxiv.org/pdf/1609.04747.pdf