Professional Documents
Culture Documents
random initializations. Notably, Chapelle & Erhan is surprisingly effective and can be naturally combined
(2011) used the random initialization of Glorot & Ben- with techniques such as those in Raiko et al. (2011).
gio (2010) and SGD to train the 11-layer autoencoder
We will also discuss the links between classical mo-
of Hinton & Salakhutdinov (2006), and were able to
mentum and Nesterovs accelerated gradient method
surpass the results reported by Hinton & Salakhutdi-
(which has been the subject of much recent study in
nov (2006). While these results still fall short of those
convex optimization theory), arguing that the latter
reported in Martens (2010) for the same tasks, they
can be viewed as a simple modification of the former
indicate that learning deep networks is not nearly as
which increases stability, and can sometimes provide a
hard as was previously believed.
distinct improvement in performance we demonstrated
The first contribution of this paper is a much more in our experiments. We perform a theoretical analysis
thorough investigation of the difficulty of training deep which makes clear the precise difference in local be-
and temporal networks than has been previously done. havior of these two algorithms. Additionally, we show
In particular, we study the effectiveness of SGD when how HF employs what can be viewed as a type of mo-
combined with well-chosen initialization schemes and mentum through its use of special initializations to
various forms of momentum-based acceleration. We conjugate gradient that are computed from the up-
show that while a definite performance gap seems to date at the previous time-step. We use this property
exist between plain SGD and HF on certain deep and to develop a more momentum-like version of HF which
temporal learning problems, this gap can be elimi- combines some of the advantages of both methods to
nated or nearly eliminated (depending on the prob- further improve on the results of Martens (2010).
lem) by careful use of classical momentum methods
or Nesterovs accelerated gradient. In particular, we
show how certain carefully designed schedules for the
2. Momentum and Nesterovs
constant of momentum , which are inspired by var- Accelerated Gradient
ious theoretical convergence-rate theorems (Nesterov, The momentum method (Polyak, 1964), which we refer
1983; 2003), produce results that even surpass those re- to as classical momentum (CM), is a technique for ac-
ported by Martens (2010) on certain deep-autencoder celerating gradient descent that accumulates a velocity
training tasks. For the long-term dependency RNN vector in directions of persistent reduction in the ob-
tasks examined in Martens & Sutskever (2011), which jective across iterations. Given an objective function
first appeared in Hochreiter & Schmidhuber (1997), f () to be minimized, classical momentum is given by:
we obtain results that fall just short of those reported
in that work, where a considerably more complex ap- vt+1 = vt f (t ) (1)
proach was used.
t+1 = t + vt+1 (2)
Our results are particularly surprising given that mo-
mentum and its use within neural network optimiza- where > 0 is the learning rate, [0, 1] is the mo-
tion has been studied extensively before, such as in the mentum coefficient, and f (t ) is the gradient at t .
work of Orr (1996), and it was never found to have such Since directions d of low-curvature have, by defini-
an important role in deep learning. One explanation is tion, slower local change in their rate of reduction (i.e.,
that previous theoretical analyses and practical bench- d> f ), they will tend to persist across iterations and
marking focused on local convergence in the stochastic be amplified by CM. Second-order methods also am-
setting, which is more of an estimation problem than plify steps in low-curvature directions, but instead of
an optimization one (Bottou & LeCun, 2004). In deep accumulating changes they reweight the update along
learning problems this final phase of learning is not each eigen-direction of the curvature matrix by the in-
nearly as long or important as the initial transient verse of the associated curvature. And just as second-
phase (Darken & Moody, 1993), where a better ar- order methods enjoy improved local convergence rates,
gument can be made for the beneficial effects of mo- Polyak (1964) showed that CM can considerably accel-
mentum.
erate convergence to a local minimum, requiring R-
In addition to the inappropriate focus on purely local times fewer iterations than steepest descent to reach
convergence rates, we believe that the use of poorly de- the same level of accuracy, where R is the condition
signed standard random initializations, such as those number
of the curvature at the minimum and is set
in Hinton & Salakhutdinov (2006), and suboptimal to ( R 1)/( R + 1).
meta-parameter schedules (for the momentum con-
Nesterovs Accelerated Gradient (abbrv. NAG; Nes-
stant in particular) has hampered the discovery of the
terov, 1983) has been the subject of much recent at-
true effectiveness of first-order momentum methods in
tention by the convex optimization community (e.g.,
deep learning. We carefully avoid both of these pit-
Cotter et al., 2011; Lan, 2010). Like momentum,
falls in our experiments and provide a simple to under-
NAG is a first-order optimization method with better
stand and easy to use framework for deep learning that
convergence rate guarantee than gradient descent in
On the importance of initialization and momentum in deep learning
help quantify precisely the way in which CM and oscillations (or divergence) and thus allows the use of
NAG differ, we analyzed the behavior of each method a larger than is possible with CM for a given .
when applied to a positive definite quadratic objective
q(x) = x> Ax/2 + b> x. We can think of CM and NAG
as operating independently over the different eigendi-
3. Deep Autoencoders
rections of A. NAG operates along any one of these The aim of our experiments is three-fold. First, to
directions equivalently to CM, except with an effective investigate the attainable performance of stochastic
value of that is given by (1 ), where is the momentum methods on deep autoencoders starting
associated eigenvalue/curvature. from well-designed random initializations; second, to
The first step of this argument is to reparameterize explore the importance and effect of the schedule for
q(x) in terms of the coefficients of x under the basis the momentum parameter assuming an optimal fixed
of eigenvectors of A. Note that since A = U > DU for choice of the learning rate ; and third, to compare the
a diagonal D and orthonormal U (as A is symmetric), performance of NAG versus CM.
we can reparameterize q(x) by the matrix transform For our experiments with feed-forward nets, we fo-
U and optimize y = U x using the objective p(y) cused on training the three deep autoencoder prob-
q(x) = q(U > y) = y > U U > DU U > y/2 + b> U > y = lems described in Hinton & Salakhutdinov (2006) (see
y > Dy/2 + cP >
y, where c = U b. We can further rewrite sec. A.2 for details). The task of the neural net-
n
p as p(y) = i=1 [p]i ([y]i ), where [p]i (t) = i t2 /2+[c]i t work autoencoder is to reconstruct its own input sub-
and i > 0 are the diagonal entries of D (and thus ject to the constraint that one of its hidden layers is
the eigenvalues of A) and correspond to the curva- of low-dimension. This bottleneck layer acts as a
ture along the associated eigenvector directions. As low-dimensional code for the original input, similar to
shown in the appendix (Proposition 6.1), both CM other dimensionality reduction techniques like Princi-
and NAG, being first-order methods, are invariant ple Component Analysis (PCA). These autoencoders
to these kinds of reparameterizations by orthonormal are some of the deepest neural networks with pub-
transformations such as U . Thus when analyzing the lished results, ranging between 7 and 11 layers, and
behavior of either algorithm applied to q(x), we can in- have become a standard benchmarking problem (e.g.,
stead apply them to p(y), and transform the resulting Martens, 2010; Glorot & Bengio, 2010; Chapelle & Er-
sequence of iterates back to the default parameteriza- han, 2011; Raiko et al., 2011). See the appendix for
tion (via multiplication by U 1 = U > ). more details.
Pn
Theorem 2.1. Let p(y) = i=1 [p]i ([y]i ) such that Because the focus of this study is on optimization, we
[p]i (t) = i t2 /2 + ci t. Let be arbitrary and fixed. only report training errors in our experiments. Test
Denote by CM x (, p, y, v) and CM v (, p, y, v) the pa- error depends strongly on the amount of overfitting in
rameter vector and the velocity vector respectively, ob- these problems, which in turn depends on the type and
tained by applying one step of CM (i.e., Eq. 1 and then amount of regularization used during training. While
Eq. 2) to the function p at point y, with velocity v, regularization is an issue of vital importance when de-
momentum coefficient , and learning rate . Define signing systems of practical utility, it is outside the
N AGx and N AGv analogously. Then the following scope of our discussion. And while it could be ob-
holds for z {x, v}: jected that the gains achieved using better optimiza-
tion methods are only due to more exact fitting of the
CM z (, [p]1 , [y]1 , [v]1 ) training set in a manner that does not generalize, this
CM z (, p, y, v) =
..
is simply not the case in these problems, where under-
.
CM z (, [p]n , [y]n , [v]n ) trained solutions are known to perform poorly on both
the training and test sets (underfitting).
CM z ((1 1 ), [p]1 , [y]1 , [v]1 )
.. The networks we trained used the standard sigmoid
N AGz (, p, y, v) =
. nonlinearity and were initialized using the sparse ini-
CM z ((1 n ), [p]n , [y]n , [v]n ) tialization technique (SI) of Martens (2010) that is
described in sec. 3.1. Each trial consists of 750,000
parameter updates on minibatches of size 200. No reg-
Proof. See the appendix. ularization is used. The schedule for was given by
the following formula:
The theorem has several implications. First, CM and t = min(1 21log2 (bt/250c+1) , max ) (5)
NAG become equivalent when is small (when 1
for every eigenvalue of A), so NAG and CM are where max was chosen from
distinct only when is reasonably large. When is {0.999, 0.995, 0.99, 0.9, 0}. This schedule was mo-
relatively large, NAG uses smaller effective momentum tivated by Nesterov (1983) who advocates using what
for the high-curvature eigen-directions, which prevents amounts to t = 1 3/(t + 5) after some manipulation
On the importance of initialization and momentum in deep learning
task 0(SGD) 0.9N 0.99N 0.995N 0.999N 0.9M 0.99M 0.995M 0.999M SGDC HF HF
Curves 0.48 0.16 0.096 0.091 0.074 0.15 0.10 0.10 0.10 0.16 0.058 0.11
Mnist 2.1 1.0 0.73 0.75 0.80 1.0 0.77 0.84 0.90 0.9 0.69 1.40
Faces 36.4 14.2 8.5 7.8 7.7 15.3 8.7 8.3 9.3 NA 7.5 12.0
Table 1. The table reports the squared errors on the problems for each combination of max and a momentum type
(NAG, CM). When max is 0 the choice of NAG vs CM is of no consequence so the training errors are presented in a
single column. For each choice of max , the highest-performing learning rate is used. The column SGDC lists the results of
Chapelle & Erhan (2011) who used 1.7M SGD steps and tanh networks. The column HF lists the results of HF without
L2 regularization, as described in sec. 5; and the column HF lists the results of Martens (2010).
problem before after would prevent this. This phase shift between opti-
Curves 0.096 0.074
Mnist 1.20 0.73
mization that favors fast accelerated motion along the
Faces 10.83 7.7 error surface (the transient phase) followed by more
careful optimization-as-estimation phase seems consis-
Table 2. The effect of low-momentum finetuning for NAG. tent with the picture presented by Darken & Moody
The table shows the training squared errors before and (1993). However, while asymptotically it is the second
after the momentum coefficient is reduced. During the pri- phase which must eventually dominate computation
mary (transient) phase of learning we used the optimal time, in practice it seems that for deeper networks in
momentum and learning rates. particular, the first phase dominates overall computa-
tion time as long as the second phase is cut off before
the remaining potential gains become either insignifi-
(see appendix), and by Nesterov (2003) who advocates cant or entirely dominated by overfitting (or both).
a constant t that depends on (essentially) the con-
It may be tempting then to use lower values of from
dition number. The constant t achieves exponential
the outset, or to reduce it immediately when progress
convergence on strongly convex functions, while the
in reducing the error appears to slow down. However,
1 3/(t + 5) schedule is appropriate when the function
in our experiments we found that doing this was detri-
is not strongly convex. The schedule of Eq. 5 blends
mental in terms of the final errors we could achieve,
these proposals. For each choice of max , we report
and that despite appearing to not make much progress,
the learning rate that achieved the best training error.
or even becoming significantly non-monotonic, the op-
Given the schedule for , the learning rate was
timizers were doing something apparently useful over
chosen from {0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001}
these extended periods of time at higher values of .
in order to achieve the lowest final error training error
after our fixed number of updates. A speculative explanation as to why we see this be-
havior is as follows. While a large value of allows
Table 1 summarizes the results of these experiments.
the momentum methods to make useful progress along
It shows that NAG achieves the lowest published
slowly-changing directions of low-curvature, this may
results on this set of problems, including those of
not immediately result in a significant reduction in er-
Martens (2010). It also shows that larger values of
ror, due to the failure of these methods to converge in
max tend to achieve better performance and that
the more turbulent high-curvature directions (which
NAG usually outperforms CM, especially when max
is especially hard when is large). Nevertheless, this
is 0.995 and 0.999. Most surprising and importantly,
progress in low-curvature directions takes the optimiz-
the results demonstrate that NAG can achieve results
ers to new regions of the parameter space that are
that are comparable with some of the best HF results
characterized by closer proximity to the optimum (in
for training deep autoencoders. Note that the previ-
the case of a convex objective), or just higher-quality
ously published results on HF used L2 regularization,
local minimia (in the case of non-convex optimiza-
so they cannot be directly compared. However, the
tion). Thus, while it is important to adopt a more
table also includes experiments we performed with an
careful scheme that allows fine convergence to take
improved version of HF (see sec. 2.1) where weight
place along the high-curvature directions, this must be
decay was removed towards the end of training.
done with care. Reducing and moving to this fine
We found it beneficial to reduce to 0.9 (unless convergence regime too early may make it difficult for
is 0, in which case it is unchanged) during the final the optimization to make significant progress along the
1000 parameter updates of the optimization without low-curvature directions, since without the benefit of
reducing the learning rate, as shown in Table 2. It momentum-based acceleration, first-order methods are
appears that reducing the momentum coefficient al- notoriously bad at this (which is what motivated the
lows for finer convergence to take place whereas oth- use of second-order methods like HF for deep learn-
erwise the overly aggressive nature of CM or NAG ing).
On the importance of initialization and momentum in deep learning
as its dimensionality or the input variance). Indeed, even better performance. But the main achievement
for tasks that do not have many irrelevant inputs, a of these results is a demonstration of the ability of
larger scale of the input-to-hidden weights (namely, momentum methods to cope with long-range tempo-
0.1) worked better, because the aforementioned dis- ral dependency training tasks to a level which seems
advantage of large input-to-hidden weights does not sufficient for most practical purposes. Moreover, our
apply. See table 4 for a summary of the initializations approach seems to be more tolerant of smaller mini-
used in the experiments. Finally, we found centering batches, and is considerably simpler than the partic-
(mean subtraction) of both the inputs and the outputs ular version of HF proposed in Martens & Sutskever
to be important to reliably solve all of the training (2011), which used a specialized update damping tech-
problems. See the appendix for more details. nique whose benefits seemed mostly limited to training
RNNs to solve these kinds of extreme temporal depen-
4.2. Experimental Results dency problems.
We conducted experiments to determine the effi-
cacy of our initializations, the effect of momentum, 5. Momentum and HF
and to compare NAG with CM. Every learning trial Truncated Newton methods, that include the HF
used the aforementioned initialization, 50,000 param- method of Martens (2010) as a particular example,
eter updates and on minibatches of 100 sequences, work by optimizing a local quadratic model of the
and the following schedule for the momentum co- objective via the linear conjugate gradient algorithm
efficient : = 0.9 for the first 1000 parameter, (CG), which is a first-order method. While HF, like
after which = 0 , where 0 can take the fol- all truncated-Newton methods, takes steps computed
lowing values {0, 0.9, 0.98, 0.995}. For each 0 , we using partially converged calls to CG, it is naturally
use the empirically best learning rate chosen from accelerated along at least some directions of lower cur-
{103 , 104 , 105 , 106 }. vature compared to the gradient. It can even be shown
The results are presented in Table 5, which are the av- (Martens & Sutskever, 2012) that CG will tend to fa-
erage loss over 4 different random seeds. Instead of re- vor convergence to the exact solution to the quadratic
porting the loss being minimized (which is the squared sub-problem first along higher curvature directions
error or cross entropy), we use a more interpretable (with a bias towards those which are more clustered
zero-one loss, as is standard practice with these prob- together in their curvature-scalars/eigenvalues).
lems. For the bit memorization, we report the frac- While CG accumulates information as it iterates which
tion of timesteps that are computed incorrectly. And allows it to be optimal in a much stronger sense than
for the addition and the multiplication problems, we any other first-order method (like NAG), once it is
report the fraction of cases where the RNN the error terminated, this information is lost. Thus, standard
in the final output prediction exceeded 0.04. truncated Newton methods can be thought of as per-
Our results show that despite the considerable long- sisting information which accelerates convergence (of
range dependencies present in training data for these the current quadratic) only over the number of itera-
problems, RNNs can be successfully and robustly tions CG performs. By contrast, momentum methods
trained to solve them, through the use of the initial- persist information that can inform new updates across
ization discussed in sec. 4.1, momentum of the NAG an arbitrary number of iterations.
type, a large 0 , and a particularly small learning rate One key difference between standard truncated New-
(as compared with feedforward networks). Our results ton methods and HF is the use of hot-started calls to
also suggest that with larger values of 0 achieve bet- CG, which use as their initial solution the one found at
ter results with NAG but not with CM, possibly due to the previous call to CG. While this solution was com-
NAGs tolerance of larger 0 s (as discussed in sec. 2). puted using old gradient and curvature information
Although we were able to achieve surprisingly good from a previous point in parameter space and possi-
training performance on these problems using a suf- bly a different set of training data, it may be well-
ficiently strong momentum, the results of Martens & converged along certain eigen-directions of the new
Sutskever (2011) appear to be moderately better and quadratic, despite being very poorly converged along
more robust. They achieved lower error rates and their others (perhaps worse than the default initial solution
initialization was chosen with less care, although the of ~0). However, to the extent to which the new local
initializations are in many ways similar to ours. No- quadratic model resembles the old one, and in partic-
tably, Martens & Sutskever (2011) were able to solve ular in the more difficult to optimize directions of low-
these problems without centering, while we had to curvature (which will arguably be more likely to per-
use centering to solve the multiplication problem (the sist across nearby locations in parameter space), the
other problems are already centered). This suggests previous solution will be a preferable starting point to
that the initialization proposed here, together with the 0, and may even allow for gradually increasing levels
method of Martens & Sutskever (2011), could achieve of convergence along certain directions which persist
On the importance of initialization and momentum in deep learning
Table 5. Each column reports the errors (zero-one losses; sec. 4.2) on different problems for each combination of 0 and
momentum type (NAG, CM), averaged over 4 different random seeds. The biases column lists the error attainable by
learning the output biases and ignoring the hidden state. This is the error of an RNN that failed to establish communi-
cation between its inputs and targets. For each 0 , we used the fixed learning rate that gave the best performance.
in the local quadratic models across many updates. them over NAG, or if they are closer to the aforemen-
tioned worst-case examples. To examine this question
The connection between HF and momentum methods
we took a quadratic generated during the middle of a
can be made more concrete by noticing that a single
typical run of HF on the curves dataset and compared
step of CG is effectively a gradient update taken from
the convergence rate of CG, initialized from zero, to
the current point, plus the previous update reapplied,
NAG (also initialized from zero). Figure 5 in the ap-
just as with NAG, and that if CG terminated after just
pendix presents the results of this experiment. While
1 step, HF becomes equivalent to NAG, except that it
this experiment indicates some potential advantages to
uses a special formula based on the curvature matrix
HF, the closeness of the performance of NAG and HF
for the learning rate instead of a fixed constant. The
suggests that these results might be explained by the
most effective implementations of HF even employ a
solutions leaving the area of trust in the quadratics be-
decay constant (Martens & Sutskever, 2012) which
fore any extra speed kicks in, or more subtly, that the
acts analogously to the momentum constant . Thus,
faithfulness of approximation goes down just enough
in this sense, the CG initializations used by HF allow
as CG iterates to offset the benefit of the acceleration
us to view it as a hybrid of NAG and an exact second-
it provides.
order method, with the number of CG iterations used
to compute each update effectively acting as a dial
between the two extremes. 6. Discussion
Inspired by the surprising success of momentum-based Martens (2010) and Martens & Sutskever (2011)
methods for deep learning problems, we experimented demonstrated the effectiveness of the HF method as
with making HF behave even more like NAG than it al- a tool for performing optimizations for which previ-
ready does. The resulting approach performed surpris- ous attempts to apply simpler first-order methods had
ingly well (see Table 1). For a more detailed account failed. While some recent work (Chapelle & Erhan,
of these experiments, see sec. A.6 of the appendix. 2011; Glorot & Bengio, 2010) suggested that first-order
methods can actually achieve some success on these
If viewed on the basis of each CG step (instead of kinds of problems when used in conjunction with good
each update to parameters ), HF can be thought of initializations, their results still fell short of those re-
as a peculiar type of first-order method which approx- ported for HF. In this paper we have completed this
imates the objective as a series of quadratics only so picture and demonstrated conclusively that a large
that it can make use of the powerful first-order CG part of the remaining performance gap that is not
method. So apart from any potential benefit to global addressed by using a well-designed random initializa-
convergence from its tendency to prefer certain direc- tion is in fact addressed by careful use of momentum-
tions of movement in parameter space over others, per- based acceleration (possibly of the Nesterov type). We
haps the main theoretical benefit to using HF over a showed that careful attention must be paid to the mo-
first-order method like NAG is its use of CG, which, mentum constant , as predicted by the theory for
while itself a first-order method, is well known to have local and convex optimization.
strongly optimal convergence properties for quadrat-
ics, and can take advantage of clustered eigenvalues Momentum-accelerated SGD, despite being a first-
to accelerate convergence (see Martens & Sutskever order approach, is capable of accelerating directions
(2012) for a detailed account of this well-known phe- of low-curvature just like an approximate Newton
nomenon). However, it is known that in the worst method such as HF. Our experiments support the idea
case that CG, when run in batch mode, will converge that this is important, as we observed that the use of
asymptotically no faster than NAG (also run in batch stronger momentum (as determined by ) had a dra-
mode) for certain specially designed quadratics with matic effect on optimization performance, particularly
very evenly distributed eigenvalues/curvatures. Thus for the RNNs. Moreover, we showed that HF can be
it is worth asking whether the quadratics which arise viewed as a first-order method, and as a generalization
during the optimization of neural networks by HF are of NAG in particular, and that it already derives some
such that CG has a distinct advantage in optimizing of its benefits through a momentum-like mechanism.
On the importance of initialization and momentum in deep learning
References LeCun, Y., Bottou, L., Orr, G., and Muller, K. Efficient
backprop. Neural networks: Tricks of the trade, pp. 546
Bengio, Y., Simard, P., and Frasconi, P. Learning 546, 1998.
long-term dependencies with gradient descent is diffi-
cult. IEEE Transactions on Neural Networks, 5:157166, Martens, J. Deep learning via Hessian-free optimization.
1994. In Proceedings of the 27th International Conference on
Machine Learning (ICML), 2010.
Bengio, Y., Lamblin, P, Popovici, D., and Larochelle, H.
Greedy layer-wise training of deep networks. In In NIPS. Martens, J. and Sutskever, I. Learning recurrent neural
MIT Press, 2007. networks with hessian-free optimization. In Proceedings
of the 28th International Conference on Machine Learn-
Bottou, L. and LeCun, Y. Large scale online learning. In ing (ICML), pp. 10331040, 2011.
Advances in Neural Information Processing Systems 16:
Proceedings of the 2003 Conference, volume 16, pp. 217. Martens, J. and Sutskever, I. Training deep and recurrent
MIT Press, 2004. networks with hessian-free optimization. Neural Net-
works: Tricks of the Trade, pp. 479535, 2012.
Chapelle, O. and Erhan, D. Improved Preconditioner for
Hessian Free Optimization. In NIPS Workshop on Deep Mikolov, Tomas, Sutskever, Ilya, Deoras, Anoop, Le,
Learning and Unsupervised Feature Learning, 2011. Hai-Son, Kombrink, Stefan, and Cernocky, J. Sub-
word language modeling with neural networks. preprint
Cotter, A., Shamir, O., Srebro, N., and Sridharan, K. Bet- (http://www. fit. vutbr. cz/imikolov/rnnlm/char. pdf ),
ter mini-batch algorithms via accelerated gradient meth- 2012.
ods. arXiv preprint arXiv:1106.4574, 2011.
Mohamed, A., Dahl, G.E., and Hinton, G. Acoustic mod-
Dahl, G.E., Yu, D., Deng, L., and Acero, A. Context- eling using deep belief networks. Audio, Speech, and
dependent pre-trained deep neural networks for large- Language Processing, IEEE Transactions on, 20(1):14
vocabulary speech recognition. Audio, Speech, and Lan- 22, Jan. 2012.
guage Processing, IEEE Transactions on, 20(1):3042,
2012. Nesterov, Y. A method of solving a convex program-
ming problem with convergence rate O(1/sqr(k)). Soviet
Darken, C. and Moody, J. Towards faster stochastic gra- Mathematics Doklady, 27:372376, 1983.
dient search. Advances in neural information processing
systems, pp. 10091009, 1993. Nesterov, Y. Introductory lectures on convex optimization:
A basic course, volume 87. Springer, 2003.
Glorot, X. and Bengio, Y. Understanding the difficulty
of training deep feedforward neural networks. In Pro- Orr, G.B. Dynamics and algorithms for stochastic search.
ceedings of AISTATS 2010, volume 9, pp. 249256, may 1996.
2010.
Polyak, B.T. Some methods of speeding up the convergence
Graves, A. Sequence transduction with recurrent neural of iteration methods. USSR Computational Mathematics
networks. arXiv preprint arXiv:1211.3711, 2012. and Mathematical Physics, 4(5):117, 1964.
Hinton, G and Salakhutdinov, R. Reducing the dimension- Raiko, Tapani, Valpola, Harri, and LeCun, Yann. Deep
ality of data with neural networks. Science, 313:504507, learning made easier by linear transformations in percep-
2006. trons. In NIPS 2011 Workshop on Deep Learning and
Unsupervised Feature Learning, Sierra Nevada, Spain,
Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., 2011.
Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P.,
Sainath, T., et al. Deep neural networks for acoustic Sutskever, I., Martens, J., and Hinton, G. Generating
modeling in speech recognition. IEEE Signal Processing text with recurrent neural networks. In Proceedings of
Magazine, 2012. the 28th International Conference on Machine Learning,
ICML 11, pp. 10171024, June 2011.
Hinton, G.E., Osindero, S., and Teh, Y.W. A fast learning
algorithm for deep belief nets. Neural computation, 18 Wiegerinck, W., Komoda, A., and Heskes, T. Stochas-
(7):15271554, 2006. tic dynamics of learning with momentum in neural net-
works. Journal of Physics A: Mathematical and General,
Hochreiter, S. and Schmidhuber, J. Long short-term mem- 27(13):4425, 1999.
ory. Neural computation, 9(8):17351780, 1997.
Jaeger, H. personal communication, 2012.
Jaeger, H. and Haas, H. Harnessing nonlinearity: Pre-
dicting chaotic systems and saving energy in wireless
communication. Science, 304:7880, 2004.
Krizhevsky, A., Sutskever, I., and Hinton, G. Imagenet
classification with deep convolutional neural networks.
In Advances in Neural Information Processing Systems
25, pp. 11061114, 2012.
Lan, G. An optimal method for stochastic composite op-
timization. Mathematical Programming, pp. 133, 2010.
On the importance of initialization and momentum in deep learning
vt t t1 (10)
Figure 2. The trajectories of CM, NAG, and SGD are t (at 1)/at+1 (11)
shown. Although the value of the momentum is identi-
cal for both experiments, CM exhibits oscillations along then the combination of eqs. 9 and 11 implies:
the high-curvature directions, while NAG exhibits no such
yt = t1 + t1 vt1
oscillations. The global minimizer of the objective is at
(0,0). The red curve shows gradient descent with the same
which can be used to rewrite eq. 7 as follows:
learning rate as NAG and CM, the blue curve shows NAG,
and the green curve shows CM. See section 2 of the paper. t = t1 + t1 vt1 t1 f (t1 + t1 vt1 )
(12)
Appendix vt = t1 vt1 t1 f (t1 + t1 vt1 ) (13)
Proposition 6.1. Let {(i , i )} i=1 be an arbitrary se- Proof of Theorem 2.1. We first show that for separa-
quence of learning rates, let x0 , v0 be an arbitrary ini- ble problems, CM (or NAG) is precisely equivalent to
tial position and velocity, and let {(xi , vi )}i=0 be the many simultaneous applications of CM (or NAG) to
sequence of iterates obtained by CM by optimizing q the one-dimensional problems that correspond to each
starting from x0 , v0 , with the learning parameters i , i problem dimension. We then show that for NAG, the
at iteration i. effective momentum for a one-dimensional problem de-
pends on its curvature. We prove these results for one
Next, let y0 , w0 be given by U x0 , U v0 , and let
step of CM or NAG, although these results generalize
{(yi , wi )}
i=0 be the sequence of iterates obtained by to larger n.
CM by optimizing p starting from y0 , w0 , with the
learning parameters i , i at iteration i. Consider one step of CM v :
Then the following holds for all i: CM v (, p, y, v) = v p(y)
yi = U xi = ([v]1 [y]1 p(y), . . . , [v]n [y]n p(y))
wi = U vi = ([v]1 [p]1 ([y]1 ), . . . , [v]n [p]n ([y]n ))
= (CM v (, [p]1 , [y]1 , [v]1 ), . . . , CMv (, [p]n , [y]n , [v]n ))
The above also applies when CM is replaced with NAG.
This shows that one step of CMv on q is precisely
Proof. First, notice that equivalent to n simultaneous applications of CMv to
the one-dimensional quadratics [q]i , all with the same
xi+1 = CMx (i , q, xi , vi ) and . A similar argument shows a single step of
vi+1 = CMv (i , q, xi , vi ) each of CM x , on N AGx , and N AGv can be obtained
by applying each of them to the n one-dimensional
and that quadratics [q]i .
Next we show that NAG, applied to a one-dimensional
yi+1 = CMx (i , p, yi , wi ) quadratic with a momentum coefficient , is equiva-
wi+1 = CMv (i , p, yi , wi ) lent to CM applied to the same quadratic and with
the same learning rate, but with a momentum co-
The proof is by induction. The claim is immediate efficient (1 ). We show this by expanding
for i = 0. To see that it holds for i + 1 assuming i, N AGv (, [p]i , y, v) (where y and v are scalars):
consider:
N AGv (, [q]i , y, v) = v [p]i (y + v)
wi+1 = wi + yi p(yi ) = v (i (y + v) + ci )
= wi + U xi p(yi ) (chain rule using yi = U xi ) = v i v (i y + ci )
>
= wi + U xi q(U yi ) (definition of p) = (1 i )v [p]i (y)
= wi + U xi q(xi ) (xi = U > yi ) = CM v ((1 i ), [p]i , v, y)
= U vi + U xi q(xi ) (by induction)
Since Eq. 2 is identical to 4, it follows that
= U (vi + xi q(xi ))
= U vi+1 N AGx (, [p]i , y, v) = CM x ((1 i ), [p]i , y, v)
yi+1 = yi + wi+1
= U xi + U vi+1
= U xi+1
A.3. Autoencoder Problem Details
We experiment with the three autoencoder problems
yi+1 = yi + wi+1 = U xi + U vi+1 = U xi+1 from Hinton & Salakhutdinov (2006), which are de-
This completes the proof for CM; the proof for NAG scribed in Table 6.
is nearly identical.
A.4. RNN Problem Details
We considered 4 of the pathological long term de-
Given this result, and reparameterization p of q ac- pendency problems from Hochreiter & Schmidhu-
cording to its eigencomponents, the results of Theorem ber (1997), which consist of artificial datasets designed
2.1 can thus be to have various non-trivial long-range dependencies.
On the importance of initialization and momentum in deep learning
0 0 0
20 20 20
40 40 40
60 60 60
80 80 80
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
Figure 3. In each figure, the vertical axis is time, and the horizontal axis are units. The leftmost figure shows the hidden
state sequences of an RNN on the addition problem with the aforementioned parameter initialization. Note the gentle
oscillations of the hidden states. The middle figure shows the hidden state sequence of the same RNN after 1000 parameter
updates. The hidden state sequence is almost indistinguishable, differing by approximately 104 for each pixel, so despite
the little progress that has been made, the oscillations are preserved, and thus the parameter setting still have a chance of
solving the problem. The rightmost figure shows an RNN with the same initialization after 1000 parameter updates, where
the output biases were initialized to zero. Notice how the hidden state sequence has many fewer oscillations, although
both parameter settings fail to establish communication between the inputs and the target.
within its hidden state, retaining it all of the way to the and used a fixed learning rate of 1 which, after some
end. This is made especially difficult because there are point chosen by hand, we gradually decayed to zero us-
no easier-to-learn short or medium-term dependencies ing a simple schedule; we gradually increased the value
that can act as hints to help the learning see the long- of the decay constant (similarly to how we increased
term ones, and because of the noise in the addition ); and we used smaller minibatches, while computing
and the multiplication problems. the gradient only on the current minibatch (as opposed
to the full dataset - as is commonly done with HF).
Analogously to our experiments with momentum, we
A.5 The effect of bias centering adjusted the settings near the end of optimization (af-
In our experiments, we observed that the hidden state ter the transient phase) to improve fine-grained local
oscillations, which are likely important for relaying in- convergence/stochastic estimation. In particular, we
formation across long distances, had a tendency to dis- switched back to using a more traditional version of
appear after the early stages of learning (see fig. 3), HF along the lines of the one described by Martens
causing the optimization to converge to a poor local (2010), which involved using much larger minibatches,
optimum. We surmised that this tendency was due or just computing the gradient on more data/the entire
to fact that the smallest modification to the parame- training set, raising the learning rate back to 1 (or as
ters (as measured in the standard norm - the quantity high as the larger minibatches permit), etc., and also
which steepest descent can be thought of as minimiz- decreasing the L2 weight decay to squeeze out some
ing) that allowed the output units to predict the av- extra performance.
erage target output, involved making the outputs con- The results of these experiments were encouraging, al-
stant by way of making the hidden state constant. By though preliminary, and we report them in Table 1,
centering the target outputs, the initial bias is correct noting that they are not directly comparable to the
by default, and thus the optimizer is not forced into experiments performed with CM and NAG due to the
making this poor early choice that dooms it later of use incomparable amounts of computation (fewer
on. iterations, but with large minibatches).
Similarly, we found it necessary to center the inputs For each autoencoder training task we ran we hand-
to reliably solve the multiplication problem; the mo- designed a schedule for the learning rate (), decay
mentum methods were unable to solve this problem constant (), and minibatch size (s), after a small
without input centering. We speculate that the reason amount of trial and error. The details for CURVES
for this is that the inputs associated with the multipli- and MNIST are given below.
cation problem are positive, causing the hidden state
to drift in the direction of the input-to-hidden vector, For CURVES, we first ran HF iterations with 50 CG
which in turn leads to saturation of the hidden state. steps each, for a total of 250k CG steps, using = 1.0,
When the inputs are centered, the net input to the hid- = 0.95 and with gradient and curvature products
den units will initial be closer to zero when averaged computed on minibatches consisting of 1/16th of the
over a reasonable window of time, thus preventing this training set. Next, we increased to 0.999 and an-
issue. nealed according to the schedule 500/(t 4000).
The error reach 0.0094 by this point, and from here
Figure 3 shows how the hidden state sequence evolves we tuned the method to achieve fine local conver-
when the output units are not centered. While the gence, which we achieved primarily through running
oscillations of the RNNs hidden state are preserved the method in full batch mode. We first ran HF for
if the output bias is initialized to be the mean of the 100k total CG steps with the number of CG steps per
targets, they disappear if the output bias is set to zero update increased to 200, lowered to 0.995, and
(in this example, the mean of the targets is 0.5). This raised back to 1 (which was stable because we were
is a consequence of the isotropic nature of stochastic running in batch mode) . The error reach 0.087 by
gradient descent which causes it to minimize the L2 this point. Finally, we lowered the L2 weight decay
distance of its optimization paths, and of the fact that from 2e-5 (which was used in (Martens, 2010)) to near
the L2 -closest parameter setting that outputs the av- zero and ran 500k total steps more, to arrive at an
erage of the targets is obtained by slightly adjusting error of 0.058.
both the biases and making the hidden states constant,
to utilize the hidden-to-output weights. For MNIST, we first ran HF iterations with 50 CG
steps each, for a total of 25k CG steps, using = 1.0,
= 0.95 and with gradient and curvature products
A.6 Hybrid HF-Momentum approach computed on minibatches consisting of 1/20th of the
Our hybrid HF-momentum algorithm, is characterized training set. Next, we increased to 0.995, was an-
by the following changes to the original approach out- nealed according to 250/t, and the method was run for
lined by Martens (2010): we disposed of the line search another 225k CG steps. The error reach 0.81 by this
On the importance of initialization and momentum in deep learning
0.000
0.002 NAG
CG
0.004
0.006
0.008
0.010
0.012
0.014
0.016
0.018
0 100 200 300 400 500