You are on page 1of 5

Algorithms for unconstrained optimization

1 Introduction

We saw in the previous document on unconstrained optimization that at a minimum, the gradient
∇f (x∗ ) = 0. Thus, it is tempting to determine the optimum by simply solving the above set of n
nonlinear equations. However, this is not an attractive idea since the resulting set of equations
is nonlinear (In fact, one technique for solving nonlinear equations relies on optimization). Also,
recall that a stationary point could be a maximum, minimum or a saddle. Simply setting the
gradient equal to 0 does not guarantee that we have found a minimum.

Typical algorithms to determine a minimum are iterative, i.e., starting from an initial point x0 , a
sequence of new points x1 , x2 , . . . is generated till some termination criteria are met. There are
two fundamental strategies in moving from xk to xk+1 . Most algorithms fall into one of the two
methods:

• Line search methods. In line search methods, starting from xk , a search direction p is
determined and we move along this direction to xk+1 , i.e.,

xk+1 = xk + αp, (1)

so that f (xk+1 ) < f (xk ). α can be determined by solving the following univariate optimiza-
tion problem:
min f (xk + αp) (2)
α>0

One might assume that by solving the above problem exactly, i.e., determining the global
minimum of the above problem, one gets the maximum benefit. However, this itself might
be a difficult optimization problem and it turns out an approximate solution that guarantees
a sufficient decrease along the chosen direction is often preferred.

• Trust region methods. In trust region methods, a model function m is built around xk such
that it approximates the true function f around xk . Since this is only an approximation, we
“trust” this model only in a certain region called the trust region and we choose the new
point xk+1 such that it lies in the trust region and minimizes m(xk + p), i.e, we determine p
such that
min m(xk + p), (3)
p

is minimized subject to xk + p lying in the trust region. Usually, the trust region is a ball of a
certain radius around xk . If there is not sufficient decrease in the value of the true objective
function, we conclude that the model was not accurate enough and hence reduce the trust
region size and repeat.

1
Thus, to compare the two approaches, in line search methods, we fix a search direction and then
compute the step size. In trust region methods, we decide a priori the maximum distance we
would go and then determine the direction and step size.

2 Line search methods

As mentioned earlier, in line search methods, we choose a search direction and then proceed in
that direction. The direction p should be chosen to be a descent direction, i.e., at least locally, if
we move along that direction, the objective function value decreases.

The directions p in the line search methods can often be written as:

p = −Bk−1 ∇(xk ), (4)

where Bk is symmetric and invertible. When Bk is positive definite, p is a search direction.


Depending on the choice of Bk , we have the following different methods.

2.1 Steepest descent

Since −∇f (xk ) is the direction of steepest descent, an obvious choice would be to set p to be
−∇f (xk ). This is possible by setting Bk to be I, the identity matrix in the above equation. The
advantage of the method is that knowledge of only the first derivative is sufficient. However, it is
often slow and sensitive to scaling of f (x). Also, the method may converge to a saddle point and
hence one must test if the determined solution is a minimum or not.

2.2 Newton method

The Newton direction is determined by approximating the function by a quadratic around xk


and then determining the minimum of this quadratic function. It can be shown that the search
direction can be expressed as:

p = −(∇2 f (xk ))−1 ∇f (xk ) (5)

Since a function behaves like a quadratic near the optimum, if one starts close to the optimum,
the approximation is valid and in fact, one converges rapidly (quadratically) to the optimum. How-
ever, second order derivatives have to be calculated which may be computationally demanding.
In addition, a system of linear equations have to be solved. In addition, if the Hessian is indef-
inite at some point, the direction may not be search direction and the Hessian is forced to be
positive definite by adding a suitable matrix. Use of step length of 1 is not does not guarantee
convergence and hence a step length is chosen as described below.

2
2.3 Quasi-Newton method

In quasi-Newton methods, we replace the true Hessian by an approximation and update it at each
iteration. If some conditions are met, the updated matrix is guaranteed to be positive definite and
hence the search directions are guaranteed to be descent directions. Since the true Hessian is
not needed, there is no need to compute second order derivatives. Moreover, convergence is
usually faster than the steepest descent method. Depending on the choice of the updating rule,
we have different methods. Define sk = xk+1 − xk and yk = ∇f (xk+1 ) − ∇f (xk )

• Symmetric rank one update (SR 1)

(yk − Bk sk )(yk − Bk sk )0
Bk+1 = Bk + (6)
(yk − Bk sk )0 sk

• BFGS
Bk sk s0k Bk yk yk0
Bk+1 = Bk − + (7)
s0k Bk sk yk0 sk

• DFP Denoting Bk−1 = Hk , we directly update the inverse of Bk

Hk yk yk0 Hk sk s0k
Hk+1 = Hk − + (8)
yk0 Hk yk yk0 sk

The first method updates Bk by a rank one matrix, while the second and third methods update
Bk (or its inverse) by a rank two matrix.

2.4 Choice of step length α

As mentioned previously, it is not necessary to determine α exactly, a choice of α that approxi-


mately solves (2) is sufficient. In particular, α is chosen so that it satisfies the following condition
(Armijo condition):
f (xk + αp) − f (xk )
≤ c∇f (xk )0 p (9)
α
where c1 is a small positive number (usually 10−4 ). Physically, it says that the average rate
of decrease in going from xk to xk+1 should be proportional to the local rate of change at xk .
One simple method to achieve this is to start with a reasonable value of α (1 in Newton and
quasi-Newton methods) and keep decreasing till the condition is met.

3 Trust region methods

In trust region methods, we approximate the true function by a model function which is usually a
quadratic approximation:
mk = f (xk ) + ∇f (xk )0 p + 1/2p0 Bk p, (10)

3
If Bk = ∇2 f (xk ), the method is a trust region Newton method. If Bk is an approximation to the
Hessian, we have a trust region quasi Newton method. We then solve the following sub problem:

min mk (p), subject to |p| ≤ ∆, (11)


p

∆ is the size of our trust region. Assume that the above problem has been solved and we have
p and hence xk+1 = xk + p. In order to decide if the model is good enough, we use the following
ratio:
f (xk ) − f (xk + p)
ρk = , (12)
mk (0) − mk (p)
which is the ratio of the actual reduction achieved to that predicted by the model. The denomi-
nator is always positive and hence if ρk is negative, we must conclude that the approximation is
not accurate and hence decrease the trust region and solve the above problem again. If ρk is
close to 1, the approximated model is reasonably accurate and in the next iteration, we enlarge
the trust region. If ρk is positive, but much smaller than 1, we do not change the trust region. If it
is close to zero or negative, we reduce the trust region at the next iteration.

3.1 Solution of trust region subproblem

As in line search methods, it is not necessary to solve the sub-problem exactly. First notice that
if the trust region is very small, the solution will lie along the steepest descent. Likewise, if the
trust region is very large and Bk is positive definite, we obtain the so-called Newton step, i.e.,
−Bk−1 ∇f (xk ). When the trust region size is somewhere in between, the solution depends in a
complicated manner as a function of ∆. If we could determine this path exactly, we would have
solved the above problem. However, we replace this complicated path by a piece-wise path,
i.e., we move along the steepest descent and then take a turn towards the Newton direction
while simultaneously ensuring that we are within the trust region. It turns out this is sufficient for
achieving reasonable progress.

4 Termination criteria

The algorithm terminates if one can decide the minimum is reached or that no further improve-
ment is possible. No single criterion is used to decide termination. Some are as follows:

• |∇f (xk )| < 1

• |f (xk+1 ) − f (xk )| < 1 (1 + |f (xk )|)

• |xk+1 − xk | < 3 (1 + |xk |),

where i are tolerance criteria and || is a vector norm, i.e., the size of the vector, usually the 2
norm.

4
5 References

1. J. Nocedal and S. J. Wright, Numerical Optimization, Springer Series in Operations Research,


2nd Edition, Springer, 2000.

2. T. F. Edgar, D.M. Himmelblau, L.S. Lasdon, Optimization of chemical processes, 2nd Edition,
McGraw Hill, 2001.

You might also like