You are on page 1of 21

Lecture Notes on First-Order Methods in

Simple Settings: Smooth and Non-smooth Methods

Robert M. Freund

May 2013

2013
c Massachusetts Institute of Technology. All rights reserved.

1
The material presented herein is based on a variety of sources, but
most specically on:

Mirror descent and nonlinear projected subgradient methods for con-


vex optimization by Amir Beck and Marc Teboulle, Operations Re-
search Letters 31 (2003), pp. 167-175

Introductory Lectures on Convex Optimization by Yurii Nesterov, Kluwer


Academic Publishers, 2004

Approximation accuracy, gradient methods, and error bound for struc-


tured convex optimization by Paul Tseng, Mathematical Program-
ming B (2010), 125, pp. 263-295, and

On accelerated proximal gradient methods for convex-concave opti-


mization by Paul Tseng, unpublished notes dated May 21, 2008.

1 Smooth Method in Simple Setting

1.1 Problem Setting and Basics for Smooth Method

Our problem of interest is

P : minimizex f (x)
(1)
s.t. x Rn ,

where f () is a dierentiable convex function dened on Rn . We denote the


optimal objective value by v and let x denote an optimal solution of P ,
when such a solution exists.
Let f (x) denote the gradient of f () at x, and recall the gradient inequality:

f (y) f (x) + f (x)T (y x) for all y Rn . (2)

We will measure distances


between points using the Euclidean norm :=
2 where 2 := xT x.

2
We assume that f () has a Lipschitz gradient. That is, there is a scalar L
for which

f (y) f (x) Ly x for all x, y Rn . (3)

Let B(c, R) denote the ball centered at c with radius R, namely:

B(c, R) := {x Rn : x c R} .

Table 1 presents a simple gradient scheme for solving (P).

Simple Gradient Scheme

Step 0: Initialize. Initialize with x0 and i 0 .

Step 1: Compute New Point. Compute f (xi ), and then compute:

xi+1 xi L1 f (xi ) .

Step 2 Update counter and repeat. i i + 1 and Goto Step 1 .

Table 1: A Simple Gradient Scheme.

1.2 Analysis and Complexity of the Simple Gradient Scheme

We begin with a basic property of Lipschitz gradients:

3
Proposition 1.1 If f () has Lipschitz gradient with constant L, then

L
f (y) f (x) + f (x)T (y x) + y x2 for all x, y Rn .
2

Proof: We use the fundamental theorem of calculus applied in a multivariate


setting. We have
1
f (y) = f (x) + 0 f (x + t(y x))T (y x)dt
1
= f (x) + f (x)T (y x) + 0 [f (x + t(y x)) f (x)]T (y x)dt
1
f (x) + f (x)T (y x) + 0 f (x + t(y x)) f (x)(y x)dt
1
f (x) + f (x)T (y x) + 0 tL(y x)2 dt

= f (x) + f (x)T (y x) + L2 (y x)2 .

We also need the following arithmetic result:

Proposition 1.2 For any x the following holds:


L L L
f (xi )T (xxi )+ xxi 2 xxi+1 2 = f (xi )T (xi+1 xi )+ xi+1 xi 2 .
2 2 2

4
Proof: This is just rearrangement of terms using the fact that xi+1 =
xi L1 f (xi ). Expanding the left side of the proposition we have:

f (xi )T (x xi ) + L2 x xi 2 L2 x xi+1 2
[ ]
= L (xi xi+1 )T (x xi ) + 12 x2 xT xi + 12 xi 2 12 x2 + xT xi+1 21 xi+1 2
[ ]
= L (xi+1 )T (xi ) xi 2 + 12 xi 2 12 xi+1 2
[ ]
= L 21 xi 2 + (xi+1 )T (xi ) 12 xi+1 2

= L2 xi+1 xi 2

= L(xi+1 xi )T (xi+1 xi ) + L2 xi+1 xi 2

= f (xi )T (xi+1 xi ) + L2 xi+1 xi 2 .

Our main algorithmic analysis result for the simple gradient scheme is:

Theorem 1.1 After k iterations of the simple gradient scheme, the follow-
ing holds:

Lx x0 2
f (xk ) f (x) + for all x Rn .
2k

Corollary 1.1 Suppose our region of interest is B(x0 , R), which may or
may not contain an optimal solution x . Then:

R2 L
f (xk ) min f (x) + .
xB(x0 ,R) 2k

Let R be such that x B(x0 , R) for some optimal solution x , let > 0,
and let 2
R L
k := .
2
Then f (xk ) v + .

5
Proof of Theorem 1.1: We have:
f (xi+1 ) f (xi ) + f (xi )T (xi+1 xi ) + L2 xi+1 xi 2 (from Proposition 1.1)

= f (xi ) Lxi+1 xi 2 + L2 xi+1 xi 2

f (xi ) ,

whereby we see that the function values f (xi ) are decreasing in i. Restating
the rst inequality above again for a dierent purpose, we have for all x :

f (xi+1 ) f (xi ) + f (xi )T (xi+1 xi ) + L2 xi+1 xi 2 (from Proposition 1.1)

= f (xi ) + f (xi )T (x xi ) + L2 x xi 2 L2 x xi+1 2 (from Proposition 1.2)

f (x) + L2 x xi 2 L2 x xi+1 2 . (from (2))

Summing up and recalling from the start of the proof that the sequence
f (xi ) is nonincreasing, we have:


k1
L L
kf (x )
k
f (xi+1 ) kf (x) + x x0 2 x xk 2 .
2 2
i=0

Dividing by k and noting that the nal subtractand above is nonnegative,


we obtain:
Lx x0 2
f (xk ) f (x) + .
2k

1.3 Comments and Extensions

1. It is important for future extensions and analysis to observe that xi+1


solves the following elementary optimization problem:
L
xi+1 = arg min f (xi ) + f (xi )T (x xi ) + x xi 2 .
x 2

6
2. How might the scheme and the analysis be modied if L is not explic-
itly known?

3. How might the scheme and the analysis be modied if one can e-
ciently do a linesearch to compute:

ti := arg min{f (xi tf (xi ))} ,


t

instead of (implicitly) assigning the value ti 1/L at each iteration?

2 Non-smooth Method in Simple Setting

2.1 Problem Setting and Basics for Non-smooth Method

Our problem of interest still is

P : minimizex f (x)

s.t. x Rn ,

where f () is a convex function, not necessarily dierentiable, dened on Rn .


As before, we denote the optimal objective value by v .
Let f (x) denote the set of subgradients of f () at x, namely g f (x) i:

f (y) f (x) + g T (y x) for all y Rn . (4)

Recall that f (x) is nonempty, and we assume that computing a subgradient


of f () at any given x can be done eciently.
We will continue to measure distances
between points using the Euclidean
norm := 2 where 2 := xT x.
We assume that f () has Lipschitz function values. That is, there is a scalar
L for which

|f (y) f (x)| Ly x for all x, y Rn . (5)

7
Also, recall that B(c, R) denotes the ball centered at c with radius R, namely:

B(c, R) := {x Rn : x c R} .

Table 2 presents a simple subgradient scheme.

Simple Subgradient Scheme

Step 0: Initialize. Dene step-sizes ti , i = 0, 1, 2, . . .

Initialize with x0 , i 0.

Step 1: Compute New Point. Compute gi f (xi )

xi+1 xi ti gi

Step 2 Update counter and repeat. i i + 1 and Goto Step 1.

Table 2: A Simple Subgradient Scheme.

2.2 Analysis and Complexity of the Simple Subgradient Scheme

We begin with a simple relationship between the Lipschitz constant and the
norm of any subgradient:

Proposition 2.1 If g f (x), then g L.

Proof: Let g f (x) be given. If g = 0, the result is trivial. If g = 0, then


since f (x + g) f (x) + g T (x + g x) = f (x) + g T g, it follows that

8
g2 = g T g f (x + g) f (x) |f (x + g) f (x)| Lg ,

which upon dividing by g yields g L .


Our main algorithmic analysis result is:

Theorem 2.1 After k iterations of the simple subgradient scheme, the fol-
lowing holds:

k
L2 + x x0 2
2
i=0 ti
min f (x ) f (x) +
j
k for all x Rn .
0jk 2 i=0 ti

Corollary 2.1 Suppose we wish to run our algorithm for k iterations, and
our region of interest is B(x0 , R), which may or may not contain an optimal
solution x . If we set the step-sizes to be:
R
ti := t :=
L k+1
we obtain:
RL
min f (xj ) min f (x) + .
0jk xB(x0 ,R) k+1
If we are given > 0 and we wish to compute a point whose objective function
value is within of the optimal value v , then we can do so within
2 2
R L
k := 1
2

iterations using the constant step-size



ti = t = .
L2

9
Proof of Theorem 2.1: If we have just computed xi+1 , we have for any
x:
xi+1 x2 = xi ti gi x2

= xi x2 + t2i gi 2 + 2ti giT (x xi )

xi x2 + t2i gi 2 + 2ti (f (x) f (xi )) (subgradient inequality)

xi x2 + t2i L2 + 2ti (f (x) f (xi )) , (Proposition 2.1)

which rearranges to:

2ti f (xi ) 2ti f (x) + t2i L2 + xi x2 xi+1 x2 .

Denote:
fk := min f (xi ) .
0ik

Now sum up the above inequality over i:


k
k
k
k
2 ti fk 2 ti f (x ) 2
i
ti f (x) + t2i L2 + x0 x2 xk+1 x2 .
i=0 i=0 i=0 i=0

Since
the last subtractand is nonnegative, it can be eliminated, and dividing
by 2 ki=0 ti yields:

L2 ki=0 t2i + x x0 2
fk f (x) + .
2 ki=0 ti

2.3 Comments and Extensions

1. The rst result in Corollary 2.1 presumes that we know in advance how
many iterations we wish to run the method. This enables a constant
step-size t to be determined that is optimal for the inequality of the
corollary. What if one wishes to run the method simply for a while?
C
How might one proceed with an analysis of, say, choosing ti := i+1 for
some suitably chosen constant C?

10
2. How might the scheme and the analysis be modied if L is not explic-
itly known?

3. Suppose that our optimization problem includes a feasibility set S that


is computationally ecient to work with, in the sense that computing
the (Euclidean) projection PS (x) of a point x onto the set S is easy to
do. (This is the case when S is a box, a ball, or the standard simplex
Q := {x Rn : x 0, eT x = 1} where e = (1, . . . , 1)T denotes the
column vector of 1s.) Then it is a simple exercise to modify the main
algorithm step to:
xi+1 PS (xi ti gi ) .
The proof needs to be suitably modied, but can be done so easily if
one relies on the following contraction property of projection operators:

Proposition 2.2 If S is a convex set, then for all x, y Rn it holds


that PS (x) PS (y) x y.

4. A slightly modied step-size rule is to incorporate the norm of gi into


the step-size itself. If we set ti = gii for some i , then xi+1 xi =
ti gi = i , so that i is literally the length of the step. It is a useful
exercise to explore how the results of Theorem 2.1 and Corollary 2.1
change with this step-size rule.

3 An Accelerated Smooth Method (with Optimal


Iteration Complexity) in Simple Setting

In this section we re-consider the case when f () is dierentiable, i.e., smooth.


We present an accelerated version of the gradient method of Section 1.1
for solving (1), that has superior theoretical and practical performance over
the gradient scheme presented in Section 1.1. Recall the gradient inequality
(2). As in Section 1.1, we will consider the case when f () has a Lipschitz
gradient, see (3), and recall Proposition 1.1 which states a useful inequality
for functions whose gradient is Lipschitz.
Table 3 presents an accelerated simple gradient scheme.

11
Accelerated Simple Gradient Scheme

Step 0: Initialize.

Dene step-size parameters i (0, 1], i = 0, 1, . . . .

Initialize with x0 and z 0 := x0 , and i 0 .

Step 1: Compute New Points.

Dene y i (1 i )xi + i z i and compute f (y i )

z i+1 z i i L f (y )
1 i

xi+1 (1 i )xi + i z i+1

Step 2 Update counter and repeat. i i + 1 and Goto Step 1.

Table 3: An Accelerated Simple Gradient Scheme.

Note that the algorithm keeps track of three points at each iteration: xi , y i ,
and z i . However, information about y i is not used in subsequent iterations,
and indeed the explicit computation or storage of yi can be eliminated from
the algorithm description. The algorithm therefore really only works with
the pair xi , z i at each iteration. Note that y i is used only to compute the
gradient which is used to determine the value of z i+1 in Step 1. The new
point z i+1 is also used to determine the value of xi+1 in Step 1.
The algorithm requires a sequence of step-size parameters i that satises
0 = 1, i [0, 1] for i = 0, 1, . . ., and that also satises the following
inequality that prevents the sequence from decreasing too quickly:

12
1 i+1 1
2 2 for i = 0, 1, . . . . (6)
i+1 i

The next proposition presents and proves properties of two dierent step-size
sequences. Its proof is deferred to later.

Proposition 3.1

1. Consider the sequence


2
i = , i = 0, 1, . . . . (7)
i+2
This sequence satises the step-size inequality (6) strictly.

2. Consider the sequence


2
0 = 1, i+1 = , i = 0, 1, . . . . (8)
4
1+ 1+ i2

2
This sequence satises i < i+2 , i = 0, 1, . . . and satises the step-size
inequality (6) at equality.

3.1 Analysis and Complexity of the Accelerated Simple Gra-


dient Scheme

Our main algorithmic analysis result is:

Theorem 3.1 Suppose that the sequence {i } i=0 is given by (7) or (8).
Then after k iterations of the accelerated simple gradient scheme, the fol-
lowing holds:

2Lx x0 2
min f (xi ) f (x) + for all x X .
0ik (k + 1)2

13
Corollary 3.1 Suppose our region of interest is SR := {x : x x0 R}
for some R, which may or may not contain an optimal solution x . Then:
2LR2
min f (xi ) min f (x) + .
0ik xSR (k + 1)2
If we wish to compute a point whose objective function value is within of
the optimal value v , then we can do so within

2Lx x0 2
k := 1

iterations.

In order to facilitate the proof of Theorem 3.1, we will utilize the following
arithmetic result:

Proposition 3.2 For all x the following holds:


i L i L i L i+1 i 2
f (y i )T (xy i )+ xz i 2 xz i+1 2 = f (y i )T (z i+1 y i )+ z z .
2 2 2

Proof: This is just rearrangement of terms using the denition of z i+1 :


1
z i+1 = z i f (y i ) .
i L
Expanding the left side of the proposition we have:
f (y i )T (x y i ) + 2 x
i L
z i 2 2 x
i L
z i+1 2
[ ]
= f (y i )T y i + i L (z i z i+1 )T x + 12 x2 xT z i + 21 z i 2 21 x2 + xT z i+1 12 z i+1 2
[1 ]
= f (y i )T y i + i L 2 z
i 2 12 z i+1 2
[ ]
= f (y i )T (z i+1 y i ) + i L (z i+1 z i )T z i+1 + 21 z i 2 12 z i+1 2
[ ]
= f (y i )T (z i+1 y i ) + i L z i+1 2 (z i )T z i+1 + 21 z i 2 12 z i+1 2

= f (y i )T (z i+1 y i ) + 2 z
i L i+1 z i 2 .

14
We now introduce some facilitating notation. Let fl (x, y) denote the rst-
order linear expansion of f () about the point y evaluated at x, namely:

fl (x, y) := f (y) + f (y)T (x y) . (9)

With this notation, the conclusion of Proposition 3.2 can be written equiv-
alently as:

i L i L i L i+1
fl (x, y i ) + x z i 2 x z i+1 2 = fl (z i+1 , y i ) + z z i 2 .
2 2 2
(10)
The next result will be used in the proof of Theorem 3.1.

Lemma 3.1 If f (x) f (xi+1 ) or (6) holds at equality, then


1 i+1 ( ) L 1 i ( ) L
2 f (xi+1 ) f (x) + xz i+1 2 2 f (xi ) f (x) + xz i 2 .
i+1 2 i 2

Proof: We have for all x X:


f (xi+1 ) f (y i ) + f (y i )T (xi+1 y i ) + L2 xi+1 y i 2 (from Prop.1.1)

= fl (xi+1 , y i ) + L2 xi+1 y i 2 (from (9))

= fl ((1 i )xi + i z i+1 , y i ) + L2 (1 i )xi + i z i+1 y i 2 (def. xi+1 )

= (1 i )fl (xi , y i ) + i fl (z i+1 , y i ) + L2 (1 i )xi + i z i+1 y i 2 (linearity)

i2 L
= (1 i )fl (xi , y i ) + i fl (z i+1 , y i ) + 2 z
i+1 z i 2 (def. y i )
[ ]
= (1 i )fl (xi , y i ) + i fl (z i+1 , y i ) + i L2 z i+1 z i 2 (rearranging)
[ ]
= (1 i )fl (xi , y i ) + i fl (x, y i ) + i L2 x z i 2 i L2 x z i+1 2 (from (10))
[ ]
(1 i )f (xi ) + i f (x) + i L2 x z i 2 i L2 x z i+1 2 . (Grad.-Ineq.)

15
Subtract f (x) from each side, divide by i2 , and rearrange to yield:

1 ( ) L 1 i ( ) L
2 f (xi+1 ) f (x) + x z i+1 2 2 f (xi ) f (x) + x z i 2 .
i 2 i 2
(11)
If f (x) f (x ), then (6) implies that
i+1

1 i+1 ( ) 1 ( )
2 f (xi+1 ) f (x) 2 f (xi+1 ) f (x)
i+1 i

which combined with (11) proves the result.


If (6) is satised at equality, then

1 i+1 1
2 = 2 ,
i+1 i

which combined with (11) proves the result.


Proof of Theorem 3.1: Let x X be given. If f (x) min0ik f (xi )
then the conclusion of the theorem follows trivially. Otherwise, f (x) <
min0ik f (xi ), in which case from Lemma 3.1 we have:

1 i+1 ( ) L 1 i ( ) L
2 f (xi+1 ) f (x) + xz i+1 2 2 f (xi ) f (x) + xz i 2 , i = 0, . . . , k1 .
i+1 2 i 2

This cascading chain of inequalities can be combined for i = 1, . . . , k 2 to


yield:
1 k1 ( ) L 1 0 ( ) L
2 f (xk1
) f (x) + xz k1 2 2 f (x0 ) f (x) + xz 0 2 .
k1 2 0 2

Noting that 0 = 1 and z 0 = x0 yields:


1 k1 ( ) L L
2 f (x k1
) f (x) + x z k1 2 x x0 2 .
k1 2 2

Invoking (11) for i = k 1 yields:

1 ( ) L 1 k1 ( ) L
2 f (x k
) f (x) + xz
k 2
2 f (x k1
) f (x) + xz k1 2 ,
k1 2 k1 2

16
which when combined with the previous inequality yields:
1 ( ) L L
2 f (xk
) f (x) + x z k 2 x x0 2 .
k1 2 2

Lastly note from Proposition 3.1 that k1 2


k+1 and also noting that
x z k 0, we obtain:

L 4 L x x0 2
f (xk ) f (x) + k1
2
x x0 2 2 .
2 (k + 1)2

Proof of Proposition 3.1: We rst prove (1.). For the sequence (7) it
follows trivially that 0 = 1. For a given i we have
( )
1 i+1 (3 + i)2 2 i2 + 4i + 3 i2 + 4i + 4 (i + 2)2 1
2 = 1 = < = 2
= 2,
i+1 4 3+i 4 4 2 i

which proves (1.). To prove (2.), we show that the formula for i+1 in (8)
is simply an application of the quadratic formula applied to (6) at equality.
Indeed, given i , then the equality version of (6) is
( ) ( ) ( )2
1 2 1 1
=0,
i+1 i+1 i
( )
1
which is quadratic in := i+1 . Invoking the quadratic formula to solve
this equation yields precisely (8). Note also from (8) that i+1 is monotone
increasing in i . By denition we have 0 = 1 satises i 2+i 2
at i = 0.
Suppose that i 2+i holds for some i. Then from monotonicity we have
2

2 2 2 2
i+1 = < = ,
1+ 1+ 4
1+ 1+ 4
1+ 4 2+i+1
i2 2 2 2 2
( 2+i ) ( 2+i )

and the proof is completed by induction.

3.2 Comments and Extensions

1. How might the scheme and the analysis be modied if L is not explic-
itly known?

17
2. One might ask whether the complexity bound in Theorem 3.1 can
be improved. In a rather deep result, it was shown by Nemirovskii
and Yudin that this bound is order optimal in the sense that no
algorithm that only relies on rst-order information can have a better
worst-case complexity bound except perhaps by an absolute constant.
In this sense the accelerated simple gradient method is an optimal
algorithm.

4 The Conditional-Gradient Method for Constrained


Optimization

Let us now consider the following optimization problem:


P : minimizex f (x)

s.t. xS ,

where f (x) is a dierentiable convex function on S, and S is a closed and


bounded convex set. Herein we describe the conditional-gradient method
for solving P , which is also called the Frank-Wolfe method. This method
is based on the premise that the set S is well-suited for linear optimiza-
tion. This means that either S is itself a system of linear inequalities
S = {x | Ax b} over which linear optimization is an easy task, or more
generally that the problem:

LOc : minimizex cT x

s.t. xS

is easy to solve for any given objective function vector c. This being the
case, suppose that we have a given iterate value xk S. The linearization
of the function f (x) at x = xk is:

z1 (x) := f (xk ) + f (xk )T (x xk ) ,

18
which is the rst-order Taylor expansion of f () at xk . Since we can easily
do linear optimization on S, let us solve:

LP : minimizex z1 (x) = f (xk ) + f (xk )T (x xk )

s.t. xS ,

which simplies to:


LP : minimizex f (xk )T x

s.t. xS .

Let xk denote the optimal solution to this problem. Then since S is a convex
set, the line segment joining xk and xk is also in S, and we can choose our
next iterate to be some convex combination of xk and xk :

xk+1 xk + k (xk xk ) .

This can be done several ways. One way is by choosing a step-size k


2
according to some pre-determined rule, such as k = k+2 . Another way is
to choose k by performing a line-search of f () over this segment. That is,
we might determine k as the optimal solution to the following line-search
problem:

LS : minimize f (xk + (xk xk ))

s.t. 01.

Table 1 presents a formal statement of the conditional gradient method just


described.
Regarding the conditional gradient method, note that the lower bound val-
ues LBk result from the convexity of f () and the gradient inequality for
convex functions:

19
Algorithm 1 Conditional Gradient Method for minimizing f (x) over x S
Initialize at x0 S, LB1 , k 0 .

At iteration k :

1. Compute f (xk ), and then compute:

xk arg min{f (xk ) + f (xk )T (x xk )} .


xS

LBk max{LBk1 , f (xk ) + f (xk )T (xk xk )} .

2. Set xk+1 xk + k (xk xk ), where k [0, 1] .

f (x) f (xk ) + f (xk )T (x xk ) for any x S .


Therefore

min f (x) min f (xk ) + f (xk )T (x xk ) = f (xk ) + f (xk )T (xk xk ) ,


xS xS

and so the optimal objective function value of P is always bounded below


by the value of LBk .

We have the following computational guarantee for the conditional gradient


method:

Theorem 4.1 Suppose that S is a bounded set and that f () is Lipschitz


on S, namely:

f (y) f (x) Lf x y for any x, y S .

If the step-sizes i are determined either by a line-search or by the step-size


2
rule i = i+2 , then after k iterations of the conditional gradient method the
following holds:

20
Lf Diam2S
min f (xi ) LBk ,
i=0,...,k k+2
where DiamS := maxx,yS x y2 .

Proof: Under construction.

21

You might also like