Professional Documents
Culture Documents
Robert M. Freund
May 2013
2013
c Massachusetts Institute of Technology. All rights reserved.
1
The material presented herein is based on a variety of sources, but
most specically on:
P : minimizex f (x)
(1)
s.t. x Rn ,
2
We assume that f () has a Lipschitz gradient. That is, there is a scalar L
for which
B(c, R) := {x Rn : x c R} .
xi+1 xi L1 f (xi ) .
3
Proposition 1.1 If f () has Lipschitz gradient with constant L, then
L
f (y) f (x) + f (x)T (y x) + y x2 for all x, y Rn .
2
4
Proof: This is just rearrangement of terms using the fact that xi+1 =
xi L1 f (xi ). Expanding the left side of the proposition we have:
f (xi )T (x xi ) + L2 x xi 2 L2 x xi+1 2
[ ]
= L (xi xi+1 )T (x xi ) + 12 x2 xT xi + 12 xi 2 12 x2 + xT xi+1 21 xi+1 2
[ ]
= L (xi+1 )T (xi ) xi 2 + 12 xi 2 12 xi+1 2
[ ]
= L 21 xi 2 + (xi+1 )T (xi ) 12 xi+1 2
= L2 xi+1 xi 2
Our main algorithmic analysis result for the simple gradient scheme is:
Theorem 1.1 After k iterations of the simple gradient scheme, the follow-
ing holds:
Lx x0 2
f (xk ) f (x) + for all x Rn .
2k
Corollary 1.1 Suppose our region of interest is B(x0 , R), which may or
may not contain an optimal solution x . Then:
R2 L
f (xk ) min f (x) + .
xB(x0 ,R) 2k
Let R be such that x B(x0 , R) for some optimal solution x , let > 0,
and let 2
R L
k := .
2
Then f (xk ) v + .
5
Proof of Theorem 1.1: We have:
f (xi+1 ) f (xi ) + f (xi )T (xi+1 xi ) + L2 xi+1 xi 2 (from Proposition 1.1)
f (xi ) ,
whereby we see that the function values f (xi ) are decreasing in i. Restating
the rst inequality above again for a dierent purpose, we have for all x :
Summing up and recalling from the start of the proof that the sequence
f (xi ) is nonincreasing, we have:
k1
L L
kf (x )
k
f (xi+1 ) kf (x) + x x0 2 x xk 2 .
2 2
i=0
6
2. How might the scheme and the analysis be modied if L is not explic-
itly known?
3. How might the scheme and the analysis be modied if one can e-
ciently do a linesearch to compute:
P : minimizex f (x)
s.t. x Rn ,
7
Also, recall that B(c, R) denotes the ball centered at c with radius R, namely:
B(c, R) := {x Rn : x c R} .
Initialize with x0 , i 0.
xi+1 xi ti gi
We begin with a simple relationship between the Lipschitz constant and the
norm of any subgradient:
8
g2 = g T g f (x + g) f (x) |f (x + g) f (x)| Lg ,
Theorem 2.1 After k iterations of the simple subgradient scheme, the fol-
lowing holds:
k
L2 + x x0 2
2
i=0 ti
min f (x ) f (x) +
j
k for all x Rn .
0jk 2 i=0 ti
Corollary 2.1 Suppose we wish to run our algorithm for k iterations, and
our region of interest is B(x0 , R), which may or may not contain an optimal
solution x . If we set the step-sizes to be:
R
ti := t :=
L k+1
we obtain:
RL
min f (xj ) min f (x) + .
0jk xB(x0 ,R) k+1
If we are given > 0 and we wish to compute a point whose objective function
value is within of the optimal value v , then we can do so within
2 2
R L
k := 1
2
9
Proof of Theorem 2.1: If we have just computed xi+1 , we have for any
x:
xi+1 x2 = xi ti gi x2
Denote:
fk := min f (xi ) .
0ik
k
k
k
k
2 ti fk 2 ti f (x ) 2
i
ti f (x) + t2i L2 + x0 x2 xk+1 x2 .
i=0 i=0 i=0 i=0
Since
the last subtractand is nonnegative, it can be eliminated, and dividing
by 2 ki=0 ti yields:
L2 ki=0 t2i + x x0 2
fk f (x) + .
2 ki=0 ti
1. The rst result in Corollary 2.1 presumes that we know in advance how
many iterations we wish to run the method. This enables a constant
step-size t to be determined that is optimal for the inequality of the
corollary. What if one wishes to run the method simply for a while?
C
How might one proceed with an analysis of, say, choosing ti := i+1 for
some suitably chosen constant C?
10
2. How might the scheme and the analysis be modied if L is not explic-
itly known?
11
Accelerated Simple Gradient Scheme
Step 0: Initialize.
z i+1 z i i L f (y )
1 i
Note that the algorithm keeps track of three points at each iteration: xi , y i ,
and z i . However, information about y i is not used in subsequent iterations,
and indeed the explicit computation or storage of yi can be eliminated from
the algorithm description. The algorithm therefore really only works with
the pair xi , z i at each iteration. Note that y i is used only to compute the
gradient which is used to determine the value of z i+1 in Step 1. The new
point z i+1 is also used to determine the value of xi+1 in Step 1.
The algorithm requires a sequence of step-size parameters i that satises
0 = 1, i [0, 1] for i = 0, 1, . . ., and that also satises the following
inequality that prevents the sequence from decreasing too quickly:
12
1 i+1 1
2 2 for i = 0, 1, . . . . (6)
i+1 i
The next proposition presents and proves properties of two dierent step-size
sequences. Its proof is deferred to later.
Proposition 3.1
2
This sequence satises i < i+2 , i = 0, 1, . . . and satises the step-size
inequality (6) at equality.
Theorem 3.1 Suppose that the sequence {i } i=0 is given by (7) or (8).
Then after k iterations of the accelerated simple gradient scheme, the fol-
lowing holds:
2Lx x0 2
min f (xi ) f (x) + for all x X .
0ik (k + 1)2
13
Corollary 3.1 Suppose our region of interest is SR := {x : x x0 R}
for some R, which may or may not contain an optimal solution x . Then:
2LR2
min f (xi ) min f (x) + .
0ik xSR (k + 1)2
If we wish to compute a point whose objective function value is within of
the optimal value v , then we can do so within
2Lx x0 2
k := 1
iterations.
In order to facilitate the proof of Theorem 3.1, we will utilize the following
arithmetic result:
= f (y i )T (z i+1 y i ) + 2 z
i L i+1 z i 2 .
14
We now introduce some facilitating notation. Let fl (x, y) denote the rst-
order linear expansion of f () about the point y evaluated at x, namely:
With this notation, the conclusion of Proposition 3.2 can be written equiv-
alently as:
i L i L i L i+1
fl (x, y i ) + x z i 2 x z i+1 2 = fl (z i+1 , y i ) + z z i 2 .
2 2 2
(10)
The next result will be used in the proof of Theorem 3.1.
i2 L
= (1 i )fl (xi , y i ) + i fl (z i+1 , y i ) + 2 z
i+1 z i 2 (def. y i )
[ ]
= (1 i )fl (xi , y i ) + i fl (z i+1 , y i ) + i L2 z i+1 z i 2 (rearranging)
[ ]
= (1 i )fl (xi , y i ) + i fl (x, y i ) + i L2 x z i 2 i L2 x z i+1 2 (from (10))
[ ]
(1 i )f (xi ) + i f (x) + i L2 x z i 2 i L2 x z i+1 2 . (Grad.-Ineq.)
15
Subtract f (x) from each side, divide by i2 , and rearrange to yield:
1 ( ) L 1 i ( ) L
2 f (xi+1 ) f (x) + x z i+1 2 2 f (xi ) f (x) + x z i 2 .
i 2 i 2
(11)
If f (x) f (x ), then (6) implies that
i+1
1 i+1 ( ) 1 ( )
2 f (xi+1 ) f (x) 2 f (xi+1 ) f (x)
i+1 i
1 i+1 1
2 = 2 ,
i+1 i
1 i+1 ( ) L 1 i ( ) L
2 f (xi+1 ) f (x) + xz i+1 2 2 f (xi ) f (x) + xz i 2 , i = 0, . . . , k1 .
i+1 2 i 2
1 ( ) L 1 k1 ( ) L
2 f (x k
) f (x) + xz
k 2
2 f (x k1
) f (x) + xz k1 2 ,
k1 2 k1 2
16
which when combined with the previous inequality yields:
1 ( ) L L
2 f (xk
) f (x) + x z k 2 x x0 2 .
k1 2 2
L 4 L x x0 2
f (xk ) f (x) + k1
2
x x0 2 2 .
2 (k + 1)2
Proof of Proposition 3.1: We rst prove (1.). For the sequence (7) it
follows trivially that 0 = 1. For a given i we have
( )
1 i+1 (3 + i)2 2 i2 + 4i + 3 i2 + 4i + 4 (i + 2)2 1
2 = 1 = < = 2
= 2,
i+1 4 3+i 4 4 2 i
which proves (1.). To prove (2.), we show that the formula for i+1 in (8)
is simply an application of the quadratic formula applied to (6) at equality.
Indeed, given i , then the equality version of (6) is
( ) ( ) ( )2
1 2 1 1
=0,
i+1 i+1 i
( )
1
which is quadratic in := i+1 . Invoking the quadratic formula to solve
this equation yields precisely (8). Note also from (8) that i+1 is monotone
increasing in i . By denition we have 0 = 1 satises i 2+i 2
at i = 0.
Suppose that i 2+i holds for some i. Then from monotonicity we have
2
2 2 2 2
i+1 = < = ,
1+ 1+ 4
1+ 1+ 4
1+ 4 2+i+1
i2 2 2 2 2
( 2+i ) ( 2+i )
1. How might the scheme and the analysis be modied if L is not explic-
itly known?
17
2. One might ask whether the complexity bound in Theorem 3.1 can
be improved. In a rather deep result, it was shown by Nemirovskii
and Yudin that this bound is order optimal in the sense that no
algorithm that only relies on rst-order information can have a better
worst-case complexity bound except perhaps by an absolute constant.
In this sense the accelerated simple gradient method is an optimal
algorithm.
s.t. xS ,
LOc : minimizex cT x
s.t. xS
is easy to solve for any given objective function vector c. This being the
case, suppose that we have a given iterate value xk S. The linearization
of the function f (x) at x = xk is:
18
which is the rst-order Taylor expansion of f () at xk . Since we can easily
do linear optimization on S, let us solve:
s.t. xS ,
s.t. xS .
Let xk denote the optimal solution to this problem. Then since S is a convex
set, the line segment joining xk and xk is also in S, and we can choose our
next iterate to be some convex combination of xk and xk :
xk+1 xk + k (xk xk ) .
s.t. 01.
19
Algorithm 1 Conditional Gradient Method for minimizing f (x) over x S
Initialize at x0 S, LB1 , k 0 .
At iteration k :
20
Lf Diam2S
min f (xi ) LBk ,
i=0,...,k k+2
where DiamS := maxx,yS x y2 .
21