FOM Simple Setting

Lecture Notes on First-Order Methods in
Simple Settings: Smooth and Non-smooth Methods
Robert M. Freund
May 2013
2013
c Massachusetts Institute of Technology. All rights reserved.
1
The material presented herein is based on a variety of sources, but
most specically on:
Mirror descent and nonlinear projected subgradient methods for con-

vex optimization by Amir Beck and Marc Teboulle, Operations Re-
search Letters 31 (2003), pp. 167-175
Introductory Lectures on Convex Optimization by Yurii Nesterov, Kluwer

Academic Publishers, 2004
Approximation accuracy, gradient methods, and error bound for struc-

tured convex optimization by Paul Tseng, Mathematical Program-
ming B (2010), 125, pp. 263-295, and
On accelerated proximal gradient methods for convex-concave opti-

mization by Paul Tseng, unpublished notes dated May 21, 2008.
1 Smooth Method in Simple Setting
1.1 Problem Setting and Basics for Smooth Method
Our problem of interest is
P : minimizex f (x)
(1)
s.t. x Rn ,
where f () is a dierentiable convex function dened on Rn . We denote the

optimal objective value by v and let x denote an optimal solution of P ,
when such a solution exists.
Let f (x) denote the gradient of f () at x, and recall the gradient inequality:
f (y) f (x) + f (x)T (y x) for all y Rn . (2)
We will measure distances

between points using the Euclidean norm :=
2 where 2 := xT x.
2
We assume that f () has a Lipschitz gradient. That is, there is a scalar L
for which
f (y) f (x) Ly x for all x, y Rn . (3)
Let B(c, R) denote the ball centered at c with radius R, namely:
B(c, R) := {x Rn : x c R} .
Table 1 presents a simple gradient scheme for solving (P).
Simple Gradient Scheme
Step 0: Initialize. Initialize with x0 and i 0 .
Step 1: Compute New Point. Compute f (xi ), and then compute:
xi+1 xi L1 f (xi ) .
Step 2 Update counter and repeat. i i + 1 and Goto Step 1 .
Table 1: A Simple Gradient Scheme.
1.2 Analysis and Complexity of the Simple Gradient Scheme
We begin with a basic property of Lipschitz gradients:
3
Proposition 1.1 If f () has Lipschitz gradient with constant L, then
L
f (y) f (x) + f (x)T (y x) + y x2 for all x, y Rn .
2
Proof: We use the fundamental theorem of calculus applied in a multivariate

setting. We have
1
f (y) = f (x) + 0 f (x + t(y x))T (y x)dt
1
= f (x) + f (x)T (y x) + 0 [f (x + t(y x)) f (x)]T (y x)dt
1
f (x) + f (x)T (y x) + 0 f (x + t(y x)) f (x)(y x)dt
1
f (x) + f (x)T (y x) + 0 tL(y x)2 dt
= f (x) + f (x)T (y x) + L2 (y x)2 .
We also need the following arithmetic result:
Proposition 1.2 For any x the following holds:

L L L
f (xi )T (xxi )+ xxi 2 xxi+1 2 = f (xi )T (xi+1 xi )+ xi+1 xi 2 .
2 2 2
4
Proof: This is just rearrangement of terms using the fact that xi+1 =
xi L1 f (xi ). Expanding the left side of the proposition we have:
f (xi )T (x xi ) + L2 x xi 2 L2 x xi+1 2
[ ]
= L (xi xi+1 )T (x xi ) + 12 x2 xT xi + 12 xi 2 12 x2 + xT xi+1 21 xi+1 2
[ ]
= L (xi+1 )T (xi ) xi 2 + 12 xi 2 12 xi+1 2
[ ]
= L 21 xi 2 + (xi+1 )T (xi ) 12 xi+1 2
= L2 xi+1 xi 2
= L(xi+1 xi )T (xi+1 xi ) + L2 xi+1 xi 2
= f (xi )T (xi+1 xi ) + L2 xi+1 xi 2 .
Our main algorithmic analysis result for the simple gradient scheme is:
Theorem 1.1 After k iterations of the simple gradient scheme, the follow-
ing holds:
Lx x0 2
f (xk ) f (x) + for all x Rn .
2k
Corollary 1.1 Suppose our region of interest is B(x0 , R), which may or
may not contain an optimal solution x . Then:
R2 L
f (xk ) min f (x) + .
xB(x0 ,R) 2k
Let R be such that x B(x0 , R) for some optimal solution x , let > 0,
and let 2
R L
k := .
2
Then f (xk ) v + .
5
Proof of Theorem 1.1: We have:
f (xi+1 ) f (xi ) + f (xi )T (xi+1 xi ) + L2 xi+1 xi 2 (from Proposition 1.1)
= f (xi ) Lxi+1 xi 2 + L2 xi+1 xi 2
f (xi ) ,
whereby we see that the function values f (xi ) are decreasing in i. Restating
the rst inequality above again for a dierent purpose, we have for all x :
f (xi+1 ) f (xi ) + f (xi )T (xi+1 xi ) + L2 xi+1 xi 2 (from Proposition 1.1)
= f (xi ) + f (xi )T (x xi ) + L2 x xi 2 L2 x xi+1 2 (from Proposition 1.2)
f (x) + L2 x xi 2 L2 x xi+1 2 . (from (2))
Summing up and recalling from the start of the proof that the sequence
f (xi ) is nonincreasing, we have:

k1
L L
kf (x )
k
f (xi+1 ) kf (x) + x x0 2 x xk 2 .
2 2
i=0
Dividing by k and noting that the nal subtractand above is nonnegative,

we obtain:
Lx x0 2
f (xk ) f (x) + .
2k
1.3 Comments and Extensions
1. It is important for future extensions and analysis to observe that xi+1

solves the following elementary optimization problem:
L
xi+1 = arg min f (xi ) + f (xi )T (x xi ) + x xi 2 .
x 2
6
2. How might the scheme and the analysis be modied if L is not explic-
itly known?
3. How might the scheme and the analysis be modied if one can e-
ciently do a linesearch to compute:
ti := arg min{f (xi tf (xi ))} ,

t
instead of (implicitly) assigning the value ti 1/L at each iteration?
2 Non-smooth Method in Simple Setting
2.1 Problem Setting and Basics for Non-smooth Method
Our problem of interest still is
P : minimizex f (x)
s.t. x Rn ,
where f () is a convex function, not necessarily dierentiable, dened on Rn .

As before, we denote the optimal objective value by v .
Let f (x) denote the set of subgradients of f () at x, namely g f (x) i:
f (y) f (x) + g T (y x) for all y Rn . (4)
Recall that f (x) is nonempty, and we assume that computing a subgradient

of f () at any given x can be done eciently.
We will continue to measure distances
between points using the Euclidean
norm := 2 where 2 := xT x.
We assume that f () has Lipschitz function values. That is, there is a scalar
L for which
|f (y) f (x)| Ly x for all x, y Rn . (5)
7
Also, recall that B(c, R) denotes the ball centered at c with radius R, namely:
B(c, R) := {x Rn : x c R} .
Table 2 presents a simple subgradient scheme.
Simple Subgradient Scheme
Step 0: Initialize. Dene step-sizes ti , i = 0, 1, 2, . . .
Initialize with x0 , i 0.
Step 1: Compute New Point. Compute gi f (xi )
xi+1 xi ti gi
Step 2 Update counter and repeat. i i + 1 and Goto Step 1.
Table 2: A Simple Subgradient Scheme.
2.2 Analysis and Complexity of the Simple Subgradient Scheme
We begin with a simple relationship between the Lipschitz constant and the
norm of any subgradient:
Proposition 2.1 If g f (x), then g L.
Proof: Let g f (x) be given. If g = 0, the result is trivial. If g = 0, then

since f (x + g) f (x) + g T (x + g x) = f (x) + g T g, it follows that
8
g2 = g T g f (x + g) f (x) |f (x + g) f (x)| Lg ,
which upon dividing by g yields g L .

Our main algorithmic analysis result is:
Theorem 2.1 After k iterations of the simple subgradient scheme, the fol-
lowing holds:
k
L2 + x x0 2
2
i=0 ti
min f (x ) f (x) +
j
k for all x Rn .
0jk 2 i=0 ti
Corollary 2.1 Suppose we wish to run our algorithm for k iterations, and
our region of interest is B(x0 , R), which may or may not contain an optimal
solution x . If we set the step-sizes to be:
R
ti := t :=
L k+1
we obtain:
RL
min f (xj ) min f (x) + .
0jk xB(x0 ,R) k+1
If we are given > 0 and we wish to compute a point whose objective function
value is within of the optimal value v , then we can do so within
2 2
R L
k := 1
2
iterations using the constant step-size

ti = t = .
L2
9
Proof of Theorem 2.1: If we have just computed xi+1 , we have for any
x:
xi+1 x2 = xi ti gi x2
= xi x2 + t2i gi 2 + 2ti giT (x xi )
xi x2 + t2i gi 2 + 2ti (f (x) f (xi )) (subgradient inequality)
xi x2 + t2i L2 + 2ti (f (x) f (xi )) , (Proposition 2.1)
which rearranges to:
2ti f (xi ) 2ti f (x) + t2i L2 + xi x2 xi+1 x2 .
Denote:
fk := min f (xi ) .
0ik
Now sum up the above inequality over i:

k
k
k
k
2 ti fk 2 ti f (x ) 2
i
ti f (x) + t2i L2 + x0 x2 xk+1 x2 .
i=0 i=0 i=0 i=0
Since
the last subtractand is nonnegative, it can be eliminated, and dividing
by 2 ki=0 ti yields:

L2 ki=0 t2i + x x0 2
fk f (x) + .
2 ki=0 ti
1. The rst result in Corollary 2.1 presumes that we know in advance how
many iterations we wish to run the method. This enables a constant
step-size t to be determined that is optimal for the inequality of the
corollary. What if one wishes to run the method simply for a while?
C
How might one proceed with an analysis of, say, choosing ti := i+1 for
some suitably chosen constant C?
10
itly known?
3. Suppose that our optimization problem includes a feasibility set S that

is computationally ecient to work with, in the sense that computing
the (Euclidean) projection PS (x) of a point x onto the set S is easy to
do. (This is the case when S is a box, a ball, or the standard simplex
Q := {x Rn : x 0, eT x = 1} where e = (1, . . . , 1)T denotes the
column vector of 1s.) Then it is a simple exercise to modify the main
algorithm step to:
xi+1 PS (xi ti gi ) .
The proof needs to be suitably modied, but can be done so easily if
one relies on the following contraction property of projection operators:
Proposition 2.2 If S is a convex set, then for all x, y Rn it holds

that PS (x) PS (y) x y.
4. A slightly modied step-size rule is to incorporate the norm of gi into

the step-size itself. If we set ti = gii for some i , then xi+1 xi =
ti gi = i , so that i is literally the length of the step. It is a useful
exercise to explore how the results of Theorem 2.1 and Corollary 2.1
change with this step-size rule.
3 An Accelerated Smooth Method (with Optimal

Iteration Complexity) in Simple Setting
In this section we re-consider the case when f () is dierentiable, i.e., smooth.

We present an accelerated version of the gradient method of Section 1.1
for solving (1), that has superior theoretical and practical performance over
the gradient scheme presented in Section 1.1. Recall the gradient inequality
(2). As in Section 1.1, we will consider the case when f () has a Lipschitz
gradient, see (3), and recall Proposition 1.1 which states a useful inequality
for functions whose gradient is Lipschitz.
Table 3 presents an accelerated simple gradient scheme.
11
Accelerated Simple Gradient Scheme
Step 0: Initialize.
Dene step-size parameters i (0, 1], i = 0, 1, . . . .
Initialize with x0 and z 0 := x0 , and i 0 .
Step 1: Compute New Points.
Dene y i (1 i )xi + i z i and compute f (y i )
z i+1 z i i L f (y )
1 i
xi+1 (1 i )xi + i z i+1
Step 2 Update counter and repeat. i i + 1 and Goto Step 1.
Table 3: An Accelerated Simple Gradient Scheme.
Note that the algorithm keeps track of three points at each iteration: xi , y i ,
and z i . However, information about y i is not used in subsequent iterations,
and indeed the explicit computation or storage of yi can be eliminated from
the algorithm description. The algorithm therefore really only works with
the pair xi , z i at each iteration. Note that y i is used only to compute the
gradient which is used to determine the value of z i+1 in Step 1. The new
point z i+1 is also used to determine the value of xi+1 in Step 1.
The algorithm requires a sequence of step-size parameters i that satises
0 = 1, i [0, 1] for i = 0, 1, . . ., and that also satises the following
inequality that prevents the sequence from decreasing too quickly:
12
1 i+1 1
2 2 for i = 0, 1, . . . . (6)
i+1 i
The next proposition presents and proves properties of two dierent step-size
sequences. Its proof is deferred to later.
Proposition 3.1
1. Consider the sequence

2
i = , i = 0, 1, . . . . (7)
i+2
This sequence satises the step-size inequality (6) strictly.
2. Consider the sequence

2
0 = 1, i+1 = , i = 0, 1, . . . . (8)
4
1+ 1+ i2
2
This sequence satises i < i+2 , i = 0, 1, . . . and satises the step-size
inequality (6) at equality.
3.1 Analysis and Complexity of the Accelerated Simple Gra-

dient Scheme
Our main algorithmic analysis result is:
Theorem 3.1 Suppose that the sequence {i } i=0 is given by (7) or (8).
Then after k iterations of the accelerated simple gradient scheme, the fol-
lowing holds:
2Lx x0 2
min f (xi ) f (x) + for all x X .
0ik (k + 1)2
13
Corollary 3.1 Suppose our region of interest is SR := {x : x x0 R}
for some R, which may or may not contain an optimal solution x . Then:
2LR2
min f (xi ) min f (x) + .
0ik xSR (k + 1)2
If we wish to compute a point whose objective function value is within of
the optimal value v , then we can do so within

2Lx x0 2
k := 1

iterations.
In order to facilitate the proof of Theorem 3.1, we will utilize the following
arithmetic result:
Proposition 3.2 For all x the following holds:

i L i L i L i+1 i 2
f (y i )T (xy i )+ xz i 2 xz i+1 2 = f (y i )T (z i+1 y i )+ z z .
2 2 2
Proof: This is just rearrangement of terms using the denition of z i+1 :

1
z i+1 = z i f (y i ) .
i L
Expanding the left side of the proposition we have:
f (y i )T (x y i ) + 2 x
i L
z i 2 2 x
i L
z i+1 2
[ ]
= f (y i )T y i + i L (z i z i+1 )T x + 12 x2 xT z i + 21 z i 2 21 x2 + xT z i+1 12 z i+1 2
[1 ]
= f (y i )T y i + i L 2 z
i 2 12 z i+1 2
[ ]
= f (y i )T (z i+1 y i ) + i L (z i+1 z i )T z i+1 + 21 z i 2 12 z i+1 2
[ ]
= f (y i )T (z i+1 y i ) + i L z i+1 2 (z i )T z i+1 + 21 z i 2 12 z i+1 2
= f (y i )T (z i+1 y i ) + 2 z
i L i+1 z i 2 .
14
We now introduce some facilitating notation. Let fl (x, y) denote the rst-
order linear expansion of f () about the point y evaluated at x, namely:
fl (x, y) := f (y) + f (y)T (x y) . (9)
With this notation, the conclusion of Proposition 3.2 can be written equiv-
alently as:
i L i L i L i+1
fl (x, y i ) + x z i 2 x z i+1 2 = fl (z i+1 , y i ) + z z i 2 .
2 2 2
(10)
The next result will be used in the proof of Theorem 3.1.
Lemma 3.1 If f (x) f (xi+1 ) or (6) holds at equality, then

1 i+1 ( ) L 1 i ( ) L
2 f (xi+1 ) f (x) + xz i+1 2 2 f (xi ) f (x) + xz i 2 .
i+1 2 i 2
Proof: We have for all x X:

f (xi+1 ) f (y i ) + f (y i )T (xi+1 y i ) + L2 xi+1 y i 2 (from Prop.1.1)
= fl (xi+1 , y i ) + L2 xi+1 y i 2 (from (9))
= fl ((1 i )xi + i z i+1 , y i ) + L2 (1 i )xi + i z i+1 y i 2 (def. xi+1 )
= (1 i )fl (xi , y i ) + i fl (z i+1 , y i ) + L2 (1 i )xi + i z i+1 y i 2 (linearity)
i2 L
= (1 i )fl (xi , y i ) + i fl (z i+1 , y i ) + 2 z
i+1 z i 2 (def. y i )
[ ]
= (1 i )fl (xi , y i ) + i fl (z i+1 , y i ) + i L2 z i+1 z i 2 (rearranging)
[ ]
= (1 i )fl (xi , y i ) + i fl (x, y i ) + i L2 x z i 2 i L2 x z i+1 2 (from (10))
[ ]
(1 i )f (xi ) + i f (x) + i L2 x z i 2 i L2 x z i+1 2 . (Grad.-Ineq.)
15
Subtract f (x) from each side, divide by i2 , and rearrange to yield:
1 ( ) L 1 i ( ) L
2 f (xi+1 ) f (x) + x z i+1 2 2 f (xi ) f (x) + x z i 2 .
i 2 i 2
(11)
If f (x) f (x ), then (6) implies that
i+1
1 i+1 ( ) 1 ( )
2 f (xi+1 ) f (x) 2 f (xi+1 ) f (x)
i+1 i
which combined with (11) proves the result.

If (6) is satised at equality, then
1 i+1 1
2 = 2 ,
i+1 i
which combined with (11) proves the result.

Proof of Theorem 3.1: Let x X be given. If f (x) min0ik f (xi )
then the conclusion of the theorem follows trivially. Otherwise, f (x) <
min0ik f (xi ), in which case from Lemma 3.1 we have:
1 i+1 ( ) L 1 i ( ) L
2 f (xi+1 ) f (x) + xz i+1 2 2 f (xi ) f (x) + xz i 2 , i = 0, . . . , k1 .
i+1 2 i 2
This cascading chain of inequalities can be combined for i = 1, . . . , k 2 to

yield:
1 k1 ( ) L 1 0 ( ) L
2 f (xk1
) f (x) + xz k1 2 2 f (x0 ) f (x) + xz 0 2 .
k1 2 0 2
Noting that 0 = 1 and z 0 = x0 yields:

1 k1 ( ) L L
2 f (x k1
) f (x) + x z k1 2 x x0 2 .
k1 2 2
Invoking (11) for i = k 1 yields:
1 ( ) L 1 k1 ( ) L
2 f (x k
) f (x) + xz
k 2
2 f (x k1
) f (x) + xz k1 2 ,
k1 2 k1 2
16
which when combined with the previous inequality yields:
1 ( ) L L
2 f (xk
) f (x) + x z k 2 x x0 2 .
k1 2 2
Lastly note from Proposition 3.1 that k1 2

k+1 and also noting that
x z k 0, we obtain:
L 4 L x x0 2
f (xk ) f (x) + k1
2
x x0 2 2 .
2 (k + 1)2
Proof of Proposition 3.1: We rst prove (1.). For the sequence (7) it
follows trivially that 0 = 1. For a given i we have
( )
1 i+1 (3 + i)2 2 i2 + 4i + 3 i2 + 4i + 4 (i + 2)2 1
2 = 1 = < = 2
= 2,
i+1 4 3+i 4 4 2 i
which proves (1.). To prove (2.), we show that the formula for i+1 in (8)
is simply an application of the quadratic formula applied to (6) at equality.
Indeed, given i , then the equality version of (6) is
( ) ( ) ( )2
1 2 1 1
=0,
i+1 i+1 i
( )
1
which is quadratic in := i+1 . Invoking the quadratic formula to solve
this equation yields precisely (8). Note also from (8) that i+1 is monotone
increasing in i . By denition we have 0 = 1 satises i 2+i 2
at i = 0.
Suppose that i 2+i holds for some i. Then from monotonicity we have
2
2 2 2 2
i+1 = < = ,
1+ 1+ 4
1+ 1+ 4
1+ 4 2+i+1
i2 2 2 2 2
( 2+i ) ( 2+i )
and the proof is completed by induction.
itly known?
17
2. One might ask whether the complexity bound in Theorem 3.1 can
be improved. In a rather deep result, it was shown by Nemirovskii
and Yudin that this bound is order optimal in the sense that no
algorithm that only relies on rst-order information can have a better
worst-case complexity bound except perhaps by an absolute constant.
In this sense the accelerated simple gradient method is an optimal
algorithm.
4 The Conditional-Gradient Method for Constrained

Optimization
Let us now consider the following optimization problem:

P : minimizex f (x)
s.t. xS ,
where f (x) is a dierentiable convex function on S, and S is a closed and

bounded convex set. Herein we describe the conditional-gradient method
for solving P , which is also called the Frank-Wolfe method. This method
is based on the premise that the set S is well-suited for linear optimiza-
tion. This means that either S is itself a system of linear inequalities
S = {x | Ax b} over which linear optimization is an easy task, or more
generally that the problem:
LOc : minimizex cT x
s.t. xS
is easy to solve for any given objective function vector c. This being the
case, suppose that we have a given iterate value xk S. The linearization
of the function f (x) at x = xk is:
z1 (x) := f (xk ) + f (xk )T (x xk ) ,
18
which is the rst-order Taylor expansion of f () at xk . Since we can easily
do linear optimization on S, let us solve:
LP : minimizex z1 (x) = f (xk ) + f (xk )T (x xk )
s.t. xS ,
which simplies to:

LP : minimizex f (xk )T x
s.t. xS .
Let xk denote the optimal solution to this problem. Then since S is a convex
set, the line segment joining xk and xk is also in S, and we can choose our
next iterate to be some convex combination of xk and xk :
xk+1 xk + k (xk xk ) .
This can be done several ways. One way is by choosing a step-size k

2
according to some pre-determined rule, such as k = k+2 . Another way is
to choose k by performing a line-search of f () over this segment. That is,
we might determine k as the optimal solution to the following line-search
problem:
LS : minimize f (xk + (xk xk ))
s.t. 01.
Table 1 presents a formal statement of the conditional gradient method just

described.
Regarding the conditional gradient method, note that the lower bound val-
ues LBk result from the convexity of f () and the gradient inequality for
convex functions:
19
Algorithm 1 Conditional Gradient Method for minimizing f (x) over x S
Initialize at x0 S, LB1 , k 0 .
At iteration k :
1. Compute f (xk ), and then compute:
xk arg min{f (xk ) + f (xk )T (x xk )} .

xS
LBk max{LBk1 , f (xk ) + f (xk )T (xk xk )} .
2. Set xk+1 xk + k (xk xk ), where k [0, 1] .
f (x) f (xk ) + f (xk )T (x xk ) for any x S .

Therefore
min f (x) min f (xk ) + f (xk )T (x xk ) = f (xk ) + f (xk )T (xk xk ) ,

xS xS
and so the optimal objective function value of P is always bounded below

by the value of LBk .
We have the following computational guarantee for the conditional gradient

method:
Theorem 4.1 Suppose that S is a bounded set and that f () is Lipschitz

on S, namely:
f (y) f (x) Lf x y for any x, y S .
If the step-sizes i are determined either by a line-search or by the step-size

2
rule i = i+2 , then after k iterations of the conditional gradient method the
following holds:
20
Lf Diam2S
min f (xi ) LBk ,
i=0,...,k k+2
where DiamS := maxx,yS x y2 .
Proof: Under construction.
21

FOM Simple Setting

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FOM Simple Setting

Uploaded by

Copyright:

Available Formats

Lecture Notes on First-Order Methods in

Simple Settings: Smooth and Non-smooth Methods

Mirror descent and nonlinear projected subgradient methods for con-

Introductory Lectures on Convex Optimization by Yurii Nesterov, Kluwer

Approximation accuracy, gradient methods, and error bound for struc-

On accelerated proximal gradient methods for convex-concave opti-

1 Smooth Method in Simple Setting

1.1 Problem Setting and Basics for Smooth Method

Our problem of interest is

where f () is a dierentiable convex function dened on Rn . We denote the

f (y) f (x) + f (x)T (y x) for all y Rn . (2)

We will measure distances

f (y) f (x) Ly x for all x, y Rn . (3)

Let B(c, R) denote the ball centered at c with radius R, namely:

Table 1 presents a simple gradient scheme for solving (P).

Simple Gradient Scheme

Step 0: Initialize. Initialize with x0 and i 0 .

Step 1: Compute New Point. Compute f (xi ), and then compute:

Step 2 Update counter and repeat. i i + 1 and Goto Step 1 .

Table 1: A Simple Gradient Scheme.

1.2 Analysis and Complexity of the Simple Gradient Scheme

We begin with a basic property of Lipschitz gradients:

Proof: We use the fundamental theorem of calculus applied in a multivariate

= f (x) + f (x)T (y x) + L2 (y x)2 .

We also need the following arithmetic result:

Proposition 1.2 For any x the following holds:

= L(xi+1 xi )T (xi+1 xi ) + L2 xi+1 xi 2

= f (xi )T (xi+1 xi ) + L2 xi+1 xi 2 .

= f (xi ) Lxi+1 xi 2 + L2 xi+1 xi 2

f (xi+1 ) f (xi ) + f (xi )T (xi+1 xi ) + L2 xi+1 xi 2 (from Proposition 1.1)

= f (xi ) + f (xi )T (x xi ) + L2 x xi 2 L2 x xi+1 2 (from Proposition 1.2)

f (x) + L2 x xi 2 L2 x xi+1 2 . (from (2))

Dividing by k and noting that the nal subtractand above is nonnegative,

1.3 Comments and Extensions

1. It is important for future extensions and analysis to observe that xi+1

ti := arg min{f (xi tf (xi ))} ,

instead of (implicitly) assigning the value ti 1/L at each iteration?

2 Non-smooth Method in Simple Setting

2.1 Problem Setting and Basics for Non-smooth Method

Our problem of interest still is

where f () is a convex function, not necessarily dierentiable, dened on Rn .

f (y) f (x) + g T (y x) for all y Rn . (4)

Recall that f (x) is nonempty, and we assume that computing a subgradient

|f (y) f (x)| Ly x for all x, y Rn . (5)

Table 2 presents a simple subgradient scheme.

Simple Subgradient Scheme

Step 0: Initialize. Dene step-sizes ti , i = 0, 1, 2, . . .

Step 1: Compute New Point. Compute gi f (xi )

Step 2 Update counter and repeat. i i + 1 and Goto Step 1.

Table 2: A Simple Subgradient Scheme.

2.2 Analysis and Complexity of the Simple Subgradient Scheme

Proposition 2.1 If g f (x), then g L.

Proof: Let g f (x) be given. If g = 0, the result is trivial. If g = 0, then

which upon dividing by g yields g L .

iterations using the constant step-size

= xi x2 + t2i gi 2 + 2ti giT (x xi )

xi x2 + t2i gi 2 + 2ti (f (x) f (xi )) (subgradient inequality)

xi x2 + t2i L2 + 2ti (f (x) f (xi )) , (Proposition 2.1)

which rearranges to:

2ti f (xi ) 2ti f (x) + t2i L2 + xi x2 xi+1 x2 .

Now sum up the above inequality over i:

2.3 Comments and Extensions

3. Suppose that our optimization problem includes a feasibility set S that

Proposition 2.2 If S is a convex set, then for all x, y Rn it holds