You are on page 1of 13

Background Material

Lecture notes of UCLA Math164 Spring 2016


Instructor: Wotao Yin)
Authors: Ernest K. Ryu and Wotao Yin
April 12, 2017

This chapter reviews the following components to prepare the reader


for this optimization book.

1. Linear algebra: vectors, matrices, vector space, subspaces, linear


independence, norms, linear transformations.

2. Set: affine set, convex set, closeness.

3. Functions: continuity, limit, derivatives, partial derivatives, level


set, coerciveness, convexity, linearity.

Linear algebra

Real vectors
The set of real numbers is denoted as R.
A real vector or vector x of dimension n is written as

x1
x2

x= ..

.
xn

or, in a more text-friendly fashion, x = [ x1 x2 xn ] T . Alternatively,


it is also written as x T = [ x1 x2 xn ]. Here, each xi , i = 1, . . . , n, is
a real number.
There some operations defined for vectors: for any vectors x, y, z
and scalars , R

1. The + operation: x + y is a vector, and the operation satisfies


x + y = y + x, x + (y + z) = (x + y) + z.

2. The zero vector: there exists a unique vector called the zero vector
(written as 0) such that x + 0 = x. (Do not confuse 0 with the
scalar 0. They are different.)

3. The sign and operation: there exist a unique vector for x (writ-
ten as x) such that x + (x) = 0. The vector y z is defined as
y + (z).
background material
lecture notes of ucla math164 spring 2016
instructor: wotao yin) 2
4. Scalar multiplication: x = [x1 xn ] is a vector, and this
scalar multiplication satisfies: ( x) = ()x and 1x = x.

5. Distributivities: (x + y) = x + y and ( + )x = x + x.
Following this property and the definition of 0, we have 0x = 0.

Linear subspace and linear independence


A set of vectors {x1 , . . . , xm } is linearly independent or independent if
the only way to select the scalars 1 , . . . , m such that im=1 i xi =
0 is 1 = = m = 0. This implies that, if a vector y can be
linearly expressed by x1 , . . . , xm , that is y = im=1 i xi for some scalars
1 , . . . , m , the values of 1 , . . . , m must be unique.
The set of points im=1 i xi with arbitrary values of 1 , . . . , m R
is called the linear subspace spanned by x1 , . . . , xm . Let this space be
denoted by V = span{x1 , . . . , xm }. The dimension of V , written as
dim(V ), equals m if those vectors are linearly independent. In this
case, those vectors also form a basis of V . That is, any vector in V can
be uniquely linearly represented by the set of vectors.
When the set of vectors x1 , . . . , xm are not linearly independent
(or linearly dependent), we cannot call them a basis. By definition, we
can find 1 , . . . , m that include at least two nonzero entries such that
im=1 i xi = 0. Without loss of generality, assume m is nonzero,
so xm = im=11 ( mi )xi , that is, xm can be linearly represented by
the remaining vectors, rendering itself redundant. Every vector in
V can be linearly represented just using x1 , . . . , xm1 . Hence, we
0
can eliminate xm . This elimination is repeated until, say, x1 , . . . , xm
are linearly independent. These m0 vectors form a basis for V and
dim(V ) = m0 .
The natural basis of Rn is the vectors

1 0
.
0 ..

.. , , en = .
e1 =
. 0
0 1

Indeed, any vector x Rn can be written as x = x1 e1 + + xn en .

Matrix
We write an m n matrix as

a11 a12 a1n
a a22 a2n
A = 21 .


am1 am2 amn
background material
lecture notes of ucla math164 spring 2016
instructor: wotao yin) 3
The transpose of an m n matrix is an n m matrix with each pair
of aij and a ji having swapped their positions. The diagonal entries aii ,
i = 1, 2, . . . , min{m, n} remain their positions.
A column vector and a row vector are special matrices. The trans-
pose of a column vector is a row vector, and vice versa.
If A has the same number of rows and columns, i.e., m = n, then
we call it a square matrix. Consider a square matrix A Rnn . If
A T = A, then A is symmetric. The identitfy matrix I is a special
case. The quadratic function f (x) = x T Ax = i,j[n] aij xi x j , where
x Rn and [n] is an abbreviation of {1, . . . , n}. If x T Ax 0 for
all x Rn , then we say A is positive semi-definite. In the special case
that x T Ax > 0 for all nonzero x Rn , A is called positive definite or,
to avoid ambiguity, strictly positive definite. Note that positive semi-
definite and positive definite matrices do not need to be symmetric.
To see this, the value of x T Ax remains unchanged if we add 1 to aij
and subtract 1 from a ji . Nonetheless, in x T Ax, we often use a sym-
metric matrix; if it is not so, we can replace A by its symmetrization
1 T
2 ( A + A ).
A linear transformation T from Rn to Rm is such that T (x +
y) = T (x) + T (y) for all , R and x, y Rn . Any linear
transformation from Rn to Rm can be represented by a matrix: for
any x Rn , T (x) = Ax, where y = Ax is a vector with the following
entries:
n
yi = [ ai1 ai2 ain ]x = aij x j .
j =1

The rank of a matrix is closely connected to the linear depen-


dence of its columns and rows. Let ai be the ith column of A. The
rank of A, written as rank( A), equals the dimension of the linear
subspace that is spanned by a1 , . . . , an . In particular, rank( A) = n
if and only if these columns are linearly independent; otherwise,
rank( A) < n. Somewhat oddly as it may look, rank( A) also equals
the dimension of the linear subspace that is spanned by the rows of
A (we transpose each row into a column vector). Therefore, we have
rank( A) min{m, n}. When the equality holds, we say A has full
rank (and A does not need to be a square matrix).
The rank of a matrix remains the same under any of the following
operations: (i) reorganize the orders of the rows or columns, (ii)
multiply a nonzero scalar, (iii) rank( A) = rank( BA) if B R pm is
full rank and p m, and (iv) rank( A) = rank( AC ) if C Rnq is full
rank and q n.
background material
lecture notes of ucla math164 spring 2016
instructor: wotao yin) 4
Metric
The n-dimensional Euclidean space is denoted as Rn . The inner
product between two vectors x, y is
n
hx, yi = xi yi ,
i =1

which also written as the multiplication of the row vector x T to the


column vector y, that is, hx, yi = x T y. When hx, yi = 0, we say x is
perpendicular to y or write x y.
One of the most useful inequalities is the Cauchy-Schwarz In-
equality:

|h x, yi| kxk2 kyk2 .

The = holds if and only if x = y, or x = y, for some scalar


. Because so, the Cauchy-Schwarz inequality can be quite loose.
Nonetheless, it is very easy to use.
The inner product induces the following measure (its distance to 0)
for a point, which is the 2-norm,
s
q n
kxk2 = hx, xi = xi2 .
i =1

It is easy to verify that it satisfies the three criteria that define a norm:
for any x, y Rn ,

1. kxk 0, and kxk = 0 if and only if x = 0.

2. kxk = || kxk, for any scalar .

3. kx + yk kxk + kyk.

Part 3 is the triangle inequality. For the 2-norm, it holds with = if


and only if x = y or x = y for some R.
Besides, the following 1-norm and inf-norm are also norms:
n
k x k1 = | x i |,
i =1
kxk = max{| x1 |, . . . , | xn |}.

The following inequalities between different norms are useful: for


any x Rn ,

k x k k x k2 k x k1 ,

k x k2 k x k1 n k x k2 ,

k x k k x k2 n k x k ,
k x k k x k1 n k x k .
background material
lecture notes of ucla math164 spring 2016
instructor: wotao yin) 5
It is important to know where each = holds in these inequalities.
Let p > 0 and define
n 1/p
kxk p = | xi | p .
i =1

When 0 < p < 1, kxk p is not a norm since Property 3 is violated


(Properties 1 and 2 are still satisfied). Interestingly, its pth power
p
kxk p does satisfy Property 3 but instead violates Property 2.
A semi-norm satisfies the three norm properties except allowing
kxk = 0 for nonzero x. A common semi-norm is the discrete total
variation in=11 | xi+1 xi |. A constant vector has 0 total variation, so it
violates Property 1.
We can generalize the 2-norm to the matrix-induced norm. Let G
be a symmetric and positive semi-definite. The G-induced function
q
kxkG = hx, Gxi

is well defined. By the positive semi-definiteness of G, hx, Gxi =


x T Gx 0). kxkG becomes a norm if G is strictly positive definite. In
particular, when G = I, we recover kxkG = kxk2 .
Matrix induced norms are used in the analysis of algorithms for
least-squares problems and the design of proximal algorithms.
Suppose 1 , n > 0 are the minimal and maximal eigenvalues of
G. For any x R2 , we have the inequality:

1 k x k2 k x k G n k x k2

The well-known cosine rule can be stated as: for any a, b, c Rn ,

ka bk22 = ka ck22 + kb ck22 2ka ck2 kb ck2 cos( ),

where is the angle between the vectors a c and b c. We know


hac,bci
that cos( ) = kack kbck . From this identify, we arrive at the three-
2 2
point identity:
2
1
2 k a b k2 = 21 ka ck22 + 12 kb ck22 ha c, b ci, (2)

which is commonly used to transform an inner product into a set of


squared terms.

Complementary subspace
Consider a linear subspace m-dimensional V Rn . It must hold
m n.
When m < n, there exists a set of points, known as the complemen-
tary subspace and denoted as V , such that

hx, yi = 0, for all x V , y V .


background material
lecture notes of ucla math164 spring 2016
instructor: wotao yin) 6
The dimension of V is n m.
Suppose {x1 , . . . , xm } spans V and {y1 , . . . , ynm } spans V . Then,
V can be written is three ways:
m
i x i : 1 , . . . , m R

V=
i =1
= {x : hx, yi for all y V }
= {x : hx, yi i = 0, i = 1, . . . , n m}.

When a linear subspace V has dimension 1, it is called a line.


When it has dimension n 1, it is called a hyperplane, named after
the plane, which is a two-dimensional linear subspace in R3 .

Set

Affine set
Directly following from it definition, a linear subspace must contain
0. In particular, by setting 1 = = m = 0, we obtain im=1 i xi =
0 V.
Pick x0 6 V , and create the new set

V 0 = { x + x0 : x V }.

The point 0 is no longer in V 0 . (Otherwise, there exists x V such


that 0 = x + x0 . Hence, x0 = x belongs to V and, by the linearity of
V , x0 V . Contradiction.) We say V is translated to V 0 by (adding x0 ).
A linear subspace or its translation is called an affine set.
Formally, a set U is called affine if, for any x, y U , the line con-
necting x, y (that is, all the points x + (1 )y for R) is a subset
of U . It is easy to show that the two definitions are equivalent.
A linear subspace is a special affine set, and an affine set is either a
linear subspace (if containing 0) or a translation of a linear subspace.
Let A Rmn , b Rm . We have seen that the set V = {x : Ax =
0} is linear subspace. It turns out U = {y : Ay = b} is an affine set
and is a translation of {x : Ax = 0}. To see this, let y0 U . For each
point x V , it is easy to check that x + y0 is a point in U . Conversely,
for each point y U , y y0 is a point in V .

Open and closed sets


Given a norm k k, the induced ball centered at a point x0 with radius
e > 0 is

Bkk (x0 ; e) = {x : kx x0 k e}.


background material
lecture notes of ucla math164 spring 2016
instructor: wotao yin) 7
When we use the 2-norm, we write B instead of Bkk2 for simplicity.
Mathematically, S is open if, for any x S , there exists some e > 0
such that B(x; e) is contained in S . Since we require e > 0 (which can
depend on x), the points of S that can fail to satisfy the condition are
precisely the boundary points of S . So, intuitively, a set S in Rn is
open if it does not contain any boundary point.
On the contrary, a closed set must contain all of its boundary
point. Mathematically, a set S is closed if its complementary set
S c = {x : x 6 S} is open.
The definitions of open and closed sets do not change if the ball
B(x0 ; e) is replaced by the open ball B0 (x0 ; e) = {x : kx x0 k2 < e}.
In fact, one can use a ball induced by any norm.
Alternatively, a set of Rn is closed if the limit of any convergent
sequence from the set is also in the set. Similarly, a set of Rn is open
if its complementary set is closed.

Useful sets
The set with no point is called an empty set. A set with exactly one
single point is called a singleton set or singleton. The set that has all
the points in the space Rn is called a full set.
The set Rn+ = {x : x1 , . . . , xn 0} is called the nonnegativity set
or the first orthant. The set Rn++ = {x : x1 , . . . , xn > 0} is called the
(strict) positivity set.
A box set or box is referred to the set {x : li xi ui , i = 1, . . . , n}
for two given vectors l, u Rn that satisfy li < , < ui
, and li ui .
The standard simplex is the set
n
= {x : x 0, xi = 1}.
i =1

Any element p of this set is a probability density vector of some n


possible events.

Convex set
A set S is convex if, for any x, y S , the line segment [x, y] = {x +
(1 )y : 0 1} is a subset of S . From any point of a convex set,
one can follow a straight line completely within the set to reach any
other point of the set.
All the sets defined above (including the empty set and singletons)
are convex.
A convex set can be formed by convexly combining a set of vectors
background material
lecture notes of ucla math164 spring 2016
instructor: wotao yin) 8
x 1 , . . . , x m Rn :
m
conv{x1 , . . . , xm } = i xi :

,
i =1

where = [1 2 n ] Rn . This set is called the convex com-


bination. (Recall the linear combination allows all Rn . The rarely
used affine combination restricts to in=1 i = 1 but not necessarily
nonnegative.)
Convex sets remain convex after pre- and post-linear transforms:
let S Rn be a convex set, b Rn be a point, and A Rmn and
B Rnq be matrices,
1. the set A(S + b) = { Ax + b : x S} is convex,

2. the set {y : By + b S} is convex.


By properly choosing A, B, b, we can conclude the translation, rota-
tion, reflection, and affine projection of a convex set are convex sets.

Function

Domain and extended value


The set of vectors where a function f maps to a scalar is call the do-
main of a function, written as dom( f ). Outside its domain, a function
is not defined. However, it is often useful to define extended-value
functions as they conveniently incorporate domains of functions or
constraint sets in the objective function.
An extended-valued function is a function maps to the extended
real line, R = R {, }. By default, a function in this book is
extended-valued.
Typically, we let f (x) = for all x 6 dom( f ) in the context
of minimization problems. A function is proper if it never maps to
and not everywhere to . Equivalently, a function f is proper if
dom( f ) is nonempty.
Let S Rn . The indicator function

0, if x S
S =
, otherwise

is an extended-valued function. Indicator functions allow us to write


constraints x S into the objective function and treat the problem
as an unconstrained minimization problem. Although this treatment
does not simplify the problem, it makes the application of certain
algorithms more convenient.
The following conventions are used with : any R, < ,
+ = , = , + = , and > .
background material
lecture notes of ucla math164 spring 2016
instructor: wotao yin) 9
Limit and continuity
The limit limxx0 f (x), if exists, is a value in R such that, for each
e > 0 (no matter how small), there exists > 0 such that the inequal-
ity | f (x) | < e holds for all points x B(x0 , ) \ {x0 }.
A function f is continuous at x0 dom( f ) if f (x0 ) = limxx0 f (x).
Note that x can approach x0 from any directions" so it excludes x0
on the boundary of dom( f ). When dom( f ) = Rn and f is continuous
everywhere, we say this function is continuous.
When analyzing a function, semi-continuity is a very useful con-
cept, which is weaker than continuity. It is often assumed so that a
minimizer of the function can be attained. Let x1 , x2 , . . . dom( f ) be
a sequence of points such that x0 = limi xi . An extended-valued
function f is lower semi-continuous at x0 if f (x0 ) lim infi f (xi ).
Roughly speaking, the values of f near x0 are either close to, or less
than, f (x0 ). Technically, both sides of can be or .
Related to semicontinuity is the closeness of a function. A function
is closed if its epigraph

epi( f ) = {(x, ) Rn R : f (x) }

is a closed set.
A function is lower semi-continuous if and only if it is closed.
Therefore, the two properties are used interchangeably. If a function
f has a closed domain dom( f ) and is continuous in dom( f ), then
the function is closed. In addition, the closeness is preserved in the
following settings: for any closed functions f 1 , f 2 , . . . , f m ,

1. f i ( Ax + b) is a closed function of x where the matrix A and vector


b are arbitrarily given;

2. im=1 i f i (x) is a closed function of x for 1 , . . . , m 0;

3. max{ f 1 (x), . . . , f m (x)} is a closed function of x.

The Weierstrass theorem provides a sufficient condition for one


to attain the minimizer of a function: if f is a proper closed function
and S is a bounded closed set such that dom( f ) S is nonempty,
then there exists a point x0 dom( f ) S such that

f (x0 ) = min{ f (x) : x dom( f ) S}.


background material
lecture notes of ucla math164 spring 2016
instructor: wotao yin) 10
Derivative, gradient, Hessian
Let be a function that maps from R to R. Its first and second deriva-
tives, if exist, are defined as, respectively,
d 1 
() = lim ( + e) () ,
d e 0 e
2
d 1  d d 
2
() = lim ( + e) () .
d e0 e d d
Partial derivatives, defined for functions that maps from Rn to R and
written as i f , are derivatives take with respect to the individual
variables (while keeping the remaining variables fixed). The gradient
of f , which maps from Rn to R, is written as

f (x) = [ x1 f (x) x2 f (x) xn f (x)]T ,

which is a column vector in Rn .


The Hessian of f is a matrix of second partial derivatives, written
and defined as:
2
2 f 2 f

f
2 ( x ) x1 x2 ( x ) x1 xn ( x )
x1
2 f 2 f 2 f
x2 x1 ( x ) (x) x2 xn (x)
2 f ( x ) =

x22 .



2
2 f 2 f

f
xn x (
1
x ) xn x 2
( x ) x 2 ( x )
n

The matrix exists when f is twicely continuously differentiable and is


2 f f 2
symmetric as = x x
xi x j for any 1 i, j n.
j i
The chain rules are often used. Some specific examples are left as
exercises.

1. x f ( Ax + b) = A T f ( Ax + b).

2. Let x 6= 0. kxk2 = x/kxk2 .

3. Let x be such that xi 6= 0 for all i. kxk1 = [sign( x1 ) sign( xn )] T .

4. ... matrix functions involving norms, trace, and matrix dot prod-
uct.

Convex function
Convex functions play an important role in optimization because of
the convenient properties they have. They are easier to analyze and
to minimize.
A function is convex if

f (x + (1 )y) f (x) + (1 ) f (y), for all x, y Rn , [0, 1].


background material
lecture notes of ucla math164 spring 2016
instructor: wotao yin) 11
Roughly speaking, a convex function is bowl" shaped. Because we
evaluate f at x + (1 )y, the domain of a convex function is a
convex set.
A function is convex if and only if, for any set of points x1 , . . . , xm
and , the inequality
m m
f i x i ) i f ( x i ). (3)
i =1 i =1

holds.
The inequality (3) is related to Jensens inequality. Let X be a ran-
dom variable and E denotes the corresponding expectation. If f is
convex, then

f EX E f ( X )).

(4)

When X has m possible outcomes x1 , . . . , xm with the corresponding


probabilities 1 , . . . , m , then (4) reduces to (3).
Theorem 1. Every local minimizer of a convex function is its global mini-
mizer.
The proof of this theorem is left as an exercise.
The convexity of a function or a set of functions is preserved under
linear transformations, nonnegative summation, point-wise maxi-
mum, projection, etc. All norms are convex functions. The maximal
distance to bounded set (not necessarily convex) is a convex function,
and so is the minimal distance to a convex set. The reader is referred
to Boyd-Vandenberghe for these results.
Note that we say a function f (x1 , . . . , xm ) is convex if it is convex
with respect to all the variables jointly. There are functions f that
are convex only in every variable while the values of the remaining
variables are fixed. We call these functions multi-convex (or, bi-convex
if m = 2). A multi-convex function is not necessarily convex.
If f (x, y) is a convex function and is proper in y for each fixed x,
then the function g(x) = miny f (x, y) is a convex function.
A convex function must be continuous in its domain with possible
exceptions only at the boundary of its domain. A discontinuity in
the interior of its domain will cause a contradiction to convexity
so it cannot occur. Furthermore, a convex in locally Lipschitz in its
interior.

Exercises

1. CZ2.1: Let A Rmn and rank( A) = n. Show that m n.

2. Consider A Rmn , b Rm and the linear system Ax = b.


Show that the system has a solution if and only if rank( A) =
background material
lecture notes of ucla math164 spring 2016
instructor: wotao yin) 12
rank([ A b]). CZ2.2: furthermore, the solution is unique if rank( A) =
rank([ A b]) = n.

3. CZ2.9: Show that for any x, y Rn , kxk2 kyk2 kx yk2 .


The same holds for other norms: k k1 , k k , k k G , where G is
symmetric positive semi-definite matrix. (Check the definition of
norm. It will be ideal to prove using the properties that all these
norms share.) When does the equality hold?

4. Directly verify (2). Do not use the cosine rule.

5. Consider the function f ( x1 , x2 , x3 ) = 4x12 + x22 + 4x32 + x1 x2 +


3x1 x3 + 2x2 x1 + x2 x3 x3 x1 + x2 x3 . Find a symmetric matrix A
such that f ( x1 , x2 , x3 ) = x T Ax, where x = [ x1 x2 x3 ] T . Is the matrix
A unique? Is A positive (semi)definite?

6. Find the rank and nullspace of



2 2 8
A = 2 3 2 .

3 1 4

7. Verify that if G is a symmetric positive definite matrix, then the


inner product

hx, yiG = x T Gy

satisfies the following properties of inner product:

(a) hx, xiG 0, and = holds if and only if x = 0.


(b) hx, yiG = hy, xiG .
(c) hx, y + ziG = hx, yiG + hx, ziG .
(d) for any R, hx, yiG = hx, yiG .

What if G is asymmetric? What if G is symmetric but only positive


semidefinite?

8. Prove the inequalities (1) and specific when each inequality holds
with =.

9. Prove Theorem 1.

10. Prove that a set is convex if and only if the intersection of the set
with any line is convex.

11. Let S1 , S2 Rn be two convex sets. Prove that the following sets
are convex

(a) Intersection: S1 S2 .
background material
lecture notes of ucla math164 spring 2016
instructor: wotao yin) 13
(b) Minkowski sum: S1 + S2 = {x + y : x S1 , y S2 }.
(c) Partial intersect/sum:

{(x, y + z) : x Rn1 , y, z Rn2 , (x, y) S1 , (x, z) S2 },

where n1 + n2 = n.

12. Let A Rnn , b Rn , c R. Among the definitions introduced


in this chapter, which can ensure the convexity of the following
set:

{x Rn : x T Ax + b T x + c 0}.

What if we replace by ?

13. Let k k be an norm. Let A Rmn , b Rm . Prove that k Ax bk


is a convex function.

14. A function is convex if and only if it is so when restricted to


any line. Use this property to show that, a differentiable function
f : Rn R is convex if and only if dom f is convex and

f (y) f (x) + h f (x), y xi, x, y dom f . (5)

(The proof is given in Boyd-Vandenberghe 3.1.3.) Can we replace


(5) by the following condition?

h f (x) f (y), x yi 0, x, y dom f .

You might also like