Least Squares and The Singular Value Decomposition: Ivan Markovsky

Least squares and the singular value
decomposition
Ivan Markovsky
University of Southampton
(Lecture 3) Least squares and the singular value decomposition 1 / 52
Outline
QR and SVD decompositions
Least squares and least norm problems
Extensions of the least squares problem
Recursive
Multiobjective
Regularized
Constrained
QR and SVD decompositions
Orthonormal set of vectors
Consider a nite set of vectors Q:=q
1
, . . . , q
k
R
n
Q is orthogonal : q
i
, q
j
) :=q
i
q
j
, for all i ,=j
Q is normalized : |q
i
|
2
2
:=q
i
, q
i
) =1, i =1, . . . , k
Q is orthonormal : Q is orthogonal and normalized
with Q :=
_
q
1
q
k
, Q orthonormal Q
Q =I
k
Properties:
orthonormal vectors are independent
multiplication with Q preserves inner product and norm
Qz, Qy) =z
Qy =z
y =z, y)
Orthogonal projectors
Consider orthonormal set Q:=q
1
, . . . , q
k
and L :=span(Q) R
n
.
Q is an orthonormal basis for L.
With Q :=
_
q
1
q
k
, Q
Q =I
k
, however, for k <n, QQ
,=I
n
.
span(Q)
:=QQ
is an orthogonal projector on span(Q), i.e.,
L
x =argmin
y
|x y|
2
subject to y L
Properties: =
2
, =
(necessary and sufcient for orth. proj.)
:= (I ) is also orthogonal projector, it projects on

_
col span()
_
R
n
orth. complement of the column span of
Orthonormal basis for R
n
orthonormal set Q:=q
1
, . . . , q
k
R
n
of k =n vectors
then Q :=
_
q
1
q
n
is called orthogonal and satises Q
Q =I
n
It follows that Q
1
=Q
and
QQ
=
n
i =1
q
i
q
i
=I
n
Expansion in orthonormal basis x =QQ

x :=Q
x coordinates of x in the basis Q

x =Q
x reconstruct x from the coordinates a

Geometrically multiplication by Q (and Q
) is rotation.
Gram-Schmidt (G-S) procedure
Given independent set a
1
, . . . , a
k
R
n
,
G-S produces orthonormal set q
1
, . . . , q
k
R
n
such that
span(a
1
, . . . , a
r
) =span(q
1
, . . . , q
r
), for all r k
G-S procedure: Let q
1
:=a
1
/|a
1
|
2
. At the i th step i =2, . . . , k
orthogonalized a
i
w.r.t. q
1
, . . . , q
i 1
:
v
i
:= (I
span(q
1
,...,q
i 1
)
)a
i
. .
projection of a
i
on
_
span(q
1
, . . . , q
i 1
)
_
normalize the result: q

i
:=v
i
/|v
i
|
2
QR decomposition
G-S procedure gives as a byproduct scalars r
ji
, j i , i =1, . . . , k, s.t.
a
i
= (q
1
a
i
)q
1
+ +(q
i 1
a
i
)q
i 1
+|q
i
|
2
q
i
=r
1i
q
1
+ +r
ii
q
i
in a matrix form G-S produces the matrix decomposition
_
a
1
a
2
a
k
. .
A
=
_
q
1
q
1
q
k
. .
Q
_
_
r
11
r
12
r
1k
0 r
22
r
2k
.
.
.
.
.
.
.
.
.
.
.
.
0 0 r
kk
_
_
. .
R
with orthonormal Q R
nk
and upper triangular R R
kk
If a
1
, . . . , a
k
are dependent, v
i
:= (I
span(q
1
,...,q
i 1
)
)a
i
=0 for some i
Conversely, if v
i
=0 for some i , a
i
is linearly dependent on a
1
, . . . , a
i 1
Modied G-S procedure: when v

i
=0, skip to the next input vector a
i +1
= R is in upper staircase form, e.g.,
_
_
(empty elements
are zeros)
Full QR
A =
_
Q
1
Q
2
. .
orthogonal
_
R
1
0
_
col span(A) = col span(Q
1
)
_
col span(A)
_
= col span(Q
2
)
Procedure for nding Q
2
:
complete A to full rank matrix, e.g., A
m
:=
_
A I
, and apply G-S on A

m
Application: complete an orthonormal matrix Q
1
R
nk
to an orthogonal matrix Q =
_
Q
1
Q
2
R
nn
(by computing the full QR of
_
Q
1
I
)
Singular value decomposition (SVD)
The SVD is used as both computational and analytical tool.
Any mn matrix A of rank r has a reduced SVD
A =
_
u
1
u
r
. .
U
1
_
1
.
.
.
r
_
_
. .
1
_
v
1
v
r
. .
V
1
where U
1
and V
1
are orthonormal

1

r
are called singular values
u
1
, . . . , u
r
are called left singular vectors
v
1
, . . . , v
r
are called right singular vectors
Full SVD A =UV
where U R
mm
and V R
nn
are orthogonal and
=
r nr
_
1
0
0 0
_
r
mr
where
1
=diag(
1
, . . . ,
r
)
Note that the singular values of A are
(A) :=
_
1
, . . . ,
r
, 0, . . . , 0
. .
min(nr ,mr )
_

min
(A) smallest singular value of A

max
(A) largest singular value of A
Proof of existence of an SVD
The proof is constructive and uses induction. W.l.o.g. assume mn.
End of induction: vector A R
m1
has (unique) SVD
A =UV
, with U :=A/|A|
2
, :=|A|
2
, V :=1
Inductive step: choose v
i
R
n
with |v
i
|
2
=1 and let
A
i
v
i
=:
i
u
i
, where
i
:=|A
i
|
2
Complete v
i
and u
i
to orthogonal matrices (QR decomp.)
V
i
:=
_
v
i

and U
i
:=
_
u
i

We have that for certain w R

n1
and A
i +1
R
(n1)(n1)
U
i
A
i
V
i
=
_
i
w
0 A
i +1
_
Next we show that w =0.
Proof of existence of an SVD
2
i
=|A
i
|
2
2
=|U
i
A
i
V
i
|
2
2
=max
v
|A
i
v|
2
2
|v|
2
2
|A
i
[
i
w
] |
2
2
|[
i
w
] |
2
2
=
1
2
i
+w
w
_
_
_
_
_
2
i
+w
w
A
i +1
w
__
_
_
_
2
2
2
i
+w
w
(
2
i
+w
w)
2
=
2
i
+w
w
The inequality
2
i

2
i
+w
w can be true only when w =0.

Geometric fact motivating the SVD
The image of a unit ball under linear map is a hyperellips.
_
1.00 1.50
0 1.00
_
. .
A
=
_
0.89 0.45
0.45 0.89
_
. .
U
_
2.00 0
0 0.50
_
. .
_
0.45 0.89
0.89 0.45
_
. .
V
1.5 1 0.5 0 0.5 1 1.5

1.5
1
0.5
0
0.5
1
1.5
v
1
v
2
A
1.5 1 0.5 0 0.5 1 1.5

1.5
1
0.5
0
0.5
1
1.5
1
u
1
2
u
2
Low-rank approximation
Given
a matrix A R
mn
, mn, and
an integer r , 0 <r <n,
nd
A :=argmin
A
|A
A| subject to rank(
A) r
Interpretation:
is optimal rank-r approximation of A w.r.t. the norm | |, e.g.,

|A|
2
F
:=
m
i =1
n
j =1
a
2
ij
or |A|
2
:=max
x
|Ax|
2
|x|
2
Solution via SVD
:=argmin
A
|A
A|
F
subject to rank(
A) r (LRA)
Theorem Let A =UV
be the SVD of A and dene

U =:
r r n
_
U
1
U
2
n =:
r r n
_
1
0
0
2
_
r
r n
and V =:
r r n
_
V
1
V
2
n
A solution to (LRA) is
=U
1
1
V
1
It is unique if and only if
r
,=
r +1
.
Proof of the low-rank approximation theorem
Let

A
be solution to (LRA) and let

A
:=U
(V
be an SVD of

A
.
|A
|
F
=|(U
AV
. .
B
|
F
=
is an opt. approx. of B
Partition B =:
_
B
11
B
12
B
21
B
22
_
conformably with
=:
_
1
0
0 0
_
and observe that
rank(
_
1
B
12
0 0
_
) r and B
12
,=0 =
_
_
_B
_
1
B
12
0 0
__
_
_
F
<
_
_
_B
_
1
0
0 0
__
_
_
F
so that B
12
=0. Similarly B
21
=0. Observe also that
rank(
_
B
11
0
0 0
_
) r and B
11
,=
1
=
_
_
_B
_
B
11
0
0 0
__
_
_
F
<
_
_
_B
_
1
0
0 0
__
_
_
F
so that B
11
=
1
. Therefore, B =
_
1
0
0 B
22
_
.
Let B
22
=U
22
22
V
22
be the SVD of B
22
. Then the matrix
_
I 0
0 U
22
_
B
_
I 0
0 U
22
_
=
_
1
0
0
22
_
has optimal rank-r approximation
=
_
1
0
0 0
_
, so that
min(diag(
1
)) >max(diag(U
22
))
Therefore
A =U
_
I 0
0 U
22
__
1
0
0
22
__
I 0
0 U
22
_
(V
is an SVD of A.
SVD of A:
A =U
_
I 0
0 U
22
__
1
0
0
22
__
I 0
0 U
22
_
(V
Then, if
r
>
r +1
, the rank-r SVD truncation
=U
1
0
0 0
_
(V
=U
_
I 0
0 U
22
__
1
0
0 0
__
I 0
0 U
22
_
(V
is unique and

A
is the unique solution of (LRA).

Note that

A
is simultaneously optimal in any unitarily invariant norm.

Numerical rank
i =r +1
2
i
=min
A
|A
A|
F
subject to rank(
A) r
and
r +1
=min
A
|A
A|
2
subject to rank(
A) r
are measures of the distance of A to the manifold of rank-r matrices
In particular,
min
(A) is the distance of A to rank deciency.
rank(A, ) := # of singular values > is called numerical rank of A
Note that rank(A, ) depends on an a priori given tolerance .
Pseudo-inverse A
+
:=V
1
1
1
U
1
R
nm
rank(A) =n =m = A
+
=A
1
rank(A) =n = A
+
= (A
A)
1
A
rank(A) =m = A
+
=A
(AA
)
1
In general, A
+
y is the least squares, least norm solution of Ax =y
Note that the pseudo-inverse depends on the rank of A.
In practice the numerical rank rank(A, ) is used.
The SVD, using numerical rank and pseudo-inverse, is the most
reliable way of solving Ax =y.
It should be used in cases when A is ill-conditioned.
Condition number (A) :=
max
(A)/
min
(A)
Geometrically (A) is the eccentricity of the hyperellipsoid
Ax [ |x|
2
=1
1.5 1 0.5 0 0.5 1 1.5
1.5
1
0.5
0
0.5
1
1.5
1
u
1
2
u
2
(A) measures the sensitivity of A
+
y to perturbations in y and A
For large (A) (above a few 1000) A is called ill-conditioned.
Least squares and least norm
Least squares
consider an overdetermined system of linear equations Ax =y
problem: given A R
mn
, m >n and y R
m
, nd x R
n
for most A and y, there is no solution x
Least squares approximation:
choose x that minimizes 2-norm of the residual (eqn. error)
e(x) :=y Ax
a minimizing x is called a least squares approximate solution
x
ls
:=argmin
x
|y Ax
. .
e(x)
|
2
Geometric interpretation: project y onto the image of A
(
y
ls
:=A
x
ls
is the projection)
e
ls
:=

y
ls
A
x
ls
R
m
y
e
ls
y
ls
col span(A)
A
x
ls
=

y
ls

_
A

y
ls
x
ls
1
_
=0
_
a
i

y
ls,i
x
ls
1
_
=0, for i =1, . . . , m
(a
i
is the i th row of A)
(a
i
,
y
ls,i
), for all i , lies on the subspace perpendicular to (
x
ls
, 1)
data point (a
i
, y
i
) = (a
i
,
y
ls,i
) +(0, e
ls,i
)
the approximation error (0, e
ls,i
) is the vertical distance from (a
i
, y
i
)
to the subspace
Another geometric interpretation of the LS approximation:
R
n+1
R
n
R
1
_

x
ls
1
_
e
ls,i null([
ls
1])
(a
i
, y
i
)
(a
i
,
y
ls,i
)
Notes
Assuming mn =rank(A), i.e., A is full column rank,
x
ls
= (A
A)
1
A
y
is the unique least squares approximate solution.

x
ls
is a linear function of y
If A is square

x
ls
=A
1
y

x
ls
is an exact solution if Ax =y has an exact solution

y
ls
:=A
x
ls
=A(A
A)
1
A
y is a least squares approximation of y

Projector onto the span of A
The mm matrix
col span(A)
:=A(A
A)
1
A
is the orthogonal projector onto L :=col span(A).

The columns of A are an arbitrary basis for L.
If the columns of Q form an orthonormal basis for L
col span(Q)
:=QQ

Orthogonality principle
The least squares residual vector
e
ls
:=y A
x
ls
=
_
I
m
A(A
A)
1
A
_
. .
(col span(A))
y
is orthogonal to col span(A)
e
ls
, A
x
ls
) =y
_
I
m
A(A
A)
1
A
_
A
x
ls
=0, for all x R
n
y
e
ls
y
ls
col span(A)
Least squares via QR decomposition
Let A =QR be the QR decomposition of A.
(A
A)
1
A
= (R
QR)
1
R
= (R
QR)
1
R
=R
1
Q
so that
x
ls
=R
1
Q
y and

y
ls
:=Ax
ls
=QQ
y
Let A =:
_
a
1
a
n
and consider the sequence of LS problems

A
i
x
i
=y, where A
i
:=
_
a
1
a
i
, for i =1, . . . , n
Dene R
i
as the leading i i submatrix of R and Q
i
:=
_
q
1
q
i
x
i
ls
=R
1
i
Q
i
y
Least norm solution
Consider an underdetermined system Ax =y, with full rank A R
mn
.
The set of solutions is
x R
n
[ Ax =y =x
p
+z [ z null(A)
where x
p
is a particular solution, i.e., Ax
p
=y.
Least norm problem
x
ln
:=argmin
x
|x|
2
subject to Ax =y
Geometric interpretation:
x
ln
is the projection of 0 onto the solution set
orthogonality principle x
ln
null(A)
R
n
x
ln
|x
ln
|
2
0
null(A) +x
p
Derivation of the solution: Lagrange multipliers
Consider the least norm problem with A full rank
min
x
|x|
2
2
subject to Ax =y
introduce Lagrange multipliers R
m
L(x, ) =xx
(Ax y)
the optimality conditions are
x
L(x, ) =2x +A
=0
L(x, ) =Ax y =0
from the rst condition x =A
/2, substituting into the second

=2(AA
)
1
y = x
ln
=A
(AA
)
1
y
Solution via QR decomposition
Let A
=QR be the QR decomposition of A
.
A
(AA
)
1
=QR(R
QR)
1
=Q(R
)
1
is a right inverse of A. Then
x
ln
=Q(R
)
1
y
Extensions
Weighted least squares
Given a positive denite matrix W R
mm
, dene wighted 2-norm
|e|
2
W
:=e
We
Weighted least squares approximation problem
x
W,ls
:=argmin
x
|y Ax|
W
The orthogonality principle holds by dening the inner product as
e, y)
W
:=e
Wy
and
x
W,ls
= (A
WA)
1
A
Wy
Recursive least squares
Let a
i
be the i th row of A
A =
_
_
a
1

.
.
.
a
m

_
_
with this notation, |y Ax|
2
2
=
m
i =1
(y
i
a
i
x)
2
and
x
ls
=

x
ls
(m) :=
_
m
i =1
a
i
a
i
_
1
m
i =1
a
i
y
i
(a
i
, y
i
) correspond to a measurement
often the measurements (a
i
, y
i
) come sequentially (e.g., in time)
Recursive computation of

x
ls
(m) =
_
m
i =1
a
i
a
i
_
1
m
i =1
a
i
y
i
P(0) =0 R
nn
, q(0) =0 R
n
For m =0, 1, . . .
P(m+1) :=P(m) +a
m+1
a
m+1
, q(m+1) :=q(m) +a
m+1
y
m+1
.
If P(m) is invertible, x
ls
(m) =P
1
(m)q(m).
Notes:
In each step, the algorithm requires inversion of an nn matrix
P(m) invertible = P(m
) invertible, for all m
>m
Rank-1 update formula
(P +aa
)
1
=P
1
1
1+a
P
1
a
(P
1
a)(P
1
a)
Notes:
gives an O(n
2
) method for computing P
1
(m+1) from P
1
(m)
standard methods based on dense LU, QR, or SVD for computing
P
1
(m+1) require O(n
3
) operations
Multiobjective least squares
least squares minimizes the cost function J
1
(x) :=|y Ax|
2
2
.
Consider a second cost function J
2
(x) :=|z Bx|
2
2
,
which we want to minimize together with J
1
.
Usually the criteria min
x
J
1
(x) and min
x
J
2
(x) are competing.
Common example: J
2
(x) :=|x|
2
2
minimize J
1
with small x
feasible objectives:
(, ) R
2
[ x R
n
subject to J
1
(x) =, J
2
(x) =
optimal trade-off curve: boundary of the feasible objectives
the corresponding x is called Pareto optimal
Set of Pareto optimal solutions
Example:
green area feasible
white area infeasible
black line marginally
feasible
Pareto optimal solutions cor-
respond to points on the line
0.1 0.2 0.3 0.4 0.5
0.6
0.8
1
1.2
1.4
1.6
J
2
J
1
For any 0,

x() =argmin
x
J
1
(x) +J
2
(x) is Pareto optimal.
By varying [0, ),

x() sweeps all Pareto optimal solutions
Regularized least squares
Tychonov regularization
x
tych
() =argmin
x
|y Ax|
2
2
+|x|
2
2
the solution
x
tych
() = (A
A+I
n
)
1
A
y
exists for any >0, independent on size and rank of A.
Trade-off between
tting accuracy J
1
(x) =|y Ax|
2
, and
solution size J
2
(x) =|x|
2
.
Quadratically constrained least squares
Consider again the biobjective LS problem min
x
J
1
(x) and J
2
(x)
Scalarization approach:
x
tych
() =argmin
x
J
1
(x) +J
2
(x)
where is trade-off parameter
Constrained optimization approach:
x
constr
() =argmin
x
J
1
(x) subject to J
2
(x)
where is upper bound on the J
2
objective
Regularized least squares
Tychonov regularization corresponds to the scalarization approach for
tting accuracy J
1
(x) =|y Ax|
2
, and
solution size J
2
(x) =|x|
2
.
The constrained optimization approach leads in this case to
x
constr
() =argmin
x
|y Ax|
2
2
subject to |x|
2
2

2
least squares minimization over the ball U
2 :=x [ |x|
2
2

2
.
The solution to the latter problem involves scalar nonlinear equation.
Secular equation
If |A
+
y|
2
2

2
, then

x
constr
() =|A
+
y|
2
2
.
If |A
+
y|
2
2
>
2
, then it can be shown that

x
constr
() U
2 .
The Lagrangian of
minimize
x
|y Ax|
2
2
subject to |x|
2
2
=
2
is |y Ax|
2
2
+(|x|
2
2
2
), where is a Lagrange multiplier.
Necessary and sufcient optimality condition is
x
tych
()x
tych
() =
2
, where x
tych
() := (A
A+I)
1
y
The nonlinear equation in
y
(A
A+I)
2
y =
2
is called secular equation. It has unique positive solution because
|x
tych
()| is monotonically decreasing on the interval [0, ) and by
assumption |x
tych
(0)|
2
2
>
2
.
Total least squares (TLS)
The LS method minimizes 2-norm of the equation error e(x) :=y Ax.
min
x,e
|e|
2
subject to Ax =y e
alternatively the equation error e can be viewed as a correction on y.
The TLS method is motivated by the asymmetry of the LS method:
both A and y are given data, but only y is corrected.
TLS problem: min
x,A,y
_
_
_
A y
_
_
F
subject to (A+A)x =y +y
A correction on A, y correction on y
Frobenius matrix norm: |C|
F
:=
_
m
i =1
n
j =1
c
2
ij
, where C R
mn
Geometric interpretation of the TLS criterion
In the case n =1, the problem of solving approximately Ax =y is
_
_
a
1
.
.
.
a
m
_
_
x =
_
_
y
1
.
.
.
y
m
_
_
, x R
Geometric interpretation:
t a line L(x) passing through 0 to the points (a
1
, y
1
), . . . , (a
m
, y
m
)
LS minimizes
sum of squared vertical distances from (a
i
, y
i
) to L(x)
TLS minimizes
sum of squared orthogonal distances from (a
i
, y
i
) to L(x)
R
n+1
R
n
R
1
_
x
tls
1
_
null([
tls
1])
(a
i
, y
i
)
(
a
i
,
y
tls,i
)
Solution of the TLS problem
Let
_
A y
=UV
be the SVD of the data matrix

_
A y
and
=diag(
1
, . . . ,
n+1
), U =
_
u
1
u
n+1
, V =
_
v
1
v
n+1
.
A TLS solution of Ax =y exists iff v
n+1,n+1
,=0 (last element of v
n+1
)
and is unique iff
n
,=
n+1
.
In the case when a TLS solution exists and is unique, it is given by
x
tls
=
1
v
n+1,n+1
_
_
v
1,n+1
.
.
.
v
n,n+1
_
_
and the corresponding TLS corrections are
_
A
tls
y
tls
=
n+1
u
n+1
v
n+1
(Corollary of the low-rank approximation theorem, see page 17.)
References
1. S. Boyd.
EE263: Introduction to linear dynamical systems.
2. G. Golub and C. Van Loan.
Matrix Computations.
Johns Hopkins, 1996.
3. L. Trefethen and D. Bau.
Numerical Linear Algebra.
SIAM, 1997.
4. B. Vanluyten, J. C. Willems, and B. De Moor.
Model reduction of systems with symmetries.
In Proc. of the CDC, pages 826831, 2005.
5. I. Markovsky and S. Van Huffel
Overview of total least squares methods
Signal Processing, 87, pages 22832302, 2007

Least Squares and The Singular Value Decomposition: Ivan Markovsky

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Least Squares and The Singular Value Decomposition: Ivan Markovsky

Uploaded by

Copyright:

Available Formats

Least squares and the singular value

is an orthogonal projector on span(Q), i.e.,

(necessary and sufcient for orth. proj.)

:= (I ) is also orthogonal projector, it projects on

is called orthogonal and satises Q

x coordinates of x in the basis Q

x reconstruct x from the coordinates a

normalize the result: q

Modied G-S procedure: when v

, and apply G-S on A

We have that for certain w R

w can be true only when w =0.

1.5 1 0.5 0 0.5 1 1.5

1.5 1 0.5 0 0.5 1 1.5

is optimal rank-r approximation of A w.r.t. the norm | |, e.g.,

be the SVD of A and dene

be solution to (LRA) and let

is the unique solution of (LRA).

is simultaneously optimal in any unitarily invariant norm.

y is a least squares approximation of y

is the orthogonal projector onto L :=col span(A).

(Lecture 3) Least squares and the singular value decomposition 30 / 52

and consider the sequence of LS problems

/2, substituting into the second

=QR be the QR decomposition of A

) invertible, for all m

be the SVD of the data matrix

You might also like