You are on page 1of 71

Math 243M, Numerical Linear Algebra

Lecture Notes
Plamen Koev

Contents
Chapter 1. Preliminaries
1. Motivationwhy study numerical linear algebra?
2. Matricesnotation and operations
3. Norms of vectors and matrices
4. Absolute and relative errors
5. Computer arithmetic
6. Writing for loopschasing entries around a matrix
7. Determining the complexity of an algorithm

5
5
7
10
13
15
19
20

Chapter 2. The tools of numerical linear algebra


1. The goals of numerical linear algebra
2. Examples of how the matrices Eij (x) and Si (d) work

23
23
25

Chapter 3. Solving linear systems


1. Solving a linear system
2. Gaussian Elimination
3. Pivoting
4. Cholesky decomposition
5. Forward and backward errors and stability
6. Condition numbers
7. Stability of the LU decomposition
8. Givens rotations
9. The QR algorithm
10. Least squares problems
11. Perturbation theory for Ax = b
12. Eigenvalues and eigenvectors
13. Eigenvalue algorithms
14. QR iteration
15. The Singular Value Decomposition (SVD)
16. Krylov subspace methods
17. The Discrete Fourier Transform
18. High accuracy computations

27
27
28
31
35
37
39
41
42
45
47
48
50
51
51
56
58
61
67

Bibliography

71

CHAPTER 1

Preliminaries
1. Motivationwhy study numerical linear algebra?
The unique challenges of numerical linear algebra are likely to keep it a vibrant area of research and its
techniques important despite continuing advances in computing technology. Here are some of the reasons.
The computational bottleneck in most mathematical models is a linear system or an
eigenvalue problem.
Most natural phenomena are described by differential equations. Modeling those phenomena
requires solving these differential equations which is rarely possible explicitly. Thus one attempts
to discretize and solve them numerically. This usually means picking a set of points and getting a
linear problem for the [volume, temperature, pressure, etc.] at these points.
Efficiency
Computers are slow, so to speak. Everyone expects to get answers within a reasonable amount
of time, preferably immediately. Results within seconds would be perfect, but waiting minutes
might be alright. Predicting weather only makes sense if the data comes back before we see what
the weather really is. Waiting months for computational results is unlikely to be acceptable in any
situation. So, how large of a linear system can be solved in (say) less than a day?
If were modeling a 3D phenomenon on a uniform 100-by-100-by-100 mesh, this means we need
to solve at n = 1, 000, 000 points, i.e, a million-by-million linear system. This mesh isnt all that
dense to give great approximations, but its a start. With the average unstructured algorithm
requiring O(n3 ) operations, this means (106 )3 = 1018 operations. How long will it take to perform
that many operations? Modern machines run a few billion operations per second at best, say 1010
to be optimistic. So we need 108 seconds to solve our problem. This is 3.16 years give or take.
There are a lot of assumptions in the above calculation, but you get the picture. You cant
just throw MATLABs linear solver at a problem and expect to solve a decent size problem.
There are, of course, adaptive meshes, but even those only go so far. There are structureexploiting algorithms that run in less than O(n3 ) time, but these algorithms need to be tailored
to every problem (so you need to know numerical linear algebra). There are multicore processors
and parallel computers that can perform multiple operations simultaneously, but you need to know
and understand the algorithms in order to properly divide the work among processors. This alone
is highly nontrivial task.
Efficiency is about performing the minimum number of operations to perform a given computational task. For example, x8 can be computed with 7 multiplications in the obvious way x x x,
or with 3 as ((x2 )2 )2 . Thus the obvious way to compute xn would take O(n) operations, but
the clever, repeated squaring method takes O(log n), which is a lot better. Ultimately one can
compute xn as en ln x , which takes a fixed time for any n and x, i.e., O(1) operations, which is best.
Accuracy For reasons that require its own course to fully explain and understand, computations
are performed in IEEE binary floating point arithmetic. This means most every number gets
rounded before its stored, and most every floating point computation results in a rounding error.
5

1. PRELIMINARIES

These
1 3 errors are tiny, but they can accumulate. A lot. If one tries to compute the determinant
in MATLAB, one deservedly gets 0.
3 9
>> det([1 3; 3 9])
ans = 0
0.1 0.3

If, however, one tries the determinant of one tenth of that, i.e., 0.3
0.9 , one gets
>> det([0.1 0.3; 0.3 0.9])
ans = 1.6653e-17
The problem is that 0.1, etc., are not finite binary fractions, so the 0.1 is approximated by
whatever the closest binary fraction is that is representable in the computer. The determinant of
the stored matrix then may no longer be zero, a matter further complicated by the fact that the
already rounded input quantities are subject to further rounding errors as the determinant is being
computed.
So, instead of a zero we got a tiny quantity. So we need to recognize when tiny quantities are
really zeros.
Now that its clear that:
we cant even input most quantities exactly in the computer. Most will get rounded as they
are stored;
the rounded input quantities are subject to further rounding errors during arithmetic operations.
So, what results can we possibly trust?
Computers are widely used, so we can trust a lot, but one of the main goals of this course to
learn how to recognize the abilities as well as the limitations of numerical linear algebra algorithms.

2. MATRICESNOTATION AND OPERATIONS

2. Matricesnotation and operations


A matrix will be assumed to be real unless mentioned otherwise. The standard notation for an m n
matrix is

a11 a12 a1n


a21 a22 a2n

A= .
..
.. ,
..
..
.
.
.
am1
also denoted as A =

am2

amn

[aij ]m,n
i,j=1 .

Identity matrix

I=

1
..

.
1

Its size most often will not need to be specified


Transpose

a11 a21
a12 a22

AT = .
..
..
.
a1n
T

a2n

as it will be apparent from context.

..
.

am1
am2
..
.

amn

[aji ]n,m
i,j=1 .

or, equivalently, A =
The conjugate transpose is defined as A = [
aji ]n,m
i,j=1 .
A matrix A is upper triangular if it is zero below its main diagonal, i.e., if aij = 0 for all i > j:

a11 a12 a1n

a22 a2n

A=
.. .
..

.
.
ann
A is also called unit upper triangular matrix if, in addition, it has ones on the main diagonal, i.e.,
a11 = a22 = = ann = 1. (Unit) lower triangular matrices are defined analogously.
A tridiagonal matrix is a matrix that is nonzero only on its diagonal and the first super- and
sub-diagonal, i.e., aij = 0 unless |i j| 1:

a1 b1

..
c1 a2

.
.
A=

..
..

.
. bn1
cn1
an
A matrix A is bidiagonal if it is only nonzero

a1 b1

a2
A=

only on its diagonal and the superdiagonal:

..

..

bn1
an

1. PRELIMINARIES

Matrix-vector product

a11 a12
a21 a22

..
..
..
.
.
.
am1 am2

a1n
a2n
..
.

amn

x1
x2
..
.

a11 x1 + a12 x2 + + a1n xn


a21 x1 + a22 x2 + + a2n xn
..
.

am1 x1 + am2 x2 + + amn xn

xn

Note that A is m n, x is an n-vector and the result is an m-vector.


It is very useful to think of the product Ax as a linear combination of the columns of A:

a1n
a12
a11
x1
a11 a12 a1n
a2n
a22
a21
a21 a22 a2n x2

+
x
+
x
=
x

..

.. .
..
..
..
..
..
n
2
1
..

.
.
.
.
.
.
.
am1

am2

amn

am2

am1

xn

amn

n,k
Matrixmatrix product. If A = [aij ]m,n
i,j=1 is m n and B = [bij ]i,j=1 is n k, then the product
C = [cij ]m,k
i,j=1 is an m k matrix, where

cij =

n
X

ait btj

t=1

Inverse. For a square matrix A, its inverse is A1 such that A1 A = AA1 = I.


Products of inverses and transposes. The order is reversed.
(AB)1 = B 1 A1

and

(AB)T = B T AT .

Symmetric and Hermitian matrices. A real matrix such that AT = A is called symmetric. A
(complex) matrix such that A = A is called Hermitian.
Symmetric/Hermitian positive definite matrix is a symmetric/Hermitian matrix such that x Ax > 0
for all nonzero vectors x. This condition is equivalent to all eigenvalues being positive.
Orthogonal/unitary matrices. A real m n, m n, matrix Q is orthogonal if QT Q = I. A complex
m n, m n, matrix Q is unitary if Q Q = I. In either case Q1 = Q .
Note that a matrix need not be square in order to be orthogonal/unitary, but it does need to
have at least as many rows as it has columns.
A product of orthogonal matrices is also an orthogonal matrix, namely, if P and Q are orthogonal, then (P Q)T P Q = QT P T P Q = QT Q = I.
Eigenvalues and eigenvectors. We say that is an eigenvalue of an n n matrix A if Ax = x for
some nonzero vector x, called a (right) eigenvector. A left eigenvector is one such that y T A = y T .
An n n matrix has n eigenvalues. Those are the are the roots of the characteristic polynomial
det(A I) (each taken with its multiplicity).
When A is n n and symmetric, it has n real eigenvalues and n eigenvectors. The eigenvectors
can be chosen to form an orthonormal set. When all eigenvectors are placed in an (orthogonal)
matrix Q, we have A = QT Q, where = diag (1 , 2 , . . . , n ) and 1 , 2 , . . . , n are the (real)
eigenvalues. The left and right eigenvectors corresponding to the same eigenvalue are the same.
Problems:
(1) Prove that the product of two lower triangular matrices is lower triangular.
(2) Prove that the inverse of a lower triangular invertible matrix is lower triangular.
(3) Prove that if a triangular matrix is orthogonal, then it must be diagonal. What are the elements
on the diagonal?

2. MATRICESNOTATION AND OPERATIONS

(4) Prove that if the m n, m n matrix A is orthogonal and k < n, then the m k matrix B
obtained by taking the first k columns of A is also orthogonal.
(5) Give an example or explain why, if A is m n, m > n, orthogonal, then AT is not orthogonal.
(6) Prove that if A and B are orthogonal, then AB is also orthogonal.

10

1. PRELIMINARIES

3. Norms of vectors and matrices


Well need to measure the difference or distance between vectors and between matrices (to, e.g., tell
how close the computed solution is to the actual one), so we need norms.
Vector norms. A vector norm is a function k k : Cn R such that for any n-vectors x and y
(1) kxk 0, with kxk = 0 only when x = 0;
(2) kcxk = |c| kxk for all scalars c C;
(3) kx + yk kxk + kyk.
Examples of vector norms are:
kxk1

= |x1 | + |x2 | + + |xn |

kxk

= max |xj |
j
p
= |x1 |2 + |x2 |2 + + |xn |2

kxk2

(the absolute values in the two norm are needed for complex vectors).
For the 2-norm, we have, very importantly

kxk2 = x x.
Also, if Q is orthogonal (or unitary), then
p
p

kQxk2 = (Qx) Qx = x Q Qx = x x = kxk2 ,


i.e., orthogonal matrices do not change the 2-norm!
Matrix norms. A matrix norm is a function k k : Cnn R, if it satisfies for matrices A and B:
(1) kAk 0, with kAk = 0 only when A = 0;
(2) kcAk = |c| kAk for all scalars c C;
(3) kA + Bk kAk + kBk.
If is also satisfies the fourth condition below, it is called an operator norm:
(4) kABk kAk kBk

(submultiplicativity).

Every vector norm induces a matrix norm via


kAk = max
x6=0

Equivalently (since

kAxk
kxk

kAxk
.
kxk

x
),
= A kxk
kAk = max kAxk.
kxk=1

Trivially, for any operator norm,


kAxk kAkkxk.
Any induced matrix norm is also an operator norm. For it automatically satisfies conditions (1) through
(3). For condition (4) (submultiplicativity), if x is the vector of norm 1 such that kABk = maxkxk=1 kABxk,
then
kABk = kABxk = kA(Bx)k kAk kBxk kAk kBk kxk = kAk kBk.

3. NORMS OF VECTORS AND MATRICES

11

Theorem 3.1. The following are the induced 1, 2, and norms for matrices:
X
kAk1 = max
|aij |;
j

kAk

= max
i

kAk2

|aij |;

q
|max (AT A)|.

For symmetric matrices kAk2 = |max (A)|, the absolute value of the largest by magnitude eigenvalue of A.
Proof. We prove the third part only. Let AT A = QT Q be the eigendecomposition of AT A, where
= diag (1 , . . . , n ), 1 n 0. Then, since kQxk2 = kxk2 ,

2
2
kAk2 = max kAxk2
kxk2 =1

= max xT AT Ax
kxk2 =1

= max xT QT Qx
kxk2 =1

max (Qx)T (Qx)

kQxk2 =1

= max y T y
kyk2 =1

= y12 1 + + yn2 n
1 (y12 + + yn2 )
= 1 ,
with equality for y = e1 = (1, 0, 0, . . . , 0)T .

The matrix 2-norm is unaffected by orthogonal unitary transformations for the same reason the vector
2-norm is not: kQAk2 = kAk2 .
The Frobenius norm is a matrix norm, which is not an operator norm:
v
uX
p
u n
|aij |2 = trace (A A).
kAkF t
i,j=1

It is also unaffected by orthogonal/unitary transformations:


p
p
kQAkF = trace ((QA) QA) = trace (A A) = kAkF .
Problems:
(1) Prove
that if W is a symmetric positive definite matrix, then the function defined by kxkW =

xT W x is a vector norm.
(2) Find the 1,2, and norms of the vectors

1
0
x = 1 , y = 2 .
3
4

12

1. PRELIMINARIES

(3) Find the 1 and norms of the matrices

1 1 1
A = 1 0 0 ,
3 1 7

0
B= 1
3

1 1
1 2 .
2 4

(4) Find the 2 norm of the matrices



A=

1
2

1
2


,

B=

0
2

2
3


.

(5) Prove that kIk 1 for any operator norm and that kIk = 1 for any induced norm.
(6) Prove that kQk2 = 1 for any orthogonal matrix.

4. ABSOLUTE AND RELATIVE ERRORS

13

4. Absolute and relative errors


The norms of vectors and matrices come into play when determining the error or error bounds in various
computation. What does getting the right answer mean? If we wanted to compute a single (scalar) quantity,
this would mean getting its sign and leading digits correct.
What does it mean to get the leading digits correct? It means that the true answer x and the computed
answer x
share the first few leading digits. It means that if we were to subtract x
from x we would get a
quantity that is much smaller compared to x. For example, having 4 digits in common would mean that
xx
would be at 10, 000 times smaller than x. In other words we want that


x x


(4.1)
x , where  1.
The above quantity (4.1) is called relative error between the desired quantity x and its computed counterpart x
.
1
If x and x
agree to 4 (decimal) digits, this means that the relative error is 10,000
= 104 . In general a
relative error of 10k would mean the answer is true to k decimal digits. On the other side, a relative error
of 1 (or greater) would mean that x
has no correct digits at all.
A small relative error is the gold standard in computationsit guarantees correct sign and leading
digits regardless of the magnitude of x.
In contrast a small absolute error bound
|x x
|
may or may not mean much about the error in x
depending on the relative magnitudes of x and . A value of
= 104 will mean that x
is accurate to 6 digits if x = 100 for example, but will mean nothing if x = 107 .
With vectors the gold standard is that each entry of the vector is computed with a small relative error.
In practice, however, this is difficult to achieve, so we often settle for norm accuracy. Namely, we look for
algorithms that will compute a vector x
such that
kx x
k
, where  1.
kxk
What does a small relative norm error mean for the individual entries? It means that the largest entries
of x will be guaranteed to be accurate whereas the tiny ones may or may not be.
Here is the reason: Assume for simplicity we use the infinity norm (the story is the same for the other
norms) and that

100, 000
1, 000 .
x=
10
Suppose we have a computed solution x
with a relative norm error 103 . Namely,
kx x
k
= 103 .
kxk
Since kxk = 100, 000, this means kx x
k = 105 103 = 102 .
For the individual entries x
i , i = 1, 2, 3, we have
|xi x
i | kx x
k = 102 .
This error bound means different things for x
1 , x
2 , and x
3 because of their magnitudes:
|x1 x
1 | 102 means
|x2 x
2 | 102 means
|x3 x
3 | 102 means

|x1
x1 |
|x1 |
|x2
x2 |
|x2 |
|x3
x3 |
|x3 |

102
x1
102
x2
102
x3

= 104 , i.e., x
1 is accurate to at least 4 decimal digits;
= 102 , i.e., x
2 is accurate to at least 2 decimal digits;
= 1, i.e., x
3 may have no accurate digits at all.

14

1. PRELIMINARIES

Understanding the correct implications about the number of correct digits in each computed quantity
based on absolute and relative error bounds is critical in numerical linear algebra.
Problems.
(1) If x
= 1.1234567890 1040 is computed with relative error 106 , write down all digits of x that
you can infer to be correct.
(2) If the vector

1, 000.123456
10.123456
x
=
0.01234
is known to be computed with a relative error satisfying
kx x
k
106 ,
kxk
write down the vector x and in every entry only specify the digits which you know to be correct.

5. COMPUTER ARITHMETIC

15

5. Computer arithmetic
Pretty much all computers nowadays conform to the IEEE standard for binary floating point arithmetic [1].
In the most popular double precision, the numbers take 64 bits to represent: 1 for sign, 11 for exponent,
and 52 for the fraction. Any nonzero number can be written as
(1)s 2e1023 (1.f1 f2 . . . f52 ),
where 1 e 2046 is the exponent (e = 0 and e = 2047 are specialwe discuss below) and f = f1 f2 . . . f52
are the fraction bits. Since in binary the first significant digit is 1, we get an extra, 53rd significant digit for
freeit does not need to be stored (the leading significant digit of a binary number is always 1!).
The binary floating point representation of the number 0 is e = 0, f = 0. It can have either sign. By
convention +0 = 0.
What about the remaining floating point numbers with e = 0, i.e., the ones with e = 0, f 6= 0? These
are called subnormal numbers, because the assumption of a leading significant digit 1 is dropped. In every
other aspect, these are perfectly functional floating point numbers of the form
(1)s 21022 (0.f1 f2 . . . f52 )
and serve an extremely important purpose to ensure that the smallest distance between floating point
numbers is also a floating point number. This way a 6= b means a b 6= 0. Note that even though the
exponent e = 0 it is treated as if it were 1 (thus the power of 2 is 1022, not 1023).
The exponent e = 2047 (all 1s in binary) is also special.
If e = 2047, f = 0, this is infinity or INF. It behaves like infinity in the sense that 1/INF = 0
and 1/0 =INF. Why infinity? In a well designed calculation it should never appear, right? If it
appears, something went wrong and the programmer should be notified. But what happens if this
is performed on an unmanned craft in outer space? One may as well let the calculations continue.
For example, say we have a parallel two resistor circuit where the overall resistance is
1
R= 1
1 .
+
R1
R2
Shorts shouldnt happen, but what if one did? If R1 = 0, floating point arithmetic will still correctly
return the correct overall resistance R = 0.
If e = 2047 and f 6= 0, this is NaN or Not-a-Number. It is an indication that a 0/0 or a similar
calculation occurred.
The range of floating point numbers is 21074 (the smallest subnormal number, e = 0, f1 = = f51 = 0,
f52 = 1) to 21023 (2 252 ) 21024 these are also the underflow and overflow thresholds, respectively.
When arithmetic operations are performed then the result is always the correctly rounded true answer.
In other words, one can think that each of the operations +, , , / is performed exactly, then its rounded
to the nearest floating point number.
Without belaboring the point further, we will use a simplified model in which the result of any floating
point computation will be
(5.1)

fl(a b) = (a b)(1 + ),

where ||  and {+, , , /} is one of the four arithmetic operations.


The quantity  is machine precision and in double precision floating point arithmetic  = 253 , or half
a unit in the last place. Since we are rounding to the nearest floating point number, the relative error cant
exceed half the value of the last significant bit, which is 252 (very much like rounding to integer can have
an error that does not exceed 0.5, or half the smallest integer which is 1).
So, from now on all error analysis will rely on (5.1).

16

1. PRELIMINARIES

Catastrophic cancellation in floating point. How is accuracy lost in floating point? It turns out
addition (of same sign quantities), multiplication, and division preserve the relative accuracy of intermediate
results. In other words, if we have two intermediate computed quantities a
> 0 and b > 0 accurate to (say)

10 digits, then a
+ b, a
b and a
/b will also be accurate to about 10 digits.
In contrast the difference a
b may be inaccurate, depending on how close a is to b.
This is known as subtractive cancellation and is the way accuracy is lost in numerical linear algebra.
Understanding this helps in the design of accurate algorithms.
First, we explain why addition, multiplication, and division preserve the relative accuracy. Let a
=
a(1 + 1 ) and b = b(1 + 2 ), with |1 | k and |2 | l for some modest k and l. Assume additionally, that
k l.
In the following calculations we will ignore terms which are products of 2 or more s, e.g., 1 2 , which are
O(2 ), so too tiny to matter. This is done for convenience here and does not compromise the analysisthere
is a method, which does account for all s and yields the same conclusionssee e.g., Higham [5].
In particular, we say that
1
= 1 2 + 22 = 1 1 .
1 + 1
Then the relative errors for +, , / are:
Addition: We have fl(
a + b) = (
a + b)(1 + 3 ), where |3 | . Therefore



fl(
a + b)(1 + ) (a + b)
a + b) (a + b) (

3

=




a+b
a+b


(a(1 + 1 ) + b(1 + 2 ))(1 + 3 ) (a + b)


=

a+b


a1 + b2 + (a + b)3

(weve dropped 1 3 and 2 3 terms)
=

a+b
a|1 | + b|2 | + (a + b)|3 |

since a > 0 and b > 0


a+b
ak + bl + (a + b)

a+b
= (k + 1).

Multiplication: We have fl(


ab) = (
ab)(1 + 3 ), where |3 | . Therefore



fl(

ab) ab a(1 + 1 )b(1 + 2 )(1 + 3 ) ab

=



ab
ab
= |(1 + 1 )(1 + 2 )(1 + 3 ) 1|
= |1 + 2 + 3 |
(k + l + 1).

(again, ignoring 1 2 terms, etc.)

5. COMPUTER ARITHMETIC

17

Division: We have fl(


a/b) = (
a/b)(1 + 3 ), where |3 | . Therefore



fl(

a/b) a/b a(1 + 1 )/(b(1 + 2 ))(1 + 3 ) a/b

=



a/b
a/b


1 + 1

=
(1 + 3 ) 1
1 + 2
= |(1 + 1 )(1 2 )(1 + 3 ) 1|

note that in our convention

1
1+2

= 1 2

= |1 2 + 3 |
(k + l + 1).
We see that in all three cases (+, , /) the relative errors accumulate modestly with the resulting quantity
still having about as many correct digits as the two arguments.
One also shouldnt worry about the k and l coefficients in the relative errors. In practical computations
with n n matrices these typically grow as O(n) leaving plenty of correct digits even for n in the millions.
Subtractions are tricky: following the same calculations as with addition, the relative error in subtraction
is easily seen to be



fl(

a+b
a + b) (a + b) a1 b2 + (a b)3 a|1 | + b|2 | + |a b||3 |

k + .

=



a+b
ab
|a b|
|a b|
Unlike +, , /, this relative error now depends on a and b! In particular, if a b is much smaller than a + b,
i.e., when a and b share a few significant digits, the relative error can be enormous!
For example, if a and b have 7 digits in common, e.g., a = 1.234567444444444 and b = 1.234567333333333,
then
a+b
107
|a b|
and fl(
a b) will have 7 fewer correct significant digits than either a
and b!
Problems.
(1) In double precision IEEE 754 arithmetic (e.g., in MATLAB), what is the largest floating point
number a so that fl(100 + a) = 100? Specify the exact value. For example, a = 2100 is one such
number, but is not the largest.
(2) (Hard) Prove that if a and b are binary floating point numbers, then a/b cant fall exactly half way
between two floating point numbers. In other words, there will never be an ambiguity on how to
round a fraction.
Solution: Say, its k-digit floating point arithmetic and we have 2 numbers a and b such that
c = a/b is exactly halfway between 2 floating point numbers, i.e., c has exactly k + 1 significant
digits. The first digit of c must be 1 (it always is), and the (k + 1)st must also be one so that c falls
exactly halfway between two floating point numbers. Now bc, regardless of what b is, must have at
least (k + 1) significant digits, which is impossible since bc = a which only has k significant digits.
(3) (Hard) If 2b > a > b > 0 then fl(a b) = a b, i.e., a b is an exact binary floating point number.
Solution: If a = 1.a1 a2 . . . a52 2k and b = 1.b1 b2 . . . b52 2n , then 2b > a > b implies k = n
or k = n 1.
If k = n, since a b > 0, then a and b align in the subtraction a b and a b thus cant have
more than 52 significant digits (since the leading ones will cancel), i.e., it is an exactly representable
floating point number. There is no need to round.

18

1. PRELIMINARIES

If k = n 1, then a and b are misaligned by one place. The condition 2b > a means a b < b,
i.e., the leading 1 in a will get cancelled in the subtraction, leaving only an at most 53 digit number,
which again is exactly representable.

6. WRITING FOR LOOPSCHASING ENTRIES AROUND A MATRIX

19

6. Writing for loopschasing entries around a matrix


The goal of this section is to address the issue of writing for loops that touch the entries of a matrix
in a particular order. That order will depend on what we are trying to do with the matrix, but for now we
will learn how to write loops.
If we are trying to eliminate the entries of a matrix below the diagonal, one column at a time, starting
with the (2, 1) entry, we could write for an n n matrix:
for j=1:n-1
for i=j+1:n
A(i,j)=0
pause
end
end

% columns 1 through n-1


% in column j, we go through entries j+1 through n
% for now physically just setting the entry to zero

The same idea, but go through the entries in each column starting from the bottom:
for j=1:n-1
for i=n:-1:j+1
A(i,j)=0
pause
end
end

% columns 1 through n-1


% in column j, we go through entries n up through j+1
% for now physically just setting the entry to zero

Next, we go through all subdiagonals of A starting at the very bottom lefthand corner (i.e., the entry
(n, 1)):
for i=n-1:-1:1
for j=1:n-i
A(i+j,j)=0
pause
end
end

% this is subdiagonal i, i.e., subdiagonal n-1 is the (n,1) entry, etc.


% in subdiagonal j we go through columns 1 through n-i
% for now physically just setting the entry to zero

Problems.
(1) Write a program that sets to zero the entries below the diagonal in a matrix, one row at a time,
starting with the second row. Namely, the program will set a21 = 0, a31 = 0, a32 = 0, a41 = 0, a42 =
0, a43 = 0, etc.
(2) Write a program that computes the average value of each of the 2n 1 anti diagonals of an n n
matrix A. Apply this program to the Pascal matrices pascal(n) in MATLAB and find the size
of the smallest Pascal matrix for which the largest average value of any of the anti diagonals is at
least 1010 . Turn in your program and the size n of that matrix.

20

1. PRELIMINARIES

7. Determining the complexity of an algorithm


Programs that take too long to run never get run. William Kahan
7.1. Motivation. Consider the matrix-matrix multiplication C = AB, where A, B, and C are n n.
The formula for every entry of C is
n
X
cij =
aik bkj .
k=1
2

Each of the n entries cij will require n multiplications and n additions for a total of 2n3 operations.
Is this reasonable if A = Eij (x)? Forming AB is equivalent to changing row j of B only by adding a
multiple of row i to it. Namely, we are changing n entries only bj1 , . . . , bjn using two arithmetic operations
per entry:
bjk = bjk + xbik
for a total of 2n operations.
It therefore matters tremendously if we account for the structure of our problem (and dont just blindly
call matrix-matrix multiplication) and save major computational resources!
Such decisions have huge implications on algorithms (which are otherwise mathematically equivalent)
and there is systematic way of analyzing the cost of an algorithm.
7.2. Formal analysis of the complexity of an algorithm. The analysis of the complexity of an
algorithm, we assume that each arithmetic operation {+, , , /} takes one unit of time. This is not a bad
assumption and allows us to perform complexity analysis without always referring to the latest technology.
As of this writing, on modern computes, addition, subtraction, and multiplication, do take one unit of time,
whereas division takes about 8. We ignore such peculiarities and assume all arithmetic operations (including
square roots, sines, cosines, etc.) take only one unit of time. It makes for a systemic analysis of all linear
algebra algorithms.
The complexity f (n) of an algorithm is the number of arithmetic operations that it performs. It is a
function of the problem size n. The goal is expressing f (n) as a function of n, most often in the form
f (n) = ank + O(nk1 ).
Note that:
The problem size n is usually the size of a matrix, the number of discretization points, size of the
mesh or other appropriate quantity that measures the size of the problem.
The complexity behavior matters when n gets larger, which is exactly when the lower order terms
(O(nk1 )) no longer contribute anything significant compared with ank . This is why these lower
order terms are most often ignored in the complexity analysis and an algorithm is said to have
complexity ank or simply nk .
It is common practice in numerical analysis to count each arithmetic operation (+, , , /, square
root) as 1 and compare algorithms that way. Computer technology changes rapidly so who knows
how many clock cycles each operation takes on a computer this month. At the time of this writing,
multiplication, addition, or subtraction can be done in 1 clock cycle, with division taking 6 to
8. Such performance is only possible, however, through sophisticated cache optimization and
programming techniques, which we will not address.
The complexity of an algorithm can be determined by counting the number of arithmetic operations
which can be done either analytically or experimentally.
The goal is to figure out what f (n) is as a function of n, i.e., to find a and k.

7. DETERMINING THE COMPLEXITY OF AN ALGORITHM

21

Analytically, one uses formulas such as:


n
X

i=

i=1

n(n + 1)
2

and

n
X

i2 =

i=1

n(n + 1)(2n + 1)
.
6

Experimentally, one adds a variable, e.g., flopcount, to the code and adds 1 to it every time an
arithmetic operation is performed. Then the code is run for different (reasonably large) values of n, say n1
and n2 .
The variable flopcount will return f (n1 ) and f (n2 ). Since
f (n1 ) = ank1 + O(nk1
)
1

and f (n2 ) = ank2 + O(nk1


)
2

we have for large n1 , n2


log(f (n1 )) log(ank1 ) = log(a) + k log(n1 )
log(f (n2 )) log(ank1 ) = log(a) + k log(n2 )
so
log(f (n1 )) log(f (n2 )) = k(log(n1 ) log(n2 ))
and
k=

log(f (n1 )) log(f (n2 ))


.
log(n1 ) log(n2 )

Then
a

f (n1 )
.
nk1

This relationship should improve as n1 , n2 increase.


Example 7.1. Consider the following algorithm that back solves for the solution x of a lower triangular
linear system Ax = b. An appropriate measure of the size of this problem is n, the size of the matrix. The
solution is
x1 = b1 /a11
x2 = (b2 a21 x1 )/a22
..
.
xn = (bn an1 x1 an,n1 xn1 )/ann .
So, analytically, this takes
1 + 3 + 5 + + (2n 1) =

n
n
X
X
n(n + 1)
n = n2
(2i 1) = 2
in=2
2
i=1
i=1

operations.
On the other side, experimentally, the code that solves Ax = b is
for i=1:n
x(i)=b(i);
for j=1:i-1
x(i)=x(i)-A(i,j)*x(j);
end
x(i)=x(i)/A(i,i);
end

22

1. PRELIMINARIES

Adding to the code the counter flopcount, we get


flopcount=0;
for i=1:n
x(i)=b(i);
for j=1:i-1
x(i)=x(i)-A(i,j)*x(j);
flopcount=flopcount+2;
end
x(i)=x(i)/A(i,i);
flopcount=flopcount+1;
end
Running the augmented code for some reasonable values of n, say 500 and 505, we get for flopcount
250000 and 255025, respectively. Now we have
log(255025) log(250000)
=2
k=
log(505) log(500)
and
255025
a=
=1
5052
as expected.
Thus the complexity of the algorithm is n2 .
The advantage of the experimental method is that one does not need to understand how an algorithm
works or what it does in order to analyze it.
Here is a table of the costs of conventional linear algebra algorithms:
Type of operation
Cost
Matrix-vector product
2n2
Matrix-matrix product
2n3
Solution to a triangular system
n2
Upper triangular matrix times a vector n2
Solution to a tridiagonal linear system O(n)
2 3
Computing the LU decomposition
3n
1 3
Cholesky decomposition
3n
Problems:
(1) Determine the complexity (in the form ank ) of the following code, which takes as an input a column
vector b of length m:
function x = hwComplexity(b)
m=length(b);
x(m) = 1;
for i= m-1:-1:1
x(i) = 2;
for j=i+1:m
x(i) = x(i)- 2*x(j);
end
x(i)=x(i)/2;
end

CHAPTER 2

The tools of numerical linear algebra


1. The goals of numerical linear algebra
We consider three types of problems:
Linear systems, Ax = b, A is n n;
Least squares problems, Ax = b, A is m n, m > n;
Eigenvalue and singular value problems, Ax = x.
In each of these problems, the goal would be to create zeros in the matrix A while preserving the solution
in the process.
When solving Ax = b, the goal is to reduce A to an upper triangular matrix, which is then trivial to
solve.
When solving Ax = x, the goal is again to create zeros in the lower triangular part of A through
similarity transformations, which would preserve the eigenvalues, which at the end would be on the diagonal
of A.
For this we will use two matrices only. These are the tools of our trade and are the only transformations
we will utilize to solve linear systems, least squares problems, eigenvalue, and singular value problems.
Adding a multiple of one row from another
We get to add x times row j to row i of a matrix A by multiplying A on the left by

..

.
..

Eij (x) =

x
1

..

.
1
(x is in position (i, j)). This matrix differs from the identity matrix only in the (i, j)th entry x.
Scaling a row.
We scale row i of A by d by multiplying A on the left by

..

.
Si (d) =

.
..

1
Note that both matrices Eij (x) and Si (d) invert very easily:
(Eij (x))1 = Eij (x)

and
23

(Si (d))1 = Si (d1 ).

24

For example,

1
0 1

0 0
0 3

2. THE TOOLS OF NUMERICAL LINEAR ALGEBRA

1
1
0

1
0
=
0
0

0 1
3 0 1

and

1
2

1
1

1
12

1
1

2. EXAMPLES OF HOW THE MATRICES Eij (x) AND Si (d) WORK

25

2. Examples of how the matrices Eij (x) and Si (d) work


Consider the Vandermonde matrix

(2.1)

1
1
A=
1
1

1
2
3
4

1
4
9
16

1
8
.
27
64

(1) In order to add 3 times row 2 to row 4 of A, we form the product:



1 1
1 1 1 1
1
0 1
1 2 4 8 1 2

E42 (3) V =
0 0 1
1 3 9 27 = 1 3
4 10
1 4 16 64
0 3 0 1
(2) In order to subtract 3 times row 2 from row

1
1
1
0
1

E42 (3) V =
1
0
0 1
1
0 3 0 1

1
4
9
28

4 of A, we form the product:



1
1
1 1 1
1
2
2 4 8
=
3
3 9 27 1
2 2
4 16 64

1
8
.
27
96

1 1
4 8
.
9 27
4 40

(3) If we wanted to create a zero in position (4, 1) using row 1, we would subtract 1 times row 1 from
row 4 of A by forming the product:

1 1 1 1
1 1 1 1
1
1 2 4 8 1 2 4 8
0 1

E41 (1) V =
1 3 9 27 = 1 3 9 27 .
0 0 1
0 1 12 56
1 4 16 64
1 0 0 1
(4) If we wanted to scale row 2 of A by (say)


1
1
1

S2 (2) V =
1

1
1
1

2 wed form the product:


1
1
1
1
1 1 1

2 4 8
= 2 4 8 16 .

1
3
9
27
3 9 27
1
4 16
64
4 16 64

(5) Column operations work analogously by applying the appropriate matrix Eij (x) or Si (d) on the
right. For example, if we wanted to subtract 2 times column 2 from column 3, wed form the
product

1 1
1 1
1
1 1 1 1
1 2

1 2
4 8
1 2
0 8

=
.
V E23 (2) =
1 3

9 27
1
1 3
3 27
1 4 16 64
1
1 4
8 64
Problems. For the Vandermonde matrix (2.1):
(1) Find the matrix Eij that uses row 2 to create a zero in position (4, 4).
(2) Find the matrix Eij that uses column 4 to create a zero in position (2, 1).
(3) Find the matrix that scales column 4 by 1.

CHAPTER 3

Solving linear systems


1. Solving a linear system
There are two linear systems that are particularly easy to solve: Orthogonal and triangular:
if Q is orthogonal, then the solution to Qx = b is b = QT b;
if A is triangular, then Ax = b it is easily solved via substitution. In matrix form

a11 a12 a1n


x1
b1

a22 a2n

x2 b2
= .

.
.
.
..
.. .. ..

ann
xn
bn
means
a11 x1 + a12 x2 + + a1n xn = b1
a22 x2 + + a2n xn = b2
..
.
an1,n1 xn1 + an1,n xn = bn1
ann xn = bn .
The solution to the above system is easily obtained by substitution, starting with xn and going
back
x1 = (b1 a11 x1 a1,n1 xn1 )/a11
..
.
xn1 = (bn1 an1,n xn )/an1,n1
xn = bn /ann .
Therefore, in order to solve a linear system, our first goal is to reduce it to a triangular form which is
then easy to solve.
We can do this using either the elimination matrices Eij (x) (obtaining the Gaussian elimination algorithm) or orthogonal matrices (obtaining the QR algorithm, which is also useful for solving least squares
problems).

27

28

3. SOLVING LINEAR SYSTEMS

2. Gaussian Elimination
Gaussian elimination is the process of reducing a matrix to upper triangular form using only subtraction
of a multiple of one row from another, i.e., the matrices Eij (x). The result is a decomposition
A = LU,
where L is unit lower triangular and U is upper triangular.
We start with the simplest 2 2 case. Let


a11 a12
A=
.
a21 a22
Then

E21 (l21 )A =

1
l21


1

a11
a21

a12
a22


=

a11
0

a12
a022


,

21
is the factor that we need to multiply the first row by so that when we subtract it from the
where l21 = aa11
second we get a zero in position (2, 1) and a022 = a22 a12 l21 . Therefore

1 
 


1
a11 a12
1
a11 a12
A=
=
.
l21 1
0 a022
l21 1
0 a022

We have factored A into a the product of a lower triangular matrix (well call it L) and an upper triangular
matrix U .
In the general n n case, the process is analogous we proceed one column at a time with the diagonal
entry being used as pivot to eliminate all entries in the corresponding column.
Let lij be the multiplier needed to multiply the pivot in position (j, j) so that when the jth row is
subtracted from the ith, we get a zero in position (i, j).
In particular, in order to eliminate the entries in the first column a21 through an1 we would need
multipliers li1 = ai1 /a11 , i = 2, 3, . . . , n. Forming the product
A1 = En1 (ln1 )En1,1 (ln1,1 ) E21 (l21 )A
will result in a matrix A1 whose entries in the first column below the main diagonal are zero. Moving all the
Es on the other side (and recalling that Eij (x)1 = Eij (x)) we get
A = E21 (l21 )E31 (l31 ) En1 (ln1 )A1 .
The matrix A1 is called the first Schur complement of A.
We repeat the same process for the remaining columns and get

n1
n
Y Y
(2.1)
A=
Eij (lij ) U.
j=1 i=j+1

The above decomposition (2.1) is already sufficient to solve Ax = b. It is equivalent to solving

n1
n
n1
n
Y Y
Y Y
Ux =
Eij (lij ) b =
Eij (lij ) b,
j=1 i=j+1

an upper triangular system.

j=1 i=j+1

2. GAUSSIAN ELIMINATION

It turns out that the product of all the Es in (2.1) is

1
l21
n1
n

Y Y

L=
Eij (lij ) = l31
..
j=1 i=j+1
.
ln1

29

the lower triangular matrix

1
l32
..
.

1
..
.

ln2

...

..

.
ln,n1

This fact requires no proof as it is a direct consequence of how the product of matrices is formed, but it is
highly recommended that the reader perform several LU decompositions in order to make friends with
the process.
One thus obtains the LU decomposition of A

1
u11 u12 u13 . . . u1n

l21 1
u22 u23 . . . u2n

l31 l32 1
u33 . . . u3n
A = LU =

..
..
..
..
..
..

.
.
.
.
.
.
ln1

ln2

...

ln,n1

unn

Here is the MATLAB code for the LU decomposition of A:


function [L,U]=lu143M(A);
n=size(A,1); U=A; L=eye(n);
for j=1:n-1
for i=j+1:n
L(i,j) =U(i,j)/U(j,j);
U(i,j:n) = U(i,j:n) - L(i,j)*U(j,j:n);
end
end
2.1. Storing L and U . Since the upper triangular part of L and the lower triangular part of U are
zero it makes sense to save memory by storing the nontrivial components of L where the zeros of U are.
This is the practical implementation in LAPACK. For example the LU decomposition

1
5 6 7
2 1
0 8 9
3 4 1
0 0 10
will be encoded in the matrix

5 6
2 8
3 4
Here is the MATLAB code that does just that:

7
9 .
10

function A=lu143Mexpert(A);
n=size(A,1);
for j=1:n-1
for i=j+1:n
A(i,j) =A(i,j)/A(j,j);
A(i,j+1:n) = A(i,j+1:n) - A(i,j)*A(j,j+1:n);
end
end

30

3. SOLVING LINEAR SYSTEMS

This code returns the encoded A. If L and U are needed, they can be obtained as L=eye(n)+tril(L,-1)
and U=triu(A).
2.2. Operation count for computing the LU decomposition. Referring back to the algorithm,
its cost is
n1
n
X X
2
(2(n j + 1) + 1) = n3 + O(n2 ).
3
j=1 i=j+1
2.3. Solving Ax = b using the LU decomposition. Once we have the decomposition A = LU
solving Ax = b is very easy. It requires two triangular solves: first we solve Ly = b for y and then U x = y
for x. We thus have x = U 1 y = U 1 L1 b = (LU )1 b = A1 b.
2.4. Uniqueness of the LU decomposition. If A is nonsingular, its LU decomposition is unique.
1
To see that, write A = LU = L1 U1 . Then L1
. On the left we have a unit lower triangular
1 L = U1 U
matrix (see the problems after this section) and on the right we have an upper triangular matrix. This is
only possible if both equal the identity matrix, i.e., L = L1 and U = U1 .
2.5. LDU decomposition. Sometimes its preferable to work with the so called LDU decomposition
of a matrix, which is readily obtained from the LU decomposition by factoring the diagonal out of U , leaving
U unit upper triangular:
A = LDU.
Problems.
(1) Compute the LU decomposition of the following matrices:

1 1 1
1 1 1
1 2
3
1 2 4 , 1 2 3 , 1 0 4 .
1 3 9
1 3 6
2 10
2
(2) Compute the LU decomposition of
the 4 4 Pascal matrix (pascal(4) in MATLAB)
the Vandermonde matrix

1 1 1 1
1 2 4 8

1 3 9 27 .
1 4 16 64
(3) Prove that the product of two unit lower triangular matrices is also unit lower triangular.
(4) Prove that the inverse of a unit lower triangular matrix is also unit lower triangular.
(5) Compute the LDU decomposition of the matrix

2 2 4
2
5 5 .
4
5 39

3. PIVOTING

31

3. Pivoting
3.1. The need for pivoting. One faces the obvious problem of not being able to start Gaussian
elimination if the (1, 1) entry is 0, e.g., the matrix


0
1

1
1


.

Serious problems also occur if the 0 is replaced by a very small number, e.g., 254 :

A=

With a right hand side b =
are

1
0

254
1

1
1


, the solution to Ax = b is x

=
L

1
254

0
1


,

=
U

254
0

1
1


. However, the computed L and U

1
254

(note that the (2, 2) entry of U is 1 254 which is rounded to 254 ). Then
U
=
L

254
1

1
0


,

which is a matrix very different from A:


U
Ak
kL
= 1,
kAk
 
0
a 100% relative error. The computed solution is then x
=
, also a 100% relative error with the true x
1
even though the matrix A is well conditioned (A) 2.6.
This method is neither backward not forward stable. The problem, loosely speaking, is that in the
process of working with a matrix with norm kAk 2, we ended up with intermediate quantities that are 254
times larger.
All this justifies the need for pivoting.

3.2. Partial and complete pivoting. Partial pivoting occurs when during Gaussian elimination one
chooses the largest element in the current column at or below the diagonal as a pivot element. When that
element is chosen among the entire Schur complement, this is complete pivoting.
In practice, partial pivoting almost always suffices to achieve stability (although there are extremely rare
counterexamples), so this is the default setting in MATLAB.
The process of pivoting produces the decomposition
A = P LU,
where P is a permutation matrix and L and U are as usual.

32

3. SOLVING LINEAR SYSTEMS

To understand the process, we proceed by induction. We start with P0 = I, L = I, U = A. After k steps


we have

u11 . . . u1k
u1,k+1
...
u1n
1

..
..
..
..
..
..
..

.
.
.
.
.
.
.

lk1

u
u
.
.
.
u
.
.
.
1
kk
k,k+1
kn

lk+1,1 . . . lk+1,k 1

u
.
.
.
u
k+1,k+1
k+1,n

A=P
.

.
.
..
..
..
.
..
..
..

...
.
.
.

li1

ui,k+1
...
uin
...
lik
1

..
..
..
..
..
..

.
.
...
.
.
.
.
un,k+1
...
unn
ln1
...
lnk
1
Say ui,k+1 is the largest (in absolute value) among uk+1,k+1 through un,k+1 , i.e., ui,k+1 is our pivot that
needs to be brought to position (k + 1, k + 1). Let S be the permutation matrix that swaps rows k + 1 and
i. So we have

1
u11 . . . u1k
u1,k+1
...
u1n

..
..
..
..
..
..
..

.
.
.
.
.
.
.

lk1

.
.
.
1
u
u
.
.
.
u
kk
k,k+1
kn

lk+1,1 . . . lk+1,k 1

u
.
.
.
u
i,k+1
in

A=P
S

.
.
..
..
..
.
.
.
.

.
.
...
.
.
.
.

li1

uk+1,k+1 . . . uk+1,n
...
lik
1

..
..
..
..
..
..

.
.
...
.
.
.
.
un,k+1
...
unn
ln1
...
lnk
1
We factor S into L from the right (swapping columns k + 1 and i).

1
u11 . . .

..
.
..
..

.
.

lk1

.
.
.
1

lk+1,1 . . . lk+1,k 0

A=P

..
..
..
..

.
.
.
.

li1

...
lik
1
0

..
..
..
.
.

.
.
.
.
ln1
...
lnk
1
Then we factor S out of

..

lk1

li1

A=P S
..

lk+1,1

..

.
ln1

L on the left by swapping rows k + 1 and i.

u11 . . .

..
..

.
.

...
1

...
lik
1

..
..
..

.
.
.

. . . lk+1,k
1

..
..
..

.
.
.
...
lnk
1

u1k
..
.

u1,k+1
..
.

...
..
.

u1n
..
.

ukk

uk,k+1
ui,k+1
..
.

...
...

ukn
uin
..
.

uk+1,k+1
..
.
un,k+1

...
. . . uk+1,n
..
...
.
...

u1,k+1
..
.

...
..
.

u1n
..
.

ukk

uk,k+1
ui,k+1
..
.

...
...

ukn
uin
..
.

un,k+1

unn

u1k
..
.

uk+1,k+1
..
.

...
. . . uk+1,n
..
...
.
...

unn

3. PIVOTING

33

Thus we update P by S and we have our new L and U that we can proceed with in the usual manner for
one more step since the pivot is now on the diagonal.
With complete pivoting, one also swaps columns by multiplying by permutation matrices on the right,
ultimately getting
A = P LU Q,
where P and Q are permutation matrices and L and U are as before.
Example. We will compute the LU decomposition with partial pivoting of the matrix

1
2
1
2 2 4 .
1
1 1
The largest element (in absolute value) in the first column is 2 so we need to swap rows

1
2
1
0 1 0
2 2 4
2 2 4 = 1 0 0 1
2
1
1
1 1
0 0 1
1
1 1
then we carry out the elimination of the first column

1
2 2 4
0 1 0
0
1 1
= 1 0 0 0.5 1
0.5 0 1
0
2
1
0 0 1
we now need to swap rows 2 and 3 in U

2 2
1
1 0 0
0 1 0
0 0 1 0
2
= 1 0 0 0.5 1
0
1
0.5 0 1
0 1 0
0 0 1
{z
}
|

0 1 0
1
2 2 4
2
1
= 1 0 0 0.5 0 1 0
0 0 1
0.5 1 0
0
1 1
we need to swap rows 2 and 3 in L to restore its shape

2 2
1 0 0
1
0 1 0
0
2
= 1 0 0 0 0 1 0.5 1
0 0 1
0 1 0
0.5 0 1
0
1
|
{z
}

0 0 1
1
2 2
4
0
1
2
1 .
= 1 0 0 0.5
0 1 0
0.5 0.5 1
0
0 1.5

1 and 2.

4
1
1

4
1
1

Condensed technique. One can compute an LU decomposition with pivoting by avoiding the need to
write the full permutation matrices and record the columns indices instead. Also, we record the multipliers
of L in the lower triangular portion of U . We start with the following
1
2
3

1
2
1

2
1
2 4
1 1

which represents the fact that the columns of P start off in their natural order 1, 2, 3. We need to pivot
rows 1 and 2. This is automatically registered in P as the 1 and 2 gets swapped as well. The entries of L
are entered in boldface in where the zeros in U would be.

34

3. SOLVING LINEAR SYSTEMS

1
1
2 2
3 1

2
1
2
2 4 1
1 1
3

2 2
0.5
1
0.5
2

4
2
1 3
1
1

2 2
0.5
2
0.5
1

4
2
1 3
1
1

2
0.5
0.5

2
4
2
1
0.5 1.5

The matrix P would thus contain the columns of the identity in order 2, 3, 1 and we have

0 0 1
1
2 2
4
U =
1
2
1 .
P = 1 0 0 L = 0.5
0 1 0
0.5 0.5 1
1.5
Problems:
(1) Compute the decomposition A = P LU from Gaussian elimination with partial pivoting of the
matrix

0 1 1
1 1 1 .
1 3 3

4. CHOLESKY DECOMPOSITION

35

4. Cholesky decomposition
If a matrix is symmetric and positive definite, it turns out, it can be decomposed into triangular factors
in half the time that LU takes and there is no need to pivot. This is the so-called Cholesky decomposition.
Recall that a matrix A is symmetric positive definite if A is symmetric and xT Ax > 0 for all nonzero
vectors x.
In particular, the positive definiteness of A implies that it is nonsingular and has an LDU decomposition
A = L1 DU , where L1 is unit lower triangular, U is unit upper triangular and D is diagonal (we call the
lower triangular factor L1 because we need L for later).
Since A = AT = (L1 DU )T = U T DLT1 (note that D = DT since D is diagonal), U T DLT1 is also an LDU
decomposition of A, thus we must have L1 = U T (because U T would be obtained by the same Gaussian
elimination process as L1 ). Now, D has positive entries on the diagonal (or otherwise we can find a vector
x such that xT Ax 0which vector is that?), thus D1/2 is real and A = (L1 D1/2 ) (L1 D1/2 )T . For
L L1 D1/2 ,
A = LLT .
This is the Cholesky decomposition of A. Here L is lower triangular with positive diagonal (but not necessarily
unit lower triangular).
There are two main properties that make the Cholesky decomposition deserve special attention: its
stable without the need to pivot and it can be computed in half the time that it takes to compute the LU
decomposition.
First on stability. If A is s.p.d., i.e., xT Ax > 0 for all x 6= 0, if we pick x = ei + zej (i.e., x is a vector of
zeros except for 1 and z in positions i and j, respectively), we get
0 < xT Ax = aii + 2zaij + z 2 ajj .
The above inequality holds true for all z, thus we must have
a2ij < aii ajj ,
implying that we can never have a zero on the main diagonal and also that the largest element of the matrix
(or any of the subsequent Schur complements which are also s.p.d.) is always on the diagonal. Pivoting is
not required for the Cholesky decomposition to be stable.
Now efficiency. Since the symmetry is preserved in Gaussian elimination, the Schur complement will be
symmetric at every step. Thus only the elements at and above the diagonal are worked with and updated,
1
thus halving the work. We need to symmetrize the LU decomposition at the end, so we scale by D 2 ,
where D = diag (u1 , . . . , unn ), to obtain the Cholesky decomposition.
Here is the code. Compare it with the LU decomposition code. It returns LT rather than L, which is
typical and is what MATLAB does.
function U=cholesky(A)
U=triu(A);
n=size(A,1);
for j=1:n
for i=j+1:n
t = U(j,i)/U(j,j);
U(i,i:n) = U(i,i:n) - t*U(j,i:n);
end
end
for j=1:n
U(j,j:n)=U(j,j:n)/sqrt(U(j,j));
end

% utilize only the upper triangular part of A

% technically we need U(i,j), but U(i,j)=U(j,i)


% only the diagonal and above is updated

% we rescale by D^(-1/2) to obtain Cholesky

36

3. SOLVING LINEAR SYSTEMS

The cost is half that of Gaussian elimination:

1 3
3n

+ O(n2 ).

Problems.
(1) Prove that the cost of Cholesky is 31 n3 + O(n2 ). Analytically or empirically.
(2) Compute the Cholesky decomposition of the matrices

4 6 10
1 1 1
A = 6 25 39 B = 1 2 3 .
10 39 110
1 3 6

5. FORWARD AND BACKWARD ERRORS AND STABILITY

37

5. Forward and backward errors and stability


Relative and absolute errors. In numerical linear algebra, we try to compute the solution x to a
problem, but instead compute x
, the result of many roundoff errors. Since x is often a vector, we use norms.
The quantity
kx x
k
is called absolute error, but depending on the size of x, a small absolute error may or may not be
indicative of how close x is to x
. For example an absolute error of 0.0001 might be a good thing if x is close
to 1, but it tells us nothing useful if x is close to 0.00000001.
Instead, the relative error
kx x
k
kxk
is a lot more meaningful: a relative error less than 104 means that x and x
have at least 4 significant
decimal digits in common.
Forward and backward errors. Say we have to compute f (x) for a given input x, but we end up
computing f(x).
f(x)k
is the relative forward error.
The quantity kf (x) f(x)k is the absolute forward error and kf (x)
kf (x)k
Ideally, algorithms will deliver a small forward error (absolute or relative). Unfortunately, this is not possible
in general without the sacrifice of significant computing resources as the following example suggests:
If x = 1020 , y = 1, z = 1020 , then the computed x + y + z is 0, which has 100% relative error with the
true result 1. I.e., one of the simplest calculationsadding three numbersresulted in 100% error. There
are hacks to get the sum of any three numbers right (and, in general, the sum of n numbers right) with very
little additional effort. In general, however, obsessing with forward errors is not a good idea simply because
there are problems that dont deserve to be computed accurately. What does that mean?
If f is such that a small change in x results in large change of f (x), then the effect of simply storing x
in the computer (call the stored one x
) will result in an f (
x) that is far from f (x). Why would we want an
accurate algorithm for f if the accurate f (
x) well get is already far off f (x) which is what we really want.
In such cases were better off getting an indication of the sensitivity of f and reformulating the problem so
that we compute something that is less susceptible to roundoff errors.
This leads us to the concept of backward error. Say we end up computing f(x). We have
f(x) = f (x + x)
for some x, which is called backward error. In other words we computed the solution to a nearby problem.
All algorithms we will consider will be backward stable, i.e., the computed solution will be the exact
solution to a nearby problem. If the problem under consideration is stable, i.e., small changes in the input
result in small changes in the output, a small backward error will imply a small forward error.
5.1. Stability. An algorithm is forward stable if the relative forward error is small, i.e., if
|f (x) f(x)|
= O().
|f (x)|
An algorithm is backward stable if the relative backward error is small, i.e., if f(x) = f (x + x), then
|x|
= O().
|x|
The quantity O() is meant that the error does not exceed a modest multiple (say less than 100) of a
polynomial in the size of the problem (i.e., the matrix size n) times the machine precision, i.e., the bound is

38

3. SOLVING LINEAR SYSTEMS

of the form
100g(n),
where g(n) is a polynomial in n.
Most significantly O() is a bound that does not depend on x only on its size and modestly so.

6. CONDITION NUMBERS

39

6. Condition numbers
So how sensitive is a problem to perturbations?
|f (x + x) f (x)| |f 0 (x)| |x|,
so we call |f 0 (x)| the absolute condition number and similarly since
|f (x + x) f (x)|
|x| |xf 0 (x)|

,
|f (x)|
|x|
|f (x)|
0

(x)|
the quantity |xf
|f (x)| is called relative condition number.
The condition number tell us how much the backward error gets magnified to get the forward condition
number. Since in all our algorithms the backward errors will be small, the forward error will be small
for problems with small condition numberwell conditioned problemsand large errors in ill conditioned
problems.

Condition number of solving Ax = b. In this section we address the question of the sensitivity of
the solution to a linear system to perturbations. Namely, if we perturb A to A = A + A, how much does x
change?
x = b. Then (A + A)(x + x) = b or
Consider the perturbed solution x
= x + x such that A
Ax + A
x = 0, i.e.,
x = A1 A
x,
which implies
kxk = kA1 A
xk kA1 kkAkk
xk
and in turn
kAk
kxk
kAkkA1 k
.
k
xk
kAk
This justifies calling the quantity
(A) = kAkkA1 k
condition number of A.
It tells us how relative changes in A are multiplied to result in relative perturbations in x.
The condition number is always at least 1:
(A) 1.
1

This is because 1 = kIk = kAA k kAkkA k = (A) (recall that kIk = 1 in any norm). In the two-norm
equality is attained for orthogonal matrices 2 (Q) = 1, since the two norm of any orthogonal matrix is 1.
6.1. Practical interpretation of (A) when solving Ax = b. We can expect at least relative
16
and thus
errors in A just from storing A in the computer. Therefore kAk
kAk 10
kxk
(A)1016 .
k
xk
Since a relative error of 10k means we will have about k correct decimal digits in the answer, the
condition number (A) can be interpreted as a measurement of how many digits will be lost in solving
Ax = b. If (A) 105 , then we can expect to lose 5 decimal digits and the answer to have about 11 correct
decimal digits.
In general, we would expect the solution to Ax = b to have about 16 log10 (A) correct digits.
This can all be verified empirically very easily in MATLAB.

40

3. SOLVING LINEAR SYSTEMS

n=50;
kappa=10^5;
% generate a random matrix with condition number kappa
[Q,R]=qr(rand(n));
A=Q*diag(linspace(1,kappa,n))*Q;
x=rand(n,1); % our random solution
b=A*x;
% right hand side
y=A\b;
% the computed solution
disp(The relative error is:);
norm(y-x)/norm(y)
disp(Expected number of correct decimal digits is at least);
16-log10(kappa)
disp(Actual number of correct decimal digits (in the norm sense):);
-log10(norm(y-x)/norm(y))
Problems:
(1) Find (by hand) the two-norm condition number of the matrix


0 2
A=
1 1
(2) Prove that if L is the Cholesky factor of A then 2 (A) = (2 (L))2 .

7. STABILITY OF THE LU DECOMPOSITION

41

7. Stability of the LU decomposition


This section can be skipped on a first reading.
We saw that Gaussian elimination without pivoting can be unstable, so we only focus on partial pivoting.
If A is nonsingular and P A = LU is computed by Gaussian elimination with partial pivoting., then
kAk
U
= P A + A, where
L
= O().
kLkkU k
For backward stability we need
kAk
(7.1)
= O().
kAk
Partial pivoting ensures that the multipliers needed to elimination do not exceed 1 in absolute value, i.e.,
k
lij 1. Thus 1 kLk1 n, so in order for (7.1) to hold we must have kU
kAk = O(1). The parameter ,
also often defined as
uij
max
,
ij aij
is called the growth factor of Gaussian elimination with partial pivoting.
In practice, years of experience have shown that = O(1) although there are pathological examples such
as the Kahan matrix

1
1
1
1

1
2
1
1
1

..

2
1
1
.
.
2
K = 1 1
for which U =

..
..
.. . .
.

.
.
.
. .
.
.
.

1
.
1 1 . . . 1 1
2n1
The growth factor is 2n1 , a worst case scenario. Matrices like this are never encountered in practice, so
Gaussian elimination with partial pivoting is stable in practice and this is the reason MATLAB defaults to
it when one calls A\b.

42

3. SOLVING LINEAR SYSTEMS

8. Givens rotations
A Givens rotation is the third tool well use. It is not really a third since its a product of the first
two, but well use it often enough to where it deserves separate attention.
The need for a Givens rotation arises because our main tool for creation of zeros, the matrix Eij , is not
orthogonal.
A Givens rotation is a matrix of the form


cos x sin x
G=
.
sin x cos x
One trivially verifies that GT G = I, so G is indeed orthogonal. The idea is to choose x so that G creates
zeros:

  
a
c
G
=
.
b
0

Since orthogonal matrices do not change the 2-norm of a vector, we must have c = a2 + b2 . Also, from
p
a cos x + b sin x = a2 + b2
a sin x + b cos x = 0
we get
a
.
a2 + b2
 
3
For example, the Givens rotation that kills the (2, 1) entry in the vector
is
4

3 4
5
3
5
5
.
=

0
4
45 35
sin x =

b
a2 + b2

and

cos x =

Givens rotations in n n matrices. Since well use Givens rotations to create zeros in large matrices,
we also call Givens rotation the n n matrix

..

cos x
sin x

..
G=
.
.

sin x
cos x

.
..

1
Note that G differs from the identity matrix only in the 4 entries in positions (i, i), (i, j), (j, i), and (j, j).
One trivially verifies that G is still orthogonal; it rotates rows i and j and leaves the rest of the matrix
unchanged.
A Givens rotation is nothing new. A Givens rotation is simply a sequence of our previous two tools:
if we perform Gaussian elimination on G we get

 
 
 

cos x sin x
1
0
cos x
1 tan x
=

.
sin x cos x
tan x 1
sec x
0
1
In other words, using a Givens rotation is equivalent to the following sequence of operations:
(1) Adding a multiple of the second row to the first;

8. GIVENS ROTATIONS

43

(2) Scaling both rows;


(3) Subtracting a multiple of the first row from the second.
Complex Givens rotations. Obviously if x or y is complex, then a real Givens rotation may not be
enough to zero out the y in the vector


x
.
y
Instead, we take the complex Givens rotation

G=

c s
s c

where c is real and s is complex such that |c|2 + |s|2 = 1. One directly checks that G G = I, so G is unitary.
We need


  p
x
|x|2 + |y|2
,
G
=
y
0
where can be any complex
p number on the unit circle, || = 1.
Writing cs + sy = |x|2 + |y|2 and sx + cy = 0 means c =

.
|x|2 +|y|2

For c to be real, we must

choose = |x|/
x, so that
c= p

|x|
|x|2

|y|2

and s =

cy
.
x

One must obviously check that x 6= 0. If x = 0, then c = 0, s = 1.


MATLAB implementation.
function G=givens(x,y)
if x == 0
c = 0; s = 1;
else
c = abs(x)/sqrt(abs(x)^2+abs(y)^2);
s = c*y/x;
end
G = [c conj(s);-s c];
Practical application of an n n Givens rotation to an n n matrix. Every Givens rotation
(when applied on the left) only changes two rows of the matrix, say i and j. For the purpose of efficiency,
it is critical to only change those two rows only. In practice we form


ai1 ai2 . . . ain
G
aj1 aj2 . . . ajn
and replace rows i and j of A with the result.
In MATLAB notation this is particularly simple to write: A([i,j],:)=G*A([i,j],:).
For example, if the the Vandermonde matrix from the previous section we wanted to use row 1 to create
a zero in position (4, 1) using a Givens rotation, wed form:
1

2
17
65
5

0 0 12
1 1 1 1
2
2
2
2
2


0 1 0
0
2
4
8

1 2 4 8 = 1
.

0 0 1
0 1 3 9 27 1
3
9 27
15
63

1 4 16 64
12 0 0 12
0 32
2
2

44

3. SOLVING LINEAR SYSTEMS

Equivalently, we can form the product


"
#
1
1
1 1
2
2
1 4
12 12

1
16

1
64

"


=

2
2

5
2
3
2

17

2
15

65

2
63

these are rows 1 and 4 in the matrix GA. Rows 2 and 3 are unchanged.
Problems. For the 4 4 Vandermonde matrix (2.1):
(1) Find the Givens rotation that rotates rows 2 and 4 to create a zero in position (4, 4).
(2) Find the Givens rotation that rotates columns 1 and 2 to create a zero in position (1, 1).

9. THE QR ALGORITHM

45

9. The QR algorithm
The QR algorithm reduces a matrix to an upper triangular form by using Givens rotations to zero out
all the entries below the main diagonal.
We start with an m n matrix A then choose an appropriate Givens rotation G1 to zero out the (m, 1)
entry and get A1 = G1 A. Then we choose another Givens rotations G2 to zero out the (m 1, 1) entry,
obtaining A2 = G2 A1 . Note that A2 = G2 A1 = G2 G1 A. We go on until we get an upper triangular matrix
R. If the total number of entries to be zeroed out is k, then
R = Gk G2 G1 A.
Therefore A = GT1 GT2 GTk R, since the Gi s are orthogonal. We define Q = GT1 GT2 GTk , which is an
orthogonal matrix as a product of orthogonal ones. The decomposition
A = QR
is called the QR decomposition of A.
So here is the QR algorithm. We need two nested loops to kill all entries below the main diagonal. We
start with column 1, killing entries (2, 1) through (m, 1), then proceed with column 2 and so on.
function [Q,A]=qr143M(A)
[m,n]=size(A);
Q=eye(m);
for j=1:n
for i=j+1:m
G=givens(A(j,j),A(i,j));
A([j i],j:n)=G*A([j i],j:n);
Q(:,[j,i])=Q(:,[j,i])*G;
end
end

%
%
%
%
%

columns j=1 through n


all elements below the diagonal in column j
Form the Givens that uses A(j,j) to kill A(i,j)
Apply G to rows j and i of A (and only columns j:n)
Accumulate all the rotations in Q

For square matrices, solving Ax = b then is easy. From QRx = b we get Rx = QT b, which is a triangular
linear system which we solve through substitution. It will turn out that solving the least squares problem
Ax = b for a rectangular A has the exact same solution.
One can compute the QR decomposition through other methods, e.g., Householder transformations,
which yield the same decomposition.
Uniqueness of the QR decomposition. The QR decomposition is unique only up to the signs of R
because A = (QJ)(JR) is also a QR decomposition of A for any choice of a diagonal matrix J with +1s and
1s on the diagonal. The matrix QJ would still be orthogonal and JR upper triangular. In the complex
case one can choose J to be any diagonal matrix of complex signs (i.e., complex numbers on the unit circle).
Reduced QR decomposition. The QR algorithm of an m n matrix yields an m m matrix Q and
an m n matrix R. If we were to take only the first n columns of Q and the first n rows of R, the resulting
m n matrix Q will still be orthogonal and the resulting n n matrix R will still be upper triangular and
the product will still equal A. This is called the reduced QR decomposition. The reason is that the last
m n rows of the original Q multiply zeros in the bottom m n rows of the original R and can thus be
ignored.

46

3. SOLVING LINEAR SYSTEMS

For example,

1
1

1
1


1
0.5000
0.5000
2
=
3 0.5000
4
0.5000

0.5000
0.5000
=
0.5000
0.5000

2.0000
0.0236
0.5472

0
0.4393 0.7120

0
0.8079 0.2176
0
0.3921
0.3824

0.6708 

0.2236
2.0000 5.0000 .
0.2236
0 2.2361
0.6708

0.6708
0.2236
0.2236
0.6708

5.0000
2.2361

0
0

Problems.
(1) Compute (by hand) the QR decomposition of the matrix


1 2
.
1 3
(2) Write a program to compute a QL decomposition of a matrix as a product of an orthogonal matrix
and a lower triangular matrix. Turn in your code and the result of your code (the matrices Q and
L for the 5 5 Pascal matrix pascal(5).
(3) Compute the reduced QR decomposition of the 5 3 matrix obtained in MATLAB as
A=pascal(5); A=A(:,1:3); Turn in the matrix Q only.

10. LEAST SQUARES PROBLEMS

47

10. Least squares problems


A linear system Ax = b, where A is m n with m n is overdetermined and in general would not have
a solution. In these cases one looks for the vector x that minimizes the two norm of the residual r = Ax b:
min kAx bk2 .
x

This problem is referred to as a least squares problem.


We follow Demmel [3, Sec. 3.2.1]. Let x be such that AT Ax = AT b. We will show that x minimizes
krk2 . If y = x + e then
(Ay b)T (Ay b) = (Ae + Ax b)T (Ae + Ax b)
= (Ae)T (Ae) + (Ax b)T (Ax b) + 2(Ae)T (Ax b)
= kAek2 + kAx bk2 + 2eT (AT Ax AT b)
= kAek2 + kAx bk2 ,
which is minimized for e = 0.
The relationship AT Ax = AT b is referred to as normal equations. In practice the solution to the least
squares problem is never found using these equations because 2 (AT A) = 2 (A)2 (see more detailed analysis
in section 11.2).
Instead we use the reduced QR decomposition.
The normal equations imply (QR)T (QR) = (QR)T b and in turn RT Rx = RT QT b. When A is of full
column rank (i.e., rank A = n = rank R), we get Rx = QT b, a triangular linear system whose solution is
x = R1 QT b.
Problems.
(1) Using the QR algorithm, find the best fit polynomial of degree 3 that fits the data f (x) =
1, 2, 1, 0, 1 for x = 1, 0, 1, 2, 3, respectively. Report the coefficients of the polynomial to 4
significant digits.

48

3. SOLVING LINEAR SYSTEMS

11. Perturbation theory for Ax = b


11.1. Condition number of a matrix. What is the condition number of the problem of solving a
system with a matrix A? In other words, how much does the solution change if we were to perturb the
matrix A to A + A?
Say the solution to the perturbed system is x
= x + x, i.e.,
(A + A)(x + x) = b.
After using that Ax = b we get
x = A1 A
x,
implying that
kxk kA1 k kAk k
xk,
or

kAk
kxk
kA1 k kAk
.
k
xk
kAk
In other words the relative change in A gets magnified by kA1 kkAk to get the relative change in x.
This justifies calling the quantity
(A) = kA1 k kAk
the condition number of A. When we refer to a particular norm, we write, e.g., 2 (A) = kAk2 kA1 k.
Properties of (A).
(A) 1 in the 1, 2, and infinity norms. Follows from kIk = kAA1 k kAk kA1 k.
2 (Q) = 1 for orthogonal matrices Q: 2 (Q) = kQk2 kQ1 k2 = kQIk2 kQT Ik2 = kIk2 kIk2 = 1,
since the 2-norm is unaffected by multiplication by an orthogonal matrix.
11.2. Backward stability. The addition of three numbers is backward stable:
fl(a + b + c) = fl(fl(a + b) + c) = fl((a + b)(1 + ) + c)
= ((a + b)(1 + ) + c)(1 + 1 )
= a(1 + )(1 + 1 ) + b(1 + )(1 + 1 ) + c(1 + 1 )
= a1 + b1 + c1 ,
where a1 = a(1 + )(1 + 1 ), b1 = b(1 + )(1 + 1 ), c1 = c(1 + 1 ), and || , |1 | .
In other words the floating point result of the sum of the numbers a, b, and c is the exact sum of the
numbers a1 , b1 , and c1 . The relative perturbation in the a1 is
|a1 a|
= |(1 + )(1 + 1 ) 1| = | + 1 | 2,
a
where we ignored the 1 term which is O(2 ). It is common practice to ignore second and higher order
terms in backward error analysis.
We get similar inequalities for b1 and c1 and say that the backward error in the sum of three numbers
is 2. Any algorithm whose backward error is O() is called backward stable.
As an exercise (see the problems after this section), prove that multiplication by a Givens rotation G is
= G(A+E), where kEk2 = O()kAk2 and G
is the floating point representation
backward stable. I.e., fl(GA)
of G.
Theorem 11.1. Applying a series of Givens rotations is backward stable.
k G
1 A) = Gk G1 (A + E)
fl(G
i is the floating point representation of the Givens rotation Gi , i = 1, 2, . . . , k.
where kEk2 = O()kAk and G

11. PERTURBATION THEORY FOR Ax = b

49

k1
Proof. The proof goes by induction, with the k = 1 step being trivial. Then assuming that fl(G
1 A) = Gk1 G1 A + E1 , where kE1 k = O()kAk2 , then for B fl(G
k1 G
1 A) we have
G
k B) = Gk (B + E2 ) = Gk (Gk1 G1 A + E1 ) + E2 = Gk Gk1 G1 A + Gk E1 + E2 ,
fl(G
where kGk E1 + E2 k2 kGk E1 k2 + kE2 k2 = kE1 k2 + kE2 k2 O()kAk2 + O()kBk2 = O()kAk2 , since
kBk2 kGk1 G1 Ak2 + kE1 k2 = kAk2 + O()kAk2 .

and R
are the computed
Theorem 11.2. The QR decomposition of a matrix is backward stable, i.e., if Q
Q and R, then
R
= A + A, where kAk = O().
Q
kAk
Theorem 11.3. The Ax = b is solved by first computing the QR decomposition of A, then solving
Rx = QT b via backward substitution, then
kAk
= O()
(A + A)
x = b, where
kAk
and
k
x xk
= O((A))
kxk
Problems.

(1) Prove that multiplying by a Givens rotation G is backward stable: fl(GA)


= G(A + E), where
is the floating point representation of G.
kEk2 = O()kAk2 and G

50

3. SOLVING LINEAR SYSTEMS

12. Eigenvalues and eigenvectors


Say A = XX 1 is the eigendecomposition of A.
We start by arguing that if A is not symmetric, then this is not the decomposition one should compute.
First off, could contain Jordan blocks, which are extremely sensitive to perturbations. A single  perturbation in a multiple eigenvalue will destroy a Jordan block. Additionally the computation of the eigenvalue
decomposition cannot be backward stable if X is ill conditioned: even if we computed X exactly and with
a tiny error = O()kAk, we get A = XX 1 , so
kAk = kXX 1 k kXkkkkX 1 k = O()kAk(X),
With (X) arbitrarily large, this would not be backward stable.
Therefore in our quest for eigenvalues, we restrict ourselves to orthogonal transformations only. There
is an orthogonal similarity transformation that reveals the eigenvalues of A. It yields the so called Schur
decomposition
A = QT QT ,
where Q is orthogonal and T is upper triangular. Since A and T are similar, they have the same eigenvalues,
and since T is upper triangular, its eigenvalues are on the diagonal.
The existence of the Schur form follows easily by induction. Let be an eigenvalue and u be a unit
eigenvector corresponding to it. Let U = [u, V ] be a unitary matrix with first column u.1 (V is an n (n 1)
matrix).
Then using that Au = u, u u = 1 and V u = 0 (since every column of V is orthogonal to u), we get
 

 

u
u Au u AV
u AV
U AU =

[u,
V
]
=
=
,
V
V Au V AV
0 V AV
where the 0 stands for a zero vector of length n 1. We can then proceed by induction with the trailing
(n 1) (n 1) matrix V AV .
So our goal is to compute the Schur decomposition of A using orthogonal similarity transformations.
Those, of course, will be comprised of Givens rotations.
Problems.
(1) Let x and y be vectors and the rank 1 matrix A be defined as A = xy T . In terms of x and y, find
the only nonzero eigenvalue of A and a left and right eigenvector for it. What is kAk2 ?

1We must allow for U to be unitary rather than orthogonal, because may be complex. One method to obtain U is to
take an arbitrary n n matrix, replace the first column with u, and compute its QR decomposition. The first column of Q will
be u.

14. QR ITERATION

51

13. Eigenvalue algorithms


Since the eigenvalues of a matrix are the zeros of the polynomial det(A I) = 0 of degree n, the
fundamental theorem of algebra says that we cant find an explicit algebraic expression for the roots for
n > 4. Therefore we must use an iterative process that converges to the roots. This is not a contradiction
with the fundamental theorem of algebra, since we end up computing only the floating point approximations
to the roots, which is all we need.
There are many competing algorithms for the eigenvalue problem:
Power method
Inverse iteration
Rayleigh quotient iteration
Divide-and-conquer
QR iteration
DQDS (differential quotient difference with shift)
We refer to the classical texts [3, 4] for the details on all these algorithms and focus on the two algorithms
currently of choice: QR iteration for the non symmetric eigenvalue problem and DQDS for the symmetric
eigenvalue problem (and also for the Singular Value Decomposition (SVD) which we will study later).
14. QR iteration
We start with a word of caution: the QR iteration is an eigenvalue algorithm and is not to be confused
with the QR algorithm, which computes the QR decomposition.
The QR iteration is an eigenvalue algorithm that computes the eigenvalues of a matrix. It consists of
forming the QR decomposition, multiplying the factors in reverse order RQ and repeating the process until
it converges to the Schur form:
repeat until convergence
[Q,R] = qr(A)
A=R*Q
Notice that this is a similarity transformation: RQ = QT QRQ = QT AQ, so the eigenvalues of A are
preserved in the process.
The problem with this algorithm is that each step takes O(n3 ) time, so even if we converged in one
step per eigenvalue, it would still take O(n4 ) overall, which is too slow. We want (and will get!) an O(n3 )
algorithm. Additionally, the convergence of this algorithm depends on the ratio between adjacent eigenvalues,
so when we have close eigenvalues, it can be very slow.
We address these issues one at a time. First, we will reduce the matrix A to Hessenberg form (so each
step of the QR algorithm becomes O(n2 )) and then introduce shifts to make the QR algorithm converge in
about 2-3 steps per eigenvalue.
14.1. Hessenberg form. The problem with the cost of the QR iteration above is that it requires that
n(n 1)/2 Givens rotations be performed in order to compute the QR decomposition. At O(n) per Givens
rotation, each step of the QR iteration costs O(n3 ).
Our next goal is to first find a matrix, similar to A, which has a lot of zeros below the main diagonal so
that the QR decomposition of it will require a lot less Givens rotations to perform. This is the Hessenberg
form of A: It has zeros below the subdiagonal (i.e., aij = 0 for i > j + 1) and requires only n 1 Givens
rotations to compute its QR decomposition. At the cost of O(n) per Givens, this means O(n2 ) per step of
QR iteration!
Here is how the reduction to Hessenberg form works. We first use a Givens rotation G to zero out the
(3, 1) entry of A using row 2, then complete the similarity transformation by multiplying by G on the right
(which rotates columns 2 and 3). Now, very importantly, this does not disturb the zero in position (3, 1) that

52

3. SOLVING LINEAR SYSTEMS

we just created. We then use row 2 again to create a zero in position (4, 1) again completing the similarity on
the right, which again does not disturb the zero we just created in position (3, 1) (since this rotates columns
2 and 4 nor the one we already had in position (2, 1). We proceed like this zeroing out all elements in the
first column up to the (n, 1) entry.
It might be tempting to zero out the (2, 1) entry as well. We could do so by applying a Givens on the
left rotating rows 1 and 2. However, completing the similarity on the right by rotating columns 1 and 2 will
destroy all the zeros in the first column we just created. Thus we leave the (2, 1) entry intact.
We proceed with the second column and so on. One can easily see that neither the Givens rotations
needed to create the new zeros nor the completion of the similarity transformations on the right affect any
of the zeros we already have.
Here is the algorithm.
function A=hessenberg(A)
n=size(A,1);
for j=1:n-2
for i=j+2:n
G=givens(A(j+1,j),A(i,j));
A([j+1 i],:)=G*A([j+1 i],:);
A(:,[j+1 i])=A(:,[j+1 i])*G;
end
end

% reduces a matrix to Hessenberg form


%
%
%
%
%

columns 1 through n-2


rows j+2 through n
we kill entry A(i,j) with A(j+1,j)
apply Givens on the left: rotate rows j+1 and i
complete similarity: rotate columns j+1 and i

14.2. Wilkinson shift. While the QR iteration as we have already described it works most of the
time, there are examples on which it fails to converge at all or the convergence
  is very slow.
For example, QR iteration fails to converge at all on the matrix 01 10 : the Givens rotation for QR


 


 
iteration is 10 10 , so the QR iteration produces the sequence 01 10 10 10 01 10 even though
the eigenvalues +1 and 1 are nice.
Additionally, QR iteration can be slow to converge to clustered eigenvalues, i.e., ones that have small
relative gaps:
|i j |
.
|i |
For example eigenvalues 1.00001 and 1.00002 are clusteredthe relative gap between them is 105 , whereas
0.00001 and 0.00002 are notthe relative gaps between these two is 1.
Both of these problems can be remedied by introducing shifts, which is actually very simple change to
the algorithm. If the eigenvalues of A are i , i = 1, 2, . . . , n, those of A sI are i s, i = 1, 2, . . . , n. If i
and j have a small relative gap, then the relative gap between i s and j s is
|i j |
.
|i s|
So long as we pick s near i this new relative gap can be made very large (i.e., be close to 1). Then i s
and j s will be very well separated. For example, if i = 1.00001 and j = 1.00002, choosing any shift
between 1 and 1.00003 makes the relative gap between i s and j s at least 0.5.
One shift which works particularly well is the Wilkinson shiftit is the eigenvalue of the trailing 2 2
submatrix


an1,n1 an1,n
an,n1
ann
that is closer to ann . With this shift the QR iteration converges in 2-3 steps per eigenvalue. It requires that
we change the inner loop of QR iteration to

14. QR ITERATION

53

[Q,R]=qr(A-s*I)
A=R*Q+s*I
Note that with the shift, we still have a similarity transformation: If A sI = QR, then
RQ + sI = QT QRQ + sI = QT (A sI)Q + sI = QT AQ sQT IQ + sI = QT AQ.
14.3. Convergence. When have we converged to an eigenvalue in position (n, n)? Theoretically, this
happens when an,n1 = 0. But in the presence of roundoff errors zeros are elusive, so we settle for machine
precision . Thus we say that weve converged when |an,n1 | < 1016 |ann |.
14.4. Our ultimate QR iteration algorithm. Our ultimate algorithm performs n 1 Givens rotations to compute the QR decomposition, records them, then applies them on the right.
function A=QRiteration(A)
A=hessenberg(A);
n=size(A,1);
StoredGivens=zeros(2,2,n);
i=n;
while i>1
L=eig(A(i-1:i,i-1:i));
if abs(L(1)-A(i,i))<abs(L(2)-A(i,i))
shift=L(1);
else
shift=L(2);
end
for k=1:i
A(k,k)=A(k,k)-shift;
end
for k=2:i
G=givens(A(k-1,k-1),A(k,k-1));
A(k-1:k,:)=G*A(k-1:k,:);
StoredGivens(:,:,k)=G;
end
for k=2:i
G=StoredGivens(:,:,k);
A(:,k-1:k)=A(:,k-1:k)*G;
end
for k=1:i
A(k,k)=A(k,k)+shift;
end
if abs(A(i,i-1))<10^(-16)*abs(A(i,i))
i=i-1;
end
end

% first reduce A to Hessenberg form


%
%
%
%
%
%
%

%
%
%
%
%
%

allocate space to store Givens rotations


Compute eigenvalue i starting with i=n and doing down
So long as we have eigenvalues left to compute ...
Compute the Wilkinson shift:
... its the eigenvalue of
... the 2x2 submatrix A(i-1:i,i-1:i)
... that is closer to A(i,i)

subtract the shift to form A-sI


form the QR by killing entries (k,k-1)...
... using Givens rotations
applying the Givens on left
and storing the Givens so they can be applied
on the right

% recover Givens from storage


% and complete the similarity on right

% add the shift back in


% check for convergence
% if so, move on to eigenvalue i-1

14.5. The symmetric eigenvalue problem. If A is symmetric, the exact same QR iteration yields
the eigenvalue of A. The difference is that the Hessenberg form of A is now tridiagonal (since if QAQT = H
and AT = A, we get H T = H, so H must be symmetric, thus tridiagonal).
One can account for that fact by only applying each Givens rotation to only those entries in the rows
and columns of A that are nonzero.

54

3. SOLVING LINEAR SYSTEMS

14.6. Operation count. We will only do the accounting in the case when only the eigenvalues (but
not the eigenvectors) are desired.
The reduction to Hessenberg form requires (n 2) + (n 1) + + 1 Givens rotations, each costing
O(n), for a total of O(n3 ).
The iterative part of the algorithm then requires 2-3 iterations for each of the n eigenvalues. Each
iteration consists of 2(n 1) Givens rotations (n 1 rotations applied to each side of the matrix).
In the nonsymmetric case each Givens rotation costs O(n). In the symmetric case each Givens rotation
only changes 4 entries on each of the 2 rows (or columns when applied on the right) that it changes. This is
4 2 4 = 32 arithmetic operations each. The overall cost of the iterative part is thus:
In the nonsymmetric case: O(n3 );
In the symmetric case: O(n2 ).
Note that in the symmetric case the finite part, i.e., the reduction to tridiagonal form costs O(n3 ), while the
potentially infinite iterative part takes O(n2 ) and is negligible compared to the finite one.
14.7. Stability. The following theorem states that the application of one or multiple Givens rotations
to a matrix A (possibly with a shift) is backward stable.
Theorem 14.1. If A = QBQT is computed using Givens rotations in floating point arithmetic with
B
we have
precision , then for the computed factors Q,
B
Q
T = A + A,
Q

where

kAk
= O().
kAk

This theorem indicates that the reduction to Hessenberg form as well as the subsequent eigenvalue
i will be those of a nearby
computation is backward stable. In other words, the computed eigenvalues

matrix A + A, where kAk/kAk = O(). How close are i to i ?


Let x and y be the unit right and left eigenvectors corresponding to , i.e., Ax = x and y A = y ,
kxk2 = kyk2 = 1.
We write (A + A)(x + x) = ( + )(x + x). Since Ax = x and after ignoring second order terms
(i.e., ones with 2 s) we get
Ax + Ax = x + x.

Multiplying on the left by y we get


y Ax + y Ax = y x + y x
and thus y x = y Ax implying
=

y Ax
y x

and

kAk
.
y x
Therefore eigenvalues where the angle between the left and right eigenvectors is small (i.e., 1/y x =
sec (x, y) is large) will be very ill conditioned and will likely be computed inaccurately in the presence of
roundoff.
Even when the angle between x and y is large, e.g., in the symmetric case when x and y are orthogonal,
there is no guarantee that the computed eigenvalue will be accurate if it is very small compared to kAk.
In the symmetric case the above implies
||

kAkO() = |max |O().


Therefore eigenvalues close to max will be computed accurately and those much smaller will progressively
have fewer and fewer correct digits.

14. QR ITERATION

55

Problems:
(1) Modify the MATLAB code QRiteration to accumulate the eigenvector matrix and compute (and
turn in) the eigenvector matrix of pascal(5). You will need to accumulate all the Givens rotations
for the reduction to Hessenberg form, then the Givens rotations in the reduction to Schur form
(which for this symmetric matrix will be diagonal). Compare your output with that of eig.
(2) Modify the MATLAB code QRiteration so that when the matrix is tridiagonal, it only takes
O(n2 ) operations to compute all the eigenvalues. You will need to modify the application of the
Givens rotations so that they only affect the nonzero entries of A then empirically estimate the
total cost. Turn in your code and the number of flops for the 20 20 tridiagonal matrix with
aii = 2, i = 1, 2, . . . , n, and aij = 1, for |i j| = 1.

56

3. SOLVING LINEAR SYSTEMS

15. The Singular Value Decomposition (SVD)


If A is m n, m n, then the Singular Value Decomposition (SVD) of A is
A = U V T
where U is m m orthogonal, = diag (1 , 2 , . . . , n ), where 1 2 n 0 is m n diagonal,
and V is n n orthogonal.
To see this, let = kAk2 and let v be the unit vector such that kAk2 = maxkvk2 =1 kAvk2 . Then
Av = u, where kuk2 = 1. Now we complete u and v to orthogonal m m and n n matrices [u, U ] and
[v, V ], respectively. Thus
 T 
 T

u
u Av uT AV
A[v V ] =
.
UT
U T Av U T AV
We have uT Av = and U T Av = U T u = 0 (since U is the orthogonal complement of u). Also, uT AV = 0, because otherwise = kAk2 = k[u, U ]T A[v, V T ]k2 keT1 [u, U ]T A[v, V T ]k2 kuT A[v, V T ]k2 = k[, uT AV T k2 >
, a contradiction.
Therefore
 T 


u
0
A[v
V
]
=
0 B
UT
and we proceed by induction with the matrix B.
Reduced SVD. If we were to take the first n columns of U and the first n rows of and the entire V ,
we obtain the reduced SVD:
A = U V T
where U is m n orthogonal, is n n diagonal, and V is n n orthogonal.
Properties of the SVD.
(1)
(2)
(3)
(4)
(5)
(6)
(7)

1 = kAk2 .
Pk
Ak = i=1 i ui viT is the best rank-k approximation to A.
If A is symmetric i = |i |, i = 1, 2, . . . , n.
The eigenvalues of AT A are i2 and the eigenvalues of AAT are i2 and m n zeroes.
rank(A) is equal to the number of nonzero singular values.
If A is nonsingular, kAk1
2 = 1/n and (A) = 1 /n .
If A has full rank, then the solution to the least squares problem minx kAx bk2 if x = V 1 U T b.

Computing the SVD. The computation of the SVD consists of 2 parts. First, we use Givens rotations
to reduces the matrix to a bidiagonal form through a process called GolubKahan bidiagonalization.
We use n 1 Givens rotations on the left to zero out the entries below the diagonal in the first column
and then n 2 Givens rotations on the right to zero out the third to the last elements in the first row. This
way only 2 non zeros remain in the first row and creating the zeros in the first row does not affect the zeros
we already created in the first column. We then proceed by induction with the trailing (m 1) (n 1)
matrix:

0
a11 a12 a13 . . . a1n
a11 a012 a013 . . . a01n
a21 a22 a23 . . . a2n 0
a022 a023 . . . a02n


Gn1 Gn2 . . . G1 .
=

..
..
..
..
..
..
..
..
..
..

.
.
.
.
.
.
.
.
.
am1

am2

am3

...

amn

a0m2

a0m3

...

a0mn

15. THE SINGULAR VALUE DECOMPOSITION (SVD)

and

a011
0
..
.

a012
a022
..
.

a013
a023
..
.

...
...
..
.

a01n
a02n
..
.

a0m2

a0m3

...

a0mn

0 0
G1 G2 G0n2 =

57

a011
0
..
.

a0012
a0022
..
.

0
a0023
..
.

...
...
..
.

0
a002n
..
.

a00m2

a00m3

...

a00mn

15.1. MATLAB algorithm.


function A=GKbidiagonalization(A)
[m,n]=size(A);
for j=1:n
for i=m:-1:j+1
G=givens(A(i-1,j),A(i,j));
A(i-1:i,:)=G*A(i-1:i,:);
end
for i=n:-1:j+2
G=givens(A(j,i-1),A(j,i));
A(:,i-1:i)=A(:,i-1:i)*G;
end
end
15.2. The SVD of a bidiagonal matrix. If B is bidiagonal, then B T B is tridiagonal and its eigenvalues are the squares of the singular values of B (and of A) and can be computed in O(n2 ) time using QR
iteration.
15.3. Further reading. The best method for computing the SVD of a bidiagonal matrix is the differential quotient-difference with shift (dqds). It has both optimal complexity and high relative accuracy, meaning
all singular values, including the tiniest ones are computed with most of their digits correct, irrespective of
condition numbers. See Demmel [3, section 5.4.2.].
Problems.
(1) If A has full rank, then the solution to the least squares problem minx kAx bk2 is x = V 1 U T b.

58

3. SOLVING LINEAR SYSTEMS

16. Krylov subspace methods


What linear algebra problems can we solve if the only thing we can do with a matrix A is form Ax?
This happens for example when A is very large and sparse, but there are other situations when we can form
Ax via a black box (e.g., when A is highly structured and a fast multiplication is possible, but the size of
A prevents us from forming it explicitly).
It turns out we can build better and better approximations to the solutions of Ax = b and Ax = x.
The guiding hope is, of course, that we can get to those approximations rather quickly.
We take a random vector b and form the subspace
Km = span{b, Ab, A2 b, . . . , Am1 b}.
This is called an mth Krylov subspace of A. Denote Km = [b, Ab, A2 b, . . . , Am1 b] (this is an n m matrix).
Since the sequence b, Ab, A2 b, . . . converges to an eigenvector of A, the matrix Km gets progressively ill
conditioned. This is why we consider an orthogonal basis of the space Km . I.e., we consider the matrix Qm
in the reduced QR decomposition of Km .
Next, we establish that the projection of A onto the space Km (i.e., the matrix QTm AQm ) is Hessenberg.
If ei is the ith column of the identity, then for any matrix B, Bei is its ith column, thus we have
AKm = [Ab, A2 b, . . . , Am b] = Km [e2 , e3 , . . . , em , c] Km C,
1 m
A b and
where c Km

0
1
0

0
0
1
..
.

..
.
..

0
0
0
..

.
0

c1
c2
c3
..
.
..
.
cm

1
AKm = C means
is Hessenberg. Now if Qm Rm = Km is the reduced QR decomposition of Km , then Km

(Qm Rm )1 AQm Rm = C
or
1
QTm AQm = Rm CRm
= H,
where H is also Hessenberg (because one can easily verify that the product of two upper triangular matrices
R and R1 and the Hessenberg matrix C is also Hessenberg).
What are the entries of H?
We have
AQm = Qm H.
We compare column j on both sides. On the left its Aqi , where qi is the ith column of Qm , and on the right
its Qm hi , where hi is the ith column of H. Namely,

Aqj =

j+1
X

hij qi .

i=1
T
T
Multiplying both sides by qm
on the left gets us qm
Aqi =

Pj+1

hj+1,j qj+1 = Aqj

i=1

j
X

T
hij qm
qi = hmj , which implies

hij qi .

i=1

Therefore we can get the vector qj+1 by forming Aqj and subtracting all its components in the direction of
q1 , q2 , . . . , qj . This is just the modified GramSchmidt algorithm.

16. KRYLOV SUBSPACE METHODS

59

Arnoldi. Arnoldis algorithm approximates the eigenvalues of the matrix A by those of H.


function H=arnoldi(A,m,b)
% m = number of steps, b = initial vector
n=size(A,1);
Q=zeros(n,m+1); H=zeros(m+1,m);
% initialize Q, H
Q(:,1) = b/norm(b);
for j=1:m
v=A*Q(:,j);
for i=1:j
H(i,j)=Q(:,i)*v;
v=v-H(i,j)*Q(:,i);
% subtracting all components in directions 1 though j
end
H(j+1,j) = norm(v);
Q(:,j+1) = v/H(j+1,j);
end
= QT AQm . The matrix H that we need for the
Arnoldi returns the (m + 1) m Hessenberg matrix H
m+1
: m, 1 : m), but H
will be useful in an upcoming algorithm, GMRES.
eigenvalue computation is H = H(1
Lanczos. The Lanczos method is the Arnoldis method applied to a symmetric matrix. The matrix
H is then tridiagonal. This means that each new vector v needs to be orthogonalized with respect to the
previous two vectors only. This leads to substantial computational savings, but is otherwise identical to
Arnoldi.
function H=lanczos(A,m,b)
% m = number of steps, b = initial vector
n=size(A,1);
Q=zeros(n,m+1); H=zeros(m+1);
% initialize Q, H
Q(:,1) = b/norm(b);
for j=1:m
v=A*Q(:,j);
H(j,j)=Q(:,j)*v;
v=v-H(j,j)*Q(:,j);
% subtracting all components in directions j and j-1
if j>1
v=v-H(j-1,j)*Q(:,j-1);
end
H(j+1,j)=norm(v);
H(j,j+1)=H(j+1,j);
% H is symmetric so H(j,j+1)=H(j+1,j)
Q(:,j+1) = v/H(j+1,j);
end
GMRES. GMRES stands for Generalized Minimum Residuals and is a Krylov subspace method used
for finding and approximate solution to Ax = b. The idea is to find a vector x Km that minimizes
kAx bk2 . Thus x = Qm c for some vector c and
min kAx bk2 = min kAQm c bk2 .
c

xKm

The vector b is chosen as the initial guess, thus b Km Km+1 and also, AQm c Km+1 . Therefore
multiplying by QTm+1 does not change the 2 norm above, so
QT bk2 .
min kAQm c bk2 = min kQTm+1 AQm c QTm+1 bk2 = min kHc
m+1
c

QTm+1 b

Since b is the first column of Qm+1 we have


= kbk2 e1 , and therefore the MINRES solution c is the

solution to the least squares problem minc kHc kbk2 e1 k2 . Once we have c, we obtain x = Qm c.

60

3. SOLVING LINEAR SYSTEMS

The symmetric version of this algorithm is called MINRES.

17. THE DISCRETE FOURIER TRANSFORM

61

17. The Discrete Fourier Transform

For the rest of this section i = 1.


The matrix of the N N Discrete Fourier Transform (DFT) is the Vandermonde matrix
1
N = [wjk ]N
j,k=0 ,
2

2
where w = e N = cos 2
N i sin N is a principal N th root of unity.
Its utility comes from:

(1) 1 = N1 .
(2) Z is the imaginary part of 2N +2 (2 : N + 1, 2 : N + 1); also Z 1 = N1 Z since Z is symmetric.
(3) One can multiply by in O(N log N ) time (vs. O(N 2 ) for conventional matrix-vector multiplication).
Proof of (1): Since w
= w1 and wN = 1,

lj =
()

N
1
X

lk kj =

k=0

N
1
X
k=0

wlk w
kj =

N
1
X

wk(lj) .

k=0

This is the sum of N ones if l = j and a geometric sum if l 6= j with value


Part (2) is obvious. Part (3) is the Fast Fourier Transform (FFT).

1wN (lj)
1wlj

= 0.

17.1. The FFT. Assume for simplicity that N = 2m .


If a is the vector a = [a0 , . . . , aN 1 ]T . Forming the product a means we need to evaluate the polynomial
a(x) = a0 + a1 x + aN 1 xN 1
at all N roots of unity x = wj , j = 0, 1, . . . , N 1.
If we write
a(x) = a0 + a1 x + aN 1 xN 1
= (a0 + a2 x2 + a4 x4 + ) + x(a1 + a3 x2 + a5 x4 + )
= aeven (x2 ) + x aodd (x2 ).
Thus we need to evaluate two polynomials aeven and aodd at (wj )2 , j = 0, 1, . . . , N 1. The computational savings come from the recognition that (wj )2 , j = 0, 1, . . . , N 1 are just N/2, not N , points, just
N
repeated twice since w2j = w2(j+ 2 ) .
Thus computing the FFT of a vector of size N is the same as computing two FFTs of size N/2 and
combining the results with N/2 multiplications and N additions.
Here is the algorithm:
function FFT(a)
if length(a)=1
return a
else
a1=FFT(aeven )
a2=FFT(aodd )
2i
w=eN
u = [w0 , w1 , . . . , wN/21 ]
return [a1 + u.*a2, a1 - u.*a2]
end
We used the fact that wj+N/2 = wj .

... the multiplication is meant componentwise

62

3. SOLVING LINEAR SYSTEMS

For the cost C(N ) of the algorithm, we assume the powers of w are precomputed and stored, so we have



N
3N
N
3N
3N
C(N ) = 2C N2 + 3N
2 = 4C 4 + 2 2 = 8C 8 + 3 2 = = log2 N 2 .

17. THE DISCRETE FOURIER TRANSFORM

63

17.2. Multiplication by V . We now explain how to perform the multiplication V x in O(N log N )
time. Since V is N 2 N 2 and x is N 2 1, we partition x into sections of size N :

x=

x1
x2
..
.
xN

where each

xi =

xi1
x2
..
.
xiN

is an N 1 vector (for a total of N 2 1 for x). Let y i = Zxi , i = 1, 2, . . . , N . Thus

N
V x = [zjk Z]j,k=1 x =

z11 Z
z21 Z
..
.

z12 Z
z22 Z
..
.

..
.

z1N Z
z2N Z
..
.

zN 1 Z

zN 2 Z

zN N Z

x1
x2
..
.
xN

z11 Zx1 + z12 Zx2 + + z1N ZxN


z21 Zx1 + z22 Zx2 + + z2N ZxN
..
.
zN 1 Zx1 + zN 2 Zx2 + + zN N ZxN

z11 y 1 + z12 y 2 + + z1N y N


z21 y 1 + z22 y 2 + + z2N y N

..

.
1
2
N
zN 1 y + zN 2 y + + zN N y

z11 y11 + z12 y12 + + z1N y1N


z11 y21 + z12 y22 + + z1N y2N

..

1
2
N

z11 yN
+ z12 yN
+ + z1N yN

2
N
1

z21 y1 + z22 y1 + + z2N y1

1
2
N

z21 y2 + z22 y2 + + z2N y2

..
.
.

1
2
N

z21 yN
+ z22 yN
+ + z2N yN

..

N
1
2
zN 1 y1 + zN 2 y1 + + zN N y1
zN 1 y21 + zN 2 y22 + + zN N y2N

..

.
1
2
N
zN 1 yN
+ zN 2 yN
+ + zN N yN

64

3. SOLVING LINEAR SYSTEMS

The trick now is to recognize that the N entries in positions


1
1
y1
y1
z11 z12 z1N
y12 z21 z22 z2N y12


Z . = .
..
.. .. =
..

.. ..
.
.
.
.
N
zN 1 zN 2 zN N
y1N
y1

1, N + 1, 2N + 1, 3N + 1, ... are just

z11 y11 + z12 y12 + + z1N y1N


z21 y11 + z22 y12 + + z2N y1N

.
..

.
zN 1 y11 + zN 2 y12 + + zN N y1N

Similarly, we get the entries in positions 2, N + 2, 2N + 2, . . . .


If Zb(x) is the MATLAB routine that takes a vector x as an input and returns Zx, then the multiplication
by V by an N 2 1 vector x is done as follows:
function
y=ZNNx(x)
p
N= length(x)
for i=1:N
y((i-1)*N+1:i*N)=Zb(x((i-1)*N+1:i*N))
end
for i=1:N
y(i:N:end)=Zb(y(i:N:end))
end

17. THE DISCRETE FOURIER TRANSFORM

65

17.3. The right hand side of TN N x = b. We saw that solving the heat equation in 2D means
solving TN N x = b for a certain right hand side b. We will establish its exact form.
Say, were solving the heat equation on [0, 1] [0, 1], subdivided into N + 1 parts so that in both the
x and y direction we have N + 2 points2 for boundary conditions and N unknown. Say all data is in a
(N + 2) (N + 2) matrix u with the unknown values in u(2 : N + 1, 2 : N + 1). The boundary conditions
are u(:, 1), u(:, N + 2), u(1, :) and u(N + 2, :).
Here are all our equations for the u0ij s:

4u22 u21 u23 u12 u32 = 0


4u23 u22 u24 u13 u33 = 0
..
.
4u2,N u2,N 1 u2,N +1 u1,N u3,N = 0
4u2,N +1 u2N u2,N +2 u1,N +1 u3,N +1 = 0
4u32 u31 u33 u22 u42 = 0
4u33 u32 u34 u23 u43 = 0
..
.
4u3,N u3,N 1 u3,N +1 u2,N u4,N = 0
4u3,N +1 u3N u3,N +2 u2,N +1 u4,N +1 = 0
..
.
4uN 2 uN 1 uN 3 uN 1,2 uN +1,2 = 0
4uN 3 uN 2 uN 4 uN 1,3 uN +1,3 = 0
..
.
4uN,N uN,N 1 uN,N +1 uN 1,N uN +1,N = 0
4uN,N +1 uN N uN,N +2 uN +1,N +1 uN +1,N +1 = 0
4uN +1,2 uN +1,1 uN +1,3 uN 2 uN +2,2 = 0
4uN +1,3 uN +1,2 uN +1,4 uN 3 uN +2,3 = 0
..
.
4uN +1,N uN +1,N 1 uN +1,N +1 uN N uN +2,N = 0
4uN +1,N +1 uN +1,N uN +1,N +2 uN,N +1 uN +2,N +1 = 0

66

3. SOLVING LINEAR SYSTEMS

Careful inspection reveals that the right hand side is then

u12 + u21
u13

..

u1N

u1,N +1 + u2,N +2

u31

..

u3,N +2

..
TN N x =
.

u
N,1

..

uN,N +2

uN +2,2 + uN +1,1

uN +2,3

..

uN +2,N
uN +2,N +1 + uN +1,N +2

18. HIGH ACCURACY COMPUTATIONS

67

18. High accuracy computations


At this point we have come to the understanding that ill conditioned matrices will cause trouble. The
smallest singular value of a matrix with condition number 1013 will barely have 3 decimal digits correct and
the one of a matrix with condition number 1018 will likely have no correct digits at all.
In this section we explain where the loss of accuracy occurs and describe situations where that loss
of accuracy may be avoided altogether by designing specialized algorithms that may compute accurate
eigenvalues regardless of conventional condition numbers.
The culprit: subtractive cancellation. Suppose we have two quantities a
and b in the computer,
which have been computed in floating point arithmetic and are now accurate to (say) 10 digits compared to
the true values a and b. We will assume a > 0 and b > 0.
The main observation here is that a
+ b, a
b and a
/b will also be accurate to about 10 digits, whereas
a b may not. It does not automatically mean a b will have fewer correct digits, but it means that
subtractions are the only place where loss of accuracy occurs so they deserve special attention if accuracy is
to be preserved.
First, we explain why addition, multiplication, and division are OK. Let a
= a(1 + 1 ) and b = b(1 + 2 ),
with |1 | k and |2 | l for some modest k and l. Assume additionally, that k l.
In the following calculations we will ignore terms which are products of 2 or more s, e.g., 1 2 , which
are O(2 ), so too tiny to matter. This is done for convenience here and does not compromise the analysis
there is a method, which does account for all s and yields the same conclusionssee e.g., Higham [5]. In
particular, we say that
1
= 1 2 + 22 = 1 1 .
1 + 1
Then the relative errors for +, , / are:
Addition: We have fl(
a + b) = (
a + b)(1 + 3 ), where |3 | . Therefore



a + b)(1 + ) (a + b)
fl(

a + b) (a + b) (
3
=






a+b
a+b


(a(1 + 1 ) + b(1 + 2 ))(1 + 3 ) (a + b)


=

a+b


a1 + b2 + (a + b)3

=
(weve dropped 1 3 and 2 3 terms)

a+b
a|1 | + b|2 | + (a + b)|3 |
since a > 0 and b > 0

a+b
ak + bl + (a + b)

a+b
= (k + 1).
Multiplication: We have fl(
ab) = (
ab)(1 + 3 ), where |3 | . Therefore



fl(

ab) ab a(1 + 1 )b(1 + 2 )(1 + 3 ) ab

=



ab
ab
= |(1 + 1 )(1 + 2 )(1 + 3 ) 1|
= |1 + 2 + 3 |
(k + l + 1).

(again, ignoring 1 2 terms, etc.)

68

3. SOLVING LINEAR SYSTEMS

Division: We have fl(


a/b) = (
a/b)(1 + 3 ), where |3 | . Therefore




fl(
a/b) a/b a(1 + 1 )/(b(1 + 2 ))(1 + 3 ) a/b
=




a/b
a/b


1 + 1

=
(1 + 3 ) 1
1 + 2
= |(1 + 1 )(1 2 )(1 + 3 ) 1|

note that in our convention

1
1+2

= 1 2

= |1 2 + 3 |
(k + l + 1).
We see that in all three cases (+, , /) the relative errors accumulate modestly with the resulting quantity
still having about as many correct digits as the two arguments.
One also shouldnt worry about the k and l coefficients in the relative errors. In practical computations
with n n matrices these typically grow as O(n) leaving plenty of correct digits even for n in the millions.
Subtractions are tricky: following the same calculations as with addition, the relative error in subtraction
is easily seen to be




fl(
a+b
a + b) (a + b) a1 b2 + (a b)3 a|1 | + b|2 | + |a b||3 |

k + .
=




a+b
ab
|a b|
|a b|
Unlike +, , /, this relative error now depends on a and b! In particular, if a b is much smaller than a + b,
i.e., when a and b share a few significant digits, the relative error can be enormous!
For example, if a and b have 7 digits in common, e.g., a = 1.234567444444444 and b = 1.234567333333333,
then
a+b
107
|a b|
and fl(
a b) will have 7 fewer correct significant digits than either a
and b!
This is how accuracy is lost in numerical computations: when subtracting quantities that are:
(1) results of previous floating point computations (and thus carry rounding errors)
(2) share significant digits.
Item 1. above has a (pleasant) caveat: If a and b are initial data (and thus assumed to be known exactly),
then a b is computed accurately as per the fact that fl(a b) = (a b)(1 + ) with relative error such
that || .
The moral of this story is that
The output will be accurate if we only multiply, divide, add, and only subtract
initial data!
If we perform other subtractions, it does not automatically mean the results will be inaccurate, just that
theres no guarantee of accuracy.
For example, the formula for the determinant of the Vandermonde matrix V = [xj1
]ni,j=1
i
det V =

(xi xj )

i>j

will always be accurate because it only involves subtractions of initial data and multiplications.

18. HIGH ACCURACY COMPUTATIONS

69

Only true subtractions should be avoided. Just because there is a minus in the formula does not
mean this is a potential problem: it must be a true subtraction (i.e., a positive quantity gets subtracted
from a positive quantity). Subtracting a positive from a negative is a net addition and cannot result in loss
of significant digits.
For example, we can be certain to get an accurate result if we multiply two matrices with positive entries
(since well only add and multiply), but it is also OK to multiply two checkerboard sign pattern matrices,
since then the multiplication will not involve any true subtractions regardless of all the minuses. For example,
the product of the matrices

2
0
2 3
1 1
1 1
5
1
1 1
3
2 3
4

and

2 1

5 2
3 2
4 1
6
1 5
1
0
2 2
3
will involve no (true) subtractions! E.g., the (1, 1) entry of the product is 1 2 + 1 5 + 1 3 + 1 6, the (2, 4)
entry is (1 3 + 2 3 + 3 1 + 4 1) and similarly for all the other entries despite all the minuses!
This is an important observation, since the inverses of the Hilbert and Pascal matrices have checkerboard sign pattern and in general all linear algebra with these matrices can be performed to high relative
accuracy [6].
How can we perform linear algebra without subtracting? This, obviously cannot always be
done, but there are many classes of structured matrices, where accurate computations are possible through
the avoidance of subtractive cancellation [2, 6]. The problems after this section provide an example how, by
avoiding subtractive cancellation one can compute accurate results in about the same amount of time that
MATLAB will take (and deliver the wrong answer!).
Problems.
(1) Compute the smallest eigenvalue of the 100 100 matrix H, where
hi,j =

1
.
i+j

Turn in your code and all 16 digits of the smallest eigenvalue. The first 4 digits are 1.001
10151 .
While there are many ways to solve this problem, here are some ideas.
This matrix is an example of a Cauchy matrix. A Cauchy matrix, C = C(x, y), is defined as
cij =

1
xi + yj

for i, j = 1, 2, . . . , n, where x = (x1 , x2 , . . . , xn )T and y = (y1 , y2 , . . . , yn )T are vectors of length n.


The vectors x and y are the initial data.
One way to solve this problem is to first compute an accurate inverse C 1 using Cramers rule
and then use QRiteration (or MATLABs eig) to compute the largest eigenvalue of C 1 , which
will be accurate as the largest eigenvalue of an accurate inverse. This largest eigenvalue of C 1 is
the reciprocal of the smallest eigenvalue of C, which is what we want.
To use the Cramers rule, use the facts that
any submatrix of a Cauchy matrix is also Cauchy. For example, if we erase row 1 and column
1 of the Cauchy matrix C(x, y) we obtain the Cauchy matrix C(x(2 : n), y(2 : n)).

70

3. SOLVING LINEAR SYSTEMS

the determinant of the Cauchy matrix is


n
n
Y
Y
det C(x, y) =

[(xj
i=1 j=i+1
n Y
n
Y

xi )(yj yi )]
,

(xi + yj )

i=1 j=1

which will not result in subtractive cancellation and will thus be accurate.
In the process of writing your code, you may want to write it as a function of n. This way, for
small n, e.g., n = 6, 7, your code should match the output of MATLABs eig. For these values of
n, the matrix H would still be well conditioned and MATLABs results will still be accurate.
You may also want to test your code against the output of invhilb, which is a built-in function
in MATLAB that accurately computes the inverse of the Hilbert matrix [1/(i + j 1)]ni,j=1 (note
the slight difference with the matrix H), so the minimum eigenvalue of the Hilbert matrix can be
computed accurately in MATLAB as 1/max(eig(invhilb(100))).
(2) Compute the smallest eigenvalue of the 200 200 Pascal matrix. Turn in your code and answer
(all 16 digits of it). The first few digits of the answer are: 2.9 10119 .
Here are some suggestions. Details on the Pascal matrix can be obtained on Wikipedia. While
there are many approaches to solve this problem, one way would be to compute the largest eigenvalue of the inverse. The inverse (which has checkerboard sign pattern) can (once again) be computed without performing any subtractions if one takes the correct approach. Instead of eliminating
the matrix in the typical Gaussian elimination fashion, try to eliminate it by using only ADJACENT rows and columns. This process is called Neville elimination. Once you eliminate the first
row and first column, you will see that the Schur complement is also a Pascal matrix of one size
less. In matrix form this elimination can be written as
0
LPn LT = Pn1

where L is lower bidiagonal matrix with ones on the main diagonal and 1 on the first subdiagonal
0
and Pn1
is an n n matrix with zeros in the first row and column, except for the (1, 1) entry
(which equals one), and the matrix Pn1 in the lower right hand corner. You can now observe
1
(no need to prove) that if you have Pn1
you can compute Pn1 using the above equality without
performing any subtractions.
(3) Let A = LLT , where

a d
b e .
L=
c
Design a subtraction-free algorithm that computes the Cholesky factor of A given a, b, c, d, and e
that performs not more than 19 arithmetic operations (it can even be done with 18). Turn in your
code and your output in format long e format for a = 1020 , b = 1, c = 2, d = 1, e = 1.
The obvious call L=diag([1e-20 1 2])+diag([1 1],1); chol(L*L) will return an error saying that the computed A is not positive definite. Yet the true matrix A is positive definite.

Bibliography
1. ANSI/IEEE, New York, IEEE Standard for Binary Floating Point Arithmetic, Std 754-1985 ed., 1985.
2. J. Demmel, M. Gu, S. Eisenstat, I. Slapni
car, K. Veseli
c, and Z. Drma
c, Computing the singular value decomposition with
high relative accuracy, Linear Algebra Appl. 299 (1999), no. 13, 2180.
3. J. W. Demmel, Applied Numerical Linear Algebra, SIAM, Philadelphia, 1997. MR MR1463942 (98m:65001)
4. G. Golub and C. Van Loan, Matrix computations, 3rd ed., Johns Hopkins University Press, Baltimore, MD, 1996.
5. N. J. Higham, Accuracy and stability of numerical algorithms, Second ed., SIAM, Philadelphia, 2002. MR 2003g:65064
6. Plamen Koev, Accurate computations with totally nonnegative matrices, SIAM J. Matrix Anal. Appl. 29 (2007), 731751.

71

You might also like