You are on page 1of 250

Numerical methods for

elliptic partial differential equations


Arnold Reusken
Preface
This is a book on the numerical approximation of partial differential equations. On the next
page we give an overview of the structure of this book:

2
Elliptic boundary value problems (chapter 1):
Poisson equation: scalar, symmetric, elliptic.

Convection-diffusion equation: scalar, nonsymmetric,


singularly perturbed.

Stokes equation: system, symmetric, indefinite.

Weak formulation (chapter 2)

Basic principles (chapter 3);


application to Poisson equation.
Finite element
Streamline-diffusion FEM (chapter 4);
method
application to convection-diffusion eqn.

FEM for Stokes equation (chapter 5).

Basics on linear iterative methods


(chapter 6).

Preconditioned CG method (chapter 7);


application to Poisson equation.

Krylov subspace methods (chapter 8);


Iterative methods
application to convection-diffusion eqn.

Multigrid methods (chapter 9).

Iterative methods for saddle-point


problems (chapter 10);
application to Stokes equation.

A posteriori error estimation (chapter ).


Adaptivity
Grid refinement techniques (chapter ).

3
4
Contents

1 Introduction to elliptic boundary value problems 9


1.1 Preliminaries on function spaces and domains . . . . . . . . . . . . . . . . . . . . 9
1.2 Scalar elliptic boundary value problems . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.1 Formulation of the problem . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.3 Existence, uniqueness, regularity . . . . . . . . . . . . . . . . . . . . . . . 14
1.3 The Stokes equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Weak formulation 17
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Sobolev spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1 The spaces W m () based on weak derivatives . . . . . . . . . . . . . . . . 23
2.2.2 The spaces H m () based on completion . . . . . . . . . . . . . . . . . . . 25
2.2.3 Properties of Sobolev spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 General results on variational formulations . . . . . . . . . . . . . . . . . . . . . . 34
2.4 Minimization of functionals and saddle-point problems . . . . . . . . . . . . . . . 43
2.5 Variational formulation of scalar elliptic problems . . . . . . . . . . . . . . . . . . 45
2.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.5.2 Elliptic BVP with homogeneous Dirichlet boundary conditions . . . . . . 46
2.5.3 Other boundary conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.5.4 Regularity results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.5.5 Riesz-Schauder theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.6 Weak formulation of the Stokes problem . . . . . . . . . . . . . . . . . . . . . . . 56
2.6.1 Proof of the inf-sup property . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.6.2 Regularity of the Stokes problem . . . . . . . . . . . . . . . . . . . . . . . 60
2.6.3 Other boundary conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3 Galerkin discretization and finite element method 63


3.1 Galerkin discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.2 Examples of finite element spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.2.1 Simplicial finite elements . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.2.2 Rectangular finite elements . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.3 Approximation properties of finite element spaces . . . . . . . . . . . . . . . . . . 68
3.4 Finite element discretization of scalar elliptic problems . . . . . . . . . . . . . . . 75
3.4.1 Error bounds in the norm k k1 . . . . . . . . . . . . . . . . . . . . . . . . 75
3.4.2 Error bounds in the norm k kL2 . . . . . . . . . . . . . . . . . . . . . . . 77
3.5 Stiffness matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.5.1 Mass matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5
3.6 Isoparametric finite elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.7 Nonconforming finite elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4 Finite element discretization of a convection-diffusion problem 87


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.2 A variant of the Cea-lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.3 A one-dimensional hyperbolic problem and its finite element discretization . . . . 93
4.4 The convection-diffusion problem reconsidered . . . . . . . . . . . . . . . . . . . . 100
4.4.1 Well-posedness of the continuous problem . . . . . . . . . . . . . . . . . . 101
4.4.2 Finite element discretization . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.4.3 Stiffness matrix for the convection-diffusion problem . . . . . . . . . . . . 113

5 Finite element discretization of the Stokes problem 115


5.1 Galerkin discretization of saddle-point problems . . . . . . . . . . . . . . . . . . . 115
5.2 Finite element discretization of the Stokes problem . . . . . . . . . . . . . . . . . 117
5.2.1 Error bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.2.2 Other finite element spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 124

6 Linear iterative methods 127


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.2 Basic linear iterative methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.3 Convergence analysis in the symmetric positive definite case . . . . . . . . . . . . 134
6.4 Rate of convergence of the SOR method . . . . . . . . . . . . . . . . . . . . . . . 137
6.5 Convergence analysis for regular matrix splittings . . . . . . . . . . . . . . . . . . 140
6.5.1 Perron theory for positive matrices . . . . . . . . . . . . . . . . . . . . . . 141
6.5.2 Regular matrix splittings . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.6 Application to scalar elliptic problems . . . . . . . . . . . . . . . . . . . . . . . . 146

7 Preconditioned Conjugate Gradient method 151


7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.2 Conjugate Gradient method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.3 Introduction to preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.4 Preconditioning based on a linear iterative method . . . . . . . . . . . . . . . . . 161
7.5 Preconditioning based on incomplete LU factorizations . . . . . . . . . . . . . . . 162
7.5.1 LU factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
7.5.2 Incomplete LU factorization . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7.5.3 Modified incomplete Cholesky method . . . . . . . . . . . . . . . . . . . . 169
7.6 Problem based preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
7.7 Preconditioned Conjugate Gradient Method . . . . . . . . . . . . . . . . . . . . . 170

8 Krylov Subspace Methods 175


8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
8.2 The Conjugate Gradient method reconsidered . . . . . . . . . . . . . . . . . . . . 176
8.3 MINRES method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
8.4 GMRES type of methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
8.5 Bi-CG type of methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

6
9 Multigrid methods 197
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
9.2 Multigrid for a one-dimensional model problem . . . . . . . . . . . . . . . . . . . 198
9.3 Multigrid for scalar elliptic problems . . . . . . . . . . . . . . . . . . . . . . . . . 203
9.4 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
9.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
9.4.2 Approximation property . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
9.4.3 Smoothing property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
9.4.4 Multigrid contraction number . . . . . . . . . . . . . . . . . . . . . . . . . 216
9.4.5 Convergence analysis for symmetric positive definite problems . . . . . . . 218
9.5 Multigrid for convection-dominated problems . . . . . . . . . . . . . . . . . . . . 223
9.6 Nested Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
9.7 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
9.8 Algebraic multigrid methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
9.9 Nonlinear multigrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226

10 Iterative methods for saddle-point problems 229


10.1 Block diagonal preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
10.2 Application to the Stokes problem . . . . . . . . . . . . . . . . . . . . . . . . . . 232

A Functional Analysis 235


A.1 Different types of spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
A.2 Theorems from functional analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 238

B Linear Algebra 241


B.1 Notions from linear algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
B.2 Theorems from linear algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

7
8
Chapter 1

Introduction to elliptic boundary


value problems

In this chapter we introduce the classical formulation of scalar elliptic problems and of the Stokes
equations. Some results known from the literature on existence and uniqueness of a classical
solution will be presented. Furthermore, we briefly discuss the issue of regularity.

1.1 Preliminaries on function spaces and domains


The boundary value problems that we consider in this book will be posed on domains Rn ,
n = 1, 2, 3. In the remainder we always assume that

is open, bounded and connected.

Moreover, the boundary of should satisfy certain smoothnes conditions that will be introduced
in this section. For this we need so-called Holder spaces.

By C k (), k N, we denote the space of functions f : R for which all (partial) derivatives

|| f
D f := , = (1 , . . . , n ), || = 1 + . . . + n ,
x11 . . . xnn

of order || k are continuous functions on . The space C k (), k N, consists of all functions
in C k () C() for which all derivatives of order k have continuous extensions to .
Since is compact, the functional

f max max |D f (x)| = max kD f k, =: kf kC k ()


||k x ||k

defines a norm on C k (). The space (C k (), k kC k () ) is a Banach space (cf. Appendix A.1).
Note that f max||k kD f k, does not define a norm on C k ().
For f : R we define its support by

supp(f ) := { x | f (x) 6= 0 }.

The space C0k (), k N, consists of all functions in C k () which have a compact support in
, i.e., supp(f ) . The functional f max||k kD f k, defines a norm on C0k (), but

9
(C0k (), k kC k () ) is not a Banach space.

For a compact set D Rn and (0, 1] we introduce the quantity

|f (x) f (y)|
[f ],D := sup{ | x, y D, x 6= y } for f : D R.
kx yk

We write f C 0, () and say that f is Holder continuous in with exponent if [f ], < .


A norm on the space C 0, () is defined by

f kf kC() + [f ], .

We write f C 0, () and say that f is Holder continuous in with exponent if for arbitrary
compact subsets D the property [f ],D < holds. An important special case is = 1: the
space C 0,1 () [or C 0,1 ()] consists of all Lipschitz continuous functions on [].
The space C k, () [C k, ()], k N, (0, 1], consists of those functions in C k () [C k ()]
for which all derivatives D f of order || = k are elements of C 0, () [C 0, ()]. On C k () we
define a norm by X
f kf kC k () + [D f ], .
||=k

Note that

C k, () C k () for all k N, (0, 1],


C k,2 () C k,1 () for all k N, 0 < 1 2 1 ,

and similarly with replaced by . We use the notation C k,0 () := C k () [C k,0 () := C k ()].

Remark 1.1.1 The inclusion C k+1 () C k, (),p (0, 1], is in general not true. Consider
n = 2 and = { (x, y) | 1 < x < 1, 1 < y < |x| }. The function
1
(
(sign x)y 1 2 if y > 0,
f (x, y) =
0 otherwise,

/ C 0, () if ( 34 , 1].
belongs to C 1 (), but f 

Based on these Holder spaces we can now characterize smoothness of the boundary .

Definition 1.1.2 For k N, [0, 1] the property C k, (the boundary is of class C k, )


holds if at each point x0 there is a ball B = { x Rn | kx x0 k < , > 0 } and a
bijection : B E Rn such that

(B ) Rn+ := { x Rn | xn > 0 }, (1.1a)


(B ) Rn+ , (1.1b)
C k, (B), 1
C k, (E). (1.1c)

For the case n = 2 this is illustrated in Figure ??.

F igure 1

10
A very important special case is C 0,1 . In this case all transformations (and their
inverses) must be Lipschitz continuous functions and we then call a Lipschitz domain. This
holds, for example, if consists of different patches which are graphs of smooth functions (e.g.,
polynomials) and at the interface between different patches interior angles are bounded away
from zero. In Figure ?? we give an illustration for the two dimensional case.

F igure 2

A domain is convex if for arbitrary x, y the inclusion { tx+(1t)y | t [0, 1] } holds.

In almost all theoretical analyses presented in this book it suffices to have C 0,1 . Moreover,
the domains used in practice usually satisfy this condition. Therefore, in the remainder of this
book we always consider such domains, unless stated otherwise explicitly.

Assumption 1.1.3 In this book we assume that the domain Rn is such that

is open, connected and bounded,


is of class C 0,1 .

One can show that if this assumption holds then C k+1 () C k,1 () (cf. remark 1.1.1).

1.2 Scalar elliptic boundary value problems


1.2.1 Formulation of the problem
On C 2 () we define a linear second order differential operator L as follows:
n n
X 2u X u
Lu = aij + bi + cu, (1.2)
xi xj xi
i,j=1 i=1

2 2u
with aij , bi and c given functions on . Because xi x
u
j
= xj xi we may assume, without loss
of generality, that
aij (x) = aji (x)

holds for all x . Corresponding to the differential operator L we can define a partial differ-
ential equation
Lu = f, (1.3)

with f a given function on . In (1.2) the part containing the second derivatives only, i.e.
n
X 2u
aij ,
xi xj
i,j=1

is called the principal part of L. Related to this principal part we have the n n symmetric
matrix
A(x) = (aij (x))1i,jn . (1.4)

11
Note that due to the symmetry of A the eigenvalues are real. These eigenvalues, which may
depend on x , are denoted by

1 (x) 2 (x) . . . n (x).

Hyperbolicity, parabolicity, or ellipticity of the differential operator L is determined by these


eigenvalues. The operator L, or the partial differential equation in (1.2), is called elliptic at the
point x if all eigenvalues of A(x) have the same sign. The operator L and the corresponding
differential equation are called elliptic if L is elliptic at every x . Note that this property is
determined by the principal part of L only.

Remark 1.2.1 If the operator L is elliptic, then we may assume that all eigenvalues of the
matrix A(x) in (1.4) are positive:

0 < 1 (x) 2 (x) . . . n (x) for all x .

The operator L (and the corresponding boundary value problem) is called uniformly elliptic if
inf{1 (x) | x } > 0 holds. Note that if the operator L is elliptic with coefficients aij C()
then the function x A(x) is continuous on the compact set and hence L is uniformly elliptic.
Using
Xn
aij (x)i j = T A(x) 1 (x) T ,
i,j=1

we obtain that the operator L is uniformly elliptic if and only if there exists a constant 0 > 0
such that for all Rn , we have
n
X
aij (x)i j 0 T for all x .
i,j=1

We obtain a boundary value problem when we combine the partial differential equation in (1.3)
with certain boundary conditions for the unknown function u. For ease we restrict ourselves to
problems with Dirichlet boundary conditions, i.e., we impose :

u = g on ,

with g a given function on . Other types of boundary conditions are the so called Neumann
u
boundary conditon, i.e., a condition on the normal derivative n on , and the mixed boundary
condition which is a linear combination of a Dirichlet and a Neumann boundary condition.
Summarizing, we consider a linear second order Dirichlet boundary value problem in Rn :
n n
X 2u X u
Lu = aij + bi + cu = f in , (1.5a)
xi xj xi
i,j=1 i=1
u=g on , (1.5b)

where (aij (x))1i,jn is such that the problem is elliptic. A solution u of (1.5) is called a
classical solution if u C 2 () C(). The functions (aij (x))1i,jn , (bi (x))1in and c(x) are
called the coefficients of the operator L.

12
1.2.2 Examples
We assume n = 2, i.e. a problem with two independent variables, say x1 = x and x2 = y. Then
the differential operator is given by

2u 2u 2u u u
Lu = a11 2
+ 2a12 + a22 2
+ b1 + b2 + cu.
x xy y x y
In this case we have 1 (x)2 (x) = det(A(x)) and the ellipticity condition can be formulated as

a11 (x, y)a22 (x, y) a212 (x, y) > 0, (x, y) .

Examples of elliptic equations are the Laplace equation

2u 2u
u := + 2 = 0 in ,
x2 y
the Poisson equation (cf. Poisson [72])

u = f in , (1.6)

the reaction-diffusion equation


u + cu = f in , (1.7)
and the convection-diffusion equation
u u
u + b1 + b2 = f in , > 0. (1.8)
x y
If we add Dirichlet boundary conditions to the Poisson equation in (1.6), we obtain the classical
Dirichlet problem for Poissons equation:

u = f in ,
(1.9)
u = g on .

Remark 1.2.2 We briefly comment on the convection-diffusion equation in (1.8). If |/b1 |


1 or |/b2 | 1 (in a part of the domain) then the diffusion term u can be seen as a
perturbation of the convection term b1 u u
x + b2 y (in a part of the domain). The convection-
diffusion equation is of elliptic type. However, for = 0 we obtain the so-called reduced equation
which is of hyperbolic type. In view of this the convection-diffusion equation with |/b1 | 1
or |/b2 | 1 is called a singularly perturbed equation. The fact that the elliptic equation (1.8)
is then in some sense close to a hyperbolic equation results in some special phenomena, that do
not occur in diffusion dominated problems (as (1.9)). For example, in a convection dominated
problem (e.g., an equation as (1.8) with 1 and bi = 1 , i = 1, 2 ) the solution u shows a
behaviour in which most of the information is transported in certain directions (streamlines).
So we observe a behaviour as in the hyperbolic problem ( = 0), in which the solution satisfies an
ordinary differential equation along each characteristic. Another phenomenon is the occurrence
of boundary layers. If we combine the equation in (1.8) with Dirichlet boundary conditions on
then in general these boundary conditions are not appropriate for the hyperbolic problem
( = 0). As a result, if |/b1 | 1 or |/b2 | 1 we often observe that on a part of the
boundary (corresponding to the outflow boundary in the hyperbolic problem) there is a small
neighbourhood in which the solution u varies very rapidly. Such a neighbourhood is called a
boundary layer.

13
For a detailed analysis of singularly perturbed convection-diffusion equations we refer to Roos et
al [76]. An illustration of the two phenomena described above is given in Section ??. Finally we
note that for the numerical solution of a problem with a singularly perturbed equation special
tools are needed, both with respect to the discretization of the problem and the iterative solver
for the discrete problem. 

1.2.3 Existence, uniqueness, regularity


For the elliptic boundary value problems introduced above, a first important topic that should
be addressed concerns the existence and uniqueness of a solution. If a unique solution exists then
another issue is the smoothness of the solution and how this smoothness depends on the data
(source term, boundary condition, coefficients). Such smoothness results are called regularity
properties of the problem. The topic of existence, uniqueness and regularity has been, and still
is, the subject of many mathematical studies. We will not treat these topics here. We only give
a few references to standard books in this field: Gilbarg and Trudinger [39], Miranda [64], Lions
and Magenes [60], Hackbusch [45], [47].
We note that for the classical formulation of an elliptic boundary value problem it is often rather
hard to establish satisfactory results on existence, uniqueness or regularity. In Section 2.5 we
will discuss the variational (or weak) formulation of elliptic boundary value problems. In that
setting, additional tools for the analysis of existence, uniqueness and regularity are available and
(much) more results are known.

Example 1.2.3 The reaction-diffusion equation can be used to show that a solution of an
elliptic Dirichlet boundary value problem as in (1.5) need not be unique. Consider the problem
in (1.7) on = (0, 1)2 , with f = 0 and c(x, y) = ()2 ()2 , , N, combined with
zero Dirichlet boundary conditions. Then both u(x, y) 0 and u(x, y) = sin(x) sin(y) are
solutions of this boundary value problem. 

Example 1.2.4 Even for very simple elliptic problems a classical solution may not exist. Con-
sider

a 2 u2 = 1 in (0,1),
x
u(0) = u(1) = 0,

with a(x) = 1 for 0 x 0.5 and a(x) = 2 for 0.5 < x 1. Clearly the second derivative of a
solution u of this problem cannot be continuous at x = 0.5. 

We present a typical result from the literature on existence and uniqueness of a classical solu-
tion. For this we need a certain condition on . The domain is said to satisfy the exterior
sphere condition if for every x0 there exists a ball B such that B = x0 . Note that this
condition is fulfilled, for example, if is convex or if C 2,0 .

14
Theorem 1.2.5 ([39], Theorem 6.13) Consider the boundary value problem (1.5) and as-
sume that

(i) L is uniformly elliptic,


(ii) satisfies the exterior sphere condition,
(iii) the coefficients of L and the function f belong to C 0, (), (0, 1),
(iv) c 0 holds,
(v) the boundary data are continuous : g C().

Then the problem (1.5) has a unique classical solution u C 2, () C().

With respect to regularity of the solution it is important to distinguish between interior


smoothness (i.e., in ) and global smoothness (i.e., in ). A typical result on interior smoothness
is given in the next theorem:

Theorem 1.2.6 ([39], Theorem 6.17) Let u C 2 () be a solution of (1.5). Suppose that L
is elliptic and that there are k N, (0, 1) such that the coefficients of L and the function
f are in C k, (). Then u C k+2, () holds. If the coefficients and f are in C (), then
u C ().

This result shows that the interior regularity depends on the smoothness of the coefficients and
of the right handside f , but does not depend on the smoothness of the boundary (data). A
result on global regularity is given in:

Theorem 1.2.7 ([39], Theorem 6.19) Let u C 2 () C() be a classical solution of (1.5).
Suppose that L is uniformly elliptic and that there are k N, (0, 1) such that the coefficients
of L and the function f are in C k, (), C k+2, . Assume that g can be extended on such
that g C k+2, (). Then u C k+2, () holds.

For a global regularity result as in the previous theorem to hold, the smoothness of the boundary
(data) is important. In practice one often has a domain with a boundary consisting of the union
of straight lines (in 2D) or planes (3D). Then the previous theorem does not apply and the
global regularity of the solution can be rather low as is shown in the next example.

Example 1.2.8 [from [47], p.13] We consider (1.9) with = (0,1)(0,1), f 0, g(x, y) = x2
(so g C(), g C ()). Then Theorem 1.2.5 guarantees the existence of a unique classical
solution u C 2 () C(). However, u is not an element of C 2 ().
Proof. Assume that u C 2 () holds. From this and u = 0 in it follows that u = 0 in
holds. From u = g = x2 on we get uxx (x, 0) = 2 for x [0, 1] and uyy (0, y) = 0 for y [0, 1].
It follows that u(0, 0) = 2 must hold, which yields a contradiction. 

1.3 The Stokes equations


The n-dimensional Navier-Stokes equations model the motion of an incompressible viscous
medium. It can be derived using basic principles from continuum mechanics (cf. [43]). The
unknowns are the velocity field u(x) = (u1 (x), . . . , un (x)) and the pressure p(x), x . If one

15
considers a steady-state situation then these Navier-Stokes equations, in dimensionless quanti-
ties, are as follows:
n
X ui p
ui + uj + = fi in , 1 i n, (1.10a)
xj xi
j=1
n
X uj
= 0 in , (1.10b)
xj
j=1

with > 0 a parameter that is related to the viscosity of the medium. Using the notation
u
u := (u1 , . . . , un )T , div u := nj=1 xjj , f = (f1 , . . . , fn )T , we obtain the more compact
P

formulation

u + (u )u + p = f in , (1.11a)
div u = 0 in . (1.11b)

Note that the pressure p is determined only up to a constant by these Navier-Stokes equations.
The problem has to be completed with suitable boundary conditions. One simple possibility is
to take homogeneous Dirichlet boundary conditions for u, i.e., u = 0 on . If in the Navier-
Stokes equations the nonlinear convection term (u )u is neglected, which can be justified in
situations where the viscosity parameter is large, one obtains the Stokes equations. From a
simple rescaling argument (replace u by 1 u) it follows that without loss of generality in the
Stokes equations we can assume = 1. Summarizing, we obtain the following Stokes problem:

u + p = f in , (1.12a)
div u = 0 in , (1.12b)
u = 0 on . (1.12c)

This is a stationary boundary value problem, consisting of n + 1 coupled partial differential


equations for the unknowns (u1 , . . . , un ) and p.

In [2] the notion of ellipticity is generalized to systems of partial differential equations. It


can be shown (cf. [2, 47]) that the Stokes equations indeed form an elliptic system. We do
not discuss existence and uniqueness of a classical solution of the Stokes problem. In chapter 2
we discuss the varitional formulation of the Stokes problem. For this formulation the issue of
existence, uniqueness and regularity of a solution is treated in section 2.6.

16
Chapter 2

Weak formulation

2.1 Introduction
For solving a boundary value problem it can be (very) advantageous to consider a generalization
of the classical problem formulation, in which larger function spaces are used and a weaker
solution (explained below) is allowed. This results in the variational formulation (also called
weak formulation) of a boundary value problem. In this section we consider an introductory
example which illustrates that even for a very simple boundary value problem the choice of an
appropriate solution space is an important issue. This example also serves as a motivation for
the introduction of the Sobolev spaces in section 2.2.

Consider the following elliptic two-point boundary value problem:

(au ) = 1 in (0, 1), (2.1a)


u(0) = u(1) = 0. (2.1b)

We assume that the coefficient a is an element of C 1 ([0, 1]) and that a(x) > 0 holds for all
x [0, 1]. This problem then has a unique solution in the space

V1 = { v C 2 ([0, 1]) | v(0) = v(1) = 0 }.

This solution is given by


x 1 1
t + c t 1 1
Z Z  Z
u(x) = dt, c := dt dt ,
0 a(t) 0 a(t) 0 a(t)

which may be checked by substitution in (2.1). If one multiplies the equation (2.1a) by an
arbitrary function v V1 , integrates both the left and right handside and then applies partial
integration one can show that u V1 is the solution of (2.1) if and only if
Z 1 Z 1

a(x)u (x)v (x) dx = v(x) dx for all v V1 . (2.2)
0 0

This variational problem can be reformulated as a minimization problem. For this we introduce
the notion of a bilinear form.

17
Definition 2.1.1 Let X be a vector space. A mapping k : X X R is called a bilinear form
if for arbitrary , R and u, v, w X the following holds:

k(u + v, w) = k(u, w) + k(v, w),


k(u, v + w) = k(u, v) + k(u, w).

The bilinear form is symmetric if k(u, v) = k(v, u) holds for all u, v X. 

Lemma 2.1.2 Let X be a vector space and k : X X R a symmetric bilinear form which is
positive, i.e., k(v, v) > 0 for all v X, v 6= 0. Let f : X R be a linear functional. Define
J : X R by
1
J(v) = k(v, v) f (v).
2
Then J(u) < J(v) for all v X, v 6= u, holds if and only if

k(u, v) = f (v) for all v X. (2.3)

Moreover, there exists at most one minimizer u of J().

Proof. Take u, w X, t R. Note that


 1
J(u + tw) J(u) = t k(u, w) f (w) + t2 k(w, w) =: g(t; u, w).
2
. If J(u) < J(v) for all v X, v 6= u, then the function t g(t; u, w) must be strictly
positive for all w X \ {0} and t R \ {0}. It follows that k(u, w) f (w) = 0 for all w X.
. From (2.3) it follows that J(u + tw) J(u) = 12 t2 k(w, w) for all w X, t R. For
v 6= u, w = v u, t = 1 this yields J(v) J(u) = 12 k(v u, v u) > 0.
We finally prove the uniqueness of the minimizer. Assume that k(ui , v) = f (v) for all v X
and for i = 1, 2. It follows that k(u1 u2 , v) = 0 for all v X. For the choice v = u1 u2
we get k(u1 u2 , u1 u2 ) = 0. From the property that the bilinear form k(, ) is positive we
conclude u1 = u2 .

Note that in this lemma we do not claim existence of a solution.


For the minimizer u (if it exists) the relation
1
J(v) J(u) = k(v u, v u) for all v X (2.4)
2
holds. We now return to the example and take
Z 1 Z 1
X = V1 , k(u, v) = a(x)u (x)v (x) dx, f (v) = v(x) dx.
0 0

Note that all assumptions of lemma 2.1.2 are fulfilled. It then follows that the unique solution
of (2.1) (or, equivalently, of (2.2)) is also the unique minimizer of the functional
1
1
Z
J(v) = [ a(x)v (x)2 v(x)] dx. (2.5)
0 2

We consider a case in which the coefficient a is only piecewise continuous (and not differentiable
at all x (0, 1)). Then the problem in (2.1) is not well-defined. However, the definitions of the

18
bilinear form k(, ) and of the functional J() still make sense. We now analyze a minimization
problem with a functional as in (2.5) in which the coefficient a is piecewise constant:
(
1 if x [0, 21 ]
a(x) =
2 if x ( 21 , 1].

We show that for this problem the choice of an appropriate solution space is a delicate issue. Note
that due to lemma 2.1.2 the minimization problem in X = V1 has a corresponding equivalent
variational formulation as in (2.3). With our choice of the coefficient a the functional J() takes
the form
Z 1 Z 1
2 1
2
J(v) := [ v (x) v(x)] dx + [v (x)2 v(x)] dx. (2.6)
0 2 1
2

This functional is well-defined on the space V1 . The functional J, however, is also well-defined
if v is only differentiable and even if we allow v to be nondifferentiable at x = 21 . We introduce
the spaces

V2 = { v C 1 ([0, 1]) | v(0) = v(1) = 0 },


V3 = { v C 1 ([0, 12 ]) C 1 ([ 21 , 1]) C([0, 1]) | v(0) = v(1) = 0 },
V4 = { v C 1 ([0, 12 ]) C 1 ([ 21 , 1]) | v(0) = v(1) = 0 }.

Note that V1 V2 V3 V4 and that on all these spaces the functional J() is well-defined.
Moreover, with X = Vi , i = 1, . . . , 4, and
1
Z
2
Z 1 Z 1
k(u, v) = u (x)v (x) dx + 2u (x)v (x) dx, f (v) = v(x) dx (2.7)
1
0 2
0

all assumptions of lemma 2.1.2 are fulfilled. We define a (natural) norm on these spaces:
1
Z
2
Z 1
|||w|||2 := w (x)2 dx + w (x)2 dx. (2.8)
1
0 2

One easily checks that this indeed defines a norm on the space V4 and thus also on the subspaces
Vi , i = 1, 2, 3. Furthermore, this norm is induced by the scalar product
1
Z
2
Z 1
(w, v)1 := w (x)v (x) dx + w (x)v (x) dx (2.9)
1
0 2

on V4 , and
1
|||w|||2 k(w, w) |||w|||2 for all w V4 (2.10)
2
holds. We show that in the space V3 the minimization problem has a unique solution.

Lemma 2.1.3 The problem minvV3 J(v) has a unique solution u given by
(
5
12 x2 + 12 x if 0 x 12 ,
u(x) = (2.11)
14 x2 + 24
5 1
x + 24 if 12 x 1.

19
Proof. Note that u V3 and even u C ([0, 12 ]) C ([ 21 , 1]). We use the notation
uL ( 12 ) = limx 1 u (x) and similarly for uR ( 21 ). We apply lemma 2.1.2 with X = V3 . For
2
arbitrary v V3 we have
Z 1 Z 1
2

k(u, v) f (v) = u (x)v (x) v(x) dx + 2u (x)v (x) v(x) dx
1
0 2
1
1 1
Z
2
= uL ( )v( ) (u (x) + 1)v(x) dx (2.12)
2 2 0
Z 1
1 1
2uR ( )v( ) (2u (x) + 1)v(x) dx.
2 2 1
2

1
Due to u (x) = 1 on [0, 2 ], u (x)
= 12
on [ 12 , 1] and uL ( 21 ) 2uR ( 21 ) = 0 we obtain
k(u, v) = f (v) for all v V3 . From lemma 2.1.2 we conclude that u is the unique minimizer in
V3 .

Thus with X = V3 a minimizer u exists and the relation (2.4) takes the form
1
J(v) J(u) = k(v u, v u),
2
with k(, ) as in (2.7). Due to (2.10) the norm ||| ||| can be used as a measure for the distance
from the minimum (i.e. J(v) J(u)):
1 1
|||v u|||2 J(v) J(u) |||v u|||2 . (2.13)
4 2
Before we turn to the the minimization problems in the spaces V1 and V2 we first present a
useful lemma.
Lemma 2.1.4 Define W := { v C ([0, 1]) | v(0) = v(1) = 0 }. For every u V3 there is a
sequence (un )n1 in W such that
lim |||un u||| = 0. (2.14)
n

Proof. Take u V3 and define u(x) := u (x) for all x [0, 1], x 6= 12 , u( 12 ) a fixed arbitrary
value and u(x) = u(x) for all x [0, 1]. Then u is even and u L2 (1, 1) . From Fourier
analysis it follows that there is a sequence
n
X
un (x) = ak cos(kx), n N,
k=0

such that Z 1 2
lim ku un k2L2 = lim u(x) un (x) dx = 0.
n n 1
R1
Note that due to the fact that u is continuous and u(0) = u(1) = 0 we get a0 = 12 1 u(x) dx =
R 12 R1 Pn ak
0 u (x) dx + 12 u (x) dx = 0. Define un (x) = k=1 k sin(kx) For n 1. Then un W ,

un = un and
|||u un |||2 ku un k2L2
holds. Thus it follows that limn |||un u||| = 0.

20
Lemma 2.1.5 Let u V3 be given by (2.11). For i = 1, 2 the following holds:
inf J(v) = min J(v) = J(u). (2.15)
vVi vV3

Proof. Take i = 1 or i = 2. I := inf vVi J(v) is defined as the greatest lower bound of J(v)
for v Vi . From V3 Vi it follows that J(u) I holds. Suppose that J(u) < I holds, i.e. we
have := I J(u) > 0. Due to W Vi and lemma 2.1.4 there is a sequence (un )n1 in Vi such
that limn |||u un ||| = 0 holds. Using (2.13) we obtain
1
J(un ) = J(u) + (J(un ) J(u)) I + |||un u|||2 .
2
So for n sufficiently large we have J(un ) < I, which on the other hand is not possible because
I is a lower bound of J(v) for v Vi . We conclude that J(u) = I holds.

The result in this lemma shows that the infimum of J(v) for v V2 is equal to J(u) and
thus, using (2.13), it follows that the minimizer u V3 can be approximated to any accuracy,
measured in the norm ||| |||, by elements from the smaller space V2 . The question arises why
in the minimization problem the space V3 is used and not the seemingly more natural space V2 .
The answer to this question is formulated in the following lemma.
Lemma 2.1.6 There does not exist u V2 such that J(u) J(v) for all v V2 .
Proof. Suppose that such a minimizer, say u V2 , exists. From lemma 2.1.5 we then obtain
J(u) = min J(v) = inf J(v) = J(u)
vV2 vV2

with u as in (2.11) the minimizer in V3 . Note that u V3 . From lemma 2.1.2 it follows that
the minimizer in V3 is unique and thus u = u must hold. But then u = u
/ V2 , which yields a
contradiction.

The same arguments as in the proof of this lemma can be used to show that in the smaller
space V1 there also does not exist a minimizer. Based on these results the function u V3 is
called the weak solution of the minimization problem in V2 . From (2.15) we see that for solving
the minimization problem in the space V2 , in the sense that one wants to compute inf vV2 J(v),
it is natural to consider the minimization problem in the larger space V3 .

We now consider the even larger space V4 and show that the minimization problem still makes
sense (i.e. has a unique solution). However, the minimum value does not equal inf vV2 J(v).
Lemma 2.1.7 The problem minvV4 J(v) has a unique solution u given by
(
12 x(x 1) if 0 x 12 ,
u(x) =
14 x(x 1) if 12 < x 1.

Proof. Note that u V4 holds. We apply lemma 2.1.2. Recall the relation (2.12):
Z 1
1 1 2
k(u, v) f (v) = uL ( )v( ) (u (x) + 1)v(x) dx
2 2 0
Z 1
1 1
2uR ( )v( ) (2u (x) + 1)v(x) dx.
2 2 1
2

21
From u (x) = 1 on [0, 12 ], u (x) = 21 on [ 12 , 1] and uL ( 21 ) = uR ( 12 ) = 0 it follows that
k(u, v) = f (v) for all v V4 . We conclude that u is the unique minimizer in V4 .

A straightforward calculation yields the following values for the minima of the functional J()
in V3 and in V4 , respectively:

11 1 1
J(u) = , J(u) = .
12 32 32

From this we see that, opposite to u, we should not call u a weak solution of the minimization
problem in V2 , because for u we have J(u) < inf vV2 J(v).

We conclude the discussion of this example with a few remarks on important issues that play
an important role in the remainder of this book

1) Both the theoretical analysis and the numerical solution methods treated in this book
heavily rely on the varational formulation of the elliptic boundary value problem (as, for
example, in (2.2)). In section 2.3 general results on existence, uniquess and stability of
variational problems are presented. In the sections 2.5 and 2.6 these are applied to varia-
tional formulations of elliptic boundary value problems. The finite element discretization
method treated in chapter 3 is based on the variational formulation of the boundary value
problem. The derivation of the conjugate gradient (CG) iterative method, discussed in
chapter 7, is based on the assumption that the given (discrete) problem can be formulated
as a minimization problem with a functional J very similar to the one in lemma 2.1.2.

2) The bilinear form used in the weak formulation often has properties similar to those of an
inner product, cf. (2.7), (2.9), (2.10). To take advantage of this one should formulate the
problem in an inner product space. Then the structure of the space (inner product) fits
nicely to the variational problem.

3) To guarantee that a weak solution actually lies in the space one should use a space that
is large enough but not too large. This can be realized by completion of the space in
which the problem is formulated. The concept of completion is explained in section 2.2.2.

The conditions discussed in the remarks 2) and 3) lead to Hilbert spaces, i.e. inner product
spaces that are complete. The Hilbert spaces that are appropriate for elliptic boundary value
problems are the Sobolev spaces. These are treated in section 2.2.

2.2 Sobolev spaces


The Holder spaces C k, () that are used in the classical formulation of elliptic boundary value
problems in chapter 1 are Banach spaces but not Hilbert spaces. In this section we introduce
Sobolev spaces. All Sobolev spaces are Banach spaces. Some of these are Hilbert spaces. In our
treatment of elliptic boundary value problems we only need these Hilbert spaces and thus we
restrict ourselves to the presentation of this subset of Sobolev Hilbert spaces. A very general
treatment on Sobolev spaces is given in [1].

22
2.2.1 The spaces W m () based on weak derivatives
We take u C 1 () and C0 (). Since vanishes identically outside some compact subset
of , one obtains by partial integration in the variable xj :

u(x) (x)
Z Z
(x) dx = u(x) dx
xj xj

and thus Z Z
D u(x)(x) dx = u(x)D (x) dx , || = 1,

holds. Repeated application of this result yields the fundamental Greens formula
Z Z
||
D u(x)(x) dx = (1) u(x)D (x) dx ,
(2.16)
for all C0 (), k
u C (), k = 1, 2, . . . and || k.

Based on this formula we introduce the notion of a weak derivative:

Definition 2.2.1 Consider u L2 () and || > 0. If there exists v L2 () such that


Z Z
||
v(x)(x) dx = (1) u(x)D (x) dx for all C0 () (2.17)

then v is called the th weak derivative of u and is denoted by D u := v. 

Two elementary results are given in the next lemma.

Lemma 2.2.2 If for u L2 () the th weak derivative exists then it is unique (in the usual
Lebesgue sense). If u C k () then for 0 < || k the th weak derivative and the classical
th derivative coincide.
Proof. The second statement follows from the first one and Greens formula (2.16). We now
prove the uniqueness. Assume that vi L2 (), i = 1, 2, both satisfy (2.17). Then it follows
that Z
v1 (x) v2 (x) (x) dx = hv1 v2 , iL2 = 0 for all C0 ()


Since is dense in L2 () this implies that hv1 v2 , iL2 = 0 for all L2 () and thus
C0 ()
v1 v2 = 0 (a.e.).

Remark 2.2.3 As a warning we note that if the classical derivative of u, say D u, exists
almost everywhere in and D u L2 (), this does not guarantee the existence of the th
weak derivative. This is shown by the following example:

= (1, 1), u(x) = 0 if x < 0, u(x) = 1 if x 0 .

The classical first derivative of u on \ {0} is u (x) = 0. However, the weak first derivative of
u as defined in 2.2.1 does not exist.
A further noticable observation is the following: If u C ()C() then u does not always

have a first weak derivative. This is shown by the example = (0, 1), u(x) = x. The only
candidate for the first weak derivative of u is v(x) = 21 x . However, v
/ L2 (). 

23
The Sobolev space W m (), m = 1, 2, . . . , consists of all functions in L2 () for which all th
weak derivatives with || m exist:

W m () := { u L2 () | D u exists for all 0 < || m } (2.18)

Remark 2.2.4 By definition, for u W m (), Greens formula


Z Z
||
D u(x)(x) dx = (1) u(x)D (x) dx , for all C0 (), || m,

holds. 

For m = 0 we define W 0 () := L2 (). In W m () a natural inner product and corresponding


norm are defined by
X 1
hu, vim := hD u, D viL2 , kukm := hu, uim
2
, u, v W m () (2.19)
||m

It is easy to verify, that h, im defines an inner product on W m (). We now formulate a main
result:

Theorem 2.2.5 The space (W m () , h, im ) is a Hilbert space.

Proof. We must show that the space W m () with the norm k km is complete. For m = 0
this is trivial. We consider m 1. First note that for v W m ():
X
kvk2m = kD vk2L2 (2.20)
||m

Let (uk )k1 be a Cauchy sequence in W m (). From (2.20) it follows that if kuk u km
then kD uk D u kL2 for all 0 || m. Hence, (D uk )k1 is a Cauchy sequence in
L2 () for all || m. Since L2 () is complete it follows that there exists a unique u() L2 ()
with limk D uk = u() in L2 (). For || = 0 this yields limk uk = u(0) in L2 (). For
0 < || m and arbitrary C0 () we obtain

hu() , iL2 = lim hD uk , iL2


k
(2.21)
= lim (1)|| huk , D iL2 = (1)|| hu(0) , D iL2
k

From this it follows that u() L2 () is the th weak derivative of u(0) . We conclude that
u(0) W m () and
X
lim kuk u(0) k2m = lim kD uk D u(0) k2L2
k k
||m
X
= lim kD uk u() k2L2 = 0
k
||m

This shows that the Cauchy sequence (uk )k1 in W m () has its limit point in W m () and thus
this space is complete.

24
Similar constructions can be applied if we replace the Hilbert Rspace L2 () by the Banach space
Lp (), 1 p < of measurable functions for which kukp := ( |u(x)|p dx)1/p is bounded. This
results in Sobolev spaces which are usually denoted by Wpm (). For notational simplicity we
deleted the index p = 2 in our presentation. For p 6= 2 the Sobolev space Wpm () is a Banach
space but not a Hilbert space. In this book we only need the Sobolev space with p = 2 as
defined in (2.18). For p 6= 2 we refer to the literature, e.g. [1].

2.2.2 The spaces H m () based on completion


In this section we introduce the Sobolev spaces using a different technique, namely based on the
concept of completion. We recall a basic result (cf. Appendix A.1).

Lemma 2.2.6 Let (Z, kk) be a normed space. Then there exists a Banach space (X, kk ) such
that Z X, kxk = kxk for all x Z, and Z is dense in X. The space X is called the completion
of Z. This space is unique, except for isometric (i.e., norm preserving) isomorphisms.

Here we consider the function space

Zm := { u C () | kukm < } , (2.22)

endowed with the norm k km , as defined in (2.19), i.e, we want to construct the completion of
(Zm , k km ). For m = 0 this results in L2 (), since C () is dense in L2 (). Hence, we only
consider m 1. Note that Zm W m () and that this embedding is continuous. One can apply
the general result of lemma 2.2.6 which then defines the completion of the space Zm . However,
here we want to present a more constructive approach which reveals some interesting relations
between this completion and the space W m ().
First, note that due to (2.20) a Cauchy sequence (uk )k1 in Zm is also a Cauchy sequence in L2 ()
and thus to every such sequence there corresponds a unique u L2 () with limk kuk ukL2 =
0. The space Vm Zm is defined as follows:

Vm := {u L2 () | lim kuk ukL2 = 0 for a Cauchy sequence (uk )k1 in Zm }


k

One easily verifies that Vm is a vector space.

Lemma 2.2.7 Vm is the closure of Zm in the space W m ().

Proof. Take u Vm and let (uk )k1 be a Cauchy sequence in Zm with limk kuk ukL2 = 0.
From (2.20) it follows that (D uk )k1 , 0 < || m are Cauchy sequences in L2 (). Let
u() := limk D uk in L2 (). As in (2.21) one shows that u() is the th weak derivative
D u of u. Using D u = limk D uk in L2 (), for 0 < || m, we get
X
lim kuk uk2m = lim kD uk D uk2L2 = 0
k k
||m

Since (uk )k1 is a sequence in Zm we have shown that Vm is the closure of Zm in W m ().

On the space Vm we can take the same inner product (and corresponding norm) as used in
the space W m () (cf. (2.19)). From lemma 2.2.7 and the fact that in the space Zm the norm
is the same as the norm of W m () it follows that (Vm , k km ) is the completion of (Zm , k km ).

25
Since the norm k km is induced by an inner product we have that (Vm , h, im ) is a Hilbert
space. This defines the Sobolev space
H m () := (Vm , h, im ) = completion of (Zm , h, im )
It is clear from lemma 2.2.7 that
H m () W m ()
holds. A fundamental result is the following:

Theorem 2.2.8 The equality H m () = W m () holds.

Proof. The first proof of this result was presented in [63]. A proof can also be found in
[1, 65].

We see that the construction using weak derivatives (space W m ()) and the one based on
completion (space H m ()) result in the same Sobolev space. In the remainder we will only use
the notation H m ().

The result in theorem 2.2.8 holds for arbitrary open sets in Rn . If the domain satisfies
certain very mild smoothnes conditions (it suffices to have assumption 1.1.3) one can prove a
somewhat stronger result, that we will need further on:
Theorem 2.2.9 Let H m () be the completion of the space (C () , h, im ). Then
H m () = H m () = W m ()
holds.
Proof. We refer to [1].

Note that C () $ Zm and thus H m () results from the completion of a smaller space than
H m ().
Remark 2.2.10 If assumption 1.1.3 is not satisfied then it may happen that H m () $ W m ()
holds. Consider the example
= { (x, y) R2 | x (1, 0) (0, 1), y (0, 1) }
and take u(x, y) = 1 if x > 0, u(x, y) = 0 if x < 0. Then D(1,0) u = D (0,1) u = 0 on and thus
u W 1 (). However, one can verify that there does not exist a sequence (k )k1 in C 1 () such
that X
ku k k21 = ku k k2L2 + kD k k2L2 0 for k
||=1

Hence, C 1 () is not dense in W 1 (), i.e., H 1 () 6= W 1 (). The equality H 1 () = W 1 (),


however, does hold. 
The completion can also be defined if in (2.22) the space C () is replaced by the smaller space
C0 (). This yields another class of important Sobolev spaces:
H0m () completion of the space (C0 () , h, im ) (2.23)
The space H0m () is a Hilbert space that is in general strictly smaller than H m ().

26
Remark 2.2.11 In general we have H01 () $ H 1 (). Consider, as a simple example, =
(0, 1), u(x) = x. Then u H 1 () but for arbitrary C0 () we have

ku k21 = ku k2L2 + ku k2L2


Z 1 Z 1

2
u (x) (x) dx = 1 2 (x) + (x)2 dx
0 0
Z 1
(x) dx = 1 2 (1) (0) = 1

12
0

kk1
/ H01 () = C0 ()
Hence u . 

P technique of completion can also be applied if instead of k km one uses the norm kukm,p =
The
( ||m kD ukp )1/p , 1 p < . This results in Sobolev spaces denoted by Hpm (). For p = 2
we have H2m () = H m (). For p 6= 2 these spaces are Banach spaces but not Hilbert spaces. A
result as in theorem 2.2.8 also holds for p 6= 2: Hpm () = Wpm ().

We now formulate a result on a certain class of piecewise smooth functions which form a subset
of the Sobolev space H m (). This subset plays an important role in the finite element method
that will be presented in chapter 3.
Theorem 2.2.12 Assume that can be partitioned as = N i=1 i , with i j = for all
i 6= j and for all i the assumption 1.1.3 is fulfilled. For m N, m 1, define

Vm = { u L2 () | u|i C m (i ) for all i = 1, . . . , N }

For u Vm the following holds:

u H m () u C m1 ()

Proof. First we need some notation. Let i := i . The outward unit normal on i is
denoted by n(i) . Let i := i (= i ) and int denotes the set of all those intersections
i with measn1 (i ) > 0 (in 2D with triangles: intersections by sides are taken into account
but intersections by vertices not). Similarly, i0 := i and b the set of all i0 with
measn1 (i0 ) > 0. For i int let n(i) be the unit normal pointing outward from i (thus
n(i) = n(i) . Finally, for u V1 let

[u]i = lim u(x + tn(i) ) u(x + tn(i) ) , x i int



t0

be the jump of u across i .


u(x)
We now consider the case m = 1, i.e., u V1 . Let v L2 () be given by v(x) = xk for
x i , i = 1, . . . , N . For arbitrary C0 () we have
N
(x) (x)
Z X Z
u(x) dx = u(x) dx
xk xk
i=1 i
N Z N Z
(i)
X X
= v(x)(x) dx + u(x)(x)nk dx
i=1 i i=1 i
Z N Z
(i)
X
= v(x)(x) dx + u(x)(x)nk dx (2.24)
i=1 i

27
For the last term in this expression we have
N Z X Z
(i) (i)
X
u(x)(x)nk dx = [u]i (x)nk dx
i=1 i i int i
X Z (i)
+ u(x)(x)nk dx =: Rint + Rb
i0 b i0

We have Rb = 0 because (x) = 0 on .


u
If u H 1 () holds then the weak derivative xk must be equal to v (for all k = 1, . . . , n). From

(x) u(x)
Z Z Z
u(x) dx = (x) dx = v(x)(x) dx , C0 (),
xk xk

it follows that Rint = 0 must hold for all C0 (). This implies that the jump of u across i
is zero and thus u C() holds. Conversely, if u C() then Rint = 0 and from the relation
u
(2.24) it follows that the weak derivative x k
exists. Since k is arbitrary we conclude u H 1 ().
This completes the proof for the case m = 1.
For m > 1 we use an induction argument. Assume that the statement holds for m. We consider
m + 1. Take u Vm+1 and assume that u H m+1 () holds. Take an arbitrary multi-index
with || m 1. Classical derivatives will be denoted by D and weak ones by D (with
a multi-index as ). From the induction hypothesis we obtain w := D u C(). From
u H m+1 () it follows that D w H 1 () for || 1. Furthermore, for these values we
also have, due to u Vm+1 that D w = D w C 1 (i ) for i = 1, . . . , N . From the result
for m = 1 it now follows that D w C() for || 1. Hence, D w is continuous across the
internal interfaces i and thus D w C() holds. We conclude that D D u C() for all
|| m1, || 1, i.e., u C m (). Conversely, if u Vm+1 and u C m () then D u C()
for || m. From the result for m = 1 it follows that D u H 1 () for all || m and thus
u H m+1 () holds.

2.2.3 Properties of Sobolev spaces


There is an extensive literature on the theory of Sobolev spaces, see for example, Adams [1],
Marti [61], Necas [65], Wloka [99], Alt [3] , and the references therein. In this section we collect
a few results that will be needed further on.

A first important question concerns the smoothness of functions from H m () in the classical
(i.e., C k ()) sense. For example, one can show that if R then all functions from H 1 () must
be continuous on . This, however, is not true for the two dimensional case, as the following
example shows:

Example 2.2.13 In this example we show that functions in H 1 (), with R2 , are not
1
necessarily continuous on . With r := (x21 + x22 ) 2 let B(0, ) := { (x1 , x2 ) R2 | r < } for
> 0. We take = B(0, 21 ). Below we also use := \ B(0, ) with 0 < < 12 . On we
define the function u by u(0, 0) := 0, u(x1 , x2 ) := ln(ln(1/r)) otherwise. Using polar coordinates
one obtains
Z Z Z 1
2
2 2
u(x) dx = lim u(x) dx = 2 lim [ln(ln(1/r))]2 r dr <
0 0

28
so u L2 () holds. Note that u C ( \ {0}). For the first derivatives we have

Z 1
u(x) 2 1 2
Z X
2
dx = 2 lim 2 2
r dr =
xi 0 r (ln r) ln 2
i=1,2

u
It follows that the classical first derivatives x i
exist a.e. on and are elements of L2 (). Note,

however, remark 2.2.3. For arbitrary C0 () we have, using Greens formula on :

(x) u(x)
Z Z Z
u(x) dx = u(x)(x)nx1 ds (x) dx.
x1 B(0,) x1

Note that
Z
lim | u(x)(x)nx1 ds| lim 2kk | ln(ln(1/))| = 0.
0 B(0,) 0

So we have
(x) u(x)
Z Z
u(x) dx = (x) dx.
x1 x1

u
We conclude that x 1
is the generalized partial derivative with respect to the variable x1 . The
same argument yields an analogous result for the derivative w.r.t. x2 . We conclude that u
H 1 (). It is clear that u is not continuous on . 

We now formulate an important general result which relates smoothness in the Sobolov sense
(weak derivatives) to classical smoothness properties.
For normed spaces X and Y a linear operator I : X Y is called a continuous embedding if I
is continuous and injective. Such a continuous embedding is denoted by X Y . Furthermore,
usually for x X the corresponding element Ix Y is denoted by x, too (X Y is formally
replaced by X Y ).

n
Theorem 2.2.14 If m 2 > k (recall: Rn ) then there exist continuous embeddings

H m () C k () (2.25a)
H0m () C k (). (2.25b)

Proof. Given in [1]

It is trivial that for m 0 there are continuous embeddings H m+1 () H m () and H0m+1 ()
H0m (). In the next theorem a result on compactness of embeddings (cf. Appendix A.1) is for-
mulated. We recall that if X, Y are Banach spaces then a continuous embedding X Y is
compact if and only if each bounded sequence in X has a subsequence which is convergent in Y .

29
Theorem 2.2.15 The continuous embeddings

H m+1 () H m () for m0 (2.26a)


H0m+1 () H0m () for m0 (2.26b)
n
H () C k ()
m
for m >k (2.26c)
2
n
H0m () C k () for m >k (2.26d)
2
are compact.

Proof. We sketch the idea of the proof. In [1] it is shown that the embeddings
H 1 () L2 (), H01 () L2 ()
are compact. This proves the results in (2.26a) and (2.26b) for m = 0. The results in (2.26a) and
(2.26b) for m 1 can easily be derived from this as follows. Let (uk )k1 be a bounded sequence
in H m+1 () (m 1). Then (D uk )k1 is a bounded sequence in H 1 () for || m. Thus
this sequence has a subsequence (D uk )k 1 that converges in L2 (). Hence, the subsequence
(uk )k 1 converges in H m (). This proves the compactness of the embedding H m+1 ()
H m (). The result in (2.26b) for m 1 can be shown in the same way.
With a similar shift argument one can easily show that it suffices to prove the results in (2.26c)
and (2.26d) for the case k = 0. The analysis for the case k = 0 is based on the following general
result (which is easy to prove). If X, Y, Z are normed spaces with continuous embeddings
I1 : X Y, I2 : Y Z and if at least one of these embeddings is compact then the continuous
embedding I2 I1 : X Z is compact. For m n2 > 0 there exist , (0, 1) with 0 < < <
m n2 . The following continuous embeddings exist:
H m () C 0, () C 0, () C().
In this sequence only the first embedding is nontrivial. This one is proved in [1], theorem 5.4.
Furthermore, from [1] theorem 1.31 it follows that the second embedding is compact. We con-
clude that for m n2 > 0 the embedding H m () C() is compact. This then yields the result
in (2.26c) for k = 0. The same line of reasoning can be used to show that (2.26d) holds.

The result in the following theorem is a basic inequality that will be used frequently.

Theorem 2.2.16 (Poincare-Friedrichs inequality) There exists a constant C that depends


only on diam() such that
sX
kukL2 C kD uk2L2 for all u H01 ().
||=1

Proof. Because C0 () is dense in H01 () it is sufficient to prove the inequality for u


C0 ().Without loss of generality we can assume that (0, . . . , 0) . Let a > 0 be such that
[a, a]n =: E. Take u C0 () and extend this function by zero outside . Note that
Z x1
u(t, x2 , . . . , xn )
u(x1 , . . . , xn ) = u(a, x2 , . . . , xn ) + dt
a t

30
Since u(a, x2 , . . . , xn ) = 0 we obtain, using the Cauchy-Schwarz inequality
 Z x1 u(t, x , . . . , x ) 2
2 2 n
u(x) = 1 dt
a t
Z x1 Z x1
u(t, x2 , . . . , xn ) 2
1 dt dt
a a t
Z a
u(t, x2 , . . . , xn ) 2
2a dt for x E
a t
Note that the latter term does not depend on x1 . Integration with respect to the variable x1
results in Z a Z a
2 2
2
u(x1 , . . . , xn ) dx1 4a D(1,0,...,0) u(x) dx1 for x E
a a
and integration with respect to the other variables gives
Z Z
2 X
u(x)2 dx 4a2 D (1,0,...,0) u(x) dx 4a2 kD uk2L2
E E ||=1

and thus the desired result is proved.

Corollary 2.2.17 For u H m () define


X
|u|2m := kD uk2L2 (2.27)
||=m

There exists a constant C such that

|u|m kukm C|u|m for all u H0m () ,

i.e., | |m and k km are equivalent norms on H0m ().


Proof. The inequality |u|m kukm is trivial. For m = 1 the inequality kuk1 C|u|1 directly
follows from the Poincare-Friedrichs inequality. For u H02 () we obtain kuk22 = kuk21 + |u|22
C 2 |u|21 + |u|22 . Application of the Poincare-Friedrichs inequality to D u H01 (), || = 1, yields
X X X
|u|21 = kD uk2L2 c kD D uk2L2 c|u|22
||=1 ||=1 ||=1

Thus kuk2 C|u|2 holds. For m > 2 the same reasoning is applicable.

In the weak formulation of the elliptic boundary value problems one must treat boundary con-
ditions. For this the next result will be needed.

Theorem 2.2.18 (Trace operator) There exists a unique bounded linear operator

: H 1 () L2 (), k(u)kL2 () ckuk1 , (2.28)

with the property that for all u C 1 () the equality (u) = u| holds.

31
Proof. We define : C 1 () L2 () by (u) = u| and will show that

k(u)kL2 () ckuk1 for all u C 1 () (2.29)

holds. The desired result then follows from the extension theorem A.2.3. We give a proof
of (2.29) for the two dimensional case. The general case can be treated in the same way. In a
neighbourhood of x we take a local coordinate system (, ) such that locally the boundary
can be represented as

loc = { (, ()) | [a, a] } with a > 0, C 0,1 ([a, a])

and a small strip below the graph of is contained in :

S := { (, ) | [a, a], [() b, ()) } ,

for b > 0 sufficiently small. Take u C 1 (). Note that


Z ()
u(, )
u(, ()) = u(, t) + d for t [() b, ()]
t
Using the inequality ( + )2 2(2 + 2 ) and application of the Cauchy-Schwarz inequality
yields
 Z () u(, ) 2
2 2
u(, ()) 2u(, t) + 2 1 d
t
Z ()
2 u(, ) 2
2u(, t) + 2(() t) d
t
Z ()
2 u(, ) 2
2u(, t) + 2b d
()b

In this last expression only the first term on the right handside depends on t. Integration over
t [() b, ()] results in
Z () Z ()
2 2 2 u(, ) 2
b u(, ()) 2 u(, t) dt + 2b d
()b ()b
Z ()
u(, ) 2
=2 u(, )2 + b2 d
()b

Integration over [a, a] and division by b gives


Z a
u(, ) 2
Z
u(, ())2 d 2 b1 u(, )2 + b dd
a S

If C 1 ([a, a]) then for the local arc length variable s on loc we have
p
ds = 1 + ()2 d

Since 0 ()2 C for [a, a] we obtain


Z Z a
2
u(s) ds C u(, ())2 d
loc a (2.30)
C b1 kuk2L2 (S) + b|u|21,S Ckuk21,S


32
If is only Lipschitz continuous on [a, a] then exists almost everywhere on [a, a] and
| ()| is bounded (Rademachers theorem). Hence, the same argument can be applied. Finally
note that can be covered by a finite number of local parts loc . Addition of the local in-
equalities in (2.30) then yields the result in (2.29).

The operator defined in theorem 2.2.18 is called the trace operator. For u H 1 () the function
(u) L2 () represents the boundary values of u and is called the trace of u. For (u) one
often uses the notation u| . For example, for u H 1 (), the identity u| = 0 means that
(u) = 0 in the L2 () sense.

The space range() can be shown to be dense in L2 () but is strictly smaller than L2 ().
For a characterization of this subspace one can use a Sobolev space with a broken index:

1
H 2 () = range() = { v L2 () | u H 1 () : v = (u) } (2.31)

1
The space H 2 () is a Hilbert space which has similar properties as the usual Sobolev spaces.
We do not discuss this topic here, since we will not need this space in the remainder.

Using the trace operator one can give another natural characterization of the space H01 ():

Theorem 2.2.19 The equality

H01 () = { u H 1 () | u| = 0 }

holds.

Proof. We only prove . For a proof of we refer to [47] theorem 6.2.42 or [1]
remark 7.54. First note that { u H 1 () | u| = 0 } = ker(). Furthermore, C0 () ker()
and the trace operator : H 1 () L2 () is continuous. From the latter it follows that ker()
is closed in H 1 (). This yields:

kk1 kk1
H01 () = C0 () ker() = ker()

and this proves .

Finally, we collect a few results on Greens formulas that hold in Sobolev spaces. For notational
simplicity theR function arguments x are deleted in the integrals, and in boundary integrals like,
for example, (u)(v) ds we delete the trace operator .

33
Theorem 2.2.20 The following identities hold, with n = (n1 , . . . , nn ) the outward unit normal
on and H m := H m ():

v
Z
R u
v dx + uvni ds, u, v H 1 , 1 i n
R
u dx = x (2.32a)
xi i
Z
u v dx = u v dx + u nv ds, u H 2 , v H 1
R R
(2.32b)
Z

u div v dx = u v dx + uv n ds, u H 1 , v (H 1 )n .
R R
(2.32c)

Proof. These results immediately follow from the corresponding formulas in C () the con-
tinuity of the trace operator and a density argument based on theorem 2.2.9.

For the dual space of H0m () the following notation is used:


H m () := H0m () . (2.33)

The norm on this space is denoted by k km :

|(v)|
kkm := sup , H m ().
vH0m () kvkm

2.3 General results on variational formulations

In section 2.1 we already gave an example of a variational problem. In the previous section we
introduced Hilbert spaces that will be used for the variational formulation of elliptic boundary
problems in the sections 2.5 and 2.6. In this section we present some general existence and
uniqueness result for variational problems. These results will play a key role in the analysis of
well-posedness of the weak formulations of elliptic boundary value problems. They will also be
used in the discretization error analysis for the finite element method in chapter 3.

A remark on notation: For elements from a Hilbert space we use boldface notation (e.g., u),
elements from the dual space (i.e., bounded linear functionals) are denoted by f, g, etc., and for
linear operators between spaces we use capitals (e.g., L).

Let H1 and H2 be Hilbert spaces. A bilinear form k : H1 H2 R is continuous if there


is a constant such that for all x H1 , y H2 :

|k(x, y)| kxkH1 kykH2 . (2.34)

For a continuous bilinear form k : H1 H2 R we define its norm by kkk = sup{ |k(x, y)| | kxkH1 =
1, kykH2 = 1 }. A fundamental result is given in the following theorem:

34
Theorem 2.3.1 Let H1 , H2 be Hilbert spaces and k : H1 H2 R be a continuous bilinear
form. For f H2 consider the variational problem:

find u H1 such that k(u, v) = f (v) for all v H2 . (2.35)

The following two statements are equivalent:

1. For arbitrary f H2 the problem (2.35) has a unique solution u H1 and kukH1 ckf kH2
holds with a constant c independent of f .

2. The conditions (2.36) and (2.37) hold:

k(u, v)
>0 : sup kukH1 for all u H1 , (2.36)
vH2 kvkH2

v H2 , v 6= 0, u H1 : k(u, v) 6= 0. (2.37)

Moreover, for the constants c and one can take c = 1 .

Proof. We introduce the linear continuous operator L : H1 H2


(Lu)(v) := k(u, v). (2.38)
Note that for all u H1
(Lu)(v) k(u, v)
kLukH2 = sup = sup (2.39)
vH2 kvkH2 vH2 kvkH2

Furthermore, u H1 satisfies (2.35) if and only if Lu = f holds.


1. 2. From 1. it follows that L : H1 H2 is bijective. For arbitrary u H1 and f := Lu
we have:
k(u, v)
kukH1 ckf kH2 = c kLukH2 = c sup .
vH2 kvkH2

From this it follows that (2.36) holds with = 1c .


Take a fixed v H2 , v 6= 0. The linear functional w hv, wiH2 is an element of H2 .
There exists u H1 such that k(u, w) = hv, wiH2 for all w H2 . Taking w = v yields
k(u, v) = kvk2H2 > 0. Hence, (2.37) holds.
1. 2. Let u H1 be such that Lu = 0. Then k(u, v) = (Lu)(v) = 0 for all v H2 .
From condition (2.36) it follows that u = 0. We conclude that L : H1 H2 is injective. Let
R(L) H2 be the range of L and L1 : R(L) H1 the inverse mapping. From (2.39) and
(2.36) it follows that kLukH2 kukH1 for all u H1 and thus
1
kL1 f kH1 kf kH2 for all f R(L). (2.40)

Hence the inverse mapping is bounded. From corollary A.2.6 it follows that R(L) is closed in
H2 . Assume that R(L) 6= H2 . Then there exists g R(L) , g 6= 0. Let J : H2 H2 be the
Riesz isomorphism. For arbitrary u H1 we get
0 = hg, LuiH2 = hJg, JLuiH2 = (Lu)(Jg) = k(u, Jg).
This is a contradiction to (2.37). We conclude that R(L) = H2 and thus L : H1 H2 is
bijective. From (2.40) we obtain, with u := L1 f , kukH1 1 kf kH2 for arbitrary f H2 .

35
Remark 2.3.2 The conditon (2.37) can also be formulated as follows:
 
v H2 such that k(u, v) = 0 for all u H1 v = 0.

The condition (2.36) is equivalent to


k(u, v)
>0 : inf sup , (2.41)
uH1 \{0} vH2 kukH1 kvkH2

and is often called the inf-sup condition. In the finite dimensional case with dim(H1 ) =
dim(H2 ) < this condition implies the result in (2.37) and thus is necessary and sufficient
for existence and uniqueness, as can been seen from the following. Let L : H1 H2 be as in
(2.38). If dim(H1 ) = dim(H2 ) < we have

L is bijective L is injective
kLukH2 k(u, v) (2.42)
inf > 0 inf sup > 0.
u6=0 kukH1 u6=0 v kukH1 kvkH2

The latter condition seems to be weaker than the inf-sup condition in (2.41), since > 0 is
required there. However, in the finite dimensional case it is easy to show, using a compactness
argument, that these two conditions are equivalent. In infinite dimensional Hilbert spaces the
inf-sup condition (2.41) is in general really stronger than the one in (2.42).
As we saw in the analysis above, it is natural to identify the bounded bilinear form k : H1
H2 R with a bounded linear operator L : H1 H2 via (Lu)(v) = k(u, v). The result in
theorem 2.3.1 is a reformulation of the following result that can be found in functional analysis
textbooks. Let L : H1 H2 be a bounded linear operator. Then L is an isomorphism if and
only if the following two conditions hold:

(a) L is injective and R(L) is closed in H2 ,


(b) L : H2 H1 defined by (L v)(u) = (Lu)(v) is injective.

These two conditions correspond to (2.36) and (2.37), respectively. 

The following lemma will be used in the analysis below.


Lemma 2.3.3 Let H1 , H2 be Hilbert spaces and k : H1 H2 R be a continuous bilinear
form. For every u H1 there exists a unique w H2 such that

hw, viH2 = k(u, v) for all v H2 .

Furthermore, if (2.36) is satisfied, then kwkH2 kukH1 , with > 0 as in (2.36), holds.
Proof. Take u H1 . Then g : v k(u, v) defines a continuous linear functional on H2 .
Form the Riesz representation theorem it follows that there exists a unique w H2 such that
hw, viH2 = g(v) = k(u, v) for all v H2 , and kgkH2 = kwkH2 . Using (2.36) we obtain

g(v) k(u, v)
kwkH2 = kgkH2 = sup = sup kukH1
vH2 kvkH2 vH2 kvkH2

and thus the result is proved.

36
Definition 2.3.4 Let H be a Hilbert space. A bilinear form k : H H R is called H-elliptic
if there exists a constant > 0 such

k(u, u) kuk2H for all u H

As an immediate consequence of theorem 2.3.1 we obtain the following

Theorem 2.3.5 (Lax-Milgram) Let H be a Hilbert space and k : H H R a continuous


H-elliptic bilinear form with ellipticity constant . Then for every f H there exists a unique
u H such that
k(u, v) = f (v) for all v H.
Furthermore, the inequality kukH 1 kf kH holds.

This theorem will play an important role in the analysis of well-posedness of the weak for-
mulation of scalar elliptic problems in the section 2.5.

In the remainder of this section we analyze well-posedness for a variational problem which
has a special saddle point structure. This structure is such that the analysis applies to the
Stokes problem. This application is treated in section 2.6.
Let V and M be Hilbert spaces and

a : V V R, b:V M R

be continuous bilinear forms. For f1 V , f2 M we define the following variational problem:


find (, ) V M such that

a(, ) + b(, ) = f1 () for all V (2.43a)


b(, ) = f2 () for all M. (2.43b)

We now define H := V M and


k : H H R, k(u, v) := a(, ) + b(, ) + b(, )
(2.44)
with u := (, ), v := (, ).

On H we use the product norm kuk2H = kk2V + kk2M , for u = (, ) H. If we define


f H = V M by f (, ) = f1 () + f2 () then the problem (2.43) can be reformulated in
the setting of theorem 2.3.1 as follows:

find u H such that k(u, v) = f (v) for all v H. (2.45)

Below we will analyze the well-posedness of the problem (2.45) and thus of (2.43). In particular,
we will derive conditions on the bilinear forms a(, ) and b(, ) that are necessary and sufficient
for existence and uniqueness of a solution. The main result is presented in theorem 2.3.10.
We start with a few useful results. We need the following null space:

V0 := { V | b(, ) = 0 for all M }. (2.46)

Note that both V0 and V0 := { V | h, iV = 0 for all V0 } are closed subspaces


of V . These subspaces are Hilbert spaces if we use the scalar product h, iV . As usual the

37
corresponding dual spaces are denoted by V0 and (V0 ) .

We introduce an inf-sup condition for the bilinear form b(, ):

b(, )
>0 : sup kkM for all M. (2.47)
V kkV

Remark 2.3.6 Due to V = V0 V0 and b(, ) = 0 for all V0 , M , the condition in


(2.47) is equivalent to the condition
b(, )
>0 : sup kkM for all M. (2.48)
V0 kkV

Lemma 2.3.7 Assume that (2.47) holds. For g (V0 ) the variational problem

find M such that b(, ) = g() for all V0 (2.49)

has a unique solution. Furthermore, kkM 1 kgk(V ) holds.


0

Proof. We apply theorem 2.3.1 with H1 = M , H2 = V0 and k(, ) = b(, ) (note the
interchange of arguments). From (2.48) it follows that condition (2.36) is fulfilled. Take V0 ,
6= 0. Then / V0 and thus there exists M with b(, ) 6= 0. Hence, the second condition
(2.37) is also satisfied. Application of theorem 2.3.1 yields that the problem (2.49) has a unique
solution M and kkM 1 kgk(V ) holds.
0
Note that, opposite to (2.36), in (2.47) we take the supremum over the first argument of the
bilinear form. In the following lemma we formulate a result in which the supremum is taken
over the second argument:
Lemma 2.3.8 The condition (2.47) (or (2.48)) implies:
b(, )
>0 : sup kkV for all V0 . (2.50)
M kkM

Proof. Take V0 , 6= 0, and define g (V0 ) by g() = h, iV for V0 . From


lemma 2.3.7 it follows that there exists a unique M such that

b(, ) = g() for all V0

and kkM 1 kgk(V ) = 1 kkV holds. We conclude that


0

b(, ) b(, ) g() kk2V


sup = = kkV
M kkM kkM kkM kkM

holds for arbitrary V0 .

In general, (2.50) does not imply (2.47). Take, for example, the bilinear that is identically
zero, i.e., b(, ) = 0 for all V, M . Then (2.47) does not hold. However, since V0 = V
and thus V0 = 0 it follows that (2.50) does hold.
Application of lemma 2.3.3 in combination with the infsup properties in (2.47) and (2.50)
yields the following corollary.

38
Corollary 2.3.9 Assume that (2.47) holds. For every M there exists a unique V0
such that
h, iV = b(, ) for all V0 .
Furthermore, kkV kkM holds. For every V0 there exists a unique M such that

h, iM = b(, ) for all M.

Furthermore, kkM kkV holds.


Proof. The first part follows by applying lemma 2.3.3 with H1 = M, H2 = V0 , k(, ) =
b(, ) in combination with (2.48). For the second part we use lemma 2.3.3 with H1 = V0 , H2 =
M, k(, ) = b(, ) in combination with (2.50).

In the following main theorem we present necessary and sufficient conditions on the bilinear
forms a(, ) and b(, ) such that the saddle point problem (2.43) has a unique solution which
depends continuously on the data.

Theorem 2.3.10 Let V, M be Hilbert spaces and a : V V R, b : V M R be continuous


bilinear forms. Define H := V M and let k : H H R be the continuous bilinear form
defined in (2.44). For f H = V M consider the variational problem

find u H such that k(u, v) = f (v) for all v H. (2.51)

The following two statements are equivalent:

1. For arbitrary f H the problem (2.51) has a unique solution u H and kukH ckf kH
holds with a constant c independent of f .

2. The inf-sup condition (2.47) holds and the conditions (2.52a), (2.52b) are satisfied:

a(, )
>0: sup kkV for all V0 (2.52a)
V0 kkV
V0 , 6= 0, V0 : a(, ) 6= 0. (2.52b)

Moreover, if the second statement holds, then for c in the first statement one can take
c = ( + 2kak)2 1 2 .

Proof. From theorem 2.3.1 (with H1 = H2 = H) it follows that statement 1 is equivalent to

1 . For H1 = H2 = H the conditions (2.36) and (2.37) are satisfied.

We now prove that the statements 1 and 2 are equivalent. We recall the condition (2.36):
a(, ) + b(, ) + b(, ) 1
sup 1 kk2V + kk2M 2 for all (, ) H. (2.53)
(,)H (kk2V + kk2M ) 2

Define u = (, ), v = (, ) and k(u, v) as in (2.44).


We have to prove: {(2.53), (2.37)} {(2.47), (2.52a), (2.52b)}. This is done in the following

39
5 steps: a) (2.53) (2.47), b) {(2.53), (2.47)} (2.52a), c) {(2.47), (2.37)} (2.52b), d)
{(2.47), (2.52a)} (2.53), e) {(2.52a), (2.52b)} (2.37).
a). If in (2.53) we take = 0 we obtain

b(, ) b(, ) 2
sup = sup 1 kkM for all M
V kkV 2 2
(,)H (kkV + kkM ) 2

and thus the inf-sup condition (2.47) holds.


b). Take 0 V0 . The functional g : a(0 , ), V0 is linear and bounded, i.e., g
(V0 ) . Application of lemma 2.3.7 yields that there exists M such that b(, ) = a(0 , )
for all V0 . In (2.53) we take (, ) = (0 , ). Every V is decomposed as = 0 +
with 0 V0 , V0 . Using b(0 , ) + b(, ) = b( , ) we obtain from (2.53)

a(0 , 0 ) + a(0 , ) + b( , ) 1
sup 1 k0 k2V + kk2M 2
.
(,)H (kk2V + kk2M ) 2

From this we get, using b( , ) = a(0 , ) for all V0 :

a(0 , 0 ) a(0 , 0 ) 2
1
2 2
sup = sup 1 k0 kV + kkM k0 kV ,
k 0 kV (,)H (kk2 2 2
V + kkM )
0 V0

and thus the condition (2.52a) holds.


c). Take 0 V0 , 0 6= 0. The functional g : a(, 0 ), V0 is an element of (V0 ) .
From lemma 2.3.7 it follows that there exists M such that b(, ) = a(, 0 ) for all
V0 . In condition (2.37) we take v = ( 0 , ). Then there exists u = (, ) H such that
k(u, v) 6= 0, i.e.,

a(, 0 ) + b(, ) + b( 0 , ) = a(, 0 ) + b(, ) 6= 0.

Decompose as = 0 + , 0 V0 , V0 and use the definition of to get

0 6= a(, 0 ) + b(, ) = a(0 , 0 ) + a( , 0 ) + b( , ) = a(0 , 0 ).

Hence, the result in (2.52b) holds.


d). Let u = (, ) H be given. We decompose as = 0 + , 0 V0 , V0 . We
assume that 6= 0, 6= 0. From corollary 2.3.9 it follows that:

V0 : h, iV = b(, ) V0 ; kkV kkM , (2.54)


M : h, iM = b( , ) M ; kkM k kV . (2.55)

From assumption (2.52a) it follows that there exist > 0 and 0 V0 with k 0 kV = 1 such
that a(0 , 0 ) k0 kV holds. We now introduce


:= 1 0 + , := , 1 R,
kkV

:= 2 , := , 2 R.
kkM

40
Note that kk2V + kk2M = 21 + 1 + 22 . We obtain:
k(u, v) a(, ) + b(, ) + b(, )
sup sup 1
vH kvkH2 (,)H (kk2V + kk2M ) 2
a(0 , ) + a( , ) + b( , ) + b(, )
= sup 1
1 ,2 (1 + 21 + 22 ) 2
a(0 , ) + a( , ) + h, iM + h, iV
= sup 1
1 ,2 (1 + 21 + 22 ) 2
1 a(0 , 0 ) + a(0 , ) + a( , ) + 2 kkM + kkV
= sup 1
1 ,2 (1 + 21 + 22 ) 2

(1 kak)k0 kV + 2 kak(1 + 1) k kV + kkM
sup 1
1 ,2 (1 + 21 + 22 ) 2
We take 1 , 2 such that 1 kak = and 2 kak(1 + 1) = . This results in
kak +
(1 , 2 ) = (, kak + ).

1 (+2kak)2
Note that kak, 1 1, 2 1, and thus (1 + 21 + 22 ) 2 1 + 2 . We conclude
that
k(u, v) 2  2
sup k k
0 V + k k
V + kkM kukH (2.56)
vH kvkH ( + 2kak)2 ( + 2kak)2
holds. Using a continuity argument the same result holds if = 0 or = 0. Hence condition
2
(2.53) holds with = (+2kak) 2.

e). Take V0 , 6= 0. From (2.52) and theorem 2.3.1 with H1 = H2 = V0 , k(u, v) = a(u, v)
it follows that there exists a unique V0 such that a(, ) = h, iV for all V0 and
kkV 1 kkV0 = 1 kkV . If we take = we obtain

a(, ) a(, ) kk2V


sup = kkV . (2.57)
V0 kkV kkV kkV

We introduce the adjoint bilinear form (with u = (, ), v = (, )):


k(u, v) := k(v, u) = a(, ) + b(, ) + b(, ), a(, ) := a(, ).
From (2.57) it follows that
a(, )
sup kkV for all V0 .
V0 kkV

Using this one can prove with exactly the same arguments as in part d) that

k(u, v) 2
sup kukH for all u H
vH kvkH ( + 2kak)2

holds. Thus for every u H, u 6= 0 there exists v H such that k(u, v) 6= 0 and thus
k(u, v) 6= 0, too. This shows that (2.37) holds and completes the proof of a)e). The final state-
ment in the theorem follows from the final result in theorem 2.3.1 and the choice of in part d).

41
Remark 2.3.11 The final result in theorem 2.3.10 predicts that if we scale such that kak = 1
then the stability constant c = ( + 2kak)2 1 2 is large when the values of the constants
and are much smaller than one. We now give an example with kak = 1 in which the stability
deteriorates like 1 2 for 0, 0. This shows that the behaviour c 1 2 for the
stability constant is sharp.
Take V = R2 with the euclidean norm k k2 , M = R and let e1 = (1 0)T , e2 = (0 1)T be the
standard basis vectors in R2 . For fixed > 0, > 0 we introduce the bilinear forms

b(, ) = eT1 , R2 , R,
 
T 0 1
a(, ) = A, A := , , R2 .
1

We then have V0 = span(e2 ) and a simple computation yields

b(, )
sup = || for all ,
V kk2
a(, )
sup = kk2 for all V0 .
V0 kk2

With u = (, ), v = (, ) R3 we have


0 1
k(u, v) = a(, ) + b(, ) + b(, ) = uT Cv, C := 1 0 .
0 0

We consider the functional f (v) = f1 () + f2 () = = (0 0 1)v with norm kf kH = 1. The


unique solution u V M = R3 such that k(u, v) = f (v) for all v V M is the unique
solution of Cu = (0 0 1)T . Hence

1 1 1 T
u= .
2

From this is follows that for all 0 < 1, 0 < 1 we have


1 3
2
kf kH kukH 2 kf kH .

Important sufficient conditions for well-posedness of the problem (2.43) are formulated in the
following corollary.

42
Corollary 2.3.12 For arbitrary f1 V , f2 M consider the variational problem (2.43):
find (, ) V M such that

a(, ) + b(, ) = f1 () for all V (2.58a)


b(, ) = f2 () for all M. (2.58b)

Assume that the bilinear forms a(, ) and b(, ) are continuous and satisfy the following two
conditions:
b(, )
> 0 : sup kkM M (inf-sup condition), (2.59a)
V kkV
>0 : a(, ) kk2V V (V-ellipticity). (2.59b)

Then the conditions (2.47) and (2.52) (with = ) are satisfied and the problem (2.58) has a
2
unique solution (, ). Moreover, the stability bound k(, )kH (+2kak)
2 k(f1 , f2 )kH holds.

Proof. Apply theorem 2.3.10.

2.4 Minimization of functionals and saddle-point problems

In the variational problems treated in the previous section we did not assume symmetry of the
bilinear forms. In this section we introduce certain symmetry properties and show that in that
case equivalent alternative problem formulations can be derived.

First we reconsider the case of a continuous bilinear form k : H H R that is H-elliptic.


This situation is considered in the Lax-Milgram lemma 2.3.5. In addition we now assume that
the bilinear form is symmetric: k(u, v) = k(v, u) for all u, v.

Theorem 2.4.1 Let H be a Hilbert space and k : H H R a continuous H-elliptic symmetric


bilinear form. For f H let u H be the unique solution of the variational problem

k(u, v) = f (v) for all v H. (2.60)

Then u is the unique minimizer of the functional


1
J(v) := k(v, v) f (v). (2.61)
2

43
Proof. From the Lax-Milgram theorem it follows that the variational problem (2.60) has a
unique solution u H. For arbitrary z H, z 6= 0 we have, with ellipticity constant > 0:
1
J(u + z) = k(u + z, u + z) f (u + z)
2
1 1
= k(u, u) f (u) + k(u, z) f (z) + k(z, z)
2 2
1 1 2
= J(u) + k(z, z) J(u) + kzkH > J(u).
2 2
This proves the desired result.

We now reconsider the variational problem (2.43) and the result formulated in corollary 2.3.12.

Theorem 2.4.2 For arbitrary f1 V , f2 M consider the variational problem (2.43):


find (, ) V M such that

a(, ) + b(, ) = f1 () for all V (2.62a)


b(, ) = f2 () for all M. (2.62b)

Assume that the bilinear forms a(, ) and b(, ) are continuous and satisfy the conditions (2.47)
and (2.52). In addition we assume that a(, ) is symmetric. Define the functional L : V M R
by
1
L(, ) = a(, ) + b(, ) f1 () f2 ()
2
Then the unique solution (, ) of (2.62) is also the unique element in V M for which

L(, ) L(, ) L(, ) for all V, M (2.63)

holds.

Proof. ddd From theorem 2.3.10 it follows that the problem (2.62) has a unique solution.
Take a fixed element (, ) V M . We will prove:
L(, ) L(, ) M b(, ) = f2 () M
(2.64)
L(, ) L(, ) V a(, ) + b(, ) = f1 () V.
From this it follows that (, ) satisfies (2.63) if and only if (, ) is a solution of (2.62). This
then proves the statement of the theorem. We now prove (2.64). Note that
L(, ) L(, ) M
b(, ) f2 () b(, ) f2 () M
b(, ) f2 () M
b(, ) = f2 () M.
From this the first result in (2.64) follows. For the second result we first note
L(, ) L(, ) V
1 1
a(, ) + b(, ) f1 () a(, ) + b(, ) f1 () V
2 2
1
a(, ) a(, ) + b(, ) f1 () V.
2

44
Now note that a(, ) is a quadratic term and a(, ) + b(, ) f1 () is linear. A
scaling argument now yields

L(, ) L(, ) V
0 a(, ) + b(, ) f1 () V
0 = a(, ) + b(, ) f1 () V,

and thus the second result in (2.64) holds.

Note that if a(, ) is symmetric then (2.52a) implies (2.52b). Due to the property (2.63) the
problem (2.62) with a symmetric bilinear form a(, ) is called a saddle-point problem.

2.5 Variational formulation of scalar elliptic problems


2.5.1 Introduction
We reconsider the example of section 2.1, i.e., the two-point boundary value problem

(au ) = 1 in (0, 1), (2.65a)


u(0) = u(1) = 0. (2.65b)

with a(x) > 0 for all x [0, 1]. Let V1 , k(, ) and f () be as defined in section 2.1:

V1 = { v C 2 ([0, 1]) | v(0) = v(1) = 0 }


Z 1 Z 1

k(u, v) = a(x)u (x)v (x) dx, f (v) = v(x) dx.
0 0

The two-point boundary value problem has a corresponding variational formulation:


(
find u V1 such that
(2.66)
k(u, v) = f (v) for all v V1 .

One easily checks that u V1 solves (2.65) iff u is a solution of (2.66). Hence, if the problem
(2.65) has a solution u C 2 ([0, 1]) this must also be the unique (due to lemma 2.1.2) solution of
(2.66). As in section 2.1 we now consider this problem with a discontinuous piecewise constant
function a. Then the classical formulation (2.65) is not well-defined, whereas the variational
problem does make sense. However, in section 2.1 it is shown that the problem (2.66) has no
solution (the space V1 is too small). Since in the bilinear form k(, ) only first derivatives
occur, the larger space V2 := { v C 1 ([0, 1]) | v(0) = v(1) = 0 } seems to be more appropriate.
This leads to the weaker variational formulation:
(
find u V2 such that
(2.67)
k(u, v) = f (v) for all v V2 .

However, it is shown in section 2.1 that the problem (2.67) still has no solution. The key step
is to take the completion of the space V2 (or , equivalently, of V1 ):
kk1 kk kk
H01 (0, 1) = C0 ([0, 1]) = V 1 1 = V 2 1.


45
Thus we consider (
find u H01 (0, 1) such that


k(u, v) = f (v) for all v V2 .

Both the bilinear form k(, ) and f () are continuous on H01 (0, 1) and thus this problem is


equivalent to the variational problem


(
find u H01 (0, 1) such that

(2.68)
k(u, v) = f (v) for all v H01 (0, 1) .


From the Lax-Milgram lemma 2.3.5 it follows that there exists a unique solution (which is usually
called the weak solution) of the variational problem (2.68). For  this existence and uniqueness
result it is essential that we used the Sobolev space H01 (0, 1) , which is a Hilbert space. In
section 2.1 we considered a space V3 with V2 V3 H01 (0, 1) and showed that the function
kk
u given in (2.11) solves the variational problem in the space V3 . Due to V 3 1 = H01 (0, 1) this


function u is also the unique solution of (2.68).

We summarize the fundamental steps discussed in this section in the following diagram:

Weaker formulation, Weaker formulation,


(2.65) due to reduction of (2.67) due to completion (2.68).
order of differentiation. of space.
A very similar approach can be applied to a large class of elliptic boundary value problems,
as will be shown in the following sections

Remark 2.5.1 For the weak formulation in (2.68) to have a unique solution it is important
that the bilinear form is elliptic. The following examples illustrates this. Consider (2.65) with

a(x) = x. Then the solution of this problem is given by u(x) = 32 x(1 x) (as can be checked
/ V2 . Since u C 2 (0, 1) C([0, 1]), this is the classical solution

by substitution). Note that u
R1
of (2.65), cf. section 1.2. However, due to 0 u (x)2 dx = it follows that u / H 1 (0, 1) and


thus the weak formulation as in (2.68) does not have a solution.

2.5.2 Elliptic BVP with homogeneous Dirichlet boundary conditions


In this section we derive and analyze variational formulations for a class of scalar elliptic bound-
ary value problems. We consider a linear second order differential operator of the form
n n
X u  X u
Lu = aij + bi + cu. (2.69)
xi xj xi
i,j=1 i=1

Note that this form differs from the one in (1.2). If the coefficients aij are differentiable, then
due to
n n n
X u  X 2u X aij u
aij = aij +
xi xj xi xj xi xj
i,j=1 i,j=1 i,j=1

the operator L can be written in the form as in (1.2) with the same c as in (2.69) but with aij
a
and bi in (1.2) replaced by aij and bi nj=1 xjij , respectively.
P

46
As in section 1.2.1 the coefficients that determine the principal part of the operator are collected
in a matrix
A(x) = (aij (x))1i,jn .

We assume that the problem is uniformly elliptic:

0 > 0 T A(x) 0 T for all Rn , x . (2.70)

We use the notation b(x) = (b1 (x), . . . , bn (x))T .


In this section we only discuss the case with homogeneous Dirichlet boundary conditions, i.e.,
we consider the following elliptic boundary value problem:

Lu = f in (2.71a)
u=0 on . (2.71b)

We now derive a (weaker) variational formulation of this problem along the same lines as in
the previous section. For this derivation we assume that the equation (2.71a) has a solution u
in the space V := { u C 2 () | u = 0 on }. Multiplication of (2.71a) with v C0 () and
using partial integration implies that u also satisfies
Z Z
uT Av + b uv + cuv dx = f v dx.

Based on this, we introduce a bilinear form and linear functional:


Z Z
T
k(u, v) = u Av + b uv + cuv dx , f (v) = f v dx. (2.72)

We conclude that a solution u V also solves the following variational problem:


(
find u V such that
k(u, v) = f (v) for all v C0 ().

Note that in the bilinear form no higher than first derivatives occur. This motivates to use spaces
kk1
obtained by completion w.r.t. the norm kk1 and leads to the Sobolev space H01 () = C0 () .
kk1
One may check that C0 () V H01 () and thus V = H01 (). We thus obtain the follow-
ing.

The variational formulation of (2.71) is given by:


(
find u H01 () such that
(2.73)
k(u, v) = f (v) for all v H01 (),

with k(, ) and f () as in (2.72)

It is easy to verify that if the problem (2.73) has a smooth solution u C 2 () and if the
coefficients are sufficiently smooth then u is also a solution of (2.71). In this sense this problem
is the correct weak formulation.

47
Remark 2.5.2 There is a subtle reason why in the derivation of the weak formulation we used
the test space C0 () and not C (). The reason for this choice is closely related to the type
of boundary condition. In the situation here we have prescribed boundary values which are
kk
automatically fulfilled in the space V and also (after completion)
R in V 1 = H01 (). Therefore,
the differential equation should be tested in the form (Luf )v dx = 0 only in the interior of
, i.e., with functions v that are zero on the boundary. Hence we take v C0 (). In problems
with other types of boundary conditions it may be necessary to take test functions v C ().
This will be further explained in section 2.5.3. 
We now analyze existence and uniqueness of the variational problem (2.73) by means of the
Lax-Milgram lemma 2.3.5. We use the following mild smoothness assumptions concerning the
coefficients in the differential operator:

aij L () i, j, bi H 1 () L () i, c L (). (2.74)

Theorem 2.5.3 Let (2.70) and (2.74) hold and assume that the condition
1
div b + c 0 a.e. in
2
is fulfilled. Then for every f L2 () the variational problem (2.73) with f (v) :=
R
f v dx has
a unique solution u. Moreover, the inequality

kuk1 Ckf kL2

holds with a constant C independent of f .

P Proof. We 2use the Lax-Milgram lemma and the fact that k k1 and | |1 , defined by |u|21 =
1
||=1 kD ukL2 , are equivalent norms on H0 ().
From Z
|f (v)| = | f v dx| kf kL2 kvkL2 kf kL2 kvk1

it follows that f () defines a bounded linear functional on H01 (). We now check the boundedness
of the bilinear form k(, ) for u, v H01 ():
n Z n Z
u v u
Z
X X
|k(u, v)| aij dx +
bi v dx +
cuv dx
xj xi xi
i,j=1 i=1
n n
X u v X u
kaij kL k kL2 k kL2 + kbi kL k k 2 kvkL2
xj xi xi L
i,j=1 i=1

+ kck L kukL2 kvkL2


n
X n
X
kaij kL kuk1 kvk1 + kbi kL kuk1 kvkL2 + kckL kukL2 kvkL2
i,j=1 i=1

Ckuk1 kvk1 .

Note that C0 () is dense in H01 (), the bilinear form is continuous and k k1 and | |1 are
equivalent norms. Hence, for the ellipticity, k(u, u) kuk21 (with > 0), to hold it suffices to

48
show k(u, u) |u|21 for all u C0 ().
Take u C0 (). From the uniform ellipticity assumption (2.70) it follows (with = u) that
n Z n
u u u 2
X Z X
aij dx 0 dx = 0 |u|21 ,
xj xi xj
i,j=1 j=1

with 0 > 0 holds. Using partial integration we obtain


n Z n Z
u 1X (u2 ) 1
X Z
bi u dx = bi dx = div b u2 dx.
i=1 xi 2
i=1
xi 2

Collecting these results we get


1
Z
k(u, u) 0 |u|21 + div b + c u2 dx.

2
The desired result follows from the assumption 12 div b + c 0 (a.e.).

We now formulate two important special cases.


Corollary 2.5.4 For every f L2 () the Poisson equation (in variational formulation)
(
find u H01 () such that
1
R R
u v dx = f v dx for all v H0 ()

has a unique solution. Moreover, kuk1 Ckf kL2 holds with a constant C independent of f .

Corollary 2.5.5 For f L2 (), > 0 and bi H 1 () L (), i = 1, . . . , n, consider the


convection-diffusion problem (in variational formulation)
(
find u H01 () such that
u v dx + b u v dx = f v dx for all v H01 ().
R R R

If div b 0 holds (a.e.), then this problem has a unique solution, and kuk1 Ckf kL2 holds with
a constant C independent of f .

We note that the condition div b 0 holds, for example, if all bi , i = 1, . . . , n, are constants.
In the singular perturbation case it may happen that the stability constant deteriorates: C =
C() if 0.

2.5.3 Other boundary conditions


In this section we consider variational formulations of elliptic problems with boundary conditions
that are not of homogeneous Dirichlet type. For simplicity we only discuss the case L = .
The corresponding results for general second order differential operators (L as in (2.69)) are
very similar.

Inhomogeneous Dirichlet boundary conditions. First we treat the Poisson equation with
Dirichlet boundary data that are not identically zero:

u = f in
u = on .

49
Assume that this problem has a solution in the space V = { u C 2 () | u = on }.
After completion (w.r.t. k k1 ) this will yield the space { u H 1 () | (u) = } where is the
trace operator. As in the previous section the boundary conditions are automatically fulfilled
in this space and thus we take test functions v C0 () (cf. remark 2.5.2). Multiplication of
the differential equation with such a function v, partial integration and using completion with
respect to k k1 results in the following variational problem:
(
find u { u H 1 () | u| = } such that
R R 1
(2.75)
u v dx = f v dx for all v H0 ().

If u and u are solutions of (2.75) then u u H01 () and (u u) v dx = 0 for all


R

v H01 (). Taking v = u u it follows that u = u and thus we have at most one solution. To
prove existence we introduce a transformed problem. For the identity u| = ( (u) = )
to make sense, we assume that the boundary data are such that range(). Then there
exists u0 H 1 () such that u0| = .

Lemma 2.5.6 Assume f L2 () and range(). Take u0 H 1 () such that (u0 ) = .


Then u solves the variational problem (2.75) iff w = u u0 solves the following:
(
find w H01 () such that
R R R 1
(2.76)
w v dx = f v dx u0 v dx for all v H0 ().

Proof. Trivial.

Note that f (v) = f v dx u0 v dx defines a continuous linear functional on H01 ().


R R

The Lax-Milgram lemma yields existence of a solution of (2.76) and thus of (2.75).

Natural boundary conditions. We now consider a problem in which also (normal) derivatives
of u occur in the boundary condition:

u = f in (2.77a)
u
+u = on (2.77b)
n
u
with R a constant and n = (n )u = u n the normal derivative at the boundary. For
this problem the following difficulty arises related to the (normal) derivative in the boundary
condition. For u H 1 () the weak derivative D u, || = 1, is an element of L2 (). It can
be shown that it is not possible to define unambiguously v| for v L2 (). In other words,
u
there is no trace operator which in a satisfactory way defines n |
for u H 1 (). This is the
reason why for the solution u we search in the space H 1 () which does not take the boundary
conditions into account. Due to this, for the derivation of an appropriate weak formulation, we
multiply (2.77a) with test functions v C () (and not C0 (), cf. remark 2.5.2). This results
in Z Z Z
u v dx u n v ds = f v dx for all v C ().

We now use the boundary condition (2.77b) and then obtain


Z Z Z Z
u v dx + uv ds = f v dx + v ds for all v C ().

50
This results in the following variational problem:
(
find u H 1 () such that
R R R R 1
(2.78)
u v dx + uv ds = f v dx + v ds v H ().

It is easy to verify that if the problem (2.78) has a smooth solution u C 2 () and if is
sufficiently smooth then u is also a solution of (2.77). In this sense this problem is the correct
weak formulation.
Note that now the space H 1 () is used and not H01 (). The space H 1 () does not contain
any information concerning the boundary condition (2.77b). The boundary data are part of
the bilinear form used in (2.78). In the case of Dirichlet boundary conditions, as in (2.73)
and (2.75), this is the other way around: The solution space is such that the boundary condi-
tions are automatically fulfilled and the boundary data do not influence the bilinear form. The
latter class of boundary conditons are called essential boundary conditions (these are a-priori
fulfilled by the choice of the solution space). Boundary conditions as in (2.77b) are called nat-
ural boundary conditions (these are automatically fulfilled if the variational problem is solved).

We now analyze existence and uniqueness of the variational formulation. For this we need
two variants of the Poincare-Friedrichs inequality:

Lemma 2.5.7 There exist constants C1 and C2 such that


Z
2 2
u2 ds for all u H 1 (),

kuk1 C1 |u|1 + (2.79a)
Z
kuk1 C2 |u|1 + | u dx|2
2 2
for all u H 1 ().

(2.79b)

[Note that for u H01 () the first result reduces to the Poincare-Friedrichs inequality]

Proof. For i = 1, 2 we define qi : H 1 () R by


Z Z
q1 (u) = u2 ds, q2 (u) = | u dx|2 .

Then qi is continuous on H 1 () (for i = 1 this follows from the continuity of the trace operator),
qi (u) = 2 qi (u) for all R and if u is equal to a constant, say c, then qi (u) = qi (c) = 0 iff
c = 0.
Assume that there does not exist a constant C such that kuk21 C |u|21 +qi (u) for all u H 1 ().


Then there exists a sequence (uk )k1 in H 1 () such that

1 = kuk k21 k |uk |21 + qi (uk ) for all k.



(2.80)

This sequence is bounded in H 1 (). Since the embedding H 1 () L2 () is compact, there


exists a subsequence (uk() )1 that converges in L2 ():

lim uk() = u in L2 ().


From (2.80) it follows that lim |uk() |1 = 0 and thus

lim D uk() = 0, if || = 1, in L2 ().


51
From this we conclude that

u H 1 (), lim uk() = u in H 1 (), D u = 0 (a.e.) if || = 1.


Hence, u must be constant (a.e.) on , say u = c. From (2.80) we obtain

lim qi (uk() ) = 0.

Using the continuity of qi it follows that qi (c) = qi (u) = 0 and thus c = 0. This yields a
contradiction:
1 = lim kuk() k21 = kuk21 = kck21 = 0.

Thus the results are proved.

Using this lemma we can prove the following result for the variational problem in (2.78):
Theorem 2.5.8 Consider the variational problem (2.78) with > 0, f L2 () and
L2 (). This problem has a unique solution u and the inequality

kuk1 C kf kL2 + kkL2 ()

holds with a constant C independent of f and .


Proof. For v H 1 () define the linear functional:
Z Z
g:v f v dx + v ds.

Using the continuity of the trace operator we obtain

|g(v)| kf kL2 kvkL2 + kkL2 () k(v)kL2 ()



c kf kL2 + kkL2 () kvk1

and thus g H 1 () . Define k(u, v) = u v dx + uv ds. The continuity of this
R R

bilinear form follows from

|k(u, v)| |u|1 |v|1 + k(u)kL2 () k(v)kL2 ()


|u|1 |v|1 + Ckuk1 kvk1 Ckuk1 kvk1 .

The ellipticity can be concluded from the result in (2.79a):


Z
2
k(u, u) = |u|1 + u2 ds Ckuk21 for all u H 1 ()

and with C > 0. Application of the Lax-Milgram lemma completes the proof.

We now analyze the problem with pure Neumann boundary conditions, i.e., = 0 in (2.77b).
Clearly, for this problem we can not have uniqueness: if u is a solution then for any constant c
the function u + c is also a solution. Moreover, for existence of a solution the data f and must
satisfy a certain condition. Assume that u H 2 () is a solution of (2.77) for the case = 0,
then Z Z Z Z Z
f dx = u dx = u 1 dx u n ds = ds

52
must hold. This motivates the introduction of the compatibility relation:
Z Z
f dx + ds = 0. (2.81)

To obtain uniqueness, for the solution space we take a subspace of H 1 () consisting of functions
u with hu, 1iL2 = hu, 1i1 = 0:
Z
H1 () := { u H 1 () | u dx = 0 }.

Since this is a closed subspace of H 1 () it is a Hilbert space. Instead of (2.78) we now consider:
(
find u H1 () such that
R R R 1
(2.82)
u v dx = f v dx + v ds for all v H ().

For this problem we have existence and uniqueness:

Theorem 2.5.9 Consider the variational problem (2.82) with f L2 (), L2 () and
assume that the compatibility relation (2.81) holds. Then this problem has a unique solution u
and the inequality 
kuk1 C kf kL2 + kkL2 ()
holds with a constant C independent of f and .

Proof. For v H 1 () define the linear functional:


Z Z
g:v f v dx + v ds.

The continuity of this functional is shown in the proof of theorem 2.5.8. DefineR k(u, v) =
v dx. The continuity of this bilinear form is trivial. For u H1 () we have u dx = 0
R
u
and thus, using the result in (2.79b), we get
Z
k(u, u) = |u|1 + | u dx|2 Ckuk21 for all u H1 ()
2

with a constant C > 0. Hence, the bilinear form is H1 ()-elliptic. From the Lax-Milgram
lemma it follows that there exists a unique solution u H1 () such that k(u, v) = g(v) for all
v H1 (). Note that k(u, 1) = 0 and, due to the compatibility relation, g(1) = 0. It follows
that for the solution u we have k(u, v) = g(v) for all v H 1 ().

Remark 2.5.10 For the case < 0 it may happen that the problem (2.77) has a nontrivial
kernel (and thus we do not have uniqueness). Moreover, in general this kernel is not as simple
as for the case = 0. As an example, consider the problem

u (x) = 0 for x (0, 1)


u (0) 2u(0) = 0
u (1) 2u(1) = 0.

All functions in span(u ) with u (x) = 2x 1 are solutions of this problem. 

53
Mixed boundary conditions
It may happen that in a boundary value problem both natural and essential boundary values
occur. We discuss a typical example. Let e and n be parts of the boundary such that
measn1 (e ) > 0, measn1 (n ) > 0, e n = , e n = . Now consider the following
boundary value problem:

u = f in (2.83a)
u=0 on e (2.83b)
u
= on n . (2.83c)
n
The Dirichlet (= essential) boundary conditions are fulfilled by the choice of the solution space:

H1e () := { u H 1 () | (u) = 0 on e }.

The Neumann (= natural) boundary conditions will be part of the linear functional used in the
variational problem. A similar derivation as for the previous examples results in the following
variational problem:
(
find u H1e () such that
R R R 1 ().
(2.84)
u v dx = f v dx + n v ds for all v H e

One easily verifies that if this problem has a smooth solution u C 2 () then u is a solution of
the problem (2.83).

Remark 2.5.11 For the proof of existence and uniqueness we need the following Poincare-
Friedrichs inequality in the space H1e ():

C>0: kukL2 C|u|1 for all u H1e (). (2.85)

For a proof of this result we refer to the literature, e.g. [3], Remark 5.16. 

Theorem 2.5.12 The variational problem (2.84) with f L2 () and L2 () has a unique
solution u and the inequality

kuk1 C kf kL2 + kkL2 ()

holds with a constant C independent of f and holds.

Proof. For v H1e () define the linear functional:


Z Z
g:v f v dx + v ds.

The continuity of this linearRfunctional can be shown as in the proof of theorem 2.5.8. Define
the bilinear form k(u, v) = u v dx. The continuity of k(, ) is trivial. From (2.85) it
follows that k(u, u) Ckuk21 for all u H1e () and with C > 0. Hence the bilinear form is
H1e ()-elliptic. Application of the Lax-Milgram lemma completes the proof.

54
2.5.4 Regularity results
In this section we present a few results from the literature on global smoothness of the solution.
First the notion of H m -regularity is introduced. For ease of presentation we restrict ourselves
to elliptic boundary value problems with homogeneous Dirichlet boundary conditions.

Definition 2.5.13 (H m -regularity.) Let 1 1


 k : H0 () H0 () R be 1a continuous elliptic
1 1
bilinear form and f H () = H0 () . For the unique solution u H0 () of the problem

k(u, v) = f (v) for all v H01 ()

the inequality
kuk1 Ckf k1
holds with a constant C independent of f . This property is called H 1 -regularity of the variational
problem. If for some m > 1 and f H m2 () the unique solution u of
Z
k(u, v) = f v dx for all v H01 ()

satisfies
kukm Ckf km2
with a constant C independent of f then the variational problem is said to be m-regular. 

The result in the next theorem is an analogon of the result in theorem 1.2.7, but now the
smoothness is measured using Sobolev norms instead of Holder norms.

Theorem 2.5.14 ([39], Theorem 8.13) Assume that u H01 () is a solution of (2.73) (for
existence of u: see theorem 2.5.3). For some m N assume that C m+2 and:

for m = 0 : f L2 (), aij C 0,1 () i, j, bi L () i, c L (),


for m 1 : f H m (), aij C m,1 () i, j, bi C m1,1 () i, c C m1,1 ().

Then u H m+2 () holds and



kukm+2 C kukL2 + kf km (2.86)

with a constant C independent of f and u.

Corollary 2.5.15 Assume that the assumptions of theorem 2.5.3 and of theorem 2.5.14 are
fulfilled. Then the variational problem (2.73) is H m+2 -regular.

Proof. Due to theorem 2.5.3 the problem has a unique solution u and kuk1 Ckf kL2 holds.
Now combine this with the result in (2.86):
 
kukm+2 C kukL2 + kf km C1 (kf kL2 + kf km 2C1 kf km

with a constant C1 independent of f .

55
Note that in these regularity results there is a severe condition on the smoothness of the bound-
ary. For example, for m = 0, i.e., H 2 -regularity, we have the condition C 2 . In practice,
this assumption often does not hold. For convex domains one can prove H 2 -regularity without
assuming such a strong smoothness condition on . The following result is due to [53]:

Theorem 2.5.16 Let be convex. Suppose that the assumptions of theorem 2.5.3 hold and in
addition aij C 0,1 () for all i, j. Then the unique solution of (2.73) satisfies

kuk2 Ckf kL2

with a constant C independent of f , i.e., the variational problem (2.73) is H 2 -regular.

We note that very similar regularity results hold for elliptic problems with natural boundary
conditions (as in (2.77b)). In problems with mixed boundary conditions, however, one in general
has less regularity.

2.5.5 Riesz-Schauder theory


In this section we show that for the variational formulation of the convection-diffusion problem
results on existence and uniqueness can be derived that avoid the condition 21 div b + c 0 as
used in theorem 2.5.3. The analysis is based on the so-called Riesz-Schauder theory and uses
results on compact embeddings of Sobolev spaces. ([47], Thm. 7.2.14)
....In preparation .......

2.6 Weak formulation of the Stokes problem


We recall the classical formulation of the Stokes problem with homogeneous Dirichlet boundary
conditions:
u + p = f in , (2.87a)
div u = 0 in (2.87b)
u = 0 on . (2.87c)
In this section we derive a variational formulation of this problem and prove existence and
uniqueness of a weak solution.

From the formulation of the Stokes problem it is clear that the pressure p is determined only
up to a constant. In order to eliminate this degree of freedom we introduce the additional
requirement Z
hp, 1iL2 = p dx = 0.

Assume that the Stokes Rproblem has a solution u V := { u C 2 ()n | u = 0 on },
p M := { p C 1 () | p dx = 0 }. Then (u, p) also solves the following variational problem:
find (u, p) V M such that
Z Z Z
u v dx p div v dx = f v dx v C0 ()n
Z
(2.88)
q div u dx = 0 q M

56
with u v dx = hu, viL2 := ni=1 hui , vi iL2 . We introduce the bilinear forms and
R P

the linear functional


Z
a(u, v) := u v dx (2.89a)
Z

b(v, q) := q div v dx (2.89b)


Z
f (v) := f v dx. (2.89c)

Note that no derivatives of the pressure occur in (2.88). To obtain a weak formulation in
appropriate Hilbert spaces we apply the completion principle. For the velocity we use completion
w.r.t. k k1 and for the pressure we use the norm k kL2 :
Z
kk1 n kk1 1 n kkL2 2 2
V = C0 () = H0 () , M = L0 () := {p L () | p dx = 0}.

This results in the following weak formulation of the Stokes problem, with V := H01 ()n , M :=
L20 ():

Find (u, p) V M such that

a(u, v) + b(v, p) = f (v) for all v V (2.90a)


b(u, q) = 0 for all q M. (2.90b)

Lemma 2.6.1 Suppose that u C 2 ()n and p C 1 () satisfy (2.90). Then (u, p) is a solution
of (2.87).

Proof. Using partial integration it follows that


Z
for all v C0 ()n

u + p f v dx = 0

R
and thus u + p = f in . Note that by Greens formula we have that div u dx = 0 for
u H01 ()n . Hence in (2.90b) we can take q = div u, which yields (div u)2 dx = 0 and thus
R

div u = 0 in .

To show the well-posedness of the variational formulation of the Stokes problem we apply corol-
lary 2.3.12. For this we need the following inf-sup condition, which will be proved in section 2.6.1
R
q div v dx
>0 : sup kqkL2 q L20 (). (2.91)
1
vH () n kvk 1
0

Using this property we obtain a fundamental result on well-posedness of the variational Stokes
problem:

57
Theorem 2.6.2 For every f L2 ()n the Stokes problem (2.90) has a unique solution (u, p)
V M . Moreover, the inequality

kuk1 + kpkL2 C kf kL2

holds with a constant C independent of f .

R Proof. We can apply corollary 2.3.12 with V, M, a(, ), b(, ) as defined above and f1 (v) =
f v dx, f2 = 0. The continuity of b(, ) on V M follows from


Z

|b(v, q)| = q div v dx kqkL2 kdiv vkL2 n kqkL2 kvk1 for (u, q) V M.

The inf-sup condition is given in (2.91). Note that the minus sign in (2.89b) does not play a role
for the inf-sup condition in (2.91). The continuity of a(, ) on V V is clear. The V -ellipticity
follows from the Poincare-Friedrichs inequality:
n
X n
X
a(u, u) = |ui |21 c kui k21 = ckuk21 for all u V.
i=1 i=1

Application of corollary 2.3.12 yields the desired result.

Remark 2.6.3 Note that the bilinear form a(, ) is symmetric and thus we can apply theo-
rem 2.4.2. This shows that the variational formulation of the Stokes problem is equivalent to a
saddle-point problem. 

2.6.1 Proof of the inf-sup property


In this section we derive the fundamental inf-sup property in (2.91). First we note that a function
f L2 () induces a bounded linear functional on H01 () that is given by:

|hf, uiL2 |
Z
u f (x)u(x) dx, u H01 (), kf k1 = sup .
uH 1 () kuk1
0

We now define the first (partial) derivative of an L2 -function (in the sense of distributions). For
f L2 () the mapping
Z
F : u f (x)D u(x) dx, || = 1, u H01 (),

defines a bounded linear functional on H01 (). This functional is denoted by F =: D f



H 1 () = H01 () . Its norm is defined by

|(D f )(u)| |hf, D uiL2 |


kD f k1 := sup = sup .
uH 1 () kuk1 uH 1 () kuk1
0 0

58
Based on these partial derivatives we define
f f 
f = ,..., ,
x1 xn
v
u n
uX f sX
kf k1 =t k k21 = kD f k21 .
xi
i=1 ||=1

In the next theorem we present a rather deep result from analysis. Its proof (for the case
C 0,1 ) is long and technical.

Theorem 2.6.4 There exists a constant C such that for all p L2 ():

kpkL2 C kpk1 + kpk1 .

Proof. We refer to [65], lemma 7.1 or [31], remark III.3.1.

Remark 2.6.5 From the definitions of kpk1 , kpk1 it immediately follows that kpk1
kpkL2 and kpk1 kpkL2 for all p L2 (). Hence, using theorem 2.6.4 it follows that k kL2
and kk1 +kk1 are equivalent norms on L2 (). This can be seen as a (nontrivial) extension
of the (trivial) result that on H m (), m 1, the norms k km and k km1 + k km1 are
equivalent.

From the result in theorem 2.6.4 we obtain the following:

Lemma 2.6.6 There exists a constant C such that

kpkL2 Ckpk1 for all p L20 ().

Proof. Suppose that this result does not hold. Then there exists a sequence (pk )k1 in L20 ()
such that
1 = kpk kL2 kkpk k1 for all k. (2.92)
From the fact that the continuous
 embedding H01 () L2 () is compact, it follows that
2 2 1 1
L () = L () H0 () = H () is a compact embedding. Hence there exists a subse-
quence (pk() )1 that is a Cauchy sequence in H 1 (). From (2.92) and theorem 2.6.4 it follows
that (pk() )1 is a Cauchy sequence in L2 () and thus there exists p L2 () such that

lim pk() = p in L2 (). (2.93)


From (2.92) we get lim kpk() k1 = 0 and thus

pk()
lim () = 0 for all C0 () and all i = 1, . . . , n.
xi

In combination with (2.93) this yields

pk()
0 = lim () = lim hpk() , i 2 = hp, i 2 for i = 1, . . . , n.
xi xi L xi L

59
Hence p H 1 () and p = 0. It follows R that p isR equal to a constant (a.e.), say p = c. From
2
(2.93) and pk() L0 () it follows that p dx = c dx = 0 and thus c = 0. This results in a
contradiction:
1 = lim kpk() kL2 = kpkL2 = kckL2 = 0,

and thus the proof is complete.

Theorem 2.6.7 The inf-sup property (2.91) holds.

Proof. From lemma 2.6.6 it follows that there exists c > 0 such that kqk1 ckqkL2 for
all q L20 (). Hence, for suitable k with 1 k n we have
q c
k k1 kqkL2 for all q L20 ().
xk n
Thus there exists v H01 () with kvk1 = 1 and
q v 1 c
Z
for all q L20 ().

| (v)| = q dx kqkL2
xk xk 2 n
For v = (v1 , . . . , vn ) H01 ()n defined by vk = v, vi = 0 for i 6= k we have
R R
q div v dx q div v dx

sup = sup
vH01 ()n kvk1 vH01 ()n kvk1
q v dx
R R
q div v dx xk 1 c
= kqkL2
kvk1 kvk1 2 n
for all q L20 (). This completes the proof.

2.6.2 Regularity of the Stokes problem


We present two results from the literature concerning regularity of the Stokes problem.
The first result is proved in [26, 56]:
Theorem 2.6.8 Let (u, p) H01 ()n L20 () be the solution of the Stokes problem (2.90). For
m 0 assume that f H0m ()n and C m+2 . Then u H m+2 ()n , p H m+1 () and the
inequality
kukm+2 + kpkm+1 Ckf km (2.94)
holds, with a constant C independent of f .
If the property in (2.94) holds then the Stokes problem is said to be H m+2 -regular. Note that
even for H 2 -regularity (i.e., m = 0) one needs the assumption C 2 , which in practice is
often not fulfilled. For convex domains this assumption can be avoided (as in theorem 2.5.16).
The following result is presented in [54] (only n = 2) and in [30] (for n 2):
Theorem 2.6.9 Let (u, p) H01 ()n L20 () be the solution of the Stokes problem (2.90).
Suppose that is convex. Then u H 2 ()n , p H 1 () and the inequality
kuk2 + kpk1 Ckf kL2
holds, with a constant C independent of f .

60
2.6.3 Other boundary conditions
For a Stokes problem with nonhomogeneous Dirichlet boundary conditions, say u = g on , a
compatibility condition is needed: Z
g n ds = 0 (2.95)

...other boundary conditions ...
...in preparation ....

61
62
Chapter 3

Galerkin discretization and finite


element method

3.1 Galerkin discretization


We consider a variational problem as in theorem 2.3.1, i.e., for f H2 the variational problem
is given by:
find u H1 such that k(u, v) = f (v) for all v H2 . (3.1)
We assume that the bilinear form k(, ) is continuous

M: k(u, v) M kukH1 kvkH2 for all u H1 , v H2 (3.2)

and that the conditions (2.36) and (2.37) from theorem 2.3.1 hold:

k(u, v)
>0 : sup kukH1 for all u H1 , (3.3)
vH2 kvkH2

v H2 , v 6= 0, u H1 : k(u, v) 6= 0. (3.4)
From theorem 2.3.1 we know that for a continuous bilinear form the conditions (3.3) and (3.4)
are necessary and sufficient for well-posedness of the variational problem in (3.1).
The Galerkin discretization of the problem (3.1) is based on the following simple idea. We
assume a finite dimensional subspaces H1,h H1 , H2,h H2 (note: in concrete cases the index
h will correspond to some mesh size parameter) and consider the finite dimensional variational
problem
find uh H1,h such that k(uh , vh ) = f (vh ) for all vh H2,h . (3.5)
This problem is called the Galerkin discretization of (3.1) (in H1,h H2,h ). We now discuss the
well-posedness of this Galerkin-discretization. First note that the continuity of k : H1,h H2,h
R follows from (3.2). From theorem 2.3.1 it follows that we need the conditions (3.3) and (3.4)
with Hi replaced by Hi,h , i = 1, 2. However, because Hi,h is finite dimensional we only need (3.3)
since this implies (3.4) (see remark 2.3.2). Thus we formulate the following (discrete) inf-sup
condition in the space H1,h H2,h :

k(uh , vh )
h > 0 : sup h kuh kH1 for all uh H1,h . (3.6)
vh H2,h kvh kH2

63
We now prove two fundamental results:

Theorem 3.1.1 (Cea-lemma.) Let (3.2), (3.3), (3.4), (3.6) hold. Then the variational prob-
lem (3.1) and its Galerkin discretization (3.5) have unique solutions u and uh , respectively.
Furthermore, the inequality
M
ku uh kH1 1 + inf ku vh kH1 (3.7)
h vh H1,h

holds.

Proof. The result on existence and uniqueness follows from theorem 2.3.1 and the fact that
in the finite dimensional case (3.3) implies (3.4). From (3.1) and (3.5) it follows that
k(u uh , vh ) = 0 for all vh H2,h . (3.8)
For arbitrary vh H1,h we have, due to (3.6), (3.8), (3.2):
1 k(vh uh , wh )
kvh uh kH1 sup
h wh H2,h kwh kH2
1 k(vh u, wh ) M
= sup kvh ukH1 .
h wh H2,h kwh kH2 h
From this and the triangle inequality
ku uh kH1 ku vh kH1 + kvh uh kH1 for all vh H1,h
the result follows.

The result in this theorem simplifies if we consider the important special case H1 = H1 =: H,
H1,h = H2,h =: Hh and assume that the bilinear form k(, ) is elliptic on H.

Corollary 3.1.2 Consider the case H1 = H2 =: H and H1,h = H2,h =: Hh . Assume that
(3.2) holds and that the bilinear form k(, ) is H-elliptic with ellipticity constant . Then the
variational problem (3.1) and its Galerkin discretization (3.5) have unique solutions u and uh ,
respectively. Furthermore, the inequality
M
ku uh kH inf ku vh kH (3.9)
vh Hh

holds.

Proof. Because k(, ) is H-elliptic the conditions (3.3) (with = ), (3.4) and (3.6) (with
h = ) are satisfied. From theorem 3.1.1 we conclude that unique solutions u and uh exist.
Using k(u uh , vh ) = 0 for all vh Hh and the ellipticity and continuity we get for arbitrary
vh Hh :
1 1
ku uh k2H k(u uh , u uh ) = k(u uh , u vh )

M
ku uh kH ku vh kH .

64
Hence the inequality in (3.9) holds.

In chapter 4 and chapter 5 we will use theorem 3.1.1 in the discretization error analysis. In
the remainder of this chapter we only consider cases with H1 = H1 =: H, H1,h = H2,h =: Hh
and H-elliptic bilinear forms, such that corollary 3.1.2 can be applied.
An improvement of the bound in (3.9) can be obtained if k(, ) is symmetric:

Corollary 3.1.3 Assume that the conditions as in corollary 3.1.2 are satisfied. If in addition
the bilinear form k(, ) is symmetric, the inequality
s
M
ku uh kH inf ku vh kH (3.10)
vh Hh

holds.

1
Proof. Introduce the norm |||v||| := k(v, v) 2 on H. Note that

kvkH |||v||| M kvkH for all v H.

The space (H, ||| |||) is a Hilbert space and due to |||v|||2 = k(v, v), k(u, v) |||u||||||v||| the bilinear
form has ellipticity constant and continuity constant w.r.t. the norm ||| ||| both equal to 1.
Application of corollary 3.1.2 in the space (H, ||| |||) yields

|||u uh ||| inf |||u vh |||


vh Hh

and thus we obtain


s
1 1 M
ku uh kH |||u uh ||| inf |||u vh ||| inf ku vh kH ,
vh Hh vh Hh

which completes the proof.

Assume H1 = H2 = H and H1,h = H2,h = Hh . For the actual computation of the solution
uh of the Galerkin discretization we need a basis of the space Hh . Let {i }1iN be a basis of
Hh , i.e., every vh Hh has a unique representation
N
X
vh = vj j with v := (v1 , . . . , vN )T RN .
j=1

The Galerkin discretization can be reformulated as:


N
X
find v RN such that k(j , i )vj = f (i ) i = 1, . . . , N (3.11)
j=1

This yields the linear system of equations

Kv = b , with Kij = k(j , i ), bi = f (i ), 1 i, j N. (3.12)

65
In the remainder of this chapter we discuss concrete choices for the space Hh , namely the
so-called finite element spaces. These spaces turn out to be very suitable for the Galerkin
discretization of scalar elliptic boundary value problems. Finite element spaces can also be used
for the Galerkin discretization of the Stokes problem. This topic is treated in chapter 5. Once a
space Hh is known one can investigate approximation properties of this space and derive bounds
for inf vh Hh ku vh kH (with u the weak solution of the elliptic boundary value problem), cf.
section 3.3. Due to the Cea-lemma we then have a bound for the discretization error ku uh kH
(see section 3.4).
In Part III of this book (iterative methods) we discuss techniques that can be used for solving
the linear system in (3.12).

3.2 Examples of finite element spaces


In this section we introduce finite element spaces that are appropriate for the Galerkin discretiza-
tion of elliptic boundary value problems. We only present the main principles. An extensive
treatment of finite element techniques can be found in, for example, [27], [28], [21].

To simplify the presentation we only consider finite element methods for elliptic boundary value
problems in Rn with n 3.
Starting point for the finite element approach is a subdivsion of the domain in a finite number
of subsets T . Such a subdivision is called a triangulation and is denoted by Th = {T }. For the
subsets T we only allow:
T is an n-simplex (i.e., interval, triangle, tetrahedron), or,
(3.13)
T is an n-rectangle.
Furthermore, the triangulation Th = {T } should be such that
= T Th T , (3.14a)
int T1 int T2 = for all T1 , T2 Th , T1 6= T2 , (3.14b)
any edge [face] of any T1 Th is either a subset
(3.14c)
of or an edge [face] of another T2 Th .

Definition 3.2.1 A triangulation that satisfies (3.13) and (3.14) is called admissible. 
Note that a triangulation can be admissible only if the domain is polygonal (i.e., consists
of lines and/or planes). If the domain is not polygonal we can approximate it by a polygonal
domain h and construct an admissible triangulation of h (see ...) or use isoparametric finite
elements (section 3.6).

We consider a family of admissible triangulations denoted by {Th }. Let hT := diam(T ) for


T Th . The index parameter h of Th is taken such that
h = max{ hT | T Th }.
Furthermore, for T Th we define
T := sup{ diam(B) | B is a ball contained in T },
hT
T := [1, ).
T

66
Definition 3.2.2 A family of admissible triangulations {Th } is called regular if
1. The parameter h approaches zero: inf{ h | Th {Th } } = 0,
hT
2. : T = T for all T Th and all Th {Th }.
A family of admissible triangulations {Th } is called quasi-uniform if
h
: for all T Th and all Th {Th }.
T


3.2.1 Simplicial finite elements


We now introduce a very important class of finite element spaces. Let {Th } be a family of
admissible triangulations of consisting only of n-simplices.
The space of polynomials in Rn of degree less than or equal k is denoted by Pk , i.e., p Pk is
of the form X
p(x) = x1 1 x2 2 . . . xnn , R.
||k

The dimension of Pk is  
n+k
dim Pk = .
k
The spaces of simplicial finite elements are given by
X0h := { v L2 () | v|T P0 for all T Th }, (3.15a)
Xkh := { v C() | v|T Pk for all T Th }, k 1. (3.15b)
Thus these spaces consist of piecewise polynomials which, for k 1, are continuous on .
Remark 3.2.3 From theorem 2.2.12 it follows that Xkh H 1 () for all k 1. 
We will also need simplicial finite element spaces with functions that are zero on :
Xkh,0 := Xkh H01 (), k 1. (3.16)

3.2.2 Rectangular finite elements


Let {Th } be a family of admissible triangulations consisting only of n-rectangles.
The space of polynomials in Rn of degree less than or equal k with respect to each of the variables
is denoted by Qk , i.e., p Qk is of the form
X
p(x) = x1 1 x2 2 . . . xnn , R.
0i k

The dimension of Qk is
dim Qk = (k + 1)n .
The spaces of rectangular finite elements are given by
Q0h := { v L2 () | v|T Q0 for all T Th }, (3.17a)
Qkh := { v C() | v|T Qk for all T Th }, k 1, (3.17b)
Qkh,0 := Qkh H01 () , k 1. (3.17c)

67
3.3 Approximation properties of finite element spaces
In this section, for u H 2 () we derive bounds for the approximation error inf vh Hh ku vh k1
with Hh = Xkh or Hh = Qkh (note that Hh depends on the parameter k).
The main idea of the analysis is as follows. First we will introduce an interpolation operator
Ihk : C() Hh . Recall that we assumed n 3. The Sobolev embedding theorem 2.2.14 yields

H m () C() for m 2

and thus the interpolation operator is well-defined for u H m (), m 2. We will prove
interpolation error bounds of the form (cf. theorem 3.3.9)

ku Ihk ukt chmt |u|m for 2 m k + 1, t = 0, 1.

This implies (corollary 3.3.10)

inf ku vh kt chmt |u|m for 2 m k + 1, t = 0, 1.


vh Hh

We first introduce the interpolation operators IXk : C() Xkh and IQk : C() Qkh . Then we
formulate some useful results that will be applied to prove the main result in theorem 3.3.9.

We start with the definition of an interpolation operator IXk : C() Xkh . For the descrip-
tion of this operator the so-called barycentric coordinates are useful:

Definition 3.3.1 Let T be a nondegenerate n-simplex and aj Rn , j = 1, . . . , n+1 its vertices.


Then T can be described by
n+1
X n+1
X
T ={ j aj | 0 j 1 j, j = 1 }. (3.18)
j=1 j=1

To every x T there corresponds a unique n + 1-tuple (1 , . . . , n+1 ) as in (3.18). These j , 1


j n + 1, are called the barycentric coordinates of x T . The mapping x (1 , . . . , n+1 ) is
affine. 

Using these barycentric coordinates we define the set

 n+1
X 1 k1
n+1
X
Lk (T ) := j aj | j {0, , . . . , , 1} j, j = 1
k k
j=1 j=1

which is called the principal lattice of order k (in T ). Examples for n = 2 and n = 3 are given
in figure...
figure
This principal lattice can be used to determine a unique polynomial p Pk :

Lemma 3.3.2 Let T be a nondegenerated n-simplex. Then any polynomial p Pk is uniquely


determined by its values on the principal lattice Lk (T ).

Proof. For example, in [67].

Let Th = {T } be an admissible triangulation of consisting only of n-simplices. For u C()

68
we define a corresponding function IXk u L2 () by piecewise polynomial interpolation on each
simplex T Th :

T Th : (IXk u)|T Pk such that (IXk u)(xj ) = u(xj ) xj Lk (T ). (3.19)

The piecewise polynomial function IXk u is continuous on :


Lemma 3.3.3 For k 1 and u C() we have IXk u Xkh .
Proof. By definition we have that (IXk u)|T Pk . Thus we only have to show that IXk u
is continuous across interfaces between adjacent n-simplices T1 , T2 Th . For n = 1 this is
trivial, since the endpoints a1 , a2 of a 1-simplex [a1 , a2 ] are used as interpolation points. We
now consider n = 2. Define := T1 T2 and pi := (IXk u)|Ti , i = 1, 2. Note that k + 1 points of
the principal lattice lie on the face :

Lk (T1 ) = Lk (T2 ) =: {x1 , . . . , xk+1 } with xi 6= xj for i 6= j.

Since these xj are interpolation points we have that p1 (xj ) = p2 (xj ) = u(xj ) for j = 1, . . . , k + 1.
The functions (pi )| are one-dimensional polynomials of degree k. We conclude that (p1 )| =
(p2 )| holds, and thus IXk u is continuous across the interface .
The case n = 3 (or even n 3) can be treated similarly.

For the space of rectangular finite elements, Qkh , an interpolation operator IQk : C() Qkh
can be defined in a very similar way. For this we introduce a uniform grid on a rectangle in Rn .
For a given interval [a, b] a uniform grid with mesh size ba
k is given by

ba
Gk[a,b] := { a + j | 0 j k }.
k
Qn
On an n-rectangle T = i=1 [ai , bi ] we define a uniform lattice by
n
Y
Lk (T ) := Gk[ai ,bi ] .
i=1

Using a tensor product argument it follows that any polynomial p Qk , k 1, is uniquely


determined by its values on the set Lk (T ). Let Th = {T } be an admissible triangulation of
consisting only of n-rectangles. For u C() we define a corresponding function IQk u L2 ()
by piecewise polynomial interpolation on each n-rectangle T Th :

T Th : (IQk u)|T Qk such that (IQk u)(xj ) = u(xj ) xj Lk (T ) (3.20)

With similar arguments as used in the proof of lemma 3.3.3 one can show the following:
Lemma 3.3.4 For k 1 and u C() we have IQk u Qkh .

For the analysis of the interpolation error we begin with two elementary lemmas.
Lemma 3.3.5 Let T , T Rn be two sets as in (3.13) and F (x) = Bx + c an affine mapping
such that F (T ) = T . Then the following inequalities hold:
hT hT
kBk2 , kB1 k2 .
T T

69
Proof. We will prove the first inequality. The second one then follows from the first one by
using F 1 (T ) = T with F 1 (x) = B1 x B1 c.
Note that
1
kBk2 = max{ kBxk2 | x Rn , kxk2 = T }. (3.21)
T
Let B(a; T ) be a ball with centre a and diameter T that is contained in T . Take x Rn with
kxk2 = T . For y1 = a + 12 x T and y2 = a 12 x T we have

x = y1 y2 , F (yi ) T, i = 1, 2,

and thus
kBxk2 = kB(y1 y2 )k2 = kF (y1 ) F (y2 )k2 . hT (3.22)
hT
From (3.21) and (3.22) we obtain kBk2 T .

Lemma 3.3.6 Let K and K be Lipschitz domains in Rn that are affine equivalent:

F (K) = K with F (x) = Bx + c, det B 6= 0.

For m 0, v H m (K) define v := v F : K R. Then v H m (K) and there exists a


constant C such that
1
|v|m,K CkBkm
2 | det B|
2
|v|m,K for all v H m (K), (3.23a)
1
|v|m,K CkB1 km
2 | det B| |v|m,K
2 for all v H m (K). (3.23b)

Proof. Since C () is dense in H m () it suffices to prove (3.23a) for v C (). For m = 0


this result follows from
Z Z
|v|20,K = v(x)2 dx = v(x)2 | det B|1 dx = | det B|1 |v|20,K .
K K

For the case m 1 we need some basic results on Frechet derivatives. For v C () the
Frechet derivative D m v(x) : Rn . . . Rn R is an m-linear form. Let ej be the j-th basis
vector in Rn . For || = m and for suitable i1 , . . . , im N we have

|| v(x) || v(x)
D v(x) = n = = D m v(x)(ei1 , . . . , eim ) (3.24)
x1 1 . . . xn xi1 . . . xim

(note the subtle difference in notation between D and D m ). Let E be an m-linear form on Rn .
Then both
|E(y 1 , . . . , y m )|
kEk2 := max and kEk := max |E(ei1 , . . . , eim )|
y i Rn ky 1 k2 . . . ky m k2 1ij n

define norms on the space of m-linear forms on Rn . Using the norm equivalence property it
follows that there exists a constant c, independent of E, such that

kEk kEk2 c kEk .

70
If we take E = D m v(x) and use (3.24) we get

max |D v(x)| kD m v(x)k2 c max |D v(x)|. (3.25)


||=m ||=m

The chain rule applied to v(x) = v(Bx + c), with x = Bx + c, results in

D m v(x)(y 1 , . . . , y m ) = D m v(x)(By 1 , . . . , By m )

and thus
kD m v(x)k2 kBkm m
2 kD v(x)k2 . (3.26)
Combination of (3.25) and (3.26) yields

max |D v(x)| c kBkm


2 max |D v(x)|.
||=m ||=m

Using this we finally obtain


X Z Z
2
2 2
|v|m,K = D v(x) dx C max |D v(x)| dx
||=m
||=m K K
Z
2
C kBk2m
2 max |D v(x)| dx
K ||=m
Z
2m
= C kBk2 max |D v(x)|2 | det B|1 dx
||=m K
X Z
C kBk2m
2 | det B|1
|D v(x)|2 dx = C kBk2m 1 2
2 | det B| |v|m,K .
||=m K

This proves the result in (3.23a). The result in (3.23b) follows from (3.23a) and F 1 (K) = K
with F 1 (x) = B1 x B1 c.

The following result is a generalization of the Poincare-Friedrichs inequality in (2.79b) and


will be used in the proof of theorem 3.3.8.

Lemma 3.3.7 Let K be a Lipschitz domain in Rn . There exists a constant C such that
 Z
2 2
X 2 
kukm C |u|m + D u dx for all u H m (K).
||m1 K

(Here | |m and k km denote Sobolev (semi)norms on the domain K)

Proof. For m = 1 this result is given in (2.79b). From the result in (2.79b) it also follows
that Z
2 2
2 
kukL2 C |u|1 + u dx for all u H 1 (K). (3.27)
K

We introduce the notation (for u H m (K)):


X X Z 2

:= kD uk2L2 (K) , := D u dx , = 0, . . . , m.
||= ||= K

71
Note that for m 1 we have ||= |D u|21 = +1 . Using this and the inequality (3.27)
P

with u replaced by D u we get for m 1:

X X Z 2 

= kD uk2L2 (K) C +1 + D u dx = C(+1 + ).
||= ||= K

From this it follows that

m m1 Z
X X X 2 
kuk2m |u|2m D u dx

= C m + = C + ,
=0 =0 ||m1 K

which completes the proof.

The next theorem, due to Bramble-Hilbert [20], is a fundamental one:

Theorem 3.3.8 Let K be a Lipschitz domain in Rn and Y a Banach space. Suppose L :


H m (K) Y , m 1, is a linear bounded operator such that

L(p) = 0 for all p Pm1 .

Then there exists a constant C such that

kLukY C |u|m for all u H m (K). (3.28)

Proof. First note that

kLukY = kL(u p)kY kLkku pkm for all p Pm1 . (3.29)

Let p(x) = ||m1 x1 1 . . . xnn Pm1 . For any given u H m (K) one can show that the
P
coefficients can be taken such that
Z Z

D p dx = D u dx for || m 1
K K

holds (hint: the ordering || = m 1, || = m 2, . . ., yields a linear system for the coefficients
with a nonsingular lower triangular matrix). Using the result in lemma 3.3.7 we obtain
Z
X 2
ku pk2m C |u p|2m + D (u p) dx )
||m1 K (3.30)
= C|u p|2m = C|u|2m

Combination of (3.29) and (3.30) completes the proof.

We now present a main result on the interpolation error:

72
Theorem 3.3.9 Let {Th } be a regular family of triangulations of consisting of n-simplices
and let Xkh be the corresponding finite element space as in (3.15b). For 2 m k + 1 and
t {0, 1} the following holds:

ku IXk ukt Chmt |u|m for all u H m (). (3.31)

Let {Th } be a regular family of triangulations of consisting of n-rectangles and let Qkh be the
corresponding finite element space as in (3.17b). For 2 m k + 1 and t {0, 1} the following
holds:
ku IQk ukt Chmt |u|m for all u H m (). (3.32)
The constants C in(3.31) and (3.32) are independent of u and of Th {Th }.

Proof. We will prove the result in (3.31). Very similar arguments can be used to show that
the result in (3.32) holds. Take 2 m k + 1. The constants C used below are all uniform
with respect to u H m () and Th {Th }. We will show that for all {0, 1}

|u IXk u| Chm |u|m for all u H m ()

holds, with | |0 := k kL2 . The result in (3.31) then follows from this and from kvk21 = |v|20 + |v|21 .
Due to X
|u IXk u|2 = |u IXk u|2,T
T Th

it suffices to prove for {0, 1} and for arbitrary T Th :

|u IXk u|,T Chm |u|m,T for all u H m (). (3.33)

Let T be the unit n-simplex and F : T T an affine transformation F (x) = Bx + c such that
F (T ) = T . Due to the fact that the family {Th } is regular, there exists a constant C such that

hT
kBk2 kB1 k2 c C. (3.34)
T
P
Note that kpk := xLk (T ) |p(x)| defines a norm on Pk . Since all norms on Pk are equivalent
there exists a constant C such that

kpkm,T C kpk for all p Pk . (3.35)

The continuous embedding H m (T ) C(T ) yields:

C: kvk,T C kvkm,T for all v H m (T ). (3.36)

Let IXk : C(T ) Pk be the interpolation operator on the unit n-simplex as defined in (3.19)
(with T = T ). We then have

(IXk u) F = IXk (u F ) = IXk u with u := u F. (3.37)

Define the linear operator L := id IXk : H m (T ) H m (T ). For this operator we have Lp = 0


for all p Pk and thus, due to m k + 1, Lp = 0 for all p Pm1 . Furthermore, using (3.35)

73
and (3.36) we get

kLvkm,T kvkm,T + kIXk vkm,T kvkm,T + CkIXk vk


X
kvkm,T + C |v(x)| kvkm,T + C kvk,T Ckvkm,T .
xLk (T )

Thus we can apply theorem 3.3.8, which yields

kv IXk vkm,T C|v|m,T for all v H m (T ). (3.38)

For u H m () we obtain, using lemma 3.3.5, lemma 3.3.6 and the results in (3.34), (3.37),
(3.38):
1
|u IXk u|,T C kB1 k2 | det B| 2 |u F (IXk u) F |,T
1
= C kB1 k2 | det B| 2 |u IXk u|,T
1
C kB1 k2 | det B| 2 ku IXk ukm,T
1
C kB1 k2 | det B| 2 |u|m,T C kB1 k2 kBkm
2 |u|m,T

C kB1 k2 kBk2 kBkm

2 |u|m,T
C kBkm
2 |u|m,T C hm |u|m,T .

This proves the result in (3.33)

Corollary 3.3.10 Under the same assumption as in theorem 3.3.9 we have

inf ku vh kt C hmt |u|m , (3.39)


vh Xkh

inf ku vh kt C hmt |u|m . (3.40)


vh Qkh

Furthermore, the results in (3.39)and (3.40) hold for u H m () H01 () with Xkh , Qkh replaced
by Xkh,0 and Qkh,0 , respectively.
Proof. The first part is clear. The second part follows from the fact that for u H m ()
H01 () we have IXk u Xkh,0 and IQk u Qkh,0 .

We now prove so-called local and global inverse inequalities. These results can be used to
bound the H 1 -norm of a finite element function in terms of its L2 -norm.

Lemma 3.3.11 (inverse inequalities) Let {Th } be a regular family of triangulations of


consisting of n-simplices (n-rectangles) and let Vh := Xkh (= Qkh ) be the corresponding finite
element space. For m 0 there exists a constant c independent of h such that

|vh |m+1,T ch1


T |vh |m,T for all T Th and all vh Vh .

If in addition the family of triangulations is quasi-uniform, then there exists a constant c inde-
pendent of h such that
|vh |1 c h1 kvh kL2 for all vh Vh .

74
Proof. We consider the case of simplices. The other case can be treated very similar. For
T Th let F (x) = BT x + c be an affine transformation such that F (T ) = T , where T is the
unit simplex. Note that on the finite dimensional space Pk (T ) all norms are equivalent. Using
lemma 3.3.6 we get, with vh = vh F ,
1
|vh |m+1,T ckB1 m+1
T k2 | det BT | 2 |vh |m+1,T
1
chm1
T | det BT | 2 |vh |m+1,T
1
chm1
T | det BT | 2 |vh |m,T ch1
T |vh |m,T ,

which proves the local inverse inequality. Note that vh H01 (). Thus for m = 0 we can sum up
these local results and using the quasi-uniformity assumption (i.e., h1 1
T ch ) we then obtain

X X X
|vh |21 = |vh |21,T c h2 2
T |vh |0,T c h
2
|vh |20,T = ch2 kvh k2L2
T Th T Th T Th

and thus the global inverse inequality is proved.

3.4 Finite element discretization of scalar elliptic problems


In this section we consider the Galerkin discretization of the scalar elliptic problem
(
find u H01 () such that
(3.41)
k(u, v) = f (v) for all v H01 ()

with a bilinear form and right handside as in (2.72), i.e.:

uT Av + b uv + cuv dx ,
R R
k(u, v) = f (v) = f v dx (3.42a)
1
with 2 div b + c 0 a.e. in , (3.42b)
T T
and 0 > 0 A(x) 0 for all Rn , x , (3.42c)
and aij L () i, j, bi H 1 () L () i, c L (). (3.42d)

For the Galerkin discretization we use finite element subspaces Hh = Xkh,0 and Hh = Qkh,0 . We
prove bounds for the discretization error ku uh k1 (section 3.4.1) and ku uh kL2 (section 3.4.2).

3.4.1 Error bounds in the norm k k1


We first consider the Galerkin discretization of (3.41) with simplicial finite elements. Let {Th }
be a regular family of triangulations of consisting of n-simplices and Xkh,0 , k 1, the corre-
sponding finite element space as in (3.16). The discrete problem is given by
(
find uh Xkh,0 such that
(3.43)
k(uh , vh ) = f (vh ) for all vh Xkh,0 .

75
We have the following result concerning the discretization error:

Theorem 3.4.1 Assume that the conditions (3.42b)-(3.42d) are fulfilled and that the solution
u H01 () of (3.41) lies in H m () with m 2. Let uh be the solution of (3.43). For 2 m
k + 1 the following holds
ku uh k1 C hm1 |u|m ,
with a constant C independent of u and of Th {Th }.

Proof. From the proof of theorem 2.5.3 it follows that the bilinear form k(, ) is continuous
and H01 ()-elliptic. From corollary 3.1.2 it follows that the continuous and discrete problems
have unique solutions and that

ku uh k1 C inf ku vh k1
vh Xkh,0

holds. Now apply corollary 3.3.10 with t = 1.

A very similar result holds for the Galerkin discretization with rectangular finite elements.
Let {Th } be a regular family of triangulations of consisting of n-rectangles and Qkh,0 , k 1,
the corresponding finite element space as in (3.17c). The discrete problem is given by
(
find uh Qkh,0 such that
(3.44)
k(uh , vh ) = f (vh ) for all vh Qkh,0 .

We have the following result concerning the discretization error:

Theorem 3.4.2 Assume that the conditions in (3.42b)-(3.42d) are fulfilled and that the solution
u H01 () of (3.41) lies in H m () with m 2. Let uh be the solution of (3.44). For 2 m
k + 1 the following holds
ku uh k1 C hm1 |u|m ,
with a constant C independent of u and of Th {Th }.

Proof. The same arguments as in the proof of theorem 3.4.1 can be used.

Note that in the preceding two theorems we used the smoothness assumption u H01 ()H m ()
with m 2. Sufficient conditions for this to hold are given in section 2.5.4, theorem 2.5.14 and
theorem 2.5.16. In the literature one can find discretization error bounds for the case when u
is less regular, i.e., u H01 () but u
/ H 2 () (cf., for example, [?]). One simple result for the
1
case of minimal smoothness (u H0 () only) is given in:

Theorem 3.4.3 Assume that the conditions of theorem 2.5.3 are fulfilled. Let uh be the solution
of (3.43). Then we have:
lim ku uh k1 = 0.
h0

76
kk1
Proof. Define V := H01 () H 2 (). Note that V = H01 (). Take > 0.
From corollary 3.1.2 we obtain

ku uh k1 C inf ku vh k1 . (3.45)
vh Xkh,0

There exists v V such that



ku vk1 (3.46)
2C
From corollary 3.3.10 it follows that kv IXk vk1 C h|v|2 , and thus for h sufficiently small we
have

kv IXk vk1 . (3.47)
2C
Combination of (3.45), (3.46) and (3.47) yields

ku uh k1 Cku IXk vk1 C ku vk1 + kv IXk vk1




and thus the result is proved.

Remark 3.4.4 Comment on results for cases with other boundary conditions .... 

3.4.2 Error bounds in the norm k kL2


In this section we derive a bound for the discretization error in the L2 -norm. For the analysis we
will need a duality argument, i.e., an argument in which the dual problem of the given variational
problem (3.41) plays a role. For k(, ) and f () as in (3.42a) we define the dual problem by
(
find u H01 () such that
(3.48)
k(v, u) = f (v) for all v H01 ().

Note that if k(, ) is continuous and H01 ()-elliptic then this dual problem has a unique solution.
The dual problem is said to be H 2 -regular (cf. section 2.5.4) if

C: kuk2 C kf kL2 for all f L2 ().

The following result concerning the finite element discretization error holds:

Theorem 3.4.5 Suppose that the assumptions of theorem 3.4.1 [theorem 3.4.2] are fulfilled and
that the dual problem (3.48) is H 2 -regular. For 2 m k + 1 the inequality

ku uh kL2 C hm |u|m

holds, with a constant C independent of u and of Th {Th }.

Proof. We give the proof for the case of simplicial finite elements. Exactly the same argu-
ments can be used for rectangular finite elements.
The bilinear form k(, ) is continuous and H01 ()-elliptic and thus the problem (3.41), its

77
Galerkin discretization and the dual problem (3.48) are uniquely solvable. Define eh = u uh
and note that eh H01 (). Let u H01 () H 2 () be the solution of the dual problem
Z
k(v, u) = eh v dx for all v H01 ().

Using the Galerkin orthogonality, k(eh , vh ) = 0 for all vh Xkh,0 , we get


Z
keh k2L2 = e2h dx = k(eh , u)
(3.49)
= k(eh , u IXk u) Ckeh k1 ku IXk uk1 .

From corollary 3.3.10 and the H 2 -regularity of the dual problem we obtain

ku IXk uk1 C h|u|2 C hkeh kL2 . (3.50)

Combining (3.49) and (3.50) results in

keh kL2 C hkeh k1 .

Now apply theorem 3.4.1 [theorem 3.4.2].

Remark 3.4.6 Comment on sufficient conditions for H 2 -regularity of the dual problem ... 

3.5 Stiffness matrix


In this section we consider the discrete problem in (3.43) with a bilinear form and right handside
as in (3.42). We will discuss properties of the linear system described in (3.12). For this we
need a suitable basis of the finite element space Xkh,0 . The following lemma gives a general tool
for constructing a basis in some finite element space.
Lemma 3.5.1 Let H be a finite dimensional vector space. Assume that for i = 1, . . . , N we
have i H and i H such that the following conditions are satisfied:

i (i ) 6= 0 for all i, i (j ) = 0 for all i 6= j , (3.51a)


for all v H, v 6= 0 : i (v) 6= 0 for some i (3.51b)

Then (i )1iN forms a basis of H.


PN
Proof. Let 1 , . . . , N be such that j=1 j j = 0. Using (3.51a) we get
N
X N
 X
0 = i j j = j i (j ) = i i (i ) for i = 1, . . . , N,
j=1 j=1

and thus i = 0 for all i. This yields that i , i = 1, . . . , N , are independent. Hence, N k :=
dim(H) = dim(H ) holds. We now show that N k holds, too. Let v1 , . . . , vk be a basis of H.
Define the matrix L RN k by Lij = i (vj ). Let x Rk be such that Lx = 0. We then have
k
X Xk
i (vj )xj = i ( xj vj ) = 0 for all i = 1, . . . , N
j=1 j=1

78
Pk
Using (3.51b) this yields j=1 xj vj = 0 and thus x = 0. Hence, L has full column rank and
thus N k holds.

The set (i )1iN as in (3.51) forms a basis of H and is called the dual basis of (i )1iN .
We now construct a so-called nodal basis of the space of simplicial finite elements Xkh,0 . We
will associate a basis function to each interpolation point in the principal lattice of T Th that
lies not on . To make this more precise, for an admissible triangulation Th , consisting of
n-simplices, we introduce the grid

T Th { xj Lk (T ) | xj
/ } =: {x1 , . . . , xN } =: V

with xi 6= xj for all i 6= j. For each xi V we define a corresponding function i as follows:

T Th :
(
0 if xj 6= xi (3.52)
(i )|T Pk and xj Lk (T ) : i (xj ) =
1 if xj = xi

From lemma 3.3.3 it follows that for all k 1 we have i Xkh,0 . Thus we have a collection of
functions (i )1iN with the properties:

i Xkh,0 ; xj V : i (xj ) = ij , 1 i N (3.53)

Lemma 3.5.2 The functions (i )1iN form a basis of Xkh,0 .

Proof. Introduce the linear functional i (Xkh,0 ) :

i (u) = u(xi ) for u Xkh,0 , xi V, i = 1, . . . , N

One easily verifies that for i , i , i = 1, . . . , N the conditions of lemma 3.5.1 are satisfied.

Due to the property i (xj ) = ij the functions i are called nodal basis functions. In ex-
actly the same way one can construct nodal basis functions for other finite element spaces like,
for example, Xkh , Qkh,0 , Qkh .

We consider the discrete problem (3.43) and use the nodal basis (i )1iN of Xkh,0 to refor-
mulate this problem as explained in (3.11)-(3.12). This results in the linear system of equations

Kh vh = bh , with (Kh )ij = k(j , i ), (bh )i = f (i ), 1 i, j N (3.54)

The matrix Kh is called the stiffness matrix. In the remainder of this section we derive some
important properties of this matrix that will play an important role in the chapters 6-9. In these
chapters we discuss iterative solution methods for the linear system in (3.54). Below we assume
that for the bilinear form k(, ) the conditions (3.42b)-(3.42d) are satisfied.

The stiffness matrix is sparse.


We introduce

qrow (Kh ) = maximum number of nonzero entries per row in Kh


qcol (Kh ) = maximum number of nonzero entries per column in Kh

79
Lemma 3.5.3 Let {Th } be a regular family of triangulations consisting of n-simplices and for
each Th let Kh be the stiffness matrix defined in (3.54). There exists a constant q independent
of Th {Th } such that
max{ qrow (Kh ) , qcol (Kh ) } q
Proof. Take a fixed i with 1 i N . Define a neighbourhood of xi by
Nxi := { T Th | xi Lk (T ) } = supp(i )
From the assumption that we have a regular family of triangulations it follows that
|Nxi | M (3.55)
with a constant M independent of i and of Th {Th }. Assume that (Kh )ij 6= 0. Using the fact
that we have a nodal basis it follows that
xj Nxi ,
i.e., xj is a lattice point in Nxi . Using (3.55) we get that the number of lattice points in Nxi
can be bounded by a constant, say q, independent of i and of Th {Th }. Hence qrow (Kh ) q
holds. The same arguments apply if one interchanges i and j.

Note that the constant q depends on the degree k used in the finite element space Xkh,0 . The re-
sult of this lemma shows that the number of nonzero entries in the N N -matrix Kh is bounded
by qN . If h 0 then N and the number of nonzero entries in Kh is proportional to N .
Therefore the stiffness matrix is said to be sparse.

The stiffness matrix is positive definite.


Lemma 3.5.4 For the stiffness matrix defined in (3.54) we have:
Kh + KhT is symmetric positive definite
Proof. Take v RN , v 6= 0 and define u = N k
P
j=1 vj j Xh,0 . Note that u 6= 0. Using the
fact that the bilinear form is elliptic we get
vT (Kh + KhT )v = 2vT Kh v = 2k(u, u) > 0
and thus the symmetric matrix Kh + KhT is positive definite.

As a direct consequence we have:


Corollary 3.5.5 If in (3.42) we have b = 0 then the bilinear form k(, ) is symmetric and the
stiffness matrix Kh is symmetric positive definite.

The stiffness matrix is ill-conditioned.


We now derive sharp bounds for the condition number of the stiffness matrix. We restrict
ourselves to the case b = 0 in (3.42). Then the stiffness matrix Kh is symmetric positive definite
and its spectral condition number is given by
max (Kh )
(Kh ) = kKh k2 kKh1 k2 =
min (Kh )
We first give a result (due to [89]) on diagonal scaling of a symmetric positive definite matrix.
We use the notation DA := diag(A) for a square matrix A.

80
Lemma 3.5.6 Let A RN N be a symmetric positive definite matrix and let q be such that
qrow (A) q. For any nonsingular diagonal matrix D RN N we have
1 1
(DA 2 ADA 2 ) q (DAD)
1 1
Proof. Define A = DA 2 ADA 2 and note that this matrix is symmetric positive definite and
diag(A) = I. Let A = LLT be the Cholesky factorization of A. Let ei be the ith standard basis
vector in RN . Then we have

kLT ei k22 = hLT ei , LT ei i = hAei , ei i = Aii = 1


|Aij | = |hLT ej , LT ei i| kLT ej k2 kLT ei k2 = 1

and thus kAk2 kAk q holds. For an arbitrary nonsingular diagonal matrix D RN N
we have:

(A) = kAk2 kA1 k2 q kLT L1 k2 = q kL1 k22


q kL1 D1 k22 max |Djj |2 = q kL1 D 1 k22 max kLT Dej k22
j j

q kL1 D1 k22 kLT Dk22 = q kD 1 A1 D 1 k2 kDADk2 = q (D AD)

This shows that the desired result holds.

The result in this lemma shows that for the sparse symmetric positive definite stiffness ma-
trix Kh the symmetric scaling with the diagonal matrix DKh is in a certain sense optimal.
Hence, we investigate the condition number of the scaled matrix
1 1
Kh := DKh2 Kh DKh2 (3.56)

The following result is based on the analysis presented in [9].


Theorem 3.5.7 Suppose b = 0 in (3.42). Let {Th } be a regular family of triangulations con-
sisting of n-simplices and for each Th let Kh be the stiffness matrix defined in (3.54). Then there
exists a constant C independent of Th {Th } such that
(
h
CN (1 + log hmin ) if n = 2
(Kh ) 2 (3.57)
CN 3 if n = 3

with hmin = min{ hT | T Th }.


Proof. We need the following embedding results (cf. [1], theorem 5.4):

H 1 () L6 () for n = 3 (3.58)
1 q
H () L () for n = 2, q > 0 (3.59)

For the embedding in (3.59) one can analyze the dependence of the norm of the embedding
operator on q. This results in (cf. [9]):

kukLq C qkuk1 for all u H 1 (), q > 0 , (3.60)

with a constant C independent of u and q. Note that if for c1 > 0, c2 we have

c1 hDKh v, vi hKh v, vi c2 hDKh v, vi for all v RN (3.61)

81
then (Kh ) cc21 holds.
For v RN we define u := N
P
i=1 vi i . Note that each nodal basis function i is associated to
a grid point xi such that i (xi ) = 1, i (xj ) = 0 for j 6= i. The set of grid points (xi )1iN is
denoted by V. We have
X X
DKh ii u(xi )2 = k(i , i )u(xi )2

hDKh v, vi = (3.62)
xi V xi V

There are constants d1 > 0 and d2 such that d1 |i |21 k(i , i ) d2 |i |21 for all i. Using the
lemmas 3.3.5 and 3.3.6 one can show that there are constants d1 > 0 and d2 independent of
Th {Th } such that

d1 h2 k(i , i ) d2 h2
X
T |T | T |T | for all T Th (3.63)
xi T

Combination of (3.62) and (3.63) yields

d1 2
X X
h2
T kuk 2
0,T hDK h
v, vi d h2 2
T kuk0,T (3.64)
T Th T Th

with constants d1 > 0 and d2 independent of Th {Th } and of v RN . For T Th let


F (x) = Bx + c be an affine mapping with F (T ) = T , where T is the unit n-simplex. From
X
hKh v, vi = k(u, u) C |u|21 = C |u|21,T
T Th
X X
C |u|21,T h2
T | det B| C kuk20,T h2
T | det B| (3.65)
T Th T Th
X
C h2 2
T kuk0,T ChDKh v, vi
T Th

it follows that the second inequality in (3.61) holds with a constant c2 independent of v and
of Th {Th }. We now consider the first inequality in (3.61). First note that for arbitrary
> 2, 0 and w L () we have, using the discrete Holder inequality:
X Z 2 X  2 X Z 2
hT w dx hT2 w dx
T Th T T Th T Th T
X  2
hmin

1 kwk2L () (3.66)
T Th
2
Chmin N kwk2L ()

We now distinguish n = 3 and n = 2. First we treat n = 3. We use the Holder inequality to get
X X Z
2
hDKh v, vi C 2
hT kuk0,T = C h2 2
T u dx
T Th T Th T

1 1
Z Z
1 1
h2p
X
T dx p
u2q dx q
( + = 1)
T T p q
T Th
32p
Z
X 1
C hT p u2q dx q

T Th T

82
Now take p = 23 , q = 3 and apply (3.66) with = 0, = 6. This yields
Z
X 1 2
hDKh v, vi C u6 dx 3
C N 3 kuk2L6 ()
T Th T

We use the embedding result (3.58) and thus obtain


2 2 2
hDKh v, vi C N 3 kuk21 C N 3 k(u, u) = C N 3 hKh v, vi (3.67)

Combination of the results in (3.65) and (3.67) proves the result in (3.57) for n = 3.
We consider n = 2. Using the Holder inequality it follows that for p > 1:

2 2p
Z Z Z
1 1 1 1
kuk20,T u2p dx p
1 dx p
C hT u2p dx p

T T T

Using this we get


Z
X X 2 1
hDKh v, vi C h2 2
T kuk0,T hT p u2p dx p

T Th T Th T

We apply (3.66) with = 2p > 2, = 4 and use the result in (3.60). This yields

2 p1 2 p1
p
hDKh v, vi C hmin N p kuk2L2p () C phmin
p
N p kuk21

h2T CN h2 and thus N Ch2 . We then obtain


P
Note that || T Th

h  p2 h  2p
hDKh v, vi C p N kuk21 C p N hKh v, vi
hmin hmin

h h
2
The constant C can be chosen independent of p. For p = max{2, hmin } we have p hmin
p

h
C(1 + log hmin ) and thus

h
hDKh v, vi C(1 + log )N hKh v, vi (3.68)
hmin

Combination of the results in (3.65) and (3.68) proves the result (3.57) for n = 2.

Remark 3.5.8 In [9] one can find an example which shows that for n = 2 the logarithmic term
h
can not be avoided. If the family of triangulations is quasi-uniform then hmin for a constant
2
independent of Th {Th } and furthermore, N = O(h2 ) for n = 2, N 3 = O(h2 ) for n = 3.
Hence, for the quasi-uniform case we have (Kh ) Ch2 for n = 2, n = 3. Moreover, in this
case the diagonal of Kh is well-conditioned, (DKh ) = O(1), and thus the scaling in (3.56) is
not essential. We emphasize that for the general case of a regular (possibly non quasi-uniform)
family of triangulations the scaling is essential: a result as in (3.57) does in general not hold for
the matrix Kh . Finally we note that for the quasi-uniform case it is not difficult to prove that
there exists a constant C > 0 independent of Th {Th } such that (Kh ) Ch2 holds, both
for n = 2 and n = 3. 

83
3.5.1 Mass matrix
Apart from the stiffness matrix the so-called mass matrix also plays an important role in finite
elements. This matrix depends on the choice of the basis in the finite element space but not on
the bilinear form k(, ).
Let (i )1iN be the nodal basis of the finite element space Xkh,0 as defined in (3.52). The mass
matrix Mh RN N is given by
Z
(Mh )ij = i j dx = hi , j iL2 (3.69)

Note that this matrix is symmetric positive definite. As for the stiffness matrix we use a diagonal
scaling with the diagonal matrix DMh := diag(Mh ). The next result shows that the scaled mass
matrix is uniformly well-conditioned:

Theorem 3.5.9 Let {Th } be a regular family of triangulations consisting of n-simplices and for
each Th let Mh be the mass matrix defined in (3.69). Then there exists a constant C independent
of Th {Th } such that
1 1
(DMh2 Mh DM2h ) C

Proof. Take Th {Th }. For v RN we define u := N


P
i=1 vi i . The constants that appear in
the proof are independent of Th and of v. For each T Th = {T } let F : T T be an affine
transformation between the unit simplex T and T . Furthermore, u := u F . We use the index
set
IT := { i | T supp(i ) }
Note that |IT | is uniformly bounded. We have
X
hMh v, vi = hu, uiL2 = |u|20,T (3.70)
T Th

The nodal point associated to i is denoted by xi (1 i N ). Using lemma 3.3.6 and the norm
equivalence property in the space Pk (T ) it follows that there exist constants c1 > 0 and c2 such
that X
c1 |u|20,T |T | u(zi )2 c2 |u|20,T
zi Lk (T )

and thus, using u(xi ) = vi , we get


X
c1 |u|20,T |T | vi2 c2 |u|20,T
iIT

Define di := |supp(i )|. For i IT the quantity |T |d1 i is uniformly (w.r.t. T ) bounded both
from below by a strictly positive constant and from above (by 1). If we combine this with the
result in (3.70) we get (with different constants c1 > 0, c2 ):
X X
c1 hMh v, vi di vi2 c2 hMh v, vi
T Th iIT

Hence
N
X
c1 hMh v, vi di vi2 c2 hMh v, vi
i=1

84
with c1 > 0. Note Z
(DMh )ii = hMh ei , ei i = 2i dx
supp(i )

thus there are constants c1 > 0, c2 independent of i such that c1 di (DMh )ii c2 di . We then
obtain
c1 hMh v, vi hDMh v, vi c2 hMh v, vi
with c1 > 0. Thus the result is proved.

Corollary 3.5.10 Let {Th } be a quasi-uniform family of triangulations consisting of n-simplices


and for each Th let Mh be the mass matrix defined in (3.69). Then there exists a constant C
independent of Th {Th } such that
(Mh ) C

Proof. Note that Z


(Mh )ii = 2i dx
supp(i )

Using this in combination with the quasi-uniformity of {Th } it follows that the spectral condition
number of DMh = diag(Mh ) is uniformly bounded. Now apply theorem 3.5.9.

3.6 Isoparametric finite elements


See Handbook Ciarlet, chapter 6.

3.7 Nonconforming finite elements

85
86
Chapter 4

Finite element discretization of a


convection-diffusion problem

4.1 Introduction
In this chapter we consider the convection-diffusion boundary value problem
u + b u + cu = f in
u = 0 on

with a constant (0, 1], bi H 1 () L () i, c L () and f L2 (). Furthermore,


we also assume the smoothness property div b L (). The weak formulation of the problem
is analyzed in section 2.5.2. We introduce
Z Z
k(u, v) = u v + b uv + cuv dx , f (v) = f v dx. (4.1)

The weak formulation of this convection-diffusion problem is as follows:


(
find u H01 () such that
(4.2)
k(u, v) = f (v) for all v H01 ().

In theorem 2.5.3 it is shown that if we assume


1
div b + c 0 in (4.3)
2
then this variational problem has a unique solution. In this chapter we treat the finite element
discretization of the problem (4.2) for the convection-dominated case, i.e., maxi kbi kL .
Then the problem is singularly perturbed and the standard finite element method in general
yields a poor approximation of the continuous solution. A significant improvement results if one
introduces suitable artificial stabilization terms in the Galerkin discretization. For such a stabi-
lization many different techniques exist, which leads to a large class of so-called stabilized finite
element methods that are known in the literature. In section 4.4.2 we will explain and analyze
one very popular method from this class, namely the streamline diffusion finite element method
(SDFEM). In section 4.3 we consider a simple one-dimensional convection-diffusion equation to
illustrate a few basic phenomena related to standard finite element discretization. To gain a
better understanding of the (poor) behaviour of the standard finite element in the convection-
dominated case we reconsider its discretization error analysis in section 4.4.

87
In the remainder of this section we briefly discuss the topic of regularity of the variational
problem (4.2). In section 2.5.4 regularity results of the form kukm Ckf km2 , m = 1, 2, . . .,
with a constant C independent of f , are presented (with smoothness assumptions on the coef-
ficients and on the domain). In the convection-dominated case it is of interest to analyze the
dependence of the (stability) constant C on . An important result of this analysis is given in
the following theorem.

Theorem 4.1.1 Assume that


1
div b(x) + c(x) 0 > 0 a.e. in . (4.4)
2
Then the solution u of (4.2) satisfies
1
2 kuk1 + kukL2 Ckf kL2 (4.5)

with a constant C independent of f and of . Furthermore, if the regularity property u H 2 ()


holds, then the inequality
1
1 2 kuk2 Ckf kL2 (4.6)
holds, with a constant C independent of f and of .

Proof. Using partial integration, (4.4) and the Poincare-Friedrichs inequaliy we get
1
Z
2
div b + c u2 dx

k(u, u) = |u|1 +
2
|u|1 + 0 kukL2 c kuk21 + kuk2L2
2 2


with c > 0 independent of . In combination with


1
k(u, u) = f (u) kf kL2 kukL2 kf kL2 kuk21 + kuk2L2 2

this yields
1 1
2 kuk1 + kukL2 2 kuk21 + kuk2L2 2 2 c1 kf kL2
and thus the result in (4.5) holds. If u H 2 () then the equality u + b u + cu = f holds
(where all derivatives are weak ones). Hence, using (4.5) and 1, we obtain
1
kukL2 kf kL2 + kbkL kuk1 + kckL kukL2 c 2 kf kL2 (4.7)

with a constant c independent of f and . We use the following result (lemma 8.1 in [57])

c : kvk2 ckvkL2 + kvkL2 for all v H 2 ().

Combination of this with (4.7) and (4.5) yields


1 1 1
1 2 kuk2 c 1 2 kuk2 + 1 2 kukL2 ckf kL2


and thus the result (4.6) holds.

88
Remark 4.1.2 The constants in (4.5) and (4.6) depend on 0 in (4.4). For the analysis the
assumption 0 > 0 is essential. For the case 0 0 a slight modification of the analysis results
in a stability bound
1  1 1
0 kukL2 C min 2 , 0 2 kf kL2 ,
p
2 kuk1 +

with a constant C independent of f , and 0 .

The results in theorem 4.1.1 indicate that derivatives of the solution u (e.g., kuk1 ) grow if 0.
This is due to the fact that in general in such a convection-diffusion problem there are boundary
and internal layers in which the solution (or some of its derivatives) can vary exponentially. For
an analysis of these boundary layers we refer to the literature, e.g. [76]. In certain special cases
it is possible to obtain bounds on the derivative in streamline direction which are significantly
1
better than the general bound kuk1 C 2 kf kL2 in (4.5). We now present two such results.
The first one is for a relatively simple one-dimensional problem, whereas the second one is
related to a two-dimensional convection-diffusion problem with Neumann boundary conditions
on the outflow boundary.

Theorem 4.1.3 For f L2 ([0, 1]) consider the following problem (with weak derivatives):

u (x) + u (x) = f (x) for x (0, 1), u(0) = u(1) = 0

with (0, 1]. The unique solution u satifies:

max{kukL , ku kL } (1 e1 )1 kf kL1 , (4.8)


1 1
ku kL1 2(1 e ) kf kL1 . (4.9)

Proof. We reformulate the differential equation in the equivalent form


 1
ex/ u (x) = ex/ f (x) =: g(x).

In textbooks on ordinary differential equations (e.g. [95]) it is shown that the solution can
be represented using a Greens function. For this we introduce the two fundamental solutions
u1 (x) = 1 ex/ and u2 (x) = 1 e(x1)/ (note: u1 and u2 satisfy the homogeneous differential
equation and u1 (0) = u2 (1) = 0). The solution is given by
Z 1 (
u1 (t)u2 (x) if t x,
u(x) = G(x, t)g(t) dt, G(x, t) :=
0 1 e1/ u1 (x)u2 (t) if t > x.

We use C := (1 e1/ )1 . Note that C (1 e1 )1 for (0, 1]. Using g(t) = 1 et/ f (t)
we get (for x [0, 1])
Z x Z 1
(1 et/ )f (t) dt Cu1 (x)ex/ e(xt)/ 1 e(t1)/ f (t) dt.

u(x) = Cu2 (x)
0 x

Note that |u2 (x)| 1, |u1 (x)|ex/ 1. We obtain


Z x Z 1
|u(x)| C |f (t)| dt + C |f (t)| dt = Ckf kL1 . (4.10)
0 x

89
From
Z x 1 Z

Cu2 (x) t/
et/ 1 e(t1)/ f (t) dt
Cu1 (x)

u (x) = (1 e )f (t) dt
0 x
Z x Z 1
C C
= e(x1)/ (1 et/ )f (t) dt + e(xt)/ 1 e(t1)/ f (t) dt

0 x
we get
x 1
C C C
Z Z

|u (x)| |f (t)| dt + |f (t)| dt = kf kL1 . (4.11)
0 x
Combination of (4.10) and (4.11) proves the result in (4.8). We also have
Z 1
C 1 (x1)/ x
Z Z

|u (x)| dx e (1 et/ )|f (t)| dt dx
0 0 0
C 1 x/ 1 t/
Z Z
1 e(t1)/ |f (t)| dt dx.

+ e e
0 x

For the first term on the right handside we have


1 1 (x1)/ x 1 1 (x1)/
Z Z Z
t/
e (1 e )|f (t)| dtdx =: e F (x) dx
0 0 0
Z 1
e(x1)/ 1 ex/ |f (x)| dx

= F (1)
0
Z 1
F (1) |f (t)| dt = kf kL1 .
0

The second term can be treated similarly. This then yields


Z 1
|u (x)| dx 2Ckf kL1
0

and thus the result in (4.9).

Note that in (4.9) we have a bound on the derivative measured in the L1 -norm (which is weaker
than the L2 -norm) that is independent of . Similar results for a more general one-dimensional
convection-diffusion problem are given in [76] (section 1.1.2) and [38].

We now present a result for a two-dimensional problem.


Theorem 4.1.4 For f L2 (), := (0, 1)2 and a constant c 0, consider the convection-
diffusion problem
u + ux + cu = f in
u
=0 on E := { (x, y) | x = 1 } (4.12)
x
u=0 on \ E .

This problem has a unique solution u H 2 () and the inequality

ckukL2 + kux kL2 2kf kL2 (4.13)

holds.

90
Proof. First note that the weak formulation of this problem has a unique solution u H 1 ().
Using the fact that is convex it follows that u H 2 () holds and thus the problem (with
weak derivatives) in (4.12) has a unique solution u H 2 (). From the differential equation we
get
kux k2L2 = hf, ux iL2 + huyy , ux iL2 + huxx , ux iL2 chu, ux iL2 .

Using Greens formulas and the boundary conditions for the solution u we obtain, with W :=
{ (x, y) | x = 0 }:

1 1
Z
2
huyy , ux iL2 = h (uy ) , 1iL2 = u2 dy 0
2 x 2 E y
1 1
Z
huxx , ux iL2 = h (ux )2 , 1iL2 = u2 dy 0
2 x 2 W x
Z
hu, ux iL2 = u2 dy hux , uiL2 and thus hu, ux iL2 0.
E

Hence, we have
kux k2L2 hf, ux iL2 kf kL2 kux kL2 . (4.14)

Testing the differential equation with u (instead of ux ) yields

c kuk2L2 = hf, uiL2 kuk2L2 hux , uiL2 .

This yields
c kuk2L2 kf kL2 kukL2 . (4.15)

Combination of (4.14) and (4.15) completes the proof.

We note that for this problem a similar -independent bound for the derivative uy does not
1
hold. A sharp inequality of the form kuy kL2 2 kf kL2 can be shown. Furthermore, for the
uniform bound on kux kL2 in (4.13) to hold it is essential that we consider the convection-diffusion
problem with Neumann boundary conditions at the outflow boundary. Due to this there is no
exponential boundary layer at the outflow boundary.

4.2 A variant of the Cea-lemma


In the analysis of the finite element method in chapter 3 the basic Cea-lemma plays an important
role. In the analysis of finite element methods for convection-dominated elliptic problems we
will need a variant of this lemma that is presented in this section and is based on a basic lemma
given in [94]:

Lemma 4.2.1 Let U be a normed linear space with norm k k and V a subspace of U . Let
s(, ) be a continuous bilinear forms on U V and t(, ) a bilinear form on U V such that
for all u U the functional v t(u, v) is bounded on V . Define r := s + t and assume that r
is V -elliptic. Let c0 > 0 and c1 be such that

r(v, v) c0 kvk2 for all v V (4.16)


s(u, v) c1 kuk kvk for all u U, v V. (4.17)

91
t(u,v)
On U we define the semi-norm kuk := supvV kvk . Then the following holds:

|r(u, v)| max{c1 , 1} kuk + kuk kvk for all u U, v V (4.18)
r(u, v) c0 
sup kuk + kuk for all u V. (4.19)
vV kvk 1 + c0 + c1
Proof. For u U, v V we have
|r(u, v)| |s(u, v)| + |t(u, v)| c1 kukkvk + kuk kvk

max{c1 , 1} kuk + kuk kvk
and thus (4.18) holds. We now consider (4.19). Take a fixed u V and (0, 1). Then there
exists v V such that kv k = 1 and kuk t(u, v ). Note that
r(u, v ) = s(u, v ) + t(u, v ) kuk c1 kuk
c0 kuk
and thus for w := u + 1+c1 v V we obtain
c0 kuk
r(u, w ) = r(u, u) + r(u, v )
1 + c1
c0 kuk
c0 kuk2 +

kuk c1 kuk
1 + c1
c0 
= kuk + kuk kuk. (4.20)
1 + c1
Furthermore,
c0 kuk 1 + c0 + c1
kw k kuk + = kuk (4.21)
1 + c1 1 + c1
holds. Combination of (4.20) and (4.21) yields
r(u, w ) c0 
kuk + kuk .
kw k 1 + c0 + c1
Because w V and (0, 1) is arbitrary this proves the result in (4.19).

We emphasize that the seminorm k k on U depends on the bilinear form t(, ) and on the
subspace V . Also note that in (4.18) we have a boundedness result on U V , whereas in (4.19)
we have an infsup bound on V V .
Using this lemma we can derive the following variant of the Cea-lemma (theorem 3.1.1).

Theorem 4.2.2 Let the conditions as in lemma 4.2.1 be satisfied. Take f U and assume
that there exist u U , v V such that

r(u, w) = f (w) for all w U (4.22a)


r(v, w) = f (w) for all w V. (4.22b)

Then the following holds:



ku vk + ku vk C inf ku wk + ku wk (4.23)
wV
1 + c0 + c1
with C := 1 + max{c1 , 1} . (4.24)
c0

92
Proof. Let w V be arbitrary. Using (4.18), (4.19) and the Galerkin property r(uv, z) = 0
for all z V we get

1 + c0 + c1 r(v w, z) 1 + c0 + c1 r(u w, z)
kv wk + kv wk sup = sup
c0 zV kzk c0 zV kzk
1 + c0 + c1 
max{c1 , 1} ku wk + ku wk .
c0
Using this and the triangle inequality

ku vk + ku vk kv wk + kv wk + ku wk + ku wk

we obtain the result.

In this theorem there are significant differences compared to the Cea-lemma. For example,
in theorem 4.2.2 we do not assume that U (or V ) is a Hilbert space and we do not assume an
infsup property for the bilinear form r(, ) on U V (only on V V , cf. (4.19)). On the other
hand, in theorem 4.2.2 we assume existence of solutions in U and V , cf. (4.22), whereas in the
Cea-lemma existence and uniqueness of solutions follows from assumptions on continuity and
infsup properties of the bilinear form.

4.3 A one-dimensional hyperbolic problem and its finite element


discretization
If in a convection-diffusion problem with a bilinear form as in (4.1) one formally takes = 0
this results in a hyperbolic differential operator. In this section we give a detailed treatment of a
very simple one-dimensional hyperbolic problem. We show well-posedness of this problem and
explain why a standard finite element discretization method suffers from an instability. Further-
more, a stabilization technique is introduced that results in a finite element method with much
better properties. In section 4.4 essentially the same analysis is applied to the finite element
discretization of the convection-diffusion problem (4.1)-(4.2).

We consider the hyperbolic problem

bu (x) + u(x) = f (x), x I := (0, 1), b > 0 a given constant,


(4.25)
u(0) = 0.

For the weak formulation we introduce the Hilbert spaces H1 = { v H 1 (I) | v(0) = 0 }, H2 =
L2 (I). The norm on H1 is kvk21 = kv k2L2 + kvk2L2 . We define the bilinear form
Z 1
k(u, v) = bu v + uv dx
0

on H1 H2 .

Theorem 4.3.1 Let f L2 (I). There exists a unique u H1 such that

k(u, v) = hf, viL2 for all v H2 . (4.26)

Moreover, kuk1 ckf kL2 holds with c independent of f .

93
Proof. We apply theorem 2.3.1. The bilinear form k(, ) is continuous on H1 H2 :

|k(u, v)| bku kL2 kvkL2 + kukL2 kvkL2 2 max{1, b}kuk1 kvkL2 , u H1 , v H2 .

For u H1 we have
k(u, v) hbu + u, viL2 1
sup = sup = kbu + ukL2 = b2 ku k2L2 + kuk2L2 + 2bhu , uiL2 2 .
vH2 kvkL2 vH2 kvkL2

Using u(0) = 0 we get hu , uiL2 = u(1)2 hu, u iL2 and thus hu , uiL2 0. Hence we get

k(u, v)
sup min{1, b}kuk1 for all u H1 ,
vH2 kvkL2

i.e., the infsup condition (2.36) in theorem 2.3.1 is satisfied. We now consider the condition
(2.37)
R 1 in this theorem.
R1 Let v H2 be such that k(u, v) = 0 for all u H1 . This implies
b 0 u v dx = 0 uv dx for all u C0 (I) and thus v H 1 (I) with v = 1b v (weak derivative).
Using this we obtain
Z 1 Z 1 Z 1 Z 1

uv dx = b u v dx = bu(1)v(1) b uv dx = bu(1)v(1) uv dx for all u H1 ,
0 0 0 0

and thus u(1)v(1) = 0 for all u H1 . This implies v(1) = 0. Using this and bv v = 0 yields

b b
kvk2L2 = hv, viL2 + hbv v, viL2 = bhv , viL2 = v(1)2 v(0)2 = v(0)2 0.

2 2
This implies v = 0 and thus condition (2.37) is satisfied. Application of theorem 2.3.1 now yields
existence and uniqueness of a solution u H1 and

hf, viL2
kuk1 c sup = c kf kL2 ,
vH2 kvkL2

which completes the proof.

For the discretization of this problem we use a Galerkin method with a standard finite ele-
ment space. To simplify the notation we use a uniform grid and consider only linear finite
elements. Let h = N1 , xi = ih, 0 i N , and

Xh = { v C(I) |v(0) = 0, v|[xi ,xi+1 ] P1 for 0 i N 1 }.

Note that Xh H1 and Xh H2 . The discretization is as follows:

determine uh Xh such that k(uh , vh ) = hf, vh iL2 for all vh Xh . (4.27)

For the error analysis of this method we apply the Cea-lemma (theorem 3.1.1). The conditions
(3.2), (3.3), (3.4) in theorem 3.1.1 have been shown to hold in the proof of theorem 4.3.1. It
remains to verify the discrete infsup condition:

k(uh , vh )
h > 0 : sup h kuh k1 for all uh X h . (4.28)
vh Xh kvh kL2

Related this we give the following lemma:

94
Lemma 4.3.2 The infsup property (4.28) holds with h = c h, c > 0 independent of h.

Proof. For uh Xh we have huh , uh iL2 = 21 uh (1)2 0 and thus

k(uh , vh ) k(uh , uh ) bhuh , uh iL2 + kuh k2L2


sup = kuh kL2 .
vh Xh kvh kL2 kuh kL2 kuh kL2

Now apply an inverse inequality, cf. lemma 3.3.11, kvh kL2 ch1 kvh kL2 for all vh Xh , result-
ing in kuh kL2 12 kuh kL2 + chkuh kL2 ch kuk1 with a constant c > 0 independent of h.

Remark 4.3.3 The result in the previous lemma is sharp in the sense that the best (i.e. largest)
infsup constant h in (4.28) in general satisfies h c h. This can be deduced from a numer-
ical experiment or a technical analytical derivation. Here we present results of a numerical
experiment. We consider the continuous and discrete problems as in (4.26), (4.27) with b = 1.
Discretization of the bilinear forms (u, v) hu , viL2 , (u, v) hu, viL2 and (u, v) hu , v iL2
in the finite element space Xh (with respect to the nodal basis) results in N N -matrices
1

0 1 1 6
1 0 1 1 1 1

6 6
1 . . ..
2h .. .. ..

Ch =
.. .. . , Mh =

. . . ,

(4.29)
2 3

1 1

1 0 1 6 1 6
1 1
1 1 6 2

2 1
1 2 1
1 .. .. ..

Ah = .

. . .
h
1 2 1
1 1

Note that
1
k(uh , vh ) hCh x + Mh x, yi2 kMh 2 (Ch + Mh )xk2
inf sup = inf sup 1 1 = inf 1
uh Xh vh Xh kuh k1 kvh kL2 xRN yRN xRN k(Ah + Mh ) 2 xk2
h(Ah + Mh )x, xi22 hMh y, yi22
1 1
kMh 2 (Ch + Mh )(Ah + Mh ) 2 zk2
= inf
zRN kzk2
1 1
= k(Ah + Mh ) 2 (Ch + Mh )1 Mh2 k1
2 =: h .

1 1
A (MATLAB) computation of the quantity q(h) := h /h yields: q( 10 ) = 1.3944, q( 50 ) =
1
1.3987, q( 250 ) = 1.3988. Hence, in this case h is proportional to h. 

Using the infsup result of the previous lemma we obtain the following corollaries.

Corollary 4.3.4 The discrete problem (4.27) has a unique solution uh Xh and the following
stability bound holds:
kuh kL2 kf kL2 . (4.30)

95
Proof. Existence and uniqueness of a solution follows from continuity and ellipticity of the
bilinear form k(, ) on Xh Xh . From

kuh k2L2 k(uh , uh ) = hf, uh iL2 kf kL2 kuh kL2

we obtain the stability result.

Note that this stability result for the discrete problem is weaker than the one for the contin-
uous problem in theorem 4.3.1.

Corollary 4.3.5 Let u H1 and uh Xh be the solutions of (4.26) and (4.27), respectively.
From theorem 3.1.1 we obtain the error bound

ku uh k1 c h1 inf ku vh k1 ,
vh Xh

with a constant c independent of h. If u H 2 (I) holds, we obtain

ku uh k1 cku kL2 . (4.31)

Remark 4.3.6 Experiment. Is the result in (4.31) sharp ?


Expectation (for suitable f ):

|u uh |1 c
ku uh kL2 ch

These results show that, due to the deterioration of the infsup stability constant h for h 0, the
discretization with standard linear finite elements is not satisfactory. A heuristic explanation for
this instability phenomenom can be given via the matrix Ch in (4.29) that represents the finite
element discretization of u u . The differential equation (in strong form) is u = 1b u + 1b f
on (0, 1), which is a first order ordinary differential equation. The initial condition is given by
1
u(0) = 0. For discretization of u (xi ) we use (cf. Ch ) the central difference 2h

u(xi+1 )u(xi1 ) .
Thus for the approximation of u at time x = xi we use u at the future time x = xi+1 ,
which is an unnatural approach.

We now turn to the question how a better finite element discretization for this very simple
problem can be developed. One possibility is to use suitable different finite element spaces
H1,h H1 and H2,h H2 . This leads to a so-called Petrov-Galerkin method. We do not treat
such methods here, but refer to the literature, for example [76]. From an implementation point
of view it is convenient to use only one finite element space instead of two different ones. We
will show how a satisfactory discretization with (only) the space Xh of linear finite elements can
be obtained by using the concept of stabilization.
A first stabilized method is based on the following observation. If u H1 satisfies (4.26),
then Z 1
(bu + u)bv dx = hf, bv iL2 for all v H1 (4.32)
0
also holds. By adding this equation to the one in (4.26) it follows that the solution u H1
satisfies
hbu + u, bv + viL2 = hf, bv + viL2 for all v H1 . (4.33)

96
Based on this we introduce the notation
k1 (u, v) := hbu + u, bv + viL2 , u, v H1 ,
f1 (v) := hf, bv + viL2 , v H1 .
The bilinear form k1 (, ) is continuous on H1 H1 and f1 is continuous on H1 . Moreover, k1 (, )
is symmetric and using hv , viL2 = 12 v(1)2 0 for v H1 , we get
k1 (v, v) = b2 kv k2L2 + kvk2L2 + 2bhv , viL2 min{b2 , 1}kvk21 , for v H1 , (4.34)
and thus k1 (, ) is elliptic on H1 . The discrete problem is as follows:
determine uh Xh such that k1 (uh , vh ) = f1 (vh ) for all vh Xh . (4.35)
Due to Xh H1 and the H1 -ellipticity of k1 (, ) this problem has a unique solution uh Xh .
For the discretization error we obtain the following result.
Lemma 4.3.7 Let u H1 be the solution of (4.26) (or (4.33)) and uh the solution of (4.35).
The following holds:
ku uh k1 c inf ku vh k1
vh Xh

with a constant c independent of h. If u H 2 (I) then


ku uh k1 chku kL2 (4.36)
holds with a constant c independent of h.
Proof. Apply corollary 3.1.2.

From (4.34) and f1 (v) max{b, 1} 2kf kL2 kvk1 it follows that the discrete problem has the
stability property kuh k1 c kf kL2 which is similar to the stability property of the continuous
solution given in theorem 4.3.1 and (significantly) better than the one for the original discrete
problem, cf. (4.30). This explains why the discretization in (4.35) is called a stabilized finite
element method. From (4.34) one can see that k1 (, ) contains an (artificial) diffusion term that
is not present in k(, ). Note that the bounds in lemma 4.3.7 are significantly better than the
ones in corollary 4.3.5.
If u H 2 (I) then from (4.36) we have the L2 -error bound
ku uh kL2 c hku kL2 . (4.37)
In section 3.4.2, for elliptic problems, a duality argument is used to derive an L2 -error bound of
the order h2 for linear finite elements. Such a duality argument can not be applied to hyperbolic
problems due to the fact that the H 2 -regularity assumption that is crucial in the analysis in
section 3.4.2 is usually not satisfied for hyperbolic problems. In the following remark this is
made clear for the simple hyperbolic problem that is treated in this section.
Remark 4.3.8 Consider the problem in (4.25) with b = 1 and f (x) = 1 e
x 1, with a

constant 1. Substitution shows that the solution is given by u(x) = 1 (ex 1). Note that
u, f C (I). Further elementary computations yield
1p
kf kL2 2, ku kL2 .
4
Hence a bound ku kL2 c kf kL2 with a constant c independent of f L2 (I) can not hold, i.e.,
this problem is not H 2 -regular. 

97
We now generalize the stabilized finite element method presented above and show that using
this generalization we can derive a method with an H 1 -error bound of the order h (as in (4.36))
1
and an improved L2 -error bound of the order h1 2 . This generalization is obtained by adding
-times, with a parameter in [0, 1], the equation in (4.32) to the one in (4.26). This shows that
the solution u H1 of (4.26) also satisfies

k (u, v) = f (v) for all v H1 , with (4.38a)



k (u, v) := hbu + u, bv + viL2 , f (v) := hf, bv + viL2 . (4.38b)

Note that for = 0 we have the original variational formulation and that = 1 results in the
problem (4.33). For 6= 1 the bilinear form k (, ) is not symmetric. For all [0, 1] we have
f H1 . The discrete problem is as follows:

determine uh Xh such that k (uh , vh ) = f (vh ) for all vh Xh . (4.39)

The discrete solution uh (if it exists) depends on . We investigate how can be chosen such
that the discretization error (bound) is minimized. For this analysis we use the abstract results
in section 4.2. We write

k (u, v) = s (u, v) + t (u, v), u, v H1 , with


s (u, v) = hbu , bv iL2 + hu, viL2 , t (u, v) = hu, bv iL2 + hbu , viL2 .

The bilinear form s (, ) defines a scalar product on H1 . We introduce the norm and the
seminorm (cf. lemma 4.2.1)

1 t (u, vh )
|||u||| := s (u, u) 2 , kuk,h, := sup , for u H1 .
vh Xh |||vh |||

Note that
1
t (u, u) = b( + 1)hu , uiL2 = b( + 1)u(1)2 0 for all u H1 , (4.40)
2
and
1
(b |u|1 + kukL2 ) |||u||| b |u|1 + kukL2 for all u H1 . (4.41)
2

Lemma 4.3.9 For all [0, 1] the continuous problem (4.38) and the discrete problem (4.39)
have unique solutions u and uh , respectively. The discrete solution satisfies the stability bound

b |uh |1 + kuh kL2 2kf kL2 .

Proof. For = 0 the existence and uniqueness of solutions is given in theorem 4.3.1 and
corollary 4.3.4. The stability result for = 0 also follows from corollary 4.3.4. For > 0 we
obtain, using (4.40),

k (u, u) = b2 ku k2L2 + kuk2L2 + t (u, u) kuk21 for u H1 , with := min{b2 , 1} > 0,

and
k (u, v) (bku kL2 + kukL2 )(bkv kL2 + kvkL2 ) c kuk1 kvk1 for u, v H1 .

98
Hence k (, ) is elliptic and continuous on H1 . The Lax-Milgram lemma implies that both the
continuous and the discrete problem have a unique solution. For v H1 we have, cf. (4.40),

k (v, v) = s (v, v) + t (v, v) |||v|||2 . (4.42)

Furthermore, using 1 we get



f (v) kf kL2 (b|v|1 + kvkL2 ) kf kL2 b |v|1 + kvkL2 2kf kL2 |||v||| for v H1 .

This yields |||uh |||2 k (uh , uh ) = f (uh ) 2kf kL2 |||uh ||| and thus

b |uh |1 + kuh kL2 2|||uh ||| 2kf kL2 ,

which completes the proof.

Lemma 4.3.10 Let u and uh be the solutions of (4.38) and (4.39), respectively. The following
error bound holds:

|||u uh ||| + ku uh k,h, 4 inf |||u vh ||| + ku vh k,h, . (4.43)
vh Xh

Proof. To derive this error bound we use theorem 4.2.2 with U = H1 , V = Xh , r(, ) = k (, ),
k k = ||| ||| , k k = k k,h, . We verify the corresponding conditions in lemma 4.2.1. The
bilinear form s (, ) is continuous on U = H1 : s (u, v) |||u||| |||v||| . Hence (4.17) is satisfied with
c1 = 1. For u H1 the functional v t (u, v) is clearly continuous on V = Xh . From (4.42) it
follows that condition (4.16) is satisfied with c0 = 1. Application of theorem 4.2.2 yields

|||u uh ||| + ku uh k,h, C inf |||u vh ||| + ku vh k,h, ,
vh Xh

with C = 1 + max{c1 , 1} 1+cc00+c1 = 4.

For the Sobolev space H1 we have H1 C(I) and thus the nodal interpolation

IX : H1 C(I), (IX u)(xi ) = u(xi ), 0 i N,

is well-defined.

Theorem 4.3.11 Let u H1 and uh Xh be the solutions of (4.38) and (4.39), respectively.
For all [0, 1] the error bound
 b 1 
b |u uh |1 + ku uh kL2 C b |u IX u|1 + (1 + min{ , })ku IX ukL2
h
holds with a constant C independent of h, , b and u.

Proof. From lemma 4.3.10 and (4.41) we obtain


 
b |u uh |1 + ku uh kL2 4 2 b |u IX u|1 + ku IX ukL2 + ku IX uk,h, . (4.44)

Define eh := u IX u, and note that eh (0) = eh (1) = 0. Thus we we have

heh , vh iL2 = heh , vh iL2 for all vh Xh .

99
Using this and the inverse inequality |vh |1 ch1 kvh kL2 for all vh Xh we obtain
t (eh , vh ) b(1 )heh , vh iL2
keh k,h, = sup = sup
vh Xh |||vh ||| vh Xh (b2 |vh |2 2 12
1 + kvh kL2 )
(4.45)
bkeh kL2 |vh |1  1 b
sup 1 c min , keh kL2 ,
vh Xh ( + ch2 b2 ) 2 b|vh |1 h

with c independent of , h and b. The result follows from combination of (4.44) and (4.45).

Corollary 4.3.12 Let u and uh be as in theorem 4.3.11 and assume that u H 2 (I). Then the
following error bound holds for [0, 1]:
h 
b |u uh |1 + ku uh kL2 Ch h + b + b min{1, } ku kL2

(4.46)
b
with a constant C independent of h, , b and u.

The term between square brackets in (4.46) is minimal for h b if we take


h
= opt = . (4.47)
b
We consider three cases:
= 0 (no stabilization): Then we get ku uh kL2 chku kL2 . This bound for the L2 -error is
better than the one that follows from (4.31), cf. also remark 4.3.6.
= 1 (full stabilization): Then we obtain ku uh k1 chku kL2 , which is the same bound as in
lemma 4.3.7.
= opt (optimal value): This results in
1
|u uh |1 chku kL2 , ku uh kL2 ch1 2 ku kL2 . (4.48)

Hence, the bound for the norm | |1 is the same as for = 1, but we have an improvement in
the L2 -error bound.
From these discretization error results and from the stability result in lemma 4.3.9 we see
that = 0 leads to poor accuracy and poor stability properties. The best stability property is
for the case = 1. A somewhat weaker stability property but a better approximation property
is obtained for = opt . For = opt we have a good compromise between sufficient stability
and high approximation quality. Finding such a compromise is a topic that is important in all
stabilized finite element methods.
Remark 4.3.13 Experiments to show dependence of errors on . Is the bound in (4.46) sharp?

4.4 The convection-diffusion problem reconsidered


In chapter 3 we treated the finite element discretization of the variational problem in (4.2).
Under the assumption (4.3) we have

k(u, u) |u|21 for all u H01 (), (4.49)


k(u, v) M |u|1 |v|1 for all u, v H01 (), (4.50)

100
with a constant M independent of . Now recall the standard Galerkin discretization with linear
finite elements, i.e.: uh X1h,0 such that

k(uh , vh ) = f (vh ) for all vh X1h,0 .

From the analysis in chapter 3 (corollary 3.1.2 and corollary 3.3.10) we obtain the discretization
error bound
M M
|u uh |1 inf |u vh |1 C h|u|2 , (4.51)
vh X1h,0

provided u H 2 (). The constant C is independent of h and . We can apply the duality
argument used in section 3.4.2 to derive a bound for the error in the L2 -norm. The dual
problem has the same form as in(4.1)-(4.2) but with b replaced by b. Assume that (4.4) also
holds with b replaced by b and that the solution of the dual problem lies in H 2 () (the latter
is true if is convex). Then a regularity result as in (4.6) holds for the solution of the dual
problem. Using this we obtain
1
ku uh kL2 C2 2 h2 |u|2 , (4.52)

with a constant C independent of h and .


Even if |u|2 remains bounded for 0 (no boundary layers) the bounds in (4.51) and (4.52) tend
to infinity for 0. These bounds, however, are too pessimistic and do not reflect important
phenomena that are observed in numerical experiments. For example, from experiments it is
known that the standard linear finite element discretization yields satisfactory results if h
and 0. This, however, is not predicted by the bounds in (4.51) and (4.52).

Below we present a refined analysis based on the same approach as in section 4.3 which re-
sults in better bounds for the discretization error. These bounds reflect important properties of
the standard Galerkin finite element discretization applied to the convection-diffusion problem
in (4.2) and show the effect of introducing a stabilization. In section 4.4.1 we consider well-
posedness of the problem in (4.2). In section 4.4.2 we analyze a stabilized finite element method
for this problem.

4.4.1 Well-posedness of the continuous problem


The (sharp) result in (4.49) shows that in the norm | |1 (or equivalently k k1 ) the convection-
diffusion problem is not uniformly well-posed for 0. In this section we introduce other norms
in which the problem is uniformly well-posed. For a better understanding of the analysis we
first present results for a two-dimensional hyperbolic problem:

Remark 4.4.1 We consider a two-dimensional variant of the hyperbolic problem treated in


section 4.3. Let := (0, 1)2 , b := (1 0)T , f L2 (). The continuous problem is as follows:
determine u such that
b u + u = f in
(4.53)
u = 0 on W := {(x, y) | x = 0 }.
u
Let H1 be the space of functions u L2 () for which the weak derivative ux = x exists:
H1 := { u L2 () | ux L2 () }. This space with the scalar product

hu, viH1 = hu, viL2 + hb u, b viL2 = hu, viL2 + hux , vx iL2 (4.54)

101
is a Hilbert space (follows from the same arguments as for the Sobolev space H 1 ()). Take
H2 := L2 (). The bilinear form corresponding to the problem in (4.53) is

k(u, v) = hb u, viL2 + hu, viL2 , u H1 , v H2 .

Using the same arguments as in the proof of theorem 4.3.1 one can show that there exists a
unique u H1 such that
k(u, v) = hf, viL2 for all v H2

and that kukH1 c kf kL2 holds with a constant c independent of f L2 (). Thus this
hyperbolic problem is well-posed in the space H1 H2 . Note that the stability result is similar
to the one for the convection-diffusion problem in theorem 4.1.4. 

We now turn to the convection-diffusion problem as in (4.1)-(4.2). As in section 4.3 the


analysis uses the abstract setting given in section 4.2. We will need the following assumption:

1
There are constants 0 , cb such that div b + c 0 0, kckL cb 0 . (4.55)
2

We take cb := 0 if 0 = 0. Note that this assumption is somewhat stronger than the one in (4.3)
but still covers the important special case div b = 0, c constant and c 0.

Theorem 4.4.2 Consider the variational problem in (4.2) and assume that (4.55) holds. For
u H01 () define the (semi-)norms
1
|||u||| := |u|21 + 0 kuk2L2 2 , (4.56a)
R
b u v dx
kb uk = kuk := sup . (4.56b)
vH 1 () |||v|||
0

Then we have the continuity bound

for all u, v H01 (),



k(u, v) max{cb , 1} |||u||| + kb uk |||v||| (4.57)

and the infsup result

k(u, v) 1
for all u H01 ().

sup |||u||| + kb uk (4.58)
vH 1 () |||v|||
0
2 + max{cb , 1}

Proof. We apply lemma 4.2.1 with U = V = H01 (), norm k k = ||| ||| and
Z Z
s(u, v) = u v + cuv dx , t(u, v) = b uv dx .

For given u H01 () we have

c
|t(u, v)| c kvkL2 c |v|1 |||v|||

102
and thus v t(u, v) is bounded on V . Note that k(u, v) = s(u, v)+ t(u, v) holds. For u H01 ()
we have
Z
k(u, u) = u u + b uu + cu2 dx
Z
1
= u u + ( div b + c)u2 dx
2
Z
u u + 0 u2 dx = |||u|||2

and thus (4.16) is satisfied with c0 = 1. Furthermore, for all u, v H01 () we have

|s(u, v)| |u|1 |v|1 + kckL kukL2 kvkL2


1 1
|u|21 + cb 0 kuk2L2 2 |v|21 + cb 0 kvk2L2 2
max{cb , 1}|||u||| |||v||| .

Hence (4.17) holds with c1 = max{cb , 1}. The results in (4.18) and (4.19) then yield (4.57) and
(4.58), respectively.

The result in this theorem can be interpreted as follows. Let H1 be the space H01 () endowed
with the norm ||| ||| + kb k and H2 the space H01 () with the norm ||| ||| . Note that
these norms are problem dependent. The spaces H1 and H2 are Hilbert spaces. Using the linear
operator L : H1 H2 , L(u)(v) := k(u, v) the variational problem (4.2) can be reformulated as
follows: find u H1 such that Lu = f . From the results in theorem 2.3.1 and theorem 4.4.2 it
follows that L is an isomorphism and that the inequalities

kLkH2 H1 max{cb , 1} , kL1 kH1 H2 2 + max{cb , 1}

hold. Hence, the operator L : H1 H2 is well-conditioned uniformly in . In this sense the


norms ||| ||| + kb k and ||| ||| are natural for the convection-diffusion problem if one is
interested in the case 0. If we take 0 > 0 and in the definition of the norms in (4.56)
1
using a density argument it follows that kb uk=0 = 0 2 kb ukL2 .
formally put = 0 then
Furthermore |||u|||=0 = 0 kukL2 . The resulting norms in the spaces H1 and H2 are precisely
those used in the well-posedness of the hyperbolic problem in remark 4.4.1.

The infsup bound in the previous theorem implies the following stability result for the vari-
ational problem (4.2).

Corollary 4.4.3 Consider the variational problem in (4.2) and assume that (4.55) holds with
0 > 0. Then the inequality
p  1
|u|1 + 0 kukL2 + 2 kb uk 2 2 + max{cb , 1} 0 2 kf kL2 (4.59)

holds.

103
Proof. From k(u, v) = hf, viL2 for all v H01 () and (4.58) we obtain
 hf, viL2
|||u||| + kb uk 2 + max{cb , 1} sup
vH 1 () |||v|||
0
 kf kL2 kvkL2
2 + max{cb , 1} sup 1
vH01 () (|v|21 + 0 kvk2L2 ) 2
 1
2 + max{cb , 1} 0 2 kf kL2 .
Furthermore, note that
1 p 
|||u||| |u|1 + 0 kukL2
2
holds.
1
From this corollary it follows that for the case 0 > 0 the inequality 2 kuk1 + kukL2 Ckf kL2
holds with a constant C independent of f and . This result is proved in theorem 4.1.1, too.
However, from corollary 4.4.3 we also obtain
kb uk Ckf kL2 (4.60)
with C independent of f and . Hence, we have a bound on the derivative in streamline direction.
Taking (formally) = 0 we obtain a bound kukL2 + kb ukL2 Ckf kL2 , which is (for the
example b = (1 0)T ) the same as the stability bound kukH1 ckf kL2 derived in remark 4.4.1.
For = 0 and 0 > 0 the norm k k is equivalent to the L2 -norm. A result that relates
the norm k k to the more tractable L2 -norm also for > 0 is given in the following lemma.

Lemma 4.4.4 Let {Th } be a regular quasi-uniform family of triangulations of consisting


of n-simplices and let Vh := Xkh,0 H01 () be the corresponding finite element space. Let
Ph : L2 () Vh be the L2 -orthogonal projection on Vh :
hPh w, vh iL2 = hw, vh iL2 for all w L2 (), vh Vh .
Assume that 0 > 0. There exists a constant C > 0 independent of h and such that for
0 h2 :
1
CkPh wkL2 kwk 0 2 kwkL2 for all w L2 (). (4.61)

Proof. The second inequality in (4.61) follows from


hw, viL2 1
kwk = sup 1 0 2 kwkL2
vH01 () (|v|21 + 0 kvk2L2 ) 2

For the first inequality we need the global inverse inequality from lemma 3.3.11: |vh |1
ch1 kvh kL2 for all vh Vh . Using this inequality we get
hw, viL2 hw, Ph wiL2
kwk = sup
vH 1 () |||v|||
0
|||Ph w|||
kPh wk2L2 1
= 1 c2 2
+ 0 ) 2 kPh wkL2 CkPh wkL2 ,
(|Ph w|21 + 0 kPh wk2L2 ) 2 h

and thus the first inequality in (4.61) holds, too.

104
4.4.2 Finite element discretization
We now analyze the Galerkin finite element discretization of the convection-diffusion problem.
For ease of presentation we only consider simplicial finite elements. The case with rectangular
finite elements can be treated analogously. Let {Th } be a regular family of triangulations of
consisting of n-simplices and let

Vh := Xkh,0 H01 (), k 1,

be the corresponding finite element space. The standard discretization is as follows:

determine uh Vh such that k(uh , vh ) = hf, vh iL2 for all vh Vh . (4.62)

We now use the same stabilization approach as in section 4.3. Assume that for the solution
u H01 () of the convection-diffusion problem we have u H 2 (). Then
Z
u + b u + cu v dx = hf, viL2 for all v H01 (),


holds, but also for arbitrary R:


Z
for all v H01 ().

u + b u + cu b v dx = hf, b viL2

Adding these equations it follows that the solution u satisfies

hu + b u + cu, v + b viL2 = hf, v + b iL2 for all v H01 (),

or, equivalently,

k(u, v) + hu + b u + cu, b viL2 = hf, viL2 + hf, b viL2 for all v H01 ().

This leads to the following discretization: determine uh Vh such that

k (uh , vh ) = f (vh ) for all vh Vh , (4.63a)


X
with k (uh , vh ) := k(uh , vh ) + T huh + b uh + cuh , b vh iT , (4.63b)
T Th
X
f (vh ) := hf, vh iL2 + T hf, b vh iT . (4.63c)
T Th
P
We use the notation h, iT = h, iL2 (T ) . In (4.63) we consider a sum T Th h, iT instead of
h, iL2 because for uh Vh the second derivatives in uh are well-defined in each T Th but uh
is not well-defined across edges (faces) in the triangulation. In (4.63) we use a (stabilization)
parameter T for each T Th instead of one global parameter . This offers the possibility to
adapt the stabilization parameter to the local mesh size and thus obtain better results if the
triangulation is strongly nonuniform.
The continuous solution u H01 () satisfies

k (u, v) = f (v) for all v H01 (), (4.64)

provided u H 2 (T ) for all T Th . In the remainder we assume that u has this regularity
property. If T = 0 for all T we have the standard (unstabilized) method as in (4.62). In the

105
remainder of this section we present an error analysis of the discretization (4.63) along the same
lines as in section 4.3. We use the abstract analysis in section 4.2 with the spaces
U := { v H01 () | v|T H 2 (T ) for all T Th }, V = Vh .
Note that U depends on Th . We split k (, ) as follows:
k (u, v) = s (u, v) + t (u, v), u, v U,
X
s (u, v) := hu, viL2 + hcu, viL2 + T hb u, b viT ,
T Th
X
t (u, v) := hb u, viL2 + T hu + cu, b viT .
T Th

We only consider T 0. Then k k = ||| ||| defines a norm on U :


X
|||u|||2 := |u|21 + 0 kuk2L2 + T kb uk2T , u U.
T Th

We also use the seminorm


t (u, vh )
kuk,h, := sup , u U. (4.65)
vh Vh |||vh |||
We will apply theorem 4.2.2. For this we have to verify the corresponding conditions in
lemma 4.2.1. First note that vh t (u, vh ) is trivially bounded on Vh and thus the semi-
norm in (4.65) is well-defined. The conditions (4.16)-(4.17) in lemma 4.2.1 are verified in the
following two lemmas.
We always assume that (4.55) holds.
Lemma 4.4.5 The bilinear form s (, ) is continuous on U U :
s (u, v) max{cb , 1}|||u||| |||v||| for all u, v U.
Proof. The result follows from:
X
s (u, v) |u|1 |v|1 + cb 0 kukL2 kvkL2 + T kb ukT kb vkT
T Th
X 1 X 1
max{cb , 1} |u|21 + 0 kuk2L2 + T kb uk2T 2
|v|21 + 0 kvk2L2 + T kb vk2T 2

T Th T Th

= max{cb , 1}|||u||| |||v||| .

Below we use a local inverse inequality (lemma 3.3.11):


|vh |m,T inv hT |vh |m+1,T for all vh Vh , m = 0, 1.T Th , (4.66)
with a constant inv > 0 independent of h and T .
Lemma 4.4.6 If
2 o
1 2 hT
n 1
0 T min , inv for all T Th (4.67)
2 0 c2b
holds then the bilinear form k (, ) is elliptic on Vh :
1
k (vh , vh ) |||vh |||2 for all vh Vh .
2

106
Proof. Using hb vh , vh iL2 = 21 hdiv b vh , vh iL2 and (4.55) we obtain
X X
k (vh , vh ) |vh |21 + 0 kvh k2L2 + T kb vh k2T + T hvh + cvh , b vh iT
T Th T Th
X (4.68)
= |||vh |||2 + T hvh + cvh , b vh iT .
T Th
1
For a bound on the last term in (4.68) we use kvh kT 1 1
inv hT |vh |1,T , T 1 inv hT 2
2
1
and T 12 0 2 c1
b :
X
T hvh + cvh , b vh iT
T T
h
X  
1 1
T inv hT |vh |1,T kb vh kT + 0 cb kvh kT kb vh kT
T Th
X h  1 p  p  1 p i
|vh |1,T T kb vh kT + 0 kvh kT T kb vh kT
T Th
2 2
1 X h i
|vh |21,T + 0 kvh k2T + T kb vh k2T
2
T Th
1 X  1
= |vh |21 + 0 kvh k2L2 + T kb vh k2T = |||vh |||2 .
2 2
T Th

Using this in (4.68) proves the result.

In the condition (4.67) the bound 01 c1


b should be taken if 0 = 0 or cb = 0.
From the ellipticity result in the previous lemma we see that if we take T = > 0 such
that (4.67) is satisfied, a term kb vh k2L2 is added in the ellipticity lower bound |||vh |||2 which
enhances stability. The bilinear form corresponding to this additional term is (u, v) hb
u, b viL2 , which models diffusion in the streamline direction b. Therefore the stabilized
method (4.63) with T > 0 is called the streamline diffusion finite element method, SDFEM.

Remark 4.4.7 If Vh is the space of piecewise linear finite elements then (vh )|T = 0. Inspection
of the proof shows that the result of the lemma holds with the condition (4.67) replaced by the
weaker condition 0 T 21 01 c2
b . 

Corollary 4.4.8 If (4.67) is satisfied then the discrete problem (4.63a) has a unique solution
uh Vh . For 0 > 0 we have the stability bound
X 1 1 p 
T kb uh k2T
p
|uh |1 + 0 kuh kL2 + 2
2 3 + h kf kL2 , with h := max T .
T Th
0 T Th

Proof. The bilinear form k (, ) is trivially bounded on the finite dimensional space Vh .
Lemma 4.4.6 yields Vh -ellipticity of the bilinear form. Existence of a unique solution follows
from the Lax-Milgram lemma. For the left handside of the stabiliy inequality we have
X 1
T kb uk2T 2 3|||uh ||| .
p
|uh |1 + 0 kuh kL2 +
T Th

107
Lemma 4.4.6 yields
|||uh |||2 2k (uh , uh ) = 2f (uh )
Combining this with
X
f (uh ) = hf, uh iL2 + T hf, b uh iT
T Th
X 1 X 1
kf kL2 kuh kL2 + T kf k2T 2
T kb uh k2T 2

T Th T Th
1 p 1 p 
kf kL2 |||uh ||| + h kf kL2 |||uh ||| = + h kf kL2 |||uh ||| .
0 0
completes the proof.

Remark 4.4.9 As an example consider the case with linear finite elements, T = for all T
and 0 = 1. Then the stability result of this corollary takes the form
 1
|uh |1 + kuh kL2 + kb uh kL2 2 3 1 + kf kL2 , for [0, c2 ]. (4.69)
2 b
note the similarity of this result with the one in corollary 4.4.3 (for the continuous problem)
and in corollary 4.3.9 (for the stabilized finite element method applied to the 1D hyperbolic
problem). 
From the results in corollary 4.4.3 and (4.69) we see that one obtains the strongest stability
result if T is chosen as large as possible (maximal stabilization). In section 4.3 it is shown that
smaller values for the stabilization parameter may lead to smaller discretization errors. Below
we give an analyis on how to chose the parameter T such that the (bound for the) discretization
error in minimized.
Application of theorem 4.2.2 yields:
Theorem 4.4.10 Assume that (4.67) is satisfied. For the discrete solution uh of (4.63a) we
have the error bound

|||u uh ||| + ku uh k,h, C inf |||u vh ||| + ku vh k,h, (4.70)
vh Vh

with C = 1 + max{cb , 1} 3 + 2 max{cb , 1} .
Proof. From lemma 4.4.5 and lemma 4.4.6 it follows that the conditions (4.16) and (4.17)
are satisfied with c0 = 12 , c1 = max{cb , 1}. Now we use theorem 4.2.2.

The norm ||| ||| is given in terms of the usual L2 - and H 1 -norm. For the right handside in
(4.70) we need a bound on ku vh k,h, . Such a result is given in the following lemma. We will
need the assumption
kdiv bkL 0 0 . (4.71)
(0 as in (4.55)). This can always be satisfied (for suitable 0 ) if 0 > 0. For the cases 0 = 0
this implies that div b = 0 must hold.
Lemma 4.4.11 Let T be such that (4.67) holds and assume that (4.55) and (4.71) are satisfied.
For u U the following estimate holds:
 X kbk,T 1
min T1 , 2 2
p 
kuk,h, |u|1 + 0 (1 + 0 )kukL2 + kukT .
+ 2inv h2T 0
T Th

108
Proof. By definition we have
P
hb u, vh iL2 + T Th T hu + cu, b vh iT
kuk,h, = sup . (4.72)
vh Vh |||vh |||
1
We first treat the second term in the nominator. Using the inverse inequality (4.66) and T2
1
1 1
1 inv hT 2 , T2 1 2 c1 we get
2 2 0 b
X X 
T hu + cu, b vh iT T |u|2,T + cb 0 kukT kb vh kT
T Th T Th
X 1 p p
|u|1,T + 0 kukT T kb vh kT
T Th
2
 X 1 p 2  1 (4.73)
|u|1,T + 0 kukT 2 |||v |||
h
2
T Th
1
|u|21 + 0 kuk2L2 2 |||vh |||
p 
|u|1 + 0 kukL2 |||vh ||| .

For the first term in the nominator in (4.72) we obtain, using partial integration,

|hb u, vh iL2 | |hu, (div b)vh iL2 | + |hu, b vh iL2 |


0 0 kukL2 kvh kL2 + |hu, b vh iL2 | (4.74)
p
0 0 kukL2 |||vh ||| + |hu, b vh iL2 |.

We write |||vh |||2 = |vh |21,T + 0 kvh k2T + T kb vh k2T =: T Th T2 . For the last term in
P P
T Th
(4.74) we have
X X
|hu, b vh iL2 | = hu, b vh iT kukT kb vh kT .
T Th T Th

1
From kb vh kT T 2 T and
kbk,T 1
kb vh kT kbk,T |vh |1,T = 1 |vh |21,T + 2inv h2T 0 |vh |21,T 2

( + 2inv h2T 0 ) 2

kbk,T 1 kbk,T
1 |vh |21,T + 0 |vh |2T 2
1 T ,
( + 2inv h2T 0 ) 2 ( + 2inv h2T 0 ) 2
we get
X  1 kbk,T
|hu, b vh iL2 | kukT min T 2 , 1 T
T Th ( + 2inv h2T 0 ) 2

h X kbk2,T i1
min T1 , kuk2T
 2
|||vh ||| .
+ 2inv h2T 0
T Th

Using this in (4.74) we get

|hb u, vh iL2 | h X kbk2,T i1


min T1 , 2 2
p 
0 0 kukL2 + kukT .
|||vh ||| + 2inv h2T 0
T Th

109
Combining this with the results in (4.72) and (4.73) completes the proof.

For the estimation of the approximation error in (4.70) we use an interpolation operator (e.g.,
nodal interpolation)
IVh : H Vh = Xkh,0 (k 1),
that satisfies
ku IVh ukT chm
T |u|m,T (4.75a)
|u IVh u|1,T chm1
T |u|m,T , (4.75b)
for u H m (), with 2 m k + 1. A main discretization error result is given in the following
theorem.

Theorem 4.4.12 Assume that (4.55), (4.71) hold and that T is such that (4.67) is satisfied.
Let u be the solution of (4.2) and assume that u H 2 (). Let uh Vh = Xkh,0 be the solution
of the discrete problem (4.63). For 2 m k + 1 the discretization error bound
X 1
T kb (u uh )k2T
p
|u uh |1 + 0 ku uh kL2 + 2

T Th
 
hm1 + 0 (1 + 0 )hm |u|m
p
C (4.76)
n X  kbk2,T  2 o 21
T kbk2,T h2m2 + min T1 , h2m

+C T T |u|m,T , (4.77)
+ 2inv h2T 0
T Th

holds, with C independent of u, h, , 0 , T and b.

Proof. We apply theorem 4.4.10. For the left handside in (4.70) we have
1 h X 1 i
T kb (u uh )k2T 2 .
p
|||u uh ||| + ku uh k,h, |u uh |1 + 0 ku uh kL2 +
3 T T h

For the right handside in (4.70) we obtain, using kb (u vh )kT kbk,T |u vh |1,T and
lemma 4.4.11:

inf |||u vh ||| + ku vh k,h, |||u IVh u||| + ku IVh uk,h,
vh Vh
 p 
C |u IVh u|1 + 0 (1 + 0 )ku IVh ukL2 +
 X  kbk2,T  21
T kbk2,T |u IVh u|21,T + min T1 , ku IVh uk2T

C .
+ 2inv h2T 0
T Th

Using the approximation error bounds in (4.75) we obtain the result.

Note that this theorem covers the cases T = 0 for all T (i.e., no stabilization) and 0 = 0.
To gain more insight we consider a special case:
Remark 4.4.13 We take b = (1 0)T , c 1 (hence 0 = 1, 0 = 0), T = for all T and m = 2.
Then the estimate in the previous theorem takes the form
  h 1 
|u uh | + ku uh kL2 + k (u uh )kL2 C h + h + + min , p |u|2 ,
x /h + 1

110
with C independent of u, h, and . For 0 this result is very similar to the one in
corollary 4.3.12 for the one-dimensional hyperbolic problem. 
For the case without stabilization we obtain the following.
Corollary 4.4.14 If the assumptions of theorem 4.4.12 are fulfilled we have the following dis-
cretization error bounds for the case = 0 for all T :
h
|u uh |1 C 1 + hm1 |u|m (4.78)

ku uh kL2 Chm1 |u|m if 0 > 0, (4.79)
with a constant C independent of u, h and .
For 0 these bounds are much better than the ones in (4.51) and (4.52) which resulted from the
standard finite element errror analysis. Moreover, the results in (4.78), (4.79) reflect important
properties of the standard Galerkin finite element discretization that are observed in practice:
If h . holds then (4.78) yields |u uh |1 Chm1 |u|m with C independent of h and
, which is an optimal discretization error bound. This explains the fact that for h .
the standard Galerkin finite element method applied to the convection-diffusion problem
usually yields an accurate discretization. Note, however, that for 1 the condition
h . is very unfavourable in view of computational costs.
For fixed h, even if u is smooth (i.e., |u|m bounded for 0) the H 1 -error bound in (4.78)
tends to infinity for 0. Thus, if the analysis is sharp, we expect that an instability
phenomenom can occur for 0.
For the case 0 > 0 we have the suboptimal bound hm1 |u|m for the L2 -norm of the
discretization error (for an optimal bound we should have hm |u|m ). If u is smooth (|u|m
bounded for 0) the discretization error in kkL2 will be arbitrarily small if h is sufficiently
small, even if h. Hence, for the case 0 > 0 the L2 -norm of the discretization error can
not show an instability phenomenon as described in the previous remark for the H 1 -norm.
Note, however, that the L2 -norm is weaker than the H 1 -norm and in particular allows a
more oscillatory behaviour of the error.
We now turn to the question whether the results can be improved by chosing a suitable value
for the stabilization parameter T . For the term between square brackets in (4.77) we have
kbk2,T
T kbk2,T h2m2 + min T1 ,
2m
hT gT (T )kbk,T h2m2

T T , (4.80)
+ 2inv h2T 0
kbk,T
1
h2T . For hT kbk1

with gT () := kbk,T + min kbk,T , ,T the function gT attains
its minimum at = hT kbk1 ,T . Remember the condition on T in (4.67). This leads to the
parameter choice
hT  hT 1 2
T,opt := T , with T := min 1, inv kbk,T . (4.81)
kbk,T 2
If kbk,T = 0 we take T,opt = 0. Note that T,opt hT kbk1 ,T and thus for hT sufficiently
1 1 2
small the condition T,opt 2 0 cb in (4.67) is satisfied. The second condition in (4.67) is
satisfied due to the definition of T,opt in (4.81). If T = 1 we have
h2T
gT (T,opt ) T,opt kbk,T + = 2hT ,
T,opt kbk,T

111
hT
and if T < 1 this implies kbk,T 22
inv and thus

kbk,T 2
gT (T,opt ) T,opt kbk,T + hT (1 + 22
inv )hT .

Hence, for T = T,opt we obtain the following bound for the T -dependent term in (4.77):
n X  kbk2,T 2m  2 o 12 1 1
T kbk2,T h2m2 + min T1 , CkbkL2 hm 2 |u|m .

T hT |u|m,T
+ 2inv h2T 0
T Th

Using this in theorem 4.4.12 we obtain the following corollary.

Corollary 4.4.15 Let the assumptions be as in theorem 4.4.12. For T = T,opt we get the
estimate
X 1
T,opt kb (u uh )k2T 2
p
|u uh |1 + 0 ku uh kL2 +
T Th
 1 
0 (1 + 0 )h + kbkL2 h hm1 |u|m .
p
C + (4.82)

This implies
1
 0 (1 + 0 ) kbkL2  m1
|u uh |1 C 1 + h+ h h |u|m , (4.83a)


1
kbkL2  m1
ku uh kL2 C + (1 + 0 )h + h h |u|m if 0 > 0,
0 0
(4.83b)
X 1  1 
T,opt kb (u uh )k2T 2 C + 0 (1 + 0 )h + kbkL2 h hm1 |u|m .
 p
(4.83c)
T Th

The constants C are independent of u, h, , 0 , T and b.

Some comments related to these discretization error bounds:


q 
The bound in (4.83a) is of the form c 1 + h hm1 |u|m and thus better than the one in
(4.78) if h.

The bound in (4.83b) is of the form c( + h)hm1 |u|m and thus better than the one in
1
(4.79) if h. For 0 we have a bound of the form chm 2 |u|m which, for m = 2, is
similar to the bound in (4.48) for the one-dimensional hyperbolic problem.

The result in (4.83c) shows a control on the streamline derivative of the discretization
error. Such a control is not present in the case T = 0 (no stabilization). If T = 1 for all
T (i.e., 12 2inv kbk,T hT ) and hT c0 h with c0 > 0 independent of T and h we obtain
r

+ 1 hm1 |u|m ,

kb (u uh )kL2 c
h

and thus an optimal bound of the form kb (u uh )kL2 chm1 |u|m if h.

112
In (4.82) there is a correct scaling of , 0 and b. Note that T,opt = T kbkh,T
T
has a scaling
w.r.t. kbk,T that is the same as in the one-dimensional hyperbolic problem in (4.47).

In case of linear finite elements the condition on T in (4.67) can be simplified to T


1 1 2
2 0 cb , cf. remark 4.4.7. Due to this one can take T = 1 in (4.81) and thus T,opt =
hT
kbk,T . In the general case (quadratic or higher order finite elements) one does not use
T,opt as in (4.81) in practice, because inv is not known. Instead one often takes the
simplified form
hT  hT 1
T,opt := T , with T := min 1, kbk,T ,
kbk,T 2

in which, if necessary, kbk,T is replaced by an approximation.

4.4.3 Stiffness matrix for the convection-diffusion problem


Stiffness matrix for SDFEM: nonsymmetric. Condition number ? (example: hyperbolic problem
with SDFEM).

113
114
Chapter 5

Finite element discretization of the


Stokes problem

5.1 Galerkin discretization of saddle-point problems


We recall the abstract variational formulation of a saddle-point problem as introduced in sec-
tion 2.3, (2.43).
Let V and M be Hilbert spaces and
a : V V R, b:V M R
be continuous bilinear forms. For f1 V , f2 M we consider the following variational problem:
find (, ) V M such that
a(, ) + b(, ) = f1 () for all V (5.1a)
b(, ) = f2 () for all M (5.1b)
We define H := V M and
k : H H R, k(U, V) := a(, ) + b(, ) + b(, )
(5.2)
with U := (, ), V := (, )
If we define F (, ) = f1 () + f2 () then the problem (5.1) can be reformulated as follows:
find U H such that k(U, V) = F (V) for all V H (5.3)
We note that in this section we use the notation U, V, F instead of the (more natural) notation
u, v, f that is used in the chapters 2 and 3. The reason for this is that the latter symbols are
confusing in view of the notation used in section 5.2.
For the Galerkin discretization of this variational problem we introduce finite dimensional sub-
spaces Vh and Mh :
Vh V, Mh M, Hh := Vh Mh H
The Galerkin discretization is as follows:
find Uh Hh such that k(Uh , Vh ) = F (Vh ) for all Vh Hh (5.4)
An equivalent formulation is: find (h , h ) Vh Mh such that
a(h , h ) + b( h , h ) = f1 ( h ) for all h Vh (5.5a)
b(h , h ) = f2 (h ) for all h Mh (5.5b)

115
For the discretization error we have a variant of the Cea-lemma 3.1.1:

Theorem 5.1.1 Consider the variational problem (5.1) with continuous bilinear forms a(, )
and b(, ) that satisfy:
b(,)
>0: supV kkV kkM M (5.6a)
>0 : a(, ) kk2V V (5.6b)
b(h ,h )
h > 0 : suph Vh kh kV h kh kM h Mh (5.6c)

Then the problem (5.1) and its Galerkin discretization (5.5) have unique solutions (, ) and
(h , h ), respectively. Furthermore the inequality

k h kV + k h kM C inf k h kV + inf k h kM
h Vh h Mh


2 1 + 1 h2 (2kak + kbk)3 .

holds, with C =

Proof. We shall apply lemma 3.1.1 to the variational problem (5.1) and its Galerkin dis-
cretization (5.5). Hence, we have to verify the conditions (3.2), (3.3), (3.4), (3.6). First note
that 
|k(U, V)| kak + kbk kUkH kVkH
holds and thus the condition (3.2) is satisfied with M = kak + kbk. Due to the assumptions
(5.6a) and (5.6b) it follows from corollary 2.3.12 and theorem 2.3.10 that the conditions (3.3),
(3.4) are satisfied. Due to (5.6b) and (5.6c) it follows from corollary 2.3.12 and theorem 2.3.10,
with V and M replaced by Vh and Mh , respectively, that the condition (3.6) is fulfilled, too.
Moreover, from the final statement in theorem 2.3.10 we obtain that (3.6) holds with
2 2
h = h2 h + 2kak h2 kbk + 2kak

Application of lemma 3.1.1 yields


1 M 1
k h k2V + k h k2M 2
(1 +
) inf k h k2V + k h k2M 2
h (h ,h )Vh Mh
p
From this and the inequalities + 2 2 + 2 2( + ), for 0, 0, the result
follows.

Remark 5.1.2 The condition (5.6c) implies dim(Vh ) dim(Mh ). This can be shown by the
following argument. Let ( j )1jm be a basis of Vh and (i )1ik a basis of Mh . Define the
matrix B Rkm by
Bij = b( j , i )
From (5.6c) it follows that for every h Mh , h 6= 0, there exists h Vh such that b( h , h ) 6=
0. Thus for every y Rk , y 6= 0, there exists x Rm such that yT Bx 6= 0, i.e., xT BT y 6= 0.
This implies that all columns of BT , and thus all rows of B, are independent. A necessary
condition for this is k m. 

116
5.2 Finite element discretization of the Stokes problem
We recall the variational formulation of the Stokes problem (with homogeneous Dirichlet bound-
ary conditions) given in section 2.6 : find (u, p) V M such that

a(u, v) + b(v, p) = f (v) for all v V (5.7a)


b(u, q) = 0 for all q M (5.7b)

with

V := H01 ()n , M := L20 ()


Z
a(u, v) := u v dx
Z

b(v, q) := q div v dx
Z
f (v) := f v dx

For the Galerkin discretization of this problem we use the simplicial finite element spaces defined
in section 3.2.1, i.e., for a given family {Th } of admissible triangulations of we define the pair
of spaces:
(Vh , Mh ) := (Xkh,0 )n , Xhk1 L20 () , k 1

(5.8)
A short discussion concerning other finite element spaces that can be used for the Stokes problem
is given in section 5.2.2. For k 2, the spaces in (5.8) are called Hood-Taylor finite elements [52].
Note that for k = 1 the pressure-space X0h L20 () contains discontinuous functions, whereas for
k 2 all functions in the pressure space are continuous. The Stokes problem fits in the general
setting presented in section 5.1. The discrete problem is as in (5.5): find (uh , ph ) such that

a(uh , vh ) + b(vh , ph ) = f (vh ) for all vh Vh (5.9a)


b(uh , qh ) = 0 for all qh Mh (5.9b)

From the analysis in section 2.6 it follows that the conditions (5.6a) and (5.6b) in theorem 5.1.1
are satisfied. The following remark shows that the condition in (5.6c), which is often called the
discrete inf-sup condition, needs a careful analysis:
Remark 5.2.1 Take n = 2, = (0, 1)2 and a uniform triangulation Th of that is defined as
follows. For N N and h := N1+1 the domain is subdivided in squares with sides of length h
and vertices in the set { (ih, jh) | 0 i, j N + 1 }. The triangulation Th is obtained by subdi-
viding every square in two triangles by inserting a diagonal from (ih, jh) to ((i + 1)h, (j + 1)h).
The spaces (Vh , Mh ) are defined as in (5.8) with k = 1. The space Vh has dimension 2N 2 and
dim(Mh ) = 2(N + 1)2 1. From dim(Vh ) < dim(Mh ) and remark 5.1.2 it follows that the
condition (5.6c) does not hold.
The same argument applies to the three dimensional case with a uniform triangulation of
(0, 1)3 consisting of tetrahedra (every cube is subdivided in 6 tetrahedra). In this case we
have dim(Vh ) = 3N 3 and dim(Mh ) = 6(N + 1)3 1.
We now show that also the lowest order rectangular finite elements in general do not satisfy
(5.6c). For this we consider n = 2, = (0, 1)2 and use a triangulation consisting of squares
1
Tij := [ih, (i + 1)h)] [jh, (j + 1)h], 0 i, j N, h :=
N +1

117
We assume that N is odd. The corresponding lowest order rectangular finite element spaces
(Vh , Mh ) = (Q1h,0 , Q0h L20 ()) are defined in (3.17). We define ph Mh by

(ph )|Tij = (1)i+j (checkerboard function)

For uh Vh we use the notation uh = (u, v), u(ih, jh) =: ui,j , v(ih, jh) =: vi,j . Then we have:
Z Z
ph div uh dx = (1)i+j uh n ds
Tij Tij
h
= (1)i+j (ui+1,j+1 + ui+1,j ) (ui,j+1 + ui,j )
2 
+ (vi+1,j+1 + vi,j+1 ) (vi+1,j + vi,j )

Using (uh )| = 0 we get, for 0 k N + 1,


N
X N
X
(1)j (uk,j+1 + uk,j ) = 0, (1)i (vi+1,k + vi,k ) = 0
j=0 i=0

and thus
Z N Z
X
b(uh , ph ) = ph div uh dx = ph div uh dx = 0
i,j=0 Tij

for arbitrary uh Vh . We conclude that there exists ph Mh , ph 6= 0, such that supvh Vh b(vh , ph ) =
0 and thus the discrete inf-sup conditon does not hold for the pair (Vh , Mh ). 

Definition 5.2.2 Let {Th } be a family of admissible triangulations of . Suppose that to every
Th {Th } there correspond finite element spaces Vh V and Mh M . The pair (Vh , Mh ) is
called stable if there exists a constant > 0 independent of Th {Th } such that
b(vh , qh )
sup kqh kL2 for all qh Mh (5.10)
vh Vh kvh k1

5.2.1 Error bounds


In this section we derive error bounds for the discretization of the Stokes problem with Hood-
Taylor finite elements. We will prove that for k = 2 the Hood-Taylor spaces are stable. Theo-
rem 5.1.1 is used to derive error bounds.
In the analysis of this finite element method we will need an approximation operator which is
applicable to functions u H 1 () and yields a reasonable approximation of u in the subspace
of continuous piecewise linear functions. Such an operator was introduced by Clement in [29]
and is denoted by IXC (Clement operator).
Let {Th } be a regular family of triangulations of consisting of n-simplices and X1h,0 the corre-
sponding finite element space of continuous piecewise linear functions. For the definition of the
Clement operator we need the nodal basis of this finite element space. Let {xi }1iN be the set
of vertices in Th that lie in the interior of . To every xi we associate a basis function i X1h,0
with the property i (xi ) = 1, i (xj ) = 0 for all j 6= i. Then {i }1iN forms a basis of the
space X1h,0 . We define a neighbourhood of the point xi by

xi := supp(i ) = { T Th | xi T }

118
and a neighbourhood of T Th by

T := { xi | xi T }

The local L2 -projection Pi : L2 (xi ) P0 is defined by:


Z
1
Pi v = |xi | v dx
x i

The Clement operator IXC : H01 () X1h,0 is defined by

N
X
IXC u = (Pi u)i (5.11)
i=1

For this operator the following approximation properties hold:


Theorem 5.2.3 (Clement operator.) There exists a constant C independent of Th {Th }
such that for all u H01 () and all T Th :

kIXC uk1,T C kuk1,T (5.12a)


ku IXC uk0,T C hT kuk1,T (5.12b)
1
ku IXC uk0,T C hT kuk1,T
2
(5.12c)

Proof. We refer to [29] and [13].

Variants of this operator are discussed in [13, 81, 14]. Results as in theorem 5.2.3 also hold
if H01 () and X1h,0 are replaced by H 1 () and X1h , respectively.
Using the Clement operator one can reformulate the stability condition (5.10) in another form
that turns out to be easier to handle. This reformulation is given in [93] and applies to a large
class of finite element spaces. Here we only present a simplified variant that applies to the
Hood-Taylor finite element spaces. We will need the mesh-dependent norm
sX
kqh k1,h := h2T kqh k20,T , qh X1h L20 ()
T Th

Theorem 5.2.4 Let {Th } be a regular family of triangulations. The Hood-Taylor pair of finite
element spaces (Vh , Mh ) as in (5.8), k 2, is stable iff there exists a constant > 0 independent
of Th {Th } such that
b(vh , qh )
sup kqh k1,h for all qh Mh (5.13)
vh Vh kvh k1

Proof. For T Th let F (x) = Bx + c be an affine mapping such that F (T ) = T , where T


is the unit n-simplex. Using the lemmas 3.3.5 and 3.3.6 and dim(Pk ) < we get, for qh Mh
and qh := qh F :

kqh k20,T = |qh |21,T CkB1 k22 | det B||qh |21,T


CkB1 k22 | det B||qh |20,T C h2 2
T kqh k0,T

119
with a constant C independent of qh and of T . This yields kqh k1,h Ckqh kL2 and thus the
stability property implies (5.13).
Assume that (5.13) holds. Take an arbitrary qh Mh with kqh kL2 = 1. The constants C below
are independent of qh and of Th {Th }. From the inf-sup property of the continuous problem
it follows that there exists > 0, independent of qh , and v H01 ()n such that

kvk1 = 1, b(v, qh )

We apply the Clement operator to the components of v:


n
wh := IXC v = IXC v1 , . . . , IXC vn X1h,0 Vh


From theorem 5.2.3 we get

kwh k1 Ckvk1 = C
X
h2 2 2
T kv wh k0,T Ckvk1 = C
T Th

From this we obtain


Z
|b(wh v, qh )| = |qh div (wh v) dx|

X Z
=| qh (wh v) dx|
T Th T
X
kqh k0,T kwh vk0,T (5.14)
T Th
X 1 X 1
h2T kqh k20,T 2
h2 2
T kwh vk0,T
2

T Th T Th

Ckqh k1,h

Define := supvh Vh b(v h ,qh )


kvh k1 . From (5.13) we have kqh k1,h

. Using this in combination with
the result in (5.14) we obtain

b(wh , qh )
C b(wh , qh ) = C b(v, qh ) + C b(wh v, qh )
kwh k1
 C 
C Ckqh k1,h C

And thus for a suitable constant > 0 independent of qh and of Th {Th }.

Theorem 5.2.5 Let {Th } be a regular family of triangulations consisting of simplices. We


assume that every T Th has at least one vertex in the interior of . Then the Hood-Taylor
pair of finite element spaces with k = 2 is stable.

Proof. We consider only n {2, 3}. Take qh Mh , qh 6= 0. The constants used in the
proof are independent of Th {Th } and of qh . The set of edges in Th is denoted by E. This
set is partitioned in edges which are in the interior of and edges which are part of :

120
E = Eint Ebnd . For every E E, mE denotes the midpoint of the edge E. Every E Eint with
endpoints a1 , a2 Rn is assigned a vector tE := a1 a2 . For E Ebnd we define tE := 0. Since
qh X1h the function x tE qh (x) is continuous across E, for E Eint . We define

wE := tE qh (mE ) tE , for E E
n
Due to lemma 3.3.2 a unique wh X2h,0 is defined by
(
0 if xi is a vertex of T Th
wh (xi ) =
wE if xi = mE for E E

For each T Th the set of edges of T is denoted by ET . By using quadrature we see that for
any p P2 which is zero at the vertices of T we have

|T | X
Z
p(x) dx = p(mE )
T 2n 1
EET

We obtain:
Z Z X Z
qh div wh dx = qh wh dx = (qh )|T wh dx
T Th T
X |T | X
= (qh )|T wh (mE ) (5.15)
2n 1
T Th EET
X |T | X 2
= tE qh (mE )
2n 1
T Th EET

Using the fact that (qh )|T is constant one easily checks that
X 2
Ckqh k20,T tE qh (mE ) Ckqh k20,T (5.16)
EET

with C > 0 and C independent of T . Combining this with (5.15) we get

|T |
Z X
qh div wh dx C kqh k20,T
2n 1
T Th
X (5.17)
C h2T kqh k20,T = Ckqh k21,h
T Th

Let ET be the set of all edges of the unit n-simplex T . In the space { v P2 | v is zero at the vertices of T }
P 1
2 2 are equivalent. Using this componentwise for the
the norms kvk1,T and EET v(m E )
vector-function wh := wh F we get:

|wh |21,T Ch2 2 2 2


T |T | |wh |1,T ChT |T | kwh k1,T
X X
Ch2
T |T | kwh (mE )k22 = Ch2
T |T | kwE k22
EET EET

121
Summation over all simplices T yields, using (5.16),
X X X
kwh k21 C|wh |21 C |wh |21,T C h2
T |T | kwE k22
T Th T Th EET
X X 2
=C h2
T |T | tE qh (mE ) ktE k22 (5.18)
T Th EET
X
C h2T kqh k20,T = C kqh k21,h
T Th

From (5.17) and (5.18) we obtain

b(wh , qh )
C kqh k1,h
kwh k1

with a constant C > 0 independent of qh and of Th {Th }. Now apply theorem 5.2.4.

One can also prove stability for higher order Hood-Taylor finite elements:

Theorem 5.2.6 Let {Th } be a regular family of triangulations as in theorem 5.2.5. Then the
Hood-Taylor pair of finite element spaces with k 3 is stable.

Proof. We refer to the literature: [15, 16, 22].

Remark 5.2.7 The condition that every T Th has at least one vertex in the interior of is a
mild one. Let S := { T Th | T has no vertex in the interior of }. If S 6= {} then a suitable
bisection of each T S (and of one of the neighbours of T ) results in a modified admissible
triangulation for which the condition is satisfied. In certain cases the condition can be avoided
(for example, for n = 2, k = 2, 3, cf. [87]) or replaced by another similar assumption on the
geometry of the triangulation (cf. remark 3.2 in [16]). An example which shows that the stability
result does in general not hold without an assumption on the geometry of the triangulation is
given in [16] remark 3.3. 

For the discretization of the Stokes problem with Hood-Taylor finite elements we have the fol-
lowing bound for the discretization error:

Theorem 5.2.8 Let {Th } be a regular family of triangulations as in theorem 5.2.5. Consider the
discrete Stokes problem (5.9) with Hood-Taylor finite element spaces as in (5.8), k 2. Suppose
that the continuous solution (u, p) lies in H m ()n H m1 () with m 2. For m k + 1 the
following holds:
ku uh k1 + kp ph kL2 C hm1 |u|m + |p|m1


with a constant C independent of Th {Th } and of (u, p).

Proof. We apply theorem 5.1.1 with V = H01 ()n , M = L20 () and (Vh , Mh ) the pair of Hood-
Taylor finite element spaces. From the analysis in section 2.6 it follows that the conditions (5.6a)
and (5.6b) are satisfied. From theorem 5.2.5 or theorem 5.2.6 it follows that the discrete inf-sup
property (5.6c) holds with a constant h independent of Th . Hence we have

ku uh k1 + kp ph kL2 C inf ku vh k1 + inf kp qh kL2 (5.19)
vh Vh qh Mh

122
For the first term on the right handside we use (componentwise) the result of corollary 3.3.10.
This yields:
inf ku vh k1 Chm1 |u|m (5.20)
vh Vh
with a constant C independent of u and of Th {Th }.
Using p L20 () it follows that
inf kp qh kL2 = inf kp qh kL2
qh Mh qh Xk1
h

For m = 2 we can use the Clement operator of theorem 5.2.3 and for m 3 the result in
corollary 3.3.10 to bound the approximation error for the pressure:
inf kp qh kL2 Chm1 |p|m1 (5.21)
qh Mh

Combination of (5.19), (5.20) and (5.21) completes the proof.

Sufficient conditions for (u, p) H 2 ()n H 1 () to hold are given in section 2.6.2.
As in section 3.4.2 one can derive an L2 -error bound for the velocity using a duality argument.
For this we have to assume H 2 -regularity of the Stokes problem (cf. section 2.6.2):
Theorem 5.2.9 Consider the Stokes problem and its Hood-Taylor discretization as described
in theorem 5.2.8. In addition we assume that the Stokes problem is H 2 -regular. Then for
2 m k + 1 the inequality
ku uh kL2 Chm |u|m + |p|m1


holds, with C independent of Th {Th } and of (u, p).


Proof. The variational Stokes problem can be reformulated as in (5.3) :
find U H such that k(U, V) = F (V) for all V H
with H = H01 ()n L20 (), k(, ) as in (5.2), U = (u, p), F (V) = F ((v, q)) = f v dx. Let
R

eh = U Uh = (u uh , p ph ) be the discretization error. We consider the dual problem with


f = u uh : let U = (u, p) H be such that
Z
k(U, V) = k(V, U) = Fe (V) := (u uh ) v dx V = (v, q) H

For V = eh we obtain, using the Galerkin orthogonality of eh :
ku uh k2L2 = Fe (eh ) = k(eh , U) = k(eh , U wh ) wh Hh = Vh Mh
and thus
ku uh k2L2 C keh kH inf kU wh kH
wh Hh

C keh kH inf ku vh k1 + inf kp qh kL2
vh Vh qh Mh

For the second term on the right handside we can use approximation results as in (5.20), (5.21).
In combination with the H 2 -regularity this yields

inf ku vh k1 + inf kp qh kL2 C h |u|2 + |p|1 C hku uh kL2
vh Vh qh Mh

For keh kH ku uh k1 + kp ph kL2 we can use the result in theorem 5.2.8. Thus we conlcude
that
ku uh k2L2 C hm1 |u|m + |p|m1 h ku uh kL2


holds.

123
5.2.2 Other finite element spaces
In this section we briefly discuss some other pairs of finite element spaces that are used in prac-
tice for solving Stokes (and Navier-Stokes) problems.

Rectangular finite elements.


Let {Th } be a regular family of triangulations consisting of n-rectangles. The pair of rectangular
finite element spaces is given by (cf. (3.17)):

(Vh , Mh ) = (Qkh,0 )n , Qhk1 L20 () , k 1




In remark 5.2.1 it is shown that for k = 1 this pair in general will not be stable. In [11] it
is proved that the pair (Vh , Mh ) with k = 2 is stable both for n = 2 and n = 3. In [87] it is
proved that for all k 2 the pair (Vh , Mh ) is stable if n = 2. In these stable cases one can prove
disretization error bounds as in theorem 5.2.8 and theorem 5.2.9. The analysis is very similar
to the one presented for the case of simplicial finite elements in section 5.2.1.

Mini-element.
Let {Th } be a regular family of triangulations consisting of simplices. For every T Th we can
define a so-called bubble function
(Q
n+1
i=1 i (x) for x T
bT (x) =
0 otherwise

with i (x), i = 1, . . . , n + 1, the barycentric coordinates of x T . Define the space of bubble


functions B := span{ bT | T Th }. The mini-element, introduced in [4] is given by the pair

(Vh , Mh ) = (X1h,0 B)n , X1h L20 ()




This element is stable, cf. [40, 4]. An advantage of this element compared to, for example, the
Hood-Taylor element with k = 2 is that the implementation of the former is relatively easy. This
is due to the following. The unknowns associated to the bubble basis functions can be elimi-
nated by a simple local technique (so-called static condensation) and the remaining unknowns
for the velocity and pressure basis functions are associated to the same set of points, namely the
vertices of the simplices. In case of Hood-Taylor elements (k = 2) one also needs the midpoints
of edges for some of the velocity unknowns. Hence, the data structures for the mini-element
are relatively simple. A disadvantage of the mini-element is its low accuracy (only P1 for the
velocity).

IsoP2 P1 element.
This element is a variant of the Hood-Taylor element with k = 2. Let {Th } be a regular family
of triangulations consisting of simplices. Given Th we construct a refinement T 1 h by dividing
2
each n-simplex T Th , n = 2 or n = 3, into 2n subsimplices by connecting the midpoints of
the edges of T . Note that for n = 3 this construction is not unique. The space of continuous
functions which are piecewise linear on the simplices in T 1 h and zero on is denoted by X11 h,0 .
2 2
The isoP2 P1 element consists of the pair of spaces

(Vh , Mh ) = (X11 h,0 )n , X1h L20 ()



2

Both for n = 2 and n = 3 this pair is stable. This can be shown using the analysis of sec-
tion 5.2.1. The proofs of theorem 5.2.4 and of theorem 5.2.5 apply, with minor modifications,

124
to the isoP2 P1 pair, too. In the discrete velocity space Vh the degrees of freedom (unknowns)
are associated to the vertices and the midpoints of edges of T Th . This is the same as for the
discrete velocity space in the Hood-Taylor pair with k = 2. This explains the name isoP2 P1 .
Note, however, that the accuracy for the velocity for the isoP2 P1 element is only O(h) in
the norm k k1 , whereas for the Hood-Taylor pair with k = 2 one has O(h2 ) in the norm k k1
(provided the solution is sufficiently smooth).

Nonconforming Crouzeix-Raviart: in preparation.

In certain situations, if the pair (Vh , Mh ) of finite element spaces is not stable, one can still
successfully apply these spaces for discretization of the Stokes problem, provided one uses an
appropriate stabilization technique. We do not discuss this topic here. An overview of some
useful stabilization methods is given in [73], section 9.4.

125
126
Chapter 6

Linear iterative methods

The discretization of elliptic boundary value problems like the Poisson equation or the Stokes
equations results in a large sparse linear system of equations. For the numerical solution of such
a system iterative methods are applied. Important classes of iterative methods are treated in
the chapters 7-10. In this chapter we present some basic results on linear iterative methods and
discuss some classical iterative methods like, for example, the Jacobi and Gauss-Seidel method.
In our applications these methods turn out to be very inefficient and thus not very suitable for
practical use. However, these methods play a role in the more advanced (and more efficient)
methods treated in the chapters 7-10. Furthermore, these basic iterative methods can be used
to explain important notions such as convergence rate and efficiency. Standard references for
a detailed analysis of basic iterative methods are Varga [92], Young [100]. We also refer to
Hackbusch [46],[48] and Axelsson [6] for an extensive analysis of these methods.

In the remainder of this chapter we consider a (large sparse) linear system of equations

Ax = b (6.1)

with a nonsingular matrix A Rnn . The solution of this system is denoted by x .

6.1 Introduction
We consider a given iterative method, denoted by xk+1 = (xk ), k 0, for solving the system
in (6.1). We define the error as
ek = xk x , k 0.
The iterative method is called a linear iterative method if there exists a matrix C (depending
on the particular method but independent of k) such that for the errors we have the recursion

ek+1 = Cek , k 0. (6.2)

The matrix C is called the iteration matrix of the method. In the next section we will see that
basic iterative methods are linear. Also the multigrid methods discussed in chapter 9 are linear.
The Conjugate Gradient method, however, is a nonlinear iterative method (cf. chapter 7).
From (6.2) it follows that ek = Ck e0 for all k, and thus limk ek = 0 for arbitrary e0 if and
only if limk Ck = 0. Based on this, the linear iterative method with iteraton matric C is is
called convergent if
lim Ck = 0 . (6.3)
k

127
An important characterization for convergence is related to the spectral radius of the iteration
matrix. To derive this characterization we first need two lemmas.

Lemma 6.1.1 For all B Rnn and all > 0 there exists a matrix norm k k on Rnn such
that
kBk (B) +

Proof. For the given matrix B there exists a nonsingular matrix T Cnn which transforms
B to its Jordan normal form:

T1 BT = J, J = blockdiag(J )1m ,
with J = ,

1
.. ..
. .
or J = ..
, (B), 1m
. 1


For the given > 0 define D := diag(, 2 , . . . , n ) Rnn and T := TD , J := D1


JD .
Note that J has the same form as J, only with the entries 1 on the codiagonal replaced by .
For C Rnn define
n
X
kCk := kT1
CT k = max |(T1
CT )ij |
1in
j=1

This defines a matrix norm on Rnn . Furthermore,

kBk = kT1
BT k = kJ k max || + (B) +
(B)

This proves the result of the lemma.

Lemma 6.1.2 (Stable matrices) For B Rnn the following holds:

lim Bk = 0 if and only if (B) < 1.


k

Proof. . Take > 0 such that (B) + < 1 holds and let k k be the matrix norm
defined in lemma 6.1.1. Then

kBk k kBkk ((B) + )k

holds. Hence, limk kBk k = 0 and thus limk Bk = 0.


. Let max{ kCxk n
kxk | x C , x 6= 0 } be the complex maximum norm on C
nn . Take

(C) and v Cn , v 6= 0, such that Cv = v and || = (C). Then ||kvk =


kvk = kCvk . From this it follows that (C) kCk holds for arbitrary C Cnn .
From limk Bk = 0 we get limk kBk k = 0 and thus, due to (B)k = (Bk ) kBk k ,
we have limk (B)k = 0. Thus (B) < 1 must hold.

128
Corollary 6.1.3 For any B Rnn and any matrix norm k k on Rnn we have

(B) kBk

Proof. If (B) = 0 then B = 0 and the result holds. For (B) 6= 0 define B := (B)1 B.
Assume that (B) > kBk. Then 1 = (B) > kBk holds and thus limk kBkk = 0. Using
kBk k kBkk this yields limk Bk = 0. From lemma 6.1.2 we conclude (B) < 1 which gives
a contradiction.

From lemma 6.1.2 we obtain the following result:


Theorem 6.1.4 A linear iterative method is convergent if and only if for the corresponding
iteration matrix C we have (C) < 1.
If (C) < 1 then the iterative method converges and the spectral radius (C) even yields a
quantitative result for the rate of convergence. To see this, we first formulate a lemma:
Lemma 6.1.5 For any matrix norm k k on Rnn and any B Rnn the following equality
holds:
lim kBk k1/k = (B).
k

Proof. From corollary 6.1.3 we get (B)k = (Bk ) kBk k and thus
1
(B) kBk k k for all k 1 (6.4)

Take arbitrary > 0 and define B := ((B) + )1 B. Then (B) < 1 and thus limk Bk = 0.
Hence there exists k0 such that for all k k0 , kBk k 1, i.e., ((B) + )k kBk k 1. We get
1
kBk k k (B) + for all k k0 (6.5)

From (6.4) and (6.5) it follows that limk kBk k1/k = (B).

For the error ek = Ck e0 we have


kek k 1 kCk e0 k 1
max max
e0 Rn ke0 k e 0
e R n ke 0 k e
1 1 1  k1
kCk k kCk k k
e e
From lemma 6.1.5 we have that, for k large enough,

kCk k1/k (C)

Hence, to reduce the norm of an arbitrary starting error ke0 k by a factor 1/e we need asymp-
totically (i.e. for k large enough) approximately ( ln((C)))1 iterations. Based on this we
call ln((C)) the asymptotic convergence rate of the iterative method (in the literature, e.g.
Hackbusch [48], sometimes (C) is called the asymptotic convergence rate).
The quantity kCk is the contraction number of the iterative method. Note that

kek k kCkkek1 k for all k1

holds, and
(C) kCk.

129
From these results we conclude that (C) is a reasonable measure for the rate of convergence,
provided k is large enough. For k small it may be better to use the contraction number as a
measure for the rate of convergence. Note that the asymptotic convergence rate does not depend
on the norm k k. In some situations, measuring the rate of convergence using the contraction
number or using the asymptotic rate of convergence is the same. For example, if we use the
Euclidean norm and if C is symmetric then

(C) = kCk2 (6.6)

holds. However, in other situations, for example if C is strongly nonsymmetric, one can have
(C) kCk.

To measure the quality (efficiency) of an iterative method one has to consider the following
two aspects:

The arithmetic costs per iteration. This can be quantified in flops needed for one iteration.

The rate of convergence. This can be quantified using ln((C)) (asymptotic convergence
rate) or kCk (contraction number).

To be able to compare iterative methods the notion of complexity is introduced. We assume:

A given linear system.

A given error reduction factor R, i.e. we wish to reduce the norm of an arbitrary starting
error by a factor R.

The complexity of an iterative method is then defined as the order of magnitude of the number
of flops needed to obtain an error reduction with a factor R for the given problem. In this
notion the arithmetic costs per iteration and the rate of convergence are combined. The quality
of different methods for a given problem (class) can be compared by means of this complexity
concept. Examples of this are given in section 6.6

6.2 Basic linear iterative methods


In this section we introduce classical linear iterative schemes, namely the Richardson, (damped)
Jacobi, Gauss-Seidel and SOR methods. For the convergence analysis that is presented in the
sections 6.3 and 6.5 it is convenient to put these methods in the general framework of of so-called
matrix splittings (cf. Varga [92]).

We first show how a splitting of the matrix A in a natural way results in a linear iterative
method. We assume a splitting of the matrix A such that

A =MN , where (6.7a)


M is nonsingular, and (6.7b)
for arbitrary y we can solve Mx = y with relatively low costs. (6.7c)

For the solution x of the system in (6.1) we have

Mx = Nx + b .

130
The splitting of A results in the following matrix splitting iterative method. For a given starting
vector x0 Rn we define
Mxk+1 = Nxk + b , k 0 (6.8)
This can also be written as
xk+1 = xk M1 (Axk b). (6.9)
From (6.9) it follows that for the error ek = xk x we have

ek+1 = (I M1 A)ek .

Hence the iteration in (6.8), (6.9) is a linear iterative method with iteration matrix

C = I M1 A = M1 N. (6.10)

The condition in (6.7c) is introduced to obtain a method in (6.8) for which the arithmetic costs
per iteration are acceptable. Below we will see that the above-mentioned classical iterative
methods can be derived using a suitable matrix splitting. These methods satisfy the conditions
in (6.7b), (6.7c), but unfortunately, when applied to discrete elliptic boundary value problems,
the convergence rates of these methods are in general very low. This is illustrated in section 6.6.

Richardson method
The simplest linear iterative method is the Richardson method:
 0
x a given starting vector ,
(6.11)
xk+1 = xk (Axk b) , k 0 .

with a parameter R, 6= 0. The iteration matrix of this method is given by C = I A.

Jacobi method
A second classical and very simple method is due to Jacobi. We introduce the notation

D := diag(A) , A=DLU , (6.12)

with L a lower triangular matrix with zero entries on the diagonal and U an upper triangular
matrix with zero entries on the diagonal. We assume that A has only nonzero entries on the
diagonal, so D is nonsingular. The method of Jacobi is the iterative method as in (6.8) based
on the matrix splitting
M := D , N := L + U .
The method of Jacobi is as follows
 0
x a given starting vector ,
Dxk+1 = (L + U)xk + b , k 0 .

This can also be formulated row by row:


( 0
x a given starting vector,
(6.13)
aii xk+1 = j6=i aij xkj + bi ,
P
i i = 1, 2, . . . , n , k0.

From this we see that in the method of Jacobi we solve the ith equation ( nj=1 aij xj = bi ) for
P
the ith unknown (xi ) using values for the other unknowns (xj , j 6= i) computed in the previous

131
iteration.
The iteration can also be represented as

xk+1 = (I D1 A)xk + D1 b, k0

In the Jacobi method the computational costs per iteration are low, namely comparable to
one matrix-vector multiplication Ax, i.e. cn flops (due to the sparsity if A).
We introduce a variant of the Jacobi method in which a parameter is used. This method is given
by
xk+1 = (I D1 A)xk + D1 b, k 0 (6.14)

with a given real parameter 6= 0. This method corresponds to the splitting

1 1
M= D, N=( 1)D + L + U (6.15)

For = 1 we obtain the Jacobi method. For 6= 1 this method is called the damped Jacobi
method (damped due to the fact that in practice one usually takes (0, 1)).

Gauss-Seidel method
This method is based on the matrix splitting

M := D L , N := U .

This results in the method:

x0 a given starting vector ,




(D L)xk+1 = Uxk + b , k 0 .

This can be formulated row wise:


( 0
x a given starting vector ,
(6.16)
aii xk+1 = j<i aij xk+1 j>i aij xkj + bi ,
P P
i j i = 1, . . . , n .

For the Gauss-Seidel method to be feasible we assume that D is nonsingular. In the Jacobi
method (6.13) for the computation of xk+1 i (i.e. for solving the ith equation for the ith unknown
xi ) we use the values xkj , j 6= i, whereas in the GaussSeidel method (6.16) for the computation
of xk+1
i we use xk+1
j , j < i and xkj , j > i.
The iteration matrix of the Gaus-Seidel method is given by

C = (D L)1 U = I (D L)1 A .

SOR method
The Gauss-Seidel method in (6.16) can be rewritten as

x0 a given starting vector ,


(

xk+1 = xki k+1


+ ji aij xkj bi /aii ,
P P 
i j<i aij xj i = 1, . . . , n, k 0.

132
From this representation it is clear how xk+1 i can be obtained by adding a certain correction term
k
to xi . We now introduce a method in which this correction term is multiplied by a parameter
>0: 0
x a given starting vector ,

k+1 k k+1 k b /a , (6.17)
P P 
x i = xi j<i aij x j + ji aij xj i ii
i = 1, . . . , n, k 0 .

This method is the Successive Overrelaxation method (SOR). The terminology over is used
because in general one should take > 1 (cf. Theorem 6.4.3 below). For = 1 the SOR method
results in the Gauss-Seidel method. In matrix-vector notation the SOR method is as follows:

Dxk+1 = (1 )Dxk + (Lxk+1 + Uxk + b),

or, equivalently,
1 1
(D L)xk+1 = [(1 )D + U]xk + b .

From this it is clear that the SOR method is also a matrix splitting iterative method, corre-
sponding to the splitting (cf. (6.15))
1 1
M := DL , N := ( 1)D + U .

The iteration matrix is given by
1
C = C = I M1 A = I ( D L)1 A.

For the SOR method the arithmetic costs per iteration are comparable to those of the Gauss-
Seidel method.

The Symmetric Successive Overrelaxation method (SSOR) is a variant of the SOR method.
One SSOR iteration consists of two SOR steps. In the first step we apply an SOR iteration as
in (6.17) and in the second step we again apply an SOR iteration but now with the reversed
ordering of the unknowns. In formulas we thus have:

k+ 1 k+ 21
P 
xi 2 = xki k b /a ,
P
a x
j<i ij j + a x
ji ij j i ii i = 1, 2, . . . , n
k+ 1 k+ 1
 
xk+1 k+1
P P
= xi 2 j>i aij xj + ji aij xj 2 bi /aii , i = n, . . . , 1

i

This method results if we use a matrix splitting with


1
M= (D L)D1 (D U) . (6.18)
(2 )
Although the arithmetic costs for one SSOR iteration seem to be significantly higher than for
one SOR iteration, one can implement SSOR in such a way that the costs per iteration are
approximately the same as for SOR (cf. [68]). In many cases the rate of convergence of both
methods is about the same. Often, in the SSOR method the sensitivity of the convergence rate
with respect to variation in the relaxation parameter is much lower than in the SOR method
(cf. Axelsson and Barker [7]). Finally we note that, if A is symmetric positive definite then the
matrix M in (6.18) is symmetric positive definite, too (such a property does not hold for the
SOR method). Due to this property the SSOR method can be used as a preconditioner for the
Conjugate Gradient method. This is further explained in chapter 7.

133
6.3 Convergence analysis in the symmetric positive definite case
For the classical linear iterative methods we derive convergence results for the case that A is
symmetric positive definite. Recall that for square symmetric matrices B and C we use the
notation B < C (B C) if C B is positive definite (semi-definite).
We start with an elementary lemma:

Lemma 6.3.1 Let B Rnn be a symmetric positive definite matrix. The smallest and largest
eigenvalues of B are denoted by min (B) and max (B), respectively. The following holds:

2
(I B) < 1 iff 0<< (6.19)
max (B)
2
min (I B) = (I opt B) = 1
(B) + 1
(6.20)
2
for opt =
min (B) + max (B)

Proof. The eigenvalues of I B are given by { 1 | (B) }. Define opt as in (6.20).


We then have

(I B) = max{ |1 min (B)| , |1 max (B)| }



1 max (B)
if 0
= 1 min (B) if 0 opt

max (B) 1 if opt

Hence (I B) < 1 iff > 0 and max (B) 1 < 1. This proves the result in (6.19). The
result in (6.20) follows from

min (I B) = 1 opt min (B)



2min (B) 2
=1 =1
min (B) + max (B) (B) + 1

As an immediate consequence of this lemma we get a convergence result for the Richardson
method.

Corollary 6.3.2 Let A be symmetric positive definite. For the iteration matrix of the Richard-
son method C = I A we have
2
(C ) < 1 iff 0<< (6.21)
max (A)
2
min (C ) = (Copt ) = 1
(A) + 1
(6.22)
2
for opt =
min (A) + max (A)

We now consider the Jacobi method. From Theorem 6.1.4 we obtain that this method is conver-
gent if and only if (I D1 A) < 1 holds. A simple example shows that the method of Jacobi

134
1 21

1 1
does not converge for every symmetric positive definite matrix A: consider A = 1 1 21 1 ,
1 1 1 12
with spectrum (A) = { 12 , 3 21 }. Then (I D1 A) = |1 111 3 12 | = 43 .
2
From the analysis in section 6.5.2 (theorems 6.5.12 and 6.5.13) it follows that if A is symmetric
positive definite and aij 0 for all i 6= j, then the Jacobi method is convergent.
A convergence result for the damped Jacobi method can be derived in which the assumption
aij 0 is avoided:

Theorem 6.3.3 Let A be a symmetric positive definite matrix. For the iteration matrix of the
damped Jacobi method C = I D1 A we have

2
(C ) < 1 iff 0<< (6.23)
max (D1 A)
2
min (C ) = (Copt ) = 1 1
(D A) + 1
(6.24)
2
for opt =
min (D1 A) + max (D1 A)
1 1
Proof. Note that D 2 AD 2 is symmetric positive definite. Apply lemma 6.3.1 with
1 1 1 1
B = D 2 AD 2 and note that (D 2 AD 2 ) = (D1 A).

If the matrix A is the stiffness matrix resulting from a finite element discretization of a scalar
elliptic boundary value problem then in general the condition number (D1 A) is very large
(namely h2 , with h a mesh size parameter). The result in the previous theorem shows that
for such problems the rate of convergence of a (damped) Jacobi method is very low. This is
illustrated in section 6.6.

For the convergence analysis of both the Gauss-Seidel and the SOR method the following lemma
is useful.

Lemma 6.3.4 Let A be symmetric positive definite and assume that M is such that

M + MT > A (6.25)

Then M is nonsingular and

(I M1 A) kI M1 AkA < 1

holds.

Proof. Assume that Mx = 0. Then h(M + MT )x, xi = 0 and using assumption (6.25) this
1 1
implies x = 0. Hence M is nonsingular. We introduce C := I A 2 M1 A 2 and note that
1
kI M1 AkA = kCk2 = (CT C) 2

From M + MT > A it follows that


1 1 1 1 1 1
A 2 (MT + M1 )A 2 = A 2 MT (M + MT )M1 A 2 > A 2 MT AM1 A 2

135
Using this we get
1 1 1 1
0 CT C = I A 2 MT A 2 I A 2 M1 A 2
 
1 1 1 1
= I A 2 (MT + M1 )A 2 + A 2 MT AM1 A 2 < I

and thus (CT C) < 1 holds.

Using this lemma we immediately get a main convergence result for the Gauss-Seidel method.
Theorem 6.3.5 Let A be symmetric positive definite. Then we have
(C) kCkA < 1 with C := I (D L)1 A
and thus the Gauss-Seidel method is convergent.
Proof. The Gauss-Seidel method corresponds to the matrix-splitting A = M N with
M = D L. Note that M + MT = D + (D L LT ) = D + A > A holds. Application of
lemma 6.3.4 proves the result.

We now consider the SOR method. Recall that this method corresponds to the matrix-splitting
A = M N with M = 1 D L.
Theorem 6.3.6 Let A be symmetric positive definite. Then for (0, 2) we have
1
(C ) kC kA < 1 with C := I ( D L)1 A

and thus the SOR method is convergent.
1
Proof. For M = D L we have
2 2 
M + MT = D L LT = D+A>A if (0, 2)

Application of lemma 6.3.4 proves the result.

In the following lemma we show that for every matrix A (i.e., not necessarily symmetric) with
a nonsingular diagonal the SOR method with / (0, 2) is not convergent.
Lemma 6.3.7 Let A Rnn be a matrix with aii 6= 0 for all i. For the iteration matrix
C = I ( 1 D L)1 A of the SOR method we have
(C ) |1 | for all 6= 0
Proof. Define L := D1 L and U := D1 U. Then we have
1
C = I (I L)1 (I L U) = I L (1 )I + U)
1
(1 )I + U) = (1 )n . Let {i | 1 i n } = (C ) be
 
Hence, det(C ) = det I L
the spectrum of the iteration matrix. Then due to fact that the determinant of a matrix equals
the product of its eigenvalues we get ni=1 |i | = |1 |n . Thus there must be an eigenvalue with
modulus at least |1 |.

Jacobi for the nonsymmetric case .....in preparation.

Block-Jacobi method ....in preparation.

136
6.4 Rate of convergence of the SOR method
The result in theorem 6.3.6 shows that in the symmetric positive definite case the SOR method is
convergent if we take (0, 2). This result, however, does not quantify the rate of convergence.
Moreover, it is not clear how the rate of convergence depends on the choice of the parameter .
It is known that for certain problems a suitable choice of the parameter can result in an SOR
method which has a much higher rate of convergence than the Jacobi and Gauss-Seidel methods.
This is illustrated in example 6.6.4. However, the relation between the rate of convergence and
the parameter is strongly problem dependent and for most problems it is not known how a
good (i.e. close to optimal) value for the parameter can be determined.
In this section we present an analysis which, for a relatively small class of block-tridiagonal
matrices, shows the dependence of the spectral radius of the SOR iteration matrix on the
parameter . For related (more general) results we refer to the literature, e.g., Young [100],
Hageman and Young [50], Varga [92]. For a more recent treatment we refer to Hackbusch [48].
We start with a technical lemma. Recall the decomposition A = D L U, with D = diag(A),
L and U strictly lower and upper triangular matrices, respectively.

Lemma 6.4.1 Consider A = D L U with det(A) 6= 0. Assume that A has the block-
tridiagonal structure

D11 A12
.. ..
A21
. .

A=
.. .. .. , Dii Rni ni , 1 i k,

. . .
(6.26)

.. ..
. .

Ak1,k
Ak,k1 Dkk

with diag(A) = blockdiag(D11 , . . . , Dkk )

Then the eigenvalues of zD1 L + z1 D1 U are independent of z C, z 6= 0.

Proof. For z C, z 6= 0 define Gz := zD1 L + z1 D1 U. Note that

1 1

0 z D11 A12
1 ..
zD22 A21
0 .

Gz = .. .. ..
. . .


..

. 1 1


0 D A
z k1,k1 k1,k

zD1
kk Ak,k1 0

Introduce Tz := blockdiag(I1 , zI2 , . . . , z k1 Ik ) with Ii the ni ni identity matrix. Now note


that
T1z Gz Tz = G1 = D L + D U
1 1

This similarity transformation with Tz does not change the spectrum and thus (Gz ) =
(D1 L + D1 U) holds for all z 6= 0. The latter spectrum is independent of z.

We collect a few properties in the following lemma.

137
Lemma 6.4.2 Let A be as in lemma 6.4.1. Let CJ = I D1 A and C = I ( 1 D L)1 A
be the iteration matrices of the Jacobi and SOR method, respectively. The following holds

(a) (CJ ) (CJ )


(b) 0 (C ) = 1
(c) For 6= 0, 6= 0 we have
+1
(C ) 1 (CJ )
2

Proof. With L := D1 L and U := D1 U we have CJ = L + U, C = (I L)1 (1



)I + U . From lemma 6.4.1 with z = 1 and z = 1 we have (L + U) = (L U) and thus
(L + U) (L U) = (L + U) holds, which proves (a). If 0 (C ) then we
have
0 = det (I L)1 (1 )I + U = det (1 )I + U = (1 )n
 

and thus = 1, i.e., the result in (b) holds. For (C ), 6= 0 and 6= 0 we have

det(C I) = det (I L)1 [(1 )I + U (I L)]



1 1 1
= det 2 ( 2 L + 2 U) ( + 1)I


1 1 1 +1 
= n 2 n det ( 2 L + 2 U) 1 I
2

Using lemma 6.4.1 we get (for 6= 0, 6= 0):

+1 1 1
(C ) 1 ( 2 L + 2 U) = (CJ )
2

which proves the result in (c).

Now we can prove a main result on the rate of convergence of the SOR method.

Theorem 6.4.3 Let A be as in lemma 6.4.1 and CJ , C the iteration matrices of the Jacobi and
SOR method, respectively. Assume that all eigenvalues of CJ are real and that := (CJ ) < 1
(i.e. the method of Jacobi is convergent). Define
!2
2
opt := p =1+ p (6.27)
1 + 1 2 1 + 1 2

The following holds


(
1
p 2
+
4 2 2 4( 1) for 0 < opt
(C ) = (6.28)
1 for opt < 2

and
opt 1 = (Copt ) < (C ) < 1 for all (0, 2), 6= opt , (6.29)

holds.

138
Proof. We only consider (0, 2). Introduce L := D1 L and U := D1 U. First we treat
the case where there exists (0, 2) such that (C ) = 0, i.e., C = 0. This implies = 1,
U = 0, = 0 and opt = 1. From U = 0 we get C = (1 )(I L)1 , which yields
(C ) = |1 |. One now easily verifies that for this case the results in (6.28) and (6.29) hold.
We now consider the case with (C ) > 0 for all (0, 2). Take (C ), 6= 0. From
lemma 6.4.2 it follows that
+1
1 = (CJ ) [, ]
2
A simple computation yields

1 p 2
= || 2 2 4( 1) (6.30)
4

We first consider with opt < 2. Then 2 2 4( 1) 2 2 4( 1) 0 and thus


from (6.30) we obtain
1
|| = 2 2 ( 2 2 4( 1) = 1

4
Hence in this case all eigenvalues of C have modulus 1 and this implies (C ) = 1,
which proves the second part of (6.28). We now consider with 0 < opt and thus
2 2 4( 1) 0. If is such that 2 2 4( 1) 0 then (6.30) yields

1 p 2
|| = || 2 2 4( 1)
4
The maximum value is attained for the + sign and with the value = , resulting in

1 p 2
|| = + 2 2 4( 1) (6.31)
4

There may be eigenvalues (C ) that correspond to (CJ ) with 2 2 4( 1) < 0.


As shown above, this yields corresponding (C ) with || = 1. Due to

1 2 1
+ 2 2 4( 1) 2 2 1
p
4 4

we conclude that the maximum value for || is attained for the case (6.31) and thus

1 p 2
(C ) = + 2 2 4( 1)
4

which proves the first part in (6.28). An elementary computation shows that for (0, 2) the
function (C ) as defined in (6.28) is continuous, monotonically decreasing on (0, opt ] and
monotonically increasing on [opt , 2). Morover, both for 0 and 2 we have the function
value 1. From this the result in (6.29) follows.

In (6.27) we see that opt > 1 holds, which motivates the name over-relaxation. Note that we
do not require symmetry of the matrix A. However, we do assume that the eigenvalues of CJ
are real. A sufficient condition for the latter to hold is that A is symmetric. For different values
of the function (C ) defined in (6.28) is shown in figure 6.1.

139
1

0.9

mu=0.95
0.8

Spectral radius SOR iteration matrix 0.7

mu=0.9
0.6

0.5

mu=0.6
0.4

0.3

0.2

0.1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
omega

Figure 6.1: Function (C )

Corollary 6.4.4 If we take = 1 then the SOR method is the same as the Gauss-Seidel
method. Hence, if A satisfies the assumptions in theorem 6.4.3 we obtain from (6.28)

(C1 ) = 2 = (CJ )2

Thus ln((C1 )) = 2 ln(CJ ), i.e., the asymptotic convergence rate of the Gauss-Seidel method
is twice the one of the Jacobi method. 

Assume that for = (CJ ) we have = 1 with 1. From Theorem 6.4.3 we


obtain (provided A fulfills the conditions of this theorem)thep
following estimate related
to the
2 2
convergence of the SOR method: (Copt ) = (1 ) (1 + 2 1 /2) = 1 2 2 + O().
Hence the method of Jacobi has an asymptotic convergence rate ln() = ln(1 ) and
the SOR method has an asymptotic convergence rate ln((Copt )) ln(1 2 2) 2 2.

Note that for small we have 2 2 , and thus the SOR method has a significantly higher
rate of convergence than the method of Jacobi.

6.5 Convergence analysis for regular matrix splittings


We present a general convergence analysis for so called regular matrix splitting methods, due
to Varga [92]. For this analysis we need some fundamental results on the largest eigenvalue
of a positive matrix and its corresponding eigenvector. These results, due to Perron [71], are
presented in section 6.5.1. In this section for B, C Rnn we use the notation B C (B > C)
iff bij cij (bij > cij ) for all i, j. The same ordering notation is used for vectors. For B Rnn
we define |B| = (|bij |)1i,jn and similarly for vectors.

140
6.5.1 Perron theory for positive matrices
For a matrix A Rnn an eigenvalue (A) for which || = (A) holds is not necessarily
real. If we assume A > 0 then it can be shown that (A) (A) holds and, moreover, that the
corresponding eigenvector is strictly positive. These and other related results, due to Perron [71],
are given in lemma 6.5.2, theorem 6.5.3 and theorem 6.5.5.
We start the analysis with an elementary lemma.

Lemma 6.5.1 For B, C Rnn the following holds

0 B C (B) (C)

Proof. From 0 B C we get 0 Bk Ck for all k. Hence, kBk k kCk k for all k.
1/k
Recall that for arbitrary A Rnn we have (A) = limk kAk k (cf. lemma 6.1.5). Using
this we get (B) (C).

Lemma 6.5.2 Take A Rnn with A > 0. For (A) with || = (A) and w Cn , w 6= 0,
with Aw = w the relation
A|w| = (A)|w|
holds, i.e., (A) is an eigenvalue of A.

Proof. With these and w we have

(A)|w| = |||w| = |w| = |Aw| |A||w| = A|w| (6.32)

Assume that we have < in (6.32). Then there exsts > (A) such that |w| A|w|
k |w| k
and thus Ak |w| k |w| for all k N. This yields kAk k kA
k |w| k

k and thus
1/k
(A) = limk kAk k , which is a contradiction with > (A). We conclude that in
(6.32) equality must hold, i.e., A|w| = (A)|w|.

Theorem 6.5.3 (Perron) For A Rnn with A > 0 the following holds:

(A) > 0 is an eigenvalue of A (6.33a)


There exists a vector v > 0 such that Av = (A)v (6.33b)
If Aw = (A)w holds, then w span(v) with v from (6.33b) (6.33c)

Proof. From lemma 6.5.2 we obtain that there exists w 6= 0 such that

A|w| = (A)|w| (6.34)

holds. Thus (A) is an eigenvalue of A. The eigenvector |w| from (6.34) contains at least one
entry that is strictly positive. Due to this and A > 0 we have that A|w| > 0, which due to
(6.34) implies (A) > 0 and |w| > 0. From this the results in (6.33a) and (6.33b) follow.
Assume that there exists x 6= 0 independent of v such that Ax = (A)x. For arbitrary 1 k n
define = xvkk and z = x v. Note that zk = 0 and due to the assumption that x and v are in-
dependent we have z 6= 0. We also have Az = (A)z. From lemma 6.5.2 we get A|z| = (A)|z|,

141
which results in a contradicton, because (A|z|)k > 0 and (A)(|z|)k = 0. Thus the result in
(6.33c) is proved.

The eigenvalue (A) and corresponding eigenvector v > 0 (which is unique up to scaling)
are called the Perron root and Perron vector.
If instead of A > 0 we only assume A 0 then the results (6.33a) and (6.33b) hold with >
replaced by as is shown in the following corollary. Clearly, for A 0 the result (6.33c) does
not always hold (take A = 0).

Corollary 6.5.4 For A Rnn with A 0 the following holds:

(A) is an eigenvalue of A (6.35a)


There exists a nonzero vector v 0 such that Av = (A)v (6.35b)

Proof. For (0, 1] define A := aij + 1i,jn . From theorem 6.5.3 it follows that for
:= (A ) there exists a vector v > 0 such that A v = v holds. We scale v such
that kv k = 1. Then, for all this vector is contained in the compact set { x Rn | kxk =
1 } =: S. Hence there exists a decreasing sequence 1 > 1 > 2 > . . ., with limj j = 0 and
limj vj = v S. Thus v 6= 0 and from vj > 0 for all j it follows that v 0. Note that
0 A Ai Aj for all i j. Using lemma 6.5.1 we get (A) i i for all i j. From
this it follows that
lim i = (A) (6.36)
i

Taking the limit i in the equation Ai vi = i vi yields Av = v and thus is an


eigenvalue of A with v 6= 0. This implies (A). In combination with (6.36) this yields
= (A), which completes the proof.

In the next theorem we present a few further results for the Perron root of a positive matrix.

Theorem 6.5.5 (Perron) For A Rnn with A > 0 the following holds:

(A) is a simple eigenvalue of A (note that this implies (6.33c)) (6.37a)


For all (A), 6= (A), we have || < (A) (6.37b)
No nonnegative eigenvector belongs to any other eigenvalue than (A) (6.37c)

Proof. We use the Jordan form A = TT1 (cf. Appendix B) with a matrix of the form
= blockdiag(i )1is and

i 1
.. ..
. . Rki ki , 1 i s,

i =
.
. . 1

with all i (A). Due to (6.33c) we know that the eigenspace corresponding to the eigenvalue
(A) is one dimensional. Thus there is only one block i with i = (A). Let the ordering of
the blocks in be such that the first block 1 corresponds to the eigenvalue 1 = (A). We
will now show that its dimension must be k1 = 1. Let ej be the j-th basis vector in Rn and
define t := Te1 , t := TT ek1 . From ATe1 = Te1 we get At = (A)t and thus t is the Perron
vector of A. This implies t > 0. Note that AT TT ek1 = TT T ek1 and thus AT t = (A)t.

142
Since AT > 0 this implies that t is the Perron vector of AT and thus t > 0. Using that both
t and t are strictly positive we get 0 < tT t = eTk1 T1 Te1 = eTk1 e1 . This can only be true if
k1 = 1. We conclude that there is only one Jordan block corresponding to (A) and that this
block has the size 1 1, i.e., (A) is simple eigenvalue.
We now consider (6.37b). Let w Cn , w 6= 0, = ei (A) (i.e., || = (A)) be such that
Aw = w. From lemma 6.5.2 we get that A|w| = (A)|w| and from (6.33c) it follows that
|w| > 0 holds. We introduce k , rk R, with rk > 0, such that wk = rk eik , 1 k n, and
D := diag(eik )1kn . Then D|w| = w holds and thus

AD|w| = Aw = w = D|w| = ei (A)D|w| = ei DA|w|

This yields
ei D1 AD A |w| = 0


Consider the k-th row of this identity:


n
X
ei(+k j ) 1 akj |wj | = 0


j=1

Due to akj |wj | > 0 for all j this can only be true if ei(+k j ) 1 = 0 for all j = 1, . . . , n. We
take j = k and thus obtain ei = 1, hence = ei (A) = (A). This shows that (6.37b) holds.
We finally prove (6.37c). Assume Aw = w with a nonzero vector w 0 and 6= (A)
holds. Application of theorem 6.5.3 to AT implies that there exists a vector x > 0 such that
AT x = (AT )x = (A)x. Note that xT Aw = xT w and xT Aw = wT AT x = (A)wT x. This
implies ( (A))wT x = 0 and thus, because wT x > 0, we obtain = (A), which contradicts
6= (A). This completes the proof of the theorem.

From corollary 6.5.4 we know that for A 0 there exists an eigenvector v 0 corresponding to
the eigenvalue (A). Under the stronger assumption A 0 and irreducible (cf. Appendix B)
this vector must be strictly positive (as for the case A > 0). This and other related results for
nonnegative irreducible matrices are due to Frobenius [37].

Theorem 6.5.6 (Frobenius) Let A Rnn be irreducible and A 0. Then the following
holds:

(A) > 0 is a simple eigenvalue of A (6.38a)


There exists a vector v > 0 such that Av = (A)v (6.38b)
No nonnegative eigenvector belongs to any other eigenvalue than (A) (6.38c)

Proof. Given in, for example, theorem 4.8 in Fiedler [34].

6.5.2 Regular matrix splittings


A class of special matrix splittings consists of so-called regular splittings. In this section we will
discuss the corresponding matrix splitting methods. In particular we show that for a regular
splitting the corresponding iterative method is convergent. We also show how basic iterative
methods like the Jacobi or Gauss-Seidel method fit into this setting.

143
Definition 6.5.7 A matrix splitting A = M N is called a regular splitting if

M is regular, M1 0 and M A (6.39)

Recall that the iteration matrix of a matrix splitting method (based on the splitting A =
M N) is given by C = I M1 A = M1 N.
Theorem 6.5.8 Assume that A1 0 holds and that A = M N is a regular splitting. Then
(A1 N)
(C) = (M1 N) = <1
1 + (A1 N)
holds.
Proof. The matrices I C = M1 A and I + A1 N = A1 M are nonsingular. We use the
identities

A1 N = (I C)1 C (6.40)
C = (I + A1 N)1 A1 N (6.41)

Because C 0 we can apply corollary 6.5.4. Hence there exists a nonzero vector v 0 such
that Cv = (C)v. Due to the fact that I C is nonsingular we have (C) 6= 1. From (6.40) we
get
(C)
A1 Nv = (I C)1 Cv = v (6.42)
1 (C)
From this and A1 0, N 0, v 0 we conclude A1 Nv 0 and (C) < 1. From (6.42)
(C) (C)
it also follows that 1(C) is a positive eigenvalue of A1 N. This implies 1(C) (A1 N),
which can be reformulated as
(A1 N)
(C) (6.43)
1 + (A1 N)
From A1 N 0 and corollary 6.5.4 it follows that there exists a nonzero vector w 0 such
that A1 Nw = (A1 N)w. Using (6.41) we get

(A1 N)
Cw = (I + A1 N)1 A1 Nw = w
1 + (A1 N)
(A1 N)
Thus 1+(A1 N) is a positive eigenvalue of C. This yields

(A1 N)
(C) (6.44)
1 + (A1 N)
Combination of (6.43) and (6.44) completes the proof.

x
From the fact that the function x 1+x is increasing and using lemma 6.5.1 one immedi-
ately obtains the following result.
Corollary 6.5.9 Assume that A1 0 holds and that A = M1 N1 = M2 N2 are two
regular splittings with N1 N2 . Then

(I M1 1
1 A) (I M2 A) < 1

holds.

144
For the application of these general results to concrete matrix splitting methods it is convenient
to introduce the following class of matrices.

Definition 6.5.10 (M-matrix) A matrix A Rnn is called M-matrix if it is nonsingular and


has the following two properties

A1 0 (6.45a)
aij 0 for all i 6= j (6.45b)

Consider an M-matrix A and let sk 0 be the k-th P columns of A1 . From the identity Ask = ek
(k-th basis vector) it follows that akk (sk )k = 1 j6=k akj (sk )j 1 holds, and thus akk > 0.
Hence, in an M-matrix all diagonal entries are strictly positive. Another property that we will
need further on is given in the following lemma.

Lemma 6.5.11 Let A be an M-matrix. Assume that the matrix B has the properties bij 0
for all i 6= j and B A. Then B is an M-matrix, too. Furthermore, the inequalities

0 B1 A1

hold.

Proof. Let DA := diag(A) and DB := diag(B). Because A is an M-matrix we have


that DA is nonsingular and from B A it follows that DB is nonsingular, too. Note that
NA := DA A 0. We conclude that A = DA NA is a regular splitting and from theo-
rem 6.5.8 it follows that (CA ) < 1 with CA := I D1 1
A A. Furthermore, with CB := I DB B
we have 0P CB CA and thus (CBP ) (CA ) < 1 holds. Thus we have the representations
A1 = ( k=0 AC k )D1 and B1 = ( Ck )D1 . From the latter and C 0, D1 0
A k=0 B B B B
we obtain B1 0 and we can conclude that B in an M-matrix. The inequality B1 A1
follows by using CB CA , D1 1
B DA .

There is an extensive literature on properties of M-matrices, cf. [12], [34]. A few results are
given in the following theorem.

Theorem 6.5.12 For A Rnn the following results hold:

(a) If A is irreducibly diagonally dominant and aii > 0 for all i, aij 0 for all i 6= j, then A
is an M-matrix.

(b) Assume that aij 0 for all i 6= j. Then A is an M-matrix if and only if all eigenvalues of
A have positive real part.

(c) Assume that aij 0 for all i 6= j. Then A is an M-matrix if A + AT is positive definite
(this follows from (b)).

(d) If A is symmetric positive definite and aij 0 for all i 6= j, then A is an M-matrix (this
follows from (b)).

(e) If A is a symmetric M-matrix then A is symmetric positive definite (this follows from (b)).

(f ) If A is an M-matrix and B results from A after a Gaussian elimination step without pivot-
ing, then B is an M-matrix, too (i.e. Gaussian elimination without pivoting preserves the
M-matrix property).

145
Proof. A proof can be found in [12].

We now show that for M-matrices the Jacobi and Gauss-Seidel methods correspond to regu-
lar splittings. Recall the decomposition A = D L U.

Theorem 6.5.13 Let A be an M-matrix. Then both MJ := D and MGS := D L result in


regular splittings. Furthermore

(I (D L)1 A) (I D1 A) < 1 (6.46)

holds.
Proof. In the proof lemma 6.5.11 it is shown that the method of Jacobi corresponds to a
regular splitting. For the Gauss-Seidel method note that MGS = D L has only nonpositive
off-diagonal entries and MGS A = U 0. From lemma 6.5.11 it follows that MGS is an
M-matrix, hence M1 GS 0 holds. Thus the Gauss-Seidel method corresponds to a regular split-
ting, too. Now note that NGS := U NJ := L + U holds and thus corollary 6.5.9 yields the
result in (6.46).

This result shows that for an M-matrix both the Jacobi and Gauss-Seidel method are con-
vergent. Moreover, the asymptotic convergence rate of the Gauss-Seidel method is at least as
high as for the Jacobi method. If A is the result of the discretization of an elliptic boundary
value problem then often the arithmetic costs per iteration are comparable for both methods.
In such cases the Gauss-Seidel method is usually more efficient than the method of Jacobi.

The SOR method corresponds to a splitting A = M N with M = 1 D L. If A is


an M-matrix then for > 1 the matrix M A has strictly positive diagonal entries and thus
this is not a regular splitting. For (0, 1] one can apply the same arguments as in the proof
of theorem 6.46 to show that for an M-matrix A the SOR method corresponds to a regular
splitting, and

(I M1 1
1 A) (I M2 A) < 1 for all 0 < 2 1 1

holds.

6.6 Application to scalar elliptic problems


In this section we apply basic iterative methods to discrete scalar elliptic model problems. We
recall the weak formulation of the Poisson equation and the convection-diffusion problem:
(
find u H01 () such that
1
R R
u v dx = f v dx for all v H0 ()
(
find u H01 () such that
for all v H01 ()
R R R
u v dx + b u v dx = f v dx
with > 0 and b = (b1 , b2 ) with given constants b1 0, b2 0. We take = (0, 1)2 . We
1 k
use nested uniform triangulations with mesh size parameter h = 20 2 , k = 1, 2, 3, 4. These
problems are discretized using the finite element method with piecewise linear finite elements.
For the convection-diffusion problem, we use the streamline-diffusion stabilization technique (for

146
the convection-dominated case). The resulting discrete problems are denoted by (P) (Poisson
problem) and (CD) (convection-diffusion problem).

Example 6.6.1 (Model problem (P)) For the Poisson equation we obtain a stiffness matrix
A that is symmetric positive definite and for which (D1 A) = O(h2 ) holds. In Table 6.1 we
show the results for the method of Jacobi applied to this problem with different values of h. For
the starting vector we take x0 = 0. We use the Euclidean norm k k2 . By # we denote the
number of iterations needed to reduce the norm of the starting error by a factor R = 103 .
We observe that when we halve the mesh size h we need approximately four times as many

h 1/40 1/80 1/160 1/320


# 2092 8345 33332 133227

Table 6.1: Method of Jacobi applied to problem (P).

iterations. This is in agreement with (D1 A) = O(h2 ) and the result in theorem 6.3.3. 

We take a reduction factor R = 103 and consider model problem (P). Then the complexity of
the method of Jacobi is cn2 flops (c depends on R but is independent of n). For model problem
(P) there are methods that have complexity cn with < 2. In particular = 1 21 for the SOR
method, = 1 14 for preconditioned Conjugate Gradient (chapter 7) and = 1 for the multigrid
method (chapter 9). It is clear that if n is large a reduction of the exponent will result in a
significant gain in efficiency, for example, for h = 1/320 we have n h2 105 and n2 1010 .
Also note that = 1 is a lower bound because for one matrix-vector multiplication Ax we
already need cn flops.

Example 6.6.2 (Model problem (P)) In Table 6.2 we show results for the situation as de-
scribed in example 6.6.1 but now for the Gauss-Seidel method instead of the method of Jacobi.
For this model problem with R = 103 the Gauss-Seidel method has a complexity cn2 , which is

h 1/40 1/80 1/160 1/320


# 1056 4193 16706 66694

Table 6.2: Gauss-Seidel method applied to problem (P).

of the same order of magnitude as for the method of Jacobi. 

Example 6.6.3 (Model problem (CD)) It is important to note that in the Gauss-Seidel
method the results depend on the ordering of the unknowns, whereas in the method of Jacobi
the resulting iterates are independent of the ordering. We consider model problem (CD) with
b1 = cos(/6), b2 = sin(/6). We take R = 103 and h = 1/160. Using an ordering of the grid
points (and corresponding unknowns) from left to right in the domain (0, 1)2 we obtain the results
as in Table 6.3. When we use the reversed node ordering then get the results shown in Table 6.4.
These results illustrate a rather general phenomenon: if a problem is convection-dominated
then for the Gauss-Seidel method it is advantageous to use a node ordering corresponding (as
much as possible) to the direction in which information is transported. 

147
100 102 104
# 17197 856 14

Table 6.3: Gauss-Seidel method applied to problem (CD).

100 102 104


# 17220 1115 285

Table 6.4: Gauss-Seidel method applied to problem (CD).

Example 6.6.4 We consider the model problem (P) as in example 6.6.1, with h = 1/160. In
Figure 6.2 for different values of the parameter we show the corresponding number of SOR
iterations (#), needed for an error reduction with a factor R = 103 . The same experiment is
performed for the model problem (CD) as in example 6.6.3 with h = 1/160, = 102 . The
results are shown in Figure 6.3. Note that with a suitable value for an enormous reduction

5
10

4
10

3
10

2
10
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2

Figure 6.2: SOR method applied to model problem (P).

in the number of iterations needed can be achieved. Also note the rapid change in the number
of iterations (#) close to the optimal value. 

148
4
10

3
10

2
10

1
10
0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

Figure 6.3: SOR method applied to model problem (CD)

149
150
Chapter 7

Preconditioned Conjugate Gradient


method

7.1 Introduction
In this chapter we discuss the Conjugate Gradient method (CG) for the iterative solution of
sparse linear systems with a symmetric positive definite matrix.
In section 7.2 we introduce and analyze the CG method. This method is based on the
formulation of the discrete problem as a minimization problem. The CG method is nonlinear and
of a different type as the basic iterative methods discussed in chapter 3. The CG method is not
suitable for solving strongly nonsymmetric problems, as for example a discretized convection-
diffusion problem with a dominating convection. Many variants of CG have been developed
which are applicable to linear systems with nonsymmetric matrix. A few of these methods are
treated in chapter 8. In the CG method and in the variants for nonsymmetric problems the
resulting iterates are contained in a so-called Krylov subspace, which explains the terminology
Krylov subspace methods. A detailed treatment of these Krylov subspace methods is given in
Saad [78]. An important concept related to all these Krylov subspace methods is the so-called
preconditioning technique. This will be explained in section 7.3.

7.2 Conjugate Gradient method


In section 2.4 it is shown that to a variational problem with a symmetric elliptic bilinear form
there corresponds a canonical minimization problem. Similarly, to a linear system with a sym-
metric positive definite matrix there corresponds a natural minimization problem. We consider
a system of equations
Ax = b (7.1)
with A Rnn symmetric positive definite. The unique solution of this problem is denoted by
x . In this chapter we use the notation
1
hy, xi = yT x , hy, xiA = yT Ax , kxkA = hx, xiA
2
, (x, y Rn ) . (7.2)
Since A ist symmetric positive definite the bilinear form h, iA defines an inner product on Rn .
This inner product is called the A-inner product or energy inner product.
We define the functional F : Rn R by
1
F (x) = hx, Axi hx, bi . (7.3)
2

151
For this F we have

DF (x) = F (x) = Ax b and D 2 F (x) = A .

So F is a quadratic functional with a second derivative (Hessian) which is positive definite.


Hence F has a unique minimizer and the gradient of F is equal to zero at this minimizer. Thus
we obtain:
min{F (x) | x Rn } = F (x ) , (7.4)
i.e. minimization of the functional F yields the unique solution x of the system in (7.1).
This result is an analogon of the one discussed for symmetric bilinear forms in section 2.4. In
this section we consider two methods that are based on (7.4) and in which first certain search
directions are determined and then a line search is applied. Such methods are of the following
form:  0
x a given starting vector,
(7.5)
xk+1 = xk + opt (xk , pk )pk , k 0 .
In (7.5), pk 6= 0 is the search direction at xk and opt (xk , pk ) is the optimal steplength at xk in
the direction pk . The vector Axk b = F (xk ) is called the residual (at xk ) and denoted by

rk := Axk b. (7.6)

From the definition of F we obtain the identity


1
F (xk + pk ) = F (xk ) + hpk , Axk bi + 2 hpk , Apk i =: () .
2
The function : R R is quadratic and () > 0 holds. So has a unique minimum at opt
iff (opt ) = 0. This results in the following formula for opt :

hpk , rk i
opt (xk , pk ) := . (7.7)
hpk , Apk i

For the residuals we have the recursion

rk+1 = rk + opt (xk , pk )Apk , k 0. (7.8)

For (0), i.e. the derivative of F at xk in the direction pk , we have (0) = hpk , rk i. The
direction pk with kpk k2 = 1 for which the modulus of this derivative is maximal is given by
pk = rk /krk k2 . This follows from | (0)| = |hpk , rk i| kpk k2 krk k2 , in which we have equality
only if pk = rk ( R). The sign and length of pk are irrelevant because the right sign and
the optimal length are determined by the steplength parameter opt . With the choice pk = rk
we obtain the Steepest Descent method:
( 0
x a given starting vector,
k ,rk i
x k+1 = xk hrhrk ,Ar k r .
i
k

In general the Steepest Descent method converges only slowly. The reason for this is already
clear from a simple example with n = 2. We take
 
1 0
A= , 0 < 1 < 2 , b = (0, 0)T (hence, x = (0, 0)T ).
0 2

152
x0

x1

x3

x2

Figure 7.1: Steepest Descent method

The function F (x1 , x2 ) = 12 hx, Axi hx, bi = 12 (1 x21 + 2 x22 ) has level lines Nc = {(x1 , x2 )
R2 | F (x1 , x2 ) = c} which are ellipsoids. Assume that 2 1 holds (so (A) 1). Then
the ellipsoids are stretched in x1 -direction as is shown in Figure 7.1, and convergence is very slow.

We now introduce the CG method along similar lines as in Hackbusch [48]. To be able to
formulate the weakness of the Steepest Descent method we introduce the following notion of
optimality. Let V be a subspace of Rn .

y is called optimal for the subspace V if
(7.9)
F (y) = minzV F (y + z)

So y is optimal for V if on the hyperplane y+V the functional F is minimal at y. Assume P a given
y and subspace V . Let d1 , . . . , ds be a basis of V and for c Rs define g(c) = F (y + si=1 ci di ).
Then y is optimal for V iff g(0) = 0 holds. Note that
g
(0) = hF (y), di i = hAy b, di i
ci
Hence we obtain the following:

y optimal for V hdi , Ay bi = 0 for i = 1, . . . , s . (7.10)

In the Steepest Descent method we have pk = rk . From (7.7) and (7.8) we obtain

hrk , rk i
hrk , rk+1 i = hrk , rk i hrk , Ark i = 0 .
hrk , Ark i

Using (7.10) we conclude that in the Steepest Descent method xk+1 is optimal for the subspace
span{pk }. This is also clear from Fig. 7.1 : for example, x3 is optimal for the subspace spanned
by the search direction x3 x2 . From Fig. 7.1 it is also clear that xk is not optimal for the
subspace spanned by all previous search directions. For example x3 can be improved in the
search direction p1 = (x2 x1 ): for = opt (x3 , p1 ) we have F (x4 ) = F (x3 + p1 ) < F (x3 ).
Now consider a start with p0 = r0 and thus x1 = x0 + opt (x0 , p0 )p0 (as in Steepest Descent).
We assume that the second search direction p1 is chosen such that hp1 , Ap0 i = 0 holds. Due to

153
the fact that A is symmetric positive definite we have that p1 and p0 are independent. Define
x2 = x1 + opt (x1 , p1 )p1 . Note that now hp0 , b Ax2 i = 0 and also hp1 , b Ax2 i = 0 and thus
(cf. 7.10) x2 is optimal for span{p0 , p1 }. For the special case n = 2 as in the example shown in
figure 7.1 we have span{p0 , p1 } = R2 . Hence x2 is optimal for R2 which implies x2 = x ! This
is illustrated in Fig. 7.2.
We have constructed search directions p0 , p1 and an iterand x2 such that x2 is optimal for the

x0

x1

x2

Figure 7.2: Conjugate Gradient method

two-dimensional subspace span{p0 , p1 }. This leads to the basic idea behind the Conjugate Gra-
dient (CG) method: we shall use search directions such that xk is optimal for the k-dimensional
subspace span{p0 , p1 , . . . , pk1 }. In the Steepest Descent method the iterand xk is optimal for
the one-dimensional subspace span{pk1 }. This difference results in much faster convergence
of the CG iterands as compared to the iterands in the Steepest Descent method.

We will now show how to construct appropriate search directions such that this optimality
property holds. Moreover, we derive a method for the construction of these search directions
with low computational costs.
As in the Steepest Descent method, we start with p0 = r0 and x1 as in (7.5). Recall that x1 is
optimal for span{p0 }. Assume that for a given k with 1 k < n, linearly independent search
directions p0 , ..., pk1 are given such that xk as in (7.5) is optimal for span{p0 , ..., pk1 }. We
introduce the notation
Vk = span{p0 , ..., pk1 }
and assume that xk 6= x , i.e., rk 6= 0 (if xk = x we do not need a new search direc-
tion). We will show how pk can be taken such that xk+1 , defined as in (7.5), is optimal for
span{p0 , p1 , ..., pk } =: Vk+1 . We choose pk such that
pk A Vk , i.e. pk VkA (7.11)
holds. This A-orthogonality condition does not determine a unique search direction pk . The
Steepest Descent method above was based on the observation that rk = F (xk ) is the direction
of steepest descent at xk . Therefore we use this direction to determine the new search direction.
A unique new search direction pk is given by the following:
pk VkA such that kpk rk kA = min kp rk kA (7.12)
A
pVk

154
The definition of p1 is illustrated in Fig. 7.3.

V1 6 3r
1





  
  V A 
  - 1 
 V1 = span{p0 }.
 p1 
 

Figure 7.3: Definition of a search direction in CG

Note that pk is the A-orthogonal projection of rk on VkA . This yields the following formula for
the search direction pk :

k1 k1
X hpj , rk iA j X hpj , Ark i j
pk = rk p = rk
p . (7.13)
j=0
hpj , pj iA
j=0
hpj , Apj i

We assumed that xk is optimal for Vk and that rk 6= 0. From the former we get that hpj , rk i = 0
for j = 0, . . . , k 1, i.e., rk Vk (note that here we have and not A ). Using rk 6= 0
we conclude that rk / Vk and thus from (7.13) it follows that pk
/ Vk . Hence, pk is linearly
0
independent of p , . . . , p k1 and

Vk+1 = span{p0 , . . . , pk } has dimension k + 1. (7.14)

Given this new search direction the new iterand is defined by

xk+1 = xk + opt (xk , pk )pk (7.15)

with opt as in (7.7). Using the definition of opt we obtain

hpk , b Axk+1 i = hpk , rk i + opt (xk , pk )hpk , Apk i = 0.

Due to (7.11) and the optimality of xk for the subspace Vk (cf. also (7.10)) we have for j < k

hpj , b Axk+1 i = hpj , b Axk i opt (xk , pk )hpj , Apk i = 0.

Using (7.10) we conclude that xk+1 is optimal for the subspace Vk+1 ! The search directions
pk defined as in (7.13) (p0 := r0 ) and the iterands as in (7.15) define the Conjugate Gradient
method. This method is introduced in Hestenes and Stiefel [51].

We now derive some important properties of the CG method.

Theorem 7.2.1 Let x0 Rn be given and m < n be such that for k = 0, 1, . . . , m we have
xk 6= x and pk , xk+1 as in (7.13), (7.15). Define Vk = span{p0 , . . . , pk1 } (0 k m + 1).

155
Then the following holds for all k = 1, . . . , m + 1:

dim(Vk ) = k (7.16a)
k 0
x x + Vk (7.16b)
k 0
F (x ) = min{F (x) | x x + Vk } (7.16c)
0 k1 0 0 k1 0
Vk = span{r , ..., r } = span{r , Ar , ..., A r } (7.16d)
j k
hp , r i = 0, for all j = 0, 1, ..., k 1 (7.16e)
j k
hr , r i = 0, for all j = 0, 1, ..., k 1 (7.16f)
k k k1
p span{r , p } (for k m) (7.16g)

Proof. The result in (7.16a) is shown in the derivation of the method, cf. (7.14). The result
in (7.16b) can be shown by induction using xk = xk1 + opt (xk1 , pk1 )pk1 . The construc-
tion of the search directions and new iterands in the CG method is such that xk is optimal
for Vk , i.e., F (xk ) = min{F (xk + w) | w Vk }. Using xk x0 + Vk this can be rewritten as
F (xk ) = min{F (x0 + w) | w Vk } which proves the result in (7.16c).
We introduce the notation Rk = span{r0 , . . . , rk1 } and prove Vk Rk by induction. For
k = 1 this holds due to p0 = r0 . Assume that it holds for some k m. Since Vk+1 =
span{Vk , pk } and Vk Rk Rk+1 , we only have to show pk Rk+1 . From (7.13) it fol-
lows that pk span{p0 , . . . , pk1 , rk } = span{Vk , rk } Rk+1 , which completes the induc-
tion argument. Using dim(Vk ) = k it follows that that Vk = Rk must hold. Hence the first
equality in (7.16d) is proved. We introduce the notation Wk = span{r0 , Ar0 , ..., Ak1 r0 }
and prove Rk Wk by induction. For k = 1 this is trivial. Assume that for some k m,
Rk Wk holds. Due to Rk+1 = span{Rk , rk } and Rk Wk Wk+1 we only have to show
rk Wk+1 . Note that rk = rk1 + opt (xk1 , pk1 )Apk1 and rk1 Rk Wk Wk+1 ,
Apk1 AVk = ARk AWk Wk+1 . Thus rk Wk+1 holds, which completes the induction.
Due to dim(Rk ) = k it follows that Rk = Wk must hold. Hence the second equality in (7.16d)
is proved.
The search directions and iterands are such that xk is optimal for Vk = span{p0 , . . . , pk1 }.
From (7.10) we get hpj , rk i = 0 for j = 0, . . . , k 1 and thus (7.16e) holds. Due to Vk =
span{r0 , ..., rk1 } this immediately yields (7.16f), too. To prove (7.16g) we use the formula
(7.13). Note that rj+1 = rj + opt (xj , pj )Apj and thus Apj span{rj+1 , rj }. From this and
(7.16f) it follows that for j k 1 we have hpj , rk iA = hApj , rk i = 0. Thus in the sum in
(7.13) all terms with j k 2 are zero.

The result in (7.16g) is very important for an efficient implementation of the CG method. Com-
bining this result with the formula given in (7.13) we immediately obtain that in the summation
in (7.13) there is only one nonzero term, i.e. for pk we have the formula

hpk1 , Ark i
pk = rk pk1 . (7.17)
hpk1 , Apk1 i

From (7.17) we see that we have a simple and cheap two term recursion for the search directions
in the CG method. Combination of (7.5),(7.7),(7.8) and (7.17) results in the following CG

156
algorithm:
x a given starting vector; r0 = Ax0 b
0




for k 0 (if rk 6= 0) :






k k1
pk = rk hphrk1,Ap
,Ap
i
k1 p
i
k1 ( if k = 0 then p0 := r0 ) (7.18)


k k


x k+1 = xk + (xk , pk )pk with (xk , pk ) = hp ,r i


opt opt hp ,Apk i
k


= rk + opt (xk , pk )Apk
k+1
r

Some manipulations result in the following alternative formulas for pk and


opt :
hrk , rk i
pk = rk + pk1 ,
hrk1 , rk1 i
(7.19)
hrk , rk i
opt (xk , pk ) = .
hpk , Apk i
Using these formulas in (7.18) results in a slightly more efficient algorithm.
The subspace Vk = span{r0 , Ar0 , ..., Ak1 r0 } in (7.16d) is called the Krylov subspace of dimen-
sion k corresponding to r0 , denoted by Kk (A; r0 ).
The CG method is of a different type as the basic iterative methods discussed in chapter 6. One
important difference is that the CG method is nonlinear. The error propagation xk+1 x =
(xk x ) is determined by a nonlinear function and thus there does not exist an error
iteration matrix (as in the case of basic iterative methods) which determines the convergence
behaviour. Related to this, in the CG method we often observe a phenomenon called superlinear
convergence. This type of convergence behaviour is illustrated in Example 7.2.3. For a detailed
analysis of this phenomenon we refer to Van der Sluis and Van der Vorst [90].
Another difference between CG and basic iterative methods is that the CG method yields the
exact solution x in at most n iterations. This follows from the property in (7.16c). However,
in practice this will not occur due the effect of rounding errors. Moreover, in practical applica-
tions n is usually very large and for efficiency reasons one does not want to apply n CG iterations.

We now discuss the arithmetic costs per iteration and the rate of convergence of the CG method.
If we use the CG algorithm with the formulas in (7.19) then in one iteration we have to compute
one matrix-vector multiplication, two inner products and a few vector updates, i.e. (if A is a
sparse matrix) we need cn flops. The costs per iteration are of the same order of magnitude as
for the Jacobi, Gauss-Seidel and SOR method.
With respect to the rate of convergence of the CG method we formulate the following theorem.
Theorem 7.2.2 Define Pk := { p Pk | p(0) = 1 }. Let xk , k 0 be the iterands of the CG
method and ek = xk x . The following holds:
kek kA = min kpk (A)e0 kA (7.20)
pk Pk

min max |pk ()| ke0 kA (7.21)


pk Pk (A)
p !k
(A) 1
2 p ke0 kA (7.22)
(A) + 1

157
Proof. From (7.16b) we get ek e0 + Vk . And due to Vk = span{r0 , . . . , rk1 } and (7.16f)
we have Aek = rk Vk and thus ek A Vk . This implies kek kA = minvk Vk ke0 vk kA . Note
that vk Vk can be represented as
k1
X k1
X
vk = j Aj r0 = j Aj+1 e0
j=0 j=0

Hence,
k1
X
k 0
ke kA = min ke j Aj+1 e0 kA = min kpk (A)e0 kA
Rk pk Pk
j=0

This proves the result in (7.20). The result in (7.21) follows from

kpk (A)e0 kA kpk (A)kA ke0 kA = max |pk ()|ke0 kA


(A)

Let I = [min , max ] with min and max the extreme eigenvalues of A. From the results above
we have
kek kA min max |pk ()|ke0 kA
pk Pk I

The min-max quantity in this upper bound can be analyzed using Chebychev polynomials,
defined by T0 (x) = 1, T1 (x) = x, Tm+1 (x) = 2xTm (x) Tm1 (x) for m 1. These polynomials
have the representation
1h p k p k i
Tk (x) = x + x2 1 + x + x2 1 (7.23)
2
and for any interval [a, b] with b < 1 they have the following property
2 a b
min max |pk (x)| = 1/Tk
pk Pk ,pk (1)=1 x[a,b] ba

We introduce qk (x) = pk (1 x) and then get

min max |pk ()| = min max |pk (1 x)|


pk Pk I pk Pk x[1max ,1min ]

= min max |qk (x)|


qk Pk ,qk (1)=1 x[1max ,1min ]
max + min  (A) + 1 
= 1/Tk = 1/Tk
max min (A) 1

Using the representation (7.23) we get


s
(A) + 1  1  (A) + 1 (A) + 1 2 k 1  p(A) + 1 k
Tk + 1 = p
(A) 1 2 (A) 1 (A) 1 2 (A) 1

which then yields the bound in (7.22).


So if we measure errors using the A-norm and neglect thepfactor 2 in (7.22),
p it follows that
on average per iteration the error is reduced by a factor ( (A) 1)/( (A) + 1). In this
bound one can observe a clear relation between (A) and the rate of convergence of the CG
method: a larger condition number results
p in a lower rate of convergence. For (A) 1 the
reduction factor is of the form 1 2/ (A), which is significantly better than the bounds for

158
the contraction numbers of the Richardson and (damped) Jacobi methods which are of the form
1 c/(A). For the case (A) ch2 the latter takes the form 1 ch2 , whereas for CG we
have an (average) reduction factor 1 ch.
Often the bound in (7.22) is rather pessimistic because the phenomenon of superlinear conver-
gence is not expressed in this bound.
For a further theoretical analysis of the CG method we refer to Axelsson and Barker [7],
Golub and Van Loan [41] and Hackbusch [48].
Example 7.2.3 (Poisson model problem) We apply the CG method to the discrete Poisson
equation from section 6.6. First we discuss the complexity of the CG method for this model
problem. In this case we have (A) ch2 . Using (7.22) it follows that (in the A-norm) the
error is reduced with approximately a factor
p
(A) 1
:= p 1 2c1 h (7.24)
(A) + 1
per iteration. The arithmetic costs are cn flops per iteration. So for a reduction of the error with
1
a factor R we need approximately ln R/ ln(1 2c1 h)cn ch1 n cn1 2 flops. We conlude
that the complexity is of the same order of magnitude as for the SOR method with the optimal
value for the relaxation parameter. However, note that, opposite to the SOR method, in the
CG method we do not have the problem of chosing a suitable parameter value. In Table 7.1
we show results which can be compared with the results in section 6.6. We use the Euclidean
norm and # denotes the number of iterations needed to reduce the starting error with a factor
R = 103 .

h 1/40 1/80 1/160 1/320


# 65 130 262 525

Table 7.1: CG method applied to Poisson equation.

In figure 7.4 we illustrate the phenomenon of superlinear convergence in the CG method. For
the case h = 1/160 we show the actual error reduction in the A-norm, i.e.
kxk x kA
k :=
kxk1 x kA
p p
in the first 250 iterations. The factor ( (A)1)/( (A)+1) has the value = 0.96 (horizontal
line in figure 7.4). There is a clear decreasing tendency of k during the iteration process. For
large values of k, k is significantly smaller than . Finally, we note that an irregular convergence
behaviour as in figure 7.4 is typical for the CG method. 

7.3 Introduction to preconditioning


In this section we consider the general concept of preconditioning and discuss a few precondi-
tioning techniques. Consider a (sparse) system Ax = b, A Rnn (not necessarily symmetric
positive definite), for which an approximation W A is available with the following properties:
Wx = y can be solved with low computational costs (cn flops). (7.25a)
1
(W A) < (A). (7.25b)

159
1

0.95

0.9

0.85

0.8

0.75

0.7

0.65
0 50 100 150 200 250

Figure 7.4: Error reduction of CG applied to Poisson problem .

An approximation W with these properties is called a preconditioner for A. In (7.25a) it is


implicitly assumed that the matrix W does not contain many more nonzero entries than the
matrix A, i.e. W is a sparse matrix, too. In the sections below three popular techniques for
constructing preconditioners will be explained. In section 7.7 results of numerical experiments
are given which show that using an appropriate preconditioner one can improve the efficiency of
an iterative method significantly. The combination of a given iterative method (e.g. CG) with a
preconditioner results in a so-called preconditioned iterative method (e.g. PCG in section 7.7).
As an introductory example, to explain the basic idea of preconditioned iterative methods, we
assume that both A and W are symmetric positive definite and show how the basic Richardson
iterative method (which is not used in practice) can be combined with a preconditioner W. We
consider the Richardson method with parameter value = 1/(A), i.e.:
1
xk+1 = xk (Axk b) with := (7.26)
(A)

For the iteration matrix C of this method we have


(C) = (I A) = max{|1 /max (A)| | (A)}
(7.27)
= 1 max
min (A) 1
(A) = 1 (A)

When we apply the same method to the preconditioned system

Ax = b , A := W1 A , b := W1 b,

we obtain

xk+1 = xk (Axk b) = xk W1 (Axk b) with = 1/(A). (7.28)

160
This method is called the preconditioned Richardson method. Note that if we assume that
(an estimate of) (A) is known, then we do not need the preconditioned matrix A in this
method. In (7.28) we have to compute z := W1 (Axk b), i.e., Wz = Axk b. Due to the
condition in (7.25a) z can be computed with acceptable arithmetic costs. For the spectral radius
of the iteration matrix C of the preconditioned method we obtain, using (A) = (W1 A) =
1 1
(W 2 AW 2 ) (0, ),

(C) = (I A) = max{|1 /max (A)| | (A)}


min (A) 1 (7.29)
=1 =1 .
max (A) (A)

From (7.27) and (7.29) we conclude that if (W1 A) (A) (cf. (7.25b)), then (C) (C)
and the convergence of the preconditioned method will be much faster than for the original one.
Note that for W = diag(A) the preconditioned Richardson method coincides with the damped
Jacobi method.

7.4 Preconditioning based on a linear iterative method


In this section we explain how a preconditioner W can be obtained from a given (basic) linear
iterative method. Recall the general form of a linear iterative method:

xk+1 = xk M1 (Axk b). (7.30)

If one uses this iterative method for preconditioning then W := M is taken as the preconditioner
for A. If the method (7.30) converges then W is a reasonable approximation for A in the sense
that (I W1 A) < 1.
The iteration in (7.30) corresponds to an iterative method and thus M1 y (y Rn ) can be
computed with acceptable arithmetic costs. Hence the condition in (7.25a), with W = M, is
satisfied.
Related to the implementation of such a preconditioner we note the following. In an iterative
method the matrix M is usually not used in its implementation (cf. Gauss-Seidel or SOR),
i.e. the iteration (7.30) can be implemented without explicitly computing M. The solution of
Wx = y, i.e. of Mx = y, is the result of (7.30) with k = 0, x0 = 0, b = y. From this it
follows that the computation of the solution of Wx = y can be implemented by performing one
iteration of the iterative method applied to Az = y with starting vector 0.

A bound for (W1 A) (cf. (7.25b)) is presented in the following lemma.


Lemma 7.4.1 We assume that A and M are symmetric positive definite matrices and that the
method in (7.30) is convergent, i.e (I M1 A) < 1. Then the following holds:

1 + (C)
(M1 A) (7.31)
1 (C)
Proof. Because A and M are symmetric positive definite it follows that
1 1
(M1 A) = (M 2 AM 2 ) (0, ) .

Using (I M1 A) < 1 we obtain that (M1 A) (0, 2). The eigenvalues of M1 A are
denoted by i :
0 < 1 2 . . . n < 2 .

161
Hence (C) = max{|1 1 |, |1 n |} holds and

n 1 + (n 1) 1 + |1 n |
(M1 A) = = .
1 1 (1 1 ) 1 |1 1 |

So
1 + (C)
(M1 A)
1 (C)
holds. 
1+x
With respect to the bound in (7.31) we note that the function x 1x increases monoton-
ically on [0, 1).
In the introductory example above we have seen that it is favourable to have a small value for
(M1 A). In (7.31) we have a bound on (M1 A) that decreases if (C) decreases. This
indicates that the higher the convergence rate of the iterative method in (7.30), the better the
quality of M as a preconditioner for A.

Example 7.4.2 (Discrete Poisson equation) We consider the matrix A resulting from the
finite element discretization of the Poisson equation as described in section 6.6. If we use the
method of Jacobi, i.e. M = D then (C) 1 ch2 holds and (7.31) results in

<
(D1 A) 2ch2 . (7.32)

In this model problem the eigenvalues are known and it can be shown that the exponent 2 in
(7.32) is sharp. When we use the SSOR basic iterative method in (7.30), with M as in (6.18)
and with an appropriate value for , we have (C) 1 ch and thus

<
(M1 A) 2ch1 . (7.33)

Hence, in this example, for the SSOR preconditioner the quantity (M1 A) is significantly
smaller than for the Jacobi preconditioner. So in a preconditioned Richardson method or in
a Preconditioned Conjugate Gradient method (cf. Example 7.7.1) the SSOR preconditioner
results in a method with a higher rate of convergence than the Jacobi preconditioner. 

7.5 Preconditioning based on incomplete LU factorizations


In this section we discuss a very popular preconditioning technique, which is based on the classi-
cal Gaussian elimination principle. Using the Gaussian elimination method in combination with
partial pivoting (row permutations), if necessary, results in an LU factorization of the matrix A.
The LU factorization of A can be used for solving a linear system with matrix A. However, for
the large sparse systems that we consider, this solution method is inefficient. In an incomplete
factorization method the Gaussian elimination is only partly performed, which then yields an
approximate LU factorization A LU with L and U sparse.
Here we only explain the basic concepts of incomplete factorization methods. For an extensive
discussion of this topic we refer to Axelsson [6], Saad [78] and to Bruaset [25] (the latter contains
many references related to this subject).

162
7.5.1 LU factorization
The direct method of Gaussian elimination for solving a system Ax = b is closely related to
the LU factorization of A. We recall the following: for every square matrix A there exists a
permutation matrix P, a lower triangular matrix L, with diag(L) = I and an upper triangular
matrix U such that the factorization
PA = LU (7.34)
holds. If A is nonsingular, then for given P these L and U are unique. To simplify the discussion
we only consider the case P = I, i.e. we do not use pivoting. It is known that a factorization as in
(7.34) with P = I exists if A is symmetric positive definite or if A is an M-matrix. Many different
algorithms for the computation of an LU factorization exist (cf. Golub and Van Loan [41]). A
standard technique is presented in the following algorithm, in which aij is overwritten by lij if
i > j and by uij otherwise.

LU factorization.
For k = 1, . . . , n 1
If akk = 0 then quit else
For i = k + 1, . . . , n (7.35)
:= aik /akk ; aik := ;
For j = k + 1, . . . , n
aij := aij akj .

Clearly, the Gaussian elimination process fails if we encounter a zero pivot. In the if condition
in (7.35) it is checked whether the pivot in the kth elimination step is equal to zero. If this con-
dition is never true, the Gaussian elimination algorithm (7.35) yields an LU decomposition as in
(7.34) with P = I. In the kth step of the Gaussian elimination process we eliminate the nonzero
entries below the diagonal in the kth column. Due to this the entries in the (n k) (n k)
right under block in the matrix change. This corresponds to the assignment aij := aij akj ,
with a loop over i and j. In the assignment aik := , with a loop over i, the values of lik , i > k
are computed. Finally note that in the kth step of the elimination process the entries amj , with
1 m k and j m do not change; these are the entries umj (1 m k, j m) of the
matrix U.

Another possible implementation of Gaussian elimination is based on solving the n2 equations


A = LU for the n2 unknowns (lij )1j<in , (uij )1ijn . These n2 equations are given by
min(i,k)
X
aik = lij ujk , 1 i, k n . (7.36)
j=1

This yields the following explicit formulas for lik and uik :
k1
X
lik = (aik lij ujk )/ukk , 1k<in, (7.37)
j=1
i1
X
uik = aik lij ujk , 1ikn. (7.38)
j=1

Thus we can compute L and U row by row, i.e. we take i = 1, 2, . . . , n and for i fixed we compute
lik by (7.37) with k = 1, . . . , i 1 and then uik by (7.38) with k = i, . . . , n. We discuss a simple

163
implementation of this row wise Gaussian elimination process. We take i fixed and introduce
the notation
m1
(m)
X
aik := aik lij ujk , 1 k, m n. (7.39)
j=1

(1)
Note that aik = aik and

(i)
uik = aik for k i,
(k)
lik = aik /ukk for k < i, (7.40)
(m+1) (m)
aik = aik lim umk .

Using these formulas the entries lik and uik can be computed as follows. Note that u1k = a1k
for k = 1, . . . , n. Assume that the rows 1, . . . , i 1 of L and U have been computed, then lik ,
1 k < i, and uik , i k n, are determined by

For k = 1, . . . , i 1
(k)
lik = aik /ukk
For j = k + 1, . . . , n
(k+1) (k)
aij = aij lik ukj
For k = i, . . . , n
(i)
uik = aik

As in (7.35) we can overwrite the matrix A, and we then obtain the following algorithm, which
is commonly used for a row-contiguous data structure:

Row wise LU factorization.


For i = 2, . . . , n
For k = 1, . . . , i 1
(7.41)
:= aik /akk ; aik := ;
For j = k + 1, . . . , n
aij := aij akj .

For ease of presentation we deleted the statement If akk = 0 then quit. If in both algorithms,
(7.35) and (7.41), a zero pivot does not occur (i.e. akk = 0 is never true), then both algorithms
yield identical LU factorizations.
For certain classes of matrices, for example symmetric matrices or matrices having a band struc-
ture, there exist Gaussian elimination algorithms which take advantage of special properties of
the matrix. Such specialized algorithms enhance effiency. A well-known example is the Cholesky
decomposition method, in which for a symmetric positive definite matrix A a factorization
A = LLT is computed (here L is lower triangular, but diag(L) is not necessarily equal to I).
Based on the formula in (7.36) the following algorithm is obtained:

Cholesky factorization.
For k = 1,q...,n
Pk1 2
akk = akk j=1 akj (7.42)
For i = k + 1, . . . , n
Pk1
aik := (aik j=1 aij akj )/akk

164
To obtain a stable Gaussian elimination algorithm it is important to use a (partial) pivoting
strategy. We do not discuss this topic here but refer to the literature, e.g. Golub and Van
Loan [41]. We note that if the matrix A is symmetric positive definite or weakly diagonally
dominant, then a straightforward implementation of Gaussian elimination is stable even without
using pivoting. For example, the Cholesky algorithm as in (7.42), applied to a symmetric positive
definite matrix, is stable.

7.5.2 Incomplete LU factorization


In general, in a Gaussian elimination process applied to a sparse matrix an unacceptable amount
of fill-in is created in the matrices L and U. In this section we describe a simple way to avoid
excessive fill-in, resulting in an incomplete LU factorization. We note that in this section we
again use the notation L to denote a lower triangular matrix with diag(L) = I and U to denote
an upper triangular matrix. These L and U, however, may differ from the L and U used in the
LU decomposition discussed in Section 7.5.1.

We introduce the graph of the matrix A:

G(A) := { (i, j) | 1 i, j n and aij 6= 0 }.

Let S be a subset of the indexset { (i, j) | 1 i, j n }. We call this subset the sparsity pattern
and we assume:
{ (i, i) | 1 i n } S , G(A) S . (7.43)
In our applications the matrices A are such that all diagonal entries are nonzero and thus (7.43)
reduces to the condition G(A) S. We now simply enforce sparsity of L and U by setting
every entry in L and U to zero if the corresponding index is outside the sparsity pattern, i.e.
we introduce the condition:
lij = uij = 0 if (i, j)
/ S. (7.44)
We apply Gaussian elimination and we require sparsity of L and U as formulated in (7.44). This
then yields an incomplete LU factorization. As for the (complete) LU factorization method dis-
cussed in Section 7.5.1, several different implementations of an incomplete factorization method
exist. We present a few well-known algorithms. We assume that no zero pivot occurs in the
algorithms below. Theorem 7.5.3 gives sufficient conditions on the matrix A such that this as-
sumption is fulfilled.

We start with the incomplete Cholesky factorization A = LLT R based on algorithm (7.42).
For this algorithm to make sense, the matrix A should be symmetric positive definite. To
preserve symmetry we assume that the sparsity pattern is symmetric, i.e. if (i, j) S then
(j, i) S, too. We use a formulation in which aij is overwritten by lij if i j.

Incomplete Cholesky factorization.


For k = 1,q...,n
Pk1 2
akk := akk j=1 akj (7.45)
For i = k + 1, . . . , n If (i, k) S then
Pk1
aik := (aik j=1 aij akj )/akk

The sums in this algorithm should be taken only over those j for which the corresponding indexes
(k, j) and (i, j) are in S.

165
The first thorough analysis of incomplete factorization techniques is given in Meijerink and
Van der Vorst [62]. In that paper a modified (i.e. incomplete) version of algorithm (7.35) is
considered:
Incomplete LU factorization.
For k = 1, . . . , n 1
For j = k + 1, . . . , n If (k, j) / S then akj := 0 ()
For i = k + 1, . . . , n If (i, k)
/ S then aik := 0 ()
(7.46)
For i = k + 1, . . . , n
:= aik /akk ; aik := ;
For j = k + 1, . . . , n
aij := aij akj .

Compared to algorithm (7.35) only the lines () have been added. In these lines certain entries
in the kth row of U and in the kth column of L are set to zero, according to the condition (7.44).
Algorithm (7.46) has a simple structure and is easy to analyze (cf. theorem 7.5.1). However,
this algorithm is a rather inefficient implementation of incomplete factorization. Below we
reformulate the algorithm, resulting in a significantly more efficient implementation, given in
algorithm (7.56).

Theorem 7.5.1 Assume that algorithm (7.46) does not break down (i.e. akk = 0 is never true).
Then this algorithm results in an incomplete factorization A = LU R with

lij = uji = 0 for 1i<jn , (7.47)


lij = uij = 0 for (i, j)
/ S , (7.48)
rij = 0 for (i, j) S. (7.49)

Proof . The result in (7.47) holds because L (U) is lower (upper) triangular. By construction
(cf. lines () in the algorithm) the result in (7.48) holds. It remains to prove the result in (7.49).
The standard basis vector with value 1 in the mth entry is denoted by em . By vm = (vm 1 , v 2 , . . . , v n )T
m m
i
we denote a generic n-vector with vm = 0 for i m. Note that a standard (complete) Gaussian
elimination as in algorithm (7.35) can be represented in matrix formulation as :

A1 = A
For k = 1, . . . , n 1
(7.50)
Ak+1 = Lk Ak , with
Lk of the form Lk = I + vk eTk .

The matrices Ak+1 have the property (Ak+1 )ij = 0 if i > j and j k. Then U := An =
Ln1 Ln2 . . . L1 A holds. Using L1 T
k = I vk ek we obtain the LU factorization

n1
X
A= L1 1 1
1 L2 . . . Ln1 U = (I vk eTk )U =: LU . (7.51)
k=1

The kth stage of algorithm (7.46) consists of two parts. First the kth row and kth column are
modified by setting certain entries to zero (lines () in (7.46)) and then a standard Gaussian

166
elimination step as in (7.50) is applied to the modified matrix. In matrix formulation this yields:

A1 = A (7.52a)
For k = 1, . . . , n 1
Ak = Ak + Rk , with (7.52b)
Rk of the form Rk = vk eTk + ek vkT , and (7.52c)
(Rk )ij = 0 for all (i, j) S , (7.52d)
Ak+1 = Lk Ak , with (7.52e)
Lk of the form Lk = I + vk eTk , (7.52f)

Again, the matrix Ak+1 has the property (Ak+1 )ij = 0 if i > j and j k. The three vectors vk
that occur in (7.52c) and (7.52f) may all be different. From the form of Rk and Lm (cf. (7.52c),
(7.52f)) we obtain
Lm Rk = Rk for m < k . (7.53)
Now note that for the resulting upper triangular matrix U := An we get, using (7.53) and the
notation L := Ln1 Ln2 . . . L1 :

U = Ln1 An1 = Ln1 An1 + Ln1 Rn1


= Ln1 Ln2 An2 + LRn1
= Ln1 Ln2 An2 + LRn2 + LRn1
= ....... = LA + L(R1 + R2 + . . . + Rn1 ) .
Pn1
As in (7.51) we define L := L1 = L1 1 1
1 L2 . . . Ln1 = I
T
k=1 vk ek and we get

n1
X
LU = A + Rj .
j=1

Pn1
Hence A = LU R with R := j=1 Rj , and the result in (7.49) follows from (7.52d). 
We can use the results in theorem 7.5.1 to derive a much more efficient implementation of
algorithm (7.46). Using the condition in (7.44) (or (7.48)) for the incomplete LU factorization,
we obtain for L = (lij )1i,jn , U = (uij )1i,jn :

lij = uji = 0 for 1 i < j n ,


lij = uij = 0 for (i, j)
/ S, (7.54)
lii = 1 for 1 i n.

By |S| we denote the number of elements in the sparsity pattern S. After using (7.54) there are
still |S| entries in L and U which have to be determined. From (7.49) we deduce that

aij = (LU)ij for (i, j) S . (7.55)

This yields |S| (nonlinear) equations for these unknown entries of L and U.. We now follow the
line of reasoning as in (7.36)-(7.41) for the complete LU factorization. From (7.55) we obtain
(cf. (7.36)) :
min(i,k)
X
aik = lij ujk if 1 i, k n and (i, k) S .
j=1

167
This yields explicit formulas for lik and uik (cf. (7.37)) :

k1
X
lik = (aik lij ujk )/ukk if 1 k < i n and (i, k) S ,
j=1
i1
X
uik = aik lij ujk if 1 i k n and (i, k) S.
j=1

Thus we can compute L and U row by row. We take i fixed and use the notation as in (7.39).
This yields (cf. (7.40)) :
(i)
uik = aik if k i and (i, k) S,
(k)
lik = aik /ukk if k < i and (i, k) S,
(m+1) (m)
aik = aik lim umk .

Using these formulas the entries lik and uik can be computed as follows. Note that for k =
1, . . . , n, u1k = a1k if (1, k) S and u1k = 0 otherwise. Assume that the rows 1, . . . , i 1 of L
and U have been computed, then lik and uik are determined by

For k = 1, . . . , i 1
(k)
If (i, k) S then lik = aik /ukk
For j = k + 1, . . . , n
(k+1) (k)
aij = aij lik ukj ()
For k = i, . . . , n
(i)
If (i, k) S then uik = aik

(m)
In the computation of lik and uik we use aik only for (i, k) S. Hence the update in line () is
needed only if (i, j) S. Again we use a formulation in which we overwrite the matrix A, and
we obtain the following incomplete version of algorithm (7.41) :

Incomplete row wise LU factorization.


For i = 2, . . . , n
For k = 1, . . . , i 1 If (i, k) S then
(7.56)
:= aik /akk ; aik := ;
For j = k + 1, . . . , n If (i, j) S then
aij := aij akj .

Remark 7.5.2 The results in theorem 7.5.1 and the construction (7.54)-(7.56) following that
theorem show that if algorithm (7.46) does not break down then Algorithm (7.56) and Algo-
rithm (7.46) are equivalent, in the sense that these two algorithms yield the same incomplete
LU factorization. Moreover, the derivation in (7.54)-(7.56) implies that if an incomplete LU
factorization exists which satisfies (7.47)-(7.49), then these L (with diag(L) = I ) and U are
unique.
Note that the implementation in (7.56) is more efficient than the implementation (7.46). In the
latter algorithm it may well happen that certain assignments aij := aij akj in the j-loop are
superfluous, since for a higher value of k (in the k-loop) these previously computed values are
set to zero. 

168
As stated in theorem 7.5.1 and remark 7.5.2, a unique incomplete LU factorization which satisfies
(7.47)-(7.49) exists, if algorithm (7.46) does not break down. The following result is proved in
[62] theorem 2.3.

Theorem 7.5.3 If A is an M-matrix then algorithm (7.46) does not break down. If A is in
addition symmetric positive definite, and the pattern S is symmetric, algorithm (7.45) does not
break down.

If A is an M-matrix, then the unique incomplete LU factorization can be computed using, for
example, algorithm (7.46) or algorithm (7.56). If A is in addition symmetric positive definite,
and the pattern S is symmetric, then we can use algorithm (7.45), too. With respect to the
stability of an incomplete LU factorization we give the following result, which is proved in
Meijerink and Van der Vorst [62] theorem 3.2.

Theorem 7.5.4 Let A be an M-matrix. The incomplete LU factorization of A as described in


theorem 7.5.1 is denoted by A = LU R. The complete LU factorization of A is denoted by
A = LU. Then
|lij | |lij | for all 1 i, j n
holds. Hence, the construction of an incomplete LU factorization is at least as stable as the
construction, without any pivoting, of a complete LU factorization. If A is in addition sym-
metric then the construction of an incomplete Cholesky factorization is at least as stable as the
construction of a complete Cholesky factorization.

Proof . Given in Meijerink and Van der Vorst [62].

Even if A is an M-matrix the construction of a complete LU factorization may suffer from


instabilities. However, if the matrix is weakly diagonally dominant, a complete LU factoriza-
tion can be computed, without any pivoting, in a stable way. We conclude that for a weakly
diagonally dominant M-matrix (e.g. the matrices of our model problems) the construction of an
incomplete LU factorization (using (7.56) or (7.45)) is a stable process.

We can use the incomplete LU decomposition to construct a basic iterative method. Such a
method is obtained by taking M := LU, with L and U as in theorem 7.5.1, and applying the
iteration xk+1 = xk M1 (Axk b). For this method we have to compute an incomplete LU
factorization of the given matrix A and per iteration we have to solve a system of the form
LUx = y. The latter can be done with low computational costs by a forward and backward
substitution process. The iteration matrix of this method is given by I(LU)1 A. With respect
to the convergence of this iterative method we give the following theorem.

Theorem 7.5.5 Assume that A is an M-matrix. For the incomplete LU factorization as in


theorem 7.5.1 we have:
(I (LU)1 A) < 1.

7.5.3 Modified incomplete Cholesky method


In the modified incomplete Cholesky method the fill-in entries which are neglected in the loop
over i in algorithm (7.45) are taken into account, in the sense that these are moved to the

169
diagonal. Moving these entries to the corresponding diagonal elements does not cause any
additional fill-in. The algorithm is as follows:

Modified incomplete Cholesky factorization.


For k = 1,q ...,n
Pk1 2
akk := akk j=1 akj
For i = k + 1, . . . , n (7.57)
If (i, k) S then
Pk1
aik := (aik j=1 aij akj )/akk
else
Pk1
aii := (aii j=1 aij akj )/akk

Again the sums in this algorithm should only be taken over those j for which the corresponding
indexes are in S. One can prove that this algorithm, if it does not break down, yields an
incomplete factorization
A = LLT + R
with


li,j = 0 if (i, j)
/S
T
(LL )i,j = ai,j if (i, j) S, i 6= j,
X n n
X
T
(LL )i,j = ai,j for all i.
j=1 j=1

It is known that for certain problems this lumping strategy (moving in each row certain
entries to the diagonal) can improve the quality of the incomplete Cholesky preconditioner
significantly. This is illustrated in numerical experiments for the Poisson equation in section 7.7.

7.6 Problem based preconditioning


In preparation

7.7 Preconditioned Conjugate Gradient Method


A convergence analysis of the CG method results in a bound for the rate of convergence of the
CG method as in (7.22), which depends on the spectral condition number of the matrix A. A
larger condition number will result in a lower rate of convergence. In Section 7.3 we discussed
the concept of preconditioning. In this section we apply this preconditioning technique to the
CG method.

Let B be a given regular n n-matrix. Consider the following transformation of the origi-
nal problem given in (7.1)

Ax = b with A := B1 ABT , x := BT x , b := B1 b . (7.58)

The matrix A is symmetric positive definite, so we can apply the CG method from (7.18) to the
system in (7.58). This results in the following algorithm (which is not used in practice, because

170
in general the computation of A = B1 ABT will be too expensive):
0

x a given starting vector ; r0 = Ax0 b


for k 0 (if rk 6= 0) :






k ,rk >
pk = rk + <r<r
k1 ,rk1 > p
k1 (if k = 0 : p0 := r0 ) (7.59)


<rk ,rk >

xk+1 = xk + opt (xk , pk )pk with opt (xk , pk ) = <p



k ,Apk >



k+1
r = rk + opt (xk , pk )Apk .

This algorithm yields approximations xk for the solution x = BT x of the transformed system.
To obtain an algorithm in in the original variables we introduce the notation

pk := BT pk , xk := BT xk , rk := Brk , (7.60)
zk := BT rk = BT B1 rk = W1 rk , with W := BBT . (7.61)

Using this notation we can reformulate the algorithm in (7.59) as follows




x0 a given starting vector ; r0 := Ax0 b


for k 0 (if rk 6= 0) :







solve zk from Wzk = rk



k ,rk >
(7.62)


pk := zk + <z<z
k1 ,rk1 > p
k1 (if k = 0 : p0 := z0 )



<zk ,rk >
xk+1 := xk + opt (xk , pk )pk with opt (xk , pk ) = <p



k ,Apk >



rk+1 := rk + (xk , pk )Apk .

opt

This algorithm, which yields approximations xk for the solution x of the original system, is
called the Preconditioned Conjugate Gradient method (PCG) with preconditioner W. For W = I
we obtain the algortihm as in (7.18).

Note that in the algorithm in (7.62) the matrx B is involved only in W = BBT . Hence this
algorithm is applicable if a symmetric positive definite matrix W is available. Such a matrix
has a corresponding Cholesky-decomposition W = BBT . This decomposition, however, plays
a role only in the theoretical derivation of the method (cf. (7.58)) and is not needed in the
algorithm.
Using the identity kxk x kA = kxk x kA we obtain that the error reduction in (7.62), mea-
sured in k kA , is the same as the error reduction in (7.59), measured in k kA . Based on the
result in (7.22) we have that the rate of convergence of the algorithm in (7.59), and thus of the
PCG algorithm, too, is determined by

(A) = (B1 ABT ) = (W1 A).

So for a significant increase of the rate of convergence due to preconditioning we should have a
preconditioner W with (W1 A) (A). In the PCG algorithm in (7.62) we have to solve a
system with matrix W in every iteration. So the matrix W should be such that the solution of
this system can be computed with low computational costs (not much more than one matrix-
vector multipication). Note that these requirements for the preconditioner are as in section 7.3.

171
For the PCG methods we need a symmetric positive definite preconditioner W. In section 7.3
we discussed the symmetric positive definite preconditioners W = MSSOR (SSOR precondition-
ing), W = LLT (incomplete Cholesky) and W = LLT (modified incomplete Cholesky). In
the example below we apply the PCG method with these three preconditioners to the discrete
Poisson equation.
Example 7.7.1 (Poisson model problem) We consider the discrete Poisson equation as in
section 6.6. We apply the PCG method with the SSOR preconditioner. For the parameter
in the SSOR preconditioning we use the value for opt as in theorem 6.4.3, i.e. is such that
the spectral radius of the iteration matrix of the SOR method is minimal. In Table 7.2 we show
results that can be compared with the results in Table 7.1. We measure the error reduction in
the Euclidean norm k k2 . By # we denote the number of iterations needed to reduce the norm
of the starting error by a factor R = 103 .

h 1/40 1/80 1/160 1/320


= 2/(1 + sin(h)) 1.854 1.924 1.961 1.981
# 11 16 22 32

Table 7.2: PCG with SSOR preconditioner applied to Poisson problem.

In Axelsson and Barker [7] it is shown that for this model problem the SSOR preconditioner, with
an appropriate value for the parameter , results in (W1 A) ch1 and thus (cf. (7.22))we
expect an error reduction per iteration (measured in the A-norm) with at least a factor 1 c h.
If this is the case, then for this problem the PCG method has a complexity O(n5/4 ).
The results in Table 7.2 are consistent with such a reduction factor of the form 1 c h. Ap-
parently, the choice = opt , as explained above, is appropriate. Related to this we note that
in Axelsson and Barker [7] it is shown that often the rate of convergence of the PCG method
with SSOR preconditioning is not very sensitive with respect to perturbations in the value of
the parameter . This phenomenon is illustrated in Table 7.3, where we show the results of
1
PCG with SSOR preconditioning for h = 160 and for several values of .

1.80 1.85 1.90 1.95 1.96 1.97 1.98 1.99


# 33 29 26 23 22 22 23 24

Table 7.3: PCG with SSOR preconditioner applied to Poisson problem

In Table 7.4 we show the results obtained with the incomplete Cholesky preconditioner, i.e.
W = LLT , and with the modified incomplete Cholesky preconditioner, i.e. W = LLT (cf.
Section 7.5). In both cases, for the sparsity pattern we used S = G(A). In the literature these
algorithms are denoted by ICCG and MICCG, respectively.

The results for ICCG indicate that for the preconditioned system we have (W1 A) ch2
where the constant is better than for the unpreconditioned system with W = I. The results
for MICCG indicate that for the preconditioned system we have (W1 A) ch1 , which is
comparable to the result with SSOR preconditioning. 

172
h 1/40 1/80 1/160 1/320
ICCG, # 20 40 79 157
MICCG, # 8 11 14 20

Table 7.4: PCG with (M)IC preconditioner applied to Poisson problem.

Remark 7.7.2 (in preparation) IC preconditioning is often more robust than MIC precondi-
tioning, for example for problems with discontinuous coefficients. 

Remark 7.7.3 (in preparation) On the eigenvalue distribution of a preconditioned system in


relation to CG convergence. 

173
174
Chapter 8

Krylov Subspace Methods

8.1 Introduction
In Section 7.2 the CG method for solving Ax = b has been derived as a minimization method
for the functional
1
F (x) = hx, Axi hx, bi.
2
If A is symmetric positive definite then this F is a quadratic functional with a unique minimizer
and minimization of F is equivalent to solving Ax = b.
If A is not symmetric positive definite then the nice minimization properties of CG do not
hold and it is not clear whether CG is still useful. If A is not symmetric positive definite we
can still try the CG algorithm. In practice we often observe that for nonsymmetric problems
in which the symmetric part (i.e. 12 (A + AT )) is positive definite, the CG algorithm is still
a fairly efficient solver if the skew-symmetric part (i.e. 12 (A AT )) is small compared to
the symmetric part. In other words, the CG algorithm can be used for solving nonsymmetric
problems in which the nonsymmetric part is a perturbation on a symmetric positive definite part.
In problems with moderate nonsymmetry (kA AT k kA+ AT k) or with strong nonsymmetry
(kAAT k kA+AT k) the CG algorithm generally diverges. For such nonsymmetric problems
other Krylov subspace methods have been developed.

Example 8.1.1 We consider the discrete convection-diffusion problem as in section 6.6 with
b1 = cos(/6), b2 = sin(/6) and h = 1/160. We use We take x0 = 0 and an error reduction
factor R = 103 . The CG algorithm is applied to this problem for different values of the parameter
. The results are shown in Table 8.1.
Note that for large values of the problem is nearly symmetric and the convergence behaviour

102 101 100 101 102


# 190 233 322 DIV DIV

Table 8.1: CG method applied to convection-diffusion problem

of the CG method is reasonable. For smaller values of the nonsymmetry of the problem is
increasing and the CG method fails. 

In section 8.2 below we show that, for A symmetric positive definite, the CG method can be
seen as a projection method. Using this point of view we can develop variants of the CG method

175
which can be used for problems in which A is symmetric but indefinite or A is nonsymmetric. In
recent years, many of such variants have been introduced. For an overview of these methods we
refer to Saad [78], Freund et al. [36], Greenbaum [42], Sleijpen and Van der Vorst [85]. We will
discuss a few important methods and explain the main approaches in this field of nonsymmetric
Krylov subspace methods.

8.2 The Conjugate Gradient method reconsidered


For a given nonsingular A Rnn and given r Rn we define the Krylov subspace as follows

Kk (A; r) := span{r, Ar, A2 r, ..., Ak1 r} . (8.1)

In the remainder of this chapter the Krylov subspace Kk (A; r0 ), with r0 = Ax0 b the starting
residual, will play an important role. To avoid certain technical details, we make the following
assumption concerning the starting vector x0 :

Assumption 8.2.1 In the remainder of this chapter we assume that x0 is chosen such that
dim(Kk (A; r0 )) = k for k = 1, 2, . . . , n.

We note that in the generic case this assumption is fulfilled. Only for special choices of x0 one
has dim(Kk (A; r0 )) < k for k < n. We emphasize that the formulations of the algorithms which
are discussed in the remainder of this chapter do not depend on this assumption.

We first reconsider the CG method applied to the problem Ax = b with A symmetric pos-
itive definite. Using the results of theorem 7.2.1 we obtain that

xk x0 + Kk (A; r0 )

holds and
A(xk x ) = Axk b = rk Kk (A; r0 ),
or, equivalently,

hA(xk x ), zi = hxk x , ziA = 0 for all z Kk (A; r0 ).

We conclude that xk x0 is the A-orthogonal projection (i.e. with respect to the A-inner product
h, iA ) of the starting error x x0 on Kk (A; r0 ) . This is illustrated in figure 8.1. From the

x x0
3





 
)
Kk (A; r0

 

- 
  the right angle is w.r.t. the
 xk x0 
  inner product h, iA

Figure 8.1: CG as a projection method

176
observations above it follows that the CG iterate xk can be characterized as the unique solution
of the following problem:

determine xk x0 + Kk (A; r0 ) such that



(8.2)
kxk x kA = min{kx x kA | x x0 + Kk (A; r0 )} .

Because xk x A z Axk b z, an equivalent formulation of this problem is:

determine xk x0 + Kk (A; r0 ) such that



(8.3)
Axk b Kk (A; r0 ).

We will now derive an algorithm ((8.16) below), different from the CG algorithm, that can
be used to solve this problems. For this algorithm and the CG algorithm the computational
costs per iteration are comparable and, in exact arithmetic, these two algorithms yield the same
iterands. The ideas underlying this alternative algorithm will play an important role in the
derivation of algorithms for the case that A is not symmetric positive definite.

We start with a simple method for computing an orthogonal basis of the Krylov subspace,
the so-called Lanczos method:
0

q := 0; q1 := r0 /kr0 k; 0 := 0;
for j 1 :


qj+1 := Aqj j1 qj1 ,


(8.4)

j := hqj+1 , qj i,
qj+1 := qj+1 j qj , j := kqj+1 k,





q j+1 j+1
:= q /j .

With induction one easily proves:


Theorem 8.2.2 If A is symmetric then the set q1 , q2 , ..., qk forms an orthogonal basis of the
Krylov subspace Kk (A; r0 ) (k n).
Note that the method uses only a three term recursion, i.e. qj+1 can be determined from
qj , qj1 , and that the costs per iteration are low. Given the basis for the Krylov subspace
of dimension j we need one matrix-vector multiplication, two inner products and a few vector
updates to compute the orthogonal basis for the Krylov subspace of dimension j + 1.
Define
Qj := [q1 q2 ... qj ] (n j matrix with columns qi ) .
The recursion in (8.4) can be rewritten as

Aqj = j1 qj1 + j qj + j qj+1 , (8.5)

and thus

1 1
..
1
2 .

AQk = Qk
.. .. .. + k qk+1 (0, 0, ..., 0, 1)

. . .
.. ..
. .

k1
k1 k

=: Qk Tk + k qk+1 eTk (8.6)

177
holds. Due to the orthogonality of the basis we have

QTk Qk = Ik (k k identity matrix); QTk qk+1 = 0 . (8.7)

For solving the problem (8.3) we have to compute xk x0 + Kk (A; r0 ) which satisfies the
orthogonality property
hAxk b, zi = 0 for all z Kk (A; r0 ).
This yields the condition

QTk (Axk b) = QTk (A(xk x0 ) r0 ) = 0 . (8.8)

Note that q1 = r0 /kr0 k and QTk r0 = kr0 k(1, 0, . . . , 0)T =: kr0 ke1 . Since the vector xk x0 must
be an element of Kk (A; r0 ) it can be represented using the basis q1 , q2 , . . . , qk , i.e. there exists
an yk Rk such that xk x0 = Qk yk . Using this the condition (8.8) can be formulated as

QTk AQk yk = kr0 ke1 . (8.9)

With the results in (8.6) and (8.24) we obtain that the solution of the problem (8.3) (or (8.2))
is given by

Tk yk = kr0 ke1 , (8.10a)


k 0 k
x = x + Qk y . (8.10b)

Note that Tk = QTk AQ is a symmetric positive definite tridiagonal k k matrix. So the vector
xk can be obtained by first solving the tridiagonal system in (8.10a) and then computing xk
as in (8.10b). This, however, would result in an algorithm with high computational costs per
iteration. We now show that based on (8.10a), (8.10b) an algorithm can be derived in which
the iterand xk can be updated from the previous iterand xk1 in a simple and cheap way (as in
the CG algorithm). To derive this algorithm we represent Tk using its LU factorization (which
exists, because Tk is symmetric positive definite):

1

u1 1
l2 1

u2 2
. .


Tk = Lk Uk = . . . .
. . . . ,



. .

. .
.. ..
uk1 k1

lk 1 uk

for k = 1, 2, . . ., where the i are the same as in the matrix Tk . We also introduce the notation
Pk = [p1 p2 . . . pk ] := Qk U1 k 1 0
k , z := Lk kr ke1 . From (8.10a) and (8.10b) we then obtain

xk = x0 + Qk T1 0
k kr ke1
= x0 + Qk U1 1 0 0 k
k Lk kr ke1 = x + Pk z . (8.11)

From the k-th column in the identity Pk Uk = Qk one obtains pk1 k1 + pk uk = qk and thus
the simple update formula
1 k
pk = (q k1 pk1 ). (8.12)
uk
From the last row in the identity Tk = Lk Uk we obtain lk uk1 = k1 and lk k1 + uk = k ,
i.e.
k1
lk = , uk = k lk k1 . (8.13)
uk1

178
zk1
 
If we represent zk as zk = with k R (k = 1, 2, . . .), it follows from (8.11) that
k

xk = x0 + Pk1 zk1 + pk k = xk1 + k pk . (8.14)

Finally, from the last equation in Lk zk = kr0 ke1 it follows that lk k1 + k = 0 and thus

k = lk k1 . (8.15)

In (8.12), (8.13), (8.14) and (8.15) we have recursion formulas which allow a simple update
k 1 k. Combining these formulas with the Lanczos algorithm (8.4) for computing qk results
in the following Lanczos iterative solution method:


q0 := 0; q1 := r0 /kr0 k; 0 = p0 = l1 = 0; 1 = kr0 k;


for j 1 :





qj+1 := Aqj j1 qj1 , j := hqj+1 , qj i








if j > 1 then lj = uj1 and j = lj j1 ,




j1

uj = j lj j1 , (8.16)


pj = u1j (qj j1 pj1 ),






xj = xj1 + j pj ,





qj+1 := qj+1 j qj , j := kqj+1 k,






qj+1 := qj+1 / .

j

This algorithm for computing the solution xk of (8.2) (or (8.3)) has about the same computa-
tional costs as the CG algorithm presented in section 7.2.

In the derivation of the Lanczos iterative solution method (8.16) the following ingredients are
important:

An orthogonal basis of the Krylov subspace can be computed with


low costs, using the Lanczos method (8.4). (8.17)

As an approximation of the original system, the projected much


smaller system Tk yk = kr0 ke1 in (8.10a) is solved. (8.18)

The computation of the orthogonal basis in (8.17) and the solution


of the projected system in (8.18) can be implemented in such a way,
that we only need simple update formulas k 1 k. (8.19)

In the derivation of the projected system (8.10a) the fact that we have an orthogonal basis plays
a crucial role.
The approach discussed above is a starting point for the development of methods which can
be used in cases where A is not symmetric positive definite. In generalizing this appraoch to
systems in which A is not symmetric positive definite one encounters the following two major

179
difficulties:

If A is not symmetric positive definite, then an A-inner product does


not exist, and thus the problem (8.2) does not make sense, (8.20)

and

if A is not symmetric, then an orthogonal basis of K k (A, r0 )


can not be computed with low computational costs. (8.21)

In section 8.3 we consider the case that matrix A is not positive definite, but still symmetric
(i.e. symmetric indefinite). Then we can still use the Lanczos method to compute, in a cheap
way, an orthogonal basis of the Krylov subspace. To deal with the problem formulated in (8.20)
one can replace the error minimization in the A-norm in (8.2) by a residual minimization in the
euclidean norm, i.e. minimize kAx bk over the space x0 + Kk (A; r0 ). For every nonsingular
matrix A this residual minimization problem has a unique solution. Furthermore, as will be
shown in section 8.3, this residual minimization problem can be solved with low computational
costs if an orthogonal basis of the Krylov subspace is available. A well-known method for solving
symmetric indefinite problems, which is based on using the Lanczos method (8.4) for computing
the solution of the residual minimization problem is the MINRES method.

In section 8.4 and Section 8.5 we assume that the matrix A is not even symmetric. Then
both the problem formulated in (8.20) and the problem formulated in (8.21) arise. We can deal
with the problem in (8.20) as in the MINRES method, i.e. we can use residual minimization in
the euclidean norm instead of error minimization in the A-norm. It will turn out that, just as for
the symmetric indefinite case, this residual minimization problem can be solved with low costs if
an orthogonal basis of the Krylov subspace is available. However, due to the nonsymmetry (cf.
(8.21)), for computing such an orthogonal basis we now have to use a method which is computa-
tionally much more expensive than the Lanczos method. An important method which is based
on the idea of computing an orthogonal basis of the Krylov subspace and using this basis to solve
the residual minimization problem is the GMRES method. We discuss this method in section 8.4.

Another important class of methods for solving nonsymmetric problems is treated in section 8.5.
In these methods one does not compute the solution of an error or residual minimization prob-
lem (as is done in CG, MINRES, GMRES). Instead one tries to determine xk x0 + Kk (A; r0 )
which satisfies an orthogonality condition similar to the one in (8.3). It turns out that using
this approach one can avoid the expensive computation of an orthogonal basis of the Krylov
subspace. The main example from this class is the Bi-CG method. The Bi-CG method has lead
to many variants. A few popular variants are considered in section 8.5, too.

8.3 MINRES method


In this section we discuss the MINRES method (Minimal Residual) which can be used for
problems with A symmetric and (possibly) indefinite. The method is introduced in Paige and
Saunders [70]. For symmetric A the Lanczos method in (8.4) can be used to find, with low com-
putational costs, an orthogonal basis q1 , q2 , ..., qk of the Krylov space Kk (A, r0 ), k = 1, 2, . . ..
The recursion in (8.4) can be rewritten as

Aqj = j1 qj1 + j qj + j qj+1 , (8.22)

180
and thus
1 1
..
1
2 .

.. .. ..
AQk = Qk+1
. . . =: Qk+1 Tk .

(8.23)
.. ..
. .


k1

k1 k
k
Note that Tk is a (k + 1) k matrix. Due to the orthogonality of the basis we have

QTk Qk = Ik , QTk qk+1 = 0 . (8.24)

The MINRES method, introduced in Paige and Saunders [70], is based on the following residual
minimization problem:

Given x0 Rn , determine xk x0 + Kk (A; r0 ) such that


(
(8.25)
kAxk bk = min{ kAx bk | x x0 + Kk (A; r0 ) } ,

where r0 := Ax0 b. Note that the Euclidean norm is used and that for any regular A this
minimization problem has a unique solution xk , which is illustrated in figure 8.2.
Clearly, we have a projection: rk = Axk b is the projection (with respect to h, i) of r0

Ax0 b

3




 
R 

 
 -
 
 b
Axk  R = A(Kk (A; r0 ))
 

Figure 8.2: Residual minimization

on A(Kk (A; r0 )) = span{Ar0 , A2 r0 , ..., Ak r0 }. Any x Kk (A; r0 ) can be represented as x =


Qk y with y Rk and using this we obtain:

kA(x0 + x) bk = kAQk y r0 k = kQk+1 Tk y r0 k


(8.26)
= kQk+1 Tk y Qk+1 (kr0 ke1 )k = kTk y kr0 ke1 k .

So xk as in (8.25) can be obtained from

kTk yk kr0 ke1 k = min{ kTk y kr0 ke1 k | y Rk } (8.27a)


k 0 k
x = x Qk y . (8.27b)

From (8.27) we see that the residual minimization problem in (8.25) leads to a least squares
problem with the (k + 1) k tridiagonal matrix Tk . Due to the structure of this matrix Givens
rotations are very suitable for solving the least squares problem in (8.27). Combination of the
Lanczos algorithm (for computing an orthogonal basis) with a least squares solver based on

181
Givens rotations results in the MINRES algorithm. We will now derive this algorithm.
First we recall that for (x, y) 6= (0, 0) a unique orthogonal Givens rotation is given by

2 2
  c + s! = 1

c s !
G= such that x w (8.28)
s c G
= with w > 0
y 0

The least squares problem in (8.27a) is solved using an orthogonal transformation Vk R(k+1)(k+1)
such that  
Rk
Vk Tk = , Rk Rkk upper triangular (8.29)

 
bk
Define bk := Vk e1 =: , with bk Rk . Then the solution of the least squares problem

is given by yk = R1k bk . We show how the matrices Rk and vectors bk , k = 1, 2, . . ., can be
computed using short (and thus cheap) recursions. We introduce the notation

Ij1
Gj = cj sj R(j+1)(j+1) with c2j + s2j = 1
sj cj

Given T1 one can compute c1 , s1 , r1 such that


   
1 r
G1 = 1 (8.30)
1 0

Given T2 and G1 one can compute c2 , s2 , r2 such that


 
1  
G r
G2 1 2 = 2 , r2 R2 (8.31)
0
2

For k 3 and for given Tk , Gk1 , Gk2 one can compute ck , ss , rk such that
 
0
Gk1 Gk2 k1
 
r
Gk = k , rk Rk (8.32)
k 0
k
Note that rk has at most three nonzero entries:
rk = (0, . . . , 0, rk,k2 , rk,k1 , rk,k )T (8.33)

Using these Givens transformations Gj the orthogonal transformations Vj , j 1, are defined


as follows  
Vj1
V1 := G1 , Vj := Gj , j2
1
One easily checks, using induction, that

r1
r2
 
Rk

Vk T k = , Rk := , rj as in (8.30),(8.31), (8.32)

..
.
rk

182
For bk = Vk e1 =: (bk , bk,k+1 )T , we have the recursion

 bj1
 
b1 = G1 e1 , bj = cj sj bj1,j , j2 (8.34)
sj cj 0

(Notation: bj1,j is the j-th entry of bj1 .) We now derive a simple recursion for the vector xk
in (8.27b). Define the matrix Qk R1 k =: Pk = (p1 . . . pk ) with columns pj (1 j k). From
Pk Rk = Qk and the nonzero structure of the columns of Rk (cf. (8.33)) it follows that

p1 = q1 /r1 , p2 = (q2 r2,1 p1 )/r2,2


(8.35)
pj = (qj rj,j2pj2 rj,j1 pj1 )/rj,j , j3

Note that using (8.34) we can rewrite (8.27b) as

xk = x0 kr0 kQk R1 0 0
k bk = x kr kPk bk
(8.36)
= x0 kr0 kPk1 bk1 kr0 kbk,k pk = xk1 kr0 kbk,k pk

This leads to the following method:

MINRES algorithm.
Given x0 , compute r0 = Ax0 b. For k = 1, 2, . . . :
Compute qk , k , k using the Lanczos method.
Compute rk using (8.30), (8.31) or (8.32) (note (8.33)).
Compute pk using (8.35).
Compute bk,k using (8.34).
Compute update: xk = xk1 kr0 kbk,k pk .

Note that in each iteration of this method we need only one matrix-vector multiplication and a
few relatively cheap operations, like scalar products and vector additions.

Remark 8.3.1 If for a given symmetric regular matrix A and given starting residual r0 assump-
tion 8.2.1 does not hold then there exists a minimal k0 hn such that AKk0 (A; r0 ) = Kk0 (A; r0 ).
In the Lanczos method we then obtain (using exact arithmetic) k0 = 0 and thus the itera-
tion stops for k = k0 . It can be shown that xk0 computed in the MINRES algorithm satisfies
Axk0 = b and thus we have solved the linear system. 

We now derive the preconditioned MINRES algorithm. For this we assume a given symmetric
positive definite matrix M. Let L be such that M = LLT . We consider the preconditioned
system
L1 ALT z = L1 b , z = LT x
Note that A := L1 ALT is symmetric. For given x0 Rn we have z0 = LT x0 and the starting
residual of the preconditioned problem satisfies Az0 L1 b = L1 r0 . We apply the Lanczos
method to construct an orthogonal basis q1 , . . . qk of the space Kk (A; L1 r0 ). We want to avoid
computations with the matrices L and LT . This can be achieved if we reformulate the algorithm
using the transformations

tj := Lqj , tj := Lqj , wj := LT qj = M1 tj

183
Using these definitions we obtain an equivalent formulation of the algorithm (8.4) applied to A
with r = L1 r0 , which is called the preconditioned Lanczos method:
1

t0 := 0; w0 := M1 r0 ; krk := hw0 , r0 i 2

t1 := r0 /krk; w1 := w0 /krk; 0 := 0;






for j 1 :





j+1 := Awj

j1 ,
t j1 t


j := htj+1 , wj i, (8.37)
tj+1 := tj+1 j tj ,





wj+1 := M1 tj+1 ; := hwj+1 , tj+1 i,


j

j+1 j+1

t := t /j ,





wj+1 := wj+1 /j ,

Note that for M = I we obtain the algorithm (8.4) and that in each iteration a system with the
matrix M must be solved. As a consequence of theorem 8.2.2 we get:

Theorem 8.3.2 The set w1 , w2 , ..., wk defined in algorithm (8.37) is orthogonal with respect to
h, iM and forms a basis of the Krylov subspace Kk (M1 A; M1 r0 ) (k n).

Proof. From theorem 8.2.2 and the defintion of wj it follows that (LT wj )1jk forms
an orthogonal basis of Kk (L1 ALT ; L1 r0 ) with respect to the Euclidean scalar product.
Note that hLT wj , LT wi i = 0 iff hwj , wi iM = 0, and LT wj Kk (L1 ALT ; L1 r0 ) iff wj
Kk (M1 A; M1 r0 ).

Define Wj := w1 w2 . . . wj Rnj . From theorem 8.3.2 it follows that WT MW = I.




From (8.23) we obtain, using LT Wk = Qk , that L1 ALT LT Wk = LT Wk+1 Tk holds and thus

M1 AWk = Wk+1 Tk (8.38)

The matrix M1 A is symmetric with respect to h, iM . Instead of (8.25) we consider the


following minimization problem:

0 n k 0 k 1 1 0

Given x R , compute x x + K (M A; M r ) such that

kM1 Axk M1 bkM (8.39)

= min{ kM1 Ax M1 bkM | x x0 + Kk (M1 A; M1 r0 ) }

with r0 = Ax0 b. Using arguments as in (8.26) it follows that the solution of the minimization
problem (8.39) can be obtained from

kTk yk krke1 k = min{ kTk y krke1 k | y Rk } (8.40a)


k 0 k
x = x Wk y . (8.40b)
1
with krk = hM1 r0 , r0 i 2 . This problem can be solved using Givens rotations along the same
lines as for the unpreconditioned case. Thus we get the following:

184
Preconditioned MINRES algorithm.
1
Given x0 , compute r0 = Ax0 b, w0 = M1 r0 , krk = hw0 , r0 i 2 .
For k = 1, 2, . . . :
Compute wk , k , k using the preconditioned Lanczos method (8.37).
Compute rk using (8.30), (8.31) or (8.32) (note (8.33)).
Compute pk using (8.35) with qk replaced by wk .
Compute bk,k using (8.34).
Compute update: xk = xk1 krkbk,k pk .

The minimization property (8.39) yields a convergence result for the preconditioned MINRES
method:

Theorem 8.3.3 Let A Rnn be symmetric and M Rnn symmetric positive definite. For
xk , k 0, computed in the preconditioned MINRES algorithm we define rk = M1 (Axk b).
The following holds:

krk kM = min kpk (M1 A)r0 kM


pk Pk ;pk (0)=1
(8.41)
min max |pk ()| kr0 kM
pk Pk ;pk (0)=1 (M1 A)

Proof. The equality result follows from

krk kM = kM1 b M1 A x0 + pk1 (M1 A)r0 kM



min
pk1 Pk1

= min kr0 M1 Apk1 (M1 A)r0 kM


pk1 Pk1

= min kpk (M1 A)r0 kM


pk Pk ;pk (0)=1

Note that M1 A is symmetric with respect to h, iM . And thus

kpk (M1 A)r0 kM kpk (M1 A)kM kr0 kM = max |pk ()| kr0 kM
(M1 A)

holds.

From this result it follows that bounds on the reduction of the (preconditioned) residual can be
obtained if one assumes information on the spectrum of M1 A. We present two results that are
well-known in the literature. Proofs, which are based on approximation properties of Chebyshev
polynomials are given in, for example, [42].

Theorem 8.3.4 Let A, M and rk be as in theorem 8.3.3. Assume that all eigenvalues of M1 A
are positive. Then

krk kM p
1 A) + 1) k ,

2 1 2/( (M k = 0, 1, . . .
kr0 kM

holds.

We note that in this bound the dependence on the condition number (M1 A) is the same as
in well-known bounds for the preconditioned CG method.

185
Theorem 8.3.5 Let A, M and rk be as in theorem 8.3.3. Assume that (M1 A) [a, b] [c, d]
with a < b < 0 < c < d and b a = d c. Then
r
krk kM ad [k/2]
0
2 1 2/( + 1) , k = 0, 1, . . . (8.42)
kr kM bc
holds.
In the special case a = d, b = c the reduction factor in (8.42) takes the form 12/((M1 A)+
1). Note that here the dependence on (M1 A) is different from the positive definite case in
theorem 8.3.4.

8.4 GMRES type of methods


In this (and the following) section we do not assume that A is symmetric. We only assume
that A is regular. In GMRES (Generalized Minimal Residual) type of methods one first
computes an orthogonal basis of the Krylov subspace and then, using this basis, one determines
the xk satisfying the minimal residual criterion in (8.25). It can be shown (cf. Faber and
Manteuffel [33]) that only for a very small class of nonsymmetric matrices it is possible to
compute this xk satisfying the minimal residual criterion with low computational costs. This
is related to the fact that in general for a nonsymmetric matrix we do not have a method for
computing an orthogonal basis of the Krylov subspace with low computational costs (cf. the
Lanczos algorithm for the symmetric case).
In GMRES the so-called Arnoldi algorithm, introduced in Arnoldi [5], is used for computing
an orthogonal basis of the Krylov subspace :
1
q := r0 /kr0 k;


for j 1 :






qj+1 := Aqj ,






for i = 1, . . . j :

(8.43)


hij := hqj+1 , qi i

qj+1 := qj+1 hij qi






hj+1,j := kqj+1 k





qj+1 := qj+1 /hj+1,j .

When we put the coefficients hij (1 i j + 1 k + 1) in a matrix denoted by Hk we obtain:



h11 h12 h1k
h21 h22 h23 h2k

. . . . . .

. . .
Hk = (8.44)

. . . .

. . hk1,k

..
.

hk,k
hk+1,k

This is a (k + 1) k matrix of upper Hessenberg form. We also use the notation

Qj := [q1 q2 . . . qj ] (n j matrix with columns qi ).

186
Using this notation, the Arnoldi algorithm results in

AQk = Qk+1 Hk . (8.45)

The result in (8.45) is similar to the result in (8.23). However, note that the matrix Hk in (8.45)
contains significantly more nonzero elements than the tridiagonal matrix Tk in (8.23).
Using induction it can be shown that q1 , q2 , ..., qk forms an orthogonal basis of the Krylov
subspace Kk (A; r0 ). As in the derivation of (8.27a),(8.27b) for the MINRES method, using the
fact that we have an orthogonal basis, we obtain that the xk that satisfies the minimal residual
criterion (8.25) can be characterized by the least squares problem:

kHk yk kr0 ke1 k = min{kHk y kr0 ke1 k | y Rk } (8.46)


xk = x0 Qk yk . (8.47)

The GMRES algorithm has the following structure :

1. Start : choose x0 ; r0 := Ax0 b; q1 := r0 /kr0 k







2. Arnoldi method (8.43) for the computation of an orthogonal



basis q1 , q2 , . . . , qk of Kk (A; r0 )



(8.48)


3. Solve a least squares problem :
yk such that kHk yk kr0 ke1 k = min{kHk y kr0 ke1 k | y Rk }.





xk := x0 Qk yk

The GMRES method is introduced in Saad and Schultz [80]. For a detailed discussion of
implementation aspects of the GMRES method we refer to that paper. In [80] it is shown that
using similar techniques as in the derivation of the MINRES method the least squares problem
in step 3 in (8.48) can be solved with low computational costs. However, step 2 in (8.48) is
expensive, both with respect to memory and arithmetic work. This is due to the fact that in
the kth iteration we need computations involving q1 , q2 , ..., qk1 to determine qk . To avoid
computations involving all the previous basis vectors, the GMRES method with restart is often
used in practice. In GMRES(m) we apply m iterations of the GMRES method as in (8.48),
then we define x0 := xm and again apply m iterations of the GMRES method with this new
starting vector, etc.. Note that for k > m the iterands xk do not fulfill the minimal residual
criterion (8.25). In Saad and Schultz [80] it is shown that (in exact arithmetic) the GMRES
method cannot break down and that (as in CG) the exact solution is obtained in at most n
iterations. The minimal residual criterion implies that in GMRES the residual is reduced in
every iteration . These nice properties of GMRES do not hold for the GMRES(m) algoritm; a
well known difficulty with the GMRES(m) method is that it can stagnate.

Example 8.4.1 (Convection-diffusion problem) We apply the GMRES(m) method to the


discrete convection-diffusion problem of section 6.6 with b1 = cos( 6 ), b2 = sin( 6 ) and for several
values of , m, h. For the starting vector we take x0 = 0. In Table 8.2 we show the number of
iterations needed to reduce the Euclidean norm of the starting residual with a factor 103 .
From these results we see that for this model problem the number of iterations increases
significantly when h is decreased. Also we observe a certain robustness with respect to variation
in . Based on the results in Table 8.2 we obtain that for m small (i.e. 1-20) the GMRES(m)
method is more efficient than for m large (i.e. 20). 

187
m: 10 20 40 80
1
= 101 h = 32 97 72 61 56
1
= 102 h = 32 68 75 80 59
1
= 104 h = 32 61 59 60 59
1
= 101 h = 64 270 191 146 147
1
= 102 h = 64 127 134 150 160
1
= 104 h = 64 119 114 114 114

Table 8.2: # iterations for GMRES(m).

There are other methods which are of GMRES type, in the sense that these methods (in exact
arithmetic) yield iterands defined by the minimal residual criterion (8.25). These methods differ
in the approach that is used for computing the minimal residual iterand. Examples of GMRES
type of methods are the Generalized Conjugate Residual method (GCR) and Orthodir. These
variants of GMRES seem to be less popular because for many problems they are at least as
expense as GMRES and numerically less stable. For a further discussion and comparison of
GMRES type methods we refer to Saad and Schultz [79], Barrett et al. [10] and Freund et
al. [36].

8.5 Bi-CG type of methods


The GMRES method is expensive (both with respect to memory and arithmetic work) due to
the fact that for the computation of an orthogonal basis of the Krylov subspace we need long
recursions (cf. Arnoldi method (8.43)). In this respect there is an essential difference with the
symmetric case, because then we can use short recursions for the computation of an orthogo-
nal basis (cf. (8.4)). Also note that the implementation of GMRES (using Givens rotations to
solve the least squares problem) is rather complicated, compared to the implementation of the
CG method.

The Bi-CG method which we discuss below is based on a generalized Lanczos method that
is used for computing a reasonable basis of the Krylov subspace. This generalized Lanczos
method uses short recursions (as in the Lanczos method), but the resulting basis will in general
not be orthogonal. The implementation of the Bi-CG method is as simple as the implementation
of the CG method.

The Bi-CG method is based on the bi-Lanczos (also called nonsymmetric Lanczos) method:
0

v := v0 := 0; v1 := v1 := r0 /kr0 k2 ; 0 = 0 = 0;





For j 1 :

:= hAvj , vj i

j
j+1 := Avj vj j1 ; (8.49)
w j j1 v
j+1 := A v j v j1 vj1 ;
T j j

w



j := kwj+1 k, vj+1 := wj+1 /j ,





j := hvj+1 , wj+1 i, wj+1 := wj+1 /j .

188
If A = AT holds, then the two recursions in (8.49) are the same and the bi-Lanczos method re-
duces to the Lanczos method in (8.4). In the bi-Lanczos method it can happen that hvj+1 , wj+1 i =
0 , even if vj+1 6= 0 and wj+1 6= 0. In that case the algorithm is not executable anymore and
this is called a (serious) breakdown. Using induction we obtain that for the two sequences of
vectors generated by Bi-CG the following properties hold:

span{v1 , v2 , . . . , vj } = Kj (A; r0 ) (8.50)


1 2 j j T 0
span{v , v , . . . , v } = K (A ; r ) (8.51)
hvi , vj i = 0 if i 6= j , hvi , vi i = 1 . (8.52)

Based on (8.52) we call vi and vj (i 6= j) bi-orthogonal. In general the vj (j = 1, 2, ...) will not
be orthogonal. Using the notation

Vj := [v1 . . . vj ], Vj := [v1 . . . vj ]

we obtain the identities



1 1
..
1
2 .

AVk = Vk
.. .. .. + k vk+1 0 . . . 0 1

. . .
.. ..
. . k1


k1 k
=: Vk Tk + k vk+1 eTk (8.53)

and
VkT Vk = Ik , VkT vk+1 = 0 . (8.54)
In Bi-CG we do not use a minimal residual criterion as in (8.25) but the following criterion
based on an orthogonality condition:

Determine xk x0 + Kk (A; r0 ) such that Axk b Kk (AT ; r0 ) . (8.55)

The existence of an xk satisfying the criterion in (8.55) is not guaranteed ! If the criterion (8.55)
cannot be fulfilled, the Bi-CG algorithm in (8.58) below will break down. For the case that A
is symmetric positive definite, the criteria in (8.55) and in (8.3) are equivalent and (in exact
arithmetic) the Bi-CG algorithm will yield the same iterands as the CG algorithm.

Using (8.50)-(8.52) we see that the Bi-CG iterand xk , characterized in (8.55), satisfies

VkT (AVk yk r0 ) = 0 ; xk = x0 + Vk yk (yk Rk ).

Due to the relations in (8.53), (8.54) this yields the following characterization:

Tk yk = kr0 k2 e1 (8.56)
k 0 k
x = x + Vk y . (8.57)

Note that this is very similar to the characterization of the CG iterand in (8.10a), (8.10b).
However, in (8.56) the tridiagonal matrix Tk need not be symmetric positive definite and Vk in
general will not be orthogonal. Using an LU-decomposition of the tridiagonal matrix Tk we can

189
compute yk , provided Tk is nonsingular, and then determine xk . An efficient implementation
of this approach can be derived along the same lines as for the Lanczos iterative method in
section 8.2. This then results in the Bi-CG algorithm, introduced in Lanczos [59] (cf. also
Fletcher [35]):

starting vector x0 ; p0 = p0 = r0 = r0 = b Ax0 ; 0 := kr0 k2






For k 0 :






k := hApk , pk i; k := k /k ;




xk+1 := xk + pk

k
k+1 := rk Apk (8.58)
r k
rk+1 := rk k AT pk




k+1 := hrk+1 , rk+1 i; k+1 := k+1 /k ;




pk+1 := rk+1 + k+1 pk




pk+1 := rk+1 + k+1 pk

Note: here and in the remainder of this section the residual is defined by rk = b Axk (instead
of Axk b).
The Bi-CG algorithm is simple and has low computational costs per iteration (compared to
GMRES type methods). A disadvantage is that a breakdown can occur (k = 0 or k = 0).
A near breakdown will result in numerical instabilities. To avoid these (near) breakdowns
variants of Bi-CG have been developed that use so-called look-ahead Lanczos algorithms for
computing a basis of the Krylov subspace. Also the criterion in (8.55) can be replaced by
another criterion to avoid a breakdown caused by the fact that the Bi-CG iterand as in (8.55)
does not exist. The combination of a look-ahead Lanczos approach and a criterion based on
minimization of a quasi-residual is the basis of the QMR (Quasi Minimal Residual) method.
For a discussion of the look-ahead Lanczos approach and QMR we refer to Freund et al. [36].
For the Bi-CG method there are only very few theoretical convergence results. A variant
of Bi-CG is analyzed in Bank and Chan [8]. A disadvantage of the Bi-CG method is that we
need a multiplication by AT which is often not easily available. Below we discuss variants of
the Bi-CG method which only use multiplications with the matrix A (two per iteration). For
many problems these methods have a higher rate of convergence than the Bi-CG method.

We introduce the BiCGSTAB method (from Van der Vorst [91])and the CGS (Conjugate
Gradients Squared) method (from Sonneveld [86]). These methods are derived from the Bi-CG
method. We assume that the Bi-CG method does not break down.
We first reformulate the Bi-CG method using a notation based on matrix polynomials. With
Tk , Pk Pk defined by

T0 (x) = 1, P0 (x) = 1 ,
Pk (x) = Pk1 (x) k1 xTk1 (x) , k 1,
Tk (x) = Pk (x) + k Tk1 (x) , k 1,

with k , k as in (8.58), we have for the search directions pk and the residuals rk resulting from
the Bi-CG method:

rk = Pk (A)r0 , (8.59)
0
pk = Tk (A)r . (8.60)

190
Results as in (8.59), (8.60) also hold for rk and pk with A replaced by AT and r0 replaced by
r0 . For the sequences of residuals and search directions, generated by the Bi-CG method, we
define related transformed sequences:

rk :=Qk (A)rk , (8.61)


k k
p :=Qk (A)p , (8.62)
with Qk Pk .

Note that corresponding to a given residual rk there corresponds an iterand

xk := A1 (b rk ). (8.63)

In the BiCGSTAB method and the CGS method we compute the iterands xk corresponding to
a suitable polynomial Qk . These polynomials are chosen in such a way that the xk can be
computed with simple (i.e short) recursions involving rk , pk and A. The costs per iteration
of these algorithms will be roughly the same as the costs per iteration in Bi-CG. An important
advantage is that we do not need AT . Clearly, from an efficiency point of view it is favourable
to have a polynomial Qk such that krk k = kQk (A)rk k krk k holds. For obtaining a hybrid
Bi-CG method one can try to find a polynomial Qk such that for the corresponding transformed
quantities we have low costs per iteration (short recursions) and a (much) smaller transformed
residual. The first example of such a polynomial is due to Sonneveld [86]. He proposes:

Qk (x) = Pk (x), (8.64)

with Pk the Bi-CG polynomial. The iterands xk corresponding to this Qk are computed in the
CGS method (cf. (8.75)). Another choice is proposed in Van der Vorst [91]:

Qk (x) = (1 k1 x)(1 k2 x) (1 0 x). (8.65)

The choice of the parameters j is discussed below (cf. (8.73)). The iterands xk corresponding
to this Qk are computed in the BiCGSTAB algorithm.
We now show how the BiCGSTAB algorithm can be derived. First note that for the
BiCGSTAB polynomial we have

Qk+1 (A) = (I k A)Qk (A).

From the Bi-CG algortihm and the definition of pk we obtain

pk+1 = Qk+1 (A)pk+1 = Qk+1 (A)rk+1 + k+1 (I k A)Qk (A)pk


= rk+1 + k+1 (pk k Apk ). (8.66)

Similarly, for the transformed residuals we obtain the recursion

rk+1 = (rk k Apk ) k A(rk k Apk ). (8.67)

For the iterands related to these transformed residuals we have

xk+1 xk = A1 (rk rk+1 ) = k pk + k (rk k Apk ),

and thus we have the recursion

xk+1 = xk + k pk + k (rk k Apk ). (8.68)

191
Note that in (8.66), (8.67) and (8.68) we have simple recursions in which the scalars k and k
defined in the Bi-CG algorithm are used. We now show that for these scalars one can derive
other, more feasible, formulas. We consider

k = hrk , rk i

The coefficient for the highest order term of the Bi-CG polynomial Pk is equal to (1)k 0 1 k1 .
So we have
rk = Pk (AT )r0 = (1)k 0 1 k1 (AT )k r0 + wk ,
with wk Kk (AT ; r0 ). Using this in the definition of k and the orthogonality condition for rk
in (8.55) we obtain the relation

k = (1)k 0 1 k1 h(AT )k r0 , rk i. (8.69)

We now define the quantity


k := hr0 , rk i
in which we use the transformed residual rk . The coefficient for the highest order term of the
BiCGSTAB polynomial Qk is equal to (1)k 0 1 k1 . So we have

Qk (AT )r0 = (1)k 0 1 k1 (AT )k r0 + wk ,

with wk Kk (AT ; r0 ). Using this in the definition of k and the orthogonality condition for rk
in (8.55) we obtain the relation

k = hr0 , Qk (A)rk i = hQk (AT )r0 , rk i


= (1)k 0 1 k1 h(AT )k r0 , rk i. (8.70)

The results in (8.69),(8.70) yield the following formula for k+1

k+1 h(AT )k+1 r0 , rk+1 i


k+1 = = k
k h(AT )k r0 , rk i
k+1 k
= ( )( ). (8.71)
k k

Similarly, for the scalar k defined in the Bi-CG algorithm, the formula

k
k = (8.72)
hApk , r0 i

can be derived. We finally discuss the choice of the parameters j in the BiCGSTAB polynomial.
We use the notation
1
rk+ 2 := rk k Apk .
The recursion for the transformed residuals can be rewritten as
1 1
rk+1 = rk+ 2 k Ark+ 2 .

The k is now defined by a standard line search:


1 1 1 1
krk+ 2 k Ark+ 2 k = min krk+ 2 Ark+ 2 k.

192
This results in 1 1
hArk+ 2 , rk+ 2 i
k = 1 1 . (8.73)
hArk+ 2 , Ark+ 2 i
Using the recursion in (8.66), (8.67), (8.68) , the formulas for the scalars in (8.71), (8.72) and
the choice for k as in (8.73) we obtain the following Bi-CGSTAB algorithm, where for ease of
notation we dropped the notation for the transformed variables.
starting vector x0 ; r0 = b Ax0 ; choose r0 ( e.g. = r0 )




p1 = c1 = 0. 1 = 0, 1 = 1 = 1,










for k 0 :

k 0
k = hr , r i, k = (k1 /k1 )(k /k1 ),



pk = k pk1 + rk k k1 ck1 , (8.74)

c k = Apk ,

k = hck , r0 i, k = k /k ,




k+ 21 k ck , ck+ 21 = Ark+ 12 ,



r = r k
k+ 21 k+ 21 1 1

= hc , r i/hck+ 2 , ck+ 2 i,



k
1 1 1
xk+1 = xk + k pk + k rk+ 2 , rk+1 = rk+ 2 k ck+ 2 .

This Bi-CGSTAB method is introduced in Van der Vorst [91] as a variant of Bi-CG and of
the CGS method. Variants of the Bi-CGSTAB method, denoted by Bi-CGSTAB(), are dis-
cussed in Sleijpen and Fokkema [84].

In the CGS method we take Qk (x) = Pk (x), with Pk the Bi-CG polynomial. This results
in transformed residuals satisfying the relation
rk = (Pk (A))2 r0 .
This explains the name Conjugate Gradients Squared. Along the same lines as above for the
Bi-CGSTAB method one can derive the following CGS algorithm from the Bi-CG algorithm (cf.
Sonneveld [86]):
starting vector x0 ; r0 := b Ax0 ; q0 := q1 := 0; 1 := 1;




For k 0 :






k := hr0 ; rk i; k := k /k1 ;




k k k
w := r + k q


q := w + k (qk + k qk1 )
k k
(8.75)

v k := Aqk

k := hr0 , vk i; k := k /k ;





qk+1 := wk k vk



rk+1 := rk k A(wk + qk+1 )




xk+1 := xk + k (wk + qk+1 )

Note that both in the Bi-CGSTAB method and in the CGS method we have relatively low costs
per iteration (two matrix vector products and a few inner products) and that we do not need
AT . The fact that in the CGS polynomial we use the square of the Bi-CG polynomial often
results in a rather irregular convergence behaviour (cf. Van der Vorst [91]). The Bi-CGSTAB
polynomial (8.65),(8.73) is chosen such that the resulting method in general has a less irregular
convergence behaviour than the CGS method.

193
Example 8.5.1 (Convection-diffusion problem) In Figure 8.3 we show the results of the
CGS method and the Bi-CGSTAB method applied to the convection-diffusion equation as in
1
example 8.4.1, with = 102 , h = 32 . As a measure for the error reduction we use

10
log(kd ) := 10 log(||Axk b||2 /||Ax0 b||2 ) = 10 log(||Axk b||2 /||b||2 ).

In figure 8.3 we show the values of 10 log( d ) for k = 0, 1, 2, , 60.


k

12

60
0

CGS

-12 Bi-CGSTAB

Figure 8.3: Convergence behaviour of CGS and of Bi-CGSTAB

Note that for both methods we first have a growth of the norm of the defect (in CGS even up to
1012 !). In this example we observe that indeed the Bi-CGSTAB has a smoother convergence
behaviour than the CGS method. 

We finally note that for all these nonsymmetric Krylov subspace methods the use of a suitable
preconditioner is of great importance for the efficiency of the methods. There is only very little
analysis in this field, and in general the choice of the preconditioner is based on trial and error.
Often variants of the ILU factorization are used as a preconditioner.
The preconditioned Bi-CGSTAB algorithm, with preconditioner W, is as follows (cf. Sleijpen

194
and Van der Vorst [85]):

starting vector x0 ; r0 = b Ax0 ; choose r0 ( e.g. = r0 )






p1 = c1 = 0. 1 = 0, 1 = 1 = 1,






for k 0 :






k = hrk , r0 i, k = (k1 /k1 )(k /k1 ),



1 1
solve pk+ 2 from Wpk+ 2 = rk k k1 ck1 ,




1

pk = pk1 + pk+ 2 ,

k
k = Apk , (8.76)
c
k = hck , r0 i, k = k /k ,




1
rk+ 2 = rk k ck ,




1 1 1

solve yk+ 2 from Wyk+ 2 = rk+ 2 ,



1 1
ck+ 2 = Ayk+ 2 ,




1 1 1 1
k = hck+ 2 , rk+ 2 i/hck+ 2 , ck+ 2 i,




1 1 1
xk+1 = xk + k pk + k yk+ 2 , rk+1 = rk+ 2 k ck+ 2 .

195
196
Chapter 9

Multigrid methods

9.1 Introduction
In this chapter we treat multigrid methods (MGM) for solving discrete scalar elliptic boundary
value problems. We first briefly discuss a few important differences between multigrid methods
and the iterative methods treated in the preceding chapters .

The basic iterative methods and the Krylov subspace methods use the matrix A and the right-
hand side b which are the result of a discretization method. The fact that these data correspond
to a certain underlying continuous boundary value problem is not used in the iterative method.
However, the relation between the data (A and b) and the underlying problem can be useful for
the development of a fast iterative solver. Due to the fact that A results from a discretization
procedure we know, for example, that there are other matrices which, in a certain natural sense,
are similar to the matrix A. These matrices result from the discretization of the underlying
continuous boundary value problem on other grids than the grid corresponding to the given
discrete problem Ax = b. The use of discretizations of the given continuous problem on several
grids with different mesh sizes plays an important role in multigrid methods.

We will see that for a large class of discrete elliptic boundary value problems multigrid methods
have a significantly higher rate of convergence than the methods treated in the preceding chap-
ters. Often multigrid methods even have optimal complexity.

Due to the fact that in multigrid methods discrete problems on different grids are needed,
the implementation of multigrid methods is in general (much) more involved than the imple-
mentation of, for example, Krylov subspace methods. We also note that for multigrid methods it
is relatively hard to develop black box solvers which are applicable to a wide class of problems.

In section 9.2 we explain the main ideas of the MGM using a simple one dimensional prob-
lem. In section 9.3 we introduce multigrid methods for discrete scalar elliptic boundary value
problems. In section 9.4 we present a convergence analysis of these multigrid methods. Opposite
to the basic iterative and Krylov subspace methods, in the convergence analysis we will need the
underlying continuous problem. The standard multigrid method discussed in the sections 9.2-9.4
is efficient only for diffusion-dominated elliptic problems. In section 9.5 we consider modifica-
tions of standard multigrid methods which are used for convection-dominated problems. In
section 9.6 we discuss the principle of nested iteration. In this approach we use computations on
relatively coarse grids to obtain a good starting vector for an iterative method (not necessarily

197
a multigrid method). In section 9.7 we show some results of numerical experiments. In sec-
tion 9.8 we discuss so-called algebraic multigrid methods. In these methods, as in basic iterative
methods and Krylov subspace methods, we only use the given matrix and righthand side, but
no information on an underlying grid structure. Finally, in section 9.9 we consider multigrid
techniques which can be applied directly to nonlinear elliptic boundary value problems without
using a linearization technique.

For a thorough treatment of multigrid methods we refer to the monograph of Hackbusch [44].
For an introduction to multigrid methods requiring less knowledge of mathematics, we refer to
Wesseling [96], Briggs [23], Trottenberg et al. [69]. A theoretical analysis of multigrid methods
is presented in [19].

9.2 Multigrid for a one-dimensional model problem


In this section we consider a simple model situation to show the basic principle behind the
multigrid approach. We consider the two-point boundary value model problem
u (x) = f (x),

x := (0, 1)
(9.1)
u(0) = u(1) = 0 .
The variational formulation of this problem is: find u H01 () such that
Z 1 Z
u v dx = f v dx for all v H01 ()

0

For the discretization we introduce a sequence of nested uniform grids. For = 0, 1, 2, . . . , we


define
h = 21 (mesh size) , (9.2)
n = h1
1 (number of interior grid points) , (9.3)
,i = ih , i = 0, 1, ..., n + 1 (grid points) , (9.4)
int
= {,i | 1 i n } (interior grid) , (9.5)
Th = { [,i , ,i+1 ] | 0 i n } (triangulation) (9.6)
The space of linear finite elements corresponding to the triangulation Th is given by
X1h ,0 = { v C() | v|[,i ,,i+1] P1 , i = 0, . . . , n , v(0) = v(1) = 0 }
The standard nodal basis in this space is denoted by (i )1in . This basis induces an isomor-
phism
n
X
n 1
P : R Xh ,0 , P x = xi i (9.7)
i=1

The Galerkin discretization in the space X1h ,0yields a linear system


Z 1 Z 1

A x = b , (A )ij = i j dx, (b )i = f i dx (9.8)
0 0

The solution of this discrete problem is denoted by x . The solution of the Galerkin discretization
in the space X1h ,0 is given by u = P x . A simple computation shows that

A = h1
tridiag(1, 2, 1) R
n n

198
Note that, apart from a scaling factor, the same matrix results from a standard discretization
with finite differences of the problem (9.1).
Clearly, in practice one should not solve the problem in (9.8) using an iterative method (a
Cholesky factorization A = LLT is stable and efficient). However, we do apply a basic iterative
method here, to illustrate a certain smoothing property which plays an important role in
multigrid methods. We consider the damped Jacobi method

1
xk+1
= xk h (A xk b ) with (0, 1] . (9.9)
2
The iteration matrix of this method is given by

1
C = C () = I h A .
2
In this simple model problem an orthogonal eigenvector basis of A , and thus of C , too, is
known. This basis is closely related to the Fourier modes:

w (x) = sin(x), x [0, 1], = 1, 2, ... .

Note that w satisfies the boundary conditions in (9.1) and that (w ) (x) = ()2 w (x) holds,
and thus w is an eigenfunction of the problem in (9.1). We introduce vectors z Rn , 1
n , which correspond to the Fourier modes w restricted to the interior grid int :

z := [w (,1 ), w (,2 ), ..., w (,n )]T .

These vectors form an orthogonal basis of Rn . For = 2 we give an illustration in figure 9.1.
To a vector z there corresponds a frequency . If < 12 n holds then the vector z , or the

o x o
x x

x x

x x : z12
x

0 1 : z42
o o o o

o o

Figure 9.1: two discrete Fourier modes.

1
corresponding finite element function P z , is called a low frequency mode, and if 2 n

199
holds then this vector [finite element function] is called a high frequency mode. These vectors
z are eigenvectors of the matrix A :
4
A z = sin2 ( h )z ,
h 2
and thus we have

C z = (1 2 sin2 ( h ))z . (9.10)
2
From this we obtain
kC k2 = max1n |1 2 sin2 ( 2 h )|
(9.11)
= 1 2 sin2 ( 2 h ) = 1 12 2 h2 + O(h4 ) .

From this we see that the damped Jacobi method is convergent, but that the rate of convergence
will be very low for h small (cf. section 6.3).
Note that the eigenvalues and the eigenvectors of C are functions of h [0, 1]:

, := 1 2 sin2 ( h ) =: g (h ) , with (9.12a)
2
2
g (y) = 1 2 sin ( y) (y [0, 1]) . (9.12b)
2
Hence, the size of the eigenvalues , can directly be obtained from the graph of the function
g . In figure 9.2 we show the graph of the function g for a few values of . From the graphs

1
= 3
1
= 2

2
= 3

-1 =1

Figure 9.2: Graph of g .

in this figure we conclude that for a suitable choice of we have |g (y)| 1 if y [ 21 , 1]. We
choose = 23 (then |g ( 21 )| = |g (1)| holds). Then we have |g 2 (y)| 13 for y [ 21 , 1]. Using this
3
and the result in (9.12a) we obtain
1 1
|, | for n .
3 2
Hence:

200
the high frequency modes are strongly damped by the iteration matrix C .

From figure 9.2 it is also clear that the low rate of convergence of the damped Jacobi method is
caused by the low frequency modes (h 1).
Summarizing, we draw the conclusion that in this example the damped Jacobi method will
smooth the error. This elementary observation is of great importance for the two-grid method
introduced below. In the setting of multigrid methods the damped Jacobi method is called a
smoother. The smoothing property of damped Jacobi is illustrated in figure 9.3. It is impor-

0 1 0 1

Graph of the error after one damped


Graph of a starting error. Jacobi iteration ( = 23 ).

Figure 9.3: Smoothing property of damped Jacobi.

tant to note that the discussion above concerning smoothing is related to the iteration matrix
C , which means that the error will be made smoother by the damped Jacobi method, but not
(necessarily) the new iterand xk+1 .

In multigrid methods we have to transform information from one grid to another. For that
purpose we introduce so-called prolongations and restrictions. In a setting with nested finite
element spaces these operators can be defined in a very natural way. Due to the nestedness the
identity operator
I : X1h1 ,0 X1h ,0 , I v = v
is well-defined. This identity operator represents linear interpolation as is illustrated for = 2
in figure 9.4.
The matrix representation of this interpolation operator is given by

x x X1h1,0
0 x 1
x

I2
x ?
x
x
x
x X1h2,0
0 x
x
x 1
x

Figure 9.4: Canonical prolongation.

p : Rn1 Rn , p := P1 P1 (9.13)

201
A simple computation yields
1


2
1
1 1


2 2
1
1

p = (9.14)

2
..
.
1


2
1
1
2 n n1

We can also restrict a given grid function v on int int


to a grid function on 1 . An obvious
approach is to use a restriction r based on simple injection:
(rinj v )() = v () if int
1 .

When used in a multigrid method then often this restriction based on injection is not satisfactory
(cf. Hackbusch [44], section 3.5). A better method is obtained if a natural Galerkin property is
satisfied. It can easily be verified (cf. also lemma 9.3.2) that with A , A1 and p as defined
in (9.8), (9.13) we have
r A p = A1 iff r = pT (9.15)
Thus the natural Galerkin condition r A p = A1 implies the choice
r = pT (9.16)
for the restriction operator.

The two-grid method is based on the idea that a smooth error, which results from the ap-
plication of one or a few damped Jacobi iterations, can be approximated fairly well on a coarser
grid. We now introduce this two-grid method.
Consider A x = b and let x be the result of one or a few damped Jacobi iterations applied
to a given starting vector x0 . For the error e := x x we have
A e = b A x =: d ( residual or defect) (9.17)
Based on the assumption that e is smooth it seems reasonable to make the approximation
e p e1 with an appropriate vector (grid function) e1 Rn1 . To determine the vector
e1 we use the equation (9.17) and the Galerkin property (9.15). This results in the equation
A1 e1 = r d
for the vector e1 . Note that x = x + e x + p e1 . Thus for the new iterand we take
x := x + p e1 . In a more compact formulation this two-grid method is as follows:
procedure TGM (x , b )


if = 0 then x0 := A1

0 b0 else





begin
x := J (x , b ) ( smoothing it., e.g. damped Jacobi )




d1 := r (b A x ) ( restriction of defect ) (9.18)
1
e1 := A1 d1 ( solve coarse grid problem )




x := x + p e1 ( add correction )








TGM := x
end;

202
Often, after the coarse grid correction x := x + p e1 , one or a few smoothing iterations are
applied again. Smoothing before/after the coarse grid correction is called pre/post-smoothing.
Besides the smoothing property a second property which is of great importance for a multigrid
method is the following:

The coarse grid system A1 e1 = d1 is of the same form as the system A x = b .

Thus for solving the problem A1 e1 = d1 approximately we can apply the two-grid algo-
rithm in (9.18) recursively. This results in the following multigrid method for solving A x = b :

procedure MGM (x , b )
if = 0 then x0 := A1 0 b0 else
begin
x := J1 (x , b ) ( presmoothing )
d1 := r (b A x )
(9.19)
e01 := 0; for i = 1 to do ei1 := MGM1 (ei1
1 , d1 );
x := x + p e1

x := J2 (x , b ) ( postsmoothing )
MGM := x
end;
If one wants to solve the system on a given finest grid, say with level number , i.e. A x = b ,
then we apply some iterations of MGM (x , b ).

Based on efficiency considerations (cf. section 9.3) we usually choose = 1 (V -cycle) or


= 2 (W -cycle) in the recursive call in (9.19). For the case = 3 the structure of one multi-
grid iteration with {1, 2} is illustrated in figure 9.5.

=3
B  B 
B  B 
B  B 
=2 B  B : smoothing

A
B  B  A 
B  B  A   : solve exactly
=1 B  B  A 
A A
B  B  A  A  A 
=0 B B A A A
    

=1 =2
Figure 9.5: Structure of one multigrid iteration

9.3 Multigrid for scalar elliptic problems


In this section we introduce multigrid methods which can be used for solving discretized elliptic
boundary value problems. Opposite to the CG method, the applicability of multigrid methods

203
is not restricted to (nearly) symmetric problems. Multigrid methods can also be used for solving
problems which are strongly nonsymmetric (convection dominated). However, for such problems
one usually has to modify the standard multigrid approach. These modifications are discussed
in section 9.5.
We will introduce the two-grid and multigrid method by generalizing the approach of section 9.2
to the higher (i.e., two and three) dimensional case. We consider the finite element discretization
of scalar elliptic boundary value problems as discussed in section 3.4. Thus the continuous
variational problem is of the form
(
find u H01 () such that
(9.20)
k(u, v) = f (v) for all v H01 ()

with a bilinear form and righthand side as in (3.42):


Z Z
k(u, v) = uT Av + b uv + cuv dx , f (v) = f v dx

The coefficients A, b, c are assumed to satisfy the conditions in (3.42). For the discretization of
this problem we use simplicial finite elements. The case with rectangular finite elements can be
treated in a very similar way. Let {Th } be a regular family of triangulations of consisting of n-
simplices and Xkh,0 , k 1, the corresponding finite element space as in (3.16). The presentation
and implementation of the multigrid method is greatly simplified if we assume a given sequence
of nested finite element spaces.
Assumption 9.3.1 In the remainder of this chapter we always assume that we have a sequence
V := Xkh ,0 , = 0, 1, . . ., of simplicial finite element spaces which are nested:

V V+1 for all (9.21)

We note that this assumption is not necessary for a succesful application of multigrid methods.
For a treatment of multigrid methods in case of non-nestedness we refer to [69] (?). The con-
struction of a hierarchy of triangulations such that the corresponding finite element spaces are
nested is discussed in chapter ??.
In V we use the standard nodal basis (i )1in as explained in section 3.5. This basis
induces an isomorphism
Xn
n
P : R V , P x = xi i
i=1
The Galerkin discretization: Find u V such that

k(u , v ) = f (v ) for all v V

can be represented as a linear system

A x = b , with (A )ij = k(j , i ), (b )i = f (i ), 1 i, j n (9.22)

Along the same lines as in the one-dimensional case we introduce a multigrid method for solving
this system of equations on an arbitrary level 0.
For the smoother we use one of the basic iterative methods discussed in section 6.2. For this
method we use the notation

xk+1 = S (xk , b ) = xk M1 k
(A x b) , k = 0, 1, . . .

204
The corresponding iteration matrix is denoted by

S = I M1
A

For the prolongation we use the matrix representation of the identity I : V1 V , i.e.,

p := P1 P1 (9.23)

The choice of the restriction is based on the following elementary lemma:


Lemma 9.3.2 Let A , 0, be the stiffness matrix defined in (9.22) and p as in (9.23). Then
for r : Rn Rn1 we have:

r A p = A1 if and only if r = pT

Proof. For the stiffness matrix matrix the identity

hA x, yi = k(P x, P y) for all x, y Rn

holds. From this we get

r A p = A1
hA p x, rT yi = hA1 x, yi for all x, y Rn1
k(P1 x, P rT y) = k(P1 x, P1 y) for all x, y Rn1

Using the ellipticity of k(, ) it now follows that

r A p = A1
P rT y = P1 y for all y Rn1
rT y = P1 P1 y = p y for all y Rn1
rT = p

Thus the claim is proved.

Thus for the restriction we take:


r := pT (9.24)
Using these components we can define a multigrid method with exactly the same structure as
in (9.19)

procedure MGM (x , b )
if = 0 then x0 := A1 0 b0 else
begin
x := S1 (x , b ) ( presmoothing )
d1 := r (b A x )
(9.25)
e01 := 0; for i = 1 to do ei1 := MGM1 (ei1
1 , d1 );
x := x + p e1

x := S2 (x , b ) ( postsmoothing )
MGM := x
end;

205
We briefly comment on some important issues related to this multigrid method.

Smoothers
For many problems basic iterative methods provide good smoothers. In particular the Gauss-
Seidel method is often a very effective smoother. Other smoothers used in practice are the
damped Jacobi method and the ILU method.

Prolongation and restriction


If instead of a discretization with nested finite element spaces one uses a finite difference or a
finite volume method then one can not use the approach in (9.23) to define a prolongation. How-
ever, for these cases other canonical constructions for the prolongation operator exist. We refer
to Hackbusch [44], [69] or Wesseling [96] for a treatment of this topic. A general technique for the
construction of a prolongation operator in case of nonnested finite element spaces is given in [17].

Arithmetic costs per iteration


We discuss the arithmetic costs of one MGM iteration as defined in (9.25). For this we introduce
a unit of arithmetic work on level :
W U := # flops needed for A x b computation. (9.26)
We assume:
W U1 . g W U with g < 1 independent of (9.27)
Note that if T is constructed through a uniform global grid refinement of T1 (for n = 2:
subdivision of each triangle T T1 into four smaller triangles by connecting the midpoints of
the edges) then (9.27) holds with g = ( 21 )n . Furthermore we make the following assumptions
concerning the arithmetic costs of each of the substeps in the procedure MGM :
x := S (x , b ) : costs . W U
d1 := r (b A x )

total costs . 2 W U
x := x + p e1
For the amount of work in one multigrid V-cycle ( = 1) on level , which is denoted by V M G ,
we get using := 1 + 2 :
V M G . W U + 2W U + V M G1 = ( + 2)W U + V M G1

. ( + 2) W U + W U 1 + . . . + W U 1 + V M G0
. ( + 2) 1 + g + . . . + g1 W U + V M G0 (9.28)


+2
. W U
1g
In the last inequality we assumed that the costs for computing x0 = A1 0 b0 (i.e., V M G0 ) are
negligible compared to W U . The result in (9.28) shows that the arithmetic costs for one V-cycle
are proportional (if ) to the costs of a residual computation. For example, for g = 18
(uniform refinement in 3D) the arithmetic costs of a V-cycle with 1 = 2 = 1 on level are
comparable to 4 12 times the costs of a residual computation on level .
For the W-cycle ( = 2) the arithmetic costs on level are denoted by W M G . We have:
W M G . W U + 2W U + 2W M G1 = ( + 2)W U + 2W M G1
. ( + 2) W U + 2W U 1 + 22 W U 2 + . . . + 21 W U 1 + W M G0


. ( + 2) 1 + 2g + (2g)2 + . . . + (2g)1 W U + W M G0


206
From this we see that to obtain a bound proportional to W U we have to assume
1
g<
2
Under this assumption we get for the W-cycle
+2
W M G . W U
1 2g

(again we neglected W M G0 ). Similar bounds can be obtained for 3, provided g < 1 holds.

9.4 Convergence analysis


In this section we present a convergence analysis for the multigrid method. Our approach is
based on the so-called approximation- and smoothing property, introduced by Hackbusch (cf.
[44, 48]). For a discussion of other analyses we refer to remark 9.4.22.

9.4.1 Introduction
One easily verifies that the two-grid method is a linear iterative method. The iteration matrix
of this method with 1 presmoothing and 2 postsmoothing iterations on level is given by

CT G, = CT G, (2 , 1 ) = S 2 (I p A1 1
1 r A )S (9.29)

with S = I M1
A the iteration matrix of the smoother.

Theorem 9.4.1 The multigrid method (9.25) is a linear iterative method with iteration matrix
CM G, given by

CM G,0 = 0 (9.30a)
S 2
I p (I CM G,1 )A1 1

CM G, = 1 r A S (9.30b)
= CT G, + S 2 p CM G,1 A1 1
1 r A S , = 1, 2, . . . (9.30c)

Proof. The result in (9.30a) is trivial. The result in (9.30c) follows from (9.30b) and the
definition of CT G, . We now prove the result in (9.30b) by induction. For = 1 it follows from
(9.30a) and (9.29). Assume that the result is correct for 1. Then MGM1 (y1 , z1 ) defines
a linear iterative method and for arbitrary y1 , z1 Rn1 we have

MGM1 (y1 , z1 ) A1 1
1 z1 = CM G,1 (y1 A1 z1 ) (9.31)

We rewrite the algorithm (9.25) as follows:

x1 := S1 (xold
, b )
x2 := x1 + p MGM1 0, r (b A x1 )


xnew
:= S2 (x2 , b )

207
From this we get
2 2
xnew
x = xnew
A1
b = S (x x )
= S 2 x1 x + p MGM1 0, r (b A x1 )


Now we use the result (9.31) with y1 = 0, z1 := r (b A x1 ). This yields

xnew x = S 2 x1 x + p (A1 1

1 z1 CM G,1 A1 z1
= S 2 I p (I CM G,1 )A1
 1
1 r A (x x )
= S 2 I p (I CM G,1 )A1 1 old
x )

1 r A S (x

This completes the proof.

The convergence analysis will be based on the following splitting of the two-grid iteration matrix,
with 2 = 0, i.e. no postsmoothing:
1
kCT G, (0, 1 )k2 = k(I p A1
1 r A )S k2
1
(9.32)
kA1 1
p A1 r k2 kA S k2

In section 9.4.2 we will prove a bound of the form kA1 1 1


p A1 r k2 CA kA k2 . This result
is called the approximation property. In section 9.4.3 we derive a suitable bound for the term
kA S 1 k2 . This is the so-called smoothing property. In section 9.4.4 we combine these bounds
with the results in (9.32) and in theorem 9.4.1. This yields bounds for the contraction number
of the two-grid method and of the multigrid W-cycle. For the V-cycle a more subtle analysis is
needed. This is presented in section 9.4.5. In the convergence analysis we need the following:
Assumption 9.4.2 In the sections 9.4.29.4.5 we assume that the family of triangulations
{Th } corresponding to the finite element spaces V , = 0, 1, . . ., is quasi-uniform and that
h1 ch with a constant c independent of .

We formulate three results that will be used in the analysis further on. First we recall the global
inverse inequality that is proved in lemma 3.3.11:

|v |1 c h1
kv kL2 for all v V

with a constant c independent of . Note that for this result we need assumption 9.4.2.
We now show that, apart from a scaling factor, the isomorphism P : (Rn , h, i) (V , h, iL2 )
and its inverse are uniformly (w.r.t. ) bounded:
Lemma 9.4.3 There exist constants c1 > 0 and c2 independent of such that
1
n
c1 kP xkL2 h2 kxk2 c2 kP xkL2 for all x Rn (9.33)

Proof. Let M be the mass matrix, i.e., (M )ij = hi , j iL2 . Note the basic equality

kP xk2L2 = hM x, xi for all x Rn (9.34)

There are constants d1 , d2 > 0 independent of such that


(
d1 hn for all i, j
hi , j iL2
d2 hn for all i = j

208
From this and from the sparsity of M we obtain

d2 hn (M )ii kM k2 kM k d1 hn (9.35)

Using the upper bound in (9.35) in combination with (9.34) we get

kP xk2L2 kM k2 kxk22 d1 hn kxk22 ,

which proves the first inequality in (9.33). We now use corollary 3.5.10. This yields min (M )
c max (M ) with a strictly positive constant c independent of . Thus we have

min (M ) ckM k2 chn , c > 0, independent of

This yields
kP xk2L2 = hM x, xi min (M )kxk22 chn kxk22 ,
which proves the second inequality in (9.33).

The third preliminary result concerns the scaling of the stiffness matrix:
Lemma 9.4.4 Let A be the stiffness matrix as in (9.22). Assume that the bilinear form is
such that the usual conditions (3.42) are satisfied. Then there exist constants c1 > 0 and c2
independent of such that
c1 hn2 kA k2 c2 hn2
Proof. First note that
hA x, yi
kA k2 = maxn
x,yR kxk2 kyk2
Using the result in lemma 9.4.3, the continuity of the bilinear form and the inverse inequality
we get
hA x, yi k(v , w )
maxn chn max
x,yR kxk2 kyk2 v ,w V kv kL2 kw kL2
|v |1 |w |1
chn max c hn2
v ,w V kv kL2 kw kL2

and thus the upper bound is proved. The lower bound follows from
hA x, yi
maxn max hA ei , ei i = k(i , i ) c|i |21 chn2
x,yR kxk2 kyk2 1in

The last inequality can be shown by using for T supp(i ) the affine transformation from the
unit simplex to T .

9.4.2 Approximation property


In this section we derive a bound for the first factor in the splitting (9.32).
In the analysis we will use the adjoint operator P : V Rn which satisfies hP x, v iL2 =
hx, P v i for all x Rn , v V . As a direct consequence of lemma 9.4.3 we obtain
1
n
c1 kP v k2 h2 kv kL2 c2 kP v k2 for all v V (9.36)

209
with constants c1 > 0 and c2 independent of . We now formulate a main result for the conver-
gence analysis of multigrid methods:

Theorem 9.4.5 (Approximation property.) Consider A , p , r as defined in (9.22),


(9.23),(9.24). Assume that the variational problem (9.20) is such that the usual conditions (3.42)
are satisfied. Moreover, the problem (9.20) and the corresponding dual problem are assumed to
be H 2 -regular. Then there exists a constant CA independent of such that

kA1 1 1
p A1 r k2 CA kA k2 for = 1, 2, . . . (9.37)

Proof. Let b Rn be given. The constants in the proof are independent of b and of .
Consider the variational problems:

u H01 () : k(u, v) = h(P )1 b , viL2 for all v H01 ()


u V : k(u , v ) = h(P )1 b , v iL2 for all v V
u1 V1 : k(u1 , v1 ) = h(P )1 b , v1 iL2 for all v1 V1

Then
A1 1 1 1
b = P u and A1 r b = P1 u1

hold. Hence we obtain, using lemma 9.4.3,


12 n
k(A1 1 1
p A1 r )b k2 = kP (u u1 )k2 c h ku u1 kL2 (9.38)

Now we apply theorem 3.4.5 and use the H 2 -regularity of the problem. This yields

ku u1 kL2 ku ukL2 + ku1 ukL2


(9.39)
ch2 |u|2 + +ch21 |u|2 ch2 k(P )1 b kL2

Now we combine (9.38) with (9.39) and use (9.36). Then we get

k(A1 1 2n
p A1 r )b k2 c h kb k2

and thus kA1 1 2n


p A1 r k2 c h . The proof is completed if we use lemma 9.4.4.

Note that in the proof of the approximation property we use the underlying continuous problem.

9.4.3 Smoothing property


In this section we derive inequalities of the form

kA S k2 g()kA k2

where g() is a monotonically decreasing function with lim g() = 0. In the first part of
this section we derive results for the case that A is symmetric positive definite. In the second
part we discuss the general case.

Smoothing property for the symmetric positive definite case.


We start with an elementary lemma:

210
Lemma 9.4.6 Let B Rmm be a symmetric positive definite matrix with (B) (0, 1]. Then
we have
1
kB(I B) k2 for = 1, 2, . . .
2( + 1)
Proof. Note that
1 
kB(I B) k2 = max x(1 x) =
x(0,1] +1 +1


A simple computation shows that +1 is decreasing on [1, ).

Below for a few basic iterative methods we derive the smoothing property for the symmetric
case, i.e., b = 0 in the bilinear form k(, ). We first consider the Richardson method:
Theorem 9.4.7 Assume that in the bilinear form we have b = 0 and that the usual conditions
(3.42) are satisfied. Let A be the stiffness matrix in (9.22). For c0 (0, 1] we have the smoothing
property
c0 1
kA (I A ) k2 kA k2 , = 1, 2, . . .
(A ) 2c0 ( + 1)
holds.
Proof. Note that A is symmetric positive definite. Apply lemma 9.4.6 with B := A ,
:= c0 (A )1 . This yields
1 1 1
kA (I A ) k2 1 (A ) = kA k2
2( + 1) 2c0 ( + 1) 2c0 ( + 1)
and thus the result is proved.

A similar result can be shown for the damped Jacobi method:


Theorem 9.4.8 Assume that in the bilinear form we have b = 0 and that the usual conditions
(3.42) are satisfied. Let A be the stiffness matrix in (9.22) and D := diag(A ). There exists
an (0, (D1 1
A ) ], independent of , such that the smoothing property
1
kA (I D1
A ) k2 kA k2 , = 1, 2, . . .
2( + 1)
holds.
1 1
Proof. Define the symmetric positive definite matrix B := D 2 A D 2 . Note that
(D )ii = (A )ii = k(i , i ) c |i |21 c hn2 , (9.40)
with c > 0 independent of and i. Using this in combination with lemma 9.4.4 we get
kA k2
kBk2 c, c independent of .
min (D )
Hence for (0, 1c ] (0, (D1 1
A ) ] we have ( B) (0, 1]. Application of lemma 9.4.6,
with B = B, yields
1 1
kA (I D1 1 2
A ) k2 kD k2 k B(I B) k2 kD k2
2

kD k2 1
kA k2
2( + 1) 2( + 1)
and thus the result is proved.

211
Remark 9.4.9 The value of the parameter used in theorem 9.4.8 is such that (D1
A ) =
1 1
(D 2 A D 2 ) 1 holds. Note that
1 1 hA x, xi hA ei , ei i
(D 2 A D 2 ) = max max =1
n
xR hD x, xi 1in hD ei ei i
and thus we have 1. This is in agreement with the fact that in multigrid methods one
usually use a damped Jacobi method as a smoother. 

We finally consider the symmetric Gauss-Seidel method. This method is the same as the SSOR
method with parameter value = 1. Thus it follows from (6.18) that this method has an
iteration matrix
S = I M1
A , M = (D L )D1 T
(D L ) , (9.41)
where we use the decomposition A = D L LT with D a diagonal matrix and L a strictly
lower triangular matrix.
Theorem 9.4.10 Assume that in the bilinear form we have b = 0 and that the usual conditions
(3.42) are satisfied. Let A be the stiffness matrix in (9.22) and M as in (9.41). The smoothing
property
c
kA (I M1
A ) k2 + 1 kA k2 , = 1, 2, . . .

holds with a constant c independent of and .


Proof. Note that M = A + L D1 T
L and thus M is symmetric positive definite. Define
1 1
the symmetric positive definite matrix B := M 2 A M 2 . From
hBx, xi hA x, xi hA x, xi
0 < max = max = max 1
xRn hx, xi xR hM x, xi
n
xR hA x, xi + hD1 LT x, LT xi
n

it follows that (B) (0, 1]. Application of lemma 9.4.6 yields


1 1
kA (I M1 2 2
A ) k2 kM k2 kB(I B) k2 kM k2 2( + 1)
From (9.40) we have kD1 2n
k2 c h . Using the sparsity of A we obtain
kL k2 kLT k2 kL k kL k1 c(max |(A )ij |)2 ckA k22
i,j

In combination with lemma 9.4.4 we then get


kM k2 kD1 T 2n
k2 kL k2 kL k2 c h kA k22 ckA k2 (9.42)
and this completes the proof.

For the symmetric positive definite case smoothing properties have also been proved for other
iterative methods. For example, in [98, 97] a smoothing property is proved for a variant of the
ILU method and in [24] it is shown that the SPAI (sparse approximate inverse) preconditioner
satisfies a smoothing property.

Smoothing property for the nonsymmetric case.


For the analysis of the smoothing property in the general (possibly nonsymmetric) case we can
not use lemma 9.4.6. Instead the analysis will be based on the following lemma (cf. [74, 75]):

212
Lemma 9.4.11 Let k k be any induced matrix norm and assume that for B Rmm the
inequality kBk 1 holds. The we have
r
+1 2
k(I B)(I + B) k 2 , for = 1, 2, . . .

Proof. Note that


     

X k +1
X  k
(I B)(I + B) = (I B) B =IB + B
k k k1
k=0 k=1

This yields
   
X

k(I B)(I + B) k 2 +
k k1
k=1
       
1
Using k 2 ( + 1) and we get (with [ ] the round
k k1 k k
down operator):
   
X

k k1
k=1
[ 21 (+1)]        
X  X 
= +
k k1 k1 k
1 [ 12 (+1)]+1

[ 21 ]     [ 21 ]    
X  X 
= +
k k1 m m1
1 m=1
[ 21 ]
X     

  

=2 =2 1
k k1 [ 2 ] 0
k=1

An elementary analysis yields (cf., for example, [75])


  r
2
1 2 for 1
[ 2 ]

Thus we have proved the bound.

Corollary 9.4.12 Let k k be any induced matrix norm. Assume that for a linear iterative
method with iteration matrix I M1
A we have

kI M1
A k 1 (9.43)

Then for S := I 12 M1
A the following smoothing property holds:
r
2
kA S k 2 kM k , = 1, 2, . . .

213
Proof. Define B = I M1
A and apply lemma 9.4.11:
r
1  2
kA S k kM k
k(I B)(I + B) k 2 kM k
2

Remark 9.4.13 Note that in the smoother in corollary 9.4.12 we use damping with a factor 21 .
Generalizations of the results in lemma 9.4.11 and corollary 9.4.12 are given in [66, 49, 32]. In
[66, 32] it is shown that the damping factor 21 can be replaced by an arbitrary damping factor
(0, 1). Also note that in the smoothing property in corollary 9.4.12 we have a -dependence
1
of the form 2 , whereas in the symmetric case this is of the form 1 . It [49] it is noted that
1
this loss of a factor 2 when going to the nonsymmetric case is due to the fact that complex
eigenvalues may occur. Assume that M1 A is a normal matrix. The assumption (9.43) implies
that (M1 A ) K := { z C | |1 z| 1 }. We have:

1 1 1 1
kM1 2
A (I 2 M A ) k2 max |z(1 z) |2 = max |z(1 z) |2
zK 2 zK 2
1 1
= max |1 + ei |2 | ei |2
[0,2] 2 2
1 1 1 1
= max 4( + cos )( cos )
[0,2] 2 2 2 2
4 
= max 4(1 ) =
[0,1] +1 +1

Note that the latter function of also occurs in the proof of lemma 9.4.6. We conclude that for
the class of normal matrices M1 A an estimate of the form

1 1 c
kM1
A (I 2 M A ) k2 , = 1, 2, . . .

is sharp with respect to the -dependence. 

To verify the condition in (9.43) we will use the following elementary result:

Lemma 9.4.14 If E Rmm is such that there exists a c > 0 with

kExk22 chEx, xi for all x Rm

then we have kI Ek2 1 for all [0, 2c ].

Proof. Follows from:

k(I E)xk22 = kxk22 2hEx, xi + 2 kExk22


2
kxk22 ( )kExk22
c
2
kxk22 if ( ) 0
c

We now use these results to derive a smoothing property for the Richardson method.

214
Theorem 9.4.15 Assume that the bilinear form satisfies the usual conditions (3.42). Let A
be the stiffness matrix in (9.22). There exist constants > 0 and c independent of such that
the following smoothing property holds:
c
kA (I h2n A ) k2 kA k2 , = 1, 2, . . .

Proof. Using lemma 9.4.3, the inverse inequality and the ellipticity of the bilinear form we
get, for arbitrary x Rn :

hA x, yi 1
n k(P x, v )
kA xk2 = max c h2 max
n
yR kyk2 v V kv kL2
1
n |P x|1 |v |1 1
n1
c h2 max c h2 |P x|1
v V kv kL2
1 1
n1 1 n1 1
c h2 k(P x, P x) 2 = c h2 hA x, xi 2

From this and lemma 9.4.14 it follows that there exists a constant > 0 such that

kI 2h2n
A k2 1 for all (9.44)
1 n2
Define M := 2 h I. From lemma 9.4.4 it follows that there exists a constant cM independent
of such that kM k2 cM kA k2 . Application of corollary 9.4.12 proves the result of the lemma.

We now consider the damped Jacobi method.

Theorem 9.4.16 Assume that the bilinear form satisfies the usual conditions (3.42). Let A be
the stiffness matrix in (9.22) and D = diag(A ). There exist constants > 0 and c independent
of such that the following smoothing property holds:
c
kA (I D1
A ) k2 kA k2 , = 1, 2, . . .

1
Proof. We use the matrix norm induced by the vector norm kykD := kD2 yk2 for y Rn .
1
1
Note that for B Rn n we have kBkD = kD2 BD 2 k2 . The inequalities

kD1 2n
k2 c1 h , (D ) c2 (9.45)

hold with constants c1 , c2 independent of . Using this in combination with lemma 9.4.3, the
inverse inequality and the ellipticity of the bilinear form we get, for arbitrary x Rn :

1 1 1 1
1 1 hA D 2 x, D 2 yi k(P D 2 x, P D 2 y)
kD 2 A D 2 xk2 = max = max
yRn kyk2 yRn kyk2
1 1
|P D 2 x|1 kP D 2 ykL2
c h1 max
yRn kyk2
1
n1 1 1 1
c h2 |P D 2 x|1 kD 2 k2 c |P D 2 x|1
1 1 1 1 1 1
c k(P D 2 x, P D 2 x) 2 = c hD 2 A D 2 x, xi 2

215
From this and lemma 9.4.14 it follows that there exists a constant > 0 such that
1 1
kI 2D1
A kD = kI 2D A D k2 1 for all
2 2

1
Define M := 2 D . Application of corollary 9.4.12 with k k = k kD in combination with (9.45)
yields
1 1 1
kA (I h D1
A )
k2 (D 2
) kA (I 2 M A ) kD
c c c
kM kD = kD k2 kA k2
2
and thus the result is proved.

9.4.4 Multigrid contraction number


In this section we prove a bound for the contraction number in the Euclidean norm of the multi-
grid algorithm (9.25) with 2. We follow the analysis in [44, 48].
Apart from the approximation and smoothing property that have been proved in the sec-
tions 9.4.2 and 9.4.3 we also need the following stability bound for the iteration matrix of
the smoother:
CS : kS k2 CS for all and (9.46)

Lemma 9.4.17 Consider the Richardson method as in theorem 9.4.7 or theorem 9.4.15. In
both cases (9.46) holds with CS = 1.

Proof. In the symmetric case (theorem 9.4.7) we have


c0
kS k2 = kI A k2 = max 1 c0 1
(A ) (A ) (A )

For the general case (theorem 9.4.15) we have, using (9.44):


1 1
kS k2 = kI h2n
A k2 = k I + (I 2h2n
A )k2
2 2
1 1
+ kI 2h2n A k2 1
2 2

Lemma 9.4.18 Consider the damped Jacobi method as in theorem 9.4.8 or theorem 9.4.16. In
both cases (9.46) holds.

Proof. Both in the symmetric and nonsymmetric case we have


1
1
kS kD = kD2 (I D1
A )D k2 1
2

and thus
1 1
1 1 1 1
kS k2 kD 2 (D2 S D 2 ) D2 k2 (D2 ) kS kD (D2 )
Now note that D is uniformly (w.r.t. ) well-conditioned.

Treatment of symmetric Gauss-Seidel method: in preparation.

216
Using lemma 9.4.3 it follows that for p = P1 P1 we have
Cp,1 kxk2 kp xk2 Cp,2 kxk2 for all x Rn1 (9.47)

with constants Cp,1 > 0 and Cp,2 independent of .


We now formulate a main convergence result for the multigrid method.

Theorem 9.4.19 Consider the multigrid method with iteration matrix given in (9.30) and pa-
rameter values 2 = 0, 1 = > 0, 2. Assume that there are constants CA , CS and a
monotonically decreasing function g() with g() 0 for such that for all :

kA1 1 1
p A1 r k2 CA kA k2 (9.48a)
kA S k2 g() kA k2 , 1 (9.48b)
kS k2 CS , 1 (9.48c)

For any (0, 1) there exists a such that for all

kCM G, k2 , = 0, 1, . . .

holds.

Proof. For the two-grid iteration matrix we have

kCT G, k2 kA1 1
p A1 r k2 kA S k2 CA g()

Define = kCM G. k2 . From (9.30) we obtain 0 = 0 and for 1:



CA g() + kp k2 1 kA1
1 r A S k2
1
CA g() + Cp,2 Cp,1 1 kp A1
1 r A S k2
1
1 k(I p A1

CA g() + Cp,2 Cp,1 1 r A )S k2 + kS k2
1
1 CA g() + CS CA g() + C 1

CA g() + Cp,2 Cp,1
1
with C := Cp,2 Cp,1 (CA g(1)+ CS ). Elementary analysis shows that for 2 and any (0, 1)
the sequence x0 = 0, xi = CA g()+C xi1 , i 1, is bounded by for g() sufficiently small.

Remark 9.4.20 Consider A , p , r as defined in (9.22), (9.23),(9.24). Assume that the vari-
ational problem (9.20) is such that the usual conditions (3.42) are satisfied. Moreover, the prob-
lem (9.20) and the corresponding dual problem are assumed to be H 2 -regular. In the multigrid
method we use the Richardson or the damped Jacobi method described in section 9.4.3. Then
the assumptions 9.48 are fulfilled and thus for 2 = 0 and 1 sufficiently large the multigrid
W-cylce has a contractrion number smaller than one indpendent of . 

Remark 9.4.21 Let CM G, (2 , 1 ) be the iteration matrix of the multigrid method with 1 pre-
and 2 postsmoothing iterations. With := 1 + 2 we have
 
CM G, (2 , 1 ) = CM G, (0, ) kCM G, (0, )k2
Using theorem 9.4.19 we thus get, for 2, a bound for the spectral radius of the iteration
matrix CM G, (2 , 1 ). 

217
Remark 9.4.22 Note on other convergence analyses. Xu, Yserentant (quasi-uniformity not
needed in BPX). Comment on regularity. Book Bramble.

9.4.5 Convergence analysis for symmetric positive definite problems


In this section we analyze the convergence of the multigrid method for the symmetric positive
definite case, i.e., the stiffness matrix A is assumed to be symmetric positive definite. This
property allows a refined analysis which proves that the contraction number of the multigrid
method with 1 (the V-cycle is included !) and 1 = 2 1 pre- and postsmoothing iterations
is bounded by a constant smaller than one independent of . The basic idea of this analysis is
due to [18] and is further simplified in [44, 48].

Throughout this section we make the following


Assumption 9.4.23 In the bilinear form k(, ) in (9.20) we have b = 0 and the conditions
(3.42) are satisfied.
Due to this the stiffness matrix A is symmetric positive definite and we can define the energy
scalar product and corresponding norm:
1
hx, yiA := hA x, yi , kxkA := hx, xiA2 x, y Rn

We only consider smoothers with an iteration matrix S = I M1


A in which M is symmetric
positive definite. Important examples are the smoothers analyzed in section 9.4.3:

Richardson method : M = c1
0 (A )I , c0 (0, 1] (9.49a)
1
Damped Jacobi : M = D , as in thm. 9.4.8 (9.49b)
Symm. Gauss-Seidel : M = (D L )D1
(D LT ) (9.49c)

For symmetric matrices B, C Rmm we use the notation B C iff hBx, xi hCx, xi for all
x Rm .
Lemma 9.4.24 For M as in (9.49) the following properties hold:

A M for all (9.50a)


CM : kM k2 CM kA k2 for all (9.50b)

Proof. For the Richardson method the result is trivial. For the damped Jacobi method we
1 1
have (0, (D1 1 1
A ) ] and thus (D A D ) 1. This yields A D = M .
2 2

The result in (9.50b) follows from kD k2 kA k2 . For the symmetric Gauss-Seidel method the
results (9.50a) follows from M = A + L D1 T
L and the result in (9.50b) is proved in (9.42).

We introduce the following modified approximation property:


1  21
CA : kM2 A1 p A1
1 r M k2 CA for = 1, 2, . . . (9.51)

We note that the standard approximation property (9.37) implies the result (9.51) if we consider
the smoothers in (9.49):

218
Lemma 9.4.25 Consider M as in (9.49) and assume that the approximation property (9.37)
holds. Then (9.51) holds with CA = CM CA .

Proof. Trivial.
One easily verifies that for the smoothers in (9.49) the modified approximation property (9.51)
implies the standard approximation property (9.37) if (M ) is uniformly (w.r.t. ) bounded.
The latter property holds for the Richardson and the damped Jacobi method.

We will analyze the convergence of the two-grid and multigrid method using the energy scalar
product. For matrices B, C Rn n that are symmetric w.r.t. h, iA we use the notation
B A C iff hBx, xiA hCx, xiA for all x Rn . Note that B Rn n is symmetric w.r.t.
h, iA iff (A B)T = A B holds. We also note the following elementary property for symmetric
matrices B, C Rn n :
B C BA A CA (9.52)

We now turn to the two-grid method. For the coarse grid correction we introduce the notation
Q := I p A1 1
1 r A . For symmetry reasons we only consider 1 = 2 = 2 with > 0 even.
The iteration matrix of the two-grid method is given by
1 1

CT G, = CT G, () = S2 Q S2

Due the symmetric positive definite setting we have the following fundamental property:

Theorem 9.4.26 The matrix Q is an orthogonal projection w.r.t. h, iA .

Proof. Follows from


Q2 = Q and (A Q )T = A Q

As an direct consequence we have


0 A Q A I (9.53)

The next lemma gives another characterization of the modified approximation property:

Lemma 9.4.27 The property (9.51) is equivalent to

0 A Q A CA M1
A for = 1, 2, . . . (9.54)

Proof. Using (9.52) we get


1  12
kM2 A1 1
p A1 r M k2 CA for all
1  21
CA I M2 A1
p A 1
1 r M CA I for all
CA M1 1 1 1
A p A1 r CA M for all
CA M1 1
A A Q A CA M A for all

In combination with (9.53) this proves the result.

We now present a convergence result for the two-grid method:

219
Theorem 9.4.28 Assume that (9.50a) and (9.51) hold. Then we have
1
kCT G, ()kA max y(1 CA y)
y[0,1]

1
(1 CA (9.55)
) if CA 1
=
CA  if CA 1
+1 +1

Proof. Define X := M1 A . This matrix is symmetric w.r.t. the energy scalar product and
from (9.50a) it follows that
0 A X A I (9.56)
holds. From lemma 9.4.27 we obtain 0 A Q A CA X . Note that due to this, (9.56) and the
fact that Q is an A-orthogonal projection which is not identically zero we get

CA 1 (9.57)

Using (9.53) we get

0 A Q A CA X + (1 )I for all [0, 1] (9.58)

Hence, using S = I X we have


1 1
0 A CT G, () A (I X ) 2 CA X + (1 )I (I X ) 2


for all [0, 1] , and thus

kCT G, ()kA min max CA x + (1 ) (1 x)



[0,1] x[0,1]

A minimax result (cf., for example, [83]) shows that in the previous expression the min and max
operations can be interchanged. A simple computation yields

max min CA x + (1 ) (1 x)

x[0,1] [0,1]

max CA x(1 x) , max (1 x)



= max
1 1
x[0,CA ] x[CA ,1]
1
= max CA x(1 x) = max y(1 CA y)
1 y[0,1]
x[0,CA ]

This proves the inequality in (9.55). An elementary computation shows that the equality in
(9.55) holds.

We now show that the approach used in the convergence analysis of the two-grid method in
theorem 9.4.28 can also be used for the multigrid method.
We start with an elementary result concerning a fixed point iteration that will be used in theo-
rem 9.4.30.
Lemma 9.4.29 For given constants c > 1, 1 define g : [0, 1) R by
(
(1 1c )
if 0 < 1 c1
g() = c
 1 +1

(9.59)
+1 +1 (1 ) 1 + c 1 if 1 c1 <1

220
For N, 1, define the sequence ,0 = 0, )
,i+1 = g(,i for i 1. The following
holds:
g() is continuous and increasing on [0, 1)
For c = CA , g(0) coincides with the upper bound in (9.55)
c
g() = iff =
c+
The sequence (,i )i0 is monotonically increasing, and := lim ,i < 1
i
1


( ) , is the first intersection point of the graphs of g() and
c
= 1 2 . . .

= g(0)
c+
Proof. Elementary calculus.

As an illustration for two pairs (c, ) we show the graph of the function g in figure 9.6.
1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Figure 9.6: Function g() for = 2, c = 4 (left) and = 4, c = 4 (right).

Theorem 9.4.30 We take 1 = 2 = and consider the multigrid algorithm with iteration
matrix CM G, = CM G, (, ) as in (9.30). Assume that (9.50a) and (9.51) hold. For c = CA ,
c
2 and as in (9.30) let c+ be the fixed point defined in lemma 9.4.29. Then

kCM G, kA

holds.

Proof. From (9.30) we have


1  21

CM G, = S2 I p (I CM G,1 )A1
1 r A S
1 1

= S2 (Q + R )S2 , R := p CM G,1 A1
1 r A

The matrices S and Q are symmetric w.r.t. h, iA . If CM G,1 is symmetric w.r.t. h, iA1
then from T
(A R )T = (A p A1 1

1 )(A1 CM G,1 )(A1 r A ) = A R

221
it follows that R is symmetric w.r.t. h, iA , too. By induction we conclude that for all the
matrices R and CM G, are symmetric w.r.t. h, iA . Note that

0 A CM G,1 0 CM G,1 A1 1
1 0 p CM G,1 A1 r 0 A R

holds. Thus, by induction and using 0 A Q we get

0 A Q + R , 0 A CM G, for all (9.60)

For 0 define := kCM G, kA . Hence, 0 A CM G, A I holds. For arbitrary x Rn we


have

hR x, xiA = hCM G,1 A1 1


1 r A x, A1 r A xiA1

1 hA1 1
1 r A x, A1 r A xiA1 = 1 hx, (I Q )xiA

and thus

R A 1 (I Q ) (9.61)
holds. Define X := M1
A . Using (9.58), (9.60) and (9.61) we get


0 A Q + R A (1 1 )Q + 1 I


A (1 1 ) CA X + (1 )I + 1 I for all [0, 1]

Hence, for all [0, 1] we have


1  1
0 A CM G, A (I X ) 2 (1 1

I (I X ) 2
 
) CA X + (1 )I + 1

This yields

(1 x)
  
min max (1 1 ) CA x + 1 + 1
[0,1] x[0,1]

As in the proof of theorem 9.4.28 we can interchange the min and max operations in the previous
expression. A simple computation shows that for [0, 1] we have

max min (1 ) CA x + 1 + (1 x)
  
x[0,1] [0,1]

max (1 )CA x + (1 x) , max (1 x) = g()


 
= max
1 1
x[0,CA ] x[CA ,1]

where g() is the function defined in lemma 9.4.29 with c = CA . Thus satisfies 0 = 0 and
) for 1. Application of the results in lemma 9.4.29 completes the proof.
g(1

The bound for the multigrid contraction number in theorem 9.4.30 decreases if increases.
Moreover, for the bound converges to the bound for the two-grid contraction number in
theorem 9.4.28.

Corollary 9.4.31 Consider A , p , r as defined in (9.22), (9.23),(9.24). Assume that the


variational problem (9.20) is such that b = 0 and that the usual conditions (3.42) are satisfied.
Moreover, the problem is assumed to be H 2 -regular. In the multigrid method we use one of the
smoothers (9.49). Then the assumptions (9.50a) and (9.51) are satisfied and thus for 1 = 2 1
the multigrid V-cycle has a contraction number (w.r.t. k kA ) smaller than one independent of
. 

222
9.5 Multigrid for convection-dominated problems

9.6 Nested Iteration


We consider a sequence of discretizations of a given boundary value problem, as for example in
(9.22):
A x = b , = 0, 1, 2, . . . .

We assume that for a certain = we want to compute the solution x of the problem A x = b
using an iterative method (not necessarily a multigrid method). In the nested iteration method
we use the systems on coarse grids to obtain a good starting vector x0 for this iterative method
with relatively low computational costs. The nested iteration method for the computation of
this starting vector x0 is as follows


compute the solution x0 of A0 x0 = b0


x01 := p1 x0 (prolongation of x0 )







xk1 := result of k iterations of an iterative method




applied to A1 x1 = b1 with starting vector x01







x02 := p2 xk1 ( prolongation of xk1 )



(9.62)


xk2 := result of k iterations of an iterative method
applied to A2 x2 = b2 with starting vector x02





..
.




etc.





..
.




x0 := p xk1 .

In this nested iteration method we use a prolongation p : Rn1 Rn . The nested iteration
principle is based on the idea that p x1 should be a reasonable approximation of x , because
A1 x1 = b1 and A x = b are discretizations of the same continuous problem. With
respect to the computational costs of this approach we note the following (cf. Hackbusch [44],
section 5.3). For the nested iteration to be a feasible approach, the number of iterations applied
on the coarse grids (i.e. k in (9.62)) should not be too large and the number of grid points
in the union of all coarse grids (i.e. level 0, 1, 2, ..., 1) should be at most of the same order
of magnitude as the number of grid points in the level grid. Often, if one uses a multigrid
solver these two conditions are satisfied. Usually in multigrid we use coarse grids such that the
number of grid points decreases in a geometric fashion, and for k in (9.62) we can often take
k = 1 or k = 2 due to the fact that on the coarse grids we use the multigrid method, which has
a high rate of convergence.
Note that if one uses the algorithm MGM from (9.25) as the solver on level then the
implementation of the nested iteration method can be done with only little additional effort
because the coarse grid data structure and coarse grid operators (e.g. A , < ) needed in the
nested iteration method are already available.

If in the nested iteration method we use a multigrid iterative solver on all levels we obtain

223
x03 x1
3 - -3
p3
x02 -
2 -
p2 x12 MGM3 (x03 , b3 )
x-01 MGM2 (x02 , b2 )
1 -
p1 MGM1 (x01 , b1 ) x11
0
x0

Figure 9.7: Multigrid and nested iteration.

the following algorithmic structure:

x0 := A1
k

0 b0 ; x0 := x0



for = 1 to do
begin

(9.63)

x0 := p xk1
for i = 1 to k do xi := MGM (xi1
, b )




end;

For the case = 3 and k = 1 this method is illustrated in Figure 9.7.

Remark 9.6.1 The prolongation p used in the nested iteration may be the same as the pro-
longation p used in the multigrid method. However, from the point of view of efficiency it
is sometimes better to use in the nested iteration a prolongation p that has higher order of
accuracy than the prolongation used in the multigrid method. 

9.7 Numerical experiments


We consider the Poisson model problem described in section 6.6 and apply a multigrid method
to this problem. In this section we present some results of numerical experiments and discuss
the complexity of the multigrid method for this model problem.

Example 9.7.1 (Poisson model problem) We apply a multigrid algorithm as in (9.25) to


the discrete Poisson equation described in section 6.6. For the smoother we use a Gauss-
Seidel method. The starting vector is x0 = 0. For the parameters in the algorithm we choose
1 = 2, 2 = 0, = 2. We solve the discrete problem on the triangulation Th with mesh
size h 21 . In table 9.1 we show the error reduction kxk+1
x k2 /kxk x k2 for several
values of and of k. For a better comparison with the basic iterative methods and with the
CG method, we also computed the number of iterations (#) needed to reduce the norm of the
starting error with a factor 103 . The results are shown in table 9.2. From these results it is
clear that the contraction number is not close to one, even if the mesh size is small. In other
words, the rate of convergence does not deteriorate if the mesh size h becomes smaller. This is
a crucial difference compared with basic iterative methods and CG. 

224
h 1 3 5 7 9
3 1/16 0.080 0.055 0.061 0.067 0.070
4 1/32 0.056 0.053 0.058 0.062 0.066
5 1/64 0.044 0.055 0.059 0.062 0.065
6 1/128 0.043 0.054 0.058 0.061 0.063

Table 9.1: Multigrid error reduction, k = 1, 3, . . . , 9.

h 1/16 1/32 1/64 1/128


# 3 3 3 3

Table 9.2: # iterations for multigrid method.

Complexity. Consider the situation described in example 9.7.1. Then the arithmetic costs per
multigrid iteration are cn flops and the error reduction per iteration is bounded by < 1 with
independent of (as proved in section 9.4). To obtain a reduction of a starting error by a fixed
factor R we then need at most ln R/| ln | iterations, i.e. the arithmetic costs are approximately
ln R/| ln | cn flops. We conclude that the multigrid method has complexity cn . Note that this
is optimal in the sense that for one matrix-vector multiplication A x we already need O(n )
flops. A nice feature of multigrid methods is that such an optimal complexity property holds
for a large class of interesting problems.

With respect to the efficiency of multigrid methods we note the following. The rate of con-
vergence will increase if 1 + 2 or is increased. However, in that case also the arithmetic
costs per iteration will grow. Analysis of the dependence of the multigrid contraction number
on 1 , 2 , and numerical experiments have shown that for many problems we obtain an efficient
method if we take 1 + 2 {1, 2, 3, 4} and {1, 2} . In other words, in general many (> 4)
smoothing iterations or more than two recursive calls in (9.25) will make a multigrid method
less efficient.

Stopping criterion. In general, for the discrete solution x (with corresponding finite element
function u = P x ) we have a discretization error , so it does not make sense to solve the
discrete problem to machine accuracy. For a large class of elliptic boundary value the fol-
lowing estimate for the discretization error holds: ku u k ch2 . If in the multigrid itera-
tion one has an arbitrary starting vector (e.g., 0) then the error reduction factor R should be
taken proportional to h2 . Using the multigrid iteration MGM one then needs approximately
ln R/| ln | ln ch2
/| ln | c ln n /| ln | iterations to obtain an approximation with the de-
sired accuracy. Per iteration we need O(n ) flops. Hence we conclude: When we use a multigrid
method for computing an approximation u of u with accuracy comparable to the discretization
error in u , the arithmetic costs are of the order

c n ln n flops . (9.64)

Multigrid and nested iteration. For an analysis of the multigrid method used in a nested
iteration we refer to Hackbusch [44]. From this analysis it follows that a small fixed number of
MGM iterations (i.e. k in (9.62)) on each level in the nested iteration method is sufficient
to obtain an approximation x of x with accuracy comparable to the discretization error in x .

225
The arithmetic costs of this combined multigrid and nested iteration method are of the order
c
n flops. (9.65)
| ln |

When we compare the costs in (9.64) with the costs in (9.65) we see that using the nested
iteration approach results in a more efficient algorithm. From the work estimate in (9.65) we
conclude: Using multigrid in combination with nested iteration we can compute an approximation
x of x with accuracy comparable to the discretization error in x and with arithmetic costs
Cn flops (C independent of ).

Example 9.7.2 To illustrate the behaviour of multigrid in combination with nested iteration
we show numerical results for an example from Hackbusch [44]. In the Poisson problem as in
example 9.7.1 we take boundary conditions and a righthand side such that the solution is given
by u(x, y) = 12 y 3 /(x + 1), so we consider:

u = (3y/(x + 1) + y 3 /(x + 1)3 ) in = [0, 1]2 ,


(

u(x, y) = 12 y 3 /(x + 1) on .

For the discretization we apply linear finite elements on a family of nested uniform triangulations
with mesh size h = 21 . The discrete solution on level is denoted by u . The discretization
error, measured in a weighted Euclidean norm is given in table 9.3. From these results one can

h 1/8 1/16 1/32 1/64


ku uk2 2.64 105 6.89 106 1.74 106 4.36 107

Table 9.3: Discretization errors.

observe a ch2 behaviour of the discretization error. We apply the nested iteration approach of
section 9.6 in combination with the multigrid method. We start with a coarsest triangulation
Th0 with mesh size h0 = 12 (this contains only one interior grid point). For the prolongation
p used in the nested iteration we take the prolongation p as in the multigrid method (linear
interpolation). When we apply only one multigrid iteration on each level (i.e k = 1 in
(9.63)) we obtain approximations x0 (= p x11 ) and x1 of x (= P1 u ) (cf. figure 9.7). The
errors in these approximations are given in table 9.4. In that table we also give the errors for the
case with two multigrid iterations on each level (i.e., k = 2 in (9.63)). Comparing the results in
table 9.4 with the discretization errors given in table 9.3 we see that we only need two multigrid
iterations on each grid to compute an approximation of x (0 ) with accuracy comparable
to the discretization error in x .


9.8 Algebraic multigrid methods

9.9 Nonlinear multigrid

226
h kxi x k2 , k = 1 kxi x k2 , k = 2
1/8 x02 7.24 103 x02 6.47 103
x21 5.98 104 x21 4.92 104
x22 2.86 105
1/16 x03 2.09 103 x30 1.73; 103
x13 1.30 104 x31 9.91 105
x23 4.91 106
1/32 x04 5.17 104 x40 4.43 104
x14 2.54 105 x14 1.82 105
x42 8.52 107
1/64 x05 1.23 104 x05 1.12 104
x15 4.76 106 x51 3.25 106
x52 1.47 107

Table 9.4: Errors in nested iteration.

227
228
Chapter 10

Iterative methods for saddle-point


problems

In this chapter we discuss a class of iterative methods for solving a linear system with a matrix
of the form
A BT
 
K=
B 0
(10.1)
A Rmm symmetric positive definite
B Rnm rank(B) = n < m
The so-called Schur complement matrix is given by S := BA1 BT . Note that S is symmetric
positive definite. The symmetic matrix K is (strongly) indefinite:
Lemma 10.0.1 The matrix K has m strictly positive and n strictly negative eigenvalues.
Proof. From the factorization
A1 0 A BT
   
A 0
K=
B I 0 S 0 I
it follows that K is congruent to the matrix blockdiag(A1 , S) which has m strictly positive
and n strictly negative eigenvalues. Now apply Sylvesters inertia theorem.
Remark 10.0.2 Consider a linear system of the form
   
v f
K = 1 (10.2)
w f2
with K as in (10.1). Define the functional L : Rm Rn R by L(v, w) = 12 hAv, vi + hBv, wi
hf1 , vi hf2 , wi. Using the same arguments as in the proof of theorem 2.4.2 one can easily show
that (v , w ) is a solution of the problem (10.2) iff
L(v , w) L(v , w ) L(v, w ) for all v Rm , w Rn
Due to this property the linear system (10.2) is called a saddle-point problem. 
In section 8.3 we discussed the preconditioned MINRES method for solving a linear system with
a symmetric indefinite matrix. This method can be applied to the system in (10.2). Recall that
the preconditioner must be symmetric positive definite. In section 10.1 we analyze a particular
preconditioning technique for the matrix K. In section 10.2 we apply these methods to the
discrete Stokes problem.

229
10.1 Block diagonal preconditioning
In this section we analyze the effect of symmetric preconditioning of the matrix K in (10.1) with
a block diagonal matrix
 
MA 0
M :=
0 MS
MA Rmm , MA = MTA > 0, MS Rnn , MS = MTS > 0
The preconditioned matrix is given by
A BT
 
21 12
K = M KM =
B 0
1 1 1 1
A := MA 2 AMA 2 , B := MS 2 BMA 2
We first consider a very special preconditoner, which in a certain sense is optimal:
Lemma 10.1.1 For MA = A and MS = S we have
1 1
(K) = { (1 5) , 1 , (1 + 5) }
2 2
Proof. Note that
I BT
 
1 1
K = , B = S 2 BA 2
B 0
   
v v
The matrix B has a nontrivial kernel. For v ker(B), v 6= 0, we have K = and thus
0 0
1 (K). For (K), 6= 1, we get

I BT
    
v v
= , w 6= 0
B 0 w w

This holds iff ( 1) (BBT ) = (I) = {1} and thus = 21 (1 5).

Note that from the result in (8.41) it follows that the preconditioned MINRES method with
the preconditioner as in lemma 10.1.1 yields (in exact arithmetic) the exact solution in at most
three iterations. In most applications (e.g., the Stokes problem) it is very costly to solve linear
systems with the matrices A and S. Hence this preconditioner is not feasible. Instead we will
use approximations MA of A and MS of S. The quality of these approximations is measured
by the following spectral inequalities, with A , S > 0:
A MA A A MA
(10.3)
S MS S S MS
Using an analysis as in [77, 82] we obtain a result for the eigenvalues of the preconditioned
matrix:
Theorem 10.1.2 For the matrix K with preconditioners that satisfy (10.3) we have:
1 q
2 + 4 , 1 (
q
2 + 4

(K) (A A S A A A S A
2 2
1
q
A , (A + 2A + 4S A
 
2

230
Proof. We use the following inequalities

A I A A I (10.4a)
A A 1
M1 A A A
1
(10.4b)
1 1
S I MS 2 SMS 2 S I (10.4c)

1 1
Note that BBT = MS 2 BM1 T
A B MS . Using (10.4b) and (10.4c) we get
2

A S I BBT A S I (10.5)

Take (K). Then 6= 0 and there exists (v, w) 6= (0, 0) such that

Av + BT w = v
(10.6)
Bv = w

From v = 0 it follows that w = 0, hence, v 6= 0 must hold. From (10.6) we obtain (A +


1 T 1 T T T
B B)v = v and thus (A + B B). Note that (B B) = (BB ) {0}. We first
consider the case > 0. Using (10.5) and (10.4a) we get
1 T 1
A I A + B B (A + S A )I

and thus A A + 1 S A holds. This yields

1
q
A , (A + 2A + 4S A )
 
2
We now consider the case < 0. From (10.5) and (10.4a) it follows that
1 T 1
A + B B (A + S A )I

q
and thus A + 1 S A . This yields 21 (A A 2 + 4 ). Finally we derive an upper
S A
bound for < 0. We introduce := > 0. From (10.6) it follows that for < 0, w 6= 0 must
hold. Furthermore, we have
B(A + I)1 BT w = w

and thus (B(A + I)1 BT ). From I + A1 (1 + A )I and (10.4c) we obtain
1 1 1
B(A + I)1 BT = BA 2 (I + A1 )1 A 2 BT (1 + ) BA1 BT
A
1 12 1 1
= (1 + ) MS SMS 2 (1 + ) S I
A A
q
We conclude that (1+ A )1 S holds. Hence, for = we get 21 (A 2 + 4 ).
A S A

Remark 10.1.3 Note that if A = A = S = S = 1, i.e., MA = A and MS = S, we obtain


(K) = { 12 (1 5)} [ 1 , 21 (1 + 5) ], which is sharp (cf. lemma 10.1.1). 

231
10.2 Application to the Stokes problem
In this section the results of the previous sections are applied to the discretized Stokes problem
that is treated in section 5.2. We consider the Galerkin discretization of the Stokes problem
with Hood-Taylor finite element spaces

(Vh , Mh ) = (Xkh,0 )d , Xhk1 L20 () ,



k2

Here we use the notation d for the dimension of the velocity vector ( Rd ). For the bases
in these spaces we use standard nodal basis functions. In the velocity space Vh = (Xkh,0 )d the
set of basis functions is denoted by ( i )1im . Each i is a vector function in Rd with d 1
components identically zero. The basis in the pressure space Mh = Xhk1 L20 () is denoted by
(i )1in . The corresponding isomorphisms are given by
m
X
m
Ph,1 : R Vh , Ph,1 v = vi i
i=1
X n
Ph,2 : Rn Mh , Ph,2 w = wi i
i=1

The stiffness matrix for the Stokes problem is given by

A BT
 
K= R(m+n)(m+n) , with
B 0
Z
hAv, vi = a(Ph,1 v, Ph,1 v) = (Ph,1 v) (Ph,1 v) dx v, v Rm
Z

hBv, wi = b(Ph,1 v, Ph,2 w) = Ph,2 w div Ph,1 v dx v Rm , w Rn


The matrix A = blockdiag(A1 , . . . , Ad ) is symmetric positive definite and A1 = . . . = Ad =: A


is the stiffness matrix corresponding to the Galerkin discretization of the Poisson equation in
the space Xkh,0 of simplicial finite elements.
We now discuss preconditioners for the matrix A and the Schur complement S = BA1 BT .
The preconditioner for the matrix A is based on a symmetric multigrid method applied to the
diagonal block A. Let CM G be the iteration matrix of a symmetric multigrid method applied to
the matrix A, as defined in section 9.4.5. The matrix MM G is defined by CM G =: I M1 M G A.
This matrix, although not explicitly available, can be used as a preconditioner for A. For given
y the vector M1M G y is the result of one multigrid iteration with starting vector equal to zero
applied to the system Av = y.
From the analysis in section 9.4.5 it follows that MM G is symmetric and under certain reason-
able assumptions we have (I M1 M G A) [0, M G ] with the contraction number M G < 1
independent of the mesh size parameter h. For the preconditioner MA of A we take

MA := blockdiag(MM G ) (d blocks)

For this preconditioner we then have the following spectral inequalities

(1 M G )MA A MA , with M G < 1 independent of h (10.7)

232
For the preconditioner MS of the Schur complement S we use the mass matrix in the pressure
space, which is defined by
hMS w, zi = hPh,2 w, Ph,2 ziL2 for all w, z Rn (10.8)
This mass matrix is symmetric positive definite and (after diagonal scaling, cf. section 3.5.1)
in general well-conditioned. In practice the linear systems of the form MS w = q are solved
approximately by applying a few iterations of an iterative solver (for example, CG). We recall
the stability property of the Hood-Taylor finite element pair (Vh , Mh ) (cf. section 5.2.1):
b(uh , qh )
> 0 : sup kqh kL2 for all qh Mh (10.9)
uh Vh kuh k1

with independent of h. Using this stability property we get the following spectral inequalities
for the preconditioner Ms :
Theorem 10.2.1 Let MS be the pressure mass matrix defined in (10.8). Assume that the
stability property (10.9) holds. Then
2 MS S d MS (10.10)
holds.
Proof. For w Rn we have:
1
hBv, wi hBA 2 v, wi
max 1 = maxm
vRm Av, vi 2 vR kvk
1
hv, A 2 BT wi
= maxm
vR kvk
1 1
= kA 2 BT wk = hSw, wi 2
Hence, for arbritrary w Rn :
1 b(uh , Ph,2 w)
hSw, wi 2 = max (10.11)
uh Vh |uh |1
Using this and the stability bound (10.9) we get
1 1
hSw, wi 2 kPh,2 wkL2 = hMS w, wi 2
and thus the first inequality in (10.10) holds. Note that
|b(uh , Ph,2 w)| kdiv uh kL2 kPh,2 wkL2
1
d |uh |1 kPh,2 wkL2 = d |uh |1 hMS w, wi 2
holds. Combining this with (10.11) proves the second inequality in (10.10).
Corollary 10.2.2 Suppose that for solving a discrete Stokes problem with stiffness matrix
K we use a preconditioned MINRES method with preconditioners MA (for A) and MS (for
S) as defined above. Then the inequalities (10.3) hold with constants A , A , S , S that are
independent of h. From theorem 10.1.2 it follows that the spectrum of the preconditioned matrix
K is contained in a set [a, b] [c, d] with a < b < 0 < c < d, all independent of h, and with
b a = d c. From theorem 8.42 we then conclude that the residual reduction factor can be
bounded by a constant smaller than one independent of h. 

233
234
Appendix A

Functional Analysis

A.1 Different types of spaces


Below we give some definitions of elementary notions from functional analysis (cf. for example
Kreyszig [55]). We restrict ourselves to real spaces, i.e. for the scalar field we take R.

Real vector space. A real vector space is a set X of elements, called vectors, together with
the algebraic operations vector addition and multiplication of vectors by real scalars. Vector
addition should be commutative and associative. Multiplication by scalars should be associative
and distributive.
Example A.1.1 Examples of real vector spaces are Rn and C([a, b])
Normed space. A normed space is a vector space X with a norm defined on it. Here a norm
on a vector space X is a real-valued function on X whose value at x X is denoted by kxk and
which has the properties

kxk 0
kxk = 0 x = 0
(A.1)
kxk = || kxk
kx + yk kxk + kyk

for arbitrary x, y X, R.
Example A.1.2 . Examples of normed spaces are

(Rn , k k ) with kxk = max |xi | ,


1in

n
X
(Rn , k k2 ) with kxk22 = x2i ,
i=1

(C([a, b]), k k ) with kf k = max |f (t)| ,


t[a,b]
Z b
1
(C([a, b]), k kL2 ) with kf kL2 =( f (t)2 dt) 2 .
a

Banach space. A Banach space is a complete normed space. This means that in X every
Cauchy sequence, in the metric defined by the norm, has a limit which is an element of X.

235
Example A.1.3 Examples of Banach spaces are:

(Rn , k k2 ) ,
(Rn , k k ) with any norm k k on Rn ,
(C([a, b]), k k ).

The completeness of the space in the second example follows from the fact that on a finite
dimensional space all norms are equivalent. The completeness of the space in the third example
is a consequence of the following theorem: The limit of a uniformly convergent sequence of
continouos functions is a continuous function.

Remark A.1.4 The space (C([a, b]), kkL2 ) is not complete. Consider for example the sequence
fn C([0, 1]), n 1, defined by

if t 12

0
fn (t) = n(t 12 ) if 21 t 12 + n1
1 if 12 + n1 t 1 .

Then for m, n N we have


1
1 +1
1
Z Z
2 N
kfn fm k2L2 = 2
|fn (t) fm (t)| dt 1 dt = .
0 1 N
2

So (fn )n1 is a Cauchy sequence. For the limit function f we would have

if 0 t 12

0
f (t) =
1 if 12 + t 1,

for arbitrary > 0. So f cannot be continuous.

Inner product space. An inner product space is a (real) vector space X with an inner product
defined on X. For such an inner product we need a mapping of X X into R, i.e. with every
pair of vectors x and y from X there is associated a scalar denoted by hx, yi. This mapping is
called an inner product on X if for arbitrary x, y, z X and R the following holds:

hx, xi 0 (A.2)
hx, xi = 0 x = 0 (A.3)
hx, yi = hy, xi (A.4)
hx, yi = hx, yi (A.5)
hx + y, zi = hx, zi + hy, zi. (A.6)

An inner product defines a norm on X :


p
kxk = hx, xi.

An inner product and the corresponding norm satisfy the Cauchy-Schwarz inequality:

|hx, yi| kxk kyk for all x, y X. (A.7)

236
Example A.1.5 Examples of inner product spaces are:
n
X
Rn with hx, yi = xi yi ,
i=1

Z b
C([a, b]) with hf, gi = f (t)g(t) dt.
a

Hilbert space. An inner product space which is complete is called a Hilbert space.

Example A.1.6 Examples of Hilbert spaces are:


n
X
Rn with hx, yi = xi yi ,
i=1

Z b
2
L ([a, b]) with hf, gi = f (t)g(t) dt.
a

We note that the space C([a, b]) with the inner product (and corresponding norm) as in Ex-
ample A.1.5 results in the normed space (C([a, b]), k kL2 ). In Remark A.1.4 it is shown that
this space is not complete. Thus the inner product space C([a, b]) as in Example A.1.5 is not a
Hilbert space.

Completion. Let X be a Banach space and Y a subspace of X. The closure Y of Y in


X is defined as the set of accumulation points of Y in X, i.e. x Y if and only if there is a
sequence (xn )n1 in Y such that xn x. If Y = Y holds, then Y is called closed (in X). The
subspace Y is called dense in X if Y = X.
Let (Z, k k) be a given normed space. Then there exists a Banach space (X, k k ) (which is
unique, except for isometric isomorphisms) such that Z X, kxk = kxk for all x Z, and Z
is dense in X. The space X is called the completion of Z.

The space L2 (). Let be a domain in Rn . We denote by L2 () the space of all Lebesgue
measurable functions f : R for which
sZ
kf k0 := kf kL2 := |f (x)|2 dx <

In this space functions are identified that are equal almost evereywhere (a.e.) on . The elements
of L2 () are thus actually equivalence classes of functions. One writes f = 0 [f = g] if f (x) = 0
[f (x) = g(x)] a.e. in . The space L2 () with
Z
hf, gi = f (x)g(x) dx

is a Hilbert space. The space C0 () (all functies in C () which have a compact support in
) is dense in L2 ():
kk0
C0 () = L2 ()
In other words, the completion of the normed space (C0 (), k k0 ) results in the space L2 ().

237
Dual space. Let (X, k k) be a normed space. The set of all bounded linear functionals
f : X R forms a real vector space. On this space we can define the norm:

|f (x)|
kf k := sup{ |x X, x 6= 0 }.
kxk

This results in a normed space called the dual space of X and denoted by X .

Bounded linear operators. Let (X, k kX ) and (Y, k kY ) be normed spaces and T : X Y
be a linear operator. The (operator) norm of T is defined by

 kT xkY
kT kY X := sup | x X, x 6= 0 .
kxkX

The operator T is bounded iff kT kY X < . The operator T is called an isomorphism if T is


bijective (i.e., injective and surjective) and both T and T 1 are bounded.

Compact linear operators. Let X and Y be Banach spaces and T : X Y a linear


bounded operator. The operator T is compact if for every bounded set A in X the image set,
i.e., B := T (A), is precompact in Y (this means that every sequence in B must contain a con-
vergent subsequence).

Continuous embedding; compact embedding. Let X and Y be normed spaces.and A


linear operator I : X Y is called a continuous embedding if I is bounded (or, equivalently,
continuous, see theorem A.2.2) and injective. The embedding is called compact if I is continuous
and compact. An equivalent characterization of compact embedding is that for every bounded
sequence (xk )k1 in X the image sequence (Ixk )k1 has a subsequence that is a Cauchy sequence
in Y .

A.2 Theorems from functional analysis


Below we give a few classical results from functional analysis (cf. for example Kreyszig [55]).

Theorem A.2.1 (Arzela-Ascoli.) A subset K of C() is precompact (i.e., every sequence has
a convergent subsequence) if and only if the following two conditions holds:

(i) M : kf k < M for all f K (K is bounded)


(ii) > 0 > 0 : f K |f (x) f (y)| < x, y with kx yk <
(K is uniformly equicontinuous)

Theorem A.2.2 (Boundedness of linear operators.) Let X and Y be normed spaces an


T : X Y a linear mapping. Then T is bounded if and only if T is continuous.

238
Theorem A.2.3 (Extension of operators.) Let X be a normed space and Y a Banach space.
Suppose X0 is a dense subspace of X and T : X0 Y a bounded linear operator. Then there
exists a unique extension Te : X Y with the properties

(i) T x = Te x for all x X0


(ii) If (xk )k1 X0 , x X and lim xk = x then Te x = lim T xk
k k
(iii) kTe kY X = kT kY X0

Theorem A.2.4 (Banach fixed point theorem.) Let (X, kk) be a Banach space. Let F : X
X be a (possibly nonlinear) contraction, i.e. there is a constant < 1 such that for all x, y X
:
kF (x) F (y)k kx yk.
Then there exists a unique x X (called a fixed point) such that

F (x) = x

holds.

Theorem A.2.5 (Corollary of open mapping theorem.) Let X and Y be Banach spaces
and T : X Y a linear bounded operator which is bijective. Then T 1 is bounded, i.e.,
T : X Y is an isomorphism.

Corollary A.2.6 Let X and Y be Banach spaces and T : X Y a linear bounded operator
which is injective. Let R(T ) = { T x | x X } be the range of T . Then T 1 : R(T ) X is
bounded if and only if R(T ) is closed (in Y ).

Theorem A.2.7 (Orthogonal decomposition.) Let U H be a closed subspace of a Hilbert


space H. Let U = { x H | (x, y) = 0 for all y U } be the orthogonal complement of
U . Then H can be decomposed as H = U U , i.e., every x H has a unique representation
x = u + v, u U , v U . Moreover, the identity kxk2 = kuk2 + kvk2 holds.

Theorem A.2.8 (Riesz representation theorem.) Let H be a Hilbert space with inner prod-
uct denoted by h, i and corresponding norm k kH . Let f be an element of the dual space H ,
with norm kf kH . Then there exists a unique w H such that

f (x) = hw, xi for all x H.

Furthermore, kwkH = kf kH holds. The linear operator JH : f w is called the Riesz


isomorphism.

Corollary A.2.9 Let H be a Hilbert space with inner product denoted by h, i. The bilinear
form
hf, giH := hJH f, JH gi , f, g H
defines a scalar product on H . The space H with this scalar product is a Hilbert space.

239
240
Appendix B

Linear Algebra

B.1 Notions from linear algebra


Below we give some definitions and elementary notions from linear algebra. The collection of
all real n n matrices is denoted by Rnn .

For A = (aij )1i,jn Rnn the transpose AT is defined by AT = (aji )1i,jn .

Spectrum, spectral radius. By (A) we denote the spectrum of A, i.e. the collection of
all eigenvalues of the matrix A. Note that in general (A) contains complex numbers (even for
A real). We use the notation (A) for the spectral radius of A Rnn :

(A) := max{ || | (A) }.

Vector norms. On Rn we can define a norm k k, i.e. a real-valued function on Rn with


properties as in (A.1). Important examples of such norms are:
n
X
kxk1 := |xi | (1-norm), (B.1)
i=1
n
X 1
kxk2 := ( x2i ) 2 (2-norm or Euclidean norm), (B.2)
i=1
kxk := max |xi | (maximum norm). (B.3)
1in

Cauchy-Schwarz inequality. On Rn we can define an inner product by hx, yi := xT y. The


norm corresponding to this inner product is the Euclidean norm (B.2). The Cauchy-Schwarz
inequality (A.7) takes the form:

|xT y| kxk2 kyk2 for all x, y Rn . (B.4)

Matrix norms. A matrix norm on Rnn is a real valued function whose value at A Rnn is
denoted by kAk and which has the properties

kAk 0 and kAk = 0 iff A = 0


kAk = ||kAk
(B.5)
kA + Bk kAk + kBk
kABk kAkkBk

241
for all A, B Rnn and all R. A special class of matrix norms are those induced by a
vector norm. For a given vector norm k k on Rn we define an induced matrix norm by:

kAxk
kAk := sup{ | x Rn , x 6= 0 } for A Rnn . (B.6)
kxk

Induced by the vector norms in (B.1), (B.2), (B.3) we obtain the matrix norms kAk1 , kAk2 ,
kAk . From the definition of the induced matrix norm it follows that

kAxk kAkkxk for all A Rnn , x Rn (B.7)

and that the properties (B.5) hold.


In the same way one can define a matrix norm on Cnn . In this book we will always use real
induced matrix norms as defined in (B.6).

Condition number. For a nonsingular matrix A the spectral condition number is defined
by
(A) := kAk2 kA1 k2 .
We note that condition numbers can be defined with respect to other matrix norms, too.

Below we introduce some notions related to special properties which matrices A Rnn may
have.
The matrix A is symmetric if A = AT holds. The matrix A is normal if the equality
AT A = AAT holds. Note that every symmetric matrix is normal.

A symmetric matrix A is positive definite if xT Ax > 0 holds for all x 6= 0. In that case,
A is said to be symmetric positive definite.

A matrix A is weakly diagonally dominant if the following condition is fulfilled:


X
|aij | |aii | for all i, with strict inequality for at least one i.
j6=i

The matrix A is called irreducible if there does not exist a permutation matrix such that
T A is a two by two block matrix in which the (2, 1) block is a zero block. The matrix A is
called irreducibly diagonally dominant if A is irreducible and weakly diagonally dominant.

A matrix A is an M-matrix if it has the following properties:

aij 0 for all i 6= j ,


A is nonsingular and all the entries in A1 are 0.

B.2 Theorems from linear algebra


In this section we give some basic results from linear algebra. For the proofs we refer to an
introductory linear algebra text, e.g. Strang [88] or Lancaster and Tismenetsky [58].

Theorem B.2.1 (Results on eigenvalues and eigenvectors) . For A, B Rnn the fol-
lowing results hold:

242
E1. (A) = (AT ).

E2. (AB) = (BA) , (AB) = (BA).

E3. A Rnn is normal if and only if A has an orthogonal basis of eigenvectors. In general
these eigenvectors and corresponding eigenvalues are complex, and orthogonality is meant
with respect to the complex Euclidean inner product.

E4. If A is symmetric then A has an orthogonal basis of real eigenvectors. Furthermore, all
eigenvalues of A are real.

E5. A is symmetric positive definite if and only if A is symmetric and (A) (0, ).

Theorem B.2.2 (Results on matrix norms) . For A Rnn the following results hold:

N1. kAk1 = max1jn ni=1 |aij |.


P

N2. kAk = max1in nj=1 |aij |.


P

p
N3. kAk2 = (AT A).

N4. If A is normal then kAk2 = (A).

N5. Let k k be an induced matrix norm. The following inequality holds:

(A) kAk.

Using (N4) and (E5) we obtain the following results for the spectral condition number:

(A) = (A)(A1 ) , if A is normal , (B.8)


(A) = max /min , if A is symmetric positive definite, (B.9)

with max and min the largest and smallest eigenvalue of A, respectively.

Theorem B.2.3 (Jordan normal form) . For every A Rnn exists a nonsingular matrix
T such that A = TT1 with a matrix of the form = blockdiag(i )1is ,

i 1
.. ..
. . Rki ki , 1 i s,

i := .

. . 1

and {1 , . . . , s } = (A).

243
244
Bibliography

[1] R. A. Adams. Sobolev Spaces. Academic Press, 1975.

[2] S. Agmon, A. Douglis, and L. Nirenberg. Estimates near the boundary of solutions of
elliptic partial differential equations satisfying general boundary conditions ii. Comm. on
Pure and Appl. Math., 17:3592, 1964.

[3] H. W. Alt. Linear Funktionalanalysis, 2.ed. Springer, Heidelberg, 1992.

[4] D. N. Arnold, F. Brezzi, and M. Fortin. A stable finite element for the stokes equation.
Calcalo, 21:337344, 1984.

[5] W. E. Arnoldi. The principle of minimized iterations in the solution of the matrix eigen-
value problem. Quart. Appl. Math., 9:1729, 1951.

[6] O. Axelsson. Iterative Solution Methods. Cambridge University Press, NY, 1994.

[7] O. Axelsson and V. A. Barker. Finite Element Solution of Boundary Value Problems.
Theory and Computation. Academic Press, Orlando, 1984.

[8] R. E. Bank and T. F. Chan. A composite step biconjugate gradient method. Numer.
Math., 66:295319, 1994.

[9] R. E. Bank and L. R. Scott. On the conditioning of finite element equations with highly
refined meshes. SIAM J. Numer. Anal., 26:13831394, 1989.

[10] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout,


R. Pozo, C. Romine, and H. van der Vorst. Templates for the Solution of Linear Systems:
Building Blocks for Iterative Methods. SIAM, Philadelphia, 1994.

[11] M. Bercovier and O. Pironneau. Error estimates for finite element solution of the stokes
problem in primitive variables. Numer. Math., 33:211224, 1979.

[12] A. Berman and R. J. Plemmons. Nonnegative Matrices in the Mathematical Sciences.


Academic Press, NY, 1979.

[13] C. Bernardi. Optimal finite element interpolation on curved domains. SIAM J. Numer.
Anal., 26:12121240, 1989.

[14] C. Bernardi and V. Girault. A local regularization operator for triangular and quadrilateral
finite elements. SIAM J. Numer. Anal., 35:18931915, 1998.

[15] D. Boffi. Stability of higher-order triangular hood-taylor methods for the stationary stokes
equations. Math. Models Methods Appl. Sci., 4:223235, 1994.

245
[16] D. Boffi. Three-dimensional finite element methods for the stokes problem. SIAM J.
Numer. Anal., 34:664670, 1997.

[17] D. Braess, M. Dryja, and W. Hackbusch. A multigrid method for nonconforming fe-
discretisations with application to non-matching grids. Computing, 63:125, 1999.

[18] D. Braess and W. Hackbusch. A new convergence proof for the multigrid method including
the V-cycle. SIAM J. Numer. Anal., 20:967975, 1983.

[19] J. H. Bramble. Multigrid Methods. Longman, Harlow, 1993.

[20] J. H. Bramble and S. R. Hilbert. Estimation of linear functionals on sobolev spaces with
applications to fourier transforms and spline interpolation. SIAM J. Numer. Anal., 7:113
124, 1970.

[21] S. C. Brenner and L. R. Scott. The Mathematical Theory of Finite Element Methods.
Springer, New York, 1994.

[22] F. Brezzi and R. S. Falk. Stability of higher-order hood-taylor methods. SIAM J. Numer.
Anal., 28:581590, 1991.

[23] W. L. Briggs, V. E. Henson, and S. F. McCormick. A Multigrid Tutorial (2nd ed.). SIAM,
Philadelphia, 2000.

[24] O. Broker, M. Grote, C. Mayer, and A. Reusken. Robust parallel smoothing for multigrid
via sparse approximate inverses. SIAM J. Sci. Comput., 32:13951416, 2001.

[25] A. M. Bruaset. A Survey of Preconditioned Iterative Methods. Longman, Harlow, 1995.

[26] L. Cattabriga. Su un problema al contorno relativo al sistema di equazioni di stokes. Rend.


Sem. Mat. Univ. Padov., 31:308340, 1961.

[27] P. G. Ciarlet. The Finite Element Method for Elliptic Problems. North Holland, 1978.

[28] P. G. Ciarlet. Basic error estimates for elliptic problems. In P. G. Ciarlet and J. L. Lions,
editors, Handbook of Numerical Analysis, Volume II: Finite Element Methods (Part 1).
North Holland, Amsterdam, 1991.

[29] P. Clement. Approximation by finite element functions using local regularization. RAIRO
Anal. Numer. (M2 AN), 9(R-2):7784, 1975.

[30] M. Dauge. Stationary stokes and navier-stokes systems on two- or three-dimensional


domains with corners. part i: linearized equations. SIAM J. Math. Anal., 20:7497, 1989.

[31] G. Duvaut and J. L. Lions. Les Inequations en Mecanique et en Physique. Dunod, Paris,
1972.

[32] E. Ecker and W. Zulehner. On the smoothing property of multi-grid methods in the
non-symmetric case. Numerical Linear Algebra with Applications, 3:161172, 1996.

[33] V. Faber and T. Manteuffel. Orthogonal error methods. SIAM J. Numer. Anal., 20:352
262, 1984.

[34] M. Fiedler. Special Matrices and their Applications in Numerical Mathematics. Nijhoff,
Dordrecht, 1986.

246
[35] R. Fletcher. Conjugate gradient methods for indefinite systems. In G. A. Watson, editor,
Numerical Analysis Dundee 1975, Lecture Notes in Mathemaics, Vol. 506, pages 7389,
Berlin, 1976. Springer.

[36] R. W. Freund, G. H. Golub, and N. M. Nachtigal. Iterative solution of linear systems.


Acta Numerica, pages 57100, 1992.

[37] G. Frobenius. Uber matrizen aus nicht negativen elementen. Preuss. Akad. Wiss., pages
456477, 1912.

[38] E. Gartland. Strong uniform stability and exact discretizations of a model singular pertur-
bation problem and its finite difference approximations. Appl. Math. Comput., 31:473485,
1989.

[39] D. Gilbarg and N. S. Trudinger. Elliptic Partial Differential Equations of Second Order.
Springer, Berlin, Heidelberg, 1977.

[40] V. Girault and P.-A. Raviart. Finite Element Methods for Navier-Stokes Equations, vol-
ume 5 of Springer Series in Computational Mathematics. Springer, Berlin, Heidelberg,
1986.

[41] G. H. Golub and C. F. V. Loan. Matrix Computations. John Hopkins University Press,
Baltimore, 2. edition, 1989.

[42] A. Greenbaum. Iterative Methods for Solving Linear Systems. SIAM, Philadelphia, 1997.

[43] M. E. Gurtin. An Introduction to Continuum Mechanics, volume 158 of Mathematics in


Science and Engineering. Academic Press, 1981.

[44] W. Hackbusch. Multigrid Methods and Applications, volume 4 of Springer Series in Com-
putational Mathematics. Springer, Berlin, Heidelberg, 1985.

[45] W. Hackbusch. Theorie und Numerik elliptischer Differentialgleichungen. Teubner,


Stuttgart, 1986.

[46] W. Hackbusch. Iterative Losung Groer Schwachbesetzter Gleichungssysteme. Teubner,


1991.

[47] W. Hackbusch. Elliptic Differential Equations: Theory and Numerical Treatment, vol-
ume 18 of Springer Series in Computational Mathematics. Springer, Berlin, 1992.

[48] W. Hackbusch. Iterative Solution of Large Sparse Systems of Equations, volume 95 of


Applied Mathematical Sciences. Springer, New York, 1994.

[49] W. Hackbusch. A note on reuskens lemma. Computing, 55:181189, 1995.

[50] L. A. Hageman and D. M. Young. Applied Iterative Methods. Academic Press, New York,
1981.

[51] M. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems.
J. Res. Nat. Bur. Stand., 49:409436, 1952.

[52] P. Hood and C. Taylor. A numerical solution of the navier-stokes equations using the finite
element technique. Comp. and Fluids, 1:73100, 1973.

247
[53] J. Kadlec. On the regularity of the solution of the Poisson problem on a domain with
boundary locally similar to the boundary of a convex open set. Czechoslovak Math. J.,
14(89):386393, 1964. (russ.).

[54] R. B. Kellog and J. E. Osborn. A regularity result for the stokes problem in a convex
polygon. J. Funct. Anal., 21:397431, 1976.

[55] E. Kreyszig. Introductory Functional Analysis with Applications. Wiley, New York, 1978.

[56] O. A. Ladyzhenskaya. Funktionalanalytische Untersuchungen der Navier-Stokesschen Gle-


ichungen. Akademie-Verlag, Berlin, 1965.

[57] O. A. Ladyzhenskaya and N. A. Uraltseva. Linear and Quasilinear Elliptic Equations, vol-
ume 46 of Mathematics in Science and Engineering. Academic Press, New York, London,
1968.

[58] P. Lancaster and M. Tismenetsky. The Theory of Matrices. Academic Press, Orlando, 2.
edition, 1985.

[59] C. Lanczos. Solution of systems of linear equations by minimized iterations. J. Res. Natl.
Bur. Stand. 49, pages 3353, 1952.

[60] J. L. Lions and E. Magenes. Non-homogeneous Boundary Value Problems and Applications,
Vol. I. Springer, Berlin, 1972.

[61] J. T. Marti. Introduction to Sobolev Spaces and Finite Element Solution of Elliptic Bound-
ary Value Problems. Academic Press, London, 1986.

[62] J. A. Meijerink and H. A. van der Vorst. An iterative solution method for linear systems
of which the coefficient matrix is a symmetric m-matrix. Math. Comp., 31:148162, 1977.

[63] N. Meyers and J. Serrin. H=W. Proc. Nat. Acad. Sci. USA, 51:10551056, 1964.

[64] C. Miranda. Partial Differential Equations of Elliptic Type. Springer, Berlin, 1970.

[65] J. Necas. Les Methodes Directes en Theorie des Equations Elliptiques. Masson, Paris,
1967.

[66] O. Nevanlinna. Convergence of Iterations for Linear Equations. Birkhauser, Basel, 1993.

[67] R. A. Nicolaides. On a class of finite elements generated by lagrange interpolation. SIAM


J. Numer. Anal., 9:435445, 1972.

[68] W. Niethammer. The sor method on parallel computers. Numer. Math., 56:247254, 1989.

[69] U. T. C. W. Oosterlee and A. Schuler, editors. Multigrid. Academic Press, London, 2001.

[70] C. C. Paige and M. Saunders. Solution of sparse indefinite systems of linear equations.
SIAM J. Numer. Anal., 12:617629, 1975.

[71] O. Perron. Zur theorie der matrizen. Math. Ann., 64:248263, 1907.

[72] S. D. Poisson. Remarques sur une equation qui se presente dans la theorie des attractions
des spheroides. Nouveau Bull. Soc. Philomathique de Paris, 3:388392, 1813.

248
[73] A. Quarteroni and A. Valli. Numerical Approximation of Partial Differential Equations,
volume 23 of Springer Series in Computational Mathematics. Springer, Berlin, Heidelberg,
1994.

[74] A. Reusken. On maximum norm convergence of multigrid methods for two-point boundary
value problems. SIAM J. Numer. Anal., 29:15691578, 1992.

[75] A. Reusken. The smoothing property for regular splittings. In W. Hackbusch and G. Wit-
tum, editors, Incomplete Decompositions : (ILU)-Algorithms, Theory and Applications,
volume 41 of Notes on Numerical Fluid Mechanics, pages 130138, Braunschweig, 1993.
Vieweg.

[76] H.-G. Roos, M. Stynes, and L. Tobiska. Numerical Methods for Singularly Perturbed Dif-
ferential Equations, volume 24 of Springer Series in Computational Mathematics. Springer,
Berlin, Heidelberg, 1996.

[77] T. Rusten and R. Winther. A preconditioned iterative method for saddlepoint problems.
SIAM J. Matrix Anal. Appl., 13:887904, 1992.

[78] Y. Saad. Iterative Methods for Sparse Linear Systems. PWS Publishing Company, London,
1996.

[79] Y. Saad and M. H. Schultz. Conjugate gradientlike algorithms for solving nonsymmetric
linear systems. Math. Comp., 44:417424, 1985.

[80] Y. Saad and M. H. Schultz. Gmres: a generalized minimal residual algorithm for solving
nonsymmetric linear systems. SIAM J. Sci. Statist. Comput., 7:856869, 1986.

[81] L. R. Scott and S. Zhang. Finite element interpolation of nonsmooth functions satisfying
boundary conditions. Math. Comp., 54:483493, 1990.

[82] D. Silvester and A. Wathen. Fast iterative solution of stabilised stokes systems. part ii:
using general block preconditioners. SIAM J. Numer. Anal., 31:13521367, 1994.

[83] M. Sion. On general minimax theorems. Pacific J. of Math., 8:171176, 1958.

[84] G. L. G. Sleijpen and D. R. Fokkema. Bicgstab(l) for linear equations involving matrices
with complex spectrum. ETNA, 1:1132, 1993.

[85] G. L. G. Sleijpen and H. van der Vorst. Optimal iteration methods for large linear sys-
tems of equations. In C. B. Vreugdenhil and B. Koren, editors, Numerical Methods for
Advection-Diffusion Problems, volume 45 of Notes on Numerical Fluid Mechanics, pages
291320. Vieweg, Braunschweig, 1993.

[86] P. Sonneveld. Cgs: a fast lanczostype solver for nonsymmetric linear systems. SIAM J.
Sci. Statist. Comput., 10:3652, 1989.

[87] R. Sternberg. Error analysis of some finite element methods for the stokes problem. Math.
Comp., 54:495508, 1990.

[88] G. Strang. Linear Algebra and its Applications. Harcourt Brace Jovanovich, San Diego,
3. edition, 1988.

249
[89] A. van der Sluis. Condition numbers and equilibration of matrices. Numer. Math., 14:14
23, 1969.

[90] A. van der Sluis and H. van der Vorst. The rate of convergence of conjugate gradients.
Numer. Math., 48:543560, 1986.

[91] H. A. van der Vorst. Bi-cgstab: A fast and smoothly converging variant of bi-cg for the
solution of nonsymmetric linear systems. SIAM J. Sci. Statist. Comput., 13:631644,
1992.

[92] R. S. Varga. Matrix Iterative Analysis. Prentice Hall, Englewood Cliffs, New Jersey, 1962.

[93] R. Verfurth. Error estimates for a mixed finite element approximation of stokes problem.
RAIRO Anal. Numer., 18:175182, 1984.

[94] R. Verfurth. Robust a posteriori error estimates for stationary convection-diffusion equa-
tions. SIAM J. Numer. Anal., 43:17661782, 2005.

[95] W. Walter. Gewohnliche Differentialgleichungen. Heidelberger Taschenbucher. Springer,


Berlin, 1972.

[96] P. Wesseling. An introduction to Multigrid Methods. Wiley, Chichester, 1992.

[97] G. Wittum. Linear iterations as smoothers in multigrid methods : Theory with applica-
tions to incomplete decompositions. Impact Comput. Sci. Eng., 1:180215, 1989.

[98] G. Wittum. On the robustness of ILU-smoothing. SIAM J. Sci. Stat. Comp., 10:699717,
1989.

[99] J. Wloka. Partial Differential Equations. Cambridge University Press, Cambridge, 1987.

[100] D. M. Young. Iterative Solution of Large Linear Systems. Academic Press, NY, 1971.

250

You might also like