You are on page 1of 104

802360A

Introduction to inverse problems (Spring 2014)

Sari Lasanen

March 3, 2014
2
Introduction to inverse problems (4 op)

Learning outcomes: Upon completion, the student will be able to

recognise several inverse problems

describe typical properties of inverse problems

solve simple inverse problems with accurate and inaccurate data.

Additional material:

1. Jari Kaipio, Erkki Somersalo: Statistical and computational inverse problems.


Springer-Verlag (Applied Mathematical Sciences, Vol. 160).

2. Daniela Calvetti, Erkki Somersalo: Introduction to Bayesian scientific comput-


ing. Ten lectures on subjective computing Springer (Surveys and Tutorials in the
Applied Mathematical Sciences, Vol. 2)

i
ii
Chapter 1

Inverse problems

Inverse problems belong to the field of applied mathematics, but contain also elements
from pure mathematics. Several inverse problems contribute to both the practical ap-
plications and the pure mathematics underneath them. Even the best mathematical
journals, like Annals of Mathematics contain publications about inverse problems. The
main scientific journals that are dedicated solely to inverse problems are Inverse Prob-
lems (IP), Inverse Problems and Imaging (IPI), Journal of Inverse and Ill-posed Problems
and Inverse Problems in Science and Engineering. These journals are available at Oulu
University Library, especially through Nelli portal (which has also remote access).

1.1 What is an inverse problem?


In inverse problems, one tries to get information about an unknown object by using indi-
rect and possibly inaccurate data. Examples of familiar inverse problems include medical
imaging (CT scans, ultrasound imaging), image enhancement in image processing and
detection of rain with weather radars. In this course, we introduce ourselves with math-
ematical inverse problems and solve in practice some simple inverse problems.
The name inverse problem originates from the fact that one has to first know a direct
problem that tells how the given data y depends on the unknown quantity x. Typically
the data y is produced by some phenomenon of physics, and the direct theory is the
mathematical description of the physical phenomenon. Say, x 7 F (x) =: y. The data y
and the unknown x are often either vectors or functions.

Definition 1. The mapping F that takes the unknown to the corresponding data is
called the direct theory (also forward mapping).

The inverse problem is to seek for x that has produced y. In layman terms,

Direct problem: from cause to consequences

Inverse problems: from consequences to the cause.

In mathematical terms, the question is plainly about the determination of the inverse
mapping F 1 , but we will see later that inaccurate data makes things more complicated

1
1.2 Examples of inverse problems and their typical
properties
Example 1
Direct problem: Add all numbers that are on the same row or on the same column or of
the same color.

? ? ? ? ?
? 1 5 7 ?
? 4 3 8 ?
? 6 2 9 ?

Inverse problem: Determine the numbers when only row, column and color sums are
known.

3 11 10 24 10
13 ? ? ? 13
15 ? ? ? 9
17 ? ? ? 10

Inverse problems are often more difficult than direct problems.

Example 2
Direct problem: Determine the function f C 1 (0, 1) when its derivative f 0 (t) = 3t2 and
initial value f (0) = 0 are given.

Inverse problem: Determine the derivative of f C 1 (0, 1) when

f (t) = t3 .

This problem is easy to solve, but some difficulties arise when the given data is inac-
curate, say instead of f we are given
1
g(t) = f (t) + sin(100t),
100
which has derivative
g 0 (t) = 3t2 + cos(100t).
Solutions of inverse problems are often strongly disturbed by small changes
in data.

2
1.2 4
g Dg
f Df
3.5
1
3

0.8 2.5

2
0.6

1.5

0.4
1

0.2 0.5

0
0
0.5

0.2 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 1.1: Disturbed data g is close to f .... but corresponding derivatives are not!

Example 3
In image deblurring, one tries to improve a blurred photo.

Direct problem: Mathematically model how a crisp photo can be transformed into a
blurred photo.

Inverse problem: Transfrom a blurred photo into a crisp photo.

Suora ongelma

Inversio-ongelma

Black and white digital picture can be represented as a matrix

M Rnm ,

whose elements Mij represent the pixel color: the larger the value the lighter the color is
(see Figures 1.2 and 1.3).

3
Figure 1.2: Black and white picture consists of pixels (=rectangular elements of a single
color).

10

Figure 1.3: 9 9-matrix of grey shade pixels and scale of pixel colors.

Image blurrring can be modeled with a Gaussian convolution (choose n = m for simplic-
ity)
n
e(|ki| /n +|lj| /n )/2 Mij ,
2 2 2 2 2
X
Mkl = Ckl
f
i,j=1

where k, l = 1, ..., n and the norming constant

n
!1
e(|ki| )/22
2 /n2 +|lj|2 /n2
X
Ckl = .
i,j=1

Every pixel shade Mkl is mapped to a weighted average M


fkl . The neighboring pixels has
more weight than distant ones.

Direct problme: determine M


f when M is known.

Inverse problem: determine M when M


f is known.

4
For a small image n, m = 256 but for good quality picture n and m can be several
thousands, so that the matrix contains millions of elements. In inverse problems, the
unknown is often very high dimensional.

Example 4
A weather radar transmits electromagnetic pulses in microwave frequencies (5600-5650
MHz, wavelength approximately 5.3 cm). The pulses are reflected back from obstacles,
like rain drops, snowflakes and hails. The weather radar detects the reflected pulses, and
the pulse travel time tells the distance between the transmitter and the obstacle. Typical
maximum detection range of a weather radar is around 250 km. Radar measurements
can be made in several directions by moving the transmitting/receiving antenna. The
magnitude of the reflected signal tells how heavy the rain is. Doppler phenomenon tells
also how fast the rain drops are moving.

Direct problem: Determine the reflected signal when the distribution of the rain drops
(and their speed) is known.
Inverse problem: Determine the distribution of the rain drops (and their speed) when the
reflected signal is known.
For simplicity, lets consider here the mathematical description of the problem for a
single object. The transmitted signal is of the form
(t) = P e(t) sin(0 t),
where 0 is the carrier frequency, P is the transmitted power and e(t) describes the pulse
shape. The equation of motion for a single object is
1
r(t) = x2 + x3 t + x4 t2 ,
2
where x2 is the distance to the radar, x3 is the velocity of the object and x4 is the
acceleration of the object. The reflected signal is modeled as
   
2 0 1 2
z(t) = x1 t x2 exp i2 (x3 t + x4 t ) ,
c c 2

5
where x1 is the power of the reflected signal and c is the speed of light. The power x1
satisfies so called radar equation
CP
x1 = ,
(4)2 x42
where C is a constant (independent of the radar) and the radar cross section depends
on the reflectivity and size of the object.

Direct problem: Determine t 7 z(t) when (x1 , x2 , x3 , x4 ) are known.

Inverse problem: Determine (x1 , x2 , x3 , x4 ) when the function t 7 z(t) (or some values
z(tk ), tk > 0, k = 1, ..., m) is given.

Figure 1.4: Weather radar image (Finnish Meteorological Institute).

In inverse problems, one uses indirect data about the unknown object.

Other radar imaging applications:


Remote sensing of the space debris (ground-based radar observations of shattered
and broken satellites, rocket debris, lost tools etc that slowly falls towards the Earth.
For example, International Space Station (ISS) has to evade falling debris once or
twice per year).
Remote sensing of the moon (Electromagnetic pulses reflect from the surface of the
Moon).
Ionospheric research (Auroras, other effects of solar storms, space weather). Utilizes
incoherent scattering, where the transmitted signal forces the ionospheric plasma
to oscillate and thus produce weak electromagnetic signals that are detected at the
ground-based receiver.
Ground penetrating radar. Works at microwave frequencies.

6
Example 5
In medical computerized tomography (CT scan), one forms slice images of interior parts
of the patient from x-ray data

Figure 1.5: CT scanner (image: Siemens Press Picture).

Different tissues (like muscle and bone) absorp different amounts of x-ray radiation:
their mass absoption coefficients have different values. When the total x-ray absorption
across the body is measured in several different directions, one can form slice images of
the body. Actually, one reconstructs the mass absorption coefficient across the body.

Figure 1.6: Construction of slice images vs. ordinary x-ray images. In ordinary x-
ray image, only absorption directly through the body is seen. In tomography, several
directions are used to create a slice image.

7
Let (x, y) 7 f (x, y) be a piecewise continuous function that represents the mass
absorption coefficient at point (x, y). When I0 is the intensity of the transmitted x-ray
radiation and I1 is the intensity of the received x-ray radiation that has traveled through
the body along the path C with parametrization r(t), t [t1 , t2 ], then
  Z Z t2
I0
ln = f ds = f (r(t))|r0 (t)|dt,
I1 C t1

where the term in the middle denotes the path integral of f over path C. Here we
assume for simplicity that the cross-section of the body is contained in the square S =
[1, 1] [1, 1] so that f (x, y) = 0 when |x| > 1 or |y| > 1.
y
1
Suora y = x

-1
1 x

-1

Figure 1.7: The path integrals of function f are calculated along different paths (like in
this picture) that are straight lines.

For example, if the line segment C = {(x, y) R2 : x = y, 1 y 1}, then we can


take r(t) = (t, t), 1 t 1 as the parametrization and
Z 1 Z 1 Z 1
  Z
I0 0
ln = f ds = f (r(t))|r (t)|dt = f (t, t)|(1, 1)|dt = 2 f (t, t)dt.
I1 C 1 1 1

Direct problem: When function f is known, determine the integrals


Z
f ds.
C

along all different line segments C that start and end at the boundary of S.

Inverse problem: Determine f when we know all its integrals


Z
f ds
C

along different line segments C that start and end at the boundary of S.
The inverse problem can be solved with the help of the Fourier analysis. However,
this is outside the scope of this course. Therefore, we proceed to to the following practical
problem.

8
In practice, the measurements cant be taken along every line, but only for finitely
many lines. The less there are lines, the less there is information about the unknown
function f . The problem is that different functions can have same integrals along finitely
many lines. For example, take f (x, y) = f (x, y) = x2 + y 2 when (x, y) B(0, 1) and
f (x, y) = 0 otherwise, Then the integral of f along C = {(x, y) R2 : y = 0, 1 x 1}
is Z 1
2
x2 dx =
1 3
which is the same as the integral of the function f (x, y) = 31 . Since f is rotationally
symmetric, the integral along any straight line C = {(x, y) R2 : y = ax, (x, y)
B(0, 1)}, a R has the same value.
In tomography, the limitedness of data is usually compensated by restricting the form
of allowed solutions. Instead of any function, we allow the solution be only of the form
n
X
f (x, y) = cj j (x, y),
j=1

where n is a fixed positive number, functions j are known and coefficients ci R are
unknown. For example, functions j can be characteristic functions of disjoint rectangles
(the pixels) (
1 kun (x, y) Ij
j (x, y) =
0 otherwise.

I
24
y

I1 I8

I2 I9
I3 I10
I4 I11
I5 I12
I6

I7
x
y

Figure 1.8: Rectangle Ij .

Lets assume that there are m paths, say Cj whose parametrisations are rj : [t1 , t2 ]
R2 , j = 1, . . . , m. Then data is modeled by equations
Z Z t2 n
X n
X
yi = f ds = j (ri (t))|r0i (t)|cj dt = Mij cj
C t1 j=1 j=1

where Z t2
Mij = j (ri (t))|r0i (t)|dt
t1

9
and i = 1, . . . , m, j = 1, . . . , n. The inverse problem is now to determined coefficients
c = (c1 , . . . , cn ), when vector y = (y1 , . . . , ym ) is given. The matrix M of the direct theory
is known, because functions j and paths Ci are known. Additionally, the data will also
contain disturbances!

10

Figure 1.9: Greyscale image with low resolution.

Figure 1.10: CT scan: Different shades represents different values of f . (image: Siemens Press
Picture).

In practical inverse problems the reconstruction (=forming of the image of


the unknown object) has to be done from somewhat limited data. Moreover,
the unknown is often approximated with the help of finite-dimensional vectors.

Esimerkki 6

In electrical impedance tomography (EIT), electrical measurements on the boundary of


an object give information about the interior structure of the object. One can either
feed current to the object and measure voltage or feed voltage and measure currents.
Let u = u(x) denote the voltage in object D R3 and let there be a voltage f on the
denote the electric conductivity of the object D. Then
boundary of D. Let C (D)

10
Virta

Jnnite

Figure 1.11: Voltage-current measurements on the object D.

satisfies
function u C 2 (D) C 1 (D)

(u)(x) = 0, x D
u(x) = f (x), x D

The current on the boundary of D can be derived from the voltage u by

g(x) = (x)n(x) u(x), x D,

where n(x) is the outer normal vector of D.

Direct problem: Determine g when and f are given.

Inverse problem: Determine when g is known for every f C 1 (D).

This type of inverse problems are more commonly known as inverse boundary value
problems.
The solution of this particular inverse problem is known for quite general D anda.
The solution method requires knowledge of e.g. partial differential equations so it is not
in the scope of this course. However, the last chapter contains some relevant methods for
the practical solution of the problem.
Where it is used?

Medical imaging (heart and lungs).

Noninvasive testing (e.g. integrity tests for glass jars, airplane wings, bridges).

Process monitoring in industry (e.g. mixture homogeneity monitoring).

11
The problem has also a coarser counterpart that is widely used body composition
monitoring by bioelectrical imbedance analysis. The measuring setup is the same: small
currents are fed to the body and voltages are measured. The main difference to EIT is
the forward model: instead of the more accurate partial differential equation, the forward
theory consists of a parametrized coarse approximation, where the main assumption is
that the persons body is a cylinder that has the same height as the person and the
water content of body corresponds to the volume of the cylinder. The voltage is used to
calculate the water content of the cylinder. Other parameters that are commonly used
to tune the model are age, weight, gender...
Inverse problems are a way to extract information about objects that are
difficult or impossible to study otherwise.

Example 7
Medical ultrasound imaging gives a picture of patients interior structures on the basis of
sound waves. The main principle is the following: sound pulses are transmitted inside a
patient (at frequency of 2-15 MHz). The pulses partially reflect backwards from different
structures of the body. The backscattered signal is receive and transformed to brightness
values. The procedure is repeated along different measurement lines.

Figure 1.12: Ultrasound imaging 1. Pulse reflects from boundaries (the same color cor-
responds here to homogeneous structure)

Ultrasound imaging is based on several physical simplifications that are more or less
inaccurate. First of all, the sound speed is assumed to be constant regardless of the
tissue. This makes the size of the organs slightly distorted. Moreover, the model does
not take into account multipath propagation and wave diffraction, which can place objects
in wrong places in images. Very rough surfaces can also produce speckled images.

12
1.5

0.5

0.5

1.5
0 0.2 0.4 0.6 0.8 1

Figure 1.13: Ultrasound imaging 2. backscattered pulse (blue line) is receive and trans-
formed to brightness values (red curve) by using an envelope curve.

A more precise mathematical model would be to use the physics of acoustic wave
propagation through inhomogeneous media. A time-harmonic acoustic wave u in domain
D Rn satisfies the equation

2
u(x) + u(x) = 0, x D,
c2 (x)

where is the frequency and c(x) is the speed of sound in the inhomogeneous media.
The sound source is described with equation

n(x) u(x) = f (x), x D,

where n(x) is the outer normal vector of the surface D. The sound wave at the surface
is
g(x) = u(x), x D.
Function x 7 u(x) is connected to the physical pressure wave p(x, t) through equation
p(x, t) = Re u(x)eit .

Direct problem: Determine u when f and c are given. Inverse problem: Determine the

function c when g is given for every f .

This inverse problem is an inverse boundary value problem.


In inverse problems, mathematics is used in improving different imaging
methods.

The same acoustic equation can be used to describe the propagation of seismic waves
(=earth quake waves). This method is used in mapping the inner structure of Earth.
Sound waves propagate also very well in water, which leads to the so called sonar imaging.

13
Example 8
Inverse scattering problems are a class of mathematically challenging inverse problems.
In inverse scattering, a wave (e.g. sound or electromagnetic) is transmitted towards an
unknown obstacle or inhomogeneity, which then distorts the propagating wave. The
distortion of the incident wave is described by a scattered wave, which is finally observed
far away from the unknown. The inverse problem is to deduce the properties of the
unknown from the observations of the scattered wave.

Figure 1.14: The scattering. Incident wave is ui . Scatterer produces the scattered wave
us . The total wave is u = ui + us .

In mathematical scattering theory, the term wave is usually replaced with the term
field, which means a multivariate function. A common simplification is to assume that
the time-dependence is time-harmonic, meaning that u(x, t) = eit u(x).

Time-harmonic acoustic scattering from inhomogeneous media is described


with equations

u(x) = ui (x) + us (x)


2
u(x) + 2 u(x) = 0, x Rn ,
c (x)

where is the frequency of the incident wave. Additionally, we require the radiation
condition
 
x s s x
lim |x| u (x) i u (x) = 0 uniformly in every directions .
|x| |x| c |x|

Function c(x) describes the speed of sound in the media. It is a physical quantity depend-
ing on the structure of the media (e.g. its molecular composition). Above, it is assumed
that c > 0 is a smooth function, that has constant value far away of the unknown.
In direct scattering problem, one determines the scattered field us , when ui and c are
known. The incident field is usually a plane wave ui (x) = eiax , where a is a unit vector.

14
In inverse acoustic scattering problem, one tries to determine the function c when us
is given far away of the unknown scatterer and also ui is known.

Electromagnetic scattering from inhomogeneous media: Let E = E(x, t)


C (R2 R+ ; R3 ) and H = H(x, t) C 2 (R2 R+ ; R3 ) be the electric field and mag-
2

netic field, respectively. In isotropic media, these fields satisfy the following Maxwells
equations
H
E(x, t) + 0 (x, t) = 0
t
E
H(x, t) (x) (x, t) = (x)E (x, t).
t
Assuming time-harmonic time dependence gives
1 1
E(x, t) = 0 2 E(x)eit , H(x, t) = 0 2 H(x)eit ,

where is the frequency and 0 , 0 are permittivity and permeability of the vacuum.
Corresponding time-harmonic Maxwells equations are

E(x) ikH(x) = 0 (1.1)


H(x) + ikn(x)E(x) = 0 (1.2)

where the reflection coefficient


 
1 (x)
n(x) = (x) + i
0

depends on the media and k = 0 0 .
Let E i and H i be time-harmonic solutions of the Maxwells equations in the vacuum
(  0 and 0). The sum of the incident and the scattered field E = E i + E s ,
H = H i + H s satisfies (1.1) and (1.2). Additional radiation condition reads as:

lim (H s x |x|E s ) = 0
|x|

x
uniformly in every direction |x|
.

Direct problem: Determine E s ja H s when E i , H i and n(x) are given.

Inversse problem: Determine n(x) when H s and E s are given far away from the scatterer
when E i and H i are known.

Inverse scattering problem are mathematically challenging.

15
1.3 Classification of inverse problems
(A) Mathematical inverse problems. For example

Inverse scattering problems (e.g. scattering form media, data acquired with
single or several frequencies or directions).
Inverse boundary value problems (like electrical impedance tomography)
Mathematical tomography
Inverse initial value problems
Inverse eigenvalue problems.

(B) Practical and computational inverse problems. For example

Image enhancement
Remote sensing (including ecological, geological and astronomical targets)
Medical imaging
Noninvasive testing (including industrial process monitoring)
Retrospective inverse problems (like determination of the source for pollution
particles in the atmosphere)
Biological inverse problems (like phylogenetic problems: From DNA differ-
ences, determine the family tree of different species).
Economic problems (determine the parameters of economical models)

1.4 Recap
In inverse problems, one uses indirect observations to obtain information about the un-
known target. Applications can be found in several fields (Where ?). Inverse problems
can be divided to mathematical and computational problems. Typical properties are

more difficult than direct problems,

sensitive to small disturbances in data, and

in practical inverse problems, the amount of data is limited.

Please be prapared to

mention practical examples of inverse problems that utilize indirect data (direct
data= observe the values of the unknown, indirect data 6= direct data),

explain what is meant by image deblurring when the blurring function is given.

represent CT scans as a mathematical problem (with equations).

explain what is meant by an inverse boundary value problem, when the equations
of the direct problem are given,

16
explain what is meant by an inverse scattering problem when the equations of the
direct scattering are given,

define what is the direct theory.

17
18
Chapter 2

Well-posed and ill-posed inverse


problems

In this chapter, we study inverse problems in normed vector spaces. Especially, we pay
attention to the solvability and stability of the inverse problem.

Definition 2. A set V is a linear vector space over the scalar field K if there exists
mappings V V 3 (x, y) 7 x + y V and K V 3 (, x) 7 x V that satisfy

1. (x + y) + z = x + (y + z)

2. x + y = y + x

3. there exists 0 V such that x + 0 = x

4. for every x V there exists (x) V such that x + (x) = 0 (we also denote
x y := x + (y))

5. (x) = ()x

6. 1x = x

7. (x + y) = x + y

8. ( + )x = x + x

for all x, y, z V and , K.

Definition 3. A normed vector space is a pair (V, k k), where V is a linear vector space
over the field K and x 7 kxk is a real-valued mapping such that

1. kxk 0 for every x V and kxk = 0 if and only if x = 0,

2. kxk = ||kxk for all x V and K, and

3. kx + yk kxk + kyk for all x, y V.

19
Let (V, k k) be a normed vector space. The set

B(x0 , r) = {x V : kx0 xk < r} V,

where x0 V and r > 0, is called an open ball at x0 with radius r. The (ordinary)
topology of a normed space is defined by open balls (a set U V is open if and only if
for every x U there exists r > 0 such that B(x, r) U ).
Let V1 and V2 be two linear vector spaces with norms k k1 and k k2 , respectively.
Let D V1 . Recall, that a function F : D V2 is continuous at x1 D if for every
 > 0 there exists > 0 so that kF (x1 ) F (x2 )k2 <  if kx1 x2 k1 < and x2 D. The
function F is continuous on D is it is continuous at every x D.
The linear vector space Rn , n 1 is equipped with the usual norm, where the vectors
x = (x1 , .., xn ) Rn norm |x| is
v
u n
uX
|x| = t |xi |2 .
i=1

Recall, that a finite-dimensional vector space suits well for describing the unknowns in
imaging problems, since an image can be represented as a matrix (which, in turn, can be
represented as a vector by rearranging the elements)

2.1 Well-posed inverse problems


Definition 4. Let V1 and V2 be two normed vector spaces with norms k k1 and k k2 ,
respectively. Let V V1 , W V2 , and F : V W be a given mapping. The direct
problem is to determine y = F (x), when arbitrary x V is given. The inverse problem
is to determine such x V that y = F (x) when arbitrary y W is given. The mapping
F is called the direct theory.
The next definition is important in the theory of inverse problems.
Definition 5 (Jacques Hadamard). A problem is well-posed if
1. it has a solution,

2. the solution is unique, and

3. the solution depends continuously on the data.


Consider the abstract inverse problem of Definition 4:

() Find x V that satisfies y = F (x) when the given data y W is arbitrary.

When is the inverse problem () well-posed?

The problem () has to have a solution.

For every y W there has to be x V such that y = F (x). In other words, the
direct theory F : V W needs to be a surjection, meaning that F (V ) = W .

20
Sub-problem A: Characterisation. What is the image F (V )? Characterise
those y V2 that correspond to unknowns x V .

The solution of () has to be unique.

If x1 , x2 V are two solutions satisfying F (x1 ) = F (x2 ) W then x1 = x2


has to hold. In other words, the direct theory needs to be an injection.
Uniqueness is important also for practical problems. For example, in medical
imaging non uniquess adds danger of misdiagnosis. Indeed, it is undesirable
that a malignant change in tissue would appear as a healthy tissue in images.
Sub-problem B: Identification. Is there enough data to determine the solution
uniquely? Usually the first challenging step in mathematical inverse problems.

When F is both injective and surjective, then it is a bijection and the inverse
mapping F 1 : W V exists. The inverse mapping F 1 : W V has to be
continuous.

In practical inverse problems there is a rule of thumb: the given data is never
exactly the same as in the mathematical formulation. There are several rea-
sons for this. (i) Measurement equipment has limited accuracy. (ii) Electrical
devices suffer from intrinsic errors, like thermal noise. (iii) Direct theory is not
necessarily completely correct. It may contain approximations. (iv) There can
be external disturbances in the measurement environment. (v) In numerical
calculations, the real numbers are replaced by floating point numbers that are
of finite accuracy.
When F 1 is continuous at y1 W , then for given  > 0 there is > 0 such
that |F 1 (y1 ) F 1 (y2 )| <  whenever y2 W and |y1 y2 | < . Especially,
if y1 = F (x1 ) for some x1 V and y2 W is of the form

y2 = F (x1 ) + e,

where F (x1 ) + e W , |e| < then

|F 1 (y1 ) F 1 (y2 )| = |x1 F 1 (F (x1 ) + e)| < .

Sub-problem C: stability. How small changes in the data disturb the corre-
sponding mathematical solutions?

Remark 1. There are also two additional sub-problems.

Sub-problem D : Reconstruction. How x is obtained from the given y F (W )?


This is another important step for mathematical inverse problems.

Sub-problem E: Numerical reconstruction. A precise or approximative method for


recovering the unknown from available data, which can be imprecise or incomplete.

21
Example 1. (Uniqueness) An object, with mass m, experiences a time-dependent force
F(t) = (F1 (t), F2 (t), F3 (t)), where the real functions Fi are continuous for i = 1, 2.3. The
path of the obstacle is described with a function g(t) = (g1 (t), g2 (t), g3 (t)). When F and
g(0), g0 (0) are known, then the path of the obstacle g(t) satisfies
F(t) = mg00 (t), t (0, 1) (2.1)
g0 (0) = (v1 , v2 , v3 ) (2.2)
g(0) = (a1 , a2 , a3 ). (2.3)
Let
V1 = C([0.1]; R3 ) = {F = (F1 , F2 , F3 ) : Fi : [0, 1] R, i = 1, 2.3, is continuous.}
equipped with norm kFk1 = supt[0,1] |F(t)| ja
V2 = C 2 ([0.1]; R3 ) = {g = (g1 , g2 , g3 ) : gi : [0, 1] R, i = 1, 2.3, is twice
continuously differentiable and the derivatives are continuous up to the end-points.}
equipped with norm k
d g
kgk2 = sup k (t) .

t[0,1],k=0,1,2 dt

Inverse problem: Determine F V1 , when the object moves along the path g V2 .
Lets prove the uniqueness. Let F1 , F2 V1 be two forces that satisfy (2.1)-(2.3)
when the path g is given. Then
kF1 F2 k1 = sup |F1 (t) F2 (t)| = sup |mg00 (t) mg00 (t)| = 0.
t[0,1] t[0,1]

Hence F1 = F2 . A similar inverse problem was used in 19th century to find the planet
Neptune. The data was the path of a known planet, Uranus, whose orbit had curious
irregularities that could not be explained unless a gravity pull of an unknown planet
effected it.
Example 2. (Reconstruction) Let V1 = V2 = C([0, 1]) equipped with norm kf k =
supt[0,1] |f (t)|. Lets consider the direct theory
Z 1
F f (t) = f (ts)ds.
0
Let g C([0, 1]) be given. Inverse problems is to determine f C([0, 1]) so that g = F f .
Because F f (0) = f (0), we may assume that t 6= 0 in the following. By change of the
variables formula for ts = r we get
Z t
1
g(t) = F f (t) = t f (r)dr.
0
Especially Z t
f (r)dr = tg(t),
0
and we may differentiate (by using the fundamental theorem in calculus) to obtain

dg(t)t
f (r) =
dt t=r
kun r (0, 1].

22
2.2 Ill-posed inverse problems
Definition 6. If the problem is not well-posed, then it is ill-posed.

Let us consider the inverse problem of Definition 4:

(*) Find x V that satisfies y = F (x) when the given data y W is arbitrary.

There are several possibilities that make a problem ill-posed:

1. There is no solution.
We may end up with this situation, if the given data contains disturbation. For
example, if we are given y = F (x) + e instead of y = F (x), where e V2 is is
not known and y / Im (F ). Despite this fact, we would like to extract information
about the unknown x.

Example 3. Let us consider the Fredholm integral equation of the first kind
Z 1
g(x) = K(x, y)f (y)dy, y [0.1], (2.4)
0

where the integral kernel (x, y) 7 K(x, y) is C 1 -function. Inverse problem: deter-
mine function f C([0.1]), when function g C([0.1]) is given.
Whn g is continuous function that is not differentiable, then the solution does
not exists. Indeed, the right hand side of the integral equation (2.4) is always
differentiable, since it holds that
R1 R Z 1
0
K(x + h, y)f (y)dy K(x, y)f (y)dy K(x + h, y) K(x, y)
= f (y)dy
h 0 h
Z 1 R x+h
x
x K(x0 , y)dx0
= f (y)dy
0 h
1 x+h 1
Z Z
= x K(x0 , y)f (y)dydx0 .
h x 0

2. The solution exists but is nonunique.


More than one unknown produces the same data i.e. y = F (x1 ) = F (x2 ) for some
x1 6= x2 .
Especially computational inverse problems suffer from nonuniqueness because of the
limitedness of the data. In mathematical inverse problems, the data is typically a
function, but in practice only finitely many (approximative) values of this function
can be observerd.

Example 4. Let the direct theory be a linear mapping F : C(0.1) C 1 (0, 1), that
takes the function f into is integral
Z x
F f (x) = f (y)dy.
0

23
Inverse problem is to determine f C(0, 1) so that F f = g when such g C 1 (0, 1)
is given that g(0) = 0. The solution is obtained by differentiating i.e. f = g 0 , and
it is easy to see that the solution is unique. However, if the function g is only given
at points ti , t1 , ..., tn (0, 1), then there exist several different functions g that give
the same data g(ti ), i = 1, ..., n. Each function that is compatible with the data
gives different derivative g 0 = f .
Example 5. In practical inverse problems the unknown is often higher dimensional
vector than the given data vector. One simple example is any matrix equation
n
X
yi = Mij xj ,
j=1

where i = 1, ..., m and n > m. Then there are n unknown variables xj and they
satisfy only m equations. In the next chapter, we consider more linear finite-
dimensional inverse problems.
Example 6. Look Chapter 1 Example 5.

3. The solution does not depend continuously on the data.


If the direct theory F : V1 W is a bijection, then there exists the inverse function
F 1 : W V1 , but it is not always continuous even though F is continuous.
Then arbitrary small disturbances of data can produce large disturbations in the
unknown.
Example 7. Consider the mapping
Z x
F f (x) = f (y)dy.
0

Let V1 = V2 = C([0.1]) and V = V1 , W = F (V ). Then W is a linear subspace


of C([0.1]) and, hence, W is also a normed vector space equipped with the norm of
V2 . Let us show first that F : V W is continuous. Let  > 0. Then
Z x Z x Z x

kF f1 F f2 k = sup f1 (y)dy f2 (y)dy = sup
f1 (y) f2 (y)dy
x[0,1] 0 0 x[0,1] 0
Z x Z x
sup sup |f1 (y) f2 (y)|dy = kf1 f2 k sup dy < 
x[0.1] 0 y[0,1] x[0.1] 0

when kf1 f2 k < with the choice = .


Inverse problem: Determine f V when g = F f W is given. One sees easily, that
there is a unique solution that is obtained by derivation. We show that F 1 = D
is not continuous at zero function. Let  = 41 and > 0. Then there exists
g(x) = sin(x/) W , such that
kgk = sup |g(x)| < ,
x[0,1]

but
kDgk = sup |cos(x)| = 1
x[0,1]

regardless how small is. Hence F 1 : W V is not continuous.

24
2.3 Recap
Inverse problems can be either well posed or ill posed.

An inverse problem has usually several subproblems (characterisation, identifiabil-


ity, stability, reconstruction and numerical reconstruction).

Please memorise

the definition of a well posed problem and an ill-posed problem, and

what do characterization, identifiability, stability, and reconstruction mean in con-


nection with inverse problems.

Please be able to tell

what practical benefits follow from the uniqueness of an inverse problem,

what practical disadvantage follows from the unstability of an inverse problem,

why limitedness of data can render an inverse problem ill-posed,

why inexactness of data can render an inverse problem ill-posed, and

examples of situation where the data is inexact.

25
26
Chapter 3

Finite-dimensional linear inverse


problems

Definition 7. An inverse problem is called finite dimensional if the ukknown and data
belong to finite dimensional vector spaces. An inverse problem is called linear if the direct
theory is linear.

Lets recap some facts from linear algebra. Here K is either R or C.

Space Kn is equipped with inner product


n
X
(x, y) = xi y i
i=1

where x = (x1 , . . . , xn ) and y = (y1 , . . . , yn ) Kn .

When x = (x1 , . . . , xn ) and y = (y1 , . . . , yn ) Kn , then the norm and the inner
product satisfy
|x|2 = (x, x),
and the Cauchy-Schwart inequality

(x, y) |x||y|

n n
! 12 n
! 21
X X X
xi y i |xi |2 |yi |2 .
i=1 i=1 i=1

The set V Kn is a linear subspace of Kn if for every a, b K and x, y V it


holds that ax + by V .

Let V Kn be a linear subspace. The mapping F : V Km is linear if

F (ax + by) = aF (x) + bF (y)

when a, b K and x, y Kn .

27
A basis of a linear subspace V Kn consists of linearly independent vectors
{e1 , . . . , ek } that satisfy
k
X
V = {x Kn : x = ai ei , ai K, i = 1, . . . , k}.
i=1

The basis {e1 , . . . ek } of linear subpace V Kn is orthonormal if

(ei , ei ) = 1 and (ei , ej ) = 0

when i 6= j and i, j = 1, . . . , k.

Notation
x1
(x1 , . . . , xn ) = ...

xn

Finite-dimensional linear mappings F : Kn Km can be represented by using


matrices. Equip Kn and Km with their natural basis {e1 , . . . , en } and {f1 , . . . , fm },
respectively, where

e1 = (1, 0, 0, . . . , 0), e2 = (0.1, 0, . . . , 0), . . . , en = (0, . . . , 0, 1)

and
f1 = (1, 0, 0, . . . , 0), f2 = (0.1, 0, . . . , 0), . . . , fm = (0, . . . , 0, 1)
Pn
When x = j=1 xj ej , then
n
! n
X X
F (x) = F xj ej = xj F (ej ).
j=1 j=1

Especially
n
X n
X
F (x)i = (F (x), fi ) = xj (F (ej ), fi ) = Mij xj ,
j=1 j=1

where the matrix of the linear mapping F : V Km with respect to basis


{e1 , . . . , en } and {f1 , . . . , fm } is

Mij = (F ej , fi ).

Respectively every matrix corresponds to a linear mapping.

3.1 Well and ill posed finite dimensional linear prob-


lems
Lets study a finite dimensional linear inverse problem, where the direct theory F : V
W between linear subspaces V Rn and W Rm is defined by using a matrix M Rmn

28
i.e. F (x) = M x when x V . Here all matrices are defined with respect to natural basis
of Rn .

Linear finite dimensional inverse problem ()

Let V Rn and W Rm be linear subspace and M Rmn . Find such x V


that satisfies y = M x when arbitrary y W is given.

In this chapter, we show that () is well posed if


for every y W the equation y = M x has a solution x V .
the equation M x = 0 has only trivial solution x = 0 in the linear subspace V .
Respectively, () is ill posed if at least one of the following conditions hold:
for some y W the equation y = M x does not have a solution x V .
there exists x V such that x 6= 0 and M x = 0.
Remark 2. In this course, we use methods from linear algebra to olve matrix equations.
The matrix equation y = M x can be solved for example by (i) solving the system of
equations by back substitution or (ii) using the Gauss-Jordan elimination method. In
practice, very large problems are solved numerically by using a computer. However, in
computer aided solutions the accuracy of the solution is limited by effects of the numerical
computation accuracy.

3.1.1 Injectivity
Recall the following result from linear algebra.
Theorem 1. Let V Rn be a linear subspace. The linear mapping F : V Rm is an
injection if and only if its kernel N (F ) = {x V : F (x) = 0} contains only the zero
vector. This is equivalent to the fact that the kernel N (M ) = {x Rn : M x = 0} of
matrix M of the linear mapping F satisfies V N (M ) = {0}.
Proof. See courses of linear algebra.
Corollary 1 (Identifiability). The solution of inverse problem ()is unique if and only
if N (F ) = {0}. This is equivalent to V N (M ) = {0}.
The direct theory in () is injective, if and only M x = 0 has only trivial solution in
the subspace V .
Example 8. Let W = R2 , V = R3 and
 
1 1 0
M= .
0 0 1
Then M x = 0 if and only if x1 + x2 = 0 ja x3 = 0. In other words
N (M ) = {(x1 , x1 , 0) : x1 R} =
6 {0}.
The inverse problem () is ill posed, because the solution is not unique.

29
Remark 3. When V = Rn , W = Rm and n > m, the direct theory isnt injective.

Example 9. Let V = {x = (x1 , x2 , x3 ) R3 : x1 + x2 + x3 = 0}, W = R2 and


 
1 0 0
M= .
1 3 1

Is the solution of () unique?


Solution: Let x = (x1 , x2 , x3 ) V solve 0 = M x. Then

    x1  
0 1 0 0 x1
= x2 = ,
0 1 3 1 x1 + 3x2 + x3
x3

which implies that 0 = x1 and 0 = x1 + 3x2 + x3 = 0 + 3x2 + x3 . Because x V , then


x1 + x2 + x3 = 0 x2 = 0, x3 = 0. Hence, the solution is unique.

Remark 4 (Affine problems). An affine subspace Vaf Rn is a set such that Vaf +
x0 = {x + x0 : x Vaf } is a linear subspace for some x0 Rn . If in Example 9 the
linear subspace V is replaced with an affine subspace, say Vaf = {(x1 , x2 , x3 ) R3 :
x1 + x2 + x3 = 1}, then either (a) the uniqueness has to be checked from the condition
M x = M x x = x or (b) affine problem needs to reformulated as a linear problem. In
(b), we may proceed as follows. The unknown x = (x1 , x2 , x3 ) has to satisfy x1 +x2 +x3 = 1
in order to x Vaf . This equation can be added to the matrix equation, and we obtain
a new linear problem

y1 1 0 0 x1
y2 = 1 3 1 x2 ye = M fx,
1 1 1 1 x3

where x R3 ja y = (y1 , y2 , 1) R3 .

Example 10. Let us study Example 1 of Chapter 1. Denote y = (y1 , . . . , y11 ) the
data vector and x = (x1 , . . . , x9 ) the unknown vector. Lets form the direct theory
F : R9 R11 from row, column and color sums by using a matrix M = M129 .

y1 y2 y3 y4 y5
y9 x1 x4 x7 y6
y10 x2 x5 x8 y7
y11 x3 x6 x9 y8

Chapter 1.2. Example 1: Determine the numbers from their row, column and color sums

30
Then


y1 0 0 0 0 1 0 0 0 0
y2 1 1 1 0 0 0 0 0 0
x1
y3 0 0 0 1 1 1 0 0 0 x2


y4 0 0 0 0 0 0 1 1 1 x3


y5 0 1 1 0 0 0 0 0 0 x4


y = Mx
y6 = 0
0 0 1 0 0 0 1 0 x5 .

y7 0 0 0 0 0 1 1 0 0 x6


y8 1 0 0 0 0 0 0 0 1 x7


y9 1 0 0 1 0 0 1 0 0 x8


y10 0 1 0 0 1 0 0 1 0 x9
y11 0 0 1 0 0 1 0 0 1

From linear algebra, we know that equation 0 = M x has only the trivial solution x = 0
if and only if M can be transformed by Gauss-Jordan elimination method into the form


1 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0

0 0 1 0 0 0 0 0 0

0 0 0 1 0 0 0 0 0

0 0 0 0 1 0 0 0 0

M
0 0 0 0 0 1 0 0 0.
0 0 0 0 0 0 1 0 0

0 0 0 0 0 0 0 1 0

0 0 0 0 0 0 0 0 1

0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0

31
Lets proceed with Gauss-Jordan elimination method.

1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0

1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1

0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0

0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0

M 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1

0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0

0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 0 0 0

0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0

0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0
0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1

1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0

0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1

0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0

0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0

0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 1 1

0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0

0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 1 0

0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0

0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0
0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1

1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0

0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0

0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0

0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 0

0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0
0 0

0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 1 0

0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 2 1

0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0

0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1

1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0


0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0

0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0


0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0

0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 1 0

0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0

0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 2 0

0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0
2 1

0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0

32

1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0

0
0 1 0 0 0 0 1 0

0
0 1 0 0 0 0 1 0

0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0

0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0

M 0 0 0 0 0 1 0 1 0

0 0 0 0 0 1 0 1 0

(1)0
0 0 0 0 0 1 1 0

0
0 0 0 0 0 1 1 0

1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 1 0
2


0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Hence, the mapping R9 3 x 7 M x R11 is an injection.

3.1.2 Surjectivity
The inverse problem () has a solution if for every y W there exists x V that
satisfies
y = M x.

Example 11. Let



0 1 1
1 1 0
M =
2
.
2 0
1 0 0
Does () have a solution, when W = R4 and V = R3 ?
Solution: Let us apply Gauss-Jordan elimination method

0 1 1 y1 1 1 1 y1 + y4 0 0 1 y1 + y4 y2
1 1 0 y2 1 1 0 y2 1 1 0 y2


2 2 0 y3 2 2 0 y3 2 2 0 y3
1 0 0 y4 1 0 0 y4 1 0 0 y4

0 0 1 y1 + y4 y2 0 0 1 y1 + y4 y2
1 1 0 y2 0 1 0
y2 y4

0 0 0 y3 2y2 0 0 0 y3 2y2
1 0 0 y4 1 0 0 y4

1 0 0 y4
0 1 0 y2 y4

0 0 1 y1 + y4 y2
0 0 0 y3 2y2

Inverse problem () does not have a solution when 0 6= y3 2y2 .

33
Lemma 1. Let V Rn be a linear subspace. The image F (V ) Rm of the linear
mapping F : V Rm , whose matrix is M , is a linear subspace that consists of those
linear combinations y of the columns Mi of M that satisfy

n
X
y= ai Mi ,
i=1

where a = (a1 , . . . , an ) V .

Proof. Since V is a linear subspace, then ax1 + bx2 V when x1 , x2 V and a, b R.


By linearity of F we have F (ax1 + bx2 ) = aF (x1 ) + bF (x2 ) F (V ). Hence F (V ) is a
linear subspace. Furthermore, the vector y = (y1 , . . . , ym ) F (V ) if and only if there
exists multipliers x1 , . . . xn such that (x1 , . . . , xn ) V and

n
X
yi = Mij xj , i = 1, . . . , m (3.1)
j=1

Denote in (3.1) (Mi )j := Mij the jth element of the ith column M .

Remark 5. When V = Rn , W = Rm and n < m, then () is ill posed, since the image
R(M ) Rm is by Theorem 1 linear subspace of dimension n, where n < m.

Lets recall more linear algebra. The transpose of matrix M Rmn is matrix
(M T )ij = Mji , j = 1, . . . , m, i = 1, . . . , n. The inner product of Rn satisfies

x
 .1
(x, y) = y T x = y1 yn .. ,
yn

where x = (x1 , . . . , xn ), y = (y1 , . . . , yn ) Rn .


Let V be a linear subspace of Rn and {e1 , . . . , ek } orthonormal basis of V with respect
to the inner product of Rn . Orthogonal projection of vector x Rn onto subspace V is
the vector
k
X
P (x) = (x, ei )ei . (3.2)
i=1

Let QV be the matrix of the linear mapping x 7 P (x). By (3.2),



eT1
T
e2

QV = e1 e2 ek .. .
.
eTk

34
The matrix QV is called orthogonal projection onto V . Especially QTV = QV and

eT1 eT1
T T
e2
 e2

Q2V = QV QV = e1 e2 ek .. e1 e2 ek ..
. .
T
ek eTk

(e1 , e1 ) (e1 , e1 ) (e1 , en ) eT1
T
 (e2 , e1 ) (e2 , e2 ) (e2 , en ) e2
= e1 e2 ek .. .. .. ..
. . . .
(en , e1 ) (en , e2 ) (en , en ) eTk

1 0 0 eT1
 0 1 0 eT2

= e1 e2 ek .. .. .. .. = QV

. . . .
0 0 1 eTk

Vector x V if and
P only if QV x = x, because every x V can be represented as a linear
combination x = ki=1 (x, ek )ek of the basis vectors.

Theorem 2 (Characterization). The inverse problem () has a solution if and only if


QF (V ) y = y, where QF (V ) is orthogonal projection onto F (V ).

Proof. The solution exists if and only if y F (V ). This is equivalent to QF (V ) y = y.

3.1.3 Continuity of the inverse


In Example 7 of Chapter 2, we saw that in general normed space the inverse of a con-
tinuous mapping need not be continuous. Let us show that in finite dimensional linear
problems the inverse of the direct theory when it exists is always continuous..

Theorem 3. Let V Rn be a linear subspace. The linear mapping F : V Rm is


continuous.

Proof. Take first V = Rn . Denote

Mij = (F (ej ))i , i = 1, . . . , m, j = 1, . . . , n,

where {e1 , . . . en } is the natural basis of Rn . By linearity

n n
!2
X X
|F (x) F (y)|2 = |F (x y)|2 = Mij (xj yj ) .
i=1 j=1

Apply Cauchy-Schwartz inequality

n
!2 n
! n
! n
!
X X X X
Mij (xj yj ) (Mij )2 (xj yj )2 (Mij )2 |xy|2 , which implies
j=1 j=1 j=1 j=1

35
m X
n
!
X
2
|F (x) F (y)| (Mij )2 |x y|2 .
i=1 j=1

Therefore F is continuous
Suppose
Pk then that V isPa linear subspace and {e1 , . . . ek } its orthonormal basis. Denote
k
x = j=1 xj ej and y = j=1 yj ej two arbitrarily chosen vectors from V . Similarly as
before, ! k
m X
X k X
|F (x) F (y)|2 fij )2
(M (xj yj )2 ,
i=1 j=1 j=1

where
M
fij = (F (
ej ))i , i = 1, . . . , m, j = 1, . . . , k.
Moreover
k k k
!
X X X
xj yj )2 =
( xj ej yj ej , xj ej yj ej = |x y|2 .
j=1 j=1 j=1

Corollary 2 (Continuity of the solution). Let V Rn and W Rm be linear subspaces.


If the linear direct theory F : V W is bijective, then F 1 : W V is continuous.

Proof. Because F is a bijection, then F 1 exists. The inverse of a linear mapping is


linear, since if y = F x ja y = F x then

F 1 (ay + b
y ) = F 1 (aF x + bF x) = F 1 (F (ax + b x = aF 1 (y) + bF 1 (
x)) = ax + b y ).

By Theorem 3, F 1 is continuous.

Example 12. Let  


1 2
M=
3 1
be the matrix of the direct theory.

Inverse problem (): determine x R2 such that y = M x when y R2 is given.

Study whether the inverse problem () is well posed.

Solution: Let y = (y1 , y2 ) R2 . Denote x = (x1 , x2 ) and set y = M x


    
y1 1 2 x1
= .
y2 3 1 x2

A square matrix M is invertible if its determinant det(M ) is non zero. In such a case,
1 adj(M )
M = det(M )
, where adj(M ) is the adjugate matrix of M . In case of 2 2-matrix
   1  
a b a b 1 d b
det = ad bc ja = .
c d c d ad bc c a

36
Since det(M ) = 1 1 2 3 = 5 6= 0, the matrix M has the inverse
 1  
1 1 2 1 1 2
M = = .
3 1 5 3 1

Therefore, () has a solution


1      1
5 y1 + 25 y2
    
x1 1 2 y1 1 1 2 y1
= = = 3 .
x2 3 1 y2 5 3 1 y2 y 15 y2
5 1

Moreover, the solution is unique since

M x = M x M 1 M x = M 1 M x x = x.

The solution depends also continuously from the data by Corollary 2. Hence () is well
posed.

Theorem 4 (The case of square matrix). Let M Rnn and V = W = Rn . The inverse
problem () is well posed if and only if det(M ) 6= 0.

Proof. When det(M ) 6= 0, there exist the inverse M 1 . It follows that the solution exists
(x = M 1 y), it is unique (M x = M x M 1 M x = M 1 M x) and the solution depends
continuously on the data by Corollary 2. Hence the inverse problem is well posed. We
need to show also that the inverse problem is ill posed when det(M ) = 0. In this case,
the matrix M is not invertible which implies that the linear mapping F is not bijective.
Because the direct theory of a well posed problem is always a bijection, the problem is ill
posed.

Example 13. Let


11 10 14
M = 12 11 13
14 13 66
be the matrix of the direct theory, V = W = R3 . Study whether the inverse problem
() is well posed.

Solution: The matrix M is a square matrix and the direct theory maps between full
spaces. By Theorem 4, we only need to calculate the determinant of M :

det(M ) = 11 (11 (66) (13) 13) 10 (12 (66) (13) 14) + 14 (12 13 11 14)
= 11 11 66 + 11 13 13 + 10 12 66 10 13 14 + 14 12 13 14 11 14
= (121 + 120) 66 + 11 13 13 + 2 13 14 14 11 14
= 66 + 11 13 13 + 2 13 14 14 11 14
= 66 + 11 13 13 + (26 154) 14
= 66 + 11 13 13 128 (13 + 1)
= 66 + (143 128) 13 128 = 15 13 194 = 15 (3 + 10) 194 = 45 44 = 1.

The inverse problem is well posed.

37
3.2 Ill conditioned solutions
An ill posed problem can be sensitive to disturbances of data but also well posed problems
can have different behaviors with respect to disturbances of fixed magnitude in data.
Loosely speaking, we say that a problem A is more ill conditioned than problem B if
disturbances of the same order produce larger changes in the solution of problem A than
in problem B.

Example 14. Let y, y R8 be of the form y = M x + and y = M fx + , where


x = (1, 1, 1, 1, 1, 1, 1, 1), = (0, 0, 0, 0, 0, 0, 0, 0.02) and M, M be 8 8-matrix whose
f
elements are Mij = 1i ij and M fij = 2i ij . Here ij is the Kronecker delta function i.e.
ij = 0 if i 6= j and ij = 1 if i = j. The matrices M and Mf are non-singular, but

M 1 y = x + M 1 = (1, 1, 1, 1, 1, 1, 1, 1.16) ja
f1 y = x + M
M f1 = (1, 1, 1, 1, 1, 1, 1, 1 + 28 0.02)

The last element is changed by 28 0.02 = 5.12. Although the problem is well posed,
the disturbed data produces a solution that is not very accurate approximation of the
unknown.

A well posed problem that is very ill conditioned resembles a lot an ill posed problem
where the solution does not depend continuously on the data. Ill conditioned problems
need be taken into account, since most practical situation generate inaccurate and noisy
data.

3.2.1 Definition of condition numbers


T
Recall that the (Hermite) adjoint of M = Mmn Cmn is M = M . Moreover, the
number is an eigenvalue of Mnn if there exists a vector Rn 3 x 6= 0 so that M x = x.
Then x is called an eigenvector. The eigenvalues of a square matrix M are the zeros of
the characteristic polynomial p() = det(M I).
In comparing the ill conditionednes of different problems, the following quantities are
useful.

Definition 8. Singular values of matrix Mmn Cmn are numbers i (M ) that are the

square roots of the eigenvalues i of M M i.e. i (M ) = i , where i = 1, ..., n.

The matrix M M has only non negative eigenvalues i since the corresponding eigenvec-
tors ei satisfy
0 (M ei , M ei ) = (M M ei , ei ) = i (ei , ei ) = i |ei |2 .
Moreover, det(M M ) = det(M ) det(M ) = det(M T ) det(M ) = | det(M )|2 6= 0 for non-
singular matrices so that zero is not an eigenvalue of M M when M is non singular.

Definition 9. The condition number (M ) of a non singular matrix M = Mnn Cnn


is
(M ) = kM kkM 1 k,
where the matrix norm kM k = max (M ) is the largest singular value of M .

38
Different conditions numbers can be defined by using different matrix norms.
Theorem 5. Let M Cnn be non singular matrix. The largest singular value of M 1
is
1
max (M 1 ) = ,
min (M )
where min (M ) is the smallest singular value of M .
We apply here two lemmas.
Lemma 2. Let A, B Cnn be non singular. Then AB and BA have equal eigenvalues..
Proof. Lets look at the characteristic polynomials. By calculation rules for determinants,
we obtain
det(AB I) = det(A(B A1 )) = det(A) det(B A1 )
= det(B A1 ) det(A) = det((B A1 )A) = det(BA I).
There matrices AB ja BA share the characteristic polynomial, which implies that their
eigenvalues are equal.
Lemma 3. Let A Cnn be non singular. The eigenvalues of matrix A1 are inverse
numbers of eigenvalues of A.
Proof. Lets study the characteristic polynomial
det(A I) = det(A(1 A1 )) = (n) det(A) det(1 A1 ).
Because A is non singular, its eigenvalues are non zero. Hence is a zero of the char-
acteristic polynomial of A if and only if 1 is a zero of the characteristic polynomial of
A1 .
Proof. (Theorem 5) Let us define the singular values of M 1 by calculating
(M 1 ) M 1 = (M )1 M 1 = (M M )1 ,
whose eigenvalues are by Lemma 3 the inverses of eigenvalues of M M . By Lemma 2,
the eigenvalues of M M are equal to eigenvalues of M M . Therefore, singular values of
M 1 are inverse numbers of singular values of M . Especially, max (M 1 ) = min1(M ) .

3.2.2 Interpretation of condition numbers


Let x Rn solve y = M x. Let the given data be y + y, where y Rn represents
disturbances. Adding y to data leads to an additional term x in the solution:
y + y = M (x + x).
|x| |y|
Lets compare the relative errors of solution and data |x|
and |y|
.
We start with the term
|M x| = |y|.
The norm and the inner product are related by
|M x|2 = (M x, M x) = (M M x, x). (3.3)
Recall the spectral theorem of linear algebra.

39
Theorem 6. Let A : V V be self-adjoint linear mapping in finite-dimensional inner
product space V. Then V has orthonormal basis that consists of eigenvectors of A.
Now M M is self-adjoint (meaning that (M M ) = M M ). Denote i the eigenvalues
of M MPand ei corresponding eigenvectors. By Theorem 6, a vector x can be represented
as x = ni=1 x0i ei , where (x01 , . . . , x0n ) Rn are the coordinates of x with respect to basis
ei . Then the quadratic form (3.3) can be represented as
n n
! n
X X X
0 0
(M M x, x) = xi M M e i , xi e i = i |x0i |2 .
i=1 i=1 i=1

By approximating the expression from above by using the largest eigenvalue, we obtain
p |y|
|y| = |M x| max i |x| = max (M )|x| |x| . (3.4)
1in max (M )

Similarly for M 1 , we get


1 |y|
|x| = |M 1 y| |y| |x| , (3.5)
min1in i min (M )

where Lemma 3 has been applied. By (3.4) and (3.5), the relative errors satisfy
|x| |y|
(M ) .
|x| |y|
The condition number gives hence an upper bound for the relative errors. When the
condition number is large (say, > 105 ), even numerical rounding errors start to affect the
inversion of a matrix.
Example 15. Identity matrix has condition number 1. This is the smallest possible
condition number.
Example 16. In Example 14, the condition numbers are

(M ) = 8

and
1 8
(M
f) = 2 = 128
2
Example 17. Lets calculate the condition number of

11 10 14
M = 12 11 13 .
14 13 66

First compute
T
11 10 14 11 10 14 461 424 926
M T M = 12 11 13 12 11 13 = 424 390 861 .
14 13 66 14 13 66 926 861 4721

40
The eigenvalues are zeros of the characteristic polynomial

461 424 926
p() = det 424 390 861
926 861 4721

so lets set

p() = (461 ) (390 ) (4721 ) 8612 424 (424 (4721 ) 861 926)


926 (424 (861) (390 ) (926))


= 0. (3.6)

The equation (3.6) has solutions 1 , 2 and 3 , whose square roots are
p p p
( 1 , 2 , 3 ) (0.0006, 21.8, 71.4).

The condition number is now


71.4
(M ) 105 .
0.0006
Let y = M x + be given. When kk 1/5, how accurate is x? Say, x = (0, 0, 1) and
 = (0.1, 0.1, 0.1). Then
T
M x = 14 13 66
and T
y = M x + = 14.1 13.1 65.9 .
The determinant of M is

det(M ) = 11(11(66)(13)13)10(12(66)(13)14)+14(12131114) = 1,

so
T
11 (66) (13) 13) (12 (66) (13) 14)) 12 13 11 14
M 1 = (10 (66) 14 13) 11 (66) 14 14 (11 13 10 14)
10 (13) 14 11 (11 (13) 14 12) 11 11 10 12

557 842 284
= 610 922 311
2 3 1

Applying matrix inverse leadst to


6 T
M 1 (M x + ) = x + 168 10
3 3

184 10 10
,

which is quite far from x = (0, 0, 1).


Example 18. Lets work through a more pathological example of large conditions num-
bers. Consider the convolution
Z

g() = R( )f ()d,

41
where [, ] and the functions R and f are twice continuously differentiable 2-
periodic functions i.e. R( +n2) = R() and f ( +n2) = f () for every n Z. Assume
additionally that R() = R() and R() 0, t [0, ].
We are given data
g(1 ), ..., g(n ),
where j = hj , j = 1, .., n and h = 2
n
, n = 2m for some m > 3 and we know the
is defined as a
function R. What can we say about the function f ? The integral g()

limit of Riemann sums n


X (n) (n)
=
Sn () R( j )f (j )hn ,
j=1

when the mesh is densified e.g. n = 2m and m ). Lets write out


Z 
g(k ) = R(k )f ()d Sn (k ) + Sn (k )

n
X
= R(k j )f (j )h + ek ,
j=1

where the approximation error


Z
ek = R(k )f ()d Sn (k ).

Denote
Mkj = R(k j )h
and
xk = f (k ) and yk = g(k )
when k, j = 1, ..., n. We replace the problem with a matrix problem

y = M x + e.

where the given data y is inaccurate.


What is the condition number of M ? The matrix is of the form

R(0) R(h) R(2h) R((n 2)h R((n 1)h)

R(h) R(0) R(h) R((n 3)h) R((n 2)h)

M = h
R(2h) R(h) R(0) R((n 4)h) R((n 2)h)

.. .. .. .. ..
. . . . .
R((n 1)h) R((n 2)h) R((n 3)h) R(h) R(0)

42
By periodicity of R, the matrix M is so called circulant matrix. A matrix M Rnn is
called circulant if
m1 mn mn1 m3 m2
m2
m1 mn m4 m3
M = m3
m2 m1 m5 m4
.. .. .. .. ..
. . . . .
mn mn1 mn2 m2 m1
for some vector (m1 , ..., mn ) Rn .

Lemma 4. The eigenvalues M Rnn of circulant matrix are


n
X
k = mj exp(2i(j 1)(k 1)/n), k = 1, .., n.
j=1

and a circulant matrix is M is unitarily similar to diagonal matrix (i.e. there exists a
unitary matrix U so that U M U is diagonal matrix).

Proof. There exists F (k) Rn so that M F (k) = k F (k) for k = 1, ...., n. Indeed, take
(k)
Fj = exp(2i(j 1)(k 1)/n), k, j = 1, ..., n.

Straightforwardly,
n
X n
X
(k) (k)
(M F )j = Mjl Fl = m(jl+1)mod n exp(2i(l 1)(k 1)/n)
l=1 l=1
n
X
= mL exp(2i(j L)(k 1)/n) = k exp(2(j 1)(k 1))
L=1
(k)
= k Fj .

But now F (k) 6= 0, so k is an eigenvalue.


Moreover, the eigenvectors are orthogonal. Indeed, if k 6= l, the inner product of F (k)
and F (l) is
n
X
(k) (l)
(F ,F ) = exp(2i(j 1)(k 1)/n) exp(2i(j 1)(l 1)/n)
j=1
Xn
= exp(2i(j 1)(k l)/n)
j=1
n n1
X X 0 1 zn
= z j1 = zj =
j=1 j 0 =0
1z
1 exp(2i(k l))
=
1 exp(2i(k l)/n)
= 0,

43
where we apply geometric progression for z = exp(2i(k l)/n) 6= 1. If k = l, then
n
X
(F (k) , F (k) ) = exp(2i(j 1)(k 1)/n) exp(2i(j 1)(k 1)/n) = n.
j=1

Set U = 1 (F (1) , ..., F (n) ). Then


n
T

F (1)
1 ..
U U =
(1)
. (F , ..., F (n) ) = Inn .
n T

F (n)
proving the unitarity. Moreover, M U = U diag(1 , ..., n ), which implies the similarity.

The absolute values of the eigenvalues of M are also singular values, since
1 , ...,
M M = U diag( n )U U diag(1 , ..., n )U = U diag(|1 |2 , ..., |n |2 )U

is similar with diag(|1 |2 , ..., |n |2 ) and similar matrices share their eigenvalues.
Let mj = R(h(j 1))h, j = 1, ..., n. The eigenvalues of the corresponding circulant
matrix M are then
X n
k = hR(h(j 1)) exp(2i(j 1)(k 1)/n).
j=1

Suppose that M is non singular. When k = 1, then


n
X
1 = hR(h(j 1))
j=1

When k = n/2 + 1 (n even!) , then



Xn
j1
|n/2+1 | = (1) hR(h(j 1)) .


j=1

The condition number satisfies the estimate


|1 |
(M ) .
|n/2+1 |
Lets proceed by writing
n
X
|n/2+1 | = (1)j1 hR(h(j 1))


j=1

n/21
even and odd j
X
=
h
R(h(2J + 1)) + R(h(2J))

J=0

n/21 Z (2J+1)h
fundamental theorem of calculus
X dR
=
h ()d .

J=0 (2J)h d

44
Then we divide summation into two parts: one containing the integrals over [0, ] and
one containing the integrals over [, 2]:

n/41 Z (2J+1)h n/21 Z (2J+1)h
X dR X dR
|n/2+1 | = h ()d + ()d
J=0 (2J)h d d
J=n/4 (2J)h

n/41 Z (2J+1)h n/41 Z (2(J 0 +n/4)+1)h
J 0 =Jn/4
X dR X dR
=
h ()d + ()d
J=0 (2J)h d 0
J 0 =0 (2(J +n/4))h
d

n/41 Z (2J+1)h n/41 Z (2J 0 +1)h+
= nh
X dR X dR
=
2 h ()d + ()d

J=0 (2J)h d J 0 =0 (2J 0 )h+ d


n/41 Z (2J+1)h n/41 Z (2J 0 +1)h
periodicity
X dR X dR
=
h ()d + ()d .

J=0 (2J)h d J 0 =0 (2J 0 )h d

Perform change of variables 0 =



n/41 Z (2J+1)h n/41 Z (2J 0 )h
X dR X dR 0 0

|n/2+1 | = h ()d + ( )d
J=0 (2J)h d 0
J 0 =0 (2J +1)h
d

n/41 Z (2J+1)h n/41 Z (2J 0 )h
R0 antisymmetric
X dR X dR 0 0
=
h ()d + ( )d
J=0 (2J)h d 0
J =0 (2J 0 +1)h d

Lets replace the summing index with J = n/4 J 0 1



n/41 n/41 Z 2(n/4J1)h
X (2J+1)h dR
Z
X dR 0 0
|n/2+1 | = h ()d + ( )d
J=0 (2J)h d J=0 2(n/4J1)hh d

n/41 Z (2J+1)h Z (2J+1)h+h
X dR dR 0 0

= h ()d + ( )d
J=0 (2J)h d (2J)h+h d

n/41 Z (2J+1)h
=0 h
X dR dR
=
h () ( + h)d .
J=0 (2J)h d d

By the fundamental theorem of calculus



n/41 Z (2J+1)h Z +h 2
X d R 0 0

|n/2+1 | = h ( )d d .
J=0 (2J)h d2

Taking the absolute values inside the integral gives


Z Z +h 2
d R 0 0
|n/2+1 | h sup 2 ( ) d d

0 d
0
2
d R
h2 sup 2 (0 ) ,

0 d

45
implying
hR(0) R(0)
(Mnn ) 00
= 2 O(n).
h2 sup |R ()| 2 sup |R00 ()|
The larger the n is the more unstable the inversion of matrix Mnn becomes. This is
typical behavior for finite-dimensional approximations of smoothing convolutions.

3.3 Recap
In finite dimensional linear inverse problems the direct theory F : V W is a
linear mapping between two finite-dimensional linear subspaces V, W . The direct
theory can be represented with the help of a matrix.

The finite-dimensional linear inverse problem is well posed if

for every y W the equation y = M x has a solution x V .


the equation M x = 0 has only the trivial solution x = 0 in V .

The finite-dimensional linear inverse problem is ill posed if at least one of the fol-
lowing claims holds:

for some y W the equation y = M x does not have a solution x V .


there exists x V ,that satisfies x 6= 0 and M x = 0.

For the well-posedness of the inverse problm in the case of square matrix M , it is
enough to find out whether the determinant of M does not vanish.

If the data contains too much disturbances, the solution of a well-posed problem can
be far from the true solution. A well posed problem which is highly ill conditioned
can resemble an ill posed problem where the solution does not depend continuously
on the data.

Learning outcomes:
student knows how to study whether a finite-dimensional linear inverse problem is
well posed.

student can recognise and give examples of ill posed finite-dimensional linear inverse
problems.

student knows the definition of condition numbers

student can calculate the condition number of a given matrix


Please find out:
how an ill conditioned problem and an ill posed problem differ

how condition number is connected to the solution of disturbed equations


Please remember

46
what does exact data and inexact (disturbed) data mean

that functions are usually approximated by finite dimensional vectors in numerical


calculations

that a well posed finite dimensional approximation of an ill posed function valued
problem can have increasing condition numbers as the dimensionality of the problem
grows.

47
48
Chapter 4

Approximate solutions and


regularization

Lets study classical approximate solutions for ill posed or ill conditioned finite dimen-
sional linear problems.

4.1 Least squares method


Ill posed linear inverse problem ()

Let M Rmn . Determine x Rn so that y = M x, when y Rm is given and


Rm 6= M (Rn ).

Example 19. Perform two identical measurements of an unknown quantity x0 R. The


measurement event is not ideal so that the both measured values contain noise, which is
additive. The measured values are
     
y1 1
= x0 + 1 R2 .
y2 1 2
 
1
Then M = ja M (R) = {(x, x) : x R} 6= R2 . Now () is ill posed since there is
1
no solution when y1 6= y2 . Also in this case, we would like to get information about the
unknown x0 on the basis of data y = (y1 , y2 ).
When () has no solution, one way to proceed is to loosen the concept of the solution
by considering approximate solutions, that do not necessarily satisfy y = M x. In the least
squares method the approximate solution of y = M x is taken to be x Rn , that satisfies
|M x y| = minn |M x y|. (4.1)
xR

In other words
x = argmin |M x y|.
xRn

49
The notation argmin means the argument x of the functional x 7 |M x y| that gives
the minimum value. The hat above the vector x is used to indicate that the value is not
necessarily the exact solution but an estimate.
Definition 10. The least squares (LS) solution of () is

x = argmin |M x y|.
xRn

Remark 6. Functionals x 7 |M x y| and x 7 |M x y|2 attain their minimum at same


points x (because s 7 s2 is strictly increasing on [0, )). The norm can be squared in
order to make calculations easy!
 
1 0 1
Example 20. Let M = and the given data y = (1, 10 ). When x = (x1 , x2 ) R2 ,
0 0
the functional to be minimised is

      2  2
2
1 0 x1 1 2 1 1
f (x1 , x2 ) = |M x y| =
1 = (x1 1) + > 0.
0 0 x2 10 10 100
At the minimum point, x1 = 1 and x2 is a free parameter. In other words, least squares
solutions are
x = (1, x2 ),
where x2 R. A problem that has no solutions, has infinitely many approximate solu-
1
tions. (Here the true values were x0 = (1, 0) ja hairio = (0, 10 )).

4.1.1 Existence and (non)uniqueness of LS solutions


Remark 7. If equation y = M x has a solution x, then x is also LS solution, since from
0 = M x y it follows that |M x y| = 0, which is the minimum value of a non-negative
functional x 7 |M x y|. If LS solution x is such that |M x y| > 0, then y = M x has
no solution (why?). LS solution x does not always satisfy y = M x.
The next theorem recasts the minimisation problem as a linear equation! The result is
used both in theory and numerics.
Theorem 7. Let M Rmn ja y Rm . The minimisation problem

x = argmin |M x y|
xRn

has the same solutions x Rn as the equation

M T M x = M T y.

Proof. Calculate first the inner product

f (x) = |M x y|2 = (M x y, M x y)
= (M x, M x) (y, M x) (M x, y) + (y, y)
= (M T M x, x) 2(M T y, x) + (y, y). (4.2)

50
All extreme values of f : Rn R, especially its minimum, are located at critical points
(that is, the minimum satisfies f (
x) = 0). The partial derivatives
n n
T X T X xj
(M y, x) = (M y)j xj = (M T y)j = (M T y)k (4.3)
xk xk j=1 j=1
x k

where k = 1, . . . , n and a = (a1 , . . . , an ) Rn . On the other hand, the partial derivatives


n
T X T
(M M x, x) = (M M x)j xj
xk xk j=1

can be calculated by differentiating a product. Namely


n n n
X T X (M T M x)j X xj
(M M x)j xj = xj + (M T M x)j
xk j=1 j=1
xk j=1
xk
n
!
ni=1 (M T M )ji xi
X P
= xj + (M T M x)k
j=1
x k

n
!
X
= (M M )jk xj + (M T M x)k = 2(M T M x)k
T
(4.4)
j=1

by symmetry of M T M . Every minimum point x satisfies


(4.2)
0 = f (
x) =
(M T M x, x) 2(M T y, x) + (y, y)
(4.3),(4.4)
=
2M T M x 2M T y + 0. (4.5)

On the other hand, when x satisfies M T M x = M T y and x Rn it follows that

f (x) = |M (x x) + M x y|2 = |M (x x)|2 + 2(M (x x), M x y) + |M x y|2


= |M (x x)|2 + 2(x x, M T M x M T y) + |M x y|2
= |M (x x)|2 + |M x y|2 |M x y|2 = f (
x).

Hence x is a minimum point of f .


Does the LS solution always exist? Lets recall some linear algebra. The orthogonal
complement of a linear subspace V Rn is the subspace

V = {x Rn : (x, y) = 0 y V }.

We have Rn = V V (meaning that every x Rn is of the form x = x1 + x2 , where


x1 V , x2 V and (x1 , x2 ) = 0). Moreover, (V ) = V .

Denote R(A) = A(Rn ) for A Rmn .

Lemma 5. Let A Cmn . Then

R(A ) = N (A).

51
Proof. Let x R(A ) . For every y Cm , we have
0 = (A y, x) = (y, Ax). (4.6)
By choosing y = Ax in (4.6), we see that 0 = |Ax|2 . Then Ax = 0 x N (A).
Therefore, R(A ) N (A). On the other hand, if x N (A), then
(A y, x) = (y, Ax) = 0
for every y Cm , so that x R(A ) . Hence N (A) R(A ) .

Theorem 8. Let M Rmn and y Rm . Then there exists a least squares solution
x = argmin |M x y|.
xRn

Moreover, the LS solution is unique if and only if N (M ) = {0}. For LS solutions x1 6= x2 ,


it holds that x1 x2 N (M ).
Proof. By 7 the minimisation problem is equivalent with equation M T M x = M T y. Lets
study solvability of M T M x = M T y. First, we show that
N (M ) = N (M T M ). (4.7)
Clearly N (M ) N (M T M ). Furthermore, x N (M T M ) which means that M T M x =
0 if and only if
0 = (M T M x, z) = (M x, M z)
for every z Rn . Especially, when z = x, we get |M x| = 0 which is equivalent to
x N (M ). In other words N (M T M ) N (M ). Hence, (4.7) holds, which implies that
M T M is injective if and only if M is injective . Next, we show that M T y R(M T M ).
By choosing A = M and A = M T M in Lemma 5, we get with the help of (4.7) that
R(M T ) = N (M ) = N (M T M ) = R(M T M ).
Therefore, M T M x = M T y has at least one solution and the solution is unique if and only
if N (M ) = {0}. Moreover, M T M (x1 x2 ) = 0 when x1 and x2 are two LS solutions.

Example 21. We have the following noisy observations of the unknown x = (x1 , x2 ) R2 :
1 = x1 + e 1
3 = x1 + x2 + e2
4 = x1 + x2 + e3
2 = x2 + e 4 .
Find an approximate solution by using the least squares method. Denote

1 1 0 e1
3 1 1 e2
y= 4 , M = 1 1 and e = e3 .

2 0 1 e4

52
Determine LS solution for the given data y = M x + e. Lets calculate

  1 0  
T 1 1 1 0 1 1 3 2
M M= =
0 1 1 1 1 1 2 3
0 1

and
  1  
1 1 1 0 3
= 8 .

MT y =
0 1 1 1 4 9
2
We obtain the equation
    
T 3 2
T x1 8
M M x = M y = ,
2 3 x2 9

x1 , x2 ) = ( 56 , 11
which has the solution ( 5
).

Example 22. Consider the problem in Example 8, where


 
1 1 0
M= .
0 0 1

We already know that N (M ) = {(x1 , x2 , x3 ) R3 : x1 = x2 }. Let y = (1, 1). Then



1 0   1 1 0 1 0   1
T 1 1 0 T 1
M M= 1 0 = 1 1 0 and M y = 1 0
= 1

0 0 1 1
0 1 0 0 1 0 1 1

Now det(M T M ) = 0, so the matrix M T M is singular. However, the equation M T y =


M T M x has infinitely many solutions

x = (x1 , 1 x1 , 1)

where x1 R. For example, x = (0, 1, 1) ja x = (5, 4, 1).

Minimum norm solutions


Two mathematicians are solving separately the same equation with the least squares
method. They recover that the solution is not unique. Both want to represent their
solution graphically and compare the pictures. The comparison becomes feasible if they
choose a common rule for picking the representative among the least squares solutions.
One way to do this is to use the following concept.

Definition 11. A least square solution x of () is called the minimum norm solution, if
x| = min{|x| : x Rn , M T M x = M T y} .
|

Example 23. In example 20 the minimum norm solution is (


x1 , x2 ) = (1, 0).

53
The next theorem shows that working with minimum norm solutions removes the
nonuniqueness of the approximate solutions (but the cost is increased error in the LS
solution).

Theorem 9. The least squares minimum norm solution of () is unique.

Proof. Let x be LS solution of (). Denote QR(M T ) the orthogonal projection onto
R(M T ). By Lemma 5, Rn = R(M T ) N (M ) and by (4.7)

M T M (QR(M T ) x) = M T M (QR(M T ) x + QN (M ) x) = M T M x = M T y

so that QR(M T ) x is also LS solution. Moreover

|QR(M T ) x|2 < |QR(M T ) x|2 + |QN (M ) x + z|2 = |


x + z|2

for any z N (M ) that is not QN (M ) x.

Remark 8. According to the above proof, the minimum norm solution is of the form
Lemma 5
QR(M T ) x = QN (M ) x.

Singular value decompositions


A practical way to obtain LS minimum norm solutions is to use singular value decompo-
sition.

Definition 12. The a singular value decomposition (SVD) of matrix M Cmn is

M = U DV ,

where U and V are orthogonal ( U 1 = U and V 1 = V ) and

Dij = j ij

where i = 1, . . . , m, j = 1, . . . , n and 1 2 n are the singular values of M .

Definition 13. Let M Cmn have singular value decomposition M = U DV , where


 
D
e rr 0r(nr)
D=
0(mr)r 0(mr)(nr)

for some r {1, . . . , min(m, n)} and D


e ii > 0 when i = 1, . . . r. The matrix
 
D 0r(mr)
+
U
e rr
M =V
0(nr)r 0(nr)(mr)

is called the Moore-Penrose pseudoinverse of M .

Remark 9. 1) Especially M + = V D+ U . 2) if D is non-singular square matrix, then


M + = V D+ U = V D1 U = M 1 .

54

4 0 0 1
0 4
0 0 0
3 0
, then M + = 0 1 0 0.
Example 24. When M =
0 3
0 2
0 0 21 0
0 0 0

Theorem 10. Let M Cmn whose Moore-Penrose pseudoinverse is M + . Then

1. QR(M ) = M M + .

2. QN (M ) = M + M .

3. QN (M ) M + = M + QR(M ) = M + .

Proof. Let M = U DV be the SVD of M , where Dii > 0 if and only if i r. Denote
V = (V1 Vn ). Then

D11 V1 D11 (x, V1 )
.. ..
0 = M x = U DV x U 1 0 = DV x =
.
x =
.


Drr Vr Drr (x, Vr )
0(mr)n 0(mr)1
Pn
if and only if x = i=r+1 x0i Vi . Then

N (M ) = span{Vr+1 , . . . , Vn }. (4.8)

In the same way, for M = V D U we get N (M ) = span{Ur+1 , . . . , Um }. By Lemma 5

R(M ) = N (M ) = span{U1 , . . . , Ur }. (4.9)

Then

M M + = (U DV )(V D+ U ) = U (DD+ )U

  U1
Irr 0r(mr)
= U U = (U1 , . . . Ur ) ... ,

0(mr)r 0(mr)(mr)
Ur

which is the orthogonal projection onto image space by (4.9). Claim 2 follows from (4.8)
similarly. Claim 3 follows from using the above formulas for the projections and noticing
that
 1  U1
Drr 0r(mr) .
M+ = V U = (V1 Vr )D
e 1
e
.. .
0(nr)r 0(nr)(mr)
Ur

Corollary 3. Let a non-trivial matrix M Rmn have SVD M = U DV T . Then the LS


minimun norm solution of y = M x is

x = M + y.

55
Proof. The vector x = M + y satisfies
M T M x = M T M (M + y) = M T (M M + )y = M T y,
since M M + = QR(M ) by Theorem 10 and y = QR(M ) y+(IQR(M ) )y = QR(M ) y+QN (M T ) y
by Lemma 5.
The vector x has minimum norm by the Remark after Theorem 9, since
Lemma 5 Theorem 10
QR(M T ) M + y = QN (M ) M + y = M + y.

Example 25. Determine the LS minimum norm solution of y = M x when y = (3, 4)


and M has SVD of the form
!
1 1
 
T 2 2 2 0
M = U DV , U = V = ,D = .
12 12 0 0

Solution:
! ! 
12 12 1 1
1

0 3
x = M + y = V D+ U T = 2 2 2
12 12 0 0 12 12 4
! ! 
21 2 0 12 12 3
= 1 1 1
22 0 2 2 4
1 1   7
4 4
3
= 1 1 = 47 .
4 4
4 4

4.1.2 Accuracy of the LS solution


How close is LS solution of y = M x to the truth when data is of the form y = M x0 +  ?
In the case of minimum norm solutions
x = M + y = M + M x0 + M + = QN (M ) x0 + M + QR(M )
By Theorem 10. The difference to the true unknown is
x0 x = (I QN (M ) )x0 M + QR(M ) = QN (M ) x0 M + QR(M )
Minimum norm solution x does not contain the components of the unknown x0 that
belong to the null space of M .
LS solution x is immune to the components of noise that belong to R(M ) .
LS solution is affected by components of that belong to R(M ).
How much the noise term QR(M ) affects the LS minimun norm solution? Consider the
condition numbers. Although M = U DV T may not be invertible in the whole space, we
can consider smaller subspaces. Denote

  V1T
Drr 0r(nr) .
V T = (U1 Ur )D
e
M =U e .. ,
0(mr)r 0(mr)(nr)
VrT

56
where De is invertible. Recall that N (M ) = span{V1 , . . . Vr } and R(M ) = span{U1 , . . . Ur }.
Then the linear mapping F : N (M ) 3 x 7 M x R(M ) is invertible and its matrix
with respect to basis {V1 , . . . Vr } ja {U1 , . . . Ur } is D.
e Especially, D
e is non-singular and
has condition number
e = D11 .
e
(D)
De rr
Hence, the relative error satisfies
|M + | e |QR(M ) | .
(D)
|QN (M T ) x0 | |M x0 |
The worst case relative error is explained by non zero singular values of M . Problems
occur when M has very small (compared to the matrix norm) non-zero singular values.

4.1.3 Regularisation
In the least squares method, the ill-posed problem y = M x is replaced with a closely
related well-posed problem M T M x = M T y. In general, regularisation means a method
to replace an ill conditioned problem with a better conditioned problem. A linear finite-
dimensional inverse problem that contains disturbances is often presented in the following
form.

Linear finite-dimensional inverse problem with disturbations ( )

Let M Rmn be a theory matrix. Estimate the unknown x0 Rn , when data y =


M x0 + Rm is given.

4.1.4 Truncated singular value decompositions


Let matrix M Rmn have SVD M = U DV T and let the given data be y = M x0 + .
If we choose the LS solution x = argmin |y M x| as an estimate for the unknown,
xRn
then the ill-conditionedness of the problem can be described by the condition number
D11
= ,
Drr
where D11 is the largest singular value of M and Drr is the smallest non-zero singular
value of M . A simple method to improve the ill-conditionedness of the problem is the
replace the smallest non-zero singular values of M with zeros.
Definition 14. Let M Cmn be a non trivial matrix with SVD M = U DV T , where
D11 D22 Drr > 0 and Dij = 0 otherwise. The truncated singular value
decomposition (TSVD) of M is the matrix
M(k) = U D(k) V T ,
where k is some number in {1, . . . , r 1} and (D(k) )ii = Dii when i k and (D(k) )ij = 0
otherwise.

57
With TSVD of M we can get a regularised problem

y = M(k) x, (4.10)

whose LS solution is
+
x(k) = M(k) y. (4.11)

The condition number for equation (4.10) is

D11
k = .
Dkk

Definition 15. TSVD regularised solution of y = M x is

+
x(k) = M(k) y.

Remark 10. The cost of the smaller condition number is larger kernel N (M(k) ). This
+
decreases the accuracy of the estimate M(k) y. In Chapter ?? we saw that the minimum
+
norm solution here M(k) y does not contain components from the linear subspace
N (M(k) ).

Example 26. Let M have SVD

1
10
0 0 0
0 1
10
0 0
V T and the given data is y = M x0 +
M =U
0 1
0 10
0
1
0 0 0 10000

Then
1
10
0 0 0 10 0 0 0
0 1
10
0 0
V T and M3+ = V
0 10 0 0 T
M(3) =U
0 1 0 0 10 0 U .

0 10
0
0 0 0 0 0 0 0 0

1 2 1 3 1 T
Lets study the case, where || 10 . Say, = U ( 100 100
100 100
) for example. Then
equation y = M x has the solution

2 2

10 0 0 0 100 10
1 1
0 10 0 0 T 100
M 1 y = x0 + V

U U 103 .
3 = x0 + V

0 0 10 0
100 10
1
0 0 0 10000 100
100

58
Now |x0 M 1 y| 100. But with TSVD we get
2
10 0 0 0 100
1
+ +
0 10 0 0 T 100

M(3) y = M(3) M x0 + V U U 3
0 0 10 0 100
1
0 0 0 0 100
2
1 0 0 0 10
0 1 0 0 T 1
= V 10
0 0 1 0 V x0 + V 3

10
0 0 0 0 0
2
0 0 0 0 10
0 0 0 0 T 1
= x0 V 0 0 0 0 V x0 + V 3
10
10
0 0 0 1 0
2
10
1
= x0 V4 V4T x0 + V 10 ,
3
10
0

which implies that


 1
+ 2 14 2
|x0 M(3) y| = |(V4 , x0 )| + .
100
If the inner product of x0 with V4 is not too large, then TSVD regularised solution gives
a better estimate for the unknown than by using M 1 y.

The accuracy of TSVD regularised solution is


+ + +
|x0 x(k) | = |x0 M(k) y| = |x0 M(k) M x0 M(k) |.
+ +
It is easy to verify that M(k) M = M(k) M(k) , which by Theorem 10 is the orthogonal
projection QN (M(k) ) . Then
+ +
|x0 x(k) |2 = |(I QN (M(k) ) )x0 M(k) |2 = |QN (M(k) ) x0 M(k) |2
+ +
= |QN (M(k) ) x0 |2 2(QN (M(k) ) x0 , M(k) ) + |M(k) |2
Theorem 10 +
= |QN (M(k) ) x0 |2 2(QN (M(k) ) x0 , QN (M(k) ) M(k) )+ ) + |M(k) |2
+
= |QN (M(k) ) x0 |2 + |M(k) |2 .

Regularised solution x(k) does not contain those components of the unknown x0
that belong to the kernel of M(k) i.e. span{Vk+1 , . . . , Vn }

Regularised solution x(k) is immune towards those components of that belong to


R(M(k) ) = span{Uk+1 , . . . , Um }. .

Regularised solution x(k) is affected by those components of that belong to R(M(k) ) =


span{U1 , . . . , Uk }.

59
4.1.5 Tikhonov regularization
Let x0 Rn be the unknown, M Rmn the theory matrix and

y = M x0 + Rm (4.12)

the given data. In an ill-conditioned probem the LS minimum norm solution

x = M + y = M + M x0 + M + = QN (M ) x0 + M +

can be very inaccurate due to terms M + , although |y M x| is as small as possible. One


way to improve the ill-conditioned problem is to search for approximative solutions x for
which |y M x| is small but the norm |x| is not too large.
In Tikhonov regularisation the appoximative solution of y = M x is taken to be x
that minimizes the Tikhonov functional

L (x) := |M x y|2 + |x|2 , (4.13)

i.e.
x = argmin |M x y|2 + |x|2 .
xRn

The number > 0 is a constant that is called the regularization parameter.


Remark 11. The Tikhonov functional differs from the LS functional by a penalization
|x|2 . The aim of the penalization term is to neglect the vectors that contain very large
inaccuracies.
The terms in the Tikhonov functional are squared norms. Next theorem shows that
this is a very useful choice. The following theorem gives also a way to calculate the
minimisation problem by solving a linear equation.
Theorem 11. Let > 0. The minimisation problem

|M x y|2 + |
x |2 = minn |M x y|2 + |x|2
xR

has a unique solution x . The solution x coincides with the unique solution of

(M T M + I)
x = M T y.

Proof. Lets write the Tikhonov functional in the from


    2
2 2
M y
|M x y| + |x| = x ,
I 0

which leads to a LS problem. We can use Theorem 7, to find the minimiser by solving
 T    T  
M M M y
x =
I I I 0

in other words
(M T M + I)
x = M T y.

60
 
M
This equation has a unique solution by Theorem 8, since the kernel of contains
I
only zero by the lower row of
   
M M x
0= x= .
I x

Remark 12. Above, it was actually verified that the Tikhonov regularisation corresponds
to LS solution of    
M y
x= .
I 0

Example 27. Consider Example ??, where the matrix



11 10 14
M = 12
11 13
14 13 66

has condition number about 105 .


Let y = M x0 + R3 be the given data. Consider the case, where the unknown is
x0 = (0, 0, 1) and  = (0.1, 0.1, 0.1). Then
T
M x0 = 14 13 66

and
T
y = M x0 + = 14.1 13.1 65.9 .
We saw in Example ?? that

6 T
M 1 (M x0 + ) = x0 + 168 10
3 3

184 10 10
.

Lets solve the problem with Tikhonov regularisation. Calculate first


T
11 10 14 11 10 14 461 424 926
M T M = 12 11 13 12 11 13 = 424 390 861
14 13 66 14 13 66 926 861 4721

Choose = 0.01 and calculate


1
461.01 424 926 11 12 14 14.1
x = (M T M + I)1 M T y = 424 390.01 861 10 14 13 13.1
926 861 4721.01 14 13 66 65.9

0.003
0.006 . Wow!
1.001

61
Choosing the regularisation parameter
How the parameter affects the regularised solution. First, we ask what happens to x
when 0 or when . We need to calculate the limits
lim (M T M + I)1 M T y and lim (M T M + I)1 M T y,
0+

if they exist.
Assume for simplicity that zero is not an eigenvalue of M T M . Then the inverse
(M T M )1 exists and we can study
x x| = |(M T M + I)1 M T y (M T M )1 M T y|.
|
The difference of two inverse matrices satisfies
B 1 C 1 = B 1 (I BC 1 ) = B 1 (C B)C 1 .
Especially,
(M T M + I)1 (M T M )1 = (M T M + I)1 (I)(M T M )1 .
Then
|(M T M + I)1 M T y (M T M )1 M T y| k(M T M + I)1 k|(M T M )1 M T y|.
Recall, that k(M T M + I)1 k is the inverse of the smallest eigenvalue min of the matrix
(M T M + I). Denote umin the corresponding normed eigenvector. We can estimate the
smallest eigenvalue as follows:
min = ((M T M + I)umin , umin ) = ((M T M + I)umin , umin ) (M T M umin , umin )
min (M T M ).
Then we get an estimate
|(M T M + I)1 M T y (M T M )1 M T y| min (M T M )1 |(M T M )1 M T y|,
which implies that
lim x = (M T M + I)1 M T y = (M T M )1 M T y. (4.14)
0+

In the general case, it holds that


lim x = (M T M + I)1 M T y = M + y. (4.15)
0+

In the same way,


|(M T M + I)1 M T y| = 1 |(1/M T M + I)1 M T y|
1 min (I)1 |M T y|
which implies that
lim x = (M T M + I)1 M T y = 0. (4.16)

For large values of the regularised solution approaches to zero. For small values of
the regularised solution approaches to the LS minimum norm solution.

62
The choice of can be done as follows.
Definition 16. Let y = M x0 + be the given data, where || e. According to Morozovs
discrepancy principle the parameter is chosen so that

|M x y| = e,

if this choice is possible.


The idea is to avoid the situation where the regularised solution fits to the error rather
than to the true unkown. The aim is to get x close to x0 , so that

|M x y| = |(M x M x0 ) | ||.

Example 28. Lets take  


1 0
M= .
0 1
1
Let the given data be y = M x0 + = (2, 1), and we know that || 10 . We determine
by Morozovs discrepancy principle. We define the function
 1    2 
T 1 T 1+ 0 2
[0, ) 3 7 M x = M (M M + I) M y = = 1+1 .
0 1+ 1 1+

( is now variable!). Set


 2   
1 2
e = |M x y| = 1+
1
10 1+
1
s 2  2
2 1
= 2 + 1
1+ 1+

5
= .
1+
We get equation
1
(1 + ) = 10 5 = 0.05.
10 5 1
2
, 1

Then x=0.05 = 1+0.05 1+0.05
(1.90, 0.95).
When the Morozovs discrepancy principle can be applied? Let M Rmn have SVD
M = U DV T . Let the given data be y = M x0 + . We use the Tikhonov regularization
for > 0. We obtain the estimate

x = (M T M + I)1 M T y.

where

(M T M + I) = V DT U T U DV T + I = V DT DV T + V V T = V (DT D + I)V T

has eigenvalues Dii2 + (or eigenvalue is 0). From SVD we get

x = (V (DT D + I)V T )1 V DT U T y = V (DT D + I)1 DT U T y.

63
Hence

M x = U DV T V (DT D + I)1 DT U T y = U D(DT D + I)1 DT U T y

can be written as
min(m,n) m 2
X X Djj
(M x )i = Uij 2
Ukj yk .
j=1 k=1
Djj +

The squared norm of M x y is

m min(m,n)  2
2
X
T
X
f () := |M x y| = (U y)2j + 2
(U T y)j .
j=1
Djj +
j=min(m,n)+1

What values does f have? Lets calculate the differential

min(m,n)  2
0
X d
f () = 2
(U T y)j
j=1
d Djj +
min(m,n)   
X 1
= 2 2
(U T y)j 2
2
(U T yj )
j=1
Djj + Djj + (Djj + )2
min(m,n) 2
X Djj
= 2 2 3
(U T y)2j 0.
j=1
(Djj + )

Then f 0 () 0, which makes f non-decreasing! By (4.16),

lim f () = lim |M (M T M + I)1 M T y y|2 = |y|2 .


and by (4.15)
Theorem 10
lim f () = |M M + y y|2 = |QR(M ) y|2 .
0+

When || e, the Morozovs discrepancy principle can be applied if

|QR(M ) y| e |y|. (4.17)

Accuracy of Tikhonov regularised solutions


Let M Rmn and y = M x0 + be the given data. The accuracy of Tikhonov regularised
solution is

|x0 x | = |x0 (M T M + )1 M T M x0 (M T M + )1 M T |,

which depends from two terms that have opposite behavior as a function of . Denote

G1 () = (I (M T M + )1 M T M )x0 and G2 () = (M T M + )1 M T .

64
By equations (4.14)-(4.16), we have that

lim G1 () = (I M + M )x0 (good)


0
lim G2 () = M +  (bad)
0
lim G1 () = x0 (bad)

lim G2 () = 0 (good).

Tikhonov regularized solution is immune with respect to those components of that


belong to R(M ) .

Tikhonov regularised solution is affected by those components of that belong to


R(M ).

The larger the regularization parameter is the smaller the effect of the noise on
the solution is. However, simultaneously the distortion caused by the penalization
increases.

The penalization distorts the regularised solution even for the exact data.

Generalizations
More generally, Tikhonov regularisation means the minimisation problem

x = argmin |M x y|2 + |Bx|2 .


xRn

where B = Bn0 n is usually some matrix whose singular values are positive. The vector
Bx represents some unwanted feature of the approximative solution.
Example 29.
1 0 0 0 0 0
1 1 0 0 0 0

0 1 1 0 0 0

0
B = 0 1 1 0 0
.. .. .. ..
. . . .

0 0 0 1 1 0
0 0 0 0 1 1
The matrix B penalizes the differences of neighboring points. This forces the approxima-
tive solution to be smoother.
In regularisation, also other norms than inner product norms can be used. For exam-
ple,
n
X
2
x = argmin |M x y| + |xi |,
xRn
i=1

where the penalisation term is the so called `1 norm. In such a case, the minimisa-
tion problem is solved numerically by different methods than in the case of Tikhonov
regularisation.

65
4.2 Recap
Least squares method:

when no solutions exist, gives approximative solutions.


LS solution exists always but is not necessarily unique.
LS problem can be ill-conditioned.

Tkhonov regularisation:

ill-posed/ill-conditioned problem is replaced by a well posed closely related


problem
the regularised solution is more robust than LS solution against inaccuracies
in data.
some unwanted feature of the approximative solution is penalised .

Please be able to

define what are LS solution, TSDV regularisation, Tikhonov regularisation

calculate LS solutions for simple cases

calculate TSDV and Tikhonov regularised solutions when SVD is given

use Morozovs discrepancy principle

choose a suitable approximative solution method for simple problems

Please find out

why approximative solutions are used.

what are the differences between a solution and an approximative solution

what are the differences between LS method and regularisation.

how the regularisation parameter affects the Tikhonov regularised solutions.

Please remember

what is SVD.

what is the minimum norm solution.

66
Chapter 5

Statistical inverse problems

A statistical inverse problem does not give answer to the question what is the
unknown x0 but rather to the question what do we know about the unknown x0 .

In layman terms, inverse problem is to deduce the causes from consequences. In a


similar way, a statistical inverse problem is to estimate the probabilities of causes
when the noisy consequences and their probabilities are known.

The aim of this chapter is to understand the solution principle of a statistical in-
verse problem. In Section 5.1 we recap necessary concepts from probability theory (see
the blue text below) and multidimensional integration, which is used in calculating the
probabilities. In Chapter 5.2 we meet statistical inverse problems.

Solution principle of a finite dimensional statistical inverse problem:

1. Data and the unknown are modeled as random vectors Y = (Y1 , . . . , Ym )


ja X = (X1 , , Xn ).

2. The given data y Rm is a sample of the random vector Y .

3. The distribution of the unknown is called the prior distribution. It represents our
knowledge about the unknown.

4. The solution of the statistical inverse problem is the posterior distribution, which
is the conditional distribution of X given Y = y and it has probability density
function (pdf)
f (x|y) = cf (y|x)fpr (x) (Bayes formula)
where f (y|x) is the conditional pdf of Y given X = x,
fpr (x) is the prior pdf of X and c > 0 a norming constant.

67
Remark 13. The word prior refers to the time when the observation y of the value of Y
is not yet available. The work posterior refers to the time when Y = y is available.

Example 30. What does it mean that the distribution represents knowledge about the
unknown? Consider two simple cases (a) unknown x0 R and (b) unknown x0 R2 .
(a) Let the unknown X be tomorrows temperature at noon. Today, we do not know
for sure what is the exact value of X. However, X may have a probability distribution,
whose pdf is, say, f (x). Below are some examples of f .
0.4
0.3
0.2
f
0.1
0.0

10 5 0 5 10
Temperature

Figure 5.1: Pdf f : Temperatures in interval [10, 0] seem to be unlikely, as are temper-
atures in [+5,+10]. The temprerature +2 has the highest value. On the basis of f , we
believe that tomorrows temperature at noon is around +2 degrees.
0.08
0.06
0.04
f
0.02

10 5 0 5 10
Temperature

Figure 5.2: Pdf f : Temperatures in [5, 10] look unlikely. The temperature -2 has the
highest value, but the density is quite wide. This reflects uncertainty about the tomorrows
temperature at noon.

68
0.20
0.15
0.10
f
0.05
0.00

10 5 0 5 10
Temperature

Figure 5.3: Pdf f : Temperatures in [10, 5] and [5, 10] seem to be quite unlikely. Tem-
perture +2 has the highest values but also -2 is at a local maximum. This reflects
uncertainty about tomorrows temperature at noon. We actually have two scenarios of
how the weather will develop.

(b) Let the unknown X = (X1 , X2 ), where X1 and X2 are, for instance, parameters
of the elliptic path X1 x2 + X2 y 2 = 10 of an asteroid moving in the same plane as planets.
Let the pdf of X be f (x) = f (x1 , x2 ).

15
10
5

0
x2

0.012
0.011
0.010
0.009
0.008
0.007
0.006
0.005
0.004
0.003
0.002
0.001
15
f

10 5
5
0
x1 10
5
10
15
15

Figure 5.4: Pdf f : the function f = f (x1 , x2 ) of two variables x1 and x2 can be either
presented with colors or elevations. The values of x1 are on horizontal axis and the values
of x2 are on the vertical axis. For example, the value f (10, 5) is the number corresponding
to the color at coordinates x1 = 10, x2 = 5 It seems that the values of unknown are close
to (10,5), since f has highest values there. On the other hand, values near the point
(10, 10) seem to be unlikely.

69
In the same way, we can fix pdfs for finite-dimensional unknowns of any inverse prob-
lem, like color values in image deblurring, the coefficients of approximated mass absorp-
tion coeffiecients in CT scans and even coefficients of approximated conductivities in
electrical impedance tomography.
Remark 14. In statistical inverse problems, the random vectors are usually very high
dimensional. The visualisation of high-dimensional pdfs is usually done few coordinates
at a time or by using statistics of distributions.

5.1 On probability theory


We recap those concepts of probability theory that are of importance to statistical inverse
problems such as
random vector
independence
pdf
expectations (including variance)
conditinal distributions, Bayes formula

Measure theoretic principles of probability theory


Let be as set, whose elements are called elementary events. Let be a family
of subsets of that forms a -algebra i.e.
1.
2. If A , then AC .
3. If Ai when i N, then
i=1 Ai .

Sets A, B are called events.


Union of events A B means that either A or B (or both) happens.
The intersection A B means that both events happen.
The complement AC = \A means that event A does not happen.
Definition 17. Let be a set and some -algebra of subsets of . Mapping P :
[0, 1] is a probability measure, if
1. P () = 1
2. If Ai , iP N, are pairwise disjoint i.e. Ai Aj = for all i 6= j, then

P (
i=1 Ai ) = i=1 P (Ai ) (countable additivity).
The number P (A) is called the probability of event A . The triple (, , P ) is called
a probability space.

Definition 18. Two events A and B are independent if P (A B) = P (A)P (B).

70
Random vectors
Let (, , P ) be a probability space. The Borel sets of Rn is the smallest -algebra B(Rn )
that contains all open sets of Rn .

Definition 19. A random variable) X is such a mapping X : 7 R that X 1 (B)


for all B B(R). The distribution of X is the mapping B(R) 3 B 7 P (X B).
A random vector X is such a mapping X : 7 Rn that X 1 (B) for all
B B(Rn ). The distribution of X is the mapping B(Rn ) 3 B 7 P (X B).

Remark 15. Notation: P (X A) = P (X 1 (A)) = P ({ : X() A}).

We skip the proof of the next theorem, which relies on properties of Borel sets (namely,
the generation of Borel sets with the help of hyperrectangulars).

Theorem 12. The mapping X : Rn is a random vector if and only the components
Xi , i = 1, ..., n of X = (X1 , ..., Xn ) are random variables.

Definition 20. Two random vectors X : Rn and Y : Rm are independent if

P (X A Y B) = P (X A)P (Y B)

for all Borel sets A B(Rn ) and B(Rm ).

Why measure theory?


In the beginning of 20th century, probability theory was not considered as pure
mathematics since there was no axiomatic starting point. The sixth of the HIlberts
famous 23 problems demanded axiomatisation of probability theory by the following
words:
6. Mathematical Treatment of the Axioms of Physics. The investigations on the
foundations of geometry suggest the problem: To treat in the same manner, by
means of axioms, those physical sciences in which already today mathematics plays
an important part; in the first rank are the theory of probabilities and mechanics.

The axiomatisation was finally done after the development of abstract measure and
integration theory at the end of 1920s. The father of the axioms of probability
theory is A. N. Kolmogorov (1903-1987). This has been the only consistent way of
treating the probability theory.

As mathematical objects, the random variables and random vectors are just ordi-
nary functions: they do not have any randomness attached to them and there is
no mechanism to produce random numbers. This may seem a little odd... that
randomness is modeled without any randomly occuring phenomena...?

In Kolmogorov axiomatisation, the random phenomenon is only partially modeled!

For example, consider a random phenomenon that produces a real number


(e.g. the time when the elevator arrives after pushing the button), which is
modeled as a random variable X.

71
The values of X are real numbers, but we do not beforehand know which value
X will take. Our knowledge of X is imperfect.
When the elevator arrives at time x0 , then x0 is a sample of X. This means
that x0 = X(0 ) for some 0 .
Mathematics does not tell how we ended up with X(0 ). The mechanism for
producing the elementary event 0 is unknown.
Although we know exactly the function X, the set and probability P , we
can not say anything else about the value of X except what the distribution
P (X B), where B B(R), reveals.

Multidimensional Riemann integral


The probability theory works best with Lebesgues integration (which does not belong to
prerequisites of this course). Therefore, we will use Riemann integration. (We follow the
books Apostol: Calculus (vol II), Lang: Analysis I, Apostol: Mathematical Analysis).
Let B Rn be n-dimensional hyperrectangular
B = {x = (x1 , ..., xn ) Rn : ai xi bi , i = 1, ..., n}
where ai , bi R and ai < bi . Denote the interior of B with Int(B).
Definition 21. Function f : B R is called a step function if the hyperrectangular
B can be divided into hyperrectangulars Bi , i = 1, ..m in such a way that there exists
numbers ci R so that
f (x) = ci ,
when x Int(Bi ), i = 1, ..., m.
Definition 22. Let f be as in Def. 21. The integral of the step functionf over the set
B is Z m
X
f (x)dx := ci Vol(Bi )
B i=1
where Vol(Bi ) is the volume of the hyperrectangular
(i) (i)
Bi = {x = (x1 , ..., xn ) Rn : aj xj bj , j = 1, .., n}
i.e. n
Y (i) (i)
Vol(Bi ) = (bj aj ).
j=1

Definition 23. Let B be a hyperrectangular. Let f : B R be a bounded function. If


there is only one number I R, such that
Z Z
s(x)dx I S(x)dx
B B

for every step functions s : B R such that s f , and for every step function S : B R,
such that f S, then we say that f is Riemann integrable (over B) and we denote
Z
f (x)dx = I.
B

72
Let K(B) denote the set of all step functions f : B R.
Theorem 13. Bounded function f : B R is Riemann integrable if and only if
Z Z
sup s(x)dx = I = inf S(x)dx
sK(B) SK(B)
sf f S

in which case Z
f (x)dx = I.
B

Theorem 14 (Fubinis theorem for Riemann integrable functions). Let B Rn and


C Rm be compact hyperrectangulars. Let f : B C R be such an integrable function
that Z
f (x, y)dy
C
R
exists for every x B. Then the mapping B 3 x 7 C
f (x, y)dy is also integrable and
Z Z  Z
f (x, y)dy dx = f (z)dz.
B C BC

By Fubinis theorem, multidimensional integral can be calculated by iteration of 1D


integrals (as long as all the integrals sare well-defined). For example, in 3D case
Z Z b3 Z b2 Z b1  
f (x)dx = f (x1 , x2 , x3 )dx1 dx2 dx3 ,
B x3 =a3 x2 =a2 x1 =a1

as long as all the integrals are well-defined. Moreover, we can change the order of inte-
gration.

The integral over the whole space is defined as an improper integral i.e. we take a
limit of integrals over increasing sets.

Similarly, when f is non-negative, Fubinis theorem is still true when the compact
sets are replaced with whole spaces.

When f has also non-negative values, f can be written as f = f+ f , where


f+ , f 0, and we pursue after
Z Z Z
f (x)dx = f+ (x)dx f (x)dx,

if possible.

Probability density functions


Definition 24. A probability density function (pdf ) is an integrable mapping f : Rn
[0, ) for which Z
f (x)dx = 1.
Rn

73
Example 31. Let (
1
2n
[1, 1]n
,x
f (x) =
0, x 6 [1, 1]n .
Then n
Z Z Z Z 1
1 1 Fubini 1
f (x)dx = n
dx = n dx = dx = 1.
[1,1]n 2 2 [1,1]n 2n 1

Example 32. Let


1 12 |x|2
f (x) = n e .
(2) 2
Then
Z Z Z Z
1 21 |x|2 1 12 |x|2 1 1 2 2 Fubini
f (x)dx = n e dx = n e dx = n e 2 (x1 ++xn ) dx1 dxn = 1
(2) 2 (2) 2 (2) 2
Definition 25. Let (, , P ) be a probability space. A random variable X : R is
said to have a pdf fX , if fX : R [0, ) is such a pdf that
Z b
P (a X b) = fX (x)dx
a

for all a, b R, a b.
Definition 26. Let (, , P ) be a probability space The random vector X = (X1 , ..., Xn ) :
Rn is said to have a pdf fX , if fX : Rn [0, ) is a such a pdf that
Z
P (ai Xi bi , i = 1, . . . , n) = fX (x)dx
[a1 ,b1 ][an ,bn ]

for all ai , bi R, ai bi , i = 1, ..n. The pdf fX is called the joint probability density
function of X1 , ..., Xn .
Definition 27. The function
Z Z Z Z
fXi (x) = fX (x1 , ..., xn )dx1 dxi1 dxi+1 dxn
x1 = xi1 = xi+1 = xn =

is called the marginal pdf of Xi .


Theorem 15. Two random vectors X and Y with joint pdf f(X,Y ) = f(X,Y ) (x, y) are
independent if
f(X,Y ) (x, y) = fX (x)fY (y).

Some statistics of distributions


Definition 28. Let X be a random vector, whose pdf is fX : Rn [0, ). The expec-
tation of X is the vector m = (m1 , ..., mn ) Rn with components
Z
mi = xi fX (x)dx
Rn
R
whenever xi fX (x) is integrable for all i = 1, ..., n. We denote E[X] := xfX (x)dx = m.

74
Remark 16. Random vectors do not always have the expectation.
Definition 29. Let X be a random vector with pdf fX : Rn [0, ) and expectation
E[X] = (m1 , ..., mn ). The covariance matrix CX Rnn of X is defined by equations
Z
(CX )ij = (xi mi )(xj mj )fX (x)dx,
Rn

where i, j = 1, . . . , n (whenever all the above integrals exist).


Remark 17. The covariance matrix CX is always symmetric and its eigenvalues are
non-negative. Indeed,
Z Z
(CX )ij = (xi mi )(xj mj )fX (x)dx = (xj mj )(xi mi )fX (x)dx = (CX )ji
Rn Rn

and if CX u = u and kuk = 1, then


n n
!
X X
= (CX u, u) = (CX )ij uj ui
i=1 j=1
n Z
X
= (xi mi )ui (xj mj )uj fX (x)dx
i,j=1 Rn
Z n
! n
!
X X
= (xi mi )ui (xj mj )uj fX (x)dx
Rn i=1 j=1
Z
= g(x)2 fX (x)dx 0,
Rn
Pn
where g(x) = i=1 (xi mi )ui .
Definition 30. Let X : Rn and Y : Rm be random vectors with joint pdf j
f(X,Y ) : Rn+m R and expectations E[X] = mX and E[Y ] = mY . The cross covariance
matrix CXY Rnm of X and Y has elements
Z Z 
(CXY )ij = (xi (mX )i )(yj (mY )j )f(X,Y ) (x, y)dx dy, i = 1, .., n j = 1, .., m
Rm Rn

whenever all the above integrals exist.


T
Remark 18. CXY = CY X .

Gaussian distributions
The random vector Z : Rn has Gaussian (or multinormal) distribution if its pdf is
of the form
1 1 T 1
fZ (x) = p e 2 (xm) C (xm) ,
(2)n det(C)
where m Rn and C Rnn is a symmetric non-singular matrix whose eigenvalues
are positive. We denote Z N (m, C), meaning that Z has Gaussian distribution with
expectation m and covariance matrix C.

75
Lemma 6. Function
1 1 T 1
fZ (x) = p e 2 (xm) C (xm) ,
(2)n det(C)
is a pdf. If Z : Rn is a random vector and Z N (m, C), then
E[Z] = m
and the covariance matrix of Z is
CZ = C.
Proof. Clearly, fZ 0. Lets check up what is
Z
1 1 T 1
I=p e 2 (xm) C (xm) dx.
(2)n det(C) Rn
Perform change of variables x0 = x m
Z
1 1 T 1 0
I=p e 2 (x) C x dx0 .
(2)n det(C) Rn
1 1
Perform another change of variables x00 = C 2 x0 . Recall, that C 2 = U diag( 11 , ..., 1n )U T ,
where we have used the eigenvalue decomposition C = U diag(1 , ..., n )U T . We get
Z
1 1 00 2
I=p e 2 |x | | det(C 1/2 )|dx00 .
n
(2) det(C) Rn
We need to calculate the integrals
Z
1 1 2 2 2
I = p e 2 (x1 +x2 +....+xn ) dx1 dxn
(2)n Rn
Z n
1 12 x2
= p e dx .
(2)n R

A neat way to do this is to write


Z 2 Z
12 x2 1 2 +y 2 )
e dx = e 2 (x dxdy
R R2

in polar coordinates x = r cos() ja y = r sin(). We get


Z 2 Z Z 2
21 x2 1 2
e dx = e 2 r rdrd = 2
R 0 0
and hence Z
1 2
e 2 x dx = 2.
R
implying that I = 1. The same method can be applied to
Z
1 1 T 1
E[Z] = p xe 2 (xm) C (xm) dx = m
n
(2) det(C) Rn
and
Z
1 1 T C 1 (xm)
(CZ )ij = p (xi mi )(xj mj )e 2 (xm) dx = Cij .
(2)n det(C) Rn

76
The concept of probability

There are rarely disputes in mathematics but the meaning of P (X B) is such. The
question is simply: what does P (X B) stand for? There are two schools:

1. Frequentistic: The probability of the event is the relative number of occurences of


the event when the experiment is repeated infinitely many times

2. Bayesian: The probability of the event is the degree of our belief that the event
will happen. (We use this one!)

Subjective Bayesian interpretation makes it possible to attach distribution to events that


can not be readily repeated (for example, we can talk about the probability that there
are other intelligent lifeforms elsewhere in the universe). Different people can choose
different probabilities for the same event. In frequentistic formulation, there is only one
probability for the event.

Remark 19. Why Bayesian viewpoint? In inverse problems, there scarcely is objective
information about the unknown. The Bayesian viewpoint allows us to complement the
objective information by using believable prior distributions.

Pros:

Honesty: The prior distribution contains all the prior information about the un-
known that is used in the problem solving. For example, in regularisation methods
the prior information is contained in the choice of the method, which makes applied
prior informations harder to compare.

Flexibility: In a prior distributions one can combine exact/objective information


with uncertain/subjective information.

Possibilities: Any distribution is in principle allowed to be chosen as the prior


distribution.

Robustness: Prior distributions helps to compensate the effects of noise and dis-
turbances in data.

Cons:

different prior distributions lead to different posterior distributions

the uncertainties in prior distributions are subjectively evaluated. It may happen


that prior distribution represents too strong claims about the unknown.

the (objective) accuracy of posterior distribution has to be usually tested with


computer simulations.

77
Probabilities and density functions
Let (, , P ) be a probability space and X : Rn a random vector.
The pdf is a tool for calculating the probabilities P (X B).

This tool has some limitations:

All random vectors do not have pdf.


Pdf is not unique.

Example 33 (Probability distribution without pdf). Let X be a random vector with pdf
fX : R [0, ). The random vector (X, X) does not have pdf.

Proof by contradiction: Assume that there is pdf f(X,X) (x, y). Denote

B = {(x, y) R R : x 6= y}

(is a Borel set whose indicator function 1B (x, y) is Riemann integrable). The probability
distribution gives to the set B the value P ((X, X) B) = 0 since (X, X) / B. From the
existence of pdf, it follows that
Z
0 = P ((X, X) B) = f(X,X) (x, y)dxdy
B
Z Z x Z Z
= f(X,X) (x, y)dxdy + f(X,X) (x, y)dxdy = 1,
x= y= x= y=x

which is impossible. (The slightly dubious Fubini is not actually necessary here, since we
can also divide the integration area into the corresponding parts). Hence, there is no pdf
for (X, X).
Example 34 (Pdf is not unique). Let X : Rn be rv with pdf

fX (x) = 1[0,1] (x). (5.1)

For every a < b it holds that


Z b Z b
P (X [a, b]) = 1[0,1] (x)dx = 1(0,1) (x)dx.
a a

Hence, also
feX (x) = 1(0,1) (x)
is the pdf of X. Clearly feX 6 fX . For multidimensional example, take n-dimensional
random vector X = (X1 , . . . , Xn ), with statistically independent components with pdf
given by (5.1). Then
fX (x1 , . . . , xn ) = 1[0,1]n (x1 , . . . , xn )
and
feX (x1 , . . . , xn ) = 1(0,1)n (x1 , . . . , xn )
define the same probability distribution.

78
Definition 31. Let X : Rn be a random vector. Different pdfs f : Rn [0, )
satisfying Z
P (X B) = fX (x)dx
B
for all hyperrectangulars B Rn , are called versions of the pdf of X.
Remark 20. Let X be n-dimensional and Y such an m-dimensional random vector that
the random vector (X, Y ) has (joint) pdf f(X,Y ) (x, y). When the marginal pdf
Z
fX (x) = f(X,Y ) (x, y)dy

exists, then it is a version of the pdf of X. Indeed,


P (X B) = P (X B Y Rm ) = P ((X, Y ) B Rm )
Z Z
= f(X,Y ) (x, y)dxdy = fX (x)dx
BRm B
n
for every hyperrectangular B R .
The next lemma shows that the continuous version of pdf is unique.
Lemma 7. Let X : Rn be a random vector whose pdf has versions fX and feX . If
there exists such an open set O Rn that
Z
fX dx = 1
O

and fX : O [0, ) and feX : O [0, ) are continuous, then


fX (x) = feX (x)
when x O.
Proof. Let fX , feX and O satisfy the assumptions. Proof by contradiction: Assume that
fX 6= fe in O. Then there exists x0 O so that
fX (x0 ) feX (x0 ) > 0
(if the difference is negative, just change the roles of f and fe). By continuity of g :=
fX feX there exists r > 0 so that
|g(x0 ) g(x)| < /2
whenever x B(x0 , r). Then
g(x) = g(x0 ) (g(x0 ) g(x)) g(x0 ) |g(x0 ) g(x)| /2 = /2
for every x B(x0 , r). Let B B(x0 , r) be a hyperrectangular. Then
Z Z
V ol(B)
g(x)dx dx > 0.
B B 2 2
On the other hand, the pdf of X are related to the same distributions so that
Z Z
g(x)dx = fX (x) feX (x)dx = 0,
B B
which we proved above to be positive. Hence g 0 in O.

79
Conditional pdfs
Let (, , P ) be a probability space.

Definition 32. Let X : Rn and Y : Rm be random vectors with joint pdf


f(X,Y ) : Rn Rm R whose marginal pdf 0 < fY (y0 ) < at y0 Rm . The conditional
pdf of X given Y = y0 is the mapping

f(X,Y ) (x, y0 )
Rn 3 x 7 fX (x|Y = y0 ) = . (5.2)
fY (y0 )

Other aspects of conditional pdf can be seen by using abstract measure theory (not
done in this course).

Remark 21. The condition Y = y0 means that a random event has happened
and the random vector Y has attained the sample value y0 = Y (0 ), where 0
. In practical inverse problems this means that the noisy value of the data is
observed/measured (the noisy data is then available aka given).

Conditional pdf is pdf


Z Z
1 marginal pdf fY (y0 )
fX (x|Y = y0 )dx = f(X,Y ) (x, y0 )dx = = 1.
Rn fY (y0 ) Rn fY (y0 )

If X and Y are statistically independent, then knowing the value of Y does not
affect the distribution of X, since

f(X,Y ) (x, y0 ) independenc fX (x)fY (y0 )


fX (x|Y = y0 ) = = = fX (x).
fY (y0 ) fY (y0 )

The significance of the conditional pdf in statistical inverse problems is based on the
fact than there is dependence between the unknown X and the data Y . When Y = y0 is
given, it can change the distribution of the unknown X.
Conditional pdfs are more easily handled by the Bayes formula.

Theorem 16. (Bayes formula) Let X : Rn and Y : Rm be two random vectors


and let fX be the pdf of X and fY (y|X = x) be the conditional pdf of Y given X = x (or
fY be pdf of Y and fX (x|Y = y) be conditional pdf of X given Y = y).
If the mapping
(x, y) 7 fY (y|X = x)fX (x)

(or (x, y) 7 fX (x|Y = y)fY (y))


is integrable, then it is a (Riemann-integrable) version of the joint pdf f(X,Y ) (x, y).

Proof. Skipped. (not hard with Lebesgues integral but requires pretty much more ma-
terial with Riemann integral ).

Different versions are united with the help of continuity.

80
Corollary 4. Let X : Rn and Y : Rm be two random vector. If there exists
open sets O1 Rn and O2 Rm , satisfying
Z Z
fX (x)dx = 1and fY (y|X = x)dy = 1 x (5.3)
O1 O2

and, moreover, fX is a bounded continuous function on O1 and (x, y) 7 fY (y|X = x) is


a bounded continuous function on O1 O2 , then

fY (y0 |X = x)fX (x)


fX (x|Y = y0 ) = R (5.4)
fY (y0 |X = x)fX (x)dx

is a conditional
R pdf of X = x that is on O1 uniquely determined and continuous. whenever
y0 O2 and fY (y0 |X = x)fX (x)dx > 0.

Proof. The product of two Riemann integrable bounded functions is Riemann integrable.
By Theorem 16, the product fY (y|X = x)fX (x) is a version of f(X,Y ) (x, y). Since
Z Z Z 
Fubini (5.3)
fX (x)fY (y|X = x)dxdy = fX (x) fY (y|X = x)dy dx = 1,
O1 O2 O1 O2

then Lemma 7 holds for fX (x)fY (y|X = x) when O = O1 O2 , which proves the unique-
ness of f(X,Y ) on O by continuity. By Def. 32,

f(X,Y ) (x, y0 ) fY (y0 |X = x)fX (x)


fX (x|Y = y0 ) = =R
fY (y0 ) fY (y0 |X = x)fX (x)dx

whenever the denominator is positive. The finiteness of the denominator follows from the
boundedness of the pdfs. The value
Z Z Z
fY (y0 |X = x)fX (x)dx = fY (y0 |X = x)fX (x)dx + fY (y0 |X = x)fX (x)dx
O1 O1C

does not depend on the second integral, since


Z  Z
fY (y0 |X = x)fX (x)dx sup fY (y0 |X = x) fX (x)dx
O1C x O1C
Z
(5.3)
= (1 1O1 (x))fX (x)dx = 1 1 = 0.

Definition 33. Let X and Y be as in Def. 32. The conditional expectation of X given
Y = y0 is Z
E[X|Y = y0 ] = xfX (x|Y = y0 )dx,
Rn
if the integral exists.

81
We do not prove the following theorem, since the proof requires measure theoretic
tools.
Theorem 17. Let X be Rn -valued random vector that is statistically independent from
the Rm -valued random vector Z. Let G : Rn Rm Rk be a continuous mapping and
let G(x0 , Z) have pdf for every x0 Rn . Then
fG(X,Z) (y|X = x0 ) = fG(x0 ,Z) (y)
for all y Rk .

Transformations of random vectors


Theorem 18. Let G : Rn Rm be a continuous function and let X : Rn be random
vector. Then G(X) is a random vector.
Proof. We actually need only to prove that the preimage G1 (B) of an open set B Rm
is open. The same holds then also when the word open is replaced with the word Borel.
But this requirement is just the topological definition of the continuity.
Example 35. Let X : Rn ja : Rm be random vectors. The following are also
random vectors
1. aX, a R
2. X + a , a Rn
3. kXk (is a random variable)
4. Y = F (X) + , where F : Rn Rm is continuous.
Example 36. Lets determine the pdf of X + a when the pdf of X is fX and a Rn is
some constant vector. The distribution of X + a is of the form
Z
P (X + a B) = P (X B a) = fX (x)dx (5.5)
Ba

where B a is a translation of a hyperrectangular B, that is,


B a = {x a : x B}.
Lets make the change of variables H(x) = x a in (5.5). We obtain
Z Z
P (X + a B) = fX (x)dx = fX (x a)dx
Ba B

for every hyperrectangular B, so that fX+a (x) = fX (x a).


The combination of Example 36 and Theorem 17 leads to the following corollary.
Corollary 5. Let F : Rn Rm be continuous, X : Rn and : Rm be two
statistically independent random vectors and f pdf of . When Y = F (X) + , then
fY (y|X = x0 ) = f (y F (x0 ))
Proof. Choose G(x, y) = F (x) + y in Theorem 17. Then
Theorem 17 Ex. 36
fY (y|X = x0 ) = f+F (x0 ) (y) = f (y F (x0 )).

82
5.2 Statistical inverse problems
Consider an inverse problem, where we are given the noisy data y0 = F (x0 )+ Rm
about the unknown x0 Rn . The direct theory F : Rn Rm is here continuous.

We often have statistical information about the noise . For example, = (1 , ..., m )
could consist of statistically independent components with the probability distribu-
tions Z b  
1 1 2
P (a i b) = exp y dy,
2 a 2
where i = 1, ..., m, a < b R and > 0.

When F is a linear mapping whose matrix is M , then the Morozovs discrepancy


principle is unavailable, since

P (|| > e) P (|i | > e) > 0

for any e 0. One option is to consider statistical solution methods.

Let (, , P ) be a probability space

Finite dimensional statistical inverse problem

1. The unknown X : Rn and noise : Rn are random vectors. The direct


theory F : Rn Rm is a continuos function.

2. Data Y = F (X) + is a random vector Y : Rm (Example 35).

3. The given data y0 = F (x0 ) + 0 Rm is the value of the sample Y (0 ) =


F (X(0 )) + (0 ).

4. The distribution of the unknown B 7 P (X B) is called the prior distribution


and its density fX (x) the prior pdf. We denote fpr (x) = fX (x).

5. The solution of the statistical inverse problem is the posterior distribution whose
pdf is

fY (y0 |X = x)fpr (x)


fpost (x) := fX (x|Y = y0 ) = R (Cor. 4, Cor. 5)
fY (y0 |X = x)fpr (x)dx

Example 37. Let the noise N (0, C ), the unknown X N (0, CX ), the noise and
the unknown are statistically independent, F : Rn Rm is linear and has matrix M .
The given data y0 = M x0 + 0 is sample of the random variable Y = M X + . Then by
Cor. 5, we have

1 1 T 1
fY (y|X = x) = f (y M x) = p e 2 (yM x) C (yM x) ,
(2)m det(C )

83
which is continuous and bounded. The prior pdf
1 1 T 1
fpr (x) = p e 2 x CX x
m
(2) det(CX )
is also continuous and bounded. By Cor. 4 the posterior pdf is
1 T C 1 (y M x) 1 T 1
fpost (x) = Cy0 e 2 (y0 M x) 0
e 2 xCX x
,

where Cy is a norming constant. Lets simplify this expression. Consider


1 1 1 1 1
(y0 M x)T C1 (y0 M x) xT CX x = y0T C1 y0 + xT M T C1 y0
2 2 2 2
1 T 1 1 T 1
y0 C M x x M T C1 M + CX

+ x.
2 2
Denote 1
1
Cpost = M T C1 M + CX
and add terms so that we have a quadratic form
1 1 1 1 1
(y0 M x)T C1 (y0 M x) xT CX x = (y0T C1 y0 ) + xT Cpost
1
Cpost M T C1 y0
2 2 2 2
1 T 1 1 1 1
+ y0 C M Cpost Cpost x xT Cpost x
2 2
1 1
= (y0T C1 y0 ) (x mpost )T Cpost
1
(x mpost )
2 2
1 T
+ m C 1 mpost
2 post post
where 1
1
mpost = Cpost M T C1 y0 = M T C1 M + CX M T C1 y0 .
Posterior pdf is (up to a norming constant now) a Gaussian pdf! The norming constant
Cy0 , is then well-known and we obtain
1 1 T 1
fpost (x) = p e 2 (xmpost ) Cpost (xmpost ) .
(2)n det(Cpost )
The posterior pdf is multinormal having the mean
1
1
mpost = M T C1 M + CX M T C1 y0

and covariance matrix 1


1
Cpost = M T C1 M + CX .
Especially, if C = I and CX = cI, then
 1
T
mpost = M M + I M T y0 ,
c
which leads to

mpost = argmin |M x y0 |2 + |x|2 .
xRn c

84
When the noise is distributed as N (0, I) and the prior distribution is N (0, cI), then
posterior expectation coincides with a Tikhonov-regularised solution with regularisation
parameter = /c.
Prior can be interpreted so that

Xi N (0, c)

represents knowledge of the unknown that tells as that the values of the unknown are
not exactly know but we feel that the negative and positive values of the components
are as likely and large values of the component are quite unlikely. Independence between
components allows large variation between the values of the components.

85
Recap: Finite-dimensional linear Gaussian satistical inverse problem
The given data is y0 = M x0 + 0 , where M Rmn .
The statistical model of the noise is an m-dimensional Gaussian random vector,
that is distributed according to N (0, C ) i.e.
1 1 T 1
f (y) = p e 2 y C y
,
(2)m det(C )
for all y Rm .
The statistical model of the unknown is n-dimensional Gaussian random vector X,
that is independent from and distributed according to N (0, CX ) i.e.
1 1 T 1
fpr (x) = p e 2 x C x
(2)n det(CX )
for all x Rn .
The statistical model of the data is Y = M X + .
The solution is the posterior pdf
fY (y0 |X = x)fpr (x)
fpost (x) = R
f (y0 |X = x)fpr (x)dx
Rn Y
1 T C 1 (y M x) 1 T 1
= cy0 e 2 (y0 M x) 0
e 2 x CX x
,
which simplifies to
1 1 T 1
fpost (x) = p e 2 (xmpost ) Cpost (xmpost ) ,
(2)n det(Cpost )
where 1
1
mpost = M T C1 M + CX M T C1 y0
and 1
1
Cpost = M T C1 M + CX .
In more general cases, the unknown and the noise can have non-zero expectations
and the unknown and the noise need not be independent.

5.2.1 Likelihood function fY (y0 |X = x)


Consider a statistical inverse problem, where the data Y is an m-dimensional rv and the
unknown X in an n-dimensional rv.
Definition 34. Let y0 Rm be a sample of Y . The function x 7 fY (y0 |X = x) is called
the likelihood function.
The likelihood function can contain information about
inaccuracies due to the external disturbances (noise).
inaccuracies of the direct theory

86
The case of independent noise term
Let X and be independent random vectors and denote Y = F (X) + , where the forwad
mapping F : Rn Rm is continuous.
If random vector has a pdf, then the conditional pdf of Y = F (X) + given X = x
is, by Corollary 5,
fY (y0 |X = x) = f+F (x) (y0 ) = f (y0 F (x)). (5.6)
Example 38 (CT scan). The unknown X-ray mass absorption coefficient f = f (x0 , y 0 )
is approximated by equation
n
X
f (x0 , y 0 ) = xj j (x0 , y 0 ), x0 , y 0 R2
j=1

where x = (x1 , . . . , xn ) Rn contains the unknowns and the functions j are fixed. The
data can be (coarsely) modeled as a vector y = (y1 , . . . , ym ) whose components are
Z n Z
X 
y= f ds + i = j ds xi + i = (M x)i + i ,
Ci j=1 Ci

where i = 1, . . . , , m and the random vector is distributed according to N (0, I). Then
we end up with the statistical inverse problem
Y = M X + .
When X and are taken to be statistically independent, the likelihood function is
1 1
2 |y0 M x|2
fY (y0 |X = x) = m e .
(2) 2

Model errors
Next, we allow model errors for the direct theory and the unknown.
Theorem 19. Let Y be an m-dimensional rv, X be an n-dimensional rv and U be a k-
dimensional rv so that the joint pdf f(X,U ) is positive and the conditional pdfs fY (y|(X, U ) =
(x, u)) and fU (u|X = x), are given. Then the conditional pdf
Z
fY (y|X = x) = fY (y|(X, U ) = (x, u))fU (u|X = x)du.
Rk

whenever fX (x) > 0.


Proof. We need to determine
f(X,Y ) (x, y)
fY (y|X = x) = .
fX (x)
By definition, the marginal pdf
Z
f(X,Y ) (x, y) = f(X,Y,U ) (x, y, u)du,
Rk

87
where the integrand is determined by Theorem 16. Then
f(X,Y,U ) (x, y, u) f(X,U ) (x, u)
Z
fY (y|X = x) = du,
Rk f(X,U ) (x, u) fX (x)
which gives the claim by the definition of conditional pdfs..
Example 39. (Approximation error) Consider the statistical inverse problem Y = F (X)+
, where the unknown X and the noise are statistically independent. For computa-
tional reasons, a high-dimensional X is often approximated by a lower dimensional rv
Xn . Lets take Xn = Pn X, where Pn : RN RN is an orthogonal projection onto some
n-dimensional subspace of RN , where n < N (and also m < N ). Then
F (X) = F (Xn ) + (F (X) F (Xn )) =: F (Xn ) + U,
which leads to
Y = F (X) + = F (Xn ) + U + .
According to Theorem 19, the likelihood function for Xn can be expressed as
Z
fY (y|Xn = x) = fU (u|Xn = x)f (y F (x) u)du, (5.7)
Rm

whenever the assumptions of the theorem are fulfilled. Especially fU (u|Xn = x) needs to
be available.
The integral (5.7) is often computationally costly. One approximation is to replace
U by a rv U e that is a similarly distributed but independent from X. Whe the prior
distribution of X is given, then Ue + has a known probability distribution. When this
distribution has a pdf, then
fY (y|Xn = x) = f+Ue (y F (x)).
Example 40. (Inaccuracies of the forward model) Let the forward model F : Rn Rm
be a linear mappping whose matrix M = M deoends continuously from R, where
the value of is not precisely known. For example, in image enhancing (Chapter 1.2)
the blurring map
n
e(|ki| /n +|lj| /n )/2 mij
2 2 2 2 2
X
m
e kl = Ckl
i,j=1

contains such a parameter. Then we may model the inaccuracies of with a probability
distribution. Say, , X and are statistically independent and f (s) is the pdf of . Then
Y = M X + = G(, X, )
is a random vector, since
G : R Rn Rm 3 (s, x, z) 7 M x + z
is continuous. By Theorem 17,
fY (y|(X, ) = (x, s)) = fG(s,x,) (y) = f (y M s x).
Under the assumptions of Theorem 19, we have
Z
fY (y|X = x) = f (y M s x)f (s)ds.
Rm

88
5.2.2 The prior pdf fpr (x)
The prior pdf represents the information that we have about the unknown and describes
also our perception of the lack of information.
Assume that x Rn corresponds to values of some unkown function g at fixed points
of [0, 1] [0, 1], say
xi = g(ti ),
where ti [0, 1] [0, 1] kun i = 1, ..., n.

Possible prior information:

Function g Vectorx
Some values of g are known Some component of x are known
exactly or inexactly. exactly or inexactly.
Smoothness of g. Behavior of the neighbor components in x.
Image of g is known The subset, where x belongs is known.
E.g. g 0, or monotonicity E.g. . xi 0, xi xi+1
Symmetry of g. Symmetry of x.
Other restrictions for g Restrictions for x.
E.g. if g : R3 R3 is Equations G(x) = 0.
a magnetic field, then g 0.

Possible statistical models:


Unknown vector x Rn Statistical model X : Rn
Some component of x are Xi = mi + Zi , where rv Zi represents the
known exactly or inexactly. inaccuracy of mi .
P 0
The vectors that span x are known. X = ni=1 Zi ei
P 0
x = ni=1 ai ei , n0 n. where Zi models the uncertainty of
the coefficients.
The behavior of neighbor components in x Statistical dependencies between components of X
The joint distribution of X.
The subset containing x E.g. P (i {Xi 0}) = 1.
E.g. xi 0

5.3 Different prior pdfs


Let X : Rn be a random vector that models the uknown and let fpr : Rn [0, )
denote its pdf. Next, we meet some pdf that can be often used as fpr .

Uniform distribution
Let B Rn be a closed and bounded hyper-rectangular

B = {x Rn : ai xi bi , i = 1, .., n},

where ai < bi when i = 1, .., n.

89
The random vector X is uniformly distributed on B if

1
fpr (x) = 1B (x),
|B|
R
where the number |C| := C
dx.

Unknown belongs to set B i.e. the ith component belongs to the interval [ai , bi ].

Reflects almost perfect uncertainty about the values of the unknown. They belong
to B.

The set B needs to be bounded in order to fpr to be a proper pdf.

The posterior pdf


fY (y0 |X = x)1B (x)
fpost (x) =
fY (y)|B|
is the renormed and restricted likelihood.

`1 -prior
Define a new norm `1 by
n
X
kxk1 = |xi |
i=1

for all x Rn .
A random vector X has `1 -prior, if
 n
fpr (x) = ekxk1
2

Components Xi are statistically independent.

Pdf fXi is symmetric w.r.t origo and the expectation is zero.

Parameter reflects how certain we are that the unknown attains large values.

5.3.1 `2 -prior
A random vector X has `2 -prior, if
  n2 2
fpr (x) = e|x|

Components of X are independent and normally distributed.

90
1
alpha=0.5
0.9 alpha=1
alpha=2
0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
10 8 6 4 2 0 2 4 6 8 10

Figure 5.5: Pdf of 1-dimensional `1 -prior.

0.8
alpha=0.5
alpha=1
alpha=2
0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
10 8 6 4 2 0 2 4 6 8 10

Figure 5.6: Pdf of 1-dimensional `2 -prior.

91
Cauchy prior
A random vector X has Cauchy prior if
n
 n Y 1
fpr (x) =
i=1
1 + 2 x2i

where x Rn .

The components of Xi are independent

Pdf fXi is symmetric w.r. origo.

No expectation (large tail probabilities).

Reflects best a situation, where some of the components of the unknown can attain
large values.

0.7
alpha=0.5
alpha=1
0.6 alpha=2

0.5

0.4

0.3

0.2

0.1

0
10 8 6 4 2 0 2 4 6 8 10

Figure 5.7: Pdf of 1-dimensional Cauchy-prior.

Discrete Markov fields


Let the unknown represent the values of some n0 -variable function
0 0
f : Rn R at points ti Rn , i = 1, ..., n.
The neighborhoods Ni {1, ..., n} of indices i {1, . . . , n} consists of sets that

1. i
/ Ni

2. i Nj if and only if j Ni .

92
0.4
Gauss
l1
0.35 Cauchy

0.3

0.25

0.2

0.15

0.1

0.05

0
10 8 6 4 2 0 2 4 6 8 10

Figure 5.8: Pdf of N (0, 1) , pdf of 1D-Cauchy priori with = and pdf of 1D `1 -prior
2
2
when = 2 .

Definition 35. A random vector X is a discrete Markov field with respect to neighbor-
hood system Ni , i = 1, .., n if jos

fXi (x|(X1 , X2 , .., Xi1 , Xi+1 , Xi+2 , ..., Xn ) = (x1 , x2 , .., xi1 , xi+1 , xi+2 , ..., xn ))
= fXi (x|Xk = xk k Ni )

The components Xi of discrete Markov field depend only on its neighboring compo-
nents Xk , k Ni .
Theorem 20 (Hammersley-Clifford). Let rv X : Rn be a discrete Markov field with
respect to the neighborhood system Ni , i = 1, .., n. If X has a pdf fX > 0, then
Pn
fX (x) = ce i=1 Vi (x)

where Vi : Rn R depends only from xi and its neighbor components xk , k Ni .


Example 41. (Total variation prior) Let the rv X model an image consisting of N N -
pixels so that the corresponding matrix is organised as an n = N 2 -dimensional vectors.
The rv X : R2 is distributed according to the total variation prior , if
Pn
fpr (x) = ce j=1 Vj (x)

where X
Vj (x) = lij |xi xj |
iNj

and the neighborhood Nj of index j contains only of indeces of those pixel i that share
an edge with the pixel j. Moreover, the number lij is the length of the common edge
between pixels i and j.

93
The total variation nj=1 21 iNj lij |xi xj | is small if the difference between the
P P
value xi of the pixel i color and its the corresponding value of its neighbor com-
ponents is small except possibly for those pixel sets whose borders have very short
length.
Example 42. (1D-Gaussian smoothness priors)
Let X be such a rv that corresponds to values o an unknown function g at points
ti [0, 1], i = 1, .., n, 0 = t0 < t1 < < tn < 1 are equidistant points and g(t) = 0 for
t 0.
Fix the prior pdf of X as
Pn
fpr (x) = ce(x1 + ).
2 2
i=2 (xi xi1 )

The boundary component is forced to zero i.e. X0 = g(0) = 0.

If is large, then the neighbor components of X are more likely to be close to each
other.

A Random walk model.


Similarly, also higher differences can be used. For example,
1 Pn
fpr (x) = ce 2a4 (x1 +(2x2 x1 ) )
2 2+ 2
i=3 (xi 2xi1 +xi2 )

corresponds to the second differences.


Example 43. (2D-Gaussian smoothness priors) Let f : [0, 1]2 R be such a continuous
function that f = 0 outside [0, 1]2 . Let X be rv, corresponding to a function g(t, s) at
points   
2 k j
{ti [0, 1] [0, 1] : i = 1, .., n } = , : k, j = 1, ..., n .
n n
Set P
fpr (x) = ce j Vj (x)
,
where X
Vj = |4xj xi |2
iNj

and Nj contains only the indices of points ti that are next to the point tj ( over it or
under it or to left or to right from it).

Positvity constraint
If we know that the unknown has non-negative components, then we may restrict and
renorm the pdf
fpr (x) = cf+ (x)fX (x)
where (
1, xi 0 i = 1, .., n
f+ (x) =
0 otherwise.

94
Hierarchical priors
When the unknown is modeled as a random vector, whose pdf depends continuously on
0
parameter Rn , it is possible to model the uncertainty of the parameter by attaching
a pdf to it.
Let X : Rn be the rv that models the unknown and let the pdf of X be fX . Let
0
: Rn be a rv that models the unknown parameter and let its pdf be f . Assume
that we have the conditional pdf of X given = s, that is

x 7 fX (x| = s) = fXs (x)


0
is known for all s Rn . When the product fXs (x)f (s) is integrable, we have the joint
distribution
f(X,) (x) = fXs (x)f (s).
Option 1) The unknown is modeled as a rv X with pdf
Z
fpr (x) = fXs (x)f (s)ds1 dsn0

(whenever the marginal exists). The corresponding posterior pdf is

fpost (x) = cfY (y|X = x)fpr (x)

whenever fY (y) > 0.


Option 2) Also the hyperparameter is taken to be part of the unknown and as a
prior pdf, we set the joint pdf

fpr (x, s) = fXs (x)f (s).

which implies that the posterior pdf is

fpost (x, s) = cfY (y|(X, ) = (x, s))fpr (x, s) = cfY (y|X = x, s)fpr (x, s)

whenever fY (y) > 0 (note that the likelihood function does not depend on s but only on
x).
0
In options 1,2 the prior pdf is called a hierarchical prior and the parameter : Rn
is called a hyperparameter and its distribution a hyperprior.

Example 44. Let X : R3 be rv that models the unknown and has pdf
 
s 1 2 s 2 1 2
fpr (x; s) = 3 exp x1 (x2 x1 ) (x3 x2 ) ,
2 2 2 2

where s R is an unknown parameter. We model this parameter as a random variable


: R and denote
fX (x| = s) = fpr (x; s).
As the hyperprior, we set
f (s) = f+ (s)es

95
where > 0, and f+ (s) = 1 for s > 0 and 0 otherwise. Then
 
s 1 2 s 1
f(X,) (x, s) = f+ (s) exp x1 (x2 x1 ) (x3 x2 ) es
2 2
( 2) 3 2 2 2
and
Z


1 2 1 2
 s 
fX (x) = exp x1 (x3 x2 ) s exp (x2 x1 )2 s ds
( 2)3 2 2 0 2
 Z  
1 2 1 2 1 1 2
= exp x1 (x3 x2 ) s 2 exp(s (x2 x1 ) + )ds
( 2)3 2 2 0 2
  Z
1 2 1 2 1 1
= exp x1 (x3 x2 ) 3 s 2 exp(s)ds
( 2)3 2 2 ( 12 (x2 x1 )2 + ) 2 0
exp 12 x21 21 (x3 x2 )2
  
3
= 3
( 2)3 ( 12 (x2 x1 )2 + ) 2 2
1 2 1

exp 2 x1 2 (x3 x2 )2
= 3 .
4 2 ((x2 x1 )2 + 2) 2

The value of the Gamma function (3/2) = /4.

0.7
lambda=0.3
lambda=1
0.6 lambda=2

0.5

0.4

0.3

0.2

0.1

0
20 15 10 5 0 5 10 15 20


Figure 5.9: Pdf f (x) = 3 .
(x2 +2) 2

The differences between components of X are independent.


The difference X2 X1 has a Cauchy type distribution (transformed Beta distribu-
tion), which gives a slightly lower probability to occurance of very high values.
Uncerainty about the variance of X2 X1 produced a distribution that allows large
values with higher probability than the Gaussian distribution.

96
0.25
Cauchy
Transformed Beta

0.2

0.15

0.1

0.05

0
20 15 10 5 0 5 10 15 20


Figure 5.10: Cauchy prior and pdf f (x) = 3 .
(x2 +2) 2

5.4 Studying the posterior distribution


5.4.1 Decision theory
Let pdfs f(X,Y ) , fX > 0 and fY > 0 exist and be continuous. Denote
fpost (x; y) = fX (x|Y = y)
when y Rm .
Multidimensional function fpost (x; y) can be very hard to visualize properly. Can we
extract some information about the unknown on the basis of the posterior pdf? We turn
our attention to the field of statistics that is called decision theory.
Decision theory answer to the question: what function h : Rm Rn is such that the
vector h(y) resembles the most (in some sense) the unknown x that has produced the
observation y = F (x) + ?
In statistics, the function h is called an estimator and the value h(y) an estimate.
Let us fix in what sense the estimator is best. We first fix a loss function
L : Rn Rn [0, )
that measures the accuracy of the estimate h(y) when the unknown is x as L(x, h(y)) (low
values of L mean accurate estimates). For example, we can take L(x, h(y)) = |x h(y)|2 .
Assume that L is fixed and x 7 L(x, z)fpost (x) is integrable for all z Rn .
If y Rm , then the value h(y) Rn of the estimator h is chosen so that it minimizes
the posterior expectation Z
L(x, h(y))fpost (x; y)dx
Rn
i.e. Z
h(y) = argmin L(x, z)fpost (x; y)dx.
zRn Rn

97
When data is y we look for h(y), that gives the smallest possible posterior expectation.
The number
Z Z 
r(h) = L(x, h(y))fpost (x; y)dx fY (y)dy
Rm Rn

is called the Bayes risk. An application of the Fubini theorem leads to


Z Z 
r(h) = L(x, h(y))fY (y|X = x)dy fpr (x)dx.
Rn Rm

The interpretation of the Bayes risk is that when the unknown is X and the noisy data
Y , then the Bayes risk r(h) of the estimator h is the expected loss with respect to the
joint distribution of X and Y i.e. r(h) = E[L(X, h(Y ))].
Example 45. (CM estimate) Take L(x, z) = |xh(y)|2 as the loss function. Let mpost (y)
denote the posterior expectation
Z
mpost (y) = xfpost (x)dx
Rn

and Cpost (y) the posterior covariance matrix


Z
(Cpost (y))ij = (xi (mpost (y))i )(xj (mpost (y))j )fpost (x)dx.
Rn

Then
Z Z
L(x, h(y))fpost (x; y)dx = |x h(y)|2 fpost (x; y)dx
Rn ZR
n

= |x mpost (y) + mpost (y) h(y)|2 fpost (x; y)dx


Rn
Z n
X
2
= (|x mpost (y)| + 2 (x mpost (y))i (mpost (y) h(y))i
Rn i=1
+|mpost (y) h(y)|2 )fpost (x; y)dx
Z
= |x mpost (y)|2 fpost (x; y)dx
Rn
n
X Z
+2 (mpost (y) h(y))i (x mpost (y))i fpost (x; y)dx
i=1 Rn
Z
2
+|mpost h(y)| fpost (x; y)dx
R n
Z
= |x mpost (y)|2 fpost (x; y)dx + |mpost h(y)|2
Rn

The minimum loss is attained when |mpost (y) h(y))|2 = 0 i.e h(y) = mpost (y), so that
Z n
X
L(x, h(y))fpost (x; y)dx = (Cpost (y))ii .
Rn i=1

98
In other words, the expectation of the loss function is the sum of the diagonal elements
of the posterior covariance matrix i.e. its trace.
Posterior expectation is often denoted by xCM (CM=ccnditional mean).
Example 46. MAP estimate
We say that the pdf is unimodal if its global maximum is attained at only one point.
n
Let > 0 and L (x, z) = 1B(z,)
C (x) when x, z R . Let x 7 fpost (x; y) be unimodal
n
for the given data y R . The limit of the estimate
Z
h (y) = argmin 1B(z,)
C (x)fpost (x; y)dx
zRn Rn
Z
= argmin fpost (x; y)dx
zRn
Rn \B(z,)

is
lim h (y) = xM AP (y)
0+

where
xM AP (y) = argmax fpost (x; y).
xRn

The maximum a posterior estimate xM AP (y) is useful when expectations are hard to
obtain. It can be also written as

xM AP (y) = argmax fY (y|X = x)fpr (x)


xRn

MAP estimate is often used also in situations when the posterior pdf is not unimodal
whereby it is not unique.
In addition to estimates x we can also determine their componentwise Bayesian con-
fidence intervals by choosing a e.g. in such a way that

Ppost (|Xi xi | a) = 1

where = 0.05.

5.5 Recap
About probability theory

Conditional pdf of random vector X given Y = y (with marginal pdf fY (y) >
0) is
f(X,Y ) (x, y)
fX (x|Y = y) = .
fY (y)
Bayes formula

fX (x|Y = y)fY (y) = f(X,Y ) (x, y) = fY (y|X = x)fX (x)

holds for continuous pdfs (in case of discontinuities, only up to versions).

99
Statistical inverse problem

The unknown and the data are modeled as random vectors X and Y .
The probability distributions of X and Y represents quantitative and qualita-
tive information about X and Y and lack of such information.
The given data y0 is a sample of Y i.e. y0 = Y (0 ) for some elementary event
0 .
The solution of a statistical inverse problem is the conditional pdf of X given
Y = y0 (with fY (y0 ) > 0)

Posterior pdf

consists of (normed) product of likelihood function x 7 fY (y0 |X = x) and


prior pdf x 7 fpr (x).
can be used in determining estimates and confidence intervals for the unknown.

Typical priors include Gaussian priors (especially smoothness priors), `1 -prior, Cauchy
prior and total variation prior (e.g. for 2D images).

Please learn:

definitions of prior and posterior pdf

how to define posterior pdf (up to the norming constant) when the unknown and
noise are statistically independent and the needed pdfs are continuous.

how to write the expressions for the posterior pdf, its mean and covariance, in the
linear Gaussian case.

how to explain the connection between Tikhonov regularisation and Gaussian linear
inverse problems

how to form the hierarchical prior pdf when the conditional pdf and the hyperprior
are given

definition of CM-estimate as the conditional mean

definition of MAP-estimate as a maximizer of the posterior pdf

100

You might also like