Professional Documents
Culture Documents
Richard J. Baxter 1
MSc in Computer Science (2nd Year)
Computer Science Department
University of Cape Town
Hosted by:
Laboratoire Fizeau
Universite Nice Sophia Antipolis
and
Observatoire de la Cote dAzur
November 1, 2011
richard.jonathan.baxter@gmail.com
david.mary@unice.fr
3
chiara.ferrari@oca.eu
2
Acknowledgements
Many people and organisations came together to make this report and the trip that created it possible. I
would like to thank everyone at ADION for accepting me for the Henri Poincare Junior Fellowship, which
formed the primary source of income and academic support.
I would also like to extend my thanks to the Square Kilometer Array Project (SKA) and the National
Research Fund (NRF) in South Africa for their contributions towards travel expenses and equipment. Namely
my laptop, without which I would have not been able to do any work at all!
This brings me to the wonderful people who hosted and helped me along at lUniversite Nice Sophia Antipolis
and lObservatoire de la C
ote dAzur. Especially to Dr. David Mary and Dr. Chiara Ferrari, who both
supported me and made sure that I was on my feet at all times, which enabled me to get a substantial amount
of work done. Dr. Ferrari made sure I was housed, knew my way around and gave me valuable scientific
insight that I, as a non-astronomer, would not otherwise have gained. Dr Marys constant supervision
(especially in the first month) made sure I got up to speed very fast. I could not have learned what I did
without either supervisors assistance.
I would also like to thank my friend, Arwa Dabbech, who guided me along with the many important details
and pedantic questions I asked. Again, without her help I would not have learned as much as I did in the
time I had. The teasing out of concepts and problems was invaluable and does not go unappreciated.
I would also like to give mention to my supervisors at the University of Cape Town, Dr. Patrick Marais,
Dr. Michelle Kuttel, Dr. Kurt van der Hyden and Dr. Ian Stewart for their constant guidance.
Also to my parents who supported me in my long trip away from home. Especially my mother who helped
me with travel arrangements. Special thanks to my girlfriend, Elizabeth Braae, who in addition to her
support, also spent many hours reading over my work and expertly correcting my grammar, spelling and
mathematics errors.
Lastly, to all the friends I made in Nice who made my time there more than just an academic endeavour. I
cannot not say that I had a great time with you all.
I learnt much about many things in many areas of life and many areas of academia. I am truly grateful.
Contents
1 Introduction and Background
1.1 Convolution . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.1 Discrete Convolution . . . . . . . . . . . . . . . .
1.2 Fourier Transform . . . . . . . . . . . . . . . . . . . . .
1.2.1 Introduction and Usage . . . . . . . . . . . . . .
1.2.2 Discrete Fourier Transforms . . . . . . . . . . . .
1.2.3 Fast Fourier Transform . . . . . . . . . . . . . .
1.2.4 Convolution Theorem for FFT . . . . . . . . . .
1.3 Wavelet Transform . . . . . . . . . . . . . . . . . . . . .
1.3.1 Deficiencies of the Fourier Transforms . . . . . .
1.3.2 Decomposition and Reconstruction . . . . . . . .
1.3.3 Usage . . . . . . . . . . . . . . . . . . . . . . . .
1.3.4 Fast Wavelet Transform . . . . . . . . . . . . . .
1.4 Radio Interferometry . . . . . . . . . . . . . . . . . . . .
1.4.1 Observed Data and PSFs: Both Clean and Dirty
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8
8
8
9
9
11
13
14
14
14
15
15
17
17
17
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
Source and Diffuse
. . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
Source .
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
19
19
20
21
21
22
23
23
23
24
24
25
25
26
27
27
27
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
28
28
28
29
30
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CONTENTS
3.2 Maximum Likelihood (ML) . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Constrained Maximum Likelihood and Negative Log Likelihood
3.2.2 Poisson: RLA . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.3 Additive Gaussian Noise: ISRA . . . . . . . . . . . . . . . . . .
3.2.3.1 Maintaining Positivity via Shifting the Data . . . . .
3.2.3.2 Maintaining Positivity via Splitting the Data . . . . .
3.2.4 Regularised ISRA and RLA . . . . . . . . . . . . . . . . . . . .
3.2.4.1 Exit Criteria . . . . . . . . . . . . . . . . . . . . . . .
3.3 Algorithmic Implementation . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Computational Operators for Convolution . . . . . . . . . . . .
3.3.2 RLA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.3 ISRA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 Maximum
4.1 Sparse
4.1.1
4.1.2
4.2
4.3
4.4
a postiori (MAP)
Representation . . . . . . . . . . . . . . . .
Concepts and Definitions . . . . . . . . . .
Definitions of Sparsity . . . . . . . . . . . .
4.1.2.1 Intuitively . . . . . . . . . . . . .
4.1.2.2 Descriptively . . . . . . . . . . . .
4.1.2.3 Formally . . . . . . . . . . . . . .
MAP Synthesis . . . . . . . . . . . . . . . . . . . .
4.2.1 The Maximum Likelihood Term . . . . . .
4.2.2 The Prior Term . . . . . . . . . . . . . . . .
4.2.3 The MAP-Synthesis Model . . . . . . . . .
Matching Pursuit Algorithm . . . . . . . . . . . . .
4.3.1 Description of the Algorithm . . . . . . . .
4.3.1.1 Exit Criteria . . . . . . . . . . . .
4.3.2 Computational Considerations . . . . . . .
4.3.2.1 Wavelet Computational Operators
4.3.2.2 Computing the Coefficients . . . .
4.3.2.3 Computing the Norms . . . . . . .
4.3.3 Manifestation of the prior . . . . . . . . . .
Experimentation . . . . . . . . . . . . . . . . . . .
4.4.1 Set up . . . . . . . . . . . . . . . . . . . . .
4.4.2 Analysis of Dictionaries . . . . . . . . . . .
4.4.2.1 The Identity Dictionary . . . . . .
4.4.2.2 The Symlet Dictionary . . . . . .
4.4.2.3 The Symlet+Identity Dictionary .
4.4.3 Analysis of Loop-Gains . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
30
30
30
32
33
34
34
35
35
35
35
36
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
38
39
39
40
40
40
40
41
41
42
42
43
43
44
44
45
46
47
47
49
49
49
49
50
51
51
5 Conclusion
A Notations and Concepts
A.1 Mathematical Notations . . . . . . . . . .
A.1.1 Element-wise Vector Multiplication
A.2 Covariance Matrix . . . . . . . . . . . . .
A.3 Hyperparameter . . . . . . . . . . . . . .
A.4 Descent Direction . . . . . . . . . . . . . .
A.5 KKT First Order Optimality Conditions .
56
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
59
59
60
60
60
60
61
B Proofs
62
B.2 Proofs From Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
CONTENTS
B.2.6
B.3 Proofs
B.3.8
B.3.10
B.3.19
B.3.28
B.3.29
B.3.42
B.3.43
B.4 Proofs
B.4.5
B.4.18
B.4.22
B.4.23
B.4.27
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
62
63
63
64
64
65
65
66
67
67
67
68
68
69
70
List of Figures
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
Transform Infra Red) spectrometer
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
9
10
12
13
14
15
15
16
2.1
2.2
2.3
2.4
2.5
2.6
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20
20
22
23
24
26
3.1
RLA Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
4.1
4.2
4.3
4.4
4.5
4.6
4.7
45
48
48
52
53
54
55
a
a
.
.
.
.
PSF
PSF
. . .
. . .
. . .
. . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
List of Tables
4.1
4.2
50
51
Chapter 1
Convolution
1.1.1
Discrete Convolution
f (m) g(n m)
(1.2)
m=
A more intuitive way of visualising a convolution is to visualise a simple picture consisting of a number of
sparsely placed infinitely small points of value 1, call this 2D function f . If we take another picture, another
2D function g, and we stamp it on each one of the points in f , we will have achieved something like the
convolution of f with g. If a point on f was equal to 2, the g would be multiplied by 2 when stamped. See
Figure 1.1 for an example.
For images with just points this seems intuitive. For more complex images this analogy breaks down a bit. If
we take f to contain more than just point, say a line or some other more complicated shape, we can imagine
stamping g at every single point of f . Figure 1.2 demonstrates this concept.
Convolution is used for many applications. Blurring images is a prime example of convolution. The Gaussian
blur is simply an image convolved with an image of a 2D Gaussian curve. However, if one were to imagine
the blurring process (or had one used an image proocessing application such as Photoshop or Gimp), they
would realise that any attempt to un-blur or re-sharpen an image back to its original state is never perfect.
50
0.9
50
0.9
50
0.9
100
0.8
100
0.8
100
0.8
150
0.7
150
0.7
150
0.7
200
0.6
200
0.6
200
0.6
250
0.5
250
0.5
250
0.5
300
0.4
300
0.4
300
0.4
350
0.3
400
500
50
100
150
200
250
300
350
400
450
500
50
100
150
200
250
300
350
400
450
500
0.2
450
0.1
500
0.3
400
0.2
450
0.1
350
0.3
400
0.2
450
350
0.1
500
50
100
150
200
250
300
350
400
450
500
Figure 1.1: Left: Three points at (100,100), (300,300), (200,400) - not visible. Middle: a square
function. Right: Convolution of the two.
3
1
2000
50
0.9
50
100
0.8
100
150
1800
2.5
100
50
1600
150
0.7
150
200
0.6
200
250
0.5
250
300
0.4
1400
200
250
1.5
300
1
350
400
350
0.3
400
0.2
0.5
450
450
500
50
100
150
200
250
300
350
400
450
500
0.1
500
50
100
150
200
250
300
350
400
450
500
1200
1000
300
800
350
600
400
400
450
200
500
50
100
150
200
250
300
350
400
450
500
1.2
Fourier Transform
1.2.1
A Fourier transform takes some function f (x) and decomposes it into an infinite series of sinusoidal functions
of different frequencies and amplitudes. Adding these frequencies back together returns the original function.
Figure 1.3 shows the first four elements of the Fourier Series of a square wave, and the wave if this series
were to run to infinity. If we take the Fourier transform of this signal, we get an continuous function F (x)
representing the frequency space of f (x). We can see in Figure 1.4 that at the 8 micro-meter range there
is a spike. Depending on where signal f (x) originates from, this could mean different things. In chemical
analysis, the spike could represent an abundance or lack of a certain chemical compound. If it were a radio
signal it could be a signal from a radio station operating at a certain frequency. In Astronomy it could
represent the presence of an object that is known to emit signals of that frequency, a pulsar or maser source,
say. The graphs in Figure 1.4 are from a Fourier-Transform Infra-Red (FTIR) spectrometer, so is most likely
showing an abundance of a certain chemical bond. In any case, analysis of the frequency domain makes
certain insights not only easier, but sometimes possible at all.
1 Under
perfect circumstances, deconvolution can be done, however under any noise or contamination, even seemingly negligible amounts, this does not hold (see section 2.2).
10
1.5
1.5
0.5
0.5
-0.5
-0.5
-1
-1
-1.5
-1.5
-2
-1.5
-1
-0.5
0.5
1.5
1.5
1.5
0.5
0.5
-0.5
-0.5
-1
-1
-1.5
-2
-1.5
-1
-0.5
0.5
1.5
-2
-1.5
-1
-0.5
0.5
1.5
-2
-1.5
-1
-0.5
0.5
1.5
-2
-1.5
-1
-0.5
0.5
1.5
-1.5
-2
-1.5
-1
-0.5
0.5
1.5
1.5
1.5
0.5
0.5
-0.5
-0.5
-1
-1
-1.5
-1.5
-2
-1.5
-1
-0.5
0.5
1.5
1.5
1.5
0.5
0.5
-0.5
-0.5
-1
-1
-1.5
-1.5
-2
-1.5
-1
-0.5
0.5
1.5
1.5
0.5
-0.5
-1
-1.5
-2
-1.5
-1
-0.5
0.5
1.5
Figure 1.3: Red: nth element of the series. Blue: Sum over n elements. Dashed: Sum over n 1
elements. If this series were to go to infinity, the resultant wave will be a square wave.
11
(1.3)
1.2.2
The discrete version of the Fourier Transform has a similar form to the continuous, given a discrete vector
function f (t) : CN CN and s CN
T (g(t), s) =
N
X
g(t)e2its
(1.4)
12
13
250
18
100
200
150
200
16
300
14
400
12
500
10
600
100
700
800
4
50
900
2
1000
100
200
300
400
500
600
700
800
900
1000
250
18
200
200
16
400
14
150
600
12
10
800
100
8
1000
6
50
1200
4
200
400
600
800
1000
1200
Figure 1.5: Top Left: An Image of some random text. Top Right: The Fourier transform. Bottom
Left: The image rotated. Bottom Right: The Fourier transform of the rotated image. Notice how in the
Fourier domain the main axis is aligned with the rotation of the original image. Using this information, a
scanned page that has an unknown rotation can be corrected to its original rotation by simply detecting
this main axis on the Fourier transform, a far simpler task than trying to detect lines of text in the image
plane. Once rotated back to an upright position, text recognition or other advanced processing steps
can be utilised.
1.2.3
The Fast Fourier Transform (FFT) is an algorithm to calculate the DFT. It was first developed by Cooley
and Turkey [2]. Whilst direct computation is O(N 4 ) for N samples, the FFT does it in O(N log N ). As it
is a significantly faster algorithm, it allows much of modern signal processing to be done.
The FFT in this form requires that N be a power-of-two, but there have been modifications that allow for
non power-of-two sizes, such as the Prime-Factor FFT Algorithm [8] or the chirp-z algorithm [18].
1.2.4
14
(1.5)
T (f g) = T (f ) T (g)
In other words, the transform of a convolution is the product of their transforms and the transform of a
product is the convolution of their transforms.
This allows us to calculate a discrete convolution using two Fast Fourier Transforms and one matrix multiplication. Classic calculation of the discrete convolution is a O(N 4 ) problem, whereas the use of the FFT
reduces this to O(2N log N + N 2 ) = O(N 2 ).
1.3
Wavelet Transform
1.3.1
Recall that a Fourier transform takes data in image space and transforms it into Fourier space (or frequency
space). This can be seen as transforming intensity data to sinusoidal coefficients. Owing to the periodic
and unbounded nature of the sine function, the transform can create artifacts when discretised and under a
bounded region. Take the transform of a 2D square function on a finite map for example (Figure 1.6). The
sharp discontinuities in the image cause the distinct ringing effects in the frequency domain. Unbounded,
these rings would go on infinitely but, because of the finite space, the signal either cuts off or wraps
(depending on the algorithm used). Performing the inverse transform would ideally return the original
square function, but instead returns an approximation.
1
50
0.9
50
100
0.8
100
150
0.7
150
200
0.6
200
250
0.5
250
300
0.4
50
100
10
150
15
200
20
250
25
300
30
350
35
400
40
450
45
4
2
350
0.3
400
0.2
450
0.1
500
50
100
150
200
250
300
350
400
450
500
FFT
300
FFT1
4
350
6
400
450
500
50
100
150
200
250
300
350
400
450
500
10
50
500
50
100
150
200
250
300
350
400
450
500
Figure 1.6: Left: Square function. Center: Fourier Transform of square function. Right: Square
function transformed and inverse transformed. Notice the artifacts and the sharp dips to the sides of
the restored square function.
The square function example is contrived when considering real data, but there are cases where threshold
techniques can create sharp discontinuities and hence these artifacts. Use of wave-like functions that are
bounded by some finite area, called wavelets, have been shown to represent certain physical signals better
than the infinitely periodic sine wave (see Figure 1.7) [15].
15
Figure 1.7: Left: Mexican Hat Wavelet. Center: Meyer Wavelet. Right: Morlet Wavelet.
1.3.2
Decomposition is the act of decomposing one form into another. With the Fourier transform we are decomposing an image into a series of sine waves. This decomposition will essentially give us a number of sine-wave
coefficients. Each coefficient (a complex number) represents the phase and amplitude of a sine wave (withs
the sine-waves frequency determined by its location in the Fourier plane). Computing the sine wave for each
coefficient and adding all these together will reconstruct the original image.
The wavelet transform aims to use bounded sine-like functions called wavelets to decompose the image into
a series of coefficients, rather than periodic sine waves.
1.3.3
Usage
The 2D sine wave has only four parameters (frequency in xy, amplitude and phase) whose decomposition can
be represented by a 2D complex-valued map. The Wavelet has five (scale in xy, shift in xy and amplitude)
which cannot, however one could represent it with a 4D real-valued map. A discrete transform is employed
for computational purposes but generating a 4D map is too memory intensive. Instead a multi-resolution
approach which performs multiple levels, L, of decomposition on an N N image/map is used.
0.04
50
0.03
50
100
0.06
50
0.035
0.02
100
0.04
100
0.03
150
150
0.01
150
0.02
200
200
0.025
200
0.02
250
250
0.01
0.015
300
300
350
0.03
400
0.06
450
0.04 450
0.08
500
0.05 500
400
0
450
0.005
500
50
100
150
200
250
300
350
400
450
500
0.04
350
0.005
400
0.02
300
0.02
0.01
350
250
50
100
150
200
250
300
350
400
450
500
50
100
150
200
250
300
350
400
450
500
Figure 1.8: The above three wavelets are examples of what the wavelet transform uses to decompose
the image. Left and Center are level 1 wavelets whereas on the Right is a level 2 wavelet. Notice the
difference in scale. In the same way a Fourier transform decomposes an image into a number of sine
waves, the wavelet transform decomposes an image into wavelets.
At each level, three wavelets are used for decomposition: one circularly symmetric, one scaled in x, one
Figure 1.9: Example of a JPEG2000 2-level wavelet decomposition. Since this a compression scheme
the residual image is down-sampled (or decimated) at each level.[4]
In scientific computing the residual is not down-sampled (i.e. it remains undecimated ) and the decomposition
is calculated at each (2L )th pixel on the first level, every (2L1 )th on the second level, (2L2 )th on the third
and so forth. Similar to the Fourier transform, the coefficients can be used in conjunction with the wavelet
function to reconstruct the original image. Unlike the Fourier transform, the residual image must be added
back as well.
1.3.4
17
The first Discrete Wavelet Transforms (DWT) were devised by Haar [10] in 1909 and were based on discontinuous square-like functions. The Fast Wavelet Transform (FWT) was initially developed by Mallat
[14]. Ingrid Daubechies later developed on Mallats work, creating the set of wavelets that are used most
commonly today [6][9].
Whilst DWTs usually run in O(N log N ), the FWT runs in O(N ) for N samples or data-points. Complexitywise this is faster than the FFT (which is O(N log N )), the wavelet computations are significantly more
intensive and hence for most small cases (less than 5122 ) the FWT is slower than the FFT.
1.4
Radio Interferometry
Details of radio interferometry are out of the scope of this report. Briefly, for readers unfamiliar with
Astronomy, Radio Astronomy involves the surveying of the sky in the radio wave frequencies. This can be
done with either a single radio dish, or by use of multiple dishes working in unison.
Using multiple dishes enables one to synthesise a dish with aperture equal to the sum of the dishes apertures
and diameter equal to the largest distance between any 2 dishes (called the baseline).
Larger baselines means higher resolution data can be observed. A more dense array of radio telescopes allows
for fainter sources to be detected. It impossible, or at least very improbable, to be able to synthesise a dish
with coverage equal to a (theoretical) dish with diameter equal to the baseline. This lack of full coverage
causes an implicit convolution, which is the reason for deconvolution.
1.4.1
The importance of convolution to radio interferometry comes from the way in which data is collected.
For reasons that will not be explored in detail here, when a radio interferometer measures a section of the
sky, IT , we find that the data that outputs is IT convolved with the PSF of that particular interferometer.
The Point Spread Function (PSF) is what the interferometer would measure if it were to observe a single
infinitely small but sufficiently bright point source in the sky. This can be seen as a result of the incomplete
coverage of the radio telescope array.
In the literature, and hence this report, there are standard ways certain sets of data are named. They fall
under two categories, dirty and clean. The Dirty Beam is the same as the PSF and the terms will be used
interchangeably in this report. The clean beam is what a perfect interferometer would measure. As there
is no such thing, it is usually modeled as a Gaussian curve. At the center of every PSF is a Gaussian-like
curve, the clean beam is fitted to represent a smooth version of this 2 .
The True Image is the perfect observation, which we do not have. This could be referred to as the Clean
Image, but is not to be confused with what might be called the cleaned image, which is the resulting image
after deconvolution. As deconvolution will never, with certainty, reproduce the True Image they can never
be said to be the same. For clarity it shall be referred to as the Reconstructed Image. The Dirty Image,
mathematically, is the True Image convolved with the Dirty Beam. This is the data we need to deconvolve
in order to approximate the True Image. The chapters that follow show various attempts of deconvolution
to Reconstruct the True Image.
2 This
Chapter 2
2.1
The following chapter will build our initial model of the sky such that we might be able to derive a rigorous
mathematical way by which to deconvolve our data. This and the following chapters are built from
Introduction to Image Reconstruction and Inverse Problems by Thiebaut [21], Deconvolution and Blind
Deconvolution in Astronomy by Pantin, Starck & Murtagh [17], Sparse priors in unions of representation
spaces for radiointerferometric image reconstruction by Mary, Bourguignon, Theys & Lanteri [16] and a
Masters thesis Reconstruction of Radio-Interferometer images using sparse representations by Arwa Dabbech
[3] amongst other sources that will be more specifically referenced in the text.
2.1.1
Building a Model
When mapping a region of the sky, the region is split into a grid. Each cell of this grid (i.e. a pixel) measures
the amount of photons that are received by the observing instrument from a certain angular section of the
sky. We use y to denote the intensity map that was observed over some section of the sky and x for the true
(albeit unknown) intensity over the same region.
The goal of processing on an observation is to find x. If we assume perfect observing conditions and a perfect
observing instrument we can simply use the trivial model
y=x
However, when utilising a radio interferometer, it is well known that the received signal is affected by the
PSF of the interferometer. We denote the PSF by h. Furthermore, we know that with this consideration
our model becomes
y hx
Owing to cosmic background, instrument noise, atmospheric interference amongst other considerations our
signal will be unavoidably contaminated by noise. This noise can affect the signal before it has reached
the instrument (external noise/temperature) ne as well as affecting the signal after the signal has been
20
Ideal/Actual Image
PSF
Convoluted Image
x 10
100
250
50
100
200
150
50
50
100
100
150
200
200
300
100
350
60
200
5
150
250
80
150
400
250
4
300
250
40
300
3
350
350
2
400
20
400
50
450
450
450
500
500
500
50
100
150
200
250
300
350
400
450
500
50
100
150
200
250
300
350
400
450
500
50
100
150
200
250
300
350
400
450
500
Figure 2.1: Left: Actual Intensity map of the sky. Middle: Point spread function of the interferometer.
Right: The convolution of the two, this is the data that the interferometer produces (minus any noise).
received (internal instrumental noise/temperature), nt . This would mean that we could model our data as
y h (x + ne ) + ni
This can be simplified to use one noise term:
y h (x + ne ) + ni
= (h x) + (h ne ) + ni
(2.1)
= (h x) + n0e + ni
= (h x) + n
where n0e = (h ne ). Since noise convolved with a uniform PSF is just the same noise, n0e will have the same
variance and mean as ne . We use n to denote the combination of internal and external noise since in all
further formulations they will be treated the same.
It is important to note that each pixel or cell of our intensity data can be modeled independently from one
another. This is a safe assumption as we have no physical reason to believe that one section of the skys
photons are necessarily correlated with any other section of the sky.
Convoluted Image
Dirty Image
100
40
50
50
50
100
30
100
100
80
150
60
200
250
100
20
150
200
10
200
250
250
40
300
20
40
300
350
20
400
80
150
300
10
350
60
350
20
400
0
400
30
450
500
450
500
50
100
150
200
250
300
350
400
450
500
20
450
40
50
100
150
200
250
300
350
400
450
500
40
500
50
100
150
200
250
300
350
400
450
500
Figure 2.2: Left: Convolution of the true sky intensity and the PSF. Middle: Gaussian Noise Right:
The data that the interferometer produces, the noise affected and convoluted sky intensity.
2.1.2
Discretised Data
Since we work with discretised data (pixels) we take the matrix form of this data
y Hx+n
(2.2)
21
y has not been made into a k k matrix but rather a flattened matrix
y = [y(1, 1), y(1, 2), ..., y(1, k), y(2, 1), y(2, 2), ..., y(k, k)]T
(2.3)
y = [y1 , y2 , ..., yN ]T
(2.4)
2.1.3
The construction and meaning of the PSFs response matrixs is as such: Let us take the PSF centered
around the point (1, 1) and similar to y, we flattened y, we flatten h to obtain [h1 , h2 , h3 , ..., hN ]T
We now define the PSF response matrix as:
h1
h2
H h3
..
.
hN
hN
h1
h2
..
.
hN 1
hN
h1
..
.
..
.
h2
h3
h4
..
.
hN 1
hN 2
h1
(2.5)
which is a matrix such that the ith column is a flattened PSF circularly shifted i 1 times. This also can be
seen to represent a PSF at every possible point on a k k map.
It can be shown that for any map a and its flattened version a, that Ha = h a.
An example and a sketch proof showing that
Hx = x1 H1 + x2 H2 + + xN HN
(2.6)
is given in Appendix B.2.6. As you can see Hx gives rise to a linear combination of PSFs applied to every
point of x scaled to the intensity at that point, which is the definition of a convolution.
2.2
An easy approach to take is simply to transform the dirty map to Fourier space and divide, i.e. to perform
a direct inversion [21]. After all, in the Fourier domain the dirty image is simply the multiplication of the
Fourier transform of true sky intensity by the Fourier transform of the PSF. Inversion should work and we
follow this line of reasoning:
y =hx
F(y) = F(h)F(x)
F(y)
F(x) =
F(h)
F(y)
xdirect = F 1
F(h)
(2.7)
22
Convoluted Image
100
F 1
50
100
80
150
60
200
250
40
300
350
20
400
450
500
50
100
150
200
250
300
350
400
450
500
PSF
x 10
50
100
150
200
5
250
4
300
3
350
2
400
450
500
0
50
100
150
200
250
300
350
400
450
50
100
200
150
200
150
250
300
100
350
400
50
450
500
50
100
150
200
250
300
350
400
450
500
500
Figure 2.3: When performing the direct inversion on the convoluted map, the original intensity map
can be restored.
We expect the original image (Figure 2.3) but when performed this observed data the result is simply noise
(Figure 2.4). The problem is the additional noise factor. When this is taken into account the equation
becomes somewhat less helpful.
y =hx+n
F(y) = F(h)F(x) + F(n)
F(y) + F(n)
F(x) =
F(h)
F(y)
F(n)
F(x) =
+
F(h)
F(h)
F(y)
F(n)
xdirect = F 1
+
F(h)
F(h)
2.2.1
(2.8)
Ill-Conditioning
We can see that the F(n)/F(h) term affects what we thought would be a simple solution. Division of the
small values in the PSF cause this term to corrupt the result significantly, even when the signal-to-noise ratio
is very high. Since convolution by the PSF (also called instrumental transmission) is a process whereby the
noise is usually non-negligible, noise amplification will always be a significant consideration that arises in
convolution problems. This noise amplification by the solution is called ill-conditioning of a solution. We,
of course, strive to find well conditioned solutions that do not amplify the noise. [21]
23
Dirty Image
100
F 1
50
100
80
150
60
200
250
40
300
350
20
400
450
500
50
100
150
200
250
300
350
400
450
500
PSF
x 10
50
100
150
200
5
250
4
300
3
350
2
400
450
500
0
50
100
150
200
250
300
350
400
450
x 10
2
50
1.5
100
150
200
0.5
250
300
0.5
350
400
1.5
450
500
50
100
150
200
250
300
350
400
450
500
500
Figure 2.4: When performing the direct inversion on noisy data, the equation produces a noise-like
result. In this example the dirty image was contaminated by a seemingly insignificant noise level ( =
0.1).
2.3
2.3.1
Noise usually dominates in the high frequency range whereas the sources we are trying to detect usually
have frequency ranges lower than this. One solution to this is simply to cut off high frequencies where we
know noise dominates and perform the inversion again.
F(y)
F(h)
xdirect = F 1 (t) for t =
F(y)
< ucutoff
if
F(h)
(2.9)
otherwise
This does produce more acceptable results, however the sharp cutoff in the Fourier domain causes clear
artifacts in the image plane (noticeable ripples or ringing) [21]. These ripples distort faint objects in the
background and cause negative valued solutions, which we know are not physically possible (see figure 2.5).
2.3.2
Wiener Inverse-Filter
Frequency cutoff can be seen as a simple form of inverse-filter. More advanced inverse-filters have been
developed to ensure certain criteria are met in order to create more reliable images.
For instance the Wiener Inverse-Filter ensures that the result is, on average, as close to the true object
brightness (in a least-squares understanding of close).[21]
2.4 CLEAN
24
F 1 T
Dirty Image
80
60
40
20
0
20
40
50
100
100
150
200
250
300
350
400
450
500
50
100
150
200
250
300
350
400
450
500
PSF
x 10
50
100
150
200
5
250
4
300
3
350
2
400
450
500
0
50
100
150
200
250
300
350
400
450
120
150
100
200
80
250
60
300
40
350
20
400
0
20
450
40
500
50
100
150
200
250
300
350
400
450
500
500
Figure 2.5: Frequency Cutoff of Direct Inversion with T , the cutoff function. We see somewhat
improved results over the direct inversion (Fig 2.4), even when under considerably greater noise ( = 10
as opposed to = 0.1). Other techniques have proven to give better results. For instance in this case,
there is a point source and a diffuse source that are not resolved in the solution that can be resolved by
other methods. Also the solution contains negative, and hence non-physical, values.
Details of this filter are left out as a matter of scope, but can be found in Thiebaut [21] or most textbooks
on linear filters.
2.4
2.4.1
CLEAN
Brief History
2.4 CLEAN
2.4.2
25
The CLEAN algorithm assumes that all the sources in the observed patch of sky are point sources. Starting
with the dirty map, it simply finds the brightest point within it and subtracts a fraction (called the loopfactor) of the normalised PSF centered at that point (the exact value detailed below). The subtracted PSF
value is added back to a point map at the same coordinates. The map from which the algorithm subtracts
is referred to as the residual map or reside, as it contains what is left over. The algorithm then repeats,
using the residue to find and subtract the brightest point again.
The algorithm stops once the maximum brightness falls below a certain threshold (classically, one hundredth
of the initial maximum brightness) or after the residual is considered to be statistically close enough to
noise. E.g. for Gaussian noise this would be the mean squared error term falling below some predetermined
threshold. Also, a maximum number of iterations can be set.
After the algorithms stops, one is left with the residual and a point map. We convolve the point map with
the normalised clean beam to reconstruct the observation. This, in effect, means that at each iteration of
CLEAN the dirty beam was estimated to have had a certain effect on the image, that effect was removed
and a clean beams effect was put in its place. When we finally add the residual back to the reconstructed
map, the total flux is preserved. Not adding the residue back will cause elements in the point map to appear
as faint, but significant sources. Adding back the reside results in adding back the original noise. As such,
the previously significant sources fade away becoming clearly insignificant.
2.4.2.1
Exit Criteria
The CLEAN algorithm is one in which a residual is defined and progressively reduced as to reconstruct the
original. We do not know enough about the statistics of the data to create statistically justified robust exit
criteria. However we expect that as we remove statistically interesting characteristics of data from the reside,
that the residue will eventually become more and more like background noise. If we know the noise level
with standard deviation of n , we can either exit when the maximum value of the residue (or the standard
deviation of the residue1 ) falls below some predefined value,
max(r) < c or r < c
(2.10)
where r is the standard deviation of the residue r, and c is usually in the range of [3n , 5n ].
Another criterion, if data is assumed to be under additive Gaussian noise, is to calculate by how much the
standard deviation of the residue has changed from iteration to the next. When no significant structures
are left in the residue, the percentage difference in the standard deviation from the last iteration will be
extremely small as both will be essentially noise. The exit criteria is therefore,
|r(k+1) r(k) |
<
r(k)
(2.11)
where r is the standard deviation of the residue r, and is a predefined constant, usually a small number
in the range of 105 to 107
1 If
2.4 CLEAN
26
Reconstructed with CleanBeam Convolution and Residue (4557 iterations, loop = 0.25)
Reconstructed with CleanBeam Convolution and Residue (1654 iterations, loop = 0.25)
50
200
50
100
180
100
150
160
150
140
200
120
250
200
150
200
250
100
100
300
300
80
350
350
60
400
50
400
40
450
20
500
50
100
150
200
250
300
350
400
450
500
Reconstructed with CleanBeam Convolution and Residue (256 iterations, loop = 0.25)
200
50
100
450
500
0
50
100
150
200
250
300
350
400
450
500
Reconstructed with CleanBeam Convolution and Residue (125 iterations, loop = 0.25)
50
200
100
150
150
200
150
150
200
250
100
300
100
250
300
350
50
400
50
350
400
0
450
450
0
500
500
50
100
150
200
250
300
350
400
450
500
50
Ideal/Actual Image
100
150
200
250
300
350
400
450
500
Reconstructed with CleanBeam Convolution and Residue (6000 iterations, loop = 0.25)
250
50
100
50
200
100
200
150
200
150
150
200
150
250
250
100
300
300
100
350
350
400
50
400
50
450
450
500
50
100
150
200
250
300
350
400
450
500
500
0
50
100
150
200
250
300
350
400
450
500
Figure 2.6: Top Left: = 0. Top Right: = 1. Middle Left: = 5. Middle Right: = 10.
Bottom Left: The sky model. Bottom Right: Over-CLEANing ( = 1). For low levels of noise the
CLEAN algorithm performs admirably. As the data becomes more noisy, the defined boundary of the
faint background source becomes more obscure. Even more noise causes many faint false-positive point
source detections as the algorithm tried to pick point sources out of the noise. An example of
over-CLEANing shows what happens when the algorithm is performed over too many iterations.
2.4.3
The PSF and clean beam must be normalised. This is to preserve the overall flux of the image. Say that at
a certain iteration, the brightest point was valued at 1.5. If the loop factor is 0.1, it means that 0.15 must
be subtracted from that point. If the PSF, h, has a maximum value of, say, 0.3, (which will be at the center
of the PSF) we subtract 0.5 of the normalised PSF from the residue (max(h)*0.5 = 0.3*0.5 = 0.15). So we
have removed 0.5 flux units from the residue. We must now add 0.5 to the point map, so that the total flux
of both images remains the same.
More formally, if = max(r) loop/ max(h). We remove h from the residue at the maximum point and
2.4 CLEAN
27
add to that same location on the point map. Since the PSF and the clean beam (Gaussian) are both
normalised to have a total flux of 1, these effects will ultimately cancel each other out.
2.4.4
Test on a simulated point source showed clearly accurate results. On more complicated sources with a
diffuse background source, these results were not as good but still acceptable. All complex sources were
well resolved and the diffuse background was resolved, albeit not as smoothly as the stronger sources. The
tested simulated sky model incorporates a complex, diffuse and point source and demonstrates many of the
advantages and disadvantages of CLEAN.
2.4.4.1
With no noise the diffuse source is clearly resolved, as well as the weak point source and the large source.
This is maintained as noise is increased to a level of = 1 although the diffuse source starts to break up at
its edges.
When increasing it to = 5 the diffuse source becomes very spotty and has become so obscured as to
make one insufficiently confident to say where its boundaries lie. The weak point source could be said to
be resolved, but the many false-positive point sources amongst the diffuse source makes this impossible to
discern.
With = 10, the weak source is completely obscured, and many false-positive point sources are present.
The diffuse source is now barely resolved, swamped in the noise. The strong, complex source is not as well
resolved as in lower noise levels but retains much of its original shape and flux. Figure 2.6 shows these
results.
2.4.4.2
Corrugation Effect
Leaving the CLEAN algorithm to run even further produces additional divergent visual results. The removal
of the main, strong sources causes a seemingly insignificant bump in the data a certain distance and
direction away. After enough iterations, this bump becomes significant enough such that the algorithm
detects it as a significant source, which creates another bump, and so on. This introduces a repeated
corrugation effect in the final results. This is often referred to as over cleaning and demonstrates the
importance of a robust exit criterion. See Figure 2.6 - bottom right.
Chapter 3
The building blocks of Maximum Likelihood techniques will be outlined in this section. We wish to find
a general optimisation solution to maximise some function J(x). In this way we have a structure around
which to build different models under different noise constraints.
We derive this general model with some important considerations already in place. One of which is that the
data, x, must be positive to reflect that the data is in fact real. We also assume that the function to be
optimised is convex. This ensures only one solution will be found and that our model will converge to this
solution.
3.1.1
A well-known method for finding maxima/minima under constraints is the use of a Lagrange function. In
particular we use the Karush-Kuhn-Tucker (KKT) First Order Optimality Conditions [12].
We use the following Lagrange Function:
L(x, ) = J(x) h, g(x)i
(3.1)
where x RN is the vector representing the desired data, is the Lagrange multiplier, J(x) is the function
to be maximised and g(x) is the function expressing the constraints.
We use g(x) = x to ensure positivity of the solution. Let x and be the optimal solutions of x and
respectively, and note that g(x ) = 1. The KKT conditions become: 1
Stationarity
L(x , ) = 0 i = [J(x )]i , i
(3.2a)
x 0 xi 0, i
(3.2b)
0 i 0, i
(3.2c)
Positivity of x
Positivity of
1 The
29
(3.2d)
Condition (a) simply states that an optimal solution is a stationary point (a minimum or maximum). Later
we assume convexity of the function to ensure that the solution is a minimum. Conditions (b) and (c) are
trivially satisfied by g(x) = x. So we attempt to solve for the remaining condition, (d).
Consider that J(x(k) ) is (trivially) a descent direction of J(x(k) ) (A.4). This would allow us to express
a gradient descent algorithm:
x(k+1) = x(k) = (k) [J(x(k) )]
(3.3)
Later considerations will show us that a multiplicative form of the algorithm is more desirable than an additive
form from a computational standpoint. However, the under-defined J might have additions or subtractions
in its formulation. To avoid this problem we define a vector of functions F(x) = [F1 (x) F2 (x) ... FN (x)],
where Fi is a real valued positive function that depends on J(x). Whilst a seemingly arbitrary definition at
this time, later this term allows us to create our multiplicative form of the algorithm.
We formulate a gradient descent algorithm again:
x(k+1) = x(k) + (k) diag(F(x(k) ))diag(x(k) )[J(x(k) )]
(3.4)
where (k) RN is a relaxation or dampening parameter. Also since diag(F(x(k) ))diag(x(k) ) is a positive
definite matrix, diag(F(x(k) ))diag(x(k) )[J(x(k) )] is still a descent direction.
3.1.2
We assume that J(x) is convex and has a finite (unconstrained) global minimum value (i.e. the minimum
value is not achieved when tending towards + or ). This value is realised at J(x) = 0. We take two
functions U(x) and V(x) that are positive for strictly-positive arguments (i.e. U(x), V(x) 0 for x > 0)
and describe,
J(x(k) ) = U(x(k) ) V(x(k) )
We now define F(x(k) )
1
.
V(x(k) )
(3.5)
We obtain
1
diag(x(k) )[U(x(k) ) V(x(k) )]
V(x(k) )
x(k)
= x(k) + (k) diag
[U(x(k) ) V(x(k) )]
V(x(k) )
(3.6)
If any element in V(x(k) ) is equal to 0, then this equation bears no solution. In practice an adjusted function
is used, V0 (x(k) ) = V(x(k) ) + .1 for , a sufficiently small positive constant.
We need to ensure that x(k) > 0 for all k. If we set x(0) > 0, (B.3.8)
x(k)
(k)
(k)
x + diag
[U(x(k) ) V(x(k) )] > 0
V(x(k) )
If U(x(k) ) > V(x(k) ) then it is sufficient to say (k) > 0. If not, (B.3.8)
x(k)
(k)
(k)
x + diag
[U(x(k) ) V(x(k) )] > 0
V(x(k) )
(3.7)
(k) <
1
U(x(k) )
V(x(k) )
(3.8)
(k)
U(x(k)
V(x(k) )
(k)
) = U(x
U(x(0)
V(x(0) )
<
) V(x
U(x(k)
V(x(k) )
(k)
) 0 as k , hence
U(x(k)
V(x(k) )
U(x(k)
V(x(k) )
(k)
1<
1
<
U(x(0) )
V(x(0) )
U(x(k) )
V(x(k) )
!
< +
(3.9)
satisfies the condition. Whilst one can attempt to make more accurate estimations of (k) , setting it to 1 is
sufficient but also allows us to gain a multiplicative form of the problem: (B.3.10)
x(k+1) = diag
3.1.3
x(k)
U(x(k) )
V(x(k) )
(3.10)
In general we cannot say that 3.10 will necessarily converge to an answer. We can say that in general,
x(k+1) = T (x(k) ) will converge if T is a contraction, i.e. if n s.t. (x1 , x2 ), ||T (x1 ) T (x2 )|| n||x1 x2 ||.
For the following two algorithms that will be outlined, this does hold true and they will converge to an
answer.
3.2
3.2.1
This chapter aims to maximise the probability of the data y given all possible x values.
xML arg max Pr(y|x) = arg min log(Pr(y|x)) arg min ML (x)
x
(3.11)
is used to denote the negative log likelihood and is a common strategy to maximise probability functions
as minimising or maximising a function involves finding the derivative. This is because log derivatives can
be easier to work with than probability derivatives.
Since we are interested in positive values only, we constrain the maximum likelihood so that we may use the
Lagrangian method outlined in the previous section.
xCML arg min ML (x)
x
3.2.2
subject to xCML 0
(3.12)
Poisson: RLA
Let us now try to maximize Pr(y|x) assuming that each pixel is modeled as a Poisson distribution. We again
start with the likelihood function,
k
P (k; )
e
(3.13)
k!
where k is the number of observed occurrences of the event
2 We
P (ki ; `i ) =
Y `ki
i
ki !
e `i
(3.14)
We again apply our problem to the equation, setting the expected value to Hx, which is our expected data
given that y was observed
Y (Hx)yi
i
Pr(y|x) P (y; Hx) =
e(Hx)i
(3.15)
y
!
i
i
Again, we take the negative log likelihood to find the minimum and constrain it to positive values
J(x) P (x) ln Pr(y|x)
= ln P (y; Hx)
Y (Hx)yi
i
= ln
e(Hx)i
y
!
i
i
=
X
i
ln
(Hx)yi i (Hx)i
e
yi !
(3.16)
yi ln((Hx)i ) + ln(yi !) + Hx
subject to x 0
Di (x)
Similarly, it can be shown that the second derivative of P is positive everywhere and hence convex. Thus
finding the root of the gradient will yield the equation expressing the minimum value:
X
X
J(x) P (x) =
Di (x) =
Di (x)
(3.17)
We know that Di (x) =
Di Di
Di
x1 , x2 , ..., xj , ...
and
xj (Hx)i
Di =
(yi ln((Hx)i ) + ln(yi !) + Hx)
xj
xj
=
(Hx yi ln((Hx)i ))
xj
(3.18)
y
Hx
(3.19)
y
Note that 1 HT ( Hx
) = J(x) = V(x) U(x), we recall the multiplicative iterative algorithm from section
3.1, eqn 3.10:
x(k)
(k+1)
x
= diag
U(x(k) )
V(x(k) )
(3.20)
y
(k)
T
= diag x
H
Hx(k)
This equation is known as the Richardson-Lucy Algorithm (RLA) after its inventors [13].
3.2.3
32
Pr(y|x) represents the probability that y is the observed data, given that the actual intensity is x, as
represented by the model
y = Hx + n
(3.21)
Let us try to maximize Pr(y|x) assuming that each pixel is modeled as a random variable under additive
Gaussian noise:
(k)2
1
(3.22)
G(k; , ) e 22
2
where k is the random-variable, is the mean and is the variance.
We wish to find a single term to represent all the pixels in our data. We use a vector k RN instead of a
single variable k and take the product of all G(ki ) assuming all elements are independant of one another.
Y
Y 1
(k u )2
exp i 2 i
(3.23)
G(ki ; ui , si ) =
2si
si 2
i
i
where u RN is the corresponding vector of means and s RN is the corresponding vector of variances.
The variance of y is denoted by y = (y1 , ..., yn ). We set the mean to equal to Hx and obtain
Y
Y
1
(y (Hx)i )2
exp i
Pr(y|x)
G(yi ; (Hx)i , yi ) =
2y2i
yi 2
i
i
(3.24)
Maximising this unconstrained equation will yield the least incorrect solution. To do this we take the
negative log likelihood and minimise.
Y
1
(yi (Hx)i )2
Di (x)
subject to x 0
X
(3.26)
It can be shown that the second derivative of g is positive everywhere and hence convex. Thus finding the
root of the gradient will yield the equation expressing the minimum value.
X
X
J(x) = G (x) =
Di (x) =
Di (x)
(3.27)
i
T
Di Di
Di
where Di (x) = x1 , x2 , ..., xj , ...
1
(yi (Hx)i )2
Di =
ln
+
xj
xj
2y2i
2yi
hij
= 2 ((Hx)i yi )
yi
(3.28)
33
J(x) = HT WHx HT Wy
(3.29)
However, since Gaussian noise may contain negative values, the positivity of y, and hence HT Wy, is not
ensured. We shift y by a sufficiently large number such that it contains only positive elements.
y0 = Hx + n + d
(3.31)
where d = min(|y|)1.
Since H is the circular convolution based on h, a normalised PSF, it is clear to see that a constant vector
convolved by it will just return that same vector, i.e. H1 = 1 and hence, Hd = d. So,
y0 = Hx + n + d
= Hx + n + Hd
(3.32)
= H(x + d) + n
= Hx0 + n
where x0 x + d. We are now trying to minimise
X
(y0 (Hx0 )i )2
1
+ i
J(x0 ) G (x0 ) =
ln
2y2i
2yi
i
subject to x d 0
(3.33)
It is simple to recalculate that J(x0 ) = V(x0 ) U(x0 ) = HT WHx0 HT Wy0 and reformulating the
Lagrangian formula with constraint G(x0 ) = x0 d yields
x0(k) d
U(x0(k) )
V(x0(k) )
x0(k) d
= d + diag
HT Wy0
HT WHx0(k)
x0(k+1) = d + diag
(3.34)
34
Another method to ensure that U(x) and V(x) are positive is to split HT Wy into its positive and negative
parts. Recall that HT WHx HT Wy = J(x) = V(x) U(x). We also consider that we may split the
following such that HT Wy = (HT Wy)+ + (HT Wy) .
We now choose V(x) = HT WHx (HT Wy) and U(x) = (HT Wy)+ . We insert these terms into the
Lagrangian formula:
x(k)
(k+1)
x
= diag
U(x(k) )
V(x(k) )
(3.35)
x(k)
T
+
= diag
(H Wy)
HT WHx(k) (HT Wy)
3.2.4
Many functions can be used to regularise data. This allows a robust definition of a residue and thus, an
exit criteria can be formulated. In the next chapter, we shall go into detail about wavelet dictionaries and
constraints using sparsity. As a manner of introduction to the concept, wavelet regularisation is applied to
RLA and ISRA [20]. For a fuller understanding, one could read through the next chapter and then return
to this section.
We start by defining the residual at each iteration k :
r(k) = y Hx(k)
(3.36)
y = Hx(k) + r(k)
HT
(k)
(k)
+ r(k)
T Hx
H
Hx(k)
= diag x
y
Hx(k)
(3.37)
We use a deconstruction function to transform the residual r(k) , process on this data (i.e. frequency cutoff)
and then use its corresponding reconstruction function to produce a denoised or filtered residue r(k) . More
formally we take a dictionary of atoms or shapes. For this report, it is apt to imagine a dictionary of wavelets.
Deconstructing the residue into coefficients (AT r(k) ) we hard-threshold values above a certain level (T , for
[3r(k) , 5r(k) ]) before using a reconstruction (S, where SAT = I). This results in:
r(k) = S(T (AT r(k) ))
and therefore
x
(k+1)
= diag x
(k)
(k)
+ r(k)
T Hx
H
Hx(k)
(3.38)
(3.39)
x(k)
HT W(Hx(k) + r(k) )
HT WHx(k)
(3.40)
35
A major problem for the regularised RLA and ISRA is that there is no form of robust exit criteria (Fig 3.1).
One of the huge advantages of the regularised ISRA algorithm is the inclusion of the residue. Using the
residue, as was done with CLEAN (section 2.4.2.1), an exit criterion can be formulated based on how much
the residue has become noise-like:
|r(k+1) r(k) |
<
(3.41)
r(k)
3.3
3.3.1
Algorithmic Implementation
Computational Operators for Convolution
2
For x MN,N , x RN the flattened x, h the PSF map, H the PSF response matrix. We define:
CONV(h, x) = 2D(Hx) as the convolution operator
Implemented with Fast Fourier Transforms: CONV(h, x) = invFFT(FFT(h) FFT(x))
CONV(trans(h), x) = 2D(HT x), where trans(h) is an operator that flips h vertically and horizontally
around the center.
trans(h) can efficiently calculated by taking the real part3 of FFT(FFT(h))/N where N is the
number of element in h
3.3.2
RLA
The Richardson-Lucy Algorithm (RLA) [19][13] is the iterative scheme for deconvolution under Poisson noise.
Returning to the multiplicative scheme in eqn 3.20, we deduce an equation that can be more easily mapped
to the 2D maps and computer operations. (B.3.42)
y
(k+1)
(k)
x
= x CONV trans(h),
(3.42)
CONV(h, x(k) )
where x(k) is the k th estimate of the map, y is the convoluted data and h is the PSF. The initial guess can
be set to any positive map; x(0) = 1 is sufficient.
Algorithmically, MATLAB-esque pseudo-code is given:
RLA(y,h,x0,n)
ht = real(FFT(FFT(h)))/numel(h)
xk = x0
for i=0:n
Hx = invFFT( FFT(h) .* FFT(x) )
xk = xk .* invFFT( FFT(ht) ./ FFT(Hx) )
return xk
3 We
take the real part as there may be imaginary components after the Fourier Transform.
3.3.3
ISRA
The Image Space Reconstruction Algorithm originally developed by Daube-Witherspoon and Muehllehner
[5], is the iterative algorithm used to deconvolve data under Gaussian noise. As for RLA above, we take eqn
3.35 and map it to computational operators. (B.3.43)
x(k+1) =
(3.43)
37
50
50
200
100
150
350
100
300
150
150
200
250
250
200
250
300
100
350
200
300
150
350
400
50
450
450
500
500
50
100
150
200
250
300
350
400
450
100
400
50
500
50
100
150
200
250
300
350
400
450
500
350
50
100
300
150
400
50
100
350
150
300
250
200
200
200
250
250
250
200
300
150
350
300
150
350
100
400
50
450
500
400
100
450
50
500
50
100
150
200
250
300
350
400
450
500
50
Ideal/Actual Image
100
150
200
250
300
350
400
450
500
10
250
50
100
200
150
200
150
250
10
300
100
350
400
50
450
500
50
100
150
200
250
300
350
400
450
500
Error
10
100
200
300
400
500
600
Figure 3.1: RLA after 100, 200, 400, 600 iterations and the simulated sky model (bottom left). The
graph (bottom right) shows the error from the original image. After around 200 iterations the error
starts to increase. After 600 iterations, the image has deviated significantly from any correct solution.
There is no way of knowing when this error is minimised as it requires knowledge of the original intensity.
An estimate of the ideal number of iterations is not robustly defined.
Chapter 4
Pr(y|x) Pr(x)
Pr(y)
= arg max Pr(y|x) Pr(x)
= arg max
x
(4.2)
Pr(y) = 1 as it is an event that has already passed, this its probability is certain.
As before, we consider minimising M AP , the negative log probability
M AP (x) = log(Pr(y|x) Pr(x))
= log(Pr(y|x)) log(Pr(x))
(4.3)
39
4.1
Sparse Representation
4.1.1
The gist of sparse representation is to use a predefined dictionary of shapes or atoms and use a vector of
coefficients pick out a subset of those atoms. We hope to be able to synthesise x with the use of this
dictionary.
So for instance, we take some shape, be that a wavelet, pillbox, a point or whatever shape is felt will represent
the data best. Let us define s as a K K map containing only this shape centered at (1, 1). We take its
flattened version s = [s1 , s2 , ..., sN ] for N = K 2
In the same way H was defined in section 2.1.3,
s1
s2
B = s3
..
.
sN
sN
sN 1 s2
s1
sN
s3
s2
s1
s4
..
..
.
..
. ..
.
.
sN 1
sN 2
(4.4)
s1
This is simply our shape flattened and shifted to every point on the map. Each column is called an atom
and represents a flattened map of some description.
The atoms need not be defined as simply shifted versions of some base atom. For instance a dictionary may
contain atoms that can change shape depending on their position or shift. In this case we say the shape has
shift variance. In general, shift invariance is preferred but is not always possible owing to the way certain
shape functions are used in computation, namely wavelets.
It is also possible to create a dictionary that contains two or more shapes. Say we have a number of
dictionaries B1 , B2 , ..., BM ; we can define a dictionary by simply concatenating their columns/atoms. We
denote this by D = [B1 B2 ... BM ]
For this reason a dictionary is defined as a N L matrix for L the total number of atoms.
We now take some vector RL and assume we can model x by some linear combination of atoms in some
dictionary D. Mathematically, we assume x = D. Furthermore we wish for to be sparse which means
that most of the elements of are zero. This mean that only a few atoms will model the image.
The following mathematical expansion is given to illustrate this sparse representation model.
We see that Di is simply the vector equal to the ith atom (or column) of D. Thus,
x = D = 1 D1 + 2 D2 + + L DL
(4.5)
4.1.2
Definitions of Sparsity
4.1.2.1
Intuitively
40
Sparsity in its simplest terms is a vector or signal with few non-zero elements. Though, sparsity can also be
seen as a measure of how compressable a set of data is, or in other words, how easily the data can be described
by a few elements. If a signal were to be composed of 4 sine waves, for example, we would only need to
know the definition of the sine wave and those 4 coefficients. If the signal were to be random noise, however,
representing it as a composition of sine waves would take a huge number of coefficients to approximate it
sufficiently.
This of course depends heavily on the dictionary upon which the coefficients are based. For certain signals,
a dictionary of wavelets might represent them very well, but for another signal, sine waves might compose a
more precise representation.
4.1.2.2
Descriptively
To solidify this notion we say that a signal is strictly sparse if most of its coefficients are zero given some
dictionary. However for natural signals, a less prohibitive definition is more effective. A weakly sparse signal
is one whereby a few coefficients have large magnitudes and the rest are comparatively close to zero. These
weakly sparse signals are also referred to as compressable.
Natural signals usually have specific properties, and with the use of a representative dictionary, it can be
sparsified, albeit only in the weak - and not strict - sense.
Effectively sparsifying data depends heavily on the dictionary used. A small and simple dictionary will
struggle to represent the data in a few coefficients. A large dictionary with more complex entries will usually
have more representative results.
Compressibility is similar to being able to express ideas in language: given a small number of simple words,
an idea might take many of those words to be expressed fully, if it can be at all. With a large number
of complex and detailed words, an idea could be expressed in a few of these words. Furthermore, certain
dictionaries just might not have the right words for what is to be expressed. Alternatively a combination of
complex and simple words might express an idea in even fewer words than what would be expressed with
just complex words alone. [15]
4.1.2.3
Formally
We now define a way in which to measure these notions in a mathematically rigorous way. A simple
measure of this might be to take the number of non-zero elements. This is called the l0 norm and defined as
||x||0 = #{i : xi 6= 0} for x, a real-valued vector 1 .
The l0 norm is not continuous and has unavoidable mathematical implicationsP
for convexity in the following
sections. In these cases we can use the lp norm (p R+ ), defined as ||x||p = ( i |xp |)(1/p) .
The l1 norm is simply the sum of the elements and the l2 norm is the same as the Euclidean norm. When
used in the following section, the lp norm (0 < p < 1) is used in the prior that describes sparsity. Restricting
p between 0 and 1 maintains the convexity assumption. Recall that convexity of the minimising term is an
important assumption made when formulating the problem (3.1.2).
1
4.2
41
MAP Synthesis
We can now formulate the sparse representation of our data more rigorously. We start with our model and
assume that x RN can be modelled by dictionary D of size N L, and a sparse vector of coefficients,
RL .
Since y = Hx and we assumed that x = D, we define our MAP-synthesis model:
y HD + n
(4.6)
We recall our attempt to minimise the negative log of Pr(x|y) (3.2.3 & 3.2.2). From this we can see that we
need only formulate ML and define prior :
xMAP = arg min MAP (x) = arg min ML (x) + prior (x)
(4.7)
x
x
where xMAP is the desired solution for this MAP-synthesis model.
4.2.1
Let us consider the problem for Gaussian additive noise 2 . As before (3.2.3) we take the maximum likelihood
of the Gaussian. This time, however, we use a multi-dimensional definition of the Gaussian distribution [1]:
G(x; u, ) =
1
N
2
(2) ||
N
2
1
exp (x u)T (x u)
2
(4.8)
where u RN and is the inverse covariance matrix. We model y as a multi-dimentional Gaussian with
means equal to HD and variance RN . Since we assume all variables in y to be independant of one
another, = diag( 12 ). Thus,
G(x) = log(G(y; x, R))
1
i
h
1
(x u)T (x u)
= log
N
N exp
2
(2) 2 || 2
1
1
= log
+
(x u)T (x u)
N
N
2
(2) 2 || 2
(4.9)
and hence,
arg min MAP = arg min(G(x) + prior )
x
x
h
i
1
1
= arg min log
+
(x u)T 1 (x u) + prior
N
N
x
2
(2) 2 || 2
1
= arg min
(x u)T 1 (x u) + prior
x
2
1
) since we are minimising on x.
We ignore the first term ( log
N
N
(2)
||
(4.10)
For our model, we use G(y; Hx, R) = G(y; HD, R), where R is the inverse covariance matrix of y:
arg min MAP = arg min
x
2 The
1
2
Poisson case may also be formulated, however it is not included in this report.
(4.11)
4.2.2
42
This leaves only prior to be defined. We assumed that was sparse, so we need a prior that will minimise
on this constraint, as mentioned in the introduction to this chapter. The lp norm is used in the definition of
the prior:
prior = ||||pp
(4.12)
(4.13)
where c is a constant and is again the hyperparameter. Defining Pr(x) like this forces prior = ||||pp + a
for some constant a.
This definition allows for any , but we would like our model to consider only if is an exact representation.
So we define a set x = { R|D = x}, that contains all possible values that reconstruct x. This set
may be empty, a singleton, or it may contain multiple solutions.
Multiple solutions arise from an overly large number of atoms, i.e. that L > N
This, however does not account for dictionaries for which there is no solution that represents x. We would
like the probability of such a case to be 0. We now attempt again to define:
p
c e||||p
if x 6=
Pr(x) = Pr(D) =
(4.14)
0
otherwise
Thus prior = ||||pp . However this does not guarantee a unique optimal solution, so we add a minimising
argument in place of .
p
c e|| (x)||p
if x 6=
Pr(x) = Pr(D) =
0
otherwise
(4.15)
where (x) = arg min ||||pp
x
4.2.3
We have finished. The final MAP-synthesis model for a signal under additive Gaussian noise is:
xMAP = arg min MAP (x) = arg min ML (x) + prior (x)
x
x
= arg min ML (x) + arg min prior (x)
x
x
1
(HD y)T R(HD y) + ||||pp
= arg min
x
2
(4.16)
4.3
43
4.3.1
The Matching Pursuit Algorithm was pioneered by [16] and computes an approximation to the MAPsynthesis solution . The algorithm itself is greedy and attempts to approximate by iteratively maximising
the probability function by modifying one element of at a time.
We start with the Maximum Likelihood function of a signal under additive Gaussian noise3 from 4.16.
y = HD + n
ML (x) =
1
(HD y)T R(HD y)
2
(4.17)
1
1
1
R 2 y||
= arg min ||R 2 HD
2
1
= arg min ||C
z||
2
1
(4.18)
where we define z = R 2 y and C = R 2 HD and z is the whitened data and has a covariance matrix equal to
I, the identity matrix. C is a dictionary of atoms that are convolved with the PSF and then whitened. A
PSF and noise adjusted dictionary, if you will. We are now able to say
1
z = C + where = R 2 n
(4.19)
We define a starting residual r(0) = z and an initial (0) = 0 and take an iterative greedy approach. We
assume that at each step, the residual is made up of a single atom and that the rest is simply noise. This is
not actually the case, but this approach allows us to ignore all other signals and to find a good estimation
for a single coefficient in . Once this coefficient is found, the algorithm simply repeats until a stopping
criterion is met.
More formally, we assume
r(k) = i Ci + for some i [1, ..., L]
(4.20)
1
= arg min (
i Ci r(k) )T I(
i Ci r(k) )
i 2
1
= arg min ||
i Ci r(k) ||2
i 2
3 The
effect of the prior is implicit in the algorithm, as will be explained later in this section.
yi is the variance of yi
4 Where
(4.21)
44
i Ci r(k) = 0
i =
hr(k) , Ci i
||Ci ||2
(4.22)
To find the best index we minimise the on the index, rather than the on i (B.4.23)
1
m = arg min ||
i Ci r(k) ||2
i 2
|hr(k) , Ci i|
hr(k) , Ci i2
= arg max
= arg max
2
i
i
||Ci ||
||Ci ||
We simply update the residue by removing the mth atom and updating the coefficient vector
( (k)
hr ,Ci i
if i = m
(k+1)
(k)
||Ci ||2
= + where =
0
otherwise
(k+1)
(k)
=r
m
Cm
=r
(k)
(4.23)
(4.24)
hr(k) , Cm i
Cm
||Cm ||2
(4.25)
Exit Criteria
Stopping criteria is usually based on some characteristic of the residual: either when ||r||2 or max(|r|) falls
below a user determined threshold. Similarly one can use the same stopping criteria as was used in CLEAN
and Regularised ISRA (2.4.2.1)
|r(k+1) r(k) |
<
r(k)
where r is the standard deviation of the residue r, and is a predefined constant, usually a small number
in the range of 105 to 107
A more robust exit criterion is found when the best coefficient
is selected for removal from the residue. Each
(k)
i i|
coefficient will have a (normalised) significance value ( |hr||C,C
). The atom with the highest significance
i ||
value will be the one that is to be removed from the residue. When the maximum significance falls below a
preset threshold (usually in [1, 4]), the best atom can no longer be said to have a significant effect on the
result and hence we stop the algorithm. More formally: the algorithm will exit when
|hr(k) , Cm i|
<t
||Cm ||
(4.26)
where m is the index of the most significant atom (as it is defined in eqn 4.23) and t [1, 4]
4.3.2
Computational Considerations
In eqn 4.23 the best index of the coefficient vector needs to be found. This involved calculating
|hr(k) ,Ci i|
||Ci ||
for all i. This is typically split into two stages: 1) calculation of |hr(k) , Ci i| for all i, which must be done at
every iteration, and 2) calculation of ||Ci || for all i, which can be precalculated.
Wavelet sym6 @ 31
x 10
1
50
100
100
150
150
200
200
0
250
+ 31
Wavelet sym6 @ 33
x 10
50
250
+ + 33
300
350
400
300
350
400
450
450
500
500
50
100
150
200
250
300
350
400
450
500
50
100
150
Wavelet sym6 @ 40
200
250
300
350
400
450
500
x 10
0.04
1
50
50
0.035
100
100
150
150
200
200
0.03
0.025
0.02
250
250
0.015
+ + 40
+ + 250
300
350
300
0.01
350
0.005
400
400
1
450
450
500
500
50
100
150
200
250
300
350
400
450
500
0.005
50
100
150
200
250
300
350
400
450
500
Dirty Image
0.04
50
50
100
0.035
100
100
0.03
80
150
150
0.025
200
60
200
0.02
250
250
40
300
20
0.015
+ + 254
300
0.01
350
+ =
350
0
0.005
400
400
20
0
450
450
0.005
500
50
100
150
200
250
300
350
400
450
40
500
500
50
100
150
200
250
300
350
400
450
500
Figure 4.1: Given an image x, we take its wavelet decomposition, WAVDECW (x) = RM . As
shown above, is a vector of M coefficients. Scaling the wavelets by the corresponding coefficients and
composing them together results in the original image. Reconstruction is this composition, however a
Fast Wavelet Transform is used instead of simple addition.
4.3.2.1
Before we continue, it would be useful to define some computational operators. It is common to use FFT and
FWT operations to perform matrix multiplications and various dictionary-based operations. Some slightly
informal definitions are given:
2
46
T T
(k)
(k)
1
(k)
D H (1 r1 )
hr , C1 i
hr , (R 2 HD)1 i
(k)
hr(k) , C2 i hr(k) , (R 21 HD)2 i
DT HT (2 r2 )
=
..
..
..
.
.
.
(4.27)
(k)
We split the equation into its separate dictionaries, we denote the whitened residue at i by ri
(k)
DT HT (r1 )
T
T T (k)
HT r(k)
D H (r2 ) = DT HT r(k) = I T HT r(k) =
W
WT HT r(k)
..
.
(k)
= i ri
(4.28)
To simplify the following explanation, we say r(k) = 2D(r(k) ), r(k) = 2D(r(k) ). h denotes the PSF map.
To calculate HT r(k) we simply take,
2D(HT r(k) ) = CONV(trans(h), r(k) )
(4.29)
(4.30)
5 We
take the real part as there might be imaginary components after the Fourier Transforms.
47
||(R 2 HI)1 ||
1
||(R 2 HI)2 ||
..
.
||(R 2 HI)1 ||
1
||(R 2 HI)2 ||
..
.
||(R 2 HD)1 ||
||C1 ||
1
1
=
=
||(R 2 HW)1 || ||(R 2 HWI)1 ||
..
..
1
1
.
.
||(R 2 HW)2 || ||(R 2 HWI)2 ||
..
..
.
.
21
||R HI1 ||
1 ||HI1 ||
||1 HI1 ||
||R 21 HI || || HI || ||HI ||
2
2
2
2
..
..
..
.
.
.
=
= 1
=
.
.
..
..
..
.
(4.31)
Ii is simply a vector in which the ith element is 1, and 0 everywhere else. For each i, one computes
2D(HIi ) = CONV(h, Ii ), and we then use this to calculate the norm:
q
p
i ||HIi || = i hHIi , HIi i = i (HIi )T (HIi )
Since the PSF is shift invariant, ||HIi || = ||HIj || for all i, j. So it need only be calculated once.
With the wavelet dictionary, take each i and compute 2D(HWIi ) = CONV(h, 2D(WIi )) = CONV(h, WAVREC(Ii )).
Using this we can calculate the norm:
q
p
i ||HWIi || = i hHWIi , HWIi i = i (HWIi )T (HWIi )
Since wavelets transforms are generally shift-variant (see figure 4.2), this needs to be calculated once for each
i. For all discrete wavelet transform, subsets of this operation are shift-invariant and one norm calculation
will evaluate to the same value for all elements in that subset.
4.3.3
This algorithm does not explicitly define the prior term that represents sparsity. It is manifest in the iterative
process whereby the initial iterations will have high coefficient values and then drop off in further iterations.
Considering that most solutions converge within 5000 iterations (512512 data) and that the most significant
coefficient values are calculated in the first 400-500 iterations, we can easily see that the coefficient vector
will be sparse. At most there are 5000 coefficients, with a vector sized at least 5122 = 262144. This means
that at most 5000/262144 = 1.91% of the coefficients are non-zero. The coefficient vector can at least be
said to be weakly sparse considering that at most 500/262144 = 0.19% of the coefficients will be significant.
See figure 4.3.
48
4
Centered PSF
x 10
4
50
100
3.5
150
200
Offcenter PSF
x 10
4
50
100
3.5
150
200
2.5
250
2.5
250
2
300
2
300
350
1.5
350
1.5
400
400
450
0.5
450
0.5
500
500
50
100
150
200
250
300
350
400
450
500
50
100
150
200
250
300
350
400
450
500
0.04
0.03
50
50
0.035
0.02
100
100
0.03
150
0.01
150
200
200
0.025
0.02
250
0.01
300
250
0.015
300
0.02
350
0.01
350
0.03
400
450
500
50
100
150
200
250
300
350
400
450
0.005
400
0
0.04
450
0.05
500
500
0.005
50
100
150
200
250
300
350
400
450
500
Figure 4.2: Top Left: Centered PSF. Top Right: Off-center PSF. Bottom Left: Wavelet at one location
Bottom Right: The same wavelet at another location. The PSF is shift-invariant, hence the function
at one location will simply be a shifted version of any other PSF. Since the wavelet is shift-variant is it
not guaranteed that the wavelet will be a shifted version of any other wavelet, as demonstrated above.
10
10
10
10
10
10
10
10
10
10
Residue STD
Significance
Max Threshold = 4
Threshold
gamma*
gamma*
10
10
10
500
10
Residue STD
Significance
Max Threshold = 4
Threshold
gamma*
gamma*
10
1000
1500
2000
10
1000
2000
Residue STD
Significance
Max Threshold = 4
Threshold
gamma*
gamma*
10
3000
4000
5000
6000
10
2000
4000
6000
8000
Figure 4.3: The green and yellow lines represent the significance of each iterations effect. Once the
significance reaches a threshold the algorithm is no longer making significant changes to the solution and
is stopped. These are just 3 cases but all others follow a similar pattern in that the first 500 iterations
contain the bulk of the significant changes (note that the y-axis is logarithmic).
10000
4.4 Experimentation
4.4
4.4.1
49
Experimentation
Set up
Tests were run on various simulated sky maps to measure the ability to deconvolve diffuse, point-like and
other complex sources. Varying levels of Gaussian noise added to each sky map to gain insight into the
rigour of the algorithm. All tests used the same PSF.
Besides comparing this algorithm with others, there are a number of variables within the algorithm that can
be tested; namely: loop-gain, noise and dictionary used.
As in the CLEAN algorithm, the loop-gain variable controls how much of the atom is to be removed from the
residual and added to the reconstructed map at each iteration. It can also be seen as a dampening factor.
The lower the loop-gain, the longer it will take to reach a solution. However, it would be expected that a
lower loop-gain can give slightly better, more reliable results.
The choice of dictionary or dictionaries is also very important. In these tests, three cases were used: the
identity dictionary, which causes the MP synth algorithm to become almost equivalent to CLEAN; a wavelet
dictionary (The Symlet wavelet or sym6 in MATLAB); and a combination of both identity and sym6.
The three main metrics used were Time to Complete, Number of Iterations to Complete and the Error Term,
which is calculated against the ideal data using an element-wise squared difference.
4.4.2
Analysis of Dictionaries
4.4.2.1
As expected, this dictionary performed much like CLEAN. Point-like sources were easily detected, as well
as complex sources. Diffuse sources were resolved; however they appear very spotty as the algorithm was
picking points rather than areas, especially under heavy noise. (see figure 4.4, 4.5 and 4.6,)
4.4 Experimentation
Error Term
Average
Std Dev
ttest(I, Sym6)
ttest(Sym6, I+Sym6)
ttest(I, I+Sym6)
Time to Complete
Average
Std Dev
ttest(I, Sym6)
ttest(Sym6, I+Sym6)
ttest(I, I+Sym6)
Number Of Iterations
Average
Std Dev
ttest(I, Sym6)
ttest(Sym6, I+Sym6)
ttest(I, I+Sym6)
Time per Iteration
Average
Std Dev
ttest(I, Sym6)
ttest(Sym6, I+Sym6)
ttest(I, I+Sym6)
50
Including Maxed Iteration Cases
I
Sym6
I + Sym6
950.78
503.00
0.601
1,206.21
1,054.26
929.98
550.86
1,572.79
1,002.74
0.022
3,807.82
2,814.88
0.586
0.951
0.833
3,404.63
1,719.20
0.114
0.022
1,871.50
1,612.13
1,423.00
804.03
5,603.42
3,525.52
0.840
5,936.25
4,382.41
0.560
0.926
638.41
14.28
645.85
19.85
278.33
11.11
0.000
639.64
21.00
0.475
0.000
5,712.00
4,512.39
0.903
0.010
76.67
13.38
0.000
3,883.01
3,110.61
664.69
28.54
0.024
0.000
Table 4.1: This table outlines various metrics and the differences they generated based on the dictionary
used. A Student t-test was used to determine if the difference is significant (marked in pink for < 0.05
and gray for < 0.15). Also, since some cases did not run to completion, they reached a maximum
iteration limit. For completeness the statistics for the original sets of data and the sets of data not
including the tests that hit the iteration limit are included.
Timing Results: Whilst the identity dictionary alone took slightly more iterations to complete compared
to the Symlet+Identity and Symlet dictionaries (see table 4.1), the algorithm ran approximately 2 to 8
times faster per iteration. The only difference between these two groups is the use of the FWT. Since all
dictionaries must use the Fast Fourier Transform for Convolution, the FWT is the obvious reason for this
slowdown. Despite the faster run time per iteration, the Symlet and Symlet+Identity dictionary completed
in close to the same amount of time as the Identity dictionary.
4.4.2.2
Compared to the Identity dictionary, the Symlet dictionary handles diffuse sources far better owing to its
ability to use large low resolution atoms. Point sources are reconstructed just as well (figure 4.6). Overall,
the number of iterations needed to reach the exit criteria was less than the Identity dictionary, but the extra
computation time means that it takes about the same time to execute.
Even though diffuse sources are resolved more smoothly, if one uses the error term metric, the Identity
dictionary actually performs better (table 4.1).
4.4 Experimentation
Ignoring Maxed Iteration Cases
0.1
0.5
1.0
Error Term
Average
2,495,514.47 2,661,422.65 2,861,777.60
Std Dev
162,742.93
148,527.35
258,854.07
ttest(0.1, 0.5)
0.085
ttest(0.5, 1.0)
0.107
ttest(0.1, 1.0)
0.007
Time to Complete
Average
1,634.40
849.29
667.43
Std Dev
835.83
517.75
341.56
ttest(0.1, 0.5)
0.081
ttest(0.5, 1.0)
0.455
ttest(0.1, 1.0)
0.009
Number Of Iterations
Average
3,001.17
2,365.57
1,776.86
Std Dev
741.67
2,251.35
1,545.81
ttest(0.1, 0.5)
0.503
ttest(0.5, 1.0)
0.580
ttest(0.1, 1.0)
0.049
Time per Iteration
Average
528.79
485.39
478.36
Std Dev
201.73
187.45
192.26
ttest(0.1, 0.5)
0.698
ttest(0.5, 1.0)
0.946
ttest(0.1, 1.0)
0.312
51
Including Maxed Iteration Cases
0.1
0.5
1.0
2,621,670.44 2,896,499.78 3,559,207.14
242,533.54
407,687.98 1,027,607.59
0.060
0.056
0.006
3,485.56
2,400.69
0.611
2,934.50
2,814.23
2,843.55
2,901.33
0.939
0.392
6,501.08
3,689.58
0.564
5,547.00
4,268.83
5,203.58
4,386.04
0.848
0.301
531.20
188.61
0.956
527.03
182.04
524.44
186.06
0.973
0.679
Table 4.2: A similarly generated Student t-test table. See table 4.1 description.
4.4.2.3
Fortunately, the combination of the two dictionaries showed a best of both worlds situation. It resolves
the diffuse sources as well as the Symlet-only dictionary but maintains an error not significantly different to
that of the Identity dictionary.
Performance-wise it does not compute significantly slower than the Symlet-only dictionary (either per iteration or in terms of total time (figure 4.1). Although it is still not as fast as the Identity-only dictionary
computations.
4.4.3
Analysis of Loop-Gains
As can be seen in table 4.2, the loop gain has a significant effect on the error term. A lower loop-gain leads
to a more accurate reconstruction. One might say, based on the statistics, that the major disadvantage of
a lower loop-gain is that it takes twice as long - on average - to complete. However, most of the cases that
reached the maximum number of iterations, especially those cases that used the Symlet dictionary, did so
owing to the high level of noise as much as it did so owing to the low loop gain. When considering all
the cases, the statistical significance in time differences disappear since all cases take a similar number of
iterations to complete.
4.4 Experimentation
52
Sym6
I+Sym6
200
200
200
100
100
100
150
150
150
200
200
0.1
200
100
100
300
300
50
50
400
400
200
300
400
0
500
100
50
400
0
500
100
300
500
500
100
2456360.645348
200
300
400
500
100
2918732.981053
200
300
400
500
2579154.788337
200
200
100
200
100
100
150
150
200
150
200
0.5
200
100
100
300
300
50
50
400
400
200
300
400
500
500
100
2773598.929160
200
300
400
500
100
3038604.876926
300
400
400
0
400
500
0
500
300
50
50
50
200
100
100
300
3033169.501075
150
200
100
100
200
150
200
500
500
100
150
400
400
100
300
300
200
200
200
200
2683631.647627
100
1.0
0
500
100
50
400
0
500
100
300
500
100
200
300
400
500
3681863.337349
100
200
300
400
500
3578489.048304
Figure 4.4: Loop gain (y) vs Dictionary (x) Comparison (using a noise level of = 5). Error Terms
are given underneath the images.
4.4 Experimentation
53
Sym6
I+Sym6
200
200
200
100
100
100
150
150
200
200
100
100
0.1
300
300
50
50
400
400
0
500
500
100
200
300
400
500
150
200
100
300
50
400
0
500
100
2686786.222410
200
300
400
500
100
3163747.295927
200
300
400
500
2682176.504975
200
200
100
100
200
100
150
150
150
200
200
0.5
100
100
300
300
50
50
400
500
400
500
100
200
300
400
500
100
3856083.611988
200
300
400
500
200
150
200
200
100
150
100
50
50
400
0
500
500
500
300
400
400
400
100
50
300
300
200
300
400
200
150
100
300
200
100
3139778.559479
200
5457156.341997
200
100
100
50
400
50 500
100
500
100
300
3409940.094938
1.0
200
0
500
100
200
300
400
500
4704211.175027
100
200
300
400
500
5256322.598075
Figure 4.5: Loop gain (y) vs Dictionary (x) Comparison (using a noise level of = 10). Error Terms
are given underneath the images.
4.4 Experimentation
I
54
Sym6
I+Sym6
200
200
100
100
150
200
100
300
400
500
100
200
300
400
500
100
300
50
400
500
100
200
300
400
50
400
500
500
100
400
500
200
150
150
200
100
300
50
400
500
500
100
300
50
0
100
2334693.955064
200
300
400
50
400
500
500
0
100
2701127.109364
500
200
300
400
100
150
200
300
200
100
400
0
200
2431422.648873
200
300
50
100
100
200
100
300
2695706.082107
100
150
200
2361818.557768
200
100
150
200
200
300
400
500
2448318.460755
200
200
200
100
100
100
150
150
150
200
200
200
100
100
300
300
50
50
400
400
200
300
400
500
500
100
2456360.645348
200
300
400
500
100
2918732.981053
0
500
100
50
400
0
500
100
300
200
300
400
500
2579154.788337
Reconstructed (x *CB + r ) after iteration 10001
200
200
200
100
100
100
150
150
200
10
200
100
100
300
300
50
50
400
400
0
500
500
100
200
300
400
2686786.222410
500
150
200
100
300
50
400
0
500
100
200
300
400
3163747.295927
500
100
200
300
400
500
2682176.504975
Figure 4.6: Noise (y) vs Dictionary (x) Comparison (using a loop gain of 0.1). Error Terms are given
underneath the images.
4.4 Experimentation
0.1
55
0.5
1.0
200
100
200
100
150
200
100
300
400
500
100
200
300
400
100
300
50
400
500
500
100
200
300
400
500
100
300
400
500
500
100
100
200
300
400
50
400
500
300
400
500
100
200
300
400
500
100
2918732.981053
k
200
150
200
100
100
300
100
300
50
50
400
50
400
0
500
500
150
200
500
500
100
150
400
400
200
100
300
300
200
200
200
2579154.788337
100
2686786.222410
500
400
50
50
2456360.645348
300
100
100
500
200
150
200
100
200
150
400
500
100
50
500
400
200
300
400
300
2448318.460755
100
300
200
200
2701127.109364
Reconstructed (xk*CB + rk) after iteration 10001
150
400
50
100
100
300
100
300
500
200
200
150
200
2334693.955064
100
200
100
300
500
200
500
150
400
100
400
200
50
300
200
400
0
200
2431422.648873
150
300
50
100
100
200
10
400
500
200
50
100
200
100
300
2695706.082107
100
150
200
2361818.557768
200
100
150
200
0
500
100
200
300
400
3163747.295927
500
100
200
300
400
500
2682176.504975
Figure 4.7: Noise (y) vs Loop Gain (x) Comparison (using the Identity+Symlet dictionary). Error
Terms are given underneath the images.
Chapter 5
Conclusion
This report aimed to follow my progress as I learned the basics of signal processing and deconvolution.
Starting with simple direct-inverse solutions and discovering how they are actually ill-conditioned, frequency
cutoff and frequency filtering solutions gave more promising results.
The CLEAN algorithm was looked at in some depth for its historical impact as well as its impact as a widely
used deconvolution algorithm.
Maximum Likelihood techniques were explored with focus on ISRA and RLA. Using a prior (MAP) in the
solution allowed us to use our knowledge of the compressability of our data by wavelets in order to improve
our deconvolution techniques. The Matching Pursuit algorithm is able to solve this problem greedily with
the use of a wavelet dictionary, a Dirac-Delta dictionary, and a combination of the two, with varying factors
of success depending on noise level and loop-gain.
Matching Pursuit is a precursor to even more advanced techniques. Arwa Dubbech [3] explored multiresolution and iterative analysis-by-synthesis approaches to deconvolve various types of sources.
Although not reflected in this report, these algorithms were run on a number of different sources, with
similar results. Although differing noise levels and object intensities often require tweaking of parameters,
the algorithms that have been explored here were very robust in reconstructing convolved data.
Bibliography
[1] Do Chuong, The multivariate gaussian distribution, 2008, http://cs229.stanford.edu/section/
cs229-gaussians.pdf (Accessed October 25, 2011).
[2] J.W. Cooley and J.W. Tukey, An algorithm for the machine calculation of complex Fourier series,
Mathematics of computation 19 (1965), no. 90, 297301.
[3] A. Dabbech, Reconstruction of radio-interferometer images using sparse representations, Masters thesis,
Tunisia Polytechnic School, March-June 2011.
[4] A. Damato, File:jpeg2000 2-level wavelet transform-lichtenstein.png, 17 May 2007, http://www.
interaction-design.org/references/ (Accessed October 12, 2011).
[5] M.E. Daube-Witherspoon and G. Muehllehner, An iterative image space reconstruction algorthm suitable
for volume ect, Medical Imaging, IEEE Transactions on 5 (1986), no. 2, 6166.
[6] I. Daubechies, Orthonormal bases of compactly supported wavelets, Communications on pure and applied
mathematics 41 (1988), no. 7, 909996.
[7] M. Elad, P. Milanfar, and R. Rubinstein, Analysis versus synthesis in signal priors, Inverse Problems
23 (2007), 947968.
[8] I.J. Good, The interaction algorithm and practical Fourier analysis, Journal of the Royal Statistical
Society. Series B (Methodological) 20 (1958), no. 2, 361372.
[9] A. Graps, An introduction to wavelets, Computational Science & Engineering, IEEE 2 (1995), no. 2,
5061.
[10] Alfred Haar, Zur theorie der orthogonalen funktionensysteme, Mathematische Annalen 69 (1910), 331
371, 10.1007/BF01456326.
[11] JA H
ogbom, Aperture synthesis with a non-regular distribution of interferometer baselines, Astronomy
and Astrophysics Supplement Series 15 (1974), 417.
[12] H. Lanteri, M. Roche, O. Cuevas, and C. Aime, A general method to devise maximum-likelihood signal
restoration multiplicative algorithms with non-negativity constraints, Signal Processing 81 (2001), no. 5,
945974.
[13] LB Lucy, An iterative technique for the rectification of observed distributions, The astronomical journal
79 (1974), 745.
[14] S.G. Mallat, A theory for multiresolution signal decomposition: The wavelet representation, Pattern
Analysis and Machine Intelligence, IEEE Transactions on 11 (1989), no. 7, 674693.
[15]
[16] D. Mary, S. Bourguignon, C. Theys, and H. Lanteri, Sparse priors in unions of representation spaces
for radiointerferometric image reconstruction, 2011.
BIBLIOGRAPHY
58
[17] E. Pantin, J.L. Starck, and F. Murtagh, Deconvolution and blind deconvolution in astronomy, Blind
image deconvolution: theory and applications (2007), 138.
[18] L. Rabiner, R. Schafer, and C. Rader, The chirp z-transform algorithm, IEEE Transactions on Audio
and Electroacoustics 17 (1969), no. 2, 8692.
[19] W.H. Richardson, Bayesian-based iterative method of image restoration, JOSA 62 (1972), no. 1, 5559.
[20] J.L. Starck, F. Murtagii, and A. Bijaoui, Multiresolution support applied to image filtering and restoration, Graphical models and image processing 57 (1995), no. 5, 420431.
[21] E. Thiebaut, Introduction to image reconstruction and inverse problems, Optics in Astrophysics (2005),
397422.
[22] BP Wakker and UJ Schwarz, The multi-resolution clean and its application to the short-spacing problem
in interferometry, Astronomy and Astrophysics 200 (1988), 312322.
[23] John Wallace, Fourier-transform spectroscopy: Interferometer simplifies ftir spectrometer, 2011,
http://www.optoiq.com/index/photonics-technologies-applications/lfw-display/lfwarticle-display/332945/articles/laser-focus-world/volume-44/issue-7/world-news/
fourier-transform-spectroscopy-interferometer-simplifies-ftir-spectrometer.html (Accessed October 12, 2011).
Appendix A
Mathematical Notations
R
R+
N
N+
Rp
MN,L
is
is
is
is
is
is
F(x)
arg min F(x)
||x||
||x||p
hx, yi
xy
x
y
xp
x>y
the
the
the
the
the
ith element of x
ith column of H
element of H in the ith column and th row
transpose of H
diagonal matrix with x as its diagonal elements
xi
Hi
Hi,j
HT
diag(x)
A.1.1
60
Element-wise vector multiplication is expressed via the use of diagonal matrices. The element-wise multiplication of x and y can be expressed as diag(x)y.
This can be extended to more than two vectors, that is for x, y, z, we would use diag(x)diag(y)z.
A.2
Covariance Matrix
Let each element of y RN be modeled by a Gaussian distribution with respective means RN and
variances RN .
We define the covariance matrix of y, an N N matrix, by cov(y)i,j equal to the covariance between yi
and yj . If all the elements in y are independant of one another, then cov(y) = diag( 2 )
We may also use the term inverse covariance matrix to define cov1 (y)i,j equal to the inverse covariance
between yi and yj . If all variables are independant, cov1 (y) = diag( 12 )
A.3
Hyperparameter
A hyperparameter is a parameter of a prior term, and not a parameter within it, i.e. it is not a parameter
of the underlying model. It is used to tune or weight the effect of the prior verses the effect of the maximum
likelihood. As there is no clear mathematical distinction between a hyperparameter and a parameter of the
model in this particular example, this is a conceptual difference, but an important one nonetheless.
A.4
Descent Direction
Given some real valued function F(x) for x RN (i.e. F : RN R) and x RN , a vector p R is called
descent direction if hp, F(x)i < 0. In a more intuitive sense, consider that if hp, F(x)i = 0 then p is
orthogonal to the gradient of F. If the inner product is negative, however, this implies that p points in the
direction that goes down the gradient. Taking an infinitly small step in this descent direction will decrease
the value of F.
In slightly more rigorous terms, starting at x(0) , we find a p(0) that is a descent direction. We create
x(1) = x(0) + lim+0 p(0) . We know that x(1) < x(0) . Repeating this step many times over will trace a
path towards the minimum value of F.
This is an intuitive explanation of a result known as Taylors Theorem. In reality, we can take 1, but
exact value used will be contextual.
It is clear that for any x(k) , F(x(k) ) is a descent direction since hF(x(k) ), F(x(k) )i = hF(x(k) ), F(x(k) )i =
||F(x(k) )||2 < 0
A.5
61
(A.1)
where is the Lagrange multiplier, J(x) is the function to be maximised and g(x) is the function expressing
the constraints. x is our desired data and hence the argument that will maximise J(x).
With x and to denote the optimal values of x and respectively, the first order optimality conditions
are as follows: [12, pg 947]
Stationarity
L(x , ) = 0 i [g(x )]i = [J(x )]i
i =
[J(x )]i
, i
[g(x )]i
(A.2a)
Positivity of x
g(x ) 0 xi 0, i
(A.2b)
0 i 0, i
(A.2c)
Positivity of
Complementary Slackness
g(x ) = 0
[J(x )]i
g(x ) = 0, i
[g(x )]i
(A.2d)
Appendix B
Proofs
B.2
B.2.6
2
0
if i = 2 and j = 1
otherwise
(B.1)
This is two times the discrete Dirac-Delta function centered around (2, 1). The flattened version is thus
defined as a = [0, 2, 0, ..., 0]T RN and we can now see
h1
h2
Ha = h3
..
.
hN
hN
h1
h2
..
.
hN 1
hN
h1
..
.
..
.
hN 1
hN 2
h2 0
2hN
h3
2 2h1
h4 0 = 2h2
.. .. ..
. . .
2hN 1
h1 0
(B.2)
which is simply twice the flattened PSF at position (2, 1). This is what we would get if we convolved the
PSF by two times the Dirac-Delta function centered around (2, 1).
This result can be generalised to any k k map x and its flattened version x = [x1 , ..., xN ]T for N = k 2 .
We use the notation Hi to represent the ith column of H. Hi is by definition the flattened version of H
x2
Hx = H1 H2 HN .
..
xL
x1
h1
hN
h2
h2
x2
h
h
1
3
= .
..
. .
..
..
. .. ..
.
hN hN 1
(x1 h1 +
(x1 h2 +
(x1 hN +
x1 h1
x1 h2
= . +
..
x1 hN
h1
xN
+ xN h2 )
+ xN h3 )
..
.
x2 hN 1 + + xN h1 )
x2 hN
xN h2
xN h3
x2 h1
+
..
..
.
.
x2 hN
x2 h1
+
+
x2 hN 1
(B.3)
xN h1
= x1 H1 + x2 H2 + + xN HN
What this shows us is that Hx is a PSF applied to every point of x scaled to the intensity at that point,
which is the definition of a convolution.
B.3
B.3.8
V(x(k) )
(k)
x [U(x(k) ) V(x(k) )]
(k)
(k) ) x(k)
V(x (k)
) V(x(k) )]
(k) [U(x
V(x(k) )
(k)
)
(k) U(x
1
(k)
V(x ) (k)
U(x )
(k) 1
V(x(k) )
(k)
> 0
> x(k)
> x(k)
>
x(k)
x(k)
> 1
> 1
< 1
1
<
1
U(x(k) )
V(x(k) )
(B.4)
B.3.10
B.3.19
64
(k+1)
x(k)
= x + diag
[U(x(k) ) V(x(k) )]
V(x(k) )
x(k)
x(k)
(k)
= x(k) + diag
U(x
)
diag
V(x(k) )
V(x(k) )
V(x(k) )
x(k)
= x(k) + diag
U(x(k) ) x(k)
V(x(k) )
x(k)
= diag
U(x(k) )
V(x(k) )
(k)
(B.5)
hi1
yi hi1 (Hx)1
i
1
= hi2 yi hi2 (Hx)i
..
..
.
.
X
therefore, J(x) =
Di (x)
i
!
h11
h21
h12 h22
= + +
..
..
.
.
!
y1 h11 (Hx)1
y1 h21 (Hx)1
1
1
y2 h12 (Hx)1 y2 h22 (Hx)1
2 +
2 +
..
..
.
.
1
h11 + h21 +
y1 h11 (Hx)1 + y1 h21 (Hx)1
1 +
1
1
h11 h21 1
h11 h21 y1 (Hx)1
1
1
(B.6)
B.3.28
65
(yi (Hx)i )
xj
2y2i
1
(yi (Hx)i )2
2
2yi xj
1
yi2 2yi (Hx)i + (Hx)2i
2
2yi xj
1
2yi (Hx)i + (Hx)2i
2
2yi xj
Di =
xj
xj
=
=
=
=
Consider briefly
xj (Hx)i .
(B.7)
Since,
h11
h21
Hx =
..
.
h12
h22
..
.
x1
h11 x1 + h12 x2 + . . .
x2 = h21 x1 + h22 x2 + . . .
..
..
..
.
.
.
xj (Hx)i
xj (hi1 x1
(B.8)
+ hi2 x2 + . . . ) = hij .
We return to
1
2
Di =
2y
(Hx)
+
(Hx)
i
i
i
xj
2y2i xj
1
(2yi hij + 2hij (Hx)i )
=
2y2i
hij
= 2 ((Hx)i yi )
yi
B.3.29
(B.9)
Consider that
(B.10)
66
2
h11 ((Hx)1 y1 )y1 + (h21 (Hx)2 y2 )y2
+
2
2
2
2
0
((Hx)1 y1
h11 h21 y1
y2
= h12 h22 0
1
((Hx)2 y2 )
..
..
..
..
..
..
..
.
.
.
.
.
.
.
(B.11)
= HT W(Hx y)
= HT WHx HT Wy
B.3.42
(k+1)
= 2D x
(k+1)
= 2D diag x
(k)
HT
= 2D x(k) HT
y
Hx(k)
!
!
y
Hx(k)
y
= 2D x(k) 2D HT
Hx(k)
y
= x(k) CONV trans(h), 2D
Hx(k)
2D(y)
= x(k) CONV trans(h),
2D(Hx(k) )
y
= x(k) CONV trans(h),
CONV(h, x(k) )
(B.12)
B.3.43
67
B.4
B.4.5
(B.13)
x = D = D1
D2
2
DL .
..
L
D1,1
D2,1
= .
..
D1,2
D2,2
..
.
..
.
D1,L
1
2
D2,L
.. ..
. .
=
..
1 D1,1
2 D1,2
L D1,L
L D2,L
1 D2,1 2 D2,2
= . + . + +
..
.. ..
.
1 DN,1
2 DN,2
= 1 D1 + 2 D2 + + L DL
L DN,L
(B.14)
B.4.18
68
Minimising D
on (4.18)
arg min ML (x) = arg min ML (D
)
x
1
= arg min (HD
y)T R(HD
y)
2
1
1
1
= arg min (HD
y)T R 2 R 2 (HD
y)
2
1
1
1
= arg min (HD
y)T (R 2 )T R 2 (HD
y)
2
1
1 1
= arg min (R 2 (HD
y))T (R 2 (HD
y))
2
1
1
1
R 2 y||
= arg min ||R 2 HD
2
1
= arg min ||C
z||
(B.15)
B.4.22
i Ci r(k) = 0
i Ci = r(k)
i (Ci )T = (r(k) )T
i (Ci )T Ci = (r(k) )T Ci
i hCi , Ci i = hr(k) , Ci i
hr(k) , Ci i
i =
hCi , Ci i
i =
hr(k) , Ci i
||Ci ||2
i =
hr(k) , Ci i
||Ci ||2
(B.16)
B.4.23
69
i Ci Ci
= arg min
i CTi r(k)
i (r(k) )T Ci + (r(k) )T r(k)
i 2
1 2
= arg min
i ||Ci ||2 2
i hr(k) , Ci i + ||r(k) ||2
i 2
!
(k)
1 hr(k) , Ci i2
hr
,
C
i
i
= arg min
||Ci ||2 2
hr(k) , Ci i + ||r(k) ||2
i 2
||Ci ||4
||Ci ||2
!
1 hr(k) , Ci i2
hr(k) , Ci i2
(k) 2
= arg min
2
+ ||r ||
i 2
||Ci ||2
||Ci ||2
!
(k)
2
1
hr
,
C
i
i
||r(k) ||2
= arg min
i 2
||Ci ||2
= arg max
i
hr(k) , Ci i2
|hr(k) , Ci i|
=
arg
max
i
||Ci ||2
||Ci ||
(B.17)
B.4.27
70
1
hr , C1 i
hr , (R 2 HD)1 i
1
hr(k) , C2 i hr(k) , (R 2 HD)2 i
..
..
.
.
(k) 1
hr , R 2 HD1 i
1
(k)
= hr , R 2 HD2 i
..
.
1
(R 2 HD1 )T r(k)
T (k)
1
= (R 2 HD2 ) r
..
.
T T 1 T (k)
D1 H (R 2 ) r
DT HT (R 21 )T r(k)
= 2
..
.
T T 1 (k)
D1 H R 2 r
DT HT R 21 r(k)
= 2
..
.
1 (k)
DT HT R 2 r1
T T 1 (k)
D H R 2 r2
=
..
.
(k)
DT HT (1 r1 )
T T
(k)
D H (2 r2 )
=
..
.
(B.18)