You are on page 1of 71

A Rigorous Introduction to Deconvolution Techniques

Richard J. Baxter 1
MSc in Computer Science (2nd Year)
Computer Science Department
University of Cape Town

Hosting Supervisor: David Mary 2


Hosting Supervisor: Chiara Ferrari 3

Hosted by:
Laboratoire Fizeau
Universite Nice Sophia Antipolis
and
Observatoire de la Cote dAzur
November 1, 2011

richard.jonathan.baxter@gmail.com
david.mary@unice.fr
3
chiara.ferrari@oca.eu
2

Acknowledgements
Many people and organisations came together to make this report and the trip that created it possible. I
would like to thank everyone at ADION for accepting me for the Henri Poincare Junior Fellowship, which
formed the primary source of income and academic support.
I would also like to extend my thanks to the Square Kilometer Array Project (SKA) and the National
Research Fund (NRF) in South Africa for their contributions towards travel expenses and equipment. Namely
my laptop, without which I would have not been able to do any work at all!
This brings me to the wonderful people who hosted and helped me along at lUniversite Nice Sophia Antipolis
and lObservatoire de la C
ote dAzur. Especially to Dr. David Mary and Dr. Chiara Ferrari, who both
supported me and made sure that I was on my feet at all times, which enabled me to get a substantial amount
of work done. Dr. Ferrari made sure I was housed, knew my way around and gave me valuable scientific
insight that I, as a non-astronomer, would not otherwise have gained. Dr Marys constant supervision
(especially in the first month) made sure I got up to speed very fast. I could not have learned what I did
without either supervisors assistance.
I would also like to thank my friend, Arwa Dabbech, who guided me along with the many important details
and pedantic questions I asked. Again, without her help I would not have learned as much as I did in the
time I had. The teasing out of concepts and problems was invaluable and does not go unappreciated.
I would also like to give mention to my supervisors at the University of Cape Town, Dr. Patrick Marais,
Dr. Michelle Kuttel, Dr. Kurt van der Hyden and Dr. Ian Stewart for their constant guidance.
Also to my parents who supported me in my long trip away from home. Especially my mother who helped
me with travel arrangements. Special thanks to my girlfriend, Elizabeth Braae, who in addition to her
support, also spent many hours reading over my work and expertly correcting my grammar, spelling and
mathematics errors.
Lastly, to all the friends I made in Nice who made my time there more than just an academic endeavour. I
cannot not say that I had a great time with you all.
I learnt much about many things in many areas of life and many areas of academia. I am truly grateful.

Abstract: Why I am in Nice


My time at the Sophia Antipolis University of Nice is primarily intended for me to learn as much about
signal processing in deconvolution as possible in the time I was there.
The general direction of this report will follow in parallel to the direction of understanding I have undertaken.
As I learned additional techniques, additional chapters were added. Under the guidance of Dr. David Mary, I
hope that I have produced a cogent story of understanding in line with my own progression. Each chapter
remains largely self-contained and revisions of chapters did, of course, take place as I learned more.
The particular form of deconvolution that shall be focused on appears in the field of Radio Interferometry.
Although it does appear in many other fields such as medical imaging and computer vision.
This Introduction chapter outlines, very basically, some key concepts in Radio Astronomy and signal processing. Nave Deconvolution and the CLEAN algorithm are explored in chapter 2. In chapter 3, Maximum
Likelihood techniques are defined and explained in a mathematical and programmatic way. The penultimate
chapter forms the bulk of this report, explaining in detail a Maximum Likelihood a Priori synthesis approach
for deconvolution and the greedy Matching Pursuit algorithm that computes it. The final chapter contains
some closing remarks.
There were many sections and proofs that I made to aid in my own understanding of the maths, which are
included in the appendix
I hope that my understanding of these deconvolution topics will be reflected back unto the reader with the
clarity and completeness that it was reflected unto me.

Contents
1 Introduction and Background
1.1 Convolution . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.1 Discrete Convolution . . . . . . . . . . . . . . . .
1.2 Fourier Transform . . . . . . . . . . . . . . . . . . . . .
1.2.1 Introduction and Usage . . . . . . . . . . . . . .
1.2.2 Discrete Fourier Transforms . . . . . . . . . . . .
1.2.3 Fast Fourier Transform . . . . . . . . . . . . . .
1.2.4 Convolution Theorem for FFT . . . . . . . . . .
1.3 Wavelet Transform . . . . . . . . . . . . . . . . . . . . .
1.3.1 Deficiencies of the Fourier Transforms . . . . . .
1.3.2 Decomposition and Reconstruction . . . . . . . .
1.3.3 Usage . . . . . . . . . . . . . . . . . . . . . . . .
1.3.4 Fast Wavelet Transform . . . . . . . . . . . . . .
1.4 Radio Interferometry . . . . . . . . . . . . . . . . . . . .
1.4.1 Observed Data and PSFs: Both Clean and Dirty

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

8
8
8
9
9
11
13
14
14
14
15
15
17
17
17

2 Nave Deconvolution and CLEAN


2.1 Formulation of the Problem . . . . . . . . . . . . . .
2.1.1 Building a Model . . . . . . . . . . . . . . . .
2.1.2 Discretised Data . . . . . . . . . . . . . . . .
2.1.3 PSF Response Matrix . . . . . . . . . . . . .
2.2 The Nave Solution: Direct Inversion . . . . . . . . .
2.2.1 Ill-Conditioning . . . . . . . . . . . . . . . . .
2.3 Frequency Cutoff and Filtering Based Solutions . . .
2.3.1 Frequency Cutoff . . . . . . . . . . . . . . . .
2.3.2 Wiener Inverse-Filter . . . . . . . . . . . . . .
2.4 CLEAN . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Brief History . . . . . . . . . . . . . . . . . .
2.4.2 Description of the algorithm . . . . . . . . . .
2.4.2.1 Exit Criteria . . . . . . . . . . . . .
2.4.3 Normalising the beam . . . . . . . . . . . . .
2.4.4 Results and Findings . . . . . . . . . . . . . .
2.4.4.1 Strong Complex Source, Weak Point
2.4.4.2 Corrugation Effect . . . . . . . . . .

. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
Source and Diffuse
. . . . . . . . . . .

. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
Source .
. . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

19
19
19
20
21
21
22
23
23
23
24
24
25
25
26
27
27
27

3 Maximum Likelihood Techniques


3.1 Optimising a Function under Positivity Constraint
3.1.1 Lagrangian Maximisation via KKT . . . . .
3.1.2 Enforcing a Multiplicative Form . . . . . .
3.1.3 Convergence of the Solution . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

28
28
28
29
30

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

CONTENTS
3.2 Maximum Likelihood (ML) . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Constrained Maximum Likelihood and Negative Log Likelihood
3.2.2 Poisson: RLA . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.3 Additive Gaussian Noise: ISRA . . . . . . . . . . . . . . . . . .
3.2.3.1 Maintaining Positivity via Shifting the Data . . . . .
3.2.3.2 Maintaining Positivity via Splitting the Data . . . . .
3.2.4 Regularised ISRA and RLA . . . . . . . . . . . . . . . . . . . .
3.2.4.1 Exit Criteria . . . . . . . . . . . . . . . . . . . . . . .
3.3 Algorithmic Implementation . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Computational Operators for Convolution . . . . . . . . . . . .
3.3.2 RLA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.3 ISRA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 Maximum
4.1 Sparse
4.1.1
4.1.2

4.2

4.3

4.4

a postiori (MAP)
Representation . . . . . . . . . . . . . . . .
Concepts and Definitions . . . . . . . . . .
Definitions of Sparsity . . . . . . . . . . . .
4.1.2.1 Intuitively . . . . . . . . . . . . .
4.1.2.2 Descriptively . . . . . . . . . . . .
4.1.2.3 Formally . . . . . . . . . . . . . .
MAP Synthesis . . . . . . . . . . . . . . . . . . . .
4.2.1 The Maximum Likelihood Term . . . . . .
4.2.2 The Prior Term . . . . . . . . . . . . . . . .
4.2.3 The MAP-Synthesis Model . . . . . . . . .
Matching Pursuit Algorithm . . . . . . . . . . . . .
4.3.1 Description of the Algorithm . . . . . . . .
4.3.1.1 Exit Criteria . . . . . . . . . . . .
4.3.2 Computational Considerations . . . . . . .
4.3.2.1 Wavelet Computational Operators
4.3.2.2 Computing the Coefficients . . . .
4.3.2.3 Computing the Norms . . . . . . .
4.3.3 Manifestation of the prior . . . . . . . . . .
Experimentation . . . . . . . . . . . . . . . . . . .
4.4.1 Set up . . . . . . . . . . . . . . . . . . . . .
4.4.2 Analysis of Dictionaries . . . . . . . . . . .
4.4.2.1 The Identity Dictionary . . . . . .
4.4.2.2 The Symlet Dictionary . . . . . .
4.4.2.3 The Symlet+Identity Dictionary .
4.4.3 Analysis of Loop-Gains . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

4
30
30
30
32
33
34
34
35
35
35
35
36

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

38
39
39
40
40
40
40
41
41
42
42
43
43
44
44
45
46
47
47
49
49
49
49
50
51
51

5 Conclusion
A Notations and Concepts
A.1 Mathematical Notations . . . . . . . . . .
A.1.1 Element-wise Vector Multiplication
A.2 Covariance Matrix . . . . . . . . . . . . .
A.3 Hyperparameter . . . . . . . . . . . . . .
A.4 Descent Direction . . . . . . . . . . . . . .
A.5 KKT First Order Optimality Conditions .

56

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

59
59
60
60
60
60
61

B Proofs
62
B.2 Proofs From Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

CONTENTS
B.2.6
B.3 Proofs
B.3.8
B.3.10
B.3.19
B.3.28
B.3.29
B.3.42
B.3.43
B.4 Proofs
B.4.5
B.4.18
B.4.22
B.4.23
B.4.27

PSF Response Matrix Performs a Convolution 2.6 . . . .


From Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . .
Constraint on (k) to Ensure Positivity 3.8 . . . . . . . .
Multiplicative Form of the Optimisation Function (3.10) .
Calculation of J(x) under Poisson noise (3.19) . . . . . .
Partial Derivative of Di (3.28) . . . . . . . . . . . . . . .
Calculation of J(x) under Additive Gaussian noise (3.29)
From Mathematical to Computational: RLA (3.42) . . . .
From Mathematical to Computational: ISRA (3.43) . . .
From Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . .
Dictionary is a Linear Combination of Atoms 4.5 . . . . .
Minimising D
on (4.18) . . . . . . . . . . . . . . . . .
Minimising the Inner Term of i (4.22) . . . . . . . . . .
Minimising ML on the index 4.23 . . . . . . . . . . . . . .
Initial Coefficient Calculation 4.27 . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

5
62
63
63
64
64
65
65
66
67
67
67
68
68
69
70

List of Figures
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9

Simple Convolution Example . . . . . . . . . . . .


Another Simple Convolution Example . . . . . . .
First three elements of the square wave expansion.
Example of Fourier Transform in an FTIR (Fourier
FFT Example in Image Processing . . . . . . . . .
FFT Ringing and Aliasing Example . . . . . . . .
Wavelet Function Examples . . . . . . . . . . . . .
Wavelet Decompositions . . . . . . . . . . . . . . .
JPEG Wavelet Decompositions . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
Transform Infra Red) spectrometer
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

9
9
10
12
13
14
15
15
16

2.1
2.2
2.3
2.4
2.5
2.6

Example of a source map affected by


Example of a source map affected by
Direct Inversion for Deconvolution .
Ill-Conditioning of Direct Inversion .
Frequency Cutoff of Direct Inversion
CLEAN Results . . . . . . . . . . . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

20
20
22
23
24
26

3.1

RLA Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

4.1
4.2
4.3
4.4
4.5
4.6
4.7

Wavelet Decomposition and Reconstruction Example


Example of Shift Variance . . . . . . . . . . . . . . .
Example of Fast Convergence of Solution . . . . . .
Loop vs Dict Comparison (Noise 5) . . . . . . . . . .
Loop vs Dict Comparison (Noise 10) . . . . . . . . .
Noise vs Dictionary Comparison (Loop 0.1) . . . . .
Noise vs Loop Comparison (Sym6+I) . . . . . . . . .

45
48
48
52
53
54
55

a
a
.
.
.
.

PSF
PSF
. . .
. . .
. . .
. . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

List of Tables
4.1
4.2

Statistical Differences Between Different Dictionaries . . . . . . . . . . . . . . . . . . . . . . .


Statistical Differences Between Different Loop Gains . . . . . . . . . . . . . . . . . . . . . . .

50
51

Chapter 1

Introduction and Background


1.1

Convolution

Convolution is a mathematical operation on two functions to produce a combination function. It can


intuitively be seen as adding one function, f to another function, g at every point on g. This becomes clear
in the discrete case, but the continuous case is given first. A convoltion is denoted by the operator and is
defined as
Z
(f g)(t) =
f ( ) g(t ) d
(1.1)

for two complex-valued continuous functions f and g on the complex numbers.


From this definition we see that a convolution is akin to taking a sliding weighted average of g multiplied
by f (t) at some t. This can also be seen as taking the same weighted average of f over g. This also implies
that the convolution operator is commutitive (f g = g f ).

1.1.1

Discrete Convolution

For the discrete case we have


(f g)(n) =

f (m) g(n m)

(1.2)

m=

A more intuitive way of visualising a convolution is to visualise a simple picture consisting of a number of
sparsely placed infinitely small points of value 1, call this 2D function f . If we take another picture, another
2D function g, and we stamp it on each one of the points in f , we will have achieved something like the
convolution of f with g. If a point on f was equal to 2, the g would be multiplied by 2 when stamped. See
Figure 1.1 for an example.
For images with just points this seems intuitive. For more complex images this analogy breaks down a bit. If
we take f to contain more than just point, say a line or some other more complicated shape, we can imagine
stamping g at every single point of f . Figure 1.2 demonstrates this concept.
Convolution is used for many applications. Blurring images is a prime example of convolution. The Gaussian
blur is simply an image convolved with an image of a 2D Gaussian curve. However, if one were to imagine
the blurring process (or had one used an image proocessing application such as Photoshop or Gimp), they
would realise that any attempt to un-blur or re-sharpen an image back to its original state is never perfect.

1.2 Fourier Transform


9
It turns out there is no mathematically certain way to deconvolve an images given one of images it was
convolved by 1 .
1

50

0.9

50

0.9

50

0.9

100

0.8

100

0.8

100

0.8

150

0.7

150

0.7

150

0.7

200

0.6

200

0.6

200

0.6

250

0.5

250

0.5

250

0.5

300

0.4

300

0.4

300

0.4

350

0.3

400

500
50

100

150

200

250

300

350

400

450

500

50

100

150

200

250

300

350

400

450

500

0.2

450

0.1

500

0.3

400

0.2

450

0.1

350

0.3

400

0.2

450

350

0.1

500

50

100

150

200

250

300

350

400

450

500

Figure 1.1: Left: Three points at (100,100), (300,300), (200,400) - not visible. Middle: a square
function. Right: Convolution of the two.
3

1
2000

50

0.9

50

100

0.8

100

150

1800

2.5

100

50

1600
150

0.7

150

200

0.6

200

250

0.5

250

300

0.4

1400

200
250

1.5

300
1

350
400

350

0.3

400

0.2

0.5
450

450

500
50

100

150

200

250

300

350

400

450

500

0.1

500
50

100

150

200

250

300

350

400

450

500

1200

1000
300
800
350
600
400
400
450

200

500
50

100

150

200

250

300

350

400

450

500

Figure 1.2: Convolution of 2 non-point-like images

1.2

Fourier Transform

1.2.1

Introduction and Usage

A Fourier transform takes some function f (x) and decomposes it into an infinite series of sinusoidal functions
of different frequencies and amplitudes. Adding these frequencies back together returns the original function.
Figure 1.3 shows the first four elements of the Fourier Series of a square wave, and the wave if this series
were to run to infinity. If we take the Fourier transform of this signal, we get an continuous function F (x)
representing the frequency space of f (x). We can see in Figure 1.4 that at the 8 micro-meter range there
is a spike. Depending on where signal f (x) originates from, this could mean different things. In chemical
analysis, the spike could represent an abundance or lack of a certain chemical compound. If it were a radio
signal it could be a signal from a radio station operating at a certain frequency. In Astronomy it could
represent the presence of an object that is known to emit signals of that frequency, a pulsar or maser source,
say. The graphs in Figure 1.4 are from a Fourier-Transform Infra-Red (FTIR) spectrometer, so is most likely
showing an abundance of a certain chemical bond. In any case, analysis of the frequency domain makes
certain insights not only easier, but sometimes possible at all.
1 Under

perfect circumstances, deconvolution can be done, however under any noise or contamination, even seemingly negligible amounts, this does not hold (see section 2.2).

1.2 Fourier Transform

10

1.5

1.5

0.5

0.5

-0.5

-0.5

-1

-1

-1.5

-1.5
-2

-1.5

-1

-0.5

0.5

1.5

1.5

1.5

0.5

0.5

-0.5

-0.5

-1

-1

-1.5

-2

-1.5

-1

-0.5

0.5

1.5

-2

-1.5

-1

-0.5

0.5

1.5

-2

-1.5

-1

-0.5

0.5

1.5

-2

-1.5

-1

-0.5

0.5

1.5

-1.5
-2

-1.5

-1

-0.5

0.5

1.5

1.5

1.5

0.5

0.5

-0.5

-0.5

-1

-1

-1.5

-1.5
-2

-1.5

-1

-0.5

0.5

1.5

1.5

1.5

0.5

0.5

-0.5

-0.5

-1

-1

-1.5

-1.5
-2

-1.5

-1

-0.5

0.5

1.5

1.5

0.5

-0.5

-1

-1.5
-2

-1.5

-1

-0.5

0.5

1.5

Figure 1.3: Red: nth element of the series. Blue: Sum over n elements. Dashed: Sum over n 1
elements. If this series were to go to infinity, the resultant wave will be a square wave.

1.2 Fourier Transform


More formally, the Fourier Transform of a function g(t) : CN CN at s CN is defined as
Z
T (g(t), s) = g(t)e2its ds

11

(1.3)

For brevity we write F(g)

1.2.2

Discrete Fourier Transforms

The discrete version of the Fourier Transform has a similar form to the continuous, given a discrete vector
function f (t) : CN CN and s CN
T (g(t), s) =

N
X

g(t)e2its

(1.4)

Again, we simply write F(g).


Discrete Fourier Transforms have a wide variety of computational uses since all data on a computer is
discretised. Images can be seen as 2D matrices. Performing a 2D Discrete Fourier Transform will reveal
frequency information that would otherwise be obscured or too difficult to calculate otherwise (see Figure
1.5). In signal processing, the ability to transform a signal into the frequency domain is invaluable.

1.2 Fourier Transform

12

(a) Direct Signal

(b) The Fourier transform


Figure 1.4: Example of Fourier Transform in an FTIR (Fourier Transform Infra Red) spectrometer
(a) Direct signal received from the FTIR of overall power received over a certain space. (b) The Fourier
transform of the direct signal showing the the absorbance of certain frequencies [23]. These frequencies
show an abundance of a material that absorbs at the 8 micrometer range.

1.2 Fourier Transform

13
250

18

100
200

150

200

16

300

14

400

12

500

10

600

100

700

800
4

50

900
2
1000

100

200

300

400

500

600

700

800

900

1000

250

18
200

200
16
400

14

150

600

12
10

800
100

8
1000
6
50

1200
4
200

400

600

800

1000

1200

Figure 1.5: Top Left: An Image of some random text. Top Right: The Fourier transform. Bottom
Left: The image rotated. Bottom Right: The Fourier transform of the rotated image. Notice how in the
Fourier domain the main axis is aligned with the rotation of the original image. Using this information, a
scanned page that has an unknown rotation can be corrected to its original rotation by simply detecting
this main axis on the Fourier transform, a far simpler task than trying to detect lines of text in the image
plane. Once rotated back to an upright position, text recognition or other advanced processing steps
can be utilised.

1.2.3

Fast Fourier Transform

The Fast Fourier Transform (FFT) is an algorithm to calculate the DFT. It was first developed by Cooley
and Turkey [2]. Whilst direct computation is O(N 4 ) for N samples, the FFT does it in O(N log N ). As it
is a significantly faster algorithm, it allows much of modern signal processing to be done.
The FFT in this form requires that N be a power-of-two, but there have been modifications that allow for
non power-of-two sizes, such as the Prime-Factor FFT Algorithm [8] or the chirp-z algorithm [18].

1.3 Wavelet Transform

1.2.4

14

Convolution Theorem for FFT

Given two functions f, g : CN CN , it can be shown that:


T (f g) = T (f )T (g)

(1.5)

T (f g) = T (f ) T (g)

In other words, the transform of a convolution is the product of their transforms and the transform of a
product is the convolution of their transforms.
This allows us to calculate a discrete convolution using two Fast Fourier Transforms and one matrix multiplication. Classic calculation of the discrete convolution is a O(N 4 ) problem, whereas the use of the FFT
reduces this to O(2N log N + N 2 ) = O(N 2 ).

1.3

Wavelet Transform

1.3.1

Deficiencies of the Fourier Transforms

Recall that a Fourier transform takes data in image space and transforms it into Fourier space (or frequency
space). This can be seen as transforming intensity data to sinusoidal coefficients. Owing to the periodic
and unbounded nature of the sine function, the transform can create artifacts when discretised and under a
bounded region. Take the transform of a 2D square function on a finite map for example (Figure 1.6). The
sharp discontinuities in the image cause the distinct ringing effects in the frequency domain. Unbounded,
these rings would go on infinitely but, because of the finite space, the signal either cuts off or wraps
(depending on the algorithm used). Performing the inverse transform would ideally return the original
square function, but instead returns an approximation.
1
50

0.9

50

100

0.8

100

150

0.7

150

200

0.6

200

250

0.5

250

300

0.4

50

100

10

150

15

200

20

250

25

300

30

350

35

400

40

450

45

4
2

350

0.3

400

0.2

450

0.1

500
50

100

150

200

250

300

350

400

450

500

FFT

300

FFT1

4
350
6

400
450

500
50

100

150

200

250

300

350

400

450

500

10

50

500
50

100

150

200

250

300

350

400

450

500

Figure 1.6: Left: Square function. Center: Fourier Transform of square function. Right: Square
function transformed and inverse transformed. Notice the artifacts and the sharp dips to the sides of
the restored square function.
The square function example is contrived when considering real data, but there are cases where threshold
techniques can create sharp discontinuities and hence these artifacts. Use of wave-like functions that are
bounded by some finite area, called wavelets, have been shown to represent certain physical signals better
than the infinitely periodic sine wave (see Figure 1.7) [15].

1.3 Wavelet Transform

15

Figure 1.7: Left: Mexican Hat Wavelet. Center: Meyer Wavelet. Right: Morlet Wavelet.

1.3.2

Decomposition and Reconstruction

Decomposition is the act of decomposing one form into another. With the Fourier transform we are decomposing an image into a series of sine waves. This decomposition will essentially give us a number of sine-wave
coefficients. Each coefficient (a complex number) represents the phase and amplitude of a sine wave (withs
the sine-waves frequency determined by its location in the Fourier plane). Computing the sine wave for each
coefficient and adding all these together will reconstruct the original image.
The wavelet transform aims to use bounded sine-like functions called wavelets to decompose the image into
a series of coefficients, rather than periodic sine waves.

1.3.3

Usage

The 2D sine wave has only four parameters (frequency in xy, amplitude and phase) whose decomposition can
be represented by a 2D complex-valued map. The Wavelet has five (scale in xy, shift in xy and amplitude)
which cannot, however one could represent it with a 4D real-valued map. A discrete transform is employed
for computational purposes but generating a 4D map is too memory intensive. Instead a multi-resolution
approach which performs multiple levels, L, of decomposition on an N N image/map is used.

0.04
50

0.03
50

100

0.06
50

0.035

0.02
100

0.04
100

0.03
150

150

0.01

150

0.02

200

200

0.025
200
0.02
250

250

0.01

0.015
300

300
350
0.03

400

0.06

450

0.04 450

0.08

500

0.05 500

400
0

450
0.005

500
50

100

150

200

250

300

350

400

450

500

0.04
350

0.005
400

0.02
300

0.02

0.01
350

250

50

100

150

200

250

300

350

400

450

500

50

100

150

200

250

300

350

400

450

500

Figure 1.8: The above three wavelets are examples of what the wavelet transform uses to decompose
the image. Left and Center are level 1 wavelets whereas on the Right is a level 2 wavelet. Notice the
difference in scale. In the same way a Fourier transform decomposes an image into a number of sine
waves, the wavelet transform decomposes an image into wavelets.
At each level, three wavelets are used for decomposition: one circularly symmetric, one scaled in x, one

1.3 Wavelet Transform


16
th
scaled in y (see Figure 1.8). Instead of decomposing over the whole image, it decomposes at every p pixel.
So a 512 512 image decomposed at every 2nd pixel would produce a 256 256 coefficient map for each
of the three wavelets and leave a residue image. The residue image is down-sampled by a factor of two and
then the decomposition is performed again on every 2nd pixel. This repeats until a 1 1 image remains but
more usually after a preset number of levels have been decomposed. A good example of how this is stored
is shown in Figure 1.9.

Figure 1.9: Example of a JPEG2000 2-level wavelet decomposition. Since this a compression scheme
the residual image is down-sampled (or decimated) at each level.[4]
In scientific computing the residual is not down-sampled (i.e. it remains undecimated ) and the decomposition
is calculated at each (2L )th pixel on the first level, every (2L1 )th on the second level, (2L2 )th on the third
and so forth. Similar to the Fourier transform, the coefficients can be used in conjunction with the wavelet
function to reconstruct the original image. Unlike the Fourier transform, the residual image must be added
back as well.

1.4 Radio Interferometry

1.3.4

17

Fast Wavelet Transform

The first Discrete Wavelet Transforms (DWT) were devised by Haar [10] in 1909 and were based on discontinuous square-like functions. The Fast Wavelet Transform (FWT) was initially developed by Mallat
[14]. Ingrid Daubechies later developed on Mallats work, creating the set of wavelets that are used most
commonly today [6][9].
Whilst DWTs usually run in O(N log N ), the FWT runs in O(N ) for N samples or data-points. Complexitywise this is faster than the FFT (which is O(N log N )), the wavelet computations are significantly more
intensive and hence for most small cases (less than 5122 ) the FWT is slower than the FFT.

1.4

Radio Interferometry

Details of radio interferometry are out of the scope of this report. Briefly, for readers unfamiliar with
Astronomy, Radio Astronomy involves the surveying of the sky in the radio wave frequencies. This can be
done with either a single radio dish, or by use of multiple dishes working in unison.
Using multiple dishes enables one to synthesise a dish with aperture equal to the sum of the dishes apertures
and diameter equal to the largest distance between any 2 dishes (called the baseline).
Larger baselines means higher resolution data can be observed. A more dense array of radio telescopes allows
for fainter sources to be detected. It impossible, or at least very improbable, to be able to synthesise a dish
with coverage equal to a (theoretical) dish with diameter equal to the baseline. This lack of full coverage
causes an implicit convolution, which is the reason for deconvolution.

1.4.1

Observed Data and PSFs: Both Clean and Dirty

The importance of convolution to radio interferometry comes from the way in which data is collected.
For reasons that will not be explored in detail here, when a radio interferometer measures a section of the
sky, IT , we find that the data that outputs is IT convolved with the PSF of that particular interferometer.
The Point Spread Function (PSF) is what the interferometer would measure if it were to observe a single
infinitely small but sufficiently bright point source in the sky. This can be seen as a result of the incomplete
coverage of the radio telescope array.
In the literature, and hence this report, there are standard ways certain sets of data are named. They fall
under two categories, dirty and clean. The Dirty Beam is the same as the PSF and the terms will be used
interchangeably in this report. The clean beam is what a perfect interferometer would measure. As there
is no such thing, it is usually modeled as a Gaussian curve. At the center of every PSF is a Gaussian-like
curve, the clean beam is fitted to represent a smooth version of this 2 .
The True Image is the perfect observation, which we do not have. This could be referred to as the Clean
Image, but is not to be confused with what might be called the cleaned image, which is the resulting image
after deconvolution. As deconvolution will never, with certainty, reproduce the True Image they can never
be said to be the same. For clarity it shall be referred to as the Reconstructed Image. The Dirty Image,
mathematically, is the True Image convolved with the Dirty Beam. This is the data we need to deconvolve
in order to approximate the True Image. The chapters that follow show various attempts of deconvolution
to Reconstruct the True Image.
2 This

is done via a mathematical deduction, human estimation or a combination of both.

1.4 Radio Interferometry


18
The Fourier transform has many uses in scientific applications as it allows for a dataset to be transformed
into its frequency domain. Figure 2.1 shows an example of a simulated sky with an ASKAP (Australian
SKA Project) PSF. An example of a 1D signal, as used in spectroscopy, is shown in Figure 1.4.

Chapter 2

Nave Deconvolution and CLEAN


In this chapter we build our mathematical model and expand on some vital concepts via explanation of some
nave deconvolution attempts as well as an algorithm named CLEAN.

2.1

Formulation of the Problem

The following chapter will build our initial model of the sky such that we might be able to derive a rigorous
mathematical way by which to deconvolve our data. This and the following chapters are built from
Introduction to Image Reconstruction and Inverse Problems by Thiebaut [21], Deconvolution and Blind
Deconvolution in Astronomy by Pantin, Starck & Murtagh [17], Sparse priors in unions of representation
spaces for radiointerferometric image reconstruction by Mary, Bourguignon, Theys & Lanteri [16] and a
Masters thesis Reconstruction of Radio-Interferometer images using sparse representations by Arwa Dabbech
[3] amongst other sources that will be more specifically referenced in the text.

2.1.1

Building a Model

When mapping a region of the sky, the region is split into a grid. Each cell of this grid (i.e. a pixel) measures
the amount of photons that are received by the observing instrument from a certain angular section of the
sky. We use y to denote the intensity map that was observed over some section of the sky and x for the true
(albeit unknown) intensity over the same region.
The goal of processing on an observation is to find x. If we assume perfect observing conditions and a perfect
observing instrument we can simply use the trivial model
y=x
However, when utilising a radio interferometer, it is well known that the received signal is affected by the
PSF of the interferometer. We denote the PSF by h. Furthermore, we know that with this consideration
our model becomes
y hx
Owing to cosmic background, instrument noise, atmospheric interference amongst other considerations our
signal will be unavoidably contaminated by noise. This noise can affect the signal before it has reached
the instrument (external noise/temperature) ne as well as affecting the signal after the signal has been

2.1 Formulation of the Problem

20

Ideal/Actual Image

PSF

Convoluted Image

x 10

100
250

50
100

200

150

50

50

100

100

150

200

200

300
100
350

60

200
5

150
250

80

150

400

250
4
300

250
40
300

3
350

350
2

400

20

400

50
450

450

450

500

500

500

50

100

150

200

250

300

350

400

450

500

50

100

150

200

250

300

350

400

450

500

50

100

150

200

250

300

350

400

450

500

Figure 2.1: Left: Actual Intensity map of the sky. Middle: Point spread function of the interferometer.
Right: The convolution of the two, this is the data that the interferometer produces (minus any noise).
received (internal instrumental noise/temperature), nt . This would mean that we could model our data as
y h (x + ne ) + ni
This can be simplified to use one noise term:
y h (x + ne ) + ni
= (h x) + (h ne ) + ni

(2.1)

= (h x) + n0e + ni
= (h x) + n

where n0e = (h ne ). Since noise convolved with a uniform PSF is just the same noise, n0e will have the same
variance and mean as ne . We use n to denote the combination of internal and external noise since in all
further formulations they will be treated the same.
It is important to note that each pixel or cell of our intensity data can be modeled independently from one
another. This is a safe assumption as we have no physical reason to believe that one section of the skys
photons are necessarily correlated with any other section of the sky.
Convoluted Image

Post convolution noise

Dirty Image

100

40

50

50

50

100

30
100

100

80

150
60

200
250

100
20

150

200

10

200

250

250

40

300

20

40
300

350

20

400

80

150

300

10

350

60

350
20

400

0
400

30
450

500

450
500

50

100

150

200

250

300

350

400

450

500

20
450

40
50

100

150

200

250

300

350

400

450

500

40

500
50

100

150

200

250

300

350

400

450

500

Figure 2.2: Left: Convolution of the true sky intensity and the PSF. Middle: Gaussian Noise Right:
The data that the interferometer produces, the noise affected and convoluted sky intensity.

2.1.2

Discretised Data

Since we work with discretised data (pixels) we take the matrix form of this data
y Hx+n

(2.2)

2.2 The Nave Solution: Direct Inversion


where y, x and n is the data, actual intensity and noise vectors and H is the PSF response matrix.

21

y has not been made into a k k matrix but rather a flattened matrix
y = [y(1, 1), y(1, 2), ..., y(1, k), y(2, 1), y(2, 2), ..., y(k, k)]T

(2.3)

y = [y1 , y2 , ..., yN ]T

(2.4)

and for simplicity we say,


for N = k 2 .
We say that y is y flattened Similarly we define x as x flattened and n as n flattened

2.1.3

PSF Response Matrix

The construction and meaning of the PSFs response matrixs is as such: Let us take the PSF centered
around the point (1, 1) and similar to y, we flattened y, we flatten h to obtain [h1 , h2 , h3 , ..., hN ]T
We now define the PSF response matrix as:

h1
h2

H h3
..
.
hN

hN
h1
h2
..
.

hN 1
hN
h1
..
.

..
.

h2
h3

h4

..
.

hN 1

hN 2

h1

(2.5)

which is a matrix such that the ith column is a flattened PSF circularly shifted i 1 times. This also can be
seen to represent a PSF at every possible point on a k k map.
It can be shown that for any map a and its flattened version a, that Ha = h a.
An example and a sketch proof showing that
Hx = x1 H1 + x2 H2 + + xN HN

(2.6)

is given in Appendix B.2.6. As you can see Hx gives rise to a linear combination of PSFs applied to every
point of x scaled to the intensity at that point, which is the definition of a convolution.

2.2

The Nave Solution: Direct Inversion

An easy approach to take is simply to transform the dirty map to Fourier space and divide, i.e. to perform
a direct inversion [21]. After all, in the Fourier domain the dirty image is simply the multiplication of the
Fourier transform of true sky intensity by the Fourier transform of the PSF. Inversion should work and we
follow this line of reasoning:
y =hx
F(y) = F(h)F(x)


F(y)
F(x) =
F(h)


F(y)
xdirect = F 1
F(h)

(2.7)

2.2 The Nave Solution: Direct Inversion


22

Convoluted Image

100

F 1

50
100

80

150
60

200
250

40
300
350

20

400
450

500
50

100

150

200

250

300

350

400

450

500

PSF

x 10

50

100

150

200
5
250
4
300
3
350
2

400

450

500

0
50

100

150

200

250

300

350

400

450

Direct Inversion No noise


250

50
100

200

150
200

150
250
300
100
350
400
50
450
500
50

100

150

200

250

300

350

400

450

500

500

Figure 2.3: When performing the direct inversion on the convoluted map, the original intensity map
can be restored.
We expect the original image (Figure 2.3) but when performed this observed data the result is simply noise
(Figure 2.4). The problem is the additional noise factor. When this is taken into account the equation
becomes somewhat less helpful.
y =hx+n
F(y) = F(h)F(x) + F(n)


F(y) + F(n)
F(x) =
F(h)

 

F(y)
F(n)
F(x) =
+
F(h)
F(h)

 

F(y)
F(n)
xdirect = F 1
+
F(h)
F(h)

2.2.1

(2.8)

Ill-Conditioning

We can see that the F(n)/F(h) term affects what we thought would be a simple solution. Division of the
small values in the PSF cause this term to corrupt the result significantly, even when the signal-to-noise ratio
is very high. Since convolution by the PSF (also called instrumental transmission) is a process whereby the
noise is usually non-negligible, noise amplification will always be a significant consideration that arises in
convolution problems. This noise amplification by the solution is called ill-conditioning of a solution. We,
of course, strive to find well conditioned solutions that do not amplify the noise. [21]

2.3 Frequency Cutoff and Filtering Based Solutions

23

Dirty Image

100

F 1

50
100

80

150
60

200
250

40
300
350

20

400
450

500
50

100

150

200

250

300

350

400

450

500

PSF

x 10

50

100

150

200
5
250
4
300
3
350
2

400

450

500

0
50

100

150

200

250

300

350

400

450

Direct Inversion Sigma = 0.1

x 10
2

50

1.5

100
150

200

0.5

250

300

0.5

350

400

1.5

450

500
50

100

150

200

250

300

350

400

450

500

500

Figure 2.4: When performing the direct inversion on noisy data, the equation produces a noise-like
result. In this example the dirty image was contaminated by a seemingly insignificant noise level ( =
0.1).

2.3
2.3.1

Frequency Cutoff and Filtering Based Solutions


Frequency Cutoff

Noise usually dominates in the high frequency range whereas the sources we are trying to detect usually
have frequency ranges lower than this. One solution to this is simply to cut off high frequencies where we
know noise dominates and perform the inversion again.


F(y)

F(h)
xdirect = F 1 (t) for t =



F(y)

< ucutoff
if
F(h)

(2.9)

otherwise

This does produce more acceptable results, however the sharp cutoff in the Fourier domain causes clear
artifacts in the image plane (noticeable ripples or ringing) [21]. These ripples distort faint objects in the
background and cause negative valued solutions, which we know are not physically possible (see figure 2.5).

2.3.2

Wiener Inverse-Filter

Frequency cutoff can be seen as a simple form of inverse-filter. More advanced inverse-filters have been
developed to ensure certain criteria are met in order to create more reliable images.
For instance the Wiener Inverse-Filter ensures that the result is, on average, as close to the true object
brightness (in a least-squares understanding of close).[21]

2.4 CLEAN

24

F 1 T

Dirty Image

80

60


40


20


0

20


40

50

100

100
150
200
250
300
350
400
450
500
50

100

150

200

250

300

350

400

450

500

PSF

x 10

50

100

150

200
5
250
4
300
3
350
2

400

450

500

0
50

100

150

200

250

300

350

400

450

Direct Inversion with Frequency Cutoff at 64000 frequals


160
50
140
100

120

150

100

200

80

250

60

300

40

350

20

400

0
20

450

40

500
50

100

150

200

250

300

350

400

450

500

500

Figure 2.5: Frequency Cutoff of Direct Inversion with T , the cutoff function. We see somewhat
improved results over the direct inversion (Fig 2.4), even when under considerably greater noise ( = 10
as opposed to = 0.1). Other techniques have proven to give better results. For instance in this case,
there is a point source and a diffuse source that are not resolved in the solution that can be resolved by
other methods. Also the solution contains negative, and hence non-physical, values.
Details of this filter are left out as a matter of scope, but can be found in Thiebaut [21] or most textbooks
on linear filters.

2.4
2.4.1

CLEAN
Brief History

CLEAN was initially developed by H


ogboom [11] to solve the problems of deconvolution in Radio Astronomy.
It proved very successful and remains as one of the important deconvolution algorithms in radio interferometry today. It is both simple to implement, simple to understand and historically proven to be very
effective. In the past, radio interferometers primarily scanned at longer wavelengths. This caused a decrease
in resolution and thus many sources did not resolve. With complex sources being blurred so much by the
low resolution, they can appear as point sources. As the primary assumption of CLEAN is that all sources
in the sky are point sources (see description of algorithm), it preformed well in these conditions.
As radio astronomers scan higher frequencies and gain access to larger baseline arrays, point sources resolve
to more complex shapes or clusters of stars. CLEAN still performs somewhat adequately for clusters and can
resolve larger structures to a degree, but its initial assumption hampers it. As telescopes gain more and more
sensitivity, diffuse sources become observable as well. Again, CLEAN does not perform well deconvolving
these sources, especially under noisy conditions, as it tends to pick random points within the source often
leading to a dotted effect. A multi-resolution CLEAN algorithm was developed by Wakker & Schwarz [22]
that addresses some of these deficiencies. Detailed analysis of the multi-resolution CLEAN was not explored
in this report as other multi-resolution techniques were deemed more within the scope.

2.4 CLEAN

2.4.2

25

Description of the algorithm

The CLEAN algorithm assumes that all the sources in the observed patch of sky are point sources. Starting
with the dirty map, it simply finds the brightest point within it and subtracts a fraction (called the loopfactor) of the normalised PSF centered at that point (the exact value detailed below). The subtracted PSF
value is added back to a point map at the same coordinates. The map from which the algorithm subtracts
is referred to as the residual map or reside, as it contains what is left over. The algorithm then repeats,
using the residue to find and subtract the brightest point again.
The algorithm stops once the maximum brightness falls below a certain threshold (classically, one hundredth
of the initial maximum brightness) or after the residual is considered to be statistically close enough to
noise. E.g. for Gaussian noise this would be the mean squared error term falling below some predetermined
threshold. Also, a maximum number of iterations can be set.
After the algorithms stops, one is left with the residual and a point map. We convolve the point map with
the normalised clean beam to reconstruct the observation. This, in effect, means that at each iteration of
CLEAN the dirty beam was estimated to have had a certain effect on the image, that effect was removed
and a clean beams effect was put in its place. When we finally add the residual back to the reconstructed
map, the total flux is preserved. Not adding the residue back will cause elements in the point map to appear
as faint, but significant sources. Adding back the reside results in adding back the original noise. As such,
the previously significant sources fade away becoming clearly insignificant.
2.4.2.1

Exit Criteria

The CLEAN algorithm is one in which a residual is defined and progressively reduced as to reconstruct the
original. We do not know enough about the statistics of the data to create statistically justified robust exit
criteria. However we expect that as we remove statistically interesting characteristics of data from the reside,
that the residue will eventually become more and more like background noise. If we know the noise level
with standard deviation of n , we can either exit when the maximum value of the residue (or the standard
deviation of the residue1 ) falls below some predefined value,
max(r) < c or r < c

(2.10)

where r is the standard deviation of the residue r, and c is usually in the range of [3n , 5n ].
Another criterion, if data is assumed to be under additive Gaussian noise, is to calculate by how much the
standard deviation of the residue has changed from iteration to the next. When no significant structures
are left in the residue, the percentage difference in the standard deviation from the last iteration will be
extremely small as both will be essentially noise. The exit criteria is therefore,
|r(k+1) r(k) |
<
r(k)

(2.11)

where r is the standard deviation of the residue r, and  is a predefined constant, usually a small number
in the range of 105 to 107
1 If

Gaussian noise is assumed.

2.4 CLEAN

26

Reconstructed with CleanBeam Convolution and Residue (4557 iterations, loop = 0.25)

Reconstructed with CleanBeam Convolution and Residue (1654 iterations, loop = 0.25)

50

200

50

100

180

100

150

160

150

140

200

120

250

200

150

200
250
100

100
300

300
80

350

350
60

400

50

400
40

450

20

500
50

100

150

200

250

300

350

400

450

500

Reconstructed with CleanBeam Convolution and Residue (256 iterations, loop = 0.25)
200

50
100

450
500

0
50

100

150

200

250

300

350

400

450

500

Reconstructed with CleanBeam Convolution and Residue (125 iterations, loop = 0.25)

50

200

100

150

150

200

150

150
200

250

100

300

100

250
300

350

50

400

50

350
400

0
450

450
0

500

500
50

100

150

200

250

300

350

400

450

500

50

Ideal/Actual Image

100

150

200

250

300

350

400

450

500

Reconstructed with CleanBeam Convolution and Residue (6000 iterations, loop = 0.25)
250

50
100

50

200

100
200

150
200

150

150

200
150

250

250
100

300

300
100

350

350

400

50

400
50

450

450

500
50

100

150

200

250

300

350

400

450

500

500

0
50

100

150

200

250

300

350

400

450

500

Figure 2.6: Top Left: = 0. Top Right: = 1. Middle Left: = 5. Middle Right: = 10.
Bottom Left: The sky model. Bottom Right: Over-CLEANing ( = 1). For low levels of noise the
CLEAN algorithm performs admirably. As the data becomes more noisy, the defined boundary of the
faint background source becomes more obscure. Even more noise causes many faint false-positive point
source detections as the algorithm tried to pick point sources out of the noise. An example of
over-CLEANing shows what happens when the algorithm is performed over too many iterations.

2.4.3

Normalising the beam

The PSF and clean beam must be normalised. This is to preserve the overall flux of the image. Say that at
a certain iteration, the brightest point was valued at 1.5. If the loop factor is 0.1, it means that 0.15 must
be subtracted from that point. If the PSF, h, has a maximum value of, say, 0.3, (which will be at the center
of the PSF) we subtract 0.5 of the normalised PSF from the residue (max(h)*0.5 = 0.3*0.5 = 0.15). So we
have removed 0.5 flux units from the residue. We must now add 0.5 to the point map, so that the total flux
of both images remains the same.
More formally, if = max(r) loop/ max(h). We remove h from the residue at the maximum point and

2.4 CLEAN
27
add to that same location on the point map. Since the PSF and the clean beam (Gaussian) are both
normalised to have a total flux of 1, these effects will ultimately cancel each other out.

2.4.4

Results and Findings

Test on a simulated point source showed clearly accurate results. On more complicated sources with a
diffuse background source, these results were not as good but still acceptable. All complex sources were
well resolved and the diffuse background was resolved, albeit not as smoothly as the stronger sources. The
tested simulated sky model incorporates a complex, diffuse and point source and demonstrates many of the
advantages and disadvantages of CLEAN.
2.4.4.1

Strong Complex Source, Weak Point Source and Diffuse Source

With no noise the diffuse source is clearly resolved, as well as the weak point source and the large source.
This is maintained as noise is increased to a level of = 1 although the diffuse source starts to break up at
its edges.
When increasing it to = 5 the diffuse source becomes very spotty and has become so obscured as to
make one insufficiently confident to say where its boundaries lie. The weak point source could be said to
be resolved, but the many false-positive point sources amongst the diffuse source makes this impossible to
discern.
With = 10, the weak source is completely obscured, and many false-positive point sources are present.
The diffuse source is now barely resolved, swamped in the noise. The strong, complex source is not as well
resolved as in lower noise levels but retains much of its original shape and flux. Figure 2.6 shows these
results.
2.4.4.2

Corrugation Effect

Leaving the CLEAN algorithm to run even further produces additional divergent visual results. The removal
of the main, strong sources causes a seemingly insignificant bump in the data a certain distance and
direction away. After enough iterations, this bump becomes significant enough such that the algorithm
detects it as a significant source, which creates another bump, and so on. This introduces a repeated
corrugation effect in the final results. This is often referred to as over cleaning and demonstrates the
importance of a robust exit criterion. See Figure 2.6 - bottom right.

Chapter 3

Maximum Likelihood Techniques


3.1

Optimising a Function under Positivity Constraint

The building blocks of Maximum Likelihood techniques will be outlined in this section. We wish to find
a general optimisation solution to maximise some function J(x). In this way we have a structure around
which to build different models under different noise constraints.
We derive this general model with some important considerations already in place. One of which is that the
data, x, must be positive to reflect that the data is in fact real. We also assume that the function to be
optimised is convex. This ensures only one solution will be found and that our model will converge to this
solution.

3.1.1

Lagrangian Maximisation via KKT

A well-known method for finding maxima/minima under constraints is the use of a Lagrange function. In
particular we use the Karush-Kuhn-Tucker (KKT) First Order Optimality Conditions [12].
We use the following Lagrange Function:
L(x, ) = J(x) h, g(x)i

(3.1)

where x RN is the vector representing the desired data, is the Lagrange multiplier, J(x) is the function
to be maximised and g(x) is the function expressing the constraints.
We use g(x) = x to ensure positivity of the solution. Let x and be the optimal solutions of x and
respectively, and note that g(x ) = 1. The KKT conditions become: 1
Stationarity
L(x , ) = 0 i = [J(x )]i , i

(3.2a)

x 0 xi 0, i

(3.2b)

0 i 0, i

(3.2c)

Positivity of x

Positivity of
1 The

original KKT first-order optimality conditions can be found in Appendix A.2.

3.1 Optimising a Function under Positivity Constraint


Complementary Slackness
x = 0 [J(x )]i x = 0

29
(3.2d)

Condition (a) simply states that an optimal solution is a stationary point (a minimum or maximum). Later
we assume convexity of the function to ensure that the solution is a minimum. Conditions (b) and (c) are
trivially satisfied by g(x) = x. So we attempt to solve for the remaining condition, (d).
Consider that J(x(k) ) is (trivially) a descent direction of J(x(k) ) (A.4). This would allow us to express
a gradient descent algorithm:
x(k+1) = x(k) = (k) [J(x(k) )]
(3.3)
Later considerations will show us that a multiplicative form of the algorithm is more desirable than an additive
form from a computational standpoint. However, the under-defined J might have additions or subtractions
in its formulation. To avoid this problem we define a vector of functions F(x) = [F1 (x) F2 (x) ... FN (x)],
where Fi is a real valued positive function that depends on J(x). Whilst a seemingly arbitrary definition at
this time, later this term allows us to create our multiplicative form of the algorithm.
We formulate a gradient descent algorithm again:
x(k+1) = x(k) + (k) diag(F(x(k) ))diag(x(k) )[J(x(k) )]
(3.4)


where (k) RN is a relaxation or dampening parameter. Also since diag(F(x(k) ))diag(x(k) ) is a positive
definite matrix, diag(F(x(k) ))diag(x(k) )[J(x(k) )] is still a descent direction.

3.1.2

Enforcing a Multiplicative Form

We assume that J(x) is convex and has a finite (unconstrained) global minimum value (i.e. the minimum
value is not achieved when tending towards + or ). This value is realised at J(x) = 0. We take two
functions U(x) and V(x) that are positive for strictly-positive arguments (i.e. U(x), V(x) 0 for x > 0)
and describe,
J(x(k) ) = U(x(k) ) V(x(k) )
We now define F(x(k) )

1
.
V(x(k) )

(3.5)

We obtain


1
diag(x(k) )[U(x(k) ) V(x(k) )]
V(x(k) )


x(k)
= x(k) + (k) diag
[U(x(k) ) V(x(k) )]
V(x(k) )

x(k+1) = x(k) + (k) diag

(3.6)

If any element in V(x(k) ) is equal to 0, then this equation bears no solution. In practice an adjusted function
is used, V0 (x(k) ) = V(x(k) ) + .1 for , a sufficiently small positive constant.
We need to ensure that x(k) > 0 for all k. If we set x(0) > 0, (B.3.8)


x(k)
(k)
(k)
x + diag
[U(x(k) ) V(x(k) )] > 0
V(x(k) )
If U(x(k) ) > V(x(k) ) then it is sufficient to say (k) > 0. If not, (B.3.8)


x(k)
(k)
(k)
x + diag
[U(x(k) ) V(x(k) )] > 0
V(x(k) )

(3.7)

(k) <
1

U(x(k) )
V(x(k) )

(3.8)

3.2 Maximum Likelihood (ML)


30
(k)
(k)
(k)
The case that U(x ) = V(x ) will not arise as this case means the solution has been found. If U(x ) <
V(x(k) ) then
J(x
0<

(k)

U(x(k)
V(x(k) )
(k)

) = U(x

U(x(0)
V(x(0) )

<

< 1. Since U(x(k) ), V(x(k) ) > 0 we can now say 0 <

) V(x

U(x(k)
V(x(k) )

(k)

) 0 as k , hence

U(x(k)
V(x(k) )

U(x(k)
V(x(k) )

< 1. We assume that

1. This being the case we can see that

< 1 for all k, thus

(k)

1<
1

<

U(x(0) )
V(x(0) )

U(x(k) )
V(x(k) )

!
< +

(3.9)

satisfies the condition. Whilst one can attempt to make more accurate estimations of (k) , setting it to 1 is
sufficient but also allows us to gain a multiplicative form of the problem: (B.3.10)
x(k+1) = diag

3.1.3


x(k)
U(x(k) )
V(x(k) )

(3.10)

Convergence of the Solution

In general we cannot say that 3.10 will necessarily converge to an answer. We can say that in general,
x(k+1) = T (x(k) ) will converge if T is a contraction, i.e. if n s.t. (x1 , x2 ), ||T (x1 ) T (x2 )|| n||x1 x2 ||.
For the following two algorithms that will be outlined, this does hold true and they will converge to an
answer.

3.2
3.2.1

Maximum Likelihood (ML)


Constrained Maximum Likelihood and Negative Log Likelihood

This chapter aims to maximise the probability of the data y given all possible x values.
xML arg max Pr(y|x) = arg min log(Pr(y|x)) arg min ML (x)
x

(3.11)

is used to denote the negative log likelihood and is a common strategy to maximise probability functions
as minimising or maximising a function involves finding the derivative. This is because log derivatives can
be easier to work with than probability derivatives.
Since we are interested in positive values only, we constrain the maximum likelihood so that we may use the
Lagrangian method outlined in the previous section.
xCML arg min ML (x)
x

3.2.2

subject to xCML 0

(3.12)

Poisson: RLA

Let us now try to maximize Pr(y|x) assuming that each pixel is modeled as a Poisson distribution. We again
start with the likelihood function,
k
P (k; )
e
(3.13)
k!
where k is the number of observed occurrences of the event
2 We

and is the expected number of occurrences.

assume an occurrence of an event is the detection of a photon by the observing device.

3.2 Maximum Likelihood (ML)


31
For a vector k, we use the product since we assume each element is independant of one another. The
corresponding vector of expected number of occurrences is denoted by `.
Y

P (ki ; `i ) =

Y `ki
i

ki !

e `i

(3.14)

We again apply our problem to the equation, setting the expected value to Hx, which is our expected data
given that y was observed
Y (Hx)yi
i
Pr(y|x) P (y; Hx) =
e(Hx)i
(3.15)
y
!
i
i
Again, we take the negative log likelihood to find the minimum and constrain it to positive values
J(x) P (x) ln Pr(y|x)
= ln P (y; Hx)
Y (Hx)yi
i
= ln
e(Hx)i
y
!
i
i
=

X
i

ln

(Hx)yi i (Hx)i
e
yi !

(3.16)

yi ln((Hx)i ) + ln(yi !) + Hx

subject to x 0

Di (x)

Similarly, it can be shown that the second derivative of P is positive everywhere and hence convex. Thus
finding the root of the gradient will yield the equation expressing the minimum value:
X
X
J(x) P (x) =
Di (x) =
Di (x)
(3.17)

We know that Di (x) =

Di Di
Di
x1 , x2 , ..., xj , ...

and

xj (Hx)i

= hij (Eqn. B.8)

Di =
(yi ln((Hx)i ) + ln(yi !) + Hx)
xj
xj

=
(Hx yi ln((Hx)i ))
xj

(3.18)

= hij yi hij (Hx)i


We now can say (B.3.19),
J(x) = 1 H

y
Hx


(3.19)

y
Note that 1 HT ( Hx
) = J(x) = V(x) U(x), we recall the multiplicative iterative algorithm from section
3.1, eqn 3.10:


x(k)
(k+1)
x
= diag
U(x(k) )
V(x(k) )
(3.20)

 y 

(k)
T
= diag x
H
Hx(k)

This equation is known as the Richardson-Lucy Algorithm (RLA) after its inventors [13].

3.2 Maximum Likelihood (ML)

3.2.3

32

Additive Gaussian Noise: ISRA

Pr(y|x) represents the probability that y is the observed data, given that the actual intensity is x, as
represented by the model
y = Hx + n
(3.21)
Let us try to maximize Pr(y|x) assuming that each pixel is modeled as a random variable under additive
Gaussian noise:
(k)2
1
(3.22)
G(k; , ) e 22
2
where k is the random-variable, is the mean and is the variance.
We wish to find a single term to represent all the pixels in our data. We use a vector k RN instead of a
single variable k and take the product of all G(ki ) assuming all elements are independant of one another.


Y
Y 1
(k u )2
exp i 2 i
(3.23)
G(ki ; ui , si ) =
2si
si 2
i
i
where u RN is the corresponding vector of means and s RN is the corresponding vector of variances.
The variance of y is denoted by y = (y1 , ..., yn ). We set the mean to equal to Hx and obtain


Y
Y
1
(y (Hx)i )2
exp i
Pr(y|x)
G(yi ; (Hx)i , yi ) =
2y2i
yi 2
i
i

(3.24)

Maximising this unconstrained equation will yield the least incorrect solution. To do this we take the
negative log likelihood and minimise.


Y
1
(yi (Hx)i )2

G (x) ln Pr(y|x) = ln G(y; Hx, y ) = ln


exp
(3.25)
2y2i
2yi
i
We now reduce and constrain it to positive values


1
(yi (Hx)i )2
J(x) G (x) =
ln
exp
2y2i
2yi
i


X
1
(yi (Hx)i )2
=
ln
+
2y2i
2yi
i
X

Di (x)
subject to x 0
X

(3.26)

It can be shown that the second derivative of g is positive everywhere and hence convex. Thus finding the
root of the gradient will yield the equation expressing the minimum value.
X
X
J(x) = G (x) =
Di (x) =
Di (x)
(3.27)
i

T
Di Di
Di
where Di (x) = x1 , x2 , ..., xj , ...


We calculate the j th partial derivative (B.3.28),







1
(yi (Hx)i )2

Di =
ln
+
xj
xj
2y2i
2yi
hij
= 2 ((Hx)i yi )
yi

(3.28)

3.2 Maximum Likelihood (ML)


Therefore (B.3.29),

33

J(x) = HT WHx HT Wy

(3.29)

where W = cov1 (y), the inverse covariance matrix of y (see A.2).


Considering that HT WHxHT Wy = J(x) = V(x)U(x), we recall the multiplicative iterative algorithm
from section 3.1, eqn 3.10:


x(k)
x(k+1) = diag
U(x(k) )
V(x(k) )
(3.30)


x(k)
T
= diag
H
Wy
HT WHx(k)
This equation is known as the Image Space Reconstruction Algorithm (ISRA) and was originally developed
by Daube-Witherspoon [5].
3.2.3.1

Maintaining Positivity via Shifting the Data

However, since Gaussian noise may contain negative values, the positivity of y, and hence HT Wy, is not
ensured. We shift y by a sufficiently large number such that it contains only positive elements.
y0 = Hx + n + d

(3.31)

where d = min(|y|)1.
Since H is the circular convolution based on h, a normalised PSF, it is clear to see that a constant vector
convolved by it will just return that same vector, i.e. H1 = 1 and hence, Hd = d. So,
y0 = Hx + n + d
= Hx + n + Hd

(3.32)

= H(x + d) + n
= Hx0 + n
where x0 x + d. We are now trying to minimise


X
(y0 (Hx0 )i )2
1
+ i
J(x0 ) G (x0 ) =
ln
2y2i
2yi
i

subject to x d 0

(3.33)

It is simple to recalculate that J(x0 ) = V(x0 ) U(x0 ) = HT WHx0 HT Wy0 and reformulating the
Lagrangian formula with constraint G(x0 ) = x0 d yields

x0(k) d
U(x0(k) )
V(x0(k) )


x0(k) d
= d + diag
HT Wy0
HT WHx0(k)

x0(k+1) = d + diag

This algorithm will converge to x0 = x + d. Taking x = x0 d yeilds the solution.

(3.34)

3.2 Maximum Likelihood (ML)


3.2.3.2 Maintaining Positivity via Splitting the Data

34

Another method to ensure that U(x) and V(x) are positive is to split HT Wy into its positive and negative
parts. Recall that HT WHx HT Wy = J(x) = V(x) U(x). We also consider that we may split the
following such that HT Wy = (HT Wy)+ + (HT Wy) .
We now choose V(x) = HT WHx (HT Wy) and U(x) = (HT Wy)+ . We insert these terms into the
Lagrangian formula:


x(k)
(k+1)
x
= diag
U(x(k) )
V(x(k) )
(3.35)


x(k)
T
+
= diag
(H Wy)
HT WHx(k) (HT Wy)

3.2.4

Regularised ISRA and RLA

Many functions can be used to regularise data. This allows a robust definition of a residue and thus, an
exit criteria can be formulated. In the next chapter, we shall go into detail about wavelet dictionaries and
constraints using sparsity. As a manner of introduction to the concept, wavelet regularisation is applied to
RLA and ISRA [20]. For a fuller understanding, one could read through the next chapter and then return
to this section.
We start by defining the residual at each iteration k :
r(k) = y Hx(k)

(3.36)

y = Hx(k) + r(k)

and recall the RLA equation (3.20):




x(k+1) = diag x(k)



HT

(k)




(k)
+ r(k) 
T Hx
H
Hx(k)

= diag x

y 
Hx(k)

(3.37)

We use a deconstruction function to transform the residual r(k) , process on this data (i.e. frequency cutoff)
and then use its corresponding reconstruction function to produce a denoised or filtered residue r(k) . More
formally we take a dictionary of atoms or shapes. For this report, it is apt to imagine a dictionary of wavelets.
Deconstructing the residue into coefficients (AT r(k) ) we hard-threshold values above a certain level (T , for
[3r(k) , 5r(k) ]) before using a reconstruction (S, where SAT = I). This results in:
r(k) = S(T (AT r(k) ))
and therefore
x

(k+1)

= diag x

(k)




(k)
+ r(k) 
T Hx
H
Hx(k)

(3.38)

(3.39)

Similarly, this can be done for ISRA:


x(k+1) = diag


x(k)
HT W(Hx(k) + r(k) )
HT WHx(k)

(3.40)

3.3 Algorithmic Implementation


3.2.4.1 Exit Criteria

35

A major problem for the regularised RLA and ISRA is that there is no form of robust exit criteria (Fig 3.1).
One of the huge advantages of the regularised ISRA algorithm is the inclusion of the residue. Using the
residue, as was done with CLEAN (section 2.4.2.1), an exit criterion can be formulated based on how much
the residue has become noise-like:
|r(k+1) r(k) |
<
(3.41)
r(k)

3.3
3.3.1

Algorithmic Implementation
Computational Operators for Convolution
2

We define 2D(x) = x as the 1D to 2D operator, transforming an x RN into x MN,N . 1D(x) = x is the


reverse operator, flattening the square matrix into a vector.
2

For x MN,N , x RN the flattened x, h the PSF map, H the PSF response matrix. We define:
CONV(h, x) = 2D(Hx) as the convolution operator
Implemented with Fast Fourier Transforms: CONV(h, x) = invFFT(FFT(h) FFT(x))
CONV(trans(h), x) = 2D(HT x), where trans(h) is an operator that flips h vertically and horizontally
around the center.
trans(h) can efficiently calculated by taking the real part3 of FFT(FFT(h))/N where N is the
number of element in h

3.3.2

RLA

The Richardson-Lucy Algorithm (RLA) [19][13] is the iterative scheme for deconvolution under Poisson noise.
Returning to the multiplicative scheme in eqn 3.20, we deduce an equation that can be more easily mapped
to the 2D maps and computer operations. (B.3.42)


y
(k+1)
(k)
x
= x CONV trans(h),
(3.42)
CONV(h, x(k) )
where x(k) is the k th estimate of the map, y is the convoluted data and h is the PSF. The initial guess can
be set to any positive map; x(0) = 1 is sufficient.
Algorithmically, MATLAB-esque pseudo-code is given:
RLA(y,h,x0,n)
ht = real(FFT(FFT(h)))/numel(h)
xk = x0
for i=0:n
Hx = invFFT( FFT(h) .* FFT(x) )
xk = xk .* invFFT( FFT(ht) ./ FFT(Hx) )
return xk
3 We

take the real part as there may be imaginary components after the Fourier Transform.

3.3 Algorithmic Implementation


36
where numel(h) is the number of elements in h, real(x) is the real values of the complex valued x and .*
and ./ are element wise multiplication and division respectively.
We can also reduce the numbers of FFT operations from six to four by noticing that we can precalculate
h fft = FFT(h) and ht fft = FFT(ht).

3.3.3

ISRA

The Image Space Reconstruction Algorithm originally developed by Daube-Witherspoon and Muehllehner
[5], is the iterative algorithm used to deconvolve data under Gaussian noise. As for RLA above, we take eqn
3.35 and map it to computational operators. (B.3.43)
x(k+1) =

x(k) CONV(ht , s y)+


CONV(ht , s CONV(h, x(k) )) CONV(ht , 2D(s y))

(3.43)

where s 2D( 2 ) and ht trans(h).


The complicated equation above is better expressed in code:
ISRA(y,h,x0,s,n)
ht = real(FFT(FFT(h)))/numel(h)
Ht_Wy = invFFT( FFT(ht) .* FFT(s .* y) )
Ht_Wy_plus = max( Ht_Wy, 0 ) % all negative values are set to 0
Ht_Wy_minus = min( Ht_Wy, 0 ) % all positive values are set to 0
xk = x0
for i=0:n
Hx = invFFT( FFT(h) .* FFT(xk) )
Ht_WHx = invFFT( FFT(ht) .* FFT(s *. Hx) )
xk = xk ./ ( Ht_WHx - Ht_Wy_minus) .* Ht_Wy_plus
return xk
where numel(h) is the number of elements in h, real(x) is the real values of the complex valued x and .*
and ./ are element-wise multiplication and division respectively.
The number of FFT operations can be reduced from six to four by precalculation of h fft = FFT(h) and
ht fft = FFT(ht).

3.3 Algorithmic Implementation

37

Guess Map after iteration 100

Guess Map after iteration 200

50

50
200

100
150

350

100

300

150
150

200
250

250

200
250

300

100

350

200

300

150

350

400

50

450

450

500

500
50

100

150

200

250

300

350

400

450

100

400

50

500

50

100

Guess Map after iteration 400

150

200

250

300

350

400

450

500

Guess Map after iteration 600

350

50
100

300

150

400

50
100

350

150

300

250
200

200
200

250

250

250
200

300

150

350

300
150

350
100

400
50

450
500

400

100

450

50

500
50

100

150

200

250

300

350

400

450

500

50

Ideal/Actual Image

100

150

200

250

300

350

400

450

500

10
250

50
100

200

150
200

150
250

10

300
100
350
400
50
450
500
50

100

150

200

250

300

350

400

450

500

Error

10

100

200

300

400

500

600

Figure 3.1: RLA after 100, 200, 400, 600 iterations and the simulated sky model (bottom left). The
graph (bottom right) shows the error from the original image. After around 200 iterations the error
starts to increase. After 600 iterations, the image has deviated significantly from any correct solution.
There is no way of knowing when this error is minimised as it requires knowledge of the original intensity.
An estimate of the ideal number of iterations is not robustly defined.

Chapter 4

Maximum a postiori (MAP)


Maximum Likelihood, unfortunately, can produce ill-conditioned problems; whereas Maximum a postiori
approach, based on Bayesian mathematics, has proven a far more effective path [21].
Instead of trying to maximise Pr(y|x) which yields the Maximum Likelihood, we now attempt to maximise
Pr(x|y) instead. This represents the probability that x is the actual intensity map of the sky, given that we
have observed y.
y = Hx + n
(4.1)
We calculate the best solution using Bayes Theorem:
xM AP (x) = arg max Pr(x|y)
x

Pr(y|x) Pr(x)
Pr(y)
= arg max Pr(y|x) Pr(x)
= arg max
x

(4.2)

Pr(y) = 1 as it is an event that has already passed, this its probability is certain.
As before, we consider minimising M AP , the negative log probability
M AP (x) = log(Pr(y|x) Pr(x))
= log(Pr(y|x)) log(Pr(x))

(4.3)

=M L (x) + prior (x)


It is clear that the difference from the Maximum Likelihood method is the addition of the prior or a priori
penalty. This prior is a qualitative addition or regularisation to the problem and is used to represent already
known information about the data.
For instance if we know the final image will have a certain smoothness to it, we would make the prior
represent some form of roughness, say variance of the data, so that we may minimise it. Thus the solution
will be the one that is maximally likely as well as maximally smooth. A hyperparameter, , (A.3) is usually
included to weight the prior, for instance prior (x) = .var(x).
The prior that will be used in the upcoming section is one which represents sparsity.

4.1 Sparse Representation

39

4.1

Sparse Representation

4.1.1

Concepts and Definitions

The gist of sparse representation is to use a predefined dictionary of shapes or atoms and use a vector of
coefficients pick out a subset of those atoms. We hope to be able to synthesise x with the use of this
dictionary.
So for instance, we take some shape, be that a wavelet, pillbox, a point or whatever shape is felt will represent
the data best. Let us define s as a K K map containing only this shape centered at (1, 1). We take its
flattened version s = [s1 , s2 , ..., sN ] for N = K 2
In the same way H was defined in section 2.1.3,

s1
s2

B = s3
..
.
sN

we can define an example dictionary

sN
sN 1 s2
s1
sN
s3

s2
s1
s4

..
..
.
..
. ..
.
.

sN 1

sN 2

(4.4)

s1

This is simply our shape flattened and shifted to every point on the map. Each column is called an atom
and represents a flattened map of some description.
The atoms need not be defined as simply shifted versions of some base atom. For instance a dictionary may
contain atoms that can change shape depending on their position or shift. In this case we say the shape has
shift variance. In general, shift invariance is preferred but is not always possible owing to the way certain
shape functions are used in computation, namely wavelets.
It is also possible to create a dictionary that contains two or more shapes. Say we have a number of
dictionaries B1 , B2 , ..., BM ; we can define a dictionary by simply concatenating their columns/atoms. We
denote this by D = [B1 B2 ... BM ]
For this reason a dictionary is defined as a N L matrix for L the total number of atoms.
We now take some vector RL and assume we can model x by some linear combination of atoms in some
dictionary D. Mathematically, we assume x = D. Furthermore we wish for to be sparse which means
that most of the elements of are zero. This mean that only a few atoms will model the image.
The following mathematical expansion is given to illustrate this sparse representation model.
We see that Di is simply the vector equal to the ith atom (or column) of D. Thus,
x = D = 1 D1 + 2 D2 + + L DL

(4.5)

see section B.4.5 for proof.


We can now see that the elements of are simply multipliers for each of the atoms in D. For this reason they
are referred to as coefficients of the sparse representation. It is our goal to find a such that it reconstructs
x exactly.
There is no guarantee nor expectation that the sparse representation of x will actually equal x in a strict
sense. However, we assume that it can and will find as close an approximation to x as is possible.

4.1 Sparse Representation

4.1.2

Definitions of Sparsity

4.1.2.1

Intuitively

40

Sparsity in its simplest terms is a vector or signal with few non-zero elements. Though, sparsity can also be
seen as a measure of how compressable a set of data is, or in other words, how easily the data can be described
by a few elements. If a signal were to be composed of 4 sine waves, for example, we would only need to
know the definition of the sine wave and those 4 coefficients. If the signal were to be random noise, however,
representing it as a composition of sine waves would take a huge number of coefficients to approximate it
sufficiently.
This of course depends heavily on the dictionary upon which the coefficients are based. For certain signals,
a dictionary of wavelets might represent them very well, but for another signal, sine waves might compose a
more precise representation.
4.1.2.2

Descriptively

To solidify this notion we say that a signal is strictly sparse if most of its coefficients are zero given some
dictionary. However for natural signals, a less prohibitive definition is more effective. A weakly sparse signal
is one whereby a few coefficients have large magnitudes and the rest are comparatively close to zero. These
weakly sparse signals are also referred to as compressable.
Natural signals usually have specific properties, and with the use of a representative dictionary, it can be
sparsified, albeit only in the weak - and not strict - sense.
Effectively sparsifying data depends heavily on the dictionary used. A small and simple dictionary will
struggle to represent the data in a few coefficients. A large dictionary with more complex entries will usually
have more representative results.
Compressibility is similar to being able to express ideas in language: given a small number of simple words,
an idea might take many of those words to be expressed fully, if it can be at all. With a large number
of complex and detailed words, an idea could be expressed in a few of these words. Furthermore, certain
dictionaries just might not have the right words for what is to be expressed. Alternatively a combination of
complex and simple words might express an idea in even fewer words than what would be expressed with
just complex words alone. [15]
4.1.2.3

Formally

We now define a way in which to measure these notions in a mathematically rigorous way. A simple
measure of this might be to take the number of non-zero elements. This is called the l0 norm and defined as
||x||0 = #{i : xi 6= 0} for x, a real-valued vector 1 .
The l0 norm is not continuous and has unavoidable mathematical implicationsP
for convexity in the following
sections. In these cases we can use the lp norm (p R+ ), defined as ||x||p = ( i |xp |)(1/p) .
The l1 norm is simply the sum of the elements and the l2 norm is the same as the Euclidean norm. When
used in the following section, the lp norm (0 < p < 1) is used in the prior that describes sparsity. Restricting
p between 0 and 1 maintains the convexity assumption. Recall that convexity of the minimising term is an
important assumption made when formulating the problem (3.1.2).
1

Where #A denotes the number of elements in some set A

4.2 MAP Synthesis

4.2

41

MAP Synthesis

We can now formulate the sparse representation of our data more rigorously. We start with our model and
assume that x RN can be modelled by dictionary D of size N L, and a sparse vector of coefficients,
RL .
Since y = Hx and we assumed that x = D, we define our MAP-synthesis model:
y HD + n

(4.6)

We recall our attempt to minimise the negative log of Pr(x|y) (3.2.3 & 3.2.2). From this we can see that we
need only formulate ML and define prior :

xMAP = arg min MAP (x) = arg min ML (x) + prior (x)
(4.7)
x
x
where xMAP is the desired solution for this MAP-synthesis model.

4.2.1

The Maximum Likelihood Term

Let us consider the problem for Gaussian additive noise 2 . As before (3.2.3) we take the maximum likelihood
of the Gaussian. This time, however, we use a multi-dimensional definition of the Gaussian distribution [1]:
G(x; u, ) =

1
N
2

(2) ||

N
2

 1

exp (x u)T (x u)
2

(4.8)

where u RN and is the inverse covariance matrix. We model y as a multi-dimentional Gaussian with
means equal to HD and variance RN . Since we assume all variables in y to be independant of one
another, = diag( 12 ). Thus,
G(x) = log(G(y; x, R))
 1
i
h
1
(x u)T (x u)
= log
N
N exp
2
(2) 2 || 2




1
1
= log
+
(x u)T (x u)
N
N
2
(2) 2 || 2

(4.9)

and hence,
arg min MAP = arg min(G(x) + prior )
x
x
h


i
 1
1
= arg min log
+
(x u)T 1 (x u) + prior
N
N
x
2
(2) 2 || 2
1

= arg min
(x u)T 1 (x u) + prior
x
2


1
) since we are minimising on x.
We ignore the first term ( log
N
N
(2)

||

(4.10)

For our model, we use G(y; Hx, R) = G(y; HD, R), where R is the inverse covariance matrix of y:
arg min MAP = arg min
x

2 The

1
2

(HD y)T R(HD y) + prior

Poisson case may also be formulated, however it is not included in this report.

(4.11)

4.2 MAP Synthesis

4.2.2

42

The Prior Term

This leaves only prior to be defined. We assumed that was sparse, so we need a prior that will minimise
on this constraint, as mentioned in the introduction to this chapter. The lp norm is used in the definition of
the prior:
prior = ||||pp

(4.12)

where is the hyperparameter of the prior (A.3).


We must ensure that we are able to define Pr(x) such that our chosen ML (x) will be obtained [7]. Navely
we can simply define it as:
p

Pr(x) = Pr(D) = c e||||p

(4.13)

where c is a constant and is again the hyperparameter. Defining Pr(x) like this forces prior = ||||pp + a
for some constant a.
This definition allows for any , but we would like our model to consider only if is an exact representation.
So we define a set x = { R|D = x}, that contains all possible values that reconstruct x. This set
may be empty, a singleton, or it may contain multiple solutions.
Multiple solutions arise from an overly large number of atoms, i.e. that L > N
This, however does not account for dictionaries for which there is no solution that represents x. We would
like the probability of such a case to be 0. We now attempt again to define:

p
c e||||p
if x 6=
Pr(x) = Pr(D) =
(4.14)
0
otherwise
Thus prior = ||||pp . However this does not guarantee a unique optimal solution, so we add a minimising
argument in place of .


p
c e|| (x)||p
if x 6=
Pr(x) = Pr(D) =
0
otherwise
(4.15)
where (x) = arg min ||||pp
x

4.2.3

The MAP-Synthesis Model

We have finished. The final MAP-synthesis model for a signal under additive Gaussian noise is:


xMAP = arg min MAP (x) = arg min ML (x) + prior (x)
x
x


= arg min ML (x) + arg min prior (x)
x
x

1
(HD y)T R(HD y) + ||||pp
= arg min
x
2

(4.16)

4.3 Matching Pursuit Algorithm

4.3

43

Matching Pursuit Algorithm

4.3.1

Description of the Algorithm

The Matching Pursuit Algorithm was pioneered by [16] and computes an approximation to the MAPsynthesis solution . The algorithm itself is greedy and attempts to approximate by iteratively maximising
the probability function by modifying one element of at a time.
We start with the Maximum Likelihood function of a signal under additive Gaussian noise3 from 4.16.
y = HD + n
ML (x) =

1
(HD y)T R(HD y)
2

(4.17)

where y RN is our data, H is the N N PSF operator, D is an L N dictionary, RL is the desired


vector of coefficients, D = x RN is the solution and R is the inverse covariance matrix of y. Since all
variables in y are independant, R = diag( 12 ). 4
yi

Instead of minimising on x, we minimise on (B.4.18):


arg min ML (x) = arg min ML (D
)
x

1
1
1
R 2 y||
= arg min ||R 2 HD

2
1
= arg min ||C
z||

2
1

(4.18)

where we define z = R 2 y and C = R 2 HD and z is the whitened data and has a covariance matrix equal to
I, the identity matrix. C is a dictionary of atoms that are convolved with the PSF and then whitened. A
PSF and noise adjusted dictionary, if you will. We are now able to say
1

z = C + where = R 2 n

(4.19)

We define a starting residual r(0) = z and an initial (0) = 0 and take an iterative greedy approach. We
assume that at each step, the residual is made up of a single atom and that the rest is simply noise. This is
not actually the case, but this approach allows us to ignore all other signals and to find a good estimation
for a single coefficient in . Once this coefficient is found, the algorithm simply repeats until a stopping
criterion is met.
More formally, we assume
r(k) = i Ci + for some i [1, ..., L]

(4.20)

where is the desired coefficient vector


If we were to just consider maximising a single coefficient i for a known i, we maximise the likelihood of
eqn 4.20,
i = arg max Pr(r(k) |
i Ci )

1
= arg min (
i Ci r(k) )T I(
i Ci r(k) )

i 2
1
= arg min ||
i Ci r(k) ||2

i 2
3 The

effect of the prior is implicit in the algorithm, as will be explained later in this section.
yi is the variance of yi

4 Where

(4.21)

4.3 Matching Pursuit Algorithm


Since 12 ||
i Ci r(k) ||2 0, to minimise the term we must calculate the following(B.4.22):

44

i Ci r(k) = 0

i =

hr(k) , Ci i
||Ci ||2

(4.22)

To find the best index we minimise the on the index, rather than the on i (B.4.23)
1
m = arg min ||
i Ci r(k) ||2
i 2
|hr(k) , Ci i|
hr(k) , Ci i2
= arg max
= arg max
2
i
i
||Ci ||
||Ci ||
We simply update the residue by removing the mth atom and updating the coefficient vector
( (k)
hr ,Ci i
if i = m
(k+1)
(k)

||Ci ||2

= + where =
0
otherwise
(k+1)

(k)

=r

m
Cm

=r

(k)

(4.23)

(4.24)

hr(k) , Cm i
Cm

||Cm ||2

x is simply reconstructed from the dictionary and the coefficient vector


x(k) = D (k)
4.3.1.1

(4.25)

Exit Criteria

Stopping criteria is usually based on some characteristic of the residual: either when ||r||2 or max(|r|) falls
below a user determined threshold. Similarly one can use the same stopping criteria as was used in CLEAN
and Regularised ISRA (2.4.2.1)
|r(k+1) r(k) |
<
r(k)
where r is the standard deviation of the residue r, and  is a predefined constant, usually a small number
in the range of 105 to 107
A more robust exit criterion is found when the best coefficient
is selected for removal from the residue. Each
(k)
i i|
coefficient will have a (normalised) significance value ( |hr||C,C
). The atom with the highest significance
i ||
value will be the one that is to be removed from the residue. When the maximum significance falls below a
preset threshold (usually in [1, 4]), the best atom can no longer be said to have a significant effect on the
result and hence we stop the algorithm. More formally: the algorithm will exit when
|hr(k) , Cm i|
<t
||Cm ||

(4.26)

where m is the index of the most significant atom (as it is defined in eqn 4.23) and t [1, 4]

4.3.2

Computational Considerations

In eqn 4.23 the best index of the coefficient vector needs to be found. This involved calculating

|hr(k) ,Ci i|
||Ci ||

for all i. This is typically split into two stages: 1) calculation of |hr(k) , Ci i| for all i, which must be done at
every iteration, and 2) calculation of ||Ci || for all i, which can be precalculated.

4.3 Matching Pursuit Algorithm


45
Let us take a typical dictionary made up of the identity matrix and a wavelet dictionary. More formally,
1
1
1
D = [I W]. This means we define C = R 2 HD = [(R 2 HI) (R 2 HW)]
4

Wavelet sym6 @ 31

x 10
1

50

100

100

150

150

200

200
0

250

+ 31

Wavelet sym6 @ 33

x 10

50

250

+ + 33

300
350
400

300
350
400

450

450

500

500
50

100

150

200

250

300

350

400

450

500

50

100

150

Wavelet sym6 @ 40

200

250

300

350

400

450

500

Wavelet sym6 @ 250

x 10

0.04
1

50

50
0.035

100

100

150

150

200

200

0.03
0.025
0.02

250

250
0.015

+ + 40

+ + 250

300
350

300
0.01
350
0.005

400

400
1

450

450

500

500
50

100

150

200

250

300

350

400

450

500

0.005
50

100

150

Wavelet sym6 @ 254

200

250

300

350

400

450

500

Dirty Image
0.04

50

50

100

0.035
100

100
0.03

80

150

150
0.025

200

60

200
0.02

250

250

40

300

20

0.015

+ + 254

300
0.01
350

+ =

350
0

0.005
400

400
20

0
450

450
0.005

500
50

100

150

200

250

300

350

400

450

40

500

500

50

100

150

200

250

300

350

400

450

500

Figure 4.1: Given an image x, we take its wavelet decomposition, WAVDECW (x) = RM . As
shown above, is a vector of M coefficients. Scaling the wavelets by the corresponding coefficients and
composing them together results in the original image. Reconstruction is this composition, however a
Fast Wavelet Transform is used instead of simple addition.

4.3.2.1

Wavelet Computational Operators

Before we continue, it would be useful to define some computational operators. It is common to use FFT and
FWT operations to perform matrix multiplications and various dictionary-based operations. Some slightly
informal definitions are given:
2

We define 2D(x) = x is the 1D to 2D operator, transforming an x RN into x MN,N . 1D(x) = x is the


reverse operator, flattening the square matrix into a vector.
In a previous section (3.3.1), we defined some computational convolution operators. We follow in a similar
vein for wavelet deconstruction and reconstruction operators.
For x MN,N , h the PSF map, H the PSF response matrix and W a wavelet dictionary, we define:

4.3 Matching Pursuit Algorithm


CONV(h, x) = 2D(Hx) as the convolution operator

46

Implemented with Fast Fourier Transforms: CONV(h, x) = invFFT(FFT(h) FFT(x))


CONV(trans(h), x) = 2D(HT x) where trans(h) is an operator that flips h vertically and horizontally
around the center.
trans(h) can efficiently calculated by taking the real part5 of FFT(FFT(h))/N where N is the
number of elements in h
WAVDECW (x) = WT x is the Wavelet Deconstruction or analysis function corresponding to the
wavelet W.
This is usually implemented with a Fast Wavelet Transform and effectively decomposes the input
image into a map representing coefficients on multiple wavelet resolutions (see 4.1).
WAVRECW (x) = Wx is the Wavelet Reconstruction function corresponding to the wavelet W.
This is usually implemented with an inverse Fast Wavelet Transform, taking a coefficient map
and transforming it back into an image.
4.3.2.2

Computing the Coefficients

With this, we can calculate how to compute the following (B.4.27)

T T
(k)
(k)
1
(k)
D H (1 r1 )
hr , C1 i
hr , (R 2 HD)1 i
(k)
hr(k) , C2 i hr(k) , (R 21 HD)2 i
DT HT (2 r2 )
=

..
..
..
.
.
.

(4.27)

(k)

We split the equation into its separate dictionaries, we denote the whitened residue at i by ri

(k)
DT HT (r1 )
 T 


T T (k)
HT r(k)
D H (r2 ) = DT HT r(k) = I T HT r(k) =

W
WT HT r(k)
..
.

(k)

= i ri

(4.28)

To simplify the following explanation, we say r(k) = 2D(r(k) ), r(k) = 2D(r(k) ). h denotes the PSF map.
To calculate HT r(k) we simply take,
2D(HT r(k) ) = CONV(trans(h), r(k) )

(4.29)

2D(WT HT r(k) ) = WAVDECW (2D(HT r(k) )) = WAVDECW (CONV(trans(h), r(k) ))

(4.30)

To calculate WT HT r(k) we take,

5 We

take the real part as there might be imaginary components after the Fourier Transforms.

4.3 Matching Pursuit Algorithm


4.3.2.3 Computing the Norms

47

Calculation of ||Ci || is somewhat simpler,

||(R 2 HI)1 ||
1
||(R 2 HI)2 ||
..
.

||(R 2 HI)1 ||
1
||(R 2 HI)2 ||
..
.

||(R 2 HD)1 ||
||C1 ||

||C2 || ||(R 12 HD)2 ||


1
1
=

=
||(R 2 HW)1 || ||(R 2 HWI)1 ||
..
..

1
1
.
.
||(R 2 HW)2 || ||(R 2 HWI)2 ||

..
..
.
.


21
||R HI1 ||
1 ||HI1 ||
||1 HI1 ||
||R 21 HI || || HI || ||HI ||


2
2
2
2


..
..
..

.
.
.

=
= 1
=

||R 2 HWI1 || ||1 HWI1 || 1 ||HWI1 ||

||R 2 HWI2 || ||2 HWI2 ||


2 ||HWI2 ||

.
.
..
..
..
.

(4.31)

Ii is simply a vector in which the ith element is 1, and 0 everywhere else. For each i, one computes
2D(HIi ) = CONV(h, Ii ), and we then use this to calculate the norm:
q
p
i ||HIi || = i hHIi , HIi i = i (HIi )T (HIi )
Since the PSF is shift invariant, ||HIi || = ||HIj || for all i, j. So it need only be calculated once.
With the wavelet dictionary, take each i and compute 2D(HWIi ) = CONV(h, 2D(WIi )) = CONV(h, WAVREC(Ii )).
Using this we can calculate the norm:
q
p
i ||HWIi || = i hHWIi , HWIi i = i (HWIi )T (HWIi )
Since wavelets transforms are generally shift-variant (see figure 4.2), this needs to be calculated once for each
i. For all discrete wavelet transform, subsets of this operation are shift-invariant and one norm calculation
will evaluate to the same value for all elements in that subset.

4.3.3

Manifestation of the prior

This algorithm does not explicitly define the prior term that represents sparsity. It is manifest in the iterative
process whereby the initial iterations will have high coefficient values and then drop off in further iterations.
Considering that most solutions converge within 5000 iterations (512512 data) and that the most significant
coefficient values are calculated in the first 400-500 iterations, we can easily see that the coefficient vector
will be sparse. At most there are 5000 coefficients, with a vector sized at least 5122 = 262144. This means
that at most 5000/262144 = 1.91% of the coefficients are non-zero. The coefficient vector can at least be
said to be weakly sparse considering that at most 500/262144 = 0.19% of the coefficients will be significant.
See figure 4.3.

4.3 Matching Pursuit Algorithm

48
4

Centered PSF

x 10
4

50
100

3.5

150

200

Offcenter PSF

x 10
4

50
100

3.5

150

200
2.5

250

2.5
250

2
300

2
300

350

1.5

350

1.5

400

400

450

0.5

450

0.5

500

500
50

100

150

200

250

300

350

400

450

500

50

100

150

200

250

300

350

400

450

500

0.04

0.03
50

50
0.035

0.02
100

100
0.03

150

0.01

150

200

200

0.025
0.02
250

0.01

300

250
0.015
300

0.02
350

0.01
350

0.03

400
450
500
50

100

150

200

250

300

350

400

450

0.005
400
0

0.04

450

0.05

500

500

0.005
50

100

150

200

250

300

350

400

450

500

Figure 4.2: Top Left: Centered PSF. Top Right: Off-center PSF. Bottom Left: Wavelet at one location
Bottom Right: The same wavelet at another location. The PSF is shift-invariant, hence the function
at one location will simply be a shifted version of any other PSF. Since the wavelet is shift-variant is it
not guaranteed that the wavelet will be a shifted version of any other wavelet, as demonstrated above.

10

10

10

10

10

10

10

10

10

10
Residue STD
Significance
Max Threshold = 4
Threshold
gamma*
gamma*

10

10

10

500

10
Residue STD
Significance
Max Threshold = 4
Threshold
gamma*
gamma*

10

1000

1500

2000

10

1000

2000

Residue STD
Significance
Max Threshold = 4
Threshold
gamma*
gamma*

10

3000

4000

5000

6000

10

2000

4000

6000

8000

Figure 4.3: The green and yellow lines represent the significance of each iterations effect. Once the
significance reaches a threshold the algorithm is no longer making significant changes to the solution and
is stopped. These are just 3 cases but all others follow a similar pattern in that the first 500 iterations
contain the bulk of the significant changes (note that the y-axis is logarithmic).

10000

4.4 Experimentation

4.4
4.4.1

49

Experimentation
Set up

Tests were run on various simulated sky maps to measure the ability to deconvolve diffuse, point-like and
other complex sources. Varying levels of Gaussian noise added to each sky map to gain insight into the
rigour of the algorithm. All tests used the same PSF.
Besides comparing this algorithm with others, there are a number of variables within the algorithm that can
be tested; namely: loop-gain, noise and dictionary used.
As in the CLEAN algorithm, the loop-gain variable controls how much of the atom is to be removed from the
residual and added to the reconstructed map at each iteration. It can also be seen as a dampening factor.
The lower the loop-gain, the longer it will take to reach a solution. However, it would be expected that a
lower loop-gain can give slightly better, more reliable results.
The choice of dictionary or dictionaries is also very important. In these tests, three cases were used: the
identity dictionary, which causes the MP synth algorithm to become almost equivalent to CLEAN; a wavelet
dictionary (The Symlet wavelet or sym6 in MATLAB); and a combination of both identity and sym6.
The three main metrics used were Time to Complete, Number of Iterations to Complete and the Error Term,
which is calculated against the ideal data using an element-wise squared difference.

4.4.2

Analysis of Dictionaries

4.4.2.1

The Identity Dictionary

As expected, this dictionary performed much like CLEAN. Point-like sources were easily detected, as well
as complex sources. Diffuse sources were resolved; however they appear very spotty as the algorithm was
picking points rather than areas, especially under heavy noise. (see figure 4.4, 4.5 and 4.6,)

4.4 Experimentation

Error Term
Average
Std Dev
ttest(I, Sym6)
ttest(Sym6, I+Sym6)
ttest(I, I+Sym6)
Time to Complete
Average
Std Dev
ttest(I, Sym6)
ttest(Sym6, I+Sym6)
ttest(I, I+Sym6)
Number Of Iterations
Average
Std Dev
ttest(I, Sym6)
ttest(Sym6, I+Sym6)
ttest(I, I+Sym6)
Time per Iteration
Average
Std Dev
ttest(I, Sym6)
ttest(Sym6, I+Sym6)
ttest(I, I+Sym6)

Ignoring Maxed Iteration Cases


I
Sym6
I + Sym6

50
Including Maxed Iteration Cases
I
Sym6
I + Sym6

2,557,330.44 2,866,545.61 2,662,928.56


233,065.86
173,094.53
220,805.28
0.015
0.108
0.469

2,909,585.86 3,176,364.45 2,991,427.04


905,234.03
560,490.70
783,232.61
0.397
0.514
0.914

950.78
503.00
0.601

1,206.21
1,054.26

929.98
550.86

1,572.79
1,002.74
0.022

3,807.82
2,814.88

0.586

0.951

0.833
3,404.63
1,719.20
0.114

0.022

1,871.50
1,612.13

1,423.00
804.03

5,603.42
3,525.52
0.840

5,936.25
4,382.41

0.560
0.926

638.41
14.28

645.85
19.85

278.33
11.11
0.000

639.64
21.00

0.475
0.000

5,712.00
4,512.39

0.903

0.010
76.67
13.38
0.000

3,883.01
3,110.61

664.69
28.54

0.024
0.000

Table 4.1: This table outlines various metrics and the differences they generated based on the dictionary
used. A Student t-test was used to determine if the difference is significant (marked in pink for < 0.05
and gray for < 0.15). Also, since some cases did not run to completion, they reached a maximum
iteration limit. For completeness the statistics for the original sets of data and the sets of data not
including the tests that hit the iteration limit are included.

Timing Results: Whilst the identity dictionary alone took slightly more iterations to complete compared
to the Symlet+Identity and Symlet dictionaries (see table 4.1), the algorithm ran approximately 2 to 8
times faster per iteration. The only difference between these two groups is the use of the FWT. Since all
dictionaries must use the Fast Fourier Transform for Convolution, the FWT is the obvious reason for this
slowdown. Despite the faster run time per iteration, the Symlet and Symlet+Identity dictionary completed
in close to the same amount of time as the Identity dictionary.
4.4.2.2

The Symlet Dictionary

Compared to the Identity dictionary, the Symlet dictionary handles diffuse sources far better owing to its
ability to use large low resolution atoms. Point sources are reconstructed just as well (figure 4.6). Overall,
the number of iterations needed to reach the exit criteria was less than the Identity dictionary, but the extra
computation time means that it takes about the same time to execute.
Even though diffuse sources are resolved more smoothly, if one uses the error term metric, the Identity
dictionary actually performs better (table 4.1).

4.4 Experimentation
Ignoring Maxed Iteration Cases
0.1
0.5
1.0
Error Term
Average
2,495,514.47 2,661,422.65 2,861,777.60
Std Dev
162,742.93
148,527.35
258,854.07
ttest(0.1, 0.5)
0.085
ttest(0.5, 1.0)
0.107
ttest(0.1, 1.0)
0.007
Time to Complete
Average
1,634.40
849.29
667.43
Std Dev
835.83
517.75
341.56
ttest(0.1, 0.5)
0.081
ttest(0.5, 1.0)
0.455
ttest(0.1, 1.0)
0.009
Number Of Iterations
Average
3,001.17
2,365.57
1,776.86
Std Dev
741.67
2,251.35
1,545.81
ttest(0.1, 0.5)
0.503
ttest(0.5, 1.0)
0.580
ttest(0.1, 1.0)
0.049
Time per Iteration
Average
528.79
485.39
478.36
Std Dev
201.73
187.45
192.26
ttest(0.1, 0.5)
0.698
ttest(0.5, 1.0)
0.946
ttest(0.1, 1.0)
0.312

51
Including Maxed Iteration Cases
0.1
0.5
1.0
2,621,670.44 2,896,499.78 3,559,207.14
242,533.54
407,687.98 1,027,607.59
0.060
0.056
0.006
3,485.56
2,400.69
0.611

2,934.50
2,814.23

2,843.55
2,901.33

0.939
0.392
6,501.08
3,689.58
0.564

5,547.00
4,268.83

5,203.58
4,386.04

0.848
0.301
531.20
188.61
0.956

527.03
182.04

524.44
186.06

0.973
0.679

Table 4.2: A similarly generated Student t-test table. See table 4.1 description.

4.4.2.3

The Symlet+Identity Dictionary

Fortunately, the combination of the two dictionaries showed a best of both worlds situation. It resolves
the diffuse sources as well as the Symlet-only dictionary but maintains an error not significantly different to
that of the Identity dictionary.
Performance-wise it does not compute significantly slower than the Symlet-only dictionary (either per iteration or in terms of total time (figure 4.1). Although it is still not as fast as the Identity-only dictionary
computations.

4.4.3

Analysis of Loop-Gains

As can be seen in table 4.2, the loop gain has a significant effect on the error term. A lower loop-gain leads
to a more accurate reconstruction. One might say, based on the statistics, that the major disadvantage of
a lower loop-gain is that it takes twice as long - on average - to complete. However, most of the cases that
reached the maximum number of iterations, especially those cases that used the Symlet dictionary, did so
owing to the high level of noise as much as it did so owing to the low loop gain. When considering all
the cases, the statistical significance in time differences disappear since all cases take a similar number of
iterations to complete.

4.4 Experimentation

52

Sym6

Reconstructed (x *CB + r ) after iteration 10001


k

I+Sym6

Reconstructed (x *CB + r ) after iteration 10001

Reconstructed (x *CB + r ) after iteration 10001

200
200

200
100

100

100
150

150

150
200

200

0.1

200
100

100
300

300
50

50
400

400

200

300

400

0
500

100

50
400

0
500

100
300

500

500
100

2456360.645348

200

300

400

500

100

2918732.981053

Reconstructed (xk*CB + rk) after iteration 6880

200

300

400

500

2579154.788337

Reconstructed (xk*CB + rk) after iteration 10001

Reconstructed (xk*CB + rk) after iteration 10001

200
200
100

200
100

100
150

150
200

150
200

0.5

200
100

100
300

300
50

50
400

400

200

300

400

500

500
100

2773598.929160

200

300

400

500

100

3038604.876926

Reconstructed (xk*CB + rk) after iteration 5084

300

400

400

0
400

500

0
500

300

50

50

50

200

100

100
300

3033169.501075

150
200

100

100

200

150
200

500

500

100

150

400

400

Reconstructed (xk*CB + rk) after iteration 10001

100

300

300

200

200

200

200

2683631.647627

Reconstructed (xk*CB + rk) after iteration 10001

100

1.0

0
500

100

50
400

0
500

100
300

500
100

200

300

400

500

3681863.337349

100

200

300

400

500

3578489.048304

Figure 4.4: Loop gain (y) vs Dictionary (x) Comparison (using a noise level of = 5). Error Terms
are given underneath the images.

4.4 Experimentation

53

Sym6

Reconstructed (x *CB + r ) after iteration 10001


k

I+Sym6

Reconstructed (x *CB + r ) after iteration 10001

Reconstructed (x *CB + r ) after iteration 10001

200

200

200
100

100

100
150

150
200

200

100

100

0.1
300

300

50

50
400

400

0
500

500
100

200

300

400

500

150
200
100
300
50
400
0
500

100

2686786.222410

200

300

400

500

100

3163747.295927

Reconstructed (xk*CB + rk) after iteration 10001

200

300

400

500

2682176.504975

Reconstructed (xk*CB + rk) after iteration 10001

Reconstructed (xk*CB + rk) after iteration 10001

200
200

100

100

200
100

150

150

150
200

200

0.5

100

100
300

300

50

50
400

500

400

500
100

200

300

400

500

100

3856083.611988

200

300

400

500

200
150
200

200
100
150
100
50

50
400

0
500

500

500

300

400

400

400

100

50

300

300

200

300

400

200

150

100
300

200

100

3139778.559479

200

5457156.341997

Reconstructed (xk*CB + rk) after iteration 10001

200

100

100

50
400

50 500

Reconstructed (xk*CB + rk) after iteration 10001

100

500

100
300

3409940.094938

Reconstructed (xk*CB + rk) after iteration 10001

1.0

200

0
500

100

200

300

400

500

4704211.175027

100

200

300

400

500

5256322.598075

Figure 4.5: Loop gain (y) vs Dictionary (x) Comparison (using a noise level of = 10). Error Terms
are given underneath the images.

4.4 Experimentation
I

54
Sym6

Reconstructed (xk*CB + rk) after iteration 2561

I+Sym6

Reconstructed (xk*CB + rk) after iteration 3928

200

200
100

100
150
200
100

300
400
500
100

200

300

400

500

100
300

50

400

500
100

200

300

400

50

400

500

500

100

400
500

200

150

150
200

100
300

50

400

500

500

100

300
50

0
100

2334693.955064

200

300

400

50

400
500

500

0
100

2701127.109364

Reconstructed (xk*CB + rk) after iteration 10001

500

Reconstructed (xk*CB + rk) after iteration 2553

200

300

400

100

150
200

300

200
100

400

0
200

2431422.648873

200

300

50

100

Reconstructed (xk*CB + rk) after iteration 3974

100

200

100

300

2695706.082107

Reconstructed (xk*CB + rk) after iteration 2635

100

150
200

2361818.557768

200
100

150
200

Reconstructed (xk*CB + rk) after iteration 2356

200

300

400

500

2448318.460755

Reconstructed (xk*CB + rk) after iteration 10001

Reconstructed (xk*CB + rk) after iteration 10001

200
200

200
100

100

100
150

150

150
200

200

200
100

100
300

300
50

50
400

400

200

300

400

500

500
100

2456360.645348

200

300

400

500

100

2918732.981053

Reconstructed (x *CB + r ) after iteration 10001


k

0
500

100

50
400

0
500

100
300

Reconstructed (x *CB + r ) after iteration 10001

200

300

400

500

2579154.788337
Reconstructed (x *CB + r ) after iteration 10001

200

200

200
100

100

100
150

150
200

10

200

100

100
300

300

50

50
400

400

0
500

500
100

200

300

400

2686786.222410

500

150
200
100
300
50
400
0
500

100

200

300

400

3163747.295927

500

100

200

300

400

500

2682176.504975

Figure 4.6: Noise (y) vs Dictionary (x) Comparison (using a loop gain of 0.1). Error Terms are given
underneath the images.

4.4 Experimentation
0.1

55
0.5

Reconstructed (xk*CB + rk) after iteration 2356

1.0

Reconstructed (xk*CB + rk) after iteration 810

200
100

200
100
150
200

100

300
400
500
100

200

300

400

100

300

50

400

500

500

100

200

300

400

500

100

300
400
500

500

100

100

200

300

400

50

400

500

300
400
500
100

200

300

400

500

100

2918732.981053
k

200
150
200

100

100
300

100
300

50

50
400

50
400

0
500

500

150
200

500

500

100

150

400

400

200
100

300

300

Reconstructed (x *CB + r ) after iteration 10001

200

200

200

2579154.788337

Reconstructed (x *CB + r ) after iteration 10001

100

2686786.222410

500

400

50

50

2456360.645348

300

100

100

500

200

150
200

100

200

150

400

Reconstructed (x *CB + r ) after iteration 10001

500

100

50

500

400

200

300

400

300

2448318.460755

100
300

200

Reconstructed (xk*CB + rk) after iteration 10001

200

2701127.109364
Reconstructed (xk*CB + rk) after iteration 10001

150

400

50

100

100

300

100

300

500

200

200

150
200

2334693.955064

100

200
100

300

500

200

500

150

400

100

400

200

50

Reconstructed (xk*CB + rk) after iteration 10001

300

Reconstructed (xk*CB + rk) after iteration 993

200

400

0
200

2431422.648873

150

300

50

100

100

200

10

400

500

200

50

Reconstructed (xk*CB + rk) after iteration 868

100

200

100

300

2695706.082107

Reconstructed (xk*CB + rk) after iteration 2553

100

150
200

2361818.557768

200
100

150
200

Reconstructed (xk*CB + rk) after iteration 958

0
500

100

200

300

400

3163747.295927

500

100

200

300

400

500

2682176.504975

Figure 4.7: Noise (y) vs Loop Gain (x) Comparison (using the Identity+Symlet dictionary). Error
Terms are given underneath the images.

Chapter 5

Conclusion
This report aimed to follow my progress as I learned the basics of signal processing and deconvolution.
Starting with simple direct-inverse solutions and discovering how they are actually ill-conditioned, frequency
cutoff and frequency filtering solutions gave more promising results.
The CLEAN algorithm was looked at in some depth for its historical impact as well as its impact as a widely
used deconvolution algorithm.
Maximum Likelihood techniques were explored with focus on ISRA and RLA. Using a prior (MAP) in the
solution allowed us to use our knowledge of the compressability of our data by wavelets in order to improve
our deconvolution techniques. The Matching Pursuit algorithm is able to solve this problem greedily with
the use of a wavelet dictionary, a Dirac-Delta dictionary, and a combination of the two, with varying factors
of success depending on noise level and loop-gain.
Matching Pursuit is a precursor to even more advanced techniques. Arwa Dubbech [3] explored multiresolution and iterative analysis-by-synthesis approaches to deconvolve various types of sources.
Although not reflected in this report, these algorithms were run on a number of different sources, with
similar results. Although differing noise levels and object intensities often require tweaking of parameters,
the algorithms that have been explored here were very robust in reconstructing convolved data.

Bibliography
[1] Do Chuong, The multivariate gaussian distribution, 2008, http://cs229.stanford.edu/section/
cs229-gaussians.pdf (Accessed October 25, 2011).
[2] J.W. Cooley and J.W. Tukey, An algorithm for the machine calculation of complex Fourier series,
Mathematics of computation 19 (1965), no. 90, 297301.
[3] A. Dabbech, Reconstruction of radio-interferometer images using sparse representations, Masters thesis,
Tunisia Polytechnic School, March-June 2011.
[4] A. Damato, File:jpeg2000 2-level wavelet transform-lichtenstein.png, 17 May 2007, http://www.
interaction-design.org/references/ (Accessed October 12, 2011).
[5] M.E. Daube-Witherspoon and G. Muehllehner, An iterative image space reconstruction algorthm suitable
for volume ect, Medical Imaging, IEEE Transactions on 5 (1986), no. 2, 6166.
[6] I. Daubechies, Orthonormal bases of compactly supported wavelets, Communications on pure and applied
mathematics 41 (1988), no. 7, 909996.
[7] M. Elad, P. Milanfar, and R. Rubinstein, Analysis versus synthesis in signal priors, Inverse Problems
23 (2007), 947968.
[8] I.J. Good, The interaction algorithm and practical Fourier analysis, Journal of the Royal Statistical
Society. Series B (Methodological) 20 (1958), no. 2, 361372.
[9] A. Graps, An introduction to wavelets, Computational Science & Engineering, IEEE 2 (1995), no. 2,
5061.
[10] Alfred Haar, Zur theorie der orthogonalen funktionensysteme, Mathematische Annalen 69 (1910), 331
371, 10.1007/BF01456326.
[11] JA H
ogbom, Aperture synthesis with a non-regular distribution of interferometer baselines, Astronomy
and Astrophysics Supplement Series 15 (1974), 417.
[12] H. Lanteri, M. Roche, O. Cuevas, and C. Aime, A general method to devise maximum-likelihood signal
restoration multiplicative algorithms with non-negativity constraints, Signal Processing 81 (2001), no. 5,
945974.
[13] LB Lucy, An iterative technique for the rectification of observed distributions, The astronomical journal
79 (1974), 745.
[14] S.G. Mallat, A theory for multiresolution signal decomposition: The wavelet representation, Pattern
Analysis and Machine Intelligence, IEEE Transactions on 11 (1989), no. 7, 674693.
[15]

, A wavelet tour of signal processing, Academic Pr, 1999.

[16] D. Mary, S. Bourguignon, C. Theys, and H. Lanteri, Sparse priors in unions of representation spaces
for radiointerferometric image reconstruction, 2011.

BIBLIOGRAPHY
58
[17] E. Pantin, J.L. Starck, and F. Murtagh, Deconvolution and blind deconvolution in astronomy, Blind
image deconvolution: theory and applications (2007), 138.
[18] L. Rabiner, R. Schafer, and C. Rader, The chirp z-transform algorithm, IEEE Transactions on Audio
and Electroacoustics 17 (1969), no. 2, 8692.
[19] W.H. Richardson, Bayesian-based iterative method of image restoration, JOSA 62 (1972), no. 1, 5559.
[20] J.L. Starck, F. Murtagii, and A. Bijaoui, Multiresolution support applied to image filtering and restoration, Graphical models and image processing 57 (1995), no. 5, 420431.
[21] E. Thiebaut, Introduction to image reconstruction and inverse problems, Optics in Astrophysics (2005),
397422.
[22] BP Wakker and UJ Schwarz, The multi-resolution clean and its application to the short-spacing problem
in interferometry, Astronomy and Astrophysics 200 (1988), 312322.
[23] John Wallace, Fourier-transform spectroscopy: Interferometer simplifies ftir spectrometer, 2011,
http://www.optoiq.com/index/photonics-technologies-applications/lfw-display/lfwarticle-display/332945/articles/laser-focus-world/volume-44/issue-7/world-news/
fourier-transform-spectroscopy-interferometer-simplifies-ftir-spectrometer.html (Accessed October 12, 2011).

Appendix A

Notations and Concepts


A.1

Mathematical Notations

R
R+
N
N+
Rp
MN,L

is
is
is
is
is
is

the set of all the real numbers


the set of all the non-zero positive real numbers
the set of all the natural numbers
the set of all the non-zero positive natural numbers
the set of all real-valued vectors size p N+
a real-valued N L matrix

For x, y RN , H an M L matrix and F(x) a function, then:


is
is
is
is
is

F(x)
arg min F(x)

is the gradient vector of F(x)


is the variable x that minimises F(x)

arg max F(x)

is the variable x that maximises F(x)

||x||
||x||p
hx, yi

is the Euclidean norm of x


is the lp norm of x (see section 4.1.2.3)
is the inner product of x and y

is used to denote a definition as well as an equality

xy

is the element-wise multiplication of x by the respective elements in y, i.e.(x y)i = xi yi


See section A.1.1 for more details

is the element-wise division of x by the respective elements in y, i.e. yx i = xyii
is element-wise power function by p R of elements in x, i.e.(xp )i = (xi )p
is elements-wise greater than operator. i.e. xi > yi , i
Similarly defined for {<, , , =}

x
y

xp
x>y

the
the
the
the
the

ith element of x
ith column of H
element of H in the ith column and th row
transpose of H
diagonal matrix with x as its diagonal elements

xi
Hi
Hi,j
HT
diag(x)

As is usual, I is the identity matrix and 1 as a vector of 1s

A.2 Covariance Matrix

A.1.1

60

Element-wise Vector Multiplication

Element-wise vector multiplication is expressed via the use of diagonal matrices. The element-wise multiplication of x and y can be expressed as diag(x)y.
This can be extended to more than two vectors, that is for x, y, z, we would use diag(x)diag(y)z.

A.2

Covariance Matrix

Let each element of y RN be modeled by a Gaussian distribution with respective means RN and
variances RN .
We define the covariance matrix of y, an N N matrix, by cov(y)i,j equal to the covariance between yi
and yj . If all the elements in y are independant of one another, then cov(y) = diag( 2 )
We may also use the term inverse covariance matrix to define cov1 (y)i,j equal to the inverse covariance
between yi and yj . If all variables are independant, cov1 (y) = diag( 12 )

A.3

Hyperparameter

A hyperparameter is a parameter of a prior term, and not a parameter within it, i.e. it is not a parameter
of the underlying model. It is used to tune or weight the effect of the prior verses the effect of the maximum
likelihood. As there is no clear mathematical distinction between a hyperparameter and a parameter of the
model in this particular example, this is a conceptual difference, but an important one nonetheless.

A.4

Descent Direction

Given some real valued function F(x) for x RN (i.e. F : RN R) and x RN , a vector p R is called
descent direction if hp, F(x)i < 0. In a more intuitive sense, consider that if hp, F(x)i = 0 then p is
orthogonal to the gradient of F. If the inner product is negative, however, this implies that p points in the
direction that goes down the gradient. Taking an infinitly small step in this descent direction will decrease
the value of F.
In slightly more rigorous terms, starting at x(0) , we find a p(0) that is a descent direction. We create
x(1) = x(0) + lim+0 p(0) . We know that x(1) < x(0) . Repeating this step many times over will trace a
path towards the minimum value of F.
This is an intuitive explanation of a result known as Taylors Theorem. In reality, we can take 1, but
exact value used will be contextual.
It is clear that for any x(k) , F(x(k) ) is a descent direction since hF(x(k) ), F(x(k) )i = hF(x(k) ), F(x(k) )i =
||F(x(k) )||2 < 0

A.5 KKT First Order Optimality Conditions

A.5

61

KKT First Order Optimality Conditions

Recall that the Lagrange equation is


L(x, ) = J(x) h, g(x)i

(A.1)

where is the Lagrange multiplier, J(x) is the function to be maximised and g(x) is the function expressing
the constraints. x is our desired data and hence the argument that will maximise J(x).
With x and to denote the optimal values of x and respectively, the first order optimality conditions
are as follows: [12, pg 947]
Stationarity
L(x , ) = 0 i [g(x )]i = [J(x )]i
i =

[J(x )]i
, i
[g(x )]i

(A.2a)

Positivity of x
g(x ) 0 xi 0, i

(A.2b)

0 i 0, i

(A.2c)

Positivity of

Complementary Slackness
g(x ) = 0

[J(x )]i
g(x ) = 0, i
[g(x )]i

(A.2d)

Appendix B

Proofs
B.2
B.2.6

Proofs From Chapter 2


PSF Response Matrix Performs a Convolution 2.6

Take a, a k k map, and i, j [1, 2, ..., k] defined as:



ai,j

2
0

if i = 2 and j = 1
otherwise

(B.1)

This is two times the discrete Dirac-Delta function centered around (2, 1). The flattened version is thus
defined as a = [0, 2, 0, ..., 0]T RN and we can now see

h1
h2

Ha = h3
..
.
hN

hN
h1
h2
..
.

hN 1
hN
h1
..
.

..
.

hN 1

hN 2


h2 0
2hN


h3
2 2h1

h4 0 = 2h2

.. .. ..
. . .
2hN 1
h1 0

(B.2)

which is simply twice the flattened PSF at position (2, 1). This is what we would get if we convolved the
PSF by two times the Dirac-Delta function centered around (2, 1).
This result can be generalised to any k k map x and its flattened version x = [x1 , ..., xN ]T for N = k 2 .
We use the notation Hi to represent the ith column of H. Hi is by definition the flattened version of H

B.3 Proofs From Chapter 3


63
th
circularly shifted i times. We use the normal notation for a vector, with xi denoting the i element of x

x1



x2
Hx = H1 H2 HN .
..
xL


x1
h1
hN
h2
h2
x2
h

h
1
3


= .
..
. .
..
..
. .. ..
.
hN hN 1

(x1 h1 +
(x1 h2 +

(x1 hN +


x1 h1
x1 h2


= . +
..
x1 hN

h1

xN

+ xN h2 )
+ xN h3 )

..

.
x2 hN 1 + + xN h1 )

x2 hN
xN h2
xN h3
x2 h1

+
..

..

.
.
x2 hN
x2 h1

+
+

x2 hN 1

(B.3)

xN h1

= x1 H1 + x2 H2 + + xN HN
What this shows us is that Hx is a PSF applied to every point of x scaled to the intensity at that point,
which is the definition of a convolution.

B.3

Proofs From Chapter 3

B.3.8

Constraint on (k) to Ensure Positivity 3.8



x(k)
[U(x(k) ) V(x(k) )]
(k) )
V(x


x(k)
(k)
diag
[U(x(k) ) V(x(k) )]
(k) )
V(x

(k)
[U(x(k) ) V(x(k) )]
(k) x

V(x(k) )
 (k)

x [U(x(k) ) V(x(k) )]
(k)
(k) ) x(k)
V(x (k)

) V(x(k) )]
(k) [U(x

V(x(k) )


(k)
)
(k) U(x

1
(k)
 V(x ) (k) 
U(x )
(k) 1
V(x(k) )

x(k) + (k) diag

(k)

> 0
> x(k)
> x(k)
>

x(k)
x(k)

> 1
> 1
< 1
1

<
1

U(x(k) )
V(x(k) )

(B.4)

B.3 Proofs From Chapter 3

B.3.10

Multiplicative Form of the Optimisation Function (3.10)


x

B.3.19

64

(k+1)


x(k)
= x + diag
[U(x(k) ) V(x(k) )]
V(x(k) )




x(k)
x(k)
(k)
= x(k) + diag
U(x
)

diag
V(x(k) )
V(x(k) )
V(x(k) )


x(k)
= x(k) + diag
U(x(k) ) x(k)
V(x(k) )


x(k)
= diag
U(x(k) )
V(x(k) )


(k)

(B.5)

Calculation of J(x) under Poisson noise (3.19)

hi1 yi hi1 (Hx)1


i
1

Di (x) = hi2 yi hi2 (Hx)i


..
.


hi1
yi hi1 (Hx)1
i
1

= hi2 yi hi2 (Hx)i
..
..
.
.
X
therefore, J(x) =
Di (x)
i


!
h11
h21
h12 h22
= + +
..
..
.
.

!
y1 h11 (Hx)1
y1 h21 (Hx)1
1
1
y2 h12 (Hx)1 y2 h22 (Hx)1
2 +
2 +

..
..
.
.

1
h11 + h21 +
y1 h11 (Hx)1 + y1 h21 (Hx)1
1 +
1
1

= h12 + h22 + y2 h12 (Hx)2 + y2 h22 (Hx)2 +


..
..
.
.

h11 h21 1
h11 h21 y1 (Hx)1
1
1

= h12 h22 1 h12 h22 y2 (Hx)2


..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.


y
= HT 1 HT
Hx


y
T
=1H
Hx
since H1 = 1 = HT 1

(B.6)

B.3 Proofs From Chapter 3

B.3.28

65

Partial Derivative of Di (3.28)




1
(yi (Hx)i )2
ln
+
2y2i
2yi


2

(yi (Hx)i )
xj
2y2i

1
(yi (Hx)i )2
2
2yi xj

1
yi2 2yi (Hx)i + (Hx)2i
2
2yi xj

1
2yi (Hx)i + (Hx)2i
2
2yi xj

Di =
xj
xj
=
=
=
=
Consider briefly

xj (Hx)i .

(B.7)

Since,

h11
h21
Hx =
..
.

h12
h22
..
.

then (Hx)i = hi1 x1 + hi2 x2 + . . . , therefore

x1
h11 x1 + h12 x2 + . . .


x2 = h21 x1 + h22 x2 + . . .
..
..
..
.
.
.

xj (Hx)i

xj (hi1 x1

(B.8)

+ hi2 x2 + . . . ) = hij .

We return to

1 
2
Di =

2y
(Hx)
+
(Hx)
i
i
i
xj
2y2i xj
1
(2yi hij + 2hij (Hx)i )
=
2y2i
hij
= 2 ((Hx)i yi )
yi

B.3.29

(B.9)

Calculation of J(x) under Additive Gaussian noise (3.29)

Consider that

hi1 ((Hx)i yi )y2


i
2

Di (x) = hi2 ((Hx)i yi )yi


..
.

(B.10)

B.3 Proofs From Chapter 3


Therefore,
X
J(x) =
Di (x)

66

h21 ((Hx)2 y2 )y2


h11 ((Hx)1 y1 )y2
2
1
2
2

= h12 ((Hx)1 y1 )y1 + h22 ((Hx)2 y2 )y2


..
..
.
.

2
h11 ((Hx)1 y1 )y1 + (h21 (Hx)2 y2 )y2
+
2
2
2

= h12 ((Hx)1 y1 )y1 + (h22 (Hx)2 y2 )y2 +


..
.

h11 h21 ((Hx)1 y1 )y1 2


= h12 h22 ((Hx)2 y2 )y2 2


..
..
..
..
.
.
.
.

2
0

((Hx)1 y1
h11 h21 y1

y2

= h12 h22 0
1
((Hx)2 y2 )
..
..
..
..
..
..
..
.
.
.
.
.
.
.

(B.11)

= HT W(Hx y)
= HT WHx HT Wy

B.3.42

From Mathematical to Computational: RLA (3.42)


x

(k+1)

= 2D x

(k+1)

= 2D diag x

(k)



HT



= 2D x(k) HT

y 
Hx(k)
!


!

y
Hx(k)

 y 

= 2D x(k) 2D HT
Hx(k)

 y 
= x(k) CONV trans(h), 2D
Hx(k)


2D(y)
= x(k) CONV trans(h),
2D(Hx(k) )


y
= x(k) CONV trans(h),
CONV(h, x(k) )

(B.12)

B.4 Proofs From Chapter 4

B.3.43

67

From Mathematical to Computational: ISRA (3.43)

Note that W = diag( 12 ) and for any a RN , Wa = diag( 12 )a = a 2 . Therefore 2D(Wa) =


2D(a 2 ) = a s where a 2D(a) and s 2D( 2 ). Also we use
!


(k)
x
x(k+1) = 2D(x(k+1) ) = 2D diag
(HT Wy)+
HT WHx(k) (HT Wy)
!


x(k)
T
+
(H Wy)
= 2D
HT WHx(k) (HT Wy)


x(k) (HT Wy)+
= 2D
HT WHx(k) (HT Wy)

B.4
B.4.5

2D(x(k) ) 2D(HT Wy)+


2D(HT WHx(k) ) 2D(HT Wy)

x(k) CONV(ht , 2D(Wy))+


CONV(ht , 2D(WHx(k) )) CONV(ht , 2D(Wy))

x(k) CONV(ht , s y)+


CONV(ht , s 2D(Hx(k) )) CONV(ht , s y)

x(k) CONV(ht , s y)+


CONV(ht , s CONV(h, x(k) )) CONV(ht , 2D(s y))

(B.13)

Proofs From Chapter 4


Dictionary is a Linear Combination of Atoms 4.5


x = D = D1

D2


2
DL .
..
L

D1,1
D2,1

= .
..

D1,2
D2,2
..
.

..
.


D1,L
1
2
D2,L

.. ..
. .

DN,1 DN,2 DN,L L

1 D1,1 + 2 D1,2 + + L D1,L


1 D2,1 + 2 D2,2 + + L D2,L

=
..

1 DN,1 + 2 DN,2 + + L DN,L

1 D1,1
2 D1,2
L D1,L
L D2,L
1 D2,1 2 D2,2

= . + . + +

..
.. ..

.
1 DN,1

2 DN,2

= 1 D1 + 2 D2 + + L DL

L DN,L

(B.14)

B.4 Proofs From Chapter 4

B.4.18

68

Minimising D
on (4.18)
arg min ML (x) = arg min ML (D
)
x

1
= arg min (HD
y)T R(HD
y)

2
1
1
1
= arg min (HD
y)T R 2 R 2 (HD
y)

2
1
1
1
= arg min (HD
y)T (R 2 )T R 2 (HD
y)

2
1
1 1
= arg min (R 2 (HD
y))T (R 2 (HD
y))

2
1
1
1
R 2 y||
= arg min ||R 2 HD

2
1
= arg min ||C
z||

(B.15)

where we define the whitened data z = R 2 y and C = R 2 HD.

B.4.22

Minimising the Inner Term of i (4.22)

i Ci r(k) = 0

i Ci = r(k)

i (Ci )T = (r(k) )T

i (Ci )T Ci = (r(k) )T Ci

i hCi , Ci i = hr(k) , Ci i
hr(k) , Ci i

i =
hCi , Ci i

i =

hr(k) , Ci i
||Ci ||2

i =

hr(k) , Ci i
||Ci ||2

(B.16)

B.4 Proofs From Chapter 4

B.4.23

69

Minimising ML on the index 4.23


1
m = arg min ||
i Ci r(k) ||2
i 2
1
= arg min (
i Ci r(k) )T (
i Ci r(k) )
i 2


1
i CTi (r(k) )T
i Ci r(k)
= arg min
i 2

1 2 T

i Ci Ci
= arg min
i CTi r(k)
i (r(k) )T Ci + (r(k) )T r(k)
i 2

1 2
= arg min

i ||Ci ||2 2
i hr(k) , Ci i + ||r(k) ||2
i 2
!
(k)
1 hr(k) , Ci i2
hr
,
C
i
i
= arg min
||Ci ||2 2
hr(k) , Ci i + ||r(k) ||2
i 2
||Ci ||4
||Ci ||2
!
1 hr(k) , Ci i2
hr(k) , Ci i2
(k) 2
= arg min
2
+ ||r ||
i 2
||Ci ||2
||Ci ||2
!
(k)
2
1
hr
,
C
i
i
||r(k) ||2
= arg min
i 2
||Ci ||2
= arg max
i

hr(k) , Ci i2
|hr(k) , Ci i|
=
arg
max
i
||Ci ||2
||Ci ||

(B.17)

B.4 Proofs From Chapter 4

B.4.27

70

Initial Coefficient Calculation 4.27


(k)
(k)

1
hr , C1 i
hr , (R 2 HD)1 i
1
hr(k) , C2 i hr(k) , (R 2 HD)2 i

..
..
.
.
(k) 1

hr , R 2 HD1 i
1
(k)

= hr , R 2 HD2 i
..
.
1

(R 2 HD1 )T r(k)
T (k)
1
= (R 2 HD2 ) r
..
.
T T 1 T (k)
D1 H (R 2 ) r
DT HT (R 21 )T r(k)
= 2

..
.
T T 1 (k)
D1 H R 2 r
DT HT R 21 r(k)
= 2

..
.

1 (k)
DT HT R 2 r1
T T 1 (k)
D H R 2 r2
=

..
.

(k)
DT HT (1 r1 )
T T
(k)
D H (2 r2 )
=

..
.

(B.18)

You might also like