You are on page 1of 66

Estimation Theory

Alireza Karimi
Laboratoire dAutomatique, MEC2 397,
email: alireza.karimi@ep.ch
Spring 2011
(Introduction) Estimation Theory Spring 2011 1 / 66
Course Objective
Extract information from noisy signals
Parameter Estimation Problem : Given a set of measured data
{x[0], x[1], . . . , x[N 1]}
which depends on an unknown parameter vector , determine an estimator

= g(x[0], x[1], . . . , x[N 1])


where g is some function.
Applications : Image processing, communications, biomedicine, system
identication, state estimation in control, etc.
(Introduction) Estimation Theory Spring 2011 2 / 66
Some Examples
Range estimation : We transmit a pulse that is reected by the aircraft.
An echo is received after second. Range is estimated from the equation
= 2R/c where c is the lights speed.
System identication : The plant is excited with a signal u and the
output signal y is measured. If we have :
y[k] = G(q
1
, )u[k] + n[k]
where n[k] is the measurement noise. The model parameters are estimated
using u[k] and y[k] for k = 1, . . . , N.
DC level in noise : Consider a set of data {x[0], x[1], . . . , x[N 1]} that
can be modeled as :
x[n] = A + w[n]
where w[n] is some zero mean noise process. A can be estimated using the
measured data set.
(Introduction) Estimation Theory Spring 2011 3 / 66
Outline
Classical estimation ( deterministic)
Minimum Variance Unbiased Estimator (MVU)
Cramer-Rao Lower Bound (CRLB)
Best Linear Unbiased Estimator (BLUE)
Maximum Likelihood Estimator (MLE)
Least Squares Estimator (LSE)
Bayesian estimation ( stochastic)
Minimum Mean Square Error Estimator (MMSE)
Maximum A Posteriori Estimator (MAP)
Linear MMSE Estimator
Kalman Filter
(Introduction) Estimation Theory Spring 2011 4 / 66
References
Main reference :
Fundamentals of Statistical Signal Processing
Estimation Theory
by Steven M. KAY, Prentice-Hall, 1993 (available in Library de La
Fontaine, RLC). We cover Chapters 1 to 14, skipping Chapter 5 and
Chapter 9.
Other references :
Lessons in Estimation Theory for Signal Processing, Communications
and Control. By Jerry M. Mendel, Prentice-Hall, 1995.
Probability, Random Processes and Estimation Theory for Engineers.
By Henry Stark and John W. Woods, Prentice-Hall, 1986.
(Introduction) Estimation Theory Spring 2011 5 / 66
Review of Probability and Random Variables
(Probability and Random Variables) Estimation Theory Spring 2011 6 / 66
Random Variables
Random Variable : A rule X() that assigns to every element of a sample
space a real value is called a RV. So X is not really a variable that varies
randomly but a function whose domain is and whose range is some
subset of the real line.
Example : Consider the experiment of throwing a coin twice. The sample
space (the possible outcomes) is :
= {HH, HT, TH, TT}
We can dene a random variable X such that
X(HH) = 1, X(HT) = 1.1, X(TH) = 1.6, X(TT) = 1.8
Random variable X assigns to each event (e.g. E = {HT, TH} ) a
subset of the real line (in this case B = {1.1, 1.6}).
(Probability and Random Variables) Estimation Theory Spring 2011 7 / 66
Probability Distribution Function
For any element in , the event {|X() x} is an important event.
The probability of this event
Pr [{|X() x}] = P
X
(x)
is called the probability distribution function of X.
Example : For the random variable dened earlier, we have :
P
X
(1.5) = Pr [{|X() 1.5}] = Pr [{HH, HT}] = 0.5
P
X
(x) can be computed for all x R. It is clear that 0 P
X
(x) 1.
Remark :
For the same experiment (throwing a coin twice) we could dene
another random variable that would lead to a dierent P
X
(x).
In most of engineering problems the sample space is a subset of the
real line so X() = and P
X
(x) is a continuous function of x.
(Probability and Random Variables) Estimation Theory Spring 2011 8 / 66
Probability Density Function (PDF)
The Probability Density Function, if it exists, is given by :
p
X
(x) =
dP
X
(x)
dx
When we deal with a single random variable the subscripts are removed :
p(x) =
dP(x)
dx
Properties :
(i )
_

p(x)dx = P() P() = 1


(ii ) Pr [{|X() x}] = Pr [X x] = P(x) =
_
x

p()d
(iii ) Pr [x
1
< X x
2
] =
_
x
2
x
1
p(x)dx
(Probability and Random Variables) Estimation Theory Spring 2011 9 / 66
Gaussian Probability Density Function
A random variable is distributed according to a Gaussian or normal
distribution if the PDF is given by :
p(x) =
1
2
2
e

(x)
2
2
2
The PDF has two parameters : , the mean and
2
the variance.
We note X N(,
2
) when the random variable X has a normal
(Gaussian) distribution with the mean and the standard deviation .
Small means small variability (uncertainty) and large means large
variability.
Remark : Gaussian distribution is important because according to the
Central Limit Theorem the sum of N independent RVs has a PDF that
converges to a Gaussian distribution when N goes to innity.
(Probability and Random Variables) Estimation Theory Spring 2011 10 / 66
Some other common PDF
Chi-square
2
: p(x) =
_
1

2
n
(n/2)
x
n/21
exp(
x
2
) for x > 0
0 for x < 0
Exponential ( > 0) : p(x) =
1

exp(x/)u(x)
Rayleigh ( > 0) : p(x) =
x

2
exp(
x
2
2
2
)u(x)
Uniform (b > a) : p(x) =
_
1
ba
a < x < b
0 otherwise
where (z) =
_

0
t
z1
e
t
dt and u(x) is the unit step function.
(Probability and Random Variables) Estimation Theory Spring 2011 11 / 66
Joint, Marginal and Conditional PDF
Joint PDF : Consider two random variables X and Y then :
Pr [x
1
< X x
2
and y
1
< Y y
2
] =
_
x
2
x
1
_
y
2
y
1
p(x, y)dxdy
Marginal PDF :
p(x) =
_

p(x, y)dy and p(y) =


_

p(x, y)dx
Conditional PDF : p(x|y) is dened as the PDF of X conditioned on
knowing the value of Y.
Bayes Formula : Consider two RVs dened on the same probability space
then we have :
p(x, y) = p(x|y)p(y) = p(y|x)p(x) or p(x|y) =
p(x, y)
p(y)
(Probability and Random Variables) Estimation Theory Spring 2011 12 / 66
Independent Random Variables
Two RVs X and Y are independent if and only if :
p(x, y) = p(x)p(y)
A direct conclusion is that :
p(x|y) =
p(x, y)
p(y)
=
p(x)p(y)
p(y)
= p(x) and p(y|x) = p(y)
which means conditioning does not change the PDF.
Remark : For a joint Gaussian pdf the contours of constant density is an
ellipse centered at (
x
,
y
). For independent X and Y the major (or
minor) axis is parallel to x or y axis.
(Probability and Random Variables) Estimation Theory Spring 2011 13 / 66
Expected Value of a Random Variable
The expected value, if it exists, of a random variable X with PDF p(x) is
dened by :
E(X) =
_

xp(x)dx
Some properties of expected value :
E{X + Y} = E{X} + E{Y}
E{aX} = aE{X}
The expected value of Y = g(X) can be computed by :
E(Y) =
_

g(x)p(x)dx
Conditional expectation : The conditional expectation of X given that a
specic value of Y has occurred is :
E(X|Y) =
_

xp(x|y)dx
(Probability and Random Variables) Estimation Theory Spring 2011 14 / 66
Moments of a Random Variable
The r th moment of X is dened as :
E(X
r
) =
_

x
r
p(x)dx
The rst moment of X is its expected value or the mean ( = E(X)).
Moments of Gaussian RVs : A Gaussian RV with N(,
2
) has
moments of all orders in closed form
E(X) =
E(X
2
) =
2
+
2
E(X
3
) =
3
+ 3
2
E(X
4
) =
4
+ 6
2

2
+ 3
4
E(X
5
) =
5
+ 10
3

2
+ 15
4
E(X
6
) =
6
+ 15
4

2
+ 45
2

4
+ 15
6
(Probability and Random Variables) Estimation Theory Spring 2011 15 / 66
Central Moments
The r th central moment of X is dened as :
E[(X )
r
] =
_

(x )
r
p(x)dx
The second central moment (variance) is denoted by
2
or var(X).
Central Moments of Gaussian RVs :
E[(X )
r
] =
_
0 if r is odd

r
(r 1)!! if r is even
where n!! denotes the double factorial that is the product of every odd
number from n to 1.
(Probability and Random Variables) Estimation Theory Spring 2011 16 / 66
Some properties of Gaussian RVs
If X N(
x
,
2
x
) then Z = (X
x
)/
x
N(0, 1).
If Z N(0, 1) then X =
x
Z +
x
N(
x
,
2
x
).
If X N(
x
,
2
x
) then Z = aX + b N(a
x
+ b, a
2

2
x
).
If X N(
x
,
2
x
) and Y N(
y
,
2
y
) are two independent RVs, then
aX + bY N(a
x
+ b
y
, a
2
x
+ b
2
y
)
The sum of square of n independent RV with standard normal
distribution N(0, 1) has a
2
n
distribution with n degree of freedom.
For large value of n,
2
n
converges to N(n, 2n).
The Euclidian norm

X
2
+ Y
2
of two independent RVs with
standard normal distribution has the Rayleigh distribution.
(Probability and Random Variables) Estimation Theory Spring 2011 17 / 66
Covariance
For two RVs X and Y, the covariance is dened as

xy
= E[(X
x
)(Y
y
)]

xy
=
_

(x
x
)(y
y
)p(x, y)dxdy
If X and Y are zero mean then
xy
= E{XY}.
var(X + Y) =
2
x
+
2
y
+ 2
xy
var(aX) = a
2

2
x
Important formula : The relation between the variance and the mean of
X is given by

2
= E[(X )
2
] = E(X
2
) 2E(X) +
2
= E(X
2
)
2
The variance is the mean of the square minus the square of the mean.
(Probability and Random Variables) Estimation Theory Spring 2011 18 / 66
Independence, Uncorrelatedness and Orthogonality
If
xy
= 0, then X and Y are uncorrelated and
E{XY} = E{X}E{Y}
X and Y called orthogonal if E{XY} = 0.
If X and Y are independent then they are uncorrelated.
p(x, y) = p(x)p(y) E{XY} = E{X}E{Y}
Uncorrelatedness does not imply the independence. For example, if X
is a normal RV with zero mean and Y = X
2
we have p(y|x) = p(y)
but

xy
= E{XY} E{X}E{Y} = E{X
3
} 0 = 0
The correlation only shows the linear dependence between two RV so
is weaker than independence.
For Jointly Gaussian RVs, independence is equivalent to being
uncorrelated.
(Probability and Random Variables) Estimation Theory Spring 2011 19 / 66
Random Vectors
Random Vector : is a vector of random variables
1
:
x = [x
1
, x
2
, . . . , x
n
]
T
Expectation Vector :
x
= E(x) = [E(x
1
), E(x
2
), . . . , E(x
n
)]
T
Covariance Matrix : C
x
= E[(x
x
)(x
x
)
T
]
C
x
is an n n symmetric matrix which is assumed to be positive
denite and so invertible.
The elements of this matrix are : [C
x
]
ij
= E{[x
i
E(x
i
)][x
j
E(x
j
)]}.
If the random variables are uncorrelated then C
x
is a diagonal matrix.
Multivariate Gaussian PDF :
p(x) =
1
_
(2)
n
det(C
x
)
exp
_

1
2
(x
x
)
T
C
1
x
(x
x
)
_
1
In some books (including our main reference) there is no distinction between random
variable X and its specic value x. From now on we adopt the notation of our reference.
(Probability and Random Variables) Estimation Theory Spring 2011 20 / 66
Random Processes
Discrete Random Process : x[n] is a sequence of random variables
dened for every integer n.
Mean value : is dened as E(x[n]) =
x
[n].
Autocorrelation Function (ACF) : is dened as
r
xx
[k, n] = E(x[n]x[n + k])
Wide Sense Stationary (WSS) : x[n] is WSS if its mean and its
autocorrelation function (ACF) do not depend on n.
Autocovariance function : is dened as
c
xx
[k] = E[(x[n]
x
)(x[n + k]
x
)] = r
xx
[k]
2
x
Cross-correlation Function (CCF) : is dened as
r
xy
[k] = E(x[n]y[n + k])
Cross-covariance function : is dened as
c
xy
[k] = E[(x[n]
x
)(y[n + k]
y
)] = r
xy
[k]
x

y
(Probability and Random Variables) Estimation Theory Spring 2011 21 / 66
Discrete White Noise
Some properties of ACF and CCF :
r
xx
[0] |r
xx
[k]| r
xx
[k] = r
xx
[k] r
xy
[k] = r
yx
[k]
Power Spectral Density : The Fourier transform of ACF and CCF gives
the Auto-PSD and Cross-PSD :
P
xx
(f ) =

k=
r
xx
[k] exp(j 2fk)
P
xy
(f ) =

k=
r
xy
[k] exp(j 2fk)
Discrete White Noise : is a discrete random process with zero mean and
r
xx
[k] =
2
[k] where [k] is the Kronecker impulse function. The PSD of
white noise becomes P
xx
(f ) =
2
and is completely at with frequency.
(Probability and Random Variables) Estimation Theory Spring 2011 22 / 66
Introduction and Minimum Variance Unbiased
Estimation
(Minimum Variance Unbiased Estimation) Estimation Theory Spring 2011 23 / 66
The Mathematical Estimation Problem
Parameter Estimation Problem : Given a set of measured data
x = {x[0], x[1], . . . , x[N 1]}
which depends on an unknown parameter vector , determine an estimator

= g(x[0], x[1], . . . , x[N 1])


where g is some function.
The rst step is to nd the PDF of data as a
function of : p(x; )
Example : Consider the problem of DC level in white Gaussian noise with
one observed data x[0] = + w[0] where w[0] has the PDF N(0,
2
).
Then the PDF of x[0] is :
p(x[0]; ) =
1

2
2
exp
_

1
2
2
(x[0] )
2
_
(Minimum Variance Unbiased Estimation) Estimation Theory Spring 2011 24 / 66
The Mathematical Estimation Problem
Example : Consider a data sequence that can be modeled with a linear
trend in white Gaussian noise
x[n] = A + Bn + w[n] n = 0, 1, . . . , N 1
Suppose that w[n] N(0,
2
) and is uncorrelated with all the other
samples. Letting = [A B] and x = [x[0], x[1], . . . , x[N 1]] the PDF is :
p(x; ) =
N1

n=0
p(x[n]; ) =
1
(

2
2
)
N
exp
_

1
2
2
N1

n=0
(x[n] A Bn)
2
_
The quality of any estimator for this problem is related to the assumptions
on the data model. In this example, linear trend and WGN PDF
assumption.
(Minimum Variance Unbiased Estimation) Estimation Theory Spring 2011 25 / 66
The Mathematical Estimation Problem
Classical versus Bayesian estimation
If we assume is deterministic we will have a classical estimation
problem. The following method will be studied : MVU, MLE, BLUE,
LSE.
If we assume is a random variable with a known PDF, then we will
have a Bayesian estimation problem. In this case the data are
described as the joint PDF
p(x, ) = p(x|)p()
where p() summarizes our knowledge about before any data is
observed and p(x|) summarizes our knowledge provided by data x
conditioned on knowing . The following methods will be studied :
MMSE, MAP, Kalman Filter.
(Minimum Variance Unbiased Estimation) Estimation Theory Spring 2011 26 / 66
Assessing Estimator Performance
Consider the problem of estimating a DC level A in uncorrelated noise :
x[n] = A + w[n] n = 0, 1, . . . , N 1
Consider the following estimators :

A
1
=
1
N
N1

n=0
x[n]

A
2
= x[0]
Suppose that A = 1,

A
1
= 0.95 and

A
2
= 0.98. Which estimator is better ?
An estimator is a random variable, so its
performance can only be described by its PDF or
statistically (e.g. by Monte-Carlo simulation).
(Minimum Variance Unbiased Estimation) Estimation Theory Spring 2011 27 / 66
Unbiased Estimators
An estimator that on the average yield the true value is unbiased.
Mathematically
E(

) = 0 for a < < b


Lets compute the expectation of the two estimators

A
1
and

A
2
:
E(

A
1
) =
1
N
N1

n=0
E(x[n]) =
1
N
N1

n=0
E(A + w[n]) =
1
N
N1

n=0
(A + 0) = A
E(

A
2
) = E(x[0]) = E(A + w[0]) = A + 0 = A
Both estimators are unbiased. Which one is better ?
Now, lets compute the variance of the two estimators :
var(

A
1
) = var
_
1
N
N1

n=0
x[n]
_
=
1
N
2
N1

n=0
var(x[n]) =
1
N
2
N
2
=

2
N
var(

A
2
) = var(x[0]) =
2
> var (

A
1
)
(Minimum Variance Unbiased Estimation) Estimation Theory Spring 2011 28 / 66
Unbiased Estimators
Remark : When several unbiased estimators of the same parameters from
independent set of data are available, i.e.,

1
,

2
, . . . ,

n
, a better estimator
can be obtained by averaging :

=
1
n
n

i =1

i
E(

) =
Assuming that the estimators have the same variance, we have :
var(

) =
1
n
2
n

i =1
var(

i
) =
1
n
2
n var(

i
) =
var(

i
)
n
By increasing n, the variance will decrease (if n ,

).
It is not the case for biased estimators, no matter how many estimators are
averaged.
(Minimum Variance Unbiased Estimation) Estimation Theory Spring 2011 29 / 66
Minimum Variance Criterion
The most logical criterion for estimation is the Mean Square Error (MSE) :
mse(

) = E[(

)
2
]
Unfortunately this type of estimators leads to unrealizable estimators (the
estimator will depend on ).
mse(

) = E{[

E(

) + E(

) ]
2
} = E{[

E(

) + b()]
2
}
where b() = E(

) is dened as the bias of the estimator. Therefore :


mse(

) = E{[

E(

)]
2
} + 2b()E[

E(

)] + b
2
() = var(

) + b
2
()
Instead of minimizing MSE we can minimize the variance of
the unbiased estimators :
Minimum Variance Unbiased Estimator
(Minimum Variance Unbiased Estimation) Estimation Theory Spring 2011 30 / 66
Minimum Variance Unbiased Estimator
Existence of MVU Estimator : In general MVU estimator does not
always exist. There may be no unbiased estimator or non of unbiased
estimators has a uniformly minimum variance.
Finding the MVU Estimator : There is no known procedure which
always leads to the MVU estimator. Three existing approaches are :
1
Determine the Cramer-Rao lower bound (CRLB) and check to see if
some estimator satises it.
2
Apply the Rao-Blackwell-Lehmann-Schee theorem (we will skip it).
3
Restrict to linear unbiased estimators.
(Minimum Variance Unbiased Estimation) Estimation Theory Spring 2011 31 / 66
Cramer-Rao Lower Bound
(Cramer-Rao Lower Bound) Estimation Theory Spring 2011 32 / 66
Cramer-Rao Lower Bound
CRLB is a lower bound on the variance of any unbiased estimator.
var(

) CRLB()
Note that the CRLB is a function of .
It tells us what is the best performance that can be achieved
(useful in feasibility study and comparison with other estimators).
It may lead us to compute the MVU estimator.
(Cramer-Rao Lower Bound) Estimation Theory Spring 2011 33 / 66
Cramer-Rao Lower Bound
Theorem (scalar case)
Assume that the PDF p(x; ) satises the regularity condition
E
_
ln p(x; )

_
= 0 for all
Then the variance of any unbiased estimator

satises
var(

)
_
E
_

2
ln p(x; )

2
__
1
An unbiased estimator that attains the CRLB can be found i :
ln p(x; )

= I ()(g(x) )
for some functions g(x) and I (). The estimator is

= g(x) and the
minimum variance is 1/I ().
(Cramer-Rao Lower Bound) Estimation Theory Spring 2011 34 / 66
Cramer-Rao Lower Bound
Example : Consider x[0] = A + w[0] with w[0] N(0,
2
).
p(x[0]; A) =
1

2
2
exp
_

1
2
2
(x[0] A)
2
_
ln p(x[0], A) = ln

2
2

1
2
2
(x[0] A)
2
Then
ln p(x[0]; A)
A
=
1

2
(x[0] A)

2
ln p(x[0]; A)
A
2
=
1

2
According to Theorem :
var (

A)
2
and I (A) =
1

2
and

A = g(x[0]) = x[0]
(Cramer-Rao Lower Bound) Estimation Theory Spring 2011 35 / 66
Cramer-Rao Lower Bound
Example : Consider multiple observations for DC level in WGN :
x[n] = A + w[n] n = 0, 1, . . . , N 1 with w[n] N(0,
2
)
p(x; A) =
1
(

2
2
)
N
exp
_

1
2
2
N1

n=0
(x[n] A)
2
_
Then
ln p(x; A)
A
=

A
_
ln[(2
2
)
N/2
]
1
2
2
N1

n=0
(x[n] A)
2
_
=
1

2
N1

n=0
(x[n] A) =
N

2
_
1
N
N1

n=0
x[n] A
_
According to Theorem :
var (

A)

2
N
and I (A) =
N

2
and

A = g(x[0]) =
1
N
N1

n=0
x[n]
(Cramer-Rao Lower Bound) Estimation Theory Spring 2011 36 / 66
Transformation of Parameters
If it is desired to estimate = g(), then the CRLB is :
var(

)
_
E
_

2
ln p(x; )

2
__
1
_
g

_
2
Example : Compute the CRLB for estimation of the power (A
2
) of a DC
level in noise :
var(

A
2
)

2
N
(2A)
2
=
4A
2

2
N
Denition
Ecient estimator : An unbiased estimator that attains the CRLB is said
to be ecient.
Example : Knowing that x =
1
N
N1

n=0
x[n] is an ecient estimator for A, is
x
2
an ecient estimator for A
2
?
(Cramer-Rao Lower Bound) Estimation Theory Spring 2011 37 / 66
Transformation of Parameters
Solution : Knowing that x N(A,
2
/N), we have :
E( x
2
) = E
2
( x) + var( x) = A
2
+

2
N
= A
2
So the estimator

A
2
= x
2
is not even unbiased.
Lets look at the variance of this estimator :
var( x
2
) = E( x
4
) E
2
( x
2
)
but we have from the moments of Gaussian RVs (slide 15) :
E( x
4
) = A
4
+ 6A
2

2
N
+ 3(

2
N
)
2
Therefore :
var( x
2
) = A
4
+ 6A
2

2
N
+
3
4
N
2

_
A
2
+

2
N
_
2
=
4A
2

2
N
+
2
4
N
2
(Cramer-Rao Lower Bound) Estimation Theory Spring 2011 38 / 66
Transformation of Parameters
Remarks :
The estimator

A
2
= x
2
is biased and not ecient.
As N the bias goes to zero and the variance of the estimator
approaches the CRLB. This type of estimators are called
asymptotically ecient.
General Remarks :
If g() = a + b is an ane function of , then

g() = g(

) is an
ecient estimator. First, it is unbiased : E(a

+ b) = a + b = g(),
moreover :
var(

g())
_
g

_
2
var(

) = a
2
var(

)
but var(

g()) = var(a

+b) = a
2
var(

), so that the CRLB is achieved.


If g() is a nonlinear function of and

is an ecient estimator,
then g(

) is an asymptotically ecient estimator.


(Cramer-Rao Lower Bound) Estimation Theory Spring 2011 39 / 66
Cramer-Rao Lower Bound
Theorem (Vector Parameter)
Assume that the PDF p(x; ) satises the regularity condition
E
_
ln p(x; )

_
= 0 for all
Then the variance of any unbiased estimator

satises C

I
1
() 0
where 0 means that the matrix is positive semidenite. I() is called the
Fisher information matrix and is given by :
I
ij
() = E
_

2
ln p(x; )

j
_
An unbiased estimator that attains the CRLB can be found i :
ln p(x; )

= I()(g(x) )
(Cramer-Rao Lower Bound) Estimation Theory Spring 2011 40 / 66
CRLB Extension to Vector Parameter
Example : Consider a DC level in WGN with A and
2
unknown.
Compute the CRLB for estimation of = [A
2
]
T
.
ln p(x; ) =
N
2
ln 2
N
2
ln
2

1
2
2
N1

n=0
(x[n] A)
2
The Fisher information matrix is :
I() = E
_

2
ln p(x;)
A
2

2
ln p(x;)
A
2

2
ln p(x;)
A
2

2
ln p(x;)
(
2
)
2
_
=
_
N

2
0
0
N
2
4
_
The matrix is diagonal (just for this example) and can be easily inverted to
yield :
var(

)

2
N
var(

2
)
2
4
N
Is there any unbiased estimator that achieves these bounds ?
(Cramer-Rao Lower Bound) Estimation Theory Spring 2011 41 / 66
Transformation of Parameters
If it is desired to estimate = g(), and the CRLB for the covariance of

is I
1
(), then :
C

_
g

__
E
_

2
ln p(x; )

2
__
1
_
g

_
T
Example : Consider a DC level in WGN with A and
2
unknown.
Compute the CRLB for estimation of signal to noise ratio = A
2
/
2
.
We have = [A
2
]
T
and = g() =
2
1
/
2
, then the Jacobian is :
g()

=
_
g()

1
g()

2
_
=
_
2A

2

A
2

4
_
So the CRLB is :
var( )
_
2A

2

A
2

4
_ _
N

2
0
0
N
2
4
_
1
_
2A

2
A
2

4
_
=
4 + 2
2
N
(Cramer-Rao Lower Bound) Estimation Theory Spring 2011 42 / 66
Linear Models with WGN
If N point samples of data observed can be modeled as
x = H +w
where
x = N 1 observation vector
H = N p observation matrix (known, rank p)
= p 1 vector of parameters to be estimated
w = N 1 noise vector with PDF N(0,
2
I)
Compute the CRLB and the MVU estimator that achieves this bound.
Step 1 : Compute ln p(x; ).
Step 2 : Compute I() = E
_

2
ln p(x; )

2
_
and the covariance
matrix of

: C

= I
1
().
Step 3 : Find the MVU estimator g(x) by factoring
ln p(x; )

= I()[g(x) ]
(Linear Models) Estimation Theory Spring 2011 43 / 66
Linear Models with WGN
Step 1 : ln p(x; ) = ln(

2
2
)
N

1
2
2
(x H)
T
(x H).
Step 2 :
ln p(x; )

=
1
2
2

[x
T
x 2x
T
H +
T
H
T
H]
=
1

2
[H
T
x H
T
H]
Then I() = E
_

2
ln p(x; )

2
_
=
1

2
H
T
H
Step 3 : Find the MVU estimator g(x) by factoring
ln p(x; )

= I()[g(x) ]
=
H
T
H

2
[(H
T
H)
1
H
T
x ]
Therefore :

= g(x) = (H
T
H)
1
H
T
x C

= I
1
() =
2
(H
T
H)
1
(Linear Models) Estimation Theory Spring 2011 44 / 66
Linear Models with WGN
For a linear model with WGN represented by x = H +w the MVU
estimator is :

= (H
T
H)
1
H
T
x
This estimator is ecient and attains the CRLB.
That the estimator is unbiased can be seen easily by :
E(

) = (H
T
H)
1
H
T
E(H +w) =
The statistical performance of

is completely specied because

is a
linear transformation of a Gaussian vector x and hence has a Gaussian
distribution :

N(,
2
(H
T
H)
1
)
(Linear Models) Estimation Theory Spring 2011 45 / 66
Example (Curve Fitting)
Consider tting the data x[n] by a p-th order polynomial function of n :
x[n] =
0
+
1
n +
2
n
2
+ +
p
n
p
+ w[n]
We have N data sample, then :
x = [x[0], x[1], . . . , x[N 1]]
T
w = [w[0], w[1], . . . , w[N 1]]
T
= [
0
,
1
, . . . ,
p
]
T
so x = H +w, where H is :
_

_
1 0 0 0
1 1 1 1
1 2 4 2
p
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1 N 1 (N 1)
2
(N 1)
p
_

_
N(p+1)
Hence the MVU estimator is :

= (H
T
H)
1
H
T
x
(Linear Models) Estimation Theory Spring 2011 46 / 66
Example (Fourier Analysis)
Consider the Fourier analysis of the data x[n] :
x[n] =
M

k=1
a
k
cos(
2kn
N
) +
M

k=1
b
k
sin(
2kn
N
) + w[n]
so we have = [a
1
, a
2
, . . . , a
M
, b
1
, b
2
, . . . , b
M
]
T
and x = H +w where :
H = [h
a
1
, h
a
2
, . . . , h
a
M
, h
b
1
, h
b
2
, . . . , h
b
M
]
with
h
a
k
=
_

_
1
cos(
2k
N
)
cos(
2k2
N
)
.
.
.
cos(
2k(N1)
N
)
_

_
, h
b
k
=
_

_
1
sin(
2k
N
)
sin(
2k2
N
)
.
.
.
sin(
2k(N1)
N
)
_

_
Hence the MVU estimate of the Fourier coecients is :

= (H
T
H)
1
H
T
x
(Linear Models) Estimation Theory Spring 2011 47 / 66
Example (Fourier Analysis)
After simplication (noting that (H
T
H)
1
=
2
N
I), we have :

=
2
N
[(h
a
1
)
T
x, . . . , (h
a
M
)
T
x , (h
b
1
)
T
x, . . . , (h
b
M
)
T
x]
T
which is the same as the standard solution :
a
k
=
2
N
N1

n=0
x[n] cos
_
2kn
N
_
,

b
k
=
2
N
N1

n=0
x[n] sin
_
2kn
N
_
From the properties of linear models the estimates are unbiased.
The covariance matrix is :
C

=
2
(H
T
H)
1
=
2
2
N
I
Note that

is Gaussian and C

is diagonal (the amplitude estimates


are independent).
(Linear Models) Estimation Theory Spring 2011 48 / 66
Example (System Identication)
Consider identication of a Finite Impulse Response (FIR) model, h[k] for
k = 0, 1, . . . , p 1, with input u[n] and output x[n] provided for
n = 0, 1, . . . , N 1 :
x[n] =
p1

k=0
h[k]u[n k] + w[n] n = 0, 1, . . . , N 1
FIR model can be represented by the linear model x = H +w where
=
_

_
h[0]
h[1]
. . .
h[p 1]
_

_
p1
H =
_

_
u[0] 0 0
u[1] u[0] 0
.
.
.
.
.
.
.
.
.
.
.
.
u[N 1] u[N 2] u[N p]
_

_
Np
The MVU estimate is

= (H
T
H)
1
H
T
x with C

=
2
(H
T
H)
1
.
(Linear Models) Estimation Theory Spring 2011 49 / 66
Linear Models with Colored Gaussian Noise
Determine the MVU estimator for the linear model x = H +w with w a
colored Gaussian noise with N(0, C).
Whitening approach : Since C is positive denite, its inverse can be
factored as C
1
= D
T
D where D is an invertible matrix. This matrix acts
as a whitening transformation for w :
E[(Dw)(Dw)
T
] = E(Dww
T
D) = DCD
T
= DD
1
D
T
D = I
Now if we transform the linear model x = H +w to :
x

= Dx = DH +Dw = H

+w

where w

= Dw N(0, I) is white and we can compute the MVU


estimator as :

= (H
T
H

)
1
H
T
x

= (H
T
D
T
DH)
1
H
T
D
T
Dx
so, we have :

= (H
T
C
1
H)
1
H
T
C
1
x with C

= (H
T
H

)
1
= (H
T
C
1
H)
1
(Linear Models) Estimation Theory Spring 2011 50 / 66
Linear Models with known components
Consider a linear model x = H +s +w, where s is a known signal. To
determine the MVU estimator let x

= x s, so that x

= H +w is a
standard linear model. The MVU estimator is :

= (H
T
H)
1
H
T
(x s) with C

=
2
(H
T
H)
1
Example : Consider a DC level and exponential in WGN :
x[n] = A + r
n
+ w[n] where r is known. Then we have :
_

_
x[0]
x[1]
.
.
.
x[N 1]
_

_
=
_

_
1
1
.
.
.
1
_

_
A +
_

_
1
r
.
.
.
r
N1
_

_
+
_

_
w[0]
w[1]
.
.
.
w[N 1]
_

_
The MVU estimator is :

A = (H
T
H)
1
H
T
(x s) =
1
N
N1

n=0
(x[n] r
n
) with var(

A) =

2
N
(Linear Models) Estimation Theory Spring 2011 51 / 66
Best Linear Unbiased Estimators (BLUE)
Problems of nding the MVU estimators :
The MVU estimator does not always exist or impossible to nd.
The PDF of data may be unknown.
BLUE is a suboptimal estimator that :
restricts estimates to be linear in data ;

= Ax
restricts estimates to be unbiased ; E(

) = AE(x) =
minimizes the variance of the estimates ;
needs only the mean and the variance of the data (not the PDF). As
a result, in general, the PDF of the estimates cannot be computed.
Remark : The unbiasedness restriction implies a linear model for the data.
However, it may still be used if the data are transformed suitably or the
model is linearized.
(Best Linear Unbiased Estimators) Estimation Theory Spring 2011 52 / 66
Finding the BLUE (Scalar Case)
1
Choose a linear estimator for the observed data
x[n] , n = 0, 1, . . . , N 1

=
N1

n=0
a
n
x[n] = a
T
x where a = [a
0
, a
1
, . . . , a
N1
]
T
2
Restrict estimate to be unbiased :
E(

) =
N1

n=0
a
n
E(x[n]) =
3
Minimize the variance
var(

) = E{[

E(

)]
2
} = E{[a
T
x a
T
E(x)]
2
}
= E{a
T
[x E(x)][x E(x)]
T
a} = a
T
Ca
(Best Linear Unbiased Estimators) Estimation Theory Spring 2011 53 / 66
Finding the BLUE (Scalar Case)
Consider the problem of amplitude estimation of known signals in noise :
x[n] = s[n] + w[n]
1
Choose a linear estimator :

=
N1

n=0
a
n
x[n] = a
T
x
2
Restrict estimate to be unbiased : E(

) = a
T
E(x) = a
T
s =
then a
T
s = 1 where s = [s[0], s[1], . . . , s[N 1]]
T
3
Minimize a
T
Ca subject to a
T
s = 1.
The constrained optimization can be solved using Lagrangian Multipliers :
Minimize J = a
T
Ca +(a
T
s 1)
The optimal solution is :

=
s
T
C
1
x
s
T
C
1
s
and var(

) =
1
s
T
C
1
s
(Best Linear Unbiased Estimators) Estimation Theory Spring 2011 54 / 66
Finding the BLUE (Vector Case)
Theorem (GaussMarkov)
If the data are of the general linear model form
x = H +w
with w is a noise vector with zero mean and covariance C (the PDF of w
is arbitrary), then the BLUE of is :

= (H
T
C
1
H)
1
H
T
C
1
x
and the covariance matrix of

is
C

= (H
T
C
1
H)
1
Remark : If noise is Gaussian then BLUE is MVU estimator.
(Best Linear Unbiased Estimators) Estimation Theory Spring 2011 55 / 66
Finding the BLUE
Example : Consider the problem of DC level in white noise :
x[n] = A + w[n], where w[n] is of unspecied PDF with var(w[n]) =
2
n
.
We have = A and H = 1 = [1, 1, . . . , 1]
T
. The covariance matrix is :
C =
_

2
0
0 0
0
2
1
0
.
.
.
.
.
.
.
.
.
.
.
.
0 0
2
N1
_

_
C
1
=
_

_
1

2
0
0 0
0
1

2
1
0
.
.
.
.
.
.
.
.
.
.
.
.
0 0
1

2
N1
_

_
and hence the BLUE is :

= (H
T
C
1
H)
1
H
T
C
1
x =
_
N1

n=0
1

2
n
_
1
N1

n=0
x[n]

2
n
and the minimum covariance is :
C

= (H
T
C
1
H)
1
=
_
N1

n=0
1

2
n
_
1
(Best Linear Unbiased Estimators) Estimation Theory Spring 2011 56 / 66
Maximum Likelihood Estimation
(Maximum Likelihood Estimation) Estimation Theory Spring 2011 57 / 66
Maximum Likelihood Estimation
Problems : MVU estimator does not often exist or cannot be found.
BLUE is restricted to linear models.
Maximum Likelihood Estimator (MLE) :
can always be applied if the PDF is known ;
is optimal for large data size ;
is computationally complex and requires numerical methods.
Basic Idea : Choose the parameter value that makes the observed data,
the most likely data to have been observed.
Likelihood Function : is the PDF p(x; ) when is regarded as a variable
(not a parameter).
ML Estimate : is the value of that maximizes the likelihood function.
Procedure : Find log-likelihood function ln p(x; ) ; dierentiate w.r.t
and set to zero and solve for .
(Maximum Likelihood Estimation) Estimation Theory Spring 2011 58 / 66
Maximum Likelihood Estimation
Example : Consider DC level in WGN with unknown variance
x[n] = A + w[n]. Suppose that A > 0 and
2
= A. The PDF is :
p(x; A) =
1
(2A)
N
2
exp
_

1
2A
N1

n=0
(x[n] A)
2
_
Taking the derivative of the log-likelihood function, we have :
ln p(x; A)
A
=
N
2A
+
1
A
N1

n=0
(x[n] A) +
1
2A
2
N1

n=0
(x[n] A)
2
What is the CRLB? Does an MVU estimator exist ?
MLE can be found by setting the above equation to zero :

A =
1
2
+

_
1
N
N1

n=0
x
2
[n] +
1
4
(Maximum Likelihood Estimation) Estimation Theory Spring 2011 59 / 66
Maximum Likelihood Estimation
Properties : MLE may be biased and is not necessarily an ecient
estimator. However :
MLE is asymptotically unbiased meaning that lim
N
E(

) .
MLE is asymptotically attains the CRLB.
MLE has asymptotically a Gaussian PDF :

a
N(, I
1
()).
If an MVE estimator exists then ML procedure will nd it.
Example : Consider DC level in WGN with known variance
2
. Then
p(x; A) =
1
(2
2
)
N
2
exp
_

1
2
2
N1

n=0
(x[n] A)
2
_
ln p(x; A)
A
=
1

2
N1

n=0
(x[n] A) = 0
N1

n=0
x[n] N

= 0
which leads to

= x =
1
N
N1

n=0
x[n]
(Maximum Likelihood Estimation) Estimation Theory Spring 2011 60 / 66
MLE for Transformed Parameters (Invariance Property)
Find MLE for = g(), when PDF p(x; ) is given :
1
If = g() is a one-to-one function, then
= arg max

p(x; g
1
()) = g(

)
Example : Consider DC level in WGN and nd MLE of = exp(A).
Since g() is a one-to-one function then = exp( x).
2
If = g() is not a one-to-one function, then
p
T
(x; ) = max
{:=g()}
p(x; ) and = arg max

p
T
(x; )
Example : Consider DC level in WGN and nd MLE of = A
2
. Since
g() is not a one-to-one function then :
= arg max
0
{p(x;

), p(x;

)}
=
_
arg max

0
{p(x;

), p(x;

)}
_
2
=
_
arg max
<A<
p(x; A)
_
2
=

A
2
= x
2
(Maximum Likelihood Estimation) Estimation Theory Spring 2011 61 / 66
MLE (Extension to Vector Parameter)
Example : DC Level in WGN with unknown variance. The vector
parameter = [A
2
]
T
should be estimated. We have :
ln p(x; )
A
=
1

2
N1

n=0
(x[n] A)
ln p(x; )

2
=
N
2
2
+
1
2
4
N1

n=0
(x[n] A)
2
which leads to the following MLE :

A = x

2
=
1
N
N1

n=0
(x[n] x)
2
Asymptotic properties of the MLE : Under some regularityconditions,
the MLE is asymptotically normally distributed

a
N(, I
1
()) even if
the PDF of x is not Gaussian.
(Maximum Likelihood Estimation) Estimation Theory Spring 2011 62 / 66
MLE for General Gaussian Case
Consider the general Gaussian case where x N((), C()).
The partial derivative of the PDF is :
ln p(x; )

k
=
1
2
tr
_
C
1
()
C()

k
_
+
()
T

k
C
1
()(x ())

1
2
(x ())
T
C
1
()

k
(x ())
for k = 1, . . . , p.
By setting the above equations equal to zero, MLE can be found.
A particular case is when C is known (the rst and third terms
become zero).
In addition, if () is linear in , the general linear model is obtained.
(Maximum Likelihood Estimation) Estimation Theory Spring 2011 63 / 66
MLE for General Linear Models
Consider the general linear model x = H +w where w is a noise vector
with PDF N(0, C) :
p(x; ) =
1
_
(2)
n
det(C)
exp
_

1
2
(x H)
T
C
1
(x H)
_
Taking the derivative of lnp(x; ) leads to :
ln p(x; )

=
(H)
T

C
1
(x H)
Then
H
T
C
1
(x H) = 0

= (H
T
C
1
H)
1
H
T
C
1
x
which is the same as MVU estimator. The PDF of

is :

N(, (H
T
C
1
H)
1
)
(Maximum Likelihood Estimation) Estimation Theory Spring 2011 64 / 66
MLE (Numerical Method)
NewtonRaphson : A closed form estimator cannot be always computed
by maximizing the likelihood function. However, the maximum value can
be computed by the numerical methods like the iterative NewtonRaphson
algorithm.

k+1
=
k

_

2
ln p(x; )

T
_
1
ln p(x; )

=
k
Remarks :
The Hessian can be replaced by the negative of its expectation, the
Fisher information matrix I().
This method suers from convergence problems (local maximum).
Typically, for large data length, the log-likelihood function becomes
more quadratic and the algorithm will produce the MLE.
(Maximum Likelihood Estimation) Estimation Theory Spring 2011 65 / 66
Least Squares Estimation
(Least Squares) Estimation Theory Spring 2011 66 / 66

You might also like