You are on page 1of 15

Statistics and Estimation

Matthew Kupinski
Optical Sciences 512L
Laboratory #10
Due November 14, 2011
1 Introduction
The process of using data observed from an experiment to determine the value of some param-
eter of interest is known as estimation. Estimation is synonymous with data analysis and is used
throughout optics. For example, astronomers might use data taken from telescopes to estimate
the mass, diameter, etc. of a distant star. A researcher in medical imaging might use measure-
ments taken on many patients to estimate the risks associated with smoking. In this laboratory, you
will be introduced to the basics of estimation theory and the techniques of maximum-likelihood
estimation.
First, let us review basic probability theory. If you are already comfortable with probability,
then you may skip to section 3.
2 Review of Probability
Random variables (RVs) come in two avors: discrete and continuous. In the case of discrete
RVs, it makes sense to speak of a probability associated with each value that the RV can take on.
That is, the probability that the RV x equals some value x
i
is denoted by Pr(x = x
1
) or (Pr(x
1
)
for short). For example, this value may be the number of photons collected during an integration
time of a detector. Random variables may also be continuous and range from to . In that
case, the probability that we obtain a particular value of the continuous RV is vanishingly small. It
makes more sense to speak of obtaining a value that falls within some prescribed interval, i.e.,
Pr(x
1
< x x
1
+ dx) = pr(x
1
)dx. (1)
The term pr(x
1
) is called the probability density function (pdf) associated with the random variable
x.
The cumulative distribution function (CDF) for a RV x is the probability that the RV x is less
than or equal to some value x
1
, i.e.,
F(x
1
) = Pr(x x
1
). (2)
For continuous RVs,
F(x
1
) =
_
x
1

pr(x)dx, (3)
and for discrete RVs,
F(x
1
) =

i:x
i
<x
1
Pr(x
i
). (4)
2.1 Moments
Lets begin with the mean or expectation value. Consider a discrete RV n that we measure
in a large number of trials. For example, we may record the number of photons collected by
a detector element in a large number of exposures, each lasting the same integration time. The
random variable may assume values n
i
, where i = 1, . . . , N. The number of occurrences is m(n
i
).
The average values of the counts is then given by
< n >
M
=
1
M
N

i=1
m(n
i
)n
i
(5)
where M is the number of trials. The frequency of occurrence is
f(n
i
) = m(n
i
)/M. (6)
The average value listed above is an estimate of the true mean value. We approach the true
mean values as M . From the law of large numbers,
lim
M
m(n
i
)
M
= Pr(n
i
). (7)
The expression for the expectation value may be rewritten in the limit as
< n >=
N

i=1
n
i
Pr(n
i
). (8)
Similarly, for continuous RVs,
< x >=
_

pr(x

)dx

. (9)
The quantity < n > (or < x >) is also known as the mean, the expected value, or the ensemble
average.
We dene the kth moment of the random variable x as
< x
k
>=
_

x
k
pr(x

)dx

and < x
k
>=
N

i=1
x
k
i
Pr(x
i
) (10)
for continuous and discrete random variables, respectively. We also consider central moments.
These are so called because they express the deviation of a random variable from its mean value.
Central moments are also designated by the power to which the quantity is raised; we may speak
of a second central moment, for instance. The second central moment, in particular, is known as
the variance and is given by
< (x x)
2
>=
_

(x

x)
2
pr(x

)dx

and < (x x)
2
>=
N

i=1
(x
i
x)
2
Pr(x
i
) (11)
where x =< x >. The square root of the variance yields the standard deviation of the RV, the
root-mean-square (rms) value of the uctuations of the RV around its mean value.
Finally, we dene an expectation of a general function f(x) as
< f(x) >=
_

f(x

)pr(x

)dx

and < f(x) >=


N

i=1
f(x
i
)Pr(x
i
). (12)
2.2 Characteristic Functions
We have already seen the utility of Fourier methods in treating linear, shift-invariant systems.
The concepts of the point-spread function, the optical transfer function, and the modulation transfer
function allow us to express the performance of an optical system. Fourier methods also nd use in
probability theory. A probability law may be Fourier transformed to yield a characteristic function.
We dene the characteristic function as

x
() =
_

dxpr(x) exp(2ix) = exp(2ix) . (13)


One can also dene the probability density function pr(x) in terms of the inverse Fourier trans-
form of the characteristic function
x
(). (Most statistics books actually dene the characteristic
function as an integral over exp(ix). However, to be consistent with the Fourier notation we have
used throughout this course, I will use the denition in Eqn. 13.)
Whatever the form, the characteristic function is useful in many respects, although only two
will be covered here. First, it may be used to calculate the moments of a random variable,
< x
n
>=
1
(2)
n
i
n
d
n

x
()
d
n

=0
(14)
So, if you know the characteristic function of a random variable it is relatively easy to determine
any moment simply by taking the derivative of the characteristic function evaluated at = 0.
The other use of the characteristic function we will cover deals with sums of independent
RVs. Consider the addition of two independent RVs, x and y, i.e., z = x + y. We want to
determine the probability law on z. We employ the characteristic function. We seek to determine
the characteristic function associated with the RV z, i.e.,
z
(). The notation devloped above
allows us to write

z
() =
_
e
2iz
_
=
_
e
2i(x+y)
_
=
_
e
2ix
_ _
e
2iy
_
=
x
()
y
(), (15)
where we took advantage of the statistical independence of x and y in splitting up the brackets.
This result implies that,
pr
z
(z) = pr
x
(z) pr
y
(z), (16)
where represents a convolution.
2.3 Transformations of RVs
We consider two questions: First, given the probability density function (pdf) p
x
(x) on x and
the functional relationship,
y = f(x) (17)
what is the pdf p
y
(y) on y? Second, given an available p
y
(y) in which we are able to draw samples
from (like a uniform random number generator), how does one draw samples from an arbitrary
desired pdf p
x
(x).
x0 x0+dx
f(x)
y+dy
y0
Figure 1: An example of a monotonic transformation.
We concentrate rst on variable transformations with a unique root. Starting with the rst
question, we rst consider the case in which the transformation from x into y is unique. In this
case we have a single root and therefore a one-to-one correspondance from x-space to y-space. A
general monotonic transformation is depicted in g. 1. From the gure we see that any given bin
in x maps uniquely into a bin in y. The relative number of times that bin x
0
occurs is the same as
the relative number of times that bin y
0
will occur. We express this as,
Pr(x x
0
x + dx) = Pr(y y
0
y + dy). (18)
Both sides of the above equality are denitions of pdfs, i.e.,
pr
x
(x)dx = pr
y
(y)dy. (19)
If we now invert the functional relationship x = f
1
(y), we can express p
y
(y) in terms of known
quantities,
p
y
(y) = p
x
_
f
1
(y)
_

dy
dx

1
. (20)
The magnitude appears because the bin widths dx and dy are always positive. Using the transform
y = f(x) we have now transformed the pdf on x into a pdf on y.
Now we examine the consequences of a variable transformation having multiple roots. The
above example covered the case for a monotonic transformation. We can easily extend this result
to the case where the functional relation has two roots. (The general multiple-root case is just a
straightforward extension of the two-root case.) We consider the transformation shown in g. 2.
This function might be a parabola or a Gaussian. For multiple roots, we add the relative number of
times that root 1 and root 2 occur to nd the relative number of times the corresponding bin in y is
selected. We nd that,
Pr(y y
0
y + dy) = Pr(x x
1
x + dx
1
) + Pr(x x
2
x + dx
2
), (21)
which, in terms of their pdfs is,
pr
y
(y)dy = pr
x
(x
1
)dx
1
+ pr
x
(x
2
)dx
2
. (22)
x1 x1+dx1 x2+dx2 x2
f(x)
y0
y+dy
Figure 2: An example of a non-monotonic transformation.
This simplies to,
pr
y
(y) = p
x
(x
1
)

d
dx
f(x
1
)

1
+ p
x
(x
2
)

d
dx
f(x
2
)

1
. (23)
Finally, let us examine the case where we have
y = F
1
z
(x) (24)
where x is a uniform RV between 0 and 1 and F
1
z
() is some cumulative distribution function. It
is easy to show that pr
y
() is equal to pr
z
() where pr
z
() is the PDF which generated F
z
(). Thus
is you know the CDF for an arbitrary distribution and can generate uniform samples between 0 and
1, then you can generate samples of pr
z
(z) using this method.
2.4 Examples of Probability Laws
2.4.1 Uniform
The most elementary probability density function is the uniform law, described by the cele-
brated rect function,
pr(x) =
1
b
rect
_
x a
b
_
. (25)
Many compute random-number generators produce numbers according to this law (i.e., rand).
For the homework assignment, you will a uniform pdf to sample from arbitrary density functions.
2.4.2 Binomial
The binomial law is a discrete probability law. It is given by
Pr(n) =
_
N!
n!(N n)!
_
p
n
q
Nn
. (26)
The parameter N is the number of trials. The parameter n is the number of occurrences of one of
two events or, say, successes. Success occurs with a probability p [0, 1]. Finally, q = 1p. Note
that the sequence of successes is not specied; only that n of them occur in N trials. Consider, for
example, ipping a fair coin, i.e., p = 0.5. What is the probability of getting two heads in 10 ips?
P(2) = 4.4%.
For a large number of trials N and a low probability of occurrence p 1, the binomial
probability law becomes the Poisson law. The latter is often preferred due to it functional form and
the fact that only one parameter is needed to dene it.
2.4.3 Poisson
One of the most famous discrete probability laws is the Poisson, given by
Pr(n) = e
a
a
n
n!
, (27)
where a > 0. The parameter a is solely responsible for the shape of the probability law. It can be
shown that a is both the mean and the variance of a Poisson RV.
The parameter n may denote the number of photons that arrive at a detector element in a
prescribed period of time (integration time) or the number of raindrops on a windshield between
intermittent passes of the wipers.
2.4.4 Normal
The probability law that is most commonly used is the normal law, given by
pr(x) =
1

2
2
exp
_
(x x)
2
2
2
_
. (28)
One reason that the normal law is used so frequently is the central limit theorem. Sums of inde-
pendent random variables approach Gaussian distributions. The notation x N(x,
2
) means that
x is a random variable with mean value x and variance
2
. The normal law is called the Gaussian
probability law when x = 0.
The normal law is most commonly used for three reasons. First, many processes are Gaus-
sian, especially those that are related to basic thermal noise in systems. Second, the central limit
theorem, already mentioned above, states that any process that is the result of elementary ran-
dom processes acting in tandem will tend to be governed by a normal law. Third, linear systems
preserve the normality or a process, i.e., the output of a linear system excited by normal-law
process is itself normal-law. In addition, the normal law is fully specied by the rst two moments
of its variables. In many situation, these two parameters are all that can be reliably estimated. The
normal law also permits relatively complete solutions of many interesting problems.
3 Estimation Theory
Now that you have reviewed basic probability theory, we will begin the discussion of estima-
tion.
3.1 Bias, Variance, and MSE
The goal in estimation is to use data {x
i
}
1
to determine, as best one can, the value of some
parameter of interest . In general, each observation x
i
could be a vector of observations x
i
and
we might be interested in estimating a vector of parameters . Here, bold symbols denote vectors.
So an estimation technique can be thought of as a mapping from observation space to an estimate
of . That is

= ({x
i
}), where () is the estimation procedure, and the hat symbol over the
denotes an estimate of the true value . For the remainder of this handout, we will assume that we
are trying to estimate a scalar .
The estimation procedure () produces an estimate

which may be close to the true value .
We are interested in characterizing the performance of () in estimating . To accomplish this,
we use the bias and variance of the estimator. The bias is a measure of how closely, on average,

is to . The bias is dened as


Bias = ({x
i
}) . (29)
An estimator that has a bias of zero is said to be unbiased. Unbiased estimators produce estimates
that are, on average, equal to the true value of the parameter of interest. This says nothing about
how variable the estimate produced by () are. This information is characterized by the variance
of the estimator dened by
Var =
_
(({x
i
}) ({x
i
}))
2
_
=
_

2
({x
i
})
_
({x
i
})
2
. (30)
It is desirable to have an estimator that has both a small bias and a small variance. Thus, the
bias and variance are often combined into a single expression called the mean-squared error or
MSE,
MSE = Bias
2
+ Var =
_
(({x
i
}) )
2
_
. (31)
There are always difculties when a single gure-of-merit (like MSE) is assigned to an estimator
that has two inherent gures-of-merit, the bias and the variance. For example, what if
1
() has a
bias of 2 and a variance of 4, and the estimator
2
() has a variance of 1 and a bias of

7? Both of
these estimator has an MSE of 8 so how do we decide which method (
1
() or
2
()) is the better
estimator?
3.1.1 An Example
At this point it is probably useful to work through a simple example. Let us say that we have N
independent samples from some pdf pr(x), denoted by {x
i
}; i = 1, . . . , N. The random variable x
has mean x and variance
2
. Our task is to use {x
i
} to estimate the x. We will, of course, be using
the sample mean formula to accomplish this,

x =
1
N
N

i=1
x
i
. (32)
1
The squiggle brackets denote the entire set of data, i.e., x
1
, x
2
, . . . x
N
where N is the number of observations in
the dataset.
To relate this expression to our notation above, is x,

is

x, and () is shown in Eqn. 32.
Now, let us study the bias of our estimator. We begin by taking the expectation of

x and using
the fact that expectations are linear and can be taken inside summations.
Bias =
_
1
N
N

i=1
x
i
_
x
=
1
N
N

i=1
x
i
x
= x x = 0.
Thus, this estimator is unbiased. The variance of this estimator is
Var =
_
1
N
2
N

i=1
x
i
N

j=1
x
j
_
x
2
=
1
N
2
N

i=1
N

j=1
x
i
x
j
x
2
.
Now, there are N terms in the above summation where i = j and N
2
N terms where i = j.
Thus, because x
i
and x
j
are independent when i = j, the above equation simplies to
Var =
1
N
2
_
N
_
x
2
_
+ (N
2
N)x
2
_
x
2
=
1
N
__
x
2
_
x
2
_
=

2
N
. (33)
This is known as the standard error. This result veries what you probably already knew; as you
increase the number of samples N used to estimate the mean, the variance of your estimate goes
down.
3.2 Maximum-Likelihood Estimation
We will cover the very important topic of maximum-likelihood (ML) estimation. ML estima-
tion is a very powerful estimation procedure that has nice asymptotic properties. It is commonly
used for data analysis and parameter estimation. We will also briey discuss the topic of Bayesian
estimation.
3.2.1 The Model
All ML estimation procedures begin with a statistical model. The model is one of the most
important aspects of ML estimation and one area where mistakes are often made. As we will
see, ML estimation is a very powerful procedure, but if the underlying model employed in ML
estimation is in error, then any conclusions drawn from the estimation procedure may be invalid.
To help in understanding ML estimation, we will discuss two examples.
Example 1 Envision an experiment in which you are attempting to use N images of Jupiter
to estimate the diameter of the planet. The model that we assume is that our measurements of
the diameter of Jupiter are Gaussian distributed around the true value of the diameter with some
variance
2
. This implies that each measurement d
i
is a sample from the pdf
pr(d|) =
1

2
2
exp
_

1
2
2
(d )
2
_
(34)
where is the true diameter of Jupiter. This notation can be confusing because statistician often
use the notation pr(x|y) to denote the probability of observing x given that the random variable
y was observed to be y. We are using the notation pr(d
i
|) as the probability density of d
i
as a
function of the non-random parameter . For example, it does not make sense for us to apply Bayes
rule to pr(d
i
|) because the term pr() is meaningless given that is a xed (although unknown)
parameter. (IMPORTANT NOTE: This is only true for a frequentists denition of probability. The
previous statement is not true for a Bayesian viewpoint of probability. We will discuss this briey
in the last section of these notes).
Thus far we have derived the pdf for a single measurement of the diameter of Jupiter as a func-
tion of the true diameter of Jupiter (Eqn. 34). We can go a step further and use the independence
of our measurements to derive the pdf for our entire set of measurements {d
i
}; i = 1, . . . , N as
pr({d
i
}|) =
N

i=1
pr(d
i
|)
=
1

2
2
N
exp
_

1
2
2
N

i=1
(d
i
)
2
_
. (35)
Equation 35 is known as the likelihood of the diameter of Jupiter. A likelihood is always the
probability (or probability density) of the observed data as a function of the value(s) you wish to
estimate. It is important to realize that assumptions went into generating this likelihood. Namely,
we assumed a model for our measurements of the diameter of Jupiter (Gaussian around the true
value of the diameter). Our likelihood expression is valid only insofar as our model is valid. An
inaccurate or inappropriate model can invalidate an estimate produced using ML techniques.
Example 2 I present another example of a statistical model for completeness. Let us say that we
wish to estimate the decay constant of an unknown radioactive isotope. We make N measure-
ments on N samples of the isotope, each sample containing K
0
atoms of the isotope. We measure
the number of decays in a 30 minute interval for the N samples. We know that the mean number
of decays k in that 30 minute interval is given by,
k = K
0
(1 exp(30)) (36)
where has units of minutes
1
. We also knowthat the number of measured decays for each sample
k
i
; i = 1, . . . , N is a Poisson distributed random variable with mean k, i.e.,
Pr(k
i
|) = exp(k)
k
k
i
k
i
!
= exp(K
0
(1 exp(30))
(K
0
(1 exp(30))
k
i
k
i
!
. (37)
We have N measurements k
i
and the probability of each measurement is given by Eqn. 37. We
also know that each measurement is independent from the other measurement. That is, there is no
way that the measurement from sample 1 could effect the measurements on any of the other N 1
samples and so forth. Thus, the probability of observing every measurement {k
i
} is given by
Pr({k
i
}|) =
N

i=1
Pr(k
i
|) =
N

i=1
exp(K
0
(1 exp(30))
(K
0
(1 exp(30))
k
i
k
i
!
. (38)
Equation 38 represents the likelihood. Again, it is the probability of our entire dataset as a function
of the parameter we wish to estimate. The model in this case is represented by the exponential
decay (Eqn. 36) and the Poisson distribution for the measurements (Eqn. 37).
3.2.2 ML Estimation
A likelihood is dened as the probability or probability density of a dataset conditioned on the
parameter that one is trying to estimate. We derived two likelihoods for two different estimation
problems in the previous section. The likelihood is usually based on some model of the data and
the measurement process. Once the likelihood has been dened, ML estimation is quite easy: one
simply chooses the value of the parameter that maximizes the likelihood, i.e.,

= argmax

{pr({x
i
}|)} . (39)
I nd it useful to think of ML estimation in the following sense: You have a statistical model with
some unknown parameter so the model is not completely specied. You are choosing the value of
that parameter (or parameters) such that the data you observed would be the most-likely to come
from your model.
It is often difcult to perform ML estimation using Eqn. 39 because the values for pr({x
i
}|)
are often very small. However, because the logarithm function is a monotonic function, we can
apply use the log of the likelihood instead of the likelihood. ML estimation thus becomes,

= argmax

{log(pr({x
i
}|))} . (40)
The term log(pr({x
i
}|)) is known as the log-likelihood.
Example 1 Remember that the goal in example 1 is to estimate the diameter of Jupiter using N
measurements that are Gaussian distributed around the true value of the diameter of Jupiter. We
previously derived the likelihood as,
pr({d
i
}|) =
1

2
2
N
exp
_

1
2
2
N

i=1
(d
i
)
2
_
. (41)
The ML estimate of is the that maximizes Eqn. 41. The log-likelihood is given by,
log(pr({d
i
}|)) = (N/2) log(2
2
)
1
2
2
N

i=1
(d
i
)
2
. (42)
We will nd the maximum of the log-likelihood by taking the derivative of Eqn. 42 with respect to
and setting the derivative equal to zero. We nd that
d
d
log(pr({d
i
}|)) =
1

2
N

i=1
(d
i
)
=
1

2
N

i=1
d
i

2
. (43)
Setting this equation to zero reveals that,

=
1
N
N

i=1
d
i
. (44)
I will leave it up to the reader to show that this is a maximumand not a minimum. Thus, the sample
mean is an ML estimate of the mean of a distribution and the sample mean of our measurements is
the ML estimate of the diameter of Jupiter.
Example 2 Let us do the same for Example 2. In Example 2 we derived the following likelihood,
Pr({k
i
}|) =
N

i=1
exp(K
0
(1 exp(30))
(K
0
(1 exp(30))
k
i
k
i
!
. (45)
The log-likelihood is given by,
log(Pr({k
i
}|)) = NK
0
(1exp(30))+log(K
0
[1exp(30)])
N

i=1
k
i

i=1
log(k
i
!). (46)
Taking the derivative of the above expression, setting that equal to zero, and solving for results
in the ML estimate of
=
1
30
log
_
K
0

1
N

N
i=1
k
i
K
0
_
. (47)
By studying this equation, you should see that this result makes quite a bit of sense. The summation
in Eqn. 47 is the sample mean of the number of decays which we can denote by

k. If we were to
replace the k in Equ. 36 with

k, and solve for , we would arrive at Eqn. 47.
3.2.3 Properties of ML Estimates
Both examples presented in these notes are one-dimensional estimation examples. That is,
our likelihood was the probability of the data given a scalar parameter we wish to estimate. ML
estimation is trivially extended to higher dimensions. The ML estimate of the vector of parameters
is simply the that maximizes either the likelihood or the log-likelihood. Often, it is difcult
to solve exactly for the that maximizes the likelihood so one is forces to use an optimization
algorithm to determine the that maximizes the likelihood.
We have thus far discussed how to perform ML estimation but we have yet to discuss the
performance of ML estimation, i.e., bias and variance. It has been shown that ML estimates are
either unbiased or asymptotically unbiased. That is, an ML estimate is either, on average, correct or
at least it approaches unbiasedness as the number of observations N gets large. Furthermore, using
information theory one can prove that there is a lower bound on the variance of any estimator. If
an estimation procedure is both unbiased and the variance of the estimator meets this lower bound,
then the procedure is known as an efcient estimator. If an ML estimation procedure is unbiased,
then it is also efcient. If an ML estimation procedure is not unbiased, then it is asymptotically
efcient. That is, the ML estimation procedure has a bias that approaches 0 and a variance that
approaches the lowest possible variance as N gets large. Thus, ML estimation is almost always a
reasonable method for estimation!
Fisher Information Matrix ML estimates either have the lowest possible variance that an un-
biased estimator can have or they approach this lower bound. This lower-bound on the variance
(known as the Cram er-Rao bound) is determined through the Fisher information matrix. The Fisher
information matrix J for a problemwhere we are trying to estimate D parameters is a DD matrix
whose elements are dened by
J
i,j
=
_

i
log(pr({x
l
}|))

j
log(pr({x
l
}|))
_

=true
(48)
where

i
is the partial derivative with respect to the ith parameter in the vector . The Cram er-Rao
bound on the estimate of the ith parameter is given by the (i, i) element of the inverse of the Fisher
information matrix
Var
min
_

i
_
=
_
J
1
_
i,i
. (49)
If we are dealing with a scalar estimation problem, then there is only one element in the Fisher
information matrix and we simply need to compute the reciprocal to derive the Cram` er-Rao bound.
We consider Example 1 to illustrate the Cram er-Rao bound. We already derived the derivative
of our log-likelihood with respect to the mean diameter of Jupiter shown in Eqn. 43 and repeated
here in a slightly different form as
d
d
log(pr({d
i
}|)) =
1

2
N

i=1
(d
i
). (50)
The Fisher information matrix (not really a matrix in this example) is given by,
J =
_
1

2
N

i=1
(d
i
)
1

2
N

i=1
(d
i
)
_

=
1

4
N

i=1
_
(d
i
)
2
_

=
N

2
. (51)
Thus, the minimum possible variance that any unbiased estimator can achieve is given by the
inverse of this expression or by
Var
min
{

} =

2
N
. (52)
We proved earlier (see Eqn. 5) that the variance of the sample-mean estimator is given by
2
/N.
Thus, the sample mean is both unbiased and achieves the Cram er-Rao bound and is therefore the
best that one can do given the limitations of the data taken.
4 Bayesian Estimation
Bayes rule states that
pr(y|x) =
pr(x|y)pr(y)
pr(x)
=
pr(x|y)pr(y)
_
pr(x|y)pr(y)dy
. (53)
This statement is true always if x and y are random variables. Bayesian estimation is a somewhat
unusual proposition. A Bayesian estimate of is one that maximizes the following
pr(|{x
i
}) =
pr({x
i
}|)pr()
pr({x
i
})
(54)
which is equivalent to choosing the that maximizes pr({x
i
}|)pr() because the denominator in
Eqn. 54 does not depend on . This is unusual because is not a random variable. That is, the
diameter of Jupiter is a certain quantity; we may not know its value but it does exist. The prior term
pr() is, thus, a bit confusing. The key difference between Bayesian estimation and ML estimation
is that Bayesian estimation takes a Bayesian viewpoint on probability. The Bayesian viewpoint of
probability states that pr() represents the belief of what should be. It does not represent the
density of values that we observe as it does in the standard frequentists denition of probability.
An example might be helpful. Let us say that the estimation task is to estimate my height .
You take measurements of my height and derive a likelihood pr({x
i
}|) based on some statistical
model. Now, you can perform ML estimation by choosing the that maximizes this likelihood.
You, however, choose to perform Bayesian estimation. Let us say that you heard that I was around
6 feet tall. So, you dene a prior probability of my height as a Gaussian centered around 6 feet with
a certain variance. So, your pr() is a Gaussian centered on 6 feet. Notice how this probability
represents a belief and not the distribution of my heights I only have one height! So this Bayesian
estimation procedure would choose the height that maximizes Eqn. 54 or, equivalently, the log of
Eqn. 54 which is,

bayes
= argmax

{log(pr({x
i
}|)) + log(pr())} . (55)
Notice how the likelihood still plays an important role in Bayesian estimation. The term pr()
effectively penalizes those answers that are not close to 6 feet tall.
When we performed ML estimation, we had to dene and justify the statistical model being
employed. For Bayesian estimation, you have to dene and justify the same model but also justify
your prior pr(). I have read numerous papers on the same estimation procedure where people are
touting their choice of priors and berating another researchers choices of priors. What really is
the best prior? That is a difcult, if not unanswerable, question.
Questions
BASIC PROBABILITY QUESTIONS
1. Plot probability density functions of the form: binomial, Poisson, and normal. Us the analyt-
ical forms presented in this handout. Also, plot normal pdfs where the mean and the variance
are equal along side Poisson distributions with the same mean and variance. Use values of
4, 10, and 30 for the mean (and variance) for your plots. Comment on what is occurring.
2. Generate 300 samples from a uniform pdf (using rand). Plot a histogram of your data
(using the hist command). Now generate a random variable
y =
1
N
N

i=1
x
i
, (56)
where x
i
is a uniform random variable and plot a histogram of y. Show results for N = 2,
5, and 20. Comment on the pdfs for y in each case.
3. The exponential distribution has the following pdf
pr(x) =
1

exp(x/); x 0. (57)
Using only matlabs rand routine and your knowledge of how to transform densities, gen-
erate 500 samples from an exponential distribution. Plot a histogram of your results. Do the
same for a normal random variable with 0 mean and variance 1. Look at matlabs help for
erf and inverf.
ESTIMATION QUESTIONS
4. Imagine a laser beam incident on a piece of ground glass. The intensity of the beam is
then measured with a piece of lm some distance away from the ground glass. The ground
glass will cause speckle pattern to appear on the lm. Rotating or changing the ground glass
will produce an entirely different pattern. It is well known that the statistics governing the
intensity of the measured beam at one location on the lm is given by
pr
I
(v) =
1
exp(v/)
for v greater than or equal to zero. Assume that we took N measurements of the intensity at
one point on the lm (changing the piece of ground glass for each measurement) and we were
asked to determine the the parameter . How would you go about doing this? Now sample
100 observations from this density (see Prob 3) for a particular value of and estimate .
5. Consider the following problem. We have a series of measurement pairs x
i
, y
i
where the x
i
are perfect measurements (i.e., there is no noise in x
i
) and the measurement of y
i
is normally
distributed with variance
2
and mean given by ax
i
. Here, a is the slope relating x to y and is
the parameter we wish to estimate. Derive the likelihood expression for the data as a function
of a and then determine the ML estimate of a. Now, generate a dataset of 10 measurements
that t this model using an a that your specify. Now, compute the ML estimate of a and
compare the estimate with the truth.

You might also like