MLE Stuff

1
Large-Sample Theory
Definition 1. Convergence in Distribution. Let Yn be a sequence of random variables, Hn be the CDF of

d
Yn and H be the CDF of Y. Yn ! Y if Hn (y) ! H(y) for every y at which H is continuous.
Lemma 1. Implications
d
1. Yn ! Y i Ef (Yn ) ! Ef (Y )8 bounded, continuous f.
d
2. Yn ! Y and g is continuous ) g(Yn ) ! g(Y )
Theorem 1. Central Limit Theorem. Given Xi i.i.d with mean , variance

p
n
n(X
) !N (0,
Theorem 2. Slutskys Theorem. Suppose Yn ! Y, An ! a, Bn ! b. Then:

d
An + Bn Yn !a + bY
2
Theorem 3. Delta Method. Given Xi i.i.d with mean , variance

p
. f is dierentiable at . Then
n)
n(f (X
f ()) !N (0, (f 0 ())2
Definition 2. Uniform Integrability
sup E |Xn | I|Xn | t ! 0as t ! 1
n 1
d
Theorem 4. Xn ! X and uniform integrability ) E [Xn ] ! E [X]

Note that this theorem is not a result of Lemma 1, because the function f (x) = x is not bounded.
Theorem 5. Let M LE be the MLE estimate for the parameter of exponential family. Then:
p
n (
M LE
) !N
1
0, 00
A ()
MLE achieves Cramer-Rao in an asymptotic sense. Except not quite because of super efficicency.
Theorem 6. Keener 8.18. Let (Xi )i 1 be i.i.d rv with common CDF F, let 2 (0, 1), and let n be the
b nc0 th order statistics for X1 , X2 , ..., Xn . If F () = , and if F 0 ( ) exists and is finite and positive, then:
p
2
2.1
n(
) !N
0,
(1
[F 0 ()]
)
2
Estimating Equations and Maximum Likelihood

Weak Law for Random Functions
Definition 3. Random Element. Let (, F, P) be a probability space, and (E, E) a measurable space. A
random element with value in E is a function X : ! E which is (F, E) measurable.
Random Function is an example of random element. Let K be a compact set. Let Wi (t) = h(t, Xi ), t 2 K.
Assume h(t, x) is continuous in t, 8x. Wi are random functions taking values in C(K), the set of continuous
function.
1
Definition 4. Supremum Norm. For w 2 C(K), the supremum norm of ! is defined as:
kwk1 = sup |w(t)|
t2K
Convergence means kwn
wk1 ! 0.
Lemma 2. Keener p152. Let W be a random function in C(k).Define (t) = EW (t).t 2 K. If EkW k1 < 1,
then is continuous. Also:
sup E
t2K
sup
s:ks tk<
|W (s)
W (t)| !0
as ! 0.
Theorem 7. Dinis Theorem. Suppose f1
then supx2K fn (K) ! 0.
f2
... are positive functions in C(K). If fn (x) ! 0, 8x 2 K,
Dini theorem turns pointwise statement into uniform statement. A more general form in Empirical
Process Theory is:
Theorem 8. Uniform Weak Law. Keener p153. Let W, W1 , W2 , ... be i.i.d random functions in C(K), K
P
n = 1 Pn Wi . This implies kW
n k1 !
compact, with mean and EkW k1 < 1. Let W
0.
i=1
n
P
Theorem 9. Let Gn , n 1, be random function in C(K), K compact, and suppose kGn gk1 ! 0 with g
a nonrandom function in C(K).
P
1. If tn , n
1 are random variables converging in probability to a constant t 2 K, tn ! t , then
P
Gn (tn ) ! g(t ).
2. If g achieves its maximum at a unique value t , and if tn are random variables maximizing Gn , so
P
that: Gn (tn ) = supt2K Gn (t), then tn ! t .
3. If K R and g(t) = 0 has a uninque solution t , and if tn are random variables solving Gn (tn ) = 0,
P
then tn ! t .
2.2
Consistency of The MLE
For this subsection, let X, X1 , X2 , ... be i.i.d with common density f , 2 , and let ln be the log-likelihood
function for the first n obs:
ln (!) =
n
X
log f! (Xi )
i=1
The MLE estimator n = n (X1 , ..., Xn ) maximize ln . Assuming f (x) is continuous in .

Definition 5. The Kullback-Leiber information is defined as:
I(, !) =E log [f (X)/f! (X)]
Here is thought of as the true value of the unknown parameter, ! is a dummy variable. This information
is viewed as a measure of the information discriminating between and ! when is the true value of the
unknown parameter.
Lemma 3. If P 6= P! , then I(, !) > 0.
Theorem 10. Define W (!) = log
f! (X)
f (X)
. If is compact, E kW k1 < 1, f! (x) is continuous function
P
of ! for a.e. x, and P! 6= P! , 8! 6= , then under P , n ! .
Note that here n is the MLE for n observation, which mean it maximize W (!). This theorem establishes
the consistency of MLE.
Theorem 11. Suppose = Rp , f! (x) is a continuous function of ! for a.e. x, P! 6= P for all ! 6= , and
f! (x) ! 0 as ! ! 1. If E kIK W k1 < 1 for any compact set K Rp , and if E supk!k>a W (!) < 1 for
P
some a > 0, then under P , n ! .
2.3
Limiting Distribution for the MLE
Theorem 12. Keener 9.14. Assume:

1. RV X, X1 , X2 , ... are i.i.d with common density f , 2 R
2. The set A = {x | f (x) > 0} is independent of
3. 8x 2 A, @ 2 f (x)/@2 exists and is continuous in
4. Let W () = log f (X). The Fisher information I() from a single observation exists, is finite and can
be found using either:
I() =E W 0 ()y2 , or
I() =
E W 00 ()
5. 8 in the interior of , 9 > 0 such that:

E kI[
,+] W
00
k1 <1
6. The maximum likelihood estimator n is consistent

Then for any in the interior of ,
p
n n
as n ! 1.
1
d
!N 0,
I()
Lemma 4. Suppose Yn ! Y, and P(Bn ) ! 1 as n ! 1. Then for arbitrary rv Zn , n

as n ! 1.
2.4
1, Yn IBn +Zn IBnC ! Y
Confidence Intervals
Definition 6. If
for g() i:
and
are statistics, then the random interval ( 0 ,
P (g() 2 ( 0 ,
1 ))
1)
is called a 1 confidence interval
for all 2 . Also, a random set S = S(X) constructed from data X is called a 1
for g() if:
P (g() 2 S)
confidence region
for all 2 .
Definition 7. A variable, which depends on both the data and the parameter, but whose distribution is
independent of the parameter is called Pivot
3
Suppose
n(n
) ) N (0, 1/I()). Then

p
nI()(n ) )N (0, 1)
hp
i
)P
nI() n z/2 !1
It is often difficult to calculate Fisher information. We will discuss strategies to approximate the Fisher
information
1. We can use I(n ) instead. If I() is continuous then:
s
I(n ) p
! 1
I()
Then using Slutsky theory, we can conclude that

q
nI(n )(n
) =
I(n ) p
nI()(n
I()
) ) N (0, 1)
2. We can also use the results from empirical process theorey. Remember that l() =
Thus by law of large number
l00 (n ) p
! I()
n
And thus by Slutsky theorem
q
l00 (n )(n
) )N (0, 1)
3. Another method are profile regions. Expand ln () in Taylor series

1
ln () =ln (n ) + ln00 (n )(
2
n )2
Here n is a random variable between and n . By rearranging this equation we get

2ln (n )
Note that for Z N (0, 1)
And thus
2ln () =
hp
ln00 (n )(
n )
i2
) X12
h
i
2
P Z 2 z/2
=P z/2 Z z/2 = 1
P (2ln (n )
2
2ln () z/2
) !1
This identity can now be used to calculate the asymptotic (1

confidence interval for .
Pn
i=1
log f (Xi ).
Hypothesis Testing
3.1
General Setting and Simple vs. Simple Case
Let P = {P , 2 } be a family of distribution. The distributions of P can be classified into H, one in

which the hypothesis is true, and K, one in which the hypothesis is false. And the corresponding partition
of into H , K . H [ K = P, H [ k = P.
Let the decision of accepting or rejecting H be d0 and d1 respectively. A nonrandomized test procedure
assigns to each possible value x of X one of these two decisions and thereby divides the sample space into
two complementary regions S0 and S1 . The set S0 is called the region of acceptance, and the set S1 the
region of rejection or critical region.
There are two types of error: rejecting the hypothesis when it is true (Type I), and accepting when it is
false (Type II).
Define the level of significant be such that:
P ( (X) = d1 ) = P (X 2 S1 ) ,8 2 H
(1)
Subject to this condition, it is desired to minimize P ( (X) = d0 ) for 2 K or equivalently maximize

P ( (X) = d1 ) = P (X 2 S1 ), 8 2 K
(2)
sup P (X 2 S) =
(3)
Usually, (1) implies
The LHS in (3) is called the size of the test or critical region S1 . The probability of rejection in (2) is
called power of the test against the alternative . Consider as a function of over , the probability is called
the power function of the test and is denoted ().
Now, for each X = x, instead of choosing 1 or 0 deterministically, one can do it randomly as a Bernoulli
trial with success rate (x). The randomized test is therefore completely characterized with the function
(x).The set of points x for which (x) = 1 is the region of rejection. The probability of rejection is:
E (X) =
The problem now is to select
(x)dP (x)
so as to maximize the power

() =E (X), 8 2 K
(4)
E (X) , 8 2 H
(5)
subject to the condition
Now if K has more than one element, the test that maximize the power for each alternative will be
dierent, so things is more complicated. But if K has only one element, things simplify. Sometimes in case
when there are many alternatives, we can get lucky and have the test maximizes the power of all alternatives
in K. This is called uniformly most powerful UMP tests.
Theorem 13. Let P0 and P1 be probability distribution possessing densities p0 and p1 respectively wrt
measure .
(i) Existence. For testing H: p0 against the alternative K: p1 there exists a test and a constant k such
that:
5
E0 (X) =
(6)
and
(x) =
1
0
when p1 (x) > kp0 (x)

when p1 (x) < kp0 (x)
(7)
(ii) Sufficient Condition for a Most Powerful Test. If a test sastifies (6), and (7) for some k, then it is
most powerful for testing p0 against p1 at level .
(iii) Necessary condition for most powerful test. If is most powerful at level for testing p0 against p1 ,
then for some k it sastitifes 7 a.e. . It also sastifies 6 unless there exists a test of size < and with power
1.
3.2
Simple vs Multiple - Distributions with Monotone Likelihood Ratio
Now when the set K has multiple elements, again in general the most powerful test of H agains an alternative
1 > 0 (in contrast to H0 1 0 ) depends on 1 and is then not UMP. However, a UMP test does exists if
an additional assumption is satisfied. The real-parameter family of densities p (x) is said to have monotone
likelihood ratio
Definition 8. Monotone Likelihood Ratio. The real-parameter family of densities p (x) is said to have
monotone likelihood ratio if there exists a real-valued function T (x) such that for any < 0 the distribution
P and P0 are distinct, and the ratio p0 (x)/p (x) is a nondecreasing function of T (x).
Theorem 14. Let be a real parameter, and let the random variable X have density p (x) with monotone
likelihood ratio in T (x).
(i) For testing H : 0 against K : > 0 , there exists a UMP test, which is given by:
(x) =
where C and
are determined by
8
>
<1
>
:
, T (x) > C
, T (x) = C
, T (x) < C
E0 (X) =
(8)
(9)
(ii) The power function

() =E (X)
of this test is strictly increasing for all points for which 0 < () < 1
(iii) For all 0 , the test determined by (8) and (9) is UMP for testing H 0 : 0 against K 0 : > 0 at
level 0 = (0 )
(iv) For any < 0 the test minimizes () (Type I error) amonng all tests satisfying (9).
Concentration Bound
Theorem 15. Markov Inequality. X
0.EX < 1. Then:

P [X
t]
6
EX
,t > 0
t
Theorem 16. Chebyshev. EX < 1. Then:

P [|X
Theorem 17. Assuming E exp { (X
P [X
P [X
t]
E (X )
t2
)} < 1 for | | b. Then8 2 [0, b] :

E exp { (X )}
exp [ t]
t]
sup { t log E exp { (X
t]
)}}
2[0,b]
E.g. Gaussian: Mgf: E exp { X} = exp +
2 2
/2 , 8 2 R. Thus: P [X
t] exp
t2 / 2
Definition 9. Sub-gaussian. A RV X with mean is called sub-gaussian i 9 > 0 such that:

2 2
E exp { (X )} exp
,8 2 R
2
is the sub-gaussian parameter.
The symmetry of definition implies X is sub-gaussian i
P [|X
t] 2 exp
X is sub-gaussian. Thus
t2
2 2
Theorem 18. Hoeding Bound. Xi mean i and subgaussian with parameter

" n
#
X
t2
Pn
P
(Xi i ) t exp
2 i=1 i2
i=1
i.
Then 8t
0,
Theorem 19. Equivalent characterizations of sub-Gaussian variables. Given any zero-mean RV X, the
following properties are equivalent:
2 2
(i)9 : E [exp { X}] exp
2
1, Z N 0, 2 : P [|X| s] cP [|Z|
(2k)!
(iii)9 0 : E X 2k k 2k , 8k = 1, 2, ...
2 k!
X2
1
(iv)E exp
p
, 8 2 [0, 1)
2
2
1
(ii)9c
s] , 8s
Definition 10. A RV X with mean is sub-exponential if 9 > 0, b > 0 such that:

2 2
1
E exp { (X )} exp
,8| |
2
b
Theorem 20. Sub-exponential tail bound. Suppose that X is sub-exponential with parameters (, b) . Then:
n 2 o
h
i
(
t
2
exp
2 2 , 8t 2 0, b
P [X + t]
2
t
exp
8t > b
2b ,
Definition 11. Bernsteins condition. Given a RV X mean and variance
condition with parameter b holds if:
i 1
h
k
E (X ) k! 2 bk 2 , 8k = 3, 4, ...
2
7
. We say that Bernsteins
Theorem 21. Bernstein-type bound. For any RV satisfying the Bernstein condition above, we have:
E [exp { (X
P [|X
)}] exp
|
t] 2 exp
2 2
/2
1
,8| | <
b| |
b
2
t
2 ( 2 + bt)
Multivariate sub-exponential. RVs X1 , ..., Xn are independent, and Xk is subexponential with parameters (k , bk ) , and has mean k = E [Xk ]. The MGF is:
"
( n
)#
n
X
Y
E exp
(Xk k )
=
E [exp { Xk }]
k=1
k=1
exp
i=1
2 2
k
,8| | <
max bk
k=1,...,n
i.i.d
Let Zk N (0, 1) . Then we have:

P
"
1X 2
Zk
n
k=1
t 2 exp
nt2
8
, 8t 2 (0, 1)
Theorem 22. Equivalent Characterizations of sub-exponential variables. For a zero-mean RV X, the following statements are equivalent:
(i).9 > 0, b > 0 : E [exp ( X)] exp
2
2
(ii).9c0 > 0 : E [exp ( X)] < 1, 8 | | c0
(iii).9c1 , c2 > 0 : P [|X| t] c1 exp ( c2 t) , 8t

" #1/k
E Xk
(iv). := sup
<1
k!
k 2
,8| | <
0
1
b

MLE Stuff

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MLE Stuff

Uploaded by

Copyright:

Available Formats

1

Definition 1. Convergence in Distribution. Let Yn be a sequence of random variables, Hn be the CDF of

2. Yn ! Y and g is continuous ) g(Yn ) ! g(Y )

Theorem 1. Central Limit Theorem. Given Xi i.i.d with mean , variance

Theorem 2. Slutskys Theorem. Suppose Yn ! Y, An ! a, Bn ! b. Then:

Theorem 3. Delta Method. Given Xi i.i.d with mean , variance

f ()) !N (0, (f 0 ())2

Definition 2. Uniform Integrability

sup E |Xn | I|Xn | t ! 0as t ! 1

Theorem 4. Xn ! X and uniform integrability ) E [Xn ] ! E [X]

Estimating Equations and Maximum Likelihood

Convergence means kwn

... are positive functions in C(K). If fn (x) ! 0, 8x 2 K,

Consistency of The MLE

The MLE estimator n = n (X1 , ..., Xn ) maximize ln . Assuming f (x) is continuous in .

Theorem 10. Define W (!) = log

. If is compact, E kW k1 < 1, f! (x) is continuous function

Limiting Distribution for the MLE

Theorem 12. Keener 9.14. Assume:

5. 8 in the interior of , 9 > 0 such that:

6. The maximum likelihood estimator n is consistent

Lemma 4. Suppose Yn ! Y, and P(Bn ) ! 1 as n ! 1. Then for arbitrary rv Zn , n

1, Yn IBn +Zn IBnC ! Y

are statistics, then the random interval ( 0 ,

is called a 1 confidence interval

) ) N (0, 1/I()). Then

Then using Slutsky theory, we can conclude that

3. Another method are profile regions. Expand ln () in Taylor series

Here n is a random variable between and n . By rearranging this equation we get

This identity can now be used to calculate the asymptotic (1

General Setting and Simple vs. Simple Case

Let P = {P , 2 } be a family of distribution. The distributions of P can be classified into H, one in

Subject to this condition, it is desired to minimize P ( (X) = d0 ) for 2 K or equivalently maximize

Usually, (1) implies

so as to maximize the power

subject to the condition

when p1 (x) > kp0 (x)

Simple vs Multiple - Distributions with Monotone Likelihood Ratio

(ii) The power function

Theorem 15. Markov Inequality. X

0.EX < 1. Then:

Theorem 16. Chebyshev. EX < 1. Then:

)} < 1 for | | b. Then8 2 [0, b] :

E.g. Gaussian: Mgf: E exp { X} = exp +

Definition 9. Sub-gaussian. A RV X with mean is called sub-gaussian i 9 > 0 such that:

Theorem 18. Hoeding Bound. Xi mean i and subgaussian with parameter

Definition 10. A RV X with mean is sub-exponential if 9 > 0, b > 0 such that:

. We say that Bernsteins

Let Zk N (0, 1) . Then we have:

(i).9 > 0, b > 0 : E [exp ( X)] exp

(ii).9c0 > 0 : E [exp ( X)] < 1, 8 | | c0

(iii).9c1 , c2 > 0 : P [|X| t] c1 exp ( c2 t) , 8t

You might also like