You are on page 1of 52

6.

041 Probabilistic Systems Analysis


6.431 Applied Probability

Coursework

Sta:

Lecturer: John Tsitsikli s

Pick up and read course information handout

Turn in recitation and tutorial scheduling form


(last sheet of course information handout)

Quiz 1 (October 12, 12:05-12:55pm)

17%

Quiz 2 (November 2, 7:30-9:30pm)

30%

Final exam (scheduled by registrar)

40%

Weekly homework (best 9 of 10)

10%

Attendance/participation/enthusiasm in
recitations/tutorials

Pick up copy of slides

3%

Collaboration policy described in course info handout

Text: Introduction to Probability, 2nd Edition,


D. P. Bertsekas and J. N. Tsitsiklis, Athena Scientic, 2008
Read the text!

Sample space

LECTURE 1
Readings: Sections 1.1, 1.2

List (set) of possible outcomes


List must be:

Mutually exclusive

Lecture outline

Collectively exhaustive

Probability as a mathematical framework


for reasoning about uncertainty

Art: to be at the right granularity

Probabilistic models
sample space

probability law
Axioms of probability
Simple examples

Sample space: Discrete example

Sample space: Continuous example

Two rolls of a tetrahedral die

= {(x, y) | 0 x, y 1}

Sample space vs. sequential description


1
4

Y = Second

1,1
1,2
1,3
1,4

roll

1
1

X = First roll

4,4

Probability axioms

Probability law: Example with nite sample space

Event: a subset of the sample space


Probability is assigned to events

Y = Second 3
roll

Axioms:
1. Nonnegativity: P(A) 0

1
1

2. Normalization: P() = 1

X = First roll

3. Additivity: If A B = , then P(A B) = P(A) + P(B)

Let every possible outcome have probability 1/16


P((X, Y ) is (1,1) or (1,2)) =

P({s1, s2, . . . , sk }) = P({s1}) + + P({sk })

P({X = 1}) =

= P(s1) + + P(sk )

P(X + Y is odd) =

Axiom 3 needs strengthening


Do weird sets have probabilities?

P(min(X, Y ) = 2) =

Discrete uniform law

Continuous uniform law

Let all outcomes be equally likely


Then,

Two random numbers in [0, 1].

number of elements of A
P(A) =
total number of sample points

Computing probabilities counting

Uniform law: Probability = Area

Denes fair coins, fair dice, well-shued decks

P(X + Y 1/2) = ?
P( (X, Y ) = (0.5, 0.3) )

Probability law: Ex. w/countably innite sample space


Sample space: {1, 2, . . .}
We are given P(n) = 2n, n = 1, 2, . . .
Find P(outcome is even)

Remember!

Turn in recitation/tutorial scheduling form now

1/2

Tutorials start next week

1/4
1/8

1/16

..

P({2, 4, 6, . . .}) = P(2) + P(4) + =

1
1
1
1
+ 4 + 6 + =
22
3
2
2

Countable additivity axiom (needed for this calculation):


If A1, A2, . . . are disjoint events, then:

P(A1 A2 ) = P(A1) + P(A2) +

LECTURE 2

Review of probability models

Readings: Sections 1.3-1.4

Sample space

Mutually exclusive
Collectively exhaustive

Lecture outline

Right granularity

Review

Event: Subset of the sample space

Conditional probability

Allocation of probabilities to events


1. P(A) 0
2. P() = 1

Three important tools:


Multiplication rule

3. If A B = ,
then P(A B) = P(A) + P(B)

Total probability theorem


Bayes rule

3. If A1, A2, . . . are disjoint events, then:


P(A1 A2 ) = P(A1) + P(A2) +
Problem solving:

Specify sample space


Define probability law
Identify event of interest
Calculate...

Conditional probability

Die roll example

Y = Second 3

roll

2
1

P(A | B) = probability of A,
given that B occurred

X = First roll

B is our new universe


Definition: Assuming P(B) =
$ 0,

Let B be the event: min(X, Y ) = 2

P(A B)
P(A | B) =
P(B)

Let M = max(X, Y )
P(M = 1 | B) =

P(A | B) undefined if P(B) = 0

P(M = 2 | B) =

Multiplication
rule
Multiplication rule

Models based on conditional


probabilities

(AB
B C)
P(B | A)
P(C
(C ||A
A
B)
PP
(A
C) =
= PP(A)
(A)P
A)P
B)

P(C | A B)

A B C

P(B | A)
P(Bc | A)

A Bc C
U

A
P(A)
U

P(Ac)=0.95

P(Bc | A)=0.01

Bc
A Bc Cc
U

P(B | Ac)=0.10
P(Bc | Ac)=0.90

P(A)=0.05

A B

P(B | A)=0.99

U U

Event A: Airplane is flying above


Event B: Something registers on radar
screen

P(Ac)
Ac

P(A B) =
P(B) =
P(A | B) =

Total probability theorem

Bayes rule
Prior probabilities P(Ai)
initial beliefs

Divide and conquer


Partition of sample space into A1, A2, A3

We know P(B | Ai) for each i

Have P(B | Ai), for every i

A1

Wish to compute P(Ai | B)


revise beliefs, given that B occurred

B
A1

A2

A3
A2

One way of computing P(B):

P(B) =

A3

P(A1)P(B | A1)

+ P(A2)P(B | A2)

+ P(A3)P(B | A3)

P(Ai | B) =
=

P(Ai B)
P(B)
P(Ai)P(B | Ai)
P(B)

P(Ai)P(B | Ai)
= !
j P(Aj )P(B | Aj )

LECTURE 3

Models based on conditional


probabilities

Readings: Section 1.5

3 tosses of a biased coin:


P(H) = p, P(T ) = 1 p

Review
Independence of two events
Independence of a collection of events

assuming P(B ) > 0

HHT

HTH

1- p

HTT

THH

p
1- p

THT

TTH

1- p

TTT

1- p

P(A B) = P(B) P(A | B) = P(A) P(B | A)


Total probability theorem:

P(T HT ) =

P(B ) = P(A)P(B | A) + P(Ac)P(B | Ac)


Bayes rule:

P(1 head) =

P(Ai)P(B | Ai)
P(Ai | B) =
P(B)

P(first toss is H | 1 head) =

Independence of two events

Conditioning may affect independence

P(B | A) = P(B)

Conditional independence, given C,


is defined as independence
under probability law P( | C)

occurrence of A
provides no information
about Bs occurrence

Assume A and B are independent

Recall that P(A B) = P(A) P(B | A)


Defn:

1- p
1- p

1- p

Multiplication rule:

Defn:

HHH

Review

P(A B)
P(A | B) =
,
P(B)

P(A B) = P(A) P(B)

Symmetric with respect to A and B

applies even if P(A) = 0


implies P(A | B) = P(A)

If we are told that C occurred,


are A and B independent?

Conditioning may affect independence

Independence of a collection of events

Two unfair coins, A and B:


P(H | coin A) = 0.9, P(H | coin B) = 0.1
choose either coin with equal probability

Intuitive definition:
Information on some of the events tells
us nothing about probabilities related to
the remaining events

0.9
0.1

0.9

E.g.:

Coin A

P(A1 (Ac2 A3) | A5Ac6) = P(A1 (Ac2 A3))

0.9
0.5

0.1

0.1

Mathematical definition:
Events A1, A2, . . . , An
are called independent if:

0.1
0.5

0.1
0.9

Coin B

0.1

P(AiAj Aq ) = P(Ai)P(Aj ) P(Aq )

0.9

for any distinct indices i, j, . . . , q,


(chosen from {1, . . . , n})

0.9

Once we know it is coin A, are tosses


independent?
If we do not know which coin it is, are
tosses independent?
Compare:
P(toss 11 = H)
P(toss 11 = H | first 10 tosses are heads)

Independence vs. pairwise


independence

The kings sibling


The king comes from a family of two
children. What is the probability that
his sibling is female?

Two independent fair coin tosses


A: First toss is H

B: Second toss is H
P(A) = P(B) = 1/2
HH

HT

TH

TT

C: First and second toss give same


result
P(C) =
P(C A) =
P(A B C) =
P(C | A B) =
Pairwise independence does not
imply independence

LECTURE 4

Discrete uniform law

Readings: Section 1.6

Let all sample points be equally likely


Then,

Lecture outline

P(A) =

Principles of counting

number of elements of A
|A|
=
total number of sample points
||

Just count. . .

Many examples
permutations
k-permutations
combinations
partitions
Binomial probabilities

Basic counting principle

Example

r stages

Probability that six rolls of a six-sided die


all give dierent numbers?

ni choices at stage i

Number of outcomes that


make the event happen:
Number of elements
in the sample space:
Number of choices is: n1n2 nr

Answer:

Number of license plates


with 3 letters and 4 digits =
. . . if repetition is prohibited =
Permutations: Number of ways
of ordering n elements is:
Number of subsets of {1, . . . , n} =

Combinations

Binomial probabilities

: number of k-element subsets


k
of a given n-element set

n independent coin tosses


P(H) = p

Two ways of constructing an ordered


sequence of k distinct items:

P(HT T HHH) =

Choose the k items one at a time:


n!
n(n1) (nk+1) =
choices
(n k)!

P(sequence) = p# heads(1 p)# tails

Choose k items, then order them


(k! possible orders)
Hence:

P(k heads) =
n

k! =

P(seq.)

khead seq.

n!
(n k)!

n!
k!(n k)!

k=0

= (# of khead seqs.) pk (1 p)nk

pk (1 p)nk

Coin tossing problem

Partitions

event B: 3 out of 10 tosses were heads.

52-card deck, dealt to 4 players


Find P(each gets an ace)

Given that B occurred,


what is the (conditional) probability
that the first 2 tosses were heads?

Outcome: a partition of the 52 cards


number of outcomes:
52!
13! 13! 13! 13!

All outcomes in set B are equally likely:


probability p3(1 p)7
Conditional probability law is uniform

Count number of ways of distributing the


four aces: 4 3 2

Number of outcomes in B:

Count number of ways of dealing the


remaining 48 cards

Out of the outcomes in B,


how many start with HH?

48!
12! 12! 12! 12!
Answer:

48!
12! 12! 12! 12!
52!
13! 13! 13! 13!

432

LECTURE 5

Random variables

Readings: Sections 2.1-2.3, start 2.4

An assignment of a value (number) to


every possible outcome

Lecture outline

Mathematically: A function
from the sample space to the real
numbers

Random variables
Probability mass function (PMF)

discrete or continuous values

Expectation

Can have several random variables


defined on the same sample space

Variance

Notation:
random variable X
numerical value x

How to compute a PMF pX (x)


collect all possible outcomes for which
X is equal to x
add their probabilities
repeat for all x

Probability mass function (PMF)


(probability law,
probability distribution of X)
Notation:

Example: Two independent rools of a


fair tetrahedral die

pX (x) = P(X = x)
= P({ s.t. X() = x})
pX (x) 0

F : outcome of first throw


S: outcome of second throw
X = min(F, S)

!
x pX (x) = 1

Example: X=number of coin tosses


until first head

assume independent tosses,


P(H) = p > 0

3
S = Second roll
2

pX (k) = P(X = k)
= P(T T T H)
= (1 p)k1p,

k = 1, 2, . . .

F = First roll

geometric PMF
pX (2) =

10

Binomial PMF

Expectation
Definition:

X: number of heads in n independent


coin tosses

E[X] =

$
x

P(H) = p

Interpretations:
Center of gravity of PMF
Average in large number of repetitions
of the experiment
(to be substantiated later in this course)

Let n = 4
pX (2) = P(HHT T ) + P(HT HT ) + P(HT T H)
+P(T HHT ) + P(T HT H) + P(T T HH)
= 6p2(1 p)2
=

"4#

Example: Uniform on 0, 1, . . . , n

p2(1 p)2

In general:
"n#
pX (k) =
pk (1p)nk ,
k

pX(x )
1/(n+1)

...

k = 0, 1, . . . , n
0

E[X] = 0

Easy: E[Y ] =

y
$
x

Recall:

E[g(X)] =

$
x

ypY (y)

g(x)pX (x)

Second moment: E[X 2] =

g(x)pX (x)

Variance

! 2
x x pX (x)

var(X) = E (X E[X])2

Caution: In general, E[g(X)] %= g(E[X])

$
x

Prop erties:

Variance

Let X be a r.v. and let Y = g(X)


Hard: E[Y ] =

n- 1

1
1
1
+1
+ +n
=
n+1
n+1
n+1

Properties of expectations

xpX (x)

&

(x E[X ])2pX (x)

= E[X 2] (E[X])2

If , are constants, then:

E[] =

Properties:

E[X] =

var(X) 0

var(X + ) = 2var(X)

E[X + ] =

11

LECTURE 6

Review
Random variable X: function from
sample space to the real numbers

Readings: Sections 2.4-2.6

PMF (for discrete random variables):


pX (x) = P(X = x)

Lecture outline
Review: PMF, expectation, variance

Expectation:

Conditional PMF

E[X] =

Geometric PMF

E[g(X)] =

Total expectation theorem

!
x
!
x

Joint PMF of two random variables

xpX (x)
g(x)pX (x)

E[X + ] = E[X] +
"

E X E[X] =

var(X) = E (X E[X])2
=

!
x

(x E[X])2pX (x)

= E[X 2] (E[X])2
Standard deviation:

X =

&

var(X)

Random speed

Average speed vs. average time

Traverse a 200 mile distance at constant


but random speed V

Traverse a 200 mile distance at constant


but random speed V

1/2

pV (v )

200

1/2

pV (v )

1/2

d = 200, T = t(V ) = 200/V

1/2

200

time in hours = T = t(V ) =


E[T ] = E[t(V )] =

E[V ] =

'
v t(v)pV (v) =

E[T V ] = 200 =
" E[T ] E[V ]

var(V ) =

E[200/V ] = E[T ] "= 200/E[V ].

V =

12

Conditional PMF and expectation

Geometric PMF
X: number of independent coin tosses
until first head

pX|A(x) = P(X = x | A)
E[X | A] =

!
x

xpX |A(x)

pX (k) = (1 p)k1p,

pX (x )

E[X] =

k = 1, 2, . . .

kpX (k) =

k=1

k=1

k(1 p)k1p

Memoryless property: Given that X > 2,


the r.v. X 2 has same geometric PMF

1/4

pX (k)

pX |X>2(k)
p(1-p)2

x
...

Let A = {X 2}

...

pX- 2|X>2(k)

pX|A(x) =

E[X | A] =

...
k

Total Expectation theorem

Joint PMFs

Partition of sample space


into disjoint events A1, A2, . . . , An

pX,Y (x, y) = P(X = x and Y = y)


y

A1

1/20 2/20 2/20

2/20 4/20 1/20 2/20

1/20 3/20 1/20

1/20

A2

A3

P(B) = P(A1)P(B | A1)+ +P(An)P(B | An)

pX (x) = P(A1)pX |A1 (x)+ +P(An)pX |An (x)

!!
x

pX,Y (x, y) =
!

E[X] = P(A1)E[X | A1]+ +P(An)E[X | An]

pX (x) =

Geometric example:
A1 : {X = 1}, A2 : {X > 1}

pX |Y (x | y) = P(X = x | Y = y) =

E[X] =

P(X = 1)E[X | X = 1]
+P(X > 1)E[X | X > 1]

Solve to get E[X] = 1/p

13

!
x

pX,Y (x, y)

pX |Y (x | y) =

pX,Y (x, y)
pY (y)

LECTURE 7

Review

Readings: Finish Chapter 2

pX (x) = P(X = x)

Lecture outline

pX,Y (x, y) = P(X = x, Y = y)

Multiple random variables

pX |Y (x | y) = P(X = x | Y = y )

Joint PMF
Conditioning
Independence

pX (x) =

More on expectations

!
y

pX,Y (x, y)

pX,Y (x, y) = pX (x)pY |X (y | x)

Binomial distribution revisited


A hat problem

Independent random variables

Expectations

pX,Y,Z (x, y, z) = pX (x)pY |X (y | x)pZ |X,Y (z | x, y)

E[X] =

!
x

E[g(X, Y )] =

Random variables X, Y , Z are


independent if:

!!
x

xpX (x)
g(x, y)pX,Y (x, y)
"

pX,Y,Z (x, y, z) = pX (x) pY (y) pZ (z)

In general: E[g(X, Y )] #= g E[X], E[Y ]

for all x, y, z

E[X + ] = E[X] +

y
4

1/20 2/20 2/20

2/20 4/20 1/20 2/20

1/20 3/20 1/20

1/20
1

E[X + Y + Z] = E[X] + E[Y ] + E[Z]


If X, Y are independent:
E[XY ] = E[X]E[Y ]

E[g(X)h(Y )] = E[g(X)] E[h(Y )]

Independent?
What if we condition on X 2
and Y 3?

14

Variances

Binomial mean and variance

Var(aX) = a2Var(X)

X = # of successes in n independent
trials

Var(X + a) = Var(X)

probability of success p

Let Z = X + Y .
If X, Y are independent:

E[X] =

n
!

k=0

Var(X + Y ) = Var(X) + Var(Y )


Xi =
Examples:

1,
0,

If X = Y , Var(X + Y ) =

E[Xi] =

If X = Y , Var(X + Y ) =

E[X] =

If X, Y indep., and Z = X 3Y ,
Var(Z) =

Var(Xi) =

"n#

pk (1 p)nk

if success in trial i,
otherwise

Var(X) =

The hat problem

Variance in the hat problem


Var(X ) = E[X 2] (E[X])2 = E[X 2] 1

n people throw their hats in a box and


then pick one at random.
X: number of people who get their own
hat

X2 =

Find E[X]

Xi =

1,
0,

Xi2 +

XiXj

i,j :i=j
#

E[Xi2] =
if i selects own hat
otherwise.

P(X1X2 = 1) = P(X1 = 1)P(X2 = 1 | X1 = 1)

X = X1 + X2 + + Xn

P(Xi = 1) =
E[Xi] =
Are the Xi independent?

E[X 2 ] =

E[X] =

Var(X) =

15

LECTURE 8

Continuous r.v.s and pdfs


A continuous r.v. is described by a
probability density function fX

Readings: Sections 3.1-3.3


Lecture outline

fX(x)

Sample Space

Probability density functions


Cumulative distribution functions

Event {a < X < b }

Normal random variables

P(a X b) =
!

P(X B) =

E[X] =

fX (x) dx

fX (x) dx = 1

P(x X x + ) =

Means and variances

! b

! x+
x

fX (x) dx,

fX (s) ds fX (x)

for nice sets B

Cumulative distribution function


(CDF)

xfX (x) dx

!
E[g(X)] =
g(x)fX (x) dx

!
2 =
(x E[X ])2fX (x) dx
var(X ) = X

FX (x) = P(X x) =

! x

fX (t) dt

CDF

fX(x )

Continuous Uniform r.v.


fX (x )
a

Also for discrete r.v.s:


a

fX (x) =

FX (x) = P(X x) =

axb

pX (k)

kx

3/6
2/6

E[X] =
2 =
X

1/6

! b"
a

#
a+b 2 1
(b a)2
dx =
2
ba
12

16

Mixed distributions

Gaussian (normal) PDF


2
1
Standard normal N (0, 1): fX (x) = ex /2
2

Schematic drawing of a combination of


a PDF and a PMF
1/2

Normal CDF FX(x)

Normal PDF fx (x)

1
0.5

-1

1/2

E[X] =

x0

-1

var(X) = 1

General normal N (, 2):

2
2
1
fX (x) = e(x) /2
2

The corresponding CDF:


FX (x) = P(X x)
CDF

It turns out that:


E[X] = and Var(X) = 2.

1
3/4

Let Y = aX + b
Then: E[Y ] =

1/4
1/2

Var(Y ) =

Fact: Y N (a + b, a2 2)

Calculating normal probabilities

The constellation of concepts

No closed form available for CDF


but there are tables
(for standard normal)

If X N (, 2), then

pX (x)

X
N(

If X N (2, 16):
"

FX (x)
E[X], var(X)

X 2
32
3) =Random
PSec.
(X
P Variables
= CDF(0.25)
3.3 Normal
155
4
4
.00

.01

.02

.03

.04

.05

.06

.07

.08

.09

0.0
0.1
0.2
0.3
0.4

.5000
.5398
.5793
.6179
.6554

.5040
.5438
.5832
.6217
.6591

.5080
.5478
.5871
.6255
.6628

.5120
.5517
.5910
.6293
.6664

.5160
.5557
.5948
.6331
.6700

.5199
.5596
.5987
.6368
.6736

.5239
.5636
.6026
.6406
.6772

.5279
.5675
.6064
.6443
.6808

.5319
.5714
.6103
.6480
.6844

.5359
.5753
.6141
.6517
.6879

0.5
0.6
0.7
0.8
0.9

.6915
.7257
.7580
.7881
.8159

.6950
.7291
.7611
.7910
.8186

.6985
.7324
.7642
.7939
.8212

.7019
.7357
.7673
.7967
.8238

.7054
.7389
.7704
.7995
.8264

.7088
.7422
.7734
.8023
.8289

.7123
.7454
.7764
.8051
.8315

.7157
.7486
.7794
.8078
.8340

.7190
.7517
.7823
.8106
.8365

.7224
.7549
.7852
.8133
.8389

1.0
1.1
1.2
1.3
1.4

.8413
.8643
.8849
.9032
.9192

.8438
.8665
.8869
.9049
.9207

.8461
.8686
.8888
.9066
.9222

.8485
.8708
.8907
.9082
.9236

.8508
.8729
.8925
.9099
.9251

.8531
.8749
.8944
.9115
.9265

.8554
.8770
.8962
.9131
.9279

.8577
.8790
.8980
.9147
.9292

.8599
.8810
.8997
.9162
.9306

.8621
.8830
.9015
.9177
.9319

1.5
1.6
1.7
1.8
1.9

.9332
.9452
.9554
.9641
.9713

.9345
.9463
.9564
.9649
.9719

.9357
.9474
.9573
.9656
.9726

.9370
.9484
.9582
.9664
.9732

.9382
.9495
.9591
.9671
.9738

.9394
.9505
.9599
.9678
.9744

.9406
.9515
.9608
.9686
.9750

.9418
.9525
.9616
.9693
.9756

.9429
.9535
.9625
.9699
.9761

.9441
.9545
.9633
.9706
.9767

2.0
2.1
2.2
2.3
2.4

.9772
.9821
.9861
.9893
.9918

.9778
.9826
.9864
.9896
.9920

.9783
.9830
.9868
.9898
.9922

.9788
.9834
.9871
.9901
.9925

.9793
.9838
.9875
.9904
.9927

.9798
.9842
.9878
.9906
.9929

.9803
.9846
.9881
.9909
.9931

.9808
.9850
.9884
.9911
.9932

.9812
.9854
.9887
.9913
.9934

.9817
.9857
.9890
.9916
.9936

2.5
2.6
2.7
2.8

.9938
.9953
.9965
.9974

.9940
.9955
.9966
.9975

.9941
.9956
.9967
.9976

.9943
.9957
.9968
.9977

.9945
.9959
.9969
.9977

.9946
.9960
.9970
.9978

.9948
.9961
.9971
.9979

.9949
.9962
.9972
.9979

.9951
.9963
.9973
.9980

.9952
.9964
.9974
.9981

fX (x)

17

pX,Y (x, y)

fX,Y (x, y)

pX |Y (x | y)

fX |Y (x | y)

LECTURE 9

Continuous r.v.s and pdfs

Readings: Sections 3.4-3.5

fX(x)

Sample Space

Outline
PDF review

Event {a < X < b }

Multiple random variables


conditioning

independence

P(a X b) =

Examples

E[g(X)] =

fX (x)
FX (x)

xpX (x)

E[X]
var(X)

pX,Y (x, y)

fX (x) dx

P(x X x + ) fX (x)

Summary of concepts
pX (x)

g(x)fX (x) dx

xfX (x) dx

fX,Y (x, y)

pX|A(x)

fX |A(x)

pX |Y (x | y )

fX |Y (x | y)

Joint PDF fX,Y (x, y)

P((X, Y ) S) =

Buons needle
Parallel lines at distance d
Needle of length (assume < d)
Find P(needle intersects one of the lines)

fX,Y (x, y) dx dy

q
x
l

Interpretation:

P(x X x+, y Y y + ) fX,Y (x, y ) 2

X [0, d/2]: distance of needle midpoint


to nearest line
Model: X, uniform, independent

Expectations:

E[g (X, Y )] =

fX,(x, ) =

g (x, y)fX,Y (x, y) dx dy

Intersect if X

From the joint to the marginal:


fX (x) P(x X x + ) =

P X

X and Y are called independent if


fX,Y (x, y) = fX (x)fY (y),

0 x d/2, 0 /2

for all x, y

18

sin
2

sin
2

4 /2 (/2) sin
dx d
d 0
0

4 /2
2
sin d =
2
d
d 0

x 2 sin

fX (x)f() dx d

Conditioning
Recall

P(x X x + ) fX (x)

Joint, Marginal and Conditional Densities

By analogy, would like:

P(x X x + | Y y) fX |Y (x | y)
This leads us to the definition:
fX |Y (x | y) =

fX,Y (x, y)

Area of slice = Height of marginal


density at x

if fY (y) > 0

fY (y)

For given y, conditional PDF is a


(normalized) section of the joint PDF

Renormalizing slices for


fixed x gives conditional
densities for Y given X = x

Slice through
density surface
for fixed x

If independent, fX,Y = fX fY , we obtain

Image by MIT OpenCourseWare, adapted from


Probability, by J. Pittman, 1999.

fX |Y (x|y) = fX (x)

Stick-breaking example
Break a stick of length twice:
break at X: uniform in [0, 1];
break again at Y , uniform in [0, X]

fX,Y (x, y) =

1
,
x

0yx

y
f Y |X (y | x)

f X(x)

fX,Y (x, y) = fX (x)fY |X (y | x) =

on the set:

fY (y) =

=
=
L

E[Y ] =
E[Y | X = x] =

yfY |X (y | X = x) dy =

19

fX,Y (x, y) dx

1
y x

dx

log ,

yfY (y) dy =

0y

1

y log dy =
y
4
0

LECTURE 10

The Bayes variations

Continuous Bayes rule;


Derived distributions

pX|Y (x | y) =

pX,Y (x, y)
pY (y)

pY (y) =

Readings:
Section 3.6; start Section 4.1

pX (x)pY |X (y | x)
pY (y)

pX (x)pY |X (y | x)

Example:
Review

pX (x)
pX |Y (x | y) =
pX (x) =

pX,Y (x, y)
pX,Y (x, y)
pY (y)
pX,Y (x, y)

X = 1, 0: airplane present/not present

Y = 1, 0: something did/did not register


on radar

fX (x)
fX,Y (x, y)
fX|Y (x | y) =
fX (x) =

fX,Y (x, y)

Continuous counterpart

fY (y)
fX,Y (x, y) dy

fX|Y (x | y) =

fX,Y (x, y)

fY (y) =
FX (x) = P(X x)

Discrete X, Continuous Y

fY (y ) =

fY (y)

y
f X ,Y(y,x)=1
1

Continuous X, Discrete Y

fX (x)fY |X (y | x) dx

It is a PMF or PDF of a function of one


or more random variables with known
probability law. E.g.:

pX (x)fY |X (y | x)

Example:
X: a discrete signal; prior pX (x)
Y : noisy version of X
fY |X (y | x): continuous noise model

pY (y) =

fY (y)

What is a derived distribution

pX (x)fY |X (y | x)

fX |Y (x | y) =

fX (x)fY |X (y | x)

Example: X: some signal; prior fX (x)


Y : noisy version of X
fY |X (y | x): model of the noise

E[X], var(X)

pX |Y (x | y) =

fY (y)

Obtaining the PDF for

fX (x)pY |X (y | x)

g(X, Y ) = Y /X

pY (y)

involves deriving a distribution.


Note: g(X, Y ) is a random variable

fX (x)pY |X (y | x) dx

Example:
X: a continuous signal; prior fX (x)
(e.g., intensity of light beam);
Y : discrete r.v. aected by X
(e.g., photon count)
pY |X (y | x): model of the discrete r.v.

When not to find them


Dont need PDF for g(X, Y ) if only want
to compute expected value:

E[g(X, Y )] =

20

g(x, y)fX,Y (x, y) dx dy

How to find them

The continuous case

Discrete case

Two-step procedure:

Obtain probability mass for each


possible value of Y = g(X)

Get CDF of Y : FY (y) = P(Y y)


Dierentiate to get

pY (y) = P(g(X) = y)

pX (x)
=

fY (y) =

x: g(x)=y

dFY
(y)
dy

y
g(x)

.
.
.
.
.
.
.

Example

.
.
.
.
.
.
.

X: uniform on [0,2]
Find PDF of Y = X 3
Solution:
FY (y) = P(Y y) = P(X 3 y)
1
= P(X y 1/3) = y 1/3
2
fY (y) =

Example

dFY
1
(y) =
dy
6y 2/3

The pdf of Y=aX+b

Joan is driving from Boston to New York.


Her speed is uniformly distributed between 30 and 60 mph. What is the distribution of the duration of the trip?

Y = 2X + 5:
fX

faX

faX+b

200
Let T (V ) =
.
V
Find fT (t)

-2

-1

f v(v0 )
1/30

fY (y) =
30

60

v0

1
yb
fX
a
|a|

Use this to check that if X is normal,


then Y = aX + b is also normal.

21

LECTURE 11

A general formula
Let Y = g(X)
g strictly monotonic.

Derived distributions; convolution;


covariance and correlation
Readings:
Finish Section 4.1;
Section 4.2

slope

dg
(x)
dx

g(x)
[y, y+?]

Example
y

f X ,Y(y,x)=1

[x, x+d]

Event x X x + is the same as


g(x) Y g(x + )
or (approximately)
g(x) Y g(x) + |(dg/dx)(x)|

Hence,

Find the PDF of Z = g(X, Y ) = Y /X


FZ (z ) =

z1

FZ (z) =

z1

where y = g(x)

The distribution of X + Y

W = X + Y ; X, Y independent

dx

!
!

(x)!!

The continuous case

W = X + Y ; X, Y independent

!
! dg

fX (x) = fY (y) !!

y
(0,3)

(1,2)

(2,1)

(3,0)

pW (w) = P(X + Y = w)
=
=

"
x
"
x

x
x +y=w

P(X = x)P(Y = w x)
fW |X (w | x) = fY (w x)

pX (x)pY (w x)

fW,X (w, x) = fX (x)fW |X (w | x)

Mechanics:

= fX (x)fY (w x)

Put the pmfs on top of each other


Flip the pmf of Y

fW (w) =

Shift the flipped pmf by w


(to the right if w > 0)
Cross-multiply and add

22

fX (x)fY (w x) dx

Two independent normal r.v.s

The sum of independent normal r.v.s


X N (0, x2), Y N (0, y2),
independent

X N (x, x2), Y N (y , y2),


independent
fX,Y (x, y) = fX (x)fY (y)

(x x)2 (y y )2
1
=
exp

2xy
2x2
2y2

Let W = X + Y

fW (w) =
=

PDF is constant on the ellipse where

Correlation coefficient

&

cov(X, Y ) = E (X E[X]) (Y E[Y ])

'

Dimensionless version of covariance:


= E

.
.
. . .... . .. .
. ... ...... ... .. . . .
. .. ... . .... ... . .. .. . ... . .
.
. . . .. .
. . .. .. . . ... .. . .
. . .
.

i=1

Xi =

n
"

i=1

var(Xi) + 2

"

1 1
y

|| = 1 (X E[X]) = c(Y E[Y ])


(linearly related)
Independent = 0
(converse is not true)

cov(X, Y ) = E[XY ] E[X]E[Y ]


var

(X E[X]) (Y E[Y ])

X
Y
cov(X, Y )
=
X Y

Zero-mean case: cov(X, Y ) = E [XY ]

same argument for nonzero mean case

Covariance

n
"

2
2
2
2
1
ex /2x e(wx) /2y dx
2xy

mean=0, variance=x2 + y2

Ellipse is a circle when x = y

fX (x)fY (w x) dx

Conclusion: W is normal

is constant

.
.
. . .... . .. .
. . ... ...... ... .. . . .
. .. ... . ... ... . .. .. . ... . .
. . . .. .. .
y
. . . .. . . ... .. . .
.
.
. .

(algebra) = cew

(x x)2
(y y ) 2
+
2x2
2y2

cov(Xi, Xj )

&
(i,j ):i=j

independent cov(X, Y ) = 0
(converse is not true)

23

LECTURE 12

Conditional expectations
Given the value y of a r.v. Y :

Readings: Section 4.3;


parts of Section 4.5
(mean and variance only; no transforms)

E[X | Y = y] =

!
x

xpX|Y (x | y)

(integral in continuous case)


Stick example: stick of length !
break at uniformly chosen point Y
break again at uniformly chosen point X

Lecture outline
Conditional expectation
Law of iterated expectations

E[X | Y = y] =

Law of total variance


Sum of a random number
of independent r.v.s

E[X | Y ] =

Y
2

y
(number)
2

(r.v.)

mean, variance
Law of iterated expectations:

E[E[X | Y ]] =

!
y

E[X | Y = y]pY (y)= E[X]

In stick example:
E[X] = E[E[X | Y ]] = E[Y /2] = !/4

var(X | Y ) and its expectation


"

var(X | Y = y ) = E (X E[X | Y = y])2 | Y = y


var(X | Y ): a r.v.
with value var(X | Y = y) when Y = y

Section means and variances


#

Two sections:
y = 1 (10 students); y = 2 (20 students)
y=1:

Law of total variance:


var(X) = E[var(X | Y )] + var(E[X | Y ])

10
1 !
xi = 90
10 i=1

E[X] =

Proof:

E[X | Y ] =

(b) var(X | Y ) = E[X 2 | Y ] (E[X | Y ])2

E[E[X | Y ]] =

(c) E[var(X | Y )] = E[X 2] E[ (E[X | Y ])2 ]

30
1 !
xi = 60
20 i=11

30
1 !
90 10 + 60 20
xi =
= 70
30
30 i=1

E[X | Y = 1] = 90,

(a) Recall: var(X) = E[X 2] (E[X])2

y=2:

E[X | Y = 2] = 60

90,
60,

w.p. 1/3
w.p. 2/3

1 90 + 2 60 = 70 =
3
3

E[X]

1
2
(90 70)2 + (60 70)2
3
3
600
=
= 200
3

(d) var(E[X | Y ]) = E[ (E[X | Y ])2 ](E[X])2

var(E[X | Y ]) =

Sum of right-hand sides of (c), (d):


E[X 2] (E[X ])2 = var(X)

24

Section means and variances (ctd.)


10
1 !
(xi90)2 = 10
10 i=1

Example

30
1 !
(xi60)2 = 20
20 i=11

var(X) = E[var(X | Y )] + var(E[X | Y ])


f X(x)
2/3

var(X | Y = 1) = 10

var(X | Y ) =

E[var(X | Y )] =

var(X | Y = 2) = 20

10,
20,

1/3

Y=1

w.p. 1/3
w.p. 2/3

E[X | Y = 1] =

1 10 + 2 20 = 50
3
3
3

var(X | Y = 1) =
var(X) = E[var(X | Y )] + var(E[X | Y ])
50
=
+ 200
3
= (average variability within sections)

Y=2

E[X | Y = 2] =
var(X | Y = 2) =

E[X] =
var(E[X | Y ]) =

+ (variability between sections)

Sum of a random number of


independent r.v.s

Variance of sum of a random number


of independent r.v.s

N : number of stores visited


(N is a nonnegative integer r.v.)

var(Y ) = E[var(Y | N )] + var(E[Y | N ])

Xi: money spent in store i

E[Y | N ] = N E[X]
var(E[Y | N ]) = (E[X])2 var(N )

Xi assumed i.i.d.

var(Y | N = n) = n var(X)
var(Y | N ) = N var(X)
E[var(Y | N )] = E[N ] var(X)

independent of N
Let Y = X1 + + XN

E[Y | N = n] =
=
=
=

E[X1 + X2 + + Xn | N = n]
E[X1 + X2 + + Xn]
E[X1] + E[X2] + + E[Xn]
n E[X]

var(Y ) = E[var(Y | N )] + var(E[Y | N ])

= E[N ] var(X ) + (E[X])2 var(N )

E[Y | N ] = N E[X]

E[Y ] = E[E[Y | N ]]
= E[N E[X]]
= E[N ] E[X]

25

LECTURE 13

The Bernoulli process


A sequence of independent
Bernoulli trials

The Bernoulli process


Readings: Section 6.1

At each trial, i:
P(success) = P(Xi = 1) = p

Lecture outline

P(failure) = P(Xi = 0) = 1 p

Definition of Bernoulli process

Examples:

Random processes

Sequence of lottery wins/losses

Basic properties of Bernoulli process

Sequence of ups and downs of the Dow


Jones

Distribution of interarrival times

Arrivals (each second) to a bank

The time of the kth success

Arrivals (at each time slot) to server

Merging and splitting

Random processes

Number of successes S in n time slots

First view:
sequence of random variables X1, X2, . . .

P(S = k) =

E[Xt] =

E[S] =

Var(Xt) =

Var(S) =

Second view:
what is the right sample space?
P(Xt = 1 for all t) =
Random processes we will study:
Bernoulli process
(memoryless, discrete time)
Poisson process
(memoryless, continuous time)
Markov chains
(with memory/dependence across time)

26

Interarrival times

Time of the kth arrival

T1: number of trials until first success

Given that first arrival was at time t


i.e., T1 = t:
additional time, T2, until next arrival

P(T1 = t) =

has the same (geometric) distribution

Memoryless property

independent of T1

E[T1] =
Var(T1) =

Yk : number of trials to kth success


E[Yk ] =

If you buy a lottery ticket every day, what


is the distribution of the length of the
first string of losing days?

Var(Yk ) =
P(Yk = t) =

Sec. 6.1

The Bernoulli Process

305

Splitting and Merging of Bernoulli Processes


Starting with a Bernoulli process in which there is a probability p of an arrival
at each time, consider splitting it as follows. Whenever there is an arrival, we
choose to either keep it (with probability q), or to discard it (with probability
1q); see Fig. 6.3. Assume that the decisions to keep or discard are independent
for dierent arrivals. If we focus on the process of arrivals that are kept, we see
that it is a Bernoulli process: in each time slot, there is a probability pq of a
kept arrival, independent of what happens in other slots. For the same reason,
the process of discarded arrivals is also a Bernoulli process, with a probability
of a discarded arrival at each time slot equal to p(1 q).

Splitting of a Bernoulli Process

Merging of Indep. Bernoulli Processes


306

(using independent coin flips)

The Bernoulli and Poisson Processes


Bernoulli (p)
time

time

Merged process:
Bernoulli (p + q pq)

q
Original
process

Chap. 6

time

time

Bernoulli (q)

1q

time
time

Figure 6.4: Merging of independent Bernoulli processes.

yields a Bernoulli process


concentrate on the are
special counted
case where n is large
p is small,
so that the mean
(collisions
as but
one
arrival)

Figure 6.3: Splitting of a Bernoulli process.

yields
Bernoulli processes
In a reverse situation, we start with two independent Bernoulli processes

np has a moderate value. A situation of this type arises when one passes from
discrete to continuous time, a theme to be picked up in the next section. For
some examples, think of the number of airplane accidents on any given day:
there is a large number n of trials (airplane flights), but each one has a very
small probability p of being involved in an accident. Or think of counting the
number of typos in a book: there is a large number of words, but a very small
probability of misspelling any single one.
Mathematically, we can address situations of this kind, by letting n grow
while simultaneously decreasing p, in a manner that keeps the product np at a
constant value . In the limit, it turns out that the formula for the binomial PMF
simplifies to the Poisson PMF. A precise statement is provided next, together
with a reminder of some of the properties of the Poisson PMF that were derived
in Chapter 2.

(with parameters p and q, respectively) and merge them into a single process,
as follows. An arrival is recorded in the merged process if and only if there
is an arrival in at least one of the two original processes. This happens with
probability p + q pq [one minus the probability (1 p)(1 q) of no arrival in
either process]. Since dierent time slots in either of the original processes are
independent, dierent slots in the merged process are also independent. Thus,
the merged process is Bernoulli, with success probability p + q pq at each time
step; see Fig. 6.4.
Splitting and merging of Bernoulli (or other) arrival processes arises in
many contexts. For example, a two-machine work center may see a stream of
arriving parts to be processed and split them by sending each part to a randomly
chosen machine. Conversely, a machine may be faced with arrivals of dierent
types that can be merged into a single arrival stream.
The Poisson Approximation to the Binomial

Poisson Approximation to the Binomial

The number of successes in n independent Bernoulli trials is a binomial random


variable with parameters n and p, and its mean is np. In this subsection, we

A Poisson random variable Z with parameter takes nonnegative


integer values and is described by the PMF
pZ (k) = e

k
,
k!

k = 0, 1, 2, . . . .

Its mean and variance are given by


E[Z] = ,

27

var(Z) = .

LECTURE 14

Bernoulli review
Discrete time; success probability p

The Poisson process

Number of arrivals in n time slots:


16
BernoulliLECTURE
review
binomial pmf

Readings: LECTURE
Start Section
166.2.

Discrete time; success


probability
p
The Poisson
process
Interarrival
times: geometric
pmf

Readings: Start Section 5.2.


Review of Bernoulli process

Number
of arrivals in
n time
slots:5.2.
Readings:
Start
Section
Time to k arrivals: Pascal pmf
binomial pmf

N
b

Definition of
Poisson
process
Lecture
outline

Memorylessness
Interarrival
time pmf:
geometric
pmf
Lecture
outline

Distribution
of number
of arrivals
Review of Bernoulli
process

TimetoReview
k arrivals:
Pascal pmf
of Bernoulli
process

Distribution
interarrival
times
Definition ofof
Poisson
process

Memorylessness
Definition of Poisson process

The Poisson process


Lecture outline

Other
properties
of the of
Poisson
Distribution
of number
arrivalsprocess

Distribution of number of arrivals

Distribution of interarrival times

Distribution of interarrival times

Other properties of the Poisson process

Other properties of the Poisson process

Definition of the Poisson process


Definition of the Poisson process
t2

t1
0

x x

x x

t3
x x

xx

PMF of NumberPoisson
of Arrivals N
PMF Definition
of Numberofofthe
Arrivals N process

t2

t1

x xk
x x
0 ( ) e

x x
Time

Time
P (k, )homogeneity:
= Prob. of k arrivals in interval
Pof(k,duration
) = Prob.
of k arrivals in interval

of duration
Assumptions:
Numbers of arrivals in disjoint time
Numbers of arrivals in disjoint time inintervals
independent
tervals are
are independent
For VERY
small
:
Small
interval
probabilities:

For VERY small


:
1
if k = 0
P (k, )
=1

1
, ifif kk=
0;

0
if k > 1
P (k, ) ,
if k = 1;

= arrival rate
0,
if k > 1.

P (k, ) =

k!

t3
x x

xx

!
x

k = 0, 1, . . .

x x
Time

E[N ] =
P (k, ) = Prob. of k arrivals in interval
Finely
discretize [0, t]: approximately Bernoulli
2 = of duration
N
Assumptions:
t(esdiscrete
1)
Nte(of
=
approximation): binomial
MN (s)
Numbers of arrivals in disjoint time intervals are independent
Taking 0 (or n ) gives:
For VERY small :

( )k
eaccording
Example: You
to a
P (k, get
) =email
1 , ifk k==0,01, . . .
k!
Poisson process at
a
rate
of

=
0.4
mesP (k, )
if k = 1

sages per hour. You check your


0 email ifevery
k>1
E[Nt] = t,
var(Nt) = t
thirty minutes.
= arrival rate
Prob(no new messages)=

: arrival rate

Prob(one new message)=

28

Exa
Poi
sag
thir

Example

Interarrival Times

Interarrival Times
Yk time of kth arrival

You get email according to a Poisson


process at a rate of = 5 messages per
hour. You check your email every thirty
minutes.

Yk time of kth arrival

Erlang distribution:
k k1

ey
y
fYk (y) = k k1 y ,
y(k e1)!

fYk (y) =

Prob(no new messages) =

to the
past

(k 1)!

flr (l)

fY (y)
k

Prob(one new message) =

y0

y0

r=1
r=2
r=3

k=1
k=2

k=3
0

Image by MIT OpenCourseWare.


First-order interarrival
times (k = 1):
exponential
Time of first arrival (k = 1):
(y) = ey , f y
0
fY1exponential:
(y) = ey , y 0

Y1

Memoryless
the
Memorylessproperty:
property: The
The time
time to
to the
next
arrival
is
independent
of
the
past
next arrival is independent of the past

Adding
Poisson
Merging
PoissonProcesses
Processes

Bernoulli/Poisson Relation
Bernoulli/Poisson Relation
! ! ! ! ! ! ! !
0

n = t /!

Arrivals

Sum
of of
independent
Poisson
Poissonrandom
randomvariSum
independent
ables
is Poisson
variables
is Poisson

Time

p ="!

np =" t

Merging
of independent
Poisson
processes
Sum
of independent
Poisson
processes
Poisson
is is
Poisson

1):

Erlang distribution:

POISSON
Times of Arrival

Times Arrival
of Arrival
Rate

POISSON
Continuous

Continuous
/unit time

BERNOULLI
Discrete

Poisson
/uni
t time

Interarrival Time Distr.

Exponential

Geometric

Erlang

Pascal

Time to k-th arrival

Poisson

"1

Discrete
p/per
trial

PMF of Rate
# of Arrivals
Arrival

PMF of # of Arrivals

Red bulb flashes


(Poisson)

BERNOULLI

"2

Binomial
p/per trial

Green bulb flashes


(Poisson)

Binomial

Interarrival Time Distr.

Exponential

Geometric

Time to k-th arrival

Erlang

Pascal

All flashes
(Poisson)

What is the probability that the next


What is the probability that the next
arrival comes from the first process?
arrival comes from the first process?

m vari-

29

Interarrival Times
LECTURE 15

Yk time of kth arrival

Erlang
distribution:
Defining
characteristics

Review

Poisson process II

Time homogeneity:
k y k1eyP (k, )

f (y) =

y0

Yk
Independence
(k 1)!
Small interval probabilities (small ):
fY (y)

1 , if k = 0,
P (k, ) ,
if k = 1,
k=1

0,
if k > 1.
k=2

Readings: Finish Section 6.2.


Review of Poisson process
Merging and splitting
Examples

N is a Poisson r.v., with parameter :


k=3
( )k e
P (k, ) =
,
k = 0, 1, . . .
y
k!

Random incidence

[N ] = var(N
) =
E
First-order
interarrival
times (k = 1):
exponential
Interarrival times (k = 1): exponential:
y0
fY1 (y) = ey ,
E[T1] = 1/
fT1 (t) = et, t 0,
Memoryless property: The time to the
next arrival is independent of the past
Time Yk to kth arrival: Erlang(k):
fYk (y) =

k y k1ey
,
(k 1)!

y0

Adding Poisson Processes


Sum of independent Poisson random variables is Poisson

Poisson fishing

Merging Poisson Processes (again)

Sum
of independent
Poisson
processes
Merging
of independent
Poisson
processes
is is
Poisson
Poisson

Assume: Poisson, = 0.6/hour.


Fish for two hours.

Red bulb flashes


(Poisson)

if no catch, continue until first catch.

"1

a) P(fish for more than two hours)=

All flashes
(Poisson)

"2

b) P(fish for more than two and less than


five hours)=

Green bulb flashes


(Poisson)

What
next
Whatis isthe
theprobability
probability that
that the
the next
arrival
comes
from
the
first
process?
arrival comes from the first process?

c) P(catch at least two fish)=

d) E[number of fish]=

e) E[future fishing time | fished for four hours]=


f) E[total fishing time]=

30

Light bulb example

Splitting of
of Poisson
Poisson processes
Splitting
processes

Each light bulb has independent,


exponential() lifetime

Assume
that email
trac through
a server
Each
message
is routed
along the
first
is
a
Poisson
process.
stream with probability p
Destinations of dierent messages are

Install three light bulbs.


Find expected time until last light bulb
dies out.

independent.
Routings of different messages are independent
USA
Email Traffic
leaving MIT

MIT
Server

p!
(1 - p) !
Foreign

Each output stream is Poisson


Each output stream is Poisson.

Renewal processes
Random incidence for Poisson

Random incidence in
processes
Series ofrenewal
successive
arrivals

Poisson process that has been running


forever

i.i.d. interarrival
times
Series
of successive
arrivals
(but not necessarily exponential)
i.i.d. interarrival times
(but not necessarily exponential)

Show up at some random time


(really means arbitrary time)

Example:
interarrival times are equally likely to
Bus
Example:
be
5
10 minutes
Bus or
interarrival
times are equally likely to

be 5 or 10 minutes

Time

If you arrive at a random time:

Chosen
time instant

If you arrive at a random time:

what is the probability that you selected


awhat
is the probability
you selected
5 minute
interarrivalthat
interval?
a 5 minute interarrival interval?

what is the expected time to next ar rival?


what is the expected time

What is the distribution of the length of


the chosen interarrival interval?

to next arrival?

31

LECTURE 16

Checkout counter model


Discrete time n = 0, 1, . . .

Markov Processes I

Customer arrivals: Bernoulli(p)

Readings: Sections 7.17.2

geometric interarrival times


Customer service times: geometric(q)

Lecture outline
Checkout counter example

State Xn:
time n

Markov process definition

number of customers at

n-step transition probabilities


Classification of states
1

Finite state Markov chains

.. .

10

n-step transition probabilities


State occupancy probabilities,
given initial state i:

Xn: state after n transitions


belongs to a finite set, e.g., {1, . . . , m}

rij (n) = P(Xn = j | X0 = i)

X0 is either given or random


Markov property/assumption:
(given current state, the past does not
matter)

Time 0

Time n-1

Time n

pij = P(Xn+1 = j | Xn = i)

p 1j

r ik(n-1)

p kj

...

= P(Xn+1 = j | Xn = i, Xn1, . . . , X0)

...

r i1(n-1)

r im(n-1)

p mj

Model specification:

identify the possible states

Key recursion:

identify the possible transitions

rij (n) =

identify the transition probabilities

m
!

k=1

rik (n 1)pkj

With random initial state:

P(Xn = j) =

m
!

i=1

32

P(X0 = i)rij (n)

Example

0.5

Generic convergence questions:


Does rij (n) converge to something?

0.8
0.5

0.5

n=1

n odd: r2 2(n)=

n=2

0.2

n=0

0.5
2

n = 100

n = 101

n even: r2 2(n)=

Does the limit depend on initial state?

r11(n)
r12(n)

0.4

r21(n)

r22(n)

0.3

r1 1(n)=
r3 1(n)=
r2 1(n)=

Recurrent and transient states


State i is recurrent if:
starting from i,
and from wherever you can go,
there is a way of returning to i
If not recurrent, called transient

4
6

i transient:
P(Xn = i) 0,
i visited finite number of times
Recurrent class:
collection of recurrent states that
communicate with each other
and with no other state

33

0.3

LECTURE 17

Review

Markov Processes II

Discrete state, discrete time, time-homogeneous

Readings: Section 7.3

Transition probabilities pij


Markov property

Lecture outline

rij (n) = P(Xn = j | X0 = i)

Review

Key recursion:
!
rij (n) =
rik (n 1)pkj

Steady-State behavior

Steady-state convergence theorem


Balance equations
Birth-death processes

Warmup
9

Periodic states

The states in a recurrent class are


periodic if they can be grouped into
d > 1 groups so that all transitions from
one group lead to the next group

P(X1 = 2, X2 = 6, X3 = 7 | X0 = 1) =
P(X4 = 7 | X0 = 2) =
Recurrent and transient states
State i is recurrent if:
starting from i,
and from wherever you can go,
there is a way of returning to i

9
5
4
8

6
3

If not recurrent, called transient


Recurrent class:
collection of recurrent states that
communicate to each other
and to no other state

34

Expected Frequency of a Particular Transition

Steady-State Probabilities
Do the rij (n) converge to some j ?
(independent of the initial state i)

Consider n transitions of a Markov chain with a single class which is aperiodic, starting from a given initial state. Let qjk (n) be the expected number
of such transitions that take the state from j to k. Then, regardless of the
Visit frequency interpretation
initial state, we have
qjk (n)
= j pjk .
lim
n
n!
j =

k pkj

Yes, if:
recurrent states are all in a single class,
and

Given the frequency interpretation of j and k pkj , the balance equation


m

! of being in j: j
(Long run) frequency
j =

single recurrent class is not periodic


Assuming yes, start from key recursion
!
k

rik (n 1)pkj

Frequency
of transitions
j: the
k pexpected
kj
has an intuitive meaning.
It expresses
the factk
that
frequency
of visits to j is equal to the sum of the expected frequencies k pkj of transition
!
that lead to j; see Fig.
7.13.
p
Frequency of transitions into j:
k kj

take the limit as n


j =

j pjj

for all j

k pkj ,

2p2j

. . .

Additional equation:
!

1 p1j

. . .

rij (n) =

k pkj

k=1

j = 1

m pmj

Figure 7.13: Interpretation of the balance equations in terms of frequencies. In


a very large number of transitions, we expect a fraction k pkj that bring the state
from k to j. (This also applies to transitions from j to itself, which occur with
frequency j pjj .) The sum of the expected frequencies of such transitions is the
expected frequency j of being at state j.

Example

0.5

0.8
0.5

Birth-death processes
In fact, some stronger statements are also true, such as the following. Wheneve
1experiment
- p1- q1
1 - q m of the Markov chai
1- p0
we carry out a probabilistic
and generate a trajectory
over an infinite time horizon, the observed long-term frequency with which state j
p
p
visited will be exactly equal0 to j , 1and the observed long-term frequency of transition
3
m
1
0
from j to k will be exactly equal to j p2 jk . Even
though the trajectory
is random, thes
q
q 1 certainty,
q2
m
equalities hold with essential
that is, with probability
1.

.. .

pi

0.2

i+1

ipi = i+1qi+1

q i+1

Special case: pi = p and qi = q for all i


= p/q =load factor
p
i+1 = i = i
q
i = 0 i ,

i = 0, 1, . . . , m

Assume p < q and m


0 = 1

E[Xn] =

35

(in steady-state)

LECTURE 18

Review

Markov Processes III

Assume
a single class of recurrent states,
aperiodic;
aperiodic.
Then,
plus transient
states. Then,

LECTURE 20

Review

Assume a single class of recurrent states,

Markov Processes III

(n)
=j j
lim rrijij(n
)=
lim
n
n

Readings: Section 7.4

Readings: Section 6.4

where
not depend
dependon
onthe
theinitial
initial
where
jj does
does not
conditions
conditions:

Lecture outline

j | jX|0X
=0)i)=
=jj
lim P(X = =

lim P(X
n n
n
n

Lecture outline

Review of steady-state behavior

can
m can
11,,......,, m
solution
to
the
solution of the

Review of steady-state behavior

Probability of blocked phone calls

Probability of blocked phone calls

Calculating absorption probabilities

j =

!
k

Calculating
absorption
Calculating
expected
timeprobabilities
to absorption

The phone company problem


The phone company problem

0.8
0.8

Calls originate as a Poisson process,


Calls originate as a Poisson process,
rate
rate

0.5
0.5

k pkj

!j = 1
j j = 1
j

Example

j =

together with

Example

k pkj , ! j = 1, . . . , m,

together with

Calculating expected time to absorption

0.5
0.5

be
be found
found as
as the
theunique
unique
balance
equations
balance equations

Each
duration isis exponentially
exponentially
Each call
call duration
distributed
(parameter
)
distributed (parameter )

0.2
0.2

B
B lines
lines available
available

1 = 2/7, 2 = 5/7

1 = 2/7, 2 = 5/7

Discrete time intervals


Discrete
intervals
of (small)
(small) length
length
of

Assume process starts at state 1.

Assume process starts at state 1.

"#

P(X1 = 1, and X100 = 1)=


P(X1 = 1, and X100 = 1)=

i!1

B-1

i#

(X100
= 1 and X101 = 2)
PP
(X
100 = 1 and X101 = 2)

Balance
i1
i
Balance equations:
equations:
i1==i
i i
i
i = 0 ii
i = 0 i i!

i!

36

B
i
!
B i

0 = 1/ !
i
0 = 1/i=0 i!i

i=0 i!

Calculating absorption probabilities

Expected time to absorption

What is the probability ai that:


process eventually settles in state 4,
given that the initial state is i?
1

1
3

44

0.5

0.4

54

0.5

0.6

0.2
3

0.3

0.8

44

0.5

0.4

0.6

Find expected number of transitions i,


until reaching the absorbing state,
given that the initial state is i?

0.2
2

1
0.8

For i = 4,
For i = 5,

ai =
ai =

ai =

i = 0 for i =

pij aj ,

For all other i: i = 1 +

for all other i

unique solution

unique solution

Mean first passage and recurrence


times
Chain with one recurrent class;
fix s recurrent
Mean first passage time from i to s:
ti = E[min{n 0 such that Xn = s} | X0 = i]
t1, t2, . . . , tm are the unique solution to
ti = 1 +

pij tj ,

for all i =
% s

Mean recurrence time of s:


ts = E[min{n 1 such that Xn = s} | X0 = s]
ts = 1 +

!
j

ts = 0,

0.2
2

"
j psj tj

37

pij j

LECTURE 19
Limit theorems I

Chebyshevs inequality
Random variable X
(with finite mean and variance 2)

Readings: Sections 5.1-5.3;


start Section 5.4

2 =

X1, . . . , Xn i.i.d.

X1 + + Xn
n
What happens as n ?
Mn =

(x )2fX (x) dx

! c

(x )2fX (x) dx +

!
c

(x )2fX (x) dx

c2 P(|X | c)

Why bother?

2
c2

P(|X | c)

A tool: Chebyshevs inequality


Convergence in probability

P(|X | k )

Convergence of Mn
(weak law of large numbers)

Deterministic limits

1
k2

Convergence in probability

Sequence an
Number a

Sequence of random variables Yn


converges in probability to a number a:
(almost all) of the PMF/PDF of Yn ,
eventually gets concentrated
(arbitrarily) close to a

an converges to a
lim a = a
n n
an eventually gets and stays
(arbitrarily) close to a

For every " > 0,


lim P(|Yn a| ") = 0

For every " > 0,


there exists n0,
such that for every n n0,
we have |an a| ".

1 - 1 /n

pmf of Yn

1 /n
0
Does Yn converge?

38

Convergence of the sample mean


(Weak law of large numbers)

The pollsters problem


f : fraction of population that . . .

X1, X2, . . . i.i.d.


finite mean and variance 2

ith (randomly selected) person polled:

X + + Xn
Mn = 1
n

Xi =

1,

0,

if yes,
if no.

Mn = (X1 + + Xn)/n
fraction of yes in our sample

E[Mn] =

Goal: 95% confidence of 1% error

Var(Mn) =

P(|Mn f | .01) .05

P(|Mn | ")

Use Chebyshevs inequality:

Var(Mn)
=
"2
n"2

P(|Mn f | .01)

2
M
n

(0.01)2
x2
1
=

4n(0.01)2
n(0.01)2

Mn converges in probability to

If n = 50, 000,
then P(|Mn f | .01) .05
(conservative)

Different scalings of Mn

The central limit theorem

X1, . . . , Xn i.i.d.
finite variance 2

Standardized Sn = X1 + + Xn:
Zn =

Look at three variants of their sum:


Sn = X1 + + Xn

zero mean

variance n 2

unit variance

Sn
variance 2/n
n
converges in probability to E[X] (WLLN)

Let Z be a standard normal r.v.


(zero mean, unit variance)

Mn =

Sn

n

Sn nE[X]
Sn E[Sn]
=

Sn
n

Theorem: For every c:

constant variance 2

P(Zn c) P(Z c)

Asymptotic shape?

P(Z c) is the standard normal CDF,


(c), available from the normal tables

39

LECTURE 20
THE CENTRAL LIMIT THEOREM

Usefulness
universal; only means, variances matter
accurate computational shortcut

Readings: Section 5.4

justification of normal models

X1, . . . , Xn i.i.d., finite variance 2


Standardized Sn = X1 + + Xn:

What exactly does it say?

Sn E[Sn]
Sn nE[X]
Zn =
=

n
Sn
E[Zn] = 0,

CDF of Zn converges to normal CDF

not a statement about convergence of


PDFs or PMFs

var(Zn) = 1

Let Z be a standard normal r.v.


(zero mean, unit variance)

Normal approximation
Treat Zn as if normal

Theorem: For every c:

also treat Sn as if normal

P(Zn c) P(Z c)
P(Z c) is the standard normal CDF,
(c), available from the normal tables

Can we use it when n is moderate?


Yes, but no nice theorems to this eect
Symmetry helps a lot

0.1

0.14
n =2

n =4

0.12
0.08

0.1
0.06

0.08
0.06

The pollsters problem using the CLT

0.04

0.04

f : fraction of population that . . .

0.02

0.02
0

10

15

20

10

15

20

25

30

35

ith (randomly selected) person polled:

0.035

0.25

n =32
0.03

0.2

Xi =

0.025

0.15

0.02
0.015

0.1

0.01
0.005

0
100

0.12

120

140

160

180

P(|Mn f | .01) .05


n = 16

0.07

n=8

Event of interest: |Mn f | .01

0.06
0.08

0.05
0.06

X1 + + Xn nf

.01

0.04
0.03

0.04

0.02
0.02

0.01
0

10

15

20

25

30

35

40

10

20

30

40

50

60

X + + X nf
n
1

70

0.06
n = 32

0.9
0.05

0.8
0.7

0.04

0.6
0.5

0.03

0.4

0.2

0.01

0.1
0

0
30

40

50

60

70

80

90

.01 n

P(|Mn f | .01) P(|Z| .01 n/)

P(|Z| .02 n)

0.02

0.3

if yes,
if no.

Suppose we want:

200

0.08

0.1

0,

Mn = (X1 + + Xn)/n

0.05

1,

100

40

Apply to binomial

The 1/2 correction for binomial


approximation

Fix p, where 0 < p < 1


Xi: Bernoulli(p)

P(Sn 21) = P(Sn < 22),


because Sn is integer

Sn = X1 + + Xn: Binomial(n, p)

Compromise: consider P(Sn 21.5)

mean np, variance np(1 p)


CDF of

Sn np

np(1 p)

standard normal

Example
18 19 20 21 22

n = 36, p = 0.5; find P(Sn 21)

Exact answer:


21

36 1 36

k=0

= 0.8785

De MoivreLaplace CLT (for binomial)

Poisson vs. normal approximations of


the binomial

When the 1/2 correction is used, CLT


can also approximate the binomial p.m.f.
(not just the binomial CDF)

Poisson arrivals during unit interval equals:


sum of n (independent) Poisson arrivals
during n intervals of length 1/n

P(Sn = 19) = P(18.5 Sn 19.5)


18.5 Sn 19.5

Let n , apply CLT (??)

18.5 18
Sn 18
19.5 18

3
3
3

Poisson=normal (????)

Binomial(n, p)

0.17 Zn 0.5

p fixed, n : normal

P(Sn = 19) P(0.17 Z 0.5)

np fixed, n , p 0: Poisson

= P(Z 0.5) P(Z 0.17)

p = 1/100, n = 100: Poisson

= 0.6915 0.5675

p = 1/10, n = 500: normal

= 0.124
Exact answer:

36 1 36

19

= 0.1251

41

p()

p()

p()

p()

p()

pX|(x | )
Types of
Inference models/approaches
p()p()
Sample Applications
N
(x | )
pX|

Model
building
versus
inferring
unkno
wn
(x
|
)
p
p()
X|
N
p()
p()
Readings: Sections 8.1-8.2
X
Polling
variables.
E.g., assume X = aS + W
N
N
(x | )
p
X
ModelNbuilding:
X X|
Design of experiments/sampling
N
pX|(x | ) methodologies

N
It is the mark of truly educated people
| )(x | know
pX|(x
)
p
signal
S,
observe
X,
infer
a
X|
Lancet study on Iraq death toll

to be deeply moved bystatistics.

Xpresence of noise:
(x
| the
)
pX|

Estimation
in
X
(x
|
)
p

Estimator
X|
pX (x;X)
pX|(x | )
(Oscar Wilde)
X
Medical/pharmaceutical
trials
a, observe X,
pknow
estimate S.

p()
Estimator
()
Estimator
X

X
p()
Hypothesis testing: unknown
takes one of

Data
mining X

Reality
Model
N N possible values;

Estimator
p()
few
at small
(e.g., customer arrivals)
Poisson)

{0, 1}
X =+W
W aim
fW (w)
(e.g.,Netflix
competition
Estimator

N
probability
of
incorrect
decision

Estimator

Estimator
(x(x
| )
pX|

p()
N
| )
pX|
Finance

Estimator
Data
at a small
error
{0,
1} Estimator
X = aim
+W
W fW (w) Estimation:
| )
pX|(xestimation

LECTURE 21

Estimator

interpretation of experiments
Design &

polling,

trials.
W .f.W (w)
Matrix Completion
pmedical/pharmaceutical
Y |X(y | x)

Netflix
competition

Finance
Y |X (
! Partially observed matrix:pgoal
predict the
to| )
X

unobserved entries
2
5

?
2

3
1X

pX (x)

1/6

Estimator
X
10

Estimator
pX|
pX
(x; (x
) | )
pX|(x | )

p()
Estimator
N

pX|(x | )
X

p()

Estimator
X
X : unknown parameter
a r.v.) N
()
p(not

pX|(x | )
Estimator
E.g., =
mass
of
electron

pX|(x | )
p()
p
()
Bayesian:
Use priors &XBayes rule
Estimator
pX|(x | )
X
Estimator

N N

measurement
3 ?
pY (y ; )
5

1?
4
1
5 5 4
2 ?5 ? 4
3 3 1 5 2 1
pY (y ; )
3
1
2 3
N
4
51
3

3
3?
5
2 ? 1 p1Y (y ; )
pX (x)
5
2 ? X4 4
Y
1 3 1 5 4 5pY |X (y | x)
1 2
4
5?

sensors

f()

pY |X (y | x)

objects X
N
1 pY 4|X (y |5x)

W fW
1}
{0, 1}X =
X+
=W
+W
fW (w) {0,
W (w)
N
p() XX
p()
Classical statistics: X

{0, 1}
X
=
pX|(x |)
+ W
N
N

Estimator
Graph of S&P 500 index removed

due to copyright restrictions.


pY |X ( | )

Estimator

X {0, 1}

W fW (w)

Signal processing
W fW (w)

f()

Tracking,
detection, speaker identification,. . .
Y =X +W
pY (y ; )

pp()
()

{0,
1}| )
p(x
|(x
)
p
X|X|

1/6 X X4

NN

10

X = Estimator
+W

Estimator

Estimator

(x(x
| )
pX|
| )
pX|
Estimator
Estimator
XX

Estimator
Estimator

Bayesian inference: Use Bayes rule

Estimation with discrete data

Hypothesis testing
discrete data

p|X ( | x) =

f|X ( | x) =

p() pX |(x | )
pX (x)

pX (x) =

continuous data
p|X ( | x) =

p() fX |(x | )

f() pX |(x | )
pX (x)

f()pX |(x | ) d

Example:

fX (x)

Coin with unknown parameter


Observe X heads in n tosses

Estimation; continuous data


f|X ( | x) =

What is the Bayesian approach?

f() fX |(x | )

Want to find f|X ( | x)

fX (x)

Assume a prior on (e.g., uniform)

Zt = 0 + t1 + t22
Xt = Zt + Wt,

t = 1, 2, . . . , n

Bayes rule gives:


f0,1,2|X1,...,Xn (0, 1, 2 | x1, . . . , xn)
1

42

)
pX|(x | W
(w)
pX (x) =
f ( ) pf W
X | (x | ) d
(x
|
)
p
X|
W
pW ( w )
pX (x)
p()
X
Y = X + W
+
N | )
pN
N (x
X|
0 , p1Y} ( y ; ) Y = X + W
EXxaX
m{ ple:
N

| )
(x
pX|
X
(x
| |) )
pXp|(x
object at unknown location X
X|

W
pW ( w )
sensors
Least
Mean
Squares
Estimation
pX|(x | )
Estimator
p()
X X X

Y = X + W
Estimator in the absence of information
Estimation
X
N +W
{0, 1}
X=
W fW (w)

Estimator
{0, 1}
X =+W
W fW (w)
pX|(x | )

p X | (f1
= 1}1/6 X =4 + W10
| ()
Estimator
) {0,
W

ft W
E sEstimator
i m(w)
a t or
f=P()
(se nsor i1/6
se nses t4
h e o b j e c10
t |
= ) X
Estimator

{0,
=
X
+
W
W
fW
4t a
nsor
fW
(w
) = h

n c
{e1}
0,o{0,
1 } 1}
==

++
WW
W

f(w)
( dis
f10
fr o X
m X
se
i )
W (w)
()
Wf1/6
Npp()
)()
p(

Output of Bayesian Inference


Posterior distribution:
pmf p|X ( | x) or pdf f|X ( | x)

If interested in a single answer:


Maximum a posteriori probability (MAP):

fW (w)
{0, 1}
X =+W
4 4 4 101 010
ff()
1 /1/6
6
)() 1/6
f(

f()

p|X ( | x) = max p|X ( | x)

Optimal estimate: c = E[]

f|X ( | x) = max f|X ( | x)

Optimal mean squared error:

Conditional expectation:

E[ | X = y] =

"

minimize E ( c)2

minimizes probability of error;


often used in hypothesis testing

1/6
4
10

find estimate c, to:

Estimator

"

E ( E[])2 = Var()

f|X ( | x) d

Single answers can be misleading!

LMS Estimation of based on X

LMS Estimation w. several measurements

Two r.v.s , X

Unknown r.v.

we observe that X = x

Observe values of r.v.s X1, . . . , Xn

new universe: condition on X = x


E

"

#
( c)2 | X = x is minimized by

"

( E[ | X = x])2 | X = x

Best estimator: E[ | X1, . . . , Xn]


Can be hard to compute/implement

c=
E

involves multi-dimensional integrals, etc.

E[( g(x))2 | X = x]

"

"

E ( E[ | X])2 | X E ( g(X))2 | X
"

"

E ( E[ | X ])2 E ( g(X))2
"

E[ | X] minimizes E ( g (X))2
over all estimators g()

43

p()

f()

f()

f()

N
f() f f()
(x | )
X|

fX|(x | )

LECTURE
22

)(x
x
fX|(xf | g(
) | )

g( ) f ()3

X|9

pX|(x | )

11

Readings: pp. 225-226; pSections


8.3-8.4
()
g( ) g(
) = g(X)
= g(X)

(x | )
f

X|

| )
(x
p pXp|(x
X
(x
| |) )
object at unknown location X
X|

X|
{0,
1}
=p W( w
+) W
WX

sensors
pX|(x | )
Estimator
p()
X X X

1/6
4
10 Y = X + W
Estimator
x
10
X
N +W
{0, 1}
X=
W fW (w)

Estimator
{0, 1}
X =+W
W fW (w)
pX|(x | )

p X | (f1
= 1}1/6 X =4 + W10
| ()
Estimator
) {0,
W

ft W
E sEstimator
i m(w)
a t or
f()
f()
1/6
4
10

f=P()
(se nsor i se nses t h e o b j e c t |
= )X
Estimator
(w)

{0,
=
X

+
W
W
fW
4t a
nsor
fW
(w
)4= h

n c
{ef1}
0,
1()
} 1}
X
==
++
WxW
W

f(w)
( dis
o{0,
f10
frfo X
se
i )
(x
|
)
fmX|
W
()

Wf1/6
()
fX|(x | )

W fW (w)

f()3

= g(X)

1/2
Topics

= g(X)

1/2 g( )

fW (w)
{0, 1}
X = +W
(x
)(x
x
g(
)|
4 4 f4X|
1/6
10
ff()
6
1fX|
0|10
)
)() g(
f(
1)/1/6

f()

f()
p() p

X| (x | )

1/6

X |

1/2 g( )
1
= g(X)

{0, 1}
4

X
10
10

MAP estimate:
= g(X)

MAP maximizes f|X ( | x)


g(p )

X=
E[ | X] minimizes E ( g(X ))2

=g(X)
over
all estimators g()

10 )
g(

11

Estimator

p()
= g(X)

g( )

= g(X)

1/2

= g(X)

p() p
X| (x | )

1/2

f()

1
1/2

fX|(x | )
N
X
f()

+
1
= +
W
p 1 (x | )
+ 11/2
g( )
X|

f()
fX|(x | )
+1 +1
= g(X)
X
1

Estimator

fX|(x | )
g( )
x f()

p()
+1
g( )
= g(X)

fX|
Estimator
N (x | )
= g(X)

(x | )

LMSX|
estimation:

= g(X)

f
(x | )

(Bayesian) Least
1/2 means
1 squares (LMS)
1/2
1
fX|(x | )
= g(X)
N
estimation
X

f() Estimator
1
1
+1
+ 11/2
g( )
|
)
pX|(x
(Bayesian)
Linear LMS
estimation

f()
fX|(x | )
+1 +1
W fW (w)
X
1
= g(X)
Estimator
fX |(x | )
g( )
f()

p()
1/6
f()
+1
g( )
= g(X)

fX|
(x
|
)
Estimator
N

g(p )

X| (x | )

f()

= g(X)

fX|(x | )

for any x, = E[ | X = x]
Estimator
minimizes E ( )2 | X = x
over all estimates

f()

f() f f()
(x | )
X|

Estimator

g( ) f

)(x
x
fX|(xf | g(
) | )

X|9

11

x
y

p()

g( ) g(
) = g(X)
= g(X)

x|
f

X|

1/2 g

x | )

E[( E[ | X ])2 | X = x]

fW (w)
1/6

same as Var( | X = x): variance of the


X =+W
conditional distribution of

{0, 1}
4

10

10

E[( E[ | X])2 | X = x]
g( )
x
same
10 as Var( | X = x): variance of the
conditional distribution of
= g(X)

1/2
4

f
()
1
x
fX|
+ 1(x | )
11
9
3
5
Predicting X based on Yy

g( )
Two r.v.s X, Y
we observe that Y = y
= g(X)

new universe: condition


on Y = y

fX|(x | )

1
1/2

=

:
Since

)
var() = var() + var(
Estimator

(x | ) error
x
fX|
Conditional mean
squared
11
9
5

f()

= g(X)
,
) = 0
cov(

f()

1/2
g X

p() p
X| (x | )

by
E (X c)2 | Y = y is minimized
1/2
c=

f()
1
1
+1
+1 /
g( )
|
)
pX|(x

f()
Some properties
of LMS estimation
fX|(x | )
+1 +1
= g(X)
X

Estimator
Estimator:
(xE![
) | X]
fX!=
g( )
f()
=

Estimation
error:

p()
g( " )
= g(X)

(x
|
)
fX|
]N= 0
| XEstimator
E[
E[
= x] = 0
= g(X)

g(p ) (x | )
X|
h(X)] = 0, for any function h
E[

Conditional mean squared error

mator

= g(X)

1/2

= g(X)

44

Linear LMS

Linear LMS properties

Consider estimators of ,
= aX + b
of the form

Minimize E ( aX b)2

L = E[] +

2
L )2] = (1 2)
E[(

Best choice of a,b; best linear estimator:

Linear LMS with multiple data

Cov(X, )
X =
+W

(X E[X])
L = E[] +

{0, 1}

var(X)

10

Consider estimators of the form:


= a1X1 + + anXn + b

10

Cov(X, )
(X E[X])
var(X)

Find best choices of a1, . . . , an, b


Minimize:

f()
4

fX|(x | )
g( ) f

E[(a1X1 + + anXn + b )2]

f()

= g(X)

x|
f
X|

1/2 g
1

f() f f()
(x | )
X|

Only means, variances, covariances matter

)(x
x
fX|(xf | g(
) | )

X|9

11

x
y

p()

g( ) g(
) = g(X)

= g(X)

1/2

= g(X)
1/2

g X

Set derivatives to zero


linear system in b and the ai

p() p
X| (x | )

f()

fX|(x | )

1
1/2

f()
The cleanest linear LMS example

Big picture

1
1
+1
+1 /
g( )
pX|(x
| )
Standard examples:
Xi = + Wf
i , (), W1 , . . . , Wn independent
fX|(x | )
2
2
Wi +
1
0, +
, 0
i X1
= g(X) Xi uniform on [0, ];

Estimator
n
fX!(x ! )

uniform prior on
2
g( )
Xi/i
/02 +
f()

i=1

p()
L =
Xi Bernoulli(p);
n

g( " )
1/i2

uniform (or Beta) prior on p


= g(X)
fX|
i=0 Estimator
N (x | )

Xi normal with mean , known variance 2;


normal prior on ;
X i = + Wi

(weighted average
of , X1, . . . , Xn)
= g(X)

L = E[ | X1, . . . , Xn]
) normal,
g(
If pall
(x | )
X|

= g(X)

Estimation methods:

Choosing Xi in linear LMS

MAP

E[ | X] is the same as E[ | X 3]

MSE

Estimator
Linear
LMS is dierent:
= aX + b versus
= aX 3 + b

= a1 X + a2 X 2 + a 3 X 3 + b
Also consider

Linear MSE

45

Estimator
W fW (w)

Readings: Section 9.1


f()
(not responsible for t-based confidence
intervals, in pp. 471-473)

Estimator
p()

X X

Estimator

NN

Estimator
p()

{0, 1}
X =+W
W fW (w)
Estimator

N
Estimator
Estimator
(x
|
)
p

p()
X|
N
pX|(x | )

Estimator
{0,
1} Estimator
X =+W
W fW (w)
pX|(x | )
(w)

{0,
1}
X
=

+
W
W fW
(w)

{0,
1}
X
=

+
W

f
W
W
N
pX|(x | )
p() XX
p()
Classical
statistics
X

LECTURE 23

pp()
()

1/6

X
=
+ W

{0, 1}
N
4

Estimator
X
10

Estimator
pX|
pX
(x; (x
) | )
pX|(x | )

Outline

pX|(x | )

Estimator

also for vectors X and :


Estimator
pX1,...,Xn (x
1 , . . . , xn ; 1 , . . . , m )

Classical statistics

Maximum likelihood (ML) estimation

Estimator

Estimator

These areEstimator
NOT conditional probabilities;
is NOT random

Estimating a sample mean


Confidence intervals (CIs)

{0, 1}
W
fW (w)
mathematically:

X=
+W
many
models,
one for each possible value of

CIs using an estimated variance

f()

1/6

10

Problem types:
Hypothesis testing:
H0 : = 1/2 versus H1 : = 3/4
Composite hypotheses:
1/2
H0 : = 1/2 versus H1 : =
,
Estimation: design an estimator

to keep estimation error small

Maximum Likelihood Estimation

Desirable properties of estimators


(should hold FOR ALL !!!)

Model, with unknown parameter(s):


X pX (x; )

n] =
Unbiased: E[

Pick that makes data most likely

exponential example, with n = 1:


E[1/X1] = =
(biased)

ML = arg max pX (x; )

Compare to Bayesian MAP estimation:

n (in probability)
Consistent:

MAP = arg max p|X ( | x)

exponential example:
(X1 + + Xn)/n E[X] = 1/

pX |(x|)p()
MAP = arg max

pX (x)

can use this to show that:


n = n/(X1 + + Xn) 1/E[X] =

Example: X1, . . . , Xn: i.i.d., exponential()


max

)2] = var(
) + (E[
])2
E[(
) + (bias)2
= var(

i=1

max n log

Small mean squared error (MSE)

exi

ML =

xi

i=1

n
x1 + + xn

n =

n
X1 + + X n

46

Estimate a mean

Confidence intervals (CIs)


n may not be informative
An estimate
enough

X1, . . . , Xn: i.i.d., mean , variance 2


Xi = + W i

An 1 confidence interval

+
is a (random) interval [
n ],
n,

Wi: i.i.d., mean, 0, variance 2


n = sample mean = Mn =

X1 + + X n
n

often = 0.05, or 0.25, or 0.01


interpretation is subtle

Properties:
n] =
E[

CI in estimation of the mean


n = (X1 + + Xn)/n

normal tables: (1.96) = 1 0.05/2

(unbiased)

n
WLLN:

+
P(
n
n ) 1 ,

s.t.

(consistency)

MSE: 2/n


|n |

1.96 0.95 (CLT)

/ n

Sample mean often turns out to also be


the ML estimate.
E.g., if Xi N (, 2), i.i.d.

1.96
1.96
n
n+
P

0.95
n
n
More generally: let z be s.t. (z) = 1 /2

z
z
n
n +
P
1
n
n

The case of unknown


Option 1: use upper bound on
if Xi Bernoulli: 1/2
Option 2: use ad hoc estimate of
=
if Xi Bernoulli():

(1
)

Option 3: Use generic estimate


of the variance
Start from 2 = E[(Xi )2]
2=

n
1
(Xi )2 2
n i=1

(but do not know )


n2 =
S

n
1
n )2 2
(Xi
n 1 i=1

n2] = 2)
(unbiased: E[S

47

LECTURE 24

Review
Maximum likeliho o d estimation
Have model with unknown parameters:
X pX (x; )
Pick that makes data most likely

Reference: Section 9.3


Course Evaluations (until 12/16)
http://web.mit.edu/subjectevaluation

476

Classical Statistical Inference

max pX (x; )

Chap. 9

Compare to Bayesian MAP estimation:

Outline
in the context of various probabilistic
frameworks, which provide perspective and
a mechanism for quantitative analysis.
consider
Reviewthe case of only two variables, and then generalize. We
We first
wish to modeltheMaximum
relation between
two variables
of interest, x and y (e.g., years
likelihood
estimation
of education and income), based on a collection of data pairs (xi , yi ), i = 1, . . . , n.
For example,
xi could
be the years
of education and yi the annual income of the
Confidence
intervals
ith person in the sample. Often a two-dimensional plot of these samples indicates
a systematic, approximately
linear relation between xi and yi . Then, it is natural
Linear regression
to attempt to build a linear model of the form

max p|X ( | x) or max

pX |(x|)p()
pY (y)

Sample mean estimate of = E[X]


n = (X1 + + Xn)/n

1 confidence interval

y 0 testing
+ 1 x,
Binary hypothesis

+
P(
n
n ) 1 ,

1 are
unknown
parameters to be estimated.
where 0 and
Types
of error
In particular, given some estimates 0 and 1 of the resulting parameters,
the value yi corresponding
to xiratio
, as predicted
by the model, is
Likelihood
test (LRT)

confidence interval for sample mean

yi = 0 + 1 xi .

let z be s.t. (z) = 1 /2

Generally, yi will be dierent from the given value yi , and the corresponding
dierence
yi = yi yi ,

z
z
n
n +
P
1
n
n

is called the ith residual. A choice of estimates that results in small residuals
is considered to provide a good fit to the data. With this motivation, the linear
regression approach chooses the parameter estimates 0 and 1 that minimize
the sum of the squared residuals,
n
n

(yi yi )2 =
(yi 0 1 xi )2 ,
i=1

i=1

over all 1 and 2 ; see Fig. 9.5 for an illustration.

Regression
y

Residual
x yi 0 1 xi

(xi , yi )

Linear regression

Model y 0 + 1x

min

x y = 0 + 1 x

Solution (set derivatives to zero):

x + + xn
x= 1
,
n

yx

Data: (x1, y1), (x2, y2), . . . , (xn, yn)

1 =

Figure 9.5: Illustration of a set of data pairs (xi , yi ), and a linear model y =
by minimizing
0 + 1 x,
+ 01, x1 the sum of the squares of the residuals
obtained
Model:
y 0over
yi 0 1 x i .

min

0 ,1

( yi 0 1 x i ) 2

i=1

(yi 0 1xi)2
2 2 i=1

n
i=1(xi x)(yi y)
n
2
i=1(xi x)

Interpretation of the form of the solution


Assume a model Y = 0 + 1X + W
W independent of X, with zero mean

Likelihood function fX,Y | (x, y; ) is:

y + + yn
y= 1
n

0 = y 1x

()

One interpretation:
Yi = 0 + 1xi + Wi, Wi N (0, 2), i.i.d.

c exp

( y i 0 1 xi )2

0 ,1 i=1

=0

Check that

cov(X, Y ) E (X E[X])(Y E[Y ])

1 =
=
var(X)
E (X E[X])2

Take logs, same as (*)

Solution formula for 1 uses natural


estimates of the variance and covariance

Least sq. pretend Wi i.i.d. normal

48

The world of regression (ctd.)

The world of linear regression

In practice, one also reports

Multiple linear regression:


data: (xi, xi, xi, yi), i = 1, . . . , n

Confidence intervals for the i

model: y 0 + x + x + x

Standard error (estimate of )


R2, a measure of explanatory power

formulation:
min

(yi 0 xi xi xi )2

, , i=1

Some common concerns


Heteroskedasticity

Choosing the right variables

Multicollinearity

model y 0 + 1h(x)
e.g., y 0 + 1x2

Sometimes misused to conclude causal


relations

work with data points (yi, h(x))

etc.

formulation:
min

(yi 0 1h1(xi))2

i=1

Binary hypothesis testing

Likelihood ratio test (LRT)

Binary ; new terminology:


null hypothesis H0:
X pX (x; H0)

Bayesian case (MAP rule): choose H1 if:


P(H1 | X = x) > P(H0 | X = x)
or

[or fX (x; H0)]

P(X = x | H1)P(H1) P(X = x | H0)P(H0)


>
P(X = x)
P(X = x)
or
P(X = x | H1) P(H0)
>
P(X = x | H0) P(H1)

alternative hypothesis H1:


[or fX (x; H1)]
X pX (x; H1)
Partition the space of possible data vectors
Rejection region R:
reject H0 i data R

(likelihood ratio test)

Nonbayesian version: choose H1 if

Types of errors:

P(X = x; H1)
> (discrete case)
P(X = x; H0)

Type I (false rejection, false alarm):


H0 true, but rejected

fX (x; H1)
>
fX (x; H0)

(R) = P(X R ; H0)


Type II (false acceptance,
missed detection):
H0 false, but accepted

(continuous case)

threshold trades o the two types of error


choose so that P(reject H0; H0) =
(e.g., = 0.05)

(R) = P(X R ; H1)

49

LECTURE 25
Outline

Simple binary hypothesis testing


null hypothesis H0:
X pX (x; H0)

Reference: Section 9.4

[or fX (x; H0)]

alternative hypothesis H1:


[or fX (x; H1)]
X pX (x; H1)

Course Evaluations (until 12/16)


http://web.mit.edu/subjectevaluation

Choose a rejection region R;


reject H0 i data R

Review of simple binary hypothesis tests

Likelihood ratio test: reject H0 if

examples

pX (x; H1)
>
pX (x; H0)

Testing composite hypotheses


is my coin fair?

or

fX (x; H1)
>
fX (x; H0)

is my die fair?

fix false rejection probability


(e.g., = 0.05)

goodness of fit tests

choose so that P(reject H0; H0) =

Example (test on normal mean)

Example (test on normal variance)

n data points, i.i.d.


H0: Xi N (0, 1)
H1: Xi N (1, 1)

n data points, i.i.d.


H0: Xi N (0, 1)
H1: Xi N (0, 4)

Likelihood ratio test; rejection region:

(1/ 2)n exp{ i(Xi 1)2/2}

>

(1/ 2)n exp{ i Xi2/2}

Likelihood ratio test; rejection region:

(1/2 2)n exp{ i Xi2/(2 4)}

>

(1/ 2)n exp{ i Xi2/2}

algebra: reject H0 if:

Xi >

algebra: reject H0 if

Find such that

Xi2 >

Xi > ; H0 =

i=1

Find such that

Xi2 > ; H0 =

i=1

use normal tables

the distribution of i Xi2 is known


(derived distribution problem)
chi-square distribution;
tables are available

50

Composite hypotheses

Is my die fair?
Hypothesis H0:
P(X = i) = pi = 1/6, i = 1, . . . , 6

Got S = 472 heads in n = 1000 tosses;


is the coin fair?
H0 : p = 1/2 versus H1 : p = 1/2

Observed occurrences of i: Ni

Pick a statistic (e.g., S)

Choose form of rejection region;


chi-square test:

Pick shape of rejection region


(e.g., |S n/2| > )

reject H0 if T =

Choose significance level (e.g., = 0.05)

npi

>

Choose so that:

Pick critical value so that:

P(reject H0; H0) = 0.05

P(reject H0; H0) =

P(T > ; H0) = 0.05

Using the CLT:

P(|S 500| 31; H0) 0.95;

(Ni npi)2

= 31

Need the distribution of T :


(CLT + derived distribution problem)

In our example: |S 500| = 28 <


H0 not rejected (at the 5% level)

for large n, T has approximately


a chi-square distribution
available in tables

Do I have the correct pdf?

What else is there?

Partition the range into bins

Systematic methods for coming up with


shape of rejection regions

npi: expected incidence of bin i


(from the pdf)

Methods to estimate an unknown PDF


(e.g., form a histogram and smooth it
out)

Ni: observed incidence of bin i


Use chi-square test (as in die problem)
Kolmogorov-Smirnov test:
form empirical CDF, FX , from data

Ecient and recursive signal processing


Methods to select between less or more
complex models
(e.g., identify relevant explanatory
variables in regression models)
Methods tailored to high-dimensional
unknown parameter vectors and huge
number of data points (data mining)
etc. etc.. . .

(http://www.itl.nist.gov/div898/handbook/)

Dn = maxx |FX (x) FX (x)|

P( nDn 1.36) 0.05

51

MIT OpenCourseWare
http://ocw.mit.edu

6.041 / 6.431 Probabilistic Systems Analysis and Applied Probability


Fall 2010

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

You might also like