LPC Lecture Slides

6.
041 Probabilistic Systems Analysis

6.431 Applied Probability
Coursework
Sta:
Lecturer: John Tsitsikli s
Pick up and read course information handout
Turn in recitation and tutorial scheduling form

(last sheet of course information handout)
Quiz 1 (October 12, 12:05-12:55pm)
17%
Quiz 2 (November 2, 7:30-9:30pm)
30%
Final exam (scheduled by registrar)
40%
Weekly homework (best 9 of 10)
10%
Attendance/participation/enthusiasm in
recitations/tutorials
Pick up copy of slides
3%
Collaboration policy described in course info handout
Text: Introduction to Probability, 2nd Edition,

D. P. Bertsekas and J. N. Tsitsiklis, Athena Scientic, 2008
Read the text!
Sample space
LECTURE 1
Readings: Sections 1.1, 1.2
List (set) of possible outcomes

List must be:
Mutually exclusive
Lecture outline
Collectively exhaustive
Probability as a mathematical framework

for reasoning about uncertainty
Art: to be at the right granularity
Probabilistic models
sample space
probability law
Axioms of probability
Simple examples
Sample space: Discrete example
Sample space: Continuous example
Two rolls of a tetrahedral die
= {(x, y) | 0 x, y 1}
Sample space vs. sequential description

1
4
Y = Second
1,1
1,2
1,3
1,4
roll
1
1
X = First roll
4,4
Probability axioms
Probability law: Example with nite sample space
Event: a subset of the sample space

Probability is assigned to events
Y = Second 3
roll
Axioms:
1. Nonnegativity: P(A) 0
1
1
2. Normalization: P() = 1
X = First roll
3. Additivity: If A B = , then P(A B) = P(A) + P(B)
Let every possible outcome have probability 1/16

P((X, Y ) is (1,1) or (1,2)) =
P({s1, s2, . . . , sk }) = P({s1}) + + P({sk })
P({X = 1}) =
= P(s1) + + P(sk )
P(X + Y is odd) =
Axiom 3 needs strengthening

Do weird sets have probabilities?
P(min(X, Y ) = 2) =
Discrete uniform law
Continuous uniform law
Let all outcomes be equally likely

Then,
Two random numbers in [0, 1].
number of elements of A
P(A) =
total number of sample points
Computing probabilities counting
Uniform law: Probability = Area
Denes fair coins, fair dice, well-shued decks
P(X + Y 1/2) = ?
P( (X, Y ) = (0.5, 0.3) )
Probability law: Ex. w/countably innite sample space

Sample space: {1, 2, . . .}
We are given P(n) = 2n, n = 1, 2, . . .
Find P(outcome is even)
Remember!
Turn in recitation/tutorial scheduling form now
1/2
Tutorials start next week
1/4
1/8
1/16
..
P({2, 4, 6, . . .}) = P(2) + P(4) + =
1
1
1
1
+ 4 + 6 + =
22
3
2
2
Countable additivity axiom (needed for this calculation):

If A1, A2, . . . are disjoint events, then:
P(A1 A2 ) = P(A1) + P(A2) +
LECTURE 2
Review of probability models
Readings: Sections 1.3-1.4
Sample space
Mutually exclusive
Collectively exhaustive
Lecture outline
Right granularity
Review
Event: Subset of the sample space
Conditional probability
Allocation of probabilities to events

1. P(A) 0
2. P() = 1
Three important tools:

Multiplication rule
3. If A B = ,
then P(A B) = P(A) + P(B)
Total probability theorem

Bayes rule
3. If A1, A2, . . . are disjoint events, then:

P(A1 A2 ) = P(A1) + P(A2) +
Problem solving:
Specify sample space

Define probability law
Identify event of interest
Calculate...
Conditional probability
Die roll example
Y = Second 3
roll
2
1
P(A | B) = probability of A,
given that B occurred
X = First roll
B is our new universe

Definition: Assuming P(B) =
$ 0,
Let B be the event: min(X, Y ) = 2
P(A B)
P(A | B) =
P(B)
Let M = max(X, Y )
P(M = 1 | B) =
P(A | B) undefined if P(B) = 0
P(M = 2 | B) =
Multiplication
rule
Multiplication rule
Models based on conditional

probabilities
(AB
B C)
P(B | A)
P(C
(C ||A
A
B)
PP
(A
C) =
= PP(A)
(A)P
A)P
B)
P(C | A B)
A B C
P(B | A)
P(Bc | A)
A Bc C
U
A
P(A)
U
P(Ac)=0.95
P(Bc | A)=0.01
Bc
A Bc Cc
U
P(B | Ac)=0.10
P(Bc | Ac)=0.90
P(A)=0.05
A B
P(B | A)=0.99
U U
Event A: Airplane is flying above

Event B: Something registers on radar
screen
P(Ac)
Ac
P(A B) =
P(B) =
P(A | B) =
Total probability theorem
Bayes rule
Prior probabilities P(Ai)
initial beliefs
Divide and conquer

Partition of sample space into A1, A2, A3
We know P(B | Ai) for each i
Have P(B | Ai), for every i
A1
Wish to compute P(Ai | B)

revise beliefs, given that B occurred
B
A1
A2
A3
A2
One way of computing P(B):
P(B) =
A3
P(A1)P(B | A1)
+ P(A2)P(B | A2)
+ P(A3)P(B | A3)
P(Ai | B) =
=
P(Ai B)
P(B)
P(Ai)P(B | Ai)
P(B)
P(Ai)P(B | Ai)
= !
j P(Aj )P(B | Aj )
LECTURE 3
Models based on conditional

probabilities
Readings: Section 1.5
3 tosses of a biased coin:

P(H) = p, P(T ) = 1 p
Review
Independence of two events
Independence of a collection of events
assuming P(B ) > 0
HHT
HTH
1- p
HTT
THH
p
1- p
THT
TTH
1- p
TTT
1- p
P(A B) = P(B) P(A | B) = P(A) P(B | A)

Total probability theorem:
P(T HT ) =
P(B ) = P(A)P(B | A) + P(Ac)P(B | Ac)

Bayes rule:
P(1 head) =
P(Ai)P(B | Ai)
P(Ai | B) =
P(B)
P(first toss is H | 1 head) =
Independence of two events
Conditioning may affect independence
P(B | A) = P(B)
Conditional independence, given C,

is defined as independence
under probability law P( | C)
occurrence of A
provides no information
about Bs occurrence
Assume A and B are independent
Recall that P(A B) = P(A) P(B | A)

Defn:
1- p
1- p
1- p
Multiplication rule:
Defn:
HHH
Review
P(A B)
P(A | B) =
,
P(B)
P(A B) = P(A) P(B)
Symmetric with respect to A and B
applies even if P(A) = 0

implies P(A | B) = P(A)
If we are told that C occurred,

are A and B independent?
Conditioning may affect independence
Independence of a collection of events
Two unfair coins, A and B:

P(H | coin A) = 0.9, P(H | coin B) = 0.1
choose either coin with equal probability
Intuitive definition:
Information on some of the events tells
us nothing about probabilities related to
the remaining events
0.9
0.1
0.9
E.g.:
Coin A
P(A1 (Ac2 A3) | A5Ac6) = P(A1 (Ac2 A3))
0.9
0.5
0.1
0.1
Mathematical definition:
Events A1, A2, . . . , An
are called independent if:
0.1
0.5
0.1
0.9
Coin B
0.1
P(AiAj Aq ) = P(Ai)P(Aj ) P(Aq )
0.9
for any distinct indices i, j, . . . , q,

(chosen from {1, . . . , n})
0.9
Once we know it is coin A, are tosses

independent?
If we do not know which coin it is, are
tosses independent?
Compare:
P(toss 11 = H)
P(toss 11 = H | first 10 tosses are heads)
Independence vs. pairwise

independence
The kings sibling

The king comes from a family of two
children. What is the probability that
his sibling is female?
Two independent fair coin tosses

A: First toss is H
B: Second toss is H
P(A) = P(B) = 1/2
HH
HT
TH
TT
C: First and second toss give same

result
P(C) =
P(C A) =
P(A B C) =
P(C | A B) =
Pairwise independence does not
imply independence
LECTURE 4
Discrete uniform law
Let all sample points be equally likely

Then,
Lecture outline
P(A) =
Principles of counting
number of elements of A
|A|
=
total number of sample points
||
Just count. . .
Many examples
permutations
k-permutations
combinations
partitions
Binomial probabilities
Basic counting principle
Example
r stages
Probability that six rolls of a six-sided die

all give dierent numbers?
ni choices at stage i
Number of outcomes that

make the event happen:
Number of elements
in the sample space:
Number of choices is: n1n2 nr
Answer:
Number of license plates

with 3 letters and 4 digits =
. . . if repetition is prohibited =
Permutations: Number of ways
of ordering n elements is:
Number of subsets of {1, . . . , n} =
Combinations
Binomial probabilities
: number of k-element subsets

k
of a given n-element set
n independent coin tosses

P(H) = p
Two ways of constructing an ordered

sequence of k distinct items:
P(HT T HHH) =
Choose the k items one at a time:

n!
n(n1) (nk+1) =
choices
(n k)!
P(sequence) = p# heads(1 p)# tails
Choose k items, then order them

(k! possible orders)
Hence:
P(k heads) =
n
k! =
P(seq.)
khead seq.
n!
(n k)!
n!
k!(n k)!
k=0
= (# of khead seqs.) pk (1 p)nk
pk (1 p)nk
Coin tossing problem
Partitions
event B: 3 out of 10 tosses were heads.
52-card deck, dealt to 4 players

Find P(each gets an ace)
Given that B occurred,

what is the (conditional) probability
that the first 2 tosses were heads?
Outcome: a partition of the 52 cards

number of outcomes:
52!
13! 13! 13! 13!
All outcomes in set B are equally likely:

probability p3(1 p)7
Conditional probability law is uniform
Count number of ways of distributing the

four aces: 4 3 2
Number of outcomes in B:
Count number of ways of dealing the

remaining 48 cards
Out of the outcomes in B,

how many start with HH?
48!
12! 12! 12! 12!
Answer:
48!
12! 12! 12! 12!
52!
13! 13! 13! 13!
432
LECTURE 5
Random variables
Readings: Sections 2.1-2.3, start 2.4
An assignment of a value (number) to

every possible outcome
Lecture outline
Mathematically: A function
from the sample space to the real
numbers
Random variables
Probability mass function (PMF)
discrete or continuous values
Expectation
Can have several random variables

defined on the same sample space
Variance
Notation:
random variable X
numerical value x
How to compute a PMF pX (x)

collect all possible outcomes for which
X is equal to x
add their probabilities
repeat for all x
Probability mass function (PMF)

(probability law,
probability distribution of X)
Notation:
Example: Two independent rools of a

fair tetrahedral die
pX (x) = P(X = x)
= P({ s.t. X() = x})
pX (x) 0
F : outcome of first throw

S: outcome of second throw
X = min(F, S)
!
x pX (x) = 1
Example: X=number of coin tosses

until first head
assume independent tosses,

P(H) = p > 0
3
S = Second roll
2
pX (k) = P(X = k)
= P(T T T H)
= (1 p)k1p,
k = 1, 2, . . .
F = First roll
geometric PMF
pX (2) =
10
Binomial PMF
Expectation
Definition:
X: number of heads in n independent

coin tosses
E[X] =
$
x
P(H) = p
Interpretations:
Center of gravity of PMF
Average in large number of repetitions
of the experiment
(to be substantiated later in this course)
Let n = 4
pX (2) = P(HHT T ) + P(HT HT ) + P(HT T H)
+P(T HHT ) + P(T HT H) + P(T T HH)
= 6p2(1 p)2
=
"4#
Example: Uniform on 0, 1, . . . , n
p2(1 p)2
In general:
"n#
pX (k) =
pk (1p)nk ,
k
pX(x )
1/(n+1)
...
k = 0, 1, . . . , n
0
E[X] = 0
Easy: E[Y ] =
y
$
x
Recall:
E[g(X)] =
$
x
ypY (y)
g(x)pX (x)
Second moment: E[X 2] =
g(x)pX (x)
Variance
! 2
x x pX (x)
var(X) = E (X E[X])2
Caution: In general, E[g(X)] %= g(E[X])
$
x
Prop erties:
Variance
Let X be a r.v. and let Y = g(X)

Hard: E[Y ] =
n- 1
1
1
1
+1
+ +n
=
n+1
n+1
n+1
Properties of expectations
xpX (x)
&
(x E[X ])2pX (x)
= E[X 2] (E[X])2
If , are constants, then:
E[] =
Properties:
E[X] =
var(X) 0
var(X + ) = 2var(X)
E[X + ] =
11
LECTURE 6
Review
Random variable X: function from
sample space to the real numbers
PMF (for discrete random variables):

pX (x) = P(X = x)
Lecture outline
Review: PMF, expectation, variance
Expectation:
Conditional PMF
E[X] =
Geometric PMF
E[g(X)] =
Total expectation theorem
!
x
!
x
Joint PMF of two random variables
xpX (x)
g(x)pX (x)
E[X + ] = E[X] +
"
E X E[X] =
var(X) = E (X E[X])2
=
!
x
(x E[X])2pX (x)
= E[X 2] (E[X])2
Standard deviation:
X =
&
var(X)
Random speed
Average speed vs. average time
Traverse a 200 mile distance at constant

but random speed V
Traverse a 200 mile distance at constant

but random speed V
1/2
pV (v )
200
1/2
pV (v )
1/2
d = 200, T = t(V ) = 200/V
1/2
200
time in hours = T = t(V ) =

E[T ] = E[t(V )] =
E[V ] =
'
v t(v)pV (v) =
E[T V ] = 200 =
" E[T ] E[V ]
var(V ) =
E[200/V ] = E[T ] "= 200/E[V ].
V =
12
Conditional PMF and expectation
Geometric PMF
X: number of independent coin tosses
until first head
pX|A(x) = P(X = x | A)
E[X | A] =
!
x
xpX |A(x)
pX (k) = (1 p)k1p,
pX (x )
E[X] =
k = 1, 2, . . .
kpX (k) =
k=1
k=1
k(1 p)k1p
Memoryless property: Given that X > 2,

the r.v. X 2 has same geometric PMF
1/4
pX (k)
pX |X>2(k)
p(1-p)2
x
...
Let A = {X 2}
...
pX- 2|X>2(k)
pX|A(x) =
E[X | A] =
...
k
Total Expectation theorem
Joint PMFs
Partition of sample space

into disjoint events A1, A2, . . . , An
pX,Y (x, y) = P(X = x and Y = y)

y
A1
1/20 2/20 2/20
2/20 4/20 1/20 2/20
1/20 3/20 1/20
1/20
A2
A3
P(B) = P(A1)P(B | A1)+ +P(An)P(B | An)
pX (x) = P(A1)pX |A1 (x)+ +P(An)pX |An (x)
!!
x
pX,Y (x, y) =
!
E[X] = P(A1)E[X | A1]+ +P(An)E[X | An]
pX (x) =
Geometric example:
A1 : {X = 1}, A2 : {X > 1}
pX |Y (x | y) = P(X = x | Y = y) =
E[X] =
P(X = 1)E[X | X = 1]
+P(X > 1)E[X | X > 1]
Solve to get E[X] = 1/p
13
!
x
pX,Y (x, y)
pX |Y (x | y) =
pX,Y (x, y)
pY (y)
LECTURE 7
Review
Readings: Finish Chapter 2
pX (x) = P(X = x)
Lecture outline
pX,Y (x, y) = P(X = x, Y = y)
Multiple random variables
pX |Y (x | y) = P(X = x | Y = y )
Joint PMF
Conditioning
Independence
pX (x) =
More on expectations
!
y
pX,Y (x, y)
pX,Y (x, y) = pX (x)pY |X (y | x)
Binomial distribution revisited

A hat problem
Independent random variables
Expectations
pX,Y,Z (x, y, z) = pX (x)pY |X (y | x)pZ |X,Y (z | x, y)
E[X] =
!
x
E[g(X, Y )] =
Random variables X, Y , Z are

independent if:
!!
x
xpX (x)
g(x, y)pX,Y (x, y)
"
pX,Y,Z (x, y, z) = pX (x) pY (y) pZ (z)
In general: E[g(X, Y )] #= g E[X], E[Y ]
for all x, y, z
E[X + ] = E[X] +
y
4
1/20 2/20 2/20
2/20 4/20 1/20 2/20
1/20 3/20 1/20
1/20
1
E[X + Y + Z] = E[X] + E[Y ] + E[Z]

If X, Y are independent:
E[XY ] = E[X]E[Y ]
E[g(X)h(Y )] = E[g(X)] E[h(Y )]
Independent?
What if we condition on X 2
and Y 3?
14
Variances
Binomial mean and variance
Var(aX) = a2Var(X)
X = # of successes in n independent
trials
Var(X + a) = Var(X)
probability of success p
Let Z = X + Y .
If X, Y are independent:
E[X] =
n
!
k=0
Var(X + Y ) = Var(X) + Var(Y )

Xi =
Examples:
1,
0,
If X = Y , Var(X + Y ) =
E[Xi] =
If X = Y , Var(X + Y ) =
E[X] =
If X, Y indep., and Z = X 3Y ,
Var(Z) =
Var(Xi) =
"n#
pk (1 p)nk
if success in trial i,
otherwise
Var(X) =
The hat problem
Variance in the hat problem

Var(X ) = E[X 2] (E[X])2 = E[X 2] 1
n people throw their hats in a box and

then pick one at random.
X: number of people who get their own
hat
X2 =
Find E[X]
Xi =
1,
0,
Xi2 +
XiXj
i,j :i=j
#
E[Xi2] =
if i selects own hat
otherwise.
P(X1X2 = 1) = P(X1 = 1)P(X2 = 1 | X1 = 1)
X = X1 + X2 + + Xn
P(Xi = 1) =
E[Xi] =
Are the Xi independent?
E[X 2 ] =
E[X] =
Var(X) =
15
LECTURE 8
Continuous r.v.s and pdfs

A continuous r.v. is described by a
probability density function fX

Lecture outline
fX(x)
Sample Space
Probability density functions

Cumulative distribution functions
Event {a < X < b }
Normal random variables
P(a X b) =
!
P(X B) =
E[X] =
fX (x) dx
fX (x) dx = 1
P(x X x + ) =
Means and variances
! b
! x+
x
fX (x) dx,
fX (s) ds fX (x)
for nice sets B
Cumulative distribution function

(CDF)
xfX (x) dx
!
E[g(X)] =
g(x)fX (x) dx
!
2 =
(x E[X ])2fX (x) dx
var(X ) = X
FX (x) = P(X x) =
! x
fX (t) dt
CDF
fX(x )
Continuous Uniform r.v.

fX (x )
a
Also for discrete r.v.s:

a
fX (x) =
FX (x) = P(X x) =
axb
pX (k)
kx
3/6
2/6
E[X] =
2 =
X
1/6
! b"
a
#
a+b 2 1
(b a)2
dx =
2
ba
12
16
Mixed distributions
Gaussian (normal) PDF

2
1
Standard normal N (0, 1): fX (x) = ex /2
2
Schematic drawing of a combination of

a PDF and a PMF
1/2
Normal CDF FX(x)
Normal PDF fx (x)
1
0.5
-1
1/2
E[X] =
x0
-1
var(X) = 1
General normal N (, 2):
2
2
1
fX (x) = e(x) /2
2
The corresponding CDF:

FX (x) = P(X x)
CDF
It turns out that:

E[X] = and Var(X) = 2.
1
3/4
Let Y = aX + b
Then: E[Y ] =
1/4
1/2
Var(Y ) =
Fact: Y N (a + b, a2 2)
Calculating normal probabilities
The constellation of concepts
No closed form available for CDF

but there are tables
(for standard normal)
If X N (, 2), then
pX (x)
X
N(
If X N (2, 16):
"
FX (x)
E[X], var(X)
X 2
32
3) =Random
PSec.
(X
P Variables
= CDF(0.25)
3.3 Normal
155
4
4
.00
.01
.02
.03
.04
.05
.06
.07
.08
.09
0.0
0.1
0.2
0.3
0.4
.5000
.5398
.5793
.6179
.6554
.5040
.5438
.5832
.6217
.6591
.5080
.5478
.5871
.6255
.6628
.5120
.5517
.5910
.6293
.6664
.5160
.5557
.5948
.6331
.6700
.5199
.5596
.5987
.6368
.6736
.5239
.5636
.6026
.6406
.6772
.5279
.5675
.6064
.6443
.6808
.5319
.5714
.6103
.6480
.6844
.5359
.5753
.6141
.6517
.6879
0.5
0.6
0.7
0.8
0.9
.6915
.7257
.7580
.7881
.8159
.6950
.7291
.7611
.7910
.8186
.6985
.7324
.7642
.7939
.8212
.7019
.7357
.7673
.7967
.8238
.7054
.7389
.7704
.7995
.8264
.7088
.7422
.7734
.8023
.8289
.7123
.7454
.7764
.8051
.8315
.7157
.7486
.7794
.8078
.8340
.7190
.7517
.7823
.8106
.8365
.7224
.7549
.7852
.8133
.8389
1.0
1.1
1.2
1.3
1.4
.8413
.8643
.8849
.9032
.9192
.8438
.8665
.8869
.9049
.9207
.8461
.8686
.8888
.9066
.9222
.8485
.8708
.8907
.9082
.9236
.8508
.8729
.8925
.9099
.9251
.8531
.8749
.8944
.9115
.9265
.8554
.8770
.8962
.9131
.9279
.8577
.8790
.8980
.9147
.9292
.8599
.8810
.8997
.9162
.9306
.8621
.8830
.9015
.9177
.9319
1.5
1.6
1.7
1.8
1.9
.9332
.9452
.9554
.9641
.9713
.9345
.9463
.9564
.9649
.9719
.9357
.9474
.9573
.9656
.9726
.9370
.9484
.9582
.9664
.9732
.9382
.9495
.9591
.9671
.9738
.9394
.9505
.9599
.9678
.9744
.9406
.9515
.9608
.9686
.9750
.9418
.9525
.9616
.9693
.9756
.9429
.9535
.9625
.9699
.9761
.9441
.9545
.9633
.9706
.9767
2.0
2.1
2.2
2.3
2.4
.9772
.9821
.9861
.9893
.9918
.9778
.9826
.9864
.9896
.9920
.9783
.9830
.9868
.9898
.9922
.9788
.9834
.9871
.9901
.9925
.9793
.9838
.9875
.9904
.9927
.9798
.9842
.9878
.9906
.9929
.9803
.9846
.9881
.9909
.9931
.9808
.9850
.9884
.9911
.9932
.9812
.9854
.9887
.9913
.9934
.9817
.9857
.9890
.9916
.9936
2.5
2.6
2.7
2.8
.9938
.9953
.9965
.9974
.9940
.9955
.9966
.9975
.9941
.9956
.9967
.9976
.9943
.9957
.9968
.9977
.9945
.9959
.9969
.9977
.9946
.9960
.9970
.9978
.9948
.9961
.9971
.9979
.9949
.9962
.9972
.9979
.9951
.9963
.9973
.9980
.9952
.9964
.9974
.9981
fX (x)
17
pX,Y (x, y)
fX,Y (x, y)
pX |Y (x | y)
fX |Y (x | y)
LECTURE 9
Continuous r.v.s and pdfs
fX(x)
Sample Space
Outline
PDF review
Event {a < X < b }
Multiple random variables

conditioning
independence
P(a X b) =
Examples
E[g(X)] =
fX (x)
FX (x)
xpX (x)
E[X]
var(X)
pX,Y (x, y)
fX (x) dx
P(x X x + ) fX (x)
Summary of concepts
pX (x)
g(x)fX (x) dx
xfX (x) dx
fX,Y (x, y)
pX|A(x)
fX |A(x)
pX |Y (x | y )
fX |Y (x | y)
Joint PDF fX,Y (x, y)
P((X, Y ) S) =
Buons needle
Parallel lines at distance d
Needle of length (assume < d)
Find P(needle intersects one of the lines)
fX,Y (x, y) dx dy
q
x
l
Interpretation:
P(x X x+, y Y y + ) fX,Y (x, y ) 2
X [0, d/2]: distance of needle midpoint

to nearest line
Model: X, uniform, independent
Expectations:
E[g (X, Y )] =
fX,(x, ) =
g (x, y)fX,Y (x, y) dx dy
Intersect if X
From the joint to the marginal:

fX (x) P(x X x + ) =
P X
X and Y are called independent if

fX,Y (x, y) = fX (x)fY (y),
0 x d/2, 0 /2
for all x, y
18
sin
2
sin
2
4 /2 (/2) sin
dx d
d 0
0
4 /2
2
sin d =
2
d
d 0
x 2 sin
fX (x)f() dx d
Conditioning
Recall
P(x X x + ) fX (x)
Joint, Marginal and Conditional Densities
By analogy, would like:
P(x X x + | Y y) fX |Y (x | y)
This leads us to the definition:
fX |Y (x | y) =
fX,Y (x, y)
Area of slice = Height of marginal

density at x
if fY (y) > 0
fY (y)
For given y, conditional PDF is a

(normalized) section of the joint PDF
Renormalizing slices for

fixed x gives conditional
densities for Y given X = x
Slice through
density surface
for fixed x
If independent, fX,Y = fX fY , we obtain
Image by MIT OpenCourseWare, adapted from

Probability, by J. Pittman, 1999.
fX |Y (x|y) = fX (x)
Stick-breaking example
Break a stick of length twice:
break at X: uniform in [0, 1];
break again at Y , uniform in [0, X]
fX,Y (x, y) =
1
,
x
0yx
y
f Y |X (y | x)
f X(x)
fX,Y (x, y) = fX (x)fY |X (y | x) =
on the set:
fY (y) =
=
=
L
E[Y ] =
E[Y | X = x] =
yfY |X (y | X = x) dy =
19
fX,Y (x, y) dx

1
y x
dx
log ,
yfY (y) dy =
0y

1
y log dy =
y
4
0
LECTURE 10
The Bayes variations
Continuous Bayes rule;

Derived distributions
pX|Y (x | y) =
pX,Y (x, y)
pY (y)
pY (y) =
Readings:
Section 3.6; start Section 4.1
pX (x)pY |X (y | x)
pY (y)
pX (x)pY |X (y | x)
Example:
Review
pX (x)
pX |Y (x | y) =
pX (x) =
pX,Y (x, y)
pX,Y (x, y)
pY (y)
pX,Y (x, y)
X = 1, 0: airplane present/not present
Y = 1, 0: something did/did not register

on radar
fX (x)
fX,Y (x, y)
fX|Y (x | y) =
fX (x) =
fX,Y (x, y)
Continuous counterpart
fY (y)
fX,Y (x, y) dy
fX|Y (x | y) =
fX,Y (x, y)
fY (y) =
FX (x) = P(X x)
Discrete X, Continuous Y
fY (y ) =
fY (y)
y
f X ,Y(y,x)=1
1
Continuous X, Discrete Y
fX (x)fY |X (y | x) dx
It is a PMF or PDF of a function of one

or more random variables with known
probability law. E.g.:
pX (x)fY |X (y | x)
Example:
X: a discrete signal; prior pX (x)
Y : noisy version of X
fY |X (y | x): continuous noise model
pY (y) =
fY (y)
What is a derived distribution
pX (x)fY |X (y | x)
fX |Y (x | y) =
fX (x)fY |X (y | x)
Example: X: some signal; prior fX (x)

Y : noisy version of X
fY |X (y | x): model of the noise
E[X], var(X)
pX |Y (x | y) =
fY (y)
Obtaining the PDF for
fX (x)pY |X (y | x)
g(X, Y ) = Y /X
pY (y)
involves deriving a distribution.

Note: g(X, Y ) is a random variable
fX (x)pY |X (y | x) dx
Example:
X: a continuous signal; prior fX (x)
(e.g., intensity of light beam);
Y : discrete r.v. aected by X
(e.g., photon count)
pY |X (y | x): model of the discrete r.v.
When not to find them

Dont need PDF for g(X, Y ) if only want
to compute expected value:
E[g(X, Y )] =
20
g(x, y)fX,Y (x, y) dx dy
How to find them
The continuous case
Discrete case
Two-step procedure:
Obtain probability mass for each

possible value of Y = g(X)
Get CDF of Y : FY (y) = P(Y y)

Dierentiate to get
pY (y) = P(g(X) = y)
pX (x)
=
fY (y) =
x: g(x)=y
dFY
(y)
dy
y
g(x)
.
.
.
.
.
.
.
Example
.
.
.
.
.
.
.
X: uniform on [0,2]
Find PDF of Y = X 3
Solution:
FY (y) = P(Y y) = P(X 3 y)
1
= P(X y 1/3) = y 1/3
2
fY (y) =
Example
dFY
1
(y) =
dy
6y 2/3
The pdf of Y=aX+b
Joan is driving from Boston to New York.

Her speed is uniformly distributed between 30 and 60 mph. What is the distribution of the duration of the trip?
Y = 2X + 5:
fX
faX
faX+b
200
Let T (V ) =
.
V
Find fT (t)
-2
-1
f v(v0 )
1/30
fY (y) =
30
60
v0
1
yb
fX
a
|a|
Use this to check that if X is normal,

then Y = aX + b is also normal.
21
LECTURE 11
A general formula
Let Y = g(X)
g strictly monotonic.
Derived distributions; convolution;

covariance and correlation
Readings:
Finish Section 4.1;
Section 4.2
slope
dg
(x)
dx
g(x)
[y, y+?]
Example
y
f X ,Y(y,x)=1
[x, x+d]
Event x X x + is the same as

g(x) Y g(x + )
or (approximately)
g(x) Y g(x) + |(dg/dx)(x)|
Hence,
Find the PDF of Z = g(X, Y ) = Y /X

FZ (z ) =
z1
FZ (z) =
z1
where y = g(x)
The distribution of X + Y
W = X + Y ; X, Y independent
dx
!
!
(x)!!
The continuous case
W = X + Y ; X, Y independent
!
! dg
fX (x) = fY (y) !!
y
(0,3)
(1,2)
(2,1)
(3,0)
pW (w) = P(X + Y = w)
=
=
"
x
"
x
x
x +y=w
P(X = x)P(Y = w x)
fW |X (w | x) = fY (w x)
pX (x)pY (w x)
fW,X (w, x) = fX (x)fW |X (w | x)
Mechanics:
= fX (x)fY (w x)
Put the pmfs on top of each other

Flip the pmf of Y
fW (w) =
Shift the flipped pmf by w

(to the right if w > 0)
Cross-multiply and add
22
fX (x)fY (w x) dx
Two independent normal r.v.s
The sum of independent normal r.v.s

X N (0, x2), Y N (0, y2),
independent
X N (x, x2), Y N (y , y2),

independent
fX,Y (x, y) = fX (x)fY (y)
(x x)2 (y y )2
1
=
exp
2xy
2x2
2y2
Let W = X + Y
fW (w) =
=
PDF is constant on the ellipse where
Correlation coefficient
&
cov(X, Y ) = E (X E[X]) (Y E[Y ])
'
Dimensionless version of covariance:

= E
.
.
. . .... . .. .
. ... ...... ... .. . . .
. .. ... . .... ... . .. .. . ... . .
.
. . . .. .
. . .. .. . . ... .. . .
. . .
.
i=1
Xi =
n
"
i=1
var(Xi) + 2
"
1 1
y
|| = 1 (X E[X]) = c(Y E[Y ])

(linearly related)
Independent = 0
(converse is not true)
cov(X, Y ) = E[XY ] E[X]E[Y ]

var
(X E[X]) (Y E[Y ])
X
Y
cov(X, Y )
=
X Y
Zero-mean case: cov(X, Y ) = E [XY ]
same argument for nonzero mean case
Covariance
n
"
2
2
2
2
1
ex /2x e(wx) /2y dx
2xy
mean=0, variance=x2 + y2
Ellipse is a circle when x = y
fX (x)fY (w x) dx
Conclusion: W is normal
is constant
.
.
. . .... . .. .
. . ... ...... ... .. . . .
. .. ... . ... ... . .. .. . ... . .
. . . .. .. .
y
. . . .. . . ... .. . .
.
.
. .
(algebra) = cew
(x x)2
(y y ) 2
+
2x2
2y2
cov(Xi, Xj )
&
(i,j ):i=j
independent cov(X, Y ) = 0
(converse is not true)
23
LECTURE 12
Conditional expectations
Given the value y of a r.v. Y :
Readings: Section 4.3;

parts of Section 4.5
(mean and variance only; no transforms)
E[X | Y = y] =
!
x
xpX|Y (x | y)
(integral in continuous case)

Stick example: stick of length !
break at uniformly chosen point Y
break again at uniformly chosen point X
Lecture outline
Conditional expectation
Law of iterated expectations
E[X | Y = y] =
Law of total variance

Sum of a random number
of independent r.v.s
E[X | Y ] =
Y
2
y
(number)
2
(r.v.)
mean, variance
Law of iterated expectations:
E[E[X | Y ]] =
!
y
E[X | Y = y]pY (y)= E[X]
In stick example:
E[X] = E[E[X | Y ]] = E[Y /2] = !/4
var(X | Y ) and its expectation

"
var(X | Y = y ) = E (X E[X | Y = y])2 | Y = y

var(X | Y ): a r.v.
with value var(X | Y = y) when Y = y
Section means and variances

#
Two sections:
y = 1 (10 students); y = 2 (20 students)
y=1:
Law of total variance:

var(X) = E[var(X | Y )] + var(E[X | Y ])
10
1 !
xi = 90
10 i=1
E[X] =
Proof:
E[X | Y ] =
(b) var(X | Y ) = E[X 2 | Y ] (E[X | Y ])2
E[E[X | Y ]] =
(c) E[var(X | Y )] = E[X 2] E[ (E[X | Y ])2 ]
30
1 !
xi = 60
20 i=11
30
1 !
90 10 + 60 20
xi =
= 70
30
30 i=1
E[X | Y = 1] = 90,
(a) Recall: var(X) = E[X 2] (E[X])2
y=2:
E[X | Y = 2] = 60
90,
60,
w.p. 1/3
w.p. 2/3
1 90 + 2 60 = 70 =
3
3
E[X]
1
2
(90 70)2 + (60 70)2
3
3
600
=
= 200
3
(d) var(E[X | Y ]) = E[ (E[X | Y ])2 ](E[X])2
var(E[X | Y ]) =
Sum of right-hand sides of (c), (d):

E[X 2] (E[X ])2 = var(X)
24
Section means and variances (ctd.)

10
1 !
(xi90)2 = 10
10 i=1
Example
30
1 !
(xi60)2 = 20
20 i=11

f X(x)
2/3
var(X | Y = 1) = 10
var(X | Y ) =
E[var(X | Y )] =
var(X | Y = 2) = 20
10,
20,
1/3
Y=1
w.p. 1/3
w.p. 2/3
E[X | Y = 1] =
1 10 + 2 20 = 50
3
3
3
var(X | Y = 1) =
50
=
+ 200
3
= (average variability within sections)
Y=2
E[X | Y = 2] =
var(X | Y = 2) =
E[X] =
var(E[X | Y ]) =
+ (variability between sections)
Sum of a random number of

independent r.v.s
Variance of sum of a random number

of independent r.v.s
N : number of stores visited

(N is a nonnegative integer r.v.)
var(Y ) = E[var(Y | N )] + var(E[Y | N ])
Xi: money spent in store i
E[Y | N ] = N E[X]
var(E[Y | N ]) = (E[X])2 var(N )
Xi assumed i.i.d.
var(Y | N = n) = n var(X)
var(Y | N ) = N var(X)
E[var(Y | N )] = E[N ] var(X)
independent of N
Let Y = X1 + + XN
E[Y | N = n] =
=
=
=
E[X1 + X2 + + Xn | N = n]
E[X1 + X2 + + Xn]
E[X1] + E[X2] + + E[Xn]
n E[X]
var(Y ) = E[var(Y | N )] + var(E[Y | N ])
= E[N ] var(X ) + (E[X])2 var(N )
E[Y | N ] = N E[X]
E[Y ] = E[E[Y | N ]]
= E[N E[X]]
= E[N ] E[X]
25
LECTURE 13
The Bernoulli process

A sequence of independent
Bernoulli trials
The Bernoulli process

At each trial, i:
P(success) = P(Xi = 1) = p
Lecture outline
P(failure) = P(Xi = 0) = 1 p
Definition of Bernoulli process
Examples:
Random processes
Sequence of lottery wins/losses
Basic properties of Bernoulli process
Sequence of ups and downs of the Dow

Jones
Distribution of interarrival times
Arrivals (each second) to a bank
The time of the kth success
Arrivals (at each time slot) to server
Merging and splitting
Random processes
Number of successes S in n time slots
First view:
sequence of random variables X1, X2, . . .
P(S = k) =
E[Xt] =
E[S] =
Var(Xt) =
Var(S) =
Second view:
what is the right sample space?
P(Xt = 1 for all t) =
Random processes we will study:
Bernoulli process
(memoryless, discrete time)
Poisson process
(memoryless, continuous time)
Markov chains
(with memory/dependence across time)
26
Interarrival times
Time of the kth arrival
T1: number of trials until first success
Given that first arrival was at time t

i.e., T1 = t:
additional time, T2, until next arrival
P(T1 = t) =
has the same (geometric) distribution
Memoryless property
independent of T1
E[T1] =
Var(T1) =
Yk : number of trials to kth success

E[Yk ] =
If you buy a lottery ticket every day, what

is the distribution of the length of the
first string of losing days?
Var(Yk ) =
P(Yk = t) =
Sec. 6.1
The Bernoulli Process
305
Splitting and Merging of Bernoulli Processes

Starting with a Bernoulli process in which there is a probability p of an arrival
at each time, consider splitting it as follows. Whenever there is an arrival, we
choose to either keep it (with probability q), or to discard it (with probability
1q); see Fig. 6.3. Assume that the decisions to keep or discard are independent
for dierent arrivals. If we focus on the process of arrivals that are kept, we see
that it is a Bernoulli process: in each time slot, there is a probability pq of a
kept arrival, independent of what happens in other slots. For the same reason,
the process of discarded arrivals is also a Bernoulli process, with a probability
of a discarded arrival at each time slot equal to p(1 q).
Splitting of a Bernoulli Process
Merging of Indep. Bernoulli Processes

306
(using independent coin flips)
The Bernoulli and Poisson Processes

Bernoulli (p)
time
time
Merged process:
Bernoulli (p + q pq)
q
Original
process
Chap. 6
time
time
Bernoulli (q)
1q
time
time
Figure 6.4: Merging of independent Bernoulli processes.
yields a Bernoulli process

concentrate on the are
special counted
case where n is large
p is small,
so that the mean
(collisions
as but
one
arrival)
Figure 6.3: Splitting of a Bernoulli process.
yields
Bernoulli processes
In a reverse situation, we start with two independent Bernoulli processes
np has a moderate value. A situation of this type arises when one passes from
discrete to continuous time, a theme to be picked up in the next section. For
some examples, think of the number of airplane accidents on any given day:
there is a large number n of trials (airplane flights), but each one has a very
small probability p of being involved in an accident. Or think of counting the
number of typos in a book: there is a large number of words, but a very small
probability of misspelling any single one.
Mathematically, we can address situations of this kind, by letting n grow
while simultaneously decreasing p, in a manner that keeps the product np at a
constant value . In the limit, it turns out that the formula for the binomial PMF
simplifies to the Poisson PMF. A precise statement is provided next, together
with a reminder of some of the properties of the Poisson PMF that were derived
in Chapter 2.
(with parameters p and q, respectively) and merge them into a single process,
as follows. An arrival is recorded in the merged process if and only if there
is an arrival in at least one of the two original processes. This happens with
probability p + q pq [one minus the probability (1 p)(1 q) of no arrival in
either process]. Since dierent time slots in either of the original processes are
independent, dierent slots in the merged process are also independent. Thus,
the merged process is Bernoulli, with success probability p + q pq at each time
step; see Fig. 6.4.
Splitting and merging of Bernoulli (or other) arrival processes arises in
many contexts. For example, a two-machine work center may see a stream of
arriving parts to be processed and split them by sending each part to a randomly
chosen machine. Conversely, a machine may be faced with arrivals of dierent
types that can be merged into a single arrival stream.
The Poisson Approximation to the Binomial
Poisson Approximation to the Binomial
The number of successes in n independent Bernoulli trials is a binomial random

variable with parameters n and p, and its mean is np. In this subsection, we
A Poisson random variable Z with parameter takes nonnegative

integer values and is described by the PMF
pZ (k) = e
k
,
k!
k = 0, 1, 2, . . . .
Its mean and variance are given by

E[Z] = ,
27
var(Z) = .
LECTURE 14
Bernoulli review
Discrete time; success probability p
The Poisson process
Number of arrivals in n time slots:

16
BernoulliLECTURE
review
binomial pmf
Readings: LECTURE
Start Section
166.2.
Discrete time; success

probability
p
The Poisson
process
Interarrival
times: geometric
pmf
Readings: Start Section 5.2.

Review of Bernoulli process
Number
of arrivals in
n time
slots:5.2.
Readings:
Start
Section
Time to k arrivals: Pascal pmf
binomial pmf
N
b
Definition of
Poisson
process
Lecture
outline
Memorylessness
Interarrival
time pmf:
geometric
pmf
Lecture
outline
Distribution
of number
of arrivals
Review of Bernoulli
process
TimetoReview
k arrivals:
Pascal pmf
of Bernoulli
process
Distribution
interarrival
times
Definition ofof
Poisson
process
Memorylessness
Definition of Poisson process
The Poisson process

Lecture outline
Other
properties
of the of
Poisson
Distribution
of number
arrivalsprocess
Distribution of number of arrivals
Other properties of the Poisson process
Other properties of the Poisson process
Definition of the Poisson process

Definition of the Poisson process
t2
t1
0
x x
x x
t3
x x
xx
PMF of NumberPoisson
of Arrivals N
PMF Definition
of Numberofofthe
Arrivals N process
t2
t1
x xk
x x
0 ( ) e
x x
Time
Time
P (k, )homogeneity:
= Prob. of k arrivals in interval
Pof(k,duration
) = Prob.
of k arrivals in interval
of duration
Assumptions:
Numbers of arrivals in disjoint time
Numbers of arrivals in disjoint time inintervals
independent
tervals are
are independent
For VERY
small
:
Small
interval
probabilities:
For VERY small

:
1
if k = 0
P (k, )
=1
1
, ifif kk=
0;
0
if k > 1
P (k, ) ,
if k = 1;
= arrival rate
0,
if k > 1.
P (k, ) =
k!
t3
x x
xx
!
x
k = 0, 1, . . .
x x
Time
E[N ] =
P (k, ) = Prob. of k arrivals in interval
Finely
discretize [0, t]: approximately Bernoulli
2 = of duration
N
Assumptions:
t(esdiscrete
1)
Nte(of
=
approximation): binomial
MN (s)
Numbers of arrivals in disjoint time intervals are independent
Taking 0 (or n ) gives:
For VERY small :
( )k
eaccording
Example: You
to a
P (k, get
) =email
1 , ifk k==0,01, . . .
k!
Poisson process at
a
rate
of
=
0.4
mesP (k, )
if k = 1
sages per hour. You check your

0 email ifevery
k>1
E[Nt] = t,
var(Nt) = t
thirty minutes.
= arrival rate
Prob(no new messages)=
: arrival rate
Prob(one new message)=
28
Exa
Poi
sag
thir
Example
Interarrival Times
Interarrival Times
Yk time of kth arrival
You get email according to a Poisson

process at a rate of = 5 messages per
hour. You check your email every thirty
minutes.
Erlang distribution:
k k1
ey
y
fYk (y) = k k1 y ,
y(k e1)!
fYk (y) =
Prob(no new messages) =
to the
past
(k 1)!
flr (l)
fY (y)
k
Prob(one new message) =
y0
y0
r=1
r=2
r=3
k=1
k=2
k=3
0
Image by MIT OpenCourseWare.

First-order interarrival
times (k = 1):
exponential
Time of first arrival (k = 1):
(y) = ey , f y
0
fY1exponential:
(y) = ey , y 0
Y1
Memoryless
the
Memorylessproperty:
property: The
The time
time to
to the
next
arrival
is
independent
of
the
past
next arrival is independent of the past
Adding
Poisson
Merging
PoissonProcesses
Processes
Bernoulli/Poisson Relation
Bernoulli/Poisson Relation
! ! ! ! ! ! ! !
0
n = t /!
Arrivals
Sum
of of
independent
Poisson
Poissonrandom
randomvariSum
independent
ables
is Poisson
variables
is Poisson
Time
p ="!
np =" t
Merging
of independent
Poisson
processes
Sum
of independent
Poisson
processes
Poisson
is is
Poisson
1):
Erlang distribution:
POISSON
Times of Arrival
Times Arrival
of Arrival
Rate
POISSON
Continuous
Continuous
/unit time
BERNOULLI
Discrete
Poisson
/uni
t time
Interarrival Time Distr.
Exponential
Geometric
Erlang
Pascal
Time to k-th arrival
Poisson
"1
Discrete
p/per
trial
PMF of Rate
# of Arrivals
Arrival
PMF of # of Arrivals
Red bulb flashes

(Poisson)
BERNOULLI
"2
Binomial
p/per trial
Green bulb flashes

(Poisson)
Binomial
Interarrival Time Distr.
Exponential
Geometric
Time to k-th arrival
Erlang
Pascal
All flashes
(Poisson)
What is the probability that the next

What is the probability that the next
arrival comes from the first process?
m vari-
29
Interarrival Times
LECTURE 15
Erlang
distribution:
Defining
characteristics
Review
Poisson process II
Time homogeneity:
k y k1eyP (k, )
f (y) =
y0
Yk
Independence
(k 1)!
Small interval probabilities (small ):
fY (y)
1 , if k = 0,
P (k, ) ,
if k = 1,
k=1
0,
if k > 1.
k=2
Readings: Finish Section 6.2.

Review of Poisson process
Merging and splitting
Examples
N is a Poisson r.v., with parameter :

k=3
( )k e
P (k, ) =
,
k = 0, 1, . . .
y
k!
Random incidence
[N ] = var(N
) =
E
First-order
interarrival
times (k = 1):
exponential
Interarrival times (k = 1): exponential:
y0
fY1 (y) = ey ,
E[T1] = 1/
fT1 (t) = et, t 0,
Memoryless property: The time to the
next arrival is independent of the past
Time Yk to kth arrival: Erlang(k):
fYk (y) =
k y k1ey
,
(k 1)!
y0
Adding Poisson Processes

Sum of independent Poisson random variables is Poisson
Poisson fishing
Merging Poisson Processes (again)
Sum
of independent
Poisson
processes
Merging
of independent
Poisson
processes
is is
Poisson
Poisson
Assume: Poisson, = 0.6/hour.

Fish for two hours.
Red bulb flashes

(Poisson)
if no catch, continue until first catch.
"1
a) P(fish for more than two hours)=
All flashes
(Poisson)
"2
b) P(fish for more than two and less than

five hours)=
Green bulb flashes

(Poisson)
What
next
Whatis isthe
theprobability
probability that
that the
the next
arrival
comes
from
the
first
process?
c) P(catch at least two fish)=
d) E[number of fish]=
e) E[future fishing time | fished for four hours]=

f) E[total fishing time]=
30
Light bulb example
Splitting of
of Poisson
Poisson processes
Splitting
processes
Each light bulb has independent,

exponential() lifetime
Assume
that email
trac through
a server
Each
message
is routed
along the
first
is
a
Poisson
process.
stream with probability p
Destinations of dierent messages are
Install three light bulbs.

Find expected time until last light bulb
dies out.
independent.
Routings of different messages are independent
USA
Email Traffic
leaving MIT
MIT
Server
p!
(1 - p) !
Foreign
Each output stream is Poisson

Each output stream is Poisson.
Renewal processes
Random incidence for Poisson
Random incidence in
processes
Series ofrenewal
successive
arrivals
Poisson process that has been running

forever
i.i.d. interarrival
times
Series
of successive
arrivals
(but not necessarily exponential)
i.i.d. interarrival times
(but not necessarily exponential)
Show up at some random time

(really means arbitrary time)
Example:
interarrival times are equally likely to
Bus
Example:
be
5
10 minutes
Bus or
interarrival
times are equally likely to
be 5 or 10 minutes
Time
If you arrive at a random time:
Chosen
time instant
If you arrive at a random time:
what is the probability that you selected

awhat
is the probability
you selected
5 minute
interarrivalthat
interval?
a 5 minute interarrival interval?
what is the expected time to next ar rival?

what is the expected time
What is the distribution of the length of

the chosen interarrival interval?
to next arrival?
31
LECTURE 16
Checkout counter model

Discrete time n = 0, 1, . . .
Markov Processes I
Customer arrivals: Bernoulli(p)
Readings: Sections 7.17.2
geometric interarrival times

Customer service times: geometric(q)
Lecture outline
Checkout counter example
State Xn:
time n
Markov process definition
number of customers at
n-step transition probabilities

Classification of states
1
Finite state Markov chains
.. .
10
n-step transition probabilities

State occupancy probabilities,
given initial state i:
Xn: state after n transitions

belongs to a finite set, e.g., {1, . . . , m}
rij (n) = P(Xn = j | X0 = i)
X0 is either given or random

Markov property/assumption:
(given current state, the past does not
matter)
Time 0
Time n-1
Time n
pij = P(Xn+1 = j | Xn = i)
p 1j
r ik(n-1)
p kj
...
= P(Xn+1 = j | Xn = i, Xn1, . . . , X0)
...
r i1(n-1)
r im(n-1)
p mj
Model specification:
identify the possible states
Key recursion:
identify the possible transitions
rij (n) =
identify the transition probabilities
m
!
k=1
rik (n 1)pkj
With random initial state:
P(Xn = j) =
m
!
i=1
32
P(X0 = i)rij (n)
Example
0.5
Generic convergence questions:

Does rij (n) converge to something?
0.8
0.5
0.5
n=1
n odd: r2 2(n)=
n=2
0.2
n=0
0.5
2
n = 100
n = 101
n even: r2 2(n)=
Does the limit depend on initial state?
r11(n)
r12(n)
0.4
r21(n)
r22(n)
0.3
r1 1(n)=
r3 1(n)=
r2 1(n)=
Recurrent and transient states

State i is recurrent if:
starting from i,
and from wherever you can go,
there is a way of returning to i
If not recurrent, called transient
4
6
i transient:
P(Xn = i) 0,
i visited finite number of times
Recurrent class:
collection of recurrent states that
communicate with each other
and with no other state
33
0.3
LECTURE 17
Review
Markov Processes II
Discrete state, discrete time, time-homogeneous
Transition probabilities pij

Markov property
Lecture outline
rij (n) = P(Xn = j | X0 = i)
Review
Key recursion:
!
rij (n) =
rik (n 1)pkj
Steady-State behavior
Steady-state convergence theorem

Balance equations
Birth-death processes
Warmup
9
Periodic states
The states in a recurrent class are

periodic if they can be grouped into
d > 1 groups so that all transitions from
one group lead to the next group
P(X1 = 2, X2 = 6, X3 = 7 | X0 = 1) =
P(X4 = 7 | X0 = 2) =
Recurrent and transient states
State i is recurrent if:
starting from i,
and from wherever you can go,
there is a way of returning to i
9
5
4
8
6
3
If not recurrent, called transient

Recurrent class:
collection of recurrent states that
communicate to each other
and to no other state
34
Expected Frequency of a Particular Transition
Steady-State Probabilities
Do the rij (n) converge to some j ?
(independent of the initial state i)
Consider n transitions of a Markov chain with a single class which is aperiodic, starting from a given initial state. Let qjk (n) be the expected number
of such transitions that take the state from j to k. Then, regardless of the
Visit frequency interpretation
initial state, we have
qjk (n)
= j pjk .
lim
n
n!
j =
k pkj
Yes, if:
recurrent states are all in a single class,
and
Given the frequency interpretation of j and k pkj , the balance equation

m
! of being in j: j
(Long run) frequency
j =
single recurrent class is not periodic

Assuming yes, start from key recursion
!
k
rik (n 1)pkj
Frequency
of transitions
j: the
k pexpected
kj
has an intuitive meaning.
It expresses
the factk
that
frequency
of visits to j is equal to the sum of the expected frequencies k pkj of transition
!
that lead to j; see Fig.
7.13.
p
Frequency of transitions into j:
k kj
take the limit as n

j =
j pjj
for all j
k pkj ,
2p2j
. . .
Additional equation:
!
1 p1j
. . .
rij (n) =
k pkj
k=1
j = 1
m pmj
Figure 7.13: Interpretation of the balance equations in terms of frequencies. In

a very large number of transitions, we expect a fraction k pkj that bring the state
from k to j. (This also applies to transitions from j to itself, which occur with
frequency j pjj .) The sum of the expected frequencies of such transitions is the
expected frequency j of being at state j.
Example
0.5
0.8
0.5
Birth-death processes
In fact, some stronger statements are also true, such as the following. Wheneve
1experiment
- p1- q1
1 - q m of the Markov chai
1- p0
we carry out a probabilistic
and generate a trajectory
over an infinite time horizon, the observed long-term frequency with which state j
p
p
visited will be exactly equal0 to j , 1and the observed long-term frequency of transition
3
m
1
0
from j to k will be exactly equal to j p2 jk . Even
though the trajectory
is random, thes
q
q 1 certainty,
q2
m
equalities hold with essential
that is, with probability
1.
.. .
pi
0.2
i+1
ipi = i+1qi+1
q i+1
Special case: pi = p and qi = q for all i

= p/q =load factor
p
i+1 = i = i
q
i = 0 i ,
i = 0, 1, . . . , m
Assume p < q and m

0 = 1
E[Xn] =
35
(in steady-state)
LECTURE 18
Review
Markov Processes III
Assume
a single class of recurrent states,
aperiodic;
aperiodic.
Then,
plus transient
states. Then,
LECTURE 20
Review
Assume a single class of recurrent states,
Markov Processes III
(n)
=j j
lim rrijij(n
)=
lim
n
n
where
not depend
dependon
onthe
theinitial
initial
where
jj does
does not
conditions
conditions:
Lecture outline
j | jX|0X
=0)i)=
=jj
lim P(X = =
lim P(X
n n
n
n
Lecture outline
Review of steady-state behavior
can
m can
11,,......,, m
solution
to
the
solution of the
Review of steady-state behavior
Probability of blocked phone calls
Probability of blocked phone calls
Calculating absorption probabilities
j =
!
k
Calculating
absorption
Calculating
expected
timeprobabilities
to absorption
The phone company problem

The phone company problem
0.8
0.8
Calls originate as a Poisson process,

Calls originate as a Poisson process,
rate
rate
0.5
0.5
k pkj
!j = 1
j j = 1
j
Example
j =
together with
Example
k pkj , ! j = 1, . . . , m,
together with
Calculating expected time to absorption
0.5
0.5
be
be found
found as
as the
theunique
unique
balance
equations
balance equations
Each
duration isis exponentially
exponentially
Each call
call duration
distributed
(parameter
)
distributed (parameter )
0.2
0.2
B
B lines
lines available
available
1 = 2/7, 2 = 5/7
1 = 2/7, 2 = 5/7
Discrete time intervals

Discrete
intervals
of (small)
(small) length
length
of
Assume process starts at state 1.
Assume process starts at state 1.
"#
P(X1 = 1, and X100 = 1)=

P(X1 = 1, and X100 = 1)=
i!1
B-1
i#
(X100
= 1 and X101 = 2)
PP
(X
100 = 1 and X101 = 2)
Balance
i1
i
Balance equations:
equations:
i1==i
i i
i
i = 0 ii
i = 0 i i!
i!
36
B
i
!
B i
0 = 1/ !
i
0 = 1/i=0 i!i
i=0 i!
Calculating absorption probabilities
Expected time to absorption
What is the probability ai that:

process eventually settles in state 4,
given that the initial state is i?
1
1
3
44
0.5
0.4
54
0.5
0.6
0.2
3
0.3
0.8
44
0.5
0.4
0.6
Find expected number of transitions i,

until reaching the absorbing state,
given that the initial state is i?
0.2
2
1
0.8
For i = 4,
For i = 5,
ai =
ai =
ai =
i = 0 for i =
pij aj ,
For all other i: i = 1 +
for all other i
unique solution
unique solution
Mean first passage and recurrence

times
Chain with one recurrent class;
fix s recurrent
Mean first passage time from i to s:
ti = E[min{n 0 such that Xn = s} | X0 = i]
t1, t2, . . . , tm are the unique solution to
ti = 1 +
pij tj ,
for all i =
% s
Mean recurrence time of s:

ts = E[min{n 1 such that Xn = s} | X0 = s]
ts = 1 +
!
j
ts = 0,
0.2
2
"
j psj tj
37
pij j
LECTURE 19
Limit theorems I
Chebyshevs inequality
Random variable X
(with finite mean and variance 2)
Readings: Sections 5.1-5.3;

start Section 5.4
2 =
X1, . . . , Xn i.i.d.
X1 + + Xn
n
What happens as n ?
Mn =
(x )2fX (x) dx
! c
(x )2fX (x) dx +
!
c
(x )2fX (x) dx
c2 P(|X | c)
Why bother?
2
c2
P(|X | c)
A tool: Chebyshevs inequality

Convergence in probability
P(|X | k )
Convergence of Mn
(weak law of large numbers)
Deterministic limits
1
k2
Convergence in probability
Sequence an
Number a
Sequence of random variables Yn

converges in probability to a number a:
(almost all) of the PMF/PDF of Yn ,
eventually gets concentrated
(arbitrarily) close to a
an converges to a
lim a = a
n n
an eventually gets and stays
(arbitrarily) close to a
For every " > 0,

lim P(|Yn a| ") = 0
For every " > 0,

there exists n0,
such that for every n n0,
we have |an a| ".
1 - 1 /n
pmf of Yn
1 /n
0
Does Yn converge?
38
Convergence of the sample mean

(Weak law of large numbers)
The pollsters problem

f : fraction of population that . . .
X1, X2, . . . i.i.d.

finite mean and variance 2
ith (randomly selected) person polled:
X + + Xn
Mn = 1
n
Xi =
1,
0,
if yes,
if no.
Mn = (X1 + + Xn)/n
fraction of yes in our sample
E[Mn] =
Goal: 95% confidence of 1% error
Var(Mn) =
P(|Mn f | .01) .05
P(|Mn | ")
Use Chebyshevs inequality:
Var(Mn)
=
"2
n"2
P(|Mn f | .01)
2
M
n
(0.01)2
x2
1
=
4n(0.01)2
n(0.01)2
Mn converges in probability to
If n = 50, 000,
then P(|Mn f | .01) .05
(conservative)
Different scalings of Mn
The central limit theorem
X1, . . . , Xn i.i.d.
finite variance 2
Standardized Sn = X1 + + Xn:
Zn =
Look at three variants of their sum:

Sn = X1 + + Xn
zero mean
variance n 2
unit variance
Sn
variance 2/n
n
converges in probability to E[X] (WLLN)
Let Z be a standard normal r.v.

(zero mean, unit variance)
Mn =
Sn

n
Sn nE[X]
Sn E[Sn]
=
Sn
n
Theorem: For every c:
constant variance 2
P(Zn c) P(Z c)
Asymptotic shape?
P(Z c) is the standard normal CDF,

(c), available from the normal tables
39
LECTURE 20
THE CENTRAL LIMIT THEOREM
Usefulness
universal; only means, variances matter
accurate computational shortcut
justification of normal models
X1, . . . , Xn i.i.d., finite variance 2

Standardized Sn = X1 + + Xn:
What exactly does it say?
Sn E[Sn]
Sn nE[X]
Zn =
=
n
Sn
E[Zn] = 0,
CDF of Zn converges to normal CDF
not a statement about convergence of

PDFs or PMFs
var(Zn) = 1
Let Z be a standard normal r.v.

(zero mean, unit variance)
Normal approximation
Treat Zn as if normal
Theorem: For every c:
also treat Sn as if normal
P(Zn c) P(Z c)
P(Z c) is the standard normal CDF,
(c), available from the normal tables
Can we use it when n is moderate?

Yes, but no nice theorems to this eect
Symmetry helps a lot
0.1
0.14
n =2
n =4
0.12
0.08
0.1
0.06
0.08
0.06
The pollsters problem using the CLT
0.04
0.04
f : fraction of population that . . .
0.02
0.02
0
10
15
20
10
15
20
25
30
35
ith (randomly selected) person polled:
0.035
0.25
n =32
0.03
0.2
Xi =
0.025
0.15
0.02
0.015
0.1
0.01
0.005
0
100
0.12
120
140
160
180
P(|Mn f | .01) .05

n = 16
0.07
n=8
Event of interest: |Mn f | .01
0.06
0.08
0.05
0.06
X1 + + Xn nf
.01
0.04
0.03
0.04
0.02
0.02
0.01
0
10
15
20
25
30
35
40
10
20
30
40
50
60
X + + X nf
n
1
70
0.06
n = 32
0.9
0.05
0.8
0.7
0.04
0.6
0.5
0.03
0.4
0.2
0.01
0.1
0
0
30
40
50
60
70
80
90
.01 n
P(|Mn f | .01) P(|Z| .01 n/)
P(|Z| .02 n)
0.02
0.3
if yes,
if no.
Suppose we want:
200
0.08
0.1
0,
Mn = (X1 + + Xn)/n
0.05
1,
100
40
Apply to binomial
The 1/2 correction for binomial

approximation
Fix p, where 0 < p < 1

Xi: Bernoulli(p)
P(Sn 21) = P(Sn < 22),

because Sn is integer
Sn = X1 + + Xn: Binomial(n, p)
Compromise: consider P(Sn 21.5)
mean np, variance np(1 p)

CDF of
Sn np
np(1 p)
standard normal
Example
18 19 20 21 22
n = 36, p = 0.5; find P(Sn 21)
Exact answer:

21
36 1 36
k=0
= 0.8785
De MoivreLaplace CLT (for binomial)
Poisson vs. normal approximations of

the binomial
When the 1/2 correction is used, CLT

can also approximate the binomial p.m.f.
(not just the binomial CDF)
Poisson arrivals during unit interval equals:

sum of n (independent) Poisson arrivals
during n intervals of length 1/n
P(Sn = 19) = P(18.5 Sn 19.5)

18.5 Sn 19.5
Let n , apply CLT (??)
18.5 18
Sn 18
19.5 18
3
3
3
Poisson=normal (????)
Binomial(n, p)
0.17 Zn 0.5
p fixed, n : normal
P(Sn = 19) P(0.17 Z 0.5)
np fixed, n , p 0: Poisson
= P(Z 0.5) P(Z 0.17)
p = 1/100, n = 100: Poisson
= 0.6915 0.5675
p = 1/10, n = 500: normal
= 0.124
Exact answer:
36 1 36
19
= 0.1251
41
p()
p()
p()
p()
p()

pX|(x | )
Types of
Inference models/approaches
p()p()
Sample Applications
N
(x | )
pX|
Model
building
versus
inferring
unkno
wn
(x
|
)
p
p()
X|
N
p()
p()
X
Polling
variables.
E.g., assume X = aS + W
N
N
(x | )
p
X
ModelNbuilding:
X X|
Design of experiments/sampling
N
pX|(x | ) methodologies
N
It is the mark of truly educated people
| )(x | know
pX|(x
)
p
signal
S,
observe
X,
infer
a
X|
Lancet study on Iraq death toll
to be deeply moved bystatistics.
Xpresence of noise:
(x
| the
)
pX|
Estimation
in
X
(x
|
)
p
Estimator
X|
pX (x;X)
pX|(x | )
(Oscar Wilde)
X
Medical/pharmaceutical
trials
a, observe X,
pknow
estimate S.
p()
Estimator
()
Estimator
X
X
p()
Hypothesis testing: unknown
takes one of
Data
mining X
Reality
Model
N N possible values;
Estimator
p()
few
at small
(e.g., customer arrivals)
Poisson)
{0, 1}
X =+W
W aim
fW (w)
(e.g.,Netflix
competition
Estimator

N
probability
of
incorrect
decision
Estimator
Estimator
(x(x
| )
pX|
p()
N
| )
pX|
Finance
Estimator
Data
at a small
error
{0,
1} Estimator
X = aim
+W
W fW (w) Estimation:
| )
pX|(xestimation
LECTURE 21
Estimator
interpretation of experiments
Design &
polling,
trials.
W .f.W (w)
Matrix Completion
pmedical/pharmaceutical
Y |X(y | x)
Netflix
competition
Finance
Y |X (
! Partially observed matrix:pgoal
predict the
to| )
X
unobserved entries
2
5
?
2
3
1X
pX (x)
1/6
Estimator
X
10
Estimator
pX|
pX
(x; (x
) | )
pX|(x | )
p()
Estimator
N
pX|(x | )
X
p()
Estimator
X
X : unknown parameter
a r.v.) N
()
p(not

pX|(x | )
Estimator
E.g., =
mass
of
electron
pX|(x | )
p()
p
()
Bayesian:
Use priors &XBayes rule
Estimator
pX|(x | )
X
Estimator
N N
measurement
3 ?
pY (y ; )
5
1?
4
1
5 5 4
2 ?5 ? 4
3 3 1 5 2 1
pY (y ; )
3
1
2 3
N
4
51
3
3
3?
5
2 ? 1 p1Y (y ; )
pX (x)
5
2 ? X4 4
Y
1 3 1 5 4 5pY |X (y | x)
1 2
4
5?
sensors
f()
pY |X (y | x)
objects X
N
1 pY 4|X (y |5x)
W fW
1}
{0, 1}X =
X+
=W
+W
fW (w) {0,
W (w)
N
p() XX
p()
Classical statistics: X
{0, 1}
X
=
pX|(x |)
+ W
N
N
Estimator
Graph of S&P 500 index removed
due to copyright restrictions.

pY |X ( | )
Estimator
X {0, 1}
W fW (w)
Signal processing
W fW (w)
f()
Tracking,
detection, speaker identification,. . .
Y =X +W
pY (y ; )
pp()
()
{0,
1}| )
p(x
|(x
)
p
X|X|
1/6 X X4
NN
10
X = Estimator
+W
Estimator
Estimator
(x(x
| )
pX|
| )
pX|
Estimator
Estimator
XX
Estimator
Estimator
Bayesian inference: Use Bayes rule
Estimation with discrete data
Hypothesis testing
discrete data
p|X ( | x) =
f|X ( | x) =
p() pX |(x | )
pX (x)
pX (x) =
continuous data
p|X ( | x) =
p() fX |(x | )
f() pX |(x | )
pX (x)
f()pX |(x | ) d
Example:
fX (x)
Coin with unknown parameter

Observe X heads in n tosses
Estimation; continuous data

f|X ( | x) =
What is the Bayesian approach?
f() fX |(x | )
Want to find f|X ( | x)
fX (x)
Assume a prior on (e.g., uniform)
Zt = 0 + t1 + t22
Xt = Zt + Wt,
t = 1, 2, . . . , n
Bayes rule gives:

f0,1,2|X1,...,Xn (0, 1, 2 | x1, . . . , xn)
1
42
)
pX|(x | W
(w)
pX (x) =
f ( ) pf W
X | (x | ) d
(x
|
)
p
X|
W
pW ( w )
pX (x)
p()
X
Y = X + W
+
N | )
pN
N (x
X|
0 , p1Y} ( y ; ) Y = X + W
EXxaX
m{ ple:
N
| )
(x
pX|
X
(x
| |) )
pXp|(x
object at unknown location X
X|
W
pW ( w )
sensors
Least
Mean
Squares
Estimation
pX|(x | )
Estimator
p()
X X X
Y = X + W
Estimator in the absence of information
Estimation
X
N +W
{0, 1}
X=
W fW (w)
Estimator
{0, 1}
X =+W
W fW (w)
pX|(x | )
p X | (f1
= 1}1/6 X =4 + W10
| ()
Estimator
) {0,
W
ft W
E sEstimator
i m(w)
a t or
f=P()
(se nsor i1/6
se nses t4
h e o b j e c10
t |
= ) X
Estimator
{0,
=
X
+
W
W
fW
4t a
nsor
fW
(w
) = h
n c
{e1}
0,o{0,
1 } 1}
==
++
WW
W
f(w)
( dis
f10
fr o X
m X
se
i )
W (w)
()
Wf1/6
Npp()
)()
p(
Output of Bayesian Inference

Posterior distribution:
pmf p|X ( | x) or pdf f|X ( | x)
If interested in a single answer:

Maximum a posteriori probability (MAP):
fW (w)
{0, 1}
X =+W
4 4 4 101 010
ff()
1 /1/6
6
)() 1/6
f(
f()
p|X ( | x) = max p|X ( | x)
Optimal estimate: c = E[]
f|X ( | x) = max f|X ( | x)
Optimal mean squared error:
Conditional expectation:
E[ | X = y] =
"
minimize E ( c)2
minimizes probability of error;

often used in hypothesis testing
1/6
4
10
find estimate c, to:
Estimator
"
E ( E[])2 = Var()
f|X ( | x) d
Single answers can be misleading!
LMS Estimation of based on X
LMS Estimation w. several measurements
Two r.v.s , X
Unknown r.v.
we observe that X = x
Observe values of r.v.s X1, . . . , Xn
new universe: condition on X = x

E
"
#
( c)2 | X = x is minimized by
"
( E[ | X = x])2 | X = x
Best estimator: E[ | X1, . . . , Xn]

Can be hard to compute/implement
c=
E
involves multi-dimensional integrals, etc.
E[( g(x))2 | X = x]
"
"
E ( E[ | X])2 | X E ( g(X))2 | X
"
"
E ( E[ | X ])2 E ( g(X))2
"
E[ | X] minimizes E ( g (X))2
over all estimators g()
43
p()
f()
f()
f()
N
f() f f()
(x | )
X|
fX|(x | )
LECTURE
22
)(x
x
fX|(xf | g(
) | )
g( ) f ()3
X|9
pX|(x | )
11
Readings: pp. 225-226; pSections

8.3-8.4
()
g( ) g(
) = g(X)
= g(X)
(x | )
f
X|
| )
(x
p pXp|(x
X
(x
| |) )
object at unknown location X
X|
X|
{0,
1}
=p W( w
+) W
WX
sensors
pX|(x | )
Estimator
p()
X X X
1/6
4
10 Y = X + W
Estimator
x
10
X
N +W
{0, 1}
X=
W fW (w)
Estimator
{0, 1}
X =+W
W fW (w)
pX|(x | )
p X | (f1
= 1}1/6 X =4 + W10
| ()
Estimator
) {0,
W
ft W
E sEstimator
i m(w)
a t or
f()
f()
1/6
4
10
f=P()
(se nsor i se nses t h e o b j e c t |
= )X
Estimator
(w)
{0,
=
X
+
W
W
fW
4t a
nsor
fW
(w
)4= h
n c
{ef1}
0,
1()
} 1}
X
==
++
WxW
W
f(w)
( dis
o{0,
f10
frfo X
se
i )
(x
|
)
fmX|
W
()
Wf1/6
()
fX|(x | )
W fW (w)
f()3
= g(X)
1/2
Topics
= g(X)
1/2 g( )
fW (w)
{0, 1}
X = +W
(x
)(x
x
g(
)|
4 4 f4X|
1/6
10
ff()
6
1fX|
0|10
)
)() g(
f(
1)/1/6
f()
f()
p() p
X| (x | )
1/6
X |
1/2 g( )
1
= g(X)
{0, 1}
4
X
10
10
MAP estimate:
= g(X)
MAP maximizes f|X ( | x)

g(p )
X=
E[ | X] minimizes E ( g(X ))2

=g(X)
over
all estimators g()
10 )
g(
11
Estimator
p()
= g(X)
g( )
= g(X)
1/2
= g(X)
p() p
X| (x | )
1/2
f()
1
1/2
fX|(x | )
N
X
f()
+
1
= +
W
p 1 (x | )
+ 11/2
g( )
X|
f()
fX|(x | )
+1 +1
= g(X)
X
1
Estimator
fX|(x | )
g( )
x f()
p()
+1
g( )
= g(X)
fX|
Estimator
N (x | )
= g(X)
(x | )
LMSX|
estimation:
= g(X)
f
(x | )
(Bayesian) Least
1/2 means
1 squares (LMS)
1/2
1
fX|(x | )
= g(X)
N
estimation
X
f() Estimator
1
1
+1
+ 11/2
g( )
|
)
pX|(x
(Bayesian)
Linear LMS
estimation
f()
fX|(x | )
+1 +1
W fW (w)
X
1
= g(X)
Estimator
fX |(x | )
g( )
f()
p()
1/6
f()
+1
g( )
= g(X)
fX|
(x
|
)
Estimator
N
g(p )
X| (x | )
f()
= g(X)
fX|(x | )
for any x, = E[ | X = x]
Estimator
minimizes E ( )2 | X = x
over all estimates
f()
f() f f()
(x | )
X|
Estimator
g( ) f
)(x
x
fX|(xf | g(
) | )
X|9
11
x
y
p()
g( ) g(
) = g(X)
= g(X)
x|
f
X|
1/2 g
x | )
E[( E[ | X ])2 | X = x]
fW (w)
1/6
same as Var( | X = x): variance of the

X =+W
conditional distribution of
{0, 1}
4
10
10
E[( E[ | X])2 | X = x]
g( )
x
same
10 as Var( | X = x): variance of the
conditional distribution of
= g(X)
1/2
4
f
()
1
x
fX|
+ 1(x | )
11
9
3
5
Predicting X based on Yy
g( )
Two r.v.s X, Y
we observe that Y = y
= g(X)
new universe: condition

on Y = y
fX|(x | )
1
1/2
=

:
Since
)
var() = var() + var(
Estimator
(x | ) error
x
fX|
Conditional mean
squared
11
9
5
f()
= g(X)
,
) = 0
cov(
f()
1/2
g X
p() p
X| (x | )
by
E (X c)2 | Y = y is minimized
1/2
c=
f()
1
1
+1
+1 /
g( )
|
)
pX|(x
f()
Some properties
of LMS estimation
fX|(x | )
+1 +1
= g(X)
X
Estimator
Estimator:
(xE![
) | X]
fX!=
g( )
f()
=
Estimation
error:

p()
g( " )
= g(X)
(x
|
)
fX|
]N= 0
| XEstimator
E[
E[
= x] = 0
= g(X)
g(p ) (x | )
X|
h(X)] = 0, for any function h
E[
Conditional mean squared error
mator
= g(X)
1/2
= g(X)
44
Linear LMS
Linear LMS properties
Consider estimators of ,
= aX + b
of the form
Minimize E ( aX b)2
L = E[] +
2
L )2] = (1 2)
E[(
Best choice of a,b; best linear estimator:
Linear LMS with multiple data
Cov(X, )
X =
+W
(X E[X])
L = E[] +
{0, 1}
var(X)
10
Consider estimators of the form:

= a1X1 + + anXn + b
10
Cov(X, )
(X E[X])
var(X)
Find best choices of a1, . . . , an, b

Minimize:
f()
4
fX|(x | )
g( ) f
E[(a1X1 + + anXn + b )2]
f()
= g(X)
x|
f
X|
1/2 g
1
f() f f()
(x | )
X|
Only means, variances, covariances matter
)(x
x
fX|(xf | g(
) | )
X|9
11
x
y
p()
g( ) g(
) = g(X)
= g(X)
1/2
= g(X)
1/2
g X
Set derivatives to zero

linear system in b and the ai
p() p
X| (x | )
f()
fX|(x | )
1
1/2
f()
The cleanest linear LMS example
Big picture
1
1
+1
+1 /
g( )
pX|(x
| )
Standard examples:
Xi = + Wf
i , (), W1 , . . . , Wn independent
fX|(x | )
2
2
Wi +
1
0, +
, 0
i X1
= g(X) Xi uniform on [0, ];

Estimator
n
fX!(x ! )
uniform prior on
2
g( )
Xi/i
/02 +
f()
i=1
p()
L =
Xi Bernoulli(p);
n
g( " )
1/i2
uniform (or Beta) prior on p

= g(X)
fX|
i=0 Estimator
N (x | )
Xi normal with mean , known variance 2;

normal prior on ;
X i = + Wi
(weighted average
of , X1, . . . , Xn)
= g(X)
L = E[ | X1, . . . , Xn]
) normal,
g(
If pall
(x | )
X|
= g(X)
Estimation methods:
Choosing Xi in linear LMS
MAP
E[ | X] is the same as E[ | X 3]
MSE
Estimator
Linear
LMS is dierent:
= aX + b versus
= aX 3 + b

= a1 X + a2 X 2 + a 3 X 3 + b
Also consider
Linear MSE
45
Estimator
W fW (w)

f()
(not responsible for t-based confidence
intervals, in pp. 471-473)
Estimator
p()
X X
Estimator
NN
Estimator
p()
{0, 1}
X =+W
W fW (w)
Estimator

N
Estimator
Estimator
(x
|
)
p
p()
X|
N
pX|(x | )
Estimator
{0,
1} Estimator
X =+W
W fW (w)
pX|(x | )
(w)
{0,
1}
X
=
+
W
W fW
(w)
{0,
1}
X
=
+
W
f
W
W
N
pX|(x | )
p() XX
p()
Classical
statistics
X
LECTURE 23
pp()
()
1/6
X
=
+ W
{0, 1}
N
4
Estimator
X
10
Estimator
pX|
pX
(x; (x
) | )
pX|(x | )
Outline
pX|(x | )
Estimator
also for vectors X and :

Estimator
pX1,...,Xn (x
1 , . . . , xn ; 1 , . . . , m )
Classical statistics
Maximum likelihood (ML) estimation
Estimator
Estimator
These areEstimator
NOT conditional probabilities;
is NOT random
Estimating a sample mean

Confidence intervals (CIs)
{0, 1}
W
fW (w)
mathematically:
X=
+W
many
models,
one for each possible value of
CIs using an estimated variance
f()
1/6
10
Problem types:
Hypothesis testing:
H0 : = 1/2 versus H1 : = 3/4
Composite hypotheses:
1/2
H0 : = 1/2 versus H1 : =
,
Estimation: design an estimator
to keep estimation error small
Maximum Likelihood Estimation
Desirable properties of estimators

(should hold FOR ALL !!!)
Model, with unknown parameter(s):

X pX (x; )
n] =
Unbiased: E[
Pick that makes data most likely
exponential example, with n = 1:

E[1/X1] = =
(biased)
ML = arg max pX (x; )
Compare to Bayesian MAP estimation:
n (in probability)
Consistent:
MAP = arg max p|X ( | x)
exponential example:
(X1 + + Xn)/n E[X] = 1/
pX |(x|)p()
MAP = arg max
pX (x)
can use this to show that:

n = n/(X1 + + Xn) 1/E[X] =
Example: X1, . . . , Xn: i.i.d., exponential()

max
)2] = var(
) + (E[
])2
E[(
) + (bias)2
= var(
i=1
max n log
Small mean squared error (MSE)
exi
ML =
xi
i=1
n
x1 + + xn
n =
n
X1 + + X n
46
Estimate a mean
Confidence intervals (CIs)

n may not be informative
An estimate
enough
X1, . . . , Xn: i.i.d., mean , variance 2

Xi = + W i
An 1 confidence interval
+
is a (random) interval [
n ],
n,
Wi: i.i.d., mean, 0, variance 2

n = sample mean = Mn =
X1 + + X n
n
often = 0.05, or 0.25, or 0.01

interpretation is subtle
Properties:
n] =
E[
CI in estimation of the mean

n = (X1 + + Xn)/n
normal tables: (1.96) = 1 0.05/2
(unbiased)
n
WLLN:
+
P(
n
n ) 1 ,
s.t.
(consistency)
MSE: 2/n

|n |
1.96 0.95 (CLT)
/ n
Sample mean often turns out to also be

the ML estimate.
E.g., if Xi N (, 2), i.i.d.
1.96
1.96
n
n+
P
0.95
n
n
More generally: let z be s.t. (z) = 1 /2
z
z
n
n +
P
1
n
n
The case of unknown

Option 1: use upper bound on
if Xi Bernoulli: 1/2
Option 2: use ad hoc estimate of
=
if Xi Bernoulli():
(1
)
Option 3: Use generic estimate

of the variance
Start from 2 = E[(Xi )2]
2=
n
1
(Xi )2 2
n i=1
(but do not know )

n2 =
S
n
1
n )2 2
(Xi
n 1 i=1
n2] = 2)
(unbiased: E[S
47
LECTURE 24
Review
Maximum likeliho o d estimation
Have model with unknown parameters:
X pX (x; )
Pick that makes data most likely
Reference: Section 9.3

Course Evaluations (until 12/16)
http://web.mit.edu/subjectevaluation
476
Classical Statistical Inference
max pX (x; )
Chap. 9
Compare to Bayesian MAP estimation:
Outline
in the context of various probabilistic
frameworks, which provide perspective and
a mechanism for quantitative analysis.
consider
Reviewthe case of only two variables, and then generalize. We
We first
wish to modeltheMaximum
relation between
two variables
of interest, x and y (e.g., years
likelihood
estimation
of education and income), based on a collection of data pairs (xi , yi ), i = 1, . . . , n.
For example,
xi could
be the years
of education and yi the annual income of the
Confidence
intervals
ith person in the sample. Often a two-dimensional plot of these samples indicates
a systematic, approximately
linear relation between xi and yi . Then, it is natural
Linear regression
to attempt to build a linear model of the form
max p|X ( | x) or max
pX |(x|)p()
pY (y)
Sample mean estimate of = E[X]

n = (X1 + + Xn)/n
1 confidence interval
y 0 testing
+ 1 x,
Binary hypothesis
+
P(
n
n ) 1 ,
1 are
unknown
parameters to be estimated.
where 0 and
Types
of error
In particular, given some estimates 0 and 1 of the resulting parameters,
the value yi corresponding
to xiratio
, as predicted
by the model, is
Likelihood
test (LRT)
confidence interval for sample mean
yi = 0 + 1 xi .
let z be s.t. (z) = 1 /2
Generally, yi will be dierent from the given value yi , and the corresponding
dierence
yi = yi yi ,
z
z
n
n +
P
1
n
n
is called the ith residual. A choice of estimates that results in small residuals
is considered to provide a good fit to the data. With this motivation, the linear
regression approach chooses the parameter estimates 0 and 1 that minimize
the sum of the squared residuals,
n
n
(yi yi )2 =
(yi 0 1 xi )2 ,
i=1
i=1
over all 1 and 2 ; see Fig. 9.5 for an illustration.
Regression
y
Residual
x yi 0 1 xi
(xi , yi )
Linear regression
Model y 0 + 1x
min
x y = 0 + 1 x
Solution (set derivatives to zero):
x + + xn
x= 1
,
n
yx
Data: (x1, y1), (x2, y2), . . . , (xn, yn)
1 =
Figure 9.5: Illustration of a set of data pairs (xi , yi ), and a linear model y =
by minimizing
0 + 1 x,
+ 01, x1 the sum of the squares of the residuals
obtained
Model:
y 0over
yi 0 1 x i .
min
0 ,1
( yi 0 1 x i ) 2
i=1
(yi 0 1xi)2
2 2 i=1
n
i=1(xi x)(yi y)
n
2
i=1(xi x)
Interpretation of the form of the solution

Assume a model Y = 0 + 1X + W
W independent of X, with zero mean
Likelihood function fX,Y | (x, y; ) is:
y + + yn
y= 1
n
0 = y 1x
()
One interpretation:
Yi = 0 + 1xi + Wi, Wi N (0, 2), i.i.d.
c exp
( y i 0 1 xi )2
0 ,1 i=1
=0
Check that
cov(X, Y ) E (X E[X])(Y E[Y ])
1 =
=
var(X)
E (X E[X])2
Take logs, same as (*)
Solution formula for 1 uses natural

estimates of the variance and covariance
Least sq. pretend Wi i.i.d. normal
48
The world of regression (ctd.)
The world of linear regression
In practice, one also reports
Multiple linear regression:

data: (xi, xi, xi, yi), i = 1, . . . , n
Confidence intervals for the i
model: y 0 + x + x + x
Standard error (estimate of )

R2, a measure of explanatory power
formulation:
min
(yi 0 xi xi xi )2
, , i=1
Some common concerns

Heteroskedasticity
Choosing the right variables
Multicollinearity
model y 0 + 1h(x)
e.g., y 0 + 1x2
Sometimes misused to conclude causal

relations
work with data points (yi, h(x))
etc.
formulation:
min
(yi 0 1h1(xi))2
i=1
Binary hypothesis testing
Likelihood ratio test (LRT)
Binary ; new terminology:

null hypothesis H0:
X pX (x; H0)
Bayesian case (MAP rule): choose H1 if:

P(H1 | X = x) > P(H0 | X = x)
or
[or fX (x; H0)]
P(X = x | H1)P(H1) P(X = x | H0)P(H0)

>
P(X = x)
P(X = x)
or
P(X = x | H1) P(H0)
>
P(X = x | H0) P(H1)
alternative hypothesis H1:

[or fX (x; H1)]
X pX (x; H1)
Partition the space of possible data vectors
Rejection region R:
reject H0 i data R
(likelihood ratio test)
Nonbayesian version: choose H1 if
Types of errors:
P(X = x; H1)
> (discrete case)
P(X = x; H0)
Type I (false rejection, false alarm):

H0 true, but rejected
fX (x; H1)
>
fX (x; H0)
(R) = P(X R ; H0)

Type II (false acceptance,
missed detection):
H0 false, but accepted
(continuous case)
threshold trades o the two types of error

choose so that P(reject H0; H0) =
(e.g., = 0.05)
(R) = P(X R ; H1)
49
LECTURE 25
Outline
Simple binary hypothesis testing

null hypothesis H0:
X pX (x; H0)
Reference: Section 9.4
[or fX (x; H0)]
alternative hypothesis H1:

[or fX (x; H1)]
X pX (x; H1)
Course Evaluations (until 12/16)

http://web.mit.edu/subjectevaluation
Choose a rejection region R;

reject H0 i data R
Review of simple binary hypothesis tests
Likelihood ratio test: reject H0 if
examples
pX (x; H1)
>
pX (x; H0)
Testing composite hypotheses

is my coin fair?
or
fX (x; H1)
>
fX (x; H0)
is my die fair?
fix false rejection probability

(e.g., = 0.05)
goodness of fit tests
choose so that P(reject H0; H0) =
Example (test on normal mean)
Example (test on normal variance)
n data points, i.i.d.

H0: Xi N (0, 1)
H1: Xi N (1, 1)
n data points, i.i.d.

H0: Xi N (0, 1)
H1: Xi N (0, 4)
Likelihood ratio test; rejection region:
(1/ 2)n exp{ i(Xi 1)2/2}
>
(1/ 2)n exp{ i Xi2/2}
Likelihood ratio test; rejection region:
(1/2 2)n exp{ i Xi2/(2 4)}
>
(1/ 2)n exp{ i Xi2/2}
algebra: reject H0 if:
Xi >
algebra: reject H0 if
Find such that
Xi2 >
Xi > ; H0 =
i=1
Find such that
Xi2 > ; H0 =
i=1
use normal tables
the distribution of i Xi2 is known

(derived distribution problem)
chi-square distribution;
tables are available
50
Composite hypotheses
Is my die fair?
Hypothesis H0:
P(X = i) = pi = 1/6, i = 1, . . . , 6
Got S = 472 heads in n = 1000 tosses;

is the coin fair?
H0 : p = 1/2 versus H1 : p = 1/2
Observed occurrences of i: Ni
Pick a statistic (e.g., S)
Choose form of rejection region;

chi-square test:
Pick shape of rejection region

(e.g., |S n/2| > )
reject H0 if T =
Choose significance level (e.g., = 0.05)
npi
>
Choose so that:
Pick critical value so that:
P(reject H0; H0) = 0.05
P(reject H0; H0) =
P(T > ; H0) = 0.05
Using the CLT:
P(|S 500| 31; H0) 0.95;
(Ni npi)2
= 31
Need the distribution of T :

(CLT + derived distribution problem)
In our example: |S 500| = 28 <

H0 not rejected (at the 5% level)
for large n, T has approximately

a chi-square distribution
available in tables
Do I have the correct pdf?
What else is there?
Partition the range into bins
Systematic methods for coming up with

shape of rejection regions
npi: expected incidence of bin i

(from the pdf)
Methods to estimate an unknown PDF

(e.g., form a histogram and smooth it
out)
Ni: observed incidence of bin i

Use chi-square test (as in die problem)
Kolmogorov-Smirnov test:
form empirical CDF, FX , from data
Ecient and recursive signal processing

Methods to select between less or more
complex models
(e.g., identify relevant explanatory
variables in regression models)
Methods tailored to high-dimensional
unknown parameter vectors and huge
number of data points (data mining)
etc. etc.. . .
(http://www.itl.nist.gov/div898/handbook/)
Dn = maxx |FX (x) FX (x)|
P( nDn 1.36) 0.05
51
MIT OpenCourseWare
http://ocw.mit.edu
6.041 / 6.431 Probabilistic Systems Analysis and Applied Probability

Fall 2010
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

LPC Lecture Slides

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LPC Lecture Slides

Uploaded by

Copyright:

Available Formats

6.

041 Probabilistic Systems Analysis

Lecturer: John Tsitsikli s

Pick up and read course information handout

Turn in recitation and tutorial scheduling form

Quiz 1 (October 12, 12:05-12:55pm)

Quiz 2 (November 2, 7:30-9:30pm)

Final exam (scheduled by registrar)

Weekly homework (best 9 of 10)

Pick up copy of slides

Collaboration policy described in course info handout

Text: Introduction to Probability, 2nd Edition,

List (set) of possible outcomes

Probability as a mathematical framework

Art: to be at the right granularity

Sample space: Discrete example

Sample space: Continuous example

Two rolls of a tetrahedral die

Sample space vs. sequential description

Probability law: Example with nite sample space

Event: a subset of the sample space

3. Additivity: If A B = , then P(A B) = P(A) + P(B)

Let every possible outcome have probability 1/16

P({s1, s2, . . . , sk }) = P({s1}) + + P({sk })

Axiom 3 needs strengthening

Discrete uniform law

Continuous uniform law

Let all outcomes be equally likely

Two random numbers in [0, 1].

Computing probabilities counting

Uniform law: Probability = Area

Denes fair coins, fair dice, well-shued decks

Probability law: Ex. w/countably innite sample space

Turn in recitation/tutorial scheduling form now

Tutorials start next week

P({2, 4, 6, . . .}) = P(2) + P(4) + =

Countable additivity axiom (needed for this calculation):

P(A1 A2 ) = P(A1) + P(A2) +

Review of probability models

Readings: Sections 1.3-1.4

Event: Subset of the sample space

Allocation of probabilities to events

Three important tools:

Total probability theorem

3. If A1, A2, . . . are disjoint events, then:

Specify sample space

Die roll example

B is our new universe

Let B be the event: min(X, Y ) = 2

P(A | B) undefined if P(B) = 0

Models based on conditional

Event A: Airplane is flying above

Total probability theorem

Divide and conquer

We know P(B | Ai) for each i

Have P(B | Ai), for every i

Wish to compute P(Ai | B)

One way of computing P(B):

Models based on conditional

Readings: Section 1.5

3 tosses of a biased coin:

assuming P(B ) > 0