You are on page 1of 38

Introductory Course M2 TSE

Part II: Probability Theory

Basic Prob-
ability Theory
This material is extracted from the books of Robert B. Ash

Elementary Probability Theory with Stochastic Processes and an


(John Whiley & Sons, 1970) and of Kai Lai Chung and Farid

Introduction to Mathematical Finance 4


AitSahlia
, th edition (Springer, 2003).
This is a preliminary version of the document. Some sections remain
to be completed, so refer to the books above (or other literature) for the
corresponding topics.

1 Basic concepts
The classical denition of probability is the following: the probability of an
event is the number of outcomes favorable to the event, divided by the total
number of outcomes, where all outcomes are equally likely. This denition
is restrictive (nite number of outcomes) and circular (equally likely =
equally probable).
The frequency approach is based on the physical observations of the fol-
lowing type: if an unbiased coin is tossed independently n times, where n
is very large, the relative frequency of heads is likely to be close to 1/2. We
would like to dene the probability of an event as the limit of Sn /n, where
Sn is the number of occurrences of the event. However, it is possible that
Sn /n converges to any number between 0 and 1, or has no limit at all.
We need a rigorous denition of probability to construct a mathematical
theory. We now introduce the basic concepts of the mathematical probability
theory.

1.1 Probability space


1.1.1 Sample space
A sample space is a set of points representing possible outcomes of a ran-
dom experiment. Its choice is dictated by the problem under consideration.
Examples:

1
In the experiment of tossing a single die, we can chose = {1, 2, 3, 4, 5, 6}.
Another choice: = { N is even ,  N is odd } but if we are interested,
for example, in whether or not N 3, this second space is not useful.

To model the number of accidents that faces an insurance company in


one year, we can dene the sample space as = {0, 1, 2, . . . }. In this
example, the sample space is innite (countable).

Let the experiment consist in selecting a person at random and mea-


suring his height. We can chose = R+ . Here, has a continuum of
points.

1.1.2 Algebra of events


An event associated with a random experiment is a condition that is satised
or not satised in a given performance of the experiment. For example, if a
coin is tossed twice, the number of heads is 1 is an event.
We can also say that an event corresponds to a yes/no question that
may be answered after the experiment is performed. A yes answer is as-
sociated with a subset of the sample space. Let in the previous example
= {HH, HT, T H, T T }. The question Is the number of heads 1? can
be answered yes or no. The subset of corresponding to a yes answer
is A = {HT, T H, T T }.
Thus, in mathematical terms, an event is dened as a subset of the sample
space. Example:

Experiment: tossing a coin twice. Sample space: = {HH, HT, T H, T T }.


The event the result of the rst toss is equal to the result of the second
toss corresponds to the subset B = {HH, T T }.
We can form new events from old ones by use of connectives or, and,
and not. In terms of subsets, this corresponds to the following operations:

or = union: AB (means A or B or both).

and = intersection: A B.
not = complement: Ac \ A.
For example, in the experiment of tossing a single die with = {1, 2, 3, 4, 5, 6}
and N the result, we can consider the following events: A = {N 3} =
{3, 4, 5, 6} and B = {N is even} = {2, 4, 6}. Then

2
A B = {N 3 or N is even} = {2, 3, 4, 5, 6}

A B = {N 3 and N is even} = {4, 6}

Ac = {N is not 3} = {N < 3} = {1, 2}

B c = {N is not even} = {N is odd} = {1, 3, 5}

We can also apply these operations to more than two events: A1 A2



An = ni=1 Ai (set of the points at least one
all
belonging to of the events
n
Ai ) and A1 A2 An = i=1 Ai (set of the points belonging to of
the events Ai ). We dene in the same way the union and the intersection of

mutually exclusive disjoint


an innite sequence of events.
Two events in a sample space are said to be or
if it is impossible that both A and B occur during the same performance of
the experiment. Mathematically, this means that their intersection is empty:
A B = . In general, the events {Ai } (a nite or innite collection) are
mutually exclusive if no more than one of them can occur during the same
performance of the experiment:

Ai Aj = for i = j.

In some ways the algebra of events is similar to the algebra of real num-
bers, with union corresponding to addition and intersection to multiplication.
For example, the commutative and associative properties hold:

A B = B A, A (B C) = (A B) C
A B = B A, A (B C) = (A B) C

In many ways the algebra of events diers from the algebra of real num-
bers, as some of the identities below indicate.

A A = A, A Ac =
A A = A, A Ac =
A = A, A=A
A = , A=

3
1.1.3 Class of events
In some situations, we may not have a complete information about the out-

In
comes of an experiment. For example, if the experiment involves tossing a

this case, we cannot consider all subsets of as events.


coin three times, we may record the results of only the rst two tosses.
Indeed, let

= {(a1 , a2 , a3 ) | ai {H, T }, i = 1, 2, 3}

and let us consider the subset A of corresponding to the condition there


are at least two heads:

A = {(H, H, T ), (H, T, H), (T, H, H), (H, H, H)}.

Imagine that after the experiment is performed, we have the following infor-
mation about the outcome:

= (H, T, ).

In this case, we are not able to give a yes or no answer to the question
Is A? (this depends on the result of the last toss, and we miss this
information). So, A is not measurable with respect to the given information.
In contrast, the subset B = {(T, T, T ), (T, T, H)} which corresponds to
the condition the rst two tosses are tails is an event. Indeed, with the
information about the rst two tosses, we are always able to say whether
B .
class
is in or not, for all possible outcomes
called the
of events
This leads to consider a particular class of subsets of
F . For reasons
sigma eld
. The standard notation for the class of events is of
mathematical consistency, we require that F form a , which is a
collection of subsets of satisfying the following three requirements.

1. F

2. AF implies Ac F . That is, F is closed under complementation.



3. A1 , A2 , F implies i=1 Ai F . That is, F is closed under nite
or countable union.

The above conditions imply also that F is closed under nite or countable
intersection, and that the empty set belongs to F (exercise).
Examples of sigma elds:

4
F = {, }

The collection of all subsets of is a sigma eld.

Let = {1, 2, 3, 4, 5, 6}. The following collection of subsets is a sigma


eld:
F = {, , {1, 3, 5}, {2, 4, 6}}

R R R+ , [0, 1]),
Borel sets
If is a part of (for instance, or or we will typically
consider the sigma eld B of . That is, the smallest sigma eld
containing the intervals (and, in consequence, unions and intersections
of intervals).

1.1.4 Probability measure

relative frequency
We now consider the assignment of probabilities to events. The probability
of an event should somehow reect the of the event in a
large number of independent repetitions of the experiment. Thus, if A F,
the probability P (A) 0 and 1, with
should be a number between P () = 0
and P () = 1. B are disjoint (cannot
Furthermore, if the events A and
occur in the same time), then the number of occurrences of A B is the
sum of the number of occurrences of A and the number of occurrences of B ,
so we should have P (A B) = P (A) + P (B). This motivates the following
denition.
P (A) to each set A in the sigma eld F
probability measure
A function that assigns a number
is called a on F, provided that the following conditions
are satised:

1. P (A) 0 for every AF

2. P () = 1

3. If A1 , A 2 , . . . is a nite or countable collection of disjoint sets in F , then

P (A1 A2 ) = P (A1 ) + P (A2 ) +

Remark. Another reason for considering a class of events F instead of all


subsets of is that in some cases it is impossible to dene a probability
measure (in the sense of the denition above) on all subsets. Just as it is
impossible to dene an area of all subsets of the plane or a length of all

5
subsets of a line (this is the reason why we consider only the Borel sets in
Rn ).
From this denition, we can deduce the following properties of a proba-
bility measure (exercise):

P () = 0
P (A B) = P (A) + P (B) P (A B)
If B A, then P (B) P (A).
P (A1 A2 ) P (A1 ) + P (A2 ) +
If the sets An are nondecreasing, i.e. An An+1 , n 1, then

P( An ) = lim P (An ).
n
n1

If the sets An are nonincreasing, i.e. An+1 An , n 1, then



P( An ) = lim P (An ).
n
n1

We may now give the underlying mathematical framework for probability


theory.

Denition 1. A probability space is a triple (, F, P ), where is a set, F a


sigma eld of subsets of , and P a probability measure on F.

Exercises
1. Let = {1 , 2 , . . . , n , . . . } be a countable sample space and F be
the class of all subsets of . To each sample point n let us attach an
arbitrary weight pn subject to the conditions

n : pn 0, pn = 1.
n

A , sum of
the weights of all points in it
Now for any subset of we dene its probability to be the
. In symbols,

A , P (A) = pn .
n A

Show that P is indeed a probability measure.

6
2. If is countably innite, may all the sample points n be equally
likely (that is, all pn in the previous example be equal)?

3. Let be a plane set ( R2 ) with a nite Lebesgue measure 0 <


|| < + (think about a square or a circle for simplicity). Consider
all measurable subsets A of and dene

|A|
P (A) = .
||

Show that P is a probability measure.

4. Suppose that the land of a square kingdom is divided into three strips
A, B , C of equal area and suppose the value per unit is in the ratio
of 1 : 3 : 2. For any piece of (measurable) land S in this kingdom, the
relative value with respect to that of the kingdom is then given by the
formula
P (S A) + 3P (S B) + 2P (S C)
V (S) =
2
where P is as in the previous exercise. Show that V is a probability
measure.

5. Show that if PQ are two probability measures dened on the same


and
class of events F of , then aP + bQ is also a probability measure on
F for any two nonnegative numbers a and b satisfying a + b = 1.

6. If P is a probability measure, show that the function P/2 satises


conditions (1) (3)
but not (2) in the denition of a probability
and
2
measure. The function P satises (1) and (2) but not necessarily (3);
give a counterexample to (3).

1.2 Independence
Consider the following experiment. A person is selected at random and is
height is recorded. After this the last digit of the licence number of the next
car to pass is noted. If A is the event that the height is over 1m70, and B is
the event that the digit is 7, then, intuitively, A and B are independent.
The knowledge about the occurrence or nonoccurrence of one of the events
should not inuence the odds about the other.

7
B
all only
In other words, we expect that the relative frequency of occurrence of

those in which A occurs


should be the same if we consider repetitions of the experiment or
:
NB NAB
= .
N NA
This implies that
NA N B
NAB N NA NB
= =
N N N N
Thus, if we interpret probabilities as relative frequencies, we should have

P (A B) = P (A)P (B).

This reasoning motivates the following denition.

Denition 2. Two events A and B are independent if P (AB) = P (A)P (B).


This can be extended to an arbitrary (possibly innite) collection of
events.

Denition 3. Let Ai , i I , where I is an arbitrary index set, possibly


innite, be an arbitrary collection of events on a given probability space
(, F, P ). The Ai are said to be independent if for each nite set of distinct
indices i1 , . . . , ik I we have

P (Ai1 Ai2 Aik ) = P (Ai1 )P (Ai2 ) P (Aik )

We list below some properties of independent events (the proofs are left
in exercise).

If in a collection of independent events Ai , i I , we replace some Ai


c
by their complements Ai , the new collection is also independent. For
c
example, if A and B are independent, then A and B are independent,
c c
as well as A and B .

Any subcollection of independent events also forms, of course, a family


of independent events. However, the condition P (A1 An ) =
P (A1 ) P (An ) does not imply the analogous condition for any smaller
family of events! For example, it is possible to have P (A B C) =
P (A)P (B)P (C), but P (A B) = P (A)P (B), P (A C) = P (A)P (C),
P (B C) = P (B)P (C).

8
P (A B) = P (A)P (B),
Conversely, it is possible to have, for example,
P (A C) = P (A)P (C), P (B C) = P (B)P (C), but P (A B C) =
P (A)P (B)P (C). Thus A and B are independent, as are A and C , and
also B and C , but A, B , and C are not independent.

Exercises
1. What can you say about the event A if it is independent of itself ? If
the events A and B are disjoint and independent, what can you say of
them?

1.3 Conditional probability


If the events A and B are not independent, the knowledge about the occur-
rence of A will change the odds about B. How to measure this exactly? In
other words, we want to quantify the relative frequency of the occurrence of
B in the trials on which A occurs. We look only at the trials on which A
occurs and count those trials on which B occurs also. This relative frequency
NAB /NA may be represented as

NAB NAB /N
=
NA NA /N
This discussion suggests the following denition.

Denition 4. Let P (A) > 0. The conditional probability of B given A is


dened as
P (A B)
P (B | A) = .
P (A)
Example. Throw two unbiased dice independently. Let A = {sum of the faces = 8}
and B = {faces are equal}. Then

P (A B) P ((4, 4)) 1/36 1


P (B | A) = = = =
P (A) P ((4, 4), (3, 5), (5, 3), (2, 6), (6, 2)) 5/36 5
Note some consequences of the above denition.

If A and B are independent, then

P (A B) P (A)P (B)
P (B | A) = = = P (B),
P (A) P (A)
which is in accordance with the intuition.

9
We have P (A B) = P (A)P (B | A), and we can extend this formula
to more than two events:

P (A B C) = P (A B)P (C | A B) = P (A)P (B | A)P (C | A B)

Similarly,

P (A B C D) = P (A)P (B | A)P (C | A B)P (D | A B C),

and so on.

Example. Three cards are drawn without replacement from an ordinary


deck. Find the probability of not obtaining a heart.
Let Ai = {card i is not a heart}. Then we are looking for

39 38 37
P (A1 A2 A3 ) = P (A1 )P (A2 | A1 )P (A3 | A1 A2 ) = .
52 51 50
We now formulate the most useful results on conditional probabilities.

1.3.1 Theorem of total probability


Theorem 5. Let B1 , B2 , . . . be a nite orcountable family of mutually ex-
clusive and exhaustive events (i.e., the B are disjoint and their union is ).
If A is any event, then
i


P (A) = P (Bi )P (A | Bi )
i

(the sum is taken over those i for which P (B ) > 0). i

Example. Consider the following experiment. We have a biased coin with


probability of heads equal to 1/3. We also have two urns: the rst contains
3 white and 2 black balls; the second contains 1 white and 3 black balls.
We toss the coin. If the result is heads, we draw a ball from the rst
urn, if the result is tails, we draw a ball from the second urn. What is the
probability that a white ball is drawn?
Let A = {a white ball is drawn}, B1 = {the coin falls heads}, B2 =
{the coin falls tails}. Then
( )( ) ( )( )
1 3 2 1 11
P (A) = P (B1 )P (A | B1 ) + P (B2 )P (A | B2 ) = + =
3 5 3 4 30

10
1.3.2 Bayes' theorem
Notice that under the above assumptions we have

P (A Bk ) P (Bk )P (A | Bk )
P (Bk | A) = =
P (A) i P (Bi )P (A | Bi )

Bayes's theorem P (Bk | A)


a posteriori probability
This formula is referred to as . The quantity is
called an . The reason for this terminology may be
seen in the example below.
Example. Consider the previous experiment with one biased coin and two

a posteriori
urns. Suppose that we did not observe the whole experiment but only the
nal result: a black ball is drawn. We would like to estimate the
probability that the coin falled heads. Let C = {a black ball is drawn}. We
use the Bayes theorem to compute P (B1 | C):
(1) (2)
P (B1 )P (C | B1 ) 4
P (B1 | C) = = ( 1 ) ( 2 )3 (5 2 ) ( 3 ) =
P (B1 )P (C | B1 ) + P (B2 )P (C | B2 ) 3 5
+ 3 4 19

2 Random variables
2.1 Denition of a random variable
Intuitively, a random variable is a quantity that is measured in connection
with a random experiment. If is a sample space, and the outcome of the
, a measuring process is carried out to obtain a number R().
a random variable is a real-valued function on a sample space
experiment is
Thus . Let us
give some examples:

10 times, and let R be the number of heads. For =


Throw a coin
HHT HT T HHT H , R() = 6. Another random variable, R1 , is the
number of times a head is followed immediately by a tail. For the
outcome above, R1 () = 3.
Throw two dice. We may take the sample space to be the set of all
pairs of integers (x, y), x, y = 1, 2, . . . , 6 (36 points in all).

Let R1 = the result of the rst toss. Then R1 (x, y) = x.


Let R2 = the sum of the two faces. Then R2 (x, y) = x + y .
LetR3 = 1 if at least one face is an even number; R3 = 0 otherwise.
Then R3 (6, 5) = 1, R3 (3, 6) = 1, R3 (1, 3) = 0, and so on.

11
If we are interested in a random variable R, we generally want to know
the probability of events involving R. In general these events are of the form
 R lies in a set B R. For instance,  R is less than 5 or  R lies in the
interval [a, b).
Notation. { : a R() < b} will be often abbreviated to
The event
{a R < b}. We note its probability P (a R < b).
Example. A biased coin is tossed independently n times, with probability
p of coming up heads on a given toss. Let R be the number of heads. Then,
for integers k l ,


l
P (k R l) = Cni pi (1 p)ni .
i=k

The formal denition of a random variable is the following.

Denition 6. A random variable on the probability space (, F, P ) is a real


valued function R dened on , such that for every Borel subset B of the
reals, { : R() B} belongs to F .

Remember that Borel sets in R are the sets obtained from the intervals
by applying the operations of union and intersection. The last condition in
the denition means that the assertions of the form  R belongs to a Borel
set B are events on our probability space.

2.2 Classication of random variables


The way in which the probabilities P (R B) are calculated depends on the
particular nature of R. In this section we examine some standard classes of
random variables.

2.2.1 Discrete random variables


Denition 7. A random variable R is said to be discrete if the set of possible
values of R is nite or countably innite.

{x1 , x2 , . . . } be the set of possible values of R. R is characterized


probability function
Let Then
by its pR dened by

pR (xi ) = P (R = xi ), i = 1, 2, . . .

12
We say that R has masses of probability at the points xi . The probability
function denes the probabilities of all events involving R:

P (R B) = P (R = xi ) = pR (xi )
xi B xi B

Another way of characterizing R is by means of the distribution function


dened by
FR (x) = P (R x), x R.
Example. Let R be the number of heads in two independent tosses of a
coin, with the probability of heads being 0.6 on a given toss. Take =
{HH, HT, T H, T T } with respective probabilities 0.36, 0.24, 0.24, 0.16.
Then R can take three values: 0, 1, or 2. Its probability function is given
by
pR (0) = 0.16, pR (1) = 0.48, pR (2) = 0.36.
The distribution function of R is given by


0, x < 0,

0.16, 0 x < 1,
FR (x) =

0.64, 1 x < 2,

1, x 2.
piecewise con-
stant
The distribution function of a discrete random variable is
, with jumps at the points where R has masses of probability. The size
of the jump at a point is equal to the corresponding probability mass:

FR (x) FR (x) = pR (x).


Thus, in the discrete case, if we know pR , we can construct FR , and, con-
versely, given FR , we can construct pR . Knowledge of either function is
sucient to determine the probability of all events involving R.

2.2.2 Absolutely continuous random variables


Denition 8. The random variable R is said to be absolutely continuous if
there is a nonnegative function fR dened on R such that
x
FR (x) P (R x) = fR (y)dy for all real x.

fR is called the density function of R.

13
the distribution function of an abso-
lutely continuous random variable is continuous
From the denition, it follows that
F . Note that if R is dieren-
tiable at x, then its derivative is given by fR :
x
d
FR (x) = fR (y)dy = fR (x).
dx

It can be proved that the probabilities of all events involving R may be


computed in the following way:

P (R B) = fR (y)dy.
B

In particular, we can see that, in contrast with discrete random variables,


the probability for R to fall exactly at a given point x is zero:
x
P (R = x) = fR (y)dy = 0.
x

We give below some important examples of absolutely continuous random


variables.

Uniform random variable. A uniform random variable on an interval


[a, b] is dened by the density function

{ 1
ba
, x [a, b],
fR (x) =
0, otherwise.

R represents a number chosen at random between a and b in a uniform way:


that is, the probability that R will fall into an interval of length c depends
only on c and not on the position of this interval within [a, b]. Indeed, let
Ic [a, b] be any interval of length c. Then,

1 1 c
P (R Ic ) = fR (y)dy = dy = dy = .
Ic Ic b a b a Ic ba

The distribution function of R is given by



0, x < a,
FR (x) = xa
, a x < b,
ba
1, x b.

14
Notation. The uniform distribution on [a, b] is noted U ([a, b]), and we write
R U ([a, b]).

Normal random variable. R has a normal distribution with parameters


R and > 0, R N (, ), if its density function is given by
1 (x)2
fR (x) = e 2 2
2 2
The distribution function of a normal random variable
x
1 (y)2
FR (x) = e 2 2 dy
2 2
cannot be expressed in terms of elementary functions but its properties are

standard normal
well known and its values are listed in tables or may be easily obtained on
= 0 = 1, R
distribution
a computer. If and we say that has a
. There is a standard notation for the distribution function in
this case: x
1 y2
(x) = e 2 dy
2

Exponential random variable. R has an exponential distribution with


parameter > 0, R E(), if its density function is given by

{
ex , x 0,
fR (x) =
0, x < 0.

The distribution function of R is given by

{
1 ex , x 0,
FR (x) =
0, x < 0.

2.2.3 Mixed random variables


A random variable need not be discrete or absolutely continuous. There
are also mixed distributions that have masses of probability at some points
and are continuous elsewhere. In terms of the distribution function, FR is
piecewise continuous. Typically, FR is also piecewise dierentiable, so that

15
we can identify the intervals where R has a density fR and the points where
R has masses of probability. For example,

0, x < 0,
FR (x) = x+30
, 0 x < 120,
200
1, x 120.

FR is continuous everywhere except two points: x=0 and x = 120. This


means that R has masses of probability at these points:

P (R = 0) = FR (0) FR (0) = 0.15 0 = 0.15,


P (R = 120) = FR (120) FR (120) = 1 0.75 = 0.25.

On the intervals (, 0), (0, 120), and (120, ), R has a density function

{
1/200, x (0, 120),
gR (x) = FR (x) =
0, x < 0 or x > 120.

The probability that R B, for a Borel set B, is then computed in a


mixed way, combining the formulae for discrete and continuous random
variables:
P (R B) = gR (y)dy + P (R = xi ),
B xi B

where xi are the points where R has masses of probability. For instance, in
the example above,

120
1
P (R > 100) = gR (y)dy + P (R = 120) = dy + 0.25 = 0.35.
100 100 200

2.3 Properties of distribution functions


In this section, we list some general properties of the distribution function of
an arbitrary random variable.

Theorem 9. Let F be the distribution function of an arbitrary random vari-


able R. Then
1. F (x) is nondecreasing; that is, a < b implies F (a) F (b).
2. lim F (x) = 1
x

16
3. lim F (x) = 0
x

4. F is continuous from the right; that is, lim F (x) = F (x ).


xx0 + 0

5. lim F (x) = P (R < x)


xx0

6. P (R = x ) = F (x ) F (x ). Thus F is continuous at x if and only


if P (R = x ) = 0.
0 0 0 0
0

The random variable R is said to be if its dis-


tribution function F (x) is a continuous function for all x. In any
Remark. continuous

reasonable case a continuous random variable will have a densitythat


R

is, it will be absolutely continuousbut it is possible to establish the


existence of random variables that are continuous but not absolutely
continuous.
7. Let F be a function from reals to reals, satisfying properties 1,2,3, and
4 above. Then F is the distribution function of some random variable.
Note that property 2 impies that

f (x)dx = 1.

It can be shown that any nonnegative function f satisfying this condition


(the integral on R is equal to 1) is the density function of some random
variable.

2.4 Joint density functions


We are going to investigate situations in which we deal simultaneously with
several random variables dened on the same sample space. For example,
suppose that a person is selected at random, and his age and weight recorded.
We may take = {(x, y) | x, y R}. Let R1 be the age of the person selected
and R2 the weight; that is, R1 (x, y) = x, R2 (x, y) = y . We wish to assign
probabilities to events that involve R1 and R2 simultaneously. For example,
the person is between 22 and 23 years, and 60 and 80 kilograms:

{22 R1 23, 60 R2 80}

17
Denition 10. The joint distribution function of two arbitrary random vari-
ables R1 and R2 on the same probability space is dened by

F12 (x, y) = P (R1 x, R2 y)


The pair (R1 , R2 ) is said to be absolutely continuous if there is a nonnegative
function f12 dened on R2 such that
x y
F12 (x, y) = f12 (u, v)dudv for all real x, y

f12 is called the density of (R1 , R2 ), or the joint density of R1 and R2 .


As for a single random variable, we have the following properties of an
absolutely continuous pair (R1 , R2 ).
For any Borel set B R2 ,

P ((R1 , R2 ) B) = f12 (x, y)dxdy
B

The density function has total mass 1:



f12 (x, y)dxdy = 1.

random vector (R , R , . . . , R )
joint distribution function
In a similar way, we can dene a 1 2 n with

F12...n (x1 , x2 , . . . , xn ) = P (R1 x1 , R2 x2 , . . . , Rn xn )


Example. Let the joint density of (R1 , R2 ) be
{
1, if 0x1 and 0y1
f12 (x, y) =
0, elsewhere

(This is the uniform density on the unit square.) Let us calculate the prob-
ability that 1/2 R1 + R2 3/2:
( )
1 3
P (1/2 R1 + R2 3/2) = 1 dxdy = 1 2 =
8 4
1/2x+y3/2

(making a gure can help to compute such integrals).

18
2.5 Relationship between joint and individual distribu-
tions
In this section, we investigate the relationship between joint and individual
distributions of random variables dened on the same probability space.

Question 1. If (R1 , R2 ) is absolutely continuous, are R1 and R2 absolutely


continuous, and, if so, how can the individual densities of R1 and R2 be found
in terms of the joint density?

marginal densities
The answer to this question is positive, and the individual densities (also
called ) are given by

f1 (x) = f12 (x, y)dy, f2 (y) = f12 (x, y)dx

Indeed, we have

F1 (x) = P (R1 x) = P (R1 x, R2 (, ))


x x ( ) x
= f12 (u, v)dudv = f12 (u, v)dv du = f1 (u)du,

so R1 is absolutely continuous with density f1 . We deal with R2 similarly.


In exactly the same way we may establish similar formulae in higher
dimensions; for example,

f12 (x, y) = f123 (x, y, z)dz


f2 (y) = f123 (x, y, z)dxdz

and so on.

Question 2. Given R1 , R2 (individually) absolutely continuous, is (R1 , R2 )


absolutely continuous, and, if so, can the joint density be derived from the
individual densities?

R1 R2
not
The answer to this question is negative; that is, if and are each
absolutely continuous then (R1 , R2 ) is necessarily absolutely continuous.

19
(R1 , R2 ) is absolutely continuous, f1 (x) and f2 (x)
not
Furthermore, even if do
determine f12 (x, y). The examples below illustrate these statements.

Example. LetR1 be absolutely continuous with density f . Take R2 R1 ;


that is, R2 () = R1 (), . Then R2 is absolutely continuous but
(R1 , R2 ) is not. Indeed, if L denotes the line x = y , we have P ((R1 , R2 )
L) = 1. On the other hand, if (R1 , R2 ) has a density g(x, y), we should have

P ((R1 , R2 ) L) = g(x, y)dxdy = 0
L

since L has area 0. This contradiction proves that (R1 , R2 ) cannot have a
density.

Example. It is easy to check that the following two densities


1 1
4 (1 + xy), 1 x 1 4, 1 x 1
f12 (x, y) = 1 y 1 , g12 (x, y) = 1 y 1

0, elsewhere 0, elsewhere

have the same marginal densities:


{ {
1
2
, 1 x 1 1
2
, 1 y 1
f1 (x) = , f2 (y) =
0, elsewhere 0, elsewhere

do
Thus the individual densities are not sucient to determine the joint density.

independent
However, there is one situation where the individual densities deter-
mine the joint density: when the random variables are .

2.6 Independence of random variables


We have considered the notion of independence of events, and this can be
used to dene independence of random variables. Namely, we say that ran-
dom variables are independent if events involving these random variables are
independent. The formal denition is the following.

Denition 11. Let R1 , . . . , Rn be random variables on (, F, P ). R1 , . . . , R n


are said to be independent if for all Borel subsets B1 , . . . , Bn of R we have

P (R1 B1 , . . . , Rn Bn ) = P (R1 B1 ) P (Rn Bn ).

20
Note that the last equality in itself does not imply that events {R1
B1 }, . . . , {Rn Bn } are independent. However, since we require that this
equality holds for all Borel subsets B1 , . . . , Bn , it holds for any subfamily
of Bi . Indeed, it is sucient to replace the other ones by (, ). For
example, in the case n = 3,

P (R1 B1 , R2 B2 ) = P (R1 B1 , R2 B2 , R3 (, ))
= P (R1 B1 )P (R2 B2 )P (R3 (, )) = P (R1 B1 )P (R2 B2 ),

and so on. By the same reasoning, we deduce that if R1 , . . . , Rn are inde-


pendent, so are R1 , . . . , Rk , for k < n.
If (Ri , i I) is an arbitrary family of random variables on the space
(, F, P ), the Ri are said to be independent if for each nite set of distinct
indices i1 , . . . , ik I , Ri1 , . . . , Rik are independent.

Theorem 12. Let R1 , R2 , . . . , Rn be independent random variables on a given


probability space. If each R is absolutely continuous with density f , then
is absolutely continuous; also, for all x , x , . . . , x ,
i i
(R1 , R2 , . . . , Rn ) 1 2 n

f12...n (x1 , x2 , . . . , xn ) = f1 (x1 )f2 (x2 ) fn (xn ).

Thus in this sense the joint density is the product of the individual den-
sities.

2.7 Problems
1. An absolutely continuous random variable R has a density function
f (x) = (1/2)e|x| .

(a) Sketch the distribution function of R.


(b) Find the probability of each of the following events.

i. {|R| 2}
ii. {|R| 2 or R 0}
iii. {|R| 2 and R 1}
iv. {|R| + |R 3| 3}
v. {R3 R2 R 2 0}
vi. {esin R 1}

21
vii. {R is irrational} = { : R() is an irrational number}

2. Consider a sequence of ve Bernoulli trials. Let R be the number of


times that a head is followed immediately by a tail. For example, if
= HHT HT then R() = 2, since a head is followed directly by a
tail at trials 2 and 3, and also at trials 4 and 5. Find the probability
function of R.

3 Expectation
The physical meaning of the expectation of a random variable is the average
value of this variable in a very large number of independent repetitions of the
random experiment. Before we make this denition mathematically precise,
let us consider the following example.
Suppose that we observe the length of a telephone call made from a
specic phone booth at a given time of the day (say, the rst call after 12
o'clock). Suppose that the cost R2 of a call depends on its length R1 in the
following way:

If 0 R1 3 (minutes) R2 = 10 (cents)
If 3 < R1 6 R2 = 20
If 6 < R1 9 R2 = 30

(Assume for simplicity that the telephone is automatically disconnected after


9 minutes.)
Suppose that we repeat the experiment independently N times, where
N is very large, and record the cost of each call. If we take the arithmetic
average of the costs (the total cost divided by N) we expect physically that
it will converge to a number that we should interpret as the long-run average
cost of a call.
Suppose that P (R2 = 10) = 0.6, P (R2 = 20) = 0.25, and P (R2 = 30) =
0.15. If we observe N calls, then, roughly, {R2 = 10} will occur 0.6N times;
the total cost of the calls of this type is 10(0.6N ) = 6N . The calls with
{R2 = 20} will occur approximately 0.25N times, giving rise to a total cost
of 20(0.25N ) = 5N . Finally, {R2 = 30} will occur approximately 0.15N
times, producing a total cost of 30(0.15N ) = 4.5N . The total cost of all calls
is then equal to 6N + 5N + 4.5N = 15.5N . We deduce that the average cost
of a call is 15.5 cents.

22
Observe how we have computed the average:

10(0.6N ) + 20(0.25N ) + 30(0.15N )


= 10(0.6) + 20(0.25) + 30(0.15)
N
= yP (R2 = y).
y

Thus we are taking a weighted average of the possible values of R2 , where


the weights are the corresponding probabilities. This suggests the following
denition.

3.1 Expectation of discrete random variables


Denition 13. Let R be a simple random variable, that is, a discrete random
nite expectation
expected value average value mean value mean R
variable with a number of possible values. Dene the (also
called the , , , or ) of as


E[R] = xP (R = x)
x

Example. Let R have a binomial distribution with parameters n and p.


Then


n
n
n!
E[R] = kCnk pk q nk = pk q nk
k=0 k=1
(k 1)!(n k)!
n
(n 1)! (n 1)!n1
= np pk1 q nk = np pl q n1l
k=1
(k 1)!(n k)! l=0
l!(n 1 l)!

n1
l
= np Cn1 pl q n1l = np(p + q)n1 = np
l=0

Example. Let R be a simple random variable with probability function

x 3 2 0 10 15
pR (x) 0.1 0.35 0.2 0.05 0.3

Then its expectation is given by

E[R] = (3) 0.1 + (2) 0.35 + 0 0.2 + 10 0.05 + 15 0.3 = 4

23
If R is discrete with innitely (countably) many possible values, the ex-
pectation is globally dened in the same way but there is a little complication:
an innite sum is not always convergent. This leads to the following con-
+
struction. Let R = max(R, 0) and R = max(R, 0) be the positive and

negative parts of R. We have R = R R . For example, for the simple
+
+
random variable R above, the probability functions of R and R are given
by

x 0 10 15
p+ (x) 0.1+0.35+0.2=0.65 0.05 0.3

x 3 2 0
p (x) 0.1 0.35 0.2+0.05+0.3=0.55

We dene

E[R+ ] = xP (R+ = x), E[R ] = xP (R = x)
x x

Since R+ and R take on only nonnegative values, these sums are always
well dened (they may be nite or equal to +). Now we dene

E[R] = E[R+ ] E[R ]

if this is not of the form + . That is, the expectation of R exists if


R+ and R are not equal to + simultaneously. It may be nite, or equal
to + or . The possible cases are summarized in the following table.

E[R+ ] = a E[R+ ] = +
E[R ] = b E[R] = b a E[R] = +
E[R ] = + E[R] = E[R] does not exist
Example. Let R have a Poisson distribution with parameter > 0. Its
probability function is given by

n
pR (n) = e , n = 0, 1, 2, . . .
n!
Let us calculate the expectation of R:

n
n1
k

E[R] = ne = e = e = e e =
n=0
n! n=1
(n 1)! k=0
k!

24
3.2 Expectation of absolutely continuous random vari-
ables
If R is absolutely continuous, the denition of the expectation is similar but
the sum is replaced by an integral, and P (R = x) by the density fR (x).
Denition 14. Let R be an absolutely continuous random variable with
density fR (x). The expectation of R is dened as

E[R] = xfR (x)dx


if this integral is well dened; that is, E[R ] =
+
xfR (x)dx and E[R ] =
0 0


(x)fR (x)dx are not equal to + simultaneously.
Note. Don't confuse, however, fR (x)
P (R = x). For an absolutely
and
continuous random variable, the probability P (R = x) is zero for all x but
the density is a non-zero function. The density fR (x) is a probability: in not
particular, it need not be 1. The quantity which represents a probability
in this expression is fR (x)dx: informally speaking, this is the probability that
R belongs to the innitesimal interval (x, x + dx).

Example. Let R be a uniform random variable on [a, b] with density function


{ 1
ba
, x [a, b],
fR (x) =
0, otherwise.

Then
b
1 1 x2 b 1 b 2 a2 a+b
E[R] = x dx = = =
a ba ba 2 a ba 2 2
Example. If R is an exponential random variable with parameter then

x x x
E[R] = xe dx = x(e ) dx = xe + ex dx
0 0 0 0
ex 1
= =
0
Example. Let R be a standard normal random variable, R N (0, 1). Then

1
x ex /2 dx = 0
2
E[R] =
2

25
since the integrand is an odd function of x. If R N (, 2 ) then


1
(x)2 1 y2
E[R] = x e 2 2
dx = (y + ) e 22 dy
2 2

1 y 2
1 y2
= y e 22 dy + e 22 dy = 0 + 1 =
2 2
On the last line, the rst integral is equal to zero because the integrand is
odd, and the second integral is equal to 1 because this is the total mass of
2
the density function of a N (0, ) random variable.
Thus the meaning of the parameter of a normal random variable is its
expectation.
Remark. It is possible for the expectation to be innite, or not exist at all.
For example, let
{ 1
x2
, x1
fR (x) =
0, x<1
Then
1
E[R] = xfR (x)dx = x dx =
1 x2
As another example, let fR (x) = 1/(2x ), |x| 1; fR (x) = 0, |x| < 1.
2
Then


1 1
E[R ] =
+
xfR (x)dx = x 2 dx =
0 2 1 x
0 1
1 1
E[R ] = (x)fR (x)dx = (x) 2 dx =
2 x

Thus E[R] does not exist.

3.3 Expectation of mixed random variables


IfR is a mixed random variable such that its distribution function is piecewise
dierentiable with derivative g(x) and has jumps at the points x1 , x2 , . . . ,
then the expectation of R is dened as follows:

E[R] = xg(x)dx + xi P (R = xi )
xi

if both terms are well dened.

26
Example. If R is the mixed random variable from the example in Sec-
tion 2.2.3 then

120
1
E[R] = x dx + 0 P (R = 0) + 120 P (R = 120)
0 200
1 1202
= + 120 0.25 = 66
200 2

3.4 General moments of random variables


If R1 : R is a random variable, and g : R R a real-valued function on
R, then R2 = g(R1 ) : R is also a random variable (under some conditions
on g ; for instance, it is true if g is continuous or piecewise continuous).
Let us consider the example of phone calls given at the beginning of
Section 3. The cost R2 of a call depends on the length R1 of the call, so that
R2 = g(R1 ) with


10, if 0x3

20, if 3<x6
g(x) =

30, if 6<x9

0, otherwise

This example shows that we may be interested in the expectations of the


form E[g(R)].
Suppose that R1 is discrete with possible values x1 , x2 , . . . . Then with
probability P (R1 = xi ) we have R1 = xi , hence R2 = g(xi ). So, the expecta-
tion of g(R1 ) should be given by


E[g(R1 )] = E[R2 ] = g(xi )P (R1 = xi )
xi

Similarly, if R is absolutely continuous with density fR , then



E[g(R)] = g(x)fR (x)dx

If we have an n-dimensional situation, for instance R0 = g(R1 , . . . , Rn ), the


preceding formulae generalize in a natural way. If R1 , . . . , Rn are discrete,

E[g(R1 , . . . , Rn )] = g(x1 , . . . , xn )P (R1 = x1 , . . . , Rn = xn )
x1 ...xn

27
If (R1 , . . . , Rn ) is absolutely continuous with density f12...n , then

E[g(R1 , . . . , Rn )] = g(x1 , . . . , xn )f12...n (x1 , . . . , xn )dx1 dxn

Of course, in all denitions above, we require that the corresponding series


or integrals are well dened.

3.4.1 Terminology
If R is a random variable, the k th moment of R (k > 0, not necessarily an
integer) is dened by
mk = E[Rk ]
if the expectation exists. Thus
{ k
x p (x) if R is discrete
mk = x k R

x fR (x) if R is absolutely continuous

m1 E[R].
th central moment
The rst moment is simply the expectation
The k of R (k > 0) is dened by

ck = E[(R E[R])k ]
{
x (x m1 ) kpR (x)
k
if R is discrete
=

(x m1 ) fR (x) if R is absolutely continuous

if E[R] is nite and the expectation in question exists. Note that the rst
c1 = E[R m1 ] = m1 m1 = 0.
variance
central moment is zero:
The second central moment E[(R E[R]) ] is called the
2
R,
of
2
written or Var(R). The positive square root of the variance = Var(R)
is called the standard deviation of R.
The variance may be interpreted as a measure of dispersion. A large
variance corresponds to a high probability that R will fall far from its mean,
while a small variance indicates that R
is likely to be close to its mean.
The quantities E[g(R)], with g other than x or (x m1 ) , are sometimes
k k

called general moments of R.

3.4.2 Moment generating function


k = 1, 2, . . . R may be calculated
moment generating function
The moments of order of a random variable
using the so called of R dened as

MR (t) = E[eRt ], tR

28
wherever this expectation exists. Provided MR (t) exists in an open interval
around t = 0, the k th moment is given by

(k) dk MR (t)
E[Rk ] = MR (0) =
dtk t=0


For example, MR (t) = E[Re ], hence MR (0)
Rt
= E[R]. Similarly, MR (t) =
E[R2 eRt ] yields MR (0) = E[R2 ], and so on.

Example. Let R E(). Then



x
MR (t) = E[e ] =
Rt xt
e e dx = e(t)x dx =
0 0 t
provided t < . We have

2 6
MR (t) = , MR (t) = , MR (t) =
( t)2 ( t)3 ( t)4
Thus we obtain

E[R] = MR (0) = 1/
E[R2 ] = MR (0) = 2/2
E[R3 ] = MR (0) = 6/3

and so on. In particular, Var(R) = E[R2 ] (E[R])2 = 2/2 1/2 = 1/2 .

Example. Let R P(). Then


n

(et )n
nt
= e ee = e(e 1)
t t
MR (t) = E[e ] = Rt
e e =e
n=0
n! n=0
n!

Let us compute the rst two moments of R. We have

MR (t) = et e(e 1) , MR (t) = et e(e 1) + 2 e2t e(e 1)


t t t

Therefore,

E[R] = MR (0) =
E[R2 ] = MR (0) = + 2

Then Var(R) = E[R2 ] (E[R])2 = + 2 2 = .

29
3.5 Properties of expectation
In this section we list several basic properties of the expectation of a random
variable. (We will always assume that all expectations which appear in the
properties listed below exist.)

1. Let R1 , . . . , R n be random variables on a given probability space. Then

E[R1 + + Rn ] = E[R1 ] + + E[Rn ]

(if + and do not both appear in the sum E[R1 ] + + E[Rn ]).
2. If E[R] exists and a is any real number, then E[aR] exists and

E[aR] = aE[R]

Basically, properties 1 and 2 say that the expectation is linear.

3. If R1 R2 , then E[R1 ] E[R2 ].

4. If R0 and E[R] = 0, then R is zero almost surely ; that is, P (R =


0) = 1.

5. If Var(R) = 0, then R is essentially constant; more precisely, R = E[R]


almost surely.

This is a corollary of the previous property. Indeed, from E[(Rm1 ) ] =


2

0 we conclude that (R m1 ) = 0 almost surely, since (R m1 )2 is


2

nonnegative. Thus R = m1 almost surely.

6. Let R1 , . . . , Rn be independent random variables. If one of the following


conditions is satised:

a) all Ri are nonnegative, or

b) all E[Ri ] are nite

then
E[R1 R2 Rn ] = E[R1 ]E[R2 ] E[Rn ]

7. Let R be a random variable with nite mean m and variance 2 (pos-


sibly innite). If a and b are real numbers, then

Var(aR + b) = a2 2

30
8. Let R1 , . . . , R n be independent random variables, each with nite ex-
pectation. Then

Var(R1 + + Rn ) = Var(R1 ) + + Var(Rn )


Corollary. If R1 , . . . , Rn are independent, each with nite expecta-
tion, and a1 , . . . an , b are real numbers, then
Var(a1 R1 + + an Rn + b) = a21 Var(R1 ) + + a2n Var(Rn )

3.6 Correlation
R1 R2
covariance
If and are random variables on a given probability space, we dene
their as

Cov(R1 , R2 ) = E[(R1 E[R1 ])(R2 E[R2 ])] = E[R1 R2 ] E[R1 ]E[R2 ]


Theorem 15. If R and R are independent, then Cov(R , R ) = 0, but not
conversely.
1 2 1 2

Proof. R If 1 and R2 are independent, then E[R1 R2 ] = E[R1 ]E[R2 ] hence


Cov(R1 , R2 ) = 0.
Let be uniformly distributed between 0 and 2 . Dene R1 = cos ,
R2 = sin . We have
2
1
E[R1 ] = E[cos ] = cos(x)
dx = 0
0 2
2
1
E[R2 ] = E[sin ] = sin(x) dx = 0
0 2
2
1
E[R1 R2 ] = E[cos sin ] = cos(x) sin(x) dx = 0
0 2

Thus Cov(R1 , R2 ) = 0. However, R1 and R2 are not independent. Indeed,



2 7 1
P (R1 > ) = P (0 < < ou < < 2) =
2 4 4 4
2 3 1
P (R2 > ) = P( < < )=
2 4 4 4
But
2 2 2 2
P (R1 > , R2 > ) = 0 = P (R1 > )P (R2 > )
2 2 2 2

31
Denition 16. Let R1 and R2 be random variables dened on a given proba-
bility space. If Var(R1 ) > 0, Var(R2 ) > 0 we dene the correlation coecient
of R1 and R2 as
Cov(R1 , R2 )
(R1 , R2 ) =
Var(R1 ) Var(R2 )
By Theorem 15, if R1 and R2 are independent, they are uncorrelated;
that is, (R1 , R2 ) = 0, but not conversely.
It can be shown that

1 (R1 , R2 ) 1

3.7 Indicator function


In this section, we introduce a useful notion of indicator functions.

Denition 17. The indicator of an event A is a random variable IA dened


as follows: {
1, if A,
IA () =
0, if / A.

Note that this is a (simple) discrete random variable since it has only two
possible values. Its expectation is given by

E[IA ] = 1 P (IA = 1) + 0 P (IA = 0),

where

P (IA = 1) P ({ | IA () = 1}) = P ({ | A}) P (A).

So, the expectation of IA is equal to the probability of A:

E[IA ] = P (A).

Example. A single unbiased die is tossed independently n times. Let R1


be the number of 1 s obtained, and R2 the number of 2 s. Find E[R1 R2 ].
Intuitively, random variables R1 and R2 are not independent (if we obtain
1, we cannot obtain 2 at the same time), so the direct evaluation of the
expectation E[R1 R2 ] is not easy. The method of indicators allows to greatly
simplify this problem.

32
If Ai is the event that ith toss results in a 1, and Bi the event that the
ith toss results in a 2, then

R1 = IA1 + + IAn ,
R2 = IB1 + + IBn .
Hence

n
E[R1 R2 ] = E[IAi IBj ].
i,j=1

Now if i = j , IAi and IBj are independent (exercise); hence

1
E[IAi IBj ] = E[IAi ]E[IBj ] = P (Ai )P (Bj ) = .
36
If i = j , Ai and Bi are disjoint, since the ith toss cannot simultaneously
result in a 1 and in a 2. Thus IAi IBi = IAi Bi = 0. Thus

n(n 1)
E[R1 R2 ] =
36
since there are n(n1) ordered pairs (i, j) of integers belonging to {1, 2, . . . , n}
such that i = j .

4 Conditional probability and expectation


4.1 Conditional expectation given A with P (A) > 0
If R is discrete,


xP ({R = x} A) E[RIA ]
E[R | A] = xP (R = x | A) = x
=
x
P (A) P (A)

For any R, not necessarily discrete, dene

E[RIA ]
E[R | A] =
P (A)
Example. Let R be the result of the toss of a single die and A be the event
the result is even. Then


6
E[R | A] = nP (R = n | A)
n=1

33
We have

P (R = 1 | A) = P (R = 3 | A) = P (R = 5 | A) = 0
and

P (R = 2 and R is even) P (R = 2) 1/6 1


P (R = 2 | A) = = = =
P (R is even) P (R is even) 1/2 3
and also P (R = 4 | A) = P (R = 6 | A) = 1/3. Thus
1 1 1
E[R | A] = 2 + 4 + 6 = 4
3 3 3
We have the following consequence of the formula of total probability.
Theorem of total expectation. Let B1 , B2 , . . . be a nite or countable
family of mutually exclusive and exhaustive events. If R is any random
variable, then

E[R] = P (Bi )E[R | Bi ]
i
We say that R is independent from A if all the events involving R are
independent from A; that is {R B} and A are independent for all Borel
sets B.
Property 18. If R is independent from A then

E[R | A] = E[R]
Example. Let N be a discrete random variable with possible values n =
0, 1, 2, . . . (for instance, N P()) and X1 , X2 , . . . are i.i.d. random vari-
ables, independent from N . Dene

N
S= Xi
i=1

Let us compute the expectation of S.


[ ] [ ]

N

N
E[S] = E Xi = P (N = n)E Xi | N = n
i=1
[
n=0
]
i=1
[ ]


n

n
= P (N = n)E Xi | N = n = P (N = n)E Xi
n=0 i=1 n=0 i=1



= P (N = n)nE[X1 ] = E[X1 ] nP (N = n) = E[X1 ]E[N ]
n=0 n=0

34
For example, if N P(200) and X1 E(1/1000) then E[S] = 1000 200 =
2 105 .

4.2 Conditional expectation with respect to a random


variable
If R has an absolutely continuous distribution, then P (R = x) = 0 for all
x, so the conditional probability P (A | R = x) and conditional expectation
E[R1 | R = x] cannot be dened as before. However, intuitively, these
quantities make sense. For example, let R1 and R2 be independent random
variables with the same distribution U ([0, 1]). Then we would like to write,
for example,

P (R1 + R2 1 | R1 = x) = P (x + R2 1) = P (R2 1 x) = 1 x

or
x
E[R1 R2 | R1 = x] = E[xR2 | R1 = x] = xE[R2 ] =
2
The rigorous denition of conditional probability and expectation with re-
spect to the events of probability zero of the type {R = x} is rather involved
and is out of the scope of these lectures.
Instead, we give the continuous equivalents of theorems of total prob-
ability and total expectation and dene the notion of conditional density
consistent with these theorems.

Theorem 19. Let R be an absolutely continuous random variable with den-


sity function f (x). Let A be any event and R any random variable. Then
R 1

P (A) = P (A | R = x)fR (x)dx


E[R1 ] = E[R1 | R = x]fR (x)dx

Denition 20. (conditional density)

Example. R1 is chosen with the density f1 (x) =


A nonnegative number
x
xe , x 0; f1 (x) = 0, x < 0. If R1 = x, a number R2 is chosen with
uniform density between 0 and x. Find P (R1 + R2 2).

35
By theorem of total probability,


P (R1 + R2 2) = P (R1 + R2 2 | R1 = x)f1 (x)dx

= P (x + R2 2 | R1 = x)xex dx
1 0 2
x x
= (1)xe dx + P (R2 2 x | R1 = x)xe dx + (0)xex dx
0 1 2

We have
2x
P (R2 2 x | R1 = x) =
x
Therefore

1
x
2
2 x x
P (R1 + R2 2) = xe dx + xe dx = 1 2e1 + e2
0 1 x

Notation. If E[R2 | R1 = x] = g(x), we write E[R2 | R1 ] = g(R1 ). So, the


conditional expectation with respect to a random variable is also a random
variable (and not a constant as a usual expectation).

Example. Let, as in the previous example, R1 have density fR (x) = xex ,


x 0, and R2 be uniform on [0, x] given R = x. Then we have

x
1 x
E[R2 | R1 = x] = y dy =
0 x 2
x
1 ex 1
E[e | R1 = x] =
R2
ey dy =
0 x x
Thus we write
R1
E[R2 | R1 ] =
2
eR1 1
E[eR2 | R1 ] =
R1

36
4.3 Conditional expectation with respect to a -eld
Denition 21. Let X be a random variable on a probability space (, F, P )
E[|X|] < , G F a -eld. The random variable E[X | G]
conditional expectation of given
such that and
( X G ) is a random variable Z satisfying the
following properties:

Z G (Z is G -measurable);

Y G, Y bounded, E[XY ] = E[ZY ].

Remark The random variable dened above is unique in the sense that if Z1
and Z2 satisfy the properties of the conditional expectation E[X | G], then
Z1 = Z2 almost surely (P { | Z1 () = Z2 ()} = 0).

If X is square integrable (that is, E[|X|2 ] < ), the conditional expecta-


tion has a geometrical interpretation. Indeed, the set L2 of square-integrable
random variables on a given probability space may be considered as a vector

space with the norm X2 = E[|X|2 ] and the associated scalar product
X Y = E[XY ]. For any -eld G , the random variables which are G-
measurable form a vector subspace of X, Y G , R implies
L2 (since
X + Y G ). Let us denote VG . The conditional expecta-
orthogonal projection
this subspace by
tion E[X | G] is then an of X on VG .
Indeed, what an orthogonal projection of a vector v V on a subspace
G V is? It is a vector vG which lies in G and such that v vG is orthogonal
to G. The second condition means that for all w G, (v vG ) w
in the
v vG ) w
sense ( G, v w
= 0. Equivalently, for all w = vG w
.
In our case, the vector space V is L2 , vectors are random variables, and
G is the set VG of G -measurable random variables. Looking at the denition
of E[X | G], we see that this is exactly the description of the orthogonal
projection of X on VG .
In other words, the conditional expectation E[X | G] is the closest to
X G -measurable random variable in the sense of the L2 -norm (that is, the
random variable Z G which minimizes E[(Z X) ]).
2

Keeping in mind this geometrical interpretation helps to understand some


of the following properties of the conditional expectation. These properties
are essential in the stochastic processes theory and thus are extensively used
in mathematical nance.

37
Properties Let X, Y, Z be random variables on a given probability space
(, F, P ). Let G, H F be -algebras of events.

E[aX + bY | G] = aE[X | G] + bE[Y | G], a, b R.


(The projection is linear.)
1.

Y G E[Y | G] = E[Y ].
(Don't confuse independent and orthogonal in the sense of L ! Here,
2. If then

Y is not necessarily orthogonal to G. To understand this property, use


2

the interpretation of G as an information: if Y is independent of G, the


information contained in G does not inuence the distribution of Y , so
the condition in the expectation may be dropped out.)
ZG E[XZ | G] = ZE[X | G]. In particular, E[Z | G] = Z .
(The projection on of a vector which already belongs to is
3. If then
G v G v
itself. In terms of the information, if is -measurable, it means that
Z G
given the information the value of is known; thus is treated as
G Z Z
a known constant in the conditional expectation given .) G
GH E[E[X | H] | G] = E[X | G].
(To project on a smaller subspace is equivalent to project rst on a
4. If then

greater subspace and then project the result on the smaller subspace.)
The comments above do not constitute the proofs of these properties but
only provide some intuition about them. To prove these properties, we have
to use the denition of the conditional expectation. For example, let us prove
property 2:
We need to prove that the (constant) random variable Z = E[Y ] satises
the two properties of the conditional expectation of Y G . Since it is
given
constant, it is G -measurable. G be an arbitrary bounded random variable,
Let
G G. We have to check that E[Y G] = E[E[Y ]G]. By independence of Y
with respect to G , the left hand side is equal to E[Y ]E[G]. The right hand
side is equal to the same expression simply by taken the constant E[Y ] out
of the expectation.
We omit here the proofs of the other properties of the conditional expec-
tation.

5 Gaussian vectors
(to be completed)

38

You might also like