You are on page 1of 126

Lectures on

Measure Theory and Probability


by

H.R. Pitt

Tata institute of Fundamental Research, Bombay


1958
(Reissued 1964)

Lectures on
Measure Theory and Probability
by

H.R. Pitt

Notes by

Raghavan Narasimhan

No part of this book may be reproduced in any


form by print, microfilm or any other means without written permission from the Tata institute of
Fundamental Research, Colaba, Bombay 5

Tata institute of Fundamental Research, Bombay


1957
(Reissued 1963)

Contents
1 Measure Theory
1.
Sets and operations on sets . . . . . . . .
2.
Sequence of sets . . . . . . . . . . . . . .
3.
Additive system of sets . . . . . . . . . .
4.
Set Functions . . . . . . . . . . . . . . .
5.
Continuity of set functions . . . . . . . .
6.
Extensions and contractions of... . . . . .
7.
Outer Measure . . . . . . . . . . . . . .
8.
Classical Lebesgue and Stieltjes measures
9.
Borel sets and Borel measure . . . . . . .
10. Measurable functions . . . . . . . . . . .
11. The Lebesgue integral . . . . . . . . . . .
12. Absolute Continuity . . . . . . . . . . . .
13. Convergence theorems . . . . . . . . . .
14. The Riemann Integral . . . . . . . . . . .
15. Stieltjes Integrals . . . . . . . . . . . . .
16. L-Spaces . . . . . . . . . . . . . . . . . .
17. Mappings of measures . . . . . . . . . .
18. Differentiation . . . . . . . . . . . . . . .
19. Product Measures and Multiple Integrals .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

1
1
3
4
5
6
10
11
16
17
20
23
27
31
34
37
39
46
47
57

2 Probability
61
1.
Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.
Function of a random variable . . . . . . . . . . . . . . 62
3.
Parameters of random variables . . . . . . . . . . . . . . 63
iii

Contents

iv
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.

Joint probabilities and independence . . . . . .


Characteristic Functions . . . . . . . . . . . .
Sequences and limits of... . . . . . . . . . . . .
Examples of characteristic functions . . . . . .
Conditional probabilities . . . . . . . . . . . .
Sequences of Random Variables . . . . . . . .
The Central Limit Problem . . . . . . . . . . .
Cumulative sums . . . . . . . . . . . . . . . .
Random Functions . . . . . . . . . . . . . . .
Random Sequences and Convergence Properties
Markoff Processes . . . . . . . . . . . . . . . .
L2 -Processes . . . . . . . . . . . . . . . . . .
Ergodic Properties . . . . . . . . . . . . . . .
Random function with independent increments
Doob Separability and extension theory . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

67
72
76
79
81
84
86
95
101
104
110
112
117
119
122

Chapter 1

Measure Theory
1. Sets and operations on sets
We consider a space X of elements (or point) x and systems of this sub- 1
sets X, Y, . . . The basic relation between sets and the operations on them
are defined as follows:
(a) Inclusion: We write X Y (or Y X) if every point of X is
contained in Y. Plainly, if 0 is empty set, 0 X X for every
subset X. Moreover, X X and X Y, Y Z imply X Z.
X = Y if X Y and Y X.
(b) Complements: The complements X of X is the set of point of X
which do not belong to X. Then plainly (X ) = X and X = Y if
Y = X. In particular , O = X, X = 0. Moreover, if X Y, then
Y X.
(c) Union: The union of any system of sets is the set of points x which
belong to at least one of them. The system need not be finite or
even countable. The union of two sets X and Y is written X Y,
and obviously X Y = Y X. The union of a finite or countable

[
sequence of sets X1 , X2 , . . . can be written
Xn .
n=1

(d) Intersection: The intersection of a system of sets of points which


belong to every set of the system. For two sets it is written X Y
1

1. Measure Theory

(or X.Y) and for a sequence {Xn },


2

Xn . Two sets are disjoint if

n=1

their intersection is 0, a system of sets is disjoint if every pair of


sets of the system is. For disjoint system we write X + Y for X Y
P
and Xn for Xn , this notation implying that the sets are disjoint.

(e) Difference: The difference X.Y or X Y between two X and Y is


the sets of point of X which do not belong to Y. We shall use the
notation X Y for the difference only if Y X.
It is clear that the operations of taking unions and intersection are
both commutative and associative. Also they are related t to the operation of taking complements by
X.X = 0, X + X = X, (X Y) = X , Y , (X.Y) = X Y .
More generally
(X) = X , (X) = X .
The four operations defined above can be reduced to two in several different ways. For examples they can all be expressed in terms of
unions and complements. In fact there is complete duality in the sense
that any true proposition about sets remains true if we interchange
0 and

and

and
and

and leave = and unchanged all through.


A countable union can be written as a sum by the formula

[
n=1

Xn = X1 + X1 .X2 + X1 .X2 .X3 +

2.. Sequence of sets

2. Sequence of sets
3

A sequence of sets X1 , X2 , . . . is increasing if


X1 X2 X3 . . .
decreasing If
X1 X2 X3 . . .
The upper limit, lim sup Xn of a sequence {Xn } of sets is the set
of points which belong to Xn for infinitely many n. The lower limit,
lim inf Xn is the set of points which belong to Xn for all but a finite number of n. It follows that lim inf Xn lim sup Xn and if lim sup Xn =
lim inf Xn = X, X is called the limit of the sequence, which then coverage to X.
It is easy to show that
lim inf Xn =

Xm

Xm .

n=1 m=n

and that
lim sup Xn =

n=1 m=n

Then if Xn ,

m=n

Xm =

Xm . lim inf Xn =

m=1

m=1

Xm = Xn , lim sup Xn =

m=n

lim Xn =

\
n=1

Xn ,

n=1

and similarly if Xn ,
lim Xn =

[
n=1

Xn .

Xn ,

Xm ,

1. Measure Theory

3. Additive system of sets


4

A system of sets which contains X and is closed under a finite number of


complement and union operations is called a (finitely) additive system or
a field. It follows from the duality principle that it is then closed under
a finite number of intersection operations.
If an additive system is closed under a countable number of union
and complement operations (and therefore under countable under inter
sections), it is called a completely additive system, a Borel system or a
-field.
It follows that any intersection (not necessarily countable) of additive or Borel system is a system of the same type. Moreover, the intersection of all additive (of Borel) systems containing a family of sets
is a uniquely defined minimal additive (or Borel) system containing the
given family. The existence of at least one Borel system containing a
given family is trivial, since the system of all subsets of X is a Borel
system.
A construction of the actual minimal Borel system containing a
given family of sets has been given by Hausdorff (Mengenlehre,1927,
p.85).

Theorem 1. Any given family of subsets of a space X is contained in


a unique minimal additive system S 0 and in a unique minimal Borel
system S .

Example of a finitely additive system: The family of rectangles ai


xi < bi (i = 1, 2, ..., n) in Rn is not additive, but has a minimal additive
S 0 consisting of all element ary figures and their complements. An
elementary figure is the union of a finite number of such rectangles.
The intersections of sets of an additive (or Borel) system with a fixed
set(of the system) from an additive (or Borel) subsystem of the original
one.

4.. Set Functions

4. Set Functions
Functions con be defined on a system of sets to take values in any given
space. If the space is an abelian group with the group operation called
addition, one can define the additivity of the set function.
Thus, if is defined on an additive system of sets, is additive if
X  X

(Xn )
Xn =
for any finite system of (disjoint) sets Xn .
In general we shall be concerned only with functions which take real
values. We use the convention that the value is excluded but that
may take the value +. It is obvious that (0) = 0 if (X) is additive
and finite for at least one X.
For a simple example of an additive set function we may take (X)
to be the volume of X when X is an elementary figures in Rn .
If the additive property extends to countable system of sets, the function is called completely additive, and again we suppose that (X) ,
. Complete additive of can defined even if the field of X is only 6
P
finitely additive, provided that Xn and Xn belong to it.
Example of a completely additive function: (X) = number of elements (finite of infinite) in X for all subsets X of X

Examples of additive, but not completely additive functions:


1. X is an infinite set,
(X) = 0 if X is a finite subset of X
= if X is an infinite subset of X
Let X be a countable set of elements (x1 , x2 , . . .) of X.
Then
(xn ) = 0,

(xn ) = 0, (X) = .

2. X is the interval 0 x < 1 and (X) is the sum of the lengths of finite sums of open or closed intervals with closure in X. These sets

1. Measure Theory

together with X from an additive system on which is additive


but not completely additive if (X) = 2.
A non-negative, completely additive function defined on a Borel
system S of subsets of a set X is called a measure. It is bounded
(or finite) if (X) < . it is called a probability measure if (X) =
1. The sets of the system S are called measurable sets.

5. Continuity of set functions


Definition . A set function is said to be continuous, from below if
(Xn ) (X) whenever Xn X. It is continuous from above if (Xn )
(X) whenever Xn X and (Xno ) < for some n0 .
7

It is continuous if it is continuous from above and below. Continuity


at 0 means continuity from above at 0.
(For general ideas about limits of set functions when {Xn } is not
monotonic, see Hahn and Rosenthal, Set functions, Ch. I).
The relationship between additivity and complete additivity can be
expressed in terms of continuity as follows.
Theorem 2.

(a) A completely additive function is continuous.

(b) Conversely, an additive function is completely additive if it is either continuous from below or finite and continuous at 0. (The
system of sets on which is defined need only be finitely additive).
Proof.

(a) If Xn X, we write
X = X1 + (X2 X1 ) + (X3 X2 ) + ,

(X) = (X1 ) + (X2 X1 ) +


= (X1 ) + lim

= lim (XN ).
N

N
X
n=2

(Xn Xn1 )

5.. Continuity of set functions

On the other hand, if Xn X and (Xn0 ) < , we write


Xn0 = X +

n=n0

(Xn0 ) = (X) +

n=n0

(Xn Xn+1 )

(Xn Xn+1 ), and (X) = lim(Xn )

as above since (Xn0 ) < .


(b) First, if is additive and continuous from below, and

Y = Y1 + Y2 + Y3 +
we write
Y = lim

N
X

Yn ,

n=1

N
N
X
X

Yn Y
(Y) = lim Yn , since
N
= lim

n=1
N
X

n=1

(Yn )

n=1

by finite additivity, and therefore (Y) =

(Yn ).

n=1

On the other hand, if is finite and continuous at 0, and X =

P
Xn , we write
n=1

X
X

Xn
Xn +
(X) =
n=N+1
n=1

N
X
X

=
(Xn ) +
Xn , by finite additivity,
n=1

since

N+1

n=N+1

Xn 0 and has finite .

1. Measure Theory

Theorem 3 (Hahn-Jordan). Suppose that is completely additive in a


Borel system S of subsets of a space X. Then we can write X = X+ + X
(where X+ , X belong to S and one may be empty) in such a way that
1. 0 (X) (X+ ) = M for X X+ ,

< m = (X ) (X) 0 for X X

while m (X) M for all X.

Corollary 1. The upper and lower bounds M, m of (X) in S are attained for the sets X+ , X respectively and m > .
Moreover, M < if (X) is finite for all X. In particular, a finite
measure is bounded.
Corollary 2. If we write
+ (X) = (X X+ ), (X) = (X X )
we have
(X) = + (X) + (X), + (X) 0, (X) 0

+ (X) = sup (Y), (X) = inf (Y).


YX

YX

If we write (X) = + (X) (X), we have also


|(Y)| (X) for all Y X.
It follows from the theorem and corollaries that an additive function
can always be expressed as the difference of two measures, of which
one is bounded (negative part here). From this point on, it is sufficient
to consider only measures.
Proof of theorem 3. [Hahn and Rosenthal, with modifications] We suppose that m < 0 for otherwise there is nothing to prove. Let An be defined

S
so that (An ) m and let A =
An . For every n, we write
n=1

A = Ak + (A Ak ), A =

n
\
[Ak + (A Ak )]
k=1

5.. Continuity of set functions

This can be expanded as the union of 2n sets of the form

n
T

k=1

Ak ,

Ak = Ak or A Ak , and we write Bn for the sum of those for which


< 0. (If there is no such set, Bn = 0). Then, since An consists of
disjoint sets which either belong to Bn or have 0, we get
10
(An ) (Bn )
Since the part of Bn+1 which does not belong to Bn consists of a
n+1
T
finite number of disjoint sets of the form
Ak for each of which < 0,
k=1

(Bn Bn+1 ) = (Bn ) + (Bn+1 Bn ) (Bn )


and similarly
(Bn ) (Bn Bn+1 . . . Bn )
for any n > n. By continuity from below, we can let n ,

[
(An ) (Bn ) Bk
k=n

Let X = limn

Bk . Then

k=n

(x ) lim (An ) = m,
n

and since (x ) m by definition of m, (x ) = m.


Now, if X is any subset of X and (X) > 0, we have
m = (X ) = (X) + (X X) > (X X)
which contradicts the fact that m is inf (Y).
YX

This proves (1) and the rest follows easily.


It is easy to prove that corollary 2 holds also for a completely additive function on a finitely additive system of sets, but sup (X), inf (X)
are then not necessarily attained.

1. Measure Theory

10

6. Extensions and contractions of additive


functions
11

We get a contraction of an additive (or completely additive) function defined on a system by considering only its values on an function defined
on a system by considering only its values on an additive subsystem.
More important, we get an extension by embedding the system of sets
in a larger system and defining a set function on the new system so that
it takes the same values as before on the old system.
The basic problem in measure theory is to prove the existence of a
measure with respect to which certain assigned sets are measurable and
have assigned measures. The classical problem of defining a measure
on the real line with respect to which every interval is measurable with
measure equal to its length was solved by Borel and Lebesgue. We
prove Kolmogoroffs theorem (due to Caratheodory in the case of Rn )
about conditions under which an additive function on a finitely additive
system S 0 can be extended to a measure in a Borel system containing
S 0.
(a) If (I) is non-negative and additive on an additive

P
system S 0 and if In are disjoint sets of S 0 with I =
In also in

Theorem 4.

n=1

S 0 , then

X
n=1

(In ) (I).

(b) In order that (I) should be completely additive, it is sufficient


that

X
(I)
(In ).
n=1

12

(c) Moreover, if (I) is completely additive, this last inequality holds

S
whether In are disjoint or not, provided that I
In .
n=1

Proof.

(a) For any N,

N
X
n=1

In , I

N
X
n=1

In

7.. Outer Measure

11

belong to S 0 and do not overlap. Since their sum is I, we get

N
N

X
X
In
In + I
(I) =
n=1

n=1

N
N
X X

In =
(In )
n=1

n=1

by finite additivity. Part (a) follows if we let N and (b) is a


trivial consequence of the definition.
For (c), we write

[
n=1

In = I1 + I2 I1 + I3 I1 I2 +

and then

(I) [
n=1 In ] = (I1 ) + (I2 I1 ) +

(I1 ) + (I2 ) +


7. Outer Measure
We define the out or measure of a set X with respect to a completely ad- 13
P
ditive non-negative (I) defined on a additive system S 0 to be inf (In )

S
for all sequences {In } of sets of S 0 which cover X (that is, X ).
n=1

Since any I of S 0 covers itself, its outer measure does not exceed
(I). On the other hand it follows from Theorem 4(c) that
(I)

(In )

n=1

for every sequence (In ) covering I, and the inequality remains true if
the right hand side is replaced by its lower bound, which is the outer

1. Measure Theory

12

measure of I. It follows that the outer measure of a set I of S 0 is (I),


and there is therefore no contradiction if we use the same symbol (X)
for the outer measure of every set X, whether in S 0 or not.
Theorem 5. If X

Xn , then

n=1

(X)
Proof. Let > 0,

n=1

(Xn )

n=1

n . Then we can choose In from S 0 so that

Xn

In ,

=1

and then, since


X

(X)

14

X
=1

[
n=1

X
X
n=1 =1

(In ) (Xn ) + n ,

Xn

In ,

n,=1

(In )

((Xn ) + n )

(Xn ) + ,

n=1

n=1

and we can let 0.

Definition of Measurable Sets.


We say that X is measurable with respect to the function if
(PX) + (P PX) = (P)
for every P with (P) < .
Theorem 6. Every set I of S o is measurable.

7.. Outer Measure

13

Proof. If P is any set with (P) < , and > 0, we can define In in
S 0 so that

[
X
P
In ,
(In ) (P) +
n=1

Then
PI

[
n=1

n=1

I In , p PI

[
(In IIn )
n=1

and since IIn and In IIn both belong to S 0 ,


(PI)

X
n=1

(IIn ), (P PI)

X
n=1

(In IIn )

and
(PI) + (P PI)

X
n=1

((IIn ) + (In IIn ))

X
n=1

(In ) (P) +

by additivity in S 0 . Since is arbitrary,

15

(PI) + (P PI) (P)


as required.
We can now prove the fundamental theorem.

Theorem 7 (Kolmogoroff-Caratheodory). If is a non-negative and


completely additive set function in an additive system S 0 , a measure
can be defined in a Borel system S containing S 0 and taking the original
value (I) for I S 0 .
Proof. It is sufficient to show that the measurable sets defined above
form a Borel system and that the outer measure is completely additive
on it.

1. Measure Theory

14

If X is measurable, it follows from the definition of measurablility


and the fact that
PX = P PX, P PX = PX,

(PX ) + (P PX) = (PX) + (P PX)


that X is also measurable.
Next suppose that X1 , X2 are measurable. Then if (P) < ,
(P) = (PX1 ) + (P PX1 ) since X1 is measurable

= (PX1 X2 ) + (PX1 PX1 X2 ) + (PX2 PX1 X2 )

+ (P P(X1 X2 )) since X2 is measurable

Then, since
(PX1 PX1 X2 ) + (PX2 PX1 X2 ) + (P P(X1 X2 )) = P PX1 X2 ,
16

it follows from Theorem 5 that


(P) (PX1 X2 ) + (P PX1 X2 )
and so X1 X2 is measurable.
It follows at once now that the sum and difference of two measurable
sets are measurable and if we take P = X1 + X2 in the formula defining
measurablility of X1 , it follows that
(X1 + X2 ) = (X1 ) + (X2 )
When X1 and X2 are measurable and X1 X2 = 0. This shows that the
measurable sets form an additive system S in which (X) is additive.
After Theorems 4(b) and 5, (X) is also completely additive in S . To

S
complete the proof, therefore, it is sufficient to prove that X =
Xn is
n=1

measurable if the Xn are measurable and it is sufficient to prove this in


the case of disjoint Xn .
If (P) < ,

N
X

Xn
(P) = P
Xn + P P
n=1

n=1

7.. Outer Measure


since

N
P

15

Xn is measurable,

n=1

N
N
X
X

(PXn ) + (P PX)
Xn + (P PX) =
P
n=1

n=1

by definition of measurablility applied N 1 times, the Xn being disjoint.


Since this holds for all N,
17
(P)

X
n=1

(PXn ) + (P PX)

(PX) + (P PX),
by Theorem 5, and therefore X is measurable.

Definition. A measure is said to be complete if every subset of a measurable set of zero measure is also measurable (and therefore has measure
zero).
Theorem 8. The measure defined by Theorem 7 is complete.
Proof. If X is a subset of a measurable set of measure 0, then (X) = 0,
(PX) = 0, and
(P) (PX) + (P PX) = (P PX) (P),
(P) = (P PX) = (P PX) + (PX),

and so X is measurable.
The measure defined in Theorem 7 is not generally the minimal measure generated by , and the minimal measure is generally not complete.
However, any measure can be completed by adding to the system of
measurable sets (X) the sets X N where N is a subset of a set of measure zero and defining (X N) = (X). This is consistent with the
original definition and gives us a measure since countable unions of sets
X N are sets of the same form, (X N) = X N = X (Y N Y )
(where N Y, Y being measurable and of 0 measure) = X1 N1 is of the
same form and is clearly completely additive on this extended system.

1. Measure Theory

16

The essential property of a measure is complete additivity or the 18


equivalent continuity conditions of Theorem 2(a). Thus, if Xn X or

P
Xn X, then (Xn ) (X), if Xn 0, (Xn ) 0 and if X = Xn ,

(X) =

(Xn ). In particular, the union of a sequence of sets of measure

zero also has measure zero.

8. Classical Lebesgue and Stieltjes measures


The fundamental problem in measure theory is, as we have remarked
already, to prove the existence of a measure taking assigned values on
a given system of sets. The classical problem solved by Lebesgue is
that of defining a measure on sets of points on a line in such a way
that every interval is measurable and has measure equal to its length.
We consider this, and generalizations of it, in the light of the preceding
abstract theory.
It is no more complicated to consider measures in Euclidean space
RK than in R1 . A set of points defined by inequalities of the form
ai xi < bi (i = 1, 2, . . . , k)
will be called a rectangle and the union of a finite number of rectangles, which we have called an elementary figure, will be called simply a
figure. It is easy to see that the system of figures and complements of figures forms a finitely additive system in Rk . The volume of the rectangle
k
Q
defined above is defined to be (bi ai ). A figure can be decomposed
i=1

19

into disjoint rectangles in many different ways, but it is easy to verify


that the sum of the volumes of its components remains the same, however, the decomposition is carried out. It is sufficient to show that this is
true when one rectangle is decomposed to be +, it is easy to show by
the same argument that the volume function (I) is finitely additive on
the system S 0 of figures and their complements.
Theorem 9. The function (I) (defined above) is completely additive in
S 0.

9.. Borel sets and Borel measure

17

Proof. As in Theorem 2, it is sufficient to show that if {In } is a decreasing


sequence of figures and In 0, then (In ) 0. If (In ) does not 0,
we can define > 0 so that (In ) for all n and we can define a
decreasing sequence of figures Hn such that closure H n of Hn lies in In ,
while

(In Hn ) <
2
It follows that (Hn ) = (In ) (In Hn ) > 2 so that Hn , and therefore H n , contains at least one point. But the intersection of a decreasing
sequence of non-empty closed sets (H n ) is non-empty, and therefore the
Hn and hence the In have a common point, which is impossible since
In 0.

The measure now defined by Theorem 7 is Lebesgue Measure.

9. Borel sets and Borel measure


The sets of the minimal Borel system which contains all figures are
called Borel sets and the measure which is defined by Theorem 9 and 7
is called Borel measure when it is restricted to these sets. The following
results follow immediately.
Theorem 10. A sequence of points in RK is Borel measurable and has 20
measure 0.
Theorem 11. Open and closed sets in RK are Borel sets.
(An open set is the sum of a sequence of rectangles, and a closed set
is the complement of an open set).
Theorem 12. If X is any (Lebesgue) measurable set, and > 0, we can
find an open set G and a closed set F such that
F X G, (G P) <
Moreover, we can find Borel sets A, B so that
A X B, (B A) = 0.
Conversely, any set X for which either of these is true is measurable.

1. Measure Theory

18

Proof. First suppose that X is bounded, so that we can find a sequence


of rectangles In so that
X

In ,

n=1

(In ) < (X) + /4.

n=1

Each rectangle In can be enclosed in an open rectangle (that is, a


point set defined by inequalities of the from ai < xi < bi , i = 1, 2, . . . , k,
k
Q
its measure is defined to be (bi ai )Qn of measure not greater than
i=1

.

(In ) + n
2 +2
Then
XQ=

21

[
n=1

Qn , (Q)

X
n=1

(Qn )

(In )+

n=1

X
n=1

2n

1
(X)+
+2
2

Then Q is open and (Q X) /2.


Now any set X is the sum of a sequence of bounded sets Xn (which
are measurable if X is), and we can apply this each Xn with 6/2n+1 instead of . Then

X
X
X=
Xn , Xn Qn ,
Qn = G,
n=1

n=1

where G is open and

[
X
X
GX
(Qn Xn ), (G X)
(Qn Xn )
n=1

n=1

n=1

2n

=
+1 2

The closed set F is found by repeating the argument on X and complementing.


Finally, if we set n 0 and Gn , Fn are open and closed respectively,
Fn X Gn , (Gn Fn ) <n
and we put
A=

[
n=1

Fn , B =

[
n=1

Gn ,

9.. Borel sets and Borel measure

19

we see that
A X B, (B A) (Gn Fn ) n for all n,
and so
(B A) = 0,
while A, B are obviously Borel sets.
Conversely, if (P) < and
F X G,
We have, since a closed set is measurable,
(P) = (PF) + (P PF)

(PX) (P(X F)) + (P PX)


(PX) + (P PX) (X F)

(PX) + (P PX) (G F)

(PX) + (P PX)
true for every > 0 and therefore

(P) (PX) + (P PX)


so that X is measurable.
In the second case, X is the sum of A and a subset of B contained
in a Borel set of measure zero and is therefore Lebesgue measurable by
the completeness of Lebesgue measure.
It is possible to defined measures on the Borel sets in Rk in which the
measure of a rectangle is not equal to its volume. All that is necessary
is that they should be completely additive on figures. Measures of this
kind are usually called positive Stiltjes measures in Rk and Theorems 11
and 12 remain valid for them but Theorem 10 does not. For example, a
single point may have positive Stieltjes measure.
A particularly important case is k = 1, when a Stieltjes measure can
be defined on the real line by any monotonic increasing function (X).
The figures I are finite sums of intervals ai x < bi and (I) is defined
by
X
(I) =
{(bi 0) (ai 0)}.
i

22

1. Measure Theory

20

The proof of Theorem 9 in this case is still valid. We observe that 23


since lim () = (b 0), it is possible to choose so that < b and
b0

( 0) (a 0) > 12 , [(b 0) (a 0)].


The set function can be defined in this way even if (x) is not
monotonic. If is bounded, we say that (x) is of bounded variation.
In this case, the argument of Theorem 9 can still be used to prove that
is completely additive on figures. After the remark on corollary 2
of Theorem 3, we see that it can be expressed as the difference of two
completely additive, non-negative functions + , defined on figures.
These can be extended to a Borel system of sets X, and the set function
= + + gives a set function associated with (x). We can also write
(x) = + (x) + (x) where + (x) increases, (x) decreases and both
are bounded if (x) has bounded variation.
A non-decreasing function (x) for which () = 0, () = 1 is
called a distribution function, and is of basic importance in probability.

10. Measurable functions


A function f (x) defined in X and taking real values is called measurable
with respect to a measure if [ f (x) k]([P(x)] is the set of points
x in X for which P(x) is true) is measurable with respect to for every
real k.
Theorem 13. The memorability condition
24

(i) [ f (x) k] is measurable for all real k is equivalent to each one


of
(ii) [ f (x) > k] is measurable for all real k,
(iii) [ f (x) k] is measurable for all real k,
(iv) [ f (x) < k] is measurable for all real k,
Proof. Since
#
"
\
1
[ f (x) k] =
f (x) > k ,
n
n=1

10.. Measurable functions

21

(ii) implies (i). Also


#
"
[
1
[ f (x) k] =
f (x) k + ,
n
n=1
and so (i)implies (ii). This proves the theorem since (i) is equivalent with
(iv) and (ii) with (iii) because the corresponding sets are complements.

Theorem 14. The function which is constant in X is measurable. If f
and g are measurable, so are f g and f g.
Proof. The first is obvious. To prove the second , suppose f , g are
measurable. Then
[ f (x) + g(x) > k] = [ f (x) > k g(x)]

= [ f (x) > r > k g(x)]


r
[
=
[ f (x) > r] [g(x) > k r]
r

the union being over all rationals r. This is a countable union of measurable sets so that f + g is measurable. Similarly f g is measurable.
Finally

[ f (x))2 > k] = [ f (x) > k] + [ f (x) < k] for k 0


so that f 2 is measurable. Since

25

1
1
f (x)g(x) = ( f (x) + g(x))2 ( f (x) g(x))2
4
4
f g is measurable.

Theorem 15. If fn measurable for n = 1, 2, . . . then so are lim sup fn ,


lim inf fn .
Proof. [lim sup fn (x) < k]
= [ fn (x) < k for all sufficiently large n]

1. Measure Theory

22

[ fn (x) < k]

N=1 n=N

is measurable for all real k. Similarly lim inf fn is measurable.

In Rn , a function for which [ f (x) k] is Borel measurable for all k


is called a Borel measurable function or a Baire function.
Theorem 16. In Rn , a continuous function is Borel measurable.
Proof. The set [ f (x) k] is closed.

Theorem 17. A Baire function of a measurable function is measurable.

26

Proof. The Baire functions form the smallest class which contains continuous functions and is closed under limit operations. Since the class
of measurable functions is closed under limit operations, it is sufficient
to prove that a continuous function of a measurable, function is measurable. Then if (u) is continuous and f (x) measurable, [( f (x)) > k]
is the set of x for which f (x) lies in an open Set, namely the open set
of points for which (u) > k. Since an open set is a countable union of
open intervals, this set is measurable, thus proving the theorem.

Theorem 18 (Egoroff). If (X) < and fn (x) f (x) , p.p in X,
and if > 0, then we can find a subset X of X such that (X X ) <
and fn (x) f (x) uniformly in X .
We write p.p for almost everywhere, that is, everywhere expect
for a set of measure zero.
Proof. We may plainly neglect the set of zero measure in which fn (X)
dose not converge to a finite limit. Let
XN, = [| f (x) fn (x) |< 1/ for all N].
Then, for fixed ,
XN, X as N
For each we choose N os that X = XN , satisfies
(X X ) < /2 ,

11.. The Lebesgue integral

23

and let

X =

=1

Then
(X X )
and

X
=1

(X X ) <

| f (x) fn (x) |< 1/ for n N

if x is in x and therefore if X is in X . This proves the theorem.

11. The Lebesgue integral


Suppose that f (x) 0, f is measurable in X, and let
0 = y < y1 < y2 < y


and
E = y f (x) < y+1 , = 0, 1, 2, . . .
P
so that E is measurable and x =
=0 E .
We call the set of the y , {y } subdivision.
Let

X
S = S {y} =
y (E ).
=1

Then we define sup S for all subdivisions


{y } to be the Lebesgue
R
Integral of f (x) over X, and write it f (X)d. We say that f (x) is
integrable or summable if its integral is finite. It is obvious that changes
in the values of f in a null set (set of measure 0) have no effect on the
integral.
Theorem 19. Let {yk }, k = 1, 2, . . . , be a sequence of subdivisions
whose maximum intervals
k = sup(yk+1 yk ) 0 as k
Then, if S k is the sum corresponding to {yk },
Z
f (x)d = F(X).
lim S k =
k

27

1. Measure Theory

24

28

Corollary . Since S k is the integral of the function taking constant values yk in the sets Ek , it follows, by leaving out suitable remainders
P
k
k
=v+1 y (E ), that F(X) is the limit of the integrals of simple functions, a simple function being a function taking constant values on each
of a finite number of measurable sets whose union is X.
Proof. If A < F(X), we can choose a subdivision {y } so that if E are
the corresponding sets, S the corresponding sum,

V
X

y (E )

=1

for a finite V. One of the (E ) can be infinite only if F(x) = and


then there is nothing to prove. Otherwise, (E ) < and we let {y }
be a subdivision with = sup(y+1 y ) and denote by S the sum
defined for {y } and by S the sum defined for the subdivision consisting
of points y and y . Since S is not decreased by insertion of extra points
of sub-division,
S S
while

V
X

y (E ) > A,

=1

S S

V
X

(E )

and, by making small enough we get S > A. Since S F(X) and


A < F(X) is arbitrary, this proves the theorem.

The definition can be extended to integrals over subsets X of by
defining
Z
Z
f x (X)d
F(X) =
f (x)d =
X

29

where fX (x) = f (x) for x in x and fX (x) = 0 for x in X X. We may


therefore always assume (when it is convenient) that integrals are over
the whole space X.
The condition f (X) 0 can easily be removed.

11.. The Lebesgue integral

25

We define
f + (x) = f (x) when f (x) 0, f + (x) = 0 when f (x) 0,

f (x) = f (x) when f (x) 0, f (x) = 0 when f (x) 0.


Then f (x) = f + (x) + f (x), | f (x) |= f + (x) (x).
We define
Z
Z
Z
+
f (x)d =
f (x)d ( f (x))d
X

when both the integrals on the right are finite, so that f (x) is integrable
if and only if | f (x) | is integrable.
In general, we use the integral sign only when the integrand is integrable in this absolute sense.
The only exception to this rule is that
R
we may sometimes write f (x)d = when f (x) r(x) and r(x) is
X

integrable.

Theorem 20. If f (x) is integrable on X, then


Z
F(X) =
f (x)d
X

is defined for every measurable subset X of X and is completely additive


on these sets.
Corollary. If f (x) 0, then F(Y) F(X) if Y X
Proof. It is sufficient to prove the theorem in the case f (X) 0. Let

X
X=
Xn where are Xn are measurable and disjoint. Then, if {y } is a
n=1

subdivision, E =

E Xn , (E ) =

n=1

S =

(E Xn ) and

n=1

X
=1

y (E ) =

X
=1

X
n=1

(E Xn )

1. Measure Theory

26

y (E Xn )

n=1 =1

Sn

n=1

30

where S n is the sum for f (x) over Xn . Since S and S n (which are 0)
tend to F(X) and F(XN ) respectively as the maximum interval of subdivision tends to 0, we get
F(X) =

f (x)d =

F(Xn ).

n=1


Theorem 21. If a is a constant,
Z
Z
a f (x)d = a
f (x)d
X

Proof. We may again suppose that f (x) 0 and that a > 0. If we use
the subdivision {y } for f (x) and {ay } for af (x), the sets E are the same
in each case, and the proof is trivial.

Theorem 22. If A f (x) B in X, then
A(X) F(X) B(X).
Theorem 23. If f (x) g(x) in X, then
Z
Z
f (x)d
g(x)d
X

Corollary. If | f (x) | g(x) and g(x) is integrable, then so is f (x).


R
Theorem 24. If f (x) 0 and f (x)d = 0, then f (x) = 0 p.p. in X.
X

12.. Absolute Continuity


31

27

Proof. If this were not so, then

[ f (x) > 0] =

n=0

"

1
1
f (x) <

n+1
n

has
and hence, so has at least one subset En =
h positive measure,
i
1
1
n+1 f (x) < n Then
Z
x

f (x)d

f (x)d

(En )
>0
n+1

En

which is impossible.

R
Corollary 1. If f (x)d = 0 for all X X, f (x) not necessarily of the
K

same sign, then f (x) = 0 p.p.


we have merely to apply Theorem 24 to X1 = [ f (x) 0] and to
X2 = [ f (x) < 0].
R
R
Corollary 2. If f (x)d = g(x)d for all X X , then f (x) = g(x)
X

p.p. If f (x) = g(x) p.p. we say that f and g are equivalent.

12. Absolute Continuity


A completely additive set function F(x) defined on a Borel system is
said to be absolutely continuous with respect to a measure on the same
system if F(X) o uniformly in X as (X) 0. In other words, if
> 0, we can find > 0 so that |F(X)| < for all sets X which satisfy
(X) < . In particular, if F(X) is defined in R by a point function F(x)
of bounded variation, then it is absolutely continuous, if given > o we
can find > 0 so that
n
n

X
X


F(bi ) F(ai ) if (bi a1 ) <
i=1

i1

32

1. Measure Theory

28

Moreover, it is clear from the proof of Theorem 3 that a set function F(X) is absolutely continuous if and only if its components F + (X),
F (X) are both absolutely continuous. An absolutely continuous point
function F(x) can be expressed as the difference of two absolutely continuous non-decreasing functions as we see by applying the method
used on page 22 to decompose a function of bounded variation into two
monotonic functions. We observe that the concept of absolute continuity
does not involve any topological assumptions on X.
Theorem 25. If f (x) is integrable on X, then
Z
F(X) =
f (x)d
x

is absolutely continuous.
Proof. We may suppose that f (x) 0. If > 0, we choose a subdivision
{ y }so that

X
y (E ) > F(X) /4
=1

and then choose V so that


V
X
=1

y (E ) > F(X) /2

Then, if A > yv+1 and E A = [ f (x) A]


we have
Now

E X E A for V.
V Z
V
X
X
F(X E A )
f (x)d
y (E )
=1

and therefore,

EV

> F(X) /2

F(E A ) < /2.

33

If X is any measurable set,

=1

12.. Absolute Continuity

29

F(x) = F(XE A ) + F(X E A )

< + A(X)(since f (x) A in X E A )


2
provided that (X) /2A =

Theorem 26. If f (x) is integrable on X and Xn X, then


F(Xn ) F(X).
Proof. If (X) < this follows from Theorem 25 and the continuity
of in the sense of Theorem 2. If (X) = , > o we can choose a
subdivision { y } and corresponding subsets E of X so that

X
=1

y (E ) > F(X)

(assuming that f (x) 0, as we may)


But

X
F(Xn ) =
F(Xn E )
=1

and F(Xn E ) F(E ) as n for every , since (E ) < . Since


all the terms y F(Xn E ) are positive, it follows that
34
lim F(Xn ) =

X
=1

F(E )

X
=1

y (E ) > F(X)

Since F(Xn ) F(X), the theorem follows.

Theorem 27. If f (x) is integrable


on X and > 0, we can find a subset
R
X1 of X so that (X1 ) < , XX | f(x) | d < and f (x) is bounded in
1
X1 .
Proof. The theorem follows at once from Theorems 25 and 26 since we
can take X1 [ f (x) y1 ] and this set has finite measure since f(x) is
integrable.


1. Measure Theory

30

Theorem 28. If f (x) and g(x) are integrable on X, so is f (x) + g(x) and
Z
Z
Z
f (x)d + g(x)d.
[ f (x) + g(x)]d =
X

Proof. Since | f (x) + g(x) || f (x) | + | g(x) | 2 sup(| f (x) |, | g(x) |)


we have
Z
Z
| f (x) + g(x) | d 2 sup(| f (x) |, | g(x) |)d
X

= 2

| f (x) | d +

| g(x) | d 2

| f |<|g|

| f ||g|

| f (x) | d + 2

Z
X

| g(x) | d

so that f (x) + g(x) is integrable. After Theorem 27, there is no loss of


generality in supposing that (X) < . Moreover, by subdividing X into
the sets (not more than 8) in which f (x), g(x), f (x) + g(x) have constant
signs, the theorem can be reduced to the case in which f (x) 0, g(x) 0
and so f (x) + g(x) 0 in X.

35

The conclusion is obvious if f (x) is a constant c 0, for we can


then take as subdivisions, {y } for g(x) and {y + c} for g(x) + c. In the
general case, if

Z
X

E = [y g(x) < y+1 ]


Z
X
[ f (x) + g(x)]d =
[ f (x) + g(x)]d, by Theorem 20
=0 E

X
v=0

f (x)d +

f (x)d + S ,

=1

y (E

13.. Convergence theorems


and since

31

g(x)d is sup s for all subdivisions {y }, we get


Z

[ f (x) + g(x)]d

f (x)d +

g(x)d

On the other hand, if > 0, and we consider subdivisions for which


y1 , y+1 (1+ )y for , 1
we get
Z

[ f (x) + g(x)]d

Z
X

=0.. E

f (x)d +

y+1 (E )

=0

f (x)d + (1+ )S + y1 (Eo )

X
Z

f (x)d + (1+ )

g(x)d+ (X)

and the conclusion follows if we let 0.


Combining this result with Theorem 21, we get
TheoremR29. The integrable functions on X form a linear space over R
on which f (x)d is a linear functional.
X

This space is denoted by L(X), and f (x)L(X) means that f (x) is 36


(absolutely) integrable on X.

13. Convergence theorems


Theorem 30 (Fatous Lemma). If (x) is integrable on X, and fn (x),
n = 1, ...are measurable functions, then
Z
Z
fn (x)d (lim sup fn (x))d if fn (x) (x),
lim sup
X

1. Measure Theory

32
lim inf

fn (x)d

(lim inf fn (x))d if fn (x) (x),

As immediate corollaries we have


Theorem 31 (Lebesgues theorem on dominated convergence). If (x)
is integrable on X, | fn (x) | (x) and
then

fn (x) f (x) p.p. in X


Z
Z
fn (x)d
f (x)d
X

In particular, the conclusion holds if (X) < and the fn (x) are
uniformly bounded.
Theorem 32 (Monotone convergence theorem). If (x) is integrable on
X, fn (x) (x) and fn (x) is an increasing sequence for each x, with
limit f (x) then
Z
Z
f (x)d

fn (x)d =

lim

in the sense that if either side is finite, then so is the other and the two
values are the same, and if one side is +, so is the other.
37

Proof of Fatous lemma


The two cases in the theorem are similar. It is sufficient to prove
the second, and since fn (x) + (x) 0, there is no loss of generality in
supposing that (x) = 0, fn (x) 0,
R
Let f (x) = lim inf fn (x) and suppose that X f (x)d < . Then
afterR Theorem 27, given > 0 we can define X1 so that (X1 ) < and
> XX f (x)d while f (x) is bounded in X1 .
1
A straight-forward modification of Egoroffs theorem to gether with
theorem 25 shows that we can find a set X2 X1 so that
Z
f (X)d <
while

X1 X2

fn (x) f (x) /(X1 )

13.. Convergence theorems

33

for all x in X2 and n N. Then


Z
Z
Z
fn (x)d
fn (x)d
X2

X2

f (x)d

x f (x)d 3 for n N

and our conclusion follows.


If
Z

f (x)d =

it follows from the definition of the integral that A > 0, we can define
(x) L(X) so that
Z
(x)d A, 0 (x) f (x)
X

The argument used above now shows that


Z
Z
fn (x)d
(x)d 3 A 3
X

for sufficiently large n, and hence


Z
fn (x)d = .
lim inf
X

Restatement of Theorems 31 and 32 in terms of series, rather than


sequences given us
Theorem 33. (Integration of series) If un (x) is measurable for each n,

X
u(x) =
un (x), then
n=1

Z
X

u(x)d =

Z
X
n=1 X

un (x)d

38

1. Measure Theory

34

provided that |

n
X
=1

u (x) | (x) for all N and x, (x) L(X).

The equation is true if un (x) 0, in the sense that if either side is


finite, then so is the other and equality holds, while if either side is so
is the other.
Theorem 34. (Differentiation under the integral sign)
If f (x, y) is integrable in a < x < b in a neighbourhood of y = y
f
exists in a < x < b, then
and if
y
d
dy

Zb

f (x, y)dx =

Zb
a

f
dx
y

provided that

39



f (x, y + h) f (x, y ) (x)L(a, b)


h

for all sufficiently small h.

This theorem follows from the analogue of Theorem 31 with n replaced by a continuous variable h. The proof is similar.

14. The Riemann Integral


If we proceed to define an integral as we have done, but restrict the set
function to one defined only on a finitely additive system of sets (we
call this set function measure even now), we get a theory, which in
the case of functions of a real variable, is equivalent to that of Riemann.
It is then obvious that an-R-integrable function is also L-integrable and
that the two integrals have the same value.
The more direct definition of the R-integral is that f (x) is R-integrable in a x b if it is bounded and if we can define two sequences
{n (x)}, {n (x)} of step functions so that n (x) , n (x) , for each x,

14.. The Riemann Integral

35

Rb
n (x) f (x) n (x), (n (x) n (x))dx 0 as n since lim
a

n (x) = limn (x) = f (x) p.p., it is clear that f (x) is L-integrable and
that its L-integral satisfies
Zb
a

f (x)dx = lim

Zb

n (x)dx = lim

Zb

n (x)dx,

and the common value of these is the R-integral. The following is the
main theorem.
Theorem 35. A bounded function in (a, b) is R-integrable if and only if 40
it is continuous p.p
Lemma . If f (x) is R-integrable and > 0, we can define > 0 and a
measurable set E in (a, b) so that
(E0 ) > b a ,

| f (x + h) f (x)| for x E0 , x + h (a, b), |h| < .


Proof of Lemma: We can define continuous functions (x), (x) in a
x b so that
(i) (x) f (x) (x), a x b
(ii)

Rb
a

((x) (x))dx 2 /2

If E0 is the set in (a, b) in which (x) (x) < /2 it is plain that


(E0 ) > b a . For otherwise, the integral in(ii) would exceed 2 /2.
By uniform continuity of (x), (x), we can define = () > 0 so that
(x + h) (x)1 /2, |(x + h) (x)| /2
for x, x + h in (a, b), |h| .
Then, if x is in E0 , x + h is in (a, b) and |h|
f (x + h) f (x) (x + h) (x) = (x) (x) + (x + h) (x)

1. Measure Theory

36
/2 + /2 =
and similarly f (x + h) f (x) , as we require.
Proof of Theorem 35
If f (x) is R-integrable, let
> 0, n > 0,

X
n=1

n <,

and define measurable sets En in (a, b) by the lemma so that


(En ) > b a n , | f (x + h) f (x)| <n for xEn ,

|h| n , n = n (n ) > 0.

Let E =

En , so that

n=1

(E ) b a

n > b a

Since f (x) is continuous at every point of E and is arbitrarily


small, f (x) is continuous p.p.
Conversely, suppose that f (x) is continuous p.p. Then if > 0 we
can define E0 so that
(E0 ) > b a and > 0so that

| f (x + h) f (x)| < for xE0 , |h| <


If now we divide (a, b) into intervals of length at most those which
contain a point of E0 contribute not more than 2 (b a) to the difference between the upper and lower Riemann sums S , s for f (x), while
the intervals which do not contain points of E0 have total length at
most and contribute not more than 2 M where M = sup | f (x)|. Hence
S s 2 (b a) + 2 M
42

which can be made arbitrarily small.

41

15.. Stieltjes Integrals

37

15. Stieltjes Integrals


In the development of the Lebesgue integral, we have assumed that the
measure is non-negative. It is easy to extend the theory to the case in
which is the difference between two measures + and in accordance
with Theorem 3.In this case, we define
Z
Z
Z
+

f (x)d = f (x)d f (x)d( ),


x

when both integrals on the right are finite, and since + and are measure, all our theorems apply to the integrals separately and therefore to
their sum with the exception of Theorems 22, 23, 24, 30, 32 in which
the sign of obviously plays a part. The basic inequality which takes
the place of Theorem 22 is
Theorem 36. If = + + in accordance with Theorem 3, and =
+ then
Z
Z
| f (x)d| | f (x)|d .
x

[The integral on the right is often written


Proof.

| f (x)| |d|.]


Z
Z



+

f (x)d X f (x)d( )
f (x)d =

Z
Z




+


f (x)d +
f (x)d( )

| f (x)d+ +

| f (x)|d =

| f (x)|d( )

| f (x)| |d|.

We shall nearly always suppose that is a measure with 0 but it 43

1. Measure Theory

38

will be obvious when theorems do not depend on the sign of and these
can be extended immediately to the general case. When we deal with
inequalities, it is generally essential to restrict to the positive case (or
replace it by ).

Integrals with taking positive and negative values are usually called
Stieltjes integrals. If they are integrals of functions f (x) of a real variable. x with respect to defined by a function (x) of bounded variation,
we write
Z
Z
f (x)d(x) for
f (x)d,
X

and if X is an interval (a, b) with (x) continuous at a and at b, we write


it as
Zb
f (x)d (x).
a

In particular, if (x) = x, we get the classical Lebesgue integral,


which can always be written in this from.
If (x) is not continuous at a or at b, the integral will generally
depend on whether the interval of integration is open or closed at each
end, and we have to specify the integral in one of the four forms.
Zb0

f (x)d (x)

ao

44

Finally, if f (x) = F1 (x) + i f2 (x), ( f1 (x), f2 (x) real) is a complex


valued function, it is integrable if f1 and f2 are both integrable if we
define
Z
Z
Z
f2 (x)d.
f (x)d =
f1 (x)d + i
X

The inequality
Z
Z


| f (x) | | d |
f
(x)d


X

(Theorem 36) still holds.

16.. L-Spaces

39

16. L-Spaces
A set L of elements f , g, . . . is a linear space over the field R of real
numbers (and similarly over any field) if
(1) L is an abelian group with operation denoted by +.
(2) f is defined and belongs to L for any of R and f of L.
(3) ( + ) f = f + f
(4) ( f + g) = f + g
(5) ( f ) = () f
(6) 1. f = f.
A linear space is a topological linear space if
(1) L is a topological group under addition,
(2) scalar multiplication by in R is continuous in this topology. L is a
metric linear space if its topology is defined by a metric.
It is a Banach space if
(1) L is a metric linear space in which metric is defined by d( f, g) =
k f gk where the norm k f k is defined as a real number for all f
of L and has the properties
k f k = 0 if and only if f = 0, kfk 0 always
k f k =| | k f k, k f + gk k f k + kgk
and
(2) L is complete. That is, if a sequence fn has the property that k fn
fm k 0 as m, n , then there is a limit f in L for which
k fn f k 0. A Banach space L is called a Hilbert space if and
inner product ( f, g) is defined for every f , g of L as a complex
number and

45

1. Measure Theory

40
(1) ( f, g) is a linear functional in f and in g
(2) ( f, g) = (g, f )
(3) ( f, f ) = k f k2

Two point f , g are orthogonal if ( f, g) = 0, It is obvious that the


integrable
functions f (x) in X form a linear space L(X) on which
R
f (x)d is a linear functional. If p 1 the space of measurable

functions f (x) on x for which | f(x) |p is integrable is denoted by


L p (X) and we have the fallowing basic theorems.

Theorem 37. (Holders inequality; Schwartz inequality if p=2)


If p 1, 1p + p1 = 1, f (x) L p (X) then
1/p
1/p

Z
Z



p
p

f (x)g(x)d f (x) | d | g(x) | d

If p = 1, | g(x) | d 1/p is interpreted as the essential upper


X

bound of g(x) that is, the smallest number for which | g(x) | p.p
46

Theorem 38. If q p 1 and () < , then


Lq (X) L p (X).
If (X) = , there is no inclusion relation between L p , Lq . For the proof
we merely apply Holders theorem with f (x) ,g(x), p replaced by | f (x)| p ,
q
1, respectively.
p
Theorem 39 (Minkowskis
Inequality).
R
If p 1 and || f || = ( x | f (x)| p d)1/p , then

|| f + g|| || f || + ||g||

For the proofs see Hardy, Littlewood and Polya: Inequalities.


Theorem 40. If p 1, L p () is complete. (the case p = 2 is the RieszFischer theorem).

16.. L-Spaces

41

Proof. We support that p < and that


|| fn fm || 0 as m, n ,
(in the notation introduced in Theorem 39) and define Ak > 0 k 0 so
P
P
that Ak < and (k /Ak ) p < .
We can choose a sequence {nk } so that nk+1 > nk and
|| fnk fm || k for m nk
and in particular
|| fnk+1 fnk || k .


Let Ek be the set in which | fnk+1 (x) fnk (x)| Ak .


Then
kp

47

| fnk+1 (x) fnk (x)| p d

so that

| fnk+1 (x) fnk (x)| p d

XEk

"

p
Ak (X

Ek ),

#
P
(X k ) 0 as K since (k /Ak ) p < .

Since fnk (x) tends to a limit at every point of each set


P

Ek (because

Ak < ) , it follows that fnk (x) tends to a limit f (x) p.p.


Also, it follows from Fatous lemma that, since || fnk || is bounded,
f (x) L p (X) and that
|| f nk f || k , || f nk f || 0 as k

Since || f nk fm || 0 as k, m it follows from Minkowskis


inequality that | fm f || 0 as m .
If p = , the proof is rather simpler.
From these theorems we deduce

1. Measure Theory

42

Theorem 41. If p 1, L p (X) is a Banach space with

1/p
Z

|| f || = | f (x)| p d

L2 is a Hilbert space with

( f, g) =

f (x) g(x) d.

48

The spaces L p generally have certain separability properties related


to the topological properties (if any) of X.
A function with real values defined on an additive system S 0 , taking
constant values on each of a finite number of sets of S 0 id called a step
function.
Theorem 42. The set of step functions (and even the sets step function
taking relation values) is dense in L p for 1 p < . If the Borel system
of measurable sets in X is generated by a countable, finitely additive system S 0 ,then the set of steps functions with rational values is countable
and L p is separable.
The proof follows easily from the definition of the integral.
Theorem 43. If every step function can be approximated in L p (X) by
continuous functions, the continuous functions, are dense in L p (X), (assuming of course that (X) is a topological space).
In particular, continuous functions in Rn are dense in L p (Rn ). Since
the measure in Rn can be generated by a completely additive function
on the finitely additive countable system of finite unions of rectangles
ai xi < bi , and with ai , bi rational their complements, L p (Rn ) is
separable.
We have proves in Theorem 25 that an integral over an arbitrary
set X is an absolutely continuous function of X. The following theorem
provides a converse.

16.. L-Spaces

49

43

Theorem 44. (Lebesgue for Rn ; Radon-Nikodym in the general case)


If H(x) is completely additive and finite in X and if X has finite measure or is the limit of a sequence of subset of finite measure, then
H(X) = F(X) + Q(X)
Z
F(X) =
f (x)d, f (x) L(X)

where

and Q is a (singular) function with the property that there is a set X s of


measure zero for which
0 Q(X) = Q(X.X s )
for all measurable X. Moreover,F(X), Q(X) are unique and f (x) is
unique up to a set of measure zero.
In particular, if H(X) is absolutely continuous, Q(X) = 0 and
Z
H(X) = F(X) =
f (x)d, f (x) L(X)
X

In this case, f (x) is called the Radon derivative of H(X) = F(X)


Proof. We assume that (X) < . The extension is straightforward.
By Theorem 3, we can suppose that H(X) 0. let be the class of
measurable function (x) with the property that (x) 0,
Z
(x)d H(X)
x

for all measurable X. Then we can find a sequence {n (x)} in for which
Z
Z
n (x)d sup (x)d H(X) <


If we define n (x) = supk (x) and observe that
kn

50

1. Measure Theory

44

X=

n
[

X.[n (x) = k (x)]

k=1

belongs to . Since n (x) increases with n for each x,


we see that
it has a limit f (x) 0, which also belongs to ,and we can write
Z
F(X) =
f (x)d H(X), Q(X) = H(X) F(X) 0
n (x)

while
R
R
(1)
f (x)d = sup X (x)d < .
X
Now let
(X)
Qn (X) = Q(X)
n
+

and let Xn , Xn be the sets defined by Theorem 3 for which


Qn (X) 0 if X X+n , Qn (X) 0 if X X+n
Then,
H(X) F(X) +

(X)
=
n

1
( f (x) + )d if X X+n
n
X

and if
1
for x X+n
n
f (x) = f (x) for x X+n ,

f (x) = f (x) +

it follows that f (x) be longs to , and this contradicts(1) unless (X+n ) =


0. Hence (X+n ) = 0 and Q(X) = 0 if X is disjoint from
Xs =

X+n

n=1

51

which has measure zero.


To prove uniqueness, suppose that the decomposition can be made
in two ways so that
H(X) = F1 (X) + Q1 (X) = F2 (X) + Q2 (X),

16.. L-Spaces

45

where F1 (X), F2 (X) are integrals and Q1 (X), Q2 (X) vanish on all sets
disjoint from two sets of measure zero, whose union X s also has measure
zero. Then
F1 (X) = F1 (X XXS ), F2 (X) = F2 (X XXS ),

F1 (X) F2 (X) = Q2 (X XX s ) Q1 (X XX s ) = 0.

Theorem 45. If (X) is absolutely continuous in X and has Radon


derivative (X), with respect to a measure inX,then
Z
Z
f (x)d =
f (x)(x)d
X

if either side exists.


Proof. We may suppose that f (x) 0, (x) 0, (X) 0, Suppose that
Z
f (x)d <
X

Then, it follows from Theorem 27 that we may suppose that (X) <
. If E > 0, we consider subdivisions {y } for which
y1 , y+1 (1 + )y ( 1)
so that
s=

X
=1

y (E )

f (x)d

y+1 (E)

v=0

(1+ )s+ (X)


But
Z

f (x)(x)d =
X

Z
X
=o

52

f (x)(x)d
E

1. Measure Theory

46
by Theorem 20, and
y (E )

f (x)(x)d y+1 (E )

by Theorem 22, and therefore we have also


Z
s
f (x)(x)d (1+ )s+ (x).
X

The conclusion follows on letting 0


R

Moreover the first part of this inequality holds even if X f (x)d =


, but in this case, s is not bounded and since the inequality holds for
all s,
Z
X

f (x)(x)d = .

17. Mappings of measures


Suppose that we have two spaces X, X and a mapping X X of X
into X . If S is a Borel system of measurable sets X with a measure
in X, the mapping induces a Borel system S of measurable sets X in
X , these being defined as those sets X for which the inverse images X
in X are measurable, the measure induced by on s being defined
by (x ) = (x) where X is the inverse image of X .
If the mapping is (1-1), the two spaces have the same properties of
measure and we call the mapping a measure isomorphism.
53

Theorem 46 (Change of variable). If the measure , in X and X


are isomorphic under the (1-1) mapping X X of X onto X and if
f (x ) = f (x) then
Z
Z
f (x)d =
f (x )d
X

The proof is immediate if we note that the sets E and E defined in


X and X respectively by any subdivision correspond under the mapping
x x and have the same measure (E ) = (E ).
As an immediate corollary of this theorem we have

18.. Differentiation

47

Theorem 47. If (t) increases for A t b and (A) = a, (B) = b


and G(x) is of bounded variation in a x b, then
Z b
Z B
f (x)dG(x) =
f ((t))dG((t)).
a

In particular
Z

f (x)dx =

f ((t))d(t)
A

and, if (t) is absolutely continuous


Z B
Z b
f ((t)) (t)dt.
f (x)dx =
a

18. Differentiation
It has been shown in Theorem 44 that any completely additive and absolutely continuous finite set function can be expressed as the the integral
of an integrable function defined uniquely upto a set of measure zero
called its Radon derivative. This derivative does not depend upon any
topological properties of the space X. On the other hand the derivative 54
of a function of a real variable is defined, classically, as a limit in the
topology of R. An obvious problem is to determine the relationship between Radon derivatives and those defined by other means. We consider
here only the case X = R where the theory is familiar (but not easy). We
need some preliminary results about derivatives of a function F(x) in the
classical sense.
Definition. The upper and lower, right and left derivatives of F(x) at x
are defined respectively, by
F(x + h) F(x)
h+0
h
F(x + h) F(x)
D+ F = lim inf
h+0
h
F(x
+
h)
F(x)
D F = lim sup
h0
h
D+ F = lim sup

1. Measure Theory

48
D F = lim inf
h0

F(x + h) F(x)
h

Plainly D+ F D+ F, D F D F. If D+ F = D+ F or D F = D F
we say that F(x) is differentiable on the or on the left, respectively, and
the common values are called the right or left derivatives, F+ , F . If
all four derivatives are equal, we say that F(x) is differentiable with
derivative F (x) equal to the common value of these derivatives.
Theorem 48. The set of points at which F+ and F both are exist but
different is countable.
55

Proof. It is enough to prove that the set E of points x in which F (x) <
F+ (x)is countable. Let r1 , r2 . . . be the sequence of all rational numbers
arranged in some definite order. If x E let k = k(x) be the smallest
integer for which
F (x) < rk < F+ (x)
Now let m, n be the smallest integers for which
F() F(x)
< rk for rm < < x,
x
F() F(x)
> rk for x < < rn
rn > x,
x

rm < x,


Every x defines the triple (k, m, n) uniquely, and two numbers x1 <
x2 cannot have the same triple (k, m, n) associated with them. For if they
did, we should have
rm < x1 < x2 < rn
and therefore

while

F(x1 ) F(x2 )
< rk from the inequality
x1 x2
F(x1 ) F(x2 )
< rk from the second
x1 x2

and these are contradictory. Since the number of triples (k, m, n) is


countable, so is E.

18.. Differentiation

56

49

Theorem 49 (Vitalis covering theorem). Suppose that every point of


a bounded set E of real numbers (not necessarily measurable) is contained in an arbitrarily small closed interval with positive length and
belonging to a given family V. Suppose that G is an open set containing
E and that > 0.
Then we can select a finite number N of mutually dis joint intervals
In of V so that each In lies in G and
N
X
n=1

(In ) (E) (E

N
X
n=1

In )+ .

( standing of course, for outer measure).


Proof. If > 0, it is obviously enough, after Theorem 12, to prove the
theorem in the case (G) (E)+ . We may also suppose that all the
intervals of V lie in G.

We define a sequence of intervals I1 , I2 . . . inductively as follows.
I1 is an arbitrary of V containing points of E. If I1 , I2 . . . , In have been
defined, let 1n be the upper bound of lengths of all the intervals of V
which contain points of E and which are disjoint from I1 + I2 + + In .
Then, since the Ik are closed, 1n > 0 unless I1 + I2 + + In E. Now
define In+1 so that it is an interval of the type specified above and so that
n+1 = (In+1 ) > 21 1n .
Then In+1 is disjoint from I1 + + In and
S =

X
n=1

In G.

Suppose now that A = E S E, (A) > 0. Let Jn be the interval with


that same centre as In and 5 times the length of In . We can then choose
N so that
57

X
X
(Jn ) = 5
(In ) < (A),
n=N+1

n=N+1

1. Measure Theory

50
since

n=1

(In ) (G) (E)+ < and (A) > 0. It follows that

Jn > 0
A A
n=N+1

and that A A

Jn contains at least one point . Moreover, since

n=N+1

does not belong to the closed set

N
P

In , we can choose from V an

n=1

interval I containing and such that I. In = 0 for n = 1, 2 . . . , N. On


the other hand, I.In cannot be empty for all n N + 1 for, if it were, we
should have
0 < (I) 1n < 2n+1

P
for all n N + 1 and this is impossible since n 0 (for n =

P
1

(In ) (G) < ). We can therefore define n N + 1 to be the

smallest integer for which I.In , 0. But


I.In = 0 f orn n 1
and it follows from the definition of 1n that
0 < 1n 1 < 2n

58

Hence I, and therefore , is contained in Jn since Jn has five times the


length of In and I.In , 0.

S
This is impossible since belongs to A A
Jn and n N + 1.
n=N+1

Hence we must have

(A) = 0
and (ES ) = (E),

n=1

(In ) (G) (E)+ .

We can therefore choose N as large that


N
X
n=1

(In ) (E) (E

N
X
n=1

In )+

18.. Differentiation

51

Theorem 50. A function F(x) of bounded variation is differentiable p.p.


Proof. It is sufficient to prove the theorem when F(x) is increasing. We
prove that D+ F = D+ F p.p. The proof that D F = D F p.p. is similar,
and the conclusion then follows from Theorem 48.
The set
[
[D+ F > D+ F] =
[D+ F > r1 > r2 > D+ F]
r1 ,r2

Where the union is over the countable pairs of rational r1 , r2 .


Hence, if we suppose that ( [D+ F > D+ F]) > 0 we can find
rationals r1 , r2 such that
D+ F > r1 > r2 > D+ F
in a set E of positive outer measure. Then every point x of E is the left
hand end point of an interval (x, ) such that
F() F(x) ( x)r2
and we may suppose that x is arbitrarily small. It follows from 59
Vitalis theorem that we can define a set K consisting of a finite number
of such intervals so that
(E.K) > (K)
While the increment F(K) of F(x) over the intervals satisfies
F(K) r2 (K).
But every point x of EK, with the exception of the finite set of right
hand end points of K, is the left hand end point of an arbitrarily small
interval (x, ) for which
F() F(x) ( x)r1 .
If we now apply Vitalis theorem to the intersection of E and the
set of interior pints of K (Which has the same measure as EK), we can
construct a finite set of intervals K so that
K K, (K ) (EK) (K) 2 ,

1. Measure Theory

52

while the increment F(K ) of F(x) over the intervals K satisfies


F(K ) r1 (K ).
Since F(K ) F(K), we get
r2 (K) r1 (K ) r1 ((K) 2 ),
which gives a contradiction if is small enough. Hence we must have
(E) = 0 and the theorem is proved.


60

Theorem 51. If F(x) increases and is bounded in a x b and if F (x)


is its derivative, then F (x) is non-negative p.p, integrable in (a, b) and
satisfies
Z b
F (x)dx F(b) F(a)
a

Proof. Since

F(x+h)F(x)
h

0 for h , o it follows that


F(x + h) F(x)
0 p.p.
h0
h

F (x) = lim

It follows now from Fatous lemma that if > 0,


Z

F(x + h) F(x)
dx
h
a
Z

 1 Z b+h
1 b
F(x)dx
F(x)dx
= lim inf
h0
h a+h
h a
Z
 1 Z b+h

1 a+h
= lim inf
F(x)dx
F(x)dx
h0
h b
h a
lim [F(b + h ) F(a)]

F (x)dx lim inf


h0

h0

F(b) F(a)
since F(x) is increasing.

18.. Differentiation

53

Theorem 52. IF f (x) is integrable in (a, b) and


Z x
F(x) =
f (t)dt = 0 for a x b
a

then f (x) = 0 p.p.


(This is a refinement of corollary 1 of Theorem 24, since the con- 61
dition
here is lighter than the condition that the set function F(X) =
R
f (t)dt should vanish for for all measurable X.)
x
Proof. Our hypothesis implies that F(X) = 0 for all open or closed
intervals X and therefore, since F(X) is completely additive F(X) = 0
for all open sets X every open set being the sum of a countable number
of disjoint open intervals. But every measurable set is t he sum of a set
of zero measure and the limit of a decreasing sequence of open sets by
Theorem 12, and therefore F(X) = 0 for every measurable set X. The
conclusion then follows from corollary 1 to Theorem 24.

Theorem 53 (Fundamental theorem of the calculus).
(i) If F(x) is
an absolutely continuous point function and f (x) is the Radon
derivative of its associated set function F(X) (which is also absolutely continuous ; see page 30) then F(x) is differentiable p.p
and

F (x) = f (x) p.p


(ii) If f (x)L(a, b) then F(x) =

F (x) = f (x) p.p

x
a

f (t)dt is absolutely continuous and

(iii) If F(x) is absolutely continuous in a x b, then F is integrable


and
Z
x

F (t)dt + F(a)

F(x) =

Proof. (i) We may suppose that F(x) increases


and that F(x) o. If
Rx
A > 0, let fA (x) = min[A, f (x)], F A (x) = a fA (t)dt, where f (x) is the 62
Rx

Radon derivative of F(X) and F(x) = a f (t)dt.

1. Measure Theory

54

Then since fa (x) is bounded it follows from Fatous lemma that


Z x
Z x

F A (t + h) F A (t)
dt
F A (t)dt =
lim
h0
h
a
a
Z x
F A (t + h) F A (t)
lim sup
dt
h0
h
a
Z
 1 Z x+h

1 a+h
= lim sup
F A (t)dt
F A (t)dt
h0
h x
h a
= F A (x) F A (a) = F A (x)

since F A (t) is continuous. Since f (t) fA (t) it follows that F (t) F A (t)
and therefore
Z x
Z x

F (t)dt
F A (t)dt F A (x)
a

This holds for all A > 0 and since F A (x) F(x)asA by


Theorem 31, we deduce that
Z x
Z x

F (t)dt F(x) =
f (t)dt.
a

Combining this with Theorem 50 we get


Z x

(F (t) f (t))dt = 0
a

63

for all x, and the conclusion follows from Theorem 51


Parts(ii) and (iii) follow easily form(i). If we did not wish to use
the Radon derivative, we could prove (ii) and (iii) with the help of the
deduction from Vitalis theorem that if F(x) is absolutely continuous

and F (x) = 0 p.p then F(x) is constant


Theorem 54 (Integration by parts). If F(x), G(x) are of bounded variation in an open or closed interval J and
1
1
F(x) = [F(x 0) + F(x + 0)], G(x) = [G(x 0) + G(x + 0)],
2
2
then

F(x)dG(x) =
j

[F(x)G(x)]

G(x)dF(x).
j

18.. Differentiation

55

In particular if F(x), G(x) are absolutely continuous then,


Z

F(x)G (x)dx =
a

[F(x)G(x)]

F (x)G(x)dx

Proof. We may suppose that F(x), G(x) increase on the interval and are
non - negative and define
Z
Z
Z
(I) = F(x)dG(x) + G(x)dF(x) [F(x)Gx]
I

for intervals I J. Then (I) is completely additive and we shall prove


that (I) = 0 for all I

Suppose first that I consist of a single point a. Then
(I) = F(a)[G(a + 0) G(a 0)] + G(a)[F(a + 0) F(a 0)]
=0

F(a + 0)G(a + 0) + F(a 0)G(a 0)

since 2F(a) = F(a + 0) + F(a 0), 2G(a) = G(a + 0) + G(a 0).


Next if I is an open interval a < x < b,
(I) F(b 0)[G(b 0) G(a + 0)]G(b 0)[F(b 0) F(a + 0)]

F(b 0)G(b 0) + F(a + 0)G(a + 0)

= (F(b 0) F(a + 0))(G(b 0) G(a + 0)),


= F(I)G(I)

where F(I), G(I) are the interval functions defined by F(x),G(x), and 64
similarly
(I) F(I)G(I)so that | (I) | F(I)G(I)
Now, any interval is the sum of an open interval and one or two end
points and it follows from the additivity of (I), that
| (I) | F(I)G(I).

1. Measure Theory

56

for all intervals. Let > 0. Then apart from a finite number of points at
which F(x + 0) F(x 0) >, and on which = 0, we can divide I into
a finite number of disjoint intervals In on each of which F(In ) . Then
X
X
X 
| (I) |=|
(In ) |
F(In )G(In )
In | = |
X

G(In ) = G(I).
The conclusion follows on letting 0.

Theorem 55 (Second Mean Value Theorem). (i) If f (x) L(a, b)


and (x) is monotonic,
Z b
Z
Z b
f (x)(x)dx = (a + 0)
f (x)dx + (b 0)
f (x)dx
a

65

for some in a b.
(ii) If (x) 0 and (x) decreases in a x b,
Z
Z b
f (x)dx
f (x)(x)dx = (a + 0)
a

for some , a b.
Proof.
Suppose that (x) decreases in (i), so that, if we put F(x) =
Rx
f
(t)dt,
we have
a
Z b0
Z b
b0
F(x)d(x)
f (x)(x)dx = a+0 [F(x)(x)]
a+0
a
Z b
f (x)dx + [(a + 0) (b 0)]F()
= (b 0)
a

by Theorem 22 and the fact that F(x) is continuous and attains every
value between its bounds at some point in a b. This establishes
(i) and we obtain (ii) by defining (b + 0) = 0 and writing
Z b+0
Z b+0
Z b
F(x)d(x)
f (x)(x)dx =
[F(x)(x)]
a

a+0

a+0

= (a + 0)F() with a b.
A refinement enables us to assert that a < < b in (i) and that
a < b in (ii).


19.. Product Measures and Multiple Integrals

57

19. Product Measures and Multiple Integrals


Suppose that X, X are two spaces of points x, x . Then the space of pairs
(x, x ) with x in X, x in X is called the product space of X and X and is
written XxX .
Theorem 56. Suppose that measures , , are defined on Borel systems
S , S of measurable sets X, X in two spaces X, X respectively. Then 66
a measure m can be defined in XxX in such a way that, if X, X are
measurable in X, X respectively, then X xX is measurable in XxX and
m(X xX) = (X). (X )
(The measure m is called the product measure of and ). The idea
of product measures is basic in the theory of probability where it is vital
to observe that the product measure is not the only measure which can
be defined in XxX .
Proof. We define a rectangular set in XxX to be any set X xX with X
in S , X in S and we define its measure m(X xX ) to be (X) (X ).
(An expression of the form 0 is taken to stand for 0). We call the
sum of a finite number of rectangular sets a figure in XxX and define its
measure to be the sum of the measure of disjoint rectangular sets which
go to form it. It is easy to verify that this definition is independent of the
decomposition used and that the figures and their complements form a
finitely additive system on which their measure is finitely additive.
After Kolmogoroffs theorem (Theorem 7), it s sufficient to show
that m is completely additive on figures. Suppose that

Xn xXn = X0 xX0

n=1

where the sets on the left are disjoint. If x is any point of X0 , let Jn (x)
be the set of points x of X0 for which (x, x ) belongs to Xn xXn . Then
Jn (x) is measurable in X for each x, it has measure (Xn ) when x is in 67

Xn and 0 otherwise. This measure (Jn (x)) is plainly measurable as a


function of x and
Z

(Jn (x))d = (Xn ) (Xn )


Xo

1. Measure Theory

58
Moreover,
belongs to

N
P

n=1
N
P

Jn (X) is the set of points x of Xo for which (x, x )


Xn xXn . It is measurable and

n=1

Xo

N
X
X


Jn (x) d =
(Xn ) (Xn ).

But since X0 xX0

n=1

n=1

lim

n=1

Xn xXn , it follows that


N
X

Jn (x) = X0 for every x of X0 ,

n=1

X
Jn (x) = (X0 ) for every x of X0 .
lim
N

and therefore

n=1

It follows from Theorem 32 on monotone convergence that


(X0 )
and so

(X0 )

= lim

X0

N
N
X

(Xn ) (XN )
Jn (x) d =

n=1

n=1

m(X0 xX0 ) =

m(Xn xXn )

n=1

which completes the proof.


68

Theorem 57. Let X, X be two measure spaces with measures , re 


spectively such that X(X ) is the limit of a sequence | Xn ( Xn ) of measurable sets of finite measure (Xn )( (Xn )). Let Y be a set in XxX
measurable with respect to the product measure m defined by , . Let
Y (x) be the set of points x X for which (x, x ) Y. Then Y is measurable in X for almost all x X, its measure (Y (x)) is a measurable
function of x and
Z
(Y (x))d = m(Y).

19.. Product Measures and Multiple Integrals

59

Proof. We note first that the theorem is trivially true if Y is a rectangular


set and follows immediately if Y is the sum of a countable number of
rectangular sets. Further, it is also true for the limit of a decreasing
sequence of sets of this type. In the general case, we can suppose that
Y Q, Q Y
where m() = 0 and Q, are limits of decreasing sequence of sums of

rectangular sets. Then, if Q (x), (x) are defined in the same way as

Y (x) we have
Y (x) Q (x), Q (x) Y (x)
where (x), Q (x) are measurable for almost all x. But
Z
((x))d = m() = 0
X

so that ( (x)) = 0 for almost all x since 0, and this is enough to

show that Y (x) is measurable for almost all x and that

(Y (x)) = (Q (x))p.p.

Finally,
Z

69

(Y (x))d =

(Q(x))d = m(Q) = m(Y).

Theorem 58 (Fubinis theorem). Suppose X, X satisfy the hypotheses

of Theorem 57. If f (x, x ) is measurable with respect to the product

measure defined by , it is measurable in x for almost all x and in x


for almost all x. The existence of any one of the integrals
Z
Z
Z
Z
Z

d
| f (x, x)|d ,
| f (x, x)|d
| f (x, x )|dm, d
XxX

implies that of the other two and the existence and equality of the integrals
Z
Z
Z Z
Z

f (x, x )d
f (x, x )d , d
f (x, x)dm, d
XxX

60

1. Measure Theory

Proof. We may obviously suppose that f (x, x ) 0. Let y be a sub


division with E = [y f (x, x ) < y+1 ]. The theorem holds for the
function equal to y in E for = 0, 1, 2, . . . , N (N arbitrary) and zero
elsewhere, by Theorem 57, and the general result follows easily from
the definition of the integral.


Chapter 2

Probability
1. Definitions
A measure defined in a space X of points x is called a probability 70
measure or a probability distribution if (X) = 1. The measurable sets
X are called events and the probability of the event X is real number
(X). Two events X1 , X2 are mutually exclusive if X1 X2 = 0.
The statement: x is a random variable in X with probability distribution means
(i) that a probability distribution exists X,
(ii) that the expression the probability that x belongs to X, where
X is a given event, will be taken to mean (X) (X) sometimes
written P(xX).
The basic properties of the probabilities of events follow immediately. They are that these probabilities are real numbers between 0 and
1, inclusive, and that the probability that one of a finite or countable set
of mutually exclusive events (Xi ) should take palace i.e. the probability
of the event Xi , is equal to the sum of the probabilities of the events Xi .
If a probability measure is defined in some space, it is clearly possible to work with any isomorphic measure in another space. In practice,
this can often be taken to be Rk , in which case we speak of a random
real vector in the place of a random variable. In particular, if k = 1
61

2. Probability

62

71

we speak of a random of a random real number or random real variable. The probability distribution is in this case defined by a distribution function F(x) increasing from 0 to 1 in - < x < . For example,
a probability measure in any finite or countable space is isomorphic,
with a probability measure in R defined by a step function having jumps
p at = 0, 1, 2, . . . , where
X
p 0,
p = 1.
Such a distribution is called a discrete probability distribution. If
F(x) is absolutely continuous F (x) is called the probability density
function (or frequency function).
Example 1. Tossing a coin The space X has only two points H, T with
four subsets, with probabilities given by
1
(0) = 0, (H) = (T ) = , (X) = (H + T ) = 1.
2
If we make H, T correspond respectively with the real numbers 0,1,
we get the random real variable with distribution function
F(x) = 0(x < 0)
1
= (0 x < 1)
2
= 1(1 x)
Any two real numbers a,b could be substituted for 0, 1.
Example 2. Throwing a die- The space contains six points, each with
probability 1/6 (unloaded die). The natural correspondence with R gives
rise to F(x) with equal jumps 1/6 at 1, 2, 3, 4, 5, 6.

2. Function of a random variable


72

Suppose that x is a random variable in X and that y = (x) is a function


defined in X and taking values in a space y. Suppose that y contains a

3.. Parameters of random variables

63

Borel system of measurable sets Y. Then y is called a function of the


variable x if the set [(x) Y] is measurable in X for every measurable
Y and if we take the measure in X of this set as the probability measure
of Y. Note that the mapping x y need not be one-one.
(There is a slight ambiguity in notation as x may denote either the
random variable in X or a generic point of X. In practice, there is no
difficulty in deciding which is meant.)
Example 1. x being a random real variable with distribution function
(d f )F(x) we compute the d f G(y) of y = x2
P(y < 0) = P(x2 < 0) = 0 so that G(y) = 0 for y < 0.
If

a 0, P(y a) = P(x2 a) = P(0 x a) + P( a x < 0),

G(a + 0) = F( a + 0) F( a 0).
Example 2. If F(x) = 0 for x < 0 and G(y) is the distribution function
of y = 1/x, x having d.f. F(x) then
G(y) = 0 f ory < 0.
If a 0.
!
1
a = P(x 1/a) = 1 F(1/a 0).
G(a + 0) = P(y a) = P
x
Since G(a) is a d.f., G(a + 0) 1 as a , so that F must be
continuous at 0. That is, P(x = 0) = 0.

3. Parameters of random variables


A parameter of a random variable (or its distribution) is a number associated with it. The most important parameters of real distributions are 73
the following.

2. Probability

64

(i) The mean or expectation (x) of a real valued function (x) of a


random real variable x is defined by
Z
(x) = E((x)) =
(x)d
X

(ii) The standard deviation or dispersion of a random real number


about its mean is defined by
Z
2
2
= E(x x) = (x x)2 d
Z
Z
ZX
2
2
d
=
x d 2x
xd + x
X

X
2

= E(x ) 2x + x = E(x ) (E(x))2


2 is called the variance of x.
(iii) The range is the interval (r, R), where
r = sup a, R = inf a
F(a)=1

F(a)=0

(iv) The mean error is

|x x|d = E(|x x|)

(v) A median is a real number A for which


F(A 0) + F(A + 0) 1.
(vi) The mode of an absolutely continuous distribution F(x) is the
unique maximum, when it exists, of F (X).
Some special distributions in R.
(i) The Dinomial distribution
0 < p < 1, q = 1 p, n is a positive integer. x can take values
= 0, 1, 2, ..., n with probabilities
p = (n )p qn ,

n
X
0

p = 1.

3.. Parameters of random variables

65

74

Then
x = E() =

n
X

p = np,

=0

= E( ) x =

n
X
=0

2 p n2 p2 = npq

P is the probability of successes out of n experiments in each of


which the probability of success is p.
(ii) The Poisson distribution x can take the values = 0, 1, 2, . . . with
probabilities

c X
p = 1,
p = ec ,
! =0
where c > 0. Here
x = ec
2 = ec

X
c

=0

X
=0

=c

2 c
c2 = c
!

The binomial and Poisson distributions are discrete. The Poisson


distribution is the limit of the binomial distribution as n if
we put
p = c/n(p then 0).
(iii) The rectangular distribution
This is given by
F(x) = 0 for x a
xa
=
for a x b
ba

2. Probability

66
= 1 for b x.

It is absolutely continuous and F (x) = 1/(b a) for a < x < b and 75


= 0 for x < a, x > b. Also
x=

a + b 2 (b a)2
, =
2
12

(iv) The normal (or Gaussian) distribution


This is an absolutely continuous distribution F for which
F (x) =

1
2
2
Q e(xx) /2
2

It is easy to verify that the mean and standard deviation are respectively x and .
(v) The singular distribution:
Here x = 0 has probability 1,
F(x) = D(x a) =1 if x a

0 if x < a

We now prove
Theorem 1 (Tehebycheffs inequality). If (x) is a nonnegative function of a random variable x and k > 0 then
E((x))
k

P((x) k)
Proof.
E((x)) =
=

(x)d

ZX

Z(x)k
(x)k

(x)d +

(x)d

(x)<k

(x)d k

(x)k

d = KF((x) k).


4.. Joint probabilities and independence

67

76

Corollary. If k > 0 and x, are respectively the mean and the standard
deviation of a random real x, then
p((x) k ) 1/k2 .
We merely replace k by k2 2 , (x) by (x x)2 in the Theorem 1.

4. Joint probabilities and independence


Suppose that X1 , X2 are two spaces of points x1 , x2 and that a probability
measure is define in a Borel system of sets X in their product space
X1 xX2 . Then the set x1 in X1 for which the set [x1 X1 , x2 X2 ] is
measurable with respect to form a Borel system in X1 . if we define
1 (X1 ) = ( [x1 X1 , x2 X2 ])
it follows that 1 is a probability measure in X1 and we define 2 in X2
in the same way. We call 1 (X1 ) simply the probability that x1 belongs
to X1 with respect to the joint distribution defined by .
Definition . If is the product measure of 1 , 2 the random variables
x1 , x2 are said to be independent. Otherwise, they are dependant.
When we wish to deal at the same time with several random variables, we must know their joint probability distribution and this, as we
see that once, is not necessarily the same as the product probability as 77
their separate distributions. This applies in particular (x1 , x2 . . .) for
the probability distribution of the values of the function is determined
by the joint distribution of (x1 , x2 , . . .). In this way we can define the
sum X1 + X2 and product x1 x2 of random variables, each being treated
as a function (x1 , x2 ) over the product space X1 xX2 with an assigned
joint probability distribution.
Theorem 2. If (x1 , x2 , ..., xn ) is a random real vector then
E(x1 + x2 + ... + xn ) = E(x1 ) + E(x2 ) + ... + E(xn )
whether x1 , x2 , . . . , xn are independent or not. (E(xi ) is the mean of xi
over the product space.)

2. Probability

68

Proof. Let p be the joint probability distribution. Then


X  Z X 
E
xi =
xi dp (where = X1 x . . . xXn )

X
XZ
xi dp =
E(xi ).

Theorem 3. If x1 , x2 , . . . , xn are independent random real variables


with standard deviations 1 , 2 , . . . , n , then the standard deviation
of their sum xi + x2 . . . + xn is given by
2 =

n
X

i=1

Proof. It is sufficient to prove the theorem for n = 2. Then


2 = E((x1 + x2 x1 x2 )2 )


= E (x1 x1 )2 + (x2 x2 )2 + 2(x1 x1 )(x2 x2 )

= E((x1 x1 )2 ) + E((x2 x2 )2 )2E((x1 x2 )(x2 x2 ))

(by theorem 2)

= 21 + 22 + 2

(x1 x1 )(x2 x2 )dp

(x1 x1 )(x2 x2 )d1 d2

X1 xX2

21

22

+2

X1 (xX2

= 21 + 22 + 2

(x1 x1 )d1

X1

(x2 x2 )d2

X2

by Fubinis theorem,
= 21 + 22
78

this is an example of more general principle.

Theorem 4. If (x1 , x2 ) is function of two independent variables x1 , x2


then
Z
x
(x1 , x2 )d1 d2 , = X1 xX2 .
(x1 , x2 )dp =

4.. Joint probabilities and independence

69

The proof is immediate from the definition of independence. In particulars,


Theorem 5. If x1 , x2 are independent, then
E(x1 , x2 ) = E(x1 )E(x2 ) = x1 .x2
It is not generally sufficient to know the mean or other parameters
of a function of random variables. The general problem is to find its
complete distribution. This can be difficult, but the most important case
is fairly easy.
Theorem 6. If x1 , x2 are independent random real numbers, with dis- 79
tribution functions F1 , F2 then their sum has distribution function F(x)
defined by
Z
F(x) = F1 F2 (x) = F2 F1 (x) =
F1 (x u)dF2 (u)

(F(x) is called the convolution of F1 (x) and F2 (x)).


Proof. Let
x (x1 , x2 ) = 1 when x1 + x2 x

= 0 when x1 + x2 > x

so that if we suppose that F(x + 0) = F(x) and put = RxR, we have


x
F(x) =
x (x1 , x2 )dp
=

x (x1 , x2 )dF1 (x1 )dF2 (x2 )

dF2 (x2 )

x (x1 , x2 )dF1 (x1 )

dF2 (x2 )

x1 +x2 x

dF1 (x1 )

2. Probability

70

F1 (x x2 )dF2 (x2 ) =

F1 (x u)dF2 (u),

and a similar argument shows that


Z
F(x) =
F2 (x u)dF1 (u).


80

It is obvious that F(x) increases, F() = 0, F(+) = 1 so that


F(x) is a distribution function. Moreover, the process can be repeated
any finite number of times and we have
Theorem 7. If x1 , . . . , xn are independent, with distribution functions
F1 , . . . , Fn , then the distribution function of x1 + . . . + xn is
F1 . . . Fn
Corollary . The convolution operator applied to two or more distribution functions (more generally, functions of bounded variation in (- co,
co)) is commutative and associative.
Theorem 8. If F1 (x), F2 (x) are distribution functions, and F1 (x) is absolutely continuous with derivative f1 (x) then F(x) is absolutely continuous and
Z

f (x) = F (x) =
f1 (x u)dF2 (u)p.p

If both F1 (x) and F2 (x) are absolutely continuous, then


Z
f (x) =
f1 (x u) f2 (u)du p.p

Proof. We write
F(x) =

F1 (x u)dF2 (u) =

dF2 (u)

Zxu

f1 (t)dt

4.. Joint probabilities and independence

71

by the fundamental theorem of of calculus


Z

Zx

dF2 (u)

dt

Zx

f1 (t u)dt

f1 (t u)dF2 (u)

and so

f (x) = F (x) =

f1 (x u)dF2 (u)p.p

again by the fundamental theorem of the calculus. The second part 81


follows from Theorem 45, Chapter I,
We shall need the following general theorem on convolutions.

Theorem 9. Suppose that F1 (x), F2 (x) are distribution functions and
that (x) is bounded and is either continuous or is the limit of continuous
functions. Then
Z

(x + y)dF2 (x) is B - measurable as a

function of y and
Z

dF1 (y)

(x + y)dF2 (x) =

(x)dF(x)

where
F(x) = F1 F2 (x)
Proof. We may suppose that (x) 0. If we consider first the case in
which (x) = 1 f ora x b and (x) = 0 elsewhere,
Z

(x + y)dF2 (x) = F2 (b y 0) F2 (a y 0),

2. Probability

72
Z

dF1 (y)

(x + y)dF2 (x) =

(F2 (b y 0)) F2 (a y 0)dF1 (y)

= F(b 0) F(a 0)
Z
=
(x)dF(x),

82

and the theorem is true for function (x) of this type.


Since an open set is the union of a countable number of intervals a
x < b, the theorem is true also for functions (x) constant in each
interval of an open set and 0 elsewhere. The extension to continuous
function (x) and their limits is immediate.


5. Characteristic Functions
A basic tool in modern probability theory is the notion of the characteristic function (t) of a distribution function F(x).
It is defined by
Z
(t) =

eitx dF(x)

Since | eitx |= 1, the integral converges absolutely and defines (t)


for all real t.
Theorem 10. | (t) | 1, (0) = 1, (t) = (t) and (t) is uniformly
continuous for all t.
Proof.
| (t) | =|
=

(0) =

itx

e dF(x) |

dF(x) = 1.

dF(x) = 1.

| eitx | dF(x)

5.. Characteristic Functions

(t) =

73

-itx

dF(x) =

eitx dF(x) = (t)

 83

since F(x) is real.


If h , 0,
(t + h) (t) =
| (t + h) (t) |

eitx (eixh 1)dF(x),


| eixh 1 | dF(x) = o(1) as h 0

by Lebesgues theorem, since | eitx 1 | 2.


Theorem 11. If 1 (t), 2 (t) are the characteristic functions of F1 (x),
F2 (x) respectively, then 1 (t). 2 (t) is the characteristic function of F1
F2 (x).
Proof.
1 (t).2 (t) =

ith

e dF1 (y).

dF1 (y)

dF1 (y)

eitx dF2 (x)

eit(x+y) dF2 (x)

eitx dF2 (x y)

eitx dF(x)

where F(x) = F1 F2 (x)m by Theorem 9.

2. Probability

74

As an immediate corollary of this and theorem 7, we have


84

Theorem 12. If x1 , x2 , . . . , xn are independent random real variables


with characteristic function 1 (t), 2 (t), . . . , n (t) then the characteristic
function of x1 + x2 + + xn is 1 (t), 2 (t), , n (t).
Theorem 13. Suppose that F1 (x), F2 (x) are distribution functions with
characteristic functions 1 (t), 2 (t).
Then
Z
Z
1 (t + u)dF2 (u) =
eitx 2 (x)dF1 (x)

Proof.
Z

itx

e 2 (x)dF1 (x) =

itx

e dF1 (x)

eixu dF2 (u)

dF1 (x)

dF2 (u)

eix(t+u) dF2 (u)

eix(t+u) dF1 (x)

1 (t + u)dF2 (u)


Theorem 14 (Inversion Formula). If
(t) =

itx

e dF(x),

| dF(x) |<

then
1
A

(i) F(a+h)F(ah) = lim


at a h.

RA

sin ht iat
(t)dt
t e

if F(x) is continuous

5.. Characteristic Functions

(ii)

a+H
R
a

F(x)dx

Ra

75

F(x)dx =

1cosHt iat
e (t)dt.
t2

aH

85

Corollary . The expression of a function (t) as an absolutely convergent Fourier - Stieltjes integral in unique. In particular, a distribution
function is defined uniquely by its characteristic function.
Proof.
1

ZA

1
sin ht iat
e
(t)dt =

ZA

sin ht iat
e dt
t

A
Z

dF(x)

2
=

ZA

eitx dF(x)

sin ht it(Xa)
e
dt
t

dF(x)

ZA

sin ht cos((x a)t)


dt
t

But
2
=

sin ht cos(x a)t


1
dt =
t

sin(x a + h)t
dt
t

ZA

sin(x a h)t
dt
t

1
=

A(xa+h)
Z

sin t
1
dt =
t

A(xa+h)
Z
0

sin t
1
dt
t

A(xa+h)
Z

sin t
dt
t

A(xah)

and

86

1 1
sin t
dt ,
t
2

1
sin t
dt as T .
t
2

2. Probability

76
It follows that
1

A lim

A(xa+h)
A(xah)

0 if x > a + h

sin t

dt =
1 if a h < x < a + h

0 if x < a h

and since this integral is bounded, it follows from the Lebesgue convergence theorem that
Z
1 A sin ht iat
A lim
e (t)dt
A
t
Z a+h
dF(x) = F(a + h) F(a h)
=
ah

provided that F(x) is continuous at a h.

Since the integral on the left is bounded in | h | H, we can apply


Lebesgues theorem to its integral with respect to h over | h | H, and
(ii) follows.

6. Sequences and limits of distribution and characteristic functions


Theorem 15. If (t), n (t) are the characteristic functions of distribution functions F(x) and Fn (x), and if Fn (x) F(x) at every point of
continuity of F(x), then n (t) (t) uniformly in any finite interval.
87

Proof. Let > 0 and choose X, N () so that X are points of continuity


of F(x) while
!
Z X Z !
Z X Z
+
+
dF(x) < /2,
dFn (X) < /2 for n N.

This is possible since the first inequality is clearly satisfied for large
X and
Z X Z !
+
dF(X) = Fn (X 0) + 1 Fn (X + 0).

6.. Sequences and limits of...

77

Since Fn is a distribution function and as n this


Z X Z !
+
F(X) + 1 F(X) =
dF(X) < /2

Since F(x) is continuous at X.


Then
Z X
| n (t) (t) | + |
eitx d(Fn (X) F(X)) |
X
Z X h

i Z X

itx
= +
e (Fn (x) F(x))
iteitx (Fn (x) F(x))dx
X

+ | Fn (X 0) F(X 0) | + | Fn (X + 0) F(X + 0) |
Z X
+|t|
| Fn (x) F(x) | dx + 0(1) as n
X

uniformly in any finite interval of values t, by Lebesgues theorem.

The converse theorem is much deeper.


Theorem 16. If n (t) is the characteristic function of the distribution 88
function Fn (x) for n = 1, 2, . . . and n (t) (t) for all t, where (t) is
continuous at 0, then (t) is continuous at 0, then is the characteristic
function of a distribution function F(x) and
Fn (x) F(x)
at every continuity point of F(x).
We need the following
Lemma 1. An infinite sequence of distribution functions Fn (x) contains
a subsequence Fnk (x) tending to a non-decreasing limit function F(x) at
every continuity point of F(x). Also
0 F(x) 1.
(but F(x) is not necessarily a distribution function).

2. Probability

78


Proof. Let rm be the set of rational numbers arranged in a sequence.
Then the numbers Fn (r1 ) are bounded and we can select a sequence
n11 , n12 , . . . so that Fn1 (r1 ) tends to a limit as which we denote
by F(r1 ). The sequence (n1 ) then contains a subsequence (n2 ) so that
Fn2 (r2 ) F(r2 ) and we define by induction sequences (nk ), (nk+1, )
being a subsequence of (nk ) so that
Fnk (rk ) F(rk )as
If we then define nk = nkk , it follows that
Fnk (rm ) F(rm ) for all m.
89

Also, F(x) is non-decreasing on the rationals and it can be defined elsewhere to be right continuous and non-decreasing on the reals. The conclusion follows since F(x), Fnk (x) are non-decreasing and every x is a
limit of rationals.


Proof of the theorem: We use the lemma to define a bounded nondecreasing function F(x) and a sequence (nk ) so that Fnk (x) F(x) at
every continuity point of F(x).
If we put a = 0 in Theorem 14 (ii), we have
Z

1
Fnk (x)dx
Fnk (x)dx =

and if we let k and note tat


nk (t)are bounded, we get
1
H

1
F(x)dx
H

1 cos Ht
nk (t)dt
t2

1cos Ht
L(, )
t2

1
F(x)dx =
H
H
=

and that Fnk ,

1 cos Ht
(t)dt
t2

1 cos t t
( )dt.
H
t2

Now, since (t) is bounded in (, ) and continuous at 0, the expression on the right tends to (0) = lim nk (0) = 1 as
k

7.. Examples of characteristic functions

79

H . Since F(t) is non-decreasing, it easy to show that the left


hand side tends to F() F() and hence we have
F() F() = 1,
and F(x) is a distribution function. It now follows from Theorem 15 that
is the characteristic function of F(x).
Finally, unless Fn (x) F(x) through the entire sequence we can 90
define another subsequence (nk ) so that Fnk (x) F (x) and the same
argument shows that F (x) is a distribution function and that
(t) =

eitx dF (x)

By the corollary to Theorem 13, F(x) = F (x), and it follows therefore that Fn (x) F(x) at every continuity point of F(x).

7. Examples of characteristic functions


Theorem 17. (i) The binomial distribution p = (n )p pn , = 0, 1,
2, . . . has the distribution function
X
F(x) =
p
x

and the characteristic function


(t) = (q + peit )n

(ii) The Poisson distribution p = ec c! , = 0, 1, 2, . . . has distribution function


X
F(x) =
p
x

and characteristic function


it 1)

(t) = ec(e

2. Probability

80

1
(iii) The rectangular distribution F (x) = ba
for a < x < b, 0 for
x < a, x > b, 0 for x < a, x > b, has the characteristic function

(t) =
91

eitb eita
(b a)it
2/2 2

(iv) The normal distribution F (x) = 1 ex


2
function
2 2/2
(t) = et

has characteristic

(v) The singular distribution


F(x) = D(x a) = 0, x < a

= 1, x a

has characteristic function


(t) = eita
If a = 0 , (t) = 1.
These are all trivial except (iv) which involves a simple contour integration.
As a corollary we have the
Theorem 18. (i) The sum of independent variables with binomial distributions (p, n1 ), (p, n2 ) is binomial with parameters (p, n1 + n2 ).
(ii) The sum of independent variables with Poisson distributions c1 , c2
is Poisson and has parameter c1 + c2 .
(iii) The sum of independent variables with normal distributions (x1 , )
(x2 , 2 ) has normal distribution (x1 + x2 , ), 2 = 21 + 22 .
We have also the following trivial formal result.

92

Theorem 19. If x is a random real number with characteristic function


(t), distribution function F(x), and if A, B are constants, then Ax+B has
XB
distribution function F( XB
A ) if A > O and 1 F( A + 0) if A < 0, and
itB
characteristic function e (At). In particular x has the characteristic
function (t) = (t).

8.. Conditional probabilities

81

Corollary. If (t) is a characteristic function, so is


| (t) |2 = (t)(t).
The converse of Theorem 18, (ii) and (iii) is deeper. We sate it without proof. For the proof reference may be made to Probability Theory
by M. Loose, pages 213-14 and 272-274.
Theorem 20. If the sum of two independent real variables is normal
(Poisson), then so is each variable separately.

8. Conditional probabilities
If x is a random variable in X and C is a subset of X with positive measure, we define
P(X/C) = (XC)/(C)
to be the conditional probability that x lies in X, subject to the condition
that x belongs to C. It is clear that P(X/C) is probability measure over
all measurable X.
Theorem 21 (Bayes Theorem). Suppose that X =

J
P

c j , (c j ) > 0.

j=1

Then
P(cJ /X) =

P(X/C j )(C j )
J
P
P(X/Ci )(Ci )

j=1

The proof follows at once from the definition. In applications, the


sets C j are regarded as hypotheses, the numbers (c j ) being called the
prior probabilities. The numbers P(C j /X) are called their post probabilities or likelihoods under the observation of the event X.
93
Example 3. Two boxes A, B are offered at random with (prior) probabilities 1/3, 2/3. A contains 8 white counters, 12 red counters, B contains
4 white and 4 red counters. A counter is taken at random from a box
offered. If it turns out to be white, what is the likelihood that the box
offered was A ?

2. Probability

82

If we denote the event of taking a red (white) counter by R(W)


the space X under consideration has four points (A, R), (A, W), (B, R),
(B, W). The required likelihood is.
P(A/W) =

P(W/A) (A)
P(W/A) (A) + P(W/B)(B)

Here
P(W/A) = probability of taking white counter from A
= 8/20 = 2/5
P(W/B) = 4/8 = 1/2
Hence
P(A/W) =

21
53
21
53

12
23

= 2/7.

Thus the likelihood that the box offered was A is 2 /7.


Conditional probabilities arise in a natural way if we think of X as a
product space X1 xX2 in which a measure (not generally a product measure) is defined. Then if we write (P(X2 /X1 ) as the conditional probability that x2 X2 with respect to the condition x1 X1 , we have
P(X2 /X1 ) = (X1 x X2 )/1 (X1 )
where

1 (X1 ) = (X1 x X2 ).

94

The set X1 , may reduce to a single point x1 , and the definition remains valid provided that 1 (x1 ) > 0. But usually 1 (x1 ) = 0, but
the conditional probability with respect to a single point is not difficult
to define. It follows from the Randon-Nikodym theorem that for fixed
X2 , (X1 xX2 ) has a Radon derivative which we can write R(x1 , X2 ) with
the property that
Z
(X1 x X2 ) =
R(x1 , X2 )d1
x1

for all measurable X1 . For each X2 , R(x1 , X2 ) is defined for almost all
x1 and plainly R(x1 , X2 ) = 1 p.p. But unfortunately,since the number
of measurable sets X2 is not generally countable, the union of all the

8.. Conditional probabilities

83

exceptional sets may not be a set of measure zero. This means that we
cannot assume that, for almost all x1 , R(x1 , X2 ) is a measure defined on
all measurable sets X2 . If how ever, it is , we write it P(X1 /x1 ) and call
it that conditional probability that x2 EX2 subject to the condition that x1
has a specified value.
Suppose now that (x1 , x2 ) is a random variable
s in the plane with
probability density f (x1 , x2 )(i.e f (x1 , x2 ) 0 and
f (x1 , x2 )dx1 dx2 =
1). Then we can define conditional probability densities as follows:
f (x1 , x2 )
f1 (x1 ) =
P(x2 /x1 ) =
f1 (x1 )

f (x1 , x2 )dx2

provided that f1 (x1 ) > 0.


The conditional expectation of x2 for a fixed value of x1 is

m(x1 ) =

x2 P(x2 /x1 )dx2 =

x2 f (x1 , x2 )dx2

f (x1 , x2 )dx2

The conditional standard deviation of x2 for the value x1 is (x1 ) where


2

(x1 ) =

(x2 m(x1 ))2 P(x2 /x1 )dx2

(x2 m(x1 ))2 f (x1 , x2 )dx2


R

f (x1 , x2 )dx2

The curve x2 = m(x1 ) is called the regression curve of x2 on x1 . It


has following minimal property it gives the least value of
x
E(x2 g(x1 ))2 ) =
(x2 g(x1 ))2 f (x1 , x1 )dx1 dx2
RXR

95

2. Probability

84

for all possible curves x2 = g(x1 ). If the curves x2 = g(x1 ) are restricted
to specified families, the function which minimizes E in that family.
For example, the linear regression is the line x2 = Ax1 + B for which 96
E((x2 Ax1 B)2 ) is least, the nth degree polynomial regression is the
polynomial curve of degree n for which the corresponding E is least.

9. Sequences of Random Variables


We can define limit processes in connection with sequences of random
variables in several different ways. The simplest is the convergence of
the distribution or characteristic functions, Fn (x) or n (t), of random
real numbers xn to a limiting distribution or characteristic function F(x)
or (t). As in Theorem 15, it is sufficient to have lim Fn (x) = F(x)
n
at every continuity point of F(x). Note that this does not involve any
idea of a limiting random variable. If we wish to introduce this idea, we
must remember that it is necessary, when making probability statements
about two or more random variables, to specify their joint probability
distribution in their product space.
There are two important definitions based on this idea. We say that
a sequence of random variables xn converges in probability to a limit
random variable x, and write
if

xn x in prob.

lim P(|xn x| > ) = 0

for every > 0, P being the joint probability in the product space of xn
and x. In particular, if C is a constant, xn C in prob, if
lim E(| xn x | ) = 0.

97

The most important case is that in which = 2. The following result


is got almost immediately from the definition.
Theorem 22. If Fn (X) is the distribution function of xn the necessary
and sufficient condition that xn 0 in prob. is that
Fn (x) D(x)

9.. Sequences of Random Variables

85

where D(x) = 0, x < 0; = 1, x 0 is the singular distribution. The


necessary and sufficient condition that xn 0 in mean of order is
that
Z
lim
| x | dFn (x) = 0.
n

Theorem 23.

(i) If xn x in prob., then Fn (x) F(x).

(ii) If xn x in mean, then xn x in prob. and Fn (x) F(x). As


corollaries we have
Theorem 24 (Tchebycheff). If xn has mean xn and standard deviation
n then xn xn 0 in prob. if n 0.
Theorem 25 (Bernoulli: Weak law of large numbers). If 1 , 2 , ..... are
independent random variables with means 1 , 2 , . . . and standard deviations 1 , 2 , .. and if
n

xn =
then

1X
1X
m n =

n =1
n =1

Xn mn 0 in prob. if

n
X

2 = 0(n2 )

=1

98

Theorem 26 (Khintchine). If are independent random variables with


the same distribution function and finite mean m, then
n

xn =

1X
m in prob.
n =1

(Note that this cannot be deduced from Theorem 25 since we do not


assume that has finite standard deviation.)
Proof. Let (t) be the characteristic function of so that the characteristic function of xn is ((t/n))n . If
(t) =

eitx dF(x),

2. Probability

86

Z 

(t) 1 mit =
eitx 1 itx dF(x).

we have



itx 1itx

e
is majorised by a multiple of | x | and 0 as t 0
Now
t

for each x.
Hence, by Lebesgues theorem,

(t) 1 mit = (t) as t 0.


Thus
"
!#n
mit
1
((t/n)) = 1 +
+0
emit asn ,
n
n
n

and since emit is the characteristic function of D(x m) the conclusion


follows easily.

99

In these definitions we need only the joint distribution of x and each


xN separately. In practice, of course, we may know the joint distributions of some of the xn s (they may be independent, for example), but
this is not necessary.
On the other hand, when we come to consider the notion of a random
sequence, the appropriate probability space is the infinite product space
of all the separate variables. This is a deeper concept than those we have
used till now and we shall treat it later as a special case of the theory of
random functions.

10. The Central Limit Problem


We suppose that
xn =

xn

is a finite sum of independent random real variables xn and that Fn (x),


Fn (x), n (t), n (t) are the associated distribution and characteristic
functions. The general central limit problem is to find conditions under which Fn (x) tends to some limiting function F(x) when each of the

10.. The Central Limit Problem

87

components xn is small (in a sense to be defined later) in relation to


xn . Without the latter condition, there is no general result of this kind.
Theorems 25 and 26 show that F(x) may take the special form D(x) and
the next two theorems show that the Poisson and normal forms are also
admissible. The general problem includes that of finding the most general class of such functions. The problem goes back to Bernoulli and
Poisson and was solved (in the case of R) by Khintchine and P. Levy.
Theorem 27 (Poisson). The binomial distribution P(x = ) = (n )p qn , 100
p = nc , q = 1 p, c constant, tends to the Poisson distribution with
mean c as n .
Proof.
n (t) = (q + peit )n
=

[1 + c(eit 1)]n
n
c(eit 1)
e


which, after Theorem 16, is sufficient.

Theorem 28 (De Moivre). If 1 , 2 are independent random variables with the same distribution, having mean 0 and finite standard deviation , then the distribution of
xn =

1 + 2 + +n

tends to the normal distribution (0, 2 ).


This is proved easily using the method of Theorem 26.
The general theory is based on a formula due to Khintchine and
Levy, which generalizes an earlier one for distributions of finite variance
due to Kolmogoroff.
We say that (t) is a K L function with representation (a, G) if
Z 
eitx 1
(t) = iat +

itx  1 + x2
dG(x)
1 + x2
x2

2. Probability

88

where a is a real number and G(x) is bounded and non-decreasing in 101


( , ). The value of the integrand at x = 0 is defined to be - 21 t2 and
it is then continuous and bounded in < x < for each fixed t.
Theorem 29. A K L function (t) is bounded in every finite interval
and defines a, G uniquely.
Proof. The first part is obvious.
If we define
Z
Z
1 1
1 1
(t) = (t)
(t + u)du =
((t) (t + u))du,
2 1
2 1
we have

1
(t) =
2

du

itx

Z 

eitx (1 eiux )

iux  1 + x2
dG(x)
1 + x2
x2

!
Z
Sin x 1 + x2
dG(x) =
eitx dT (x)
1
x
x2

where
T (x) =

G(x) =

Zx "

Zx

h
since 1

102

i
sin y 1+y2
y

y2

#
sin y 1 + y2
dG(y),
1
y
y2

y2
dT (x)
(1 siny/y)(1 + y2 )

and its reciprocal are bounded.

This proves the theorem since G(x) is defined uniquely by T (x)


which is defined uniquely (t) by Theorem 14 and this in turn is defined uniquely by (t).
The next theorem gives analogues of theorems 15 and 16. We shall
write Gn G if Gn (x) and G(x) are bounded and increasing and Gn (x)
G(x) at every continuity point of G(x) and at

10.. The Central Limit Problem

89

Theorem 30. If n (t) has K L representation (an , Gn ) for each n and


if an a, Gn G where G(x) is non-decreasing and bounded, then
n (t) (t) uniformly in any finite interval.
Conversely, if n (t) (t) for all t and (t) is continuous at 0, then
(t) has a K L representation (a, G) and an a, Gn G.
Proof. The first part is proved easily using the method of Theorem 15.
For the second part, define n (t), T n (t) as in the last theorem. Then, since
R1
n (t) (t) = (t) 12 1 (t+u)du, which is continuous at 0, it follows
from Theorem 16 that there is a non-decreasing bounded function T (x)
such that T n T . Then Gn G where G(x) is defined as in Theorem
29, and is bounded, and (t) plainly has K L representation (a, G)

where an a.
Definition. We say that the random variables xn1 , xn2 , . . . are uniformly
asymptotically negligible (u.a.n.) if, for every > 0,
Z
dFn (x) 0 as n .
sup

|x|

The condition that the variables are u.a.n implies that the variables 103
are centered in the sense that their values are concentrated near 0.
In the general case of u.a.n variables by considering the new variables
xn Cn . Thus, we need only consider the u.a.n case, since theorems
for this case can be extended to the u.a.n. case by trivial changes. We
prove an elementary result about u.a.n. variables first.
Theorem 31. The conditions
(i) xn are u.a.n.
(ii) sup

x2
dFn
1+x2




x + an 0 for every set of numbers a
for
n



which sup an 0 as n are equivalent and each implies

that

(iii) sup | n (t)1 | 0 as n , uniformly in every finite tinterval.

2. Probability

90

Proof. The equivalence of (i) and (ii) follows from the inequalities
Z

x2
dFn (x + an ) (+ | an |2 ) +
1 + x2
Z

dFn (x)

1+
2

|x|

that

dFn (x)

|x|
x2

1 + x2

dFn (x).

|x|

For (iii) we use the inequality | 1eitx || xt | i f | xt | 1 and deduce

| n (t) 1 |=|

itx

(e

1)dFn (x) | | t | +2

dFn (x).

|x|


104

Theorem 32 (The Central Limit Theorem). Suppose that xn are independent u.a.n. variables and that Fn F. Then.
(i) (t) = log (t) is a K L function
(ii) If (t) has representation (a, G) and the real numbers an satisfy
Z

x
dFn (x + an ) = 0,
1 + x2

and are bounded uniformly in then

Gn (x) =

Zx

y2
dFn (y + an ) G(x),
1 + y2

an a.

an =

Conversely, if the conditions (i) and (ii) hold, then Fn F.

10.. The Central Limit Problem

91

Proof. It follows from the definition of the an that


eitan n (t) =

eitx dFn (x + an ) = 1 + n (t)

where

n (t) =

Z 

eitx 1

itx 
dF n (x + a n ),
1 + x2

and
n (t) = R( n (t)) 0

(1)

It follow easily from the u.a.n. condition and the definition of an


that a n 0 uniformly in as n and from Theorem 31 that 105
n (t) 0 uniformly in for | t | H where H > 0 is fixed. Hence
log n (t) = itan + n (t) + 0(| n (t) |2 ),
the 0 being uniform in and , by addition,
X
X
log n (t) = it an +
n (t) + 0[
| n (t) |2 ],

uniformly in | t | H.
Now let
A n

Z H
1
n (t)dt
=
2h H
#
Z "
sin H x
dFn (x + an )
=
1
Hx

Using the inequality


itx

|e

for | t | H, we have

"
#
sin H x
itx
| C(H) 1
1
Hx
1 + x2
| n (t) | CAn ,

(2)

2. Probability

92

and therefore , taking real parts in (2) and using the fact that sup |

n (t) | 0 uniformly in | t | H,
X

X
An
n (t) log | n (t) | +

This again holds uniformly in | t | H, and after integration we get


Z H
X
X

1

log | n (t) | dt +
An
2H H

106

An = 0(1) and that


X
log n (t) = i t an +
n (t) + 0(1),

from which it follows that

(3)

uniformly for |t| H, and the before, since H is at our disposal, for each
real t.
The first part of the conclusion follows from Theorem 30.
For the converse, our hypothesis implies that Gn () G() and if
we use the inequality


2
eitx 1 itx C(t) x ,

1 + x2
1 + x2
it follows from (1) that
X

|n (t)| C(t)Gn () = 0(1)

uniformly in . But (t) 0 uniformly in for any fixed t so that (2)


and (3) remain valid and
log n (t) ita +

(eitx 1

itx 1 + x2
)
dG(x)
1 + x2 x2

= (t) log (t, )

since Gn G. Hence n (t) ( t) as we require.

10.. The Central Limit Problem

93

Notes on Theorem 32.


(a) The first part of the theorem shows that the admissible limit functions for sums of u.a.n variables are those for which log (t) is a
K L function.
(b) The numbers an defined in stating the theorem always exist when 107
R x
n is large since
dFn (x + ) is continuous in and takes
1+x2

positive and negative values at = 1 when n is large. They can


be regarded as extra correcting terms required to complete the
centralization of the variables. The u.a n. condition centralizes
each of them separately, but this is not quite enough.

(c) The definition of an is not the only possible one. It is easy to see
that the proof goes through with trivial changes provided that the
an are defined so that an 0 and
Z

Z
x2
x
dF
(x
+
a
)
=
0(
dFn (x + an ))
n
n
1 + x2
1 + x2

uniformly in as n , and this is easy to verify if we define


an by
Z
an =
x dFn (x)

for some fixed > 0. This is the definition used by Gnedenko and
Levy.
WE can deduce immediately the following special cases.
Theorem 33 (Law of large numbers). In order that Fn should tend to the
singular distribution with F(x) = D(x a) it is necessary and sufficient
that

X
XZ
y2
d
F
(y
+
a
)

0,
an a
n
n

1 + y2

(Here (t) = ita, G(x) = 0).

2. Probability

94
108

Theorem 34. In order that Fn should tend to the Poisson distribution


with parameter c, it is necessary and sufficient that
x

XZ
1
c
y2
an c and
d Fn (y + an ) D(x 1)
2
2
2
1
+
y

(Here (t) = c(eit 1), a = 21 c, G(x) =

c
2

D(x 1))

Theorem 35. In order that Fn should tend to the normal (, 2 ) distribution, it is necessary and sufficient that
X

an and

Zx

y2
dFn (y + an ) 2 D(x).
1 + y2

2 2

(Here (t) = i t 2t , a = , G(x) = 2 D(x). From this and the


note (c) after Theorem 33, it is easy to deduce
Theorem 36 (Liapounoff). If xn has mean o and finite variance 2n
P
with 2n = 1 a necessary and sufficient condition that xn should tend
to normal (0, 1) distribution is that for every > 0,
XZ
x2 d Fn (x) 0 as n .

|x|

The distributions for which log (t) is a K L function can be characterized by another property. We say that a distribution is infinitely
divisible (i.d.) if, for every n we can write
(t) = (n (t))n
109

where n (t) is a characteristic function. This means that it is the distribution of the sum of n independent random variables with the same
distribution.
Theorem 37. A distribution is i.d. if and only if log (t) is a K L
function.
This follows at once from Theorem 32. That the condition that a distribution be i.d. is equivalent to a lighter one is shown by the following

11.. Cumulative sums

95

Corollary 1. (t) is i.d. if there is a sequence of decompositions (not


necessarily with identically distributed components) in which the terms
are u.a.n.
Corollary 2. If a distribution is i.d., (t) , 0.
Theorem 38. A distribution is i.d. if and only if it is the limit of finite
compositions of u.a.n. Poisson distributions.
Proof. The result is trivial in one direction. In the other, we observe that
the integral
Z 
itx  1 + x2
eitx 1
dG(x)
1 + x2
x2

can be interpreted as a Riemann - Stieltjes integral and is the limit of


finite sums of the type

it
j
it

b j e j 1
1 + 2
j

each term of which corresponds to a Poisson distribution.

 110

11. Cumulative sums


Cumulative sums

1 + 2 + . . . + n
n
in which 1 , 2 , . . . n are independent and have distribution functions
B1 (x), B2 (x), . . . , Bn (x) and characteristic functions , (t), (t), . . . , n (t)
are included in the more general sums considered in the central limit
theorem. It follows that the limiting distribution of xn is i.d. and if (t) is
the limit of the characteristic functions n (t) of xn , then log (t) is a KL
function function. These limits form, however, only a proper subclass
of the K L class and the problem of characterizing this subclass was
proposed by Khint chine and solved by Levy. We denote the class by L.
As in the general case, it is natural to assume always that the components n are u.a.n.
xn =

2. Probability

96

Theorem 39. If /n are u. a. n. and n (t) (t) where (t) is


1.
non-singular, then n , n+1
n
Proof. Since n > 0, either n or (n ) contains a convergent
 subse
quence (nk ) with limit . The u.a.n. condition implies that tn 1
for every t and therefore, by the continuity of (t),
!
t
(t) = lim
=1
k
nk
111

for all t. This means that every (t) is singular, and this is impossible
as (t) is not singular.
For the second part, since
n xn
1 + 2 + + n
n+1
=
= xn+1
n + 1
n+1
n+1
xn
and xn+1 have the
and the last term is asymptotically negligible, nn+1
same limiting distribution F(x), and therefore
!
xn+1
Fn
F(x), Fn (x) F(x)
n

= n , we can choose a subsequence (nk ) which either


Now if n+1
n
tends to or to some limit 0. In the first case Fnk (xnk ) F() ,
F(x) for some x. In the other case
F(x) = lim Fnk (nk x) = F(x)
k

whenever x and x are continuity points of F(x) and this is impossible


unless = 1.
A characteristic function (t) is called self-decomposable (s.d) if,
for every c in 0 < c < 1 it is possible to write
(t) = (ct)c (t)
112

where c (t) is a characteristic function

Theorem 40 (P.Levy). A function (t) belongs to L if and only if it is


self-decomposable , and c (t) is then i.d

11.. Cumulative sums

97

Proof. First suppose that (t) is s.d. if it has a positive real zero, it has
a smallest, 2a, since it is continuous, and so
(2a) = 0, (t) , 0 f or0 t < 2a.
Then (2ac) , 0 if 0 < c < 1, and since (2a) = (2ac)c (2a) it
follows that c (2a) = 0. Hence
1 = 1 R(c (2a)) =
=2

(1 cos 2ax)dFc (x)

(1 cos ax)(1 + cos ax)dFc (x) 4

(1 cos ax)dFc (x)

= 4(1 R(c (a))) = 4(1 R((a)1(ca)))

This leads to a contradiction since (ca) (a) as c 1, and it


follows therefore that (t) , 0 for t 0 and likewise for t < 0.
If 1 V n it follows from our hypothesis that
V (t) = v1 (Vt)
v

(Vt)
((V 1)t)

is a characteristic function and the decomposition


(t) = n (t) =

n
Y

V(t/n)

=1

shows that (t) is of type L with n = n


Conversely if we suppose that (t) is of type L we have
n (t) =
n+m (t) =

n
Y

r=1
n+m
Y

r (tn )
(t/n+m ) = n (n t/n+m )n,m (t),

v=1

where

113

2. Probability

98
n+m
Y

n,m (t) =

(t/n+m )

v=n+1

Using theorem 39, we can those n, m(n) so that n /n+m


c(0 < c < 1) and then n+m (t) (t). Since n (t) (t) uniformly in
any finite t-interval n (n t/n+m ) (ct). It follows that n,m (t) has a
limit c (t) which is continuous at t = 0 and is therefore a characteristic
function by theorem 16. Moreover the form of n,m (t) shows that c (t)
is i.d.
The theorem characterizes L by a property of (t). It is also possible
to characterize it by a property of F(x).

Theorem 41 (P.Levy). The function (t) of L are those for which log (t)
2

has a KL representation (a, G) in which x x+1 G (x) exists and decreases


outside a countable set of points.
Proof. If we suppose that (t) is of class L and 0 < c < 1 and ignore
terms of the form iat we have
Z 
log c (t) =
eitx 1

itx  1 + x2
dG(x)
1 + x2
x2

Z "

114

itx

#
itc2 x x2 + c2
dG(x/c)
1 2
c + x2
x2

and the fact that c (t) is i.d. by Theorem 40 implies that Q(x) Q(bx)
decreases, where b = 1/c > 1, x > 0 and
Q(x) =

y2 + 1
dG(y)
y2

If we write q(x) = Q(ex ) this means that q(k) q(x + d) decrease


for x > 0 if d > 0. If follows that q (x) exists and decreases outside a
countable set (see for instance Hardy, Littlewood, Polya: Inequalities ,
P.91 together with Theorem 48 Chapter I on page 51)
Then since

11.. Cumulative sums

99

x2 Q(x) Q(x + h) G(x + h) G(x)

h
h
x2 + 1
(x + h)2 Q(x) Q()x + h

h
(x + h)2 + 1
2

We have G (x) = x2x+1 Q (x) and x x+1 G (x) = xQ (x) which also
exists and decreases outside a countable set. The same argument applies
for x < 0. The converse part is trivial.
A more special case arises if we suppose that the components are
identically distributed and the class L of limits for sequences of such
cumulative sums can again be characterized by properties of the limits
(t) or G(x)
We say that (t) is stable, if for every positive constant b, we can
find constants a, b so that

(t)(bt) = eiat (b t)
This implies, of course, that (t) is s.d. and that c has the form


eia t (c t).

Theorem 42 (P.Levy). A characteristic function (t) belongs to L if 115


and only if it is stable.
Proof. If (t) is stable, we have on leaving out the inessential factors of
the form eit , ((t))n = (n t) for some n > 0 and so
n

(t) = ((t/n )) =

n
Y

(t/n ) with (t) = (t),

=1

which is enough to show that (t) belongs to L .


Conversely, if we suppose that a sequence n can be found so that
n (t) = ((t/n ))n (t),
we write n = n1 + n2 ,
n (t) = ((t/n ))n1 ((t/n ))n2 = n1 (tn1 /n )n2 (tn2 /n ).

2. Probability

100

Then, if 0 < c < 1, we choose n1 so that n1 /n c and it follows


that n1 (tn1 /n ) (ct) and n2 (tn2 /n ) c (t). It is easy to show

that this implies that c (t) has the form eia t (c t).
It is possible to characterize the stable distributions in terms of log
(t) and G(x).

Theorem 43 (P.Levy). The characteristic function (t) is stable if and
only if
it

tan
)
t
2
0 < < 1 or 1 < < 2
it 2
or log (t) = iat A | t | (1 +
log | t |)with A > 0 and 1 +1.
t
log t = iat A | t | (1 +

116

Corollary. The real stable distributions are given by (t) = eA |t| (0 <
2).
Proof. In the notation of Theorem 41, the stability condition implies
that, for every d > 0, we can define d so that
q(x) = q(x + d) + q(x + d )(x > 0).
Since q(x) is real, the only solutions of this difference equation, apart
from the special case G(x) = AD(x) are given by
q(x) = A1 ex , Q(x) = A1 x , G (x) =

A1 x1
,x>0
1 + x2

and satisfies 1 = ed + ed . We can use a similar argument for x <0


and we have also
q(x) = A2 e|x| , Q(x) = A2 | x | , G (x) =

A2 | x |1
,x<0
1 + x2

where d, d , are the same. Moreover, since G(x) is bounded, we must


have 0 < 2, and since the case = 2 arises when G(x) = AD(x) and
the distribution is normal, we can suppose that 0 < < 2. Hence

12.. Random Functions


log (t) = iat + A1

101


itx  dx
1 + x2 x+1
Z 0 
+ A2
eitx 1

eitx 1

itx  dx
1 + x2 | x |+1

The conclusion follows, if , 1 form the formula


Z

(eitx 1)

117

dx
=| t | ei/2 p() if 0 < < 1,
x+1

Z 
eitx 1
0

itx  dx
=| t | ei/2 p() if 1 < < 2
1 + x2 x+1

(easily proved by contour integration), Since the remaining components


itx
itx
or 1+x
2 tx merely add to the term iat. If = 1, we use the formula
1+x2
Z 
0

eitx 1

itx  dx

= | t | it log | t | +ia1 t,
2
2
2
1+x x

which is easy to Verify.

12. Random Functions


In our discussion of random functions, we shall not give proofs for all
theorems, but shall content ourselves with giving references, in many
cases.
Let be the space of functions x(t) defined on some space T and
taking values in a space X. Then we call x(t) a random function (or
process) if a probability measure is defined in . We shall suppose here
that X is the space of real numbers and that T is the same space or some
subspace of it.
The basic problem is to prove the existence of measures in with
certain properties - usually that certain assigned sets in are measurable and have assigned measures. These sets are usually associated with 118
some natural property of the functions x(t). It is sometimes convenient

2. Probability

102

to denote the function (which is a point ) by and the value of the


function at t by x(t, )..
A basic theorem is
Theorem 44 (Kolomogoroff). Suppose that for every finite set of distinct real numbers t1 , t2 , . . . , tn we have a joint distribution functionFt1 ,t2 ,...,tn (1 , 2 , . . . , n ) in Rn and that these distribution functions are
consistent in the sense that their values are unchanged by like permutations of (ti ) and (i ) and, if n > m,
Ft1 ,t2 ,...,tn (1 , 2 , . . . , m , , . . . , ) = Ft1 ,t2 ,...,tm (1 , . . . , m ).
Then a probability measure can be defined in in such a way that
Prob(x(ti ) i , i = 1, 2, . . . , n) = Ft1 ,t2 ,...,tn (1 , . . . , n ).

(1)

Proof. The set of functions defined by a finite number of conditions


ai x(ti ) bi

119

is called a rectangular set and the union of a finite number of rectangular


sets is called a figure. It is plain that intersections of figures are also
figures and that the system S o of figures 1 and their complements is
additive. Moreover, the probability measure defined in S o by 1 is
additive in S o , and it is therefore enough, after Theorem 7 of Chapter
1, to show that is completely additive in S o . It is enough to show that
if In are figures and In 0, then (In ) 0.
We assume that lim (In ) > 0, and derive a contradiction. Since only
a finite number of points ti are associated with each In , the set of all these
ti s is countable and we can arrange them in a sequence (ti ). Now each
In is the union of a finite number of the rectangular sets in the product
space of finite number of the space of the variables xi = x(ti ) and we
can select one of these rectangles, for n = 1, 2, . . . so that it contains
a closed rectangle Jn with the property that lim (Jn Im ) > 0. Also
m
we may choose the Jn so that Jn+1 Jn . We then obtain a decreasing
sequence of closed non empty rectangles Jn defined by
ain yi bin (i = 1, 2, . . . , in )

12.. Random Functions

103

For each i there is at least one point yi which is contained in all the
intervals [ain , bin ] , and any function x(t) for which x(ti ) = yi belongs to
all In . This impossible since In 0, and therefore we have (In ) 0.
As an important special case we have the following theorem on random sequences.

Theorem 45. Suppose that for every N we have joint distribution functions F N (1 , . . . , N ) in RN which are consistent in the sense of Theorem
44. Then a probability measure can be defined in the space of real sequences (x1 , x2 , . . .) in such a way that
P(xi i i = 1, 2, . . . , N)

= F N (1 , . . . , N )

Corollary. If Fn (x) is a sequence of distribution functions, a probabil- 120
ity measure can be defined in the space of real sequence so that if In are
any open or closed intervals,


P(xn In , n = 1, 2, . . . , N) =

N
Y

Fn (In ).

n=1

The terms of the sequence are then said to be independent, and


the measure is the product measure of the measures in the component
spaces. The measures defined by Theorem 44 will be called K-measure.
The probability measures which are useful in practice are generally extensions of K-measure, since the latter generally fails to define measures
on important classes of functions. For example, if I is an interval, the
set of functions for which a x(t) b for t in I is not K-measurable.
In the following discussion of measures with special properties, we
shall suppose that the basic K-measure can be extended so as to make
measurable all the sets of functions which are used.
A random function x(t) is called stationary (in the strict sense) if
the transformation x(t) x(t + a) preserves measures for any real a (or
integer a in the case of sequences).
A random function x(t) is said to have independent increments if
the variables x(ti ) x(si ) are independent for non-overlapping intervals (si , ti ). It is called Gaussian if the joint probability distribution 121

2. Probability

104

for x(t1 ), x(t2 ), . . . , x(tn ) for any finite set t1 , t2 , . . . , tn is Gaussian in Rn .


That is , if the functions F of Theorem 44 are all Gaussian. A random
function x(t) is called an L2 - function if it has finite variance for every
t. This means that x(t, w) belongs to L2 () as a function of w for each t,
and the whole function is described as a trajectory in the Hilbert space
L2 ().
Many of the basic properties of an L2 -function can be described in
terms of the auto-correlation function
r(s, t) = E((x(s) m(s))(x(t) m(t)))
= E(x(s)x(t)) m(s)m(t)
where m(t) = E((x(t)).
A condition which is weaker than that of independent increments is
that the increments should be uncorrelated. This is the case if E(x(t)
x(s))(x(t ) x(s ))) = E(x(t) x(s))E(x(t ) x(s )) for non- overlapping intervals (s, t), (s , t ). If an L2 -function is centred so that m(t) = 0
(which can always be done trivially by considering x(t) m(t)), a function with uncorrelated increments has orthogonal increments, that is
E((x(t) x(s))(x(t ) x(s ))) = 0

122

for non-overlapping intervals. The function will then be called an orthogonal random function.
The idea of a stationary process can also be weakened in the same
way. An L2 - function is stationary in the weak sense or stationary, if
r(s, t) depends only on t s. We then write (h) = r(s, s + h).
We now go on to consider some special properties of random functions.

13. Random Sequences and Convergence Properties


The problems connected with random sequences are generally much
simpler than those relating to random functions defined over a non-

13.. Random Sequences and Convergence Properties

105

countable set. We may also use the notation w for a sequence and xn (w)
for its nth term.
Theorem 46 (The 0 or 1 principle: Borel, Kolmogoroff). The probability that a random sequence of independent variables have a property
(e.g. convergence) which is not affected by changes in the values of any
finite number of its terms is equal to 0 to 1.
Proof. Let E be the set of sequences having the given property, so that
our hypothesis is that, for every N 1,
E = X1 x X2 x . . . x Xn x E N
where E N is a set in the product space XN+1 x XN+2 x . . .
It follows that if F is any figure, F E = F x E N for large enough N
and
(FE) = (F)(E N ) = (F)(E)
and since this holds for all figures F , it extends to measurable sets F. In 123
particular, putting F=E, we get
(E) = ((E))2 , (E) = 0or1.
We can now consider questions of convergence of series

x of

=1

independent random variables.


Theorem 47. If sn =

=1

x s p. p., then sn s 0 in probability

and the distribution function of sn tends to that of s . (This follows from


Egoroffs theorem)
Theorem 48 (Kolmogoroffs inequality). If x are independent, with
means 0 and standard deviations and if
T N = sup | sn |, sn =
nN

then

n
X
=1
1

P(T N ) 2

x , > 0

N
X
=1

2. Probability

106
Proof. Let
E = [T N ] =

N
X

Ek

k=1

Ek = [| sk |, T k1 <]

where

It is plain that the Ek are disjoint . Moreover

N
P

=1

x are independent,

N Z
X

s2N d

k=1 E

124

k=1 E

s2N d, since the

s2N d

(sk + xk+1 + ... + xN )2 d

N Z
X
k=1 E

N Z
X

2 =

s2k d +

N
X

(Ek )

k=1

N
X

2i

i=k+1

since Ek involves only x1 , . . . , xk .


Therefore
N
N
N Z
X
X
X
2
(Ek ) =2 (E)
s2k d 2

=1

k=1 E

k=1

as we require.
Theorem 49. If x are independent with means m and
P
1 (x m ) converges p.p.

X
=1

2 < then

Proof. It is obviously enough to prove the theorem in the case m = 0.


By theorem 48, if > 0

m+n

1 X 2
P sup | sm+n sm |< 2

=m+1
|nN

13.. Random Sequences and Convergence Properties

107

and therefore
!

P sup | sm+n sm |>


n1

1 X 2

2 =m+1

and this is enough to show that


lim sup | sm+n sm |= 0p.p.

m n1

and by the general principle of convergence, sn converges p.p.


As a partial converse of this, we have

125

Theorem 50. If x are independent with means m and standard devia


X
tions , | x | c and
x converges in a set of positive measure (and
=1

therefore p.p. by Theorem 46), then

and

=1

m converge.

=1

Proof. Let (t), (t) be the characteristic functions of x and s =

x .

=1

Then it follows from Theorem 47 that

Y
(t) = (t)
=1

where (t) , 0 in some neighbourhood of t=0, the product being uniformly convergent over every finite t-interval. Since
(t) =

Zc

eitx dF (x)

it is easy to show that


2 K log | (t) |
if t is in some sufficiently small interval independent of , and it follows

X
X
2
that
< . Hence
(x m ) converges p.p. by Theorem 49,
=1 P
=1 P
and since x converges p.p., m also converges.


2. Probability

108

Theorem 51 (Kolmogoroffs three series theorem). Let x be independent and c > o,


x = x i f | x | c
= 0i f | x |> c.

126

Then

x converges p.p. if and only if the three series

X
1

P(| x |> c),

m1 ,

are the means and standard deviations of the


,
converge, where

x .
P
Proof. First,if x converges p.p., x 0 and x = x , | x |< c for
large enough for almost all sequences.
Let p = P(| x |> c).
Now
"
#
lim sup | x |< c = lim [| x |< c] for n N]

= lim

=N

[| x |< c].

Therefore
1 = P(lim sup | x |< c) = lim

by the independence of the x . Hence

=N

(1 p ) so

=1

127

Y
(1 p )

p converge.

=1

The convergence of the other two series follows from Theorem 50.
Conversely, suppose that the three series converge, so that, by Theo

P
P
rem 50, x converges p.p. But it follows from the convergence of p
1

that x = X for sufficiently large and almost all series, and therefore

P
x also converges p.p.

1

13.. Random Sequences and Convergence Properties

109

n
P
Theorem 52. If x are independent, sn =
x converges if and only if
=1
Q
=1 (t) converges to a characteristic function.
We do not give the proof. For the proof see j.L.Doob, Stochastic
processes, pages 115, 116. If would seem natural to ask whether there
is a direct proof of Theorem 52 involving some relationship between T N
in Theorem 48 and the functions (t). This might simplify the whole
theory.
Stated differently, Theorem 52 reads as follows:

Theorem 53. If x are independent and the distribution functions of sn


converges to a distribution function, then sn converges p.p.
This is a converse of Theorem 47.
Theorem 54 (The strong law of large numbers). If x are independent,
with zero means and standard deviations , and if
n
X
2

=1

<

1X
x = 0 p.p.
n n
=1

then

lim

x
, so that y has standard deviation /. It follows 128
Proof. Let y =

P
P
then from Theorem 49 that y = (x /) converges p.p.

P
If we write x =
x j / j,
j=1

1
2

n
X
=1

x = X X=1 ,
n
1X
x =
(X X=1 )
n =1
n

1X
X1
= Xn
n 1
= 0(1)
if Xn converges, by the consistency of (C, 1) summation.

2. Probability

110
If the series

x does not converge p.p. it is possible to get results

about the order of magnitude of the partial sums sn =

n
P

x . The basic

result is the famous law of the iterated logarithm of Khintchine.

Theorem 55 (Khintchine; Law of the iterated logarithm). Let x be independent, with zero means and standard deviations
Let
n
X
Bn =
2 as n .
=1

129

Then
lim sup
n

| sn |
= 1p.p.
(2Bn log log Bn )

Corollary. If x have moreover the same distribution, with = then


lim sup
n

| sn |
= p.p.
(2n log log n)

We do not prove this here. For the proof, see M. Loeve: Probability Theory, Page 260 or A. Khintchine : Asymptotische Gesetzeder
Wahrsheinlichkeit srechnung, Page 59.

14. Markoff Processes


A random sequence defines a discrete Markoff process if the behaviour
of the sequence x for n depends only on xn (see page 123). It is
called a Markoff chain if the number of possible values (or states) of
x is finite or countable. The states can be described by the transition
probabilities n pi j defined as the probability that a sequence for which
xn = i will have xn+1 = j. Obviously
X
n pi j = 1.
n pi j 0,
J

If n pi j is independent of n, we say that the transition probabilities


are stationary and the matrix P = (pi j ) is called a stochastic matrix. It

14.. Markoff Processes

130

111

follows that a stationary Markoff chain must have stationary transition


probabilities, but the converse is not necessarily true.
It is often useful to consider one sided chains, say for n 1 and
the behaviour of the chain then depends on the initial state or the initial
probability distribution of x1 .
The theory of Markoff chains with a finite number of states can
be treated completely (see for example J.L.Doob, Stochastic processes
page 172). In the case of stationary transition probabilities, the matrix
(pn i j ) defined by
X
p1i j = pi j , pn+1
pnik pk j
ij =
k

satisfies pnij 0, j pnij = 1 and gives the probability that a sequence


with x1 = i will have xn = j. The main problem is to determine the
asymptotic behaviour of pnij . The basic theorem is
P

Theorem 56 (For the proof see J.L.Doob, Stochastic Processes page


175).
n
1X m
pi j = qi j
lim
n n
m=1
where Q = (qi j ) is a stochastic matrix and QP = PQ = Q, Q2 = Q.
The general behaviour of pnij can be described by dividing the states
into transient states and disjoint ergodic sets of states. Almost all sequences have only a finite number of terms in any one of the transient
states and almost all sequences for which xn lies in an ergodic set will
have all its subsequent terms in the same set.
A random function of a continuous variable t is called a Markoff if 131
, for t1 < t2 < tn < t and intervals (or Borel sets) I1 , I2 , . . . , In , we
have
P(x(t) I/x(t1 ) I1 , x(t2 ) I2 , ..., x(tn ) In )
= P(x(t) I/x(tn ) In ).

Part of the theory is analogous to that of Markoff Chains, but the


theory is less complete and satisfactory.

2. Probability

112

15. L2 -Processes
Theorem 57. If r(s, t) is the auto correlation function of an L2 function,
r(s, t) = r(t, s)
and if (zi ) is any finite set of complex numbers, then
X
r(ti , t j )zi z j 0.
i, j

The first part is trivial and the second part follows from the identity
X
r(ti , t j )zi z j = E( |(x(ti ) m(ti ))zi | 2 ) 0.
i, j

Theorem 58 (For proof see J.L.Doob, Stochastic processes page 72). If


m(t), r(s, t) are given and r(s, t) satisfies the conclusion of theorem 57,
then there is a unique Gaussian function x(t) for which
E(x(t)) = m(t), E(x(s) x(t)) m(s)m(t)) = r(s, t).
132

The uniqueness follows from the fact that a Gaussian process is determined by its first and second order moments given r(s, t). Hence, if
we are concerned only with properties depending on r(s, t) and m(t) we
may suppose that all our processes are Gaussian.
Theorem 59. In order that a centred L2 -process should be orthogonal,
it is necessary and sufficient that
E(|x(t) x(s)|2 ) = F(t) F(s)(s < t)

(1)

where F(S ) is a non -decreasing function. In particular, if x(t) is stationary L2 , then E(|x(t) x(s)|2 ) = 2 (t s)(s < t) for some constant
2 .
Proof. If s < u < t the orthogonality condition implies that E(|x(u)
x(s)|2 ) + E(|x(t) x(u)|2 ) = E(|x(t) x(s)|2 ) which is sufficient to prove
(1). The converse is trivial.

15.. L2 -Processes

113

We write
dF = E(|dx|2 )
and for stationary functions,
2 dt = E(|dx|2 ).
We say that an L2 - function is continuous at t if
lim E(|x(t + h) x(t)|2 ) = 0,

h0

and that it is continuous if it is continuous for all t. Note that this does
not imply that the individual x(t) are continuous at t.

Theorem 60 (Slutsky). In order that x(t) be continuous at t, it is neces- 133
sary and sufficient that r(s, t) be continuous at t = s.
It is continuous (for all t) if r(s, t) is continuous on the line t=s and
then r(s, t) is continuous in the whole plane.
Proof. The first part follows from the relations
E(|x(t + h) x(t)|2 )

= r(t + h, t + h) r(t + h, t) r(t, t + h) + r(t, t)


= o(1) as h 0 if r(s, t) is continuous for

t = s; r(t + h, t + k) r(t, t)

= E(x(t + h)x(t + k) x(t)x(t))


= E(x(t + h)x(t + k) x(t))+ = ((x(t + h) x(t))x(t))
= o(1) as h, k 0

by the Schwartz inequality if x(t) is continuous at t.

For the second part, we have


r(s + h, t + k) r(s, t)
= E(x(s + h)x(t + k) x(t))) + E((x(s + h) x(s))x(t))

= o(1)as h, k 0 by Schwarzs inequality,

if x(t) is continuous at t and s.




2. Probability

114

Theorem 61. If x(t) is continuous and stationary L2 , with (h) = r(s, s+


h), then
Z
(h) =
eih dS ()

where S () is non- decreasing and bounded.


Moreover,
S () S () = (0) = E(|x(t)|2 ) for all t.
134

Proof. We have (h) = (h), (h) is continuous at 0 and


X
(ti t j )zi z j 0
i, j

for all complex zi by Theorem 57 and the conclusion follows from


Bochners theorem (Loeve, Probability theory, p. 207-209, and Bochner
Harmonic analysis and Probability, page 58).
The theorem for sequences is similar.

Theorem 62. If xn is stationary L2 , with n = E(xm .xm+n ) then
Z
n =
ein dS ()

where S () increases and S () S () = o = E(| xm |2 )


We say that an L2 -random function x(t) is differentiable at t with
derivative x (t) (a random variable)if
!
x(t + h) x(t)
x(t) |2 0 as h 0
E |
h
Theorem 63. In order that x(t) be differentiable at t it is necessary and
2 r
exists when t = s. Moreover, if x(t) is differentiable
sufficient that st
2 r
for all t, st exists on the whole plane.
(The proof is similar to that of Theorem 60.)
Integration of x(t) can be defined along the same lines.

15.. L2 -Processes

135

115

P
We say that x(t) is R-integrable in a t b if i x(ti )i tends to
a limit in L2 for any sequence of sub-divisions of (a, b) into intervals
of lengths i containing points ti respectively. The limit is denoted by
Rb
x(t)dt.
a
Theorem 64. In order that x(t) be R-integrable in a t b it is necesRbRb
sary and sufficient that a a r(s, t)dsdt exists as a Riemann integral.

Riemann - Stieltjes integrals can be defined similarly.


The idea of integration with respect to a random function Z(t) is
deeper (see e.g. J.L.Doob, Stochastic processes, chap. IX 2). In the
important cases, Z(t) is orthogonal, and then it is easy to define the integral
Z b
(t)dZ(t)
a

the result being a random variable. Similarly


Z b
(s, t)dZ(t)
a

will be a random function of s under suitable integrability conditions.


The integral of a random function x(t) with respect to a random function Z(t) can also be defined (Doob, Chap. IX 5).
The most important application is to the spectral representation of a
stationary process.
Theorem 65 (Doob, page 527). A continuous stationary (L2 ) function
x(t) can be represented in the form
136
x(t) =

eit dZ()

where Z() has orthogonal increments and


E(| dZ |2 ) = dS
where S () is the function defined in Theorem 61.

2. Probability

116

The formula gives the spectral decomposition of x(t). S () is its


spectral distribution.
The corresponding theorem for random sequence is

Theorem 66 (Doob, page 481). A stationary (L2 ) sequence xn has
spectral representation
xn =

ein dZ()

where Z() has orthogonal increments and


E(| dZ |)2 = dS ,
S () being defined by Theorem 62.
Two or more random function xi (t) are mutually orthogonal if E(xi (t)
x j (s)) = 0 for i , j and s, t.
Theorem 67. Suppose that x(t) is a continuous, stationary (L2 ) process
and that E1 , E2 . . . , E are measurable, disjoint sets whose union is the
whole real line. Then we can write

X
x(t) =
xi (t)
i=1

137

where xi (t) are mutually orthogonal and


xi (t) =

it

e dZi () =

eit dZ()
Ei

and E(| dZi |)2 = 0 outside Ei .


The theorem for sequences is similar. In each case, a particularly
important decomposition is that in which three sets Ei are defined by
the Lebesgue decomposition of S () into absolutely continuous, discontinuous and singular components. For the second component, the autocorrelation function (n) has the form
X
(h) =
di eihi
i

16.. Ergodic Properties

117

where di are the jumps of S () at the discontinuities i , and is uniformly


almost periodic.
We can define liner operations on stationary functions (Doob, page
534). In particular, if k(s) of bounded variation in (, ), the random
function
Z
y(t) =
x(t s)dk(s)

can be defined and it is easy to show that y(t) has spectral representation
y(t) =

eit K()dZ() =

where

K() =

eit dz1 ()

eis dk(s), E( | dZ1 () |2 ) = (K())2 dS ().

138

If k(s) = 0 for s < , > 0 we have


Z
y(t + ) =
x(t s)dk(s )
0

which depends only on the part of the function x(t) before time t.
The linear prediction problem (Wiener) is to determine k(s) so as to
minimise (in some sense) the difference between y(t) and x(t). In so
far as this difference can be made small, we can regard y(t + ) as a
prediction of the value of x(s) at time t + based on our knowledge of
its behaviour before t.

16. Ergodic Properties


We state first the two basic forms of the ergodic theorem.
Theorem 68 (G.D Birkhoff, 1932). Suppose that for 0, T is a
measure preserving (1-1) mapping of a measure space of measure 1
onto itself and that T 0 = I, T + = T T . Suppose that f ()

2. Probability

118

L()and that f (T ) is a measurable function of (, ) in the product


space R . Then
Z
1

f () = lim
f (T )d

0
exists for almost all , f () L() and
Z
Z
f ()d
f ()d =

139

Moreover, if has no subspace of measure > 0 and < 1 invariant


under all T
Z

f () =
f ()d for almost all

There is a corresponding discrete ergodic theorem for transformations T n = (T )n where n is an integer, the conclusion then being that
f () = lim

140

N
1 X
f (T n )
N n=1

exists for almost all . In this case, however, the memorability condition
on f (T ) may be dispensed with.
Theorem 69 (Von Neumann). Suppose that the conditions of Theorem
68 hold and that f () L2 (). Then
Z Z
2
d 0 as .
1

f
(T
)d

f
()

For proofs see Doob page 465, 515 or P.R. Halmos, Lectures on
Ergodic Theory, The math.Soc. of Japan, pages 16,18. The simplest
proof is due to F.Riesz (Comm. Math.Helv. 17 (1945)221-239).
Theorems 68 is much than Theorem 69.
The applications to random functions are as follows

17.. Random function with independent increments

119

Theorem 70. Suppose that x(t) is a strictly


R stationary random function
and that x(, t) L() for each t, with x(, t)d = E(x(t)) = m.
Then
Z
1
x(, t)dt = x ()
lim

0
exists for almost all If x(t) is an L2 - function we have also convergence
in mean .
This follows at once from Theorem 68, 69 if we define
f () = x(, 0), T (x(t)) = x(t + ).
Corollary. If a is real
1
lim

x(, t)eiat dt = x (, a)

exists for almost all .


Theorem 70 is a form of the strong law of large number for random
functions. There is an analogue for sequences.
A particularly important case arises if the translation operation x(t)
x(t + ) has no invariant subset whose measure is >0 and < 1. In this
case we have
1
lim

Z
0

x(, t)dt =

x(, t)d = m

for almost all . In other words almost all functions have limiting time
averages equal to the mean of the values of the function at any fixed
time.

17. Random function with independent increments


The basic condition is that if t1 < t2 < t3 , then x(t2 ) x(t1 ) and x(t3 )
x(t2 ) are independent (see page 114) so that the distribution function of

2. Probability

120

x(t3 ) x(t1 ) is the convolution of those of x(t3 ) x(t1 ) and x(t3 ) x(t2 ).
We are generally interested only in the increments, and it is convenient
to consider the behaviour of the function from some base point, say 141
0, modify the function by subtracting a random variable so as to make
x(0) = 0, x(t) = x(t) x(0). Then if Ft1 ,t2 is the distribution function for
the increment x(t2 ) x(t1 ) we have,
Ft1 ,t3 (x) = Ft1 ,t2 Ft2 ,t3 (x).
We get the stationary case if Ft1 ,t2 (x) depends only on t2 t1 . (This
by itself is not enough, but together with independence, the condition is
sufficient for stationary) If we put
Ft (x) = Fo.t (x),
we have in this case
Ft1 +t2 (x) = Ft1 Ft2 (x),
for all t1 ,t2 > 0.
If x(t) is also an L2 function with x(0) = 0, it follows that
E(|x(t1 + t2 )|2 ) = E(|x(t1 )|2 ) + E(|x(t2 )|2 )
so that

E(|x(t)|2 ) = t2

where

2 = E(|x(1)|2 )

Theorem 71. If x(t) is stationary with independent increments, its distribution function Ft (x) infinitely divisible and its characteristic function
t (u) has the form et(u) , where
Z 
iux  1 + x2
dG(x)
eitx 1
(u) = iau +
1 + x2
x2

142

G(x) being non-decreasing and bounded,


Proof. The distribution function is obviously infinitely divisible for every t and it follows from the stationary property that
t1 +t2 (u) = t1 (u)t2 (u)

17.. Random function with independent increments

121

so that t (u) = et(u) for some (u), which must have the K L form
which is seen by putting t = 1 and using Theorem 37.

Conversely, we have also the
Theorem 72. Any function t (u) of this form is the characteristic function of a stationary random function with independent increments.
Proof. We observe that the conditions on Ft (x) gives us a system of joint
distributions over finite sets of points ti which is consistent in the sense
of Theorem 44 and the random function defined by the Kolmogoroff
measure in Theorem 44 has the required properties.

Example 1 (Brownian motion : Wiener). The increments all have normal distributions, so that
2
2
1
Ft (x) = ex /2t
2t
Example 2 (Poisson). The increments x(s+ t) x(s) have integral values

0 with probabilities ect (ct)


!
Both are L2 Processes.
143
Theorem 73. Almost all functions x(t) defined by the Kolmogoroff measure defined by the Wiener function, or any extension of it, are everywhere non-differentiable. In fact, almost all functions fail to satisfy a
Lipschitz condition of order (x(t h) x(t) = 0(| h | )) if > 21 and
are not of bounded variation.
Theorem 74. The Kolmogoroff measure defined by the Wiener function
can be extended so that almost all functions x(t) Satisfy a Lipschitz condition of any order < 21 at every point, and are therefore continuous
at every point.
For proofs, see Doob, pages 392-396 and for the notion of extension,
pages 50-71.
Theorem 75. The K-measure defined by the Poisson function can be
extended so that almost functions x(t) are step functions with a finite
number of positive integral value in any finite interval.
The probability that x(t) will be constant in an interval of length t is
ect .

2. Probability

122

18. Doob Separability and extension theory

144

The K-measure is usually not extensive enough to give probabilities to


important properties of the functions x(t), e.g. continuity etc.
Doobs solution is to show that a certain subject of has outer
K-measure 1, ( ) = 1. Then, if X1 is any K-measurable set, Doob
defines
(X) = (X1 )when X = o X1
and shows that is completely additive and defines a probability measure in a Borel system containing 0 , and (0 ) = 1.
Doob now defines a quasi-separable K-measure as one for which
there is a subset 0 of outer K-measure 1 and a countable set R0 of real
numbers with the property that
sup x(t) = sup x(t)
tI

tI.R

inf x(t) = inf x(t)


tI

145

tI.R

()

for every x(t) and every open interval I.


If the K-measure has this property, it can be extended to a measure
so that almost all functions x(t) have the property ().
All conditions of continuity, differentiability and related concepts
can be expressed then in terms of the countable set R0 and the sets of
functions having the corresponding property then become measurable.
Thus, in the proofs of Theorem 74 we have only to show that the set of
functions having the required property (of continuity or Lipschitz condition) has outer measure 1 with respect to the basic Wiener measure.
For Theorem 73, there is no need to extended the measure, for if
the set of functions x(t) which are differentiable at least at one point has
measure zero, with respect to Wiener measure, it has measure zero with
respect to any extension of Wiener measure.
For a fuller account, see Doob, Probability in Function Space, Bull.
Amer. Math. Soc. Vol. 53 (1947),15-30.

You might also like