Professional Documents
Culture Documents
H.R. Pitt
Lectures on
Measure Theory and Probability
by
H.R. Pitt
Notes by
Raghavan Narasimhan
Contents
1 Measure Theory
1.
Sets and operations on sets . . . . . . . .
2.
Sequence of sets . . . . . . . . . . . . . .
3.
Additive system of sets . . . . . . . . . .
4.
Set Functions . . . . . . . . . . . . . . .
5.
Continuity of set functions . . . . . . . .
6.
Extensions and contractions of... . . . . .
7.
Outer Measure . . . . . . . . . . . . . .
8.
Classical Lebesgue and Stieltjes measures
9.
Borel sets and Borel measure . . . . . . .
10. Measurable functions . . . . . . . . . . .
11. The Lebesgue integral . . . . . . . . . . .
12. Absolute Continuity . . . . . . . . . . . .
13. Convergence theorems . . . . . . . . . .
14. The Riemann Integral . . . . . . . . . . .
15. Stieltjes Integrals . . . . . . . . . . . . .
16. L-Spaces . . . . . . . . . . . . . . . . . .
17. Mappings of measures . . . . . . . . . .
18. Differentiation . . . . . . . . . . . . . . .
19. Product Measures and Multiple Integrals .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
3
4
5
6
10
11
16
17
20
23
27
31
34
37
39
46
47
57
2 Probability
61
1.
Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.
Function of a random variable . . . . . . . . . . . . . . 62
3.
Parameters of random variables . . . . . . . . . . . . . . 63
iii
Contents
iv
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
67
72
76
79
81
84
86
95
101
104
110
112
117
119
122
Chapter 1
Measure Theory
1. Sets and operations on sets
We consider a space X of elements (or point) x and systems of this sub- 1
sets X, Y, . . . The basic relation between sets and the operations on them
are defined as follows:
(a) Inclusion: We write X Y (or Y X) if every point of X is
contained in Y. Plainly, if 0 is empty set, 0 X X for every
subset X. Moreover, X X and X Y, Y Z imply X Z.
X = Y if X Y and Y X.
(b) Complements: The complements X of X is the set of point of X
which do not belong to X. Then plainly (X ) = X and X = Y if
Y = X. In particular , O = X, X = 0. Moreover, if X Y, then
Y X.
(c) Union: The union of any system of sets is the set of points x which
belong to at least one of them. The system need not be finite or
even countable. The union of two sets X and Y is written X Y,
and obviously X Y = Y X. The union of a finite or countable
[
sequence of sets X1 , X2 , . . . can be written
Xn .
n=1
1. Measure Theory
n=1
and
and
and
[
n=1
2. Sequence of sets
3
Xm
Xm .
n=1 m=n
and that
lim sup Xn =
n=1 m=n
Then if Xn ,
m=n
Xm =
Xm . lim inf Xn =
m=1
m=1
Xm = Xn , lim sup Xn =
m=n
lim Xn =
\
n=1
Xn ,
n=1
and similarly if Xn ,
lim Xn =
[
n=1
Xn .
Xn ,
Xm ,
1. Measure Theory
4. Set Functions
Functions con be defined on a system of sets to take values in any given
space. If the space is an abelian group with the group operation called
addition, one can define the additivity of the set function.
Thus, if is defined on an additive system of sets, is additive if
X X
(Xn )
Xn =
for any finite system of (disjoint) sets Xn .
In general we shall be concerned only with functions which take real
values. We use the convention that the value is excluded but that
may take the value +. It is obvious that (0) = 0 if (X) is additive
and finite for at least one X.
For a simple example of an additive set function we may take (X)
to be the volume of X when X is an elementary figures in Rn .
If the additive property extends to countable system of sets, the function is called completely additive, and again we suppose that (X) ,
. Complete additive of can defined even if the field of X is only 6
P
finitely additive, provided that Xn and Xn belong to it.
Example of a completely additive function: (X) = number of elements (finite of infinite) in X for all subsets X of X
(xn ) = 0, (X) = .
2. X is the interval 0 x < 1 and (X) is the sum of the lengths of finite sums of open or closed intervals with closure in X. These sets
1. Measure Theory
(b) Conversely, an additive function is completely additive if it is either continuous from below or finite and continuous at 0. (The
system of sets on which is defined need only be finitely additive).
Proof.
(a) If Xn X, we write
X = X1 + (X2 X1 ) + (X3 X2 ) + ,
= lim (XN ).
N
N
X
n=2
(Xn Xn1 )
n=n0
(Xn0 ) = (X) +
n=n0
(Xn Xn+1 )
Y = Y1 + Y2 + Y3 +
we write
Y = lim
N
X
Yn ,
n=1
N
N
X
X
Yn Y
(Y) = lim Yn , since
N
= lim
n=1
N
X
n=1
(Yn )
n=1
(Yn ).
n=1
P
Xn , we write
n=1
X
X
Xn
Xn +
(X) =
n=N+1
n=1
N
X
X
=
(Xn ) +
Xn , by finite additivity,
n=1
since
N+1
n=N+1
1. Measure Theory
Corollary 1. The upper and lower bounds M, m of (X) in S are attained for the sets X+ , X respectively and m > .
Moreover, M < if (X) is finite for all X. In particular, a finite
measure is bounded.
Corollary 2. If we write
+ (X) = (X X+ ), (X) = (X X )
we have
(X) = + (X) + (X), + (X) 0, (X) 0
YX
S
so that (An ) m and let A =
An . For every n, we write
n=1
A = Ak + (A Ak ), A =
n
\
[Ak + (A Ak )]
k=1
n
T
k=1
Ak ,
Let X = limn
Bk . Then
k=n
(x ) lim (An ) = m,
n
1. Measure Theory
10
We get a contraction of an additive (or completely additive) function defined on a system by considering only its values on an function defined
on a system by considering only its values on an additive subsystem.
More important, we get an extension by embedding the system of sets
in a larger system and defining a set function on the new system so that
it takes the same values as before on the old system.
The basic problem in measure theory is to prove the existence of a
measure with respect to which certain assigned sets are measurable and
have assigned measures. The classical problem of defining a measure
on the real line with respect to which every interval is measurable with
measure equal to its length was solved by Borel and Lebesgue. We
prove Kolmogoroffs theorem (due to Caratheodory in the case of Rn )
about conditions under which an additive function on a finitely additive
system S 0 can be extended to a measure in a Borel system containing
S 0.
(a) If (I) is non-negative and additive on an additive
P
system S 0 and if In are disjoint sets of S 0 with I =
In also in
Theorem 4.
n=1
S 0 , then
X
n=1
(In ) (I).
X
(I)
(In ).
n=1
12
S
whether In are disjoint or not, provided that I
In .
n=1
Proof.
N
X
n=1
In , I
N
X
n=1
In
11
N
N
X
X
In
In + I
(I) =
n=1
n=1
N
N
X X
In =
(In )
n=1
n=1
[
n=1
In = I1 + I2 I1 + I3 I1 I2 +
and then
(I) [
n=1 In ] = (I1 ) + (I2 I1 ) +
(I1 ) + (I2 ) +
7. Outer Measure
We define the out or measure of a set X with respect to a completely ad- 13
P
ditive non-negative (I) defined on a additive system S 0 to be inf (In )
S
for all sequences {In } of sets of S 0 which cover X (that is, X ).
n=1
Since any I of S 0 covers itself, its outer measure does not exceed
(I). On the other hand it follows from Theorem 4(c) that
(I)
(In )
n=1
for every sequence (In ) covering I, and the inequality remains true if
the right hand side is replaced by its lower bound, which is the outer
1. Measure Theory
12
Xn , then
n=1
(X)
Proof. Let > 0,
n=1
(Xn )
n=1
Xn
In ,
=1
(X)
14
X
=1
[
n=1
X
X
n=1 =1
(In ) (Xn ) + n ,
Xn
In ,
n,=1
(In )
((Xn ) + n )
(Xn ) + ,
n=1
n=1
13
Proof. If P is any set with (P) < , and > 0, we can define In in
S 0 so that
[
X
P
In ,
(In ) (P) +
n=1
Then
PI
[
n=1
n=1
I In , p PI
[
(In IIn )
n=1
X
n=1
(IIn ), (P PI)
X
n=1
(In IIn )
and
(PI) + (P PI)
X
n=1
X
n=1
(In ) (P) +
15
1. Measure Theory
14
Then, since
(PX1 PX1 X2 ) + (PX2 PX1 X2 ) + (P P(X1 X2 )) = P PX1 X2 ,
16
S
complete the proof, therefore, it is sufficient to prove that X =
Xn is
n=1
N
X
Xn
(P) = P
Xn + P P
n=1
n=1
N
P
15
Xn is measurable,
n=1
N
N
X
X
(PXn ) + (P PX)
Xn + (P PX) =
P
n=1
n=1
X
n=1
(PXn ) + (P PX)
(PX) + (P PX),
by Theorem 5, and therefore X is measurable.
Definition. A measure is said to be complete if every subset of a measurable set of zero measure is also measurable (and therefore has measure
zero).
Theorem 8. The measure defined by Theorem 7 is complete.
Proof. If X is a subset of a measurable set of measure 0, then (X) = 0,
(PX) = 0, and
(P) (PX) + (P PX) = (P PX) (P),
(P) = (P PX) = (P PX) + (PX),
and so X is measurable.
The measure defined in Theorem 7 is not generally the minimal measure generated by , and the minimal measure is generally not complete.
However, any measure can be completed by adding to the system of
measurable sets (X) the sets X N where N is a subset of a set of measure zero and defining (X N) = (X). This is consistent with the
original definition and gives us a measure since countable unions of sets
X N are sets of the same form, (X N) = X N = X (Y N Y )
(where N Y, Y being measurable and of 0 measure) = X1 N1 is of the
same form and is clearly completely additive on this extended system.
1. Measure Theory
16
P
Xn X, then (Xn ) (X), if Xn 0, (Xn ) 0 and if X = Xn ,
(X) =
19
17
(In Hn ) <
2
It follows that (Hn ) = (In ) (In Hn ) > 2 so that Hn , and therefore H n , contains at least one point. But the intersection of a decreasing
sequence of non-empty closed sets (H n ) is non-empty, and therefore the
Hn and hence the In have a common point, which is impossible since
In 0.
The measure now defined by Theorem 7 is Lebesgue Measure.
1. Measure Theory
18
In ,
n=1
n=1
.
(In ) + n
2 +2
Then
XQ=
21
[
n=1
Qn , (Q)
X
n=1
(Qn )
(In )+
n=1
X
n=1
2n
1
(X)+
+2
2
X
X
X=
Xn , Xn Qn ,
Qn = G,
n=1
n=1
[
X
X
GX
(Qn Xn ), (G X)
(Qn Xn )
n=1
n=1
n=1
2n
=
+1 2
[
n=1
Fn , B =
[
n=1
Gn ,
19
we see that
A X B, (B A) (Gn Fn ) n for all n,
and so
(B A) = 0,
while A, B are obviously Borel sets.
Conversely, if (P) < and
F X G,
We have, since a closed set is measurable,
(P) = (PF) + (P PF)
(PX) + (P PX) (G F)
(PX) + (P PX)
true for every > 0 and therefore
22
1. Measure Theory
20
21
the union being over all rationals r. This is a countable union of measurable sets so that f + g is measurable. Similarly f g is measurable.
Finally
25
1
1
f (x)g(x) = ( f (x) + g(x))2 ( f (x) g(x))2
4
4
f g is measurable.
1. Measure Theory
22
[ fn (x) < k]
N=1 n=N
26
Proof. The Baire functions form the smallest class which contains continuous functions and is closed under limit operations. Since the class
of measurable functions is closed under limit operations, it is sufficient
to prove that a continuous function of a measurable, function is measurable. Then if (u) is continuous and f (x) measurable, [( f (x)) > k]
is the set of x for which f (x) lies in an open Set, namely the open set
of points for which (u) > k. Since an open set is a countable union of
open intervals, this set is measurable, thus proving the theorem.
Theorem 18 (Egoroff). If (X) < and fn (x) f (x) , p.p in X,
and if > 0, then we can find a subset X of X such that (X X ) <
and fn (x) f (x) uniformly in X .
We write p.p for almost everywhere, that is, everywhere expect
for a set of measure zero.
Proof. We may plainly neglect the set of zero measure in which fn (X)
dose not converge to a finite limit. Let
XN, = [| f (x) fn (x) |< 1/ for all N].
Then, for fixed ,
XN, X as N
For each we choose N os that X = XN , satisfies
(X X ) < /2 ,
23
and let
X =
=1
Then
(X X )
and
X
=1
(X X ) <
X
S = S {y} =
y (E ).
=1
27
1. Measure Theory
24
28
Corollary . Since S k is the integral of the function taking constant values yk in the sets Ek , it follows, by leaving out suitable remainders
P
k
k
=v+1 y (E ), that F(X) is the limit of the integrals of simple functions, a simple function being a function taking constant values on each
of a finite number of measurable sets whose union is X.
Proof. If A < F(X), we can choose a subdivision {y } so that if E are
the corresponding sets, S the corresponding sum,
V
X
y (E )
=1
V
X
y (E ) > A,
=1
S S
V
X
(E )
29
25
We define
f + (x) = f (x) when f (x) 0, f + (x) = 0 when f (x) 0,
when both the integrals on the right are finite, so that f (x) is integrable
if and only if | f (x) | is integrable.
In general, we use the integral sign only when the integrand is integrable in this absolute sense.
The only exception to this rule is that
R
we may sometimes write f (x)d = when f (x) r(x) and r(x) is
X
integrable.
X
X=
Xn where are Xn are measurable and disjoint. Then, if {y } is a
n=1
subdivision, E =
E Xn , (E ) =
n=1
S =
(E Xn ) and
n=1
X
=1
y (E ) =
X
=1
X
n=1
(E Xn )
1. Measure Theory
26
y (E Xn )
n=1 =1
Sn
n=1
30
where S n is the sum for f (x) over Xn . Since S and S n (which are 0)
tend to F(X) and F(XN ) respectively as the maximum interval of subdivision tends to 0, we get
F(X) =
f (x)d =
F(Xn ).
n=1
Theorem 21. If a is a constant,
Z
Z
a f (x)d = a
f (x)d
X
Proof. We may again suppose that f (x) 0 and that a > 0. If we use
the subdivision {y } for f (x) and {ay } for af (x), the sets E are the same
in each case, and the proof is trivial.
Theorem 22. If A f (x) B in X, then
A(X) F(X) B(X).
Theorem 23. If f (x) g(x) in X, then
Z
Z
f (x)d
g(x)d
X
27
[ f (x) > 0] =
n=0
"
1
1
f (x) <
n+1
n
has
and hence, so has at least one subset En =
h positive measure,
i
1
1
n+1 f (x) < n Then
Z
x
f (x)d
f (x)d
(En )
>0
n+1
En
which is impossible.
R
Corollary 1. If f (x)d = 0 for all X X, f (x) not necessarily of the
K
i1
32
1. Measure Theory
28
Moreover, it is clear from the proof of Theorem 3 that a set function F(X) is absolutely continuous if and only if its components F + (X),
F (X) are both absolutely continuous. An absolutely continuous point
function F(x) can be expressed as the difference of two absolutely continuous non-decreasing functions as we see by applying the method
used on page 22 to decompose a function of bounded variation into two
monotonic functions. We observe that the concept of absolute continuity
does not involve any topological assumptions on X.
Theorem 25. If f (x) is integrable on X, then
Z
F(X) =
f (x)d
x
is absolutely continuous.
Proof. We may suppose that f (x) 0. If > 0, we choose a subdivision
{ y }so that
X
y (E ) > F(X) /4
=1
y (E ) > F(X) /2
E X E A for V.
V Z
V
X
X
F(X E A )
f (x)d
y (E )
=1
and therefore,
EV
> F(X) /2
33
=1
29
X
=1
y (E ) > F(X)
X
F(Xn ) =
F(Xn E )
=1
X
=1
F(E )
X
=1
y (E ) > F(X)
1. Measure Theory
30
Theorem 28. If f (x) and g(x) are integrable on X, so is f (x) + g(x) and
Z
Z
Z
f (x)d + g(x)d.
[ f (x) + g(x)]d =
X
= 2
| f (x) | d +
| g(x) | d 2
| f |<|g|
| f ||g|
| f (x) | d + 2
Z
X
| g(x) | d
Z
X
X
v=0
f (x)d +
f (x)d + S ,
=1
y (E
31
[ f (x) + g(x)]d
f (x)d +
g(x)d
[ f (x) + g(x)]d
Z
X
=0.. E
f (x)d +
y+1 (E )
=0
X
Z
f (x)d + (1+ )
g(x)d+ (X)
1. Measure Theory
32
lim inf
fn (x)d
In particular, the conclusion holds if (X) < and the fn (x) are
uniformly bounded.
Theorem 32 (Monotone convergence theorem). If (x) is integrable on
X, fn (x) (x) and fn (x) is an increasing sequence for each x, with
limit f (x) then
Z
Z
f (x)d
fn (x)d =
lim
in the sense that if either side is finite, then so is the other and the two
values are the same, and if one side is +, so is the other.
37
X1 X2
33
X2
f (x)d
x f (x)d 3 for n N
f (x)d =
it follows from the definition of the integral that A > 0, we can define
(x) L(X) so that
Z
(x)d A, 0 (x) f (x)
X
X
u(x) =
un (x), then
n=1
Z
X
u(x)d =
Z
X
n=1 X
un (x)d
38
1. Measure Theory
34
provided that |
n
X
=1
Zb
f (x, y)dx =
Zb
a
f
dx
y
provided that
39
f (x, y + h) f (x, y ) (x)L(a, b)
h
This theorem follows from the analogue of Theorem 31 with n replaced by a continuous variable h. The proof is similar.
35
Rb
n (x) f (x) n (x), (n (x) n (x))dx 0 as n since lim
a
n (x) = limn (x) = f (x) p.p., it is clear that f (x) is L-integrable and
that its L-integral satisfies
Zb
a
f (x)dx = lim
Zb
n (x)dx = lim
Zb
n (x)dx,
and the common value of these is the R-integral. The following is the
main theorem.
Theorem 35. A bounded function in (a, b) is R-integrable if and only if 40
it is continuous p.p
Lemma . If f (x) is R-integrable and > 0, we can define > 0 and a
measurable set E in (a, b) so that
(E0 ) > b a ,
Rb
a
((x) (x))dx 2 /2
1. Measure Theory
36
/2 + /2 =
and similarly f (x + h) f (x) , as we require.
Proof of Theorem 35
If f (x) is R-integrable, let
> 0, n > 0,
X
n=1
n <,
|h| n , n = n (n ) > 0.
Let E =
En , so that
n=1
(E ) b a
n > b a
41
37
when both integrals on the right are finite, and since + and are measure, all our theorems apply to the integrals separately and therefore to
their sum with the exception of Theorems 22, 23, 24, 30, 32 in which
the sign of obviously plays a part. The basic inequality which takes
the place of Theorem 22 is
Theorem 36. If = + + in accordance with Theorem 3, and =
+ then
Z
Z
| f (x)d| | f (x)|d .
x
| f (x)| |d|.]
Z
Z
+
f (x)d X f (x)d( )
f (x)d =
Z
Z
+
f (x)d +
f (x)d( )
| f (x)d+ +
| f (x)|d =
| f (x)|d( )
| f (x)| |d|.
1. Measure Theory
38
will be obvious when theorems do not depend on the sign of and these
can be extended immediately to the general case. When we deal with
inequalities, it is generally essential to restrict to the positive case (or
replace it by ).
Integrals with taking positive and negative values are usually called
Stieltjes integrals. If they are integrals of functions f (x) of a real variable. x with respect to defined by a function (x) of bounded variation,
we write
Z
Z
f (x)d(x) for
f (x)d,
X
f (x)d (x)
ao
44
The inequality
Z
Z
| f (x) | | d |
f
(x)d
X
16.. L-Spaces
39
16. L-Spaces
A set L of elements f , g, . . . is a linear space over the field R of real
numbers (and similarly over any field) if
(1) L is an abelian group with operation denoted by +.
(2) f is defined and belongs to L for any of R and f of L.
(3) ( + ) f = f + f
(4) ( f + g) = f + g
(5) ( f ) = () f
(6) 1. f = f.
A linear space is a topological linear space if
(1) L is a topological group under addition,
(2) scalar multiplication by in R is continuous in this topology. L is a
metric linear space if its topology is defined by a metric.
It is a Banach space if
(1) L is a metric linear space in which metric is defined by d( f, g) =
k f gk where the norm k f k is defined as a real number for all f
of L and has the properties
k f k = 0 if and only if f = 0, kfk 0 always
k f k =| | k f k, k f + gk k f k + kgk
and
(2) L is complete. That is, if a sequence fn has the property that k fn
fm k 0 as m, n , then there is a limit f in L for which
k fn f k 0. A Banach space L is called a Hilbert space if and
inner product ( f, g) is defined for every f , g of L as a complex
number and
45
1. Measure Theory
40
(1) ( f, g) is a linear functional in f and in g
(2) ( f, g) = (g, f )
(3) ( f, f ) = k f k2
Z
Z
p
p
f (x)g(x)d f (x) | d | g(x) | d
bound of g(x) that is, the smallest number for which | g(x) | p.p
46
|| f + g|| || f || + ||g||
16.. L-Spaces
41
47
so that
XEk
"
p
Ak (X
Ek ),
#
P
(X k ) 0 as K since (k /Ak ) p < .
Ek (because
1. Measure Theory
42
1/p
Z
|| f || = | f (x)| p d
( f, g) =
f (x) g(x) d.
48
16.. L-Spaces
49
43
where
for all measurable X. Then we can find a sequence {n (x)} in for which
Z
Z
n (x)d sup (x)d H(X) <
If we define n (x) = supk (x) and observe that
kn
50
1. Measure Theory
44
X=
n
[
k=1
while
R
R
(1)
f (x)d = sup X (x)d < .
X
Now let
(X)
Qn (X) = Q(X)
n
+
(X)
=
n
1
( f (x) + )d if X X+n
n
X
and if
1
for x X+n
n
f (x) = f (x) for x X+n ,
f (x) = f (x) +
X+n
n=1
51
16.. L-Spaces
45
where F1 (X), F2 (X) are integrals and Q1 (X), Q2 (X) vanish on all sets
disjoint from two sets of measure zero, whose union X s also has measure
zero. Then
F1 (X) = F1 (X XXS ), F2 (X) = F2 (X XXS ),
F1 (X) F2 (X) = Q2 (X XX s ) Q1 (X XX s ) = 0.
Then, it follows from Theorem 27 that we may suppose that (X) <
. If E > 0, we consider subdivisions {y } for which
y1 , y+1 (1 + )y ( 1)
so that
s=
X
=1
y (E )
f (x)d
y+1 (E)
v=0
f (x)(x)d =
X
Z
X
=o
52
f (x)(x)d
E
1. Measure Theory
46
by Theorem 20, and
y (E )
f (x)(x)d y+1 (E )
f (x)(x)d = .
18.. Differentiation
47
In particular
Z
f (x)dx =
f ((t))d(t)
A
18. Differentiation
It has been shown in Theorem 44 that any completely additive and absolutely continuous finite set function can be expressed as the the integral
of an integrable function defined uniquely upto a set of measure zero
called its Radon derivative. This derivative does not depend upon any
topological properties of the space X. On the other hand the derivative 54
of a function of a real variable is defined, classically, as a limit in the
topology of R. An obvious problem is to determine the relationship between Radon derivatives and those defined by other means. We consider
here only the case X = R where the theory is familiar (but not easy). We
need some preliminary results about derivatives of a function F(x) in the
classical sense.
Definition. The upper and lower, right and left derivatives of F(x) at x
are defined respectively, by
F(x + h) F(x)
h+0
h
F(x + h) F(x)
D+ F = lim inf
h+0
h
F(x
+
h)
F(x)
D F = lim sup
h0
h
D+ F = lim sup
1. Measure Theory
48
D F = lim inf
h0
F(x + h) F(x)
h
Plainly D+ F D+ F, D F D F. If D+ F = D+ F or D F = D F
we say that F(x) is differentiable on the or on the left, respectively, and
the common values are called the right or left derivatives, F+ , F . If
all four derivatives are equal, we say that F(x) is differentiable with
derivative F (x) equal to the common value of these derivatives.
Theorem 48. The set of points at which F+ and F both are exist but
different is countable.
55
Proof. It is enough to prove that the set E of points x in which F (x) <
F+ (x)is countable. Let r1 , r2 . . . be the sequence of all rational numbers
arranged in some definite order. If x E let k = k(x) be the smallest
integer for which
F (x) < rk < F+ (x)
Now let m, n be the smallest integers for which
F() F(x)
< rk for rm < < x,
x
F() F(x)
> rk for x < < rn
rn > x,
x
rm < x,
Every x defines the triple (k, m, n) uniquely, and two numbers x1 <
x2 cannot have the same triple (k, m, n) associated with them. For if they
did, we should have
rm < x1 < x2 < rn
and therefore
while
F(x1 ) F(x2 )
< rk from the inequality
x1 x2
F(x1 ) F(x2 )
< rk from the second
x1 x2
18.. Differentiation
56
49
(In ) (E) (E
N
X
n=1
In )+ .
X
n=1
In G.
X
X
(Jn ) = 5
(In ) < (A),
n=N+1
n=N+1
1. Measure Theory
50
since
n=1
Jn > 0
A A
n=N+1
and that A A
n=N+1
N
P
n=1
P
for all n N + 1 and this is impossible since n 0 (for n =
P
1
58
S
This is impossible since belongs to A A
Jn and n N + 1.
n=N+1
(A) = 0
and (ES ) = (E),
n=1
(In ) (E) (E
N
X
n=1
In )+
18.. Differentiation
51
1. Measure Theory
52
60
Proof. Since
F(x+h)F(x)
h
F (x) = lim
F(x + h) F(x)
dx
h
a
Z
1 Z b+h
1 b
F(x)dx
F(x)dx
= lim inf
h0
h a+h
h a
Z
1 Z b+h
1 a+h
= lim inf
F(x)dx
F(x)dx
h0
h b
h a
lim [F(b + h ) F(a)]
h0
F(b) F(a)
since F(x) is increasing.
18.. Differentiation
53
x
a
F (t)dt + F(a)
F(x) =
1. Measure Theory
54
F A (t + h) F A (t)
dt
F A (t)dt =
lim
h0
h
a
a
Z x
F A (t + h) F A (t)
lim sup
dt
h0
h
a
Z
1 Z x+h
1 a+h
= lim sup
F A (t)dt
F A (t)dt
h0
h x
h a
= F A (x) F A (a) = F A (x)
since F A (t) is continuous. Since f (t) fA (t) it follows that F (t) F A (t)
and therefore
Z x
Z x
F (t)dt
F A (t)dt F A (x)
a
F (t)dt F(x) =
f (t)dt.
a
(F (t) f (t))dt = 0
a
63
F(x)dG(x) =
j
[F(x)G(x)]
G(x)dF(x).
j
18.. Differentiation
55
F(x)G (x)dx =
a
[F(x)G(x)]
F (x)G(x)dx
Proof. We may suppose that F(x), G(x) increase on the interval and are
non - negative and define
Z
Z
Z
(I) = F(x)dG(x) + G(x)dF(x) [F(x)Gx]
I
where F(I), G(I) are the interval functions defined by F(x),G(x), and 64
similarly
(I) F(I)G(I)so that | (I) | F(I)G(I)
Now, any interval is the sum of an open interval and one or two end
points and it follows from the additivity of (I), that
| (I) | F(I)G(I).
1. Measure Theory
56
for all intervals. Let > 0. Then apart from a finite number of points at
which F(x + 0) F(x 0) >, and on which = 0, we can divide I into
a finite number of disjoint intervals In on each of which F(In ) . Then
X
X
X
| (I) |=|
(In ) |
F(In )G(In )
In | = |
X
G(In ) = G(I).
The conclusion follows on letting 0.
65
for some in a b.
(ii) If (x) 0 and (x) decreases in a x b,
Z
Z b
f (x)dx
f (x)(x)dx = (a + 0)
a
for some , a b.
Proof.
Suppose that (x) decreases in (i), so that, if we put F(x) =
Rx
f
(t)dt,
we have
a
Z b0
Z b
b0
F(x)d(x)
f (x)(x)dx = a+0 [F(x)(x)]
a+0
a
Z b
f (x)dx + [(a + 0) (b 0)]F()
= (b 0)
a
by Theorem 22 and the fact that F(x) is continuous and attains every
value between its bounds at some point in a b. This establishes
(i) and we obtain (ii) by defining (b + 0) = 0 and writing
Z b+0
Z b+0
Z b
F(x)d(x)
f (x)(x)dx =
[F(x)(x)]
a
a+0
a+0
= (a + 0)F() with a b.
A refinement enables us to assert that a < < b in (i) and that
a < b in (ii).
57
Xn xXn = X0 xX0
n=1
where the sets on the left are disjoint. If x is any point of X0 , let Jn (x)
be the set of points x of X0 for which (x, x ) belongs to Xn xXn . Then
Jn (x) is measurable in X for each x, it has measure (Xn ) when x is in 67
1. Measure Theory
58
Moreover,
belongs to
N
P
n=1
N
P
n=1
Xo
N
X
X
Jn (x) d =
(Xn ) (Xn ).
n=1
n=1
lim
n=1
n=1
X
Jn (x) = (X0 ) for every x of X0 .
lim
N
and therefore
n=1
(X0 )
= lim
X0
N
N
X
(Xn ) (XN )
Jn (x) d =
n=1
n=1
m(X0 xX0 ) =
m(Xn xXn )
n=1
59
rectangular sets. Then, if Q (x), (x) are defined in the same way as
Y (x) we have
Y (x) Q (x), Q (x) Y (x)
where (x), Q (x) are measurable for almost all x. But
Z
((x))d = m() = 0
X
(Y (x)) = (Q (x))p.p.
Finally,
Z
69
(Y (x))d =
d
| f (x, x)|d ,
| f (x, x)|d
| f (x, x )|dm, d
XxX
implies that of the other two and the existence and equality of the integrals
Z
Z
Z Z
Z
f (x, x )d
f (x, x )d , d
f (x, x)dm, d
XxX
60
1. Measure Theory
Chapter 2
Probability
1. Definitions
A measure defined in a space X of points x is called a probability 70
measure or a probability distribution if (X) = 1. The measurable sets
X are called events and the probability of the event X is real number
(X). Two events X1 , X2 are mutually exclusive if X1 X2 = 0.
The statement: x is a random variable in X with probability distribution means
(i) that a probability distribution exists X,
(ii) that the expression the probability that x belongs to X, where
X is a given event, will be taken to mean (X) (X) sometimes
written P(xX).
The basic properties of the probabilities of events follow immediately. They are that these probabilities are real numbers between 0 and
1, inclusive, and that the probability that one of a finite or countable set
of mutually exclusive events (Xi ) should take palace i.e. the probability
of the event Xi , is equal to the sum of the probabilities of the events Xi .
If a probability measure is defined in some space, it is clearly possible to work with any isomorphic measure in another space. In practice,
this can often be taken to be Rk , in which case we speak of a random
real vector in the place of a random variable. In particular, if k = 1
61
2. Probability
62
71
we speak of a random of a random real number or random real variable. The probability distribution is in this case defined by a distribution function F(x) increasing from 0 to 1 in - < x < . For example,
a probability measure in any finite or countable space is isomorphic,
with a probability measure in R defined by a step function having jumps
p at = 0, 1, 2, . . . , where
X
p 0,
p = 1.
Such a distribution is called a discrete probability distribution. If
F(x) is absolutely continuous F (x) is called the probability density
function (or frequency function).
Example 1. Tossing a coin The space X has only two points H, T with
four subsets, with probabilities given by
1
(0) = 0, (H) = (T ) = , (X) = (H + T ) = 1.
2
If we make H, T correspond respectively with the real numbers 0,1,
we get the random real variable with distribution function
F(x) = 0(x < 0)
1
= (0 x < 1)
2
= 1(1 x)
Any two real numbers a,b could be substituted for 0, 1.
Example 2. Throwing a die- The space contains six points, each with
probability 1/6 (unloaded die). The natural correspondence with R gives
rise to F(x) with equal jumps 1/6 at 1, 2, 3, 4, 5, 6.
63
G(a + 0) = F( a + 0) F( a 0).
Example 2. If F(x) = 0 for x < 0 and G(y) is the distribution function
of y = 1/x, x having d.f. F(x) then
G(y) = 0 f ory < 0.
If a 0.
!
1
a = P(x 1/a) = 1 F(1/a 0).
G(a + 0) = P(y a) = P
x
Since G(a) is a d.f., G(a + 0) 1 as a , so that F must be
continuous at 0. That is, P(x = 0) = 0.
2. Probability
64
X
2
F(a)=0
n
X
0
p = 1.
65
74
Then
x = E() =
n
X
p = np,
=0
= E( ) x =
n
X
=0
2 p n2 p2 = npq
c X
p = 1,
p = ec ,
! =0
where c > 0. Here
x = ec
2 = ec
X
c
=0
X
=0
=c
2 c
c2 = c
!
2. Probability
66
= 1 for b x.
a + b 2 (b a)2
, =
2
12
1
2
2
Q e(xx) /2
2
It is easy to verify that the mean and standard deviation are respectively x and .
(v) The singular distribution:
Here x = 0 has probability 1,
F(x) = D(x a) =1 if x a
0 if x < a
We now prove
Theorem 1 (Tehebycheffs inequality). If (x) is a nonnegative function of a random variable x and k > 0 then
E((x))
k
P((x) k)
Proof.
E((x)) =
=
(x)d
ZX
Z(x)k
(x)k
(x)d +
(x)d
(x)<k
(x)d k
(x)k
d = KF((x) k).
67
76
Corollary. If k > 0 and x, are respectively the mean and the standard
deviation of a random real x, then
p((x) k ) 1/k2 .
We merely replace k by k2 2 , (x) by (x x)2 in the Theorem 1.
2. Probability
68
X
XZ
xi dp =
E(xi ).
n
X
i=1
(by theorem 2)
= 21 + 22 + 2
X1 xX2
21
22
+2
X1 (xX2
= 21 + 22 + 2
(x1 x1 )d1
X1
(x2 x2 )d2
X2
by Fubinis theorem,
= 21 + 22
78
69
= 0 when x1 + x2 > x
dF2 (x2 )
dF2 (x2 )
x1 +x2 x
dF1 (x1 )
2. Probability
70
F1 (x x2 )dF2 (x2 ) =
F1 (x u)dF2 (u),
80
f (x) = F (x) =
f1 (x u)dF2 (u)p.p
Proof. We write
F(x) =
F1 (x u)dF2 (u) =
dF2 (u)
Zxu
f1 (t)dt
71
Zx
dF2 (u)
dt
Zx
f1 (t u)dt
f1 (t u)dF2 (u)
and so
f (x) = F (x) =
f1 (x u)dF2 (u)p.p
function of y and
Z
dF1 (y)
(x + y)dF2 (x) =
(x)dF(x)
where
F(x) = F1 F2 (x)
Proof. We may suppose that (x) 0. If we consider first the case in
which (x) = 1 f ora x b and (x) = 0 elsewhere,
Z
2. Probability
72
Z
dF1 (y)
(x + y)dF2 (x) =
= F(b 0) F(a 0)
Z
=
(x)dF(x),
82
5. Characteristic Functions
A basic tool in modern probability theory is the notion of the characteristic function (t) of a distribution function F(x).
It is defined by
Z
(t) =
eitx dF(x)
(0) =
itx
e dF(x) |
dF(x) = 1.
dF(x) = 1.
| eitx | dF(x)
(t) =
73
-itx
dF(x) =
83
ith
e dF1 (y).
dF1 (y)
dF1 (y)
eitx dF2 (x y)
eitx dF(x)
2. Probability
74
Proof.
Z
itx
e 2 (x)dF1 (x) =
itx
e dF1 (x)
dF1 (x)
dF2 (u)
1 (t + u)dF2 (u)
Theorem 14 (Inversion Formula). If
(t) =
itx
e dF(x),
| dF(x) |<
then
1
A
RA
sin ht iat
(t)dt
t e
if F(x) is continuous
(ii)
a+H
R
a
F(x)dx
Ra
75
F(x)dx =
1cosHt iat
e (t)dt.
t2
aH
85
Corollary . The expression of a function (t) as an absolutely convergent Fourier - Stieltjes integral in unique. In particular, a distribution
function is defined uniquely by its characteristic function.
Proof.
1
ZA
1
sin ht iat
e
(t)dt =
ZA
sin ht iat
e dt
t
A
Z
dF(x)
2
=
ZA
eitx dF(x)
sin ht it(Xa)
e
dt
t
dF(x)
ZA
But
2
=
sin(x a + h)t
dt
t
ZA
sin(x a h)t
dt
t
1
=
A(xa+h)
Z
sin t
1
dt =
t
A(xa+h)
Z
0
sin t
1
dt
t
A(xa+h)
Z
sin t
dt
t
A(xah)
and
86
1 1
sin t
dt ,
t
2
1
sin t
dt as T .
t
2
2. Probability
76
It follows that
1
A lim
A(xa+h)
A(xah)
0 if x > a + h
sin t
dt =
1 if a h < x < a + h
0 if x < a h
and since this integral is bounded, it follows from the Lebesgue convergence theorem that
Z
1 A sin ht iat
A lim
e (t)dt
A
t
Z a+h
dF(x) = F(a + h) F(a h)
=
ah
This is possible since the first inequality is clearly satisfied for large
X and
Z X Z !
+
dF(X) = Fn (X 0) + 1 Fn (X + 0).
77
+ | Fn (X 0) F(X 0) | + | Fn (X + 0) F(X + 0) |
Z X
+|t|
| Fn (x) F(x) | dx + 0(1) as n
X
2. Probability
78
Proof. Let rm be the set of rational numbers arranged in a sequence.
Then the numbers Fn (r1 ) are bounded and we can select a sequence
n11 , n12 , . . . so that Fn1 (r1 ) tends to a limit as which we denote
by F(r1 ). The sequence (n1 ) then contains a subsequence (n2 ) so that
Fn2 (r2 ) F(r2 ) and we define by induction sequences (nk ), (nk+1, )
being a subsequence of (nk ) so that
Fnk (rk ) F(rk )as
If we then define nk = nkk , it follows that
Fnk (rm ) F(rm ) for all m.
89
Also, F(x) is non-decreasing on the rationals and it can be defined elsewhere to be right continuous and non-decreasing on the reals. The conclusion follows since F(x), Fnk (x) are non-decreasing and every x is a
limit of rationals.
Proof of the theorem: We use the lemma to define a bounded nondecreasing function F(x) and a sequence (nk ) so that Fnk (x) F(x) at
every continuity point of F(x).
If we put a = 0 in Theorem 14 (ii), we have
Z
1
Fnk (x)dx
Fnk (x)dx =
1
F(x)dx
H
1 cos Ht
nk (t)dt
t2
1cos Ht
L(, )
t2
1
F(x)dx =
H
H
=
1 cos Ht
(t)dt
t2
1 cos t t
( )dt.
H
t2
Now, since (t) is bounded in (, ) and continuous at 0, the expression on the right tends to (0) = lim nk (0) = 1 as
k
79
eitx dF (x)
By the corollary to Theorem 13, F(x) = F (x), and it follows therefore that Fn (x) F(x) at every continuity point of F(x).
(t) = ec(e
2. Probability
80
1
(iii) The rectangular distribution F (x) = ba
for a < x < b, 0 for
x < a, x > b, 0 for x < a, x > b, has the characteristic function
(t) =
91
eitb eita
(b a)it
2/2 2
has characteristic
= 1, x a
92
81
8. Conditional probabilities
If x is a random variable in X and C is a subset of X with positive measure, we define
P(X/C) = (XC)/(C)
to be the conditional probability that x lies in X, subject to the condition
that x belongs to C. It is clear that P(X/C) is probability measure over
all measurable X.
Theorem 21 (Bayes Theorem). Suppose that X =
J
P
c j , (c j ) > 0.
j=1
Then
P(cJ /X) =
P(X/C j )(C j )
J
P
P(X/Ci )(Ci )
j=1
2. Probability
82
P(W/A) (A)
P(W/A) (A) + P(W/B)(B)
Here
P(W/A) = probability of taking white counter from A
= 8/20 = 2/5
P(W/B) = 4/8 = 1/2
Hence
P(A/W) =
21
53
21
53
12
23
= 2/7.
1 (X1 ) = (X1 x X2 ).
94
The set X1 , may reduce to a single point x1 , and the definition remains valid provided that 1 (x1 ) > 0. But usually 1 (x1 ) = 0, but
the conditional probability with respect to a single point is not difficult
to define. It follows from the Randon-Nikodym theorem that for fixed
X2 , (X1 xX2 ) has a Radon derivative which we can write R(x1 , X2 ) with
the property that
Z
(X1 x X2 ) =
R(x1 , X2 )d1
x1
for all measurable X1 . For each X2 , R(x1 , X2 ) is defined for almost all
x1 and plainly R(x1 , X2 ) = 1 p.p. But unfortunately,since the number
of measurable sets X2 is not generally countable, the union of all the
83
exceptional sets may not be a set of measure zero. This means that we
cannot assume that, for almost all x1 , R(x1 , X2 ) is a measure defined on
all measurable sets X2 . If how ever, it is , we write it P(X1 /x1 ) and call
it that conditional probability that x2 EX2 subject to the condition that x1
has a specified value.
Suppose now that (x1 , x2 ) is a random variable
s in the plane with
probability density f (x1 , x2 )(i.e f (x1 , x2 ) 0 and
f (x1 , x2 )dx1 dx2 =
1). Then we can define conditional probability densities as follows:
f (x1 , x2 )
f1 (x1 ) =
P(x2 /x1 ) =
f1 (x1 )
f (x1 , x2 )dx2
m(x1 ) =
x2 f (x1 , x2 )dx2
f (x1 , x2 )dx2
(x1 ) =
f (x1 , x2 )dx2
95
2. Probability
84
for all possible curves x2 = g(x1 ). If the curves x2 = g(x1 ) are restricted
to specified families, the function which minimizes E in that family.
For example, the linear regression is the line x2 = Ax1 + B for which 96
E((x2 Ax1 B)2 ) is least, the nth degree polynomial regression is the
polynomial curve of degree n for which the corresponding E is least.
xn x in prob.
for every > 0, P being the joint probability in the product space of xn
and x. In particular, if C is a constant, xn C in prob, if
lim E(| xn x | ) = 0.
97
85
Theorem 23.
xn =
then
1X
1X
m n =
n =1
n =1
Xn mn 0 in prob. if
n
X
2 = 0(n2 )
=1
98
xn =
1X
m in prob.
n =1
eitx dF(x),
2. Probability
86
Z
(t) 1 mit =
eitx 1 itx dF(x).
we have
itx 1itx
e
is majorised by a multiple of | x | and 0 as t 0
Now
t
for each x.
Hence, by Lebesgues theorem,
xn
87
[1 + c(eit 1)]n
n
c(eit 1)
e
Theorem 28 (De Moivre). If 1 , 2 are independent random variables with the same distribution, having mean 0 and finite standard deviation , then the distribution of
xn =
1 + 2 + +n
itx 1 + x2
dG(x)
1 + x2
x2
2. Probability
88
1
(t) =
2
du
itx
Z
eitx (1 eiux )
iux 1 + x2
dG(x)
1 + x2
x2
!
Z
Sin x 1 + x2
dG(x) =
eitx dT (x)
1
x
x2
where
T (x) =
G(x) =
Zx "
Zx
h
since 1
102
i
sin y 1+y2
y
y2
#
sin y 1 + y2
dG(y),
1
y
y2
y2
dT (x)
(1 siny/y)(1 + y2 )
89
|x|
The condition that the variables are u.a.n implies that the variables 103
are centered in the sense that their values are concentrated near 0.
In the general case of u.a.n variables by considering the new variables
xn Cn . Thus, we need only consider the u.a.n case, since theorems
for this case can be extended to the u.a.n. case by trivial changes. We
prove an elementary result about u.a.n. variables first.
Theorem 31. The conditions
(i) xn are u.a.n.
(ii) sup
x2
dFn
1+x2
x + an 0 for every set of numbers a
for
n
which sup an 0 as n are equivalent and each implies
that
2. Probability
90
Proof. The equivalence of (i) and (ii) follows from the inequalities
Z
x2
dFn (x + an ) (+ | an |2 ) +
1 + x2
Z
dFn (x)
1+
2
|x|
that
dFn (x)
|x|
x2
1 + x2
dFn (x).
|x|
| n (t) 1 |=|
itx
(e
1)dFn (x) | | t | +2
dFn (x).
|x|
104
Theorem 32 (The Central Limit Theorem). Suppose that xn are independent u.a.n. variables and that Fn F. Then.
(i) (t) = log (t) is a K L function
(ii) If (t) has representation (a, G) and the real numbers an satisfy
Z
x
dFn (x + an ) = 0,
1 + x2
Gn (x) =
Zx
y2
dFn (y + an ) G(x),
1 + y2
an a.
an =
91
where
n (t) =
Z
eitx 1
itx
dF n (x + a n ),
1 + x2
and
n (t) = R( n (t)) 0
(1)
uniformly in | t | H.
Now let
A n
Z H
1
n (t)dt
=
2h H
#
Z "
sin H x
dFn (x + an )
=
1
Hx
|e
for | t | H, we have
"
#
sin H x
itx
| C(H) 1
1
Hx
1 + x2
| n (t) | CAn ,
(2)
2. Probability
92
and therefore , taking real parts in (2) and using the fact that sup |
n (t) | 0 uniformly in | t | H,
X
X
An
n (t) log | n (t) | +
log | n (t) | dt +
An
2H H
106
(3)
uniformly for |t| H, and the before, since H is at our disposal, for each
real t.
The first part of the conclusion follows from Theorem 30.
For the converse, our hypothesis implies that Gn () G() and if
we use the inequality
2
eitx 1 itx C(t) x ,
1 + x2
1 + x2
it follows from (1) that
X
(eitx 1
itx 1 + x2
)
dG(x)
1 + x2 x2
93
(c) The definition of an is not the only possible one. It is easy to see
that the proof goes through with trivial changes provided that the
an are defined so that an 0 and
Z
Z
x2
x
dF
(x
+
a
)
=
0(
dFn (x + an ))
n
n
1 + x2
1 + x2
for some fixed > 0. This is the definition used by Gnedenko and
Levy.
WE can deduce immediately the following special cases.
Theorem 33 (Law of large numbers). In order that Fn should tend to the
singular distribution with F(x) = D(x a) it is necessary and sufficient
that
X
XZ
y2
d
F
(y
+
a
)
0,
an a
n
n
1 + y2
2. Probability
94
108
XZ
1
c
y2
an c and
d Fn (y + an ) D(x 1)
2
2
2
1
+
y
c
2
D(x 1))
Theorem 35. In order that Fn should tend to the normal (, 2 ) distribution, it is necessary and sufficient that
X
an and
Zx
y2
dFn (y + an ) 2 D(x).
1 + y2
2 2
|x|
The distributions for which log (t) is a K L function can be characterized by another property. We say that a distribution is infinitely
divisible (i.d.) if, for every n we can write
(t) = (n (t))n
109
where n (t) is a characteristic function. This means that it is the distribution of the sum of n independent random variables with the same
distribution.
Theorem 37. A distribution is i.d. if and only if log (t) is a K L
function.
This follows at once from Theorem 32. That the condition that a distribution be i.d. is equivalent to a lighter one is shown by the following
95
it
j
it
b j e j 1
1 + 2
j
110
1 + 2 + . . . + n
n
in which 1 , 2 , . . . n are independent and have distribution functions
B1 (x), B2 (x), . . . , Bn (x) and characteristic functions , (t), (t), . . . , n (t)
are included in the more general sums considered in the central limit
theorem. It follows that the limiting distribution of xn is i.d. and if (t) is
the limit of the characteristic functions n (t) of xn , then log (t) is a KL
function function. These limits form, however, only a proper subclass
of the K L class and the problem of characterizing this subclass was
proposed by Khint chine and solved by Levy. We denote the class by L.
As in the general case, it is natural to assume always that the components n are u.a.n.
xn =
2. Probability
96
for all t. This means that every (t) is singular, and this is impossible
as (t) is not singular.
For the second part, since
n xn
1 + 2 + + n
n+1
=
= xn+1
n + 1
n+1
n+1
xn
and xn+1 have the
and the last term is asymptotically negligible, nn+1
same limiting distribution F(x), and therefore
!
xn+1
Fn
F(x), Fn (x) F(x)
n
97
Proof. First suppose that (t) is s.d. if it has a positive real zero, it has
a smallest, 2a, since it is continuous, and so
(2a) = 0, (t) , 0 f or0 t < 2a.
Then (2ac) , 0 if 0 < c < 1, and since (2a) = (2ac)c (2a) it
follows that c (2a) = 0. Hence
1 = 1 R(c (2a)) =
=2
(Vt)
((V 1)t)
n
Y
V(t/n)
=1
n
Y
r=1
n+m
Y
r (tn )
(t/n+m ) = n (n t/n+m )n,m (t),
v=1
where
113
2. Probability
98
n+m
Y
n,m (t) =
(t/n+m )
v=n+1
itx 1 + x2
dG(x)
1 + x2
x2
Z "
114
itx
#
itc2 x x2 + c2
dG(x/c)
1 2
c + x2
x2
and the fact that c (t) is i.d. by Theorem 40 implies that Q(x) Q(bx)
decreases, where b = 1/c > 1, x > 0 and
Q(x) =
y2 + 1
dG(y)
y2
99
h
h
x2 + 1
(x + h)2 Q(x) Q()x + h
h
(x + h)2 + 1
2
We have G (x) = x2x+1 Q (x) and x x+1 G (x) = xQ (x) which also
exists and decreases outside a countable set. The same argument applies
for x < 0. The converse part is trivial.
A more special case arises if we suppose that the components are
identically distributed and the class L of limits for sequences of such
cumulative sums can again be characterized by properties of the limits
(t) or G(x)
We say that (t) is stable, if for every positive constant b, we can
find constants a, b so that
(t)(bt) = eiat (b t)
This implies, of course, that (t) is s.d. and that c has the form
eia t (c t).
(t) = ((t/n )) =
n
Y
=1
2. Probability
100
that this implies that c (t) has the form eia t (c t).
It is possible to characterize the stable distributions in terms of log
(t) and G(x).
Theorem 43 (P.Levy). The characteristic function (t) is stable if and
only if
it
tan
)
t
2
0 < < 1 or 1 < < 2
it 2
or log (t) = iat A | t | (1 +
log | t |)with A > 0 and 1 +1.
t
log t = iat A | t | (1 +
116
Corollary. The real stable distributions are given by (t) = eA |t| (0 <
2).
Proof. In the notation of Theorem 41, the stability condition implies
that, for every d > 0, we can define d so that
q(x) = q(x + d) + q(x + d )(x > 0).
Since q(x) is real, the only solutions of this difference equation, apart
from the special case G(x) = AD(x) are given by
q(x) = A1 ex , Q(x) = A1 x , G (x) =
A1 x1
,x>0
1 + x2
A2 | x |1
,x<0
1 + x2
101
itx dx
1 + x2 x+1
Z 0
+ A2
eitx 1
eitx 1
itx dx
1 + x2 | x |+1
(eitx 1)
117
dx
=| t | ei/2 p() if 0 < < 1,
x+1
Z
eitx 1
0
itx dx
=| t | ei/2 p() if 1 < < 2
1 + x2 x+1
eitx 1
itx dx
= | t | it log | t | +ia1 t,
2
2
2
1+x x
2. Probability
102
(1)
119
103
For each i there is at least one point yi which is contained in all the
intervals [ain , bin ] , and any function x(t) for which x(ti ) = yi belongs to
all In . This impossible since In 0, and therefore we have (In ) 0.
As an important special case we have the following theorem on random sequences.
Theorem 45. Suppose that for every N we have joint distribution functions F N (1 , . . . , N ) in RN which are consistent in the sense of Theorem
44. Then a probability measure can be defined in the space of real sequences (x1 , x2 , . . .) in such a way that
P(xi i i = 1, 2, . . . , N)
= F N (1 , . . . , N )
Corollary. If Fn (x) is a sequence of distribution functions, a probabil- 120
ity measure can be defined in the space of real sequence so that if In are
any open or closed intervals,
P(xn In , n = 1, 2, . . . , N) =
N
Y
Fn (In ).
n=1
2. Probability
104
122
for non-overlapping intervals. The function will then be called an orthogonal random function.
The idea of a stationary process can also be weakened in the same
way. An L2 - function is stationary in the weak sense or stationary, if
r(s, t) depends only on t s. We then write (h) = r(s, s + h).
We now go on to consider some special properties of random functions.
105
countable set. We may also use the notation w for a sequence and xn (w)
for its nth term.
Theorem 46 (The 0 or 1 principle: Borel, Kolmogoroff). The probability that a random sequence of independent variables have a property
(e.g. convergence) which is not affected by changes in the values of any
finite number of its terms is equal to 0 to 1.
Proof. Let E be the set of sequences having the given property, so that
our hypothesis is that, for every N 1,
E = X1 x X2 x . . . x Xn x E N
where E N is a set in the product space XN+1 x XN+2 x . . .
It follows that if F is any figure, F E = F x E N for large enough N
and
(FE) = (F)(E N ) = (F)(E)
and since this holds for all figures F , it extends to measurable sets F. In 123
particular, putting F=E, we get
(E) = ((E))2 , (E) = 0or1.
We can now consider questions of convergence of series
x of
=1
=1
then
n
X
=1
1
P(T N ) 2
x , > 0
N
X
=1
2. Probability
106
Proof. Let
E = [T N ] =
N
X
Ek
k=1
Ek = [| sk |, T k1 <]
where
N
P
=1
x are independent,
N Z
X
s2N d
k=1 E
124
k=1 E
s2N d
N Z
X
k=1 E
N Z
X
2 =
s2k d +
N
X
(Ek )
k=1
N
X
2i
i=k+1
k=1 E
k=1
as we require.
Theorem 49. If x are independent with means m and
P
1 (x m ) converges p.p.
X
=1
2 < then
m+n
1 X 2
P sup | sm+n sm |< 2
=m+1
|nN
107
and therefore
!
1 X 2
2 =m+1
m n1
125
and
=1
m converge.
=1
x .
=1
Y
(t) = (t)
=1
where (t) , 0 in some neighbourhood of t=0, the product being uniformly convergent over every finite t-interval. Since
(t) =
Zc
eitx dF (x)
X
X
2
that
< . Hence
(x m ) converges p.p. by Theorem 49,
=1 P
=1 P
and since x converges p.p., m also converges.
2. Probability
108
126
Then
X
1
m1 ,
x .
P
Proof. First,if x converges p.p., x 0 and x = x , | x |< c for
large enough for almost all sequences.
Let p = P(| x |> c).
Now
"
#
lim sup | x |< c = lim [| x |< c] for n N]
= lim
=N
[| x |< c].
Therefore
1 = P(lim sup | x |< c) = lim
=N
(1 p ) so
=1
127
Y
(1 p )
p converge.
=1
The convergence of the other two series follows from Theorem 50.
Conversely, suppose that the three series converge, so that, by Theo
P
P
rem 50, x converges p.p. But it follows from the convergence of p
1
that x = X for sufficiently large and almost all series, and therefore
P
x also converges p.p.
1
109
n
P
Theorem 52. If x are independent, sn =
x converges if and only if
=1
Q
=1 (t) converges to a characteristic function.
We do not give the proof. For the proof see j.L.Doob, Stochastic
processes, pages 115, 116. If would seem natural to ask whether there
is a direct proof of Theorem 52 involving some relationship between T N
in Theorem 48 and the functions (t). This might simplify the whole
theory.
Stated differently, Theorem 52 reads as follows:
=1
<
1X
x = 0 p.p.
n n
=1
then
lim
x
, so that y has standard deviation /. It follows 128
Proof. Let y =
P
P
then from Theorem 49 that y = (x /) converges p.p.
P
If we write x =
x j / j,
j=1
1
2
n
X
=1
x = X X=1 ,
n
1X
x =
(X X=1 )
n =1
n
1X
X1
= Xn
n 1
= 0(1)
if Xn converges, by the consistency of (C, 1) summation.
2. Probability
110
If the series
n
P
x . The basic
Theorem 55 (Khintchine; Law of the iterated logarithm). Let x be independent, with zero means and standard deviations
Let
n
X
Bn =
2 as n .
=1
129
Then
lim sup
n
| sn |
= 1p.p.
(2Bn log log Bn )
| sn |
= p.p.
(2n log log n)
We do not prove this here. For the proof, see M. Loeve: Probability Theory, Page 260 or A. Khintchine : Asymptotische Gesetzeder
Wahrsheinlichkeit srechnung, Page 59.
130
111
2. Probability
112
15. L2 -Processes
Theorem 57. If r(s, t) is the auto correlation function of an L2 function,
r(s, t) = r(t, s)
and if (zi ) is any finite set of complex numbers, then
X
r(ti , t j )zi z j 0.
i, j
The first part is trivial and the second part follows from the identity
X
r(ti , t j )zi z j = E( |(x(ti ) m(ti ))zi | 2 ) 0.
i, j
The uniqueness follows from the fact that a Gaussian process is determined by its first and second order moments given r(s, t). Hence, if
we are concerned only with properties depending on r(s, t) and m(t) we
may suppose that all our processes are Gaussian.
Theorem 59. In order that a centred L2 -process should be orthogonal,
it is necessary and sufficient that
E(|x(t) x(s)|2 ) = F(t) F(s)(s < t)
(1)
where F(S ) is a non -decreasing function. In particular, if x(t) is stationary L2 , then E(|x(t) x(s)|2 ) = 2 (t s)(s < t) for some constant
2 .
Proof. If s < u < t the orthogonality condition implies that E(|x(u)
x(s)|2 ) + E(|x(t) x(u)|2 ) = E(|x(t) x(s)|2 ) which is sufficient to prove
(1). The converse is trivial.
15.. L2 -Processes
113
We write
dF = E(|dx|2 )
and for stationary functions,
2 dt = E(|dx|2 ).
We say that an L2 - function is continuous at t if
lim E(|x(t + h) x(t)|2 ) = 0,
h0
and that it is continuous if it is continuous for all t. Note that this does
not imply that the individual x(t) are continuous at t.
Theorem 60 (Slutsky). In order that x(t) be continuous at t, it is neces- 133
sary and sufficient that r(s, t) be continuous at t = s.
It is continuous (for all t) if r(s, t) is continuous on the line t=s and
then r(s, t) is continuous in the whole plane.
Proof. The first part follows from the relations
E(|x(t + h) x(t)|2 )
t = s; r(t + h, t + k) r(t, t)
2. Probability
114
15.. L2 -Processes
135
115
P
We say that x(t) is R-integrable in a t b if i x(ti )i tends to
a limit in L2 for any sequence of sub-divisions of (a, b) into intervals
of lengths i containing points ti respectively. The limit is denoted by
Rb
x(t)dt.
a
Theorem 64. In order that x(t) be R-integrable in a t b it is necesRbRb
sary and sufficient that a a r(s, t)dsdt exists as a Riemann integral.
eit dZ()
2. Probability
116
ein dZ()
X
x(t) =
xi (t)
i=1
137
it
e dZi () =
eit dZ()
Ei
117
can be defined and it is easy to show that y(t) has spectral representation
y(t) =
eit K()dZ() =
where
K() =
eit dz1 ()
138
which depends only on the part of the function x(t) before time t.
The linear prediction problem (Wiener) is to determine k(s) so as to
minimise (in some sense) the difference between y(t) and x(t). In so
far as this difference can be made small, we can regard y(t + ) as a
prediction of the value of x(s) at time t + based on our knowledge of
its behaviour before t.
2. Probability
118
f () = lim
f (T )d
0
exists for almost all , f () L() and
Z
Z
f ()d
f ()d =
139
f () =
f ()d for almost all
There is a corresponding discrete ergodic theorem for transformations T n = (T )n where n is an integer, the conclusion then being that
f () = lim
140
N
1 X
f (T n )
N n=1
exists for almost all . In this case, however, the memorability condition
on f (T ) may be dispensed with.
Theorem 69 (Von Neumann). Suppose that the conditions of Theorem
68 hold and that f () L2 (). Then
Z Z
2
d 0 as .
1
f
(T
)d
f
()
For proofs see Doob page 465, 515 or P.R. Halmos, Lectures on
Ergodic Theory, The math.Soc. of Japan, pages 16,18. The simplest
proof is due to F.Riesz (Comm. Math.Helv. 17 (1945)221-239).
Theorems 68 is much than Theorem 69.
The applications to random functions are as follows
119
x(, t)eiat dt = x (, a)
Z
0
x(, t)dt =
x(, t)d = m
for almost all . In other words almost all functions have limiting time
averages equal to the mean of the values of the function at any fixed
time.
2. Probability
120
x(t3 ) x(t1 ) is the convolution of those of x(t3 ) x(t1 ) and x(t3 ) x(t2 ).
We are generally interested only in the increments, and it is convenient
to consider the behaviour of the function from some base point, say 141
0, modify the function by subtracting a random variable so as to make
x(0) = 0, x(t) = x(t) x(0). Then if Ft1 ,t2 is the distribution function for
the increment x(t2 ) x(t1 ) we have,
Ft1 ,t3 (x) = Ft1 ,t2 Ft2 ,t3 (x).
We get the stationary case if Ft1 ,t2 (x) depends only on t2 t1 . (This
by itself is not enough, but together with independence, the condition is
sufficient for stationary) If we put
Ft (x) = Fo.t (x),
we have in this case
Ft1 +t2 (x) = Ft1 Ft2 (x),
for all t1 ,t2 > 0.
If x(t) is also an L2 function with x(0) = 0, it follows that
E(|x(t1 + t2 )|2 ) = E(|x(t1 )|2 ) + E(|x(t2 )|2 )
so that
E(|x(t)|2 ) = t2
where
2 = E(|x(1)|2 )
Theorem 71. If x(t) is stationary with independent increments, its distribution function Ft (x) infinitely divisible and its characteristic function
t (u) has the form et(u) , where
Z
iux 1 + x2
dG(x)
eitx 1
(u) = iau +
1 + x2
x2
142
121
so that t (u) = et(u) for some (u), which must have the K L form
which is seen by putting t = 1 and using Theorem 37.
Conversely, we have also the
Theorem 72. Any function t (u) of this form is the characteristic function of a stationary random function with independent increments.
Proof. We observe that the conditions on Ft (x) gives us a system of joint
distributions over finite sets of points ti which is consistent in the sense
of Theorem 44 and the random function defined by the Kolmogoroff
measure in Theorem 44 has the required properties.
Example 1 (Brownian motion : Wiener). The increments all have normal distributions, so that
2
2
1
Ft (x) = ex /2t
2t
Example 2 (Poisson). The increments x(s+ t) x(s) have integral values
2. Probability
122
144
tI.R
145
tI.R
()