You are on page 1of 184

Stat 340 Course Notes

Spring 2010

Generously Funded by MEF

























Contributions:
Riley Metzger
Michaelangelo Finistauri

Special Thanks:
Without the following people and groups, these course notes would never have been
completed: MEF, Don McLeish.

TableofContents

Chapter1:Probability1
ExercisesforChapter1 29
Solutions..pp.18

Chapter2:Statistics. 34
ExercisesforChapter2 38
Solutions..pp.19

Chapter3:Validation. 42
ExercisesforChapter3 52
Solutions.pp.14

Chapter4:QueuingSystems... 53
ExercisesforChapter4 66
Solutionspp.14

Chapter5:GeneratingRandomVariables... 67
ExercisesforChapter5 75
Solutions..pp.14

Chapter6:VarianceReduction... 76
ExercisesforChapter6 120
Solutions.pp.112

StatisticalTables

Chapter 1
Probability
Three approaches to dening probability are:
1. The classical denition: Let the sample space (denoted by o) be the
set of all possible distinct outcomes to an experiment. The probability of
some event is
number of ways the event can occur
number of outcomes in S
,
provided all points in o are equally likely. For example, when a die is
rolled, the probability of getting a 2 is
1
6
because one of the six faces is a
2.
2. The relative frequency denition: The probability of an event is the
proportion (or fraction) of times the event occurs over a very long (the-
oretically innite) series of repetitions of an experiment or process. For
example, this denition could be used to argue that the probability of get-
ting a 2 from a rolled die is
1
6
. For instance, if we roll the die 100 times,
but get a 2 30 times, we may suspect that the probability of getting a 2
is
1
3
.
3. The subjective probability denition: The probability of an event is a
measure of how sure the person making the statement is that the event
will happen. For example, after considering all available data, a weather
forecaster might say that the probability of rain today is 30% or 0.3.
Unfortunately, all three of these denitions have serious limitations.
1
1.1 Sample Spaces and Probability
Consider a phenomenon or process which is repeatable, at least in theory, and
suppose that certain events (outcomes)
1
,
2
,
3
, . . . are dened. We will often
refer to this phenomenon or process an experiment," and refer to a single
repetition of the experiment as a trial." Then, the probability of an event ,
denoted by 1(), is a number between 0 and 1.
Denition 1. A sample space S is a set of distinct outcomes for an exper-
iment or process, with the property that in a single trial, one and only one of
these outcomes occurs. The outcomes that make up the sample space are called
sample points.
Denition 2. Let o =
1
,
2
,
3
, . . . be a discrete sample space. Then
probabilities 1(
I
) are numbers attached to the
I
s (i = 1, 2, 3, . . .) such that
the following two conditions hold:
(1) 0 _ 1(
I
) _ 1
(2)

I
1(
I
) = 1
The set of values 1(
I
), i = 1, 2, . . . is called a probability distrib-
ution on o.
Denition 3. An event in a discrete sample space is a subset o. If the
event contains only one point, e.g.
1
=
1
, we call it a simple event. An
event A made up of two or more simple events such as =
1
,
2
is called
a compound event.
Denition 4. The probability P(A) of an event A is the sum of probabilities
for all the simple events that make up A.
Example: Suppose a 6-sided die is rolled, and let the sample space be o =
1, 2, 3, 4, 5, 6, where 1 means the number 1 occurs, and so on. If the die is an
ordinary one, we would nd it useful to dene probabilities as
1(i) = 1,6 )or i = 1, 2, 3, 4, 5, 6,
because if the die were tossed repeatedly (as in some games or gambling situ-
ations), then each number would occur close to 1,6 of the time. However, if
the die were weighted in a way such that one face favoured the others, these
numerical values would not be so useful.
Note that if we wish to consider a compound event, the probability is easily
obtained. For example, if = even number," then because = 2, 4, 6 we
get 1() = 1(2) +1(4) +1(6) = 1,2.
2
1.2 Conditional Probability
It is often important to know how the outcome of one event will aect the
outcome of another. Consider ipping a coin: if we know that the rst ip is a
head, how does this aect the outcome of a second ip? Logically this should
have no eect. Now, suppose we are interested in the values of two rolled dice
(assuming each die is a standard six-faced, with faces numbered 1 through 6,
hereafter referred to as a D6). If the total sum of the dice is 5 and we know the
value of one of the dice, then how does this aect the outcome of the second
die? This is the basic concept behind Conditional Probability.
1.2.1 Denition
Given two events, and 1, and given that we know 1 occurs the Conditional
Probability of [1 (said given 1) is dened as:
1([1) =
1( 1)
1(1)
Recall that 1 denotes the intersection of events and 1 (hereafter, this
will be shortened to 1).
Note: If and 1 are independent,
1(1) = 1()1(1) so
1([1) =
1()1(1)
1(1)
= 1().
1.2.2 Example
Consider the rst example: the coin. Assume that the coin is fair, which is to
say that the probability of either outcome is strictly
1
2
. Event is the event
that the rst toss will land a head, and event 1 is the event that the second
toss will land a tail. Then the conditional probability of 1[ is:
1(1[) =
1(1 )
1()
=
1
4
1
2
=
1
2
Notice that the conditional probability 1[ is the same as the probability of 1.
This is not always the case.
3
1.3 Random Variables
There is a far more intuitive and useful representation of probabilistic events:
Random Variables (r.v). A random variable can be thought of as an unknown
value (numeric) which is determined by chance.
Denition 5. A random variable is a function that assigns a real number to
each point in a sample space o.
Example 1. From the previous examples we can have A be the random variable
representing the outcome of a coin toss. We could assign A with the value of
1 if the coin turns out to be a head, and 0 otherwise. We can even have a
sequence of : random variables for a series of : coin tosses. In such a case,
A might be the number of heads on the : coin tosses.
Random variables come in three main types: Discrete, Continuous and
Mixed (a combination of Discrete and Continuous). Which of these categories
a random variable falls into is dependent on the domain of the sample space.
Both the coin and dice examples have a discrete support for the sample space.
Time, height or temperature are examples where the support may indicate a
continuous random variable.
1.4 Discrete Random Variables
A discrete random variable is one whose sample space is nite or countably
innite. Common sample spaces are (proper or improper) subsets of integers.
Denition 6. The probability function (p.f.) of a random variable A is the
function
)(r) = 1(A = r), dened for all r , n/crci:t/c:a:j|c:jacco)t/cra:do:aria/|c.
The set of pairs (r, )(r)) : r is called the probability distribution
of A. All probability functions must have two properties:
1. )(r) _ 0 for all values of r (i.e. for r )
2.

all r2.
)(r) = 1
By implication, these properties ensure that )(r) _ 1 for all r. We consider a
few toy examples before dealing with more complicated problems.
4
1.5 Expectation, Averages, Variability
Denition 7. The expected value (also called the mean or the expectation) of
a discrete random variable A with probability function )(r) is
1(A) =

all x
r)(r).
The expected value of A is also often denoted by the Greek letter j. The
expected value of A can be thought as the average of the A-values that would
occur in an innite series of repetitions of the process where A is dened.
Notes:
(1) You can interpret 1[q(A)] as the average value of q(A) in an innite series
of repetitions of the process where A is dened.
(2) 1 [q(A)] is also known as the expected value of q(A). This name is some-
what misleading since the average value of q(A) may be a value which
q(A) never takes - hence unexpected!
(3) The case where q(A) = A reduces to our earlier denition of 1(A).
Theorem 1. Suppose the random variable A has probability function )(r).
Then, the expected value of some function q(A) of A is given by
1 [q(A)] =

all r
q(r))(r)
Properties of Expectation:
If your linear algebra is good, it may help if you think of 1 as being a linear
operator. Otherwise, youll have to remember these and subsequent properties.
1. For constants a and /,
1 [aq(A) +/] = a1 [q(A)] +/
Proof: 1 [aq(A) +/] =

all r
[aq(r) +/] )(r)
=

all r
[aq(r))(r) +/)(r)]
= a

all r
q(r))(r) +/

all r
)(r)
= a1 [q(A)] +/
_
since

all r
)(r) = 1
_
2. For constants a and / and functions q
1
and q
2
, it is also easy to show
1 [aq
1
(A) +/q
2
(A)] = a1 [q
1
(A)] +/1 [q
2
(A)]
Variability:
5
While an average is a useful summary of a set of observations, or of a proba-
bility distribution, it omits another important piece of information, namely the
amount of variability. For example, it would be possible for car doors to be
the right width, on average, and still have no doors t properly. In the case
of tting car doors, we would also want the door widths to all be close to this
correct average. We give a way of measuring the amount of variability next.
You might think we could use the average dierence between A and j to indi-
cate the amount of variation. In terms of expectation, this would be 1 (A j).
However, 1 (A j) = 1(A) j (since j is a constant) = 0. We soon realize
that to measure variability we need a function that is the same sign for A j
and for A < j. We now dene
Denition 8. The variance of a r.v A is 1
_
(A j)
2
_
, and is denoted by o
2
or by Var (A).
In words, the variance is the average squared distance from the mean. This
turns out to be a very useful measure of the variability of A.
Example: Let A be the number of heads when a fair coin is tossed 4 times.
Then, A ~ 1i:o:ia|
_
4,
1
2
_
; and so, j = :j = (4)
_
1
2
_
= 2. Without doing
any calculations, we know o
2
_ 4 because A is always between 0 and 4. Hence
it can never be further away from j than 2. This makes the average squared
distance from j at most 4. The values of )(r) are
r 0 1 2 3 4 since )(r) =
_
4
r
_ _
1
2
_
r
_
1
2
_
4r
)(r) 1/16 4/16 6/16 4/16 1/16 =
_
4
r
_ _
1
2
_
4
The value of \ ar(A) (i.e. o
2
) is easily found here:
o
2
= 1
_
(A j)
2
_
=
4

r=0
(r j)
2
)(r)
=
(0 2)
2
_
1
16
_
+ (1 2)
2
_
4
16
_
+ (2 2)
2
_
6
16
_
+(3 2)
2
_
4
16
_
+ (4 2)
2
_
1
16
_
= 1
Denition 9. The standard deviation of a random variable A is o =
_
1
_
(A j)
2
_
Both variance and standard deviation are commonly used to measure variability.
The basic denition of variance is often awkward to use for mathematical cal-
culation of o
2
, whereas the following two results are often useful:
(1) o
2
= 1
_
A
2
_
j
2
(2) o
2
= 1 [A(A 1)] +j j
2
6
Properties of Mean and Variance
If a and / are constants and 1 = aA +/, then
j
Y
= aj

+/ and o
2
Y
= a
2
o
2

(where j

and o
2

are the mean and variance of A and j


Y
and o
2
Y
are the
mean and variance of 1 ). The proof of this is left to the reader as an exercise.
1.6 Moment Generating Functions
We have now seen two functions which characterize a distribution, the probabil-
ity function and the cumulative distribution function. There is a third type of
function: the moment generating function, which uniquely determines a distri-
bution. The moment generating function is closely related to other transforms
used in mathematics: the Laplace and Fourier transforms.
Denition 10. Consider a discrete random variable A with probability function
)(r). The moment generating function (m.g.f.) of A is dened as
'(t) = 1(c
|
) =

r
c
|r
)(r).
We will assume that the moment generating function is dened and nite for
values of t in an interval around 0 (i.e. for some a 0 ,

r
c
|r
)(r) < for
all t [a, a]).
The moments of a random variable A are the expectations of the functions
A
:
for r = 1, 2, . . . . The expected value 1(A
:
) is called the r
th
moment of
A. The mean j = 1(A) is therefore the rst moment, 1(A
2
) is the second
and so on. It is often easy to nd the moments of a probability distribution
mathematically by using the moment generating function. This often gives
easier derivations of means and variances than the direct summation methods in
the preceding section. The following theorem gives a useful property of m.g.f.s.
Theorem 2. Let the random variable A have m.g.f. '(t). Then
1(A
:
) = '
(:)
(0) r = 1, 2, . . .
where '
(:)
(0) stands for d
:
'(t),dt
:
evaluated at t = 0.
Proof:
'(t) =

r
c
|r
)(r) and if the sum converges, then
'
(:)
(t) =
J
J|
r

r
c
|r
)(r)
=

r
J
J|
r
(c
|r
))(r)
=

r
r
:
c
|r
)(r)
7
Therefore '
(:)
(0) =

r
r
:
)(r) = 1(A
:
), as stated.
This sometimes gives a simple way to nd the moments for a distribution.
Example 1. Suppose X has a 1i:o:ia|(:, j) distribution. Then its moment
generating function is
'(t) =
n

r=0
c
|r
_
:
r
_
j
r
(1 j)
nr
=
n

r=0
_
:
r
_
(jc
|
)
r
(1 j)
nr
= (jc
|
+ 1 j)
n
Therefore
'
0
(t) = :jc
|
(jc
|
+ 1 j)
n1
'
00
(t) = :jc
|
(jc
|
+ 1 j)
n1
+:(: 1)j
2
c
2|
(jc
|
+ 1 j)
n2
and so
1[A] = '
0
(0) = :j,
1[A
2
] = '"(0) = :j +:(: 1)j
2
\ ar(A) = 1(A
2
) 1(A)
2
= :j(1 j)
1.7 Discrete Distributions
1.7.1 Discrete Uniform
The Discrete Uniform is used when every outcome is equiprobable such as with
fair dice, coins, and simple random sampling (surveying method). This is the
simplest discrete random variable. Let A ~U(a,b) where parameters a and / are
the integer min and max of the support respectively (support is in the integers).
For coin tosses a = 0 and / = 1, and for dice a = 1 and / = 6. For the discrete
uniform, the sample space is the closed set of all integers between a and /. The
uniform has the following properties.
)(r) =
1
/ a + 1
, r o
1(r) =
r a + 1
/ a + 1
, r o
1[A] =
a +/
2
\ ar(A) =
(/ a + 1)
2
1
12
8
1.7.2 Bernoulli
Given a trial which results in either a success or a failure (like ipping a coin;
we can consider a head as a success and a tail as a failure), we can use a random
variable to represent the outcome. This situation is modelled by a Bernoulli
random variable, named after the scientist Jacob Bernoulli. This is also known
as a Bernoulli Trial with a probability of success, j, and probability of failure,
= 1 j. This random variable only has one parameter, j, which is the
probability of a success. The support is simply 0 and 1 (i.e. failure or success
respectively).
)(r) =
_
r = 0
j r = 1
1(r) =
_
r = 0
1 r = 1
1[A] = j
\ ar(A) = j(1 j)
1.7.3 Binomial
When there are : independent and identically distributed Bernoulli random
variables and the value of interest is in the number of success (i.e. the number of
heads one obtains after ipping a coin a certain number of times), then this can
be described by a binomial random variable. The Binomial random variable has
two parameters: number of trials, :, and probability of success, j. The sample
space is the number of possible successes given : trials to r Z[0, :].
)(r) =
_
:
r
_
j
r
(1 j)
nr
1(r) =
r

I=0
)(i)
1[A] = :j
\ ar(A) = :j(1 j)
1.7.4 Geometric
Given a series of independent Bernoulli trials, if we are interested in the number
of trials until the rst success then we would use a Geometric random variable.
This has one parameter, the probability of success, j. The sample space is the
non-negative integers which is very intuitive. If the rst trial is a success then 0
additional trials were required; however, we can have up to innitely many trials
9
until our rst success. Especially if the coin rolls into a nearby storm drain.
)(r) = j(1 j)
r
1(r) = 1 (1 j)
r+1
1[A] =
1 j
j
\ ar(A) =
1 j
j
2
1.7.5 Negative Binomial
Consider the previous random variable except we are now interested in the /
||
success. Then this is a negative binomial distribution. This distribution has
two parameters: the number of desired successes, /, and the probability of each
success occurring, j. The sample space for this random variable, again, is the
set of non-negative integers.
)(r) =
_
r +/ 1
/ 1
_
j
|
(1 j)
r
1(r) =
r

I=0
)(i)
1[A] = /
1 j
j
\ ar(A) = /
1 j
j
2
1.7.6 Hypergeometric
Now assume that we have a population of size which can be split into two
subgroups arbitrarily called and 1 (a common example is an urn full of
coloured balls). The subgroups have size ' and ' respectively. If we
take a sample without replacement of size : from the population, the number
of elements taken from subgroup is the random variable of interest. Note
that the two subgroups can usually be swapped in terms of notation without
consequence. Consider the example of an urn full of balls. There are ' blue
balls and ' red balls. If we take a sample of : balls out of the urn, how
many blue balls are there? This assumes that each ball is equiprobable to be
selected at each selection.
The parameters and support for this random variable have very specic
restrictions. , the population size, is a positive integer (preferably greater
than 2). ', the size of a subpopulation (i.e. number of blue balls), can be of
any integer in the set [0, ]. :, the number of items selected, is an integer in
the set [0, ]. Now the sample space is a little tricky: the support of the sample
10
space is r [max(0, : +' ), . . . , min(', :)].
)(r) =
_
1
r
__
1
nr
_
_

n
_
1(r) =
r

I=1
)(i)
1[A] =
:'

\ ar(A) = :
'

_
1
'

_
:
1
1.7.7 Poisson
The number of independent events that occur with a common rate, `, over
a xed period of time, t, is known as the Poisson distribution. This random
variable has the parameter `t, where ` is the rate of arrivals (strictly positive,
real number) and t is the time of interest. The support for this random variable
is the non-negative real numbers.
)(r) =
(`t)
r
c
X|
r!
1(r) = c
X|
r

I=1
(`t)
r
r!
1[A] = `t
\ ar(A) = `t
1.8 Discrete Multivariate Distributions
1.8.1 Joint Probability Functions:
First, suppose there are two random variables A and 1 , and dene the function
)(r, j) = 1(A = r and 1 = j)
= 1(A = r, 1 = j).
We call )(r, j) the joint probability function of (A, 1 ). In general,
)(r
1
, r
2
, , r
n
) = 1(A
1
= r
1
and A
2
= r
2
and . . . and A
n
= r
n
)
if there are : random variables A
1
, . . . , A
n
.
The properties of a joint probability function are similar to those for a single
variable; for two random variables we have )(r, j) _ 0 for all (r, j) and

all(x,y)
)(r, j) = 1.
11
Example: Consider the following numerical example, where we show )(r, j)
in a table.
r
)(r, j) 0 1 2
1 .1 .2 .3
j
2 .2 .1 .1
For example, )(0, 2) = 1(A = 0 and 1 = 2) = 0.2. We can check that )(r, j)
is a proper joint probability function since )(r, j) _ 0 for all 6 combinations of
(r, j) and the sum of these 6 probabilities is 1. When there are only a few
values for A and 1 , it is often easier to tabulate )(r, j) than to nd a formula
for it. Well use this example below to illustrate other denitions for
multivariate distributions, but rst we give a short example where we need to
nd )(r, j).
Example: Suppose the range for (A, 1 ), which is the set of possible
values (r, j) is the following: A can be 0, 1, 2, or 3, 1 can be 0 or 1. Well
see that not all 8 combinations (r, j) are possible in the following table of
)(r, j) = 1(A = r, 1 = j).
r
)(r, j) 0 1 2 3
0
1
8
2
8
1
8
0
j
1 0
1
8
2
8
1
8
Note that the range or joint p.f. for (A, 1 ) is a little awkward to write down
here in formulas, so we just use the table.
Marginal Distributions: We may be given a joint probability function
involving more variables than were interested in using. How can we eliminate
variables that are not of interest? Look at the rst example above: if were only
interested in A, and dont care what value 1 takes, we can see that
1(A = 0) = 1(A = 0, 1 = 1) +1(A = 0, 1 = 2),
so 1(A = 0) = )(0, 1) +)(0, 2) = 0.3. Similarly
1(A = 1) = )(1, 1) +)(1, 2) = .3 and
1(A = 2) = )(2, 1) +)(2, 2) = .4
The distribution of A obtained in this way from the joint distribution is
called the marginal probability function of A:
12
r 0 1 2
)(r) .3 .3 .4
In the same way, if we were only interested in 1 , we obtain
1(1 = 1) = )(0, 1) +)(1, 1) +)(2, 1) = .6
since A can be 0, 1, or 2 when 1 = 1. The marginal probability function of 1
would be:
j 1 2
)(j) .6 .4
We generally put a subscript on the ) to indicate whether it is the marginal
probability function for the rst or second variable. So )
1
(1) would be 1(A =
1) = .3, while )
2
(1) would be 1(1 = 1) = 0.6. An alternative notation that
you may see is )

(r) and )
Y
(j).
In general, to nd )
1
(r) we add over all values of j where A = r, and to nd
)
2
(j) we add over all values of r with 1 = j. Then
)
1
(r) =

all y
)(r, j) and
)
2
(j) =

all x
)(r, j).
This reasoning can be extended beyond two variables. For example, with three
variables (A
1
, A
2
, A
3
),
)
1
(r
1
) =

all (x2 ,x3 )
)(r
1
, r
2
, r
3
) and
)
1,3
(r
1
, r
3
) =

all x2
)(r
1
, r
2
, r
3
) = 1(A
1
= r
1
, A
3
= r
3
)
where )
1,3
(r
1
, r
3
) is the marginal joint distribution of (A
1
, A
3
).
1.8.2 Conditional Probability Functions (described using
Random Variables):
Again, we can extend a denition from events to random variables. For events
and 1, recall that 1([1) =
1(.1)
1(1)
. Since 1(A = r[1 = j) = 1(A =
r, 1 = j),1(1 = j), we make the following denition.
13
Denition 11. The conditional probability function of A given 1 = j is
)(r[j) =
}(r,)
}2()
.
Similarly, )(j[r) =
}(r,)
}1(r)
(provided, of course, the denominator is not zero).
In our rst example let us nd )(r[1 = 1).
)(r[1 = 1) =
)(r, 1)
)
2
(1)
.
This gives:
r 0 1 2
)(r[1 = 1)
.1
.6
=
1
6
.2
.6
=
1
3
.3
.6
=
1
2
As you would expect, marginal and conditional probability functions are
probability functions in that they are always _ 0 and their sum is 1.
1.9 Multinomial Distribution
There is only this one multivariate model distribution introduced in this course,
though other multivariate distributions exist. The multinomial distribution de-
ned below is very important. It is a generalization of the binomial model to
the case where each trial has / possible outcomes.
Physical Setup: This distribution is the same as Binomial except there are
/ outcomes rather than two. An experiment is repeated independently : times
with / distinct outcomes each time. Let the probabilities of these / outcomes
be j
1
, j
2
, , j
|
each time. Let A
1
be the number of times the 1
st
outcome
to occur, A
2
the number of times the 2
nd
outcome occurs, , A
|
the number
of times the /
th
outcome occurs. Then (A
1
, A
2
, , A
|
) has a multinomial
distribution.
Notes:
(1) j
1
+j
2
+ +j
|
= 1
(2) A
1
+A
2
+ +A
|
= :,
If we wish, we can drop one of the variables (say the last), and just note
that A
|
equals : A
1
A
2
A
|1
.
Illustration:
1. Suppose student marks are given in letter grades: A, B, C, D, or F. In
a class of 80 students, the number of students getting A, B, ..., F might
have a multinomial distribution with : = 80 and / = 5.
14
Joint Probability Function: The joint probability function of A
1
, . . . , A
|
is
given by extending the argument in the sprinters example from / = 3 to general
/. There are
n!
r1!r2!r
k
!
dierent outcomes of the : trials in which r
1
are of the
1
st
type, r
2
are of the 2
nd
type, etc. Each of these arrangements has probability
j
r1
1
j
r2
2
j
r
k
|
since j
1
is multiplied r
1
times in some order, etc.
Therefore ) (r
1
, r
2
, , r
|
) =
:!
r
1
!r
2
! r
|
!
j
r1
1
j
r2
2
j
r
k
|
The restriction on the r
I
s are r
I
= 0, 1, , : and
|

I=1
r
I
= :.
As a check that

) (r
1
, r
2
, , r
|
) = 1 we use the multinomial theorem to get

:!
r
1
!r
2
! r
|
!
j
r1
1
j
r
k
|
= (j
1
+j
2
+ +j
|
)
n
= 1.
Here is another simple example.
Example: Every person has one of four blood types: A, B, AB and O. (This
is important in determining, for example, who may give a blood transfusion
to a person.) In a large population, let the fraction that has type A, B, AB
and O, respectively, be j
1
, j
2
, j
3
, j
4
. Then, if : persons are randomly selected
from the population, the numbers A
1
, A
2
, A
3
, A
4
of types A, B, AB, O have
a multinomial distribution with / = 4 (for Caucasian people, the values of the
j
I
s are approximately j
1
= .45, j
2
= .08, j
3
= .03, j
4
= .44.)
Note: We sometimes use the notation (A
1
, . . . , A
|
) ~ 'n|t(:; j
1
, . . . , j
|
) to
indicate that (A
1
, . . . , A
|
) has a multinomial distribution.
1.10 Expectation for Multivariate Distributions:
Covariance and Correlation
It is easy to extend the denition of expected value to multiple variables. Gen-
eralizing 1 [q (A)] =

all r
q(r))(r) leads to the denition of expected value in
the multivariate case.
Denition 12.
1 [q (A, 1 )] =

all (r,)
q(r, j))(r, j)
and
1 [q (A
1
, A
2
, , A
n
)] =

all (r1,r2, ,rn)
q (r
1
, r
2
, r
n
) ) (r
1
, , r
n
)
15
As before, these represent the average value of q(A, 1 ) and q(A
1
, . . . , A
n
).
1 [q (A, 1 )] could also be determined by nding the probability function )
2
(.) of
7 = q(A, 1 ) and then using the denition of expected value 1(7) =

all :
.)
2
(.).
Example: Let the joint probability function, )(r, j), be given by
r
)(r, j) 0 1 2
1 .1 .2 .3
j 2 .2 .1 .1
Find 1(A1 ) and 1(A).
Solution:
1 (A1 ) =

all (r,)
rj)(r, j)
= (0 1 .1) + (1 1 .2) + (2 1 .3) + (0 2 .2) + (1 2 .1) + (2 2 .1)
= 1.4
To nd 1(A) we have a choice of methods. First, taking q(r, j) = r we get
1(A) =

all (r,)
r)(r, j)
= (0 .1) + (1 .2) + (2 .3) + (0 .2) + (1 .1) + (2 .1)
= 1.1
Alternatively, since 1(A) only involves A, we could nd )
1
(r) and use
1(A) =
2

r=0
r)
1
(r) = (0 .3) + (1 .3) + (2 .4) = 1.1
Property of Multivariate Expectation: It is easily proved (make sure you
can do this) that
1 [aq
1
(A, 1 ) +/q
2
(A, 1 )] = a1 [q
1
(A, 1 )] +/1 [q
2
(A, 1 )]
This can be extended beyond 2 functions q
1
and q
2
, and beyond 2 variables A
and 1 .
16
1.10.1 Relationships between Variables:
Denition 13. The covariance of A and 1 , denoted Cov(A, 1 ) or o
Y
, is
Cov(A, 1 ) = 1 [(A j

)(1 j
Y
)]
For calculation purposes, it is easier to express the formula in the following form:
Cov(A, 1 ) = 1 [(A j

) (1 j
Y
)] = 1 (A1 j

1 Aj
Y
+j

j
Y
)
= 1(A1 ) j

1(1 ) j
Y
1(A) +j

j
Y
= 1(A1 ) 1(A)1(1 ) 1(1 )1(A) +1(A)1(1 )
Therefore Cov(A, 1 ) = 1(A1 ) 1(A)1(1 )
Example:
In the example with joint probability function
r
)(r, j) 0 1 2
1 .1 .2 .3
j
2 .2 .1 .1
nd Cov (A, 1 ).
Solution: We previously calculated 1(A1 ) = 1.4 and 1(A) = 1.1. Simi-
larly, 1(1 ) = (1 .6) + (2 .4) = 1.4
Therefore Cov(A, 1 ) = 1.4 (1.1)(1.4) = .14
Theorem 3. If A and 1 are independent then Cov (A, 1 ) = 0.
Proof: Recall 1 (A j

) = 1(A) j

= 0. Let A and 1 be independent.


Then )(r, j) = )
1
(r))
2
(j).
Cov (A, 1 ) = 1 [(A j

) (1 j
Y
)] =

all
_

all r
(r j

) (j j
Y
) )
1
(r))
2
(j)
_
=

all
_
(j j
Y
) )
2
(j)

all r
(r j

) )
1
(r)
_
=

all
[(j j
Y
) )
2
(j)1 (A j

)]
=

all
0 = 0
The following theorem gives a direct proof of the result above, and is useful in
many other situations.
Theorem 4. Suppose random variables A and 1 are independent. Then, if
q
1
(A) and q
2
(1 ) are any two functions,
1[q
1
(A)q
2
(1 )] = 1[q
1
(A)]1[q
2
(1 )].
17
To prove Theorem 4 above, we just note that if A and 1 are independent then
Cov(A, 1 ) = 1[(A j

)(1 j
Y
)]
= 1(A j

)1(1 j
Y
) = 0
Caution: This result is not reversible. If Co(A, 1 ) = 0 we cannot conclude
that A and 1 are independent.
Example: Let (A, 1 ) have the joint probability function )(0, 0) = 0.2, )(1, 1) =
0.6, )(2, 0) = 0.2; i.e. (A, 1 ) only takes on 3 values.
r 0 1 2
)
1
(r) .2 .6 .2
and
j 0 1
)
2
(j) .4 .6
are marginal probability functions. Since )
1
(r))
2
(j) ,= )(r, j), therefore, A
and 1 are not independent. However,
1 (A1 ) = (0 0 .2) + (1 1 .6) + (2 0 .2) = .6
1(A) = (0 .2) + (1 .6) + (2 .2) = 1 and 1(1 ) = (0 .4) + (1 .6) = .6
Therefore, Cov (A, 1 ) = 1(A1 ) 1(A)1(1 ) = .6 (1)(.6) = 0
So, A and 1 have covariance 0 but are not independent. If Cov (A, 1 ) = 0
we say that A and 1 are uncorrelated, because of the denition of correlation
given below.
Denition 14. The correlation coecient of A and 1 is j =
Cov (,Y )
c
X
c
Y
The correlation coecient measures the strength of the linear relationship
between A and 1 and is simply a rescaled version of the covariance, scaled to
lie in the interval [1, 1].
Properties of j:
1) Since o

and o
Y
, the standard deviations of A and 1 , are both positive,
j will have the same sign as Cov (A, 1 ). Hence the interpretation of
the sign of j is the same as for Cov (A, 1 ), and j = 0 if A and 1 are
independent. When j = 0 we say that A and 1 are uncorrelated.
2) 1 _ j _ 1 and as j 1 the relation between A and 1 becomes
one-to-one and linear.
18
1.11 Mean and Variance of a Linear Combina-
tion of Random Variables
Many problems require us to consider linear combinations of random variables;
examples will be given below and in Chapter 9. Although writing down the
formulas is somewhat tedious, we give here some important results about their
means and variances.
Results for Means:
1. 1 (aA +/1 ) = a1(A) + /1(1 ) = aj

+ /j
Y
, when a and / are con-
stants. (This follows from the denition of expected value .) In particular,
1 (A +1 ) = j

+j
Y
and 1 (A 1 ) = j

j
Y
.
2. Let a
I
be constants (real numbers) and 1 (A
I
) = j
I
. Then 1 (

a
I
A
I
) =

a
I
j
I
. In particular, 1 (

A
I
) =

1 (A
I
).
3. Let A
1
, A
2
, , A
n
be random variables which have a mean j. (You
can imagine these being some sample results from an experiment such as
recording the number of occupants in cars travelling over a toll bridge.)
The sample mean is A =
n
P
i=1
i
n
. Then 1
_
A
_
= j.
Results for Covariance:
1. Co (A, A) = 1 [(A j

) (A j

)] = 1
_
(A j)
2
_
= \ ar (A)
2. Co (aA +/1, cl +d\ ) = ac Co (A, l)+ad Co (A, \ )+/c Co (1, l)+
/d Co (1, \ ) where a, /, c, and d are constants.
This type of result can be generalized, but gets messy to write out.
Results for Variance:
1. Variance of a linear combination of random variables:
\ ar (aA +/1 ) = a
2
\ ar(A) +/
2
\ ar(1 ) + 2a/Co (A, 1 )
2. Variance of a sum of independent random variables: Let A and
1 be independent. Since Cov (A, 1 ) = 0, result 1. gives
Var (A +1 ) = o
2

+o
2
Y
;
i.e., for independent variables, the variance of a sum is the sum of the
variances. Also note
Var (A 1 ) = o
2

+ (1)
2
o
2
Y
= o
2

+o
2
Y
;
i.e., for independent variables, the variance of a dierence is the sum of
the variances.
19
3. Variance of a general linear combination of random variables:
Let a
I
be constants and Var (A
I
) = o
2
I
. Then
Var
_

a
I
A
I
_
=

a
2
I
o
2
I
+ 2

I<
a
I
a

Cov (A
I
, A

) .
This is a generalization of 1. and can be proved using either of the methods
used for 1.
4. Variance of a linear combination of independent random vari-
ables: Special cases of result 3. are:
a) If A
1
, A
2
, , A
n
are independent then Cov (A
I
, A

) = 0, so that
Var
_

a
I
A
I
_
=

a
2
I
o
2
I
.
b) If A
1
, A
2
, , A
n
are independent and all have the same variance
o
2
, then
Var
_
A
_
= o
2
,:
Remark: This result is very important in probability and statistics. To recap,
it says that if A
1
, . . . , A
n
are independent random variables with the same mean
j and some variance o
2
, then the sample mean

A =
1
n
n

I=1
A
I
has
1(

A) = j
Var (

A) = o
2
,:
1.11.1 Indicator Variables
The results for linear combinations of random variables provide a way of break-
ing up more complicated problems, involving mean and variance, into simpler
pieces using indicator variables; an indicator variable is just a binary variable
(0 or 1) that indicates whether or not some event occurs. Well illustrate this
important method with an example.
Example: We have letters to dierent people, and envelopes
addressed to those people. One letter is put in each envelope at random. Find
the mean and variance of the number of letters placed in the right envelope.
Solution:
Let A
I
=
_
0; if letter i is not in envelope i
1; if letter i is in envelope i.
Then

I=1
A
I
is the number of correctly placed letters. Once again, the A
I
s are
dependent (Why?).
20
First 1 (A
I
) =
1

ri=0
r
I
)(r
I
) = )(1) =
1

= 1
_
A
2
I
_
(since there is 1 chance in
that letter i will be put in envelope i) and then,
Var (A
I
) = 1
_
A
2
I
_
[1 (A
I
)]
2
=
1

2
=
1

_
1
1

_
Exercise: Before calculating cov (A
I
, A

), what sign do you expect it to


have? (If letter i is correctly placed, does that make it more or less likely that
letter , will be placed correctly?)
Next, 1 (A
I
A

) = )(1, 1) (As in the last example, this is the only non-zero


term in the sum.) Now, )(1, 1) =
1

1
1
since once letter i is correctly placed
there is 1 chance in 1 of letter , going in envelope ,.
Therefore 1 (A
I
A

) =
1
( 1)
For the covariance,
Cov (A
I
, A

) = 1 (A
I
A

) 1 (A
I
) 1 (A

) =
1
( 1)

_
1

__
1

_
=
1

_
1
1

1

_
=
1

2
( 1)
1
_

I=1
A
I
_
=

I=1
1 (A
I
) =

I=1
1

=
_
1

_
= 1
Var
_

I=1
A
I
_
=

I=1
Var (A
I
) + 2

I<
Cov (A
I
, A

)
=

I=1
1

_
1
1

_
+ 2
_

2
_
1

2
( 1)
=
1

_
1
1

_
+ 2
_

2
_
1

2
( 1)
= 1
1

+ 2
( 1)
2
1

2
( 1)
= 1
(Common sense often helps in this course, but we have found no way of being
able to say this result is obvious. On average 1 letter will be correctly placed
and the variance will be 1, regardless of how many letters there are.)
1.12 Continuous Distributions
As expected, a continuous distribution is one such that the support of the ran-
dom variable is continuous. Again, time is a very common example of one such
support. Even the surface area of the face of a dart board can be written as the
Cartesian product of two continuous random variables.
21
1.12.1 Notation
The interpretation of the probability functions for continuous random variables
is dierent from those of the discrete variety. The probability of any single point
happening in a continuous distribution is 0. With continuous distributions,
we are often more interested in the probability of a particular range of values
happening.
For continuous distributions, it is easier to consider the cumulative distribution
functions rst. This is again a non-decreasing function taking all values from 0
to 1 (it is also monotonic when restricted to the support set) with the following
properties.
1. lim
r!1
1(r) = 0 and lim
r!1
1(r) = 1
2. 1(r) is a non-decreasing function with respect to r.
3. P(a < A _ /) = 1(/)1(a) notice that mathematically there is no dier-
ence between stricly less than and less than or equal to when considering
continuous random variables. This is mathematically equivalent to the
Fundamental Theorem of Calculus II.
The probability density function is the analogue to the discrete probability
mass function. This function is more representative of the relative likelihood of
dierent outcomes and is dened as
)(r)r = 1(r < A < r + r) = 1(r + r) 1(r)
As r becomes smaller, we divide the probability by r to be able to compare
many probabilities of length r. This is the denition of a derivative, so we see
that )(r) =
J
Jr
1(r) for CDF 1(r).
1.12.2 Uniform
This is very similar to the discrete version: every outcome is equiprobable over
the support. This random variable also has two parameters, a and /, which are
a maximum and a minimum. The support is all real numbers in the set of [a, /].
)(r) =
1
/ a
1(r) =
r a
/ a
1[A] =
/ +a
2
\ ar(A) =
(/ a)
2
12
22
1.12.3 Normal
This distribution models data which clusters around an average. This distrib-
ution has two parameters: the mean j and the variance o
2
. There is also the
Gaussian distribution which has the same properties as the normal except that
the second parameter is the standard deviation o. The mean can take on any
real value, whereas, the variance must be a positive real number.
)(r) =
1
_
2o
c

(x)
2
2
2
1[A] = j
\ ar(A) = o
2
It is important to note that there is no closed form expression for the cumula-
tive distribution function of the normal distribution. We use a table of values
to determine approximate probabilities and quantiles.
There is a very important case of the Normal distribution: the standard nor-
mal, which has a mean of 0 and a variance of 1. Given any random variable
X~Normal(j, o
2
), we can standardize the random variable by the following op-
erations:

c
= 7 ~Normal(0,1).
1.12.4 Exponential
The exponential distribution is used to model the times in between events from
a Poisson process. It has one parameter, the rate, `, and has a support of the
non-negative real numbers.
)(r) = `c
Xr
1(r) = 1 c
Xr
1[A] =
1
`
\ ar(A) =
1
`
2
1.12.5 Gamma
The gamma distribution is often used to model waiting times between a certain
number of events. It can also be expressed as the sum of nitely many indepen-
dent and identically distributed exponential distributions. This distribution has
two parameters: the number of exponential terms :, and the rate parameter `.
In this distribution there is the Gamma function, which has some very useful
properties.
The gamma function has one parameter, which is a real number and is de-
ned as:
(.) =
_
1
0
r
:1
c
r
dx
23
This function has three major, important properties:
1. (.) = (. 1)(. 1) (if . is a positive integer, (.) = (. 1)!)
2. (1) = 1
3. (
1
2
) =
_

4. The distribution itself has the following properties and denitions.


)(r) =
r
n1
c

(:)`
n
1(r) =
_
r/X
0
t
n1
c
|
dt
(:)
1[A] = :`
\ ar(A) = :`
2
1.12.6
2
This distribution is formed by taking the sum of : independent, squared stan-
dard normal random variables. This random variable has one parameter :,
called the degrees of freedom. In many instances it refers to the number of in-
dependent terms in the statistic.
Similar to the Gamma function, there are many important properties to the
2
distribution. Most of them follow directly from the fact that the distribution is
derived from a sum of squared normally distributed random variables.
)(r) =
_
1
2
_
n/2

_
n
2
_ r
n/21
c
r/2
1(r) = intractable
1[A] = :
\ ar[A] = 2:
Consider two independent
2
random variables with : and : degrees of freedom
(i.e ' ~
2
n
and ~
2
n
); then their sum is also a
2
random variable with
:+: degrees of freedom (i.e. '+ ~
2
n+n
). This is intuitive as it arises from
knowing the fact that
2
random variables are derived from sums of squared
normals. Since the probability function for this distribution cannot be solved
explicitly, we must use tables to determine probabilities.
Unlike the Normal, this distribution lacks symmetry, which would normally
simplify. The table will be of the form:
Degrees of Freedom 0.001 0.999
1 1.5710
6
10.8
.
.
.
.
.
.
.
.
.
.
.
.
100 61.9 149
24
Each row shows the quantiles for the given degrees of freedom. The quantiles
match up with their respective probability along the columns. If the test sample
has more than 100 degrees of freedom, then use the Normal distribution table
instead.
1.13 Order Statistics
Given a series of : random variables, we may be interested in the distribution
of a maximum or minimum value. First, let us assume that we have a series of
: random variables A
I
with CDF 1
I
(r). Then let T = maxA
I
. We can nd
the PDF of this random variable.
P(T < t) = P(maxA
I
< t), if the largest is less than t, then they all are
= P(A
1
< t, A
2
< t, . . . , A
n
< t), by independence
= P(A
1
< t) P(A
2
< t) . . . P(A
n
< t), since identically distributed
=
n

I=1
1
I
(t)
Similarly, let 1 = minA
I
.
P(1 < r) = P(minA
I
< r)
= 1 P(minA
I
_ r)
= 1 P(A
1
_ r, A
2
_ r, . . . , A
n
_ r)
= 1 P(A
1
_ r)P(A
2
_ r) . . . P(A
n
_ r)
= 1 (1 P(A
1
< r))(1 P(A
2
< r)) . . . (1 P(A
n
< r))
= 1
n

I=1
(1 1
I
(r))
For exercise, determine the min distribution for a series of : independent
and indentically distributed exponential random variables.
25
1.15 Appendix - Maximum Likelihood Estima-
tion
There are two major theologies on how to estimate parameters: Frequentist and
Bayesian. We will only concentrate on the Frequentist methodologies.
Frequentists use Maximum Likelihood Estimation. It is a method which
maximizes the probability of the parameters given the data, which is to say
that it nds the most likely parameter values.
To nd this, we create a function called the Likelihood Function and dene
it as follows for : observations:
1(0) = 1(A
1
= r
1
, A
2
= r
2
, . . . , A
n
= r
n
)
For our data set we usually impose the independent and identically dis-
tributed assumptions. The Likelihood Function can be simplied under these
assumptions.
1(0) = 1(A
1
= r
1
, A
2
= r
2
, . . . , A
n
= r
n
)
= 1(A
1
= r
1
)1(A
2
= r
2
) . . . 1(A
n
= r
n
)
=
n

I=1
)(r
I
; 0)
Since we intend on maximizing this equation, it is often more convenient to
use the Log-Likelihood Function, which is the natural logarithm of the Likelihood
Function.
/(0) = ln(1(0))
Lets look at some common examples.
Example 1
Suppose we have a set of : independent and identically distributed data points
from an exponential distribution, r
1
, r
2
, . . . , r
n
. Find the Likelihood Func-
tion, the Log-Likelihood Function, and nd the value which maximizes these
(called the Maximum Likelihood Estimator).
Finding the Likelihood Function: since we have independent and identically
distributed data points for a known distribution, the likelihood takes the form:
1(`) =
n

I=1
)(r
I
; `)
=
n

I=1
`c
Xri
= `
n
c
X
P
n
i=1
ri
26
1
.
1
4
S
u
m
m
a
r
y
o
f
D
i
s
t
r
i
b
u
t
i
o
n
s
D
i
s
t
r
i
b
u
t
i
o
n
P
a
r
a
m
e
t
e
r
s
P
D
F
,
)
(
r
)
C
D
F
,
1
(
r
)
M
e
a
n
V
a
r
i
a
n
c
e
S
u
p
p
o
r
t
D
i
s
c
r
e
t
e
U
n
i
f
o
r
m
a
,
/
1
b

o
+
1
r

o
+
1
b

o
+
1
o
+
b
2
(
b

o
+
1
)
2

1
1
2
r

Z
[
a
,
/
]
B
e
r
n
o
u
l
l
i
j
j
r

r
(
1

j
)
1

r
j
j
(
1

j
)
r

Z
[
0
,
1
]
B
i
n
o
m
i
a
l
:
,
j
_
n r
_
j
r

r I
=
0
)
(
i
)
:
j
:
j
(
1

j
)
r

Z
[
0
,
:
]
G
e
o
m
e
t
r
i
c
j
j

r
1

(
1

j
)
r
+
1
1

2
r

Z
[
0
,

)
N
e
g
a
t
i
v
e
B
i
n
o
m
i
a
l
/
,
j
_
r
+
|

1
|

1
_
j
|
(

)
r

r I
=
0
)
(
i
)
/
1

/
1

2
r

Z
[
0
,

)
H
y
p
e
r
g
e
o
m
e
t
r
i
c
'
,
:
,

(
M
x
)
(
N

M
n

x
)
(
N n
)

r I
=
0
)
(
i
)
n
1

n
1

_
1

1
_

1
r

Z
[
m
a
x
(
0
,
:
+
'

)
,
m
i
n
(
'
,
:
)
]
P
o
i
s
s
o
n
`
(
X
|
)
x
t

t
r
!
c

X
|

r I
=
0
(
X
|
)
i
I
!
`
t
`
t
r

Z
[
0
,

)
C
o
n
t
i
n
u
o
u
s
U
n
i
f
o
r
m
a
,
/
1
b

o
r

o
b

o
b
+
o
2
(
b

o
)
2
1
2
r

R
[
a
,
/
]
N
o
r
m
a
l
j
,
o
2
1
p
2
t
c
c

(
x

)
2

I
n
t
r
a
c
t
a
b
l
e
j
o
2
r

R
(

)
E
x
p
o
n
e
n
t
i
a
l
`
`
c

X
r
1

X
r
1 X
1
X
2
r

R
[
0
,

)
G
a
m
m
a
:
,
`
r
n

1
t

(
n
)
X
n
_
r
/
X
0
|
n

1
t

(
n
)
d
t
:
`
:
`
2
r

R
[
0
,

2
:
(
1 2
)
k
=
2

(
n 2
)
r
k
=
2

1
t
x
=
2
I
n
t
r
a
c
t
a
b
l
e
:
2
:
r

R
[
0
,

)
27
Finding the Log-Likelihood Function: taking the natural log of the likelihood
yields
/(`) = ln(1(`))
= ln(`
n
c
X
P
n
i=1
ri
)
= :ln` `
n

I=1
r
I
Finding the Maximum Likelihood Estimator (MLE): by dierentiating either
the Likelihood or the Log-Likelihood, we can nd a critical point. For many of
the common distributions (i.e. mostly the ones with names), there is a single
global maximum, but not for all distributions.
The advantage of the Log-Likelihood over the Likelihood is that constants
can be put into separate terms and then disappear when we take the derivative.
For this example we will use the Log-Likelihood Model.
d/(
^
`)
d
^
`
=
d
d
^
`
(:ln
^
`
^
`
n

I=1
r
I
)
=
:
^
`

n

I=1
r
I
Setting this expression equal to 0 and solving for
^
` yields
0 =
:
^
`

n

I=1
r
I
^
` =
:

n
I=1
r
I
=
1
r
Example 2
Find the MLE for a 1i:o:ia| (:, ). Not that in many cases : is already
known so the real value of interest is .
Finding the likelihood function for / iid random variables:
1(; r) =
|

I=1
)(r
I
)
=
|

I=1
_
:
r
I
_

ri
(1 )
nri
=
P
k
i=1
ri
(1 )
n
P
k
i=1
ri
|

I=1
_
:
r
I
_
28
Now nd the log-likelihood function:
/() = ln
_

P
k
i=1
ri
(1 )
n
P
k
i=1
ri
|

I=1
_
:
r
I
_
_
=
|

I=1
r
I
ln +
_
:
|

I=1
r
I
_
ln(1 ) +C
Taking the derivative with respect to the parameter .
/
0
(^ ) =

|
I=1
r
I
^

:

|
I=1
r
I
1 ^
Solve for ^ .
0 =

|
I=1
r
I
^

:

|
I=1
r
I
1 ^
= (1 ^ )
|

I=1
r
I
^
_
:
|

I=1
r
I
_
=
|

I=1
r
I
:^
:^ =
|

I=1
r
I
^ = r
As an exercise, nd the MLE for the Normal Distribution.
1.16 Chapter Problems
1.16.1a List a sample space for tossing a fair coin 3 times.
1.16.1b What is the probability of 2 consecutive tails (but not 3)?
1.16.2 Machine Recognition of Handwritten Digits. Suppose that you
have an optical scanner and associated software for determining which of
the digits 0, 1, ..., 9 an individual has written in a square box. The system
may of course be wrong sometimes, depending on the legibility of the
handwritten number.
1.16.2a List a sample space o that includes points (r, j), where r stands for the
number actually written, and j stands for the number that the machine
identies.
29
1.16.2b Suppose that the machine is asked to identify very large numbers of
digits, of which 0, 1, ..., 9 occur equally often, and suppose that the follow-
ing probabilities apply to the points in your sample space:
j(0, 6) = j(6, 0) = .004; j(0, 0) = j(6, 6) = .096
j(5, 9) = j(9, 5) = .005; j(5, 5) = j(9, 9) = .095
j(4, 7) = j(7, 4) = .002; j(4, 4) = j(7, 7) = .098
j(j, j) = .100 for j = 1, 2, 3, 8
Give a table with probabilities for each point (r, j) in o. What fraction of
numbers is correctly identied?
1.16.3 Prove the following Expected Value properties
1.16.3a 1[aq(A) +/] = a1[q(A)] +/, for constants a and /.
1.16.3b 1[aq
1
(A) +/q
2
(A)] = a1[q
1
(A)] +/1[q
2
(A)], for constants a and /.
1.16.4 Expected Winnings in a Lottery
A small lottery sells 1000 tickets numbered 000, 001, . . . , 999; the tickets cost
$10 each, and each one has a unique number. When all the tickets have been
sold the draw takes place: this draw consists of a single ticket from 000 to 999
being chosen at random. For ticket holders, the prize structure is as follows:
Your ticket is drawn - win $5000.
Your ticket has the same rst two number as the winning ticket, but the
third is dierent - win $100.
Your ticket has the same rst number as the winning ticket, but the second
number is dierent - win $10.
All other cases - win nothing.
Let the random variable A represent the winnings from a given ticket. Find
1(A).
1.16.5 Example: Diagnostic Medical Tests: along with expensive, highly
accurate tests used for diagnosing the presence of some conditions in a
person, there are also cheap, less accurate ones. Suppose we have two
cheap tests and one expensive test, with the following characteristics: all
three tests are positive if a person has the condition (there are no false
negatives), but the cheap tests can give false positives.
Let a person be chosen at random, and let 1 = {person has the condition}.
The three tests are
30
Test 1: 1 (positive test [1) = .05; test costs $5.00
Test 2: 1 (positive test [1) = .03; test costs $8.00
Test 3: 1 (positive test [1) = 0; test costs $40.00
We want to check a large number of people for the condition, and have to choose
among three testing strategies:
1.16.5a Use Test 1, followed by Test 3 if Test 1 is positive.
1.16.5b Use Test 2, followed by Test 3 if Test 2 is positive.
1.16.5c Use Test 3.
Determine the expected cost per person under each of strategies (i), (ii) and
(iii). We will then choose the strategy with the lowest expected cost. It is
known that about .001 of the population have the condition (i.e. 1(1) =
.001, 1(1) = .999).
1.16.6 Prove the following: let the random variable A have m.g.f. '(t). Then
1(A
:
) = '
(:)
(0) r = 1, 2, . . .
where '
(:)
(0) stands for d
:
'(t),dt
:
evaluated at t = 0.
1.16.7 Show that the Poisson distribution with probability function
)(r) = c

j
r
,r! r = 0, 1, 2, . . .
has m.g.f. '(t) = c
+t
t
. Then show that 1(A) = j and \ ar(A) = j.
1.16.8 A potter is producing teapots one at a time. Assume that they are
produced independently of each other and with probability j the pot pro-
duced will be satisfactory; the rest are sold at a lower price. Let A be the
number of rejects before a satisfactory teapot is produced and recorded.
When 12 satisfactory teapots are produced, .there should be 12 values of
A indicating the number of rejects before each teapot is produced. What
is the probability the 12 values of A will consist of six 0s, three 1s, two
2s and one value which is _ 3?
1.16.9 Consider a race with competitors A, B, and C. They compete in 10
matches and have the probabilities of winning 0.5, 0.4, and 0.1 respectively.
You can see that we only need to use A
1
and A
2
in our formula.
) (r
1
, r
2
) =
10!
r
1
!r
2
!(10 r
1
r
2
)!
(.5)
r1
(.4)
r2
(.1)
10r1r2
where A wins r
1
times and B wins r
2
times in 10 races. Find 1 (A
1
A
2
).
31
1.16.10 The joint probability function of (A, 1 ) is:
1.
r
)(r, j) 0 1 2
0 .06 .15 .09
j
1 .14 .35 .21
Calculate the correlation coecient, j. What does it indicate about the
relationship between A and 1 ?
1.16.11 Prove the covariance results on page 19.
1.16.12 Suppose that : people take a blood test for a disease, where each person
has probability j of having the disease, independent of other persons. To
save time and money, blood samples from/ people are pooled and analyzed
together. If none of the / persons has the disease then the test will be
negative, but otherwise it will be positive. If the pooled test is positive
then each of the / persons is tested separately (so / + 1 tests are done in
that case).
1.16.12a Let A be the number of tests required for a group of / people. Show
that
1(A) = / + 1 /(1 j)
|
.
1.16.12b What is the expected number of tests required for :,/ groups of /
people each? If j = .01, evaluate this for the cases / = 1, 5, 10.
1.16.12c Show that if j is small, then the expected number of tests in part (b)
is approximately :(/j +/
1
), and is minimized for /
.
= j
1/2
.
1.16.13 A manufacturer of car radios ships them to retailers in cartons of :
radios. The prot per radio is $59.50, less shipping cost of $25 per carton,
so the prot is $ (59.5: 25) per carton. To promote sales by assuring
high quality, the manufacturer promises to pay the retailer $200A
2
if A
radios in the carton are defective. (The retailer is then responsible for re-
pairing any defective radios.) Suppose radios are produced independently
and that 5% of radios are defective. How many radios should be packed
per carton to maximize expected net prot per carton?
1.16.14 Let A have a geometric distribution with probability function
)(r) = j(1 j)
r
; r = 0, 1, 2, ...
1.16.14a Calculate the m.g.f. '(t) = 1
_
c
|
_
, where t is a parameter.
1.16.14b Find the mean and variance of A.
1.16.14c Use your result in (b) to show that if j is the probability of success
(o) in a sequence of Bernoulli trials, then the expected number of trials
until the rst o occurs is 1,j. Explain why this is obvious.
32
1.16.15 Find the distributions that correspond to the following moment-generating
functions:
1.16.15a '(t) =
1
3t
t
2
, for t < ln(3,2)
1.16.15b '(t) = c
2(t
t
1)
, for t <
1.16.16 Find the moment generating function of the discrete uniform distrib-
ution A on a, a + 1, ..., /;
1(A = r) =
1
/ a + 1
, for r = a, a + 1, ..., /.
What do you get in the special case a = / and in the case / = a + 1? Use
the moment generating function in these two cases to conrm the expected
value and the variance of A.
1.16.17 Let A be a random variable taking values in the set 0, 1, 2 with
moments 1(A) = 1, 1(A
2
) = 3,2.
1.16.17a Find the moment generating function of A.
1.16.17b Find the rst six moments of A.
1.16.17c Find 1(A = i), i = 0, 1, 2.
1.16.17d Show that any probability distribution on 0, 1, 2 is completely de-
termined by its rst two moments.
1.16.18 Assume that each week a stock either increases in value by $1 with
probability
1
2
or decreases by $1, these moves independent of the past.
The current price of the stock is $50. I wish to purchase a call option
which gives me the option of buying the stock 13 weeks from now at a
strike price of $55. Of course, if the stock price at that time is $55 or
less there is no benet to the option and it is not exercised. Assume that
the return from the option is
q(o
13
) = max(o
13
55, 0)
where o
13
is the price of the stock in 13 weeks. What is the fair price of
the option today assuming no transaction costs and 0% interest; i.e. what
is 1[q(o
13
)]?
33

0.1 Chapter Problems


1.15.1a List a sample space for tossing a fair coin 3 times.
The sample space consists of all possible outcomes. Let H and T denote
that the of a toss is heads or tails respectively. Let a sequence of outcomes
be represented as a string. Then the sample space for 3 Bernoulli trials of
a fair coin are
o = fHHH, HHT, HTH, THH, HTT, THT, TTH, TTTg
1.15.1b What is the probability of 2 consecutive tails (but not 3)?
This is a subset of the sample space dened as
o
A
= fHTT, TTHg
Since all outcomes are equiprobable (since we have assumed that we are
using a fair coin), then the probability of exactly 2 consecutive tails is
1
4
.
1.15.2 Machine Recognition of Handwritten Digits. Suppose that you
have an optical scanner and associated software for determining which of
the digits 0, 1, ..., 9 an individual has written in a square box. The system
may of course be wrong sometimes, depending on the legibility of the
handwritten number.
1.15.2a Describe a sample space o that includes points (r, j), where r stands
for the number actually written, and j stands for the number that the
machine identies.
The sample space for this model is every pair of r and j, so it is of
the form
o = f(0, 0), (0, 1), . . . , (9, 9)g
1.15.2b Suppose that the machine is asked to identify very large numbers of
digits, of which 0, 1, ..., 9 occur equally often, and suppose that the follow-
ing probabilities apply to the points in your sample space:
j(0, 6) = j(6, 0) = .004; j(0, 0) = j(6, 6) = .096
j(5, 9) = j(9, 5) = .005; j(5, 5) = j(9, 9) = .095
j(4, 7) = j(7, 4) = .002; j(4, 4) = j(7, 7) = .098
j(j, j) = .100 for j = 1, 2, 3, 8
Give a table with probabilities for each point (r, j) in o. What fraction of num-
bers is correctly identied?
1
A,1 0 1 2 3 4 5 6 7 8 9
0 0.096 0.000 0.000 0.000 0.000 0.000 0.004 0.000 0.000 0.000
1 0.000 0.100 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
2 0.000 0.000 0.100 0.000 0.000 0.000 0.000 0.000 0.000 0.000
3 0.000 0.000 0.000 0.100 0.000 0.000 0.000 0.000 0.000 0.000
4 0.000 0.000 0.000 0.000 0.098 0.000 0.000 0.002 0.000 0.000
5 0.000 0.000 0.000 0.000 0.000 0.095 0.000 0.000 0.000 0.005
6 0.004 0.000 0.000 0.000 0.000 0.000 0.096 0.000 0.000 0.000
7 0.000 0.000 0.000 0.000 0.002 0.000 0.000 0.098 0.000 0.000
8 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.100 0.000
9 0.000 0.000 0.000 0.000 0.000 0.005 0.000 0.000 0.000 0.095
The fraction of correctly identied numbers if also 1 minus the fraction of in-
correctly identied numbers.
= 1 (1(0, 6) + 1(6, 0) + 1(7, 4) + 1(4, 7) + 1(9, 5) + 1(5, 9))
= 1 (0.004 + 0.004 + 0.002 + 0.002 + 0.005 + 0.005)
= 1 0.022
= 0.978
1.15.3 Prove Theorem 1. Suppose the random variable X has probability func-
tion f(x). Then the expected value of some function g(X) of X is given
by
1[q(A)] =
X
all x
q(r))(r)
Proof:
Let 1 be a random variable to represent q(A). Then 1 will have PDF of
the form
/(j) =
X
g(x)=y
)(r)
This is just another random variable and hence it will have expectation of
the form
1[1 ] =
X
all x
q(r))(r)
1.15.4 Expected Winnings in a Lottery
A small lottery sells 1000 tickets numbered 000, 001, . . . , 999; the tickets cost $10
each. When all the tickets have been sold the draw takes place: this consists of
a simple ticket from 000 to 999 being chosen at random. For ticket holders the
prize structure is as follows:
Your ticket is drawn - win $5000.
2
Your ticket has the same rst two numbers as the winning ticket, but the
third is dierent - win $100.
Your ticket has the same rst number as the winning ticket, but the second
number is dierent - win $10.
All other cases - win nothing.
Let the random variable A represent the winnings from a given ticket. Find
1(A).
The expected value of the purchase of a single ticket is
1(A) = 5000
1
1000
+ 100
1
100
+ 10
9
100
10
= 5 + 1 + 0.9 10
= 3.1
1.15.5 Example: Diagnostic Medical Tests: Often there are cheaper,
less accurate tests for diagnosing the presence of some conditions in a
person, along with more expensive, accurate tests. Suppose we have two
cheap tests and one expensive test, with the following characteristics. All
three tests are positive if a person has the condition (there are no false
negatives), but the cheap tests give false positives.
Let a person be chosen at random, and let 1 = {person has the condition}.
The three tests are
Test 1: 1 (positive test j1) = .05; test costs $5.00
Test 2: 1 (positive test j1) = .03; test costs $8.00
Test 3: 1 (positive test j1) = 0; test costs $40.00
We want to check a large number of people for the condition, and have to choose
among three testing strategies:
1.15.5a Use Test 1, followed by Test 3 if Test 1 is positive.
1.15.5b Use Test 2, followed by Test 3 if Test 2 is positive.
1.15.5c Use Test 3.
Determine the expected cost per person under each of strategies (i), (ii) and
(iii). We will then choose the strategy with the lowest expected cost. It is
known that about .001 of the population have the condition (i.e. 1(1) =
.001, 1(1) = .999).
3
For strategy 1:
Expected Cost 1 = 5 + 40 (0.001 + 0.05)
= 7.04
For strategy 2:
Expected Cost 2 = 8 + 40(0.001 + 0.03)
= 9.24
For strategy 3:
Expect Cost 3 = 40
Therefore, the least expensive option is to use the rst strategy.
1.15.6 Prove the following: let the random variable A have m.g.f. '(t). Then
1(A
r
) = '
(r)
(0) r = 1, 2, . . .
where '
(r)
(0) stands for d
r
'(t),dt
r
evaluated at t = 0.
Proof:
By denition the moment generating function for random variable X with
PDF )(:)
'
(r)
(t) = 1[c
tX
]
=
Z
S
c
sX
)(:)d:
=
Z
S

1 + tr +
t
2
r
2
2!
+ . . .

)(:)d:
= 1[A
0
] + t1[A
1
] + t
2
1[A
2
] + . . .
Then
'
(r)
=

1[A
0
] + t1[A
1
] + t
2
1[A
2
] + . . .

t=0
= 0 + . . . + 0 + 1[A
r
] + 0 + . . .
Since all terms with a power less than r will be constant terms and hence
0, and all terms with a power greater than r will still have a t term which
evaluates to 0.
1.15.7 Show that the Poisson distribution with probability function
)(r) = c

j
x
,r! r = 0, 1, 2, . . .
4
has m.g.f. '(t) = c
+e
t
. Then show that 1(A) = j and \ ar(A) = j.
By denition, a Poisson Distributed random variable will have MGF
'(t) = 1[c
tX
]
=
1
X
x=0
c
sx
c

`
x
r!
= c

1
X
x=0
(c
s
`)
x
r!
= c

c
e
t
= c
(e
t
1)
Finding the expected value
'
0
(0) = `c
0
c
(e
t
1)
= `c
0
c
(e
0
1)
= `c
(11)
= `
Finding the variance
'
00
(0) = `c
t
c
(e
t
1)
= `c
t
c
(e
t
1)
+ `
2
c
2t
c
(e
t
1)
= `c
0
c`(c
0
1) + `
2
c
0
c
(e
0
1)
= ` + `
2
\ ar(A) = '
00
(0) ['
0
(0)]
2
= ` + `
2
`
2
= `
1.15.8 A potter is producing teapots one at a time. Assume that they are pro-
duced independently of each other and with probability j the pot produced
will be satisfactory; the rest are sold at a lower price. The number, A,
of rejects before producing a satisfactory teapot is recorded. When 12
satisfactory teapots are produced, what is the probability the 12 values of
A will consist of six 0s, three 1s, two 2s and one value which is 3?
1.15.9 Consider a race with competitors A, B, and C. They compete in 10
matches and have the probabilities of winning 0.5, 0.4, and 0.1 respectively.
You can see that we only need to use A
1
and A
2
in our formulae.
) (r
1
, r
2
) =
10!
r
1
!r
2
!(10 r
1
r
2
)!
(.5)
x1
(.4)
x2
(.1)
10x1x2
5
where A wins r
1
times and B wins r
2
times in 10 races. Find 1 (A
1
A
2
).
1.15.10 The joint probability function of (A, 1 ) is:
r
)(r, j) 0 1 2
0 .06 .15 .09
j
1 .14 .35 .21
1. Calculate the correlation coecient, j. What does it indicate about the
relationship between A and 1 ?
The correlation can be found by the following calculation
Corr(A, 1 ) =
Co(A, 1 )
p
\ ar(A)\ ar(1 )
=
1(A1 ) 1(A)1(1 )
p
(1(A
2
) 1(A)
2
)(1(1
2
) 1(1 )
2
)
1(A1 ) = 0 (0.06 + 0.15 + 0.09 + 0.14) + 1 1 0.35 + 1 2 0.21
= 0.35 + 0.42
= 0.77
1(A) = 0 0.20 + 1 0.50 + 2 0.30
= 1.1
1(A
2
) = 0
2
0.20 + 1
2
0.50 + 2
2
0.30
= 1.9
1(1 ) = 0 0.30 + 1 0.70
= 0.70
1(1
2
) = 0
2
0.30 + 1
2
0.70
= 0.70
Corr(A, 1 ) =
0.77 1.1 0.7
p
(1.9 1.1
2
)(0.70 0.70
2
)
=
0
p
(1.9 1.1
2
)(0.70 0.70
2
)
= 0
From this we can determine
6
1.15.11 Prove the covariance results on page 19.
1.15.12 Suppose that : people take a blood test for a disease, where each person
has probability j of having the disease, independent of other persons. To
save time and money, blood samples from / people are pooled and analyzed
together. If none of the / persons has the disease then the test will be
negative, but otherwise it will be positive. If the pooled test is positive
then each of the / persons is tested separately (so / + 1 tests are done in
that case).
1.15.12a Let A be the number of tests required for a group of / people. Show
that
1(A) = / + 1 /(1 j)
k
.
1.15.12b What is the expected number of tests required for :,/ groups of /
people each? If j = .01, evaluate this for the cases / = 1, 5, 10.
1.15.12c Show that if j is small, the expected number of tests in part (b) is
approximately :(/j + /
1
), and is minimized for /
.
= j
1=2
.
1.15.13 A manufacturer of car radios ships them to retailers in cartons of :
radios. The prot per radio is $59.50, less shipping cost of $25 per carton,
so the prot is $ (59.5: 25) per carton. To promote sales by assuring
high quality, the manufacturer promises to pay the retailer $200A
2
if A
radios in the carton are defective. (The retailer is then responsible for re-
pairing any defective radios.) Suppose radios are produced independently
and that 5% of radios are defective. How many radios should be packed
per carton to maximize expected net prot per carton?
1.15.14 Let A have a geometric distribution with probability function
)(r) = j(1 j)
x
; r = 0, 1, 2, ...
1.15.14a Calculate the m.g.f. '(t) = 1

c
tX

, where t is a parameter.
1.15.14b Find the mean and variance of A.
1.15.14c Use your result in (b) to show that if j is the probability of success
(o) in a sequence of Bernoulli trials, then the expected number of trials
until the rst o occurs is 1,j. Explain why this is obvious.
1.15.15 Find the distributions that correspond to the following moment-generating
functions:
1.15.15a '(t) =
1
3e
t
2
, for t < ln(3,2)
7
1.15.15b '(t) = c
2(e
t
1)
, for t < 1
Since a Poisson Random Variable has MGF c
(e
t
1)
then this MGF denes
a Poisson(2) random variable.
1.15.16 Find the moment generating function of the discrete uniform distrib-
ution A on fa, a + 1, ..., /g;
1(A = r) =
1
/ a + 1
, for r = a, a + 1, ..., /.
What do you get in the special case a = / and in the case / = a + 1? Use
the moment generating function in these two cases to conrm the expected
value and the variance of A.
1.15.17 Let A be a random variable taking values in the set f0, 1, 2g with
moments 1(A) = 1, 1(A
2
) = 3,2.
1.15.17a Find the moment generating function of A
1.15.17b Find the rst six moments of A
1.15.17c Find 1(A = i), i = 0, 1, 2.
1.15.17d Show that any probability distribution on f0, 1, 2g is completely de-
termined by its rst two moments.
1.15.18 Assume that each week a stock either increases in value by $1 with
probability
1
2
or decreases by $1, these moves independent of the past.
The current price of the stock is $50. I wish to purchase a call option
which allows me (if I wish to do so) the option of buying the stock 13
weeks from now at a strike price of $55. Of course if the stock price
at that time is $55 or less there is no benet to the option and it is not
exercised. Assume that the return from the option is
q(o
13
) = max(o
13
55, 0)
where o
13
is the price of the stock in 13 weeks. What is the fair price of
the option today assuming no transaction costs and 0% interest; i.e. what
is 1[q(o
13
)]?
8
Chapter 2
Statistics
2.1 Parameters, Estimates & Estimators
In general, we draw a sample from a population or as output from an experi-
ment. The goal of statistical analysis is to determine the value of a population
parameter, 0, by using an estimate,
^
0, from the sample.
Population parameters give characteristics of the population such as expres-
sions of the mean and the variance. Because each sample drawn is merely an
instance of the samples that could occur, we use an estimator
~
0 to denote the
possible values of
^
0 we may get.
Group Parameter(s)
Population j, o
2
Sample ^ j = r,
^
o
2
= :
2
2.2 Estimates
Our estimates are obtained using the Maximum Likelihood Estimation method.
This method determines the most probable estimate using the sample data.
Some common ones are listed below
Distribution and Parameter(s) Estimate
Normal(j, o
2
) ^ j = r
Continuous Uniform(c, ,)
^
, = max(r
1
, . . . , r
n
)
Expontential(`)
^
` =
1
r
2.3 Sampling Distributions
Each estimator is a random variable representing all possible values that our
estimate may take with each sample. Hence our estimator has a distribution
called a sampling distribution.
34
Example 2. Let A ~ (j, o
2
). Our estimate of j is ^ j = r. This is the realized
estimate we get from the data we have. Without having the data, this estimate is
also a random variable, known as an estimator and denoted by ~ j = A. Since A
is normally distributed, then r is normally distributed with mean 1[A] = j and
variance \ ar(A) =
c
2
n
. These results follow from the Central Limit Theorem.
2.4 Condence Interval
Although wed like to use our estimates, there is a good chance they will not be
equal to the population parameter. As a result, we build an interval of values
in which we are condent that the real value will lie. If our parameter,
~
0, has
a sampling distribution with mean 0 and variance \ ar(
~
0) (and hence, standard
error o1(0)), then our condence interval will be
0 c o1(0)
where c is a table value and the standard error is the standard deviation of the
estimator. For example, if ~ j ~ (j, o
2
,:) then the condence interval (CI) for
j from a Normal distribution with o
2
known is
Estimate c (Standard Error)
^ j c
o
_
:
= r .
o
_
:
If o
2
is unknown, the CI for j is
r t
n1
:
_
:
.
is a value selected from the Normal table and t
n1
is a value selected from
the Students t table with :1 degrees of freedom tables. We make the selection
based on the condence level (CL) which is a measure of the certainty we have
that our unknown parameter is in the interval.
For example, if the CL is 95%, then we are 95% condent that our unknown
parameter is in the interval.
A condence interval is not a probability since the interval changes with each
sample. In the case of a proportion, we state
A ~ 1i:o:ia|(:, ) with
A
.
~ (:, :(1 )).
Hence, for a proportion:
A :
_
: (1 )
~ (0, 1)
Lets review with some examples.
35
Example 1
A bank robber decides to measure his returns obtained by looting a series of
safety deposit boxes from a randomly selected bank. He samples 45 safety
deposit boxes and nds an average return of $57.80 after pawning the contents
of each box. The standard deviation of the earnings per box is $9.60. Find a
99% condence interval on the average earning from a randomly selected safety
deposit box from the sampled bank.
^ j .
o/2
o
_
45
^ j .
0.005
o
_
45
57.8 2.576
9.6
_
45
57.8 3.7
Therefore, the robber can be 99% condent that the mean return on his safety
deposit boxes is between $54.10 and $61.50.
Example 2
A group of recent high school graduates was speeding down a desolate highway
in the middle of life, laughing with the hubris of youth. They gleefully cried
out their speeds at various points during their adventure. The speeds cried out
were
234.3 233.0 232.9 235.2 234.0 232.8 233.8 233.0
Determine a 90% condence interval for the average speed of the carefree
graduates driving. For this, we need to calculate the sample mean, as well as
the sample standard error. Now to estimate the variance:
:
2
=

n
I=1
(r
I
r)
2
: 1
=

n
I=1
(r
I
)
2
:(r)
2
: 1
=
436650.2 8 233.6
2
7
- 0.7279
Note that we have 7 degrees of freedom (since o
2
was estimated); therefore,
the critical value is 1.895. Then, we can solve for the CI.
r t
0.05,7
:
_
:
= 233.6 0.57
= (233.03, 234.17)
36
Thus, we are 90% condent that the true value of the average speed of young
drivers is between 233.11 and 234.09 kilometres per hour.
Example 3
A manufacturing companys production line takes a random sample of their
leading product to determine the number of units that are defective. The com-
pany samples 880, 308 specimen of which were proved to be defective. Create a
95% condence interval of the proportion of defective units on the production
line. This can be modelled using a binomial distribution where j represents
the proportion of defective items. The general form for the CI of a binomial
response model is
^ j .
_
^ j(1 ^ j)
:
0.35 1.96
_
0.35 0.65
880
0.5 0.032
Thus, we have a 95% condence interval of (0.318, 0.382) for the proportion of
defective items in the production line.
Example 4
We are interested in a CI for the parameter o
2
. The formula is
(: 1):
2

2
o/2
< o
2
<
(: 1):
2

2
1o/2
A textbook wants to question students on their ability to determine CIs for the
estimation of a standard error. The sampled variance is 26.70 from a total of
10 sampled units. Find a 95% CI for these numbers.
9 26.7
19.02
< o
2
<
9 26.7
2.70
12.63 < o
2
< 89.00
Thus, we are 95% condent that the actual value of the variance is between
12.63 and 89.
2.4.1 Comparison between 2 parameters
Often we will want to compare 2 parameters. If the samples are independent
then the comparision can be measured statistically with the following formula
r
1
r
2
t
n1+n22
_
:
2
1
:
1
+
:
2
2
:
2
37
If the two samples are dependent, then we calculate the dierence in the data
values in the two groups. The average dierence is r
J
, and the dierence between
the parameters of interest and the standard deviation has the form
r
J

t
n
d
1
:
J
_
:
In the case of 2 proportions, we use
^ j
1
^ j
2
7
_
^ j
1
(1 ^ j
1
)
:
1
+
^ j
2
(1 ^ j
2
)
:
2
Example 5
A professor administers a test out of 40 marks. She notices a dierence in the
averages between students who took both STAT 230 and 231 (prepared students)
versus her students that have only taken STAT 230 (unprepared students).
There are 7 prepared students, they had an average of 32 marks and a variance
of 4.47. There are 6 unprepared students who had an average of 30.2 marks and
a variance of 0.652. Find a 90% condence interval on the dierence of their
marks.
r
1
r
2
t
n1+n22
_
:
2
1
:
1
+
:
2
2
:
2
32 30.2 1.796
_
4.47
7
+
0.652
6
1.8 1.55
This gives a 90% condence interval that the dierence in the two means is
between 0.25 and 3.35.
In addition to a CI, a hypothesis test (HT) could also be performed. For that,
we refer you to your STAT 231 course notes for further details.
2.5 Chapter Problems
2.5.1 A machine produces a particular type of computer chip with an average
length of 0.8 centimetres. The production machine had a sample error of
0.2 centimeters when 10 randomly selected chips were measured. Deter-
mine the 95% condence interval on the average length of the computer
chip.
2.5.2 A study of 500 participants was conducted to determine the proportion
of t people who could leap across small buildings with a running start
and favourable winds. Of the participants, only 60 managed to achieve
this feat. Determine the 90% condence interval on the true proportion
of people who successfully lept over a small building.
38
2.4.3 The Oce of the Registrar General of Canada reported that a sample of
200,000 males born in 1990 had 12% of them named "Optimus Prime."
Determine the 95% condence interval associated with the true proportion
of newborn males given the awesome name of Optimus Prime.
2.5.4 Given that a the sample error of hacking coughs from a sample of 20 ex-
cessive smokers is 1.6 milliseconds, determine the 95% condence interval
for the variance of annoying, hacking coughs of excessive smokers.
2.5.5 Gas prices have increased signicantly over the past decade. The aver-
age price of one litre of gasoline was recorded at a nearby station every
week. Determine the 90% condence interval for both the variance and
the standard deviation of the price in dollars of the average gas price over
the following two-month period:
59 54 53 52 51
39 49 46 49 48
2.5.6 A survey conducted compared student spending habits between two com-
parable universities. 50 students were polled on their monthly spendings
on entertainment. The average spendings were $175.53 and $171.31 each
with standard deviations $9.52 and $10.89 respectively. Determine the
95% condence interval associated with the dierence in the means of
entertainment spending between the two student bodies.
2.5.7 A competitive sports radio station has asked you to discuss the dierence
between scores in American and Canadian Whack-a-Mole teams. You
picked the champions from each nation and recorded their data over a
40-year period for the highest number of moles whacked in a single bout.
The data is as follows:
Canadian Leaders
47 49 73 50 65 70 49 47 40 43
46 35 38 40 47 39 49 37 37 36
40 37 31 48 48 45 52 38 38 36
44 40 48 45 45 36 39 44 52 47
American Leaders
47 57 52 47 48 56 56 52 50 40
46 43 44 51 36 42 49 49 40 43
39 39 22 41 45 46 39 32 36 32
32 32 37 33 44 49 44 44 49 32
2.5.7a Dene the population for this set of data.
2.5.7b What kind of sample was used?
2.5.7c How do you feel this sample representing the target population?
2.5.7d State the hypotheses you will use to compare the data sets.
39
2.5.7e Determine a reasonable signicance level. For what purpose did you
select this level?
2.5.7f What statistical test is reasonable to use for your hypotheses? Justify
your answer.
2.5.7g Perform statistical analysis as described in your answers to the above.
What are the results?
2.5.7h State your conclusions
2.5.7i Consider how this data set describes the original question. What are
your thoughts on the t?
2.5.7j What alternative data set would you consider using to answer the ques-
tions above?
2.5.8 Eight athletes have acquired a perfectly legal, under the counter perfor-
mance enhancer that they plan to test its eectiveness. As part of their
training regimen, they test their performance levels before using the en-
hancer as well as afterwards. To end their rigorous training program, they
perform statistical analysis on the data. Using the data that they acquired,
perform a test to determine if the enhancer is eective with signicance
level of c=0.05. Assume normality on the data.
Athlete 1 2 3 4 5 6 7 8
Without Enhancer 95 104 83 92 119 115 99 98
With Enhancer 99 107 81 93 122 113 101 98
2.5.9 An advertising rm released a new controversial ad campaign for a par-
ticular brand of clothing. The campaign was featured in six randomly
selected retailers of the product. Average weekly sales for both a month
before and after the unveiling of the new ad campaign are shown in the
table below:
Location 1 2 3 4 5 6
Before Ad 210 235 208 190 172 244
After Ad 190 170 210 188 173 228
Determine whether or not the ad campaign caused a signicant change in
sales with a signicance level of c=0.10.
2.5.10 Early educational facilities began standardized testing to determine
whether or not pre-school children could recognize a ninja. 12 out of
34 privately run pre-schools had a ninja identication success rate of less
than 80%, while 17 out of 24 publicly run pre-schools had a success rate
less than 80%. Determine a 95% condence interval for the dierence of
proportions in successful indentication of ninjas in publicly and privately
run pre-schools.
40
2.5.11 Consider the model 1 = j + 1, where 1 ~ (0, o
2
). Use maximum
likelihood estimation to
2.5.11a Show j = r.
2.5.11b Show o
2
=
P
n
i=1
(i)
2
n
.
41
1 Chapter Problems
2.4.1 A machine produces a particular type of computer chip with average
length of 0.8 centimetres. The production machine had a sample error of
0.2 centimeters when 10 randomly selected chips were measured. Deter-
mine the 95% condence interval on the average length of the computer
chip.
2.4.1 Solution Since o is unknown and : must replace it, the t distribution
must be used for a 95% condenc einterval. Hence, with 9 degrees of free-
dom, t
=2
= 2.262.
The 95% condence interval of the population mean is found by substi-
tuting in the formula

A t
=2

:
p
:

< j <

A +t
=2

:
p
:

Hence, 0.8 2.262


0.2
p
10
< j < 0.8 + 2.262
0.2
p
10
0.8 0.143 < j < 0.8 + 0.143
0.657 < j < 0.943
Therefore, one can be 95% condent that the population mean computer
chip length is between 0.657 and 0.943 centimetres based on a sample of
10 tires.
2.4.2 A study of 500 participants was conducted to determine the proportion
of t humans that could leap small buildings with a running start and
favourable winds. Of the participants only 60 managed to match this feat
of strength. Determine the 90% condence interval on the true proportion
of humans who successul lept over a small building under these conditions.
2.4.2 Solution Since c = 1 0.90 = 0.10 and .
=2
= 1.645, substituting in
the formula
^ j .
=2
r
^ j^
:
< j < ^ j +.
=2
r
^ j^
:
When ^ j = 60,500 = 0.12 and ^ = 0.88, one gets
1
0.12 1.645
r
0.12 0.88
500
< j < 0.12 + 1.645
r
0.12 0.88
500
0.12 0.024 < j < 0.12 + 0.024
0.096 < j < 0.144
9.6% < j < 14.4%
or
Hence, one can be 90% condent that the percentage of applicants who
can leap over a small building is between 9.6% and 14.4%.
2.4.3 The Canadian Oce of the Registrar General reported that a sample of
200,000 male births in 1990 had 12% of them named "Optimus Prime".
Determine the 95% condence interval associated with the true proportion
of newborn males given the awesome name of Optimus Prime.
2.4.3 Solution From the Snapshot, ^ j = 0.12 (i.e. 12%), and : = 200, 000.
Since .
=2
= 1.96, substituting in the formula
^ j .
=2
r
^ j^
:
< j < ^ j +.
=2
r
^ j^
:
yields 0.12 1.96
r
0.12 0.88
200, 000
< j < 0.12 + 1.96
r
0.12 0.88
200, 000
0.119 < j < 0.121
Hence, one can say with 95% condence that the true percentage of males
named Optimus Prime is between 11.9% and 12.1%.
2.4.4 Given that a the sample error hacking coughs from a sample of 20 exces-
sive smokers is 1.6 milliseconds determine a 95% condence interval for
the variance of annoying, hacking coughs of excessive smokers.
2.4.4 Solution Since c = 0.05, the two critical values, respectively, for the
0.025 and 0.975 levels for 19 degrees of freedom are 32.852 and 8.907. The
95% condence interval for the variance is found by substituting in the
formula
(: 1):
2

2
righ / t
< o
2
<
(: 1):
2

2
low
(20 1)(1.6)
2
32.852
< o
2
<
(20 1)(1.6)
2
8.907
1.5 < o
2
< 5.5
2
Hence, one can be 95% condent that the true variance for the hacking
coughs is between 1.5 and 5.5 milliseconds. (For the standard deviation,
the condence interval is (1.2, 2.3).)
2.4.5 Prices of gas have increased signicantly over the past decade. The av-
erage price of one litre of gasoline was recorded at a nearby station every
week. Determine a 90% condence interval for both the variance and the
standard deviation of the price in dollars of the average price of gasoline
over the following two-month period:
59 54 53 52 51
39 49 46 49 48
2.4.5 Solution First we nd the variance for the data. The estimated variance
is :
2
= 28.2.
Then nd
2
right
and
2
low
from a
2
9
with quantile c = 0.10 the criti-
cal values are 3.325 and 16.919 using 0.95 and 0.05.
Substitute and solve
(: 1):
2

2
righ / t
< o
2
<
(: 1):
2

2
low
(10 1)(28.2)
2
16.919
< o
2
<
(10 1)(28.2)
2
3.325
15.0 < o
2
< 76.3
Hence one can be 90% condent that the variation for the price of gasoline
per litre is between $15.00 and $76.30. For standard deviation, the interval
is ($3.87, 8.73).
2.4.6 A survey conducted compared student spending habits between two com-
parable universities. 50 students were polled on their monthly spending
habits on entertainment. The averages were $175.53 and $171.31 each
with standard deviations $9.52 and $10.89 respectively. Determine the
95% condence associated with the dierence in the means of entertain-
ment spending between the two undergraduate bodies.
2.4.6 Solution
(

A
1


A
2
) .
=2
s
o
2
1
:
1
+
o
2
2
:
2
< j
1
j
2
< (

A
1


A
2
) +.
=2
s
o
2
1
:
1
+
o
2
2
:
2
(1.75.53 171.31) 1.96
r
9.52
2
+ 10.89
2
50
< j
1
j
2
< (1.75.53 171.31) + 1.96
r
9.52
2
+ 10.89
2
50
4.22 4.01 < j
1
j
2
< 4.22 + 4.01
0.21 < j
1
j
2
< 8.23
3
Since the condence interval does not contain zero, the decision is to reject
the null hypothesis.
2.4.7 A competative sports radio has asked you to discuss the dierence be-
tween scores in American and Canadian Whack-a-Mole teams. You pick
the champions from each nation and record their data over a 40-year pe-
riod for highest number of moles whacked in a single bout. The data is as
follows
Canadian Leaders
47 49 73 50 65 70 49 47 40 43
46 35 38 40 47 39 49 37 37 36
40 37 31 48 48 45 52 38 38 36
44 40 48 45 45 36 39 44 52 47
American Leaders
47 57 52 47 48 56 56 52 50 40
46 43 44 51 36 42 49 49 40 43
39 39 22 41 45 46 39 32 36 32
32 32 37 33 44 49 44 44 49 32
2.4.7a Dene the population for this set of data.
2.4.7a Solution The population is all Moles whacked by Canadian and Amer-
ican Whack-a-Mole Champions.
2.4.7b What kind of sample was used?
2.4.7b Solution A cluster sample was used.
2.4.7c How do you feel this sample represents the target population?
2.4.7c Solution Answers will vary. While this sample is not representative of
all professional Whack-a-Molers, per se, it does allow us to compare the
leaders in each league.
2.4.7d State the hypotheses you will use to compare the data sets.
2.4.7d Solution H
0
: j
1
= j
2
and H
A
: j
1
6= j
2
.
2.4.7e Determine a reasonable signicance level. For what purpose did you
select this level?
2.4.7e Solution Answers will vary. Possible answers include the 0.05 and the
0.01 signicance levels.
2.4.7f What statistical test is reasonable to use for your hypotheses? Justify
your answer.
2.4.7f Solution We will use the . test for the dierence in means.
4
2.4.7g Perform statistical analysis as described in your answers to the above.
What are the results?
2.4.7g Solution Our test statitsic is . =
44:7542:88
q
8:88
2
+7:82
2
40
= 1.00, and our j-value
is 0.3173.
2.4.7h State your conclusions
2.4.7h Solution We fail to reject the null hypothesis. There is not enough
evidence to conclude that there is a dierence in the number of moles
whacked by Canadian versus American champions.
2.4.7i Consider how this data set describes the original question. What are
your thoughts on the t?
2.4.7i Solution Answers will vary. One possible answer is that since we do
not have a random sample of data from each nationa, we cannot answer
the original question asked.
2.4.7j What alternative data set would you consider using to answer the ques-
tions above?
2.4.7j Solution Answers will vary. One possible answer is that we could get a
random sample of data from each nation from a recent season.
2.4.8 Eight athletes have acquired a perfectly legal, under the counter perfor-
mance enhancer that they plan to test its aectiveness. As part of their
training regime they test their performance levels before using the en-
hancer as well as afterwards. To end their rigorous training programme
they perform statistical analysis on the data. Using the data that they
acquired, perform a test to determine if the enchancer is eective with
signicance level of c=0.05. Assume normality on the data.
Athlete 1 2 3 4 5 6 7 8
Without Enhancer 95 104 83 92 119 115 99 98
With Enhancer 99 107 81 93 122 113 101 98
2.4.8 Solution State the hypothesis and identify the claim. For the supple-
ment to be eective the before weights must be signicantly less than the
after weights: hence the mean of the dierences must be less than zero.
H
0
: j
D
= 0 and H
A
: j
D
< 0 (claim).
Find the critical value. The degrees of freedom are :1. In this case, the
d) = 8 1 = 7. The critical value for a left-tailed test with c = 0.05 is
1.895.
Compute the dierences.
5
Before (A
1
) After (A
2
) Dierence 1 = A
1
A
2
Squared Di. 1
2
= (A
1
A
2
)
2
95 99 -4 16
104 107 -3 9
83 81 2 4
92 93 -1 1
119 122 -3 9
115 113 2 4
99 101 -2 4
98 98 0 0
1 = 9 1
2
= 47
with mean dierence
9
8
= 1.125. With this chart, one can calculate the
standard deviation of the dierences
:
D
=
s
1
2

(D)
2
n
: 1
=
s
47
(9)
2
8
8 1
2.295
Calculating the test statistic
t =

1 j
0
:
D
,
p
:
=
1.125 0
2.295,
p
8
1.386
Comparing against the theroetical, the decision is to not reject the null
hypothesis at c = 0.05, since 1.386 1.895.
2.4.9 An advertizing rm releases a new controversial ad campaign for a par-
ticular brand of clothing. The campaign was featured in six randomly
selected retailers of the product. Average weekly sales both a month be-
fore and release the unveiling of the new ad campaign. Determine whether
or not the ad campaign caused a signicant change in sales with a signi-
cance level of c=0.10.
Location 1 2 3 4 5 6
Before Ad 210 235 208 190 172 244
After Ad 190 170 210 188 173 228
2.4.9 Solution State the hypothesis and identify the claim. For the supple-
ment to be eective the before weights must be signicantly less than the
after weights: hence the mean of the dierences must be less than zero.
H
0
: j
D
= 0 and H
A
: j
D
6= 0 (claim).
Find the critical value. The degrees of freedom are 5. The critical value
for at c = 0.10 is 2.015.
Compute the dierences.
6
Before (A
1
) After (A
2
) Dierence 1 = A
1
A
2
Squared Di. 1
2
= (A
1
A
2
)
2
210 190 20 400
235 170 65 4225
208 210 -2 4
190 188 2 4
172 173 -1 1
244 228 16 256
1 = 100 1
2
= 4890
with mean dierence
100
6
= 16.7. With this chart, one can calculate the
standard deviation of the dierences
:
D
=
s
1
2

(D)
2
n
: 1
=
s
4890
(100)
2
6
5
25.4
Calculating the test statistic
t =

1 j
0
:
D
,
p
:
=
16.7 0
25.4,
p
6
1.610
The decision is to not reject the null hypothesis, since the test statistics
1.610 is in the non-critical region.
2.4.10 Early educational facilities began standardized testing to determine
whether or not pre-school children could recognize a ninja. 12 out of
34 privately run pre-schools had a ninja identication success rate of less
than 80%, while 17 out of 24 publically run pre-schools had a success
rate less than 80%. Determine a 95% condence interval for the dier-
ence of proportions in successful indentication of ninjas in publically and
privately run pre-schools.
2.4.10 Solution
^ j
1
=
12
34
= 0.35, ^
1
= 0.65
^ j
2
=
17
24
= 0.71, ^
2
= 0.29
Substitute into the formula
(^ j
1
^ j
2
) .
=2
r
^ j
1
^
1
:
1
+
^ j
2
^
2
:
2
< j
1
j
2
< (^ j
1
^ j
2
) +.
=2
r
^ j
1
^
1
:
1
+
^ j
2
^
2
:
2
(0.35 0.71) 1.96
r
0.35 0.65
34
+
0.71 0.29
24
< j
1
j
2
< (0.35 0.71) + 1.96
r
0.35 0.65
34
+
0.71 0.29
24
0.36 0.242 < j
1
j
2
< 0.36 + 0.242
0.620 < j
1
j
2
< 0.118
Since 0 is not contained in the interval, the decision is to reject the null
hypothesis H
0
: j
1
= j
2
.
7
2.4.11 Consider the model A = j + 1, where 1 (j, o
2
). Use maximum
likelihood estimation (MLE) to
2.4.11a Show j = r.
2.4.11b Show o
2
=
P
n
i=1
(xi)
2
n
.
2.4.11a&b Solution Find the log-likelihood of the Normal distribution
1(0) =
n
Y
i=1
1
p
2o
c

1
2
(
x
i

)
2
/(0) = C
:
2
lno
2

1
2
n
X
i=1

r
i
j
o

2
To solve for the estimated value of j, take the partial derivative with
respect to j.
0/(0)
0^ j
= 0 0
1
2
2
n
X
i=1
r
i
^ j
o
=
n
X
i=1
r
i
^ j
o
Set equal to zero and solve. Note that o is a positive constant with respect
to j.
0 =
n
X
i=1
(r
i
^ j)
0 =
n
X
i=1
r
i
:^ j
:^ j =
n
X
i=1
r
i
^ j = r
Now for o
2
. Take the partial derivative with respect to o
2
.
0/(0)
0^ o
= 0
:
2
2
^ o
^ o
2

1
2
(2)
n
X
i=1
(r
i
j)
^ o
3
2
=
:
^ o
+
n
X
i=1
(r
i
j)
^ o
3
2
8
Set this equation equal to 0 and then solve for ^ o.
0 =
:
^ o
+
n
X
i=1
(r
i
j)
^ o
3
2
0 = :o
2
+
n
X
i=1
(r
i
j)
2
^ o
2
=
P
n
i=1
(r
i
j)
2
:
9

Chapter 3
Validation
In many textbooks, the author will say something like: The waiting times
between buses arriving at a bus stop follow an exponential distribution. How
do we know if this is true? Its all good and well to say that a set of data follows
a distribution, but how can we argue that it does?
It is important to remember that hypothesis tests are not conclusive in that
the results dont lead to concrete solutions. They test to see if a particular
parameter value given both the data and a margin of error is signicant. So
how can we determine which distribution to use? Consider the following set of
completely random data.
Order of Occurence Data Value
1 1,4
2 1,2
3 3,4
4 1
How can we examine the data in such a way that we can pick a distribution
that seems to t? The best solution is to graph and compare it to known
distributions.
42
3.1
2
Test for Goodness of Fit
For discrete models, we can use the
2
Test for Goodness of Fit (sometimes
referred to as the Pearsons
2
Test). If we have data which can be categorized
(into groups called bins), then we can use this test to compare a theoretical
distribution to the distribution of a set of data. Consider the following example:
3.1.1 Example
A study compares two sub-groups of a population (males and females), and looks
at whether or not the members of that population have cancer. The surveying
team found 351 participants and recorded the following data:
Males Females Total
Cancer 104 73 137
No Cancer 89 85 174
Total 193 158 351
The interest of this study is to determine whether or not having cancer is
independent of ones sex. For this experiment, we assume that the two are
independent. Let the null hypothesis be that these two are independent, and
let the alternative hypothesis be that sex and having cancer are dependent.
Let A represent the sex of the unit, and 1 represent whether or not the unit
has cancer. Statistically spreaking we are interested in whether or not these two
random variables are independent of each other. More specically, we want to
see if 1(A = r, 1 = j) = 1(A = r)1(1 = j).
43
The underlying concept here is the multinomial distribution. Our data values
are going to fall into 1 of 4 bins
Male with cancer
Male without cancer
Female with cancer
Female without cancer
Based on the data, we can estimate the probability that a particular person
of a particular gender has or does not have cancer. Expressed more generally,
we have:
A = 0 A = 1 Total
1 = 0 a / a +/
1 = 1 c d c +d
Total a +c / +d a +/ +c +d = :
Similarly, we can argue that the marginal distribution is binomial. Hence
the expected number of males with cancer can be calculated as 1[A] = :j.
And so, if we are interested in the expected number of males with cancer given
a sample of size :, then we would be interested in :P(A = 0)P(1 = 0)
assuming that we have independence. This becomes
:
a +/
:

a +c
:
=
(a +/)(a +c)
:
One can apply the same conept for the remaining three boxes, which will yeild
a chart of the expected values for each bin, as shown below:
Male Female Total
Cancer
(o+b)(o+c)
n
- 75.33
(b+o)(b+J)
n
- 61.67 137
No Cancer
(c+J)(c+o)
n
- 95.68
(J+b)(J+c)
n
- 78.32 174
Total 193 158 351
We now have tables detailing the expected bins and the observed. The test
statistic for the general Pearson Test with / bins is
1 =
|

I=1
(o
I
c
I
)
2
c
I
where, for the i
th
bin, o
I
is the observed frequency and c
I
is the expected/theoretical
frequency. 1 is not exactly a
2
; however, it is asymptotically a
2
distribution,
with one degree of freedom in this case. We need to determine the degrees of
freedom for this statistic.
44
Using a calculator, we get an approximate test statistic of d=14.03 on 1
degree of freedom. We calculate the probability of 1 d as follows:
1(1 d) = 1 1(1 _ d)
= 1 1(1 _ 14.03)
1 0.999
0.001
Therefore, we have very strong evidence against our hypothesis and we reject
it in favour of the alternate hypothesis: cancer and sex are dependent on each
other.
The intricacies of using the
2
Test in another situation may not be apparent
from the previous example. So consider the following set of data:
47 199 120 59 204
128 408 47 48 79
66 61 25 91 217
These data come from some unknown discrete distribution.
0 100 200 300 400
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
ecdf(x)
x
F
n
(
x
)
Figure 3.1: Empirical CDF of Observed Data Points
We begin by asking if we model our data using a geometric distribution.
Hence, we need to determine the parameter of our model. From the review
of ML estimation of the geometric distribution, we obtain an estimate of the
45
paremter j from our data:
^ j =
1
r
=
:

n
I=1
r
I
=
_
1
15
15

I=1
r
I
_
1
=
_
1799
15
_
1
- 0.00834
Therefore, we hypothesize that the data follows a Gco(0.00833) distribution.
The rst order of business is to determine what interval we should use for our
data. The following rules should be imposed:
1. Let A
I
be the event of a data value landing in Bin i. Let j
I
be the
probability of a data value being in Bin i. Let : be the number of data
values. Then 1[A
I
] = :j
I
_5 for all i. This condition is necessary as the
approximation to the
2
distribution requries the expected frequencies to
be greater than or equal to 5; if this is not the case, the approximation
breaks down in which case the G-test would be more appropriate.
2. Let 1 be the number of bins, and j be the number of parameters required
to be estimated. Then the degrees of freedom (df) are 1 j 1. There
must be at least 1 df.
3. The data must be discrete.
4. The size of the bins can have vary in size.
In situations requiring calculation by hand (such as in practice examples or
tests), we may relax assumption 1 to make the question doable in the allotted
time.
In our sample, at least three bins are needed in order that we satisfy our
degrees of freedom condition. Accordingly, we can divide our bins on to the
intervals [0,47], (47,120] and (120,). So, we get the following bins
Bin 1 Bin 2 Bin 3
25, 47, 47 48, 59, 61, 66, 79, 91, 120 128, 199, 204, 217, 408
Now, we can state the null and alternative hypotheses:
H
0
: the data collected is governed by a Geometric Distribution with para-
meter 0.00833.
H
.
: the data is not governed by a Geometric Distribution with parameter
0.00833.
46
The test statistic is 1 =

|
I=1
(oiti)
2
ti
. For the geometric, we have
1(Bin 1) = 1(0 _ r _ 47) - 0.3307
1(Bin 2) = 1(48 _ r _ 124) - 0.3178
1(Bin 3) = 1(125 _ r < ) - 0.3515
The code from R used to get these results are as follows:
jqco:(47, 0.00833)
[1]0.3306945
jqco:(124, 0.00833) jqco:(47, 0.00833)
[1]0.3178285
1 jqco:(124, 0.00833)
[1]0.351477
From there, we can use the formula to get a test statistic by math:
1 =
3

I=1
_
(o
I
c
I
)
2
c
I
_
=
(3 15 0.3307)
2
15 0.3307
+
(7 15 0.3178)
2
15 0.3178
+
(5 15 0.3515)
2
15 0.3515
- 1.8349
Now we have enough information to nish this example! We have a test statistic
of 1.8349 with 1 degree of freedom, so going back to the very original form of
the question:
1(1 d) = 1 1(1 _ d)
= 1 1(1 _ 1.835)
We end up with two possible solutions depending on what we do from here.
We can either use a simple R routine to get a more accurate answer or we can
refer to the
2
table for a wider approximation. Lets compare both methods.
Using the R command:
> 1-pchisq(1.8349,1)
This yields the value 0.179.
Now, looking at the
2
table with 1 degree of freedom we have that 0.5 <
1(1 _ 1.8349) < 0.9, which implies: 10.5 11(1 _ 1.8349) 10.9 which
simplies to the range of (0.10, 0.50). The latter one is a wide interval; however,
it shows that our null hypothesis will not be rejected at a 10% signicance
level. The value 0.179 is nice since it gives us a better indication of the exact
magnitude.
47
3.2 Kolmogorov-Smirnov Test
This test compares a theoretical distribution 1 to the empirical cumulative
distribution function (ECDF) from the data
^
1. If 1 and
^
1 appear close,
then we will assume that our data are 1 distributed.
Step 1 Build an ECDF from our data. We determine how far
^
1 and 1 are from
each other. We call this distance, d, the K-S Distance.
Step 2 We ask if d is large. We compare d to the distance wed get by comparing
1 to the ECDF. If the distance is relatively small, then we would expect
the data to be 1 distributed (i.e if our p-value is small, we reject 1 as the
distribution).
3.2.1 Building an Empirical CDF
First of all, we assume that every observation has a weighted probability
1
n
,
where : is the number of observations. The empirical CDF is dened as follows:
^
1(r) =
1
:
n

I=1
1(r
I
_ r)
This is the total number of observations with values less than or equal to r.
Take the example of a random set of data: 1, 1, 2, 5. This has the following
ECDF.
0 2 4 6
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Empirical CDF
X
F
(
x
)
Figure 3.2: ECDF of Observations
^
1 can be built for discrete and continuous data. We assume for the contin-
uous data that 1(r
I
= a, r

= a)=0, for i ,= ,.
We create an Empirical CDF as follows:
48
1. Order the data.
2. Assume each point is equiprobable.
3. Make a step function
1x
n
where 1
r
is the number of data points less than
or equal to r.
3.2.2 The K-S Distance
If our ordered data was a
1
, a
2
, . . . , a
n
at every step of our ECDF, we will
calculate two distances: d
I,I
= 1(a
I
)
^
1(a
I
) and d
I,1
= 1(a
I
)
^
1(a
I1
).
Well call these the up and down distances. The K-S distance, d, is the
maxd
I,I
, d
I,1
.
Example:
We have received the data set {1,4,9,16}, and we suspect that it follows an
exponential distribution.
-5 0 5 10 15 20
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Empirical CDF
X
F
(
x
)
Step 1 Build the empirical cdf
Since we believe that this follows an exponential distribution, we can esti-
mate its parameter:
^
` =
1
r
=
2
15
Now, we can look at the comparison between the theoretical and empirical
CDF probabilities:
49
-5 0 5 10 15 20
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Empirical CDF
X
F
(
x)
Figure 3.3: Theoretical vs. Empirical cdfs
1(1) = 1 c

2
15
1(4) = 1 c

8
15
1(9) = 1 c

18
5
1(16) = 1 c

32
15
^
1(1) =
1
4
^
1(4) =
1
2
^
1(9) =
3
4
^
1(16) = 1
Let d
I,I
represent the up distance and d
I,1
the down distance for ob-
servation i. Then, for this set of data, we have the following:
d
1,I
= 0.1252 d
2,I
= 0.0866 d
3,I
= 0.0512 d
4,I
= 0.1184
d
1,1
= 0.1248 d
2,1
= 0.1634 d
3,1
= 0.1980 d
4,1
= 0.1316
Thus, our KS Distance is d = 0.1980.
Step 2 Create data : data sets derived from 1:
At this point, we will create : sets of ECDFs from 1 and denote them as
^
1
I
(i = 1, .., :). Then, we calculate the K-S Distance between 1 and each
^
1
I
.
Finally, we consider the frequency of 1
I
d. The more is that satisfy this, the
more acceptable our
^
1 will be.
The following are examples of how to perform a KS test using R. The rst
method uses R code to calculate a p-value.
#set up for the data of interest
#sorted random exponential with rate 1
x<-sort(rexp(10,rate=1))
50
-5 0 5 10 15 20
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Empirical CDF
X
F
(
x)
d1,D
d1,U
d2,D
d3,D
d4,D
d4,U
d3,U
d2,U
Figure 3.4: Theoretical vs. Empirical cdfs with K-S Distances
plot(ecdf(x))
#Plots Empirical CDF
lines(x,pexp(x,rate=1/mean(x)))
#Plots lines for other data values
ks.test(x,"pexp",1/mean(x))
#Calculates the ks test
The following is the code for the KS test. As an exercise, create your own
code for the KS test.
#****************************************************
#KS exp will perform a KS test and return the p-value
KSEXP <- function (data,N,d)
{
data <- sort(data)
rate <- 1/mean(data)
#ks is the set of all ks distances
ks <- NULL
for (i in 1:N)
{
#generate data from EXP(rate)
#rdata is the random data
51
set.seed(i)
rdata <- sort(rexp(length(data),rate=rate))
rrate <- 1/mean(rdata)
D <- max(abs(1-exp(-rrate*rdata)-seq(1/length(data),1,by=1/length(data))),
abs(1-exp(-rrate*rdata)-seq(0,1-1/length(data),by=1/length(data))))
ks<-cbind(D, ks)
} #end for
Pvalue <- mean(ks>d)
return(Pvalue)
}#end fn KSEXP
#*****************************************************
#Generate an exponential random variable
EXPGEN<- function(rate, time, seed)
{
#set.seed(seed)
newtime <- rexp(1,rate=rate) + time
return(newtime)
}
To calculate the p-value for P(1 _ d) we want the percentage of times 1
I
is bigger than d (that is, the percentage of time that the generated data is less
reliable than our observed data).
1(1 _ d) = The number of times 1
I
is bigger than d
=
1
:
n

I=1
1(1
I
_ d)
If our p-value is small, then d is too large and we reject our hypothesis. We
use our usual 5% rejection rule. Since the test statistic is dependent on the
number of sets of values, then the answers will vary every time the test is done.
To reduce the variation, we just make a large number of sets. We will still see
variation in our test statistics; however, the larger the number of generated sets
are, the smaller the variation will be.
52

1 Chapter Problems
3.3.1 Given the follow data set: perform a
2
test for Goodness of Fit using a
Binomial distribution with n = 20 and ^ p = 0:5.
13 8 8 12 11
9 9 10 15 13
11 7 8 10 10
9 8 11 13 9
3.3.1 Solution We will need at least two bins for this since we have one
estimated value. Construct the bins as follows:
Bin 1: x 2 [0; 9] with expectation E[Bin 1] = 20 P(Bin 1) =
20 0:4119015 = 8:23803
Bin 2: x 2 [10; 20] with expectation E[Bin 2] = 20 P(Bin 2) =
20 0:5880985 = 11:76197
The actual counts for the bins are 9 and 11. Then the test statistic
becomes
t =
(9 8:23803)
2
8:23803
+
(11 11:76197)
2
11:76197
= 0:056240
We are testing the hypthesis H
0
: p = 0:50 versus H
A
: p 6= 0:50.
Then, on 1 degree of freedom
P (T > t) = 1 P (T t)
= 1 P(T 0:056240)
99:5% from R
in the range of (0:50; 0:95) from the Tables
From this we have no evidence against the hypothesis that this data follows
a Binomial Distribution with probability of success 0.50. Note that the bins
that were used for this test is one of many valid choices.
3.3.2 Given the following data determine an appropriate distribution and per-
form a
2
Test using MLE.
10 5 3 5 5
10 1 3 8 5
4 9 7 5 4
1
3.3.2 Solution Solutions will vary, but should take the following form.
Look at the Empirical CDF (note, data was modied using the jitter function
so that no points coincide with each other)
Graph Needed
This looks like it could follow a Poisson Distribution or Binomial. The MLE
for the rate parameter for the Poisson Distribution is x. At least two bins will
be required. Follow procedure as above.
3.3.3 Given the following set of Geometric data
1 3 0 0 5
0 0 0 8 0
2 4 3 0 1
0 0 0 1 3
3.3.3a Perform a
2
test under the assumption that ^ p = 0:60.
3.3.3a Solution We require at least two bins for this test to satisfy the degrees
of freedom. We will construct the bins as follows:
Bin 1: x = 0 with expectation E[Bin 1] = 20 P(Bin 1) = 20 0:60 = 12
Bin 2: x 2 [1; 1) with expectation E[Bin 2] = 20 P(Bin 2) = 20 0:40 = 8
The actual counts for the bins are both 10. Then the test statistic becomes
t =
(10 12)
2
12
+
(10 8)
2
8
= 0:8333
We are testing the hypothesis H
0
: p = 0:60 versus H
A
: p 6= 0:60.
Then, on 1 degree of freedom
P (T > t) = 1 P (T t)
= 1 P(T 0:8333)
36:1% from R
From this we have no evidence against our hypothesis that ^ p = 0:60.
3.3.3b Use the following ML estimator for p and perform the same test:
^ p =

1 +
1
n
n
X
i=1
x
i
!
1
2
3.3.3b Solution Find an estimate for ^ p.
^ p =

1 +
1
n
n
X
i=1
x
i
!
1
=

1 +
31
20

1
0:3613
Now to create bins in a similar manner to above.
Bin 1: x = 0with expectation E[Bin 1] = 20 P(Bin 1) = 20 0:3921 = 7:8431
Bin 2: x 2 [2; 1) with expectation E[Bin 2] = 20P(Bin 2) = 200:6078 =
12:1569
The actual counts for the bins are both 10. Then the test statistic becomes
t =
(10 7:8431)
2
7:8431
+
(10 12:15169)
2
12:15169
= 0:9742
We are testing the hypothesis H
0
: p = 0:3613 versus H
A
: p 6= 0:3613.
Then, on 1 degree of freedom
P (T > t) = 1 P (T t)
= 1 P(T 0:9742)
32:4% from R
From this we have no evidence against our hypothesis that ^ p = 0:3613.
3.3.4 Using R, create your own
3.3.4a
2
Test for Goodness of Fit
3.3.4b KS Test
3.3.4c Using the following data set and the KS Test that you made in 3.3.4b,
perform a KS Test for the distribution Normal(-1,2)
-3.4749 1.4955 -1.7907 -1.2529 -5.4987
-2.6497 -3.4814 0.6968 -1.1721 3.2448
-3.9419 -3.9650 4.2603 1.9561 -1.9241
-1.2756 4.9261 -5.3436 -3.2091 1.7410
3
3.3.5 Using the following data, determine an appropriate distribution, estimate
their parameters and perform a KS Test.
0.1356 0.3759 0.0158 0.1043 0.0406
0.5534 0.7184 0.2038 0.1298 0.0265
1.2225 0.7828 0.3368 0.0490 0.5329
Look at the Empirical CDF
Graph Needed
Compare this to theoretical ECDFs. Many distributions work; however, we
will only show the steps for the Exponential Distribution.
This distribution only has one parameter, with MLE
^
=
1
n
P
n
i=1
x
i
=
5:2281
15
0:3485. Then using the following R Code
ks.test(data,pexp,rate=(0.3485^(-1)))
With p-value 0.8706.
3.3.6 Explain the variance in the p-value of the KS Test. How can this variance
be reduced? Is this variance present in other tests like the
2
?
3.3.6 Solution The variance in the KS Test p-value is a result of the generation
of theoretical data points to be used to compare against the ECDF. This
variance can be reduced by increasing the number of generated functions.
This is not present in most other test statistics since they do not involve
the generation of random values, but rely on the analysis of realizations.
3.3.7 Download the data le ks1.txt from the course website. Determine an
appropriate model, use Maximum Likelihood to estimate its parameters
and then perform a KS Test on your results.
3.3.7 Solution The ECDF looks like that of a Normal random variable. Mean
is 1.94 with SD 1.05.
4
Chapter 4
Queuing Systems
4.1 Introduction
Waiting in queues (British word for lines) is a phenomenon that occurs every-
where. For example, we wait in checkout lines in grocery stores, or we wait to
be served in a single line in a fast food restaurant. A queueing system usually
consists of multiple servers, which can be in groups or in sequence. For simplic-
ity, suppose we have a queueing system with two distinct servers, named Server
1 and Server 2. There are three ways in which these servers can be arranged.
We mention only two of them here:
First, as illustrated below, we could have Server 1 serving a person from the
queue and then send him/her to Server 2.
Figure 4.1: Queue with Two Servers in Series
A second type of arrangement is shown below in Figure 4.2. Server 1 and
Server 2 share one queue and can each serve no more than one unit (note that
its possible that both servers are busy). This is the grocery store model (servers
are cashiers), or a mode of transportation model (servers could be bus routes
to the same location). We call these parallel servers.
53
Figure 4.2: Queue with Two Servers in Parallel
This example raises an important topic with queues: their types. This could
happen in a grocery store: if Server 2 is the cashier for the express lane (8 items
to purchase or less) and Server 1 is a regular service lane, then they would have
dierent serving times. The following is a list of some of the dierent types of
serving:
FIFO (First In, First Out): In computer science terms, this is a
"queue." For example, in a grocery store, the shopkeepers attempt to
sell their olders stock rst to avoid spoilage.
LIFO(Last In, First Out): In computer science terms, this is a "stack."
54
Having dishes in a stack is a good example of this.
Priority Queue: Units in a queue are given dierent priority ranks mak-
ing the order of service is dependent on their priority ranking. An excellent
example of this is in an emergency room where patients are attended to
according to the severity of their injuries/condition.
SIRO (Service In Random Order): Units in the queue are served
at random. For example, collision resloution protocols in cable access
networks operate in a manner quite similar to SIRO.
SPT (Shortest Processing Time): Units are ranked by which takes
the least amount of time to serve. When I do my assignments, I tackle
the easiest questions rst, leaving the hardest questions at the very end.
55
4.1.1 Notation
Now that we have a general understanding of what a queue is, we need a common
method of describing what were talking about. Any queue can be written in
the following form:
A/B/C/D/E
A - Arrival Distribution. M represents Markovian which means the arrival
distribution is Poisson; D represents a constant; G represents a General
distribution.
B - Departure Service Distribution. Same letters as in A are used in this
case.
C - Number of servers. Note that this does not provide information on
the formation of the servers.
D - System Capacity (if omitted, then it is assumed to be innite). This
is the maximum number of units that can be in the queue at any moment.
56
E - Calling Population (if omitted, then it is assumed to be innite). This
is the maximum number of units that can be put into the queue.
4.2 One Server Systems
Perhaps the simplest version of the server is a single server system: a queue that
has only one server. This type of system is nice and simple to build the basics
for us to use! We will impose the assumption that arrivals and departures follow
a Homogeneous Poisson Process, which means:
1. (t) is a non-decreasing function of time, t , with the initial condition that
(0) = 0.
2. The process has independent increments. Specically, no interval is de-
pendent on a prior non-overlapping interval. Or mathematically:
(:
1
+t
1
) (:
1
)
is indepedent of
(:
2
+t
2
) (:
2
)
for :
1
+t
1
< :
2
with :
1
, :
2
, t
1
, t
2
0.
3. The number of events occurring over a period (0, t] follows a Poisson dis-
tribution with parameter `. If ` was not then we would have a Non-
Homogeneous Poisson Process.
First were going to need some tools to help us understand the distribution
of the events in a queue. Assuming that we have a one-server queue in which
elements enter the system but never leave the line without being served (one
example is standing in line at the post oce). Recall that we will only be using
Markovian arrivals and departures since that will simplify the mathematics (as
well be dealing with Poisson and exponential random variables). Begin with
the rst arrival: what is the distribution for the event that the rst event occurs
by time t? More specically, we are asking what the probability is when no event
occurs in the interval (0, t] given that no events have occured at present time.
This is
1((t) (0) 0[(0) = 0) =
1((|0),0)
1((0)=0)
=
1t
t
1
= 1 c
X|
This is exponential! This is interesting because:
Its a distribution with which we are familiar.
It has a simple pdf and cdf.
It also has the memoryless property.
57
We can handle arrivals. What if we also have departures? Assume that we
are in a Post Oce and people leave the line after realizing they lled out the
wrong form for their package. This is a model where arrivals happen at a rate
`
o
and depart at a rate `
J
. Recall that this is called an
M/M/1 server. Assuming that queue isnt empty, what is the probability
that an arrival will occur before a departure at time t ?
Let denote arrival and 1 will denote departure. We get:
1( < 1) =
1
_
1
1( < 1[1 = t) )
1
(t)dt , by conditioning on 1.
=
1
_
1
1( < t) )
1
(t)dt , by independence.
=
1
_
1
1( _ t) )
1
(t)dt , this makes a cdf.
=
1
_
1
1
.
(t) )
1
(t)dt
=
1
_
0
(1 c
Xa|
)`
J
c
X
d
|
dt
= `
J
_
_
1
_
0
c
X
d
|
dt
1
_
0
c
(Xa+X
d
)|
dt
_
_
= 1
X
d
Xa+X
d
=
Xa
Xa+X
d
Example The Campus Tech store will receive Apple Macbooks 2 times a week
and can sell them to a student at a rate of 1 per week. If all events are Poisson
processes, what is the probability that the store will receive a Macbook before
selling it?
Solution: Let `
o
denote the arrival rate of Macbooks at the store, and `
J
be the selling rate of Macbooks to students. Thus, `
o
=2 and `
J
=1, and so,
1(rric < oc||) =
Xa
Xa+X
d
=
2
3
.
4.2.1 Algorithm
The following is an algorithm to generate a single server system. For this to
work, we use the following variables:
Time Variables
t represents time.
58
Counter Variables

.
and
1
represent the number of arrivals and departures by time
t respectively.
System State Variables
: represents the number of customers in the system.
Event List
t
.
, t
1
represent the time of the next arrival and departure respec-
tively.
For this algorithm, the termination of the queue is at a particular time T.
This value is predetermined.
Set t =
.
=
1
= 0.
Set the system state to zero. (: = 0).
Generate t
.
and set t
1
= t
.
+ 1
Then we proceed with the cases to update the system.
Case 1 t
.
_ t
1
, t
.
_ T
Reset t = t
.
. This sets the system time to the most recent event
time (the arrival)
Reset : = : + 1. This increases the current number of units in the
queue by one.
Reset
.
=
.
+ 1. This increases the number of arrivals by one.
If : = 1
Generate service time, 1 , and set t
1
= t +1 . If this is the rst
unit in the system, then its departure time must be generated.
Case 2 t
1
< t
.
, t
1
_ T
Reset t = t
1
. This sets the system time to the most recent event
time (the departure)
Reset : = : 1. This decreases the current number of units in the
queue by one.
Reset
1
=
1
+ 1. This increases the total number of departures
by one.
If : = 0.
59
Reset t
1
= t
.
+ 1.
Otherwise, generate service time 1 , and reset t
1
= t +1 .
Notice that this algorithm puts arrivals before departures (even if
they occur simultaneously).
Case 3 min(t
.
, t
1
) T, : 0
Reset t = t
1
.
Reset : = : 1.
Reset
1
=
1
+ 1.
If : = 0
Reset t
1
= t
.
+ 1.
Otherwise generate service time 1 and reset t
1
= t +1 .
This is the instance where no more units are permitted to enter the
queue; however, there are still units in the queue which require ser-
vice.
Case 4 min(t
.
, t
1
) T, : = 0
Output desired information.
No items remain in the queue and no arrivals can occur. This is the
terminating case.
#*****************************************************
## Modified: MM1 Code
REPAIRSTAT340 <- function(birth, death, events, machines)
{
# Define the EL = (next birth=Ta, next death=Td)
TimeList <- rep(0,machines+1) #a list of times for each state 0 to m.
n <- 0
t <- 0
prevtime <- 0
infinit <- 1000000
Td <- infinit
Ta <- EXPGEN(birth,t,10)
for (event in 1:events)
{
#Arrival before departure
if (Ta <= Td)
{
prevtime <- t
t <- Ta
TimeList[n+1] <- TimeList[n+1] + t - prevtime
60
n <- n + 1
Td <- EXPGEN(death,t,10)
if (n==machines) Ta <- infinit #no arrival allowed.
else Ta <- EXPGEN((machines-n)*birth,t,10)
} #endif
#Departure before arrival
else if (Td <= Ta)
{
prevtime <- t
t <- Td
TimeList[n+1] <- TimeList[n+1] + t - prevtime
n <- n - 1
Ta <- EXPGEN((machines-n)*birth,t,10)
Td <- EXPGEN(death,t,10)
if (n==0) Td <- infinit
}#endelse
}#endfor
return(TimeList/(sum(TimeList)))
}
#*****************************************************
#Testing the code:
#------------------
birthr<- 0.1
deathr<- 0.1
m <- 6
dist<-.25
#Acceptable:
ACCEPT5 <- function(pvalue)
{
#Gives a range of p-values to accept
lower <- 0.3642
upper <- 0.4924
if (pvalue < upper)
{
if (pvalue > lower) ans <- "pass"
else ans <- "notpass"
}
else ans <- "notpass"
return(ans)
}
61
4.3 Two Server System with Servers in Series
Modelling single server queues is denitely a good starting point; however, it has
limited uses applications since the real world tends to be more interesting and
complicated. There are two major dierences that we will begin to address in
multi-server systems: server formation and math. On the plus side: one aects
the other. On the down side: one of them can get really ugly with many servers.
So now let us begin with a two server system! If you recall the picture from
the preamble of this chapter with Server 1 and Server 2 we can see that we can
classify systems of servers as either serving units simultaneously or both serving
units. Systems where each unit in the queue goes to a unique server and then
is nished that are called parallel systems, wheras systems where each unit is
served rst by one server and then the other are called series systems.
Examples of a series system can be found in many service-oriented industries.
One example is the Canadian Health system: each unit is a patient in the system;
they enter the hospital to be served and then ve days later, they see a doctor.
Another service example: every time the author of this text goes to a certain
store, he gets served by a cashier. The cashier always makes a mistake and then
the author has to go to the customer service desk to x the problem.
Other examples include a sequence of interviews where the interviewee goes
from one interviewer to the next.
4.3.1 Algorithm
Generating these servers using computers can be very confusing, and here is an
example which is fairly intuitive, but still takes a bit of time to get accustomed
to. It is important to understand the notation of the algorithm before you begin
reading it.
62
Time Variables
t represents time.
System State Variables (shorthand SS)
(:
1
, :
2
) represents the number of units in each servers queue (which
includes service). The subscript represents the i
th
queue, in this case
i=1,2.
Counter Variables

.
represents the number of arrivals by time t.

1
represents the number of departures by time t.
Output Variables

1
(:): the arrival time of unit : into the system and hence server
#1s queue.

2
(:): the arrival time of unit : into server #2s queue.
1(:): the departure time of unit : from the queuing system.
Event List
t
.
, t
1
, t
2
: where each represents the time of the next event for each
area in the queue. t
.
represents the next arrival time to the queue, t
1
represents the completion/departure time with Server 1 and t
2
repre-
sents the completion/departure time with Server 2.We are assuming
that arrivals will happen indenitely; however, if there are no units
at Server 1 or at Server 2, then we must set it to a suciently large
number such that the algorithm works (see below for illustration).
For simplicity, we will use to symbolize this number.
It is important to note that this algorithm doesnt have a nish time and
is theoretically innite. To address this the programmer must decide what the
limit of their algorithm is, whether it be that the code will run for a certain
amount of "time" or that only a certain number of people can be served. It is
important to have some aptitude in programming for simulations and be certain
that you can investigate closing conditions.
Initialization:
Set t =
.
=
1
= 0.
Set SS=(0, 0).
Generate an arrival time, t
.
, and then set t
1
= t
2
= .
Now that everything is set up, the following algorithm will collect data.
63
Case 1 t
.
= min(t
.
, t
1
, t
2
)
Reset t = t
.
.
Reset
.
=
.
+ 1
Reset :
1
= :
1
+ 1
Generate a new arrival time, and set t
.
equal to it.
If :
1
= 1, generate a departure time and set t
1
equal to it.
Case 2 t
1
= min(t
.
, t
1
, t
2
)
Reset t = t
1
.
Reset :
1
= :
1
1, :
2
= :
2
+ 1.
If :
1
= 0, then set t
1
= ; otherwise, generate a new completion time for
server one and set it equal to t
1
If :
2
= 1, generate a new completion time for server two and set it equal
to t
2
.
Case 3 t
2
= min(t
.
, t
1
, t
2
)
Reset t = t
2
.
Reset
1
=
1
+ 1.
Reset :
2
= :
2
1.
If :
2
= 0 then set t
2
= .
If :
2
0 then generate a new completion time for server two.
Remember, when generating a new server time to add the current time to
the generated value so that the event has a chance to occur.
4.4 Two-server System with Servers in Parallel
Even in the two-server case the orientation of the servers can make for a com-
plicated system. Even though it is beyond the scope of this text, a three-server
queue has 4 unique ways to be set up with the assumption that the servers
are distinct. Parallel systems add to the dynamic of the system where multiple
servers can work on units in the queue simultaneously, such as in the famous
grocery store model. Fortunately, given our assumption of a Poisson process,
one will nd that the math becomes more of an exercise in managing the par-
ticulars of the model than a computation nightmare. However, it can still be
computationally challenging.
64
As an exercise, alter the algorithm given in the previous section to simu-
late a two-server system in which two servers work in parallel. How will this
algorithm change if units are placed into lines for each server (a more accurate
representation of the grocery store model)?
4.5 Repair Problem
The repair problem is a variant on a common statistical problem known as a
Random Walk. A random walk is a model where we map a series of outcomes
or events to the integers (normally, reals can also be used). Travelling from one
state to another has its own probability and each state can be reached from any
other state in a nite number of state transitions. Under this model, we have
births and deaths. A birth increases the number of units in the queue; whereas,
a death decreases the number of units in the queue. It is important to realize
that these do not always represent biological births and deaths.
Example Consider a computer lab on campus: the lab has a certain number
of computers, say 30, and technicians, say 2. If a computer breaks down, then
a technician will begin working on the computer to x it. A computer breaking
down would constitute a birth as it adds a unit to the serving system; whereas,
a technician nishing xing a computer would constitute a death since it takes
a unit out of the server. We generally assume that only one server can work on
a unit at a time.
Now, assume that we have a time homogeneous Poisson process representing
the computer repair scenario described above except with fewer computers: 4
computers and 2 servers. The break down of a machine follows an exponential
distribution with rate 1; the mean serving time is denoted as T
I
. Let 1
I
denote
the repair times and these are exponential with rate 2. Again, a computer
breaking is a birth and a computer being successfully xed is a death. This is
a ',',3,30,30 Queue. We have the following variables:
::= number of units in the queue.
`
n
:= Rate of births.
j
n
:= Rate of deaths.

n
:= Proportion of time spent in state :.
Theorem 5. The rate leaving from state n is equal to the rate entering state n.
Which is to say:
n
(`
n
+ j
n
) =
n1
`
n1
+
n+1
j
n+1
with
0
j
0
=
1
`
1
;
therefore,
n
=
0
n1

I=0

i
Xi+1
.
65
For the births, suppose we are in state 0, then to get to state 1 one of the
computers must break down, which is Exp(4). From state 1 to 2, the distribution
of arrivals is Exp(3), and so forth.
For the deaths, moving from states 2, 3 or 4 into the next lowest state is
modelled by an Exp(4) distribution; whereas, moving from state 1 to 0 follows
an Exp(2) distribution.
We are interested in knowing how often we are in state :. Also, in which
state do we spend the most time on average? Recall
n
(`
n
+j
n
) =
n1
`
n1
+

n+1
j
n+1
.

0
(0 + 4) =
1
2

1
(3 + 2) =
0
4 +
2
4

2
(2 + 4) =
1
3 +
3
4

3
(1 + 4) =
2
2 +
4
4

4
(0 + 4) =
3
1
This becomes a linear system of equations, which is solvable. We end up
with the solution: =
16
87
_
1, 2,
3
2
,
3
4
,
3
16
_
. Be sure to check that these values sum
to 1.
To nd in which state we spend the most time, calculate:
1[(t)] =

:
n
= 0
0
+. . . + 4
4
- 1.3
Therefore, we expect to be in state 1 for the largest proportion of time. The
following graphic is called a System State Diagram.
Figure 4.3: System State Diagram for Computer Problem
66
1 Chapter Problems
4.6.1 Describe the following Queues in terms of course terminology. Indentify
whether these are series or parellel queuing systems.
4.6.1i A gas station with 8 pumps.
4.6.1i Solution This is an M/M/8 queue (similarly M=M=8=n=t if there is
a specied maximum capacity for waiting cars in the station n, and a
maximum number of vehicles registered in a particular area t). This is a
FIFO server in parallel.
4.6.1ii A penalty box in a game of hockey.
4.6.1ii Solution This is an M/M/3/5/5 queue (per team). It is a FIFO server
in parallel.
4.6.1iii A stop light on a four-way street.
4.6.1iii Solution This is an M/M/4 queue (can vary if a limit on vehicles in
wait is in place). This is a FIFO server in parallel.
4.6.1iv A stack of dishes.
4.6.1iv Solution This is an M/M/1 queue (can vary if there is a nite number
of dishes available). This is a LIFO server in series.
4.6.1v A buet table with three stations.
6.4.1v Solution This is an M/M/3 queue (can vary with limitations imposed
on the population available to the buet). This is a FIFO server in series.
4.6.2 A pawn shop will receive musical items to sell 3 times a week, and sell
them to a customer at a rate of 2 per week. Treat all actions with trading
of stock as Poisson process and assume that there will never be insucient
stock to sell to customers. Answer the following:
4.6.2a What is the probability that the store will receive a musical item before
selling one? What is the probability of selling before receiving?
4.6.2a Solution Let the A denote an arrival and D a departure of a musical
item at the pawn shop. Then A and D follow a Poisson distribution with
parameters
A
= 3 and
D
= 2 respectively. Then we have
P(A < D) =

A

A
+
D
=
3
3 + 2
=
3
5
1
and similarly
P(A > D) =

A

A
+
D
=
2
3 + 2
2
5
Thus the probability that a musical item will arrive before one will leave
is
3
5
and the complimentary event has probability
2
5
.
4.6.2b What is the probability that the store will neither receive nor sell any
stock in one week given that at the beginning of the week it has sold
nothing?
4.6.2b Solution Let E denote that any event will happen with respect to mu-
sical stock in the pawn shop. Then E follows a Poisson distribution with
parameter
E
=
A
+
D
= 3 + 2 = 5. Then we want
P((N(1) = 0jN(0) = 0) =
P(N(1) = 0)
P(N(0) = 0)
=
5
0
e
5
0!
0
0
e
0
0!
= e
5
0:00674
4.6.3 The surviving cast of the original Star Trek is at a convention and are
approached by various nerds. People dressed up as Klingons arrive at a
rate of 6 per minute and leave at at rate of 2 per minute. People dressed
up as Vulcans arrive at a rate of 4 per minute and leave a rate of 3 per
minute. People dressed up as gaseous energy beings appear at a rate of 1
per minute and leave at a rate of 1 per minute.
4.6.3a What is the probability that a fan will arrive before a fan will leave?
4.6.3a Solution We can summarize this by using two dierent events: a fan
arrives,
A
= 6 + 4 + 3 = 13 and a fan leaves
D
= 2 + 3 + 1 = 6. Then
this becomes
P(A < D) =

A

A
+
D
=
13
13 + 6
=
13
19
2
4.6.3b What is the probability that a Klingon will leave before a Vulcan?
4.6.3b Solution Similar to the previous section we now are looking for the
rst event to be the Vulcan departure
V
D
versus a Klingon departure

K
D
.
P(K
D
< V
D
) =

K
D

K
D
+
V
D
=
2
2 + 3
=
2
5
4.6.3c What is the probability that a gaseous energy being will arrive before a
Klingon arrives and a Vulcan leaves?
4.6.3c Solution This is a slight modication of the previous question. We
want the rst event to be a gaseous arrival G
A
over a Klingon arrival K
A
and a Vulcan departure V
D
.
P(min(G
A
; K
A
; V
D
) = G
A
) =

G
A

G
A
+
K
A
+
V
D
=
1
1 + 6 + 3
= 0:10
4.6.3d What is the expected number of fans to arrive at the event in a four
hour period?
4.6.3d Solution RILEY-HUMAN. I forget how to do this so the following may
be incorrect. . .
Since the event of arrivals is a Poisson Process with the following rate

A
= 13, and N(t) Poisson(
A
t) with t = 240. Then the expected num-
ber of arrivals in 4 hours (or 240 minutes) can be determined from the
properties of the Poisson. Hence
E[N(t)] =
A
t
= 13 240
= 3120
Therefore, for a four hour period, the Star Trek cast can be expected to
be accosted by 3120 nerds.
4.6.3e What is the probability that a gaseous energy being will leave after ten
minutes given that it hasnt left after 5 minutes?
3
4.6.3e Solution We want to solve the following
P (G
D
> 10jG
D
> 5) = P (G
D
> 5) , by the memoryless property
= 1 P (G
D
5)
= 1 1 +e
5
G
D
= e
5
= 0:0067
4.6.4
4
Chapter 5
Generating Random
Variables
In simulations, we need to create enough data to be able to make useful observa-
tions regarding any patterns in the data. Three realizations may not be enough
to see the behaviour of a stock price over time, or to determine the growth in
populations based on birth, death and immigration models.
We want to have suciently random data, but what does this really mean?
Lets say we are given the following sequence of randomly generated numbers:
9 9 9 9 9 9 9 9 9 9
We would say that this is deterministic because we could guess the next
number in the sequence as 9. Wed be wrong in this case if the next number
is 12. However, the numbers do not appear to be random and someone could
theoretically gure out the next number in the sequence using math. This is
referred to as being deterministic.
So, we need a method of generating random variables or numbers which ap-
pears to be random, and at the same time is computationally simple. Again,
computers in this day and age are fast in mathematical computations, but sim-
plicity means an increase in eciency when were dealing with millions of data
points or more.
5.1 Pseudo Random Number Generators
It is important to note that humans are terrible random number generators.
Computers, although, entirely calculable, have the advantage of appearing to
be more random than humans. Whod have thought that would be the case?
Even though we know for a fact that the computing process for generating
random numbers is deterministic, these numbers still seem quite random. How
do they do that?
67
Lets say we want to generate a series of numbers which appear to be Uni-
form(0,1) distributed. This can be done using modular arithmetic. If you recall
classical algebra, we say that a is congruent to / mod : if a = c :+/, where
c is an integer. For example:
31 = 4 7 + 3 mod 7
= 3 mod 7
Assume we are working in arithmetic mod 8 with r
n
= 5r
n1
and r
0
= 3 (r
0
is
called the seed). Use math to show that we end up with the following sequence
of numbers: 3, 7, 3, 7, . . .. Notice that the pattern begins to repeat after 2
iterations. We say that this algorithm has a period of 2 or simply period 2.
Denition 15. The period is the largest number of unique numbers before the
rst repeatition.
Now, assume our seed is 4. Use math to show that we end up with the
following sequence of numbers: 4, 4, 4, 4, . . .. Notice that it has period 1. This
problem can be xed using Linear Congruential Generators.
5.1.1 Linear Congruential Generators
Suppose we were to edit our previous example. Before, we had a generator of
the form:
r
n1
= a r
n
mod c
Now, let us add in a constant term to help circumvent the situation where our
generator repeats 0s innitely or something similar.
r
n1
= a r
n
+/ mod c
This helps out a bit, but we can still have the situation where a r
n
+ / = 0
mod c. We need one more condition.
5.1.2 Theorem
Statement - If : is a prime, then the maximum period of a Linear Congruential
Generator
(r
n
= ar
n1
mod :), a ,= 0
is :1, if and only if
a
n1
= 1 mod :, a
I
,= 1 mod :
\0 < i < :1.
68
5.2 Inverse Transform Theorem for Continuous
Random Variables
One of the most powerful tools available to us is the uniform distribution: specif-
ically, the Uniform(0,1) distribution. For those using R, you will readily have
access to functions like runif, rexp, rgeom, and so forth. Assuming that you
do not have access to any computer software, youll need tools to mimic these
functions on your own.
Theorem - Assume the following properties hold:
1. Let A be a r.v. such that
lim
r!1
1(r) = 0
lim
r!1
1(r) = 1
1(r) < 1(j) r < j
2. q(r) _ q(j) for r _ j
3. 1
1
(1(r)) = r, if 1 is invertible
4. Let l ~Uniform(0,1), then
1(l _ t) =
_
|
0
dn = t
Then, let l = 1(r).
Statement of Theorem: Let l be a Uniform(0,1) r.v. for any continu-
ous function 1, then the r.v. dened as A = 1
1
(l) has a distribution
function. This means if we are given l = 0.2, we know 1(r) = 0.2 and we
can nd the value of A. In summary, given a U(0,1) r.v., we can obtain
an A from any continuous CDF simply by inverting the function.
Proof.
1

(r) = 1(A _ r)
= 1(1
1
(l) _ r), by Property 1
= 1(1(1
1
(l)) _ 1(r)), by Property 2
= 1(l _ 1(r)), by Property 3
= 1(r), by Property 4
69
5.2.1 Examples
Example Assume we needed a r.v. to have the distribution function G(r) =
r
3
. The previous theorem says by letting n = r
3
where n U(0,1), we have
n
1
3
= r. And so, to obtain a value of r from G(r), we obtain a n, n U(0,1),
and then set r = n
1
3
.
Exponential Example Assume that we have exponential data (with some
unreported rate `) and we want to convert our data to Uniform(0,1) either
because were paranoid that someone will review our results or we just like
messing with peoples minds. How can we convert our data from exponential
to uniform? Let r
I
be a data realization with respective uniform representation
n
I
.
1(r
I
) = 1 c
Xri
n
I
= 1 c
Xri
1 n
I
= c
Xri
ln(1 n
I
) = `r
I
r
I
=
1
`
ln(1 n
I
)
70
5.3 Acceptance-Rejection Method
Assume that we want to generate a random variable from a probability function
, )(r). However, )(r) is not invertible or does not have a "nice" method we
can use to generate a random variable. On the other hand, we know of a
function q(r) such that /q(r) is a probability function for constant / and /q(r) is
arbitrarily "close" to )(r). Furthermore, there is a constant c for which c/q(r) _
)(r), \r 1 _
}(r)
c|(r)
_ 0.
What do you get when you have all of these things? The Acceptance-
Rejection Method! The full algorithm for the method is as follows:
1. Generate /q(r)
2. Generate n, a uniform(0,1) realization
3. Check: n <
}(r)
c|(r)
. If yes, then weve generated a value. If not, then
repeat process.
Lets look at an example.
Example We wish to generate a random variable from the following discrete
function:
x 0 1 2
f(x) 0.3 0.4 0.3
We decide to use a binomial(2,
1
2
) random variable as our q(r) function. Lets
look at the probabilities:
x 0 1 2
f(x) 0.3 0.4 0.3
g(x) 0.25 0.5 0.25
We need q(r) _ )(r), \r. Since 0.50.4, we do not need to concern ourselves
with this. However, 0.3 is larger than 0.25 by a factor of
6
5
; therefore, this is our
constant to satisfy the conditions of the acceptance rejection method. Therefore
c =
6
5
.
x 0 1 2
f(x) 0.3 0.4 0.3
kg(x) 0.3 0.6 0.3
This completes our requirements. So, we generate a value for r; suppose it
is r = 1, then with R, we can generate a random uniform(0,1) value: 0.2817782.
Then, we have
}(1)
6
5
(1)
=
0.4
0.6
= 0.75 0.2817782. Therefore, we have generated
a value! In this example, r = 0, and 2 will always generate a value since the R
function runif never generates values at the extremities.
71
5.4 Compositions
5.5 Special Cases
In mathematics, there are always simplications under certain constraints. Since
the lengthy computations can be taxing on resources, simplication them can
make work more easily.
5.5.1 Poisson
Suppose we are simulating a Poisson distribution and we are looking for a com-
putationally simple method to calculate values for its probability function. Con-
sider the simple case: a Poisson distribution with parameter `. First, we look
at the probability function for the distribution:
)(r) =
c
X
`
r
r!
Let j
I
be the probability that A = i from this distribution. Evaluating
i+1
i
,
we get the following:
j
I+1
j
I
=
j
I+1
c

,(i + 1)!
j
I
c

,(i)!
=
j
I+1
j
I

(i)!
(i + 1)!
=
j
I
i + 1
j
I+1
= j
I

j
I
i + 1
This is computationally simpler than the standard method of determining
sequences of probabilities. Its algorithm is as follows:
1. Generate a uniform(0,1) random variable.
2. Set i = 0 and calculate j
0
= c

. Set j = c

.
3. If n < j, then r = i.
4. Else i = i + 1, j = j

I+1
.
5. Go to 1.
Repeat until the conditions are satised or until you have enough points.
This is a computational blessing in that we no longer need to concern
ourselves with factorial evaluations, which take longer to compute as i
increases.
72
5.5.2 Non-Homogeneous Poisson Process
A non-homogeneous Poisson process is similar to a regular Poisson process ex-
cept that the rate, `, is not constant. In fact, the rate either becomes a function
of time, `(t), or a function of the number of events at time t (denoted as (t)),
`((t)). Remember that a Poisson Process can be extended to an exponential.
In fact, in this instance, we will build a Poisson r.v. from an exponential. We
will nd the number of events up to a time T. So, we build intervals until
we surpass T, and this will be the number of events. First, we look at the
homogeneous Poisson process:
Algorithm
1. Generate a uniform(0,1) random variable, n.
2. Let t = 0, i = 0 for time t, and number of events, i.
3. t := t
1
X
ln(n), if t _ T stop and output i.
4. Otherwise, increase i by 1 (recall that we now have t := t
1
X
ln(n)).
This generates an exponential with constant `. For a non-constant rate, we
will use a process called thinning. Again, this usually depends on time, `(t), and
we have a rate, `, which is larger than `(t), \t. We generate exponentials with
the rate `, but decide whether or not to accept or reject them by comparing
`(t) and ` with n. In other words, we dont accept every point.
Algorithm
1. Generate a uniform(0,1) random variable, n.
2. t := 0, i := 0, for time and number of events respectively as above.
3. t := t
1
X
ln(n)
4. Generate a uniform(0,1) random variable, . Is <
X(|)
X
?
(a) If "Yes, "
i. If t T then output i.
ii. Otherwise i := i + 1, t := t
1
X
ln(n).
(b) If "No,"
i. If t T, then output i.
ii. Otherwise go to 3.
73
5.5.3 Binomial
Similar to the Poisson distribution, there is an easy way to calculate incremental
increases in A~binomial(n,0). Given a similar setup: j
I
is the probability that
A = i on a binomial with PDF )(r) =
_
n
r
_
0
r
(1 0)
nr
. Again, compare
i+1
i
.
j
I+1
j
I
=
_
n
I+1
_
0
I+1
(1 0)
nI1
_
n
I
_
0
I
(1 0)
nI
=
n!
(I+1)!(nI1)!
0
I+1
(1 0)
nI1
n!
I!(nI)!
0
I
(1 0)
nI
=
i!
(i + 1)!

(: i)!
(: i 1)!

0
1 0
=
1
i + 1
(: i)
0
1 0
j
I+1
=
j
I
(: i)0
(i + 1)(1 0)
Granted that this is less aesthetically pleasing as the Poisson simplication,
it is much nicer than constantly evaluating binomials.
5.5.4 Normal/Gaussian
One Gaussian Generator
1. Generate two Uniform[-1,1] random variables, 7
1
, 7
2
.
2. If .
2
1
+.
2
2
1 go to 1.
3. Else, let R=
_
2 ln(.
2
1
+.
2
2
)
We cant use the inverse transform theorem due to the fact that 1(r) does
not have a closed form solution. Then, the joint probability functions of two
independent, standard normal random variables is
)(r, j) = )

(r))
Y
(j)
=
1
2
c
(
1
2
(r
2
+
2
))
Instead of cartesian coordinates, we will use polar coordinates. This is a
change of variable from (r, j) (r, 0) for r =
_
r
2
+j
2
and 0 = arctan
_

r
_
.
For this to work, we will need the Jacobian to apply the change (see calculus
textbook for more details).
74
J =

J:
Jr
J:
J
J0
Jr
J0
J

2r 2j

_

r
2
_
_
1 +
_

r
_
2
_
1 _
1
r
_
_
1 +
_

r
_
2
_
1

= 2 (just believe it).


Now to apply the change of variable.
)(r, j) =
1
2
c
(
1
2
(r
2
+
2
))
=
1
2
c

1
2
:

1
2

, the inverse of the Jacobian


=
1
2

1
2
c

1
2
:
Upon inspection, this is the product of two random variables! One is U
_
0,
1
2t
_
and the second is Exponential(2), both of which are invertible! Therefore, we
can make a new algorithm:
1. Generate two uniform(0,1) random variables, n
1
, n
2
.
2. Set r =
1
2
ln(n), 0 = 2n
2
.
3. Then, r = r
2
cos 0, j = r
2
sin0, this becomes
r =
_
2 lnn
1
cos(2n
2
)
=

1
_
2 lnn
1

2
1
+
2
2
j =
_
2 lnn
1
sin(2n
2
)
=

2
_
2 lnn
1

2
1
+
2
2
4. Remember that these points still must be on a circle. Therefore, for
1
=
2n
1
1,
2
= 2n
2
1, if
2
1
+
2
2
1 go to 1. Otherwise, use our A and 1
values.
This algorithm can be cleaned up to get the rst algorithm mentioned in
this chapter.
75

1 Chapter Problems
5.6.1 Find the periodicity for the following Random Number Generator. Is this
optimal?
r
n
= 5r
n1
mod19
with seed r
0
= 5.
5.6.1 Solution
: 0 1 2 3 4 5 6 7 8 9
r
n
5 6 11 17 9 7 16 4 1 5
Therefore, this has period 9. This is not optimal because there are values
between 1 and 19 that do not appear.
5.6.2 Find the periodicity for the following Random Number Generator. Is this
optimal?
r
n
= 3r
n1
mod19
with seed 2.
5.6.2 Solution
: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
r
n
2 6 18 16 10 11 14 4 12 17 13 1 3 9 8 5 15 7 2
Therefore, this has periodicit 18. It is an optimal LCG since the period is
equal to one less the modulus.
5.6.3 Find the periodicity to the following Linear Congruential Generator. Is
this optimal?
r
n
= 5r
n1
+ 2 mod19
with seed r
0
= 5.
5.6.3 Solution
: 0 1 2 3 4 5 6 7 8 9
r
n
5 8 4 3 17 11 0 2 12 5
This is not an optimal LCG, given that it has periodicity 9.
5.6.4 For the following LCG, determine if the periodicity is :1, where : is
the modulus. Use the seed r
0
= 1 for each.
5.6.4a r
n
= 5r
n1
mod19.
5.6.4a Solution From 5.6.1 we know that 5
9
1 mod19. Therefore, this does
not have a period of :1 = 18.
5.6.4b r
n
= 8r
n1
mod11.
5.6.4b Solution
: 0 1 2 3 4 5 6 7 8 9 10
r
n
8 9 6 4 10 3 2 5 7 1 8
Therefore, this LCG has period 10.
5.6.4c r
n
= 3r
n1
mod14.
1
5.6.4c Solution Since this LCG does not have a prime modulus, we must eval-
uate each term explicitly.
: 0 1 2 3 4 5 6
r
n
3 9 13 11 5 1 3
5.6.5 Given the follow Uniform(0,1) data points, calculate the data points for
the following distributions.
n
1
= 0.25, n
2
= 0.50, n
3
= 0.75
5.6.5a Exponential with parameter ` = 5.
5.6.5a Solution From the course text we have that given an Uniform(0,1)
realization we can generate an exponential realization using the following
formula:
r
i
=
1
`
ln(1 n
i
)
Hence for each other points we have
r
1
=
1
5
ln(1 0.25)
0.0575
r
2
=
1
5
ln(1 0.50)
0.1386
r
3
=
1
5
ln(1 0.75)
0.2773
5.6.5b Pareto with parameters c = 2, , = 4

The Pareto distribution has the following PDF: 1(r) = 1

, r c

.
5.6.5b Solution Since we have the CDF of the Pareto distribution we can
determine the form of the transformation
1(r
i
) = 1

c
r
i

n
i
= 1

c
r
i

1 n
i
=

c
r
i

(1 n
i
)
1=
=
c
r
i
r
i
c
= (1 n
i
)
1=
r
i
= c(1 n
i
)
1=
2
From here, we can plug in the values and get:
r
1
= 2(1 0.25)
1=4
2.1491
r
2
= 2(1 0.50)
1=4
2.3784
r
3
= 2(1 0.75)
1=4
2.8284
5.6.6 Given the following LCG, generate three realizations from an exponential
distribution with parameter ` = 0.50.
r
n
= 7r
n1
+ 2 mod23
with r
0
= 14.
5.6.6 Solution Determining the rst three terms using the LCG
r
0
= 14
r
1
= 7 14 + 2 mod23 8
r
2
= 7 8 + 2 mod23 12
Hence our uniform terms are now
14
23
,
8
23
, and
12
23
. Using these are the form
of the exponential distribution we can generate the terms that we desire.
r
1
= 2 ln

1
14
23

1.8766
r
2
= 2 ln

1
8
23

0.8549
r
3
= 2 ln

1
12
23

1.4752
Therefore, we have generated three realizations of an Exp(0.50) distribu-
tion and they are r
1
= 1.8766, r
2
= 0.8549, and r
3
= 1.4752.
5.6.7 Given the following values, construct a valid Acceptance-Rejection algo-
rithm for the following data using a binomial with probability of success
0.70.
r = 0 r = 1 r = 2 r = 3 r = 4
0.01 0.10 0.25 0.40 0.24
From this test, if r = 2 will pass the Acceptance-Rejection algorithm when
n = 0.50.
3
5.6.7 Solution First we must determine an appropriate value of c to satisfy
the condition 1
f(x)
cg(x)
.
r = 0 r = 1 r = 2 r = 3 r = 4
)(r) 0.01 0.10 0.25 0.40 0.24
q(r) 0.0081 0.0756 0.2646 0.4116 0.2401
min(c) 1.2346 1.3228 1 1 1
With this imformation we can determine the minimum value of c that
will satisfy the inequality 1
f(x)
cg(x)
for all r is 1.3228. Algorithms may
vary, see section 5.3 for an example of one such algorithm. Secondly, given
n = 0.50 and r = 2 we have the following
5 <
)(r)
c q(r)
5 <
)(2)
c q(2)
5 <
0.25
1.3228 0.2646
5 < 0.7143
From this result we have that r = 2 with pass the AR method in this
instance with n = 0.50; therefore, we have generated a random variable.
5.6.8 Create an algorithm to generate random variables from the following
mixture distribution:
)(r) =

1
4
r 2 R(0, 1)
3
4
c
x+1
r 2 R[1, 1)
5.6.8 Solution Generate two uniform random realizations n
1
, n
2
. Then
1. If n
1
<
1
4
then )(r) = n
2
.
2. Else then )(r) = ln(n
2
).
4
Chapter 6
Variance Reduction
6.1 Introduction.
Imagine that you are employed by a brokerage rm and you are handling a series
of portfolios of both high-risk and low-risk stocks to generate the optimal yield
for your clients. The current prices of all stocks in the portfolio are known, as
well as the amounts owned by the client. However, the future stock prices are
not known. How do you manage the portfolio (i.e. which securities do you buy
and which do you sell)? Note that the total value of the investors portfolio is
the sum of the value of the individual stocks, and similarly, the overall return
for the portfolio is the sum of a weighted average of the individual returns.
Consider a signicant simplication of this problem: what is the return on a
single European call option? A European call option is a security with a strike
price, 1, and a maturity date, T. On the date of maturity, you can exercise the
option and buy it at the strike price or at the current market price (whichever
is lower). Assuming we have a model that reects market conditions, and has a
reasonable approximation of the distribution of payouts, how much should we
invest in a single stock? Other possible questions of interest: what if we were...
1. inventory control managers at a retail store: how many goods should we
order and at what time?
2. in charge of regulating trac lights in a high-trac city intersection?
3. trying to optimize code by nding the most ecient order of math?
4. looking for possible fraudulent cases in a series of health insurance claims?
5. trying to model the frequency of accidents in an amusement park?
6. standing in a grocery store check-out line, watching as the new cashier
struggles to count change and we are interested in evaluating how much
weve died on the inside during this unmoving procession?
76
As you have seen in earlier chapters, we used simulation to solve these prob-
lems. A simple example is integration. In math, we often nd that there
are many expressions which we cannot solve for explicitly. The most com-
mon example of this in statistics is the normal distributions cdf: )(r) =
_
r
1
1
p
2tc
exp
_

(|)
2
2c
2
_
dt. In fact, it can be shown that any function of the
form
_
c
r
2
dr has no closed form solution. What if we want to know the area
under the curve of c
r
2
/2
, from 0 to 1 (i.e. evaluate
_
1
0
c
r
2
/2
dr)?
This area can be estimated using Crude Monte Carlo Integration (CMC).
Example The area of a unit circle (i.e. radius equal 1) is r
2
= -
3.14159... Suppose we didnt know or have the formula r
2
, how might we
determine the area of the circle? One suggestion involves ring points at
random at a square (with equal side lengths of size 2) centered at the origin
(0, 0). The circle lies in the center of the square and the area of the square is
2 2 = 4 units squared. If points are randomly red, then the proportion of
these points that lie in the circle relative to the square is indicative of the area
of the circle.
Algorithmically:
1. Set the count to zero.
2. Generate an x coordinate on -1 to 1.
3. Generate an y coordinate on -1 to 1.
4. If the point (r, j) lies in the unit circle (r
2
+j
2
_ r
2
= 1), then increase
the count by 1. Return to 2.
77
5. Repeat steps 2 to 3, : times.
6. The area of the unit circle is
con:t
:
rca of the onarc = 4
con:t
:
.
R Code:
#Calculate the area of a circle in a square
Area<-function(n){
count<-0
for (i in 1:dim(n)[1]){
if (n[i,1]^(2)+n[i,2]^(2)<=1){
count<-count+1
}
}
4*count/dim(n)[1]
78
}
n<-cbind(runif(10,-1,1),runif(10,-1,1))
Area(n)
The following data were generated by using R, and the specics of how
these points were created will be discussed later in this chapter. As you can
see, the larger the number of pairs (n), the closer we get to the actual answer.
Theoretically, we obtain the exact value of after collecting innitely many
points, but who can wait for innitely many iterations?
Number of Pairs (n) Estimate of
10 3.6
50 3.2
100 3.32
500 3.224
1,000 3.1764
5,000 3.1648
10,000 3.14848
50,000 3.12512
100,000 3.141516
This is an example of how Crude Monte Carlo integration works. We want
to solve a problem where we take samples and evaluate them according to the
model in order to determine expectation or other properties.
6.2 Crude Monte Carlo Integration with Bounds
of Zero and One
Introduction:
This method involves repeatedly sampling heights from a function, and using
the average height as a measure of the functions area.
For crude Monte Carlo integration (hereafter refered to as CMC) to work, we
begin by considering a random variable derived from the uniform distribution,
l(0, 1).
Recall:
1. Given a random variable A ~ l(a, /), its probability density function is
)(r) =
1
bo
, for a < /.
2. Given a continuous function of our random variable, A, q(A), the
expected value of q(A) is:
1(q(A)) = 1(q(r))
=
_

q(r))(r)dr
79
3. An expectation is a long-run average; so, in practice: 1(A) -
1
n

n
I=1
r
I
.
Mathematically:
Returning to the original problem, we wish to determine the area beneath
q(r), from r = 0 to r = 1; or, in other words, we wish to evaluate 0 =
_
1
0
q(r)dr.
To begin, we consider a random variable A that is l(0, 1). Next, we nd
1[q(A)]:
First, note that 1[q(r)] -
1
n

n
I=1
q(r
I
) because:
1[q(A)] =
_
1
0
q(r))(r) dr [We can think of this as an average height, by 2 above]
=
_
1
0
q(r) (1) dr [by 1 above]
=
_
1
0
q(r) dr [Which is the area of interest!]
-
1
n

n
I=1
q(r
I
) a:d each r
I
is a realization of a l(0, 1) random variable, by 3 above.
The last line is where the approximation is used. It is not too dicult to
visualize how this works since we have already estimated averages using the
mean of a set of data. Estimating a denite integral using a nite number of
data points is frequently referred to as the crude Monte Carlo method.
There are restrictions on what can be used for this estimation. The following
conditions must hold:
1. Each r
I
is a realization of A ~ l(0, 1).
80
2. For : suciently large, the Law of Large Numbers gives
1
n

I
q(r
I
) -
_
1
0
q(r)dr
3. q(r
I
) represents the height of the function.
It may be dicult to imagine now, but a long time ago, computers were
signicantly slower than they are today. Making algorithms or calculations ef-
cient was very important. Even today, eciency can help reduce the time a
simulation takes since a single simulation could be run over one-hundred thou-
sand times before termination. From the calculation example, earlier in this
chapter, that relatively simple calculation for 100,000 terms took about 4 sec-
onds, and the code was 6 lines long. More complicated algorithms might take
much longer.
Example We wish to evaluate 0 =
_
1
0
c
r
2
/2
dr.
a) Write a CMC algorithm to evaluate 0.
Algorithm:
1. Generate : values of r
I
, each one being a realization of A ~ l(0, 1).
2. For each r
I
, nd q(r
I
) = c
r
2
i
/2
.
3. The estimated area is

0 =
1
n

I
q(r
I
).
b) Use a normal table or some other means to calculate 0.
The (0, 1) cdf is denoted by 1(A) =
1
_
2
_
r
1
c
r
2
/2
dr.
Then, 0 =
_
2 (1(1) 1(0)) = (0.8413 0.5)
_
2 - 0.855
c) Write R code to estimate 0. Using : = 10, 100 and 1000, provide
estimates for 0.
x=runif(10);
g=exp(-x^2/2);
theta=mean(g);
theta
Note that estimates will change. With : = 10,

0 = 0.848762
Variability of our CMC Estimator:
We will need numerical quantities that reect the variability of the crude
Monte Carlo estimator. For simplicity:
81
1. Let 1(q(A)) = 0
2. Let \ ar (q (A)) = o
2
.
3. 0 is estimated by

0 =
1
n

n
I=1
q(r
I
)
4. The estimator of

0 is

0 =
1
n

n
I=1
q(A
I
).
5. : is the sample size, the number of heights we sample.
6. Each r
I
is a realization of A ~ l(0, 1).
\ ar(
~
0) = \ ar(
1
:
n

I=1
q(A
I
))
=
1
:
2
\ ar(
n

I=1
q(A
I
))
=
1
:
\ ar(q(A))
=
o
2
:
To estimate o, we use what we call the standard error of our estimator
q(A). It is of the form
_
c
2
n
. The expectation-based variance estimator is
^ o
2
=
1
n1

n
I=1
(r
I
r)
2
.
Now that we have a measure for the variance, we can now begin to explore
other types of estimation and compare these to the crude Monte Carlo method.
This is important because we will be comparing every other technique against
this method.
6.3 Population Calculation
Lets say we want to determine the value of the parameter, 0 (e.g. the average
height of the population of Canada) to within a xed margin of error (denoted
as 1), how do we determine the number of units to include in the study?
Theory: Consider the general form of the Condence Interval for parameter
0.
CI -
^
0 c
_
^ o
2
:
: is the sample size, ^ o is the estimated standard deviation,
^
0 is the estimate
of the mean and c is the corresponding table value. Since
^
0 is generally an
82
average and : is relatively large, we assume normality, and obtain c from the
normal table.
The length of the interval is dependent on the second term; therefore, it is
the point of interest in the calculation for sample size. Examining the term on
its own, we can see that the overall variability in the interval is
1 = c
_
o
2
:
1
2
=
c
2
o
2
:
: =
_
co
1
_
2
1 is prespecied in the planning stage of the PPDAC process. o
2
, however,
is unknown. To estimate o
2
, we either:
1. Use historical data: if another related survey or experiment was con-
ducted, then it is acceptable to use the estimate of the variance from that
data.
2. Run a small, pilot survey: take a very small sample to derive an estimate
for the variance using the model selected in the planning step.
Example The Fake Bureau of Random Statistics would like to determine
the average shoe size for Canadian-born citizens between the ages 13 19. The
desired length of the condence interval is 0.01. A cursory survey was conducted
which showed a variance of 0.104. However, a study done fty years ago on the
sales of shoes based on size for teenagers had an estimate of the variance of
40
333
.
1. Determine : using the cursory survey:
Plugging in the values we get:
:
1
= 4o
2
1
2
|
= 4 0.104 0.01
2
|
= 4160|
= 4160
2. Determine : using the historical survey:
Pluggin in the values we get:
:
2
= 4o
2
1
2
|
= 4
40
333
0.01
2
|
= 4804.804805|
= 4805
83
3. What are the advantages and disadvantages of both methods? Compare
the two results:
The rst method has the advantage of using current data; however, it has
the problem of requiring to run a small survey at the beginning of the
process. The second method is good since it saves both time and money;
however, the data may no longer be relevant due to the age of the values
or the historical values may not properly represent the current values. In
the case of this example, the number of samples required dier greatly.
Although, for the purpose of running a simulation an extra 645 iterations
isnt always a strain on resources, but this doesnt necessarily hold true in
the real world. Experiments and surveys are costly with respect to both
time and money. From this, we can gather that by reducing the variance
of the model, we can eectively save two very important resources: time
and money.
In both cases above, we round up. If we do not round up, we will not have
suciently many units in our experiment to get the desired results, with respect
to interval length.
Example We wish to evaluate 0 =
_
1
0
c
r
2
/2
dr.
Using the CMC algorithm, how large should your sample size be to get an
answer to within 0.0001 units at an approximately 95% condence level.
#CMC
u=runif(1000);
est=exp(-u^2/2);
variance=var(est)
variance
Code Output: 0.014
Note that 1000 was arbitrarily selected. It is a relatively small number
(from a computing point of view).
:
1
= 4o
2
1
2
|
= 4 0.014 0.0001
2
|
= 5600000
Hence, we need to run our code 5,600,000 times to get an answer within
0.0001 units. Imagine what would happen if this werent a computer simulation,
and the amount of substantial resources that would be required to implement a
survey of this magnitude.
84
6.4 Crude Monte Carlo Integration with Bounds
from A to B
The issue with the above methodogy is that it only allows you to integrate
functions from 0 to 1
_
i.c. 0 =
_
1
0
q(r)dr
_
. This has probably bothered you;
so lets expand to integrals of the form 0 =
_
b
o
q(r)dr.
To evaluate these integrals, the secret is to perform a change of variable.
The goal of the change of variable is not to make the function nicer (as it was
in Calculus 1) but to make the bounds nicer instead .
Example In Calculus 1, you may have integrated
_
12
6
exp(4r + 2)dr by mak-
ing the substituion n = 4r + 2. Here, we arent interested in the function (in
fact, since the computer performs the evaluation, it can get as ugly as wed like);
instead, we will try to change the bounds to be from 0 to 1, so a good choice in
this case might be, n =
(r 6)
6
.
In general, for nite a and b, a good choice is n =
r a
/ a
.
In the case of innite a and b, we want to do the following:
1. We only want 1 bound to be innite, so we use the integral rule:
_
1
1
) (r)
dr =
_
o
1
) (r) dr+
_
1
o
) (r) dr (a is nite).
2. We make an appropriate change of variable. Ensure that whatever you
select as your change of variable exists on the integrals domain. For example,
one possibility is n =
1
1 +r
. However, this would not be a good choice if it
were possible for r = 1.
Example We wish to evaluate 0 =
_
0
5
c
r
2
/2
dr.
1. Write an CMC algorithm to estimate 0.
2. We make the change of variable n =
r
5
, which gives 5 dn = dr and
5n = r.
Based on this change of variable, we want: 0 =
_
1
0
c
(5u)
2
/2
5 dn.
Algorithm:
85
1. Generate : values of r
I
, each of which is a realization of A ~ l(0, 1).
2. For each r
I
, nd q(r
I
) = c
(5ri)
2
/2
5.
3. The estimated area is:

0 =
1
n

I
q(r
I
).
Example We wish to evaluate 0 =
_
1
5
c
r
2
/2
dr.
a) Write a CMC algorithm to estimate 0.
b) We make the change of variable
n =
1
6 +r
which gives

1
n
2
dn = dr
and
1
n
6 = r
Based on this change of variable, we want:
0 =
_
1
0
exp
_

_
1
n
6
_
2
,2
_

1
n
2
dn
Algorithm:
1. Generate : values of r
I
, each of which is a realization of A ~ l(0, 1).
2. For each r
I
, nd q(r
I
) = exp
_

_
1
r
I
6
_
2
,2
_

1
r
2
I
.
3. The estimated area is:

0 =
1
n

I
q(r
I
).
86
Example We wish to evaluate 0 =
_
1
1
1
t
x
1
dr.
1. Write a CMC algorithm to evaluate 0.
2. We want make a mapping of the domain [1, ) to [0, 1]. A simple method
for this is to use r =
1
u
. This technically maps [1, ) (0, 1], but since
we are using continuous uniform random variables, this is not a concern as
1[n = l] = 0.
Then, we have n =
1
r
, and -
1
u
2
dn = dr.
Making use of the change of variables formula ,we have
0 =
_
1
0
1
exp
_
1
u
2
_
1
dn
See above for more detail about the algorithm.
6.5 Crude Simulation of a Call Option Under
the Black Scholes Model
Consider the integral used to price a European call option. Recall that a call
option gives the owner the right, but not the obligation to buy the stock at the
strike price 1.
The price of a stock at any particular time t is denoted by o
|
. The stock
price when you purchase your call option is o
0
and when you decide to make
your call is o
T
. Hence, t 0, 1, 2, . . . , T.
Now, if we consider an integral used to price a call option. If a European
option has a payo \ (o
T
) where o
T
is the value of the stock at maturity (which
occurs at time T), then the price of the option can be valued using the discounted
future payo from the option, assuming a risk neutral measure:
c
:T
1[\ (o
T
)] = c
:T
1[\ (o
0
c

)]
Under the Black-Scholes model, the random variable A = ln(o
T
,o
0
) is nor-
mally distributed with a mean of rT o
2
T,2 and a variance of o
2
T. We can
generate a normal using the inverse transform theorem, and by the Central
Limit Theorem, we know that A =
1
(l; rT
c
2
2
T, o
2
T) follows a l(0, 1)
distribution (
1
(; j, o
2
) denotes the inverse of a normally Distributed ran-
dom variable with mean j and variance o
2
). As shown previously, the value
of the option can be written as an expectation of a r.v. governed by a U(0,1)
distritbuion.
87
1)(l) =
1
_
0
)(n)dn
Where )(n) = c
:T
\ (o
0
exp
1
(l; rT
c
2
2
T, o
2
T)).
If / < o
T
, it means that you can buy the stock at a price that is less than
the market value.
If / o
T
, you would not exercise the call option since it would cost less to
purchase the stock in the open market.
To model a stock option, we will use the Black-Scholes-Merton (hereafter
referred to as BSM), which is a popular nancial model. From above, we can
see that our option has a payo of:
1(o
T
) = max(0, o
T
/)
The BSM assumes risk neutrality and measures the future payo 1(o
T
) by
discounting to the present, t = 0,
c
:T
1[\ (o
T
)] = c
:T
1[\ (o
0
c

)]
88
Recall that CMC estimators are generated by evaulating a function many
times at randomly selected values (generated from a U(0,1)), and then these
values are averaged. The following code creates a function in R that estimates
the expected value of a stock option:
fn<-function(u,S0,strike,r,sigma,T){
#u is a vector of uniform[0,1] randomly renerated values
#S0 is the price of the stock at time 0
#strike is the strike price of the call option
#sigma is the variability/volatility of the stock
#r is the interest rate associated with the stock
#T is the time from t=0 until maturity, also the number of compounding periods
x<-S0*exp(qnorm(u,mean=r*T-sigma^2*T/2,sd=sigma*sqrt(T)))
y<-exp(-r*T)*pmax(x-strike,0)
y}
#Test values
\bigskip fn(runif(10000),45,60,0.06,0.25,5)
fn(runif(100),10,10,0.04,0.3,0.5)
89
One run of this code resulted in an approximate value of the option as
^
0
c1
= 0.4620 with an estimated standard error of
_
8.7r10
7
. A put option
is an option to sell a stock for a pre-specied price (also called the strike) at a
given date of maturity and has the same pricing function.
An advantage of the Monte Carlo methods versus numerical techniques is
we get a simple, accurate estimator due to the fact that we use a sample mean
in our calculations. For a series of : simulations, we can measure the accuracy
using the standard error or our sample mean, since
\ ar(
^
0
c1
) =
\ ar()(l
1
))
:
.
We can also attain the standard error of the sample mean using the standard
deviation
o1(
^
0
c1
) =
^ o
}
_
:
where ^ o
}
= \ ar()(l)). This is estimated using the sample standard devi-
ation. Since our function generates a vector of estimates, then we can obtain a
sample estimator of o
}
by
90
\bigskip sqrt(var(fn($\cdot $))
The standard error can be obtained with
sf=sqrt(var(fn($\cdot $)))\thinspace
sf/sqrt(length(u))
One run of the code has given an estimate of 0.6603 for the standard devia-
tion o
}
or standard error o
}
_
500000 or 0.0009.
6.6 Antithetic Random Variables
Introduction:
1. Recall that sample size is calculated by : =
c
2
c
2
J
2
. Hence, if we want to
reduce o
2
, we can reduce our sample size. Consider the variance of a sum
of two random variables.
Var(A +1 ) = Var(A) + Var(1 ) + 2Cov(A, 1 )
91
2 We can reduce this sum by making the covariance term negative.
Mathematically:
To reduce the variance of our CMC estimator, we will build a new estimator
called
~
0
.T1
, which is comprised of two random variables, q(l) and q(\ ).
To ensure the covariance term is less than zero, we can use the following
conditions:
1. For two continuous l(0, 1) random variables, l and \ , we want their
covariance to be less than 0. We can set \ = 1 l. This is one example
of a choice for \ that will ensure the covariance is less than 0.
2. ) (l) should be a monotonic function.
Then Co()(l), )(\ )) < 0.
Our antithetic estimator is
~
0
ANTI
=
1
2:
n

I=1
()(l
I
) +)(1 l
I
))
Note that in the antithetic case, we end up doubling the number of observa-
tions which is why there is the divisor of 2: instead of :.
How does the antithetic method improve the variability? Is the antithetic
estimator unbiased?
Unbiasedness:
Again, we are using l
I
~ l(0, 1) random variables and having \
I
= 1 l
I
,
we get the following:
1[
~
0
ANTI
] = 1
_
1
2:
n

I=1
()(A
I
) +)(1 A
I
))
_
=
1
2:
_
1
_
n

I=1
)(A
I
)
_
+1
_
n

I=1
)(1 A
I
)
__
=
1
2:
(:0 +:0)
=
2:
2:
0
= 0
92
Thus, the antithetic estimator is unbiased.
Variability:
By taking the variance of the antithetic estimator, we get the following:
\ ar(
~
0
ANTI
) = \ ar
_
1
2:
n

I=1
()(l
I
) +)(\
I
))
_
=
1
4:
2
\ ar
_
n

I=1
()(l
I
) +)(\
I
)
_
=
1
4:
2
n

I=1
(\ ar()(l
I
)) +\ ar()(\
I
)) + 2Co()(l
I
), )(\
I
)))
=
1
4:
2
_
:o
2
+:
2
o
2
+ 2:Co()(l
I
), )(\
I
))
_
=
1
2:
_
o
2
+Co()(l
I
), )(\
I
))
_
=
1
2:
_
o
2
+C
_
In the last line, we let C denote the covariance.
Now that we have the theoretic variance of the antithetic estimator, we can
compare this to the CMC estimator.
Variance Issue:
Ignoring the covariance for a moment, we see that \ ar(
~
0
ANTI
) =
c
2
2n
<
c
2
n
=
\ ar(
~
0
CMC
). This is not an accurate comparison since it does not take into
consideration that the antithetic estimater uses twice as many values (heights).
To balance this, we need to do one of several things:
1. Compare the antithetic estimator on
:
2
sampled heights to that of the
CMC on :.
2. Compare 2Var(
~
0
ANTI
) to Var(
~
0
CMC
)
3. Compare Var(
~
0
ANTI
) to
1
2
Var(
~
0
CMC
)
Function Evaluations: A function evaluation occurs every time you call
a function. For example, in CMC, where the estimator is

0 =
1
n

n
I=1
q(A
I
),
the number of times we call q(r) is :. Thus, : is the number of funciton
evaluations.
93
Eciency: For an equal number of function evaluations, the eciency of
a new estimator,
^
0
NEW
is calculated as:
Eciency =
\ ar(
^
0
CMC
)
\ ar(
^
0
NEW
)
When we compare the two variances, we use the ratio of the two. A number
larger than one implies that the variance of the CMC is worse than that of the
NEW estimator. To interpret this number, we use an example:
Suppose that the eciency is 5. This means that we need 5 times the
number of function evaluations using the CMC estimator to obtain the same
eciency as the new estimator.
Example: Antithetic When comparing the antithetic estimator against
the crude Monte Carlo estimator, which can be thought of how much faster the
Antithetic method is:
Eciency =
\ ar(
^
0
CMC
)
2\ ar(
^
0
ANTI
)
=
o
2
:
2
2n
(o
2
+C)
=
o
2
o
2
+C
Hence, we will only see an improvement in eciency when C < 0.
Example We wish to evaluate 0 =
_
1
0
c
r
2
/2
dr.
a) Write an algorithm to estimate 0 using an antithetic estimator.
On the domain, q(r) = c
r
2
/2
is a monotonic function, hence, an antithetic
estimator would be a good choice here.
Algorithm:
1. Generate : values of n
I
, a realization of l ~ l(0, 1).
2. For each n
I
, set
I
= 1 n
I
and nd /(n
I
) = c
u
2
i
/2
and /(
I
) = c
u
2
i
/2
.
94
3. The estimated area is

0 =
1
2n

n
I
(/(n
I
) +/(
I
))
b) Write R code to estimate 0. Using : = 10, 100 and 1000, provide
estimates for 0.
u=runif(10);
v=1-runif(10);
g1=exp(-u^2/2);
g2=exp(-v^2/2);
est=(g1+g2)/2;
theta=mean(est);
theta
\bigskip
Note that estimates will change. With : = 10,
^
0
CMC
= 0.8157554.
As a result, to make the CMC estimator variance worse than that of the
ANTI estimator, we want to ensure the covariance, C, is negative.
6.7 Stratied Sampling
In survey sampling, the act of stratifying is when the surveyer partitions the
population into subpopulations called strata. For this method to be eective,
the attributes within each strata must be similar while the attributes of the
populations in dierent strata are dierent. A couple of examples will illustrate
this concept:
Example Problem: We want to determine the average size of a Canadian
citizens english vocabulary.
Plan: Canada is composed of two principle disparate languages: French and
English. French is primarily spoken in Quebec, while English is mainly spoken
in the other 9 provinces and 3 territories. The author of the study suspects
that those in Quebec will have a smaller English vocabulary than those in the
rest of Canada (on average). Hence, the author breaks Canada into two strata:
Quebec and the rest of Canada.
95
Example Problem: We want to determine the average income of Cana-
dians.
Plan: We break the country into 13 strata (one for each province and
territory) and conduct sampling in each one separately.
Question: What benet(s) do you suspect stratifying gives?
It is very important to note that dierent populations have dierent vari-
ances. This also is true for many mathematical functions. Although it is
conceivable to split a nite population into nitely many strata, we will only
look at the case where there are two strata.
Consider each of the these functions: Finding the point through which one
can draw a dividing line is quite straightforward for A. However, what about B
and C? We can see that B is an oscillating function, increasing over time and
C looks like a quadratic function. Having to choose a point is arbitrary and
there are innitely many points from which we can choose; however, we have
no simple way to determine the one which best isolates the variance save for
lengthly empirical calculations.
Given a function on the interval (0,1), we wish to arbitarily select a point
a at which the variance of the function begins to dier from the rest of the
function. This point is arbitrary, but should be reasonable. Thus, we can split
up the function we wish to estimate:
96
_
1
0
)(r)dr =
_
o
0
)(r)dr +
_
1
o
)(r)dr, let r = an dr = adn
= a
_
1
0
)(na)dn + (1 a)
_
1
0
)((1 a)n +a)dn
This leads to the estimate
^
0
STRAT
=
a
:
1
n1

I=1
)(j
I
) +
1 a
:
2
n2

=1
)(j

)
Where j
I
= n
I
a and j

= (1 a)n

+ a. Check that this is an unbiased


estimator
Recall: The variance of the CMC estimator,

0
c1c
, was \ ar
_

0
c1c
_
=
c
2
n
. Hence the larger the sample size, n, the smaller the variance of our estima-
tor.
In terms of the stratied estimator, we have two samples, :
1
and :
2
. In
most situations, we have a maximum sample size that we can sample, :. In
this case, : = :
1
+:
2
.
We want to minimize the variance of our stratied estimator to improve the
eciency.
The problem, stated mathematically is:
Find :
1
, :
2
By Minimizing

0
ST1.T
Subject to the constraint: : = :
1
+:
2
Let \
I
be the variance of strata i. Without going into detail, the solution
to the problem is to make the sample size strata proportional to the length of
stratas interval times the variance of the strata. In notation, this is :
1
a\
1
and :
2
(1 a) \
2
. Since we often have a limit on how many units can
be included in the population (due to budget or time constraints for example),
then we can determine the sample size of strata i by
:
_
:
I
:
1
+:
2
_
= :
c\
I
a\
1
+ (1 a)\
2
.
For c = a or c = 1 a depending on i. Due to math, the numbers will have
to be rounded such that : = :
1
+:
2
to meet these constraints.
Now we derive the variance of this estimator:
97
\ ar(
~
0
STRAT
)
= \ ar(a

0
1
+ (1 a)

0
2
)
= a
2
\ ar(

0
1
) + (1 a)
2
\ ar(

0
2
)
= a
2
o
2
1
:
1
+ (1 a)
2
o
2
2
:
2
.
:
I
is the size of the sample o
2
I
is the variance of stratum i.
The following algorithm gives an ecient means to obtain and evaluate the
estimator:
1. Make a graph and select an a.
2. Do a simple study with a small : to get an idea of the variance. Let
:
1
= :
2
.
3. Plug this into a program (e.g. R) to estimate the variance of each stratum.
4. Determine the optimal :
I
, i Z[1, 2] with : aVar(Stratum i). To
solve we use :
_
ni
n1+n2
_
.
5. Perform your study using the new numbers.
Example
We wish to evaluate 0 =
_
1
0
c
r
2
/2
dr.
a) Write a STRAT algorithm using the value a =
1
3
to evaluate 0.
^
0
STRAT
=
1/3
n1

n1
I=1
)(j
I
) +
2/3
n2

n2
=1
)(j

)
=
1/3
n1

n1
I=1
_
c
(1/3ui)
2
/2
_
+
2/3
n2

n2
=1
_
c
(2/3uj+1/3)
2
/2
_
Algorithm:
1. Generate :
1
values of n
I
and :
2
values of
I
realizations of l ~ l(0, 1).
2. The estimated area is

0 =
1/3
n

n
I=1
_
c
(1/3ui)
2
/2
_
+
2/3
n2

n2
=1
_
c
(2/3uj+1/3)
2
/2
_
b) Assuming that we can sample : = 1000 heights, what value of :
1
and
:
2
should we use?
98
u=runif(1000);\quad
v=runif(1000); #Why is there a v term at all?
g1=exp(-(1/3*u)^2/2);
g2=exp(-(2/3*u+1/3)^2/2);
n1=ceiling(1000*(1/3*var(g1))/(1/3*var(g1)+2/3*var(g2)))
n2=1000-n1
During this run of the program, I obtained :
1
= 14 and :
2
= 986
c) Write down the R code using the algorithm in a) and the values in b) to
estimate 0. Using : = 10, 100 and 1000, provide estimates for 0.
u=runif(n1);
v=runif(n2);
g1=exp(-(1/3*u)^2/2);
g2=exp(-(2/3*u+1/3)^2/2);
est=1/3*mean(g1)+2/3*mean(g2)
est
Note that the estimates will change. For : = 10,
~
0
STRAT
= 0.8832353.
In fact, we can have : disjoint strata that cover the entire domain. If we
dene the size of each stratum to be a
1
, a
2
, . . . , a
n
with variances \
1
, \
2
, . . . , \
n
,
then the optimal sample size will be
:
_
:

n
I=1,I6=
:
I
_
= :
_
a

n
I=1,I6=
a
I
\
I
_
Example: Call Options Dene ) by the BSM model. In R, this is the
function ): of section 6.5 The following code will be used to stratify our function
for a prespecied a value.
x=runif(50,000)
y=runif(50,000)
STRATA1=a*fn(a+x)
STRATA2=(1-2)*fn((1-a)*y)
Var(c(strata1,strata2))
99
The variance of the sample mean of the components of the vector F is
var(F)/length(F) or around 9.2 10
8
. Since each component of the vector
above corresponds to two function evaluations, we should compare this with a
crude Monte Carlo estimator with : = 1000000 having variance o
2
}
10
6
=
4.3610
7
. This corresponds to an eciency gain of 43.6,9.2 or around 5. We
can aord to use one fth of the sample size by simply stratifying the sample
into two strata. The improvement is somewhat limited by the fact that we are
still sampling in a region in which the function is 0 (although now slightly less
often).
6.8 Control Variates
The crude Monte Carlo Integration Method is a means for us to empirically
calculate the value of a denite integral which would be otherwise impossible.
Consider the function c
r
2
over the interval (0,1). Although we cannot integrate
that function exactly, what we can do is nd a function which is very close to
c
r
2
, but has a known integrand like c
r
. We may be able to use this fact to
help us integrate c
r
2
. This is the logic behind the control variate technique.
For a given denite integral which still satistes the crude Monte Carlo
integration criteria, we have
0 =
_
1
0
)(r)dr, denition of CMC
=
_
1
0
()(r) q(r) +q(r))dr
=
_
1
0
()(r) q(r))dr +
_
1
0
q(r)dr.
This method will work given the following criteria.
1. )(r) and q(r) must be close (i.e. )(r) - q(r)).
2. q(r) must be integrable, since )(r) usually isnt.
This method works because the variability of /(r) = )(r) q(r) is smaller
than the variability of )(r) on its own. Note that closeness is arbitrary.
We dene the control variate estimator to be
^
0
CV
=
1
n

n
I=1
/(A
I
)+
_
1
0
q(r)dr.
As in the prior cases, we investigate unbiasedness and variability.
Unbiasedness:
100
1[
~
0
CV
] = 1
_
1
:
n

I=1
/(A
I
) +
_
1
0
q(r)dr
_
=
1
:
1
_
n

I=1
()(A
I
) q(A
I
))
_
+
_
1
0
q(r)dr
= 0
1
:
1
_
n

I=1
q(A
I
)
_
+
_
1
0
q(r)dr
= 0
Thus, the control variate estimator is unbiased.
Variability:
This result holds by the Law of Large Numbers.
\ ar(
~
0
CV
) = \ ar
_
1
:
n

I=1
/(r
I
) +
_
1
0
q(r)dr
_
=
1
:
2
\ ar
_
n

I=1
/(r
I
)
_
=
o
2
I
:
Thus, o
2
I
dependent on the function /; and provided that )(r) - q(r),
c
2
C V
n
<
c
2
C M C
n
.
6.9 Optimal Control Variates
Now, if we want to improve the eciency of our simulation method, then con-
sider a more general expression for the control variate which we will dene as
the optimal control variate (OCV for short):
~
0
OCV
=
_
1
0
()(r) +d(q(r) q(r)))dr.
Note that if d = 1, we get the same estimator as that of the previous example.
For a general d, the variance of the estimator is as follows:
\ ar(
~
0
OCV
) =
1
:
2
n

I=1
\ ar()(r
I
) dq(r
I
))
=
1
:
2
n

I=1
(\ ar()(r)) +d
2
\ ar(q(r)) 2dCo()(r), q(r)))
101
We now want to nd the minimum value with respect to the d value. Let
\ ar()(r)) = o
2
}
, \ ar(q(r)) = o
2

and Co()(r), q(r)) = o.


\ ar(
~
0
CV
) =
1
:
2
n

I=1
(o
2
}
+d
2
o
2

2do)
=
1
:
(o
2
}
+d
2
o
2

2do), dierentiate with respect to d


\ ar(
~
0
CV
) =
1
:
(0 + 2do
2

2o), set equal to 0


0 = 2do
2

2do
d =
o
o
2

=
Co()(A), q(A))
\ ar(q(A))
Thus, the variance reduction and hence eciency improvement is entirely
dependent on the fact that the variance of a denite integral is 0, since it is a
constant and that the two functions are relatively close. Again, closeness is
arbitrary.
Example We wish to evaluate 0 =
_
1
0
c
r
2
/2
dr.
a) Write a CV algorithm using the function )(r) = c
r
to evaluate 0.
~
0
OCV
=
1
n

n
I=1
/(A
I
) +
_
1
0
q(r)dr.
=
1
n

n
I=1
_
c
r
2
i
/2
c
ri
_
+
_
1
0
c
r
dr.
=
1
n

n
I=1
_
c
r
2
i
/2
c
ri
_
+
_
1 c
1
_
Algorithm:
1. Generate : values of n
I
, a realization of l ~ l(0, 1).
2. For each n
I
, nd /(n
I
) = c
u
2
i
/2
c
ui
.
3. The estimated area is:

0 =
1
n

n
I
_
c
u
2
i
/2
c
ui
_
+ (1 c
1
)
b) Write the R code using the algorithm in a) to estimate 0. Using : = 10,
100 and 1000, provide estimates for 0.
102
u=runif(10);
g1=exp(-u^2/2);
g2=exp(-u);
est=(g1-g2);
theta=mean(est)+(1-exp(-1));
sigmasq=var(est);
theta
Note that estimates will change. For : = 10,

0
CV
= 0.811051.
c) Write an OCV algorithm using the function )(r) = c
r
to evaluate 0.
~
0
OCV
=
1
n

n
I=1
_
c
r
2
i
/2
dc
ri
_
+d
_
1 c
1
_
(from a)
Where d =
Co()(A), q(A))
\ ar(q(A))
Algorithm:
1. Generate : values of n
I
, a realization of l ~ l(0, 1).
2. For each n
I
, nd /(n
I
) = c
u
2
i
/2
dc
ui
, d =
Co(c
I
2
i
/2
, c
Ii
)
\ ar(c
Ii
)
3. The estimated area is:

0 =
1
n

n
I
_
c
u
2
i
/2
dc
ui
_
+d(1 c
1
)
d) Write R code using the algorithm in c to estimate 0. Using : = 10, 100
and 1000, provide estimates for 0.
u=runif(10);
g1=exp(-u^2/2);
g2=exp(-u);
d=cov(g1,g2)/var(g2)
est=(g1-d*g2);
theta=mean(est)+d*(1-exp(-1));
theta
Note: estimates will change. For : = 10,

0
OCV
= 0.8880204.
e) Compare the eciencies of the OCV, CV, STRAT, ANTI estimators with
: = 1000.
Combining the code from above:
#CMC
103
u=runif(1000);
est=exp(-u^2/2);
CMC=var(est)/1000;
#ANTI
u=runif(1000);
v=1-runif(1000);
g1=exp(-u^2/2);
g2=exp(-v^2/2);
est=(g1+g2)/2;
theta=mean(est);
ANTI=var(est)/1000;
#CV
u=runif(1000);
g1=exp(-u^2/2);
g2=exp(-u);
est=(g1-g2);
theta=mean(est)+(1-exp(-1));
CV=var(est)/1000;
#OCV
u=runif(1000);
g1=exp(-u^2/2);
g2=exp(-u);
d=cov(g1,g2)/var(g2)
est=(g1-d*g2);
theta=mean(est)+d*(1-exp(-1));
OCV=var(est)/1000;
#STRAT
u=runif(1000);
v=runif(1000);
g1=exp(-(1/3*u)^2/2);
g2=exp(-(2/3*u+1/3)^2/2);
n1=ceiling(1000*(1/3*var(g1))/(1/3*var(g1)+2/3*var(g2)))
n2=1000-n1
u=runif(n1);
v=runif(n2);
g1=exp(-(1/3*u)^2/2);
g2=exp(-(2/3*u+1/3)^2/2);
est=1/3*mean(g1)+2/3*mean(g2)
STRAT=1/9*var(g1)/n1+4/9*var(g2)/n2;
output=as.matrix(t(CMC/c(STRAT,OCV,CV,ANTI)))
colnames(output)=c(STRAT,OCV,CV,ANTI)
output
104
STRAT OCV CV ANTI
1.306962 10.03115 2.496909 2.08554
Hence, the best, most ecient method in this case is OCV. It is 10 times
more ecient than CMC. This means that the CMC code requires 10 times
more heights to get the same eciency as OCV. It may not seem like much but
it implies that I can run my OCV code 1000 times to get the same accuracy as
someone using CMC 10,000 times.
105
6.10 Importance Sampling
Recall the denition of crude Monte Carlo Integration:
1[q(A)] =
_
)(r)q(r)dr.
If A ~ l(0, 1) and hence )(r) = 1, then we have the basis of other variance
reduction techniques. Now, we consider what happens if A is not uniformly
distributed.
In the control variate case, we change the formula by adding and subtracting
a known function q(r): basically, by adding zero to the integral, retaining its
unbiasedness and allowing us to solve it more easily. In importance sampling,
we will instead multiply by 1. The known function in this case will be /(r),
which is selected under the following assumptions:
1. /(r), or at least a constant times /(r), is a pdf.
2. We have a method to generate a random variable from /(r). At the
moment, assume that this allows us to use any function R can create,
including normal, exponential, chi-squared, gamma, etc...
3.
)(r)q(r)
/(r)
- /, a constant. Hence, it will have little variability.
Note: Often, for simplicity, we ignore q(r), meaning we assume q(r) = 1.
Theory:
Mathematically, we will have done nothing; however, this will do a lot more
than nothing. Most notably, it allows us to create a proper random variable with
a pdf.
0 =
_
)(r)q(r)dr
=
_
)(r)q(r)
_
/(r)
/(r)
_
dr, for /(r) ,= 0\r
=
_ _
)(r)q(r)
/(r)
_
/(r)dr
= 1
_
)(r)q(r)
/(r)
_
106
Under our simplifying assumption, 0 = 1
_
)(r)
/(r)
_
.
And, since an expectation is a weighted average, we can get our estimator:
^
0
IMP
=
1
:
n

I=1
_
)(r
I
)q(r
I
)
/(r
I
)
_
Using our simplifying assumption, we have
^
0
IMP
=
1
:
n

I=1
_
)(r
I
)
/(r
I
)
_
In all of these cases, A has cdf H(A).
The variance of importance sampling technique involves using Taylors ap-
proximation. For our purposes:
\ ar
_

0
IMP
_
=
1
:
\ ar
_
)(A)
/(A)
_
The following assumptions on /(r) should hold:
1. /(r) should be close to a constant times )(r), i.e.
)(r)
/(r)
= /
2. /(r) should not equal zero at the same time as )(r). For /(r) = 0, we
can ignore the data point.
Example We wish to evaluate 0 =
_
1
1
c
jrj
dr.
a) Write an IMP algorithm using the (0, 1) pdf to evaluate 0.
~
0
IMP
=
_
1
1
c
jrj
dr
=
_
1
1
c
jrj
_
_
1
p
2t
exp
_

r
2
2
_
1
p
2t
exp
_

r
2
2
_
_
_
dr
=
_
1
1
_
c
jrj
1
p
2t
exp
_

r
2
2
_
_
1
p
2t
exp
_

r
2
2
_
dr
= 1
_
c
jj
1
p
2t
exp
_

2
2
_
_
where A ~ (0, 1)
107
Algorithm:
1. Generate : values of n
I
, a realization of ~ l(0, 1).
2. The estimated area is

0 =
1
n

n
I
_
_
c
juij
1
p
2t
exp
_

u
2
i
2
_
_
_
b) Write R code using the algorithm in a to estimate 0. Using : = 10, 100
and 1000, provide estimates for 0.
#IMP
u=rnorm(1000,0,1);
g1=exp(-abs(u));
g2=1/sqrt(2*pi)*exp(-u^2/2);
est=mean(g1/g2);
Note: estimates will change. For : = 10,

0
IMP
=1.958993.
c) Determine the real value of 0 =
_
1
1
c
jrj
dr.
Answer: 2
108
6.11 Combining Monte Carlo Estimators
We have now seen some of the dierent variance reduction techniques. The
main challenge is to determine the best one of them for a given function.
For instance, the variance formula may be used as a basis of choosing a best
method; however, the main issue with this is that these variances (and ecien-
cies) must also estimated from the simulation. Furthermore, it is rarely clear
a priori which sampling procedure and estimator is best. For example, if a
function ) is monotone on [0, 1] then an antithetic variate can be introduced
with an estimator of the form
^
0
o1
=
1
2
[)(l) +)(1 l)], l ~ l[0, 1]. (6.1)
If, however, the function is increasing to a maximum somewhere around
1
2
, and
then decreasing thereafter, we might prefer
^
0
o2
=
1
4
[)(l,2) +)((1 l),2) +)((1 +l),2) +)(1 l,2)]. (6.2)
Notice that any weighted average of these two unbiased estimators of 0 would
also provide an unbiased estimator of 0. The large number of potential variance
reduction techniques is an embarrassment of riches. Thus the question is: which
variance reduction methods should we use and how will we know whether it is
better than the competitors? Tthe answer to this is to use all of the methods
(within reason of course); in fact, choosing a single method is often neither
necessary nor desirable. Rather, it is preferable to use a weighted average of
the available estimators with the optimal choice of the weights provided by
regression.
Suppose in general that we have / estimators or statistics

0
I
, i = 1, ../, all
unbiased estimators of the same parameter 0 so that 1(

0
I
) = 0 for all i . In
vector notation, letting
0
= (

0
1
, ...,

0
|
), we write 1() = 10 where 1 is the
k-dimensional column vector of ones so that 1
0
= (1, 1, ..., 1). Let us suppose for
the moment that we know the variance-covariance matrix \ of the vector ,
dened by
\
I
= co(

0
I
,

).
Theorem 6. (best linear combinations of estimators)
The linear combination of the

0
I
which provides an unbiased estimator of 0 and
has minimum variance among all linear unbiased estimators is

0
blc
=

I
/
I

0
I
(6.3)
where the vector b = (/
1
, ..., /
|
)
0
is given by
b = (1
|
\
1
1)
1
\
1
1.
The variance of the resulting estimator is
ar(

0
blc
) = b
|
\ b = 1,(1
|
\
1
1)
109
Proof. It is easy to see that for any linear combination (6.3), the variance of the
estimator is
b
|
\ b
and we wish to minimize this quadratic form as a function of b, subject to the
constraint that the coecients add to one, or that
b
0
1 =1.
Introducing the Lagrangian, we wish to set the derivatives, with respect to
the components /
I
, equal to zero
0
0b
b
|
\ b +(b
0
11) = 0 or
2\ b +1= 0
b=constant \
1
1.
Upon requiring that the coecients add to one, we discover the value of the
constant above is (1
|
\
1
1)
1
.
This theorem indicates that the ideal linear combination of estimators has
coecients proportional to the row sums of the inverse covariance matrix. No-
tably, the variance of a particular estimator

0
I
is one of many ingredients in
that sum. In practice, we almost never know the variance-covariance matrix \
of a vector of estimators . However, through simulation, we evaluate these
estimators using the same uniform input to each and obtain independent repli-
cated values of . This permits us to estimate the covariance matrix \ , and
since we typically conduct many simulations, this estimate can be very accurate.
Let us suppose that we have : simulated values of the vectors , and call these

1
, ...,
n
. As usual, we estimate the covariance matrix \ using the sample
covariance matrix

\ =
1
: 1
n

I=1
(
I
)(
I
)
0
where
=
1
:
n

I=1

I
.
Let us return to the example and attempt to nd the best combination of
the many estimators we have considered so far. To this end, let

0
1
=
0.53
2
[)(.47 +.53n) +)(1 .53n)], an antithetic estimator,

0
2
=
0.37
2
[)(.47 +.37n) +)(.84 .37n)] +
0.16
2
[)(.84 +.16n) +)(1 .16n)], a stratied-antithetic estimator

0
3
= 0.37[)(.47 +.37n)] + 0.16[)(1 .16n)], a stratied-antithetic estimator

0
4
=
_
q(r)dr + [)(n) q(n)], a control variate estimator

0
5
=
^
0
IMP
, an importance sampling estimator (??).
110
All these estimator are obtained from a single input uniform random variate
l. In order to determine the optimal linear combination ,we need to generate
simulated values of all 5 estimators using the same uniform random numbers as
inputs. We determine the best linear combination of these estimators using the
following R code:
function [o,v,b,V]=optimal(U)
% generates optimal linear combination of five estimators and outputs
% average estimator, variance and weights
% input $U$ a row vector of U[0,1] random numbers
T1=(.53/2)*(fn(.47+.53*U)+fn(1-.53*U));
T2=.37*.5*(fn(.47+.37*U)+fn(.84-.37*U))+.16*.5*(fn(.84+.16*U)+fn(1-.16*U));
T3=.37*fn(.47+.37*U)+.16*fn(1-.16*U);
intg=2*(.53)^3+.53^2/2;
T4=intg+fn(U)-GG(U);
T5=importance(fn,U);
X=[T1 T2 T3 T4 T5];% matrix whose columns are replications of the same estimator,
% a row=5 estimators using same U
mean(X)
V=cov(X); % this estimates the covariance matrix V
on=ones(5,1);
V1=inv(V); % the inverse of the covariance matrix
b=V1*on/(on*V1*on); % vector of coefficients of the optimal linear combination
o=mean(X*b); % vector of the optimal linear combinations
v=1/(on*V1*on); % variance of the optimal linear combination based on a single U\bigskip
One run of this estimator, called with [o,v,b,V]= optimal(unifrnd(0,1,1,1000000))
yields
o = 0.4615
/ = [0.5499 1.4478 0.1011 0.0491 0.0481].
The estimate 0.4615 is accurate to at least four decimals, which is not sur-
prising since the variance per uniform random number input is = 1.1310
5
.
In other words, the variance of the mean based on 1,000,000 uniform input is
1.1310
10
, and the standard error is around .00001, so we can expect accuracy
to at least 4 decimal places. Note that some of the weights are negative and
others are greater than one.
Do these negative weights indicate estimators that are worse than useless?
The eect of some estimators can may be, on subtraction, render the remaining
function to be more linear and making it more easily estimated. We note that
negative coecients are quite common in regression. The eciency gain over
crude Monte Carlo is an extraordinary 40,000. However, since there are 10
function evaluations for each uniform variate input, the eciency when adjusting
for the number of function evaluations is 4,000.
111
This simulation using 1,000,000 uniform random numbers and taking a 63
seconds on a Pentium IV (2.4 GHz) (including the time required to generate all
ve estimators) is equivalent to forty billion simulations by crude Monte Carlo:
a major task on a supercomputer!
If we intended to use this simulation method repeatedly, we might as well
wish to see whether some of the estimators can be omitted without too much loss
of information. Since the variance of the optimal estimator is 1,(1
|
\
1
1), we
might use this to attempt to select one of the estimators for deletion. Notice that
it is not so much the covariance of the estimators \ which enters into Theorem 6
but its inverse J = \
1
, which we can consider as a type of information matrix
by analogy to maximum likelihood theory. For example, we could choose to
delete the i
th
estimator, i.e. delete the i
th
row and column of \ where i is chosen
to have the smallest eect on 1,(1
|
\
1
1) or its reciprocal 1
|
J1 =

J
I
.
In particular, if we let \
(I)
be the matrix \ with the i
th
row and column
deleted and J
(I)
= V
1
(I)
, then we can identify 1
|
J1 1
|
J
(I)
1 as the loss of
information when the i
th
estimator is deleted. Since not all estimators have
the same number of function evaluations, we should adjust this information by
11(i) =number of function evaluations required by the i
th
estimator. In other
words, if an estimator i is to be deleted, it should be the one corresponding to
min
I

1
|
J1 1
|
J
(I)
1
11(i)
,
we should drop this i
th
estimator if the minimum is less than the information per
function evaluation in the combined estimator. In other words, we will increase
the information available in our simulation per function evaluation.
In the above example, with all ve estimators included, 1
|
J1 = 88, 757 (with
10 function evaluations per uniform variate), so the information per function
evaluation is 8, 876.
i 1
|
J1 1
|
J
(I)
1 11(i)
1
t
J11
t
J
(i)
1
JJ(I)
1 88,048 2 44,024
2 87,989 4 21,997
3 28,017 2 14,008
4 55,725 1 55,725
5 32,323 1 32,323
In this case, if we were to eliminate one of the estimators, our choice would
likely be number 3 since it contributes the least information per function eval-
uation. However, since all contribute more than 8, 876 per function evaluation,
we should likely retain all ve estimators.
112
6.12 Variance Reduction Using Common Ran-
dom Numbers
We now discuss another variance reduction technique, closely related to anti-
thetic variates called common random numbers, used for example whenever
we wish to estimate the dierence in performance between two systems, or any
other variable involving a dierence such as a slope of a function.
Example 3. For a simple example, suppose we have two estimators

0
1
,

0
2
of
the center of a symmetric distribution. We would like to know which of these
estimators is better in the sense that it has smaller variance when applied to
a sample from a specic distribution symmetric about its median. If both esti-
mators are unbiased estimators of the median, then the rst estimator is better
if
ar(

0
1
) < ar(

0
2
)
and so, we are interested in estimating a quantity like
1/
1
(A) 1/
2
(A),
where A is a vector representing a sample from the distribution and /
1
(A) =

0
2
1
, /
2
(A) =

0
2
2
. There are at least two ways of estimating these dierences:
1. Generate samples and hence values of /
1
(A
I
), i = 1, ..., : and 1/
2
(A

), , =
1, 2, ..., : independently, and use the estimator
1
:
n

I=1
/
1
(A
I
)
1
:
n

=1
/
2
(A

).
2. Generate samples and hence values of /
1
(A
I
), /
2
(A
I
), i = 1, ..., : inde-
pendently, and use the estimator
1
:
n

I=1
(/
1
(A
I
) /
2
(A
I
)).
It seems intuitive that the second method is preferable since it removes
the variability due to the particular sample from the comparison. This is a
common type of problem in which we want to estimate the dierence between
two expected values. For example, we may be considering investing in a new
piece of equipment that will speed up processing at one node of a network and
we wish to estimate the expected improvement in performance between the new
system and the old. In general, suppose that we wish to estimate the dierence
between two expectations, say
1/
1
(A) 1/
2
(1 ), (6.4)
113
where the random variable or vector A has cumulative distribution function 1

and 1 has cumulative distribution function 1


Y
. Notice that the variance of a
Monte Carlo estimator is
ar[/
1
(A) /
2
(1 )] = ar[/
1
(A)] +ar[/
2
(1 )] 2co/
1
(A), /
2
(1 ) (6.5)
is small if we can induce a high degree of positive correlation between the
generated random variables A and 1 . This is precisely the opposite problem
that led to antithetic random numbers, where we wished to induce a high degree
of negative correlation. The following lemma is due to Hoeding (1940) and
provides a useful bound on the joint cumulative distribution function of two
random variables A and 1 . Suppose A and 1 have cumulative distribution
functions 1

(r) and 1
Y
(j) respectively and a joint cumulative distribution
function G(r, j) = 1[A _ r, 1 _ j].
Lemma 7. (a) The joint cumulative distribution function G of (A, 1 ) always
satises
(1

(r) +1
Y
(j) 1)
+
_ G(r, j) _ min(1

(r), 1
Y
(j)) (6.6)
for all r, j .
(b) Assume that 1

and 1
Y
are continuous functions. In the case that
A = 1
1

(l) and 1 = 1
1
Y
(l) for l uniform on [0, 1], equality is achieved
on the right G(r, j) = min(1

(r), 1
Y
(j)). In the case that A = 1
1

(l) and
1 = 1
1
Y
(1 l), there is equality on the left; (1

(r) +1
Y
(j) 1)
+
= G(r, j).
Proof. (a) Note that
1[A _ r, 1 _ j] _ 1[A _ r], and similarly,
_ 1[1 _ j].
This shows that
G(r, j) _ min(1

(r), 1
Y
(j)),
verifying the right side of (6.6). Similarly, for the left side,
1[A _ r, 1 _ j] = 1[A _ r] 1[A _ r, 1 j]
_ 1[A _ r] 1[1 j]
= 1

(r) (1 1
Y
(j))
= (1

(r) +1
Y
(j) 1).
Since it is also non-negative, the left side follows.
For (b), suppose A = 1
1

(l) and 1 = 1
1
Y
(l), then
1[A _ r, 1 _ j] = 1[1
1

(l) _ r, 1
1
Y
(l) _ j]
= 1[l _ 1

(r), l _ 1
Y
(j)]
since 1[A = r] = 0 and 1[1 = j] = 0.
114
But
1[l _ 1

(r), l _ 1
Y
(j)] = min(1

(r), 1
Y
(j)),
verifying the equality on the right of (6.6) for common random numbers. By a
similar argument,
1[1
1

(l) _ r, 1
1
Y
(1 l) _ j] = 1[l _ 1

(r), 1 l _ 1
Y
(j)]
= 1[l _ 1

(r), l _ 1 1
Y
(j)]
= (1

(r) (1 1
Y
(j)))
+
,
verifying the equality on the left.
The following theorem supports the use of common random numbers to
maximize covariance and antithetic random numbers to minimize covariance.
Theorem 8. (maximum/minimum covariance)
Suppose /
1
and /
2
are both non-decreasing (or both non-increasing) functions.
Subject to the constraint that A, 1 have cumulative distribution functions 1

, 1
Y
respectively, the covariance
co[/
1
(A), /
2
(1 )]
is maximized when 1 = 1
1
Y
(l) and A = 1
1

(l) (i.e. for common U(0, 1)


random numbers) and is minimized when 1 = 1
1
Y
(l) and A = 1
1

(1 l)
(i.e. for antithetic random numbers).
Proof. We will sketch a proof of the theorem when the distributions are all
continuous and /
1
, /
2
are dierentiable. Dene G(r, j) = 1[A _ r, 1 _ j].
The following representation of covariance is useful: dene
H(r, j) = 1(A r, 1 j) 1(A r)1(1 j) (6.7)
= G(r, j) 1

(r)1
Y
(j).
Notice that, using integration by parts,
_
1
1
_
1
1
H(r, j)/
0
1
(r)/
0
2
(j)drdj
=
_
1
1
_
1
1
0
0r
H(r, j)/
1
(r)/
0
2
(j)drdj
=
_
1
1
_
1
1
0
2
0r0j
H(r, j)/
1
(r)/
2
(j)drdj
=
_
1
1
_
1
1
/
1
(r)/
2
(j)q(r, j)drdj
_
1
1
/
1
(r))

(r)dr
_
1
1
/
2
(j))
Y
(j)dj
= co(/
1
(A), /
2
(1 )), (6.8)
where q(r, j), )

(r), )
Y
(j) denote the joint probability density function, the
probability density function of A and that of 1 respectively. In fact, this result
115
holds in general even without the assumption that the distributions are contin-
uous. The covariance between /
1
(A) and /
2
(1 ), for /
1
and /
2
dierentiable
functions, is
co(/
1
(A), /
2
(1 )) =
_
1
1
_
1
1
H(r, j)/
0
1
(r)/
0
2
(j)drdj.
The formula shows that in order to maximize the covariance, if /
1
, /
2
are both
increasing or decreasing functions, it is sucient to maximize H(r, j) for each
r and j since /
0
1
(r), /
0
2
(j) are both non-negative. Since we are constraining
the marginal cumulative distribution functions 1

and 1
Y
, this is equivalent to
maximizing G(r, j) subject to the following constraints
lim
!1
G(r, j) = 1

(r)
lim
r!1
G(r, j) = 1
Y
(j).
Lemma 7 shows that the maximum is achieved when common random numbers
are used, and the minimum achieved when we use antithetic random numbers.
We can argue intuitively for the use of common random numbers in the case
of a discrete distribution with probability on the points indicated in Figure ??.
This gure corresponds to a joint distribution with the following probabilities,
say
r 0 0.25 0.25 0.75 0.75 1
j 0 0.25 0.75 0.25 0.75 1
1[A = r, 1 = j] .1 .2 .2 .1 .2 .2
Suppose we wish to maximize 1[A r, 1 j], subject to the constraint that
the probabilities 1[A r] and 1[1 j] are xed. We have indicated arbitrary
xed values of (r, j) in the gure. Note that if there is any weight attached to
the point in the lower right quadrant (labelled 1
2
), some or all of this weight
can be reassigned to the point 1
3
in the lower left quadrant provided that there
is an equal movement of weight from the upper left 1
4
to the upper right 1
1
.
Such a movement of weight will increase the value of G(r, j) without aecting
1[A _ r] or 1[1 _ j]. The weight that we are able to transfer in this example
is 0.1, the minimum of the weights on 1
4
and 1
2
. In general, this continues
until there is no weight in one of the o-diagonal quadrants for every choice of
(r, j). The resulting distribution in this example is given by
r 0 0.25 0.25 0.75 0.75 1
j 0 0.25 0.75 0.25 0.75 1
1[A = r, 1 = j] .1 .3 0 .1 .3 .2
It is easy to see that such a joint distribution can be generated from common
random numbers A = 1
1

(l), 1 = 1
1
Y
(l).
116
6.13 Variance Reduction Using Conditioning
We now consider a simple but powerful generalization of control variates. Sup-
pose that we can decompose a random variable T into two components T
1
, -
T = T
1
+-
so that T
1
, - are uncorrelated
Co(T
1
, -) = 0.
Assume as well that 1(-) = 0. Regression is one method for determining such
a decomposition, and the error term - in regression satises these conditions.
As a result, T
1
has the same mean as T, and it is easy to see that
\ ar(T) = \ ar(T
1
) +\ ar(-).
So, T
1
as smaller variance than T (unless - = 0 with probability 1). This means
that if we wish to estimate the common mean of T or T
1
, the estimator T
1
is
preferable, since it has the same mean but a smaller variance.
One special case is variance reduction by conditioning. For the standard
denition and properties of conditional expectation, see Appendix. One com-
mon denition of 1[A[1 ] is the unique (with probability one) function q(j)
of 1 , which minimizes 1[A q(1 )]
2
. This denition only applies to random
variables A which have nite variances; and so, this denition requires some
modication when 1(A
2
) = . For simplicity, we will assume here that all
random variables, say A, 1, 7 have nite variances. We can dene conditional
covariance using conditional expectation as
Co(A, 1 [7) = 1[A1 [7] 1[A[7]1[1 [7]
and conditional variance:
\ ar(A[7) = 1(A
2
[7) (1[A[7])
2
.
Variance reduction through conditioning is justied by the following well-known
result:
Theorem 9. (a)1(A) = 11[A[1 ]
(b) co(A, 1 ) = 1co(A, 1 [7) +co1[A[7], 1[1 [7]
(c) ar(A) = 1ar(A[7) +ar1[A[7]
This Theorem is used as follows: suppose we are considering a candidate
estimator
^
0, which is an unbiased estimator of 0; also, suppose we have an
arbitrary random variable 7 which is somehow related to
^
0, and we have chosen
7 carefully so that we are able to calculate the conditional expectation T
1
=
1[
^
0[7]. Then by part (a) of the above Theorem, T
1
is also an unbiased estimator
of 0. Dene
- =
^
0 T
1
.
117
By part (c),
\ ar(
^
0) = \ ar(T
1
) +\ ar(-)
and ar(T
1
) = ar(
^
0) ar(-) < ar(
^
0). In other words, for any variable
7, 1[
^
0[7] has the same expectation as
^
0, but a smaller variance. Actually, the
decrease in variance is largest if 7 and
^
0 are nearly independent, because in this
case 1[
^
0[7] is close to a constant and its variance close to zero. In general, the
search for an appropriate 7 - as to reducing the variance of an estimator by
conditioning - requires searching for a random variable 7 such that:
1. The conditional expectation 1[
^
0[7] with the original estimator is com-
putable.
2. \ ar(1[
^
0[7]) is substantially smaller than \ ar(
^
0).
Example 4. (hit or miss)
Suppose we wish to estimate the area under a certain graph )(r) by the hit
and miss method. A crude method would involve determining a multiple c of
a probability density function q(r) which dominates )(r) so that cq(r) _ )(r)
for all r. We can generate points (A, 1 ) at random and uniformly distributed
under the graph of cq(r) by generating A by inverse transform A = G
1
(l
1
)
where G(r) is the cumulative distribution function corresponding to density q
and then generating 1 from the U(0, cq(A)) distribution, say 1 = cq(A)l
2
.
An example, with q(r) = 2r, 0 < r < 1 and c = 1,4 is given in Figure ??.
The hit and miss estimator of the area under the graph of ) is obtained
by generating such random points (A, 1 ) and counting the proportion that fall
under the graph of q, i.e. for which 1 _ )(A). This proportion estimates the
probability
1[1 _ )(A)] =
area under )(r)
area under cq(r)
=
area under )(r)
c
.
This holds since q(r) is a probability density function. Notice that if we dene
\ =
_
c if 1 _ )(A)
0 if 1 )(A)
,
then
1(\) = c
area under )(r)
area under cq(r)
= area under )(r)
118
So, \ is an unbiased estimator of the parameter that we wish to estimate. We
might therefore estimate the area under )(r) using a Monte Carlo estimator
^
0
11
=
1
n

n
I=1
\
I
based on independent values of \
I
.This is the hit-or-miss
estimator. However, in this case, it is easy to nd a random variable 7 such
that the conditional expectation 1(7[\) can be determined in closed form. In
fact we can choose 7 = A, we obtain
1[\[A] =
)(A)
q(A)
.
This is therefore an unbiased estimator of the same parameter and it has smaller
variance than does \. For a sample of size :, we should replace the crude
estimator
^
0
c:
by the estimator
^
0
conJ
=
1
:
n

I=1
)(A
I
)
q(A
I
)
=
1
:
n

I=1
)(A
I
)
2A
I
with A
I
generated from A = G
1
(l
I
) =
_
l
I
, i = 1, 2, ..., : and l
I
~ U(0,1). In
this case, the conditional expectation results in a familiar form for the estimator
^
0
conJ
. Although this is simply an importance sampling estimator with q(r) as
the importance distribution, this derivation shows that the estimator
^
0
conJ
has
a smaller variance than
^
0
11
.
119
6.14 Problems
1. Let the antithetic estimator be denoted by

0 =
n

I=1
) (l
I
) +) (1 l
I
)
2:
;
where l
I
~ l(0, 1). Show that the estimator is unbiased. Note: 0 =
_
1
0
)(r) dr.
2. RANDIs main problem was that it was based on RANDU. This explains
why RANDI was not a good random number generator. Because of con-
cerns over the quality of RANDI, RANDII was used. Professor Derry Va-
tive loves integrating things using only a U(0,1) "derived" from RANDII.
In particular, he wishes to determine:
0 =
1
_
1
1
2
jrj
dr
(a) Find 0.
(b) Write an algorithm using Monte Carlo integration and an appropriate
substitution that gives the integral(s) 0, 1 bounds to nd

0.
3. We wish to estimate the area beneath )(r) = ln(r + 1) for 0 _ r _ 1. A
"close" function to )(r), if needed, is q(r) =
_
r (note: q(r) is not a p.f.).
The following is known about )(r) and q(r):
1 () (A)) - 0.4; 1
_
() (A))
2
_
- 0.2; \ ar () (A)) - 0.2;
1 (q (A)) =
2
3
; 1
_
(q (A))
2
_
=
1
2
; \ ar (q (A)) =
1
18
;
1 () (A) q (A)) - 0.3; Co () (A) , q (A)) - 0.03;
1 () (A) ) (1 A)) - 0.11; Co () (A) , ) (1 A)) - 0.05.
(a) Determine, or state, 0.
(b) Using q(r) as a control variate, state the estimate

0
cu
in terms of
)(r
I
) and q(r
I
). Be sure to evaluate any expressions that can be
evaluated.
(c) Determine the eciency of our control variate estimator.
(d) Explain what the eciency you calculated in (c) means to the general
layperson.
(e) Using q(r), state the importance estimate,

0
In
in terms of )(r
I
) and
q(r
I
). Clearly explain where all parts of the expression come from.
(f) Would you suggest using antithetic random variables on )(r)? Why
or why not?
(g) Determine the eciency of our antithetic estimator.
120
4. Dery Vate wants to integrate )(r) = crj(r1) from 0 to 1. She wants an
algorithm that will obtain an estimate for this integral, 0, with as small a
variance as possible.
(a) Give a crude monte carlo estimate using the uniform random vari-
ables U = [0.1, 0.5, 0.8].
(b) Obtain the expected value and variance of the crude Monte Carlo
estimator.
(c) She thinks she may be able to improve the process using antithetic
random variables. Explain in words why this will or will not work?
(d) Obtain the expected value and variance of the antithetic estimator.
(e) Still not content with the results, she decides to use a control variate.
Find the expected value and variance using the control variate g(x)
= 3(x-1).
5. We are interested in estimating 0 =
_
1
0
exp(r) dr.
(a) Determine the variance for the crude monte carlo method.
(b) Determine the variance for the antithetic method.
(c) Determine the variance for the control variate q(l) = l, l~l(0, 1).
6. Use CMC to write an algorithm that can be used to estimate 0 for each
of the integrals below.
(a) 0 =
_
1
0
exp(r) dr.
(b) 0 =
_
8
12
exp(r) dr.
(c) 0 =
_
1
1
1
r
dr.
(d) 0 =
_
1
1
1
r
dr.
(e) 0 =
_
1
1
exp
_
r
2
_
dr.
7. In question 6, write algorithms to estimate 0 using:
(a) Importance Sampling.
(b) Control Variates.
(c) Optimal Control Variates.
(d) Stratied.
(e) Antithetic Random Variables.
121
2.1 Solutions
6.14.1 Let the antithetic estimator be denoted by

0 =
n

i=1
) (l
i
) + ) (1 l
i
)
2:
;
where l
i
~ l(0, 1). Show the estimator is unbiased. Note: 0 =
_
1
0
)(r)
dr.
6.14.1 Solution
1
_

0
_
= 1
_
n

i=1
) (l
i
) + ) (1 l
i
)
2:
_
=
_
n

i=1
1 () (l
i
)) + 1 () (1 l
i
))
2:
_
=
_
n

i=1
0 + 0
2:
_
=
1
2n
2:0
= 0
6.14.2 RANDIs main problem was that it was based on RANDU. This ex-
plains why RANDI was not a good random number generator. Because
of concerns over the quality of RANDI, RANDII was used. Professor
Derry Vative loves integrating things using only a U(0,1) derived from
RANDII. In particular he wishes to determine:
0 =
1
_
1
1
2
jxj
dr
6.14.2a Find 0.
6.14.2a Solution
1. (a)
0 =
1
_
1
1
2
jxj
dr = 2 lim
a>1
a
_
0
2
x
dr
0 = 2
1
_
0
1
2
x
dr = 2 lim
a>1
_

1
ln2
2
x

a
0
_
(by symmetry) = 2
_
1
ln2
_
= 2. 885 4
6.14.2b Write an algorithm, using Monte Carlo integration and an appropriate
substitution that gives the integral(s) 0, 1 bounds; to nd

0.
1
6.14.2b Solution
1. (a)
Let t =
1
1 + r
r =
1
t
1
dr = t
2
dt
And 0 = 2
0
_
1
1
2
j1=t1j
t
2
dt 2
1
_
0
1
2
j1=t1j
t
2
dt
The Algorithm:
1. Gen t, repeat n times
2.

0 =
2

n
i=1
1
2
j1=t1j
t
2
:
6.14.3 We wish to estimate the area beneath )(r) = ln(r + 1) for 0 _ r _ 1.
A close function to )(r), if needed, is q(r) =
_
r (note: q(r) is not a
p.f.) The following is known about )(r) and q(r):
1. 1 () (A)) - 0.4; 1
_
() (A))
2
_
- 0.2; \ ar () (A)) - 0.2;
1 (q (A)) =
2
3
; 1
_
(q (A))
2
_
=
1
2
; \ ar (q (A)) =
1
18
;
1 () (A) q (A)) - 0.3; Co () (A) , q (A)) - 0.03;
1 () (A) ) (1 A)) - 0.11; Co () (A) , ) (1 A)) - 0.05
6.14.3a Determine, or state, 0.
6.14.3a Solution 0 = 1()(A)) - 0.4
6.14.3b Using q(r) as a control variate state the estimate

0
cv
in terms of )(r
i
)
and q(r
i
). Be sure to evaluate any expressions that can be evaluated.
6.14.3b Solution
1. (a)

0
cv
=
n

i=1
_
) (r
i
) q (r
i
)
:
_
+ 1 (q (A))
=
n

i=1
_
) (r
i
) q (r
i
)
:
_
+
2
3
6.14.3c Determine the eciency of our control variate estimator.
6.14.3c Solution
1. (a)
1)) =
\ ar
_

0
cmc
_
\ ar
_

0
cv
_ =
\ ar () (A)) ,:
(\ ar() (A) q (A))) ,:
=
\ ar () (A))
\ ar () (A)) + \ ar (q (A)) 2Co() (A) , q(A))
=
0.2
0.2 +
1
18
2 (0.03)
= 1.02
2
6.14.3d Explain what the eciency you calculated in (c) means to the general
layperson.
6.14.3d Solution This means that if I were to run the CMC code 102 times
it would be as accurate as the CV code running 100 times.
6.14.3e Using q(r), state the importance estimate,

0
imp
in terms of )(r
i
) and
q(r
i
). Clearly explain where all parts of the expression come from.
6.14.3e Solution
1. (a)
1
_
0
a
_
r dr = a
2
3
r
3=2

1
0
= 1 ==a =
3
2
Therefore let /(r) =
3
2
_
r which means H(r) = r
3=2

0
imp
=
n

i=1
_
) (r
i
) ,/(r
i
)
:
_
and each r
i
= n
2=3
6.14.3f Would you suggest using antithetic random variables on )(r)? Why
or why not.
6.14.3f Solution They should work as )(r) is a monotonic, increasing function
on the interval.
6.14.3g Determine the eciency of our antithetic estimator.
6.14.3g Solution
1. (a)
1)) =
\ ar
_

0
cmc
_
\ ar
_

0
anti
_ =
\ ar () (A)) ,:
2\ ar
_
() (A) + ) (1 A)
2
_
,:
=
2\ ar () (A))
2\ ar () (A)) + 2Co() (A) , )(1 A))
=
\ ar () (A))
\ ar () (A)) + Co() (A) , )(1 A))
=
0.2
0.2 0.05
= 1.33
6.14.4 Dery Vate wants to integrate )(r) = crj(r 1) from 0 to 1. She wants
an algorithm that will obtain an estimate for this integral, 0, with as small
a variance as possible.
6.14.4a Give a crude monte carlo estimate using the uniform random variables
U = [0.1, 0.5, 0.8].
3
6.14.4a Solution From CMC we have an estimate for the integral as
~
0
CMC
=
n

i=1
)(r
i
)
=
1
:
n

i=1
c
xi1
=
0.4066 + 0.6065 + 0.8187
3
= 0.6106
6.14.4b Obtain the expected value and variance of the crude monte carlo esti-
mator.
6.14.4b Solution
1[
~
0
CMC
] = 1
_
1
:
n

i=1
c
xi1
_
=
1
:
n

i=1
1[c
xi1
]
=
1
:
n

i=1
0
=
0:
:
= 0
\ ar[
~
0
CMC
] = \ ar
_
1
:
n

i=1
c
xi1
_
=
1
:
2
\ ar
_
n

i=1
c
xi1
_
=
1
:
2
n

i=1
\ ar
_
c
xi1

=
1
:
2
n

i=1
o
2
CMC
=
o
2
CMC
:
:
2
=
o
2
CMC
:
6.14.4c She thinks she may be able to improve the process using antithetic
random variables. Explain in words why this will or will not work?
4
6.14.4c Solution Dery Vate will nd the the Antithetic Estimator will be more
ecient than the Crude Monte Carlo because the function is a monotonic
function over the interval.
6.14.4d Obtain the expected value and variance of the antithetic estimator.
6.14.4d Solution
1
_
~
0
ANTI
_
= 1
_
1
2:
n

i=1
()(r
i
) + ) (1 r
i
))
_
=
1
2:
1
_
n

i=1
()(r
i
) + ) (1 r
i
))
_
=
1
2:
_
n

i=1
1 [) (r
i
)] +
n

i=1
1 [) (1 r
i
)]
_
=
1
2:
_
n

i=1
(0 + 0)
_
=
20:
2:
= 0
\ ar
_
~
0
ANTI
_
= \ ar
_
1
2:
_
n

i=1
() (r
i
) + ) (1 r
i
))
__
=
1
4:
2
\ ar
_
n

i=1
() (r
i
) + ) (1 r
i
))
_
=
1
4:
2
_
n

i=1
\ ar () (r
i
) + ) (1 r
i
))
_
=
1
4:
2
_
n

i=1
(\ ar () (r
i
)) + \ ar () (1 r
i
)) + 2Co () (r
i
) , ) (1 r
i
)))
_
=
1
4:
2
_
n

i=1
o
2
ANTI
+ o
2
ANTI
+ 2:C
_
=
o
2
ANTI
+ C
2:
Note that C < 0 since we are guaranteed to have a negative covariance.
6.14.4e Still not content with the results she decides to use a control variate.
Find the expected value and variance using the control variate g(x) =
3(x-1).
5
6.14.4e Solution
1
_
~
0
CV
_
= 1
_
1
:
n

i=1
() (r
i
) + q (r
i
) q (r
i
))
_
=
1
:
1
_
n

i=1
() (r
i
) + q (r
i
) q (r
i
))
_
=
1
:
_
1
_
n

i=1
() (r
i
) q (r
i
))
_
+ 1
_
n

i=1
q (r
i
)
__
=
1
:
_
n

i=1
1 [q(r
i
)] + 0
_
=
0:
:
= 0, since q(r) - )(r)
\ ar
_
~
0
CV
_
= \ ar
_
1
:
n

i=1
() (r
i
) + q (r
i
) q (r
i
))
_
=
1
:
2
\ ar
_
n

i=1
() (r
i
) + q (r
i
) q (r
i
))
_
=
1
:
2
\ ar
_
n

i=1
(() (r
i
) q (r
i
)) + q (r
i
))
_
=
1
:
2
\ ar
_
n

i=1
q (r
i
)
_
=
1
:
2
n

i=1
\ ar [q (r
i
)]
=
o
2
g
:
2
6.14.5 We are interested in estimating 0 =
_
1
0
exp(r) dr.
6.14.5a Determine the variance for the crude monte carlo method.
6
6.14.5a Solution
\ ar
_
~
0
CMC
_
= \ ar
_
1
:
n

i=1
)(r
i
)
_
=
1
:
2
\ ar
_
n

i=1
) (r
i
)
_
=
1
:
2
n

i=1
\ ar [) (r
i
)]
=
1
:
2
n

i=1
_
)(r
i
)
_
^
0
CMC
__
2
6.14.5b Determine the variance for the antithetic method.
6.14.5b Solution
\ ar
_
~
0
ANTI
_
= \ ar
_
1
2:
n

i=1
() (r
i
) + ) (1 r
i
))
_
=
1
4:
2
\ ar
_
n

i=1
() (r
i
) + ) (1 r
i
))
_
=
1
4:
2
n

i=1
(\ ar [) (r
i
) + ) (1 r
i
)])
=
1
4:
2
n

i=1
(\ ar () (r
i
)) + \ ar () (1 r
i
)) + 2Co () (r
i
) , ) (1 r
i
)))
=
1
4:
2
_
n

i=1
_
) (r
i
)
^
0
ANTI
_
2
+
n

i=1
_
) (1 r
i
)
^
0
ANTI
_
2
+ 2
n

i=1
_
) (r
i
)
^
0
ANTI
__
) (1 r
i
)
^
0
ANTI
_
_
6.14.5c Determine the variance for the control variate q(l) = l, l~l(0, 1).
6.14.5c Solution
\ ar
_
~
0
CV
_
= \ ar
_
1
:
n

i=1
(() (r
i
) q (r
i
)) + q (r
i
))
_
=
1
:
2
\ ar
_
n

i=1
(C + q (r
i
))
_
=
1
:
2
n

i=1
_
q (r
i
)
^
0
CV
_
2
6.14.6 Use CMC to write an algorithm that can be used to estimate 0 for each
of the integrals below.
7
6.14.6a 0 =
_
1
0
exp(r) dr
6.14.6b 0 =
_
8
12
exp(r) dr
6.14.6c 0 =
_
1
1
1
r
dr
6.14.6d 0 =
_
1
1
1
r
dr
6.14.6d Solution Note: All other solutions are either a simplication of this
one, or an extension.
Dene a limit point a. This solution will use the change of variable n =
1
x
for a = 0 when r < 0 and n =
1
x
when r 0. Note that n ,= 1 r = 0
which does not exist. Since we are using a continuous, uniform on the
bounds (0, 1) this is not a concern. Then we have
_
1
1
1
r
dr =
_
0
1
1
r
dr +
_
1
0
1
r
dr
Now apply the change of variable: r =
1
u
, with
1
u
2
dn = dr to the rst
term and r =
1
u
, with
1
u
2
dn = dr to the second term.
_
1
1
1
r
dr =
_
1
0
1

1
u
1
n
2
dn +
_
0
1
1
1
u
1
n
2
dn
=
_
1
0
n
n
2
dn
_
1
0
n
n
2
dn
=
_
1
0
1
n
dn
_
1
0
1
n
dn
Now to start the algorithm:
1. Generate n
1
and n
2
, two U(0,1) RV.
2. If n
1
< 0.50 then r
1
=
1
u2
.
3. Else then r
1
=
1
u2
.
4. Repeat process : times. Take average.
6.14.6e 0 =
_
1
1
exp
_
r
2
_
dr
6.14.7 In question 6.14.6, write algorithms to estimate 0 using:
6.14.7a Importance Sampling
6.14.7a Solution For 6.14.6d.
8
6.14.7b Control Variates
6.14.7c Optimal Control Variates
6.14.7d Stratied
6.14.7d Solution For 6.14.6d.
6.14.7e Antithetic Random Variables
1. [ANGE - PLEASE CHANGE THESE QUESTIONS, IF POSSI-
BLE, THEY ARE FROM DON]
6.14.8 Find the CMC and ANTI estimators for the following function
_
1
0
c
x
1
c
2
1
dr.
Determine the eciency of the ANTI. Would you expect this to be greater
less than 1? Explain your answer.
6.14.8 Solution The CMC estimator is
~
0
CMC
=
1
:
n

i=1
c
xi1
c
2
1
The ANTI estimator is (using l = 1 \ ).
~
0
ANTI
=
1
2:
n

i=1
_
c
xi1
c
2
1
+
c
xi
c
2
1
_
With ecient measure
Eciency =
\ ar(
~
0
CMC
)
2\ ar
_
~
0
ANTI
_
=
1
n
2

n
i=1
_
e
x
i
1
e
2
1

^
0
CMC
_
2
2
4n
2

_
_
e
x
i
1
e
2
1

^
0
ANTI
_
2
+
_
e
x
i
e
2
1

^
0
ANTI
_
2
+ 2Co
_
e
x
i
1
e
2
1
,
e
x
i
e
2
1
_
_
=

n
i=1
_
e
x
i
1
e
2
1

^
0
CMC
_
2
2

_
_
e
x
i
1
e
2
1

^
0
ANTI
_
2
+
_
e
x
i
e
2
1

^
0
ANTI
_
2
+
2
(e
2
1)
2
Co (c
xi1
, c
xi
)
_
This value is expected to be greater than 1 due to the fact that the function
is monotonic over any interval, which is the optimal use for the Antithetic
estimator.
9
1. How large a sample size would I need, using antithetic and crude Monte
Carlo, in order to estimate the above integral, correct to four decimal
places, with probability at least 95%?
2. Under what conditions on ) does the use of antithetic random numbers
completely correct for the variability of the Monte-Carlo estimator? i.e.
When is ar()(l) + )(1 l)) = 0?
3. Show that if we use antithetic random numbers to generate two normal
random variables A
1
, A
2
, having mean rT o
2
T,2 and variance o
2
T,
this is equivalent to setting A
2
= 2(rT o
2
T,2)A
1
. In other words, it
is not necessary to use the inverse transform method to generate normal
random variables in order to permit the use of antithetic random numbers.
4. Show that the variance of a weighted average
ar(cA + (1 c)\)
is minimized over c when
c =
ar(\) co(A, \)
ar(\) + ar(A) 2co(A, \)
Determine the resulting minimum variance. What if the random variables
A, \ are independent?
5. Use a stratied random sample to integrate the function
_
1
0
c
u
1
c 1
dn.
What do you recommend for intervals (two or three) and sample sizes?
What is the eciency gain?
6. Use a combination of stratied random sampling and an antithetic random
number in the form
1
2
[)(l,2) + )(1 l,2)]
to integrate the function
_
1
0
c
u
1
c 1
dn.
What is the eciency gain?
7. In the case )(r) =
e
x
1
e1
, use q(r) = r as a control variate to integrate over
[0,1]. Show that the variance is reduced by a factor of approximately 60.
Is there much additional improvement if we use a more general quadratic
function of r?
10
8. In the case )(r) =
e
x
1
e1
, consider using q(r) = r as a control variate
to integrate over [0,1]. Note that regression of )(l) on q(l) yields
)(l)1()(l)) = ,[q(l)1q(l)] +- where the error term - has mean
0 and is uncorrelated with q(l) and , = co()(l), q(l)),ar(q(l).
Therefore, taking expectations on both sides and reorganising the terms,
1()(l)) = )(l) ,[q(l) 1(q(l))]. The Monte-Carlo estimator
1
:
n

i=1
)(l
i
) ,[q(l
i
) 1(q(l
i
))]
is an improved control variate estimator, equivalent to the one discussed
above in the case , = 1. Determine how much better this estimator
is than the basic control variate case , = 1 by performing simulations.
Show that the variance is reduced by a factor of approximately 60. Is
there much additional improvement if we use a more general quadratic
function of r?
9. A call option pays an amount \ (o) = 1,(1 + exp(o(T) /)) at time T
for some predetermined price /. Discuss what you would use for a control
variate and conduct a simulation to determine how it performs, assuming
geometric Brownian motion for the stock price, interest rate 5%, annual
volatility 20% and various initial stock prices, values of / and T.
10. It has been suggested that stocks are not log-normally distributed but the
distribution can be well approximated by replacing the normal distribu-
tion by a student t distribution. Suppose that the daily returns A
i
are
independent with probability density function )(r) = c(1+(r,/)
2
)
2
(the
re-scaled student distribution with 3 degrees of freedom). We wish to esti-
mate a weekly \ ar
:95
, a value c
v
such that 1[

5
i=1
A
i
< ] = 0.95. If we
wish to do this by simulation, suggest an appropriate method involving
importance sampling. Implement and estimate the variance reduction.
11. Suppose, for example, I have three dierent simulation estimators 1
1
, 1
2
, 1
3
whose means depend on two unknown parameters 0
1
, 0
2
. In particular,
suppose 1
1
, 1
2
, 1
3
, are unbiased estimators of 0
1
, 0
1
+ 0
2
, 0
2
respectively.
Let us assume for the moment that ar(1
i
) = 1, co(1
i
, 1
j
) = 1,2.
I want to estimate the parameter 0
1
. Should I use only the estimator
1
1
which is the unbiased estimator of 0
1
, or some linear combination of
1
1
, 1
2
, 1
3
? Compare the number of simulations necessary for a certain
degree of accuracy.
12. Consider the systematic sample estimator based on the trapezoidal rule:
^
0 =
1
:
n1

i=0
)(\ + i,:), \ ~ l[0,
1
:
]
Discuss the bias and variance of this estimator. In the case )(r) = r
2
,
how does it compare with other estimators such as crude Monte Carlo
11
and antithetic random numbers requiring : function evaluations. Are
there any disadvantages to its use?
13. In the case )(r) =
e
x
1
e1
, use q(r) = r as a control variate to integrate
over [0,1]. Find the optimal linear combination using estimators (??) and
(??), an importance sampling estimator and the control variate estimator
above. What is the eciency gain over crude Monte-Carlo?
14. The rho of an option is the derivative of the option price with respect
to the interest rate parameter r. What is the value of j for a call option
with o
0
= 10, strike=10, r = 0.05, T = .25 and o = .2? Use a simulation
to estimate this slope and determine the variance of your estimator. Try
using (i) independent simulations at two points and (ii) common random
numbers. What can you say about the variances of your estimators?
15. For any random variables A, 1, prove that 1(A _ r, 1 _ j) 1(A _
r)1(1 _ j) = 1(A r, 1 j) 1(A r)1(1 j) for all r, j.
12

Statistical
Tables

You might also like