Professional Documents
Culture Documents
[1] Basics
For a specified population you can calculate a mean, , and a variance, 2, where the standard
deviation is the square root of the variance, i.e. . Furthermore, you can sample the
population, e.g. with a sample size n, to estimate population parameters. For a sample you can
calculate the mean of the sample, , with a sample variance, s2. Similarly, if categorical data is
involved you can use a sample proportion, p, to estimate a population proportion, .
Using this notation, keep in mind that z-scores can be calculated using:
=
A standard normal distribution is a normally distributed random variable where the mean
equals zero, i.e. =0, and the standard deviation equals 1, i.e. =1. We can use z-scores and the
normal probability tables to evaluate the cumulative probability of a random variable, given as
Z. Note that were using X to denote a normally distributed random variable or normal random
variable and Z to denote a standard normal random variable.
So, lets get into some of the equations. If X is a normally distributed random variable with
mean and standard deviation then the cumulative normal probability table can be used to
compute ( < < ) by:
( < < ) =
<<
Where Z denotes a standard normal random variable, a can be any decimal number or -, and
b can be any decimal number or . The endpoints ( )/ ( )/ are really the zscores where a and b are values of x. Notice that the nomenclature here denotes the mean and
standard deviation of a population.
Example 1.1:
Let X be a normal random variable with a mean = 10 and standard deviation = 2.5.
Calculate the probability that X < 14 or P(X<14).
Solution:
( < 14) = <
14 10
2.5
= ( < 1.60)
= 0.9452
= <
Page 2
14
where 0.9452 is found in the normal probability table using the value 1.60. This yields the area
of the left hand side of the normal distribution, i.e. the left hand tail. If you are looking for X >
14, or the right hand tail, you must subtract 0.9452 from 1, or 1 0.9452 = 0.0548.
But what about a sample rather than a population? We use the nomenclature to designate a
random variable that is sampled and to denote the values it takes. The sample has a mean
denoted by and a standard deviation denoted by . Remember that the probability
distribution of a discrete random variable X is a list of each possible value of X together with the
probability that X takes that value in one trial. For an example of such a probability distribution
see Table 1 and below. Also remember that each probability P(x) must be between 0 and 1, i.e.
0 P(x) 1, and the sum of all probabilities must equal 1, i.e. () = 1 . Therefore, for a
sample we have the following equations:
= ( ) = 2 ( ) 2
Example 1.2
For Table 1 find the sample mean and the standard deviation of the samples.
Table 1
For Table 1 above, the probability distribution of the sample mean is:
Page 3
Now we can apply the equations above to get the mean of the random variable and the
corresponding standard deviation. That is,
= ( )
= 152
= 158
1
2
3
4
3
2
1
+ 154 + 156 + 158 + 160 + 162 + 164
16
16
16
16
16
16
16
To find the standard deviation of the sample mean we must first calculate
1
1642 16 = 24,964
Example 2.1:
Find the (cumulative) probability that Z is less than 1.48, i.e. ( < 1.48).
Page 4
Solution:
Using the normal cumulative probability tables, find the value 1.4 on the vertical axis of table
for positive Z and the value 0.08 on the horizontal axis. The intersection of these values is
0.9306. Therefore, the (cumulative) probability ( < 1.48) = 0.9306. This can be depicted as
shown in Figure 1 below.
There are practical uses for this. For example, the empirical rule says that the probability of the
value of a random normal variable falling within some number of standard deviations of the
mean is given. That is, we often talk about the probability of a random variable falling one
Page 5
standard deviation from the mean; or one sigma equals a given probability which we convert to
a percent, two sigma equals a given probability and so on. This comes from finding the interval
for a probability in the same way we just did. Consider(1 < < 1). Computing the
corresponding interval we get:
(1 < < 1) = 0.8413 0.1587 = 0.6826
That is, the probability of Z falling within 1 is 0.6826 or 68%. The values 2=95% and 3=99%
are found the same way.
This leads us to a discussion about the tails of the standard normal distribution and cutoff
values. Lets consider Example 1.2 below.
Example 2.2
Find the value of z* that cuts off the right hand tail of a normal distribution with an area a =
0.0250, i.e. ( > ) = 0.0250.
Solution:
In this case the area of the right hand tail is known, i.e. a = 0.0250, but we want to know z*, the
value that cuts off this area on the right hand side. The trick is to keep in mind that the tables
are tables of cumulative probabilities. That is, the values begin from the left hand side of the
standard normal distribution and move to the right, starting from an area zero and
accumulating area to the maximum 1.0. So, we cant use the area given for the right hand
side in either table to find z*. In order to use the normal cumulative probability tables we have
to find the corresponding area on the left hand side or 1 or in this case 1 0.0250 =
0.9750. This is the number we use for the area to look in the tables. Using 0.9750 we find that
z* is 1.96. This is illustrated in F below.
Figure 3 Finding the Value of z* Given the Right Hand Cutoff Area
Page 6
But what about cases when we need to determine, for a normally distributed random variable X
and a known area a, how to find the value of x* such that:
( < ) = ( > ) =
that is, the left hand tail or the right hand tail whichever is required. Now we cannot directly
exploit the standard normal probability tables. In this case we need to have a bit more
information, e.g. the mean and standard deviation of the random variable X. Consider Example
1.3 below.
Example 2.3
Find x* for an area equal to 0.9332 when the mean and standard deviation of the normal
random variable X are = 10 and = 2.5.
Solution:
If this were a standard normal random variable and a textbook problem we could use the
given area to find z* in the tables. In this case, we can begin by looking at the tables for the z*
given an area equal to 0.9332 and find z* = 1.50. What this means is that x* will be 1.50
standard deviations above the mean. So we need to essentially de-standardize in order to
find x*. We can use the formula to do this de-standardization.
= +
Page 7
So, now lets finally consider a more practical example for the use of all this. Consider Example
2.4 below.
Example 2.4
Assume that the scores on a standardized college entrance examination are normally
distributed with a mean of 510 and a standard deviation of 60. Further, suppose you worked at
a selective university, i.e. one that only admitted students with scores in the top 5% of this
entrance examination. If you wanted to determine what the minimum score would be to meet
that criterion how could you do that?
Solution:
First, lets assume that the scores on this examination are normally distributed. And, let X be
the random variable representing that distribution. We are given that = 510 and = 60.
Then, what we need to find is the score, x*, that produces a cutoff area of 0.05 on the right
hand side. This is similar to Example 1.2 above except that now were looking for x* rather than
z*. But weve already seen how to de-standardize. So, first we find the left hand area or:
= 1 0.05 = 0.95
From the standard normal cumulative probability tables we find that z* is not listed for an area
of 0.95. In this case 0.95 lies exactly between two listed values, i.e. 0.9495 and 0.9505, so we
can take the average of the two to get z* = 1.645. Since we were given the mean and the
standard deviation we use our equation to de-standardize and get:
= + = 510 + (1.645 60) = 608.7
Figure 5. Finding the Cutoff Value for a Normally Distributed Random Variable
Page 8
This is all well and good if we have a population that is small enough that we can compute the
relevant statistics directly, e.g. the mean and standard deviation. But sometimes we cannot do
that. In that case we resort to samples with the mean of the sample denoted by , with a
sample variance, s2 , as above. And, sometimes we are interested in more than a point
estimate, i.e. using one single point to represent a large or very large population. That is what
we are really doing when we compute the mean of a sample. We are generating a point
estimate for the entire population. In reality, we usually want to generate a range of values
that will encompass all the possible point estimates, e.g. the means of all samples for a
population. This is called a confidence interval.
Example 3.1
Consider a case in which we want a level of significance, or , of 0.05. As weve discussed this
leads us to an area equal to 0.95 of a normal probability distribution. Now we want to
determine what we can about the cutoff values.
Solution:
Page 9
We assume that a confidence interval is centered symmetrically about the mean of a normal
distribution. Given that symmetry the right and left hand tails will correspond to areas equal to
/2. Thus, for a standard normal distribution z* will be z/2 and for = 0.05 this is z0.025.
This is depicted in F below.
Figure 6. Evaluating the Placement of Cutoff Values for a 95% Confidence Interval
In cases where we know the sample size, sample mean, , and the population standard
deviation, , then we can easily make evaluations for different levels of confidence. Consider X
below.
Example 3.2
Given a sample size of 49 with a sample mean of 35 and corresponding standard deviation of
14, evaluate a 98% confidence interval.
Solution:
Given a level of confidence of 98%, = 1 0.98 = 0.02. So, using the standard normal
probability tables z/2 = z0.01 = 2.326.
Earlier we demonstrated that for samples we can use = and = / . Now we
need to expand this to cover our confidence interval. We can do this by considering Example
3.1 and what we have learned about sample means and their corresponding standard
deviations to develop an equation for the margin of error, denoted as E, which is also called the
standard error, denoted as SE, and the cutoff value for the sample:
= /2
Page 10
/2
14
= 35 2.326
= 35 4.652 35 4.7
49
This means that we can be 98% confident that our population mean, , lies within the interval
(30.3, 39.7). To consider additional cases, e.g. when the population standard deviation is
unknown, look in Appendix A: Selecting the Appropriate Test Statistic.
Page 11
correspondingly fail to accept the alternative hypothesis. In this case, 0 is the value being
tested in the null hypothesis and we can say:
If Ha has the form Ha : < 0 we reject H0 if is far to the left of 0, i.e. to the left of the
critical value C such that the rejection region is the interval (-,C);
If Ha has the form Ha : > 0 we reject H0 if is far to the right of 0, i.e. to the right of the
critical value C such that the rejection region is the interval (C ,,);
If Ha has the form Ha : 0 we reject H0 if is far away from 0 in either direction i.e. to
the left or right of the critical value C such that the rejection region is the interval (-,C)
(C ,,).
Now what we have to do is determine the critical value or values, C. We want to be confident
in our rejection of H0; or in other words, we want only a very small probability that the value
will fall in the rejection region. So, well define = 0.01 as defining a rare event which
provides us with the confidence we want. Then, we have the situation as before. This is shown
in Figure 7.
Example 4.1
Suppose you are a bakery chef and have developed a recipe for the worlds most delicious
chocolate cupcakes. Each of these cupcakes have 8 grams of fat per serving. Furthermore,
Page 12
suppose that you know that the amount of fat in all the cupcakes baked (the population) is
normally distributed with a standard deviation of 0.15 grams. You want to make sure that, on
average, this is truly how much fat the cupcakes contain. You set aside 5 cupcakes for testing.
But, your testing equipment is not very accurate so youll allow = 0.10. How would you
validate this?
Solution:
You can use hypothesis testing to validate the statement that your cupcakes (on average) have
8 grams of fat per serving, where the serving size is one cupcake. Establish the null hypothesis
as H0 : = 8.0. Thus, the alternative hypothesis will be Ha : 8.0.
Remember from before that = = 8.0, = / = 0.15/5 = 0.067
Since Ha has an inequality we have rejection regions for both left and right hand tails.
Therefore, we are looking for rejection regions with an area equal to /2 = 0.10/2 = 0.05. Using
the normal probability tables we find that this corresponds to z0.05 = 1.645. So, the critical
values will be 1.645 standard deviations of to the right and left of the mean 8.0.
We still have one last thing to do. We have to de-standardize this.
Therefore, for our sample of 5 cupcakes there will be a less than 10% chance that well find a
mean of 7.89 grams of fat or less or 8.11 grams of fat or more. If we do indeed find that, we
will have to reject the null hypothesis!
In reality, it is more commonly the case when the population standard deviation is unknown.
Then, we use the t-distribution rather than the normal distribution and the respective test
statistics are shown below.
Type of Test
Test Statistic
0
/
0
/
We also use the t-distribution when the sample size is small, i.e. less than n = 30.
There are essentially two types of errors that occur in hypothesis testing. These are shown in
Table 2 below. One type of error occurs when we reject the null hypothesis even though it is
Page 13
actually true. This is generally considered the worst type of error and labeled a Type I error.
The second type of error is to fail to reject the null hypothesis when it is actually false. This is
labeled a Type II error.
Table 2. Possible Outcomes of Hypothesis Testing
Our Decision
Do not reject H0
Reject H0
Now we need to say a bit more about . When we talked about confidence intervals we said
that denoted the significance level. That is, the significance level reflects how much risk we
are willing to take in our results. Before, if = 0.05 we were implicitly saying that we wanted to
be 95% confident we would be right or conversely we were willing to take a 5% risk of being
wrong. This established the critical values we used as criteria to accept or reject our results.
Now, well talk about the significance level as the probability of rejecting the null hypothesis if it
is true. So for = 0.05 what we mean is that we are willing to accept that 5% of the time we
will reject the null hypothesis even though it is true.
This is directly related to P values. That is, the P value is the probability of obtaining a result
from sampling that is equal to or more extreme than what was actually observed. So, if the P
value is less than or equal to the significance level we reject the null hypothesis. Another way
to say this is to say that the P value is the probability of an observation being at least as
favorable if not more so to the alternative hypothesis than to the null hypothesis. Luckily for us
this all boil down to the P-value being equal to the rejection area that weve already learned
how to calculate. To clarify lets consider another example.
Example 4.2
Lets suppose your still working at your job as a bakery chef but now you want to greatly reduce
the amount of fat in your delicious cupcakes. Wouldnt it be great if you could invent
something that actually had negative fat? Lets set up an experiment to see if we have invented
a recipe that has fat, on average, equal to -4.0 grams. Since this would be such a break through
well include many more cupcakes in our observation, i.e. 90 cupcakes. The mean of the
sample turns out to be -5.033 and the standard deviation 3.567. And, the minimum fat
measured is -13.511 and the maximum 4.490.
Solution:
First lets set up our hypotheses.
Page 14
0
/
5.033 + 4
3.567 /90
= 2.747
Because we have our hypothesis set up as an equality this will be a two-sided t-test, i.e. t < 2.747 and t > 2.747 will both be considered a rare event or extreme. Next we compare our test
statistic to the t-distribution with n-1 degrees of freedom to find the P-value. Unfortunately,
the resolution for most t-distribution tables does not extend to 89 degrees of freedom so some
interpolation is required. You can use a program such as R to calculate the value or it is easy
enough to do a linear interpolation between values in the tables to get:
(89 2.747) = 0.00364
Since the t-distribution is symmetric we have the same result for 2.747. So, the P value will
be:
= 2 0.00364 = 0.00727
which means that the null hypothesis is rejected at the 1% significance level. That is, P =
0.00727 is less than = 0.01.
There is yet another way to look at this problem. Lets go back to our original problem
statement. We have a sample size of n = 90. Lets also assume that we want our level of
significance to be 1%. Then, we can also use the t-distribution tables to find a value given these
parameters. That value is 2.632. So if we draw our typical diagram showing the rejection
regions we have:
Rejection region
Rejection region
Either way, it doesnt look like weve validated that we made our cupcakes with negative fat!
Page 15
produces an upper-tailed test. For these cases remember to subtract from one, keep signs
straight, etc. When we want to validate that the population parameter of one population
varies from another population then we set-up our null hypothesis as an equality and perform a
two-tailed test.
To properly perform this hypothesis testing we need to ensure that the samples from the
respective populations have all the properties discussed before and in addition they are
independent. That is, the samples are drawn without reference to and have no connection to
each other.
In fact, there are several tests that involve two samples including:
Selecting the proper test statistic involves determining whether or not the populations
standard deviations are known and whether or not they are assumed to be equal. Typically the
Page 16
Example 5.1
One classic example is evaluating whether or not there is a relationship between a mother
smoking (designated by the letter s), or not (designated by the letter n), and a babys average
birth weight. Our data consists of 50 samples for which the mother was a smoker and 100
samples for which the mother was a non-smoker. Table 3 summarizes our data. Well establish
our level of significance at 0.050, i.e. = 0.050.
Table 3
Solution:
The null hypothesis will represent the case where there is no difference. So we set-up our
hypothesis as:
H0: n - s = 0 (no difference in birth weight)
Ha: n - s 0 (there is a difference in birth weight)
We check the conditions and see that the data comes from a random sample and consists of
less than 10% of all cases so we believe the observations to be independent; and, the sample
sizes are both greater than 30 so we believe the distribution will be nearly normal. Therefore,
the t-score is appropriate. Since we are comparing the difference between two population
means with some manipulation 1 the equation for our t-score becomes:
=
(
1 )
2 0
2
2
1+ 2
1 2
For the derivation of this equation see Appendix C: Calculating the t-score and Confidence Interval for the
Difference of Two Means.
Page 17
where D0 for H0 equals 0. I have actually found a couple different ways of calculating the
number of degrees of freedom for comparing the means of two populations. In this case well
set the number of degrees of freedom equal to the smaller sample size, i.e. ns -1 = 49.
=
0.40 0
= 1.54
26
Keep in mind that here we have the t-score and need to find the corresponding area or P value.
Again, we go to the t-tables and find that the P value falls between 0.100 and 0.200, i.e. the P
value is greater than 0.050. Therefore, we fail to reject the null hypothesis. Or stating this
another way, there is insufficient evidence in this data set to say that there is a statistically
significant difference in a babys birth weight if the mother smokes.
In summary, to use the t-distribution to perform statistical inference the general procedure is:
1. Write the appropriate hypothesis.
2. Verify the required conditions for using the t-distribution.
a. Samples must be independent and nearly normal. For large sample sizes the
condition that the sample distribution be nearly normal can be relaxed.
b. When comparing population parameters between two populations, the samples
must be independent and each sample must satisfy the conditions in 2.a above.
3. Calculate the point estimate for the parameter of interest, e.g. the mean, the corresponding
standard error 2, and the degrees of freedom. A software program such as R can be used to
calculate the degrees of freedom, or taking a conservative approach the smaller of n1 1 or
n2 1 can be used for the degrees of freedom.
4. Calculate the t-score and find the corresponding P value.
5. State the results.
/
2
See Appendix B: Sampling and Estimation for a discussion about the Standard Error.
Page 18
Where, as at the beginning, s2 is the sample variance. Well look at this more closely in our
discussion of hypothesis testing. For now, similar to the empirical rule which covers specific
numbers of standard deviations, there are common levels of confidence, i.e. 90%, 95%, 98%,
and 99%. Correspondingly these correlate to specific cutoff values, i.e. 1.645, 1.96, 2.326 and
2.576 respectively. This means we can reduce our equations for the confidence interval, e.g.
for 98%, to:
1.96/ or 1.96/ = 1.96
[8] Appendix C: Calculating the t-score and Confidence Interval for the Difference of
Two Means
Page 19