You are on page 1of 14

Section 16.

16-1

CHAPTER 16

Analysis of Variance (ANOVA)


GENERAL
OBJECTIVE

LESSON
OUTLINE

In Chapter 10, we studied inferential methods for comparing the means of two
populations. Now we will study analysis of variance, or ANOVA, which
provides methods for comparing two or more population means. You should
be familiar with the chapter that discusses analysis of variance in your
textbook before beginning this chapter.
16.1
16.2
16.3
16.4
16.5
16.6

The F-distribution
One-Way ANOVA: The Logic
One-Way ANOVA: The Procedure
Multiple Comparisons*
The Kruskal-Wallis Test*
Problems

Copyright 2012 Pearson Education, Inc. Publishing as Addison-Wesley.

16-2

Analysis of Variance (ANOVA)

16.1 The F-distribution


Analysis of variance procedures rely on a distribution called the
F-distribution, named in honor of Sir Ronald Fisher (1800-1962). A variable
is said to have an F-distribution if its distribution has a special type of rightskewed curve, called an F-curve. There are infinitely many F-distributions,
which we identify by stating two associated degrees of freedom a degrees of
freedom for the numerator and a degrees of freedom for the denominator. We
will now study how SPSS can be used to find F-value, F , from this
distribution.

Finding the F-Value Having a Specified Area to Its Right


Example 16.1 For an F-curve with degrees of freedom, df = (4, 12), find F0.05; that is, find
the F-value having area 0.05 to its right for an F-distribution with 4 degrees of
freedom in the numerator and 12 degrees of freedom in the denominator.
Solution The SPSS function, IDF.F(prob, df1, df2) returns the value from the
F-distribution, with the specified degrees of freedom, df = (df1, df2), for
which the area to the left is prob. Similar to computing a t-score, we will use
the Compute Variable dialog box.
The F-value having area 0.05 to its right has area 0.95 to its left, since the
total area under the probability curve is one. In the Numeric Expression box
type IDF.F(0.95, 4, 12). SPSS returns the F-value that has area 0.05 to its
right as F = 3.26.

16.2 One-Way ANOVA: The Logic


Analysis of Variance (ANOVA) provides methods for comparing several
population means, that is, the means of a single variable from several
populations. In this Chapter, we study one-way analysis of variance. This
type of ANOVA is called one-way analysis of variance because it compares
the means of a variable for populations that result from a classification by one
variable, called the factor. The possible values of the factor are referred to as
the levels of the factor.
One-way ANOVA is the generalization to more than two populations of the
pooled t-procedure. As in the pooled t-procedure, we make the following
assumptions.

Copyright 2012 Pearson Education, Inc. Publishing as Addison-Wesley.

Section 16.3

16-3

Assumptions (Conditions) for One-Way ANOVA


1. Simple Random Samples: The samples taken from the populations under
consideration are simple random samples.
2. Independent Samples: The samples taken from the populations under
consideration are independent of one another.
3. Normal populations: For each population, the variable under
consideration is normally distributed.
4. Equal standard deviations: The standard deviations of the variable
under consideration are the same for all the populations.

16.3 One-Way ANOVA: The Procedure


The One-Way ANOVA Test
Example 16.3 Energy Consumption: The U.S. Energy Information Administration gathers
data on residential energy consumption and expenditures and publishes its
findings in Residential Energy Consumption Survey: Consumption and
Expenditures. Table 16 - 1 shows last years energy consumptions for four
independent random samples of households in the four U.S. regions

Table 16 - 1
Energy
consumption
for samples
of
households
in four U.S.
regions

Northeast
15
10
13
14
13

Midwest
17
12
18
13
15
12

South
11
7
9
13

West
10
12
8
7
9

At the 5% level of significance, do the data provide sufficient evidence to


conclude that a difference exists in mean annual energy consumption by
households in the four U.S. regions?

Copyright 2012 Pearson Education, Inc. Publishing as Addison-Wesley.

16-4

Analysis of Variance (ANOVA)

Solution Type the data into two variables named, ENERGY and REGION. ENERGY
should contain all 20 data values in the four samples. REGION should take
on the four values, 1, 2, 3, and 4, which associate the case with a region. The
values of REGION, 1, 2, 3, and 4, should be associated with the value labels,
Northeast, Midwest, South, and West, respectively.
Step 1: State the null and alternative hypotheses.
Let 1, 2, 3, and 4 denote last years mean energy consumptions for
households in the Northeast, Midwest, South, and West, respectively. The
null and alternative hypotheses are:
H 0 : 1 = 2 = 3 = 4 (mean consumptions are all equal)
H a : Not all the mean consumptions are all equal
Step 2: Decide on the significance level, .
The test is to be performed at the 5% significance level. Thus = 0.05.
Step 3: Compute the value of the test statistic.
1. Test the hypotheses by choosing Analyze > Compare Means >
One-Way ANOVA to open the One-Way ANOVA dialog box
(Figure 16 - 1).
Figure 16 - 1
One-Way
ANOVA
dialog box

2.

Paste the variable ENERGY into the Dependent List box and the
variable REGION into the Factor box.

3.

Click the OK button to display the results of the one-way ANOVA


in Viewer window.

The ANOVA table (Figure 16 - 2) shows several statistics used in analysis of


variance.

Copyright 2012 Pearson Education, Inc. Publishing as Addison-Wesley.

Section 16.3

16-5

Figure 16 - 2
ANOVA table
from OneWay ANOVA
procedure

The test statistic is F= 6.318. It has an F-distribution with df = (3, 16).


Step 4: Obtain the p-value.
The test statistic has an associated p-value = 0.005 which is given under the
column titled Sig.
Step 5: If P < , reject H0; otherwise, do not reject H0.
The p-value is less than the specified significance level of 0.05; therefore, we
reject the null hypothesis.
Step 6: Interpret the results of the hypothesis test.
At the 5% significance level, the data provide sufficient evidence to conclude
that a difference exists in last years mean energy consumption by households
among the four U.S. regions. That is, at least two of the regions have different
mean energy consumptions.

The ANOVA Table


The layout of the ANOVA table in SPSS is similar to the layout in the chapter
with the following exceptions. SPSS denotes Treatment by Between Groups
and Error by Within Groups. This is because SSTR can be thought of as the
error between the sample means and SSE can be thought of as the error within
the samples. The values of SSTR = 97.5, SSE = 82.3, and SST = 179.8 can
be read from the second column in the ANOVA table (Figure 16 - 2).
The one-way ANOVA identity,
SST = SSTR + SSE =97.5 + 82.3= 179.8,
shows that the total variation among all the sample data can be partitioned into
a component representing variation among the sample means and a
component representing variation within samples. The associated degrees of
freedom and mean squares are also reported in the ANOVA table.

Copyright 2012 Pearson Education, Inc. Publishing as Addison-Wesley.

16-6

Analysis of Variance (ANOVA)

16.4 Multiple Comparisons*


When the null hypothesis is rejected in a one-way ANOVA, the conclusion is
that the means are not all equal. Once you make that decision, you may also
want to know which means are different, which is the largest, or, more
generally, the relation among all the between the means. Methods for doing
such problems are called multiple comparisons.
SPSS provides several multiple comparison methods including the Tukey
multiple comparison method. In multiple comparisons, it is important to
distinguish between the individual confidence level and the family confidence
level. The individual confidence level is the confidence that any particular
confidence interval contains the true difference between the corresponding
population means; the family confidence level is the confidence that all the
confidence intervals simultaneously contain their respective true differences.
The Tukey multiple comparison method is based on the studentized range
distribution. The Tukey multiple comparison method for obtaining
confidence intervals for the differences between means is similar to the pooled
t-interval formula. The essential difference is that, in the Tukey multiple
comparison method the percentile of a studentized range distribution is used
instead of the percentile of a t-distribution. The effect of this is that the
(1-)-level confidence intervals constructed by the Tukey multiple
comparisons method have a family confidence level of 1-. Each of the
(1-)-level confidence intervals constructed by the pooled t-interval formula
has an individual confidence level of 1-, the family confidence for this set of
confidence intervals in smaller than 1-.

The Tukey Multiple-Comparison


Example 16.6 Energy Consumption: Apply the Tukey multiple comparison method to the
energy consumption data in Table 16 - 1. Use a family confidence level of
95%.
Solution To perform Tukey multiple comparisons in SPSS,
1. Click the Post Hoc... button in the One-Way ANOVA dialog box
(Figure 16 - 1) to open the One-Way ANOVA: Post Hoc
Multiple Comparisons dialog box (Figure 16 - 3).

Copyright 2012 Pearson Education, Inc. Publishing as Addison-Wesley.

Section 16.4

16-7

Figure 16 - 3
One-Way
ANOVA:
Post Hoc
Multiple
Comparisons
dialog box

2. Choose the checkbox for Tukey.


3. A 95% family confidence interval corresponds to a 5% significance level.
Therefore, enter 0.05 into the Significance level box.
4. Click the Continue button to close the dialog box and then click the OK
button to display the results in the Viewer window.
The Multiple Comparisons table (Figure 16 - 4) shows 95% confidence
intervals for the differences using the Tukey multiple comparisons method.
Figure 16 - 4
Multiple
Comparisons
table for
Tukey
multiple
comparisons
method

Copyright 2012 Pearson Education, Inc. Publishing as Addison-Wesley.

16-8

Analysis of Variance (ANOVA)

For example, the confidence interval for the mean difference between the
Northeast and Midwest regions is 5.429 to 2.429. Two population means are
significantly different if their confidence interval does not include 0. This is
true for the Midwest and South regions, for example. SPSS provides another
table, the Homogeneous Subsets table (Figure 16 - 5), to help decipher which
population means are different and which are equal.
Figure 16 - 5
Homogeneous subsets
table from
Tukey
multiple
comparison
procedure

Means that are lined up together in a column under Subset for alpha = 0.05
are judged equal by the Tukey multiple comparison method. Means that are
in separate columns are judged not equal. That is, there is sufficient evidence
the population means for the regions, West, South, and Northeast are equal;
and the population means for the regions, Northeast and Midwest are equal.
Further, since West and Midwest are in different columns there is sufficient
evidence that they are not equal. These results have a 95% family confidence
level.

16.5 The Kruskal-Wallis Test*


The Kruskal-Wallis test is a nonparametric alternative to the one-way
ANOVA procedure. The Kruskal-Wallis tests whether several independent
samples are from the same population. The Kruskal-Wallis test applies when
the distributions (one for each population) of the variable under consideration
have the same shape, but does not require that they be normal or have any
other specific shape. Like the Mann-Whitney test, the Kruskal-Wallis test is
based on ranks.

Copyright 2012 Pearson Education, Inc. Publishing as Addison-Wesley.

Section 16.5

16-9

The Kruskal-Wallis Test


Example 16.8 Vehicle Miles: The U.S. Federal Highway Administration conducts annual
surveys on motor vehicle travel by type of vehicle and publishes its findings
in Highway Statistics. Independent simple random samples of cars, buses, and
trucks were chosen and the data on number of miles driven, in thousands, by
each sampled vehicle last year are shown in Table 16 - 2.

Table 16 - 2
Number
miles driven
(1000s) last
year for
independent
samples of
cars, buses,
and trucks

Cars
19.9
15.3
2.2
6.8
34.2
8.3
12.0
7.0
9.5
1.1

Buses
1.8
7.2
7.2
6.5
13.3
25.4

Trucks
24.6
37.0
21.2
23.6
23.0
15.3
57.1
14.5
26.0

Preliminary data analysis (not shown) suggest that the distributions of miles
driven have roughly the same shape for cars, buses, and trucks but that those
distributions are far from normal. Thus the appropriate test is the KruskalWallis procedure. At the 5% significance level, do the data provide sufficient
evidence to conclude that a difference exists in last years mean number of
miles driven among cars, buses, and trucks?

Solution The Kruskal-Wallis test is performed by the Tests for Several Independent
Samples dialog box. Type the data into two variables named, MILES and
VEHICLE, in a new data file. MILES should contain all 25 data values in
the three samples. VEHICLE should take on the values, 1, 2, and 3,
associated with the value labels, Cars, Buses, and Trucks, respectively.
Step 1: State the null and alternative hypotheses
Let 1, 2, and 3 denote last years mean number of miles driven for cars,
buses, and trucks, respectively. The null and alternative hypotheses are:
H 0 : 1 = 2 = 3 (mean miles driven are all equal)
H a : Not all the means all equal
Step 2: Decide on the significance level, .
The test is to be performed at the 5% significance level. Thus = 0.05.
Copyright 2012 Pearson Education, Inc. Publishing as Addison-Wesley.

16-10

Analysis of Variance (ANOVA)

Step 3: Compute the value of the test statistic


1. Test the hypotheses by choosing Analyze > Nonparametric Tests >
Legacy Dialogs > K Independent Samples to open the Tests for
Several Independent Samples dialog box (Figure 16 - 6).
2. Paste the variable MILES into the Test Variable List box and the
variable VEHICLE into the Grouping Variable box.
Figure 16 - 6
Tests for
Several
Independent
Samples
dialog box

Next, we need to specify the minimum and maximum integer values for the
grouping variable. The minimum value must be less than the maximum value.
Cases associated with values outside the bounds are excluded during the
analysis. This option is supplied so that the Kruskal-Wallis procedure can be
performed on a subset of the samples.
3. Click the Define Range button to open the Several Independent
Samples: Define Range dialog box (Figure 16 7).

Copyright 2012 Pearson Education, Inc. Publishing as Addison-Wesley.

Section 16.5

16-11

Figure 16 7
Several
Independent
Samples:
Define
Range dialog
box

We require all the cases to be analyzed, consequently enter 1, the minimum


value in VEHICLE, into the Minimum box and 3, the maximum value in
VEHICLE, into the Maximum box.
4. Click the Continue button to close the dialog box and update the grouping
variable information in the Tests for Several Independent Samples
dialog box.
5.

Click the OK button to display the results in the Viewer window.

The Ranks table (Figure 16 8) displays the mean ranks for each of the three
samples. If the sample means are equal we would expect the mean ranks to be
approximately equal.
Figure 16 8
Ranks table
from KruskalWallis
procedure

The Test Statistics table (Figure 16 9) gives the chi-square test statistic,
degrees of freedom associated with the test statistic, and the p-value of the
hypothesis test.
Figure 16 9
Test
Statistics
table from
KruskalWallis
procedure
Copyright 2012 Pearson Education, Inc. Publishing as Addison-Wesley.

16-12

Analysis of Variance (ANOVA)

The test statistic is H = 9.93 which has a 2-distribution with 2 degrees of


freedom.
Step 4: Obtain the p-value.
The test statistic has an associated p-value = 0.007 which is given in the row
titled Asymp. Sig.
Step 5: If P < , reject H0; otherwise, do not reject H0.
The p-value is less than the specified significance level of 0.05; therefore, we
reject the null hypothesis.
Step 6: Interpret the results of the hypothesis test.
At the 5% significance level, the data provide sufficient evidence to conclude
that at least one of the means is not equal to the others.

16.6 Problems
Problem 16.8

For the F-curve with df = (12, 5), find


a. F0.05
b. F0.01
c. F0.025

Problem 16.10

For the F-curve with df = (6, 10), find


a. F0.05
b. F0.01
c. F0.025

Copyright 2012 Pearson Education, Inc. Publishing as Addison-Wesley.

Section 16.6

Problem 16.48

Table 16 3
Running Times
in minutes

Problem 16.49

Table 16 4
Number of
Copepods

16-13

Movie fans use the annual Leonard Maltin Movie Guide for facts, cast
members, and reviews of over 21,000 films. The movies are rated form 4
stars (4*), indicating a very good movie to 1 star (1*) which Leonard Maltin
refers to as a BOMB. Table 16 3 gives the running times, in minutes, of a
random sample of films listed in one years guide. At the 1% significance
level, do the data provide sufficient evidence to conclude that a difference
exists in mean running times among the four rating groups?
1* or 1.5*
75
95
84
86
58
85

2* or 2.5*
97
70
105
119
87
95

3* or 3.5*
101
89
97
103
86
100

4*
101
135
93
117
126
119

Copepods are tiny crustaceans that are an essential link in the estuarine food
web. Marine scientists G. Weiss, G. McManus, and H. Harvey at the
Chesapeake Biological Laboratory in Maryland designed an experiment to
determine whether dietary lipid (fat) content is important in the population
growth of a Chesapeake Bay copepod. Their findings were published as the
paper Development and Lipid Composition of the Harpacticoid Copepod
Nitocra Spinipes Reared on Different Diets (Marine Ecology Progress
Series, vol. 132, pp. 57-61). Independent random samples of copepods were
placed in containers containing lipid-rich diatoms, bacteria, or leafy
macroalgae. There were 12 containers total, four replicates per diet. Five
gravid (egg-bearing) females were placed in each container. Table 16 4
shows the number of copepods in each container after 14 days.
Diatoms
426
467
438
497

Bacteria
303
301
293
328

Macroalgae
277
324
302
272

a. Obtain the one-way ANOVA table for the data.


b. Verify the one-way ANOVA identity.
c. At the 5% significance level, do the data provide sufficient evidence to
conclude that a difference exists in the mean number of copepods among
the three different diets?

Copyright 2012 Pearson Education, Inc. Publishing as Addison-Wesley.

16-14

Analysis of Variance (ANOVA)

Problem 16.95

Refer to Problem 16.49. Apply the Tukey multiple comparison method to the
data in Table 16 3. Use a family confidence level of 95%

Problem 16.129 Indications are that Americans have become more aware of the dangers of
excessive fat intake in their diets, although some reversal of this awareness
appears to have developed in recent years. The U.S. Department of
Agriculture publishes data on annual consumption of selected beverages in
Food Consumption, Prices, and Expenditures. Independent random samples
of lowfat-milk consumptions, measured in gallons, for 1980, 1995, and 2005
are given in Table 16 5.
Table 16 5
Lowfat milk
consumptions,
in gallons, for
1980, 1995,
and 2005

1980

1995

2005

11.1
10.7
8.6
9.4
9.2
15.1
11.6
8.3

15.5
16.0
16.1
14.7
11.5
17.1
16.2

11.2
12.7
17.4
17.1
13.4
11.4
13.9
14.6
15.2

At the 1% level of significance, do the data provide sufficient evidence to


conclude that there is a difference in mean (per capita) consumption of lowfat
milk for the years 1980, 1995, and 2005? Use the Kruskal-Wallis Test.

Copyright 2012 Pearson Education, Inc. Publishing as Addison-Wesley.

You might also like