You are on page 1of 8

STAT 6420 Fall 2014 J.

Reeves
HW #6 Higher-Way ANOVA
Lab Fri. Nov. 14, 2014
Due Fri. Nov. 21, 2014
HW 6 - 2-Way ANOVA & Latin Square Designs
Answer the questions following the usual homework rules. Explain what your output
means and attach it, as described earlier this semester. These problems will be reviewed
briey in lab on Monday 11/17/14.
The rst problem is a balanced 2-factor design with 4 observations per cell, while
the second is a Latin Square design with t=5 (that is, t2 =25 of the possible t3 =125
combinations are observed in the design). The third is an unbalanced 2-factor design.
PROC ANOVA (in SAS) or the aov application (in R) can be used for any 1-WAY
ANOVA (balanced or not) and for balanced higher-dimensional designs. Since Latin
Squares are balanced (although incomplete) designs, this means that PROC ANOVA (or
the aov application) can be used to analyze problems 1 and 2 of this homework, but
PROC GLM (in SAS) or the lm application (in R) will be needed to correctly analyze
Problem #3. The data for problems 1-3 are not text les, but Excel les, since these are
common in practice. For your convenience, both a regular .xls le and a .csv version of
each le are stored on the S: drive and on eLC, with the .xls format being more convenient
for those who use SAS and the .csv format more useful for those using R. The le names
for the three problems are: uwisc.xls, color.xls, and agr.xls, respectively. The rst
le has no header line, while the other two do have headers. The third le contains some
missing data (blank cells), although the analysis of the problem will be the same whether
one reads those lines in as they are listed (blank for the response variable) or deletes the
lines from the data set. Relatively little of each problems point value is derived from
running code to perform analyses. Much more of the point value concerns your ability
to explain clearly what youve done and what it means.

1. UWisc Living Groups (data-set in uwisc.xls)


A large state university (University of Wisconsin-Madison) wondered if there was
much dierence between students grades based upon their living situations. The
ve most typical living situations for undergraduates at UW are: Dorms, Fraternities/Sororities, Co-ops, Apartments, and Houses. Although they were primarily
interested in the eects (if any) due to living situation, the campus administration
realized that some other factors such as age and gender might be relevant. Therefore, they considered only Juniors and used gender as a blocking variable. The
results given below are the cumulative GPAs of a random sample of n = 40 Juniors at UW, with r = 4 being randomly chosen from each of the 10 ((a=2)*(b=5))
Gender/Living Situation cells. This le (of 40 lines, no headers) can be read from
le uwisc.xls. The format for each line of data is: GENDER LS GPA.
GPAs of 40 U Wisconsin Students by Gender and Living Situation
LIVING SITUATION

G
E
N
D
E
R

| DORM
| FRAT/SOR.
|
CO-OP
|
APT.
|
HOUSE
------|-----------|--------------|------------|------------|----------|2.56 2.70 | 3.02 2.67 | 3.41 2.97 | 2.28 3.73 | 2.57 2.81
MALE |3.04 1.42 | 2.35 1.80 | 2.23 2.31 | 2.05 2.61 | 3.01 3.40
------|-----------|--------------|------------|------------|----------|3.33 1.80 | 3.87 3.00 | 3.14 4.00 | 3.69 2.55 | 3.09 1.99
FEMALE|2.50 3.04 | 3.25 2.76 | 2.66 2.91 | 3.21 2.86 | 3.46 3.61
------|-----------|--------------|------------|------------|-----------

(a) Give the complete statistical model for this study, including both main eects
and interaction terms.
(b) Use appropriate software to obtain the ANOVA table for the model in (a).
(c) Test for interactions, and, if appropriate, for main eects, if =.10).
(d) Does the blocking factor (Gender) seem necessary?
(e) Use the software to calculate estimates of the grand mean, the 5 Living Situation eects, the 2 Gender eects, and the RMSE of the model in (a).

(f) Which of the 10 interaction terms is the most negative? Does an interaction
of this magnitude seem very important, given the RMSE of the model?
(g) A randomly chosen (Female, Co-Op) Junior UWisc student (not in the original
sample) has a GPA of 2.58. Is this a surprising result or not? (Explain your
reasoning.)
(h) The university has some control over Dorms, Fraternities/Sororities, and Coops as living situations, while it has no control at all over Apartments or
Houses. It wishes to know if there is any signicant dierence between the
average GPA scored by Juniors living in the former group (Dorm, F/S, Coop)
as opposed to the latter group (Apt. & House). Assume for the purposes of
this problem (although its not quite true) that each of the 10 cells in the table
above represents exactly 10% of the 20 year-old students at the University of
Wisconsin. [This question can be answered using the CONTRAST options
within SAS or R (with the former clearer than the latter), or one can use
the output and the formula from Section 1 of Unit 8b to obtain a C.I. for this
linear combination of cell means.]
(i) Re-run your program omitting the interaction term from your model and test
(at =.10 level) for the eects of both Gender and Living Situation.

2. Clothing Color Preference (data-set in color.xls)


An experiment was performed to determine what color outt college men prefer
to see on college women. (This experiment was performed at some Midwestern
University [some say Iowa State] in the middle of Winter, when there was nothing
better to do. A fashion show was held in an auditorium, where 5 dierent women
would come on stage wearing dierent outts of dierent colors. The order was
rotated so as to create a Latin Square Design.) The main variable of interest was
COLOR, but the experimenters also wanted to control for the type of OUTFIT
and the person wearing the outt (WEARER). The results shown below are the
scores on a [0(dislike) 100(like)] scale for the 5*5=25 combinations presented.
Analyze this Latin Square design, assuming that each observation is based on the
same respondents opinion. [In fact, the scores are averages over all the persons
who attended the fashion show, many of whom were other women, so its not clear
that this experiment will really answer the question stated in the rst sentence.]
Fashion Show Latin Square Results
(The abbreviations are: R-Red, W-White, Bu-Blue, Bk-Black, G-Green;
E-Evening Gown, H-Halter & Hot Pants, S-Shirt & Slacks, B-Blouse & Skirt,
J-Jeans & T-shirt.)
WEARER
1 |
2 |
3 |
4 |
5 |
-----|------|------|------|------|
R
| G
| W
| Bk
| Bu
|
64 | 75 | 71 | 93 | 86 |
-----|------|------|------|------|
W
| R
| Bu
| G
| Bk
|
90 | 71 | 85 | 65 | 87 |
-----|------|------|------|------|
Bu
| W
| Bk
| R
| G
|
80 | 42 | 86 | 50 | 64 |
-----|------|------|------|------|
Bk
| Bu
| G
| W
| R
|
55 | 73 | 54 | 37 | 41 |
-----|------|------|------|------|
G
| Bk
| R
| Bu
| W
|
50 | 67 | 40 | 81 | 44 |
-----|------|------|------|------|

OUTFIT
---------------------E (= Evening Gown)
---------------------H (= Halter & Hot Pants)
---------------------S (= Shirt & Slacks)
---------------------B (= Blouse & Skirt)
---------------------J (= Jeans & T-shirt)
----------------------

(a) Read the data in and analyze it using commands similar to those shown below.
Use this output to answer parts (3b)-(3i). The data-set in slightly more useful
format is shown at the end of the next page and in color.xls.
Using SAS:
PROC ANOVA;
CLASS Color Outfit Wearer;
MODEL Score=Color Outfit Wearer;
MEANS Color Outfit Wearer;
MEANS Color/LSD TUKEY;

Using R:
g<-aov(Score~Color+Outfit+Wearer,data=P2)
summary(g)
g1<-lm(Score~Color+Outfit+Wearer,data=P2)
summary(g1)

(b) Which factor was of most interest to the persons doing this study?
(c) Which blocking factor had the most eect on score? Why ?
(d) Does there appear to be a signicant dierence between the Colors with respect
to scores assigned ? (Choose best answer):
(i) YES; bright colors are preferred to dark colors.
(ii) YES, but directions of dierence cant be determined from ANOVA table.
(iii) NO; there is no signicant evidence of dierence between Colors.
(e) If one parameterizes the model in (grand mean, deviations) format:
Yijk = + i + j + k + ijk ,
where the s, s, and s each sum to zero over their respective levels, what
are the numerical estimates of:
grand mean = _________
deviation due to Jeans & T-shirt outfit = _______
deviation due to Wearer #2 = _______
deviation due to Blue color = _________

(f) In two dierent situations, you will attempt to predict the score which would
be observed if a particular (Outt, Wearer, Color) combination were displayed.
Also, attempt to give a standard error for your prediction. In the rst case,
assume that the Outt, Wearer, and Color are each chosen at random from the
ve available, and that this information is unknown to you. In the second case,
assume that you are told in advance that the randomly chosen combination
is Wearer #2, in Jeans & T-Shirt of Blue color. Fill in the four blanks below
with numerical values.
Case
----

Known Information
-------------------------------

Predicted Score
---------------

RMSE for Model


---------------

(i) Outfit, Wearer, Color all Unknown

_______________

_______________

(ii) Outfit=J, Wearer=2, Color=Blue

_______________

_______________

(g) The highest observed score was 93 (for Outt=E, Wearer=4, Color=Bk). According to the estimated parameters of this model, which of the 125 possible
combinations of (Outt,Wearer,Color) would yield the highest expected score
(and what is it)? Of the 25 combinations which were actually used in the
Latin Square Design, is the observed maximum (E,4,Bk) the combination
which would have been expected to be the highest?
(h) Of the 10 possible paired comparisons between the dierent colors (Bu-Bk,
Bu-G, ... , R-W ):
How many are signicantly dierent at =.05 by the LSD criteria ?
How many are signicantly dierent at =.05 by the Tukey criteria ?
(i) Urban legend suggests that men will pick Blue signicantly more often than
other colors when/if a woman asks a male what color outt he thinks she
should wear. Based on the above analysis, is there any truth to this idea?

----------------------------------------------------------------------(Possibly more useful display of Color dataset; also in color.xls)


E
H
S
B
J

1
1
1
1
1

R
W
Bu
Bk
G

64
90
80
55
50

E
H
S
B
J

2
2
2
2
2

G
R
W
Bu
Bk

75
71
42
73
67

E
H
S
B
J

3
3
3
3
3

W
Bu
Bk
G
R

71
85
86
54
40

E
H
S
B
J

4
4
4
4
4

Bk
G
R
W
Bu

93
65
50
37
81

E
H
S
B
J

5
5
5
5
5

Bu
Bk
G
R
W

86
87
64
41
44

3. Agricultural Missing Data (data-set in agr.xls)


A 2*3 agricultural randomized block design with two replications per cell was to
be conducted. Unfortunately, the gardener was lazy in some plots and unexpected
rain ruined others, so only 8 of the planned 12 plants were produced. The weights
of these 8 plants (Y for this problem) are given in the data table below. (Also given
in le agr.xls in the HW6 folder). The format there is: FERT PLOT REP Y,
so the rst entry is F1 A 1 14.3 and the last is F2 C 2 11.6. The le contains 12
entries, but the Y value for the four missing values is represented in the .xls le by
an empty cell in the Y column for these four rows.

F1
Fertilizers
F2

Yields by Fertilizer and Plot


Plots
A
B
C
|-------------|----------------|------------|
|14.3, 12.6
|
11.8, 11.5
| 13.6, ----|
|-------------|----------------|------------|
|----, ---|
10.7, ---|
8.9, 11.6|
|-------------|----------------|------------|

Below are shown three ANOVA tables derived from running PROC ANOVA in
SAS, PROC GLM in SAS, and the lm application in R, respectively, on this
data set. (The actual code used for these three programs is shown in the HW6
folder.) Examine the three tables and answer the questions about their contents
on the next page.
Table 1 ANOVA TABLE from SASs PROC ANOVA
SOURCE df Sum of Squares
Fertilizers 1
10.44300
Plots
2
6.61667
Error
4
2.77533
TOTAL
7
19.83500

Mean Square
10.44300
3.30833
0.69383
-X-

F-Stat
15.05
4.77
-X-X-

P-value
0.0178
0.0873
-X-X-

Table 2 ANOVA TABLE from SASs PROC GLM (Type III SS)
SOURCE df Sum of Squares
Fertilizers 1
6.16333
Plots
2
2.23700
Error
4
7.05500
TOTAL
7
19.83500

Mean Square
6.16333
1.16850
1.76375
-X7

F-Stat
3.49
0.66
-X-X-

P-value
0.1349
0.5643
-X-X-

Table 3 ANOVA TABLE from Rs lm (Sequential SS)


SOURCE df Sum of Squares
Fertilizers 1
10.44300
Plots
2
2.23700
Error
4
7.05500
TOTAL
7
19.83500

Mean Square
10.44300
1.16850
1.76375
-X-

F-Stat
5.92
0.66
-X-X-

P-value
0.0717
0.5643
-X-X-

(a) One of the three ANOVA tables is incorrect by any reasoning. The other two
are correct, depending on what one wants to test. Explain which output is
wrong and under what conditions one would use each of the other two.
(b) If you were asked, based on the data presented above, whether you think that
FERT is a signicant factor at the = .10 level, what would you conclude?
Explain.
(c) There are no observations in the (F2,A) cell, but the mean for this cell can be
estimated by both of the legitimate programs. Do they yield the same value?
If so, what is that value? If not, which value would you use as your estimate
for this cells mean?
(d) The dierences between the two correct programs concern how one would want
to apportion sums of squares so as to test factors. Even after that has been
determined, one could still parameterize in many equivalent ways. Suppose
that one wanted to parameterize such that the weight of the k-th replicate in
the cell with the (i-th Fertilizer and j-th Plot), Yijk , could be expressed as:
Yijk = 0 + 1 IF 2ijk + 2 IAijk + 3 IC ijk + ijk ,
so that the baseline cell is (Fertilizer=F1, Plot=B). In that case, what are the
numerical estimates of 0 , 1 , 2 , and 3 ?
(e) Discuss briey what errors in conclusions one would have made if one had
used the incorrect ANOVA analysis instead of one of the correct analyses for
this problem.

You might also like