Professional Documents
Culture Documents
Reeves
HW #6 Higher-Way ANOVA
Lab Fri. Nov. 14, 2014
Due Fri. Nov. 21, 2014
HW 6 - 2-Way ANOVA & Latin Square Designs
Answer the questions following the usual homework rules. Explain what your output
means and attach it, as described earlier this semester. These problems will be reviewed
briey in lab on Monday 11/17/14.
The rst problem is a balanced 2-factor design with 4 observations per cell, while
the second is a Latin Square design with t=5 (that is, t2 =25 of the possible t3 =125
combinations are observed in the design). The third is an unbalanced 2-factor design.
PROC ANOVA (in SAS) or the aov application (in R) can be used for any 1-WAY
ANOVA (balanced or not) and for balanced higher-dimensional designs. Since Latin
Squares are balanced (although incomplete) designs, this means that PROC ANOVA (or
the aov application) can be used to analyze problems 1 and 2 of this homework, but
PROC GLM (in SAS) or the lm application (in R) will be needed to correctly analyze
Problem #3. The data for problems 1-3 are not text les, but Excel les, since these are
common in practice. For your convenience, both a regular .xls le and a .csv version of
each le are stored on the S: drive and on eLC, with the .xls format being more convenient
for those who use SAS and the .csv format more useful for those using R. The le names
for the three problems are: uwisc.xls, color.xls, and agr.xls, respectively. The rst
le has no header line, while the other two do have headers. The third le contains some
missing data (blank cells), although the analysis of the problem will be the same whether
one reads those lines in as they are listed (blank for the response variable) or deletes the
lines from the data set. Relatively little of each problems point value is derived from
running code to perform analyses. Much more of the point value concerns your ability
to explain clearly what youve done and what it means.
G
E
N
D
E
R
| DORM
| FRAT/SOR.
|
CO-OP
|
APT.
|
HOUSE
------|-----------|--------------|------------|------------|----------|2.56 2.70 | 3.02 2.67 | 3.41 2.97 | 2.28 3.73 | 2.57 2.81
MALE |3.04 1.42 | 2.35 1.80 | 2.23 2.31 | 2.05 2.61 | 3.01 3.40
------|-----------|--------------|------------|------------|----------|3.33 1.80 | 3.87 3.00 | 3.14 4.00 | 3.69 2.55 | 3.09 1.99
FEMALE|2.50 3.04 | 3.25 2.76 | 2.66 2.91 | 3.21 2.86 | 3.46 3.61
------|-----------|--------------|------------|------------|-----------
(a) Give the complete statistical model for this study, including both main eects
and interaction terms.
(b) Use appropriate software to obtain the ANOVA table for the model in (a).
(c) Test for interactions, and, if appropriate, for main eects, if =.10).
(d) Does the blocking factor (Gender) seem necessary?
(e) Use the software to calculate estimates of the grand mean, the 5 Living Situation eects, the 2 Gender eects, and the RMSE of the model in (a).
(f) Which of the 10 interaction terms is the most negative? Does an interaction
of this magnitude seem very important, given the RMSE of the model?
(g) A randomly chosen (Female, Co-Op) Junior UWisc student (not in the original
sample) has a GPA of 2.58. Is this a surprising result or not? (Explain your
reasoning.)
(h) The university has some control over Dorms, Fraternities/Sororities, and Coops as living situations, while it has no control at all over Apartments or
Houses. It wishes to know if there is any signicant dierence between the
average GPA scored by Juniors living in the former group (Dorm, F/S, Coop)
as opposed to the latter group (Apt. & House). Assume for the purposes of
this problem (although its not quite true) that each of the 10 cells in the table
above represents exactly 10% of the 20 year-old students at the University of
Wisconsin. [This question can be answered using the CONTRAST options
within SAS or R (with the former clearer than the latter), or one can use
the output and the formula from Section 1 of Unit 8b to obtain a C.I. for this
linear combination of cell means.]
(i) Re-run your program omitting the interaction term from your model and test
(at =.10 level) for the eects of both Gender and Living Situation.
OUTFIT
---------------------E (= Evening Gown)
---------------------H (= Halter & Hot Pants)
---------------------S (= Shirt & Slacks)
---------------------B (= Blouse & Skirt)
---------------------J (= Jeans & T-shirt)
----------------------
(a) Read the data in and analyze it using commands similar to those shown below.
Use this output to answer parts (3b)-(3i). The data-set in slightly more useful
format is shown at the end of the next page and in color.xls.
Using SAS:
PROC ANOVA;
CLASS Color Outfit Wearer;
MODEL Score=Color Outfit Wearer;
MEANS Color Outfit Wearer;
MEANS Color/LSD TUKEY;
Using R:
g<-aov(Score~Color+Outfit+Wearer,data=P2)
summary(g)
g1<-lm(Score~Color+Outfit+Wearer,data=P2)
summary(g1)
(b) Which factor was of most interest to the persons doing this study?
(c) Which blocking factor had the most eect on score? Why ?
(d) Does there appear to be a signicant dierence between the Colors with respect
to scores assigned ? (Choose best answer):
(i) YES; bright colors are preferred to dark colors.
(ii) YES, but directions of dierence cant be determined from ANOVA table.
(iii) NO; there is no signicant evidence of dierence between Colors.
(e) If one parameterizes the model in (grand mean, deviations) format:
Yijk = + i + j + k + ijk ,
where the s, s, and s each sum to zero over their respective levels, what
are the numerical estimates of:
grand mean = _________
deviation due to Jeans & T-shirt outfit = _______
deviation due to Wearer #2 = _______
deviation due to Blue color = _________
(f) In two dierent situations, you will attempt to predict the score which would
be observed if a particular (Outt, Wearer, Color) combination were displayed.
Also, attempt to give a standard error for your prediction. In the rst case,
assume that the Outt, Wearer, and Color are each chosen at random from the
ve available, and that this information is unknown to you. In the second case,
assume that you are told in advance that the randomly chosen combination
is Wearer #2, in Jeans & T-Shirt of Blue color. Fill in the four blanks below
with numerical values.
Case
----
Known Information
-------------------------------
Predicted Score
---------------
_______________
_______________
_______________
_______________
(g) The highest observed score was 93 (for Outt=E, Wearer=4, Color=Bk). According to the estimated parameters of this model, which of the 125 possible
combinations of (Outt,Wearer,Color) would yield the highest expected score
(and what is it)? Of the 25 combinations which were actually used in the
Latin Square Design, is the observed maximum (E,4,Bk) the combination
which would have been expected to be the highest?
(h) Of the 10 possible paired comparisons between the dierent colors (Bu-Bk,
Bu-G, ... , R-W ):
How many are signicantly dierent at =.05 by the LSD criteria ?
How many are signicantly dierent at =.05 by the Tukey criteria ?
(i) Urban legend suggests that men will pick Blue signicantly more often than
other colors when/if a woman asks a male what color outt he thinks she
should wear. Based on the above analysis, is there any truth to this idea?
1
1
1
1
1
R
W
Bu
Bk
G
64
90
80
55
50
E
H
S
B
J
2
2
2
2
2
G
R
W
Bu
Bk
75
71
42
73
67
E
H
S
B
J
3
3
3
3
3
W
Bu
Bk
G
R
71
85
86
54
40
E
H
S
B
J
4
4
4
4
4
Bk
G
R
W
Bu
93
65
50
37
81
E
H
S
B
J
5
5
5
5
5
Bu
Bk
G
R
W
86
87
64
41
44
F1
Fertilizers
F2
Below are shown three ANOVA tables derived from running PROC ANOVA in
SAS, PROC GLM in SAS, and the lm application in R, respectively, on this
data set. (The actual code used for these three programs is shown in the HW6
folder.) Examine the three tables and answer the questions about their contents
on the next page.
Table 1 ANOVA TABLE from SASs PROC ANOVA
SOURCE df Sum of Squares
Fertilizers 1
10.44300
Plots
2
6.61667
Error
4
2.77533
TOTAL
7
19.83500
Mean Square
10.44300
3.30833
0.69383
-X-
F-Stat
15.05
4.77
-X-X-
P-value
0.0178
0.0873
-X-X-
Table 2 ANOVA TABLE from SASs PROC GLM (Type III SS)
SOURCE df Sum of Squares
Fertilizers 1
6.16333
Plots
2
2.23700
Error
4
7.05500
TOTAL
7
19.83500
Mean Square
6.16333
1.16850
1.76375
-X7
F-Stat
3.49
0.66
-X-X-
P-value
0.1349
0.5643
-X-X-
Mean Square
10.44300
1.16850
1.76375
-X-
F-Stat
5.92
0.66
-X-X-
P-value
0.0717
0.5643
-X-X-
(a) One of the three ANOVA tables is incorrect by any reasoning. The other two
are correct, depending on what one wants to test. Explain which output is
wrong and under what conditions one would use each of the other two.
(b) If you were asked, based on the data presented above, whether you think that
FERT is a signicant factor at the = .10 level, what would you conclude?
Explain.
(c) There are no observations in the (F2,A) cell, but the mean for this cell can be
estimated by both of the legitimate programs. Do they yield the same value?
If so, what is that value? If not, which value would you use as your estimate
for this cells mean?
(d) The dierences between the two correct programs concern how one would want
to apportion sums of squares so as to test factors. Even after that has been
determined, one could still parameterize in many equivalent ways. Suppose
that one wanted to parameterize such that the weight of the k-th replicate in
the cell with the (i-th Fertilizer and j-th Plot), Yijk , could be expressed as:
Yijk = 0 + 1 IF 2ijk + 2 IAijk + 3 IC ijk + ijk ,
so that the baseline cell is (Fertilizer=F1, Plot=B). In that case, what are the
numerical estimates of 0 , 1 , 2 , and 3 ?
(e) Discuss briey what errors in conclusions one would have made if one had
used the incorrect ANOVA analysis instead of one of the correct analyses for
this problem.