You are on page 1of 23

STA 6166, Section 8489, Fall 2007

Final Exam Part II


Due 13 December 2007

RAMI SHAMSHIRI
UFID#: 9021-3353 Ramin.sh@ufl.edu
Phone: 352-392-1864 ext:217

RAMIN SHAMSHIRI, UFID#:9021-3353

Page 1

A- During a study of the effect of an oil spill on the interstitial marine biota on sandy beaches, a graduate student collected a total of 129 animals in a stretch of beach near Catalina Island in California that had been oiled. For each animal measured, the student recorded its species, length, weight, coordinates of the collection location (where on the beach it was found), and the substrate on which it was found (sand, rock, wood, pebbles, etc) 1- List the qualitative variables
Answer: The qualitative or categorical variables are: Animals ID, (either in the form of 1,..,129 or other ID formats assigned, i.e. name, random codes, etc) Animal Species The substrate

2- List the quantitative variables


Answer: Animals Length Animals Weight Coordinates of the collection location if is in the form of (X, Y, Z). If the coordinate of the collection location is in the form of North, South, Northwest, etc, it will be considered as a categorical variable.

3- The sample consists of the 129 oiled animals collected in a stretch of beach near Catalina Island.

4- Suppose the students plans to test whether the oil spill has decreased the average size (weight and length) of individuals of the most abundant species. a. Describe the population(s) appropriate for the inference from the test.
Answer: i. Population of Oiled and Not-Oiled animals(all species) The first population contains all possible individuals of the most abundant species living under similar circumstance near the Catalina Island but are Not- Oiled and the second population will be all possible individuals of the most abundant species Oiled in the Catalina Island. ii. Population of animals Species(individual species) The 129 oiled animals are a collection of several species, i.e. fish, birds, etc. So, it is possible to have inference on an oiled and Not-oiled particular species separately and independently from another species of animals. In this case, we can consider species No.i as two populations, one is the population of oiled, the other one population of not oiled. (i=1 to all available species in the 129 observations)

RAMIN SHAMSHIRI, UFID#:9021-3353

Page 2

b. Describe the likely hypotheses to be tested.


Answer:
Note: To answer this question, it is assumed that the 129 observed oiled-animals are different in species and may contain fish, birds, etc. So, Species No.1 for example, refers to the fish group; Species No.2 refers to birds, etc.

i. Testing Mean Size(For Two population) Testing if the Mean size (either Weight or Length) of a particular species of Oiled-animals is less than the Mean size of that particular species of Not-oiled-animals, the hypotheses are: H0: H1: = H0: H1: = H0: H1: =

<

<

<

ii. Testing relationship: Testing whether there is a relationship between level of Oiled and the size of animals, (i.e, the more an animal has been oiled, the smaller its size is). To test this, we need to construct a model as below to find a relationship between x and y. y=0+1x+ Then we can find and as below: = = =

The hypotheses are then:

Is there a relationship between Oiled effect (Y) and animal Size (X)? H0:1=0 H1:10 Is the relationship positive? H0:1=0 H1:1>0 Is the relationship negative? H0:1=0 H1:1<0 iii. Testing Mean (Multiple and specific Comparison) Testing if the Oil spill has had same effect on the size of different species animals, for example, if the 129 observed animals can be categorized into n species, the student can test if the means of changes in size of different species are equal or not. In the other word, have all species received same size impact from the oil spill? The hypotheses can be written as: H0:
_ _

==

H1: At least one of the Species has received a different (Higher or lower) size effect from oil spill. RAMIN SHAMSHIRI, UFID#:9021-3353 Page 3

It is also possible to make this test more specific to understand which groups have received the highest and lowest size impact from oil spill. An example hypothesis can be written as: H0: H1:
_ _ _ _

>

_ _

_ _

c. Describe the testing procedure that should be used to test the hypotheses you gave. What prior information, if any, is needed to perform this test?
Procedure 1: Testing Mean size for: Here our two populations are: 1- Population of all the animals (regardless of species) oiled and 2- all the animals in that area Not-oiled. Difference between the two means is defined by: = 1- 2 A sample size n1 = 129 is randomly selected from the first population and a sample of size n2 is independently drawn from the second. The difference between the two sample means ( ) provides the unbiased point estimate of the difference (1- 2). The sampling distribution of the difference between these two means has a mean of 1- 2. It is important that the sample sizes are sufficiently large, are normally distributed; so that we can apply the Central Limit Theorem and be also normally distributed. Here our assumptions are that The two samples are independent The distributions of the two populations are normal or of such a size that the central limit theorem is applicable. The variances of the two populations are equal or can be assumed equal. Now we will have one of the following cases: 1- If our population variances are Known, then the variance of our (= 1- 2) distribution will be + and we can use the statistic below which has the standard normal distribution to test our hypotheses: = + 2- If our population variances are Unknown, and assumed equal, we will use the estimate of variance and the pooled t-test which has the t distribution with + 2 degrees of freedom. 1 + 1 = 1 + 1 = + = 1 + 1/

3- If our population variances are Unknown and Not equal, we may first note that inference on Means may not be very useful, however we can use the below statistic test if both n1 and n2 are large (both over 30) considering the fact that if n1 and n2 are large, the central limit theorem will RAMIN SHAMSHIRI, UFID#:9021-3353 Page 4

allow us to assume that the difference between the sample means will have approximately the normal distribution. For the large sample case, we can replace 1 and 2 with s1 and s2 without serious loss of accuracy. Therefore, the statistic will have approximately the standard normal distribution. = + 4- If either sample size was not large, we could compute the statistic as in part 1. If the data come from approximately normally distributed population, this statistic does have an approximate student t distribution, but the degrees of freedom cannot be precisely determined. A reasonable approximation is to use the degrees of freedom for the smaller sample; however, other approximations may be used. Procedure2: Testing relationship for: We want to test whether there is a relationship between the effect of oil level on the size of the animal, in the other word, if there is a relationship to show that the more an animal is oiled, the smaller its size is. To test this, we need to use correlation test procedure as below: : The population correlation coefficient r: Pearsonss product moment correlation coefficient, (Sample correlation coefficient) = = =

r2: is known as coefficient of determination, is a measure of relative strength of the corresponding regression. It is used to describe the effectiveness of linear regression model. 2 1 F: is the F statistic from the analysis of variance test for the hypothesis that 1=0 It is obvious that large values of r produce large values of F, both of which imply a strong linear relationship. If the F-value from this test leads to a P-value smaller than our significant level, we will reject the null hypothesis H0:1=0 and conclude that there is enough evidence to show that a linear relationship exists between oil level and animal size. = =

RAMIN SHAMSHIRI, UFID#:9021-3353

Page 5

Procedure 3: ANOVA for: Multiple mean comparison with the hypothesis as below can be done with one-way ANOVA: H0: = == _ _ _ _ _ _ H1: At least one of the Species has received a different (Higher or lower) size effect from oil spill. Assumptions for the F test comparing three or more Means: 1- The population from which the samples were obtained must be normally or approximately normally distributed. 2- The samples must be independent. 3- The variances of the populations must be equal. Fining the F-test value for the Analysis of Variance: Step 1- Finding the Mean and Variance of each sample ( , ),( , ),( , ) =

Step2- Finding the Grand Mean

Step 3- Finding the between group variance, (variance of the Means) = = 1 Step 4- Find the within group variance; computing the variance using all the data and is not affected by differences in the Means. 1 = = 1 Step 5- Find the F-test Value. =

Degrees of freedom for Nominator: k-1 (Number of Groups -1) Degrees of freedom for Denominator: N-k (Sum of the sample sizes of the groups Number of Groups) N=n1+n2++nk For this test, we dont need to have equal sample sizes. The F-test to comparing Means is always righttailed. If there is no difference in the Means, the between group variance estimate will be approximately equal to the within group variance estimate and the F-test value will be approximately equal to 1 and the null hypothesis will not be rejected. If the Means differs significantly, the between group variance will be much larger than the within group variance, thus the F-test will be significantly greater than 1 and the null hypothesis will be rejected.

RAMIN SHAMSHIRI, UFID#:9021-3353

Page 6

For specific comparison, we can use The Scheffe test and the Tukey test. In order to conduct the Scheffe test, one must compare the Means two at a time, using all possible combinations of Means. vs vs Formula for the Scheffe test: vs = [ 1 + 1 ]

and are the Means of the samples being compared, ni and nj are the respective sample Where sizes, and is the within group variance. To find the critical value for the Scheffe test, multiply the critical value for the F test by k-1. = (K-1)(Critical Value) There is a significant difference between the two means being compared when is greater than .

The Tukey test can also be used after the analysis of variance has been completed to make pairwise comparisons between the groups have the same sample size. The symbol for the test value in the Tukey test is q / is Where and are the Means of the samples being compared, n is the size of the samples and the within group variance. When the absolute value of q is greater than the critical value for the Tukey test, there is a significant difference between the two means being compared. =

RAMIN SHAMSHIRI, UFID#:9021-3353

Page 7

B- Suppose a graduate student in your department shows you the following matrix of Pearson correlation coefficients for four variables:
X1 X2 X3 X4 X1 1.000 X2 0.83343 1.0000 X3 -0.87627 0.77677 1.0000 X4 0.09951 0.47300 -0.17368 1.00000

1. Which correlation coefficients in the matrix imply that the two variables are highly correlated?
Answer: Based on the notes mentioned in a, b, c and d, my answer to this question is summarized in the table below:
X1 Perfect (meaningless) X2 Strong Positive Relation (Highly correlated) Perfect (meaningless) X3 Strong Negative Relation (Highly correlated) Strong Positive Relation (Highly correlated) Perfect (meaningless) X4 Weak positive Relation (Very low correlation) Medium Positive Relation (Medium correlated) Weak positive Relation (low correlation) Perfect (meaningless)

X1 X2 X3 X4

Table 1

a)

The Pearsons Correlation Coefficient, r, is a quantitative assessment of the strength and direction of a linear relationship between 2 variables and our assumption is that if a relationship exists, it is linear (Pearsons r is valid for linear relationships only). The stronger the relationship, the closer r is to 1 and the weaker the relationship, the closer r is to 0. (If the relationship is perfect (every point falls exactly on a straight line), r = 1 depending on the sign of the slope.)In other words, if the variables are independent then the correlation is 0, but the converse is not true because the correlation coefficient detects only linear dependencies between two variables.

b) If the relationship is positive (slope>0), r > 0 and if the relationship is negative (slope<0), r < 0. If there is no relationship at all, (slope = 0), r = 0, however the size of r does not depend on the size of the slope. c) Interpretation of the size of a correlation The interpretation of a correlation coefficient depends on the context and purposes. A correlation of 0.9 may be very low if one is verifying a physical law using high-quality instruments, but may be regarded as very high in the social sciences where there may be a greater contribution from complicating factors. Several authors have offered [1] guidelines for the interpretation of a correlation coefficient. Cohen (1988) , has suggested the following interpretations for correlations in psychological research:

Correlation Small Medium Large

Negative 0.29 to 0.10 0.49 to 0.30 1.00 to 0.50


Table 2

Positive 0.10 to 0.29 0.30 to 0.49 0.50 to 1.00

d) Any variable has a perfect relationship with itself, and it is meaningless since it is obvious. In the table, the correlation value for X1 and X1 is 1.0, which can be inferred as a meaningless perfect relation.

RAMIN SHAMSHIRI, UFID#:9021-3353 Page 8

2. Suppose you are told that X2 is a purely categorical variable that was coded as 1,2,3, or 4 (rather than names). Is the Pearson correlation coefficient appropriate to look at the strength of the relationship between X2 and other variables? Explain.
Answer: No, if X2 is a categorical variable, the Pearson correlation coefficient is not appropriate to look at the strength of the relationship between X2 and any of the other variables. In fact, Pearsons r is only valid for relationships between two quantitative variables and we should use other measures when one or both variables are categorical. If we have used a computer program and did not mention that X2 is categorical, then the value of X2 which are coded as 1,2,3 or 4 will be considered as a quantitative data and can be misleading.

RAMIN SHAMSHIRI, UFID#:9021-3353

Page 9

C- The following experiment on reproductive fitness in ospreys was conducted back in 1970-1980. Review the description of the experiment and then answer the following questions. 1. Suppose location was expected to have an effect on reproductive fitness but was not of direct interest to the researcher. Should s/he simply ignore the location aspect in the analysis and use CRD with Year as the factor of interest? Explain.
Answer: Ignoring the location effect, the data will be ordered as below: Year Mean SD Var 1970 3.53 4.27 3.82 3.28 5.12 2.85 2.6 2.42 2.76 2.18 3.283 0.918 0.843 1976 12.32 13.18 9.03 18.67 13.91 13.88 16.42 8.92 6.95 10.49 12.377 3.611 13.04 1982 36.49 29.06 19.12 30.39 23.98 21.69 31.15 28.01 16.5 19.72 25.611 6.389 40.82 With the following hypothesis: H0: 1970=1976=1982 H1: At least one of the above is not equal = any reasonable level (0.05 or 0.01) Using one-way ANOVA for this hypothesis test, we need the assumptions below: 4- The population from which the samples were obtained must be normally or approximately normally distributed. 5- The samples must be independent. 6- The variances of the populations must be equal. Using the Levene test for homogeneity of variance, we get an F-value equal to 13.03 which leads to pvalue less than 0.0001, thus we conclude that the variances of the populations are not equal. The variance column of the table above also confirms this result. Since at least one of the assumptions of one-way ANOVA is not met here, we probably not able to receive a trusted result from this test. A one-way ANOVA to test this hypothesis will result: Test F-value = 69.13 Test P-value= <0.0001 Critical F-value= 3.53 This shows that our test F-value is larger than the critical F-value, (very small P-value, less than any reasonable significant level ), thus we reject the null hypothesis and conclude that at least one of the years is different in the mean value. This result is regardless of location effect.

RAMIN SHAMSHIRI, UFID#:9021-3353

Page 10

Considering location effect, we first need to know whether the location had any effect on the data observed in a same year. The data and hypotheses can be written as below:

Location Mean Var F-value P-Value GAR1970 2.85 2.6 2.42 2.76 2.18 2.562 0.0724 17.41 0.0031 MAS1970 3.53 4.27 3.82 3.28 5.12 4.004 0.52473 Location Mean Var F-value P-Value GAR1976 13.88 16.42 8.92 6.95 10.49 11.332 14.527 0.82 0.391 MAS1976 12.32 13.18 9.03 18.67 13.91 13.422 12.085

F-crit 5.3176

F-crit 5.317

Location Mean Var F-value P-Value F-crit GAR1982 21.69 31.15 28.01 16.5 19.72 23.414 36.3475 1.209 0.303 5.317 MAS1982 36.49 29.06 19.12 30.39 23.98 27.808 43.4365

H0: MAS1970= GAR1970 H1: MAS1970 GAR1970 Result: P-value=0.0031 => reject H0 H0: MAS1976= GAR1976 H1: MAS1976 GAR1976 Result: P-value=0.0031 => reject H0 H0: MAS1982= GAR1982 H1: MAS1982 GAR1982 Result: P-value=0.0031 => reject H0 Based on the F-value and P-value results, we can see that the location has had effect only on the first year data collection, (1970). For the other years, (1976 and 1982) the location did not have any significant effect. Since locations also have effect on the reproductive fitness, the researcher should not ignore the location aspect in her analysis and use CRD which only uses year as factor of interest since it was shown here that this method will not reveal the true effects of both Year and Location on the reproductive fitness. The researcher shall consider RCBD and consider this problem as a block design in which the blocks have more than t experimental units that are used in the experiment. This method will provide a control on the effect of the two different locations.

RAMIN SHAMSHIRI, UFID#:9021-3353

Page 11

2. Review the attached output and choose the most appropriate analysis for this data.

(There are four different A OVA in the output) Explain your choice including specifically what aspects of the analyses led to your decision and why the other analyses were inappropriate. At a minimum, you should discuss the intentions of the scientist and assumptions of the alternative models.
Answer: Reviewing the four different outputs, I would the fourth one because of the four below reasons: 1- One-Way Anova on index with year The assumptions for this test is that error terms are independent, Normally distributed with constant variance. This One-Way ANOVA will test the below hypothesis: H0: 1970=1976=1982 H1: At least one of the above is not equal The assumption of the homogeneity of variance is not met here according to the following output which shows that the F-value from the Levenes test is equal to 13.03 with degrees of freedom=2 leading to a p-value smaller than any reasonable p-value, thus we reject the null hypothesis of equality of variances (H0: = = )

Since the assumption of homogeneous variance is not met here, it is not appropriate to use One-Way ANOVA. Moreover, as already mentioned earlier in the answer of previous question, this method does not show the location effect. However, regardless of these facts, this test has lead to the following results which rejects the Null hypothesis of equality of the means of productivity fitness through years. (Reject H0: 1970=1976=1982)

RAMIN SHAMSHIRI, UFID#:9021-3353

Page 12

2- RCBD on index with location as block This method is capable of considering the effect of location on the fitness index, but we need to check if the assumptions are met. The assumptions for RCBD are independently selection of blocks, the treatments are randomly assigned to the experimental units within a block, homogeneity of variances in treatments and approximately normally distribution of each population. According to the outputs, we can see that the assumption of the approximately normal distribution for populations is met. The Shapro-Wilk and Kolmogorove test for example have both high p-values equal to 0.69 and 0.11 respectively, which does not reject the null hypothesis of normal distribution. The Q-Q plot and Box Plot also shows the same result.

RAMIN SHAMSHIRI, UFID#:9021-3353

Page 13

Checking the assumption of homogeneity of variance from the plots of residuals against treatments, we can see that the distribution of the residuals of the model between years is not homogeneous, indicating that the assumption of homogeneous variance between treatments is not met.

The hypothesis of homogeneity of variance is also rejected with the Levenes test, which has a F-value of 2.70, leading to a P-value equal to 0.045<0.05.

RAMIN SHAMSHIRI, UFID#:9021-3353

Page 14

Since the assumptions of RCBD are not, it is not appropriate to use its results which are mentioned as below:

3- RCBD on Log10(index)

Due to the problem of Unequal variance among factor levels, it may be useful to perform the analysis using transformed values of the observations, which may satisfy the assumption of equal variances. If is proportional to the Mean, we can use the Logarithm of the yij. Checking the assumption of Normality, the Shapiro-Wilk and Kolmogorov test both have large P-values which do not reject the null hypothesis of Normality distribution. The Q-Q plot and Box plot also confirm this result graphically.

RAMIN SHAMSHIRI, UFID#:9021-3353

Page 15

But we can still see that the variances are not homogeneous according to the uneven distributions of the residuals shown as below:

Since the assumption of homogeneity of variance is not met, the test Result of this procedure shown as below cannot also be trusted.

RAMIN SHAMSHIRI, UFID#:9021-3353

Page 16

4- RCBD on index - unequal variances for each year. This method provides a more appropriate procedure for making inference on this problem. The assumption of Normality is met by looking at Shapiro-Wilk and Kolmogorov P-values which are both large enough in order to fail in rejecting the null hypothesis of normality. The relevant Q-Q plot and Box plot also shows graphically that the populations are normally distributed. The plot of wtresid*Pred and the plot of Plot of wtresid*year shows that we have met our assumption of homogeneity of variance. Since all the assumptions of RCBD are met here, the results of this analysis can be trusted more than other three analyses.

..

3. Based on your decision in (2), state the statistical model your chose. Be sure to identify all terms in the model.
Answer: The model that I have selected is Randomize Complete Block Design (RCBD) which has the following equation: Y =+ + + : is the Grand Mean of all the 30 fitness data observed in the two sites during the 3 experimental year and is equal to: : is the effect due to the ith treatment. Here our treatments are the Years. We have three years, so we have , and .

: is the effect due to the jth block. In this model, our blocks are the two location, GAR and MAS, So we have and .

: is the error term. These error terms are independent observations from an approximately normally distribution with Mean=0 and constant Variance =

RAMIN SHAMSHIRI, UFID#:9021-3353 Page 17

4. Given the model you chose, test the hypotheses of interest to the scientist. State the hypotheses being tested. For each set of hypotheses (if there are more than one), give the equation of the test statistic you are using and its distribution. From the output, give the value of the test statistic, the associated degrees of freedom, the p-value for the test, and your conclusion. State the conclusion in terms of the problem under study (reject the null hypothesis is OT sufficient here). If you have multiple hypotheses, also discuss your choice of method for controlling the experiment-wise error rate.
Answer: The main hypothesis that the scientist are testing is whether the ban of DDT led to a recovery by the osprey in their fitness. This hypothesis can be written as: = . > .

= . . Other sets of hypotheses that the scientists are interested to test are: H0: 19701976 H1: 1970<1976 (Claim) H0: 19761982 H1: 1976<1982 (Claim) H0: 19701982 H1: 1970<1982 (Claim) H0: 1970=1976=1982 H1: At least one of the above is not equal Using ANOVA test for RCBD, we will have a table of results as below:

The F-stat has F distribution with t-1 degrees of freedom for Numerator and (t 1)(b 1) degrees of freedom for Denominator, where t is number of treatments and b is number of blocks. From the SAS outputs, we have:

The F-value is equal to 92.57 leading to P-vale less than 0.0001, which rejects the null hypothesis of equality of means between years. The degrees of freedom of Numerator is 2 and df of denominator is 11.7. Using Tukey test to find out where the difference falls, we have the following hypotheses.

RAMIN SHAMSHIRI, UFID#:9021-3353

Page 18

H0: 1970=1976 H1: 19701976 (Claim)

H0: 19761982 H1: 19761982 (Claim)

H0: 19701982 H1: 19701982 (Claim)

Testing these hypothesis with Tukey, we have the following result from SAS:

The procedure for Tukey test is: =

Where n is the sample size for each treatment.

Conclusion:
Considering the p-values from the below SAS output table which is the results of our analyses, we conclude that the ban of DDT has led to recovery of fitness since 1972. In the other words, we are rejecting the null hypothesis of H0: 1970=1976=1982 and conclude that there is not enough evidence to show that the mean of the fitting index in the three years are equal.

RAMIN SHAMSHIRI, UFID#:9021-3353

Page 19

D- Do blood types of people tend to vary among states? Or stated another way, is state and blood type independent? The data for testing this hypothesis are given below. There are four blood types and three states; frequency is the number of observations in that rows combination of state and blood type. Perform the analysis and state your conclusion. Give the equation of the test statistic you are using and its distribution. Give the value of the test statistic, the associated degrees of freedom, the p-value for the test, and your conclusion. State the conclusion in terms of the problem under study, i.e. reject the null hypothesis is OT sufficient here. ( ote: if you decide to perform the test by hand, please give the critical or cutoff value you are using to determine whether to reject the null hypothesis).

Blood Type State Frequency A FL 122 B AB O A B AB O A B AB O FL FL FL IA IA IA IA MO MO MO MO 117 19 244 1781 351 289 3301 353 269 60 713

Answer: This problem can be solved with the procedure of testing independence of two categorical variables. The two categorical variables here are 1- Blood Type and 2- State. The hypothesis then can be written in the below form: H0: The State and Blood type are independent H1: H0 is not true Our significant level, (type I error) =0.01 The test used for this analysis is Chi-square with equation as below: = Expected Cell= =

Degree of Freedom=df= (row-1)(Col-1) The P-value will be the area to the right of the observed in the chi-square distribution with the above degree of freedom. Both of the assumptions are met. 1- The samples are random. 2- The sample sizes are sufficiently large so that the expected cell counts are all 5 or more. In fact, we are using the idea that when two events, like E and F are independent, then Pr (event E | event F occurred) =Pr (event E) Pr (E and F) =Pr (E|F). Pr (F) =Pr (E).Pr (F)

RAMIN SHAMSHIRI, UFID#:9021-3353

Page 20

To perform the test manually, we re-arrange the data in the order of a table as below. Our grand sample size here is equal to 7619 and our degrees of freedom equal to (4-1).(3-1)=6 Observed FL Blood Type A B AB O Total = 122 117 19 244 502 IA 1781 351 289 3301 5722 State MO 353 269 60 713 1395 Total 2256 737 368 4258 7619

Table 3: Observed values

Expected Cell= = 7619 = 7619

. .

2256 7619

502 = 148.64 7619

4258 7619

1395 = 779.6 7619 Expected FL

State IA

MO

Total 2256 737 368 4258 7619 =

Blood Type A B AB O Total =

148.643129 48.559391 24.2467515 280.550728 502 =

1694.29479 413.062082 553.499672 134.940937 276.374327 67.3789211 3197.83121 779.61806 5722 1395 + +

Table 4: Expected Values

Degrees of freedom= (4-1).(3-1)=6


[(OBS-EXP)^2]/EXP FL Blood Type A B AB O Total

122 148.64 148.64 State IA

713 779.61 779.61

MO 8.7334418 133.182952 0.80809363 5.69248733 148.416975

Total 17.9461387 303.72973 2.52021908 13.7828347 337.978923

4.77557442 96.4616084 1.13534391 4.76190441 107.134431

4.43712251 74.0851697 0.57678154 3.32844301 82.4275167

Table 5: (Observed Cell - Expected Cell)^2/expected Cell

RAMIN SHAMSHIRI, UFID#:9021-3353

Page 21

Performing the test in SAS also gives a similar Chi-square value.

Table 6: SAS outputs

P-value conclusion: With degree of freedom=6, we search the chi-square table and see that the largest value in the table associated with 6 degrees of freedom, is 18.548 with a right tail probability of 0.005. Since our chisquare value is 337.97 which is much larger than 18.548, the p-value of our test is definitely less than 0.005. We can also see from SAS output that the p-value associated with our chi-square result is equals to 0.0001. Conclusion: Under any reasonable choice of type I error (), we reject the null hypotheses that the blood type and state are independent. It means that there are not a same proportion of blood types in different states. In the other words, blood type may have a kind of relationship with states.

RAMIN SHAMSHIRI, UFID#:9021-3353 Page 22

SAS Code for Part D:


data bloodtype; input bloodtype$ state$ count@@; datalines; A FL 122 B FL 117 AB FL 19 O FL 244 A IA 1781 B IA 351 AB IA 289 O IA 3301 A MO 353 B MO 269 AB MO 60 O MO 713 ; proc freq data=bloodtype; tables bloodtype*state / cellchi2 chisq expected norow nocol nopercent; weight count; quit;

References: 1- Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.) Hillsdale, NJ: Lawrence Erlbaum Associates. ISBN 0-8058-0283-5.

RAMIN SHAMSHIRI, UFID#:9021-3353

Page 23

You might also like