Professional Documents
Culture Documents
0 BASIC GUIDE II
SMOKER * DISEASE Cross tabulation DISEASE 1.00 4 2 66.7% 33.3% 80.0% 28.6% 1 5 16.7% 83.3% 20.0% 7 1.4% 5 7 41.7% 58.3% 100.0% 100.0% .00
SMOKER
.00
1.00
Total
So, in this example, 33% of Non-Smokers (Smoker=0) have a disease whereas 83% of Smokers (smoker=1) have a disease. To test for differences between groups, we create this table, and run one of the following tests: If the table is 2x2 and the numbers are large enough in each cell, we use ChiSquare with Yates Continuity Correction. If the table is greater than 2x2 and the numbers are large enough in each cell, we use Pearsons Chi-Square If the numbers are not large enough, we use Fishers Exact Test
Look at the Expected Count less than 5 displayed below the table produced by SPSS. If this is greater than or equal to 20% then the numbers are not large enough and we should use Fishers Exact Test (and always use the two sided test) To obtain a p-value using CROSSTABS Click on Statistics. Tick the Chi-Square box and then click on Continue.
Click OK
Value Pearson Chi-Square Continuity Correction Likelihood Ratio Fisher's Exact Test Linear-by-Linear Association N of Valid Cases
a. Computed only for a 2x2 table
a
df
b. 4 cells (100.0%) have expected count less than 5. The minim expected count is 2.50. um
Under the table it says that 100% of our cells have an Expected Count less than 5, therefore we dont have enough subjects in our sample to use a chi-square. Therefore we shall use the Fishers Exact Test. Our p-value is two-sided, so in this case, p=0.24. We have no evidence that smokers are more likely to develop the disease (although, this is a lot to do with the sample size and power of the study)
WORK
Workplace 1
Total
Count % within WORK Workplace 2 Count % within WORK Workplace 3 Count % within WORK Count % within WORK
Value Pearson Chi-Square Likelihood Ratio Linear-by-Linear Association N of Valid Cases 1.764a 1.762 1.222 104
df
a. 0 cells (.0%) have expected count less than 5. The m inimumexpected count is 7.68.
We observe that 46% of Workplace 1 have a history of cough, 60% of Workplace 2 and 59% of Workplace 3. We can also see that 0% of the expected counts are less than 5. Therefore, we would use the Pearsons Chi-Square p-value, which is equal to p=0.41. There is no evidence of any difference in proportions, between the three groups.
cough
1 2 1 2 1 2
workplace
1 1 2 2 3 3
freq
16 19 31 21 10 7
The Freq column is now used to weigh the cases. Click on the Data menu Click on Weight cases Tick the weight cases by box Highlight the freq variable and click on the arrow to place it in the frequency variable box Click on OK Then use the Crosstabs command as before on the cough and workplace variables
Click on Statistics and put a tick in the McNemar box. Click on Continue, Now Click on Cells and click the Total Percentages box
If there were two different groups at year 1 and year 2, then this would be a chi-square test.
The percentage of subjects coughing at year one is the total percentage for this row. It is, in this example, 54.8%. In fact, the percentage of coughs in year 2 is also 54.8%. It is no surprise, that the p-value for this 1.000 (Exact Sig. (2-sided)). There is no evidence that the prevalence rate of cough is changing between year 1 and year 2.
Alternatively, McNemars test is under the Non-parametric test 2 related samples option. Test type = McNemars Test pair = 2 matched variables
Testing for Normality or Log Normality The tests for normality have been given SPSS Basic Notes 1 (and Basic Statistics 1). It is important to note that when we do group comparisons that the data must be normally distributed within each group. The easiest way to do this is to split the file by the group variable before doing your test of normality. Log normal data has a positive skewness. It is very common in biochemistry where there are more people with a normal, low value than people with a high value, which often indicates an illness. To test for log normality, we simply do a normality test on the variable after it has been log transformed. An example is given in these notes.
Independent Samples t-test Appropriate for comparing the means between two groups of different subjects, (who have not been matched in the sampling2.) The Compare Means option allows you to test for the difference in means for a continuous variable between two or more groups. 1) If the data are normally distributed To open the t-test box: Go to the Analyse menu, select Compare Means and then select Independent samples T test. Highlight the continuous variable to be tested and click on the relevant arrow to place it in the Test Variable box Highlight the variable that indicates the groups and click on the relevant arrow to place in the Grouping Variable box
Now click the Define Groups box and type in the values of the grouping variable. So for instance if you wanted to compare men and women, and in your data set men were coded as 1 and women coded as 2 then you would enter 1 and 2 into this box. Although you could argue that SPSS should be able to work this out, what this box does allow you to do is to select two subgroups of a variable. So, for instance if you had a variable coded 1=Yes, 2=No, 8=Dont Know, then you could compare just the yes and the no group by entering 1 and 2 into the box.
If the sample design includes matching then we have special methods for analysing. The Independent Samples t-test is not appropriate as through the Sample Design, you have forced the groups to be more alike than they would have been if they had been randomly sampled
An example of the output you should get is as follows: It tests for the difference in average heights between men and women, by using independent samples.
Levenes F Sig. Equal variances Not equal variances 0.005 0.942 t 11.019 11.053 df 112 62.74 t-test for equality of means sig Mean s.error Lower Diff Diff 0.000 15.52 1.408 12.728 0.000 15.52 1.404 12.713 Upper 18.310 18.325
There are two rows to this table. An assumption of the t-test is that our two groups (in this example, men and women) have the same variance. If they dont, it is often a sign of a violation of normality, but if we are sure that our data are normal but have different variances then we use the information given on the bottom row of the table, but more commonly, we use the top row. In order to test for equal variances between the two groups, we use the Levene Test of Equal Variances. The p-value is in the second column. If this p-value is below 0.05, we have evidence of a difference in variances and we must use the bottom row for our p-value and confidence interval. In our case we have observed that men are on average 15.52cm taller than women with a confidence interval of (12.73cm,18.31cm). The p-value is given as 0.000, which we should report as p<0.0013. Also spot that the confidence interval does not contain zero therefore we are 95% certain that men are taller than women. 2) If the data are log-normally distributed We can also do an independent samples t-test, but this time, we have to transform the data first. To this we use the Natural Log function (Ln) in Compute. We do not use the Log, base 10 function. In order to do this:
3
Click on the Transform menu Click on Compute Enter a new variable name in the Target Variable box Form the expression ln(Variable Name) in the Numeric Expression box Click on OK
In most circumstances, a p-value can never equal zero. This is a consequence of using a sample to describe an unknown population. Therefore there is always a small probability that the difference that you have observed has happened by chance. Therefore, for small p-values that are reported in SPSS as 0.000, we would use p<0.001 in a paper. i.e. it is a small probability, but it still exists
10
In our example the variable that we wish to transform is called Hdl and the new variable we have created is called loghdl4. Loghdl will now be the transformed version of HDL and it is this that we run the t-test on.
So to run the t-test, we repeat what we did in the example with the normally distributed variable Highlight the continuous variable to be tested and click on the relevant arrow to place it in the Test Variable box. This time the variable we put in is LOGHDL Highlight the variable that indicates the groups and click on the relevant arrow to place in the Grouping Variable box Now click the Define Groups box and type in the values of the grouping variable.
Youll get a similar output. In our example, which represents HDL levels between two groups, the first group in the treatment arm of a trial and the second group in the placebo arm, we get:
Levenes F Sig. Equal variances Not equal variances 0.441 0.511 t-test for equality of means sig Mean s.error Lower Diff Diff 0.905 -0.0097 0.08118 -0.17407 0.905 -0.0097 0.08118 -0.17422
T -0.120 -0.120
df 39 36.92
In statistics, Log is synonymous with Natural Logs as we hardly ever use Log, Base 10. Therefore, it is not unusual for our transformed variables to be called Log<variable> when Ln<variable> may seem more appropriate.
11
Again we can assume equal variances (p=0.51). There is no evidence of any difference in levels of HDL betweem the two groups (p=0.91). We observe a mean difference of -0.0097, but this is between the log transformed outcomes. If we exponentiate the mean difference in the table, then we get 0.99. This is the ratio of the means of the treatment group and the placebo group, i.e. The average HDL reading of the treatment group is 0.99 times the average of the placebo group. If we exponentiate the limits of the confidence interval, then we get a confidence interval of this ratio, which is (0.84, 1.16), that is the average HDL after the treatment group is somewhere between 0.84 times and 1.16 times the average HDL after a course of the placebo. In order to exponentiate, we have to do it either with a calculator or a spreadsheet, using the EXP function.
12
Click on the Define Groups box and type in the values of the grouping variable corresponding to the groups being compared. This is identical to the ttest example.
In this example, comparing weight between men and women, we get the following output:
Ranks SEX Male Female Total N 80 34 114 Mean Rank 7 2.15 23.03 Sum o f Ranks 57 72.00 783.00
WEIGHT
We can see from the p-value (reported in SPSS as 0.000, but we should report as <0.001) that the weight of males is significantly different to females. There are, however NO descriptive statistics with this output as the mean rank and the sum of ranks that have been reported are completely meaningless. To get the medians, by group, use the Explore option. The confidence interval for the median difference is not available in SPSS. Either seek alternative software, calculate it by hand or ask a statistician.5
One reference for the formula for the Median Difference is: Campbell, M.J. and Gardner, M.J. (1989) Calculating confidence intervals for some non-parametric analyses. In Statistics with Confidence (eds M.J. Gardner and D.G.Altman), London: British Medical Journal, 71-9 [9.6.3, Ans 9.4] It is not a regularly seen statistic in papers so may not be necessary in order to get your paper published. Descriptives by group could be sufficient.
13
In this case the normality we are interested in is the normality of the paired differences. In order to test for normality in our paired differences, we need to calculate the difference between the two variables and test that for normality. Data that are normally distributed at both time points often have normally distributed differences. Also some data that are non-normal at both time points could have normal paired differences. For example, hdl may be log-normal at time 1 and log-normal at time 2 but the reductions in hdl could be normal. There is no hard and fast rule for the distribution of the paired differences, it will have to be checked. However normally distributed paired differences are very common.
Paired t-test
This test is appropriate when we have a repeated measure on a single group of subjects (or our groups have been matched in the design) and the paired difference between the paired variables is normally distributed. Select Compare Means Select paired-sample t-test Highlight the two variables to be tested and click on the relevant arrow to place in the Paired Variables box (e.g. In this case we choose pre and post you have to select them both before transferring them across)
14
Click on OK
Here is an example of some output: It demonstrates the difference in calories taken by a sample of women, pre and post menstruation Paired Differences s.error Lower Upper 132.88 976.91 1589.75
Pair 1
Mean 1283.33
SD 398.64
t 9.658
df 8
sig 0.000
Therefore, we have observed that that dietary intake on pre menstrual days is 1283 calories higher than dietary intake on post menstrual days with a confidence interval of (977 calories,1590 calories). This is highly significant (p<0.001)
15
Click on OK
9a 0b 0c 9
a. Dietary intake po st menstrual < Dietary intake pre-menstrual b. Dietary intake po st menstrual > Dietary intake pre-menstrual c. Dietary intake pre-menstrual = Dietary intake po st menstrual
The mean ranks are not appropriate summary statistics. Therefore, the figure we are most interested in is the p-value in the table overleaf (Asymp. Sig.(2-tailed)). We can conclude there is a significant difference in calorie intake between the premenstrual and postmenstrual period (with a higher intake on premenstrual days as the mean negative rank exceeds the mean positive rank). This time to get the median difference, calculate the difference variable first and then Explore this new variable. To get the confidence interval for it seek alternative help. Quantitative variables Correlations
b Tes t Statis tics
Dietary intake po st menstrual - Dietary intake pre-menstr ual Z -2.666a Asymp. Sig. (2-tailed) .008 a. Based o n po s itive ranks. b. Wilco xo n Signed Ranks Test
16
Correlations allows us to assess the strength of the relationship between two continuous variables. If normal data, we use a Pearsons correlation coefficient If non-normal data, we use a Spearmans correlation coefficient
Some people would say you need BOTH variables to be normal to do a Pearsons. This is true if you want the Confidence Interval. Without the confidence interval, you only need one variable to be normal but as we are advocating Confidence Intervals in this course, only use Pearsons if BOTH variables are normally distributed otherwise use Spearmans. An alternative correlation coefficient for non-normal data is Kendalls Tau-B but we will use Spearmans. The Correlate menu is under Analyse Click on Bivariate Highlight the relevant variables and click on the arrow to place them in the Variables box Select either Pearson or Spearman. - We use Pearson when we have normally distributed data and believe we have a linear relationship. - We use Spearmans when we have non-normally distributed data and/or a nonlinear relationship is to be assessed. (In this case, as we believe height and weight are normal, we shall click Pearson)
Click on OK
17
SPSS gives the following outcome which shows that Height and Weight are highly correlated (correlation coefficient=0.747). SPSS does not give a confidence interval for the correlation coefficient.
Correlations HEIGHT Pearso n Correlatio n Sig. (2-tailed) N Pearso n Correlatio n Sig. (2-tailed) N HEIGHT WEIGHT 1.000 .747 ** . .000 114 114 .7 47** 1.000 .000 . 114 114
WEIGHT
18
19
ANOVA
Sum of Squares DECLINE Between Groups Within Groups Total STRPRE Between Groups Within Groups Total 320.233 6595.950 6916.183 4513.600 9294.800 13808.400 df 2 57 59 2 57 59 2256.800 163.067 13.840 .000 Mean Square 160.117 115.718 F 1.384 Sig. .259
The p-value for the mean decline is 0.26, whereas for the levels of stress before the task (Variable name: strpre) it is <0.001. We can conclude that there is no evidence of any difference in the average decline of stress after the task but there is strong evidence that
20
the levels of stress before the task were different. We can see from the descriptives that the younger age group were more stressed with a mean score of 52.80 (95% C.I.=(47.6,58.1)) compared with mean scores of 33.4 for the 26-45 age group and 35.6 for the 46-65 age group. The middle age group showed the smallest decline, average 9.5 (95% C.I.=(5.3,13,6)) but this was not significantly smaller. Multiple Comparison Tests Click on Post Hoc and put a tick in the box corresponding to one of the multiple comparison tests (eg. Scheffe, Duncan, Tukey). Click on Continue.
Kruskal-Wallis
For example purposes, we are using the same stress data set but in this case we are assuming it is not normally distributed. If this were the case, then by using a One-way Analysis of Variance, we have a large chance of invoking a Type I error. Therefore, we choose the non-parametric version, the Kruskal-Wallis One Way Analysis of Variance.
Enter the variables that you want to test into the Test Variable List. Enter the grouping variable, age, into the Grouping Variable box.
Click on Define Range. Enter the range of categories that you want to test, in this case we want to test from 1 (16-24) up to 3 (45-64). i.e we are choosing a subset of the possible age groups. Click on Continue
21
Click on OK
1.712
.425
I have missed out the mean ranks from the output as, again, they are meaningless if you wish to explore further where any differences lie use the Explore option. We see there is little evidence of a difference between groups in the levels of stress decline (p=0.43) but there is strong evidence of a difference in levels of stress before the task (p<0.001)
22
APPENDIX
Entering Data for Paired and Unpaired Tests
The golden rule for data entry is ONE ROW PER SUBJECT. No more and no less. Therefore, when we are comparing independent groups, we would get a data sheet such as.
i.e., we have a group variable, work, which tells us which workplace the subject is in and an outcome variable, cough, which tells us whether the subject has a history of cough. It is not appropriate in this case to create three variables, one for each workplace. We would then have 3 subjects per row. When we have paired data, we get a data set such as the set overleaf.
23
This time year 1 is in one column and year 2 is in another column. We do not have a year variable because that would mean that we would have two rows for one subject. The consequence of all this is that the menus for the paired analysis are set up differently for the menus of the unpaired analysis. However, both menus are set up to accommodate: ONE ROW PER SUBJECT and ONE SUBJECT PER ROW
24