You are on page 1of 12

Paired-samples test

Use this test as an alternative to the t-test, for cases where data can be paired to reduce incidental variation - i.e. variation that you expect to be present but that is irrelevant to the hypothesis you want to test. As background, let us consider exactly what we do in a conventional t-test to compare two samples. We compare the size of the difference between two means in relation to the amount of inherent variability (the random error, not related to treatment differences) in the data. If the random error is large then we are unlikely to find a significant difference between means unless this difference is also very large. Consider the data in the table below, which shows the number of years' remission from symptoms (of cancer, AIDS, etc.) in two groups of patients: group A who received a new drug and group B who received a placebo (the controls). There were10 patients in each group, and we will first analyse the data by conventional ttest (seeStudent's t-test if you are not familiar with this).
Patient 1 2 3 4 5 6 7 8 9 10 Drug 7 5 2 8 3 4 10 7 4 9 Placebo 4 3 1 6 2 4 9 5 3 8

59 10

45 10

5.9
x
2

4.5 261 203 58 6.44 = 1.37

413 348 65

( x)2 / n

7.22

= 1.17 = 0.67

Clearly, there is no significant difference between the means. [The smallest tabulated t value for significant difference at p = 0.05 is 1.96.] But drug trials are never done as randomly as this. Instead, the patients are matched as nearly as possible to exclude the effects of extraneous variation. For example, patient 1 in each group (drug or placebo) might be a Caucasian male aged 20-25; patient 2 in each group might be an Asian female aged 40-50, and so on. There is every reason to suspect that age, sex, social factors etc. could influence the course of a disease, and it would be foolish not to exclude this variation if the purpose of the trial is to see if the drug actually has an overall effect. In other words, we are not dealing with random groups but with purposefully paired observations. (The same would be true if, for example, we wanted to test effects of a fungicide against a disease on 10 farms, or to test whether a range of different bacteria are sensitive to an antibiotic, etc.). Now we will analyse the data as paired samples. Procedure (see worked example later) 1. Subtract each control value from the corresponding treatment value and call the difference z. (NB Always subtract in the same "direction", recording negative values where they occur) 2. Calculate z, , z2 and ( z)2 /n , where "n" is the number of pairs (z values)

3. Construct a null hypothesis. In this case it would be appropriate to "expect" no difference between the groups (drug treatment versus controls). If this were true then the observed values of z would have a mean close to zero, with variation about this mean. 4. Calculate:

5. Calculate:

6. Square root this to find d then calculate:

7. Find t from the equation:

8. Consult a t table at n-1 degrees of freedom, where n is the number of pairs (number of z values). The tabulated t value for 9 df is 2.26 (p = 0.05) In our example the calculated t value is 5.24, which is very highly significant - it exceeds the t value for probability (p) of 0.001. In other words, we would expect such a result to occur by chance only once in a thousand times. So the drug is effective: we see below that it gives remissison of symptoms for 1.4 0.266 years (this value is the mean standard error of the mean). The confidence limits are 1.4 0.6 years (mean t.n).
Patient 1 2 Drug 7 5 Placebo 4 3 Difference (z) 3 2

3 4 5 6 7 8 9 10

2 8 3 4 10 7 4 9

1 6 2 4 9 5 3 8

1 2 1 0 1 2 1 1

14 10 1.4

26 19.6 6.4

( z)2 / n

d d n

0.71 0.84 0.267

t = 1.4/0.267 = 5.24 It is instructive to consider what we have done in this analysis. We calculated the mean difference between the pairs of patients (treatments), calculated the standard errorof this mean difference and tested it to see if it is significantly different from zero (no difference). The following diagram should make this clear.

Paired-samples t-test: print-out from "Excel". The example above was run on "Excel", as before (see Student's t-test) but we select ttest: paired two sample for means from the analysis tools package, click OK, select the whole data set (cells B2-C11) for Input variable range and a clear cell for Output range. See Student's t-test for explanation of other relevant entries in the print-out. [The calculated t (5.25) differs slightly from the worked example (5.24) on the previous page because the computer did not round up the decimal points during calculations]
Patient 1 2 3 4 5 6 7 8 9 10 Drug 7 5 2 8 3 4 10 7 4 9 Placebo 4 3 1 6 2 4 9 5 3 8

t-Test: Paired Two Sample for Means Mean

Variable 1 5.9

Variable 2 4.5

Variance Observations Pearson Correlation Hypothesized Mean Difference df t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail

7.211111 10 0.949414 0 9 5.25 0.000264 1.833114 0.000528 2.262159

6.5 10

************

******** Note that the two-tailed t-test shows the drug and placebo to be significantly different at p = 0.0005 (probability of 5 in 10,000 that we would get this result by chance alone). But in this case we would be justified in using a one-tailed test (P = 0.00026) because we are testing whether the mean difference (1.4 years) is significantlygreater than zero. Look at the critical t for a one-tailed test in the printout. This is the value given forp = 0.1 in a t-table (because we are testing only for a difference in one direction - above zero - so we can double the normal probability of 0.05). We should have used this value, not the value for p = 0.05 in our testing for significance. CONTENTS INTRODUCTION
THE SCIENTIFIC METHOD Experimental design Designing experiments with statistics in mind Common statistical terms Descriptive statistics: standard deviation, standard error, confidence intervals of mean. WHAT TEST DO I NEED? STATISTICAL TESTS: Student's t-test for comparing the means of two samples Paired-samples test. (like a t-test, but used when data can be paired) Analysis of variance for comparing means of three or more samples:

For comparing separate treatments (One-way ANOVA) Calculating the Least Significant Difference between means Using a Multiple Range Test for comparing means

For factorial combinations of treatments (Two-way ANOVA)

Chi-squared test for categories of data Poisson distribution for count data Correlation coefficient and regression analysis for line fitting:

linear regression logarithmic and sigmoid curves

TRANSFORMATION of data: percentages, logarithms, probits and arcsin values STATISTICAL TABLES: t (Student's t-test) F, p = 0.05 (Analysis of Variance) F, p = 0.01 (Analysis of Variance) F, p = 0.001 (Analysis of Variance) 2 (chi squared) r (correlation coefficient) Q (Multiple Range test) Fmax (test for homogeneity of variance)

Pairwise
Input.

T-test, number of cases and mean difference and standard deviation of difference respectively. Or, number of cases and t-value of difference. Wilcoxon Ranks Test, number of cases and sum of signed difference. McNemar test, give two integer numbers of changers from the diagonal of a two by two tables in the top two boxes. Number of positive changers and number of negative changers.

Explanation.
T-test | Wilcoxon | Mc-Nemar Pairwise tests concerns the comparison of the same group of individuals, or matched pairs, being measured twice, before and after an 'intervention'. Using this methodology the respondents or their matched 'partners' function as their own control, lowering the level of unexplained variance or 'error'. Matched or pairwise data can be presented as in the following table: Before After D(ifference) Sq(D-Mean) Ranked-D

John Steve Liz Mary Paul Joy Mike Nick Linda Peter Total

18 37 12 42 7 31 59 21 8 56 291

28 34 17 40 20 35 66 27 21 61 349

10 -3 5 -2 13 4 7 6 13 5 58 Mean=5.8

17.64 77.44 0.64 60.84 51.84 3.24 1.44 0.04 51.84 0.64 265.6 sd=5.43

8 -2 4.5 -1 9.5 3 7 6 9.5 4.5 45 Signed=3

sd=19.05 sd=16.74

The first three columns show the data, the following columns a number of calculations which we will now discuss. As the data shows, the 10 respondents have improved 5.8 'points' on average after the intervention. One possible way to test if the difference between the respondents before and after the intervention is statistically significant is to apply the procedure t-test as implemented in SISA. Giving the values above {mean1=29.1 (291/10); mean2=34.9; n1=10; n2=10; sd1=19.05; sd2=16.74} the t-test procedure produces a t-value of 0.72, with an associated single sided ('tailed') p-value of 0.2376. The intervention does not produce statistically significant improvements in the respondents score, according to this method of testing. However, doing it this way does not do justice to the data, as it concerns paired observations much of the differences within the two sets of data which the usual t-test (for two independent samples) considers might already have been taken out. In that case the t-test tests is too conservative, differences between the two sets of paired observations are not declared statistically significant as quickly as it should. T-test for paired observations. (Also known as the t-test for two correlated samples). This t-test tests if the sum of the change between the two groups differs statistically significantly from zero. For doing the t-test procedure you have to give the number of cases, which is an integer number, in the top box, and which is 10 in the case of the above example. In the second box the mean difference is given (5.8 in the above example), and the standard deviation of the difference must be given as positive number with or without decimals in the third box. The standard deviation is only used for the t-test procedure. How the standard deviation is calculated is shown in the fifth column of the example above. One takes the sum of the squares of the difference between the observations minus the mean of the difference. This sum is then divided by the number of cases (minus one) and the square root is taken. In the case of the example like this: squareroot-outof(265.6/(10-1))=5.43. One can use a calculator with statistical function to calculate the standard deviation. Enter the scores for the difference, and take the sample standard deviation, the one with the "s" symbol, not the one with the funny little "sigma" symbol.

Alternatively to giving a mean and a standard deviation for the t-test, one can give the t-value of the difference and leave the value zero (0) in the standard deviation box. The t-value is calculated like this: (mean-nilhypothesis)/(standarddeviation/squareroot(n))=(5.80)/(5.43/sqrt(10)). The nil hypothesis in this case equals zero (0), we expect no or zero change given the standard error of the measurement (the standarddeviation/sqrt(n) is commonly refered to as the standard error). The t-value of the difference equals 3.378, the program gives you the pvalue of 0.00407. The t-test is critical with regard to characteristics of the data, the assumptions for the test. It is considered that the differences between the two different observations are at the interval level. Thus, someone who improved four points improved twice as much compared with someone who improved two points. It is further considered that the differences are normally distributed, thus, in the case of there being, for example, a number of smaller difference being counterballanced by one very large difference, or there only being differences of more or less the same size, or there being a ceiling on the maximum size of the differences, the t-test would not be valid. Following two tests are presented for in case the assumptions of interval level and normal distribution are not met. Wilcoxon Matched Pairs Signed Ranks Test. This test considers that the data are at an ordinalmetric level, i.e., that the original data can be validly ordered, that the data after the intervention can be ordered, and that the difference between the two sets of data can be validly ordered. This assumption is slightly less critical than the interval level assumption necessary for the t-test. The assumption of there being a normal distribution does not have to be met, this is particularly practical if the maximum change is somehow limited. A positive aspect of the Wilcoxon test is that it is a very powerful test. If all the assumptions for the t-test are met the Wilcoxon has about 95% of the power of the t-test. The Wilcoxon test requires as input the number of changed cases as an integer value in the top box and the sum of the negative ranks as a positive integer number in the second box, no standard deviation is required. The sum of the negative ranks is calculated in the sixth column of the above table. This column shows the rank number of the differences between the two measurements in the fourth column. This might not be immediately obvious and therefore we repeat the calculation here again. In column four we have the following differences, D, which reordered from small to large give: -2, -3, 4, 5, 5, 6, 7, 10, 13, 13. As we have 10 observations these observations should be apportioned the rank numbers: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. In the case of ties (two or more similar values) we average the rank numbers. We have to do this for the (D) values 5 and 13, doing this results in the following ranking: 1, 2, 3, 4.5, 4.5, 6, 7, 8, 9.5, 9.5. In the case of an individual which did not change, i.e. the difference between the first and the second measurement is zero, this individual is excluded from the analysis. Thus, cases which did not change are not present in the data, table or analysis, neither in the ordering of the data nor in the number of cases. Two of the rank numbers refer to negative values, the values -2, -3, with rank numbers 1 and 2 respectively. The sum of these values equals 3, the sum of signed ranks. Giving this value in the appropriate box shows that the expected value of the sum of signed ranks equals 27.5, with a standard deviation of 9.81. The associated z value of the difference between observed and

expected sum of signed ranks equals 2.497 with a p-value of 0.00625. According to the Wilcoxon test, the difference between the respondents score before and after the intervention is statistically significant. McNemar Change Test. This test studies the change in a group of respondents measured twice on a dichotomous variable. It is customary in that case to tabulate the data in a two by two table. The following example from Siegel illustrates how the test works. Preference for candidate before and after T.V. debate Preference Before Debate Reagan Carter Total Preference after TV Debate Reagan 27 13 40 Carter 7 28 35 Total 34 41 75

In this table one group of voters is asked twice about their voting intention, before and after a television debate. As can be seen, 13 respondents changed their preference from Carter to Reagan while 7 respondents changed their preference from Reagan to Carter. The test asks the question if the number of respondents changing is similar in the direction from Reagan to Carter as in the other direction. Thus, the question is if there is a statistically significant difference between the seven and the 13. Fill in the two change values (13 and 7) in the top two boxes (leave a zero in the bottom box), click the McNemar button, and the program provides you with the answers, probability value equals 0.179, 0.125 with Yate's continuity correction. The two candidates were not statistically significantly different in changing the voters preference. The program also gives you a Binomial alternative to the McNemar test, the probability of getting more or the same number out of a total number while the expectation was 50% one way and 50% the other way. Use the Binomial if the number of cases is small or if you have a cell with less than 5 observations. Note that the Binomial is a single sided test, use it if you want to test if the change goes into a particular direction and if the change is statistically significant in that regard. The Chi-square is a double sided test, no prior expectation with regard to direction. Double the pvalue of the Binomial test to get a double sided test. The doubled Binomial p-value will be very close to the p-value of the Yate's Chi-square, mostly. What is interesting is mostly not what happens inside the table, but what happens in the marginals (the green bit). As you can see, before the TV debate Carter had the support of 41 of the 75 voters, after the debate this decreased to 35. Carter lost the support of 6 voters, the net difference between the voters who changed in favour of Carter and the voters who changed in favour of Reagan. One of the problems with the McNemar is that it requires knowledge of the inside of the table, while we are mostly interested in the marginals. Machin and Campbell propose that, in the case the inside of the table is not known, to estimate the inside of the table from the marginals. This presumes that the likelihood of changing from Carter to Reagan, or from Reagan to Carter, is independent of being a Reagan or a Carter supporter in the first place. SISA allows you to do this

less preferred way of doing the McNemar also. Fill in two numbers of cases, one for each marginal, in the top two boxes, and a total number of cases in the bottom box. Click the McNemar button. (Agresti discussess this same topic in his example 10.1.3. He uses a t-test instead of the Chi-square or the Binomial alternative.) The McNemar can be used as a alternative for the signs test. Place the number of positive changers in the top N+ box, and the number of negative changes in the second N- box. Disregard the number of non changers. In the top table above we would have 7 positive and 3 negative changers, the McNemar test gives a Chi2 for equal number of changers in both groups of 1.6 with a probability of 0.2059. The positive direction of the change might well have been caused by chance fluctuation. The signs test including this McNemar version has the disadvantage that it is not a very powerful test. However, it is also a test with few assumptions and does not require that the data has particular characteristics. It can be used on interval, ordinal and nominal data. For example, to see if after some treatment more diseased people got cured (+1), then healthy people diseased (1).
Hypothesis test Formula: where is the mean of the change scores, is the hypothesized difference (0 if testing for equal means), s is the sample standard deviation of the differences, andn is the sample size. The number of degrees of freedom for the problem is n 1. A farmer decides to try out a new fertilizer on a test plot containing 10 stalks of corn. Before applying the fertilizer, he measures the height of each stalk. Two weeks later, he measures the stalks again, being careful to match each stalk's new height to its previous one. The stalks would have grown an average of 6 inches during that time even without the fertilizer. Did the fertilizer help? Use a significance level of 0.05. null hypothesis: H0: = 6 alternative hypothesis: Ha : > 6

Stalk

10

Before height 35.5 31.7 31.2 36.3 22.8 28.0 24.6 26.1 34.5 27.7 After height 45.3 36.0 38.6 44.7 31.4 33.5 28.8 35.8 42.9 35.0

Subtract each stalk's before height from its after height to get the change score for each stalk; then compute the mean and standard deviation of the change scores and insert these into the formula. The problem has n 1, or 10 1 = 9 degrees of freedom. The test is one-tailed because you are asking only whether the fertilizer increases growth, not reduces it. The critical value from the t-table for t.05,9 is 1.833. Because the computed t-value of 2.098 is larger than 1.833, the null hypothesis can be rejected. The test has provided evidence that the fertilizer caused the corn to grow more than if it had not been fertilized. The amount of actual increase was not large (1.36 inches over normal growth), but it was statistically significant.

You might also like