You are on page 1of 82

STAT171

Statistical Data Analysis


(2015)

Topic 8

Inference regarding two


population means

1
J & B : Chapter 10
(Small sample procedures only)

1. Two (independent) sample t-tests for


population means 10.3

2. Confidence interval for µ1-µ 2 10.3

3. Modified (unpooled) t-test 10.3

4. Introduction to design principles


10.1 , Minitab

5. Randomisation 10.4 , Minitab

6. Testing of Paired data 10.5


2
Two sample tests:
Up until now, we have only looked at
single samples and tested for the population
mean µ being a particular value.
What if we want to make a comparison
between two populations?

Example:
Identical laboratory rats are randomly
divided into two groups.
Each group is fed a different diet.
After a certain time period, the
weight gain (in g) of each rat is taken.

We wish to determine if there is a


difference in average weight gain
between the two diets.
That is, can we conclude that there is any
difference between the two population
3
means?
Data:
Diet 1: 73 121 110 81 105 89 128

Diet 2: 137 97 152 83 103 103 112


138 135 146 129 133 119

Summary statistics:
Sample 1: x1 = 101.00 , s1 = 20.632 , n1 = 7
Sample 2: x2 = 122.08 , s2 = 20.962 , n2 = 13

Visual summary:
(i) Comparative boxplots

(ii) Confidence intervals


for the two population means
MTB > Graph > Interval plot

Diet2 appears to
give a higher result
than Diet1 4
There are two possibilities for reality:

1. The mean weight gain for the two


populations (the two diets) is the same:
µ1 = µ2 or (µ1 - µ2 = 0)
This is the null hypothesis H0

2. The mean weight gain for the two


populations (the two diets) is NOT the
same: µ1 ≠ µ2 or (µ1 - µ2 ≠ 0)
This is the alternative hypothesis H1

We do not know which is the true state …


but we do have sample evidence on which
to base a decision.
5
There are two decision possibilities:

1. Retain H0 … the difference between


x1 and x2 is small enough to be
explained by chance variation and
therefore it is believable that µ1 = µ2.

2. Reject H0 … x1 and x2 are so far apart


that we don’t believe that the two groups
have the same population means.

We need to develop a procedure for testing:


H0: µ1 = µ2
H1: µ1 ≠ µ2

Note that we don’t care what the values of


µ1 & µ2) are.
the true means (µ
Just: Are the population means the same or
are they different? 6
We have two independent samples of
sizes n1 and n2 from two separate
populations.

If the two populations have Normal


distributions, then:

For population 1 (with mean µ1 and sd σ1):

 σ 12 
X ~ N ( µ1 , σ 2
1 ) and X 1 ~ N  µ1 , 
 n1 

For population 2 (with mean µ2 and sd σ2):


 σ 22 
X ~ N ( µ2 , σ 2
2 ) and X 2 ~ N  µ2 , 
 n2 
7
So, for the difference in sample
means:

X 1 − X 2 ~ Normal

( ) ( ) ( )
E X1 − X 2 = E X1 − E X 2

= µ1 − µ2

and if the two samples are


independent, then:
( ) ( )
Var X 1 − X 2 = Var X 1 + Var X 2 ( )
σ 2
σ 2
= 1
+ 2
n1 n2

8
Therefore, for independent samples:

 σ 12 σ 22 
X 1 − X 2 ~ N  µ1 − µ 2 , + 
 n1 n2 

or

(X 1 )
− X 2 − ( µ1 − µ2 )
~ N ( 0,1)
σ 2
σ 2
1
+ 2
n1 n2

9
So, if the two samples come from
populations with the same mean,
i.e. if H0 is true (µ1-µ2 = 0), then:

(X 1 −X2 ) ~ N ( 0,1) ~ Z
σ 12 σ 22
+
n1 n2

This can be simplified if the two


populations have the same
standard deviation,
 that is if σ1 = σ2 = σ :

(X 1 − X2 ) ~ N ( 0,1) ~ Z
1 1
σ  + 
2

 n1 n2 
10
σ1 and σ2 known:
If σ1 and σ2 are known, a z-test
could be carried out (whether the
σ’s are equal or not).

This is valid statistically, but in


practice σ1 and σ2 are rarely
known.

In most cases, the population


variances will have to be estimated
by s12 and s22 , and hence a t-test
will need to be carried out.

11
σ1 and σ2 unknown:
To carry out a traditional t-test we
need to assume that σ12 = σ22 .

This is usually quite reasonable in


that different treatments will often
affect the mean but not the variance.

For many measured variables, there


is often a substantial difference in
the means but both groups are
equally variable.
e.g. males v females (height)
12
Pooled estimate of variance:
We need to estimate σ2
But we have two estimates of σ2:
• s12 with (n1-1) degrees of freedom
• s22 with (n2-1) degrees of freedom

→ we get a weighted average of s12 and s22


weighted according to the d.f.

→ called the pooled variance sp2


(n1 -1) * s + (n2 − 1) * s
2 2
σˆ = s =
2 2
p
1 2
(n1 + n2 − 2)

with n1+n2-2 d.f.


(lost 1 d.f. in using x1 to calculate s12
x2 to calculate s22)
13
Constructing the t-test:
Exact variance:
1 1
Var ( X 1 − X 2 ) =σ ×
2
 + 
 n1 n2 

Estimated variance:
2 1 1
est.Var ( X 1 − X 2 ) = s p ×  + 
 n1 n2 

Estimated standard error:


1 1
est.se ( X 1 − X 2 ) = s p +
n1 n2

Standardising:
(X 1 )
− X 2 − ( µ1 − µ2 )
~t
1 1 n1+n2 −2
sp + 14
n1 n2
If H0 is true (µ1 - µ2 = 0), then:

(X 1 −X2 ) ~t
1 1 n1+n2 −2
sp +
n1 n2

If H0 is not true, the above “test


statistic” is NOT distributed as a
tn1 + n2 - 2

So, we can carry out a two-sample


test for the difference in two
population means.
15
The observed value of the test statistic is:

tobs =
( x1 − x2 )
1 1
sp +
n1 n2
where

(n1 -1) * s + (n2 − 1) * s


2 2
sp = 1 2
(n1 + n2 − 2)

If doing a two tailed test, the sign doesn’t matter:


Obtain p-val = P( | tn1+n2-2 | ≥ | tobs | )

For a < alternative, p-val = P( tn1+n2-2 < tobs )

For a > alternative, p-val = P( tn1+n2-2 > tobs )

16
Compare p with α and Reject H0 if p ≤ α
The test can be one or two tailed depending
on what you are interested in.
The procedure is the same as for any
hypothesis test. The only difference is in
the calculation of the test statistic.

STAT170 uses a mnemonic to assist


students to remember the steps in
hypothesis testing:

H  hypotheses

A  assumptions
T  test statistic
P  p-value
C  conclusion
We will do a few more additional (small)
17
steps ...
Putting it all together:
(1) Identify the random variable and what the
groups (populations) are

(2) Formulate Hypotheses, H0 and H1


(3) Specify the significance level, α
(4) Summarise the data and any given information

(5) Specify any assumptions

(6) Specify the test to be used

(7) Determine the test statistic to use (and df)

(8) Perform any further calculations (eg Sp)

(9) Calculate the observed value of the test statistic

(10) Find the p-value (or quote the critical value)

(11) Make a decision

(12) Write a concluding statement


(quite often containing the relevant
18
confidence interval)
For the example:
Response X = weight gain (g) for rats on
two different diets
H0: µ1 = µ2
H1: µ1 ≠ µ2

α = 0.05

Data summary:
Diet 1: n1 = 7, x1 = 101.00, s1 = 20.632
Diet 2: n2 = 13, x2 = 122.08, s2 = 20.962

Assume:
• independent observations
• X is normally distributed
• σ1 = σ2 (two populations have equal standard deviations)

19
Test = two-independent samples t-test
use as accurate
Test statistic is: figures as possible in
your calculations

t=
(X 1 − X2 )
with df = n1 + n2 − 2
1 1
Sp +
n1 n2

(n1 -1) × s12 + (n2 − 1) × s22


where: s p = (n1 + n2 − 2)

6 × (20.632) 2 + 12 × (20.962)2

18

≈ 434.83 this denominator


value is the
≈ 20.8526
degrees of
freedom of the t
101.00 − 122.08
tobs ≈
1 1
20.8526 +
7 13
−21.08

9.7758
20
≈ −2.156 with df = 7 + 13 − 2 = 18
p-value = P(t18 ≥ 2.16 or t18 ≤ -2.16)
0.01 < one tail area < 0.025
0.02 < p-value < 0.05 (p = 0.044 from Mtb)
(critical value = t18,0.025 = 2.101)

∴ Reject H0
We can conclude at the 5% level of
significance that the two diets do not have
an equal effect on the average weight gain
of the rats.
In fact, Diet2 results in a significantly
higher average weight gain than Diet1.
Note that a one tailed test could be
an option under certain
circumstances. However, in this
case, a two tailed test was required.
21
α)% C.I. for µ1-µ
100(1-α µ2:

All two-sided confidence intervals


for means are in the basic form:

statistic ± tcritical* estimated


s.e.(statistic)

In this case:

1 1
( x1 − x2 ) ± tn1+ n 2−2 × s p +
n1 n2

22
For the example:
95% confidence interval for µ1-µ
µ2 is:
 1 1 
 (101.00 − 122.08 ) ± t18 × 20.8526 × + 
 7 13 

= ( −21.08 ± 2.101× 9.7758 )

= ( −21.08 ± 20.539 )

= ( − 41.62 , − 0.54 )

 we are 95% confident that the


interval (-41.62 to -0.54) includes
the true difference between the
population means, µ1-µ µ2

23
The confidence interval can be used for a
two tailed test of significance.
H0: µ1 = µ2 OR H0: µ1-µ
µ2 = 0
H1: µ1 ≠ µ2 H1: µ1-µ
µ2 ≠ 0

The test can be carried out at a particular


α) by determining
level of significance (α
whether zero lies in the equivalent
100(1- α)% confidence interval.

The (two-sided) 95% confidence interval


for µ1-µ
µ2 here is (-41.62, -0.54)

Zero lies outside the 95% C.I. 


therefore we can reject the hypothesis of
equal population means at the 5% level
of significance.

24
Assumptions for two-sample t-test:

• The data come from a Normal or


approximate Normal distribution.
We can check this using Normality plots

• The two samples are independent


(i.e. not related in any way).
This can’t really be checked unless we have
been involved in the collection of the data.

• The observations within each sample are


independent.

• The two population standard deviations


are the same i.e. σ1 = σ2
We can check this by checking how close s1 and
s2 are. In this case s1 = 20.63 and s2 = 20.96
 very close (unusually so).

25
Checking σ1 = σ2:
General “rough” rule:
The ratio slarger /ssmaller should be less than ~2.
Some other books “Rule of thumb”
Look at the comparative boxplots and
compare the lengths of the boxes (IQR)

But the larger the df., the more stringent we


should be.
e.g. For very large samples, the ratio
smax / smin should be less than about 1.5.

However for very small samples, we might allow


the ratio to go above 2, say up to about 2.5.
If the ratio of the sample standard deviations is too
large, we can’t pool the variances (since we can’t
assume σ1 = σ2).

Therefore we can’t carry out a (pooled) two sample


t-test. We will look at the “modified” 26

two-sample t-test later.


Using Minitab:

MTB > print c1 c2


Row Diet1 Diet2 Diet1 in c1
1 73 137 Diet2 in c2
2 121 97
3 110 152
4 81 83
5 105 103
6 89 103
7 128 112
8 138
9 135
10 146
11 129
12 133
13 119

MTB > STAT > Basic Statistics > 2-Sample T


Specify: Samples in different columns
then √ Assume equal variances

27
Alternative
hypothesis
≠,<,>

MTB > TwoSample 95.0 ‘Diet1’ ‘Diet2’;


SUBC> Pooled.
Two-Sample T-Test and CI: Diet1, Diet2
Twosample T for Diet 1 vs Diet 2
N Mean StDev SEMean
Diet1 7 101.0 20.6 7.8
Diet2 13 122.1 21.0 5.8
Difference = mu (Diet1) - mu (Diet2)
Estimate for difference: -21.08
95% CI for difference: (-41.62, -0.54)
T-Test of difference = 0 (vs not =):
T-Value = -2.16 P-Value = 0.045 DF = 18
Both use Pooled StDev = 20.8526
28
“Stacked” data in MTB
So far we have only had data in
“unstacked” format. Unstacked data has
each group is in a separate column.

Another format is “stacked”, where all the


data is in one column, and a “group
indicator” variable is in another column.
Minitab will do the manipulation (stacking
or unstacking) for you: Advice: do not
overwrite your
original data
MTB > Data > Stack > Columns >
columns (just
in case you
have made a
mistake!)

Name the
output
columns,
c3 and c4
something
sensible. 29
Result of “Stacking”

c3 contains all the data


c4 is an “indicator”

Each row represents


an “experimental unit”
(here, that is a rat).

This allows for much


more complicated
data structures to be
analysed.

MTB > STAT > Basic Statistics > 2-Sample T


Specify: Samples in one column
then √ Assume equal variances

30
MTB > TwoT 'wtgain' 'DietNo';
SUBC> Pooled.
Two-Sample T-Test and CI: wtgain, DietNo
Two-sample T for wtgain
DietNo N Mean StDev SE Mean
1 7 101.0 20.6 7.8
2 13 122.1 21.0 5.8
Difference = mu (1) - mu (2)
Estimate for difference: -21.08
95% CI for difference: (-41.62, -0.54)
T-Test of difference = 0 (vs not =):
T-Value = -2.16 P-Value = 0.045 DF = 18
Both use Pooled StDev = 20.8526 31
Checking Normality:
MTB  Graph  Probability Plot  Multiple

Must have separate plots for each group.


Stacked or unstacked data can be used.
Separate graphs or separate panels is fine.

32
Checking Normality:

The plots can also be overlaid. This allows


us to see the features:
- Horizontal “shift” shows mean difference
- Slopes indicate st.dev.’s
Using unstacked data:
the x-label is uninformative

Using stacked data:


the x-label is more
useful = ‘wtgain’ here

Clearly, the
assumption of
normality for
both groups
here is fine.
33
Why separate Normality plots for
each sample?
We need to obtain separate Normality plots for
each sample, as they could easily have different
Normal distributions.
We must assume they have the same variance, but
they could have different means.
Example:
Two samples (n=20) were generated in Minitab:
same variance ( 52 ), different means (100 , 130).

Descriptive Statistics:
Variable N Mean StDev Min Med Max
N(100,5sq) 20 100.20 10.73 85 97 122
N(130,5sq) 20 131.94 6.71 121 131 151
Boxplot of N(100, 5sq), N(130,5sq)

N(100, 5sq)

N(130,5sq)

80 90 100 110 120 130 140 150 34


Data
Plotting as one big sample
Probability Plot of Both
Normal - 95% CI
99
Mean 116.1
StDev 18.34
95 N 40
AD 1.324
90
P-Value <0.005
80
70
Percent

60
50
40
30
20

10

1
50 75 100 125 150 175
Both

Plotting as two separate samples


Probability Plot of N(100, 5sq), N(130,5sq)
Normal - 95% CI
99
Variable

. 95

90
N(100, 5sq)
N(130,5sq)

Mean StDev N AD P
100.2 10.73 20 0.583 0.114
80 131.9 6.706 20 0.406 0.319
70
Percent

60
50
40
30
20

10

1
60 70 80 90 100 110 120 130 140 150
Data

35
Another simulated example
Two samples each of size 200 were simulated.
Both had σ = 10, but µ1 = 100, and µ2 = 130
Descriptive Statistics: N(100,10^2), N(130,10^2)

Variable N Mean StDev Min Med Max


N(100,10^2) 200 101.21 10.05 72 100.5 126
N(130,10^2) 200 129.37 10.34 99 130.8 156

Box plotted separately, the A single dotplot shows


difference is clear bimodality

Nscore plotted separately, A single nscore plot shows


the normality is clear distinct non-normality

36
“Modified” (= “unpooled”) t-test:
Example: Obtaining the concentration of a
chemical by two different methods.
Does the new, faster method
give the same measure of
concentration as the standard
method (which is known to be accurate)?

Standard method:
25 24 25 26
New method:
23 18 22 28 17 25 19 16

Testing: H0: µ1 = µ2
H1: µ1 ≠ µ2 α = 0.05
Data summary:
x1 = 25.0 s1 = 0.816 n1 = 4
x2 = 21.0 s2 = 4.209 n2 = 8 37
Check ratio of the standard deviations
slarger = 4.201 ≈ 5.13 >> 2
ssmaller 0.819

Even though the samples sizes are


small, the ratio is far too large to
assume equal variances.
∴ Can’t pool the variances.

There is no exact test.


All we can do is carry out an
approximate test.
38
“Modified” (= “unpooled”) t test:
This test follows the same format as the
ordinary two sample t-test except that the
degrees of freedom have to be adjusted to
give an approximate t test.
X1− X 2
t'=
The test statistic is S12 S 22
+
n1 n 2

x1 − x 2
with observed value t
'
obs =
2 2
s s
1
+ 2
n1 n 2
Compare t' with the t distribution with d.f.
given by Do NOT put this on your summary sheet.
It is included for completeness only!
(It was requested by some students in the past.)

( ( n) ( n) )
2
s12 + s2
2

1 2
df =
( n) ( )
2 2
s 12 (n1-1) + s2
2
(n2-1)
1 n2 39
This df is called the Smith-Satterthwaite
modification.
Historically Smith-Welch-Satterthwaite:
Smith(1936), Journal of CSIR,
Welch(1938), Biometrika,
Satterthwaite(1946), Biometrics.

Note:
the d.f are often not going to be an integer.
- Just get Minitab to do it.

By hand, take
df = min(n1-1, n2-1)
= min(n1, n2) – 1

 conservative, but not too bad ☺


40
By hand:
25.0 − 21.0
t '
obs ≈
0.8162 4.2092
+
4 8
The numerator and
4.0
≈ denominator values
1.543 can come in handy
later for confidence
≈ 2.592 interval evaluations.

Approx, df ≈ min(n1, n2) – 1 = 4 – 1 = 3


p-value: 0.05 < P( | t3 | ≥ 2.592 ) < 0.10
We would retain H0.
However, this is only VERY approximate!!
(in fact, it is conservative, so we know the p-val has to be less)

The complicated formula gives df = 7.987


at df = 7 and 8  0.02 < p-val < 0.05
We would reject H0.
 Better to use Minitab … 41
Using Minitab:
Row standard new
1 25 23
2 24 18
3 25 22
4 26 28
5 17
Don’t tick the
6 25
“Assume equal variances”
7 19 option
8 16

MTB > TwoSample 95.0 'standard' 'new';


SUBC> Alternative 0.
Two Sample T-Test & Confidence Interval
Two sample T for standard vs new
N Mean StDev SEMean
standard 4 25.000 0.816 0.41
new 8 21.00 4.21 1.5
95%CI for mustd - munew: ( 0.35, 7.65)
T-Test mu standard = mu new (vs not=):
T-Value = 2.59 P-Value = 0.036 DF = 7

42
Finishing the hypothesis test …
Decision: based on Minitab calculations,
reject H0 (at a 5% significance level).

There is sufficient evidence to be able to


conclude that the new technique is giving an
average concentration lower than the
standard technique. We are 95% certain
that this difference in population averages is
in the interval (0.35 , 7.7).

Not only is the concentration different on


average, but the variability of the results
when using the new method is a lot larger
than the variability of the standard method.

The end result of the analysis is that the new


faster method for obtaining the
concentration of a chemical would not be
43
recommended over the standard method.
Experimental Design:
In the previous example
of weight gain in rats
under two different diets,
at the design stage of the
experiment we assumed:
• the 20 rats came from a “homogeneous”
background:
- they were all close to the same age;
- they had very similar diets;
- their cage conditions (temperature,
humidity, exercise possibilities etc.) were
the same;
- etc.
• the rats were allocated totally
at random to the two treatment
groups, so the only systematic
difference in the rats in the two
44
groups was their diet.
Then, when we rejected H0: µ1 = µ2, we
concluded that there was evidence that
“the diet effects were different”.
i.e. the difference between the means was
due to (caused by) the difference in the
diets, and not something else.
Concluding this causality is valid only if
there are no other (systematic) differences
between the two groups.
As an extreme example, if
• the Diet1 rats were all female; and
• the Diet2 rats were all male,
the difference between the two sample
weight gain averages could be due to:
• a sex difference or
• a diet difference or
• a mixture of both 45
AND, it would be impossible to
separate the two effects, so you would
never know.

This is an example of what is called


“confounding” – here, the effect of
sex and the effect of diet are not
separable due to a poor experimental
design.

Therefore in setting up the


experiment, at the start the two
samples need to be as identical as
possible, so that the only difference
between them is the treatment
46
applied.
At the beginning of the experiment,
the rats allocated to the two treatment
groups need to be as identical
(“homogeneous”) as possible:
i.e. same sex, age, size, health status
etc.
same initial weight if possible.
AND
the rats need to be treated as
identically as possible in the running
of the experiment i.e. the same
housing conditions, same
maintenance etc.
with the only difference between
the groups being the diet. 47
Randomisation:
No matter how hard we try for
uniformity there will always be
some differences between the rats.
Many of the differences will be
small or even undetectable, such as
genetic differences etc.
In setting up the experiment, the aim
is to ensure that those differences
(whatever they are) are equally
assigned to both groups.
The simplest way to achieve this is
to assign each of the rats randomly
to the groups.
48
To assign the rats randomly to the
groups, we use some mechanical
process.
This should result in absolutely no
chance of any subjective influences
on how the rats are chosen.

Hence, neither group will be biased


at the expense of the other, and we
are able to validly assign causality
of any differences at the end of the
experiment to the different
treatments ☺.

49
The actual randomisation of the
experimental units is usually
carried out using random
numbers.

For example, each experimental


unit is allotted a number from 1 to n
and those numbers are then
randomly assigned to the different
treatments.

This can be done by hand


(e.g. placing pieces of
paper in a “hat” and
drawing at random),
 OR by computer. 50
Using Minitab:
MTB > Calc
> Make Patterned Data
> Simple set of numbers
Data Display
id_numbers
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20

MTB > Calc


> Random Data
> Sample from Columns

Resulting 7 id’s
for Tmt1

MTB> Print C2
14 3 18 10 7 6 13
Tmt2 would have
the other 13 id’s:
1, 2, 4, 5,
8, 9, 11,
12, 15, 16, 51
17, 19, 20
Further explanation of randomisation
(from Discussion Board in 2009)
Randomisation is simply the
procedure by which the experimental
units are allocated to the treatments.
The use of random numbers (or a
procedure implemented in Minitab) is
to save having to write out id tags
(e.g. slips of paper) for the units, put
them in a “hat”, shuffle them and
draw them out.
It is very important to allocate
subjects from a “pool” at random to
the treatments, to avoid
“confounding” (the situation where
differences can be explained by more
52
than one causal factor).
Example
Suppose there are 26 rats in a cage to
be allocated to two treatment groups.
We wish to see if the rats maze-
navigating results differ by whether
they are rewarded with chocolate or
with cheese during the “training”
sessions.
Each rat would have an id tag. Each
id would be randomly allocated to the
treatment groups such as by tossing a
coin and setting:
• Head  chocolate reward group

• Tail  cheese reward group53


An example of bias
When allocating the 26 rats to the two
treatment groups, a bias would occur
if the rats were not allocated
randomly.
For example, the lab
assistant could just grab
the first 13 out of the cage
and allocate them to the
chocolate group, and then allocate the
remaining 13 to the cheese group.
This would mean the slowest or
friendliest rats would be selected first,
and the “not-so-friendly”
or “scaredy-rats”
54
selected last.
This would make the two treatment
groups different in not only diet, but
in demeanour (“personality”).

After two weeks of


intensive maze
navigation training, the
time it takes the rats to
complete the maze is
measured and analysed.

It is found that the average time for


the “chocolate” rats is significantly
less than the average time for the
“cheese” rats.
!!! WOOHOO !!!! 55
Question:

Is the time difference due to the type


of reward food or due to the ease
of catchability of the rats????

 Are chocolate rats quicker than


cheese rats
OR

 Are friendly rats quicker than


scaredy rats ?

The two effects (food and


friendliness) are confounded
and their contributions to the
difference cannot be separated!
56
Fully randomised design

With a fully randomised design,


whatever differences there are
between the individuals, those
differences will be assigned to each
group randomly.

This will result in an increased


estimate of variability, and hence
makes it more difficult to detect
significant differences, but will not
result in bias.

 BUT: we can do better!!


57
“Restricted” randomisation

Usually the experimental units are


not allocated to the treatments using
total randomisation, but a form of
“restricted” randomisation.

In most cases with two treatments to


be compared it is desirable to have
equal sized samples (n1 = n2).

If each experimental unit is assigned


totally at random, this will almost
certainly NOT be the case (think of it
like tossing a coin – only rarely do
you see exactly half heads).

58
So the randomisation is done in
such a way that half the
experimental units are assigned
to one treatment, and the other
half to the other treatment.
The principal of “restricted
randomisation” can be extended
even further (essentially it is a
matter of common sense).
For example:
Suppose that of the 26 rats:
• 16 are male
and
• 10 are female
59
We want to allocate rats to the two
treatment groups, ensuring the rats
are as identical as possible
regarding their characteristics in
the two groups ...

 within the males, allocate 8


to each group;
 within the females, allocate 5
to each group.

This idea can be extended ... (it


forms the basis of what is called
“blocking”)
60
Pairing:
What if the initial differences
between the rats are so great that we
can’t assume uniformity (the rats are
very “heterogeneous” in their
characteristics).

For example, the rats could have


very different initial sizes.

If initial size affects weight gain, our


experiment will be affected by the
extra variability, and so
randomization will only be partly
effective.
61
We can try to allocate the rats in some
systematic way so that we can account
for the differences in the analysis.

One solution is to take the rats in


pairs, pairing by a factor that is very
likely to affect weight gain  initial
“fatness”. So:

Pair 1: Two fattest rats


Pair 2: Next two fattest rats


Pair n: Thinnest two rats
62
We assume the two rats in each pair
have the same attributes (regarding
anything to do with weight gain) and
randomly allocate one to Diet1 and
one to Diet2.

Therefore Diet1 and Diet2 are


measured at each level of “initial
fatness” and we assume that the only
difference between the two rats in
each pair is the treatment.

There may or may not be differences


between the pairs – it is irrelevant.

63
Here there are differences between the
pairs (they are all different weights).
This won’t cause problems in the analysis,
as we have “removed” the variability in
weight gain due to initial weight.
Within each pair, we assume the rats have identical properties.

The differences from pair to pair can be


great or small. It doesn’t matter as the
effects of the differences between the pairs
is excluded from the analysis and only the
difference within each pair is analyzed.

This is done by taking the difference for


each pair and testing the mean of the
differences. Diet1
2 fattest random Diet2
Diet1
Next 2 random
Diet2
26 rats
...

Diet1
2 thinnest random
Diet2 64
Pairing example (Captopril):
The effect of the drug Captopril
on blood pressure was assessed
by recording the (initial) blood pressure on
15 hypertensive patients. They were then
given Captopril and their blood pressure
was measured two hours later. Does the
drug have any effect on blood pressure?
Patient before after
1 130 125
2 122 121
3 124 121
4 104 106
5 112 101
6 101 85
7 121 98
8 124 105
9 115 103
10 102 98
11 98 90
12 119 98
13 106 110
14 107 103 65
15 100 82
An appropriate visual display is a line plot
in Minitab.
MTB > Graph > Line plot

> Series in rows or columns

The heading and vertical axis labels have been


edited, and the legend has been removed (then
trimmed and put back – allows for a larger graphical; area)66
.
The “Before” and “After” records are on
the one patient. i.e. each patient is the pair.
If we take the difference for each patient we
get the effect of the drug on that patient.
difference = before - after

For the general case:


Before After Difference
Pair1 X11 X12 → d1 (= X11 - X12)
Pair2 X21 X22 → d2 (= X21 – X22)
… … … …
Pairn Xn1 Xn2 → dn (= Xn1 - Xn2)

Xij is the response (BP) from patient i


(i= 1, …, n) at treatment j (j = 1,2)
We have a sample of n differences (Di).
We ignore the original data and only
67
use the differences.
If the two treatments have the same effect,
then the differences should be randomly
distributed around 0.
i.e. µD should be 0
 We have a single sample of differences,
so we obtain d and sD (the observed
mean and standard deviation of the
sample of differences).
 We can carry out a one sample t-test on
the differences, testing H0: µ D = 0

If H0 is true, D − µ D and the


~ tn−1 assumptions
SD n hold
(where n = number of differences)

If H0 is not true, the test statistic is not


distributed as a t value.
The test is carried out exactly as for any
other one sample t-test. 68
For the example:
Patient before after diff(B-A)
1 130 125 5
2 122 121 1
3 124 121 3
4 104 106 -2
5 112 101 11
6 101 85 16
7 121 98 23
8 124 105 19
9 115 103 12
10 102 98 4
11 98 90 8
12 119 98 21
13 106 110 -4
14 107 103 4
15 100 82 18

Summary statistics for the differences:


d = 9.27 sd = 8.61 n = 15
Here, a positive average
difference implies a decrease, 69
since Diff = Before – After
H 0: µ D = 0
H1: µD ≠ 0
d − µD 9.26
tobs = ≈ ≈ 4.16
sD n 8.6145 15

df = 15 - 1 = 14

p-val = P(| t14 | > 4.166) ≅ 2* 0.0005


MTB gives p= 0.00096 ≅ 0.001
∴ Reject H0 at the 1% significance level.
We can conclude there is strong
evidence that the drug has an effect in
lowering average blood pressure.
As for any t–test, the above could have been
carried out as a one tailed test.
[In this case a one tailed test would probably
have been more appropriate, as the patients
are hypertensive, so we would be most 70
interested in lowering their blood pressure. ]
α)% C.I. for µd:
100(1-α
 sD 
 d ± tn−1,α 2 × 
 n
For the example:
95% confidence interval for µd:
 8.614 
 9.267 ± 2.145 × 
 15 
= ( 9.267 ± 4.771)
= ( 4.50 , 14.04 )
Again, the two-tailed C.I. can be used for
two tailed tests of significance.
H0: µD = 0
H1: µD ≠ 0
0 lies outside the 95% C.I., so reject H0
at the 5% level of significance.
We are 95% certain the true average
decrease in blood pressure is 71
between 4.50 and 14.04 units.
Assumptions:
We are only assuming the differences are
approximately Normally distributed.
We are making absolutely no assumptions
about the original observations. In fact, if
they are not Normally distributed, it doesn’t
matter, as long as the differences are.
We are also assuming all the differences are
independent. This is reasonable in the
case of the hypertensive
patients (eg unless some
For the example: or all are members of one
family).
Probability Plot of diff
Normal - 95% CI
99
Mean 9.267
StDev 8.614
95 N 15
AD 0.306
90
P-Value 0.525
80
70
Percent

60
50
40
30
20

10

1
-20 -10 0 10 20 30 40
72
diff
Using Minitab:
Stat  Basic Statistics  Paired t
Q: If we did
have the
differences in a
column, what
would we do?

MTB > Paired 'Before' 'After'.


Paired T-Test and CI: Before, After
Paired T for Before - After
N Mean StDev SEMean
Before 15 112.333 10.472 2.704
After 15 103.067 12.555 3.242
Difference 15 9.26667 8.61449 2.2243
Note the sd of the differences is much
smaller than for the two separate groups !
95% CI for mean diff:(4.4961, 14.0372)
T-Test of mean diff = 0 (vs not = 0): 73
T-Value = 4.17 P-Value = 0.001
Testing as a two sample t-test:
If the paired design is ignored, and the
design is treated as two independent
samples (which it isn’t here, as the matching is
very strong from before to after) …

Obviously this analysis is WRONG!


MTB > TwoSample 'Before' 'After';
SUBC> Pooled.
Two-Sample T-Test & CI: Before, After
Two-sample T for Before vs After
N Mean StDev SE Mean
Before 15 112.3 10.5 2.7
After 15 103.1 12.6 3.2
Difference = mu(Before) - mu(After)
Estimate for difference: 9.26667
95% CI for diff: (0.61950, 17.91384)
T-Test of difference = 0 (vs not =)
T-Value = 2.20 P-Value = 0.037 DF=28
Both use Pooled StDev = 11.5608
74
Controlling for causes of variation :
In this experiment:
X (response) = blood pressure (in humans)
By pairing (by person), we are
controlling for variability due to such
sources as:
- age - stress - diet
- weight - sex - fitness etc.
Without pairing, the blood pressures
vary more (sp = 11.56) than the BP
difference (sD = 8.61).
This is because a large number of the
sources of variation have been mainly
controlled for (or removed) from the
experiment by doing the pairing. ☺ 75
Example (hypothetical):
We wish to compare the
durability of two different brands of
Tyre (A and B) by measuring distance
driven until “bald”.

We have 20 cars (numbered 1, 2, …, 19, 20)


and drivers and as many of each brand
of tyre as we like.
The null hypothesis is that the mean
distances for the two brands is the
same.
There are many possible ways to
sensibly design an experiment here.
76
Two independent sample design:
Randomly allocate:
• 10 cars to Brand A tyres
(say, numbers: 2, 3, 4, 7, 10, 13, 15, 16, 17, 20)
and
• 10 cars to Brand B tyres.
(say, numbers 1, 5, 6, 8, 9, 11, 12, 14, 18, 19)

Drive the cars until one of the tyres is


“bald” … record X = distance driven .
Data:
XA1, XA2, … , XA10 10 independent
observations
XB1, XB2, … , XB10 per group

Use a two-independent sample t-test (if


the assumption of normality is reasonable)77
Paired design:
Each car gets one Brand A and one
Brand B tyre, placed at the front, but
allocated to left/right at random.
Drive the cars until the first tyre is
“bald”, then (replace it! … safety!)
drive until the other is bald.
For each car we have two
observations … but they are NOT
independent (they were on the same car!)
Data: XA1 , XA2, …, XA19 , XA20
XB1 , XB2, …, XB19 , XB20
 D1 , D2, … , D19 , D20
NOT independent observations: Diff = A - B
Use a paired t-test (if the assumption of
78
normality of the differences is reasonable.)
Tyre Example (hypothetical):

By pairing here, we have


controlled for variability caused
by:
- driver skill
- road surface
- wheel alignment
- speed
- etc.
Over the relevant period any factors
which could affect tyre wear but:
• stay the same on an individual car
and
• vary from car to car. 79
Paired v Unpaired:
To do a paired t-test the experiment has
to be physically set up (designed) in pairs.
This is NOT the decision of the person
doing the analysis !!
It was the decision of the person who
designed the experiment.
Examples:
• before and after
• left and right side of a person / animal
• top and bottom leaves of a plant
• two animals in a cage / people in a suburb
• two people the same age
• twins (especially identical ☺ )
80
• etc
Paired ? OR Unpaired?

One way to help determine if a


design is paired is to ask the
question:
“Would any information be lost if
the data were shuffled?”

If NO – the data is unpaired in structure


 do a two independent sample test
(pooled or unpooled)

If YES – the data has a pairing structure


 do a paired t-test on the differences

81
If the individuals have uniform
characteristics / properties – a paired
design is unnecessary.
Paired is often:
• harder to set up
• requires equal replication
• has fewer degrees of freedom

But: The greater the variation


between the individuals - the greater
the advantage in setting up the
experiment in pairs.
Warning: If you set up an unpaired
two sample test and you find large
differences between the individuals
- you are stuck with the two sample 82
design.

You might also like