You are on page 1of 10

Sample Size and Power: What Is Enough?

Ceib Phillips

A basic understanding of statistical methodology is essential, both for de-


signing quality research projects and for evaluating the medical literature.
This article deals with the basic principles involved in sample size calcula-
tions and shows the concepts and factors that determine sample sizes for
comparing means, proportions, and time-to-event measures. Special topics
are also discussed including adjustments to sample size calculation and post
hoc analyses of power. (Semin Orthod 2002;8:67-76.) Copyright 2002,
Elsevier Science (USA). All rights reserved.

E vmT researcher who has b e e n involved in a Sample size calculations are m a d e using vari-
clinical study has an appreciation of the ous assumptions a b o u t the anticipated treat-
eftbrt, cost, and often, inconvenience, to both m e n t effect or differences between treatments
investigators and study subjects. Even though together with realistic projections c o n c e r n i n g
investigators often h o p e for positive findings, in patient accrual and follow-up. Calculating a sam-
clinical research true negative findings make ple size requires four things: (1) deciding on the
very valuable contributions to clinical practice. design of the study; (2) assessing the availability
However, false negative findings can occur ei- of resources; (3) specifying distribution assump-
ther by chance or because a study is underpow- tions; and (4) perhaps most challengingly, de-
ered, which means too few patients have b e e n fining a clinically relevant effect. Table 1 pro-
studied to allow a clinically meaningful treat- vides a set of questions that can help organize
m e n t effect or difference to be detected, l It is the information n e e d e d to calculate an appro-
i m p o r t a n t to realize that designing studies with priate sample size.
inadequate sample sizes may lead to e r r o n e o u s T h e statistical m e t h o d o l o g y for calculating
results and false conclusions and subsequently to sample size has b e e n extensively developed over
inappropriate t r e a t m e n t of patients. Such stud- the years. Although the sample size calculations
ies not only drain existing limited resources but are p e r f o r m e d using mathematical methods, the
are also unfair to patients who have participated p r e p a r a t i o n for the calculation requires both
in them. statistical reasoning and clinical experience. De-
U n d e r p o w e r e d studies c a n n o t always be s i g n i n g Clinical Research 3 and Statistics 4 are two
avoided. However, careful sample size calcula- introductory textbooks that provide a basic con-
tions can guide researchers as to what can and ceptual overview of sample size and power con-
cannot be accomplished in a study with a finite cepts. In this article, the basic principles under-
a m o u n t of resources. In fact, National Institute lying the sample size calculation are reviewed,
of Craniofacial Research guidelines for clinical and examples of sample sizes are provided.
trials 9 now m a n d a t e the inclusion of such infor- T h r e e kinds of data will be presented with a
mation on all clinical trial applications. discussion of how each kind influences sample
size calculations. T h e three kinds of data are: (1)
nominal data, information collected in studies
From the Deparlment of O'~¢hodonlics, School of Dentistry, Uni- that can be classified by categories such as m a n /
versity' of North Carolina, Chapel Hill, NC. w o m a n or Class I / I I / I I I (Angle's classification);
Supported in part by NIH DE NIH DE 10028 and DE05215. (2) continuous data, information collected in
Address correspondence to Ceib Phillips, MPH, PhD, Depart studies that can be m e a s u r e d on a continuous
ment o[ Orthodonlics, CB7450 UNC-CH, Chapel Hill, NC 27599.
C~gyright 2002, Elsevier Science ({LSA). All rights *~se*ved.
scale, such as degrees (incisor inclination, man-
1073-8746/02/0802-0005535.00/0 dibular plane angle), millimeters (oveuet,
doi:l O.1053/sodo. 2002.32074 length of m a n d i b l e ) , or grams of force (head-

Seminars in Orthodontics, Vol 8, No 2 ~une), 2002:[9 67-76 67


68 Ceib Phillips

Table 1. Key Questions Before Sample Size t r i b u t e d variable ( b e l l - s h a p e d curve) with a


Calculation m e a n /x~ a n d a s t a n d a r d d e v i a t i o n (SD) 0-1. T h e
1. What is the prima U outcome? Clinicians frequently include m e a n refers to the a v e r a g e a m o u n t o f m a n d i b -
muhiple outcomes in clinical studies. This creates a u l a r g r o w t h in 1 year a n d is r e p r e s e n t e d hy /x 1
dilemlna for calculating a sample size (n) because the n
required will vat). fi'om outcome to outcome. One or b e c a u s e we d o n o t k n o w t h e t r u e a m o u n t o f
Iwo outcomes should be designated as the primaLw basis growth. T h e SD tells us h o w m u c h variation we
for sample size determinations. Other outcomes can e x p e c t in t h e a m o u n t o f a n n u a l g r o w t h
typically are considered tbr secondai y analyses.
2. What's the scale of measurement for the outcome? Outcomes
a m o n g all c h i l d r e n age 7 to 10 years. T h e SD is
can be classified as binaLw or dichotomous (yes/no, r e p r e s e n t e d by 0-~. A n o t h e r special t e r m is the
present/absent), categorical (mutually exclusive s t a n d a r d e r r o r o f t h e m e a n (SEM). SEM is a
categories), ordinal (mnkcd or rated), or continuous
(measures on which arithmetic operations can be m e a s u r e o f variability similar to t h e SD. SD indi-
performed). In general, binmT and categorical cates the s p r e a d o f the values f o r a m e a s u r e m e n t
measures provide less infbrmation and thus lower such as m a n d i b u l a r l e n g t h c h a n g e s . SEM tells us
power, requiring larger sample sizes.
3. DTml~ the variability o/the outcome? For bina W measures,
a b o u t the s p r e a d o f values o f the m e a n s t h a t
the variability is directly related to the proportion of w o u l d o c c u r if we d r e w all o f the possible sam-
positives (yes or present responses). For continnous pies o f a given size f r o m t h e p o p u l a t i o n a n d
measures, the sample size required increases as the
standard deviation increases. c a l c u l a t e d the m e a n f o r e a c h sample. It is im-
4. What is the desired level of siffnifieance and power? Typically, p o r t a n t to u n d e r s t a n d the c o n c e p t o f m e a n s a n d
level of significance is set at 0.05 or 0.01, whereas power SDs b e c a u s e t h e i r estimates are u s e d to calculate
is set at 80% or 90%. Increasing the power or
decreasing the level of significance increases the sample s a m p l e sizes.
size required. I f we draw a r a n d o m s a m p l e o f size (nl) f r o m
5. Are there special.features in the study design ? These include this p o p u l a t i o n o f 7- to 10-year-old c h i l d r e n a n d
nmltigronp comparisons, clustered outcome data (such
as nmltiple obsepeations of the same snbject or m e a s u r e the m a n d i b u l a r l e n g t h c h a n g e s (Xl) ,
multicenter studies), long-term fMlow-up when dropout we w o u l d e x p e c t t h a t the m e a n (average) o f that
or attrition is expected, and inclusion of covariates s a m p l e w o u l d be s c a t t e r e d a r o u n d t h e true (but
(other measures that are believed to somehow aitiect
the relationship between the treatment and the u n k n o w n ) a m o u n t o f m a n d i b u l a r g r o w t h (/x~),
outcome). For many studies, sntiware packages such as with a s t a n d a r d e r r o r given by 0-j/X/n~n~ (Fig 1A).
nQuery Advisor or Epilnfo can be used to make N o w a s s u m e t h a t we s i m u l t a n e o u s l y draw an-
preliminal T sample size determinations. However,
special features often involve nonstandard calculations o t h e r r a n d o m s a m p l e o f size (n2) f r o m this s a m e
and a statistician should be consulted. p o p u l a t i o n , b u t this t i m e the c h i l d r e n in the
6. What defines a clinicaUy relevant f[leet (ejJi~etsize)? This is s e c o n d s a m p l e have b e e n t r e a t e d with r e m o v -
perhaps the most challenging question. The clinician
must define the clinically relevant eftect in terms of the able a p p l i a n c e s . If t r e a t m e n t with a r e m o v a b l e
prima1y outcome. What diff~'rence expressed in the a p p l i a n c e d o e s affect m a n d i b u l a r growth, t h e n
values of the outcome measure would alter/change/ we w o u l d e x p e c t t h a t the m a n d i b u l a r l e n g t h
modify, the way"patients are treated)?
c h a n g e (x~) m e a s u r e d in the s e c o n d s a m p l e
w o u l d be d i s t r i b u t e d a r o u n d a d i f f e r e n t m e a n
(/,~), with a s t a n d a r d e r r o r 0-2/X/n 2 that de-
g e a r force, s h e a r b o n d s t r e n g t h ) ; a n d (3) time- p e n d s o n the variability (0-2) in t h e m a n d i b u l a r
to-event, i n f o r m a t i o n c o l l e c t e d in studies t h a t l e n g t h c h a n g e s in c h i l d r e n t r e a t e d with r e m o v -
m e a s u r e s h o w l o n g it takes f o r a s p e c i f i e d e v e n t able a p p l i a n c e s .
to occur, such as n u m b e r o f m o n t h s to c o r r e c t to W e a s s u m e that b o t h s a m p l e m e a n s are valid
Class I molars, a n d w h a t p r o p o r t i o n o f subjects estimates o f what we w o u l d have f o u n d if we
r e a c h the e v e n t in a c e r t a i n a m o u n t o f time. c o u l d have m e a s u r e d b o t h e n t i r e p o p u l a t i o n s o f
interest. O n e p o p u l a t i o n w o u l d have c o n s i s t e d
o f all u n t r e a t e d Class II c h i l d r e n b e t w e e n 7 a n d
Hypothesis Testing 10 years. T h e o t h e r p o p u l a t i o n w o u l d have con-
Clinical r e s e a r c h is d e s i g n e d to d e t e r m i n e if a sisted o f all Class II c h i l d r e n 7 to 10 years who
t r e a t m e n t has an e f f e c t o r if d i f f e r e n t t r e a t m e n t s w e r e t r e a t e d with a r e m o v a b l e a p p l i a n c e . T h e
p r o d u c e d i f f e r e n t o u t c o m e s . S u p p o s e we b e l i e v e g r e a t e r the d i f f e r e n c e b e t w e e n "21 a n d ~2, the
that removable appliances increase m a n d i b u l a r s a m p l e m e a n s f o r the u n t r e a t e d a n d t r e a t e d chil-
l e n g t h . W e m i g h t a s s u m e t h a t in u n t r e a t e d Class d r e n , the m o r e likely it is that the values o f the
II c h i l d r e n , age 7 to 10 years, m a n d i b u l a r l e n g t h r e m o v a b l e a p p l i a n c e p o p u l a t i o n are n o t distrib-
g r o w t h d u r i n g a 1-year p e r i o d is a n o r m a l l y dis- u t e d a b o u t /,~ b u t a r o u n d s o m e o t h e r value
Sample Size and Power 69

o r a false n e g a t i v e c o n c l u s i o n ) . T h e s e f o u r out-
c o m e s a r e d e p i c t e d in T a b l e 2.
T h e strategy f o r a r r i v i n g at t h e s e c o n c l u s i o n s
starts with f o r m u l a t i n g a n u l l h y p o t h e s i s (H0). If
we w e r e i n t e r e s t e d in w h a t h a p p e n s in j u s t o n e
o f t h e g r o u p s , t h e n u l l h y p o t h e s i s in o u r e x a m -
p l e w o u l d state t h a t t h e r e is n o c h a n g e in m a n -
d i b u l a r l e n g t h d u r i n g 1 y e a r (/x~ = 0). I f we w e r e
~-1 F2 i n t e r e s t e d in c o m p a r i n g t h e c h a n g e s t h a t o c c u r
L J
b e t w e e n t h e two g r o u p s , the null h y p o t h e s i s
w o u l d state t h a t t h e r e is n o d i f f e r e n c e in treat-
Figure 1. The curve on the left shows the distribution
m e n t effect (/x 2 = /*1 o r /x2 -- /z 1 = 0). T h e
of means of random samples of size (nl) drawn from
a normally distributed population with mean /xa and e x p r e s s i o n /,~ r e p r e s e n t s t h e a v e r a g e g r o w t h in
SD 0-~.The curve on the right shows another normally population 1 (untreated children). The expres-
distributed population with m e a n / x 2 and SD 0-~ from sion /xz r e p r e s e n t s t h e a v e r a g e g r o w t h in p o p u -
which random samples of size n 2 are drawn. The l a t i o n 2 ( t r e a t e d c h i l d r e n ) . Thus, /*2 = /*1 is
horizontal hatched areas show c~, the risk of rejecting s h o r t h a n d for saying t h a t t h e average g r o w t h in
the null hypothesis (H0: /x1 = /*2) when it is in fact
true (false positive). The vertical hatch represents /3, p o p u l a t i o n 1 ( u n t r e a t e d c h i l d r e n ) is e q u a l to
the risk of accepting the null hypothesis when in fact t h e a v e r a g e g r o w t h in p o p u l a t i o n 2 ( t r e a t e d chil-
it is false (false negative). The alternative hypothesis is d r e n ) . T h e a l t e r n a t i v e way to write it,/*2 - /~] =
specified as Hi: I/x1 -- /xe[ = A rather than Hi: I/x1 -- 0, is s h o r t h a n d for saying t h e d i f f e r e n c e in t h e
/*el = 0 because there are infinite possibilities for a v e r a g e g r o w t h b e t w e e n the two p o p u l a t i o n s is
I/xl - /xz[, and each is associated with a different
likelihood of a false negative when c~ and the sample 0. In o u r e x a m p l e , t h e n u l l h y p o t h e s i s w o u l d b e
sizes are specified. The figure is drawn so that 0-~ = ~ru, t h a t t h e a v e r a g e m a n d i b u l a r g r o w t h o b s e r v e d in
n~ = n 2,andc~ = .05. the u n t r e a t e d c h i l d r e n is t h e s a m e as t h a t ob-
s e r v e d in t h e c h i l d r e n t r e a t e d with r e m o v a b l e
appliances.
(/*9). I n o t h e r words, t h e l a r g e r t h e s e p a r a t i o n N e x t , we a s s u m e t h a t t h e n u l l h y p o t h e s i s is
b e t w e e n t h e two s a m p l e m e a n s , t h e s t r o n g e r t h e true, a n d we c a l c u l a t e t h e p r o b a b i l i t y (P) o f
e v i d e n c e t h a t the r e m o v a b l e a p p l i a n c e has a n o b s e r v i n g j u s t by c h a n c e a n a b s o l u t e d i f f e r e n c e
effect o n m a n d i b u l a r growth. If r e m o v a b l e ap- in m a n d i b u l a r g r o w t h e q u a l to o r g r e a t e r t h a n
p l i a n c e t h e r a p y h a d n o effect o n m a n d i b u l a r t h a t actually m e a s u r e d t o t t h e two samples. T h e
growth, t h e n we w o u l d e x p e c t very little (less d e c i s i o n to a c c e p t o r n o t a c c e p t t h e n u l l h y p o t h -
t h a n s o m e clinically m e a n i n g f u l a m o u n t ) o r n o esis is b a s e d o n t h e r e l a t i o n s h i p b e t w e e n t h e
d i f f e r e n c e b e t w e e n t h e two s a m p l e m e a n vahtes c a l c u l a t e d P value a n d t h e risk the i n v e s t i g a t o r is
(xl a n d x2), a n d we w o u l d e x p e c t t h e m a n d i b u - willing to take o f i n c o r r e c t l y r e j e c t i n g t h e null
lar l e n g t h c h a n g e s o f t h e t r e a t e d c h i l d r e n to b e h y p o t h e s i s . This risk level (c~) is t h e level o f
d i s t r i b u t e d , j u s t like t h o s e o f the u n t r e a t e d chil- s i g n i f i c a n c e a n d is c h o s e n by t h e i n v e s t i g a t o r
d r e n , a r o u n d /x 1. b e f o r e t h e start o f t h e study. I n o t h e r words, c~
U n f o r t u n a t e l y , t h e t r u e effect o f r e m o v a b l e
a p p l i a n c e t h e r a p y o n m a n d i b u l a r g r o w t h is un-
Table 2. Four Possible Outcomes When Comparing
k n o w n . T h e i n v e s t i g a t o r seeks to a n s w e r t h e re- lhe Interpretation of a Hypothesis Test and the
s e a r c h q u e s t i o n by d r a w i n g c o n c l u s i o n s ( k n o w n True Population Finding Between Treatment and
as m a k i n g statistical i n f e r e n c e s ) b a s e d o n t h e No Treatment Groups
available d a t a a n d p r o b a b i l i t i e s . B e c a u s e p r o b a - 7)'uth in the Population
bilities a r e n o t c e r t a i n t i e s , t h e s e c o n c l u s i o n s c a n Hypothesis test
interpretation Not different D![]ere~zt
yield f o u r o u t c o m e s with r e g a r d to t h e a c t u a l
but, u n f o r t u n a t e l y , u n k n o w n t r u t h a b o u t t h e N o t different Correct conclusion Type II e r r o r
(True negative) (False negative)
d i f f e r e n c e b e t w e e n g r o u p s . E i t h e r t h e d a t a has (~)
l e d the i n v e s t i g a t o r to arrive at a c o r r e c t d e c i s i o n
relative to w h a t actually h a p p e n s in t h e p o p u l a - Di~tercnt Type I e r r o r Correct conclusion
tion (a t r u e n e g a t i v e o r a t r u e positive c o n c l u - (False positive) (True positive)
(c~) (Power - 1 - /3)
sion) o r to an i n c o r r e c t d e c i s i o n (a thlse positive
70 Ceib Phillips

represents the risk the investigator is willing to (eg, P < .05) (Table 2): (1) there is a difference
take o f incorrectly rejecting the null hypothesis or effect (true positive), o r (2) you were u n l u c k y
(and thus incorrectly c o n c l u d i n g there is a dif- a n d there really is n o t a difference in the p o p u -
ference between the two groups). This is k n o w n lation, b u t y o u r samples suggest there is (false
as a type I e r r o r (or false positive) (Table 2). positive by c h a n c e ) .
Investigators usually set c~ at 1% or at 5% (o~ = If the decision is m a d e to reject the null hy-
.01 or c~ = .05) to describe the risk they are pothesis, t h e n what? An alternative hypothesis
willing to take o f a talse-positive e r r o r rate. (H 1) b e c o m e s the fall-through decision. T h e al-
W h e n the P value (ie, the probability associ- ternative hypothesis is generally established as
ated with finding a difterence in growth o f the two-tailed, which m e a n s that the direction o f the
observed m a g n i t u d e occurring, just by chance, if difference is n o t specified, (ie, the c h a n g e in
there is n o real difference in the p o p u l a t i o n ) is o n e g r o u p may be either larger t h a n or smaller
greater t h a n c~, a result is said to be n o t statisti- t h a n the o t h e r g r o u p ) . I n o t h e r words, if the
cally significant. T h e s h o r t h a n d is P > .05 (if c~ is m e a n s are n o t the same t h e n they m u s t be dif-
c h o s e n by the investigator to be .05) or P > .01 ferent.
(if c~ is c h o s e n by the investigator to be .01). W h e n the P value tails to reach statistical sig-
A nonsignificant P value does n o t imply that nificance (ie, P > c~) even t h o u g h the underly-
the null hypothesis is true, that there is n o d i f ing g r o u p s are truly different, t h e n a type II
ference b e t w e e n the two treatments, o r that the e r r o r (false negative) has o c c u r r e d (Table 2).
two treatments are equivalent. A nonsignificant T h e probability o f c o m m i t t i n g a type II e r r o r is
P v a l u e tells y o u m e r e l y that there is i n a d e q u a t e called/3./3 is the level o f risk o f a c c e p t i n g a false
weight o f evidence against the null hypothesis. negative conclusion: the risk that an investigator
Nonsignificant results are inconclusive because is willing to take o f declaring there is n o t a
the default position r e p r e s e n t e d by the null hy- difference w h e n there is o n e (false negative).
pothesis has b e e n n e i t h e r c o n f i r m e d n o r re- This is n o t the same as c~, the risk an investigator
jected. is willing to take o f declaring there is a differ-
T h e distinction between failure to reject the ence w h e n there is n o t o n e (false positive).
null, which is the c o r r e c t designation w h e n P is T h e likelihood (or probability) o f avoiding a
greater than c~, a n d a c c e p t i n g the null is impor- false-negative e r r o r is t e r m e d the statistical
tant. T h e first implies a lack o f evidence o n p o w e r o f the study. Power thus expresses the
which to reject the null, whereas the latter im- probability o f detecting a true etfect, a true pos-
plies the null is true. T h r e e possibilities exist itive. A d e q u a t e p o w e r traditionally has b e e n de-
w h e n P i s g r e a t e r t h a n c~ (eg, P > .05) (Table 2): fined as 80% or 90% (ie, 0.8 o r 0.9) a n d is equal
(1) there is n o difference or n o effect (true to (1 - ]3). If p o w e r = 0.8 o r 0 . 9 t h e n / 3 = 0.2 or
negative) ; (2) you were u n l u c k y a n d there really 0.1. Thus, the risk o f c o m m i t t i n g a type II e r r o r
is a difference in the p o p u l a t i o n , b u t y o u r sam- (/3) is 0.i or 0.1. ( R e m e m b e r that the risk of
ples do n o t reflect this p o p u l a t i o n difference c o m m i t t i n g a type I e r r o r is a.)
(false negative by c h a n c e ) ; or (3) the observed For specific sample sizes a n d values for a,
difference is real a n d clinically i m p o r t a n t , b u t p o w e r expresses h o w likely it is that a study will
the sample size was too small to r e a c h statistical d e t e c t a difference o f a certain m a g n i t u d e w h e n
significance (false negative because o f p o o r that difference really exists in the p o p u l a t i o n . If
study design). p o w e r is 90% t h e n we e x p e c t 9 o u t o f 10 trials
If the P value is less t h a n or equal to ~ (P --< (identical studies) to indicate a statistically sig-
a), the result is said to be statistically significant. nificant difference exists w h e n in truth a differ-
This does n o t prove with 100% certainty that the e n c e does exist in the p o p u l a t i o n . If the p o w e r is
null hypothesis is false n o r does it m e a n that the 10%, t h e n we e x p e c t that in only 1 o u t o f 10
difference is clinically i m p o r t a n t . T h e null hy- trials would a statistically significant difference
pothesis is simply rejected because there is too o c c u r even t h o u g h a true difference exists.
m u c h evidence against i t - - the likelihood o f A n o t h e r way to explain p o w e r is to say that
observing such a difference by chance, if there is w h e n the null hypothesis is really false (there is
n o effect or n o difference, is less than the ac- truly a t r e a t m e n t effect), a study with a p o w e r o f
ceptable risk level (a) c h o s e n by the investiga- 10% has only a 10% c h a n c e o f rejecting the null
tor. Two possibilities exist w h e n P is less t h a n hypothesis a n d a 90% c h a n c e o f b e i n g inconclu-
Sample Size and Power 71

sive. Thus, if there is a true t r e a t m e n t effect, you o p i n i o n r e g a r d i n g the effect that would be re-
have only a 10% c h a n c e o f d e t e c t i n g this differ- q u i r e d to c h a n g e their clinical practice, it m a y
ence. Your study is u n d e r p o w e r e d . be p r u d e n t to obtain i n p u t f r o m a b r o a d e r base
T h e specification o f the alternative hypothesis than a single investigator or even a single aca-
as Hi: /xi :~ /x2 (two-tailed alternative h y p o t h e - d e m i c care c e n t e r ' s faculty. 5 Detecting small dif-
sis) works in hypothesis testing only if the deci- ferences will generally require m o r e patients
sion o f interest is the risk we are willing to take t h a n showing really strong t r e a t m e n t effects that
o f a c c e p t i n g a false-positive decision (type I er- result in large differences.
ror). In o t h e r words, do we only care a b o u t Because it is often difficult to c h o o s e the ex-
w h e t h e r or n o t we have evidence to reject the act m a g n i t u d e o f t r e a t m e n t effect that is o f in-
null hypothesis? If, however, the risk o f a false- terest, it is i m p o r t a n t to plan the study so that
negative decision (type II error) is i m p o r t a n t there is a high likelihood o f rejecting the null
to us or if a sample size or p o w e r calculation hypothesis if the m i n i m u m t r e a t m e n t effect or
is desired, t h e n the difference o f interest m u s t difference that would be o f therapeutic impor-
be explicitly stated. T h e general inequality o f tance exists. T h e r e are times w h e n a 10% differ-
Hi: /x 1 ~ /x~ m u s t be r e p l a c e d by the f o r m ence is irrelevant; there also are times w h e n a
Hi: [/Xl - /x91 = 1.25 m m or whatever value the 10% difference is crucially i m p o r t a n t . If the rate
investigator considers clinically meaningful. I n o f apical resorption after o r t h o d o n t i c t r e a t m e n t
o t h e r words, the type II e r r o r rate or p o w e r were only 10% h i g h e r t h a n in u n t r e a t e d subjects
c a n n o t be calculated for a two-tailed alternative, with a P value o f .06, it is likely that the investi-
which merely says that the two g r o u p s are differ- g a t o r w o u l d n o t be d i s a p p o i n t e d that statistical
ent. significance was n o t attained. However, a statis-
Calculating the probability (/3) o f a type II tically n o n s i g n i f i c a n t finding that early orth-
error, risk o f a c c e p t i n g a false-negative decision, o d o n t i c t r e a t m e n t r e d u c e d incisor t r a u m a by
requires that the size o f the difference between only 10% would be d i s a p p o i n t i n g and inconclu-
the two g r o u p s be explicitly stated. For example, sive. T h e goal o f any clinical study s h o u l d be to
in o u r study we may decide that a true absolute have sufficient n u m b e r s o f subjects so that clin-
ditference in the average a m o u n t o f m a n d i b u l a r ically m e a n i n g f u l effects are also statistically sig-
growth in the treated a n d u n t r e a t e d c h i l d r e n nificant.
would be clinically m e a n i n g f u l if the difference
were 1.95 m m (Hi: I/x1 - /x21 = 1.25). T h e size o f
The Role of Variability
this ditference s h o u l d r e p r e s e n t the m i n i m u m
effect that w o u l d be c o n s i d e r e d clinically rele- T h e results o f a clinical trial are based o n a
vant. limited sample o f patients, a n d therefore the
As the value specified by the alternative hy- observed t r e a t m e n t difference is simply an esti-
pothesis c h a n g e s (ie, the clinically relevant ef- m a t e o f the true t r e a t m e n t difference. Even a
fect), the sample size r e q u i r e d to d e t e c t a spec- well-designed study can only give an idea o f the
ified difference a n d the p o w e r o f a study to answer s o u g h t b e c a u s e o f r a n d o m variation in
detect that difference also c h a n g e s (if you do the sample studied a n d the variation that would
n o t c h a n g e the c~ level a n d the variability in y o u r o c c u r between the samples if several samples
study stays the same). This is discussed f u r t h e r were drawn f r o m the p o p u l a t i o n . Thus, results
later. f r o m a single sample are subject to statistical
uncertainty, which is inversely related to the size
o f the sample. In o t h e r words, statistical uncer-
Clinically Relevant Effect tainty decreases as sample size increases. For
T h e specification o f the clinically relevant effect example, y o u are m o r e likely to have c o n f i d e n c e
or difference is based o n clinical j u d g m e n t n o t in an estimate o f the rate o f incisor t r a u m a if this
statistical inference. T h e r e s e a r c h e r s h o u l d spec- estimate is based o n a study o f 200 r a t h e r than
ify b e t b r e h a n d what m a g n i t u d e o f t r e a t m e n t ef- o n a study o f 20. An estimate based on the larger
fect w o u l d be r e g a r d e d as the m i n i m u m effect sample is likely to be m o r e precise than the o n e
size o f interest (ie, the m i n i m u m c h a n g e o r based on the smaller sample.
difference that would be clinically m e a n i n g f u l ) . In a study m e a s u r i n g c o n t i n u o u s variables,
Because n o t all clinicians will share the same the e x t e n t o f the variability o f the m e a s u r e
72 Ceib PhiUips

a m o n g the patients will affect the sample size or Sample Size for Comparing Proportions
power calculations. As the variation from patient
Consider a trial designed to compare the rate of
to patient increases, detection of the same level
maxillary incisor trauma in orthodontically
of difference between treatments will require
treated and untreated children ages of 7 to 10
increasing numbers of subjects. For example, in
years old with an overjet of at least 7 ram. Previ-
our mandibular growth study, if we assume that ous studies have estimated that 20% of children
c~ = .05 a n d / 3 = .1, only 24 subjects per group who do not have orthodontic treatment experi-
would be n e e d e d to detect a 1.25-mm difference ence incisor trauma. We decided that an abso-
in mandibular length growth if the variability lute difference of 5% (.05) would be considered
was low (SD = 1.3 ram). On the other hand, 73 clinically meaningful. We also set 0~ = .05 (risk of
subjects would be required if the variability was thlse positive [type I] error) and/3 = .20 (risk of
greater (SD = 2.3 mm). Although the extent of false negative [type II] error). Because we are
the variability cannot be known in advance, it comparing proportions of children in each
can be estimated from several sources, includ- g r o u p who will experience incisor trauma, we
ing previous studies, pilot data, and educated use the )(2 formula in n query ~ to compare two
guesses. proportions for the two-tailed alternative hypothe-
sis (Ha: Iproportion 1 - proportion 2[ = .05).
These calculations tell us that 906 patients per
Estimation of Sample Size group or a total of 1,812 children (2 groups ×
Requirements 906 subjects/group) are necessary to detect the
difference we consider clinically meaningful.
The methods used to calculate the sample size Setting /3 = 0.2 gives us 80% power to detect a
will d e p e n d on the analysis plan (ie, what statis- statistically significant difference if it is true that
tical test will be used to analyze the outcome). In there is a 5% difference in the p r o p o r t i o n of
addition to the analysis plan, three parameters trauma (effect size) in the population between
must be specified before the calculation of a treated and untreated children 7 to 10 years old
sample size: (1) the effect size of interest, (2) the with an overjet of at least 7 mm.
acceptable risk of a false positive finding (~), Table 3 shows various combinations of values
and (3) the acceptable risk of a false negative for ~,/3, and effect size (magnitude of a clinically
finding (/3). meaningfill difference) and their influence on
Other information, such as an estimate of the the sample size when the effect size is expressed
variability for a continuous variable, may be re- in proportions. The effect of /3 is more easily
quired. The impact of varying the statistical pa- understood when expressed as power (1 - /3). It
rameters a n d / o r the magnitude of the effect size is apparent that to detect a small effect size
on sample size requirements are discussed for
comparisons of the following types of data: two Table 3. Two Proportions Compared
proportions for a dichotomous measure (only
Proportion of Trauma
two responses possible, eg, yes/no, present/ab-
sent), two means of a continuous measure (mm, I~rnlreated Treated
Children (%) Children (%) Power Sample Size
degrees, grams), and two time-to-event measures (P1)* (P2) ? oe ( %) P~a"Croup
(length of time till a specified event occurs).
20 15 .05 80 906
Sample size calculations depicted in the ta-
20 10 .05 80 199
bles to follow were accomplished by using the 20 5 .05 80 76
software package, n Q u e r y Advisor Version 4.0 20 15 .05 80 906
20 15 .05 90 1212
(Statistical Solutions, Boston, MA), although in- 20 15 .01 80 1348
vestigator-written c o m p u t e r programs and soft- 20 15 .01 90 1717
ware packages such as Epilnfo are available. Epi- 2(1 15 .05 80 906
50 45 .05 80 1565
lnfo is a free software package that inchldes
sample size and power calculations. Epilnto is *P1 is the p r o p o r t i o n of u n t r e a t e d children who are ex-
pected to e x p e r i e n c e incisor trauma.
available from the Centers for Disease Control "}P2 is the p r o p o r t i o n of treated children who are expected
and Prevention (www.cdc.gov). to e x p e r i e n c e incisor trauma.
Sample Size and Power 73

( d i f f e r e n c e in % i n c i s o r t r a u m a = 5%) r e q u i r e s Table 4. Two Means Are Compared


m o r e p a t i e n t s t h a n r e q u i r e d to d e t e c t a l a r g e Mean Annual
effect size ( d i f f e r e n c e in % i n c i s o r t r a u m a = Mandibular Growth
15%) if t h e d i f f e r e n c e actually exists. It is also (ram)
c l e a r t h a t if m o r e restrictive e r r o r rates ( t h a t is, Untreated Treated Standard Power Sample Size
d e c r e a s i n g t h e level o f a a n d / o r / 3 ) a r e c h o s e n (hvup G r o u p Deviation a (%) Per Group
to r e d u c e t h e risk o f o b s e r v i n g false-positive a n d 2 3 1.5 .05 80 37
false-negative results, l a r g e r s a m p l e sizes a r e nec- 2 4 1.5 .05 80 10
2 3 1.5 .05 80 37
essary. Finally, it is n o t e w o r t h y t h a t t h e s a m p l e 2 3 2.0 .05 80 64
size is g r e a t e s t w h e n t h e p r o p o r t i o n o f a n effect 2 3 1.5 .05 80 37
(ie, t h e rate o f i n c i s o r t r a u m a ) in o n e o f t h e 2 3 1.5 .05 90 49
2 3 1.5 .01 80 55
g r o u p s is 0.5 o r 50% (as l o n g as c~ a n d /3 a r e 2 3 1.5 .01 90 69
constant).

Sample Size for Comparing a Time-to-Event


Sample Size for Comparing Continuous Measure
Measures
T i m e - t o - e v e n t analysis is useful w h e n t h e investi-
C o n s i d e r tile s a m e trial b u t n o w a s s u m e t h e g a t o r is i n t e r e s t e d n o t o n l y in w h e t h e r s o m e
i n t e n t is to c o m p a r e t h e m a n d i b u l a r g r o w t h o f e v e n t o c c u r s b u t also in h o w l o n g it takes f o r t h e
treated and untreated children. Mandibular e v e n t to o c c u r . A n e x a m p l e o f such a n e v e n t is
g r o w t h is m e a s u r e d o n a c o n t i n u o u s scale a n d is t h e clinical e n d p o i n t o f c o r r e c t i n g the Class II
t h e r e f o r e s u m m a r i z e d f b r e a c h g r o u p by m e a n s . m o l a r o c c l u s i o n to a Class I m o l a r o c c l u s i o n .
The estimate of mandibular growth (derived T i m e - t o - e v e n t analyses a r e d e s i g n e d for studies
p e r h a p s f r o m o t h e r studies o r a p i l o t s a m p l e ) in in w h i c h p a t i e n t s a r e e n t e r e d i n t o a trial a n d
u n t r e a t e d c h i l d r e n is 2 m m p e r year. A s s u m e f o l l o w e d u n t i l a s p e c i f i e d e v e n t occurs (attain-
t h a t a n a b s o l u t e i n c r e a s e o f 50%, o r 3 m m p e r m e n t o f Class I m o l a r s ) , t h e p a t i e n t is lost to
year, in t h e t r e a t e d g r o u p w o u l d b e c o n s i d e r e d ibllow-up, o r t h e study e n d s . I n clinical studies in
to b e clinically w o r t h w h i l e . T h u s , t h e effect size w h i c h t h e m a i n o u t c o m e is t h e t i m e to an event,
we d e e m m e a n i n g f u l is the d i f f e r e n c e b e t w e e n 3 t h e p o w e r o f t h e study d e p e n d s o n t h e n u m b e r
m m a n d 2 m m growth, o r a n effect size o f 1 ram. o f events o b s e r v e d d u r i n g t h e trial r a t h e r t h a n
A g a i n t h e s a m p l e size will b e c a l c u l a t e d given a the number of patients.
two-sided test with a n a o f .05 a n d a p o w e r o f As a n e x a m p l e , let us a s s u m e t h a t in t h e Class
80% (13 = .2). II early t r e a t m e n t trial, t h e m e a s u r e o f i n t e r e s t
B e c a u s e m a n d i b u l a r g r o w t h is a c o n t i n u o u s was d e f i n e d as t h e t i m e r e q u i r e d f o r a c h i l d
m e a s u r e , a n e s t i m a t e o f t h e variability is also t r e a t e d with a r e m o v a b l e a p p l i a n c e to r e a c h t h e
n e e d e d . W e will a s s u m e a n SD o f 1.5 m m b a s e d clinical e n d p o i n t o f Class II m o l a r c o r r e c t i o n
o n f i n d i n g s f r o m p r e v i o u s studies. By u s i n g t h e c o m p a r e d with t h e t i m e r e q u i r e d if a h e a d g e a r
f b r m u l a for an u n p a i r e d t test to c o m p a r e two was used. T h e e v e n t w o u l d b e thus d e f i n e d as
m e a n s , 6 37 p a t i e n t s p e r g r o u p w o u l d b e re- r e a c h i n g t h e clinical e n d p o i n t o f a t t a i n i n g Class
q u i r e d to d e t e c t a n effect size o f 1 m m given t h e I m o l a r s . N o t all p a t i e n t s will attain Class I m o -
s p e c i f i e d ~, p o w e r , a n d variability (~ = .05, 13 = lars in t h e s p e c i f i e d t i m e p e r i o d o f t h e trial.
0.2, S.D. = 1.5 r a m ) . Thus, a total o f 74 p a t i e n t s Enough patients therefore must be entered and
w o u l d b e r e q u i r e d f o r t h e s t u d y (2 g r o u p s × 37 f o l l o w e d for a sufficient l e n g t h o f t i m e to ob-
p a t i e n t s / g r o u p ) . T a b l e 4 shows t h a t f o r c o m p a r - serve a critical n u m b e r o f events (ie, a c e r t a i n
ison o f c o n t i n u o u s data, s a m p l e size i n c r e a s e s as n u m b e r o f m o l a r c o r r e c t i o n s ) . Thus, t h e s a m p l e
t h e effect size d e c r e a s e s (size o f t h e d i f f e r e n c e in size involves two stages: (1) d e c i d i n g o n t h e
g r o w t h d e c r e a s e s ) , t h e variability i n c r e a s e s (SD r e q u i r e d n u m b e r o f events t h a t will b e consid-
i n c r e a s e s ) , o r t h e e r r o r rates a r e m o r e restrictive e r e d m e a n i n g f i l l a n d (2) c a l c u l a t i n g the re-
(~* a n d /3 d e c r e a s e ) . R e m e m b e r t h a t p o w e r in- q u i r e d s a m p l e size b a s e d o n this n m n b e r o f
creases (1 - /3) as/3 d e c r e a s e s o r b e c o m e s m o r e evellts.
restrictive. Befbre the calculations can be performed, the
74 Ceib Phillips

i n v e s t i g a t o r n m s t specify t h e event, a r e a s o n a b l e Table 5. Occurrence of Event Are Compared for


length of time within which the event might Two Groups
occur, an estimate of the proportion of children Estimated Accrual Follow-Up
Leng'th of Time- Time Time Sample Size*
in o n e o f t h e g r o u p s w h o w o u l d have h a d t h e
to-Event (too) (mo) (too) P1 1"2 lYr Group
e v e n t by t h e s p e c i f i e d time, t h e m a g n i t u d e o f
t h e d i f f e r e n c e in t h e p r o p o r t i o n s in t h e two 6 24 24 .25 .40 119
6 24 24 .25 .35 258
g r o u p s , a n d finally, t h e l e n g t h o f t i m e available 6 24 24 .25 .30 984
to r e c r u i t a n d follow p a t i e n t s . 12 24 24 .25 .40 155
Assume that we d e c i d e that it is i m p o r t a n t to b e 18 24 24 .25 .40 193
6 18 24 .25 .40 102
able to d e t e c t an absolute difference o f 15% be- 6 12 24 .25 .40 97
tween the two g r o u p s in the p r o p o r t i o n o f chil- 6 24 30 .25 .40 100
6 24 36 .25 .40 95
d r e n that r e a c h the clinical e n d p o i n t by 6 m o n t h s .
In o u r e x a m p l e , let's estimate that 25% o f the *Sample size estimated using an exponential survival cmwe
r e m o v a b l e a p p l i a n c e g r o u p w o u l d r e a c h the clin- approach with a specified accrual period.6
tPl is the proportion of subjects in Group 1 to reach defined
ical e n d p o i n t (Class I molars) by 6 m o n t h s . A 15% event.
difference in p r o p o r t i o n o f the g r o u p s to r e a c h +P2 is the proportion of suhjects in Group 2 to reach defined
event.
Class I molars m e a n s e i t h e r a 15% increase o r a
15% decrease in the p r o p o r t i o n o f h e a d g e a r pa-
tients to r e a c h Class I molars c o m p a r e d with the
25% o f r e m o v a b l e a p p l i a n c e patients we e x p e c t to n i t u d e o f t h e d i f f e r e n c e in p r o p o r t i o n s o f events
reach this critical event. T h e s a m p l e size s h o u l d b e d e c r e a s e s , t h e e s t i m a t e d l e n g t h o f t i m e to e v e n t
sufficiently large that if at least 40% o f the head- increases, a n d as t h e l e n g t h o f t i m e available for
g e a r g r o u p (15% m o r e t h a n the r e m o v a b l e appli- r e c r u i t m e n t a n d follow-up o f p a t i e n t s d e c r e a s e s
ance g r o u p ) o r only 10% o f t h e h e a d g e a r g r o u p relative to t h e l e n g t h o f t h e t i m e to event.
(15% less t h a n the r e m o v a b l e a p p l i a n c e g r o u p ) T h e d i f f e r e n c e in t h e s a m p l e sizes t h a t a r e
r e a c h e d the e n d p o i n t by 6 m o n t h s , the difference r e q u i r e d for t h e s e t h r e e d i f f e r e n t types o f c o m -
w o u l d almost always b e d e t e c t e d as statistically sig- p a r i s o n s e m p h a s i z e s a n i m p o r t a n t p o i n t : t h e in-
nificant. (A difference in t h e two g r o u p s b e t w e e n v e s t i g a t o r m u s t d e c l a r e n o t o n l y t h e effect size o f
10% to 40% would n o t b e c o n s i d e r e d clinically i n t e r e s t b u t also t h e p r i m a r y o u t c o m e o f inter-
m e a n i n g f u l a c c o r d i n g to o u r criteria, a n d there- est. I f all t h r e e o u t c o m e s ( p r o p o r t i o n o f i n c i s o r
fore we are n o t i n t e r e s t e d in d e s i g n i n g o u r study to t r a u m a , a v e r a g e m a n d i b u l a r growth, a n d p r o -
d e t e c t if differences in this r a n g e are statistically p o r t i o n o f subjects w h o a t t a i n Class I m o l a r s
significant.) A s s u m e that the trial is d e s i g n e d to within 6 months) shown earlier were considered
recruit patients for 24 m o n t h s (length o f time fbr p r i m a r y ( i m p o r t a n t to b o t h t h e i n v e s t i g a t o r a n d
p a t i e n t accrual) a n d the m a x i m u m follow-up time t h e r e a d e r ) , t h e n 906 c h i l d r e n p e r g r o u p w o u l d
is 24 m o n t h s . By using o u r p r o g r a m 6 a n d setting c~ b e t h e a p p r o p r i a t e n u m b e r o f subjects. This will
at .05 a n d p o w e r at 80%, a total o f 119 c h i l d r e n p e r ensure that the primary outcome requiring the
t r e a n n e n t g r o u p for a total o f 238 (2 g r o u p s × 119 l a r g e s t s a m p l e size (ie, d i f f e r e n c e in i n c i s o r
c h i l d r e n / g r o u p ) w o u l d n e e d to be r e c r u i t e d t r a u m a p r o p o r t i o n ) is a d e q u a t e l y r e c r u i t e d . Al-
within the 24 m o n t h s to d e t e c t an absolute differ- t h o u g h a study t h a t is b a s e d o n o n l y 37 p a t i e n t s /
ence o f 15% b e t w e e n g r o u p s in the n u m b e r o f g r o u p m a y l e a d to c o n c l u s i o n s a b o u t t h e differ-
c h i l d r e n who attain Class I molars in the specified e n c e in m a n d i b u l a r g r o w t h f o r t h e two g r o u p s
time. t h a t a r e b o t h clinically i m p o r t a n t a n d statistically
In studies in w h i c h t i m e to e v e n t is short, t h e significant, n o valid c o n c l u s i o n s c o u l d b e m a d e
n u m b e r o f p a t i e n t s will b e a l m o s t t h e s a m e as with so few p a t i e n t s with r e g a r d to t h e i n c i d e n c e
t h e n u m b e r o f events b e c a u s e all p a t i e n t s will b e o f i n c i s o r t r a u m a o r to t h e t i m e to r e a c h t h e
f o l l o w e d u n t i l t h e event. H o w e v e r , in trials in clinical e n d p o i n t o f m o l a r c o r r e c t i o n . Thus, t h e
w h i c h a n e v e n t takes a l o n g t i m e to o c c u r , a p r i m m T o u t c o m e , w h i c h r e q u i r e s t h e largest
l a r g e r n u m b e r o f p a t i e n t s m u s t b e r e c r u i t e d to s a m p l e size, s h o u l d be u s e d to d e t e r m i n e the
o b s e r v e t h e r e q u i r e d n u m b e r o f events. T a b l e 5 n u m b e r o f subjects t h a t n m s t b e r e c r u i t e d to the
shows t h a t t h e s a m p l e size i n c r e a s e s as t h e m a g - study.
Sample Size and Power 75

Adjustments to Sample Size Calculation tients' primary care health professionals. Orr et al 9
found, in a multicenter clinical trial, that the fac-
Sample size calculations represent the sample size
tors most strongly associated with incomplete fob
per group needed at the end of the study and
low-up could not have been identified before en-
therefore underestimate the number of subjects
rollment. These included changes in marital or
who need to be enrolled if the study requires that
employment status, motives for enrolling in the
patients be evaluated on more than one occasion.
trial, and too little time spent with the study clini-
No matter how carefully a study is planned, some
cian.
patients are likely to be lost between enrollment
Sample size estimates need to be adjusted for
and completion (Fig 2). Patients may be dropped
any expected loss from the study. The n u m b e r of
because of violations of the inclusion or exclusion
subjects to be enrolled in the study can be cal-
criteria or patient compliance, protocol errors of
culated as the sample size required per group
misdiagnosis or incorrect assignment, or patient
divided by the expected retention rate. For ex-
attrition resulting from loss to follow-up or missed
ample, if the sample size calculation indicates 37
examinations. 7
patients per group are n e e d e d and if the ex-
Special attention should be paid during the
pected retention rate for the trial is 80% then
development of the protocol to mechanisms that
the n u m b e r of enrolled subjects should be 47
will minimize loss of patients. The attitude of clinic
(37/.8 = 46.25). As might be expected, the
staff toward patient complaints and the vigorous
n u m b e r of subjects n e e d e d for enrollment in-
pnrsuit of patients who Pail to keep study appoint-
creases as the projected retention rate decreases.
ments are important factors in reducing dropout, s
Strategies that have been used include reminding
patients of forthcoming appointments, assisting Post Hoc Analyses of Power
with transportation, minimizing waiting times,
On occasion, well-designed studies with prior
sending newsletters, providing monetary rein>
sample size and power calculations result in sta-
bursement, providing continuity of care, involving
tistically nonsignificant resuhs. The concern of
family members, and maintaining contact with pa-
course is that the failure to reject the null hy-
pothesis (ie, no treatment effect) may be the
result of low statistical power. The estimate of
Screened variability may have been too low, or an i m p o >
(nl) [
tant effect actually may have existed that the
investigator missed by chance, a false-negative
[ Meetstudy
(n2) criteria 1 Do not meet(n3)studycriteria ] outcome. This has been called the dilemma of
the nonrejected null hypothesis. 1° Some authors
Invitedto(n4)participate] have advocated postexperiment power calcula-
tions 1. to explain the observed data (ie, calcu-
lating the power associated with the observed
(n5)
[ Acceptparticipation[ (n6)
I RefuseparticipatiOn I effect or finding the effect that could have been
detected with higher power given the sample
I Patientdoesnot finishstudy] Patientfinishesstudy t size, level of significance, and variability). How-
(n7) (n8; samplesizecalculation) ever, these calculations can lead to contradictory
evidence for and against the null hypothesis. 1°
Figure 2. Subsets of patients to be considered when An alternative approach to determining if
planning a study. Calculating the sample size to detect a population values (such as the true but un-
clinically meaningful effect is not the only consideration
in the number of patients to be screened and enrolled known difference) are supported by the data is
in a study. The investigator will also need to estimate the the calculation of confidence intervals. For some
proportion of patients who are likely to meet the study statisticians, this approach has more validity than
inclusion criteria, the proportion who will agree to par- calculating the power after the study has been
ticipate, and the proportion who are likely to complete completed. A 95% confidence intmwal is the
the smcly protocol (it, the retention rate). (Modified
from Phillips C: Design principles in oval and maxillo- range of values that you are 95% certain con-
Facial surgery clinical trials. Oval MaxillotZac Surg Clin 13; tains the true but unknown value of the differ-
237-243, 2001) ence you are seeking. The interval is b o u n d e d by
76 Ceib PhiUips

a lower limit a n d an u p p e r limit a n d is deter- symmetrical a r o u n d zero (eg, -1.5, +1.5) then
m i n e d by rigorous statistical m e t h o d o l o g y using the investigator has no evidence to support a di-
specific statistics derived f r o m the sample in rectional effect in the population.
y o u r study. If the value o f zero is c o n t a i n e d
within the 95% c o n f i d e n c e interval, t h e n it is
possible that there is n o t r e a t m e n t effect be- Conclusions
cause the difference between the treated g r o n p
A study with negative results b u t a d e q u a t e power
a n d u n t r e a t e d g r o u p may equal zero.
to detect clinically m e a n i n g f u l differences may
For example, using o u r earlier e x a m p l e that
be a valuable c o n t r i b u t i o n to the literature. A
c o m p a r e d the average a n n u a l difference in m a n -
negative study with i n a d e q u a t e p o w e r is incon-
dibular growth between 37 treated a n d 37 un-
clusive at best. T h e question remains: is the sam-
treated children, let's assume you f o u n d that the
pie size calculation worthwhile? T h e answer is
m e a n difference in growth was 1 m m , b u t the SD
yes a l t h o u g h sample size calculations are based
was 2.5 m m . T h e SD o f y o u r actual data t u r n e d
o n m a n y assumptions a n d r e p r e s e n t o u r best
o u t to be m u c h h i g h e r t h a n the value you as-
guess. These calculations provide a guide to the
s u m e d for the sample size calculation. If an un-
feasibility a n d practicality o f u n d e r t a k i n g a study
paired t test were used to m a k e the c o m p a r i s o n
a n d focus attention o n the c o n s i d e r a t i o n o f what
between the treated a n d u n t r e a t e d g r o u p s using
clinical effect, if it exists, would i m p a c t patient
these sample values ( m e a n difference o f 1 m m
care.
a n d SD o f 2.5 ram), the P value would be .09,
which tells you that t h e r e is n o statistically sig-
nificant difference in the average a m o u n t o f References
m a n d i b u l a r growth. If we calculated the 95%
1. C o h e n J: Statistical Power Analysis tbr the Behavioral
c o n f i d e n c e interval for this c o m p a r i s o n , we Sciences (ed 2). Hillsdale, NJ, l,awrence Erlbanin, 1988
w o u l d have f o u n d the interval to be d e f i n e d by 2. NIDCR Policies a n d Procedures tbr Investigator Initi-
the limits o f ( - 0 . 1 6 m m , + 2 . 1 6 m m ) . ated Clinical Trials. Available at: wu~v.nidcr.nih.gov/
T h e difference in mandibular growth was cal- research/ctp/clinical%SFtrials.him. Accessed 19 Octo-
ber 2001
culated by subtracting the average a m o u n t of
3. Hulley SB, C n m m i n g s SR: Designing Clinical Research:
growth in untreated children f r o m the average An Epidemiologic Approach. Bahimore, MD, Williams
a m o u n t o f growth in treated children. Thus, values a n d Wilkins, 1988
greater than 0 tell you that the treated children 4. F r e e d m a n D, Pisani R, Purves R, et al: Statistics (ed 2).
grew m o r e than untreated children. Values less New York, NY, W~'WNorton, 1991
5. Fayers PM, Cuschieri A, Fielding J, et al: Sample size
than zero tell you that the treated children grew
calculation for clinical trials: T h e impact of clinician
less than untreated children. A value o f zero beliefs. Br J Cancer 82:213-219, 2000
means there is n o difference in the growth be- 6. ElashoffJD: n Q u e r y Advisor Version 4.1 User's Guide.
tween u-eated a n d untreated children. Although Boston, MA, Statistical Solutions, 2000
we still do n o t know what the real difference in 7. Phillips C, Tnlloch .]FC: T h e r a n d o m i z e d clinical trial
(RCT) as a powerfid m e a n s tor u n d e r s t a n d i n g t r e a t m e n t
growth is, we can say that we are 95% confident
efficacy. Semin O r t h o d Dentofac O r t h o p e d 1:128-138,
that, in the population, untreated children, o n 1995
average, m i g h t grow as m u c h as 0.16 n u n m o r e 8. G o l d m a n J F , H o l c o m b R, Perry HM, et al: Can d r o p o n t
than treated children (this corresponds to - 0 . 1 6 a n d o t h e r n o n c o m p l i a n c e be minimized in a clinical
ram) and daat, on average, treated children will trial? Report fl'om the Veterans Administration National
Heart, L u n g a n d Blood Institute cooperative study on
n o t grow m o r e than 2.16 m m m o r e than untreated
antihypertensive therapy: Mild hypertension. Control
children. The confidence interval includes zero, Clin Trials 3:75-89, 1982
but the asymmeuT o f the negative and positive 9. Orr PR, Blackhurst DW, Hawkins BS: Patient and clinic
boundaries a r o u n d zero are in the direction of the tactors predictive of missed visit,s and inactive stares on a
hypothesized effect (ie, that treated children will multiccnter clinical trial. Control Clin Trials 13:40-49, 1992
10. H o e n i g JM, Heisey DM: T h e abuse of power: T h e per-
grow m o r e than untreated children) a n d could be
vasive IPallacy of power calculations tbr data analysis. Am
supportive evidence for a similar study with a sam- Statistician 55:19-24, 2001
ple size calculation based on larger variability esti- 11. Berry E, Coustere-Yakir, Grover NB: T h e significance of
mates. If, however, the confidence interval was non-signilicance. Q,JM 91:(547-(553, 1998

You might also like