You are on page 1of 13

A Leisurely Look at the Bootstrap, the Jackknife, and

Cross-Validation
BRADLEY EFRON and GAlL GONG'

validation? For a quick answer. before we begin the


This is an invited expository article for The Americall main exposition. we consider a problem where none of
Statistician. It reviews the nonparametric estimation of the three methods are necessary, estimating the stan-
statistical error, mainly the bias and standard error of dard error of a sample average.
an estimator. or the error rate of a prediction rule. The The data set consists of a random sample of size II
presentation is written at a relaxed mathematical level,
omitting most proofs, regularity conditions. and tech-
from an unknown probability distribution F on the real
line, ...
nical details.
(l)
KEY WORDS: Bias estimation; Variance estimation: Having observed XI =Xl> X,! = X2, . . . , XII = XII' we com-
Nonparametric standard errors; Nonparametric con- pute the sample average x = L~-1 X,lll for use as an
fidence intervals; Error rate prediction. estimate of the expectation of F.
An interesting fact. and a crucial one for statistical
applications, is that the data set provides more than the <'
estimate X. Lt also gives an estimate for the accuracy of
I. INTRODUCTION
x. namely
This article is intended to cover lots of ground, but at
a relaxed mathematical level that omits most proofs. . [ I " _
cr = .. ( _ 1) 2;. (x, - x )-
-J III
: (2)
--
Il n . ,... [
regularity conditions, and technical details. The ground
in question is the nonparamctric estimation of statistical
0- is the estimated standard error of X = X, the root
error, "Error" here refers mainly to the bias and stan-
mean squared error of estimation.
dard error of an estimator I or to the error ratc of a
The trouble with formula (2) is that it does not, in any
data-based prediction rule.
obvious way, extend to estimators other than X, for
All of the methods we discuss share some attractive
example the sample median. The jackknife and the
properties for the statistical practitioner: they require
bootstrap are two ways of making this extension. Let
very little in the way of modeling, assumptions, or anal-
ysis. and can be applied in an automatic way to any _ IlX - Xi I"
situation. no matter how complicated. (We will give an Xfi) =--1-=--1 .4..J x,. (3)
Il - Il - 1'"
example of a very complicated prediction rule indeed).
An important theme of what follows is the substitution
of raw computing power for tbeoretical analysis.
the sample average of the data set deleting the II th .
point. Also, let Xo = L:'-I X",/", the average of the de-
The references upon which this article is based (Efron leted averages. (Actually x(.) = X. but we need the dot
1979a,b. 1981a,b.c, 1982; Efron and Gong 1982) ex- notation below.) The jackknife estimate of standard
plore the connections between the various non- error IS
parametric methods, and also the relationship to famil-
iar parametric techniques. Needless to say. there is no
danger of parametric statistics going out of busine s. A (4)
good parametric analysis, when appropriate, can be far
more efficient than its nonparametric counterpart. Of-
ten, though, parametric assumptions are difficult to jus- The reader can verify that this is the same as (2). The
tify. in which case it is reassuring to have available the advantage of (4) is an easy generalizability to any esti-
comparatively crude but trustworthy nonparametric mator 6 = 6(X" X, . ... X.). The only change is to
answers.
What are the bootstrap, the jackknife, and cross-
substitute 9(j) = 8(X\ . .... X'-h X'411 .... XII) forxm and
6(.) = L:'.I 6ayll for x(.).
The bootstrap generalizes (2) in an apparently differ-
...
-Bradley Efron is Professor of Statistics and Biostatistics at Stan-
ent way. Let f: be the' empirical probability distribution
of the data, putting probability mass 1/11 on each x" and
ford University. Gail Gong is Assistant Professor of Statistics at
Carnegie-Mellon University. The authors are gratefu!to Rob libshir- let X7, X;J ... X; be a random sample from t.
I

ani who suggested the final example in Section 7; to Samprit Chat-


terjee and Werner StuctzJe who suggested looking at estimators like X!. X; .... ,x~-f:. (5)
<-
"BootAvc" in Section 9: and to Dr. Peter Gregory of the Stanford
Mediclll School who provided the original analysis as well as the data In other words each X~ is drawn independently with ,.
in Section 10. This work was partially supported by the National replacement and with equal probability from the set {XI,
Science Foundation and the National Institutes of Health. X2, .. , x,J. Then X* = L~"l Xi/n has variance
36 Cl The Americ(ln Statistician, Februtlry 1983, Vol. 37. No.1
-. ~ The observed Pearson correlation coefficient for
var. X =--,1 L.. ,
(XI - :fj-, (6)
these n = 15 pairs is p(x" X" . , x,,) = .776. We want
n i~1
to allach a nonparametric estimate of standard error to
var. indicating variance under sampling scheme (5). The p. The bootstrap idea is the following:
bootstrap estimate of standard error for an estimator
il(X I , X" .... X,) is 1. Suppose that the data points x" x" ... , x" are
independent observations from some bivariate distribu-
U8 = [var. il(X~, X;, ... , X:)]"'. (7) tion F on the plane. Then the true standard error of p
.. Comparing (7) with (2) we see that [n/(n - l)l,n
U8 = <1 for il = X. We could make U8 exactly equal U,
is a function of F, indicated rr(F).
rr(F) = [var, p(X" X" ... , X,)j'n.
for il = X, by adjusting definition (7) with the factor

. [/1/(/1 - l)l'n, but there is no general advantage in doing (It is also a function of sample size II, and the functional
form of the statistic p, but both of these are known to
so. A simple algorithm described in Section 2 allows the
statistician to compute U8 no matter how complicated il the statistician.)
may be. Section 3 shows the close connection between 2. We don't know F, but we can estimate it by the
an and a/. empirical probability distribution t.
Cross-validation relates to another J more difficult,
problem in estimating statistical error. Going back to t: mass 1. on each observed data point Xi,
(1). suppose we try to predict a new observation from /I
F, call it X o, using the estimator X as a predictor. The i = I, 2, .... n.
expected squared error of prediction E[X, - Xl' equals
/I + 1)//1 )IJ., where IJ., is the variance of the distribu- 3. The bootstrap estimate of rr(F) is
tion F. An unbiased estimate of /I + I)/n )1J.2 is <1 8 = rr(t). (10)
(/I + l)u2 (8) For the correlation coefficient and for most statistics,
Cross-validation is a way of obtaining nearly unbiased even very simple ones. the function rr(F) is impossible
estimators of prediction error in much more compli- to express in closed form. That is why the bootstrap is
cated situations. The method consists of (a) deleting the not in common use. However in these days of fast and
points X; from the data set one at a time; (b) recalcu- cheap computation <1 8 can easily be approximated by
lating the prediction rule on the basis of the remaining Monte Carlo methods:
II - I points; (c) seeing how well the recalculated rule (i) Construct t, the empirical distribution function.
predicts the deleted point; and (d) averaging these pre as just described.
dictions over all II deletions of an X" In the simple case (ii) Draw a bootstrap sample ~, X;, ... , by x:
above, the cross-validated estimate of prediction error independent random sampling from t. In other words.
is make 11 random draws with replacement from {XI> X2,
1. i: [Xi - x(I)1'. (9) . .. , x,}. In the law school example a typical bootstrap
n i-I sample might consist of 2 copies of point 1,0 copies of
point 2, 1 copy of point 3, and so on, the total number

- A lillie algebra shows that (9) equals (8) times


n 2/(II' -1), thi last factor being nearly equal to one.
The advantage of the cross-validation algorithm is
of copies adding up to /I = 15. Compute the bootstrap
replication. (J* = p(Xr, X;, ... , X:). that is, the value
of the statistic. in this case the correlation coefficient,
that it can be applied to arbitrarily complicated predic-
evaluated for the bootstrap sample.
tion rules. The connection with the bootstrap and jack-
(iii) Do step (ii) some large number" B" of times,
." knife is shown in Section 9.

.. .,
3.50
2. THE BOOTSTRAP
.,
This section describes the simple idea of the boot- 3.30
strap (Efron 1979a). We begin with an example. The 15
points in Figure I represent various entering classes at
GPA
3.10
... '"
.,
... . ,
American law schools in 1973. The two coordinates for .,
law school i are Xi = (Yi, Zi),
2.90
Y, = average LSAT score of entering students
at school i. 2.70
., .,
."
S40 560 580 600 620 6'10 660 680
Z, = average undergraduate GPA score of entering stu- LSAT
dent at school i. Figure 1. The law school data (Efron 1979B). The data points,

.. (The LSAT is a national test similar to the Graduate


Record Exam, wh.ile GPA refers to undergraduate
begillning with School #1. ale (576. 3.39), (635, 3.30). (558. 2.81).
(578. 3.03). (666, 3.44). (580, 3.07), (555,3.00). (661.3.43), (651,
3.36), (605, 3.13). (653. 3.12). (575. 2.74). (545. 2.76). (572. 2.88),
grade point average.) (594, 2.96).

@ The American Statistician, February /983, Vol. 37, No. I 37


cases, but has higher variability than rrn.
as shown by its
... '
higher coefficient of variation. The minimum possible
Hislogrom coefficient of variation (C. V.), for a scale-invariant esti-
mate of a(F), assuming full knowledge of the para-
metric model, is shown in brackets. In the normal case,
--1f----;7-t;;r.... Hlslogrom
perCenliles
for example, .19 is the C.V. of [2:(x, -X)'114]1I2 The
bootstrap estimate performs well by this standard con-
~ .
=~~---<::---4-----b-L----.-i--'tl--
p-p sidering its totally nonparametric character and the
-.4 -.3 -,2 -,I 0 ,I small sample size.
Figure 2. Histogram of B = 1000 bootstrap replications p" for the Table 2 returns to the case of p, the correlation coef-
law school data, The normal theory density curve has a similar ficient. Instead of real data we have a sampling experi-
ment in which F is bivariate normal, true correlation
..,
shape, but falls off more quickly at the upper rail.
p = .5, and the sample size is 11- = 14. The left side of
obtaining independent bootstrap replications p , Table 2 refers to p, while the right side refers to the
p,,*?-., , , ,P *n ,an d apprOXimate
A . bY
an statistic <1> ~ tanh-' p = .5 10g(1 + p)/(1 - pl. For each
estimator a-, the root mean squared error of estimation
[E(Cr - a)']'" is given in the column headed \!MSE.
The bootstrap was run with B = 128 and B = 512,
the latter value yielding only slightly better estimates
As B --+"', (11) approaches the original definition (10). CrH. Further increasing B would be pointless. It can
...
The choice of B is further discussed below, but mean- be shown that B = oc would give \!MSE = .063 in the p
while we won't distinguish between (10) and (11), call- case, only .001 less than using B = 512. As a point
Lng both estimates a B' of comparison, the normal theory estimate for the
Figure 2 shows B = 1000 bootstrap replications p", standard error of p, CrNORM = (1- p')/(11- - 3)'12, has
... , p"ooo for the law school data. The abscissa is plot- \!MSE= .056.
ted in terms of p' - p = p' - .776. Formula (11) gives Why not generate the bootstrap observations from
Cry = .127. This can be compared with the normal the- an estimate of F which is smoother than F? This is
ory estimate of standard error for p, (Johnson and Kotz done in lines 3, 4, and 5 of Table 2. Let j; = 2::~[
1970, p, 229). (x, - x) (X,. - x)'/n be the sample covariance matrix
. ~ I-p' of the observed data. The normal slIloOlhed boot-
CTNORM = A r---;; .115. strap draws the bootstrap sample X~, X;, ... 1 X~
V/I, -3
from F tJ)H,(O, .25 j;), tJ) indicating convolution. This
One thing is obvious about the bootstrap procedure: amounts to estimating F by an equal mixture of the 11-
it can be applied just as well to any statistic, simple or distributions H,(x" .25 j;), that is by a normal window ,.
complicated, as to the correlation coefficient. In estimate. Smoothing makes little difference on the left
Table I the statistic is the 25 percent trimmed mean for side of the table. but is spectacularly effective in the <1>
a sample of size n = 15. The true distribution F (now case. The latter result is suspect since the true sampling
defined on the line rather than on the plane) is the distribution is bivariate normal, and the function
standard normal N(O, I) for the left side of the table, or ,j, = tanh-' p is specifically chosen to have nearly con-
one-sided negative exponential for the right side. The stant standard error in the bivariate-normal family. The
true standard errors a(F) are .286 and .232, respec- uniform smoothed bootstrap samples xi . ... , X~ from
tively. In both cases, Cry, calculated with B = 200 boot- FtJ)"ll(O, .25j;), where "ll(0, .25j;) is the uniform
strap replications, is nearly unbiased for a(F). distribution on a rhombus selected so "ll has mean vec-
The jackknife estimate of standard error Cr}, de- tor 0 and covariance matrix .25j'. It yields moderate
scribed in Section 3, is also nearly unbiased in both reductions in \!MSE for both sides of the table.
The standard normal-theory estimates of line 8. Table
Table 1. A Sampling Experiment Comparing the 2, are themselves bootstrap estimates, carried out in a
Bootstrap and Jackknife Estimates of Standard parametric framework. The bootstrap sample x7, . ' '}
Error for the 25% Trimmed Mean, X~ is drawn from the parametric maximum likelihood
Sample Size n = 15 distribution

F Standard Normal F Negative Exponential


Coeff Coeff rather than the nonparametric maxjmuITI likelihood dis
Ave Sd Var Ave Sd Var tribution P, and with only this change the bootstrap
Bootstrap &s: ,287 ,071 .25 ,242 ,078 .32 algorithm proceeds as previously described. In practice
(B ~200)

Jackknife 0-,,: .280 .084 ,30 .224 .085 .38


the bootstrap process is not actually carried out. If it
were, and if B ~ 00, then a high-order Taylor series
analysls shows that Un would equal approximately
.-
True: ,286 [.19] .232 1271 (I - p')/(n - 3)', the formula actually used to compute
[Minimum C.V.]
line 8 for the p side of Table 2. Notice that the normal

38 The American Statistician, February /983, Vol. 37, No. /


Table 2. Estimates 01 Standard Error for the Correlation Coefficient p and for ci> = tanh -I p; Sample Size n = 14,
Distribution F Bivariate Normal With True Correlation p = .5. From a Larger Table in Efron (1981b)

Summary Statistics for 200 Trials


Standard Error Standard Error
Estimates for fl Estimates for eP
Ave Sid Dev CV VMSE Ave Sid Dev CV VMSE

1. Bootstrap B = 128 .206 .066 .32 .067 .301 .065 .22 .065
2. Bootstrap B ~ 512 .206 .063 .31 .064 .301 .062 .21 .062
3. Normal Smoothed Bootstrap B = 128 .200 .060 .30 .063 .296 .041 .14 .041
4. Uniform Smoothed Bootstrap B = 128 .205 .061 .30 .062 .298 .058 .19 .058
5. Uniform Smoothed Bootstrap B = 512 .205 .059 .29 .060 .296 .052 .18 .052
6. Jackknife .223 .085 .38 .085 .314 .090 .29 .091
7. Delta Method .175 .058 .33 .072 .244 .052 .21 .076
(Infinitesimal Jackknife)

8. Normal Theory .217 .056 .26 .056 .302 0 0 .003


True Standard Error .218 .299

smoothed bootstrap can be thought of as a compromise


between using t and P NO '" to begin the bootstrap
all = [var. O(pO)Ji1. (13)
process. where var. indicates variance under distribution (12).
(This is true because we can take P;
= #{xj = Xi}/1I in
3. THE JACKKNIFE step 2 of the bootstrap algorithm.)
Figure 3 illustrates the situation for the case II = 3.
The jackknife estimate of standard error was in- There are 10 possible boot trap points. For example.
troduced by Tukey in 1958 (see Miller 1974). Let the point pO = (j.l. 0)' is the second dot from the left on
Pll\ = P(XI' X:!" ... , X,-J . .\'",_+\ . . . . x,,) be the value of the the lower sidc of the triangle. and occurs with bootstrap
statistic when x, is deleted from the data et. and let probabilityl. under (12). It indicates a bootstrap samplc
Pt, = (lIlI) ~:~, PVt The jaCkknife formula is X~. X~, X~ consisting of two XI'S and one X2' The center
l.
point P" = (~. ~)' has bootstrap probability ~ .
.' The jackknife resamples the tatistic at the II points
PI" = (1/(11 - I)) (1. I. .... I. o. I. .... I)'
Like the bootstrap. the jackknife can be applied to any (0 in ith place).
statistic that is a function of II independent and identi- i = I. 2.... , II. These are indicated by the open circles
cally distributed variables. It performs less well than the in Figure 3. In general there are II jackknife points ,
bootstrap in Tables 1 and 2. and in mOSt cases investi compared with ("';') bootstrap points.
gatcd by the author (see Efron 1982). but rcquires less Thc trouble with bootstrap formula (13) is that alP)
computation. In fact the two methods are closely re- is usually a complicated function of P (think of thc
lated. which we shall now show. examples in Sec. 2). and so var_ O(PO) cannot be evalu-
Suppose the statistic of interest. which we will now
call O(x" x" ... , x.). is of fUllctiollal fortll: 0 = e(P).
where elF) is a functional assigning a real number to
any distribution F on the sample space. Both examples
in Section 2 are of this form. Let P = (P" P" ... , p.)
be a probability vector having nonnegative weights sum-
ming to one. and define the reweigh ted empirical distri-
1/9
bution t(P) : mass P, on x,. i = 1. 2..... II. Corrcspond-
ing to P is a resampled wilue of the statistic of interest.
say alP) = e(t(p)). The shorthand notation O(P) as- ~ (I)
sumes that the data points Xl> ..\"2 . . . , X" are fixed at .p

2/9
their observed values. 1/9
Another way to describe the bOOtstrap estimate is a.
as follows. Let p* indicate a vector drawn from the
rescaled multinomial distribution

PO-Mult.(II. P")/II. (P"=(l/ll) (I. I. .... I)'). (12)


1/9
~(3) 1/9
meaning the observed proportions from " random
.' draws on II categories. with equal probability lin for Figure 3. The bootstrap and jackknife sampling points in the case
each category. Then n = 3. The bootstrap points (.J are shown with their probabilities.

The American SWfisliciall. Febrtlary 1983. Vol, 37. No. I 39


ated except by Monte Carlo methods. The jackknifc finitesimal jackknife lets E ..... O. thereby earning the
trick approximates O(P) by a linear function of P, say name.
01. (P), and then uses the known covariance structure of The V," are values of what Mallows (1974) calls the
(12) to evaluate var. OdP). The approximator OL(P) is empirical influence function. Their definition is a oon-
chosen to match O(P) at thc 1/ points P = Pill' It is not parumetric estimate of the true innuence function
hard to see that . 81 - E)F + ES,) - 8(F)
IF(x)=i1m . '
OL(P) = O() + (P - I"')'U (14) r-II t:

where 0,.) = (1/11) ~ a(0 = (1/11) l O(PU)), 'and U is a 8., being the degenerate distribution putting mass 1 on
column vector with coordinates V, = (11 - I) (0(.) - 0(,,). x. The right side of (15) is then the obvious estimate of
the influence function approximation to the standard
Theorelll. The jackknife estimate of standard error
equals
a.
error of (Hampel 1974). O'(F)';' [JIF'(x)dF(x)/11 ]'Q.

r
The empirical influence function method and the in-

6}= [1/ ~ J var. a (p')


L
finitesimal jackknife give identical estimates of stan-
dard error.
How have statisticians gotten along for so many years
which is 111/(1/ - I)l'" timcs the bootstrap estimate of
without methods like the jackknife or the bootstrap?
standard error for 81. (Efron 1982).
The answer is the delta merhod, which is still the most
In other words the jackknife is, almost,' a bootstrap commonly used device for approximating standard er-
itself. The advantage of working with aL rather than a rors. The method applies to statistics of the form (0,.

is that there is no need for Monte Carlo: val. 0,. ... , OA)' where ( ' , ., ..... ) is a known function
OL(P') = var. (p' - I"')'U = lV,'/I/', using the covar- and each 0, is an observed averagc. 0,=L:'., Q,(X, )In.
iance matrix for (12) and the fact that lV, = O. The For example, the correlation p is a function of A ~ 5
disadvantage is (usually) increased error of estimation. such averages: the average of the fir t coordinate val-
as seen in Tables I and 2. ues. the second coordinates, the first coordinates
The fact that &} is almost &8 for a linear approxi- squared. the second coordinates squared, and the cross-
e
mation of does not mean that IT) is a reasonable ap- products.
,
proximation for the actual &8' That depends on how In its nonparametric formulation, the delta method
well ill. approximates O. In the case where is the sam- a works by (a) expanding I in a linear Taylor series about
ple median, for instance, the approximation is very the expectations of the 0,: (b) evaluating the standard
poor. error of the Taylor series using the usual expressions for
variances and covariances of averages: and (c) substi-
tuting '1(t) for any unknown quantity '1(F) occurring in
4. THE DELTA METHOD, INFLUENCE (b). For example. the nonparametric delta method esti-
FUNCTIONS, AND THE mates the standard crror of p by
INFINITESIMAL JACKK IFE (

There is a more obvious linear approximation to a(p) {E.[~ + ~ + 2';'" + 4';'" _ ~ _~] }'~
411 iJ.io ~2 fl20,L01 ~TI 111lP-m iii 1tlil2 ...
than OJ.lP). (14). Why not use the first-order Taylor
series expansion for il(p) about the point P = P"? This is where. in terms of x,. = (yu ZI), ~Sh =l(YI - yy:
the idea of Jaeckel's il/fil/iresimal jackkl/ife (1972). The (z, - 1)'/11 (Cramer 1946, p. 359).
Taylor series approximation turns out to be
ar(p) = a(l"') + (P - I"')'U.
a
Theorem. For statistics of the form = r(O" ... ,
OA)' the non parametric delta method and the infini-
where tesimal jackknife give the same estimate of standard
error (Efron 198Ib).
The infinitesimal jackknife, the delta method. and
the empirical influence function approach are three
5, being the ith coordinate vector. This suggests the names for the same method. Notice that the results re-
infinitesimal jackknife estimate of standard error porled in line 7 of TaMe 2 show a severe downward bias.
Efron and Stein (1981) show that the ordinary jackknife
&1J=lvar. O.,.(P')l'Q = IlV,"'/Il'j'Q, (15)
is always biased upwards, in a sense made precise in that
with var. still indicating variance under (12). The ordi- paper. In the authors' opinion the ordinary jackknife is
nary jackknife can be thought of as taking the method of cboice if one does not want to do the
E = -1/(11 - I) in the definition of V,", while the in- bOOlstrap computations.

S. NONPARAMETRIC CONFIDENCE INTERVALS


&;
IThc [actor ("/(,.:..- 1)1 11 makes unbiased for a"Z if 0 is a linear
statistic. e.g., 0 = X. WI: could multiply a" by this same f;'IClOr. and
achieve the same unbiased ness. but there doesn"t seem 10 be any In applied work. the usual purpose of estimating a
general ,1dvarHagc to doing so. standard error is to set confidence intervals for the un-

40 Tile America" Stafisticiall. February /983. Vol. 37. No. I


known paramater. These are typically of the crude form The bias-corrected putative 1 - 2a central confidence
a zafr, with Za being the 100( 1 - a) percentile point of interval is defined to be
a standard normal distribution. We can, and do, use the
bootstrap and jackknife estimates fr., fr} in this way. 6e[C-I{<I>(2zo -za)}, C-'{<I>(2z o +za)}]. (17)
However in small-sample parametric situations, where
we can do exact calculations, confidence intervals are If C(a) = .50, the median unbiased case, then Zo = 0
often highly asymmetric about the best point estimate a. and (8) reduce to the uncorrected percentile interval
This asymmetry, which is O(l/v1i) in magnitude, is sub- (16). Otherwise the results can be quite different. In the
stantially more important than the Student's I cor- law school example Zo = <1>( .433) = - .17. and for a =
rection (replacing a Zafr by a lafr, with la the .16, (8) gives pe[C-'{<I>(-1.34)}, C-'{<I>(.66)}] =
lOO( I - a) percentile point of the appropriate I distribu- [Ii - .17, Ii + .10J. This agrees nicely with the normal-
tion), which is only O(lln). This section discusses some theory interval [Ii - .16, Ii + .09].
nonparametric methods of assigning confidence inter- Table 3 shows the results of a small sampling experi-
vals, which attempt to capture the correct asymmetry. It ment, only 10 trials, in which the true distribution Fwas
i abbreviated from a longer discussion in Efron bivariate normal, p = .5. The bias-corrected percentile
(1981c), and also Chapter 10 of Efron (1982). All of this metbod shows impressive agreement with the normal-
work is highly speculative, though encouraging. theory intervals. Even better are the smoothed inter-
We return to the law school example of Section 2. vals, last column. Here the bootstrap replications were
Suppose for the moment that we believe the data come obtained by sampling from t$.N"(O, .25'z), as in line
3 of Table 2, and then applying (17) to the resulting
from a bivariate normal distribution. The standard 68
histogram.
percent central confidence interval (i.e., a = .16, 1-
2a = .68) for p in this case is [.62, .87J = [Ii - .16, Ii + There are some theoretical arguments supporting
.09]. obtained by inverting the approximation cj,- (16) and (17). If there exists a normalizing transfor-
.N"(q, + p/(2(n - I)), 1/(n - 3)). Compared to the crude mation, in the same sense as cj, = tanh-! Ii is normalizing
interval Ii Z.16 fr NORM = Ii fr NORM = [Ii - .12, Ii + .12]. for the correlation coefficient under bivariate-normal
this demonstrates the magnitude of the asymmetry ef- sampling, then the bias-corrected percentile method au-
fect described previously. tomatically produces the appropriate confidence inter-
The asymmetry of the confidence interval [Ii - .16, vals. This is interesting since we do not have to know the
Ii + .09] relates to the asymmetry of the normal-theory form of the normalizing transformation to apply (17).
density curve for Ii, as shown in Figure 2. The bootstrap Bayesian and frequentist justifications are given also in
histogram shows this same asymmetry. The striking Efron (1981c). None of these arguments is overwhelm-
similarity between the histogram and the density curve ing, and in fact (17) and (16) sometimes perform poor-
suggests that we can use the bootstrap results more ly. Some other methods are suggested in Efron (1981c),
ambitiously than simply to compute fr. but the appropriate theory is still far from clear.
Two ways of forming non parametric confidence inter-
vals from the bootstrap histogram are discussed in Ef- 6. BIAS ESTIMATION
ron (1981c). The first, called the percentile melhod, uses
the 100a and 100(1 - a) percentiles of the bootstrap QuenouiUe (1949) originally introduced the jackknife
histogram, say as a nonparametric device for estimating bias. Let us
denote the bias of a functional statistic a = 6(t) by
6e[a(a), a(l- a)l, (16)
as a putative 1 - 2a central confidence interval for the
unknown parameter 6. Letting Table 3. Central 68% Confidence Intervals for p, 10
Trials of X" X" ... , XIS Bivariate Normal With True
C(I)~#{a'b<I) p = .5. Each Interval Has Ii Subtracted From
B Both Endpoints

then ala) = C-l(a), a(l- a) = C-I(l - a). In the law


Smoothed and
school example, with B = 1000 and a = .16, the 68 per- Bies-Corrected BiasCorrected
cent interval is p E [.65, .91] = [Ii - .12, Ii + .13J, almost Normal Percentile Percentile Percentile
exactly the same as the crude normal-theory interval Tria' p Theory Method Method Method

p ar-;ORM' 1 .16 ( .29, .26) (-.29, .24) (-.28, .25) (-.28, .24)
2 .75 (-.17, .09) (-.05, .08) (-.13, .04) (-.12, .08)
.- Notice that the median of tbe bootstrap histogram is
substantially higher than Ii in Figure 2. In fact,
3
4
.55 (-.25,
.53 (-.26,
.16)
.17)
(-.24,
(-.16,
.16)
.16)
(-.34,
(-.19,
.12)
.13)
(-.27,
(-.21,
.15)
.16)
C(Ii) = .433, only 433 out of 1000 bootstrap replications 5 .73 (-.18, .10) (-.12, .14) (-.16, .10) (-.20, .10)
having Ii' < Ii. The bias-correcled percenlile melhod 6 .50 (-.26, .18) (-.18, .18) (-.22, .15) (-.26, .(4)
7 .70 (-.20, .11) (-.17, .12) (-.21, .10) (-.18, .11)
makes an adjustment for this type of bias. Let <I>(z) 8 .30 (-.29, .23) (-.29, .25) (-.33, .24) (-.29, .25)
indicate the CDF of the standard normal distribution 9 .33 (-.29, .22) (-.36, .24) (-.30, .27) (-.30, .26)
so <D(za) = 1 - a, and define ' 10 .22 (-.29, .24) (-.50, .34) (-.48, .36) (-.38, .34)
AVE .48 (-.25, .18) (-.21, .19) (-.26, .18) (-.25, .18)
Zo ~ <I>-'{C(a)).

<D The American Statistician, February 1983, Vol. 37, No. J 41


13, 13 = E{a(f) - a(F)}. In the notation of Section 3, and have been interested in the expectation 13 and the
Quenouille's estimate is standard deviation Cl of R.) The bootstrap algorithm
proceeds as described in Section 2, with these two
~I = (n - l)(a(.) - a). (18)
changes: at step (ii), we calculate the bootstrap repli-
Subtracting ~I from a, to correct the bias leads to the cation R' = R(Xr, X;, ... , X:; f), and at step (iii) we
jackknife estimate of a, aj= nO- (n - l)a(.), see Miller calculate the distributional property of interest from the
(1974), and also Schucany, Gray, and Owen (1971). empirical distribution of the bootstrap replications R",
There are many ways to justify (18). Here we follow R 2, ... , R*8,
the same line of argument as in the justification of (r/. For example, we might be interested in the proba-
The bootstrap estimate of 13, which has an obvious mo- bility that the usualt statistic Vii (X - fLitS exceeds 2,
tivation, is introduced, and then (18) is related to the where fL = E{X) and S' = LeX, - X)'/(n - I). Then
bootstrap estimate by a Taylor series argument. R' = Vii (X' - X)/S', and the bootstrap estimate is
The bias can be thought of as a function of the un- #{R' b > 2}/ B. This calculation is used in Section 9 of
known probability distribution F, 13 = I3(F). The boot- Efron (198le) to get confidence intervals for the mean
strap estimate of bias is simply IJ. in a situation where normality is suspect.
The cross-validation problem of Sections 8 and 9 in-
~B = I3(P) = E.{a(r) - a(F)}. (19)
volves a different type of error random variable R. It
Here E;. indicates expectation with respect to bootstrap will be useful there to use a jackknife-type approxi-
sampling, and P' is the empirical distribution of the mation to the bootstrap expectation of R,
bootstrap sample. 4.
In practice ~B must be approximated by Monte Carlo
E.{R'} '" W + (n - J)(R(.) - W). (20)
methods. The only change in the algorithm described in Here W=R(x" x" ... , x.: P) and Rt.)=(l/n)lR(>j,
Section 2 is at step (iii), when instead of (or in addition R(t) = R(Xh Xl, ... Xj-h Xi+h ... XII; F). The justifica-
I

to) (rH we calculate tion of (20) is the same as for the theorem of this
section, being based on a quadratic approximation
formula.

In the sampling experiment of Table 2 the true bias, of 7. MORE COMPLICATED DATA SETS
Ii for estimating p, is 13 = - .014. The bootstrap estimate So far we have considered the simplest kind of data
~H' taking B = 128, has expectation - .014 and stan-
sets, where all the observations come from the same
'.
dard deviation .031 in this case, while ~I has expectation
distribution F. The bootstrap idea, and jaCkknife-type
- .017, standard deviation .040. Bias is a negligible
approximations (which are not discussed here), can be
source of statistical error in this situation compared with
applied to much more complicated situations. We begin
variability. In applications this is usually made clear by
with a two-sample problem.
comparison of ~H with (rH
The data in our first example consist of two indepen-
Thc estimates (18) and (19) are closely related to
dent random samples,
each other. The argument is thc same as in Section 3,
except that we approximate e(p) with a quadratic
rather than a linear function of P, say Q (P) = a F and G being two possibly different distributions on
a + (P - P")'b + l(p - P")'c(P - 1"'). Let OQ(P) be any
the real line. The statistic of interest is the Hodges-
such quadratic satisfying
Lehmann shift estimate
aQ(P") = a(P") = a and aQ(p(,j) = a(p(,j), i = 1, 2, .. . ,11.
O=median {Y,-x,; i = 1, ... , m, j=I, . .. , n}.
Theorem. The jackknife estimate of bias equals
We desire an estimate of the standard error Cl(F, G).
An.. .. The bootstrap estimate is simply
13/=--1 [E.{aQ(p i-ell,
n- CrB = Cl(f, C),
which is /l/(Il - ]) times the bootstrap estimate of bias
C being the empirical distribution of the y,. This is
a
for Q (Efron 1982).
evaluated by Monte Carlo, as in Section 3, with obvious
Once again, the jackknife is, almost, a bootstrap esti-
mate itself, except applied to a convenient approxi-
modifications: a bootstrap sample now consists of a ran-
dam sample X~, X;, .... X:, drawn from F and an
..
mation of a(p). independent random sample yr, ... , Y: drawn from
More general problems. There is nothing special C. (In otber words, m draws with replacement from {x"
about bias and standard error as far as the bootstrap is X2, ... ,x",}, and n draws with replacement from {Yll Y2,
concerned. Tbe bootstrap procedure can be applied to ... , Y.}.) Tbe bootstrap replication 0' is the median of
almost any estimation problem. the mn differences y7 -X~. Then era is approximated
Suppose that R (X" X" ... , X.; F) is a random vari- from B independent such replications as on the right
able, and we are interested in estimating some aspect of side of (11).
R's distribution. (So far we have taken R = a(f) - a(F) Table 4 shows the results of a sampling experiment in
42 The American Statistician, February 1983, Vol. 37, No.1
Table 4. Bootstrap Estimates of Standard Error for the between y and the vector of predicted values '1(13) =
Hodges-Lehmann Two-Sample Shift Estimate; (g, (13), .... g. (13
m = 6, n = 9; True Distributions Both F and G
Uniform (O, I] ~:min D(y, '1(13.

The most common choice of D is D (y. '1) = L;'.,
.' Expectation St. Dev. C.V. VMSE (Y, - Tj,)'.
B~ 100 .165 .030 .18 .030 Having calculated ~, we can modify the one-sample
Separate bootstrap algorithm of Section 2, and obtain an esti-
B=200 .166 .031 .19 .031
mate of ~'s variability:
B= 100 .145 .028 .19 .036
Combined (i) Construct F pUlling mass 1/11 at each observed
B~200 .149 .025 .17 .031 residual.
True Standard Error .167 t: mass 1/11 on E, = y, - gi(~)'
(ii) Construct a bootstrap data et
which I1l = 6, II = 9, and both F and G were uniform
distributions on the interval [0, LJ. The table is based on Y~ =gi(~)+E~_ i = I. 2, ... , "_
100 trials of the situation. The true standard error is where the .; arc drawn independently from t, and
.' fJ(F, G) = .167. "Separate" refers to a. calculated ex- calculate
actly as described in the previous paragraph. The im-
~': min D(Y'. '1(13).
provement in going from B = 100 to B = 200 is too
small to show up in the table.

(iii) Do step (ii) some large number B of times. ob-
"Combined" refers to the following idea: suppose we
taining independent bootstrap replications ~". ~".
believe that G is really a translate of F. Then it wastes
information to estimate Fand G separately. Instead we
.... ~ '8. and estimate the covariance matrix of ~ by

- can form the combined empirical distribution


- 1
H:mass--on
111 + 11

.'
All m + 11 bootstrap variates Xr, ... X;:, Yr, ... , V:
J In ordinary linear regression we have g,(I3) = ': [3 and
are then sampled independently from fl. (We could add D(y. '1) = ~(y, - Tj,)'. Section 7 of Efron (1979a) shows
il back to the Y;
values. but this has no effect on the that in this case the algorithm above can be carried out
bootstrap standard error estimate. since it just adds the theoretically. B =". and yields
- -,
constant e to each bootstrap replication e .)
. .
The combined method gives no improvement here.
but it might be valuable in a many-sample problem
~
;Z8
_"2
-fJ (L ,,1,)
,""I
' -I
.
2 _
fJ - L;-,
,"1
~2
E,III. (22)
where there are small numbers of observations in each
sample, a situation that arises in stratified sampling. This is the usual answer, except for dividing by II instead
(See Efron 1982, Ch. 8.) The main point here is that of II - pin 0-'. Of course the advantage of the bootstrap
"bootstrap" is not a well-defined verb. and that there approach is that j; 8 can just as well be calculated if.
may be more than one way to proceed in complicated say. g,(I3) = exp (1,13) and D(y. '1) = Li~, [Yi - Tj,I
situations. Next we consider regression problems, There is another simpler way 10 bootstrap the re-
where again there is a choice of bootstrapping methods. gression problem. We can consider each covariate-
In a typical regression problem we observe II inde- response pair x, = (Ii, y,) to be a single data point ob-
pendent real-valued quantitives Y, = Yi, tained by random sampling from a distribution F on
p + I dimension space. Then we apply the one-sample
Y i = g,([3) + Ei' i = 1, 2, ... , II. (21) bootstrap of Section 2 to the data set XII X2 , XII.

.. The functions g,(') arc of known form, usually gi(l3) =


g([3: Ii)' where Ii is an observed p-dimensionaJ vector of
The two bootstrap methods for the regression prob-
lem are asymptotically equivalent, but can perform
covariates; ~ is a vector of unknown parameters we wish quite differently in small-sample situations. The simple
to estimate. The Ei are an independent and identically method. described last, takes less advantage of the spe-
distributed random sample from some distribution Fan cial structure of the regression problem. It docs 1101 give
the real line. answer (22) in the case of ordinary least squares. On the
other hand the simple method gives a trustworthy esti-

--

where F is assumed to be centered at zero in some


sense, perhaps E {E} = 0 or Prob{E < O} = 0.5.
mate of ~'s variability evell if Ihe regressioll model (21)
is nor correct. For this reason we use the simple method
of bootstrapping on the error rate prediction problem of
-- Having observed the data vector Y = y = (Y\> ... ,y,,),
we estimate ~ by minimizing some measure of distance
Sections 9 and 10.
As a final example of bootstrapping complicated data

tV The Americlln Suuisricia". February 1983, Vol. 37. No.1 43


we considcr a two-sample problem with censored data. There are other reasonable ways to bootstrap cen-
The data are the leukemia remission times listed in sored data. One of these is described in Efron (l981a),
Table I of Cox (1972). The sample sizes are 111 = n = 21. which also contains a theoretical justification for the
Treatment-group remission times (weeks) are 6+,6,6, method used to construct Figure 4.
6.7,9+, JO + , 10. 11 +. 13, 16, 17+,19+,20+,22,23,
25+,32+,32+.34+,35+; control-group remission 8. CROSSVALIDATION
times (weeks) are 1. 1,2. 2,3.4,4.5.5,8,8,8,8, II,
II. 12. 1'2, 15, 17.22.23. Here 6+ indicates a censored Cross-validation is an old but useful idea. whose time
remission time. known only to exceed 6 weeks, while 6 seems to have come again with the advent of modern
is an uncensored remission time of exactly 6 week. computers. We discuss it in the context of estimating the
None of the control-group times were censored. error rate of a prediction rule. (There are other im-
We assume Cox's proportional hazards model. the
hazard rate in the control group equaling e~ times that
portant uscs; see Stone 1974; Geisser 1975.)
The prediction problem is as follows: each data point
...
in the Treatment group. The partial likelihood estimate Xi = (Ii. Yi) consists of a p -dimensional vector of
of 13 is ~ = 1.51. and we want to estimate the standard explanatory variables tit and a response variable y,.
error of~. (Cox gets 1.65, not 1.51. Here we are using Here we assume Y, can take on only two possible values.
Breslow's convention for ties (1972). wbich accounts for say 0 or 1, indicating two possible responses. live or
the discrepancy.) dead, male or female, uccess or failure. and so on. We
Figure 4 shows the histogram for 1000 bootstrap rep- observe XI, Xl, . . . . X n . called collectively the training set,
lications of ~'. Each replication was obtained by the and indicated x;;; (XI' X2 ... , X,I)' We have in mind a
two-sample method described for the Hodges-Lehmann formula 1)(1; x) for construding a prediction /"Ide from
estimate: the training set. also taking on values either 0 or I.
(i) Construct t putting mass TI at each point 6+,6,6, Given a new explanatory veclor 10' lhe value 1)(10; x) is
. . . . 35+. and C putting mass TI at each point 1. 1. .... supposed to predict the corresponding response Yo.
23. (Notice that the "points' in t include the censoring We assume that each Xi is an independent realization
information. ) of X = (T, Y). a random vector having some distribu-
(ii) Draw X~, X;, ... , X;I by random sampling from tion F on p + I-dimensional space. and likewise for the
P, and likewise Y~, Y~, .. . , V; 1 by random sampling "'new case" Xn ::::: (To, Yo). The true error rate err of the
from C. Calculate ~. by applying the partial-likelihood prediction rule 1)(' : x) is the expected probability of
method to the bootstrap data. error over Xu - F with x fixed.

The bootstrap estimate of standard error for ~. as err = E{Q[Y", 1)(To, x)]).
given by (11). is 0-. = .42. This agrees nicely with Cox's where Q [y, 1)J is the error indicator
asymptotic estimate 6 =.41. However, the percentile
_{Oif Y =1)
method gives quite different confidence intervals from Q [y, 1)] - I if Y *" 1).
those obtained by the usual method. For .05, ,,=
1 - 2" = .90. the laner interval is 1.51 1.65 . .41 = An obvious estimate of err is the apparent error rale
[.83, 2.19J. The percentile method gives the 90 percent ...
central interval [.98. 2.35J. Notice that (2.35 - 1.51)/ err = {O[Yo, 1)(To: x)J}=lQ[Yi.1)(li; x)].
n .-1
(1.51 - .98) = 1.58. so that the percentile interval is
considerably larger to the right of ~ than to the left. The symbol E indicates expectation with respect to the
(Thc bias-corrected percentile method gives almost the empirical distribution t. putting mass !Ill on each X,.
same answers as the uncorrected method in this case The apparent error rate is likely to underestimate the
since t(~) = .49.) true error rate, since we are evaluating Tl(' x)'s per- f

fOflnance on the same set of data used in its construc-


tion. A random variabl.e of interest is the overoptimism,
true minus apparent error rate,
R (x. F) = err - err
= E{Q[Yo, 1)(To;x)]}- E{Q[Yo, 1)(To;x)]}.(23)
The expectation of R (X, F) over the random choice of
X h X 2, . . . J X" from F,
w(F) ER(X, F) = (24)
is the expected overoptimism. "
The cross-validated estimate of err is
a 5 \.0 \.5 '.0 I
Figure 4. Histogram of 1000 bootstrap replications of ~. for the
err' =;; 2: Q[y" 1)(1,: x(>1)J.
,-I
leukemia data, proportional hazards model. Courtesy of Rob
1ibshirani, Stanford. 1)(li ; Xi) being the prediction rule based on x(.j =
,
44 The America" Statistician, February 1983. Vol. ]7. No. I
'.
t
In other words err is the
(Xt. Xb .. , X,_II Xjtll . . . X .. ). trial to trial. often being negative. The crossvalidation
error rate over the observed data set. not allowing estimate wi' is positive in all 10 cases. and does not
x, = (I,. y,) 10 enler inlo Ihe cOllSlruction of Ihe rule for ils correlate with R. This relates to the comment that w' is
own prediclion. 1t is intuitively obvious that err' is a less trying to eStimate w rather than R. We will see later that
biased estimator of err than is Gr. In what follows we w' has expectation .091, and so is nearly unbiased for w.
consider how well errt estimates err. or equivalently However, w' is too variable itself to be very useful for
how well estimating R. which is to say that err t . is not a particu-
... larly good estimate of err. These points are discussed
further in Section 9. where the two other estimates of w
estimates R(x. F) = err - err. (These are equivalent appearing in Table 5, Wj and WH, are i~troduced.
problems since err' - err = w' - R(x, F).) We have
used the notation w t rather than Rt, because it turns
9. BOOTSTRAP AND JACKKNIFE ESTIMATES
out later that it is actually w being estimated.
FOR THE PREDICTION PROBLEMS
We consider a sampling experiment involving Fish
er's linear di criminant function. The dimension is At the end of Section 6 we described a method for
p = 2 and the sample size of the training set is n = 14. applying the boostrap to any random variable R(X, F).
The distribution F is as follows: Y = 0 or I with proba Now we use that method on the overoptimism random

.' bility j, and given Y = Y the predictor vector T is bi


variate normal with identity covariance matrix and
mean vector (y -I, 0). If F were known to the statisti-
variable (23). and obtain a bootstrap estimate of the
expected overoptimism w(F).
The bootstrap estimate of w = w(F), (24). is simply
cian, the ideal prediction rule would be to guess Yo = 0
if the first component of 10 was :sO, and to guess Yo = 1 Wo = wet).
otherwise. Since F is assumed unknown, we must esti- As usual W8 must be approximated by Monte Carlo. We
mate a prediction rule from the training set. generate independent bootstrap replications RO', RO',
We use the prediction rule based on Fisher's esti- R* B. and take
mated linear discriminant function (Efron 1975).
o
. ,;"l'" ROb .
11(1; x) = {~if Ii + I'~ is {~~ WB B ~
b=1

As B goes to infinity this last expression approaches


The quantities Ii and ~ are defined in terms of no and
.{ROj. the expectation of RO under bootstrap re-
111. the number of y, equal to zero and one. respectively;
sampling. which is by definition the same quantity as
III and I" the averages of the I, corresponding to those
y, equaling zero and one. respectively; and 5 =
wet) = W8' The bootstrap estimates W8 seen in the last
column of Table 5 are considerably less variable than
[:L?",\ tit; -11010/0 - 1I tf ,f i]/lI:
the cross-validation estimates w'.
Ii = [liS -'I, -1,5 -'I,J/2, What does a typical bootstrap replication consist of in
this situation? As in Section 31el pO = (pr. p~, .... p~)
~= (I, -1,)5'.
,- indicate the bootstrap resampling proportions
Table 5 shows the results or 10 simulations ("trials") P; = #{X; = x,}//1. ( otice that we are considering
of this situation. The expected overoptimism. obtained each veclor x, = (I" y,) as a single sample point for the
from 100 trials, is w = .098. so that R = err - err is typ- purpose of carrying out the bootstrap algorithm.) Fol-
ically quite large. !-jowever, R is also quite v~riable from lowing through detinition (13). it is not hard to see that
n
Table 5. The First 10 Trials of a Sampling Experiment RO = R(Xo. t) = 2: (I"; - P;) Q[y,. '1(t,; X)J, (25)
Involving Fisher's Unear Discriminant Function. The i=1
Training Set Has Size n = 14. The Expected
where P" = (1. 1. ... ,1)'//1 as before. and '1(', XO) is
Overoptimism is w = .096, see Table 6
the prediction rule based on the bootstrap sample.
Table 6 shows the results of two simulation experi-
Error Rates Estimates of Overoptimism ments (100 trial each) involving Fisher's linear discrim
Appar- Over Cross- Jack Bootstrap inant fraction. The left side relates to the bivariate nor-
True ent optimism validation knife (B = 200)
Trial err err R w' wJ W8
mal situation described in Section 8: sample size n = 14.
dimension d = 2, mean vectors for the two randomly
1 9,5 .458 .286 .172 .214 .214 .083
2 6.8 .312 .357 -,045 .000 .066 .098 selected normal distributions = (j, 0). The right side
3 7,7 .313 .357 -.044 .071 .066 .110 still has n = 14, but the dimension has been raised to 5.
4 8,6 .351 .429 -.078 .071 .066 .107 with mean vectors ( I. O. O. 0, 0). Fuller descriptions
5 8.6 .330 .357 -.027 .143 .148 .102
6 8,6 .318 .143 .175 .214 .194 .073 appear in Chapter 7 of Efron (1982) .
7 8.6 ,310 .071 .239 .071 .066 .087 Seven estimates of overoptimism were considered. rn
8 6,8 .382 .286 .094 .071 .056 .097 the d = 2 situation, the cross-validation estimate lOt. for
9 7.7 .360 .429 -.069 .071 .087 .127
10 8.6 .335 .143 -.192 .000 .010 .048 example, had expectation .091. standard deviation
.073. and correlation - .07 with R. This gave root mean

The American SIalistician. February 1983, Vol. 37. o. J 45


Table 6. Two Sampling Experiments Involving Fisher's Even though w. and w' are closely related in theory
Linear Discriminant Function. The Left Side of and are asymptotically equivalent. they behave very dif- ..
the Table Relates to the Situation of Table 5: ferently in Table 6: w' is nearly unbiased and un-
n = 14, d ~ 2, True Mean Vectors = ('/2, 0). correlated with R, but has enormous variability; lOB has
The Right Side Relates to n = 14, d = 5, small variability. but is biased downwards. particularly

True Mean Vectors = ( 1, 0, 0, 0, 0) in the right-hand casco and highly negatively eorrelatcd
with R. The poor performanecs of the two estimators
Dimension 2 Dimension 5
are due to different causes, and there arc some grounds
Overoprimism Exp.
R(X, F)
I. kleal Constant
Sd.
w=.096.113 Ox,. v'MSE
.096 0 0 .113
III
Exp.

.184
Sd.
= .184 .099 Ox,. VMSE

0 0 .099
of hope for a favorable hybrid.
"Boot Rand'" line 5. modificd the bootstrap estimate
.
2. Cross- in just one way: instead of drawing the bootstrap sample
Validation .091 .073 -.07 .139 .170 .09' -.15 .147 X~, X;, ... I X: from t, it was drawn from
3. Jackknile
4. Bootstrap
.093 .068 -.23 .145 .167 .089 -.26 .150
(B =200) .080 .028 -.64 .135 .103 .031 -.58 .145 po {Tr/n on (Ii. I)
5. BoolRand rRAND: mass (I - 'iT;
- )111 on (0)
ti,
(B ~200) .087 .026 -.55 .130 .147 .020 -.31 .114
6. BoolAve i = I, 2, "', 11.
(B -200) .100 .036 -.18 .125 .172 .041 -.25 .118
7. Zero 0 0 0 .149 0 0 0 .209 This is a distribution supported on 2n points. the ob-
served points Xi = (I" Yi) and also the complementary
squared error, of w' for estimating R or equivalently of points (Ii. 1- y,). The probahilities Tr; were those natu-
errt for estimating err. rally associated with the linear discriminant function.

[[w' - R 1'1 l = [(err' - err)']! = .139. Tri= 11[1 +exp-(a+I:~)J


The bootstrap, line 4, did only slightly better,
\!MSE= .135.
(see Efron 1975). except that Tr; was always forced to lie
in the interval [.1, .9J.
....
The zero estimate W'" O. line 7, had \!MSE = .149. Drawing the bootstrap sample X~, ... , X: from
which is also [(err - err)']!, the \!MSE of estimating FRAND instead of F is a form of smoothing, not unlike the
err by the apparent error err, with zero correction for moothed bootstraps of Section 2. In both cases we
overoptimism. The "ideal constant" is w itself. If we support the estimate of F on points beyond those actu-
knew w, which we don't in genuine applications. we ally ohserved in the sample. Here the smoothing is en-
would usc the bias-corrected estimate err + w. Line 1, tirely in the response variable y. In complicated prob-
left side. says that this ideal correction gives lems. such as the one described in Section 10, t i can have
\!MSE= .113. complex structure (censoring. missing values, cardinal
We see that neither cross-validation nor the bootstrap and ordinal scales, discrete and continuous variates,
are much of an improvement over making no correction etc.) making it difficult to smooth in the I space. Notice
at all. though the situation is more favorable on the that in Table 6 BootRand is an improvement over the
right side of Table 6. Estimators 5 and 6. which will be ordinary bootstrap in every way: it has smaller bias,
described later, perform noticeably better. smaller standard deviation, and smaller negative cor-
The "jackknife." line 3, refers to the following idea: relation with R. The decrease in VMSE is especially
since w.;
.{R"j is a bootstrap expectation, we can impressive on the right side of the table.
approximate that expectation by (19). In this case (25) "BootAve'" line 6, involves a quantity we shall call
gives R" = 0, so the jackknife approximation is simply Wo. Generating B bootstrap replications involves mak-
wJ = (n - 1) R(.). Evaluating this last expression, as in ing nB predictions 11(1" X"b), i = I, 2, ... , II, b = 1, 2,
Chapter 7 of Efron (1982). gives ., B. Let

wJ =* {/21
Q[y" 11 (Ii, xlI,)I- ( QlY"
/-1
11(1" X<l)lI)/n}.
Then

This looks very much like (he cross-validation estimate, - ~""'-I,b


Wo- ,"I.b Q[Yi. Tt (X"b)],,,
t" """ih 1"--
ib err.

which can be written 1n other words, Wu + err is the observed bootstrap error

w' =1.
11 /=1
{Q[y" 11(li' x(,,)]- Q[Y" 11(1" x)]}.
rate for prediction of those y, where X, is not involved in
the construction of 11(' , X"). Theoretical argument can
be mustered to show that "", will usually havc expec-
As a matter of fact, WI and w t have asymptotic cor- tation greater than w. while tOil usually has expectation
relation one (Gong 1982). Their nearly perfect cor- less than w. I.BootAve" is the compromise estimator
relation can be seen in Table 5. In the sampling experi- W,WE = (W8 + ",11)12. It also performs well in Table 6,
ments of Table 6, corr(w" w') = .93 on the left side, an'; though there is not yet enough theoretical or numerical
.98 on the right side. The point here is that lite cross- evidence to warrant unqualified enthusiasm.
validacioll estimate w t is, essentially, a Taylor series ap- The bootstrap is " general all-purpose device that can
proximation to the bootstrap estimate wn be applied to almost any problem. This is very handy,

46 The American Statistician, February /983. Vol. 37, No. I


Table 7. The Last 11 Uver Patients. Negative Numbers Indicate Missing Values

Cons Ster- Anti Mal Anar- Uver Uver Spleen As ai/i- Ai. Ai"" Pro- Histo-
t8nt Age Sex old viral Fatigue alse aKis Big Arm Palp Spiders cites Varices rubin Phos SGOT min rein logy
y I 2 3 4 5 6 7 8 g 10 11 12 13 14 15 16 17 18 19 20 #
1 45 1 2 2 1 1 1 2 2 2 I I 2 1.90 -1 114 2.4 -1 -3 145
0 31 1 1 2 1 2 2 2 2 2 2 2 2 1.20 75 193 4.2 54 2 146
1 41 1 2 2 1 2 2 2 1 1 1 2 1 4.20 65 120 3.4 -1 -3 147
1 70 I 1 2 1 1 1 -3 -3 -3 -3 -3 -3 1.70 109 528 2.8 35 2 148
0
0
1
20
36
46
1
1
1
1
2
2
2
2
2
2
2
1
2
2
1
2
2
1
2
2
2
-3
2
2
2
2
2
2
2
1
2
2
1
2
2
1
.90
.60
7.60
89
120
-1
152
30
242
4.0
4.0
3.3
-,
-1

50
2
2
-3
149
150
151
0 44 1 2 2 1 2 2 2 1 2 2 2 2 .90 126 142 4.3 -1 2 152
.I' 0 61 1 1 2 1 1 2 1 1 2 1 2 2 .80 95 20 4.1 -1 2 153
0 53 2 1 2 I 2 2 2 2 1 1 2 1 1.50 B4 19 4.1 4B -3 154
. ~ 1 43 1 2 2 I 2 2 2 2 I 1 1 2 1.20 100 19 3.1 42 2 155

but it implies that in situations with special structure the Among these 19 tests, 13 predictors indicated predic-
bootstrap may be outperformed by more specialized tive power by rejecting Ho:j = 18, 13, 15, 12, 14, 7, 6,
methods. Here we have done so in two different ways. 19, 20, 11, 2, 5, 3. These are listed in order of achieved
BootRand uses an estimate of F that is better than the significance level, j = 18 attaining the smallest alpha.
totally non parametric estimate F. BootAve makes use 2. These 13 predictors were tested in a forward
of the particular form of R for the overoptimism multiple-logistic-regression program, which added pre-
problem. dictors one at a time (beginning with the constant) until
no further single addition achieved significance level
a = .10. Five predictors besides the constant survived
10. A COMPLICATED PREDICTION PROBLEM this step, j = 13, 20, 15, 7, 2.
3. A final forward, stepwise multiple-logistic-regres-
We end this article with the bootstrap analysis of a
- genuine prediction problem, involving many of the
complexities and difficulties typical of genuine prob-
sion program on these five predictors, stopping this
time at level a = .05, retained four predictors besides
the constant, j = 13, 15, 7, 20.
lems. The bootstrap is not necessarily the best method
here, as discussed in Section 9, but it is impressive to see At each of the three steps, only those patients having
how much information this simple idea, combined with no relevant data missing were included in the hypothesis
massive computation, can extract from a situation that tests. At step 2 for example, a patient was included only
is hopelessly beyond traditional theoretical solutions. A if all 13 variables were available.
fuller discussion appears in Efron and Gong (1981). The final prediction rule was based on the estimated
Among n = 155 acute chronic hepatitis patients, 33 logistic regression
were observed to die from the disease, while 122 sur-
vived. Each patient had associated a vector of 20 covar- 'IT(I,) L: .
iates. On the basis of this training set it was desired to log 1 - 'IT(I;) = /-1, 13. ". 7. '" 131 I"~
produce a rule for predicting, from the covariates,
whether a given patient would live or die. If an effective wbere ~, was the maximum likelihood estimate in this
prediction rule were available, it would be useful in model. The prediction rule was
choosing among alternative treatments. For example,
.' patients with a very low predicted probability of death .,,(1; x) = {~if L, ~i I;i{~~' (26)
could be given less rigorous treatment.
Let X; = (I;, Yi) represent the data for patient i, i = 1, c = log 331122.
2, ... , 155. Here I; is the 20-dimensional vector of co- Among the 155 patients, 133 had none of the predic-
variates, and Y; equals I or 0 as the patient died or lived. tors 13, 15,7,20 missing. When the rule .,,(1; x) was
Table 7 shows the data for the last 11 patients. Negative applied to these 133 patients, it misclassified 21 of them,
numbers represent missing values. Variable I is the con- for an apparent error rate err = 21/133 = .158. We
stant I, included for convenience. The meaning of the would like to estimate how overoptimistic err is.
19 other predictors, and their coding in Table 7, will not To answer this question, the simple bootstrap was
be explained here. applied as described in Section 9. A typical bootstrap
A prediction rule was constructed in 3 steps: sample consisted of X~, X;, ... , ~55' randomly drawn
with replacement from the training set XII X2, ... XISS'
1. An a = .05 test of the importance of predictor j,
I

The bootstrap sample was used to construct the boot-


Ho : J3i = 0 versus H,: 131 1= 0, was run separately for
strap prediction rule.,,(. X'), following the same three
j = 2, 3, ... , 20, based on tbe logistic model
steps used in the construction of.,,(, x), (26). This gives
'IT(I;) a bootstrap replication R' for the overoptimism random
log 1 _ 'IT(I;) = 13; + 13/;" variable R = err - err, essentially as in (25), but with a
modification to allow for difficulties caused by missing
'IT(I;) ~ Prob(patient i dies}. predictor values.

to The American Statistician, February 1983, Vol. 37. No. I 47


(see Efron 1982, Ch. VII), which by definition equals
[E(err -err - W)']'I2, the yMSE of err + w as an esti-
mate of err. Comparing line 1 with line 4 in Table 6, we
expect err + 0,8 = .203 to have yMSE at least this big
for estimating err.
Figure 6 illustrates another use of the bootstrap repli-
cations. The predictions chosen by lhe three-step selec-
tion procedure. applied to the bootstrap training set X',
are shown for the last 25 of the 500 replications. Among
all 500 replications, predictor 13 was selected 37 percent
-.10 of the time, predictor 15 selected 48 percent, predictor
Figure 5. Histogram of SOD bootstrap replications of over 7 selected 35 percent, and predictor 20 selected 59 per-
optimism for the hepatitis problem. cent. No other predictor was selected more than 50
percent of the time. No theory exists for interpreting
Figure 5 shows the histogram of B = 500 such repli- Figure 6, but the results certainly discourage confidence
cations. 95 percent of these fall in the range 0 S R' s in the casual nature of the predictors 13, 15, 7, 20.
.12. This indicates that the unobservable true over- (Received January 1982. Revised May 1982. J
optimism err - err is likely to be positive. The average
'.
value is REFERENCES
B

WB =1 L R" = .045, BRESLOW. N. (1972), Discussion of Cox (1974), Journal of Ille


'-1 Royal Statistical Society, Ser. B. 34. 216-217.
COX, D.R. (1972). "Regression Models With Life Tables'" Journal
suggesting that the expected overoptimism is about; as of the Royal Sialislical Sociely, Ser. B, 34, 187-000.
large as the apparent error rate .158. Taken literally, CRAMER. H. (1946). Mathemmical Melhods of Statistics. Princcton:
this gives the bias-corrected estimated error rate Princeton University Press.
.158 + .045 = .203. There is obviously plenty of room EFRON. B. (1975), "The Efficiency of Logistic Regression Com-
for error in this last estimate, given the spread of values pared to ormal Discriminant Analysis." Journal of Ihe American
Statistical Associmion, 70, 897-898.
in Figure 5, but at least we now have some idea of the - - (1979a), "Bootstrap Methods: Another Look al the Jack-
possible bias in err. knife," Annals of Statistics, 7, 1-26.
The bootstrap analysis provided more than just an - - (1979b), "Computers and the Theory of Statistics: Thinking
estimate of w(F). For example, the standard deviation the Unthinkable ..' SIAM Review. 21, 460-480. '.
of the histogram in Figure 5 is .036. This is a depend- - - (1981a), "Censored Data and the Bootstrap," Journal oflhe
American Statistical Association, 76. 312-319.
able estimate of the true standard deviation of R - - (198Ib), "Non para mel ric Estimates of Standard Error: The
Jackknife, the Bootstrap. and Other Resampling Methods," Bio,
13 7 20 15 metrika, 00. ()()()....{)OO.
13 19 6 - - (l981c), "Nonparametric Standard Errors and Confidence ,
Intervals," Cafladian Journal of Statistics, 9, 139-172.
20 16 19 - - (1982). "The Jackknife, the Bootstrap. and Other Re
20 19 sampling Plans." SIAM, monograph #38. CBMS-NSF.
14 18 7 16 2 EFRON, B.. and GONG, G. (1981), "Statislical Theory and the
18 20 7 11 Computer ," unpublished manuscript.
20 19 15 GEISSER, S. (1975), "The Predictive Sample Reuse Mcthod With
20 Applications'" JOllrtlal of the America" S(ati,~/ical Association, 70,
13 12 15 8 18 7 19 320-328.
15 13 19 GONG, G. (1982), "Cross-validation, Ihe Jackknife, and the Boot
13 4 strap: Excess Error Estimation in Forward Logistic Regression",
12 15 3 Ph.D. dissertation. Depl. of Stalislics. Slanford Uni\'crsily.
15 16 3 HAMPEL. F. (1974), '''''he Innuence Curve and ils Role in Robust
Estimation." Jourtllli of the American Statistical Association, 69,
15 20 4
383-393.
16 13 2 19
JAECKEL. L. (1972). "The Infinitesimal Jackknife." Bell Laborato
18 20 3 ries Memorandum #MM n-1215-11.
13 IS 20 JOHNSON. N., and KOTZ, S. (1970). COlllinuolls Univariate Distri
IS 13 billions (vol. 2), Boslon: Houghton Mifflin.
IS 20 7 MALLOWS. C.L. (1974), "On Some Topics in Robustness", Memo
13 randum. Bell Laboratories. Murray Hill. New Jersey.
IS
13
12
14
20 18
QUENOUILLE, M. (1949). "Approximate Tests of Correlation in
lime Series," Jourtlal of The Royal Statislical Society, Ser. B. ) I.
18--84.
..
2 20 IS 7 19 12 SHUCANY. W.; BRAY, H.; and OWEN. O. (1971). "On B;as
13 20 15 19 Reduction in Estimation," Journal of Ihe American StUlislical As-

Figure 6. Predictors selected in the last 25 bootstrap replications


socimion. 66. 524-533.
STONE, M. (1974). "Cross-Validatory Choice and Assessment of
...
for the hepatitis program. The predictors selected by the actual data Statislical Predictions." Journal of Ihe Royal Stalislical Sociely,
were 13, 15. 7.20. Ser. B. 36. 111-147.

48 Cl The American Statistician, February /983, Vol. 37, No. J .....

You might also like