Professional Documents
Culture Documents
Cross-Validation
BRADLEY EFRON and GAlL GONG'
. [/1/(/1 - l)l'n, but there is no general advantage in doing (It is also a function of sample size II, and the functional
form of the statistic p, but both of these are known to
so. A simple algorithm described in Section 2 allows the
statistician to compute U8 no matter how complicated il the statistician.)
may be. Section 3 shows the close connection between 2. We don't know F, but we can estimate it by the
an and a/. empirical probability distribution t.
Cross-validation relates to another J more difficult,
problem in estimating statistical error. Going back to t: mass 1. on each observed data point Xi,
(1). suppose we try to predict a new observation from /I
F, call it X o, using the estimator X as a predictor. The i = I, 2, .... n.
expected squared error of prediction E[X, - Xl' equals
/I + 1)//1 )IJ., where IJ., is the variance of the distribu- 3. The bootstrap estimate of rr(F) is
tion F. An unbiased estimate of /I + I)/n )1J.2 is <1 8 = rr(t). (10)
(/I + l)u2 (8) For the correlation coefficient and for most statistics,
Cross-validation is a way of obtaining nearly unbiased even very simple ones. the function rr(F) is impossible
estimators of prediction error in much more compli- to express in closed form. That is why the bootstrap is
cated situations. The method consists of (a) deleting the not in common use. However in these days of fast and
points X; from the data set one at a time; (b) recalcu- cheap computation <1 8 can easily be approximated by
lating the prediction rule on the basis of the remaining Monte Carlo methods:
II - I points; (c) seeing how well the recalculated rule (i) Construct t, the empirical distribution function.
predicts the deleted point; and (d) averaging these pre as just described.
dictions over all II deletions of an X" In the simple case (ii) Draw a bootstrap sample ~, X;, ... , by x:
above, the cross-validated estimate of prediction error independent random sampling from t. In other words.
is make 11 random draws with replacement from {XI> X2,
1. i: [Xi - x(I)1'. (9) . .. , x,}. In the law school example a typical bootstrap
n i-I sample might consist of 2 copies of point 1,0 copies of
point 2, 1 copy of point 3, and so on, the total number
.. .,
3.50
2. THE BOOTSTRAP
.,
This section describes the simple idea of the boot- 3.30
strap (Efron 1979a). We begin with an example. The 15
points in Figure I represent various entering classes at
GPA
3.10
... '"
.,
... . ,
American law schools in 1973. The two coordinates for .,
law school i are Xi = (Yi, Zi),
2.90
Y, = average LSAT score of entering students
at school i. 2.70
., .,
."
S40 560 580 600 620 6'10 660 680
Z, = average undergraduate GPA score of entering stu- LSAT
dent at school i. Figure 1. The law school data (Efron 1979B). The data points,
1. Bootstrap B = 128 .206 .066 .32 .067 .301 .065 .22 .065
2. Bootstrap B ~ 512 .206 .063 .31 .064 .301 .062 .21 .062
3. Normal Smoothed Bootstrap B = 128 .200 .060 .30 .063 .296 .041 .14 .041
4. Uniform Smoothed Bootstrap B = 128 .205 .061 .30 .062 .298 .058 .19 .058
5. Uniform Smoothed Bootstrap B = 512 .205 .059 .29 .060 .296 .052 .18 .052
6. Jackknife .223 .085 .38 .085 .314 .090 .29 .091
7. Delta Method .175 .058 .33 .072 .244 .052 .21 .076
(Infinitesimal Jackknife)
2/9
their observed values. 1/9
Another way to describe the bOOtstrap estimate is a.
as follows. Let p* indicate a vector drawn from the
rescaled multinomial distribution
where 0,.) = (1/11) ~ a(0 = (1/11) l O(PU)), 'and U is a 8., being the degenerate distribution putting mass 1 on
column vector with coordinates V, = (11 - I) (0(.) - 0(,,). x. The right side of (15) is then the obvious estimate of
the influence function approximation to the standard
Theorelll. The jackknife estimate of standard error
equals
a.
error of (Hampel 1974). O'(F)';' [JIF'(x)dF(x)/11 ]'Q.
r
The empirical influence function method and the in-
There is a more obvious linear approximation to a(p) {E.[~ + ~ + 2';'" + 4';'" _ ~ _~] }'~
411 iJ.io ~2 fl20,L01 ~TI 111lP-m iii 1tlil2 ...
than OJ.lP). (14). Why not use the first-order Taylor
series expansion for il(p) about the point P = P"? This is where. in terms of x,. = (yu ZI), ~Sh =l(YI - yy:
the idea of Jaeckel's il/fil/iresimal jackkl/ife (1972). The (z, - 1)'/11 (Cramer 1946, p. 359).
Taylor series approximation turns out to be
ar(p) = a(l"') + (P - I"')'U.
a
Theorem. For statistics of the form = r(O" ... ,
OA)' the non parametric delta method and the infini-
where tesimal jackknife give the same estimate of standard
error (Efron 198Ib).
The infinitesimal jackknife, the delta method. and
the empirical influence function approach are three
5, being the ith coordinate vector. This suggests the names for the same method. Notice that the results re-
infinitesimal jackknife estimate of standard error porled in line 7 of TaMe 2 show a severe downward bias.
Efron and Stein (1981) show that the ordinary jackknife
&1J=lvar. O.,.(P')l'Q = IlV,"'/Il'j'Q, (15)
is always biased upwards, in a sense made precise in that
with var. still indicating variance under (12). The ordi- paper. In the authors' opinion the ordinary jackknife is
nary jackknife can be thought of as taking the method of cboice if one does not want to do the
E = -1/(11 - I) in the definition of V,", while the in- bOOlstrap computations.
p ar-;ORM' 1 .16 ( .29, .26) (-.29, .24) (-.28, .25) (-.28, .24)
2 .75 (-.17, .09) (-.05, .08) (-.13, .04) (-.12, .08)
.- Notice that the median of tbe bootstrap histogram is
substantially higher than Ii in Figure 2. In fact,
3
4
.55 (-.25,
.53 (-.26,
.16)
.17)
(-.24,
(-.16,
.16)
.16)
(-.34,
(-.19,
.12)
.13)
(-.27,
(-.21,
.15)
.16)
C(Ii) = .433, only 433 out of 1000 bootstrap replications 5 .73 (-.18, .10) (-.12, .14) (-.16, .10) (-.20, .10)
having Ii' < Ii. The bias-correcled percenlile melhod 6 .50 (-.26, .18) (-.18, .18) (-.22, .15) (-.26, .(4)
7 .70 (-.20, .11) (-.17, .12) (-.21, .10) (-.18, .11)
makes an adjustment for this type of bias. Let <I>(z) 8 .30 (-.29, .23) (-.29, .25) (-.33, .24) (-.29, .25)
indicate the CDF of the standard normal distribution 9 .33 (-.29, .22) (-.36, .24) (-.30, .27) (-.30, .26)
so <D(za) = 1 - a, and define ' 10 .22 (-.29, .24) (-.50, .34) (-.48, .36) (-.38, .34)
AVE .48 (-.25, .18) (-.21, .19) (-.26, .18) (-.25, .18)
Zo ~ <I>-'{C(a)).
to) (rH we calculate tion of (20) is the same as for the theorem of this
section, being based on a quadratic approximation
formula.
In the sampling experiment of Table 2 the true bias, of 7. MORE COMPLICATED DATA SETS
Ii for estimating p, is 13 = - .014. The bootstrap estimate So far we have considered the simplest kind of data
~H' taking B = 128, has expectation - .014 and stan-
sets, where all the observations come from the same
'.
dard deviation .031 in this case, while ~I has expectation
distribution F. The bootstrap idea, and jaCkknife-type
- .017, standard deviation .040. Bias is a negligible
approximations (which are not discussed here), can be
source of statistical error in this situation compared with
applied to much more complicated situations. We begin
variability. In applications this is usually made clear by
with a two-sample problem.
comparison of ~H with (rH
The data in our first example consist of two indepen-
Thc estimates (18) and (19) are closely related to
dent random samples,
each other. The argument is thc same as in Section 3,
except that we approximate e(p) with a quadratic
rather than a linear function of P, say Q (P) = a F and G being two possibly different distributions on
a + (P - P")'b + l(p - P")'c(P - 1"'). Let OQ(P) be any
the real line. The statistic of interest is the Hodges-
such quadratic satisfying
Lehmann shift estimate
aQ(P") = a(P") = a and aQ(p(,j) = a(p(,j), i = 1, 2, .. . ,11.
O=median {Y,-x,; i = 1, ... , m, j=I, . .. , n}.
Theorem. The jackknife estimate of bias equals
We desire an estimate of the standard error Cl(F, G).
An.. .. The bootstrap estimate is simply
13/=--1 [E.{aQ(p i-ell,
n- CrB = Cl(f, C),
which is /l/(Il - ]) times the bootstrap estimate of bias
C being the empirical distribution of the y,. This is
a
for Q (Efron 1982).
evaluated by Monte Carlo, as in Section 3, with obvious
Once again, the jackknife is, almost, a bootstrap esti-
mate itself, except applied to a convenient approxi-
modifications: a bootstrap sample now consists of a ran-
dam sample X~, X;, .... X:, drawn from F and an
..
mation of a(p). independent random sample yr, ... , Y: drawn from
More general problems. There is nothing special C. (In otber words, m draws with replacement from {x"
about bias and standard error as far as the bootstrap is X2, ... ,x",}, and n draws with replacement from {Yll Y2,
concerned. Tbe bootstrap procedure can be applied to ... , Y.}.) Tbe bootstrap replication 0' is the median of
almost any estimation problem. the mn differences y7 -X~. Then era is approximated
Suppose that R (X" X" ... , X.; F) is a random vari- from B independent such replications as on the right
able, and we are interested in estimating some aspect of side of (11).
R's distribution. (So far we have taken R = a(f) - a(F) Table 4 shows the results of a sampling experiment in
42 The American Statistician, February 1983, Vol. 37, No.1
Table 4. Bootstrap Estimates of Standard Error for the between y and the vector of predicted values '1(13) =
Hodges-Lehmann Two-Sample Shift Estimate; (g, (13), .... g. (13
m = 6, n = 9; True Distributions Both F and G
Uniform (O, I] ~:min D(y, '1(13.
The most common choice of D is D (y. '1) = L;'.,
.' Expectation St. Dev. C.V. VMSE (Y, - Tj,)'.
B~ 100 .165 .030 .18 .030 Having calculated ~, we can modify the one-sample
Separate bootstrap algorithm of Section 2, and obtain an esti-
B=200 .166 .031 .19 .031
mate of ~'s variability:
B= 100 .145 .028 .19 .036
Combined (i) Construct F pUlling mass 1/11 at each observed
B~200 .149 .025 .17 .031 residual.
True Standard Error .167 t: mass 1/11 on E, = y, - gi(~)'
(ii) Construct a bootstrap data et
which I1l = 6, II = 9, and both F and G were uniform
distributions on the interval [0, LJ. The table is based on Y~ =gi(~)+E~_ i = I. 2, ... , "_
100 trials of the situation. The true standard error is where the .; arc drawn independently from t, and
.' fJ(F, G) = .167. "Separate" refers to a. calculated ex- calculate
actly as described in the previous paragraph. The im-
~': min D(Y'. '1(13).
provement in going from B = 100 to B = 200 is too
small to show up in the table.
(iii) Do step (ii) some large number B of times. ob-
"Combined" refers to the following idea: suppose we
taining independent bootstrap replications ~". ~".
believe that G is really a translate of F. Then it wastes
information to estimate Fand G separately. Instead we
.... ~ '8. and estimate the covariance matrix of ~ by
.'
All m + 11 bootstrap variates Xr, ... X;:, Yr, ... , V:
J In ordinary linear regression we have g,(I3) = ': [3 and
are then sampled independently from fl. (We could add D(y. '1) = ~(y, - Tj,)'. Section 7 of Efron (1979a) shows
il back to the Y;
values. but this has no effect on the that in this case the algorithm above can be carried out
bootstrap standard error estimate. since it just adds the theoretically. B =". and yields
- -,
constant e to each bootstrap replication e .)
. .
The combined method gives no improvement here.
but it might be valuable in a many-sample problem
~
;Z8
_"2
-fJ (L ,,1,)
,""I
' -I
.
2 _
fJ - L;-,
,"1
~2
E,III. (22)
where there are small numbers of observations in each
sample, a situation that arises in stratified sampling. This is the usual answer, except for dividing by II instead
(See Efron 1982, Ch. 8.) The main point here is that of II - pin 0-'. Of course the advantage of the bootstrap
"bootstrap" is not a well-defined verb. and that there approach is that j; 8 can just as well be calculated if.
may be more than one way to proceed in complicated say. g,(I3) = exp (1,13) and D(y. '1) = Li~, [Yi - Tj,I
situations. Next we consider regression problems, There is another simpler way 10 bootstrap the re-
where again there is a choice of bootstrapping methods. gression problem. We can consider each covariate-
In a typical regression problem we observe II inde- response pair x, = (Ii, y,) to be a single data point ob-
pendent real-valued quantitives Y, = Yi, tained by random sampling from a distribution F on
p + I dimension space. Then we apply the one-sample
Y i = g,([3) + Ei' i = 1, 2, ... , II. (21) bootstrap of Section 2 to the data set XII X2 , XII.
--
The bootstrap estimate of standard error for ~. as err = E{Q[Y", 1)(To, x)]).
given by (11). is 0-. = .42. This agrees nicely with Cox's where Q [y, 1)J is the error indicator
asymptotic estimate 6 =.41. However, the percentile
_{Oif Y =1)
method gives quite different confidence intervals from Q [y, 1)] - I if Y *" 1).
those obtained by the usual method. For .05, ,,=
1 - 2" = .90. the laner interval is 1.51 1.65 . .41 = An obvious estimate of err is the apparent error rale
[.83, 2.19J. The percentile method gives the 90 percent ...
central interval [.98. 2.35J. Notice that (2.35 - 1.51)/ err = {O[Yo, 1)(To: x)J}=lQ[Yi.1)(li; x)].
n .-1
(1.51 - .98) = 1.58. so that the percentile interval is
considerably larger to the right of ~ than to the left. The symbol E indicates expectation with respect to the
(Thc bias-corrected percentile method gives almost the empirical distribution t. putting mass !Ill on each X,.
same answers as the uncorrected method in this case The apparent error rate is likely to underestimate the
since t(~) = .49.) true error rate, since we are evaluating Tl(' x)'s per- f
.184
Sd.
= .184 .099 Ox,. VMSE
0 0 .099
of hope for a favorable hybrid.
"Boot Rand'" line 5. modificd the bootstrap estimate
.
2. Cross- in just one way: instead of drawing the bootstrap sample
Validation .091 .073 -.07 .139 .170 .09' -.15 .147 X~, X;, ... I X: from t, it was drawn from
3. Jackknile
4. Bootstrap
.093 .068 -.23 .145 .167 .089 -.26 .150
(B =200) .080 .028 -.64 .135 .103 .031 -.58 .145 po {Tr/n on (Ii. I)
5. BoolRand rRAND: mass (I - 'iT;
- )111 on (0)
ti,
(B ~200) .087 .026 -.55 .130 .147 .020 -.31 .114
6. BoolAve i = I, 2, "', 11.
(B -200) .100 .036 -.18 .125 .172 .041 -.25 .118
7. Zero 0 0 0 .149 0 0 0 .209 This is a distribution supported on 2n points. the ob-
served points Xi = (I" Yi) and also the complementary
squared error, of w' for estimating R or equivalently of points (Ii. 1- y,). The probahilities Tr; were those natu-
errt for estimating err. rally associated with the linear discriminant function.
wJ =* {/21
Q[y" 11 (Ii, xlI,)I- ( QlY"
/-1
11(1" X<l)lI)/n}.
Then
which can be written 1n other words, Wu + err is the observed bootstrap error
w' =1.
11 /=1
{Q[y" 11(li' x(,,)]- Q[Y" 11(1" x)]}.
rate for prediction of those y, where X, is not involved in
the construction of 11(' , X"). Theoretical argument can
be mustered to show that "", will usually havc expec-
As a matter of fact, WI and w t have asymptotic cor- tation greater than w. while tOil usually has expectation
relation one (Gong 1982). Their nearly perfect cor- less than w. I.BootAve" is the compromise estimator
relation can be seen in Table 5. In the sampling experi- W,WE = (W8 + ",11)12. It also performs well in Table 6,
ments of Table 6, corr(w" w') = .93 on the left side, an'; though there is not yet enough theoretical or numerical
.98 on the right side. The point here is that lite cross- evidence to warrant unqualified enthusiasm.
validacioll estimate w t is, essentially, a Taylor series ap- The bootstrap is " general all-purpose device that can
proximation to the bootstrap estimate wn be applied to almost any problem. This is very handy,
Cons Ster- Anti Mal Anar- Uver Uver Spleen As ai/i- Ai. Ai"" Pro- Histo-
t8nt Age Sex old viral Fatigue alse aKis Big Arm Palp Spiders cites Varices rubin Phos SGOT min rein logy
y I 2 3 4 5 6 7 8 g 10 11 12 13 14 15 16 17 18 19 20 #
1 45 1 2 2 1 1 1 2 2 2 I I 2 1.90 -1 114 2.4 -1 -3 145
0 31 1 1 2 1 2 2 2 2 2 2 2 2 1.20 75 193 4.2 54 2 146
1 41 1 2 2 1 2 2 2 1 1 1 2 1 4.20 65 120 3.4 -1 -3 147
1 70 I 1 2 1 1 1 -3 -3 -3 -3 -3 -3 1.70 109 528 2.8 35 2 148
0
0
1
20
36
46
1
1
1
1
2
2
2
2
2
2
2
1
2
2
1
2
2
1
2
2
2
-3
2
2
2
2
2
2
2
1
2
2
1
2
2
1
.90
.60
7.60
89
120
-1
152
30
242
4.0
4.0
3.3
-,
-1
50
2
2
-3
149
150
151
0 44 1 2 2 1 2 2 2 1 2 2 2 2 .90 126 142 4.3 -1 2 152
.I' 0 61 1 1 2 1 1 2 1 1 2 1 2 2 .80 95 20 4.1 -1 2 153
0 53 2 1 2 I 2 2 2 2 1 1 2 1 1.50 B4 19 4.1 4B -3 154
. ~ 1 43 1 2 2 I 2 2 2 2 I 1 1 2 1.20 100 19 3.1 42 2 155
but it implies that in situations with special structure the Among these 19 tests, 13 predictors indicated predic-
bootstrap may be outperformed by more specialized tive power by rejecting Ho:j = 18, 13, 15, 12, 14, 7, 6,
methods. Here we have done so in two different ways. 19, 20, 11, 2, 5, 3. These are listed in order of achieved
BootRand uses an estimate of F that is better than the significance level, j = 18 attaining the smallest alpha.
totally non parametric estimate F. BootAve makes use 2. These 13 predictors were tested in a forward
of the particular form of R for the overoptimism multiple-logistic-regression program, which added pre-
problem. dictors one at a time (beginning with the constant) until
no further single addition achieved significance level
a = .10. Five predictors besides the constant survived
10. A COMPLICATED PREDICTION PROBLEM this step, j = 13, 20, 15, 7, 2.
3. A final forward, stepwise multiple-logistic-regres-
We end this article with the bootstrap analysis of a
- genuine prediction problem, involving many of the
complexities and difficulties typical of genuine prob-
sion program on these five predictors, stopping this
time at level a = .05, retained four predictors besides
the constant, j = 13, 15, 7, 20.
lems. The bootstrap is not necessarily the best method
here, as discussed in Section 9, but it is impressive to see At each of the three steps, only those patients having
how much information this simple idea, combined with no relevant data missing were included in the hypothesis
massive computation, can extract from a situation that tests. At step 2 for example, a patient was included only
is hopelessly beyond traditional theoretical solutions. A if all 13 variables were available.
fuller discussion appears in Efron and Gong (1981). The final prediction rule was based on the estimated
Among n = 155 acute chronic hepatitis patients, 33 logistic regression
were observed to die from the disease, while 122 sur-
vived. Each patient had associated a vector of 20 covar- 'IT(I,) L: .
iates. On the basis of this training set it was desired to log 1 - 'IT(I;) = /-1, 13. ". 7. '" 131 I"~
produce a rule for predicting, from the covariates,
whether a given patient would live or die. If an effective wbere ~, was the maximum likelihood estimate in this
prediction rule were available, it would be useful in model. The prediction rule was
choosing among alternative treatments. For example,
.' patients with a very low predicted probability of death .,,(1; x) = {~if L, ~i I;i{~~' (26)
could be given less rigorous treatment.
Let X; = (I;, Yi) represent the data for patient i, i = 1, c = log 331122.
2, ... , 155. Here I; is the 20-dimensional vector of co- Among the 155 patients, 133 had none of the predic-
variates, and Y; equals I or 0 as the patient died or lived. tors 13, 15,7,20 missing. When the rule .,,(1; x) was
Table 7 shows the data for the last 11 patients. Negative applied to these 133 patients, it misclassified 21 of them,
numbers represent missing values. Variable I is the con- for an apparent error rate err = 21/133 = .158. We
stant I, included for convenience. The meaning of the would like to estimate how overoptimistic err is.
19 other predictors, and their coding in Table 7, will not To answer this question, the simple bootstrap was
be explained here. applied as described in Section 9. A typical bootstrap
A prediction rule was constructed in 3 steps: sample consisted of X~, X;, ... , ~55' randomly drawn
with replacement from the training set XII X2, ... XISS'
1. An a = .05 test of the importance of predictor j,
I