Professional Documents
Culture Documents
March 2005
Patrick J. Heagerty
Department of Biostatistics, University of Washington, P.O. Box 357232, Seattle,
Washington 98195-7232, U.S.A.
email: heagerty@u.washington.edu
and
Yingye Zheng
Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, MP 702, P.O. Box 19024,
Seattle, Washington 98109-1024, U.S.A.
Summary. The predictive accuracy of a survival model can be summarized using extensions of the pro-
portion of variation explained by the model, or R2 , commonly used for continuous response models, or
using extensions of sensitivity and specicity, which are commonly used for binary response models. In this
article we propose new time-dependent accuracy summaries based on time-specic versions of sensitivity
and specicity calculated over risk sets. We connect the accuracy summaries to a previously proposed global
concordance measure, which is a variant of Kendalls tau. In addition, we show how standard Cox regression
output can be used to obtain estimates of time-dependent sensitivity and specicity, and time-dependent
receiver operating characteristic (ROC) curves. Semiparametric estimation methods appropriate for both
proportional and nonproportional hazards data are introduced, evaluated in simulations, and illustrated
using two familiar survival data sets.
Key words: Cox regression; Discrimination; Prediction; Sensitivity; Specicity.
92
Survival Model Predictive Accuracy and ROC Curves 93
We focus here on using Cox model methods to both gen- OQuigley and Xu (2001) also develop R2 summaries for
erate a model score and to evaluate the prognostic potential Cox regression. In their approach the role of survival time
of the model score. However, the evaluation methods that we and covariate are reversed, and the proportion of variation
propose can be used to summarize the accuracy of a prog- in the covariate that is explained by survival is proposed.
nostic score generated through any alternative regression or The authors exploit partial likelihood estimation methods be-
predictive method, and in this case varying coecient meth- cause the methods provide model-based estimates of the dis-
ods (Hastie and Tibshirani, 1993) such as locally weighted tribution of covariates conditional on survival time. Focusing
partial likelihood estimation (Cai and Sun, 2003) provide a on a scalar covariate, Xu and OQuigley (2000) show that
convenient approach for estimating key accuracy summaries. i (, t) = Ri (t) exp(Z i )/W (t) can be used to estimate
Therefore, we briey introduce the relevant aspects of par- the distribution of the covariate, Z i , conditional on the
tial likelihood estimation. Under the proportional hazards event occurring at time t, P (Z i z | Ti = t) = j j (, t)
assumption, (t | Z i ) = 0 (t) exp(Z Ti ), where (t | Z i ) = 1(Zj z). OQuigley and Xu (2001) obtain estimates of the
lim0 1 P [Ti [t, t + ) | Z i , Ti t]. The partial likelihood conditional variance var(Z i | Ti = t) and propose a global
score equations can be written as summary by integrating estimates of the marginal and condi-
tional variance over the survival distribution. Our approach is
similar in that we also use i (, t) to estimate conditional dis-
0= i Z i k (, Xi )Z k ,
tributions, but rather than computing variances we estimate
i k
time-dependent versions of sensitivity and specicity dened
where
k (, t) = Rk (t) exp(Z Tk )/W (t), with W (t) = in the following section.
j
Rj (t) exp(Z Tj ). Solving these equations yields the con-
sistent and asymptotically normal maximum partial likeli- 1.3 Overview
hood estimator (MPLE) (Cox, 1972). In Section 2 we briey review ROC methods proposed for
summarizing the accuracy of a prognostic marker or model
1.2 Proportion of Variance Approaches when the outcome of interest is a survival time. We then
Two main approaches exist for characterizing the proportion develop new denitions of time-dependent sensitivity and
of variation explained by a survival model. Schemper and specicity that are strongly connected to partial likelihood
Henderson (2000) overview an approach where the survival concepts. Time-dependent accuracy measures can be used
time is characterized by a counting process representation, to calculate time-specic ROC curves, and time-specic area
N i (t) = 1(Ti t), and time-integrated variances are used to under the curve (AUC) summaries. We show that a global
form the summary measure. Alternatively, OQuigley and Xu concordance measure is the integral, or weighted average, of
(2001) consider the proportion of variation in the covariate, time-specic AUC measures. In Section 3 we discuss the es-
Z i , that is explained by the survival time Ti . timation of time-dependent ROC and AUC summaries and
Schemper and Henderson (2000) build on earlier work that provide a method that is applicable to a proportional haz-
extends R2 to Cox regression. Their approach focuses on using ards model, and a more general method that can be used to
the counting process, N i (t), and marginal and conditional ex- characterize any scalar prognostic score even if proportional
pectations given by the survival functions S(t) = E[1 N i (t)] hazards do not obtain. Finally, in Section 4 we analyze two
and S(t | Z i ) = E[1 N i (t) | Z i ], respectively. Because the well-known data sets. We conclude the article with a brief
vital status indicator N i (t) is a binary variable, Schemper discussion.
and Henderson (2000) propose using the marginal variance
S(t)[1 S(t)] and the conditional variance S(t | Z i )[1 2. Censored Survival and Predictive Accuracy
S(t | Z i )] to characterize the proportion of variation explained 2.1 Background on ROC Curve Analysis
by the covariates Z i . In particular, a nite time range (0, ) When outcomes Yi are binary the accuracy of a prediction
is considered and time-average variances are formed: or classication rule is typically summarized through correct
classication rates dened as sensitivity, P (pi > c | Yi = 1),
D( ) = S(t)[1 S(t)] f (t) dt f (t) dt and specicity, P (pi c | Yi = 0), where pi is a prediction,
0 0
and c is a criterion for classifying the prediction as positive
(pi > c) or negative (pi c). When no a priori value of c is in-
DZ ( ) = EZ {S(t | Z)[1 S(t | Z)]} f (t) dt f (t) dt, dicated the full spectrum of sensitivities and specicities can
0 0
be characterized using an ROC curve that plots the true
where f (t) is the marginal density of Ti . Our representation positive rate (sensitivity) versus the false positive rate
above diers by a factor of 2 from the proposal of Schemper (1-specicity) for all c (, +).
and Henderson (2000) as they also consider the mean absolute An ROC curve provides complete information on the set
deviation, E[|N i (t) S(t)|] = 2S(t)[1 S(t)]. Finally, the of all possible combinations of true-positive and false-positive
summary V ( ) = D( ) DZ ( )/D( ) is proposed as the rates, but is also more generally useful as a graphical char-
proportion of variation explained by covariates. Similarly, our acterization of the magnitude of separation between the case
approach views survival data through the counting process and control marker distributions. If case measurements and
representation, N i (t), but because N i (t) is a binary outcome control measurements have no overlap then the ROC curve
we explore the extension of standard binary response accuracy takes the value 1 (perfect true-positive rate) for any false-
summaries such as ROC curves rather than considering an positive rate greater than 0. In this situation the marker
extension of R2 . is perfect at discriminating between cases and controls.
94 Biometrics, March 2005
disease status indicator (Hanley and McNeil, 1982). Speci- [FPDt (p)]1 = inf c {c : FPDt (c) p}. In the absence of censoring
C/D
cally, the AUC measures the probability that the marker value ROCt (p) can be estimated using the empirical distribution
for a randomly selected case exceeds the marker value for a of the marker separately among cases and controls. With cen-
randomly selected control and is directly related to the Mann sored survival times Heagerty et al. (2000) develop a non-
Whitney U statistic (Hanley and McNeil, 1982; Pepe, 2003). parametric estimator based on the nearest-neighbor bivariate
Finally, ROC curves are particularly useful for comparing the distribution estimator of Akritas (1994). A substantive ap-
discriminatory capacity of dierent potential biomarkers. For plication that demonstrates use of cumulative/dynamic ROC
example, if for each value of specicity one marker always curves for a Cox regression model can be found in Fan et al.
has a higher sensitivity, then this marker will be a uniformly (2002).
better diagnostic measurement. See Zhou, McClish, and 2.2.2 Incident/static. Etzioni et al. (1999) and Slate and
Obuchowski (2002) or Pepe (2003) for more discussion of ROC Turnbull (2000) adopt an alternative denition of time-
analysis. dependent sensitivity and specicity using
In this section we rst review previous proposals for gener-
alizing the concepts of sensitivity and specicity for applica- sensitivityI (c, t) : P (Mi > c | Ti = t)=P Mi > c | dNi (t) = 1
tion to survival endpoints. Denitions of sensitivity and speci-
ferent denitions of specicity. In this section we focus on a sensitivityI (c, t) : P (Mi > c | Ti = t) = P Mi > c | dNi (t) = 1
scalar marker value Mi that is used as a predictor of death.
When our interest is in the accuracy of a regression model we specicityD (c, t) : P (Mi c | Ti > t) = P Mi c | Ni (t) = 0 .
will use Mi = Z Ti . Using this approach a subject can play the role of a control for
2.2.1 Cumulative/dynamic. For a baseline marker value, an early time, t < Ti , but then play the role of case when t =
Mi , Heagerty et al. (2000) propose versions of time-dependent Ti . This dynamic status parallels the multiple contributions
sensitivity and specicity using the denitions that a subject can make to the partial likelihood function.
Here sensitivity measures the expected fraction of subjects
sensitivityC (c, t) : P (Mi > c | Ti t) = P Mi > c | Ni (t) = 1
with a marker greater than c among the subpopulation of
specicityD (c, t) : P (Mi c | Ti > t) = P Mi c | Ni (t) = 0 . individuals who die at time t, while specicity measures the
fraction of subjects with a marker less than or equal to c
Using this approach, at any xed time t the entire population among those who survive beyond time t. Incident sensitivity
is classied as either a case or a control on the basis of vital and dynamic specicity are dened by dichotomizing the risk
status at time t. Also, each individual plays the role of a con- set at time t into those observed to die (cases) and those
trol for times t < Ti , but then contributes as a case for later observed to survive (controls). In Section 3 we discuss how
times, t Ti . Cumulative/dynamic accuracy summaries are the observed marker data among risk sets can be used to
most appropriate when a specic time t
(or a small collection estimate time-dependent accuracy concepts.
Survival Model Predictive Accuracy and ROC Curves 95
Incident sensitivity and dynamic specicity have some ap- is a weighted average of the area under time-specic ROC
pealing characteristics relative to the alternative denitions. curves,
First, incident sensitivity and dynamic specicity are based
P [Mj > Mk | Tj < Tk ]
on classication of the risk set at time t into case(s) and
controls, and are, therefore, a natural companion to hazard
models. Second, the denitions easily allow extension to time- = 2 P [{Mj > Mk } | {Tj = t} {t < Tk }]
t
dependent covariates using P [Mi (t) > c | Ti = t] to dene in-
cident sensitivity and P [Mi (t) c | Ti > t] to dene dynamic P [{Tj = t} {t < Tk }] dt
specicity with a longitudinal marker Mi (t). Use of cumu-
lative sensitivity does not permit a time-varying marker. Fi- = AUC(t) w(t) dt = ET [AUC(T ) 2 S(T )]
nally, use of incident sensitivity and dynamic specicity allows t
both time-specic accuracy summaries and, as shown in Sec- with w(t) = 2 f (t) S(t).
tion 2.4, allows time-averaged summaries that directly relate
to a familiar global concordance measure. In contrast, meth- In this notation AUC(t) is based on the I/D denition of sen-
ods have not been proposed for meaningfully averaging the sitivity and specicity, AUC(t) = P (Mj > Mk | Tj = t, Tk > t).
time-specic incident/static or cumulative/dynamic accuracy See the Appendix for a derivation.
summaries. In practice we would typically restrict attention to a xed
follow-up period (0, ). The concordance summary can be
2.3 Time-Dependent ROC Curves modied to account for nite follow-up:
After selecting denitions for time-dependent sensitivity and
specicity, ROC curves can be computed and interpreted. In C = AUC(t) w (t) dt,
0
this article we focus on incident/dynamic (I/D) ROC curves
I/D
dened as the function ROCt (p), where p denotes the dy- where w (t) = 2 f (t) S(t)/W , W = 0 2 f (t) S(t) dt =
I/D
namic false-positive rate, and ROCt (p) denotes the corre- 1 S 2 ( ). The restricted concordance summary remains a
sponding incident true-positive rate. Specically, let cp be weighted average of the time-specic AUCs with the weights
dened as the threshold that yields a false-positive rate of rescaled such that they integrate to 1.0 over the range (0, ).
p: P (Mi > cp | Ti > t) = 1 specicityD (cp , t) = p. The true- The interpretation of C is a slight modication of the origi-
I/D
positive rate, ROCt (p), is the sensitivity that is obtained nal concordance, where C = P [Mj > Mk | Tj < Tk , Tj < ].
I/D Thus C is the probability that the predictions for a random
using this threshold, or ROCt (p) = sensitivityI (cp , t) =
pair of subjects are concordant with their outcomes, given
P (Mi > cp | Ti = t). Using the true and false-positive
that the smaller event time occurs in (0, ).
rate functions TPIt (c) = sensitivityI (c, t) and TPDt (c) = 1
The concordance summary C is directly related to Kendalls
specicityD (c, t) allows the ROC curve to be written
tau. Specically, C = K/2 + 1/2, where K denotes Kendalls
as the composition of TPIt (c) and the inverse function
tau (see Agresti, 2002, p. 60 for denition). Korn and Simon
[TPDt ]1 (p) = cp :
(1990) and Harrell et al. (1996) discuss the use of Kendalls
I/D
1
ROCt (p) = TPIt FPDt (p) tau (K or a ) with survival data and propose modications
1 to account for censored observations.
I/D
for p [0, 1]. We use the notation AUC(t) = 0 ROCt (p)dp 2.5 Example: Gaussian Marker and Log-Normal Disease Time
to denote the area under the I/D ROC curve for time t.
To illustrate time-dependent accuracy concepts we consider
2.4 Time-Dependent AUC and Concordance a simple example where the marker Mi and the log of sur-
In the previous subsection we discussed how ROC methods vival time log(Ti ) follow a bivariate normal distribution. By
can be used to characterize the ability of a marker to dis- convention we consider a higher marker value as indicative of
tinguish cases at time t from controls at time t. However, in earlier disease onset and, therefore, explore bivariate distri-
many applications no a priori time t is identied, and a global butions with a negative correlation between the marker and
accuracy summary is desired. In this subsection we show how log(time).
time-dependent ROC curves are related to a standard con- If [Mi , log(Ti )] has a bivariate normal distribution with
cordance summary. The global summary we adopt is mean (0, 0) and unit standard deviations then time-dependent
incident sensitivity and cumulative 1-specicity are
C = P [Mj > Mk | Tj < Tk ],
log(t) c
which indicates the probability that the subject who died at P Mi > c | dNi (t) = 1 = TPIt (c) =
(1 2 )
the earlier time has a larger value of the marker. This is not
the usual form (i.e., P [Mj > Mk | Tj > Tk ]), but reects the
S2N [c, log(t); ]
conventions for ROC analysis. P Mi > c | Ni (t) = 0 = FPDt (c) = ,
[ log(t)]
In order to understand the relationship between this dis-
crimination summary and ROC curves we assume indepen- where (x) = P (X < x) for X N (0, 1) and S2N [x, y; ] =
dence of observations (Mj , Tj ) and (Mk , Tk ), and assume that P (X > x, Y > y) for (X, Y) bivariate mean 0 unit normal
Tj is continuous such that P (Tk = Tj ) = 0. We use P(x) with correlation .
to denote probability or density depending on the context. Figure 1a shows I/D ROC curves for = 0.8. The solid
These assumptions imply that the concordance summary C line corresponds to t = exp(2) and has an AUC of 0.923
96 Biometrics, March 2005
1.0
0.8
0.6
sensitivity
0.4
log(t) = -2
log(t) = -1
0.2
log(t) = 0
log(t) = 1
log(t) = 2
0.0
1-specificity
rho = -0.9
rho = -0.8
0.9
rho = -0.7
rho = -0.6
0.8
AUC(t)
0.7
0.6
w(t)
0.5
time
Figure 1. Incident/dynamic ROC and AUC plots for a bivariate (log) normal distribution. (a) Incident/dynamic ROC
curves for a scalar marker and a disease time where {Mi , log(Ti )} is bivariate normal with = 0.8. (b) Plots of AUC(t) for
a scalar marker and a disease time where {Mi , log(Ti )} is bivariate normal with taking the values (0.9, 0.8, 0.7, 0.6).
indicating very good separation between the distribution for a positive test, then by denition, only 10% of the controls
Mi among subjects with Ti = exp(2) as compared to the (i.e., log(Ti ) > 2) would have a value of Mi greater than 1.19.
marker distribution for subjects with Ti > exp(2). Further- The ROC plot shows that for this false-positive rate of 10%
more, if the threshold value c10% = 1.19 were used to indicate a sensitivity, or true-positive rate, of 75% can be obtained:
Survival Model Predictive Accuracy and ROC Curves 97
TPIt (1.19) = 0.752. If we consider a later time such as log(t) = the marker given failure: E(Mi | Ti = t) = k Mk k (, t).
0 we nd less overall discrimination with an AUC of 0.741. However, Xu and OQuigley (2000) show that these weights
Again, specic operating points can be identied; for example, can also be used to estimate the distribution of the covariate
the ROC curve shows that if the false-positive rate is again conditional on death at time t:
controlled at 10% then a true-positive rate of only 30% is now I
obtained (here c10% = 0.320). One of the key advantages of
t (c) = P (Mi > c | Ti = t) =
TP 1(Mk > c) k (, t), (1)
an ROC curve is that it facilitates comparisons across dier- k
ent conditions in terms of the sensitivity of a marker where where the estimate P (Mi > c | Ti = t) is a consistent estima-
the specicity is controlled at a xed level for each condition. tor when the Cox model for Mi holds. Estimation of us-
Here we have evaluated the temporal variation in sensitivity ing partial likelihood provides a semiparametric estimate for
while controlling 1-specicity at 10%. TPIt (c). An empirical estimator can be used for FPDt (c):
In Figure 1b we show the AUC(t) functions for dierent
D
values of . For each value of we nd a decreasing AUC(t) t (c) = P (Mi > c | Ti > t)
FP
with increasing time. In addition, with decreasing correlation
between the marker and the disease time we nd uniformly = 1(Mk > c) Rk (t+)/W R (t+), (2)
decreasing values for AUC(t). A global accuracy summary k
can be obtained using C, which integrates AUC(t) using the
weight function proportional to 2 f (t) S(t). Figure 1b also where Rk (t+) = lim0 Rk (t + ||), and W R (t+) = k Rk (t+).
displays the weight function, which for this example is w(t) = The term W R (t+) denotes the size of the control set at
2 (t)[ 1 (t)], where (x) and (x) are the standard nor- time t, where we dene the control set as the risk set minus
D
mal density and distribution functions, respectively. In this t (c) is the empirical
subjects who fail at time t. Essentially, FP
bivariate normal situation there exists an analytical solution distribution function for marker values among the control set,
I
for the concordance: C = sin1 ()/ + 0.5. For = 0.9 t (c) is an exponential tilt of the empirical distribution
and TP
we nd C = 0.827, while with = 0.6 we nd C = 0.703. function for the marker among risk set subjects (Anderson,
Therefore, when the marker Mi and log-survival time have a 1979).
correlation of 0.9 there is a 82.7% chance that for a random
3.2 Estimation: TPIt (c) and FPDt (c) under
pair of observations the marker value for the earlier survival
Nonproportional Hazards
time is greater than the marker value for the larger survival
time. This concordance probability is reduced to 70.3% when In order to use equation (1) to estimate incident sensitiv-
= 0.6. ity the proportional hazards assumption must be satised.
However, this aspect can be relaxed by adopting a varying-
coecient model of the form (t | Mi ) = 0 (t) exp[Mi (t)]. The
3. Estimation of Incident/Dynamic time-varying coecient function (t) can be estimated either
Time-Dependent Accuracy in a one-step fashion based on routine Cox model residuals,
In this section we propose methods for the estimation of time- or through locally weighted partial likelihood methods. Note
dependent accuracy summaries using a single scalar marker that if proportional hazards do obtain then (t) 1 when
Mi . When interest is in the accuracy of a survival regres- Mi = Z Ti .
sion model we propose using the linear predictor as a scalar Grambsch and Therneau (1994) describe residual-based
marker, Mi = Z Ti , and then using nonparametric or semi- methods for assessing the proportional hazards model that
parametric methods to characterize the time-dependent sen- can also be used to obtain estimates of time-varying coef-
sitivity and specicity of the model score. In particular, we cient functions. In order to dene the residuals we adopt
discuss how the Cox model and partial likelihood concepts can the following notation: S (p) (, t) = k Rk (t) exp(Z Tk ) Z p
k ,
be conveniently used to provide semiparametric estimates of where Z p T
k refers to 1, Z k , and Z k Z k for p = 0, 1, 2, respec-
I/D accuracy. However, the methods that we propose do not tively. The scaled Schoenfeld residuals are dened for each
require the model score, Mi , to be derived from a propor- observed ordered failure time, t(j) , as the vector
tional hazards model and are potentially applicable for any
prognostic scale. rj () = V 1 [, t(j) ]{Z (j) e[, t(j) ]},
where e[, t(j) ] = S (1) [, t(j) ]/S (0) [, t(j) ], V [, t(j) ] = S (2) [,
3.1 Estimation: TPIt (c) and FPDt (c) under t(j) ]/S (0) [, t(j) ] e[, t(j) ]e[, t(j) ]T , and Z(j ) denotes the co-
Proportional Hazards variate for the subject observed to die at time t(j) . Grambsch
Properties of the partial likelihood function make estimation and Therneau (1994) show that E{rj | F[t(j) ]} [(t) 0 ],
of I/D ROC curves a natural companion to Cox regression. where 0 is the time-averaged coecient and F(t) is the right-
Here we assume that the censoring time Ci is independent of continuous ltration specifying the survival process history.
the failure time Ti and marker Mi . To clearly distinguish be- This property is used to obtain focused tests of proportion-
tween the general model score, Mi = Z Ti , and a Cox model ality, and to obtain estimates of the time-varying coecient
that uses this score, we denote as the proportional haz- function, k (t) corresponding to covariate Z i,k . As a graphi-
ards regression parameter (t | Mi ) = 0 (t) exp(Mi ). It is well cal diagnostic tool standard regression-smoothing techniques
known that under a proportional hazards model the weights, are now commonly applied to the points [t(j) , k + rj,k ()] fol-
i (, t) = Ri (t) exp(Mi )/W (t) introduced in Section 1.1, lowing a Cox model t in order to obtain estimates of time-
are used to compute an estimate of the expected value of dependent coecient functions, k (t).
98 Biometrics, March 2005
For the evaluation of the accuracy of a marker, Mi , the 3.4 Inference for Incident/Dynamic Accuracy Summaries
smoothing of Schoenfeld residuals can be used to obtain a I
Xu and OQuigley (2000) show that the estimator TP t (c)
simple estimate of I/D AUC(t) by exploiting standard Cox
given in equation (1) is consistent provided that the propor-
model output. First a Cox model of the form 0 (t) exp(Mi ) is
tional hazards model obtains, and provided the independent
t, followed by use of regression-smoothing methods to obtain
observations are subject to independent censoring. Parallel
(t). Second, equation (2) can still be used to obtain estimates
arguments apply for the estimator obtained using a varying-
of false-positive rates, and (1) can now be evaluated using (t)
coecient model given in equation (3) whenever a consistent
rather than a constant value :
estimator of (t) is used. Cai and Sun (2003) show that the
I locally weighted MPLE is consistent under standard regu-
t (c) = P (Mi > c | Ti = t) =
TP 1(Mk > c) k [(t), t]. (3) D
larity conditions. In addition, because FP t (c) is an empiri-
k
cal distribution function calculated over the control set (i.e.,
By using equation (3) we are adopting the exible semi- the risk set minus the case), consistency obtains provided the
parametric hazard model, 0 (t) exp[Mi (t)], which no longer control set represents an unbiased sample (i.e., independent
assumes proportionality, but rather only assumes smoothly censoring). Therefore, consistent estimates of time-dependent
varying hazard ratios over time. sensitivity and specicity and corresponding AUC(t) and C
More formal exible semiparametric statistical methods summaries are obtained under the proportional hazards as-
can be used to estimate a varying-coecient hazard model sumption using equations (1) and (2), and under more gen-
and subsequently produce time-dependent accuracy sum- eral nonproportional hazards assumptions using equation (3).
maries based on minimal model assumptions. For example, Finally, because the accuracy summaries are dened over the
Hastie and Tibshirani (1993) discuss both smooth paramet- joint distribution of the marker Mi and the survival time Ti ,
ric methods and nonparametric penalized likelihood meth- the nonparametric bootstrap of Efron (1979) based on resam-
ods for estimating the function (t) in the model i (t) = pling of observations (Mi , Xi , i ) may be used to compute
0 (t) exp[Mi (t)]. More recently Cai and Sun (2003) char- standard errors or to provide condence intervals.
acterize the properties of locally weighted partial likelihood 3.5 Discrete Times and General Hazard Models
methods used to obtain varying coecient estimates. Using
Our motivation for developing tools to summarize predictive
kernel weights that are specied as a function of time, t,
accuracy stems from interest in characterizing the prognostic
allows use of local-linear estimation methods. Cai and Sun
potential of Cox models for continuous survival times. How-
(2003) prove the pointwise consistency and asymptotic nor-
ever, the basic time-dependent accuracy concepts and the es-
mality of the resulting function estimator, (t). Smooth para-
timation method outlined in Section 3.2 generalizes to discrete
metric and/or nonparametric methods allow valid estimation
survival times and/or alternative hazard regression models.
of accuracy summaries such as AUC(t) based on the mini-
The key to estimation of TPIt (c) presented in Sections 3.1
mal model assumptions because models of the form i (t) =
and 3.2 is that a hazard model can be used to reweight the em-
0 (t) exp[Mi (t)] only assume linearity in Mi and smoothly
pirical distribution of Mi calculated over the risk set at time
varying hazard ratios over time. The linearity assumption can
t. Equations (1) and (3) show specic details for Cox models.
be relaxed by using a model with single or multiple transfor-
More generally, let P (Ti = t | Ti t, Mi ) denote the hazard,
mations of Mi and a vector of time-varying coecients.
where P (t) represents either density for continuous survival
I/D
3.3 Estimation: ROCt (p), AUC(t), and C times or probability for discrete times. A hazard regression
Given estimates of TPIt (c) and FPDt (c) the area under the model can be formulated as g[P (Ti = t | Ti t, Mi )] = (t) +
ROC curve at time t, AUC(t), and the integrated area, C , Mi (t), where g(x) is a link function. The Cox model is a spe-
can be calculated. The estimated ROC curve is given as cial case where a log link is used; (t) = log 0 (t); and (t)
I/D I
D 1
under the proportional hazards assumption. Following ar-
t (p) = TP
ROC t t
FP (p) , guments given in Xu and OQuigley (2000) the general model
implies:
D
where t ]1 (p) = inf c {c : FP
[FP t (c) p}. The estimated P (Mi = m | Ti = t)
I/D
AUC(t) is simply AUC(t) t (p) dp estimated using
= ROC g 1 [(t) + m (t)] P (Mi = m | Ti t), (4)
standard numerical integration methods such as the trapezoid
rule. Finally, the estimated concordance is given by where P (Mi = m | Ti t) denotes either the marker den-
sity or probability depending on whether a continuous or dis-
C = w (t) dt,
AUC(t)
crete marker distribution is assumed. See the Appendix for
a derivation. Equation (4) shows that P (Mi = m | Ti = t)
can be estimated from separate estimates of the hazard
where AUC(t) is given above and w (t) = 2 f(t) S(t)/ model and the distribution of the marker conditional on Ti
[1 S ( )]. The KaplanMeier estimator can be used for S(t),
2
t. Therefore, the general estimation approach outlined in
and a discrete approximation to f(t) can be used based on the Section 3.2 can be adopted for either discrete survival times
increments in the KaplanMeier estimator. If KaplanMeier or for general hazard regression models provided that con-
is used to estimate f (t) and S(t) then AUC(t) only needs to sistent estimates of [(t), (t)] and P (Mi = m | Ti t) are
be evaluated at the observed failure times in order to calculate available. Tied survival times impact choice of a method for
C . estimating the hazard model parameters. In addition, with
Survival Model Predictive Accuracy and ROC Curves 99
discrete survival
times calculation of the concordance sum- timation for the model 0 (t) exp[Mi (t)] using the method of
mary C = AUC(t) w(t) dt requires modication to account Cai and Sun (2003); and simple local linear smoothing of the
for the fact that P (Tj = Tk ) = 0 and, therefore, the constant scaled Schoenfeld residuals. For local MPL estimation and lo-
2 in the weight w(t) = 2 f (t) S(t) needs to be computed as cal linear smoothing we used an Epanechnikov kernel with a
1/P (Tj < Tk ). Finally, Cox models are convenient because span of n1/5 where n is the number of observations.
the baseline hazard, (t) = log 0 (t), drops out of (4), and is In order to estimate AUC(t) and C using semiparamet-
thus not required for estimation of TPIt (c). ric methods the model for the survival time conditional on
the marker, 0 (t) exp[Mi (t)], is combined with the observed
3.6 Simulations to Evaluate Incident/Dynamic Estimation marker distribution within each risk set according to the
In order to demonstrate the feasibility of using Cox regres- methods described in Section 3.2. We have adopted a survival
sion methods and the marker distribution among risk sets for model that assumes that the log hazard increases linearly in
estimating I/D ROC curves and global concordance we con- Mi for each time t. The true data-generating model is actu-
ducted a set of simulation studies. ally nonlinear with a concave risk function. Therefore, for this
For each of m = 500 simulated data sets a sample of n = simulation our estimation used a rst-order approximation to
200 marker values, Mi , and survival times, Ti , were gener- the true conditional hazard surface.
ated such that (Mi , log Ti ) is bivariate normal with a correla- Table 1 displays the mean and standard deviation for the
tion of = 0.7. An independent log-normal censoring time estimate of AUC(t) at various values of t when data are gener-
was generated to yield a xed expected fraction of censored ated with 20% and with 40% censoring. When 20% of the ob-
observations (either 20% or 40% censored). For each simu- servations are censored we nd that the MLE for AUC(t) has
lated data set we estimated the I/D AUC(t) function and the minimal bias for log(t) between 2 and 2. Estimates based on
concordance summary C using the largest observed survival the locally weighted MPLE and the residual smoother yield
time to truncate follow-up time. We applied four methods of approximately unbiased estimates for all but the most ex-
estimation to the censored data: maximum likelihood assum- treme values of time with some negative bias observed for
ing a bivariate normal distribution for the survival time and both the semiparametric estimators. For example, at log(t) =
the marker; maximum partial likelihood using the Cox model,
2 the mean AUC(t) using the locally weighted MPLE is
which for this example incorrectly assumes proportional haz- 0.860 (relative bias of 1 0.860/0.884 = 3%) and using
ards; locally weighted maximum partial likelihood (MPL) es- the residual smoother the average is 0.881 (relative bias of
Table 1
Simulation results for estimation of I/D accuracy. Data (Mi , log Ti ) were generated as bivariate normal with a correlation of
= 0.7. The sample size for each simulated data set was N = 200. The AUC(t) curve and the integrated curve, C , were
estimated using: maximum likelihood assuming a bivariate normal model; Cox model, which assumes proportional hazards; local
maximum partial likelihood for the varying-coecient model (t) = 0 (t) exp[(t)Mi ]; and a local linear smooth of the scaled
Schoenfeld residuals to estimate the varying-coecient model.
1.0
0.8
AUC
0.6
0.4
w(t)
Time (days)
0.6
0.4
w(t)
Time (days)
Figure 2. Incident/dynamic AUC plots for the VA lung cancer data. (a) Accuracy of the model score (linear predictor) under
the assumption of proportional hazards. Estimates of I/D AUC(t) versus time with pointwise 90% condence intervals. Using
w (t) dt = 0.713 (SE = 0.026). (b) Accuracy of the model score (linear predictor) based
= 365 we obtain C = 0 AUC(t)
on a varying-coecient multiplicative hazard model. Estimates of I/D AUC(t) versus time with pointwise 90% condence
w (t) dt = 0.738 (SE = 0.022).
intervals. Using = 365 we obtain C = 0 AUC(t)
102 Biometrics, March 2005
Covariate Estimate SE Z
Model 1
Log(bilirubin) 0.877 0.099 8.866
0.8
Model 2
sensitivity
t = 90
t = 120 4.2 Mayo PBC Data
Next, we consider data from a randomized placebo-controlled
trial of the drug D-penicillamine (DPCA) for the treatment of
0.0
1.0
5 covariates: iAUC = 0.796
0.9
4 covariates: iAUC = 0.733
0.8
AUC
0.7
0.6
0.5
0.4
Time (days)
0.7
0.6
0.5
0.4
Time (days)
Figure 4. Incident/dynamic AUC plots for the Mayo PBC data. (a) Accuracy of the model score using ve covariates ()
log(bilirubin), log(prothrombin), edema, albumin, and age, and the model score using four covariates (+), where log(bilirubin)
is excluded. Lines plot the estimates of I/D AUC(t) versus time under the assumption of proportional hazards. (b) Accuracy
of the model score using ve covariates () log(bilirubin), log(prothrombin), edema, albumin, and age, and the model score
using four covariates (+), where log(bilirubin) is excluded. Estimation is based on a varying-coecient multiplicative hazard
model. Lines plot the estimates of I/D AUC(t) versus time.
104 Biometrics, March 2005
is obtained by using bilirubin in addition to the other model marker, Mi , or covariates, Z i , would be useful. Second, we
covariates. Relative to the ve-covariate model the estimated have proposed estimators that assume a prospective study
AUC(t) for the four-covariate model is approximately 0.10 design. Extension to casecohort data may be important
units below the ve-covariate model AUC(t) for t between 0 for characterizing the accuracy of markers for rare diseases.
and 2000 days. Third, development of analytical approximations that charac-
We then relax the proportional hazard assumption and use terize the large sample distribution of the proposed estimators
the time-varying coecient models as described in Section 3.2 would facilitate approximate inference for time-dependent
to characterize the accuracy of the model score Mi = Z Ti . ROC curves, the AUC(t) curve, or the concordance summary
The bottom panel of Figure 4 displays the AUC function C . Finally, exploration of time-dependent accuracy methods
based on the estimated time-varying coecient obtained us- with a longitudinal marker, Mi (t), would be important for
ing locally weighted MPL. Early estimates of AUC(t) now ex- the common prospective medical setting in which predictive
ceed 0.90 and decline sharply to approximately 0.75 at 2000 covariate information is updated over time.
days for the ve-covariate model and to less than 0.65 at
2000 days for the four-covariate model. Using the estimated Resume
AUC(t) reveals that the Mayo model is excellent at short-
term prediction but that the predictive accuracy declines to Ladequation dun modele de survie peut etre resumee grace
a des extensions du pourcentage de variabilite expliquee par
AUC(t) < 0.80 by 1 year for the model without bilirubin, and le modele, ou R2, utilise habituellement pour les modeles
to AUC(t) < 0.80 by 5 years for the ve-covariate model. Fi- expliquant une reponse continue, ou grace a des extensions
nally, using the time-varying coecient produces a global con- de la sensibilite et specicite, utilisees habituellement pour
cordance summary of 0.80 for the ve-covariate model and predire une reponse binaire. Dans cet article nous proposons
0.72 for the model that excludes bilirubin. une version dependant du temps de ladequation, en utilisant
des fonctions du temps de la sensibilite et la specicite cal-
culees sur les groupes a risque. Nous relions les resumes de
5. Discussion ladequation a une mesure globale de la concordance, proposee
This article introduces a new version of time-dependent sen- auparavant, qui est une extension du tau de Kendall. De plus,
sitivity, specicity, and associated ROC curves that are useful nous montrons comment utiliser les resultats obtenus par un
for characterizing the predictive accuracy of a scalar marker, modele de Cox an dobtenir les estimations de la sensibilite et
such as a derived model score, when the outcome is a cen- la specicite dependant du temps ainsi que des courbes ROC
(Receiver Operating Characteristic) dependant du temps. Des
sored survival time. We show that the area under the time-
methodes destimation semi-parametrique adaptees a la fois
specic ROC curves can be plotted as a function of time to aux modeles a hasards proportionnels et non proportionnels
characterize temporal changes in accuracy, and can be inte- sont presentees, evaluees par des simulations et illustrees par
grated using the marginal distribution of the failure time to deux jeux de donnees de survie.
provide a global concordance summary. Incident sensitivity
and dynamic specicity are shown to be easily estimated us-
References
ing a tted hazard model and the empirical distribution of
the marker data within risk sets. Using only a routine Cox Agresti, A. (2002). Categorical Data Analysis, 2nd edition.
model output allows estimates of accuracy that assume pro- New York: John Wiley & Sons.
portional hazards and simple regression smoothing of scaled Akritas, M. G. (1994). Nearest neighbor estimation of a bi-
Schoenfeld residuals provides accuracy summaries appropri- variate distribution under random censoring. Annals of
ate for markers that do not satisfy proportional hazards. Sim- Statistics 22, 12991327.
ulations suggest that residual smoothing and locally weighted Anderson, J. A. (1979). Multivariate logistic compounds.
partial likelihood estimators both provide feasible and accu- Biometrika 66, 1726.
rate estimates. Cai, T., Pepe, M. S., Lumley, T., Zheng, Y., and Jenny, N. S.
Our methods explicitly decouple the generation of a pre- (2003). The sensitivity and specicity of markers for
dictive score from the evaluation of prognostic accuracy. An event times. University of Washington Technical Report
investigator may use Cox regression to create a model score 188, 130.
Mi = Z Ti that is a time-invariant linear combination of base- Cai, Z. and Sun, Y. (2003). Local linear estimation for time-
line covariates Z i . However, using the exible methods pro- dependent coecients in Coxs regression models. Scan-
posed in Section 3.2 to evaluate the prognostic potential of dinavian Journal of Statistics 30, 93111.
Mi does not require commitment to the proportional hazards Cox, D. R. (1972). Regression models and life-tables (with
assumption. A practical advantage of using Mi = Z Ti is that discussion). Journal of the Royal Statistical Society, Series
a single scoring of the baseline covariates is conducted to B, Methodological 34, 187220.
generate Mi , but if proportional hazards is clearly violated Efron, B. (1979). Bootstrap methods: Another look at the
then a more general model such as 0 (t) exp[Z Ti (t)] may be jackknife. Annals of Statistics 7, 126.
appropriate, and would lead to a time-varying score Mi (t) = Etzioni, R., Pepe, M., Longton, G., Hu, C., and Goodman,
Z Ti (t). G. (1999). Incorporating the time dimension in receiver
A number of aspects warrant additional research. First, operating characteristic curves: A case study of prostate
estimation methods proposed in Sections 3.1 and 3.2 as- cancer. Medical Decision Making 19, 242251.
sume that the censoring time is independent of the survival Fan, V., Au, D., Heagerty, P., Deyo, R., McDonell, M., and
time. Relaxation to allow conditional independence given the Fihn, S. (2002). Validation of case-mix measures derived
Survival Model Predictive Accuracy and ROC Curves 105