You are on page 1of 50

Quantile Processes for Semi and Nonparametric Regression

Shih-Kang Chao

Stanislav Volgushev

Guang Cheng

arXiv:1604.02130v1 [math.ST] 7 Apr 2016

April 7, 2016

Abstract A collection of quantile curves provides a complete picture of conditional distributions. Properly centered and scaled versions of estimated curves at various quantile levels give rise
to the so-called quantile regression process (QRP). In this paper, we establish weak convergence
of QRP in a general series approximation framework, which includes linear models with increasing
dimension, nonparametric models and partial linear models. An interesting consequence is obtained in the last class of models, where parametric and non-parametric estimators are shown to
be asymptotically independent. Applications of our general process convergence results include the
construction of non-crossing quantile curves and the estimation of conditional distribution functions. As a result of independent interest, we obtain a series of Bahadur representations with
exponential bounds for tail probabilities of all remainder terms.
Keywords:
Bahadur representation, quantile regression process, semi/nonparametric model,
series estimation.

Introduction

Quantile regression is widely applied in various scientific fields such as Economics (Koenker and
Hallock (2001)), Biology (Briollais and Durrieu (2014)) and Ecology (Cade and Noon (2003)).
By focusing on a collection of conditional quantiles instead of a single conditional mean, quantile
regression allows to describe the impact of predictors on the entire conditional distribution of the
response. A properly scaled and centered version of these estimated curves form an underlying
(conditional) quantile regression process (see Section 2 for a formal definition). Existing literature

Postdoctoral Fellow, Department of Statistics, Purdue University, West Lafayette, IN 47906.


E-mail:
skchao74@purdue.edu. Tel: +1 (765) 496-9544. Fax: +1 (765) 494-0558. Partially supported by Office of Naval
Research (ONR N00014-15-1-2331).

Assistant Professor, Department of Statistical Science, Cornell University, 301 Malott Hall, Ithaca, NY 14853.
E-mail: sv395@cornell.edu. Part of this work was conducted while the second author was postdoctoral fellow at the
Ruhr University Bochum, Germany. During that time the second author was supported by the Sonderforschungsbereich Statistical modelling of nonlinear dynamic processes (SFB 823), Teilprojekt (C1), of the Deutsche Forschungsgemeinschaft.

Corresponding Author. Associate Professor, Department of Statistics, Purdue University, West Lafayette, IN
47906. E-mail: chengg@purdue.edu. Tel: +1 (765) 496-9549. Fax: +1 (765) 494-0558. Research Sponsored by NSF
CAREER Award DMS-1151692, DMS-1418042, and Office of Naval Research (ONR N00014-15-1-2331).

on QRP is either concerned with models of fixed dimension (Koenker and Xiao, 2002; Angrist et al.,
2006), or with a linearly interpolated version based on kernel smoothing (Qu and Yoon, 2015).
In this paper, we study weak convergence of QRP in models of the following (approximate)
form
Q(x; ) Z(x)> n ( ),
(1.1)
where Q(x; ) denotes the -th quantile of the distribution of Y conditional on X = x Rd and
Z(x) Rm is a transformation vector of x. As noted by Belloni et al. (2011), the above framework
incorporates a variety of estimation procedures such as parametric (Koenker and Bassett, 1978),
non-parametric (He and Shi, 1994) and semi-parametric (He and Shi, 1996) ones. For example,
Z(x) = x corresponds to a linear model (with potentially high dimension), while Z(x) can be
chosen as powers, trigonometrics or local polynomials in the non-parametric basis expansion (where
m diverges at a proper rate). Partially linear and additive models are also covered by (1.1).
Therefore, our weak convergence results are developed in a broader context than those available in
the literature.
A noteworthy result in the present paper is obtained for partially linear models
Q(X; ) = V > ( ) + h(W ; ),
0

where X = (V > , W )> Rk+k , ( ) is an unknown Euclidean vector and h(W ; ) is an unknown
smooth function. Here, k and k 0 are both fixed. In the spirit of (1.1), we can estimate (( ), h(; ))
based on the following series approximation
e
h(W ; ) Z(W
)> n ( ).

Our general theorem shows the weak convergence of the joint quantile process resulting from
k+1
b ), b
((
h(; )) in ` (T )
, where ` (T ) denotes the class of uniformly bounded real functions
b ) and b
on T 3 . An interesting consequence is that (
h(w0 ; ) (after proper centering and scaling)
jointly converge to two independent Gaussian processes. This asymptotic independence result is
useful for simultaneously testing parametric and nonparametric components in semi-nonparametric
modelsi . To the best of our knowledge, this is the first time that such a joint asymptotic result for
quantile regression is established in fact, even the point-wise result is new. Therefore, we prove
that the joint asymptotics phenomenon discovered by Cheng and Shang (2015) even holds for
non-smooth loss functions with multivariate nonparametric covariates.
Weak convergence of QRP is very useful in developing statistical inference procedures such as
hypothesis testing on conditional distributions (Bassett and Koenker, 1982), detection of treatment
effect on the conditional distribution after an intervention (Koenker and Xiao, 2002; Qu and Yoon,
2015) and testing conditional stochastic dominance (Delgado and Escanciano, 2013). For additional
examples, the interested reader is referred to Koenker (2005). Our paper focuses on the estimation
of conditional distribution functions and construction of non-crossing quantile curves by means of
monotone rearrangement (Dette and Volgushev, 2008; Chernozhukov et al., 2010).
i

In the literature, a statistical model is called semi-nonparametric if it contains both finite-dimensional and
infinite-dimensional unknown parameters of interest; see Cheng and Shang (2015).

Our derivation of quantile process convergence relies upon a series of Bahadur representations
which are provided in Section 5. Specifically, we obtain weak convergence results by examining the
asymptotic tightness of the leading terms in these representations, as shown in Lemma A.3. This
is achieved by combining with a new maximal inequality in Kley et al. (2015). These new Bahadur
representations are more flexible than those available in the literature (see, for instance, Belloni
et al. (2011)) in terms of choosing approximation model centers. This is crucial in deriving the
joint process convergence in partial linear models. As a result of independent interest, we obtain
bounds with exponential tail probability for the remainder terms in our Bahadur representations.
Bounds of this kind are especially useful in analyzing statistical inference procedures under divideand-conquer setup; see Zhao et al. (2016); Shang and Cheng (2015); Volgushev et al. (2016).
The rest of this paper is organized as follows. Section 2 presents the weak convergence of
QRP under general series approximation framework. Section 3 discusses the QRP in quantile partial linear models. As an application of our weak convergence theory, Section 4 considers various
functionals of the quantile regression process. A detailed discussion on our novel Bahadur representations is given in Section 5, and all proofs are deferred to the appendix.
Notation. Denote {(Xi , Yi )}ni=1 i.i.d. samples in X R where X Rd . Here, the distribution of
(Xi , Yi ) and the dimension d can depend on n, i.e. triangular arrays. For brevity, let Z = Z(X) and
Zi = Z(Xi ). Define the empirical measure of (Yi , Zi ) by Pn , and the true underlying measure by
P with the corresponding expectation as E. Note that the measure P depends on n for triangular
array cases, but this dependence is omitted in the notation. Denote by kbk the L2 -norm of a
vector b. min (A) and max (A) are the smallest and largest eigenvalue of a matrix A. 0k denotes
a k-dimensional 0 vector, and Ik be the k-dimensional identity matrix for k N. Define
(u) = (1(u 0) )u,
where 1() is the indicator function. C (X ) denotes the class of -continuously differentiable functions on a set X . C(0, 1) denotes the class of continuous functions defined on (0, 1). Define


 

(Yi , Zi ; b, ) := Zi (1{Yi Z>
(b, ) := E (Yi , Zi ; b, ) = E Zi FY |X (Z>
,
i b} ),
i b|X)
and for a vector n ( ) Rm , we define the following quantities
 



gn := gn (n ) := sup k(n ( ), )k = sup E Zi FY |X (Z>

(
)|X)


i n
T

(1.2)

Let S m1 := {u Rm : kuk = 1} denote the unit sphere in Rm . For a set I {1, ..., m}, define
>
m
Rm
I := {u = (u1 , ..., um ) R : uj 6= 0 if and only if j I}

SIm1 := {u = (u1 , ..., um )> S m1 : uj 6= 0 if and only if j I}


Finally, consider the class of functions
c (X , T ) :=
n
f C bc (X ) : T , sup

sup |Dj f (x)| c,

|j|bc x, T

sup

sup

|j|=bc x6=y, T

o
|Dj f (x) Dj f (y)|

c
,
kx ykbc

(1.3)

where bc denotes the integer part of a real number , and |j| = j1 + ... + jd for d-tuple j =
(j1 , ..., jd ). For simplicity, we sometimes write sup (inf ) and supx (inf x ) instead of sup T (inf T )
and supxX (inf xX ) throughout the paper.

Weak Convergence Results

In this section, we first present our weak convergence results of QRP in a general series approximation framework that covers linear models with increasing dimension, nonparametric models and
partial linear models. Furthermore, we demonstrate that the use of polynomial splines with local
support, such as B-splines, significantly weakens the sufficient conditions required in the above
general framework.

2.1

General Series Estimator

b ) :=
b ( )> Z(x), where for each fixed
Consider a general series estimator Q(x;
b ( ) := argmin

Rm

n
X
i=1

(Yi > Zi ),

(2.1)

and m is allowed to grow as n , and assume the following conditions:


(A1) Assume that kZi k m = O(nb ) almost surely with b > 0, and that 1/M min (E[ZZ> ])
max (E[ZZ> ]) M holds uniformly in n for some fixed constant M > 0.
(A2) The conditional distribution FY |X (y|x) is twice differentiable w.r.t. y. Denote the corresponding derivatives by fY |X (y|x) and fY0 |X (y|x). Assume that f := supy,x |fY |X (y|x)| <
and f 0 := supy,x |fY0 |X (y|x)| < uniformly in n.
(A3) Assume that uniformly in n, there exists a constant fmin > 0 such that
inf inf fY |X (Q(x; )|x) fmin .

In the above assumptions, uniformity in n is necessary as we consider triangular arrays. Assumptions (A2) and (A3) are fairly standard in the quantile regression literature. Hence, we only make
a few comments on Assumption (A1). In linear models where Z(X) = X and m = d, it holds that

e
m . m if each component of X is bounded almost surely. If B-splines B(x)
defined in Section
e
4.3 of Schumaker (1981) are adopted, then one needs to use its re-scaled version B(x) = m1/2 B(x)

as Z(x) such that (A1) holds (cf. Lemma 6.2 of Zhou et al. (1998)). In this case, we have m  m.
In addition, Assumptions (A1) and (A3) imply that for any sequence of Rm -valued (non-random)
functions n ( ) satisfying sup T supx |n ( )> Z(x) Q(x; )| = o(1), the smallest eigenvalues of
the matrices
Jem ( ) := E[ZZ> fY |X (n ( )> Z|X)],

Jm ( ) := E[ZZ> fY |X (Q(X; )|X)]

are bounded away from zero uniformly in for all n. Define for any u Rm ,

h
i

n (u, Z) := sup u> Jm ( )1 E Zi 1{Yi Q(Xi ; )} 1{Yi Z>

(
)}
.
i n
T

We are now ready to state our weak convergence result for QRP based on the general series
estimators.
2 (log n)3 = o(n). Let () : T Rm be a seTheorem 2.1. Suppose (A1)-(A3) hold and m3 m
n
quence of functions. Let gn := gn (n ( )) = o(n1/2 ) (see (1.2)) and cn = cn (n ) := supx, T |Q(x; )
Z(x)> n ( )| and assume that mcn log n = o(1). Then for any un Rm satisfying n (un , Z) =
b ( ) defined in (2.1),
o(kun kn1/2 ) and

u>
( )
n (b



n
X
1 >
kun k
1
n ( )) = un Jm ( )
Zi (1{Yi Q(Xi ; )} ) + oP
n
n

(2.2)

i=1

where the remainder term is uniform in T . In addition, if the following limit


1
> 1
H(1 , 2 ; un ) := lim kun k2 u>
n Jm (1 )E[ZZ ]Jm (2 )un (1 2 1 2 )
n

exists for any 1 , 2 T , then




n
>
b
u>

()

(
)
n n
kun k n

G() in ` (T ),

(2.3)

(2.4)

where G() is a centered Gaussian process with the covariance function H defined as (2.3). In
particular, there exists a version of G with almost surely continuous sample paths.
The proof of Theorem 2.1 is given in Section A.1. Theorem 2.1 holds under very general
conditions. For transformations Z that have a specific local structure, the assumptions on m, m
can be relaxed considerably. Details are provided in Section 2.2.
In the end, we illustrate Theorem 2.1 in linear quantile regression models with increasing dimension, in which gn , cn and n (u, Z) are trivially zero. As far as we are aware, this is the first
quantile process result for linear models with increasing dimension.
Corollary 2.2. (Linear models with increasing dimension) Suppose (A1)-(A3) hold with Z(X) = X
2 (log n)3 = o(n). In addition, if
and Q(x; ) = x> n ( ) for any x and T . Assume that m3 m
m
un R is such that the following limit
1
> 1
H1 (1 , 2 ; un ) := lim kun k2 u>
n Jm (1 )E[XX ]Jm (2 )un (1 2 1 2 )
n

(2.5)

exists for any 1 , 2 T , then (2.4) holds with the covariance function H1 defined in (2.5). Moreover, by setting un = x0 , we have for any fixed x0


n b
Q(x0 ; ) Q(x0 ; )
G(x0 ; ) in ` (T ),
kx0 k

where G(x0 ; ) is a centered Gaussian process with covariance function H1 (1 , 2 ; x0 ). In particular,


there exists a version of G with almost surely continuous sample paths.

2.2

Local Basis Series Estimator

In this section, we assume that Z() corresponds to a basis expansion with local support. Our
main motivation for considering this setting is that it allows to considerably weaken assumptions
on m, m made in the previous section. To distinguish such basis functions from the general setting
b ) be defined as
in the previous section, we shall use the notation B instead of Z. Let (
X 

b ) := argmin
(
Yi b> Bi ,
(2.6)
bRm

where Bi = B(Xi ). The notion of local support is made precise in the following sense.

(L) For each x, the basis vector B(x) has zeroes in all but at most r consecutive entries, where r
is fixed. Moreover, supx, E[B(x)> Jem ( )1 B(X)] = O(1).
The above assumption holds for certain choices of basis functions, e.g., univariate B-splines.

Example 2.3. Let X = [0, 1], assume that (A2)-(A3) hold and that the density of X over X
is uniformly bounded away from zero and infinity. Consider the space of polynomial splines of
order q with k uniformly spaced knots 0 = t1 < ... < tk = 1 in the interval [0, 1]. The space of
such splines can be represented through linear combinations of the basis functions B1 , ..., Bkq1
with each basis function Bj having support contained in the interval [tj , tj+q+1 ). Let B(x) :=
(B1 (x), ..., Bkq1 (x))> . Then the first part of assumption (L) holds with r = q. The condition
1 ( )B(X)|] = O(1) is verified in the Appendix, see Section A.2.
supx, E[|B(x)> Jem
Condition (L) ensures that the matrix Jem ( ) has a band structure, which is useful for bounding
1 ( ). See Lemma 6.3 in Zhou et al. (1998) for additional details.
the off-diagonal entries of Jem
Throughout this section, consider the specific centering


n ( ) := argmin E (B> b Q(X; ))2 fY |X (Q(X; )|X) ,
bRm

(2.7)

where B = B(X). For basis functions satisfying condition (L), assumptions in Theorem 2.1 in the
previous section can be replaced by the following weaker version.
4 (log n)6 = o(n) and letting e
(B1) Assume that m
cn := supx, |n ( )> B(x) Q(x; )| with e
c2n =
o(n1/2 ), where kB(Xi )k m almost surely.
4 (log n)6 = o(n) in (B1) is less restrictive than m3 2 (log n)3 =
Note that the condition m
m

o(n) required in Theorem 2.1. For instance, in the setting of Example 2.3 where m  m, we
only require m2 (log n)6 = o(n), which is weaker than m4 (log n)3 = o(n) in Theorem 2.1. This
improvement is made possible based on the local structure of the spline basis.
In the setting of Example 2.3, bounds on e
cn can be obtained provided that the function x 7
Q(x; ) is smooth for all T . For instance, assuming that Q(; ) c (X , T ) with X = [0, 1]
and integer , Remark B.1 shows that e
cn = O(mbc ). Thus the condition e
c2n = o(n1/2 ) holds
provided that m2bc = o(n1/2 ). Since for splines we have m m1/2 , this is compatible with the
restrictions imposed in assumption (B1) provided that 1.

Theorem 2.4. (Nonparametric models with local basis functions) Assume that conditions (A1)(A3) hold with Z = B, (L) holds for B and (B1) for n ( ). Assume that the set I consists of at
b ( ),
most L consecutive integers, where L 1 is fixed. Then for any un Rm
I , (2.2) holds with
b
n ( ) and Z being replaced by ( ), n ( ) and B. In addition, if the following limit
1
> 1
e 1 , 2 ; un ) := lim kun k2 u>
H(
n Jm (1 )E[BB ]Jm (2 )un (1 2 1 2 )
n

(2.8)

exists for any 1 , 2 T , then (2.4) holds with the same replacement as above, and the limit G is a
e defined as (2.8). Moreover, for any x0 , let
centered Gaussian process with covariance function H
b ) and assume that e
b 0 ; ) := B(x0 )> (
Q(x
cn = o(kB(x0 )kn1/2 ). Then


n b
Q(x0 ; ) Q(x0 ; )
G(x0 ; ) in ` (T ),
(2.9)
kB(x0 )k

e 1 , 2 ; B(x0 )). In particwhere G(x0 ; ) is a centered Gaussian process with covariance function H(
ular, there exists a version of G with almost surely continuous sample paths.
The proof of Theorem 2.4 is given in Section A.2.

Remark 2.5. The proof of Theorem 2.4 and the related Bahadur representation result in Section
5.2 crucially rely on the fact that the elements of Jem ( )1 decay exponentially fast in their distance
from the diagonal, i.e. a bound of the form |(Jem ( )1 )i,j | C |ij| for some < 1. Assumption (L)
provides one way to guarantee such a result. We conjecture that similar results can be obtained
for more classes of basis functions as long as the entries of Jem ( )1 decay exponentially fast in
their distance from suitable subsets of indices in (j, j 0 ) {1, ..., m}2 . This kind of result can be
obtained for matrices Jem ( ) with specific sparsity patterns, see for instance Demko et al. (1984).
In particular, we conjecture that such arguments can be applied for tensor product B-splines, see
Example 1 in Section 5 of Demko et al. (1984). A detailed investigation of this interesting topic is
left to future research.
We conclude this section by discussing a special case where the limit in (2.9) can be characterized
more explicitly.
e can be explicitly characterized under un = B(x) and
Remark 2.6. The covariance function H
univariate B-splines B(x) on x [0, 1], with an order r and equidistant knots 0 = t1 < ... < tk = 1.
Assume additional to (A3) that



(2.10)
sup x |x=t fY |X Q(x; )|x < C, where C > 0 is a constant,
tX , T

and the density fX (x) for X is bounded above, then under e


cn = o(kB(x0 )kn1/2 ), (2.9) in Theorem
2.4 can be rewritten as
r


n
>b
B(x
)
()

Q(x
;
)
G(; x0 ) in ` (T ),
(2.11)
0
0
B(x0 )> E[BB> ]1 B(x0 )

where the Gaussian process G(; x0 ) is defined by the following covariance function
1 2 1 2
e 1 , 2 ; x0 ) =
H(
.
fY |X (Q(x0 ; 1 )|x0 )fY |X (Q(x0 ; 2 )|x0 )

Although we only show the univariate case here, the same arguments are expected to hold for
tensor-product B-spline based on the same reasoning. See Section A.2 for a proof of this remark.
7

Joint Weak Convergence for Partial Linear Models

In this section, we consider partial linear models of the form


Q(X; ) = V > ( ) + h(W ; ),

(3.1)

where X = (V > , W > )> Rk+k and k, k 0 N are fixed. An interesting joint weak convergence
b ), b
b ) and b
result is obtained for ((
h(w0 ; )) at any fixed w0 . More precisely, (
h(w; ) (after proper
scaling and centering) are proved to be asymptotically independent at any fixed T . Therefore,
the joint asymptotics phenomenon first discovered in Cheng and Shang (2015) persists even for
non-smooth quantile loss functions. Such a theoretical result is practically useful for joint inference
on ( ) and h(W ; ); see Cheng and Shang (2015).
e
Expanding w 7 h(w; ) in terms of basis vectors w 7 Z(w),
we can approximate (3.1) through

>
>
>
>
e
e : Rk0 Rm is
the series expansion Z(x) n ( ) by setting Z(x) = (v , Z(w)
) . In this section, Z
regarded as a general basis expansion that does not need to satisfy the local support assumptions
in the previous section. Estimation is performed in the following form
X

e i) .
b ( ) = ((
b )> , b ( )> )> := argmin

Yi a> Vi b> Z(W


(3.2)
aRk ,bRm

b , define population coefficients n ( ) := (( )> , n ( )> )> , where


For a theoretical analysis of
e
n ( ) := argmin E[fY |X (Q(X; )|X)(h(W ; ) > Z(W
))2 ]
Rm

(3.3)

similar to (2.7); see Remark 3.3 for additional explanations.


To state our main result, we need to define a class of functions

n
o
0

U := w 7 g(w) g measurable and E[g 2 (W )fY |X (Q(X; )|X)] < , w Rk .
For V Rk , define for j = 1, ..., k,

hV W,j (; ) := argmin E[(Vj g(W ))2 fY |X (Q(X; )|X)]

(3.4)

gU

where Vj denotes the j-th entry of the random vector V . By the definition of hV W,j , we have for
all T and g U ,
E[(V hV W (W ; ))g(W )fY |X (Q(X; )|X)] = 0k .

(3.5)

The matrix A is defined as coefficent matrix of the best series approximation of hV W (W ; ):


e
A( ) := argmin E[fY |X (Q(X; )|X)khV W (W ; ) AZ(W
)k2 ].

(3.6)

The following two assumptions are needed in our main results.


> ( ) h(w; )| and assume that
e
(C1) Define cn := sup,w |Z(w)
n

m cn = o(1);

(3.7)

e
sup E[fY |X (Q(X; )|X)khV W (W ; ) A( )Z(W
))k2 ] = O(2n ) with m 2n = o(1);

(3.8)

(C2) We have maxjk |Vj | < C almost surely for some constant C > 0.
Bounds on cn can be obtained under various assumptions on the basis expansion and smoothness
0
of the function w 7 h(w; ). Assume for instance that W = [0, 1]k , that h(; ) c (W, T ) and
e corresponds to a tensor product B-spline basis of order q on W with m1/k0 equidistant knots
that Z
in each coordinate. Assuming that (V, W ) has a density fV,W such that 0 < inf v,w fV,W (v, w)
0
supv,w fV,W (v, w) < and q > , we show in Remark B.1 that cn = O(mbc/k ). Assumption
(3.8) essentially states that hV W can be approximated by a series estimator sufficiently well. This
assumption is necessary to ensure that ( ) is estimable at a parametric rate without undersmoothing when estimating h(; ). In general, (3.8) is a non-trivial high-level assumption. It
can be verified under smoothness conditions on the joint density of (X, Y ) by applying arguments
similar to those in Appendix S.1 of Cheng et al. (2014).
In addition to (C1)-(C2), we need the following condition.
(B1) Assume that

2/3

mm log n
n

3/4

1/2
+ c2
).
n m = o(n

Moreover, assume that cn n = o(n1/2 ) and mcn log n = o(1).


We now are ready to state the main result of this section.
e
Theorem 3.1. Let Conditions (A1)-(A3) hold with Z = (V > , Z(W
)> )> , (B1) and (C1)-(C2) hold


e
for n ( ) defined in (3.3). For any sequence wn Rm with E |wn> M2 (2 )1 Z(W
)| = o(kwn k)


e
e
where M2 ( ) := E Z(W
)Z(W
)> fY |X (Q(X; )|X) , if
e
e
22 (1 , 2 ) = lim kwn k2 wn> M2 (1 )1 E[Z(W
)Z(W
)> ]M2 (2 )1 wn
n

exists, then

b ())
n(()

b () n ()

n
>
kwn k wn

(G1 (), ..., Gk (), Gh ())> in (` (T ))k+1 ,

and the multivariate process (G1 (), ..., Gk (), Gh ()) has the covariance function
!
11 (1 , 2 )
0k
(1 , 2 ; wn ) = (1 2 1 2 )
0>
22 (1 , 2 )
k

(3.9)

(3.10)

(3.11)

with


11 (1 , 2 ) = M1,h (1 )1 E (V hV W (W ; 1 ))(V hV W (W ; 2 ))> M1,h (2 )1
(3.12)


where M1,h ( ) = E (V hV W (W ; ))(V hV W (W ; ))> fY |X (Q(X; )|X) . In addition, at any
0
e 0 ) satisfy the above conditions, b
e 0 )> b ( ), cn =
fixed w0 Rk , let wn = Z(w
h(w0 ; ) = Z(w
e 0 )kn1/2 ), then
o(kZ(w



b ()
n ()
>


G1 (), ..., Gk (), Gh (w0 ; ) in (` (T ))k+1 ,
(3.13)
n
b
h(w0 ; ) h(w0 ; )
e 0 )k
kZ(w

where (G1 (), ..., Gk (), Gh (w0 ; )) are centered Gaussian processes with joint covariance function
w0 (1 , 2 ) of the form (3.11) where 22 (1 , 2 ) is defined through the limit in (3.9) with wn replaced
e 0 ). In particular, there exists a version of Gh (w0 ; ) with almost surely continuous sample
by Z(w
paths.

The proof of Theorem 3.1 is presented in Section A.3. The invertibility of the matrices M1,h ( )
b ) is not semiparametric efficient, as its
and M2 ( ) is discussed in Remark 5.5. In general, (
covariance matrix (1 )11 does not achieve the efficiency bound given in Section 5 of Lee
(2003).
The joint asymptotic process convergence result (in ` (T )) presented in Theorem 3.1 is new
in the quantile regression literature. The block structure of covariance function defined in (3.11)
b ) and b
implies that (
h(w0 ; ) are asymptotically independent for any fixed . This effect was
recently discovered by Cheng and Shang (2015) in the case of mean regression, named as joint
asymptotics phenomenon.


e
Remark 3.2. We point out that E |wn> M2 (2 )1 Z(W
)| = o(kwn k) is a crucial sufficient condition
for asymptotic independence between the parametric and nonparametric parts. We conjecture that
e 0 ) or wn =
this condition is also necessary. This condition holds, for example, for wn = Z(w
e 0 ) at a fixed w0 , j = 1, ..., k 0 , where Z(w)
e
wj Z(w
is a vector of B-spline basis. However, this
condition may not hold for other estimators. Consider for instance the case W = [0, 1], B-splines of
R
e
e and the vector wn = Z(w)dw
for some > 0. In this case kwn k  1, and one can
order zero Z
 >
 0
1
e
show that E |wn M2 (2 ) Z(W
)|  1 instead. A more detailed investigation of related questions
is left to future research.
Remark 3.3. A seemingly more natural choice for the centering vector, which was also considered
in Belloni et al. (2011), is
e
n ( ) = (n ( ), n ( )) := arg min E[ (Y a> V b> Z(W
))],
(a,b)

which gives gn (n ( )) = 0. However, a major drawback of centering with n ( ) is that in this


representation, it is difficult to find bounds for the difference n ( ) ( ).

Applications of Weak Convergence Results

In this section, we consider applications of the process convergence results to the estimation of
conditional distribution functions and non-crossing quantile curves via rearrangement operators.
For the former estimation, define the functional (see Dette and Volgushev (2008), Chernozhukov
et al. (2010) or Volgushev (2013) for similar ideas)
(
` ((L , U )) ` (R)
:
R
(f )(y) := L + LU 1{f ( ) < y}d.
A simple calculation shows that

(Q(x; ))(y) =

if FY |X (y|x) < L

FY |X (y|x) if L FY |X (y|x) U

if FY |X (y|x) > U .
U
10

The latter identity motivates the following estimator of the conditional distribution function
Z U
b
b ) < y}d,
FY |X (y|x) := L +
1{Q(x;
L

b ) denotes the estimator of the conditional quantile function in any of the three settings
where Q(x;
discussed in Sections 2 and 3. By following the arguments in Chernozhukov et al. (2010), one can
easily show that under suitable assumptions the functional is compactly differentiable (see Section
A.5 for more details). Hence, the general process convergence results in Sections 2 and 3 allow to
easily establish the asymptotic properties of FbY |X - see Theorem 4.1 at the end of this section.
The second functional of interest is the monotone rearrangement operator, defined as follows
(
` ((L , U )) ` ((L , U ))
:


(f )( ) := inf y : (Q(x; ))(y) .

The main motivation for considering is that the function 7 (f )( ) is by construction nonb ), its rearranged version (Q(x;
b ))( ) is an estidecreasing. Thus for any initial estimator Q(x;
mator of the conditional quantile function which avoids the issue of quantile crossing. For more
detailed discussions of rearrangement operators and their use in avoiding quantile crossing we refer
the interested reader to (Dette and Volgushev, 2008) and (Chernozhukov et al., 2010).
b ))). For any fixed x0 and an initial estimator
Theorem 4.1 (Convergence of Fb(y|x) and (Q(x;
b 0 , ), we have for any compact sets [U , L ] T , Y Y0,T := {y : FY |X (y|x0 ) T }
Q(x


an FbY |X (|x0 ) FY |X (|x0 )


b 0 ; ))() Q(x0 ; )
an (Q(x


fY |X (FY |X (|x0 )|x0 )G x0 ; FY |X (|x0 ) in ` (Y),
G(x0 ; ) in ` ((U , L )),

b 0 , ), the normalization an , and the process G(x0 ; ) are stated as follows


where Q(x

b 0 , ) =
b ()> x0 and the
1. (Linear model with increasing dimension) Suppose Z(X) = X, Q(x

conditions in Corollary 2.2 hold. In this case, we have an = n/kx0 k. G(x0 ; ) is a centered
Gaussian process with covariance function H1 (1 , 2 ; x0 ) defined in (2.5).
b 0 , ) = ()> B(x0 ) and the conditions in Theorems 2.4
2. (Nonparametric model) Suppose Q(x

hold. In this case, we have an = n/kB(x0 )k. G(x0 ; ) is a centered Gaussian process with
e 1 , 2 ; B(x0 )) defined in (2.8).
covariance function H(

>
>
e 0 )> )> and the
b
b ( )> (v)> , Z(w
3. (Partial linear model) Suppose x>
0 = (v0 , w0 ), Q(x0 , ) =

e 0 )k. G(x0 ; ) is a
conditions in Theorem 3.1 hold. In this case, we have an = n/kZ(w
e 0 )) defined in (3.9).
centered Gaussian process with covariance function 22 (1 , 2 ; Z(w

The proof of Theorem 4.1 is a direct consequence of the functional delta method. Details can be
found in Section A.5.

11

Bahadur representations

In this section, we provide Bahadur representations for the estimators discussed in Sections 2 and
3. In Sections 5.1 and 5.2, we state Bahadur representations for general series estimators and a
more specific choice of local basis function, respectively. In particular, the latter representation
is developed with an improved remainder term. Section 5.3 contains a special case of the general
theorem in Section 5.1 that is particularly tailored to partial linear models. The remainders in
these representations are shown to have exponential tail probabilities (uniformly over T ).

5.1

A Fundamental Bahadur Representation

b ( )n ( ) for centering functions n satisfying


Our first result gives a Bahadur representation for
b ( ) in (2.1). This kind of representation for quantile
certain conditions. Recall the definition of
regression with an increasing number of covariates has previously been established in Theorem 2
of Belloni et al. (2011). Compared to their results, the Bahadur representation given below has
several advantages. First, we allow for a more general centering. This is helpful for the analysis
of partial linear models (see Sections 3 and S.1.2). Second, we provide exponential tail bounds on
remainder terms, which is much more explicit and sharper than those in Belloni et al. (2011).
2 log n = o(n). Then,
Theorem 5.1. Suppose Conditions (A1)-(A3) hold and that additionally mm
1 ) and c ( ) = o(1), we have
for any n () satisfying gn (n ) = o(m
n n
n

X
1
b ( ) n ( ) = Jm ( )1

(Yi , Zi ; n ( ), ) + rn,1 ( ) + rn,2 ( ) + rn,3 ( ) + rn,4 ( ).


n
i=1

The remainder terms rn,j s can be bounded as follows:


sup krn,1 ( )k

inf T

mm
min (Jm ( )) n

a.s.

(5.1)

2 , sufficiently large n, and a constant C independent of n


Moreover, we have for any n  n/m


P sup krn,j ( )k C<j (n ) 1 2en , j = 2, 3, 4,
T

where
 m
1/2  1/2
2
n
<2 (n ) := m
log n
+
+ gn ,
n
n
 m log n 1/2  1/2
1/2  m log n 1/2
n
m
m n
+
+ gn
,
<3 (n ) :=
+
n
n
n
n
 m
1/2  1/2 
n
<4 (n ) := cn
log n
+
+ gn .
n
n

(5.2)
(5.3)
(5.4)

The proof for Theorem 5.1 can be found in Section S.1.1.

5.2

Bahadur Representation for Local Basis Series Estimator

In this section, we focus on basis expansions B satisfying (L) and derive a Bahadur representation
b
for linear functionals of the form u>
n (( ) n ( )), where the vector un can have at most a finite
12

number of consecutive non-zero entries. Such linear functionals are of interest since the estimator
of the quantile function itself as well as estimators of derivatives can be represented in exactly this
form - see Remark 5.3 for additional details. The advantage of concentrating on vectors with this
particular structure is that we can substantially improve the rates of remainder terms compared to
the general setting in Theorem 5.1.
Theorem 5.2. Suppose Conditions (A1)-(A3) and (L) hold with Z(x) = B(x). Assume addition2 (log n)2 = o(n) and that e
ally that mm
cn = o(1) and that I {1, ..., m} consists of at most L
consecutive integers. Then, for n ( ) defined as (2.7) and un SIm1 we have
b
u>
n (( )

n ( )) =

1 1
e
u>
n Jm ( ) n

n
X
i=1

>

Bi (1{Yi n ( ) Bi } ) +

4
X

rn,k (, un ),

(5.5)

k=1

where the remainder terms rn,j s can be bounded as follows:


sup

sup |rn,1 (, un )| .

un SIm1 T

sup

sup |rn,4 (, un )|

un SIm1 T

m log n
n

a.s.

(5.6)

1 1 0 2
e n , B)
+ fe
c
sup E(u
n 2 n un S m1

a.s.

(5.7)

2 , all sufficiently
e n , B) := sup E|un Jem ( )1 B|. Moreover, we have for any n  n/m
where E(u
large n, and a constant C independent of n


e j (n ) 1 n2 en , j = 2, 3
P
sup sup |rn,j (, un )| C <
un SIm1 T

where

e 2 (n ) := C
<

1/2

2
2
e n , B) m (log n + n ) + e
E(u
c
n ,
n1/2
un S m1

sup

(5.8)

 1/2 log n 1/2 (1/2 log n)3/2 


m
n
n
e 3 (n ) := C e
<
cn
+
.
n1/2
n3/4

(5.9)

Theorem 5.2 is proved in Section S.1.2. We note that by H


olders inequality and assumptions
(A1)-(A3), we have a simple bound for
sup
un SIm1

e n , B)
E(u

sup
un S m1

1
> 1
sup u>
n Jm ( )E[BB ]Jm ( )un

1/2

= O(1).

Remark 5.3. Theorem 5.2 enables us to study several quantities associated with the quantile function Q(x; ). For instance, consider the spline setting of Example 2.3. Setting un = B(x)/kB(x)k
b ), while setting un = B0 (x)/kB0 (x)k yields
in the Theorem 5.2 yields a representation for Q(x;
a representation for the estimator of the derivative x Q(x; ). Uniformity in x follows once we
observe that for different values of x, the support of the vector B(x) is always consecutive so that
there is at most nl , l > 0, number of different sets I that we need to consider.

13

5.3

Bahadur Representation for Partial Linear Models

In this section, we provide a joint Bahadur representation for the parametric and non-parametric
part of this model. Recall the partial linear model Q(X; ) = h(W ; ) + ( )> V .
e
Theorem 5.4. Let conditions (A1)-(A3), (C1)-(C2) hold with Z = (V > , Z(W
)> )> and assume
2 (log n)2 = o(n). Then
mm
!
n
4
X
X
b ) ( )
(
1 1
>
=
J
(
)
n
Z
(1{Y

{
(
)
}
Z
}

)
+
rn,j ( ),
m
i
i
n
i
b ( ) n ( )
i=1

j=1

where the remainder terms rn,j s satisfy the bounds stated in Theorem 5.1 with gn = m c2
n . Addi1
tionally, the matrix Jm ( ) can be represented as
!
1
1 A( )
M
(
)
M
(
)
1
1
1
(5.10)
Jm
( ) =
A( )> M1 ( )1
M2 ( )1 + A( )> M1 ( )1 A( )
where
e
e
M1 ( ) := E[fY |X (Q(X; )|X)(V A( )Z(W
))(V A( )Z(W
))> ],


e
e
M2 ( ) = E Z(W
)Z(W
)> fY |X (Q(X; )|X) ,

and A( ) is defined in (3.6).

See Section S.1.3 for the proof of Theorem 5.4.


Remark 5.5. We discuss the positive definiteness of M1 ( ) and M2 ( ). Following Condition (A1)
e
with Z = (V > , Z(W
)> )> , we have
1/M inf min (M1 ( )) sup max (M1 ( )) M ;

(5.11)

1/M inf min (M2 ( )) sup max (M2 ( )) M,

(5.12)

for all n. To see this, observe that M1 ( ) = (Ik | A( ))Jm ( )(Ik | A( ))> where Ik is the kdimensional identity matrix, [A|B] denotes the block matrix with A in the left block and B in the
right block, and
!
M1 ( ) + A( )M2 ( )A( )> A( )M2 ( )
Jm ( ) =
,
M2 ( )A( )>
M2 ( )
whose form follows from the definition and the condition (3.5) (see the proof for Theorem 5.4 for
more details). Thus, for an arbitrary nonzero vector a Rk , by Condition (A1),
>



0 < 1/M a> M1 ( )a = a> Ik A( ) Jm ( ) Ik A( ) a M <

by the strictly positive definiteness of Jm ( ) for some M > 0.


The strictly positive definiteness of M2 ( ) follows directly from the observation that
>
>
> >
0 < 1/M b> M2 ( )b = (0>
k , b )Jm ( )(0k , b ) M < ,

for all nonzero b Rm and some M > 0.


14

APPENDIX
This appendix gives technical details of the results shown in the main text. Appendix A contains all the proofs for weak convergence results in Theorems 2.1, 2.4 and 3.1. Appendix B discusses
basis approximation errors with full technical details.
R
Additional Notations. Define for a function x 7 f (x) that Gn (f ) := n1/2 f (x)(dPn (x)dP (x))
R
and kf kLp (P ) = ( |f (x)|p dP (x))1/p for 0 < p < . For a class of functions G, let kPn P kG :=
supf G |Pn f P f |. For any  > 0, the covering number N (, G, Lp ) is the minimal number of balls
of radius  (under Lp -norm) that is needed to cover G. The bracketing number N[ ] (, G, Lp ) is the
minimal number of -brackets that is needed to cover G. An -bracket refers to a pair of functions
within an  distance: ku lkLp < . Throughout the proofs, C, C1 , C2 etc. will denote constants
which do not depend on n but may have different values in different lines.

APPENDIX A: Proofs for Process Convergence


A.1

Proof of Theorem 2.1

A.1.1

Proof of (2.2)

Under conditions (A1)-(A3) and those in Theorem 2.1 it follows from Theorem 5.1 applied with
1 ) and c = o(1) in
n = c log n for a suitable constant c (note that the conditions gn = o(m
n
Theorem 5.1 follow under the assumptions of Theorem 2.1) that
n

u>
( ) ( )) +
n (b

X
1 >
un Jm ( )1
Zi (1{Yi Q(Xi ; )} ) = I( ) + oP (kun kn1/2 ),
n
i=1

where the remainder term is uniform in T and


1
I( ) := n1 u>
n Jm ( )

n
X
i=1


Zi 1{Yi Z>
i n ( ))} 1{Yi Q(Xi ; )} .

Under the assumption n (un , Z) = o(kun kn1/2 ), we have sup T |E[I( )]| = o(kun kn1/2 )
and moreover
sup |I( ) E[I( )]| kun k[ inf min (Jm ( ))]1 kPn P kG5 ,
T

where the class of functions G5 is defined as



G5 (Z, n ) = (X, Y ) 7 a> Z(X)1{kZ(X)k m } 1{Y Z(X)> n ( )} 1{Y Q(X; )}


T , a S m1 .

It remains to bound kPn P kG5 . For any f G5 and a sufficiently large C, we obtain
|f | |a> Z| m ,
2fcn max (E[ZZ> ]) Ccn .

kf k2L2 (P )

15

By Lemma 21 of Belloni et al. (2011), the VC index of G5 is of the order O(m). Therefore, we
obtain from (S.2.2)
e
EkPn P kG5 C

h mc

m 1/2 mm
m i
+
log
log
.
n
cn
n
cn
n

(A.1)

For any n > 0, let


0
rN,3
(n )

=C


mc

m 1/2 mm
m
+
log
log +
n
cn
n
cn
n

cn
n
n

1/2

m
+
n
n

for a sufficiently large constant C > 0. We obtain from (S.2.3) combined with (A.1)




0


en .
P sup |I( )| kun krN,3 (n ) + sup E I( )
T

Finally, note that under condition mcn log n = o(1) and

2
m3 m
(log n)3 = o(n)
0 (log n) = o(n1/2 ). This completes the proof of (2.2).
we have that rN,3

A.1.2

Proof of (2.4)

Throughout this subsection assume without loss of generality that kun k = 1. It suffices to prove
finite dimensional convergence and asymptotic equicontinuity. Asymptotic equicontinuity follows
from (A.33). The existence of a version of the limit with continuous sample paths is a consequence
of Theorem 1.5.7 and Addendum 1.5.8 in van der Vaart and Wellner (1996).
So, we only need to focus on finite dimensional convergence.
Let
n
X
1 >
1

u Jm ( )
Gn ( ) :=
Zi (1{Yi Q(Xi ; )} ),
n n
i=1

and G be the Gaussian process defined in (2.4). From Cramer-Wold theorem, the goal is to show
for arbitrary set of {1 , ..., L } and {1 , ..., L } RL , we have
L
X
l=1

l Gn (l )

L
X

l G(l ).

l=1

1
Let the triangular array Vn,i ( ) := n1/2 u>
n Jm ( ) Zi (1{Yi Q(Xi ; )} ). Then for all T ,
1
>
1
we have E[Vn,i ( )] = 0, |Vn,i | n1/2 m and var(Vn,i ( )) = n1 u>
n Jm ( ) E[Zi Zi ]Jm ( ) un (1
P
PL
n
) < by Conditions (A1)-(A3). We can express Gn ( ) = i=1 Vn,i ( ) and l=1 l Gn (l ) =
Pn PL
Pn PL
2
i=1
l=1 l Vn,i (l ). Observe that var( i=1
l=1 l Vn,i (l )) =: n,L where
2
n,L
=

L
X

l,l0 =1

1
>
1
l l 0 u>
n Jm (l ) E[Zi Zi ]Jm (l0 ) un (l l0 l l0 ).

P
PL
2
= L
If 0 = limn n,L
l,l0 =1 l l0 H(l , l0 ; un ) = var( l=1 l G(l )), then by Markovs inequality
Pn PL
PL
i=1
l=1 l Vn,i (l ) 0 in probability, which coincides with the distribution of
l=1 l G(l ),
16

2
which is a single point mass at 0. Next, consider the case n,L
L2 > 0. For sufficiently large n
and arbitrary v > 0, Markovs inequality implies
2
n,L

 X
2  X

n
L
L
X
E
l Vn,i (l ) 1
l Vn,i (l ) > v
i=1

l=1
n
X

2 1 2
. m
n n,L

i=1

2 1 2 2
. m
n n,L v

l=1

 X

L
E 1
l Vn,i (l ) > v
l=1

L
X

l,l0 =1

1
>
1
l l0 u>
n Jm (l ) E[Zi Zi ]Jm (l0 ) un (l l0 l l0 )

= o(1)
2 n1 = o(1) by the assumption m 2 log n = o(n). Hence the Lindeberg condition is verified.
since m
m
The finite dimensional convergence follows from Cramer-Wold devise. This completes the proof.

A.2

Proofs of Theorem 2.4, Example 2.3 and Remark 2.6

We begin by introducing some notations and useful preliminary results. For a vector u = (u1 , ..., um )>
and a set I {1, ..., m}, let u(I) Rm denote the vector that has entries ui for i I and zero
otherwise. For a vector a Rm , let ka denote the position of the first non-zero entry of a with
kak0 non-zero consecutive entries
I(a, D) := {i : |i ka | kak0 + D},

(A.2)

I 0 (a, D) := {1 j m : i I(a, D) such that |j i| kak0 },

(A.3)

Lemma A.1. Under (L), for an arbitrary vector a Rm with at most kak0 non-zero consecutive
entries we have for a constant (0, 1) independent of n,
ka +kak0

1
ka> Jem
( )
1
ka> Jm
( )

1
|(a> Jem
( ))j | C1 kak

1
(a> Jem
( ))(I(a,D)) k
1
(a> Jm
( ))(I(a,D)) k

|qj| .

(A.4)

q=ka

. kak kak0 D

. kak kak0 D

(A.5)
(A.6)

Proof for Lemma A.1. Under (L) the matrix Z(x)Z(x)> has no non-zero entries that are further
1 is a band matrix with band width no larger that 2r.
than r away from the diagonal. Thus Jem
Apply similar arguments as in the proof of Lemma 6.3 in Zhou et al. (1998) to find that under (L)
1 ( ) satisfy
the entries of Jem
1
sup |(Jem
( ))j,k | C1 |jk|
,m

for some (0, 1) and a constant C1 where both and C1 do not depend on n, . It follows that
ka +kak0
1
|(a> Jem
( ))j |

C1 kak
17

q=ka

|qj| .

and thus (A.4) is established. For a proof of (A.5) note that by (A.4) we have
ka +kak0
1
|(a> Jem
( ))j |

C1 kak

q=ka

|qj| C1 kak kak0 |ka j|kak0 .

By the definition of I(a, D) we find


1
1
ka> Jem
( ) (a> Jem
( ))(I(a,D)) k Ckak kak0 D

for a constant C independent of n. The proof of (A.6) is similar to the proof of (A.5).
Proof for Theorem 2.4. By Theorem 5.2 and Condition (B1), we first obtain
n
X

b ) ( ) = 1 u> Jem ( )1
u>
(
Bi (1{Yi B(x)> n ( )} ) + oP (kun kn1/2 ).
n
n n

(A.7)

i=1

e 1,n ( ) := n1 u> Jem ( )1 Pn Bi (1{Yi B(x)> n ( )} ). We claim that


Let U
n
i=1

1/2
e
u>
),
n U1,n ( ) Un ( ) = oP (kun kn

(A.8)

Pn
1
where Un ( ) := n1 u>
n Jm ( )
i=1 Bi (1{Yi Q(Xi ; )} ). Given (A.8), the process conver
>
b ) ( ) and continuity of the sample paths of the limiting process follows from
gence of un (
process convergence of u>
n Un ( ), which can be shown via exactly the same steps as in Section A.1.2
by replacing Z by B given assumptions (A1)-(A3).
Pn
1
e
To show (A.8), we proceed in two steps. Given un SIm1 , let U1,n ( ) := n1 u>
n Jm ( )
i=1 Bi (1{Yi
Q(Xi ; )} ).


1/2 ), for all u S m1 .
e

Step 1: sup T u>
n
n U1,n ( ) U1,n ( ) = oP (n
I
 >

e
e
Let I0 ( ) := E un U1,n ( ) U1,n ( ) and observe the decomposition

 >



e
e
e
u>
+ E u>
n U1,n ( ) U1,n ( ) E un U1,n ( ) U1,n ( )
n U1,n ( ) U1,n ( )

(I(un ,D)) 


1
1
= un Jem
( ) un Jem
( )
(Pn P ) Bi (1{Yi B(x)> n ( )} 1{Yi Q(Xi ; )})
n
o
(I(un ,D))
1
+ (Pn P ) un Jem
( )
Bi (1{Yi B(x)> n ( )} 1{Yi Q(Xi ; )}) + Ie0 ( )
=: Ie1 ( ) + Ie2 ( ) + Ie0 ( ).

For sup T |Ie0 ( )|, by the construction of n ( ) in (2.7),



h


i

1
>
e
sup e
I0 ( ) sup u>
J
(
)E
B
1{Y

Q(X
;

)}

1{Y

(
)}

i
i
i
i
n m
i n

1
e
c2n f0 sup E|u> Jem
( )B| e
c2n f0 sup
uS m1

uS m1

1
1
u> Jem
( )E[BB> ]Jem
( )u

1/2

= o(n1/2 ),

where the final rate follows from assumptions (A2) and e


c2n = o(n1/2 ) in (B1).
By (A.5) in Lemma A.1, let D = c log n for large enough c > 0, we have almost surely




> e1
(I(un ,D))
e1
sup e
I1 ( ) sup u>
m kun k kun k0 nc log m = o(n1/2 ).
n Jm ( ) (un Jm ( ))
T

18

For bounding sup T |Ie2 ( )|, observe that


n

1X >e
(un Jm ( ))(I(un ,D)) Bi (1{Yi B>
i n ( )} 1{Yi Q(Xi ; )})
n

i=1
n
X

1
n

1
=
n
=

1
n

i=1

(I(un ,D))

e
u>
n Jm ( )Bi
X

(I 0 (un ,D)c )

{i:Bi

1
n

(I(un ,D))

{i:supp(Bi )I(un ,D)6=}

n
X
i=1

(1{Yi B>
i n ( )} 1{Yi Q(Xi ; )})

e
u>
n Jm ( )Bi

(I(un ,D))

=0}

e
u>
n Jm ( )Bi

(I(un ,D))

e
u>
n Jm ( )Bi

(1{Yi B>
i n ( )} 1{Yi Q(Xi ; )})

(1{Yi B>
i n ( )} 1{Yi Q(Xi ; )})

(I 0 (un ,D)) >

(1{Yi Bi

(A.9)

n ( )} 1{Yi Q(Xi ; )}),

n
1X >e
(I(u ,D))
(I 0 (un ,D))
un Jm ( )Bi n
(1{Yi B>
} 1{Yi Q(Xi ; )}),
=
i n ( )
n
i=1

(I(u ,D))

where the third equality follows from the fact that Bi n


6= 0 can only happen for i {i :
(I 0 (un ,D)c )
Bi
= 0}, because B can only be nonzero in r consecutive entries by assumption (L), where
0
c
I (un , D) = {1, ..., m} I 0 (un , D) is the complement of I 0 (un , D) in {1, ..., m}. By restricting
0
(I 0 (un ,D)c )
ourselves on set {i : Bi
= 0}, it is enough to look at the coefficient n ( )(I (un ,D)) in the
last equality in (A.9). Hence,



sup e
I2 ( ) Pn P Ge5 (I(un ,D),I 0 (un ,D))

where for any two index sets I1 and I10




0
Ge5 (I1 , I10 ) = (X, Y ) 7 a> B(X)(I1 ) 1{Y B(X)> b(I1 ) }1{Y Q(X; )} T , b Rm , a S m1 .
With the choice of D = c log n, the cardinality of both I(un , D) and I 0 (un , D) is of order O(log n).
Hence, the VC index of Ge5 (I(un , D), I 0 (un , D)) is bounded by O(log n). Note that for any f
Ge5 (I(un , D), I 0 (un , D)), |f | . m and kf kL2 (P ) . e
cn . Applying (S.2.3) yields
P

h e


cn n 1/2 n m i
cn (log n)2 1/2 m (log n)2  e
+
+
+
sup e
I2 ( ) C
n
n
n
n
T

1 en .



4 (log n)6 = o(n) in (B1) implies that sup
e

Taking n = C log n, e
c2n = o(n1/2 ) and m
T I2 ( ) =
1/2
oP (n ).


1/2 ), for all u S m1 .

Step 2: sup T u>
n
n U1,n ( ) Un ( ) = oP (n
I

19

Observe that

e
u>
n U1,n ( ) Un ( )
n
X
 

1  > e
em ( )1 Jm ( )1 (I(un ,D))
=
un Jm ( )1 Jm ( )1 u>
J
Bi (1{Yi Q(Xi ; )} )
n
n
i=1

 (I(un ,D))
1 > e
un Jm ( )1 Jm ( )1
n

=: Ie3 ( ) + Ie4 ( ).

n
X
i=1

Bi (1{Yi Q(Xi ; )} )

Applying (A.5) and (A.6) in Lemma A.1 with D = c log n where c > 0 is chosen sufficiently large,
we have almost surely






e1 ( ) (u> Je1 ( ))(I(un ,D)) + sup u> J 1 ( ) (u> J 1 ( ))(I(un ,D)) m
sup e
I3 ( ) sup u>
J
n m
n m
n m
n m
T

2kun k kun k0 n

c log

1/2

m = o(n
).


Now it is left to bound sup T e
I4 ( ) . We have

(A.10)

 (I(un ,D))
1 X > e
un Jm ( )1 Jm ( )1
Bi (1{Yi Q(Xi ; )} )
Ie4 ( ) =
n
i=1

n
 (I(u ,D))
1X > e
un Jm ( )1 Jm ( )1 Bi n
(1{Yi Q(Xi ; )} ).
=
n
i=1

Hence,






1
e
Jm ( )1 Pn P G0 (I(un ,D))G4
sup e
I4 ( ) sup u>
n Jm ( )

where for any I,




G0 (I) := (B, Y ) 7 a> B(I) 1{kBk m } a S m1 ,



G4 := (X, Y ) 7 1{Yi Q(X; )} T .

The cardinality of the set I(un , c log n) is of order O(log n). Thus, the VC index for G0 (I(un , D))
is of order O(log n). The VC index of G4 is 2 (see Lemma S.2.4). By Lemma S.2.2,
N (G0 (I(un , D)) G4 , L2 (Pn ); )

AkF kL2 (Pn )

v0 (n)

where v0 (n) = O(log n). In addition, for any f G0 (I(un , D)) G4 , |f | . m and kf kL2 (P ) = O(1)
by (A1). Furthermore, by assumptions (A1)-(A2) and the definition of e
cn ,
>

un Jem ( )1 Jm ( )1 e
cn max (E[B(X)B(X)> ])f0 . e
cn .
(A.11)
By (S.2.3), we have for some constant C > 0,

h (log n)2 1/2 (log n)2  1/2 i


n
m
n m
e

P sup I2 ( ) Ce
cn
+
+
+
1 en .
n
n
n
n
T
20

Taking n = C log n, an application of (B1) completes the proof.


Proof for Example 2.3. As Jem ( ) is a band matrix, applying similar arguments as in the proof
of Lemma 6.3 in Zhou et al. (1998) gives
0
1
sup |(Jem
( ))j,j 0 | C1 |jj | ,

(A.12)

,m

for some (0, 1) and C1 > 0. Let kB(x) be the index of the first nonzero element of the vector
B(x). Then by (A.12), we have
kB(x) +kB(x)k0

and also

1
( ))j |
sup |(B(x)> Jem
,m

1
( )B(X)|
sup E|B(x)> Jem
,m

C1 kB(x)k

|j j| ,

j 0 =kun

C1 kB(x)k max E|Bl (X)|


lm

m
X
j=1

kB(x) +kB(x)k0

|j j| .

(A.13)

j 0 =kB(x)

Since kB(x)k0 is bounded by a constant, the sum in (A.13) is bounded uniformly. Moreover, in
the present setting we have kB(x)k = O(m1/2 ) and maxlm E|Bl (X)| = O(m1/2 ). Therefore,
for each m we have
1
sup E|B(x)> Jem
( )B(X)| = O(1).
T ,xX

Proof of Remark 2.6. Consider the product Bj (x)Bj 0 (x) of two B-spline functions. The fact that
Bj (x) is locally supported on [tj , tj+r ] implies that for all j 0 satisfying |j j 0 | r, Bj (x)Bj 0 (x) = 0
for all x, where r N is the degree of spline. This also implies Jm ( ) and E[BB> ] are a band
matrices with each column having at most Lr := 2r +1 nonzero elements and each non-zero element
is at most r entries away from the main diagonal. Recall also the fact that maxjm suptR |Bj (t)| .
m1/2 (by the discussion following assumption (A1)).
Define Jm,D ( ) := Dm ( )E[BB> ], where matrix Dm ( ) := diag(fY |X (Q(tj ; )|tj ), j = 1, ..., m),
and Rm ( ) := Jm ( ) Jm,D ( ). Both Jm,D ( ) and Rm ( ) have the same band structure as Jm ( ).
For arbitrary j, j 0 = 1, ..., m, T and a universal constant C > 0,
|(Rm ( ))j,j 0 |



= E Bj (X)Bj 0 (X) fY |X (Q(X; )|X) fY |X (Q(tj ; )|tj )

Z 1 

r
2
fY |X (Q(x; )|x) fY |X (Q(tj ; )|tj ) fX (x)dx
2 max sup |Bj (t)|
1 |x tj | C
jm tR
m
0

Z 1 
r
2Cm
1 |x tj | C
|x tj |dx
m
0
= O(m1 ),

(A.14)

21

where the second inequality is an application of the upper bound of maxjm suptR |Bj (t)| . m1/2
and the local support property of Bj ; the third inequality follows by the assumption (2.10) and
bounded fX (x). This shows that maxj,j 0 =1,...,m sup T |(Rm ( ))j,j 0 | = O(m1 ).
Now we show a stronger result that sup T kRm ( )k = O(m1/2 ) for later use. Let v =
(v1 , ..., vm ). Denote kj the index with the first nonzero entry in the jth column of Rm ( ). By the
band structure of Rm ( ),
2
m  kj +L
X
Xr 1
vi (Rm ( ))i,j
sup kRm ( )k2 = sup sup kRm ( )vk22 = sup sup
T

T vS m1 j=1

T vS m1

. sup max
|(Rm ( ))j,j 0 | m = O(m
0
2

T j,j

i=kj

),

(A.15)

where the last equality follows by (A.14). Note that from assumptions (A1)-(A3) that
fmin M 1 < min (Jm,D ( )) max (Jm,D ( )) < fM

(A.16)

uniformly in T , where the constant M > 0 is defined as in Assumption (A1). Using (A.16),
assumptions (A1)-(A3) and (A.15),
1
1
1
1
kJm
( ) Jm,D
( )k kJm,D
( )kkJm,D ( ) Jm ( )kkJm
( )k . sup kRm ( )k = O(m1/2 )
T

uniformly in T .
Without loss of generality, from now on we drop the term 1 2 1 2 out of our discussion
e 1 , 2 ; un ) defined in (2.8). From (A1)
and focus on the matrix part in the covariance function H(
we have kE[BB> ]k < M for some constant M > 0 so for any 1 , 2 T ,




1
> 1
>
1
>
1
kun k2 u>
J
(
)E[BB
]J
(
)u

u
J
(
)
E[BB
]J
(
)
u

1
2
n
1
2
n
m,D
m,D
n m
m
n








. sup Rm ( ) sup E[BB> ]Jm,D ( )1 + sup Rm ( ) sup E[BB> ]Jm ( )1
T

1/2

= O(m

).

(A.17)

Moreover, note that


1
>
1
>
1
> 1
1
u>
n Jm,D (1 ) E[BB ]Jm,D (2 ) un = un Dm (1 ) E[BB ] Dm (2 ) un

(A.18)

If un = B(x), observe that for l = 1, ..., m, as suggested by the local support property, we only
need to focus on the index l satisfying |x tl | Cr/m, for a universal constant C > 0. We have
(B(x)> Dm ( )1 )l = Bl (x)fY |X (Q(tl ; )|tl )1 = Bl (x)fY |X (Q(x; )|x)1 + R0 (tl ),

(A.19)

2
where by assumption (2.10), |R0 (tl )| maxjm suptR |Bj (t)|Cfmin
|x tl | = O(m1/2 ). Therefore,
>
1
1
>
the sparse vector B(x) Dm ( ) = fY |X (Q(x; )|x) B(x) + aB(x) , where aB(x) Rm is a vector
with the same support as B(x) (only r < nonzero components) and with nonzero components
of order O(m1/2 ). Hence, kaB(x) k = O(m1/2 ). Continued from (A.18), for any x [0, 1],




B(x)> E[BB> ]1 B(x)
2
>
1
> 1
1

kB(x)k B(x) Dm (1 ) E[BB ] Dm (2 ) B(x)
fY |X (Q(x; 1 )|x)fY |X (Q(x; 2 )|x)

1





B(x) aB(x) sup E[BB> ]1 Dm ( )1 + B(x)k1 kaB(x) E[BB> ]1 /fmin

= O(m

).

22

We observe that B(x)> E[BB> ]1 B(x) does not depend on 1 and 2 and can be treated as a scaling
factor and shifted out of the covariance function as (2.11). Therefore, we finish the proof.

A.3

Proof of Theorem 3.1

Observe that
b j () j () = e>

() n ())
j (b

where ej denotes the j-th unit vector in Rm+k for j = 1, .., m + k, and

>
wn> b ( ) n ( ) = (0>
() n ()).
k , wn )(b

> ( ). The following results will be established at the end of the proof.
e
Let hn (w, ) = Z(w)
n



>
1
>

sup
E
e
J
(
)
Z
1{Y

Q(X;

)}

1{Y

(
)
V
+
h
(W,

)}

= o(n1/2 ),
j m
n
T ,j=1,...,k

(A.20)





>
1
>

1/2
sup (0>
).
k , wn /kwn k)Jm ( ) E Z(1{Y Q(X; )} 1{Y ( ) V + hn (W, )}) = o(n

(A.21)

From Theorem 5.4, we obtain that under Condition (B1)


1

1/2
e>
( ) n ( )) = n1/2 e>
), j = 1, ..., k; (A.22)
j (b
j Jm ( ) Gn ((; n ( ), )) + oP (n

>
>
1

1/2
(0>
( ) n ( )) = n1/2 (0>
)
k , wn )(b
k , wn )Jm ( ) Gn ((; n ( ), )) + oP (n

(A.23)

uniformly in T . Equation (A.20) implies that for j = 1, ..., k


1

>
1
1/2
e>
) = o(n1/2 ).
j Jm ( ) E[(Yi , Zi ; n ( ), )] = ej Jm ( ) E[1{Yi Q(Xi ; )} ] + o(n

Following similar arguments as given in the proof of (2.2) in Section A.1.1, (A.20) and (A.22) imply
that
e>
( )
j (b

n ( ))

1
n1 e>
j Jm ( )

n
X
i=1

Zi (1{Yi Q(Xi ; )} ) + oP (n1/2 ), j = 1, ..., k.

uniformly in T . Similarly, by (A.21) and (A.23) we have


1 >
1
kwn k1 wn> (b ( )n ( )) = n1 (0>
k , kwn k wn )Jm ( )

n
X
i=1

Zi (1{Yi Q(Xi ; )} )+oP (n1/2 ).

Thus, the claim will follow once we prove

G n () := (Gn,1 (), ..., Gn,k (), Gn,h ())


where
1
Gn,j ( ) := n1/2 e>
j Jm ( )

n
X
i=1

G () in (` (T ))k+1

Zi (1{Yi Q(Xi ; )} ),
23

j = 1, ..., k

and
>
1
Gn,h ( ) := kwn k1 n1/2 (0>
k , wn )Jm ( )

n
X
i=1

Zi (1{Yi Q(Xi ; )} ).

We need to establish tightness and finite dimensional convergence. By Lemma 1.4.3 of van der Vaart
and Wellner (1996), it is enough to show the tightness of Gn,j s and Gn,h individually. Tightness
follows from asymptotic equicontinuity which can be proved by an application of Lemma A.3.
More precisely, apply Lemma A.3 with un = ej to prove tightness of Gn,j () for j = 1, ..., k, and
>
>
Lemma A.3 with u>
n = (0k , wn ) to prove tightness of Gn,h (w0 ; ). Continuity of the sample paths
of Gn,h (w0 ; ) follows by the same arguments as given at the beginning of Section A.1.2.
Next, we prove finite-dimensional convergence. Observe the decomposition
(
!
!)
n
e i ))
X

M1 ( )1 (Vi A( )Z(W
0
1/2
G n ( ) = n
1{Yi Q(Xi ; )} +
e i)
kwn k1 w> M2 ( )1 Z(W
i ( )
n

i=1

where


e i )) 1{Yi Q(Xi ; )} .
i ( ) := kwn k1 wn> A( )> M1 ( )1 (Vi A( )Z(W

By definition, we have E[i ( )] = 0 and moreover

e i ))(Vi A( )Z(W
e i ))> ]M1 ( )1 A( )wn .
E[2i ( )] . kwn k2 wn> A( )> M1 ( )1 E[(Vi A( )Z(W

Since fY |X (Q(Xi ; )|X) is bounded away from zero uniformly, it follows that

e i ))(Vi A( )Z(W
e i ))> ]k fmin max (M1 ( )) < ,
kE[(Vi A( )Z(W

by Remark 5.5. Moreover, by Lemma A.2 proven later, kA( )wn k = O(1) uniformly in , and thus
P
by kwn k , sup T E[2i ( )] = o(1). This implies that n1/2 ni=1 i ( ) = oP (1) for every fixed
T . Hence it suffices to prove finite dimensional convergence of
!
n
1 (V A( )Z(W
e
X

M
(
)
))
1
i
i
n1/2
1{Yi Q(Xi ; )} .
e i)
kwn k1 w> M2 ( )1 Z(W
n

i=1


e i )) 1{Yi Q(Xi ; )} ] = 0 and by assumpObserve that E[M1 ( )1 (hV W (Wi ; )A( )Z(W
tions (A1)-(A3), (C1)
e
sup E[kM1 ( )1 (hV W (W ; ) A( )Z(W
))k2 ]

sup
T

1
e
E[kfY |X (Q(X; )|X)(hV W (w; ) A( )Z(W
))k2 ] = o(1).
fmin min (M1 ( ))


P
e i )) 1{Yi Q(Xi ; )} = oP (1) for every
Thus, n1/2 ni=1 M1 ( )1 (hV W (Wi ; ) A( )Z(W
fixed T . So, now we only need to consider finite dimensional convergence of
!
n
n
1 (V h
X
X

M
(
)
(W
;

))
1
i
i
VW
i ( ) := n1/2
1{Yi Q(Xi ; )} .
(A.24)
e i)
kwn k1 w> M2 ( )1 Z(W
i=1

i=1

24

Note that
E[i (1 )i (2 )> ] = (1 2 1 2 )

11 (1 , 2 ) + o(1)

12 (1 , 2 )

12 (1 , 2 )>

22 (1 , 2 ) + o(1)

e i )]. We shall now show


where 12 (1 , 2 ) := kwn k1 E[M1 (2 )1 (Vi hV W (Wi ; 1 ))wn> M2 (2 )1 Z(W
that 12 (1 , 2 ) = o(1) uniformly in 1 , 2 . Note that from the definition of hV W (W ; ) in (3.4), by
standard argument we can write
hV W (W ; ) = E[f (Q(X; )|X)|W ]1 E[V f (Q(X; )|X)|W ].

(A.25)

From (A3) we obtain E[f (Q(X; )|X)|W ] fmin > 0, and from (C1), (A2) it follows that
|E[V (j) f (Q(X; )|X)|W ]| C f = O(1), i.e. the components of hV W (W ; ) are bounded by a
constant almost surely. Hence,
e i )]k
k12 (1 , 2 )k = kwn k1 kM1 (1 )1 E[(Vi hV W (Wi ; ))wn> M2 (2 )1 Z(W
e i )]k
kwn k1 kM1 (1 )1 kkE[(Vi hV W (Wi ; 1 ))wn> M2 (2 )1 Z(W


e i )|
. kwn k1 E kVi hV W (Wi ; 1 )k|wn> M2 (2 )1 Z(W


e i )|
. kwn k1 E |w> M2 (2 )1 Z(W
n

= o(1),

where the third inequality applies the lower bound for inf T min (M1 ( )) in Remark 5.5; the fourth
(j)
inequality follows from sup1jk, |V (j) | + |hV W (W ; )| < a.s., while the last equality follows by
the assumptions of the Theorem.
Now we prove the finite dimensional convergence (A.24). Taking arbitrary collections {1 , ..., J }
T , c1 , .., cJ Rk+1 , we need to show that
J
X

c>
j G n (j )

j=1

J X
n
X
j=1 i=1

d
c>
j i (j )

J
X

c>
j G (j ).

j=1

P
1/2 . Using the results derived
Define Vi,J = Jj=1 c>
m
j i (j ). Note that E[Vi,J ] = 0 and |Vi,J | . n
above, we have
var(Vi,J ) = o(n

)+n

J
X

(j j 0 j j 0 )c>
j (j , j 0 )cj 0 ,

j,j 0 =1

P
where (j , j 0 ) is defined as (3.11). If Jj,j 0 =1 (j j 0 j j 0 )c>
(j , j 0 )cj 0 = 0, then the distriPJ
PJ j >
P
P
>
bution of j=1 cj G (j ) is a single point mass at 0, and j=1 cj G n (j ) = Jj=1 ni=1 c>
j i (j )
converges to 0 in probability by Markovs inequality.
P
Pn
2
If n1 Jj,j 0 =1 (j j 0 j j 0 )c>
j (j , j 0 )cj 0 > 0 define sn,J =
i=1 var(Vi,J ). We will now verify
that the triangular array of random variables (Vi,J )i=1,...,n satisfies the Lindeberg condition. For
any v > 0 and sufficiently large n, Markovs inequality gives
s2
n,J

n
X
 2



2 2
2 2 2 1 2
E Vi,J
I(Vi,J v) . m
sn,J E I(V1,J v) . m
sn,J v n sn,J ,
i=1

25

2 n1 = o(1) by (B1). Thus the Lindeberg condition holds and it follows that
where m
Pn
i=1 Vi,J d
N (0, 1).
sn,J

Finally, it remains to prove (A.20) and (A.21). Begin by observing that






e
)) FY |X (Q(X; )|X) FY |X (( )> V + hn (W, )|X)
E (V A( )Z(W




e
e
. E (V A( )Z(W
))fY |X (Q(Xi ; )|X) h(W, ) n ( )> Z(W
)

 2 


e
e
(Xi , )|X) h(W, ) n ( )> Z(W
+ E (V A( )Z(W
))fY0 |X (Q
)




e
e
. E (hV W (W, ) A( )Z(W
))fY |X (Q(Xi ; )|X) h(W, ) n ( )> Z(W
) + c2
n

h
i


e
. cn E hV W (W, ) A( )Z(W
) fY |X (Q(Xi ; )|X) + c2
n
= O(cn n + c2
n ),

(A.26)

(Xi , ) lying on the line


where the first inequality is an application of Taylor expansion, with Q
segment of Q(X; ) and ( )> Vi +hn (Wi , ); the second inequality is the result of the orthogonality
condition (3.5), Condition (A2), Conditions (C1)-(C2); the third inequality follows from Conditions
(A2), (C1), and the last line follows from condition (C1) and the H
older inequality. For a proof of
>
1
>
1
e
(A.20) observe that by (5.10) ej Jm ( ) Z = ej M1 ( ) (V A( )Z(W )) for j = 1, ..., k. Thus we
obtain from Remark 5.5


>
E[ej Jm ( )1 Z 1{Y Q(X; )} 1{Y ( )> V + hn (W, )} ]




1
>

e
= e>
M
(
)
E
(V

A(
)
Z(W
))
F
(Q(X;

)|X)

F
((
)
V
+
h
(W,

)|X)

1
Y
|X
Y
|X
j
n
= O(cn n + c2
n ),

To prove (A.21), without loss of generality, let kwn k = 1. We note that by (5.10)
>
1
>
>
1
>
1 e
e
(0>
k , wn )Jm ( ) Zi = wn A( ) M1 ( ) (Vi A( )Z(Wi )) + wn M2 ( ) Z(Wi ).

From (A.26), (5.11) in Remark 5.5 and Lemma A.2 we obtain





e i )) 1{Yi Q(Xi ; )} 1{Yi ( )> Vi + h (Wi , )}
E wn> A( )> M1 ( )1 (Vi A( )Z(W
n
= O(cn n + c2
n ).

(A.27)

Moreover,



e i ) 1{Yi Q(Xi ; )} 1{Yi ( )> Vi + h (Wi , )}
E wn> M2 ( )1 Z(W
n




e i )fY |X (Q(Xi ; )|X) h(W, ) ( )> Z(W
e
. wn> M2 ( )1 E Z(W
)

n


2 

e i )f 0 (Q
e
(Xi , )|X) h(W, ) n ( )> Z(W
+ wn> M2 ( )1 E Z(W
)

Y |X


2 

e i )f 0 (Q
e
(Xi , )|X) h(W, ) n ( )> Z(W
= E wn> M2 ( )1 Z(W
)

Y |X
 >

e i )| c2 = o(kwn kc2 ),
. E |wn M2 (2 )1 Z(W
n
n
26

(A.28)

(Xi , ) lying on the line segment


where the first inequality follows from the Taylor expansion with Q
of Q(X; ) and ( )> Vi + hn (Wi , ); the second equality follows from the first order condition of
(3.3), and the last line follows Conditions (A2), (C1) and the conditions of the theorem. Combining
(A.27) and (A.28), and (B1) we obtain (A.21).
e
Lemma A.2. Under Assumptions (A1)-(A3), (C2) and sup T E[|Z(W
)> M2 ( )1 wn |] = o(kwn k),
we have
sup kA( )wn k2 = o(kwn k)

(A.29)

Proof for Lemma A.2. By the first order condition for obtaining A( ),
e
A( ) = E[hV W (W ; )Z(W
)> fY |X (Q(X; )|X)]M2 ( )1 .

By the orthogonality condition (3.5),




e
)> fY |X (Q(X; )|X)]M2 ( )1 wn
kA( )wn k = E[hV W (W ; )Z(W


e
= E[(hV W (W ; ) V + V )Z(W
)> fY |X (Q(X; )|X)]M2 ( )1 wn


e
)> fY |X (Q(X; )|X)]M2 ( )1 wn .
= E[V Z(W

By the assumption that at fixed j, |Vj | C, the uniform boundedness of the conditional density
e
)> M2 ( )1 wn |] = o(kwn k),
in (A2), and the hypothesis sup T E[|Z(W
e
e
sup E[|Vj Z(W
)> M2 ( )1 wn |fY |X (Q(X; )|X)] fC sup E[|Z(W
)> M2 ( )1 wn |] = o(kwn k).

(A.30)

This completes the proof of (A.29) by noting that sup T kM2 ( )1 k = O(1).

A.4

Asymptotic Tightness of Quantile Process

In this section we establish the asymptotic tightness of the process n1/2 u>
n Un ( ) in ` (T ) with
un Rm being an arbitrary vector, where

Un ( ) =

1
n1 Jm
( )

n
X
i=1


Zi 1{Yi Q(Xi ; )} .

(A.31)

Note that the results obtained in this section, in particular Lemma A.3, apply to any series expansion Z = Z(Xi ) satisfying Assumptions (A1).
The following definition is only used in this subsection: For any non-decreasing, convex function
: R+ R+ with (0) = 0, the Orlicz norm of a real-valued random variable Z is defined as (see
e.g. Chapter 2.2 of van der Vaart and Wellner (1996))


kZk = inf C > 0 : E(|Z|/C) 1 .
(A.32)

2 (log n)2 =
Lemma A.3 (Asymptotic Equicontinuity of Quantile Process). Under (A1)-(A3) and m
o(n), we have for any > 0 and vector un Rm ,




>

lim lim sup P kun k1 n1/2
sup
U
(
)
>

= 0,
(A.33)
un Un (1 ) u>

n n 2
0 n

1 ,2 T ,|1 2 |

where Un ( ) is defined in (A.31).

27

Proof of Lemma A.3. Without loss of generality, we will assume that un is a sequence of vectors
with kun k = 1, which can always be achieved by rescaling. Define
Gn ( ) := n1/2 u>
n Un ( ).
Consider the decomposition
>
1 > 1
1
u>
n Un (1 ) un Un (2 ) =n un (Jm (1 ) Jm (2 ))
1
+ n1 u>
n Jm (2 )

X
i

Zi (1{Yi Q(Xi ; 1 )} 1 )

Zi i (1 , 2 ),

where
i (1 , 2 ) := 1{Yi Q(Xi ; 1 )} 1 (1{Yi Q(Xi ; 2 )} 2 ).
Note that for any L 2,
h
h
L i
2 i
1
L2
> 1



E u>
J
(
)Z

(
,

)
.

E
u
J
(
)Z

(
,

)
2
i i 1 2
2
i i 1 2
n m
m
n m

 1
L2 > 1
2
= m
un Jm (2 )E Zi Z>
i i (1 , 2 ) Jm (2 )un
L2
. m
|1 2 |.

(A.34)

1 ( ) (cf. Lemma 13 of Belloni et al. (2011)) and positive


By the Lipschitz continuity of 7 Jm
1
definiteness of Jm ( ), we have
1


1
1
1
kJm
(1 ) Jm
(2 ) = Jm
(2 ) Jm (1 ) Jm (2 ) Jm
(1 )

2
f0

|1 2 | inf min (Jm ( ))


max (E[ZZ> ]),
T
fmin

where k k denotes the operator norm of a matrix. Thus, we have for L 2,


h 
L i

1
1

E u>
J
(
)

J
(
)
Z
(1(Y

Q(X
;

))

)
1
2
i
i
i
n
m
m
h 
2 i

L2
1
1

. m
E u>
J
(
)

J
(
)
Z
(1(Y

Q(X
;

))

)
1
2
i
i
i
n
m
m
h 
2 i
L2
1
1

m
E u>
n Jm (1 ) Jm (2 ) Zi
 1

 1

L2 >
1
1
= m
un Jm
(1 ) Jm
(2 ) E Zi Z>
Jm (1 ) Jm
(2 ) un
i
 1

 1

L2
1
1
m
Jm (1 ) Jm
(2 ) E Zi Z>
Jm (1 ) Jm
(2 )
i
L2
. m
|1 2 |2 .

(A.35)

To simplify notations, define


 1

1
> 1
Ven,i (1 , 2 ) := u>
n Jm (1 ) Jm (2 ) Zi (1(Yi Q(Xi ; )) ) + un Jm (2 )Zi i (1 , 2 ).

Combining the bounds (A.34) and (A.35) yields



1/2L
E |Ven,i (1 , 2 )|2L



 1


1
2L 1/2L
1
2L 1/2L
E |u>
+ E |u>
n Jm (2 )Zi i (1 , 2 )|
n Jm (1 ) Jm (2 ) Zi (1(Yi Q(Xi ; )) )|
1/2L
2(L1)
. m
|1 2 |
.
(A.36)
28

Note that (A.36) holds for all positive integers L 1. By the fact that Gn (1 ) Gn (2 ) =
P
n1/2 ni=1 Ven,i (1 , 2 ) and EVen,i (1 , 2 ) = 0, we obtain from (A.36) that
E[|Gn (1 ) Gn (2 )|2L ]
 X
2L 
n
= nL E
Ven,i (1 , 2 )
i=1

= nL

X
n
i=1

X

 L1
E Ven,i (1 , 2 )2L +
l=1

l1 +l2 <L 1i1 ,i2 ,i3 n


l1 =1,l2 =1
i1 6=i2 6=i3

+ ... +

1i1 ,...,iL n
i1 6=...6=iL

1i1 ,i2 n
i1 6=i2


 

E Ven,i1 (1 , 2 )2L2l E Ven,i2 (1 , 2 )2l


 
 

E Ven,i1 (1 , 2 )2L2(l1 +l2 ) E Ven,i2 (1 , 2 )2l1 E Ven,i3 (1 , 2 )2l2


L
Y


2
e
E Vn,ij (1 , 2 )

j=1


 
 
 

n 2(L11)
n 2(L12)
n
2(L1)
CL nL nm
|1 2 | +
m
|1 2 |2 +
m
|1 2 |3 + ... +
|1 2 |L
2
3
L
.

L1
X
k=0

2(Lk1)

m
|1 2 |k+1 .
n(Lk1)

2 /n,
In particular we obtain for |1 2 | m

E[|Gn (1 ) Gn (2 )|2L ] . |1 2 |L .

(A.37)

For (z) = z 2L , the above equation implies that the Orlicz norm (defined in (A.32)) of Gn (1 )
G(2 ) satisfies
kGn (1 ) G(2 )k . |1 2 |1/2 .
p
Let d(, 0 ) = | 0 |, which is a metric on T . The packing number D(, d) of T with respect

to d satisfies D(, d) . 1/2 . Let


n = 2m / n 0 as n . We have
Z

n /2

(D(, d))d .

1/L

n /2

(
n /2)1L
1L
d =

1 L1
1 L1

(A.38)

For > 0,
1

(D (, d)) .

1
4

= 2/L .

(A.39)

Therefore, applying Lemma S.2.1 yields for any > 0


sup
|1 2 |

|Gn (1 ) Gn (2 )| =

sup
|1 2 |1/2 1/2

S1,n () + 2

|Gn (1 ) Gn (2 )|
sup

| 0 |1/2
n , Te

29

|Gn ( 0 ) Gn ( )|,

(A.40)

where Te T has at most D(


n , d) .
n2 points and S1,n () is a random variable that satisfies
 h Z
P (|S1,n ()| > z) z 8K

D(, d) d + (

n /2

 1L1
1L1

(
n /2)1L
1L1

1/2

+ 2
n )

+ ( 1/2 + 2
n ) 2/L

2L

D (, d)

i1

2L

(A.41)

for a constant K > 0. Let = and L = 6. As n , >


n . We obtain lim0 lim supn P (|S1,n ()| >
z) = 0 for any z > 0.
To bound the remaining term in (A.40), observe that
sup
d(, 0 )
n , Te

|Gn ( 0 ) Gn ( )| =

sup
2 , T
e
| 0 |
n

Now by Lemma A.4 we have



P
sup

2 ,, 0 T
| 0 |
n

|Gn ( 0 ) Gn ( )|

sup
2 ,, 0 T
| 0 |
n

|Gn ( 0 ) Gn ( )|.


|Gn ( 0 ) Gn ( )| > rn (n ) < en ,

where


m
m
m
1/2 m
1/2
rn (n ) := C
n log
+ log
+ n
n + n ,

n
n
n

(A.42)

for a sufficiently large constant C > 0. Take n = log n. Since


n (log n)1/2 = 2m (log n)1/2 / n =

o(1) and m log(n)/ n = o(1) by assumption, it follows that rn (log n) 0. Therefore, we conclude
from Lemma A.4 that
sup
d(, 0 )
n , Te



Gn ( 0 ) Gn ( ) = oP (1).

(A.43)

Applying bounds (A.41) and (A.43) to (A.40) verifies the asymptotic equicontinuity


lim lim sup P
sup |Gn (1 ) Gn (2 )| > z = 0
0 n

|1 2 |

for all z > 0.


The following result is applied in the proof of Lemma A.3.
Lemma A.4. Under (A1)-(A3), we have for any n > 0, 1/n  < 1,


>
P sup
sup
|u>
U
(
+
h)

u
U
(
)|

Cr
(,

)
3en ,
n
n
n n
n n
0h [,1h]

where n > 0, Un ( ) is defined in (A.31) and un Rm is arbitrary, and




m 1/2 kun km
m
n 1/2 kun km
log
+
log + kun k
+
n .
rn (, n ) = kun k
n
n
n
n

30

(A.44)

To prove Lemma A.4, we need to establish some preliminary results. For any fixed vector
u Rm and > 0, define the function classes



G3 (u) := (Z, Y ) 7 u> Jm ( )1 Z1{kZk m } T ,



G4 := (X, Y ) 7 1{Yi Q(X; )} T ,



G6 (u, ) := (Z, Y ) 7 u> {Jm (1 )1 Jm (2 )1 }Z1{kZk m } 1 , 2 T , |1 2 | ,



G7 () := (X, Y ) 7 1{Yi Q(X, 1 )} 1{Yi Q(X, 2 )} (1 2 ) 1 , 2 T , |1 2 | .

Denote G3 , G6 and G7 as the envelope functions of G3 , G6 and C7 , respectively. The following


covering number results will be shown in Section S.2.2: for any probability measure Q,
C0
,

 2
C0
N (kG6 kL2 (Q) , G6 (u, ), L2 (Q)) 2
,

 2
A7
,
N (kG7 kL2 (Q) , G7 (), L2 (Q))

N (kG3 kL2 (Q) , G3 (u), L2 (Q))

(A.45)
(A.46)
(A.47)

>

max (E[ZZ ])
f
where C0 := fmin
inf T min (Jm ( )) < given Assumptions (A1)-(A3), and A7 > 0 is a constant.
Also, G4 has VC index 2 according to Lemma S.2.4.
Proof of Lemma A.4. Observe the decomposition
>
u>
n Un (1 ) un Un (2 ) = I1 (1 , 2 ) + I2 (1 , 2 ),

where
n

X

1
1
I1 (1 , 2 ) := n1 u>
J
(
)

J
(
)
Zi 1{Yi Q(Xi , 1 )} 1 ,
m
1
m
2
n
i=1

1
I2 (1 , 2 ) := n1 u>
n Jm (2 )

n
X
i=1


Zi 1{Yi Q(Xi , 1 )} 1{Yi Q(Xi , 2 )} (1 2 ) .

Step 1: bounding I1 (1 , 2 ).
Note that sup1 ,2 T ,|1 2 |< |I1 (1 , 2 )| kPn P kG6 (un ,)G4 , where

n 
o


1
1
G6 (un , ) G4 = u>
J
(
)

J
(
)

T
,
|

.
Z
1{Y

Q(X
,

)}


m
1
m
2
1
2
3
1
2
i
i
i
3
3
n
Theorem 2.6.7 of van der Vaart and Wellner (1996) and Part 1 of Lemma S.2.4 give
N (kG4 kL2 (Pn ) , G4 , L2 (Pn ))

A4


where the envelope for G4 is G4 = 2 and A4 is a universal constant. Part 2 of Lemma S.2.4 and
Part 2 of Lemma S.2.2 imply that
2A4
N (kG6 G4 kL2 (Pn ) , G6 (un , ) G4 , L2 (Pn ))

31

2C0


2

1/3

2/3 3

2A4 C0


(A.48)

1/3

2/3

where 2A4 C0
note that

C for a large enough universal constant C > 0. To bound supf G6 (un ,)G4 kf kL2 (P ) ,


2
1
E u>
Jm (2 )1 Zi 1{Yi Q(Xi , 3 )} 3
n Jm (1 )

4kun k2 max (E[ZZ> ])[ inf min (Jm ( ))]2 C02 2 Ckun k2 2 ,
T

for a large enough constant C. In addition, an upper bound for the functions in G6 (un , ) G4 is
2m kun k[ inf min (Jm ( ))]1 C0 Cm kun k,
T

and we can take this upper bound as envelope.


Applying the bounds (S.2.2) and (S.2.3) and taking into account (A.48), for any un and > 0,
i
h
 log( ) 1/2 ku k
n m
m
+
log(m ) .
EkPn P kG6 (un ,)G4 c1 kun k
n
n

(A.49)

Finally, for any n > 0, let




1/2 ku k
 1/2
1
kun km
n
n m
e
log(m )
log(m ) +
n
+
kun k +
rn,1 (, n ) = C kun k
n
n
n
n

e > 0. From this, we obtain


for a sufficiently large constant C




P
sup
|I1 (1 , 2 )| rn,1 (, n ) P kPn P kG6 (un ,)G4 rn,1 (, n ) en .
1 ,2 T ,|1 2 |<

Step 2: bounding I2 (1 , 2 ).
Note that sup1 ,2 T ,|1 2 |< |I2 (1 , 2 )| kPn P kG3 (un )G7 () , where

n

1
G3 (un ) G7 () = u>
J
(
)
Z
1{Y

Q(X
,

)}

1{Y

Q(X
,

)}

)
i
i
i 1
i
i 2
1
2
n m 3
o
1 , 2 , 3 T , |1 2 | .

Lemma S.2.3, Part 2 of Lemma S.2.4 and Part 2 of Lemma S.2.2 imply that
2C0
N (kG3 G7 kL2 (Pn ) , G3 (un ) G7 (), L2 (Pn ))

1/3

2/3

where 2C0 A7
that

2A7


2

1/3

2/3 3

2C0 A7


(A.50)

C for a large enough constant C > 0. To bound supf G3 (un )G7 () kf kL2 (P ) note

2
1
E u>
n Jm (3 ) Zi 1{Yi Q(Xi , 1 )} 1{Yi Q(Xi , 2 )} (1 2 )
3kun k2 max (E[ZZ> ])[ inf min (Jm ( ))]2 Ckun k2 .
T

Moreover
sup
f G3 (un )G7 ()

sup kf k 2kun km [ inf min (Jm ( ))]1 Ckun km


T

32

for some constant C. Applying the bounds (S.2.2) and (S.2.3) and taking into account (A.50)
h

m 1/2 kun km
m i
log
log .
EkPn P kG3 (un )G7 () c1 kun k
+
n
n

(A.51)

For any n > 0, let







n 1/2 kun km
m 1/2 kun km
m
+
+
log
log + kun k
n
rn,2 (, n ) = C kun k
n
n
n
n

for a constant C > 0 sufficiently large, we obtain






P
sup
|I2 (1 , 2 )| rn,2 (, n ) P kPn P kG3 (un )G7 () rn,2 (, n ) en .
1 ,2 T ,|1 2 |<

Finally, rn,1 (, n ) rn,2 (, n ) when < 1. Hence, we conclude (A.44).

A.5

Proof of Theorem 4.1

As the argument x0 in Q(x0 ; ) and FY |X (y|x0 ) is fixed, simplify notations by writing Q(x0 ; ) =
b 0 ; ) = Q(
b ) and FY |X (y|x0 ) = F (y), FbY |X (y|x0 ) = Fb(y) as functions of the single arguQ( ), Q(x
ments in and y, respectively. From Theorems 2.4, 3.1 or Corollary 2.2, we have

b Q()
an Q()

G() in ` ([L , U ]),

(A.52)

where an and G depend on the model for Q(x; ) and G has continuous sample paths almost surely.
Next, note that for y Y


b
an Fb(y) F (y) = an (Q)(y)
(Q)(y) .

R1
Finally, observe that (f )(y) = L + (U L )( R)(f )(y) where (f )(y) := 0 1{f (u) < y}du
and R(f )(y) := f (L + y(U L )). The map R : ` ((L , U )) ` ((0, 1)) is linear and
continuous, hence compactly differentiable with derivative R. The map is compactly differentiable tangentially to C(0, 1) at any strictly increasing, differentiable function f0 and the
derivative of at f0 is given by df0 (h)(y) = h(f01 (y))/f00 (f01 (y)) - see Corollary 1 in Chernozhukov et al. (2010). Hence the map R is compactly differentiable at any strictly increasing function f0 ` ((L , U )) tangentially to C(L , U ). Combining this with the representation
(f )(y) = L + (U L )( R)(f )(y) it follows that is compactly differentiable at any strictly
increasing function f0 ` ((L , U )) with derivative df0 (h)(y) = f00 (f01 (y))h(f01 (y)). Thus

weak convergence of an Fb(y) F (y) follows from the functional delta method.
Next, observe that (f ) = (f ) where (f )( ) = inf{y : f (y) } denotes the generalized
inverse. Compact differentiability of at differentiable, strictly increasing functions f0 tangentially
to the space of contunuous functions is established in Lemma 3.9.23 of van der Vaart and Wellner
(1996), and the derivative of at f0 is given by df0 (h)(y) = h(f01 (y))/f00 (f01 (y)). By the chain
rule for Hadamard derivatives this implies compact differentiability of tangentially to C(L , U ).
Thus the second weak convergence result again follows by the functional delta method.

33

APPENDIX B: Technical Remarks on Estimation Bias


Remark B.1. In this remark we show the bound e
cn = o(mbc ) for univariate spline models
0

discussed in Example 2.3, as well as cn = O(mbc/k ) for partial linear model in Section 3. We
first show the latter below. We note that
0
e corresponds to a tensor product
Assume that W = [0, 1]k , that h(; ) c (W, T ) and that Z
0
1/k
B-spline basis of order q on W with m
equidistant knots in each coordinate. Moreover, assume
that (V, W ) has a density fV,W such that 0 < inf v,w fV,W (v, w) supv,w fV,W (v, w) < . We shall
0
show that in this case cn = O(mbc/k ) where cn is defined in Assumption (C1). Define
Z
Z
2
>
e
n,g ( ) := argmin
Z(w) h(w; )
fY |X (Q(v, w; )|(v, w))fV,W (v, w)dvdw.
(B.1)
Rm

>
e
Note that w 7 Z(w)
n,g ( ) can be viewed as a projection of a function g : W R onto the
> b : b Rm }, with respect to the inner product hg , g i =
e
spline space Bm (W) := {w 7 Z(w)
1 2

R
R
g1 (w)g2 (w)d(w), where d(w) := v fY |X (Q(v, w; )|v, w)fV,W (v, w)dv dw.
We first apply Theorem A.1 on p.1630 of Huang (2003). To do so, we need to verify Condition
A.1-A.3 of Huang (2003). Condition A.1 can be verified by invoking (A2)-(A3) in our paper and
using the bounds on fW . The choice of basis functions and knots ensures that Conditions A.2 and
A.3 hold (see the discussion on p.1630 of Huang (2003)). Thus, Theorem A.1 on p.1630 of Huang
(2003) implies there exists a constant C independent of n such that for any function on W,




sup e
Z(w)> n,g ( ) C sup g(w) .
wW

wW

Recall that W is a compact subset of Rd and h(w; ) c (W, T ). Since Bm (W) is a finite
dimensional vector space of functions, by a compactness argument there exists g (; ) Bm (W)
such that supwW |h(w; ) g (w; )| = inf gBm (W) supwW |h(w; ) g(w)| for each fixed . With
m > , the inequality in the proof for Theorem 12.8 in Schumaker (1981), with their mi being
0
our and i  m1/k yields


e
cn = sup e
Z(w)> n,h(w; ) ( ) h(w; )
w,



= sup e
Z(w)> n,h(w; ) ( ) g (w; ) + g (w; ) h(w; )
w,





sup e
Z(w)> n,h(w; )g (w; ) ( ) + sup g (w; ) h(w; )
w,

w,

(C + 1) sup

inf



sup h(w; ) g(w)

T gB(W) x

. mbc/k k 0 max sup sup |Dj h(w; )|,


|j| T wW

where bc is the greatest integer less than , max|j| sup T supxW |Dj h(w; )| = O(1) by the
assumption that h(w; ) c (W, T ) and fixed k 0 . An extension of Theorem 12.8 of Schumaker
(1981) to Besov spaces (see Example 6.29 of Schumaker (1981) or p.5573 of Chen (2007)) in similar
0
manner as Theorem 6.31 of Schumaker (1981) could refine the rate to e
cn . m/k , but we do not
pursue this direction here.
34

Next we show the bound e


cn = o(mbc ) in the setting of Example 2.3. Assume the density
fX (x) of X exists and 0 < inf xX fX (x) supxX fX (x) < . Define the measure (u) by
d(u) = f (Q(u; )|u)fX (u)du. Thus, x 7 B(x)> n,g ( ) with n,g defined similarly to (B.1) is now
viewed as a projection of a function g : X R onto the space B(X ) with respect to the inner
R
product hg1 , g2 i = g1 (u)g2 (u)d(u). The remaining proof is similar to the partial linear model,
with h(w; ) being replaced by Q(x; ), and we omit the details.

References
Angrist, J., Chernozhukov, V., and Fern
andez-Val, I. (2006). Quantile regression under misspecification, with an application to the U.S. wage structure. Econometrica, 74(2):539563.
Bassett, Jr., G. and Koenker, R. (1982). An empirical quantile function for linear models with iid
errors. Journal of American Statistical Association, 77(378):407415.
Belloni, A., Chernozhukov, V., and Fern
andez-Val, I. (2011). Conditional quantile processes based
on series or many regressors. arXiv preprint arXiv:1105.6154.
Briollais, L. and Durrieu, G. (2014). Application of quantile regression to recent genetic and -omic
studies. Human Genetics, 133:951966.
Cade, B. S. and Noon, B. R. (2003). A gentle introduction to quantile regression for ecologists.
Frontiers in Ecology and the Environment, 1(8):412420.
Chen, X. (2007). Large sample sieve estimation of semi-nonparametric models. In Heckman, J. J.
and Leamer, E., editors, Handbook of Econometrics, chapter 76. North-Holland.
Cheng, G. and Shang, Z. (2015). Joint asymptotics for semi-nonparametric regression models with
partially linear structure. Annals of Statistics, 43(3):13511390.
Cheng, G., Zhou, L., and Huang, J. Z. (2014). Efficient semiparametric estimation in generalized
partially linear additive models for longitudinal/clustered data. Bernoulli, 20(1):141163.
Chernozhukov, V., Fern
andez-Val, I., and Galichon, A. (2010). Quantile and probability curves
without crossing. Econometrica, 78(3):10931125.
Delgado, M. A. and Escanciano, J. C. (2013). Conditional stochastic dominance testing. Journal
of Business & Economic Statistics, 31(1):1628.
Demko, S., Moss, W. F., and Smith, P. W. (1984). Decay rates for inverses of band matrices.
Mathematics of Computation, 43(168):491499.
Dette, H. and Volgushev, S. (2008). Non-crossing non-parametric estimates of quantile curves.
Journal of the Royal Statistical Society: Series B, 70(3):609627.
He, X. and Shi, P. (1994). Convergence rate of B-spline estimators of nonparametric conditional
quantile functions. Journaltitle of Nonparametric Statistics, 3(3-4):299308.
35

He, X. and Shi, P. (1996). Bivariate tensor-product B-splines in a partly linear model. Journal of
Multivariate Analysis, 58(2):162181.
Huang, J. Z. (2003). Local asymptotics for polynomial spline regression. Annals of Statistics,
31(5):16001635.
Kley, T., Volgushev, S., Dette, H., and Hallin, M. (2015+). Quantile spectral processes: Asymptotic
analysis and inference. Bernoulli.
Knight, K. (2008). Asymptotics of the regression quantile basic solution under misspecification.
Applications of Mathematics, 53(3):223234.
Koenker, R. (2005). Quantile Regression. Cambridge University Press, New York.
Koenker, R. and Bassett, Jr., G. (1978). Regression quantiles. Econometrica, 46(1):3350.
Koenker, R. and Hallock, K. F. (2001). Quantile regression. Journal of Economic Perspectives,
15(4):143156.
Koenker, R. and Xiao, Z. (2002). Inference on the quantile regression process. Econometrica,
70(4):15831612.
Koltchinskii, V. (2006). Local Rademacher complexities and oracle inequalities in risk minimization.
Annals of Statistics, 34(6):25932656.
Lee, S. (2003). Efficient semiparametric estimation of a partially linear quantile regression model.
Econometric Theory, 19:131.
Massart, P. (2000). About the constants in Talagrands concentration inequalities for empirical
processes. Annals of Probability, pages 863884.
Puntanen, S. and Styan, G. P. H. (2005). Schur complements in statistics and probabiliy. In
Zhang, F., editor, The Schur Complement and Its Applications, volume 4 of Numerical Methods
and Algorithms. Springer.
Qu, Z. and Yoon, J. (2015). Nonparametric estimation and inference on conditional quantile
processes. Journal of Econometrics, 185:119.
Schumaker, L. (1981). Spline Functions: Basic Theory. Wiley, New York.
Shang, Z. and Cheng, G. (2015). A Bayesian splitotic theory for nonparametric models. arXiv
preprint arXiv:1508.04175.
van der Vaart, A. and Wellner, J. (1996). Weak Convergence and Empirical Processes: With
Applications to Statistics. Springer.
Volgushev, S. (2013). Smoothed quantile regression processes for binary response models. arXiv
preprint arXiv:1302.5644.

36

Volgushev, S., Chao, S.-K., and Cheng, G. (2016). Data compression for extraordinarily large data:
A quantile regression approach. Technical report, Purdue University.
Zhao, T., Cheng, G., and Liu, H. (2016). A partially linear framework for massive heterogeneous
data. Annals of Statistics.
Zhou, S., Shen, X., and Wolfe, D. (1998). Local asymptotics for regression splines and confidence
regions. Annals of Statistics, 26(5):17601782.

37

SUPPLEMENTARY MATERIAL: QUANTILE PROCESS


FOR SEMI-NONPARAMETRIC REGRESSION MODELS
In this supplemental material, we provide the auxiliary proofs needed in the appendices. Section
S.1 develops the technicalities for Bahadur representations. Section S.2 presents some empirical
process results and computes the covering number of some function classes encountered in the
proofs of asymptotic tightness of quantile process.

S.1: Proofs for Bahadur Representations


S.1.1

Proof of Theorem 5.1

Some rearranging of terms yields


b ( ), ) = n1/2 Gn ((;
b ( ), )) n1/2 Gn ((; n ( ), ))
Pn (;
+Jem ( )(b
( ) n ( )) + n1/2 Gn ((; n ( ), ))


+(n ( ), ) + (b
( ), ) (n ( ), ) Jem ( )(b
( ) n ( )) .

In other words

b ( ) n ( ) = n1/2 Jm ( )1 Gn ((; n ( ), )) + rn,1 ( ) + rn,2 ( ) + rn,3 ( ) + rn,4 ( ) (S.1.1)

where

b ( ), ),
rn,1 ( ) := Jem ( )1 Pn (;


( ), ) (n ( ), ) Jem ( )(b
( ) n ( )) ,
rn,2 ( ) := Jem ( )1 (b


b ( ), )) Gn ((; n ( ), )) ,
rn,3 ( ) := n1/2 Jem ( )1 Gn ((;

rn,4 ( ) := n1/2 (Jm ( )1 Jem ( )1 )Gn ((; n ( ), )) Jem ( )1 (n ( ), )).

The remaining proof consists in bounding the individual remainder terms.

The bound on rn,1 follows from results on duality theory for convex optimization, see Lemma
26 on page 66 in Belloni et al. (2011) for a proof.
To bound rn,2 and rn,3 , define the class of functions



G1 := (Z, Y ) 7 a> Z(1{Y Z> b} )1{kZk m } T , b Rm , a S m1 .

(S.1.2)

Moreover, let

sn,1 := kPn P kG1 .


Observe that by Lemma S.1.2 with t = 2 we have
1,n :=

sup kb
( )n ( )k

o n
4(sn,1 + gn )
inf T 2min (Jem ( )) o
sn,1 +gn <
=: 2,n .
8m f 0 max (E[ZZ> ])
inf T min (Jem ( ))
1

Define the event


1/2 m
 1/2 io
n
h m
m
n
m n
log n
+
log n +
3,n := sn,1 C
+
.
n
n
n
n

Now it follows from Lemma S.1.3 that P (3,n ) 1 en [note that m = O(nb ) yields log m =
2 log n = o(n), = O(nb ), g = o(1) implies that for
O(log n)]. Moreover, the assumption mm
m
m n
2
n  n/m and large enough n,
C

h m
n

log n

1/2

 1/2 i
inf T 2min (Jem ( ))
mm
n
m n
log n +
+
+ gn
n
n
n
8m f 0 max (E[ZZ> ])

(S.1.3)

for n large enough. Thus, for all n for which (S.1.3) holds, 3,n 2,n 1,n . From this we obtain
that on 3,n , for a constant C2 which is independent of n, we have
sup kb
( ) n ( )k C2

h m
n

log n

1/2

 1/2
i
mm
n
m n
log n +
+
+ gn .
n
n
n

In particular, for all n for which (S.1.3) holds,


1/2 m
 1/2
i

h m
n
m
m n
log n
log n+
+gn 1en . (S.1.4)
+
+
P sup kb
( )n ( )k C2
n
n
n
n
T
The bound on rn,2 is now a direct consequence of Lemma S.1.1 and the fact that n1 mm log n =
1/2
o((n1 m log n)1/2 ) and m n n1 = o(n n1/2 ). The bound on rn,4 follows once we observe that
for any a S m1
|a> (Jm ( ) Jem ( ))a| f 0 cn max (E[ZZ> ]).

(S.1.5)

Together with the identity A1 B 1 = B 1 (B A)A1 this implies that for sufficiently large n
we have sup T krn,4 ( )k C1 (cn sn,1 + gn ) for a constant C1 which does not depend on n.
Thus, it remains to bound rn,3 . Observe that on the set {sup T kb
( ) n ( )k } we have
the bound
1
sup krn,3 ( )k
kPn P kG2 () ,
T
inf T min (Jem ( ))

where the class of functions G2 () is defined as follows



G2 () := (Z, Y ) 7 a> Z(1{Y Z> b1 } 1{Y Z> b2 })1{kZk m }


b1 , b2 Rm , kb1 b2 k , a S m1 . (S.1.6)

It thus follows that for any , > 0






P sup krn,3 ( )k P sup kb
( ) n ( )k + P
T


kPn P kG2 ()
.
inf T min (Jem ( ))

Letting := C((n1 m log n)1/2 + (n /n)1/2 + gn ) and := Cn (n , n ) (here, n is defined in


(S.1.11) in the statement of Lemma S.1.3) with a suitable constant C, the bounds in Lemma S.1.3
and (S.1.4) yield the desired bound.
2

S.1.1.1

Technical details for the proof of Theorem 5.1

Lemma S.1.1. Under assumptions (A1)-(A3) we have for any > 0,


sup

sup

T kbn ( )k

k(b, ) (n ( ), ) Jem ( )(b n ( ))k max (E[ZZ> ])f 0 2 m ,

Proof of Lemma S.1.1. Note that 0 (n ( ), ) = E[ZZ> fY |X (Z> n ( )|X)] = Jem ( ) where we
use the notation 0 (b, ) := b (b, ). Additionally, we have
(b, ) = (n ( ), ) + 0 (
n , )(b n ( )),
n = b + b, (n ( ) b) for some b, [0, 1]. Moreover, for any a Rm ,
where
a> [(b, ) (n ( ), ) 0 (n ( ), )(b n ( ))] = a> [(0 (
n , ) 0 (n ( ), ))(b n ( ))]
and thus we have for any kb n ( )k


(b, ) (n ( ), ) 0 (n ( ), )(b n ( ))




n |X) fY |X (Z> n ( )|X)
sup E a> Z Z> b n ( ) fY |X (Z>
kak=1





n n ( )) Z> (b n ( ))
f 0 sup E a> Z Z> (
kak=1




f 0 m E Z> (
n n ( )) Z> (b n ( ))

m 2 f 0 sup E[|a> Z|2 ],


kak=1

here the last inequality follows by Chauchy-Schwarz. Since the last line does not depend on , b,
this completes the proof.
Lemma S.1.2. Let assumptions (A1)-(A3) hold. Then, for any t > 1
n

sup kb
( ) n ( )k

o n
2t(sn,1 + gn )
inf T 2min (Jem ( )) o
(sn,1 + gn ) <
,
4tm f 0 max (E[ZZ> ])
inf T min (Jem ( ))

where sn,1 := kPn P kG1 and G1 is defined in (S.1.2).

Proof of Lemma S.1.2. Observe that f : b 7 Pn (Yi Z>


i b) is convex, and the vector
b ( ) is a minimizer of Pn (Yi
Pn (; b, ) is a subgradient of f at the point b. Recalling that
>
Zi b), it follows that for any a > 0,


{sup kb
( ) n ( )k a(sn,1 + gn )} {inf inf > Pn ; n ( ) + a(sn,1 + gn ), > 0}. (S.1.7)
T

kk=1

To see this, define := (b


( )n ( ))/kb
( )n ( )k and note that by definition of the subgradient
e
we have for any n > 0

e
b ( )) Pn (Yi Z>
Pn (Yi Z>
( ) n ( )k en ) > Pn (; n ( ) + en , ).
i
i (n ( ) + n )) + (kb

b ( ) as minimizer, the inequality above can only be true


Set en = a(sn,1 + gn ). By the definition of
>
e
e
if (kb
( ) n ( )k n ) Pn (; n ( ) + n , ) 0, which yields (S.1.7).
3


The proof is finished once we minorize the empirical score > Pn ; n ( ) + a(sn,1 + gn ),
in (S.1.7) in terms of sn,1 + gn . To proceed, observe that under assumptions (A1)-(A3) we have by
Lemma S.1.1




sup E[ > {(Y, Z; b, ) (Y, Z; n ( ), ) ZZ> fY |X (n ( )> Z|X)(b n ( ))}]
kk=1

m f 0 max (E[ZZ> ])kb n ( )k2 .

(S.1.8)

Therefore, we have for arbitrary kk = 1, T that


> Pn (; n ( ) + a(sn,1 + gn ), )




sn,1 gn + > E (Y, Z; n ( ) + a(sn,1 + gn ), ) E (Y, Z; n ( ), )
a(sn,1 + gn ) inf min (Jem ( )) sn,1 gn m f 0 max (E[ZZ> ])a2 (sn,1 + gn )2 ,
T



where for the first inequality we recall the definition gn = sup T kE (Y, Z; n ( ), ) k and sn,1 =
kPn P kG1 ; the second inequality follows by (S.1.8). Setting a = 2t/ inf T min (Jem ( )) in en , we
see that the right-hand side of the display above is positive when
sn,1 + gn <

(2t 1) inf T 2min (Jem ( ))


.
4t2 m f 0 max (E[ZZ> ])

Observing that for t > 1 we have (2t 1)/t2 1/t and plugging a = 2t/ inf T min (Jem ( )) in
equation (S.1.7) completes the proof.
Lemma S.1.3. Consider the classes of functions G1 , G2 () defined in (S.1.2) and (S.1.6), respectively. Under assumptions (A1)-(A3) we have for some constant C independent of n and all n > 0
provided that m = O(nb ) for some fixed b

1/2 m
 1/2 i
h m
m
n
m n
P kPn P kG1 C
log m
+
log m +
+
en .
n
n
n
n

For any n satisfying m n  n1 , we have for sufficiently large n and arbitrary n > 0


P kPn P kG2 (n ) Cn (n , n ) en ,

(S.1.9)

(S.1.10)

where

1/2 1/2
n (t, n ) := m
t

m

1/2 m
m
log(m n)
+
log(m n) + n1/2 (m tn )1/2 + n1 m n .
n
n
(S.1.11)

Proof of Lemma S.1.3. Observe that for each f G1 we have |f (x, y)| m , and the same holds
for G2 () for any value of . Additionally, similar arguments as those in the proof of Lemma 18 in
Belloni et al. (2011) imply together with Theorem 2.6.7 in van der Vaart and Wellner (1996) that,
almost surely,
N (G2 (), L2 (Pn ); )

 AkF k

L2 (Pn )

v1 (m)

,
4

N (G1 , L2 (Pn ); )

 AkF k

L2 (Pn )

v2 (m)

where A is some constant and v1 (m) = O(m), v2 (m) = O(m). Finally, for each f G1 we have
E[f 2 ] sup a> E[ZZ> ]a = max (E[ZZ> ]).
kak=1

On the other hand, each f G2 (n ) satisfies


E[f 2 ] sup

sup

kak=1 kb1 b2 kn




E (a> Z)2 1 |Y Z> b1 | |Z> (b1 b2 )|



sup sup E (a> Z)2 1{|Y Z> b| m n }
bRm kak=1

2f m n max (E[ZZ> ]).

Note that under assumptions (A1)-(A3) the right-hand side is bounded by cm n where c is a
constant that does not depend on n. Thus the bound in (S.2.2) implies that for m n  n1 we
have for some constant C which is independent of n,
h m
1/2 m
i
m
EkPn P kG1 C
log(m n)
log(m n) ,
(S.1.12)
+
n
n
h

1/2 m
i
m
1/2 1/2 m
EkPn P kG2 (n ) C m
n
log(m n)
+
log(m n) .
(S.1.13)
n
n
Thus (S.1.9) and (S.1.10) follow from the bound in (S.2.3) by setting t = n .

S.1.2

Proof of Theorem 5.2

We begin with the following useful decomposition

where

b ) n ( ) = n1/2 Je1 ( )Gn ((; n ( ), )) + Je1 ( )


(
m
m

4
X

Rn,k ( )

(S.1.14)

k=1

b ), ),
Rn,1 ( ) := Pn (; (


b ), ) (n ( ), ) Jem ( )((
b ) n ( )) ,
Rn,2 ( ) := ((


b ), )) Gn ((; n ( ), )) ,
Rn,3 ( ) := n1/2 Gn ((; (
Rn,4 ( ) := (n ( ), )).

> e1
(I(un ,D)) R
e1
Define rn,2 (, un ) := u>
n,k ( ) for k = 1, 3 and
n Jm ( )Rn,2 ( ), rn,k (, un ) := (un Jm ( ))



1
> e1
> e1
(I(un ,D))
e
rn,4 (, un ) := u>
J
(
)R
(
)
+
u
J
(
)

(u
J
(
))
R
(
)
+
R
(
)
.
n,4
n,1
n,3
n m
n m
n m

With those definitions we obtain

4
X

b ) n ( ) = n1/2 u> Je1 ( )Gn ((; n ( ), )) +
u>
(
rn,k (, un ).
n
n m
k=1

We will now show that the terms rn,k (, un ) defined above satisfy the bounds given in the statement
of Theorem 5.2 if we let D = c log n for a sufficiently large constant c.
5

The bound on rn,1 follows from Lemma S.1.7. To bound rn,2 apply Lemma S.1.4 and Lemma
S.1.6. To bound rn,3 observe that by Lemma S.1.4 the probability of the event
1/2

m (log n + n ) o
b ) B(x)> n ( )| C e
sup |B(x)> (
c2n +
.
n1/2
,x

1/2 
n+n )
is at least 1 (m + 1)en . Letting n := C e
c2n + m (logn1/2
we find that on 1

1 :=

sup

un SIm1

sup |rn,3 (, un )| . kPn P kG2 (n ,I(un ,D),I 0 (un ,D)) ,

this follows since


(I(un ,D))
b ), )) = n1/2 (u> Je1 ( ))(I(un ,D)) Gn ( (I 0 (un ,D)) (; (
b ), ))
e1
n1/2 (u>
Gn ((; (
n Jm ( ))
n m

b The bound on rn,3 now follows form Lemma S.1.5.


and a similar identity holds with n instead of .
To bound the first part of rn,4 (, un ), we proceed as in the proof of equation (S.1.18) to obtain

1
> e1

e n , B)
c2 E(u
(un Jm ( ))(n ( ), ) f 0 e
2 n

where the last line follows after a Taylor expansion. To bound the second part of rn,4 (, un )
note that sup kRn,1 ( )k + kRn,3 ( )k 3m almost surely and thus choosing D = c log n with c
(I(un ,D)) u> Je1 ( )k n1 31 1 where we used (A.5). This
e1
sufficiently large yields k(u>
m
n m
n Jm ( ))
completes the proof.
S.1.2.1

Technical details for the proof of Theorem 5.2

Lemma S.1.4. Under the assumptions of Theorem 5.2 we have for sufficiently large n and any
2 ,
n  n/m
1/2



2
b ) n ( ))| C m log n + m n + e
P sup |B(x)> ((
c
(m + 1)en .
n
n1/2
,x

where the constant C does not depend on n.

Proof of Lemma S.1.4. Apply (A.5) with a = B(x) to obtain


1
1
kB(x)> Jem
( ) (B(x)> Jem
( ))(I(B(x),D)) k . mm D ,

where I(B(x), D) is defined as (A.2), and (0, 1) is a constant independent of n. Next observe
the decomposition
b ) n ( )) = n1/2 (B(x)> Je1 ( ))(I(B(x),D)) Gn ((; (
b ), )) +
B(x)> ((
m

4
X

rn,k (, x)

k=1

where

1
b ), ),
rn,1 (, x) := (B(x)> Jem
( ))(I(B(x),D)) Pn (; (


1
b ), ) (n ( ), ) Jem ( )((
b ) n ( )) ,
rn,2 (, x) := (B(x)> Jem
( )) ((
1
rn,3 (, x) := (B(x)> Jem
( ))(n ( ), )).

and



1
1
b ), ) n1/2 Gn ((; (
b ), )) .
( ))(I(B(x),D)) Pn (; (
( ) (B(x)> Jem
rn,4 (, x) := B(x)> Jem
Letting D = c log n for a sufficiently large constant c, (S.1.14) yields sup,x |rn,4 (, x)| n1 almost
surely. Lemma S.1.7 yields the bound
sup |rn,1 (, x)| .
x,

2 log n
m
n

a.s.

b ) B(x)> n ( )|. From Lemma S.1.6 we obtain under (L)


Let n := supx, |B(x)> (


1
( )B|  n2 ,
sup |rn,2 (, x)| n2 sup E |B(x)> Jem
x,

x,

(S.1.15)

(S.1.16)



1 ( )B| = O(1) by assumption (L). Finally, note that
where supx, E |B(x)> Jem
1
b ), ))
n1/2 (B(x)> Jem
( ))(I(B(x),D)) Gn ((; (

0
1
b ), ))
= n1/2 (B(x)> Jem
( ))(I(B(x),D)) Gn ( (I (B(x),D)) (; (

where I 0 (B(x), D) is defined as (A.3). This yields





1
b ), )) . m sup kPn P kG (I(B(x),D),I 0 (B(x),D))
sup n1/2 (B(x)> Jem
( ))(I(B(x),D)) Gn ((; (
1
,x

xX

By the definition of I(B(x), D), I 0 (B(x), D), the supremum above ranges over at most m distinct
terms. Additionally, supxX |I(B(x), c log n)|+|I 0 (B(x), c log n)| . log n. Thus Lemma S.1.5 yields
1/2


 2 (log n)2 1/2
m n 
1/2

m
> e1
(I(B(x),c log n))
b
P sup n
(B(x) Jm ( ))
Gn ((; ( ), )) C
+C 1/2
men .
n
n
,x
(S.1.17)
Finally, from the definition of n as minimizer we obtain




1
|rn,3 (, x)| = (B(x)> Jem
( ))(n ( ), ))




1
= (B(x)> Jem
( ))E[B(1{Y n ( )> B} )]




1
= (B(x)> Jem
( ))E[B(FY |X (n ( )> B|X) FY |X (Q(X; )|X))]

1 0 2
fe
c O(1)
2 n

(S.1.18)

where the last line follows after a Taylor expansion and the fact that E[BfY |X (Q(X; )|X)(n> B
Q(X; ))] = 0 from the definition of n and making use of (L). Combining this with (S.1.15) (S.1.17) yields
h 2 (log n)2 1/2 1/2 2 log n
i
m n
m
m
2
2
n C
+
+
+

+
e
c
n
n
n
n
n1/2

with probability at least men . By Lemma S.1.2 we have P (n 1/(2C)) en for any n
satisfying m n  n1 This yields the assertion.
7

Lemma S.1.5. Let Z := {B(x)|x X } where X is the support of X. For I1 , I10 {1, ..., m},
define the classes of functions



0
Ge1 (I1 , I10 ) := (Z, Y ) 7 a> Z(I1 ) (1{Y Z> b(I1 ) } )1{kZk m } T , b Rm , a S m1 ,
(S.1.19)


(I 0 )
(I 0 )
Ge2 (, I1 , I10 ) := (Z, Y ) 7 a> Z(I1 ) (1{Y Z> b1 1 } 1{Y Z> b2 1 })1{Z Z}


b1 , b2 Rm , sup kv> b1 v> b2 k , a S m1 . (S.1.20)
vZ

Under assumptions (A1)-(A3) we have

1/2 max(|I |, |I 0 |)
 1/2 i

h max(|I |, |I 0 |)
1
1
m
n
m n
1
1
P kPn P kGe1 (I1 ,I 0 ) C
log m
+
log m +
+
en
1
n
n
n
n
(S.1.21)
1
and for any n satisfying m n  n we have for sufficiently large n and arbitrary n > 0


P kPn P kGe2 (,I1 ,I 0 ) Cn (, I1 , I10 , n ) en ,
(S.1.22)
1

where

 max(|I |, |I 0 |)
1/2 max(|I |, |I 0 |)
m
1
1
1
1
log(m n)
+
log(m n)
n
n
+n1/2 (tn )1/2 + n1 m n .

n (t, I1 , I10 , n ) := t1/2

Proof of Lemma S.1.5. We begin by observing that


  AkF kL2 (Pn ) v1 (m)
N Ge2 (, I1 , I10 ), L2 (Pn );
,

  AkF kL2 (Pn ) v2 (m)


N Ge1 (I1 , I10 ), L2 (Pn );
,

where v1 (m) = O(max(|I1 |, |I10 |)), v2 (m) = O(max(|I1 |, |I10 |)). The proof of the bound for Ge1 (I1 , I10 )
now follows by similar arguments as the proof of Lemma S.1.3. For a proof of the second part, note
that for f Ge2 we have
h
n
oi
(I 0 )
(I 0 )
(I 0 )
E[f 2 ] sup
sup
E (a> B(I1 ) )2 1 |Y B> b1 1 | |B> (b1 1 b2 1 )|
kak=1 b1 ,b2 R(n )

h
2
i
0
0
0
sup sup E (a(I1 ) )> BB> a(I1 ) 1{|Y B> b(I1 ) | n }
bRm kak=1

2f n max (E[BB> ])
n
o
where we defined R() := b1 , b2 Rm , supvZ kv> b1 v> b2 k . The rest of the proof follows
by similar arguments as the proof of Lemma S.1.3.
Lemma S.1.6. Under assumptions (A1)-(A3) we have for any a, b Rm


>e

a Jm ( )1 (b, ) a> Jem ( )1 (n ( ), ) a> (b n ( ))
f 0 sup |B(x)> b B(x)> n ( )|2 E[|a> Jem ( )1 B|].
x

Proof of Lemma S.1.6. Note that 0 (n ( ), ) = E[BB> fY |X (B> n ( )|X)] = Jem ( ). Additionally, we have
)(b n ( )),
(b, ) = (n ( ), ) + 0 (b,
= b + b, (n ( ) b) for some b, [0, 1]. Moreover,
where b
) Jem ( )](b n ( ))
a> [Jem ( )1 (b, ) Jem ( )1 (n ( ), ) (b n ( ))] = a> Jem ( )1 [0 (b,
and thus




>e
1
>
1
>e
J
(
)
(
(
),

a
(b

(
))
J
(
)
(b,

a
a


m
n
n
m
h

i

= E a> Jem ( )1 BB> fY |X (B> b|X)


fY |X (B> n ( )|X) b n ( )

h
2 i


f 0 E a> Jem ( )1 B B> (b n ( ))

f 0 sup |B(x)> b B(x)> n ( )|2 E[|a> Jem ( )1 B|].


x

Lemma S.1.7. Under assumptions (A1)-(A3) and (L) we have for any a Rm having zero entries
everywhere except at L consecutive positions:

>
b ), ) (L + 2r)kakm .
a Pn (; (
n

Proof of Lemma S.1.7. From standard arguments of the optimization condition of quantile
regression (p.35 of Koenker (2005), also see equation (2.2) on p.224 of Knight (2008)), we know
that for any T ,
n
X

1X
b
b )} = 1
Pn (; ( ), ) =
vi Bi
Bi 1{Yi B>
(
i
n
n
i=1

iH

b
where vi [1, 1] and H = {i : Yi = B>
i ( )}. Since a has at most L non-zero entries, the
dimension of the subspace spanned by {Bi : a> Bi 6= 0} is at most L + 2r [each vector Bi by
construction has at most r nonzero entries and all of those entries are consecutive]. Since the
conditional distribution of Y given covariates has a density, the data are in general position almost
surely, i.e. no more than k of the points (Bi , Yi ) lie in any k-dimensional linear space, it follows
that the cardinality of the set H {i : a> Bi 6= 0} is bounded by L + 2r. The assertion follows
after an elementary calculation.

S.1.3

Proof of Theorem 5.4

The statement follows from Theorem 5.1 if we prove that the vector n ( ) satisfies


sup (n ( ); ) = O(m c2
n )

(S.1.23)

1 ) in Condition (C1), and establish the identity in (5.10). For the identity (5.10), we
as cn = o(m
first observe the representation
!
M1 ( ) + A( )M2 ( )A( )> A( )M2 ( )
Jm ( ) =
,
(S.1.24)
M2 ( )A( )>
M2 ( )

which follows from (3.5) and


e
e
E[(V A( )Z(W
))Z(W
)> fY |X (Q(X; )|X)] = 0, for all T .

To simplify the notations, we suppress the argument in in the following matrix calculations.
Recall the following identity for the inverse of 2 2 block matrix (see equation (6.0.8) on p.165 of
Puntanen and Styan (2005))
!1
!
A B
(A BD1 C)1
(A BD1 C)1 BD1
=
.
C D
D1 C(A BD1 C)1
D1 + D1 C(A BD1 C)1 BD1
Identifying the blocks in the representation (S.1.24) with the blocks in te above representation
yields the result after some simple calculations. For a proof of (S.1.23) observe that
e
(n ( ); ) = E[(V > , Z(W
)> )> (FY |X (n ( )> Z|X) )]

Now on one hand we have, uniformly in T ,






e
)|X) )]
E[V (FY |X (( )> V + n ( )> Z(W




e
= E[V fY |X (Q(X; )|X)(Q(X; ) ( )> V n ( )> Z(W
))] + O(c2
n )




e
= E[V fY |X (Q(X; )|X)(h(W ; ) n ( )> Z(W
))] + O(c2
n )




e
= E[(V hV W (W ; ) + hV W (W ; ))fY |X (Q(X; )|X)(h(W ; ) n ( )> Z(W
))] + O(c2
n )




e
= E[hV W (W ; )fY |X (Q(X; )|X)(h(W ; ) n ( )> Z(W
))] + O(c2
n )



>e
e
e
= E[(hV W (W ; ) A( )Z(W ) + A( )Z(W ))fY |X (Q(X; )|X)(h(W ; ) n ( ) Z(W ))] + O(c2
n )




e
e
= E[(hV W (W ; ) A( )Z(W
))fY |X (Q(X; )|X)(h(W ; ) n ( )> Z(W
))] + O(c2
n )

= O(c2
n + n cn ).

Here, the first equation follows after a Taylor expansion taking into account that, by the definition
of the conditional quantile function, FY |X (Q(X; )|X) . The fourth equality is a consequence
of (3.5), the sixth equality follows since
e
e
E[Z(W
)fY |X (Q(X; )|X)(h(W ; ) n ( )> Z(W
))] = 0

(S.1.25)

by the definition of n ( ) as minimizer. The last line follows by the Cauchy-Schwarz inequality.
On the other hand
e
e
E[Z(W
)(FY |X (( )> V + n ( )> Z(W
)|X) )]
e
e
= E[Z(W
)fY |X (Q(X; )|X)(h(W ; ) n ( )> Z(W
))]
1 e
e
+ E[Z(W
)fY0 |X ((X; )|X)(h(W ; ) n ( )> Z(W
))2 ].
2
10

By (S.1.25), the first term in the representation above is zero, and the norm of the second term is
of the order O(m c2
n ). This completes the proof.

APPENDIX S.2: Auxiliary Results


S.2.1

Results on empirical process theory

In this section, we collect some basic results from empirical process theory needed in our proofs.
Denote by G a class of functions that satisfies |f (x)| F (x) U for every f G and let 2
supf G P f 2 . Additionally, let for some A > 0, V > 0 and all > 0,
N (, G, L2 (Pn ))

 AkF k

L2 (Pn )

V

(S.2.1)

Note that if G is a VC-class, then V is the VC-index of the set of subgraphs of functions in G. In
that case, the symmetrization inequality and inequality (2.2) from Koltchinskii (2006) yield
h V
AkF kL2 (P ) 1/2 V U
AkF kL2 (P ) i
EkPn P kG c0
log
log
+
n

(S.2.2)

for a universal constant c0 > 0 provided that 1 2 > const n1 [in fact, the inequality in
Koltchinskii (2006) is for 2 = supf G P f 2 . However, this is not a problem since we can replace
G by G/(supf G P f 2 )1/2 ]. The second inequality (a refined version of Talagrands concentration
inequality) states that for any countable class of measurable functions F with elements mapping
into [M, M ]
1/2

n
o
P kPn P kF 2EkPn P kF + c1 n1/2 sup P f 2
t + n1 c2 M t et ,

(S.2.3)

f F

for all t > 0 and universal constants c1 , c2 > 0. This is a special case of Theorem 3 in Massart
(2000) [in the notation of that paper, set = 1].

Lemma S.2.1 (Lemma 7.1 of Kley et al. (2015)). Let {Gt : t T } be a separable stochastic process
with kGs Gt k Cd(s, t) (k k is defined in (A.32)) for all s, t satisfying d(s, t)
/2 0.
Denote by D(, d) the packing number of the metric space (T, d). Then, for any > 0,
, there
exists a random variable S1 and a constant K < such that
sup |Gs Gt | S1 + 2

d(s,t)

sup
d(s,t)
,tTe

|Gs Gt |,

where the set Te contains at most D(


, d) points, and S1 satisfies
Z



1
1
2
kS1 k K
D(, d) d + ( + 2
) D (, d)

(S.2.5)

/2

  h Z
P (|S1 | > x) x 8K

/2

D(, d) d + ( + 2
)

11

(S.2.4)

i1
D (, d)
2

1

(S.2.6)

S.2.2

Covering number calculation

A few useful lemmas on covering number are given in this section.


Lemma S.2.2. Suppose F and G are two function classes with envelopes F and G.
1. The class F G := {f g|f F, g G} with envelope F + G and





kF kQ,2
kGkQ,2
sup N kF + GkQ,2 , F G, L2 (Q) sup N  , F, L2 (Q) sup N  , G, L2 (Q) ,
2
2
Q
Q
Q
(S.2.7)
2. The class F G := {f g : f F, g G} with envelope F G and





kF kQ,2
kGkQ,2
sup N kF GkQ,2 , F G, L2 (Q) sup N
, F, L2 (Q) sup N
, G, L2 (Q) ,
2
2
Q
Q
Q

(S.2.8)

where the suprema are taken over the appropriate subsets of all finitely discrete probability
measures Q.
Proof of Lemma S.2.2.
1. It is obvious that |f g| |f | + |g| F + G for any f F and g G. Hence, F + G is
an envelop for the function class F G. Suppose that A = {f1 , ...fJ } and B = {g1 , ..., gK }
kGk
kF k
are the centers of 2Q,2 -net for F and 2Q,2 -net for F and G respectively. For any f g,
there exists fj and gk such that

k(fj gk ) (f g)k2Q,2 = k(fj f ) (gk g)k2Q,2 2 kfj f k2Q,2 + kgk gk2Q,2
2 (kF k2Q,2 + kGk2Q,2 ) 2 kF + Gk2Q,2 ,

where the last inequality follows from the fact that both F and G are nonnegative. Hence,
{fj +gk : 1 j J, 1 k K} forms an kF +GkQ,2 -net for the class F G, with cardinality
JK.
2. See Lemma 6 of Belloni et al. (2011).
For any fixed vector u Rm and > 0, recall the function classes



G3 (u) = (Z, Y ) 7 u> Jm ( )1 Z T ,



G4 = (X, Y ) 7 1{Yi Q(X; )} T ,



G6 (u, ) = (Z, Y ) 7 u> {Jm (1 )1 Jm (2 )1 }Z 1 , 2 T , |1 2 | ,



G7 () = (X, Y ) 7 1{Yi Q(X, 1 )} 1{Yi Q(X, 2 )} (1 2 ) 1 , 2 T , |1 2 | .

1 ( ) by Lemma 13 of Belloni et al. (2011):


Recall the following Lipschitz continuity property of Jm
for 1 , 2 T ,

2

f0
1
1
kJm
(1 ) Jm
(2 )
|1 2 | inf min (Jm ( ))
max (E[ZZ> ])
T
fmin

:= C0 inf min (Jm ( ))1 |1 2 |.

where C0 =

max (E[ZZ> ])
f0
fmin inf T min (Jm ( ) .

12

(S.2.9)

Lemma S.2.3. G3 (u) has an envelope G3 (Z) = kukm [inf T min (Jm ( ))]1 and
N (kG3 kL2 (Q) , G3 (u), L2 (Q))
where C0 =

max (E[ZZ> ])
f0
fmin inf T min (Jm ( ))

C0
,


< , for any probability measure Q and x.

Proof of Lemma S.2.3. By (S.2.9), for any 1 , 2 T and u,


0
>

u Jm (1 )1 Z u> Jm (2 )1 Z kukm f max (E[ZZ> ])[ inf min (Jm ( ))]2 |1 2 |
T
fmin
= C0 kG3 kL2 (Q) |1 2 |.

Applying the relation of the covering and bracketing number on p.84 and Theorem 2.7.11 of van der
Vaart and Wellner (1996) yields for each u and any probability measure Q,

N kG3 kL2 (Q) , G3 (u), L2 (Q) N[
Lemma S.2.4.


 C0
2kG3 kL2 (Q) , G3 (u), L2 (Q) N , T , | |
.


1. G4 is a VC-class with VC index 2.

2 [inf
1
2. The envelopes for G6 (u, ) and G7 are G6 = m
T min (Jm ( ))] C0 and G7 = 2. Furthermore, it holds for any fixed x and |T | that
 2
C0
N (kG6 kL2 (Q) , G6 (u, ), L2 (Q)) 2
,
(S.2.10)

 2
A7
N (kG7 kL2 (Q) , G7 (), L2 (Q))
,
(S.2.11)


where A7 is a universal constant and Q is an arbitrary probability measure.


Proof of Lemma S.2.4.
1. Due to the fact that Q(X; ) is monotone in , it can be argued with basic VC subgraph
argument that G4 has VC index 2, under the definition given in p.135 of van der Vaart and
Wellner (1996).
2. By (S.2.9), the envelope for G6 (un , ) is kukm min (Jm ( ))1 C0 . The envelope for G7 () is
obvious. By the fact that G7 () G4 G4 and the covering number of G4 (implied by Theorem
2.6.7 of van der Vaart and Wellner (1996)), (S.2.11) thus follows by (S.2.7) of Lemma S.2.2.
As for (S.2.10), we note that G6 (u, ) G3 (u) G3 (u). Then, (S.2.7) of Lemma S.2.2 and
Lemma S.2.3 imply

2
 2

C0
N (kG6 kL2 (Q) , G6 (u, ), L2 (Q)) N kG3 kL2 (Q) , G3 (u), L2 (Q) 2
,

2
where Q is an arbitrary probability measure.

13

You might also like