Fermin Et Al-2017-Journal of Time Series Analysis

JOURNAL OF TIME SERIES ANALYSIS
J. Time Ser. Anal. (2017)

Published online in Wiley Online Library
(wileyonlinelibrary.com) DOI: 10.1111/jtsa.12237
ORIGINAL ARTICLE
A ROBBINSMONRO ALGORITHM FOR NON-PARAMETRIC

ESTIMATION OF NAR PROCESS WITH MARKOV SWITCHING:
CONSISTENCY
LISANDRO JAVIER FERMIN,a RICARDO RIOSb AND LUIS ANGEL RODRIGUEZ,a,c*

a
CIMFAV, Facultad de Ingeniera, Universidad de Valparaso, Valparaso, Chile
b
Escuela de Matemticas, Facultad de Ciencias, Universidad Central de Venezuela, Caracas, Venezuela
c
Dpto. de Matemticas, FACYT, Universidad de Carabobo, Valencia, Venezuela
We approach the problem of non-parametric estimation for autoregressive Markov switching processes. In this context, the
NadarayaWatson-type regression functions estimator is interpreted as a solution of a local weighted least-square problem,
which does not admit a closed-form solution in the case of hidden Markov switching. We introduce a non-parametric recursive
algorithm to approximate the estimator. Our algorithm restores the missing data by means of a Monte Carlo step and estimates
the regression function via a RobbinsMonro step. We prove that non-parametric autoregressive models with Markov switching
are identifiable when the hidden Markov process has a finite state space. Consistency of the estimator is proved using the strong
-mixing property of the model. Finally, we present some simulations illustrating the performances of our non-parametric
estimation procedure.
Received 20 March 2015; Accepted 7 March 2017
Keywords: Autoregressive process; Markov switching; RobbinsMonro approximation; non-parametric kernel estimation
MOS subject classification: Primary: 60G17, Secondary:62G07.
1. INTRODUCTION
Markov switching autoregressive processes can be looked at as a combination of hidden Markov models (HMMs)
and threshold regression models. The switching autoregressive processes, introduced into an econometric context
by Goldfeld and Quandt (1973), have become quite popular in the literature and have been employed Hamilton
(1989) in the analysis of the gross internal product of the USA for both contraction and expansion regimes. In this
family of models, which combines different autoregressive models to describe the time evolution of the process,
the transition between these different autoregressive models is controlled by an HMM.
Switching linear autoregressive processes with Markov regime has been extensively studied, and several appli-
cations in economics and finance can be found in, for instance, Krolzig (1997), Kim and Nelson (1999) and
Hamilton and Raj (2003). These models are also widely used in several electrical engineering areas including
tracking of manoeuvring targets, failure detection, wind power production and stochastic adaptive control; see, for
instance, Tugnait (1982), Doucet et al. (2000), Capp et al. (2005) and Ailliot and Monbet (2012).
Switching nonlinear autoregressive models with Markov regime are of considerable interest to the statistics
community, especially for econometric series modelling. Among such models, considered by Francq and Roussig-
nol (1997), there are those that admit an additive decomposition, with particular interest in the switching ARCH

Correspondence to: Luis Angel Rodriguez, Dpto. de Matemticas, FACYT, Universidad de Carabobo, Valencia, Venezuela. E-mail:
larodri@uc.edu.ve
Copyright 2017 John Wiley & Sons, Ltd

L. J. FERMIN, R. RIOS, AND L. A. RODRIGUEZ
models (Francq et al., 2001). However, an even more general class of switching nonlinear autoregressive processes
that do not necessarily admit an additive decomposition has also been studied by Krishnamurthy and Rydn (1998)
and Douc et al. (2004).
We consider a particular type of switching nonlinear autoregressive models Y D Yk k>0 with Markov regime,
called a Markov switching nonlinear autoregressive (MS-NAR) process and which is defined for k > 1 by
Yk D rXk .Yk1 / C ek ; (1)
where ek k>1 are i.i.d. random variables, the sequence Xk k>1 is a homogeneous Markov chain with state
space 1; : : : ; m and r1 .y/; : : : ; rm .y/ are the regression functions, assumed to be unknown.
We denote by A the probability transition matrix of the Markov chain X , that is, A D aij , with aij D
P .Xk D j jXk1 D i /. We assume that the variable Y0 , the Markov chain Xk k>1 and the sequence ek k>1
are independent random variables.
This model is a generalization of the switching linear autoregressive model with Markov regime, also known as
MS-AR model. When the regression functions ri are linear, the MS-NAR process is simply an MS-AR model.
In the parametric case, that is, when the regression functions depend on an unknown parameter, the maximum
likelihood estimation method is commonly used. Although the consistency of the maximum likelihood estimator
for the MS-NAR model is given by Krishnamurthy and Rydn (1998), the consistency and asymptotic normality
are proved in a more general context by Douc et al. (2004). Several versions of the expectationmaximization
(EM) algorithm and of its variants, for instance, stochastic EM, Monte Carlo EM and simulated annealing EM
(SAEM), are implemented in Capp et al. (2005) for the computations of the maximum likelihood estimator. A
semi-parametric estimation for the MS-NAR model was studied by Ros and Rodrguez (2008b), where the authors
considered a conditional least-square approach for the parameter estimation and a kernel density estimator for the
innovation density probability.
Although in many situations a key problem is how to estimate the order m of the model, in this work, we assume
it to be known. Nevertheless, when m is not known, a possible approach is to consider the minimization of a
penalized contrast function. In this case, the main problem to be solved is the choice of a good penalty term. For
the case of an MS-AR with Gaussian innovation, Ros and Rodrguez (2008a) considered a penalized likelihood
criteria. For finite mixture model, a penalized contrast defined from the Hankel matrices of the first algebraic
moments has been taken into account by Dacunha-Castelle and Gassiat (1997). In the non-parametric context, as
far as we know, this is still an open problem.
Nevertheless, before studying consistency of any estimation procedure for this model, one needs to answer the
question of model identifiability. So far, in the context of the non-parametric estimation of the MS-NAR models, as
far as we know, this problem has not been previously addressed in the literature. We tackle this problem following
the ideas given by the recent work of Gassiat et al. (2016) for the non-parametric estimation of HMMs. The overall
idea is to identify the Markov regime first and then to ensure that the estimation method provides a unique estimate.
We consider non-parametric estimators obtained through the minimization of a quadratic contrast function. This
function has a unique minimum given by the NadarayaWatson estimator when the Markov chain is observed,
ensuring in this case that the regression functions are identified with a non-parametric approach.
In this work, we consider a non-parametric regression model. That is, for i D 1; : : : ; m, we define a Nadaraya
Watson-type kernels estimator, given by
Pn1
yYk

kD0 YkC1 K h
1Ii .XkC1 /
rOi;n .y/ D Pn1 yYk : (2)
kD0 K h
1Ii .XkC1 /
This NadarayaWatson-type estimator was introduced for HMMs in Harel and Puri (2001).
In the first part, we establish uniform consistency, assuming that a realization of the complete data .Y0Wn ; X1Wn /
is known, with Y0Wn D .Y0 ; : : : ; Yn / and X1Wn D .X1 ; : : : ; Xn /; that is, we prove the convergence over compact
subsets C R,
wileyonlinelibrary.com/journal/jtsa Copyright 2017 John Wiley & Sons, Ltd J. Time Ser. Anal. (2017)
DOI: 10.1111/jtsa.12237
NON-PARAMETRIC ESTIMATION OF MS-NAR PROCESS
sup jrOi;n .y/ ri .y/j ! 0; a.s. .when n ! 1/:

y2C
This is an interesting asymptotic result, but the key feature of the MS-NAR models is that because the state
sequence Xk k>1 is generally not observable, the statistical inference has to be carried out by means of the
observations Yk k>0 only.
In the non-parametric context, the estimators of regression functions ri .y/, for each y and i D 1; : : : ; m, can be
interpreted as solutions D .1 ; : : : ; m / of the local weighted least-squares problem
n1 m
1 XX y Yk
U.y; Y0Wn ; X1Wn ; / D K 1Ii .XkC1 /.YkC1 i /2 ;
nh h
kD0 iD1
where the weights are specified by the kernel K , so that the observations Yk near y have the largest influence on
the estimate of the regression function at y . That is,
O .y/ D argmin U.y; Y0Wn ; X1Wn ; /:

2Rm
When a realization of the state sequence Xk k>1 is observed, the solutions of this problem are the Nadaraya
Watson kernel estimators rOi;n defined in (2). Nevertheless, when Xk k>1 is a hidden Markov chain, the solution
must be approximated because it does not admit a closed form.
In the second part, we propose a recursive algorithm for the estimation of the regression functions ri with a
t
Monte Carlo step, which restores the missing data Xk k>1 by X1Wn , and a RobbinsMonro procedure, which
allows us to estimate the unknown value of . This approximation minimizes the potential U using the gradient
algorithm, for each fixed y

t D t1 t r U y; Y0Wn ; X1Wn
t
; t1 ;
where t is any sequence of real positive numbers decreasing to 0, and r U , the gradient of U with respect to
the vector 2 Rm .
In a general context, the RobbinsMonro approach is studied in Duflo (1996). Whereas EM-type algorithms
with kernel estimation are used for finite mixtures of non-parametric multi-variate densities in Benaglia et al.
(2009) and of non-parametric autoregression with independent regimes in Franke et al. (2011), we establish in
the present work the consistency of the estimator obtained by our RobbinsMonro algorithm. This asymptotic
property is obtained for each fixed point y .
The article is organized as follows. In Section 2, we present the general conditions on the model ensuring the
existence of a probability density distribution and its stability. We prove that it satisfies the strong mixing depen-
dence condition and model identifiability. Furthermore, we prove the uniform consistency of the NadarayaWatson
kernels estimator in the case of complete data. In Section 3, we prove the main result, namely, the consistency of
estimator related to our RobbinsMonro algorithm. Section 4 contains some numerical experiments on simulated
data illustrating the performances of our non-parametric estimation procedure. Some of the proofs are deferred to
Appendix A.
2. PRELIMINARY
We review the key properties of the MS-NAR model, which we shall need afterwards for proving results. In
addition, we prove the uniform consistency of the NadarayaWatson kernel estimator under the assumption that a
realization of complete data is available.
J. Time Ser. Anal. (2017) Copyright 2017 John Wiley & Sons, Ltd wileyonlinelibrary.com/journal/jtsa
DOI: 10.1111/jtsa.12237
2.1. Stability and existence of moments

It is somewhat complex to determine the stability of the MS-NAR model. In this section, we recall known results
given by Yao and Attali (1999). Our aim is to summarize the sufficient conditions that ensure the existence and
uniqueness of a strictly stationary ergodic solution for the model, as well as the existence for the respective
stationary distribution of a moment of order s > 1.
E1 The Markov chain Xk k>1 is positive recurrent with probability transition matrix A D .aij /i;j D1Wm .
Hence, it has an invariant distribution that we denote by D .1 ; : : : ; m /.
E2 The functions ri , for i D 1; : : : ; m, are continuous.
E3 There exist positive constants i ; bi , i D 1; : : : ; m, such that for y 2 R, the following holds:
jri .y/j 6 i jyj C bi :
Pm
E4 iD1 i log i < 0.
E5 E.je1 js / < 1, for some s > 1.
E6 The sequence ek k>1 of random variables has a common density probability function .e/ with respect
to the Lebesgue measure.
E7 The density probability function .e/ is everywhere positive on R.
The MS-NAR process Y D Yk k>0 is not, in general, a Markov process. However, condition E1 implies that
the extended process Z D Yk ; Xk k>1 with space states E D R 1; : : : ; m is a Markov chain.
We recall that a Markov chain is Fellerian if and only if, for all continuous bounded functions, the image under
the operator defined by the kernel transition of Z is also bounded and continuous. It is strongly Fellerian if, for
any bounded function, the image is continuous. Under condition E2, Z is a Feller chain, and it is a strong Feller
chain if, in addition, condition E6 holds.
For the problem of the existence and uniqueness of a strictly stationary ergodic solution for model Y , we have,
under condition E1, that a stationary solution exists if and only if the Markov chain Z has an invariant probability
measure. Furthermore, this stationary solution is unique if and only if the invariant probability measure of the
extended process Z is unique. Yao and Attali (1999) first studied the properties of the extended process and then
derived the properties of the marginal process Y .
The MS-NAR model is called sublinear if conditions E2 and E3 hold. In Proposition 2.1, we summarize some
results, given by Yao and Attali (1999), for sublinear MS-NAR model.
Proposition 2.1 (Yao and Attali). Consider a sublinear MS-NAR, Y D Yk k>0 . Assuming E1E7, we have the
following:
(i) There exists a unique stationary geometric ergodic

solution.
(ii) If the spectral radius of the matrix Qs D js aij is strictly less than 1, with s , the same as in
i;j D1;:::;m
s
condition E5, then E.jYk j / < 1.
Remark 2.1. Under conditions E1E7, there exists a strictly stationary solution for model Y if and only if the
Markov chain Z has an invariant probability measure. Furthermore, this stationary solution is unique if and only
if the invariant probability measure of Z is unique.
Condition E7 in Proposition 2.1 ensures that the transition kernel of Markov chain Z is -irreducible, which
implies the uniqueness of the invariant probability measure.
For the stability, the moment condition E5 with s > 1 is enough, but for the asymptotic properties of the kernel
estimator, it will be necessary that s > 2.
DOI: 10.1111/jtsa.12237
2.2. Probability density

We present a technical lemma where one of the results states the existence of conditional densities of the MS-NAR
model, and in addition, we give a factorization of this probability density. This factorization will be very useful in
the next sections.
Let us first introduce some notation: V1Wn stands for the random vector .V1 ; : : : ; Vn /, and v1Wn D .v1 ; : : : ; vn /
is a realization of the respective random vector. p.V1Wn D v1Wn / denotes the density distribution of the random
vector V1Wn evaluated at v1Wn . The symbol 1IB .x/ denotes the indicator function of set B , which assigns the value
1 if x 2 B and 0 otherwise.
We consider the following assumption:
D1 The random variable Y0 has a density function p.Y0 D y0 / with respect to the Lebesgue measure.
Lemma 2.1 is relevant in the framework of kernel estimation.
Lemma 2.1. Under conditions D1 and E6,
(i) The random vector .Y0Wn ; X1Wn / admits the probability density function
p.Y0Wn D y0Wn ;X1Wn D x1Wn / D .yn rxn.yn1 // .y1 rx1.y0 //axn1 xn ax1 x2 x1 p.Y0 D y0 /;
with respect to the product measure c , where and c denote Lebesgue and counting measures
respectively.
(ii) If is a bounded density, then the joint density of .Yk ; Yk 0 / satisfies
p.Yk D yk ; Yk 0 D yk 0 / 6 kk21 and p.Y0 D y0 ; Yk 0 D yk 0 / 6 kk1 ; for k; k 0 > 1:
For the proof of this lemma, we refer the reader to Appendix A.
2.3. Strong mixing

A strictly stationary stochastic process Y D Yk k2Z is denoted as strong mixing, if
n WD supjP .A \ B/ P .A/P .B/j W A 2 M01 ; B 2 M1

n ! 0; as n ! 1; (3)
where Mba , with a; b 2 Z, is the -algebra generated by Yk kDaWb . It is called absolutely regular mixing, if

n WD E ess supP .BjM01 / P .B/ W B 2 M1
n ! 0; as n ! 1: (4)
The values n and n are called strong mixing and regular mixing coefficients respectively. For properties and
examples under several mixing assumptions, see Doukhan (1994). In general, we have the following inequality:
2n 6 n 6 1. This implies that all -mixing processes are also -mixing. Note that the -mixing coefficients
can be rewritten as follows:
n WD supjcov.;
/j W 0 6 kk1 ; k
k1 6 1; 2 M01 ;
2 M1
n : (5)
DOI: 10.1111/jtsa.12237
The extended process Z is geometrically ergodic. This implies a geometric rate of -mixing coefficients, such
that Z is an -mixing process. From (5), we prove in Proposition 2.2 that the -mixing property of Z D .Y; X / is
transferred to the component Y .
Proposition 2.2. The MS-NAR model under conditions E1E7 is -mixing, and their coefficients n .Y /
decrease geometrically.
The proof is given in Appendix A.
2.4. Identifiability
We prove the identifiability of the MS-NAR model, following the recent work of Gassiat et al. (2016) for non-
parametric estimation of HMMs. We consider the following assumptions:
I1 The probability transition matrix A D .aij /i;j D1Wm has full rank.
I2 The functions r1 ;: : : ; rm are different a.s.; that is, if i j , then ri .y 0 / rj .y 0 / for almost all y 0 .
I3 The density probability function is such that the functions .y r1 .y 0 //; : : : ; .y rm .y 0 // are linearly
independent; that is,
m
X
i .y ri .y 0 // D 0; for all y; y 0 1 D D m D 0:
iD1
I4 The density probability function is such that .y rQkQ .y 0 // D .y rk .y 0 // for all y if and only if
rQkQ .y 0 / D rk .y 0 /.
.3/
We denote by pA;r the probability density function of Y0 ; Y1 ; Y2 ; Y3 . Notice that if the Markov chain X is
.3/
irreducible, there exists a unique invariant distribution and (Lemma 2.1) pA;r is well defined by
0 1 0 1
m
X m
X m
X
.3/
pA;r D p.Y0 D y0 / @ j aj i .y1 rj .y0 //A .y2 ri .y1 // @ aij .y3 rj .y2 //A ;
iD1 j D1 j D1
where p.Y0 D y0 / is the probability density of Y0 , and D .1 ; : : : ; m / is a stationary distribution of A, which
is the distribution of X1 . For the case of a non-irreducible Markov chain, the distribution of X1 has to be specified
as there arise many invariant distributions.
.3/
Proposition 2.3. Assume that m is known, under conditions I1I4, A and r are identifiable from pA;r , up to
label swapping of the hidden states.
The proof of this proposition follows the same idea given in Gassiat et al. (2016).
Proof
We have to prove that if AQ is a mm probability transition matrix and if rQ D .rQ1 ; : : : ; rQm / are regression functions,
.3/ .3/
such that pA; Q rQ D pA;r , then there exists a permutation of the set 1; : : : ; m such that, for all i; j D 1; : : : ; m,
aQ ij D a .i / .j / and rQi D r .i / . P
From conditions I1 and I3, the functions m 0
j D1 j aj i .y rj .y // iD1Wm are linearly independent; simi-
Pm 0
larly, the functions j D1 aij .y rj .y // iD1Wm are also linearly independent. Then, according to Allman et
al. (2009, Theorem 8), there exists a permutation of the set 1; : : : ; m such that, for all i D 1; : : : ; m,
DOI: 10.1111/jtsa.12237
m
X m
X
Q j aQ j i .y1 rQj .y0 // D j aj .i/ .y1 rj .y0 //
j D1 j D1
.y2 rQi .y1 // D .y2 r .i / .y1 //

m
X Xm
aQ ij .y3 rQj .y2 // D a .i /j .y3 rj .y2 //:
j D1 j D1
Now, using the commutativity of the sum, we obtain
m
X m
X
Q j aQ j i .y1 r .j / .y0 // D .j / a .j / .i / .y1 r .j / .y0 //
j D1 j D1
m
X X m
aQ ij .y3 r .j / .y2 // D a .i / .j / .y3 r .j / .y2 //:
j D1 j D1
Then, from condition I3 and I4, aQ ij D a .i / .j / , Q j aQ j i D .j / a .j / .i / , and rQi D r .i / a.s.
Remark 2.2. Condition I2 implies the identifiability of regression functions ri s for almost all y 0 . Nevertheless,
the continuity given by condition E2 ensures the identifiability for all y 0 .
Corollary 2.1 applies in the case where the innovation e is a Gaussian white noise.
Corollary 2.1. Assume that m is known and that is the density of a Gaussian distribution with zero mean and
.3/
variance 2 . Under conditions I1I2, A and r are identifiable from pA;r up to label swapping of the hidden states.
The proof is given in Appendix A.
2.5. Kernel estimator: fully observed data case

We assume that a realization of the complete data .Y0Wn ; X1Wn / is available. We focus on the uniform convergence
over compact sets of the NadarayaWatson kernel estimator defined in (2).
For a stationary MS-NAR model, r.y/ D E.Y1 jY0 D y/ is the quantity of interest in the autoregression function
estimation. It can be rewritten as
m
X
r.y/ D E.Y1 jY0 D y; X1 D i /P .X1 D i /:
iD1
Hence, it is sufficient to estimate each autoregression function
ri .y/ D E.Y1 jY0 D y; X1 D i /; (6)
for i D 1; : : : ; m and y 2 R.
Let us introduce
gi .y/ WD ri .y/fi .y/; and fi .y/ WD i p.Y0 D y/: (7)
DOI: 10.1111/jtsa.12237
The NadarayaWatson kernel estimator of ri is

gO i;n .y/=fOi;n .y/ if fOi;n .y/ 6D 0;
rOi;n .y/ D
0 otherwise
with
n1 n1
1 X 1 X
gO i;n .y/ WD YkC1 Kh .y Yk / 1Ii .XkC1 /; fOi;n .y/ WD Kh .y Yk / 1Ii .XkC1 /; (8)
nh nh
kD0 kD0
and Kh .y/ D K.y= h/.

To obtain the convergence of the ratio estimator rOi;n D gO i;n .y/=fOi;n .y/, we apply the method of Collomb
(Ferraty et al., 2001), which simultaneously studies the convergence of gO i;n .y/ and fOi;n .y/, when n tends to 1.
However, first, we examine the conditions that allow us to obtain the asymptotic results. R
Let us take a kernel K W R ! R, positive, symmetric, with compact support such that K.t /dt D 1. We
assume that the kernel K as well as the density is bounded, that is,
B1 kKk1 < 1.
B2 kk1 < 1.
R R
Under condition B1, the kernel K is of order 2, that is, tK.t /dt D 0 and 0 < t 2 K.t /dt < 1.
Let C be a compact subset of R. We assume the following regularity conditions:
R1 There exist finite constants c; > 0, such that
8y; y 0 2 C; jK.y/ K.y 0 /j < cjy y 0 j :
R2 The density of Y0 , , and ri have continuous second derivatives in the interior of C.

R3 For all k 2 N , the functions
ri;k .t; s/ D E.jY1 YkC1 j jY0 D t; Yk D s; X1 D i; XkC1 D i /
are continuous and uniformly bounded with respect to k .
We define g2;i .y/ WD fi .y/ri;0 .y; y/ D fi .y/E.Y12 jY0 D y; X1 D i /, which is continuous owing to
condition R3.
Finally, we impose one of the two following moment conditions:
M1 E.exp.jY0 j// < 1 and E.exp.je1 j// < 1.

M2 E.jY0 js / < 1 and E.je1 js / < 1, for some s > 2.
Remark 2.3. Note that M1 implies M2 and M2 implies E5. The latter is a sufficient condition for the stability of
the MS-NAR model.
In view of the independence of Y0 and e1 , condition M1 implies:
E.exp.jY1 j// 6 cE.exp.jY0 j//E.exp.je1 j//;
DOI: 10.1111/jtsa.12237
where c is a strictly positive constant. Moreover, E3 and M2 imply E.jY1 js / < 1. This condition is also implied
by M1. Conditions M1 and M2 are assumed to hold: the former so as to obtain the a.s. uniform convergence over
compact sets and the latter for the a.s pointwise convergence.
We now establish the uniform convergence over compact sets of the NadarayaWatson kernel estimator rOi;n
defined in (2). For this purpose, we introduce the following three technical lemmas. Their proofs are given in
Appendix A.
The first lemma allows treating in a unified way the asymptotic behaviour of the variances and covariances of
fOi;n and a truncated version of gO i;n . The other two lemmas give an asymptotic bound for the bias and variance
terms in the estimation of the regression functions ri .
We denote by A.k/ij the .i; j /th entry of the k th power of the matrix A. We define Bn;h Bh so that
limh!0 limn!1 Bn;h D limh!0 Bh , that is, for large enough n and small enough h, Bn;h is approximately
equal to Bh . Analogously, we define Bn;h Bh to mean that limh!0 limn!1 Bn;h 6 limh!0 Bh . In particular,
we write Bn;h B to mean that B is a bound for the sequence Bn;h , for large enough n and small enough h.
Lemma 2.2. Assume that the model MS-NAR satisfies conditions E1E7, D1, B1B2, S1 and R2R3 on a
compact set C . Let Mn n>1 be a non-decreasing sequence of positive numbers tending to infinity. Let
Tk;n D aKh .y Yk /1Ii .XkC1 / C bYkC1 1IjYkC1 j6Mn Kh .y Yk /1Ii .XkC1 /:
Then the following statements hold, for all y 2 C :
(i) var .T0;n / h.a2 fi .y/ C 2abgi .y/ C b 2 g2;i .y//kKk22 C o.h2 /.
(ii) cov .T0;n ; Tk;n / h2 .a2 C 2ab.jri .y/j C E.je1 j// C b 2 ri;k .y; y//A.k/ 3
i;i i kk1 C o.h /.
2 2 2 2
(iii) cov .T0;n ; Tk;n / 6 .a C 2abMn C b Mn /4kKk1 k , for any n > 0.
Lemma 2.3. Assume that the model MS-NAR satisfies conditions E1E4, E6E7, D1, B1B2, S1, M2 and
R2R3 on a compact set C . Let Mn n>1 be a positive non-decreasing sequence tending to infinity, and > 1.
Then the following asymptotic inequalities hold true, for all y 2 C :
=2 un .2s/
2 nh
(i) P .jgO i;n .y/ EgO i;n .y/j > / 4 1 C 16c1
C c2 16Mhn C c3 Mn2 h2 .
=2 un
2 nh
(ii) P .jfOi;n .y/ EfOi;n .y/j > / 4 1 C 16cQ1
C c2 16
h
.
Where c1 D supy2C g2;i .y/kKk22 , cQ1 D supy2C fi .y/kKk22 , c2 > 0 and 0 < < 1 such that the mixing
coefficient n .Y / 6 c2 n , un D .h log n/1 , and c3 D kKk21 .
Lemma 2.4. Assume that the model MS-NAR satisfies conditions E1E7, D1 and R2 on a compact set C . Then
the following statements hold true.
(i) supy2C jEgO i;n .y/ gi .y/j D O.h2 /.

(ii) supy2C jEfOi;n .y/ fi .y/j D O.h2 /.
Remark 2.4. Lemma 2.2 is a preliminary result, which is necessary to prove Lemma 2.3.
Let hn n>1 be a sequence of real numbers satisfying the following condition:
DOI: 10.1111/jtsa.12237
S1 For all n > 0, hn > 0, limn!1 hn D 0 and limn!1 nhn D 1.
Theorem 2.1. Assume that the model MS-NAR (1) satisfies conditions E1E4, E6E7, D1, B1B2, S1 and
R1R3 on a compact set C . Then
(i) If nhn = log n ! 1 and condition M2 holds, then for all y 2 C , jrOi;n .y/ ri .y/j ! 0; a.s.
(ii) If nhn = log n ! 1 and condition M1 holds, then supy2C jrOi;n .y/ ri .y/j ! 0; a.s.
The proof of this theorem is given in Appendix A.
3. MAIN RESULTS
We present our RobbinsMonro type algorithm for the non-parametric estimation of the MS-NAR model in the
partially observed data case, and we prove the consistency of the estimator.
The NadarayaWatson estimator rOn .y/ D .rO1;n .y/; : : : ; rOm;n .y//, for each y , can be interpreted as the solution
of a locally weighted least-squares problem; in our case, this has to do with finding the minimum of the potential
U defined by
n1 m
1 XX
U.y; Y0Wn ; X1Wn ; / D Kh .y Yk /1Ii .XkC1 /.YkC1 i /2 ; (9)
nh
kD0 iD1
with respect to D .1 ; : : : ; m / in a convex open set of Rm . Thus, the regression estimator rOn is given by
rOn .y/ D argmin U.y; Y0Wn ; X1Wn ; /:

2Rm
In the partially observed data case, that is, when we do not observe Xk k>1 , we cannot obtain an explicit expres-
sion for the solution rOn .y/. Thus, we must consider a recursive algorithm for the approximation to this solution.
Our approach approximates the estimator rOn .y/ by a stochastic recursive algorithm similar to that of Robbins
Monro (Capp et al., 2005; Duflo, 1996; Yao, 2000). This involves two steps: first, a Monte Carlo step that restores
the missing data Xk k>1 and second, a RobbinsMonro approximation so as to minimize P the potential U .
At this point, we introduce some further notation. For 1 6 i 6 m, ni .X1Wn / D nkD1 1Ii .Xk / is the number of
Pn1
visits of the Markov chain Xk k>1 to state i in the first n steps, and nij .X1Wn / D kD1 1Ii;j .Xk1 ; Xk / is the
number of transitions from i to j in the first n steps. t D . t ; At / is a vector containing the estimated functions
t D .1t ; : : : ; m
t
/ and the estimated probability transition matrix At , in the t th iteration of the RobbinsMonro
algorithm.
3.1. Restoration-estimation RobbinsMonro algorithm

For each fixed y ,
0
Step 0. Pick an arbitrary initial realization X1Wn . Compute the estimated regression functions rOn0 .y/ D
0 0
.rO1;n .y/; : : : ; rOm;n .y// from equation (2) in terms of the observed data Y0Wn and the initial realization
0 0 0 0 0
X1Wn . Compute the estimated transition matrix A0 D .aij /i;j D1Wm , by aij D nij .X1Wn /=ni .X1Wn /
0 0 0 0 0
and the initial measure D .1Wm / by i D ni .X1Wn /=n, for i; j D 1; : : : ; m. Define .y; Y0Wn / D
rOn0 .y/, which will be denoted simply by 0 .
For t > 1,
t
Step R. Restore the corresponding unobserved data by drawing a sample X1Wn from the conditional distribution
t1
p.X1Wn jY0Wn ; /.
DOI: 10.1111/jtsa.12237
t
Step E. Update the estimation D . t ; At / by

t D t1 t r U y; Y0Wn ; X1Wn
t
; t1 ; (10)

where r U y; Y0Wn ; X1Wn t t
; t1 D r U y; Y0Wn ; X1Wn ; D t1 , At D .aij
t t
/i;j D1Wm by aij D
t t t t t t
nij .X1Wn /=ni .X1Wn /, and D .i /iD1Wm by i D ni .X1Wn /=n.
P
Step A. Reduce the asymptotic variance of the algorithm by using the averages N t D 1t tkD1 k instead of t ,
which can be recursively computed by N 0 D 0 , and
1 t N t1
N t D N t1 C : (11)
t
The
following t result enables us to write the algorithm as a stochastic gradient algorithm. Let
E 0 U.y; Y0Wn ; X1Wn ; /jFt1 D u.y; Y0Wn ; / with E 0 ./ D E.jY0Wn ; 0 / and 0 D . 0 ; A0 / 2 Ft1 the
s
-algebra generated by X1Wn sD1W.t1/ . This conditional expectation is in fact the expectation with respect to the
t
conditional distribution function p.X1Wn jY0Wn ; 0 /. The proof is given in Appendix A.
Lemma 3.1. For each 2 , we have
n1 m
1 XX 0
u.y; Y0Wn ; / D Kh .y Yk /P .XkC1 D i jY0Wn ; /.YkC1 i /2 ; (12)
nh
kD0 iD1
t

and E 0 r U.y; Y0Wn ; X1Wn ; /jFt1 D r u.y; Y0Wn ; /:
Therefore, the restoration-estimation algorithm is a stochastic gradient algorithm that minimizes u.y; Y0Wn ; /
and can be written as

t D t1 C t r u.y; Y0Wn ; t1 / C &t ; (13)
where
t

&t D r U y; Y0Wn ; X1Wn ; t1 C r u.y; Y0Wn ; t1 /: (14)
Thus, the stochastic gradient algorithm is obtained by perturbation of the following gradient system:
P D r u.y; Y0Wn ; /:
In what follows, we describe in detail each step of the algorithm.
Step 0: SAEM algorithm

For this step, a stochastic approximation version of the EM algorithm is used, proposed by Delyon et al. (1999),
to maximize the likelihood of the data. Assume that the regression functions are linear and the noise is Gaussian.
This algorithm proved to be more computationally efficient than a classical Monte Carlo EM algorithm owing to
the recycling of simulations from one iteration to the next in the smoothing phase of the algorithm. The SAEM
algorithm, used here, is detailed in Capp et al. (2005, Section 11.1.6).
DOI: 10.1111/jtsa.12237
Step R: Carter and Kohn filter

Step R of the algorithm corresponds to a conditional simulation for a given t1 . Let t1 D .t1
1 ; : : : ; t1
m /
t1 t1 t1 t1
be the probability measure of X1 defined by i D ni .X1Wn /=n, for i D 1; : : : ; m, and A D .aij /i;j D1Wm
t1 t1 t1
defined by aij D nij .X1Wn /=ni .X1Wn /. We describe the sampling method for the conditional distribution as
t1
t1
x1 p.Y1 jY0 ; X1 D x1 ;
t1
/ : : : axt1
n1 xn
p.Yn jYn1 ; Xn D xn ; t1
/
p.X1Wn D x1Wn jY0Wn ; /D t1
;
p.Y1Wn jY0 ; /
for all x1Wn 2 1; : : : ; mN .

Carter and Kohn (1994) obtained samples X1Wn following a stochastic version of the HMM forwardbackward
algorithm first proposed by Baum et al. (1970). This method takes notice of the fact that p.X1Wn jY0Wn ; t1 / can
be decomposed as
n1
Y
t1 t1 t1
p.X1Wn jY0Wn ; / D p.Xn jY0Wn ; / p.Xk jXkC1 ; Y0Wk ; /:
kD1
Provided that XkC1 is known, p.Xk jXkC1 ; Y0Wk ; t1 / is a discrete distribution. The following sampling strategy
is suggested: for k D 2; : : : ; n and i D 1; : : : ; m, compute recursively the optimal filter by means of
m
X
t1 t1
p.Xk D i jY0Wk ; / / p.Yk jYk1 ; Xk D i; / ajt1
i p.Xk1 D j jY0Wk1 ;
t1
/:
j D1
t1
Then sample Xn from p.Xn jY0Wn ; / and for k D n 1; : : : ; 1, sample Xk using
t1 t1
t1
aixkC1
p.Xk D i jY0Wk ; /
p.Xk D i jXkC1 D xkC1 ; Y0Wk ; / D Pm :
j D1 ajt1
xkC1 p.Xk D j jY0Wk ;
t1 /
t
Following the proof reported in Rosales (2004), we have that the sequence X1Wn t2N is an ergodic Markov
t
chain with invariant distribution p.X1Wn D x1Wn jY0Wn ; /. It is sufficient to note that the sequence X1Wn t2N
N
is an irreducible and aperiodic Markov chain on a finite state space, 1; : : : ; m . Irreducibility and aperiodicity
follow directly from the positivity of the kernel,
n1
Y
t t1 t1
Q.X1Wn jX1Wn ; Y0Wn ; // p.Xnt jY0Wn ; t1
/ p.Xkt jXkC1
t
; Y0Wk ; t1
/ > 0:
kD1
In this case, the standard ergodic result for finite Markov chains applies (Kemeny and Snell, 1960).
t t1 t1
kQ.X1Wn jX1Wn ; Y0Wn ; / p.X1Wn jY0Wn ; /k 6 c t : (15)
Moreover, (15) is satisfied with c D card.1; : : : ; mN /, D .1 2Qx / and Qx D infx 0 ; 0 Q.x 0 jx; 0
/ for
x; x 0 2 1; : : : ; mN .
Step E: Estimation
In each iteration of this algorithm, we evaluate r U.y; Y0Wn ; X1Wn ; / the gradient of the potential. For each
1 6 i 6 m, we compute the components
@U
.y; Y0Wn ; X1Wn ; / D gO i;n .y; Y0Wn ; X1Wn / i fOi;n .y; Y0Wn ; X1Wn /:
@i
DOI: 10.1111/jtsa.12237
In each iteration, this quantity is updated. It has the advantage that the ratio rOi;n is not computed directly, avoiding
the zeros of the function fOi;n .
Step A: Average (or Aggregation)

To reduce the asymptotic variance of the estimated parameters t , we P
adopt the averaging technique introduced
by Polyak and Juditsky (1992). The idea is to use N t , with N t D 1t tkD1 k , instead of t , which can be
recursively computed by means of equation (11).
3.2. Consistency
The convergence analysis of RobbinsMonro approximations are well studied in Duflo (1996) in the general case.
In this article, we use a framework for the convergence of the stochastic gradient algorithm for the likelihood
function in HMMs similar to the framework used by Capp et al. (2005, p. 431). A consideration, for our particular
case, is that u./ is a continuously differentiable function of . The following convergence result is given for each
fixed y .
P P
Theorem 3.1. Assume condition B1, that t is a positive sequence such that t t D 1, t t2 < 1,
and that the closure of the set N t is a compact subset of . Then, almost surely, the sequence N t satisfies
limt!1 r u.y; Y0Wn ; N t / D 0. Furthermore, limt!1 N t D and r u.y; Y0Wn ; / D 0, a.s.
Proof P
Let M t D tsD0 s &s . The sequence M t is an Ft martingale, in fact
E.M t jFt1 / D E.t &t C M t1 jFt1 / D E.t &t jFt1 / C E.M t1 jFt1 / D M t1 :
P1
Moreover, it satisfies tD1 E.kM t M t1 k2 jFt1 / < 1. Indeed,
E.kM t M t1 k2 jFt1 / D t2 E.k&t k2 jFt1 /
and
m n1
!2
2 4 X X
k&t k D 2 2 .YkC1 it1 /Kh .y Yk /Bit .k/ ;
n h
iD1 kD0

where Bit .k/ D 1Ii .XkC1
t t
/ E 1Ii .XkC1 /jFt1 are Bernoulli centred random variables. Therefore,
m n1
4 X X
E.k&t k2 jFt1 / D .YkC1 it1 /.Yk 0 C1 it1 /Kh .y Yk /Kh .y Yk 0 / ti .k; k 0 /;
n2 h2 0
iD1 k;k D0
with ti .k; k 0 / D cov.1Ii .XkC1

t
/; 1Ii .Xkt 0 C1 /jFt1 /. Thus, by the CauchySchwarz inequality, we have
q q
ti .k; k 0 / 6 var.Bit .k/jFt1 / var.Bit .k 0 /jFt1 / 6 1=4;
and
m n1
!2
2 1 X X
E.k&t k jFt1 / 6 2 2 .YkC1 it1 /Kh .y Yk / D k. t1 /k2 ; (16)
n h
iD1 kD0
DOI: 10.1111/jtsa.12237
1
Pn
where ./ D .1 . /; : : : ; m . // and i ./ D nh kD1 .YkC1 i /Kh .y Yk /.
By compactness k./k2 is finite, therefore
1
X
t t1 2 t1 2
E.kM M k jFt1 / 6 k. /k t2 < 1:
tD1
Thus, by applying the conditional BorelCantelli lemma in Capp et al. (2005, Lemma 11.2.9), we see that the
sequence M t has a finite limit a.s., and according to Capp et al. (2005, Theorem 11.3.2), the sequence t
satisfies
lim r u.y; Y1Wn ; t / D 0:
t!1
By continuity of the function r u, we prove that D limt!1 t satisfies r u.y; Y1Wn ; / D 0, and by the
Cesro theorem, limt!1 N t D .
t
We have shown in Section 3.1 that the sequence X1Wn t2N is an ergodic Markov chain with invariant distri-

bution given by p.X1Wn D x1Wn jY0Wn ; /. The rate of convergence is given by equation (15). Moreover, for all
x1Wn 2 1; : : : ; mN , this invariant distribution satisfies

x1 p.Y1 jY0 ; X1 D x1 ;
/ : : : axn1 xn p.Yn jYn1 ; Xn D xn ;
/
p.X1Wn D x1Wn jY0Wn ; /D ;
p.Y1Wn jY0 ; /
where D . ; A /, D limt!1 N t is the limit obtained in Theorem 3.1, and A D limt!1 At is the

probability transition of the limit Markov chain X ; that is, A D .aij .Y0Wn //i;j D1Wm given by

aij .Y0Wn / D p.XkC1 D j jXk D i; Y0Wn ; /;
and D .i /iD1Wm with D limt!1 t .

Since for all t > 0 we have t At D t , we deduce that A D . Now, taking n ! 1, it is easy to verify
that A .Y0Wn / ! A and .Y0Wn / ! , where A is the probability transition matrix and an invariant measure
of the Markov chain X .
On the other hand, the i th component i of the critical point 2 Rm of the gradient is given by
Pn1
kD0 YkC1 Kh .y Yk /p.XkC1 D i jY0Wn ; / E gO i;n .y/jY0Wn ;
i .y; Y0Wn / D Pn1 D h i:
kD0 Kh .y Yk /p.Xk D i jY0Wn ;
/
E fOi;n .y/jY0Wn ;
Theorem 2.1 implies that, if nhn = log n ! 1, then for all y 2 C , gO i;n .y/ ! gi .y/, fOi;n .y/ ! fi .y/ a.s. as
n ! 1. This implies that EgO i;n .y/jY0Wn ; ! gi .y/, EfOi;n .y/jY0Wn ; ! fi .y/, and i .y; Y0Wn / !
ri .y/, a.s. as n ! 1. As a consequence, we obtain
lim lim t .y; Y0Wn / D ri .y/; a:s:

n!1 t!1 i
R
Remark
R 3.1. Note that fi .y/dy D i . Thus, if the compact set C is such that P .Y0 2 C / D 1, then
O
C fi;n .y/dy ! i , when n ! 1.
4. NUMERICAL EXAMPLES
We illustrate the performance of the algorithm developed, in the previous section, by applying it to a simulated
data.
DOI: 10.1111/jtsa.12237
4.1. Example 1
In this first example, we use an MS-NAR model with m D 2 states and autoregressive functions
2 2
r1 .y/ D 0:7y C 2e .10y / ; r2 .y/ D 1;
1 C e 10y
where r1 is a bump function and r2 is a decreasing logistic function. These functions were reported by Franke et al.
(2011). Let be the density of a Gaussian distribution with zero mean and variance 2 D 0:4. The transition
probability matrix is given by

0:98 0:02
AD :
0:02 0:98
We used a straightforward implementation of the algorithms described earlier. We generate a sample of length
n D 1000. For each k , we simulate Xk and then use it to determine Yk . The simulated data are plotted in Figure 1
(left).
For the estimation of the regression function ri , we use the standard Gaussian density as the kernel function
K , in spite of the fact that it is not compactly supported. As bandwidth parameter, we take h D .n= log.n//1=5 .
Assuming that the complete data .Y0Wn ; X1Wn / is available, we show in Figure 1 (right) the performance of rO1
and rO2 .
We implemented the restoration-estimation algorithm for the data described earlier. The initial estimates for the
0
Markov chain X1Wn in Step 0 of our algorithm were obtained by using a SAEM algorithm for the MS-AR model,
Yn D Xn Yn1 C bXn C Xn en :
The estimated linear functions were of the form rO1 .y/ D 0:8239y 0:0218 and rO2 .y/ D 0:2943y C 0:6334.
Figure 2 (left) shows the scatter plot of Yk against Yk1 and the linear adjustment. In Figure 2 (right), we show
the scatter plot of Yk against Yk1 , r1 and r2 (solid lines) and the respective RobbinsMonro estimates (dashed
lines) for the last iteration.
We have implemented our RobbinsMonro procedure with t D 1 W T iterations and with the smoothing step
defined as
Figure 1. Simulated data Y0Wn (left). Estimated regression functions for complete data .Y0Wn ; X1Wn /. The real functions are
shown with solid lines and the estimates with dashed lines (right). [Colour figure can be viewed at wileyonlinelibrary.com]
DOI: 10.1111/jtsa.12237

1 t 6 T1
t D :
.t T1 /1 t > T1 C 1
The estimated probability transitions obtained were

0:983 0:017
AO D :
0:017 0:983
The square estimation error kAt Ak22 is shown in Figure 3; we observe that convergence is reached with less
than 100 iterations.
Figure 2. Parameter estimation for SAEM and scatter plot of simulated data (left). Non-parametric estimation by Robbins
Monro procedure; the real functions are shown with solid lines and the estimates with dashed lines (right). The points are
labelled with respect to the real state of Xk : dots for Xk D 1, and x s for Xk D 2. [Colour figure can be viewed at
wileyonlinelibrary.com]
Figure 3. The square estimation error kAt Ak22 for t D 1 W T . [Colour figure can be viewed at wileyonlinelibrary.com]
DOI: 10.1111/jtsa.12237
4.2. Example 2
In this example, we consider an MS-NAR model with m D 3. The autoregressive functions are
2 2
r1 .y/ D 0:7y C 2e .10y / ; r2 .y/ D 1; and r3 .y/ D 2cos.y/ 1:
1 C e 10y
The functions r1 and r2 are the same as those considered in Example 4.1. We take to be a Gaussian density with
zero mean and with variance 2 D 0:4. The transition matrix is given by
0 1
0:98 0:01 0:01
A D @ 0:01 0:98 0:01 A :
0:01 0:01 0:98
We simulate a sample path for .Y; X / of size n D 3000. The data are shown in Figure 4.
We implemented our RobbinsMonro algorithm for the data Y considering that X is hidden. We take a standard
Gaussian density as the kernel function K , the bandwidth parameter h D .n= log.n//1=5 and the smoothing step
0
t D t 0:6 , and the initial estimates for the Markov chain X1Wn in Step 0 are considered as uniform random
variables. Owing to the complexity of the regression functions in this example, the estimate for an MS-AR model
is not a good starting point.
For T D 1000, we obtain the following results. The estimate for the transition matrix is
0 1
0:9665 0:0244 0:0091
AO D @ 0:0161 0:9338 0:0500 A :
0:0230 0:0312 0:9458
Figure 5 shows the non-parametric regression functions obtained by the RobbinsMonro procedure, and the square
error for the estimate At .
4.3. Example 3
In this example, we take m D 4. The autoregressive functions are
2 2
r1 .y/ D 0:7y C 2e .10y / ; r2 .y/ D 1
1 C e 10y
Figure 4. Simulated data Y0Wn (left). Scatter plot for .Yk ; Yk 1/, where the points are labelled with respect to the real state of
Xk : dots for Xk D 1, x s for Xk D 2 and circles for Xk D 3 (right). [Colour figure can be viewed at wileyonlinelibrary.com]
DOI: 10.1111/jtsa.12237
Figure 5. Non-parametric estimation; the real functions are shown with solid lines and the estimates with dashed lines (left).
The square estimation error kAt Ak22 for t D 1 W T (right). [Colour figure can be viewed at wileyonlinelibrary.com]
Figure 6. Simulated data Y0Wn (left). Scatter plot of .Yk ; Yk 1/. The points are labelled with respect to the real state of Xk :
dots for Xk D 1, x s for Xk D 2, circles for Xk D 3 and diamonds for Xk D 4 (right). [Colour figure can be viewed at
wileyonlinelibrary.com]
r3 .y/ D 2cos.y/ 1; r4 .y/ D .0:4y C 2:5/1Iy<0 C .0:4y C 2:5/1Iy>0 :
We take to be a Gaussian density with zero mean and variance 2 D 0:25. The transition matrix is given by
0 1
0:9000 0:1000 0 0
B 0:0500 0:9000 0:0500 0 C
ADB @ 0
C:
0:0500 0:9000 0:0500 A
0 0 0:1000 0:9000
We simulate a sample path of .Y; X / of size n D 3000. The simulated data are shown in Figure 6.
For T D 1000, we obtain the following results. The estimated transition matrix is
0 1
0:7655 0:0151 0:1439 0:0755
B 0:0406 0:7627 0:1128 0:0839 C
AO D B C
@ 0:1449 0:0705 0:7558 0:0288 A :
0:0656 0:0933 0:1109 0:7302
The non-parametric regression functions obtained by the RobbinsMonro procedure and the square error for the
estimate At are displayed in Figure 7.
DOI: 10.1111/jtsa.12237
Figure 7. Non-parametric estimation; the real functions are shown with solid lines and the estimates with dotted lines (left).
The square estimation error is given by kAt Ak22 for t D 1 W T . [Colour figure can be viewed at wileyonlinelibrary.com]
Notice that the algorithm has a good performance for the first two examples, that is, the case of m D 2 and
m D 3 respectively. Moreover, convergence is quickly reached for the probability transition matrix At . However,
when the number of states is increased to m > 4, several problems arise in the non-parametric estimation. It is
seen that the algorithm has difficulties in identifying the state of the Markov chain at intersection points of the
regression functions; that is, a loss of numerical identifiability occurs at these points owing to the discretization
step and the sample size. Furthermore, a misclassification of data arises if the variance is large with respect to the
range of regression functions. Finally, when m is large, the algorithm is more sensitive to both the choice of the
starting point and the selection of the window size h.
It seems that some useful properties of the algorithm cease to be effective when the number of states increases.
This is probably because as the number of parameters increase, so does the complexity of the model. This is,
clearly, a consequence of what has been dubbed as the curse of dimensionality.
ACKNOWLEDGEMENTS
We thank the editor, Robert Taylor, the co-editor and two anonymous referees for their insightful comments that
greatly contributed to the improvement of this article. L. J. Fermn acknowledges support from the DIUV REG
N02/2011 project of the Universidad de Valparaso. L. A. Rodrguez is thankful to Universidad de Carabobo for a
sabbatical grant. This work has been partially supported by the Anillo ACT1112 and the MathAmSud 16MATH03
SIDRE projects.
REFERENCES
Ailliot P, Monbet V. 2012. Markov-switching autoregressive models for wind time series. Environmental Modelling & Software
30(9): 2101.
Allman ES, Matias C, Rhodes JA. 2009. Identifiability of parameters in latent structure models with many observed variables.
Annals of Statistics 37: 30993132.
Ango-Nze P, Buhlmann P, Doukhan P. 2002. Weak dependence beyond mixing and asymptotics for nonparametric regression.
Annals of Statistics 30: 397430.
Baum LE, Petrie T, Soules G, Weiss N. 1970. A maximization technique occurring in the statistical analysis of a probabilistic
functions of Markov chains. Annals of Mathematical Statistics 41: 164171.
Benaglia T, Chauveua D, Hunter DR. 2009. An EM-like algorithm for semi- and non-parametric estimation in multivariate
mixtures. Journal of Computational and Graphical Statistics 18(2): 505526.
Capp O, Moulines E, Rydn T. 2005. Inference in Hidden Markov Models. New York, USA: Springer.
Carter CK, Kohn R. 1994. On Gibbs sampling for state space model. Biometrika 81: 541553.
Dacunha-Castelle D, Gassiat E. 1997. The estimation of the order of a mixture model. Bernoulli 3(3): 279299.
DOI: 10.1111/jtsa.12237
Delyon B, Lavielle M, Moulines E. 1999. Convergence of a stochastic approximation version of EM algorithm. The Annals of
Statistics 27(1): 94128.
Douc R, Moulines E, Rydn T. 2004. Asymptotic properties of the maximum likelihood estimator in autoregressive models
with Markov regime. Annals of Statistics 32: 22542304.
Doucet A, Logothetis A, Krishnamurthy V. 2000. Stochastic sampling algorithms for state estimation of jump Markov linear
systems. IEEE Transactions on Automatic Control 45(2): 188202.
Doukhan P. 1994. Mixing: Properties and Examples. Lecture Notes in Statistics 85.
Duflo M. 1996. Algorithmes Stochastiques. Berlin: Springer-Verlag.
Ferraty F, Antn N, Vieu P. 2001. Regresin No paramtrica: Desde la Dimensin Uno hasta la dimensin Infinita. Bizcaia,
Spain: Servicio editorial de Universidad del Pas Vasco.
Francq C, Roussignol M. 1997. On white noises driven by hidden Markov Chains. Journal of Time Series Analysis 18:
553578.
Francq C, Roussignol M, Zakoian J-M. 2001. Conditional heteroskedasticity driven by hidden Markov chains. Journal of Time
Series Analysis 2: 197220.
Franke J, Stockis JP, Tadjuidje J, Li WK. 2011. Mixtures of nonparametric autoregressions. Journal of Nonparametric Statistics
23(2): 287303.
Gassiat E, Cleynen A, Robin S. 2016. Inference in finite state space non-parametric hidden Markov models and applications.
Statistics and Computing 26(1): 6171.
Goldfeld SM, Quandt R. 1973. A Markov model for switching regressions. Journal of Econometrics 1: 316.
Hamilton JD. 1989. A new approach to the economic analysis of non stationary time series and the business cycle.
Econometrica 57(2): 357384.
Hamilton JD, Raj B. 2003. Advances in Markov-Switching Models: Applications in Business Cycle Research and Finance
(Studies in Empirical Economics). Heidelberg, Germany: Springer.
Harel M, Puri M. 2001. U-statistiques conditionnells universellement consistantes pour des modles de Markov cachs.
Comptes Rendus de lAcadmie des Sciences - Sries I- Mathematics 333: 953956.
Kemeny JG, Snell JL. 1960. Finite Markov Chains. Princeton, New Jersey: Van Nostrand.
Kim C, Nelson C. 1999. State-Space Models with Regime Switching Classical and Gibbs-Sampling Approaches with
Applications. Cambridge, Massachusetts, London: MIT Press.
Krishnamurthy V, Rydn T. 1998. Consistent estimation of linear and non-linear autoregressive models with Markov regime.
Journal of Time Series Analysis 19: 291307.
Krolzig H-M. 1997. Markov-Switching Vector Autoregressions: Modelling, Statistical Inference, and Application to Business
Cycle Analysis. Verlag Berlin Heidelberg, Germany: Springer.
Polyak BT, Juditsky AB. 1992. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and
Optimization 30: 838855.
Rio E. 1993. Covariance inequalities for strongly mixing processes. Annales de linstitut Henri Poincar (B) Probabilits et
Statistiques 29: 587597.
Rio E. 2000. Thorie Asymptotique des Processus Faiblement Dpendents, Vol. 31. Paris: Springer-SMAI.
Ros R, Rodrguez LA. 2008a. Penalized estimate of the number of states in Gaussian linear AR with Markov regime. Electronic
Journal of Statistics 2: 11111128.
Ros R, Rodrguez LA. 2008b. Estimacin semiparamtrica en procesos autorregresivos con rgimen de Markov. Divulgaciones
Matemticas 16(1): 155171.
Rosales R. 2004. MCMC for hidden Markov models incorporating aggregation of states and filtering. Bulletin of Mathematical
Biology 66(5): 11731199.
Tugnait J. 1982. Adaptive estimation and identification for discrete systems with Markov jump parameters. Automatic Control,
IEEE Transactions 27(5): 10541065.
Yao J. 2000. On recursive estimation in incomplete data models. Statistics 34: 2751.
Yao J, Attali JG. 1999. On stability of nonlinear AR process with Markov switching. Advances in Applied Probability
32: 394407.
APPENDIX A: PROOF OF TECHNICAL RESULTS
Proof of Lemma 2.1

Take .e1 ; : : : ; en / D T .Y1 ; : : : ; Yn /, with ek D Yk rXk .Yk1 /, for k D 1; : : : ; n. Then the Jacobian matrix of
transformation T is triangular and the absolute value of the Jacobian is equal to 1. Hence, by virtue of the theorem
on the change of variables,
DOI: 10.1111/jtsa.12237
E.h.Y1Wn ; X1Wn ; Y0 // D E.h.T 1 .e1Wn /; X1Wn ; Y0 //

Z X
D h.T 1 .u1Wn /; i1Wn ; y0 //p.e1Wn D u1Wn ; X1Wn D i1Wn ; Y0 D y0 /du1Wn dy0 :
i1Wn
Taking into account the independence of Y0 , Xk k>1 and ek k>1 , we have the following factorization:
p.e1Wn D u1Wn ; X1Wn D i1Wn ; Y0 D y0 / D p.e1Wn D u1Wn /p.X1Wn D i1Wn /p.Y0 D y0 /;
and by conditions D1 and E6, we obtain

Z X n
Y n
Y
E.h.Y1Wn ; X1Wn ; Y0 // D h.T 1 .u1Wn /; i1Wn ; y0 / .uk / aik1 ik i1 p.Y0 D y0 /du1Wn dy0 :
i1Wn kD1 kD2
Thus, the first result follows.

Integrating the joint density p.Y1Wn D y1Wn ; X1Wn D x1Wn ; Y0 D y0 /, we have
XZ
p.Yk D yk ; Yk 0 D yk 0 / D p.Y1Wn D y1Wn ; X1Wn D x1Wn ; Y0 D y0 /dy0Wk1 dykC1 dyk 0 1 dyk 0 C1Wn :
x1Wn
Since .yk rxk .yk1 //.yk 0 rxk0 .yk 0 1 // 6 kk21 , and using this bound in the aforementioned expression,
we have that the integral of the remaining terms is equal to 1. Thus, p.Yk D yk ; Yk 0 D yk 0 / 6 kk21 . In a similar
way, we can prove that p.Y0 D y0 ; Yk 0 D yk 0 / 6 kk1 .
Proof of Proposition 2.2
We consider the extended Markov chain Z D Zn n>1 defined by Zn D .Yn ; Xn /. Under conditions E1E7,
from Yao and Attali (1999, Theorem 1), we have that Z is a geometrically ergodic Markov chain. Denoting by
.E; B/ the space state of Z , by Q the kernel probability transition and by the invariant probability measure, it
follows that the -mixing coefficients of Z take the following form (Doukhan, 1994, Section 2.4):

n .Z/ WD E supjQ.n/ .Z; B/ .B/j W B 2 B : (A1)
By Theorem 1 in Doukhan (1994, Section 2.4), the geometric ergodicity implies the -mixing property for Z .
Moreover, there exist 0 < < 1 and c > 0 such that the -mixing coefficients satisfy n .Z/ 6 c n . Thus, from
inequality 2n .Z/ 6 n .Z/, the process Z is also -mixing.
On the other hand, the process Y can be obtained from Z as Yn D .Zn /, where is the projection function.
Since the projection is a continuous function, we have Mba .Y / Mba .Z/ for all a; b . Then, from the expression
given for the -mixing coefficients in formula (5), we obtain
1 c
n .Y / 6 n .Z/ 6 n .Z/ 6 n :
2 2
Therefore, Y is -mixing and their coefficients n .Y / decrease geometrically.

Proof of Corollary 2.1
1 2 2
Given that .y/ D p2 e y =2 , it is easy to verify that the condition I4 holds. To prove condition I3, we
P m
assume that for all y; y 0 , iD1 i .y ri .y 0 // D 0. Then, for all y 0 ; t , we have
m
X Z C1 m
X 0 2 t 2 =2
i e ty .y ri .y 0 //dy D i e ri .y /tC D 0:
1
iD1 iD1
DOI: 10.1111/jtsa.12237
Thus,
m
X X m
1 X m
X
0 rik .y 0 /t k
i e ri .y /t D i D0 H) i rik .y 0 / D 0; 8k > 0:
k
iD1 kD0 iD1 iD1
Consider the Vandermonde matrix of order m m

0 1
1 r1 .y 0 / : : : r1m1 .y 0 /
B 1 r2 .y 0 / : : : r2m1 .y 0 / C
B C
V DB: :: :: :: C:
@ :: : : : A
1 rm .y 0 / : : : rm
m1
.y 0 /
The determinant of the Vandermonde matrix V can be expressed as

Y
det.V / D rj .y 0 / ri .y 0 / :
16i<j 6m
From condition I2, det.V / 0 for almost all y 0 . Then the system of equations
m
X
i rik .y 0 / D 0; for k D 0; : : : ; m 1
iD1
has a unique solution, 1 D D m D 0. The identifiability follows from Proposition 2.3.

Proof of Lemma 2.2
First, we recall that
Tk;n D aKh .y Yk /1Ii .XkC1 / C bYkC1 1IjYkC1 j6Mn Kh .y Yk /1Ii .XkC1 /:
Considering the variance term, from stationarity of .Yk ; Xk /k>1 , we have
var .T0;n / D a2 var .Kh .y Y0 / 1Ii .X1 //

C 2ab cov .Y1 1IjY1 j6Mn Kh .y Y0 / 1Ii .X1 /; Kh .y Y0 / 1Ii .X1 //
C b 2 var .Y1 1IjY1 j6Mn Kh .y Y0 / 1Ii .X1 //
Z
D a2 h fi .y h/K 2 ./d
Z
C 2abh E.Y1 1IjY1 j6Mn jX1 D i; Y0 D y h/K 2 ./fi .y h/d
Z
C b h E.Y12 1IjY1 j6Mn jX1 D i; Y0 D y h/K 2 ./fi .y h/d
2
Z
h2 a fi .y h/K./d
Z 2
Cb E.Y1 1IjY1 j6Mn jX1 D i; Y0 D y h/K./fi .y h/d :
From the dominated convergence theorem and conditions R2 and R3, we have, when n ! 1 and h ! 0,

Var .T0;n / h a2 fi .y/ C 2abgi .y/ C b 2 g2;i .y/ kKk22 C o.h2 /:
DOI: 10.1111/jtsa.12237
Now, for the covariance terms cov.T0;n ; Tk;n /, we define Uks D .Yk 1IjYk j6Mn /s , s D 0; 1; 2. As the process is
stationary, it suffices to consider

cov U1s Kh .y Y0 / 1Ii .X1 /; UkC1
l
Kh .y Yk / 1Ii .XkC1 / :
Owing to -dependency using the inequality of Rio (Rio, 1993), the covariance is bounded by

l
Kh .y Yk / 1Ii .XkC1 /
6 MnsCl cov .Kh .y Y0 / 1Ii .X1 /; Kh .y Yk / 1Ii .XkC1 //
6 MnsCl 4kKk21 k :
This gives
cov .T0;n ; Tk;n / 6 .a2 C 2abMn C b 2 Mn2 /4kKk21 k :
Set Ck .i; v/ D XkC1 D i; Yk D v. We recall that A.k/ij is the .i; j /th entry of the k th power of the matrix A.
Then

E U1s UkC1
l
Kh .y Y0 / 1Ii .X1 /Kh .y Yk / 1Ii .XkC1 /
Z Z

D Kh .y u/ Kh .y v/ E U1s UkC1 l
C0 .i; u/; Ck .i; v/
p.Y0 D u; Yk D v/P .X1 D i; XkC1 D i /dudv
Z Z
.k/ 2
6 Ai i i h K.u/K.v/E jU1s UkC1
l
j C0 .i; y uh/; Ck .i; y vh/
p.Y0 D y uh; Yk D y vh/dudv

Z Z
.k/ 2
6 Ai i i h K.u/K.v/E jY1s YkC1
l
j C0 .i; y uh/; Ck .i; y vh/
p.Y0 D y uh; Yk D y vh/dudv:
We evaluate in each case 0 6 s C l 6 2, with s; l D 0; 1. For s C l D 0,

E jY1s YkC1
l
j X1 D i; Y0 D u; XkC1 D i; Yk D v D 1
and in this case, from Lemma 2.1,

l
Kh .y Yk / 1Ii .XkC1 / 6 h2 A.k/ 2
i i i kk1 :
For s C l D 1, we only consider the case s D 0; l D 1 since the case s D 1; l D 0 is similar. Thus,

E jY1s YkC1
l
j X1 D i; Y0 D u; XkC1 D i; Yk D v D E . jYkC1 jj XkC1 D i; Yk D v/ 6 jri .v/j C E.je1 j/;
so that, by Lemma 2.1, by the continuity of the function ri .v/ and by the moment condition for e1 , when we take
h ! 0, we obtain

cov U1s Kh .y Y0 /1Ii .X1 /; UkC1
l
Kh .y Yk /1Ii .XkC1 / h2 .jri .y/jCE.je1 j// A.k/ 2 3
i i i kk1 C o.h /:
DOI: 10.1111/jtsa.12237
For s C l D 2, that is, s D 1; l D 1,

E jY1s YkC1
l
j X1 D i; Y0 D u; XkC1 D i; Yk D v D ri;k .u; v/:
Then by the continuity of the function ri;k .u; v/ and Lemma 2.1, taking h ! 0, we have

cov U1s Kh .y Y0 /1Ii .X1 /; UkC1
l
Kh .y Yk /1Ii .XkC1 / h2 ri;k .y; y/A.k/ 3
i i i kk1 C o.h /;
It remains to consider the covariance term,

cov .T0;n ; Tk;n / D a2 cov U10 Kh .y Y0 /1Ii .X1 /; UkC1
0
Kh .y Yk /1Ii .XkC1 /

C ab cov U10 Kh .y Y0 /1Ii .X1 /; UkC1
1

C ab cov U11 Kh .y Y0 /1Ii .X1 /; UkC1
0

C b 2 cov U11 Kh .y Y0 /1Ii .X1 /; UkC1
1
Kh .y Yk /1Ii .XkC1 / :
By collecting the bounds, we obtain, for large enough n and small enough h, that
cov .T0;n ; Tk;n / h2 .a2 C 2abjri .y/j C b 2 ri;k .y; y//A.k/ 3

i i i kk1 C o.h /:
Proof of Lemma 2.3

Q k D k 1IjYkC1 j6Mn , for
Set k D YkC1 Kh .y Yk /1Ii .XkC1 / and the corresponding truncated variable
k D 0; : : : ; n 1. Define the truncated kernel estimator of gi ,
n1
1 X Q
gQ i;n .y/ D k :
nh
kD0
Thus,
P .jgO i;n .y/EgO i;n .y/j > 2/ 6 P .jgQ i;n .y/EgQ i;n .y/j > /CP .jgO i;n .y/gQ i;n .y/E.gO i;n .y/gQ i;n .y/j/ > /:
Conditions E3 and M2 imply E.jYk js / < 1 for s > 2, then by Chebyshevs inequality,
P .jgO i;n .y/ gQ i;n .y/ E.gO i;n .y/ gQ i;n .y/j/ > / 6 2 var .jgO i;n .y/ gQ i;n .y/j/
and by definition of the variance,

var .jgO i;n .y/ gQ i;n .y/j/ 6 E jgO i;n .y/ gQ i;n .y/j2 :
We obtain a bound on the right-hand side of the aforementioned inequality using the Hlder inequality and the
stationarity of the model,

E jgO i;n .y/ gQ i;n .y/j2 6 h2 E Y12 Kh2 .y Y0 /1Ii .X1 /1IjY1 j>Mn

6 h2 E.jY1 js /
Y 2s 1IjY j>M K 2
1 1 n 1
6 kKk21 E.jY1 js /Mn.2s/ h2
6 c3 Mn.2s/ h2 ;
with c3 D kKk21 E.jY1 js /.
DOI: 10.1111/jtsa.12237
Now, we consider the bound for the term sn2 given by
n1
X
Q 0/ C 2
sn2 D n2 h2 var .gQ i;n .y// D nvar . Q 0;
.n k/cov . Q k /:
kD1
First, we use item (i) of Lemma 2.2 for a D 0 and b D 1,
Q 0 / hg2;i .y/kKk22 C o.h2 /:

var .
Q k into terms:
Second, we use Trans device to split the covariance of
n1
X uX
n 1 n1
X
Q 0;
.n k/cov . Q k/ D Q 0;
.n k/cov . Q k/ C Q 0;
.n k/cov . Q k /:
kD1 kD1 kDun
In a way similar to the case of the first bound, we apply item (ii) of Lemma 2.2 with a D 0, b D 1 and condition
R3, and we obtain for k 6 un < n
Q 0;
cov . Q k / h2 sup kri;k k1 kk21 C o.h3 /: (A.2)
k2N
For k > un , we apply item (iii) of Lemma 2.2 and obtain
Q 0;
cov . Q k / 6 4kKk21 Mn2 k : (A.3)
From Proposition 2.2, there exist 0 < < 1 and c2 > 0 such that the -mixing coefficients satisfy n .Y / 6 c2 n .
From inequalities (A.2) and (A.3) and taking un D .h log n/1 , we obtain
n1
X un
.n k/cov . Q k / sup kri;k k1 kk21 h2 nun C 4c2 kKk21 nMn2
Q 0; D o.nh/:
k2N 1
kD1
Therefore, sn2 D O.nh/. P

The FukNagaev inequality (Rio, 2000, Theorem 6.2), applied to random variables nkD1
Q k , allows us to
obtain, for any positive > 0 and any > 1,
n1 ! =2
X 2
Q
P k > 4 6 4 1 C 2 C 4nMn un 1 :
sn
kD0
Taking 4 D nh, we obtain the asymptotic inequality
=2
2 nh 16c2 Mn un
P .jgQ i;n .y/ EgQ i;n .y/j > / 4 1 C C ;
16c1 h
where c1 D supy2C g2;i .y/kKk22 . Thus, result (i) follows. We can prove (ii) in a similar way, taking a D 1; b D 0
and cQ1 D supy2C fi .y/kKk22 in Lemma 2.2.
DOI: 10.1111/jtsa.12237
Proof of Lemma 2.4

Set k D YkC1 Kh .y Yk /1Ii .XkC1 /, for k D 0; : : : ; n 1. Taking the conditional expectation of k given
XkC1 D j; Yk D u, considering the expression for ri .y/ given in (6) and using the stationarity of the model
obtained from Proposition 2.1 by conditions E1E7, we obtain
Z Z
E.k / D ri .u/Kh .y u/ i p.Y0 D u/du D Kh .y u/ gi .u/du: (A.4)
1
P
Since gO i;n .y/ D nh k k , equation (A.4) implies that
Z
EgO i;n .y/ D K.u/gi .y uh/du: (A.5)
By the second-order Taylor expansion of gi at y , we obtain
.uh/2 00
gi .y uh/ D gi .y/ uhgi0 .y/ C gi .yQu /
2
with yQu D .y uh/.1 t / C ty for some t 2 0; 1. As the kernel K is assumed to be of order 2, substituting the
Taylor approximation into (A.5) gives
Z
h2
EgO i;n .y/ gi .y/ D gi 00 .yQu /u2 K.u/du:
2
From condition R2, we have that gi 00 is continuous. Then gi 00 .y/

Q converges uniformly to gi 00 .y/ over the compact
set C. Hence,
Z
h2 00
EgO i;n .y/ gi .y/ D gi .y/ u2 K.u/du C o.h2 /: (A.6)
2
Thus,
sup jEgO i;n .y/ gi .y/j D O.h2 /: (A.7)

y2C
The same proof works for the bias of fi , starting from

Z Z
O
Efi;n .y/ D K.u/i p0 .y uh/du D K.u/fi .y uh/du:
Proof of Theorem 2.1

We start with the following triangle inequality on the positivity set of fi .y/,

1 r .y/
O i
jrOi;n .y/ ri .y/j 6 jgO i;n .y/ gi .y/j C jfi;n .y/ fi .y/j ; (A.8)
jfOi;n .y/j fOi;n .y/
which implies the following inequality:
1 supy2C jri .y/j

sup jrOi;n .y/ri .y/j 6 sup jgO i;n .y/gi .y/j C sup jfOi;n .y/fi .y/j : (A.9)
y2C y2C infy2C jfOi;n .y/j y2C infy2C jfOi;n .y/j
DOI: 10.1111/jtsa.12237
According to the bias-variance decomposition, the proof of the theorem is achieved through Lemmas 2.3 and 2.4,
guaranteeing the strict positivity of infy2C jfOi;n
q .y/j.
n
Thus, applying Lemma 2.3 with D 0 log nhn
, Mn D n with > 0, un D .hn log n/1 , hn D nd
with 0 < d < 1, and large enough so that log.n/ D o./, we have for # D .s 2/ d 2 > 0 and
02
32c1
D .s 2/ d 1
s ! 2 1
log n log.n/ 16n C 2 un n1C.2s/
P jgO i;n .y/ EgO i;n .y/j > 0 4 exp 0 C c2 1
C c3 2
nhn 32c1 1
.log n/ 2 h 2 0 log.n/hn
0 n
2 C 12 un 1C.2s/ (A.10)
0
32c 16n n
4n 1 C c2 C c3
1
0 .log n/ 2 hn
1
2 02 log.n/hn
cn.1C#/ :
Applying the BorelCantelli lemma, the almost surely pointwise convergence of jgO i;n .y/ EgO i;n .y/j to 0 is
proved. We proceed analogously to obtain the almost surely pointwise convergence of jfOi;n .y/ EfOi;n .y/j ! 0.
According to Lemma 2.4, we have
1
inf jfOi;n .y/j > inf jfi .y/j sup jfOi;n .y/ EfOi;n .y/j sup jEfOi;n .y/ f .y/j > inf jfi .y/j > 0:
y2C y2C y2C y2C 2 y2C
Thus, the previous results obtained from Lemmas 2.3 and 2.4 and inequality (A.8) give the pointwise convergence
of jrOi;n .y/ ri .y/j.
To obtain the uniform convergence on a compact set C, we only need to prove an asymptotic inequality of type
(A.10) for the term supy2C jgO i;n .y/ EgO i;n .y/j, and analogously for supy2C jfOi;n .y/ EfOi;n .y/j, in inequality
(A.9). For this, we proceed by using a truncation device as in Ango-Nze et al. (2002), assuming the moment
condition M1.
Let us set k D YkC1 Khn .y Yk /1Ii .XkC1 / and the truncated variable Q k D k 1IjYkC1 j6Mn . Then we
define the truncated kernel estimator of gi by
n1
1 X Q
gQ i;n .y/ D k :
nhn
kD0
Since kKk1 < 1, taking Mn D M0 log n, we clearly obtain

P sup jgO i;n .y/ gQ i;n .y/j > 0 6 nP .jY1 j > M0 log n/ 6 E.exp.jY1 j//n1M0
y2C
and, by the CauchySchwarz inequality and condition R3,

sup E .jgO i;n .y/ gQ i;n .y/j/ 6 h1
n sup E jY1 j1IjY1 j>M0 log.n/ Khn .y Y0 / 1Ii .X1 /
y2C y2C
1=2
6 h1
n n
M0 =2
E.exp.jY1 j//1=2 sup E jY1 j2 Kh2n .y Y0 / 1Ii .X1 /
y2C
6 c4 nM0 =2 h1=2
n ;
where c4 D .E.exp.Y1 //kri;0 k1 Ai i i /1=2 kk1 .

Now, we reduce computations by a chaining argument (Ferraty et al., 2001, pp. 32 and 78) for the case of a
kernel estimator with bounded variables. Let C be covered by a finite number n of intervals Bk with diameter
DOI: 10.1111/jtsa.12237
2Ln and centre at tk . Then
sup jgQ i;n .y/EgQ i;n .y/j 6 max jgQ i;n .tk /EgQ i;n .tk /jCsup jgQ i;n .tk /gQ i;n .y/jCsup jEgQ i;n .y/EgQ i;n .tk /j:
y2C kD1;:::; n y2C y2C
Let us examine each term in the right-hand side of the preceding equation. First, we have from Lemma 2.3
s ! s !
n
X
"0 log n "0 log n
P max jgQ i;n .tk / EgQ i;n .tk /j > 6 P jgQ i;n .tk / EgQ i;n .tk /j >
kD1;:::; n 2 nhn 2 nhn
kD1
!
2
0
128c 32n1=2 Mn un
n 4n 1 C c2 :
0 .log n/1=2 h1=2
n
For the second and third terms, we use the following inequality obtained from condition R1:
n
1 X Mn Mn L
jgQ i;n .tk / gQ i;n .y/j 6 Mn jKhn .tk Yk / Khn .y Yk /j 6 c5 1C jy tk j 6 c5 1Cn ;
nhn hn hn
kD1
for some constants c5 ; > 0. Therefore,

s !
log n
P sup jgO i;n .y/ EgO i;n .y/j > 0
y2C nhn
! s !
2
0
128c 32n1=2 Mn un Mn L
n 0 log n nM0 =2
n 4n 1 C c2 C 2P > C c4 :
0 .log n/1=2 h1=2
n h1C
n
2 nhn h1=2
n
1
1 C 02 1Cd
Setting L
n D n
2 2
hn Mn1 , n D c5 =Ln , un D .hn log n/1 , # D 128c1
C 2
C d 1 > 0 and M0 D
2.# C 1/ C d , we obtain for some constant c > 0
s ! !
2
log n c5 0
128c 32n1=2 Mn un nM0 =2
P sup jgO i;n .y/EgO i;n .y/j > 0 4n 1 Cc
2 Cc 4
y2C nhn Ln 0 .log n/1=2 h1=2
n h1=2
n
(A.11)
.1C#/
cn :
Hence, the BorelCantelli lemma implies the a.s. convergence of the term supy2C jgO i;n .y/ EgO i;n .y/j.
The uniform convergence over a compact set of the regression function rOi;n follows in the same way as for the
a.s. pointwise convergence.
Remark A1. Note that in the proof of the a.s. pointwise convergence, the probability term in (A.10) is summable
if # D .s 2/ d 2 > 0. This is only possible if s > 2, and so the restriction imposed in condition M2 arises.
q From the asymptotic inequalities (A.10) and (A.11), we can notice that the convergence rate in Theorem 2.1 is
log n
nhn
.
Proof of Lemma 3.1

Taking the expectation in (9), it follows that (12) is true. For the second part, we use simply the fact that the
t
potential U is absolutely integrable with respect to the measure P .X1Wn D xjY0Wn ; 0 /c .dx/, where c .dx/ is
DOI: 10.1111/jtsa.12237
the counting measure on 1; : : : ; mn . So, by the dominated convergence theorem, we have

Z
t
t 0
E 0 r U.y; Y0Wn ; X1Wn ; /jFt1 / D r U.y; Y0Wn ; x; /P .X1Wn D xjY0Wn ; /c .dx/
Z
t 0
D r U.y; Y0Wn ; x; /P .X1Wn D xjY0Wn ; /c .dx/
D r u.y; Y0Wn ; /:
DOI: 10.1111/jtsa.12237

Fermin Et Al-2017-Journal of Time Series Analysis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fermin Et Al-2017-Journal of Time Series Analysis

Uploaded by

Copyright:

Available Formats

JOURNAL OF TIME SERIES ANALYSIS

J. Time Ser. Anal. (2017)

A ROBBINSMONRO ALGORITHM FOR NON-PARAMETRIC

LISANDRO JAVIER FERMIN,a RICARDO RIOSb AND LUIS ANGEL RODRIGUEZ,a,c*

Received 20 March 2015; Accepted 7 March 2017

Copyright 2017 John Wiley & Sons, Ltd

Yk D rXk .Yk1 / C ek ; (1)

sup jrOi;n .y/ ri .y/j ! 0; a.s. .when n ! 1/:

O .y/ D argmin U.y; Y0Wn ; X1Wn ; /:

2.1. Stability and existence of moments

jri .y/j 6 i jyj C bi :

(i) There exists a unique stationary geometric ergodic

2.2. Probability density

Lemma 2.1. Under conditions D1 and E6,

p.Yk D yk ; Yk 0 D yk 0 / 6 kk21 and p.Y0 D y0 ; Yk 0 D yk 0 / 6 kk1 ; for k; k 0 > 1:

For the proof of this lemma, we refer the reader to Appendix A.

2.3. Strong mixing

n WD supjP .A \ B/ P .A/P .B/j W A 2 M01 ; B 2 M1

The proof is given in Appendix A.

.y2 rQi .y1 // D .y2 r .i / .y1 //

Now, using the commutativity of the sum, we obtain

Then, from condition I3 and I4, aQ ij D a .i / .j / , Q j aQ j i D  .j / a .j / .i / , and rQi D r .i / a.s.

The proof is given in Appendix A.

2.5. Kernel estimator: fully observed data case

Hence, it is sufficient to estimate each autoregression function

ri .y/ D E.Y1 jY0 D y; X1 D i /; (6)

gi .y/ WD ri .y/fi .y/; and fi .y/ WD i p.Y0 D y/: (7)

The NadarayaWatson kernel estimator of ri is

and Kh .y/ D K.y= h/.

R1 There exist finite constants c; > 0, such that

8y; y 0 2 C; jK.y/ K.y 0 /j < cjy y 0 j :

R2 The density of Y0 , , and ri have continuous second derivatives in the interior of C.

ri;k .t; s/ D E.jY1 YkC1 j jY0 D t; Yk D s; X1 D i; XkC1 D i /

are continuous and uniformly bounded with respect to k .

M1 E.exp.jY0 j// < 1 and E.exp.je1 j// < 1.

E.exp.jY1 j// 6 cE.exp.jY0 j//E.exp.je1 j//;

Tk;n D aKh .y Yk /1Ii .XkC1 / C bYkC1 1IjYkC1 j6Mn Kh .y Yk /1Ii .XkC1 /:

Then the following statements hold, for all y 2 C :

(i) supy2C jEgO i;n .y/ gi .y/j D O.h2 /.

Let hn n>1 be a sequence of real numbers satisfying the following condition:

S1 For all n > 0, hn > 0, limn!1 hn D 0 and limn!1 nhn D 1.

The proof of this theorem is given in Appendix A.

rOn .y/ D argmin U.y; Y0Wn ; X1Wn ; /:

3.1. Restoration-estimation RobbinsMonro algorithm

Lemma 3.1. For each 2 , we have

P D r u.y; Y0Wn ; /:

In what follows, we describe in detail each step of the algorithm.

Step 0: SAEM algorithm

Step R: Carter and Kohn filter

for all x1Wn 2 1; : : : ; mN .

Step A: Average (or Aggregation)

E.kM t M t1 k2 jFt1 / D t2 E.k&t k2 jFt1 /

with ti .k; k 0 / D cov.1Ii .XkC1

and  D .i /iD1Wm with  D limt!1 t .

lim lim t .y; Y0Wn / D ri .y/; a:s:

Yn D Xn Yn1 C bXn C Xn en :

The estimated probability transitions obtained were

r3 .y/ D 2cos.y/ 1; r4 .y/ D .0:4y C 2:5/1Iy<0 C .0:4y C 2:5/1Iy>0 :

APPENDIX A: PROOF OF TECHNICAL RESULTS

Proof of Lemma 2.1

E.h.Y1Wn ; X1Wn ; Y0 // D E.h.T 1 .e1Wn /; X1Wn ; Y0 //

p.e1Wn D u1Wn ; X1Wn D i1Wn ; Y0 D y0 / D p.e1Wn D u1Wn /p.X1Wn D i1Wn /p.Y0 D y0 /;

and by conditions D1 and E6, we obtain

Thus, the first result follows.

jri .y/j 6 i jyj C bi :

Then, from condition I3 and I4, aQ ij D a .i / .j / , Q j aQ j i D .j / a .j / .i / , and rQi D r .i / a.s.

gi .y/ WD ri .y/fi .y/; and fi .y/ WD i p.Y0 D y/: (7)

with ti .k; k 0 / D cov.1Ii .XkC1

and D .i /iD1Wm with D limt!1 t .

Yn D Xn Yn1 C bXn C Xn en :

has a unique solution, 1 D D m D 0. The identifiability follows from Proposition 2.3.

cov .T0;n ; Tk;n / h2 .a2 C 2abjri .y/j C b 2 ri;k .y; y//A.k/ 3

Q 0 / hg2;i .y/kKk22 C o.h2 /:

Taking 4 D nh, we obtain the asymptotic inequality

where c4 D .E.exp.Y1 //kri;0 k1 Ai i i /1=2 kk1 .