Professional Documents
Culture Documents
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
a r t i c l e i n f o a b s t r a c t
Article history: In this paper, we consider the learning rates of support vector machines (SVMs) classier for density
Received 2 February 2011 level detection (DLD) problem. Using an established classication framework, we get error decom-
Received in revised form position which consists of regularization error and sample error. Based on the decomposition, we
2 October 2011
obtain learning rates of SVMs classier for DLD under some assumptions.
Accepted 3 October 2011
Crown Copyright & 2011 Published by Elsevier B.V. All rights reserved.
Communicated by: G.B. Huang
Available online 28 December 2011
Keywords:
Learning rates
Support vector machines
Density level detection
Anomaly detection
0925-2312/$ - see front matter Crown Copyright & 2011 Published by Elsevier B.V. All rights reserved.
doi:10.1016/j.neucom.2011.10.032
F.L. Cao et al. / Neurocomputing 82 (2012) 8490 85
give the learning rates of DLD problem, i.e., the main result of this From now the measure that we consider is P : Q s m and
paper. In Section 6, we will conclude the paper with some useful s 1=1 r. Generalization error is dened as
conclusions. Finally, the proof of the main result is arranged in the Z
Appendix. Ef ly,f x dP:
Z
In this section, we rstly recall the denitions, then investigate The corresponding empirical error is dened as
learning rates for SVM for density level detection which is given X n X m
in [15]. To this end we write Y : f1; 1g and give the following
1 r
E T ,T f l1,f xi l1,f xj :
1 rn i 1 1 rm j 1
denition.
Then we can rewrite (1) as
Denition 1 (See Steinwart et al. [15,16]). Let m and Q be prob-
ability measures on X and s A 0; 1. Then the probability measure f T ,T , l A arg minflJf J2H E T ,T f g: 2
f AH
Q s m on X Y is dened by
Q s mA sEx Q IA x,1 1sEx m IA x,1,
3. Error analysis
for all measurable A X Y. Here IA denotes the indictor function
of a set A. In this section, we investigate learning rates, the decay of the
excess classication error RP signf T ,T , l RP as n and m
From the denition we know the measure P : Q s m can be become large.
associated with a binary classication problem in which positive The error analysis aims at bounding RP signf T ,T , l RP . But by
samples are drawn from Q and negative samples are drawn from (2), we know that the algorithm is associated with E T ,T f . So we
m. However, in general classication problems the samples are need the relationship between RP signf T ,T , l RP and
drawn from only one probability measure. Inspired by this E T ,T f T ,T , l E T ,T f c . Some works have been done on this topic
interpretation let us recall that the binary classication risk for (see [2,4,6,24]). In this paper we consider the hinge loss. Then from
a measurable function f : X!R and a distribution P on Z : X Y [24] we know that for every measurable function f : X-R
is dened by
RP signf RP rEf Ef c : 3
RP f Pfx,y : signf x a yg,
By [19], f c arg min Ef with the minimum taken over all functions
where we dene signt : 1 if t 4 0 and signt 1 otherwise. f : X-R , where R R [ f 71g.
Furthermore, the Bayes risk RP of P is the smallest possible
Denition 2 (See Wu and Zhou [20]). The projection operator p is
classication risk with respect to P, i.e.
dened on the space of measurable functions f : X!R
RP inf fRP f 9f : X!R measurableg: 8
< 1,
> f x 4 1,
Steinwart et al. [15] showed that for P : Q s m and s 1=1 r pf x 1, f x o 1,
>
: f x, 1r f x r1:
RP RP f c ,
It is easy to see that pf and f have the same classier, i.e.
where signpf signf . It is easy to see that ly, pf x r ly,f x. Hence
f c Ifh 4 rg Ifh r rg sign2Py 19x1, Epf r Ef and E T ,T pf rE T ,T f : 4
i.e. fc is the Bayes rules. Moreover, Steinwart et al. [15] also showed Applying this fact to error analysis, it is sufcient for us to
that Sm,h, r f -0 if and only if RP f -RP . Therefore a classication bound the excess generalization error for pf instead of f. This
algorithm which makes RP close to RP also makes Sm,h, r or simply leads to better estimates.
SP close to zero. Thus our goal is also to nd a function fT such that Now we can present the error decomposition which leads to
RP f T -RP . bounds of the excess generalization error for pf T ,T , l . Dene
Based on the above interpretation Steinwart et al. [15] con- f l arg minfEf lJf J2H g:
structed an SVM for DLD problem. Let K : X X!R be a positive f AH
1 X n
f T ,T , l A arg minlJf J2H l1,f xi and ST ,T , l is sample error dened by
f AH 1 rn i 1
ST ,T , l fEpf T ,T , l E T ,T pf T ,T , l g fE T ,T f l Ef l g:
r X
m
l1,f xj : 1
1 rm j 1 Proof. Since
fE T ,T pf T ,T , l lJf T ,T , l J2H E T ,T f l lJf l J2H g However, we know that there exists a constant cq such that for
all t 4 0
fE T ,T f l Ef l g fEf l Ef c lJf l J2H g,
PX 92Z19 r t r Ct q ()PX 92Z19r cq t r t q :
we obtain from (4) that
E T ,T pf r E T ,T f :
Lemma 1 (See Wu and Zhou [20]). If P has Tsybakov noise exponent
This in connection with the denition of f T ,T , l tells us that
q. For every function f : X-1; 1 there holds
fE T ,T pf T ,T , l lJf T ,T , l J2H E T ,T f l lJf l J2H g r0:
1 q=q 1
Ely,f ly,f c 2 r 8 Ef Ef c q=q 1 :
This proves (5). & 2cq
Throughout this paper, we assume that the kernel is uniformly
bounded in the sense that
Steinwart et al. [15] established inequality between SP(f) and
p the excess classication risk RP f RP , i.e. if h has r-exponent q
k : sup Kx,x o1: then there exists a constant c1 4 0 such that for all measurable
xAX
f : X-R we have
To state our result, we need to further introduce several con- SP f rc1 RP f RP q=q 1 : 8
cepts and notations. Firstly, we need the capacity of the space H,
which plays an essential role in sample error estimates. In this Then by (3) and (8)
paper, we use the covering numbers measured by empirical SP signf rc1 Ef Ef c q=q 1 : 9
distances.
of density level detection (DLD) problem. Then we regarded the By Proposition 1, we can write the sample error as
algorithm of DLD problem as a classical learning algorithm of
ST ,T , l fEpf T ,T , l Ef c E T ,T pf T ,T , l
learning theory. The given denitions and theorems benet from
E T ,T f c g fE T ,T f l E T ,T f c Ef l Ef c g : S1 S2 :
the recent developments in learning theory, such as [5,4,1113,21]
and so on. As a main result of this paper, by means of error
decomposition consisting of regularization error and sample error, Proposition 5. Let r 40 and m and Q be distribution on X such that
we used an iteration procedure developed in [13,21,23] and Q has a density h with respect to m. For s 1=1 r we write
obtained the learning rates of DLD problem. P Q s m. If h has r-exponent q, then with probability at least
12d=3, there holds
1=2t !
Acknowledgments 4 ln6=d 2c ln6=d 7B ln6=d
S2 r s l
3n n 6n
The authors wish to thank the referees for their helpful 1=2t !
suggestions. 4 ln6=d 2c ln6=d 7Bl ln6=d
1s Dl,
3m m 6m
p
Appendix A where Bl : 2k Dl=l, t q=q 1, c 81=2cq q=q 1 .
Proof. Since
To prove Theorem 1, we rst give some concentration inequalities.
S2 fE T,T 0 f l E T,T 0 f c Ef l Ef c g
Proposition 3 (One-side Bernstein inequality, see Cucker and Smale !
1X n
1X n
[5]). Let x be a random variable on a probability space Z with mean s l1,f l xi l1,f c xi
ni1 ni1
Ex and variance s2 x s2 , if 9xzEx9r M for almost all z A Z, Z Z
then for all e 4 0 l1,f l dQ l1,f c dQ
( ) X X
1X m
me2 00 1
Probz A Zm zzi Ex 4 e r exp : 1 X m
1 X m
mi1 2s2 M e=3 1s@@ l1,f l xj l1,f c xj A
mj1 mj1
Z Z
Proposition 4 (See Cucker and Smale [5]). Let 0 r t r 1, c,M Z0,
l1,f l dm l1,f c dm : sS21 1sS22 ,
and G be a set of functions on Z such that for every g A G, Eg Z0, X X
9gEg9 rM and Eg 2 r cEgt , then for all e 40
we can rst consider the random variable z1 l1,f l l1,f c on
8 9
> 1 Pm > X f1g to estimate S21.
< Eg i 1 gzi =
Probz A Zm sup pm
t
4 4e1t=2 Denote
>
:g A G Eg e t >
;
z1 z11 z12 fl1,f l l1, pf l g fl1, pf l l1,f c g:
me 2 p
r NG, eexp : For z11 , from (10), we have Jf l J1 rkJf l JH r k Dl=l. Then we
2c M e1t =3 p
can get 0 r z11 r Bl 2k Dl=l. Hence 9z11 Ex Q z11 9 rBl .
The following Lemmas 2 and 3 established by Cucker and Applying Proposition 3 to z11 , we know that for any e 4 0
Smale [5] will be used. ( )
1X n
ne 2
Lemma 2 (See Cucker and Smale [5]). Let c1 ,c2 , . . . ,cl 40, and Prob z11 xi Ex Q z11 4 e rexp :
ni1 2s2 z11 Bl e=3
s 4 q1 4q2 4 4 ql 4 0. Then the equation
xs c1 xq1 c2 xq2 cl1 xql1 cl 0
Solving the quadratic equation
has a unique positive zero xn. In addition
n o ne2 d
xn r max lc1 1=sq1 ,lc2 1=sq2 , . . . ,lcl1 1=sql1 ,lcl 1=s : exp ,
2s2 z11 Bl e=3 6
Z Z
z12 , we see that the inequality s l1, pf T ,T , l dQ l1,f c dQ
! Z Z X X
1Xn
1Xn
l1, pf l xi l1,f c xi l1,f l dQ l1, pf c dQ r e2 !!
ni1 ni1 X X
1Xn
1Xn
l1, pf T ,T , l xi l1,f c xi
holds with probability at least 1d=6, where ni1 ni1
r Z
4 ln6=d 4 ln6=ds2 z12 Z
e2 r : 1s l1, pf T ,T , l dm l1,f c dm
3n n X X
0 11
Lemma 1 tells us that s2 z12 rcEx Q z12 t with t q=q 1, 1 X
m
1 X
m
@ l1, pf T ,T , l xj l1,f c xj AA
c 81=2cq q=q 1 . Applying Lemma 3, we have mj1 mi1
1=2t
4 ln6=d 2c ln6=d
e2 r Ex Q z12 : : sS11 1sS12 :
3n n
Combining the above estimates for z11 and z12 with the fact Let R 4 0. Applying Proposition 4 to the function set
Ex Q z11 Ex Q z12 Ex Q z1 , F 1 fl1, pf xl1,f c x : f A BR g,
we conclude that with probability at least 1d=3, there holds
we conclude that each function g A F 1 takes the form gx l1,
1=2t
4 ln6=d 2c ln6=d 7B ln6=d pf xl1,f r x for some f A BR . Then
S21 r l Ex Q z1 :
3n n 6n
Ex Q g Ex Q l1, pf xl1,f r x Z0
1=1 p 1=2t p )
Proof. It is easy to obtain that 16cp Rp 8cp cRp
, :
S1 fEpf T ,T , l Ef c E T ,T pf T ,T , l E T ,T f c g 3n n
F.L. Cao et al. / Neurocomputing 82 (2012) 8490 89
Then we deduce that with probability at least 1d=6, the inequality Combining the above estimates for S1 and S2, we conclude that
Z Z ! with the probability at least 1d, there holds
1X n
1Xn
l1, pf dQ l1,f c dQ l1, pf xi l1,f c xi Epf T,T 0 , l Ef c
X X ni1 ni1
1=2t
q 1 4 ln6=Z 2c ln6=Z
r 4e3
1t=2
Ex Q l1, pf Ex Q l1,f c t et3 r Epf T,T 0 , l Ef c 12e4
2 3m m
7Bl ln6=Z
holds. However, Lemma 3 implies that for 0 o t o1 Dl:
q 6m
1t=2
4e3 Ex Q l1, pf Ex Q l1,f c t et3 Together with Dl rcb l , then
b
1t=2
r 4e 3 Ex Q l1, pf Ex Q l1,f c t=2 4e 3 Epf T,T 0 , l Ef c
t t 1=2t
2=2t 8 ln6=d 2c ln6=d 7B ln6=d
r Ex Q l1, pf Ex Q l1,f c 1 4 e3 4e3 r 24e4 2 l 2Dl
2 2 3m m 3m
1
r Ex Q l1, pf Ex Q l1,f c 12e3 : 1=2t 1=1 p
2 392 ln6=d 8c ln6=d 16cp Rp
r 26 24
So with probability at least 1d=6, the inequality 3m m 3m
Z Z p 1=2t p
8cp cR 7B ln6=d b
l1, pf dQ l1,f c dQ 24 l 2cb l
m 3m
X X
! 1=2t !
p=2 1=1 p
1X n
1Xn
392 ln6=d 8c ln6=d 16cp l
l1, pf xi l1,f c xi r 26 24
ni1 ni1 3m m 3m
1 !1=2t p p
r Ex Q l1, pf Ex Q l1,f c 12e3 8cp cl
p=2
14k cb ln6=d b1=2
2 24 l 2cb l
b
m 3m
holds.
8 !q 1=q 2 pq q
Similarity, we apply Proposition 4 to the function set <ln6=d ln6=dq 1=q 2 lp=2
F 2 fl1, pf xl1,f c x : f A BR g r CE
: m m m
and know that with the probability at least 1d=6 the inequality
! ln6=d b1=2 b
Z Z l l
1Xm
1Xm
m
l1, pf dm l1,f c dm l1, pf l1,f c xi
X X mi1 mi1
[14] I. Steinwart, D. Hush, C. Scovel, An oracle inequality for clipped regularized Feilong Cao was born in Zhejiang Province, China, on
risk minimizers, Adv. Neural Inf. Process. Syst. 19 (2007) 13211328. August, 1965. He received the BS degree in Mathe-
[15] I. Steinwart, D. Hush, C. Scovel, A classication framework for anomaly matics and the MS degree in Applied Mathematics
detection, J. Mach. Learn. Res. 6 (2005) 211232. from Ningxia University, China in 1987 and 1998,
[16] I. Steinwart, D. Hush, C. Scovel, Density level detection is classication, respectively. In 2003, he received the PhD degree in
Neural Inf. Process. Syst. 17 (2005) 13371344. Institute for Information and System Science, Xian
[17] A.B. Taybakov, On nonparametric estimation of density level sets, Ann. Jiaotong University, China. From 1987 to 1992, he
Statist. 25 (1997) 948969. was an assistant professor. During 1992 to 2002, he
[18] A.B. Tsybakov, Optimal aggregation of classiers in statistical learning, Ann. was a lecturer. In 2002, he became a full professor. His
Statist. 32 (2004) 135166. current research interests include neural networks,
[19] G. Wahba, Spline Models for Observational Data, SIAM, 1990. machine learning and approximation theory. He is
[20] Q. Wu, D.X. Zhou, SVM soft margin classiers: linear programming versus the author or coauthor of more than 150 scientic
quadratic programming, Neural Comput. 17 (2005) 11601187. papers.
[21] Q. Wu, Y. Ying, D.X. Zhou, Multi-kernel regularized classiers, J. Complexity
23 (2007) 108134.
[22] Q. Wu, D.X. Zhou, Analysis of support vector machine classication, J.
Comput. Anal. Appl. 8 (2006) 99119. Xing Xing received the BSc degree in Information and
[23] Q. Wu, Y. Ying, D.X. Zhou, Learning rates of least-square regularized Computational Science from China Jiliang University,
regression, Found. Comput. Math. 6 (2006) 171192. China, in 2008. She is currently working towards the
[24] T. Zhang, Statistical behavior and consistency of classication methods based MSc degree from China Jiliang University, China. Her
on convex risk minimization, Ann. Statist. 32 (2004) 5685. research interests include learning theory and support
vector machine.