You are on page 1of 7

Neurocomputing 82 (2012) 8490

Contents lists available at SciVerse ScienceDirect

Neurocomputing
journal homepage: www.elsevier.com/locate/neucom

Learning rates of support vector machine classier for density


level detection$
Feilong Cao n, Xing Xing, Jianwei Zhao
Department of Information and Mathematics Sciences, China Jiliang University, Hangzhou 310018, Zhejiang Province, P.R. China

a r t i c l e i n f o a b s t r a c t

Article history: In this paper, we consider the learning rates of support vector machines (SVMs) classier for density
Received 2 February 2011 level detection (DLD) problem. Using an established classication framework, we get error decom-
Received in revised form position which consists of regularization error and sample error. Based on the decomposition, we
2 October 2011
obtain learning rates of SVMs classier for DLD under some assumptions.
Accepted 3 October 2011
Crown Copyright & 2011 Published by Elsevier B.V. All rights reserved.
Communicated by: G.B. Huang
Available online 28 December 2011

Keywords:
Learning rates
Support vector machines
Density level detection
Anomaly detection

1. Introduction Let us now introduce some mathematical notations. We


assume that
In [15], Steinwart et al. developed a classication framework
T x1 , . . . ,xn1 A X n1 ,
for the density level detection (DLD) problem. Here we utilize
recent results on classication from [4,14,21,22] to provide the is a training set which is independently drawn according to Q.
estimations of rate for support vector machines (SVMs) algo- With the help of T a DLD algorithm constructs a function f T :
rithms of DLD problem. X!R for which the set fx : f T x 40g is the estimate of the
Let us begin by dening the density level detection problem. r-level set. Since in general fx : f T x 4 0g does not exactly coin-
One of the most common ways to dene anomalies is by saying cide with fx : hx 4 rg we need a performance measure which
that anomalies are not concentrated (see [8,9]). To make this describes how well fx : f T x 40g approximates the set
precise let Q be a unknown data-generating distribution on the fx : hx 4 rg. The best known performance measure (see [18,3])
input space X. Moreover, to describe the concentration of Q we for measurable functions f T : X!R is
need a known reference distribution m on X. Assume Q has a Sm,h, r f : mff 40gDfh4 rg,
density h with respect to m, i.e. dQ hdm. Then the sets
fx : hx 4 rg, r 4 0 describe the concentration of Q. Consequently, where D denote the symmetric difference. Then the goal of the
to dene anomalies in terms of the concentration we only have to DLD problem is equivalent to nd fT such that Sm,h, r f T is close
x a threshold level r 40, so that a sample x A X is considered to to zero.
be anomalous whenever x A fx : hx r rg. Therefore our goal is to Unfortunately, we cannot directly compute Sm,h, r f since
nd the density level set fx : hx r rg, or equivalently, the r-level fh 4 rg is unknown to us. To overcome this problem a method
set fx : hx 4 rg. As in many other papers (see [7,17]) we assume has been proposed in [15] that provides a quantitative relation-
that fx : hx rg is a m-zero set and hence it is also a Q-zero set. ship between classication performance risk and Sm,h, r f .
Anomaly detection is applicable in a variety of domains, such In this paper, by interpreting the DLD problem as binary
as intrusion detection, fraud detection, fault detection, system classication, we use the error decomposition (see [20,21]) which
health monitoring, event detection in sensor networks, and consists of two parts: regularization error and sample error. Based
detecting eco-system disturbances. on the error decomposition, we can obtain learning rates for SVMs
of DLD problem under some assumptions.
The paper is organized as follows. In Section 2, we will
$
The research was supported by the National Natural Science Foundation of introduce the necessary notion and notations and present several
China (Nos. 60873206, 61101240) and Zhejiang Provincial Natural Science
Foundation of China (No. Y6110117).
useful tools which will be used in the proof of main results. In
n
Corresponding author. Section 3, we will exhibit how the learning error can be decom-
E-mail address: feilongcao@gmail.com (F.L. Cao). posed into a sample error and a regularization error. We will also

0925-2312/$ - see front matter Crown Copyright & 2011 Published by Elsevier B.V. All rights reserved.
doi:10.1016/j.neucom.2011.10.032
F.L. Cao et al. / Neurocomputing 82 (2012) 8490 85

give the learning rates of DLD problem, i.e., the main result of this From now the measure that we consider is P : Q s m and
paper. In Section 6, we will conclude the paper with some useful s 1=1 r. Generalization error is dened as
conclusions. Finally, the proof of the main result is arranged in the Z
Appendix. Ef ly,f x dP:
Z

From the denition of P, we know that


Z Z Z
2. Preliminaries Ef ly,f x dP s l1,f x dQ 1s l1,f x dm:
Z X X

In this section, we rstly recall the denitions, then investigate The corresponding empirical error is dened as
learning rates for SVM for density level detection which is given X n X m
in [15]. To this end we write Y : f1; 1g and give the following
1 r
E T ,T  f l1,f xi l1,f xj :
1 rn i 1 1 rm j 1
denition.
Then we can rewrite (1) as
Denition 1 (See Steinwart et al. [15,16]). Let m and Q be prob-
ability measures on X and s A 0; 1. Then the probability measure f T ,T  , l A arg minflJf J2H E T ,T  f g: 2
f AH
Q s m on X  Y is dened by
Q s mA sEx  Q IA x,1 1sEx  m IA x,1,
3. Error analysis
for all measurable A  X  Y. Here IA denotes the indictor function
of a set A. In this section, we investigate learning rates, the decay of the
excess classication error RP signf T ,T  , l RP as n and m
From the denition we know the measure P : Q s m can be become large.
associated with a binary classication problem in which positive The error analysis aims at bounding RP signf T ,T  , l RP . But by
samples are drawn from Q and negative samples are drawn from (2), we know that the algorithm is associated with E T ,T  f . So we
m. However, in general classication problems the samples are need the relationship between RP signf T ,T  , l RP and
drawn from only one probability measure. Inspired by this E T ,T  f T ,T  , l E T ,T  f c . Some works have been done on this topic
interpretation let us recall that the binary classication risk for (see [2,4,6,24]). In this paper we consider the hinge loss. Then from
a measurable function f : X!R and a distribution P on Z : X  Y [24] we know that for every measurable function f : X-R
is dened by
RP signf RP rEf Ef c : 3
RP f Pfx,y : signf x a yg,
By [19], f c arg min Ef with the minimum taken over all functions
where we dene signt : 1 if t 4 0 and signt 1 otherwise. f : X-R , where R R [ f 71g.
Furthermore, the Bayes risk RP of P is the smallest possible
Denition 2 (See Wu and Zhou [20]). The projection operator p is
classication risk with respect to P, i.e.
dened on the space of measurable functions f : X!R
RP inf fRP f 9f : X!R measurableg: 8
< 1,
> f x 4 1,
Steinwart et al. [15] showed that for P : Q s m and s 1=1 r pf x 1, f x o 1,
>
: f x, 1r f x r1:
RP RP f c ,
It is easy to see that pf and f have the same classier, i.e.
where signpf signf . It is easy to see that ly, pf x r ly,f x. Hence
f c Ifh 4 rg Ifh r rg sign2Py 19x1, Epf r Ef and E T ,T  pf rE T ,T  f : 4

i.e. fc is the Bayes rules. Moreover, Steinwart et al. [15] also showed Applying this fact to error analysis, it is sufcient for us to
that Sm,h, r f -0 if and only if RP f -RP . Therefore a classication bound the excess generalization error for pf instead of f. This
algorithm which makes RP  close to RP also makes Sm,h, r  or simply leads to better estimates.
SP  close to zero. Thus our goal is also to nd a function fT such that Now we can present the error decomposition which leads to
RP f T -RP . bounds of the excess generalization error for pf T ,T  , l . Dene
Based on the above interpretation Steinwart et al. [15] con- f l arg minfEf lJf J2H g:
structed an SVM for DLD problem. Let K : X  X!R be a positive f AH

denite kernel with reproducing kernel Hilbert space (RKHS) H


(see [1]). Furthermore, let m be a known probability measure on X Proposition 1. Let f T ,T  , l given by (2). Then
and l : Y  R!0,1 be the hinge loss function, i.e. ly,t : max Epf T ,T  , l Ef c r Dl ST ,T  , l, 5
f0; 1ytg,y A Y,t A R. Then for training set T x1 , . . . ,xn A X n and
T  x01 , . . . ,x0m A X m , a regularization parameter l 40, and r 4 0 where Dl is regularization error dened as
we dene Dl inf fEf Ef c lJf J2H g,
f AH

1 X n
f T ,T  , l A arg minlJf J2H l1,f xi and ST ,T  , l is sample error dened by
f AH 1 rn i 1
ST ,T  , l fEpf T ,T  , l E T ,T  pf T ,T  , l g fE T ,T  f l Ef l g:
r X
m
l1,f xj : 1
1 rm j 1 Proof. Since

The decision function of the SVM is f T ,T  , l : X!R. Therefore, Epf T ,T  , l Ef c


the goal of this paper is estimating the excess classication error r Epf T ,T  , l Ef c lJf T ,T  , l J2H
RP signf T ,T  , l RP : fEpf T ,T  , l E T ,T  pf T ,T  , l g
86 F.L. Cao et al. / Neurocomputing 82 (2012) 8490

fE T ,T  pf T ,T  , l lJf T ,T  , l J2H E T ,T  f l lJf l J2H g However, we know that there exists a constant cq such that for
all t 4 0
fE T ,T  f l Ef l g fEf l Ef c lJf l J2H g,
PX 92Z19 r t r Ct q ()PX 92Z19r cq t r t q :
we obtain from (4) that
E T ,T  pf r E T ,T  f :
Lemma 1 (See Wu and Zhou [20]). If P has Tsybakov noise exponent
This in connection with the denition of f T ,T  , l tells us that
q. For every function f : X-1; 1 there holds
fE T ,T  pf T ,T  , l lJf T ,T  , l J2H E T ,T  f l lJf l J2H g r0:  
1 q=q 1
Ely,f ly,f c 2 r 8 Ef Ef c q=q 1 :
This proves (5). & 2cq
Throughout this paper, we assume that the kernel is uniformly
bounded in the sense that
Steinwart et al. [15] established inequality between SP(f) and
p the excess classication risk RP f RP , i.e. if h has r-exponent q
k : sup Kx,x o1: then there exists a constant c1 4 0 such that for all measurable
xAX
f : X-R we have
To state our result, we need to further introduce several con- SP f rc1 RP f RP q=q 1 : 8
cepts and notations. Firstly, we need the capacity of the space H,
which plays an essential role in sample error estimates. In this Then by (3) and (8)
paper, we use the covering numbers measured by empirical SP signf rc1 Ef Ef c q=q 1 : 9
distances.

Denition 3 (See Cucker and Smale [5]). Let F be a subset of a


metric space. For any e 40, the covering number NF , e is dened Theorem 1. Let r 40 and m and Q be distribution on X such that Q
to be the minimal integer l A N such that there exist l balls with has a density h with respect to m. For s 1=1 r we write
radius e covering F . P Q s m. If h has r-exponent q and (6) holds, and for some
b
0 o b r1 and cb 4 0, Dl r cb l holds. Then there exists a constant
The function sets in our situation are balls in the form of C E 40 independent of m such that with condence 1d
BR ff A H : Jf JH r Rg. We denote the covering number of the unit
SP signpf T ,T  , l rC E mgq=q 1 ,
ball B1 as
where
Ne NB1 , e:  
2b 2bq 1
g min , :
b 1 pq 1 2bq 2 pq p
Denition 4 (See Wu and Zhou and Wu et al. [20,21]). The RKHS H
is said to have polynomial complexity exponent p A 0; 2 if there Remark 1. Scovel et al. [10] have addressed learning rates of DLD
is some cp 4 0 such that problem with Gaussian kernel under assumptions of density and
distribution. The main result of this paper, Theorem 1, gives
log Ne rcp 1=ep : 6
learning rates of clipping SVM classier for DLD problem under
In our analysis an addition assumption on the behavior of h assumptions of covering number, density and regularization
around the level r is required. error. The proposed error decomposition method in this paper is
different from that adapted in [10]. In fact, Scovel et al. [10] used
Denition 5 (See Steinwart et al. [15,16]). Let m be a distribution Talagrands concentration inequality to estimate the sample error
R
on X and h : X!0,1 be a measurable function with hdm 1, (see [13])
i.e. h is a density with respect to m. For r 4 0 and 0 r qr 1 we say
Ef T ,T  , l lJf T ,T  , l J2H E T ,T  f l lJf l J2H :
that h has r-exponent q if there exists a constant C 4 0 such that
for all t 40 we have It was shown in [10] that if h has r-exponent a A 0,1, then
learning rates have a form of Onqa=1 q2a 1 E for
mf9hr9 rtg rCtq : 7 a o q 2=2q and On2qa=2a2 q 3q 4 E for otherwise. How-
ever, Theorem 1 in this paper shows that as a classical learning
algorithm, the learning rates of the algorithm for DLD problem
Condition (7) was rst considered in [7]. Later, Tsybakov [17]
have a form of Omgq=q 1 , where
used (7) for a density level detection method which is based on a
 
localized version of the empirical excess mass approach. Sterin- 2b 2bq 1
g min , :
wart et al. [15] showed that condition (7) is closely related to b 1 pq 1 2bq 2 pq p
concept for binary classication called the Tsybakov noise expo-
nent (see [18]) as the following proposition: Remark 2. Smale and Zhou [12] have given an extension of
learning theory to a setting where the assumption of i.i.d. data
Proposition 2 (See Steinwart et al. [15]). Let m and Q be distribu-
is weakened. They hypothesized that a sequence of examples
tions on X such that Q has a density h with respect to m. For r 4 0
xt ,yt in X  Y for t 1; 2,3, . . . is drawn from a probability
we write s : 1=1 r and dene P : Q s m. Then for 0 r q r1
distribution rt on X  Y, and deduced an optimal learning rate
the following are equivalent:
in non-iid settings.

(1) h has r-exponent q;


4. Conclusion
(2) P has Tsybakov noise exponent q, i.e. there exists a constant
C 4 0 such that for all t 4 0 we have
Motivated by the classication framework established in [16], we
P X 92Z19 rt rCt q : have dened generalization error and corresponding empirical error
F.L. Cao et al. / Neurocomputing 82 (2012) 8490 87

of density level detection (DLD) problem. Then we regarded the By Proposition 1, we can write the sample error as
algorithm of DLD problem as a classical learning algorithm of
ST ,T  , l fEpf T ,T  , l Ef c E T ,T  pf T ,T  , l
learning theory. The given denitions and theorems benet from
E T ,T  f c g fE T ,T  f l E T ,T  f c Ef l Ef c g : S1 S2 :
the recent developments in learning theory, such as [5,4,1113,21]
and so on. As a main result of this paper, by means of error
decomposition consisting of regularization error and sample error, Proposition 5. Let r 40 and m and Q be distribution on X such that
we used an iteration procedure developed in [13,21,23] and Q has a density h with respect to m. For s 1=1 r we write
obtained the learning rates of DLD problem. P Q s m. If h has r-exponent q, then with probability at least
12d=3, there holds
 1=2t !
Acknowledgments 4 ln6=d 2c ln6=d 7B ln6=d
S2 r s l
3n n 6n
The authors wish to thank the referees for their helpful  1=2t !
suggestions. 4 ln6=d 2c ln6=d 7Bl ln6=d
1s Dl,
3m m 6m
p
Appendix A where Bl : 2k Dl=l, t q=q 1, c 81=2cq q=q 1 .

Proof. Since
To prove Theorem 1, we rst give some concentration inequalities.
S2 fE T,T 0 f l E T,T 0 f c Ef l Ef c g
Proposition 3 (One-side Bernstein inequality, see Cucker and Smale !
1X n
1X n
[5]). Let x be a random variable on a probability space Z with mean s l1,f l xi  l1,f c xi
ni1 ni1
Ex and variance s2 x s2 , if 9xzEx9r M for almost all z A Z, Z Z 
then for all e 4 0  l1,f l dQ  l1,f c dQ
( )   X X
1X m
me2 00 1
Probz A Zm zzi Ex 4 e r exp  : 1 X m
1 X m
mi1 2s2 M e=3 1s@@ l1,f l xj  l1,f c xj A
mj1 mj1
Z Z 
Proposition 4 (See Cucker and Smale [5]). Let 0 r t r 1, c,M Z0,
 l1,f l dm l1,f c dm : sS21 1sS22 ,
and G be a set of functions on Z such that for every g A G, Eg Z0, X X
9gEg9 rM and Eg 2 r cEgt , then for all e 40
we can rst consider the random variable z1 l1,f l l1,f c on
8 9
> 1 Pm > X  f1g to estimate S21.
< Eg i 1 gzi =
Probz A Zm sup pm
t
4 4e1t=2 Denote
>
:g A G Eg e t >
;
z1 z11 z12 fl1,f l l1, pf l g fl1, pf l l1,f c g:
 
me 2 p
r NG, eexp  : For z11 , from (10), we have Jf l J1 rkJf l JH r k Dl=l. Then we
2c M e1t =3 p
can get 0 r z11 r Bl 2k Dl=l. Hence 9z11 Ex  Q z11 9 rBl .
The following Lemmas 2 and 3 established by Cucker and Applying Proposition 3 to z11 , we know that for any e 4 0
Smale [5] will be used. ( )  
1X n
ne 2
Lemma 2 (See Cucker and Smale [5]). Let c1 ,c2 , . . . ,cl 40, and Prob z11 xi Ex  Q z11 4 e rexp  :
ni1 2s2 z11 Bl e=3
s 4 q1 4q2 4    4 ql 4 0. Then the equation
xs c1 xq1 c2 xq2     cl1 xql1 cl 0
Solving the quadratic equation
has a unique positive zero xn. In addition  
n o ne2 d
xn r max lc1 1=sq1 ,lc2 1=sq2 , . . . ,lcl1 1=sql1 ,lcl 1=s : exp  ,
2s2 z11 Bl e=3 6

for e, we see that the inequality


Lemma 3 (See Cucker and Smale [5]). Let p1 ,p2 4 1 be such that ! Z Z 
1=p1 1=p2 1. Then 1Xn
1Xn
l1,f l xi  l1, pf l xi  l1,f l dQ  l1, pf l dQ r e1
ni1 ni1 X X
1 p1 1 p2
ab r a b , 8a,b 40:
p1 p2 holds with probability at least 1d=6, where
q
Bl ln6=d Bl ln6=d=32 2n ln6=ds2 z11
From the denition of fc, we can get e1
3n n
r
lJf l J2H r Ef l Ef c lJf l J2H Dl, 2Bl ln6=d 2 ln6=ds2 z11
r :
3n n
then
q
Jf l JH r Dl=l: 10 The fact that 0 r z11 rBl implies s2 z11 rBl Ex  Q z11 . There-
fore, we have
Moreover, by choosing f0, we nd
7Bl ln6=d
lJf T ,T  , l J2H rE T ,T  f T ,T  , l lJf T ,T  , l J2H r E T ,T  0 l  0 1, e1 r Ex  Q z11 :
6n
then
q
Next we consider z12 . Since both pf l and fc are on 1; 1, z12 is a
Jf T ,T  , l JH r 1=l: 11
random variable satisfying 9z12 9 r2. Applying Proposition 3 to
88 F.L. Cao et al. / Neurocomputing 82 (2012) 8490

Z Z 
z12 , we see that the inequality s l1, pf T ,T  , l dQ  l1,f c dQ
! Z Z  X X
1Xn
1Xn
l1, pf l xi  l1,f c xi  l1,f l dQ  l1, pf c dQ r e2 !!
ni1 ni1 X X
1Xn
1Xn
 l1, pf T ,T  , l xi  l1,f c xi
holds with probability at least 1d=6, where ni1 ni1
r Z 
4 ln6=d 4 ln6=ds2 z12 Z
e2 r : 1s l1, pf T ,T  , l dm l1,f c dm
3n n X X
0 11
Lemma 1 tells us that s2 z12 rcEx  Q z12 t with t q=q 1, 1 X
m
1 X
m
@ l1, pf T ,T  , l xj  l1,f c xj AA
c 81=2cq q=q 1 . Applying Lemma 3, we have mj1 mi1
 1=2t
4 ln6=d 2c ln6=d
e2 r Ex  Q z12 : : sS11 1sS12 :
3n n

Combining the above estimates for z11 and z12 with the fact Let R 4 0. Applying Proposition 4 to the function set
Ex  Q z11 Ex  Q z12 Ex  Q z1 , F 1 fl1, pf xl1,f c x : f A BR g,
we conclude that with probability at least 1d=3, there holds
we conclude that each function g A F 1 takes the form gx l1,
 1=2t
4 ln6=d 2c ln6=d 7B ln6=d pf xl1,f r x for some f A BR . Then
S21 r l Ex  Q z1 :
3n n 6n
Ex  Q g Ex  Q l1, pf xl1,f r x Z0

Similarly, we can consider the random variable z2 l1,f l  and


l1,f c on X  f1g to estimate S22, and get that with probability
at least 1d=3, there holds 1Xn
1Xn
gxi l1, pf xi l1,f r xi :
 1=2t ni1 ni1
4 ln6=d 2c ln6=d 7B ln6=d
S22 r l Ex  m z2 :
3m m 6m
For gx l1, pf xl1,f r x r 2, Proposition 4 asserts that
Combining the above estimates for S21 and S22 with the fact 8 9
> 1 Pn >
sEx  Q z1 1sEx  m z2 Ef l Ef c rDl, < Ex  Q g i 1 gxi =
n
Probz A Xf1gn sup p 4 4e1t=2
> t t >
:g A G Ex  Q g e ;
we conclude that with probability at least 12d=3, there holds
S2 sS21 1sS22  
 1=2t ! ne2
4 ln6=d 2c ln6=d 7B ln6=d r NF R , e exp  :
rs l 1 t
2c 2e =3
3n n 6n
 1=2t !
4 ln6=d 2c ln6=d 7B ln6=d
1s l Dl: Next, we begin to bound the covering number. Because for each
3m m 6m
f 1 ,f 2 A BR and x A X
The proof is complete. &
9l1, pf 1 xl1,f r xl1, pf 2 xl1,f r x9
Proposition 6. Let r 40 and m and Q be distribution on X such that
Q has a density h with respect to m. For s 1=1 r we write 9l1, pf 1 xl1, pf 2 x9
P Q s m. If h has r-exponent q, then with probability at least
1d=3, there holds 9pf 1 xpf 2 x9
1
S1 r Epf T ,T  , l Ef c 12se3 121se4 r Jf 1 f 2 J1
2
where therefore,
(  1=2t
16 ln6=d 8c ln6=d NF R , e r NBR , e r expfcp Rp ep g:
e3 r max , ,
3n n
 1=1 p  1=2t p ) Let e3 be the unique positive number e satisfying
16cp Rp 8cp cRp
, ,  
3n n ne2 d
exp cp Rp ep  :
2c 2e1t =3 6
(  1=2t
16 ln6=d 8c ln6=d
e4 r max , , Solving this inequality by Lemma 2, we get
3m m
 1=1 p  1=2t p ) (  1=2t
16cp Rp 8cp cRp 16 ln6=d 8c ln6=d
, : e3 r max , ,
3m m 3n n

 1=1 p  1=2t p )
Proof. It is easy to obtain that 16cp Rp 8cp cRp
, :
S1 fEpf T ,T  , l Ef c E T ,T  pf T ,T  , l E T ,T  f c g 3n n
F.L. Cao et al. / Neurocomputing 82 (2012) 8490 89

Then we deduce that with probability at least 1d=6, the inequality Combining the above estimates for S1 and S2, we conclude that
Z Z  ! with the probability at least 1d, there holds
1X n
1Xn
l1, pf dQ  l1,f c dQ  l1, pf xi  l1,f c xi Epf T,T 0 , l Ef c
X X ni1 ni1
 1=2t
q 1 4 ln6=Z 2c ln6=Z
r 4e3
1t=2
Ex  Q l1, pf Ex  Q l1,f c t et3 r Epf T,T 0 , l Ef c 12e4
2 3m m
7Bl ln6=Z
holds. However, Lemma 3 implies that for 0 o t o1 Dl:
q 6m
1t=2
4e3 Ex  Q l1, pf Ex  Q l1,f c t et3 Together with Dl rcb l , then
b

1t=2
r 4e 3 Ex  Q l1, pf Ex  Q l1,f c t=2 4e 3 Epf T,T 0 , l Ef c
t  t  1=2t
2=2t 8 ln6=d 2c ln6=d 7B ln6=d
r Ex  Q l1, pf Ex  Q l1,f c  1 4 e3 4e3 r 24e4 2 l 2Dl
2 2 3m m 3m
1
r Ex  Q l1, pf Ex  Q l1,f c  12e3 :  1=2t  1=1 p
2 392 ln6=d 8c ln6=d 16cp Rp
r 26 24
So with probability at least 1d=6, the inequality 3m m 3m
Z Z   p 1=2t p
8cp cR 7B ln6=d b
l1, pf dQ  l1,f c dQ 24 l 2cb l
m 3m
X X
!  1=2t !
p=2 1=1 p
1X n
1Xn
392 ln6=d 8c ln6=d 16cp l
 l1, pf xi  l1,f c xi r 26 24
ni1 ni1 3m m 3m
1 !1=2t p p
r Ex  Q l1, pf Ex  Q l1,f c  12e3 8cp cl
p=2
14k cb ln6=d b1=2
2 24 l 2cb l
b
m 3m
holds.
8 !q 1=q 2 pq q
Similarity, we apply Proposition 4 to the function set <ln6=d ln6=dq 1=q 2 lp=2
F 2 fl1, pf xl1,f c x : f A BR g r CE
: m m m
and know that with the probability at least 1d=6 the inequality 
! ln6=d b1=2 b
Z Z  l l
1Xm
1Xm
m
l1, pf dm l1,f c dm  l1, pf  l1,f c xi
X X mi1 mi1

1 holds with probability at least 1d. Minimizing above inequality


r Ex  m l1, pf Ex  m l1,f c  12e4
2 with respect to l, we can get that corresponding SVM can learn
is valid, where with rate mg , where
(  
 1=2t 2b 2bq 1
16 ln6=d 8c ln6=d g min , :
e4 r max , , b 1 pq 1 2bq 2 pq p
3m m
 1=1 p  1=2t p )
16cp Rp 8cp cRp
, : From (9) we obtain that
3m m
SP signf T ,T  , l r C E mgq=q 1 :
1=2
By (11), we can take R l . Combining the above estimates Then the proof is nished. &
for S11 and S12, we conclude that with the probability at least
1d=3, the inequality References
S1 sS11 1sS12
1 [1] M. Anthony, P.L. Bartlett, Neutral Network Learning: Theoretical Foundations,
r Epf T ,T  , l Ef c 12se3 121se4 Cambridge University Press, 1999.
2
[2] P.L. Bartleet, M.I. Jordan, J.D. Mcaulie, Convexity, classication, and risk
bounds, J. Am. Stat. Assoc. 101 (2006) 138156.
[3] S. Ben-David, M. Lindenbaum, Learning distributions by their density levels: a
holds. The proof is complete. & paradigm for leaning without a teacher, J. Comput. Syst. Sci. 55 (1997)
171182.
Proof of Theorem 1. By Proposition 1, [4] D.R. Chen, Q. Wu, Y. Ying, D.X. Zhou, Support vector machine soft margin
classier: error analysis, J. Mach. Learn. Res. 5 (2004) 11431175.
Epf T ,T  , l Ef c rDl S1 S2 : [5] F. Cucker, S. Smale, On the mathematical foundations of learning, Bull. Am.
Math. Soc. 39 (2001) 149.
[6] G. Lugosi, N. Vayatis, On the Bayes-risk consistency of regularized boosting
We may assume n Z m. Because when n rm, we can get the same methods, Ann. Statist. 32 (2004) 3055.
[7] W. Polonik, Measuring mass concentrations and estimating density contour
order for n. Then, with probability at least 12d=3, there holds
clustersan excess mass approach, Ann. Statist. 23 (1995) 855881.
 1=2t [8] B.D. Ripley, Pattern Recognition and Neural Networks, Cambridge University
4 ln6=d 2c ln6=d 7B ln6=d
S2 r l Dl: Press, 1996.
3m m 6m
[9] B. Scholkopf, A.J. Smola, Learning with Kernels, MIT Press, 2002.
[10] C. Scovel, D. Hush, I. Steinwart, Learning rates for density level detection,
Anal. Appl. 3 (2005) 356371.
Since n Z m, e3 r e4 [11] S. Smale, D.X. Zhou, Shannon sampling and function reconstruction from
point values, Bull. Am. Math. Soc. 41 (2004) 279305.
1 [12] S. Smale, D.X. Zhou, Online learning with Markov sampling, Anal. Appl. 7
S1 r Epf T,T 0 , l Ef r 12e4 : (2009) 87113.
2
[13] I. Steinwart, C. Scovel, Fast rates for support vector machines using Gaussian
holds with probability at least 1d=3. kernel, Ann. Statist. 35 (2007) 575607.
90 F.L. Cao et al. / Neurocomputing 82 (2012) 8490

[14] I. Steinwart, D. Hush, C. Scovel, An oracle inequality for clipped regularized Feilong Cao was born in Zhejiang Province, China, on
risk minimizers, Adv. Neural Inf. Process. Syst. 19 (2007) 13211328. August, 1965. He received the BS degree in Mathe-
[15] I. Steinwart, D. Hush, C. Scovel, A classication framework for anomaly matics and the MS degree in Applied Mathematics
detection, J. Mach. Learn. Res. 6 (2005) 211232. from Ningxia University, China in 1987 and 1998,
[16] I. Steinwart, D. Hush, C. Scovel, Density level detection is classication, respectively. In 2003, he received the PhD degree in
Neural Inf. Process. Syst. 17 (2005) 13371344. Institute for Information and System Science, Xian
[17] A.B. Taybakov, On nonparametric estimation of density level sets, Ann. Jiaotong University, China. From 1987 to 1992, he
Statist. 25 (1997) 948969. was an assistant professor. During 1992 to 2002, he
[18] A.B. Tsybakov, Optimal aggregation of classiers in statistical learning, Ann. was a lecturer. In 2002, he became a full professor. His
Statist. 32 (2004) 135166. current research interests include neural networks,
[19] G. Wahba, Spline Models for Observational Data, SIAM, 1990. machine learning and approximation theory. He is
[20] Q. Wu, D.X. Zhou, SVM soft margin classiers: linear programming versus the author or coauthor of more than 150 scientic
quadratic programming, Neural Comput. 17 (2005) 11601187. papers.
[21] Q. Wu, Y. Ying, D.X. Zhou, Multi-kernel regularized classiers, J. Complexity
23 (2007) 108134.
[22] Q. Wu, D.X. Zhou, Analysis of support vector machine classication, J.
Comput. Anal. Appl. 8 (2006) 99119. Xing Xing received the BSc degree in Information and
[23] Q. Wu, Y. Ying, D.X. Zhou, Learning rates of least-square regularized Computational Science from China Jiliang University,
regression, Found. Comput. Math. 6 (2006) 171192. China, in 2008. She is currently working towards the
[24] T. Zhang, Statistical behavior and consistency of classication methods based MSc degree from China Jiliang University, China. Her
on convex risk minimization, Ann. Statist. 32 (2004) 5685. research interests include learning theory and support
vector machine.

Jianwei Zhao received the MSc degree in Mathematics


from Shanxi Normal University, China, in 2003, and the
PhD degree in Mathematics from Chinese Academy of
Sciences, China, in 2006. She has been an associate
professor of China Jiliang University. She is currently
doing the postdoctoral research in the Department of
Electronics and Information Engineering, Chonbuk
National University, South Korea. Her current research
interests include machine learning, neural networks
and functional analysis.

You might also like