NN2008 M SVM Yg

Multi-Class Support Ve
tor Ma hines
Yann Guermeur
LORIA - CNRS
http://www.loria.fr/~guermeur
Summer S hool NN2008
July 4, 2008
Overview
Guaranteed risk for large margin multi- ategory lassiers
- Theoreti al framework
- Basi uniform onvergen e result
-
--dimensions
- Generalized Sauer-Shelah lemma

- Nature and rate of onvergen e
Multi- lass SVMs

- Multi- ategory lassi ation with binary SVMs
- Class of fun tions implemented by the M-SVMs
- General formulation of the training algorithm
- Three main models of M-SVMs
- Some variants of the main models
- Margins and support ve tors
2/55
Overview
Guaranteed risks for multi- lass SVMs
- Bounds on the overing numbers
- Use of the Radema her omplexity
Model sele tion for multi- lass SVMs

- Algorithms tting the entire regularization path
- Bounds on the leave-one-out ross-validation error
Con lusions and open problems
3/55
Theoreti al framework
Hypotheses and goals

Chara terization of the problem
- Study of the onne tion between obje ts
- Hypothesis: existen e of a
probability measure
X Y -valued
- Problem: the joint probability measure
xX
and their ategories
random pair
(X, Y )
y Y = [[ 1, Q ]]
distributed a ording to a
is unknown
What is available
-
Dm = ((Xi , Yi ))1im
G : lass of fun tions g , from X into RQ (F : lass of de ision

f (x) = argmax1kQ gk (x) or f (x) = , in ase of ex quo
: i.i.d.
m-sample
from
(X, Y )
rules
f,
from
into
{})
The goal
-
loss fun tion:
(y, g(x)) = 1l{gy (x)maxk6=y gk (x)} ( (y, f (x)) = 1l{f (x)6=y} )
- Sele tion of a fun tion
minimizing over
the risk
R(g) = E [ (Y, g (X))] = P (f (X) 6= Y )

4/55
Basi uniform onvergen e result
Multi- lass margin and margin risk

Denition 1 (Fun tion
M)
Let M be the fun tion from RQ [[ 1, Q ]] into R given by:
1
(v, k) RQ [[ 1, Q ]] , M (v, k) =
2

vk max vl
l6=k
M (v, ) = max1kQ M (v, k)

Denition 2 (Multi- lass margin of
g on the example (x, y))
(g, x, y) G X Y, M(g, x, y) = M (g(x), y)

Denition 3 (Operators
and ) g = (gk )1kQ G
- The fun tion g = (gk )1kQ , from X into RQ , is given by:

x X , g(x) = (M (g(x), k))1kQ
- The fun tion g = ( gk )1kQ , from X into RQ , is given by:

x X , g(x) = (sign (gk (x)) M (g(x), ))1kQ
5/55
Multi- lass margin and margin risk

# repla es and in the

R(g) = E 1l{# gY (X)0} )
formulas that hold true for both operators (e.g.,
Let R+ . The risk with margin of g is dened as:
Denition 4 (Margin risk)
R (g) = E 1l{# gY (X)<} =

Empiri al risk with margin
X Y
1l{# gy (x)<} dP (x, y)
1 X
R,m (g) =
1l #
m i=1 { gYi (Xi )< }
Class of fun tions of interest:
For
R+ ,
let
: R [, ]
#
G
be the linear squashing fun tion dened as:
(t) = sign(t) min {|t| , }

#
#
g = gk
#

#
#
#
,
g
=
g
,
G
=
g
:
g
k
k
1kQ
6/55
Capa ity measure of # G : overing numbers
Figure 1:
-net
and
- over
of a set
in a pseudo-metri spa e
(E, )
Denition 5 (Covering numbers)
N (, E , ):
minimal number of open balls of radius needed to over E (or +)

N (p) (, E , ): the -nets onsidered are in luded in E (proper to E )
7/55

Classes of indi ator fun tions
Let F be a lass of indi ator fun tions on a set
X . Let N F, (Xi )1in be the number of dierent fun tions (di hotomies) that this lass an
implement on (Xi )1in and (0, 1). With probability at least 1 , the risk of any fun tion f in
F is bounded from above as follows:
Theorem 1 (Guaranteed risk, Vapnik, 1998)
R(f ) Rm (f ) +

ln EN F, (Xi )1i2m
is the
1
m

1
4
ln EN F, (Xi )1i2m + ln
+ .
annealed entropy of F
on the sample
(Xi )1i2m .
8/55

Classes of fun tions G (taking values in RQ )
Denition 6 (Pseudo-metri
pseudo-metri dxn on G as:
dxn )
Let n N . For a sequen e xn = (xi )1in X n , dene the
(g, g ) G 2 , dxn (g, g ) = max kg(xi ) g (xi )k .

1in
For
R+ ,
let
N (, G, n) = supxn X n N (, G, dxn ).
Let G be the lass of fun tions that a large margin Q- ategory

lassier on a domain X an implement. Let R+ and (0, 1). With probability at least 1 ,
for every value of in (0, ], the risk of any fun tion g in G is bounded from above by:
Theorem 2 (Guaranteed risk)
R(g) R,m (g) +

2
1
2
ln 2N (p) /4, #
+ .
+ ln
G, 2m
m
9/55
--dimensions
Growth fun tion

Let F be a lass of
indi ator fun tions on a domain X . For n N , let sX n = {xi : 1 i n} be a subset of X of
ardinality n. Then, the growth fun tion of F , F , is dened by:
Denition 7 (Growth fun tion, Vapnik & Chervonenkis, 1971)
n N , F (n) = sup N (F, sX n ) .

sX n X
Remark 1
Some authors use the alternative denition:

n N , F (n) = ln
Remark 2
sup N (F, sX n ) .
sX n X
In ontrast with the annealed entropy, the growth fun tion is distribution-free.
10/55
--dimensions
VC dimension
Let F be a lass of indi ator
fun tions on a domain X . A subset sX n = {xi : 1 i n} of X is said to be shattered by F if for
ea h ve tor vy in {1, 1}n , there is a fun tion fy in F satisfying
Denition 8 (VC dimension, Vapnik & Chervonenkis, 1971)
(fy (xi ))1in = vy .
The VC dimension of F , denoted by VC-dim(F), is the maximal ardinality of a subset of X

shattered by F , if this ardinality is nite. If no su h maximum exists, F is said to have innite
VC dimension.
Remark 3
VC-dim(F) = d if and only if F (d) = 2d and F (d + 1) < 2d+1 .
11/55
--dimensions
-dimensions
Let F be a lass of fun tions on a set

X taking their values in the nite set [[ 1, Q ]]. Let be a family of mappings from [[ 1, Q ]] into
{1, 1, }, where is thought of as a null element. A subset sX n = {xi : 1 i n} of X is said to

be -shattered by F if there is a mapping n = (i) 1in in n su h that for ea h ve tor vy in
n
{1, 1} , there is a fun tion fy in F satisfying
Denition 9 (-dimensions, Ben-David et al., 1995)
(i)
fy (xi )
1in
= vy .
The -dimension of F , denoted by -dim(F), is the maximal ardinality of a subset of X

-shattered by F , if this ardinality is nite. If no su h maximum exists, F is said to have innite
-dimension.
Let F and be dened as above. Extending the denition of the VC dimension so that
it applies to lasses of fun tions taking values in {1, 1, }, whi h has no in iden e in pra ti e, the
following proposition holds true:
Remark 4
-dim(F) = VC-dim ({(x, ) 7 f (x) : f F, }) .

12/55
--dimensions
Main examples of -dimensions

Let F be a lass of
fun tions on a set X taking their values in [[ 1, Q ]]. The graph dimension of F , G-dim(F), is the
-dimension of F in the spe i ase where = {k : 1 k Q}, su h that k takes the value 1 if
its argument is equal to k and the value 1 otherwise. Reformulated in the ontext of
multi- ategory lassi ation, the fun tions k are the indi ator fun tions of the ategories.
Denition 10 (Graph dimension, Dudley, 1987; Natarajan, 1989)
Let F be a lass of fun tions on a

set X taking their values in [[ 1, Q ]]. The Natarajan dimension of F , N-dim(F), is the -dimension
of F in the spe i ase where = {k,l : 1 k 6= l Q}, su h that k,l takes the value 1 if its
argument is equal to k, the value 1 if its argument is equal to l, and otherwise.
Denition 11 (Natarajan dimension, Natarajan, 1989)
The denition of the graph dimension is inspired from the one-against-all

de omposition method whereas the denition of the Natarajan dimension is inspired from the
one-against-one de omposition method.
Remark 5
13/55
--dimensions
Fat-shattering or dimension
Let G be a lass of
real-valued fun tions on a set X . For R+ , a subset sX n = {xi : 1 i n} of X is said to be
n
-shattered by G if there is a ve tor vb = (bi ) in Rn su h that, for ea h ve tor vy = (yi ) in {1, 1} ,
there is a fun tion gy in G satisfying
Denition 12 (Fat-shattering dimension, Kearns & S hapire, 1994)
i [[ 1, n ]] , yi (gy (xi ) bi ) .
The fat-shattering dimension with margin , or P dimension, of the lass G , P -dim (G), is the
maximal ardinality of a subset of X -shattered by G , if this ardinality is nite. If no su h
maximum exists, G is said to have innite P dimension.
14/55
--dimensions
--dimensions
Let
denote the onjun tion of two events.
Let G be a lass of fun tions on a set X taking their values in

RQ . Let be a family of mappings from [[ 1, Q ]] into {1, 1, }. For R+ , a subset
sX n = {xi : 1 i n} of X is said to be --shattered (-shattered with margin ) by # G if

there is a mapping n = (i) 1in in n and a ve tor vb = (bi ) in Rn su h that, for ea h ve tor
n
vy = (yi ) in {1, 1} , there is a fun tion gy in G satisfying
Denition 13 ( --dimensions)
if yi = 1, k : (i) (k) = 1 # gy,k (xi ) bi

i [[ 1, n ]] ,
.
if yi = 1, l : (i) (l) = 1 # gy,l (xi ) + bi
The --dimension, or -dimension with margin , of # G , denoted by -dim(# G, ), is the

maximal ardinality of a subset of X --shattered by # G , if this ardinality is nite. If no su h
maximum exists, # G is said to have innite --dimension.
This denition simplies into the one of the fat-shattering dimension when
Q = 2.
15/55
--dimensions
Natarajan dimension with margin

Denition 14 (Natarajan dimension with margin
) Let G be a lass of fun tions on a

= {xi : 1 i n} of X is said to be
X taking their values in RQ . For R+ , a subset sX n

-N-shattered (N-shattered with margin ) by # G if there
set
is a set
I(sX n ) = {(i1 (xi ), i2 (xi )) : 1 i n}
of n ouples of distin t indexes in [[ 1, Q ]] and a ve tor vb = (bi ) in Rn su h that, for ea h ve tor

n
vy = (yi ) in {1, 1} , there is a fun tion gy in G satisfying
if yi = 1, # gy,i1 (xi ) (xi ) bi

i [[ 1, n ]] ,
.
if yi = 1, # gy,i (x ) (xi ) + bi
2
i
The Natarajan dimension with margin of the lass # G , N-dim(# G, ), is the maximal
ardinality of a subset of X -N-shattered by # G , if this ardinality is nite. If no su h maximum
exists, # G is said to have innite Natarajan dimension with margin .
16/55
Generalized Sauer-Shelah lemma
Sauer-Shelah lemma
(Classes of indi ator fun tions)
Let F be a lass of
indi ator fun tions on a set X and let F be its growth fun tion. If its VC dimension d is nite,
then for n d,
Lemma 1 (Vapnik & Chervonenkis, 1971; Sauer, 1972; Shelah, 1972)
F (n)
where e is the base of the natural logarithm.
d
X
i=0
Cni
<
en d
d
17/55

Classes of fun tions from X into [[ 1, Q ]]
Let F be a lass of fun tions from X into [[ 1, Q ]] and let
F be its growth fun tion. If its Natarajan dimension d is nite, then for n d,
Lemma 2 (Haussler & Long, 1995)
F (n)
d
X
i=0
2
Cni CQ+1
i
<
(Q + 1)2 en
2d
d
18/55

Classes of real-valued fun tions
Let G be a lass of fun tions from X into [0, 1]. For every value
of in (0, 1] and every integer value of n satisfying n P/4 -dim (G), the following bound is true:
Lemma 3 (Alon et al., 1997)
N (, G, n) < 2
where d = P/4 -dim (G).
4n
2
d log2 (2en/(d))
19/55

Classes of fun tions from X into RQ
Let G be a lass of fun tions from X into [MG , MG ]Q . For every value of in (0, MG ]
and every integer value of n satisfying n N-dim (G, /6), the following bound is true:
Lemma 4
N (p) (, G, n) < 2 n Q2 (Q 1)
3MG
2 !
j 3M k m
2
G
2
d log2 enCQ
1 /d
where d = N-dim (G, /6).

The proof does not hold true anymore if the operator
is repla ed with the operator
20/55
Nature and rate of onvergen e
Nature and rate of onvergen e

Let G be the lass of fun tions from X into [MG , MG ]Q that a large margin
Q- ategory lassier an implement. Let (0, 1). With probability at least 1 , uniformly for
every value of in (0, MG ], the risk of any fun tion g in G is bounded from above by:
Theorem 3
R(g) R,m (g)+

v
j 12M k m
u
G
!
u
d
log
emQ(Q1)
2
1
/d

2

2
u2
12M
1
2M
G
G
u ln 4 2m Q2 (Q 1)
+
+
ln
tm
where d = N-dim (G, /24).
R(g) R,m (g) + c ln (m)
d
m
Proposition 1 (Almost sure uniform onvergen es)
lim sup P
m+ P

sup sup (R(g) R,n (g)) > = 0
nm gG
lim sup P
m+ P

sup sup |R (g) R,n (g)| > = 0
nm gG
21/55
Multi- lass SVMs
Multi- ategory lassi ation with binary SVMs
Multi- ategory lassi ation with binary SVMs

One-against-all method (Rifkin & Klautau, 2004)
-
SVMs: the
k -th
one distinguishes ategory
from the
Q1
other ones
- De ision rule: winner-takes-all
One-against-one method/pairwise lassi ation (Frnkranz, 2002)

-
Q
2
SVMs: one for ea h pair of lasses
- De ision rule: max-wins voting
Use of error orre ting output odes (ECOC) (Allwein et al., 2000)
-
M = (mkl ) MQ,N ({1, 0, 1}):
oding matrix
SVMs: one for ea h of the di hotomies dened by the olumns of
- De ision rule: omputation of a loss fun tion
22/55
Multi- lass SVMs
Class of fun tions implemented by the M-SVMs
Reprodu ing kernel Hilbert spa e

Let
be a spa e and
(H, h, iH )
a Hilbert spa e of fun tions on
X (H RX ).
Let be a fun tion from X 2 into R.

x X , let x be the fun tion from X into R given by x : t 7 (x, t). is a reprodu ing kernel of
H if and only if:
Denition 15 (Reprodu ing kernel, Aronszajn, 1950)
1. x X , x H ;
2. x X , h H, hh, x iH = h(x) (reprodu ing property).

If H possesses a reprodu ing kernel, it is
alled a reprodu ing kernel Hilbert spa e (RKHS) or a proper Hilbert spa e.
Denition 16 (Reprodu ing kernel Hilbert spa e)
23/55
Multi- lass SVMs
Positive semidenite kernel and RKHS

Denition 17 (Positive semidenite (positive type) kernel)
alled a positive semidenite kernel (or a positive type kernel) if
n N , (ai )1in R , (xi )1in X ,
n
n X
X
i=1 j=1
A fun tion from X 2 into R is

ai aj (xi , xj ) 0.
Let be a positive semidenite kernel on X 2 . There exists

only one Hilbert spa e (H, h, iH ) of fun tions on X with as reprodu ing kernel.
Theorem 4 (Moore-Aronszajn)
24/55
Multi- lass SVMs
Building a M-SVM starting from a kernel

Basi lass of fun tions
Let
Let
= (H , h, iH )Q
H
H:
be a positive semidenite kernel on
lass of fun tions
and
and let
(H , h, iH )
H = ((H , h, iH ) + {1})
h = (hk )1kQ
from
h() =
mk
X
i=1
into
RQ
be the orresponding RKHS.
su h that:
ik (xik , ) + bk
1kQ
{xik : 1 i mk } X , (ik )1imk Rmk and bk R, as well

fun tions when the sets {xik : 1 i mk } be ome dense in X in the
with
as the limits of these

norm indu ed by the kernel
Class of fun tions implemented

onvex subset of
(dened by onstraints on an ane subspa e)
25/55
Multi- lass SVMs

An ane model in the feature spa e
Theorem 5 (Mer er's theorem)
For all Mer er kernel , there exists a map su h that:
(x, x ) X 2 , (x, x ) = h(x), (x )i
where h, i is the dot produ t of the 2 spa e.

feature map.
(X ) = { (x) : x X }.
is alled a
feature spa e is any of the Hilbert spa es E(X ) , h, i

= H
Let
spanned by the
(X ).
an be seen as a lass of multivariate ane fun tions on
(X )
h() = (hwk , i + bk )1kQ

Q
Q
w = (wk )1kQ E(X
) , b = (bk )1kQ R
26/55
Multi- lass SVMs

Putting things the other way round: the kernel tri k
Norms on
and E Q
H
(X )

h H
v
v
v
u Q
u Q
u Q
uX
uX
u X 2
t
t

hwk , wk i = t
kwk k2 = kwk
hk H =
=
k=1
k=1
k=1
kwk = max kwk k

1kQ
27/55
Multi- lass SVMs
General formulation of the training algorithm
Q 3:
multi- lass support ve tor ma hines
((xi , yi ))1im (X [[ 1, Q ])
]
M-SVM :
training set
onvex loss fun tion (built around the
hinge loss)
M-SVM: solution of a onvex (quadrati ) programming problem

Problem 1
min
hH
m
X
i=1
2
M-SVM (yi , h(xi )) + khk
H
s.t.
PQ
k=1
hk = 0
Representer theorem
This theorem states that training (solving Problem 1) amounts to nding the values of the
oe ients
ik
in
h() =
m
X
i=1
(the values of the biases
bk
ik (xi , ) + bk
1kQ
are dedu ed by appli ation of the Kuhn-Tu ker onditions).
28/55
Multi- lass SVMs
A general framework that en ompasses the bi- lass ase

((xi , yi ))1im (X {1, 1})
: training set
h = (h1 , h2 ) = (h1 , h1 ), h(x)

= h1 (x) = # h1 (x) =

(hinge loss)
SVM (y, h(x)) = 1 y h(x)
1
2
(hw1 w2 , (x)i + b1 b2 )
SVM: solution of a onvex (quadrati ) programming problem

Problem 2
min
Representer theorem
m
X
i=1
2

i) +
SVM yi , h(x

h
This theorem states that training (solving Problem 2) amounts to nding the values of the
oe ients
in
=
h()
m
X
i=1
(the value of the bias
i (xi , ) + b
is dedu ed by appli ation of the Kuhn-Tu ker onditions).
29/55
Multi- lass SVMs
Hard margin M-SVMs and geometri al margins

Geometri al margins
dM-SVM =
dM-SVM,kl =
min
1k<lQ
1
dM-SVM
min

min min (hk (xi ) hl (xi )) , min (hl (xj ) hk (xj ))
i:yi =k
j:yj =l
(k, l), 1 k < l Q,

min (hk (xi ) hl (xi ) dM-SVM ) , min (hl (xj ) hk (xj ) dM-SVM )
i:yi =k
j:yj =l
(k, l), 1 k < l Q, kl = dM-SVM
1 + dM-SVM,kl
kwk wl k
Conne tion between the penalizer and the geometri al margins
X
k<l
kwk wl k2 = Q
Q
X
k=1
Q
X
k=1
2
Q
Q
X
X

kwk k2
wk
wk = 0 =

k=1
k=1
2
X 1 + dM-SVM,kl 2
d
kwk k2 = M-SVM
Q
kl
k<l
30/55
Multi- lass SVMs
Three main models of M-SVMs
M-SVM of Weston and Watkins

Training algorithm - primal formulation
Problem 3 (M-SVM1, Vapnik & Blanz, 1998; Weston & Watkins, 1998; . . . )
min
Q
1 X
hH 2
k=1
kwk k + C
m
X
i=1 k6=yi
hw w , (x )i + b b 1 ,
yi
k
i
yi
k
ik
s.t.
ik 0,
Remark 6
The onstraint
PQ
k=1
hk = 0
ik
(1 i m), (1 k 6= yi Q)
(1 i m), (1 k 6= yi Q)
is impli it.
31/55
Multi- lass SVMs
M-SVM of Weston and Watkins

Training algorithm - dual formulation
ik :
Lagrange multiplier orresponding to the onstraint
hwyi wk , (xi )i + byi bk 1 ik
= (ik )1im,1kQ , (iyi )1im = 0

Problem 4 (M-SVM1)
min
1 T
HWW 1TQm
2
0 C,
(1 i m), (1 k 6= yi Q)
ik
s.t. P
PQ
Pm
i:yi =k
l=1 il
i=1 ik = 0, (1 k Q 1)
HWW =
wk =

yi ,yj yi ,l yj ,k + k,l (xi , xj ) 1i,jm,1k,lQ
Q
X X
i:yi =k l=1
il
(xi )
m
X
i=1
ik
(xi ) =
Q
m X
X
i=1 l=1
(yi ,k k,l ) il
(xi )
32/55
Multi- lass SVMs
M-SVM of Crammer and Singer

Problem 5 (M-SVM2, Crammer & Singer, 2001)
min
1
2
Q
X
k=1
kwk k2 + C
m
X
i=1
s.t. hwyi wk , (xi )i + yi ,k 1 i , (1 i m), (1 k Q)

Remark 7
The onstraint
PQ
k=1
k = 0
h
is impli it.
33/55
Multi- lass SVMs
M-SVM of Crammer and Singer

ik :
hwyi wk , (xi )i + yi ,k 1 i
= (ik )1im,1kQ , = (yi ,k )1im,1kQ

Problem 6 (M-SVM2)
min
1 T
HWW + T
2
0,
(1 i m), (1 k Q)
ik
s.t. PQ
k=1 ik = C, (1 i m)
34/55
Multi- lass SVMs
M-SVM of Lee, Lin and Wahba

Problem 7 (M-SVM3, Lee et al., 2004)
min
Q
1 X
hH 2
k=1
kwk k + C
hw
,
(x
)i
+
b
k
i
k
Q1 + ik ,
s.t. ik 0,
PQ
PQ
w
=
0,
k
k=1 bk = 0
k=1
m X
X
i=1 k6=yi
ik
(1 i m), (1 k 6= yi Q)
(1 i m), (1 k 6= yi Q)
Result of onsisten y (Zhang, 2004; Tewari & Bartlett, 2007)

This M-SVM is the only one for whi h training is Bayes/Fisher onsistent.
35/55
Multi- lass SVMs
M-SVM of Lee, Lin and Wahba

ik :
1
hwk , (xi )i + bk Q1
+ ik
= (ik )1im,1kQ , (iyi )1im = 0

Problem 8 (M-SVM3)
min
1 T
1
HLLW
1TQm
2
Q1
0 ik C,
(1 i m), (1 k 6= yi Q)

s.t. Pm PQ 1
i=1
l=1 Q k,l il = 0, (1 k Q 1)
HLLW =

1
k,l
Q
wk =
(xi , xj )
Q
m X
X
1
i=1 l=1
1i,jm,1k,lQ
(xi )
k,l il
36/55
Multi- lass SVMs
Some variants of the main models
Use of dierent norms on w

Problem 9 ( -norm M-SVM)
min
s.t.
hH 2
t +C
m
X
i=1 k6=yi
hwyi wk , (xi )i + byi bk 1 ik ,
ik 0,
kwk k t,
1 -norm M-SVM (Wang
ik
(1 i m), (1 k 6= yi Q)
(1 i m), (1 k 6= yi Q)
(1 k Q)
et al., 2006)
(x, x ) = xT x ( = Id)
Problem 10 (1 -norm M-SVM)
min
hH
m
X
M-SVM (yi , h(xi ))
i=1
PQ kw k K
k 1
k=1
s.t. PQ
k=1 hk = 0
37/55
Multi- lass SVMs
Use of a dierent norm on : quadrati loss M-SVMs

A quadrati loss M-SVM is a M-SVM for whi h the
empiri al term of the obje tive fun tion, kk1 , is repla ed by a quadrati form, T M , where M is
a symmetri positive semidenite matrix.
Denition 18 (Quadrati loss M-SVM)
Denition 19 (M-SVM )
Variant of the M-SVM of Lee, Lin and Wahba orresponding to

M =

1
k,l
Q
i,j
1i,jm,1k,lQ
38/55
Multi- lass SVMs
Training algorithm of the M-SVM2

Primal formulation
2
Problem 11 (M-SVM )
min
Q
1X
kwk k2 + C T M
hH
2
k=1
hw , (x )i + b 1 + , (1 i m), (1 k 6= y Q)
ik
i
i
k
Q1
s.t. PQk
PQ
wk = 0,
bk = 0
k=1
k=1
Dual formulation
Problem 12 (M-SVM )
min
1 T
HLLW +
1
1
M
1TQm
2C
Q1
ik 0,
(1 i m), (1 k 6= yi Q)

s.t. Pm PQ 1
i=1
l=1 Q k,l il = 0, (1 k Q 1)
39/55
Multi- lass SVMs
Margins and support ve tors
Margins and support ve tors of a M-SVM

1
0.8
x_2
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
x_1
Figure 2: 3 ategories linearly separable in
R2
40/55
Multi- lass SVMs
Margins and support ve tors of a M-SVM
Figure 3: Separating hyperplanes and soft margins of a linear M-SVM1
41/55
Multi- lass SVMs
C_1 / C_2
C_2 / C_3
x_3
x_2
x_1
Figure 4: 3 ategories non-linearly separable in
R3
42/55
Multi- lass SVMs
C_1 / C_2
C_1 / C_3
C_2 / C_3
x_3
x_1
x_2
Figure 5: Separating hyperplanes and support ve tors of a linear M-SVM1
43/55
Bounds on the overing numbers
Margin Natarajan dimension of the multi- lass SVMs

Let H be the lass of fun tions that a Q- ategory M-SVM an implement under the
hypothesis that (X ) is in luded in the ball of radius (X ) about the origin in E(X ) , that the
ve tor w satises kwk w and that b = 0. Then, for all R+ ,
Theorem 6
N-dim
H,

w (X ) 2
Q
.
2
The proof
- does not hold true anymore if the operator
- alls for the use of the
-norm
is repla ed by the operator
instead of the
2 -norm
(used by the penalizer);
- rests dire tly on the one-against-one de omposition s heme.
Q=2:
P -dim (H )
w (X )
2
44/55
From overing numbers to entropy numbers

Let (E, ) be a pseudo-metri spa e (or (E, k kE )
a Bana h spa e) and E a bounded subset of E . Then, for n N , the n-th entropy number of E ,
n (E ), is:
Denition 20 (Entropy numbers of a set)
n (E ) = inf { > 0 : N (, E , ) n} .
Let (E, k kE ) and

(F, k kF ) be two Bana h spa es. Let L(E, F ) denote the Bana h spa e of all (bounded linear)
operators from (E, k kE ) into (F, k kF ) endowed with the norm:
S L(E, F ), kSk = supeE:kekE =1 kS(e)kF . The n-th entropy number of S is dened as
Denition 21 (Entropy numbers of a bounded linear operator)
n (S) = n (S(UE )).
45/55
From overing numbers to entropy numbers

Denition 22 (Evaluation operator)
is dened as:
Sxn :
= (wk )
h
1kQ
U be the unit ball of

N (, U, n) and the entropy
Let
Proposition 2
For n N , let xn X n . The evaluation operator Sxn on

= (hwk , (xi )i)
Sxn h
1in, 1kQ
H
: kwk 1
(U = h
-norm
numbers of Sxn is
in the
Qn
). The onne tion between
provided by the following proposition:
Let R+ and n N .
sup p (Sxn ) = N (, U, n) p.
xn X n
46/55
Upper bound on the entropy numbers

Finite-dimensional feature spa e
Proposition 3 (Carl & Stephani, 1990)
is of rank r, then for n N ,
Let E and F be Bana h spa es and S L (E, F ). If S
n (S) 4kSkn1/r .
ve tor w satises kwk w and b [, ]Q . If the dimensionality of the spa e E(X ) is nite
and equal to d, then for all R+ ,
Theorem 7
Q
Qd

64
8
w
(X )
+1
.
N (p) (/4, H, 2m) 2
R(h) R,m (h) + O
1
m
!
47/55
Upper bound on the entropy numbers

Innite-dimensional feature spa e
Let H be a Hilbert spa e and
S an operator belonging to L (n1 , H) or L (H, n ). Then, for ea h ouple of integers (k, n)
satisfying 1 k n,
Theorem 8 (Maurey-Carl theorem, Carl & Stephani, 1990)
ek (S) c

n
1
log2 1 +
k
k
1/2
kSk,
where the dyadi entropy number ek (S) is equal to 2k1 (S) and c is a universal onstant.
ve tor w satises kwk w and b [, ]Q . Then, for all R+ ,
Theorem 9
Q 16cw

q
(X )
2Qm
8
1
(p)
ln(2)
+1
2
N (/4, H, 2m) 2
.
R(h) R,m (h) + O
!
48/55
Use of the Radema her omplexity
Basi probabilisti tools

For n N , let A be a bounded set of ve tors
a = (ai )1in belonging to Rn and let (i )1in be a Radema her sequen e. The Radema her
average asso iated with A, Rn (A), is dened by:
Denition 23 (Radema her average)

n

1 X

i a i .
Rn (A) = E sup

aA n
i=1
Let (Ti )1in be a

sequen e of n independent random variables taking values in a set T . Let g be a fun tion from T n
into R su h that there exists a sequen e of nonnegative onstants (ci )1in satisfying:
Theorem 10 (Bounded dieren es inequality, M Diarmid, 1989)
i [[ 1, n ]] ,
sup
(ti )1in T n ,ti T
|g(t1 , . . . , tn ) g(t1 , . . . , ti1 , ti , ti+1 , . . . , tn )| ci .
Then, for all R+ , the random variable g (T1 , . . . , Tn ) satises:
where c =
Pn
2
i=1 ci
P {g (T1 , . . . , Tn ) Eg (T1 , . . . , Tn ) > } e
2c
P {Eg (T1 , . . . , Tn ) g (T1 , . . . , Tn ) > } e
2c
49/55
Use of the Radema her omplexity
Uniform onvergen e result

Convexied margin risk orresponding to the M-SVM of Crammer and Singer

R(h) = E (1 hY (X))+
hypothesis that (X ) is in luded in the losed ball of radius (X ) about the origin in E(X ) , that
the ve tor w satises kwk w and b = 0. Let KH = w (X ) + 1 and (0, 1). With
probability at least 1 , the risk of any fun tion h in H is bounded from above by:
Theorem 11

R
m
R h
v
s

um
1
X
u

ln
+ 4 + 4Q(Q 1)w t
.
h
(Xi , Xi ) + KH
m
2m
m
i=1

R h Rm h + O
1
m
50/55
Bounds on the leave-one-out error
Radius-margin bound
Let us onsider a hard margin bi- lass SVM. Let Lm be the
1
denote
number of errors that it makes in a leave-one-out ross-validation pro edure and let = kwk
its geometri al margin. Then the following upper bound holds true:
Theorem 12 (Vapnik, 1998)
Lm
2
Dm
2
where Dm is the diameter of the smallest ball of the feature spa e ontaining the support ve tors.
51/55
Radius-margin bound for the M-SVM of Weston and Watkins

dWW = dCS = 1
Let us onsider a hard margin Q- ategory M-SVM of Weston and Watkins (or
Crammer and Singer) on a domain X . Let dm = {(xi , yi ) : 1 i m} be its training set, Lm the
number of errors resulting from applying a leave-one-out ross-validation pro edure to this ma hine,
and Dm the diameter of the smallest sphere of the feature spa e ontaining the set
{(xi ) : 1 i m}. Then the following upper bound holds true:
Theorem 13
Lm
Constant
k<l
kl
KCV
- The value of
- For
KCV 2
Dm
Q
X 1 + dW W,kl 2
KCV
is obtained by solving as many QP problems as there are support ve tors.
Q = 2, KCV = 2,
and the bound redu es itself to the bi- lass one.
52/55
Radius-margin bound for the M-SVM of Lee, Lin and Wahba

dLLW =
Q
Q1
Let us onsider a hard margin Q- ategory M-SVM of Lee, Lin and Wahba on a
domain X . Let dm = {(xi , yi ) : 1 i m} be its training set, Lm the number of errors resulting
from applying a leave-one-out ross-validation pro edure to this ma hine, and Dm the diameter of
the smallest sphere of the feature spa e ontaining the set {(xi ) : 1 i m}. Then the following
upper bound holds true:
Theorem 14
2
Lm Q2 Dm
X 1 + dLLW,kl 2
k<l
kl
This bound does not redu e itself to the bi- lass one for
Q = 2.
53/55
Con lusions
Capa ity measures of the lasses of fun tions
- The
--dimensions
play for the M-SVMs (and the MLPs!) the same role as the fat-shattering
dimension for the bi- lass SVMs.

- The urrent upper bounds on the overing numbers are suboptimal but in spe i ases.
- If the use of the Radema her omplexity urrently provides the sharpest bound, better bounds,
adapted to the problem of interest, should result from implementing hybrid approa hes.
Guaranteed risks
- These studies highlight the spe i hara ter of the multi- lass ase.
- Model sele tion should provide a tou hstone to assess the dierent guaranteed risks derived.
54/55
Open problems and future work

Bounds on the risk of large margin multi- ategory lassiers
- Computation of a bound on the universal onstant of the Maurey-Carl theorem
- Use of Dudley's method of haining to improve the VC bound
- Derivation of dedi ated PAC-Bayes bounds
- ...
Model sele tion for M-SVMs

- Assessment of the guaranteed risks and radius-margin bounds to sele t the value of the soft
margin parameter
- Integration in the appli ations implementing the M-SVMs of pro edures hoosing automati ally
the values of the hyperparameters
55/55

NN2008 M SVM Yg

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NN2008 M SVM Yg

Uploaded by

Copyright:

Available Formats

Multi-Class Support Ve

- Generalized Sauer-Shelah lemma

Multi- lass SVMs

Summer S hool NN2008

Model sele tion for multi- lass SVMs

Con lusions and open problems

Summer S hool NN2008

Guaranteed risk for large margin multi- ategory lassiers

Hypotheses and goals

- Problem: the joint probability measure

and their ategories

G : lass of fun tions g , from X into RQ (F : lass of de ision

loss fun tion:

(y, g(x)) = 1l{gy (x)maxk6=y gk (x)} ( (y, f (x)) = 1l{f (x)6=y} )

- Sele tion of a fun tion

R(g) = E [ (Y, g (X))] = P (f (X) 6= Y )

Guaranteed risk for large margin multi- ategory lassiers

Basi uniform onvergen e result

Multi- lass margin and margin risk

Let M be the fun tion from RQ [[ 1, Q ]] into R given by:

M (v, ) = max1kQ M (v, k)

g on the example (x, y))

(g, x, y) G X Y, M(g, x, y) = M (g(x), y)

and ) g = (gk )1kQ G

- The fun tion g = (gk )1kQ , from X into RQ , is given by:

- The fun tion g = ( gk )1kQ , from X into RQ , is given by:

Guaranteed risk for large margin multi- ategory lassiers

Basi uniform onvergen e result

Multi- lass margin and margin risk

formulas that hold true for both operators (e.g.,

Let R+ . The risk with margin of g is dened as:

Denition 4 (Margin risk)

R (g) = E 1l{# gY (X)<} =

1l{# gy (x)<} dP (x, y)

be the linear squashing fun tion dened as:

(t) = sign(t) min {|t| , }

Summer S hool NN2008

Guaranteed risk for large margin multi- ategory lassiers

Basi uniform onvergen e result

Capa ity measure of # G : overing numbers

Denition 5 (Covering numbers)

minimal number of open balls of radius needed to over E (or +)

Guaranteed risk for large margin multi- ategory lassiers

Basi uniform onvergen e result

Basi uniform onvergen e result

Summer S hool NN2008

Guaranteed risk for large margin multi- ategory lassiers

Basi uniform onvergen e result

Basi uniform onvergen e result

pseudo-metri dxn on G as:

Let n N . For a sequen e xn = (xi )1in X n , dene the

(g, g ) G 2 , dxn (g, g ) = max kg(xi ) g (xi )k .

Let G be the lass of fun tions that a large margin Q- ategory

R(g) R,m (g) +

Summer S hool NN2008

Guaranteed risk for large margin multi- ategory lassiers

Growth fun tion

n N , F (n) = sup N (F, sX n ) .

Some authors use the alternative denition:

Summer S hool NN2008

Guaranteed risk for large margin multi- ategory lassiers

(fy (xi ))1in = vy .

The VC dimension of F , denoted by VC-dim(F), is the maximal ardinality of a subset of X

Guaranteed risk for large margin multi- ategory lassiers

Guaranteed risk for large margin multi- ategory lassiers

Guaranteed risk for large margin multi- ategory lassiers

Let R+ . The risk with margin of g is dened as:

Denition 4 (Margin risk)

be the linear squashing fun tion dened as:

Guaranteed risk for large margin multi- ategory lassiers

Denition 5 (Covering numbers)

Guaranteed risk for large margin multi- ategory lassiers

Guaranteed risk for large margin multi- ategory lassiers

Let n N . For a sequen e xn = (xi )1in X n , dene the

Guaranteed risk for large margin multi- ategory lassiers

Some authors use the alternative denition:

Guaranteed risk for large margin multi- ategory lassiers

Guaranteed risk for large margin multi- ategory lassiers

Guaranteed risk for large margin multi- ategory lassiers

The denition of the graph dimension is inspired from the one-against-all

Guaranteed risk for large margin multi- ategory lassiers

Guaranteed risk for large margin multi- ategory lassiers

Guaranteed risk for large margin multi- ategory lassiers

Guaranteed risk for large margin multi- ategory lassiers

Guaranteed risk for large margin multi- ategory lassiers

Guaranteed risk for large margin multi- ategory lassiers

Guaranteed risk for large margin multi- ategory lassiers

Guaranteed risk for large margin multi- ategory lassiers

Multi- ategory lassi ation with binary SVMs

Multi- ategory lassi ation with binary SVMs

- De ision rule: winner-takes-all

One-against-one method/pairwise lassi ation (Frnkranz, 2002)

- De ision rule: max-wins voting

SVMs: one for ea h of the di hotomies dened by the olumns of