You are on page 1of 55

Multi-Class Support Ve

tor Ma hines
Yann Guermeur
LORIA - CNRS

http://www.loria.fr/~guermeur
Summer S hool NN2008
July 4, 2008

Overview
Guaranteed risk for large margin multi- ategory lassiers
- Theoreti al framework
- Basi uniform onvergen e result
-

--dimensions

- Generalized Sauer-Shelah lemma


- Nature and rate of onvergen e

Multi- lass SVMs


- Multi- ategory lassi ation with binary SVMs
- Class of fun tions implemented by the M-SVMs
- General formulation of the training algorithm
- Three main models of M-SVMs
- Some variants of the main models
- Margins and support ve tors

Summer S hool NN2008

2/55

Overview
Guaranteed risks for multi- lass SVMs
- Bounds on the overing numbers
- Use of the Radema her omplexity

Model sele tion for multi- lass SVMs


- Algorithms tting the entire regularization path
- Bounds on the leave-one-out ross-validation error

Con lusions and open problems

Summer S hool NN2008

3/55

Guaranteed risk for large margin multi- ategory lassiers

Theoreti al framework

Hypotheses and goals


Chara terization of the problem
- Study of the onne tion between obje ts
- Hypothesis: existen e of a
probability measure

X Y -valued

- Problem: the joint probability measure

xX

and their ategories

random pair

(X, Y )

y Y = [[ 1, Q ]]

distributed a ording to a

is unknown

What is available
-

Dm = ((Xi , Yi ))1im

G : lass of fun tions g , from X into RQ (F : lass of de ision


f (x) = argmax1kQ gk (x) or f (x) = , in ase of ex quo

: i.i.d.

m-sample

from

(X, Y )
rules

f,

from

into

{})

The goal
-

loss fun tion:

(y, g(x)) = 1l{gy (x)maxk6=y gk (x)} ( (y, f (x)) = 1l{f (x)6=y} )

- Sele tion of a fun tion

minimizing over

the risk

R(g) = E [ (Y, g (X))] = P (f (X) 6= Y )


Summer S hool NN2008

4/55

Guaranteed risk for large margin multi- ategory lassiers

Basi uniform onvergen e result

Multi- lass margin and margin risk


Denition 1 (Fun tion

M)

Let M be the fun tion from RQ [[ 1, Q ]] into R given by:

1
(v, k) RQ [[ 1, Q ]] , M (v, k) =
2



vk max vl
l6=k

M (v, ) = max1kQ M (v, k)


Denition 2 (Multi- lass margin of

g on the example (x, y))

(g, x, y) G X Y, M(g, x, y) = M (g(x), y)


Denition 3 (Operators

and ) g = (gk )1kQ G

- The fun tion g = (gk )1kQ , from X into RQ , is given by:


x X , g(x) = (M (g(x), k))1kQ

- The fun tion g = ( gk )1kQ , from X into RQ , is given by:


x X , g(x) = (sign (gk (x)) M (g(x), ))1kQ
Summer S hool NN2008

5/55

Guaranteed risk for large margin multi- ategory lassiers

Basi uniform onvergen e result

Multi- lass margin and margin risk


# repla es and in the


R(g) = E 1l{# gY (X)0} )

formulas that hold true for both operators (e.g.,

Let R+ . The risk with margin of g is dened as:

Denition 4 (Margin risk)

R (g) = E 1l{# gY (X)<} =


Empiri al risk with margin

X Y

1l{# gy (x)<} dP (x, y)

1 X
R,m (g) =
1l #
m i=1 { gYi (Xi )< }
Class of fun tions of interest:
For

R+ ,

let

: R [, ]

#
G

be the linear squashing fun tion dened as:

(t) = sign(t) min {|t| , }


#
#
g = gk

Summer S hool NN2008

 #

#
#
#
,

g
=

g
,

G
=

g
:
g

k
k

1kQ

6/55

Guaranteed risk for large margin multi- ategory lassiers

Basi uniform onvergen e result

Capa ity measure of # G : overing numbers

Figure 1:

-net

and

- over

of a set

in a pseudo-metri spa e

(E, )

Denition 5 (Covering numbers)

N (, E , ):

minimal number of open balls of radius needed to over E (or +)


N (p) (, E , ): the -nets onsidered are in luded in E (proper to E )
Summer S hool NN2008

7/55

Guaranteed risk for large margin multi- ategory lassiers

Basi uniform onvergen e result

Basi uniform onvergen e result


Classes of indi ator fun tions
Let F be a lass of indi ator fun tions on a set
X . Let N F, (Xi )1in be the number of dierent fun tions (di hotomies) that this lass an
implement on (Xi )1in and (0, 1). With probability at least 1 , the risk of any fun tion f in
F is bounded from above as follows:
Theorem 1 (Guaranteed risk, Vapnik, 1998)

R(f ) Rm (f ) +



ln EN F, (Xi )1i2m

Summer S hool NN2008

is the

1
m

 



1
4
ln EN F, (Xi )1i2m + ln
+ .

annealed entropy of F

on the sample

(Xi )1i2m .

8/55

Guaranteed risk for large margin multi- ategory lassiers

Basi uniform onvergen e result

Basi uniform onvergen e result


Classes of fun tions G (taking values in RQ )
Denition 6 (Pseudo-metri

pseudo-metri dxn on G as:

dxn )

Let n N . For a sequen e xn = (xi )1in X n , dene the

(g, g ) G 2 , dxn (g, g ) = max kg(xi ) g (xi )k .


1in

For

R+ ,

let

N (, G, n) = supxn X n N (, G, dxn ).

Let G be the lass of fun tions that a large margin Q- ategory


lassier on a domain X an implement. Let R+ and (0, 1). With probability at least 1 ,
for every value of in (0, ], the risk of any fun tion g in G is bounded from above by:
Theorem 2 (Guaranteed risk)

R(g) R,m (g) +

Summer S hool NN2008

 
 


2
1
2
ln 2N (p) /4, #
+ .
+ ln
G, 2m
m

9/55

--dimensions

Guaranteed risk for large margin multi- ategory lassiers

Growth fun tion


Let F be a lass of
indi ator fun tions on a domain X . For n N , let sX n = {xi : 1 i n} be a subset of X of
ardinality n. Then, the growth fun tion of F , F , is dened by:
Denition 7 (Growth fun tion, Vapnik & Chervonenkis, 1971)

n N , F (n) = sup N (F, sX n ) .


sX n X

Remark 1

Some authors use the alternative denition:


n N , F (n) = ln

Remark 2

sup N (F, sX n ) .

sX n X

In ontrast with the annealed entropy, the growth fun tion is distribution-free.

Summer S hool NN2008

10/55

Guaranteed risk for large margin multi- ategory lassiers

--dimensions

VC dimension
Let F be a lass of indi ator
fun tions on a domain X . A subset sX n = {xi : 1 i n} of X is said to be shattered by F if for
ea h ve tor vy in {1, 1}n , there is a fun tion fy in F satisfying
Denition 8 (VC dimension, Vapnik & Chervonenkis, 1971)

(fy (xi ))1in = vy .

The VC dimension of F , denoted by VC-dim(F), is the maximal ardinality of a subset of X


shattered by F , if this ardinality is nite. If no su h maximum exists, F is said to have innite
VC dimension.
Remark 3

VC-dim(F) = d if and only if F (d) = 2d and F (d + 1) < 2d+1 .

Summer S hool NN2008

11/55

Guaranteed risk for large margin multi- ategory lassiers

--dimensions

-dimensions

Let F be a lass of fun tions on a set


X taking their values in the nite set [[ 1, Q ]]. Let be a family of mappings from [[ 1, Q ]] into
{1, 1, }, where is thought of as a null element. A subset sX n = {xi : 1 i n} of X is said to

be -shattered by F if there is a mapping n = (i) 1in in n su h that for ea h ve tor vy in
n
{1, 1} , there is a fun tion fy in F satisfying
Denition 9 (-dimensions, Ben-David et al., 1995)

(i)

fy (xi )

1in

= vy .

The -dimension of F , denoted by -dim(F), is the maximal ardinality of a subset of X


-shattered by F , if this ardinality is nite. If no su h maximum exists, F is said to have innite
-dimension.
Let F and be dened as above. Extending the denition of the VC dimension so that
it applies to lasses of fun tions taking values in {1, 1, }, whi h has no in iden e in pra ti e, the
following proposition holds true:
Remark 4

-dim(F) = VC-dim ({(x, ) 7 f (x) : f F, }) .


Summer S hool NN2008

12/55

Guaranteed risk for large margin multi- ategory lassiers

--dimensions

Main examples of -dimensions


Let F be a lass of
fun tions on a set X taking their values in [[ 1, Q ]]. The graph dimension of F , G-dim(F), is the
-dimension of F in the spe i ase where = {k : 1 k Q}, su h that k takes the value 1 if
its argument is equal to k and the value 1 otherwise. Reformulated in the ontext of
multi- ategory lassi ation, the fun tions k are the indi ator fun tions of the ategories.
Denition 10 (Graph dimension, Dudley, 1987; Natarajan, 1989)

Let F be a lass of fun tions on a


set X taking their values in [[ 1, Q ]]. The Natarajan dimension of F , N-dim(F), is the -dimension
of F in the spe i ase where = {k,l : 1 k 6= l Q}, su h that k,l takes the value 1 if its
argument is equal to k, the value 1 if its argument is equal to l, and otherwise.
Denition 11 (Natarajan dimension, Natarajan, 1989)

The denition of the graph dimension is inspired from the one-against-all


de omposition method whereas the denition of the Natarajan dimension is inspired from the
one-against-one de omposition method.
Remark 5

Summer S hool NN2008

13/55

Guaranteed risk for large margin multi- ategory lassiers

--dimensions

Fat-shattering or dimension
Let G be a lass of
real-valued fun tions on a set X . For R+ , a subset sX n = {xi : 1 i n} of X is said to be
n
-shattered by G if there is a ve tor vb = (bi ) in Rn su h that, for ea h ve tor vy = (yi ) in {1, 1} ,
there is a fun tion gy in G satisfying
Denition 12 (Fat-shattering dimension, Kearns & S hapire, 1994)

i [[ 1, n ]] , yi (gy (xi ) bi ) .

The fat-shattering dimension with margin , or P dimension, of the lass G , P -dim (G), is the
maximal ardinality of a subset of X -shattered by G , if this ardinality is nite. If no su h
maximum exists, G is said to have innite P dimension.

Summer S hool NN2008

14/55

Guaranteed risk for large margin multi- ategory lassiers

--dimensions

--dimensions
Let

denote the onjun tion of two events.

Let G be a lass of fun tions on a set X taking their values in


RQ . Let be a family of mappings from [[ 1, Q ]] into {1, 1, }. For R+ , a subset
sX n = {xi : 1 i n} of X is said to be --shattered (-shattered with margin ) by # G if

there is a mapping n = (i) 1in in n and a ve tor vb = (bi ) in Rn su h that, for ea h ve tor
n
vy = (yi ) in {1, 1} , there is a fun tion gy in G satisfying

Denition 13 ( --dimensions)

if yi = 1, k : (i) (k) = 1 # gy,k (xi ) bi


i [[ 1, n ]] ,
.
if yi = 1, l : (i) (l) = 1 # gy,l (xi ) + bi

The --dimension, or -dimension with margin , of # G , denoted by -dim(# G, ), is the


maximal ardinality of a subset of X --shattered by # G , if this ardinality is nite. If no su h
maximum exists, # G is said to have innite --dimension.
This denition simplies into the one of the fat-shattering dimension when

Summer S hool NN2008

Q = 2.

15/55

--dimensions

Guaranteed risk for large margin multi- ategory lassiers

Natarajan dimension with margin


Denition 14 (Natarajan dimension with margin

) Let G be a lass of fun tions on a


= {xi : 1 i n} of X is said to be

X taking their values in RQ . For R+ , a subset sX n


-N-shattered (N-shattered with margin ) by # G if there

set

is a set

I(sX n ) = {(i1 (xi ), i2 (xi )) : 1 i n}

of n ouples of distin t indexes in [[ 1, Q ]] and a ve tor vb = (bi ) in Rn su h that, for ea h ve tor


n
vy = (yi ) in {1, 1} , there is a fun tion gy in G satisfying

if yi = 1, # gy,i1 (xi ) (xi ) bi


i [[ 1, n ]] ,
.
if yi = 1, # gy,i (x ) (xi ) + bi
2
i

The Natarajan dimension with margin of the lass # G , N-dim(# G, ), is the maximal
ardinality of a subset of X -N-shattered by # G , if this ardinality is nite. If no su h maximum
exists, # G is said to have innite Natarajan dimension with margin .

Summer S hool NN2008

16/55

Guaranteed risk for large margin multi- ategory lassiers

Generalized Sauer-Shelah lemma

Sauer-Shelah lemma
(Classes of indi ator fun tions)
Let F be a lass of
indi ator fun tions on a set X and let F be its growth fun tion. If its VC dimension d is nite,
then for n d,
Lemma 1 (Vapnik & Chervonenkis, 1971; Sauer, 1972; Shelah, 1972)

F (n)

where e is the base of the natural logarithm.

Summer S hool NN2008

d
X
i=0

Cni

<

 en d
d

17/55

Guaranteed risk for large margin multi- ategory lassiers

Generalized Sauer-Shelah lemma

Generalized Sauer-Shelah lemma


Classes of fun tions from X into [[ 1, Q ]]
Let F be a lass of fun tions from X into [[ 1, Q ]] and let
F be its growth fun tion. If its Natarajan dimension d is nite, then for n d,
Lemma 2 (Haussler & Long, 1995)

F (n)

Summer S hool NN2008

d
X
i=0

2
Cni CQ+1

i

<

(Q + 1)2 en
2d

d

18/55

Guaranteed risk for large margin multi- ategory lassiers

Generalized Sauer-Shelah lemma

Generalized Sauer-Shelah lemma


Classes of real-valued fun tions
Let G be a lass of fun tions from X into [0, 1]. For every value
of in (0, 1] and every integer value of n satisfying n P/4 -dim (G), the following bound is true:
Lemma 3 (Alon et al., 1997)

N (, G, n) < 2

where d = P/4 -dim (G).

Summer S hool NN2008

4n
2

d log2 (2en/(d))

19/55

Guaranteed risk for large margin multi- ategory lassiers

Generalized Sauer-Shelah lemma

Generalized Sauer-Shelah lemma


Classes of fun tions from X into RQ
Let G be a lass of fun tions from X into [MG , MG ]Q . For every value of in (0, MG ]
and every integer value of n satisfying n N-dim (G, /6), the following bound is true:
Lemma 4

N (p) (, G, n) < 2 n Q2 (Q 1)

3MG

2 !

j 3M k m

2
G
2
d log2 enCQ
1 /d

where d = N-dim (G, /6).


The proof does not hold true anymore if the operator

Summer S hool NN2008

is repla ed with the operator

20/55

Guaranteed risk for large margin multi- ategory lassiers

Nature and rate of onvergen e

Nature and rate of onvergen e


Let G be the lass of fun tions from X into [MG , MG ]Q that a large margin
Q- ategory lassier an implement. Let (0, 1). With probability at least 1 , uniformly for
every value of in (0, MG ], the risk of any fun tion g in G is bounded from above by:
Theorem 3

R(g) R,m (g)+


v

j 12M k m
u
G
!
u
d
log
emQ(Q1)
2
1
/d

2


2

u2
12M
1
2M

G
G
u ln 4 2m Q2 (Q 1)
+
+
ln

tm

where d = N-dim (G, /24).

R(g) R,m (g) + c ln (m)

d
m

Proposition 1 (Almost sure uniform onvergen es)

lim sup P

m+ P


sup sup (R(g) R,n (g)) > = 0

nm gG

Summer S hool NN2008

lim sup P

m+ P


sup sup |R (g) R,n (g)| > = 0

nm gG

21/55

Multi- lass SVMs

Multi- ategory lassi ation with binary SVMs

Multi- ategory lassi ation with binary SVMs


One-against-all method (Rifkin & Klautau, 2004)
-

SVMs: the

k -th

one distinguishes ategory

from the

Q1

other ones

- De ision rule: winner-takes-all

One-against-one method/pairwise lassi ation (Frnkranz, 2002)


-

Q
2

SVMs: one for ea h pair of lasses

- De ision rule: max-wins voting

Use of error orre ting output odes (ECOC) (Allwein et al., 2000)
-

M = (mkl ) MQ,N ({1, 0, 1}):

 oding matrix

SVMs: one for ea h of the di hotomies dened by the olumns of

- De ision rule: omputation of a loss fun tion

Summer S hool NN2008

22/55

Multi- lass SVMs

Class of fun tions implemented by the M-SVMs

Reprodu ing kernel Hilbert spa e


Let

be a spa e and

(H, h, iH )

a Hilbert spa e of fun tions on

X (H RX ).

Let be a fun tion from X 2 into R.


x X , let x be the fun tion from X into R given by x : t 7 (x, t). is a reprodu ing kernel of
H if and only if:
Denition 15 (Reprodu ing kernel, Aronszajn, 1950)

1. x X , x H ;

2. x X , h H, hh, x iH = h(x) (reprodu ing property).


If H possesses a reprodu ing kernel, it is
alled a reprodu ing kernel Hilbert spa e (RKHS) or a proper Hilbert spa e.
Denition 16 (Reprodu ing kernel Hilbert spa e)

Summer S hool NN2008

23/55

Multi- lass SVMs

Class of fun tions implemented by the M-SVMs

Positive semidenite kernel and RKHS


Denition 17 (Positive semidenite (positive type) kernel)

alled a positive semidenite kernel (or a positive type kernel) if

n N , (ai )1in R , (xi )1in X ,

n
n X
X
i=1 j=1

A fun tion from X 2 into R is


ai aj (xi , xj ) 0.

Let be a positive semidenite kernel on X 2 . There exists


only one Hilbert spa e (H, h, iH ) of fun tions on X with as reprodu ing kernel.

Theorem 4 (Moore-Aronszajn)

Summer S hool NN2008

24/55

Multi- lass SVMs

Class of fun tions implemented by the M-SVMs

Building a M-SVM starting from a kernel


Basi lass of fun tions
Let

Let

= (H , h, iH )Q
H

H:

be a positive semidenite kernel on

lass of fun tions

and

and let

(H , h, iH )

H = ((H , h, iH ) + {1})

h = (hk )1kQ

from

h() =

mk
X
i=1

into

RQ

be the orresponding RKHS.

su h that:

ik (xik , ) + bk

1kQ

{xik : 1 i mk } X , (ik )1imk Rmk and bk R, as well


fun tions when the sets {xik : 1 i mk } be ome dense in X in the

with

as the limits of these


norm indu ed by the kernel

Class of fun tions implemented


onvex subset of

(dened by onstraints on an ane subspa e)

Summer S hool NN2008

25/55

Multi- lass SVMs

Class of fun tions implemented by the M-SVMs

Basi lass of fun tions


An ane model in the feature spa e
Theorem 5 (Mer er's theorem)

For all Mer er kernel , there exists a map su h that:

(x, x ) X 2 , (x, x ) = h(x), (x )i

where h, i is the dot produ t of the 2 spa e.


feature map.

(X ) = { (x) : x X }.

is alled a

feature spa e is any of the Hilbert spa es E(X ) , h, i


= H

Let

spanned by the

(X ).

an be seen as a lass of multivariate ane fun tions on

(X )

h() = (hwk , i + bk )1kQ


Q
Q
w = (wk )1kQ E(X
) , b = (bk )1kQ R

Summer S hool NN2008

26/55

Multi- lass SVMs

Class of fun tions implemented by the M-SVMs

Basi lass of fun tions


Putting things the other way round: the kernel tri k
Norms on

and E Q
H
(X )


h H

v
v
v
u Q
u Q
u Q
uX
uX
u X 2
t
t


hwk , wk i = t
kwk k2 = kwk
hk H =
=

k=1

k=1

k=1

kwk = max kwk k


1kQ

Summer S hool NN2008

27/55

Multi- lass SVMs

General formulation of the training algorithm

Q 3:

multi- lass support ve tor ma hines

((xi , yi ))1im (X [[ 1, Q ])
]
M-SVM :

training set

onvex loss fun tion (built around the

hinge loss)

M-SVM: solution of a onvex (quadrati ) programming problem


Problem 1

min

hH

m
X
i=1

2
M-SVM (yi , h(xi )) + khk
H

s.t.

PQ

k=1

hk = 0

Representer theorem
This theorem states that training (solving Problem 1) amounts to nding the values of the
oe ients

ik

in

h() =

m
X
i=1

(the values of the biases

bk

Summer S hool NN2008

ik (xi , ) + bk

1kQ

are dedu ed by appli ation of the Kuhn-Tu ker onditions).

28/55

Multi- lass SVMs

General formulation of the training algorithm

A general framework that en ompasses the bi- lass ase


((xi , yi ))1im (X {1, 1})

: training set

h = (h1 , h2 ) = (h1 , h1 ), h(x)


= h1 (x) = # h1 (x) =



(hinge loss)
SVM (y, h(x)) = 1 y h(x)

1
2

(hw1 w2 , (x)i + b1 b2 )

SVM: solution of a onvex (quadrati ) programming problem


Problem 2

min

Representer theorem

m
X
i=1

2




i) +
SVM yi , h(x

h

This theorem states that training (solving Problem 2) amounts to nding the values of the
oe ients

in

=
h()

m
X
i=1

(the value of the bias

i (xi , ) + b

is dedu ed by appli ation of the Kuhn-Tu ker onditions).

Summer S hool NN2008

29/55

Multi- lass SVMs

General formulation of the training algorithm

Hard margin M-SVMs and geometri al margins


Geometri al margins

dM-SVM =

dM-SVM,kl =

min

1k<lQ

1
dM-SVM

min


min min (hk (xi ) hl (xi )) , min (hl (xj ) hk (xj ))
i:yi =k

j:yj =l

(k, l), 1 k < l Q,


min (hk (xi ) hl (xi ) dM-SVM ) , min (hl (xj ) hk (xj ) dM-SVM )

i:yi =k

j:yj =l

(k, l), 1 k < l Q, kl = dM-SVM

1 + dM-SVM,kl
kwk wl k

Conne tion between the penalizer and the geometri al margins

X
k<l

kwk wl k2 = Q
Q
X

k=1

Summer S hool NN2008

Q
X

k=1

2
Q
Q
X
X


kwk k2
wk
wk = 0 =


k=1

k=1

2
X  1 + dM-SVM,kl 2
d
kwk k2 = M-SVM
Q
kl
k<l

30/55

Multi- lass SVMs

Three main models of M-SVMs

M-SVM of Weston and Watkins


Training algorithm - primal formulation
Problem 3 (M-SVM1, Vapnik & Blanz, 1998; Weston & Watkins, 1998; . . . )

min

Q
1 X

hH 2

k=1

kwk k + C

m
X

i=1 k6=yi

hw w , (x )i + b b 1 ,
yi
k
i
yi
k
ik
s.t.
ik 0,
Remark 6

The onstraint

Summer S hool NN2008

PQ

k=1

hk = 0

ik

(1 i m), (1 k 6= yi Q)

(1 i m), (1 k 6= yi Q)

is impli it.

31/55

Multi- lass SVMs

Three main models of M-SVMs

M-SVM of Weston and Watkins


Training algorithm - dual formulation

ik :

Lagrange multiplier orresponding to the onstraint

hwyi wk , (xi )i + byi bk 1 ik

= (ik )1im,1kQ , (iyi )1im = 0


Problem 4 (M-SVM1)

min

1 T
HWW 1TQm
2

0 C,
(1 i m), (1 k 6= yi Q)
ik
s.t. P
PQ
Pm

i:yi =k
l=1 il
i=1 ik = 0, (1 k Q 1)
HWW =

wk =



yi ,yj yi ,l yj ,k + k,l (xi , xj ) 1i,jm,1k,lQ

Q
X X

i:yi =k l=1

Summer S hool NN2008

il
(xi )

m
X
i=1

ik
(xi ) =

Q
m X
X
i=1 l=1

(yi ,k k,l ) il
(xi )

32/55

Multi- lass SVMs

Three main models of M-SVMs

M-SVM of Crammer and Singer


Training algorithm - primal formulation
Problem 5 (M-SVM2, Crammer & Singer, 2001)

min

1
2

Q
X

k=1

kwk k2 + C

m
X
i=1

s.t. hwyi wk , (xi )i + yi ,k 1 i , (1 i m), (1 k Q)


Remark 7

The onstraint

Summer S hool NN2008

PQ

k=1

k = 0
h

is impli it.

33/55

Multi- lass SVMs

Three main models of M-SVMs

M-SVM of Crammer and Singer


Training algorithm - dual formulation

ik :

Lagrange multiplier orresponding to the onstraint

hwyi wk , (xi )i + yi ,k 1 i

= (ik )1im,1kQ , = (yi ,k )1im,1kQ


Problem 6 (M-SVM2)

min

1 T
HWW + T
2

0,
(1 i m), (1 k Q)
ik
s.t. PQ

k=1 ik = C, (1 i m)

Summer S hool NN2008

34/55

Multi- lass SVMs

Three main models of M-SVMs

M-SVM of Lee, Lin and Wahba


Training algorithm - primal formulation
Problem 7 (M-SVM3, Lee et al., 2004)

min

Q
1 X

hH 2

k=1

kwk k + C

hw
,
(x
)i
+
b

k
i
k

Q1 + ik ,

s.t. ik 0,

PQ
PQ
w
=
0,
k
k=1 bk = 0
k=1

m X
X

i=1 k6=yi

ik

(1 i m), (1 k 6= yi Q)

(1 i m), (1 k 6= yi Q)

Result of onsisten y (Zhang, 2004; Tewari & Bartlett, 2007)


This M-SVM is the only one for whi h training is Bayes/Fisher onsistent.

Summer S hool NN2008

35/55

Multi- lass SVMs

Three main models of M-SVMs

M-SVM of Lee, Lin and Wahba


Training algorithm - dual formulation

ik :

Lagrange multiplier orresponding to the onstraint

1
hwk , (xi )i + bk Q1
+ ik

= (ik )1im,1kQ , (iyi )1im = 0


Problem 8 (M-SVM3)

min

1 T
1
HLLW
1TQm
2
Q1

0 ik C,
(1 i m), (1 k 6= yi Q)


s.t. Pm PQ 1

i=1
l=1 Q k,l il = 0, (1 k Q 1)
HLLW =



1
k,l
Q

wk =

(xi , xj )

Q 
m X
X
1
i=1 l=1

Summer S hool NN2008

1i,jm,1k,lQ

(xi )
k,l il

36/55

Multi- lass SVMs

Some variants of the main models

Use of dierent norms on w


Problem 9 ( -norm M-SVM)

min

s.t.

hH 2

t +C

m
X

i=1 k6=yi

hwyi wk , (xi )i + byi bk 1 ik ,

ik 0,

kwk k t,

1 -norm M-SVM (Wang

ik

(1 i m), (1 k 6= yi Q)

(1 i m), (1 k 6= yi Q)
(1 k Q)

et al., 2006)

(x, x ) = xT x ( = Id)
Problem 10 (1 -norm M-SVM)

min

hH

Summer S hool NN2008

m
X

M-SVM (yi , h(xi ))

i=1

PQ kw k K
k 1
k=1
s.t. PQ

k=1 hk = 0

37/55

Multi- lass SVMs

Some variants of the main models

Use of a dierent norm on : quadrati loss M-SVMs


A quadrati loss M-SVM is a M-SVM for whi h the
empiri al term of the obje tive fun tion, kk1 , is repla ed by a quadrati form, T M , where M is
a symmetri positive semidenite matrix.
Denition 18 (Quadrati loss M-SVM)

Denition 19 (M-SVM )

Variant of the M-SVM of Lee, Lin and Wahba orresponding to


M =

Summer S hool NN2008



1
k,l
Q

i,j

1i,jm,1k,lQ

38/55

Multi- lass SVMs

Some variants of the main models

Training algorithm of the M-SVM2


Primal formulation
2

Problem 11 (M-SVM )

min

Q
1X

kwk k2 + C T M

hH
2
k=1

hw , (x )i + b 1 + , (1 i m), (1 k 6= y Q)
ik
i
i
k
Q1
s.t. PQk
PQ

wk = 0,
bk = 0
k=1

k=1

Dual formulation

Problem 12 (M-SVM )

min

1 T

HLLW +

1
1
M
1TQm
2C
Q1

ik 0,
(1 i m), (1 k 6= yi Q)


s.t. Pm PQ 1

i=1
l=1 Q k,l il = 0, (1 k Q 1)
Summer S hool NN2008

39/55

Multi- lass SVMs

Margins and support ve tors

Margins and support ve tors of a M-SVM


1

0.8

x_2

0.6

0.4

0.2

0
0

0.2

0.4

0.6

0.8

x_1

Figure 2: 3 ategories linearly separable in

Summer S hool NN2008

R2

40/55

Multi- lass SVMs

Margins and support ve tors

Margins and support ve tors of a M-SVM

Figure 3: Separating hyperplanes and soft margins of a linear M-SVM1

Summer S hool NN2008

41/55

Multi- lass SVMs

Margins and support ve tors

C_1 / C_2
C_2 / C_3

x_3

x_2

x_1

Figure 4: 3 ategories non-linearly separable in

Summer S hool NN2008

R3
42/55

Multi- lass SVMs

Margins and support ve tors

C_1 / C_2
C_1 / C_3
C_2 / C_3

x_3

x_1

x_2

Figure 5: Separating hyperplanes and support ve tors of a linear M-SVM1

Summer S hool NN2008

43/55

Guaranteed risks for multi- lass SVMs

Bounds on the overing numbers

Margin Natarajan dimension of the multi- lass SVMs


Let H be the lass of fun tions that a Q- ategory M-SVM an implement under the
hypothesis that (X ) is in luded in the ball of radius (X ) about the origin in E(X ) , that the
ve tor w satises kwk w and that b = 0. Then, for all R+ ,

Theorem 6

N-dim

H,

 

w (X ) 2
Q
.
2

The proof
- does not hold true anymore if the operator
- alls for the use of the

-norm

is repla ed by the operator

instead of the

2 -norm

(used by the penalizer);

- rests dire tly on the one-against-one de omposition s heme.

Q=2:

Summer S hool NN2008

P -dim (H )

w (X )

2
44/55

Guaranteed risks for multi- lass SVMs

Bounds on the overing numbers

From overing numbers to entropy numbers


Let (E, ) be a pseudo-metri spa e (or (E, k kE )
a Bana h spa e) and E a bounded subset of E . Then, for n N , the n-th entropy number of E ,
n (E ), is:
Denition 20 (Entropy numbers of a set)

n (E ) = inf { > 0 : N (, E , ) n} .

Let (E, k kE ) and


(F, k kF ) be two Bana h spa es. Let L(E, F ) denote the Bana h spa e of all (bounded linear)
operators from (E, k kE ) into (F, k kF ) endowed with the norm:
S L(E, F ), kSk = supeE:kekE =1 kS(e)kF . The n-th entropy number of S is dened as
Denition 21 (Entropy numbers of a bounded linear operator)

n (S) = n (S(UE )).

Summer S hool NN2008

45/55

Guaranteed risks for multi- lass SVMs

Bounds on the overing numbers

From overing numbers to entropy numbers


Denition 22 (Evaluation operator)

is dened as:

Sxn :

= (wk )
h
1kQ

U be the unit ball of


N (, U, n) and the entropy
Let

Proposition 2

For n N , let xn X n . The evaluation operator Sxn on


= (hwk , (xi )i)
Sxn h
1in, 1kQ

H
: kwk 1
(U = h

-norm
numbers of Sxn is
in the

Qn

). The onne tion between

provided by the following proposition:

Let R+ and n N .
sup p (Sxn ) = N (, U, n) p.

xn X n

Summer S hool NN2008

46/55

Guaranteed risks for multi- lass SVMs

Bounds on the overing numbers

Upper bound on the entropy numbers


Finite-dimensional feature spa e
Proposition 3 (Carl & Stephani, 1990)

is of rank r, then for n N ,

Let E and F be Bana h spa es and S L (E, F ). If S

n (S) 4kSkn1/r .

Let H be the lass of fun tions that a Q- ategory M-SVM an implement under the
hypothesis that (X ) is in luded in the ball of radius (X ) about the origin in E(X ) , that the
ve tor w satises kwk w and b [, ]Q . If the dimensionality of the spa e E(X ) is nite
and equal to d, then for all R+ ,

Theorem 7

Q 
Qd
  
64

8
w
(X )
+1

.
N (p) (/4, H, 2m) 2

R(h) R,m (h) + O

Summer S hool NN2008

1
m

!
47/55

Guaranteed risks for multi- lass SVMs

Bounds on the overing numbers

Upper bound on the entropy numbers


Innite-dimensional feature spa e
Let H be a Hilbert spa e and
S an operator belonging to L (n1 , H) or L (H, n ). Then, for ea h ouple of integers (k, n)
satisfying 1 k n,
Theorem 8 (Maurey-Carl theorem, Carl & Stephani, 1990)

ek (S) c


n
1
log2 1 +
k
k

1/2

kSk,

where the dyadi entropy number ek (S) is equal to 2k1 (S) and c is a universal onstant.
Let H be the lass of fun tions that a Q- ategory M-SVM an implement under the
hypothesis that (X ) is in luded in the ball of radius (X ) about the origin in E(X ) , that the
ve tor w satises kwk w and b [, ]Q . Then, for all R+ ,
Theorem 9

Q 16cw
  
q
(X )
2Qm
8
1
(p)

ln(2)
+1
2
N (/4, H, 2m) 2
.

R(h) R,m (h) + O

Summer S hool NN2008

!
48/55

Guaranteed risks for multi- lass SVMs

Use of the Radema her omplexity

Basi probabilisti tools


For n N , let A be a bounded set of ve tors
a = (ai )1in belonging to Rn and let (i )1in be a Radema her sequen e. The Radema her
average asso iated with A, Rn (A), is dened by:
Denition 23 (Radema her average)



n


1 X

i a i .
Rn (A) = E sup

aA n
i=1

Let (Ti )1in be a


sequen e of n independent random variables taking values in a set T . Let g be a fun tion from T n
into R su h that there exists a sequen e of nonnegative onstants (ci )1in satisfying:
Theorem 10 (Bounded dieren es inequality, M Diarmid, 1989)

i [[ 1, n ]] ,

sup
(ti )1in T n ,ti T

|g(t1 , . . . , tn ) g(t1 , . . . , ti1 , ti , ti+1 , . . . , tn )| ci .

Then, for all R+ , the random variable g (T1 , . . . , Tn ) satises:

where c =

Pn

2
i=1 ci

P {g (T1 , . . . , Tn ) Eg (T1 , . . . , Tn ) > } e

2c

P {Eg (T1 , . . . , Tn ) g (T1 , . . . , Tn ) > } e

2c

Summer S hool NN2008

49/55

Guaranteed risks for multi- lass SVMs

Use of the Radema her omplexity

Uniform onvergen e result


Convexied margin risk orresponding to the M-SVM of Crammer and Singer




R(h) = E (1 hY (X))+

Let H be the lass of fun tions that a Q- ategory M-SVM an implement under the
hypothesis that (X ) is in luded in the losed ball of radius (X ) about the origin in E(X ) , that
the ve tor w satises kwk w and b = 0. Let KH = w (X ) + 1 and (0, 1). With
probability at least 1 , the risk of any fun tion h in H is bounded from above by:
Theorem 11


R
m
R h

v
s

um
1
X
u

ln

+ 4 + 4Q(Q 1)w t
.
h
(Xi , Xi ) + KH
m
2m
m
i=1



R h Rm h + O

Summer S hool NN2008

1
m

50/55

Model sele tion for multi- lass SVMs

Bounds on the leave-one-out error

Radius-margin bound
Let us onsider a hard margin bi- lass SVM. Let Lm be the
1
denote
number of errors that it makes in a leave-one-out ross-validation pro edure and let = kwk
its geometri al margin. Then the following upper bound holds true:
Theorem 12 (Vapnik, 1998)

Lm

2
Dm
2

where Dm is the diameter of the smallest ball of the feature spa e ontaining the support ve tors.

Summer S hool NN2008

51/55

Model sele tion for multi- lass SVMs

Bounds on the leave-one-out error

Radius-margin bound for the M-SVM of Weston and Watkins


dWW = dCS = 1

Let us onsider a hard margin Q- ategory M-SVM of Weston and Watkins (or
Crammer and Singer) on a domain X . Let dm = {(xi , yi ) : 1 i m} be its training set, Lm the
number of errors resulting from applying a leave-one-out ross-validation pro edure to this ma hine,
and Dm the diameter of the smallest sphere of the feature spa e ontaining the set
{(xi ) : 1 i m}. Then the following upper bound holds true:
Theorem 13

Lm
Constant

k<l

kl

KCV

- The value of
- For

KCV 2
Dm
Q

X  1 + dW W,kl 2

KCV

is obtained by solving as many QP problems as there are support ve tors.

Q = 2, KCV = 2,

and the bound redu es itself to the bi- lass one.

Summer S hool NN2008

52/55

Model sele tion for multi- lass SVMs

Bounds on the leave-one-out error

Radius-margin bound for the M-SVM of Lee, Lin and Wahba


dLLW =

Q
Q1

Let us onsider a hard margin Q- ategory M-SVM of Lee, Lin and Wahba on a
domain X . Let dm = {(xi , yi ) : 1 i m} be its training set, Lm the number of errors resulting
from applying a leave-one-out ross-validation pro edure to this ma hine, and Dm the diameter of
the smallest sphere of the feature spa e ontaining the set {(xi ) : 1 i m}. Then the following
upper bound holds true:

Theorem 14

2
Lm Q2 Dm

X  1 + dLLW,kl 2
k<l

kl

This bound does not redu e itself to the bi- lass one for

Summer S hool NN2008

Q = 2.

53/55

Con lusions and open problems

Con lusions
Capa ity measures of the lasses of fun tions
- The

--dimensions

play for the M-SVMs (and the MLPs!) the same role as the fat-shattering

dimension for the bi- lass SVMs.


- The urrent upper bounds on the overing numbers are suboptimal but in spe i ases.
- If the use of the Radema her omplexity urrently provides the sharpest bound, better bounds,
adapted to the problem of interest, should result from implementing hybrid approa hes.

Guaranteed risks
- These studies highlight the spe i hara ter of the multi- lass ase.
- Model sele tion should provide a tou hstone to assess the dierent guaranteed risks derived.

Summer S hool NN2008

54/55

Con lusions and open problems

Open problems and future work


Bounds on the risk of large margin multi- ategory lassiers
- Computation of a bound on the universal onstant of the Maurey-Carl theorem
- Use of Dudley's method of haining to improve the VC bound
- Derivation of dedi ated PAC-Bayes bounds
- ...

Model sele tion for M-SVMs


- Assessment of the guaranteed risks and radius-margin bounds to sele t the value of the soft
margin parameter

- Integration in the appli ations implementing the M-SVMs of pro edures hoosing automati ally
the values of the hyperparameters

Summer S hool NN2008

55/55

You might also like