You are on page 1of 42

Support Vector Machines

Problem Definition
Consider a training set of n iid samples

(x 1, y 1), (x 2 , y 2 ),..., (x n , y n )
where xi is a vector of length m and
y i {1,1} is the class label for data point xi.
Find a separating hyperplane w x b 0
corresponding to the decision function

f (x ) sign(w x b )

Separating Hyperplanes
x(2)

x(1)

which separating hyperplane should we choose?

Separating Hyperplanes
Training data is just a subset of of all possible data
Suppose hyperplane is close to sample xi
If we see new sample close to sample i, it is likely
to be on the wrong side of the hyperplane
x(2)

xi

x(1)

Poor generalization (performance on unseen data)

Separating Hyperplanes
Hyperplane as far as possible from any sample
x(2)

xi

x(1)

New samples close to the old samples will be


classified correctly
Good generalization

SVM
Idea: maximize distance to the closest example
x(2)

x(2)

xi

xi

x(1)
smaller distance

larger distance

x(1)

For the optimal hyperplane


distance to the closest negative example = distance to
the closest positive example

SVM: Linearly Separable Case


SVM: maximize the margin
x(2)

x(1)

margin is twice the absolute value of distance d of


the closest examples to the separating hyperplane
Better generalization (performance on test data)
in practice
and in theory

SVM: Linearly Separable Case


x(2)

x(1)

Support vectors are the samples closest to the


separating hyperplane
they are the most difficult patterns to classify
Optimal hyperplane is completely defined by support vectors
of course, we do not know which samples are support vectors without
finding the optimal hyperplane

SVM: Formula for the Margin


x(2)

g(x) = wtx + b

absolute distance between x


and the boundary g(x) = 0
wtx b

x(1)

distance is unchanged for hyperplane


g1(x)=ag (x)
t
wtx b
a w xa b
aw

Let xi be an example closest to the boundary. Set


w t xi b 1

Now the largest margin hyperplane is unique

SVM: Formula for the Margin


For uniqueness, set w t x i b 1 for any example
xi closest to the boundary
now distance from closest sample xi to g(x) = 0 is
w t xi b
w

x(2)

Thus the margin is


2
m
w
x(1)

SVM: Optimal Hyperplane


Maximize margin

2
m
w

subject to constraints

w t xi b 1 yi 1
t
w xi b 1 yi -1

Can convert our problem to

1
J (w ) w
2

s.t

yi ( w x i b) 1

J(w) is a quadratic function, thus there is a single


global minimum

Constrained Quadratic Programming


Primal Problem:

1
2
Minimize
w
2
subject to y i ( w xi b ) 1, i
Introduce Lagrange multipliers ai 0
associated with the constraints
The solution to the primal problem is equivalent to
determining the saddle point of the function
n
1
2
LP L( w,b ,a ) w ai y i (xi w b ) 1
2
i 1

Solving Constrained QP
At saddle point, LP has minimum requiring
LP
w ai y i xi 0 w ai y i xi
w
i
i
LP
ai y i 0
b
i

Primal-Dual
Primal:

n
n
1
2
LP w a i yi ( x i w b) a i
2
i 1
i 1

minimize LP with respect to w,b,


subject to a i 0
w a i yi x i
i

a y
i

Dual:

substitute

1 n n
LD a i a ia j yi y j x i x j
2 i 1 j 1
i 1
maximize LD with respect to
subject to a i 0, a i yi 0
i

Solving QP using dual problem


n

1 n n
t

L
a

a
a
y
y
x
maximize D

i
i j i j i xj
2 i 1 j 1
i 1

constrained to

a i 0 i and

a y
i 1

a ={a1,, an} are new variables, one for each sample


LD(a) can be optimized by quadratic programming
LD(a) formulated in terms of a
it depends on w and b indirectly

LD(a) depends on the number of samples, not on


dimension of samples

Threshold
b can be determined from the optimal a and
Karush-Kuhn-Tucker (KKT) conditions

ai y i w xi b 1 0, i

ai 0 implies

y i (w xi b ) 1 w xi b y i
b yi w x i

Support Vectors
For every sample i, one of the following must hold
ai = 0
ai >0 and yi(w .xi+b - 1) = 0

Many ai 0

w a i yi x i sparse solution
i

Samples with ai 0 are Support Vectors and they


are the closest to the separating hyperplane

Optimal hyperplane is completely defined by support


vectors

SVM: Classification
Given a new sample x, finds its label y

y sign(w x b )
n

w ai y i x i
i 1

SVM: Example

Class 1: [1,6], [1,10], [4,11]

Class 2: [5,2], [7,6], [10,4]

SVM: Example

0.036
0

a 0.039
0

0.076
0

support
vectors

Solution

0.33
find w using w ai y i x i a . y x

0
.
20
i 1

since a1 >0, can find b using

b y 1 w t x 1 0.13

SVM: Non Separable Case


Data is most likely to be not linearly separable, but
linear classifier may still be appropriate
x(2)

outliers

x(1)

Can apply SVM in non linearly separable case


data should be almost linearly separable for good
performance

SVM with slacks


Use nonnegative slack variables x1,, xn (one for
each sample)
Change constraints from y i wt xi b 1 i to
y i wt xi b 1 xi

xi is a measure of
deviation from the ideal
position for sample i

x(2)

xi >1 sample i is on the wrong


side of the separating
hyperplane
0< xi <1 sample i is on the
right side of separating
hyperplane but within the
region of maximum margin

xi > 1

x(1)
0< xi <1

SVM with slacks


Would like to minimize
n
1
2
J w ,x1,..., xn w C xi
2
i 1

constrained to y i wt xi b 1 xi and x i 0 i
C > 0 is a constant which measures relative weight of the
first and second terms
if C is small, we allow a lot of samples not in ideal position
if C is large, we want to have very few samples not in ideal
position

SVM with slacks


n
1
2
J w ,x1,..., xn w C xi
2
i 1

x(2)

x(2)

x(1)
large C, few samples not in
ideal position

x(1)
small C, a lot of samples
not in ideal position

SVM with slacks Dual Formulation


n

maximize
constrained to
find w using

1 n n
LD a ai ai ai y i y j x it x j
2 i 1 j 1
i 1

0 ai C i and

ai y i

w ai y i x i
i 1

t
solve for b using any 0 <ai < C and ai [ y i w xi b 1] 0

Non Linear Mapping


Covers theorem:
pattern-classification problem cast in a high dimensional
space non-linearly is more likely to be linearly separable
than in a low-dimensional space

One dimensional space, not linearly separable


-3 -2

1 2

Lift to two dimensional space with j(x)=(x,x2 )

Non Linear Mapping

Solve a non linear classification problem with a linear


classifier

1. Project data x to high dimension using function j(x)


2. Find a linear discriminant function for transformed data j(x)
3. Final nonlinear discriminant function is g(x) = wt j(x) +w0

j(x)=(x,x2 )
-3

-2

R2

R1

R2

In 2D, discriminant function is linear

1
x 1

x
g 2 w1 w 2 2 w 0
x
x

In 1D, discriminant function is not linear

g x w1 x w 2 x 2 w 0

Non Linear Mapping: Another Example

Non Linear SVM

Can use any linear classifier after lifting data into a


higher dimensional space. However we will have to
deal with the curse of dimensionality
1. poor generalization to test data
2. computationally expensive

SVM handles the curse of dimensionality problem:


1. enforcing largest margin permits good generalization

It can be shown that generalization in SVM is a function of the


margin, independent of the dimensionality

2. computation in the higher dimensional case is performed


only implicitly through the use of kernel functions

Non Linear SVM: Kernels

Recall SVM optimization n


1 n n
t

L
a

a
a
y
y
x
maximize

D
i
i i i j i xj
2 i 1 j 1
i 1
and classification

y sign( ai y i xi x b )
i 1

Note that samples xi appear only through the dot


products xitxj, xitx.

If we lift xi to high dimensional space F using j(x),


need to compute high dimensional product j(xi)tj(xj)
n

maximize

1 n n
LD a ai ai ai y i y j j x i t j x j
2 i 1 j 1
i 1

The dimensionality of space F not necessarily


important. May not even know the map .

Kernel
A function that returns the value of the dot product
between the images of the two arguments:
K(x,y)=j(x)tj(y)
Given a function K, it is possible to verify that it is
a kernel.
n

maximize

1 n n
LD a ai ai ai y i y j j x i t j x j
2 i 1 j 1
i 1

K(xi,xj)

Now we only need to compute K(xi,xj) instead of


j(xi)tj(xj)

kernel trick: do not need to perform operations in high


dimensional space explicitly

Kernel Matrix
(aka the Gram matrix):

The central structure in kernel machines


Contains all necessary information for the learning
algorithm
Fuses information about the data AND the kernel
Many interesting properties:
From www.support-vector.net

Mercers Theorem
The kernel matrix is Symmetric Positive Definite
Any symmetric positive definite matrix can be
regarded as a kernel matrix, that is as an inner
product matrix in some space
Every (semi)positive definite, symmetric
function is a kernel: i.e. there exists a mapping such
that it is possible to write:

K(x,y)=j(x)tj(y)
Positive definite

f LK (x , y )f (x )f (y )dxdy 0

From www.support-vector.net

Examples of Kernels
Some common choices (both satisfying Mercers
condition):

Polynomial kernel

Gaussian radial Basis kernel (data is lifted in infinite


dimension)

K xi , x j

K x i , x j x it x j 1

2
1

exp
xi x j
2
2

From www.support-vector.net

Example Polynomial Kernels

From www.support-vector.net

Example: the two spirals


Separated by a hyperplane in feature space
(gaussian kernels)

From www.support-vector.net

Making Kernels
The set of kernels is closed under some
operations. If K, K are kernels, then:
K+K is a kernel
cK is a kernel, if c>0
aK+bK is a kernel, for a,b >0
Etc etc etc
can make complex kernels from simple
ones: modularity !
From www.support-vector.net

Non Linear SVM Recepie

Start with data x1,,xn which lives in feature space


of dimension d
Choose kernel K(xi,xj) corresponding to some
function j(xi) which takes sample xi to a higher
dimensional space
Find the largest margin linear discriminant function in
the higher dimensional space by using quadratic
programming package to solve:
n

maximize
constrained to

1 n n
LD a ai ai ai y i y j K x i , x j
2 i 1 j 1
i 1

0 ai C i and

ai y i

i
1

Non Linear SVM Recipe

Weight vector w in the


high dimensional space:
n
w ai y i j x i
i

Linear discriminant function of largest margin in the


high dimensional space:
t

g j x w j x ai y i j x i j x
x i S

Non linear discriminant function in the original space:


t

g x ai y i j x i j x ai y i j t x i j x ai y i K x i , x
x i S
x i S
x i S

decide class 1 if g (x ) > 0, otherwise decide class 2

Non Linear SVM

Nonlinear discriminant function

gx

x i S

zi K x i , x

inverse distance
weight of support
m1
from x to
g x
vector xi
support vector xi
most important
training samples,
1
2

K x i , x exp
x

x
i.e. support vectors

i
2
2

Higher Order Polynomials


Taken from Andrew Moore
Polynomial

f(x)

Quadratic All d2/2


terms up to
degree 2

Cost to
build H
matrix
tradition
ally

Cost if
d=100

f(a)tf(b)

Cost to Cost if
build H d=100
matrix
sneakily

d2 n2 /4

2,500 n2

(atb+1)2

d n2 / 2

50 n2

Cubic

All d3/6
terms up to
degree 3

d3 n2 /12

83,000 n2

(atb+1)3

d n2 / 2

50 n2

Quartic

All d4/24
terms up to
degree 4

d4 n2 /48

1,960,000 n2 (atb+1)4

d n2 / 2

50 n2

n is the number of samples, d is number of features

SVM Summary

Advantages:

Based on nice theory


excellent generalization properties
objective function has no local minima
can be used to find non linear discriminant functions
Complexity of the classifier is characterized by the number
of support vectors rather than the dimensionality of the
transformed space

Disadvantages:

Its not clear how to select a kernel function in a principled


manner
tends to be slower than other methods (in non-linear case).

You might also like