Support Vector Machine

Support Vector Machines
Problem Definition
Consider a training set of n iid samples
(x 1, y 1), (x 2 , y 2 ),..., (x n , y n )
where xi is a vector of length m and
y i {1,1} is the class label for data point xi.
Find a separating hyperplane w x b 0
corresponding to the decision function
f (x ) sign(w x b )
Separating Hyperplanes
x(2)
x(1)
which separating hyperplane should we choose?
Training data is just a subset of of all possible data
Suppose hyperplane is close to sample xi
If we see new sample close to sample i, it is likely
to be on the wrong side of the hyperplane
x(2)
xi
x(1)
Poor generalization (performance on unseen data)
Hyperplane as far as possible from any sample
x(2)
xi
x(1)
New samples close to the old samples will be

classified correctly
Good generalization
SVM
Idea: maximize distance to the closest example
x(2)
x(2)
xi
xi
x(1)
smaller distance
larger distance
x(1)
For the optimal hyperplane

distance to the closest negative example = distance to
the closest positive example
SVM: Linearly Separable Case

SVM: maximize the margin
x(2)
x(1)
margin is twice the absolute value of distance d of

the closest examples to the separating hyperplane
Better generalization (performance on test data)
in practice
and in theory
SVM: Linearly Separable Case

x(2)
x(1)
Support vectors are the samples closest to the

separating hyperplane
they are the most difficult patterns to classify
Optimal hyperplane is completely defined by support vectors
of course, we do not know which samples are support vectors without
finding the optimal hyperplane
SVM: Formula for the Margin

x(2)
g(x) = wtx + b
absolute distance between x

and the boundary g(x) = 0
wtx b
x(1)
distance is unchanged for hyperplane

g1(x)=ag (x)
t
wtx b
a w xa b
aw
Let xi be an example closest to the boundary. Set

w t xi b 1
Now the largest margin hyperplane is unique
SVM: Formula for the Margin

For uniqueness, set w t x i b 1 for any example
xi closest to the boundary
now distance from closest sample xi to g(x) = 0 is
w t xi b
w
x(2)
Thus the margin is

2
m
w
x(1)
SVM: Optimal Hyperplane

Maximize margin
2
m
w
subject to constraints
w t xi b 1 yi 1
t
w xi b 1 yi -1
Can convert our problem to
1
J (w ) w
2
s.t
yi ( w x i b) 1
J(w) is a quadratic function, thus there is a single

global minimum
Constrained Quadratic Programming

Primal Problem:
1
2
Minimize
w
2
subject to y i ( w xi b ) 1, i
Introduce Lagrange multipliers ai 0
associated with the constraints
The solution to the primal problem is equivalent to
determining the saddle point of the function
n
1
2
LP L( w,b ,a ) w ai y i (xi w b ) 1
2
i 1
Solving Constrained QP
At saddle point, LP has minimum requiring
LP
w ai y i xi 0 w ai y i xi
w
i
i
LP
ai y i 0
b
i
Primal-Dual
Primal:
n
n
1
2
LP w a i yi ( x i w b) a i
2
i 1
i 1
minimize LP with respect to w,b,

subject to a i 0
w a i yi x i
i
a y
i
Dual:
substitute
1 n n
LD a i a ia j yi y j x i x j
2 i 1 j 1
i 1
maximize LD with respect to
subject to a i 0, a i yi 0
i
Solving QP using dual problem

n
1 n n
t
L
a
a
a
y
y
x
maximize D
i
i j i j i xj
2 i 1 j 1
i 1
constrained to
a i 0 i and
a y
i 1
a ={a1,, an} are new variables, one for each sample

LD(a) can be optimized by quadratic programming
LD(a) formulated in terms of a
it depends on w and b indirectly
LD(a) depends on the number of samples, not on

dimension of samples
Threshold
b can be determined from the optimal a and
Karush-Kuhn-Tucker (KKT) conditions
ai y i w xi b 1 0, i
ai 0 implies
y i (w xi b ) 1 w xi b y i
b yi w x i
Support Vectors
For every sample i, one of the following must hold
ai = 0
ai >0 and yi(w .xi+b - 1) = 0
Many ai 0
w a i yi x i sparse solution
i
Samples with ai 0 are Support Vectors and they

are the closest to the separating hyperplane
Optimal hyperplane is completely defined by support

vectors
SVM: Classification
Given a new sample x, finds its label y
y sign(w x b )
n
w ai y i x i
i 1
SVM: Example
Class 1: [1,6], [1,10], [4,11]
Class 2: [5,2], [7,6], [10,4]
SVM: Example
0.036
0
a 0.039
0
0.076
0
support
vectors
Solution
0.33
find w using w ai y i x i a . y x
0
.
20
i 1
since a1 >0, can find b using
b y 1 w t x 1 0.13
SVM: Non Separable Case

Data is most likely to be not linearly separable, but
linear classifier may still be appropriate
x(2)
outliers
x(1)
Can apply SVM in non linearly separable case

data should be almost linearly separable for good
performance
SVM with slacks

Use nonnegative slack variables x1,, xn (one for
each sample)
Change constraints from y i wt xi b 1 i to
y i wt xi b 1 xi
xi is a measure of
deviation from the ideal
position for sample i
x(2)
xi >1 sample i is on the wrong

side of the separating
hyperplane
0< xi <1 sample i is on the
right side of separating
hyperplane but within the
region of maximum margin
xi > 1
x(1)
0< xi <1
SVM with slacks

Would like to minimize
n
1
2
J w ,x1,..., xn w C xi
2
i 1
constrained to y i wt xi b 1 xi and x i 0 i
C > 0 is a constant which measures relative weight of the
first and second terms
if C is small, we allow a lot of samples not in ideal position
if C is large, we want to have very few samples not in ideal
position
SVM with slacks

n
1
2
J w ,x1,..., xn w C xi
2
i 1
x(2)
x(2)
x(1)
large C, few samples not in
ideal position
x(1)
small C, a lot of samples
not in ideal position
SVM with slacks Dual Formulation

n
maximize
constrained to
find w using
1 n n
LD a ai ai ai y i y j x it x j
2 i 1 j 1
i 1
0 ai C i and
ai y i
w ai y i x i
i 1
t
solve for b using any 0 <ai < C and ai [ y i w xi b 1] 0
Non Linear Mapping

Covers theorem:
pattern-classification problem cast in a high dimensional
space non-linearly is more likely to be linearly separable
than in a low-dimensional space
One dimensional space, not linearly separable

-3 -2
1 2
Lift to two dimensional space with j(x)=(x,x2 )
Non Linear Mapping
Solve a non linear classification problem with a linear

classifier
1. Project data x to high dimension using function j(x)

2. Find a linear discriminant function for transformed data j(x)
3. Final nonlinear discriminant function is g(x) = wt j(x) +w0
j(x)=(x,x2 )
-3
-2
R2
R1
R2
In 2D, discriminant function is linear
1
x 1
x
g 2 w1 w 2 2 w 0
x
x
In 1D, discriminant function is not linear
g x w1 x w 2 x 2 w 0
Non Linear Mapping: Another Example
Non Linear SVM
Can use any linear classifier after lifting data into a

higher dimensional space. However we will have to
deal with the curse of dimensionality
1. poor generalization to test data
2. computationally expensive
SVM handles the curse of dimensionality problem:

1. enforcing largest margin permits good generalization
It can be shown that generalization in SVM is a function of the

margin, independent of the dimensionality
2. computation in the higher dimensional case is performed

only implicitly through the use of kernel functions
Non Linear SVM: Kernels
Recall SVM optimization n

1 n n
t
L
a
a
a
y
y
x
maximize
D
i
i i i j i xj
2 i 1 j 1
i 1
and classification
y sign( ai y i xi x b )
i 1
Note that samples xi appear only through the dot

products xitxj, xitx.
If we lift xi to high dimensional space F using j(x),

need to compute high dimensional product j(xi)tj(xj)
n
maximize
1 n n
LD a ai ai ai y i y j j x i t j x j
2 i 1 j 1
i 1
The dimensionality of space F not necessarily

important. May not even know the map .
Kernel
A function that returns the value of the dot product
between the images of the two arguments:
K(x,y)=j(x)tj(y)
Given a function K, it is possible to verify that it is
a kernel.
n
maximize
1 n n
LD a ai ai ai y i y j j x i t j x j
2 i 1 j 1
i 1
K(xi,xj)
Now we only need to compute K(xi,xj) instead of

j(xi)tj(xj)
kernel trick: do not need to perform operations in high

dimensional space explicitly
Kernel Matrix
(aka the Gram matrix):
The central structure in kernel machines

Contains all necessary information for the learning
algorithm
Fuses information about the data AND the kernel
Many interesting properties:
From www.support-vector.net
Mercers Theorem
The kernel matrix is Symmetric Positive Definite
Any symmetric positive definite matrix can be
regarded as a kernel matrix, that is as an inner
product matrix in some space
Every (semi)positive definite, symmetric
function is a kernel: i.e. there exists a mapping such
that it is possible to write:
K(x,y)=j(x)tj(y)
Positive definite
f LK (x , y )f (x )f (y )dxdy 0
Examples of Kernels
Some common choices (both satisfying Mercers
condition):
Polynomial kernel
Gaussian radial Basis kernel (data is lifted in infinite

dimension)
K xi , x j
K x i , x j x it x j 1
2
1
exp
xi x j
2
2
Example Polynomial Kernels
Example: the two spirals

Separated by a hyperplane in feature space
(gaussian kernels)
Making Kernels
The set of kernels is closed under some
operations. If K, K are kernels, then:
K+K is a kernel
cK is a kernel, if c>0
aK+bK is a kernel, for a,b >0
Etc etc etc
can make complex kernels from simple
ones: modularity !
Non Linear SVM Recepie
Start with data x1,,xn which lives in feature space

of dimension d
Choose kernel K(xi,xj) corresponding to some
function j(xi) which takes sample xi to a higher
dimensional space
Find the largest margin linear discriminant function in
the higher dimensional space by using quadratic
programming package to solve:
n
maximize
constrained to
1 n n
LD a ai ai ai y i y j K x i , x j
2 i 1 j 1
i 1
0 ai C i and
ai y i
i
1
Non Linear SVM Recipe
Weight vector w in the

high dimensional space:
n
w ai y i j x i
i
Linear discriminant function of largest margin in the

high dimensional space:
t
g j x w j x ai y i j x i j x
x i S
Non linear discriminant function in the original space:

t
g x ai y i j x i j x ai y i j t x i j x ai y i K x i , x
x i S
x i S
x i S
decide class 1 if g (x ) > 0, otherwise decide class 2
Non Linear SVM
Nonlinear discriminant function
gx
x i S
zi K x i , x
inverse distance
weight of support
m1
from x to
g x
vector xi
support vector xi
most important
training samples,
1
2
K x i , x exp
x
x
i.e. support vectors
i
2
2
Higher Order Polynomials

Taken from Andrew Moore
Polynomial
f(x)
Quadratic All d2/2

terms up to
degree 2
Cost to
build H
matrix
tradition
ally
Cost if
d=100
f(a)tf(b)
Cost to Cost if
build H d=100
matrix
sneakily
d2 n2 /4
2,500 n2
(atb+1)2
d n2 / 2
50 n2
Cubic
All d3/6
terms up to
degree 3
d3 n2 /12
83,000 n2
(atb+1)3
d n2 / 2
50 n2
Quartic
All d4/24
terms up to
degree 4
d4 n2 /48
1,960,000 n2 (atb+1)4
d n2 / 2
50 n2
n is the number of samples, d is number of features
SVM Summary
Advantages:
Based on nice theory

excellent generalization properties
objective function has no local minima
can be used to find non linear discriminant functions
Complexity of the classifier is characterized by the number
of support vectors rather than the dimensionality of the
transformed space
Disadvantages:
Its not clear how to select a kernel function in a principled

manner
tends to be slower than other methods (in non-linear case).

Support Vector Machine

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Support Vector Machine

Uploaded by

Copyright:

Available Formats

Support Vector Machines

which separating hyperplane should we choose?

Poor generalization (performance on unseen data)

New samples close to the old samples will be

For the optimal hyperplane

SVM: Linearly Separable Case

margin is twice the absolute value of distance d of

SVM: Linearly Separable Case

Support vectors are the samples closest to the

SVM: Formula for the Margin

absolute distance between x

distance is unchanged for hyperplane

Let xi be an example closest to the boundary. Set

Now the largest margin hyperplane is unique

SVM: Formula for the Margin

Thus the margin is

SVM: Optimal Hyperplane

Can convert our problem to

J(w) is a quadratic function, thus there is a single

Constrained Quadratic Programming

minimize LP with respect to w,b,

Solving QP using dual problem

a ={a1,, an} are new variables, one for each sample

LD(a) depends on the number of samples, not on

Samples with ai 0 are Support Vectors and they

Optimal hyperplane is completely defined by support

Class 1: [1,6], [1,10], [4,11]

Class 2: [5,2], [7,6], [10,4]

since a1 >0, can find b using

SVM: Non Separable Case

Can apply SVM in non linearly separable case

SVM with slacks

xi >1 sample i is on the wrong

SVM with slacks

SVM with slacks

SVM with slacks Dual Formulation

Non Linear Mapping

One dimensional space, not linearly separable

Lift to two dimensional space with j(x)=(x,x2 )

Non Linear Mapping

Solve a non linear classification problem with a linear

1. Project data x to high dimension using function j(x)

In 2D, discriminant function is linear

In 1D, discriminant function is not linear

Non Linear Mapping: Another Example

Non Linear SVM

Can use any linear classifier after lifting data into a

SVM handles the curse of dimensionality problem:

It can be shown that generalization in SVM is a function of the

2. computation in the higher dimensional case is performed

Non Linear SVM: Kernels

Recall SVM optimization n

Note that samples xi appear only through the dot

If we lift xi to high dimensional space F using j(x),

The dimensionality of space F not necessarily

Now we only need to compute K(xi,xj) instead of

kernel trick: do not need to perform operations in high

The central structure in kernel machines

Gaussian radial Basis kernel (data is lifted in infinite

Example Polynomial Kernels

Example: the two spirals

Non Linear SVM Recepie

Start with data x1,,xn which lives in feature space

Non Linear SVM Recipe