15 Support Vector Machines

Classification / Regression
Support Vector
Suppo ec o Machines
ac es
Jeff Howbert Introduction to Machine Learning Winter 2012 1

Support vector machines
z Topics
SVM classifiers for linearly separable classes
SVM classifiers for non-linearly separable
classes
SVM classifiers for nonlinear decision
boundaries
kernel functions
Other applications of SVMs
Software

Linearly
separable
classes
Goal: find a linear decision boundary (hyperplane)

that separates the classes
O possible
One ibl solution
l ti

A th possible
Another ibl solution
l ti

Oth possible
Other ibl solutions
l ti

Whi h one is
Which i better?
b tt ? B1 or B2? How
H do
d you define
d fi better?
b tt ?

Hyperplane that maximizes the margin will have better generalization

=> B1 is better than B2
B1
test sample
B2
b21
b22
margin
b11
b12
B1
test sample
B2
b21
b22
margin
b11
b12
B1
wx +b = 0
w x + b = +1
w x + b = 1
b11
b12
+ 1 if w x + b 1 2
yi = f (x ) = i =
margin
1 if w x + b 1 || w ||
2
z We want to maximize: margin =
|| w ||
|| w ||2
z Which is equivalent to minimizing: L(w ) =
2
z But subject to the following constraints:
+ 1 if w x + b 1
yi = f (x ) =
1 if w x + b 1
This is a constrained convex optimization problem
Solve with numerical approaches, e.g. quadratic
programming
Solving for w that gives maximum margin:

1 Combine objective function and constraints into new
1.
objective function, using Lagrange multipliers i
N
= w i ( yi (w x i + b) 1)
1 2
L primal
2 i =1
2. To minimize this Lagrangian, we take derivatives of w
and
d b and
d sett them
th t 0:
to 0
L p N
= 0 w = i yi x i
w i =1
L p N
= 0 i yi = 0
b i =1

Solving for w that gives maximum margin:

3 Substituting and rearranging gives the dual of the
3.
Lagrangian:
N
1
Ldual
d l = i i j yi y j x i x j
i =1 2 i, j
which we try to maximize (not minimize).
4. Once we have the i, we can substitute into previous
equations to get w and b.
5 This defines w and b as linear combinations of the
5.
training data.

z Optimizing the dual is easier.

Function of i only,
y, not i and w.
z Convex optimization guaranteed to find global
optimum.
z Most of the i go to zero.
The xi for which i 0 are called the support vectors.
These support (lie on) the margin boundaries.
The xi for which i = 0 lie away from the margin
boundaries They are not required for defining the
boundaries.
maximum margin hyperplane.

Example of solving for maximum margin hyperplane

What if the classes are not linearly separable?

Now which one is better? B1 or B2? How do you define better?

z What if the problem is not linearly separable?

z Solution: introduce slack variables
Need to minimize: || w ||2 N k
L(w ) = + C i
2 i =1
Subject to:
+ 1 if w x + b 1 + i
yi = f (x ) =
1 if w x + b 1 + i
C is an important hyperparameter, whose value is

usually optimized by cross-validation.

Slack variables for nonseparable data

What if decision boundary is not linear?

Solution: nonlinear transform of attributes
: [ x1 , x2 ] [ x1 , ( x1 + x2 ) 4 ]

Solution: nonlinear transform of attributes
: [ x1 , x2 ] [( x1 x1 ), ( x2 x2 )]
2 2

z Issues with finding useful nonlinear transforms

Not feasible to do manually as number of attributes
grows (i.e. any real world problem)
Usually involves transformation to higher dimensional
space
increases computational burden of SVM optimization
curse of dimensionality
z With SVMs
SVMs, can circumvent all the above via the
kernel trick

z Kernel trick
Don t need to specify the attribute transform ( x )
Dont
Only need to know how to calculate the dot product of
any two transformed samples:
k( x1, x2 ) = ( x1 ) ( x2 )
The kernel function k is substituted into the dual of the
L
Lagrangian,
i allowing
ll i d determination
t i ti off a maximum
i
margin hyperplane in the (implicitly) transformed
space ( x )
All subsequent calculations, including predictions on
test samples, are done using the kernel in place of
( x1 ) ( x2 )
z Common kernel functions for SVM
linear k (x1 , x 2 ) = x1 x 2
polynomial k (x1 , x 2 ) = ( x1 x 2 + c) d
Gaussian or radial basis (

k (x1 , x 2 ) = exp x1 x 2
2
)
sigmoid k (x1 , x 2 ) = tanh( x1 x 2 + c)

z For some kernels (e(e.g.

g Gaussian) the implicit
transform ( x ) is infinite-dimensional!
But calculations with kernel are done in original
space, so computational burden and curse of
dimensionality arent a problem.


z Applications of SVMs to machine learning

Classification
binary
multiclass
one-class
Regression
Transduction (semi-supervised learning)
Ranking
Clustering
Structured labels
z Software
SVMlight
http://svmlight joachims org/
http://svmlight.joachims.org/
libSVM
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
includes MATLAB / Octave interface

15 Support Vector Machines

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

15 Support Vector Machines

Uploaded by

Copyright:

Available Formats

Classification / Regression

Jeff Howbert Introduction to Machine Learning Winter 2012 1

Jeff Howbert Introduction to Machine Learning Winter 2012 2

Goal: find a linear decision boundary (hyperplane)

Jeff Howbert Introduction to Machine Learning Winter 2012 4

Jeff Howbert Introduction to Machine Learning Winter 2012 5

Jeff Howbert Introduction to Machine Learning Winter 2012 6

Jeff Howbert Introduction to Machine Learning Winter 2012 7

Hyperplane that maximizes the margin will have better generalization

z But subject to the following constraints:

Solving for w that gives maximum margin:

Jeff Howbert Introduction to Machine Learning Winter 2012 13

Solving for w that gives maximum margin:

Jeff Howbert Introduction to Machine Learning Winter 2012 14

z Optimizing the dual is easier.

Jeff Howbert Introduction to Machine Learning Winter 2012 15

Example of solving for maximum margin hyperplane

Jeff Howbert Introduction to Machine Learning Winter 2012 16

What if the classes are not linearly separable?

Jeff Howbert Introduction to Machine Learning Winter 2012 17

Now which one is better? B1 or B2? How do you define better?

z What if the problem is not linearly separable?

C is an important hyperparameter, whose value is

Jeff Howbert Introduction to Machine Learning Winter 2012 19

Slack variables for nonseparable data

What if decision boundary is not linear?

Jeff Howbert Introduction to Machine Learning Winter 2012 21

Jeff Howbert Introduction to Machine Learning Winter 2012 22

Jeff Howbert Introduction to Machine Learning Winter 2012 23

z Issues with finding useful nonlinear transforms

Jeff Howbert Introduction to Machine Learning Winter 2012 24

z Common kernel functions for SVM

Gaussian or radial basis (

Jeff Howbert Introduction to Machine Learning Winter 2012 26

z For some kernels (e(e.g.

Jeff Howbert Introduction to Machine Learning Winter 2012 27

Jeff Howbert Introduction to Machine Learning Winter 2012 28

z Applications of SVMs to machine learning

Jeff Howbert Introduction to Machine Learning Winter 2012 30

You might also like