You are on page 1of 30

Classification / Regression

Support Vector
Suppo ec o Machines
ac es

Jeff Howbert Introduction to Machine Learning Winter 2012 1


Support vector machines

z Topics
SVM classifiers for linearly separable classes
SVM classifiers for non-linearly separable
classes
SVM classifiers for nonlinear decision
boundaries
kernel functions
Other applications of SVMs
Software

Jeff Howbert Introduction to Machine Learning Winter 2012 2


Support vector machines

Linearly
separable
classes

Goal: find a linear decision boundary (hyperplane)


that separates the classes
Jeff Howbert Introduction to Machine Learning Winter 2012 3
Support vector machines

O possible
One ibl solution
l ti

Jeff Howbert Introduction to Machine Learning Winter 2012 4


Support vector machines

A th possible
Another ibl solution
l ti

Jeff Howbert Introduction to Machine Learning Winter 2012 5


Support vector machines

Oth possible
Other ibl solutions
l ti

Jeff Howbert Introduction to Machine Learning Winter 2012 6


Support vector machines

Whi h one is
Which i better?
b tt ? B1 or B2? How
H do
d you define
d fi better?
b tt ?

Jeff Howbert Introduction to Machine Learning Winter 2012 7


Support vector machines

Hyperplane that maximizes the margin will have better generalization


=> B1 is better than B2
Jeff Howbert Introduction to Machine Learning Winter 2012 8
Support vector machines

B1

test sample

B2

b21
b22

margin
b11

b12
Hyperplane that maximizes the margin will have better generalization
=> B1 is better than B2
Jeff Howbert Introduction to Machine Learning Winter 2012 9
Support vector machines

B1

test sample

B2

b21
b22

margin
b11

b12
Hyperplane that maximizes the margin will have better generalization
=> B1 is better than B2
Jeff Howbert Introduction to Machine Learning Winter 2012 10
Support vector machines

B1

wx +b = 0
w x + b = +1
w x + b = 1

b11

b12
+ 1 if w x + b 1 2
yi = f (x ) = i =
margin
1 if w x + b 1 || w ||
Jeff Howbert Introduction to Machine Learning Winter 2012 11
Support vector machines
2
z We want to maximize: margin =
|| w ||
|| w ||2
z Which is equivalent to minimizing: L(w ) =
2

z But subject to the following constraints:

+ 1 if w x + b 1
yi = f (x ) =
1 if w x + b 1
This is a constrained convex optimization problem
Solve with numerical approaches, e.g. quadratic
programming
Jeff Howbert Introduction to Machine Learning Winter 2012 12
Support vector machines

Solving for w that gives maximum margin:


1 Combine objective function and constraints into new
1.
objective function, using Lagrange multipliers i
N
= w i ( yi (w x i + b) 1)
1 2
L primal
2 i =1
2. To minimize this Lagrangian, we take derivatives of w
and
d b and
d sett them
th t 0:
to 0
L p N
= 0 w = i yi x i
w i =1

L p N
= 0 i yi = 0
b i =1

Jeff Howbert Introduction to Machine Learning Winter 2012 13


Support vector machines

Solving for w that gives maximum margin:


3 Substituting and rearranging gives the dual of the
3.
Lagrangian:
N
1
Ldual
d l = i i j yi y j x i x j
i =1 2 i, j
which we try to maximize (not minimize).
4. Once we have the i, we can substitute into previous
equations to get w and b.
5 This defines w and b as linear combinations of the
5.
training data.

Jeff Howbert Introduction to Machine Learning Winter 2012 14


Support vector machines

z Optimizing the dual is easier.


Function of i only,
y, not i and w.
z Convex optimization guaranteed to find global
optimum.
z Most of the i go to zero.
The xi for which i 0 are called the support vectors.
These support (lie on) the margin boundaries.
The xi for which i = 0 lie away from the margin
boundaries They are not required for defining the
boundaries.
maximum margin hyperplane.

Jeff Howbert Introduction to Machine Learning Winter 2012 15


Support vector machines

Example of solving for maximum margin hyperplane

Jeff Howbert Introduction to Machine Learning Winter 2012 16


Support vector machines

What if the classes are not linearly separable?

Jeff Howbert Introduction to Machine Learning Winter 2012 17


Support vector machines

Now which one is better? B1 or B2? How do you define better?


Jeff Howbert Introduction to Machine Learning Winter 2012 18
Support vector machines

z What if the problem is not linearly separable?


z Solution: introduce slack variables
Need to minimize: || w ||2 N k
L(w ) = + C i
2 i =1
Subject to:
+ 1 if w x + b 1 + i
yi = f (x ) =
1 if w x + b 1 + i

C is an important hyperparameter, whose value is


usually optimized by cross-validation.

Jeff Howbert Introduction to Machine Learning Winter 2012 19


Support vector machines

Slack variables for nonseparable data


Jeff Howbert Introduction to Machine Learning Winter 2012 20
Support vector machines

What if decision boundary is not linear?

Jeff Howbert Introduction to Machine Learning Winter 2012 21


Support vector machines
Solution: nonlinear transform of attributes
: [ x1 , x2 ] [ x1 , ( x1 + x2 ) 4 ]

Jeff Howbert Introduction to Machine Learning Winter 2012 22


Support vector machines
Solution: nonlinear transform of attributes
: [ x1 , x2 ] [( x1 x1 ), ( x2 x2 )]
2 2

Jeff Howbert Introduction to Machine Learning Winter 2012 23


Support vector machines

z Issues with finding useful nonlinear transforms


Not feasible to do manually as number of attributes
grows (i.e. any real world problem)
Usually involves transformation to higher dimensional
space
increases computational burden of SVM optimization
curse of dimensionality

z With SVMs
SVMs, can circumvent all the above via the
kernel trick

Jeff Howbert Introduction to Machine Learning Winter 2012 24


Support vector machines

z Kernel trick
Don t need to specify the attribute transform ( x )
Dont
Only need to know how to calculate the dot product of
any two transformed samples:
k( x1, x2 ) = ( x1 ) ( x2 )
The kernel function k is substituted into the dual of the
L
Lagrangian,
i allowing
ll i d determination
t i ti off a maximum
i
margin hyperplane in the (implicitly) transformed
space ( x )
All subsequent calculations, including predictions on
test samples, are done using the kernel in place of
( x1 ) ( x2 )
Jeff Howbert Introduction to Machine Learning Winter 2012 25
Support vector machines

z Common kernel functions for SVM

linear k (x1 , x 2 ) = x1 x 2

polynomial k (x1 , x 2 ) = ( x1 x 2 + c) d

Gaussian or radial basis (


k (x1 , x 2 ) = exp x1 x 2
2
)
sigmoid k (x1 , x 2 ) = tanh( x1 x 2 + c)

Jeff Howbert Introduction to Machine Learning Winter 2012 26


Support vector machines

z For some kernels (e(e.g.


g Gaussian) the implicit
transform ( x ) is infinite-dimensional!
But calculations with kernel are done in original
space, so computational burden and curse of
dimensionality arent a problem.

Jeff Howbert Introduction to Machine Learning Winter 2012 27


Support vector machines

Jeff Howbert Introduction to Machine Learning Winter 2012 28


Support vector machines

z Applications of SVMs to machine learning


Classification
binary
multiclass

one-class

Regression
Transduction (semi-supervised learning)
Ranking
Clustering
Structured labels
Jeff Howbert Introduction to Machine Learning Winter 2012 29
Support vector machines

z Software

SVMlight
http://svmlight joachims org/
http://svmlight.joachims.org/

libSVM
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
includes MATLAB / Octave interface

Jeff Howbert Introduction to Machine Learning Winter 2012 30

You might also like