An Introduction Of: Support Vector Machine

An Introduction of
Support Vector
Machine
Jinwei Gu
2008/10/16
Review: What Weve Learned

So Far
Bayesian Decision Theory

Maximum-Likelihood & Bayesian Parameter Estimation
Nonparametric Density Estimation
Parzen-Window, kn-Nearest-Neighbor
K-Nearest Neighbor Classifier

Decision Tree Classifier
Today: Support Vector

Machine (SVM)
A classifier derived from statistical learning theory by Vapnik, et

al. in 1992
SVM became famous when, using images as input, it gave
accuracy comparable to neural-network with hand-designed
features in a handwriting recognition task
Currently, SVM is widely used in object detection & recognition,
content-based image retrieval, text recognition, biometrics,
speech recognition, etc.
Also used for regression (will not cover today)
Chapter 5.1, 5.2, 5.3, 5.11 (5.4*) in textbook
V. Vapnik
Outline
Linear Discriminant Function

Large Margin Linear Classifier
Nonlinear SVM: The Kernel Trick
Demo of SVM
Discriminant Function
Chapter 2.4: the classifier is said to assign a feature vector x to class

wi if
g i ( x) g j ( x)
For two-category case,
for all j i
g (x) g1 (x) g 2 (x)
Decide 1 if g ( x) 0; otherwise decide 2
An example weve learned before:
Minimum-Error-Rate Classifier
g (x) p (1 | x) p(2 | x)
Discriminant Function
It can be arbitrary functions of x, such as:
Nearest
Neighbor
Decision
Tree
Linear
Functions
g ( x) w T x b
Nonlinear
Functions
g(x) is a linear function:
x2
wT x + b > 0
g ( x) w T x b
A hyper-plane in the feature

space
b
+
x
=0
(Unit-length) normal vector

of the hyper-plane:
w
n
w
wT x + b < 0
x1

denotes +1
How would you classify

these points using a linear
discriminant function in order
to minimize the error rate?
Infinite number of answers!
x2
denotes -1
x1

denotes +1

x2
denotes -1
x1

denotes +1

x2
denotes -1
x1

denotes +1

Which one is the best?
x2
denotes -1
x1
Large Margin Linear

Classifier
x
The linear discriminant

function (classifier) with the
maximum margin is the best
Margin is defined as the

width that the boundary
could be increased by before
hitting a data point
Why it is the best?
Robust to outliners and thus

strong generalization ability
safe zone
denotes +1
denotes -1
Margin
x1
Large Margin Linear

Classifier
x
Given a set of data points:
denotes +1
denotes -1
{(xi , yi )}, i 1, 2, L , n, where

For yi 1, wT xi b 0
For yi 1, wT xi b 0
With a scale transformation

on both w and b, the above
is equivalent to
For yi 1, wT xi b 1
For yi 1, wT xi b 1
x1
Large Margin Linear

Classifier
x
We know that
denotes +1
denotes -1
Margin
wT x b 1
w x b 1
T
b
+
x
=1
x+
= 0 = -1
b
b
+
+
T x
T x
w
w
The margin width is:

n
M (x x ) n
w
2
(x x )
w
w
x-
Support Vectors
x1
Large Margin Linear

Classifier
x
Formulation:
maximize
denotes +1
denotes -1
2
w
T
such that
For yi 1, w T xi b 1
Margin
b
+
x
=1
x+
= 0 = -1
b
b
+
+
T x
T x
w
w
n
x-
x1
Large Margin Linear

Classifier
x
Formulation:
1
minimize
w
2
denotes +1
denotes -1
Margin
such that
b
+
x
=1
x+
= 0 = -1
b
b
+
+
T x
T x
w
w
n
x-
x1
Large Margin Linear

Classifier
x
Formulation:
1
minimize
w
2
denotes -1
Margin
such that
yi (wT xi b) 1
denotes +1
b
+
x
=1
x+
= 0 = -1
b
b
+
+
T x
T x
w
w
n
x-
x1
Solving the Optimization

Problem
1
Quadratic
programming
with linear
constraints
minimize
s.t.
yi (wT xi b) 1
Lagrangian
Function
n
1
2
minimize Lp ( w, b, i ) w i yi ( w T xi b) 1
2
i 1
s.t.
i 0

Problem
1
minimize L ( w, b, ) w y ( w x b) 1
n
s.t.
Lp
w
Lp
b
i 1
i 0
n
w i yi xi
i 1
y
i 1

Problem
1
minimize L ( w, b, ) w y ( w x b) 1
n
s.t.
i 1
i 0
Lagrangian Dual
Problem
n
1 n n
maximize i i j yi y j xTi x j
2 i 1 j 1
i 1
s.t.
i 0 , and
y
i 1

Problem
From KKT condition, we know:
x2
i yi (wT xi b) 1 0
x+
Thus, only support vectors have i 0
+b
=1
x+
b
x+
=0
T
+b
1
=-
x-
The solution has the form:

n
w i yi xi
i 1
Support Vectors
y x
iSV
get b from yi ( w T xi b) 1 0,
where xi is support vector
x1

Problem
The linear discriminant function is:
g ( x) w T x b
x
i i xb
iSV
Notice it relies on a dot product between the test point x

and the support vectors xi
Also keep in mind that solving the optimization problem

involved computing the dot products xiTxj between all pairs
of training points
Large Margin Linear

Classifier
x
What if data is not linear

separable? (noisy data,
outliers, etc.)
Slack variables i can be

1
added to allow mis=1
b
classification of difficult or
+
T x
=0
b
noisy data points
w
1
+
=T x
denotes +1
denotes -1
b
+
x
x1
Large Margin Linear

Classifier
Formulation:
n
1
2
minimize
w C i
2
i 1
such that
yi (wT xi b) 1 i
i 0
Parameter C can be viewed as a way to control over-fitting.
Large Margin Linear

Classifier
Formulation: (Lagrangian Dual Problem)
1 n n
maximize i i j yi y j xTi x j
2 i 1 j 1
i 1
such that
0 i C
n
y
i 1
Non-linear
Datasets that are linearly separable with noise work out great:
SVMs
But what are we going to do if the dataset is just too hard?

x
How about mapping data to a higher-dimensional space:

x2
This slide is courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt
Non-linear SVMs: Feature

General idea: the original input space can be mapped to
Space
some higher-dimensional feature space where the

training set is separable:
: x (x)
Nonlinear SVMs: The Kernel

With this mapping, our discriminant function is now:
Trick
g ( x) w T ( x) b
(
x
)
i i ( x) b
iSV
No need to know this mapping explicitly, because we only use

the dot product of feature vectors in both the training and test.
A kernel function is defined as a function that corresponds to a

dot product of two feature vectors in some expanded feature
space:
K ( x i , x j ) ( x i )T ( x j )

An example:
Trick
2-dimensional vectors x=[x1 x2];

let K(xi,xj)=(1 + xiTxj)2,
Need to show that K(xi,xj) = (xi) T(xj):
K(xi,xj)=(1 + xiTxj)2,
= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 2 xi1xi2 xi22 2xi1 2xi2]T [1 xj12 2 xj1xj2 xj22 2xj1 2xj2]
= (xi) T(xj),
where (x) = [1 x12 2 x1x2 x22 2x1 2x2]

Examples of commonly-used kernel functions:
Trick
K (xi , x j ) xTi x j
Linear kernel:
Polynomial kernel:
Gaussian (Radial-Basis Function (RBF) ) kernel:
K (xi , x j ) (1 xTi x j ) p
K (xi , x j ) exp(
Sigmoid:
xi x j
2
K (xi , x j ) tanh( 0 xTi x j 1 )
In general, functions that satisfy Mercers condition can be

kernel functions.
Nonlinear SVM: Optimization
Formulation: (Lagrangian Dual Problem)

n
1 n n
maximize i i j yi y j K (xi , x j )
2 i 1 j 1
i 1
such that
0 i C
n
y
i 1
The solution of the discriminant function is
g ( x)
K ( x , x) b
iSV
The optimization technique is the same.
Support Vector Machine:

Algorithm
1. Choose a kernel function
2. Choose a value for C
3. Solve the quadratic programming problem

(many software packages available)
4. Construct the discriminant function from the

support vectors
Some Issues
Choice of kernel
- Gaussian or polynomial kernel is default
- if ineffective, more elaborate kernels are needed
- domain experts can give assistance in formulating appropriate
similarity measures
Choice of kernel parameters

- e.g. in Gaussian kernel
- is the distance between closest points with different classifications
- In the absence of reliable criteria, applications rely on the use of a
validation set or cross-validation to set such parameters.
Optimization criterion Hard margin v.s. Soft margin

- a lengthy series of experiments in which various parameters are
tested
Summary: Support Vector

Machine
1. Large Margin Classifier
Better generalization ability & less over-fitting
2. The Kernel Trick
Map data points to higher dimensional space in

order to make them linearly separable.
Since only dot product is used, we do not need to
represent the mapping explicitly.
Additional Resource
http://www.kernel-machines.org/
Demo of LibSVM
http://www.csie.ntu.edu.tw/~cjlin/libsvm/

An Introduction Of: Support Vector Machine

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Introduction Of: Support Vector Machine

Uploaded by

Copyright:

Available Formats

An Introduction of

Review: What Weve Learned

Bayesian Decision Theory

K-Nearest Neighbor Classifier

Today: Support Vector

A classifier derived from statistical learning theory by Vapnik, et

Chapter 5.1, 5.2, 5.3, 5.11 (5.4*) in textbook

Linear Discriminant Function

Chapter 2.4: the classifier is said to assign a feature vector x to class

For two-category case,

g (x) g1 (x) g 2 (x)

Decide 1 if g ( x) 0; otherwise decide 2

An example weve learned before:

It can be arbitrary functions of x, such as:

Linear Discriminant Function

g(x) is a linear function:

A hyper-plane in the feature

(Unit-length) normal vector

Linear Discriminant Function

How would you classify

Infinite number of answers!

Linear Discriminant Function

How would you classify

Infinite number of answers!

Linear Discriminant Function

How would you classify

Infinite number of answers!

Linear Discriminant Function

How would you classify

Infinite number of answers!

Which one is the best?

Large Margin Linear

The linear discriminant

Margin is defined as the

Why it is the best?

Robust to outliners and thus

Large Margin Linear

Given a set of data points:

{(xi , yi )}, i 1, 2, L , n, where

With a scale transformation

Large Margin Linear

The margin width is:

Large Margin Linear

Large Margin Linear

Large Margin Linear

Solving the Optimization

Solving the Optimization

Solving the Optimization

Solving the Optimization

Thus, only support vectors have i 0

The solution has the form:

Solving the Optimization

Notice it relies on a dot product between the test point x

Also keep in mind that solving the optimization problem

Large Margin Linear

What if data is not linear

Slack variables i can be

Large Margin Linear

Parameter C can be viewed as a way to control over-fitting.

Large Margin Linear

But what are we going to do if the dataset is just too hard?

How about mapping data to a higher-dimensional space: