You are on page 1of 36

An Introduction of

Support Vector
Machine
Jinwei Gu
2008/10/16

Review: What Weve Learned


So Far

Bayesian Decision Theory


Maximum-Likelihood & Bayesian Parameter Estimation
Nonparametric Density Estimation

Parzen-Window, kn-Nearest-Neighbor

K-Nearest Neighbor Classifier


Decision Tree Classifier

Today: Support Vector


Machine (SVM)

A classifier derived from statistical learning theory by Vapnik, et


al. in 1992
SVM became famous when, using images as input, it gave
accuracy comparable to neural-network with hand-designed
features in a handwriting recognition task
Currently, SVM is widely used in object detection & recognition,
content-based image retrieval, text recognition, biometrics,
speech recognition, etc.
Also used for regression (will not cover today)

Chapter 5.1, 5.2, 5.3, 5.11 (5.4*) in textbook

V. Vapnik

Outline

Linear Discriminant Function


Large Margin Linear Classifier
Nonlinear SVM: The Kernel Trick
Demo of SVM

Discriminant Function

Chapter 2.4: the classifier is said to assign a feature vector x to class


wi if

g i ( x) g j ( x)

For two-category case,

for all j i

g (x) g1 (x) g 2 (x)

Decide 1 if g ( x) 0; otherwise decide 2

An example weve learned before:

Minimum-Error-Rate Classifier

g (x) p (1 | x) p(2 | x)

Discriminant Function

It can be arbitrary functions of x, such as:

Nearest
Neighbor

Decision
Tree

Linear
Functions

g ( x) w T x b

Nonlinear
Functions

Linear Discriminant Function

g(x) is a linear function:

x2

wT x + b > 0

g ( x) w T x b

A hyper-plane in the feature


space

b
+
x

=0

(Unit-length) normal vector


of the hyper-plane:

w
n
w

wT x + b < 0

x1

Linear Discriminant Function


denotes +1

How would you classify


these points using a linear
discriminant function in order
to minimize the error rate?

Infinite number of answers!

x2

denotes -1

x1

Linear Discriminant Function


denotes +1

How would you classify


these points using a linear
discriminant function in order
to minimize the error rate?

Infinite number of answers!

x2

denotes -1

x1

Linear Discriminant Function


denotes +1

How would you classify


these points using a linear
discriminant function in order
to minimize the error rate?

Infinite number of answers!

x2

denotes -1

x1

Linear Discriminant Function


denotes +1

How would you classify


these points using a linear
discriminant function in order
to minimize the error rate?

Infinite number of answers!

Which one is the best?

x2

denotes -1

x1

Large Margin Linear


Classifier
x

The linear discriminant


function (classifier) with the
maximum margin is the best

Margin is defined as the


width that the boundary
could be increased by before
hitting a data point

Why it is the best?

Robust to outliners and thus


strong generalization ability

safe zone

denotes +1
denotes -1
Margin

x1

Large Margin Linear


Classifier
x

Given a set of data points:

denotes +1
denotes -1

{(xi , yi )}, i 1, 2, L , n, where


For yi 1, wT xi b 0
For yi 1, wT xi b 0

With a scale transformation


on both w and b, the above
is equivalent to

For yi 1, wT xi b 1
For yi 1, wT xi b 1

x1

Large Margin Linear


Classifier
x

We know that

denotes +1
denotes -1

Margin

wT x b 1

w x b 1
T

b
+
x

=1

x+

= 0 = -1
b
b
+
+
T x
T x
w
w

The margin width is:


n

M (x x ) n
w
2
(x x )

w
w

x-

Support Vectors

x1

Large Margin Linear


Classifier
x

Formulation:

maximize

denotes +1
denotes -1

2
w
T

such that

For yi 1, w T xi b 1

Margin

b
+
x

=1

x+

= 0 = -1
b
b
+
+
T x
T x
w
w

n
x-

For yi 1, w T xi b 1
x1

Large Margin Linear


Classifier
x

Formulation:

1
minimize
w
2

denotes +1
denotes -1

Margin

such that

For yi 1, w T xi b 1

b
+
x

=1

x+

= 0 = -1
b
b
+
+
T x
T x
w
w

n
x-

For yi 1, w T xi b 1
x1

Large Margin Linear


Classifier
x

Formulation:

1
minimize
w
2

denotes -1

Margin

such that

yi (wT xi b) 1

denotes +1

b
+
x

=1

x+

= 0 = -1
b
b
+
+
T x
T x
w
w

n
x-

x1

Solving the Optimization


Problem
1
Quadratic
programming
with linear
constraints

minimize

s.t.

yi (wT xi b) 1

Lagrangian
Function
n
1
2
minimize Lp ( w, b, i ) w i yi ( w T xi b) 1
2
i 1

s.t.

i 0

Solving the Optimization


Problem
1
minimize L ( w, b, ) w y ( w x b) 1
n

s.t.

Lp
w
Lp
b

i 1

i 0
n

w i yi xi
i 1

y
i 1

Solving the Optimization


Problem
1
minimize L ( w, b, ) w y ( w x b) 1
n

s.t.

i 1

i 0

Lagrangian Dual
Problem
n

1 n n
maximize i i j yi y j xTi x j
2 i 1 j 1
i 1
s.t.

i 0 , and

y
i 1

Solving the Optimization


Problem
From KKT condition, we know:

x2

i yi (wT xi b) 1 0

x+

Thus, only support vectors have i 0

+b

=1

x+

b
x+

=0
T

+b

1
=-

x-

The solution has the form:


n

w i yi xi
i 1

Support Vectors

y x

iSV

get b from yi ( w T xi b) 1 0,
where xi is support vector

x1

Solving the Optimization


Problem
The linear discriminant function is:

g ( x) w T x b

x
i i xb

iSV

Notice it relies on a dot product between the test point x


and the support vectors xi

Also keep in mind that solving the optimization problem


involved computing the dot products xiTxj between all pairs
of training points

Large Margin Linear


Classifier
x

What if data is not linear


separable? (noisy data,
outliers, etc.)

Slack variables i can be


1
added to allow mis=1
b
classification of difficult or
+
T x
=0
b
noisy data points
w
1
+
=T x

denotes +1
denotes -1

b
+
x

x1

Large Margin Linear


Classifier
Formulation:

n
1
2
minimize
w C i
2
i 1

such that

yi (wT xi b) 1 i

i 0

Parameter C can be viewed as a way to control over-fitting.

Large Margin Linear


Classifier
Formulation: (Lagrangian Dual Problem)

1 n n
maximize i i j yi y j xTi x j
2 i 1 j 1
i 1
such that

0 i C
n

y
i 1

Non-linear
Datasets that are linearly separable with noise work out great:
SVMs

But what are we going to do if the dataset is just too hard?


x

How about mapping data to a higher-dimensional space:


x2

This slide is courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt

Non-linear SVMs: Feature


General idea: the original input space can be mapped to
Space

some higher-dimensional feature space where the


training set is separable:

: x (x)

This slide is courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt

Nonlinear SVMs: The Kernel


With this mapping, our discriminant function is now:
Trick

g ( x) w T ( x) b

(
x
)
i i ( x) b

iSV

No need to know this mapping explicitly, because we only use


the dot product of feature vectors in both the training and test.

A kernel function is defined as a function that corresponds to a


dot product of two feature vectors in some expanded feature
space:

K ( x i , x j ) ( x i )T ( x j )

Nonlinear SVMs: The Kernel


An example:
Trick

2-dimensional vectors x=[x1 x2];


let K(xi,xj)=(1 + xiTxj)2,
Need to show that K(xi,xj) = (xi) T(xj):
K(xi,xj)=(1 + xiTxj)2,
= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 2 xi1xi2 xi22 2xi1 2xi2]T [1 xj12 2 xj1xj2 xj22 2xj1 2xj2]
= (xi) T(xj),

where (x) = [1 x12 2 x1x2 x22 2x1 2x2]

This slide is courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt

Nonlinear SVMs: The Kernel


Examples of commonly-used kernel functions:
Trick

K (xi , x j ) xTi x j

Linear kernel:

Polynomial kernel:

Gaussian (Radial-Basis Function (RBF) ) kernel:

K (xi , x j ) (1 xTi x j ) p

K (xi , x j ) exp(

Sigmoid:

xi x j
2

K (xi , x j ) tanh( 0 xTi x j 1 )

In general, functions that satisfy Mercers condition can be


kernel functions.

Nonlinear SVM: Optimization

Formulation: (Lagrangian Dual Problem)


n

1 n n
maximize i i j yi y j K (xi , x j )
2 i 1 j 1
i 1
such that

0 i C
n

y
i 1

The solution of the discriminant function is

g ( x)

K ( x , x) b

iSV

The optimization technique is the same.

Support Vector Machine:


Algorithm

1. Choose a kernel function

2. Choose a value for C

3. Solve the quadratic programming problem


(many software packages available)

4. Construct the discriminant function from the


support vectors

Some Issues

Choice of kernel
- Gaussian or polynomial kernel is default
- if ineffective, more elaborate kernels are needed
- domain experts can give assistance in formulating appropriate
similarity measures

Choice of kernel parameters


- e.g. in Gaussian kernel
- is the distance between closest points with different classifications
- In the absence of reliable criteria, applications rely on the use of a
validation set or cross-validation to set such parameters.

Optimization criterion Hard margin v.s. Soft margin


- a lengthy series of experiments in which various parameters are
tested

This slide is courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt

Summary: Support Vector


Machine

1. Large Margin Classifier

Better generalization ability & less over-fitting

2. The Kernel Trick

Map data points to higher dimensional space in


order to make them linearly separable.
Since only dot product is used, we do not need to
represent the mapping explicitly.

Additional Resource

http://www.kernel-machines.org/

Demo of LibSVM

http://www.csie.ntu.edu.tw/~cjlin/libsvm/

You might also like