pr2013 SVM

Pattern Recognition 2013
Support Vector Machines

Ad Feelders
Universiteit Utrecht
December 9, 2013
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
1 / 56
Overview
Separable Case
Kernel Functions
Allowing Errors (Soft Margin)
SVMs in R.
Ad Feelders
Pattern Recognition
December 9, 2013
2 / 56
Linear Classifier for two classes
Linear model
y (x) = w> (x) + b
(7.1)
with tn {1, +1}.

Predict t0 = +1 if y (x0 ) 0 and t0 = 1 otherwise.
The decision boundary is given by y (x) = 0.
This is a linear classifier in feature space (x).
Ad Feelders
Pattern Recognition
December 9, 2013
3 / 56
Mapping
y(x) = w> (x) + b = 0
maps x into higher dimensional space where data is linearly separable.
Ad Feelders
Pattern Recognition
December 9, 2013
4 / 56
Data linearly separable
Assume training data is linearly separable in feature space, so there is at

least one choice of w, b such that:
1
y (xn ) > 0 for tn = +1;
y (xn ) < 0 for tn = 1;
that is, all training points are classified correctly.

Putting 1. and 2. together:
tn y (xn ) > 0
Ad Feelders
n = 1, . . . , N
Pattern Recognition
December 9, 2013
5 / 56
Maximum Margin
There may be many solutions that separate the classes exactly.

Which one gives smallest prediction error?
SVM chooses line with maximal margin, where the margin is the
distance between the line and the closest data point.
Ad Feelders
Pattern Recognition
December 9, 2013
6 / 56
Two-class training data
Ad Feelders
Pattern Recognition
December 9, 2013
7 / 56
Many Linear Separators
Ad Feelders
Pattern Recognition
December 9, 2013
8 / 56
Decision Boundary
Ad Feelders
Pattern Recognition
December 9, 2013
9 / 56
Maximize Margin
Ad Feelders
Pattern Recognition
December 9, 2013
10 / 56
Support Vectors
Ad Feelders
Pattern Recognition
December 9, 2013
11 / 56
Weight vector is orthogonal to the decision boundary
Consider two points xA and xB both of which lie on the decision surface.
Because y (xA ) = y (xB ) = 0, we have
(w> xA + b) (w> xB + b) = w> (xA xB ) = 0
and so the vector w is orthogonal to the decision surface.
Ad Feelders
Pattern Recognition
December 9, 2013
12 / 56
Distance of a point to a line

x2
x
y(x) = w>x + b = 0
r
w
x
x1
Ad Feelders
Pattern Recognition
December 9, 2013
13 / 56
Distance to decision surface ((x) = x)

We have
x = x + r
w
.
kwk
(4.6)
w
where kwk
is the unit vector in the direction if w, x is the orthogonal
projection of x onto the line y (x) = 0, and r is the (signed) distance of x
to the line. Multiply (4.6) left and right by w> and add b:
w> w
>
>
w
x
+
b
=
w
x
+
b
+r
| {z } | {z }
kwk
y (x)
So we get
r = y (x)
Ad Feelders
y (x)
kwk
=
2
kwk
kwk
Pattern Recognition
(4.7)
December 9, 2013
14 / 56
Distance of a point to a line
The signed distance of xn to the decision boundary is

r=
y (xn )
kwk
For lines that separate the data perfectly, we have tn y (xn ) = |y (xn )|, so
that the distance is given by
tn y (xn )
tn (w> (xn ) + b)
=
kwk
kwk
Ad Feelders
Pattern Recognition
(7.2)
December 9, 2013
15 / 56
Maximum margin solution
Solve

arg max
w,b

1
>
min[tn (w (xn ) + b)] .
kwk n
(7.3)
1
Since kwk
does not depend on n, it can be moved outside of the
minimization.
Direct solution of this problem would be rather complex.

A more convenient representation is possible.
Ad Feelders
Pattern Recognition
December 9, 2013
16 / 56
Canonical Representation
The hyperplane (decision boundary) is defined by

w> (x) + b = 0
Then also
(w> (x) + b) = w> (x) + b = 0
so rescaling w w and b b gives just another representation of the
same decision boundary. Choose scaling factor such that
ti (w> (xi ) + b) = 1
(7.4)
for the point xi closest to the decision boundary.
Ad Feelders
Pattern Recognition
December 9, 2013
17 / 56
Canonical Representation (square=1,circle=1)

y(x) = 1
y(x) = 0
y(x) = 1
Ad Feelders
Pattern Recognition
December 9, 2013
18 / 56
Canonical Representation
In this case we have

tn (w> (xn ) + b) 1
n = 1, . . . , N
(7.5)
Quadratic program
1
arg min kwk2
w,b 2
(7.6)
subject to the constraints (7.5).

This optimization problem has a unique global minimum.
Ad Feelders
Pattern Recognition
December 9, 2013
19 / 56
Lagrangian Function
Introduce Lagrange multipliers an 0 to get Lagrangian function

N
X
1
L(w, b, a) = kwk2
an {tn (w> (xn ) + b) 1}
2
(7.7)
n=1
with
X
L(w, b, a)
=w
an tn (xn )
w
n=1
Ad Feelders
Pattern Recognition
December 9, 2013
20 / 56
Lagrangian Function
and for b:
L(w, b, a) X
=
an tn
b
n=1
Equating the derivatives to zero yields the conditions:

w=
N
X
an tn (xn )
(7.8)
n=1
and
N
X
an tn = 0
(7.9)
n=1
Ad Feelders
Pattern Recognition
December 9, 2013
21 / 56
Dual Representation
Eliminating w and b from L(w, b, a) gives the dual representation.
N
L(w, b, a) =
X
1
kwk2
an {tn (w> (xn ) + b) 1}
2
n=1
1
kwk2
2
N
n=1
an tn w> (xn ) b
N
X
an t n +
n=1
N
X
an
n=1
1XX
an am tn tm (xn )> (xm )
2
n=1 m=1
N
N
XX
an tn am tm (xn )> (xm ) +
n=1 m=1
N
X
n=1
Ad Feelders
N
X
an
N
X
an
n=1
N
1XX
an tn am tm (xn )> (xm )
2
n=1 m=1
Pattern Recognition
December 9, 2013
22 / 56
Dual Representation
Maximize
e
L(a) =
N
X
n=1
N
1 X
an
an tn am tm (xn )> (xm )
2
(7.10)
n,m=1
with respect to a and subject to the constraints
N
X
an 0,
n = 1, . . . , N
(7.11)
an tn = 0.
(7.12)
n=1
Ad Feelders
Pattern Recognition
December 9, 2013
23 / 56
Kernel Function
We map x to a high-dimensional space (x) in which data is linearly

separable.
Performing computations in this high-dimensional space may be very
expensive.
Use a kernel function k that computes a dot product in this space
(without making the actual mapping):
k(x, x0 ) = (x)> (x0 )
Ad Feelders
Pattern Recognition
December 9, 2013
24 / 56
Example: polynomial kernel

Suppose x IR3 and (x) IR10 with
(x) = (1, 2x1 , 2x2 , 2x3 , x12 , x22 , x32 , 2x1 x2 , 2x1 x3 , 2x2 x3 )
Then
(x)> (z) = 1 + 2x1 z1 + 2x2 z2 + 2x3 z3 + x12 z12 + x22 z22 + x32 z32
+ 2x1 x2 z1 z2 + 2x1 x3 z1 z3 + 2x2 x3 z2 z3
But this can be written as
(1 + x> z)2 = (1 + x1 z1 + x2 z2 + x3 z3 )2
which costs much less operations to compute.
Ad Feelders
Pattern Recognition
December 9, 2013
25 / 56
Polynomial kernel: numeric example

Suppose x = (3, 2, 6) and z = (4, 1, 5).
Then

(x) = (1, 3 2, 2 2, 6 2, 9, 4, 36, 6 2, 18 2, 12 2)

(z) = (1, 4 2, 1 2, 5 2, 16, 1, 25, 4 2, 20 2, 5 2)
Then
(x)> (z) = 1 + 24 + 4 + 60 + 144 + 4 + 900 + 48 + 720 + 120 = 2025.
But
(1 + x> z)2 = (1 + (3)(4) + (2)(1) + (6)(5))2 = 452 = 2025
is a more efficient way to compute this dot product.
Ad Feelders
Pattern Recognition
December 9, 2013
26 / 56
Kernels
Linear kernel
k(x, x0 ) = x> x0
Two popular non-linear kernels are the polynomial kernel
k(x, x0 ) = (x> x0 + c)M
and Gaussian (or radial) kernel
k(x, x0 ) = exp(kx x0 k2 /2 2 ),
(6.23)
or
where =
Ad Feelders
1
.
2 2
k(x, x0 ) = exp(kx x0 k2 ),
Pattern Recognition
December 9, 2013
27 / 56
Dual Representation with kernels

Using k(x, x0 ) = (x)> (x0 ) we get dual representation:
Maximize
e
L(a) =
N
X
n=1
N
1 X
an
an tn am tm k(xn , xm )
2
(7.10)
n,m=1
N
X
an 0,
n = 1, . . . , N
(7.11)
an tn = 0.
(7.12)
n=1
Is this dual easier than the original problem?
Ad Feelders
Pattern Recognition
December 9, 2013
28 / 56
Prediction
Recall that
y (x) = w> (x) + b
(7.1)
Substituting
w=
N
X
an tn (xn )
(7.8)
n=1
into (7.1), we get

y (x) = b +
N
X
an tn k(x, xn )
(7.13)
n=1
Ad Feelders
Pattern Recognition
December 9, 2013
29 / 56
Prediction: support vectors

KKT conditions:
an 0
(7.14)
tn y (xn ) 1 0
(7.15)
an {tn y (xn ) 1} = 0
(7.16)
From (7.16) it follows that for every data point, either

1
an = 0, or
tn y (xn ) = 1.
The former play no role in making predictions (see 7.13), and the latter
are the support vectors that lie on the maximum margin hyper planes.
Only the support vectors play a role in predicting the class of new attribute
vectors!
Ad Feelders
Pattern Recognition
December 9, 2013
30 / 56
Prediction: computing b
Since for any support vector xn we have tn y (xn ) = 1, we can use (7.13) to get
!
X
tn b +
am tm k(xn , xm ) = 1,
(7.17)
mS
where S denotes the set of support vectors.

Hence we have
tn b + tn
am tm k(xn , xm ) = 1
mS
tn b = 1 tn
b = tn
am tm k(xn , xm )
mS
am tm k(xn , xm )
(7.17a)
mS
since tn {1, +1} and so 1/tn = tn .

Ad Feelders
Pattern Recognition
December 9, 2013
31 / 56
Prediction: computing b
A numerically more stable solution is obtained by averaging (7.17a) over

all support vectors:
!
X
1 X
b=
tn
am tm k(xn , xm )
(7.18)
NS
nS
Ad Feelders
mS
Pattern Recognition
December 9, 2013
32 / 56
Example with linear kernel
Output of optimization algorithm:

n
1
2
3
4
5
6
7
8
Ad Feelders
x1
0.3858
0.4871
0.9218
0.7382
0.1763
0.4057
0.9355
0.2146
x2
0.4687
0.6110
0.4103
0.8936
0.0579
0.3529
0.8132
0.0099
t
1
1
1
1
1
1
1
1
Pattern Recognition
a
65.5261
65.5261
0
0
0
0
0
0
December 9, 2013
33 / 56
Solving for w and b
w=
N
X

an tn xn = 65.5261
n=1
0.3858
0.4687

65.5261
0.4871
0.6110

=
6.64
9.32
b (1) = 1 w> x1 = 1 (6.64)(0.3858) (9.32)(0.4687) = 7.9300
b (2) = 1 w> x2 = 1 (6.64)(0.4871) (9.32)(0.611) = 7.9289

Averaging these values, we get b = 7.93. So we have
y (x) = 6.64x1 9.32x2 + 7.93
Ad Feelders
Pattern Recognition
December 9, 2013
34 / 56
1.0
Example: decision boundary
x2
0.6
0.8
6.64 x1 9.32 x2 + 7.93 = 0
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
x1
Ad Feelders
Pattern Recognition
December 9, 2013
35 / 56
Example: prediction
Suppose we want to predict the class of z = (0.6, 0.8). We compute

y (z) = 6.64z1 9.32z2 + 7.93
= 6.64(0.6) 9.32(0.8) + 7.93 = 3.51,
so we predict the negative class.

This however requires explicit computation of w.
Ad Feelders
Pattern Recognition
December 9, 2013
36 / 56
Example: prediction
Using (7.13) and the fact that we have a linear kernel:
y (z) =
N
X
an tn k(z, xn ) + b =
n=1
N
X
an tn z> xn + b
n=1
= 65.5261{(0.6)(0.3858) + (0.8)(0.4687)}
65.5261{(0.6)(0.4871) + (0.8)(0.611)} + b = 11.44271 + b
We compute b with (7.17) and the first data point:
b = 1 {65.5261(0.38582 + 0.46782 )
65.5261[(0.3858)(0.4871) + (0.4687)(0.611)]} = 7.93119
Hence
y (z) = 11.44271 + 7.93119 = 3.51152
Ad Feelders
Pattern Recognition
December 9, 2013
37 / 56
Allowing Errors
So far we assumed that the training data points are linearly separable
in feature space (x).
Resulting SVM gives exact separation of training data in original
input space x, with non-linear decision boundary.
Class distributions typically overlap, in which case exact separation of
the training data leads to poor generalization (overfitting).
Ad Feelders
Pattern Recognition
December 9, 2013
38 / 56
Allowing Errors
Data points are allowed to be on the wrong side of the margin

boundary, but with a penalty that increases with the distance from
that boundary.
For convenience we make this penalty a linear function of the distance
to the margin boundary.
Introduce slack variables n 0 with one slack variable for each
training data point.
Ad Feelders
Pattern Recognition
December 9, 2013
39 / 56
Definition of Slack Variables

We define n = 0 for data points that are on the inside of the correct
margin boundary and n = |tn y (xn )| for all other data points.
=0
=0
<1
>1
y(x) = 1
y(x) = 0
y(x) = 1
Ad Feelders
Pattern Recognition
December 9, 2013
40 / 56
New Constraints
The exact classification constraints
tn y (xn ) 1
n = 1, . . . , N
(7.5)
are replaced by
tn y (xn ) 1 n
n = 1, . . . , N
(7.20)
Check (7.20):
n = 0 for data points that are on the inside of the correct margin boundary.
In that case yn tn 1.
Suppose tn = +1 and on the wrong side of the margin boundary, i.e.
yn tn < 1. Since yn = yn tn , we have
n = |tn yn | = |1 yn tn | = 1 yn tn
and therefore tn yn = 1 n .
Suppose t = 1 . . .
Ad Feelders
Pattern Recognition
December 9, 2013
41 / 56
New objective function

Our goal is to maximize the margin while softly penalizing points that lie
on the wrong side of the margin boundary. We therefore minimize
C
N
X
n=1
1
n + kwk2
2
(7.21)
where the parameter C > 0 controlls the trade-off between the slack
variable penalty and the margin. Alternative view (divide by C and put
1
= 2C
:
N
X
X
n +
wi2
n=1
First term represents lack-of-fit (hinge loss) and second term takes care of
regularization.
Ad Feelders
Pattern Recognition
December 9, 2013
42 / 56
Optimization Problem
The Lagrangian is given by
N
n=1
n=1
n=1
X
X
X
1
L(w, b, a) = kwk2 + C
n
an {tn y (xn ) 1 + n }
n n
2
(7.22)
where an 0 and n 0 are Lagrange multipliers.

The KKT conditions are given by:
an 0
(7.23)
tn y (xn ) 1 + n 0
Ad Feelders
(7.24)
an (tn y (xn ) 1 + n ) = 0
(7.25)
n 0
(7.27)
n 0
(7.26)
n n = 0
(7.28)
Pattern Recognition
December 9, 2013
43 / 56
Dual
Take derivative with respect to w, b and n and equate to zero:

N
X
L
=0w=
an tn (xn )
w
(7.29)
n=1
L
=0
b
Ad Feelders
N
X
an tn = 0
(7.30)
L
= 0 an = C n
n
(7.31)
n=1
Pattern Recognition
December 9, 2013
44 / 56
Dual
Using these to eliminate w, b and n from the Lagrangian, we obtain the
dual Lagrangian: Maximize
e
L(a) =
N
X
n=1
an
N
1 X
an tn am tm k(xn , xm )
2
(7.32)
n,m=1

0 an C ,
N
X
n = 1, . . . , N
(7.33)
an tn = 0.
(7.34)
n=1
Note: we have an C since n 0 (7.26) and an = C n (7.31).

Ad Feelders
Pattern Recognition
December 9, 2013
45 / 56
Prediction
Recall that
y (x) = w> (x) + b
(7.1)
Substituting
w=
N
X
an tn (xn )
(7.8)
n=1
into (7.1), we get

y (x) =
N
X
an tn k(x, xn ) + b
(7.13)
n=1
with k(x, xn ) = (x)> (xn ).
Ad Feelders
Pattern Recognition
December 9, 2013
46 / 56
Interpretation of Solution
We distinguish two cases:
Points with an = 0 do not play a role in making predictions.
Points with an > 0 are called support vectors. It follows from KKT
condition
an (tn y (xn ) 1 + n ) = 0
(7.25)
that for these points
t n y n = 1 n
Again we have two cases:
If an < C then n > 0, because an = C n . Since n n = 0 (7.28), it
follows that n = 0 and hence such points lie on the margin.
Points with an = C can be on the margin or inside the margin and can
either be correctly classified if n 1 or misclassified if n > 1.
Ad Feelders
Pattern Recognition
December 9, 2013
47 / 56
Computing the intercept
To compute the value of b, we use the fact that those support vectors
with 0 < an < C have n = 0 so that tn y (xn ) = 1, so like before we have
X
b = tn
am tm k(xn , xm )
(7.17a)
mS
Again a numerically more stable solution is obtained by averaging (7.17a)

over all data points having 0 < an < C :
!
X
1 X
tn
am tm k(xn , xm )
(7.37)
b=
NM
nM
Ad Feelders
mS
Pattern Recognition
December 9, 2013
48 / 56
Model Selection
As usual we are confronted with the problem of selecting the

appropriate model complexity.
The relevant parameters are C and any parameters of the chosen
kernel function.
Ad Feelders
Pattern Recognition
December 9, 2013
49 / 56
How to in R
> conn.svm.lin <- svm(cause sodium + co2,
data=conn.dat,kernel="linear")
> plot(conn.svm.lin,conn.dat)
> conn.svm.lin.predict <predict(conn.svm.lin,conn.dat[,1:2])
> table(conn.dat[,3],conn.svm.lin.predict)
conn.svm.lin.predict
0 1
0 17 3
1 2 8
Ad Feelders
Pattern Recognition
December 9, 2013
50 / 56
Conns syndrome: linear kernel

SVM classification plot
o
146
o
144
x
xx
o
o
o
x
142 o
x
x
x
o
o
o
x x
140
x
x
sodium
o
x
138
x
22
24
26
28
30
32
co2
Ad Feelders
Pattern Recognition
December 9, 2013
51 / 56
How to in R
> conn.svm.rad <- svm(cause sodium + co2, data=conn.dat)

> plot(conn.svm.rad,conn.dat)
> conn.svm.rad.predict <predict(conn.svm.rad,conn.dat[,1:2])
> table(conn.dat[,3],conn.svm.rad.predict)
conn.svm.rad.predict
0 1
0 17 3
1 2 8
Ad Feelders
Pattern Recognition
December 9, 2013
52 / 56
Conns syndrome: radial kernel, C = 1

x
146
o
144
x
xx
o
o
o
o
142 x
o
x
x
x
o
x
x x
140
x
x
sodium
o
x
138
x
22
24
26
28
30
32
co2
Ad Feelders
Pattern Recognition
December 9, 2013
53 / 56
How to in R
> conn.svm.rad <- svm(cause sodium + co2,
data=conn.dat,cost=100)
> plot(conn.svm.rad,conn.dat)
> conn.svm.rad.predict <predict(conn.svm.rad,conn.dat[,1:2])
> table(conn.dat[,3],conn.svm.rad.predict)
conn.svm.rad.predict
0 1
0 19 1
1 1 9
Ad Feelders
Pattern Recognition
December 9, 2013
54 / 56
Conns syndrome: radial kernel, C = 100

o
146
o
144
x
ox
o
o
o
x
142 x
o
o
o
o
o
o
x x
140
x
x
sodium
o
o
138
x
22
24
26
28
30
32
co2
Ad Feelders
Pattern Recognition
December 9, 2013
55 / 56
SVM in R
LIBSVM is available in package e1071 in R.

It can also perform regression and non-binary classification.
Non-binary classification is performed as follows:
Train K (K 1)/2 binary SVMs on all possible pairs of classes.
To classify a new point, let it be classified by every binary SVM, and

pick the class with the highest number of votes.
This is done automatically by function svm in e1071.
Ad Feelders
Pattern Recognition
December 9, 2013
56 / 56

pr2013 SVM

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

pr2013 SVM

Uploaded by

Copyright:

Available Formats

Pattern Recognition 2013

Support Vector Machines

Allowing Errors (Soft Margin)

Linear Classifier for two classes

with tn {1, +1}.

y(x) = w> (x) + b = 0

maps x into higher dimensional space where data is linearly separable.

Data linearly separable

Assume training data is linearly separable in feature space, so there is at

y (xn ) > 0 for tn = +1;

y (xn ) < 0 for tn = 1;

that is, all training points are classified correctly.

There may be many solutions that separate the classes exactly.

Two-class training data

Many Linear Separators

Weight vector is orthogonal to the decision boundary

Distance of a point to a line

Distance to decision surface ((x) = x)

Distance of a point to a line

The signed distance of xn to the decision boundary is

Maximum margin solution

Direct solution of this problem would be rather complex.

The hyperplane (decision boundary) is defined by

for the point xi closest to the decision boundary.

Canonical Representation (square=1,circle=1)

In this case we have

subject to the constraints (7.5).

Introduce Lagrange multipliers an 0 to get Lagrangian function

Equating the derivatives to zero yields the conditions:

an tn am tm (xn )> (xm ) +

with respect to a and subject to the constraints

We map x to a high-dimensional space (x) in which data is linearly

Example: polynomial kernel

Polynomial kernel: numeric example

(x) = (1, 3 2, 2 2, 6 2, 9, 4, 36, 6 2, 18 2, 12 2)

Dual Representation with kernels

with respect to a and subject to the constraints

Is this dual easier than the original problem?

into (7.1), we get

Prediction: support vectors

From (7.16) it follows that for every data point, either

where S denotes the set of support vectors.

since tn {1, +1} and so 1/tn = tn .

A numerically more stable solution is obtained by averaging (7.17a) over

Example with linear kernel

Output of optimization algorithm:

Solving for w and b

b (1) = 1 w> x1 = 1 (6.64)(0.3858) (9.32)(0.4687) = 7.9300

b (2) = 1 w> x2 = 1 (6.64)(0.4871) (9.32)(0.611) = 7.9289

Example: decision boundary

6.64 x1 9.32 x2 + 7.93 = 0

Suppose we want to predict the class of z = (0.6, 0.8). We compute

= 6.64(0.6) 9.32(0.8) + 7.93 = 3.51,

so we predict the negative class.

65.5261[(0.3858)(0.4871) + (0.4687)(0.611)]} = 7.93119

Data points are allowed to be on the wrong side of the margin

Definition of Slack Variables

New objective function

where an 0 and n 0 are Lagrange multipliers.

Take derivative with respect to w, b and n and equate to zero:

with respect to a and subject to the constraints

Note: we have an C since n 0 (7.26) and an = C n (7.31).

into (7.1), we get