You are on page 1of 56

Pattern Recognition 2013

Support Vector Machines


Ad Feelders
Universiteit Utrecht

December 9, 2013

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

1 / 56

Overview

Separable Case

Kernel Functions

Allowing Errors (Soft Margin)

SVMs in R.

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

2 / 56

Linear Classifier for two classes

Linear model
y (x) = w> (x) + b

(7.1)

with tn {1, +1}.


Predict t0 = +1 if y (x0 ) 0 and t0 = 1 otherwise.
The decision boundary is given by y (x) = 0.
This is a linear classifier in feature space (x).

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

3 / 56

Mapping

y(x) = w> (x) + b = 0

maps x into higher dimensional space where data is linearly separable.

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

4 / 56

Data linearly separable

Assume training data is linearly separable in feature space, so there is at


least one choice of w, b such that:
1

y (xn ) > 0 for tn = +1;

y (xn ) < 0 for tn = 1;

that is, all training points are classified correctly.


Putting 1. and 2. together:
tn y (xn ) > 0

Ad Feelders

( Universiteit Utrecht )

n = 1, . . . , N

Pattern Recognition

December 9, 2013

5 / 56

Maximum Margin

There may be many solutions that separate the classes exactly.


Which one gives smallest prediction error?
SVM chooses line with maximal margin, where the margin is the
distance between the line and the closest data point.

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

6 / 56

Two-class training data

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

7 / 56

Many Linear Separators

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

8 / 56

Decision Boundary

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

9 / 56

Maximize Margin

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

10 / 56

Support Vectors

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

11 / 56

Weight vector is orthogonal to the decision boundary

Consider two points xA and xB both of which lie on the decision surface.
Because y (xA ) = y (xB ) = 0, we have
(w> xA + b) (w> xB + b) = w> (xA xB ) = 0
and so the vector w is orthogonal to the decision surface.

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

12 / 56

Distance of a point to a line


x2

x
y(x) = w>x + b = 0

r
w

x
x1

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

13 / 56

Distance to decision surface ((x) = x)


We have
x = x + r

w
.
kwk

(4.6)

w
where kwk
is the unit vector in the direction if w, x is the orthogonal
projection of x onto the line y (x) = 0, and r is the (signed) distance of x
to the line. Multiply (4.6) left and right by w> and add b:

w> w
>
>
w
x
+
b
=
w
x
+
b
+r

| {z } | {z }
kwk
y (x)

So we get
r = y (x)

Ad Feelders

( Universiteit Utrecht )

y (x)
kwk
=
2
kwk
kwk

Pattern Recognition

(4.7)

December 9, 2013

14 / 56

Distance of a point to a line

The signed distance of xn to the decision boundary is


r=

y (xn )
kwk

For lines that separate the data perfectly, we have tn y (xn ) = |y (xn )|, so
that the distance is given by
tn y (xn )
tn (w> (xn ) + b)
=
kwk
kwk

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

(7.2)

December 9, 2013

15 / 56

Maximum margin solution

Solve


arg max
w,b


1
>
min[tn (w (xn ) + b)] .
kwk n

(7.3)

1
Since kwk
does not depend on n, it can be moved outside of the
minimization.

Direct solution of this problem would be rather complex.


A more convenient representation is possible.

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

16 / 56

Canonical Representation

The hyperplane (decision boundary) is defined by


w> (x) + b = 0
Then also
(w> (x) + b) = w> (x) + b = 0
so rescaling w w and b b gives just another representation of the
same decision boundary. Choose scaling factor such that
ti (w> (xi ) + b) = 1

(7.4)

for the point xi closest to the decision boundary.

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

17 / 56

Canonical Representation (square=1,circle=1)


y(x) = 1
y(x) = 0
y(x) = 1

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

18 / 56

Canonical Representation

In this case we have


tn (w> (xn ) + b) 1

n = 1, . . . , N

(7.5)

Quadratic program
1
arg min kwk2
w,b 2

(7.6)

subject to the constraints (7.5).


This optimization problem has a unique global minimum.

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

19 / 56

Lagrangian Function

Introduce Lagrange multipliers an 0 to get Lagrangian function


N

X
1
L(w, b, a) = kwk2
an {tn (w> (xn ) + b) 1}
2

(7.7)

n=1

with

X
L(w, b, a)
=w
an tn (xn )
w
n=1

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

20 / 56

Lagrangian Function
and for b:

L(w, b, a) X
=
an tn
b
n=1

Equating the derivatives to zero yields the conditions:


w=

N
X

an tn (xn )

(7.8)

n=1

and

N
X

an tn = 0

(7.9)

n=1

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

21 / 56

Dual Representation
Eliminating w and b from L(w, b, a) gives the dual representation.
N

L(w, b, a) =

X
1
kwk2
an {tn (w> (xn ) + b) 1}
2
n=1

1
kwk2
2
N

( Universiteit Utrecht )

n=1

an tn w> (xn ) b

N
X

an t n +

n=1

N
X

an

n=1

1XX
an am tn tm (xn )> (xm )
2

n=1 m=1
N
N
XX

an tn am tm (xn )> (xm ) +

n=1 m=1
N
X
n=1

Ad Feelders

N
X

an

N
X

an

n=1
N

1XX
an tn am tm (xn )> (xm )
2
n=1 m=1

Pattern Recognition

December 9, 2013

22 / 56

Dual Representation

Maximize
e
L(a) =

N
X
n=1

N
1 X
an
an tn am tm (xn )> (xm )
2

(7.10)

n,m=1

with respect to a and subject to the constraints

N
X

an 0,

n = 1, . . . , N

(7.11)

an tn = 0.

(7.12)

n=1

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

23 / 56

Kernel Function

We map x to a high-dimensional space (x) in which data is linearly


separable.
Performing computations in this high-dimensional space may be very
expensive.
Use a kernel function k that computes a dot product in this space
(without making the actual mapping):
k(x, x0 ) = (x)> (x0 )

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

24 / 56

Example: polynomial kernel


Suppose x IR3 and (x) IR10 with

(x) = (1, 2x1 , 2x2 , 2x3 , x12 , x22 , x32 , 2x1 x2 , 2x1 x3 , 2x2 x3 )
Then
(x)> (z) = 1 + 2x1 z1 + 2x2 z2 + 2x3 z3 + x12 z12 + x22 z22 + x32 z32
+ 2x1 x2 z1 z2 + 2x1 x3 z1 z3 + 2x2 x3 z2 z3
But this can be written as
(1 + x> z)2 = (1 + x1 z1 + x2 z2 + x3 z3 )2
which costs much less operations to compute.

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

25 / 56

Polynomial kernel: numeric example


Suppose x = (3, 2, 6) and z = (4, 1, 5).
Then

(x) = (1, 3 2, 2 2, 6 2, 9, 4, 36, 6 2, 18 2, 12 2)



(z) = (1, 4 2, 1 2, 5 2, 16, 1, 25, 4 2, 20 2, 5 2)
Then
(x)> (z) = 1 + 24 + 4 + 60 + 144 + 4 + 900 + 48 + 720 + 120 = 2025.
But
(1 + x> z)2 = (1 + (3)(4) + (2)(1) + (6)(5))2 = 452 = 2025
is a more efficient way to compute this dot product.

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

26 / 56

Kernels
Linear kernel
k(x, x0 ) = x> x0
Two popular non-linear kernels are the polynomial kernel
k(x, x0 ) = (x> x0 + c)M
and Gaussian (or radial) kernel
k(x, x0 ) = exp(kx x0 k2 /2 2 ),

(6.23)

or
where =

Ad Feelders

1
.
2 2

( Universiteit Utrecht )

k(x, x0 ) = exp(kx x0 k2 ),

Pattern Recognition

December 9, 2013

27 / 56

Dual Representation with kernels


Using k(x, x0 ) = (x)> (x0 ) we get dual representation:
Maximize
e
L(a) =

N
X
n=1

N
1 X
an
an tn am tm k(xn , xm )
2

(7.10)

n,m=1

with respect to a and subject to the constraints

N
X

an 0,

n = 1, . . . , N

(7.11)

an tn = 0.

(7.12)

n=1

Is this dual easier than the original problem?

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

28 / 56

Prediction

Recall that
y (x) = w> (x) + b

(7.1)

Substituting
w=

N
X

an tn (xn )

(7.8)

n=1

into (7.1), we get


y (x) = b +

N
X

an tn k(x, xn )

(7.13)

n=1

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

29 / 56

Prediction: support vectors


KKT conditions:
an 0

(7.14)

tn y (xn ) 1 0

(7.15)

an {tn y (xn ) 1} = 0

(7.16)

From (7.16) it follows that for every data point, either


1

an = 0, or

tn y (xn ) = 1.

The former play no role in making predictions (see 7.13), and the latter
are the support vectors that lie on the maximum margin hyper planes.
Only the support vectors play a role in predicting the class of new attribute
vectors!
Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

30 / 56

Prediction: computing b
Since for any support vector xn we have tn y (xn ) = 1, we can use (7.13) to get
!
X
tn b +
am tm k(xn , xm ) = 1,
(7.17)
mS

where S denotes the set of support vectors.


Hence we have
tn b + tn

am tm k(xn , xm ) = 1

mS

tn b = 1 tn
b = tn

am tm k(xn , xm )

mS

am tm k(xn , xm )

(7.17a)

mS

since tn {1, +1} and so 1/tn = tn .


Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

31 / 56

Prediction: computing b

A numerically more stable solution is obtained by averaging (7.17a) over


all support vectors:
!
X
1 X
b=
tn
am tm k(xn , xm )
(7.18)
NS
nS

Ad Feelders

( Universiteit Utrecht )

mS

Pattern Recognition

December 9, 2013

32 / 56

Example with linear kernel

Output of optimization algorithm:


n
1
2
3
4
5
6
7
8

Ad Feelders

( Universiteit Utrecht )

x1
0.3858
0.4871
0.9218
0.7382
0.1763
0.4057
0.9355
0.2146

x2
0.4687
0.6110
0.4103
0.8936
0.0579
0.3529
0.8132
0.0099

t
1
1
1
1
1
1
1
1

Pattern Recognition

a
65.5261
65.5261
0
0
0
0
0
0

December 9, 2013

33 / 56

Solving for w and b

w=

N
X


an tn xn = 65.5261

n=1

0.3858
0.4687


65.5261

0.4871
0.6110


=

6.64
9.32

b (1) = 1 w> x1 = 1 (6.64)(0.3858) (9.32)(0.4687) = 7.9300

b (2) = 1 w> x2 = 1 (6.64)(0.4871) (9.32)(0.611) = 7.9289


Averaging these values, we get b = 7.93. So we have
y (x) = 6.64x1 9.32x2 + 7.93

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

34 / 56

1.0

Example: decision boundary

x2

0.6

0.8

6.64 x1 9.32 x2 + 7.93 = 0

0.4

0.2

0.0

0.0

0.2

0.4

0.6

0.8

1.0

x1

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

35 / 56

Example: prediction

Suppose we want to predict the class of z = (0.6, 0.8). We compute


y (z) = 6.64z1 9.32z2 + 7.93

= 6.64(0.6) 9.32(0.8) + 7.93 = 3.51,

so we predict the negative class.


This however requires explicit computation of w.

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

36 / 56

Example: prediction
Using (7.13) and the fact that we have a linear kernel:
y (z) =

N
X

an tn k(z, xn ) + b =

n=1

N
X

an tn z> xn + b

n=1

= 65.5261{(0.6)(0.3858) + (0.8)(0.4687)}
65.5261{(0.6)(0.4871) + (0.8)(0.611)} + b = 11.44271 + b
We compute b with (7.17) and the first data point:
b = 1 {65.5261(0.38582 + 0.46782 )

65.5261[(0.3858)(0.4871) + (0.4687)(0.611)]} = 7.93119

Hence
y (z) = 11.44271 + 7.93119 = 3.51152

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

37 / 56

Allowing Errors

So far we assumed that the training data points are linearly separable
in feature space (x).
Resulting SVM gives exact separation of training data in original
input space x, with non-linear decision boundary.
Class distributions typically overlap, in which case exact separation of
the training data leads to poor generalization (overfitting).

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

38 / 56

Allowing Errors

Data points are allowed to be on the wrong side of the margin


boundary, but with a penalty that increases with the distance from
that boundary.
For convenience we make this penalty a linear function of the distance
to the margin boundary.
Introduce slack variables n 0 with one slack variable for each
training data point.

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

39 / 56

Definition of Slack Variables


We define n = 0 for data points that are on the inside of the correct
margin boundary and n = |tn y (xn )| for all other data points.

=0
=0

<1
>1
y(x) = 1
y(x) = 0
y(x) = 1

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

40 / 56

New Constraints
The exact classification constraints
tn y (xn ) 1

n = 1, . . . , N

(7.5)

are replaced by
tn y (xn ) 1 n

n = 1, . . . , N

(7.20)

Check (7.20):
n = 0 for data points that are on the inside of the correct margin boundary.
In that case yn tn 1.
Suppose tn = +1 and on the wrong side of the margin boundary, i.e.
yn tn < 1. Since yn = yn tn , we have
n = |tn yn | = |1 yn tn | = 1 yn tn
and therefore tn yn = 1 n .
Suppose t = 1 . . .
Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

41 / 56

New objective function


Our goal is to maximize the margin while softly penalizing points that lie
on the wrong side of the margin boundary. We therefore minimize
C

N
X
n=1

1
n + kwk2
2

(7.21)

where the parameter C > 0 controlls the trade-off between the slack
variable penalty and the margin. Alternative view (divide by C and put
1
= 2C
:
N
X
X
n +
wi2
n=1

First term represents lack-of-fit (hinge loss) and second term takes care of
regularization.
Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

42 / 56

Optimization Problem
The Lagrangian is given by
N

n=1

n=1

n=1

X
X
X
1
L(w, b, a) = kwk2 + C
n
an {tn y (xn ) 1 + n }
n n
2

(7.22)

where an 0 and n 0 are Lagrange multipliers.


The KKT conditions are given by:
an 0

(7.23)

tn y (xn ) 1 + n 0

Ad Feelders

( Universiteit Utrecht )

(7.24)

an (tn y (xn ) 1 + n ) = 0

(7.25)

n 0

(7.27)

n 0

(7.26)

n n = 0

(7.28)

Pattern Recognition

December 9, 2013

43 / 56

Dual

Take derivative with respect to w, b and n and equate to zero:


N

X
L
=0w=
an tn (xn )
w

(7.29)

n=1

L
=0
b

Ad Feelders

( Universiteit Utrecht )

N
X

an tn = 0

(7.30)

L
= 0 an = C n
n

(7.31)

n=1

Pattern Recognition

December 9, 2013

44 / 56

Dual
Using these to eliminate w, b and n from the Lagrangian, we obtain the
dual Lagrangian: Maximize
e
L(a) =

N
X
n=1

an

N
1 X
an tn am tm k(xn , xm )
2

(7.32)

n,m=1

with respect to a and subject to the constraints


0 an C ,

N
X

n = 1, . . . , N

(7.33)

an tn = 0.

(7.34)

n=1

Note: we have an C since n 0 (7.26) and an = C n (7.31).


Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

45 / 56

Prediction

Recall that
y (x) = w> (x) + b

(7.1)

Substituting
w=

N
X

an tn (xn )

(7.8)

n=1

into (7.1), we get


y (x) =

N
X

an tn k(x, xn ) + b

(7.13)

n=1

with k(x, xn ) = (x)> (xn ).

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

46 / 56

Interpretation of Solution
We distinguish two cases:
Points with an = 0 do not play a role in making predictions.
Points with an > 0 are called support vectors. It follows from KKT
condition
an (tn y (xn ) 1 + n ) = 0
(7.25)
that for these points
t n y n = 1 n
Again we have two cases:
If an < C then n > 0, because an = C n . Since n n = 0 (7.28), it
follows that n = 0 and hence such points lie on the margin.
Points with an = C can be on the margin or inside the margin and can
either be correctly classified if n 1 or misclassified if n > 1.

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

47 / 56

Computing the intercept

To compute the value of b, we use the fact that those support vectors
with 0 < an < C have n = 0 so that tn y (xn ) = 1, so like before we have
X
b = tn
am tm k(xn , xm )
(7.17a)
mS

Again a numerically more stable solution is obtained by averaging (7.17a)


over all data points having 0 < an < C :
!
X
1 X
tn
am tm k(xn , xm )
(7.37)
b=
NM
nM

Ad Feelders

( Universiteit Utrecht )

mS

Pattern Recognition

December 9, 2013

48 / 56

Model Selection

As usual we are confronted with the problem of selecting the


appropriate model complexity.
The relevant parameters are C and any parameters of the chosen
kernel function.

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

49 / 56

How to in R
> conn.svm.lin <- svm(cause sodium + co2,
data=conn.dat,kernel="linear")
> plot(conn.svm.lin,conn.dat)
> conn.svm.lin.predict <predict(conn.svm.lin,conn.dat[,1:2])
> table(conn.dat[,3],conn.svm.lin.predict)
conn.svm.lin.predict
0 1
0 17 3
1 2 8

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

50 / 56

Conns syndrome: linear kernel


SVM classification plot
o

146
o

144

x
xx

o
o

o
x

142 o
x

x
x

o
o

o
x x

140

x
x

sodium

o
x
138
x
22

24

26

28

30

32

co2

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

51 / 56

How to in R

> conn.svm.rad <- svm(cause sodium + co2, data=conn.dat)


> plot(conn.svm.rad,conn.dat)
> conn.svm.rad.predict <predict(conn.svm.rad,conn.dat[,1:2])
> table(conn.dat[,3],conn.svm.rad.predict)
conn.svm.rad.predict
0 1
0 17 3
1 2 8

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

52 / 56

Conns syndrome: radial kernel, C = 1


SVM classification plot
x

146
o

144

x
xx

o
o

o
o

142 x
o

x
x

x
o

x
x x

140

x
x

sodium

o
x
138
x
22

24

26

28

30

32

co2

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

53 / 56

How to in R
> conn.svm.rad <- svm(cause sodium + co2,
data=conn.dat,cost=100)
> plot(conn.svm.rad,conn.dat)
> conn.svm.rad.predict <predict(conn.svm.rad,conn.dat[,1:2])
> table(conn.dat[,3],conn.svm.rad.predict)
conn.svm.rad.predict
0 1
0 19 1
1 1 9

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

54 / 56

Conns syndrome: radial kernel, C = 100


SVM classification plot
o

146
o

144

x
ox

o
o

o
x

142 x
o

o
o

o
o

o
x x

140

x
x

sodium

o
o
138
x
22

24

26

28

30

32

co2

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

55 / 56

SVM in R

LIBSVM is available in package e1071 in R.


It can also perform regression and non-binary classification.
Non-binary classification is performed as follows:
Train K (K 1)/2 binary SVMs on all possible pairs of classes.

To classify a new point, let it be classified by every binary SVM, and


pick the class with the highest number of votes.
This is done automatically by function svm in e1071.

Ad Feelders

( Universiteit Utrecht )

Pattern Recognition

December 9, 2013

56 / 56

You might also like