Professional Documents
Culture Documents
December 9, 2013
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
1 / 56
Overview
Separable Case
Kernel Functions
SVMs in R.
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
2 / 56
Linear model
y (x) = w> (x) + b
(7.1)
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
3 / 56
Mapping
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
4 / 56
Ad Feelders
( Universiteit Utrecht )
n = 1, . . . , N
Pattern Recognition
December 9, 2013
5 / 56
Maximum Margin
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
6 / 56
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
7 / 56
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
8 / 56
Decision Boundary
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
9 / 56
Maximize Margin
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
10 / 56
Support Vectors
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
11 / 56
Consider two points xA and xB both of which lie on the decision surface.
Because y (xA ) = y (xB ) = 0, we have
(w> xA + b) (w> xB + b) = w> (xA xB ) = 0
and so the vector w is orthogonal to the decision surface.
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
12 / 56
x
y(x) = w>x + b = 0
r
w
x
x1
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
13 / 56
w
.
kwk
(4.6)
w
where kwk
is the unit vector in the direction if w, x is the orthogonal
projection of x onto the line y (x) = 0, and r is the (signed) distance of x
to the line. Multiply (4.6) left and right by w> and add b:
w> w
>
>
w
x
+
b
=
w
x
+
b
+r
| {z } | {z }
kwk
y (x)
So we get
r = y (x)
Ad Feelders
( Universiteit Utrecht )
y (x)
kwk
=
2
kwk
kwk
Pattern Recognition
(4.7)
December 9, 2013
14 / 56
y (xn )
kwk
For lines that separate the data perfectly, we have tn y (xn ) = |y (xn )|, so
that the distance is given by
tn y (xn )
tn (w> (xn ) + b)
=
kwk
kwk
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
(7.2)
December 9, 2013
15 / 56
Solve
arg max
w,b
1
>
min[tn (w (xn ) + b)] .
kwk n
(7.3)
1
Since kwk
does not depend on n, it can be moved outside of the
minimization.
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
16 / 56
Canonical Representation
(7.4)
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
17 / 56
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
18 / 56
Canonical Representation
n = 1, . . . , N
(7.5)
Quadratic program
1
arg min kwk2
w,b 2
(7.6)
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
19 / 56
Lagrangian Function
X
1
L(w, b, a) = kwk2
an {tn (w> (xn ) + b) 1}
2
(7.7)
n=1
with
X
L(w, b, a)
=w
an tn (xn )
w
n=1
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
20 / 56
Lagrangian Function
and for b:
L(w, b, a) X
=
an tn
b
n=1
N
X
an tn (xn )
(7.8)
n=1
and
N
X
an tn = 0
(7.9)
n=1
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
21 / 56
Dual Representation
Eliminating w and b from L(w, b, a) gives the dual representation.
N
L(w, b, a) =
X
1
kwk2
an {tn (w> (xn ) + b) 1}
2
n=1
1
kwk2
2
N
( Universiteit Utrecht )
n=1
an tn w> (xn ) b
N
X
an t n +
n=1
N
X
an
n=1
1XX
an am tn tm (xn )> (xm )
2
n=1 m=1
N
N
XX
n=1 m=1
N
X
n=1
Ad Feelders
N
X
an
N
X
an
n=1
N
1XX
an tn am tm (xn )> (xm )
2
n=1 m=1
Pattern Recognition
December 9, 2013
22 / 56
Dual Representation
Maximize
e
L(a) =
N
X
n=1
N
1 X
an
an tn am tm (xn )> (xm )
2
(7.10)
n,m=1
N
X
an 0,
n = 1, . . . , N
(7.11)
an tn = 0.
(7.12)
n=1
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
23 / 56
Kernel Function
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
24 / 56
(x) = (1, 2x1 , 2x2 , 2x3 , x12 , x22 , x32 , 2x1 x2 , 2x1 x3 , 2x2 x3 )
Then
(x)> (z) = 1 + 2x1 z1 + 2x2 z2 + 2x3 z3 + x12 z12 + x22 z22 + x32 z32
+ 2x1 x2 z1 z2 + 2x1 x3 z1 z3 + 2x2 x3 z2 z3
But this can be written as
(1 + x> z)2 = (1 + x1 z1 + x2 z2 + x3 z3 )2
which costs much less operations to compute.
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
25 / 56
(z) = (1, 4 2, 1 2, 5 2, 16, 1, 25, 4 2, 20 2, 5 2)
Then
(x)> (z) = 1 + 24 + 4 + 60 + 144 + 4 + 900 + 48 + 720 + 120 = 2025.
But
(1 + x> z)2 = (1 + (3)(4) + (2)(1) + (6)(5))2 = 452 = 2025
is a more efficient way to compute this dot product.
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
26 / 56
Kernels
Linear kernel
k(x, x0 ) = x> x0
Two popular non-linear kernels are the polynomial kernel
k(x, x0 ) = (x> x0 + c)M
and Gaussian (or radial) kernel
k(x, x0 ) = exp(kx x0 k2 /2 2 ),
(6.23)
or
where =
Ad Feelders
1
.
2 2
( Universiteit Utrecht )
k(x, x0 ) = exp(kx x0 k2 ),
Pattern Recognition
December 9, 2013
27 / 56
N
X
n=1
N
1 X
an
an tn am tm k(xn , xm )
2
(7.10)
n,m=1
N
X
an 0,
n = 1, . . . , N
(7.11)
an tn = 0.
(7.12)
n=1
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
28 / 56
Prediction
Recall that
y (x) = w> (x) + b
(7.1)
Substituting
w=
N
X
an tn (xn )
(7.8)
n=1
N
X
an tn k(x, xn )
(7.13)
n=1
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
29 / 56
(7.14)
tn y (xn ) 1 0
(7.15)
an {tn y (xn ) 1} = 0
(7.16)
an = 0, or
tn y (xn ) = 1.
The former play no role in making predictions (see 7.13), and the latter
are the support vectors that lie on the maximum margin hyper planes.
Only the support vectors play a role in predicting the class of new attribute
vectors!
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
30 / 56
Prediction: computing b
Since for any support vector xn we have tn y (xn ) = 1, we can use (7.13) to get
!
X
tn b +
am tm k(xn , xm ) = 1,
(7.17)
mS
am tm k(xn , xm ) = 1
mS
tn b = 1 tn
b = tn
am tm k(xn , xm )
mS
am tm k(xn , xm )
(7.17a)
mS
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
31 / 56
Prediction: computing b
Ad Feelders
( Universiteit Utrecht )
mS
Pattern Recognition
December 9, 2013
32 / 56
Ad Feelders
( Universiteit Utrecht )
x1
0.3858
0.4871
0.9218
0.7382
0.1763
0.4057
0.9355
0.2146
x2
0.4687
0.6110
0.4103
0.8936
0.0579
0.3529
0.8132
0.0099
t
1
1
1
1
1
1
1
1
Pattern Recognition
a
65.5261
65.5261
0
0
0
0
0
0
December 9, 2013
33 / 56
w=
N
X
an tn xn = 65.5261
n=1
0.3858
0.4687
65.5261
0.4871
0.6110
=
6.64
9.32
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
34 / 56
1.0
x2
0.6
0.8
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
x1
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
35 / 56
Example: prediction
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
36 / 56
Example: prediction
Using (7.13) and the fact that we have a linear kernel:
y (z) =
N
X
an tn k(z, xn ) + b =
n=1
N
X
an tn z> xn + b
n=1
= 65.5261{(0.6)(0.3858) + (0.8)(0.4687)}
65.5261{(0.6)(0.4871) + (0.8)(0.611)} + b = 11.44271 + b
We compute b with (7.17) and the first data point:
b = 1 {65.5261(0.38582 + 0.46782 )
Hence
y (z) = 11.44271 + 7.93119 = 3.51152
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
37 / 56
Allowing Errors
So far we assumed that the training data points are linearly separable
in feature space (x).
Resulting SVM gives exact separation of training data in original
input space x, with non-linear decision boundary.
Class distributions typically overlap, in which case exact separation of
the training data leads to poor generalization (overfitting).
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
38 / 56
Allowing Errors
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
39 / 56
=0
=0
<1
>1
y(x) = 1
y(x) = 0
y(x) = 1
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
40 / 56
New Constraints
The exact classification constraints
tn y (xn ) 1
n = 1, . . . , N
(7.5)
are replaced by
tn y (xn ) 1 n
n = 1, . . . , N
(7.20)
Check (7.20):
n = 0 for data points that are on the inside of the correct margin boundary.
In that case yn tn 1.
Suppose tn = +1 and on the wrong side of the margin boundary, i.e.
yn tn < 1. Since yn = yn tn , we have
n = |tn yn | = |1 yn tn | = 1 yn tn
and therefore tn yn = 1 n .
Suppose t = 1 . . .
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
41 / 56
N
X
n=1
1
n + kwk2
2
(7.21)
where the parameter C > 0 controlls the trade-off between the slack
variable penalty and the margin. Alternative view (divide by C and put
1
= 2C
:
N
X
X
n +
wi2
n=1
First term represents lack-of-fit (hinge loss) and second term takes care of
regularization.
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
42 / 56
Optimization Problem
The Lagrangian is given by
N
n=1
n=1
n=1
X
X
X
1
L(w, b, a) = kwk2 + C
n
an {tn y (xn ) 1 + n }
n n
2
(7.22)
(7.23)
tn y (xn ) 1 + n 0
Ad Feelders
( Universiteit Utrecht )
(7.24)
an (tn y (xn ) 1 + n ) = 0
(7.25)
n 0
(7.27)
n 0
(7.26)
n n = 0
(7.28)
Pattern Recognition
December 9, 2013
43 / 56
Dual
X
L
=0w=
an tn (xn )
w
(7.29)
n=1
L
=0
b
Ad Feelders
( Universiteit Utrecht )
N
X
an tn = 0
(7.30)
L
= 0 an = C n
n
(7.31)
n=1
Pattern Recognition
December 9, 2013
44 / 56
Dual
Using these to eliminate w, b and n from the Lagrangian, we obtain the
dual Lagrangian: Maximize
e
L(a) =
N
X
n=1
an
N
1 X
an tn am tm k(xn , xm )
2
(7.32)
n,m=1
N
X
n = 1, . . . , N
(7.33)
an tn = 0.
(7.34)
n=1
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
45 / 56
Prediction
Recall that
y (x) = w> (x) + b
(7.1)
Substituting
w=
N
X
an tn (xn )
(7.8)
n=1
N
X
an tn k(x, xn ) + b
(7.13)
n=1
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
46 / 56
Interpretation of Solution
We distinguish two cases:
Points with an = 0 do not play a role in making predictions.
Points with an > 0 are called support vectors. It follows from KKT
condition
an (tn y (xn ) 1 + n ) = 0
(7.25)
that for these points
t n y n = 1 n
Again we have two cases:
If an < C then n > 0, because an = C n . Since n n = 0 (7.28), it
follows that n = 0 and hence such points lie on the margin.
Points with an = C can be on the margin or inside the margin and can
either be correctly classified if n 1 or misclassified if n > 1.
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
47 / 56
To compute the value of b, we use the fact that those support vectors
with 0 < an < C have n = 0 so that tn y (xn ) = 1, so like before we have
X
b = tn
am tm k(xn , xm )
(7.17a)
mS
Ad Feelders
( Universiteit Utrecht )
mS
Pattern Recognition
December 9, 2013
48 / 56
Model Selection
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
49 / 56
How to in R
> conn.svm.lin <- svm(cause sodium + co2,
data=conn.dat,kernel="linear")
> plot(conn.svm.lin,conn.dat)
> conn.svm.lin.predict <predict(conn.svm.lin,conn.dat[,1:2])
> table(conn.dat[,3],conn.svm.lin.predict)
conn.svm.lin.predict
0 1
0 17 3
1 2 8
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
50 / 56
146
o
144
x
xx
o
o
o
x
142 o
x
x
x
o
o
o
x x
140
x
x
sodium
o
x
138
x
22
24
26
28
30
32
co2
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
51 / 56
How to in R
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
52 / 56
146
o
144
x
xx
o
o
o
o
142 x
o
x
x
x
o
x
x x
140
x
x
sodium
o
x
138
x
22
24
26
28
30
32
co2
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
53 / 56
How to in R
> conn.svm.rad <- svm(cause sodium + co2,
data=conn.dat,cost=100)
> plot(conn.svm.rad,conn.dat)
> conn.svm.rad.predict <predict(conn.svm.rad,conn.dat[,1:2])
> table(conn.dat[,3],conn.svm.rad.predict)
conn.svm.rad.predict
0 1
0 19 1
1 1 9
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
54 / 56
146
o
144
x
ox
o
o
o
x
142 x
o
o
o
o
o
o
x x
140
x
x
sodium
o
o
138
x
22
24
26
28
30
32
co2
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
55 / 56
SVM in R
Ad Feelders
( Universiteit Utrecht )
Pattern Recognition
December 9, 2013
56 / 56