You are on page 1of 27

Lecture Slides for

INTRODUCTION
TO
MACHNE
LEARNNG
3RD EDTON
ETHEM ALPAYDIN
The MIT Press, 2014

alpaydin@boun.edu.tr
http://www.cmpe.boun.edu.tr/~ethem/i2ml3e
CHAPTER 13:

KERNEL MACHNES
Kernel Machines
3

Discriminant-based: No need to estimate densities


first
Define the discriminant in terms of support vectors
The use of kernel functions, application-specific
measures of similarity
No need to represent instances as vectors
Convex optimization problems with a unique solution
Optimal Separating Hyperplane
4

if C1
X x , r t where r
t
t t t 1 x
1 if x t
C 2
find w and w0 such that
w T xt w0 1 for r t 1
w T xt w0 1 for r t 1
which can be rewritten as
r t w T xt w0 1

(Cortes and Vapnik, 1995; Vapnik, 1995)


Margin
5

Distance from the discriminant to the closest instances


on either side
Distance of x to the hyperplane is w T xt w0
w
r t w T xt w0
We require , t
w

For a unique soln, fix ||w||=1, and to max margin

min w subject to r t w T xt w0 1, t
1 2

2
Margin
6
min w subject to r t w T xt w 0 1, t
1 2

2

Lp w t r t w T xt w 0 1
N
1 2

2 t 1

w r w x w 0 t
N N
1 2 t t T t

2 t 1 t 1

Lp N
0 w t r t xt
w t 1

Lp N
0 t r t 0
w 0 t 1

7
Ld w w w T t r t xt w0 t r t t
1 T
2 t t t

w w t
1 T
2 t

r r x x t
1 t s t s t T s

2 t s t

subject to t r t 0 and t 0, t
t

Most t are 0 and only a small number have t >0; they are
the support vectors

8
Soft Margin Hyperplane
9

Not linearly separable

r t w T x t w0 1 t

Soft error

t
t

New primal is
1
2
2

Lp w C t t t t r t w T x t w0 1 t t t t
10
Hinge Loss
11

0 if y t r t 1
Lhinge (y , r )
t t

1 y t t
r otherwise
n-SVM
12

1 1
min w - n t
2

2 N t
subject to
r t w T x t w 0 t , t 0, 0

Ld r r x x
1 N t s t s t T s

2 t 1 s
subject to
1
t t t
r 0 ,0 t

N t
, t
n

n controls the fraction of support vectors


Kernel Trick
13

Preprocess input x by basis functions


z = (x) g(z)=wTz
g(x)=wT (x)
The SVM solution
w t r t z t t r t xt
t t

gx w x r x
T t t
x
t T

gx t r t K xt , x
t
Vectorial Kernels
14

Polynomials of degree q:

K x , x x x 1
t T t q

K x, y xT y 1
2

x1y1 x 2 y 2 12
1 2 x1y1 2 x 2 y 2 2 x1 x 2 y1y 2 x12 y12 x 22 y 22

x 1, 2 x1 , 2 x 2 , 2 x1 x 2 , x , x 2
1
2 T
2
Vectorial Kernels
15

Radial-basis functions:

xt x 2

K xt , x exp
2s 2

Defining kernels
16

Kernel engineering
Defining good measures of similarity
String kernels, graph kernels, image kernels, ...
Empirical kernel map: Define a set of templates mi
and score function s(x,mi)
(xt)=[s(xt,m1), s(xt,m2),..., s(xt,mM)]
and
K(x,xt)= (x)T (xt)
Multiple Kernel Learning
17

Fixed kernel combination cK x, y



K x, y K1 x, y K 2 x, y
K x, y K x, y
1 2

Adaptive kernel combination


m
K x , y i K i x, y
i 1

t s r t r s i K i xt , x s
1
Ld t
t 2 t s i

g(x) t r t i K i xt , x
t i

Localized kernel combination g(x) t r t i x| K i xt , x


t i
Multiclass Kernel Machines
18

1-vs-all
Pairwise separation
Error-Correcting Output Codes (section 17.5)
Single multiclass optimization
1 K
min w i C it
2

2 i 1 i t

subject to
w zt T xt w zt 0 w i T xt wi 0 2 it , i z t , it 0
SVM for Regression
19

Use a linear model (possibly kernelized)


f(x)=wTx+w0
Use the -sensitive error function
if r t f xt
e r , f x t
t t 0

r f x t
otherwise

min w C t t
1 2

2

t

r t w T x w0 t
w x w r
T
0
t
t
t , t 0
20
Kernel Regression
21

Polynomial kernel Gaussian kernel


Kernel Machines for Ranking
22

We require not only that scores be correct order


but at least +1 unit margin.
Linear case:
1
min w i C it
2

2 t

subject to
w T xu w T xv 1 t , t : r u r v , it 0
One-Class Kernel Machines
23

Consider a sphere with center a and radius R

min R 2 C t
t

subject to
x t a R 2 t , t 0

Ld x x r r x x
N
t t T s t s t s t T s

t t 1 s

subject to
0 t C , t 1
t
24
Large Margin Nearest Neighbor
25

Learns the matrix M of Mahalanobis metric


D(xi, xj)=(xi-xj)TM(xi-xj)
For three instances i, j, and l, where i and j are of
the same class and l different, we require
D(xi, xl) > D(xi, xj)+1
and if this is not satisfied, we have a slack for the
difference and we learn M to minimize the sum of
such slacks over all i,j,l triples (j and l being one of k
neighbors of i, over all i)
Learning a Distance Measure
26

LMNN algorithm (Weinberger and Saul 2009)

LMCA algorithm (Torresani and Lee 2007) uses a


similar approach where M=LTL and learns L
Kernel Dimensionality Reduction
27

Kernel PCA does


PCA on the
kernel matrix
(equal to
canonical PCA
with a linear
kernel)
Kernel LDA, CCA