I2ml3e Chap13

Lecture Slides for
INTRODUCTION
TO
MACHNE
LEARNNG
3RD EDTON
ETHEM ALPAYDIN
The MIT Press, 2014
alpaydin@boun.edu.tr
http://www.cmpe.boun.edu.tr/~ethem/i2ml3e
CHAPTER 13:
KERNEL MACHNES
Kernel Machines
3
Discriminant-based: No need to estimate densities

first
Define the discriminant in terms of support vectors
The use of kernel functions, application-specific
measures of similarity
No need to represent instances as vectors
Convex optimization problems with a unique solution
Optimal Separating Hyperplane
4
if C1
X x , r t where r
t
t t t 1 x
1 if x t
C 2
find w and w0 such that
w T xt w0 1 for r t 1
w T xt w0 1 for r t 1
which can be rewritten as
r t w T xt w0 1
(Cortes and Vapnik, 1995; Vapnik, 1995)

Margin
5
Distance from the discriminant to the closest instances

on either side
Distance of x to the hyperplane is w T xt w0
w
r t w T xt w0
We require , t
w
For a unique soln, fix ||w||=1, and to max margin
min w subject to r t w T xt w0 1, t
1 2
2
Margin
6
min w subject to r t w T xt w 0 1, t
1 2
2

Lp w t r t w T xt w 0 1
N
1 2
2 t 1
w r w x w 0 t
N N
1 2 t t T t
2 t 1 t 1
Lp N
0 w t r t xt
w t 1
Lp N
0 t r t 0
w 0 t 1
7
Ld w w w T t r t xt w0 t r t t
1 T
2 t t t
w w t
1 T
2 t
r r x x t
1 t s t s t T s
2 t s t
subject to t r t 0 and t 0, t
t
Most t are 0 and only a small number have t >0; they are
the support vectors
8
Soft Margin Hyperplane
9
Not linearly separable
r t w T x t w0 1 t
Soft error
t
t
New primal is
1
2
2

Lp w C t t t t r t w T x t w0 1 t t t t
10
Hinge Loss
11
0 if y t r t 1
Lhinge (y , r )
t t
1 y t t
r otherwise
n-SVM
12
1 1
min w - n t
2
2 N t
subject to
r t w T x t w 0 t , t 0, 0
Ld r r x x
1 N t s t s t T s
2 t 1 s
subject to
1
t t t
r 0 ,0 t

N t
, t
n
n controls the fraction of support vectors

Kernel Trick
13
Preprocess input x by basis functions

z = (x) g(z)=wTz
g(x)=wT (x)
The SVM solution
w t r t z t t r t xt
t t
gx w x r x
T t t
x
t T
gx t r t K xt , x
t
Vectorial Kernels
14
Polynomials of degree q:
K x , x x x 1
t T t q
K x, y xT y 1
2
x1y1 x 2 y 2 12
1 2 x1y1 2 x 2 y 2 2 x1 x 2 y1y 2 x12 y12 x 22 y 22

x 1, 2 x1 , 2 x 2 , 2 x1 x 2 , x , x 2
1
2 T
2
Vectorial Kernels
15
Radial-basis functions:
xt x 2

K xt , x exp
2s 2

Defining kernels
16
Kernel engineering
Defining good measures of similarity
String kernels, graph kernels, image kernels, ...
Empirical kernel map: Define a set of templates mi
and score function s(x,mi)
(xt)=[s(xt,m1), s(xt,m2),..., s(xt,mM)]
and
K(x,xt)= (x)T (xt)
Multiple Kernel Learning
17
Fixed kernel combination cK x, y

K x, y K1 x, y K 2 x, y
K x, y K x, y
1 2
Adaptive kernel combination

m
K x , y i K i x, y
i 1
t s r t r s i K i xt , x s
1
Ld t
t 2 t s i
g(x) t r t i K i xt , x
t i
Localized kernel combination g(x) t r t i x| K i xt , x

t i
Multiclass Kernel Machines
18
1-vs-all
Pairwise separation
Error-Correcting Output Codes (section 17.5)
Single multiclass optimization
1 K
min w i C it
2
2 i 1 i t
subject to
w zt T xt w zt 0 w i T xt wi 0 2 it , i z t , it 0
SVM for Regression
19
Use a linear model (possibly kernelized)

f(x)=wTx+w0
Use the -sensitive error function
if r t f xt
e r , f x t
t t 0

r f x t
otherwise
min w C t t
1 2
2

t
r t w T x w0 t
w x w r
T
0
t
t
t , t 0
20
Kernel Regression
21
Polynomial kernel Gaussian kernel

Kernel Machines for Ranking
22
We require not only that scores be correct order

but at least +1 unit margin.
Linear case:
1
min w i C it
2
2 t
subject to
w T xu w T xv 1 t , t : r u r v , it 0
One-Class Kernel Machines
23
Consider a sphere with center a and radius R
min R 2 C t
t
subject to
x t a R 2 t , t 0
Ld x x r r x x
N
t t T s t s t s t T s
t t 1 s
subject to
0 t C , t 1
t
24
Large Margin Nearest Neighbor
25
Learns the matrix M of Mahalanobis metric

D(xi, xj)=(xi-xj)TM(xi-xj)
For three instances i, j, and l, where i and j are of
the same class and l different, we require
D(xi, xl) > D(xi, xj)+1
and if this is not satisfied, we have a slack for the
difference and we learn M to minimize the sum of
such slacks over all i,j,l triples (j and l being one of k
neighbors of i, over all i)
Learning a Distance Measure
26
LMNN algorithm (Weinberger and Saul 2009)
LMCA algorithm (Torresani and Lee 2007) uses a

similar approach where M=LTL and learns L
Kernel Dimensionality Reduction
27
Kernel PCA does

PCA on the
kernel matrix
(equal to
canonical PCA
with a linear
kernel)
Kernel LDA, CCA

I2ml3e Chap13

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

I2ml3e Chap13

Uploaded by

Copyright:

Available Formats

Lecture Slides for

Discriminant-based: No need to estimate densities

(Cortes and Vapnik, 1995; Vapnik, 1995)

Distance from the discriminant to the closest instances

For a unique soln, fix ||w||=1, and to max margin

Not linearly separable

n controls the fraction of support vectors

Preprocess input x by basis functions

Fixed kernel combination cK x, y

Adaptive kernel combination

Localized kernel combination g(x) t r t i x| K i xt , x

Use a linear model (possibly kernelized)

Polynomial kernel Gaussian kernel

We require not only that scores be correct order

Consider a sphere with center a and radius R

Learns the matrix M of Mahalanobis metric

LMNN algorithm (Weinberger and Saul 2009)

LMCA algorithm (Torresani and Lee 2007) uses a

Kernel PCA does

You might also like