You are on page 1of 37

Kernel Methods in Machine Learning

Autumn 2015 Lecture 1: Introduction

Juho Rousu

ICS-E4030 Kernel Methods in Machine Learning

9. September, 2015

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 1 / 37


Course organisation

Teachers and course position

I Lecturer: prof. Juho Rousu


I Other teachers: Dr. Celine Brouard, Dr Sahely Bhadra, Dr Markus
Heinonen, Mr Huibin Shen
I Advanced course in machine learning, extent 5 ECTS credits
I Targeted for 2. year MSc student and PhD students in Machine
Learning and Data Mining, Bioinformation Technology and Computer
Science
I Recommended background knowledge: Machine Learning: Basic
Principles or equivalent, MATLAB programming

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 2 / 37


Course organisation

Course content

Topics approximately:
I Introduction to kernel methods
I Supervised learning with kernels.
I Support vector machines.
I Ranking and preference learning
I Unsupervised learning with kernels.
I Kernels for structured data.
I Learning with multiple kernels and targets, structured output

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 3 / 37


Course organisation

Course logistics

I Lectures: Wed 12:15-14:00 T3


I Exercise session: Fri 8-10 T3, first session Sep 18, last session Dec 4
(Q&A on the exercises)
I Course Exam: 14.12.15 at 9-12 T1
I Mycourses page:
https://mycourses.aalto.fi/course/view.php?id=6076

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 4 / 37


Course organisation

How to complete the course

I Exercises: 10 weekly sets of home assignments, 50% of the course


points
I Exam: 50% of the course points
I Grading scale: 1-5

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 5 / 37


Course organisation

Exercises

I Will appear in Mycourses by Thursday evening each week


I Generally a mix of pen-and-paper and MATLAB assignments
I To be returned by Friday morning 8am the following week
I On PDF to Mycourses, or on paper to course mail box (outside 3rd
floor doors to ICS department)
I 5 exercises worth 1 point each, 1 bonus point for presenting a solution
in the exercise session:
I if you complete all 5 exercises and present one, you get 6 points for
that week
I Maximum of 50 points will be counted towards the course grade
I If you miss one or two weeks, and still get full points if you complete all
the other exercises or present enough solutions in the exercise sessions

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 6 / 37


Course organisation

Course material

I Examined contents will be the lecture slides and


exercises
I Course book (will be loosely followed): ”Kernel Methods
for Pattern Analysis by Shawe-Taylor and Cristianini”
I Available as Aalto ebook: http://site.ebrary.com/
lib/aalto/detail.action?docID=10131674
I Further reading:
I ”Introduction to Support Vector Machines and other kernel-based
learning methods” by Cristianini and Shawe-Taylor
http://books.google.fi/books?isbn=0521780195
I ”Learning with Kernels: Support Vector Machines, Regularization,
Optimization, and Beyond” by Schölkopf and Smola
http://books.google.fi/books?isbn=0262194759

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 7 / 37


Course organisation

Questions

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 8 / 37


Introduction to kernel methods

Kernel methods

Key characteristics of kernel methods:


I Embedding: Data items z are embedded into a feature space via a
feature map φ(z); maybe highly non-linear and potentially very
high-dimensional
I Linear models: are built for the the patterns in the feature space
(typically w T φ(z)); efficient to find the optimal model, convex
optimization
I Kernel trick: Algorithms
P work with kernels, inner products of feature
vectors k(x, z) = j φj (x)φj (z) rather than the original features
φ(x); side-step the efficiency problems of high-dimensionality
I Regularized learning: To avoid overfitting, large feature weights are
penalized, separation by large margin is favoured

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 9 / 37


Introduction to kernel methods

Data analysis tasks via kernels

Many data analysis algorithms can be ’kernelized’, i.e. transformed to an


equivalent form by replacing object descriptions (feature vectors) by
pairwise similarities (kernels):
I Classification
I Regression
I Ranking
I Novelty detection
I Clustering
I Principal component analysis, canonical correlation analysis
I Multi-label/Multi-task/Structured output
I ...

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 10 / 37


Introduction to kernel methods

Modularity of kernel methods

I Algorithms are designed that work with arbitrary inner products (or
kernels) between inputs
I The same algorithm will work with any inner product (or kernel)
I This allows theoretical properties of the learning algorithm to be
investigated and the results will carry to all application domains
I Kernel will depend on the application domain; prior information is
encoded into the kernel

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 11 / 37


Introduction to kernel methods

What is a kernel?

I A kernel is a function that calculates the similarity between two


objects, e.g.
I two proteins
I two images
I two documents
I ...
I xi ∈ X and xj ∈ X
I X = set of all proteins in the nature (finite set)
I X = all possible images (infinite set)
I X = all possible documents (infinite set)
I k: X ×X → R

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 12 / 37


Introduction to kernel methods

What is a kernel?
I Formally: a kernel function is an inner product (scalar product, dot
product) denoted by h·, ·i
I If φ(x) = (φ1 (x), φ2 (x), . . . , φD (x))T is a vector in feature space F,
the standard inner product in F is called the linear kernel
X
k(x, z) = h·, ·i = φ(x)T φ(z) = φj (x)φj (z)
j

I Geometric interpretation of the linear kernel: cosine angle between


two feature vectors
φ(x)T φ(z) k(x, z)
cos β = =p p ,
kφ(x)k kφ(z)k k(x, x) k(z, z)
qP
D 2
where kφ(x)k = j=1 φj (x) is the Euclidean norm.

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 13 / 37


Introduction to kernel methods

Kernel vs. Euclidean distance


I Assume two vectors φ(x), φ(x 0 ) ∈ RD with unit length
kφ(x)k = kφ(x 0 )k = 1
I Linear kernel: k(x, x 0 ) = hφ(x), φ(x 0 )i
I Euclidean Distance: qP
D
d(x, x 0 ) = kφ(x) − φ(x 0 )k = 0 2
j=1 (φj (x) − φj (x )
I Expanding the squares and using unit length of the vectors we get:

1 1
d(x, x 0 )2 = kφ(x) − φ(x 0 )k2 =
2 2
1
= (φ(x) − φ(x 0 ))T (φ(x) − φ(x 0 )) =
2
1 
= kφ(x)k2 − 2φ(x)T φ(x 0 ) + kφ(x 0 )k2
2
= 1 − φ(x)T φ(x 0 ) = 1 − k(x, x 0 )

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 14 / 37


Introduction to kernel methods

Hilbert space*
Formally the underlying space of a kernel is required to be a Hilbert space
A Hilbert space is a real vector space H, with the following additional
properties
I Equipped with a inner product, a map h., .i, which satisfies for all
objects x, x 0 , z ∈ H
I linear: hax + bx 0 , zi = ahx, zi + bhx 0 , zi
I symmetric: hx, x 0 i = hx 0 , xi
I positive semi-definite: hx, xi ≥ 0, hx, xi = 0 if and only if x = 0
I Complete: every Cauchy sequence {hn }n≥1 of elements in H
converges to an element of H
I Separable: there is a countable set of elements {h1 , h2 , . . . , } in H
such that for any h ∈ H and every  > 0 khi − hk < .
On this course, typically H = RD , where the dimension D is finite or
infinite. Both cases are Hilbert spaces.

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 15 / 37


Introduction to kernel methods

The kernel matrix

I Usually in machine learning data is represented as an N × D matrix,


with 
samples along the rows and features along the columns:
φ1 (x1 ) φ2 (x1 ) . . . φD (x1 )
 φ1 (x2 ) φ2 (x2 ) . . . φ2 D(x2 )
X= .
 
. .. . . .. 
 . . . . 
φ1 (xN ) φ2 (xN ) . . . φD (xN )
I Each data point x (image, document, signal, etc.) is represented by a
feature vector φ(x) ∈ RD
I For data that is not already in numerical form, a pre-processing step
is needed to extract features before learning

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 16 / 37


Introduction to kernel methods

The kernel matrix

I In kernel methods, instead a kernel matrix, also called the Gram


matrix, an N × N matrix of pairwise similarity values is used:
 
k(x1 , x1 ) k(x1 , x2 ) . . . k(x1 , xN )
 k(x2 , x1 ) k(x2 , x2 ) . . . k(x2 , xN ) 
K=
 
.. .. .. .. 
 . . . . 
k(xN , x1 ) k(xN , x2 ) . . . k(xN , xN )
I Each entry is an inner product between two data points
k(xi , xj ) = hφ(xi ), φ(xj )i
I Since an inner product is symmetric K is a symmetric matrix

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 17 / 37


Introduction to kernel methods

The kernel matrix

I The Gram matrix corresponding to the kernel function


k(x, z) = hφ(x), φ(z)i on a set of data points {xi }N
i=1 is positive
semidefinite: vT K v > 0 for any vector v:
N
X N
X
vT Kv = vi Kij vj = vi hφ(xi ), φ(xj )ivj =
i,j=1 i,j=1

XN N
X N
X
=h vi φ(xi ), vj φ(xj )i = k vi φ(xi )k2 ≥ 0
i=1 j=1 i=1

I Consequence: Gram matrix K has non-negative eigenvalues


λ1 ≥ · · · ≥ λn ≥ 0

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 18 / 37


Introduction to kernel methods

The kernel matrix

I A matrix K is positive semidefinite if and only if it can be expressed


as a product K = BT B of a real matrix B.
I Proof
I ”if”: Assume K = BT B, then vT K v = vT BT Bv = kBv k2 ≥ 0
I ”only if”: Assume K is PSD. Then it has eigenvalue decomposition
K = V ΛV T where Λ = diag (λ1 , . . . , λn ) is diagonal matrix containing
the eigenvalues. √
I Since the eigenvalues
√ √ are
√ non-negative,
 we √can√take B = ΛV T , where
Λ = diag λ1 , . . . , λn and BT B = V Λ Λ)V T = V ΛV T = K

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 19 / 37


Introduction to kernel methods

Quiz: which of the following could represent a valid kernel


matrix?

     
1 0 1 1 1 −1
A= C= E=
0 1 1 −1 −1 1
     
1 2 1 0 0 1
B= D= F =
1 1 0 1 1 1

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 20 / 37


Introduction to kernel methods

What are kernels good for?

I To extract nonlinear features


implicitly
Consider a feature map
φ : RD → RS possibly S  D
I Example with D = 1 and S = 3
 √ >
I φ(x) = 1 2x x 2
I hφ(x), φ(y )i =
 √  √ >
1 2x x 2 1 2y y 2 =
1 + 2xy + x y2 2

I kPOL (x, y ) = (xy + 1)2 =


1 + 2xy + x 2 y 2

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 21 / 37


Introduction to kernel methods

What are kernels good for?

I To extract nonlinear features


implicitly
Consider a feature map I What if you have 100 px
φ : RD → RS possibly S  D × 100 px images (i.e.,
D = 104 )
I Example with D = 1 and S = 3
 √ > I Computationally infeasible
I φ(x) = 1 2x x 2 to compute the features
I hφ(x), φ(y )i = explicitly
 √  √ >
1 2x x 2 1 2y y 2 = I We can still do it with
1 + 2xy + x 2 y 2 kernels
I kPOL (x, y ) = (xy + 1)2 =
1 + 2xy + x 2 y 2

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 22 / 37


Introduction to kernel methods

What are kernels good for?

I To gain computational efficiency using an N × N matrix instead of an


N × D matrix
 
x11 x12 . . . x1D
 x21 x22 . . . x2D 
X= .
 
.. .. .. 
 .. . . . 
xN1 xN2 . . . xND
 
k(x1 , x1 ) k(x1 , x2 ) . . . k(x1 , xN )
 k(x2 , x1 ) k(x2 , x2 ) . . . k(x2 , xN ) 
K=
 
.. .. .. .. 
 . . . . 
k(xN , x1 ) k(xN , x2 ) . . . k(xN , xN )
I When complex high-dimensioal feature spaces are needed and data is
a medium sized (D  N), kernel approach has an edge.

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 23 / 37


Introduction to kernel methods

What are kernels good for?

I To calculate the similarity between structured objects


I Similarity between two proteins
I e.g. String kernel with all substrings as the features
MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEK
FDRVKHLKTEAEMKASEDLKKHGVTVLTALGAILKKKGHHEAE
LKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGNFGADAQ
GAMNKALELFRKDIAAKYKELGYQG

MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAA
KSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLK
PVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDE
AAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 24 / 37


Introduction to kernel methods

What are kernels good for?

I To calculate the similarity between structured objects


I Similarity between two images

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 25 / 37


Introduction to kernel methods

What are kernels good for?

I To calculate the similarity between structured objects


I Similarity between two news stories
(Reuters) - Developed countries face a sharp year-end
slowdown led by a contraction in Germany, the OECD said
on Thursday, urging central banks to keep rates low or
pursue other forms of monetary easing if the downturn
becomes entrenched.

(Reuters) - More than 100 spacecraft have been to the


moon, including six with U.S. astronauts, but one key
piece of information about Earth’s natural satellite is
still missing -- what’s inside.

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 26 / 37


Introduction to kernel methods

What are kernels good for?

I To calculate the similarity between structured objects


I Similarity between two molecular graphs
I e.g. Walk kernel with random walks as the underlying features

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 27 / 37


Introduction to kernel methods

Several ways to get to a kernel

Approach I. Construct φ and think about efficient ways to compute the


inner product hφ(x), φ(x)i
I If φ(x) is very high-dimensional, computing the inner product element
by element is slow, we don’t want to do that
I For several cases, there are efficient algorithms to compute the kernel
in low polynomial time, even with exponential dimension of φ
I This is sometimes referred to as the kernel trick
I We will go through several examples during this course, especially for
structured data

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 28 / 37


Introduction to kernel methods

Several ways to get to a kernel

Approach II. Construct similarity measure and show that it qualifies as a


kernel:
I Show that for any set of examples the matrix K = (k(xi , xj ))ni,j=1 is
positive semi-definite (PSD).
I In that case, there always is an underlying feature representation, for
which the kernel represents the inner product
I Example: if you can show the matrix is a covariance matrix for some
variates, you will know the matrix will be PSD.

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 29 / 37


Introduction to kernel methods

Several ways to get to a kernel

Approach III. Convert a distance or a similarity into a kernel


I Take any distance d or a similarity measure s (that do not need to be
a kernel)
I In addition a set of data points Z = {zj }M
j=1 from the same domain is
required (e.g. training data)
I Construct feature vector from distances (similarly for s):
φ(x) = (d(x, z1 ), d(s, z2 ), . . . , d(x, zM ))
I Compute linear kernel Kd (x, x 0 ) = hφ(x), φ(x 0 )i
I This will always work technically, but requires that the data Z
captures the essential patterns in the input space =⇒ need enough
data

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 30 / 37


Introduction to kernel methods

Examples of non-linear kernels

Polynomial k(x, x 0 ) = (hφ(x), φ(x 0 )i + c)d


Sigmoid k(x, x 0 ) = tanh(κhφ(x), φ(x 0 )i + θ)
Gaussian (aka RBF) k(x, x 0 ) = exp −kφ(x) − φ(x 0 )k2 /(2 σ 2 )


Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 31 / 37


Introduction to kernel methods

Operations on kernels

I Examples of elementary operations that give valid kernels when


applied to kernels k1 , k2
I Convex combinations k(x, x 0 ) = β1 k1 (x, x 0 ) + β2 k2 (x, x 0 )
I Itemwise product k(x, x 0 ) = k1 (x, x 0 )k2 (x, x 0 )
PN
I Matrix product k(x, x 0 ) = i=1 k1 (x, xi )k1 (xi , x 0 ), where
xi , i = 1, . . . , N is the training data
0 0
I Normalization k(x, x 0 ) = √ 0 k (x,x 0 0
)
0
k (x,x )k (x,x )
I Multiplication with a PSD matrix: B: k(x, x 0 ) = x T Bx 0
I The operations can be combined to construct arbitrarily complex
kernels

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 32 / 37


Introduction to kernel methods

Polynomial kernel

I Given data x ∈ RD , the polynomial kernel is given by

kPOL (x, x 0 ) = (hx, x 0 i + c)S

I Integer S > 0 gives the degree of the polynomial kernel


I Real value c ≥ 0 is a weighting factor for lower order polynomial terms
I The underlying features non-linear: monomial combinations
x1 · x2 · · · xk of degree k ≤ S of the original features xj

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 33 / 37


Introduction to kernel methods

Polynomial kernel

kPOL (x, x 0 ) = (hx, x 0 i + c)S

I A linear model in the polynomial feature


space corresponds to a non-linear model
in the original feature space
I The dimension of the polynomial feature
space is roughly D S
I But polynomial kernel can be evaluated in
linear time in the dimension of the original
data D, independently of S =⇒ No
computational overhead from working in
the high-dimensional feature space

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 34 / 37


Introduction to kernel methods

Gaussian kernel (RBF kernel)

k(x, x 0 ) = exp −kφ(x) − φ(x 0 )k2 /(2 σ 2 )




I Gaussian kernel can be seen as a limit of polynomial kernels =⇒


corresponds to an infinite dimensional polynomial kernel
I Smoothness of Gaussian kernel is controlled by the parameter σ
I Higher-order features are exponentially downweighted.
xn
Proof: through Taylor expansion of e x = ∞
P
I
n=0 n!

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 35 / 37


Introduction to kernel methods

Exercises: Simple classification method using kernels

I Consider binary classification, with training data divided into positive


S+ and negative S− sets
I We estimate Pthe centre of mass of each
P class in the feature space:
φ̄S+ = |S1+ | x∈S+ φ(x), φ̄S− = |S1− | x∈S− φ(x),
I Classification is based on measuring the distance to the centre of
mass of each class: d+ (x) = kφ(x) − φ̄S+ k, d− (x) = kφ(x) − φ̄S− k
I We predict
( the class whose centre of mass is closer
+1 if d+ (x) ≤ d− (x)
h(x) =
−1 otherwise

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 36 / 37


Introduction to kernel methods

Exercises: Simple classification method using kernels

I The classification rule can equivalently expressed in terms of the


linear kernel:
 
1 X 1 X
h(x) = sgn  k(x, xi ) − k(x, xi ) − b 
|S+ | |S− |
xi ∈S+ x∈S−

I Intuitively: each data point contributes to


the density of its class through the kernel
value, class with higher density ”wins”
I This classifier is called the Parzen windows
classifier due to the connection for Parzen
windows kernel density estimation.

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 37 / 37

You might also like