ICS E4030 Lecture1

Kernel Methods in Machine Learning
Autumn 2015 Lecture 1: Introduction
Juho Rousu
ICS-E4030 Kernel Methods in Machine Learning
9. September, 2015
Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 1 / 37

Course organisation
Teachers and course position
I Lecturer: prof. Juho Rousu

I Other teachers: Dr. Celine Brouard, Dr Sahely Bhadra, Dr Markus
Heinonen, Mr Huibin Shen
I Advanced course in machine learning, extent 5 ECTS credits
I Targeted for 2. year MSc student and PhD students in Machine
Learning and Data Mining, Bioinformation Technology and Computer
Science
I Recommended background knowledge: Machine Learning: Basic
Principles or equivalent, MATLAB programming

Course organisation
Course content
Topics approximately:
I Introduction to kernel methods
I Supervised learning with kernels.
I Support vector machines.
I Ranking and preference learning
I Unsupervised learning with kernels.
I Kernels for structured data.
I Learning with multiple kernels and targets, structured output

Course organisation
Course logistics
I Lectures: Wed 12:15-14:00 T3

I Exercise session: Fri 8-10 T3, first session Sep 18, last session Dec 4
(Q&A on the exercises)
I Course Exam: 14.12.15 at 9-12 T1
I Mycourses page:
https://mycourses.aalto.fi/course/view.php?id=6076

Course organisation
How to complete the course
I Exercises: 10 weekly sets of home assignments, 50% of the course

points
I Exam: 50% of the course points
I Grading scale: 1-5

Course organisation
Exercises
I Will appear in Mycourses by Thursday evening each week

I Generally a mix of pen-and-paper and MATLAB assignments
I To be returned by Friday morning 8am the following week
I On PDF to Mycourses, or on paper to course mail box (outside 3rd
floor doors to ICS department)
I 5 exercises worth 1 point each, 1 bonus point for presenting a solution
in the exercise session:
I if you complete all 5 exercises and present one, you get 6 points for
that week
I Maximum of 50 points will be counted towards the course grade
I If you miss one or two weeks, and still get full points if you complete all
the other exercises or present enough solutions in the exercise sessions

Course organisation
Course material
I Examined contents will be the lecture slides and

exercises
I Course book (will be loosely followed): ”Kernel Methods
for Pattern Analysis by Shawe-Taylor and Cristianini”
I Available as Aalto ebook: http://site.ebrary.com/
lib/aalto/detail.action?docID=10131674
I Further reading:
I ”Introduction to Support Vector Machines and other kernel-based
learning methods” by Cristianini and Shawe-Taylor
http://books.google.fi/books?isbn=0521780195
I ”Learning with Kernels: Support Vector Machines, Regularization,
Optimization, and Beyond” by Schölkopf and Smola
http://books.google.fi/books?isbn=0262194759

Course organisation
Questions

Introduction to kernel methods
Kernel methods
Key characteristics of kernel methods:

I Embedding: Data items z are embedded into a feature space via a
feature map φ(z); maybe highly non-linear and potentially very
high-dimensional
I Linear models: are built for the the patterns in the feature space
(typically w T φ(z)); efficient to find the optimal model, convex
optimization
I Kernel trick: Algorithms
P work with kernels, inner products of feature
vectors k(x, z) = j φj (x)φj (z) rather than the original features
φ(x); side-step the efficiency problems of high-dimensionality
I Regularized learning: To avoid overfitting, large feature weights are
penalized, separation by large margin is favoured

Data analysis tasks via kernels
Many data analysis algorithms can be ’kernelized’, i.e. transformed to an

equivalent form by replacing object descriptions (feature vectors) by
pairwise similarities (kernels):
I Classification
I Regression
I Ranking
I Novelty detection
I Clustering
I Principal component analysis, canonical correlation analysis
I Multi-label/Multi-task/Structured output
I ...

Modularity of kernel methods
I Algorithms are designed that work with arbitrary inner products (or
kernels) between inputs
I The same algorithm will work with any inner product (or kernel)
I This allows theoretical properties of the learning algorithm to be
investigated and the results will carry to all application domains
I Kernel will depend on the application domain; prior information is
encoded into the kernel

What is a kernel?
I A kernel is a function that calculates the similarity between two

objects, e.g.
I two proteins
I two images
I two documents
I ...
I xi ∈ X and xj ∈ X
I X = set of all proteins in the nature (finite set)
I X = all possible images (infinite set)
I X = all possible documents (infinite set)
I k: X ×X → R

What is a kernel?
I Formally: a kernel function is an inner product (scalar product, dot
product) denoted by h·, ·i
I If φ(x) = (φ1 (x), φ2 (x), . . . , φD (x))T is a vector in feature space F,
the standard inner product in F is called the linear kernel
X
k(x, z) = h·, ·i = φ(x)T φ(z) = φj (x)φj (z)
j
I Geometric interpretation of the linear kernel: cosine angle between

two feature vectors
φ(x)T φ(z) k(x, z)
cos β = =p p ,
kφ(x)k kφ(z)k k(x, x) k(z, z)
qP
D 2
where kφ(x)k = j=1 φj (x) is the Euclidean norm.

Kernel vs. Euclidean distance

I Assume two vectors φ(x), φ(x 0 ) ∈ RD with unit length
kφ(x)k = kφ(x 0 )k = 1
I Linear kernel: k(x, x 0 ) = hφ(x), φ(x 0 )i
I Euclidean Distance: qP
D
d(x, x 0 ) = kφ(x) − φ(x 0 )k = 0 2
j=1 (φj (x) − φj (x )
I Expanding the squares and using unit length of the vectors we get:
1 1
d(x, x 0 )2 = kφ(x) − φ(x 0 )k2 =
2 2
1
= (φ(x) − φ(x 0 ))T (φ(x) − φ(x 0 )) =
2
1
= kφ(x)k2 − 2φ(x)T φ(x 0 ) + kφ(x 0 )k2
2
= 1 − φ(x)T φ(x 0 ) = 1 − k(x, x 0 )

Hilbert space*
Formally the underlying space of a kernel is required to be a Hilbert space
A Hilbert space is a real vector space H, with the following additional
properties
I Equipped with a inner product, a map h., .i, which satisfies for all
objects x, x 0 , z ∈ H
I linear: hax + bx 0 , zi = ahx, zi + bhx 0 , zi
I symmetric: hx, x 0 i = hx 0 , xi
I positive semi-definite: hx, xi ≥ 0, hx, xi = 0 if and only if x = 0
I Complete: every Cauchy sequence {hn }n≥1 of elements in H
converges to an element of H
I Separable: there is a countable set of elements {h1 , h2 , . . . , } in H
such that for any h ∈ H and every > 0 khi − hk < .
On this course, typically H = RD , where the dimension D is finite or
infinite. Both cases are Hilbert spaces.

The kernel matrix
I Usually in machine learning data is represented as an N × D matrix,

with 
samples along the rows and features along the columns:
φ1 (x1 ) φ2 (x1 ) . . . φD (x1 )
 φ1 (x2 ) φ2 (x2 ) . . . φ2 D(x2 )
X= .
 
. .. . . .. 
 . . . . 
φ1 (xN ) φ2 (xN ) . . . φD (xN )
I Each data point x (image, document, signal, etc.) is represented by a
feature vector φ(x) ∈ RD
I For data that is not already in numerical form, a pre-processing step
is needed to extract features before learning

The kernel matrix
I In kernel methods, instead a kernel matrix, also called the Gram

matrix, an N × N matrix of pairwise similarity values is used:
 
k(x1 , x1 ) k(x1 , x2 ) . . . k(x1 , xN )
 k(x2 , x1 ) k(x2 , x2 ) . . . k(x2 , xN ) 
K=
 
.. .. .. .. 
 . . . . 
k(xN , x1 ) k(xN , x2 ) . . . k(xN , xN )
I Each entry is an inner product between two data points
k(xi , xj ) = hφ(xi ), φ(xj )i
I Since an inner product is symmetric K is a symmetric matrix

The kernel matrix
I The Gram matrix corresponding to the kernel function

k(x, z) = hφ(x), φ(z)i on a set of data points {xi }N
i=1 is positive
semidefinite: vT K v > 0 for any vector v:
N
X N
X
vT Kv = vi Kij vj = vi hφ(xi ), φ(xj )ivj =
i,j=1 i,j=1
XN N
X N
X
=h vi φ(xi ), vj φ(xj )i = k vi φ(xi )k2 ≥ 0
i=1 j=1 i=1
I Consequence: Gram matrix K has non-negative eigenvalues

λ1 ≥ · · · ≥ λn ≥ 0

The kernel matrix
I A matrix K is positive semidefinite if and only if it can be expressed

as a product K = BT B of a real matrix B.
I Proof
I ”if”: Assume K = BT B, then vT K v = vT BT Bv = kBv k2 ≥ 0
I ”only if”: Assume K is PSD. Then it has eigenvalue decomposition
K = V ΛV T where Λ = diag (λ1 , . . . , λn ) is diagonal matrix containing
the eigenvalues. √
I Since the eigenvalues
√ √ are
√ non-negative,
we √can√take B = ΛV T , where
Λ = diag λ1 , . . . , λn and BT B = V Λ Λ)V T = V ΛV T = K

Quiz: which of the following could represent a valid kernel

matrix?

1 0 1 1 1 −1
A= C= E=
0 1 1 −1 −1 1

1 2 1 0 0 1
B= D= F =
1 1 0 1 1 1

What are kernels good for?
I To extract nonlinear features

implicitly
Consider a feature map
φ : RD → RS possibly S D
I Example with D = 1 and S = 3
√ >
I φ(x) = 1 2x x 2
I hφ(x), φ(y )i =
√ √ >
1 2x x 2 1 2y y 2 =
1 + 2xy + x y2 2
I kPOL (x, y ) = (xy + 1)2 =

1 + 2xy + x 2 y 2

I To extract nonlinear features

implicitly
Consider a feature map I What if you have 100 px
φ : RD → RS possibly S D × 100 px images (i.e.,
D = 104 )
I Example with D = 1 and S = 3
√ > I Computationally infeasible
I φ(x) = 1 2x x 2 to compute the features
I hφ(x), φ(y )i = explicitly
√ √ >
1 2x x 2 1 2y y 2 = I We can still do it with
1 + 2xy + x 2 y 2 kernels
I kPOL (x, y ) = (xy + 1)2 =
1 + 2xy + x 2 y 2

I To gain computational efficiency using an N × N matrix instead of an

N × D matrix
 
x11 x12 . . . x1D
 x21 x22 . . . x2D 
X= .
 
.. .. .. 
 .. . . . 
xN1 xN2 . . . xND
 
k(x1 , x1 ) k(x1 , x2 ) . . . k(x1 , xN )
 k(x2 , x1 ) k(x2 , x2 ) . . . k(x2 , xN ) 
K=
 
.. .. .. .. 
 . . . . 
k(xN , x1 ) k(xN , x2 ) . . . k(xN , xN )
I When complex high-dimensioal feature spaces are needed and data is
a medium sized (D N), kernel approach has an edge.

I To calculate the similarity between structured objects

I Similarity between two proteins
I e.g. String kernel with all substrings as the features
MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEK
FDRVKHLKTEAEMKASEDLKKHGVTVLTALGAILKKKGHHEAE
LKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGNFGADAQ
GAMNKALELFRKDIAAKYKELGYQG
MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAA
KSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLK
PVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDE
AAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL


I Similarity between two images


I Similarity between two news stories
(Reuters) - Developed countries face a sharp year-end
slowdown led by a contraction in Germany, the OECD said
on Thursday, urging central banks to keep rates low or
pursue other forms of monetary easing if the downturn
becomes entrenched.
(Reuters) - More than 100 spacecraft have been to the

moon, including six with U.S. astronauts, but one key
piece of information about Earth’s natural satellite is
still missing -- what’s inside.


I Similarity between two molecular graphs
I e.g. Walk kernel with random walks as the underlying features

Several ways to get to a kernel
Approach I. Construct φ and think about efficient ways to compute the

inner product hφ(x), φ(x)i
I If φ(x) is very high-dimensional, computing the inner product element
by element is slow, we don’t want to do that
I For several cases, there are efficient algorithms to compute the kernel
in low polynomial time, even with exponential dimension of φ
I This is sometimes referred to as the kernel trick
I We will go through several examples during this course, especially for
structured data

Approach II. Construct similarity measure and show that it qualifies as a

kernel:
I Show that for any set of examples the matrix K = (k(xi , xj ))ni,j=1 is
positive semi-definite (PSD).
I In that case, there always is an underlying feature representation, for
which the kernel represents the inner product
I Example: if you can show the matrix is a covariance matrix for some
variates, you will know the matrix will be PSD.

Approach III. Convert a distance or a similarity into a kernel

I Take any distance d or a similarity measure s (that do not need to be
a kernel)
I In addition a set of data points Z = {zj }M
j=1 from the same domain is
required (e.g. training data)
I Construct feature vector from distances (similarly for s):
φ(x) = (d(x, z1 ), d(s, z2 ), . . . , d(x, zM ))
I Compute linear kernel Kd (x, x 0 ) = hφ(x), φ(x 0 )i
I This will always work technically, but requires that the data Z
captures the essential patterns in the input space =⇒ need enough
data

Examples of non-linear kernels
Polynomial k(x, x 0 ) = (hφ(x), φ(x 0 )i + c)d

Sigmoid k(x, x 0 ) = tanh(κhφ(x), φ(x 0 )i + θ)
Gaussian (aka RBF) k(x, x 0 ) = exp −kφ(x) − φ(x 0 )k2 /(2 σ 2 )


Operations on kernels
I Examples of elementary operations that give valid kernels when

applied to kernels k1 , k2
I Convex combinations k(x, x 0 ) = β1 k1 (x, x 0 ) + β2 k2 (x, x 0 )
I Itemwise product k(x, x 0 ) = k1 (x, x 0 )k2 (x, x 0 )
PN
I Matrix product k(x, x 0 ) = i=1 k1 (x, xi )k1 (xi , x 0 ), where
xi , i = 1, . . . , N is the training data
0 0
I Normalization k(x, x 0 ) = √ 0 k (x,x 0 0
)
0
k (x,x )k (x,x )
I Multiplication with a PSD matrix: B: k(x, x 0 ) = x T Bx 0
I The operations can be combined to construct arbitrarily complex
kernels

Polynomial kernel
I Given data x ∈ RD , the polynomial kernel is given by
kPOL (x, x 0 ) = (hx, x 0 i + c)S
I Integer S > 0 gives the degree of the polynomial kernel

I Real value c ≥ 0 is a weighting factor for lower order polynomial terms
I The underlying features non-linear: monomial combinations
x1 · x2 · · · xk of degree k ≤ S of the original features xj

Polynomial kernel
kPOL (x, x 0 ) = (hx, x 0 i + c)S
I A linear model in the polynomial feature

space corresponds to a non-linear model
in the original feature space
I The dimension of the polynomial feature
space is roughly D S
I But polynomial kernel can be evaluated in
linear time in the dimension of the original
data D, independently of S =⇒ No
computational overhead from working in
the high-dimensional feature space

Gaussian kernel (RBF kernel)
k(x, x 0 ) = exp −kφ(x) − φ(x 0 )k2 /(2 σ 2 )

I Gaussian kernel can be seen as a limit of polynomial kernels =⇒

corresponds to an infinite dimensional polynomial kernel
I Smoothness of Gaussian kernel is controlled by the parameter σ
I Higher-order features are exponentially downweighted.
xn
Proof: through Taylor expansion of e x = ∞
P
I
n=0 n!

Exercises: Simple classification method using kernels
I Consider binary classification, with training data divided into positive

S+ and negative S− sets
I We estimate Pthe centre of mass of each
P class in the feature space:
φ̄S+ = |S1+ | x∈S+ φ(x), φ̄S− = |S1− | x∈S− φ(x),
I Classification is based on measuring the distance to the centre of
mass of each class: d+ (x) = kφ(x) − φ̄S+ k, d− (x) = kφ(x) − φ̄S− k
I We predict
( the class whose centre of mass is closer
+1 if d+ (x) ≤ d− (x)
h(x) =
−1 otherwise

Exercises: Simple classification method using kernels
I The classification rule can equivalently expressed in terms of the

linear kernel:
 
1 X 1 X
h(x) = sgn  k(x, xi ) − k(x, xi ) − b 
|S+ | |S− |
xi ∈S+ x∈S−
I Intuitively: each data point contributes to

the density of its class through the kernel
value, class with higher density ”wins”
I This classifier is called the Parzen windows
classifier due to the connection for Parzen
windows kernel density estimation.

ICS E4030 Lecture1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ICS E4030 Lecture1

Uploaded by

Copyright:

Available Formats

Kernel Methods in Machine Learning

Autumn 2015 Lecture 1: Introduction

ICS-E4030 Kernel Methods in Machine Learning

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 1 / 37

Teachers and course position

I Lecturer: prof. Juho Rousu

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 2 / 37

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 3 / 37

I Lectures: Wed 12:15-14:00 T3

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 4 / 37

How to complete the course

I Exercises: 10 weekly sets of home assignments, 50% of the course

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 5 / 37

I Will appear in Mycourses by Thursday evening each week

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 6 / 37

I Examined contents will be the lecture slides and

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 7 / 37

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 8 / 37

Key characteristics of kernel methods:

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 9 / 37

Data analysis tasks via kernels

Many data analysis algorithms can be ’kernelized’, i.e. transformed to an

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 10 / 37

Modularity of kernel methods

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 11 / 37

I A kernel is a function that calculates the similarity between two

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 12 / 37

I Geometric interpretation of the linear kernel: cosine angle between

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 13 / 37

Kernel vs. Euclidean distance

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 14 / 37

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 15 / 37

The kernel matrix

I Usually in machine learning data is represented as an N × D matrix,

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 16 / 37

The kernel matrix

I In kernel methods, instead a kernel matrix, also called the Gram

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 17 / 37

The kernel matrix

I The Gram matrix corresponding to the kernel function

I Consequence: Gram matrix K has non-negative eigenvalues

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 18 / 37

The kernel matrix

I A matrix K is positive semidefinite if and only if it can be expressed

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 19 / 37

Quiz: which of the following could represent a valid kernel

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 20 / 37

What are kernels good for?

I To extract nonlinear features

I kPOL (x, y ) = (xy + 1)2 =

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 21 / 37

What are kernels good for?

I To extract nonlinear features

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 22 / 37

What are kernels good for?

I To gain computational efficiency using an N × N matrix instead of an

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 23 / 37

What are kernels good for?

I To calculate the similarity between structured objects

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 24 / 37

What are kernels good for?

I To calculate the similarity between structured objects

Juho Rousu (ICS-E4030 Kernel Methods in Machine Learning) 9. September, 2015 25 / 37

What are kernels good for?