Machine Learning Book

Introduction to Machine Learning
Lecture 1: Introduction
Mahesan Niranjan
School of Electronics and Computer Science

University of Southampton
Slides are prompts (for me); Notes are what you make, off the white-
board and from textbooks during self study:
We learn by doing, not by observing!
March 2017
Mahesan Niranjan (UoS) Machine Learning March 2017 1 / 55
Overview
Logistics
Motivation
Some examples from my research

Overview
Logistics
Motivation
Review of Mathematical Foundations
Linear Algebra
Calculus
Probability Theory / Statistics
Principles of Optimization

Overview
Logistics
Motivation
Review of Mathematical Foundations
Linear Algebra
Calculus
Probability Theory / Statistics
Principles of Optimization
Emphasis is on foundations of the subject (mathematical and algorith-

mic). We will not do formal mathematics here, instead we develop an
understanding of the concepts and tools.

Logistics
Teaching:
Ten two-hour lectures

Logistics
Teaching:
Eight two-hour lab sessions

Logistics
Teaching:
Assessment (in Southampton):

Logistics
Teaching:
20% Coursework (from W4)

80% Semester end written exam

Logistics
Teaching:
20% Coursework (from W4)

80% Semester end written exam
MSc passmark 50%
Undergraduate passmark 40%

Assessment
Distribution of marks, COMP3206 2015/16

Assessment
Difficult to fail this module,

Assessment
Difficult to fail this module, but please dont try!

Good Books
R.O.Duda, P.E.Hart & D.G.Stork C.M. Bishop

Pattern Classification Pattern Recognition and Machine Learning
I.H. Witten & E. Frank S. Rogers & M. Girolami

Data Mining A First Course in Machine Learning

Good Books


There is nothing to be learnt from a professor, which is not to be met with in books
- David Hume (1711-1776)

Good Books


- David Hume (1711-1776)
(WikiPedia: Hume had little respect for the professors of his time [...]

Good Books


- David Hume (1711-1776)
(WikiPedia: Hume had little respect for the professors of his time [...] He did not graduate)

Machine Learning: Good employment prospects!



Standard disclaimers apply!

Machine Learning: Intellectually Enriching


Mathematical / Statistical side of Artificial Intelligence

Mathematical / Statistical side of Artificial Intelligence

Machine Learning draws from many fields
Machine Learning as Data-driven Modelling
Single-slide overview of the subject and challenging questions
Data {xn , yn }Nn=1

{xn , yn }Nn=1 {xn }n=1

N
Data

{xn , yn }Nn=1 {xn }n=1

N
Data
Function Approximator

{xn , yn }Nn=1 {xn }n=1

N
Data
Function Approximator y = f (x , ) + v

{xn , yn }Nn=1 {xn }n=1

N
Data
Parameter Estimation

{xn , yn }Nn=1 {xn }n=1

N
Data
{|| yn f (xn ; ) ||}

PN 2
Parameter Estimation E0 = n=1

{xn , yn }Nn=1 {xn }n=1

N
Data
{|| yn f (xn ; ) ||}

PN 2
Prediction

{xn , yn }Nn=1 {xn }n=1

N
Data
{|| yn f (xn ; ) ||}

PN 2
yN+1 xN+1 ,

Prediction = f

{xn , yn }Nn=1 {xn }n=1

N
Data
{|| yn f (xn ; ) ||}

PN 2
yN+1 xN+1 ,

Prediction = f
Regularization

{xn , yn }Nn=1 {xn }n=1

N
Data
{|| yn f (xn ; ) ||}

PN 2
yN+1 xN+1 ,

Prediction = f
{|| yn f (xn ) ||} + g (||||)

PN 2
Regularization E1 = n=1

{xn , yn }Nn=1 {xn }n=1

N
Data
{|| yn f (xn ; ) ||}

PN 2
yN+1 xN+1 ,

Prediction = f
{|| yn f (xn ) ||} + g (||||)

PN 2
Modelling Uncertainty

{xn , yn }Nn=1 {xn }n=1

N
Data
{|| yn f (xn ; ) ||}

PN 2
yN+1 xN+1 ,

Prediction = f
{|| yn f (xn ) ||} + g (||||)

PN 2
p | {xn , yn }n=1

N

{xn , yn }Nn=1 {xn }n=1

N
Data
{|| yn f (xn ; ) ||}

PN 2
yN+1 xN+1 ,

Prediction = f
{|| yn f (xn ) ||} + g (||||)

PN 2
p | {xn , yn }n=1

N
Probabilistic Inference

{xn , yn }Nn=1 {xn }n=1

N
Data
{|| yn f (xn ; ) ||}

PN 2
yN+1 xN+1 ,

Prediction = f
{|| yn f (xn ) ||} + g (||||)

PN 2
p | {xn , yn }n=1

N
E [g ()] =
R
Probabilistic Inference g () p () d

{xn , yn }Nn=1 {xn }n=1

N
Data
{|| yn f (xn ; ) ||}

PN 2
yN+1 xN+1 ,

Prediction = f
{|| yn f (xn ) ||} + g (||||)

PN 2
p | {xn , yn }n=1

N
E [g ()] = 1
PNs
g (n)
R
Probabilistic Inference g () p () d = Ns n=1

{xn , yn }Nn=1 {xn }n=1

N
Data
{|| yn f (xn ; ) ||}

PN 2
yN+1 xN+1 ,

Prediction = f
{|| yn f (xn ) ||} + g (||||)

PN 2
p | {xn , yn }n=1

N
E [g ()] = 1
PNs
g (n)
R
Sequential Estimation

{xn , yn }Nn=1 {xn }n=1

N
Data
{|| yn f (xn ; ) ||}

PN 2
yN+1 xN+1 ,

Prediction = f
{|| yn f (xn ) ||} + g (||||)

PN 2
p | {xn , yn }n=1

N
E [g ()] = 1
PNs
g (n)
R
Sequential Estimation (n 1|n 1) (n|n 1) (n|n)

{xn , yn }Nn=1 {xn }n=1

N
Data
{|| yn f (xn ; ) ||}

PN 2
yN+1 xN+1 ,

Prediction = f
{|| yn f (xn ) ||} + g (||||)

PN 2
p | {xn , yn }n=1

N
E [g ()] = 1
PNs
g (n)
R
Sequential Estimation (n 1|n 1) (n|n 1) (n|n)

Kalman & Particle Filters; Reinforcement Learning
Machine Learning
Many Interesting Problems (to me)
Visual Scene Recognition

Machine Learning

Machine Translation

Machine Learning

Machine Translation
Computational Biology

Machine Learning

Machine Translation
Computational Finance

Machine Learning

Machine Translation
Recommender Systems

Machine Learning

Machine Translation
Recommender Systems
Physiological Signal Modelling

Machine Learning

Machine Translation
Recommender Systems
Big Data: Buzzword causing even more excitement!

Machine Learning

Machine Translation
Recommender Systems
Make accurate predictions

Machine Learning

Machine Translation
Recommender Systems
Make accurate predictions and make money!

Machine Learning

Machine Translation
Recommender Systems

Make statements about the problem domain

Machine Learning

Machine Translation
Recommender Systems

Make statements about the problem domain and become famous!

Machine Learning

Machine Translation
Recommender Systems

ECS: Advanced courses building on the foundations you will learn here:

Machine Learning

Machine Translation
Recommender Systems

Advanced Machine Learning

Machine Learning

Machine Translation
Recommender Systems


Machine Learning

Machine Translation
Recommender Systems


Examples from my research
Example 1: Machine Translation
Phrases move due to grammatical differences.

Variability due to context of phrase.

Example 1: Machine Translation
Phrases move due to grammatical differences.

Variability due to context of phrase.
Not data rich (electronically available parallel corpora);

Solution from active learning.

Example 2: Computational Finance
Constructing Sparse Portfolios
A. Takeda, M. Niranjan, J. Gotoh & Y. Kawahara (2013) Simultaneous pursuit of

out-of-sample performance and sparsity in index tracking portfolios, Computational
Management Science 10(1): 21-49.
See White Board

Molecular Biology
(Figures from: Alberts et al. Molecular Biology of the Cell)

Molecular Biology

Molecular Biology

Molecular Biology

Molecular Biology

Molecular Biology

Molecular Biology

Molecular Biology

Example 3: Classifying Gene Function
2000 yeast genes

Observed (simultaneously) under 78 conditions
Some have a specific function; others not
See MATLAB Demo

Example 4: Regulation of Protein Concentrations [Yawwani
Gunawardana]

Gunawardana]
Set up a predictor of protein concentration

Gunawardana]
Set up a predictor of protein concentration

Sparse model selects relevant features
Outliers = post-translationally regulated proteins
Transcriptome Proteome [Yawwani Gunawardana]

Example 5: Morphogen Propagation in Development




A. Turing C. N
usslein-Volhard

A. Turing C. N
usslein-Volhard
2
M(x, t) = D 2 M(x, t) p1 M(x, t) + S(x, t)
t x

A. Turing C. N
usslein-Volhard
2
M(x, t) = D 2 M(x, t) p1 M(x, t) + S(x, t)
t x

A. Turing C. N
usslein-Volhard
2
M(x, t) = D 2 M(x, t) p1 M(x, t) + S(x, t)
t x
B. Houchmandzadeh et al. (2007), Nature

Is Maternal mRNA Stability Regulated? Wei Liu

Is Maternal mRNA Stability Regulated? Wei Liu

Example 6: Systems Level Modelling [Xin Liu]

Example 6: Systems Level Modelling [Xin Liu]

... and now for more serious matters!
Rapid Review of Foundations
Linear Algebra
Calculus
Optimization
Probabilities

... and now for more serious matters!
Rapid Review of Foundations
Linear Algebra
Calculus
Optimization
Probabilities
This is not a course on any of the above!

We need tools from these topics.
Quickly review what we need today, and will return to each topic as
and when we need them (in just about enough depth) to understand
machine learning.

Linear Algebra: Vectors and Matrices
Vectors and matrices as collections of numbers

x1
x
x = ..2
.
xn


x1 a11 a12 ... a1d
x2 a11 a12 ... a1d
x = .. A = .. ..

.. ..
. . . . .
xn an1 an2 ... and


x1 a11 a12 ... a1d
x2 a11 a12 ... a1d
x = .. A = .. ..

.. ..
. . . . .
xn an1 an2 ... and
Operations on collections on numbers

Scalar product
n
wx
X
= w i xi
i=1


x1 a11 a12 ... a1d
x2 a11 a12 ... a1d
x = .. A = .. ..

.. ..
. . . . .
xn an1 an2 ... and
Operations on collections on numbers

Scalar product
n
wx
X
= w i xi
i=1
With useful geometric insights

Angle between vectors in n dimensional space
wx = |w | |x | cos()

Vectors
Linear independence
... set of p vectors xj , j = 1, ..., p
p
j x j = 0
X
i=j
... only solution is all j = 0

Vectors
Linear independence
p
j x j = 0
X
i=j

... no vector in the set can be expressed as a linear combination of
the others.

Vectors
Linear independence
p
j x j = 0
X
i=j

the others.
Scalar product as projection: projection of vector x on a direction
specified by vector u
x uu
|u |

Vectors
Linear independence
p
j x j = 0
X
i=j

the others.
Scalar product as projection: projection of vector x on a direction
specified by vector u
x uu
|u |
... we will also write this as
xT u u
|u |
Matrices
Simple operations e.g. addition: [A + B]ij = [A]ij + [B]ij ;

Matrices

transpose: [A]T
ij = [A]ji ;

Matrices

transpose: [A]T
ij = [A]ji ; multiplication by a scalar: [A]ij = [A]ij

Matrices

transpose: [A]T
Matrix multiplication:
n
X
[A B]ij = [A]ik [B]kj
k=1

Matrices

transpose: [A]T
n
X
k=1
(AB)T = B T AT

Matrices

transpose: [A]T
n
X
k=1
(AB)T = B T AT
Square: number of rows = number of columns

Matrices

transpose: [A]T
n
X
k=1
(AB)T = B T AT
Symmetric: AT = A

Matrices

transpose: [A]T
n
X
k=1
(AB)T = B T AT
Symmetric: AT = A
Identity matrix: I diagonal elements 1, off diagonals 0.

Matrices

transpose: [A]T
n
X
k=1
(AB)T = B T AT
Symmetric: AT = A

a11 a12
Determinant: det = a11 a22 a21 a12
a21 a22

Matrices

transpose: [A]T
n
X
k=1
(AB)T = B T AT
Symmetric: AT = A

a11 a12
Determinant: det = a11 a22 a21 a12
a21 a22
Pn
Trace: trace(A) = i=1 aii
Linear transformation
y = Ax

y = Ax

cos sin
Rotation: R =
sin cos

y = Ax

cos sin
Rotation: R =
sin cos
R x rotates x by angle radians.
Magnitude of x does not change.

y = Ax

cos sin
Rotation: R =
sin cos
A special relationship between a square matrix A and vector x
Ax = x
Magnitude scales, but no rotation... have you come across this?

y = Ax

cos sin
Rotation: R =
sin cos
Ax = x

Eigenvalues, eigenvectors
Found by
det (A I ) = 0

y = Ax

cos sin
Rotation: R =
sin cos
Ax = x

Found by
det (A I ) = 0
Homework: Look up if the following are true and how they are proved.
det(A) = ni=1 i
Q

y = Ax

cos sin
Rotation: R =
sin cos
Ax = x

Found by
det (A I ) = 0
Homework: Look up if the following are true and how they are proved.
det(A) = ni=1 i
Q
Pn
trace(A) = i=1 i
Real symmetric matrix A = U D U T
Columns of U orthogonal.
More advanced (very powerful) topic: Singular value decomposition (SVD)

Rapid Review of Foundations II: Calculus
Function y = f (x )
Derivative dy
dx is gradient/slope;
R x=b
Integral x=a f (x)dx is area under the curve.

Function y = f (x )
Derivative dy
R x=b
Function of several variables y = f (x1 , x2 ..., xp )
f
Partial derivatives x i
: Differentiate with respect to xi pretending all
other variables remain constant.

Function y = f (x )
Derivative dy
R x=b
f
Gradient vector f
x1
f
f
x2
= ..
.

f
xp

Function y = f (x )
Derivative dy
R x=b
f
Gradient vector f
x1
f
f
x2
= ..
.

f
xp
Homework: Consider f = x t Ax , A =AT ; f = 2Ax . Using scalars in two dimensions, i.e.

x = [x1 x2 ] T
and A to contain elements a11
a12
a12
a22
, verify the claim. Writing out the
algebra helps in learning!

Rapid Review of Foundations III: Optimization
Unconstrained optimization: min f (x )


Constrained optimization:
min f (x )
subject to gi (x ) bi , i = 1, 2, .., m


min f (x )
Gradient and Hessian:

f
x1
f
f
x2
= ..
.

f
xp


min f (x )
Gradient and Hessian:

2f 2f 2f
f
x12 x1 x2 ... x1 xp
x1
f 2f 2f 2f

...

f H

x2 x2 x1 2 x2 x2 xp
= =

..
.. .. ..
.

. . ... .
f

2f 2f 2f
xp xp x1 xp x2 ... 2 xp

Optimizations (contd)
Example: Gradient descent algorithm
x(k+1) = x(k) f

x(k+1) = x(k) f
Newtons Method
x (k+1) = x (k) H 1 f

x(k+1) = x(k) f
Newtons Method
x (k+1) = x (k) H 1 f
Example: Lagrange Multipliers
min f (x )
m
F (x , ) = f (x ) + i [bi gi (x )]
X
i=1
We will use various optimization algorithms in this module (later in the coursework).

x(k+1) = x(k) f
Newtons Method
x (k+1) = x (k) H 1 f
Example: Lagrange Multipliers
min f (x )
m
F (x , ) = f (x ) + i [bi gi (x )]
X
i=1
We will use various optimization algorithms in this module (later in the coursework).
Advanced Homework: Search for CVX Disciplined Convex Programming and have a
rough read.

Rapid Review of Foundations IV: Probabilities
Discrete probabilities P [X ]
Continuous densities p (x )

Joint P [X , Y ]; Marginal P [X ]; Conditional P [X |Y ]

P [X |Y ] P [Y ]
P [Y | X ] =
P [X ]
X
P [X ] = P [X |Y ] P [Y ]
Y

P [X |Y ] P [Y ]
P [Y | X ] =
P [X ]
X
P [X ] = P [X |Y ] P [Y ]
Y
P [X , Y ] = P [X |Y ] P [Y ]
X
P [X ] = P [X , Y ]
y
X
= P [X |Y ] P [Y ]
Y

Gaussian Densities: Univariate and Multivariate

Univariate Gaussian
1 (x m)2

1
p(x) = exp
2 2 2
What are properties we know? Homework: Draw sketches for

different values of m and .

Univariate Gaussian
1 (x m)2

1
p(x) = exp
2 2 2

Multivariate Gaussian

p (x ) = x m C x m
1 1 t 1
exp ( ) ( )
(2)p/2 (det C )1/2 2
Mean m is a vector
Covariance, C , matrix: symmetric, positive semi definite!
Homework: Draw sketches for different values of m and C

Univariate Gaussian
1 (x m)2

1
p(x) = exp
2 2 2


p (x ) = x m C x m
1 1 t 1
exp ( ) ( )
(2)p/2 (det C )1/2 2
Mean m is a vector
Covariance, C , matrix: symmetric, positive semi definite!
Homework: Draw sketches for different values of m and C
x N (m , C ), y = Ax = y N (Am , ACAT )
Estimation
Univariate Mean

Estimation
1 PN
Univariate Mean m
b = N n=1 xn

Estimation
1 PN
Univariate Mean m
b = N n=1 xn
Univariate Covariance

Estimation
1 PN
Univariate Mean m
b = N n=1 xn
1 PN
b = N n=1 (xn b 2
m)

Estimation
1 PN
Univariate Mean m
b = N n=1 xn
1 PN
b = N n=1 (xn b 2
m)
Multivariate Mean

Estimation
1 PN
Univariate Mean m
b = N n=1 xn
1 PN
b = N n=1 (xn b 2
m)
mb x
1 PN
Multivariate Mean = N n=1 n

Estimation
1 PN
Univariate Mean m
b = N n=1 xn
1 PN
b = N n=1 (xn b 2
m)
mb x
1 PN
Covariance Matrix

Estimation
1 PN
Univariate Mean m
b = N n=1 xn
1 PN
b = N n=1 (xn b 2
m)
mb x
1 PN
x m )(xn m )T
1 PN
Covariance Matrix Cb = N n=1 ( n

Estimation
1 PN
Univariate Mean m
b = N n=1 xn
1 PN
b = N n=1 (xn b 2
m)
mb x
1 PN
x m )(xn m )T
1 PN
Covariance Matrix Cb = N n=1 ( n
These are known as maximum likelihood estimates (see later).

Homework: Have you noticed there are two buttons in a calculator
for estimating standard deviation, denoted n and n1 ? Find out
why.

What Next?
Pattern Classification
Classifying based on P [j | x ]

What Next?
Optimal classifier for simple distributions
Linear classifier when is it optimal?

What Next?
Distance based classifiers
Nearest Neighbour classifier
Mahalanobis distance
Linear discriminant analysis
Fisher LDA

What Next?
Distance based classifiers
Mahalanobis distance
Linear discriminant analysis
Fisher LDA
Classifier Performance
Receiver Operating Characteristics (ROC) Curve
Perceptron learning rule and convergence
See sketches on whiteboard these illustrations

are important

Overview (Lecture 2)
Review of what we learnt in Lab One

Drawing samples from N (m , C )


Principal directions


Introduction to Bayesian Decision Theory


Bayes Classifier for Simple Gaussian Distributions


Simple Classifiers
Distance to mean classifier


Simple Classifiers


Simple Classifiers
Linear classifier (more on this later)


Simple Classifiers
Perceptron (formal setting later)


Simple Classifiers
Perceptron (formal setting later)
What will we learn in Lab Two?

Bayesian Decision Theory

Classes: i , i = 1, ..., K

Classes: i , i = 1, ..., K
Prior Probabilities:
PC P [1 ] , ..., P [K ];
P [i ] 0, i=1 P [i ] = 1

Classes: i , i = 1, ..., K
PC P [1 ] , ..., P [K ];
P [i ] 0, i=1 P [i ] = 1
Likelihoods (class conditional probabilities): p (x |i ) , i = 1, .., K

Classes: i , i = 1, ..., K
PC P [1 ] , ..., P [K ];
P [i ] 0, i=1 P [i ] = 1
Posterior Probability: P [j | x ]
p (x | j ) P [j ]
P [j | x ] = PK
i=1 p (x | i ) P [i ]

Classes: i , i = 1, ..., K
PC P [1 ] , ..., P [K ];
P [i ] 0, i=1 P [i ] = 1
p (x | j ) P [j ]
P [j | x ] = PK
i=1 p (x | i ) P [i ]
From prior knowledge: P [i ]; From traing data: p (x |i )

Classes: i , i = 1, ..., K
PC P [1 ] , ..., P [K ];
P [i ] 0, i=1 P [i ] = 1
p (x | j ) P [j ]
P [j | x ] = PK
i=1 p (x | i ) P [i ]

Decision rule: Assign x to the class that maximizes posterior
probability.

Classes: i , i = 1, ..., K
PC P [1 ] , ..., P [K ];
P [i ] 0, i=1 P [i ] = 1
p (x | j ) P [j ]
P [j | x ] = PK
i=1 p (x | i ) P [i ]

probability.
The denominator is a constant; i.e. does not depend on j

Classes: i , i = 1, ..., K
PC P [1 ] , ..., P [K ];
P [i ] 0, i=1 P [i ] = 1
p (x | j ) P [j ]
P [j | x ] = PK
i=1 p (x | i ) P [i ]

probability.
The denominator is a constant; i.e. does not depend on j
Hence the decision rule becomes:
x max
j
p (x | j ) P [j ]

Bayes Classifier for Gaussian Densities
Make assumptions, cancel common terms when making comparisons...

Decision rule from: p (x | j ) P [j ]

Assume the two classes are Gaussian distributed with distinct means
and identical covariance matrices
p (x | j ) = N (mj , C )


p (x | j ) = N (mj , C )
Substitute into Bayes classifier decision rule
P [1 |x ] P [2 |x ]
p (x |1 ) P [1 ] p (x |2 ) P [2 ]


p (x | j ) = N (mj , C )
Substitute into Bayes classifier decision rule
P [1 |x ] P [2 |x ]
p (x |1 ) P [1 ] p (x |2 ) P [2 ]

exp (x m1 ) C (x m1 ) P [1 ]
1 1 t 1
(2)p/2 (det(C ))1/2 2

exp (x m2 ) C (x m2 ) P [2 ]
1 1 t 1
(2)p/2 (det(C ))1/2 2

Bayes classifier for simple densities (contd)
Distinct Means; Equal, isotropic covariance matrix
Suppose the densities are isotropic and priors are equal

i.e. C = 2 I and P [1 ] = P [2 ]


i.e. C = 2 I and P [1 ] = P [2 ]
The comparison simplifies to (see algebra on board):
(x m1 )t (x m1 ) (x m2 )t (x m2 )
|x m1 | |x m2 |


i.e. C = 2 I and P [1 ] = P [2 ]
The comparison simplifies to (see algebra on board):
(x m1 )t (x m1 ) (x m2 )t (x m2 )
|x m1 | |x m2 |
The above is a simple distance to mean classifier

Under the above simplistic assumptions, we only need to store one
template per class (the means)!

Distinct Means; Common covariance matrix (but not isotropic)

Cancel common terms and take log

(x m1 ) C (x m1 ) (x m2 ) C (x m2 ) log
t 1 t 1 P[1 ]
P[2


(x m1 ) C (x m1 ) (x m2 ) C (x m2 ) log
t 1 t 1 P[1 ]
P[2
Also simplifies to a linear classifier !

wt x + b 0
w = 2C 1 (m2 m1 )

m1 C m1 m2 C m2 log P[ ]
t 1 t 1 P[1 ]

b =
2


(x m1 ) C (x m1 ) (x m2 ) C (x m2 ) log
t 1 t 1 P[1 ]
P[2
Also simplifies to a linear classifier !

wt x + b 0
w = 2C 1 (m2 m1 )

m1 C m1 m2 C m2 log P[ ]
t 1 t 1 P[1 ]

b =
2
Also a distance to template classifier, where the distance is

(x m1 )t C 1 (x m1 )
Known as Mahalanobis distance
Implementing a linear classifier: Perceptron
Error correcting learning
Linear classifier

Linear classifier
wt x +b 0

Linear classifier
wt x +b 0
Expand dimensions: a = [w t b]t and y = [x t 1]t
at y 0

Linear classifier
wt x +b 0
at y 0
random guess of the weights
repeat
select data at random
if not correctly classified
update weights
until (all data correctly classified)
Update:

Linear classifier
wt x +b 0
at y 0
random guess of the weights
repeat
select data at random
if not correctly classified
update weights
until (all data correctly classified)
Update:
a(k+1) = a(k) + y (k)

Lab 2
1 Some plotting
2 Bayes optimal class boundary
3 Implement your own perceptron algorithm

Outline
Posterior probabilities for simple Gaussian cases

Fisher Linear Discriminant
Nearest Neighbour Classifier
Classifier performance

Two class problem

Two class problem
Bayes classifier:
p (x |1 ) P [1 ]
P [1 |x ] =
p (x |1 ) P [1 ] + p (x |2 ) P [2 ]

Two class problem
Bayes classifier:
p (x |1 ) P [1 ]
P [1 |x ] =
p (x |1 ) P [1 ] + p (x |2 ) P [2 ]
Restrictive assumptions:
Gaussian p (x |j ) = N (mj , Cj )
Equal covariance matrices: C1 = C2 = C

Two class problem
Bayes classifier:
p (x |1 ) P [1 ]
P [1 |x ] =
p (x |1 ) P [1 ] + p (x |2 ) P [2 ]
Substitute, divide through by numerator term and cancel common
terms to get
P [1 |x ] =
1
1 + exp {(w t x + w0 )}

Two class problem
Bayes classifier:
p (x |1 ) P [1 ]
P [1 |x ] =
p (x |1 ) P [1 ] + p (x |2 ) P [2 ]
Substitute, divide through by numerator term and cancel common
terms to get
P [1 |x ] =
1
1 + exp {(w t x + w0 )}
The functional form 1/(1 + exp()) is known as sigmoid / logistic

(See lab class for W3)
Classification problem (say two classes)

Classification problem (say two classes)

Disirable properties of a direction to project
Means of projected data should be far apart
Variance of projections of each class should be small


In the p dimensional (Rp ) input space, find a direction on which

projected data is maximally separable:
Projected means should be far apart
Projected scatter of each class should be small


Projection of xn onto direction w is w t xn ;
Projected mean for class j will be at w t mj
Variance of projections is w t Cj w , where Cj is the covariance matrix of
data in class j.


data in class j.
Fisher Ratio:
(w t m1 w t m2 )2
JF =
w t C1 w + w t C2 w


data in class j.
Fisher Ratio:
(w t m1 w t m2 )2
JF =
w t C1 w + w t C2 w
We can write the numerator as w t CB w , where
CB = (m1 m2 )(m1 m2 )t , the between-class scatter matrix.


data in class j.
Fisher Ratio:
(w t m1 w t m2 )2
JF =
w t C1 w + w t C2 w
We can write the numerator as w t CB w , where
CB = (m1 m2 )(m1 m2 )t , the between-class scatter matrix.
CW = C1 + C2 , the within class scatter matrix.

Fisher Linear Discriminant (contd)
Fisher criterion to maximize
J (w ) =
w t CB w
w t CW w

J (w ) =
w t CB w
w t CW w
Set gradient to zero
2CB w (w t CW w ) 2C W w (w t C B w )
w =
(w t CW w )2

J (w ) =
w t CB w
w t CW w
w =
(w t CW w )2
Equate this to zero and observe

w t CW w and w t CB w are scalars
CB w points in the same direction as m1 m2
We are only interested in the direction of w

J (w ) =
w t CB w
w t CW w
w =
(w t CW w )2
Equate this to zero and observe

w t CW w and w t CB w are scalars
CB w points in the same direction as m1 m2
We are only interested in the direction of w
wF = Cw1 (m1 m2 )

Linear Regression & Perceptron
Data: {xn , fn }N
n=1
Input: xn Rp ; target / output fn real valued

Data: {xn , fn }N
n=1
Model: f = w t x + w0
Output linear function of input (including a constant w0 )

Data: {xn , fn }N
n=1
Work in (p + 1) dimensional space to avoid treating w0 separately
y = x1 a = ww


Data: {xn , fn }N
n=1
Work in (p + 1) dimensional space to avoid treating w0 separately
y = x1 a = ww

Data: {yn , fn }N
n=1
Model: f = yt a
p + 1 unknowns held in vector a

Error and Minimization
{ynt a fn }2
PN
E = n=1
nP o2
PN (p+1)
E = n=1 j=1 a y
j nj fn

{ynt a fn }2
PN
E = n=1
nP o2
PN (p+1)
E = n=1 j=1 a y
j nj fn
To find the best a we minimize E differentiate with respect to each

of the unknowns in a and set to zero.

{ynt a fn }2
PN
E = n=1
nP o2
PN (p+1)
E = n=1 j=1 a y
j nj fn


N (p+1)
E X X
= 2 aj ynj fn (yni )
ai
n=1 j=1

{ynt a fn }2
PN
E = n=1
nP o2
PN (p+1)
E = n=1 j=1 a y
j nj fn


N (p+1)
E X X
ai
n=1 j=1
There are (p + 1) derivatives (with respect to each ai )

{ynt a fn }2
PN
E = n=1
nP o2
PN (p+1)
E = n=1 j=1 a y
j nj fn


N (p+1)
E X X
ai
n=1 j=1
There are (p + 1) derivatives (with respect to each ai )

Equating them to zero gives (p + 1) equations in (p + 1) unknowns

Solution to Regression
(p + 1) simultaneous equations to solve:

i th row, j th column shown

...
... ... ... a1
..

..

a2 .
PN PN
. y ni ynj ... .. =

fn yni

n=1 n=1
.. .. .

..
. ... . .
a(p+1)
... ... ...

Derivation in vector/matrix form
Y: N (p + 1) matrix nth row is ynt

Y : N (p + 1) matrix nth row is ynt
f : N 1 vector of outputs

Error E = ||Y a f ||2

Homework: Verify the error written like this is the same as the one
we wrote out in lengthy algebra.

Gradient
a E = 2Y t (Ya f )

Gradient
a E = 2Y t (Ya f )
Equating the gradient to zero gives

Y t Ya = Y tf
a = Y t Y 1 Y t f

Gradient
a E = 2Y t (Ya f )
Equating the gradient to zero gives

Y t Ya = Y tf
a = Y t Y 1 Y t f
Homework: With three data points in one dimensional input space

(x1 , f1 ), (x2 , f2 ) and (x3 , f3 ) and two unknowns, slope (m) and
intercept (c) of fitting a straight line, write out all the expressions
seenNiranjan
Mahesan so far.
(UoS) Machine Learning March 2017 46 / 55
Solution by Gradient Descent
Gradient vector: a E = 2Y t (Ya f )

Steepest descent algorithm:
Initialize a at random
Update a (k+1) = a(k) a E
Until Convergence


Until Convergence
Second order (Newtons) method
Update a (k+1) = a(k) H 1 a E
Until Convergence


Until Convergence
Second order (Newtons) method
Update a (k+1) = a(k) H 1 a E
Until Convergence
Rapid convergence with second order method, but cost of computing
and inverting H can be high (more on this under Neural Networks)

Gradient and Stochastic Gradient Descent
PN
Error E = n=1 en2
True gradient:
N
X t
a E = 2

yn a fn (yn )
n=1

Gradient and Stochastic Gradient Descent
PN
Error E = n=1 en2
True gradient:
N
X t
a E = 2

yn a fn (yn )
n=1
Gradient computed on nth data:
a en = 2 ynt a fn (yn )


Regularization
Pseudo inverse solution: a = (Y t Y )1 Yt f

Regularization

This can be ill conditioned, so we could regularize by
a = Y t Y + I 1 Y t f

where is a small constant.

Regularization

a = Y t Y + I 1 Y t f


We achieve precisely this by minimizing an error of the form
||Ya f ||2 + ||a ||2
Here a quadratic penalty term has been included

Homework: Differentiate this error and derive the regularized
solution

Regularization

a = Y t Y + I 1 Y t f


We achieve precisely this by minimizing an error of the form
||Ya f ||2 + ||a ||2
Here a quadratic penalty term has been included

Homework: Differentiate this error and derive the regularized
solution
Sparse solutions are obtained
Pp by regularizing with an l1 norm (sum of
absolute values of a , i.e. j=1 |aj |); See Lab 4.

Perceptron
A suitable performance measure
Number of misclassified examples as measure of error

Piecewise constant (cannot differentiate)
Suitable error measure:
ynt a
X
EP =
Summation taken over misclassified examples

We started with ynt a > 0 for positive class and ynt a < 0 for the
negative class; we then switch the signs of negative class examples and
required ynt aP
> 0 for all the training data; so for the misclassified
examples ynt a should be as small as possible.

Perceptron
Learning rule
Gradient:
E
yn
X
=
a

Perceptron
Learning rule
Gradient:
E
yn
X
=
a
a(k+1) = a(k) + yn
P
Gradient algorithm:

Perceptron
Learning rule
Gradient:
E
yn
X
=
a
a(k+1) = a(k) + yn
P
Gradient algorithm:
Stochastic gradient algorithm:
a(k+1) = a(k) + yn
Note what yn is. It is an item of data that is taken at random and
happens to be misclassified by the current value of a at iteration k.

Perceptron
Convergence of the learning rule
Learning Rule: a (k+1) = a (k) + y (k)

where y (k) is a misclassified input.

Perceptron

Training criterion
We start with requiring a t y (k) 0, depending on the example
belonging to class 1 or class 2.

Perceptron

Training criterion
If we switch the signs of examples of class 2, we require a t y (k) > 0 for
all k.
On misclassified data at y (k) < 0

Perceptron

Training criterion
all k.
On misclassified data a t y (k) < 0
If ab is a solution (separable data), for all k, ab y (k) > 0

Perceptron

Training criterion
all k.
On misclassified data a t y (k) < 0
If ab is a solution (separable data), for all k, ab y (k) > 0
We prove convergence by showing:
||a (k+1) ab||2 < ||a (k) ab||2 for this update rule. i.e. the learning
rule brings the guess closer to a valid solution.

Perceptron
Convergence of the learning rule (contd)
For perceptron criterion, the magnitude of a is not relevant (only the

direction is). Hence for some scalar , we wish to show
||a (k+1) ab||2 < ||a (k) ab||2

Perceptron

||a (k+1) ab||2 < ||a (k) ab||2
From the update formula

a(k+1) ab = a(k) ab + y (k)

Perceptron

||a (k+1) ab||2 < ||a (k) ab||2

a(k+1) ab = a(k) ab + y (k)
Taking magnitudes
||a (k+1) b
a||2 = ||a(k) ba||2 + 2(a(k) ba)t y (k) + ||y (k)||2

Perceptron

||a (k+1) ab||2 < ||a (k) ab||2

a(k+1) ab = a(k) ab + y (k)
Taking magnitudes
||a (k+1) b
a||2 = ||a(k) ba||2 + 2(a(k) ba)t y (k) + ||y (k)||2
If we drop the negative term a(k) t y (k) from RHS, the equality
becomes an inequality
||a (k+1) b
a||2 < ||a(k) ba||2 2bat y (k) + ||y (k)||2
Perceptron
Of the three terms on the right hand side, we know abt y (k) > 0,
because ab is assumed to be a solution.

Perceptron
If we select
2 = max ||yi ||2
i
= min abt yi
i
.e. largest of the positive term and smallest of the negative term,
then for = 2 /,
||a (k+1) b
a||2 < ||a(k) ba||2 2
(Note the inequality remains true when the right hand side is replaced
by a quantity larger than what it previously was.)

Perceptron
If we select
2 = max ||yi ||2
i
= min abt yi
i
then for = 2 /,
||a (k+1) b
a||2 < ||a(k) ba||2 2
Every correction takes the guess closer to a true solution.

Perceptron
If we select
2 = max ||yi ||2
i
= min abt yi
i
then for = 2 /,
||a (k+1) b
a||2 < ||a(k) ba||2 2
Every correction takes the guess closer to a true solution.
From an initialization a(1) , we will find a solution in at most
||a (1)b
a ||2
k0 = 2
updates.
Summary
Linear regression
Solution as pseudo inverse
Solution by gradient descent
Regularization
Perceptron
Setting up a suitable error function
Convergence of the algorithm

Machine Learning Book

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Book

Uploaded by

Copyright:

Available Formats

Introduction to Machine Learning

School of Electronics and Computer Science

We learn by doing, not by observing!

Mahesan Niranjan (UoS) Machine Learning March 2017 2 / 55

Mahesan Niranjan (UoS) Machine Learning March 2017 2 / 55

Emphasis is on foundations of the subject (mathematical and algorith-

Mahesan Niranjan (UoS) Machine Learning March 2017 2 / 55

Mahesan Niranjan (UoS) Machine Learning March 2017 3 / 55

Mahesan Niranjan (UoS) Machine Learning March 2017 3 / 55

Assessment (in Southampton):

Mahesan Niranjan (UoS) Machine Learning March 2017 3 / 55

Assessment (in Southampton):

20% Coursework (from W4)

Mahesan Niranjan (UoS) Machine Learning March 2017 3 / 55

Assessment (in Southampton):

20% Coursework (from W4)

Mahesan Niranjan (UoS) Machine Learning March 2017 3 / 55

Mahesan Niranjan (UoS) Machine Learning March 2017 4 / 55

Difficult to fail this module,

Difficult to fail this module, but please dont try!

R.O.Duda, P.E.Hart & D.G.Stork C.M. Bishop

I.H. Witten & E. Frank S. Rogers & M. Girolami

Mahesan Niranjan (UoS) Machine Learning March 2017 5 / 55

R.O.Duda, P.E.Hart & D.G.Stork C.M. Bishop

I.H. Witten & E. Frank S. Rogers & M. Girolami

Mahesan Niranjan (UoS) Machine Learning March 2017 5 / 55

R.O.Duda, P.E.Hart & D.G.Stork C.M. Bishop

I.H. Witten & E. Frank S. Rogers & M. Girolami

Mahesan Niranjan (UoS) Machine Learning March 2017 5 / 55

R.O.Duda, P.E.Hart & D.G.Stork C.M. Bishop

I.H. Witten & E. Frank S. Rogers & M. Girolami

Mahesan Niranjan (UoS) Machine Learning March 2017 5 / 55

Mahesan Niranjan (UoS) Machine Learning March 2017 6 / 55

Mahesan Niranjan (UoS) Machine Learning March 2017 6 / 55

Mahesan Niranjan (UoS) Machine Learning March 2017 6 / 55

Standard disclaimers apply!

Mahesan Niranjan (UoS) Machine Learning March 2017 6 / 55

Mahesan Niranjan (UoS) Machine Learning March 2017 7 / 55

Mahesan Niranjan (UoS) Machine Learning March 2017 7 / 55

Mathematical / Statistical side of Artificial Intelligence

Mahesan Niranjan (UoS) Machine Learning March 2017 7 / 55

Mathematical / Statistical side of Artificial Intelligence

Data {xn , yn }Nn=1

Mahesan Niranjan (UoS) Machine Learning March 2017 8 / 55

{xn , yn }Nn=1 {xn }n=1

Mahesan Niranjan (UoS) Machine Learning March 2017 8 / 55

{xn , yn }Nn=1 {xn }n=1

Mahesan Niranjan (UoS) Machine Learning March 2017 8 / 55

{xn , yn }Nn=1 {xn }n=1

Mahesan Niranjan (UoS) Machine Learning March 2017 8 / 55

{xn , yn }Nn=1 {xn }n=1

Mahesan Niranjan (UoS) Machine Learning March 2017 8 / 55

{xn , yn }Nn=1 {xn }n=1

{|| yn f (xn ; ) ||}

Mahesan Niranjan (UoS) Machine Learning March 2017 8 / 55

{xn , yn }Nn=1 {xn }n=1

{|| yn f (xn ; ) ||}

Mahesan Niranjan (UoS) Machine Learning March 2017 8 / 55

{xn , yn }Nn=1 {xn }n=1

{|| yn f (xn ; ) ||}

Mahesan Niranjan (UoS) Machine Learning March 2017 8 / 55

{xn , yn }Nn=1 {xn }n=1

{|| yn f (xn ; ) ||}

Mahesan Niranjan (UoS) Machine Learning March 2017 8 / 55