You are on page 1of 242

Introduction to Machine Learning

Lecture 1: Introduction

Mahesan Niranjan

School of Electronics and Computer Science


University of Southampton

Slides are prompts (for me); Notes are what you make, off the white-
board and from textbooks during self study:

We learn by doing, not by observing!

March 2017
Mahesan Niranjan (UoS) Machine Learning March 2017 1 / 55
Overview

Logistics
Motivation
Some examples from my research

Mahesan Niranjan (UoS) Machine Learning March 2017 2 / 55


Overview

Logistics
Motivation
Some examples from my research
Review of Mathematical Foundations
Linear Algebra
Calculus
Probability Theory / Statistics
Principles of Optimization

Mahesan Niranjan (UoS) Machine Learning March 2017 2 / 55


Overview

Logistics
Motivation
Some examples from my research
Review of Mathematical Foundations
Linear Algebra
Calculus
Probability Theory / Statistics
Principles of Optimization

Emphasis is on foundations of the subject (mathematical and algorith-


mic). We will not do formal mathematics here, instead we develop an
understanding of the concepts and tools.

Mahesan Niranjan (UoS) Machine Learning March 2017 2 / 55


Logistics

Teaching:
Ten two-hour lectures

Mahesan Niranjan (UoS) Machine Learning March 2017 3 / 55


Logistics

Teaching:
Ten two-hour lectures
Eight two-hour lab sessions

Mahesan Niranjan (UoS) Machine Learning March 2017 3 / 55


Logistics

Teaching:
Ten two-hour lectures
Eight two-hour lab sessions

Assessment (in Southampton):

Mahesan Niranjan (UoS) Machine Learning March 2017 3 / 55


Logistics

Teaching:
Ten two-hour lectures
Eight two-hour lab sessions

Assessment (in Southampton):

20% Coursework (from W4)


80% Semester end written exam

Mahesan Niranjan (UoS) Machine Learning March 2017 3 / 55


Logistics

Teaching:
Ten two-hour lectures
Eight two-hour lab sessions

Assessment (in Southampton):

20% Coursework (from W4)


80% Semester end written exam
MSc passmark 50%
Undergraduate passmark 40%

Mahesan Niranjan (UoS) Machine Learning March 2017 3 / 55


Assessment
Distribution of marks, COMP3206 2015/16

Mahesan Niranjan (UoS) Machine Learning March 2017 4 / 55


Assessment
Distribution of marks, COMP3206 2015/16

Difficult to fail this module,


Mahesan Niranjan (UoS) Machine Learning March 2017 4 / 55
Assessment
Distribution of marks, COMP3206 2015/16

Difficult to fail this module, but please dont try!


Mahesan Niranjan (UoS) Machine Learning March 2017 4 / 55
Good Books

R.O.Duda, P.E.Hart & D.G.Stork C.M. Bishop


Pattern Classification Pattern Recognition and Machine Learning

I.H. Witten & E. Frank S. Rogers & M. Girolami


Data Mining A First Course in Machine Learning

Mahesan Niranjan (UoS) Machine Learning March 2017 5 / 55


Good Books

R.O.Duda, P.E.Hart & D.G.Stork C.M. Bishop


Pattern Classification Pattern Recognition and Machine Learning

I.H. Witten & E. Frank S. Rogers & M. Girolami


Data Mining A First Course in Machine Learning

There is nothing to be learnt from a professor, which is not to be met with in books
- David Hume (1711-1776)

Mahesan Niranjan (UoS) Machine Learning March 2017 5 / 55


Good Books

R.O.Duda, P.E.Hart & D.G.Stork C.M. Bishop


Pattern Classification Pattern Recognition and Machine Learning

I.H. Witten & E. Frank S. Rogers & M. Girolami


Data Mining A First Course in Machine Learning

There is nothing to be learnt from a professor, which is not to be met with in books
- David Hume (1711-1776)

(WikiPedia: Hume had little respect for the professors of his time [...]

Mahesan Niranjan (UoS) Machine Learning March 2017 5 / 55


Good Books

R.O.Duda, P.E.Hart & D.G.Stork C.M. Bishop


Pattern Classification Pattern Recognition and Machine Learning

I.H. Witten & E. Frank S. Rogers & M. Girolami


Data Mining A First Course in Machine Learning

There is nothing to be learnt from a professor, which is not to be met with in books
- David Hume (1711-1776)

(WikiPedia: Hume had little respect for the professors of his time [...] He did not graduate)

Mahesan Niranjan (UoS) Machine Learning March 2017 5 / 55


Machine Learning: Good employment prospects!

Mahesan Niranjan (UoS) Machine Learning March 2017 6 / 55


Machine Learning: Good employment prospects!

Mahesan Niranjan (UoS) Machine Learning March 2017 6 / 55


Machine Learning: Good employment prospects!

Mahesan Niranjan (UoS) Machine Learning March 2017 6 / 55


Machine Learning: Good employment prospects!

Standard disclaimers apply!

Mahesan Niranjan (UoS) Machine Learning March 2017 6 / 55


Machine Learning: Intellectually Enriching

Mahesan Niranjan (UoS) Machine Learning March 2017 7 / 55


Machine Learning: Intellectually Enriching

Mahesan Niranjan (UoS) Machine Learning March 2017 7 / 55


Machine Learning: Intellectually Enriching

Mathematical / Statistical side of Artificial Intelligence

Mahesan Niranjan (UoS) Machine Learning March 2017 7 / 55


Machine Learning: Intellectually Enriching

Mathematical / Statistical side of Artificial Intelligence


Machine Learning draws from many fields
Mahesan Niranjan (UoS) Machine Learning March 2017 7 / 55
Machine Learning as Data-driven Modelling
Single-slide overview of the subject and challenging questions

Data {xn , yn }Nn=1

Mahesan Niranjan (UoS) Machine Learning March 2017 8 / 55


Machine Learning as Data-driven Modelling
Single-slide overview of the subject and challenging questions

{xn , yn }Nn=1 {xn }n=1


N
Data

Mahesan Niranjan (UoS) Machine Learning March 2017 8 / 55


Machine Learning as Data-driven Modelling
Single-slide overview of the subject and challenging questions

{xn , yn }Nn=1 {xn }n=1


N
Data

Function Approximator

Mahesan Niranjan (UoS) Machine Learning March 2017 8 / 55


Machine Learning as Data-driven Modelling
Single-slide overview of the subject and challenging questions

{xn , yn }Nn=1 {xn }n=1


N
Data

Function Approximator y = f (x , ) + v

Mahesan Niranjan (UoS) Machine Learning March 2017 8 / 55


Machine Learning as Data-driven Modelling
Single-slide overview of the subject and challenging questions

{xn , yn }Nn=1 {xn }n=1


N
Data

Function Approximator y = f (x , ) + v

Parameter Estimation

Mahesan Niranjan (UoS) Machine Learning March 2017 8 / 55


Machine Learning as Data-driven Modelling
Single-slide overview of the subject and challenging questions

{xn , yn }Nn=1 {xn }n=1


N
Data

Function Approximator y = f (x , ) + v

{|| yn f (xn ; ) ||}


PN 2
Parameter Estimation E0 = n=1

Mahesan Niranjan (UoS) Machine Learning March 2017 8 / 55


Machine Learning as Data-driven Modelling
Single-slide overview of the subject and challenging questions

{xn , yn }Nn=1 {xn }n=1


N
Data

Function Approximator y = f (x , ) + v

{|| yn f (xn ; ) ||}


PN 2
Parameter Estimation E0 = n=1

Prediction

Mahesan Niranjan (UoS) Machine Learning March 2017 8 / 55


Machine Learning as Data-driven Modelling
Single-slide overview of the subject and challenging questions

{xn , yn }Nn=1 {xn }n=1


N
Data

Function Approximator y = f (x , ) + v

{|| yn f (xn ; ) ||}


PN 2
Parameter Estimation E0 = n=1

yN+1 xN+1 ,
 
Prediction = f

Mahesan Niranjan (UoS) Machine Learning March 2017 8 / 55


Machine Learning as Data-driven Modelling
Single-slide overview of the subject and challenging questions

{xn , yn }Nn=1 {xn }n=1


N
Data

Function Approximator y = f (x , ) + v

{|| yn f (xn ; ) ||}


PN 2
Parameter Estimation E0 = n=1

yN+1 xN+1 ,
 
Prediction = f

Regularization

Mahesan Niranjan (UoS) Machine Learning March 2017 8 / 55


Machine Learning as Data-driven Modelling
Single-slide overview of the subject and challenging questions

{xn , yn }Nn=1 {xn }n=1


N
Data

Function Approximator y = f (x , ) + v

{|| yn f (xn ; ) ||}


PN 2
Parameter Estimation E0 = n=1

yN+1 xN+1 ,
 
Prediction = f

{|| yn f (xn ) ||} + g (||||)


PN 2
Regularization E1 = n=1

Mahesan Niranjan (UoS) Machine Learning March 2017 8 / 55


Machine Learning as Data-driven Modelling
Single-slide overview of the subject and challenging questions

{xn , yn }Nn=1 {xn }n=1


N
Data

Function Approximator y = f (x , ) + v

{|| yn f (xn ; ) ||}


PN 2
Parameter Estimation E0 = n=1

yN+1 xN+1 ,
 
Prediction = f

{|| yn f (xn ) ||} + g (||||)


PN 2
Regularization E1 = n=1

Modelling Uncertainty

Mahesan Niranjan (UoS) Machine Learning March 2017 8 / 55


Machine Learning as Data-driven Modelling
Single-slide overview of the subject and challenging questions

{xn , yn }Nn=1 {xn }n=1


N
Data

Function Approximator y = f (x , ) + v

{|| yn f (xn ; ) ||}


PN 2
Parameter Estimation E0 = n=1

yN+1 xN+1 ,
 
Prediction = f

{|| yn f (xn ) ||} + g (||||)


PN 2
Regularization E1 = n=1

p | {xn , yn }n=1
 
N
Modelling Uncertainty

Mahesan Niranjan (UoS) Machine Learning March 2017 8 / 55


Machine Learning as Data-driven Modelling
Single-slide overview of the subject and challenging questions

{xn , yn }Nn=1 {xn }n=1


N
Data

Function Approximator y = f (x , ) + v

{|| yn f (xn ; ) ||}


PN 2
Parameter Estimation E0 = n=1

yN+1 xN+1 ,
 
Prediction = f

{|| yn f (xn ) ||} + g (||||)


PN 2
Regularization E1 = n=1

p | {xn , yn }n=1
 
N
Modelling Uncertainty

Probabilistic Inference

Mahesan Niranjan (UoS) Machine Learning March 2017 8 / 55


Machine Learning as Data-driven Modelling
Single-slide overview of the subject and challenging questions

{xn , yn }Nn=1 {xn }n=1


N
Data

Function Approximator y = f (x , ) + v

{|| yn f (xn ; ) ||}


PN 2
Parameter Estimation E0 = n=1

yN+1 xN+1 ,
 
Prediction = f

{|| yn f (xn ) ||} + g (||||)


PN 2
Regularization E1 = n=1

p | {xn , yn }n=1
 
N
Modelling Uncertainty

E [g ()] =
R
Probabilistic Inference g () p () d

Mahesan Niranjan (UoS) Machine Learning March 2017 8 / 55


Machine Learning as Data-driven Modelling
Single-slide overview of the subject and challenging questions

{xn , yn }Nn=1 {xn }n=1


N
Data

Function Approximator y = f (x , ) + v

{|| yn f (xn ; ) ||}


PN 2
Parameter Estimation E0 = n=1

yN+1 xN+1 ,
 
Prediction = f

{|| yn f (xn ) ||} + g (||||)


PN 2
Regularization E1 = n=1

p | {xn , yn }n=1
 
N
Modelling Uncertainty

E [g ()] = 1
PNs 
g (n)
R
Probabilistic Inference g () p () d = Ns n=1

Mahesan Niranjan (UoS) Machine Learning March 2017 8 / 55


Machine Learning as Data-driven Modelling
Single-slide overview of the subject and challenging questions

{xn , yn }Nn=1 {xn }n=1


N
Data

Function Approximator y = f (x , ) + v

{|| yn f (xn ; ) ||}


PN 2
Parameter Estimation E0 = n=1

yN+1 xN+1 ,
 
Prediction = f

{|| yn f (xn ) ||} + g (||||)


PN 2
Regularization E1 = n=1

p | {xn , yn }n=1
 
N
Modelling Uncertainty

E [g ()] = 1
PNs 
g (n)
R
Probabilistic Inference g () p () d = Ns n=1

Sequential Estimation

Mahesan Niranjan (UoS) Machine Learning March 2017 8 / 55


Machine Learning as Data-driven Modelling
Single-slide overview of the subject and challenging questions

{xn , yn }Nn=1 {xn }n=1


N
Data

Function Approximator y = f (x , ) + v

{|| yn f (xn ; ) ||}


PN 2
Parameter Estimation E0 = n=1

yN+1 xN+1 ,
 
Prediction = f

{|| yn f (xn ) ||} + g (||||)


PN 2
Regularization E1 = n=1

p | {xn , yn }n=1
 
N
Modelling Uncertainty

E [g ()] = 1
PNs 
g (n)
R
Probabilistic Inference g () p () d = Ns n=1

Sequential Estimation (n 1|n 1) (n|n 1) (n|n)

Mahesan Niranjan (UoS) Machine Learning March 2017 8 / 55


Machine Learning as Data-driven Modelling
Single-slide overview of the subject and challenging questions

{xn , yn }Nn=1 {xn }n=1


N
Data

Function Approximator y = f (x , ) + v

{|| yn f (xn ; ) ||}


PN 2
Parameter Estimation E0 = n=1

yN+1 xN+1 ,
 
Prediction = f

{|| yn f (xn ) ||} + g (||||)


PN 2
Regularization E1 = n=1

p | {xn , yn }n=1
 
N
Modelling Uncertainty

E [g ()] = 1
PNs 
g (n)
R
Probabilistic Inference g () p () d = Ns n=1

Sequential Estimation (n 1|n 1) (n|n 1) (n|n)


Kalman & Particle Filters; Reinforcement Learning
Mahesan Niranjan (UoS) Machine Learning March 2017 8 / 55
Machine Learning
Many Interesting Problems (to me)

Visual Scene Recognition

Mahesan Niranjan (UoS) Machine Learning March 2017 9 / 55


Machine Learning
Many Interesting Problems (to me)

Visual Scene Recognition


Machine Translation

Mahesan Niranjan (UoS) Machine Learning March 2017 9 / 55


Machine Learning
Many Interesting Problems (to me)

Visual Scene Recognition


Machine Translation
Computational Biology

Mahesan Niranjan (UoS) Machine Learning March 2017 9 / 55


Machine Learning
Many Interesting Problems (to me)

Visual Scene Recognition


Machine Translation
Computational Biology
Computational Finance

Mahesan Niranjan (UoS) Machine Learning March 2017 9 / 55


Machine Learning
Many Interesting Problems (to me)

Visual Scene Recognition


Machine Translation
Computational Biology
Computational Finance
Recommender Systems

Mahesan Niranjan (UoS) Machine Learning March 2017 9 / 55


Machine Learning
Many Interesting Problems (to me)

Visual Scene Recognition


Machine Translation
Computational Biology
Computational Finance
Recommender Systems
Physiological Signal Modelling

Mahesan Niranjan (UoS) Machine Learning March 2017 9 / 55


Machine Learning
Many Interesting Problems (to me)

Visual Scene Recognition


Machine Translation
Computational Biology
Computational Finance
Recommender Systems
Physiological Signal Modelling

Big Data: Buzzword causing even more excitement!

Mahesan Niranjan (UoS) Machine Learning March 2017 9 / 55


Machine Learning
Many Interesting Problems (to me)

Visual Scene Recognition


Machine Translation
Computational Biology
Computational Finance
Recommender Systems
Physiological Signal Modelling

Big Data: Buzzword causing even more excitement!

Make accurate predictions

Mahesan Niranjan (UoS) Machine Learning March 2017 9 / 55


Machine Learning
Many Interesting Problems (to me)

Visual Scene Recognition


Machine Translation
Computational Biology
Computational Finance
Recommender Systems
Physiological Signal Modelling

Big Data: Buzzword causing even more excitement!

Make accurate predictions and make money!

Mahesan Niranjan (UoS) Machine Learning March 2017 9 / 55


Machine Learning
Many Interesting Problems (to me)

Visual Scene Recognition


Machine Translation
Computational Biology
Computational Finance
Recommender Systems
Physiological Signal Modelling

Big Data: Buzzword causing even more excitement!

Make accurate predictions and make money!


Make statements about the problem domain

Mahesan Niranjan (UoS) Machine Learning March 2017 9 / 55


Machine Learning
Many Interesting Problems (to me)

Visual Scene Recognition


Machine Translation
Computational Biology
Computational Finance
Recommender Systems
Physiological Signal Modelling

Big Data: Buzzword causing even more excitement!

Make accurate predictions and make money!


Make statements about the problem domain and become famous!

Mahesan Niranjan (UoS) Machine Learning March 2017 9 / 55


Machine Learning
Many Interesting Problems (to me)

Visual Scene Recognition


Machine Translation
Computational Biology
Computational Finance
Recommender Systems
Physiological Signal Modelling

Big Data: Buzzword causing even more excitement!

Make accurate predictions and make money!


Make statements about the problem domain and become famous!

ECS: Advanced courses building on the foundations you will learn here:

Mahesan Niranjan (UoS) Machine Learning March 2017 9 / 55


Machine Learning
Many Interesting Problems (to me)

Visual Scene Recognition


Machine Translation
Computational Biology
Computational Finance
Recommender Systems
Physiological Signal Modelling

Big Data: Buzzword causing even more excitement!

Make accurate predictions and make money!


Make statements about the problem domain and become famous!

ECS: Advanced courses building on the foundations you will learn here:
Advanced Machine Learning

Mahesan Niranjan (UoS) Machine Learning March 2017 9 / 55


Machine Learning
Many Interesting Problems (to me)

Visual Scene Recognition


Machine Translation
Computational Biology
Computational Finance
Recommender Systems
Physiological Signal Modelling

Big Data: Buzzword causing even more excitement!

Make accurate predictions and make money!


Make statements about the problem domain and become famous!

ECS: Advanced courses building on the foundations you will learn here:
Advanced Machine Learning
Computational Biology

Mahesan Niranjan (UoS) Machine Learning March 2017 9 / 55


Machine Learning
Many Interesting Problems (to me)

Visual Scene Recognition


Machine Translation
Computational Biology
Computational Finance
Recommender Systems
Physiological Signal Modelling

Big Data: Buzzword causing even more excitement!

Make accurate predictions and make money!


Make statements about the problem domain and become famous!

ECS: Advanced courses building on the foundations you will learn here:
Advanced Machine Learning
Computational Biology
Computational Finance

Mahesan Niranjan (UoS) Machine Learning March 2017 9 / 55


Examples from my research
Example 1: Machine Translation

Phrases move due to grammatical differences.


Variability due to context of phrase.

Mahesan Niranjan (UoS) Machine Learning March 2017 10 / 55


Examples from my research
Example 1: Machine Translation

Phrases move due to grammatical differences.


Variability due to context of phrase.

Not data rich (electronically available parallel corpora);


Solution from active learning.

Mahesan Niranjan (UoS) Machine Learning March 2017 10 / 55


Examples from my research
Example 2: Computational Finance

Constructing Sparse Portfolios

A. Takeda, M. Niranjan, J. Gotoh & Y. Kawahara (2013) Simultaneous pursuit of


out-of-sample performance and sparsity in index tracking portfolios, Computational
Management Science 10(1): 21-49.

See White Board

Mahesan Niranjan (UoS) Machine Learning March 2017 11 / 55


Molecular Biology
(Figures from: Alberts et al. Molecular Biology of the Cell)

Mahesan Niranjan (UoS) Machine Learning March 2017 12 / 55


Molecular Biology
(Figures from: Alberts et al. Molecular Biology of the Cell)

Mahesan Niranjan (UoS) Machine Learning March 2017 12 / 55


Molecular Biology
(Figures from: Alberts et al. Molecular Biology of the Cell)

Mahesan Niranjan (UoS) Machine Learning March 2017 12 / 55


Molecular Biology
(Figures from: Alberts et al. Molecular Biology of the Cell)

Mahesan Niranjan (UoS) Machine Learning March 2017 12 / 55


Molecular Biology
(Figures from: Alberts et al. Molecular Biology of the Cell)

Mahesan Niranjan (UoS) Machine Learning March 2017 12 / 55


Molecular Biology
(Figures from: Alberts et al. Molecular Biology of the Cell)

Mahesan Niranjan (UoS) Machine Learning March 2017 12 / 55


Molecular Biology
(Figures from: Alberts et al. Molecular Biology of the Cell)

Mahesan Niranjan (UoS) Machine Learning March 2017 12 / 55


Molecular Biology
(Figures from: Alberts et al. Molecular Biology of the Cell)

Mahesan Niranjan (UoS) Machine Learning March 2017 12 / 55


Examples from my research
Example 3: Classifying Gene Function

2000 yeast genes


Observed (simultaneously) under 78 conditions
Some have a specific function; others not

See MATLAB Demo

Mahesan Niranjan (UoS) Machine Learning March 2017 13 / 55


Example 4: Regulation of Protein Concentrations [Yawwani

Gunawardana]

Mahesan Niranjan (UoS) Machine Learning March 2017 14 / 55


Example 4: Regulation of Protein Concentrations [Yawwani

Gunawardana]

Set up a predictor of protein concentration

Mahesan Niranjan (UoS) Machine Learning March 2017 14 / 55


Example 4: Regulation of Protein Concentrations [Yawwani

Gunawardana]

Set up a predictor of protein concentration


Sparse model selects relevant features
Outliers = post-translationally regulated proteins
Mahesan Niranjan (UoS) Machine Learning March 2017 14 / 55
Transcriptome Proteome [Yawwani Gunawardana]

Mahesan Niranjan (UoS) Machine Learning March 2017 15 / 55


Example 5: Morphogen Propagation in Development

Mahesan Niranjan (UoS) Machine Learning March 2017 16 / 55


Example 5: Morphogen Propagation in Development

Mahesan Niranjan (UoS) Machine Learning March 2017 16 / 55


Example 5: Morphogen Propagation in Development

Mahesan Niranjan (UoS) Machine Learning March 2017 16 / 55


Example 5: Morphogen Propagation in Development

Mahesan Niranjan (UoS) Machine Learning March 2017 16 / 55


Example 5: Morphogen Propagation in Development

A. Turing C. N
usslein-Volhard

Mahesan Niranjan (UoS) Machine Learning March 2017 16 / 55


Example 5: Morphogen Propagation in Development

A. Turing C. N
usslein-Volhard

2
M(x, t) = D 2 M(x, t) p1 M(x, t) + S(x, t)
t x

Mahesan Niranjan (UoS) Machine Learning March 2017 16 / 55


Example 5: Morphogen Propagation in Development

A. Turing C. N
usslein-Volhard

2
M(x, t) = D 2 M(x, t) p1 M(x, t) + S(x, t)
t x

Mahesan Niranjan (UoS) Machine Learning March 2017 16 / 55


Example 5: Morphogen Propagation in Development

A. Turing C. N
usslein-Volhard

2
M(x, t) = D 2 M(x, t) p1 M(x, t) + S(x, t)
t x
B. Houchmandzadeh et al. (2007), Nature

Mahesan Niranjan (UoS) Machine Learning March 2017 16 / 55


Is Maternal mRNA Stability Regulated? Wei Liu

Mahesan Niranjan (UoS) Machine Learning March 2017 17 / 55


Is Maternal mRNA Stability Regulated? Wei Liu

Mahesan Niranjan (UoS) Machine Learning March 2017 17 / 55


Example 6: Systems Level Modelling [Xin Liu]

Mahesan Niranjan (UoS) Machine Learning March 2017 18 / 55


Example 6: Systems Level Modelling [Xin Liu]

Mahesan Niranjan (UoS) Machine Learning March 2017 18 / 55


... and now for more serious matters!
Rapid Review of Foundations

Linear Algebra
Calculus
Optimization
Probabilities

Mahesan Niranjan (UoS) Machine Learning March 2017 19 / 55


... and now for more serious matters!
Rapid Review of Foundations

Linear Algebra
Calculus
Optimization
Probabilities

This is not a course on any of the above!


We need tools from these topics.
Quickly review what we need today, and will return to each topic as
and when we need them (in just about enough depth) to understand
machine learning.

Mahesan Niranjan (UoS) Machine Learning March 2017 19 / 55


Linear Algebra: Vectors and Matrices

Vectors and matrices as collections of numbers



x1
x
x = ..2
.
xn

Mahesan Niranjan (UoS) Machine Learning March 2017 20 / 55


Linear Algebra: Vectors and Matrices

Vectors and matrices as collections of numbers



x1 a11 a12 ... a1d
x2 a11 a12 ... a1d
x = .. A = .. ..


.. ..
. . . . .
xn an1 an2 ... and

Mahesan Niranjan (UoS) Machine Learning March 2017 20 / 55


Linear Algebra: Vectors and Matrices

Vectors and matrices as collections of numbers



x1 a11 a12 ... a1d
x2 a11 a12 ... a1d
x = .. A = .. ..


.. ..
. . . . .
xn an1 an2 ... and

Operations on collections on numbers


Scalar product
n
wx
X
= w i xi
i=1

Mahesan Niranjan (UoS) Machine Learning March 2017 20 / 55


Linear Algebra: Vectors and Matrices

Vectors and matrices as collections of numbers



x1 a11 a12 ... a1d
x2 a11 a12 ... a1d
x = .. A = .. ..


.. ..
. . . . .
xn an1 an2 ... and

Operations on collections on numbers


Scalar product
n
wx
X
= w i xi
i=1

With useful geometric insights


Angle between vectors in n dimensional space

wx = |w | |x | cos()

Mahesan Niranjan (UoS) Machine Learning March 2017 20 / 55


Vectors
Linear independence
... set of p vectors xj , j = 1, ..., p
p
j x j = 0
X

i=j

... only solution is all j = 0

Mahesan Niranjan (UoS) Machine Learning March 2017 21 / 55


Vectors
Linear independence
... set of p vectors xj , j = 1, ..., p
p
j x j = 0
X

i=j

... only solution is all j = 0


... no vector in the set can be expressed as a linear combination of
the others.

Mahesan Niranjan (UoS) Machine Learning March 2017 21 / 55


Vectors
Linear independence
... set of p vectors xj , j = 1, ..., p
p
j x j = 0
X

i=j

... only solution is all j = 0


... no vector in the set can be expressed as a linear combination of
the others.
Scalar product as projection: projection of vector x on a direction
specified by vector u
x uu
|u |

Mahesan Niranjan (UoS) Machine Learning March 2017 21 / 55


Vectors
Linear independence
... set of p vectors xj , j = 1, ..., p
p
j x j = 0
X

i=j

... only solution is all j = 0


... no vector in the set can be expressed as a linear combination of
the others.
Scalar product as projection: projection of vector x on a direction
specified by vector u
x uu
|u |
... we will also write this as
xT u u
|u |
Mahesan Niranjan (UoS) Machine Learning March 2017 21 / 55
Matrices

Simple operations e.g. addition: [A + B]ij = [A]ij + [B]ij ;

Mahesan Niranjan (UoS) Machine Learning March 2017 22 / 55


Matrices

Simple operations e.g. addition: [A + B]ij = [A]ij + [B]ij ;


transpose: [A]T
ij = [A]ji ;

Mahesan Niranjan (UoS) Machine Learning March 2017 22 / 55


Matrices

Simple operations e.g. addition: [A + B]ij = [A]ij + [B]ij ;


transpose: [A]T
ij = [A]ji ; multiplication by a scalar: [A]ij = [A]ij

Mahesan Niranjan (UoS) Machine Learning March 2017 22 / 55


Matrices

Simple operations e.g. addition: [A + B]ij = [A]ij + [B]ij ;


transpose: [A]T
ij = [A]ji ; multiplication by a scalar: [A]ij = [A]ij
Matrix multiplication:
n
X
[A B]ij = [A]ik [B]kj
k=1

Mahesan Niranjan (UoS) Machine Learning March 2017 22 / 55


Matrices

Simple operations e.g. addition: [A + B]ij = [A]ij + [B]ij ;


transpose: [A]T
ij = [A]ji ; multiplication by a scalar: [A]ij = [A]ij
Matrix multiplication:
n
X
[A B]ij = [A]ik [B]kj
k=1

(AB)T = B T AT

Mahesan Niranjan (UoS) Machine Learning March 2017 22 / 55


Matrices

Simple operations e.g. addition: [A + B]ij = [A]ij + [B]ij ;


transpose: [A]T
ij = [A]ji ; multiplication by a scalar: [A]ij = [A]ij
Matrix multiplication:
n
X
[A B]ij = [A]ik [B]kj
k=1

(AB)T = B T AT
Square: number of rows = number of columns

Mahesan Niranjan (UoS) Machine Learning March 2017 22 / 55


Matrices

Simple operations e.g. addition: [A + B]ij = [A]ij + [B]ij ;


transpose: [A]T
ij = [A]ji ; multiplication by a scalar: [A]ij = [A]ij
Matrix multiplication:
n
X
[A B]ij = [A]ik [B]kj
k=1

(AB)T = B T AT
Square: number of rows = number of columns
Symmetric: AT = A

Mahesan Niranjan (UoS) Machine Learning March 2017 22 / 55


Matrices

Simple operations e.g. addition: [A + B]ij = [A]ij + [B]ij ;


transpose: [A]T
ij = [A]ji ; multiplication by a scalar: [A]ij = [A]ij
Matrix multiplication:
n
X
[A B]ij = [A]ik [B]kj
k=1

(AB)T = B T AT
Square: number of rows = number of columns
Symmetric: AT = A
Identity matrix: I diagonal elements 1, off diagonals 0.

Mahesan Niranjan (UoS) Machine Learning March 2017 22 / 55


Matrices

Simple operations e.g. addition: [A + B]ij = [A]ij + [B]ij ;


transpose: [A]T
ij = [A]ji ; multiplication by a scalar: [A]ij = [A]ij
Matrix multiplication:
n
X
[A B]ij = [A]ik [B]kj
k=1

(AB)T = B T AT
Square: number of rows = number of columns
Symmetric: AT = A
Identity matrix: I diagonal elements 1, off diagonals 0.
 
a11 a12
Determinant: det = a11 a22 a21 a12
a21 a22

Mahesan Niranjan (UoS) Machine Learning March 2017 22 / 55


Matrices

Simple operations e.g. addition: [A + B]ij = [A]ij + [B]ij ;


transpose: [A]T
ij = [A]ji ; multiplication by a scalar: [A]ij = [A]ij
Matrix multiplication:
n
X
[A B]ij = [A]ik [B]kj
k=1

(AB)T = B T AT
Square: number of rows = number of columns
Symmetric: AT = A
Identity matrix: I diagonal elements 1, off diagonals 0.
 
a11 a12
Determinant: det = a11 a22 a21 a12
a21 a22
Pn
Trace: trace(A) = i=1 aii
Mahesan Niranjan (UoS) Machine Learning March 2017 22 / 55
Linear transformation
y = Ax

Mahesan Niranjan (UoS) Machine Learning March 2017 23 / 55


Linear transformation
y = Ax
 
cos sin
Rotation: R =
sin cos

Mahesan Niranjan (UoS) Machine Learning March 2017 23 / 55


Linear transformation
y = Ax
 
cos sin
Rotation: R =
sin cos
R x rotates x by angle radians.
Magnitude of x does not change.

Mahesan Niranjan (UoS) Machine Learning March 2017 23 / 55


Linear transformation
y = Ax
 
cos sin
Rotation: R =
sin cos
R x rotates x by angle radians.
Magnitude of x does not change.
A special relationship between a square matrix A and vector x
Ax = x

Magnitude scales, but no rotation... have you come across this?

Mahesan Niranjan (UoS) Machine Learning March 2017 23 / 55


Linear transformation
y = Ax
 
cos sin
Rotation: R =
sin cos
R x rotates x by angle radians.
Magnitude of x does not change.
A special relationship between a square matrix A and vector x
Ax = x

Magnitude scales, but no rotation... have you come across this?


Eigenvalues, eigenvectors
Found by
det (A I ) = 0

Mahesan Niranjan (UoS) Machine Learning March 2017 23 / 55


Linear transformation
y = Ax
 
cos sin
Rotation: R =
sin cos
R x rotates x by angle radians.
Magnitude of x does not change.
A special relationship between a square matrix A and vector x
Ax = x

Magnitude scales, but no rotation... have you come across this?


Eigenvalues, eigenvectors
Found by
det (A I ) = 0

Homework: Look up if the following are true and how they are proved.
det(A) = ni=1 i
Q

Mahesan Niranjan (UoS) Machine Learning March 2017 23 / 55


Linear transformation
y = Ax
 
cos sin
Rotation: R =
sin cos
R x rotates x by angle radians.
Magnitude of x does not change.
A special relationship between a square matrix A and vector x
Ax = x

Magnitude scales, but no rotation... have you come across this?


Eigenvalues, eigenvectors
Found by
det (A I ) = 0

Homework: Look up if the following are true and how they are proved.
det(A) = ni=1 i
Q
Pn
trace(A) = i=1 i
Real symmetric matrix A = U D U T
Columns of U orthogonal.
More advanced (very powerful) topic: Singular value decomposition (SVD)

Mahesan Niranjan (UoS) Machine Learning March 2017 23 / 55


Rapid Review of Foundations II: Calculus

Function y = f (x )
Derivative dy
dx is gradient/slope;
R x=b
Integral x=a f (x)dx is area under the curve.

Mahesan Niranjan (UoS) Machine Learning March 2017 24 / 55


Rapid Review of Foundations II: Calculus

Function y = f (x )
Derivative dy
dx is gradient/slope;
R x=b
Integral x=a f (x)dx is area under the curve.
Function of several variables y = f (x1 , x2 ..., xp )
f
Partial derivatives x i
: Differentiate with respect to xi pretending all
other variables remain constant.

Mahesan Niranjan (UoS) Machine Learning March 2017 24 / 55


Rapid Review of Foundations II: Calculus

Function y = f (x )
Derivative dy
dx is gradient/slope;
R x=b
Integral x=a f (x)dx is area under the curve.
Function of several variables y = f (x1 , x2 ..., xp )
f
Partial derivatives x i
: Differentiate with respect to xi pretending all
other variables remain constant.
Gradient vector f
x1
f
f
x2
= ..
.


f
xp

Mahesan Niranjan (UoS) Machine Learning March 2017 24 / 55


Rapid Review of Foundations II: Calculus

Function y = f (x )
Derivative dy
dx is gradient/slope;
R x=b
Integral x=a f (x)dx is area under the curve.
Function of several variables y = f (x1 , x2 ..., xp )
f
Partial derivatives x i
: Differentiate with respect to xi pretending all
other variables remain constant.
Gradient vector f
x1
f
f
x2
= ..
.


f
xp

Homework: Consider f = x t Ax , A =AT ; f = 2Ax . Using scalars in two dimensions, i.e.



x = [x1 x2 ] T
and A to contain elements a11
a12
a12
a22
, verify the claim. Writing out the
algebra helps in learning!

Mahesan Niranjan (UoS) Machine Learning March 2017 24 / 55


Rapid Review of Foundations III: Optimization

Unconstrained optimization: min f (x )

Mahesan Niranjan (UoS) Machine Learning March 2017 25 / 55


Rapid Review of Foundations III: Optimization

Unconstrained optimization: min f (x )


Constrained optimization:

min f (x )
subject to gi (x ) bi , i = 1, 2, .., m

Mahesan Niranjan (UoS) Machine Learning March 2017 25 / 55


Rapid Review of Foundations III: Optimization

Unconstrained optimization: min f (x )


Constrained optimization:

min f (x )
subject to gi (x ) bi , i = 1, 2, .., m

Gradient and Hessian:



f
x1
f
f
x2
= ..
.


f
xp

Mahesan Niranjan (UoS) Machine Learning March 2017 25 / 55


Rapid Review of Foundations III: Optimization

Unconstrained optimization: min f (x )


Constrained optimization:

min f (x )
subject to gi (x ) bi , i = 1, 2, .., m

Gradient and Hessian:



2f 2f 2f
f
x12 x1 x2 ... x1 xp
x1
f 2f 2f 2f

...

f H

x2 x2 x1 2 x2 x2 xp
= =

..
.. .. ..
.

. . ... .
f

2f 2f 2f
xp xp x1 xp x2 ... 2 xp

Mahesan Niranjan (UoS) Machine Learning March 2017 25 / 55


Optimizations (contd)

Example: Gradient descent algorithm

x(k+1) = x(k) f

Mahesan Niranjan (UoS) Machine Learning March 2017 26 / 55


Optimizations (contd)

Example: Gradient descent algorithm

x(k+1) = x(k) f

Newtons Method
x (k+1) = x (k) H 1 f

Mahesan Niranjan (UoS) Machine Learning March 2017 26 / 55


Optimizations (contd)

Example: Gradient descent algorithm

x(k+1) = x(k) f

Newtons Method
x (k+1) = x (k) H 1 f

Example: Lagrange Multipliers

min f (x )
subject to gi (x ) bi , i = 1, 2, .., m

m
F (x , ) = f (x ) + i [bi gi (x )]
X

i=1

We will use various optimization algorithms in this module (later in the coursework).

Mahesan Niranjan (UoS) Machine Learning March 2017 26 / 55


Optimizations (contd)

Example: Gradient descent algorithm

x(k+1) = x(k) f

Newtons Method
x (k+1) = x (k) H 1 f

Example: Lagrange Multipliers

min f (x )
subject to gi (x ) bi , i = 1, 2, .., m

m
F (x , ) = f (x ) + i [bi gi (x )]
X

i=1

We will use various optimization algorithms in this module (later in the coursework).
Advanced Homework: Search for CVX Disciplined Convex Programming and have a
rough read.

Mahesan Niranjan (UoS) Machine Learning March 2017 26 / 55


Rapid Review of Foundations IV: Probabilities

Discrete probabilities P [X ]
Continuous densities p (x )

Mahesan Niranjan (UoS) Machine Learning March 2017 27 / 55


Rapid Review of Foundations IV: Probabilities

Discrete probabilities P [X ]
Continuous densities p (x )
Joint P [X , Y ]; Marginal P [X ]; Conditional P [X |Y ]

Mahesan Niranjan (UoS) Machine Learning March 2017 27 / 55


Rapid Review of Foundations IV: Probabilities

Discrete probabilities P [X ]
Continuous densities p (x )
Joint P [X , Y ]; Marginal P [X ]; Conditional P [X |Y ]

P [X |Y ] P [Y ]
P [Y | X ] =
P [X ]
X
P [X ] = P [X |Y ] P [Y ]
Y

Mahesan Niranjan (UoS) Machine Learning March 2017 27 / 55


Rapid Review of Foundations IV: Probabilities

Discrete probabilities P [X ]
Continuous densities p (x )
Joint P [X , Y ]; Marginal P [X ]; Conditional P [X |Y ]

P [X |Y ] P [Y ]
P [Y | X ] =
P [X ]
X
P [X ] = P [X |Y ] P [Y ]
Y

P [X , Y ] = P [X |Y ] P [Y ]
X
P [X ] = P [X , Y ]
y
X
= P [X |Y ] P [Y ]
Y

Mahesan Niranjan (UoS) Machine Learning March 2017 27 / 55


Gaussian Densities: Univariate and Multivariate

Mahesan Niranjan (UoS) Machine Learning March 2017 28 / 55


Gaussian Densities: Univariate and Multivariate

Univariate Gaussian
1 (x m)2
 
1
p(x) = exp
2 2 2

What are properties we know? Homework: Draw sketches for


different values of m and .

Mahesan Niranjan (UoS) Machine Learning March 2017 28 / 55


Gaussian Densities: Univariate and Multivariate

Univariate Gaussian
1 (x m)2
 
1
p(x) = exp
2 2 2

What are properties we know? Homework: Draw sketches for


different values of m and .
Multivariate Gaussian
 
p (x ) = x m C x m
1 1 t 1
exp ( ) ( )
(2)p/2 (det C )1/2 2

Mean m is a vector
Covariance, C , matrix: symmetric, positive semi definite!
Homework: Draw sketches for different values of m and C

Mahesan Niranjan (UoS) Machine Learning March 2017 28 / 55


Gaussian Densities: Univariate and Multivariate

Univariate Gaussian
1 (x m)2
 
1
p(x) = exp
2 2 2

What are properties we know? Homework: Draw sketches for


different values of m and .
Multivariate Gaussian
 
p (x ) = x m C x m
1 1 t 1
exp ( ) ( )
(2)p/2 (det C )1/2 2

Mean m is a vector
Covariance, C , matrix: symmetric, positive semi definite!
Homework: Draw sketches for different values of m and C
x N (m , C ), y = Ax = y N (Am , ACAT )
Mahesan Niranjan (UoS) Machine Learning March 2017 28 / 55
Estimation

Univariate Mean

Mahesan Niranjan (UoS) Machine Learning March 2017 29 / 55


Estimation

1 PN
Univariate Mean m
b = N n=1 xn

Mahesan Niranjan (UoS) Machine Learning March 2017 29 / 55


Estimation

1 PN
Univariate Mean m
b = N n=1 xn

Univariate Covariance

Mahesan Niranjan (UoS) Machine Learning March 2017 29 / 55


Estimation

1 PN
Univariate Mean m
b = N n=1 xn

1 PN
Univariate Covariance
b = N n=1 (xn b 2
m)

Mahesan Niranjan (UoS) Machine Learning March 2017 29 / 55


Estimation

1 PN
Univariate Mean m
b = N n=1 xn

1 PN
Univariate Covariance
b = N n=1 (xn b 2
m)

Multivariate Mean

Mahesan Niranjan (UoS) Machine Learning March 2017 29 / 55


Estimation

1 PN
Univariate Mean m
b = N n=1 xn

1 PN
Univariate Covariance
b = N n=1 (xn b 2
m)

mb x
1 PN
Multivariate Mean = N n=1 n

Mahesan Niranjan (UoS) Machine Learning March 2017 29 / 55


Estimation

1 PN
Univariate Mean m
b = N n=1 xn

1 PN
Univariate Covariance
b = N n=1 (xn b 2
m)

mb x
1 PN
Multivariate Mean = N n=1 n

Covariance Matrix

Mahesan Niranjan (UoS) Machine Learning March 2017 29 / 55


Estimation

1 PN
Univariate Mean m
b = N n=1 xn

1 PN
Univariate Covariance
b = N n=1 (xn b 2
m)

mb x
1 PN
Multivariate Mean = N n=1 n

x m )(xn m )T
1 PN
Covariance Matrix Cb = N n=1 ( n

Mahesan Niranjan (UoS) Machine Learning March 2017 29 / 55


Estimation

1 PN
Univariate Mean m
b = N n=1 xn

1 PN
Univariate Covariance
b = N n=1 (xn b 2
m)

mb x
1 PN
Multivariate Mean = N n=1 n

x m )(xn m )T
1 PN
Covariance Matrix Cb = N n=1 ( n

These are known as maximum likelihood estimates (see later).


Homework: Have you noticed there are two buttons in a calculator
for estimating standard deviation, denoted n and n1 ? Find out
why.

Mahesan Niranjan (UoS) Machine Learning March 2017 29 / 55


What Next?
Pattern Classification

Classifying based on P [j | x ]

Mahesan Niranjan (UoS) Machine Learning March 2017 30 / 55


What Next?
Pattern Classification

Classifying based on P [j | x ]
Optimal classifier for simple distributions
Linear classifier when is it optimal?

Mahesan Niranjan (UoS) Machine Learning March 2017 30 / 55


What Next?
Pattern Classification

Classifying based on P [j | x ]
Optimal classifier for simple distributions
Linear classifier when is it optimal?
Distance based classifiers
Nearest Neighbour classifier
Mahalanobis distance
Linear discriminant analysis
Fisher LDA

Mahesan Niranjan (UoS) Machine Learning March 2017 30 / 55


What Next?
Pattern Classification

Classifying based on P [j | x ]
Optimal classifier for simple distributions
Linear classifier when is it optimal?
Distance based classifiers
Nearest Neighbour classifier
Mahalanobis distance
Linear discriminant analysis
Fisher LDA
Classifier Performance
Receiver Operating Characteristics (ROC) Curve
Perceptron learning rule and convergence

See sketches on whiteboard these illustrations


are important

Mahesan Niranjan (UoS) Machine Learning March 2017 30 / 55


Overview (Lecture 2)

Review of what we learnt in Lab One


Multivariate Gaussian
Drawing samples from N (m , C )

Mahesan Niranjan (UoS) Machine Learning March 2017 31 / 55


Overview (Lecture 2)

Review of what we learnt in Lab One


Multivariate Gaussian
Drawing samples from N (m , C )
Principal directions

Mahesan Niranjan (UoS) Machine Learning March 2017 31 / 55


Overview (Lecture 2)

Review of what we learnt in Lab One


Multivariate Gaussian
Drawing samples from N (m , C )
Principal directions
Introduction to Bayesian Decision Theory

Mahesan Niranjan (UoS) Machine Learning March 2017 31 / 55


Overview (Lecture 2)

Review of what we learnt in Lab One


Multivariate Gaussian
Drawing samples from N (m , C )
Principal directions
Introduction to Bayesian Decision Theory
Bayes Classifier for Simple Gaussian Distributions

Mahesan Niranjan (UoS) Machine Learning March 2017 31 / 55


Overview (Lecture 2)

Review of what we learnt in Lab One


Multivariate Gaussian
Drawing samples from N (m , C )
Principal directions
Introduction to Bayesian Decision Theory
Bayes Classifier for Simple Gaussian Distributions
Simple Classifiers
Distance to mean classifier

Mahesan Niranjan (UoS) Machine Learning March 2017 31 / 55


Overview (Lecture 2)

Review of what we learnt in Lab One


Multivariate Gaussian
Drawing samples from N (m , C )
Principal directions
Introduction to Bayesian Decision Theory
Bayes Classifier for Simple Gaussian Distributions
Simple Classifiers
Distance to mean classifier
Nearest Neighbour classifier

Mahesan Niranjan (UoS) Machine Learning March 2017 31 / 55


Overview (Lecture 2)

Review of what we learnt in Lab One


Multivariate Gaussian
Drawing samples from N (m , C )
Principal directions
Introduction to Bayesian Decision Theory
Bayes Classifier for Simple Gaussian Distributions
Simple Classifiers
Distance to mean classifier
Nearest Neighbour classifier
Linear classifier (more on this later)

Mahesan Niranjan (UoS) Machine Learning March 2017 31 / 55


Overview (Lecture 2)

Review of what we learnt in Lab One


Multivariate Gaussian
Drawing samples from N (m , C )
Principal directions
Introduction to Bayesian Decision Theory
Bayes Classifier for Simple Gaussian Distributions
Simple Classifiers
Distance to mean classifier
Nearest Neighbour classifier
Linear classifier (more on this later)
Perceptron (formal setting later)

Mahesan Niranjan (UoS) Machine Learning March 2017 31 / 55


Overview (Lecture 2)

Review of what we learnt in Lab One


Multivariate Gaussian
Drawing samples from N (m , C )
Principal directions
Introduction to Bayesian Decision Theory
Bayes Classifier for Simple Gaussian Distributions
Simple Classifiers
Distance to mean classifier
Nearest Neighbour classifier
Linear classifier (more on this later)
Perceptron (formal setting later)
What will we learn in Lab Two?

Mahesan Niranjan (UoS) Machine Learning March 2017 31 / 55


Bayesian Decision Theory

Mahesan Niranjan (UoS) Machine Learning March 2017 32 / 55


Bayesian Decision Theory
Classes: i , i = 1, ..., K

Mahesan Niranjan (UoS) Machine Learning March 2017 32 / 55


Bayesian Decision Theory
Classes: i , i = 1, ..., K
Prior Probabilities:
PC P [1 ] , ..., P [K ];
P [i ] 0, i=1 P [i ] = 1

Mahesan Niranjan (UoS) Machine Learning March 2017 32 / 55


Bayesian Decision Theory
Classes: i , i = 1, ..., K
Prior Probabilities:
PC P [1 ] , ..., P [K ];
P [i ] 0, i=1 P [i ] = 1
Likelihoods (class conditional probabilities): p (x |i ) , i = 1, .., K

Mahesan Niranjan (UoS) Machine Learning March 2017 32 / 55


Bayesian Decision Theory
Classes: i , i = 1, ..., K
Prior Probabilities:
PC P [1 ] , ..., P [K ];
P [i ] 0, i=1 P [i ] = 1
Likelihoods (class conditional probabilities): p (x |i ) , i = 1, .., K
Posterior Probability: P [j | x ]
p (x | j ) P [j ]
P [j | x ] = PK
i=1 p (x | i ) P [i ]

Mahesan Niranjan (UoS) Machine Learning March 2017 32 / 55


Bayesian Decision Theory
Classes: i , i = 1, ..., K
Prior Probabilities:
PC P [1 ] , ..., P [K ];
P [i ] 0, i=1 P [i ] = 1
Likelihoods (class conditional probabilities): p (x |i ) , i = 1, .., K
Posterior Probability: P [j | x ]
p (x | j ) P [j ]
P [j | x ] = PK
i=1 p (x | i ) P [i ]

From prior knowledge: P [i ]; From traing data: p (x |i )

Mahesan Niranjan (UoS) Machine Learning March 2017 32 / 55


Bayesian Decision Theory
Classes: i , i = 1, ..., K
Prior Probabilities:
PC P [1 ] , ..., P [K ];
P [i ] 0, i=1 P [i ] = 1
Likelihoods (class conditional probabilities): p (x |i ) , i = 1, .., K
Posterior Probability: P [j | x ]
p (x | j ) P [j ]
P [j | x ] = PK
i=1 p (x | i ) P [i ]

From prior knowledge: P [i ]; From traing data: p (x |i )


Decision rule: Assign x to the class that maximizes posterior
probability.

Mahesan Niranjan (UoS) Machine Learning March 2017 32 / 55


Bayesian Decision Theory
Classes: i , i = 1, ..., K
Prior Probabilities:
PC P [1 ] , ..., P [K ];
P [i ] 0, i=1 P [i ] = 1
Likelihoods (class conditional probabilities): p (x |i ) , i = 1, .., K
Posterior Probability: P [j | x ]
p (x | j ) P [j ]
P [j | x ] = PK
i=1 p (x | i ) P [i ]

From prior knowledge: P [i ]; From traing data: p (x |i )


Decision rule: Assign x to the class that maximizes posterior
probability.
The denominator is a constant; i.e. does not depend on j

Mahesan Niranjan (UoS) Machine Learning March 2017 32 / 55


Bayesian Decision Theory
Classes: i , i = 1, ..., K
Prior Probabilities:
PC P [1 ] , ..., P [K ];
P [i ] 0, i=1 P [i ] = 1
Likelihoods (class conditional probabilities): p (x |i ) , i = 1, .., K
Posterior Probability: P [j | x ]
p (x | j ) P [j ]
P [j | x ] = PK
i=1 p (x | i ) P [i ]

From prior knowledge: P [i ]; From traing data: p (x |i )


Decision rule: Assign x to the class that maximizes posterior
probability.
The denominator is a constant; i.e. does not depend on j
Hence the decision rule becomes:
x max
j
p (x | j ) P [j ]

Mahesan Niranjan (UoS) Machine Learning March 2017 32 / 55


Bayes Classifier for Gaussian Densities
Make assumptions, cancel common terms when making comparisons...

Mahesan Niranjan (UoS) Machine Learning March 2017 33 / 55


Bayes Classifier for Gaussian Densities
Make assumptions, cancel common terms when making comparisons...

Decision rule from: p (x | j ) P [j ]


Assume the two classes are Gaussian distributed with distinct means
and identical covariance matrices
p (x | j ) = N (mj , C )

Mahesan Niranjan (UoS) Machine Learning March 2017 33 / 55


Bayes Classifier for Gaussian Densities
Make assumptions, cancel common terms when making comparisons...

Decision rule from: p (x | j ) P [j ]


Assume the two classes are Gaussian distributed with distinct means
and identical covariance matrices
p (x | j ) = N (mj , C )
Substitute into Bayes classifier decision rule

P [1 |x ] P [2 |x ]
p (x |1 ) P [1 ] p (x |2 ) P [2 ]

Mahesan Niranjan (UoS) Machine Learning March 2017 33 / 55


Bayes Classifier for Gaussian Densities
Make assumptions, cancel common terms when making comparisons...

Decision rule from: p (x | j ) P [j ]


Assume the two classes are Gaussian distributed with distinct means
and identical covariance matrices
p (x | j ) = N (mj , C )
Substitute into Bayes classifier decision rule

P [1 |x ] P [2 |x ]
p (x |1 ) P [1 ] p (x |2 ) P [2 ]

 
exp (x m1 ) C (x m1 ) P [1 ]
1 1 t 1
(2)p/2 (det(C ))1/2 2
 
exp (x m2 ) C (x m2 ) P [2 ]
1 1 t 1
(2)p/2 (det(C ))1/2 2

Mahesan Niranjan (UoS) Machine Learning March 2017 33 / 55


Bayes classifier for simple densities (contd)
Distinct Means; Equal, isotropic covariance matrix

Suppose the densities are isotropic and priors are equal


i.e. C = 2 I and P [1 ] = P [2 ]

Mahesan Niranjan (UoS) Machine Learning March 2017 34 / 55


Bayes classifier for simple densities (contd)
Distinct Means; Equal, isotropic covariance matrix

Suppose the densities are isotropic and priors are equal


i.e. C = 2 I and P [1 ] = P [2 ]
The comparison simplifies to (see algebra on board):

(x m1 )t (x m1 ) (x m2 )t (x m2 )
|x m1 | |x m2 |

Mahesan Niranjan (UoS) Machine Learning March 2017 34 / 55


Bayes classifier for simple densities (contd)
Distinct Means; Equal, isotropic covariance matrix

Suppose the densities are isotropic and priors are equal


i.e. C = 2 I and P [1 ] = P [2 ]
The comparison simplifies to (see algebra on board):

(x m1 )t (x m1 ) (x m2 )t (x m2 )
|x m1 | |x m2 |

The above is a simple distance to mean classifier


Under the above simplistic assumptions, we only need to store one
template per class (the means)!

Mahesan Niranjan (UoS) Machine Learning March 2017 34 / 55


Bayes classifier for simple densities (contd)
Distinct Means; Common covariance matrix (but not isotropic)

Mahesan Niranjan (UoS) Machine Learning March 2017 35 / 55


Bayes classifier for simple densities (contd)
Distinct Means; Common covariance matrix (but not isotropic)

Cancel common terms and take log


 
(x m1 ) C (x m1 ) (x m2 ) C (x m2 ) log
t 1 t 1 P[1 ]
P[2

Mahesan Niranjan (UoS) Machine Learning March 2017 35 / 55


Bayes classifier for simple densities (contd)
Distinct Means; Common covariance matrix (but not isotropic)

Cancel common terms and take log


 
(x m1 ) C (x m1 ) (x m2 ) C (x m2 ) log
t 1 t 1 P[1 ]
P[2

Also simplifies to a linear classifier !


wt x + b 0
w = 2C 1 (m2 m1 )
 
m1 C m1 m2 C m2 log P[ ]
t 1 t 1 P[1 ]

b =
2

Mahesan Niranjan (UoS) Machine Learning March 2017 35 / 55


Bayes classifier for simple densities (contd)
Distinct Means; Common covariance matrix (but not isotropic)

Cancel common terms and take log


 
(x m1 ) C (x m1 ) (x m2 ) C (x m2 ) log
t 1 t 1 P[1 ]
P[2

Also simplifies to a linear classifier !


wt x + b 0
w = 2C 1 (m2 m1 )
 
m1 C m1 m2 C m2 log P[ ]
t 1 t 1 P[1 ]

b =
2

Also a distance to template classifier, where the distance is


(x m1 )t C 1 (x m1 )
Known as Mahalanobis distance
Mahesan Niranjan (UoS) Machine Learning March 2017 35 / 55
Implementing a linear classifier: Perceptron
Error correcting learning

Linear classifier

Mahesan Niranjan (UoS) Machine Learning March 2017 36 / 55


Implementing a linear classifier: Perceptron
Error correcting learning

Linear classifier
wt x +b 0

Mahesan Niranjan (UoS) Machine Learning March 2017 36 / 55


Implementing a linear classifier: Perceptron
Error correcting learning

Linear classifier
wt x +b 0
Expand dimensions: a = [w t b]t and y = [x t 1]t
at y 0

Mahesan Niranjan (UoS) Machine Learning March 2017 36 / 55


Implementing a linear classifier: Perceptron
Error correcting learning

Linear classifier
wt x +b 0
Expand dimensions: a = [w t b]t and y = [x t 1]t
at y 0
random guess of the weights
repeat
select data at random
if not correctly classified
update weights
until (all data correctly classified)
Update:

Mahesan Niranjan (UoS) Machine Learning March 2017 36 / 55


Implementing a linear classifier: Perceptron
Error correcting learning

Linear classifier
wt x +b 0
Expand dimensions: a = [w t b]t and y = [x t 1]t
at y 0
random guess of the weights
repeat
select data at random
if not correctly classified
update weights
until (all data correctly classified)
Update:

a(k+1) = a(k) + y (k)


Mahesan Niranjan (UoS) Machine Learning March 2017 36 / 55
Lab 2

1 Some plotting
2 Bayes optimal class boundary
3 Implement your own perceptron algorithm

Mahesan Niranjan (UoS) Machine Learning March 2017 37 / 55


Outline

Posterior probabilities for simple Gaussian cases


Fisher Linear Discriminant
Nearest Neighbour Classifier
Classifier performance

Mahesan Niranjan (UoS) Machine Learning March 2017 38 / 55


Posterior probabilities for simple Gaussian cases
Two class problem

Mahesan Niranjan (UoS) Machine Learning March 2017 39 / 55


Posterior probabilities for simple Gaussian cases
Two class problem

Bayes classifier:
p (x |1 ) P [1 ]
P [1 |x ] =
p (x |1 ) P [1 ] + p (x |2 ) P [2 ]

Mahesan Niranjan (UoS) Machine Learning March 2017 39 / 55


Posterior probabilities for simple Gaussian cases
Two class problem

Bayes classifier:
p (x |1 ) P [1 ]
P [1 |x ] =
p (x |1 ) P [1 ] + p (x |2 ) P [2 ]

Restrictive assumptions:
Gaussian p (x |j ) = N (mj , Cj )
Equal covariance matrices: C1 = C2 = C

Mahesan Niranjan (UoS) Machine Learning March 2017 39 / 55


Posterior probabilities for simple Gaussian cases
Two class problem

Bayes classifier:
p (x |1 ) P [1 ]
P [1 |x ] =
p (x |1 ) P [1 ] + p (x |2 ) P [2 ]

Restrictive assumptions:
Gaussian p (x |j ) = N (mj , Cj )
Equal covariance matrices: C1 = C2 = C
Substitute, divide through by numerator term and cancel common
terms to get

P [1 |x ] =
1
1 + exp {(w t x + w0 )}

Mahesan Niranjan (UoS) Machine Learning March 2017 39 / 55


Posterior probabilities for simple Gaussian cases
Two class problem

Bayes classifier:
p (x |1 ) P [1 ]
P [1 |x ] =
p (x |1 ) P [1 ] + p (x |2 ) P [2 ]

Restrictive assumptions:
Gaussian p (x |j ) = N (mj , Cj )
Equal covariance matrices: C1 = C2 = C
Substitute, divide through by numerator term and cancel common
terms to get

P [1 |x ] =
1
1 + exp {(w t x + w0 )}

The functional form 1/(1 + exp()) is known as sigmoid / logistic


(See lab class for W3)
Mahesan Niranjan (UoS) Machine Learning March 2017 39 / 55
Fisher Linear Discriminant

Classification problem (say two classes)

Mahesan Niranjan (UoS) Machine Learning March 2017 40 / 55


Fisher Linear Discriminant

Classification problem (say two classes)


Disirable properties of a direction to project
Means of projected data should be far apart
Variance of projections of each class should be small

Mahesan Niranjan (UoS) Machine Learning March 2017 40 / 55


Fisher Linear Discriminant

Mahesan Niranjan (UoS) Machine Learning March 2017 41 / 55


Fisher Linear Discriminant

In the p dimensional (Rp ) input space, find a direction on which


projected data is maximally separable:
Projected means should be far apart
Projected scatter of each class should be small

Mahesan Niranjan (UoS) Machine Learning March 2017 41 / 55


Fisher Linear Discriminant

In the p dimensional (Rp ) input space, find a direction on which


projected data is maximally separable:
Projected means should be far apart
Projected scatter of each class should be small
Projection of xn onto direction w is w t xn ;
Projected mean for class j will be at w t mj
Variance of projections is w t Cj w , where Cj is the covariance matrix of
data in class j.

Mahesan Niranjan (UoS) Machine Learning March 2017 41 / 55


Fisher Linear Discriminant

In the p dimensional (Rp ) input space, find a direction on which


projected data is maximally separable:
Projected means should be far apart
Projected scatter of each class should be small
Projection of xn onto direction w is w t xn ;
Projected mean for class j will be at w t mj
Variance of projections is w t Cj w , where Cj is the covariance matrix of
data in class j.
Fisher Ratio:
(w t m1 w t m2 )2
JF =
w t C1 w + w t C2 w

Mahesan Niranjan (UoS) Machine Learning March 2017 41 / 55


Fisher Linear Discriminant

In the p dimensional (Rp ) input space, find a direction on which


projected data is maximally separable:
Projected means should be far apart
Projected scatter of each class should be small
Projection of xn onto direction w is w t xn ;
Projected mean for class j will be at w t mj
Variance of projections is w t Cj w , where Cj is the covariance matrix of
data in class j.
Fisher Ratio:
(w t m1 w t m2 )2
JF =
w t C1 w + w t C2 w
We can write the numerator as w t CB w , where
CB = (m1 m2 )(m1 m2 )t , the between-class scatter matrix.

Mahesan Niranjan (UoS) Machine Learning March 2017 41 / 55


Fisher Linear Discriminant

In the p dimensional (Rp ) input space, find a direction on which


projected data is maximally separable:
Projected means should be far apart
Projected scatter of each class should be small
Projection of xn onto direction w is w t xn ;
Projected mean for class j will be at w t mj
Variance of projections is w t Cj w , where Cj is the covariance matrix of
data in class j.
Fisher Ratio:
(w t m1 w t m2 )2
JF =
w t C1 w + w t C2 w
We can write the numerator as w t CB w , where
CB = (m1 m2 )(m1 m2 )t , the between-class scatter matrix.
CW = C1 + C2 , the within class scatter matrix.

Mahesan Niranjan (UoS) Machine Learning March 2017 41 / 55


Fisher Linear Discriminant (contd)

Fisher criterion to maximize

J (w ) =
w t CB w
w t CW w

Mahesan Niranjan (UoS) Machine Learning March 2017 42 / 55


Fisher Linear Discriminant (contd)

Fisher criterion to maximize

J (w ) =
w t CB w
w t CW w

Set gradient to zero

2CB w (w t CW w ) 2C W w (w t C B w )
w =
(w t CW w )2

Mahesan Niranjan (UoS) Machine Learning March 2017 42 / 55


Fisher Linear Discriminant (contd)

Fisher criterion to maximize

J (w ) =
w t CB w
w t CW w

Set gradient to zero

2CB w (w t CW w ) 2C W w (w t C B w )
w =
(w t CW w )2

Equate this to zero and observe


w t CW w and w t CB w are scalars
CB w points in the same direction as m1 m2
We are only interested in the direction of w

Mahesan Niranjan (UoS) Machine Learning March 2017 42 / 55


Fisher Linear Discriminant (contd)

Fisher criterion to maximize

J (w ) =
w t CB w
w t CW w

Set gradient to zero

2CB w (w t CW w ) 2C W w (w t C B w )
w =
(w t CW w )2

Equate this to zero and observe


w t CW w and w t CB w are scalars
CB w points in the same direction as m1 m2
We are only interested in the direction of w
wF = Cw1 (m1 m2 )

Mahesan Niranjan (UoS) Machine Learning March 2017 42 / 55


Linear Regression & Perceptron

Data: {xn , fn }N
n=1
Input: xn Rp ; target / output fn real valued

Mahesan Niranjan (UoS) Machine Learning March 2017 43 / 55


Linear Regression & Perceptron

Data: {xn , fn }N
n=1
Input: xn Rp ; target / output fn real valued
Model: f = w t x + w0
Output linear function of input (including a constant w0 )

Mahesan Niranjan (UoS) Machine Learning March 2017 43 / 55


Linear Regression & Perceptron

Data: {xn , fn }N
n=1
Input: xn Rp ; target / output fn real valued
Model: f = w t x + w0
Output linear function of input (including a constant w0 )
Work in (p + 1) dimensional space to avoid treating w0 separately

y = x1 a = ww
   

Mahesan Niranjan (UoS) Machine Learning March 2017 43 / 55


Linear Regression & Perceptron

Data: {xn , fn }N
n=1
Input: xn Rp ; target / output fn real valued
Model: f = w t x + w0
Output linear function of input (including a constant w0 )
Work in (p + 1) dimensional space to avoid treating w0 separately

y = x1 a = ww
   

Data: {yn , fn }N
n=1
Model: f = yt a
p + 1 unknowns held in vector a

Mahesan Niranjan (UoS) Machine Learning March 2017 43 / 55


Error and Minimization

{ynt a fn }2
PN
E = n=1
nP  o2
PN (p+1)
E = n=1 j=1 a y
j nj fn

Mahesan Niranjan (UoS) Machine Learning March 2017 44 / 55


Error and Minimization

{ynt a fn }2
PN
E = n=1
nP  o2
PN (p+1)
E = n=1 j=1 a y
j nj fn

To find the best a we minimize E differentiate with respect to each


of the unknowns in a and set to zero.

Mahesan Niranjan (UoS) Machine Learning March 2017 44 / 55


Error and Minimization

{ynt a fn }2
PN
E = n=1
nP  o2
PN (p+1)
E = n=1 j=1 a y
j nj fn

To find the best a we minimize E differentiate with respect to each


of the unknowns in a and set to zero.

N (p+1)
E X X
= 2 aj ynj fn (yni )
ai
n=1 j=1

Mahesan Niranjan (UoS) Machine Learning March 2017 44 / 55


Error and Minimization

{ynt a fn }2
PN
E = n=1
nP  o2
PN (p+1)
E = n=1 j=1 a y
j nj fn

To find the best a we minimize E differentiate with respect to each


of the unknowns in a and set to zero.

N (p+1)
E X X
= 2 aj ynj fn (yni )
ai
n=1 j=1

There are (p + 1) derivatives (with respect to each ai )

Mahesan Niranjan (UoS) Machine Learning March 2017 44 / 55


Error and Minimization

{ynt a fn }2
PN
E = n=1
nP  o2
PN (p+1)
E = n=1 j=1 a y
j nj fn

To find the best a we minimize E differentiate with respect to each


of the unknowns in a and set to zero.

N (p+1)
E X X
= 2 aj ynj fn (yni )
ai
n=1 j=1

There are (p + 1) derivatives (with respect to each ai )


Equating them to zero gives (p + 1) equations in (p + 1) unknowns

Mahesan Niranjan (UoS) Machine Learning March 2017 44 / 55


Solution to Regression

(p + 1) simultaneous equations to solve:


i th row, j th column shown

...
... ... ... a1
..


..

a2 .
PN PN
. y ni ynj ... .. =

fn yni

n=1 n=1
.. .. .

..
. ... . .
a(p+1)
... ... ...

Mahesan Niranjan (UoS) Machine Learning March 2017 45 / 55


Derivation in vector/matrix form
Y: N (p + 1) matrix nth row is ynt

Mahesan Niranjan (UoS) Machine Learning March 2017 46 / 55


Derivation in vector/matrix form
Y : N (p + 1) matrix nth row is ynt
f : N 1 vector of outputs

Mahesan Niranjan (UoS) Machine Learning March 2017 46 / 55


Derivation in vector/matrix form
Y : N (p + 1) matrix nth row is ynt
f : N 1 vector of outputs
Error E = ||Y a f ||2

Mahesan Niranjan (UoS) Machine Learning March 2017 46 / 55


Derivation in vector/matrix form
Y : N (p + 1) matrix nth row is ynt
f : N 1 vector of outputs
Error E = ||Y a f ||2
Homework: Verify the error written like this is the same as the one
we wrote out in lengthy algebra.

Mahesan Niranjan (UoS) Machine Learning March 2017 46 / 55


Derivation in vector/matrix form
Y : N (p + 1) matrix nth row is ynt
f : N 1 vector of outputs
Error E = ||Y a f ||2
Homework: Verify the error written like this is the same as the one
we wrote out in lengthy algebra.
Gradient
a E = 2Y t (Ya f )

Mahesan Niranjan (UoS) Machine Learning March 2017 46 / 55


Derivation in vector/matrix form
Y : N (p + 1) matrix nth row is ynt
f : N 1 vector of outputs
Error E = ||Y a f ||2
Homework: Verify the error written like this is the same as the one
we wrote out in lengthy algebra.
Gradient
a E = 2Y t (Ya f )

Equating the gradient to zero gives


Y t Ya = Y tf 
a = Y t Y 1 Y t f

Mahesan Niranjan (UoS) Machine Learning March 2017 46 / 55


Derivation in vector/matrix form
Y : N (p + 1) matrix nth row is ynt
f : N 1 vector of outputs
Error E = ||Y a f ||2
Homework: Verify the error written like this is the same as the one
we wrote out in lengthy algebra.
Gradient
a E = 2Y t (Ya f )

Equating the gradient to zero gives


Y t Ya = Y tf 
a = Y t Y 1 Y t f

Homework: With three data points in one dimensional input space


(x1 , f1 ), (x2 , f2 ) and (x3 , f3 ) and two unknowns, slope (m) and
intercept (c) of fitting a straight line, write out all the expressions
seenNiranjan
Mahesan so far.
(UoS) Machine Learning March 2017 46 / 55
Solution by Gradient Descent

Gradient vector: a E = 2Y t (Ya f )


Steepest descent algorithm:
Initialize a at random
Update a (k+1) = a(k) a E
Until Convergence

Mahesan Niranjan (UoS) Machine Learning March 2017 47 / 55


Solution by Gradient Descent

Gradient vector: a E = 2Y t (Ya f )


Steepest descent algorithm:
Initialize a at random
Update a (k+1) = a(k) a E
Until Convergence
Second order (Newtons) method
Initialize a at random
Update a (k+1) = a(k) H 1 a E
Until Convergence

Mahesan Niranjan (UoS) Machine Learning March 2017 47 / 55


Solution by Gradient Descent

Gradient vector: a E = 2Y t (Ya f )


Steepest descent algorithm:
Initialize a at random
Update a (k+1) = a(k) a E
Until Convergence
Second order (Newtons) method
Initialize a at random
Update a (k+1) = a(k) H 1 a E
Until Convergence
Rapid convergence with second order method, but cost of computing
and inverting H can be high (more on this under Neural Networks)

Mahesan Niranjan (UoS) Machine Learning March 2017 47 / 55


Gradient and Stochastic Gradient Descent

PN
Error E = n=1 en2
True gradient:
N
X  t
a E = 2

yn a fn (yn )
n=1

Mahesan Niranjan (UoS) Machine Learning March 2017 48 / 55


Gradient and Stochastic Gradient Descent

PN
Error E = n=1 en2
True gradient:
N
X  t
a E = 2

yn a fn (yn )
n=1

Gradient computed on nth data:

a en = 2 ynt a fn (yn )


Mahesan Niranjan (UoS) Machine Learning March 2017 48 / 55


Regularization

Pseudo inverse solution: a = (Y t Y )1 Yt f

Mahesan Niranjan (UoS) Machine Learning March 2017 49 / 55


Regularization

Pseudo inverse solution: a = (Y t Y )1 Yt f


This can be ill conditioned, so we could regularize by

a = Y t Y + I 1 Y t f


where is a small constant.

Mahesan Niranjan (UoS) Machine Learning March 2017 49 / 55


Regularization

Pseudo inverse solution: a = (Y t Y )1 Yt f


This can be ill conditioned, so we could regularize by

a = Y t Y + I 1 Y t f


where is a small constant.


We achieve precisely this by minimizing an error of the form

||Ya f ||2 + ||a ||2

Here a quadratic penalty term has been included


Homework: Differentiate this error and derive the regularized
solution

Mahesan Niranjan (UoS) Machine Learning March 2017 49 / 55


Regularization

Pseudo inverse solution: a = (Y t Y )1 Yt f


This can be ill conditioned, so we could regularize by

a = Y t Y + I 1 Y t f


where is a small constant.


We achieve precisely this by minimizing an error of the form

||Ya f ||2 + ||a ||2

Here a quadratic penalty term has been included


Homework: Differentiate this error and derive the regularized
solution
Sparse solutions are obtained
Pp by regularizing with an l1 norm (sum of
absolute values of a , i.e. j=1 |aj |); See Lab 4.

Mahesan Niranjan (UoS) Machine Learning March 2017 49 / 55


Perceptron
A suitable performance measure

Number of misclassified examples as measure of error


Piecewise constant (cannot differentiate)
Suitable error measure:

ynt a
X
EP =

Summation taken over misclassified examples


We started with ynt a > 0 for positive class and ynt a < 0 for the
negative class; we then switch the signs of negative class examples and
required ynt aP
> 0 for all the training data; so for the misclassified
examples ynt a should be as small as possible.

Mahesan Niranjan (UoS) Machine Learning March 2017 50 / 55


Perceptron
Learning rule

Gradient:
E
yn
X
=
a

Mahesan Niranjan (UoS) Machine Learning March 2017 51 / 55


Perceptron
Learning rule

Gradient:
E
yn
X
=
a

a(k+1) = a(k) + yn
P
Gradient algorithm:

Mahesan Niranjan (UoS) Machine Learning March 2017 51 / 55


Perceptron
Learning rule

Gradient:
E
yn
X
=
a

a(k+1) = a(k) + yn
P
Gradient algorithm:
Stochastic gradient algorithm:

a(k+1) = a(k) + yn
Note what yn is. It is an item of data that is taken at random and
happens to be misclassified by the current value of a at iteration k.

Mahesan Niranjan (UoS) Machine Learning March 2017 51 / 55


Perceptron
Convergence of the learning rule

Learning Rule: a (k+1) = a (k) + y (k)


where y (k) is a misclassified input.

Mahesan Niranjan (UoS) Machine Learning March 2017 52 / 55


Perceptron
Convergence of the learning rule

Learning Rule: a (k+1) = a (k) + y (k)


where y (k) is a misclassified input.
Training criterion
We start with requiring a t y (k) 0, depending on the example
belonging to class 1 or class 2.

Mahesan Niranjan (UoS) Machine Learning March 2017 52 / 55


Perceptron
Convergence of the learning rule

Learning Rule: a (k+1) = a (k) + y (k)


where y (k) is a misclassified input.
Training criterion
We start with requiring a t y (k) 0, depending on the example
belonging to class 1 or class 2.
If we switch the signs of examples of class 2, we require a t y (k) > 0 for
all k.
On misclassified data at y (k) < 0

Mahesan Niranjan (UoS) Machine Learning March 2017 52 / 55


Perceptron
Convergence of the learning rule

Learning Rule: a (k+1) = a (k) + y (k)


where y (k) is a misclassified input.
Training criterion
We start with requiring a t y (k) 0, depending on the example
belonging to class 1 or class 2.
If we switch the signs of examples of class 2, we require a t y (k) > 0 for
all k.
On misclassified data a t y (k) < 0
If ab is a solution (separable data), for all k, ab y (k) > 0

Mahesan Niranjan (UoS) Machine Learning March 2017 52 / 55


Perceptron
Convergence of the learning rule

Learning Rule: a (k+1) = a (k) + y (k)


where y (k) is a misclassified input.
Training criterion
We start with requiring a t y (k) 0, depending on the example
belonging to class 1 or class 2.
If we switch the signs of examples of class 2, we require a t y (k) > 0 for
all k.
On misclassified data a t y (k) < 0
If ab is a solution (separable data), for all k, ab y (k) > 0
We prove convergence by showing:
||a (k+1) ab||2 < ||a (k) ab||2 for this update rule. i.e. the learning
rule brings the guess closer to a valid solution.

Mahesan Niranjan (UoS) Machine Learning March 2017 52 / 55


Perceptron
Convergence of the learning rule (contd)

For perceptron criterion, the magnitude of a is not relevant (only the


direction is). Hence for some scalar , we wish to show

||a (k+1) ab||2 < ||a (k) ab||2

Mahesan Niranjan (UoS) Machine Learning March 2017 53 / 55


Perceptron
Convergence of the learning rule (contd)

For perceptron criterion, the magnitude of a is not relevant (only the


direction is). Hence for some scalar , we wish to show

||a (k+1) ab||2 < ||a (k) ab||2

From the update formula


a(k+1) ab = a(k) ab + y (k)

Mahesan Niranjan (UoS) Machine Learning March 2017 53 / 55


Perceptron
Convergence of the learning rule (contd)

For perceptron criterion, the magnitude of a is not relevant (only the


direction is). Hence for some scalar , we wish to show

||a (k+1) ab||2 < ||a (k) ab||2

From the update formula


a(k+1) ab = a(k) ab + y (k)

Taking magnitudes
||a (k+1) b
a||2 = ||a(k) ba||2 + 2(a(k) ba)t y (k) + ||y (k)||2

Mahesan Niranjan (UoS) Machine Learning March 2017 53 / 55


Perceptron
Convergence of the learning rule (contd)

For perceptron criterion, the magnitude of a is not relevant (only the


direction is). Hence for some scalar , we wish to show

||a (k+1) ab||2 < ||a (k) ab||2

From the update formula


a(k+1) ab = a(k) ab + y (k)

Taking magnitudes
||a (k+1) b
a||2 = ||a(k) ba||2 + 2(a(k) ba)t y (k) + ||y (k)||2

If we drop the negative term a(k) t y (k) from RHS, the equality
becomes an inequality
||a (k+1) b
a||2 < ||a(k) ba||2 2bat y (k) + ||y (k)||2
Mahesan Niranjan (UoS) Machine Learning March 2017 53 / 55
Perceptron
Convergence of the learning rule (contd)

Of the three terms on the right hand side, we know abt y (k) > 0,
because ab is assumed to be a solution.

Mahesan Niranjan (UoS) Machine Learning March 2017 54 / 55


Perceptron
Convergence of the learning rule (contd)

Of the three terms on the right hand side, we know abt y (k) > 0,
because ab is assumed to be a solution.
If we select
2 = max ||yi ||2
i
= min abt yi
i
.e. largest of the positive term and smallest of the negative term,
then for = 2 /,
||a (k+1) b
a||2 < ||a(k) ba||2 2
(Note the inequality remains true when the right hand side is replaced
by a quantity larger than what it previously was.)

Mahesan Niranjan (UoS) Machine Learning March 2017 54 / 55


Perceptron
Convergence of the learning rule (contd)

Of the three terms on the right hand side, we know abt y (k) > 0,
because ab is assumed to be a solution.
If we select
2 = max ||yi ||2
i
= min abt yi
i
.e. largest of the positive term and smallest of the negative term,
then for = 2 /,
||a (k+1) b
a||2 < ||a(k) ba||2 2
(Note the inequality remains true when the right hand side is replaced
by a quantity larger than what it previously was.)
Every correction takes the guess closer to a true solution.

Mahesan Niranjan (UoS) Machine Learning March 2017 54 / 55


Perceptron
Convergence of the learning rule (contd)

Of the three terms on the right hand side, we know abt y (k) > 0,
because ab is assumed to be a solution.
If we select
2 = max ||yi ||2
i
= min abt yi
i
.e. largest of the positive term and smallest of the negative term,
then for = 2 /,
||a (k+1) b
a||2 < ||a(k) ba||2 2
(Note the inequality remains true when the right hand side is replaced
by a quantity larger than what it previously was.)
Every correction takes the guess closer to a true solution.
From an initialization a(1) , we will find a solution in at most
||a (1)b
a ||2
k0 = 2
updates.
Mahesan Niranjan (UoS) Machine Learning March 2017 54 / 55
Summary

Linear regression
Solution as pseudo inverse
Solution by gradient descent
Regularization
Perceptron
Setting up a suitable error function
Convergence of the algorithm

Mahesan Niranjan (UoS) Machine Learning March 2017 55 / 55

You might also like