Professional Documents
Culture Documents
Lecture 1: Introduction
Mahesan Niranjan
Slides are prompts (for me); Notes are what you make, off the white-
board and from textbooks during self study:
March 2017
Mahesan Niranjan (UoS) Machine Learning March 2017 1 / 55
Overview
Logistics
Motivation
Some examples from my research
Logistics
Motivation
Some examples from my research
Review of Mathematical Foundations
Linear Algebra
Calculus
Probability Theory / Statistics
Principles of Optimization
Logistics
Motivation
Some examples from my research
Review of Mathematical Foundations
Linear Algebra
Calculus
Probability Theory / Statistics
Principles of Optimization
Teaching:
Ten two-hour lectures
Teaching:
Ten two-hour lectures
Eight two-hour lab sessions
Teaching:
Ten two-hour lectures
Eight two-hour lab sessions
Teaching:
Ten two-hour lectures
Eight two-hour lab sessions
Teaching:
Ten two-hour lectures
Eight two-hour lab sessions
There is nothing to be learnt from a professor, which is not to be met with in books
- David Hume (1711-1776)
There is nothing to be learnt from a professor, which is not to be met with in books
- David Hume (1711-1776)
(WikiPedia: Hume had little respect for the professors of his time [...]
There is nothing to be learnt from a professor, which is not to be met with in books
- David Hume (1711-1776)
(WikiPedia: Hume had little respect for the professors of his time [...] He did not graduate)
Function Approximator
Function Approximator y = f (x , ) + v
Function Approximator y = f (x , ) + v
Parameter Estimation
Function Approximator y = f (x , ) + v
Function Approximator y = f (x , ) + v
Prediction
Function Approximator y = f (x , ) + v
yN+1 xN+1 ,
Prediction = f
Function Approximator y = f (x , ) + v
yN+1 xN+1 ,
Prediction = f
Regularization
Function Approximator y = f (x , ) + v
yN+1 xN+1 ,
Prediction = f
Function Approximator y = f (x , ) + v
yN+1 xN+1 ,
Prediction = f
Modelling Uncertainty
Function Approximator y = f (x , ) + v
yN+1 xN+1 ,
Prediction = f
p | {xn , yn }n=1
N
Modelling Uncertainty
Function Approximator y = f (x , ) + v
yN+1 xN+1 ,
Prediction = f
p | {xn , yn }n=1
N
Modelling Uncertainty
Probabilistic Inference
Function Approximator y = f (x , ) + v
yN+1 xN+1 ,
Prediction = f
p | {xn , yn }n=1
N
Modelling Uncertainty
E [g ()] =
R
Probabilistic Inference g () p () d
Function Approximator y = f (x , ) + v
yN+1 xN+1 ,
Prediction = f
p | {xn , yn }n=1
N
Modelling Uncertainty
E [g ()] = 1
PNs
g (n)
R
Probabilistic Inference g () p () d = Ns n=1
Function Approximator y = f (x , ) + v
yN+1 xN+1 ,
Prediction = f
p | {xn , yn }n=1
N
Modelling Uncertainty
E [g ()] = 1
PNs
g (n)
R
Probabilistic Inference g () p () d = Ns n=1
Sequential Estimation
Function Approximator y = f (x , ) + v
yN+1 xN+1 ,
Prediction = f
p | {xn , yn }n=1
N
Modelling Uncertainty
E [g ()] = 1
PNs
g (n)
R
Probabilistic Inference g () p () d = Ns n=1
Function Approximator y = f (x , ) + v
yN+1 xN+1 ,
Prediction = f
p | {xn , yn }n=1
N
Modelling Uncertainty
E [g ()] = 1
PNs
g (n)
R
Probabilistic Inference g () p () d = Ns n=1
ECS: Advanced courses building on the foundations you will learn here:
ECS: Advanced courses building on the foundations you will learn here:
Advanced Machine Learning
ECS: Advanced courses building on the foundations you will learn here:
Advanced Machine Learning
Computational Biology
ECS: Advanced courses building on the foundations you will learn here:
Advanced Machine Learning
Computational Biology
Computational Finance
Gunawardana]
Gunawardana]
Gunawardana]
A. Turing C. N
usslein-Volhard
A. Turing C. N
usslein-Volhard
2
M(x, t) = D 2 M(x, t) p1 M(x, t) + S(x, t)
t x
A. Turing C. N
usslein-Volhard
2
M(x, t) = D 2 M(x, t) p1 M(x, t) + S(x, t)
t x
A. Turing C. N
usslein-Volhard
2
M(x, t) = D 2 M(x, t) p1 M(x, t) + S(x, t)
t x
B. Houchmandzadeh et al. (2007), Nature
Linear Algebra
Calculus
Optimization
Probabilities
Linear Algebra
Calculus
Optimization
Probabilities
wx = |w | |x | cos()
i=j
i=j
i=j
i=j
(AB)T = B T AT
(AB)T = B T AT
Square: number of rows = number of columns
(AB)T = B T AT
Square: number of rows = number of columns
Symmetric: AT = A
(AB)T = B T AT
Square: number of rows = number of columns
Symmetric: AT = A
Identity matrix: I diagonal elements 1, off diagonals 0.
(AB)T = B T AT
Square: number of rows = number of columns
Symmetric: AT = A
Identity matrix: I diagonal elements 1, off diagonals 0.
a11 a12
Determinant: det = a11 a22 a21 a12
a21 a22
(AB)T = B T AT
Square: number of rows = number of columns
Symmetric: AT = A
Identity matrix: I diagonal elements 1, off diagonals 0.
a11 a12
Determinant: det = a11 a22 a21 a12
a21 a22
Pn
Trace: trace(A) = i=1 aii
Mahesan Niranjan (UoS) Machine Learning March 2017 22 / 55
Linear transformation
y = Ax
Homework: Look up if the following are true and how they are proved.
det(A) = ni=1 i
Q
Homework: Look up if the following are true and how they are proved.
det(A) = ni=1 i
Q
Pn
trace(A) = i=1 i
Real symmetric matrix A = U D U T
Columns of U orthogonal.
More advanced (very powerful) topic: Singular value decomposition (SVD)
Function y = f (x )
Derivative dy
dx is gradient/slope;
R x=b
Integral x=a f (x)dx is area under the curve.
Function y = f (x )
Derivative dy
dx is gradient/slope;
R x=b
Integral x=a f (x)dx is area under the curve.
Function of several variables y = f (x1 , x2 ..., xp )
f
Partial derivatives x i
: Differentiate with respect to xi pretending all
other variables remain constant.
Function y = f (x )
Derivative dy
dx is gradient/slope;
R x=b
Integral x=a f (x)dx is area under the curve.
Function of several variables y = f (x1 , x2 ..., xp )
f
Partial derivatives x i
: Differentiate with respect to xi pretending all
other variables remain constant.
Gradient vector f
x1
f
f
x2
= ..
.
f
xp
Function y = f (x )
Derivative dy
dx is gradient/slope;
R x=b
Integral x=a f (x)dx is area under the curve.
Function of several variables y = f (x1 , x2 ..., xp )
f
Partial derivatives x i
: Differentiate with respect to xi pretending all
other variables remain constant.
Gradient vector f
x1
f
f
x2
= ..
.
f
xp
min f (x )
subject to gi (x ) bi , i = 1, 2, .., m
min f (x )
subject to gi (x ) bi , i = 1, 2, .., m
min f (x )
subject to gi (x ) bi , i = 1, 2, .., m
x(k+1) = x(k) f
x(k+1) = x(k) f
Newtons Method
x (k+1) = x (k) H 1 f
x(k+1) = x(k) f
Newtons Method
x (k+1) = x (k) H 1 f
min f (x )
subject to gi (x ) bi , i = 1, 2, .., m
m
F (x , ) = f (x ) + i [bi gi (x )]
X
i=1
We will use various optimization algorithms in this module (later in the coursework).
x(k+1) = x(k) f
Newtons Method
x (k+1) = x (k) H 1 f
min f (x )
subject to gi (x ) bi , i = 1, 2, .., m
m
F (x , ) = f (x ) + i [bi gi (x )]
X
i=1
We will use various optimization algorithms in this module (later in the coursework).
Advanced Homework: Search for CVX Disciplined Convex Programming and have a
rough read.
Discrete probabilities P [X ]
Continuous densities p (x )
Discrete probabilities P [X ]
Continuous densities p (x )
Joint P [X , Y ]; Marginal P [X ]; Conditional P [X |Y ]
Discrete probabilities P [X ]
Continuous densities p (x )
Joint P [X , Y ]; Marginal P [X ]; Conditional P [X |Y ]
P [X |Y ] P [Y ]
P [Y | X ] =
P [X ]
X
P [X ] = P [X |Y ] P [Y ]
Y
Discrete probabilities P [X ]
Continuous densities p (x )
Joint P [X , Y ]; Marginal P [X ]; Conditional P [X |Y ]
P [X |Y ] P [Y ]
P [Y | X ] =
P [X ]
X
P [X ] = P [X |Y ] P [Y ]
Y
P [X , Y ] = P [X |Y ] P [Y ]
X
P [X ] = P [X , Y ]
y
X
= P [X |Y ] P [Y ]
Y
Univariate Gaussian
1 (x m)2
1
p(x) = exp
2 2 2
Univariate Gaussian
1 (x m)2
1
p(x) = exp
2 2 2
Mean m is a vector
Covariance, C , matrix: symmetric, positive semi definite!
Homework: Draw sketches for different values of m and C
Univariate Gaussian
1 (x m)2
1
p(x) = exp
2 2 2
Mean m is a vector
Covariance, C , matrix: symmetric, positive semi definite!
Homework: Draw sketches for different values of m and C
x N (m , C ), y = Ax = y N (Am , ACAT )
Mahesan Niranjan (UoS) Machine Learning March 2017 28 / 55
Estimation
Univariate Mean
1 PN
Univariate Mean m
b = N n=1 xn
1 PN
Univariate Mean m
b = N n=1 xn
Univariate Covariance
1 PN
Univariate Mean m
b = N n=1 xn
1 PN
Univariate Covariance
b = N n=1 (xn b 2
m)
1 PN
Univariate Mean m
b = N n=1 xn
1 PN
Univariate Covariance
b = N n=1 (xn b 2
m)
Multivariate Mean
1 PN
Univariate Mean m
b = N n=1 xn
1 PN
Univariate Covariance
b = N n=1 (xn b 2
m)
mb x
1 PN
Multivariate Mean = N n=1 n
1 PN
Univariate Mean m
b = N n=1 xn
1 PN
Univariate Covariance
b = N n=1 (xn b 2
m)
mb x
1 PN
Multivariate Mean = N n=1 n
Covariance Matrix
1 PN
Univariate Mean m
b = N n=1 xn
1 PN
Univariate Covariance
b = N n=1 (xn b 2
m)
mb x
1 PN
Multivariate Mean = N n=1 n
x m )(xn m )T
1 PN
Covariance Matrix Cb = N n=1 ( n
1 PN
Univariate Mean m
b = N n=1 xn
1 PN
Univariate Covariance
b = N n=1 (xn b 2
m)
mb x
1 PN
Multivariate Mean = N n=1 n
x m )(xn m )T
1 PN
Covariance Matrix Cb = N n=1 ( n
Classifying based on P [j | x ]
Classifying based on P [j | x ]
Optimal classifier for simple distributions
Linear classifier when is it optimal?
Classifying based on P [j | x ]
Optimal classifier for simple distributions
Linear classifier when is it optimal?
Distance based classifiers
Nearest Neighbour classifier
Mahalanobis distance
Linear discriminant analysis
Fisher LDA
Classifying based on P [j | x ]
Optimal classifier for simple distributions
Linear classifier when is it optimal?
Distance based classifiers
Nearest Neighbour classifier
Mahalanobis distance
Linear discriminant analysis
Fisher LDA
Classifier Performance
Receiver Operating Characteristics (ROC) Curve
Perceptron learning rule and convergence
P [1 |x ] P [2 |x ]
p (x |1 ) P [1 ] p (x |2 ) P [2 ]
P [1 |x ] P [2 |x ]
p (x |1 ) P [1 ] p (x |2 ) P [2 ]
exp (x m1 ) C (x m1 ) P [1 ]
1 1 t 1
(2)p/2 (det(C ))1/2 2
exp (x m2 ) C (x m2 ) P [2 ]
1 1 t 1
(2)p/2 (det(C ))1/2 2
(x m1 )t (x m1 ) (x m2 )t (x m2 )
|x m1 | |x m2 |
(x m1 )t (x m1 ) (x m2 )t (x m2 )
|x m1 | |x m2 |
Linear classifier
Linear classifier
wt x +b 0
Linear classifier
wt x +b 0
Expand dimensions: a = [w t b]t and y = [x t 1]t
at y 0
Linear classifier
wt x +b 0
Expand dimensions: a = [w t b]t and y = [x t 1]t
at y 0
random guess of the weights
repeat
select data at random
if not correctly classified
update weights
until (all data correctly classified)
Update:
Linear classifier
wt x +b 0
Expand dimensions: a = [w t b]t and y = [x t 1]t
at y 0
random guess of the weights
repeat
select data at random
if not correctly classified
update weights
until (all data correctly classified)
Update:
1 Some plotting
2 Bayes optimal class boundary
3 Implement your own perceptron algorithm
Bayes classifier:
p (x |1 ) P [1 ]
P [1 |x ] =
p (x |1 ) P [1 ] + p (x |2 ) P [2 ]
Bayes classifier:
p (x |1 ) P [1 ]
P [1 |x ] =
p (x |1 ) P [1 ] + p (x |2 ) P [2 ]
Restrictive assumptions:
Gaussian p (x |j ) = N (mj , Cj )
Equal covariance matrices: C1 = C2 = C
Bayes classifier:
p (x |1 ) P [1 ]
P [1 |x ] =
p (x |1 ) P [1 ] + p (x |2 ) P [2 ]
Restrictive assumptions:
Gaussian p (x |j ) = N (mj , Cj )
Equal covariance matrices: C1 = C2 = C
Substitute, divide through by numerator term and cancel common
terms to get
P [1 |x ] =
1
1 + exp {(w t x + w0 )}
Bayes classifier:
p (x |1 ) P [1 ]
P [1 |x ] =
p (x |1 ) P [1 ] + p (x |2 ) P [2 ]
Restrictive assumptions:
Gaussian p (x |j ) = N (mj , Cj )
Equal covariance matrices: C1 = C2 = C
Substitute, divide through by numerator term and cancel common
terms to get
P [1 |x ] =
1
1 + exp {(w t x + w0 )}
J (w ) =
w t CB w
w t CW w
J (w ) =
w t CB w
w t CW w
2CB w (w t CW w ) 2C W w (w t C B w )
w =
(w t CW w )2
J (w ) =
w t CB w
w t CW w
2CB w (w t CW w ) 2C W w (w t C B w )
w =
(w t CW w )2
J (w ) =
w t CB w
w t CW w
2CB w (w t CW w ) 2C W w (w t C B w )
w =
(w t CW w )2
Data: {xn , fn }N
n=1
Input: xn Rp ; target / output fn real valued
Data: {xn , fn }N
n=1
Input: xn Rp ; target / output fn real valued
Model: f = w t x + w0
Output linear function of input (including a constant w0 )
Data: {xn , fn }N
n=1
Input: xn Rp ; target / output fn real valued
Model: f = w t x + w0
Output linear function of input (including a constant w0 )
Work in (p + 1) dimensional space to avoid treating w0 separately
y = x1 a = ww
Data: {xn , fn }N
n=1
Input: xn Rp ; target / output fn real valued
Model: f = w t x + w0
Output linear function of input (including a constant w0 )
Work in (p + 1) dimensional space to avoid treating w0 separately
y = x1 a = ww
Data: {yn , fn }N
n=1
Model: f = yt a
p + 1 unknowns held in vector a
{ynt a fn }2
PN
E = n=1
nP o2
PN (p+1)
E = n=1 j=1 a y
j nj fn
{ynt a fn }2
PN
E = n=1
nP o2
PN (p+1)
E = n=1 j=1 a y
j nj fn
{ynt a fn }2
PN
E = n=1
nP o2
PN (p+1)
E = n=1 j=1 a y
j nj fn
{ynt a fn }2
PN
E = n=1
nP o2
PN (p+1)
E = n=1 j=1 a y
j nj fn
{ynt a fn }2
PN
E = n=1
nP o2
PN (p+1)
E = n=1 j=1 a y
j nj fn
PN
Error E = n=1 en2
True gradient:
N
X t
a E = 2
yn a fn (yn )
n=1
PN
Error E = n=1 en2
True gradient:
N
X t
a E = 2
yn a fn (yn )
n=1
a en = 2 ynt a fn (yn )
a = Y t Y + I 1 Y t f
a = Y t Y + I 1 Y t f
a = Y t Y + I 1 Y t f
ynt a
X
EP =
Gradient:
E
yn
X
=
a
Gradient:
E
yn
X
=
a
a(k+1) = a(k) + yn
P
Gradient algorithm:
Gradient:
E
yn
X
=
a
a(k+1) = a(k) + yn
P
Gradient algorithm:
Stochastic gradient algorithm:
a(k+1) = a(k) + yn
Note what yn is. It is an item of data that is taken at random and
happens to be misclassified by the current value of a at iteration k.
Taking magnitudes
||a (k+1) b
a||2 = ||a(k) ba||2 + 2(a(k) ba)t y (k) + ||y (k)||2
Taking magnitudes
||a (k+1) b
a||2 = ||a(k) ba||2 + 2(a(k) ba)t y (k) + ||y (k)||2
If we drop the negative term a(k) t y (k) from RHS, the equality
becomes an inequality
||a (k+1) b
a||2 < ||a(k) ba||2 2bat y (k) + ||y (k)||2
Mahesan Niranjan (UoS) Machine Learning March 2017 53 / 55
Perceptron
Convergence of the learning rule (contd)
Of the three terms on the right hand side, we know abt y (k) > 0,
because ab is assumed to be a solution.
Of the three terms on the right hand side, we know abt y (k) > 0,
because ab is assumed to be a solution.
If we select
2 = max ||yi ||2
i
= min abt yi
i
.e. largest of the positive term and smallest of the negative term,
then for = 2 /,
||a (k+1) b
a||2 < ||a(k) ba||2 2
(Note the inequality remains true when the right hand side is replaced
by a quantity larger than what it previously was.)
Of the three terms on the right hand side, we know abt y (k) > 0,
because ab is assumed to be a solution.
If we select
2 = max ||yi ||2
i
= min abt yi
i
.e. largest of the positive term and smallest of the negative term,
then for = 2 /,
||a (k+1) b
a||2 < ||a(k) ba||2 2
(Note the inequality remains true when the right hand side is replaced
by a quantity larger than what it previously was.)
Every correction takes the guess closer to a true solution.
Of the three terms on the right hand side, we know abt y (k) > 0,
because ab is assumed to be a solution.
If we select
2 = max ||yi ||2
i
= min abt yi
i
.e. largest of the positive term and smallest of the negative term,
then for = 2 /,
||a (k+1) b
a||2 < ||a(k) ba||2 2
(Note the inequality remains true when the right hand side is replaced
by a quantity larger than what it previously was.)
Every correction takes the guess closer to a true solution.
From an initialization a(1) , we will find a solution in at most
||a (1)b
a ||2
k0 = 2
updates.
Mahesan Niranjan (UoS) Machine Learning March 2017 54 / 55
Summary
Linear regression
Solution as pseudo inverse
Solution by gradient descent
Regularization
Perceptron
Setting up a suitable error function
Convergence of the algorithm