You are on page 1of 39

Introduction to Statistical

Machine Learning

Introduction to Statistical Machine Learning


Christfried Webers

c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Outlines

Statistical Machine Learning Group


NICTA
and
College of Engineering and Computer Science
The Australian National University

Canberra
February June 2013

Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Neural Networks 1
Neural Networks 2
Kernel Methods
Sparse Kernel Methods
Graphical Models 1
Graphical Models 2
Graphical Models 3
Mixture Models and EM 1
Mixture Models and EM 2
Approximate Inference
Sampling
Principal Component Analysis
Sequential Data 1

(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")

Sequential Data 2
Combining Models
Selected Topics
Discussion and Summary
1of 300

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

Part VII

I
SML
2013
Classification

Linear Classification 1

Generalised Linear
Model
Inference and Decision
Discriminant Functions
Fishers Linear
Discriminant
The Perceptron
Algorithm

263of 300

Classification
Goal : Given input data x, assign it to one of K discrete
classes Ck where k = 1, . . . , K.
Divide the input space into different regions.

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Classification
Generalised Linear
Model
Inference and Decision
Discriminant Functions
Fishers Linear
Discriminant
The Perceptron
Algorithm

Figure : Length of the petal [in cm] for a given sepal [cm] for iris
flowers (Iris Setosa, Iris Versicolor, Iris Virginica).
264of 300

How to represent binary class labels?

Class labels are no longer real values as in regression, but


a discrete set.
Two classes : t {0, 1}
( t = 1 represents class C1 and t = 0 represents class C2 )
Can interpret the value of t as the probability of class C1 ,
with only two values possible for the probability, 0 or 1.
Note: Other conventions to map classes into integers
possible, check the setup.

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Classification
Generalised Linear
Model
Inference and Decision
Discriminant Functions
Fishers Linear
Discriminant
The Perceptron
Algorithm

265of 300

How to represent multi-class labels?

If there are more than two classes ( K > 2), we call it a


multi-class setup.
Often used: 1-of-K coding scheme in which t is a vector of
length K which has all values 0 except for tj = 1, where j
comes from the membership in class Cj to encode.
Example: Given 5 classes, {C1 , . . . , C5 }. Membership in
class C2 will be encoded as the target vector
t = (0, 1, 0, 0, 0)T

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Classification
Generalised Linear
Model
Inference and Decision
Discriminant Functions
Fishers Linear
Discriminant
The Perceptron
Algorithm

Note: Other conventions to map multi-classes into integers


possible, check the setup.

266of 300

Introduction to Statistical
Machine Learning

Linear Model
Idea: Use again a Linear Model as in regression: y(x, w) is
a linear function of the parameters w
y(xn , w) = wT (xn )
But generally y(xn , w) R.
Example: Which class is y(x, w) = 0.71623 ?

c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Classification
Generalised Linear
Model
Inference and Decision
Discriminant Functions
Fishers Linear
Discriminant
The Perceptron
Algorithm

267of 300

Introduction to Statistical
Machine Learning

Generalised Linear Model


Apply a mapping f : R Z to the linear model to get the
discrete class labels.
Generalised Linear Model
y(xn , w) = f (wT (xn ))

c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Classification
Generalised Linear
Model

Activation function: f ()
Link function : f 1 ()

Inference and Decision


Discriminant Functions
Fishers Linear
Discriminant

signHzL
1.0

The Perceptron
Algorithm

0.5

-0.5

0.0

0.5

1.0

-0.5

-1.0

Figure : Example of an activation function f (z) = sign (z) .

268of 300

Three Models for Decision Problems


In increasing order of complexity
Find a discriminant function f (x) which maps each input
directly onto a class label.
Discriminative Models
1

Solve the inference problem of determining the posterior


class probabilities p(Ck | x).
Use decision theory to assign each new x to one of the
classes.

Generative Models
1

2
3
4
5

Solve the inference problem of determining the


class-conditional probabilities p(x | Ck ).
Also, infer the prior class probabilities p(Ck ).
Use Bayes theorem to find the posterior p(Ck | x).
Alternatively, model the joint distribution p(x, Ck ) directly.
Use decision theory to assign each new x to one of the
classes.

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Classification
Generalised Linear
Model
Inference and Decision
Discriminant Functions
Fishers Linear
Discriminant
The Perceptron
Algorithm

269of 300

Introduction to Statistical
Machine Learning

Two Classes

c
2013
Christfried Webers
NICTA
The Australian National
University

Definition
A discriminant is a function that maps from an input vector x to
one of K classes, denoted by Ck .
Consider first two classes ( K = 2 ).
Construct a linear function of the inputs x
y(x) = wT x + w0
such that x being assigned to class C1 if y(x) 0, and to
class C2 otherwise.
weight vector w
bias w0 ( sometimes w0 called threshold )

I
SML
2013
Classification
Generalised Linear
Model
Inference and Decision
Discriminant Functions
Fishers Linear
Discriminant
The Perceptron
Algorithm

270of 300

Introduction to Statistical
Machine Learning

Two Classes

c
2013
Christfried Webers
NICTA
The Australian National
University

Decision boundary y(x) = 0 is a (D 1)-dimensional


hyperplane in a D-dimensional input space (decision
surface).
w is orthogonal to any vector lying in the decision surface.
Proof: Assume xA and xB are two points lying in the
decision surface. Then,
0 = y(xA ) y(xB ) = wT (xA xB )

I
SML
2013
Classification
Generalised Linear
Model
Inference and Decision
Discriminant Functions
Fishers Linear
Discriminant
The Perceptron
Algorithm

271of 300

Introduction to Statistical
Machine Learning

Two Classes

c
2013
Christfried Webers
NICTA
The Australian National
University

The normal distance from the origin to the decision


surface is
wT x
w0
=
kwk
kwk

Classification
Generalised Linear
Model

x2

y>0
y=0
y<0

I
SML
2013

Inference and Decision

R1
R2

Discriminant Functions
Fishers Linear
Discriminant

x
w

The Perceptron
Algorithm

y(x)
kwk

x
x1
w0
kwk

272of 300

Introduction to Statistical
Machine Learning

Two Classes

c
2013
Christfried Webers
NICTA
The Australian National
University

The value of y(x) gives a signed measure of the


perpendicular distance r of the point x from the decision
surface, r = y(x)/kwk.

Classification

x
T

y(x) = w

z

I
SML
2013

}|

w
x + r
kwk

{

wT w z T }| {
+w0 = r
+ w x + w0 = rkwk
kwk

Generalised Linear
Model
Inference and Decision
Discriminant Functions
Fishers Linear
Discriminant

x2

y>0
y=0
y<0

R1

The Perceptron
Algorithm

R2

x
w

y(x)
kwk

x
x1
w0
kwk

273of 300

Introduction to Statistical
Machine Learning

Two Classes

c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013

More compact notation : Add an extra dimension to the


input space and set the value to x0 = 1.
e = (w0 , w) and e
Also define w
x = (1, x)
T

e e
y(x) = w
x
Decision surface is now a D-dimensional hyperplane in a
D + 1-dimensional expanded input space.

Classification
Generalised Linear
Model
Inference and Decision
Discriminant Functions
Fishers Linear
Discriminant
The Perceptron
Algorithm

274of 300

Introduction to Statistical
Machine Learning

Multi-Class

c
2013
Christfried Webers
NICTA
The Australian National
University

Number of classes K > 2


Can we combine a number of two-class discriminant
functions using K 1 one-versus-the-rest classifiers?

I
SML
2013
Classification
Generalised Linear
Model
Inference and Decision

Discriminant Functions
Fishers Linear
Discriminant

R1
R2

The Perceptron
Algorithm

C1
R3

C2

not C1
not C2

275of 300

Introduction to Statistical
Machine Learning

Multi-Class

c
2013
Christfried Webers
NICTA
The Australian National
University

Number of classes K > 2


Can we combine a number of two-class discriminant
functions using K(K 1)/2 one-versus-one classifiers?

I
SML
2013
Classification
Generalised Linear
Model

C3

C1

Inference and Decision


Discriminant Functions

R1

Fishers Linear
Discriminant

R3
C1

The Perceptron
Algorithm

C3
R2
C2
C2

276of 300

Introduction to Statistical
Machine Learning

Multi-Class

c
2013
Christfried Webers
NICTA
The Australian National
University

Number of classes K > 2


Solution: Use K linear functions

I
SML
2013

yk (x) = wTk x + wk0


Assign input x to class Ck if yk (x) > yj (x) for all j 6= k.
Decision boundary between class Ck and Cj given by
yk (x) = yj (x)

Classification
Generalised Linear
Model
Inference and Decision
Discriminant Functions
Fishers Linear
Discriminant
The Perceptron
Algorithm

Rj
Ri
Rk
xB
xA

b
x

277of 300

Least Squares for Classification

Regression with a linear function of the model parameters


and minimisation of sum-of-squares error function resulted
in a closed-from solution for the parameter values.
Is this also possible for classification?
Given input data x belonging to one of K classes Ck .
Use 1-of-K binary coding scheme.
Each class is described by its own linear model
yk (x) = wTk x + wk0

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Classification
Generalised Linear
Model
Inference and Decision
Discriminant Functions
Fishers Linear
Discriminant
The Perceptron
Algorithm

k = 1, . . . , K

278of 300

Introduction to Statistical
Machine Learning

Least Squares for Classification

With the conventions


 
w
e k = k0
w
wk
 
1
e
x=
x

e = w
e1 . . .
W

c
2013
Christfried Webers
NICTA
The Australian National
University

RD+1

I
SML
2013
Classification

D+1

Generalised Linear
Model
Inference and Decision

eK
w

R(D+1)K

Fishers Linear
Discriminant

we get for the discriminant function (vector valued)


e Te
y(x) = W
x

Discriminant Functions

The Perceptron
Algorithm

RK .

For a new input x, the class is then defined by the index of


the largest value in the row vector y(x)

279of 300

Introduction to Statistical
Machine Learning

e
Determine W

c
2013
Christfried Webers
NICTA
The Australian National
University

Given a training set {xn , t} where n = 1, . . . , N, and t is the


class in the 1-of-K coding scheme.
Define a matrix T where row n corresponds to tTn .
The sum-of-squares error can now be written as
n
o
e = 1 tr (X
eW
e T)T (X
eW
e T)
ED (W)
2
e will be reached for
The minimum of ED (W)

I
SML
2013
Classification
Generalised Linear
Model
Inference and Decision
Discriminant Functions
Fishers Linear
Discriminant
The Perceptron
Algorithm

e = (X
e T X)
e 1 X
eTT = X
e T
W
e is the pseudo-inverse of X.
e
where X

280of 300

Discriminant Function for Multi-Class

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

The discriminant function y(x) is therefore


e Te
e )T e
y(x) = W
x = TT (X
x,
e is given by the training data, and e
where X
x is the new
input.
Interesting property: If for every tn the same linear
constraint aT tn + b = 0 holds, then the prediction y(x) will
also obey the same constraint
aT y(x) + b = 0.

I
SML
2013
Classification
Generalised Linear
Model
Inference and Decision
Discriminant Functions
Fishers Linear
Discriminant
The Perceptron
Algorithm

For the 1-of-K coding scheme, the sum of all components


in tn is one, and therefore all components of y(x) will sum
to one. BUT: the components are not probabilities, as they
are not constraint to the interval (0, 1).
281of 300

Introduction to Statistical
Machine Learning

Deficiencies of the Least Squares Approach

c
2013
Christfried Webers
NICTA
The Australian National
University

Magenta curve : Decision Boundary for the least squares


approach ( Green curve : Decision boundary for the logistic
regression model described later)

I
SML
2013

Classification

Generalised Linear
Model

Fishers Linear
Discriminant

The Perceptron
Algorithm

Inference and Decision

8
4

Discriminant Functions

8
2

282of 300

Introduction to Statistical
Machine Learning

Deficiencies of the Least Squares Approach

c
2013
Christfried Webers
NICTA
The Australian National
University

Magenta curve : Decision Boundary for the least squares


approach ( Green curve : Decision boundary for the logistic
regression model described later)

I
SML
2013

Classification

Generalised Linear
Model

Fishers Linear
Discriminant

The Perceptron
Algorithm

Inference and Decision

6
6

6
6

Discriminant Functions

283of 300

Introduction to Statistical
Machine Learning

Fishers Linear Discriminant

c
2013
Christfried Webers
NICTA
The Australian National
University

View linear classification as dimensionality reduction.


T

I
SML
2013

y(x) = w x
Classification

If y w0 then class C1 , otherwise C2 .


But there are many projections from a D-dimensional input
space onto one dimension.
Projection always means loss of information.
For classification we want to preserve the class separation
in one dimension.
Can we find a projection which maximally preserves the
class separation ?

Generalised Linear
Model
Inference and Decision
Discriminant Functions
Fishers Linear
Discriminant
The Perceptron
Algorithm

284of 300

Introduction to Statistical
Machine Learning

Fishers Linear Discriminant

c
2013
Christfried Webers
NICTA
The Australian National
University

Samples from two classes in a two-dimensional input space


and their histogram when projected to two different
one-dimensional spaces.

I
SML
2013
Classification
Generalised Linear
Model

Inference and Decision


Discriminant Functions

Fishers Linear
Discriminant
The Perceptron
Algorithm

285of 300

Fishers Linear Discriminant - First Try


Given N1 input data of class C1 , and N2 input data of class
C2 , calculate the centres of the two classes
1 X
1 X
m1 =
xn ,
m2 =
xn
N1
N2
nC1

nC2

Choose w so as to maximise the projection of the class


means onto w
m1 m2 = wT (m1 m2 )

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Classification
Generalised Linear
Model
Inference and Decision
Discriminant Functions

Problem with non-uniform covariance


4

Fishers Linear
Discriminant
The Perceptron
Algorithm

6
286of 300

Introduction to Statistical
Machine Learning

Fishers Linear Discriminant


Measure also the within-class variance for each class
X
s2k =
(yn mk )2
nCk

c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013

where yn = w xn .
Maximise the Fisher criterion

Classification
Generalised Linear
Model

J(w) =

(m2 m1 )
s21 + s22

Inference and Decision


Discriminant Functions
Fishers Linear
Discriminant

The Perceptron
Algorithm

6
287of 300

Introduction to Statistical
Machine Learning

Fishers Linear Discriminant

c
2013
Christfried Webers
NICTA
The Australian National
University

The Fisher criterion can be rewritten as


T

J(w) =

w SB w
wT SW w

SB is the between-class covariance


SB = (m2 m1 )(m2 m1 )T
SW is the within-class covariance
X
X
SW =
(xn m1 )(xn m1 )T +
(xn m2 )(xn m2 )T
nC1

I
SML
2013
Classification
Generalised Linear
Model
Inference and Decision
Discriminant Functions
Fishers Linear
Discriminant
The Perceptron
Algorithm

nC2

288of 300

Introduction to Statistical
Machine Learning

Fishers Linear Discriminant

c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013

The Fisher criterion


J(w) =

wT SB w
wT SW w

has a maximum for Fishers linear discriminant

Classification
Generalised Linear
Model
Inference and Decision
Discriminant Functions

S1
W (m2

m1 )

Fishers linear discriminant is NOT a discriminant, but can


be used to construct one by choosing a threshold y0 in the
projection space.

Fishers Linear
Discriminant
The Perceptron
Algorithm

289of 300

Fishers Discriminant For Multi-Class


Assume that the dimensionality of the input space D is
greater than the number of classes K.
Use D0 > 1 linear features yk = wT x and write everything
in vector form (no bias involved!)
y = WT x.
The within-class covariance is then the sum of the
covariances for all K classes

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Classification
Generalised Linear
Model
Inference and Decision
Discriminant Functions

SW =

K
X

Sk

k=1

Fishers Linear
Discriminant
The Perceptron
Algorithm

where
Sk =

(xn mk )(xn mk )T

nCk

mk =

1 X
xn
Nk
nCk

290of 300

Fishers Discriminant For Multi-Class


Between-class covariance
SB =

K
X

Nk (mk m)(mk m)T .

k=1

where m is the total mean of the input data


m=

N
1 X
xn .
N
n=1

One possible way to define a function of W which is large


when the between-class covariance is large and the
within-class covariance is small is given by


J(W) = tr (WT SW W)1 (WT SB W)

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Classification
Generalised Linear
Model
Inference and Decision
Discriminant Functions
Fishers Linear
Discriminant
The Perceptron
Algorithm

The maximum of J(W) is determined by the D0


eigenvectors of S1
W SB with the largest eigenvalues.
291of 300

Fishers Discriminant For Multi-Class

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013

How many linear features can one find with this method?
SB is of rank at most K 1 because of the sum of K rank
one matrices and the global constraint via m.
Projection onto the subspace spanned by SB can not have
more than K 1 linear features.

Classification
Generalised Linear
Model
Inference and Decision
Discriminant Functions
Fishers Linear
Discriminant
The Perceptron
Algorithm

292of 300

The Perceptron Algorithm

Frank Rosenblatt (1928 - 1969)


"Principles of neurodynamics: Perceptrons and the theory
of brain mechanisms" (Spartan Books, 1962)

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Classification
Generalised Linear
Model
Inference and Decision
Discriminant Functions
Fishers Linear
Discriminant
The Perceptron
Algorithm

293of 300

The Perceptron Algorithm

Perceptron ("MARK 1") was the first computer which could


learn new skills by trial and error

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Classification
Generalised Linear
Model
Inference and Decision
Discriminant Functions
Fishers Linear
Discriminant
The Perceptron
Algorithm

294of 300

The Perceptron Algorithm


Two class model
Create feature vector (x) by a fixed nonlinear
transformation of the input x.
Generalised linear model
y(x) = f (wT (x))
with (x) containing some bias element 0 (x) = 1.
nonlinear activation function
(
+1, a 0
f (a) =
1, a < 0

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Classification
Generalised Linear
Model
Inference and Decision
Discriminant Functions
Fishers Linear
Discriminant
The Perceptron
Algorithm

Target coding for perceptron


(
+1, if C1
t=
1, if C2
295of 300

The Perceptron Algorithm - Error Function

Idea : Minimise total number of misclassified patterns.


Problem : As a function of w, this is piecewise constant
and therefore the gradient is zero almost everywhere.
Better idea: Using the (1, +1) target coding scheme, we
want all patterns to satisfy wT (xn )tn > 0.
Perceptron Criterion : Add the errors for all patterns
belonging to the set of misclassified patterns M
X
EP (w) =
wT (xn )tn

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Classification
Generalised Linear
Model
Inference and Decision
Discriminant Functions
Fishers Linear
Discriminant
The Perceptron
Algorithm

nM

296of 300

Perceptron - Stochastic Gradient Descent

Perceptron Criterion (with notation n = (xn ) )


X
EP (w) =
wT n tn

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Classification

nM
Generalised Linear
Model

One iteration at step


1
2

Choose a training pair (xn , tn )


Update the weight vector w by
w( +1) = w( ) EP (w) = w( ) + n tn

Inference and Decision


Discriminant Functions
Fishers Linear
Discriminant
The Perceptron
Algorithm

As y(x, w) does not depend on the norm of w, one can set


=1
w( +1) = w( ) + n tn

297of 300

Introduction to Statistical
Machine Learning

The Perceptron Algorithm - Update 1

c
2013
Christfried Webers
NICTA
The Australian National
University

Update of the perceptron weights from a misclassified pattern


(green)
I
SML
2013

w( +1) = w( ) + n tn

Classification
Generalised Linear
Model

Inference and Decision

Discriminant Functions

0.5

Fishers Linear
Discriminant

0.5

The Perceptron
Algorithm

0.5

0.5

1
1

0.5

0.5

1
1

0.5

0.5

298of 300

Introduction to Statistical
Machine Learning

The Perceptron Algorithm - Update 2

c
2013
Christfried Webers
NICTA
The Australian National
University

Update of the perceptron weights from a misclassified pattern


(green)
I
SML
2013

w( +1) = w( ) + n tn

Classification
Generalised Linear
Model

Inference and Decision

Discriminant Functions

0.5

Fishers Linear
Discriminant

0.5

The Perceptron
Algorithm

0.5

0.5

1
1

0.5

0.5

1
1

0.5

0.5

299of 300

The Perceptron Algorithm - Convergence

Does the algorithm converge ?


For a single update step
w( +1)T n tn = w( )T n tn (n tn )T n tn < w( )T n tn
T

because (n tn ) n tn = kn tn k > 0.
BUT: contributions to the error from the other misclassified
patterns might have increased.
AND: some correctly classified patterns might now be
misclassified.
Perceptron Convergence Theorem : If the training set is
linearly separable, the perceptron algorithm is guaranteed
to find a solution in a finite number of steps.

Introduction to Statistical
Machine Learning
c
2013
Christfried Webers
NICTA
The Australian National
University

I
SML
2013
Classification
Generalised Linear
Model
Inference and Decision
Discriminant Functions
Fishers Linear
Discriminant
The Perceptron
Algorithm

300of 300

You might also like