07 Linear Classification 1

Introduction to Statistical
Machine Learning
Introduction to Statistical Machine Learning

Christfried Webers
c
2013
Christfried Webers
NICTA
The Australian National
University
I
SML
2013
Outlines
Statistical Machine Learning Group

NICTA
and
College of Engineering and Computer Science
The Australian National University
Canberra
February June 2013
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Neural Networks 1
Neural Networks 2
Kernel Methods
Sparse Kernel Methods
Graphical Models 1
Graphical Models 2
Graphical Models 3
Mixture Models and EM 1
Mixture Models and EM 2
Approximate Inference
Sampling
Principal Component Analysis
Sequential Data 1
(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")
Sequential Data 2
Combining Models
Selected Topics
Discussion and Summary
1of 300
Machine Learning
c
2013
Christfried Webers
NICTA
University
Part VII
I
SML
2013
Classification
Generalised Linear
Model
Inference and Decision
Discriminant Functions
Fishers Linear
Discriminant
The Perceptron
Algorithm
263of 300
Classification
Goal : Given input data x, assign it to one of K discrete
classes Ck where k = 1, . . . , K.
Divide the input space into different regions.
Machine Learning
c
2013
Christfried Webers
NICTA
University
I
SML
2013
Classification
Generalised Linear
Model
Fishers Linear
Discriminant
The Perceptron
Algorithm
Figure : Length of the petal [in cm] for a given sepal [cm] for iris
flowers (Iris Setosa, Iris Versicolor, Iris Virginica).
264of 300
How to represent binary class labels?
Class labels are no longer real values as in regression, but

a discrete set.
Two classes : t {0, 1}
( t = 1 represents class C1 and t = 0 represents class C2 )
Can interpret the value of t as the probability of class C1 ,
with only two values possible for the probability, 0 or 1.
Note: Other conventions to map classes into integers
possible, check the setup.
Machine Learning
c
2013
Christfried Webers
NICTA
University
I
SML
2013
Classification
Generalised Linear
Model
Fishers Linear
Discriminant
The Perceptron
Algorithm
265of 300
How to represent multi-class labels?
If there are more than two classes ( K > 2), we call it a

multi-class setup.
Often used: 1-of-K coding scheme in which t is a vector of
length K which has all values 0 except for tj = 1, where j
comes from the membership in class Cj to encode.
Example: Given 5 classes, {C1 , . . . , C5 }. Membership in
class C2 will be encoded as the target vector
t = (0, 1, 0, 0, 0)T
Machine Learning
c
2013
Christfried Webers
NICTA
University
I
SML
2013
Classification
Generalised Linear
Model
Fishers Linear
Discriminant
The Perceptron
Algorithm
Note: Other conventions to map multi-classes into integers

possible, check the setup.
266of 300
Machine Learning
Linear Model
Idea: Use again a Linear Model as in regression: y(x, w) is
a linear function of the parameters w
y(xn , w) = wT (xn )
But generally y(xn , w) R.
Example: Which class is y(x, w) = 0.71623 ?
c
2013
Christfried Webers
NICTA
University
I
SML
2013
Classification
Generalised Linear
Model
Fishers Linear
Discriminant
The Perceptron
Algorithm
267of 300
Machine Learning
Generalised Linear Model

Apply a mapping f : R Z to the linear model to get the
discrete class labels.
Generalised Linear Model
y(xn , w) = f (wT (xn ))
c
2013
Christfried Webers
NICTA
University
I
SML
2013
Classification
Generalised Linear
Model
Activation function: f ()
Link function : f 1 ()

Fishers Linear
Discriminant
signHzL
1.0
The Perceptron
Algorithm
0.5
-0.5
0.0
0.5
1.0
-0.5
-1.0
Figure : Example of an activation function f (z) = sign (z) .
268of 300
Three Models for Decision Problems

In increasing order of complexity
Find a discriminant function f (x) which maps each input
directly onto a class label.
Discriminative Models
1
Solve the inference problem of determining the posterior

class probabilities p(Ck | x).
Use decision theory to assign each new x to one of the
classes.
Generative Models
1
2
3
4
5
Solve the inference problem of determining the

class-conditional probabilities p(x | Ck ).
Also, infer the prior class probabilities p(Ck ).
Use Bayes theorem to find the posterior p(Ck | x).
Alternatively, model the joint distribution p(x, Ck ) directly.
Use decision theory to assign each new x to one of the
classes.
Machine Learning
c
2013
Christfried Webers
NICTA
University
I
SML
2013
Classification
Generalised Linear
Model
Fishers Linear
Discriminant
The Perceptron
Algorithm
269of 300
Machine Learning
Two Classes
c
2013
Christfried Webers
NICTA
University
Definition
A discriminant is a function that maps from an input vector x to
one of K classes, denoted by Ck .
Consider first two classes ( K = 2 ).
Construct a linear function of the inputs x
y(x) = wT x + w0
such that x being assigned to class C1 if y(x) 0, and to
class C2 otherwise.
weight vector w
bias w0 ( sometimes w0 called threshold )
I
SML
2013
Classification
Generalised Linear
Model
Fishers Linear
Discriminant
The Perceptron
Algorithm
270of 300
Machine Learning
Two Classes
c
2013
Christfried Webers
NICTA
University
Decision boundary y(x) = 0 is a (D 1)-dimensional

hyperplane in a D-dimensional input space (decision
surface).
w is orthogonal to any vector lying in the decision surface.
Proof: Assume xA and xB are two points lying in the
decision surface. Then,
0 = y(xA ) y(xB ) = wT (xA xB )
I
SML
2013
Classification
Generalised Linear
Model
Fishers Linear
Discriminant
The Perceptron
Algorithm
271of 300
Machine Learning
Two Classes
c
2013
Christfried Webers
NICTA
University
The normal distance from the origin to the decision

surface is
wT x
w0
=
kwk
kwk
Classification
Generalised Linear
Model
x2
y>0
y=0
y<0
I
SML
2013
R1
R2
Fishers Linear
Discriminant
x
w
The Perceptron
Algorithm
y(x)
kwk
x
x1
w0
kwk
272of 300
Machine Learning
Two Classes
c
2013
Christfried Webers
NICTA
University
The value of y(x) gives a signed measure of the

perpendicular distance r of the point x from the decision
surface, r = y(x)/kwk.
Classification
x
T
y(x) = w
z
I
SML
2013
}|
w
x + r
kwk
{
wT w z T }| {
+w0 = r
+ w x + w0 = rkwk
kwk
Generalised Linear
Model
Fishers Linear
Discriminant
x2
y>0
y=0
y<0
R1
The Perceptron
Algorithm
R2
x
w
y(x)
kwk
x
x1
w0
kwk
273of 300
Machine Learning
Two Classes
c
2013
Christfried Webers
NICTA
University
I
SML
2013
More compact notation : Add an extra dimension to the

input space and set the value to x0 = 1.
e = (w0 , w) and e
Also define w
x = (1, x)
T
e e
y(x) = w
x
Decision surface is now a D-dimensional hyperplane in a
D + 1-dimensional expanded input space.
Classification
Generalised Linear
Model
Fishers Linear
Discriminant
The Perceptron
Algorithm
274of 300
Machine Learning
Multi-Class
c
2013
Christfried Webers
NICTA
University
Number of classes K > 2

Can we combine a number of two-class discriminant
functions using K 1 one-versus-the-rest classifiers?
I
SML
2013
Classification
Generalised Linear
Model
Fishers Linear
Discriminant
R1
R2
The Perceptron
Algorithm
C1
R3
C2
not C1
not C2
275of 300
Machine Learning
Multi-Class
c
2013
Christfried Webers
NICTA
University

Can we combine a number of two-class discriminant
functions using K(K 1)/2 one-versus-one classifiers?
I
SML
2013
Classification
Generalised Linear
Model
C3
C1

R1
Fishers Linear
Discriminant
R3
C1
The Perceptron
Algorithm
C3
R2
C2
C2
276of 300
Machine Learning
Multi-Class
c
2013
Christfried Webers
NICTA
University

Solution: Use K linear functions
I
SML
2013
yk (x) = wTk x + wk0

Assign input x to class Ck if yk (x) > yj (x) for all j 6= k.
Decision boundary between class Ck and Cj given by
yk (x) = yj (x)
Classification
Generalised Linear
Model
Fishers Linear
Discriminant
The Perceptron
Algorithm
Rj
Ri
Rk
xB
xA
b
x
277of 300
Least Squares for Classification
Regression with a linear function of the model parameters

and minimisation of sum-of-squares error function resulted
in a closed-from solution for the parameter values.
Is this also possible for classification?
Given input data x belonging to one of K classes Ck .
Use 1-of-K binary coding scheme.
Each class is described by its own linear model
yk (x) = wTk x + wk0
Machine Learning
c
2013
Christfried Webers
NICTA
University
I
SML
2013
Classification
Generalised Linear
Model
Fishers Linear
Discriminant
The Perceptron
Algorithm
k = 1, . . . , K
278of 300
Machine Learning
Least Squares for Classification
With the conventions

w
e k = k0
w
wk

1
e
x=
x

e = w
e1 . . .
W
c
2013
Christfried Webers
NICTA
University
RD+1
I
SML
2013
Classification
D+1
Generalised Linear
Model
eK
w
R(D+1)K
Fishers Linear
Discriminant
we get for the discriminant function (vector valued)

e Te
y(x) = W
x
The Perceptron
Algorithm
RK .
For a new input x, the class is then defined by the index of

the largest value in the row vector y(x)
279of 300
Machine Learning
e
Determine W
c
2013
Christfried Webers
NICTA
University
Given a training set {xn , t} where n = 1, . . . , N, and t is the

class in the 1-of-K coding scheme.
Define a matrix T where row n corresponds to tTn .
The sum-of-squares error can now be written as
n
o
e = 1 tr (X
eW
e T)T (X
eW
e T)
ED (W)
2
e will be reached for
The minimum of ED (W)
I
SML
2013
Classification
Generalised Linear
Model
Fishers Linear
Discriminant
The Perceptron
Algorithm
e = (X
e T X)
e 1 X
eTT = X
e T
W
e is the pseudo-inverse of X.
e
where X
280of 300
Discriminant Function for Multi-Class
Machine Learning
c
2013
Christfried Webers
NICTA
University
The discriminant function y(x) is therefore

e Te
e )T e
y(x) = W
x = TT (X
x,
e is given by the training data, and e
where X
x is the new
input.
Interesting property: If for every tn the same linear
constraint aT tn + b = 0 holds, then the prediction y(x) will
also obey the same constraint
aT y(x) + b = 0.
I
SML
2013
Classification
Generalised Linear
Model
Fishers Linear
Discriminant
The Perceptron
Algorithm
For the 1-of-K coding scheme, the sum of all components

in tn is one, and therefore all components of y(x) will sum
to one. BUT: the components are not probabilities, as they
are not constraint to the interval (0, 1).
281of 300
Machine Learning
Deficiencies of the Least Squares Approach
c
2013
Christfried Webers
NICTA
University
Magenta curve : Decision Boundary for the least squares

approach ( Green curve : Decision boundary for the logistic
regression model described later)
I
SML
2013
Classification
Generalised Linear
Model
Fishers Linear
Discriminant
The Perceptron
Algorithm
8
4
8
2
282of 300
Machine Learning
Deficiencies of the Least Squares Approach
c
2013
Christfried Webers
NICTA
University
Magenta curve : Decision Boundary for the least squares

approach ( Green curve : Decision boundary for the logistic
regression model described later)
I
SML
2013
Classification
Generalised Linear
Model
Fishers Linear
Discriminant
The Perceptron
Algorithm
6
6
6
6
283of 300
Machine Learning
Fishers Linear Discriminant
c
2013
Christfried Webers
NICTA
University
View linear classification as dimensionality reduction.

T
I
SML
2013
y(x) = w x
Classification
If y w0 then class C1 , otherwise C2 .

But there are many projections from a D-dimensional input
space onto one dimension.
Projection always means loss of information.
For classification we want to preserve the class separation
in one dimension.
Can we find a projection which maximally preserves the
class separation ?
Generalised Linear
Model
Fishers Linear
Discriminant
The Perceptron
Algorithm
284of 300
Machine Learning
c
2013
Christfried Webers
NICTA
University
Samples from two classes in a two-dimensional input space

and their histogram when projected to two different
one-dimensional spaces.
I
SML
2013
Classification
Generalised Linear
Model

Fishers Linear
Discriminant
The Perceptron
Algorithm
285of 300
Fishers Linear Discriminant - First Try

Given N1 input data of class C1 , and N2 input data of class
C2 , calculate the centres of the two classes
1 X
1 X
m1 =
xn ,
m2 =
xn
N1
N2
nC1
nC2
Choose w so as to maximise the projection of the class

means onto w
m1 m2 = wT (m1 m2 )
Machine Learning
c
2013
Christfried Webers
NICTA
University
I
SML
2013
Classification
Generalised Linear
Model
Problem with non-uniform covariance

4
Fishers Linear
Discriminant
The Perceptron
Algorithm
6
286of 300
Machine Learning

Measure also the within-class variance for each class
X
s2k =
(yn mk )2
nCk
c
2013
Christfried Webers
NICTA
University
I
SML
2013
where yn = w xn .
Maximise the Fisher criterion
Classification
Generalised Linear
Model
J(w) =
(m2 m1 )
s21 + s22

Fishers Linear
Discriminant
The Perceptron
Algorithm
6
287of 300
Machine Learning
c
2013
Christfried Webers
NICTA
University
The Fisher criterion can be rewritten as

T
J(w) =
w SB w
wT SW w
SB is the between-class covariance

SB = (m2 m1 )(m2 m1 )T
SW is the within-class covariance
X
X
SW =
(xn m1 )(xn m1 )T +
(xn m2 )(xn m2 )T
nC1
I
SML
2013
Classification
Generalised Linear
Model
Fishers Linear
Discriminant
The Perceptron
Algorithm
nC2
288of 300
Machine Learning
c
2013
Christfried Webers
NICTA
University
I
SML
2013
The Fisher criterion

J(w) =
wT SB w
wT SW w
has a maximum for Fishers linear discriminant
Classification
Generalised Linear
Model
S1
W (m2
m1 )
Fishers linear discriminant is NOT a discriminant, but can

be used to construct one by choosing a threshold y0 in the
projection space.
Fishers Linear
Discriminant
The Perceptron
Algorithm
289of 300
Fishers Discriminant For Multi-Class

Assume that the dimensionality of the input space D is
greater than the number of classes K.
Use D0 > 1 linear features yk = wT x and write everything
in vector form (no bias involved!)
y = WT x.
The within-class covariance is then the sum of the
covariances for all K classes
Machine Learning
c
2013
Christfried Webers
NICTA
University
I
SML
2013
Classification
Generalised Linear
Model
SW =
K
X
Sk
k=1
Fishers Linear
Discriminant
The Perceptron
Algorithm
where
Sk =
(xn mk )(xn mk )T
nCk
mk =
1 X
xn
Nk
nCk
290of 300

Between-class covariance
SB =
K
X
Nk (mk m)(mk m)T .
k=1
where m is the total mean of the input data

m=
N
1 X
xn .
N
n=1
One possible way to define a function of W which is large

when the between-class covariance is large and the
within-class covariance is small is given by

J(W) = tr (WT SW W)1 (WT SB W)
Machine Learning
c
2013
Christfried Webers
NICTA
University
I
SML
2013
Classification
Generalised Linear
Model
Fishers Linear
Discriminant
The Perceptron
Algorithm
The maximum of J(W) is determined by the D0

eigenvectors of S1
W SB with the largest eigenvalues.
291of 300
Machine Learning
c
2013
Christfried Webers
NICTA
University
I
SML
2013
How many linear features can one find with this method?
SB is of rank at most K 1 because of the sum of K rank
one matrices and the global constraint via m.
Projection onto the subspace spanned by SB can not have
more than K 1 linear features.
Classification
Generalised Linear
Model
Fishers Linear
Discriminant
The Perceptron
Algorithm
292of 300
The Perceptron Algorithm
Frank Rosenblatt (1928 - 1969)

"Principles of neurodynamics: Perceptrons and the theory
of brain mechanisms" (Spartan Books, 1962)
Machine Learning
c
2013
Christfried Webers
NICTA
University
I
SML
2013
Classification
Generalised Linear
Model
Fishers Linear
Discriminant
The Perceptron
Algorithm
293of 300
Perceptron ("MARK 1") was the first computer which could

learn new skills by trial and error
Machine Learning
c
2013
Christfried Webers
NICTA
University
I
SML
2013
Classification
Generalised Linear
Model
Fishers Linear
Discriminant
The Perceptron
Algorithm
294of 300

Two class model
Create feature vector (x) by a fixed nonlinear
transformation of the input x.
Generalised linear model
y(x) = f (wT (x))
with (x) containing some bias element 0 (x) = 1.
nonlinear activation function
(
+1, a 0
f (a) =
1, a < 0
Machine Learning
c
2013
Christfried Webers
NICTA
University
I
SML
2013
Classification
Generalised Linear
Model
Fishers Linear
Discriminant
The Perceptron
Algorithm
Target coding for perceptron

(
+1, if C1
t=
1, if C2
295of 300
The Perceptron Algorithm - Error Function
Idea : Minimise total number of misclassified patterns.

Problem : As a function of w, this is piecewise constant
and therefore the gradient is zero almost everywhere.
Better idea: Using the (1, +1) target coding scheme, we
want all patterns to satisfy wT (xn )tn > 0.
Perceptron Criterion : Add the errors for all patterns
belonging to the set of misclassified patterns M
X
EP (w) =
wT (xn )tn
Machine Learning
c
2013
Christfried Webers
NICTA
University
I
SML
2013
Classification
Generalised Linear
Model
Fishers Linear
Discriminant
The Perceptron
Algorithm
nM
296of 300
Perceptron - Stochastic Gradient Descent
Perceptron Criterion (with notation n = (xn ) )

X
EP (w) =
wT n tn
Machine Learning
c
2013
Christfried Webers
NICTA
University
I
SML
2013
Classification
nM
Generalised Linear
Model
One iteration at step

1
2
Choose a training pair (xn , tn )

Update the weight vector w by
w( +1) = w( ) EP (w) = w( ) + n tn

Fishers Linear
Discriminant
The Perceptron
Algorithm
As y(x, w) does not depend on the norm of w, one can set

=1
w( +1) = w( ) + n tn
297of 300
Machine Learning
The Perceptron Algorithm - Update 1
c
2013
Christfried Webers
NICTA
University
Update of the perceptron weights from a misclassified pattern

(green)
I
SML
2013
w( +1) = w( ) + n tn
Classification
Generalised Linear
Model
0.5
Fishers Linear
Discriminant
0.5
The Perceptron
Algorithm
0.5
0.5
1
1
0.5
0.5
1
1
0.5
0.5
298of 300
Machine Learning
The Perceptron Algorithm - Update 2
c
2013
Christfried Webers
NICTA
University
Update of the perceptron weights from a misclassified pattern

(green)
I
SML
2013
w( +1) = w( ) + n tn
Classification
Generalised Linear
Model
0.5
Fishers Linear
Discriminant
0.5
The Perceptron
Algorithm
0.5
0.5
1
1
0.5
0.5
1
1
0.5
0.5
299of 300
The Perceptron Algorithm - Convergence
Does the algorithm converge ?

For a single update step
w( +1)T n tn = w( )T n tn (n tn )T n tn < w( )T n tn
T
because (n tn ) n tn = kn tn k > 0.
BUT: contributions to the error from the other misclassified
patterns might have increased.
AND: some correctly classified patterns might now be
misclassified.
Perceptron Convergence Theorem : If the training set is
linearly separable, the perceptron algorithm is guaranteed
to find a solution in a finite number of steps.
Machine Learning
c
2013
Christfried Webers
NICTA
University
I
SML
2013
Classification
Generalised Linear
Model
Fishers Linear
Discriminant
The Perceptron
Algorithm
300of 300

07 Linear Classification 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

07 Linear Classification 1

Uploaded by

Copyright:

Available Formats

Introduction to Statistical

Introduction to Statistical Machine Learning

Statistical Machine Learning Group

(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")

How to represent binary class labels?

Class labels are no longer real values as in regression, but

How to represent multi-class labels?

If there are more than two classes ( K > 2), we call it a

Note: Other conventions to map multi-classes into integers

Generalised Linear Model

Inference and Decision

Figure : Example of an activation function f (z) = sign (z) .

Three Models for Decision Problems

Solve the inference problem of determining the posterior

Solve the inference problem of determining the

Decision boundary y(x) = 0 is a (D 1)-dimensional

The normal distance from the origin to the decision

Inference and Decision

The value of y(x) gives a signed measure of the

More compact notation : Add an extra dimension to the

Number of classes K > 2

Number of classes K > 2

Inference and Decision

Number of classes K > 2

yk (x) = wTk x + wk0

Least Squares for Classification

Regression with a linear function of the model parameters

Least Squares for Classification

With the conventions

we get for the discriminant function (vector valued)

For a new input x, the class is then defined by the index of

Given a training set {xn , t} where n = 1, . . . , N, and t is the

Discriminant Function for Multi-Class

The discriminant function y(x) is therefore

For the 1-of-K coding scheme, the sum of all components

Deficiencies of the Least Squares Approach

Magenta curve : Decision Boundary for the least squares

Inference and Decision

Deficiencies of the Least Squares Approach

Magenta curve : Decision Boundary for the least squares

Inference and Decision

Fishers Linear Discriminant

View linear classification as dimensionality reduction.

If y w0 then class C1 , otherwise C2 .

Fishers Linear Discriminant

Samples from two classes in a two-dimensional input space

Inference and Decision

Fishers Linear Discriminant - First Try

Choose w so as to maximise the projection of the class

Problem with non-uniform covariance

Fishers Linear Discriminant

Inference and Decision

Fishers Linear Discriminant

The Fisher criterion can be rewritten as

SB is the between-class covariance

Fishers Linear Discriminant

The Fisher criterion

has a maximum for Fishers linear discriminant

Fishers linear discriminant is NOT a discriminant, but can

Fishers Discriminant For Multi-Class

Fishers Discriminant For Multi-Class

Nk (mk m)(mk m)T .

where m is the total mean of the input data

One possible way to define a function of W which is large