You are on page 1of 45

CpE

p 646 Pattern Recognition


g and
Classification

Prof. Hong Man

Department of Electrical and


Computer Engineering
Stevens Institute of Technology
Chapter 5: Linear Discriminant Functions

Chapter 5 (Section 5.1 5.5, 5.11):


Introduction
Linear Discriminant Functions and Decisions Surfaces
Generalized Linear Discriminant Functions
The Two-Category Linear Separable Case
Minimizing the Perceptron Criterion Function
Support Vector Machine
Introduction

In chapter
p 3,, the underlying
y g pprobabilityy densities were
known (or given)
The training sample was used to estimate the parameters of
th
these probability
b bilit densities
d iti (ML, (ML MAP estimations)
ti ti )
In this chapter, we only know the proper forms for the
discriminant
disc iminant functions, tthee training
ta g samples
sa p es will
w be used to
estimate the parameters of the discriminant function.
We focus on linear discriminant functions, which are either
li
linear in
i the
h components off x, or linear
li in
i some functions
f i
of x.
They may not be optimal, but they are very simple to use
Linear Discriminant Functions and
Decisions Surfaces

Definition of a linear discriminant function


It is a function that is a linear combination of the
components of x
g(x) = wtx + w0 (1)
where w is the weight vector and w0 the bias
In
I the
th case off c categories,
t i there
th will
ill be
b c suchh
discriminant functions.
Linear Discriminant Functions and
Decisions Surfaces

A two-categoryg y classifier with a discriminant function of


the form (1) uses the following rule:
Decide 1 if g(x) > 0 and 2 if g(x) < 0
It is equivalent to
Decide 1 if wtx > -w0 and 2 otherwise
( ) = 0 x is
If g(x) i assigned
i d to
t either
ith class
l
The equation g(x) = 0 defines the decision surface that
separates
p ppoints assigned
g g y 1 from ppoints
to the category
assigned to the category 2
When g(x) is linear, the decision surface is a
h
hyperplane
l d t d as H.
denoted H
Linear Discriminant Functions and
Decisions Surfaces
Linear Discriminant Functions and
Decisions Surfaces

g(x)
g( ) also provides
p an algebraic
g measure of the distance
from x to the hyperplane
rw
x = xp +
w
where xp is the normal projection of x onto H, and r is
the distance between x and H. H
2
since g ( x p ) = wt x p + 0 = 0 and wt w = w
rw
g ( x) = w x + 0 = w x p +
t t
+ 0 = r || w ||,
w
g ( x)
or r =
w
Linear Discriminant Functions and
Decisions Surfaces
Linear Discriminant Functions and
Decisions Surfaces

In conclusion,, a linear discriminant function divides the


feature space by a hyperplane decision surface
The orientation of the surface is determined by the
normall vector
t w andd theth location
l ti off the
th surface
f is
i
determined by the bias w0
Linear Discriminant Functions and
Decisions Surfaces

The multi-category
g y case
We define c linear discriminant functions
gi ( x) = wit x + wi 0 i = 11,..., c

assign x to i if gi(x) > gj(x) j i;


in
i case off ties,
ti the
th classification
l ifi ti is i undefined
d fi d
In this case, the classifier is a linear machine
A linear machine divides the feature space into c
decision regions, with gi(x) being the largest
discriminant if x is in the region Ri
Linear Discriminant Functions and
Decisions Surfaces
Linear Discriminant Functions and
Decisions Surfaces

For a two contiguous


g regions
g Ri and Rj, the boundaryy
that separates them is a portion of hyperplane Hij
defined by:
gi(x) ( ) (w
( ) = gj(x) ( i wj)tx + (w
( i0 wj0) = 0
(wi wj) is normal to Hij and
gi g j
d ( x, H ij ) =
wi w j
Linear Discriminant Functions and
Decisions Surfaces
Linear Discriminant Functions and
Decisions Surfaces

With the linear machine , it is not the weightg vectors,,


but their differences ||wi-wj|| that matters.
It is easy to show that the decision regions for a linear
machine
hi are convex, this
thi restriction
t i ti limits
li it the
th flexibility
fl ibilit
and accuracy of the classifier
Every
ve y decision
dec s o region
eg o iss singly
s g y co
connected.
ected.
This is suitable for conditional densities p(x|i) are
unimodal
Generalized Linear Discriminant Functions

Decision boundaries which separate


p between classes mayy
not always be linear
The complexity of the boundaries may sometimes request
th use off highly
the hi hl non-linear
li surfaces
f
A popular approach to generalize the concept of linear
decision
dec s o functions
u ct o s iss to consider
co s de a generalized
ge e a ed decision
dec s o
function as:
g(x) = w1f1(x) + w2f2(x) + + wNfN(x) + wN+1
where
h fi(x), f 1 i N,
( ) for N are functions
f i off the
h pattern
x, where x Rd (Euclidean Space)
Generalized Linear Discriminant Functions

The qquadratic discriminant function


d d d
g ( x) = w0 + wi xi + wij xi x j (2)
i =1 i =1 j =1

This function has (d+1)+d(d+1)/2 =(d+1)(d+2)/2 terms


as well
ll as the
th weighting
i hti coefficients,
ffi i t so it has
h more
flexibility for complicated separating surface.
This function pproduces a hyperquadratic
yp q surface
The weight matrix is defined as W=[wij]. It is
symmetric.
Generalized Linear Discriminant Functions

Example:
p a qquadratic decision functions for a 2-
dimensional feature space
g ( x) = w0 + w1 x12 + w2 x1 x2 + w3 x22 + w4 x1 + w5 x2
where w = ( w0 , w1 , w2 ,..., w5 )T
and x = ((1,, x12 , x1 x2 , x22 , x1 , x2 )T
Generalized Linear Discriminant Functions

The commonlyy used quadratic


q decision function can be
represented as the general d-dimensional quadratic
surface:
( ) = xtAx
g(x) A + xtb +c
+
where the matrix A = (aij), the vector b = (b1, b2, ,
bn)t aandd c, depe
depends
ds on
o the
t e weights
we g ts wii, wij, wi oof
equation
Generalized Linear Discriminant Functions

If A is ppositive definite then the decision function is a


hyperellipsoid with axes in the directions of the
eigenvectors of A
If A = Id (Identity),
(Id tit ) the
th decision
d i i function
f ti is i simply
i l theth
d-dimensional hypersphere
If A iss negative
egat ve definite,
de te, thet e decision
dec s o function
u ct o describes
desc bes
a hyperboloid
In conclusion: it is only the matrix A which determines
the
h shape
h andd characteristics
h i i off the
h decision
d i i function
f i
Generalized Linear Discriminant Functions

p Let R3 be the original


Example: g ppattern space
p and let
the decision function associated with the pattern classes
1 and 2 be
g ( x) = 2 x12 + x32 + x2 x3 + 4 x1 2 x2 + 1
for which g(x) > 0 if x 1 and g(x) < 0 if x 2
Rewrite g(x) as g(x) = xTAx + xTb + c
Determine the class of each of the following pattern
vectors
[1,1,1], [1,10,0], [0,0.5,0]
Generalized Linear Discriminant Functions

g ( x) = 2 x12 + x32 + x2 x3 + 4 x1 2 x2 + 1

2 0 0 x1 4
g ( x) = [ x1 x2 x3 ] 0 0 0 x2 + [ x1 x2 x3 ] 2 + 1
0 1 1 x3 0

g ([1 1 1] ) = 7, g ([1 1 1] ) = 13, g ([ 0 0.5 0] ) = 0


T T T
Generalized Linear Discriminant Functions

If more terms such as wijkxixjxk are added into the


discriminant function, we obtain the class of polynomial
discrinimant functions.
All th
these can be
b expressedd in
i terms
t off the
th generalized
li d
linear discrinimant function
d
g ( x) = ai yi ( x) or g ( x) = a t y
i =1

where a is a d -dimensional weight vector, and d


functions yi(x), called functions, can be arbitrary
functions of x.
Generalized Linear Discriminant Functions

The linear discriminant function can also be expressed


p
in the form of generalized linear discriminant function
d
given g ( x) = w0 + wi xi
i =1

1 w0
x w
let y = 1 and a = 1
# #

x
d wd
we have g ( x) = a t y
Generalized Linear Discriminant Functions

By y selectingg these functions carefully,


y, and let d be
large enough, we can approximate any desired
discriminant functions.
The
Th resulting
lti discriminant
di i i t function
f ti is i nott linear
li to
t x,
but is linear to y
Thee d functions
u ct o s yi(x) mapap d
d-dimensional
d e s o a xx-space
space to
d -dimensional y-space. Therefore this mapping
reduces the problem to one of finding a homogeneous
li
linear discriminant
di i i function.
f i
Generalized Linear Discriminant Functions

Example:
p let the qquadratic discriminant function be
g(x)=a1+a2x+a3x2
the original x-pace is 1-dimensional, and the 3-
dimensional vector y is given by
y=[1, x, x2]t
Varying
V i x in i 1-D
1 D will
ill cause y to
t trace
t outt a curve in
i
3-D
Generalized Linear Discriminant Functions
Generalized Linear Discriminant Functions

Disconnected regions
g mayy become connected in high g
dimension
If x is from p(x), the density in y-space, p ( y ) , will
b ddegenerate,
be t being
b i zero everywhere h exceptt on the th
curve, where it is infinite. This is common problem
when d > d, or mapping from a lower dimensional
space to a higher dimensional space.
Generalized Linear Discriminant Functions
Generalized Linear Discriminant Functions

The major disadvantage of the generalized


linear discriminant functions is the curse of
dimensionality
(d + 1)(d + 2)
For d=50, d = = 1326
2
The Two-Category Linearly Separable Case

In a 2-category
g y case,, ggiven a linear discriminant function
g(x)=aty, if there is a weight vector that can correctly
classify all the samples, these samples are called linearly
separable.
separable
In this case, if atyi>0, yi is labeled 1, and if atyi<0, yi is
labeled 2.
Or we can replace all samples labeled 2 by their
negatives, and then we are looking for a weight vector a
such that atyi>0 for all the samples.
samples This weight vector is
called a separating vector.
The Two-Category Linearly Separable Case

To find this weight


g vector a, we consider it as a point
p in a
weight space (with all possible weight vectors).
Each sample yi will impose a constraint on the possible
l ti off this
location thi a. The
Th equation
ti atyi=00 ddefines
fi a hyperplane
h l
In data space a is vector perpendicular to the separating
hyperplane
ype p a e atyi=0.
0.
The region satisfies atyi>0 is a half-space on one side of
the hyperplane atyi=0.
If a solution vector exists, it will come from a solution
region, which is an intersection of all half-spaces that each
satisfies atyi>0
0 for a particular yi.
The Two-Category Linearly Separable Case
The Two-Category Linearly Separable Case

So solution vector is usuallyy not unique


q to a finite set of
samples.
Additional constraints can be imposed to fine a unique
solution
l ti
One possibility is to seek a unit-length weight vector
tthat
at maximize
a e the
t e minimum
u distance
d sta ce from
o the
t e sample
sa p e
to the separating plane.
Another possibility is to seek a minimum-length weight
vector that i fi atyibb for
h satisfies f all
ll i and
d some positive
ii
constant b, called margin. This will move the new
boundaries from the old boundaries byy b/||||yi|| for all i
The Two-Category Linearly Separable Case
The Two-Category Linearly Separable Case

Gradient Descent Procedure


The goal is to find a solution to the set of linear
inequalities atyi>0
Define a criterion function J(a) such that when a is a
solution vector, this function is minimized.
Minimizing this scale function can be done through a
gradient descent procedure
The Two-Category Linearly Separable Case

Basic ggradient descent


At iteration i=0, a(0) is a arbitrary weight vector
At iteration i=k, calculate
a(k+1)=a(k)-(k)J(a(k))
i.e. moving a(k+1) from a(k) along the negative of
th gradient,
the di t (k)
(k) iis step
t size
i or learning
l i ratet
Continue until |(k)J(a(k))| < , where is a
threshold.
The Two-Category Linearly Separable Case

Newton descent
The iteration is based on
a(k+1)=a(k)-H-1J
where H is the Hessian matrix of second partial
derivatives 2J/ aiaj evaluated at a(k)
Usually
U ll Newton
N t descent
d t will
ill give
i a greater
t
improvement per step than the simple gradient
descent algorithm.
It is not applicable if the Hessian matrix is singular
The Two-Category Linearly Separable Case
The Two-Category Linearly Separable Case

Criterion Functions
The Perceptron criterion function
J p (a ) = (a t y ) and t e J p = ( y )
a d then
yY yY

where Y(a) is the set of samples misclassified by a.


The update rule becomes
a (k + 1) = a (k ) + (k ) y
yY

This is called Batch Perceptron


The Two-Category Linearly Separable Case

Some other criterion functions

J q ( a ) = ( a t y )2
yY

1 ( a t y b )2
J r (a ) =
2 yY ' || y ||2
where Y ' is the set of samples for which a t y b
n
J s ( a ) =|| Ya b ||2 = ( a t yi bi )2
i =1
Support Vector Machine

Support
pp vector machines ((SVMs)) are linear machines with
margins
SVMs rely on nonlinear function (kernel) () to map the
d t into
data i t a sufficiently
ffi i tl highhi h dimension,
di i ini which
hi h two
t
categories can always be separated by a hyperplane.
G
Given
ve n sasamples,
p es, kk=1,, 2,, , n. Let
et zk=+1 o
or -1 for
o sa
sample
pe
xk in 1 or 2 respectively. We have
yk= (xk)
g(y)=aty
the separating hyperplane ensures
zkg((yk)1,
)1 k=1,k 1 , n.
Support Vector Machine

The ggoal in training g a SVM is to find the separating


p g
hyperplane with the largest margin.
The distance from the hyperplane to a pattern y is
| (y)|/||a||,
|g( || assume a positiveiti margini exists,
i t then
th
z k g ( yk )
b, k = 1,..., n.
|| a ||
We will seek a that maximizes b. To ensure uniqueness of
the solution, we impose a constraint b||a||=1. This is
equivalent to minimizing ||a||2
Support Vector Machine

The support
pp vectors are the transformed training g samples
p
for which zkg(yk)=1, they are equally close to the
hyperplane, and they are the most difficult patterns to
classify
Support Vector Machine
Support Vector Machine

SVM training
g
Choice of () requires domain knowledge. Other
common choices include polynomials, Gaussians, radial
b i functions
basis f ti (RBF)
Define 1 n
L( a, ) = || a ||2 k [ zk a t yk 1]
2 k =1

where k are Lagrange multipliers.


We will minimize L()( ) w.r.t. a, and maximize it w.r.t.
k0.

You might also like