CpE646 9v3 PDF

CpE
p 646 Pattern Recognition

g and
Classification
Prof. Hong Man
Department of Electrical and

Computer Engineering
Stevens Institute of Technology
Chapter 5: Linear Discriminant Functions
Chapter 5 (Section 5.1 5.5, 5.11):

Introduction
Linear Discriminant Functions and Decisions Surfaces
Generalized Linear Discriminant Functions
The Two-Category Linear Separable Case
Minimizing the Perceptron Criterion Function
Support Vector Machine
Introduction
In chapter
p 3,, the underlying
y g pprobabilityy densities were
known (or given)
The training sample was used to estimate the parameters of
th
these probability
b bilit densities
d iti (ML, (ML MAP estimations)
ti ti )
In this chapter, we only know the proper forms for the
discriminant
disc iminant functions, tthee training
ta g samples
sa p es will
w be used to
estimate the parameters of the discriminant function.
We focus on linear discriminant functions, which are either
li
linear in
i the
h components off x, or linear
li in
i some functions
f i
of x.
They may not be optimal, but they are very simple to use
Linear Discriminant Functions and
Decisions Surfaces
Definition of a linear discriminant function

It is a function that is a linear combination of the
components of x
g(x) = wtx + w0 (1)
where w is the weight vector and w0 the bias
In
I the
th case off c categories,
t i there
th will
ill be
b c suchh
discriminant functions.
Decisions Surfaces
A two-categoryg y classifier with a discriminant function of

the form (1) uses the following rule:
Decide 1 if g(x) > 0 and 2 if g(x) < 0
It is equivalent to
Decide 1 if wtx > -w0 and 2 otherwise
( ) = 0 x is
If g(x) i assigned
i d to
t either
ith class
l
The equation g(x) = 0 defines the decision surface that
separates
p ppoints assigned
g g y 1 from ppoints
to the category
assigned to the category 2
When g(x) is linear, the decision surface is a
h
hyperplane
l d t d as H.
denoted H
Decisions Surfaces
Decisions Surfaces
g(x)
g( ) also provides
p an algebraic
g measure of the distance
from x to the hyperplane
rw
x = xp +
w
where xp is the normal projection of x onto H, and r is
the distance between x and H. H
2
since g ( x p ) = wt x p + 0 = 0 and wt w = w
rw
g ( x) = w x + 0 = w x p +
t t
+ 0 = r || w ||,
w
g ( x)
or r =
w
Decisions Surfaces
Decisions Surfaces
In conclusion,, a linear discriminant function divides the

feature space by a hyperplane decision surface
The orientation of the surface is determined by the
normall vector
t w andd theth location
l ti off the
th surface
f is
i
determined by the bias w0
Decisions Surfaces
The multi-category
g y case
We define c linear discriminant functions
gi ( x) = wit x + wi 0 i = 11,..., c
assign x to i if gi(x) > gj(x) j i;

in
i case off ties,
ti the
th classification
l ifi ti is i undefined
d fi d
In this case, the classifier is a linear machine
A linear machine divides the feature space into c
decision regions, with gi(x) being the largest
discriminant if x is in the region Ri
Decisions Surfaces
Decisions Surfaces
For a two contiguous

g regions
g Ri and Rj, the boundaryy
that separates them is a portion of hyperplane Hij
defined by:
gi(x) ( ) (w
( ) = gj(x) ( i wj)tx + (w
( i0 wj0) = 0
(wi wj) is normal to Hij and
gi g j
d ( x, H ij ) =
wi w j
Decisions Surfaces
Decisions Surfaces
With the linear machine , it is not the weightg vectors,,

but their differences ||wi-wj|| that matters.
It is easy to show that the decision regions for a linear
machine
hi are convex, this
thi restriction
t i ti limits
li it the
th flexibility
fl ibilit
and accuracy of the classifier
Every
ve y decision
dec s o region
eg o iss singly
s g y co
connected.
ected.
This is suitable for conditional densities p(x|i) are
unimodal
Decision boundaries which separate

p between classes mayy
not always be linear
The complexity of the boundaries may sometimes request
th use off highly
the hi hl non-linear
li surfaces
f
A popular approach to generalize the concept of linear
decision
dec s o functions
u ct o s iss to consider
co s de a generalized
ge e a ed decision
dec s o
function as:
g(x) = w1f1(x) + w2f2(x) + + wNfN(x) + wN+1
where
h fi(x), f 1 i N,
( ) for N are functions
f i off the
h pattern
x, where x Rd (Euclidean Space)
The qquadratic discriminant function

d d d
g ( x) = w0 + wi xi + wij xi x j (2)
i =1 i =1 j =1
This function has (d+1)+d(d+1)/2 =(d+1)(d+2)/2 terms

as well
ll as the
th weighting
i hti coefficients,
ffi i t so it has
h more
flexibility for complicated separating surface.
This function pproduces a hyperquadratic
yp q surface
The weight matrix is defined as W=[wij]. It is
symmetric.
Example:
p a qquadratic decision functions for a 2-
dimensional feature space
g ( x) = w0 + w1 x12 + w2 x1 x2 + w3 x22 + w4 x1 + w5 x2
where w = ( w0 , w1 , w2 ,..., w5 )T
and x = ((1,, x12 , x1 x2 , x22 , x1 , x2 )T
The commonlyy used quadratic

q decision function can be
represented as the general d-dimensional quadratic
surface:
( ) = xtAx
g(x) A + xtb +c
+
where the matrix A = (aij), the vector b = (b1, b2, ,
bn)t aandd c, depe
depends
ds on
o the
t e weights
we g ts wii, wij, wi oof
equation
If A is ppositive definite then the decision function is a

hyperellipsoid with axes in the directions of the
eigenvectors of A
If A = Id (Identity),
(Id tit ) the
th decision
d i i function
f ti is i simply
i l theth
d-dimensional hypersphere
If A iss negative
egat ve definite,
de te, thet e decision
dec s o function
u ct o describes
desc bes
a hyperboloid
In conclusion: it is only the matrix A which determines
the
h shape
h andd characteristics
h i i off the
h decision
d i i function
f i
p Let R3 be the original

Example: g ppattern space
p and let
the decision function associated with the pattern classes
1 and 2 be
g ( x) = 2 x12 + x32 + x2 x3 + 4 x1 2 x2 + 1
for which g(x) > 0 if x 1 and g(x) < 0 if x 2
Rewrite g(x) as g(x) = xTAx + xTb + c
Determine the class of each of the following pattern
vectors
[1,1,1], [1,10,0], [0,0.5,0]
g ( x) = 2 x12 + x32 + x2 x3 + 4 x1 2 x2 + 1
2 0 0 x1 4
g ( x) = [ x1 x2 x3 ] 0 0 0 x2 + [ x1 x2 x3 ] 2 + 1
0 1 1 x3 0
g ([1 1 1] ) = 7, g ([1 1 1] ) = 13, g ([ 0 0.5 0] ) = 0

T T T
If more terms such as wijkxixjxk are added into the

discriminant function, we obtain the class of polynomial
discrinimant functions.
All th
these can be
b expressedd in
i terms
t off the
th generalized
li d
linear discrinimant function
d
g ( x) = ai yi ( x) or g ( x) = a t y
i =1
where a is a d -dimensional weight vector, and d

functions yi(x), called functions, can be arbitrary
functions of x.
The linear discriminant function can also be expressed

p
in the form of generalized linear discriminant function
d
given g ( x) = w0 + wi xi
i =1
1 w0
x w
let y = 1 and a = 1
# #

x
d wd
we have g ( x) = a t y
By y selectingg these functions carefully,

y, and let d be
large enough, we can approximate any desired
discriminant functions.
The
Th resulting
lti discriminant
di i i t function
f ti is i nott linear
li to
t x,
but is linear to y
Thee d functions
u ct o s yi(x) mapap d
d-dimensional
d e s o a xx-space
space to
d -dimensional y-space. Therefore this mapping
reduces the problem to one of finding a homogeneous
li
linear discriminant
di i i function.
f i
Example:
p let the qquadratic discriminant function be
g(x)=a1+a2x+a3x2
the original x-pace is 1-dimensional, and the 3-
dimensional vector y is given by
y=[1, x, x2]t
Varying
V i x in i 1-D
1 D will
ill cause y to
t trace
t outt a curve in
i
3-D
Disconnected regions
g mayy become connected in high g
dimension
If x is from p(x), the density in y-space, p ( y ) , will
b ddegenerate,
be t being
b i zero everywhere h exceptt on the th
curve, where it is infinite. This is common problem
when d > d, or mapping from a lower dimensional
space to a higher dimensional space.
The major disadvantage of the generalized

linear discriminant functions is the curse of
dimensionality
(d + 1)(d + 2)
For d=50, d = = 1326
2
The Two-Category Linearly Separable Case
In a 2-category
g y case,, ggiven a linear discriminant function
g(x)=aty, if there is a weight vector that can correctly
classify all the samples, these samples are called linearly
separable.
separable
In this case, if atyi>0, yi is labeled 1, and if atyi<0, yi is
labeled 2.
Or we can replace all samples labeled 2 by their
negatives, and then we are looking for a weight vector a
such that atyi>0 for all the samples.
samples This weight vector is
called a separating vector.
To find this weight

g vector a, we consider it as a point
p in a
weight space (with all possible weight vectors).
Each sample yi will impose a constraint on the possible
l ti off this
location thi a. The
Th equation
ti atyi=00 ddefines
fi a hyperplane
h l
In data space a is vector perpendicular to the separating
hyperplane
ype p a e atyi=0.
0.
The region satisfies atyi>0 is a half-space on one side of
the hyperplane atyi=0.
If a solution vector exists, it will come from a solution
region, which is an intersection of all half-spaces that each
satisfies atyi>0
0 for a particular yi.
So solution vector is usuallyy not unique

q to a finite set of
samples.
Additional constraints can be imposed to fine a unique
solution
l ti
One possibility is to seek a unit-length weight vector
tthat
at maximize
a e the
t e minimum
u distance
d sta ce from
o the
t e sample
sa p e
to the separating plane.
Another possibility is to seek a minimum-length weight
vector that i fi atyibb for
h satisfies f all
ll i and
d some positive
ii
constant b, called margin. This will move the new
boundaries from the old boundaries byy b/||||yi|| for all i
Gradient Descent Procedure

The goal is to find a solution to the set of linear
inequalities atyi>0
Define a criterion function J(a) such that when a is a
solution vector, this function is minimized.
Minimizing this scale function can be done through a
gradient descent procedure
Basic ggradient descent

At iteration i=0, a(0) is a arbitrary weight vector
At iteration i=k, calculate
a(k+1)=a(k)-(k)J(a(k))
i.e. moving a(k+1) from a(k) along the negative of
th gradient,
the di t (k)
(k) iis step
t size
i or learning
l i ratet
Continue until |(k)J(a(k))| < , where is a
threshold.
Newton descent
The iteration is based on
a(k+1)=a(k)-H-1J
where H is the Hessian matrix of second partial
derivatives 2J/ aiaj evaluated at a(k)
Usually
U ll Newton
N t descent
d t will
ill give
i a greater
t
improvement per step than the simple gradient
descent algorithm.
It is not applicable if the Hessian matrix is singular
Criterion Functions
The Perceptron criterion function
J p (a ) = (a t y ) and t e J p = ( y )
a d then
yY yY
where Y(a) is the set of samples misclassified by a.

The update rule becomes
a (k + 1) = a (k ) + (k ) y
yY
This is called Batch Perceptron

Some other criterion functions
J q ( a ) = ( a t y )2
yY
1 ( a t y b )2
J r (a ) =
2 yY ' || y ||2
where Y ' is the set of samples for which a t y b
n
J s ( a ) =|| Ya b ||2 = ( a t yi bi )2
i =1
Support
pp vector machines ((SVMs)) are linear machines with
margins
SVMs rely on nonlinear function (kernel) () to map the
d t into
data i t a sufficiently
ffi i tl highhi h dimension,
di i ini which
hi h two
t
categories can always be separated by a hyperplane.
G
Given
ve n sasamples,
p es, kk=1,, 2,, , n. Let
et zk=+1 o
or -1 for
o sa
sample
pe
xk in 1 or 2 respectively. We have
yk= (xk)
g(y)=aty
the separating hyperplane ensures
zkg((yk)1,
)1 k=1,k 1 , n.
The ggoal in training g a SVM is to find the separating

p g
hyperplane with the largest margin.
The distance from the hyperplane to a pattern y is
| (y)|/||a||,
|g( || assume a positiveiti margini exists,
i t then
th
z k g ( yk )
b, k = 1,..., n.
|| a ||
We will seek a that maximizes b. To ensure uniqueness of
the solution, we impose a constraint b||a||=1. This is
equivalent to minimizing ||a||2
The support
pp vectors are the transformed training g samples
p
for which zkg(yk)=1, they are equally close to the
hyperplane, and they are the most difficult patterns to
classify
SVM training
g
Choice of () requires domain knowledge. Other
common choices include polynomials, Gaussians, radial
b i functions
basis f ti (RBF)
Define 1 n
L( a, ) = || a ||2 k [ zk a t yk 1]
2 k =1
where k are Lagrange multipliers.

We will minimize L()( ) w.r.t. a, and maximize it w.r.t.
k0.

CpE646 9v3 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CpE646 9v3 PDF

Uploaded by

Copyright:

Available Formats

CpE

p 646 Pattern Recognition

Prof. Hong Man

Department of Electrical and

Chapter 5 (Section 5.1 5.5, 5.11):

Definition of a linear discriminant function

A two-categoryg y classifier with a discriminant function of

In conclusion,, a linear discriminant function divides the

assign x to i if gi(x) > gj(x) j i;

For a two contiguous

With the linear machine , it is not the weightg vectors,,

Decision boundaries which separate

The qquadratic discriminant function

This function has (d+1)+d(d+1)/2 =(d+1)(d+2)/2 terms

The commonlyy used quadratic

If A is ppositive definite then the decision function is a

p Let R3 be the original

g ([1 1 1] ) = 7, g ([1 1 1] ) = 13, g ([ 0 0.5 0] ) = 0

If more terms such as wijkxixjxk are added into the

where a is a d -dimensional weight vector, and d

The linear discriminant function can also be expressed

By y selectingg these functions carefully,

The major disadvantage of the generalized

To find this weight

So solution vector is usuallyy not unique

Gradient Descent Procedure

Basic ggradient descent

where Y(a) is the set of samples misclassified by a.

This is called Batch Perceptron

Some other criterion functions

The ggoal in training g a SVM is to find the separating

where k are Lagrange multipliers.

You might also like