You are on page 1of 97

LINEAR CLASSIFIERS

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition


Classification: Problem Statement
2 Probability & Bayesian Inference

  In regression, we are modeling the relationship


between a continuous input variable x and a
continuous target variable t.
  In classification, the input variable x may still be

continuous, but the target variable is discrete.


  In the simplest case, t can have only 2 values.

e.g., Let t = +1 assigned to C1


t= 1 assigned to C2

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Example Problem
3 Probability & Bayesian Inference

  Animal or Vegetable?

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Linear Models for Classification
4 Probability & Bayesian Inference

  Linear models for classification separate input vectors into


classes using linear (hyperplane) decision boundaries.
  Example:
2D Input vector x
Two discrete classes C1 and C2
4

x2 !2

!4

!6

!8

!4 !2 0 2 4 6 8

x1

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Two Class Discriminant Function
5 Probability & Bayesian Inference

y>0 x2
y=0
y<0 R1
R2
y (x) = w t x + w 0
x
w
y (x) ! 0 " x assigned to C1 y(x)
!w!

y (x) < 0 " x assigned to C2 x⊥

x1
Thus y (x) = 0 defines the decision boundary −w0
!w!

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Two-Class Discriminant Function
6 Probability & Bayesian Inference

y (x) = w t x + w 0
y>0 x2
y (x) ! 0 " x assigned to C1 y=0
y<0 R1
y (x) < 0 " x assigned to C2
R2

For convenience, let x


t t w
w = !"w 1 …w M #$ % !"w 0 w 1 …w M #$
y(x)
!w!

and x⊥
t t x1
x = !"x 1 … x M #$ % !"1 x1 … x M #$
−w0
!w!

So we can express y(x) = w t x

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Generalized Linear Models
7 Probability & Bayesian Inference

  For classification problems, we want y to be a predictor of t. In other


words, we wish to map the input vector into one of a number of discrete
classes, or to posterior probabilities that lie between 0 and 1.
  For this purpose, it is useful to elaborate the linear model by introducing a
nonlinear activation function f, which typically will constrain y to lie between
-1 and 1 or between 0 and 1.

(
y(x) = f w t x + w 0 )

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


The Perceptron
8 Probability & Bayesian Inference

(
y(x) = f w t x + w 0 ) y (x) ! 0 " x assigned to C1
y (x) < 0 " x assigned to C2

  A classifier based upon this simple generalized linear model is


called a (single layer) perceptron.
  It can also be identified with an abstracted model of a neuron
called the McCulloch Pitts model.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Parameter Learning
9 Probability & Bayesian Inference

  How do we learn the parameters of a perceptron?

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Outline
10 Probability & Bayesian Inference

  The Perceptron Algorithm


  Least-Squares Classifiers

  Fisher’s Linear Discriminant

  Logistic Classifiers

  Support Vector Machines

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Case 1. Linearly Separable Inputs
11 Probability & Bayesian Inference

  For starters, let’s assume that the training data is in


fact perfectly linearly separable.
  In other words, there exists at least one hyperplane

(one set of weights) that yields 0 classification error.


  We seek an algorithm that can automatically find

such a hyperplane.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


The Perceptron Algorithm
12 Probability & Bayesian Inference

  The perceptron algorithm was


invented by Frank Rosenblatt
(1962).
  The algorithm is iterative.

  The strategy is to start with a

random guess at the weights w, Frank Rosenblatt (1928 – 1971)

and to then iteratively change


the weights to move the
hyperplane in a direction that
lowers the classification error.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


The Perceptron Algorithm
13 Probability & Bayesian Inference

  Note that as we change the weights continuously,


the classification error changes in discontinuous,
piecewise constant fashion.
  Thus we cannot use the classification error per se as

our objective function to minimize.


  What would be a better objective function?

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


The Perceptron Criterion
14 Probability & Bayesian Inference

  Note that we seek w such that


w t x ! 0 when t = +1
w t x < 0 when t = "1

  In other words, we would like


w t x ntn ! 0 "n

  Thus we seek to minimize


( )
E P w = ! # w t x ntn
n"M

where M is the set of misclassified inputs.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


The Perceptron Criterion
15 Probability & Bayesian Inference

( )
E P w = ! # w t x ntn
n"M

where M is the set of misclassified inputs.

  Observations:
  EP(w) is always non-negative.
  EP(w) is continuous and
piecewise linear, and thus ( )
EP w

easier to minimize.

wi
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
The Perceptron Algorithm
16 Probability & Bayesian Inference

( )
E P w = ! # w t x ntn
n"M

where M is the set of misclassified inputs.


dEP w ( )=! ( )
dw
#x t n n
EP w
n"M

where the derivative exists.

  Gradient descent:
( )
w ! +1 = w ! " #$EP w = w ! + # & x ntn
n%M

wi

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


The Perceptron Algorithm
17 Probability & Bayesian Inference

( )
w ! +1 = w ! " #$EP w = w t + # & x ntn
n%M

  Why does this make sense?


  If an input from C1(t = +1) is misclassified, we need to
make its projection on w more positive.
  If an input from C2 (t = -1) is misclassified, we need to
make its projection on w more negative.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


The Perceptron Algorithm
18 Probability & Bayesian Inference

  The algorithm can be implemented sequentially:


  Repeat until convergence:
  For each input (xn, tn):
  If it is correctly classified, do nothing
  If it is misclassified, update the weight vector to be
w ! +1 = w ! + "x ntn
  Note that this will lower the contribution of input n to the
objective function:
( )xt ( )xt ( )xt ( ) ( )xt.
t t t t t
(" ) (" +1) (" ) (" )
! w n n
#! w n n
=! w n n
! $ x ntn x ntn < ! w n n

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Not Monotonic
19 Probability & Bayesian Inference

  While updating with respect to a misclassified input


n will lower the error for that input, the error for
other misclassified inputs may increase.
  Also, new inputs that had been classified correctly

may now be misclassified.


  The result is that the perceptron algorithm is not
guaranteed to reduce the total error monotonically
at each stage.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


The Perceptron Convergence Theorem
20 Probability & Bayesian Inference

  Despite this non-monotonicity, if in fact the data are


linearly separable, then the algorithm is
guaranteed to find an exact solution in a finite
number of steps (Rosenblatt, 1962).

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Example
21 Probability & Bayesian Inference
1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1

1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


The First Learning Machine
22 Probability & Bayesian Inference

  Mark 1 Perceptron Hardware (c. 1960)

Visual Inputs Patch board allowing Rack of adaptive weights w


configuration of inputsφ (motor-driven potentiometers)

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Practical Limitations
23 Probability & Bayesian Inference

  The Perceptron Convergence Theorem is an


important result. However, there are practical
limitations:
  Convergence may be slow
  If the data are not separable, the algorithm will not
converge.
  We will only know that the data are separable once
the algorithm converges.
  The solution is in general not unique, and will depend
upon initialization, scheduling of input vectors, and the
learning rate η.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Generalization to inputs that are not linearly separable.
24 Probability & Bayesian Inference

  The single-layer perceptron can be generalized to


yield good linear solutions to problems that are not
linearly separable.
  Example: The Pocket Algorithm (Gal 1990)

  Idea:
  Run the perceptron algorithm
  Keep track of the weight vector w* that has produced the
best classification error achieved so far.
  It can be shown that w* will converge to an optimal solution
with probability 1.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Generalization to Multiclass Problems
25 Probability & Bayesian Inference

  How can we use perceptrons, or linear classifiers in


general, to classify inputs when there are K > 2
classes?

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


K>2 Classes
26 Probability & Bayesian Inference

  Idea #1: Just use K-1 discriminant functions, each of


which separates one class Ck from the rest. (One-
versus-the-rest classifier.)
  Problem: Ambiguous regions

R1
R2

C1
R3
C2
not C1
not C2

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


K>2 Classes
27 Probability & Bayesian Inference

  Idea #2: Use K(K-1)/2 discriminant functions, each


of which separates two classes Cj, Ck from each
other. (One-versus-one classifier.)
  Each point classified by majority vote.

  Problem: Ambiguous regions

C3
C1

R1
R3
C1 ?

C3
R2
C2

C2

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


K>2 Classes
28 Probability & Bayesian Inference

  Idea #3: Use K discriminant functions yk(x)


  Use the magnitude of yk(x), not just the sign.

y k (x) = w tk x

x assigned to Ck if y k (x) > y j (x)!j " k

( ) x + (w )
t
Decision boundary y k (x) = y j (x) ! w k " w j k0
" w j0 = 0

Results in decision regions that are


Rj
simply-connected and convex. Ri

Rk
xB
xA x
!
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
Example: Kesler’s Construction
29 Probability & Bayesian Inference

  The perceptron algorithm can be generalized to K-


class classification problems.
  Example:

  Kesler’s Construction:
  Allows use of the perceptron algorithm to simultaneously
learn K separate weight vectors wi.
  Inputs are then classified in Class i if and only if
w ti x > w tj x !j " i
  The algorithm will converge to an optimal solution if a
solution exists, i.e., if all training vectors can be correctly
classified according to this rule.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


1-of-K Coding Scheme
30 Probability & Bayesian Inference

  When there are K>2 classes, target variables can


be coded using the 1-of-K coding scheme:
Input from Class C i ! t = [0 0 …1…0 0]t

Element i

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Computational Limitations of Perceptrons
31 Probability & Bayesian Inference

  Initially, the perceptron was


thought to be a potentially
powerful learning machine that
could model human neural
x2
processing.
  However, Minsky & Papert
(1969) showed that the single-
layer perceptron could not learn x1

a simple XOR function.


  This is just one example of a
non-linearly separable pattern Marvin Minsky (1927 - )

that cannot be learned by a


single-layer perceptron.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Multi-Layer Perceptrons
32 Probability & Bayesian Inference

  Minsky & Papert’s book was widely


misinterpreted as showing that
artificial neural networks were
inherently limited.
  This contributed to a decline in the
reputation of neural network
research through the 70s and 80s.
  However, their findings apply only
to single-layer perceptrons. Multi-
layer perceptrons are capable of
learning highly nonlinear functions,
and are used in many practical
applications.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


End of Lecture 11
Outline
34 Probability & Bayesian Inference

  The Perceptron Algorithm


  Least-Squares Classifiers

  Fisher’s Linear Discriminant

  Logistic Classifiers

  Support Vector Machines

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Dealing with Non-Linearly Separable Inputs
35 Probability & Bayesian Inference

  The perceptron algorithm fails when the training


data are not perfectly linearly separable.
  Let’s now turn to methods for learning the

parameter vector w of a perceptron (linear


classifier) even when the training data are not
linearly separable.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


The Least Squares Method
36 Probability & Bayesian Inference

  In the least squares method, we simply fit the (x, t)


observations with a hyperplane y(x).
  Note that this is kind of a weird idea, since the t

values are binary (when K=2), e.g., 0 or 1.


  However it can work pretty well.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Least Squares: Learning the Parameters
37 Probability & Bayesian Inference

Assume D ! dimensional input vectors x.

For each class k !1…K :


y k (x) = w tk x + w k 0

! t x!
! y(x) = W

where
x! = (1,x t )t

( )
t
! is a (D + 1) ! K matrix whose kth column is w
W ! k = w 0 ,w tk

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Learning the Parameters
38 Probability & Bayesian Inference

  Method #2: Least Squares


! t x!
y(x) = W

(
Training dataset x n ,t n , ) n = 1,…,N

where we use the 1-of-K coding scheme for t n

Let T be the N ! K matrix whose nth row is t tn

! be the N ! (D + 1) matrix whose nth row is x! t


Let X n

Let RD W( )
! =X
!W! !T

Then we define the error as ED W ( )


! = 1 R 2 = 1 Tr R W
!
2 i,j ij
2 D
! tR W
D
!
{ ( ) ( )}
! to 0 yields:
Setting derivative wrt W

( )
!1
! = X
W ! tX
! ! tT = X
X ! †T

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Outline
39 Probability & Bayesian Inference

  The Perceptron Algorithm


  Least-Squares Classifiers

  Fisher’s Linear Discriminant

  Logistic Classifiers

  Support Vector Machines

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Fisher’s Linear Discriminant
40 Probability & Bayesian Inference

  Another way to view linear discriminants: find the 1D subspace


that maximizes the separation between the two classes.

1 1
Let m1 = "x ,
N1 n!C1 n
m2 = "x
N2 n!C2 n

( )
For example, might choose w to maximize w t m2 ! m1 , subject to w = 1

This leads to w ! m2 " m1


4

However, if conditional distributions are not isotropic,


!2
this is typically not optimal.
!2 2 6

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Fisher’s Linear Discriminant
41 Probability & Bayesian Inference

Let m1 = w t m1, m2 = w t m2 be the conditional means on the 1D subspace.

# (y )
2
Let sk2 = n
! mk be the within-class variance on the subspace for class Ck
n"Ck

(m )
2
2
! m1
The Fisher criterion is then J(w) = 4
s +s
2
1
2
2
2
This can be rewritten as
w t SB w
J(w) = t
0
w SW w
where !2

( )( )
t
SB = m2 ! m1 m2 ! m1 is the between-class variance !2 2 6

and

# (x )( )
! m1 x n ! m1 + # x n ! m2 x n ! m2 ( )( )
t t
SW = n
is the within-class variance
n"C1 n"C2

J(w) is maximized for w ! SW"1 m2 " m1 ( )

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Connection between Least-Squares and FLD
42 Probability & Bayesian Inference

Change coding scheme used in least-squares method to


N
tn = for C1
N1
N
tn = ! for C2
N2

Then one can show that the ML w satisfies


(
w ! SW"1 m2 " m1 )

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Least Squares Classifier
43 Probability & Bayesian Inference

  Problem #1: Sensitivity to outliers

4 4

2 2

0 0

!2 !2

!4 !4

!6 !6

!8 !8

!4 !2 0 2 4 6 8 !4 !2 0 2 4 6 8

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Least Squares Classifier
44 Probability & Bayesian Inference

  Problem #2: Linear activation function is not a


good fit to binary data. This can lead to problems.
6

!2

!4

!6
!6 !4 !2 0 2 4 6

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Outline
45 Probability & Bayesian Inference

  The Perceptron Algorithm


  Least-Squares Classifiers

  Fisher’s Linear Discriminant

  Logistic Classifiers

  Support Vector Machines

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Logistic Regression (K = 2)
46 Probability & Bayesian Inference

( ) () ( )
p C1 | ! = y ! = " w t! 1
where ! (a) =
p (C | ! ) = 1# p (C | ! )
2 1
1+ exp("a)
140 8 Classification models

( ) ()
p C1 | ! = y ! = " w t! ( ) !"

w t! x1
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
Figure 8.3 Logistic regression model in 1D and 2D. a) One dimensional fit.
Logistic Regression
47 Probability & Bayesian Inference

( ) () ( )
p C1 | ! = y ! = " w t!
p (C | ! ) = 1# p (C | ! )
2 1

where
1
! (a) =
1+ exp("a)

  Number of parameters
  Logistic
regression: M
  Gaussian model: 2M + 2M(M+1)/2 + 1 = M2+3M+1

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


ML for Logistic Regression
48 Probability & Bayesian Inference

( ) ( )
N

( ) {1! y }
t
p t|w = "y
1!tn
tn
n n
where t = t1,…,tN and y n = p C1 | !n
n=1

We define the error function to be E(w) = ! log p t | w ( )


( )
Given y n = ! an and an = w t"n , one can show that
N

(
!E(w) = $ y n " tn #n )
n=1

Unfortunately, there is no closed form solution for w.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


End of Lecture 12
ML for Logistic Regression:
50 Probability & Bayesian Inference

  Iterative Reweighted Least Squares


  Although there is no closed form solution for the ML
estimate of w, fortunately, the error function is convex.
  Thus an appropriate iterative method is guaranteed to
find the exact solution.
  A good method is to use a local quadratic
approximation to the log likelihood function (Newton-
Raphson update):

w (new ) = w (old ) ! H!1"E(w)


where H is the Hessian matrix of E(w)

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


ML for Logistic Regression
51 Probability & Bayesian Inference

w (new ) = w (old ) ! H!1"E(w)


where H is the Hessian matrix of E(w) :

H = !tR!
where R is the N " N diagonal weight matrix with Rnn = y n 1# y n ( )
(Note that, since R nn ! 0, R is positive semi-definite, and hence H is positive semi-definite
Thus E(w) is convex.)

Thus

( ) ( )
!1
w new
=w (old )
! " R"
t
"t y ! t

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


ML for Logistic Regression
52 88 Classification
Classification models
Probability & Bayesian Inference models
  Iterative Reweighted Least Squares
( ) ()
p C1 | ! = y ! = " w t! ( )
!
!
#
"

w1
"

w2 w t!

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Logistic Regression
53 Probability & Bayesian Inference

  For K>2, we can generalize the activation function


by modeling the posterior probabilities as

( )
p Ck | ! = y k ! =()
exp ak( )
" exp (a ) j
j

where the activations ak are given by


ak = w tk!

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Example
54 Probability & Bayesian Inference

6 6

4 4

2 2

0 0

!2 −2

!4 −4

!6 −6
!6 !4 !2 0 2 4 6 −6 −4 −2 0 2 4 6

Least-Squares Logistic

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Outline
55 Probability & Bayesian Inference

  The Perceptron Algorithm


  Least-Squares Classifiers

  Fisher’s Linear Discriminant

  Logistic Classifiers

  Support Vector Machines

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Motivation
56 Probability & Bayesian Inference

  The perceptron algorithm is guaranteed to provide a


linear decision surface that separates the training data,
if one exists.
  However, if the data are linearly separable, there are
in general an infinite number of solutions, and the
solution returned by the perceptron algorithm depends
in a complex way on the initial conditions, the learning
rate and the order in which training data are
processed.
  While all solutions achieve a perfect score on the
training data, they won’t all necessarily generalize as
well to new inputs.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Which solution would you choose?
57 Probability & Bayesian Inference

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


The Large Margin Classifier
58 Probability & Bayesian Inference

  Unlike the Perceptron Algorithm, Support Vector


Machines solve a problem that has a unique
solution: they return the linear classifier with the
maximum margin, that is, the hyperplane that
separates the data and is farthest from any of the
training vectors.
  Why is this good?

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Support Vector Machines
59 Probability & Bayesian Inference

SVMs are based on the linear model y(x) = w t! (x) + b

Assume training data x1,…,x N with corresponding target values


t1,…,tN , t n !{"1,1}.

x classified according to sign of y(x).

Assume for the moment that the training data are linearly separable in feature space.

( )
Then !w,b : tn y x n > 0 "n #[1,…N]

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Maximum Margin Classifiers
60 Probability & Bayesian Inference

  When the training data are linearly separable, there are generally an
infinite number of solutions for (w, b) that separate the classes exactly.
  The margin of such a classifier is defined as the orthogonal distance in
feature space between the decision boundary and the closest training
vector.
  SVMs are an example of a maximum margin classifer, which finds the
linear classifier that maximizes the margin.

y=1
y=0
y = −1

margin

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Probabilistic Motivation
61 Probability & Bayesian Inference

  The maximum margin classifier has a probabilistic motivation.

If we model the class-conditional densities with a KDE using


Gaussian kernels with variance ! 2 , then in the limit as ! " 0,
the optimal linear decision boundary " maximum margin linear classifier.

y=1
y=0
y = −1

margin

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Two Class Discriminant Function
62 Probability & Bayesian Inference

y>0 x2
y=0
Recall: y<0
R2
R1

y (x) = w t x + w 0
x
w
y(x) ! 0 " x assigned to C1 y(x)
!w!

y(x) < 0 " x assigned to C2 x⊥

x1
Thus y(x) = 0 defines the decision boundary −w0
!w!

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Maximum Margin Classifiers
63 Probability & Bayesian Inference

Distance of point x n from decision surface is given by:


( )= (
tn y x n ( ) )
tn w t! x n + b
w w

Thus we seek: y=1


&( 1 *( y=0
argmax '
n #
( ( ) )
min "tn w t! x n + b $+
% y = −1
w,b
() w (,

margin

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Maximum Margin Classifiers
64 Probability & Bayesian Inference

Distance of point x n from decision surface is given by:


( )= (
tn y x n ( ) )
tn w t! x n + b
w w

Note that rescaling w and b by the same factor


leaves the distance to the decision surface unchanged.

y=1
Thus, wlog, we consider only solutions that satisfy: y=0
y = −1

( ( ) )
tn w t! x n + b = 1.
for the point x n that is closest to the decision surface.

margin

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Quadratic Programming Problem
65 Probability & Bayesian Inference

( ( ) )
Then all points x n satisfy tn w t! x n + b " 1

Points for which equality holds are said to be active.


All other points are inactive.

&( 1 *(
Now argmax '
n #
( ( ) )
min "tn w t! x n + b $+
%
y=1
y=0
w,b
() w (, y = −1
1 2
- argmin w
2
w

( ( ) )
Subject to tn w t! x n + b . 1 /x n
margin
This is a quadratic programming problem.
Solving this problem will involve Lagrange multipliers.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Lagrange Multipliers
Lagrange Multipliers (Appendix C.4 in textbook)
67 Probability & Bayesian Inference

  Used to find stationary points of a function


subject to one or more constraints.
∇f (x)
  Example (equality constraint):
xA

( ) ( )
Maximize f x subject to g x = 0. ∇g(x)

  Observations: g(x) = 0 Joseph-Louis Lagrange


1736-1813

( )
1. At any point on the constraint surface, !g x must be orthogonal to the surface.

2. Let x * be a point on the constraint surface where f (x) is maximized.


Then !f (x) is also orthogonal to the constraint surface.

3. ! "# $ 0 such that %f (x) + #%g(x) = 0 at x * .


# is called a Lagrange multiplier.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Lagrange Multipliers (Appendix C.4 in textbook)
68 Probability & Bayesian Inference

!" # 0 such that $f (x) + "$g(x) = 0 at x * .


∇f (x)

  Defining the Lagrangian function as: xA

( )
L x, ! = f (x) + ! g(x)
∇g(x)

we then have g(x) = 0

( )
! xL x, " = 0.
and
( ) = 0.
!L x, "
!"

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Example
69 Probability & Bayesian Inference

( )
x2
L x, ! = f (x) + ! g(x)
(x!1 , x!2 )
  Find the stationary point of
( )
f x1,x2 = 1! x12 ! x22 x1

subject to g(x1 , x2 ) = 0

( )
g x1,x2 = x1 + x2 ! 1= 0

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


End of Lecture 13
Inequality Constraints
71 Probability & Bayesian Inference

( )
Maximize f x subject to g x ! 0. ( ) ∇f (x)

xA
  There are 2 cases:
∇g(x)
1.  x* on the interior (e.g., xB)
xB
  Here g(x) > 0 and the stationary condition is simply
!f (x) = 0. g(x) > 0
g(x) = 0

  This corresponds to a stationary point of the Lagrangian where


λ= 0.

2.  x* on the boundary (e.g., xA)


( )
L x, ! = f (x) + ! g(x)

  Here g(x) = 0 and the stationary condition is


!f (x) = " #!g(x), # > 0.
  This corresponds to a stationary point of the Lagrangian where
λ> 0.

  Thus the general problem can be expressed as


maximizing the Lagrangian subject to 1. g(x) ! 0
2. " ! 0 Karush-Kuhn-Tucker (KKT) conditions
3. " g(x) = 0
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
Minimizing vs Maximizing
72 Probability & Bayesian Inference

  If we want to minimize f(x) subject to g(x) ≥ 0,


then the Lagrangian becomes
( )
L x, ! = f (x) " ! g(x)
with ! # 0.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Extension to Multiple Constraints
73 Probability & Bayesian Inference

  Suppose we wish to maximize f(x) subject to


g j (x) = 0 for j = 1,…,J
hk (x) ! 0 for k = 1,…,K

  We then find the stationary points of


J K

( )
L x, ! = f (x) + " ! j g j (x) + " µk hk (x)
j =1 k =1

subject to
hk (x) ! 0
µk ! 0
µk hk (x) = 0

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Back to our quadratic programming problem…
Quadratic Programming Problem
75 Probability & Bayesian Inference

1
2
2
( ( ) )
argmin w , subject to tn w t! x n + b " 1 #x n
w

Solve using Lagrange multipliers an :

{( ( ) ) }
N
1
L(w,b,a) = w ! # an tn w t" x n + b ! 1
2

2 n=1

y=1
y=0
y = −1

margin

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Dual Representation
76 Probability & Bayesian Inference

Solve using Lagrange multipliers an :

{( ( ) ) }
N
1
L(w,b,a) = w ! # an tn w t" x n + b ! 1
2

2 n=1

Setting derivatives with respect to w and b to 0, we get:


N
w = " ant n! (x n )
n =1
N
y=1
" a nt n = 0 y=0
n =1
y = −1

margin

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Dual Representation
77 Probability & Bayesian Inference

{( ( ) ) }
N N
1
L(w,b,a) = w ! # an tn w t" x n + b ! 1 w = " antn! (x n )
2

2 n=1 n=1
N

"a t n n
=0
n=1

Substituting leads to the dual representation


of the maximum margin problem, in which we maximize:
N N N
1
()
L! a = ! an " ! ! anamtntmk x n ,x m
2 n=1 m=1
( ) y=1
y=0
n=1
y = −1
with respect to a, subject to:
an # 0 $n
N

!a t n n
=0
n=1

( )
and where k x, x% = & (x)t & (x%)
margin

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Dual Representation
78 Probability & Bayesian Inference

N
Using w = " antn! (x n ), a new point x is classified by computing
n=1
N
y(x) = " antnk(x,x n ) + b
n=1

The Karush-Kuhn-Tucker (KKT) conditions for this constrained optimization problem are:
an ! 0
y = −1
( )
tn y x n " 1! 0
y=0
a {t y ( x ) " 1} = 0
n n n y=1

Thus for every data point, either an = 0 or tn y x n = 1. ( )


support vectors

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Solving for the Bias
79 Probability & Bayesian Inference

Once the optimal a is determined, the bias b can be computed


by noting that any support vector x n satisfies t n y x n = 1. ( )
N
Using y (x) = ! ant n k (x,x n ) + b
n =1
" N
%
we have t n $ ! amt m k (x n ,x m ) + b ' = 1
# m =1 &
N
and so b = t n ! " amt m k (x n ,x m )
m =1

A more numerically accurate solution can be obtained


by averaging over all support vectors:
1 $ '
b= # & n # m m n m )(
t ! a t k (x ,x )
NS n "S % m "S

where S is the index set of support vectors and N S is the number of support vectors.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Example (Gaussian Kernel)
80 Probability & Bayesian Inference

Input Space

x2

x1
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
Overlapping Class Distributions
81 Probability & Bayesian Inference

  The SVM for non-overlapping class distributions is determined


by solving
1
2
2
( ( ) )
argmin w , subject to tn w t! x n + b " 1 #x n
w

Alternatively, this can be expressed as the minimization of


y = −1
( y ( x ) t " 1) + $ w
N

#E
2
! n n y=0
n=1
y=1
where E ! (z) is 0 if z % 0, and ! otherwise. ξ>1

This forces all points to lie on or outside the margins,


ξ<1
on the correct side for their class.
ξ=0
To allow for misclassified points, we have to relax this E! term.
ξ=0

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Slack Variables
82 Probability & Bayesian Inference

To this end, we introduce N slack variables !n " 0, n = 1,…N.


!n = 0 for points on or on the correct side of the margin boundary for their class

( )
!n = tn " y x n for all other points.

Thus !n < 1 for points that are correctly classified


!n > 1 for points that are incorrectly classified

N
1
We now minimize C " !n +
2
w , where C > 0. y = −1
n=1 2 y=0
( )
subject to tn y x n ! 1" #n , and #n ! 0, n = 1,…N ξ>1 y=1

ξ<1

ξ=0

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition ξ=0 J. Elder
Dual Representation
83 Probability & Bayesian Inference

This leads to a dual representation, where we maximize


N
N N
1
! n 2 ! ! anamtntmk xn ,xn
! = a "
L(a) ( )
n=1 n=1 m=1
y = −1
with constraints
0 # an # C
y=0

and ξ>1 y=1


N

!a t n n
=0
n=1
ξ<1

ξ=0

ξ=0
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
Support Vectors
84 Probability & Bayesian Inference

Again, a new point x is classified by computing


N
y(x) = ! antnk(x,x n ) + b
n=1

For points that are on the correct side of the margin, an = 0.

Thus support vectors consist of points between their margin and the decision boundary,
as well as misclassified points. y = −1

In other words, all points that are not on


y=0
the right side of their margin are support vectors. ξ>1 y=1

ξ<1

ξ=0

ξ=0
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
Bias
85 Probability & Bayesian Inference

Again, a new point x is classified by computing


N
y(x) = ! antnk(x,x n ) + b
n=1

Once the optimal a is determined, the bias b can be computed from


1 $ '
b=
NM
# &% tn ! # amtmk(xn ,xm ))(
n"M m"S
y = −1
where
S is the index set of support vectors y=0
NS is the number of support vectors ξ>1 y=1
M is the index set of points on the margins
NM is the number of points on the margins
ξ<1

ξ=0

ξ=0
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
Solving the Quadratic Programming Problem
86 Probability & Bayesian Inference

NN N
1
! n 2 ! ! anamtntmk xn ,xm
! = a "
Maximize L(a) ( )
n=1 n=1 m=1
N
subject to 0 # an # C and !a t n n
=0
n=1

  Problem is convex.
  Standard solutions are generally O(N3).
  Traditional quadratic programming techniques often infeasible
due to computation and memory requirements.
  Instead, methods such as sequential minimal optimization can
be used, that in practice are found to scale as O(N) - O(N2).

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Chunking
87 Probability & Bayesian Inference

N
N N
1
! n 2 ! ! anamtntmk xn ,xm
! = a "
Maximize L(a) ( )
n=1 n=1 m=1
N
subject to 0 # an # C and !a t n n
=0
n=1

  Conventional quadratic programming solution requires that


matrices with N2 elements be maintained in memory.
( )
K ! O N 2 , where K nm = k x n ,x m ( )
T ! O ( N ), where T
2
nm
= tntm

A ! O ( N ), where A
2
nm
= anam

  This becomes infeasible when N exceeds ~10,000.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Chunking
88 Probability & Bayesian Inference

N N N
1
! n 2 ! ! anamtntmk xn ,xm
! = a "
Maximize L(a) ( )
n=1 n=1 m=1
N
subject to 0 # an # C and !a t n n
=0
n=1

  Chunking (Vapnik, 1982) exploits the fact that the


value of the Lagrangian is unchanged if we remove
the rows and columns of the kernel matrix where an
= 0 or am = 0.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Chunking
89 Probability & Bayesian Inference
N
1
Minimize C " !n +
2
w , where C > 0.
n=1 2
!n = 0 for points on or on the correct side of the margin boundary for their class

( )
!n = tn " y x n for all other points.
  Chunking (Vapnik, 1982)
1.  Select a small number (a ‘chunk’) of training vectors
2.  Solve the QP problem for this subset
3.  Retain only the support vectors
4.  Consider another chunk of the training data
5.  Ignore the subset of vectors in all chunks considered so far that lie on the correct side of
the margin, since these do not contribute to the cost function
6.  Add the remainder to the current set of support vectors and solve the new QP problem
7.  Return to Step 4
8.  Repeat until the set of support vectors does not change.
( )
This method reduces memory requirements to O NS2 , where NS is the number of support vectors.
This may still be big!
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
Decomposition Methods
90 Probability & Bayesian Inference
  It can be shown that the global QP problem is solved when, for all training vectors,
satisfy the following optimality conditions:
( )
ai = 0 ! t i y x i " 1.
( )
0 < ai < C ! t i y x i = 1.
ai = C ! t y ( x ) # 1.
i i

  Decomposition methods decompose this large QP problem into a series of smaller


subproblems.
  Decomposition (Osuna et al, 1997)
  Partition the training data into a small working subset B and a fixed subset N.
  Minimize the global objective function by adjusting the coefficients in B
  Swap 1 or more vectors in B for an equal number in N that fail to satisfy the optimality
conditions
  Re-solve the global QP problem for B
  Each step is O(B)2 in memory.
  Osuna et al (1997) proved that the objective function decreases on each step and
will converge in a finite number of iterations.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
Sequential Minimal Optimization
91 Probability & Bayesian Inference

  Sequential Minimal Optimization (Platt 1998) takes


decomposition to the limit.
  On each iteration, the working set consists of just

two vectors.
  The advantage is that in this case, the QP problem

can be solved analytically.


  Memory requirement are O(N).

  Compute time is typically O(N) – O(N ).


2

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


LIBSVM
92 Probability & Bayesian Inference

  LIBSVM is a widely used library for SVMs


developed by Chang & Lin (2001).
  Can be downloaded from
www.csie.ntu.edu.tw/~cjlin/libsvm
  MATLAB interface

  Uses SMO
  Will use for Assignment 2.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


End of Lecture 14
LIBSVM Example: Face Detection
94 Probability & Bayesian Inference

Face
Preprocess:
Subsample &
Normalize
µ = 0, ! 2 = 1 svmtrain

Non-Face
Preprocess:
Subsample &
Normalize
µ = 0, ! 2 = 1
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
LIBSVM Example: MATLAB Interface
95 Probability & Bayesian Inference

Selects linear SVM

model=svmtrain(traint, trainx, '-t 0');

[predicted_label, accuracy, decision_values] = svmpredict(testt, testx, model);

Accuracy = 70.0212% (661/944) (classification)

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder


Example
96 Probability & Bayesian Inference

Input Space

x2 0

!2

!2 0 2
x1
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
Relation to Logistic Regression
97 Probability & Bayesian Inference

The objective function for the soft-margin SVM can be written as:
N

(
! ESV y ntn + " w )
2

n=1

where ESV ( z ) = $%1# z &' +


is the hinge error function,
and $% z &' + = z if z ( 0
= 0 otherwise.

For t !{"1,1}, the objective function for a regularized version


of logistic regression can be written as: E(z)
N

# E (y t ) + $
2
LR n n
w
n=1

() (
where ELR z = log 1+ exp("z) . )

ELR
ESV
z
−2 −1 0 1 2
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

You might also like