Professional Documents
Culture Documents
Animal or Vegetable?
x2 !2
!4
!6
!8
!4 !2 0 2 4 6 8
x1
y>0 x2
y=0
y<0 R1
R2
y (x) = w t x + w 0
x
w
y (x) ! 0 " x assigned to C1 y(x)
!w!
x1
Thus y (x) = 0 defines the decision boundary −w0
!w!
y (x) = w t x + w 0
y>0 x2
y (x) ! 0 " x assigned to C1 y=0
y<0 R1
y (x) < 0 " x assigned to C2
R2
and x⊥
t t x1
x = !"x 1 … x M #$ % !"1 x1 … x M #$
−w0
!w!
(
y(x) = f w t x + w 0 )
(
y(x) = f w t x + w 0 ) y (x) ! 0 " x assigned to C1
y (x) < 0 " x assigned to C2
Logistic Classifiers
such a hyperplane.
( )
E P w = ! # w t x ntn
n"M
Observations:
EP(w) is always non-negative.
EP(w) is continuous and
piecewise linear, and thus ( )
EP w
easier to minimize.
wi
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
The Perceptron Algorithm
16 Probability & Bayesian Inference
( )
E P w = ! # w t x ntn
n"M
Gradient descent:
( )
w ! +1 = w ! " #$EP w = w ! + # & x ntn
n%M
wi
( )
w ! +1 = w ! " #$EP w = w t + # & x ntn
n%M
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
Idea:
Run the perceptron algorithm
Keep track of the weight vector w* that has produced the
best classification error achieved so far.
It can be shown that w* will converge to an optimal solution
with probability 1.
R1
R2
C1
R3
C2
not C1
not C2
C3
C1
R1
R3
C1 ?
C3
R2
C2
C2
y k (x) = w tk x
( ) x + (w )
t
Decision boundary y k (x) = y j (x) ! w k " w j k0
" w j0 = 0
Rk
xB
xA x
!
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
Example: Kesler’s Construction
29 Probability & Bayesian Inference
Kesler’s Construction:
Allows use of the perceptron algorithm to simultaneously
learn K separate weight vectors wi.
Inputs are then classified in Class i if and only if
w ti x > w tj x !j " i
The algorithm will converge to an optimal solution if a
solution exists, i.e., if all training vectors can be correctly
classified according to this rule.
Element i
Logistic Classifiers
! t x!
! y(x) = W
where
x! = (1,x t )t
( )
t
! is a (D + 1) ! K matrix whose kth column is w
W ! k = w 0 ,w tk
(
Training dataset x n ,t n , ) n = 1,…,N
Let RD W( )
! =X
!W! !T
( )
!1
! = X
W ! tX
! ! tT = X
X ! †T
Logistic Classifiers
1 1
Let m1 = "x ,
N1 n!C1 n
m2 = "x
N2 n!C2 n
( )
For example, might choose w to maximize w t m2 ! m1 , subject to w = 1
# (y )
2
Let sk2 = n
! mk be the within-class variance on the subspace for class Ck
n"Ck
(m )
2
2
! m1
The Fisher criterion is then J(w) = 4
s +s
2
1
2
2
2
This can be rewritten as
w t SB w
J(w) = t
0
w SW w
where !2
( )( )
t
SB = m2 ! m1 m2 ! m1 is the between-class variance !2 2 6
and
# (x )( )
! m1 x n ! m1 + # x n ! m2 x n ! m2 ( )( )
t t
SW = n
is the within-class variance
n"C1 n"C2
4 4
2 2
0 0
!2 !2
!4 !4
!6 !6
!8 !8
!4 !2 0 2 4 6 8 !4 !2 0 2 4 6 8
!2
!4
!6
!6 !4 !2 0 2 4 6
Logistic Classifiers
( ) () ( )
p C1 | ! = y ! = " w t! 1
where ! (a) =
p (C | ! ) = 1# p (C | ! )
2 1
1+ exp("a)
140 8 Classification models
( ) ()
p C1 | ! = y ! = " w t! ( ) !"
w t! x1
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
Figure 8.3 Logistic regression model in 1D and 2D. a) One dimensional fit.
Logistic Regression
47 Probability & Bayesian Inference
( ) () ( )
p C1 | ! = y ! = " w t!
p (C | ! ) = 1# p (C | ! )
2 1
where
1
! (a) =
1+ exp("a)
Number of parameters
Logistic
regression: M
Gaussian model: 2M + 2M(M+1)/2 + 1 = M2+3M+1
( ) ( )
N
( ) {1! y }
t
p t|w = "y
1!tn
tn
n n
where t = t1,…,tN and y n = p C1 | !n
n=1
(
!E(w) = $ y n " tn #n )
n=1
H = !tR!
where R is the N " N diagonal weight matrix with Rnn = y n 1# y n ( )
(Note that, since R nn ! 0, R is positive semi-definite, and hence H is positive semi-definite
Thus E(w) is convex.)
Thus
( ) ( )
!1
w new
=w (old )
! " R"
t
"t y ! t
w1
"
w2 w t!
( )
p Ck | ! = y k ! =()
exp ak( )
" exp (a ) j
j
6 6
4 4
2 2
0 0
!2 −2
!4 −4
!6 −6
!6 !4 !2 0 2 4 6 −6 −4 −2 0 2 4 6
Least-Squares Logistic
Logistic Classifiers
Assume for the moment that the training data are linearly separable in feature space.
( )
Then !w,b : tn y x n > 0 "n #[1,…N]
When the training data are linearly separable, there are generally an
infinite number of solutions for (w, b) that separate the classes exactly.
The margin of such a classifier is defined as the orthogonal distance in
feature space between the decision boundary and the closest training
vector.
SVMs are an example of a maximum margin classifer, which finds the
linear classifier that maximizes the margin.
y=1
y=0
y = −1
margin
y=1
y=0
y = −1
margin
y>0 x2
y=0
Recall: y<0
R2
R1
y (x) = w t x + w 0
x
w
y(x) ! 0 " x assigned to C1 y(x)
!w!
x1
Thus y(x) = 0 defines the decision boundary −w0
!w!
margin
y=1
Thus, wlog, we consider only solutions that satisfy: y=0
y = −1
( ( ) )
tn w t! x n + b = 1.
for the point x n that is closest to the decision surface.
margin
( ( ) )
Then all points x n satisfy tn w t! x n + b " 1
&( 1 *(
Now argmax '
n #
( ( ) )
min "tn w t! x n + b $+
%
y=1
y=0
w,b
() w (, y = −1
1 2
- argmin w
2
w
( ( ) )
Subject to tn w t! x n + b . 1 /x n
margin
This is a quadratic programming problem.
Solving this problem will involve Lagrange multipliers.
( ) ( )
Maximize f x subject to g x = 0. ∇g(x)
( )
1. At any point on the constraint surface, !g x must be orthogonal to the surface.
( )
L x, ! = f (x) + ! g(x)
∇g(x)
( )
! xL x, " = 0.
and
( ) = 0.
!L x, "
!"
( )
x2
L x, ! = f (x) + ! g(x)
(x!1 , x!2 )
Find the stationary point of
( )
f x1,x2 = 1! x12 ! x22 x1
subject to g(x1 , x2 ) = 0
( )
g x1,x2 = x1 + x2 ! 1= 0
( )
Maximize f x subject to g x ! 0. ( ) ∇f (x)
xA
There are 2 cases:
∇g(x)
1. x* on the interior (e.g., xB)
xB
Here g(x) > 0 and the stationary condition is simply
!f (x) = 0. g(x) > 0
g(x) = 0
( )
L x, ! = f (x) + " ! j g j (x) + " µk hk (x)
j =1 k =1
subject to
hk (x) ! 0
µk ! 0
µk hk (x) = 0
1
2
2
( ( ) )
argmin w , subject to tn w t! x n + b " 1 #x n
w
{( ( ) ) }
N
1
L(w,b,a) = w ! # an tn w t" x n + b ! 1
2
2 n=1
y=1
y=0
y = −1
margin
{( ( ) ) }
N
1
L(w,b,a) = w ! # an tn w t" x n + b ! 1
2
2 n=1
margin
{( ( ) ) }
N N
1
L(w,b,a) = w ! # an tn w t" x n + b ! 1 w = " antn! (x n )
2
2 n=1 n=1
N
"a t n n
=0
n=1
!a t n n
=0
n=1
( )
and where k x, x% = & (x)t & (x%)
margin
N
Using w = " antn! (x n ), a new point x is classified by computing
n=1
N
y(x) = " antnk(x,x n ) + b
n=1
The Karush-Kuhn-Tucker (KKT) conditions for this constrained optimization problem are:
an ! 0
y = −1
( )
tn y x n " 1! 0
y=0
a {t y ( x ) " 1} = 0
n n n y=1
where S is the index set of support vectors and N S is the number of support vectors.
Input Space
x2
x1
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
Overlapping Class Distributions
81 Probability & Bayesian Inference
#E
2
! n n y=0
n=1
y=1
where E ! (z) is 0 if z % 0, and ! otherwise. ξ>1
( )
!n = tn " y x n for all other points.
N
1
We now minimize C " !n +
2
w , where C > 0. y = −1
n=1 2 y=0
( )
subject to tn y x n ! 1" #n , and #n ! 0, n = 1,…N ξ>1 y=1
ξ<1
ξ=0
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition ξ=0 J. Elder
Dual Representation
83 Probability & Bayesian Inference
!a t n n
=0
n=1
ξ<1
ξ=0
ξ=0
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
Support Vectors
84 Probability & Bayesian Inference
Thus support vectors consist of points between their margin and the decision boundary,
as well as misclassified points. y = −1
ξ<1
ξ=0
ξ=0
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
Bias
85 Probability & Bayesian Inference
ξ=0
ξ=0
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
Solving the Quadratic Programming Problem
86 Probability & Bayesian Inference
NN N
1
! n 2 ! ! anamtntmk xn ,xm
! = a "
Maximize L(a) ( )
n=1 n=1 m=1
N
subject to 0 # an # C and !a t n n
=0
n=1
Problem is convex.
Standard solutions are generally O(N3).
Traditional quadratic programming techniques often infeasible
due to computation and memory requirements.
Instead, methods such as sequential minimal optimization can
be used, that in practice are found to scale as O(N) - O(N2).
N
N N
1
! n 2 ! ! anamtntmk xn ,xm
! = a "
Maximize L(a) ( )
n=1 n=1 m=1
N
subject to 0 # an # C and !a t n n
=0
n=1
A ! O ( N ), where A
2
nm
= anam
N N N
1
! n 2 ! ! anamtntmk xn ,xm
! = a "
Maximize L(a) ( )
n=1 n=1 m=1
N
subject to 0 # an # C and !a t n n
=0
n=1
( )
!n = tn " y x n for all other points.
Chunking (Vapnik, 1982)
1. Select a small number (a ‘chunk’) of training vectors
2. Solve the QP problem for this subset
3. Retain only the support vectors
4. Consider another chunk of the training data
5. Ignore the subset of vectors in all chunks considered so far that lie on the correct side of
the margin, since these do not contribute to the cost function
6. Add the remainder to the current set of support vectors and solve the new QP problem
7. Return to Step 4
8. Repeat until the set of support vectors does not change.
( )
This method reduces memory requirements to O NS2 , where NS is the number of support vectors.
This may still be big!
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
Decomposition Methods
90 Probability & Bayesian Inference
It can be shown that the global QP problem is solved when, for all training vectors,
satisfy the following optimality conditions:
( )
ai = 0 ! t i y x i " 1.
( )
0 < ai < C ! t i y x i = 1.
ai = C ! t y ( x ) # 1.
i i
two vectors.
The advantage is that in this case, the QP problem
Uses SMO
Will use for Assignment 2.
Face
Preprocess:
Subsample &
Normalize
µ = 0, ! 2 = 1 svmtrain
Non-Face
Preprocess:
Subsample &
Normalize
µ = 0, ! 2 = 1
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
LIBSVM Example: MATLAB Interface
95 Probability & Bayesian Inference
Input Space
x2 0
!2
!2 0 2
x1
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
Relation to Logistic Regression
97 Probability & Bayesian Inference
The objective function for the soft-margin SVM can be written as:
N
(
! ESV y ntn + " w )
2
n=1
# E (y t ) + $
2
LR n n
w
n=1
() (
where ELR z = log 1+ exp("z) . )
ELR
ESV
z
−2 −1 0 1 2
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder