You are on page 1of 76

Kernel Machines

CSL465/603 - Fall 2016


Narayanan C Krishnan
ckn@iitrpr.ac.in

Outline
Optimal Separating Hyperplane
Support Vector Machine
Linear Separable Case
Soft Margin

Kernel Functions
Support Vector Regression

Kernel Machines

CSL465/603 - Machine Learning

Kernel Machines
Or, support vector machine (SVM)
Discriminant-based method
Learn class boundaries

Support vector consists of examples closest to


boundary
Kernel computes similarity between examples
Maps instance space to a higher-dimensional space
where (hopefully) linear models suffice

Choosing the right kernel is crucial


Kernel machines among best-performing learners
Kernel Machines

CSL465/603 - Machine Learning

Optimal Separating Hyperplane


(1)
Given a dataset - =

x$ , $

)
$'( ,

where

the class labels are binary $ 1, +1

Find w and 0 , such that


w 1 x2 + 0 +1 for $ = +1
w 1 x2 + 0 1 for $ = 1
Which can be combined and written as
$ (w 1 x2 + 0 ) +1 = 1, ,
Note we want +1, and not 0
Want instances at some distance from the
hyperplane.
Kernel Machines

CSL465/603 - Machine Learning

Optimal Separating Hyperplane


(2)
w

Kernel Machines

CSL465/603 - Machine Learning

Margin
Distance of point x2 to the hyperplane w 1 x + 0
w 1 x2 + 0
$ w 1 x2 + 0
or
w
w
Distance from the hyperplane to the closest
instances is the margin.
w

margin

Kernel Machines

CSL465/603 - Machine Learning

Optimal Separating Hyperplane


(3)
Optimal separating hyperplane is the one
maximizing the margin
We want to choose (w, w0 ) maximizing such that
$ w 1 x2 + 0
,
w
Infinite number of solutions by scaling w
So fix w = 1
Thus, we choose solution minimizing w
min

(
C

Kernel Machines

w C subject to $ w 1 x2 + 0 1,
CSL465/603 - Machine Learning

Optimal Separating Hyperplane


(4)
Maximizing the margin

Kernel Machines

CSL465/603 - Machine Learning

Optimal Separating Hyperplane


(5)
min

(
C

w C subject to $ w 1 x2 + 0 1,

Quadratic optimization problem


Complexity is in terms of the number of features -

Later we will talk about the Kernel that will map the
points to a even higher dimensional space!!!
Prefer complexity not based on
Transform it to a problem, where the complexity
depends on the number of training examples
More precisely, the points that are on the margin ( )

Kernel Machines

CSL465/603 - Machine Learning

Lagrange Multipliers (1)


Minimize ()
Subject to $ = 0, for = 1,2,
At the solution , must lie in the subspace
spanned by $ , = 1,2,
Lagrangian function:
, = + N $ $
Lagrange multipliers - $

Solve , = 0

Kernel Machines

CSL465/603 - Machine Learning

10

Lagrange Multipliers (2)

Kernel Machines

CSL465/603 - Machine Learning

11

Primal and Dual Problems


Problem over is the primal
Solve equations for and substitute
Resulting problem over is the dual
If its easier, solve the dual instead of primal
In SVMs:
Primal problem is over feature weights
Dual problem is over instance weights

Kernel Machines

CSL465/603 - Machine Learning

12

What happens with inequality


constraints?
Minimize ()
Subject to $ = 0, for = 1,2,
And $ 0, for = 1,2,
Lagrange multipliers for inequalities: $
Lagrangian function:
, , = + N $ $ + N $ $ ()
$

Kernel Machines

CSL465/603 - Machine Learning

13

Karush-Kuhn-Tucker Conditions
KKT Conditions

, , = 0
$ 0
$ 0
$ $ = 0
$ = 0

Complementarity - either a constraint is active


($ = 0) or its multiplier is zero ($ = 0)
In SVMs: active constraint implies a support vector
Exercise: necessary and sufficient conditions for a convex problem

Kernel Machines

CSL465/603 - Machine Learning

14

Lagrange Multipliers SVM (1)


min

(
C

w C subject to $ w 1 x2 + 0 1,

Rewrite the quadratic constrained optimization


problem using Lagrange multipliers $ , = 1, ,

Kernel Machines

CSL465/603 - Machine Learning

15

Lagrange Multipliers SVM (2)


min

(
C

w C subject to $ w 1 x2 + 0 1,

Rewrite the quadratic constrained optimization


problem using Lagrange multipliers $ , = 1, ,
)
1
R = max min w C N $ $ w 1 x2 + 0 1
TU V,WX 2
$'(

Kernel Machines

CSL465/603 - Machine Learning

16

Lagrange Multipliers SVM (3)


1
R = max min w
TU V,WX 2
YZ[
YV
YZ[
YVX

)
C

N $ $ w 1 x2 + 0 + N $
$'(

$'(

=0
=0

Kernel Machines

CSL465/603 - Machine Learning

17

Lagrange Multipliers SVM (4)


1
R = max min w
TU V,WX 2
YZ[

)
C

N $ $ w 1 x2 + 0 + N $
$'(

$'(

= 0 w = )
$'( $ $ x$

YV
YZ[
YVX

= 0 )
$'( $ $ = 0

Plugging these back into R


KKT conditions

Kernel Machines

CSL465/603 - Machine Learning

18

Dual Formulation SVM (1)


^

$'(

$'(

$'(

1 1
= max w w w 1 N $ $ x$ 0 N $ $ + N $
TU 2

Kernel Machines

CSL465/603 - Machine Learning

19

Dual Formulation SVM (2)


)

1
^ = max N N $ _ $ _ x$1 x` + N $
TU
2
$'( _'(

Subject to

$'(

N $ $ = 0
$'(
$

Kernel Machines

CSL465/603 - Machine Learning

20

Dual Formulation SVM (3)


Quadratic optimization methods can be used to
solve this maximization problem
Complexity - ( b )
Size of the dual depends on the sample size (and
not on feature dimension)

Kernel Machines

CSL465/603 - Machine Learning

21

Support Vector Machine (1)


Most $ = 0

$ for points that lie outside the margin are 0

Support vectors: points such that $ > 0


$ for points that lie on the margin are > 0
)

w = N $ $ x$
$'(

0 = $ w 1 x2 for any support vector


Typically the averaged over all support vectors

The resulting discriminant is called support vector


machine
Kernel Machines

CSL465/603 - Machine Learning

22

Support Vector Machine (2)


O = support vectors

margin

Kernel Machines

CSL465/603 - Machine Learning

23

Soft Margin Hyperplane


Data is not linearly separable
Find the hyperplane with least error
Define slack variables $ 0 that stores deviation
from the margin
$ w 1 x2 + 0 1 $

Kernel Machines

CSL465/603 - Machine Learning

24

Soft Error (1)


margin

O = support vectors

Kernel Machines

CSL465/603 - Machine Learning

25

Soft Error
Correctly classified example - far from the margin
(but on the correct side) - $ = 0
Correctly classified sample on the margin - $ = 0
Correctly classified sample, but inside the margin0 < $ <1
Incorrectly classified example (away from the
margin on the wrong side) - $ 1
Therefore the total error can be summarized in
terms of - $ s
Soft Error - )
$'( $
Kernel Machines

CSL465/603 - Machine Learning

26

Soft Margin Hyperplane (1)


1
min w
2

)
C

+ N $
$'(

subject to $ w 1 x2 + 0 1 $ , and $ 0,
Where C is a penalty factor that stresses the
importance on reducing the soft error.
Using Lagrange multipliers

Kernel Machines

CSL465/603 - Machine Learning

27

Soft Margin Hyperplane (2)


1
w
2

)
C

+ N $ N $ $ w 1 x2 + 0 1 + $ N $ $
$'(

$'(

This is the primal problem


Applying KKT conditions for optimality
YZ[
YV
YZ[
YVX
YZ[
YhU

$'(

=0
=0
=0

Kernel Machines

CSL465/603 - Machine Learning

28

Soft Margin Hyperplane (3)


Thus the Dual problem is

Kernel Machines

CSL465/603 - Machine Learning

29

Soft Margin Hyperplane (3)


Thus the Dual problem is
) )
)
1
^ = max N N $ _ $ _ x$1 x` + N $
TU
2
$'( _'(

Subject to

$'(

N $ $ = 0
$'(

0 $

Kernel Machines

CSL465/603 - Machine Learning

30

Soft Margin Hyperplane (4)


Thus the Dual problem is
) )
)
1
^ = max N N $ _ $ _ x$1 x` + N $
TU
2
$'( _'(

Subject to

$'(

N $ $ = 0
$'(

0 $
Think of as a regularization parameter
High - high penalty for non-separable examples (overfit)
Low - less penalty (underfit)
Determined using a validation set.
Kernel Machines

CSL465/603 - Machine Learning

31

Soft Margin Hyperplane (5)


Thus the Dual problem is
) )
)
1
^ = max N N $ _ $ _ x$1 x` + N $
TU
2
$'( _'(

Subject to

$'(

N $ $ = 0
$'(

0 $
Quadratic optimization problem
Support vectors have $ > 0
Examples misclassified have $ =
Examples correctly classified $ = 0
Kernel Machines

CSL465/603 - Machine Learning

32

Kernels (1)

Kernel Machines

CSL465/603 - Machine Learning

33

Kernels (2)

Kernel Machines

CSL465/603 - Machine Learning

34

Kernels (3)
Prior approaches assumed
Data is linearly separable or obtained the best fit

If the data is not linearly separable in the current


space,
Perhaps it is linearly separable in some other high
dimensional space.
Let be the function for transforming the data to the high
n
l
l
dimensional space - : - Basis functions

Kernel Machines

CSL465/603 - Machine Learning

35

Kernels (4)
Transform the -dimensional input feature space (xspace) to a dimensional feature space using
x
Call the dimensional feature space as the -space

z = (x), where ^ = ^ , = 1, ,
Linearly separable in the -space.
z = w 1 z + w0n
l

z = w 1 x + 0 = N ^ ^ x
^'0

Assume 0 x = 1
Kernel Machines

CSL465/603 - Machine Learning

36

Soft Margin Hyperplane with


Kernels (1)
1
w
2

)
C

+ N $ N $ $ w 1 x2 + 0 1 + $ N $ $
$'(

$'(

$'(

1
max N N $ _ $ _ x$1 x` + N $
TU
2
$'( _'(

Kernel Machines

CSL465/603 - Machine Learning

$'(

37

Soft Margin Hyperplane with


Kernels (2)
In the -space

w = N $ $ z$ = N $ $ (x2 )
$'(

$'(

z = w 1 z = N $ $ x2 1 (x)
$'(

Kernel Machines

CSL465/603 - Machine Learning

38

Kernel Functions
While solving the optimization and applying the solution,
we observe that we need to compute
x2 1 (x` ) Inner product of two points in the -space

Make our life easy, define a kernel matrix (function)


, = x2 1 (x` )

Also called the Gram matrix


So, we need not know what is . If we can define a
Kernel function, then our job is done. Kernel Trick
Can we use any function as a kernel function (matrix)
n

Mercers theorem Let : l l be given. Then for


to be a valid kernel, it is necessary and sufficient that for any
( , , ) ( < ), the corresponding kernel matrix is
symmetric and positive semidefinite.
Kernel Machines

CSL465/603 - Machine Learning

39

Positive Semidefinite (1)


Akin to saying (numbers) matrices are 0
To prove matrix ll to positive semidefinite, show
that for any x l , x 1 x 0
Show that if , = x2 1 (x` ), then is
symmetric and positive semidefinite.

40
Kernel Machines

CSL465/603 - Machine Learning

Positive Semidefinite (2)

Kernel Machines

CSL465/603 - Machine Learning

41

Positive Semidefinite (3)

Kernel Machines

CSL465/603 - Machine Learning

42

Positive Semidefinite (4)

Kernel Machines

CSL465/603 - Machine Learning

43

Examples of Kernel Functions (1)


Polynomial kernel of degree
x, x } = x 1 x } + 1 ~
Suppose = 1, then we are using the original
features

Kernel Machines

CSL465/603 - Machine Learning

44

Examples of Kernel Functions (2)


Polynomial kernel of degree
x, x } = x 1 x } + 1 ~
Suppose = 1, then we are using the original
features
Suppose = 2 quadratic kernel
x, x } = x 1 x } + 1 C

Kernel Machines

CSL465/603 - Machine Learning

45

Examples of Kernel Functions (3)


Polynomial of degree 2
margin

O = support vectors

Kernel Machines

CSL465/603 - Machine Learning

46

Examples of Kernel Functions (5)


Polynomial kernel of degree
x, x } = x 1 x } + 1 ~
Suppose = 1, then we are using the original
features
Suppose = 2 quadratic kernel
x, x } = x 1 x } + 1 C
What is the dimension of the -space?

Kernel Machines

CSL465/603 - Machine Learning

47

Examples of Kernel Functions (6)


Radial Basis Functions (Gaussian Kernel)
} C
x

x
x, x } = exp
2 C
Gaussian Kernel Width (radius) -
Larger implies smoother boundaries

Kernel Machines

CSL465/603 - Machine Learning

48

Examples of Kernel Functions (7)


Radial Basis Functions (Gaussian Kernel)

Kernel Machines

CSL465/603 - Machine Learning

49

Examples of Kernel Functions (8)


Radial Basis Functions (Gaussian Kernel)
} C
x

x
x, x } = exp
2 C
Gaussian Kernel Width (radius) -
Larger implies smoother boundaries
What is the dimension of the -space?

Kernel Machines

CSL465/603 - Machine Learning

50

Defining Kernels (1)


Think of x, x } as a similarity measure between x
and x } .
Prior knowledge can be included in the kernel
function
E.g., training examples are documents
x, x } = # shared words

E.g., training examples are strings (e.g., DNA)

x, x } = 1 / edit distance between x and x }


Edit distance is the number of insertions, deletions and/or
substitutions to transform x into x }

Kernel Machines

CSL465/603 - Machine Learning

51

Defining Kernels (2)


E.g., training examples are nodes in a graph (e.g.,
social network)
x, x } = 1 / length of shortest path connecting nodes
x, x } = #paths connecting nodes
Diffusion kernel (using the Laplacian of the graph)

E.g., training examples are graphs, not feature


vectors
E.g., carcinogenic vs. non-carcinogenic chemical
structures
Compare substructures of graphs
E.g., walks, paths, cycles, trees, subgraphs

x, x } = number of identical random walks in both


graphs
x, x } = number of subgraphs shared by both graphs
Kernel Machines

CSL465/603 - Machine Learning

52

Combining Kernels (1)


Training data from multiple modalities (e.g.,
biometrics, social network, audio/visual)
Construct new kernels by combining simpler kernels
If ( x, x } and C x, x } are valid kernels, and is a
constant, then the following are also valid kernels

x, x }

Kernel Machines

( x, x }
= ( x, x } + C x, x }
( x, x } C x, x }

CSL465/603 - Machine Learning

53

Combining Kernels (2)


Adaptive Kernel Combination

x, x } = N x, x }
'(
)

'(

$'(

1
N N $ _ $ _ N x2 , x` + N $
2
$'( _'(

x = N $ $ N x2 , x
$'(

'(

Learn $ and through optimization.


Kernel Machines

CSL465/603 - Machine Learning

54

Loss Function for SVM (1)


Recollect the primal formulation
)
1
min w C + N $
2
$'(

subject to $ w 1 x2 + 0 1 $ , and $ 0,

Kernel Machines

CSL465/603 - Machine Learning

55

Loss Function for SVM (2)


Zero/One loss
Not continuous
Figure
7.5 Plot
the hinge error function used
NP
hard
to ofoptimize
in support vector machines, shown

7.1. Maximum Margin Classifiers

337

E(z)

in blue, along with the error function


for logistic regression, rescaled by a
factor of 1/ ln(2) so that it passes
through the point (0, 1), shown in red.
Also shown are the misclassification
error in black and the squared error
in green.

Squared loss

Data points far from the


decision boundary have
significant influence

Hinge loss
SVM

Logarithmic
loss
remaining points we have
Logistic

Kernel Machines

= 1 yn tn . Thus the objective function (7.21) can be


written (up to an overall multiplicative constant) in the form
regression
n

N
!

ESV (yn tn ) + w2

CSL465/603 -nMachine
Learning
=1

(7.44)
56

SVM multi-class classification (1)


Learn K different
kernel machines
Each uses one class a
positive, remaining
classes as negative
The predicted class is
argmax x

Works best in practice

Kernel Machines

CSL465/603 - Machine Learning

57

SVM multi-class classification (2)


Learn K(K-1)/2 kernel
machines
Each uses one class as
positive and another
class as negative
Easier (faster) learning
per kernel machine

Kernel Machines

CSL465/603 - Machine Learning

58

SVM multi-class classification (3)


Learn all margins at once

)
1
min N C + N N $
2
'(

'( $'(

Subject to
w1U x2 + U0 w1 x2 + 0 + 2 $ , $ , $ 0
The class label for x2 is y2
It is an expensive optimization problem ()

Kernel Machines

CSL465/603 - Machine Learning

59

Optimizing the SVM Objective


Function SMO (1)
Diversion Coordinate ascent
Consider the unconstrained optimization problem
max ( , l
T

Until convergence
For = 1, ,

^ = argmaxT ( , , ^( , ^ , ^( , , l

22

2.5

1.5

0.5

0.5

1.5

Kernel Machines

CSL465/603 - Machine Learning

1.5

0.5

0.5

1.5

2.5

60

The ellipses in the figure are the contours of a quadratic function that

Optimizing the SVM Objective


Function SMO (2)
Recall SVM dual optimization problem
) )
)
1
^ = max N N $ _ $ _ x$1 x` + N $
TU
2
$'( _'(

Subject to

$'(

N $ $ = 0
$'(

0 $
Can we perform coordinate ascent here?
Optimize wrt to $ by fixing the remaining?
Kernel Machines

CSL465/603 - Machine Learning

61

Optimizing the SVM Objective


Function SMO (3)
Sequential Minimal Optimization
Repeat till convergence
Select a pair of $ and _ to optimize.
Re-optimize ^ with respect to $ and _ , while holding all
the other ( , ) fixed

Convergence is determined by checking if the KKT


conditions are satisfied.
Suppose we fix b , , ) , then
)

( ( + C C = N $ $
$'b
Kernel Machines

CSL465/603 - Machine Learning

62

algorithm can be expressed in a short amount of C code, rather than invoking an entire QP library
routine. Even though more optimization sub-problems are solved in the course of the algorithm,
each sub-problem is so fast that the overall QP problem is solved quickly.

Optimizing the SVM Objective


Platt, MSR-TR-98-14
Function SMO (4)
In addition, SMO requires no extra matrix storage at all. Thus, very large SVM training problems
can fit inside of the memory of an ordinary personal computer or workstation. Because no matrix
algorithms are used in SMO, it is less susceptible to numerical precision problems.
There are two components to SMO: an analytic method for solving for the two Lagrange
multipliers, and a heuristic for choosing which multipliers to optimize.

( ( + C C =

!2 ! C

!2 ! C

!1 ! 0

!1 ! C

!2 ! 0

( C ( C =
y1 " y2 # ! 1 $ ! 2 ! k

!1 ! 0

!1 ! C

!2 ! 0

( = C ( + C =
y1 ! y2 # ! 1 % ! 2 ! k

Figure 1. The two Lagrange multipliers must fulfill all of the constraints of the full problem.
The inequality constraints cause the Lagrange multipliers to lie in the box. The linear equality
constraint causes them to lie on a diagonal line. Therefore, one step of SMO must find an
optimum of the objective function on a diagonal line segment.

Kernel Machines

CSL465/603 - Machine Learning

63

algorithm can be expressed in a short amount of C code, rather than invoking an entire QP library
routine. Even though more optimization sub-problems are solved in the course of the algorithm,
each sub-problem is so fast that the overall QP problem is solved quickly.

Optimizing the SVM Objective


Platt, MSR-TR-98-14
Function SMO (5)
In addition, SMO requires no extra matrix storage at all. Thus, very large SVM training problems
can fit inside of the memory of an ordinary personal computer or workstation. Because no matrix
algorithms are used in SMO, it is less susceptible to numerical precision problems.
There are two components to SMO: an analytic method for solving for the two Lagrange
multipliers, and a heuristic for choosing which multipliers to optimize.

( ( + C C =

!2 ! C

!2 ! C

!1 ! 0

!1 ! C

!2 ! 0

( C ( C =
y1 " y2 # ! 1 $ ! 2 ! k

!1 ! 0

!1 ! C

!2 ! 0

( = C ( + C =
y1 ! y2 # ! 1 % ! 2 ! k

Without loss of generality, let us first compute the


1. The two Lagrange multipliers must fulfill all of the constraints of the full problem.
secondFigure
multiplier
toClie in the box. The linear equality
TheLagrange
inequality constraints cause
the Lagrange multipliers
constraint causes them to lie on a diagonal line. Therefore, one step of SMO must find an
optimum of the objective function on a diagonal line segment.

Compute the ends of the diagonal line segment in terms


of C
Kernel Machines

CSL465/603 - Machine Learning

64

can fit inside of the memory of an ordinary personal computer or workstation. Because no matr
algorithms are used in SMO, it is less susceptible to numerical precision problems.

Optimizing the SVM Objective


Platt, MSR-TR-98-14
Function SMO (6)
There are two components to SMO: an analytic method for solving for the two Lagrange
multipliers, and a heuristic for choosing which multipliers to optimize.

!2 ! C

( ( + C C =
!1 ! 0

!2 ! C

!1 ! C

!1 ! 0

!2 ! 0

!1 ! C

!2 ! 0

( = C ( + C =

( C ( C =
y1 " y2 # ! 1 $ ! 2 ! k

y1 ! y2 # ! 1 % ! 2 ! k

Without loss of generality, let us first compute the


second LagrangeFigure
multiplier
1. The two Lagrange
multipliers
C must fulfill all of the constraints of the full problem.

The inequality constraints cause the Lagrange multipliers to lie in the box. The linear equality

constraint
to lie on
a diagonal line. Therefore,
one step ofof
SMO
must
Compute the ends
ofcauses
thethemline
segment
in terms
C find an
optimum of the objective function on a diagonal line segment.
If ( C, = max 0, C ( , = min , + C (
If ( = C, = max 0, C + ( , = min , C + (

Kernel Machines

CSL465/603 - Machine Learning

65

Optimizing the SVM Objective


Function SMO (7)
Rewrite ^ as a function of C

Quadratic function of the form CC + C +

Find the optimal value by differentiating with respect to


C and equating it to 0
C ( C
W
^
C = C

Where
1
$ = ( 1 $ + 0 $ ) = )

_
_
_'(
_ $ + 0
= ( C 1 ( C

Finally clip the value of CW


,
if CW >
if CW
CW = CW ,
,
if CW <
Kernel Machines

CSL465/603 - Machine Learning

66

Optimizing the SVM Objective


Function SMO (8)
Now compute the new value for (
(W = (^ + ( C C^ CW
The intercept term 0 is updated to ensure that KKT
conditions are satisfied for the 1st and 2nd examples
Refer the reading material for the exact equation

Kernel Machines

CSL465/603 - Machine Learning

67

Optimizing the SVM Objective


Function SMO (9)
Comparison with the chunking approach
fixed number of examples are added at every step, while
Thediscarding
timing performance
of the SMO algorithm
chunking algorithm
for the linear SVM
examples
with versus
zerotheLagrange
multipliers
on the adult data set is shown in the table below:
Training Set Size

SMO time

Chunking time

1605
2265
3185
4781
6414
11221
16101
22697
32562

0.4
0.9
1.8
3.6
5.5
17.0
35.3
85.7
163.6

37.1
228.3
596.2
1954.2
3684.6
20711.3
N/A
N/A
N/A

Number of Non-Bound
Support Vectors
42
47
57
63
61
79
67
88
149

Number of Bound
Support Vectors
633
930
1210
1791
2370
4079
5854
8209
11558

The training set size was varied by taking random subsets of the full training set. These subsets
are nested. The "N/A" entries in the chunking time column had matrices that were too large to fit
Platt, MSR-TR-98-14
into 128 Megabytes, hence could not be timed due to memory thrashing. The number of nonbound and the number of bound support vectors were determined from SMO: the chunking
results vary by a small amount, due to the tolerance of inaccuracies around the KKT conditions.
Kernel Machines

- Machine Learning
By fitting a line to the log-log plot of CSL465/603
training time
versus training set size, an empirical scaling

68

Support Vector Regression (1)


Normally, we would use squared error
C
, x = x , x = 1 x + 0
For support vector regression, we use e -sensitive
loss
0,
if <
, x =
,
otherwise
Tolerate errors up to
Errors beyond have only linear effect

Kernel Machines

CSL465/603 - Machine Learning

69

Support Vector Regression (2)


Use slack variables to account for deviations
beyond
For positive deviations
For negative deviations

Thus the SVR formulation is


)
1
C + N $ + $
2
$'(

Subject to

Kernel Machines

$ 1 $ + 0 + $
1 $ + 0 $ + $
$ , $ 0
CSL465/603 - Machine Learning

70

Support Vector Regression (3)


Introduce Lagrange multipliers $ and $ and
formulate the dual
) )
1
^ = N N $ $ _ _ x21 x`
2
)

$'( _'(

N $ + $ N $ $ $
$'(

Subject to

$'(
)

0 $ , 0 $ , N $ $ = 0
$'(
Kernel Machines

CSL465/603 - Machine Learning

71

Support Vector Regression (4)


Non-support vectors will lie inside the margin $ = $ = 0

Support Vectors

On the margin - 0 < $ < or 0 < $ <


Outside the margin (outliers) - $ = or $ =

Kernel Machines

CSL465/603 - Machine Learning

72

Support Vector Regression (5)


Final fitted line weighted sum of support vectors
)

x = w 1 x + 0 = N $ $ x$1 x + 0
$'(

Average 0 over:
$ = w 1 x$ + 0 + , if 0 < $ <
$ = w 1 x$ + 0 , if 0 < $ <
Similar to classification this can be extended to use
the Kernel function.

Kernel Machines

CSL465/603 - Machine Learning

73

Support Vector Regression (6)


Polynomial kernel

Kernel Machines

Gaussian Kernel

CSL465/603 - Machine Learning

74

Support Vector Machines Implementations


Weka
Classification SMO
Regression SMOreg

LibSVM tool box most popular SVM toolbox (C++)


Matlab/python interfaces

SVMLight

Kernel Machines

CSL465/603 - Machine Learning

75

Summary
Optimal Separating Hyperplane
Support Vector Machine
Linear Separable Case
Soft Margin

Kernel Functions
Loss Function
SMO algorithm for optimization
Support Vector Regression
One class Kernel Machines (refer 13.11)
Kernel Machines

CSL465/603 - Machine Learning

76

You might also like