Professional Documents
Culture Documents
Vector Machines
Rahul B. Warrier
Abstract This article gives a brief introduction to Statistical Learning Theory (SLT) and the relatively new class of
supervised learning algorithms called Support Vector Machines
(SVM). The primary objective of this article is to understand
the chosen topic using the concepts of linear systems theory
discussed in class. To that effect, a thorough mathematical analysis of the topic is done along with some examples that vividly
illustrate the concepts being discussed. Finally, application of
SVM to a simple binary classification problem is discussed.
I. INTRODUCTION
Statistical Learning Theory (SLT) forms the basic framework for machine learning. Of the several categories of
learning, SLT deals with supervised learning, which involves
learning from a training set of data (examples). The training
data consists of an input-output pair, where the input maps
to an output. The learning problem then, consists of finding
the function that maps input to output in a predictive fashion, such that the learned function can be used to predict
output from a new set of input data with minimal mistakes
(generalization) [6].
Support Vector Machine (SVM) is a class of supervised
learning machines for two-group classification/ regression
analysis problems. Conceptually, SVM implements the following: input vectors are non-linearly mapped to a highdimensional feature space wherein a linear decision surface
is constructed [4].
The goal of this article is to briefly describe the mathematical formulation of SLT and more specifically the SVM
(applied to the binary classification problem) through the
concepts of systems theory. The major concepts that will
be highlighted during the course of this discussion are:
inner product spaces and normed spaces, Gram matrices,
linear/nonlinear mappings, convex optimization, projection
theorem on Hilbert spaces and the minimax problem.
This subject has gained immense popularity in the last
few years and ample amount of literature is available on
various aspects of this topic. This article mainly follows the
seminal work of Vladimir N. Vapnik [1] and his colleagues
at AT&T Bell Laboratories [2],[3]. STL concepts have also
been discussed succinctly in [6] and a detailed mathematical
formulation of STL and SVM is available in the dissertation
by Scholkopf, [4]. Geometric interpretation of the SVM
algorithm is presented in [8],[9].
This article is organized as follows: Section 2 deals with
the formulation of the general learning problem. The Empirical Risk Minimization Induction (ERM) principle is disRahul B. Warrier is a Graduate Student in Mechanical Engineering,
University of Washington, Seattle, WA-98105 warrirerr@uw.edu
(1)
R() =
(2)
principle1 .
The expected risk functional R() given by (2) is replaced
by the empirical risk functional [1],
Remp () =
1 n
L(y, f (x, ))
n i=1
(3)
,
=
.
(5)
n
n
n
Fig. 1.
D. Summary of STL
The problem of learning from examples is solved in three
steps:
1) Define a loss function L(y, f (x, )) that measures the
error of prediction of output f (x, ) of input x when
actual output is y;
2) Define a nested sequence of hypothesis spaces Hn , n =
1, . . . , N whose capacity hn increases with n;
3) Minimize the empirical risk Remp () in each Hn and
choose among the solutions the one that minimizes the
right hand side of the inequality (4).
Fig. 2.
max
y{1,+1}
P(y|xt )
(10)
(11)
(6)
R() =
Remp () =
1 n 1
2 |yi f (xi , )| dP(x, y),
n i=1
xi X
(12)
if yi = +1
(7a)
< w, xi > +b 1 if yi = 1,
(7b)
or equivalently,
yi (< w, xi > +b) 1
(8)
(9)
(a)
(b)
(c)
Fig. 5. (a) 2D separable training space with possible decision boundaries. (b) Non-separable training space. (c) 2D feature space with non-linear
transformation (x) = (x, x2 ), with possible decision boundaries.
(13)
2
kwk .
Thus,
2
kwk
(14)
(19)
i 0
(15)
f (x, w ) = sign(hw , xi + b )
B. Dual Form
=
w
(20a)
i yi xi = 0
w
w=w
i=1
n
L (w, b, )
= i yi = 0
(20b)
b
b=b
i=1
For non-separable data, if a map is known that transforms the non-separable training space to a separable feature
space, then the QP problem (15) becomes,
1
min kwk2
w,b 2
w = i yi xi
(17)
(18)
(21)
i=1
max i
i=1
1 n n
i j yi y j xi , x j
2 i=1 j=1
(22)
subject to
i yi = 0,
i 0
i=1
1 n
i yi [< xi , xA > + < xi , xB >]
2 i=1
f (x, w) = sign
(25)
i yi K(xi , x) + b
i=1
(24)
i yi K(xi , x) + b
(30)
xi SV
where
b =
1 n
i yi [K(xi , xA ) + K(xi , xB )]
2 i=1
K(xi , x j ) = (xi ), (x j ) H
(26)
i yi = 0,
subject to
i 0
i=1
(28)
subject to T y = 0, i 0
where is the Gram matrix defined as:
[]i j = K(xi , x j ) = (xi ), (x j ) H
(29)
and, = [1 y1 , . . . , n yn ].
By Mercers Theorem5 , we get the necessary and sufficient
conditions to represent the kernel K : X X R as an
inner product in some feature space H with a map as in
(26).
5 Mercers Theorem says that a symmetric function K(x , x ) can be
1 2
expressed as an inner product, K(xi , x j ) = (xi ), (x j ) for some if
and only if K(x1 , x2 ) is positive semi-definite [2].
Fig. 6.
R EFERENCES
[1] Vapnik, Vladimir N. An overview of statistical learning theory.
Neural Networks, IEEE Transactions on 10.5 (1999): 988-999.
[2] Boser, Bernhard E., Isabelle M. Guyon, and Vladimir N. Vapnik. A
training algorithm for optimal margin classifiers. Proceedings of the
fifth annual workshop on Computational learning theory. ACM, 1992.
[3] Cortes, Corinna, and Vladimir Vapnik. Support-vector networks.
Machine learning 20.3 (1995): 273-297.
[4] Schlkopf, Bernhard. Support vector learning. (1997).
[5] Pedregosa, Fabian, et al. Scikit-learn: Machine learning in Python.
The Journal of Machine Learning Research 12 (2011): 2825-2830.
[6] Evgeniou, Theodoros, Massimiliano Pontil, and Tomaso Poggio. Statistical learning theory: A primer. International Journal of Computer
Vision 38.1 (2000): 9-13.
[7] Schlkopf, Bernhard. Statistical learning and kernel methods. (2000).
[8] Bennett, Kristin P., and Erin J. Bredensteiner. Duality and geometry
in SVM classifiers. ICML. 2000.
[9] Zhou, Dengyong, et al. Global geometry of SVM classifiers. Institute
of Automation, Chinese Academy of Sciences, Tech. Rep. AI Lab
(2002).
[10] Tong, Simon, and Daphne Koller. Support vector machine active
learning with applications to text classification. The Journal of
Machine Learning Research 2 (2002): 45-66.
APPENDIX - A
REF. [1]
988
OF THE
LEARNING PROBLEM
Re
fe
re
nc
I. SETTING
es
(1)
if
if
For this loss function, the functional (2) provides the probability of classification error (i.e., when the answers given
by supervisor and the answers given by indicator function
differ). The problem, therefore, is to find the function
which minimizes the probability of classification errors when
is unknown, but the data (1) are
probability measure
given.
The Problem of Regression Estimation: Let the supervibe a
sors answer be a real value, and let
set of real functions which contains the regression function
It is known that if
then the regression function
is the one which minimizes the functional (2) with the the
following loss-function:
(3)
(4)
Thus the problem of regression estimation is the problem
of minimizing the risk functional (2) with the loss function
989
are given.
The General Setting of the Learning Problem: The general
setting of the learning problem can be described as follows.
be defined on the space
Let the probability measure
Consider the set of functions
The goal is: to
minimize the risk functional
(6)
is unknown but an i.i.d. sample
which one needs to minimize in order to find the approximation to the density.
Since the ERM principle is a general formulation of these
classical estimation problems, any theory concerning the ERM
principle applies to the classical methods as well.
re
if probability measure
es
(5)
nc
fe
(7)
Re
is given.
The learning problems considered above are particular cases
of this general problem of minimizing the risk functional (6) on
the basis of empirical data (7), where describes a pair
and
is the specific loss function [for example, one
of (3), (4), or (5)]. Below we will describe results obtained
for the general statement of the problem. To apply it for
specific problems one has to substitute the corresponding lossfunctions in the formulas obtained.
b)
The
values
of
(8)
(10)
2 Convergence in probability of values R( ) means that for any " > 0
`
and for any > 0 there exists a number `0 = `0 ("; ) such, that for any
` > `0 with probability at least 1 the inequality
0
R ( ` ) 0 R ( 0 ) < "
holds true.
Re
fe
re
The theory of consistency is an asymptotic theory. It describes the necessary and sufficient conditions for convergence
of the solutions obtained using the proposed method to the
best possible as the number of observations is increased. The
question arises:
Why do we need a theory of consistency if our goal is to
construct algorithms for a small (finite) sample size?
The answer is:
We need a theory of consistency because it provides not
only sufficient but necessary conditions for convergence of
the empirical risk minimization inductive principle. Therefore
any theory of the empirical risk minimization principle must
satisfy the necessary and sufficient conditions.
In this section, we introduce the main capacity concept (the
so-called VapnikCervonenkis (VC) entropy which defines
the generalization ability of the ERM principle. In the next
sections we show that the nonasymptotic theory of learning is
based on different types of bounds that evaluate this concept
for a fixed amount of observations.
es
nc
990
(12)
Let us characterize the diversity of this set of functions
on the given sample by a quantity
that represents the number of different
separations of this sample that can be obtained using functions
from the given set of indicator functions.
Let us write this in another form. Consider the set of
-dimensional binary vectors
the random entropy. The random entropy describes the diversity of the set of functions on the given data.
is a random variable since it was constructed using random
i.i.d. data. Now we consider the expectation of the random
entropy over the joint distribution function
991
(13)
re
hold.
Slightly modifying the condition (14) one can obtain the
necessary and sufficient condition for one-sided uniform convergence (11).
Entropy of the Set of Real Functions: Now we generalize
the concept of entropy for sets of real-valued functions. Let
be a set of bounded loss functions.
Using this set of functions and the training set (12) one
can construct the following set of -dimensional real-valued
vectors
es
(14)
be valid.
Slightly modifying the condition (17) one can obtain the
necessary and sufficient condition for one-sided uniform convergence (11).
According to the key assertion this implies the necessary and
sufficient conditions for consistency of the ERM principle.
nc
(17)
(15)
Re
fe
111
111
111
5 Note
N 3 (") is
Q(z; ); 2 3:
where
describing the necessary and sufficient condition for consistency of the ERM principle. This equation is the first milestone
in learning theory: any machine minimizing empirical risk
should satisfy it.
However, this equation says nothing about the rate of
to the minimal one
convergence of obtained risks
It is possible that the ERM principle is consistent but
has arbitrary slow asymptotic rate of convergence.
992
is some constant.
where
es
Re
fe
re
nc
is some constant,
if
if
The VC dimension of the set of real valued functions
is defined to be the VC-dimension of the
set of indicator functions (18).
7 Any indicator function separates a set of vectors into two subsets: the
subset of vectors for which this function takes value zero and the subset of
vectors for which it takes value one.
993
(20)
holds true simultaneously for all functions of the set (19),
where
(21)
fe
nc
es
re
Re
(22)
(23)
holds true simultaneously for all functions of the set, where
is determined by (22),
The theorem bounds the risks for all functions of the set
(including the function
8 This inequality describes some general properties of distribution functions
of the random variables Q(z; ), generated by the P (z ): It describes the
tails of distributions (the probability of big values for the random variables
): If the inequality (22) with p > 2 holds, then the distributions have socalled light tails (large values do not occurs very often). In this case rapid
convergence is possible. If, however, (22) holds only for p < 2 (large values
of the random variables occur rather often) then the rate of convergence
will be small (it will be arbitrarily small if p is sufficiently close to one).
994
fe
re
nc
es
Re
holds true.
From this bound it follows that for sufficiently large with
simultaneously for all
(including the
probability
one that minimizes the empirical risk) the following inequality
is valid:
(24)
and
An admissible structure is one satisfying the following three
properties.
is everywhere dense in
1) The set
2) The VC-dimension
of each set
of functions is
finite.
of the structure contains totally bounded
3) Any element
functions
The SRM principle suggests that for a given set of obserchoose the element of structure
, where
vations
and choose the particular function from
for which
the guaranteed risk (20) is minimal.
The SRM principle actually suggests a tradeoff between
the quality of the approximation and the complexity of the
approximating function. (As increases, the minima of empirical risk are decreased; however, the term responsible for
the confidence interval [summand in (20)] is increased. The
SRM principle takes both factors into account.)
The main results of the theory of SRM are the following
[9], [22].
Theorem: For any distribution function the SRM method
provides convergence to the best possible solution with probability one.
In other words SRM method is universally strongly consistent.
9 The
sample size ` is considered to be small if `=h is small, say `=h < 20:
995
is such that
(26)
and
is
can find the exact solution while when the minimum of this
functional is nonzero one can find an approximate solution.
Therefore by constructing a separating hyperplane one can
control the value of empirical risk.
Unfortunately the set of separating hyperplanes is not flexible enough to provide low empirical risk for many real-life
problems [13].
Two opportunities were considered to increase the flexibility
of the sets of functions:
1) to use a richer set of indicator functions which are
superpositions of linear indicator functions;
2) to map the input vectors in high dimensional feature
space and construct in this space a -margin separating
hyperplane (see Example 2 in Section III-C)
The first idea corresponds to the neural network. The second
idea leads to support vector machines.
nc
es
Re
fe
re
where
(29)
is a smooth monotonic function such that
For example, the functions
Let
is a vector,
To minimize the empirical risk one has to find the pa(weights) which minimize the
rameters
empirical risk functional
(28)
There are several methods for minimizing this functional. In
the case when the minimum of the empirical risk is zero one
=
j 0 j 0!
es
Re
fe
re
nc
996
(35)
)
2) The parameters of the optimal hyperplane (vector
are linear combination of the vectors of the training set.
(36)
3) The solution must satisfy the following KuhnTucker
conditions:
(37)
From these conditions it follows that only some training
vectors in expansion (36), the support vectors, can have
in the expansion of
The
nonzero coefficients
support vectors are the vectors for which, in (36), the
equality is achieved. Therefore we obtain
(38)
if
if
997
(43)
re
(42)
es
(40)
nc
fe
Re
coordinates
The separating hyperplane conwhere
structed in this space is a separating second-degree polynomial
in the input space.
To construct a polynomial of degree in an -dimensional
-dimensional feature
input space one has to construct
space, where one then constructs the optimal hyperplane.
(44)
In this case the SVM machine will find both the centers
and the corresponding weights
The SVM possesses some useful properties.
The optimization problem for constructing an SVM has
a unique solution.
The learning process for constructing an SVM is rather
fast.
Simultaneously with constructing the decision rule, one
obtains the set of support vectors.
Implementation of a new set of decision functions can be
done by changing only one function (kernel
which defines the dot product in -space.
VI. CONCLUSION
This article presents a very general overview of statistical
learning theory. It demonstrates how an abstract analysis
allows us to discover a general model of generalization.
According to this model, the generalization ability of learning machines depends on capacity concepts which are more
sophisticated than merely the dimensionality of the space or
the number of free parameters of the loss function (these concepts are the basis for the classical paradigm of generalization).
The new understanding of the mechanisms behind generalization not only changes the theoretical foundation of
generalization (for example from the new point of view the
Occam razor principle is not always correct), but also changes
the algorithmic approaches to function estimation problems.
The approach described is rather general. It can be applied
for various function estimation problems including regression,
density estimation, solving inverse equations and so on.
Statistical learning theory started more than 30 years ago.
The development of this theory did not involve many researchers. After the success of the SVM in solving real-life
problems, the interest in statistical learning theory significantly
increased. For the first time, abstract mathematical results in
statistical learning theory have a direct impact on algorithmic
tools of data analysis. In the last three years a lot of articles
have appeared that analyze the theory of inference and the
SVM method from different perspectives. These include:
1) obtaining better constructive bounds than the classical
one described in this article (which are closer in spirit to
the nonconstructive bound based on the growth function
than on bounds based on the VC dimension concept).
Success in this direction could lead, in particular, to
creating machines that generalize better than the SVM
based on the concept of optimal hyperplane;
2) extending the SVM ideology to many different problems
of function and data-analysis;
3) developing a theory that allows us to create kernels
that possess desirable properties (for example that can
enforce desirable invariants);
re
es
nc
998
Re
fe
999
[16]
[17]
[18]
[19]
ACKNOWLEDGMENT
[20]
[21]
REFERENCES
[22]
[23]
[24]
[25]
[26]
nc
[27]
[28]
[29]
Re
fe
re
es
[30]
Vladimir N. Vapnik was born in Russia and received the Ph.D. degree in statistics from the Institute of Control Sciences, Academy of Science of the
USSR, Moscow, Russia, in 1964.
Since 1991, he has been working for AT&T Bell
Laboratories (since 1996, AT&T Labs Research),
Red Bank, NJ. His research interests include statistical learning theory, theoretical and applied statistics,
theory and methods for solving stochastic ill-posed
problems, and methods of multidimensional function approximation. His main results in the last three
years are related to the development of the support vector method. He is author
of many publications, including seven monographs on various problems of
statistical learning theory.
REF [2]
A
Training
Algorithm
Optimal
Margin
Classifiers
Isabelle M. Guyon
AT&T Bell Laboratories
50 Fremont Street, 6th Floor
San Francisco, CA 94105
isabelle@neural .att .com
Bernhard
E. Boser
EECS Department
University of California
Berkeley, CA 94720
boser@eecs.berkeley. edu
fe
re
nc
es
In this paper we describe a training algorithm that automatically tunes the capacity of the classification function by maximizing the margin between training examples and class boundary [KM87], optionally after removing some atypical or meaningless examples from the
training data. The resulting classification function depends only on so-called supporting patterns [Vap82].
These are those training examples that are closest to
the decision boundary and are usually a small subset of
the training data.
Re
INTRODUCTION
to
copy
granted provided
direct
commercial
title
of tha
that
copying
To copy
ACM
all or
part
of this
the ACM
appear,
and
of the Association
otherwise,
material is
or distributed
copyright
notice
notice
for
and the
ia given
for Computing
or to republish,
requires
a fee
permission.
COLT92-71921PA,
@ 1992
advantage,
is by permission
specific
fee
the copies
publication
Machinery.
and/or
without
that
USA
0-89791
-498 -8/92
/0007
/0144
Vladimir
N. Vapnik
AT&T Bell Laboratories
Crawford Corner Road
Holmdel, NJ 07733
vlad@neural. att .com
Abstract
A training algorithm that maximizes the margin between the training patterns and the decision boundary is presented. The technique
is applicable to a wide variety of classifiaction functions, including Perceptions, polynomials, and Radial Basis Functions.
The effective number of parameters is adjusted automatically to match the complexity of the problem. The solution is expressed as a linear combination of supporting patterns. These are the
subset of training patterns that are closest to
the decision boundary. Bounds on the generalization performance based on the leave-one-out
method and the VC-dimension are given. Experimental results on optical character recognition problems demonstrate the good generalization obtained when compared with other
learning algorithms.
for
. ..$1 .50
144
MARGIN
pi(x)
(5)
$Di(x).
In particular, the kernel K(x, x) = (x . x + 1)9 corresponds to a polynomial expansion p(x) of order q
[Pog75].
Provided that the expansion stated in equation 5 exists,
equations 3 and 4 are dual representations of the same
decision function and
es
MAXIMUM
= ~
i
x)
The proposed training algorithm is based on the generalized portrait method described in [Vap82] that constructs separating hyperplanes with maximum margin.
Here this algorithm is extended to train classifiers linear in their parameters, First, the margin between the
class boundary and the training patterns is formulated
in the direct space. This problem description is then
transformed into the dual space by means of the Lagrangian. The resulting problem is that of maximizing
a quadratic form with constraints and is amenable to
efficient numeric optimization algorithms [Lue84].
TRAINING
(Xl) vi),
fe
re
The maximum margin training algorithm finds a decision function for pattern vectors x of dimension n belonging to either of two classes A and B. The input to
the training algorithm is a set of p examples xi with
labels gi:
nc
ALGORITHM
. ~, (Xp, yp)
(1)
if xk E class A
yk=l
~k
=
1
ifxk
E ClaSSB.
{
From these training examples the algorithm finds the
parameters of the decision function D(x) during a learning phase. After training, the classification of unknown
patterns is predicted according to the following rule:
2.1
Re
where
MAXIMIZING
THE
DIRECT
SPACE
XEA
if D(x)>O
x c B otherwise.
MARGIN
(2)
= W ~ $O(X)
IN THE
is
(7)
b,
The decision functions must be linear in their parameters but are not restricted to linear dependence of x.
These functions can be expressed either in direct, or in
dual space. The direct space notation is identical to the
Perception decision function [Ros62]:
N
D(X)
W/f~(X)
b.
(3)
(8)
i=l
In this equation the pi are predefine functions of x, and
the Wi and b are the adjustable parameters of the decision function. Polynomial classifiers are a special case of
Perceptions for which pi(x) are products of components
of x.
In the dual space, the decision functions
is to find the
(9)
max
M
W,llwl[=l
subject to
.yk~(x~) ~ M,
k=l,2,
. . ..p.
is
k=l
145
= M*.
(lo)
\\
o \\
x,
\\
2.2
patterns of the
re
(11)
IN
THE
= ;IIW112
~ak
[~k~(Xk)
1](14)
k=l
k=l,2,...,
p.
1) = O,
k=l,2,...,
p.
or Kiihn(15)
The factor one half has been included for cosmetic reasons; it does not change the solution.
The optimization problem 13 is equivalent to searching
a saddle point of the function L(w, b, a). This saddle
point is a the minimum of L(w, 6, a) with respect to w,
and a maximum with respect to a (cr~ > O). At the
solution, the following necessary conditionis met:
Thus, maximizing the margin A4 is equivalent to minimizing the norm IIwII.1 Then the problem of finding a
maximum margin separating hyperplane w* stated in 9
reduces to solving the following quadratic problem:
(13)
under conditions
The maximum
b,a)
~k (y~~(X~)
(12)
m~llw112
MARGIN
Re
= 1.
L(w,
subject to ~k ~ O,
THE
fe
max
miny~ll(x~).
W,llwll=l
k
MAXIMIZING
DUAL
SPACE
nc
es
Figure 1: Maximum margin linear decision function D(x) = w . x + b (V = x). The gray levels encode the absolute
value of the decision function (solid black corresponds to D(x) = O). The numbers indicate the supporting patterns.
yk ~(xk ) ~ 1,
k=l,2,
. . ..p.
margin is M = l/l\w*ll.
hence
In principle the problem stated in 13 can be solved directly with numerical techniques.
However, this approach is impractical when the dimensionality
of the
~-space is large or infinite. Moreover, no information is
gained about the supporting patterns.
w* =
~;yk~~.
(16)
k=l
The patterns which satisfy y~D(x~) = 1 are the supporting patterns. According to equation 16, the vector
w* that specifies the hyperplane with maximum margin
is a linear combination of only the supporting patterns,
which are those patterns for which a~ # 0. Usually the
number of supporting patterns is much smaller than the
number p of patterns in the training set.
1If the tr~~ing &ta is not linearly separable the maXimum margin may be negative. In this case, Jfllwll = 1
is imposed. Maximizing the margin is then equivalent to
maximizing IIw II.
146
exceedingly large dimensionality, the training data is divided into chunks that are processed iteratively [Vap82].
The maximum margin hypersurface is constructed for
the first chunk and a new training set is formed consisting of the supporting patterns from the solution and
those patterns xk in the second chunk of the training
set for which yk D(xk ) < 1 c. A new classifier is
trained and used to construct a training set consisting
of supporting patterns and examples from the first three
chunks which satisfy yk D(xk) < 1 c, This process is
repeated until the entire training set is separated.
b)=
jcdl.cx,
(17)
kml
subject to ~k ~ O,
Here H is a square matrix
k=l,2,...,
p.
~kl = yky~~f(xk,x~).
In order for a unique solution to exist, II must be positive definite. For fixed bias b, the solution a is obtained
by maximizing J(cx, b) under the conditions O!k ~ O.
Based on equations 7 and 16, the resulting decision function is of the form
=
W*
. V(X)+
a; > 0,
~~~~~~(x~, x) + b,
patterns
3.1
fe
Re
~~yk~~[~~(xA,xk)+
(w . ~(xA)
+ W* ~$O(XB))
SOLUTION
J(~*) =
;Ilw[y
2 (M*)2
= ;i~~
(20)
k=l
the
imthe
pa-
OF THE
In both cases the solution is found with standard nonlinear optimization algorithms for quadratic forms with
linear constraints [Lue84, Lo072]. The second approach
gives the largest possible margin. There is no guarantee, however, that this solution exhibits also the best
generalization performance.
b*
PROPERTIES
Since maximizing
the margin between the decision
boundary and the training patterns is equivalent to
maximizing a quadratic form in the positive quadrant,
there are no local minima and the solution is always
unique if H has full rank. At the optimum
re
THE
(18)
OF
ALGORITHM
nc
D(X)
PROPERTIES
es
J(a,
(19)
Figure 2 compares the decision boundary for a maximum margin and mean squared error (MSE) cost functions. Unlike the MSE based decision function which
simply ignores the outlier, optimal margin classifiers are
very sensitive to atypical patterns that are close to the
~~(x~,xk)]
k=l
The dimension of problem 17 equals the size of the training set, p. To avoid the need to solve a dual problem of
147
Figure 2: Linear decision boundary for MSE (left) and maximum margin cost functions (middle, right) in the presence
of an outlier. In the rightmost picture the outlier has been removed. The numbers reflect the ranking of supporting
patterns according to the magnitude of their Lagrange coefficient czk for each class individually.
nc
es
decision boundary.
These examples are readily identified as those with the largest O!k and can be eliminated either automatically or with supervision. Hence,
optimal margin classifiers give complete control over
the handling of outliers, as opposed to quietly ignoring
them.
The optimum margin algorithm performs automatic capacity tuning of the decision function to achieve good
generalization. An estimate for an upper bound of the
generalization error is obtained with the leave-one-out
method: A pattern Xk is removed from the training set.
A classifier is then trained on the remaining patterns
and tested on xk. This process is repeated for all p
training patterns. The generalization error is estimated
by the ratio of misclassified patterns over p. For a maximum margin classifier, two cases arise: ~f xk is not a
supporting pattern, the decision boundary is unchanged
and xk will be classified correctly. If xk is a supporting
pattern, two cases are possible:
re
3.2
CONSIDERATIONS
Re
fe
Speed and convergence are important practical considerations of classification algorithms. The benefit of the
dual space representation to reduce the number of computations required for example for polynomial classifiers
has been pointed out already. In the dual space, each
evaluation of the decision function D(x) requires m evaluations of the kernel function l((x~, x) and forming the
weighted sum of the results. This number can be further reduced through the use of appropriate search techniques which omit evaluations of Ii that yield negligible
contributions to D(x) [Omo9 1].
COMPUTATIONAL
independent from the other supportIn this case the outcome is uncertain.
case m linearly independent supportare misclassified when they are omittraining data.
148
1.37
alpha=O.541
Figure 3: Supporting
cr~.
mm
mn
1.05
alpha=O.747
alpha=O.641
alpha=O.651
alpha=O.556
alpha=O.549
alpha=O.544
alpha= 0.54
alpha=O.495
alpha=O.454
alpha=O.445
alpha=O.444
alpha=O.429
alpha=
alpha=O.512
EXPERIMENTAL
RESULTS
accordingto
re
mum
es
alpha=
nc
fe
1 linear
2
3
4
5
Re
In all experiments, the margin is maximized with respect to w and b. Ten hypersurfaces, one per class, are
used to separate the digits. Regardless of the difficulty
of the problemmeasured for example by the number of
supporting patterns found by the algorithmthe
same
similarity function K(x, x) and preprocessing is used
for all hypersurfaces of one experiment. The results obtained with different choices of K corresponding to linear hyperplanes, polynomial classifiers, and basis functions are summarized below. The effect of smoothing is
investigated as a simple form of preprocessing.
DB1
error
<m>
DB2
error
<m>
N
256
3 ~ 104
8.107
4.109
1.1012
In the above experiment, the performance changes drastically between first and second order polynomials. This
may be a consequence of the fact that maximum VCdimension of an q-th order polynomial classifier is equal
to the dimension n of the patterns to the q-th power
and thus much larger than n. A more gradual change
of the VC-dimension is possible when the function K is
chosen to be a power series, for example
(21)
Better performance
has been achieved with both
databases using multilayer neural networks or other
149
Figure 4: Decision boundaries for maximum margin classifiers with second order polynomial decision rule If(x, x) =
(x. x+ 1)2 (left) and an exponential RBF I{(x, x) = exp(llx x11/2) (middle). The rightmost picture shows the
decision boundary of a two layer neural network with two hidden units trained with backpropagation.
nc
es
DB1
2.3 0
2.2%
1.3%
1.5%
Better performance might be achieved with other similarity functions I<(x, x). Figure 4 shows the decision
boundary obtained with a second order polynomial and
a radial basis function (RBF) maximum margin classifier with 1{(x, x) = exp (11x x11/2). The decision
boundary of the polynomial classifier is much closer to
one of the two classes. This is a consequence oft he nonlinear transform from q-space to x-space of polynomials
which realizes a position dependent scaling of distance.
Radial Basis Functions do not exhibit this problem. The
decision boundary of a two layer neural network trained
with backpropagat ion is shown for comparison.
re
0:5
0.50
0.75
1.00
Re
fe
The importance of a suitable preprocessing to incorporate knowledge about the task at hand has been pointed
out by many researchers. In optical character recognition, preprocessing that introduce some invariance to
scaling, rotation, and other distortions are particularly
important [SLD92]. As in [C,VB+ 92], smoothing is used
to achieve insensitivity to small distortions. The table
below lists the error on the test set for different amounts
of smoothing. A second order polynomial classifier was
used for database DB1, and a forth order polynomial for
DB2. The smoothing kernel is Gaussian with standard
deviation u.
DB1
error
<m>
DB2
error
<m>
no smoothing
0.5
0.8
1.0
1.2
E
1.5%0
1.3%
0.8 %
0.3%
0.8%
4.970
4.6 %
5.0%
6.0%
44
41
36
31
31
CONCLUSIONS
Maximizing
the margin between the class boundary
and training patterns is an alternative to other training methods optimizing cost functions such as the mean
squared error. This principle is equivalent to minimizing the maximum loss and has a number of important
features. These include automatic capacity tuning of
the classification function, extraction of a small number of supporting patterns from the training data that
are relevant for the classification, and uniqueness of the
solution. They are exploited in an efficient learning algorithm for classifiers linear in their parameters with
very large capacity, such as high order polynomial or
RBF classifiers. Key is the representation of the decision function in a dual space which is of much lower
dimensionality than the feature space.
72
73
79
83
150
[GVB+92]
[HLW88]
Acknowledgements
We wish to thank our colleagues at UC Berkeley and
AT&T Bell Laboratories for many suggestions and stimulating discussions. Comments by L. Bottou, C. Cortes,
S. Sanders, S. Solla, A. Zakhor, and the reviewers are
gratefully acknowledged.
We are especially indebted
to R. Baldick and D. Hochbaum for investigating the
polynomial convergence property, S. Hein for providing
the code for constrained nonlinear optimization,
and
D. Haussler and M. Warmuth for help and advice regarding performance bounds.
the 29th
tions
IEEE,
[KM87]
[CBD+90]
[Mac92]
Re
R. Courant
mathematical
[GBD92]
[GPP+89]
I. Guyon,
I. Poujaud,
L. Personnaz,
G. Dreyfus, J. Denker, and Y. LeCun. Comparing different neural network architectures for classifying handwritten digits. In
Non-linear
editor.
Numerical
Optimization,
Meth-
Academic
David
IMGB+921
[Mo092]
[NY83]
Problem
Complexity
in
mization.
151
Linear
and Nonlinear
Luenberger.
Addison-Wesley, 1984.
Programming.
[MD89]
Methods
of
and D. Hilbert.
Interscience, New
physics.
York, 1953.
[DH73]
es
nc
[Lue84]
re
[BL88]
[CH53]
F. A. Lootsma,
fe
pages 100-109.
1988.
ods for
[BH89]
on the Founda-
Science,
W. Krauth and M. Mezard. Learning algorithms with optimal stability in neural netJ. Phys.
A: Math.
gen., 20: L745,
works.
[Lo072]
M.A. Aizerman, E.M. Braverman, and L.I.
Rozonoer. Theoretical foundations of the
potential function method in pattern recognition learning.
Automation
and Remote
Control,
25:821-837, 1964.
Symposium
1987.
References
[ABR64]
Annual
of Computer
and
Method
Eficiency
Opti-
[Omo91]
S.M. Omohundro.
Bumptrees for efficient
function, constraint and classification learning. In R.P. Lippmann and et al., editors,
NIPS-90, San Mateo CA, 1991. IEEE, Morgan Kaufmann.
[PG90]
T. Poggio and F. Girosi. Regularization algorithms for learning that are equivalent to
multilayer networks. Science, 247:978 982,
February 1990.
[Pog75]
[ROS62]
of neurodynamics.
F. Rosenblatt. Princzp!es
Spartan Books, New York, 1962.
[SLD92]
P. Simard, Y. LeCun, and J. Denker. Tangent propa formalism for specifying selected invariance
in an adaptive network.
In David S. Touretzky, editor, Neural Inforvolume 4. Mormation
Processing
Systems,
gan Kaufmann Publishers, San Mateo, CA,
1992. To appear.
[TLS89]
[Vap82]
Vladimir
dence
Vapnik.
Based
Estimation
on Empirical
of
Data.
Depen-
Springer
fe
re
nc
Re
[VC74]
es
152
REF [3]
es
!""#$%&('*),+-#.%/10,2!#/"!34!506#87"%:9<;
=6>@?>AB#.C%)DFEG'*)('H#.I*JLKNM,O%=
PRQTS*UHVXWZYU.[ >@J%']\*^*_._"`.acbedFfHgchHbi`8a1jZgXbikl`.anmop)q#r%'*st2p'H#.(%!%uv3$#I*J%!%'1wix&sDyuz"7
I*2!#)()(|{ZIH#&n}%o7"%C%2}'H3~)H6>5J%'-3#IJ%!%'5I*%%I*'H7%&z"#2p2po!3$7%2p'H34'H8&()6&cJ%'w82}2ps5!%u4p/"'H#"T:"7"z%&
'*I*&(%)#.'%8Dy2p!%'H#.2p13$#87"7"'*/R&(4# 'H$J%}u%J8Dy/":3~'H%)(pw'H#&cz"'-)c7"#I*'.ly&J%p)w'H#&z"'
)7"#I*'#~2}!%'H#.@/"'*I*p)(pR)cz"w#I*'}),I*%)(&c(z%I*&('*/BlO"7"'*I*!#2l7"7'H&(p'*)-w&J%'/"'*I*p)(p<)z"wy#I*'
'H%)z"'*)lJ%puJu8'H%'H(#2ppH#&(p4#.C%p2p}&xw&cJ%'l2}'H#8(%!%u3$#I*J%:%'.l>5J%'p/"'H#-C'HJ%:%/4&J%')z"7"7"%&D
'*I*&(%%'*&Fsn9Rs5#)7"' pz%)(2p]!3$7%2p'H34'H8&('*/]wi&J%'$'*)(&pI*&('*/IH#)('4s-J%'H'$&J%'$&n#!%:%u
/Z#&#rIH#.C"')('H7"#.(#&('*/s5p&J%%z%&R'H()X]']J%'H''*&('H%/&J%p)R'*)z%2p&R&(v%8Dy)('H7"#.n#.C%2p'
&(#:%!%u/Z#&#"
puJvu8'H%'H(#2ppH#&(p#.C%p2pp&w)z"7"7&D '*I*&(%'*&Fsn9.)$z%&(p2pp*:%u7.2p%3~:#2:"7"z%&
&(#8%)w%(3$#&(p%)4p)$/"'H34%%)(&(#&n'*/Bv]'1#2p)(]I*%3$7"#.'R&J%'17'Hwi(3$#.%I*'<w-&J%'<)z"7"7"%&D
'*I*&(%%'*&s(9,&( #8pz%)6I*2!#)()n}IH#2"2}'H#8(%!%u#2pu.}&cJ"34)l&J"#&l#2p2"&(%97"#.&T!$#-C'H%IJ"3#.(9
)(&z%/"w67%&(pIH#2l5J"#.(#I*&n'HEG'*I*.u%p&(pL
5BTN(
re
nc
LVS86 #&(&('Hno'*I*8u%p&(pLKZ'cI*}'H8&,2}'H#8(%!%uq#2pu.%p&J"34)HKZ%'Hz"n#2%'*&s(98)HK"(#XD
/"!#2C"#)n})@wyz"%I*&(pRI*2!#)()n{Z'H)HKN7.2p%%34!#2I*2!#)n)(|{Z'H)H
Re
fe
'5&cJ"#.o88'H#.)#u.E%=4%l})cJ%'H-|*B)z%u8u.'*)(&('*/&J%'@{N)(}u8p&J"3w%7"#&(&('H($'*I*.uD
%p&(pL '4I*%)(p/"'H'*/#R34/"'*2Tw&Fso%%(3$#2T/"p)(&!C"z%&('*/77"z%2!#&(p%)HKTv +**+F #.%/
v ; * ; w]/"!34'H%)n}%"#2 '*I*&()-s5p&J34'H#. '*I*&()x + #.%/ ; #.%/I*D #.:#8%I*'
3$#&cpI*'*)
#8%/ K#8%/])J%s'*/]&J"#&&J%'7%&(!3$#2i#H8'*)(!#. )(.2!z%&(prp)~#z"#/Z(#&(pI
/"'*I*p)(pwz" %I+ *&n}%L ;
)( l )(pu1X q +T$+ + eRv +( eRv ;cT$; + eRv ;B 2! ;
+
yr&J%'IH#)('$s,J%'H' + ; &J%'o"z"#/Z(#&(pI$/"'*I*p)(prwz"%I*&n}% /"'*u.'H%'H(#&('*)4&(1 #
2p!%'H#.wyz"%I*&(pL
2}! l )(pu + ;* + 1 + + + ; + ;c
>-'*)(&n:3#&('6&cJ%'6"z"#/Z(#&(pI/"'*I*p)(pwz"%I*&n}%%%'GJ"#)&(@/"'*&('H(34!%'$8}; Zc w'*'7"#.(#.3~'*&('H)H
>R'*)(&(!3$#&('4&J%'42p!%'H#.,wyz"%I*&(p%%2}]rwy'*'$7"#.(#.3~'*&('H)J"# '&nqC'$/"'*&('H(34!%'*/B$&J%'
i(*eHncT!n*ec(e*(XB"en!LFniHyinyi i
(*eHncT(*ce e B"en! c yFinyi% ci
perceptron output
weights of the output unit,
1, ... , 5
dotproduct
Re
fe
re
nc
es
#"
e n) }u% !
C8q#/(z%)(&(!%u$&cJ%'s'*puJ8&()$ w%3 &cJ%'&e% Dy&JRJ%p/"/"'Hz"%p&5&($&J%'%z%&7"z%&5z"%}&,)(o#)&n~3~:%|D
34p*'4)(34''Hn,34'H#)z"'x 'H&J%'x&(#!%!%uo/Z#&#"=,)-#o'*)z%2p&-wEG.)('H%C%2:#&(&
)#.7"7"#IJLK
'!( lnenZF**)5H+!n-,/L. c0!+ 2X1 @ele4*3 e65 87
I*%)(&c(z%I*&(p<w/"'*I*p)(p(z%2p'*)s5#)#u#:#)n)(I*!#&n'*/]s5p&J<&J%'4I*%)(&nz%I*&(pw2}!%'H#.xJ.8D
7"'Hn7%2:#8%'*):)(34')c7"#I*'.
=#2pu.p&J"3 &cJ"#&$#2p2}s5)~wx#2p25s'*puJ.&n)w-&J%'1%'Hz"(#25%'*&s(9&n #/Z#.7%&4:%/"'H
2pIH#2}2p&(3~:%!34p*'&J%'1'Hn4#])n'*&ow '*I*&(%)qC'*2p%u.!%uv&(v#7"#&n&('H(v'*I*.u%%}&n}%
7"C%2p'H3 s5#)xwiz"%/v! .r K
%K "K X5s-J%'H&cJ%'qC"#I*9DF7"7"#u#&(p#2}u8p&J"3 s5#)
/"p)(I* 'H'*/B1>@J%'o)(82:z%&n}%
: . 2 '*)$ #)( 2p}u%J.&$34%/"|{ZIH#&(pw5&J%'o3#&J%'H3$#&n}IH#234%/"'*26w
%'Hz"%)X5>5J%'H'cwi'.KL%'Hz"(#2l%'*&Fsn9.)!3$7%2p'H34'H8&7%p'*I*'cDys5p)('42p!%'H#.Dy&7' 1/"'*I*p)(pwyz"%IcD
&(p%)H
R&J%p)#.&(pI*2p's'I*%)n&(z%I*&-#$%'*s &7'wl2p'H#.(%!%uR3$#I*J%:%'*)XKZ&J%'x)(DyIH#2p2p'*/)z"7"7"%&D
'*I*&(%%'*&s(9 >5J%'R)cz"7"7"&D '*I*&(x%'*&Fsn9!3$7%2p'H34'H.&n)$&J%'w.2p2ps5!%u]p/"'H#"p&~3#.7%)
&J%'x:"7"z%& '*I*&(%)5!.&no)(%34'J%puJR/"!34'H%)(p"#2Tw'H#&cz"')7"#I*' &J"z%uJ)n34'%8Dy2p!%'H#.
3$#.7"7%!%uIJ%8)('H#]7"}%iv&J%p))7"#I*'1#2p!%'H#./"'*I*p)(p)z"wy#I*'Rp)oI*%%)(&(z%I*&('*/s5p&J
)7'*I*:#27"%7"'H&(p'*),&J"#&'H%)z"'4J%puJRu.'H%'Hn#2p}H#&(p#.C%p2}p&Rwl&J%'%'*&s(9
Re
fe
re
nc
es
L >BC%&c#!v#1/"'*I*p)(p)z"Fw#I*'I*('*)7"%/"!%u]&n #7.2p%3~:#2-w,/"'*u'*'
W
&sZK.%'xIH#.IH'H#&n'#xw'H#&z"')7"#I*'.K K"s,J%}I*J1J"#)- .p; I*%/"!"#&('*)5w&J%',wi(3
1I*%/"!"#&n'*)
+
+
XH
;
<I*/":"#&('*)
+
+ X H ;
;
8} ; + I*%/"!"#&('*)
; +
+
;
+
s-J%'H'
+
>5J%'J.X"H7" 'Hn7%2:#8%'p),&J%'HI*%)(&c(z%I*&('*/1!&J%p)5)7"#I*'.
HH
>Gs$7"%C%2}'H3~)#.p)('!&J%'#.C '#.7"7"#I*JLl%'I*%%I*'H7%&z"#2l#.%/1%'x&('*I*J"%}IH#2,>5J%'
I*%I*'H7%&z"#2N7"C%2p'H3 p)5J%s&(,{N%/R#)('H7"#.n#&(!%uJ87'H(7%2!#.%'&J"#&ls@}2p2Lu.'H%'H(#2}p*'s'*2p2il&J%'
/"!34'H%)(p"#2pp&1wl&J%'w'H#&cz"')7"#I*'xs5p2}2lC"'4Jz%u8'.KZ#.%/ %8&-#2p2J87'H(7%2!#.%'*)&J"#&)('H7"#.n#&('
&J%'4&(#:%!%uR/Z#&c#$s5p2p2%'*I*'*)n)#.p2pu.'H%'H(#2pp*'$s'*2p2 %>5J%'4&('*I*J"%pIH#267"C%2p'H3p)J%s I*3xD
7"z%&#&n}%"#2p2}&(R&c'H#&,)z%IJrJ%}u%J8Dy/":3~'H%)(p"#2)7"#I*'*)H,&(RI*%%)(&(z%I*&7.2p%3~:#26w/"'*u'*'
"1% !v# ./"!34'H%)(p"#25)c7"#I*'op&$3$#X]C"'1%'*I*'*)()c#.]&(<I*%)(&nz%I*&J87'H(7%2!#.%'*)o!#
C%p2p2}p /":3~'H%)(p"#2w'H#&cz"')7"#I*'.
>@J%'I*%I*'H7%&z"#2l7"#.&5w&J%p)7"C%2p'H3sG#)5)(.2 '*/<:
"XBwi&J%'xIH#)('w5`c_Nb
n_gXai_8jZg*\@w%5)('H7"#8(#.C%2p'$I*2!#)()('*)X=7%&(!3$#2J87'H(7%2!#.%'})4J%'H'4/"'c{N%'*/ #)&J%'42}!%'H#.
/"'*I*p)(pwz"%I*&(p s5p&J3#!3$#23$#.u.!C'*&s'*'H&J%' '*I*&(%)Rw&cJ%']&sI*2!#)()('*)XK)('*'
lpuz"' i&,s5#),C%)('H '*/&J"#&-&(I*%)(&c(z%I*&-)cz%IJ<7%&(!3$#2J87'H(7%2!#.%'*)%%'%%2}J"#)&(
&#.9'$!.&(]#I*I*%z".&$#R)c3$#2p25#.34z"8&w,&J%'o&c(#!%!%u]/Z#&#"K&J%')(IH#2p2}'*/ \^ _._"`.abfXghXbi`.an\K
s-J%pI*J]/"'*&('Hn34!%'$&J%p)3$#8u.!Lo&sG#))J%s-&J"#&|w5&J%'$&c(#!%!%u '*I*&(%)#.'4)('H7"#.n#&('*/
s5p&J%z%&'H()C.v#.7%&(!3$#2-J."7"'Hn7%2:#8%'1&J%'<'*7'*I*&#&(p #2!z%' w&J%'17"C"#.C%p2p}&vw
I*3$3~}&n&(!%u~#84'H($#&('*)(&6'*"#.3$7%2p',p)GCz"%/"'*/1C8&cJ%'(#&(pC"'*&s'*'H&J%'5'*"7"'*I*&#&n}%
BFn" !%e$ # niT$ %. 1Tin" 1HHiT2 + Xin e' &1 ye iyi(H + 1Xen
optimal margin
optimal hyperplane
Re
fe
re
nc
es
.&('.K&J"#&4&J%p)qCz"%//"%'*)o%.&'*7%2ppI*p&(2pI*%.&#:&cJ%'R/":3~'H%)(p"#2pp&Fw&J%'R)7"#I*'Rw
)('H7"#.n#&(pL&Bwi.2p2ps5)Bwy3&J%p)Cz"%/BK&J"#&B|wZ&J%'7%&(!3$#2%J."7"'H(7%2!#.%'@IH#.C"'I*%)(&c(z%I*&('*/
wy3#@)3$#2p2"z"34C'HBwZ)cz"7"7"& '*I*&(%)'*2:#&( '6&(,&J%'&(#!%!%u)('*&B)(p*'&J%'u.'H%'H(#2}pH#&n}%
#.C%p2pp&Fs5p2p2C'J%pu
J ' 'H!#.!8{N%}&n'/"!34'H%)(p"#2)7"#I*'. y O%'*I*&(p s's5p2}2
/"'H34%)n&(#&('&J"#&&J%'R(#&(pv wi#<'H#262p|wi'q7"%C%2}'H3~)IH#8C'q#)x2ps #)
R#.%/&J%'
7%&(!3$#2J."7"'Hn7%2:#8%'u.'H%'H(#2}p*'*)s'*2p2B!1#$C%p2p2pp/"!34'H%)(p"#2wi'H#&z"')7"#I*'.
A '*&
C"'x&J%'%7%&(!3$#2J87'H(7%2!#.%'4!wi'H#&z"'x)7"#I*'.]'s5p2p2l)J%sK%&J"#&5&J%'xs'*}u%J.&()
w%&J%'
7%&(!3$#2J87'H(7%2!#.%': &J%'xw'H#&cz"')7"#I*'$IH#.C"'4s-p&(&n'H #))(34'42p!%'H#.I*34C%!"#&(p<w
)z"7"7& '*I*&n)
i
)z"7"7& '*I*&n)
>5J%'2p!%'H#.,/"'*I*p)(pwz"%I*&n}%
!R&J%',wi'H#&z"')c7"#I*'s5p2p2l#I*I*/"!%u.2p1C"'xwl&J%',w%(3R
l )n}u%
)z"7"7"& '*I*&()
s-J%'H'
p)5&J%'/".&DF7"/Zz%I*&5C'*&Fs'*'H)z"7"7& '*I*&(%)
#8%/ '*I*&n
!w'H#&cz"'-)c7"#I*'.
>5J%'/"'*I*p)(pwyz"%I*&(pRIH#.&J%'H'cwi'C'/"'*)(IH!C'*/ #)G#~&Fs42!#X.'H5%'*&s(9lpuz"'&
"
"
classification
w1
wi
wj
nc
es
support vectors z i
in feature space
re
fe
nonlinear transformation
input vector, x
Re
lpuz"'
"62!#)()n{ZIH#&(p1C81#4)z"7"7&D '*I*&n5%'*&s(94w6#.Rz""9%%s-17"#&n&('H(p)-I*%%I*'H78D
&z"#2p2p/"%'C.~{N)(&&(#.%)wn34!%u&cJ%'7"#&n&('H(:8&($)n34'J%}u%J8Dy/":3~'H%)(p"#2w'H#&cz"'-)c7"#I*'.
=7%&(!3$#2J."7"'Hn7%2:#8%'oI*%)n&(z%I*&('*/]!r&J%p)wi'H#&z"')7"#I*'$/"'*&('Hn34!%'*)&cJ%'$z%&7"z%&H>5J%'
)(!34p2!#.p&R&(o#4&sDy2!#X.'H7"'HI*'H7%&cIH#.<C"')n'*'HRC.RI*37"#.p)(&(4lpuz"'
classification
u1
ui
uj
Lagrange multipliers
us
comparison
u = K( x k,x )
k
es
support
vectors, x k
input vector, x
re
nc
Re
fe
Re
fe
re
nc
es
p)4:37".)()n:C%2p'.<}&cJ]&J%p)x'*&('H%)n}%]s'$I*%)n}/"'H~&J%'$)z"7"7&D '*I*&n%'*&s(98)#)#<%'*s
I*2!#)()xw2p'H#.(%!%u]3$#I*J%!%'.Kl#)7"s'Hwyz%26#.%/z"% 'H)#2#)%'Hz"(#2%'*&s(98)HO%'*I*&n}%
s'5s5p2p2L/"'H34%)(&n#&('5J%ss'*2p2Zp&u.'H%'H(#2pp*'*)wilJ%puJ$/"'*u%'*'7.2p%%34!#2L/"'*I*})n}%)z"wy#I*'*)
iz"7<&(/"'H !]#RJ%puJ/"!34'H%)(p"#26)7"#I*'1e/"!34'H%)n}% . >@J%'~7'Hwi(3$#.%I*'4w&J%'
#2pu.}&cJ"3 p)$I*%3$7"#.'*/v&(&J"#&w-I*2!#)n)(pIH#252p'H#.(%!%uv3$#I*J%:%'*)'.uZT2p:%'H#8$I*2!#)()n{Z'H)HK9D
%'H#.'*)n&G%'*puJ%C"%)I*2:#)()(|{Z'H*KZ#.%/R%'Hz"(#2L%'*&s%(9.)XO%'*I*&n}% K
"K%#.%/ "~#8'5/"' .&n'*/o&(x&J%'
3$# F7".!8&()lwZ&J%'/"'H #&(p~wZ&J%'5#2pu.p&J"3 #.%/$#@/"})nIHz%)()(p4wZ)n34'6wZ}&n)67"7"'H&(p'*)H
'*&#p2p),w&J%'x/"'H #&(pp)'*2p'*u#&('*/1&($#.R#87"7"'H%/"pN
N
,
!#"%$'&(*),+!&$.-0/1&3245+!3!"
Re
fe
re
nc
es
>5J%')('*&@w2!#.C'*2p'*/ &c(#!%!%uq7"#&n&('H(%)
3 6 + + * 3687 97 , 6 :1; =<
HH
p))#}/r&(1C"'$2p!%'H#.2p])n'H7"#.(#.C%2p'$|w5&J%'H'4'*%})n&()# '*I*&n #.%/]#)(IH#2:#8 )cz%IJ&cJ"#&&J%'
!%'*z"#2}p&(p'*)
> |w 6
|w 6
#.' #2pp/w%5#2p2'*2p'H34'H8&(),w6&J%'x&(#!%!%uo)n'*&~ 6'*2ps s's-}&n'&J%'x:%'*"z"#2pp&(p'*)q !
&J%',wi(3@?
6
*A> % CB
XH
>@J%'7%&(!3$#2J8"7"'H(7%2!#.%'
.
p)&J%'$z"%p"z%'$%'$s-J%pI*J<)('H7"#.(#&n'*)&J%'4&(#!%!%u1/Z#&c#$s5p&Jr#o3$#!3$#263#.u.!L5p&x/"'*&('HD
34!%'*)&cJ%'$/"!'*I*&(p E
D s,J%'H'$&J%'4/"p)(&#.%I*'C"'*&s'*'H&J%'o7" F'*I*&n}%%)w&J%'4&n#!%:%u
' *I*&(%)lwZ&s-/"B'H'H8&6I* 2!#)( )('*)p)3$#%!3$#2iK'*IH#2p2"l}u%z"' T>5J%p)6/"p)(&#.%I*'GF * })lu8 'H
C8
FB *B HJI!34K!LN!M +PO c HCI!KQ3L#M +PO c
>5J%p)$34'H#.%)HKT&J"#&&J%'7%&(!3$#2J."7"'Hn7%2:#8%'1p)4&J%'qz"%p"z%'R%'&J"#&3~:%!34p*'*) i z"8D
/"'H,&J%'4I*%)(&c(#!.&n)o %%)(&(z%I*&(!%u1#.<7%&n:3#2J8"7"'H(7%2!#.%'p)&J%'H'cwi'$#z"#/Z(#&(pI
7".u%(#.3$34!%uo7"C%2p'H3R
*ny(
.
l0 '*I*&n)5 wis,J%}I*J
6
s5p2p26C"'x&('H(34'*/r\^ _._"`.abfXghXbi`.a(\*y<=7"7'H%/"p
s')J%s&cJ"#&5&J%' '*I*&n &J"#&/" '*&(*'H%( 34 !%'*)&J%'x7%&(!3$#2TJ."7"'H(7%2!#.%'4IH#.<C"'4s-p&(&n'HR#)
#42p!%'H#.@I*34C%:"#&(pRwl&(#:%!%u '*I*&()H
M
6
"
+
s-J%'H' > "O%!%I*' <%2pwix)z"7"7& '*I*&()qe)('*'=7"7"'H%/"p K&J%'o'*"7"'*)()n}%
" 'H7"'*)('H8&( )-#I*3$7"#I*&w%(3 w6s-}&n:%u ]'#2})no)J%s&cJ"#&5&(4{N%/<&J%' '*I*&(w
7" #.(#834'*&('H)/
+ 7
%'J"#)@&($)(.2 '&cJ%'wi.2p2}s5!%uz"#/Z(#&(pI7".u%X(H#. 3$34!%u$7"C%2p'H3R
re
nc
es
l
s5p&J<'*)7"'*I*&@&( + 9 7 K")z"C F'*I*&5&n$&J%'I*%)(&c(#!.&n)H
HX
> ]
s-J%'H' p)#8 B Dy/"!34'H%)(p"#2oz"%p& '*I*&( K
36 + 6 7 })&J%' B D
/"!34'H%)(p"#2 '*I*&(%HX wl2:#8C"'*2p)HKL#.%/ p)-#4)n3$3~'*&pI B
B DF3#&ps5p&J'*X2pH'H 34'H.&n)
6 6 %
XH
Re
fe
* -
Re
fe
re
nc
es
ewZ&cJ%'6&(#:%!%u,)('*&5 IH#.x%.&C'6)('H7"#.n#&('*/C8#5J87'H(7%2!#.%'.K.&cJ%'3$#.u.!C"'*&s'*'H7"#&(&n'H(%)
wL&J%'&Fs,I*2!#)()('*)6C'* I*34'*)#8(C%p&(#.x)3$#2p2iK'*)z%2p&(!%u!4&J%' #2:z%',wZ&J%'wyz"%I*&(p"#2
&z"(%!%u$#.(C%p&n#.2!#.u.'. #%:3~}*!%u4&J%'@wz"%I*&n}%"#2 z"%/"'HI*%)(&(#:8&() #.%/
%'&J%'H'cwi''*p&J%'Hl'H#I*J%'*)#3#!34z"3 e!&cJ%})lIH#)('6%'5J"#)TI*%)(&c(z%I*&('*/4&J%'J8"7"'H(7%2!#.%'
s5p&Jx&J%'53$#%!3$#2"3$#.u8: F KH%'{N%/")l&J"#&T&J%'53$#%!34z"3 '*I*'*'*/")l)(%34'6u. 'H<e2!#.u.'
I*%)(&c#..& e!s-J%pI*JIH#)('#x)('H7"#.(#&(pw&J%',&(#:%!%u/Z#&#xs5p&J#$3$#.u.!$2!#.u8'H&J%'H
D p)5!3$7"8)()(!C%2p'
>@J%' 7"C%2p'H3 w3$#%!34p*:%u wz"%I*&n}%"#2o z"%/"'HI*%)(&(#:8&()] #.%/ IH#.
C"')(.2 '*/ 'H'coI*p'H8&(2pz%)(!%u&J%'w.2p2ps5!%u )(I J%'H3~'. p/"'R&J%'$&(# :%! %u/Z#&# !.&(#
%z"3C'Hw7&(p%)s5p&JR#4'H#)("#.C%2p')3$#2}2B%z"34C"'HwT&(#!%!%u '*I*&(%)6!R'H#IJ7"&(pL
O%&#.&Bz%&6C8)(.2 !%u&J%'"z"#/Z(#&n}I7".un#.3$34!%u7"C%2p'H3 /"'*&('H(34!%'*/oC8&J%'{N)(&l7"%&(p
w6&n#!%:%uR/Z#&#",Z@&J%p)7"C%2p'H3&J%'H'#.'4&so7"8)()(!C%2p'$z%&(I*3~'.'*}&cJ%'H&J%p)7"%&(p
wl&J%'/Z#&#~IH#.1%.&,C"')n'H7"#.(#&('*/1C8R#oJ87'H(7%2!#.%'q:s-J%pIJ<IH#)n'&J%'wyz%2p2B)('*&,w/Z##)
s'*2p2NIH#8$%.&5C"',)('H7"#.(#&('*/ K.%l&J%',7%&(!3$#2LJ."7"'Hn7%2:#8%',w)('H7"#.(#&(!%u&cJ%'{N)(&7"&n}%$w
&J%'&n#!%:%u/Z#&#~})Gwz"%/B
A '*&&J%' '*I*&n5&cJ"#&3#!34p*'*)wz"%I*&(p"#2- !&cJ%'IH#)('w)('H7"#.(#&(pRw&J%'{N)(&
7"%&(pC' + =34%u]&cJ%'RI*/":"#&('*)ow '*I*&n + )(34'1#8'R'*z"#25&(]*'H>5J%'*
I*('*)7"%/<&(o%8Dy)cz"7"7"&@&(#!%!%u '*I*&()5w&J%p)7"&(pL #.9'#%'*s)('*&5w&n#!%:%u
/Z#&#rI*8&#!%!%uv&J%'1)cz"7"7"& '*I*&()w3 &J%'<{N)(&R7"%&(pvw&(#!%!%u/Z#&c#]#.%/&J%'
'*I*&(%)Bw"&J%'l)('*I*%/47"&(p&J"#&B/"'*)T%.&B)#&n})w,I*%)(&n#!.& K*s,J%'H' p)/"'*&('Hn34!%'*/
C8 + &J%p)R)('*&R#v%'*swz"%I*&(p"#2 ; p)RI*%)(&nz%I*&('*/#.%/3$#%:3~}*'*/ #& ;
8&(!%z%:%u&J%p)7"I*'*)n)-wl!%IH'H34'H8p2p1I*%)n&(z%I*&(!%uR#$)(82:z%&n}% '*I*&(% I* 'H!%uo#2p2
&J%'$7"%&(p%)w&J%'4&(#:%!%u</Z#&#%'$'*p&J%'H{N%/")&J"#&-p&p)x:37".)()n:C%2p'&(R)('H7"#.n#&('4&J%'
&(#:%!%u)('*&5s5p&J%%z%&5'H( K%%'I*%)(&c(z%I*&()5&J%'x7%&(!3$#2L)('H7"#8(#&(!%uoJ87'H(7%2!#.%'xw%&J%'
wyz%2}2L/Z#&c#)('*&HK .&('.K8&J"#&6/Zz":%u&J%p)G7"I*'*)()&J%' #2!z%'wB&J%'@wz"%I*&n}%"#2
p)-34%8&(%pIH#2p2pR:%IH'H#)(!%uZK")(!%I*'434'#.%/134'-&n#!%:%u '*I*&()#.'I*%)n}/"'H'*/1!&J%'
7%&(!34pH#&n}%LKZ2p'H#/"!%u&(o#4)3#2p2}'H,#.%/R)3#2p2}'H@)('H7"#.n#&(pRC"'*&s'*'HR&cJ%'&Fs4I*2!#)()('*)X
,
M+
>
% CB
Z%l)z8oI*p'H8&(2po)c3$#2p2 R&J%'@wz"%I*&n}%"#2 /"'*)(IH:CX'*H)5 &J%'z"34C'H6w &cJ%'-&n#!%:%u4'H()X
!%!34p*!%u %',{N%/"),)(34'3~:%!3$#2)z"C%)n'*&5wl&(#!%!%uo'Hn)H
3 6 * 3 6
XH
ewl&J%'*)('/Z##.''*%I*2:z%/"'*/<wy3 &J%'&n#!%:%u)('*&5%'IH#.)n'H7"#.(#&(',&J%''H3#!%:%uR7"#.&w
&J%',&(#!%!%u4)('*&6s@}&cJ%z%&'H()X>)n'H7"#.(#&('@&J%'-'H3$#:%!%uo7"#8&6wB&J%',&(#:%!%u4/Z#&#%'
IH#.I*%)n&(z%I*&-#.%7%&(!3$#2B)('H7"#8(#&(!%uoJ87'H(7%2!#.%'.
>@J%})@}/"'H#IH#.<C"''*"7"'*)()n'*/wn3$#2p2p1#)Hl34!%!34p*'$&J%',wyz"%I*&(p"#2
;
M
"
Re
fe
re
nc
es
>${N%/&J%' '*I*&(% +
7"C%2p'H3w63$#%:3~}*!%u
X H
)z"C '*I*&5&(I*%)(&(#:8&()
6
M+
T
@>
;
s-J%'H'
KL#.%/ #.'&cJ%'$)#.34''*2}'H3~'H.&()4#)z%)('*/!&J%'$7%&n:3~}H#&(p7"C%2p'H3wi
I*%)(&c(z%I*&(!%u]#.]%7%&(!3$#2J8"7"'H(7%2!#.%'.K p)~#<)(IH#2:#8*Kl#.%/ /"'*)(IH!C'*)oI*%/"!"#&('cDys@})n'
!%'*z"#2}p&(p'*)H
.&(',&J"#&4 !3$7%2pp'*)&J"#&5&J%')c3$#2p2p'*)(&#/Z34p)()(!C%2p' #2!z%' !wyz"%I*&(p"#25 p)
-
3$# + 97
HX
fe
re
nc
es
Re
z"%/"'Hx&J%'$I*%)n&(#!8&()
1#.%/ >5J%p)7"C%2p'H3 /"B'Hwy3&cJ%'o7"C%2p'H3 w5I*8D
() &nz%I*&(!%u#.v%7%&(!3$#253$#.> u8: I*2!#)()(|{Z'H %2pC.&J%'1#/"/"p&(p"#2,&('H(3 s5p&J
: &J%'
wyz"%I*&(p"#2-
. z%'&(R&J%p)&('Hn3 &J%'4)(82:z%&n}%&(&J%'$7"C%2p'H3wI*%%)(&(z%I*&(!%uR&J%'4)(w&
3$#.u.!I*2:#)()(|{Z'H,p)-z"%pz%'$#.%/R'*%p)(&()@w@#../Z#)('*&X
>@J%'4wz"%I*&n}%"#2
. p)$%.&"z"#/Z(#&(pIqC'*IH#.z%)('w5&J%'o&n'H(3s5p&J
" #%!34p*:%u
. )z"C '*I*&&(4&J%',I*%) (&(#:8&() > #.%/ C"'*2p%u8)5&(&cJ%'uz"7w)(DyIH#2p2}'*/RI*% '*
7".u%(#.3$34!%u]7"C%2p'H34)X>5J%'H'cwi'.K&(I*%)(&c(z%I*&4)(wi&~3#.u.!rI*2:#)()(|{Z'H$%%'RIH#.]'*p&J%'H
)(.2 '&J%'I* '*7".u%(#.3$34!%u7"C%2p'H3 !&J%' B Dy/"!34'H%)(p"#25)7"#I*'w-&J%'R7"#.(#834'*&('H)
K"%%'IH#.)(82 '&J%'xz"#/Zn#&(pI7"8u(#.3$3~:%u7"C%2p'H3 !&J%'/Zz"#2 B6 )7"#I*'wl&J%'
7"#.(#834'*&('H) #.%/ ,y<z"@'*7'H!34'H8&()s'I*%)(&c(z%I*&-&cJ%')nw&,3$#.u.!<J."7"'H(7%2!#.%'*)4C.
)(.2 !%uo&J%'/Zz"#2Bz"#/Z(#&(pI7".u(#.334!%u~7"C%2p'H3R
5
nc
es
>5J%'R#2pu.p&J"3~)/"'*)(IH!C'*/v:r&J%'R7"' }%z%))n'*I*&(p%)4I*%)(&(z%I*&4J."7"'Hn7%2:#8%'*)!&J%':"7"z%&
)7"#I*'.$>RI*%)(&c(z%I*J87'H(7%2!#.%':r#w'H#&cz"'4)7"#I*'$%'~{N)(&J"#)&(R&c(#.%)wi(3&J%'oD
/"!34'H%)(p"#2:"7"z%& '*I*&nl<:8&(4#.oDy/"!34'H%)(p"#2Zwi'H#&z"' '*I*&(&J"z%u%J$#xIJ%.pI*',w#.
Dy/"!34'H%)(p"#2 '*I*&(wyz"%I*&(pT
fe
re
6 le
M+
>@J%'$2p:%'H#8p&Frw5&J%'$/".&DF7"/Zz%I*&4!3$7%2pp'*)HKl&J"#&x&J%'$I*2!#)()n{ZIH#&(prwz"%I*&n}%!
wi5#.Rz""9%%s- '*I*&(G %%2}R/"'H7'H%/")&J%'/".&DF7"/Zz%I*&n)H
Re
le l le
M+
6 le le B
>5J%'p/"'H#,wZI*%)(&c(z%I*&(!%u)z"7"7"%&D '*I*&(%)%'*&s(98)BI*3~'*)Tw%3I*%)n}/"'H:%uxu.'H%'H(#28wi(34)
wl&J%'/".&DF7"/Zz%I*&5!<# p2!C"'H&)7"#I*'o F
l l
"
; KIH#.RC"'x'*7"#.%/"'*/<!R&J%',wi(3
,
M+
s-J%'H'
wl&J%'!8&('*u(#2B7"'Hn#&(/"'c{N%'*/]C.&J%'49'H(%'*2 T=)cz8oI*p'H.&I*%/"p&(p&($'H%)z"'
&J"#&
" /"'c{N%'*)#]/"8&DF7"%/Zz%I*&o!#wi'H#&z"'1)7"#I*'<p)R&J"#&o#2p2,&J%''*pu.'H #2!z%'*)1!&J%'
'*"7"#.%)(p
#.']7"8)(p&( '. >Bvu%z"#.(#.8&('*'&J"#&R&J%'*)n' I*%'coI*p'H8&()]#.']7".)n}&n '.Kp&R})
%'*I*'*)()#8q#8%/1)z8oI*p'H8&~ 'H)('H )5>5J%'*'H3 &J"#&5&J%'I*%%/"}&n}%
p)-)c#&(p){Z'*/wi5#2p2 )z%I*J&J"#&
;
Lz"%I*&(p%)&J"#&)#&(p)wi 'H)('H)&J%'*'H3IH#.$&cJ%'H 'cw%'-C"'z%)('*/R#)6/"8&DF7"%/Zz%I*&()Hl=,pcD
'H(3$#8LK(# 'H(3$#8 #8%/ EG.*%%'H$ 6I*%%)(p/"'H#I*% 82:z%&n}%w5&J%'4/".&DF7"%/Zz%I*&!r&J%'
wi'H#&z"')7"#I*'u. 'H1C8$wyz"%I*&(pw&c J%',wn3
q
.
T '*7"
s-J%pI*JR&J%'*IH#2p2 8&('H.&n:#2NLz"%I*&(p%)H
s' 'H*K&J%'$I* .2!z%&(pw&J%'4/".&DF7"/Zz%I*&x: w'H#&cz"'4)7"#I*'$IH#.rC"'$u. 'HrC.#..
wyz"%I*&(p)#&(p)wi%:%u&J%' 'H)('H'H )6I*%/"p&(pLK!R7"#.&(pIHz%2!#.&(4I*%)n&(z%I*&57".2p"%34!#2BI*2!#)D
)(|{Z'H@w/"'*u%'*' !<"Dy/"!34'H%)n}%"#2l!"7"z%&-)7"#I*'%'IH#.Rz%)n'&J%',wi.2p2ps@:%uwyz"%I*&(p
.
B
Re
fe
re
nc
es
"
z"%/"'H@&J%')(!3$7%2p'4I*%)(&(#:8&()H
s-J%'H'3#&p
6 6
nc
es
e % x
p)/"'*&n'H(34!%'*/C8-&cJ%'6'*2p'H34'H8&()w&J%'6&c(#!%!%u-)('*&XKH#.%/ X H p) &cJ%'Twz"%I*&(p/"'*&('H(34!%!%u
&J%'I* .2!z%&(pwl&J%'/".&DF7"%/Zz%I*&()H
>@J%'6)(.2!z%&(px&(,&J%'7%&(!34pH#&n}%7"C%2p'H3 IH#.C'w%z"%/'coI*p'H8&(2p~C8)(.2 !%u:8&('H(3~'cD
/"!#&('7%&n:3~}H#&(p17"%C%2}'H3~)5/"'*&('H(34!%'*/C.$&cJ%'&(#!%!%u$/Z#&#K&cJ"#&IHz"('H8&(2poI*%)n&(p&z%&('
&J%')z"7"7"%& '*I*&n)HR>5J%p)4&('*I*J"%}"z%'Rp)4/"'*)(IH!C"'*/v!O%'*I*&(p
1>5J%'$C%&c#!%'*/7%&n:3#2
/"'*I*p)(pwz"%I*&n}%Rp)z"%p"z%' $
% #I*J7%&(!34pH#&(p<7"C%2p'H3 IH#8RC"')n.2 '*/1z%)(!%uq#8.)(&#.%/Z#8/R&('*I*J"%}"z%'*)H
re
&'
4 + ( !4 +) 4 + $+*0--, +=$'&/. $!-
Re
fe
3!( XFen&
+ 1He(x0 1HX &1 %
1BLi 3c(e($n# 1XH.ny cFiny
G4 += 45+ 4 + $.- 4 -5!+!4 & 4 0 - +!$'& $,!#4 -%/ #&# )
Re
fe
re
nc
es
Re
fe
re
nc
es
B 5. J
es
re
nc
fe
Re
s5p&J x s'I*%)(&c(z%I*&6/"'*I*p)(p(z%2p'*)wT/"B'H'H8&6)('*&n)lw 7"#&n&('H(%)l!$&cJ%'-7%2!#.%'.6E'*)z%2p&()
wB&J%'*)(','*"7"'H!34'H8&()IH#.C"' p)z"#2pp*'*/ #8%/q7" p/"'%pI*'p2p2:z%)n&(#&(p%)@wB&J%'7s'HwB&J%'
#2pu.}&cJ"3R % "#.3$7%2p'*)5#.',)J%s-4!$lpuz"' "l>5J%' I*2!#)n)('*)5#.'-'H7"'*)('H.&n'*/qC8oC%2!#I*9$#.%/
s-J%p&('-C"z%2p2p'*&()Hy~&J%'{Zuz"'@s'!%/"pIH#&(',)z"7"7"%&7"#&n&('H(%)ls5p&J$#/"z"C%2p'5I*!I*2p'.KZ#.%/$'Hn)
s5p&J#RIH8)()HR>5J%'$)(82:z%&n}%%)o#.'$%7%&(!3$#2l!&J%'$)('H%)n'o&J"#&% %//"'*u'*'o7".2p"%34!#2p)
'*%})n&5&J"#&3$#.9',2p'*)()'H()X .&(pI*'&J"#&6&J%'%z"34C"'H)w)cz"7"7"&57"#&n&('H(%)5'*2!#&( ',&(4&J%'
%z"3C'H5w&(#!%!%uR7"#&(&('H(%),#.')3$#2}2i
B
[[
fe
re
nc
es
Re
2!#)n)(|{Z'H
(#Xs'H(%*K
z"3$#.R7'Hw%(3$#.%I*'
'*I*p)(p&'*'.KZ=ET>
'*I*p)(p&'*'.KZ6 "
'*)(& 2:#X.'H@%'Hz"(#2%'*&s(9
"
O"7'*I*:#2#.IJ%p&('*I*&z"' 42!#H8'H5%'*&s%(9
"
>T#.C%2p' 'Hwi(3$#.%I*',w #.}%z%)-I*2!#)()n{Z'H)I*.2p2}'*I*&n'*/1wy37"z"C%2}pIH#&n}%%)#.%/Rs-'*7'H|D
34'H8&()Hl'cwi'H'H%I*'*)-)n'*'&('*&X
re
F+ ;
+
+?
es
"
(#Xs
'H(*K
" |
" "
"4
"4
"
"4
nc
/"'*u'*'xw
7.2p%%34!#2
fe
Re
w&J%'o7"C%2p'H3 #&J"#.%/Bo>@J%'$'B'*I*&w)3~.&cJ%:%u<w&cJ%})x/Z#&#.C"#)('$#)#17"'cDF7"I*'*)()n:%u
wi5)cz"7"7"&D '*I*&(,%'*&s%(9.),s5#)5! '*)(&(pu#&n'*/ !v
*-@z"5'*"7"'H:3~'H.&()xs'I*J%.)('4&J%'
)34%.&J%!%u$9'H(%'*2B#)##.z%)()n:#8$s5p&J)(%/Z#./$/"' !#&(p !R#u'*'H34'H.&s5p&J<4
XF
x&J%'6'*"7"'H!34'H8&()s5p&Jx&J%p)l/Z#&#.C"#)n'6s'6I*%)n&(z%I*&('*/$7"82}"% 34!#2"!%/"pIH#&(%wz"%I*&(p%)
C"#)('*/x/"8&DF7"%/Zz%I*&()w&J%'w%(3
B>5J%'!"7"z%&l/":3~'H%)(p"#2pp&F4s5#) ."KH#.%/x&J%'6%/"'H
w&cJ%'7".2p"%34!#2(#8%u.'*/w3 &($"l>#8C%2}' /"'*)(IH!C"'*),&J%''*)z%2p&()@w&cJ%'-'*"7"'H!34'H8&()H
>5J%'&n#!%:%u/Z#&##.'%.&52p!%'H#.2p<)('H7"#.(#.C%2p'.
.&(pI*'.K,&J"#&&J%'%z"34C"'Hw)z"7"7& '*I*&n)R!%IH'H#)(' 'H)n2}s52p. >5J%'/"'*u'*'
7"82}"%34!#2lJ"#)5%%2}
. 34',)z"7"7"%& '*I*&()&J"#.&cJ%&
'
./R/"'*u'*'7.2p%%34!#2 #.%/
' 'H 2}'*)n)R&J"#.&J%'{N)(&/"'*u'*' 7.2p%%34! #2i >@J%'1/"!34'H%)(p"#2pp&w&cJ%'Rwi'H#&z"'<)7"#I*'
wi#R/"'*u'*'o7.2p%%34!#26p)J%s' 'H + &(!34'*)x2!#.u.'H,&J"#8&J%'4/"!34'H%)(p"#2pp&w&J%'
wi'H#&z"'5)c7"#I*'wil&
#
.%//"'*u'*'7".2p"%34!#2ZI*2!#)n)(|{Z'H* .&('@&J"#&67"'HFwn3$#.%I*'-#2!348)(&6/"%'*)
%.&IJ"#8%u.'s5p&Jv!%IH'H#)n:%u/"!34'H%)(p"#2pp&w&J%'<)7"#I*' :%/"pIH#&n:%u% 'HFDi{Z&(&(!%u
7"C%2p'H34)X
lpuz"'"AB#.C"'*2p'*/<'*#.37%2}'*),w'Hn)5%$&J%'&c(#!%!%uo)('*&Gw&J%' %/R/"'*u'*'7"82}"%34!#2
)z"7"7&D '*I*&nI*2!#)()(|{Z'H*
Re
fe
re
nc
es
G2
O"z"7"7L7"#&(&H
.
% (%6&c(#!
% n&('*)(&
2i
"
2i
2i
8
2i "
. "
.
2i
"
"
"
G2
2i6
."
"
2i
"
"
2i
.
Re
fe
re
nc
es
>T#.C%2p'
5E'*)z%2p&()%C%&#!%'*/<w,# ".&JR/"'*u%'*'$7".2p"%34!#2lI*2!#)()(|{Z'HR&J yO"> /Z#&c#.C"#)('.
>5J%')(p*'4w&cJ%'&(#:%!%u$)n'*&5p).K.."K%#.%/1&cJ%')(p*'wl&J%'&('*)(&@)('*&5p) "K|..47"#&(&('Hn%)H
Test
error
2%
8.4
2.4
1%
1.7
linear
classifier
k=3nearest
neighbor
LeNet1
1.1
1.1
LeNet4
SVN
es
Re
fe
re
nc
)(&z%/".l>5J%'*)n'5.&J%'HTI*2!#)()n{Z'H)!%I*2!z%/"'#,2p!%'H#.I*2:#)()(|{Z'H*K"#9
DF%'H#.'*)(&6%'*puJ%C"I*2!#)()n{Z'H
s5p&J$8"K..,7".&n.&7'*)HK.#.%/$&s%'Hz"(#2L%'*&Fsn9.)T)7"'*I*!#2p2pI*%)(&(z%I*&n'*/4w%l/"pu.p&'*I*8u%|D
&(pRiAL' '*& #.%/$AL' '*2& " >5J%'5#.z%&J%%)%%2}4I*8&!C"z%&('*/4s5p&J4'*)cz%2}&n)Tw)z"7"7&D '*I*&n
%'*&s(98)Hl>5J%''*)z%2p&,w&cJ%'C"'H%I*J"3$#.(9R#.'-u. 'H!Rlpuz"'
r'5I*%I*2!z%/"'-&cJ%}))('*I*&(pC.4I*p&(!%u&cJ%'-7"#.7"'H, "XL/"'*)(IH!C%!%u~'*)z%2p&()6wB&J%',C"'H%I*J"3$#.(9
Z@z%p&('#2}%%uR&(!34'oAL' '*& s5#)I*%)(p/"'H'*/r)(&#&('4w&J%'$#.&H**>5J"%z%uJ
#)('Hp'*)w'*"7"'H:3~'H.&()!#.I*J%}&n'*I*&z"'.KI*34C%:%'*/s5p&J#.v#."#2p%)(p)ow&J%'
IJ"#.n#I*&('Hp)(&(pI*)5w'*I*8u%p&(pR'H(*K"AL' '*2& "$s5#)5IH(#w&('*/BX*
>5J%'4)z"7"7"&D '*I*&(%'*&s(9RJ"#)'*I*'*2p2p'H.&$#I*IHz"n#I*.KBs-J%pI*Jp)34.)(&'H3$#.(9D
#.C%2p'.KC'*IH#.z%)('Rz"%2p:9'R&J%'$.&J%'HxJ%puJ7'Hwi(3$#.%I*'I*2:#)()(|{Z'H)HKlp&x/"'*)$%.&!8D
I*2:z%/"'R9%%s@2}'*/"u8'q#.Cz%&&J%'$u8'*34'*&w&J%'o7"%C%2}'H3qwy#I*&&J%'I*2:#)()(|{Z'H
s%z%2}/R/"#)s'*2p2B|wl&J%'!3$#u.'7%p%'*2p)s'H'-'H%IH"7%&('*/<'.uZlC.R#x{Z%'*/BKZ(#.%/"%3
7"'H(34z%&#&n}%L
>@J%'62!#)(&'H3$#.(9@)z%u.u.'*)n&B&J"#&wz"&J%'HL!3$7" 'H34'H8&BwZ&cJ%'7"'HFwn3$#.%I*'6w&J%')z"7"7"%&D
' *I*&(%5%'*&s%(94IH#.<C"''*"7"'*I*&n'*/Rwy3 &J%'xI*%)(&nz%I*&(pwTwz"%I*&(p%)@wi&J%'x/".&DF7"%/Zz%I*&
&J"#&5' Z'*I*&#$7"pB!8wn3$#&(pR#.Cz%&5&J%'7"C%2p'H3t#&,J"#.%/B
(
Re
fe
re
nc
es
>@J%')z"7"7"&D '*I*&(,%'*&s(94I*34C%!%'*)
$p/"'H#)H&J%')(82:z%&n}%1&n'*IJ"%p"z%'wy3 7%&n:3#2
8J 7'H(7%2!#.%'*)e&J"#&l#2p2ps@)TwT#.x'*"7"#.%)(p4wL&J%')(.2!z%&(p '*I*&(x)z"7"7"& '*I*&(%) KX&J%'
p/"'H#Rw5I* .2!z%&(p w&J%'/".&DF7"%/Zz%I*&qe&cJ"#&'*&n'H%/")&J%'$)n.2!z%&(p)z"wy#I*'*)w%32}!%'H#.
&(o%%8Dy2}!%'H#. KN#8%/R&J%'%.&(pw)(wi&-3$#8u.!%)e&($#2p2pswi'H(%)5&J%'&c(#!%!%uo)('*&
>@J%'#2pu.}&cJ"3tJ"#)C"'*'H<&('*)n&('*/ #.%/I*37"#.'*/1&n$&J%'$7"'Hwi(3#.%I*'w68&J%'HI*2!#)()(pIH#2
#2pu.}&cJ"34)H '*)7%p&('&J%')(!3$7%2ppI*}&$wZ&J%'/"'*)(pu4!4p&()l/"'*I*p)(p4)z"Fw#I*'&J%'5%'*s#2pu.%p&J"3
'*"J%:C%p&()# 'H4{N%'47"'Hwi(3$#8%I*'!R&J%'I*%3$7"#.p)()n&z%/".
,&cJ%'HlI*J"#.(#I*&('H})n&(pI*)62p!9'-IH#87"#I*p&F4I*8&82Z#.%/$'H#)(!%'*)()wBIJ"#8%u.!%u&J%'5!3$7%2p'H34'H8&('*/
/"'*I*p)(p)z"wy#I*'-'H%/"'H6&cJ%'5)z"7"7"%&D '*I*&(%l%'*&Fsn9#.'*&c'H34'*2p$7"s'Hwz%2L#.%/oz"% 'H)c#2
2p'H#.(%!%uR3$#I*J%:%'.
"
Z TTN
Lc ,
y <&J%p)#.7"7"'H%/"ps'/"'H '$C".&cJR&J%'434'*&J%%/Rwi5I*%%)(&(z%I*&(!%u7%&(!3$#2TJ."7"'H(7%2!#.%'*)4#.%/
)(wi&53$#.u8:<J."7"'Hn7%2:#8%'*)H
/ #
!#"%$'&(*),+!&$.-0/1&3245+!3!"
es
i&@sG#)5)J%s-:<)('*I*&n}% K"&J"#&&(I*%)(&(z%I*&@&J%'7%&(!3$#2TJ."7"'H(7%2!#.%'
#"
s-J%pI*JR)('H7"#.n#&('#4)('*&@w&(#:%!%u/Z#&#
6 + + * 3 687 ' 7
XH
%'J"#)@&(o34!%!34p*'$#xwz"%I*&n}%"#2
*
)z"C '*I*&5&(&J%'I*%)(&n#!.&()
6 e *
*A> % CB
#"
>4/"o&cJ%})@s'z%)n'#4)(&c#.%/Z#./7%&(!34pH#&(p<&('*I*J"%pz%XH'. ] 'I*%)n&(z%I*&-#$A#u(#.%u8:#8
nc
Re
fe
re
B * 6 c *
#"
M+
s-J%'H' +H 7 p)&J%' '*I*&( w%%8DF%'*u#&( '6AB#u(#.%u.'634z%2p&(!7%2pp'H)6I*n'*)7"%%/":%u
&($&cJ%'I*%)(&(#:8&(H)H "
&op)19%%s-v&J"#&o&J%')(82:z%&n}%&n&J%'<7%&(!34pH#&(p 7"C%2p'H3 p)R/"'*&('H(34!%'*/C.&J%'
)#/"/"2p'$7"8:8&w6&J%p)AB#un#.%u.!#.R!<&J%' B6 Dy/"!34'H%)n}%"#2l)7"#I*'4w K KN#8%/ Ks-J%'H'
&J%'34!%:34z"3 )J%z%2p/$C"'&#.9'H,s5p&Jx'*)7'*I*&l&(5&J%'67"#.n#.34'*&('H) #.%/ K#8%/&J%'3$#!34z"3
)J%z%2p/C"'&c#.9'Hs5p&JR'*)c7"'*I*&,&($&J%'AB#u(#.%u.'34z%2p&(!7%2pp'H)
=@&&cJ%'7"8:8&5wl&J%'34!%!34z"3 es5p&JR'*)c7"'*I*&5&( #.%/ * %'xC%&#!%)H
7
#"
6 l
M
M+
# " "
M 6
L3 '*"z"#2p}&# "
s'/"'H '
M+
#"
c
M
7+ 7 7
6 6
M+
M + M +
#"
#"
re
nc
es
y '*I*&(%.&c#&(p&J%p),IH#.RC"'4'*s-p&(&('HR#)
#"
l
s-J%'H' p)6#. Dy/"!34'H%)(p"#2Zz"%}& '*I*&( K#.%/ p6) #,)("3$34'*&pI B AB DF3$#&pxs5p&Jx'*2}'H3~'H.&()
6 6
>R{N%/&J%'$/"'*)(!'*/)#/"/"2p'R7".!8&p&4'H3$#!%)&n12pIH#&( '$&J%'R3$#%!3z"3 w " z"%/"'Hx&J%'
I*%)(&c(#!.&n)# "
s-J%'H'
3 6 + 687 K"#.%/
>
XH
>@J%'z"J"8DF>Tz%I*9'Hx&J%'*'H3 7%2:#X%)#.<!3$7& #.8&7"#.&x!&J%'$&J%'*%1w7%&(!34pH#&(pL
=, I*I*/"!%u1&(R&cJ%})x&J%'*'H3K#&z",)c#/"/"2p'q7.!.&x: KB#.8]AB#un#.%u.'$34z%2}&n:7%2pp'H
#.%/Rp&()-I*%('*)7%/"!%uoI*%)n&(#!8&-#.'I*"%'*I*&n'*/ C8q#.'*"z"#2pp&
Re
fe
6 e
N3 &J%p)-'*"z"#2pp&FRI*%34'*)5&J"#&,%8Dy*'H #2!z%'*)
6
e *
y.&J%'Hs /")H 4%2pw%6IH#)('*)-s'H'&cJ%'
'*I*&(%) wis,J%}I*J
% JB
#.'%2pRH#HI*J%}' '*/1!<&J%'IH#)('*)5s,J%'H'
jg X^ eb ,})34'*&-#)-#.gH^ b l]'xIH#2p2
e * 6
wi)z"7"7&D '*I*&()H .&n'.K&cJ"#&!&J%p)4&('H(34!%.2p.u8]&J%'$'*"z"#&(p#" )(&#&n'*)&J"#&x&J%'
)(.2!z%&(p '*I*&( IH#8RC"''*"7"#.%/"'*/)z"7"7& '*I*&()H
=%.&J%'HT%C%)('H #&(pLK"C"#)('*/$x&J%' z"J"8DF>z%I*9'H'*z"#&(p1#" " #.%/ #" wT&cJ%'57%&(|D
3$#2L)(82:z%&n}%LK"p)5&J%''*2!#&(p%)cJ%:7<C"'*&s'*'H&J%'3$#%!3$#2 #2:z%' #.%/&J%')('H7"#.n#&(p
/"p)(&#.%I*' F
7
7
6
6
*
*
l
M+
M+
M+
s'C%&c#!
O%'*I*&(p s'%C%&#!
F ;
s-J%'H' F p)-&cJ%'3$#.u.!wi&J%'%7%&(!3$#2BJ."7"'Hn7%2:#8%'.
l
M+
>T#.9.!%u!.&(#I*I*z"8&5&J%''*"7"'*)()(p
wy3
l
/ 3
c
6 c
es
z"%/"'H@&J%'I*%)(&n#!.&()
* M
nc
*A>
% CB~
> % CB H H
X H
>5J%'AB#u(#.%u.',wyz"%I*&(p"#2wi&J%p)7"%C%2}'H3p)
#"
.
re
fe
Re
oL (
M
M 6 e *
M
+
+
+
s-J%'H'x&J%'%8DF%'*u#&( '34z%2p&(!7%2p}'H) +H ;X #.p)('w3 &J%'I*%)n&(#!8&#" K
#.%/R&J%'434z%2}&n:7%2pp'H) + ; 'H8wiI*'&cJ%H'X I*%)(&c(#!.&8
r'~J"# '&( {N%/]&J%'$)c #/"/"2p'q7H.X! .&xw5&J%p)wyz"%I*&(p"#2&J%'$34!%:34z"3 s5p&J] '*)7"'*I*&4 &(
&J%' #.!#.C%2p'*) K K#.%/
#.%/R&J%'3#!34z"3 s5p&JR'*)7"'*I*&,&(4&J%' #.!#.C%2p'*) #.%/
A '*&z%)z%)n'-&J%'@I*%/"p&(p%)w&J%'-34!%!34z"3 wB&J%p)lwyz"%I*&(p"#2B#&l&J%'5'*%&'H34z"3 7".!8&H
M
M
+
M
+
M
ewls'/"'H%.&('
s'IH#.R'*s-}&n''*"z"#&(p]" # )
M+
M+
M+
M+
"
.
:8&(]&cJ%' AB#u%(#.%u.'wyz"%I*&(p"#2$ s'
nc
7 7
6 6
M + M +
.
+
es
Re
fe
re
+
M
+
+
4
+
>,{N%/$&J%')(w&63$#8u.!4J."7"'H(7%2!#.%',)(.2!z%&(p~%'-J"#)l&(3#!34p*',&J%'w%(3 wz" %I*&n }%"#2
z"%/"'HB&cJ%'6I*%)(&c(#!.&n) . D s5p&Jx'*)7"'*I*&T&(,&J%'6%8DF%'*u#&( ' #.!#.C%2p'*) s5p&J
% Ty '*I*&(%.&#&(p] IH#.<C"''*s,p&(&('H1#)
HX
i.
T + + +
s-J%'H' #.%/ #.'#)/"'c{N%'*/1#.C" '.l>{N%/&J%'-/"'*)n:'*/R)#/"/"2p'7.!.&%',&J%'H'cwi',J"#)
&(4{N%/R&cJ%'3$#%!34z"3 w-i. z"%/"'H@&J%'I*%)(&n#!.&()
i
i
i
> ]
#.%/
i "
>
)J%%z%2}/<)#&(p)wi&J%'I*%/"p&(p%)
i
re
z"%/"'H@&J%'I*%)(&n#!.&()
nc
es
Re
fe
#.%/
6
M+
s-J%'H' 7 }),&J%')(82:z%&n}%Rwl&J%'wi.2p2ps5!%uo/Zz"#2I* '*q7"8u(#.3$3~:%u7"C8D
2p'H3R3$#%:3~}*'x&J%!!', wyz"%I*&(p"#2
l
+
+
z"%/"'H@&J%'I*%)(&n#!.&()
> ]
s-J%'H's'/"'H%.&n'
l
;
nc
re
fe
Re
es
+
)z"C '*I*&,&($&J%'xI*%)(&(#:8&() " D8 K.s,J%'H'&J%'x)('*I*%/<&('H(3 34!%!34p*'*)&cJ%'2p'H#)(&,)("z"#.'
#2:z%'w%6&cJ%''H()X>@J%}),2p'H#/R&(4&J%',wi.2p2ps@:%u"z"#/Z(#&(pI7"8u(#.3$3~:%u$7"%C%2}'H33$#|D
34p*'&cJ%',wz"%I*&(p"#2
i.
l
Re
fe
re
nc
es
REF [4]
nc
es
fe
re
Re
genehmigte Dissertation
Promotionsausschuss:
Vorsitzender: Prof. Dr. K. Obermayer
Berichter: Prof. Dr. S. Jahnichen
Berichter: Prof. Dr. V. Vapnik
Tag der wissenschaftlichen Aussprache: 30. September 1997
Berlin 1997
D 83
Re
fe
re
nc
es
Bernhard Scholkopf
Dissertation zum Dr. rer. nat. | Zusammenfassung der wesentlichen Ergebnisse
Re
fe
re
nc
es
Inhalt der Arbeit ist das Lernen von Mustererkennung als statistisches Problem. Eine
Lernmaschine extrahiert aus einer Menge von Trainingsmustern Strukturen, die ihr
die Klassikation neuer Beispiele erlauben. Die Arbeit behandelt folgende Fragen:
Welche \Merkmale" sollte man aus den einzelnen Trainingsmustern extrahieren?
| Zum Studium dieser Frage wurde eine neue Form von nichtlinearer Hauptkomponentenanalyse (\Kernel PCA") entwickelt. Durch die Benutzung von Integraloperatorkernen kann in Merkmalsraumen sehr hoher Dimensionalitat (z.B. im
1010-dimensionalen Raum aller Produkte von 5 Pixeln in 16 16-dimensionalen
Bildern) eine lineare Hauptkomponentenanalyse durchgefuhrt werden. Im Ursprungsraum betrachtet, fuhrt dies zu nichtlinearen Merkmalsextraktoren. Der
Algorithmus besteht in der Losung eines Eigenwertproblemes, in dem die Wahl
verschiedener Kerne die Verwendung einer groen Klasse verschiedener Nichtlinearitaten gestattet.
Welche der Trainingsmuster enthalten am meisten Information uber die zu konstruierende Entscheidungsfunktion? | Diese Frage, wie auch die folgende, wurde
anhand des vor wenigen Jahren von Vapnik vorgeschlagenen \Support-VektorAlgorithmus" innerhalb des von Vapnik und Chervonenkis entwickelten statistischen Paradigmas des Lernens aus Beispielen untersucht. Durch die Wahl
verschiedener Integraloperatorkerne ermoglicht dieser Algorithmus die Konstruktion einer Klasse von Entscheidungsregeln, die als Spezialfalle Neuronale Netze,
Polynomiale Klassikatoren und Radialbasisfunktionennetze enthalt. Fur Bilder
von 3-D-Objektmodellen und handgeschriebenen Ziern konnte gezeigt werden,
dass die verschiedenen Entscheidungsregeln in ihrer Klassikationsgenauigkeit
den besten bisher bekannten Verfahren ebenburtig sind, und dass ihre Konstruktion lediglich eine kleine, von der speziellen der Kerne weitgehend unabhangige
Teilmenge (in den betrachteten Beispielen 1% { 10%) der Trainingsmenge verwendet.
Wie kann man am besten \A-Priori"-Information verwenden, die zusatzlich zu
den Trainingsmustenr vorhanden ist? (beispielsweise Information uber die Invarianz einer Klasse von Bildern unter Translationen) | die Arbeit schlagt
drei Verfahren vor, die alle zu deutlichen Verbesserungen der Klassikationsgenauigkeit fuhren. Zwei der Verfahren bestehen in der Konstruktion von
speziellen, dem Problem angepassten Integraloperatorkernen. Das dritte Verfahren verwendet Invarianztransformationen, um aus der oben genannten Teilmenge (der \Support-Vektor-Menge") aller Trainingsmuster zusatzliche kunstliche Trainingsbeispiele zu generieren.
genehmigt: Prof. Jahnichen
es
nc
re
fe
Re
Foreword
fe
re
nc
es
The Support Vector Machine has recently been introduced as a new technique for solving various function estimation problems, including the pattern recognition problem.
To develop such a technique, it was necessary to rst extract factors responsible for
future generalization, to obtain bounds on generalization that depend on these factors,
and lastly to develop a technique that constructively minimizes these bounds.
The subject of this book are methods based on combining advanced branches of
statistics and functional analysis, developing these theories into practical algorithms
that perform better than existing heuristic approaches. The book provides a comprehensive analysis of what can be done using Support Vector Machines, achieving record
results in real-life pattern recognition problems. In addition, it proposes a new form
of nonlinear Principal Component Analysis using Support Vector kernel techniques,
which I consider as the most natural and elegant way for generalization of classical
Principal Component Analysis.
In many ways the Support Vector machine became so popular thanks to works
of Bernhard Scholkopf. The work, submitted for the title of Doktor der Naturwissenschaften, appears as excellent. It is a substantial contribution to Machine Learning
technology.
Re
Vorwort
Re
fe
re
nc
es
Interessant an der Arbeit von Herrn Scholkopf sind nicht nur die fachlichen Aspekte,
sondern auch die unterschiedlichen und sehr intensiven Kontakte zu internationalen
Forschungseinrichtungen. Sie zeigen, da der Autor sowohl in der Lage ist, seine
Ergebnisse im wissenschaftlichen Spitzenfeld zu prasentieren und zu plazieren, als auch
aus Arbeiten der \Community" heraus seine Ergebnisse zu entwickeln. Aus dieser Sicht
lat sich auch die fachliche Qualitat der Arbeit ersehen.
Herr Scholkopf untersucht zwei Grundprobleme der Klassikation groer Datenmengen. Zum einen ist dies die Extraktion weniger aber relevanter starker Merkmale
zur Reduktion der Informations
ut, und zum anderen die Beschreibung von Datenbeispielen, die charakteristisch fur ein gegebenes Klassikationsproblem sind. Beide
Probleme werden von Herrn Scholkopf sowohl theoretisch als auch in Experimenten
ausgiebig und erschopfend untersucht. Sowohl die in der Arbeit entwickelte, sehr
elegante Methode der nichtlinearen Merkmalsextraktion (kernel PCA), als auch die
vorgeschlagenen Weiterentwicklungen der Support-Vektor-Maschine benutzen schwache Merkmale und setzen sich damit konzeptuell von der oben beschriebenen Philosophie der starken Merkmale ab. Somit spiegelt sich in der Arbeit gewissermaen ein
Paradigmenwechsel in der Klassikation und Merkmalsextraktion wider.
Herr Scholkopf war wahrend seiner Dissertation ein gern gesehener Gast der GMD
FIRST Berlin, und es war eine Freude, seine Arbeit zu lesen und zu betreuen. Insbesondere freue ich mich, da Herr Scholkopf seine Forschung in seiner neuen Position
bei GMD FIRST weiterfuhren wird.
Stefan Jahnichen, Direktor, GMD FIRST
Professor, Technische Universitat Berlin
Contents
Summary
1 Introduction and Preliminaries
11
15
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Introduction . . . . . . . . . . . . . . . . . . . . .
Principal Component Analysis in Feature Spaces .
Kernel Principal Component Analysis . . . . . . .
Feature Extraction Experiments . . . . . . . . . .
Discussion . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
es
.
.
.
.
.
.
.
nc
re
2.1
2.2
2.3
2.4
2.5
2.6
2.7
Re
3.1
3.2
3.3
3.4
3.5
fe
Introduction . . . . . . . . . . . . . . . . . . .
Incorporating Transformation Invariances . . .
Image Locality and Local Feature Extractors .
Experimental Results . . . . . . . . . . . . . .
Discussion . . . . . . . . . . . . . . . . . . . .
5 Conclusion
A Object Databases
B Object Recognition Results
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
33
33
46
56
57
61
69
76
79
79
80
83
89
96
99
99
100
109
110
120
125
127
137
7
149
153
Bibliography
165
Re
fe
re
nc
es
Acknowledgements
Re
fe
re
nc
es
First of all, I would like to express my gratitude to Prof. H. Bultho, Prof. S. Jahnichen,
and Prof. V. Vapnik for supervising the present dissertation, and to Prof. K. Obermayer for chairing the committee in the \Wissenschaftliche Aussprache". I am grateful
to Vladimir Vapnik for introducing me to the world of statistical learning theory during numerous extended discussions in his oce at AT&T Bell Laboratories. I have
deep respect for the completeness and depth of the body of theoretical work that he
and his co-workers have created over the last 30 years. To Heinrich Bultho, I am
grateful for introducing me to the world of biological information processing, during
my work on the Diplom and the doctoral dissertation. He created a unique research
atmosphere in his group at the Max-Planck-Institut fur biologische Kybernetik, and
provided excellent facilities without which the present work would not have been possible. I would like to thank Stefan Jahnichen for his advice, and for hosting me at the
GMD Berlin during several research visits. A signicant amount of the reported work
was in
uenced and carried out during these stays, where I closely collaborated with
A. Smola and K.-R. Muller.
Thanks for nancial support in the form of grants go to the Studienstiftung des
deutschen Volkes and the Max-Planck-Gesellschaft. In addition, it was the Studienstiftung that made it possible in the rst place that I got to know Vladimir Vapnik at
AT&T in 1994, and that helped in getting A. Smola join the team in 1995.
A number of people contributed to this dissertation in one way or another. Let me
start with V. Blanz, C. Burges, M. Franz, D. Herrmann, K.-R. Muller, and A. Smola,
who helped at the very end, in proofreading the manuscript, leading to many improvements of the exposition.
The work for this thesis was done at several places, and each of the groups that I
was working in deserves substantial credit. More than half of the time was spent at the
Max-Planck-Institut fur biologische Kybernetik, and I would like to thank all members
of the group for providing a stimulating interdisciplinary research atmosphere, and for
bearing with me when I maltreated their computers at night with my simulations.
Special thanks go to the people in the Object Recognition group for a number of lively
discussions, and to the small group of theoreticians at the MPI, who helped me in
various ways over the years.
Almost one year of the time of my thesis work was spend in the adaptive systems
group at AT&T and Bell Laboratories. I learnt a lot about machine learning as applied
to real-world problems from all members of this excellent group. In particular, I would
9
Re
fe
re
nc
es
like to express my thanks to C. Burges, L. Bottou, C. Cortes, and I. Guyon for helping
me understand Support Vectors, and to L. Jackel, Y. LeCun, and C. Nohl for making
my stays possible in the rst place. In addition to their scientic advice, E. Cosatto,
P. Haner, E. Sackinger, P. Simard and C. Watkins have helped me through their
friendship. Finally, I want to express my gratitude for the possibility to use code
and databases developed and assembled by these people and their co-workers. A
substantial part of this thesis would not have been possible without this.
During my time in the USA, I also had the opportunity to spend a month at
the Center for Biological and Computational Learning (Massachusetts Institute of
Technology), hosted by T. Poggio. I would like to thank him, as well as G. Geiger,
F. Girosi, P. Niyogi, P. Sinha, and K. Sung, for hospitality and fruitful discussions.
At the GMD, I had the possibility to interact with the local connectionists group,
which (in addition to those mentioned already) included J. Kohlmorgen, N. Murata,
and G. Ratsch. The present work proted a great deal from my stays in Berlin.
When starting to do research on one's own, one cannot help but noticing that the
more specialized the eld of work is, the more international and widespread seems to be
the group of people interested in it. Out of the scientists working on machine learning
and perception, I want to thank J. Buhmann, S. Canu, A. Gammerman, J. Held,
D. Kersten, J. Lemm, D. Leopold, P. Mamassian, G. Roth, S. Solla, F. Wichmann,
and A. Yuille for stimulating discussions and advice.
Without rst studying science, it is hard to become a scientist. Studying science
predominantly means arguing about scientic problems. During my education, I was
in the favourable position to have enough people for scientic discussions. With many
of these friends and teachers, there is still contact and exchange of ideas. I would
like to thank all of them, and in particular G. Alli, C. Becker, V. Blanz, D. Coreld,
H. Fischer, M. Franz, D. Henke, D. Herrmann, D. Janzing, U. Kappler, D. Kopf,
F. Lutz, A. Rieckers, M. Schramm, and G. Sewell.
Finally, without my parents, I would not even have studied anything in the rst
place. Many thanks to them.
Summary
Re
fe
re
nc
es
Learning how to recognize patterns from examples gives rise to challenging theoretical
problems: given a set of observations,
which of the observations should be used to construct the decision boundary?
which features should be extracted from each observation?
how can additional information about the decision function be incorporated in
the learning process?
The present work is devoted to the above issues, studying Support Vectors in highdimensional feature spaces, and Kernel PCA feature extraction.
The material is organized as follows. We start with an introduction to the problem
of pattern recognition, to concepts of statistical learing theory, and to feature spaces
nonlinearly related to input space (Chapter 1). The paradigm for learning from examples which is studied in this thesis, the Support Vector algorithm, is described in
Chapter 2, including empirical results obtained on realistic pattern recognition problems. The latter in particular includes the nding that the set of Support Vectors
extracted, i.e. those examples crucial for solving a given task, is largely independent
of the type of Support Vector machine used. One specic topic in the development
of Support Vector learning, the incorporation of prior knowledge, is studied in some
detail in Chapter 4: we describe three methods for improving classier accuracies by
making use of transformation invariances and the local structure of images. Intertwined between these two chapters, we propose a novel method for nonlinear feature
extraction (Chapter 3), which works in the same types of features spaces as Support
Vector machines, and which forms the basis of some developments of Chapter 4. Finally, Chapter 5 gives a conclusion. As such, it partly reiterates what has just been
said, and the reader who still remembers the present summary when arriving at Chapter 5 may nd it amusing to contemplate whether the conclusion coincides with what
had been evoked in their mind by the summary that they have just nished reading.
11
es
Re
fe
re
nc
Copyright Notice. Sections 2.4 and 2.6.1 are based on Scholkopf, Burges, and Vapnik
(1995), AIII Press. Section 2.5 is based on Scholkopf, Sung, Burges, Girosi, Niyogi,
Poggio, and Vapnik (1996c), IEEE. Chapter 3 is based on Scholkopf, Smola, and Muller
(1997b), MIT Press. Section 4.2.1 and gures 2.5 and 4.1 are based on Scholkopf,
Burges, and Vapnik (1996a), Springer Verlag. The author reserves for himself the
non-exclusive right to republish all other material.
Re
fe
re
nc
es
\To see a thing one has to comprehend it. An armchair presupposes the
human body, its joints and limbs; a pair of scissors, the act of cutting.
What can be said of a lamp or a car? The savage cannot comprehend
the missionary's Bible; the passenger does not see the same rigging as the
sailors. If we really saw the world, maybe we would understand it."
J. L. Borges, There are more things. In: The Book of Sand, 1979, Penguin,
London.
es
nc
re
fe
Re
Chapter 1
re
nc
es
The present work studies visual recognition problems from the point of view of learning
theory. This rst chapter sets the scene for the main part of the thesis. It gives a brief
introduction of the problem of Learning Pattern Recognition from examples. The two
main contributions of this thesis are motivated in the conceptual part of the chapter.
Section 1.1 discusses prior knowledge that might be available in addition to the
set of training examples, and introduces the problem of extracting useful features
from individual examples. The technical part of the chapter, Sec. 1.2, gives a concise
description of some mathematical concepts of Statistical Learning Theory (Vapnik,
1995b). This theory describes learning from examples as a problem of limited sample
size statistics and provides the basis for the Support Vector algorithm. Finally,
Sec. 1.3 introduces mathematical concepts of feature spaces, which will be of central
importance to all following chapters.
fe
Re
16
above three paragraphs. It studies an inductive learning algorithm which has been
developed in the framework of statistical learning theory, and it tries to enhance it by
incorporating prior knowledge about a recognition task at hand. Finally, it studies
the extraction of features for the purpose of recognition.
Even though pattern recognition is not limited to the visual domain, we shall focus
on visual recognition. Much of what is said in this thesis, however, would equally
apply to the recognition of acoustic patterns, say.
In the remainder of this section, we introduce the terminology which is used in
discussing dierent aspects of visual recognition problems: these are, in turn, the
data, the tasks, and the methods for recognition.
Data. Dierent types of pattern recognition problems make dierent types of as-
Re
fe
re
nc
es
ideas put forward in the following were in
uenced by discussions with people in the MPI's
object recognition group, in particular with V. Blanz.
17
Prior Knowledge. The statistical approach of learning from examples in its pure form
Re
fe
re
nc
es
neglects the additional knowledge of class structure described above. However, the
latter, referred to as Prior Knowledge, can be of great practical relevance in recognition
problems.
Suppose we were given temporal sequences of detailed observations (including spectra) of double star systems, and we would like to predict whether, eventually, one of
the stars will collapse into a black hole. Given a small set of observations of dierent
double star systems, including target values indicating the eventual outcome (supposing these were available), a purely statistical approach of learning from examples
would probably have diculties extracting the desired dependency. A physicist, on
the other hand, would infer the star masses from the spectra's periodicity and Doppler
shifts, and use the theory of general relativity to predict the eventual fate of the stars.
Of course, one could argue that the physicist's model of the situation is based on
a huge body of prior examples of situations and phenomena which are related to the
above in one way or another. This, however, is exactly how the term prior knowledge
should be understood in the present context. It does not refer to a Kantian a priori,
as prior to all experience, but to what is prior to a given problem of learning from
examples.
What do we do, however, if we do not have a dynamical model of what is happening
behind the scenes? In this case, which for instance applies whenever the underlying
dynamics is too complicated, the strengths of the purely statistical approach become
apparent. Let us consider the case of handwritten character recognition. When a
human writer decides to write the letter 'A', the actual outcome is the result of a series
of complicated processes, which in their entirety cannot be modelled comprehensively.
The intensity of the lines depends on chemical properties of ink and paper, their shape
on the friction between pencil and the paper, on the dynamics of the writer's joints,
and on motor programmes initiated in the brain, these in turn are based on what the
writer has learnt at school | the chain could be continued ad innitum. Accordingly,
nobody tries to recognize characters by completely modelling their generation.
However, the lack of a complete dynamical model does not mean that there is
no prior knowledge in handwritten digit recognition. For instance, we know that
handwritten digits do not change their class membership if they are translated on a
page, or if their line thickness is slightly changed. This type of knowledge can be used
to augment the purely statistical approach (Sec. 4.2). More abstract prior knowledge
in many visual recognition tasks includes the fact that the correlations between nearby
image locations are often more reliable features for recognition than those with larger
distances (Sec. 4.3).
18
Features. Before we proceed to the tasks that can be performed, depending on the
re
nc
es
available data, we need to introduce a concept widely used in both statistics and in
the analysis of human perception. In its general form, a feature detector or feature
extractor is a function which assigns a (typically scalar) value to each raw observation.
Often, a number of dierent such functions are applied to the observations in a feature
extraction process, leading to a preprocessed vector representation of the data. The
goal of extracting features is to improve subsequent stages of processing, be it by
improving accuracies in a recognition task, or by reducing storage requirements or
processing time.
The feature functions serving this purpose can either be specied in advance, for
instance in a way such that they incorporate prior knowledge about a problem at hand,
or computed from the set of observations. Both approaches, as well as combinations
thereof, shall be addressed in this thesis (Chapter 3, Sec. 4.2.2, Sec. 4.3).
The actual term feature is used with dierent meanings. In vision research and
psychophysics, it is mainly used for the optimal stimulus of a corresponding feature
detector. However, note that given a nonlinear feature detector, it may be practically
impossible to determine this optimal stimulus. In statistics, on the other hand, the
term feature mostly refers to the feature values, i.e. the outputs of feature detectors, or
to the feature detector itself. Possibly, this ambiguity arose from the fact that, in some
cases, the dierent meanings coincide: in the case where the feature detector consists
of a dot product with a weight vector, as in linear neural network model receptive
elds, the optimal stimulus is aligned with the weight vector, and thus the two can be
identied.
Tasks. Suppose we are only given solitary views, and neither nontrivial classes nor
Re
fe
objects (which were structured collections of views). Then out of the tasks of discrimination, classication, and identication, only discrimination can be carried out
on these views. This does, however, not prevent the term discrimination from being
used also in the context of classes and objects. Discrimination, the mere detection of
a dierence, can be preceded by feature extraction processes; in these cases, results
will depend on the extraction process used.
Classication consists of attributing views to classes, and thus requires the existence
of classes. These can be specied abstractly | by describing features, or Gibsonian
aordances (\something to sit on"), e.g. | or provided (approximately) by a sample
of training views. This denition of classes by training sets is widespread in machine
learning; it will also be the paradigm that we are going to use in this thesis. One talks
about yes-no and old-new classication tasks (one specied class) or naming tasks
(several classes). Pattern recognition problems like Handwritten Digit Recognition
are examples of classication.
Similarly, identication consists in determining to which object a presented view
belongs. As objects are special types of classes, we again have the possibility for the
above tasks. Identication makes sense only for objects: for instance, it is meaningless
to ask whether the rainbow we see today is a view of the same (object as a) rainbow
19
Human Object Recognition. The position that object recognition is not about re-
Re
fe
re
nc
es
covering physical 3-D entities, but about learning their views, and potentially also their
transformation properties, can be supported by biological and psychological evidence.
Bultho and Edelman (1992) have shown that when recognizing unfamiliar objects,
observers exhibit viewpoint eects which suggest that they do not recover the 3-D
shape of objects, but rather build a representation based on the actual training views
(cf. also Logothetis, Pauls, and Poggio, 1995). They thought of this representation
as an interpolation mechanism (cf. Poggio and Girosi, 1990), but one could of course
conceive of more sophisticated mechanisms for combining information contained in
the training views. In the above terminology, one might argue that due to their unfamiliarity, the wire frame objects of Bultho and Edelman (1992) make it very hard
to use the transformations which form the structure of the underlying class of views.
Ullman (1996) has put forward a multiple-view variant of his theory of \recognition
by alignment", where objects are recognized by aligning them with stored view templates. The alignment process can make use of certain transformations specic to the
object in question. The results of Troje and Bultho (1996) have shown that these
transformations in some cases directly operate on 2-D views, and that they are much
simpler than transformations using an underlying 3-D model: in experiments probing
face recognition under varying poses, observers performed better on views which were
obtained by simply applying a mirror reversal transformation to a previously seen
view, rather than by rotating the head in depth to generate the true view of the other
side. Rao and Ballard (1997) recently proposed a model in which the \what" and the
\where" pathway (Mishkin, Ungerleider, and Macko, 1983) in the visual system are
conceived of as estimating object identity and transformations, respectively. Using a
collection of patches taken from natural images, they construct a generative model for
the data which learns, transforms and linearly combines simple basis functions. Their
model, however, does not directly make use of the valuable information contained in
the temporal stream of visual data: comparing subsequent images, e.g. by optic
ow
techniques, would give a more direct means of constructing processing elements encoding transformations. Indeed, in the dorsal stream (the \where" pathway), neurons
have been found coding for various types of large-eld transformations (Duy and
Wurtz, 1991). Of somewhat related interest are the large-eld neurons in the
y's visual system, coding for specic
ow elds which are generated by the
y's movement
in the environment (Krapp and Hengstenberg, 1996).
20
Representations and Processes. The above illustrates that the question of how,
Re
fe
re
nc
es
21
should be used which allow solving a given task within specied limits on training
time, testing speed, error rate, and memory requirements.
Implementations. So far, not much has been said about actual implementations of
recognition systems. The present work focuses on algorithmic questions rather than
on questions of implementation, both with respect to the computational side and with
respect to the biological side of the recognition problem. The former normally need
not be justied: in statistics, scientic studies of mere algorithms, without discussion
of implementation details, are abundant. In biology, which is the main focus of interest
in the group where much of the present work was carried out, the type of abstraction
presented here is much less common. Indeed, the relevance of this thesis to biological
pattern recognition is on the level of statistical properties of problems and algorithms
| not more, and not less. In our hope that this type of theoretical work should be of
interest to people studying the brain, we concur with Barlow (1995):
nc
es
\If articial neural nets, designed to imitate cognitive functions of the
brain, are truly performing tasks that are best formulated in statistical
terms, then is this not likely also to be true of cognitive function in general? The idea that the brain is an accomplished statistical decision-making
organ agrees well with notions to be sketched in the last section of this
[Barlow's, the author] article."
re
To study object recognition from a statistical point of view, we shall in the following
section brie
y review some of the basic concepts and results of statistical learning
theory.
fe
Re
Out of the considerable body of theory that has been developed in statistical learning
theory by Vapnik and others (e.g. Vapnik and Chervonenkis, 1968, 1974; Vapnik, 1979,
1995a,b), we brie
y review a few concepts and results which are necessary in order to
be able to appreciate the Support Vector learning algorithm, which will be used in a
substantial part of the thesis.2
For the case of two-class pattern recognition, the task of learning from examples
can be formulated in the following way: we are given a set of functions
ff : 2 g; f : RN ! f1g
(1.1)
(1.2)
22
each one of them generated from an unknown probability distribution P (x; y) containing the underlying dependency. (Here and below, bold face characters denote
vectors.) We want to learn a function f which provides the smallest possible value
for the average error committed on independent examples randomly drawn from the
same distribution P , called the risk
Z 1
R() = 2 jf(x) ? yj dP (x; y):
(1.3)
The problem is that R() is unknown, since P (x; y) is unknown. Therefore an induction principle for risk minimization is necessary.
The straightforward approach to minimize the empirical risk
`
X
Remp() = 1` 21 jf(xi) ? yij
i=1
(1.4)
Re
fe
re
nc
es
turns out not to guarantee a small actual risk, if the number ` of training examples
is limited. In other words: a small error on the training set does not necessarily
imply a high generalization ability (i.e. a small error on an independent test set).
This phenomenon is often referred to as overtting (e.g. Bishop, 1995). To make the
most out of a limited amount of data, novel statistical techniques have been developed
during the last 30 years. The Structural Risk Minimization principle (Vapnik, 1979)
is based on the fact that for the above learning problem, for any 2 and ` > h,
with a probability of at least 1 ? , the bound
!
log(
)
h
(1.5)
R() Remp () + ` ; `
holds, where the condence term is dened as
2`
u
! v
u
h
log h + 1 ? log(=4)
h` ; log(` ) = t
:
`
(1.6)
The parameter h is called the VC(Vapnik-Chervonenkis)-dimension of a set of functions. It describes the capacity of a set of functions. For binary classication, h is the
maximal number of points which can be separated into two classes in all possible 2h
ways by using functions of the learning machine; i.e. for each possible separation there
exists a function which takes the value 1 on one class and ?1 on the other class.
A learning machine can be thought of as a set of functions (that the machine has
at its disposal), an induction principle, and an algorithmic procedure for implementing
the induction principle on the given set of functions. Often, the term learning machine
is used to refer to its set of functions | in this sense, we talk about the capacity or
VC-dimension of learning machines.
23
nc
es
The bound (1.5), which forms part of the theoretical basis for Support Vector
learning, deserves some further explanatory remarks.
Suppose we wanted to learn a \dependency" where P (x; y) = P (x) P (y), i.e. where
the pattern x contains no information about the label y, with uniform P (y). Given a
training sample of xed size, we can then surely come up with a learning machine which
achieves zero training error. However, in order to reproduce the random labellings,
this machine will necessarily require a VC-dimension which is large compared to the
sample size. Thus, the condence term (1.6), increasing monotonically with h=`, will
be large, and the bound (1.5) will not support possible hopes that due to the small
training error, we should expect a small test error. This makes it understandable how
(1.5) can hold independent of assumptions about the underlying distribution P (x; y):
it always holds, but it does not always make a nontrivial prediction | a bound on an
error rate becomes void if it is larger than the maximum error rate. In order to get
nontrivial predictions from (1.5), the function space must be restricted such that the
VC-dimension is small enough (in relation to the available amount of data).3
According to (1.5), given a xed number ` of training examples, one can control
the risk by controlling two quantities: Remp() and h(ff : 2 0g), 0 denoting
some subset of the index set . The empirical risk depends on the function chosen
by the learning machine (i.e. on ), and it can be controlled by picking the right .
The VC-dimension h depends on the set of functions ff : 2 0g which the learning
machine can implement. To control h, one introduces a structure of nested subsets
Sn := ff : 2 ng of ff : 2 g,
(1.8)
h1 h2 : : : h n : : :
(1.9)
re
S1 S2 : : : Sn : : : ;
Re
fe
For a given set of observations (x1; y1); :::; (x`; y`) the Structural Risk Minimization
principle chooses the function fn` in the subset ff : 2 ng for which the guaranteed
3 The bound (1.5), formulated in terms of the VC-dimension, is only the last element of a series of
tighter bounds which are formulated in terms of other concepts. This is due to the inequalities
2`
H (`) H (`) G (`) h log + 1 ; (` > h):
(1.7)
ann
The VC-dimension h is probably the most-used and best-known concept in this row. However, the
other ones lead to tighter bounds, and also play important roles in the conceptual part of statistical
are used to formulate
learning theory: the VC-entropy H and the Annealed VC-entropy Hann
conditions for the consistency of the empirical risk minimization principle, and for a fast rate of
convergence, respectively. The Growth function G provides both of the above, independently of
the underlying probability measure P , i.e. independently of the data. The VC-dimension h, nally,
provides a constructive upper bound on the Growth function, which can be used to design learning
machines (for details, see Vapnik, 1995b).
24
R(fn)
error
confidence term
training error
h
structure
Sn1
Sn+1
es
Sn
Re
fe
re
nc
FIGURE 1.1: Graphical depiction of (1.5), for xed `. A learning machine with larger
complexity, i.e. a larger set of functions Sn , allows for a smaller training error; a less
complex learning machine, with a smaller Si , has smaller VC-dimension and thus provides
a smaller condence term (cf. (1.6)). Structural Risk Minimization picks a trade-o in
between these two cases by choosing the function of the learning machine fn such that
the risk bound (1.5) is minimal.
risk bound (the right hand side of (1.5)) is minimal (cf. Fig. 1.1). The procedure of
selecting the right subset for a given amount of observations is referred to as capacity
control.
We conclude this section by noting that analyses in other branches of learning
theory have led to similar insights in the trade-o between reducing the training error and limiting model complexity, for instance as described by regularization theory
(Tikhonov and Arsenin, 1977), Minimum Description Length (Rissanen, 1978; Kolmogorov, 1965), or the Bias-Variance Dilemma (Geman, Bienenstock, and Doursat,
1992). Haykin (1994); Ripley (1996) give overviews in the context of Neural Networks.
25
re
nc
es
This approach works ne for small toy examples, but it fails for realistically sized
problems: for N -dimensional input patterns, there exist
NF = (Nd!(+N d??1)!1)!
(1.13)
Re
fe
dierent monomials (1.10), comprising a feature space F of dimensionality NF . Already 16 16 pixel input images and a monomial degree d = 5 yield a dimensionality
of 1010.
In certain cases described below, there exists, however, a way of computing dot
products in these high-dimensional feature spaces without explicitely mapping into
them: by means of nonlinear kernels in input space RN . Thus, if the subsequent
processing can be carried out using dot products exclusively, we are able to deal with
the high dimensionality.
The following section describes how dot products in polynomial feature spaces can
be computed eciently, followed by a section which discusses more general feature
spaces.
In order to compute dot products of the form ((x) (y)), we employ kernel representations of the form
k(x; y) = ((x) (y));
(1.14)
which allow us to compute the value of the dot product in F without having to carry
out the map . This method was used by Boser, Guyon, and Vapnik (1992) to extend
26
(1.15)
(1.16)
fe
re
nc
es
i.e. the desired kernel k is simply the square of the dot product in input space. Boser,
Guyon, and Vapnik (1992) note that the same works for arbitrary N; d 2 N: as a
straightforward generalization of a result proved in the context of polynomial approximation (Poggio, 1975, Lemma 2.1), we have:
Proposition 1.3.1 Dene Cd to map x 2 RN to the vector Cd (x) whose entries are
all possible d-th degree ordered products of the entries of x. Then the corresponding
kernel computing the dot product of vectors mapped by Cd is
(1.17)
Re
(Cd(x) Cd(y)) =
=
N
X
(1.18)
(1.19)
ut
Instead of ordered products, we can use unordered ones to obtain a map d which
yields the same value of the dot product. To this end, we have to compensate for
the multiple occurence of certain monomials in Cd by scaling the respective monomial
27
entries of d with the square roots of their numbers of occurence. Then, by this
denition of d , and (1.17),
(d(x) d (y)) = (Cd (x) Cd(y)) = (x y)d :
(1.20)
es
re
nc
k(x; y) = (x y + 1)d:
fe
Re
been discussed by Boser, Guyon, and Vapnik (1992); Vapnik (1995b). To construct a
map induced by a kernel k, i.e. a map such that k computes the dot product in
the space that maps to, they use Mercer's theorem of functional analysis (Courant
and Hilbert, 1953):
Proposition 1.3.2 If k is a continuous symmetric kernel of a positive4 integral operator K , i.e.
Z
(Kf )(y) = k(x; y)f (x) dx
(1.23)
C
with
Z
C C
(1.24)
28
k(x; y) =
NF
X
j =1
j j (x) j (y);
(1.25)
where NF 1.
Note that originally proven for the case where C = [a; b], this Proposition also holds
true for general compact spaces (Dunford and Schwartz, 1963).
For the converse of Proposition 1.3.2, cf. Appendix D.1.
From (1.25), it is straightforward to construct a map , mapping into a potentially
innite-dimensional l2 space, which does the job. For instance, we may use
(1.26)
nc
es
We thus have the following result (Boser, Guyon, and Vapnik, 1992):5
Proposition 1.3.3 If k is a continuous kernel of a positive integral operator (conditions as in Proposition 1.3.2), one can construct a mapping into a space where k
acts as a dot product,
((x) (y)) = k(x; y):
(1.27)
Re
fe
re
Besides (1.17), Boser, Guyon, and Vapnik (1992) and Vapnik (1995b) suggest the
usage of Gaussian radial basis function kernels (Aizerman, Braverman, and Rozonoer,
1964)
!
2
k
x
?
y
k
(1.28)
k(x; y) = exp ? 2 2
and sigmoid kernels
k(x; y) = tanh((x y) + ):
(1.29)
Note that all these kernels have the convenient property of unitary invariance, i.e.
k(x; y) = k(U x; U y) if U > = U ?1 (if we consider complex numbers, then U instead
of U > has to be used). The radial basis function kernel additionally is translation
invariant.
5 In order to identify k with a dot product in another space, it would be sucient to have pointwise
convergence of (1.25). Uniform convergence lets us make an assertion which goes further: given an
accuracy level > 0, there exists an n 2 N such that even if the range of is innite-dimensional, k
can be approximated within accuracy as a dot product in Rn , between images of
n : x 7! (
1 1 ( ); : : : ;
x)):
n n (
29
(1.30)
es
Re
fe
re
nc
for all x; y 2 C (this implies that k is symmetric). Note that this means that any
reproducing kernel k corresponds to a dot product in another space.
To establish a connection to the dot product in a feature space F , we next assume
that k is a Mercer kernel (cf. Proposition 1.3.2). First note that it is possible to
construct a dot product such that k becomes a reproducing kernel for a Hilbert space
of functions
NF
1 X
1
X
X
(1.32)
f (x) = aik(x; xi) = ai j j (x) j (xi):
i=1
i=1
j =1
Using only linearity, which holds for any dot product h:; :i, we have
NF
1
X
X
j j (xi)h j ; nin n (y):
ai
i=1
j;n=1
(1.33)
30
NF
1
1 X
q
q
q X
X
n = hf; n n i = h ai j j (xi) j ; n n i = n ai n(xi ):
i=1
i=1
j =1
(1.36)
Comparing (1.35) and (1.26), we see that F has the structure of a RKHS in the sense
that for f and g given by (1.35) and
we have
NF q
X
j j j (x);
j =1
( ) = hf; gi:
es
g(x) =
(1.37)
(1.38)
Re
fe
re
nc
In practice, we are given a nite amount of data x1 ; : : : ; x`. The following simple
observation shows that even if we do not want to (or are unable to) analyse a given
kernel k analytically, we can still compute a map such that k corresponds to a dot
product in the linear span of the (xi):
Proposition 1.3.4 Suppose the data x1; : : : ; x` and the kernel k are such that the
matrix
Kij = k(xi; xj )
(1.39)
is positive. Then it is possible to construct a map into a feature space F such that
(1.40)
31
Conversely, for a map into some feature space F , the matrix Kij = ((xi ) (xj ))
is positive.
(1.41)
with an orthogonal matrix S and a diagonal matrix D with nonnegative entries. Then
k=1
`
X
(1.42)
(1.43)
(1.44)
(1.45)
nc
es
where we have dened the si as the rows of S (note that the columns of S would be
K 's Eigenvectors).
Therefore, K is the dot product matrix (or Gram matrix) of the
p
vectors Dkk si.6 Hence the map , dened on the xi by
(1.46)
re
: xi 7! Dkk si;
`
X
i;j =1
Re
fe
1
0`
`
X
X
ij Kij = @ i(xi) j (xj )A 0:
i=1
j =1
(1.47)
ut
In particular, this result implies that given data x1 ; : : : ; x`, and a kernel k which gives
rise to a positive matrix K , it is always possible to construct a feature space F of
dimensionality ` that we are implicitely working in when using kernels.
If we perform an algorithm which requires k to correspond to a dot product in some
other space (as for instance the Support Vector algorithm to be described below), it
could happen that even though k does not satisfy Mercer's conditions in general,
it still gives rise to a positive matrix K for the given training data. In that case,
6 The
fact that every positive matrix is the Gram matrix of some set of vectors is well-known in
linear algebra (see e.g. Bhatia, 1997, Exercise I.5.10).
32
Re
fe
re
nc
es
Proposition 1.3.4 tells us that nothing will go wrong during training when we work
with these data. Moreover, if a kernel leads to a matrix K with some small negative
Eigenvalues, we can add a small multiple of some positive denite kernel to obtain a
positive matrix.7
Note, nally, that Proposition 1.3.4 does not require the x1; : : : ; x` to be elements
of a vector space. They could be any set of objects which, for some function k (which
could be thought of as a similarity measure for the objects), gives rise to a positive
matrix (k(xi ; xj ))ij . Methods based on pairwise distances or similarities have recently
attracted attention (Hofmann and Buhmann, 1997). They have the advantage of being
applicable also in cases where it is hard to come up with a sensible vector representation
of the data (e.g. in text clustering).
7 For instance, for the hyperbolic tangent kernel (1.29), Mercer's conditions have not been veried.
It does not satisfy them in general: in a series of experiments with 2-D toy data, we noticed that the
dot product matrix in K had some negative Eigenvalues, for most choices of that we investigated
(except for large negative values). Nevertheless, this kernel has successfully been used in Support
Vector learning (cf. Sec. 2.3). To understand the latter, note that by shifting the kernel (i.e. choosing
dierent values of ), one can approximate the shape of the polynomial kernel (which is known to
be positive), as a function of (x y) (within a certain range), up to a vertical oset. This oset is
irrelevant in SV learning: due to (2.15), adding a constant to all elements of the dot product matrix
does not change the solution.
Chapter 2
Re
fe
re
nc
es
This chapter discusses theoretical and empirical issues related to the Support Vector
(SV) algorithm. This algorithm, reviewed in Sec. 2.1, is based on the results of
learning theory outlined in Sec. 1.2. Via the use of kernel functions (Sec. 1.3), it gives
rise to a number of dierent types of pattern classiers (Vapnik and Chervonenkis,
1974; Boser, Guyon, and Vapnik, 1992; Cortes and Vapnik, 1995; Vapnik, 1995b).
The original contribution of the present chapter is largely empirical. Using object and digit recognition tasks, we show that the algorithm allows us to construct
high-accuracy polynomial classiers, radial basis function classiers, and perceptrons
(Sections 2.2 and 2.3), relying on almost identical subsets of the training set, their
Support Vector sets (Sec. 2.4). These Support Vector Sets are shown to contain
all the information necessary to solve a given classication task. To understand the
relationship between SV methods and classical techniques, we then describe a study
comparing SV machines with Gaussian kernels to classical radial basis function networks, with results favouring the SV approach. Following this, Sec. 2.6 shows that
one can utilize the error bounds of learning theory to select values for free parameters
in the SV algorithm, as for instance the degree of the polynomial kernel which will
perform best on a test set (Scholkopf, Burges, and Vapnik, 1995; Blanz, Scholkopf,
Bultho, Burges, Vapnik, and Vetter, 1996; Scholkopf, Sung, Burges, Girosi, Niyogi,
Poggio, and Vapnik, 1996c). Finally, at the end of the chapter, we summarize various
ways of understanding and interpreting the high generalization performance of SV
machines (Sec. 2.7).
34
(w . z) + b < 0
Note:
{z | (w . z) + b = 0}
= {z | (c w . z) + cb = 0}
for c =0
{z | (w . z) + b = 0}
nc
es
FIGURE 2.1: A separating hyperplane, written in terms of a weight vector w and a threshold
b. Note that by multiplying both w and b with the same nonzero constant, we obtain the
same hyperplane, represented in terms of dierent parameters. Fig. 2.2 shows how to
eliminate this scaling freedom.
Re
fe
re
the VC-dimension for this class of decision functions (Sec. 2.1.2). This algorithm is
then generalized in two steps in order to obtain SV machines: nonseparable classication problems are dealt with in Sec. 2.1.3, and nonlinear decision functions, retaining
the VC-dimension bound, are described in Sec. 2.1.4.
To be able to utilize the results of Sec. 1.3, we shall formulate the algorithm in
terms of dot products in some space F . Initially, we think of F as the input space.
In Sec. 2.1.4, we will substitute kernels for dot products, in which case F becomes a
feature space nonlinearly related to input space.
fz 2 F : (w z) + b = 0g:
(2.1)
In this formulation, we still have the freedom to multiply w and b with the same
nonzero constant (Fig. 2.1). However, the hyperplane corresponds to a canonical pair
35
{z | (w . z) + b = 1}
Note:
x2
(w . z1) + b = +1
(w . z2) + b = 1
x1
=>
=>
(w . (z1z2)) = 2
w .
2
(z
z
)
=
1
2
||w||
||w||
{z | (w . z) + b = 0}
re
nc
es
FIGURE 2.2: By requiring the scaling of w and b to be such that the point(s) closest to the
hyperplane satisfy j(w zi )+ bj = 1, we obtain a canonical form (w; b) of a hyperplane (cf.
Fig. 2.1). Note that in this case, the margin, measured perpendicularly to the hyperplane,
equals 2=kwk, which can be seen by considering two opposite points which precisely satisfy
j(w zi) + bj = 1.
(2.2)
fe
i=1;:::;r
Re
i.e. that the scaling of w and b be such that the point closest to the hyperplane has
a distance of 1=kwk (Fig. 2.2).1 Thus, the margin between the two classes, measured
perpendicular to the hyperplane, is at least 2=kwk. The possibility of introducing a
structure on the set of hyperplanes is based on the following result (Vapnik, 1995b):
Proposition 2.1.1 Let R be the radius of the smallest ball BR (a) = fz 2 F : kz ?
ak < Rg (a 2 F ) containing the points z1; : : : ; zr , and let
(2.3)
be canonical hyperplane decision functions dened on these points. Then the set ffw;b :
kwk Ag has a VC-dimension h satisfying
h < R2A2 + 1:
(2.4)
condition (2.2) still allows two such pairs: given a canonical hyperplane (w; b), another one
satisfying (2.2) is given by (?w; ?b). However, we do not mind this remaining ambiguity: rst, the
following Proposition only makes use of kwk, which coincides in both cases, and second, these two
hyperplanes correspond to dierent decision functions sgn((w z) + b).
1 The
36
Note. Dropping the condition kwk A leads to a set of functions whose VC-dimension
equals NF + 1, where NF is the dimensionality of F. Due to kwk A, we can get
VC-dimensions which are much smaller than NF , enabling us to work in very high
dimensional spaces | remember that the risk bound (1.5) does not explicitely depend
upon NF , but on the VC-dimension.
To make Proposition 2.1.1 intuitively plausible, note that due to the inverse proportionality of margin and kwk, (2.4) essentially states that by requiring a large lower
bound on the margin (i.e. a small A), we obtain a small VC-dimension. Conversely, by
allowing for separations with small margin, we can potentially separate a much larger
class of problems (i.e. a larger class of possible labellings of the training data, cf. the
denition of the VC-dimension, following (1.6)).
Recalling that (1.5) tells us to keep both the training error and the VC-dimension
small in order to achieve high generalization ability, we conclude that hyperplane decision functions should be constructed such that they maximize the margin, and at the
same time separate the training data with as few exceptions as possible. Sections 2.1.2
and 2.1.3 will deal with these two issues, respectively.
es
nc
Suppose we are given a set of examples (z1; y1); : : : ; (z`; y`); zi 2 F; yi 2 f1g, and
we want to nd a decision function fw;b = sgn ((w z) + b) with the property
(2.5)
re
fw;b(zi) = yi; i = 1; : : : ; `:
fe
If this function exists (the nonseparable case shall be dealt with in the next section),
canonicality (2.2) implies
Re
yi ((zi w) + b) 1; i = 1; : : : ; `:
(2.6)
As an aside, note that out of the two canonical forms of the same hyperplane (w; b),
(?w; ?b), only one will satisfy equations (2.5) and (2.6). The existence of class labels
thus allows to distinguish two orientations of a hyperplane.
Following Proposition 2.1.1, a separating hyperplane which generalizes well can
thus be found by minimizing
(w) = 21 kwk2
(2.7)
subject to (2.6). To solve this convex optimization problem, one introduces a Lagrangian
`
X
1
2
L(w; b; ) = 2 kwk ? i (yi((zi w) + b) ? 1)
(2.8)
i=1
37
and minimized with respect to w and b. The condition that at the saddle point, the
derivatives of L with respect to the primal variables must vanish,
@ L(w; b; ) = 0; @ L(w; b; ) = 0;
(2.9)
@b
@w
leads to
`
X
i=1
and
w=
iyi = 0
`
X
i=1
(2.10)
i yizi:
(2.11)
i = 1; : : : ; `:
nc
i [yi((zi w) + b) ? 1] = 0;
es
The solution vector thus has an expansion in terms of training examples. Note that although the solution w is unique (due to the strict convexity of (2.7), and the convexity
of (2.6)), the coecients i need not be.
According to the Kuhn-Tucker theorem of optimization theory (e.g. Bertsekas,
1995), at the saddle point only those Lagrange multipliers i can be nonzero which
correspond to constraints (2.6) which are precisely met, i.e.
(2.12)
Re
fe
re
there always exists a hyperplane separating the point from the interior of the set. This is called a
supporting hyperplane.
SVs do lie on the boundary of the convex hulls of the two classes, thus they possess supporting
hyperplanes. The SV optimal hyperplane is the hyperplane which lies in the middle of the two parallel
supporting hyperplanes (of the two classes) with maximum distance.
Vice versa, from the optimal hyperplane one can obtain supporting hyperplanes for all SVs of both
classes by shifting it by 1=kwk in both directions.
3 Note that this implies that the solution (w; b), where b is computed using the fact that yi ((w
zi ) + b) = 1 for SVs, is in canonical form with respect to the training data. (This makes use of the
reasonable assumption that the training set contains both positive and negative examples.)
4 In a statistical mechanics framework, Anlauf and Biehl (1989) have put forward a similar argument for the optimal stability perceptron, also computed by contrained optimization.
38
Re
fe
re
nc
es
1. yi ((zi w) + b) > 1, i.e. the pattern is classied correctly and does not lie on
the margin. These are patterns that would not have become Support Vectors
anyway.
2. yi ((zi w) + b) = 1, i.e. zi exactly meets the constraint (2.6). In that case,
the solution w does not change, even though the coecients i in the dual
formulation of the optimization problem might change: namely, if zi might
have become a Support Vector (i.e. i > 0) if it had been kept in the training
set. In that case, the fact that the solution is the same no matter
whether zi
P
is in the training set or not means that zi can be written as SVs iyizi with
i 0. Note that this is not equivalent to saying that zi can be written as
some linear combination of the remaining Support Vectors: since the sign of the
coecients in the linear combination is determined by the class of the respective
pattern, not any linear combination will do. Strictly speaking, zi must lie in
the cone spanned by the yizi, where zi are all Support Vectors.5
3. 1 > yi ((zi w) + b) > 0, i.e. zi lies within the margin, but still on the correct
side of the decision boundary. In that case, the solution looks dierent from the
one obtained if zi was in the training set (for, in that case, zi would satisfy
(2.6) after training), but classication is nevertheless correct.
4. 0 > yi ((zi w) + b). In that case, zi will be classied incorrectly.
Note that the cases 3 and 4 necessarily correspond to examples which would have
become SVs if kept in the training set; case 2 potentially includes such cases. However,
only case 4 leads to an error in the leave-one-out procedure. Consequently, we have
the following result on the generalization error of optimal margin classiers (Vapnik
and Chervonenkis, 1974):6
Proposition 2.1.2 The expectation of the number of Support Vectors obtained during
training on a training set of size `, divided by ` ? 1, is an upper bound on the expected
probability of test error.
A sharper bound can be formulated by making a further distinction in case 2, between
SVs that must occur in the solution, and those that can be expressed in terms of the
other SVs (Vapnik and Chervonenkis, 1974).
Substituting the conditions for the extremum, (2.10) and (2.11), into the Lagrangian (2.8), one derives the dual form of the optimization problem: maximize
W () =
`
X
i=1
`
X
i ? 21 ij yiyj (zi zj )
i;j =1
(2.13)
5 Possible non-uniquenesses of the solution's expansion in terms of SVs are related to zero Eigenvalues of Kij = yi yj k( i ; j ), cf. Proposition 1.3.4. Note, however, the above caveat on the distinction
x x
between linear combinations and linear combinations with coecients of xed sign.
6 It also holds for the generalized versions of optimal margin classiers explained in the following
sections.
39
i 0; i = 1; : : : ; `;
iyi = 0:
(2.14)
(2.15)
On substitution of the expansion (2.11) into the decision function (2.3), we obtain an
expression which can be evaluated in terms of dot products between the pattern to be
classied and the Support Vectors,
`
X
f (z) = sgn
i=1
i yi(z zi) + b :
(2.16)
nc
es
fe
re
In practice, a separating hyperplane often does not exist. To allow for the possibility
of examples violating (2.6), Cortes and Vapnik (1995) introduce slack variables
i 0;
i = 1; : : : ; `;
(2.17)
Re
yi((zi w) + b) 1 ? i;
i = 1; : : : ; `:
(2.18)
The SV approach to minimizing the guaranteed risk bound (1.5) consists of the following: minimize
`
X
(w; ) = 21 kwk2 +
i
(2.19)
i=1
subject to the constraints (2.17) and (2.18) (cf. (2.7)). Due to (2.4), minimizing the
rst term is related to minimizing the VC-dimension of the considered class of learning
machines, thereby minimizing the second term of the bound (1.5) (it also amounts to
maximizing
the separation margin, cf. the remark following (2.2), and Fig. 2.2). The
term P`i=1 i, on the other hand, is an upper bound on the number of misclassications
on the training set (cf. (2.18)) | this controls the empirical risk term in (1.5). For a
40
suitable positive constant
, this approach therefore constitutes a practical implementationPof Structural Risk Minimization on the given set of functions.7 Note, however,
that `i=1 i is signicantly larger than the number of errors if many of the i attain
large values, i.e. if the classes to be separated strongly overlap, for instance due to
noise. In these cases, there is no guarantee that the hyperplane will generalize well.
As in the separable case (2.11), the solution can be shown to have an expansion
w=
`
X
i=1
i yizi;
(2.20)
where nonzero coecients i can only occur if the corresponding example (zi; yi) precisely meets the constraint (2.18). The coecients i are found by solving the following
quadratic programming problem: maximize
`
`
X
X
W () = i ? 21 ij yiyj (zi zj )
(2.21)
i=1
i;j =1
subject to the constraints
(2.22)
iyi = 0:
(2.23)
i=1
nc
`
X
es
0 i ; i = 1; : : : ; `;
re
Re
fe
Although we have already introduced the concept of Support Vectors, one crucial
ingredient of SV machines in their full generality is still missing: to allow for much
more general decision surfaces, one can rst nonlinearly transform a set of input vectors
x1; : : : ; x` into a high-dimensional feature space by a map : xi 7! zi and then do a
linear separation there.
Note that in all of the above, we made no assumptions on the dimensionality of
F . We only required F to be equipped with a dot product. The patterns zi that we
talked about in the previous sections thus need not coincide with the input patterns.
They can equally well be the results of mapping the original input patterns xi into a
high-dimensional feature space.
Maximizing the target function (2.21) and evaluating the decision function (2.16)
then requires the computation of dot products ((x) (xi )) in a high-dimensional
space. Under Mercer's conditions, given in Proposition 1.3.2, these expensive calculations can be reduced signicantly by using a suitable function k such that
((x) (xi)) = k(x; xi);
(2.24)
7 It
slightly deviates from the Structural Risk Minimization (SRM) Principle in that (a) it does
not use the bound (1.5), but a related quantity (2.19) which can be minimized eciently, and (b) the
SRM Principle strictly speaking requires the structure of sets of functions to be xed a priori. For
more details, cf. Vapnik (1995b); Shawe-Taylor, Bartlett, Williamson, and Anthony (1996).
41
R2
: R2
input space
2 x1 x2
x12 x22
feature space
w2 w
3
w1
R3
f (x)
2
R3
R2
es
re
nc
fe
Re
FIGURE 2.3: By mapping the input data (top left) nonlinearly (via ) into a higherdimensional feature space F (here: R3 ), and constructing a separating hyperplane there
(bottom left), an SV machine (top right) corresponds to a nonlinear decision surface in
input space (here: R2 , bottom right).
f (x) = sgn
`
X
i=1
yii k(x; xi) + b :
(2.25)
Consequently, everything that has been said about the linear case also applies
to nonlinear cases obtained by using a suitable kernel k instead of the Euclidean
dot product (Fig. 2.3). By using dierent kernel functions, the SV algorithm can
construct a variety of learning machines (Fig. 2.4), some of which coincide with classical
architectures:
42
+ b)
3
classification
4
weights
support vectors
x 1 ... x 4
input vector x
nc
es
(2.26)
fe
re
Re
(2.27)
(2.28)
W () =
`
X
i=1
`
X
i ? 21 ij yiyj k(xi ; xj )
i;j =1
(2.29)
43
subject to the constraints (2.22) and (2.23). Since k is required to satisfy Mercer's
conditions, it corresponds to a dot product in another space (2.24), thus Kij :=
(yiyj k(xi; xj ))ij is a positive matrix, providing us with a problem that can be solved
eciently. To see this, note that (cf. Proposition 1.3.4)
`
X
i;j =1
1
0`
`
X
X
i j yiyj k(xi; xj ) = @ iyi(xi) j yj (xj )A 0
j =1
i=1
(2.30)
yii k(xj ; xi) + b = yj :
(2.31)
yii k(xj ; xi )
es
b = yj ?
`
X
(2.32)
fe
re
nc
Re
f (z) = (w z) + b
(2.33)
`
X
(w; ; ) = 21 kwk2 +
(i + i)
(2.34)
(2.35)
(2.36)
(2.37)
subject to
44
fe
re
nc
es
FIGURE 2.5: Example of a Support Vector classier found by using a radial basis function
kernel k(x; y) = exp(?kx ? yk2 ). Both coordinate axes range from -1 to +1. Circles
and disks are two classes of training examples; the middle line is the decision surface; the
outer lines precisely meet the constraint (2.6). Note that the Support Vectors found by
the algorithm (marked by extra circles) are not centers of clusters, but examples which are
critical for thePgiven classication task (cf. Sec. 2.5). Grey values code the modulus of
the argument `i=1 yi i k(x; xi ) + b of the decision function (2.25). (From Scholkopf,
Burges, and Vapnik (1996a).)
Re
for all i = 1; : : : ; `.
Generalization to nonlinear regression estimation is carried out using kernel functions, in complete analogy to the case of pattern recognition. A suitable choice of
the kernel function then allows the construction of multi-dimensional splines (Vapnik,
Golowich, and Smola, 1997).
Dierent types of loss functions can be utilized to cope with dierent types of noise
in the data (Muller, Smola, Ratsch, Scholkopf, Kohlmorgen, and Vapnik, 1997; Smola
and Scholkopf, 1997b).
45
x
x
x
x
+
0
nc
es
FIGURE 2.6: In SV regression, a desired accuracy " is specied a priori. It is then attempted
to t a tube with radius " to the data. The trade-o between model complexity and points
lying outside of the tube (with positive slack variables ) is determined by minimizing
(2.34).
where
fe
argmaxj=1;:::;k
gj (x);
`
X
j
g (x) = yiij k(x; xi) + bj
i=1
re
taking
(2.38)
Re
(note that f j (x) = sgn(gj (x)), cf. (2.25)). The values gj (x) can also be used for
reject decisions (e.g. Bottou et al., 1994), for instance by considering the dierence
between the maximum and the second highest value as a measure of condence in the
classication.
In the following sections, we shall report experimental results obtained with the SV
algorithm. We used the Support Vector algorithm with standard quadratic programming techniques8 to construct polynomial, radial basis function and neural network
classiers. This was done by choosing the kernels (2.26), (2.27), (2.28) in the decision
function (2.25) and in the function (2.29) to be maximized under the constraints (2.22)
and (2.23). We shall start with object recognition experiments (Sec. 2.2), and then
move to handwritten digit recognition (Sec. 2.3).
8 An
existing implementation at AT&T Bell Labs was used, largely programmed by L. Bottou,
C. Burges, and C. Cortes.
46
nc
es
FIGURE 2.7: Examples from the entry level (top) and animal (bottom) databases. Left:
rendered views of two 3-D models; right: 16 16 downsampled images, and four 16 16
downsampled edge detection patterns.
Re
fe
re
47
circumstances, we reasoned that it should be possible to separate the data with zero
training error even with moderate classier complexity, and we decided to determine
the value of the constant
(cf. (2.19)) by the following heuristic: out of all values 10n,
with integer n, we chose the smallest one which made the problem separable. On the
entry level databases, this led to
= 1000, on the animal databases, to
= 100. Of
both databases, we used 12 variants, obtained by
choosing one of three database sizes: 25, 89, (regularly spaced) or 100 (random,
uniformly distributed) views per object;
choosing either grey-scale images or binarized silhouette images (both in downsampled versions); and
using just 16 16 resolution images, obtained from the original images by downsampling, or additionally four more 16 16 patterns, containing downsampled
nc
es
versions of edge detection results obtained from the original images. Note that
in the latter case, the resulting 1280-dimensional vectors contain information
which is not contained in the 16 16 images, since the edge detection, involving a (nonlinear) modulus operation, is done before downsampling (cf. Blanz,
Scholkopf, Bultho, Burges, Vapnik, and Vetter, 1996).
Re
fe
re
For more details on the databases, see (Liter et al., 1997), and Appendix A. Example
images of the original models, and of the downsampled images and edge detection
patterns for the entry level and the animal databases are given Fig. 2.7.
We trained polynomial SV machines on these 25-class recognition tasks, and obtained accuracies which in some cases exceeded 99% (see Table 2.1). A few aspects of
the results deserve being pointed out:
Performance. The highest recognition rates were obtained using polynomial SV clas-
48
TABLE 2.1: Object recognition test error rates on dierent databases of 25 objects, using
polynomial SV classiers of varying degrees. The training sets containing 25 and 89 views
per object were regularly spaced; those with 100 views were distributed uniformly. Testing
was done on an independent test set of 100 random views per object. All views were taken
from the upper viewing hemisphere. For further discussion, see Sec. 2.2.1.
12
15
20
25
5.5
0.6
1.3
6.6
0.6
2.5
5.3
0.5
1.1
6.5
0.5
2.4
4.9
0.4
1.1
6.1
0.5
2.3
5.6
0.4
1.0
6.0
0.4
2.2
fe
re
Re
animals:
25 grey scale
89 grey scale
100 grey scale
25 silhouettes
89 silhouettes
100 silhouettes
es
entry level:
25 grey scale
89 grey scale
100 grey scale
25 silhouettes
89 silhouettes
100 silhouettes
nc
degree:
7.9
1.1
2.7
8.8
1.3
3.2
7.2
1.0
2.2
8.0
1.2
3.1
6.9
0.9
2.2
7.6
1.1
3.0
6.8
0.9
2.0
7.5
1.2
2.9
6.4
0.8
2.0
7.0
1.1
2.7
6.4
0.8
2.0
7.1
1.1
2.6
49
TABLE 2.2: Numbers of SVs for the object recognition systems of Table 2.1, on dierent
databases of 25 objects, using polynomial SV classiers of varying degrees. The training
sets containing 25 and 89 views per object were regularly spaced; the ones with 100 views
were distributed uniformly. The numbers of SVs are averages over all 25 binary classiers
separating one object from the rest; they should be seen in relation to the size of the
training set, which for the above numbers of views per objects was 625, 2225, and 2500,
respectively. The given numbers of SVs thus amount to roughly 10% of the database sizes.
For the silhouette databases, the numbers (not shown here) are very similar, only slightly
bigger.
degree:
12
15
20
25
entry level:
25 grey scale 86 74 71 70 72 74 79 92
89 grey scale 219 148 132 128 128 133 144 165
100 grey scale 206 139 121 117 119 122 135 158
nc
es
fe
re
animals:
25 grey scale 108 96 89 90 91 95 100 112
89 grey scale 231 196 180 177 178 183 193 208
100 grey scale 235 196 176 169 169 174 185 199
Re
Support Vectors. The numbers of SVs (Table 2.2) of the individual recognizers for
each object make up about 5% ? 15% of the whole databases. The fraction decreases
with increasing database size.
For polynomial machines of degree 1 (i.e. separating hyperplanes in input space),
the problem is not separable. In that case, all training errors show up as SVs (cf.
(2.18)), causing a fairly large number of SVs. For degrees higher than 1, the number
of SVs slightly increases with increasing polynomial degree. However, the increase is
rather moderate, compared with the increase of the dimensionality of the feature space
that we are implicitely working on (cf. Sec. 1.3). Interestingly, the number of SVs does
50
90
25
60
120
30
150
180
180
300
270
210
330
240
30
150
210
25
60
120
330
n=4632
240
300
270
90
n=3385
90
60
60
25
25
30
es
30
nc
n=4632
n=3385
Re
fe
re
FIGURE 2.8: Angular distribution of the viewing angles of those training views which
became SVs, for a polynomial SV machine of degree 20 on the animal (left) and entry
level (right) databases (100 grey level views per object, without edge detection). The
plotted distributions for azimuth (top) and elevation (bottom) have been normalized by
the corresponding distributions in the training set (see Fig. A.1). It can be seen that
SVs tend to occur more often for top, front and back views. In this and the following
plots, views which become SVs for more than one of the 25 binary recognizers are counted
according to their frequency of occurence. Consequently, there is no contradiction in the
overall number of SVs n exceeding the database size (2500).
not change much if we add edge detection information, even though this increases the
input dimensionality by a factor of 5.
As each of the training examples is associated with two viewing angles (; ) (cf.
Appendix A), we can look at the angular distribution of SVs and errors. It is shown
in gures 2.8 { 2.10, and, in more detail, in gures B.2 { B.9 in the appendix (there,
we also give an example of a full SV set of one of the binary recognizers, in Fig. B.1).
The density of SVs is increased at high polar angles, i.e. for viewing the objects from
the top. Also, SVs tend to be found more often for frontal and back views than for
views closer to the side. Top, frontal and back views typically are harder to classify
than views from more generic points of view (Blanz, 1995). We can thus interpret our
51
nc
es
nding as an indication that the density of SVs is related to the local complexity of
the classication surface, i.e. the local diculty of the classication task. Indeed, the
same qualitative behaviour is found for the distribution of recognition errors (gures
2.9 and 2.10).
There are several factors contributing to the diculty in classifying top, frontal
and back views. First note that since most objects in our databases are bilaterally
symmetric, top, frontal and back views contain a large amount of redundancy. In
contrast, side views of symmetric objects contain a maximal amount of information.
Moreover, many objects in the databases have their main axis of elongation roughly
aligned with the direction of the zero view ( = 0; = 0). Consequently, frontal and
back views suer the drawback of showing a projection of a comparably small area.
As an aside, note that although the SVs live in a high-dimensional space, the
particular setup of the presented object recognition experiments made it possible to
discuss the relationship between the diculty of the task, the distribution of SVs, and
the distribution of errors. This is due to the low-dimensional parametrization of the
examples, arising from the procedure of generating the examples by taking snapshots
at well-dened viewing positions.
The hope that Support Vectors are a useful means of analysing recognition tasks
will receive further support in Sec. 2.4, where we shall present results which show
that dierent types of SV machines, obtained using dierent kernel functions, lead to
largely the same Support Vectors if trained on the same task.
Re
fe
re
on this task, benchmark comparisons with other classiers need to be carried out.
We conducted a set of experiments using perceptrons with one hidden layer, endowed
with 400 hidden neurons, and hyperbolic tangent activation functions in hidden and
output neurons. The networks were trained by back-propagation of mean squared error
(Rumelhart, Hinton, and Williams, 1986; LeCun, 1985). We used on-line (stochastic)
gradient descent, i.e. the weights were updated after each pattern; training was stopped
when the training error dropped below 0:1%, or after 600 learning epochs, whichever
occured earlier. Neither this procedure nor the network design was carefully optimized
for the task at hand, thus the results reported in the following should be seen as baseline
comparisons solely intended to facilitate assessing the reported SV results.9
9 By observing the dependency of the test error on the number of learning epochs, we were able to
see that the networks were not overtrained. In addition, experiments with smaller numbers of hidden
units gave worse performance (larger networks were not used, for reasons of excessive training times),
hence the network capacities did not seem too large.
A full-
edged comparison between SV machines and perceptrons would take into account the following issues in order to obtain optimized network designs: instead of one fully connected hidden
layer, more sophisticated architectures use several layers with shared weights, extracting features
of increasing complexity and invariance, while still limiting the number of free parameters. Other
regularization techniques useful for improving generalization include weight decay and pruning. Similarly, early stopping can be used to deal with issues of overtraining. The training procedure can be
optimized by using dierent error functions and output functions (e.g. softmax). Finally, for small
54
FIGURE 2.11: Left: rendered view of a 3-D model from the chair database; right: 16 16
downsampled image, and four 16 16 downsampled edge detection patterns.
re
nc
es
For the following two reasons, we chose the small training set, with 25 views per
object. First, the error rates reported above for the large sets were already very low,
and dierences in performance are thus more likely to be signicant for the smaller
training sets. Second, training times of the neural networks were very long (in the
cases reported in the following, they were longer than for SV machines by more than
an order of magnitude).
On the 25 view-per-object training sets, we obtained error rates of 17.3% and
21.4% on the entry level and animal databases, respectively. Adding edge detection
information, the error rates dropped to 6.8%, and 11.2%, respectively. Comparing
with the results in (2.1), we note that SV machines in almost all cases performed
better. Further performance comparisons between SV machines and other classiers
are reported in the following section.
Re
fe
In a set of experiments using the MPI chair database (Fig. 2.11, Appendix A), dierent view{based recognition algorithms were compared (Blanz et al., 1996). The SV
analysis for this case is less detailed than the one given in Sec. 2.2.1, however, we
decided also to report these experiments, since they include further benchmark results
obtained with other classiers. The rst one used oriented lters to extract features
which are robust with respect to small rigid transformations of the underlying 3-D
objects, followed by a decision stage based on comparisons with stored templates (for
details, see Blanz, 1995; Vetter, 1994; Blanz et al., 1996). The second one, run as a
baseline benchmark, was a perceptron with one hidden layer, trained by error backpropagation to minimize the mean squared error (for further details, see Sec. 2.2.1).
The third system was a polynomial Support Vector machine (cf. (2.26)) with degree
d = 15 and
= 100.10 In addition, we report results of Kressel (1996), who utilized a fully quadratic polynomial classier (Schurmann, 1996) trained on the rst 50
databases with little redundancy it is sometimes advantageous to use batch updates with conjugate
gradient descent, or using higher order derivatives of the error function. For details, see LeCun,
Boser, Denker, Henderson, Howard, Hubbard, and Jackel (1989); Bishop (1995); Amari, Murata,
Muller, Finke, and Yang (1997).
10 The latter was chosen as in Sec. 2.2.1. Note that these values dier from those used in (Blanz
et al., 1996), which in some cases leads to dierent results.
55
TABLE 2.3: Recognition test error (in %) for the MPI chair database (Appendix A) on
25 100 random test views from the upper viewing hemisphere, for dierent training
sets (viewing angles either regularly spaced, or uniformly distributed, on the upper viewing
hemisphere; views were either just 16 16 images, or images plus edge detection data), and
dierent classiers: SV: Support Vector machine; MLP: fully connected perceptron with
one hidden layer of 400 neurons; OF: oriented lter invariant feature extraction, see text;
PC: quadratic polynomial classier trained on the rst 50 principal components (Kressel,
1996). Where marked with '{', results are not available.
es
nc
input
images+e.d.
images+e.d.
images+e.d.
images+e.d.
images
images
images
images
training set
classier
distribution views per obj. SV MLP OF PC
regul. spaced
25
5.0 8.8 5.4 {
regul. spaced
89
1.0 1.3 4.7 1.7
random
100
1.4 2.6
{ {
random
400
0.3
{
{ 0.8
regul. spaced
25
13.2 25.4 26.0 {
regul. spaced
89
2.0 7.2 21.0 {
random
100
4.5 7.5
{ {
random
400
0.6
{
{ {
Re
fe
re
2.2.3 Discussion
Realistically rendered computer graphics images of objects provide a useful basis for
evaluating object recognition algorithms. This setup enabled us to study shape re-
56
Re
fe
re
nc
es
Handwritten digit recognition has long served as a test bed for evaluating and benchmarking classiers (e.g. LeCun et al., 1989; Bottou et al., 1994; LeCun et al., 1995).
Thus, it is imperative to evaluate the SV method on some widely used digit recognition task. In the present chapter, we use the US Postal Service (USPS) database for
this purpose (Appendix C). We put particular emphasis on comparing dierent types
of SV classiers obtained by choosing dierent kernels. We report results for polynomial kernels (2.26), radial basis function kernels (2.27), and sigmoid kernels (2.28);
all of them were obtained with
= 10 (our default choice, used wherever not stated
otherwise | cf. (2.19)).
Results for the three dierent kernels are summarized in Table 2.4. In all three
cases, error rates around 4% can be achieved. They should be compared with values
achieved on the same database with a ve-layer neural net (LeNet1, LeCun, Boser,
Denker, Henderson, Howard, Hubbard, and Jackel, 1989), 5.0%, a neural net with one
hidden layer, 5.9%, and the human performance, 2.5% (Bromley and Sackinger, 1991).
Results of classical RBF machines, along with further reference results, are quoted in
Sec. 2.5.3.
The results show that the Support Vector algorithm allows the construction of
various learning machines, all of which are performing well. The similar performance
for the three dierent functions k suggests that among these cases, the choice of the
set of decision functions is less important than capacity control in the chosen type of
structure. This phenomenon is well-known for the Parzen density estimator in RN ,
` 1 x ? x
X
1
p(x) = ` !N k ! i :
(2.39)
i=1
There, it is of great importance to choose an appropriate value of the bandwidth
57
TABLE 2.4: Performance on the USPS set, for three dierent types of classiers, constructed with the Support Vector algorithm by choosing dierent functions k in (2.25) and
(2.29). Given are raw errors (i.e. no rejections allowed) on the test set. The normalization
factor c = 1:04 in the sigmoid case is chosen such that c tanh(2) = 1. For each of
the ten-class-classiers, we also show the average number of Support Vectors of the ten
two-class-classiers. The normalization factors of 256 are tailored to the dimensionality of
the data, which is 16 16.
re
nc
es
Re
fe
parameter ! for a given amount of data (e.g. Hardle, 1990; Bishop, 1995). Similar
parallels can be drawn to the solution of ill-posed problems (for a complete discussion,
see Vapnik, 1995b).
Overlap of SV Sets. To study the Support Vector sets for three dierent types of
SV classiers, we used the optimal parameters on the USPS set according to Table 2.4.
11 Copyright
notice: the material in this section is based on the article \Extracting support data
for a given task" by B. Scholkopf, C. Burges and V. Vapnik, which appeared in: Proceedings, First
International Conference on Knowledge Discovery & Data Mining, pp. 252 { 257, 1995. AIII Press.
58
TABLE 2.5: First row: total number of dierent Support Vectors of three dierent tenclass-classiers (i.e. number of elements of the union of the ten two-class-classier Support
Vector sets) obtained by choosing dierent functions k in (2.25) and (2.29); second row:
average number of Support Vectors per two-class-classier (USPS database size: 7291).
total # of SVs
average # of SVs
TABLE 2.6: Percentage of the Support Vector set of [column] contained in the support
set of [row]; for ten-class classiers (top) and binary recognizers for digit class 7 (bottom)
(USPS set).
re
nc
fe
Polynomial
RBF
Sigmoid
es
Re
TABLE 2.7: Comparison of all three Support Vector sets at a time (USPS set). For each
of the (ten-class) classiers, \% intersection" gives the fraction of its Support Vector set
shared with both the other two classiers. Out of a total of 1834 dierent Support Vectors,
1355 are shared by all three classiers; an additional 242 is common to two of the classiers.
59
TABLE 2.8: SV set overlap experiments on the MNIST set (Fig. C.2), using the binary
recognizer for digit 0. Top three tables: performances (on the 60000 element test set) and
numbers of SVs for three dierent kernels and various parameter choices. The numbers
of SVs, which should be compared to the database size, 60000, were used to select the
parameters for the SV set comparison: to get a balanced comparison of the dierent
SV sets, we decided to select parameter values such that the respective SV sets have
approximately equal size (polynomial degree d = 4, radial basis function width c = 0:6,
and sigmoid threshold = ?1:5). Bottom: SV set comparison. For each of the binary
classiers, \% intersection" gives the fraction of its Support Vector set shared with both the
other two classiers. The scaling factor 784 in the kernels stems from the dimensionality
of the data; it ensures that the values of the kernels lie in similar ranges for dierent
polynomial degrees.
nc
es
re
Re
fe
?
60
TABLE 2.9: Percentage of the Support Vector set of [column] contained in the support set
of [row]; for the binary recognizers for digit class 7 (bottom) (USPS set). The training sets
for the classiers in [row] and [column] were permuted with respect to each other (control
experiment for Table 2.6); still, the overlap between the SV sets persists.
Polynomial
RBF
Sigmoid
Re
fe
re
nc
es
a few of them are necessary at a time for expressing the decision function. In such a
case, the actual SV set extracted could strongly depend on the ordering of the training
set, especially if the quadratic programming algorithm processes the data in chunks.
In our experiments, we did use the same ordering of the training set in all three cases.
To exclude the possibility that it is this ordering that causes the reported overlaps, we
ran a control experiment where two classiers with the same kernel were trained twice,
on the original training set, and on a permuted version of it, respectively. We found
that the two cases produced highly overlapping (to around 90%) SV sets, which means
that the training set ordering does hardly have an eect on the SV sets extracted |
it only changes around 10% of the SV sets. In addition, repeating the experiments of
Table 2.6 on permuted training sets gave results consistent with this nding: Table 2.9
shows that the overlap between SV sets of dierent classiers is hardly changed when
one of the training sets is permuted. We may also add that the overlap is not due to
SVs corresponding to errors on the training set (cf. (2.18), with i > 1): the considered
classiers had very few training errors.
Using a leave-one-out procedure similar to Proposition 2.1.2, Vapnik and Watkins
have subsequently put forward a theoretical argument for shared SVs. We state it
in the following form: If the SV set of three SV classiers had no overlap, we could
obtain a fourth classier which has zero test error.
To see why this is the case, note that if a pattern is left out of the training set,
it will always be classied correctly by voting between the three SV classiers trained
on the remaining examples: otherwise, it would have been been an SV of at least two
of them, if kept in the training set. The expectation of the number of patterns which
are SVs of at least two of the three classiers, divided by the training set size, thus
forms an upper bound on the expected test error of the voting system.
Training on SV Sets. As described in Sec. 2.1.2, the Support Vector set contains
all the information a given classier needs for constructing the decision function. Due
to the overlap in the Support Vector sets of dierent classiers, one can even train
classiers on Support Vector sets of another classier. Table 2.10 shows that this leads
to results comparable to those after training on the whole database. In Sec. 4.2.1, we
61
TABLE 2.10: Training classiers on the Support Vector sets of other classiers leads
to performances on the test set which are as good as the results for training on the full
database (shown are numbers of errors on the 2007-element test set, for two-class classiers
separating digit 7 from the rest). Additionally, the results for training on a random subset
of the database of size 200 are displayed.
Re
fe
re
nc
es
examples. Much research has been devoted to the study of various learning algorithms
which allow the extraction of these underlying regularities. No matter how dierent the
outward appearance of these algorithms is, they all must rely on intrinsic regularities
of the data. If the learning has been successful, these intrinsic regularities are captured
in the values of some parameters of a learning machine; for a polynomial classier,
these parameters are the coecients of a polynomial, for a neural net they are weights
and biases, and for a radial basis function classier they are weights and centers. This
variety of dierent representations of the intrinsic regularities, however, conceals the
fact that they all stem from a common root.
The Support Vector algorithm enables us to view these algorithms in a unied
theoretical framework. The presented empirical results show that dierent types of
SV classiers construct their decision functions from highly overlapping subsets of the
training set, and thus extract a very similar structure from the observations, which
can in this sense be viewed as a characteristic of the data: the set of Support Vectors.
This nding may lead to methods for compressing databases signicantly by disposing
of the data which is not important for the solution of a given task (cf. also Guyon,
Matic, and Vapnik, 1996).
In the next section, we will take a closer look at one of the types of learning
machines implementable by the SV algorithm.
62
and threshold that minimize an upper bound on the expected test error. The present
section is devoted to an experimental comparison of these machines with a classical
approach, where the centers are determined by k-means clustering and the weights are
computed using error backpropagation. We consider three machines, namely a classical
RBF machine, an SV machine with Gaussian kernel, and a hybrid system with the centers determined by the SV method and the weights trained by error backpropagation.
Our results show that on the US postal service database of handwritten digits, the SV
machine achieves the highest recognition accuracy, followed by the hybrid system.
Copyright notice: the material in this section is based on the article \Comparing
support vector machines with Gaussian kernels to radial basis function classiers" by
B. Scholkopf, K. Sung, C. Burges, F. Girosi, P. Niyogi, T. Poggio and V. Vapnik, which
appeared in IEEE Transactions on Signal Processing; 45(11): 2758 { 2765, November
1997. IEEE.
es
i=1
2
wi exp ? kx ?c xik + b
nc
f (x) = sgn
`
X
(2.40)
Re
fe
re
(b and ci being constants, the latter positive) separating balls from circles, i.e. taking
dierent values on balls and circles. How do we choose the centers xi? Two extreme
cases are conceivable:
The rst approach consists of choosing the centers for the two classes separately,
irrespective of the classication task to be solved. The classical technique of nding the
centers by some clustering technique (before tackling the classication problem) is such
an approach. The weights wi are then usually found by either error backpropagation
(Rumelhart, Hinton, and Williams, 1986) or the pseudo-inverse method (Poggio and
Girosi, 1990).
An alternative approach (Fig. 2.13) consists of choosing as centers points which are
critical for the classication task at hand. The Support Vector algorithm implements
the latter idea. By simply choosing a suitable kernel function (2.27), it allows the
construction of radial basis function classiers. The algorithm automatically computes
the number and location of the above centers, the weights wi, and the threshold b.
By the kernel function, the patterns are mapped nonlinearly into a high-dimensional
space. There, an optimal separating hyperplane is constructed, expressed in terms of
those examples which are closest to the decision boundary. These are the Support
Vectors which correspond to the centers in input space.
The goal of the present section is to compare real-world results obtained with
k-means clustering and classical RBF training to those obtained when the centers,
weights and threshold are automatically chosen by the Support Vector algorithm. To
this end, we decided to undertake a performance study by combining expertise on the
63
nc
es
Re
fe
re
Support Vector algorithm (AT&T Bell Laboratories) and on the classical radial basis
function networks (Massachusetts Institute of Technology).
Three dierent RBF systems took part in the performance comparison:
SV system. A standard SV machine with Gaussian kernel function was constructed (cf. (2.27)).
Classical RBF system. The MIT side of the performance comparison constructed networks of the form
g(x) = sgn
K
X
i=1
K
X
wiGi (x) + b
2
= sgn
wi (2)N=2 N exp ? kx 2?c2 ik + b ;
i
i
i=1
with the number of centers k identical to the one automatically found by the SV
algorithm. The centers ci were computed by k-means clustering (e.g. Duda and
Hart, 1973), and the weights wi are trained by on-line mean squared error back
propagation.
The training procedure constructs ten binary recognizers for the digit classes,
with RBF hidden units and logistic outputs, trained to produce the target values
64
fe
re
nc
es
FIGURE 2.13: RBF centers automatically computed by the Support Vector algorithm
(indicated by extra circles), using ci = 1 for all i (cf. (2.27), (2.40)). The number of SV
centers accidentally coincides with the number of identiable clusters (indicated by crosses
found by k-means clustering with k = 2 and k = 3 for balls and circles, respectively) but
the naive correspondence between clusters and centers is lost; indeed, 3 of the SV centers
are circles, and only 2 of them are balls. Note that the SV centers are chosen with respect
to the classication task to be solved.
Re
1 and 0 for positive and negative examples, respectively. The networks were
trained without weight decay, however, a bootstrap procedure was used to limit
their complexity. The nal RBF network for each class contains every Gaussian
kernel from its target class, but only several kernels from the other 9 classes,
selected such that no false positive mistakes are made. For further details, see
(Sung, 1996; Moody and Darken, 1989).
choice and the SV weight optimization, respectively, another RBF system was
built, constructed with centers that are simply the Support Vectors arising from
the SV optimization, and with the weights trained separately using mean squared
error back propagation.
65
nc
es
FIGURE 2.14: Two-class classication problem solved by the Support Vector algorithm
(ci = 1 for all i; cf. Eq. 2.40).
fe
re
machine was faster than the RBF system by about an order of magnitude.12 The
optimization, however, requires to work with potentially large matrices. In the implementation that we used, the training data is processed in chunks, and matrix sizes
were of the order 500 500. For problems with very large numbers of SVs, a modied
training algorithm has recently been proposed by Osuna, Freund, and Girosi (1997).
Re
Error Functions. Due to the constraints (2.18) and the target function (2.19), the
SV algorithm puts emphasis on correctly separating the training data. In this respect,
it is dierent from the classical RBF approach of training in the least-squares metric,
which is more concerned with the general problem of estimating posterior probabilities
than with directly solving a classication task at hand. There exist, however, studies
investigating the question how to select RBF centers or exemplars to minimize the
number of misclassications, see for instance (Chang and Lippmann, 1993; Duda and
Hart, 1973; Reilly, Cooper, and Elbaum, 1982; Barron, 1984). A classical RBF system
could also be made more discriminant by using moving centers (e.g. Poggio and Girosi,
1990), or a dierent cost function, as the classication gure of merit (Hampshire and
Waibel, 1990). In fact, it can be shown that Gaussian RBF regularization networks
are equivalent to SV machines if the regularization operator and the cost function are
chosen appropriately (Smola and Scholkopf, 1997b).
12 For noisy regression problems, on the other hand, Support Vector machines can be slower (Muller
et al., 1997).
66
re
nc
es
FIGURE 2.15: A simple two-class classication problem as solved by the SV algorithm
(ci = 1 for all i; cf. Eq. 2.40). Note that the RBF centers (indicated by extra circles)
are closest to the decision boundary. Interestingly, the decision boundary is a straight line,
even though a nonlinear Gaussian RBF kernel was used. This is due to the fact that only
two SVs are required to solve the problem. The translational and unitary invariance of the
RBF kernel then renders the situation completely symmetric.
Re
fe
It is important to stress that the SV machine does not minimize the empirical risk
(misclassication error on the training set) alone. Instead it minimizes the sum of an
upper bound on the empirical risk and a penalty term that depends on the complexity
of the classier used.
67
TABLE 2.11: Numbers of centers (Support Vectors) automatically extracted by the Support Vector machine. The rst row gives the total number for each binary classier,
including both positive and negative examples; in the second row, we only counted the
positive SVs. The latter number was used in the initialization of the k-means algorithm,
cf. Sec. 2.5.
digit class
0 1 2 3 4 5 6 7 8 9
# of SVs
274 104 377 361 334 388 236 235 342 263
# of pos. SVs 172 77 217 179 211 231 147 133 194 166
TABLE 2.12: Two-class-classication: numbers of test errors (out of 2007 test patterns)
for the three systems described in Sec. 2.5.
4
46
32
29
5
31
24
23
6
15
19
14
es
3
38
24
19
7
18
16
12
8
37
26
25
9
26
16
16
nc
digit class
0 1 2
classical RBF
20 16 43
RBF with SV centers 9 12 27
full SV machine
16 8 25
Re
fe
re
We used the USPS database of handwritten digits (Appendix C). The SV machine
results reported in the following were obtained with our default choice
= 10 (cf.
(2.19), Sec. 2.3), and c = 0:3 N (cf. (2.27)), where N = 256 is the dimensionality of
input space.13
Two-class classication. Table 2.11 shows the numbers of Support Vectors, i.e.
RBF centers, extracted by the SV algorithm. Table 2.12 gives the results of binary
classiers separating single digits from the rest, for the systems described in Sec. 2.5.
Ten-class classication. For each test pattern, the arbitration procedure in all three
systems simply returns the digit class whose recognizer gives the strongest response
(cf. (2.38)). Table 2.13 shows the 10-class digit recognition error rates for our original
system and the two RBF-based systems.
The fully automatic SV machine exhibits the highest test accuracy of the three
systems.14 Using the SV algorithm to choose the centers for the RBF network is also
better than the baseline procedure of choosing the centers by a clustering heuristic as
described above. It can be seen that in contrast to the k-means cluster centers, the
centers chosen by the SV algorithm allow zero training error rates.
The considered recognition task is known to be rather hard | the human error rate
13 The
SV machine is rather insensitive to dierent choices of c. For all values in 0:1; 0:2; : : : ; 1:0,
the performance is about the same (in the area of 4% { 4.5%).
14 An analysis of the errors showed that about 85% of the errors committed by the SV machine
were also made by the other systems. This makes the dierences in error rates very reliable.
68
TABLE 2.13: 10-class digit recognition error rates for three RBF classiers constructed
with dierent algorithms. The rst system is a classical one, choosing its centers by kmeans clustering. In the second system, the Support Vectors were used as centers, and in
the third one, the entire network was trained using the Support Vector algorithm.
nc
es
is 2.5% (Bromley and Sackinger, 1991), almost matched by a memory-based Tangentdistance classier (2.6%, Simard, LeCun, and Denker, 1993). Other results on this
database include a Euclidean distance nearest neighbour classier (5.9%, Simard,
LeCun, and Denker, 1993), a perceptron with one hidden layer, 5.9%, and a convolutional neural network (5.0%, LeCun et al., 1989). By incorporating translational and
rotational invariance using the Virtual SV technique (see below, Sec. 4.2.1), we were
able to improve the performance of the considered Gaussian kernel SV machine (same
values of
and c) from 4.2% to 3.2% error.
Re
fe
re
The Support Vector algorithm provides a principled way of choosing the number and
the locations of RBF centers. Our experiments on a real-world pattern recognition
problem have shown that in contrast to a corresponding number of centers chosen by
k-means, the centers chosen by the Support Vector algorithm allowed a training error
of zero, even if the weights were trained by classical RBF methods. The interpretation of this nding is that the Support Vector centers are specically chosen for the
classication task at hand, whereas k-means does not care about picking those centers
which will make a problem separable.
In addition, the SV centers yielded lower test error rates than k-means. It is
interesting to note that using SV centers, while sticking to the classical procedure for
training the weights, improved training and test error rates by approximately the same
amount (2%). In view of the guaranteed risk bound (1.5), this can be understood in
the following way: the improvement in test error (risk) was solely due to the lower
value of the training error (empirical risk); the condence term (the second term on the
right hand side of (1.5)), depending on the VC-dimension and thus on the norm of the
weight vector, did not change, as we stuck to the classical weight training procedure.
However, when we also trained the weights with the Support Vector algorithm, we
minimized the norm of the weight vector (see (2.19), (2.4)) in feature space, and thus
the condence term, while still keeping the training error zero. Thus, consistent with
(1.5), the Support Vector machine achieved the highest test accuracy of the three
69
systems.15
re
nc
es
In the case where the available amount of training data is limited, it is important to
have a means for achieving the best possible generalization by controlling characteristics of the learning machine, without having to put aside parts of the training set
for validation purposes. One of the strengths of SV machines consists in the automatic capacity tuning, which was related to the fact that the minimization of (2.19)
is connected to structural risk minimization, based on the bound (1.5). This capacity
tuning takes place within a set of functions specied a priori by the choice of a kernel
function. In the following, we go one step further and use the bound (1.5) also to
predict the kernel degree which yields the best generalization for polynomial classiers
(Scholkopf, Burges, and Vapnik, 1995).
Since for SV machines we have an upper bound on the VC-dimension (Proposition 2.1.1), we can use (1.5) to get an upper bound on the expected error on an
independent test set in terms of the training error and the value of kwk (or, equivalently, the margin 2=kwk). This bound can then be used to try to choose parameters of
the learning machines such that the test error gets minimal, without actually looking
at the test set.
We consider polynomial classiers with the kernel (2.26), varying their degree d,
and make the assumption that the bound (2.4) gives a reliable indication of the actual
VC-dimension, i.e. that the VC-dimension can be estimated by
(2.41)
Re
fe
h c1 hest: = R2kwk2
remarks on the interpretation of our ndings are in order. The rst result, comparing the
error rates of the classical and the hybrid system, does not necessarily rule out the possibility of
reducing the training error also for k-means centers by using dierent cost functions or codings of the
output units. It should be considered as a statement comparing two sets of centers, using the same
weight training algorithm to build RBF networks from them. Along similar lines, the second result,
indicating the superior performance of the full SV RBF system, refers to the systems as described
in this study. It does not rule out the possibility of improving classical RBF systems by suitable
methods of complexity control. Indeed, the results for the SV RBF system do show that using the
same architecture, but dierent weight training procedures, can signicantly improve results.
16 Copyright notice: the material in this section is based on the article \Extracting support data
for a given task" by B. Scholkopf, C. Burges and V. Vapnik, which appeared in: Proceedings, First
International Conference on Knowledge Discovery & Data Mining, pp. 252 { 257, 1995. AIII Press.
70
2000
300
1000
200
174
degree:
nc
es
FIGURE 2.16: Average VC-dimension (solid) and total number of test errors of the ten
two-class-classiers (dotted) for polynomial degrees 2 through 7 (for degree 1, Remp is
comparably big, so the VC-dimension alone is not sucient for predicting R, cf. (1.5)).
The baseline on the error scale, 174, corresponds to the total number of test errors of
the ten best binary classiers out of the degrees 2 through 7. The graph shows that the
VC-dimension allows us to predict that degree 4 yields the best overall performance of the
two-class-classiers on the test set. This is not necessarily the case for the performances of
the ten-class classiers, which are built from the two-class-classier outputs before applying
the sgn functions (cf. Sec. 2.1.6).
kwk determined by the Support Vector algorithm (note that kwk is computed in
Re
fe
re
feature space, using the kernel). Thus, in order to compute hest:, we need to compute
R, the radius of the smallest sphere enclosing the training data in feature space.
This can be reduced to a quadratic programming problem similar to the one used in
constructing the optimal hyperplane:17
We formulate the problem as follows:
Minimize R2 subject to kzi ? zk2 R2;
(2.42)
where z is the (to be determined) center of the sphere. We use the Lagrangian
X
(2.43)
R2 ? i(R2 ? (zi ? z)2);
i
(2.44)
(2.45)
17 The
i;j
71
subject to
X
i
i = 1; i 0;
(2.46)
where the i are Lagrange multipliers. As in the Support Vector algorithm, this
problem has the property that the zi appear only in dot products, so as before one
can compute the dot products in feature space, replacing (zi zj ) by k(xi; xj ) (where
the xi live in input space, and the zi in feature space).
In this way, we compute the radius of the minimal enclosing sphere for all the USPS
training data for polynomial classiers of degrees 1 through 7. For the same degrees,
we then train a binary polynomial classier for each digit. Using the obtained values
for hest:, we can predict, for each digit, which degree polynomial will give the best
generalization performance. Clearly, this procedure is contingent upon the validity of
the assumption that c1 is approximately the same for all degrees. We can then compare
this prediction with the actual polynomial degree which gives the best performance
on the test set. The results are shown in Table 2.14; cf. also Fig. 2.16.
fe
re
nc
es
TABLE 2.14: Performance of the classiers with degree predicted by the VC-bound. Each
row describes one two-class-classier separating one digit (stated in the rst column) from
the rest. The remaining columns contain: deg: the degree of the best polynomial as
predicted by the described procedure, param.: the dimensionality of the high dimensional
space, which is also the VC-dimension for the set of all separating hyperplanes in that
space, hest: : the VC-dimension estimate for the actual classiers, which is much smaller
than the number of free parameters of linear classiers in that space, 1 { 7: the numbers of
errors on the test set for polynomial classiers of degrees 1 through 7. The table shows that
the decribed procedure chooses polynomial degrees which are optimal or close to optimal.
Re
chosen classier
hest:
547
95
832
1130
977
1117
615
526
1466
1210
72
The above method for predicting the optimal classier functions gives good results.
In four cases, the theory predicted the correct degree; in the other cases, the predicted
degree gave performance close to the best possible one.
nc
es
fe
re
v
u
u
t c1R2 kwk2 log c1R22k`wk2 + 1 ? log(=4) + R ()
emp
`
(2.48)
Re
with some c2 1, where ` is the number of training examples. Minimizing the objective
function (2.19) then amounts to minimizing
1 kwk2 + R (w):
(2.50)
emp
2
c2`
Identifying w with the function index in (2.48), we now have a formulation where
the second terms of (2.50) and (2.48) are identical. The rst terms cannot coincide
in general: unlike the rst term of (2.48), kwk2=(2
c2`) is proportional to kwk2.
However, a necessary condition to ensure that the minimum of the function that we
73
are minimizing, (2.50), is close to that of the one that we would like to minimize, (2.48),
is that the gradients of the rst terms with respect to w coincide at the minimum.
Hence, we have to choose
such that
c1R2 (log c1R22k`wk2 ? 1)
1 w= q
w:
(2.51)
c2`
`(c1 R2kwk2 log c1R22k`wk2 ? log(=4))
For w 6= 0, we thus obtain
q
`(c1R2 kwk2 log c1R22k`wk2 ? log(=4))
:
=
c1 c2`R2 (log c1R22k`wk2 ? 1)
(2.52)
Re
fe
re
nc
es
Eq. (2.52) establishes a relationship between
and w at the point of the solution. If
we start with a non-optimal value of
, however, we will obtain a non-optimal w and
thus (2.52) will not tell us exactly how to adjust
. For instance, suppose we start
with a
which is too big. Then too much weight will be put on minimizing empirical
risk (cf. (2.19)), and the margin will become too small, i.e. w will become too big. We
will resort to the following method: we use (2.52) to determine a new value
0, and
iterate the procedure.
The value of kwk2 is obtained by solving
the SV quadratic programming probP
2
lem (in feature space, we have kwPk = ij yiyj i j k(xi; xj )); R2 is computed as in
Sec. 2.6.1, and c2 is obtained from i i and the training error using (2.49). The values
of c1 and must be chosen by the user. The constant c1 characterizes the tightness of
the VC-dimension bound (cf. (2.47)), and 1 ? is a lower bound on the probability
with which the risk bound (2.48) holds. As long as is not too close to 0, it does
hardly aect our procedure. The value of c1 is more dicult to choose correctly, however, reasonable results can already be obtained with the default choice c1 = 1, as we
shall see below.
Statements on the convergence of this procedure are hard to obtain: to compute
the mapping from
to
0 , we have to train an SV machine and then evaluate (2.52),
thus one cannot compute its derivative in closed form. In an empirical study to be
decribed presently, the procedure exhibited well-behaved convergence behaviour. In
the experiment, we used the small MNIST database (Appendix C). We found that
the iteration converged no matter whether we started with a very large
or with a
tiny
. In the following, we report results obtained when starting with a huge value of
, which eectively leads to the construction of an optimal margin classier (i.e. with
zero training error | cf. Sec. 2.1.2: for
= 1, (2.22) reduces to (2.14)).
Table 2.15 shows partly encouraging results. For all 10 binary digit recognizers,
the iteration converges very fast (about two steps were required to reach a value very
close to the nal one). In seven cases, the number of test errors decreased, in only
two case did it increase. By combining the resulting binary classiers, we obtained a
10-class classier with an error rate (on the 10000 element small MNIST test set) of
3:9%, slightly better than the error rate obtained both with the starting value used in
74
the iteration,
= 1010, and with our default choice
= 10: in these cases, we obtained
4:0% error (cf. below, in Table 4.6).
Clearly, further experiments are necessary to validate or improve this method. In
particular, it would be interesting to study a noisy classication problem, where the
choice of
should potentially have a greater impact on the quality of the solution.
We conclude this section with a note on the relationship of the model selection
methods described in sections 2.6.1 and 2.6.2. Both of the proposed methods are
based on the bound (1.5). In principle, we could also apply the method of 2.6.1 for
choosing
. In that case, we would try out a series of values of
, and pick the one
which minimizes (1.5). The advantage of the present method, however, is that it does
not require scanning a whole range of values. Instead,
is chosen such that, with the
help of a few iterations, the SV optimization automatically minimizes (1.5) over
,
in addition to the built-in minimization over the weights of the SV machine (cf. the
remarks at the beginning of Sec. 2.6.1).18
fe
Re
1010
0.723763
digit 0 0.052130
0.050580
0.047947
0.051618
1010
1.248717
digit 1 0.047532
0.045677
0.042151
0.046176
re
nc
es
TABLE 2.15: Iterative choice of the regularization constant
(cf. Sec. 2.6.2) for all ten
digit recognizers on the small MNIST database. Each table shows SV machine parameters
and results for the starting point (
= 1010 ), and for ve subsequent iteration steps. In all
cases, we used c1 = 1, = 0:2 (corresponding to a risk bound holding with probability of
at least 0.8), and a polynomial classier of degree 5. The constant c2 is undened before
the rst run of the algorithm. After each run, it is computed using (2.49); if Remp is 0, we
set c2 = 1. For
= 1010 , we are eectively computing an optimal separating hyperplane,
with zero training errors. The iteration converges very fast; moreover, in seven of the ten
cases, it reduced the number of test errors (in two cases, the opposite happened).
18 We
kwk
36.156
29.460
29.288
29.312
29.416
29.316
kwk
48.998
34.286
34.063
33.988
34.054
34.038
75
Re
digit 5
digit 6
digit 7
# of SVs
340
355
354
352
352
351
# of SVs
377
392
397
387
385
383
# of SVs
282
312
313
312
313
311
# of SVs
339
358
353
362
358
363
# of SVs
231
260
264
256
253
258
# of SVs
253
272
271
278
270
272
es
test errors
104
88
87
87
87
88
test errors
96
93
93
94
96
93
test errors
74
79
77
77
77
77
test errors
87
101
99
99
98
101
test errors
80
80
80
78
78
79
test errors
122
109
109
108
108
108
nc
digit 4
train. errors
0
4
4
4
4
4
train. errors
0
5
5
5
5
5
train. errors
0
5
6
6
6
6
train. errors
0
4
5
5
5
5
train. errors
0
2
2
2
2
2
train. errors
0
9
10
10
10
10
re
digit 3
c2
1
12.5
13.2
13.3
13.0
c2
1
12.5
13.9
12.1
13.9
c2
1
10.8
9.6
9.4
9.7
c2
1
12.9
11.7
11.5
11.7
c2
1
20.6
20.8
19.5
19.0
c2
1
6.6
6.2
6.4
6.1
fe
digit 2
1010
1.853855
0.100126
0.094182
0.092732
0.095662
1010
3.047037
0.139778
0.122975
0.142765
0.123462
1010
2.586497
0.134502
0.150066
0.152267
0.147880
1010
2.400509
0.116545
0.126663
0.129597
0.126985
1010
1.144632
0.037990
0.037135
0.039888
0.041169
1010
2.945887
0.187035
0.196960
0.189135
0.197877
kwk
58.816
49.055
48.810
48.757
48.833
48.906
kwk
70.455
57.387
56.860
57.143
56.868
57.272
kwk
66.786
52.781
52.409
52.374
52.324
52.338
kwk
65.051
53.578
53.182
53.263
53.250
53.272
kwk
46.857
37.923
37.623
37.773
37.834
37.771
kwk
69.716
48.730
48.263
48.104
48.241
48.258
76
1010
4.246777
digit 8 0.102221
0.156304
0.165320
0.166110
1010
21.34817
digit 9 0.103062
0.417387
0.383747
0.439163
kwk
77.167
63.982
63.590
63.703
63.605
63.492
kwk
94.997
74.464
71.822
71.893
71.857
71.900
Re
fe
re
nc
es
The presented experimental results show that Support Vector machines obtain high
accuracies which are competitive with state-of-the-art techniques. This was true for
several visual recognition tasks. Care should be exercised, however, when generalizing
this statement to other types of pattern recognition tasks. There, empirical studies
have yet to be carried out, in particular since the tasks that we considered were all
characterized by relatively low overlap of the dierent classes (for instance, in the
USPS task, the human error rate is around 2.5%). In any case, the results obtained
here are encouraging, in particular when taking into account that the SV algorithm
was developed only recently. Below, we summarize dierent aspects providing insight
in why SV machines generalize well:
Capacity Control. The kernel method allows to reduce a large class of learning ma-
chines to separating hyperplanes in some space. For those, an upper bound on the
VC-dimension can be given (Proposition 2.1.1). As argued in Sec. 2.1.3, minimizing
the SV objective function (2.19) corresponds to trying to separate the data with a
classier of low VC-dimension, thereby approximately performing structural risk minimization. The problem of constructing the decision function requires minimizing a
positive quadratic form subject to box constraints and can thus be solved eciently.
As we saw, low VC-dimension is related to a large separation margin. Thus, analyses of the generalization performance in terms of separation margins and fat shattering
dimension also bear relevance to SV machines (e.g. Schapire, Freund, Bartlett, and
Lee, 1997; Shawe-Taylor, Bartlett, Williamson, and Anthony, 1996; Bartlett, 1997).
ability to the fact that the decision function is expressed in terms of a (possibly small)
77
subset of the data. This can be viewed in the context of Algorithmic Complexity and
Minimum Description Length (Vapnik, 1995b, Chapter 5, footnote 6).
posed which contains the SV algorithm as a special case. For kernel-based function
expansions, it is shown that given a regularization operator P (Tikhonov and Arsenin,
1977) mapping the functions of the learning machine into some dot product space D,
the problem of minimizing the regularized risk
(2.53)
nc
ik(xi; x) + b;
(2.54)
re
f (x) =
es
Re
fe
(2.55)
(Here, (Pk)(xi; :) denotes the result of applying P to the function obtained by xing
k's rst argument to xi.) To this end, k is chosen as Green's function of P P . For instance, an RBF kernel thus corresponds to regularization with a functional containing
a specic dierential operator (Yuille and Grzywacz, 1988).
78
Re
fe
re
nc
es
Chapter 3
3.1 Introduction
Re
fe
re
nc
es
In the last chapter, we tried to show that the idea of implicitely mapping the data
into a high-dimensional feature space has been a very fruitful one in the context of
SV machines. Indeed, it is mainly this feature which distinguishes them from the
Generalized Portrait algorithm which has been known for more than 20 years (Vapnik
and Chervonenkis, 1974), and which makes them applicable to complex real-world
problems which are not linearly separable. Thus, it was natural to ask the question
whether the same idea could prove fruitful in other domains of learning.
The present chapter proposes a new method for performing a nonlinear form of
Principal Component Analysis. By the use of Mercer kernels, one can eciently compute principal components in high-dimensional feature spaces, related to input space
by some nonlinear map. We give the derivation of the method and present experimental results on polynomial feature extraction for pattern recognition (Scholkopf, Smola,
and Muller, 1996b, 1997b).
Copyright notice: the material in this chapter is based on the article \Nonlinear
component analysis as a kernel Eigenvalue problem" by B. Scholkopf, A. Smola and
K.-R. Muller, which will appear in: Neural Computation, 1997. MIT Press.
Principal Component Analysis (PCA) is a powerful technique for extracting structure from possibly high-dimensional data sets. It is readily performed by solving an
Eigenvalue problem, or by using iterative algorithms which estimate principal components. For reviews of the existing literature, see Jollie (1986) and Diamantaras &
Kung (1996); some of the classical papers are due to Pearson (1901); Hotelling (1933);
Karhunen (1946). PCA is an orthogonal transformation of the coordinate system in
which we describe our data. The new coordinate values by which we represent the data
are called principal components. It is often the case that a small number of principal
components is sucient to account for most of the structure in the data. These are
sometimes called factors or latent variables of the data.
The present work studies PCA in the case where we are not interested in principal components in input space, but rather in principal components of variables, or
79
80
nc
es
features, which are nonlinearly related to the input variables. Among these are for instance variables obtained by taking arbitrary higher-order correlations between input
variables. In the case of image analysis, this amounts to nding principal components
in the space of products of input pixels.
To this end, we are computing dot products in feature space by means of kernel
functions in input space (cf. Sec. 1.3). Given any algorithm which can be expressed
solely in terms of dot products, i.e. without explicit usage of the variables themselves,
this kernel method enables us to construct dierent nonlinear versions of it. Even
though this general fact was known (Burges, private communication), the machine
learning community has made little use of it, the exception being Vapnik's Support
Vector machines (Chapter 2). In this chapter, we give an example of applying this
method in the domain of unsupervised learning, to obtain a nonlinear form of PCA.
In the next section, we will rst review the standard PCA algorithm. In order to
be able to generalize it to the nonlinear case, we formulate it in a way which uses
exclusively dot products. Using kernel representations of dot products (Sec. 1.3),
Sec. 3.3 presents a kernel-based algorithm for nonlinear PCA and explains some of the
dierences to previous generalizations of PCA. First experimental results on kernelbased feature extraction for pattern recognition are given in Sec. 3.4. We conclude
with a discussion (Sec. 3.5). Some technical material which is not essential for the
main thread of the argument has been relegated to Appendix D.2.
re
Re
fe
M
X
C = M1 xj x>j :
j =1
(3.1)
v = C v
(3.2)
(3.3)
all solutions v with 6= 0 must lie in the span of x1 : : : xM , hence (3.2) in that case is
equivalent to
(xk v) = (xk C v) for all k = 1; : : : ; M:
(3.4)
precisely, the covariance matrix is dened as the expectation of xx> ; for convenience, we
shall use the same term to refer to the estimate (3.1) of the covariance matrix from a nite sample.
1 More
81
In the remainder of this section, we describe the same computation in another dot
product space F , which is related to the input space by a possibly nonlinear map
: RN ! F; x 7! X:
(3.5)
Note that the feature space F could have an arbitrarily large, possibly innite, dimensionality. Here and in the following, upper case characters are used for elements of F ,
while lower case characters denote elements of RN .
Again, we assume that we are dealing with centered data, PMk=1 (xk ) = 0 | we
shall return to this point later. In F , the covariance matrix takes the form
M
X
C = M1 (xj )(xj )>:
(3.6)
j =1
nc
es
fe
re
Again, all solutions V with 6= 0 lie in the span of (x1 ); : : : ; (xM ). For us, this has
two useful consequences: rst, we may instead consider the set of equations
((xk ) V) = ((xk ) C V) for all k = 1; : : : ; M;
(3.9)
Re
V=
M
X
i=1
i (xi):
(3.10)
M
M
X
X
1
@
i((xk ) (xi)) = M i (xk ) (xj )((xj ) (xi))A
i=1
j =1
i=1
Dening an M M matrix K by
this reads
for all k = 1; : : : ; M:
(3.11)
(3.12)
MK = K 2;
(3.13)
82
M = K
(3.14)
for nonzero Eigenvalues. In Appendix D.2.1, we show that this gives us all solutions
of (3.13) which are of interest for us.
Let 1 2 : : : M denote the Eigenvalues of K (i.e. the solutions M
of (3.14)), and 1 ; : : : ; M the corresponding complete set of Eigenvectors, with p
being the rst nonzero Eigenvalue (assuming that is not identically 0). We normalize
p; : : : ; M by requiring that the corresponding vectors in F be normalized, i.e.
(Vk Vk ) = 1 for all k = p; : : : ; M:
(3.15)
By virtue of (3.10) and (3.14), this translates into a normalization condition for
p; : : : ; M :
M
M
X
X
ik jk ((xi ) (xj )) = ik jk Kij
1 =
i;j =1
(k K k ) =
(k k )
i;j =1
re
nc
es
k
(3.16)
For the purpose of principal component extraction, we need to compute projections
onto the Eigenvectors Vk in F (k = p; : : : ; M ). Let x be a test point, with an image
(x) in F , then
M
X
k
(3.17)
(V (x)) = ik ((xi) (x))
=
i=1
Re
fe
that in our derivation we could have used the known result (e.g. Kirby & Sirovich, 1990)
that PCA can be carried out on the dot product matrix (xi xj )ij instead of (3.1), however, for the
sake of clarity and extendability (in Appendix D.2.2, we shall consider the question how to center
the data in F ), we gave a detailed derivation.
83
k(x,y) = (x.y)
linear PCA
R2
x
x
x
xx
x x
x
x
kernel PCA
R2
x
x
x
x
x
x
x
xx
x x
x
x
x x x
x x
es
Re
fe
re
nc
FIGURE 3.1: The basic idea of kernel PCA. In some high-dimensional feature space F
(bottom right), we are performing linear PCA, just as a PCA in input space (top). Since
F is nonlinearly related to input space (via ), the contour lines of constant projections
onto the principal Eigenvector (drawn as an arrow) become nonlinear in input space. Note
that we cannot draw a pre-image of the Eigenvector in input space, as it may not even
exist. Crucial to kernel PCA is the fact that there is no need to perform the map into F :
all necessary computations are carried out by the use of a kernel function k in input space
(here: R2 ).
to compute dot products between mapped patterns (in (3.12) and (3.17)); we never
need the mapped patterns explicitely. Therefore, we can use the kernels described in
Sec. 1.3. The particular kernel used then implicitely determines the space F of all
possible features. The proposed algorithm, on the other hand, is a mechanism for
selecting features in F .
84
comparison: k(xi,x)
sample x1, x2, x3,...
input vector x
nc
es
FIGURE 3.2: Feature extractor constructed by kernel PCA (cf. (3.18)). In the rst layer,
the input vector is compared to the sample via a kernel function, chosen a priori (e.g.
polynomial, Gaussian, or sigmoid). The outputs are then linearly combined using weights
which are found by solving an Eigenvector problem. As shown in the text, the depicted
network's function can be thought of as the projection onto an Eigenvector of a covariance
matrix in a high-dimensional feature space. As a function on input space, it is nonlinear.
compute projections onto the Eigenvectors by (cf. Eq. (3.17), Fig. 3.2),
M
X
re
(Vn (x)) =
i=1
ink(xi ; x):
(3.18)
Re
fe
Properties of Kernel-PCA. For Mercer kernels, we know that we are in fact doing
a standard PCA in F . Consequently, all mathematical and statistical properties of
PCA (see for instance Jollie, 1986; Diamantaras & Kung, 1996) carry over to kernelbased PCA, with the modications that they become statements about a set of points
(xi); i = 1; : : : ; M , in F rather than in RN . In F , we can thus assert that PCA is
the orthogonal basis transformation with the following properties (assuming that the
Eigenvectors are sorted in ascending order of the Eigenvalue size):
the rst q (q 2 f1; : : : ; M g) principal components, i.e. projections on Eigenvectors, carry more variance than any other q orthogonal directions
the mean-squared approximation error in representing the observations by the
rst q principal components is minimal3
3 To see this,
in the simple case where the data z1 ; : : : ; z` are centered, we consider an orthogonal
85
to the inputs (this holds under Gaussianity assumptions, and thus depends on
the particular kernel chosen and on the data)
To translate these properties of PCA in F into statements about the data in input
space, they need to be investigated for specic choices of a kernels. We conclude this
section by noting one general property of kernel PCA in input space: for kernels which
depend only on dot products or distances in input space (as all the examples that we
have given so far do), kernel PCA has the property of unitary invariance. This follows
directly from the fact that both the Eigenvalue problem and the feature extraction
only depend on kernel values. This ensures that the features extracted do not depend
on which orthonormal coordinate system we use for representing our input data.
Re
fe
re
nc
es
Computational Complexity. As mentioned in Sec. 1.3, a fth order polynomial kernel on a 256-dimensional input space yields a 1010-dimensional feature space. For two
reasons, kernel PCA can deal with this huge dimensionality. First, as pointed out in
Sect. 3.2 we do not need to look for Eigenvectors in the full space F , but just in the
subspace spanned by the images of our observations xk in F . Second, we do not need
to compute dot products explicitely between vectors in F (which can be impossible
in practice, even if the vectors live in a lower-dimensional subspace), as we are using
kernel functions. Kernel PCA thus is computationally comparable to a linear PCA
on ` observations with an ` ` dot product matrix. If k is easy to compute, as for
polynomial kernels, e.g., the computational complexity is hardly changed by the fact
that we need to evaluate kernel functions rather than just dot products. Furthermore,
in the case where we need to use a large number ` of observations, we may want to
work with an algorithm for computing only the largest Eigenvalues, as for instance the
power method with de
ation (for a discussion, see Diamantaras & Kung, 1996). In
addition, we can consider using an estimate of the matrix K , computed from a subset
of M < ` examples, while still extracting principal components from all ` examples
(this approach was chosen in some of our experiments described below).
The situation can be dierent for principal component extraction. There, we have
to evaluate the kernel function M times for each extracted principal component (3.18),
basis transformation W , and use the notation Pq for the projector on the rst q canonical basis
vectors fe1; : : : ; eq g. Then the mean squared reconstruction error using q vectors is
1 X kz ? W > P W z k2 = 1 X kW z ? P W z k2 = 1 X X(W z e )2
` i
= 1`
XX
i j>q
(zi W > ej )2 = 1`
` i
XX
i j>q
` i j>q
X
j>q
(W > ej CW > ej ):
It can easily be seen that the values of this quadratic form (which gives the variances in the directions
W > ej ) are minimal if the W > ej are chosen as its (orthogonal) Eigenvectors with smallest Eigenvalues.
86
rather than just evaluating one dot product as for a linear PCA. Of course, if the dimensionality of F is 1010, this is still vastly faster than linear principal component
extraction in F . Still, in some cases, e.g. if we were to extract principal components
as a preprocessing step for classication, we might want to speed up things. This can
be carried out by the reduced set technique of Burges (1996) (cf. Appendix D.1.1),
proposed in the context of Support Vector machines. In the present setting, we approximate each Eigenvector
V=
`
X
i=1
i(xi)
(3.19)
j (zj );
(3.20)
V~ =
m
X
j =1
(3.21)
nc
= kV ? V~ k2:
es
m
X
i;j =1
ij k(zi; zj ) ? 2
fe
= kVk +
re
This can be carried out without explicitely dealing with the possibly high-dimensional
space F . Since
` X
m
X
i=1 j =1
ij k(xi; zj );
(3.22)
Re
the gradient of with respect to the j and the zj is readily expressed in terms of the
kernel function, thus can be minimized by standard gradient methods. For the task
of handwritten character recognition, this technique led to a speed-up by a factor of
50 at almost no loss in accuracy (Burges & Scholkopf, 1996; cf. Sec. 4.4.1).
Finally, we add that although kernel principal component extraction is computationally more expensive than its linear counterpart, this additional investment can
pay back afterwards. In experiments on classication based on the extracted principal components, we found when we trained on nonlinear features, it was sucient to
use a linear Support Vector machine to construct the decision boundary. Linear Support Vector machines, however, are much faster in classication speed than nonlinear
ones. This is due to the fact that for k(x; y) = (x y), the SupportPVector decision function (2.25) can be expressed with a single weight vector w = `i=1 yiixi as
f (x) = sgn((x w)+ b): Thus the nal stage of classication can be done extremely fast;
the speed of the principal component extraction phase, on the other hand, and thus
the accuracy-speed trade-o of the whole classier, can be controlled by the number
of components which we extract, or by the number m (cf. Eq. (3.20)).
87
able to select specic axes which span the subspace into which one projects in doing
principal component extraction. This way, it may for instance be possible to choose
variables which are more accessible to interpretation. In the nonlinear case, there is
an additional problem: some elements of F do not have pre-images in input space.
To make this plausible, note that the linear span of the training examples mapped
into feature space can have dimensionality up to M (the number of examples). If
this exceeds the dimensionality of input space, it is rather unlikely that each vector
of the form (3.10) has a pre-image (cf. Appendix D.1.2). To get interpretability, we
thus need to nd directions in input space (i.e. input variables) whose images under
span the PCA subspace in F . This can be done with an approach akin to the one
described above: we could parametrize our set of desired input variables and run the
minimization of (3.22) only over those parameters. The parameters can be e.g. group
parameters which determine the amount of translation, say, starting from a set of
images.
Re
fe
re
nc
es
PCA, the proposed method allows the extraction of a number of principal components
which can exceed the input dimensionality. Suppose that the number of observations
M exceeds the input dimensionality N . Linear PCA, even when it is based on the
M M dot product matrix, can nd at most N nonzero Eigenvalues | they are
identical to the nonzero Eigenvalues of the N N covariance matrix. In contrast,
kernel PCA can nd up to M nonzero Eigenvalues | a fact that illustrates that it
is impossible to perform kernel PCA directly on an N N covariance matrix. Even
more features could be extracted by using several kernels.
Being just a basis transformation, standard PCA allows the reconstruction of the
original patterns xi; i = 1; : : : ; `; from a complete set of extracted principal components
(xi vj ); j = 1; : : : ; `, by expansion in the Eigenvector basis. Even from an incomplete
set of components, good reconstruction is often possible. In kernel PCA, this is more
dicult: we can reconstruct the image of a pattern in F from its nonlinear components;
however, if we only have an approximate reconstruction, there is no guarantee that
we can nd an exact pre-image of the reconstruction in input space. In that case, we
would have to resort to an approximation method (cf. (3.22)). Alternatively, we could
use a suitable regression method for estimating the reconstruction mapping from the
kernel-based principal components to the inputs.
Comparison to Other Methods for Nonlinear PCA. Starting from some of the
properties characterizing PCA (see above), it is possible to develop a number of possible generalizations of linear PCA to the nonlinear case. Alternatively, one may choose
an iterative algorithm which adaptively estimates principal components, and make
some of its parts nonlinear to extract nonlinear features. Rather than giving a full
review of this eld here, we brie
y describe just three approaches, and refer the reader
to Diamantaras & Kung (1996) for more details.
88
Re
fe
re
nc
es
using nonlinear activation functions in the hidden layer would not suce: already the
linear activation functions lead to the best approximation of the data (given the number of hidden
nodes), so for the nonlinearities to have an eect on the components, the architecture needs to be
changed (see e.g. Diamantaras & Kung, 1996).
89
Kernel PCA. Kernel PCA is a nonlinear generalization of PCA in the sense that (a)
it is performing PCA in feature spaces of arbitrarily large (possibly innite) dimensionality, and (b) if we use the kernel k(x; y) = (x y), we recover original PCA. Compared
to the above approaches, kernel PCA has the main advantage that no nonlinear optimization is involved | it is essentially linear algebra, as simple as standard PCA. In
addition, we need not specify the number of components that we want to extract in
advance. Compared to neural approaches, kernel PCA could be disadvantageous if we
need to process a very large number of observations, as this results in a large matrix
K . Compared to principal curves, kernel PCA is so far harder to interpret in input
space; however, at least for polynomial kernels, it has a very clear interpretation in
terms of higher-order features.
es
In this section, we present a set of experiments where we used kernel PCA (in the form
given in Appendix D.2.2) to extract principal components. First, we shall take a look
at a simple toy example; following that, we describe real-world experiments where we
assess the utility of the extracted principal components by classication tasks.
Toy Examples. To provide some intuition on how PCA in F behaves in input space,
Re
fe
re
nc
we show a set of experiments with an articial 2-D data set, using polynomial kernels
(cf. Eq.( 2.26)) of degree 1 through 4 (see Fig. 3.3). Linear PCA (on the left) only
leads to 2 nonzero Eigenvalues, as the input dimensionality is 2. In contrast, nonlinear
PCA allows the extraction of further components. In the gure, note that nonlinear
PCA produces contour lines of constant feature value which re
ect the structure in
the data better than in linear PCA. In all cases, the rst principal component varies
monotonically along the parabola which underlies the data. In the nonlinear cases,
also the second and the third components show behaviour which is similar for dierent polynomial degrees. The third component, which comes with small Eigenvalues
(rescaled to sum to 1), seems to pick up the variance caused by the noise, as can be
nicely seen in the case of degree 2. Dropping this component would thus amount to
noise reduction.
In Fig. 3.3, it can be observed that for larger polynomial degrees, the principal
component extraction functions become increasingly
at around the origin. Thus,
dierent data points not too far from the origin would only dier slightly in the value
of their principal components. To understand this, consider the following example:
suppose we have two data points
x1 = 10 ; x2 = 20 ;
(3.23)
and a kernel k(x; y) := (x y)2: Then the dierences between the entries of x1 and
90
1.5
Eigenvalue=0.709
1.5
Eigenvalue=0.621
1.5
Eigenvalue=0.570
1.5
0.5
0.5
0.5
0.5
0.5
1
1.5
Eigenvalue=0.291
0.5
1
1.5
Eigenvalue=0.345
0.5
1
1.5
Eigenvalue=0.395
0.5
1
1.5
0.5
0.5
0.5
0.5
Eigenvalue=0.000
0.5
1
1.5
Eigenvalue=0.034
0.5
1
1.5
0.5
0.5
0.5
0.5
1
Eigenvalue=0.026
0.5
1
1.5
Eigenvalue=0.418
Eigenvalue=0.021
0.5
0.5
1
re
0.5
1
es
1.5
nc
0.5
1
Eigenvalue=0.552
0.5
1
Re
fe
FIGURE 3.3: Two-dimensional toy example, with data generated in the following way: xvalues have uniform distribution in [?1; 1], y -values are generated from yi = x2i + , were
is normal noise with standard deviation 0.2. From left to right, the polynomial degree
in the kernel (2.26) increases from 1 to 4; from top to bottom, the rst 3 Eigenvectors
are shown, in order of decreasing Eigenvalue size. The gures contain lines of constant
principal component value (contour lines); in the linear case, these are orthogonal to the
Eigenvectors. We did not draw the Eigenvectors, as in the general case, they live in a
higher-dimensional feature space.
x2 get scaled up by the kernel, namely k(x1; x1) = 1, but k(x2; x2 ) = 16. We can
compensate for this by rescaling the individual entries of each vector xi by
(3.24)
(xi)k 7! sign ((xi )k ) j(xi)k j :
1
2
Indeed, Fig. 3.4 shows that when the data are preprocessed according to (3.24) (where
higher degrees are treated correspondingly), the rst principal component extractors
do hardly depend on the degree anymore, as long as it is larger than 1. If necessary,
91
1
0.5
0
0.5
1 1
1
0.5
0
0.5
1 1
1
0.5
0
0.5
1 1
1
0.5
0
0.5
1 1
fe
re
nc
es
FIGURE 3.4: PCA with kernel (2.26), degrees d = 1; : : : ; 5. 100 points ((xi )1 ; (xi )2 ) were
generated from (xi )2 = (xi )21 + noise (Gaussian, with standard deviation 0.2); all (xi )j
were rescaled according to (xi )j 7! sgn((xi )j ) j(xi )j j1=d . Displayed are contour lines of
constant value of the rst principal component. Nonlinear kernels (d > 1) extract features
which nicely increase along the direction of main variance in the data; linear PCA (d = 1)
does its best in that respect, too, but it is limited to straight directions.
Re
FIGURE 3.5: Two-dimensional toy example with three data clusters (Gaussians with standard deviation 0.1, depicted region: [?1; 1] [?0:5; 1]): rst 8 nonlinear principal components extracted with k(x; y) = exp (?kx ? yk2 =0:1). Note that the rst 2 principal
component (top left) nicely separate the three clusters. The components 3 { 5 split up the
clusters into halves. Similarly, the components 6 { 8 split them again, in a way orthogonal
to the above splits. Thus, the rst 8 components divide the data into 12 regions.
we can thus use (3.24) to preprocess our data. Note, however, that the above scaling
problem is irrelevant for the character and object databases to be considered below:
there, most entries of the patterns are 1.
Further toy examples, using radial basis function kernels (1.28) and neural network
type sigmoid kernels (1.29), are shown in gures 3.5 { 3.8.
Object Recognition. In this set of experiments, we used the MPI chair database
with 89 training views per object (Appendix A). We computed the matrix K from all
92
nc
es
FIGURE 3.6: Two-dimensional toy example with three data clusters (Gaussians with standard deviation 0.1, depicted region: [?1; 1] [?0:5; 1]): rst 6 nonlinear principal components extracted with k(x; y) = tanh (2(x y) ? 1) (the gain and threshold values were
chosen according to the values used in SV machines, cf. Table 2.4). Note that the rst 2
principal components are sucient to separate the three clusters, and the third and fourth
component simultaneously split all clusters into halves.
Re
fe
re
2225 training examples, and used polynomial kernel PCA to extract nonlinear principal
components from the training and test set. To assess the utility of the components, we
trained a soft margin hyperplane classier (Sec. 2.1.3) on the classication task. This is
a special case of Support Vector machines, using the standard dot product as a kernel
function. Table 3.1 summarizes our ndings: in all cases, nonlinear components as
extracted by polynomial kernels (Eq. (2.26) with d > 1) led to classication accuracies
superior to standard PCA. Specically, the nonlinear components aorded top test
performances between 2% and 4% error, whereas in the linear case we obtained 17%.
Character Recognition. To validate the above results on a widely used pattern re-
93
Re
fe
re
nc
es
FIGURE 3.7: For dierent threshold values (from top to bottom: = ?4; ?2; ?1; 0; 2),
kernel PCA with hyperbolic tangent kernels k(x; y) = tanh (2(x y) + ) exhibits qualitatively similar behaviour (data as in the previous gures). In all cases, the rst two
components capture the main structure of the data, whereas the third components split
the clusters.
Re
fe
re
nc
es
94
FIGURE 3.8: A smooth transition from linear PCA to nonlinear PCA is obtained by using
hyperbolic tangent kernels k(x; y) = tanh ((x y) + 1) with varying gain : from top
to bottom, = 0:1; 1; 5; 10 (data as in the previous gures). For = 0:1, the rst two
features look like linear PCA features. For large , the nonlinear region of the tanh function
becomes eective. In that case, kernel PCA can exploit this nonlinearity to allocate the
highest feature gradients to regions where there are data points, as can be seen nicely in
the case = 10.
95
# of components
64
128
256
512
1024
2048
1
23.0
17.6
16.8
n.a.
n.a.
n.a.
TABLE 3.1: Test error rates on the MPI chair database for linear Support Vector machines
trained on nonlinear principal components extracted by PCA with polynomial kernel (2.26),
for degrees 1 through 7. In the case of degree 1, we are doing standard PCA, with the
number of nonzero Eigenvalues being at most the dimensionality of the space, 256; thus,
we can extract at most 256 principal components. The performance for the nonlinear cases
(degree > 1) is signicantly better than for the linear case, illustrating the utility of the
extracted nonlinear components for classication.
es
nc
re
fe
# of components
1
32 9.6
64 8.8
128 8.6
256 8.7
512 n.a.
1024 n.a.
2048 n.a.
Re
TABLE 3.2: Test error rates on the USPS handwritten digit database for linear Support
Vector machines trained on nonlinear principal components extracted by PCA with kernel
(2.26), for degrees 1 through 7. In the case of degree 1, we are doing standard PCA, with
the number of nonzero Eigenvalues being at most the dimensionality of the space, 256.
Clearly, nonlinear principal components aord test error rates which are superior to the
linear case (degree 1).
96
is much better than linear classiers operating directly on the image data (a linear
Support Vector machine achieves 8.9%; Table 2.4). Our results were obtained without
using any prior knowledge about symmetries of the problem at hand, explaining why
the performance is inferior to Virtual Support Vector classiers (3.2%, Table 4.1),
and Tangent Distance Nearest Neighbour classiers (2.6%, Simard, LeCun, & Denker,
1993). We believe that adding e.g. local translation invariance, be it by generating
\virtual" translated examples (cf. Sec. 4.2.1) or by choosing a suitable kernel (e.g. as
the ones that we shall describe in Sec. 4.3), could further improve the results.
3.5 Discussion
Feature Extraction for Classication. This chapter was devoted to the presentation
Re
fe
re
nc
es
of a new technique for nonlinear PCA. To develop this technique, we made use of a
kernel method so far only used in supervised learning (Vapnik, 1995; Sec. 1.3). Kernel
PCA constitutes a mere rst step towards exploiting this technique for a large class of
algorithms.
In experiments comparing the utility of kernel PCA features for pattern recognition
using a linear classier, we found two advantages of nonlinear kernels: rst, nonlinear
principal components aorded better recognition rates than corresponding numbers of
linear principal components; and second, the performance for nonlinear components
can be further improved by using more components than possible in the linear case.
We have not yet compared kernel PCA to other techniques for nonlinear feature extraction and dimensionality reduction. We can, however, compare results to other feature
extraction methods which have been used in the past by researchers working on the
USPS classication problem (cf. Sec. 3.4). Our system of kernel PCA feature extraction plus linear support vector machine for instance performed better than LeNet1
(LeCun et al., 1989). Even though the latter result has been obtained a number of
years ago, it should be stressed that LeNet1 provides an architecture which contains a
great deal of prior information about the handwritten character classication problem.
It uses shared weights to improve transformation invariance, and a hierarchy of feature
detectors resembling parts of the human visual system. These feature detectors were
for instance used by Bottou and Vapnik (1992) as a preprocessing stage in their experiments in local learning. Note that, in addition, our features were extracted without
taking into account that we want to do classication. Clearly, in supervised learning,
where we are given a set of labelled observations (x1 ; y1); : : : ; (x`; y`), it would seem
advisable to make use of the labels not only during the training of the nal classier,
but already in the stage of feature extraction.
To conclude this paragraph on feature extraction for classication, we note that a
similar approach could be taken in the case of regression estimation.
Feature Space and the Curse of Dimensionality. We are doing PCA in 1010-
dimensional feature spaces, yet getting results in nite time which are comparable
to state-of-the-art techniques. In fact, of course, we are not working in the full feature
97
3.5. DISCUSSION
space, but just in a comparably small linear subspace of it, whose dimension equals
at most the number of observations. The method automatically chooses this subspace
and provides a means of taking advantage of the lower dimensionality | an approach
which consisted in explicitely mapping into feature space and then performing PCA
would have severe diculties at this point: even if PCA was done based on an M M
dot product matrix (M being the sample size) whose diagonalization is tractable, it
would still be necessary to evaluate dot products in a 1010 -dimensional feature space
to compute the entries of the matrix in the rst place. Kernel-based methods avoid
this problem | they do not explicitely compute all dimensions of F (loosely speaking,
all possible features), but only work in a relevant subspace of F .
Note, moreover, that we did not get overtting problems when training the linear
SV classier on the extracted features. The basic idea behind this two-step approach
is very similar in spirit to nonlinear SV machines: one maps into a very complex space
to be able to approximate a large class of possible decision functions, and then uses a
low VC-dimension classier in that space to control generalization.
Re
fe
re
nc
es
PCA has the advantages that it does not require nonlinear optimization, but only the
solution of an Eigenvalue problem, and by the possibility to use dierent kernels, it
comprises a fairly general class of nonlinearities that can be used. Clearly, the last
point has yet to be evaluated in practice, however, for the support vector machine, the
utility of dierent kernels has already been established. Dierent kernels (polynomial,
sigmoid, Gaussian) led to ne classication performances (Table 2.4). The general
question of how to select the ideal kernel for a given task (i.e. the appropriate feature
space), however, is an open problem.
We conclude this chapter with a twofold outlook. The scene has been set for
using the kernel method to construct a wide variety of rather general and still feasible
nonlinear variants of classical algorithms. It is beyond the scope of the present work
to explore all the possibilities, including many distance-based algorithms, in detail.
Some of them are currently being investigated, for instance nonlinear forms of kmeans clustering and kernel-based independent component analysis (Scholkopf, Smola,
& Muller, 1996). Other domains where researchers have recently started to investigate
the use of Mercer kernels include Gaussian Processes (Williams, 1997).
Linear PCA is being used in numerous technical and scientic applications, including noise reduction, density estimation, image indexing and retrieval systems, and the
analysis of natural image statistics. Kernel PCA can be applied to all domains where
traditional PCA has so far been used for feature extraction, and where a nonlinear
extension would make sense.
fe
re
nc
es
Re
98
Chapter 4
Re
fe
re
nc
es
4.1 Introduction
When we are trying to extract regularities from data, we often have additional knowledge about functions that we estimate (for a review, see Abu-Mostafa, 1995). For
instance, in image classication tasks, there exist transformations which leave class
membership invariant (e.g. translations); moreover, it is usually the case that images
have a local structure in the sense that not all correlations between image regions carry
equal amounts of information. We presently investigate the question how to make use
of these two sources of knowledge.
We rst present the Virtual SV method of incorporating prior knowledge about
transformation invariances by applying transformations to Support Vectors, the training examples most critical for determining the classication boundary (Sec. 4.2.1).
99
100
In Sec. 4.2.2, we design kernel functions which lead to invariant classication hyperplanes. This method is applicable to invariances under the action of dierentiable local
1-parameter groups of local transformations, e.g. translational invariance in pattern
recognition; the Virtual SV method is applicable to any type of invariance. In the
third method proposed in this chapter, we also modify the kernel functions; however,
this time not to incorporate transformation invariance, but to take into account image locality by using localized receptive elds (Sec. 4.3). Following that, Sec. 4.4 and
Sec. 4.5 give experimental results and a discussion, respectively.
Re
fe
re
nc
es
101
virtual examples
representation
tangents
Re
fe
re
nc
es
FIGURE 4.1: Dierent ways of incorporating invariances in a decision function. The dashed
line marks the \true" boundary, disks and circle are the training examples. We assume
that prior information tells us that the classication function only depends on the norm of
the input vector (the origin being in the center of each picture). Lines crossing an example
indicate the type of information conveyed by the dierent methods of incorporating prior
information. Left: generating virtual examples in a localized region around each training
example; middle: incorporating a regularizer to learn tangent values (cf. Simard, Victorri,
LeCun, and Denker, 1992); right: changing the representation of the data by rst mapping
each example to its norm. If feasible, the latter method yields the most information.
However, if the necessary nonlinear transformation cannot be found, or if the desired
invariances are of localized nature, one has to resort to one of the former techniques.
Finally, the reader may note that examples close to the boundary allow us to exploit
prior knowledge very eectively: given a method to get a rst approximation of the true
boundary, the examples closest to it would allow good estimation of the true boundary. A
similar two-step approach is pursued in Sec. 4.2.1. (From Scholkopf, Burges, and Vapnik
(1996a).)
and simplicity of virtual examples approaches, while cutting down on their computational cost signicantly. The Invariant Hyperplane method (Sec. 4.2.2), on the other
hand, is comparable to the method of Simard et al. (1992) in that it is applicable for all
dierentiable local 1-parameter groups of local symmetry transformations, comprising
a fairly general class of invariances. In addition, it has an equivalent interpretation
as a preprocessing operation applied to the data before learning. In this sense, it can
also be viewed as changing the representation of the data to a more invariant one, in
a task-dependent way.
102
by another machine, with a test performance not worse than after training on the
full database. Using this nding as a starting point, we now investigate the question
whether it might be sucient to generate virtual examples from the Support Vectors
only. After all, one might hope that it does not add much information to generate
virtual examples of patterns which are not close to the boundary. In high-dimensional
cases, however, care has to be exercised regarding the validity of this intuitive picture.
Thus, an experimental test on a high-dimensional real-world problem is imperative.
In our experiments, we proceeded as follows (cf. Fig. 4.2):
1. Train a Support Vector machine to extract the Support Vector set.
2. Generate articial examples by applying the desired invariance transformations
to the Support Vectors. In the following, we will refer to these examples as
Virtual Support Vectors (VSVs).
3. Train another Support Vector machine on the generated examples.1
re
nc
es
If the desired invariances are incorporated, the curves obtained by applying Lie symmetry transformations to points on the decision surface should have tangents parallel
to the latter (cf. Simard et al., 1992). If we use small small Lie group transformations
to generate the virtual examples, this implies that the Virtual Support Vectors should
be approximately as close to the decision surface as the original Support Vectors.
Hence, they are fairly likely to become Support Vectors after the second training run.
Vice versa, if a substantial fraction of the Virtual Support Vectors turn out to become
support vectors in the second run, we have reason to expect that the decision surface
does have the desired shape.
fe
Re
express the condition of invariance of the decision function, we already need to know
its coecients which are found only during the optimization, which in turn should
already take into account the desired invariances. As a way out of this circle, we use
the following ansatz: consider decision functions f = sgn g, where g is dened as
g(xj ) :=
`
X
i=1
iyi(B xj B xi) + b;
(4.1)
with a matrix B to be determined below. This follows Vapnik (1995b), who suggested
to incorporate invariances by modifying the dot product used: any nonsingular B
denes a dot product, which can equivalently be written in the form (xj Axi), with
a positive denite matrix A = B >B .
1 Clearly,
the scheme can be iterated; however, care has to exercised, since the iteration of local
invariances would lead to global ones which are not always desirable | cf. the example of a '6'
rotating into a '9' (Simard, LeCun, and Denker, 1993).
103
problem
separating hyperplanes
VSV hyperplane
re
SV hyperplane
nc
es
Re
fe
FIGURE 4.2: Suppose we have prior knowledge indicating that the decision function should
be invariant with respect to horizontal translations. The true decision boundary is drawn
as a dotted line (top left); however, as we are just given a limited training sample, dierent
separating hyperplanes are conveivable (top right). The SV algorithm nds the unique
separating hyperplane with maximal margin (bottom left, which in this case is quite different from the true boundary. For instance, it would lead to wrong classication of the
ambiguous point indicated by the question mark. Making use of the prior knowledge by
generating Virtual Support Vectors from the Support Vectors found in a rst tranining run,
and retraining on these, yields a more accurate decision boundary (bottom right). Note,
moreover, that for the considered example, it is sucient to train the SV machine only on
virtual examples generated from the Support Vectors.
104
Re
fe
re
nc
es
including its margin, would change if the training examples were transformed. It
turns out that when discussing the invariance of g rather than f , these two concepts
are closely related. In the following argument, we restrict ourselves to the optimal
margin case (i = 0 for all i = 1; : : : ; `), where the margin is well-dened. As the
separating hyperplane and its margin are expressed in terms of Support Vectors, locally
transforming a Support Vector xi will change the hyperplane or the margin if g(xi )
changes: if jgj gets smaller than 1, the transformed pattern will lie in the margin, and
the recomputed margin will be smaller; if jgj gets larger than 1, the margin might
become bigger, depending on whether the pattern can be expressed in terms of the
other SVs (cf. the remark in point 2 of the enumeration preceding Proposition 2.1.2).
In terms of the mechanical analogy of Sec. 2.1.2: moving Support Vectors changes
the mechanical equilibrium for the sheet separating the classes. Conversely, a local
transformation of a non-Support Vector will never change f , even if the value of g
changes, as the solution of the programming problem is expressed in terms of the
Support Vectors only.
In this sense, invariance of f under local transformations of the given data corresponds to invariance of (4.1) for the Support Vectors. Note, however, that this criterion
is not readily applicable: before nding the Support Vectors in the optimization, we
already need to know how to enforce invariance. Thus the above observation cannot
be used directly, however it could serve as a starting point for constructing heuristics
or iterative solutions. In the Virtual SV method (Sec. 4.2.1), a rst run of the standard SV algorithm is carried out to obtain an initial SV set; similar heuristics could
be applied in the present case.
Local invariance of g for each pattern xj under transformations of a dierentiable
local 1-parameter group of local transformations Lt ,
@ g(L x ) = 0;
(4.2)
@t t=0 t j
can be approximately enforced by minimizing the regularizer
`
1X
@ g(L x ) 2 :
` j=1 @t t=0 t j
(4.3)
Note that the sum may run over labelled as well as unlabelled data, so in principle one
could also require the decision function to be invariant with respect to transformations
of elements of a test set. Moreover, we could use dierent transformations for dierent
patterns.
For (4.1), the local invariance term (4.2) becomes
!
`
@ g(L x ) = @ X
@t t=0 t j
@t t=0 i=1 iyi(B Ltxj B xi) + b
`
X
@ (B L x B x )
iyi @t
=
t j
i
t=0
i=1
105
`
X
i=1
@ L x ;
iyi@1 (B L0xj B xi) B @t
t=0 t j
(4.4)
using the chain rule. Here, @1 (B L0xj B xi) denotes the gradient of (x y) with respect
to x, evaluated at the point (x y) = (B L0 xj B xi ).
As a side remark,
note that a sucient, albeit rather strict condition for invariance
is thus that @t@ t=0 (B Lt xj B xi) vanish for all i; j ; however, we will proceed in our
derivation, with the goal to impose weaker conditions, which apply for one specic
decision function rather than simultaneously for all decision functions expressible by
dierent choices of the coecients iyi.
Substituting (4.4) into (4.3), using the facts that L0 = I and @1 (x; y) = y>, yields
the regularizer
`
X
i;k=1
where
nc
es
!2
`
` X
@
1X
>
` j=1 i=1 iyi(B xi) B @t t=0 Ltxj
!X
!
`
`
` X
X
1
@
@
>
>
= `
iyi(B xi) B @t t=0 Ltxj
k yk (B @t t=0 Ltxj ) (B xk )
j =1 i=1
k=1
(4.5)
fe
re
`
X
@ L x > :
@ L x
C := 1`
(4.6)
t j
@t t=0 t j
j =1 @t t=0
We now choose B such that (4.5) reduces to the standard SV target function (2.7)
in the form obtained by substituting (2.11) into it (cf. the quadratic term of (2.13)),
utilizing the dot product chosen in (4.1), i.e. such that
Re
(B xi BCB >B xk ) = (B xi B xk ):
(4.7)
Assuming that the xi span the whole space, this condition becomes
(4.8)
or, by requiring B to be nonsingular, i.e. that no information get lost during the
preprocessing, BCB > = I . This can be satised by a preprocessing matrix
B = C ? 21 ;
(4.9)
the nonnegative square root of the inverse of the nonnegative matrix C dened in
(4.6). In practice, we use a matrix
C := (1 ? )C + I;
(4.10)
106
es
fe
i;k
re
nc
The derivative of k must be evaluated for specic kernels, e.g. for k(x; y) = (x y)d,
@1 k(x; y) = d (x y)d?1 y>: To obtain a kernel-specic constraint on the matrix B ,
one would need to equate the result with the quadratic term in the nonlinear objective
function,
X
i yik yk k(B xi; B xk ):
(4.12)
Re
Relationship to Principal Component Analysis. Let us now provide some interpretation of (4.9) and (4.6). The tangent vectors @t@ jt=0 Ltxj have zero mean, thus C is a
sample estimate of the covariance matrix of the random vector s @t@ jt=0 Ltx, s 2 f1g
being a random sign. Based on this observation, we call C (4.6) the Tangent Covariance Matrix of the data set fxi : i = 1; : : : ; `g with respect to the transformations
Lt .
Being positive denite,2 C can be diagonalized, C = SDS >, with an orthogonal matrix S consisting of C 's Eigenvectors and a diagonal matrix D containing the
corresponding positive Eigenvalues. Then we can compute
B = C ? 21 = SD? 12 S >;
(4.13)
where D? 21 is the diagonal matrix obtained from D by taking the inverse square
2 it
107
roots of the diagonal elements. Since the dot product is invariant under orthogonal
transformations, we may drop the leading S and (4.1) becomes
g(xj ) =
`
X
i=1
(4.14)
re
nc
es
fe
Re
a feasible way how to generalize to the nonlinear case. To this end, we use kernel
principal component analysis (Chapter 3). This technique allows us to compute principal components in a space F nonlinearly related to input space. The kernel function
k plays the role of the dot product in F , i.e. k(x; y) = ((x) (y)). To generalize
(4.14) to the nonlinear case, we compute the tangent covariance matrix C (Eq. 4.6)
in feature space F , and its projection onto the subspace of F which is given by the
linear span of the tangent vectors in F . There, the considerations of the linear case
3 As an aside, note that our goal to build invariant SV machines has thus serendipitously provided
us with an approach for an open problem in SV learning, namely the one of scaling: in SV machines,
there has so far been no way of automatically assigning dierent weight to dierent directions in input
space | in a trained SV machine, the weights of the rst layer (the SVs) form a subset of the training
set. Choosing these Support Vectors from the training set only gives rather limited possibilities for
appropriately dealing with dierent scales in dierent directions of input space.
4 To see this, rst note that if B solves B > BCB > B = B > B , and B 's polar decomposition is
B = U Bs , with U U > = 1 and Bs = Bs> , then Bs also solve it. Thus, we may restrict ourselves to
symmetrical solutions. For our choice B = C ? 21 , B commutes with C , hence they can be diagonalized
simultaneously. In this case, B 2 CB 2 = B 2 clearly can also be satised by any matrix which is obtained
from B by setting an arbitrary selection of Eigenvalues to 0 (in the diagonal representation).
108
es
apply. The whole procedure reduces to computing dot products in F , which can be
done using k, without explicitly mapping into F :
In rewriting (4.6) for the nonlinear case, we substitute nite dierences, with t > 0,
for derivatives:
`
X
1
(4.15)
C := `t2 ((Lt xj ) ? (xj )) ((Lt xj ) ? (xj ))> :
j =1
For the sake of brevity, we have omitted the summands corresponding to derivatives
in the opposite direction, which ensure that the data set is centered. For the nal
tangent covariance matrix C , they do not make a dierence, as the two negative signs
cancel out.
In high-dimensional feature spaces, C cannot be computed explicitely. In complete
analogy to Chapter 3, we compute another matrix whose Eigenvalues and Eigenvectors
will allow us to extract features corresponding to Eigenvectors and Eigenvalues of
C . This is done by taking dot products from both sides with (Lt xi) ? (xi) (the
Eigenvectors in F can be expanded in terms of the latter, by the same argument as
the one leading to (3.10)). Dening
Kij = k(xi ; xj );
(4.16)
(4.17)
(4.18)
nc
fe
we get
re
and
Re
V=
`
X
k=1
(4.19)
(4.20)
`
X
k=1
`
X
k ((Lt xk ) ? (xk ))
k=1
(4.21)
(K tt ? K t + K ) = `t12 (K tt ? K t + K )2 :
109
(4.22)
(4.23)
es
`
X
nc
V>(x) =
(4.25)
fe
re
Re
110
d2
(...)d1
(...)d1
re
nc
es
FIGURE 4.3: Kernel utilizing local correlations in images. To compute k(x; y) for two
images x and y, we sum over products between corresponding pixels of the two images
in localized regions (in the gure, this is indicated by dot products (: :)), weighed by
pyramidal receptive elds. To the outputs, a rst nonlinearity in form of an exponent d1
is applied. The resulting numbers for all patches (only two are displayed) are summed,
and the d2 -th power of the result is taken as the value k(x; y). This kernel corresponds
to a dot product in a polynomial space which is spanned mainly by localized correlations
between pixels (see Sec. 4.3).
Re
fe
4. sum zdij1 over the whole image, and raise the result to the power d2 to allow for
longe-range correlations of order d2
The resulting kernel will be of order d1 d2, however, it will not contain all possible
correlations of d1 d2 pixels.
database of handwritten digits (Appendix C). This database has been used extensively
in the literature, with a LeNet1 Convolutional Network achieving a test error rate of
5.0% (LeCun et al., 1989). As in Sec. 2.3, we used
= 10.
Virtual Support Vectors were generated for the set of all dierent Support Vectors
of the ten classiers. Alternatively, one can carry out the procedure separately for
the ten binary classiers, thus dealing with smaller training sets during the training of
the second machine. Table 4.1 shows that incorporating only translational invariance
112
already improves performance signicantly, from 4.0% to 3.2% error rate. For other
types of invariances (Fig. 4.4), we also found improvements, albeit smaller ones: generating Virtual Support Vectors by rotation or by the line thickness transformation
of Drucker, Schapire, and Simard (1993), we constructed polynomial classiers with
3.7% error rate (in both cases).
Note, moreover, that generating Virtual examples from the full database rather
than just from the SV sets did not improve the accuracy, nor did it enlarge the SV set of
the nal classier substantially. This nding was reproduced for the Virtual SV system
mentioned in Sec. 2.5.3: in that case, similar to Table 4.1, generating Virtual examples
from the full database led to identical performance, and only slightly increased SV set
size (861 instead of 806). From this, we conclude that for the considered recognition
task, it is sucient to generate Virtual examples only from the SVs | Virtual examples
generated from the other patterns do not add much useful information.
MNIST Digit Recognition. The larger a database, the more information about
es
Re
fe
re
nc
TABLE 4.2: Application of the Virtual SV method to the MNIST database. Virtual SVs
were generated by translating the original SVs in all four principal directions (by 1 pixel).
Results are given for the original SV machine, and two VSV systems utilizing dierent
kernel degrees; in all cases, we used
= 10 (cf. (2.19)). SV: degree 5 polynomial SV
classier; VSV1: VSV machine with degree 5 polynomial kernel; VSV2: same with degree
9 kernel. The rst table gives the performance: for the ten binary recognizers, as numbers
of errors; for multi class-classication (T1), in terms of error rates (in %), both on the
60000 element test set. The second multi-class error rate (T2) was computed by testing
only on a 10000 element subset of the full 60000 element test set. These results are given
for reference purpose, they are the ones usually reported in MNIST performance studies.
The second table gives numbers of SVs for all ten binary digit recognizers.
system
SV
VSV1
VSV2
0
131
95
81
system
0
SV 1206
VSV1 2938
VSV2 3941
Errors
binary recognizers
1
2
3
4
5
6
97 243 240 212 241 195
84 186 176 173 171 127
66 164 146 141 147 119
Support Vectors
1
2
3
4
5
757 2183 2506 1784 2255
1887 5015 4764 3983 5235
2136 6598 7380 5127 6466
10-class
7
8
9 T1 T2
259 343 409 1.6 1.4
217 233 289 1.1 1.0
179 196 254 1.0 0.8
6
7
8
9
1347 1712 3053 2720
3328 3968 6978 6348
4128 5014 8701 7720
113
Re
fe
re
nc
es
cation accuracies with our technique, we applied the method to the MNIST database
(Appendix C) of 60000 handwritten digits. This database has become the standard
for performance comparisons at AT&T Bell Labs; the error rate record of 0.7% is held
by a boosted LeNet4 (Bottou et al., 1994; LeCun et al., 1995), i.e. by an ensemble
of learning machines. The best single machine in the performance comparisons so far
was a LeNet5 convolutional neural network (0.9%); other high performance systems
include Tangent Distance nearest neighbour classiers (1.1%), and LeNet4 with a last
layer using methods of local learning (1.1%, cf. Bottou and Vapnik, 1992).
Using Virtual Support Vectors generated by 1-pixel translations, we improved a
degree 5 polynomial SV classier from 1.4% to 1.0% error rate on the 10000 element
test set (Table 4.2). In this case, we applied our technique separately for all ten Support
Vector sets of the binary classiers (rather than for their union) in order to avoid
having to deal with large training sets in the retraining stage. Note, moreover, that
for the MNIST database, we did not compare results of the VSV technique to those for
generating Virtual examples from the whole database: the latter is computationally
exceedingly expensive, as it entails training on a very large training set. We did,
however, make a comparison for the small MNIST database (Appendix C). There, a
degree 5 polynomial classier was improved from 3:8% to 2:5% error by the Virtual SV
method, with an increase of the average SV set sizes from 324 to 823. By generating
Virtual examples from the full training set, and retraining on these, we obtained a
system which had slightly more SVs (939), but an unchanged error rate.
After retraining, the number of SVs more than doubled (Table 4.2). Thus, although
the training sets for the second set of binary classiers were substantially smaller than
the original database (for four Virtual SVs per SV, four times the size of the original
SV sets, in our case amounting to around 104), we concluded that the amount of
data in the region of interest, close to the decision boundary, had more than doubled.
Therefore, we reasoned that it should be possible to use a more complex decision
function in the second stage (note that the risk bound (1.5) depends on the ratio of
VC-dimension and training set size). Indeed, using a degree 9 polynomial led to an
error rate of 0.8%, very close to the record performance of 0.7%.
Another interesting performance measure is the rejection error rate, dened as the
percentage of patterns that would have to be rejected to attain a specied error rate (in
the benchmark studies of Bottou et al. (1994) and LeCun et al. (1995), 0.5%). Note
that this percentage is computed on the test set. In our case, using the condence
measure of Sec. 2.1.6, it was measured to be 0.9%, realizing a large improvement
compared to the original SV system (2.4%). In the above benchmark studies, only the
boosted LeNet4 ensemble performed better (0.5%).
Further improvements can possibly be achieved by combining dierent types of
invariances. Another intriguing extension of the scheme would be to use techniques
based on image correspondence (e.g. Vetter and Poggio, 1997) to extract invariance
transformations from the training set. Those transformations can then be used to
generate Virtual Support Vectors.6
6 Together with
114
FIGURE 4.5: Virtual SVs in gender classication. A: 2-D image of a 3-D head model (from
the MPI head database (Troje and Bultho, 1996; Vetter and Troje, 1997)); B: 2D image
of the rotated 3D head; C: articial image, generated from A using the assumption that it
belongs to a cylinder-shaped 3D object (rotation by the same angle as B).
nc
es
TABLE 4.3: Numbers of test errors for gender classication in novel pose, using Virtual
SVs (qualitatively similar to Fig. 4.5). The training set contained 100 views of male and
female heads (divided 49:51), taken at an azimuth of 24 , downsampled to 32 32. The
test set contained 100 frontal views of the same heads. We used polynomial SV classiers
of dierent degrees, generating one virtual SV per original SV. Clearly, training and test
views are dierently distributed, however, the amount of rotation (24 ) was known to the
classier in the sense that it was used for generating the Virtual SVs (Fig. 4.5): rst, a
simplied head model was inferred by averaging over in-depth revolutions of all the 2D
views. VSVs were generated by projecting the original SVs onto the head model, then
rotating the head to the frontal view, and computing the new 2-D view.
Re
fe
re
degree
prior knowledge
1 2 3 4 5
no virtual SVs
25 24 23 21 19
virtual SVs from 3D model 11 10 10 9 10
Face Classication. Certain types of transformations, as the above used translations
and rotations, apply equally well to object recognition as they do to character recognition. There are, however, types of transformations which are specic to the class of
images considered (cf. Sec. 1.1). For instance, line thickness transformations (Fig. 4.4)
are specic to character recognition. To provide an example of Virtual SVs which are
specic to object recognition, we generated virtual SVs corresponding to object rotations in depth, by making assumptions about the 3D shape of objects. Clearly, such an
approach would have a hard time if applied to complex objects as chairs (Appendix A).
For human heads, however, it is possible to formulate 2-D image transformations which
can be applied to generate approximate novel views of heads (Fig. 4.5). Using these
views improved accuracies in a small gender classication experiment. Table 4.3 gives
details and results of the experiment.
115
TABLE 4.4: Test error rates for two object recognition databases, for views of resolution
16 16, using dierent types of approximate invariance transformations to generate Virtual
SVs, and polynomial kernels of degree 20 (cf. Table 2.1). The second training run in the
Virtual SV systems was done on the original SVs and the generated Virtual SVs. The
training sets with 25 and 89 views per object are regularly spaced; for them, mirroring does
not provide additional information. The interesting case is the one where we trained on the
100-view-per-object sets. Here, a combination of virtual SVs from mirroring and rotation
substantially improves accuracies on both databases.
animal
entry level
training set: views per object
Virtual SVs
25 89 100 25 89 100
none (orig. system) 13.0 1.7 4.8 13.0 1.8 2.4
mirroring
13.6 1.8 4.8 14.2 2.8 3.2
translations
16.4 1.6 4.3 17.1 11.1 4.8
rotations
9.0 0.7 3.0 10.3 1.8 2.5
rotations & mirroring 9.0 0.7 1.7 9.6 0.9 1.7
es
database:
Re
fe
re
nc
mations of 3-D objects, however, do not in general correspond to simple transformations of the produced 2-D images (cf. Sec. 4.4.1). For the MPI object databases
(Appendix A), however, there exists a type of invariance transformation which can
easily be computed from the images: as all the objects used are (approximately) bilaterally symmetric, we can easily produce another valid view of the same object, with
a dierent viewing angle, by performing a mirror operation with respect to a vertical
axis in the center of the images, say (Vetter, Poggio, and Bultho, 1994). If the objects were exactly symmetric, we would not expect any additional information to be
gained in the case of the regularly spaced object sets (25 and 89 views per object),
as in these the snapshots are already sampled symmetrically around the zero view
direction, which in most cases coincided with the symmetry plane. The slight decrease
in performance in that case (Table 4.4) indicates that for some objects, the symmetry
only holds approximately (for snapshots, see Appendix A).
To get more robust, we tried combining this type of invariance transformation
with other types. As in the case of character recognition, we simply used translations
(by 1 pixel in all four directions) and image-plane rotations (by 10 degrees in both
directions). Even though these transformation are but very crude approximations of
transformations which occur when a 3-D object is rotated in space, they did in some
cases yield signicant performance improvements.7
7 The following may serve as a partial explanation why rotations were more useful than translations.
First, dierent snapshots at large elevations can be transformed into each other by an approximate
image plane rotation, and second, image plane rotations retain the centering which was applied to
116
To examine the eect of the mirror symmetry Virtual SVs, we need to focus on the
non-regularly spaced training set with 100 views per object. There, by far the best
performance for both the entry level and the animal database was obtained by using
both mirroring and rotations (Table 4.4).
TABLE 4.5: Speed improvement using the Reduced Set method. The second through
fourth columns give numbers of errors on the 10000 element MNIST test set for the
original system, the Virtual Support Vector system, and the reduced set system (for the
10-class classiers, the error is given in %). The last three columns give, for each system,
the number of vectors whose dot product must be computed in the test phase.
fe
re
nc
es
Virtual SV Combined with Reduced Set. Apart from the increase in overall training
Re
time (by a factor of two, in our experiments), the VSV technique has the computational
disadvantage that many of the Virtual Support Vectors become Support Vectors for
the second machine, increasing the cost of evaluating the decision function (2.25).
However, the latter problem can be solved with the Reduced Set (RS) method (Burges,
1996, see Appendix D.1.1), which reduces the complexity of the decision function
representation by approximating it in terms of fewer vectors. In a study combining
the VSV and RS methods, we achieved a factor of fty speedup in test phase over the
Virtual Support Vector machine, with only a small decrease in performance (Burges
and Scholkopf, 1997). We next brie
y report the results of this study. The RS results
reported were obtained by Chris Burges.
As a starting point for the RS computation, we used the VSV1 machine (Table 4.2), which achieved 1.0% error rate on the 10000 element MNIST test set.8 The
the original images. Both points suggest that virtual examples generated by rotations should be more
\realistic" than those generated by translations.
8 At the time when the described study was carried out, VSV1 was our best system; VSV2 was
not available yet.
117
Re
fe
re
nc
es
In the experiments exploring the invariant hyperplane method (Sec. 4.2.2), we used the
small MNIST database (Appendix C). We start by giving some baseline classication
results.
Using a standard linear SV machine (i.e. a separating hyperplane, Sec. 2.1.3), we
obtain a test error rate of 9:8%; by using a polynomial kernel of degree 4, this drops
to 4:0%. In all of the following experiments, we use degree 4 kernels of various types.
The number 4 was chosen as it can be written as a product of two integers, thus we
could compare results to a kernel kpd1 ;d2 with d1 = d2 = 2 (cf. sections 4.3 and 4.4.3).
For the considered classication task, results for higher polynomial degrees are very
similar.
In a series of experiments with a homogeneous polynomial kernel k(x; y) = (x y)4,
using preprocessing with Gaussian smoothing kernels of standard deviation in the
range 0:1; 0:2; : : :; 1:0, we obtained error rates which gradually increased from 4:0%
to 4:3%. We concluded that no improvement of the original 4:0% performance was
possible by a simple smoothing operation.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
error rate in % 4.2 3.8 3.6 3.6 3.7 3.8 3.8 3.9 3.9 4.0
118
FIGURE
4.6: The rst pattern in the small MNIST database, preprocessed with B =
? 21
C (cf. equations (4.9) { (4.10)), enforcing various amounts of invariance. Top row:
= 0:1; 0:2; 0:3; 0:4; bottom row: = 0:5; 0:6; 0:7; 0:8. For some values of
, the
preprocessing resembles a smoothing operation, however, it leads to higher classication
accuracies (see Sec. 4.4.2) than the latter.
re
nc
es
experiments, the patterns were rst rescaled to have entries in [0; 1], then B was
computed, using horizontal and vertical translations, and preprocessing was carried
out; nally, the resulting patterns were scaled back again (for snapshots of the resulting
patterns, see Fig. 4.6). The scaling was done to ensure that patterns and derivatives
lie in comparable regions of RN (note that if the pattern background level is a constant
?1, then its derivative is 0). The results show that even though (4.6) was derived for
the linear case, it leads to improvements in the nonlinear case (here, for a degree 4
polynomial).
fe
Dimensionality Reduction. The above [0; 1] scaling operation is ane rather than
Re
linear, hence the argument leading to (4.14) does not hold for this case. We thus only
report results on dimensionality reduction for the case where the data is kept in [0; 1]
scaling during the whole procedure. Dropping principal components which are less
important leads to substantial improvements (Table 4.7); cf. the explanation following
(4.14)).
The results in Table 4.7 are somewhat distorted by the fact that the polynomial
kernel is not translation invariant, and performs poorly when none of the principal
TABLE 4.7: Dropping directions corresponding to small Eigenvalues of C , i.e. dropping
less important principal components (cf. (4.14)), leads to substantial improvements. All
results given are for the case = 0:4 (cf. Table 4.6); degree 4 homogeneous polynomial
kernel.
119
components are discarded. Hence this result should not be compared to the performance of the polynomial kernel on the data in [?1; 1] scaling. (Recall that we obtained
3.6% in that case, for = 0:4.) In practice, of course, we may choose the scaling of the
data as we like, in which case it would seem pointless to use a method which is only
applicable for a rather disadvantageous representation of the data. However, nothing
prevents us from using a translation invariant kernel. We opted for a radial basis function kernel (2.27) with c = 0:5. On the [?1; 1] data, for = 0:4, this leads to the same
performance as the degree 4 polynomial, 3.6% (without invariance preprocessing, i.e.
for = 1, the performance is 3.9%). To get the identical system on [0; 1] data, the
RBF width was rescaled accordingly, to c = 0:125. Table 4.8 shows that discarding
principal components can further improve performance, up to 3.3%.
TABLE 4.8: Dropping directions corresponding to small Eigenvalues of C , i.e. dropping
less important principal components (cf. (4.14)), for the translation invariant RBF kernel
(see text). All results given are for the case = 0:4 (cf. Table 4.6).
es
nc
Re
fe
re
small MNIST database (Appendix C). As a reference result, we use the degree 4
polynomial SV machine, performing at 4:0% error (Sec. 4.4.2). To exploit locality in
images, we used pyramidal receptive eld kernel kpd1 ;d2 with diameter p = 9 (cf. Sec. 4.3)
and d1 d2 = 4, i.e. degree 4 polynomials kernels which do not use all products of 4
pixels (Sec. 4.2.2). For d1 = d2 = 2, we obtained an improved error rate of 3:1%,
another degree 4 kernel with only local correlations (d1 = 4; d2 = 1) led to 3:4%
(Table 4.9).
Albeit better than the 4:0% for the degree 4 homogeneous polynomial, this is
still worse than the Virtual SV result: generating Virtual SVs by image translations,
the latter led to 2:8%. As the two methods, however, exploit dierent types of prior
knowledge, it could be expected that combining them leads to still better performance;
and indeed, this yielded the best performance of all (2:0%), halving the error rate of
the original system.
For the purpose of benchmarking, we also ran our system on the USPS database.
In that case, we obtained the following test error rates: SV with degree 4 polynomial
kernel 4:2% (Table 2.4), Virtual SV (same kernel) 3:5%, SV with k72;2 3.6% (for the
smaller USPS images, we used a k7 kernel rather than k9), Virtual SV with k72;2 3:0%.
The latter compares favourably to almost all known results on that database, and is
second only to a memory-based tangent-distance nearest neighbour classier at 2:6%
(Simard, LeCun, and Denker, 1993).
120
TABLE 4.9: Summary: error rates for various methods of incorporating prior knowledge,
on the small MNIST database (Appendix C). In all cases, degree 4 polynomial kernels were
used, either of the local type (Sec. 4.3), or (by default) of the complete polynomial type
(2.26). In all cases, we used
= 10 (cf. (2.19)).
Classier
Test Error / %
SV
4.0
Virtual SV (Sec. 4.2.1), with translations
2.8
Invariant hyperplane (Sec. 4.2.2), = 0:4
3.6
same, on rst 100 principal components (Table 4.7)
3.7
2;2
semi-local kernel k9 (Sec. 4.4.3)
3.1
4;1
purely local kernel k9 (Sec. 4.4.3)
3.4
2;2
Virtual SV with k9
2.0
Object Recognition. The above results have been conrmed on the two object re-
Re
fe
re
nc
es
cognition databases used in Sec. 2.2.1 (cf. Appendix A). As in the case of the small
MNIST database, we used k9d1 ;d2 . In the present case, we chose d1 = d2 = 3, which
yields a degree 9 (= 3 3) polynomial classier which diers from a standard polynomial (2.26) in that it does not utilize all products of 9 pixels, but mainly local
ones. Comparing the results to those obtained with standard polynomials of equal degree shows that this pre-selection of useful features signicantly improves recognition
results (Table 4.10).
As in the case of digit recognition, we combined this method with the Virtual SV
method (Sec. 4.2.1). Based on the fact that prior knowledge about image locality
is dierent from prior knowledge on invariances, we expected the possibility to get
further improvements. We used the same types of Virtual SVs as in Sec. 4.4.1. The
results (Table 4.11) further improve upon Table 4.10, conrming the digit recognition
ndings reported above. In 4 of 6 cases, the resulting classiers are better than those
of Table 4.4.9
4.5 Discussion
For Support Vector learning machines, invariances can readily be incorporated by generating virtual examples from the Support Vectors, rather than from the whole training
set. The method yields a signicant gain in classication accuracy at a moderate cost
in time: it requires two training runs (rather than one), and it constructs classication
rules utilizing more Support Vectors, thus slowing down classication speed (cf. (2.25))
| in our case, both points amounted to a factor of about 2. Given that Support Vector
9 Note
that in Table 4.4, the VSV method was used for degree 20 kernels, which on the object
recognition tasks does far better than degree 9, cf. Table 2.1.
121
4.5. DISCUSSION
TABLE 4.10: Test error rates for two object recognition tasks, comparing kernels local in
the image to complete polynomial kernels. Local kernels of degree 9 outperform complete
polynomial kernels of corresponding degree. Moreover, they performed at least as well as
the best polynomial classier out of all degrees in f1; 3; 6; 9; 12; 15; 20; 25g (cf. Table 2.1).
13.0
1.8
2.4
15.4
2.2
4.0
12.0
1.8
2.0
15.0
2.1
3.9
animals:
25 grey scale
89 grey scale
100 grey scale
25 silhouettes
89 silhouettes
100 silhouettes
14.8
2.5
5.2
17.0
2.8
6.3
13.0
1.7
4.4
15.6
2.2
5.2
12.0
1.6
4.0
15.2
2.0
4.9
re
nc
es
entry level:
25 grey scale
89 grey scale
100 grey scale
25 silhouettes
89 silhouettes
100 silhouettes
Re
fe
TABLE 4.11: Test error rates for two object recognition databases, using dierent types of
approximate invariance transformations to generate Virtual SVs (as in Table 4.4), and local
polynomial kernels k93;3 of degree 9 (cf. Sec. 4.3, Table 4.10, Table 4.4, and Table 2.1).
The second training run in the Virtual SV systems was done on the original SVs and the
generated Virtual SVs. The training sets with 25 and 89 views per object are regularly
spaced; for them, mirroring does not provide additional information. For the non-regularly
spaced 100-view-per-object sets, a combination of virtual SVs from mirroring and rotation
substantially improves accuracies on both databases.
database:
animal
entry level
training set: views per object
Virtual SVs
25 89 100 25 89 100
none (orig. system) 12.0 1.6 4.0 12.0 1.8 2.0
mirroring
12.5 1.7 4.6 13.1 2.9 3.3
rotations & mirroring 8.8 1.0 1.4 8.5 1.2 1.6
122
Re
fe
re
nc
es
machines are known to allow for short training times (Bottou et al., 1994), the rst
point is usually not critical. Certainly, training on virtual examples generated from
the whole database would be signicantly slower. To compensate for the second point,
we used the reduced set method of Burges (1996) for increasing speed. This way, we
obtained a system which was both fast and accurate.
As an alternative approach, we have built in known invariances directly into the
SVM objective function via the choice of a kernel. With its rather general class of
admissible kernel functions, the SV algorithm provides ample possibilities for constructing task-specic kernels. We have considered two forms of domain knowledge:
rst, pattern classes were required to be locally translationally invariant, and second,
local correlations in the images were assumed to be more reliable than long-range correlations. The second requirement can be seen as a more general form of prior knowledge
| it can be thought of as arising partially from the fact that patterns possess a whole
variety of transformations; in object recognition, for instance, we have object rotations
and deformations. Mostly, these transformations are continuous, which implies that
local relationships in an image are fairly stable, whereas global relationships are less
reliable.
Both types of domain knowledge led to improvements on the considered pattern
recognition tasks.
The method for constructing kernels for transformation invariant SV machines
(invariant hyperplanes), put forward to deal with the rst type of domain knowledge,
so far has only been applied in the linear case, which probably explains why it only led
to moderate improvements, especially when compared with the large gains achieved
by the Virtual SV method. It is applicable for dierentiable transformations | other
types, e.g. for mirror symmetry, have to be dealt with using other techniques, as the
Virtual Support Vector method. Its main advantages compared to the latter technique
is that it does not slow down testing speed, and that using more invariances leaves
training time almost unchanged. In addition, it is more attractive from a theoretical
point of view, establishing a surprising connection to invariant feature extraction,
preprocessing, and principal component analysis.
The proposed kernels respecting locality in images, on the other hand, led to large
improvements; they are applicable not only in image classication but to all cases where
the relative importance of subsets of products features can be specied appropriately.
They do, however, slow down both training and testing by a constant factor which
depends on the cost of evaluating the specic kernel used.
Clearly, SV machines have not yet been developed to their full potential, which
could explain the fact that our highest accuracies are still slightly worse that the
record on the MNIST database. However, SVMs present clear opportunities for further
improvement. More invariances (for example, for the pattern recognition case, small
rotations, or varying ink thickness) could be incorporated, possibly combined with
techniques for dealing with optimization problems involving very large numbers of
SVs (Osuna, Freund, and Girosi, 1997). Further, one might use only those Virtual
Support Vectors which provide new information about the decision boundary, or use a
123
4.5. DISCUSSION
Re
fe
re
nc
es
measure of such information to keep only the most important vectors. Finally, if local
kernels (Sec. 4.3) will prove to be as useful on the full MNIST database as they were
on the small version of it, accuracies could be substantially increased | at a cost in
classication speed, though.
We conclude this chapter by noting that all three described techniques should be
directly applicable to other kernel-based methods as SV regression (Vapnik, 1995b) and
kernel PCA (Chapter 3). Future work will include the nonlinear Tangent Covariance
Matrix (cf. our considerations in Sec. 4.2.2), the incorporation of invariances other
than translation, and the construction of kernels incorporating local feature extractors
(e.g. edge detectors) dierent from the pyramids described in Sec. 4.3.
fe
re
nc
es
Re
124
Chapter 5
Conclusion
Re
fe
re
nc
es
We believe that Support Vector machines and Kernel Principal Component Analysis
are only the rst examples of a series of potential applications of Mercer-kernel-based
methods in learning theory. Any algorithm which can be formulated solely in terms
of dot products can be made nonlinear by carrying it out in feature spaces induced by
Mercer kernels. However, already the above two elds are large enough to render an
exhaustive discussion in this thesis infeasible. Thus, we have tried to focus on some
aspects of SV learning and Kernel PCA, hoping that we have succeeded in illustrating
how nonlinear feature spaces can benecially be used in complex learning tasks.
On the Support Vector side, we presented two chapters. Apart from a tutorial
introduction to the theory of SV learning, the rst one focused on empirical results
related to the accuracy and the Support vector sets of dierent SV classiers. Considering three well-known classier types which are included in the SV approach as
special cases, we showed that they lead to similarly high accuracies and construct
their decision surface from almost the same Support Vectors. Our rst question raised
in the Preface was which of the observations should be used to construct the decision
boundary? Against the backdrop of our empirical ndings, we can now take the position that the Support Vectors, if constructed in an appropriate nonlinear feature
space, constitute such a subset of observations. The second SV chapter focused on
algorithms and empirical results for the incorporation of prior knowledge in SV machines. We showed that this can be done both by modifying kernels and by generating
Virtual examples from the set of Support Vectors. In view of the high performances
obtained, we can reinforce and generalize the above answer, to include also Virtual
Support Vectors, and specialize it, saying that the appropriate feature space should be
constructed using prior knowledge of the task at hand. Our best performing systems
used both methods simultaneously, Virtual Support Vectors and kernels incorporating
prior knowledge about the local structure of images.
On Kernel Principal Component Analysis, we presented one chapter, describing the algorithm and giving rst experimental results on feature extraction for
pattern recognition. We saw that features extracted in nonlinear feature spaces led to
recognition performances much higher than those extracted in input space (i.e. with
traditional PCA). This lends itself to an answer of the second question raised in the
125
126
CHAPTER 5. CONCLUSION
Re
fe
re
nc
es
Preface, which features should be extracted from each observation? From our present
point of view, these should be nonlinear Kernel PCA features. As Kernel PCA operates in the same types of feature spaces as Support Vector machines, the choice of
the kernel, and the design of kernels to incorporate prior knowledge, should also be of
importance here. As the Kernel PCA method is very recent, however, these questions
have not been thoroughly investigated yet. We hope that given a few years time, we
will be in a position to specialize our answer to the second question exactly as it was
done for the rst one.
We conclude with an outlook, revisiting the question of visual processing in biological systems. If the Support Vector set should prove to be a characteristic of the data
largely independent of the type of learning machine used (which we have shown for
three types of learning machines), one would hope that it could also be of relevance
in biological learning. If a subset of observations characterizes a task rather than a
particular algorithm's favourite examples, there is reason to hope that every system
trying to solve this task | in particular animals | should make use of this subset
in one way or another. Regarding Kernel PCA, it would be interesting to study the
types of feature extractors that Kernel PCA constructs when performed on collections
of images resembling those that animals are naturally exposed to. Comparing those
with the ones found in neurophysiological studies could potentially assist us in trying to understand natural visual systems. If applied on the same data, and similar
tasks, optimal machine learning algorithms could be as fruitful to biological thinking
as biological solutions can be to engineering.
Support Vector Learning!
Appendix A
Object Databases
es
In this section, we brie
y describe three object recognition databases (chairs, entry
level objects, and animals) generated at the Max-Planck-Institut fur biologische Kybernetik (Liter et al., 1997). We start by describing the procedure for creating the
databases, and then show some images of the resulting patterns.
The training and test data was generated according to the following procedure
(Blanz et al., 1996; Liter et al., 1997):
nc
Database Generation
Snapshot Sampling. 25 dierent object models with uniform grey surface were ren-
Re
fe
re
Centering. The resulting grey level pictures were centered with respect to the center
of mass of the binarized image. As the objects were shown on a white background,
the binarized image separates gure from ground.
Edge Detection. Four one-dimensional dierential operators (vertical, horizontal,
and two diagonal ones) were applied to the images, followed by taking the modulus.
1 In
one case, we also generated a set with 400 random views per object.
127
128
1280 pixels.
Containing edge detection data, the parts r1 : : : r4 already provide useful features
for recognition algorithms. To study the ability of an algorithm to extract features by
itself, one can alternatively use only the actual image part r0 of the data, and thus
train on the 256-dimensional downsampled images rather than on the 1280-dimensional
inputs. In our experiments, we used both variants of the databases.
Standardization. On the chair database, the standard deviation of the 1616 images
with pixel values in [0; 1] was around 30 (measured on the training sets). We rescaled
all databases, separately for each part r0 : : : r4, such that each part separately gives rise
to training sets with standard deviation 30. This hardly aects the r0 part, however,
it does change the edge detection parts r1 ; : : : ; r4. In the resulting 5 256-dimensional
representation, the dierent parts arising from edge detection, or just downsampling,
have comparable scaling.
nc
es
Pixel Rescaling. Before we ran the algorithms on the databases, each pixel value
x was rescaled according to x 7! 2x ? 1. Thus, the background level was ?1, and
maximal intensities were about 1.
Databases
re
Using the above procedure, three object recognition databases were generated.
MPI Chair Database. The rst object recognition database contains 25 dierent
Re
fe
chairs (gures A.2, A.3, A.4). For benchmarking purposes, the downsampled views are
available via ftp://ftp.mpik-tueb.mpg.de/pub/chair dataset/. As all 25 objects belong
to the same object category, recognition of chairs in the database is a subordinate level
task.
MPI Entry Level Database. The entry level databases contains 25 objects (gures
A.5, A.6, A.7), for which psychophysical evidence suggests that they belong to dierent
entry levels in object recognition (cf. Sec. 2.2.1).
MPI Animal Database. The animal database contains 25 dierent animals (gures
A.8, A.9, A.10). Note that some of these animals are also contained in the entry level
database (Fig. A.5).
Appendix C
re
nc
es
contains 9298 handwritten digits (7291 for training, 2007 for testing), collected from
mail envelopes in Bualo (cf. LeCun et al., 1989). Each digit is a 16 16 image,
represented as a 256-dimensional vector with entries between ?1 and 1. Preprocessing
consisted of smoothing with a Gaussian kernel of width = 0:75.
It is known that the USPS test set is rather dicult | the human error rate is 2.5%
(Bromley and Sackinger, 1991). For a discussion, see (Simard, LeCun, and Denker,
1993). Note, moreover, that some of the results reported in the literature for the
USPS set have been obtained with an enhanced training set: for instance, Drucker,
Schapire, and Simard (1993) used an enlarged training set of size 9709, containing
some additional machine-printed digits, and note that this improves accuracies. In
our experiments, only 7291 training examples were used.
fe
MNIST Database. The MNIST database (Fig. C.2) contains 120000 handwritten
Re
digits, equally divided into training and test set. The database is a modied version
of NIST Special Database 3 and NIST Test Data 1. Training and test set consist of
patterns generated by dierent writers. The images were rst size normalized to t
into a 20 20 pixel box, and then centered in a 28 28 image (Bottou et al., 1994).
Test results on the MNIST database which are given in the literature (e.g. Bottou
et al., 1994; LeCun et al., 1995) for some reason do not use the full MNIST test set
of 60000 characters. Instead, a subset of 10000 characters is used, consisting of the
test set patterns from 24476 to 34475. To obtain results which can be compared to
the literature, we also use this test set, although the larger one is preferable from the
point of view of obtaining more reliable test error estimates.
Small MNIST Database. The USPS database has been criticised (Burges, LeCun,
private communication; Bottou et al. (1994)) as not providing the most adequate
classier benchmark. First, it only comes with a small test set, and second, the test set
contains a number of corrupted patterns which not even humans can classify correctly.
The MNIST database, which is the currently used classier benchmark in the AT&T
and Bell Labs learning research groups, does not have these drawbacks; moreover, its
149
Appendix D
Technical Addenda
D.1 Feature Space and Kernels
es
In this section, we collect some material related to Mercer kernels and the correspondig
feature spaces. If not stated otherwise, we assume that k is a Mercer kernel (cf.
Proposition 1.3.2), and is the corresponding map into a feature space F such that
k(x; y) = ((x) (y)).
re
nc
(D.1)
fe
X
0 = (z );
Re
Nz
i=1
(D.2)
= k ? 0 k2:
(D.3)
The crucial point is that even if is not given explicitely, can be computed (and
minimized) in terms of kernels, using ((x) (y)) = k(x; y) (Burges, 1996).
In Sec. 4.4.1, this method is used to approximate Support Vector decision boundaries in order to speed up classication.
154
we need not expect that there is a pre-image under for each vector that can be
expressed as a linear combination of the vectors (x1 ); : : : ; (x`). Nevertheless, it
might be desirable to have a means of constructing the pre-image in the case where it
does exist.
To this end, suppose we have a vector in F given in terms of an expansion of images
of input data, with an unknown pre-image x0 under in input space RN , i.e.
(x0) =
Then, for any x 2 RN ,
k(x0; x) =
`
X
j =1
`
X
j =1
j (xj ):
(D.4)
j k(xj ; x):
(D.5)
(D.6)
N
X
(x0 ei)ei
i=1
N
X
re
x0 =
nc
es
e.g. k(x; y) = (x y)d with odd d, or k(x; y) = ((x y) + ) with a strictly monotonic
sigmoid function and a threshold . Given any a priori chosen basis of input space
fe1; : : : ; eN g, we can then expand x0 as
fk?1(k(x0 ; ei))ei
i=1
1
0`
N
X
X
fk?1 @ j k(xj ; ei)A ei:
=
Re
fe
i=1
j =1
(D.7)
By using (D.5), we thus reconstructed x0 from the values of dot products between
images (in F ) of training examples and basis elements.
Clearly, a crucial assumption in this construction was the existence of the pre-image
x0. If this does not hold, then the discrepancy
`
X
j =1
j (xj ) ? (x0 )
(D.8)
will be nonzero. There is a number of things that we could do to make the discrepancy
small:
(a) We can try to nd a suitable basis in which we expand the pre-images.
(b) We can repeat the scheme, by trying to nd a pre-image for the discrepancy
vector. This problem has precisely the same structure as the original one (D.4), with
155
one more term in the summation on the right hand side. Iterating this method gives
an expansion of the vector in F in terms of reconstructed approximate pre-images.
(c) We have the freedom to choose the scaling of the vector in F . To see this, note
that for any nonzero , we have, similar to (D.7),
x0 =
N
X
i=1
(x0 ei =)ei =
N
X
i=1
0`
1
X
fk?1 @ j k(xj ; ei=)A ei:
j =1
(D.9)
(d) Related to this scaling issue, we could also have started with
(x0) =
`
X
j =1
j (xj );
(D.10)
0`
1
X
j
x0 = fk?1 @ k(xj ; ei)A ei
i=1
j =1
N
X
(D.11)
nc
es
re
`
2
X
j (xj ) ? (x0)
;
(D.12)
Re
fe
156
(D.15)
es
(x y)d =
1
X
i(x)i(y);
(D.16)
re
i=1
nc
fe
Re
C C i=1
1Z
X
i=1 C C
i=1
(D.17)
(D.18)
157
To this end, let be a map from input space into some Hilbert space H ,
: x 7! fx;
(D.19)
(D.20)
: x 7! T fx
(D.21)
(D.22)
nc
es
re
T = U Mv U;
(D.23)
Re
fe
kT (x; y) =
=
=
=
=
(fx U Mv Ufy )
(Ufx Mv Ufy )
q
q
( Mv Ufx Mv Ufy )
(Mpv Ufx Mpv Ufy )
((x) (y));
(D.25)
(D.26)
(D.27)
(D.28)
(D.29)
dening
: RN ! L2 (R; )
(x) = Mpv Ufx :
(D.30)
(D.31)
158
To see the relationship to (1.26), it should be noted that the spectrum of T coincides
with the essential range of v.
For simplicity, we have above made the assumption that T is bounded. The same
argument, however, also works in the case of unbounded T (e.g. Reed and Simon,
1980).
For the purpose of practical applications, we are interested in maps and operators
T 0 such that the kernel k dened by (D.20) can be computed analytically.
Without going into detail, we brie
y mention an example of a map . Dene
: x 7! k(x; :);
(D.32)
(D.33)
nc
es
re
Re
fe
Consider the mappings corresponding to kernels of the form (1.20): suppose the monomialspxi1 xi2 : : : xid are written such that i1 i2 : : : id . Then the coecients (as
the 2 in (1.21)), arising from the fact that dierent combinations of indices occur
with dierent frequencies, are largest for i1 < i2 < : : : < id (let us assume here that
the input dimensionality
q d): in that case, we
p is not smaller than the polynomial degree
have a coecient of d!. If i1 = i2, say, the coecient will be (d ? 1)!. In general, if
n of the xi are equal, and the remaining
ones are dierent, then the coecient in the
q
corresponding component of is (d ? n + 1)!. Thus, thepterms belonging to the d-th
order correlations will be weighted with an extra factor d! compared to the terms
xdi , and compared to the terms
p where only d ? 1 dierent components occur, they are
still weighted stronger by d. Consequently, kernel PCA with polynomial kernels will
tend to pick up variance in the d-th order correlations mainly.
159
Being symmetric, K has an orthonormal basis of Eigenvectors (i)i with corresponding Eigenvalues i, thus for all i, we have K i = ii (i = 1; : : : ; M ). To
understand the relation between (3.13) and (3.14), we proceed as follows: rst suppose ; satisfy (3.13). We may expand in K 's Eigenvector basis as
=
Equation (3.13) then reads
M
X
i
M
X
i=1
aii:
ai ii =
X
i
(D.34)
ai2i i ;
(D.35)
es
Re
fe
re
nc
M = i or ai = 0:
(D.39)
Comparing (D.37) and (D.39), we see that all solutions of the latter satisfy the former.
However, they do not give its full set of solutions: given a solution of (3.14), we may
always add multiples of Eigenvectors of K with Eigenvalue i = 0 and still satisfy
(3.13), with the same Eigenvalue.1 Note that this means that there exist solutions
of (3.13) which belong to dierent Eigenvalues yet are not orthogonal in the space of
the k (for instance, take any two Eigenvectors with dierent Eigenvalues, and add a
multiple of the same Eigenvector with Eigenvalue 0 to both of them). This, however,
does not mean that the Eigenvectors of C in F are not orthogonal. Indeed, note
that if
P
is an Eigenvector of K with Eigenvalue 0, then the corresponding vector i i(xi )
is orthogonal to all vectors in the span of the (xj ) in F , since
!
X
(D.40)
(xj ) i(xi ) = (K )j = 0 for all j;
i
which means that Pi i(xi) = 0. Thus, the above dierence between the solutions
of (3.13) and (3.14) is not relevant, since we are interested in vectors in F rather than
vectors in the space of the expansion coecients of (3.10). We therefore only need to
diagonalize K in order to nd all relevant solutions of (3.13).
Note, nally, that the rank of K determines the dimensionality of the span of the
(xj ) in F , i.e. of the subspace that we are working in.
observation could be used to change the vectors of the solution, e.g. to make them
maximally sparse, without changing the solution.
1 This
160
In Sec. 3.2, we made the assumption that our mapped data is centered in F , i.e.
M
X
n=1
(xn) = 0:
(D.41)
es
We shall now drop this assumption. First note that given any and any set of
observations x1; : : : ; xM , the points
M
X
~ (xi) := (xi) ? M1 (xi)
(D.42)
i=1
are centered. Thus, the assumptions of Sec. 3.2 now hold, and we go on to dene
covariance matrix and K~ ij = (~ (xi ) ~ (xj )) in F . We arrive at our already familiar
Eigenvalue problem
~~ = K~ ~ ;
(D.43)
with ~ being the expansion coecients of an Eigenvector (in F ) in terms of the points
(D.42),
M
X
(D.44)
V~ = ~i ~ (xi):
i=1
Re
fe
re
nc
161
X
(V~ k (t)) = ~ik (~ (xk ) ~ (t)):
M
i=1
n=1
(D.48)
(D.49)
(D.50)
es
nc
Re
fe
re
162
@ L z
@t t=0 t i
!>
(D.54)
(if we want to use more than one derivative operator, we also sum over these; in that
case, we may want to orthonormalize the derivatives for each observation zi). To solve
the optimization problem, one introduces a Lagrangian
`
X
L(w; b; ) = 21 (1 ? )(w C w) + kwk2 ? i (yi((zi w) + b) ? 1)
i=1
(D.55)
with Lagrange multipliers i. At the point of the solution, the gradient of L with
respect to w must vanish:
(1 ? )C w + w ?
`
X
i=1
i yizi = 0
(D.56)
es
As the left hand side of (D.53) is non-negative for any w, C is a positive (not necessarily
denite) matrix. It follows that for
(D.57)
re
nc
C := (1 ? )C + I
fe
Re
w=
`
X
i=1
i yiC?1 zi
(D.58)
f (z) = sgn
`
X
i=1
y (z C ?1 z ) + b
i i
(D.59)
Substituting (D.58), and the fact thatPat the point of the solution, the partial derivative
of L with respect to b must vanish ( `i=1 iyi = 0), into the Lagrangian (D.55), we get
`
`
>
X
X
W () = 21 iyiz>i C?1 CC?1 j yj zj
i=1
`
X
i=1
iyiziC?1
j =1
`
X
j =1
j yj zj +
`
X
i=1
i:
(D.60)
163
By virtue of the fact that C and thus also C?1 is symmetric, the dual form of the
optimization problem takes the following form: maximize
W () =
`
X
i=1
`
X
i ? 21 ij yiyj (zi C?1zj )
i;j =1
(D.61)
(D.62)
(D.63)
xi 7! zi = (xi):
nc
es
Unfortunately, (D.59) and (D.61) are not simply written in terms of dot products
between images of input patterns under . Hence, substituting kernel functions for
dot products will not do. Note, moreover, that C now is an operator in a possibly
innite-dimensional space, with C being dened as in (4.15). We cannot compute it
explicitely, but we can nevertheless compute (D.59) and (D.61), which is all we need.
First note that for all x; y 2 RN ,
((x) C?1(y)) = (C? 2 (x) C? 2 (y));
1
re
(D.64)
with C? 2 being the positive square root of C?1. At this point, methods similar to
kernel PCA come to our rescue. As C is symmetric, we may diagonalize it as
hence
Re
fe
C = SDS >;
(D.65)
(D.66)
Substituting (D.66) into (D.64), and using the fact that S is unitary, we obtain
((x) C?1(y)) = (SD? 2 S >(x) SD? 2 S >(y))
= (D? 21 S >(x) D? 21 S >(y)):
1
(D.67)
(D.68)
This, however, is simply a dot product between kernel PCA feature vectors:
S >(x)
1
computes projections onto Eigenvectors of C (i.e. features), and D? 2 rescales them.
Note that we have thus again arrived at the nonlinear tangent covariance matrix of
Sec. 4.2.2; this time, however, the approach was motivated solely by constructing
164
invariant hyperplanes in feature space, and the nonlinear feature extraction by the
tangent covariance matrix is a mere by-product.
To carry out kernel PCA on C, we essentially have to go through the analysis
of kernel PCA using C instead of the covariance matrix of the mapped data in F .
The modications arising from the fact that we are dealing with tangent vectors were
already described in Sec. 4.2.2, hence, we shall presently only sketch the additional
modications for > 0: here, we are looking for solutions of the Eigenvalue equation
V = C V with > (let us assume that < 1, otherwise all Eigenvalues are
identical to , the minimal Eigenvalue).2 These lie in the span of the tangent vectors.
In complete analogy to (3.14), we then arrive at
` = ((1 ? )K + I );
(D.69)
and the normalization condition for the coecients k of the k-th Eigenvector reads
1 = k ? ( );
(D.70)
1?
Re
fe
re
nc
es
where the k > are the Eigenvalues of (1 ? )K + I . Feature extraction is carried
out as in (4.25).
We conclude by noting an essential dierence to the approach of (4.11), which we
believe is an advantage of the present method: in (4.11), the pattern preprocessing
was assumed to be linear. In the present method, the goal to get invariant hyperplanes
in feature space naturally led to a nonlinear preprocessing operation.
2 If we want I
also to have an eect outside of the span of the tangent vectors, we have to modify
the set in which we expand our solutions.
Bibliography
Y. S. Abu-Mostafa. Hints. Neural Computation, 7(4):639{671, 1995.
M. Aizerman, E. Braverman, and L. Rozonoer. Theoretical foundations of the potential
function method in pattern recognition learning. Automation and Remote Control,
25:821 { 837, 1964.
S. Amari, N. Murata, K.-R. Muller, M. Finke, and H. Yang. Aymptotic statistical
theory of overtraining and cross-validation. IEEE Trans. on Neural Networks, 8(5),
1997.
es
re
nc
H. Baird. Document image defect models. In Proceddings, IAPR Workshop on Syntactic and Structural Pattern Recognition, pages 38 { 46, Murray Hill, NJ, 1990.
fe
Re
166
BIBLIOGRAPHY
nc
es
re
J. Bromley and E. Sackinger. Neural-network and k-nearest-neighbor classiers. Technical Report 11359{910819{16TM, AT&T, 1991.
fe
Re
C. J. C. Burges. Simplied support vector decision rules. In L. Saitta, editor, Proceedings, 13th Intl. Conf. on Machine Learning, pages 71{77, San Mateo, CA, 1996.
Morgan Kaufmann.
C. J. C. Burges and B. Scholkopf. Improving the accuracy and speed of support vector
learning machines. In M. Mozer, M. Jordan, and T. Petsche, editors, Advances in
Neural Information Processing Systems 9, pages 375{381, Cambridge, MA, 1997.
MIT Press.
E. I. Chang and R. L. Lippmann. A boundary hunting radial basis function classier
which allocates centers constructively. In S. J. Hanson, J. D. Cowan, and C. L.
Giles, editors, Advances in Neural Information Processing Systems 5, San Mateo,
CA, 1993. Morgan Kaufmann.
C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273 { 297,
1995.
167
BIBLIOGRAPHY
es
N. Dunford and J. T. Schwartz. Linear Operators Part II: Spectral Theory, Self Adjoint
Operators in Hilbert Space. Number VII in Pure and Applied Mathematics. John
Wiley & Sons, New York, 1963.
nc
fe
re
F. Girosi, M. Jones, and T. Poggio. Priors, stabilizers and basis functions: From
regularization to radial, tensor and additive splines. A.I. Memo No. 1430, Articial
Intelligence Laboratory, Massachusetts Institute of Technology, 1993.
Re
I. Guyon, B. Boser, and V. Vapnik. Automatic capacity tuning of very large VCdimension classiers. In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information Processing Systems, volume 5, pages 147{155. Morgan
Kaufmann, San Mateo, CA, 1993.
I. Guyon, N. Matic, and V. Vapnik. Discovering informative patterns and data cleaning. In U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smythand R. Uthurusamy,
editors, Advances in Knowledge Discovery and Data Mining, pages 181 { 203. MIT
Press, Cambridge, MA, 1996.
J. B. Hampshire and A. Waibel. A novel objective function for improved phoneme
recognition using time-delay neural networks. IEEE Trans. Neural Networks, 1:
216 { 228, 1990.
W. Hardle. Applied Nonparametric Regression. Cambridge University Press, Cambridge, 1990.
T. Hastie and W. Stuetzle. Principal curves. JASA, 84:502 { 516, 1989.
168
BIBLIOGRAPHY
Re
fe
re
nc
es
169
BIBLIOGRAPHY
es
nc
K.-R. Muller, A. Smola, G. Ratsch, B. Scholkopf, J. Kohlmorgen, and V. Vapnik. Predicting time series with support vector machines. In ICANN'97, page 999. Springer
Lecture Notes in Computer Science, 1997.
fe
re
Re
170
BIBLIOGRAPHY
Re
fe
re
nc
es
171
BIBLIOGRAPHY
Re
fe
re
nc
es
172
BIBLIOGRAPHY
Re
fe
re
nc
es
173
BIBLIOGRAPHY
Re
fe
re
nc
es
T. Vetter and N. Troje. Separation of texture and shape in images of faces for image
coding and synthesis. Journal of the Optical Society of America, in press, 1997.
G. Wahba. Convergence rates of certain approximate solutions to Fredholm integral
equations of the rst kind. Journal of Approximation Theory, 7:167 { 185, 1973.
C. K. I. Williams. Prediction with Gaussian processes. Preprint, 1997.
A. Yuille and N. Grzywacz. The motion coherence theory. In Proceedings of the
International Conference on Computer Vision, pages 344{354, Washington, D.C.,
1988. IEEE Computer Society Press.
REF [5]
Scikit-learn: Machine Learning in Python
Fabian Pedregosa
Gael Varoquaux
Alexandre Gramfort
Vincent Michel
Bertrand Thirion
Olivier Grisel
Nuxeo
20 rue Soleillet
75 020 Paris France
Mathieu Blondel
MBLONDEL @ AI . CS . KOBE - U . AC . JP
es
Kobe University
1-1 Rokkodai, Nada
Kobe 657-8501 Japan
Peter Prettenhofer
nc
re
Bauhaus-Universitat Weimar
Bauhausstr. 11
99421 Weimar Germany
Vincent Dubourg
Re
Google Inc
76 Ninth Avenue
New York, NY 10011 USA
fe
Ron Weiss
Jake Vanderplas
Astronomy Department
University of Washington, Box 351580
Seattle, WA 98195 USA
Alexandre Passos
IESL Lab
UMass Amherst
Amherst MA 01002 USA
David Cournapeau
Enthought
21 J.J. Thompson Avenue
Cambridge, CB3 0FA UK
c
2011
Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel,
Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher,
Matthieu Brucher
Matthieu Perrot
Edouard
Duchesnay
LNAO
Neurospin, Bat 145, CEA Saclay
91191 Gif sur Yvette France
es
Abstract
Re
1. Introduction
fe
re
nc
Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is
put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic
and commercial settings. Source code, binaries, and documentation can be downloaded from
http://scikit-learn.sourceforge.net.
Keywords: Python, supervised learning, unsupervised learning, model selection
The Python programming language is establishing itself as one of the most popular languages for
scientific computing. Thanks to its high-level interactive nature and its maturing ecosystem of scientific libraries, it is an appealing choice for algorithmic development and exploratory data analysis
(Dubois, 2007; Milmann and Avaizis, 2011). Yet, as a general-purpose language, it is increasingly
used not only in academic settings but also in industry.
Scikit-learn harnesses this rich environment to provide state-of-the-art implementations of many
well known machine learning algorithms, while maintaining an easy-to-use interface tightly integrated with the Python language. This answers the growing need for statistical data analysis by
non-specialists in the software and web industries, as well as in fields outside of computer-science,
such as biology or physics. Scikit-learn differs from other machine learning toolboxes in Python
for various reasons: i) it is distributed under the BSD license ii) it incorporates compiled code for
efficiency, unlike MDP (Zito et al., 2008) and pybrain (Schaul et al., 2010), iii) it depends only on
numpy and scipy to facilitate easy distribution, unlike pymvpa (Hanke et al., 2009) that has optional
dependencies such as R and shogun, and iv) it focuses on imperative programming, unlike pybrain
which uses a data-flow framework. While the package is mostly written in Python, it incorporates
the C++ libraries LibSVM (Chang and Lin, 2001) and LibLinear (Fan et al., 2008) that provide reference implementations of SVMs and generalized linear models with compatible licenses. Binary
packages are available on a rich set of platforms including Windows and any POSIX platforms.
2826
Furthermore, thanks to its liberal license, it has been widely distributed as part of major free software distributions such as Ubuntu, Debian, Mandriva, NetBSD and Macports and in commercial
distributions such as the Enthought Python Distribution.
2. Project Vision
re
nc
es
Code quality. Rather than providing as many features as possible, the projects goal has been to
provide solid implementations. Code quality is ensured with unit testsas of release 0.8, test
coverage is 81%and the use of static analysis tools such as pyflakes and pep8. Finally, we
strive to use consistent naming for the functions and parameters used throughout a strict adherence
to the Python coding guidelines and numpy style documentation.
BSD licensing. Most of the Python ecosystem is licensed with non-copyleft licenses. While such
policy is beneficial for adoption of these tools by commercial projects, it does impose some restrictions: we are unable to use some existing scientific code, such as the GSL.
Bare-bone design and API. To lower the barrier of entry, we avoid framework code and keep the
number of different objects to a minimum, relying on numpy arrays for data containers.
Community-driven development. We base our development on collaborative tools such as git, github
and public mailing lists. External contributions are welcome and encouraged.
Documentation. Scikit-learn provides a 300 page user guide including narrative documentation,
class references, a tutorial, installation instructions, as well as more than 60 examples, some featuring real-world applications. We try to minimize the use of machine-learning jargon, while maintaining precision with regards to the algorithms employed.
fe
3. Underlying Technologies
Re
Numpy: the base data structure used for data and model parameters. Input data is presented as
numpy arrays, thus integrating seamlessly with other scientific Python libraries. Numpys viewbased memory model limits copies, even when binding with compiled code (Van der Walt et al.,
2011). It also provides basic arithmetic operations.
Scipy: efficient algorithms for linear algebra, sparse matrix representation, special functions and
basic statistical functions. Scipy has bindings for many Fortran-based standard numerical packages,
such as LAPACK. This is important for ease of installation and portability, as providing libraries
around Fortran code can prove challenging on various platforms.
Cython: a language for combining C in Python. Cython makes it easy to reach the performance
of compiled languages with Python-like syntax and high-level operations. It is also used to bind
compiled libraries, eliminating the boilerplate code of Python/C extensions.
4. Code Design
Objects specified by interface, not by inheritance. To facilitate the use of external objects with
scikit-learn, inheritance is not enforced; instead, code conventions provide a consistent interface.
The central object is an estimator, that implements a fit method, accepting as arguments an input
data array and, optionally, an array of labels for supervised problems. Supervised estimators, such as
SVM classifiers, can implement a predict method. Some estimators, that we call transformers,
for example, PCA, implement a transform method, returning modified input data. Estimators
2827
scikit-learn
5.2
1.17
0.52
0.57
0.18
1.34
BSD
mlpy
9.47
105.3
73.7
1.41
0.79
GPL
pybrain
17.5
BSD
Table 1: Time in seconds on the Madelon data set for various machine learning libraries exposed
in Python: MLPy (Albanese et al., 2008), PyBrain (Schaul et al., 2010), pymvpa (Hanke
et al., 2009), MDP (Zito et al., 2008) and Shogun (Sonnenburg et al., 2010). For more
benchmarks see http://github.com/scikit-learn.
es
may also provide a score method, which is an increasing evaluation of goodness of fit: a loglikelihood, or a negated loss function. The other important object is the cross-validation iterator,
which provides pairs of train and test indices to split input data, for example K-fold, leave one out,
or stratified cross-validation.
Re
fe
re
nc
Model selection. Scikit-learn can evaluate an estimators performance or select parameters using
cross-validation, optionally distributing the computation to several cores. This is accomplished by
wrapping an estimator in a GridSearchCV object, where the CV stands for cross-validated.
During the call to fit, it selects the parameters on a specified parameter grid, maximizing a score
(the score method of the underlying estimator). predict, score, or transform are then delegated
to the tuned estimator. This object can therefore be used transparently as any other estimator. Cross
validation can be made more efficient for certain estimators by exploiting specific properties, such
as warm restarts or regularization paths (Friedman et al., 2010). This is supported through special
objects, such as the LassoCV. Finally, a Pipeline object can combine several transformers and
an estimator to create a combined estimator to, for example, apply dimension reduction before
fitting. It behaves as a standard estimator, and GridSearchCV therefore tune the parameters of all
steps.
LARS. Iteratively refining the residuals instead of recomputing them gives performance gains of
210 times over the reference R implementation (Hastie and Efron, 2004). Pymvpa uses this implementation via the Rpy R bindings and pays a heavy price to memory copies.
Elastic Net. We benchmarked the scikit-learn coordinate descent implementations of Elastic Net. It
achieves the same order of performance as the highly optimized Fortran version glmnet (Friedman
et al., 2010) on medium-scale problems, but performance on very large problems is limited since
we do not use the KKT conditions to define an active set.
kNN. The k-nearest neighbors classifier implementation constructs a ball tree (Omohundro, 1989)
of the samples, but uses a more efficient brute force search in large dimensions.
PCA. For medium to large data sets, scikit-learn provides an implementation of a truncated PCA
based on random projections (Rokhlin et al., 2009).
k-means. scikit-learns k-means algorithm is implemented in pure Python. Its performance is limited by the fact that numpys array operations take multiple passes over data.
6. Conclusion
fe
References
re
nc
es
Scikit-learn exposes a wide variety of machine learning algorithms, both supervised and unsupervised, using a consistent, task-oriented interface, thus enabling easy comparison of methods for a
given application. Since it relies on the scientific Python ecosystem, it can easily be integrated into
applications outside the traditional range of statistical data analysis. Importantly, the algorithms,
implemented in a high-level language, can be used as building blocks for approaches specific to
a use case, for example, in medical imaging (Michel et al., 2011). Future work includes online
learning, to scale to large data sets.
Re
D. Albanese, G. Merler, S.and Jurman, and R. Visintainer. MLPy: high-performance python package for predictive modeling. In NIPS, MLOSS Workshop, 2008.
C.C. Chang and C.J. Lin. LIBSVM: a library for support vector machines. http://www.csie.
ntu.edu.tw/cjlin/libsvm, 2001.
P.F. Dubois, editor. Python: Batteries Included, volume 9 of Computing in Science & Engineering.
IEEE/AIP, May 2007.
R.E. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, and C.J. Lin. LIBLINEAR: a library for large linear
classification. The Journal of Machine Learning Research, 9:18711874, 2008.
J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via
coordinate descent. Journal of Statistical Software, 33(1):1, 2010.
I Guyon, S. R. Gunn, A. Ben-Hur, and G. Dror. Result analysis of the NIPS 2003 feature selection
challenge, 2004.
M. Hanke, Y.O. Halchenko, P.B. Sederberg, S.J. Hanson, J.V. Haxby, and S. Pollmann. PyMVPA:
A Python toolbox for multivariate pattern analysis of fMRI data. Neuroinformatics, 7(1):3753,
2009.
2829
T. Hastie and B. Efron. Least Angle Regression, Lasso and Forward Stagewise. http://cran.
r-project.org/web/packages/lars/lars.pdf, 2004.
V. Michel, A. Gramfort, G. Varoquaux, E. Eger, C. Keribin, and B. Thirion. A supervised clustering
approach for fMRI-based inference of brain states. Patt Rec, page epub ahead of print, April
2011. doi: 10.1016/j.patcog.2011.04.006.
K.J. Milmann and M. Avaizis, editors. Scientific Python, volume 11 of Computing in Science &
Engineering. IEEE/AIP, March 2011.
S.M. Omohundro. Five balltree construction algorithms. ICSI Technical Report TR-89-063, 1989.
V. Rokhlin, A. Szlam, and M. Tygert. A randomized algorithm for principal component analysis.
SIAM Journal on Matrix Analysis and Applications, 31(3):11001124, 2009.
T. Schaul, J. Bayer, D. Wierstra, Y. Sun, M. Felder, F. Sehnke, T. Ruckstie, and J. Schmidhuber.
PyBrain. The Journal of Machine Learning Research, 11:743746, 2010.
nc
es
re
S. Van der Walt, S.C Colbert, and G. Varoquaux. The NumPy array: A structure for efficient
numerical computation. Computing in Science and Engineering, 11, 2011.
Re
fe
T. Zito, N. Wilbert, L. Wiskott, and P. Berkes. Modular toolkit for data processing (MDP): A Python
data processing framework. Frontiers in Neuroinformatics, 2, 2008.
2830
REF [6]
International Journal of Computer Vision 38(1), 913, 2000
c 2000 Kluwer Academic Publishers. Manufactured in The Netherlands.
1.
re
Keywords:
nc
es
Abstract. In this paper we first overview the main concepts of Statistical Learning Theory, a framework in which
learning from examples can be studied in a principled way. We then briefly discuss well known as well as emerging
learning techniques such as Regularization Networks and Support Vector Machines which can be justified in term
of the same induction principle.
Introduction
Re
fe
The goal of this paper is to provide a short introduction to Statistical Learning Theory (SLT) which studies
problems and techniques of supervised learning. For a
more detailed review of SLT see Evgeniou et al. (1999).
In supervised learningor learning-from-examples
a machine is trained, instead of programmed, to perform a given task on a number of input-output pairs.
According to this paradigm, training means choosing
a function which best describes the relation between
the inputs and the outputs. The central question of
SLT is how well the chosen function generalizes, or
how well it estimates the output for previously unseen
inputs.
We will consider techniques which lead to solution
of the form
f (x) =
`
X
ci K (x, xi ).
(1)
i=1
H[ f ] =
`
1X
V (yi , f (xi )) + k f k2K ,
` i=1
where V is a loss function which measures the goodness of the predicted output f (xi ) with respect to the
given output yi , k f k2K a smoothness term which can
be thought of as a norm in the Reproducing Kernel
Hilbert Space defined by the kernel K and a positive
parameter which controls the relative weight between
the data and the smoothness term. The choice of the
loss function determines different learning techniques,
each leading to a different learning algorithm for computing the coefficients ci .
The rest of the paper is organized as follows.
Section 2 presents the main idea and concepts in the theory. Section 3 discusses Regularization Networks and
Support Vector Machines, two important techniques
which produce outputs of the form of Eq. (1).
2.
as
Iemp [ f ; `] =
`
1X
V (yi , f (xi )).
` i=1
es
Re
fe
re
nc
10
V (y, f (x))P(x, y) dx dy
I[ f ]
X,Y
r
I [ f ] < Iemp [ f ] + 8
h
, .
`
(2)
where h is the capacity, and 8 an increasing function of h` and . For more information and for exact
forms of function 8 we refer the reader to (Vapnik and
Chervonenkis, 1971; Vapnik, 1998; Alon et al., 1993).
Intuitively, if the capacity of the function space in which
we perform empirical risk minimization is very large
and the number of examples is small, then the distance
between the empirical and expected risk can be large
and overfitting is very likely to occur.
Since the space F is usually very large (e.g. F could
be the space of square integrable functions), one typically considers smaller hypothesis spaces H. Moreover, inequality (2) suggests an alternative method for
achieving good generalization: instead of minimizing
the empirical risk, find the best trade off between the
empirical risk and the complexity of the hypothesis
space measured by the second term in the r.h.s. of inequality (2). This observation leads to the method of
Structural Risk Minimization (SRM).
3.1.
Learning Machines
fe
3.
Re
We now consider hypothesis spaces which are subsets of a Reproducing Kernel Hilbert Space (RKHS)
(Wahba, 1990). A RKHS
P is a Hilbert space of functions
f of the form f (x) = Nn=1 an n (x), where {n (x)}Nn=1
is a set of given, linearly independent basis functions
and N can be possibly infinite. A RKHS is equipped
with a norm which is defined as:
k f k2K =
N
X
an2
,
n=1 n
N
X
n n (x)n (y),
n=1
`
X
V (yi , f (xi ))
i=1
subject to
k f kK A m
nc
es
re
11
H[ f ] =
`
1X
V (yi , f (xi )) + k f k2K .
` i=1
(3)
12
`
X
ci K (x, xi ),
(4)
i=1
Regularization Networks
fe
SVM Classification
(5)
(6)
Re
(7)
f (x) =
(G)i j = K (xi , x j ).
`
X
yi bi (x),
(11)
i=1
es
P
with bi (x) = `j=1 (G + I )i1j K (xi , x). Equation (11)
gives the dual representation of RN. Notice the difference between Eqs. (4) and (11): in the first one the
coefficients ci are learned from the data while in the
second one the bases functions bi are learned, the coefficient of the expansion being equal to the output of
the examples. We refer to (Girosi et al., 1995) for more
information on the dual representation.
3.3.
Regularization Networks
The approximation scheme that arises from the minimization of the quadratic functional
(9)
`
1X
|yi f (xi )| + k f k2K
` i=1
(12)
`
1X
(yi f (xi ))2 + k f k2K
` i=1
(c)i = ci ,
(10)
re
(G + I )c = y,
nc
`
1X
|1 yi f (xi )|+ + k f k2K ,
` i=1
(13)
It turns out that for both problems (12) and (13) the
coefficients ci in Eq. (4) can be found by solving a
Quadratic Programming (QP) problem with linear constraints. The regularization parameter appears only
in the linear constraints: the absolute values of coefficients ci is bounded by 2 . The QP problem is non
tions, but suggests representations that lead to simple solutions. Although there is not a general solution
to this problem, a number of recent experimental and
theoretical works provide insights for specific applications (Evgeniou et al., 2000; Jaakkola and Haussler,
1998; Mohan, 1999; Vapnik, 1998).
References
nc
es
Re
3.4.
fe
re
We conclude this short review with a discussion on kernels and data representations. A key issue when using
the learning techniques discussed above is the choice
of the kernel K in Eq. (4). The kernel K (xi , x j ) defines a dot product between the projections of the two
inputs xi and x j , in the feature space (the features being
{1 (x), 2 (x), . . . , N (x)} with N the dimensionality
of the RKHS). Therefore its choice is closely related to
the choice of the effective representation of the data,
i.e. the image representation in a vision application.
The problem of choosing the kernel for the machines
discussed here, and more generally the issue of finding appropriate data representations for learning, is an
important and open one. The theory does not provide
a general method for finding good data representa-
13
REF [7]
nc
es
Technical Report
MSR-TR-2000-23
Re
fe
re
Microsoft Research
Microsoft Corporation
One Microsoft Way
Redmond, WA 98052
Abstract
We brie y describe the main ideas of statistical learning theory, support vector machines, and kernel feature spaces.
Contents
Re
fe
re
nc
es
1 An Introductory Example
2 Learning Pattern Recognition from Examples
3 Hyperplane Classiers
4 Support Vector Classiers
5 Support Vector Regression
6 Further Developments
7 Kernels
8 Representing Similarities in Linear Spaces
9 Examples of Kernels
10 Representating Dissimilarities in Linear Spaces
1
4
5
8
11
14
15
18
21
22
1 An Introductory Example
Suppose we are given empirical data
(x1 ; y1 ); : : : ; (xm ; ym ) 2 X f1g:
(1)
Here, the domain X is some nonempty set that the patterns xi are taken from;
the yi are called labels or targets .
Unless stated otherwise, indices i and j will always be understood to run
over the training set, i.e. i; j = 1; : : : ; m.
Note that we have not made any assumptions on the domain X other than
it being a set. In order to study the problem of learning, we need additional
structure. In learning, we want to be able to generalize to unseen data points.
In the case of pattern recognition, this means that given some new pattern
x 2 X , we want to predict the corresponding y 2 f1g. By this we mean,
loosely speaking, that we choose y such that (x; y) is in some sense similar to
the training examples. To this end, we need similarity measures in X and in
f1g. The latter is easy, as two target values can only be identical or dierent.
For the former, we require a similarity measure
es
k : X X ! R;
(x; x0 ) 7! k(x; x0 );
(2)
i.e., a function that, given two examples x and x0 , returns a real number char-
nc
acterizing their similarity. For reasons that will become clear later, the function
re
fe
(x x0 ) :=
N
X
i=1
(x)i (x0 )i :
(3)
Re
x 7! x:
1
(4)
The space F is called a feature space . To summarize, embedding the data into
F has three benets.
1. It lets us dene a similarity measure from the dot product in F ,
(5)
2. It allows us to deal with the patterns geometrically, and thus lets us study
learning algorithm using linear algebra and analytic geometry.
3. The freedom to choose the mapping will enable us to design a large
variety of learning algorithms. For instance, consider a situation where the
inputs already live in a dot product space. In that case, we could directly
dene a similarity measure as the dot product. However, we might still
choose to rst apply a nonlinear map to change the representation into
one that is more suitable for a given problem and learning algorithm.
We are now in the position to describe a pattern recognition learning algorithm that is arguably one of the simplest possible. The basic idea is to compute
the means of the two classes in feature space,
X
c = 1
x;
(6)
c2 = m1
es
m1 fi:yi =+1g i
xi ;
nc
fi:yi =?1g
(7)
Re
fe
re
where m1 and m2 are the number of examples with positive and negative labels,
respectively. We then assign a new point x to the class whose mean is closer to
it. This geometrical construction can be formulated in terms of dot products.
Half-way in between c1 and c2 lies the point c := (c1 + c2 )=2. We compute the
class of x by checking whether the vector connecting c and x encloses an angle
smaller than =2 with the vector w := c1 ? c2 connecting the class means, in
other words
y = sgn ((x ? c) w)
y = sgn ((x ? (c1 + c2 )=2) (c1 ? c2 ))
= sgn ((x c1 ) ? (x c2 ) + b):
b := 21 kc2 k2 ? kc1 k2 :
(8)
(9)
xi in the input domain X . To this end, note that we do not have a dot product
in X , all we have is the similarity measure k (cf. (5)). Therefore, we need to
2
y =
=
1
0
X
X
1
1
(x xi ) ? m
(x xi ) + bA
sgn @ m
1
2
fi:yi =?1g
1
0 fi:yi=+1g
X
X
1
1
k(x; xi ) ?
k(x; xi ) + bA :
sgn @
m1 fi:yi =+1g
m2 fi:yi =?1g
(10)
X
X
k(xi ; xj ) ? m12
k(xi ; xj )A : (11)
b := 12 @ m12
2 f(i;j ):yi =yj =?1g
1 f(i;j ):yi =yj =+1g
Let us consider one well-known special case of this type of classier. Assume
that the class means have the same distance to the origin (hence b = 0), and
that k can be viewed as a density, i.e. it is positive and has integral 1,
X
es
(12)
m1 fi:yi =+1g
fe
re
nc
Re
p2 (x) := m1
fi:yi =?1g
k(x; xi ):
(14)
Given some point x, the label is then simply computed by checking which of the
two, p1 (x) or p2 (x), is larger, which directly leads to (10). Note that this decision
is the best we can do if we have no prior information about the probabilities of
the two classes.
The classier (10) is quite close to the types of learning machines that we
will be interested in. It is linear in the feature space, while in the input domain,
it is represented by a kernel expansion. It is example-based in the sense that
the kernels are centered on the training examples, i.e. one of the two arguments
of the kernels is always a training example. The main point where the more
sophisticated techniques to be discussed later will deviate from (10) is in the
selection of the examples that the kernels are centered on, and in the weight
that is put on the individual kernels in the decision function. Namely, it will no
3
longer be the case that all training examples appear in the kernel expansion,
and the weights of the kernels in the expansion will no longer be uniform. In
the feature space representation, this statement corresponds to saying that we
will study all normal vectors w of decision hyperplanes that can be represented
as linear combinations of the training examples. For instance, we might want
to remove the in
uence of patterns that are very far away from the decision
boundary, either since we expect that they will not improve the generalization
error of the decision function, or since we would like to reduce the computational
cost of evaluating the decision function (cf. (10)). The hyperplane will then only
depend on a subset of training examples, called support vectors.
f : X ! f1g
(15)
Re
fe
re
nc
es
based on input-output training data (1). We assume that the data were generated independently from some unknown (but xed) probability distribution
P (x; y). Our goal is to learn a function that will correctly classify unseen examples (x; y), i.e. we want f (x) = y for examples (x; y) that were also generated
from P (x; y).
If we put no restriction on the class of functions that we choose our estimate f from, however, even a function which does well on the training data,
e.g. by satisfying f (xi ) = yi for all i = 1; : : : ; m, need not generalize well to
unseen examples. To see this, note that for each function f and any test set
(x1 ; y1 ); : : : ; (xm ; ym ) 2 RN f1g; satisfying fx1 ; : : : ; xm g\fx1 ; : : : ; xm g = fg,
there exists another function f such that f (xi ) = f (xi ) for all i = 1; : : : ; m,
yet f (xi ) 6= f (xi ) for all i = 1; : : : ; m . As we are only given the training data,
we have no means of selecting which of the two functions (and hence which of
the completely dierent sets of test label predictions) is preferable. Hence, only
minimizing the training error (or empirical risk ),
m 1
X
1
Remp [f ] = m 2 jf (xi ) ? yi j;
i=1
(16)
does not imply a small test error (called risk ), averaged over test examples
drawn from the underlying distribution P (x; y),
Z
(17)
R[f ] = 21 jf (x) ? yj dP (x; y):
Statistical learning theory [31, 27, 28, 29], or VC (Vapnik-Chervonenkis) theory,
shows that it is imperative to restrict the class of functions that f is chosen
4
from to one which has a capacity that is suitable for the amount of available
training data. VC theory provides bounds on the test error. The minimization
of these bounds, which depend on both the empirical risk and the capacity of
the function class, leads to the principle of structural risk minimization [27].
The best-known capacity concept of VC theory is the VC dimension, dened as
the largest number h of points that can be separated in all possible ways using
functions of the given class. An example of a VC bound is the following: if
h < m is the VC dimension of the class of functions that the learning machine
can implement, then for all functions of that class, with a probability of at least
1 ? , the bound
h log()
(18)
R() Remp () + m ; m
holds, where the condence term is dened as
h
s ?
h log 2m + 1 ? log(=4)
log()
m; m
(19)
Re
fe
re
nc
es
Tighter bounds can be formulated in terms of other concepts, such as the annealed VC entropy or the Growth function. These are usually considered to be
harder to evaluate, but they play a fundamental role in the conceptual part of
VC theory [28]. Alternative capacity concepts that can be used to formulate
bounds include the fat shattering dimension [2].
The bound (18) deserves some further explanatory remarks. Suppose we
wanted to learn a \dependency" where P (x; y) = P (x) P (y), i.e. where the
pattern x contains no information about the label y, with uniform P (y). Given a
training sample of xed size, we can then surely come up with a learning machine
which achieves zero training error (provided we have no examples contradicting
each other). However, in order to reproduce the random labellings, this machine
will necessarily require a large VC dimension h. Thus, the condence term
(19), increasing monotonically with h, will be large, and the bound (18) will
not support possible hopes that due to the small training error, we should
expect a small test error. This makes it understandable how (18) can hold
independent of assumptions about the underlying distribution P (x; y): it always
holds (provided that h < m), but it does not always make a nontrivial prediction
| a bound on an error rate becomes void if it is larger than the maximum error
rate. In order to get nontrivial predictions from (18), the function space must
be restricted such that the capacity (e.g. VC dimension) is small enough (in
relation to the available amount of data).
algorithms, one needs to come up with a class of functions whose capacity can
be computed.
[32] and [30] considered the class of hyperplanes
(w x) + b = 0 w 2 RN ; b 2 R;
(20)
(21)
and proposed a learning algorithm for separable problems, termed the Generalized Portrait, for constructing f from empirical data. It is based on two
facts. First, among all hyperplanes separating the data, there exists a unique
one yielding the maximum margin of separation between the classes,
N
max
w;b minfkx ? xi k : x 2 R ; (w x) + b = 0; i = 1; : : : ; mg:
(22)
nc
es
re
L(w; b; ) = 12 kwk2 ?
i=1
i (yi ((xi w) + b) ? 1) :
(25)
Re
fe
@
@
@b L(w; b; ) = 0; @ w L(w; b; ) = 0;
6
(26)
{x | (w . x) + b = +1}
{x | (w . x) + b = 1}
Note:
x2
x1
=>
yi = 1
(w . x1) + b = +1
(w . x2) + b = 1
yi = +1
=>
(w . (x1x2)) = 2
w .
2
||w|| (x1x2) = ||w||
{x | (w . x) + b = 0}
fe
and
m
X
i=1
Re
leads to
re
nc
es
Figure 1: A binary classication toy problem: separate balls from diamonds.
The optimal hyperplane is orthogonal to the shortest line connecting the convex
hulls of the two classes (dotted), and intersects it half-way between the two
classes. The problem being separable, there exists a weight vector w and a
threshold b such that yi ((w xi ) + b) > 0 (i = 1; : : : ; m). Rescaling w and b
such that the point(s) closest to the hyperplane satisfy j(w xi ) + bj = 1, we
obtain a canonical form (w; b) of the hyperplane, satisfying yi ((w xi )+ b) 1.
Note that in this case, the margin, measured perpendicularly to the hyperplane,
equals 2=kwk. This can be seen by considering two points x1 ; x2 on opposite
sides of the margin, i.e. (w x1 ) + b = 1; (w x2 ) + b = ?1, and projecting them
onto the hyperplane normal vector w=kwk.
w=
i yi = 0
m
X
i=1
(27)
i yi xi :
(28)
The solution vector thus has an expansion in terms of a subset of the training
patterns, namely those patterns whose i is non-zero, called Support Vectors.
By the Karush-Kuhn-Tucker complementarity conditions
i [yi ((xi w) + b) ? 1] = 0;
i = 1; : : : ; m;
(29)
the Support Vectors lie on the margin (cf. Figure 1). All remaining examples of
the training set are irrelevant: their constraint (24) does not play a role in the
optimization, and they do not appear in the expansion (28). This nicely captures
our intuition of the problem: as the hyperplane (cf. Figure 1) is completely
7
determined by the patterns closest to it, the solution should not depend on the
other examples.
By substituting (27) and (28) into L, one eliminates the primal variables and
arrives at the Wolfe dual of the optimization problem [e.g. 6]: nd multipliers
i which
m
X
i ? 12
m
X
i j yi yj (xi xj )
maximize
W () =
subject to
i 0; i = 1; : : : ; m; and
i=1
i;j =1
m
X
i=1
i yi = 0:
(30)
(31)
f (x) = sgn
m
X
i=1
yi i (x xi ) + b
(32)
Re
fe
re
nc
es
k(x; x0 ) = (x x0 ):
8
(33)
input space
feature space
Figure 2: The idea of SV machines: map the training data into a higherdimensional feature space via , and construct a separating hyperplane with
maximum margin there. This yields a nonlinear decision boundary in input
space. By the use of a kernel function (2), it is possible to compute the separating hyperplane without explicitly carrying out the map into the feature space.
This can be done since all feature vectors only occured in dot products. The
weight vector (cf. (28)) then becomes an expansion in feature space, and will
thus typically no more correspond to the image of a single vector from input
space. We thus obtain decision functions of the more general form (cf. (32))
i=1
fe
Re
subject to
W () =
yi i k(x; xi ) + b ;
es
yi i ((x) (xi )) + b
re
= sgn
i=1
m
X
nc
f (x) = sgn
m
X
m
X
i=1
i ? 21
m
X
i;j =1
(34)
i j yi yj k(xi ; xj )
i 0; i = 1; : : : ; m; and
m
X
i=1
i yi = 0:
(35)
(36)
In practice, a separating hyperplane may not exist, e.g. if a high noise level
causes a large overlap of the classes. To allow for the possibility of examples
violating (24), one introduces slack variables [10, 28, 22]
i 0;
i = 1; : : : ; m
(37)
yi ((w xi ) + b) 1 ? i ;
i = 1; : : : ; m:
(38)
A classier which generalizes well is then foundPby controlling both the classier
capacity (via kwk) and the sum of the slacks i i . The latter is done as it can
9
re
nc
es
Re
fe
subject to the constraints (37) and (38), for some value of the constant C > 0
determining the trade-o. Here and below, we use boldface greek letters as
a shorthand for corresponding vectors = (1 ; : : : ; m ). Incorporating kernels,
and rewriting it in terms of Lagrange multipliers, this again leads to the problem
of maximizing (35), subject to the constraints
0 i C; i = 1; : : : ; m; and
m
X
i=1
i yi = 0:
(40)
The only dierence from the separable case is the upper bound C on the Lagrange multipliers i . This way, the in
uence of the individual patterns (which
10
x
x
x
x
+
0
Figure 4: In SV regression, a tube with radius " is tted to the data. The
trade-o between model complexity and points lying outside of the tube (with
positive slack variables ) is determined by minimizing (46).
j =1
yj j k(xi ; xj ) + b = yi :
nc
m
X
es
could be outliers) gets limited. As above, the solution takes the form (34). The
threshold b can be computed by exploiting the fact that for all SVs xi with
i < C , the slack variable i is zero (this again follows from the Karush-KuhnTucker complementarity conditions), and hence
(41)
fe
re
Another possible realization of a soft margin variant of the optimal hyperplane uses the -parametrization [22]. In it, the paramter C is replaced by a
parameter 2 [0; 1] which can be shown to lower and upper bound the number
of examples that will be SVs and that will come to lie on the wrong side of the
hyperplane,
respectively. It uses a primal objective function with the error term
1 P
?
, and separation constraints
i
m i
Re
yi ((w xi ) + b) ? i ;
i = 1; : : : ; m:
(42)
(43)
f (x) = (w x) + b
(44)
m
1 kwk2 + C X
jyi ? f (xi )j" :
2
i=1
(45)
(w; ; ) = 12 kwk2 + C
minimize
m
X
i=1
((w xi ) + b) ? yi " + i
yi ? ((w xi ) + b) " + i
i ; i 0
subject to
(i + i )
(46)
(47)
(48)
(49)
for all i = 1; : : : ; m. Note that according to (47) and (48), any error smaller than
" does not require a nonzero i or i , and hence does not enter the objective
W (; ) = ?"
m
X
(i + i ) +
re
maximize
nc
es
function (46).
Generalization to kernel-based regression estimation is carried out in complete analogy to the case of pattern recognition. Introducing Lagrange multipliers, one thus arrives at the following optimization problem: for C > 0; " 0
chosen a priori,
i=1
m
X
i=1
(i ? i )yi
0 i ; i C; i = 1; : : : ; m; and
Re
subject to
fe
m
X
? 21 (i ? i )(j ? j )k(xi ; xj )
i;j =1
(50)
m
X
i=1
(i ? i ) = 0:(51)
f (x) =
m
X
i=1
(i ? i )k(xi ; x) + b;
(52)
where b is computed using the fact that (47) becomes an equality with i = 0 if
0 < i < C , and (48) becomes an equality with i = 0 if 0 < i < C .
Several extensions of this algorithm are possible. From an abstract point
of view, we just need some target function which depends on the vector (w; )
(cf. (46)). There are multiple degrees of freedom for constructing it, including
some freedom how to penalize, or regularize, dierent parts of the vector, and
some freedom how to use the kernel trick. For instance, more general loss
12
( )
1
(.)
(.)
(x1)
(x2)
output ( i k (x,xi))
...
...
(x)
weights
(.)
(xn)
...
re
nc
es
Re
fe
functions can be used for , leading to problems that can still be solved eciently
[24]. Moreover, norms other than the 2-norm k:k can be used to regularize the
solution. Yet another example is that polynomial kernels can be incorporated
which consist of multiple layers, such that the rst layer only computes products
within certain specied subsets of the entries of w [17].
Finally, the algorithm can be modied such that " need not be specied a
priori. Instead, one species an upper bound 0 1 on the fraction of
points allowed to lie outside the tube (asymptotically, the number of SVs) and
the corresponding " is computed automatically. This is achieved by using as
primal objective function
m
1 kwk2 + C m" + X
(53)
jyi ? f (xi )j"
2
i=1
instead of (45), and treating " 0 as a parameter that we minimize over [22].
13
6 Further Developments
Re
fe
re
nc
es
regularized risk
(54)
Rreg [f ] = Remp [f ] + 2 kPf k2
(with a regularization parameter 0) can be written as a constrained opti-
mization problem. For particular choices of the loss function, it further reduces
to a SV type quadratic programming problem. The latter thus is not specic to
SV machines, but is common to a much wider class of approaches. What gets
lost in the general case, however, is the fact that the solution can usually be expressed in terms of a small number of SVs. This specic feature of SV machines
is due to the fact that the type of regularization and the class of functions that
the estimate is chosen from are intimately related [11, 23]: the SV algorithm is
equivalent to minimizing the regularized risk on the set of functions
f (x) =
X
i
i k(xi ; x) + b;
(55)
Re
fe
re
nc
es
7 Kernels
We now take a closer look at the issue of the similarity measure, or kernel, k.
In this section, we think of X as a subset of the vector space RN , (N 2 N ),
endowed with the canonical dot product (3).
(59)
(60)
nc
es
This approach works ne for small toy examples, but it fails for realistically
sized problems: for N -dimensional input patterns, there exist
(61)
NF = (Nd!(+N d??1)!1)!
dierent monomials (58), comprising a feature space F of dimensionality NF .
For instance, already 16 16 pixel input images and a monomial degree d = 5
yield a dimensionality of 1010.
In certain cases described below, there exists, however, a way of computing
dot products in these high-dimensional feature spaces without explicitely mapping into them: by means of kernels nonlinear in the input space RN . Thus, if
the subsequent processing can be carried out using dot products exclusively, we
are able to deal with the high dimensionality.
The following section describes how dot products in polynomial feature
spaces can be computed eciently.
re
fe
In order to compute dot products of the form ((x) (x0 )), we employ kernel
representations of the form
(62)
Re
which allow us to compute the value of the dot product in F without having to
carry out the map . This method was used by [8] to extend the Generalized
Portrait hyperplane classier of [31] to nonlinear Support Vector machines. In
[1], F is termed the linearization space, and used in the context of the potential
function classication method to express the dot product between elements of
F in terms of elements of the input space.
What does k look like for the case of polynomial features? We start by
giving an example [28] for N = d = 2. For the map
i.e. the desired kernel k is simply the square of the dot product in input space.
The same works for arbitrary N; d 2 N [8]: as a straightforward generalization
of a result proved in the context of polynomial approximation [16, Lemma 2.1],
we have:
Proposition 1 Dene Cd to map x 2 RN to the vector Cd(x) whose entries
are all possible d-th degree ordered products of the entries of x. Then the corresponding kernel computing the dot product of vectors mapped by Cd is
N
X
(66)
(67)
es
j1 ;:::;jd =1
1d
0N
X
= @ [x]j [x0 ]j A = (x x0 )d :
j =1
(65)
re
nc
(68)
fe
Re
(69)
If x represents an image with the entries being pixel values, we can use
the kernel (x x0 )d to work in the space spanned by products of any d pixels |
provided that we are able to do our work solely in terms of dot products, without
any explicit usage of a mapped pattern d (x). Using kernels of the form (65), we
take into account higher-order statistics without the combinatorial explosion (cf.
(61)) of time and memory complexity which goes along already with moderately
high N and d.
To conclude this section, note that it is possible to modify (65) such that it
maps into the space of all monomials up to degree d, dening [28]
(70)
the m m matrix
K := (k(xi ; xj ))ij
es
(71)
is called the Gram matrix (or kernel matrix) of k with respect to x1 ; : : : ; xm .
nc
(72)
re
i;j
fe
Denition 4 ((Positive denite) kernel) Let X be a nonempty set. A function k : X X ! C which for all m 2 N ; xi 2 X gives rise to a positive Gram
Re
The term kernel stems from the rst use of this type of function in the study
of integral operators. A function k which gives rise to an operator T via
(Tf )(x) =
(73)
is called the kernel of T . One might argue that the term positive denite kernel
is slightly misleading. In matrix theory, the term denite is usually used to
denote the case where equality in (72) only occurs if c1 = : : : = cm = 0. Simply
using the term positive kernel, on the other hand, could be confused with a
kernel whose values are positive. In the literature, a number of dierent terms
1 The bar in c denotes complex conjugation.
j
18
are used for positive denite kernels, such as reproducing kernel, Mercer kernel,
or support vector kernel.
The denitions for (positive denite) kernels and positive matrices dier only
in the fact that in the former case, we are free to choose the points on which
the kernel is evaluated.
Positive denitness implies positivity on the diagonal,
(74)
k(xi ; xj ) = k(xj ; xi ):
(75)
es
Note that in the complex-valued case, our denition of symmetry includes complex conjugation, depicted by the bar. The denition of symmetry of matrices
is analogous, i.e. Kij = K ji .
Obviously, real-valued kernels, which are what we will mainly be concerned
with, are contained in the above denition as a special case, since we did not
require that the kernel take values in C n R. However, it is not sucient to
require that (72) hold for real coecients ci . If we want to get away with real
coecients only, we additionally have to require that the kernel be symmetric,
nc
k(xi ; xj ) = k(xj ; xi ):
(76)
It can be shown that whenever k is a (complex-valued) positive denite kernel,
re
Re
fe
positive. Hence both its eigenvalues are nonnegative, and so is their product,
K 's determinant, i.e.
(78)
0 K11 K22 ? K12K21 = K11 K22 ? K12 K 12 = K11K22 ? jK12 j2 :
Substituting k(xi ; xj ) for Kij , we get the desired inequality.
We are now in a position to construct the feature space associated with a
kernel k.
19
j =1
m X
m
X
0
i j k(xi ; x0j ):
i=1 j =1
(81)
(82)
re
hf; gi :=
j k(:; x0j )
nc
g(:) =
es
Re
fe
To see that this is well-dened, although it explicitly contains the expansion
coecients (which need not be unique), note that
hf; gi =
m
X
0
j =1
j f (x0j );
(83)
using k(x0j ; xi ) = k(xi ; x0j ). The latter, however, does not depend on the particular expansion of f . Similarly, for g, note that
hf; gi =
m
X
i=1
i g(xi ):
(84)
The last two equations also show that h:; :i is antilinear in the rst argument
and linear in the second one. It is symmetric, as hf; gi = hg; f i. Moreover, given
functions f1 ; : : : ; fn , and coecients
1 ; : : : ;
n 2 C , we have
n
X
i;j =1
i j hfi ; fj i =
*X
i
20
i fi ;
X
i
i fi 0;
(85)
9 Examples of Kernels
nc
es
fe
re
Besides (65), [8] and [28] suggest the usage of Gaussian radial basis function
kernels [1]
kx ? x0 k2
0
k(x; x ) = exp ? 2 2
(89)
Re
(90)
Note that all these kernels have the convenient property of unitary invariance, i.e. k(x; x0 ) = k(Ux; Ux0 ) if U > = U ?1 (if we consider complex numbers,
then U instead of U > has to be used).
The radial basis function kernel additionally is translation invariant. Moreover, as it satises k(x; x) = 1 for all x 2 X , each mapped example has unit
length, k(x)k = 1. In addition, as k(x; x0 ) > 0 for all x; x0 2 X , all points lie
inside the same orthant in feature space. To see this, recall that for unit lenght
vectors, the dot product (3) equals the cosine of the enclosed angle. Hence
cos(\((x); (x0 ))) = ((x) (x0 )) = k(x; x0 ) > 0;
(91)
2 A Hilbert space H is dened as a complete dot product space. Completeness means that
all sequences in H which are convergent in the norm corresponding to the dot product will
actually have their limits in H , too.
21
which amounts to saying that the enclosed angle between any two mapped
examples is smaller than =2.
The examples given so far apply to the case of vectorial data. Let us at least
give one example where X is not a vector space.
Further examples include kernels for string matching, as proposed by [34, 12].
re
nc
es
Re
fe
i;j =1
m
X
i=1
ci = 0
(94)
Denition 8 (Conditionally positive denite kernel) Let X be a nonempty set. A function k : X X ! C which for all m 2; xi 2 X gives rise to
a conditionally positive Gram matrix is called a conditionally positive denite
kernel.
The denitions for the real-valued case look exactly the same. Note that
symmetry is required, also in the complex case. Due to the additional constraint
on the coecients ci , it does not follow automatically anymore.
22
X
i;j
ci cj (k(xi ; xj ) + b) =
X 2 X
ci cj k(xi ; xj ) + b ci = ci cj k(xi ; xj ) 0:
i;j
i
i;j
(95)
A standard example of a conditionally positive denite kernel which is not
positive denite is
k(x; x0 ) = ?kx ? x0 k2 ;
(96)
0
where x; x 2 X , and X is a dot product space.
To see this, simply compute, for some pattern set x1 ; : : : ; xm ,
X
X
ci cj k(xi ; xj ) = ? ci cj kxi ? xj k2
(97)
= ?
= 2
i;j
X
X
i
X
i;j
ci
cj kxj k2 ?
X X
ci cj (xi xj ) 0;
cj
ci kxi k2 + 2
es
= ?
i;j
X
nc
i;j
X
i;j
ci cj (xi xj )
(98)
Re
fe
re
where the last line follows from (94) and the fact that k(x; x0 ) = (x x0 ) is a
positive denite kernel. Note that without (94), (97) can also be negative (e.g.,
put c1 = : : : = cm = 1), hence the kernel is not a positive denite one.
Without proof, we add that in fact,
k(x; x0 ) = ?kx ? x0 k
(99)
is conditionally positive denite for 0 < 2.
Let us consider the kernel (96), which can be considered the canonical conditionally positive kernel on a dot product space, and see how it is related to
the dot product. Clearly, the distance induced by the norm is invariant under
translations, i.e.
kx ? x0 k = k(x ? x0 ) ? (x0 ? x0 )k
(100)
0
0
for all x; x ; x0 2 X . In other words, even complete knowledge of kx ? x k for
all x; x0 2 X would not completely determine the underlying dot product, the
reason being that the dot product is not invariant under translations. Therefore,
one needs to x an origin x0 when going from the distance measure to the dot
product. To this end, we need to write the dot product of x ? x0 and x0 ? x0 in
terms of distances:
((x ? x0 ) (x0 ? x0 )) = (x x0 ) + kx0 k2 + (x x0 ) + (x0 x0 )
?
= 21 ?kx ? x0 k2 + kx ? x0 k2 + kx0 ? x0 k2 (101)
23
By construction, this will always result in a positive denite kernel: the dot
product is a positive denite kernel, and we have only translated the inputs.
We have thus established the connection between (96) and a class of positive
denite kernels corresponding to the dot product in dierent coordinate systems,
related to each other by translations. In fact, a similar connection holds for a
wide class of kernels:
Proposition 9 Let x0 2 X , and let k be a symmetric kernel on X X , satisfying k(x0 ; x0 ) 0. Then
k~(x; x0 ) := k(x; x0 ) ? k(x; x0 ) ? k(x0 ; x0 )
(102)
Re
fe
re
nc
es
The signicance of this proposition is that using conditionally positive definite kernels, we can thus generalize all algorithms based on distances to corresponding algorithms operating in feature spaces. This is an analogue of the
kernel trick for distances rather than dot products, i.e. dissimilarities rather
than similarities.
References
Re
fe
re
nc
es
25
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
es
[17]
nc
[16]
re
[15]
fe
[14]
Re
[13]
26
Re
fe
re
nc
es
27
REF [8]
Kristin P. Bennett
bennek@rpi.edu
Department of Mathematical Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180 USA
Erin J. Bredensteiner
Department of Mathematics, University of Evansville, Evansville, IN 47722 USA
Abstract
es
nc
In this paper we provide an intuitive geometric explanation of SVM for classification from the dual perspective along with a mathematically rigorous derivation
of the ideas behind the geometry. We begin with an
explanation of the geometry of SVM based on the idea
of convex hulls. For the separable case, this geometric explanation has existed in various forms (Vapnik,
1996; Mangasarian, 1965; Keerthi et al., 1999; Bennett & Bredensteiner, in press). The new contribution
is the adaptation of the convex hull argument for the
inseparable case to the most commmonly used 2-norm
and 1-norm soft margin SVM. The primal form resulting from this argument can be regarded as an especially elegant minor variant of the -SVM formulation
(Sch
olkopf et al., 2000) or a soft margin form of the
MSM method (Mangasarian, 1965). Related geometric ideas for the -SVM formulation were developed
independently by Crisp and Burges (1999).
re
Re
fe
We develop an intuitive geometric interpretation of the standard support vector machine (SVM) for classification of both linearly
separable and inseparable data and provide
a rigorous derivation of the concepts behind
the geometry. For the separable case finding
the maximum margin between the two sets is
equivalent to finding the closest points in the
smallest convex sets that contain each class
(the convex hulls). We now extend this argument to the inseparable case by using a
reduced convex hull reduced away from outliers. We prove that solving the reduced convex hull formulation is exactly equivalent to
solving the standard inseparable SVM for appropriate choices of parameters. Some additional advantages of the new formulation are
that the effect of the choice of parameters
becomes geometrically clear and that the formulation may be solved by fast nearest point
algorithms. By changing norms these arguments hold for both the standard 2-norm and
1-norm SVM.
eb6@evansville.edu
1. Introduction
Support vector machines (SVM) are a very robust
methodology for inference with minimal parameter
choices. This should translate into the popular adaptation of SVM in many application domains by nonSVM experts. The popular success of prior methodologies like neural networks, genetic algorithms, and
decision trees was enhanced by the intuitive motivation of these approaches, that in some sense enhanced
the end users ability to develop applications independently and have a sense of confidence in the results.
How do you sell a SVM to a consulting client, manager, etc? What quick description would allow an end
user to grasp the fundamentals of SVM necessary for
Class B
Class A
i=1
Re
fe
Class B
Class A
nc
es
re
xw=
0
Figure 2. The two closest points of the convex hulls determine the separating plane.
Class
Class
between these two supporting hyperplanes. The distance between the two parallel supporting hyperplanes
. Therefore the distance between the two planes
is
kwk
can be maximized by minimizing kwk and maximizing
( ).
xw=
1
2
min
xw =
w,,
s.t.
x w = ( + )=2
0
i=1
(2)
re
Bw + e 0
es
inal set. A convex combination of points is a positive weighted combination where the weights sum to
one, e.g. a convex combination c of points in A is
defined by c = u1 A1 + u2 A2 + . . . + um Am = u A
m
X
ui = e u = 1 and
where u Rm , u 0, and
Aw e 0
nc
Figure 3. The primal problem maximizes the distance between two parallel supporting planes.
kwk ( )
u,v
s.t.
1
2
kA u B vk
Re
min
fe
e u = 1 e v = 1 u 0 v 0
(1)
.
= ( c+d
2 ) w =
2
There is an alternative approach to finding the best
separating plane. Consider a set of parallel supporting planes as in Figure 3. These planes are positioned
so that all the points in A satisfy x w and at least
one point in A lies on the plane x w = . Similarly,
all points in B satisfy x w and at least one point
in B lies on the plane x w = . The optimal separating plane can be found by maximizing the distance
The primal C-Margin (2) and dual C-Hull (1) formulations provide a unifying framework for explaining
other SVM formulations. By transforming C-Margin
(2) into mathematically equivalent optimization problems, different SVM formulations are produced. If we
set = 2 by defining = + 1 and = 1,
then Problem (2) becomes the standard primal SVM
2-norm formulation (Vapnik, 1996)
min
w,
s.t.
1
2
kwk
Aw ( + 1)e 0 Bw + ( 1)e 0
(3)
Class B
Class A
Class B
Class A
nc
The problem of finding two closest points in the reduced convex hulls can be written as an optimization
problem (RC-Hull):
Re
fe
re
es
min
u,v
s.t.
1
2
kA u B vk
e u = 1 e v = 1 0 u De 0 v De
(4)
1
2
kwk +
Aw e + 0 0
Bw + e + 0 0
s.t.
(5)
1
with D = K
> 0. As we will prove in Theorem 4.3,
the dual of RC-Margin (5) is exactly RC-Hull (4) which
finds the closest points in the reduced convex hulls.
w,,,
s.t.
C(e + e ) +
1
2
kwk
Aw e + e
Bw + e + e
0 0
(6)
where C > 0 is a fixed constant. Note that the constant C is now different due to an implicit rescaling
of the problem. As we will show in Theorem 4.4 the
RC-Margin (5) and classic inseparable SVM (6) are
equivalent for appropriate choices of C and D.
re
+1
1
w
,
=
, =
, u
= u , and v = v .
es
D(e + e ) +
min
w,,,,
nc
Re
fe
u,v
s.t.
e u = e v = 1, u 0, v 0
(7)
B w + (
1)e 0
v (B w + (
1)e) = 0
e u = e v
v 0.
(8)
0
B w + e
=0
v (B w + e)
1=eu
= e v
v 0.
(9)
The theorems can be directly generalized to the inseparable case based on reduced convex hulls. The
Wolfe dual (for example, see Mangasarian, 1969) of
RC-Margin (5) is precisely the closest points in the
reduced convex hull problem, RC-Hull (4).
Theorem 4.3 (Reduced Convex Hulls is Dual).
The Wolfe dual of the RC-Margin ( 5) is RC-Hull ( 4)
or equivalently:
max 12 kA u B vk
u,v
s.t.
e u = e v = 1, De u 0, De v 0
(10)
+1
,
=
, =
, = , = , u = , v =
and D =
1
2
kwk + + De + De
u (Aw e + )
v (Bw + e + ) r s
L
w = w A u + B v = 0
L
= 1 + e u = 0, u 0
= 1 e v = 0, v 0
L
= De u = r 0
L
= De v = s 0
(11)
Aw (
+ 1)e + 0
B w + (
1)e + 0
0
0
=0
u (Aw
(
+ 1)e + )
v (B w + (
1)e + ) = 0
fe
Re
where , R, w Rn , , u, r Rm , and , v, s
Rk . To simplify the problem, substitute in w = (A u
B v), r = De u and s = De v:
max
,,u,v
s.t.
1
2
kA u B vk ( ) + De + De
u A(A u B v) + v B(A u B v)
+ e u e v u v
De De + u + v
e u = e v = 1, De u 0, De v 0
(12)
w
= A u B v
e u = e v
Ce u
0
Ce v 0
(Ce
u) = 0
(Ce v) = 0.
(13)
re
s.t.
es
L(w, , , , , u, v, r, s) =
Aw
e + 0
+ 0
B w + e
0
0
=0
u (Aw
e + )
v (B w + e + ) = 0
nc
max
C
.
w
= A u
B v
1=eu
= e v
De u 0
De v 0
(De
u) = 0
(De v) = 0.
(14)
De + De +
Class B
Class A
Class A
es
Class B
fe
re
nc
We have shown that the classical 2-norm SVM formulation is equivalent to finding the closest points in the
reduced convex hulls. This same explanation works for
versions of SVM based on alternative norms. For example consider the case of finding the closest points in
the reduced convex hulls as measured by the infinitynorm:
min kA u B vk
Re
u,v
e u = e v = 1, De u 0, De v 0
(15)
s.t.
s.t.
e A u B v e
e u = e v = 1, De u 0, De v 0
(16)
u,v,
The dual is
max
w,,,,
s.t.
De De
Aw e + 0, 0
Bw + e + 0, 0
kwk1 = 1
(17)
w,,,
s.t.
Ce + Ce + kwk1
Aw ( + 1)e + 0, 0
Bw + ( 1)e + 0, 0
(18)
Similarly finding the closest points of the reduced convex hulls using the 1-norm is equivalent to constructing a SVM regularized using an infinity-norm on w.
Specifically, solving the problem
min kA u B vk 1
min
s.t.
Ce + Ce + kwk
Aw ( + 1)e + 0, 0
Bw + ( 1)e + 0, 0
(20)
6. Conclusion
Re
fe
re
es
nc
e u = e v = 1, De u 0, De v 0
(19)
w,,,
This material is based on research supported by Microsoft Research and NSF Grants 949427 and IIS9979860.
References
u,v
s.t.
Acknowledgements
Mangasarian, O. L. (1965). Linear and nonlinear separation of patterns by linear programming. Operations Research, 13, 444452.
Mangasarian, O. L. (1969). Nonlinear programming.
New York: McGrawHill.
Platt, J. (1998). Fast training of support vector machines using sequential minimal optimization. Advances in kernel methods - Support vector learning.
Cambridge, MA: MIT Press.
Sch
olkopf, B., Smola, A., Williamson, R., & Bartlett,
P. (2000). New support vector algorithms. Neural
Computation, 12, 10831121.
Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C.,
& Anthony, M. (1998). Structural risk minimization
over data-dependent hierarchies. IEEE Transactions
on Information Theory, 44, 19261940.
Vapnik, V. N. (1996). The nature of statistical learning
theory. New York: Wiley.
REF [9]
!
"
#$ %&'(%) !
Re
fe
re
nc
es
*
"
+
,
-(
.-/
01 *
+
, (
-
0
1 #
+
,
-
(
2 3
(
1
+
4
-
0 +
5
(
1 "
1 6 +
, ,
-
"
1
3
+ -
1
7
- "
" 89 :1
-
+
3
1
*
5
, "
+
"
+
(
3
1 7
"3
4
4
"3
;
1 *
+
,
4
4
"
"
+
"
1
-1
!
nc
es
,
8:1
< "
+ 3
- ,
"
,
8
) %:1
0
3
+
+
;
"
0
"3
=
0
3
+ +
;
"
0
(
"3
1
" +
+
,
4
1
,
1
3
1
+
"
+
,
-
0
" 8 ) % >:1 *
+
,
-
0
1 #
+ ,
-
(
2 3
1
+
4
-
0 +
5
1 "
1
+
, ,
-
"
1
3
+ -
8:1
7
4
+1
)
(
1
'
(
1
%
(
-
01 ?
&
"
)(
-
01
1
re
Re
fe
+
1 ?
, 8& @:1
"
+
0 ("
+
1 7
0
+
3
A
1
B
B 1
)1 C C
1
'1 B
1
"
. / B
1
. /
1 7
1 7
B
A . /
1 D +
"
1
0
+ " 1
7
1
"
3
B
B
B 1 D
"
B A B
1
"
4
A B
1
"
+
+
+ 7
0
+
3
A
1 B
)1 C B C
'1
B
%1 B
B
*
+ B
.)(
/1 ?
7
(+4
;
1 E;
B
B
F
1
F
3
; "
B ?
+ " B
Re
fe
re
nc
es
+
(
1
"
(
A B
1 7
B
B
E
B + B B 1
0
"
+
11 B
1 ?
B
1 7 +
+
B
B
B C +
3
1 E
;
B C +
1 B
B A B
1 * "
B B
C
B C B
C
#
B
C
B 1
0
"
" + "
B B
B B
B
7
"
1
F
3
; "
?
'11
X0
X1
H
Re
fe
re
nc
es
?
'1
A
"
;
1 *
+
(
3 8:1
B . / +
B 3 1
B A B
03
B .) / 7
"
"
A ' B
0 B
7
"31
+
B .
/ C
1 6"
;
+ "
B 8.
/ C
:
B .
/. / C
. /
.
/ C
B .
/ C
B
B
11
1 7
"
1
17
;
+
7
B
0
B .
/
" .
/
C B
3
1 7
8& @:1
7
;1 7
B
0
B
7
8& @:1
.
B + , B /
+
B A B
+
7
B
B
$
G(G(
7,
8> :1
"
1 D
"
+
B A B
B A B
+
B
+
B
B
re
nc
es
0
"
+
11 B
1 B A B
1 7 B
A
1
#
B A B
11
1
'1 +
" B
0
"
11 B
B
B B
B
B
Re
fe
#
B + " B
7
"
1
. / . / . /
(
1 B A B
B A B
7 3
.
/
C
1
7 B A B C
B A B
+
1
+
. C / . / B )1
3
4
+
"
+
4
A
)
. C / B )
7
- )(
89 :1
. / . / . /
+
1 B A B
B A B
7 3
B .
/
! !
!
1
B A B !
B A B !
1
+
! .!/
B )!1
3
4
+
"
+
4
A
3
!
. / ! B )
B
$
G(G(
7,
8> :1
Re
fe
re
nc
es
+
(
-
01
"
"3
"
7
# . / B
0
"
B A B # . /
"
(
. / B
+
B A B . /
+
1
.
#
/
"
"
"3
"3 " =
8 # . /:
B
B 3
+ 3
"
0
"
"
+
11 B
?
+ , B
1 7 +
+
# . /
B
B 3
* 0
+
+ " # . / ?
+
A #. /
6"
# . /
B A B # . /
"
$ .%/ B A %
?
+
# . / %
%
+
A B # . /
"
$ .%/ 6"
+ " # . / B
B % .
% /
6
"
$ ./
$ ./ " 7
# . / B
7 0
"1
7
"
" B 7
# . / B F
+" B 7
B
+
1
7
,8@: +
"
1 ?
"
+
1
"
1
"
1 *
+
3
& &
&
1 *
;
+
1
"3
8&
@:1
7
"
"
1
"3 "
;
3
"3 " 7
,
?
%11
nc
es
x0
Re
fe
re
?
%1 3
A
"3
" "
"3
"
+ " "=
. / # . /
B
B 3
+ 3
"
0
" " "
+
11 B
" B A B " "
6"
"
"3 1
?
" " B '
" "
"
%1 + "
# . /
B
B 3
#
B
# . / B . / # . /
K2
Re
fe
re
nc
es
?
%1) 3
A
"3
. / . / . /
(
1 B A B
B A B
(./ (. /
"3
" 1
)
1 7 ./ B # ./ B )
7
3
4
8 ./ # ./: +
+
4
A
. ) /
)
)
%1)
) H * H +
H* B H+ B
* +
7
3
+
0
3
(
+ +
;
"
0
"3
1
#
+
" 3
1
1 . / B A B H* H* B *
. / B A B H+ H+ B +
. / . /
"3
" 1
. /
. /
+1
3
. / )
. /1 7 + "
Re
fe
re
nc
es
+ B
. /
B
) C ,
+ B
. /
, B ) 6"
, 1
+ "
H
. /
) C H ,
. /
7
./ B H
# ./ B ) C H ,
3
4
8 ./ # ./:
;
"
3
4
8. H / .) C H , /:
7 +
+
4
A
.H C H, / C . ) /
)
) C ,
%1)
+
+
4
A
H * H +
)
H* B H+ B
* +
7
3
+
0
3
+ +
;
"
0
"3
1
7
" +
3 0
(
5 8:1
G(
G(7,
1
+
-
0
F
1
" "
"3
F
1 ." " / B + " " 1
B B B B A B
B A B
= A ./
+
+ " " = .)/
+
"
"
* 0
+ 0 1
'1) + "
. / B
B
B B
B B ." "/
7
"
+
"
+
* B
C .
/
*
B
C .
/
C .
/
B
. / C . /
. / C . /
B C
C )
B)
B + "
B
7
.
/ + " * 7
." " / B
"3
" + " * " F
"
+ "
"
7
"
1
" "
+
" " +
3
1 7
,
?
&11
re
nc
es
K1
fe
K2
Re
?
&1 7 3
" "
"3
F
7
"
" +
3
1
&1 3
"3
" " +
3
1 #
+ +
+
;
1
" "
./ # ./
." " / B B
+
"3
" "
"
11 B A B ./
B A B # ./
%1) + " B
B (+4
;
3
B 7 B A B . /
B A B . /
B A B # . /
7
7
"
1
#
5
" "
" "
." " / B B
+ " B - B A B
7
"
+
" 3
+
1
1
B B 7
B
B ) C ) C
B ) C ) ) C)
. C /) - + "
C)
7
+ + "
B
11
B 7
"
1
es
Re
!"#$
fe
re
nc
*
"
+
,
-
01 *
+
,
- (
0
1 #
+ ,
-
(
2 3
1
(
+
4
-
0 +
5
1 "
1
+
, ,
-
"
1
3
+ -
1
I
+
,
+ , #
(
D
$3
I
I
J
F
#
!
"
7
1
%!$
"!# $ %
& ' %
() *
+
,- .
/
01234 -/ $
"# % & $ %
() *
+
,
/
"5# %/ & *
() 67
,-
/ 528 / - -
Re
fe
re
nc
es
"4# *
& %/ () /
2,- .
/ 4420! / -
-
"0# / (!885)
( '
)
/2,/ 9
:;
"3# * + / (!813) ! " #
< &
9 :;
"1# = -/ (!888)
2
/
4
/ !025
"># + ?
% ?;
-; & $ -? ( ) ,- %
/@ =
!!8 +-* A %
"8#
& 92 () #
/ 6B
/ 6$
"!# , ,
; (!880) $
$
/2,/ 9
:;
REF [10]
Support Vector Machine Active Learning
with Applications to Text Classication
Simon Tong
Daphne Koller
simon.tong@cs.stanford.edu
koller@cs.stanford.edu
Abstract
1. Introduction
Re
fe
re
nc
es
Support vector machines have met with signicant success in numerous real-world learning
tasks. However, like most machine learning algorithms, they are generally applied using
a randomly selected training set classied in advance. In many settings, we also have the
option of using pool-based active learning. Instead of using a randomly selected training
set, the learner has access to a pool of unlabeled instances and can request the labels for
some number of them. We introduce a new algorithm for performing active learning with
support vector machines, i.e., an algorithm for choosing which instances to request next.
We provide a theoretical motivation for the algorithm using the notion of a version space.
We present experimental results showing that employing our active learning method can
signicantly reduce the need for labeled training instances in both the standard inductive
and transductive settings.
Keywords: Active Learning, Selective Sampling, Support Vector Machines, Classication, Relevance Feedback
In many supervised learning tasks, labeling instances to create a training set is timeconsuming and costly; thus, nding ways to minimize the number of labeled instances
is benecial. Usually, the training set is chosen to be a random sampling of instances. However, in many cases active learning can be employed. Here, the learner can actively choose
the training data. It is hoped that allowing the learner this extra exibility will reduce the
learners need for large quantities of labeled data.
Pool-based active learning for classication was introduced by Lewis and Gale (1994).
The learner has access to a pool of unlabeled data and can request the true class label for
a certain number of instances in the pool. In many domains this is a reasonable approach
since a large quantity of unlabeled data is readily available. The main issue with active
learning is nding a way to choose good requests or queries from the pool.
Examples of situations in which pool-based active learning can be employed are:
Web searching. A Web-based company wishes to search the web for particular types
of pages (e.g., pages containing lists of journal publications). It employs a number of
people to hand-label some web pages so as to create a training set for an automatic
c
2001
Simon Tong and Daphne Koller.
classier that will eventually be used to classify the rest of the web. Since human
expertise is a limited resource, the company wishes to reduce the number of pages
the employees have to label. Rather than labeling pages randomly drawn from the
web, the computer requests targeted pages that it believes will be most informative
to label.
Email ltering. The user wishes to create a personalized automatic junk email lter.
In the learning phase the automatic learner has access to the users past email les.
It interactively brings up past email and asks the user whether the displayed email is
junk mail or not. Based on the users answer it brings up another email and queries
the user. The process is repeated some number of times and the result is an email
lter tailored to that specic person.
re
nc
es
Relevance feedback. The user wishes to sort through a database or website for
items (images, articles, etc.) that are of personal interestan Ill know it when I
see it type of search. The computer displays an item and the user tells the learner
whether the item is interesting or not. Based on the users answer, the learner brings
up another item from the database. After some number of queries the learner then
returns a number of items in the database that it believes will be of interest to the
user.
Re
fe
The rst two examples involve induction. The goal is to create a classier that works
well on unseen future instances. The third example is an example of transduction(Vapnik,
1998). The learners performance is assessed on the remaining instances in the database
rather than a totally independent test set.
We present a new algorithm that performs pool-based active learning with support
vector machines (SVMs). We provide theoretical motivations for our approach to choosing
the queries, together with experimental results showing that active learning with SVMs can
signicantly reduce the need for labeled training instances.
We shall use text classication as a running example throughout this paper. This is
the task of determining to which pre-dened topic a given text document belongs. Text
classication has an important role to play, especially with the recent explosion of readily
available text data. There have been many approaches to achieve this goal (Rocchio, 1971,
Dumais et al., 1998, Sebastiani, 2001). Furthermore, it is also a domain in which SVMs
have shown notable success (Joachims, 1998, Dumais et al., 1998) and it is of interest to
see whether active learning can oer further improvement over this already highly eective
method.
The remainder of the paper is structured as follows. Section 2 discusses the use of
SVMs both in terms of induction and transduction. Section 3 then introduces the notion
of a version space and Section 4 provides theoretical motivation for three methods for
performing active learning with SVMs. In Section 5 we present experimental results for
two real-world text domains that indicate that active learning can signicantly reduce the
need for labeled instances in practice. We conclude in Section 7 with some discussion of the
potential signicance of our results and some directions for future work.
46
(a)
(b)
Figure 1: (a) A simple linear support vector machine. (b) A SVM (dotted line) and a
transductive SVM (solid line). Solid circles represent unlabeled instances.
es
re
nc
Support vector machines (Vapnik, 1982) have strong theoretical foundations and excellent
empirical successes. They have been applied to tasks such as handwritten digit recognition,
object recognition, and text classication.
2.1 SVMs for Induction
Re
fe
We shall consider SVMs in the binary classication setting. We are given training data
{x1 . . . xn } that are vectors in some space X R d . We are also given their labels {y1 . . . yn }
where yi {1, 1}. In their simplest form, SVMs are hyperplanes that separate the training
data by a maximal margin (see Fig. 1a) . All vectors lying on one side of the hyperplane
are labeled as 1, and all vectors lying on the other side are labeled as 1. The training
instances that lie closest to the hyperplane are called support vectors. More generally, SVMs
allow one to project the original training data in space X to a higher dimensional feature
space F via a Mercer kernel operator K. In other words, we consider the set of classiers
of the form:
n
f (x) =
i K(xi , x) .
(1)
i=1
When K satises Mercers condition (Burges, 1998) we can write: K(u, v) = (u) (v)
where : X F and denotes an inner product. We can then rewrite f as:
f (x) = w (x), where w =
n
i (xi ).
(2)
i=1
Thus, by using K we are implicitly projecting the training data into a dierent (often
higher dimensional) feature space F. The SVM then computes the i s that correspond
to the maximal margin hyperplane in F. By choosing dierent kernel functions we can
47
implicitly project the training data from X into spaces F for which hyperplanes in F
correspond to more complex decision boundaries in the original space X .
Two commonly used kernels are the polynomial kernel given by K(u, v) = (u v + 1)p
which induces polynomial boundaries of degree p in the original space X 1 and the radial basis
function kernel K(u, v) = (e(uv)(uv) ) which induces boundaries by placing weighted
Gaussians upon key training instances. For the majority of this paper we will assume that
the modulus of the training data feature vectors are constant, i.e., for all training instances
xi , (xi ) = for some xed . The quantity (xi ) is always constant for radial basis
function kernels, and so the assumption has no eect for this kernel. For (xi ) to be
constant with the polynomial kernels we require that xi be constant. It is possible to
relax this constraint on (xi ) and we shall discuss this at the end of Section 4.
2.2 SVMs for Transduction
fe
3. Version Space
re
nc
es
The previous subsection worked within the framework of induction. There was a labeled
training set of data and the task was to create a classier that would have good performance
on unseen test data. In addition to regular induction, SVMs can also be used for transduction. Here we are rst given a set of both labeled and unlabeled data. The learning task is
to assign labels to the unlabeled data as accurately as possible. SVMs can perform transduction by nding the hyperplane that maximizes the margin relative to both the labeled
and unlabeled data. See Figure 1b for an example. Recently, transductive SVMs (TSVMs)
have been used for text classication (Joachims, 1999b), attaining some improvements in
precision/recall breakeven performance over regular inductive SVMs.
Re
Given a set of labeled training data and a Mercer kernel K, there is a set of hyperplanes that
separate the data in the induced feature space F. We call this set of consistent hypotheses
the version space (Mitchell, 1982). In other words, hypothesis f is in version space if for
every training instance xi with label yi we have that f (xi ) > 0 if yi = 1 and f (xi ) < 0 if
yi = 1. More formally:
Denition 1 Our set of possible hypotheses is given as:
w (x)
where w W ,
H = f | f (x) =
w
where our parameter space W is simply equal to F. The version space, V is then dened
as:
V = {f H | i {1 . . . n} yi f (xi ) > 0}.
Notice that since H is a set of hyperplanes, there is a bijection between unit vectors w and
hypotheses f in H. Thus we will redene V as:
V = {w W | w = 1, yi (w (xi )) > 0, i = 1 . . . n}.
1. We have not introduced a bias weight in Eq. (2). Thus, the simple Euclidean inner product will produce
hyperplanes that pass through the origin. However, a polynomial kernel of degree one induces hyperplanes
that do not need to pass through the origin.
48
(b)
es
(a)
Re
fe
re
nc
Figure 2: (a) Version space duality. The surface of the hypersphere represents unit weight
vectors. Each of the two hyperplanes corresponds to a labeled training instance.
Each hyperplane restricts the area on the hypersphere in which consistent hypotheses can lie. Here, the version space is the surface segment of the hypersphere
closest to the camera. (b) An SVM classier in a version space. The dark embedded sphere is the largest radius sphere whose center lies in the version space
and whose surface does not intersect with the hyperplanes. The center of the embedded sphere corresponds to the SVM, its radius is proportional to the margin
of the SVM in F, and the training points corresponding to the hyperplanes that
it touches are the support vectors.
Note that a version space only exists if the training data are linearly separable in the
feature space. Thus, we require linear separability of the training data in the feature space.
This restriction is much less harsh than it might at rst seem. First, the feature space often
has a very high dimension and so in many cases it results in the data set being linearly
separable. Second, as noted by Shawe-Taylor and Cristianini (1999), it is possible to modify
any kernel so that the data in the new induced feature space is linearly separable2 .
There exists a duality between the feature space F and the parameter space W (Vapnik,
1998, Herbrich et al., 2001) which we shall take advantage of in the next section: points in
F correspond to hyperplanes in W and vice versa.
By denition, points in W correspond to hyperplanes in F. The intuition behind the
converse is that observing a training instance xi in the feature space restricts the set of
separating hyperplanes to ones that classify xi correctly. In fact, we can show that the set
2. This is done by redening for all training instances xi : K(xi , xi ) K(xi , xi ) + where is a positive
regularization constant. This essentially achieves the same eect as the soft margin error function (Cortes
and Vapnik, 1995) commonly used in SVMs. It permits the training data to be linearly non-separable
in the original feature space.
49
maximizewF
w = 1
subject to:
yi (w (xi )) > 0 i = 1 . . . n.
re
nc
es
By having the conditions w = 1 and yi (w (xi )) > 0 we cause the solution to lie in the
version space. Now, we can view the above problem as nding the point w in the version
space that maximizes the distance: mini {yi (w (xi ))}. From the duality between feature
and parameter space, and since (xi ) = , each (xi )/ is a unit normal vector of a
hyperplane in parameter space. Because of the constraints yi (w (xi )) > 0 i = 1 . . . n
each of these hyperplanes delimit the version space. The expression yi (w (xi )) can be
regarded as:
fe
the distance between the point w and the hyperplane with normal vector (xi ).
Re
Thus, we want to nd the point w in the version space that maximizes the minimum
distance to any of the delineating hyperplanes. That is, SVMs nd the center of the largest
radius hypersphere whose center can be placed in the version space and whose surface does
not intersect with the hyperplanes corresponding to the labeled instances, as in Figure 2b.
The normals of the hyperplanes that are touched by the maximal radius hypersphere are
the (xi ) for which the distance yi (w (xi )) is minimal. Now, taking the original rather
than the dual view, and regarding w as the unit normal vector of the SVM and (xi ) as
points in feature space, we see that the hyperplanes that are touched by the maximal radius
hypersphere correspond to the support vectors (i.e., the labeled points that are closest to
the SVM hyperplane boundary).
The radius of the sphere is the distance from the center of the sphere to one of the
touching hyperplanes and is given by yi (w (xi )/) where (xi ) is a support vector.
Now, viewing w as a unit normal vector of the SVM and (xi ) as points in feature space,
we have that the distance yi (w (xi )/) is:
1
the distance between support vector (xi ) and the hyperplane with normal vector w,
which is the margin of the SVM divided by . Thus, the radius of the sphere is proportional
to the margin of the SVM.
50
4. Active Learning
In pool-based active learning we have a pool of unlabeled instances. It is assumed that
the instances x are independently and identically distributed according to some underlying
distribution F (x) and the labels are distributed according to some conditional distribution
P (y | x).
Given an unlabeled pool U , an active learner has three components: (f, q, X). The
rst component is a classier, f : X {1, 1}, trained on the current set of labeled data X
(and possibly unlabeled instances in U too). The second component q(X) is the querying
function that, given a current labeled set X, decides which instance in U to query next.
The active learner can return a classier f after each query (online learning) or after some
xed number of queries.
re
nc
es
The main dierence between an active learner and a passive learner is the querying
component q. This brings us to the issue of how to choose the next unlabeled instance to
query. Similar to Seung et al. (1992), we use an approach that queries points so as to attempt
to reduce the size of the version space as much as possible. We take a myopic approach
that greedily chooses the next query based on this criterion. We also note that myopia is a
standard approximation used in sequential decision making problems Horvitz and Rutledge
(1991), Latombe (1991), Heckerman et al. (1994). We need two more denitions before we
can proceed:
Re
fe
Denition 2 Area(V) is the surface area that the version space V occupies on the hypersphere w = 1.
Denition 3 Given an active learner , let Vi denote the version space of after i queries
have been made. Now, given the (i + 1)th query xi+1 , dene:
Vi = Vi {w W | (w (xi+1 )) > 0},
Vi+ = Vi {w W | +(w (xi+1 )) > 0}.
So Vi and Vi+ denote the resulting version spaces when the next query xi+1 is labeled as
1 and 1 respectively.
We wish to reduce the version space as fast as possible. Intuitively, one good way of
doing this is to choose a query that halves the version space. The follow lemma says that,
for any given number of queries, the learner that chooses successive queries that halves
the version spaces is the learner that minimizes the maximum expected size of the version
space, where the maximum is taken over all conditional distributions of y given x:
51
Lemma 4 Suppose we have an input space X , nite dimensional feature space F (induced
via a kernel K), and parameter space W. Suppose active learner always queries instances
whose corresponding hyperplanes in parameter space W halves the area of the current version
space. Let be any other active learner. Denote the version spaces of and after i queries
as Vi and Vi respectively. Let P denote the set of all conditional distributions of y given x.
Then,
i N + sup EP [Area(Vi )] sup EP [Area(Vi )],
P P
P P
with strict inequality whenever there exists a query j {1 . . . i} by that does not halve
version space Vj1 .
fe
re
nc
es
Proof. The proof is straightforward. The learner, always chooses to query instances
) = 1 Area(V ) no matter what the labeling
that halve the version space. Thus Area(Vi+1
i
2
of the query points are. Let r denote the dimension of feature space F. Then r is also the
dimension of the parameter space W. Let Sr denote the surface area of the unit hypersphere
of dimension r. Then, under any conditional distribution P , Area(Vi ) = Sr /2i .
Now, suppose does not always query an instance that halves the area of the version
space. Then after some number, k, of queries rst chooses to query a point xk+1 that
does not halve the current version space Vk . Let yk+1 {1, 1} correspond to the labeling
of xk+1 that will cause the larger half of the version space to be chosen.
Without loss of generality assume Area(Vk ) > Area(Vk+ ) and so yk+1 = 1. Note that
Area(Vk ) + Area(Vk+ ) = Sr /2k , so we have that Area(Vk ) > Sr /2k+1 .
Now consider the conditional distribution P0 :
1
2 if x = xk+1 .
P0 (1 | x) =
1 if x = xk+1
Re
EP0 [Area(Vi )] =
Hence, i > k,
1
2ik1
Area(Vk ) >
Sr
.
2i
P P
P P
Now, suppose w W is the unit parameter vector corresponding to the SVM that we
would have obtained had we known the actual labels of all of the data in the pool. We
know that w must lie in each of the version spaces V1 V2 V3 . . ., where Vi denotes the
version space after i queries. Thus, by shrinking the size of the version space as much as
possible with each query, we are reducing as fast as possible the space in which w can lie.
Hence, the SVM that we learn from our limited number of queries will lie close to w .
If one is willing to assume that there is a hypothesis lying within H that generates the
data and that the generating hypothesis is deterministic and that the data are noise free,
then strong generalization performance properties of an algorithm that halves version space
can also be shown (Freund et al., 1997). For example one can show that the generalization
error decreases exponentially with the number of queries.
52
(a)
(b)
re
nc
es
Figure 3: (a) Simple Margin will query b. (b) Simple Margin will query a.
(b)
fe
(a)
Re
Figure 4: (a) MaxMin Margin will query b. The two SVMs with margins m and m+ for b
are shown. (b) Ratio Margin will query e. The two SVMs with margins m and
m+ for e are shown.
This discussion provides motivation for an approach where we query instances that split
the current version space into two equal parts as much as possible. Given an unlabeled
instance x from the pool, it is not practical to explicitly compute the sizes of the new
version spaces V and V + (i.e., the version spaces obtained when x is labeled as 1 and
+1 respectively). We next present three ways of approximating this procedure.
Simple Margin. Recall from section 3 that, given some data {x1 . . . xi } and labels
{y1 . . . yi }, the SVM unit vector wi obtained from this data is the center of the largest
hypersphere that can t inside the current version space Vi . The position of wi in
the version space Vi clearly depends on the shape of the region Vi , however it is
often approximately in the center of the version space. Now, we can test each of the
unlabeled instances x in the pool to see how close their corresponding hyperplanes
in W come to the centrally placed wi . The closer a hyperplane in W is to the point
wi , the more centrally it is placed in the version space, and the more it bisects the
version space. Thus we can pick the unlabeled instance in the pool whose hyperplane
53
in W comes closest to the vector wi . For each unlabeled instance x, the shortest
distance between its hyperplane in W and the vector wi is simply the distance between
the feature vector (x) and the hyperplane wi in Fwhich is easily computed by
|wi (x)|. This results in the natural rule: learn an SVM on the existing labeled
data and choose as the next instance to query the instance that comes closest to the
hyperplane in F.
Figure 3a presents an illustration. In the stylized picture we have attened out the
surface of the unit weight vector hypersphere that appears in Figure 2a. The white
area is version space Vi which is bounded by solid lines corresponding to labeled
instances. The ve dotted lines represent unlabeled instances in the pool. The circle
represents the largest radius hypersphere that can t in the version space. Note that
the edges of the circle do not touch the solid linesjust as the dark sphere in 2b
does not meet the hyperplanes on the surface of the larger hypersphere (they meet
somewhere under the surface). The instance b is closest to the SVM wi and so we
will choose to query b.
Re
fe
re
nc
es
MaxMin Margin. The Simple Margin method can be a rather rough approximation.
It relies on the assumption that the version space is fairly symmetric and that wi is
centrally placed. It has been demonstrated, both in theory and practice, that these
assumptions can fail signicantly (Herbrich et al., 2001). Indeed, if we are not careful
we may actually query an instance whose hyperplane does not even intersect the
version space. The MaxMin approximation is designed to overcome these problems to
some degree. Given some data {x1 . . . xi } and labels {y1 . . . yi }, the SVM unit vector
wi is the center of the largest hypersphere that can t inside the current version
space Vi and the radius mi of the hypersphere is proportional3 to the size of the
margin of wi . We can use the radius mi as an indication of the size of the version
space (Vapnik, 1998). Suppose we have a candidate unlabeled instance x in the pool.
We can estimate the relative size of the resulting version space V by labeling x as 1,
nding the SVM obtained from adding x to our labeled training data and looking at
the size of its margin m . We can perform a similar calculation for V + by relabeling
x as class +1 and nding the resulting SVM to obtain margin m+ .
Since we want an equal split of the version space, we wish Area(V ) and Area(V + ) to
be similar. Now, consider min(Area(V ), Area(V + )). It will be small if Area(V ) and
Area(V + ) are very dierent. Thus we will consider min(m , m+ ) as an approximation
and we will choose to query the x for which this quantity is largest. Hence, the MaxMin
query algorithm is as follows: for each unlabeled instance x compute the margins m
and m+ of the SVMs obtained when we label x as 1 and +1 respectively; then choose
to query the unlabeled instance for which the quantity min(m , m+ ) is greatest.
Figures 3b and 4a show an example comparing the Simple Margin and MaxMin Margin
methods.
Ratio Margin. This method is similar in spirit to the MaxMin Margin method. We
use m and m+ as indications of the sizes of V and V + . However, we shall try to
3. To ease notation, without loss of generality we shall assume the the constant of proportionality is 1, i.e.,
the radius is equal to the margin.
54
take into account the fact that the current version space Vi may be quite elongated
and for some x in the pool both m and m+ may be small simply because of the shape
of version space. Thus we will instead look at the relative sizes of m and m+ and
m+
,
) is largest (see Figure 4b).
choose to query the x for which min( m
m+ m
re
nc
es
The above three methods are approximations to the querying component that always
halves version space. After performing some number of queries we then return a classier
by learning a SVM with the labeled instances.
The margin can be used as an indication of the version space size irrespective of whether
the feature vectors have constant modulus. Thus the explanation for the MaxMin and Ratio
methods still holds even without the constraint on the modulus of the training feature
vectors. The Simple method can still be used when the training feature vectors do not
have constant modulus, but the motivating explanation no longer holds since the maximal
margin hyperplane can no longer be viewed as the center of the largest allowable sphere.
However, for the Simple method, alternative motivations have recently been proposed by
Campbell et al. (2000) that do not require the constraint on the modulus.
For inductive learning, after performing some number of queries we then return a classier by learning a SVM with the labeled instances. For transductive learning, after querying
some number of instances we then return a classier by learning a transductive SVM with
the labeled and unlabeled instances.
5. Experiments
fe
For our empirical evaluation of the above methods we used two real-world text classication
domains: the Reuters-21578 data set and the Newsgroups data set.
Re
55
100.0
100.0
90.0
80.0
90.0
80.0
Full
Random
Simple
Ratio
MaxMin
Ratio
Simple
Random
70.0
20
40
60
Labeled Training Set Size
80
70.0
60.0
50.0
40.0
Full
Random
Simple
Ratio
MaxMin
Ratio
Simple
Random
30.0
20.0
10.0
0.0
100
20
40
60
Labeled Training Set Size
(a)
80
100
(b)
MaxMin
Ratio
86.39 1.65
77.04 1.17
93.82 0.35
95.53 0.09
95.26 0.38
96.31 0.28
96.15 0.21
97.75 0.11
98.10 0.24
98.31 0.19
87.75 1.40
77.08 2.00
94.80 0.14
95.29 0.38
95.26 0.15
96.64 0.10
96.55 0.09
97.81 0.09
98.48 0.09
98.56 0.05
re
90.24 2.31
80.42 1.50
94.83 0.13
95.55 1.22
95.35 0.21
96.60 0.15
96.43 0.09
97.66 0.12
98.13 0.20
98.30 0.19
fe
Earn
Acq
Money-fx
Grain
Crude
Trade
Interest
Ship
Wheat
Corn
Simple
Re
Topic
nc
es
Figure 5: (a) Average test set accuracy over the ten most frequently occurring topics when
using a pool size of 1000. (b) Average test set precision/recall breakeven point
over the ten most frequently occurring topics when using a pool size of 1000.
Equivalent
Random size
34
> 100
50
13
> 100
> 100
> 100
> 100
> 100
15
Table 1: Average test set accuracy over the top ten most frequently occurring topics (most
frequent topic rst) when trained with ten labeled documents. Boldface indicates
statistical signicance.
For each of the ten topics we performed the following steps. We created a pool of
unlabeled data by sampling 1000 documents from the remaining data and removing their
labels. We then randomly selected two documents in the pool to give as the initial labeled
training set. One document was about the desired topic, and the other document was
not about the topic. Thus we gave each learner 998 unlabeled documents and 2 labeled
documents. After a xed number of queries we asked each learner to return a classier (an
56
Topic
Earn
Acq
Money-fx
Grain
Crude
Trade
Interest
Ship
Wheat
Corn
Simple
MaxMin
Ratio
86.05 0.61
54.14 1.31
35.62 2.34
50.25 2.72
58.22 3.15
50.71 2.61
40.61 2.42
53.93 2.63
64.13 2.10
49.52 2.12
89.03 0.53
56.43 1.40
38.83 2.78
58.19 2.04
55.52 2.42
48.78 2.61
45.95 2.61
52.73 2.95
66.71 1.65
48.04 2.01
88.95 0.74
57.25 1.61
38.27 2.44
60.34 1.61
58.41 2.39
50.57 1.95
43.71 2.07
53.75 2.85
66.57 1.37
46.25 2.18
Equivalent
Random size
12
12
52
51
55
85
60
> 100
> 100
> 100
es
Table 2: Average test set precision/recall breakeven point over the top ten most frequently
occurring topics (most frequent topic rst) when trained with ten labeled documents. Boldface indicates statistical signicance.
Re
fe
re
nc
SVM with a polynomial kernel of degree one7 learned on the labeled training documents).
We then tested the classier on the independent test set.
The above procedure was repeated thirty times for each topic and the results were
averaged. We considered the Simple Margin, MaxMin Margin and Ratio Margin querying
methods as well as a Random Sample method. The Random Sample method simply randomly chooses the next query point from the unlabeled pool. This last method reects what
happens in the regular passive learning settingthe training set is a random sampling of
the data.
To measure performance we used two metrics: test set classication error and, to
stay compatible with previous Reuters corpus results, the precision/recall breakeven point
(Joachims, 1998). Precision is the percentage of documents a classier labels as relevant
that are really relevant. Recall is the percentage of relevant documents that are labeled as
relevant by the classier. By altering the decision threshold on the SVM we can trade precision for recall and can obtain a precision/recall curve for the test set. The precision/recall
breakeven point is a one number summary of this graph: it is the point at which precision
equals recall.
Figures 5a and 5b present the average test set accuracy and precision/recall breakeven
points over the ten topics as we vary the number of queries permitted. The horizontal line
is the performance level achieved when the SVM is trained on all 1000 labeled documents
comprising the pool. Over the Reuters corpus, the three active learning methods perform
almost identically with little notable dierence to distinguish between them. Each method
also appreciably outperforms random sampling. Tables 1 and 2 show the test set accuracy
and breakeven performance of the active methods after they have asked for just eight labeled
instances (so, together with the initial two random instances, they have seen ten labeled
instances). They demonstrate that the three active methods perform similarly on this
7. For SVM and transductive SVM learning we used SVMlight Joachims (1999a).
57
100.0
100.0
90.0
80.0
90.0
Full
Ratio
Random Balanced
Random
80.0
70.0
60.0
50.0
40.0
Full
Random
Simple
Ratio
Ratio
Random
Balanced
Random
30.0
20.0
10.0
70.0
20
40
60
Labeled Training Set Size
80
0.0
100
20
40
60
Labeled Training Set Size
(a)
80
100
(b)
nc
es
Figure 6: (a) Average test set accuracy over the ten most frequently occurring topics when
using a pool size of 1000. (b) Average test set precision/recall breakeven point
over the ten most frequently occurring topics when using a pool size of 1000.
Re
fe
re
Reuters data set after eight queries, with the MaxMin and Ratio showing a very slight edge in
performance. The last columns in each table are of more interest. They show approximately
how many instances would be needed if we were to use Random to achieve the same level
of performance as the Ratio active learning method. In this instance, passive learning on
average requires over six times as much data to achieve comparable levels of performance as
the active learning methods. The tables indicate that active learning provides more benet
with the infrequent classes, particularly when measuring performance by the precision/recall
breakeven point. This last observation has also been noted before in previous empirical
tests (McCallum and Nigam, 1998).
We noticed that approximately half of the queries that the active learning methods
asked tended to turn out to be positively labeled, regardless of the true overall proportion
of positive instances in the domain. We investigated whether the gains that the active
learning methods had over regular Random sampling were due to this biased sampling. We
created a new querying method called BalancedRandom which would randomly sample an
equal number of positive and negative instances from the pool. Obviously in practice the
ability to randomly sample an equal number of positive and negative instances without
having to label an entire pool of instances rst may or may not be reasonable depending
upon the domain in question. Figures 6a and 6b show the average accuracy and breakeven
point of the BalancedRandom method compared with the Ratio active method and regular
Random method on the Reuters dataset with a pool of 1000 unlabled instances. The Ratio
and Random curves are the same as those shown in Figures 5a and 5b. The MaxMin and
Simple curves are omitted to ease legibility. The BalancedRandom method has a much better precision/recall breakeven performance than the regular Random method, although it is
still matched and then outperformed by the active method. For classication accuracy, the
BalancedRandom method initially has extremely poor performance (less than 50% which is
58
(a)
(b)
nc
es
Figure 7: (a) Average test set accuracy over the ten most frequently occurring topics when
using a pool sizes of 500 and 1000. (b) Average breakeven point over the ten
most frequently occurring topics when using a pool sizes of 500 and 1000.
fe
re
even worse than pure random guessing) and is always consistently and signicantly outperformed by the active method. This indicates that the performance gains of the active
methods are not merely due to their ability to bias the class of the instances they queries.
The active methods are choosing special targeted instances and approximately half of these
instances happen to have positive labels.
Re
Figures 7a and 7b show the average accuracy and breakeven point of the Ratio method
with two dierent pool sizes. Clearly the Random sampling methods performance will not be
aected by the pool size. However, the graphs indicate that increasing the pool of unlabeled
data will improve both the accuracy and breakeven performance of active learning. This is
quite intuitive since a good active method should be able to take advantage of a larger pool
of potential queries and ask more targeted questions.
We also investigated active learning in a transductive setting. Here we queried the
points as usual except now each method (Simple and Random) returned a transductive
SVM trained on both the labeled and remaining unlabeled data in the pool. As described
by Joachims (1998) the breakeven point for a TSVM was computed by gradually altering
the number of unlabeled instances that we wished the TSVM to label as positive. This
invovles re-learning the TSVM multiple times and was computationally intensive. Since
our setting was transduction, the performance of each classier was measured on the pool
of data rather than a separate test set. This reects the relevance feedback transductive
inference example presented in the introduction.
Figure 8 shows that using a TSVM provides a slight advantage over a regular SVM in
both querying methods (Random and Simple) when comparing breakeven points. However,
the graph also shows that active learning provides notably more benet than transduction
indeed using a TSVM with a Random querying method needs over 100 queries to achieve
59
100.0
90.0
80.0
70.0
60.0
50.0
40.0
30.0
Transductive
Inductive
Passive
Active
Transductive
Inductive
Active
Passive
Inductive Active
Transductive
Passive
Transductive
Inductive
Passive
Active
20.0
10.0
0.0
20
40
60
Labeled Training Set Size
80
100
100.0
nc
100.0
90.0
re
90.0
70.0
60.0
80.0
fe
80.0
20
Re
40
60
Labeled Training Set Size
80
70.0
60.0
Full
Random
Simple
Ratio
MaxMin
Ratio
Simple
Random
50.0
40.0
es
Figure 8: Average pool set precision/recall breakeven point over the ten most frequently
occurring topics when using a pool size of 1000.
Full
Ratio
MaxMin
Ratio
Simple
MaxMin
Random
Simple
Random
50.0
40.0
100
(a)
20
40
60
Labeled Training Set Size
80
100
(b)
Figure 9: (a) Average test set accuracy over the ve comp. topics when using a pool size
of 500. (b) Average test set accuracy for comp.sys.ibm.pc.hardware with a 500
pool size.
the same breakeven performance as a regular SVM with a Simple method that has only seen
20 labeled instances.
5.2 Newsgroups Experiments
Our second data collection was K. Langs Newsgroups collection Lang (1995). We used the
ve comp. groups, discarding the Usenet headers and subject lines. We processed the text
documents exactly as before, resulting in vectors of about 10000 dimensions.
60
(a)
(b)
nc
es
Figure 10: (a) A simple example of querying unlabeled clusters. (b) Macro-average test
set accuracy for comp.os.ms-windows.misc and comp.sys.ibm.pc.hardware where
Hybrid uses the Ratio method for the rst ten queries and Simple for the rest.
Re
fe
re
We placed half of the 5000 documents aside to use as an independent test set, and
repeatedly, randomly chose a pool of 500 documents from the remaining instances. We
performed twenty runs for each of the ve topics and averaged the results. We used test
set accuracy to measure performance. Figure 9a contains the learning curve (averaged
over all of the results for the ve comp. topics) for the three active learning methods
and Random sampling. Again, the horizontal line indicates the performance of an SVM
that has been trained on the entire pool. There is no appreciable dierence between the
MaxMin and Ratio methods but, in two of the ve newsgroups (comp.sys.ibm.pc.hardware
and comp.os.ms-windows.misc) the Simple active learning method performs notably worse
than the MaxMin and Ratio methods. Figure 9b shows the average learning curve for the
comp.sys.ibm.pc.hardware topic. In around ten to fteen per cent of the runs for both of
the two newsgroups the Simple method was misled and performed extremely poorly (for
instance, achieving only 25% accuracy even with fty training instances, which is worse
than just randomly guessing a label!). This indicates that the Simple querying method may
be more unstable than the other two methods.
One reason for this could be that the Simple method tends not to explore the feature
space as aggressively as the other active methods, and can end up ignoring entire clusters
of unlabeled instances. In Figure 10a, the Simple method takes several queries before it
even considers an instance in the unlabeled cluster while both the MaxMin and Ratio query
a point in the unlabeled cluster immediately.
While MaxMin and Ratio appear more stable they are much more computationally intensive. With a large pool of s instances, they require about 2s SVMs to be learned for each
query. Most of the computational cost is incurred when the number of queries that have
already been asked is large. The reason is that the cost of training an SVM grows polynomially with the size of the labeled training set and so now training each SVM is costly (taking
61
Query
1
5
10
20
30
50
100
Simple
0.008
0.018
0.025
0.045
0.068
0.110
0.188
MaxMin
3.7
4.1
12.5
13.6
22.5
23.2
42.8
Ratio
3.7
5.2
8.5
19.9
23.9
23.3
43.2
Hybrid
3.7
5.2
8.5
0.045
0.073
0.115
0.2
Table 3: Typical run times in seconds for the Active methods on the Newsgroups dataset
6. Related Work
fe
re
nc
es
over 20 seconds to generate the 50th query on a Sun Ultra 60 450Mhz workstation with a
pool of 500 documents). However, when the quantity of labeled data is small, even with
a large pool size, MaxMin and Ratio are fairly fast (taking a few seconds per query) since
now training each SVM is fairly cheap. Interestingly, it is in the rst ten queries that the
Simple seems to suer the most through its lack of aggressive exploration. This motivates
a Hybrid method. We can use MaxMin or Ratio for the rst few queries and then use the
Simple method for the rest. Experiments with the Hybrid method show that it maintains
the stability of the MaxMin and Ratio methods while allowing the scalability of the Simple
method. Figure 10b compares the Hybrid method with the Ratio and Simple methods on
the two newsgroups for which the Simple method performed poorly. The test set accuracy
of the Hybrid method is virtually identical to that of the Ratio method while the Hybrid
methods run time was about the same as the Simple method, as indicated by Table 3.
Re
There have been several studies of active learning for classication. The Query by Committee algorithm (Seung et al., 1992, Freund et al., 1997) uses a prior distribution over
hypotheses. This general algorithm has been applied in domains and with classiers for
which specifying and sampling from a prior distribution is natural. They have been used
with probabilistic models (Dagan and Engelson, 1995) and specically with the Naive Bayes
model for text classication in a Bayesian learning setting (McCallum and Nigam, 1998).
The Naive Bayes classier provides an interpretable model and principled ways to incorporate prior knowledge and data with missing values. However, it typically does not perform
as well as discriminative methods such as SVMs, particularly in the text classication domain (Joachims, 1998, Dumais et al., 1998).
We re-created McCallum and Nigams (1998) experimental setup on the Reuters-21578
corpus and compared the reported results from their algorithm (which we shall call the MNalgorithm hereafter) with ours. In line with their experimental setup, queries were asked
ve at a time, and this was achieved by picking the ve instances closest to the current
hyperplane. Figure 11a compares McCallum and Nigams reported results with ours. The
graph indicates that the Active SVM performance is signicantly better than that of the
MN-algorithm.
An alternative committee approach to query by committee was explored by Liere and
Tadepalli (1997, 2000). Although their algorithm (LT-algorithm hereafter) lacks the the62
100
100
80
90
Test Set Accuracy
60
40
20
50
100
Labeled Training Set Size
80
70
150
60
150
200
300
450
600
750
Labeled Training Set Size
(a)
900
(b)
nc
es
Figure 11: (a) Average breakeven point performance over the Corn, Trade and Acq Reuters21578 categories. (b) Average test set accuracy over the top ten Reuters-21578
categories.
Re
fe
re
oretical justications of the Query by Committee algorithm, they successfully used their
committee based active learning method with Winnow classiers in the text categorization
domain. Figure 11b was produced by emulating their experimental setup on the Reuters21578 data set and it compares their reported results with ours. Their algorithm does
not require a positive and negative instance to seed their classier. Rather than seeding
our Active SVM with a positive and negative instance (which would give the Active SVM
an unfair advantage) the Active SVM randomly sampled 150 documents for its rst 150
queries. This process virtually guaranteed that the training set contained at least one positive instance. The Active SVM then proceeded to query instances actively using the Simple
method. Despite the very naive initialization policy for the Active SVM, the graph shows
that the Active SVM accuracy is signicantly better than that of the LT-algorithm.
Lewis and Gale (1994) introduced uncertainty sampling and applied it to a text domain
using logistic regression and, in a companion paper, using decision trees (Lewis and Catlett,
1994). The Simple querying method for SVM active learning is essentially the same as their
uncertainty sampling method (choose the instance that our current classier is most uncertain about), however they provided substantially less justication as to why the algorithm
should be eective. They also noted that the performance of the uncertainty sampling
method can be variable, performing quite poorly on occasions.
Two other studies (Campbell et al., 2000, Schohn and Cohn, 2000) independently developed our Simple method for active learning with support vector machines and provided
dierent formal analyses. Campbell, Cristianini and Smola extend their analysis for the
Simple method to cover the use of soft margin SVMs (Cortes and Vapnik, 1995) with linearly non-separable data. Schohn and Cohn note interesting behaviors of the active learning
curves in the presence of outliers.
63
Re
fe
re
nc
es
We have introduced a new algorithm for performing active learning with SVMs. By taking
advantage of the duality between parameter space and feature space, we arrived at three
algorithms that attempt to reduce version space as much as possible at each query. We
have shown empirically that these techniques can provide considerable gains in both the
inductive and transductive settingsin some cases shrinking the need for labeled instances
by over an order of magnitude, and in almost all cases reaching the performance achievable
on the entire pool having seen only a fraction of the data. Furthermore, larger pools of
unlabeled data improve the quality of the resulting classier.
Of the three main methods presented, the Simple method is computationally the fastest.
However, the Simple method seems to be a rougher and more unstable approximation, as
we witnessed when it performed poorly on two of the ve Newsgroup topics. If asking each
query is expensive relative to computing time then using either the MaxMin or Ratio may be
preferable. However, if the cost of asking each query is relatively cheap and more emphasis
is placed upon fast feedback then the Simple method may be more suitable. In either case,
we have shown that the use of these methods for learning can substantially outperform
standard passive learning. Furthermore, experiments with the Hybrid method indicate that
it is possible to combine the benets of the Ratio and Simple methods.
The work presented here leads us to many directions of interest. Several studies have
noted that gains in computational speed can be obtained at the expense of generalization
performance by querying multiple instances at a time (Lewis and Gale, 1994, McCallum
and Nigam, 1998). Viewing SVMs in terms of the version space gives an insight as to where
the approximations are being made, and this may provide a guide as to which multiple
instances are better to query. For instance, it is suboptimal to query two instances whose
version space hyperplanes are fairly parallel to each other. So, with the Simple method,
instead of blindly choosing to query the two instances that are the closest to the current
SVM, it may be better to query two instances that are close to the current SVM and whose
hyperplanes in the version space are fairly perpendicular. Similar tradeos can be made for
the Ratio and MaxMin methods.
Bayes Point Machines (Herbrich et al., 2001) approximately nd the center of mass of
the version space. Using the Simple method with this point rather than the SVM point in
the version space may produce an improvement in performance and stability. The use of
Monte Carlo methods to estimate version space areas may also give improvements.
One way of viewing the strategy of always choosing to halve the version space is that we
have essentially placed a uniform distribution over the current space of consistent hypotheses
and we wish to reduce the expected size of the version space as fast as possible. Rather
than maintaining a uniform distribution over consistent hypotheses, it is plausible that
the addition of prior knowledge over our hypotheses space may allow us to modify our
query algorithm and provided us with an even better strategy. Furthermore, the PACBayesian framework introduced by McAllester (1999) considers the eect of prior knowledge
on generalization bounds and this approach may lead to theoretical guarantees for the
modied querying algorithms.
Finally, the Ratio and MaxMin methods are computationally expensive since they have
to step through each of the unlabeled data instances and learn an SVM for each possible
64
labeling. However, the temporarily modied data sets will only dier by one instance from
the original labeled data set and so one can envisage learning an SVM on the original data
set and then computing the incremental updates to obtain the new SVMs (Cauwenberghs
and Poggio, 2001) for each of the possible labelings of each of the unlabeled instances. Thus,
one would hopefully obtain a much more ecient implementation of the Ratio and MaxMin
methods and hence allow these active learning algorithms to scale up to larger problems.
Acknowledgments
This work was supported by DARPAs Information Assurance program under subcontract
to SRI International, and by ARO grant DAAH04-96-1-0341 under the MURI program
Integrated Approach to Intelligent Systems.
References
es
C. J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining
and Knowledge Discovery, 2:121167, 1998.
nc
C. Campbell, N. Cristianini, and A. Smola. Query learning with large margin classiers. In
Proceedings of the Seventeenth International Conference on Machine Learning, 2000.
re
fe
C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:125, 1995.
Re
es
D. Lewis and W. Gale. A sequential algorithm for training text classiers. In Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and
Development in Information Retrieval, pages 312. Springer-Verlag, 1994.
nc
re
A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classication
and clustering. www.cs.cmu.edu/mccallum/bow, 1996.
Re
fe
A. McCallum and K. Nigam. Employing EM in pool-based active learning for text classication. In Proceedings of the Fifteenth International Conference on Machine Learning.
Morgan Kaufmann, 1998.
T. Mitchell. Generalization as search. Articial Intelligence, 28:203226, 1982.
J. Rocchio. Relevance feedback in information retrieval. In G. Salton, editor, The SMART
retrieval system: Experiments in automatic document processing. Prentice-Hall, 1971.
G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In
Proceedings of the Seventeenth International Conference on Machine Learning, 2000.
Fabrizio Sebastiani. Machine learning in automated text categorisation. Technical Report
IEI-B4-31-1999, Istituto di Elaborazione dellInformazione, 2001.
H.S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of
Computational Learning Theory, pages 287294, 1992.
J. Shawe-Taylor and N. Cristianini. Further results on the margin distribution. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory, pages
278285, 1999.
V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer Verlag, 1982.
V. Vapnik. Statistical Learning Theory. Wiley, 1998.
66