You are on page 1of 20

Margin based Active Learning

for LVQ Networks


F.-M. Schleif
a,
a
Univ. of Leipzig, Dept. of Mathematics and Computer Science, Germany
B. Hammer
b
b
Clausthal Univ. of Technology, Dept. of Computer Science, Clausthal, Germany
T. Villmann
c
c
Univ. of Leipzig, Dept. of Medicine, Clinic for Psychotherapy, Leipzig, Germany
Abstract
In this article, we extend a local prototype-based learning model by active learning,
which gives the learner the capability to select training samples during the model
adaptation procedure. The proposed active learning strategy aims on an improved
generalization ability of the nal model. This is achieved by usage of an adaptive
query strategy which is more adequate for supervised learning than a simple ran-
dom approach. Beside an improved generalization ability the method also improves
the speed of the learning procedure which is especially benecial for large data
sets with multiple similar items. The algorithm is based on the idea of selecting a
query on the borderline of the actual classication. This can be done by considering
margins in an extension of learning vector quantization based on an appropriate
cost function. The proposed active learning approach is analyzed for two kinds of
learning vector quantizers the supervised relevance neural gas and the supervised
nearest prototype classier, but is applicable for a broader set of prototype based
learning approaches too. The performance of the query algorithm is demonstrated
on synthetic and real life data taken from clinical proteomic studies. From the pro-
teomic studies high dimensional mass spectrometry measurements were calculated
which are believed to contain features discriminating the dierent classes. Using the
proposed active learning strategies the generalization ability of the models could be
kept or improved accompanied by a signicantly improved learning speed. Both of
these characteristics are important for the generation of predictive clinical models
and were used in an initial biomarker discovery study.
Key words: active learning, learning vector quantization, generalization,
classication, proteomic proling
Preprint submitted to Elsevier Science 28 September 2006
1 Introduction
In supervised learning, we frequently are interested in training a classier such
that the underlying (unknown) target distribution is well estimated. Whereas
traditional approaches usually adapt the model according to all available and
randomly sampled training data, the eld of active learning restricts to only
few actively selected samples. This method avoids the shortcoming of tra-
ditional approaches that the average amount of new information per sample
decreases during learning and that additional data from some regions are basi-
cally redundant. Further, it accounts for the phenomenon which is increasingly
common e.g. in bioinformatics or web search that unlabeled data are abundant
whereas reliable labeling is costly. Variants of active and query based learning
have been proposed quite early for neural models (1; 2). Basically two dierent
kinds of active learning exist, active learning based on an oracle where one
can demand arbitrary points of the input space (1) and method where one can
choose dierent points from a given training set (3). We focus on the second
strategy.
In query algorithms proposed so far, samples are chosen according to some
heuristic e.g. (1) or in a principled way by optimizing an objective function
such as the expected information gain of a query e.g. (2), or the model uncer-
tainty e.g. (4). A common feature of these query algorithms, however, is that
they have been applied to global learning algorithms. Only a few approaches
incorporate active strategies into local learning such as (5) where a heuris-
tic query strategy for simple vector quantization is proposed. In this paper
we include active learning in two recently proposed margin-based, potentially
kernelized learning vector quantization approaches, which combine the good
generalization ability of margin optimization with the intuitivity of prototype-
based local learners where subunits compete for predominance in a region of
inuence (6; 7).
Now, we briey review the basic of this kernel-extension of LVQ and its ac-
companying learning theoretical generalization bounds, therefrom we derive a
margin based active learning strategy. We demonstrate the benet of this mode
by comparing the classication performance of the algorithm on randomly se-
lected training data and active strategies for several data sets stemming from
clinical proteomics as well as synthetic data.

Frank-Michael Schleif: Bruker Daltonik GmbH, Permoserstrasse 15, D-04318


Leipzig, Germany, Tel: +49 341 24 31-408, Fax: +49 341 24 31-404, fms@bdal.de
1.1 Generalized relevance learning vector quantization
Standard LVQ and variants as proposed by Kohonen constitute popular sim-
ple and intuitive prototype based methods, but they are purely heuristically
motivated local learners (8). They suer from the problem of instabilities for
overlapping classes. Further they strongly depend on the initialization of pro-
totypes, and a restriction to classication scenarios in Euclidean space. Gen-
eralized relevance learning vector quantization (GRLVQ) has been introduced
by the authors to cope with these problems (6). It is based on a cost function
such that neighborhood incorporation, integration of relevance learning, and
kernelization of the approach become possible which gives Supervised Neu-
ral Gas (SNG) and Supervised Relevance Neural Gas denoted as SRNG (6).
The method can be accompanied by a large margin generalization bound (9),
which is directly connected to the cost function of the algorithm and which
opens the way towards active learning strategies, as we will discuss in this
article.
We rst introduce the basic algorithm. Input vectors are denoted by v and
their corresponding class labels by c
v
, L is the set of labels (classes). Let
V R
D
V
be a set of inputs v. The model uses a xed number of representative
prototypes (weight vectors, codebook vectors) for each class. Let W ={w
r
}
be the set of all codebook vectors and c
r
be the class label of w
r
. Furthermore,
let W
c
={w
r
|c
r
= c} be the subset of prototypes assigned to class c L. The
task of vector quantization is realized by the map as a winner-takes-all rule,
i.e. a stimulus vector v V is mapped onto that neuron s A the pointer
w
s
of which is closest to the presented stimulus vector v,

V A
: v s (v) = argmin
rA
d

(v, w
r
) (1)
with d

(v, w) being an arbitrary dierentiable similarity measure, which may


depend on a parameter vector . The subset of the input space

r
=
_
v V : r =

V A
(v)
_
(2)
which is mapped to a particular neuron r according to (1), forms the (masked)
receptive eld of that neuron. If the class information of the weight vector
is used, the boundaries

r
generate the decision boundaries for classes. A
training algorithm should adapt the prototypes such that for each class c L,
the corresponding codebook vectors W
c
represent the class as accurately as
possible.
To achieve this goal, GRLVQ optimizes the following cost function, which is
related to the number of misclassications of the prototypes, via a stochastic
3
gradient descent:
Cost
GRLVQ
=

v
f(

(v)) with

(v) =
d

r
+
d

r
+
+ d

(3)
where f (x) = (1 + exp (x))
1
is the standard logistic function, d

r
+
is the
similarity of the input vector v to the nearest codebook vector labeled with
c
r
+
= c
v
, say w
r
+
, and d

is the similarity measure to the best matching


prototype labeled with c
r

= c
v
, say w
r

. Note that the term f(

(v)) scales
the dierences of the closest two competing prototypes to (1, 1), negative
values correspond to correct classications. The learning rule can be derived
from the cost function by using the derivative. As shown in (10), this cost
function shows robust behavior whereas original LVQ2.1 yields divergence.
SRNG combined this methods with neighborhood cooperation as derived in
(6) and thus avoids that the algorithm gets trapped in local optima of the
cost function. In subsequent experiments, we choose the weighted Euclidean
metric with d

(v, w) =

i
(v
i
w
i
)
2
as metric.
An alternative approach is the Soft nearest prototype classication and its ker-
nel variants incorporating metric adaptation (7) which are referred as SNPC.
The SNPC is based on an alternative cost function which can be interpreted
as a Gaussian mixture model (GMM) approach aiming on empirical risk min-
imization. In the following the SNPC is reviewed very briey. Thereby we re-
strict ourselves to the SNPC cost function and some further important parts
which are eected by introduction of an active learning strategy. For details
on the derivation of SNPC and the ordinary learning dynamic we refer to (7).
2 Soft nearest prototype classication
We keep the generic notations from the former section and review subsequently
the Soft Nearest Prototype Classication (SNPC). SNPC is proposed as an
alternative stable NPC learning scheme. It introduces soft assignments for data
vectors to the prototypes, which have a statistical interpretation as normalized
Gaussians. In the original SNPC as provided in (11) one considers the cost
function
E (S) =
1
N
S
N
S

k=1

r
u

(r|v
k
)
_
1
r,cv
k
_
(4)
with S ={(v, c
v
)} the set of all input pairs, N
S
= #S. The value
r,cv
k
equals
one if c
v
k
= c
r
and 0 otherwise. It is the probability that the input vector v
k
is assigned to the prototype r. A crisp winner-takes-all mapping (1) would
yield u

(r|v
k
) = (r = s (v
k
)).
In order to minimize (4) in SNPC the variables u

(r|v
k
) are taken as soft
4
assignment probabilities. This allows a gradient descent on the cost function
(4). As proposed in (11), the probabilities are chosen as normalized Gaussians
u

(r|v
k
) =
exp
_

d(v
k
,wr)
2
2
_

r
exp
_

d(v
k
,w
r
)
2
2
_
(5)
whereby d is the distance measure used in (1) and is the bandwidth which
has to be chosen adequately. Then the cost function (4) can be rewritten as
E
soft
(S) =
1
N
S
N
S

k=1
lc ((v
k
, c
v
k
)) (6)
with local costs
lc ((v
k
, c
v
k
)) =

r
u

(r|v
k
)
_
1
r,cv
k
_
(7)
i.e., the local error is the sum of the assignment probabilities
r,cv
k
to all pro-
totypes of an incorrect class, and, hence, lc ((v
k
, c
v
k
)) 1 with local costs
depending on the whole set W. Because the local costs lc ((v
k
, c
v
k
)) are con-
tinuous and bounded, the cost function (6) can be minimized by stochastic
gradient descent using the derivative of the local costs as shown in (11). All
prototypes are adapted in this scheme according to the soft assignments. Note
that for small bandwidth , the learning rule is similar to LVQ2.1.
2.1 Relevance learning for SNPC
Like all NPC algorithms, SNPC heavily relies on the metric d, usually the
standard euclidean metric. For high-dimensional data as occur in proteomic
patterns, this choice is not adequate since noise present in the data set accu-
mulates and likely disrupts the classication. Thus, a focus on the (prior not
known) relevant parts of the inputs, would be much more suited. Relevance
learning as introduced in (12) oers the opportunity to learn metric param-
eters, which account for the dierent relevance of input dimensions during
training. In analogy to the above learning approaches this relevance learning
idea is included into SNPC leading to SNPC-R. Instead of the metric d (v
k
, w
r
)
now the metric can be parameterized incorporating adaptive relevance factors
giving d

(v
k
, w
r
) which is included into the soft assignments (5), whereby the
component
k
of is usually chosen as weighting parameter for input dimen-
sion k. The relevance parameters
k
can be adjusted according to the given
training data, taking the derivative of the cost function, i.e.
lc((v
k
,cv
k
))

using
the local cost (7):
5
lc ((v
k
, c
v
k
))

j
=
1
2
2

r
u

(r|v
k
)
d

_
1
r,cv
k
lc ((v
k
, c
v
k
))
_
(8)
with subsequent normalization of the
k
.
It is worth to emphasize that SNPC-R can also be used with individual metric
parameters
r
for each prototype w
r
or with a class-wise metric shared within
prototypes with the same class label c
r
as it is done here, referred as localized
SNPC-R (LSNPC-R). If the metric is shared by all prototypes, LSNPC-R is
reduced to SNPC-R. The respective adjusting of the relevance parameters
can easily be determined in complete analogy to (8).
It has been pointed out in (13) that NPC classication schemes, which are
based on the euclidean metric, can be interpreted as large margin algorithms
for which dimensionality independent generalization bounds can be derived.
Instead of the dimensionality of data, the so-called hypothesis margin, i.e.
the distance, the hypothesis can be altered without changing the classication
on the training set, serves as a parameter of the generalization bound. This
result has been extended to NPC schemes with adaptive diagonal metric in (6).
This fact is quite remarkable, since D
V
new parameters, D
V
being the input
dimension, are added this way, still, the bound is independent of D
V
. This
result can even be transferred to the setting of individual metric parameters

r
for each prototype or class as we will see below, such that a generally good
generalization ability of this method can be expected. Despite from the fact
that (possibly local) relevance factors allow a larger exibility of the approach
without decreasing the generalization ability, they are of particular interest
for proteomic pattern analysis because they indicate potentially semantically
meaningful positions.
Our active learning approach holds for each kind of such LVQ-type learning.
3 Margin based active learning
The rst dimensionality independent large margin generalization bound of
LVQ classiers has been provided in (13). For GRLVQ-type learning, a further
analysis is possible, which accounts for the fact that the similarity measure is
adaptive during training (9). Here, we sketch the argumentation as provided
in (9) to derive a bound for a slightly more general situation where dierent
local adaptive relevance terms are attached to the prototypes.
6
Theoretical generalization bound for xed margin
Assume, for the moment, that a two-class problem with labels {1, 1} is
given
1
. We assume that an NPC classication scheme is used whereby the
locally weighted squared euclidean metric determines the receptive elds:
v argmin
rA

r
l
(v
l
(w
r
)
l
)
2
where l denotes the components of the vectors and

l

r
l
= 1. We further
assume that data are chosen i.i.d. according to the data distribution P(V )
which support is limited by a ball of radius B and the class labels are deter-
mined by an unknown function. Generalization bounds limit the error, i.e. the
probability that the learned classier does not classify given data correctly:
E
P
() = P(c
v
=

V A
(v)) (9)
Note that this error captures the performance for GRLVQ/SRNG networks as
well as SNPC-R learning with local adaptive diagonal metric. Given a classier
and a sample (v, c
v
), we dene the margin as
M

(v, c
v
) = d

r+
r+
+ d

r
r
, (10)
i.e. the dierence of the distance of the data point from the closest correct and
the closest wrong prototype. (To be precise, we refer to the absolute value as
the margin.) For a xed parameter (0, 1), the loss function is dened as
L : R R, t
_

_
1 if t 0
1 t/ if 0 < t
0 otherwise
(11)
The term

E
L
m
() =

vV
L(M

(v, c
v
))/|V | (12)
denotes the empirical error on the training data. It counts the data points
which are classied wrong and, in addition, punishes all data points with too
small margin.
We can now use techniques from (14) in analogy to the argumentation in (9)
to relate the error (9) and the empirical error (12). As shown in (14)(Theorem
7), one can express a bound for the deviation of the empirical error

E
m
()
1
These constraints are technical to derive the generalization bounds which have
already been derived by two of the authors in (9), the active learning strategies
work also for > 2 classes and alternative metrics
7
and the error E
P
() by the Gaussian complexity of the function class dened
by the model:
E
P
()

E
L
m
() +
2K

G
m
+

ln(4/)
2m
with probability at least 1 /2. K is a universal constant. G
m
denotes the
Gaussian complexity, i.e. it is the expectation
E
v
i E
g
1
,...,gm
_
sup

2
m
m

i=1
g
i
(v
i
)

_
where the expectation is taken w.r.t independent Gaussian variables g
1
, . . . ,
g
m
with zero mean and unit variance and i.i.d. points v
i
sampled according
to the marginal distribution induced by P. The supremum is taken over all
NPC classiers with prototypes W with length at most B.
The Gaussian complexity of NPC networks with local adaptive diagonal met-
ric can be easily estimated using techniques from (14): the classier can be
expressed as a Boolean formula of the result of classiers with only two pro-
totypes. Thereby, at most |W(W 1)| such terms exist. For two prototypes
i and j, the corresponding output can be described by the sum of a simple
quadratic form and a linear term:
d

i
i
d

j
j
0
(v w
i
)
t

i
(v w
i
) (v w
j
)
t

j
(v w
j
) 0
v
t

i
v v
t

j
v
2 (
i
w
i

j
w
j
)
t
v + (w
i
)
t

i
w
i
(w
j
)
t

j
w
j
0
Since the size of the prototypes and the size of the inputs are restricted by
B and since is normalized to 1, we can estimate the empirical Gaussian
complexity by the sum of
4 B (B + 1) (B + 2)

m
m
for the linear term (including the bias) and
2 B
2

m
m
for the quadratic term, using (14)(Lemma 22). The Gaussian complexity dif-
fers from the empirical Gaussian complexity by at most with probability at
8
least 2 exp(
2
m/8). Putting these bounds together, the overall bound
P
_
_
E
P
() >

E
L
m
() +K

ln |V |

_
|V |

_
ln(1/) |W|
2
B
3
_
_
(13)
results with probability (0, 1), K

being a universal constant. This bound


holds for every prototype based learning algorithm with diagonal Euclidean
metric and adaptive relevance parameters as long as the absolute sum of the
relevance parameters is restricted by 1, whereby the parameters may even
be adapted locally for each prototype vector.
This formula provides an estimation of the dierence of the real error of a
trained classier in comparison to the empirical error measured on the training
set. Its form is very similar to generalization bounds of learning algorithms
as derived in the framework of VC-theory (15). Since these bounds are worst
case bounds, they usually dier strongly from the exact test error and can
give only an indication of the relevance of certain parameters. However, this
information is sucient to derive useful strategies to guide active learning.
Note that the bound (13) does not include data dimensionality, but the mar-
gin . This bound holds for the scaled squared Euclidean metric with local
adaptive relevance terms. It can directly be generalized to kernelized versions
of this similarity measure. A sucient condition for this fact is, e.g., that the
measure is symmetric, it vanishes for two identical arguments, and its nega-
tive is conditionally positive denite. It should be mentioned, that the margin
(10) occurs as nominator in the cost function of GRLVQ. Hence GRLVQ and
SRNG maximize its margin during training according to this cost function.
However, the nal generalization bound holds for all training algorithms of
NPC classiers including SNPC and SNPC-R.
Active learning strategy
The generalization bound in terms of the margin proposes an elegant scheme
to transfer margin based active learning to local learners. Margin based sample
selection has been presented e.g. in the context of SVM in (16; 3). Obviously,
the generalization ability of the GRLVQ algorithm does only depend on the
points with too small margin (10). Thus, only the extremal margin values
need to be limited and a restriction of the respective update to extremal pairs
of prototypes would suce. This argument proposes schemes for active data
selection if a xed and static pattern set is available: We x a monotonically
decreasing non-negative function L
c
: R R and actively select training
points from a given sample, in analogy to e.g. (3), based on the probability
L
c
(M

(v, c
v
)) for sample v. Thereby, dierent realizations are relevant:
9
(1) L
c
(t) = 1 for t < 0 and L
c
(t) |t|

, otherwise, i.e. the size of the margin


determines the probability of v being chosen annealed by a parameter
(Probabilistic strategy).
(2) L
c
(t) = 1 if t , otherwise, it is 0. That means, all samples with margin
smaller than are selected (Threshold strategy).
Both strategies focus on the samples which are not yet suciently represented
in the model. Therefore they directly aim at an improvement of the generaliza-
tion bound (13). Strategy (2) allows an adaptation of the margin parameter
during training in accordance to the condence of the model in analogy to the
recent proposal (3) for SVM. For each codebook vector w
r
W we introduce a
new parameter
r
measuring the mean distance of data points in its receptive
eld (2) to the current prototype w
r
. This parameter can be easily computed
during training as a moving average with no extra costs
2
. We choose
r
locally
as
r
= 2
r
. Thus, points which margin compares favorable to the size of
the receptive elds are already represented with sucient security and, hence,
they are abandoned. For strategy (1), a condence depending on the distance
to the closest correct prototype and the overall classication accuracy can be
introduced in a similar way. Doing this the normalized margin is taken as a
probability measure for data selection.
Validity of generalization bounds for active strategies
There are a few aspects of the generalization bound (13) which have to be
reconsidered in the framework of active learning. The bound (13) is valid for
a priorly xed margin ; it is not applicable to the empirical margin observed
during training which is optimized by active learning. However, it is possible to
generalize the bound (13) to this general setting in the spirit of the luckiness
framework of machine learning (17) by xing prior probabilities to achieve
a certain margin. Here, we derive a bound depending on an universal upper
bound on the empirical margin and xed prior probabilities.
Assume the empirical margin can be upper bounded by C > 0. A reasonable
upper bound C can e.g. be derived from the data as half the average of the
distances of points to their respective closest neighbor with dierent labeling.
Dene
i
= C/i for i 1. Choose prior probabilities p
i
0 with

i
p
i
= 1
which indicate the condence in achieving an empirical margin which size is
2
The extra computational time to determine the active learning control variables
is negligible
10
at least
i
. Set
L
i
: R R, t
_

_
1 if t 0
1 t/
i
if 0 < t
i
0 otherwise
(14)
in analogy to (11).

E
Lt
m
() denotes the empirical error (12) using loss function
L
t
. We are interested in the probability
P
_
i E
P
() >

E
L
i
m
() +(i)
_
(15)
i.e. the probability that the empirical error measured with respect to a loss L
i
for any i and the real error deviate by more than (i), where the bound
(i) = K

ln |V |

_
|V |

_
ln(1/(p
i
)) |W|
2
B
3
depends on the empirical margin
i
. Note that the value (i) is chosen as the
bound derived in (13) for the margin
i
and condence p
i
. The probability
(15) upper bounds the probability of a large deviation of the real error and
the empirical error with respect to a loss function which is associated to the
empirical margin observed posteriorly on the given training set. For every
observed margin < C, some
i
can be found with
i
, such that the
generalization bound (i) results in this setting. The size of (i) depends on
whether the observed empirical margin corresponds to the prior condence in
reaching this margin, i.e. a large prior p
i
.
We can limit (15) as follows:
P
_
i E
P
() >

E
L
i
m
() +(i)
_

i
P
_
E
P
() >

E
L
i
m
() +(i)
_

i
p
i
=
because of the choice of (i) such that it corresponds to the bound (13). This
argument allows us to derive bounds of a similar form to (13) for the empirical
margin.
Another point worth discussing concerns the assumption that training data are
i.i.d. with respect to an unknown underlying probability. For active learning,
the training points are chosen depending on the observed margin, that means,
they are dependent. However, this argument does not apply to the scenario
as considered in our case: we assume a priorly xed training set with i.i.d.
data and chose the next training pattern depending on the margin such that
11
the convergence speed of training and the generalization ability of the trained
classier is improved. This aects the learning algorithm on the given data
set, however, the empirical error can nevertheless be evaluated on the whole
training set independent of the algorithm, such that the bound as derived
above is valid for the priorly xed training set. For dependent data (e.g. points
created online for learning), an alternative technique such as e.g. the statistical
theory of online learning must be used to derive bounds (18; 19).
Adaptation to noisy data or unknown classes
We will apply the strategy of active learning to data sets for which a small
classication error can be achieved with reasonable margin. We will not test
the method for very corrupted or noisy data sets for which a large margin
resp. small training error cannot be achieved. It can be expected that the
active strategies as proposed above have to be adapted for very noisy learning
scenarios: the active strategy as proposed above will focus on points which
are not yet classied correctly resp. which lie close to the border, trying to
enforce a separation which is not possible for the data set as hand. In such
cases, it would be valuable to restrict the active strategy to points which are
still promising, e.g. choosing points only from a limited band parallel to the
training hyperplane.
We would like to mention that, so far, we have restricted active selection
strategies to samples where all labels are known beforehand, because the clos-
est correct and wrong prototype have to be determined in (10). This setting
allows to improve the training speed and performance of batch training. If
data are initially unlabeled and queries can be asked for a subset of the data,
we can extend these strategies in an obvious way towards this setting: in this
case, the margin (10) is given by the closest two prototypes which possess a
dierent class label, whereby the (unknown) class label of the sample point
has no inuence. L
c
(t) is substituted by L
c
(|t|).
4 Synthetic and Clinical data sets
The rst data set is the Wisconsin Breast Cancer-Data set (WDBC) as given
by UCI (20). It consist of 569 measurements with 30 features in 2 classes.
The other two data sets are taken from proteomic studies named as Proteom
1
and Proteom
2
and originate from own mass spectrometry measurements on
12
clinical proteomic data
3
The Proteom
1
data set consists of 97 samples in two
classes with 59 dimensions. The data set Proteom
2
consists of 97 measure-
ments with two classes and 94 dimensions. MALDI-TOF MS combined with
magnetic-bead based sample preparation was used to generate proteomic pat-
tern from Human EDTA (ethylenediaminetetraacetic acid) plasma samples.
The MALDI-TOF mass spectra were obtained using magnetic bead based
weak cation exchange chromatography (WCX)(21) The sample eluates from
the bead preparation have been applicated to an AnchorChip-Target
TM
by
use of HCCA matrix. The material was randomly applicated on the target by
use of ClinProt-Robot
TM
and subsequently measured using an AutoFlex II in
linear mode within 1 10kDa (Bruker Daltonik GmbH, Bremen, Germany).
Thereby each spectrum has been accumulated by use of 450 Laser shots with 15
shot positions per spot on the target. Each eluated sample was fourfold spotted
and averaged subsequently. Individual attributes such as gender, age, cancer
related preconditions and some others have been controlled during sample
collection to avoid bias eects. All data have been collected and prepared in
accordance to best clinical practices. Spectra preparation has been done by use
of the Bruker ClinProt-System (Bruker Daltonik GmbH, Bremen, Germany).
The well separable Checkerboard data (Checker) as given in (12) are used as a
synthetic evaluation set. It consists of 3700 data points. Further, to illustrate
the dierences between the dierent strategies during learning of the codebook
vector positions a simple spiral data set has been created and applied using
the SRNG algorithm. The spiral data are generated in a similar way as the
data shown in (5) and the set consists of 5238 data points. Checker as well
the spiral data are given in a two dimensional space.
5 Experiments and Results
For classication, we use 6 prototypes for the WDBC data, 100 prototypes
for the well separable Checkerboard data set as given in (12), 9 prototypes for
the Proteom
1
data set and 10 for Proteom
2
. Parameter settings for SRNG can
be resumed as follows: learn rate correct prototype: 0.01, learn rate incorrect
prototype: 0.001 and learning rate for : 0.01. The neighborhood range is
given by #W/2. For SNPC the same settings as for SRNG are used with
the additional parameter window threshold: 0.05 and width = 2.5 for the
Gaussian kernel. Learning rates are annealed by an exponential decay. All
data have been processed using a 10-fold cross-validation procedure. Results
are calculated using the SNPC, SNPC-R and SNG, SRNG. SNG and SRNG
3
These data are measured on Bruker systems and are subject of condential agree-
ments with clinical partners, so only some details are listed here.
13
are used instead of GLVQ or GRLVQ as improved version of the former using
neighborhood cooperation.
We now compare both prototype classiers using randomly selected samples
with its counterparts using the proposed query strategies. The classication
results are given in Tab. 1 and Tab. 3 without metric adaptation and in Tab. 2
and Tab. 3 with relevance learning respectively. Features of all data sets have
been normalized. First we upper bounded the data set by 1.0 and subsequently
data are transformed such that we end with zero mean and variance 1.0.
We applied the training algorithms using the dierent query strategies as in-
troduced above. The results for recognition and prediction rates using SRNG
are shown in Tab. 2
4
and for SNPC in 3 respectively. Thereby the recognition
rate is a performance measure of the model indicating the relative number of
training data points whose class label could correctly be recognized using the
model. The prediction rate is a measure of the generalization ability of the
model accounting for the relative number of correctly predicted class labels
from test data points which were not used in the former training of the clas-
sier. As explained before each prediction rate is obtained as an average over
a 10-fold cross-validation procedure.
SNG SNG
active strategy 1
SNG
active strategy 2
Rec. Pred. Rec. Pred. Rel. #Q Rec. Pred. Rel. #Q
WDBC 95% 95% 94% 94% 38% 93% 92% 9%
Proteom
1
76% 83% 76% 77% 48% 76% 85% 15%
Proteom
2
73% 67% 73% 65% 49% 73% 62% 27%
Checker 72% 67% 98% 97% 31% 99% 96% 5%
Table 1
Classication accuracies for cancer and checkerboard data sets using SNG. All data
consist of two classes whereby the Proteom
2
data set is quite complex. The predic-
tion accuracies are taken from a 10-fold cross validation and show a reliable good
prediction of data which belong to the WDBC data as well as for the Proteom
1
data
set. The Checker data are not as well modeled, this is due to the xed upper limit
of cycles (1000) longer runtimes lead for this data to a nearly perfect separation.
For the WDBC data set and the Proteom
2
data set we found small improve-
ments in the prediction accuracy using the active strategy 2. In parts small
over-tting behavior using the new query strategies can be observed. Both new
query strategies were capable to signicantly decrease the necessary number
of queries by keeping at least reliable prediction accuracies with respect to
4
The relative number of queries is calculated with respect to the maximal num-
ber of queries possible up to convergence of SRNG using the corresponding query
strategy. The upper limit of cycles has been xed to 1000.
14
0
0.2
0.4
0.6
0.8
1
0 100 200 300 400 500 600 700 800 900 1000
r
e
l
a
t
i
v
e

n
u
m
b
e
r

o
f

q
u
e
r
i
e
s
cycle
Probabilistic strategy
Threshold strategy
Fig. 1. Number of queries in % using active strategy 1 (Threshold) and 2 (Proba-
bilistic) executed by the SRNG algorithm on the Proteom
1
data set.
a random query approach. This is depicted for dierent data sets using the
SRNG algorithm in Figure 1,2, 3 and 4 and in Figure 5,6 using the SNPC al-
gorithm. Although the number of queries diers for SNPC and SRNG on the
same data set the generic trend is similar but clearly reects the classication
performance of the individual classier. Especially the threshold approach let
to a signicant decrease in the number of queries in each experiment.
In Figure 7 the behaviour of the proposed strategies are shown for a synthetic
spiral data set of around 5000 data points. Both active learning strategies
showed a signicant decrease in the number of necessary queries, typically only
10%30% of the queries were executed with respect to a random approach.
Using the active learning strategies the data could be learned nearly perfect.
Random querying required a much larger number of cycles to obtain acceptable
results. Considering the prototype distribution one clearly observes the well
positioning of the prototypes using the active learning strategies whereas the
SRNG SRNG
active strategy 1
SRNG
active strategy 2
Rec. Pred. Rec. Pred. Rel. #Q Rec. Pred. Rel. #Q
WDBC 95% 94% 95% 94% 29% 97% 97% 7%
Proteom
1
82% 88% 92% 87% 31% 97% 93% 5%
Proteom
2
87% 81% 96% 87% 33% 96% 76% 10%
Table 2
Classication accuracies for cancer data sets using SRNG. Data characteristics are
as given before. A reliable good prediction of data which belong to the WDBC data
as well as for the Proteom
1
data set can be seen. One clearly observes an improved
modeling capability by use of relevance learning and an related addition decrease
in the number of queries.
15
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 100 200 300 400 500 600 700 800 900 1000
r
e
l
a
t
i
v
e

n
u
m
b
e
r

o
f

q
u
e
r
i
e
s
cycle
Probabilistic strategy
Threshold strategy
Fig. 2. Number of queries in % using active strategy 1 (Threshold) and 2 (Proba-
bilistic) executed by the SRNG algorithm on the Proteom
2
data set.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 100 200 300 400 500 600 700 800 900 1000
r
e
l
a
t
i
v
e

n
u
m
b
e
r

o
f

q
u
e
r
i
e
s
cycle
Probabilistic strategy
Threshold strategy
Fig. 3. Number of queries in % using active strategy 1 (Threshold) and 2 (Proba-
bilistic) executed by the SRNG algorithm on the WDBC data set.
random approach suers by over-representing the core of the spiral which has
a larger data density.
6 Conclusion
Margin based active learning strategies for GLVQ based networks have been
studied. We compared two alternative query strategies incorporating the mar-
gin criterion of the GLVQ networks with a random query selection. Both active
learning strategies show reliable or partially better results in their generaliza-
tion ability with respect to the random approach. Thereby we found a sig-
16
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 100 200 300 400 500 600 700 800 900 1000
r
e
l
a
t
i
v
e

n
u
m
b
e
r

o
f

q
u
e
r
i
e
s
cycle
Probabilistic strategy
Threshold strategy
Fig. 4. Number of queries in % using active strategy 1 (Threshold) and 2 (Proba-
bilistic) executed by the SNG algorithm on the Checker data set.
nicantly faster convergence with a much lower number of necessary queries.
For the threshold strategy we found that it shows an overall stable behavior
with good prediction rate and a signicant decrease in processing time. Due to
the automatically adapted parameter the strategy is quite simple but depends
on a suciently well estimation of the local data distribution. By scaling the
threshold parameter an application specic choice between prediction accu-
racy and speed can be obtained. The probabilistic strategy has been found to
get similar results with respect to the prediction accuracy but the number of
queries is quite dependent of the annealing strategy, simulated less restrictive
SNPC SNPC
active strategy 1
SNPC
active strategy 2
Rec. Pred. Rec. Pred. Rel. #Q Rec. Pred. Rel. #Q
WDBC 90% 89% 66% 93% 60% 82% 92% 15%
Proteom
1
71% 81% 72% 82% 67% 65% 82% 21%
SNPC-R SNPC-R
active strategy 1
SNPC-R
active strategy 2
Rec. Pred. Rec. Pred. Rel. #Q Rec. Pred. Rel. #Q
WDBC 86% 91% 85% 94% 62% 94% 95% 12%
Proteom
1
72% 86% 89% 82% 66% 92% 87% 16%
Table 3
Classication accuracies for cancer data sets using standard SNPC and SNPC-
R. Data characteristics as above. The prediction accuracies show a reliable good
prediction of data belonging to the WDBC data as well as for the Proteom
1
data
set. The standard SNPC showed an in-stable behavior for the Proteom
2
data, hence
the results are not given in the table. By use of relevance learning the number of
queries as well as the prediction accuracy improved slightly with respect to the
standard approach.
17
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 100 200 300 400 500 600 700 800 900 1000
r
e
l
a
t
i
v
e

n
u
m
b
e
r

o
f

q
u
e
r
i
e
s
cycle
Probabilistic strategy
Threshold strategy
Fig. 5. Number of queries in % using active strategy 1 (Threshold) and 2 (Proba-
bilistic) executed by the SNPC-R algorithm on the Proteom
1
data set.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 100 200 300 400 500 600 700 800 900 1000
r
e
l
a
t
i
v
e

n
u
m
b
e
r

o
f

q
u
e
r
i
e
s
cycle
Probabilistic strategy
Threshold strategy
Fig. 6. Number of queries in % using active strategy 1 (Threshold) and 2 (Proba-
bilistic) executed by the SNPC-R algorithm on the WDBC data set.
constraints showed a faster convergence but over-tting on smaller training
data sets. Especially, for larger data sets the proposed active learning strate-
gies show great benets in speed and prediction. Especially for the considered
mass spectrometric cancer data sets an overall well performance improvement
has been observed. This is interesting from a practical point of view, since the
technical equipment for measuring e.g. a large number of mass spectromet-
ric data become more and more available. In mass spectrometry it is easy to
measure a sample multiple times. The replicates which are taken from such
multiple measurements are in general very similar and do dier only by ran-
dom but not by systematic new information. In clinical proteomics based on
mass spectrometry replicates are measured very often to decrease the mea-
surement variance (e.g. by averaging) or to reduce the loss of missed samples
18
-4
-3
-2
-1
0
1
2
3
4
-3 -2 -1 0 1 2 3
2
n
d

d
i
m
1st dim
Normalized spiral data
Codebook - no active learning
Codebook - probabilistic strategy
Codebook - threshold strategy
Fig. 7. Plot of the synthetic spiral data with prototype positions as obtained using
the dierent query strategies.
in case of an error during a measurement of a sample. Typically 4,8 or even
16 multiple measurements of the same sample are generated and hence also
for moderate sample sizes (e.g. 50 samples per class) the amount of training
data becomes huge
5
The presented approach is optimal suited to deal with
replicate measurements which may drastically increase the number of samples
and hence typically lead to very long runtimes for ordinary trainings using the
considered classication algorithms.
References
[1] E. Baum, Neural net algorithms that learn in polynomial time from exam-
ples and queries, IEEE Transactions on Neural Networks 2 (1991) 519.
[2] Y. Freund, H. S. Seung, E. Shamir, N. Tishby, Information, prediction
and query by committee, in: Advances in Neural Information Processing
Systems 1993, 1993, pp. 483490.
[3] P. Mitra, C. Murthy, S. Pal, A probabilistic active support vector learning
algorithm, IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 28 (3) (2004) 412418.
[4] L. M. Belue, K. W. Bauer Jr., D. W. Ruck, Selecting optimal experiments
for multiple output multilayer perceptrons, Neural Computation 9 (1997)
161183.
[5] M. Hasenjager, H. Ritter, Active learning with local models, Neural Pro-
cessing Letters 7 (1998) 107117.
5
Considering 50 samples per class and 16 replicates per sample one would be
confronted with 1600 highly redundant training items.
19
[6] B. Hammer, M. Strickert, T. Villmann, Supervised neural gas with gen-
eral similarity measure, Neural Processing Letters 21 (1) (2005) 2144.
[7] F.-M. Schleif, T. Villmann, B. Hammer, Local metric adaptation for soft
nearest prototype classication to classify proteomic data, in: Fuzzy Logic
and Applications: 6th Int. Workshop, WILF 2005, Lecture Notes in Com-
puter Science (LNCS), Springer, 2006, pp. 290296.
[8] T. Kohonen, Self-Organizing Maps, Vol. 30 of Springer Series in Informa-
tion Sciences, Springer, Berlin, Heidelberg, 1995, (2nd Ext. Ed. 1997).
[9] B. Hammer, M. Strickert, T. Villmann, On the generalization ability of
GRLVQ networks, Neural Proc. Letters 21 (2) (2005) 109120.
[10] A. Sato, K. Yamada, A formulation of learning vector quantization us-
ing a new misclassication measure, in: A. K. Jain, S. Venkatesh, B. C.
Lovell (Eds.), Proceedings. Fourteenth International Conference on Pat-
tern Recognition, Vol. 1, IEEE Computer Society, Los Alamitos, CA,
USA, 1998, pp. 3225.
[11] S. Seo, M. Bode, K. Obermayer, Soft nearest prototype classication,
IEEE Transaction on Neural Networks 14 (2003) 390398.
[12] B. Hammer, T. Villmann, Generalized relevance learning vector quanti-
zation, Neural Networks 15 (8-9) (2002) 10591068.
[13] K. Crammer, R. Gilad-Bachrach, A.Navot, A.Tishby, Margin anal-
ysis of the LVQ algorithm, in: Proc. NIPS 2002, http://www-
2.cs.cmu.edu/Groups/NIPS/NIPS2002/NIPS2002preproceedings/index.html,
2002.
[14] P. Bartlett, S. Mendelsohn, Rademacher and Gaussian complexities: risk
bounds and structural results, Journal of Machine Learning and Research
3 (2002) 463482.
[15] V. Vapnik, Statistical Learning Theory, Wiley, 1998.
[16] C. Cambell, N. Cristianini, A. Smola, Query learning with large margin
classiers, in: International Conference in Machine Learning, 2000, pp.
111118.
[17] J. Shawe-Taylor, P. Bartlett, R. Willimason, M. Anthony, Structural risk
minimization over data-dependent hierarchies, IEEE Transactions on In-
formation Theory 44 (5) (1998) 19261940.
[18] M. Biehl, A. Ghosh, B. Hammer, Learning vector quantization: The dy-
namics of winner-takes-all algorithms, Neurocomputing 69 (7-9) (2006)
660670.
[19] M. Opper, Statistical mechanics of generalization, in: M. Arbib (Ed.),
The Handbook of Brain Theory and Neural Networks, MIT, 2003, pp.
10871090.
[20] C. Blake, C. Merz., UCI repository of machine learning databases., avail-
able at: http://www.ics.uci.edu/mlearn/MLRepository.html (1998).
[21] E. Schaeler, U. Zanger, M. Schwab, M. Eichelbaum, Magnetic bead
based human plasma proling discriminate acute lymphatic leukaemia
from non-diseased samples, in: 52st ASMS Conference (ASMS) 2004,
2004, p. TPV 420.
20

You might also like