Professional Documents
Culture Documents
V A
: v s (v) = argmin
rA
d
(v, w
r
) (1)
with d
r
=
_
v V : r =
V A
(v)
_
(2)
which is mapped to a particular neuron r according to (1), forms the (masked)
receptive eld of that neuron. If the class information of the weight vector
is used, the boundaries
r
generate the decision boundaries for classes. A
training algorithm should adapt the prototypes such that for each class c L,
the corresponding codebook vectors W
c
represent the class as accurately as
possible.
To achieve this goal, GRLVQ optimizes the following cost function, which is
related to the number of misclassications of the prototypes, via a stochastic
3
gradient descent:
Cost
GRLVQ
=
v
f(
(v)) with
(v) =
d
r
+
d
r
+
+ d
(3)
where f (x) = (1 + exp (x))
1
is the standard logistic function, d
r
+
is the
similarity of the input vector v to the nearest codebook vector labeled with
c
r
+
= c
v
, say w
r
+
, and d
= c
v
, say w
r
(v)) scales
the dierences of the closest two competing prototypes to (1, 1), negative
values correspond to correct classications. The learning rule can be derived
from the cost function by using the derivative. As shown in (10), this cost
function shows robust behavior whereas original LVQ2.1 yields divergence.
SRNG combined this methods with neighborhood cooperation as derived in
(6) and thus avoids that the algorithm gets trapped in local optima of the
cost function. In subsequent experiments, we choose the weighted Euclidean
metric with d
(v, w) =
i
(v
i
w
i
)
2
as metric.
An alternative approach is the Soft nearest prototype classication and its ker-
nel variants incorporating metric adaptation (7) which are referred as SNPC.
The SNPC is based on an alternative cost function which can be interpreted
as a Gaussian mixture model (GMM) approach aiming on empirical risk min-
imization. In the following the SNPC is reviewed very briey. Thereby we re-
strict ourselves to the SNPC cost function and some further important parts
which are eected by introduction of an active learning strategy. For details
on the derivation of SNPC and the ordinary learning dynamic we refer to (7).
2 Soft nearest prototype classication
We keep the generic notations from the former section and review subsequently
the Soft Nearest Prototype Classication (SNPC). SNPC is proposed as an
alternative stable NPC learning scheme. It introduces soft assignments for data
vectors to the prototypes, which have a statistical interpretation as normalized
Gaussians. In the original SNPC as provided in (11) one considers the cost
function
E (S) =
1
N
S
N
S
k=1
r
u
(r|v
k
)
_
1
r,cv
k
_
(4)
with S ={(v, c
v
)} the set of all input pairs, N
S
= #S. The value
r,cv
k
equals
one if c
v
k
= c
r
and 0 otherwise. It is the probability that the input vector v
k
is assigned to the prototype r. A crisp winner-takes-all mapping (1) would
yield u
(r|v
k
) = (r = s (v
k
)).
In order to minimize (4) in SNPC the variables u
(r|v
k
) are taken as soft
4
assignment probabilities. This allows a gradient descent on the cost function
(4). As proposed in (11), the probabilities are chosen as normalized Gaussians
u
(r|v
k
) =
exp
_
d(v
k
,wr)
2
2
_
r
exp
_
d(v
k
,w
r
)
2
2
_
(5)
whereby d is the distance measure used in (1) and is the bandwidth which
has to be chosen adequately. Then the cost function (4) can be rewritten as
E
soft
(S) =
1
N
S
N
S
k=1
lc ((v
k
, c
v
k
)) (6)
with local costs
lc ((v
k
, c
v
k
)) =
r
u
(r|v
k
)
_
1
r,cv
k
_
(7)
i.e., the local error is the sum of the assignment probabilities
r,cv
k
to all pro-
totypes of an incorrect class, and, hence, lc ((v
k
, c
v
k
)) 1 with local costs
depending on the whole set W. Because the local costs lc ((v
k
, c
v
k
)) are con-
tinuous and bounded, the cost function (6) can be minimized by stochastic
gradient descent using the derivative of the local costs as shown in (11). All
prototypes are adapted in this scheme according to the soft assignments. Note
that for small bandwidth , the learning rule is similar to LVQ2.1.
2.1 Relevance learning for SNPC
Like all NPC algorithms, SNPC heavily relies on the metric d, usually the
standard euclidean metric. For high-dimensional data as occur in proteomic
patterns, this choice is not adequate since noise present in the data set accu-
mulates and likely disrupts the classication. Thus, a focus on the (prior not
known) relevant parts of the inputs, would be much more suited. Relevance
learning as introduced in (12) oers the opportunity to learn metric param-
eters, which account for the dierent relevance of input dimensions during
training. In analogy to the above learning approaches this relevance learning
idea is included into SNPC leading to SNPC-R. Instead of the metric d (v
k
, w
r
)
now the metric can be parameterized incorporating adaptive relevance factors
giving d
(v
k
, w
r
) which is included into the soft assignments (5), whereby the
component
k
of is usually chosen as weighting parameter for input dimen-
sion k. The relevance parameters
k
can be adjusted according to the given
training data, taking the derivative of the cost function, i.e.
lc((v
k
,cv
k
))
using
the local cost (7):
5
lc ((v
k
, c
v
k
))
j
=
1
2
2
r
u
(r|v
k
)
d
_
1
r,cv
k
lc ((v
k
, c
v
k
))
_
(8)
with subsequent normalization of the
k
.
It is worth to emphasize that SNPC-R can also be used with individual metric
parameters
r
for each prototype w
r
or with a class-wise metric shared within
prototypes with the same class label c
r
as it is done here, referred as localized
SNPC-R (LSNPC-R). If the metric is shared by all prototypes, LSNPC-R is
reduced to SNPC-R. The respective adjusting of the relevance parameters
can easily be determined in complete analogy to (8).
It has been pointed out in (13) that NPC classication schemes, which are
based on the euclidean metric, can be interpreted as large margin algorithms
for which dimensionality independent generalization bounds can be derived.
Instead of the dimensionality of data, the so-called hypothesis margin, i.e.
the distance, the hypothesis can be altered without changing the classication
on the training set, serves as a parameter of the generalization bound. This
result has been extended to NPC schemes with adaptive diagonal metric in (6).
This fact is quite remarkable, since D
V
new parameters, D
V
being the input
dimension, are added this way, still, the bound is independent of D
V
. This
result can even be transferred to the setting of individual metric parameters
r
for each prototype or class as we will see below, such that a generally good
generalization ability of this method can be expected. Despite from the fact
that (possibly local) relevance factors allow a larger exibility of the approach
without decreasing the generalization ability, they are of particular interest
for proteomic pattern analysis because they indicate potentially semantically
meaningful positions.
Our active learning approach holds for each kind of such LVQ-type learning.
3 Margin based active learning
The rst dimensionality independent large margin generalization bound of
LVQ classiers has been provided in (13). For GRLVQ-type learning, a further
analysis is possible, which accounts for the fact that the similarity measure is
adaptive during training (9). Here, we sketch the argumentation as provided
in (9) to derive a bound for a slightly more general situation where dierent
local adaptive relevance terms are attached to the prototypes.
6
Theoretical generalization bound for xed margin
Assume, for the moment, that a two-class problem with labels {1, 1} is
given
1
. We assume that an NPC classication scheme is used whereby the
locally weighted squared euclidean metric determines the receptive elds:
v argmin
rA
r
l
(v
l
(w
r
)
l
)
2
where l denotes the components of the vectors and
l
r
l
= 1. We further
assume that data are chosen i.i.d. according to the data distribution P(V )
which support is limited by a ball of radius B and the class labels are deter-
mined by an unknown function. Generalization bounds limit the error, i.e. the
probability that the learned classier does not classify given data correctly:
E
P
() = P(c
v
=
V A
(v)) (9)
Note that this error captures the performance for GRLVQ/SRNG networks as
well as SNPC-R learning with local adaptive diagonal metric. Given a classier
and a sample (v, c
v
), we dene the margin as
M
(v, c
v
) = d
r+
r+
+ d
r
r
, (10)
i.e. the dierence of the distance of the data point from the closest correct and
the closest wrong prototype. (To be precise, we refer to the absolute value as
the margin.) For a xed parameter (0, 1), the loss function is dened as
L : R R, t
_
_
1 if t 0
1 t/ if 0 < t
0 otherwise
(11)
The term
E
L
m
() =
vV
L(M
(v, c
v
))/|V | (12)
denotes the empirical error on the training data. It counts the data points
which are classied wrong and, in addition, punishes all data points with too
small margin.
We can now use techniques from (14) in analogy to the argumentation in (9)
to relate the error (9) and the empirical error (12). As shown in (14)(Theorem
7), one can express a bound for the deviation of the empirical error
E
m
()
1
These constraints are technical to derive the generalization bounds which have
already been derived by two of the authors in (9), the active learning strategies
work also for > 2 classes and alternative metrics
7
and the error E
P
() by the Gaussian complexity of the function class dened
by the model:
E
P
()
E
L
m
() +
2K
G
m
+
ln(4/)
2m
with probability at least 1 /2. K is a universal constant. G
m
denotes the
Gaussian complexity, i.e. it is the expectation
E
v
i E
g
1
,...,gm
_
sup
2
m
m
i=1
g
i
(v
i
)
_
where the expectation is taken w.r.t independent Gaussian variables g
1
, . . . ,
g
m
with zero mean and unit variance and i.i.d. points v
i
sampled according
to the marginal distribution induced by P. The supremum is taken over all
NPC classiers with prototypes W with length at most B.
The Gaussian complexity of NPC networks with local adaptive diagonal met-
ric can be easily estimated using techniques from (14): the classier can be
expressed as a Boolean formula of the result of classiers with only two pro-
totypes. Thereby, at most |W(W 1)| such terms exist. For two prototypes
i and j, the corresponding output can be described by the sum of a simple
quadratic form and a linear term:
d
i
i
d
j
j
0
(v w
i
)
t
i
(v w
i
) (v w
j
)
t
j
(v w
j
) 0
v
t
i
v v
t
j
v
2 (
i
w
i
j
w
j
)
t
v + (w
i
)
t
i
w
i
(w
j
)
t
j
w
j
0
Since the size of the prototypes and the size of the inputs are restricted by
B and since is normalized to 1, we can estimate the empirical Gaussian
complexity by the sum of
4 B (B + 1) (B + 2)
m
m
for the linear term (including the bias) and
2 B
2
m
m
for the quadratic term, using (14)(Lemma 22). The Gaussian complexity dif-
fers from the empirical Gaussian complexity by at most with probability at
8
least 2 exp(
2
m/8). Putting these bounds together, the overall bound
P
_
_
E
P
() >
E
L
m
() +K
ln |V |
_
|V |
_
ln(1/) |W|
2
B
3
_
_
(13)
results with probability (0, 1), K
(v, c
v
)) for sample v. Thereby, dierent realizations are relevant:
9
(1) L
c
(t) = 1 for t < 0 and L
c
(t) |t|
_
1 if t 0
1 t/
i
if 0 < t
i
0 otherwise
(14)
in analogy to (11).
E
Lt
m
() denotes the empirical error (12) using loss function
L
t
. We are interested in the probability
P
_
i E
P
() >
E
L
i
m
() +(i)
_
(15)
i.e. the probability that the empirical error measured with respect to a loss L
i
for any i and the real error deviate by more than (i), where the bound
(i) = K
ln |V |
_
|V |
_
ln(1/(p
i
)) |W|
2
B
3
depends on the empirical margin
i
. Note that the value (i) is chosen as the
bound derived in (13) for the margin
i
and condence p
i
. The probability
(15) upper bounds the probability of a large deviation of the real error and
the empirical error with respect to a loss function which is associated to the
empirical margin observed posteriorly on the given training set. For every
observed margin < C, some
i
can be found with
i
, such that the
generalization bound (i) results in this setting. The size of (i) depends on
whether the observed empirical margin corresponds to the prior condence in
reaching this margin, i.e. a large prior p
i
.
We can limit (15) as follows:
P
_
i E
P
() >
E
L
i
m
() +(i)
_
i
P
_
E
P
() >
E
L
i
m
() +(i)
_
i
p
i
=
because of the choice of (i) such that it corresponds to the bound (13). This
argument allows us to derive bounds of a similar form to (13) for the empirical
margin.
Another point worth discussing concerns the assumption that training data are
i.i.d. with respect to an unknown underlying probability. For active learning,
the training points are chosen depending on the observed margin, that means,
they are dependent. However, this argument does not apply to the scenario
as considered in our case: we assume a priorly xed training set with i.i.d.
data and chose the next training pattern depending on the margin such that
11
the convergence speed of training and the generalization ability of the trained
classier is improved. This aects the learning algorithm on the given data
set, however, the empirical error can nevertheless be evaluated on the whole
training set independent of the algorithm, such that the bound as derived
above is valid for the priorly xed training set. For dependent data (e.g. points
created online for learning), an alternative technique such as e.g. the statistical
theory of online learning must be used to derive bounds (18; 19).
Adaptation to noisy data or unknown classes
We will apply the strategy of active learning to data sets for which a small
classication error can be achieved with reasonable margin. We will not test
the method for very corrupted or noisy data sets for which a large margin
resp. small training error cannot be achieved. It can be expected that the
active strategies as proposed above have to be adapted for very noisy learning
scenarios: the active strategy as proposed above will focus on points which
are not yet classied correctly resp. which lie close to the border, trying to
enforce a separation which is not possible for the data set as hand. In such
cases, it would be valuable to restrict the active strategy to points which are
still promising, e.g. choosing points only from a limited band parallel to the
training hyperplane.
We would like to mention that, so far, we have restricted active selection
strategies to samples where all labels are known beforehand, because the clos-
est correct and wrong prototype have to be determined in (10). This setting
allows to improve the training speed and performance of batch training. If
data are initially unlabeled and queries can be asked for a subset of the data,
we can extend these strategies in an obvious way towards this setting: in this
case, the margin (10) is given by the closest two prototypes which possess a
dierent class label, whereby the (unknown) class label of the sample point
has no inuence. L
c
(t) is substituted by L
c
(|t|).
4 Synthetic and Clinical data sets
The rst data set is the Wisconsin Breast Cancer-Data set (WDBC) as given
by UCI (20). It consist of 569 measurements with 30 features in 2 classes.
The other two data sets are taken from proteomic studies named as Proteom
1
and Proteom
2
and originate from own mass spectrometry measurements on
12
clinical proteomic data
3
The Proteom
1
data set consists of 97 samples in two
classes with 59 dimensions. The data set Proteom
2
consists of 97 measure-
ments with two classes and 94 dimensions. MALDI-TOF MS combined with
magnetic-bead based sample preparation was used to generate proteomic pat-
tern from Human EDTA (ethylenediaminetetraacetic acid) plasma samples.
The MALDI-TOF mass spectra were obtained using magnetic bead based
weak cation exchange chromatography (WCX)(21) The sample eluates from
the bead preparation have been applicated to an AnchorChip-Target
TM
by
use of HCCA matrix. The material was randomly applicated on the target by
use of ClinProt-Robot
TM
and subsequently measured using an AutoFlex II in
linear mode within 1 10kDa (Bruker Daltonik GmbH, Bremen, Germany).
Thereby each spectrum has been accumulated by use of 450 Laser shots with 15
shot positions per spot on the target. Each eluated sample was fourfold spotted
and averaged subsequently. Individual attributes such as gender, age, cancer
related preconditions and some others have been controlled during sample
collection to avoid bias eects. All data have been collected and prepared in
accordance to best clinical practices. Spectra preparation has been done by use
of the Bruker ClinProt-System (Bruker Daltonik GmbH, Bremen, Germany).
The well separable Checkerboard data (Checker) as given in (12) are used as a
synthetic evaluation set. It consists of 3700 data points. Further, to illustrate
the dierences between the dierent strategies during learning of the codebook
vector positions a simple spiral data set has been created and applied using
the SRNG algorithm. The spiral data are generated in a similar way as the
data shown in (5) and the set consists of 5238 data points. Checker as well
the spiral data are given in a two dimensional space.
5 Experiments and Results
For classication, we use 6 prototypes for the WDBC data, 100 prototypes
for the well separable Checkerboard data set as given in (12), 9 prototypes for
the Proteom
1
data set and 10 for Proteom
2
. Parameter settings for SRNG can
be resumed as follows: learn rate correct prototype: 0.01, learn rate incorrect
prototype: 0.001 and learning rate for : 0.01. The neighborhood range is
given by #W/2. For SNPC the same settings as for SRNG are used with
the additional parameter window threshold: 0.05 and width = 2.5 for the
Gaussian kernel. Learning rates are annealed by an exponential decay. All
data have been processed using a 10-fold cross-validation procedure. Results
are calculated using the SNPC, SNPC-R and SNG, SRNG. SNG and SRNG
3
These data are measured on Bruker systems and are subject of condential agree-
ments with clinical partners, so only some details are listed here.
13
are used instead of GLVQ or GRLVQ as improved version of the former using
neighborhood cooperation.
We now compare both prototype classiers using randomly selected samples
with its counterparts using the proposed query strategies. The classication
results are given in Tab. 1 and Tab. 3 without metric adaptation and in Tab. 2
and Tab. 3 with relevance learning respectively. Features of all data sets have
been normalized. First we upper bounded the data set by 1.0 and subsequently
data are transformed such that we end with zero mean and variance 1.0.
We applied the training algorithms using the dierent query strategies as in-
troduced above. The results for recognition and prediction rates using SRNG
are shown in Tab. 2
4
and for SNPC in 3 respectively. Thereby the recognition
rate is a performance measure of the model indicating the relative number of
training data points whose class label could correctly be recognized using the
model. The prediction rate is a measure of the generalization ability of the
model accounting for the relative number of correctly predicted class labels
from test data points which were not used in the former training of the clas-
sier. As explained before each prediction rate is obtained as an average over
a 10-fold cross-validation procedure.
SNG SNG
active strategy 1
SNG
active strategy 2
Rec. Pred. Rec. Pred. Rel. #Q Rec. Pred. Rel. #Q
WDBC 95% 95% 94% 94% 38% 93% 92% 9%
Proteom
1
76% 83% 76% 77% 48% 76% 85% 15%
Proteom
2
73% 67% 73% 65% 49% 73% 62% 27%
Checker 72% 67% 98% 97% 31% 99% 96% 5%
Table 1
Classication accuracies for cancer and checkerboard data sets using SNG. All data
consist of two classes whereby the Proteom
2
data set is quite complex. The predic-
tion accuracies are taken from a 10-fold cross validation and show a reliable good
prediction of data which belong to the WDBC data as well as for the Proteom
1
data
set. The Checker data are not as well modeled, this is due to the xed upper limit
of cycles (1000) longer runtimes lead for this data to a nearly perfect separation.
For the WDBC data set and the Proteom
2
data set we found small improve-
ments in the prediction accuracy using the active strategy 2. In parts small
over-tting behavior using the new query strategies can be observed. Both new
query strategies were capable to signicantly decrease the necessary number
of queries by keeping at least reliable prediction accuracies with respect to
4
The relative number of queries is calculated with respect to the maximal num-
ber of queries possible up to convergence of SRNG using the corresponding query
strategy. The upper limit of cycles has been xed to 1000.
14
0
0.2
0.4
0.6
0.8
1
0 100 200 300 400 500 600 700 800 900 1000
r
e
l
a
t
i
v
e
n
u
m
b
e
r
o
f
q
u
e
r
i
e
s
cycle
Probabilistic strategy
Threshold strategy
Fig. 1. Number of queries in % using active strategy 1 (Threshold) and 2 (Proba-
bilistic) executed by the SRNG algorithm on the Proteom
1
data set.
a random query approach. This is depicted for dierent data sets using the
SRNG algorithm in Figure 1,2, 3 and 4 and in Figure 5,6 using the SNPC al-
gorithm. Although the number of queries diers for SNPC and SRNG on the
same data set the generic trend is similar but clearly reects the classication
performance of the individual classier. Especially the threshold approach let
to a signicant decrease in the number of queries in each experiment.
In Figure 7 the behaviour of the proposed strategies are shown for a synthetic
spiral data set of around 5000 data points. Both active learning strategies
showed a signicant decrease in the number of necessary queries, typically only
10%30% of the queries were executed with respect to a random approach.
Using the active learning strategies the data could be learned nearly perfect.
Random querying required a much larger number of cycles to obtain acceptable
results. Considering the prototype distribution one clearly observes the well
positioning of the prototypes using the active learning strategies whereas the
SRNG SRNG
active strategy 1
SRNG
active strategy 2
Rec. Pred. Rec. Pred. Rel. #Q Rec. Pred. Rel. #Q
WDBC 95% 94% 95% 94% 29% 97% 97% 7%
Proteom
1
82% 88% 92% 87% 31% 97% 93% 5%
Proteom
2
87% 81% 96% 87% 33% 96% 76% 10%
Table 2
Classication accuracies for cancer data sets using SRNG. Data characteristics are
as given before. A reliable good prediction of data which belong to the WDBC data
as well as for the Proteom
1
data set can be seen. One clearly observes an improved
modeling capability by use of relevance learning and an related addition decrease
in the number of queries.
15
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 100 200 300 400 500 600 700 800 900 1000
r
e
l
a
t
i
v
e
n
u
m
b
e
r
o
f
q
u
e
r
i
e
s
cycle
Probabilistic strategy
Threshold strategy
Fig. 2. Number of queries in % using active strategy 1 (Threshold) and 2 (Proba-
bilistic) executed by the SRNG algorithm on the Proteom
2
data set.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 100 200 300 400 500 600 700 800 900 1000
r
e
l
a
t
i
v
e
n
u
m
b
e
r
o
f
q
u
e
r
i
e
s
cycle
Probabilistic strategy
Threshold strategy
Fig. 3. Number of queries in % using active strategy 1 (Threshold) and 2 (Proba-
bilistic) executed by the SRNG algorithm on the WDBC data set.
random approach suers by over-representing the core of the spiral which has
a larger data density.
6 Conclusion
Margin based active learning strategies for GLVQ based networks have been
studied. We compared two alternative query strategies incorporating the mar-
gin criterion of the GLVQ networks with a random query selection. Both active
learning strategies show reliable or partially better results in their generaliza-
tion ability with respect to the random approach. Thereby we found a sig-
16
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 100 200 300 400 500 600 700 800 900 1000
r
e
l
a
t
i
v
e
n
u
m
b
e
r
o
f
q
u
e
r
i
e
s
cycle
Probabilistic strategy
Threshold strategy
Fig. 4. Number of queries in % using active strategy 1 (Threshold) and 2 (Proba-
bilistic) executed by the SNG algorithm on the Checker data set.
nicantly faster convergence with a much lower number of necessary queries.
For the threshold strategy we found that it shows an overall stable behavior
with good prediction rate and a signicant decrease in processing time. Due to
the automatically adapted parameter the strategy is quite simple but depends
on a suciently well estimation of the local data distribution. By scaling the
threshold parameter an application specic choice between prediction accu-
racy and speed can be obtained. The probabilistic strategy has been found to
get similar results with respect to the prediction accuracy but the number of
queries is quite dependent of the annealing strategy, simulated less restrictive
SNPC SNPC
active strategy 1
SNPC
active strategy 2
Rec. Pred. Rec. Pred. Rel. #Q Rec. Pred. Rel. #Q
WDBC 90% 89% 66% 93% 60% 82% 92% 15%
Proteom
1
71% 81% 72% 82% 67% 65% 82% 21%
SNPC-R SNPC-R
active strategy 1
SNPC-R
active strategy 2
Rec. Pred. Rec. Pred. Rel. #Q Rec. Pred. Rel. #Q
WDBC 86% 91% 85% 94% 62% 94% 95% 12%
Proteom
1
72% 86% 89% 82% 66% 92% 87% 16%
Table 3
Classication accuracies for cancer data sets using standard SNPC and SNPC-
R. Data characteristics as above. The prediction accuracies show a reliable good
prediction of data belonging to the WDBC data as well as for the Proteom
1
data
set. The standard SNPC showed an in-stable behavior for the Proteom
2
data, hence
the results are not given in the table. By use of relevance learning the number of
queries as well as the prediction accuracy improved slightly with respect to the
standard approach.
17
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 100 200 300 400 500 600 700 800 900 1000
r
e
l
a
t
i
v
e
n
u
m
b
e
r
o
f
q
u
e
r
i
e
s
cycle
Probabilistic strategy
Threshold strategy
Fig. 5. Number of queries in % using active strategy 1 (Threshold) and 2 (Proba-
bilistic) executed by the SNPC-R algorithm on the Proteom
1
data set.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 100 200 300 400 500 600 700 800 900 1000
r
e
l
a
t
i
v
e
n
u
m
b
e
r
o
f
q
u
e
r
i
e
s
cycle
Probabilistic strategy
Threshold strategy
Fig. 6. Number of queries in % using active strategy 1 (Threshold) and 2 (Proba-
bilistic) executed by the SNPC-R algorithm on the WDBC data set.
constraints showed a faster convergence but over-tting on smaller training
data sets. Especially, for larger data sets the proposed active learning strate-
gies show great benets in speed and prediction. Especially for the considered
mass spectrometric cancer data sets an overall well performance improvement
has been observed. This is interesting from a practical point of view, since the
technical equipment for measuring e.g. a large number of mass spectromet-
ric data become more and more available. In mass spectrometry it is easy to
measure a sample multiple times. The replicates which are taken from such
multiple measurements are in general very similar and do dier only by ran-
dom but not by systematic new information. In clinical proteomics based on
mass spectrometry replicates are measured very often to decrease the mea-
surement variance (e.g. by averaging) or to reduce the loss of missed samples
18
-4
-3
-2
-1
0
1
2
3
4
-3 -2 -1 0 1 2 3
2
n
d
d
i
m
1st dim
Normalized spiral data
Codebook - no active learning
Codebook - probabilistic strategy
Codebook - threshold strategy
Fig. 7. Plot of the synthetic spiral data with prototype positions as obtained using
the dierent query strategies.
in case of an error during a measurement of a sample. Typically 4,8 or even
16 multiple measurements of the same sample are generated and hence also
for moderate sample sizes (e.g. 50 samples per class) the amount of training
data becomes huge
5
The presented approach is optimal suited to deal with
replicate measurements which may drastically increase the number of samples
and hence typically lead to very long runtimes for ordinary trainings using the
considered classication algorithms.
References
[1] E. Baum, Neural net algorithms that learn in polynomial time from exam-
ples and queries, IEEE Transactions on Neural Networks 2 (1991) 519.
[2] Y. Freund, H. S. Seung, E. Shamir, N. Tishby, Information, prediction
and query by committee, in: Advances in Neural Information Processing
Systems 1993, 1993, pp. 483490.
[3] P. Mitra, C. Murthy, S. Pal, A probabilistic active support vector learning
algorithm, IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 28 (3) (2004) 412418.
[4] L. M. Belue, K. W. Bauer Jr., D. W. Ruck, Selecting optimal experiments
for multiple output multilayer perceptrons, Neural Computation 9 (1997)
161183.
[5] M. Hasenjager, H. Ritter, Active learning with local models, Neural Pro-
cessing Letters 7 (1998) 107117.
5
Considering 50 samples per class and 16 replicates per sample one would be
confronted with 1600 highly redundant training items.
19
[6] B. Hammer, M. Strickert, T. Villmann, Supervised neural gas with gen-
eral similarity measure, Neural Processing Letters 21 (1) (2005) 2144.
[7] F.-M. Schleif, T. Villmann, B. Hammer, Local metric adaptation for soft
nearest prototype classication to classify proteomic data, in: Fuzzy Logic
and Applications: 6th Int. Workshop, WILF 2005, Lecture Notes in Com-
puter Science (LNCS), Springer, 2006, pp. 290296.
[8] T. Kohonen, Self-Organizing Maps, Vol. 30 of Springer Series in Informa-
tion Sciences, Springer, Berlin, Heidelberg, 1995, (2nd Ext. Ed. 1997).
[9] B. Hammer, M. Strickert, T. Villmann, On the generalization ability of
GRLVQ networks, Neural Proc. Letters 21 (2) (2005) 109120.
[10] A. Sato, K. Yamada, A formulation of learning vector quantization us-
ing a new misclassication measure, in: A. K. Jain, S. Venkatesh, B. C.
Lovell (Eds.), Proceedings. Fourteenth International Conference on Pat-
tern Recognition, Vol. 1, IEEE Computer Society, Los Alamitos, CA,
USA, 1998, pp. 3225.
[11] S. Seo, M. Bode, K. Obermayer, Soft nearest prototype classication,
IEEE Transaction on Neural Networks 14 (2003) 390398.
[12] B. Hammer, T. Villmann, Generalized relevance learning vector quanti-
zation, Neural Networks 15 (8-9) (2002) 10591068.
[13] K. Crammer, R. Gilad-Bachrach, A.Navot, A.Tishby, Margin anal-
ysis of the LVQ algorithm, in: Proc. NIPS 2002, http://www-
2.cs.cmu.edu/Groups/NIPS/NIPS2002/NIPS2002preproceedings/index.html,
2002.
[14] P. Bartlett, S. Mendelsohn, Rademacher and Gaussian complexities: risk
bounds and structural results, Journal of Machine Learning and Research
3 (2002) 463482.
[15] V. Vapnik, Statistical Learning Theory, Wiley, 1998.
[16] C. Cambell, N. Cristianini, A. Smola, Query learning with large margin
classiers, in: International Conference in Machine Learning, 2000, pp.
111118.
[17] J. Shawe-Taylor, P. Bartlett, R. Willimason, M. Anthony, Structural risk
minimization over data-dependent hierarchies, IEEE Transactions on In-
formation Theory 44 (5) (1998) 19261940.
[18] M. Biehl, A. Ghosh, B. Hammer, Learning vector quantization: The dy-
namics of winner-takes-all algorithms, Neurocomputing 69 (7-9) (2006)
660670.
[19] M. Opper, Statistical mechanics of generalization, in: M. Arbib (Ed.),
The Handbook of Brain Theory and Neural Networks, MIT, 2003, pp.
10871090.
[20] C. Blake, C. Merz., UCI repository of machine learning databases., avail-
able at: http://www.ics.uci.edu/mlearn/MLRepository.html (1998).
[21] E. Schaeler, U. Zanger, M. Schwab, M. Eichelbaum, Magnetic bead
based human plasma proling discriminate acute lymphatic leukaemia
from non-diseased samples, in: 52st ASMS Conference (ASMS) 2004,
2004, p. TPV 420.
20