You are on page 1of 7

Knowledge-Based Systems 19 (2006) 77–83

www.elsevier.com/locate/knosys

Clusterer ensemble
Zhi-Hua Zhou *, Wei Tang
National Laboratory for Novel Software Technology, Nanjing University, Hankou Road 22, Nanjing 210093, China
Received 7 October 2003; accepted 10 December 2005
Available online 13 December 2005

Abstract
Ensemble methods that train multiple learners and then combine their predictions have been shown to be very effective in supervised learning.
This paper explores ensemble methods for unsupervised learning. Here, an ensemble comprises multiple clusterers, each of which is trained by k-
means algorithm with different initial points. The clusters discovered by different clusterers are aligned, i.e. similar clusters are assigned with the
same label, by counting their overlapped data items. Then, four methods are developed to combine the aligned clusterers. Experiments show that
clustering performance could be significantly improved by ensemble methods, where utilizing mutual information to select a subset of clusterers
for weighted voting is a nice choice. Since the proposed methods work by analyzing the clustering results instead of the internal mechanisms of the
component clusterers, they are applicable to diverse kinds of clustering algorithms.
q 2006 Elsevier B.V. All rights reserved.

Keywords: Machine learning; Ensemble learning; Clustering; Unsupervised learning; Selective ensemble

1. Introduction labels are used in some ensemble methods such as AdaBoost


[8], to evaluate the component learners and then use the
Clustering is a fundamental technique of unsupervised evaluation results to weight the learners and change the
learning, where the task is to find the inherent structure from training data distribution. More importantly, the training labels
unlabeled data. A good clusterer should divide the data into are necessary for eliminating the ambiguity in combining the
several clusters so that the intra-cluster similarity is maximized component predictions. For example, in voted classifiers, the
while the inter-cluster similarity is minimized. Since such a votes for different class labels are counted and compared. Here
technique is required everywhere and diverse inductive it is trivial to determine which vote is for which class because
principles exist [1], clustering is always an active area in the training labels have implicitly coordinated the component
machine learning. classifiers in the way that the ith class labels of all the
During the past decade, ensemble methods that train component classifiers are the same.
multiple learners and then combine their predictions to predict The lack of training labels makes the design of ensemble
new examples have been a hot topic [2]. Since the methods for unsupervised learning much more difficult than
generalization ability of an ensemble could be better than that for supervised learning. For illustration, suppose there are
that of its component learners [3], it is not a surprise that two clusterers each has discovered three clusters from a data
ensemble methods have been widely applied to diverse set, and the goal is to combine the clusterers so that data items
domains such as face recognition [4], optical character are put into a same cluster if and only if they were put into a
recognition [5], scientific image analysis [6], medical diagnosis same cluster by both of the clusterers. This task is not trivial
[7], etc. because there is no guarantee that the ith cluster discovered by
It is worth noting that almost all ensemble methods are one clusterer corresponds to the ith cluster discovered by the
designed for supervised learning where the desired outputs, or other clusterer. So, although ensemble has been well
labels, of the training instances are known. The known training investigated in supervised learning, few works address the
issue of designing ensemble methods for clustering.
* Corresponding author. Tel.: C86 25 8368 6268; fax: C86 25 8368 6268. In this paper, a process for aligning the clusters discovered
E-mail address: zhouzh@nju.edu.cn (Z.-H. Zhou). by different clusterers is developed, which works by measuring
0950-7051/$ - see front matter q 2006 Elsevier B.V. All rights reserved. the similarity between the clusters through counting their
doi:10.1016/j.knosys.2005.11.003 overlapped data items. Then, four methods for combining the
78 Z.-H. Zhou, W. Tang / Knowledge-Based Systems 19 (2006) 77–83

aligned clusterers are proposed. They are voting, weighted- measures have different impacts on the performance of
voting where the mutual information weights are used in k-means algorithm. For convenience of discussion, in this
voting, selective voting where the mutual information weights paper the basic k-means algorithm employing the Euclidean
are used to select a subset of clusterers to vote, and selective distance is used.
weighted-voting where the mutual information weights are A characteristic of k-means algorithm is that it is quite
used not only in selecting but also in voting. Experimental sensitive to the choice of the initial points, i.e. the data
results show that selective weighted-voting is the best method, items selected to be the initial centers of the clusters. In
whose performance is significantly better than that of a single supervised learning, if an algorithm has several alternative
clusterer. The experiments also reveal that profit is obtained by parameter configurations, a simple strategy is to run the
employing mutual information weights in voting, while greater algorithm several times each with a specific configuration
profit is obtained by building selective ensembles. and then use a validation set to choose the best version. But
The rest of this paper is organized as follows. Section 2 in unsupervised learning, it is difficult to judge which
focuses on the generation of the component clusterers. Section version is the best since there are no training labels
3 presents the align process and proposes methods for available. Fortunately, such a characteristic is not a bad
combining the aligned component clusterers. Section 4 reports news for building ensembles of k-means, because now it is
on the experimental results. Finally, Section 5 summarizes the easy to obtain diverse component clusterers through simply
main contributions of this paper and raises several issues for running the algorithm multiple times with different initial
future works. points.

2. Generate component clusterers 3. Combine component clusterers

2.1. Notations 3.1. Align process

Let XZ{x1, x2, ., xn}3Rd denotes an unlabeled data set The component clusterers must be aligned before they are
in a feature space of dimension d. The ith data item xi is a combined. This is because the component clusterers may
d-dimensional feature vector [xi1, xi2,., xid]T, where T denotes assign similar cluster with different labels. For example,
vector transpose. In order to simplify the discussion, here we suppose there are two clusterers, whose corresponding label
assume that all the features are numerical, i.e. xij (iZ1, ., n; vectors are [1, 2, 2, 1, 1, 3, 3]T and [2, 3, 3, 2, 2, 1, 1]T,
jZ1, ., d) is numerical. respectively. Although the appearance of these label vectors
A clusterer dividing X into k clusters could be regarded as a are quite different, in fact they represent the same clustering
label vector l2Nn, which assigns the data item xi to the lith result. Therefore, the label vectors must be aligned so that the
cluster, i.e. Cli where li2{1, 2, ., k}. same label denotes similar cluster.
A clusterer ensemble with size t comprises t clusterers, i.e. In this paper, the clusterers are aligned based on the
{l(1), l(2), ., l(t)}, which could also be regarded as a label recognition that similar clusters should contain similar data
vector l, l2Nn and lZF({l(1), l(2), ., l(t)}) where F($) is a items. In detail, suppose there are two clusterers whose
function corresponding to the combining methods presented in corresponding label vectors are l(a) and l(b), respectively, and
Section 3. each clusterer divide the data set into k clusters, i.e. fC1ðaÞ ; C2ðaÞ
; .; CkðaÞ g and fC1ðbÞ ; C2ðbÞ ; .; CkðbÞ g, respectively. For each pair
2.2. k-means of clusters from different clusterers, such as CiðaÞ and CjðbÞ , the
number of overlapped data items, i.e. data items appear in both
The idea of the well-known k-means algorithm [9] is to CiðaÞ and CjðbÞ , is counted. Then, the pair of clusters whose
iteratively update the mean value of the data items in a cluster, number of overlapped data items is the largest, are matched in
and regard the stabilized value as the representative of the the way that they are denoted by the same label. Such a process
cluster. The basic algorithm is shown in Fig. 1. is repeated until all the clusters are matched. The pseudo-code
There exist a lot of variants to the basic k-means algorithm of the align process is shown in Fig. 2.
based on different distance measure or representation of the When there are t (tO2) clusterers, one clusterer could be
centers. Strehl et al. [10] has shown that different distance regarded as the baseline to which the remaining clusterers are

Fig. 1. The basic k-means algorithm.


Z.-H. Zhou, W. Tang / Knowledge-Based Systems 19 (2006) 77–83 79

Fig. 2. The pseudo-code of the align process.

aligned. In this paper, the baseline clusterer is randomly defined as


selected from the component clusterers. Note that the align
1
process requires only one-pass scan of the data items wm Z ðm Z 1; 2; .; tÞ (3)
nevertheless how big the value of m is, and it requires the bm Z
storage of only (tK1)!k2 integers that are used to keep the where Z is used to normalize the weights so that
number of overlapped data items. It is evident that such an
align process is quite efficient. X
t

It is worth noting that according to the objective optimized wm O 0 ðm Z 1; 2; .; tÞ and wm Z 1 (4)


mZ1
by some clustering algorithms such as k-means, different
clusterers are similar if they have a similar clustering quality, It was shown that selective ensemble methods that select a
i.e. if the sum of distances from data items to their nearest subset of learners to ensemble may be superior to ensembling
centers is about the same. However, since the goal of the all the component learners [11]. The mutual information
process presented in this section is to enable the clusters weights, i.e. {w1, w2, ., wt}, can be used to select the
generated by different clusterers be combined, nevertheless clusterers. This is realized by excluding from the ensemble the
how similar the clusterers themselves are, it is the similar clusterers whose mutual information weight is smaller than a
clusters instead of similar clusterers are to be identified. threshold. In this paper the threshold is set to 1/t.
The selected clusterers can be combined via voting, or
3.2. Combining methods weighted-voting based on re-normalized mutual information
weights of the selected clusterers. Thus, another two
The simplest combining method is voting, where the ith combining methods, i.e. selective voting and selective
component of the label vector corresponding to the ensemble, weighted-voting, are obtained.
i.e. li, is determined by the plurality voting result of It is worth mentioning that the time cost of weighted-voting,
lð1Þ ð2Þ ðtÞ selective voting, and selective weighted-voting are compar-
i ; li ; .; li .
The second method, i.e. weighted-voting, employs mutual able, while that of voting is slightly less because it does not
information between a pair of clusterers [10] to compute the require the computation of the mutual information weights.
weight for each clusterer. For two label vectors, i.e. l(a) and However, the time cost of computing the mutual information
l(b), suppose there are n objects where ni are in cluster CiðaÞ , nj weights is negligible if comparing with that of the k-means
are in cluster CjðbÞ , and nij are in both CiðaÞ and CjðbÞ . The [0, 1]- clustering process. Therefore, the time cost of building an
normalized mutual information FNMI can be defined as: ensemble of k-means by the proposed methods is roughly m
times of that of training a single k-means clusterer, where m is
 ij  the number of clusterers that are trained to be considered for
ðaÞ ðbÞ 2X k X k
n n
F NMI
ðl ; l Þ Z ij
n logk2 (1) ensembling.
n iZ1 jZ1 ni nj

Then, for every clusterer, the average mutual information 4. Experiments


can be computed as:
4.1. Data sets
1 Xt
bm Z FNMI ðlðmÞ ; lðlÞ Þ ðm Z 1; 2; .; tÞ (2)
tK1 lZ1;lsm Ten data sets from the UCI Machine Learning Repository
[12] are used, all of which contains only numerical attributes
The bigger the value of bm is, the less statistical information except the class attributes. For image segmentation, a constant
contained by the mth clusterer has not been contained by other attribute has been removed. The information about the data sets
clusterers. Therefore, the weights of the clusterers can be is tabulated in Table 1. Note that the class attributes of the data
80 Z.-H. Zhou, W. Tang / Knowledge-Based Systems 19 (2006) 77–83

Table 1 Table 2
Data sets used in experiments Summary of the pairwise two-tailed t-test results under significance level of .05

Data set Attribute Class Size Ensemble Voting w-voting sel-voting sel-w-voting
size Win/tie/loss Win/tie/loss Win/tie/loss Win/tie/loss
Image segmentation 18 7 2310
Ionosphere 34 2 351 5 1/7/2 2/6/2 3/4/3 4/4/2
Iris 4 3 150 18 0/8/2 3/6/1 4/5/1 5/4/1
Liver disorder 6 2 345 13 2/6/2 2/6/2 2/7/1 4/6/0
Page blocks 10 5 5473 20 1/6/3 3/4/3 5/3/2 5/3/2
Vehicle 18 4 846 30 1/5/4 3/4/3 5/3/2 5/3/2
Waveform21 21 3 5000
Waveform40 40 3 5000 w-voting denotes weighted-voting, sel-voting denotes selective voting, and sel-
Wine 13 3 178 w-voting denotes selective weighted-voting.
Wpbc 33 2 198

sets have not been used in the training of the clusterers and the 4.3. Results
clusterer ensembles.
In our experiments the number of iteration steps of the
k-means algorithm is set to 100, and the error improvement
4.2. Evaluation scheme threshold is set to 1!10K5. For each data set, each of the four
ensemble methods proposed in Section 3 is used to build five
In general, it is difficult to evaluate a clusterer because clusterer ensembles comprising 5, 8, 13, 20, or 30 component
whether its clustering quality is good or not almost fully clusterers, respectively. The process is repeated for 10 times.
depends on the view of the user. However, when a class Then, for each data set, each method, and each ensemble size,
attribute that has not been used in the training process exist, the the average micro-p and its standard deviation are recorded.
scheme proposed by Modha and Spangler [13] could provide a The average performance of single k-means is also recorded for
relatively objective evaluation, which assumes that the class comparison. The detailed experimental results are presented in
attribute exposes some inherent property of the data set that
Appendix A.
should be captured by the clusterer.
The pairwise two-tailed t-test results under significance
In detail, the clusterers are converted into classifiers using
level of 0.05 are summarized in Table 2, where ‘win/loss’
the following simple rule: identify each cluster with the class
means that the ensemble method is significantly better/worse
that has the largest overlap with the cluster, and assign every
than the single k-means algorithm, and ‘tie’ means that there is
data item in that cluster to the found class. The rule allows
no significant difference between the ensemble method and the
multiple clusters to be assigned to a single class, but never
assigns a single cluster to multiple classes. Suppose there are c single k-means algorithm.
classes, i.e. {C1, C2, ., Cc}, in the ground truth classification. Table 2 shows that the clustering performance of voting,
For a given clusterer, by using the above rule, let ah denote the weighted-voting, and selective voting is worse than, compar-
number of data items that are correctly assigned to the class Ch. able to, and slightly better than that of the single k-means
Then, the clustering performance of the clusterer can be clusterer, respectively, while the performance of selective
measured by micro-precision, which can be computed as: weighted-voting is significantly better than that of the single
k-means clusterer. It is impressive that when the ensemble size
1X c
micro-p Z a (5)
n hZ1 t
Table 3
The best ensemble method for the experimental data sets
The bigger the value of micro-p, the better the clustering
Data set Best method Data set Best method
performance.
Such a scheme can only be used to compare clusterers with a Image segmentation sel-w-voting Vehicle Sel-w-voting
(13/30) (1.4/5)
fixed number of clusters, i.e. clusterers with the same model
Ionosphere sel-w-voting Waveform21 w-voting (13)
complexity. Therefore, in our experiments, for a given data set, (4.5/30)
the value of k, i.e. the number of clusters to be discovered, is Iris w-voting Waveform40 sel-voting (3.5/8)
fixed to the number of classes conveyed by the class attribute. (20)
Note that the ensemble methods proposed in Section 3 do not Liver disorder sel-w-voting Wine w-voting (8)
(3.3/13)
guarantee that k would not be reduced after the combination
Page blocks sel-voting Wpbc sel-voting
process. In fact, in some cases such a reduction may be helpful (4.1/13) (2.0/30)
because it may reveal that the number of actual clusters is
w-voting denotes weighted-voting, sel-voting denotes selective voting, and sel-
smaller than that was anticipated. But here the reduction will
w-voting denotes selective weighted-voting. Number in the bracket is the
disable the above scheme from comparing the clustering ensemble size with which the best clustering performance is obtained. For
performance. Fortunately, such a reduction never occurs in all selective ensemble methods, the ensemble size is shown as a ratio of the
of our experiments. number of selected clusterers against the number of clusterers available.
Z.-H. Zhou, W. Tang / Knowledge-Based Systems 19 (2006) 77–83 81

Table 4 we believe that the latter mechanism provides bigger profit


The geometrical mean percentage of clusterers selected by selective voting and because it uses fewer component clusterers to make up an
selective weighted-voting under different ensemble sizes
ensemble. In fact, Table 4 shows that the selective ensemble
Ensemble size Percentage of selecting methods, i.e. selective-voting and selective weighted-voting,
5 39.2 (1.96/5) only keep about 28–40% available clusterers. In cases where
8 38.1 (3.05/8) the clusterers must be stored for future use, such an advantage
13 33.7 (4.38/13) of selective ensemble should not be neglected.
20 31.8 (6.36/20)
The impacts of the change of ensemble size on the
30 28.0 (8.41/30)
clustering performance are depicted in Fig. 3. It is interesting
to see that for methods that could improve the clustering
performance, i.e. weighted-voting, selective voting, and
0.72 selective weighted-voting, the performance increases as the
geometrical mean micro-p

ensemble size increases, but for the useless method voting, the
0.7
voting performance keeps almost constant or even decreases as the
w-voting ensemble size increases. Why the performance of voting is so
0.68
sel-voting poor is a problem to be explored in future works.
sel-w-voting
0.66
5. Conclusion
0.64
5 8 13 20 30
In this paper, four methods are proposed for building
ensemble size
ensembles of k-means clusterers. The component clusterers are
Fig. 3. The impacts of the change of ensemble size on the clustering generated by running the k-means algorithm multiple times
performance. w-voting denotes weighted-voting, sel-voting denotes selective with different initial points. An align process is applied to
voting, and sel-w-voting denotes selective weighted-voting. Geometrical mean ensure that the same label used by different clusterers denotes
denotes the average across all the data sets. Here the ensemble size of selective
similar clusters. The aligned clusterers are combined via voting
voting and selective weighted-voting denotes the number of candidate
clusterers instead of their real ensemble size. or its variants. Experiments show that selective weighted-
voting that utilizes mutual information to select a subset of
is 13, selective weighted-voting never loses to single k-means. clusterers for weighted voting is the best method, which could
This observation shows that ensemble methods could improve significantly improve the clustering performance. It is also
the clustering performance. It also reveals that utilizing mutual found that utilizing mutual information weights or building
information in the combination of the component clusterers is selective ensembles are both beneficial to clusterer ensemble,
beneficial. So does building selective ensembles. while the latter mechanism is more rewardful because it could
Table 3 summarizes the best ensemble method for a given help obtain ensembles with smaller sizes.
data set. It justifies that selective weighted-voting is the best It is worth mentioning that although k-means is used as the
method, which achieves the best performance on four data sets. base clusterer in this paper, it does not mean that the proposed
Moreover, Table 3 shows that the performances of weighted- methods can only be applied to k-means. Since these methods
voting and selective voting are very close because each of them work by analyzing the clustering results instead of the internal
achieves the best result on three data sets. mechanisms of the component clusterers, they are applicable to
It is worth mentioning that although utilizing mutual diverse kinds of clustering algorithms.
information and building selective ensembles are comparably From the literatures on ensemble learning, it could be found
effective from the aspect of improving clustering performance, that voting is an effective combining method that is often used
Table A1
The clustering performance when ensemble size is 5

Data set Single Voting w-voting sel-voting sel-w-voting Selected


Image segmenta- 0.706G0.021 0.716G0.038 0.749G0.040 0.651G0.087 0.736G0.040 2.00G0.67
tion
Ionosphere 0.722G0.024 0.711G0.001 0.768G0.121 0.768G0.121 0.768G0.121 1.40G0.52
Iris 0.878G0.006 0.887G0.000 0.887G0.000 0.862G0.014 0.862G0.014 1.60G0.52
Liver disorder 0.822G0.007 0.815G0.007 0.815G0.007 0.842G0.016 0.842G0.016 2.20G1.55
Page blocks 0.476G0.027 0.461G0.058 0.469G0.063 0.520 G0.046 0.520G0.051 1.70G0.48
Vehicle 0.451G0.018 0.439G0.019 0.440G0.024 0.492G0.055 0.497G0.053 1.40G0.52
Waveform21 0.553G0.000 0.553G0.000 0.553G0.000 0.553G0.000 0.553G0.000 1.70G0.48
Waveform40 0.548G0.000 0.548G0.000 0.548G0.000 0.548G0.000 0.548G0.000 1.90G0.74
Wine 0.948G0.006 0.949G0.005 0.949G0.005 0.941G0.031 0.941G0.031 1.50G0.71
Wpbc 0.599G0.002 0.598G0.000 0.598G0.000 0.602G0.009 0.602G0.009 4.20G1.69
Geometrical mean 0.670G0.011 0.668G0.013 0.678G0.026 0.678G0.038 0.687G0.034 1.96G0.79

Note that after truncating from the fourth decimal digit, the differences on waveform21 and waveform40 are concealed.
82 Z.-H. Zhou, W. Tang / Knowledge-Based Systems 19 (2006) 77–83

Table A2
The clustering performance when ensemble size is 8

Data set Single Voting w-voting sel-voting sel-w-voting Selected


Image segmenta- 0.711G0.017 0.739G0.060 0.770G0.051 0.713G0.051 0.749G0.055 3.20G1.40
tion
Ionosphere 0.721G0.024 0.711G0.002 0.768G0.121 0.768G0.120 0.768G0.121 2.70G0.95
Iris 0.879G0.004 0.891G0.031 0.904G0.037 0.850G0.043 0.862G0.014 2.40G0.70
Liver disorder 0.820G0.005 0.812G0.001 0.813G0.005 0.845G0.012 0.846G0.012 2.70G2.06
Page blocks 0.463G0.020 0.464G0.058 0.473G0.060 0.535G0.029 0.534G0.035 2.70G1.16
Vehicle 0.447G0.012 0.440G0.018 0.447G0.036 0.479G0.034 0.479G0.039 2.30G1.16
Waveform21 0.553G0.000 0.553G0.000 0.553G0.000 0.553G0.000 0.553G0.000 2.70G0.67
Waveform40 0.551G0.009 0.548G0.000 0.548G0.000 0.571G0.071 0.571G0.071 3.50G1.78
Wine 0.947G0.004 0.949G0.002 0.953G0.008 0.929G0.040 0.929G0.040 2.30G0.95
Wpbc 0.599G0.002 0.598G0.000 0.598G0.000 0.604G0.010 0.604G0.010 6.00G3.23
Geometrical mean 0.669G0.010 0.670G0.017 0.683G0.032 0.685G0.041 0.689G0.040 3.05G1.41

Note that after truncating from the fourth decimal digit, the differences on waveform21 are concealed.

Table A3
The clustering performance when ensemble size is 13

Data set Single Voting w-voting sel-voting sel-w-voting Selected


Image segmenta- 0.704G0.013 0.737G0.023 0.769G0.038 0.726G0.047 0.756G0.058 5.30G1.57
tion
Ionosphere 0.717G0.015 0.710G0.001 0.767G0.121 0.769G0.120 0.769G0.120 3.90G1.66
Iris 0.877G0.004 0.886G0.002 0.886G0.002 0.808G0.104 0.829G0.088 3.40G1.26
Liver disorder 0.820G0.004 0.812G0.000 0.812G0.000 0.847G0.006 0.849G0.002 3.30G1.49
Page blocks 0.461G0.014 0.461G0.056 0.471G0.061 0.557G0.024 0.545G0.004 4.10G1.20
Vehicle 0.447G0.013 0.433G0.002 0.434G0.006 0.473G0.030 0.472G0.037 3.20G1.40
Waveform21 0.553G0.000 0.553G0.000 0.553G0.000 0.553G0.000 0.553G0.000 4.70G0.95
Waveform40 0.550G0.006 0.548G0.000 0.548G0.000 0.570G0.072 0.570G0.072 4.80G1.69
Wine 0.946G0.003 0.950G0.010 0.949G0.010 0.930G0.040 0.930G0.040 4.00G2.05
Wpbc 0.598G0.002 0.598G0.000 0.598G0.000 0.603G0.018 0.603G0.018 7.10G6.23
Geometrical mean 0.667G0.007 0.669G0.010 0.679G0.024 0.683G0.046 0.688G0.044 4.38G1.95

Note that after truncating from the fourth decimal digit, the differences on waveform21 are concealed.

in building ensembles of supervised learning algorithms. and AdaBoost [8], can be modified for unsupervised learning.
However, this paper shows that voting performs quite poor Moreover, since different distance measures have different
while its variants such as selective weighted-voting perform impacts on the clustering performance [10], and different
well in unsupervised learning scenario. How to explain this clustering algorithms may favor different kinds of cluster
phenomenon remains an open problem to be explored in future architectures, i.e. different algorithms may be effective at
works. detecting different kinds of clusters, it will be interesting to
Another interesting issue to be explored is to see whether investigate heterogeneous clusterer ensembles, i.e. ensembles
successful supervised ensemble methods, such as Bagging [14] composed of different kinds of clusterers.

Table A4
The clustering performance when ensemble size is 20

Data set Single Voting w-voting sel-voting sel-w-voting Selected


Image segmenta- 0.704G0.009 0.739G0.040 0.757G0.044 0.748G0.059 0.762G0.062 8.60G2.46
tion
Ionosphere 0.719G0.010 0.710G0.001 0.852G0.150 0.852G0.148 0.853G0.149 4.40G3.63
Iris 0.876G0.002 0.879G0.048 0.912G0.042 0.829G0.078 0.826G0.086 5.10G1.20
Liver disorder 0.819G0.003 0.812G0.000 0.812G0.000 0.848G0.003 0.848G0.003 4.40G1.78
Page blocks 0.461G0.010 0.450G0.042 0.473G0.059 0.545G0.005 0.545G0.004 7.00G1.89
Vehicle 0.447G0.008 0.434G0.003 0.437G0.010 0.478G0.015 0.478G0.017 5.30G1.70
Waveform21 0.553G0.000 0.553G0.000 0.553G0.000 0.553G0.000 0.553G0.000 8.10G1.45
Waveform40 0.549G0.004 0.548G0.000 0.548G0.000 0.570G0.072 0.570G0.072 7.70G2.95
Wine 0.946G0.003 0.947G0.007 0.947G0.008 0.919G0.046 0.921G0.043 5.80G3.26
Wpbc 0.599G0.002 0.598G0.000 0.598G0.000 0.607G0.019 0.607G0.019 7.20G8.88
Geometrical mean 0.667G0.005 0.667G0.014 0.689G0.031 0.695G0.045 0.696G0.046 6.36G2.92

Note that after truncating from the 4th decimal digit, the differences on waveform21 are concealed.
Z.-H. Zhou, W. Tang / Knowledge-Based Systems 19 (2006) 77–83 83

Table A5
The clustering performance when ensemble size is 30

Data set Single Voting w-voting sel-voting sel-w-voting Selected


Image segmenta- 0.704G0.008 0.743G0.048 0.757G0.055 0.746G0.076 0.784G0.067 13.00G3.68
tion
Ionosphere 0.720G0.008 0.710G0.001 0.910G0.138 0.909G0.136 0.911G0.137 4.50G5.06
Iris 0.877G0.003 0.858G0.045 0.910G0.044 0.853G0.000 0.853G0.000 7.60G2.37
Liver disorder 0.819G0.003 0.812G0.000 0.812G0.000 0.848G0.003 0.848G0.003 6.40G2.67
Page blocks 0.461G0.009 0.464G0.057 0.465G0.054 0.545G0.005 0.543G0.002 10.30G2.31
Vehicle 0.446G0.007 0.433G0.003 0.435G0.005 0.481G0.010 0.483G0.013 7.60G2.95
Waveform21 0.553G0.000 0.553G0.000 0.553G0.000 0.553G0.000 0.553G0.000 12.90G1.29
Waveform40 0.549G0.002 0.548G0.000 0.548G0.000 0.571G0.071 0.571G0.071 10.80G4.52
Wine 0.946G0.003 0.946G0.006 0.946G0.007 0.922G0.041 0.922G0.041 9.00G5.68
Wpbc 0.599G0.001 0.598G0.000 0.598G0.000 0.613G0.020 0.607G0.025 2.00G1.05
Geometrical mean 0.667G0.004 0.666G0.016 0.693G0.030 0.704G0.036 0.707G0.036 8.41G3.16

Note that after truncating from the 4th decimal digit, the differences on waveform21 are concealed.

Acknowledgements [4] F.J. Huang, Z.-H. Zhou, H.-J. Zhang, T. Chen, Pose invariant face
recognition, Proceedings of the Fourth IEEE International Conference on
Automatic Face and Gesture Recognition, Grenoble, France, 2000, pp.
This work was supported by the National Science Fund for 245–250.
Distinguished Young Scholars of China under the Grant No. [5] H. Drucker, R. Schapire, P. Simard, Improving performance in neural
60325207, The Fok Ying Tung Education Foundation under networks using a boosting algorithm, in: S.J. Hanson, J.D. Cowan, C. Lee
the Grant No. 91067, and the Excellent Young Teachers Giles (Eds.), Advances in Neural Information Processing Systems 5,
Program of MOE, China. Morgan Kaufmann, San Mateo, CA, 1993, pp. 42–49.
[6] K.J. Cherkauer, Human expert level performance on a scientific image
analysis task by a system using combined artificial neural networks,
Proceedings of the 13th AAAI Workshop on Integrating Multiple Learned
Appendix A Models for Improving and Scaling Machine Learning Algorithms,
Portland, OR, 1996, pp. 15–21.
Tables A1–A5 present the detailed experimental results [7] Z.-H. Zhou, Y. Jiang, Y.-B. Yang, S.-F. Chen, Lung cancer cell
summarized in Section 4, where single denotes a single identification based on artificial neural network ensembles, Artificial
Intelligence in Medicine 24 (1) (2002) 25–36.
k-means clusterer, w-voting denotes weighted-voting, sel- [8] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line
voting denotes selective voting, sel-w-voting denotes selective learning and an application to boosting, Proceedings of the second
weighted-voting, and geometrical mean denotes the average European Conference on Computational Learning Theory, Barcelona,
across all data sets. The 2nd to the 5th columns of the tables Spain, 1995, pp. 23–37.
record the micro-p while the last column records how many [9] J. MacQueen, Some methods for classification and analysis of multi-
variate observations, Proceedings of the fifth Berkeley Symposium on
clusterers have been selected by selective voting or selective Mathematical Statistics and Probability, Berkeley, CA, vol. 1, 1967, pp.
weighted-voting. The values following ‘G’ is the standard 281–297.
deviation. [10] A. Strehl, J. Ghosh, R.J. Mooney, Impact of similarity measures on web-
page clustering, Proceedings of the AAAI2000 Workshop on AI for Web
Search, Austin, TX, 2000, pp. 58–64.
[11] Z.-H. Zhou, J. Wu, W. Tang, Ensembling neural networks: many could be
References better than all, Artificial Intelligence 137 (1–2) (2002) 239–263.
[12] C. Blake, E. Keogh, C.J. Merz, UCI repository of machine learning
[1] V. Estivill-Castro, Why so many clustering algorithms—a position paper, databases [http://www.ics.uci.edu/wmlearn/MLRepository.htm], Depart-
SIGKDD Explorations 4 (1) (2002) 65–75. ment of Information and Computer Science, University of California,
[2] T.G. Dietterich, Ensemble learning, in: M.A. Arbib (Ed.), The Handbook Irvine, CA, 1998.
of Brain Theory and Neural Networks, second ed., MIT Press, Cambridge, [13] D.S. Modha, W.S. Spangler, Feature weighting in k-means clustering,
MA, 2002. Machine Learning, 52 (3) (2003) 217–237.
[3] L.K. Hansen, P. Salamon, Neural network ensembles, IEEE Transactions [14] L. Breiman, Bagging predictors, Machine Learning 24 (2) (1996)
on Pattern Analysis and Machine Intelligence 12 (10) (1990) 993–1001. 123–140.

You might also like