Professional Documents
Culture Documents
www.elsevier.com/locate/knosys
Clusterer ensemble
Zhi-Hua Zhou *, Wei Tang
National Laboratory for Novel Software Technology, Nanjing University, Hankou Road 22, Nanjing 210093, China
Received 7 October 2003; accepted 10 December 2005
Available online 13 December 2005
Abstract
Ensemble methods that train multiple learners and then combine their predictions have been shown to be very effective in supervised learning.
This paper explores ensemble methods for unsupervised learning. Here, an ensemble comprises multiple clusterers, each of which is trained by k-
means algorithm with different initial points. The clusters discovered by different clusterers are aligned, i.e. similar clusters are assigned with the
same label, by counting their overlapped data items. Then, four methods are developed to combine the aligned clusterers. Experiments show that
clustering performance could be significantly improved by ensemble methods, where utilizing mutual information to select a subset of clusterers
for weighted voting is a nice choice. Since the proposed methods work by analyzing the clustering results instead of the internal mechanisms of the
component clusterers, they are applicable to diverse kinds of clustering algorithms.
q 2006 Elsevier B.V. All rights reserved.
Keywords: Machine learning; Ensemble learning; Clustering; Unsupervised learning; Selective ensemble
aligned clusterers are proposed. They are voting, weighted- measures have different impacts on the performance of
voting where the mutual information weights are used in k-means algorithm. For convenience of discussion, in this
voting, selective voting where the mutual information weights paper the basic k-means algorithm employing the Euclidean
are used to select a subset of clusterers to vote, and selective distance is used.
weighted-voting where the mutual information weights are A characteristic of k-means algorithm is that it is quite
used not only in selecting but also in voting. Experimental sensitive to the choice of the initial points, i.e. the data
results show that selective weighted-voting is the best method, items selected to be the initial centers of the clusters. In
whose performance is significantly better than that of a single supervised learning, if an algorithm has several alternative
clusterer. The experiments also reveal that profit is obtained by parameter configurations, a simple strategy is to run the
employing mutual information weights in voting, while greater algorithm several times each with a specific configuration
profit is obtained by building selective ensembles. and then use a validation set to choose the best version. But
The rest of this paper is organized as follows. Section 2 in unsupervised learning, it is difficult to judge which
focuses on the generation of the component clusterers. Section version is the best since there are no training labels
3 presents the align process and proposes methods for available. Fortunately, such a characteristic is not a bad
combining the aligned component clusterers. Section 4 reports news for building ensembles of k-means, because now it is
on the experimental results. Finally, Section 5 summarizes the easy to obtain diverse component clusterers through simply
main contributions of this paper and raises several issues for running the algorithm multiple times with different initial
future works. points.
Let XZ{x1, x2, ., xn}3Rd denotes an unlabeled data set The component clusterers must be aligned before they are
in a feature space of dimension d. The ith data item xi is a combined. This is because the component clusterers may
d-dimensional feature vector [xi1, xi2,., xid]T, where T denotes assign similar cluster with different labels. For example,
vector transpose. In order to simplify the discussion, here we suppose there are two clusterers, whose corresponding label
assume that all the features are numerical, i.e. xij (iZ1, ., n; vectors are [1, 2, 2, 1, 1, 3, 3]T and [2, 3, 3, 2, 2, 1, 1]T,
jZ1, ., d) is numerical. respectively. Although the appearance of these label vectors
A clusterer dividing X into k clusters could be regarded as a are quite different, in fact they represent the same clustering
label vector l2Nn, which assigns the data item xi to the lith result. Therefore, the label vectors must be aligned so that the
cluster, i.e. Cli where li2{1, 2, ., k}. same label denotes similar cluster.
A clusterer ensemble with size t comprises t clusterers, i.e. In this paper, the clusterers are aligned based on the
{l(1), l(2), ., l(t)}, which could also be regarded as a label recognition that similar clusters should contain similar data
vector l, l2Nn and lZF({l(1), l(2), ., l(t)}) where F($) is a items. In detail, suppose there are two clusterers whose
function corresponding to the combining methods presented in corresponding label vectors are l(a) and l(b), respectively, and
Section 3. each clusterer divide the data set into k clusters, i.e. fC1ðaÞ ; C2ðaÞ
; .; CkðaÞ g and fC1ðbÞ ; C2ðbÞ ; .; CkðbÞ g, respectively. For each pair
2.2. k-means of clusters from different clusterers, such as CiðaÞ and CjðbÞ , the
number of overlapped data items, i.e. data items appear in both
The idea of the well-known k-means algorithm [9] is to CiðaÞ and CjðbÞ , is counted. Then, the pair of clusters whose
iteratively update the mean value of the data items in a cluster, number of overlapped data items is the largest, are matched in
and regard the stabilized value as the representative of the the way that they are denoted by the same label. Such a process
cluster. The basic algorithm is shown in Fig. 1. is repeated until all the clusters are matched. The pseudo-code
There exist a lot of variants to the basic k-means algorithm of the align process is shown in Fig. 2.
based on different distance measure or representation of the When there are t (tO2) clusterers, one clusterer could be
centers. Strehl et al. [10] has shown that different distance regarded as the baseline to which the remaining clusterers are
Table 1 Table 2
Data sets used in experiments Summary of the pairwise two-tailed t-test results under significance level of .05
Data set Attribute Class Size Ensemble Voting w-voting sel-voting sel-w-voting
size Win/tie/loss Win/tie/loss Win/tie/loss Win/tie/loss
Image segmentation 18 7 2310
Ionosphere 34 2 351 5 1/7/2 2/6/2 3/4/3 4/4/2
Iris 4 3 150 18 0/8/2 3/6/1 4/5/1 5/4/1
Liver disorder 6 2 345 13 2/6/2 2/6/2 2/7/1 4/6/0
Page blocks 10 5 5473 20 1/6/3 3/4/3 5/3/2 5/3/2
Vehicle 18 4 846 30 1/5/4 3/4/3 5/3/2 5/3/2
Waveform21 21 3 5000
Waveform40 40 3 5000 w-voting denotes weighted-voting, sel-voting denotes selective voting, and sel-
Wine 13 3 178 w-voting denotes selective weighted-voting.
Wpbc 33 2 198
sets have not been used in the training of the clusterers and the 4.3. Results
clusterer ensembles.
In our experiments the number of iteration steps of the
k-means algorithm is set to 100, and the error improvement
4.2. Evaluation scheme threshold is set to 1!10K5. For each data set, each of the four
ensemble methods proposed in Section 3 is used to build five
In general, it is difficult to evaluate a clusterer because clusterer ensembles comprising 5, 8, 13, 20, or 30 component
whether its clustering quality is good or not almost fully clusterers, respectively. The process is repeated for 10 times.
depends on the view of the user. However, when a class Then, for each data set, each method, and each ensemble size,
attribute that has not been used in the training process exist, the the average micro-p and its standard deviation are recorded.
scheme proposed by Modha and Spangler [13] could provide a The average performance of single k-means is also recorded for
relatively objective evaluation, which assumes that the class comparison. The detailed experimental results are presented in
attribute exposes some inherent property of the data set that
Appendix A.
should be captured by the clusterer.
The pairwise two-tailed t-test results under significance
In detail, the clusterers are converted into classifiers using
level of 0.05 are summarized in Table 2, where ‘win/loss’
the following simple rule: identify each cluster with the class
means that the ensemble method is significantly better/worse
that has the largest overlap with the cluster, and assign every
than the single k-means algorithm, and ‘tie’ means that there is
data item in that cluster to the found class. The rule allows
no significant difference between the ensemble method and the
multiple clusters to be assigned to a single class, but never
assigns a single cluster to multiple classes. Suppose there are c single k-means algorithm.
classes, i.e. {C1, C2, ., Cc}, in the ground truth classification. Table 2 shows that the clustering performance of voting,
For a given clusterer, by using the above rule, let ah denote the weighted-voting, and selective voting is worse than, compar-
number of data items that are correctly assigned to the class Ch. able to, and slightly better than that of the single k-means
Then, the clustering performance of the clusterer can be clusterer, respectively, while the performance of selective
measured by micro-precision, which can be computed as: weighted-voting is significantly better than that of the single
k-means clusterer. It is impressive that when the ensemble size
1X c
micro-p Z a (5)
n hZ1 t
Table 3
The best ensemble method for the experimental data sets
The bigger the value of micro-p, the better the clustering
Data set Best method Data set Best method
performance.
Such a scheme can only be used to compare clusterers with a Image segmentation sel-w-voting Vehicle Sel-w-voting
(13/30) (1.4/5)
fixed number of clusters, i.e. clusterers with the same model
Ionosphere sel-w-voting Waveform21 w-voting (13)
complexity. Therefore, in our experiments, for a given data set, (4.5/30)
the value of k, i.e. the number of clusters to be discovered, is Iris w-voting Waveform40 sel-voting (3.5/8)
fixed to the number of classes conveyed by the class attribute. (20)
Note that the ensemble methods proposed in Section 3 do not Liver disorder sel-w-voting Wine w-voting (8)
(3.3/13)
guarantee that k would not be reduced after the combination
Page blocks sel-voting Wpbc sel-voting
process. In fact, in some cases such a reduction may be helpful (4.1/13) (2.0/30)
because it may reveal that the number of actual clusters is
w-voting denotes weighted-voting, sel-voting denotes selective voting, and sel-
smaller than that was anticipated. But here the reduction will
w-voting denotes selective weighted-voting. Number in the bracket is the
disable the above scheme from comparing the clustering ensemble size with which the best clustering performance is obtained. For
performance. Fortunately, such a reduction never occurs in all selective ensemble methods, the ensemble size is shown as a ratio of the
of our experiments. number of selected clusterers against the number of clusterers available.
Z.-H. Zhou, W. Tang / Knowledge-Based Systems 19 (2006) 77–83 81
ensemble size increases, but for the useless method voting, the
0.7
voting performance keeps almost constant or even decreases as the
w-voting ensemble size increases. Why the performance of voting is so
0.68
sel-voting poor is a problem to be explored in future works.
sel-w-voting
0.66
5. Conclusion
0.64
5 8 13 20 30
In this paper, four methods are proposed for building
ensemble size
ensembles of k-means clusterers. The component clusterers are
Fig. 3. The impacts of the change of ensemble size on the clustering generated by running the k-means algorithm multiple times
performance. w-voting denotes weighted-voting, sel-voting denotes selective with different initial points. An align process is applied to
voting, and sel-w-voting denotes selective weighted-voting. Geometrical mean ensure that the same label used by different clusterers denotes
denotes the average across all the data sets. Here the ensemble size of selective
similar clusters. The aligned clusterers are combined via voting
voting and selective weighted-voting denotes the number of candidate
clusterers instead of their real ensemble size. or its variants. Experiments show that selective weighted-
voting that utilizes mutual information to select a subset of
is 13, selective weighted-voting never loses to single k-means. clusterers for weighted voting is the best method, which could
This observation shows that ensemble methods could improve significantly improve the clustering performance. It is also
the clustering performance. It also reveals that utilizing mutual found that utilizing mutual information weights or building
information in the combination of the component clusterers is selective ensembles are both beneficial to clusterer ensemble,
beneficial. So does building selective ensembles. while the latter mechanism is more rewardful because it could
Table 3 summarizes the best ensemble method for a given help obtain ensembles with smaller sizes.
data set. It justifies that selective weighted-voting is the best It is worth mentioning that although k-means is used as the
method, which achieves the best performance on four data sets. base clusterer in this paper, it does not mean that the proposed
Moreover, Table 3 shows that the performances of weighted- methods can only be applied to k-means. Since these methods
voting and selective voting are very close because each of them work by analyzing the clustering results instead of the internal
achieves the best result on three data sets. mechanisms of the component clusterers, they are applicable to
It is worth mentioning that although utilizing mutual diverse kinds of clustering algorithms.
information and building selective ensembles are comparably From the literatures on ensemble learning, it could be found
effective from the aspect of improving clustering performance, that voting is an effective combining method that is often used
Table A1
The clustering performance when ensemble size is 5
Note that after truncating from the fourth decimal digit, the differences on waveform21 and waveform40 are concealed.
82 Z.-H. Zhou, W. Tang / Knowledge-Based Systems 19 (2006) 77–83
Table A2
The clustering performance when ensemble size is 8
Note that after truncating from the fourth decimal digit, the differences on waveform21 are concealed.
Table A3
The clustering performance when ensemble size is 13
Note that after truncating from the fourth decimal digit, the differences on waveform21 are concealed.
in building ensembles of supervised learning algorithms. and AdaBoost [8], can be modified for unsupervised learning.
However, this paper shows that voting performs quite poor Moreover, since different distance measures have different
while its variants such as selective weighted-voting perform impacts on the clustering performance [10], and different
well in unsupervised learning scenario. How to explain this clustering algorithms may favor different kinds of cluster
phenomenon remains an open problem to be explored in future architectures, i.e. different algorithms may be effective at
works. detecting different kinds of clusters, it will be interesting to
Another interesting issue to be explored is to see whether investigate heterogeneous clusterer ensembles, i.e. ensembles
successful supervised ensemble methods, such as Bagging [14] composed of different kinds of clusterers.
Table A4
The clustering performance when ensemble size is 20
Note that after truncating from the 4th decimal digit, the differences on waveform21 are concealed.
Z.-H. Zhou, W. Tang / Knowledge-Based Systems 19 (2006) 77–83 83
Table A5
The clustering performance when ensemble size is 30
Note that after truncating from the 4th decimal digit, the differences on waveform21 are concealed.
Acknowledgements [4] F.J. Huang, Z.-H. Zhou, H.-J. Zhang, T. Chen, Pose invariant face
recognition, Proceedings of the Fourth IEEE International Conference on
Automatic Face and Gesture Recognition, Grenoble, France, 2000, pp.
This work was supported by the National Science Fund for 245–250.
Distinguished Young Scholars of China under the Grant No. [5] H. Drucker, R. Schapire, P. Simard, Improving performance in neural
60325207, The Fok Ying Tung Education Foundation under networks using a boosting algorithm, in: S.J. Hanson, J.D. Cowan, C. Lee
the Grant No. 91067, and the Excellent Young Teachers Giles (Eds.), Advances in Neural Information Processing Systems 5,
Program of MOE, China. Morgan Kaufmann, San Mateo, CA, 1993, pp. 42–49.
[6] K.J. Cherkauer, Human expert level performance on a scientific image
analysis task by a system using combined artificial neural networks,
Proceedings of the 13th AAAI Workshop on Integrating Multiple Learned
Appendix A Models for Improving and Scaling Machine Learning Algorithms,
Portland, OR, 1996, pp. 15–21.
Tables A1–A5 present the detailed experimental results [7] Z.-H. Zhou, Y. Jiang, Y.-B. Yang, S.-F. Chen, Lung cancer cell
summarized in Section 4, where single denotes a single identification based on artificial neural network ensembles, Artificial
Intelligence in Medicine 24 (1) (2002) 25–36.
k-means clusterer, w-voting denotes weighted-voting, sel- [8] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line
voting denotes selective voting, sel-w-voting denotes selective learning and an application to boosting, Proceedings of the second
weighted-voting, and geometrical mean denotes the average European Conference on Computational Learning Theory, Barcelona,
across all data sets. The 2nd to the 5th columns of the tables Spain, 1995, pp. 23–37.
record the micro-p while the last column records how many [9] J. MacQueen, Some methods for classification and analysis of multi-
variate observations, Proceedings of the fifth Berkeley Symposium on
clusterers have been selected by selective voting or selective Mathematical Statistics and Probability, Berkeley, CA, vol. 1, 1967, pp.
weighted-voting. The values following ‘G’ is the standard 281–297.
deviation. [10] A. Strehl, J. Ghosh, R.J. Mooney, Impact of similarity measures on web-
page clustering, Proceedings of the AAAI2000 Workshop on AI for Web
Search, Austin, TX, 2000, pp. 58–64.
[11] Z.-H. Zhou, J. Wu, W. Tang, Ensembling neural networks: many could be
References better than all, Artificial Intelligence 137 (1–2) (2002) 239–263.
[12] C. Blake, E. Keogh, C.J. Merz, UCI repository of machine learning
[1] V. Estivill-Castro, Why so many clustering algorithms—a position paper, databases [http://www.ics.uci.edu/wmlearn/MLRepository.htm], Depart-
SIGKDD Explorations 4 (1) (2002) 65–75. ment of Information and Computer Science, University of California,
[2] T.G. Dietterich, Ensemble learning, in: M.A. Arbib (Ed.), The Handbook Irvine, CA, 1998.
of Brain Theory and Neural Networks, second ed., MIT Press, Cambridge, [13] D.S. Modha, W.S. Spangler, Feature weighting in k-means clustering,
MA, 2002. Machine Learning, 52 (3) (2003) 217–237.
[3] L.K. Hansen, P. Salamon, Neural network ensembles, IEEE Transactions [14] L. Breiman, Bagging predictors, Machine Learning 24 (2) (1996)
on Pattern Analysis and Machine Intelligence 12 (10) (1990) 993–1001. 123–140.