Professional Documents
Culture Documents
I. Introduction
Clustering is a fundamental problem that frequently arises
in a broad variety of elds such as pattern recognition, image
processing, machine learning and statistics[1,2] . It can be dened as a process of partitioning a given data set of multiple
attributes into groups. K-means[3] is the most popular algorithm among clustering algorithms developed to date because
of its eectiveness and eciency in clustering large data sets[4] .
However, k-means clustering algorithm fails to handle data
sets with categorical attributes because it can only minimize
a numerical cost function. Huang[5] proposed the k-modes
clustering method that removes the numeric-only limitation of
the k-means algorithm. Since then major improvements have
been made in k-modes algorithms including new dissimilarity
measures to the k-modes clustering to improve the clustering
performance[68] , and a fuzzy set based k-modes algorithm[9] .
Although the k-modes-type algorithms have demonstrated
improved eciency in processing large categorical data sets,
like the k-means-type algorithms, they still suer from two
major drawbacks: (1) inability to cover the global information eectively, i.e. they are only local optimal methods[10] ,
Manuscript Received July 2011; Accepted Jan. 2012. This work is supported by China Scholarship Council in Cooperation with Computer Science of Rutgers University. It is also supported by the National Natural Science Foundation of China (No.61175023, No.60973092,
No.60903097); the Key Laboratory for Symbol Computation and Knowledge Engineering of the National Education Ministry of China.
n
i=1 li
< n,
1lk
(2)
(3)
(4)
m
(xi , yi ) =
i=1 (xi , yi )
0,
(xi = yi )
1,
(xi = yi )
(5)
1,
if zl,j = xi,j
1 |cl,j,r |/cl ,
otherwise
(7)
|cl | is the number of objects in the lth cluster, and cl,j,r is the
(r)
number of objects with category aj of the jth attribute in
the lth cluster.
(3) Fuzzy k-modes algorithm (FKM)
Huang and Ng[9] proposed the fuzzy k-modes algorithm
for clustering categorical objects based on the extensions to
the fuzzy k-means algorithm[17] . This method improves the
k-modes algorithm by assigning membership degrees to data
461
Fc (W, Z) = kl=1 n
(8)
i=1 li d(Zl , Xi )
Subject to
1,
0,
1
k
li
d(Zl , Xi ) (1)
1
d(Zh , Xi )
j=1
if Xi = Zl
if Xi = Zh , h = l
if Xi = Zl and Xi = Zh ,
1hk
(9)
and also subject to Eqs.(3) and (4), where is the weighting
component, W = (li ) is the k n fuzzy membership matrix.
It is noted that in existing k-modes-type algorithms, the
clustering process starts with a xed number of modes i.e.
the number of clusters to be generated. The initialization or
choice of these modes, depending on the nature of the data,
may have signicant impact on the performance of the algorithms. In other words, dierent choices of the modes may
produce inconsistent clustering results. The main cause for
such problem is that these methods only guarantee a local
optimal solution i.e. they do not guarantee a global optimal
result.
2. Global k-modes method
To overcome or at least alleviate the limitation of existing
k-modes clustering methods due to its local optimum property, a Global k-modes (GKM) algorithm is proposed. The
new method consists of two key components: (1) random selection of a suciently large number of initial cluster modes;
(2) progressive elimination of redundant clusters by using an
elimination criterion function.
(1) Initialization of cluster modes
In most existing k-modes clustering methods, the number
of initialized cluster modes is the number of target clusters,
resulting in potential miss of a true cluster for certain classes
of large data sets due to the random selection of initial cluster
modes. For those methods that optimize the cluster modes
initialization, a prior knowledge of data distribution is usually
required. Such knowledge, however, may not be always available or easily acquired especially for data sets of very large
scale or of low transparency due to security reasons. In addition, some kinds of data like stream data, change dynamically
and may be innite, so these kinds of data can only be accessed in a single scan. Thus, the initial modes cannot be
selected repeatedly. In order to overcome these limitations,
GKM method starts with randomly selected cluster modes in
a suciently large number. The random initialization of a
large number of initial clusters (modes) enables coverage of all
target clusters in the overall data space.
Specically, let kini and ktar represent the number of initial
clusters and nal target clusters respectively and kini ktar .
In our study, kini is conveniently determined as a number computed by multiplying the number of the target clusters by a
positive integer. The appropriate choice of kini depends on
the properties of specic data set and should be determined
by experiments, and the convergence of kini will be discussed
later.
462
subject to
Zl Z, Zl = Zj
where Cj represents the jth cluster in C (C = [C1 , C2 , , Ck ]
is the set of all k clusters in a current stage of iteration),
Z = [Z1 , Z2 , , Zk ] represents the set of all optimal cluster
modes in the same stage, d(xi , Zj ) denotes the distance of data
point xi (xi Cj ) to the mode Zj of cluster Cj , the computation of this distance is graphically illustrated in Fig.1(a). The
minimal distance of xi to other cluster modes other than Zj is
given by min d(xi , Zl ) as shown in Fig.1(b). The cluster Cj of
mode Zj with the smallest Ej is the current candidate to be
removed. The cluster modes for the remaining k 1 clusters
will be used as the initial modes for the next iteration.
(3) Theoretical basis of the elimination criterion
function
Here we describe the theoretical basis of the elimination
criterion function E which represents a good estimate of the
cost function F. When a cluster is removed, the value of F for
the remaining clusters which also contain all the data points
in the removed cluster will increase under convergence. For k
clusters we will get k values of F which usually are dierent.
The cluster resulting in the smallest F when removed has the
least impact on the remaining clusters, and is most likely to
belong to the other cluster(s). The observation can also be
intuitively seen in Fig.1. It can be observed in the gure that
cluster C4 is close to cluster C3, and is apparently a good candidate to be removed. In this way most of its data points will
be merged into C3. Such a cluster naturally is the best candidate to be removed. However, using F directly to determine
which cluster to be removed will lead to a very expensive computation on a signicant scale since we need to actually test
clustering on all combinations of clusters. Instead we use E to
estimate the change in F, since E computes a potential increment of F at the beginning of the next iteration after a cluster
is removed. The smaller the E value is, the smaller the F value
will increase. The smallest E indicates the smallest increase
2012
An obvious alternative to the method of using an elimination criterion function is one of random elimination. It is,
however, also conceivable that random elimination is likely to
produce inconsistent clustering performance. We will show
later in our experiments with both methods that our method
based on the elimination criterion function gives much better
performance than that of random method in terms of both
accuracy and consistency.
(4) Algorithm
A pseudo-code like summary of the algorithm is given as
follows:
Algorithm GKM
Input:
D: a set of n categorical objects
kini : the number of initial cluster modes
ktar : the number of cluster desired
Output:
Labeling or assignment of data points with ktar clusters
Initialization:
Randomly select kini initial modes from set D.
Iteration:
Let k be the number of modes in the current iteration.
for k = kini to ktar
Apply k-modes algorithm to nd current optimal solution
S(k) with k modes, i.e. cluster assignment, and
Z = [Z1 , Z2 , , Zk ] are the k modes for S(k).
for i = 1 to k
Compute the elimination criterion function Ei of
cluster mode Zi .
end
Remove the mode Zj with the minimum elimination
criterion function value Ej and use the remaining k 1
modes as initial modes for the next iteration.
end
(11)
463
464
line. The range of the box runs from the 25th to the 75th
percentiles.
Using an elimination criterion function shows an overall
superior performance compared to the random method for all
four data sets in both accuracy and consistency. For instance,
for the soybean data set, GKM achieves a median clustering
errors of 0 with a standard deviation of 0.03 when 2 as
shown by Fig.3(b) while the random selection only achieves a
median 27% clustering error (a) with a much larger standard
deviation of 0.16 for all values. The similar observation can
also be made for breast cancer data set (d) and (c), and credit
data set (f ) and (e). For zoo data set, both methods achieve
a similar median clustering error, but GKM shows a much
better consistency.
(3) GKM vs. KM, NKM, FKM and initialization
methods
We run GKM with an overall optimal value of 4 shown
above for each data set and compare their accuracy with that
2012
IV. Conclusions
In this paper, a Global k-modes (GKM) clustering algorithm is proposed. This method randomly selects suciently
larger number of initial modes than the target cluster number, and progressively removes redundant clusters using an
elimination criterion function. The complexity of the method
remains linear with the additional computation required in
the iterative elimination process. The experiments with four
commonly referenced data sets from UCI Machine learning
repository[18] have shown that the method performs well by
using a larger number of initial modes without a need of optimal mode initialization relying on prior knowledge of the data.
The experiments with dierent numbers of initial modes show
the eectiveness of the elimination criterion function. The experiments and comparative evaluation of the algorithm with
the four diverse data sets demonstrate a superiority in performance, and a consistency of the proposed global k-mode
algorithm in comparison with other well-known k-modes-type
algorithms in terms of clustering accuracy.
References
Fig. 3. Boxplot of clustering errors with 100 runs for dierent
values of . (a), (c), (e) and (g) show errors using the
random selection method; (b), (d), (f ) and (h) show
errors using the elimination criterion function for different data sets respectively
[1] A.K. Jain, M.N. Murty et al., Data clustering: a review, ACM
Computing Surveys, Vol.31, No.3, pp.264323, 1999.
[2] Haixia Xu, Zheng Tian, An optimal spectral clustering approach based on Cauchy-Schwarz divergence, Chinese Journal
of Electronics, Vol.18, No.1, pp.105108, 2009.
[3] J. MacQueen, Some methods for classication and analysis of
465
C.A. Kulikowski
received the
Ph.D. degree from University of Hawaii in
1970. Currently he is Board of Governors
Professor of Computer Science Department
of Rutgers University, the State University
of New Jersey. His current research interests include articial intelligence, biomedical informatics, and the societal impact of
computers. Prof. Kulikowski is the member of Institute of Medicine of the National
Academy of Sciences (IOM-NAS); fellow of American Association
for the Advancement of Science (AAAS); fellow of Institute of Electrical and Electronics Engineers (IEEE); founding fellow of American Association of Articial Intelligence (AAAI) and founding fellow of American College of Medical Informatics (ACMI). Prof. Kulikowski is the Vice-President of International Medical Informatics
Association (IMIA) since 2006. (Email: kulikows@cs.rutgers.edu)
GONG Leiguang currently is a Senior Research Sta Member of IBM Watson Research Center, a visiting professor
and adjunct professor at Jilin University.
Before joining IBM Research in 2000 he
was Member of Technical Sta with Bell
Laboratories (19972000), an assistant professor of Computer Science, Rutgers University (19931997), and a faculty member of Computer Science, Jilin University
(19771987). He received the Ph.D. degree in computer science
from Rutgers University in 1992. His current research focuses on
semantic modeling for multimedia content analysis and retrieval
and high-performance image/video analytics. His past research
was on high performance medical image analytics, knowledgebased systems, and semantic and contextual modeling. (Email:
gleiguang@yahoo.com)
YANG Bin was born in 1978. He
is currently a lecturer of College of Computer Science and Technology, Jilin University. He received Ph.D. degree in computer application technology from College
of Computer Science and Technology, Jilin
University, in 2010. His main research interest is machine learning and intelligence
algorithm. (Email: yangbin@jlu.edu.cn)