A Global K-Modes Algorithm For Clustering Categorical Data

Chinese Journal of Electronics
Vol.21, No.3, July 2012
A Global K-modes Algorithm for Clustering

Categorical Data
BAI Tian1,2 , C.A. Kulikowski2 , GONG Leiguang3 , YANG Bin1 ,
HUANG Lan1 and ZHOU Chunguang1
(1.College of Computer Science and Technology, Jilin University, Changchun 130012, China)
(2.Department of Computer Science, Rutgers University, New Brunswick, New Jersey, USA)
(3.IBM Thomas J. Watson Research Center, Hawthorne, New Jersey, USA)
Abstract In this paper, a new Global k-modes
(GKM) algorithm is proposed for clustering categorical
data. The new method randomly selects a suciently large
number of initial modes to account for the global distribution of the data set, and then progressively eliminates the
redundant modes using an iterative optimization process
with an elimination criterion function. Systematic experiments were carried out with data from the UCI Machine
learning repository. The results and a comparative evaluation show a high performance and consistency of the
proposed method, which achieves signicant improvement
compared to other well-known k-modes-type algorithms in
terms of clustering accuracy.
Key words Categorical data, Clustering, Data mining, K-modes algorithm.
I. Introduction
Clustering is a fundamental problem that frequently arises
in a broad variety of elds such as pattern recognition, image
processing, machine learning and statistics[1,2] . It can be dened as a process of partitioning a given data set of multiple
attributes into groups. K-means[3] is the most popular algorithm among clustering algorithms developed to date because
of its eectiveness and eciency in clustering large data sets[4] .
However, k-means clustering algorithm fails to handle data
sets with categorical attributes because it can only minimize
a numerical cost function. Huang[5] proposed the k-modes
clustering method that removes the numeric-only limitation of
the k-means algorithm. Since then major improvements have
been made in k-modes algorithms including new dissimilarity
measures to the k-modes clustering to improve the clustering
performance[68] , and a fuzzy set based k-modes algorithm[9] .
Although the k-modes-type algorithms have demonstrated
improved eciency in processing large categorical data sets,
like the k-means-type algorithms, they still suer from two
major drawbacks: (1) inability to cover the global information eectively, i.e. they are only local optimal methods[10] ,
(2) the accuracy of their clustering results is sensitive to the

random selection of initial modes. Compared to numeric data
clustering, the modes in categorical data clustering are more
dicult to move in iterative optimization processes, because
the attribute value of categorical data is not continuous. Thus,
it becomes more dicult to nd all optimal cluster modes in
k-modes algorithm when some clusters are missed in the initial
modes selection. Therefore, the key research issue is how to
initialize or select initial cluster modes[11] .
Sun[12] introduced an initialization method based on sampling using the distribution information of the data set. To
address the ineciency of Suns method, Wu[13] proposed a
new initialization method based on both distance and density
measurements, which constrains the process in a sub-sample
data set. He[14] developed a heuristic method by choosing a
data point which maximizes the distance to the nearest of all
currently computed modes. A more recent study by Cao[15]
redened the density of data points and proposed an initialization method based on both the average density of an object
and the distance between the objects. Most recently, Bai[16]
denes a density measure to reect the cohesiveness of potential exemplars and to select the initial cluster modes from the
potential exemplars set based on both of the density measure
and the distance measure. Most of optimized modes initialization methods rely on prior information of the data distribution
and density. However, this information may not be available
in many practical applications, so there is an urgent need for
an improved k-modes-type algorithm which can use a random
method and requires no prior knowledge of the data set.
Motivated by the above observation, this paper presents a
Global k-modes (GKM) clustering algorithm. The basic idea
underlying the proposed method is the random selection of
a suciently large number of initial modes to account for the
global distribution of the data set. Progressive elimination has
been adopted to remove the redundant modes in an iterative
optimization process. Systematic experiments have been conducted to show the convergence of the elimination process, and
Manuscript Received July 2011; Accepted Jan. 2012. This work is supported by China Scholarship Council in Cooperation with Computer Science of Rutgers University. It is also supported by the National Natural Science Foundation of China (No.61175023, No.60973092,
No.60903097); the Key Laboratory for Symbol Computation and Knowledge Engineering of the National Education Ministry of China.
A Global K-modes Algorithm for Clustering Categorical Data

the improved clustering performance of the proposed method.
The rest of the paper is organized as follows. In Section
II, the formal denition of k-modes-type clustering is briey
revisited, and the new global k-modes method is proposed and
discussed in detail. Section III gives the experimental results
of the method on the UCI data sets, in comparison with other
well-know k-modes-type algorithms. Finally Section IV gives
the conclusions and discussion.
II. Method and Algorithm

1. K-modes-type method
(1) K-modes algorithm (KM)
In general, k-modes clustering[5] can be expressed as an
optimization process of partitioning a data set D into k clusters by iteratively nding W and Z that minimize the cost
function:

F (W, Z) = kl=1 n
(1)
i=1 li d(Zl , Xi )
subject to
li {0, 1}, 1 l k, 1 i n
k
1in
l=1 li = 1,
and
0<
n
i=1 li
< n,
1lk
(2)
(3)
(4)
where D = {X1 , X2 , , Xn } represents a set of n categorical

objects and Xi = {xi,1 , xi,2 , , xi,m } denotes m categorical
attributes, W = [wli ] is a {0,1} matrix, representing the current cluster membership of an object, and Z = [Z1 , Z2 , , Zk ]
represents the cluster modes, k is the number of target clusters, and is predetermined before the clustering process starts.
The dissimilarity function is dened as:
d(X, Y ) =
where
m

(xi , yi ) =
i=1 (xi , yi )
0,
(xi = yi )
1,
(xi = yi )
(5)
and xi is the ith element of X and, yi the ith element Y .

(2) K-modes algorithm with new dissimilarity
(NKM)
There are alternative ways of dening the dissimilarity
function. Ng[6] proposed, for instance, to use relative attribute
frequencies of the cluster modes to dene the dissimilarity
measure as

dn (Zi , Xi ) = m
(6)
j=1 (zl,j , xi,j )
where

(zl,j , xi,j ) =
1,
if zl,j = xi,j
1 |cl,j,r |/cl ,
otherwise
(7)
|cl | is the number of objects in the lth cluster, and cl,j,r is the
(r)
number of objects with category aj of the jth attribute in
the lth cluster.
(3) Fuzzy k-modes algorithm (FKM)
Huang and Ng[9] proposed the fuzzy k-modes algorithm
for clustering categorical objects based on the extensions to
the fuzzy k-means algorithm[17] . This method improves the
k-modes algorithm by assigning membership degrees to data
461
in dierent clusters. In the fuzzy k-modes algorithm, data D

is grouped into k clusters by minimizing the cost function:

Fc (W, Z) = kl=1 n
(8)
i=1 li d(Zl , Xi )
Subject to
1,
0,
1

k
li
d(Zl , Xi ) (1)
1
d(Zh , Xi )
j=1
if Xi = Zl
if Xi = Zh , h = l
if Xi = Zl and Xi = Zh ,
1hk
(9)
and also subject to Eqs.(3) and (4), where is the weighting
component, W = (li ) is the k n fuzzy membership matrix.
It is noted that in existing k-modes-type algorithms, the
clustering process starts with a xed number of modes i.e.
the number of clusters to be generated. The initialization or
choice of these modes, depending on the nature of the data,
may have signicant impact on the performance of the algorithms. In other words, dierent choices of the modes may
produce inconsistent clustering results. The main cause for
such problem is that these methods only guarantee a local
optimal solution i.e. they do not guarantee a global optimal
result.
2. Global k-modes method
To overcome or at least alleviate the limitation of existing
k-modes clustering methods due to its local optimum property, a Global k-modes (GKM) algorithm is proposed. The
new method consists of two key components: (1) random selection of a suciently large number of initial cluster modes;
(2) progressive elimination of redundant clusters by using an
elimination criterion function.
(1) Initialization of cluster modes
In most existing k-modes clustering methods, the number
of initialized cluster modes is the number of target clusters,
resulting in potential miss of a true cluster for certain classes
of large data sets due to the random selection of initial cluster
modes. For those methods that optimize the cluster modes
initialization, a prior knowledge of data distribution is usually
required. Such knowledge, however, may not be always available or easily acquired especially for data sets of very large
scale or of low transparency due to security reasons. In addition, some kinds of data like stream data, change dynamically
and may be innite, so these kinds of data can only be accessed in a single scan. Thus, the initial modes cannot be
selected repeatedly. In order to overcome these limitations,
GKM method starts with randomly selected cluster modes in
a suciently large number. The random initialization of a
large number of initial clusters (modes) enables coverage of all
target clusters in the overall data space.
Specically, let kini and ktar represent the number of initial
clusters and nal target clusters respectively and kini ktar .
In our study, kini is conveniently determined as a number computed by multiplying the number of the target clusters by a
positive integer. The appropriate choice of kini depends on
the properties of specic data set and should be determined
by experiments, and the convergence of kini will be discussed
later.
462
(2) Progressive elimination of redundant clusters

The random initialization of GKM results in kini ktar
redundant clusters which need to be removed during the subsequent clustering process. These redundant clusters will be
removed iteratively one at each stage of iteration. To determine which cluster to be discarded at each stage, an elimination criterion function E is introduced to measure the impact
of a removed cluster on the remaining clusters. The impact of
a single data point in a removed cluster is measured in terms
of the dierence of its distance to the mode of the removed
cluster, and its distance to the closest mode of the remaining clusters. The overall impact of the removed cluster is the
sum of distance dierences for all data points in the cluster.
The smaller the dierence is, the less impact the cluster will
have. Such measure reects our estimate of potential increases
of clustering cost function F (W, Z) for the remaining clusters
when one cluster is eliminated. The cluster of least impact is
a good candidate to be removed.
The elimination criterion function is formally dened as
follows:

[min d(xi , Zl ) d(xi , Zj )]
(10)
Ej =
xi Cj
subject to
Zl Z, Zl = Zj
where Cj represents the jth cluster in C (C = [C1 , C2 , , Ck ]
is the set of all k clusters in a current stage of iteration),
Z = [Z1 , Z2 , , Zk ] represents the set of all optimal cluster
modes in the same stage, d(xi , Zj ) denotes the distance of data
point xi (xi Cj ) to the mode Zj of cluster Cj , the computation of this distance is graphically illustrated in Fig.1(a). The
minimal distance of xi to other cluster modes other than Zj is
given by min d(xi , Zl ) as shown in Fig.1(b). The cluster Cj of
mode Zj with the smallest Ej is the current candidate to be
removed. The cluster modes for the remaining k 1 clusters
will be used as the initial modes for the next iteration.
(3) Theoretical basis of the elimination criterion
function
Here we describe the theoretical basis of the elimination
criterion function E which represents a good estimate of the
cost function F. When a cluster is removed, the value of F for
the remaining clusters which also contain all the data points
in the removed cluster will increase under convergence. For k
clusters we will get k values of F which usually are dierent.
The cluster resulting in the smallest F when removed has the
least impact on the remaining clusters, and is most likely to
belong to the other cluster(s). The observation can also be
intuitively seen in Fig.1. It can be observed in the gure that
cluster C4 is close to cluster C3, and is apparently a good candidate to be removed. In this way most of its data points will
be merged into C3. Such a cluster naturally is the best candidate to be removed. However, using F directly to determine
which cluster to be removed will lead to a very expensive computation on a signicant scale since we need to actually test
clustering on all combinations of clusters. Instead we use E to
estimate the change in F, since E computes a potential increment of F at the beginning of the next iteration after a cluster
is removed. The smaller the E value is, the smaller the F value
will increase. The smallest E indicates the smallest increase
2012
of F. With the smallest F we have the best initial modes at

the beginning of the next iteration. Generally speaking, the
initial modes with the smallest F value will also result in a
faster convergence.
Fig. 1. Graphical illustration of the elimination criterion function
An obvious alternative to the method of using an elimination criterion function is one of random elimination. It is,
however, also conceivable that random elimination is likely to
produce inconsistent clustering performance. We will show
later in our experiments with both methods that our method
based on the elimination criterion function gives much better
performance than that of random method in terms of both
accuracy and consistency.
(4) Algorithm
A pseudo-code like summary of the algorithm is given as
follows:
Algorithm GKM
Input:
D: a set of n categorical objects
kini : the number of initial cluster modes
ktar : the number of cluster desired
Output:
Labeling or assignment of data points with ktar clusters
Initialization:
Randomly select kini initial modes from set D.
Iteration:
Let k be the number of modes in the current iteration.
for k = kini to ktar
Apply k-modes algorithm to nd current optimal solution
S(k) with k modes, i.e. cluster assignment, and
Z = [Z1 , Z2 , , Zk ] are the k modes for S(k).
for i = 1 to k
Compute the elimination criterion function Ei of
cluster mode Zi .
end
Remove the mode Zj with the minimum elimination
criterion function value Ej and use the remaining k 1
modes as initial modes for the next iteration.
end
(5) The number of initial modes

As what has been discussed earlier, a key to the proposed
GKM method is the random selection of a suciently large
number of initial modes in order to cover the global distribution of the data set so as to avoid missing of legitimate cluster
modes.
Let ktar be the number of the nal target modes (clusters),
kini the number of starting modes, it can be dened as
kini = ktar
(11)

where is a positive integer coecient. Theoretically, represents the average number of initial modes located in each nal
cluster. The choice of may depend on distribution of specic
data and can be determined empirically. Interestingly enough,
our experiments with several data sets of diverse domains show
that = 4 is often sucient.
(6) Complexity analysis
The computational complexity of the k-modes algorithm
is O(t k m n), where m is the number of attributes, n
is the number of objects, k is the number of clusters, and t
is the number of iterations over the whole data set. In most
cases, t k m n and thus the time complexity of k-modes
is approximately O(n). Since GKM requires only one run
of k-modes for each iteration of mode reduction, a total of
k k + 1 = ( 1)k + 1 runs of k-modes are needed. As
what has been mentioned above, = 4 works well for all 5 different data sets we used, and so is probably sucient for most
clustering problems. Since (1)k+1 n, the computational
complexity of the GKM algorithm is thus approximately O(n),
which indicates that the relationship between the size of data
set and computational complexity is linear.
III. Experiments and Result

1. Data sets
The soybean data set consists of 47 objects with 35 attributes. Each object is collected from one of the four soybean diseases which are represented by D1, D2, D3, and D4
respectively. Within this data set, each of D1, D2, and D3
has 10 records, while D4 has 17 records. Since 14 out of 35
attributes have only one category, we only use the rest 21
multi-valued categorical attributes in this paper. The breast
cancer data set has two groups of records: benign (458 records;
65.5%) and malignant (241 records; 34.5%). Each record is described by 9 categorical attributes. In this paper, we deleted
16 records which missed at least one attribute value. The
credit data set contains the information of credit card application. Based on privacy condentiality, personal names and
values in all attributes have been changed to meaningless symbols. In this data set, 690 objects are categorized into two
clusters, approved (307 objects; 44.5%) and rejected (383 objects; 55.5%). In this paper, 9 out of 15 categorical attributes
are selected, and 25 objects are deleted from the selected categorical data set since these objects are found with missing
values. The zoo data set where 101 animals are divided into 7
categories, there are 15 boolean-valued attributes, 2 numeric
valued attributes, and one animal name attribute, and only
the 15 boolean-valued attributes and one numeric valued attribute (i.e. leg attribute) are selected. These boolean-valued
attributes (e.g. hair, tail, milk,) describe whether certain object (animal) has these physical characteristics.
2. Evaluation method
Clustering accuracy is used as a measure to evaluate the
qualities of the clustering results in our methods. The clustering accuracy r is dened as follows[5] :

(12)
r = ki=1 ai /n
where n is the number of objects in data set, and ai is the
number of objects in Ci correctly classied comparing with
463
the ground truth, and k is the number of clusters. A higher

value of r suggests a better clustering accuracy.
Three kinds of experiments were designed and conducted
to test and evaluate the proposed method from three dierent
perspectives: (1) experiments to evaluate the eect of a dierent number of initial clusters, (2) experiments to evaluate the
eect of the proposed cluster elimination method and random
elimination method, and (3) experiments of a comparative
evaluation of our GKM method vs. the k-modes- type algorithms, namely the original k-modes algorithm[5] (KM), fuzzy
k-modes algorithm[9] (FKM), k-modes algorithm with new dissimilarity measure proposed by Ng[6] (NKM), and other initialization methods. A hundred runs are carried out on each
data set. The average of the 100 run results was used as the
clustering accuracy for evaluation.
3. Results and analysis
(1) Eect of dierent numbers of initial clusters
To evaluate the eect of the number of initial modes on
the accuracy of clustering result, we run GKM with dierent
numbers of initial modes i.e. with dierent values. As previously described, is a positive integer indicating an average
number of initial modes located in each nal cluster. We set
with integer numbers from 2 to 10 and run GKM with these
four UCI data sets. Table 1 lists the clustering accuracy r
achieved with each value for the four data sets, where each
column contains the accuracy r from the same data set, and
each row shows the accuracy r with a certain value. It is
noted that when = 1 the GKM method is equivalent to the
traditional KM method. The signicant performance improvement of GKM ( [2, 10]) over KM ( = 1) is obvious for all
four data sets.
Table 1. Clustering accuracy r achieved by dierent
values of on four data sets
values of
Soybean
Breast cancer
Credit
Zoo
1
0.8213
0.8215
0.6678
0.6946
2
0.9725
0.9124
0.7693
0.7162
3
0.9774
0.9311
0.7747
0.7274
4
0.9834
0.9378
0.7782
0.7292
5
0.9789
0.9312
0.7873
0.7247
6
0.9774
0.9316
0.7831
0.7241
7
0.9779
0.9303
0.7848
0.7225
8
0.9762
0.9328
0.7851
0.7224
9
0.9728
0.9325
0.7841
0.7224
10
0.9783
0.9318
0.7837
0.7225
For instance, for the soybean data set GKM achieves an

accuracy of 0.9834 when = 4, 19% higher than the KM
method with an accuracy of 0.8213. For the credit data set, a
17% accuracy improvement is achieved by the GKM method
compared to the KM method. We also study the convergence
of the clustering accuracy over dierent values. In all four
cases, as increases, accuracy r initially increases but then
levels o at = 5 (credit data set) or 4 (all other data sets).
(2) Elimination methods
The GKM algorithm is tested with both the elimination
criterion function and random selection respectively for modes
elimination on all four data sets. Again in each experiment,
100 runs are performed for each dierent value. Fig.2 shows
comparatively, the clustering errors by using boxplot (percentage of the incorrectly classied objects within all objects). The
boxplot indicates the degree of dispersion in the data. The median clustering errors are marked as shown by the connected
464
line. The range of the box runs from the 25th to the 75th
percentiles.
Using an elimination criterion function shows an overall
superior performance compared to the random method for all
four data sets in both accuracy and consistency. For instance,
for the soybean data set, GKM achieves a median clustering
errors of 0 with a standard deviation of 0.03 when 2 as
shown by Fig.3(b) while the random selection only achieves a
median 27% clustering error (a) with a much larger standard
deviation of 0.16 for all values. The similar observation can
also be made for breast cancer data set (d) and (c), and credit
data set (f ) and (e). For zoo data set, both methods achieve
a similar median clustering error, but GKM shows a much
better consistency.
(3) GKM vs. KM, NKM, FKM and initialization
methods
We run GKM with an overall optimal value of 4 shown
above for each data set and compare their accuracy with that
2012
of KM, NKM, and FKM. The fuzzy parameter used in FKM

is set as 1.1 which is also used in Ref.[9]. The test results
in terms of clustering accuracy for each of the four data sets
are plotted comparatively in Fig.3. As shown in the gure,
the proposed GKM method gives a better clustering performance than all the other three algorithms (i.e. FKM, NKM,
and KM) with all four data sets. With soybean data set, for
instance, GKM achieves an accuracy of 98% vs. the next best
of 93% by NKM. With the breast cancer data set, the accuracy of GKM is 94% vs. the next best of 85% also by NKM.
As it is clearly shown in Fig.3, for all the four data sets, the
proposed GKM algorithm achieved a signicant or noticeable
performance improvement demonstrating its potential power
in clustering categorical data of dierent domains.
In addition, the GKM method is compared with other initialization method brought up by Cao[15] and Wu[13] . The
improvement of GKM method and other initialization methods over traditional KM method are sumerized in Table 2. We
can see from Table 2 that the accuracy improvement of GKM
method and other initialization methods is comparable, but
unlike the other initialization methods, GKM method does not
rely on prior information of the data distribution and density,
thus provide a new way to address these kinds of issue.
Table 2. Improvement of GKM method and other
initialization methods over traditional KM method
Methods
Soybean
Zoo
Proposed method
19.73%
4.98%
Caos method
16.76%
5.45%
Wus method
16.91%
5.86%
Fig. 2. Clustering accuracy comparison of GKM and KM,

NKM and FKM algorithms
IV. Conclusions
In this paper, a Global k-modes (GKM) clustering algorithm is proposed. This method randomly selects suciently
larger number of initial modes than the target cluster number, and progressively removes redundant clusters using an
elimination criterion function. The complexity of the method
remains linear with the additional computation required in
the iterative elimination process. The experiments with four
commonly referenced data sets from UCI Machine learning
repository[18] have shown that the method performs well by
using a larger number of initial modes without a need of optimal mode initialization relying on prior knowledge of the data.
The experiments with dierent numbers of initial modes show
the eectiveness of the elimination criterion function. The experiments and comparative evaluation of the algorithm with
the four diverse data sets demonstrate a superiority in performance, and a consistency of the proposed global k-mode
algorithm in comparison with other well-known k-modes-type
algorithms in terms of clustering accuracy.
References
Fig. 3. Boxplot of clustering errors with 100 runs for dierent
values of . (a), (c), (e) and (g) show errors using the
random selection method; (b), (d), (f ) and (h) show
errors using the elimination criterion function for different data sets respectively
[1] A.K. Jain, M.N. Murty et al., Data clustering: a review, ACM
Computing Surveys, Vol.31, No.3, pp.264323, 1999.
[2] Haixia Xu, Zheng Tian, An optimal spectral clustering approach based on Cauchy-Schwarz divergence, Chinese Journal
of Electronics, Vol.18, No.1, pp.105108, 2009.
[3] J. MacQueen, Some methods for classication and analysis of

multivariate observation, Proc. 5th Berkeley Symp. on Mathematical Statistics and Probability, pp.281297, 1967.
[4] L. Changan, L. Fei, L. Chunyang and W. Hua, Multi-agent
reinforcement learning based on K-means algorithm, Chinese
Journal of Electronics, Vol.20, No.3, pp.414418, 2011.
[5] Zhexue Huang, Extensions to the k-means algorithm for clustering large data sets with categorical values, Data Mining and
Knowledge Discovery, Vol.2, No.3, pp.283304, 1998.
[6] Michael K. Ng, Mark Junjie Li et al., On the impact of dissimilarity measure in k-modes clustering algorithm, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.29,
No.3, pp.503507, 2007.
[7] Z. He, S. Deng and X. Xu, Improving k-modes algorithm considering frequencies of attribute values in mode, Proc. Intl
Conf. Computational Intelligence and Security, pp.157162,
2005.
[8] O. San, V. Huynh and Y. Nakamori, An alternative extension
of the k-means algorithm for clustering categorical data, Intl
J. Applied Math. and Computer Science, Vol.14, No.2, pp.241
247, 2004.
[9] Z. Huang, M.K. Ng, A fuzzy k-modes algorithm for clustering categorical data, IEEE Trans. Fuzzy Syst., Vol.7, No.4,
pp.446452, 1999.
[10] A. Chaturvedi, Paul E. Green and J.D. Caroll, K-modes clustering, Journal of Classification, Vol.18, No.1, pp.3555, 2001.
[11] Douglas Steinley, Michael J. Brusco, Initializing K-means
batch clustering: A critical evaluation of several techniques,
Journal of Classification, Vol.24, No.1, pp.99121, 2007.
[12] Y. Sun, Q.M. Zhu, Z.X. Chen, An iterative initial-points renement algorithm for categorical data clustering, Pattern Recognition Letters, Vol.23, No.7, pp.875884, 2002.
[13] S. Wu, Q.S. Jiang, Z.X. Huang, A new initialization method
for categorical data clustering, Lecture Notes in Computer Science, Vol.4426, pp.972980, 2007.
[14] Zengyou He, Farthest-point heuristic based initialization methods for k-modes clustering, arXiv:cs/0610043v1, 2006.
[15] F.Y. Cao, J.Y. Liang and L. Bai, A new initialization method
for categorical data clustering, Expert Systems with Applications, Vol.33, No.7, pp.1022310228, 2009.
[16] L. Bai, J. Liang and C. Dang, An initialization method to
simultaneously nd initial cluster modes and the number of
clusters for clustering categorical data, Knowl. Based Syst.,
Vol.24, No.6, pp.785795, 2011.
[17] J.C. Bezdek, A convergence theorem for the fuzzy ISODATA
clustering algorithms, IEEE Trans. Pattern Anal. Machine
Intell., Vol.PAMI-2, No.1, pp.18, 1980.
[18] UCI Available at: http://mlearn.ics.uci.edu/MLSummary.html.
BAI Tian was born at Changchun
in Sept. 1983. He received the B.S. degree in software engineering from College of
Software, Jilin University, and received the
M.S. degree in computer application technology from College of Computer Science
and Technology, Jilin University. He was
an exchange Ph.D. student in Department
of Computer Science, Rutgers University,
from Oct. 2010 to Oct. 2011. Currently
he is a Ph.D. candidate in computer application technology of Jilin
University. His current research interests include data mining and
biomedical informatics. (Email: baitian09@mails.jlu.edu.cn)
465
C.A. Kulikowski
received the
Ph.D. degree from University of Hawaii in
1970. Currently he is Board of Governors
Professor of Computer Science Department
of Rutgers University, the State University
of New Jersey. His current research interests include articial intelligence, biomedical informatics, and the societal impact of
computers. Prof. Kulikowski is the member of Institute of Medicine of the National
Academy of Sciences (IOM-NAS); fellow of American Association
for the Advancement of Science (AAAS); fellow of Institute of Electrical and Electronics Engineers (IEEE); founding fellow of American Association of Articial Intelligence (AAAI) and founding fellow of American College of Medical Informatics (ACMI). Prof. Kulikowski is the Vice-President of International Medical Informatics
Association (IMIA) since 2006. (Email: kulikows@cs.rutgers.edu)
GONG Leiguang currently is a Senior Research Sta Member of IBM Watson Research Center, a visiting professor
and adjunct professor at Jilin University.
Before joining IBM Research in 2000 he
was Member of Technical Sta with Bell
Laboratories (19972000), an assistant professor of Computer Science, Rutgers University (19931997), and a faculty member of Computer Science, Jilin University
(19771987). He received the Ph.D. degree in computer science
from Rutgers University in 1992. His current research focuses on
semantic modeling for multimedia content analysis and retrieval
and high-performance image/video analytics. His past research
was on high performance medical image analytics, knowledgebased systems, and semantic and contextual modeling. (Email:
gleiguang@yahoo.com)
YANG Bin was born in 1978. He
is currently a lecturer of College of Computer Science and Technology, Jilin University. He received Ph.D. degree in computer application technology from College
of Computer Science and Technology, Jilin
University, in 2010. His main research interest is machine learning and intelligence
algorithm. (Email: yangbin@jlu.edu.cn)
HUANG Lan (corresponding author) was born in 1974. She

received Ph.D. degree in computer application technology form College of Computer Science and Technology, Jilin University. Currently she is a professor of College of Computer Science and Technology at Jilin University. Her current interests are machine learning
and data mining. (Email: huanglan@jlu.edu.cn)
ZHOU Chunguang was born in Nov. 1947, Ph.D., professor, Ph.D. supervisor, Dean of Institute of Computer Science of Jilin
University. His main research area covers the theories, models and
algorithms related to neural networks, fuzzy system and evolutionary computation, and the application research of the methods of
computational intelligence in machine taste and smell, image processing, business intelligence, intelligent trac, the modern logistics bioinformatics and biometric identication, etc., which are all
related to computational intelligence. (Email: cgzhou@jlu.edu.cn)

A Global K-Modes Algorithm For Clustering Categorical Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Global K-Modes Algorithm For Clustering Categorical Data

Uploaded by

Copyright:

Available Formats

Chinese Journal of Electronics

Vol.21, No.3, July 2012

A Global K-modes Algorithm for Clustering

(2) the accuracy of their clustering results is sensitive to the

A Global K-modes Algorithm for Clustering Categorical Data

II. Method and Algorithm

where D = {X1 , X2 , , Xn } represents a set of n categorical

and xi is the ith element of X and, yi the ith element Y .

in dierent clusters. In the fuzzy k-modes algorithm, data D

Chinese Journal of Electronics

(2) Progressive elimination of redundant clusters

of F. With the smallest F we have the best initial modes at

Fig. 1. Graphical illustration of the elimination criterion function

(5) The number of initial modes

A Global K-modes Algorithm for Clustering Categorical Data

III. Experiments and Result

the ground truth, and k is the number of clusters. A higher

For instance, for the soybean data set GKM achieves an

Chinese Journal of Electronics

of KM, NKM, and FKM. The fuzzy parameter used in FKM

Fig. 2. Clustering accuracy comparison of GKM and KM,

A Global K-modes Algorithm for Clustering Categorical Data

HUANG Lan (corresponding author) was born in 1974. She

You might also like