You are on page 1of 6

Proceedings of the Third International Conference on Artificial Intelligence in Engineering & Technology

November 22-24, 2006, Kota Kinabalu, Sabah, Malaysia

Real Coded Genetic Clustering with Inter-cluster Mutation


Gautam Ramdurai
Department of Information Science and Engineering
Sri Jayachamarajendra College of Engineering, Mysore – 570 006, India
Tel: +91 98457 89983, E-mail:gautam.ramdurai@gmail.com

Abstract and natural selection to solve a problem within a complex


search space. They include a string representation of points
An improved clustering method using genetic algorithms is in a search space; a set of genetic operators for generating
proposed in this paper. Real encoding is used to render new search points; a fitness function to evaluate the search
conceptual closeness to the problem domain. A new view of points and a stochastic assignment to control the genetic
the chromosomes, which exploits the ability to directly map operations.
real solutions, is suggested. A centroid-level crossover
Several methods based on GAs have been developed to aid
technique that uses this new view is defined. Along with
clustering. A basic study of the use of GAs in clustering is
this, a novel inter-cluster mutation operator specific to
provided in [7]. Advanced schemes that use sophisticated
clustering is proposed. The principles of the K-means
encoding schemes and modified genetic operators are
algorithm embedded in a Genetic Algorithm, augmented by
proposed in [8,9]. Different variants of GA based clustering
the new genetic operators, leads to finding globally optimal
with real encoding are proposed in [10-12].
solutions to the clustering problem. The superiority of the
proposed method over the commonly used K-means method None of the genetic clustering [10-12] algorithms illustrated
and the currently used technique, based on Genetic in past literature tap the ability of direct mapping offered by
Algorithms, is demonstrated for real-life datasets. real encoding schemes. They use usual operators for
recombination and mutation without any customization. In
Keywords: this paper, a variant of these algorithms [10] is proposed
Clustering, K-means, Genetic Algorithms, Real Encoding with significant improvisations. A new way of viewing the
chromosomes representing the candidate solutions is
Introduction suggested. The view is conceptually closer to the real world
solution to the clustering problem. Keeping this view in
The need to extract hidden structures and discover groups mind, a new centroid – level crossover operator and an
in a dataset has made clustering an essential machine inter-cluster mutation operator are defined. The new
learning process in fields like data mining and pattern operators are specific to the problem of clustering and
recognition [1]. Clustering [1,2] deals with finding a exploit the ability of real encoding to map real solutions on
structure in a collection of unlabeled data. It involves to the candidate chromosomes. The Davies-Bouldin index
grouping data into clusters or groups such that members of [13] is used as the optimization metric. Different aspects of
one cluster are similar in some given sense and members of the proposed technique are elaborated in later sections.
different clusters are dissimilar in the same sense. The The superiority of the proposed method is demonstrated by
primary objectives of clustering can be defined as: a series of tests on various real-life datasets. The tests
• The distance between the members of a given validate this superiority in three phases. The first phase
cluster should be made minimum, as they illustrates the contribution of the new clustering specific
represent similar kind of data cases. mutation and crossover operators. The second phase of
testing compares the performance of the proposed method
• The distance between any two clusters with that of the standard K-means algorithm and the
(represented by the cluster centroids) should be method proposed in [10] for five real-life datasets.
made maximum, as they represent dissimilar
classes.
K-means Clustering
The challenge here is to come up with a computational
technique that can separate a dataset into the most natural The K-means [1,2] algorithm is one of the simplest and the
groups or clusters. A lot of well known methods like K most popular algorithms used for clustering. The algorithm
means and fuzzy c-means have been used and improved starts with a given set of initial clusters. Each point in the
extensively. An extensive review of these techniques can be data-space is assigned to one of them, and then each cluster
found in [1-3]. A brief overview of the standard K-means center is replaced by the mean point on the respective
algorithm is given in the next section. cluster. These two simple steps are repeated until
The failure of conventional hill-climbing methods in convergence. The Euclidean distance is used as the
reaching globally optimal solutions for the clustering similarity metric. Euclidean distance is the straight line
problem have given rise to an increasing interest in using distance between two points. It is a simple and effective
stochastic methods like Genetic Algorithms (GAs) in this method to judge the distance between any two points in the
domain. GAs [4-6] is nondeterministic stochastic search or search space.
optimization methods that utilize the theories of evolution

552
Proceedings of the Third International Conference on Artificial Intelligence in Engineering & Technology
November 22-24, 2006, Kota Kinabalu, Sabah, Malaysia

The basic K-means algorithm is shown below: genes


Step 1: Choose k initial clusters c1, c2,…,ck randomly from
the n data points {x1,x2,…,xn}
Step 2: Assign point xi , i=1,2,…,n to cluster Cj,
j∈{1, 2,…,k} if and only if p =1,2,…,k and j ≠ p
||xi – cj|| < ||xi – cp|| (1)
Step 3: Compute new cluster centers c1*, c2*… ck* sub-genes
i=1,2,…,k as follows:
Figure 1 - Chromosome encoded as genes and sub-genes
1
ci* =
ni
∑x
x j ∈C i
j (2) Cluster Formation
The clusters are formed according to the cluster centers
where ni is the number of elements belonging to
encoded in the chromosome. This is done by assigning each
cluster Ci
data point xi , i =1,2…n , to one of the clusters Cj with
Step 4: If ci* = ci , i = 1,2,…,k then terminate. else goto
cluster centroid cj. The Euclidean distance from a data point
Step 2
to all the centroids is calculated and the point is assigned to
If normal termination does not occur, the algorithm is run the cluster.
for a predefined maximum number of iterations.
The point xi, i=1,2…n is assigned to cluster Cj, j
A major disadvantage of this method is that it gets stuck at ∈{1,2,…,k} according to Equation (1). Once all points have
local optima depending on the choice of initial clusters. It been assigned to a cluster, the new cluster centers c1*, c2*…
has been shown in [14] that the algorithm may converge to ck* are calculated according to Equation (2). If the
values that are not optimal. algorithm has not converged, ci* replaces ci in the
chromosome.
Real Coded Genetic Clustering
Fitness Calculation
One of the new avenues being explored to overcome this A Genetic Algorithm requires a criterion whose
drawback of local optimality is the use techniques like optimization provides the final clusters. The criterion must
Genetic Algorithms, which are efficient in reaching the be chosen to achieve the objectives of increasing
global optima for most problems. One such method is homogeneity within the cluster and increasing
proposed here. A real-coded genetic algorithm [15,16] is heterogeneity between clusters. An objective fitness
used, where a solution is directly represented as a vector of function must be defined such that it measures the
real-parameter decision variables. This way the capability of a candidate solution. There are many
representation of the solutions is very close to the natural clustering metrics available in literature ranging from
formulation of many problems. The choice of the real simple to highly complex mathematical functions. One of
encoding scheme for clustering seems natural as it renders the commonly used metric is the Total Within Cluster
conceptual closeness to the problem domain. Variation (TWCV) [8] or simply the summation of the total
squared Euclidean distances [10], of all points from their
Chromosome Encoding
respective cluster centers, of all clusters. This parameter
Each chromosome represents a candidate solution does not take into account the proximity of two different
representing cluster centroids to be chosen for an ideally clusters. In order to overcome this, a metric, that considers
clustered dataset. Each string is a sequence of k genes both objectives of homogeneity and heterogeneity, is to be
representing the k cluster centers. For an N-dimensional used. The Davies-Bouldin index [13] (DB-index) has
space, each gene again consists of N sub-genes. Each sub- proved to be an effective metric to determine the quality of
gene value represents an ordinate of a cluster centroid in the clusters created. The proposed method uses the DB-index to
N dimensioned space. The effective length of the determine the fitness of a given chromosome and hence
chromosome is thus k*N. The chromosome initialization is validate the clusters formed. The DB index is a function of
done by randomly assigning points from the given dataset the ratio of the sum of within-cluster scatter to between-
or pattern set to the sub-genes of a chromosome. Figure 1 cluster separation.
shows a sample chromosome containing four genes (cluster The measure of scatter within the cluster Ci is calculated as:
centroids), each containing three sub-genes (ordinates).
1
 1 q q


Initial Population
Si,q =  || x − ci || 
 (3)
Chromosomes are initialized using the above method and a  | C i | x∈C i
2

population of candidate solutions is created. Each of these where ci is the centroid of Ci and is defined as
chromosomes represents a possible solution to the
1
clustering problem. A given number of individual
chromosomes initialized in this manner form the initial ci = ∑x
ni x ∈C i (4)
population.
where ni is the cardinality of the cluster Ci
Si,q is the qth root of the qth moment of the points in cluster i

553
Proceedings of the Third International Conference on Artificial Intelligence in Engineering & Technology
November 22-24, 2006, Kota Kinabalu, Sabah, Malaysia

with respect to their mean, and is a measure of the cluster centroids and not individual ordinates; hence the
dispersion of the points in cluster i. In this paper q = 1, commonly used single point crossover is not satisfactory. A
hence Si,1 is the average Euclidean distance of the data new Centroid- level Crossover operator is proposed instead.
points in class i from the centroid of class i .
dij,t denotes the Minkowski distance of the order t between
the clusters Ci and Cj , i.e. distance between the clusters.
Here t = 2
dij,t = d(Ci , Cj) = || ci - ci ||t (5)
Ri,qt denotes the maximal similarity index of Ci to the other
clusters.
 S i , q + S j , q 
Ri,qt = Max   (6)
j, j≠ i  d ij ,t 
The DB index is then defined as follows: Figure 2 -- Normal Single-point Crossover
1 K Crossover occurs Real encoding eliminates the mapping
DBr =
K
∑R i , qt (7) from phenotype to genotype; hence the candidate solution
i =1 can be directly manipulated. This feature is exploited in the
A smaller value of the DB-index indicates a good proposed technique as the chromosomes are viewed in
clustering result. Thus the fitness function can be defined as terms of genes (cluster centroids) and sub-genes
the inverse of the DB-index: (ordinates). It makes more sense than looking at the
1 solution as a mere string of real numbers. The genetic
f(Chromr) = (8) information is exchanged only in terms of genes and not in
DB r terms of single ordinates or sub-genetic alleles. A random
crossover point is selected in the range {1,2,…, k*N}. The
The fitness value is used as the parameter to be optimized chromosome is split in terms of whole genes and no gene is
by the algorithm. The maximization of this fitness function split in-between. This helps preserve the integrity of genetic
ensures minimization of the DB-index. The DB-Index has information being exchanged. Figure 3 illustrates the
been used effectively for cluster validation in the past working of the Centroid- level Crossover operator. there are
[9,11,12] four genes with three sub-genes each. The cross-site is three
and the genes after the third gene are swapped.
Selection
cross-site=3
The fitness of an individual determines its probability of
contributing to the mating pool. This means that an Parent1
individual with a higher fitness value has a higher
probability to survive, reproduce and contribute to the next
generation. The probability for a chromosome to survive Parent2
and mate can be given by the following mathematical
function, which is the ratio of the individual fitness to the
total fitness of the whole population: Child1

fr
P Child2
Prob (Chromr) = (9)
∑f
r =1
r
Figure 3 – Centroid – level Crossover
The well known Roulette Wheel selection scheme is used to
Crossover occurs with a probability of µc. Two
select the parents to create the next generation.
chromosomes are selected out of the current population
Centroid-level Crossover with a probability proportional to their fitness. A random
number in the range (0,1) is generated and if it is less than
Crossover [6] is a probabilistic process that exchanges
µc then crossover occurs and the offspring are copied onto
information between two parent chromosomes for
the next generation, otherwise the chromosomes are copied
generating two child chromosomes. It is the main search
directly without crossover. .
operator in GAs. In previous implementations [10-12] the
crossover operator used is a single point crossover. In [10] Inter-cluster Mutation
the chromosome is viewed as single string of k*N real
The role of mutation [6] is to restore lost or unexpected
values. The cross-site is randomly picked from the range
genetic material into population to prevent premature
{1,2,…, k*N} and the split parts are swapped. This might
convergence to sub-optimal solutions. It ensures that the
lead to the chromosome being split such that the ordinates
probability of reaching any point in the search space is
of a cluster centroid are split in between as shown in Fig. 2.
never zero.
The potential solution to the problem is a set of complete

554
Proceedings of the Third International Conference on Artificial Intelligence in Engineering & Technology
November 22-24, 2006, Kota Kinabalu, Sabah, Malaysia

The mutation operator must guide the algorithm towards space. The data represents different categories of irises
fulfilling the objectives of clustering as mentioned before; having four feature values (sepal length, sepal width, petal
an appropriate clustering specific mutation technique is length, petal width in centimeters). It has three classes,
proposed in this end. The chromosome is mutated such that Setosa, Versicolor and Virginica with 50 samples each.
inter-cluster distance increases. Here again the mutation is Versicolor and Virginica are said to overlap while the class
not allelic, i.e. applicable to single sub-genetic alleles, but Setosa is linearly separable.
in terms of genes. Each chromosome in the population is Wine: These data are the results of a chemical analysis of
mutated with a probability µm called the mutation wines grown in the same region in Italy but derived from
probability. A random number is generated in the range three different cultivators. The analysis determined the
(0,1), if the value is less than µm, mutation occurs. The quantities of 13 constituents found in each of the three
Inter-cluster Mutation is an operator based on intercluster types of wines. It contains 178 data points.
distances. The intercluster distances among all genes of a
New Thyroid: Lab tests are used to try to predict whether a
chromosome are calculated according to Equation (3)
patient's thyroid belongs to the class euthyroidism,
the pair of centroids (genes) that are closest to each other hypothyroidism or hyperthyroidism. The diagnosis (class
are selected, say Ca and Cb . The ordinate (sub-gene) values label) was based on a complete medical record, including
of these two centroids are mutated. A random number δ in anamnesis, scan etc. The number of instances is 215 and
the range [0,1] is generated with uniform distribution. If the number of attributes is 5.
ordinate (sub-genetic allele) values under consideration are Bupa Liver Disorder: It consists of 6 variables in each data
via and vib . Here via and vib are the ith ordinates (sub-genetic point. The first 5 variables are all blood tests which are
alleles) of the selected centroids Ca and Cb respectively. thought to be sensitive to liver disorders that might arise
They are mutated as follows: from excessive alcohol consumption. Each data point
if via ≤ vib and a,b ∈{1,2,..,k} and a ≠ b; i∈{1,2,..,N} constitutes the record of a single male individual. The
via = via – δ* via (10) number of instances is 345 and number of attributes is 6.
i i i
They can be divided into two classes.
v b = v b + δ* v b (11)
Breast Cancer: The Wisconsin Breast Cancer data set,
else having 683 points, is used. Each pattern has nine features
(clump thickness, cell size uniformity, marginal adhesion,
via = via + δ* via (12)
single epithelial cell size, bare nuclei, bland chromatin,
vib = vib - δ* vib (13) normal nucleoli, and mitoses). There are two classes in the
If the value of via is lesser than vib then reducing the value data: Malignant and Benign.
of via and increasing the value of vib will push the centroids Table 1 – Datasets used for testing
Ca and Cb farther from each other. If the value of via is
greater than vib , then the subtraction and addition
operations are reversed and the desired distance enhancing Dataset Instances Dimensions Classes
effect is achieved. This effect of increasing distance
Iris 150 4 3
between the clusters facilitates heterogeneity among
clusters. Wine 178 13 3
Termination Criterion
NewThyroid 215 5 3
The methods of fitness computation, selection, crossover,
and mutation are executed for a fixed number of iterations. LiverDisorder 345 6 2
The best string seen up to the last generation provides the
solution to the clustering problem. Elitism can be BreastCancer 683 9 2
implemented at each generation by preserving the best
string seen in that generation in a location outside the
population or by copying it onto the next generation. Thus
the resulting fittest chromosome represents the centroids of Parameter Testing
the final clusters. The values that exhibited characteristics of good
optimization in their runs are summarized in Table 2. It was
Implementation and Results observed that when the number generations was high, e.g.
100, the algorithm converged well before the last
In order to prove the superiority of method, it was tested, generation. Hence the unnecessary computation of
compared and analyzed in two phases. The testing was redundant generations can be avoided by setting the number
done on datasets of varying sizes and dimensions. Five real of generations to 50. It was also observed that lower
life datasets freely available at [17] are used. In all cases the mutation probability make the proposed technique converge
value of k is known a priori. A brief review of the datasets slowly. Higher probabilities lead to oscillating behavior of
is given below. Table 1 summarizes the different datasets the algorithm. These parameters were then used for further
along with the number of instances, dimensions and classes. testing of the algorithm.
Iris : It is a set of 150 data points in a four dimensional

555
Proceedings of the Third International Conference on Artificial Intelligence in Engineering & Technology
November 22-24, 2006, Kota Kinabalu, Sabah, Malaysia

Table 2 – Optimum Parameters known K-means algorithm and the previous


implementation of Real Coded Genetic Algorithms in
Parameter Value Clustering [10] are considered for comparison. On each
dataset the K-means is run 2500 times and the average of
Crossover probability (µc) 0.8 all the Davies-Bouldin indices of the resulting clusters is
calculated. Since each run of the genetic algorithm samples
Mutation probability (µm) 0.001 50 cluster configurations, it is run 50 times; hence being
considered equivalent to the 2500 runs of the standard K-
Population size (P) 50 means. Again the average DB index is calculated for the
GACT [10] and the RCGC-IM proposed in this paper. The
No. of generations (G) 50 results for all datasets are summarized in Table 3.
Table 3 – Comparison of DB index values

Datasets K-Means GACT RCGC-IM


Phase I: Contribution of New Operators
Iris 1.0593 0.6982 0.6648
The proposed method is compared with previous
implementations of Genetic Algorithms-based Clustering Wine 1.7523 0.4846 0.4558
Technique [10] (GACT). The GACT uses relatively simple
crossover and mutation operators which are not specific to New Thyroid 1.2670 0.5732 0.5467
clustering. The parameters discussed in previous section are
kept constant for both methods and the DB-index was used Liver Disorder 1.7767 0.5447 0.5305
as the optimization metric for both techniques. Since both Breast Cancer 1.2746 0.6337 0.6156
are stochastic algorithms, ten runs of both RCGC-IM (Real
Coded Genetic Clustering with Inter-cluster Mutation) and
GACT on the Iris dataset were considered for comparison.
For each generation, the average DB-index value over ten It is clearly seen the RCGC-IM achieves lower and better
different runs of both techniques was recorded. By values of the DB-index on an average than that of the K-
comparing these values, the contribution of the mutation means algorithm or the GACT.
and the crossover operators and their superiority over This is because the performance of the K-means algorithm
commonly used operators can be judged. Figure 3 shows is highly variant and dependent on the initial configuration.
the comparison of the average DB-index values for both But this difference is very less when it comes to RCGC-IM.
RCGC-IM and GACT for 50 generations. The operators used in GACT are not customized for
clustering applications unlike the RCGC-IM, hence the
1.4
difference in performance values.
1.2
Conclusion and Future Work
1

The method proposed in this work effectively combines the


DB Index

0.8
RCGC-IM
simplicity of the K-means algorithm and the searching
0.6 GACT
capability of Genetic Algorithms. Previous implementations
0.4
[10-12] do not exploit the conceptual closeness offered by
real encoding. This method taps the ability of real encoding
0.2 to map the candidate solutions directly on to the
chromosomes by viewing the solution as genes representing
0
a real world solution for the problem rather than viewing
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49
the chromosomes as just a string of real numbers. This view
No: of Generations
is extended and exploited further by the crossover and
mutation operators that work on complete genes and not on
Figure 3 -Average DB-index values of RCGC-IM and singular alleles. This incorporates a degree of integrity into
GACT for 50 generations the algorithm helping it to compute solutions with a real-
world perspective. The superiority of the RCGC-IM along
The graph shows the consistent behavior of the RCGC-IM with the new genetic operators is emphasized by the results
and that on an average the proposed method performs better of the tests conducted on various real world datasets.
than the GACT. It must be kept in mind that a lower DB-
index indicates a better clustering result. Future work in this field will be directed towards using
stochastic techniques like GAs as much more than just
Phase II: Performance of RCGC black-box optimization techniques. Incorporation of
The performance of proposed method is tested over five valuable real world insights into the workings of the
different real-life datasets. The performance of the well algorithm itself in the form of appropriate fitness functions

556
Proceedings of the Third International Conference on Artificial Intelligence in Engineering & Technology
November 22-24, 2006, Kota Kinabalu, Sabah, Malaysia

and customized operators will be an area of focus. Other [9] Lin H et al. 2005, An effective GA-based clustering
improvements to this method can be explored by using technique, Tamkang Journal of Science and Engg.,
newer indices [18] as the optimization metrics. The 8:113-122
possibility of devising a similar method for clustering when [10] Maulik U et al, 2000. Genetic algorithms-based
the number of classes is not known a priori can also be clustering technique, Pattern Recognition 33: 1455-
explored. 1465
Acknowledgments [11] Maulik U et al, 2001. Nonparametric Genetic
Clustering: Comparison of Validity Indices, IEEE
The author would like to thank Dr. T.N. Nagabhushan, Tran. ON Systems, Man, and Cybernetics—Part C, 31:
Professor and Head, Department of Information Science 120-125
and Engineering, Sri Jayachamarajendra College of [12] Maulik U et al, 2002. Genetic Clustering for Automatic
Engineering, Mysore, India., for his guidance and support. Evolution of Clusters and Application to Image
Classification, Pattern Recognition, 35: 11971208.
References [13] Davies D. et al,1979. A cluster separation measure,
IEEE Trans. Pattern. Anal. Mach. Intelligence, 1: 224–
[1] Duda R et al, 2001. Pattern Classification, Wiley- 227.
Interscience Publishers, USA,
[14] Selim S. et al, 1984. K-means type algorithms: a
[2] Jain A et al, 1988. .Algorithms for Clustering Data, generalized convergence theorem and characterization
Prentice-Hall, Englewood Cliffs, NJ of local optimality, IEEE Trans. Pattern Anal. Mach.
[3] Glenn F. 2001. A Comprehensive Overview of Basic Intelligence. 6 : 81-87
Clustering Algorithms [15] Herrera F et al, 1998. Tackling real-coded genetic
[4] Michalewicz Z, 1992. Genetic Algorithms = Data algorithms: operators and tools for behavioral analysis,
Structures + Evolution Programs, Springer, New Artificial Intelligence Review 12: 265-319
York, [16] Raghuwanshi M. et al, 2004. Survey on multi-objective
[5] Davis L Ed. 1991, Handbook of Genetic Algorithms, evolutionary and real-coded genetic algorithms. In
Van Nostrand Reinhold, New York. Proceedings of the 8th Asia Pacific Symposium on
[6] Mitchell M. 1996. An Introduction to Genetic Intelligent and Evolutionary Systems: 150-161
Algorithms. , Complex Adaptive Systems. MITPress. [17] The UCI online machine learning database repository.
Camhridge URL http://www.ics.uci.edu/~mlearn/databases.html
[7] Krovi R. 1992. Genetic Algorithms for clustering: A [18] Bezdek J. et al,1998. Some New Indexes of Cluster
preliminary investigation. In Proceedings of the 25th Validity. IEEE Tran. on Systems, Man, and
Hawaii Intl. Conf. On System Sciences: 540-544. Cybernetics–Part B, 28 : 301–315
[8] Krishna, K et al.,1999. Genetic K-Means Algorithm.
IEEE Transactions on Systems Man And Cybernetics-
Part B: Cybernetics 29: 433-439.

557

You might also like