Professional Documents
Culture Documents
552
Proceedings of the Third International Conference on Artificial Intelligence in Engineering & Technology
November 22-24, 2006, Kota Kinabalu, Sabah, Malaysia
∑
Initial Population
Si,q = || x − ci ||
(3)
Chromosomes are initialized using the above method and a | C i | x∈C i
2
population of candidate solutions is created. Each of these where ci is the centroid of Ci and is defined as
chromosomes represents a possible solution to the
1
clustering problem. A given number of individual
chromosomes initialized in this manner form the initial ci = ∑x
ni x ∈C i (4)
population.
where ni is the cardinality of the cluster Ci
Si,q is the qth root of the qth moment of the points in cluster i
553
Proceedings of the Third International Conference on Artificial Intelligence in Engineering & Technology
November 22-24, 2006, Kota Kinabalu, Sabah, Malaysia
with respect to their mean, and is a measure of the cluster centroids and not individual ordinates; hence the
dispersion of the points in cluster i. In this paper q = 1, commonly used single point crossover is not satisfactory. A
hence Si,1 is the average Euclidean distance of the data new Centroid- level Crossover operator is proposed instead.
points in class i from the centroid of class i .
dij,t denotes the Minkowski distance of the order t between
the clusters Ci and Cj , i.e. distance between the clusters.
Here t = 2
dij,t = d(Ci , Cj) = || ci - ci ||t (5)
Ri,qt denotes the maximal similarity index of Ci to the other
clusters.
S i , q + S j , q
Ri,qt = Max (6)
j, j≠ i d ij ,t
The DB index is then defined as follows: Figure 2 -- Normal Single-point Crossover
1 K Crossover occurs Real encoding eliminates the mapping
DBr =
K
∑R i , qt (7) from phenotype to genotype; hence the candidate solution
i =1 can be directly manipulated. This feature is exploited in the
A smaller value of the DB-index indicates a good proposed technique as the chromosomes are viewed in
clustering result. Thus the fitness function can be defined as terms of genes (cluster centroids) and sub-genes
the inverse of the DB-index: (ordinates). It makes more sense than looking at the
1 solution as a mere string of real numbers. The genetic
f(Chromr) = (8) information is exchanged only in terms of genes and not in
DB r terms of single ordinates or sub-genetic alleles. A random
crossover point is selected in the range {1,2,…, k*N}. The
The fitness value is used as the parameter to be optimized chromosome is split in terms of whole genes and no gene is
by the algorithm. The maximization of this fitness function split in-between. This helps preserve the integrity of genetic
ensures minimization of the DB-index. The DB-Index has information being exchanged. Figure 3 illustrates the
been used effectively for cluster validation in the past working of the Centroid- level Crossover operator. there are
[9,11,12] four genes with three sub-genes each. The cross-site is three
and the genes after the third gene are swapped.
Selection
cross-site=3
The fitness of an individual determines its probability of
contributing to the mating pool. This means that an Parent1
individual with a higher fitness value has a higher
probability to survive, reproduce and contribute to the next
generation. The probability for a chromosome to survive Parent2
and mate can be given by the following mathematical
function, which is the ratio of the individual fitness to the
total fitness of the whole population: Child1
fr
P Child2
Prob (Chromr) = (9)
∑f
r =1
r
Figure 3 – Centroid – level Crossover
The well known Roulette Wheel selection scheme is used to
Crossover occurs with a probability of µc. Two
select the parents to create the next generation.
chromosomes are selected out of the current population
Centroid-level Crossover with a probability proportional to their fitness. A random
number in the range (0,1) is generated and if it is less than
Crossover [6] is a probabilistic process that exchanges
µc then crossover occurs and the offspring are copied onto
information between two parent chromosomes for
the next generation, otherwise the chromosomes are copied
generating two child chromosomes. It is the main search
directly without crossover. .
operator in GAs. In previous implementations [10-12] the
crossover operator used is a single point crossover. In [10] Inter-cluster Mutation
the chromosome is viewed as single string of k*N real
The role of mutation [6] is to restore lost or unexpected
values. The cross-site is randomly picked from the range
genetic material into population to prevent premature
{1,2,…, k*N} and the split parts are swapped. This might
convergence to sub-optimal solutions. It ensures that the
lead to the chromosome being split such that the ordinates
probability of reaching any point in the search space is
of a cluster centroid are split in between as shown in Fig. 2.
never zero.
The potential solution to the problem is a set of complete
554
Proceedings of the Third International Conference on Artificial Intelligence in Engineering & Technology
November 22-24, 2006, Kota Kinabalu, Sabah, Malaysia
The mutation operator must guide the algorithm towards space. The data represents different categories of irises
fulfilling the objectives of clustering as mentioned before; having four feature values (sepal length, sepal width, petal
an appropriate clustering specific mutation technique is length, petal width in centimeters). It has three classes,
proposed in this end. The chromosome is mutated such that Setosa, Versicolor and Virginica with 50 samples each.
inter-cluster distance increases. Here again the mutation is Versicolor and Virginica are said to overlap while the class
not allelic, i.e. applicable to single sub-genetic alleles, but Setosa is linearly separable.
in terms of genes. Each chromosome in the population is Wine: These data are the results of a chemical analysis of
mutated with a probability µm called the mutation wines grown in the same region in Italy but derived from
probability. A random number is generated in the range three different cultivators. The analysis determined the
(0,1), if the value is less than µm, mutation occurs. The quantities of 13 constituents found in each of the three
Inter-cluster Mutation is an operator based on intercluster types of wines. It contains 178 data points.
distances. The intercluster distances among all genes of a
New Thyroid: Lab tests are used to try to predict whether a
chromosome are calculated according to Equation (3)
patient's thyroid belongs to the class euthyroidism,
the pair of centroids (genes) that are closest to each other hypothyroidism or hyperthyroidism. The diagnosis (class
are selected, say Ca and Cb . The ordinate (sub-gene) values label) was based on a complete medical record, including
of these two centroids are mutated. A random number δ in anamnesis, scan etc. The number of instances is 215 and
the range [0,1] is generated with uniform distribution. If the number of attributes is 5.
ordinate (sub-genetic allele) values under consideration are Bupa Liver Disorder: It consists of 6 variables in each data
via and vib . Here via and vib are the ith ordinates (sub-genetic point. The first 5 variables are all blood tests which are
alleles) of the selected centroids Ca and Cb respectively. thought to be sensitive to liver disorders that might arise
They are mutated as follows: from excessive alcohol consumption. Each data point
if via ≤ vib and a,b ∈{1,2,..,k} and a ≠ b; i∈{1,2,..,N} constitutes the record of a single male individual. The
via = via – δ* via (10) number of instances is 345 and number of attributes is 6.
i i i
They can be divided into two classes.
v b = v b + δ* v b (11)
Breast Cancer: The Wisconsin Breast Cancer data set,
else having 683 points, is used. Each pattern has nine features
(clump thickness, cell size uniformity, marginal adhesion,
via = via + δ* via (12)
single epithelial cell size, bare nuclei, bland chromatin,
vib = vib - δ* vib (13) normal nucleoli, and mitoses). There are two classes in the
If the value of via is lesser than vib then reducing the value data: Malignant and Benign.
of via and increasing the value of vib will push the centroids Table 1 – Datasets used for testing
Ca and Cb farther from each other. If the value of via is
greater than vib , then the subtraction and addition
operations are reversed and the desired distance enhancing Dataset Instances Dimensions Classes
effect is achieved. This effect of increasing distance
Iris 150 4 3
between the clusters facilitates heterogeneity among
clusters. Wine 178 13 3
Termination Criterion
NewThyroid 215 5 3
The methods of fitness computation, selection, crossover,
and mutation are executed for a fixed number of iterations. LiverDisorder 345 6 2
The best string seen up to the last generation provides the
solution to the clustering problem. Elitism can be BreastCancer 683 9 2
implemented at each generation by preserving the best
string seen in that generation in a location outside the
population or by copying it onto the next generation. Thus
the resulting fittest chromosome represents the centroids of Parameter Testing
the final clusters. The values that exhibited characteristics of good
optimization in their runs are summarized in Table 2. It was
Implementation and Results observed that when the number generations was high, e.g.
100, the algorithm converged well before the last
In order to prove the superiority of method, it was tested, generation. Hence the unnecessary computation of
compared and analyzed in two phases. The testing was redundant generations can be avoided by setting the number
done on datasets of varying sizes and dimensions. Five real of generations to 50. It was also observed that lower
life datasets freely available at [17] are used. In all cases the mutation probability make the proposed technique converge
value of k is known a priori. A brief review of the datasets slowly. Higher probabilities lead to oscillating behavior of
is given below. Table 1 summarizes the different datasets the algorithm. These parameters were then used for further
along with the number of instances, dimensions and classes. testing of the algorithm.
Iris : It is a set of 150 data points in a four dimensional
555
Proceedings of the Third International Conference on Artificial Intelligence in Engineering & Technology
November 22-24, 2006, Kota Kinabalu, Sabah, Malaysia
0.8
RCGC-IM
simplicity of the K-means algorithm and the searching
0.6 GACT
capability of Genetic Algorithms. Previous implementations
0.4
[10-12] do not exploit the conceptual closeness offered by
real encoding. This method taps the ability of real encoding
0.2 to map the candidate solutions directly on to the
chromosomes by viewing the solution as genes representing
0
a real world solution for the problem rather than viewing
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49
the chromosomes as just a string of real numbers. This view
No: of Generations
is extended and exploited further by the crossover and
mutation operators that work on complete genes and not on
Figure 3 -Average DB-index values of RCGC-IM and singular alleles. This incorporates a degree of integrity into
GACT for 50 generations the algorithm helping it to compute solutions with a real-
world perspective. The superiority of the RCGC-IM along
The graph shows the consistent behavior of the RCGC-IM with the new genetic operators is emphasized by the results
and that on an average the proposed method performs better of the tests conducted on various real world datasets.
than the GACT. It must be kept in mind that a lower DB-
index indicates a better clustering result. Future work in this field will be directed towards using
stochastic techniques like GAs as much more than just
Phase II: Performance of RCGC black-box optimization techniques. Incorporation of
The performance of proposed method is tested over five valuable real world insights into the workings of the
different real-life datasets. The performance of the well algorithm itself in the form of appropriate fitness functions
556
Proceedings of the Third International Conference on Artificial Intelligence in Engineering & Technology
November 22-24, 2006, Kota Kinabalu, Sabah, Malaysia
and customized operators will be an area of focus. Other [9] Lin H et al. 2005, An effective GA-based clustering
improvements to this method can be explored by using technique, Tamkang Journal of Science and Engg.,
newer indices [18] as the optimization metrics. The 8:113-122
possibility of devising a similar method for clustering when [10] Maulik U et al, 2000. Genetic algorithms-based
the number of classes is not known a priori can also be clustering technique, Pattern Recognition 33: 1455-
explored. 1465
Acknowledgments [11] Maulik U et al, 2001. Nonparametric Genetic
Clustering: Comparison of Validity Indices, IEEE
The author would like to thank Dr. T.N. Nagabhushan, Tran. ON Systems, Man, and Cybernetics—Part C, 31:
Professor and Head, Department of Information Science 120-125
and Engineering, Sri Jayachamarajendra College of [12] Maulik U et al, 2002. Genetic Clustering for Automatic
Engineering, Mysore, India., for his guidance and support. Evolution of Clusters and Application to Image
Classification, Pattern Recognition, 35: 11971208.
References [13] Davies D. et al,1979. A cluster separation measure,
IEEE Trans. Pattern. Anal. Mach. Intelligence, 1: 224–
[1] Duda R et al, 2001. Pattern Classification, Wiley- 227.
Interscience Publishers, USA,
[14] Selim S. et al, 1984. K-means type algorithms: a
[2] Jain A et al, 1988. .Algorithms for Clustering Data, generalized convergence theorem and characterization
Prentice-Hall, Englewood Cliffs, NJ of local optimality, IEEE Trans. Pattern Anal. Mach.
[3] Glenn F. 2001. A Comprehensive Overview of Basic Intelligence. 6 : 81-87
Clustering Algorithms [15] Herrera F et al, 1998. Tackling real-coded genetic
[4] Michalewicz Z, 1992. Genetic Algorithms = Data algorithms: operators and tools for behavioral analysis,
Structures + Evolution Programs, Springer, New Artificial Intelligence Review 12: 265-319
York, [16] Raghuwanshi M. et al, 2004. Survey on multi-objective
[5] Davis L Ed. 1991, Handbook of Genetic Algorithms, evolutionary and real-coded genetic algorithms. In
Van Nostrand Reinhold, New York. Proceedings of the 8th Asia Pacific Symposium on
[6] Mitchell M. 1996. An Introduction to Genetic Intelligent and Evolutionary Systems: 150-161
Algorithms. , Complex Adaptive Systems. MITPress. [17] The UCI online machine learning database repository.
Camhridge URL http://www.ics.uci.edu/~mlearn/databases.html
[7] Krovi R. 1992. Genetic Algorithms for clustering: A [18] Bezdek J. et al,1998. Some New Indexes of Cluster
preliminary investigation. In Proceedings of the 25th Validity. IEEE Tran. on Systems, Man, and
Hawaii Intl. Conf. On System Sciences: 540-544. Cybernetics–Part B, 28 : 301–315
[8] Krishna, K et al.,1999. Genetic K-Means Algorithm.
IEEE Transactions on Systems Man And Cybernetics-
Part B: Cybernetics 29: 433-439.
557