Professional Documents
Culture Documents
nc
=min
=1..nc
_min
]=+1,..nc
_
d(c
i,
c
]
)
max
k=1,.nc
dum(ck)
]_ (4)
IV. SCHEMATIC REPRESENTATION OF PROPOSED MODEL
Three gene expression dataset have been collected from UCI
repository [13]. Upon those dataset we have implemented our
proposed model for evaluation. First, each data set has been
normalized and then the principal component analysis (PCA)
reduction technique has been applied to reduce the dataset.
After reduction the clustering techniques k-means and
hierarchical have been applied to find the clusters. Finally, the
cluster validity indices such as Silhouette and Dunns index are
applied upon the resultant cluster to get an optimum cluster.
International Journal of Computer Trends and Technology (IJCTT) Volume 4 Issue 5May 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page 1467
Fig. 1 Proposed model
Fig. 1 shows the schematic representation of our cluster
validity model.
V. EXPERIMENTAL EVALUATION AND RESULT ANALYSIS
For the evaluation of the proposed work we have taken three
dataset from UCI repository [13]. The breast cancer dataset
consists of 98 no. of instances and 26 no. of attributes, the
hepatitis dataset consists of 155 no. of instances and 20 no. of
attributes, the dermatology dataset consists of 366 instances
and 34 no. of attributes and evaluated using MATLAB 2010.
Our experimental work consists of following steps:
Step1: Normalization of Datasets: Dataset has multivariate
characteristics and the attributes have real characteristics.
Applying the min-max normalization technique the data set
has been normalized which is given below. Fig.2 shows the
normalization of hepatitis data set, fig.3 shows the
normalization of breast cancer dataset and fig.4 shows the
normalization of dermatology data set.
Fig. 2. : Normalization of hepatitis dataset
Fig. 3. : Normalization of breast cancer dataset;
Fig. 4. : Normalization of dermatology dataset
Step2: Feature Reduction using PCA: Here we apply PCA for
reducing the dataset to get only those genes which are relevant.
Fig.5 shows the reduced form of hepatitis dataset which is
reduced to 8*2 from 155*20, fig. 6 shows the reduced form of
breast cancer dataset which is reduced to 26*3 from 98*26 and
fig.7 shows the reduced form of dermatology dataset which is
reduced to 366*11 from 366*34.
Fig. 5. : Reduction of hepatitis dataset
Gene Expression
Dataset
Min-Max
Normalization
PCA for Data
Reduction
k-means
clustering
Hierarchical
clustering
Silhouette
Index
Dunns
Index
C
o
m
p
a
r
i
s
o
n
o
f
R
e
s
u
l
t
Cluster Validity
Indices
Clustering
Techniques
International Journal of Computer Trends and Technology (IJCTT) Volume 4 Issue 5May 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page 1468
Fig. 6. : Reduction of breast cancer dataset
Fig. 7. : Reduction of dermatology dataset
TABLE I
DESCRIPTION OF PCA REDUCED DATASETS
Dataset
Before reduction
After reduction
Hepatitis
155*20
8*2
Breast cancer
98*26
26*3
Dermatology
366*34
366*11
Step3: Clustering
A. Hierarchical Clustering: Here hierarchical clustering has
been applied on three dataset. Fig.8 shows the clusters of
hepatitis dataset, Fig.9 shows the clusters of breast-cancer
dataset and fig.10 shows the clusters of dermatology dataset
after applying hierarchical clustering technique. We have
clustered the dataset based on distance so that we separate the
each data and arrange in different clusters.
Fig. 8. : Hierarchical clusters of hepatitis dataset
Fig. 9. : Hierarchical clusters of breast cancer dataset
Fig. 10. : Hierarchical clusters of dermatology dataset
B. k-means clustering: Here k-means clustering has been
applied on three dataset. Fig.11 shows the clusters of hepatitis
dataset, Fig.12 shows the clusters of breast-cancer dataset and
fig.13 shows the clusters of dermatology dataset after applying
k-means technique. We have clustered the dataset based on
distance so that we separate the each data and arrange in
different clusters.
International Journal of Computer Trends and Technology (IJCTT) Volume 4 Issue 5May 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page 1469
Fig. 11. : k-means clusters of hepatitis dataset
Fig. 12. : k-means clusters of breast cancer dataset
Fig. 13. : k-means clusters of dermatology dataset
Step 4: Cluster Validation: Here after applying the two
clustering technique (k-means and hierarchical) upon three
data sets (hepatitis, breast cancer and dermatology) the validity
indices are applied upon the resultant of each clustering
technique. The aim of the validity indices is to substantiate
with a adequate knowledge of better cluster. Hence the two
validity indices (Silhouette index and Dunns index) have been
applied upon each clustered dataset and the result is compared
in table-II.
The experimental result shows that Silhouette Index gives
better result when it is applied on different dataset for both k-
means and hierarchical clustering techniques.
TABLE II
RESULT COMPARISON OF VALIDITY INDICES BY USING
HIERARCHICAL AND K-MEANS CLUSTERING TECHNIQUE
VI. CONCLUSION
In our study, the performance of Dunns index and Silhouette
index in the validation of gene clustering were investigated
with gene expression dataset. They are clustered by
hierarchical and k-means clustering. The above techniques
applied to our gene expression data set which produced the
different clusters. Our work in this paper gives result on
quality basis, as such not depending upon the number of
clusters. Our research has successfully applied validity indices
using knowledge driven methods to estimate the quality of the
clusters. Finally, the comparison of result by using two validity
indices used in this paper, shows that Silhouette Index proves
to be a better method for cluster validity of any clustering
technique. In future, few other validity indices can be applied
upon number of clustering techniques to provide some more
precious information about the cluster validation.
REFERENCES
[1] Mahesh Visvanathan, Adagarla, B Srinivas, Gerald, H
Lushington ,Peter Smith, Cluster Validation: An Integrative
Method for Cluster Analysis, IEEE, pp .238-242, 2009.
[2] Sanghoun Oh, Chang Wook Ahn, Moongu J eon , An Evolutionary
Cluster Validation Index, IEEE, pp. 83-88, 2008.
[3] Morteza J alalat-evakilkandi, Abdolreza Mirzaei, A New
Hierarchical-Clustering Combination scheme Based on Scatter
Matrices and Nearest Neighbour criterion, 5
th
International
Symposium on telecommunications (IEEE), pp. 904-908, 2010.
Clustering
technique
Dataset Validity
indices
Result
H
i
e
r
a
r
c
h
i
c
a
l
Hepatitis
Silhouette Index 70.6%
Dunns Index 63.2%
Breast Cancer
Silhouette Index
66.8%
Dunns Index 61.8%
Hepatitis
Silhouette Index
70.6%
Dunns Index 63.2%
k
-
m
e
a
n
s
Hepatitis
Silhouette Index 68.2%
Dunns Index 61.7%
Breast Cancer
Silhouette Index 62.0%
Dunns Index 59.7%
Hepatitis
Silhouette Index 65.4%
Dunns Index
61.3%
International Journal of Computer Trends and Technology (IJCTT) Volume 4 Issue 5May 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page 1470
[4] J uanying Xie, Shuai J iang, A simple and fast algorithmfor global
k-means clustering, 2
nd
International Workshop on Education
Technology and Computer Science (IEEE), pp.36-40, 2010.
[5] Hui Xiong, J unjie Wu,J ian Chen, k-Means Clustering Versus
Validation Measures: A Data-Distribution Perspective, IEEE
Transactions on systems, man, and cyberneticspart b:
cybernetics, vol. 39, no. 2, pp.318-331, April, 2009.
[6] Danyang Cao, Bingru Yang, An improved k-medoids clustering
algorithm, IEEE, vol.3, pp.132-135, 2010.
[7] M. Srinivas,C. Krishna Mohan, Efficient Clustering Approach
using Incremental and Hierarchical Clustering Methods, IEEE,
2010.
[8] Chunmei Yang, Xuehong Zhao, Ning Li, Yan Wang, Arguing the
Validation of Dunns Index in Gene Clustering, IEEE, 2009.
[9] Susmita Datta, Somnath Datta, Comparisons and validation of
statistical clustering techniques for microarray gene expression
data, Bioinformatics, vol. 19, no. 4, pp. 459466, 2003.
[10] Li Zheng,Tao Li, Chris Ding, Hierarchical Ensemble Clustering,
IEEE International Conference on Data Mining, IEEE, pp.1199-
1204, 2010.
[11] Rui Fa, Asoke K Nandi,Li-Yun Gong, Clustering analysis for
Gene Expression Data: A Methodology Review, Proceedings of
the 5th International Symposium on Communications, Control and
Signal Processing, IEEE, 2012.
[12] Gabriela Serban , Alina Campan, A New Core-Based Method For
Hierarchical Incremental Clustering, Proceedings of the Seventh
International Symposium on Symbolic and Numeric Algorithms for
Scientific Computing (SYNASC05), IEEE.
[13] http://www.ics.uci.edu/~mlearn/MLRepository.html.2011.