Professional Documents
Culture Documents
ABSTRACT
Microarray classification technique is one of the important biotechnology used to record thousands of genes simultaneously within
a number of different samples. Among the great amount of genes presented in microarray gene expression data, only a small
amount of gene is effective for performing a certain diagnostic test. In order to find the effective group of genes, a supervised
attribute Clustering Algorithm is introduced in this paper. In this regard, mutual information has been shown to be successful for
selecting a set of relevant and non-redundant genes from microarray data. To reduce the redundancy among the attributes, a new
quantitative measure based on mutual information is used, which incorporate the information into sample categories. It implement
SAC for grouping co regulated genes within strong association to class labels.SAC forms cluster based on the similarity measures
which are more effective when compared with the existing algorithm. The growth of the cluster is repeated until the cluster gets
stabilized. Finally classify the selected gene set using Fuzzy Classification algorithm. The results are more accurate than the normal
classification algorithms.
Keywords: Microarray, mutual information, Supervised attributed clustering, fuzzy classification, gene selection.
1. INTRODUCTION
The application of gene expression data in functional genomics is to classify samples according to their gene expression
profiles, such as to classify cancer versus normal samples or to classify different types or subtypes of cancer. A microarray
gene expression dataset can be represented by an expression table, T = {wij |i = 1, . . . , m, j = 1, . . . , n},where wij is the
measured expression level of gene G i in the jth sample, and m and n represent the total number of genes and samples,
respectively.Theexpression data set can be represented by an expression table, in which each row corresponds to particular
gene, each column corresponding to a sample, and each entry of the matrix is the measured expression level of a particular
gene in a sample, respectively. However, for the most gene data, the number of training samples is still very small compared
to the large number of genes involved in the experiment. When the number of genes is greater than the number of samples,
it is possible to find the relevancy of gene behavior with the sample categories or response variables. Among the large
amount of genes presented in microarray gene expression data, only a small amount is effective for performing a certain
diagnostic test. In this, mutual information has been shown to be successful for selecting a set of relevant and nonredundant
genes from microarray data.
Figure 1
Page 83
Clustering is an important topic in data mining research. Clustering algorithm are used to group tuples, each of which is
characterized by a set of attributes into clusters based on similarity measure. In this paper which explains a methodology to
group attributes that are interdependent or correlated with each other. When applied to gene expression data, the
conventional clustering methods such as Bayesian clustering, hierarchical clustering, k-means algorithm, self-organizing
map, and principal component analysis group a subset of genes that are interdependent or correlated with each other. In this
sense, attributes in a cluster are more correlated with each other whereas attributes in different clusters are less correlated.
Attribute in the clustering are able to reduce the search dimension of a data mining algorithm to effectuate the search of
interesting relationships or for construction of models in a tightly correlated subset of attributes rather than in the entire
attribute space. After clustering the attributes, one can select a smaller number for further analysis. However, all these
algorithms group genes according to unsupervised similarity measures computed from the gene expressions, without using
any of the information about the sample categories or response variables. The supervised attribute clustering are used to
group genes or attributes, which controlled the information of sample categories or response variables.
In this regard, a new supervised attribute clustering algorithm is proposed to find coregulated clusters of genes whose
collective expression is strongly associated with the sample categories or class labels. A new measure, based on mutual
information, is introduced to find the similarity between attributes. The proposed measure used to incorporates the
information of sample categories while measuring the similarity between the attributes. In effect, it is helpful to identify
functional groups of genes that is of special interest in sample classification. The supervised attribute clustering method uses
this measure to reduce the redundancy among genes. In effect, the proposed supervised attribute clustering algorithm yields
biologically significant gene clusters, whose average expression levels allow perfect sample categories. The proposed
algorithm avoids the noise sensitivity problem of existing supervised gene clustering algorithms.
2. RELATED WORK
Classification and clustering are two major tasks which involved in gene expression data. Classification is mainly concerned
with assigning memberships to samples based on expression patterns, and clustering aims to find the new biological classes
and refining the existing ones. To cluster and/or recognize patterns in gene expression datasets, they mainly encounter the
dimension problem.Typically, datasets consist of a large number of genes (attributes) but a small number of samples
(tuples). Many data mining algorithms like association rule mining, pattern discovery, linguistic summaries, and contextsensitive fuzzy clustering are developed and/or optimized to be scalable with respect to the number of tuples, so they not
handle a large number of attributes. Existing clustering algorithms are apply to genes by various algorithm. Some of the
well-know Examples are: k-means attribute clustering algorithm is used. By using Attribute Clustering algorithm, the
search dimension are reduced and get the meaningful clusters of genes are discovered. Significant genes selected from each
group that contain useful information for gene expression classification and identification.
To solve this problem, methods that can handle both the gene-class relevance and the gene-gene redundancy have been
proposed .These methods typically use some metric to measure the gene-class relevance (e.g., mutual information, the F-test
value , information gain, symmetrical uncertainty and employ the same or a different metric to measure the gene-gene
redundancy.
To find a subset of relevant gene without any redundant genes, they usually used a methodology called redundant cover to
eliminate redundant genes with respect to a subset of genes selected according to the metric for measuring the gene-class
relevance and the gene-gene redundancy.
3. SYSTEM ARCHITECTURE
In this section, a new supervised attribute clustering algorithm is presented for grouping coregulated genes with strong
association to the class labels. It is mainly based on a supervised similarity measure which follows next.
3.1 Supervised Similarity Measure
In real data analysis, one of the important issues is computing both relevance and redundancy of attributes by discovering
dependencies among them. The mutual information can be used to calculate both relevance and redundancy among
attributes.
The redundancy or similarly between two attributes Ai and Aj, in terms of mutual information are calculated as follows:
S(Ai,Aj)=I(Ai,Aj)
However, the term S(Ai,Aj) does not incorporate the information of sample categories or class labels ID while measuring
the similarity and it is considered as unsupervised similarity measure.
Page 84
Hence, a new quantitative measure namely supervised similarity measure is defined here based on mutual information for
measuring the similarity between two random variables. While measuring the similarity between attribute it incorporates
the information of sample categories or class labels.
Page 85
They used to select the representative attribute and form the coarse cluster. After generating the initial coarse cluster, the
cluster representative is merged and averaged with one single attribute such that the augment cluster representative is
increasing the relevance value. The used to repeat the merging process until the relevance value can no longer be improved.
Instead of averaging all attributes, the augmented attribute is computed by considering a subset of attributes those increase
the relevance value of cluster representative. The set of attributes represents the finer cluster of the attribute A. While the
generation of coarse cluster reduces the redundancy among attributes of the set that of finer cluster increases the relevance
with respect to class labels. After the generation of the augmented cluster representative from the finer cluster, the process
is repeated to find more clusters and augmented cluster representatives by discarding the set of attributes from the whole set.
One of the most important property of the proposed clustering approach is that the cluster is augmented by the attributes
those satisfy following two conditions:
1. Suit best into the current cluster in terms of a supervised similarity measure defined above
2. Improve the differential expression of the current cluster most, based on the relevance of the cluster representative or
prototype.
Page 86
Fuzzy logic is a form of many-valued logic or probabilistic logic; it deals with reasoning that is approximate rather than
fixed value and exact value. In contrast with traditional logic they can have varying values, where the binary sets have twovalued logic, which is true or false, fuzzy logic variables may have a truth value that ranges from 0 to 1. Fuzzy logic has
been extended to handle the concept of partial truth, where the truth values have range between completely true and
completely false. Furthermore, when the linguistic variables are used by these degrees may be managed by specific
functions.
6. CONCLUSION
The three main contribution of this paper is,
1. Based on mutual information, a new quantitative measure is defined and they used to calculate the similarity between two
genes or attribute, which incorporates the information of sample categories or class labels.
2. A new supervised attribute clustering algorithm was developed to find coregulated clusters of genes whose collective
expression is strongly associated with the sample categories.
3. Comparing the performance of the proposed method and some existing methods using the class separability index and
predictive accuracy of support vector machine and naive bayes classifier.
In this paper, it produces significantly better results when compared with existing methods.
REFERENCES
[1]. Pradipta MajiMutual Information-Based Supervised Attribute Clustering for Microarray Sample Classification IEEE
Transactions on Knowledge And Data Engineering, vol. 24, no. 1, January 2012.
[2]. P.Maji, f-Information Measures for Efficient Selection of Discriminative Genes from Microarray Data, IEEE Trans.
Biomedical Eng., vol. 56, no. 4, pp. 1063-1069, Apr. 2009.
[3]. W. Haiying, Z. Huiru, and A. Francisco, Poisson-Based Self-Organizing Feature Maps and Hierarchical Clustering for
Serial Analysis of Gene Expression Data, IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 4, no. 2,
pp. 163-175, Apr.- June 2007.
[4]. J. Li, H. Su, H. Chen, and B.W. Futscher, Optimal Search-BasedGene Subset Selection for Gene Array Cancer
Classification, IEEE Trans. Information Technology in Biomedicine, vol. 11, no. 4, pp. 398- 405, July 2007.
[5]. L. Wang, F. Chu, and W. Xie, Accurate Cancer ClassificationUsing Expressions of Very Few Genes, IEEE/ACM
Trans. Computational Biology and Bioinformatics, vol. 4, no. 1, pp. 40 53,Jan.-Mar. 2007.
[6]. H. Peng, F. Long, and C. Ding, Feature Selection Based on MutualInformation: Criteria of Max-Dependency, MaxRelevance, and Min-Redundancy, IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 27, no. 8, pp. 12261238, Aug. 2005.
[7]. W.-H. Au, K.C.C. Chan, A.K.C. Wong, and Y. Wang, AttributeClustering for Grouping, Selection, and Classification
of GeneExpression Data, IEEE/ACM Trans. Computational Biology andBioinformatics, vol. 2, no. 2, pp. 83-101,
Apr.-June 2005.
[8]. D. Jiang, C. Tang, and A. Zhang, Cluster Analysis for Gene Expression Data: A Survey, IEEE Trans. Knowledge
and Data Eng., vol. 16, no. 11, pp. 1370-1386, Nov. 2004.
[9]. T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing,
M.A.Caligiuri, C.D. Bloomfield, and E.S. Lander, Molecular Classification of Cancer: Class Discovery and Class
Prediction by Gene Expression Monitoring, Science, vol. 286, no. 5439,pp. 531-537, 1999.
[10]. R. Battiti, Using Mutual Information for Selecting Features in Supervised Neural Net Learning, IEEE Trans. Neural
Networks, vol. 5, no. 4, pp. 537-550, July 1994.
AUTHOR
Gowthami S received B.E and M.E. degrees in Computer Science and Engineering from Info Institute of Technology in
2011 and from Shivani College of Engineering in 2013, respectively. Currently she is working as Assistance professor in
Saranathan Engineering College, Tirchy, Tamilnadu.
Page 87