You are on page 1of 5

IPASJ International Journal of Computer Science (IIJCS)

Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm


Email: editoriijcs@ipasj.org
ISSN 2321-5992

A Publisher for Research Motivation ........

Volume 3, Issue 5, May 2015

An Encapsulated Approach for Microarray


Sample Classification using Supervised Attribute
clustering and Fuzzy Classification Algorithm
Gowthami S
Assistant Professer, Department of Computer Science and Engineering,
Saranathan College of Engineering Panjappur, Tiruchirappalli, Tamil Nadu 620012

ABSTRACT
Microarray classification technique is one of the important biotechnology used to record thousands of genes simultaneously within
a number of different samples. Among the great amount of genes presented in microarray gene expression data, only a small
amount of gene is effective for performing a certain diagnostic test. In order to find the effective group of genes, a supervised
attribute Clustering Algorithm is introduced in this paper. In this regard, mutual information has been shown to be successful for
selecting a set of relevant and non-redundant genes from microarray data. To reduce the redundancy among the attributes, a new
quantitative measure based on mutual information is used, which incorporate the information into sample categories. It implement
SAC for grouping co regulated genes within strong association to class labels.SAC forms cluster based on the similarity measures
which are more effective when compared with the existing algorithm. The growth of the cluster is repeated until the cluster gets
stabilized. Finally classify the selected gene set using Fuzzy Classification algorithm. The results are more accurate than the normal
classification algorithms.

Keywords: Microarray, mutual information, Supervised attributed clustering, fuzzy classification, gene selection.

1. INTRODUCTION
The application of gene expression data in functional genomics is to classify samples according to their gene expression
profiles, such as to classify cancer versus normal samples or to classify different types or subtypes of cancer. A microarray
gene expression dataset can be represented by an expression table, T = {wij |i = 1, . . . , m, j = 1, . . . , n},where wij is the
measured expression level of gene G i in the jth sample, and m and n represent the total number of genes and samples,
respectively.Theexpression data set can be represented by an expression table, in which each row corresponds to particular
gene, each column corresponding to a sample, and each entry of the matrix is the measured expression level of a particular
gene in a sample, respectively. However, for the most gene data, the number of training samples is still very small compared
to the large number of genes involved in the experiment. When the number of genes is greater than the number of samples,
it is possible to find the relevancy of gene behavior with the sample categories or response variables. Among the large
amount of genes presented in microarray gene expression data, only a small amount is effective for performing a certain
diagnostic test. In this, mutual information has been shown to be successful for selecting a set of relevant and nonredundant
genes from microarray data.

Figure 1

Volume 3 Issue 5 May 2015

Page 83

IPASJ International Journal of Computer Science (IIJCS)


A Publisher for Research Motivation ........

Volume 3, Issue 5, May 2015

Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm


Email: editoriijcs@ipasj.org
ISSN 2321-5992

Clustering is an important topic in data mining research. Clustering algorithm are used to group tuples, each of which is
characterized by a set of attributes into clusters based on similarity measure. In this paper which explains a methodology to
group attributes that are interdependent or correlated with each other. When applied to gene expression data, the
conventional clustering methods such as Bayesian clustering, hierarchical clustering, k-means algorithm, self-organizing
map, and principal component analysis group a subset of genes that are interdependent or correlated with each other. In this
sense, attributes in a cluster are more correlated with each other whereas attributes in different clusters are less correlated.
Attribute in the clustering are able to reduce the search dimension of a data mining algorithm to effectuate the search of
interesting relationships or for construction of models in a tightly correlated subset of attributes rather than in the entire
attribute space. After clustering the attributes, one can select a smaller number for further analysis. However, all these
algorithms group genes according to unsupervised similarity measures computed from the gene expressions, without using
any of the information about the sample categories or response variables. The supervised attribute clustering are used to
group genes or attributes, which controlled the information of sample categories or response variables.
In this regard, a new supervised attribute clustering algorithm is proposed to find coregulated clusters of genes whose
collective expression is strongly associated with the sample categories or class labels. A new measure, based on mutual
information, is introduced to find the similarity between attributes. The proposed measure used to incorporates the
information of sample categories while measuring the similarity between the attributes. In effect, it is helpful to identify
functional groups of genes that is of special interest in sample classification. The supervised attribute clustering method uses
this measure to reduce the redundancy among genes. In effect, the proposed supervised attribute clustering algorithm yields
biologically significant gene clusters, whose average expression levels allow perfect sample categories. The proposed
algorithm avoids the noise sensitivity problem of existing supervised gene clustering algorithms.

2. RELATED WORK
Classification and clustering are two major tasks which involved in gene expression data. Classification is mainly concerned
with assigning memberships to samples based on expression patterns, and clustering aims to find the new biological classes
and refining the existing ones. To cluster and/or recognize patterns in gene expression datasets, they mainly encounter the
dimension problem.Typically, datasets consist of a large number of genes (attributes) but a small number of samples
(tuples). Many data mining algorithms like association rule mining, pattern discovery, linguistic summaries, and contextsensitive fuzzy clustering are developed and/or optimized to be scalable with respect to the number of tuples, so they not
handle a large number of attributes. Existing clustering algorithms are apply to genes by various algorithm. Some of the
well-know Examples are: k-means attribute clustering algorithm is used. By using Attribute Clustering algorithm, the
search dimension are reduced and get the meaningful clusters of genes are discovered. Significant genes selected from each
group that contain useful information for gene expression classification and identification.
To solve this problem, methods that can handle both the gene-class relevance and the gene-gene redundancy have been
proposed .These methods typically use some metric to measure the gene-class relevance (e.g., mutual information, the F-test
value , information gain, symmetrical uncertainty and employ the same or a different metric to measure the gene-gene
redundancy.
To find a subset of relevant gene without any redundant genes, they usually used a methodology called redundant cover to
eliminate redundant genes with respect to a subset of genes selected according to the metric for measuring the gene-class
relevance and the gene-gene redundancy.

3. SYSTEM ARCHITECTURE
In this section, a new supervised attribute clustering algorithm is presented for grouping coregulated genes with strong
association to the class labels. It is mainly based on a supervised similarity measure which follows next.
3.1 Supervised Similarity Measure
In real data analysis, one of the important issues is computing both relevance and redundancy of attributes by discovering
dependencies among them. The mutual information can be used to calculate both relevance and redundancy among
attributes.
The redundancy or similarly between two attributes Ai and Aj, in terms of mutual information are calculated as follows:
S(Ai,Aj)=I(Ai,Aj)
However, the term S(Ai,Aj) does not incorporate the information of sample categories or class labels ID while measuring
the similarity and it is considered as unsupervised similarity measure.

Volume 3 Issue 5 May 2015

Page 84

IPASJ International Journal of Computer Science (IIJCS)


Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
Email: editoriijcs@ipasj.org
ISSN 2321-5992

A Publisher for Research Motivation ........

Volume 3, Issue 5, May 2015

Hence, a new quantitative measure namely supervised similarity measure is defined here based on mutual information for
measuring the similarity between two random variables. While measuring the similarity between attribute it incorporates
the information of sample categories or class labels.

Select All Attributes


Load Gene
Data
Entropy
Conditional Entropy

Coarse Cluster Formation


Representative selection
Form cluster

Finer Cluster Formation


Merge and average representative
Form cluster

Collect Effective Attributes

Perform Classification& display result


Figure 2 System Architecture
3.2 Supervised Similarity Measure
The proposed supervised attribute clustering (SAC) algorithm relies on mainly two factors, namely, determining
therelevance of each attribute and growing the cluster around each relevant attribute incrementally by adding one attribute
after the other.

F i g u r e 3 Representation of a supervised attribute cluster

Volume 3 Issue 5 May 2015

Page 85

IPASJ International Journal of Computer Science (IIJCS)


A Publisher for Research Motivation ........

Volume 3, Issue 5, May 2015

Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm


Email: editoriijcs@ipasj.org
ISSN 2321-5992

They used to select the representative attribute and form the coarse cluster. After generating the initial coarse cluster, the
cluster representative is merged and averaged with one single attribute such that the augment cluster representative is
increasing the relevance value. The used to repeat the merging process until the relevance value can no longer be improved.
Instead of averaging all attributes, the augmented attribute is computed by considering a subset of attributes those increase
the relevance value of cluster representative. The set of attributes represents the finer cluster of the attribute A. While the
generation of coarse cluster reduces the redundancy among attributes of the set that of finer cluster increases the relevance
with respect to class labels. After the generation of the augmented cluster representative from the finer cluster, the process
is repeated to find more clusters and augmented cluster representatives by discarding the set of attributes from the whole set.
One of the most important property of the proposed clustering approach is that the cluster is augmented by the attributes
those satisfy following two conditions:
1. Suit best into the current cluster in terms of a supervised similarity measure defined above
2. Improve the differential expression of the current cluster most, based on the relevance of the cluster representative or
prototype.

4. CLASS PREDICTION METHODS


The proposed supervised attribute clustering(SAC) algorithm is extensively compared with that of some existing supervised
and unsupervised gene clustering and gene selection algorithms, namely, ACA (attribute clustering algorithm), MBBC
(model-based Bayesian clustering), SGCA (supervised gene clustering algorithm), GS (gene shaving), mRMR (minimum
redundancy-maximum relevance framework), and the method proposed.
4.1Gene Expression Data Sets
In this paper, publicly available Breast Cancer data set is used. Breast Cancer Dataset: The breast cancer dataset contains
expression levels of 7129 genes in 49 breast tumor samples. The samples are classified according to their estrogen receptor
(ER) status: 25 samples are ER positive while the other 24 samples are ER negative.
4.2 Classifier methods
NB Classifier: The NB classifier is one of the oldest classifiers. It is based on Bayes rule and assuming that features
(variables) are independent of each other given its class. Naive Bayes classifier assumes that the presence (or absence) of a
particular feature of a class is unrelated to the presence (or absence) of any other feature, given the class variable. The
attribute value of the given class independent of the value of the other attributes.
SVM: In machine learning, support vector machines (SVMs, also support vector network) are supervised learning models
with associated learning algorithms that analyze data and recognize patterns, which are used for classification and
regression analysis. SVM takes a set of input data and predicts the given input, which are of two possible classes which
forms the output.
The SVM is a margin classifier that draws an optimal hyper plane in the feature vector space which defines a boundary that
maximizes the margin between data samples of different classes, which leads a generalization properties. The main key
factor in the SVM is to use kernels to construct nonlinear decision boundary.
K-nearest neighbor rule: The K-nearest neighbor rule is used for evaluating the effectiveness of the reduced feature set
for classification. It classifies the samples based on closest training samples in the feature space. A sample which is
classified by a majority vote of its K-neighbors with the sample being assigned to the class most common among its Knearest neighbors. The value of K for the K-NN is the square root of number of samples in training set k-NN is a method
for classifying objects based on closest training examples in the feature space. k-NN are instance-based , or lazy learner
where the function is only approximated locally and all computation is deferred until classification. The k-nearest neighbor
algorithm is almost one of the simplest of all machine learning algorithms: an object is classified by a majority vote of its
neighbors, with the object are being assigned to the class most common amongst its k nearest neighbors (k is a positive
integer, typically small). If k = 1, then the object is simply assigned to the class of its nearest neighbor one. In the
classification phase, k is a user-defined constant, and an unlabeled vector (a query or test point) is classified by assigning
the label which is most frequent among the k training samples nearest to that query point.

5 . IMPLEMENTATION OF PROPOSED METHODOLOGY


In our implementation, fuzzy Classification Algorithm are used to classify the given data set. By using this method,
accuracy is improved when compared to existing algorithm Gene Expression Data Sets
5.1 Fuzzy Classification Algorithm
In this paper, they used to generate fuzzy if-then rules from the m selected attributes. Initial step to set K as K=1 and set P
as P=1 and classify each gene y using this algorithm.If the gene is misclassified, adjust the grades of certainty. The result of
fuzzy classifier is more accurate than existing algorithm for biological data analysis.

Volume 3 Issue 5 May 2015

Page 86

IPASJ International Journal of Computer Science (IIJCS)


A Publisher for Research Motivation ........

Volume 3, Issue 5, May 2015

Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm


Email: editoriijcs@ipasj.org
ISSN 2321-5992

Fuzzy logic is a form of many-valued logic or probabilistic logic; it deals with reasoning that is approximate rather than
fixed value and exact value. In contrast with traditional logic they can have varying values, where the binary sets have twovalued logic, which is true or false, fuzzy logic variables may have a truth value that ranges from 0 to 1. Fuzzy logic has
been extended to handle the concept of partial truth, where the truth values have range between completely true and
completely false. Furthermore, when the linguistic variables are used by these degrees may be managed by specific
functions.

6. CONCLUSION
The three main contribution of this paper is,
1. Based on mutual information, a new quantitative measure is defined and they used to calculate the similarity between two
genes or attribute, which incorporates the information of sample categories or class labels.
2. A new supervised attribute clustering algorithm was developed to find coregulated clusters of genes whose collective
expression is strongly associated with the sample categories.
3. Comparing the performance of the proposed method and some existing methods using the class separability index and
predictive accuracy of support vector machine and naive bayes classifier.
In this paper, it produces significantly better results when compared with existing methods.

REFERENCES
[1]. Pradipta MajiMutual Information-Based Supervised Attribute Clustering for Microarray Sample Classification IEEE
Transactions on Knowledge And Data Engineering, vol. 24, no. 1, January 2012.
[2]. P.Maji, f-Information Measures for Efficient Selection of Discriminative Genes from Microarray Data, IEEE Trans.
Biomedical Eng., vol. 56, no. 4, pp. 1063-1069, Apr. 2009.
[3]. W. Haiying, Z. Huiru, and A. Francisco, Poisson-Based Self-Organizing Feature Maps and Hierarchical Clustering for
Serial Analysis of Gene Expression Data, IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 4, no. 2,
pp. 163-175, Apr.- June 2007.
[4]. J. Li, H. Su, H. Chen, and B.W. Futscher, Optimal Search-BasedGene Subset Selection for Gene Array Cancer
Classification, IEEE Trans. Information Technology in Biomedicine, vol. 11, no. 4, pp. 398- 405, July 2007.
[5]. L. Wang, F. Chu, and W. Xie, Accurate Cancer ClassificationUsing Expressions of Very Few Genes, IEEE/ACM
Trans. Computational Biology and Bioinformatics, vol. 4, no. 1, pp. 40 53,Jan.-Mar. 2007.
[6]. H. Peng, F. Long, and C. Ding, Feature Selection Based on MutualInformation: Criteria of Max-Dependency, MaxRelevance, and Min-Redundancy, IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 27, no. 8, pp. 12261238, Aug. 2005.
[7]. W.-H. Au, K.C.C. Chan, A.K.C. Wong, and Y. Wang, AttributeClustering for Grouping, Selection, and Classification
of GeneExpression Data, IEEE/ACM Trans. Computational Biology andBioinformatics, vol. 2, no. 2, pp. 83-101,
Apr.-June 2005.
[8]. D. Jiang, C. Tang, and A. Zhang, Cluster Analysis for Gene Expression Data: A Survey, IEEE Trans. Knowledge
and Data Eng., vol. 16, no. 11, pp. 1370-1386, Nov. 2004.
[9]. T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing,
M.A.Caligiuri, C.D. Bloomfield, and E.S. Lander, Molecular Classification of Cancer: Class Discovery and Class
Prediction by Gene Expression Monitoring, Science, vol. 286, no. 5439,pp. 531-537, 1999.
[10]. R. Battiti, Using Mutual Information for Selecting Features in Supervised Neural Net Learning, IEEE Trans. Neural
Networks, vol. 5, no. 4, pp. 537-550, July 1994.
AUTHOR
Gowthami S received B.E and M.E. degrees in Computer Science and Engineering from Info Institute of Technology in
2011 and from Shivani College of Engineering in 2013, respectively. Currently she is working as Assistance professor in
Saranathan Engineering College, Tirchy, Tamilnadu.

Volume 3 Issue 5 May 2015

Page 87

You might also like