You are on page 1of 3

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)

Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 3, May June 2013 ISSN 2278-6856

A COMPARATIVE STUDY AND ANALYSIS FOR MICROARRAY GENE EXPRESSION DATA USING CLUSTERING TECHNIQUES
G.BASKAR1, Dr.P.PONMUTHURAMALINGAM2
1

PhD Research Scholar, 2Associate Professor& Head

1,2

Department of Computer Science, Government Arts College (Autonomous), Coimbatore, Tamil Nadu, INDIA

Abstract: Clustering the dataset is the important problem in


data mining research, the k-means algorithm is one of the basis and a very simple partitioning clustering technique is given by Macqueen in 1967. Micro array data is a widely used technique for candidate genes in various cancer studies and it is a powerful technology for biological exploration. In this paper we present a clustering algorithm K-Means and Fuzzy C- Means over the cancer datasets colon, leukemia and the experimental result shown and suggest that fuzzy C- Means algorithm have better result than k-means.

2. K- MEANS CLUSTERING ALGORITHM


K-means is one of the simplest unsupervised learning algorithms by (MacQueen, 1967), a well known cluster problems are solved using this algorithm. K-means clustering is a method of classifying/grouping items into k groups (where k is the number of cluster). The grouping is done by minimizing the sum of squared distances (Euclidean distances) between items and the corresponding centroid, in this the algorithm is implemented using matlab. The MATLAB Toolbox has some good functions for performing and interprets k-means clustering analyses. Algorithm steps Given a set of unsupervised data 1. Decide the number of clusters and the initial positions of their means 2. Assign each point to the cluster represented by the mean it is nearest to 3. Move each mean to the actual mean of the data points in its cluster 4. Stop if the means do not move, otherwise back to 3

Keyword: - Clustering, Classification, K-Means, Fuzzy C-Means.

1. INTRODUCTION
Biomarkers for cancer diagnosis is a important problem in cancer genomics,gene expresion micro array is used to identifying candidate gene in various cancer studys.gene expression profiling or micro array analaysis has ennabled the measurement of thousands of genes in a single RNA samples, there are a variety of micro array platforms that have been developed to accomplish this and a basic idea for each is a simple a glass slide or memberone is spotted or arranged and which find weather the gene is present or absent. Feature selection approaches have been applied to the identification of differentially expressed genes in microarray data. Microarray is often classified by cell line or tumour type and it is of interest to discover a set of genes that can be used as class predictors. The leukemia dataset of Golub et al. Both supervised and unsupervised classifiers have been used to build classification models from microarray data. Cluster analysis is grouping of objects, or clusters, such that objects in one cluster are very similar and objects in different clusters are quite distinct. Data Mining helps to convert such data into useful information. In the chip the red colour indicate up regulation and green is down regulation[3]

Figure 2 : k- Means algorithm process

4. FUZZY C-MEANS ALGORITHM


Fuzzy c-means (FCM) is a method of clustering which allows one piece of data to belong to two or more clusters. This method (developed by Dunn in 1973 and improved by Bezdek in 1981) This algorithm works by assigning membership to each data point corresponding to each cluster centre on the basis of distance between the cluster centre and the data point. More the data is near to the cluster centre more is Page 321

Figure 1 : preparation of micro array In this paper we make an analaysis of clustering algorithm for cancer data set and the result is compare with respest of accuracy . Volume 2, Issue 3 May June 2013

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)


Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 3, May June 2013 ISSN 2278-6856
its membership towards the particular cluster centre. Clearly, summation of membership of each data point should be equal .the updated has been done After each iteration membership and cluster centres. Steps : Let X = {x1, x2, x3 ..., xn} be the set of data points and V = {v1, v2, v3 ..., vc} be the set of centres. 1) Randomly select c cluster centres. 2) Calculate the fuzzy: membership 'ij' using c ij=1/ (dij /dik )(2/m-1) k=1 3) Compute the fuzzy centers 'vj' using:
n m n

THE

CLUSTERING ALGORITHM K-MEANS AND FUZZY C-

MEANS ARE TESTED OVER COLON AND LUKEMIA DATA SET.

7. CONCLUSION AND FUTURE WORK


The analaysis of clustering algorithm is done with the help of micro array cancer data , the over all accuracy rate of K-Means and Fuzzy C-Means for leukemia is better than colon data set and the execution time is also good.the performance of the algorithm can be improved with the help of other varience of clustering algorithm in future..

REFERENCES
[1] John, G.H., Kohavi, R., Pfleger, K.: Irrelevant Feature and The Subset Selection Problem. In: Proceedings of the Eleventh International Conference on Machine Learning, pp. 121129 (1994) [2] Liu, H., Yu, L.: Toward Integrating Feature Selection Algorithms for Classification and Clustering. IEEE Transactions on Knowledge and Data Engineering 17(4), 491 502 (2005) [3] Ding, C., Peng, H.: Minimum Redundancy Feature Selection from Microarray Gene Expression Data. In: Proceedings of the Computational Systems Bioinformatics conference (CSB 2003), pp. 523529 (2003) [4] Yu, L., Liu, H.: Efficient Feature Selection via Analysis of Relevance and Redundancy. Journal of Machine Learning Research 5, 12051224 (2004) [5] Pepe, M.S., Etzioni, R., Feng, Z., et al.: Phases of Biomarker Development for Early Detection of Cancer. J. Natl. Cancer Inst. 93, 10541060 (2001) [6] T. Li, C. Zhang, and M. Ogihara, A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression, Bioinformatics, vol. 20, pp. 24292437, 2004 [7] H. Liu and L. Yu, Toward integrating feature selection algorithms for classification and clustering, IEEE Transactionson Knowledge and Data Engineering (TKDE), vol. 17, no. 4,pp. 491502, 2005. [8] M. Wasikowski and X. Chen, Combating the small sample class imbalance problem using feature selection, IEEE Transactions on Knowledge and Data Engineering, vol. 22,no. 10, pp. 13881400, 2010. [9] C. A. Davis, F. Gerick, V. Hintermair, et al., Reliablegene signatures for microarray classification: assessment ofstability and performance, Bioinformatics, vol. 22, pp. 2356 2363, 2006. [10] C. A. Davis, F. Gerick, V. Hintermair, et al., Reliable gene signatures for microarray classification: assessment of stability and performance, Bioinformatics, vol. 22, pp. 2356 2363, 2006. [11] J.A. Lozano, J.M. Pena, P. Larranaga, An empirical comparison of four initialization methods for the k-means algorithm, Lett. 20 (1999) 10271040. [12] Nielsen T.O, West R.B, Linn S.C, et al. Molecular characterization of soft tissue tumours: a gene expression study. Lancet2002 [13] G.Baskar, D.Napoleon Message Passing between Data Point on Clustering Algorithm for Gene LeukemiaDataset ,IJARCS volume 1,number 4,nov- dec2010

Vj=((ij) xi) /xi) /((ij)m )), j=1,2,....c i=1 i=1 4) Repeat step 2) and 3) until the minimum 'J' value is achieved or ||U(k+1) - U(k)|| < . where, k is the iteration step. is the termination criterion between [0, 1]. U = (ij)n*c is the fuzzy membership matrix. J is the objective function.

5.

DATA SETS

Colon tumour is a disease in which cancerous growths (tumours) are found in the tissues of the colon. This dataset contains 62 samples. Among them, 40 tumour biopsies are from tumours (labelled as "negative") and 22 normal (labelled as "positive") biopsies are from healthy parts of the colons of the same patients. The total number of genes to be tested is 2000. Alon, et al, 1999. LEUKEMIA ARE PRIMARY DISORDERS OF BONE NARROW. THEY MALIGNANT NEOPLASMS OF HEMATOPOIE TIC STEM CELLS. THE TOTAL NUMBER OF GENES TO BE TESTED IS 7129, AND NUMBER OF SAMPLES TO BE TESTED IS 72, WHICH ARE ALL ACUTE LEUKEMIA PATIENTS, EITHER .GOLUB, ET AL, 1999

6.

RESULTS OVER THE DATA SET

Graph 1: Average accuracy of algorithm Volume 2, Issue 3 May June 2013 Page 322

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)


Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 3, May June 2013 ISSN 2278-6856
[14] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander, Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene [15] G. Baskar, P. Ponmuthuramali Analysis of Gene Expression Microarray Dataset for Feature Selection: National Conference on Communication Technologies & its impact on Next Generation Computing CTNGC 2012 Proceedings published by International Journal of Computer Applications (IJCA) [16] Y. Saeys, I. Inza, and P. Larranaga, A Review of Feature Selection Techniques in Bioinformatics, Bioinformatics, vol. 23, no. 19, pp. 2507-2517, 2007. [17] I.H. Witten and E. Frank, Data Mining - Practical Machine Learning Tools and Techniques. Morgan Kaufmann Publishers, 2005. [18] Pawan Lingras, Chad West. Interval set Clustering of Web users with Rough k-Means, submitted to the Journal of Intelligent Information System in 2002. [19] Yeung K.Y, Haynor D.R, Ruzzo W.L. Validating clustering for gene expression data. Bioinformatics. 2001. [20] likas,A., Vlassis,M.& Verbeek,j.(2003), the global kmeans clustering algorithm, pattern recognition,36,451461.

Authors profiles
[1] G.Baskar received his Masters degree in Information Technology in K.S.Rangasamy College of Technology, Tiruchengode, Tamil Nadu India in 2008 and M.Phil Degree in Computer Science from Bharathiar University, Coimbatore, Tamil Nadu, India in 2010, and He is currently working towards the PhD degree in Department of Computer Science, Government Arts College, Coimbatore, Tamil Nadu, INDIA in 2011. His area of interest includes Data mining, bioinformatics.

[2] P.Ponmuthuramalingam received his Masters Degree in Computer Science from Alagappa University, Karaikudi in 1988 and the Ph.D. in Computer Science from Bharathiar University, Coimbatore. He is working as Associate Professor and Head in Department of Computer Science, Government Arts College (Autonomous), Coimbatore. His research interest includes Text mining, Semantic Web, Network Security and Parallel Algorithms.

Volume 2, Issue 3 May June 2013

Page 323