You are on page 1of 4

Cluster-Based Mining of Microarray Data in

PHP/ MYSQL Environment


E. Udoh, S. Bhuiyan
Department of Computer Science
Indiana University-Purdue University
2101 E. Coliseum Blvd, Fort Wayne, IN 46805 USA
Abstract - Extracting biological significance from domain knowledge is useful to formulate
a large microarray dataset using data mining appropriate measure in a clustering algorithm,
clustering technique is an important process in which may be exclusive, overlapping,
bioinformatics. In this paper, a microarray dataset hierarchical or probabilistic.
(matrix 504 x 227) made available by SAMSI
institute, was used as the base sample to develop a
There are several clustering algorithms to
new demo web-based clustering system that process and establish relationships in large
exploits the improved efficiency and functionality dataset generated by microarray experiments [4].
of PHP/MYSQL technology. The clustering These algorithms can be used to determine what
algorithms and robustness of PHP/MYSQL group a particular genetic sample belongs to and
produced categorized microarray data that can be the tendency for certain clusters to be associated
associated with diseases with improved with certain characteristics. This helps to
visualizations. eliminate less relevant dimensions or genetic
characteristics. In this paper, six algorithms were
Keywords - Data Mining, Microarray, Clustering
Dendrogram, and PHP/MYSQL.
applied, e.g. un-weighted pair group centroid,
weighted pair group centroid or ward’s method
to achieve clustering of the samples [5]. These
I. INTRODUCTION different algorithms can be applied individually
or compared, and a suitable one chosen for the
Microarray technology generates extremely large investigation.
gene expression dataset (hundreds of rows and
columns) in a single biological experiment (Fig. The microarray data in Fig. 1, made available by
1). The gene expression patterns provide SAMSI institute, was used for this project. The
unprecedented information on human disease, data consist of a series of genetic profiles for
aging, drug, mental illness, diet and many other breast, prostrate, liver, colon, kidney, lung, ovary
clinical matters, because they correlate strongly and testis cancer (matrix 504 x 227). Some of the
with function. In the medical world, microarray cell samples are malignant while others are
has paved the way for a new era of genetic normal. The data are log-transformed, with row
screening, testing and diagnostics [7]. However, and column means subtracted (with positive
the dataset often contains missing values, and values showing well expressed gene, while
exhibits high dimensional attributes, i.e. large negative values indicate non-expression of gene).
number of genes with relatively small number of The used data have been subjected to singular
samples [1]. It is cumbersome to manually value decomposition by authors in [2] as an
examine these large dataset for biological effective way to provide dimensionality
significance, hence the need for automation (e.g. reduction, and eventual elimination of less
using clustering techniques) to reduce the relevant data [7, 10].
quantity of data to a manageable level [2, 3]. The main thrust of this project is to cluster
microarray samples with the rich visualization
Clustering techniques can determine intrinsic features in the PHP/MYSQl development
grouping in a set of microarray data using environment [6] as the displays in this project
distance or conceptual measures. To determine attest to. PHP, an open source, server-side
membership in a cluster, clustering algorithms scripting language, allows user to easily develop
evaluate distance between a point and the cluster a robust and dynamically generated page
centroid. The output is basically a statistical quickly. PHP is cross-platform and easy to learn
description of the cluster centroid with the with well balanced memory and server load.
number of components in each cluster. However,

197

T. Sobh and K. Elleithy (eds.), Advances in Systems, Computing Sciences and Software Engineering, 197–200.
© 2006 Springer.
 198 UDOH AND BHUIYAN

The improved visualization features are interface to the system. The user will be able to
attractive to any bioinformatics programmer, specify a variety of parameters or algorithms
since the representations are intuitive. using the web based interface, while the results
of the clustering can be presented by showing
II. ANALYSIS and DESIGN which clusters are included in particular groups
(Fig. 2). The program shows the percentages of
In the first phase to develop bioinformatics each cluster with or without the cancer
software in PHP/MYSQl environment, this malignancy. For example, few clusters may be
clustering system was developed in a Linux used and the percentages for cancer within those
server environment hosting Tomcat 5.0, PHP groups may be too mixed with normal samples
5.0, MySQL 4.1 and Ghostscript 8.15. The base (such as a 30% cancerous 70% normal) within a
technology used for the analysis and design are cluster. In such a scenario, the number of clusters
the server-side scripting language (PHP) and may be increased to allow a better level of
MySQL database for persistence storage. On granularity when clustering. Attribute grouping
execution, the PHP code retrieves the microarray and associative clustering explain similar
data from the MYSQL database. It converts the dependencies and also offer improved
data into a usable format, and then passes the classification of such genes [8, 9]. Below is a
output to the clustering software, which in turn dendrogram produced for the analyzed sample
sends the result to a dendrogram and ghostscript (Fig. 3).
programs for visualization. Some of the Cancerous samples are indicated by arrows.
clustering algorithms programmed, include These cancerous samples are all in the first
single link, complete link, group average, group, with some few normal samples. It is clear
weighted average, weighted centroid and ward’s in this example that almost all of the cancer
method (Fig. 2). samples are present in the first group, with a very
high probability that any sample in that group
will be cancerous (Fig. 3). A variation of this
As can be gleaned from Fig. 2, the PHP code
dendrogram can be obtained if another algorithm
calls the mathematical algorithms to perform the
is selected (Fig. 4).
clustering, as well as provide an easy to use

34449_at 35150_at
Id Sample_Tissue_Site Sample_General_Pathologic_Category CASP2 TNFRS
72 LIVER, NOS NORMAL -0.64565 -0.71759
77 COLON, NOS NORMAL 0.38159 -0.32849
79 KIDNEY, NOS MALIGNANT 0.83026 -0.2525
83 LIVER, NOS MALIGNANT 0.047638 -0.81221
91 KIDNEY, NOS MALIGNANT -0.22587 -0.45274
96 KIDNEY, NOS NORMAL -0.0528 -0.21131
98 LUNG, NOS MALIGNANT -0.02914 -0.2353
100 COLON, NOS MALIGNANT 0.65866 0.78431
101 LUNG, NOS NORMAL 0.23735 0.003307
109 LIVER, NOS MALIGNANT -0.3751 -0.10789
117 LIVER, NOS NORMAL -1.0638 -0.84062
118 COLON, NOS NORMAL -0.88117 -0.7941
124 LIVER, NOS NORMAL 0.073264 -0.29181

Fig. 1: A cross-section of a microarray dataset used for the design of the system
(http://www.samsi.info/200304/dmml/web-internal/bio/data.html)
CLUSTER-BASED MINING OF MICROARRAY DATA 199

Fig. 2: Web interface to the cluster program.

Fig. 3: A dendrogram of microarray data (Arrow indicates cancerous sample).


200 UDOH AND BHUIYAN

III. CONCLUSION mining: Facing the challenges, ACM-SIGKDD


Explorations, 5(2), 2003, 1-5.
The process of finding relevant genetic markers [4] P. Glenisson, J. Mathys & B. De Moor, Meta-clustering
in a large microarray dataset is a difficult task. of gene expression data and literature based
By reducing the amount of data using data information. SIGKDD Explorations, 5(2), 2003, 101-
112.
mining technique like clustering, the task is [5] M. Scheena, Microarray analysis (New Jersey: Wiley,
made a lot easier. This paper exploited the 2003).
improved functionality of PHP/MYSQL [6] J. Meloni, PHP 5, (Boston, MA: Thomson Course
programming environment to design a new demo Technology, 2004).
[7] N. Bolshakova, F. Azuage & P. Cunnigham, An
cluster program. It will be refined in the future to integrated tool for microarray data clustering and
serve the bioinformatics community better. The cluster validity assessment, Proceedings of ACM
PHP/MYSQL programming is certainly Symposium on Applied Computing, Nicosia, Cyprus,
2004: 133-137.
attractive to the bioinformatics programming
[8] W. Au, K. C. Chan, A. Wong & Y. Wang, Attribute of
community. grouping, selection, and classification of gene
expression data. IEEE/ACM Transactions on
REFERENCES Computational Biology and bioinformatics, 2(2): 2005,
83 – 101.
[1] A. D. Baxevanis & F. Ouellette, Bioinformatics: A [9] S. Kaski, J. Nikkila, J Sinktonen, L. Lahti, J. Knuuttila,
practical guide to the analysis of genes and C. Roos, Associative clustering for explaining
proteins (New York, NY: Wiley-Interscience, 2001). dependencies between functional genomics data sets.
[2] L. Liu, D. Hawkins, S. Ghosh & S. Young, Robust IEEE/ACM Transactions on Computational Biology and
singular value decomposition analysis of microarray bioinformatics, 2(3): 2005, 203 – 216.
data, Proceedings of the National Academy of Sciences [10] L. Parsons, E. Haque & H. Liu, Subspace clustering for
PNAS, USA, 2003, 100( 23), 13167-13172. high dimensional data: a review, ACM SIGKDD
[3] G. Piatetsky-Shapiro & P. Tamayo, Microarray data Explorations newsletter, 6(1): 2004, 90-105.

You might also like