You are on page 1of 3

2010 Second International Conference on Advances in Computing, Control, and Telecommunication Technologies

A GPU implementation of Fast Parallel Markov Clustering in Bioinformatics using ELLPACK-R Sparse Data Format
Alhadi Bustamam , Kevin Burrage and Nicholas A. Hamilton for Molecular Bioscience, The University of Queensland, Australia Department of Mathematics, University of Indonesia COMLAB, Oxford University, UK

Institute

Email: alhadi.bustamam@uqconnect.edu.au, kevin.burrage@comlab.ox.ac.uk, n.hamilton@imb.uq.edu.au

Abstract
The massively parallel computing using graphical processing unit (GPU), which based on tens of thousands of parallel threats within hundreds of GPUs streaming processors, has gained broad popularity and attracted researchers in a wide range of application areas from nance, computer aided engineering, computational uid dynamics, game physics, numerics, science, medical imaging, life science, and so on, including molecular biology and bioinformatics. Meanwhile, Markov clustering algorithm (MCL) has become one of the most effective and highly cited methods to detect and analyze the communities/clusters within an interaction network dataset on many real world problems such us social, technological, or biological networks including protein-protein interaction networks. However, as the dataset become bigger and bigger, the computation time of MCL algorithm become slower and slower. Hence, GPU computing is an interesting and challenging alternative to attempt to improve the MCL performance. In this poster paper we introduce our improvement of MCL performance based on ELLPACK-R sparse dataset format using GPU computing with the Compute Unied Device Architecture tool (CUDA) from NVIDIA (called CUDA-MCL). As the results show the signicant improvement in CUDA-MCL performance and with the low-cost and widely available GPU devices in the market today, this CUDA-MCL implementation is allowing large-scale parallel computation on off-the-shelf desktop machines. Moreover the GPU computing approaches potentially may contribute to signicantly change the way bioinformaticians and biologists compute and interact with their data.

expressible with a few abstractions [1] [2]. In 2007, NVIDIA released a scalable parallel programming model using the C language on NVIDAs GPU cards called compute unied device architecture(CUDA). CUDA provides a set of extensions to the standard ANSI C programming language which enable the programmer to do heterogeneous computation using both CPU and GPU. The serial portions of applications can be run on the CPU (called host) and the parallel portions can be executed on the GPU (called device/kernel) [1]. Since that release, commodity graphics hardware have become a cost-effective parallel platform to solve many general problems [3]. In particular, the economical manufacture of GPUs in large numbers with broad availability in the personal computer market today gives the benet of GPU accelerators for both general and specic programming purposes [4] [5].

2. MCL and CUDA-MCL implementation


Recently, the Markov clustering algorithm (MCL) [6], which originally was developed for the general problem of graph clustering, has been adopted in a wide range of applications including in bioinformatics applications [7][9]. The algorithm has also been reviewed intensively [10][12] and has been shown to be robust and reliable compared to many other clustering algorithms. As applications of MCL expand and the size of datasets increase, there is a strong need for a fast and reliable implementation of MCL. Hence, the parallel implementation of the MCL algorithm is now an important challenge in order that MCL performance may be improved [13] [14]. Previously, we developed a parallel MCL implementation in a multi-core Message Passing Interface (MPI) [15] environment with preliminary results showing a good performance improvement [13]. However, MPI implementations often have limited scaling ability due to serialization and synchronization phases that increase with core count. Hence, the CUDA approach can be used to scale up with core-count without the need to restructure the application architecture every time a new core count is targeted [2] [3] [16].
173 175

1. Introduction
A new era of computing power is now arising due to advances in multi-core CPUs and many-core GPUs. With the advance of GPU architecture, several major graphics card manufactures have develop language tools to make sophisticated parallel programs in many-cores GPU readily
978-0-7695-4269-0/10 $26.00 2010 IEEE DOI 10.1109/ACT.2010.10

MCL uses two simple algebraic operations, expansion and ination, on a matrix associated with a graph. The associated Markov matrix M associated with a graph G is dened by normalizing all columns of the adjacency matrix of G. The clustering process simulates random walks (or ow) within the graph using expansion operations, and then strengthens the ow where it is already strong and weakens it where it is weak using ination operations. By continuously alternating these two processes the underlying structure of the graph gradually becomes apparent, and there is convergence to a result with regions with strong internal ow (clusters) separated by boundaries within which ow is absent [6]. The most demanding computations in the original MCL algorithm are in the matrix-matrix multiplication processes of the MCL Expansion module; and also the vector reduction processes, both in the MCL Ination module (to compute column-vector sum for Markov matrix normalization), and in the MCL Chaos module (to compute local and global chaos for MCL stopping criteria). So the key factor in improving the original MCL algorithm is to exploit all of these MCL Expansion, Ination and Chaos modules in parallel [13]. Hence, our CUDA-MCL implementation consists of three main massively parallel thread CUDA kernels: (1) Expansion kernel to compute parallel MCL expansion processes; (2) Ination kernel to compute parallel MCL ination processes; and (3) Chaos kernel to compute parallel local and global chaos. The Sparse matrix-vector multiplication (SpMV) based on ELLPACK-R sparse data format [17] is adopted to allow the GPU to perform fast, efcient and massively ne-grain parallel sparse matrixmatrix computations in the core of MCL expansion kernel. Meanwhile, the parallel reduction processes type-5 (PRDtype5) from NVIDA CUDA SDK [1] are adopted for parallel sparse Markov matrix normalizations, and parallel local and global matrix energy computations, which are the core of the MCL ination and chaos kernel, respectively. The PRDtype5 allows us to use on-chip shared memory on the GPU efciently, to lower the latency time thus circumventing a major issue in other parallel computing environments, such as Message Passing Interface (MPI) [1], [13].

Table 1. PPI Datasets (from BioGRID [18], HPRD [19])


No 1. 2. 3. Name P P I1 P P I2 P P I3 Source BioGRID HP RD BioGRID #nodes 5, 156 19, 599 23, 175 #interactions 51, 050 58, 450 137, 104

protein function. For instance, HPRD has been used to develop a human protein interaction networks based on protein-protein and subcellular localization data. The HPRD datasets were manually curated from published literature using bioinformatics analysis on protein sequences by biologist experts. HPRD datasets are also available online with various standardized data format as well. In our performance analysis, the CUDA-MCL implementation was tested on a GTX285 NVIDIA GPU with 240 core processors and 2GB VRAM compare to a quad-core AMD Phenom II 655 3.4GHz CPU with 4GB RAM. Three different number of threads per block (TPB) were used in CUDA-MCL kernel including 128, 256 and 512 TPB. We wanted to test the behaviour of CUDA-MCL performance with scalable dataset sizes and various of TPBs. In Figure 1 it can be seen that with the datasets from BioGRID we achieved a speed up by a factor of 4 on PPI1 and of 9 on PPI3. Meanwhile, a speed up of a factor of 7 was achieve on the HRPD dataset, PPI2. Moreover, the 512 TPB case gave the highest speed-up on all cases. The sparseness of the networks affected performance in that less speedup was observed in more sparse networks due to the overhead in loading data into GPU. Nevertheless the speed-ups are scalable with increasing dataset sizes and have a signicant improvement in all TPB cases. As an illustration, on PPI3 dataset we are able to do the clustering with CUDA-MCL on NVIDIA GTX285 GPU within 10 minutes compared to 1 hour and 23 minutes with the original MCL algorithm on quad-core AMD Phenom II 655 3.4GHz CPU.

4. Conclusions and Future Work


In this poster paper, we proposed and evaluated a new approach to the Markov clustering algorithm using GPU computing with CUDA. We proposed our implementation based on SpMV using ELLPACK-R sparse matrix format [17] to compute the parallel expansion processes. We also integrated into our parallel ination process the parallel reduction method type-5 from NVIDIA. Our implementations on a wide range of dataset sizes show that acceleration factors of up to 9 may be obtained, with the sparseness of the networks being principle factor effecting the speed-up. To conclude, the CUDA-MCL approaches are allowing large-scale parallel computation on off-the-shelf desktop machines that were previously only possible on super-computing architectures. Such approaches also have
174 176

3. Performance comparison results


For performance testing three datasets and sizes (PPI1 (small), PPI2 (medium) and PPI3 (large)) were used (as shown in Table 1). These datasets were extracted from several protein-protein interaction datasets from public domain websites, including the BioGRID [18] and human protein reference database (HPRD) [19]. BioGRID is a freely available online curated biological interaction dataset, compiled comprehensively for protein-protein and genetic interactions for major organism species and available in wide variety of standardized formats. HPRD consists of a protein database directed toward understanding human

[6] S. V. Dongen, Graph clustering via a discrete uncoupling process, SIAM J. Matrix Anal. Appl., vol. 30, no. 1, pp. 121 141, 2008. [7] A. Enright, S. van Dongen, and C. Ouzounis, An efcient algorithms for large scale protein families, Nucleic Acids Research, vol. 30, pp. 15751584, 2002. [8] T. Harlow, J. Gogarten, and M. Ragan, A hybrid clustering approach to recognize of protein families in 114 microbial genomes, BMC Bioinformatics, vol. 5, p. 45, 2004. [9] S. Wong and M. A. Ragan, MACHOS: Markov clusters of homologous subsequences, Bioinformatics, vol. 24, no. 13, pp. i77i85, 2008. [10] S. Broh e and J. van Helden, Evaluation of clustering e algorithms for protein-protein interaction networks, BMC Bioinformatics, vol. 7, p. 488, 2006. [11] R. Sharan, I. Ulitsky, and R. Shamir, Network-based prediction of protein function, Molecular System Biology, vol. 3, no. 88, 2007. [12] J. Vlasblom and S. J. Wodak, Markov clustering versus afnity propagation for the partitioning of protein interaction graphs, BMC Bioinformatics, vol. 10, no. 99, September 2009. [Online]. Available: http://www.biomedcentral.com/ 1471-2105/10/99 [13] A. Bustamam, M. S. Sehgal, N. Hamilton, S. Wong, M. A. Ragan, and K. Burrage, An efcient parallel implementation of markov clustering algorithm for large-scale protein-protein interaction networks that uses MPI, in Proceeding of The 5th IMT-GT International Conference on Mathematics, Statistics, and their Applications (ICMSA), ser. Computational Mathematics, June 2009, pp. 94101. [14] K. Burrage, L. Hood, and M. Ragan, Advanced computing for system biology, Brieng in Bionformatics, vol. 7, pp. 390398, 2006. [15] M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra, MPI The Complete Reference, 1st ed. The MIT Press, 1996. [16] C. Boyd, Data-parallel computing, ACM Queue, vol. 6, no. 2, pp. 3039, 2008. [17] F. V zquez, E. Garz n, J. Martnez, and J. Fern ndez, Accela o a erating sparse matrix vector product with GPUs, in the 9th International Conference on Computational and Mathematical Methods in Science and Engineering (CMMSE), Gij n, o Asturias, Spain, 2009. [18] C. Stark, B. Breitkreutz, T. Reguly, L. Boucher, A. Breitkreutz, and M. Tyers, BioGRID: a general repository for interaction datasets, Nucleic Acids Research, vol. 34, p. D535, 2006.

Figure 1. Speed-up on Desktop QC-GTX285 Machine for PPI datasets

the strong potential to signicantly change the way bioinformaticians and biologists compute and interact with their data. Due to the relatively large memory usage of the CUDAMCL implementation using ELLPACK-R sparse data format, we plan to evaluate other approaches to Parallel MCL implementation on GPUs such as multiGPUs approaches as another extension of the CUDA-MCL capability. We also plan to consider hybrid CUDA and openMP implementations (hybrid CUDA/OpenMP) which enable the exploitation of multi-core CPU and many-core GPUs in multi-GPU cards using OpenMP and CUDA, respectively.

Acknowledgment
This work has been supported by AUSAID scholarship and ARC Center of Excellence in Bioinformatics at the Institute for Molecular Bioscience, the University of Queensland.

References
[1] NVIDIA Coorporation, NVIDIA CUDA Programming Guide, Version 2.3.1, August 2009. [2] T. P. Chen and Y.-K. Chen, Challenges and opportunities of obtaining performance from multi-core CPUs and manycore GPUs, Acoustics, Speech, and Signal Processing, IEEE International Conference on, vol. 0, pp. 613616, 2009. [3] J. Nickolls, I. Buck, M. Garland, and K. Skadron, Scalable parallel programming with CUDA, ACM Queue, vol. 6, no. 2, pp. 4053, 2008. [4] J. W. Pitera, Current developments in and importance of high-performance computing in drug discovery, Current Opinion in Drug Discovery & Development, vol. 12, no. 3, pp. 388396, 2009. [Online]. Available: http: //www.biomedcentral.com/content/pdf/cd-1002727.pdf [5] Y. Liu, B. Schmidt, and D. L. Maskell, MSA-CUDA: Multiple sequence alignment on graphics processing units with CUDA, in ASAP. IEEE, 2009, pp. 121128.
175 177

[19] T. S. K. e. a. Prasad, Human protein reference database 2009 update, Nucleic Acids Research, vol. 37, pp. D76772, 2009.

You might also like