You are on page 1of 12

International Journal of Computer Science and Information Security (IJCSIS),

Vol. 14, No. 12, December 2016

Database Search, Alignment Viewer and


Genomics Analysis Tools: Big Data for
Bioinformatics
Muhammad Atif Sarwar Abbas Rehman Javed Ferzund

Department of Computer Science Department of Computer Science Department of Computer Science


COMSATS Institute of Information COMSATS Institute of Information COMSATS Institute of Information
Technology Technology Technology
Sahiwal, Pakistan Sahiwal, Pakistan Sahiwal, Pakistan
atifsarwar@ciitsahiwal.edu.pk abbasrehman@ciitsahiwal.edu.pk jferzund@ciitsahiwal.edu.pk

AbstractAdvancement in the sequencing technology has Twitter and Walmart. Big Data has characteristics of 5
resulted in the production of large amount of omics data in Vs such as Veracity (amount of consuming data),
short time. Traditional bioinformatics tools cannot cope with Velocity (processing of data), Variety (types of data),
the rate of production of such huge amount of data. So, new Volume (amount of data) and Potential Value (giving
tools need to be developed and existing tools need to be value to data). Potential Value is very important for future
improved. New researchers, developers and Bioinformaticists thoughts and for planning of data. Big Data flow includes
face difficulty in selecting the appropriate tool for Analysis of two types of processing such as real time (new SQL) and
data or for making improvements in the tools. This paper batch (based on analytics). Performance is the big
presents a comprehensive survey on the availability of
challenge for Big Data. Many tools and systems are
bioinformatics tools, purpose of different tools, programming
available which manage the Big Data for example
languages used for development of different tools and data
formats used for different tools. It also presents either a tool Hadoop (distributed management System).
has been enhanced to be used on Big Data platform or not. At the same time, Bioinformatics data has been also
produced in large amount such as Genomics, Proteomics,
I. INTRODUCTION RNA, DNA, and Motif Finding. A lot of data will be
With the passage of time, new approaches and produced for Sequence Alignment (multiple and
technologies has been developed because massive Pairwise) from RNA, DNA and Proteins. Some data will
amount of data is available. A large amount of data is be produced for their relationships such as Protein to
available in many fields such as Electrical, Mechanical, Protein, Gene to Disease, Disease to Gene, and Gene to
Electronics, Mathematics, Management Sciences, Protein. Some important data will be available for
Computer Science and Bioinformatics. A tool of stack is database search. NGS (Next Generation Sequencing) has
provided for every field of data that will help us to produced a lot of sequencing data.
analyze and store data. In upcoming Era, this data will increase in large
Recently, the term Big Data has been introduced amount day by day. All of this Bioinformatics data is
which denotes the huge amount of data in Computer required to be analyzed and stored in a well-organized
Science field. This large data needs to be analyzed, stored way. For this purpose, an open source Apache Hadoop
and managed for example, data of Facebook, Yahoo, system had been designed for large distributed storage
and exploration of large data. This will give advantages
of fault tolerance, security and efficiency. Hadoop
consists of HDFS (Hadoop Distributed File System),
MapReduce (a programming paradigm) and many tools
This paper was submitted for review on 14 December 2016.
built on it like HBase, Hive, Pig, Zookeeper etc.
Muhammad Atif Sarwar is with department of computer Science,
COMSATS Institute of Information Techology, Sahiwal, 57000 HDFS (Hadoop Distributed File System) is
Pakistan (e-mail:atifsarwar@ciitsahiwal.edu.pk).
Abbas Rehman is with department of computer Science, COMSATS
distributed system for data storage and data processing
Institute of Information Techology, Sahiwal, 57000 Pakistan with many clusters by programming help. It contains
(e-mail:abbasrehman@ciitsahiwal.edu.pk). name node and data node (like master slave relationship).
Javed Ferzund is with department of computer Science, COMSATS HBase contains read write data access with ACID
Institute of Information Techology, Sahiwal, 57000 Pakistan properties. Hive is data warehouse system with HQL
(e-mail:jferzund@ciitsahiwal.edu.pk). (Hive Query Language) interface. It also provides the

317 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 12, December 2016

facility of Hive data units such as Partition, Bucket and implementation language, data format support and
Table. implementation for Big Data. Section 4 presents a brief
discussion and finally conclusion is presented in Section
Hadoop MapReduce is system for parallel processing 5.
of large data using simple programing. It contains Map
and Reduce tasks with job and task tracker (like master II. RELATED WORK
slave relationship). Apache also provides Spark
framework for Analysis of large data with the help of
Transformations, RDD (Resilient Distributed Datasets), The larger sizes of data has [1] been Analyzed and
Actions and Caching. It includes many Built-in libraries executed with distributed computing on computer
for multiple purposes. Apache Platform has advantage of clusters. The rapidly growing data quantities, for
best Performance than Hadoop MapReduce. example, DNA sequencing, and new methods to data
analysis are required. Hadoop is used for data processing
Many Bioinformatics tools are designed for that is a collection of software designed especially for Big
processing and analysis of Bioinformatics data such as Data, with the basic working of the Hadoop Distributed
RNA, DNA, Genomics and Proteomics. These tools are File System and Hadoop MapReduce scalable distributed
not Scale when this huge amount of data is concerned. To computing platform and Apache Spark. This effort tells
remove this bottleneck, these tools are implemented in about the working of Hadoop MapReduce and Apache
Hadoop Platform for processing of large Bioinformatics Spark for the bioinformatics data. The growth of
data. Some tools are implemented in Hadoop for bioinformatics data is so fast that it can only be stored and
Alignment viewers, some for database searching, some manipulated with the technology of Hadoop MapReduce
for Genomic Analysis, Mostly Bioinformatics tools are and Apache Spark and HDFS environment. Thus new file
implemented in MapReduce or in Spark framework. formats are being developed to better cope with the needs
Hadoop modules support many languages such as of modern and future Big Data sets. This work analyses
Java, Python, and Scala etc. Bioinformatics tools are the current state of the art tools in the world of
implemented in specific language in MapReduce or bioinformatics and their implementation for Big Data
Apache Spark framework. Some tools are implemented Platforms.
in Java, some in Python, some in Scala and some in Schatz et al. have [2] developed the CloudBurst that
C++/C# language. When these tools are implemented in is used for the genome mapping process. CloudBurst
MapReduce then mostly Java language is used for provides parallel short-read mapping method to boost the
processing. When these tools are implemented in Spark measurability of reading largest sequencing data. Many
then Scala, Java and Python are mostly used languages. new tools have been developed by CloudBrust team to
It is most important opinion what Data Format will be support the field of biomedical, for example Crossbow
selected for the storage of large Bioinformatics data. used for the recognizing single nucleotide
Focus of Data Format is compulsory when polymorphisms (SNPs) from sequencing data and
Bioinformatics tools are implemented in MapReduce or Contrail use for the aggregation giant genomes.
Spark framework. Some Data Formats are performing Pandey et al. have designed the DistMap toolkit [3]
well with small datasets but when Bioinformatics data is on a Hadoop cluster for distributed short-read mapping.
large then these Formats are not scale. There are different DistMap aims to extend the support of various styles
Data Formats for large data storage of Database of mappers to cover a wider variety of sequencing
Searching, Alignment viewer editor. applications. The nine supported mapper types include
The objectives of this survey are: SOAP, STAR, GSNOP, BWA, Bowtie, Bowtie2,
Bismark, BSMAP and TopHat. A DistMap is integrated
To explore all tools in the Bioinformatics with mapping workflow, which could be run with simple
domain command.
To explain specific implementation platform OConnor et al. have built the SeqWare that could [4]
for these Bioinformatics tools be a query based engine designed on the Apache HBase
database to assist bioinformatics researchers access
To recognize specific implementation large-scale whole-genome datasets. The interactive
language for specific tool interface to integrate the genome tools and browser was
To understand Data Format for storageand created by SeqWare team. During a prototyping analysis,
analysis of large data in MapReduce or the 1102GBM and U87MG tumor database were laden,
Spark framework and the team match the HBase back end and the Berkeley
DB and for exporting variant and loading data capabilities
Rest of this survey is organized as follows: Section 2
describes the related work in this field. Section 3 presents Lewis et al. have made the Hydra that is a search [5]
the available Bioinformatics Tools and their engine for scalable proteomic search which is built on the
characteristics in terms of category of tools, Hadoop-distributed computing framework. Hydra gives

318 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 12, December 2016

the software bundle for processing spectra and largest Hadoop on cluster has shown that Hadoop performs 80
peptide databases, with the implementation of distributed times speedily than SAMQA.
computing framework that supports massive quantities of
spectrometry data for scalable search. The Hydra divided Angiuoli et al. designed the CloVR, a sequence [10]
the proteomic search into two steps: (1) scoring the analysis tool distributed over a virtual machine. This tool
spectra and retrieving the data and (2) generating a is an equal support for both cloud systems as well as
peptide database. personal desktop systems to process the sequencing data
thus removing all the hazards to analyze large datasets.
Van der Auwera et al. have launched the Genome There are many other bioinformatics workflows that
Analysis Toolkit (GATK) designed [6] to support large- incorporate virtual machines including metagenome,
scale DNA sequence Analysis based upon the whole-genome, and 16S rRNA-sequencing Analysis.
MapReduce programing framework. GATK backings
several data formats, including binary alignment/map To check the portability of CIoVR for local host and
(BAM), SAM files, dBs and Hap Map. The GATK toolkit cloud platform, this tool is tested against a local machine
prepares two modules traversal and walker. The traversal (4 CPU, 8 GB RAM) and on the Amazon EC2 cloud
modules read sequencing data into the system and platform (80 CPU). And the results concluded that
CIoVR is potable on both platforms.
provide related references to the data. The walker
modules provide analytics outcome from the data. GATK Oehmen et al. proposed the bioinformatics tool that is
has been used for the 1000 Genomes Projects and Cancer used for parallel five BLAST calculations (blastn, blastp,
Genome Atlas. tblastn, tblastx and blastx) [11] on large and small scale
Van der Aurar et al. have introduced Myrna that is a systems in genome and protein sequence on
[7] cloud-based computing pipeline used to calculate multiprocessor environment. ScalaBLAST 2.0 provides
difference in gene expression for very large transcript dynamic data partitioning (fault resilience properties) and
RNA sequencing datasets. RNA sequence data are m- does not require pre-formatting (repeated same datasets)
sequencing reads derived from mRNA molecules. Myrna than ScalaBLAST 1.0 and mpiBLAST that contains
supports several functions for RNA-seq Analysis, single processor. It is implemented using NCBI BLAST
including normalization, reads alignment, and statistical C toolkit and depends on MPI library. The input file in
modeling in an integrated pipeline. Myrna returns the FASTA format and output formats (standard
differential expression of genes into the form of P-value pairwise, tabular and tabular with headers).
and q-value and analytical plot of those genes. This Michael C. Schatz introduced the BlastReduce, a [12]
system was tested on the Amazon Elastic Compute Cloud parallel seed-and-extend alignment algorithm (includes
(Amazon EC2) using 1.1 billion RNA-seq reads, and the three MapReduce cycles) that takes DNA NGS data
results show that Myrna can process data in less than two (short reads) and then align or map these read to reference
hours and the cost of the test task was around $66. genome (database of specific specie) on Hadoop
Chung et al. proposed the CloudDOE, a software [8] MapReduce paradigm that support parallel execution.
tool set that offers a user friendly interface to implement Landau-Vishkin alignment algorithm reduces this
the Hadoop cloud because the Hadoop platform in itself infeasibility. BlastReduce is much faster than BLAST
is too complex to be handled by a user that has no and reduce the execution time. It is implemented in Java
expertise in computer science or some other technical with Hadoop and compatible with cloud computing. Like
skills. This tool is very simple to be handled by a layman BLAST, BlastReduce uses seed-and-extend technique
and it uses MapReduce to analyze very complex and Unlike BLAST, BlastReduce uses Landau-Vishkin
procedures such as the throughput of sequences in algorithm. Their performance shows that BlastReduce is
bioinformatics data. This tool is a support for scholars and scalable to large sets of read and highly speed up.
researchers to easily configure Hadoop cloud to study III. TOOLS FOR BIOINFORMATICS
different aspects of Bioinformatics data.
There are many Bioinformatics tools that are used for
Robinson et al. presented the SAMQA, a special [9] analysis of small and large datasets. Every tool performs
toolkit designed to find errors in genomic data and it specific function. Different tools are used for alignment
guarantees that the sequencing data must fulfill the viewer editor, database search and genome analysis.
significant quality measures. The tool was intentionally These tools require the data to be stored in a specific
developed for the National Institutes of Health Cancer format for any kind of analysis. These tools are built using
Genome Atlas for an automatic detection of errors and different programming languages.so, it is important to
thus generating a log file of these errors. The tool uses know the specific language in order to customize the
some technical tests to identify abnormalities in genomic tools. The skills in a programming language are more
data such as invalid CIGAR value or sequence helpful when extending these tools for Hadoop
alignment/map (SAM) format error. For the same MapReduce or Apache Spark framework.
biological data set of approximately 23GBs in size, a
comparison of SAMQA on a single-core server and of

319 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 12, December 2016

A. Alignment Viewers/Editors sequence analysis of biological data such as nucleotide or


In this category, a number of tools are available which protein sequences. Many programming languages have
provide quicker, easier and user friendly interfaces for the been used for the implementation of the alignment
viewers or editors.
The Ale (emacs plugin) is implemented in lisp and c, supports the Nexus, MSF FASTA PHYLIP, and Clustal.
AliView in Java, Base By Base in Java and BioEdit is Base-by-Base supports the EMBEL, Genbank, Base-by-
implemented in Python. Java, C++ and Python are used base files, Clustal, BioEdit support data format.
by BioNumerics, C is used for BoxShade, C++, Java and BioNumerics uses the Genbank, and Fasta format for the
Perl is used for the implementation of CINEMA. CLC storage and Analysis of data.
viewer is implemented in C, C++ and python, ClustalX
viewer is implemented in BioPython. Table 1 presents the Some of the alignment viewer/editor tools are
tools available for alignment view and editing. implemented in Apache Spark and Hadoop MapReduce
framework for the Big Data Analysis and the sequence of
Every tool has the specific data format for the storage biological data. DNASTAR Lasergene Molecular
and analysis purposes. Ale (emacs plugin) supports the Biology Suite and Maestro are implemented in Hadoop
GenBank, EMBL, Fasta and Phylip. AliView tool MapReduce framework.

Table 1: Overview of Alignment Viewer/Editor that use MapReduce and Spark Framework

Name Language Data Format MapReduce Spark


Lisp, C, [13]
Ale (emacs plugin) GenBank, EMBL, Fasta and Phylip NO NO
Java
AliView [14] Nexus, MSF Fasta Phylip, and Clustal NO NO
[15]
Java EMBEL,Genbank, Base-by-base files,
Base-By-Base NO NO
[16] Clustal
Python, Java, C, C++, C# and Perl
NBRF/PIR, Phylip 3.2, Phylip 4,
BioEdit [17] [18], [19] NO NO
Genbank, Fasta
C++, Python
BioNumerics [20] Genbank, Fasta NO NO
[21]
READSEQ, ALN format as written by
C
BoxShade [22] Clustal W, NO NO
[23]
PILEUP,
C++, Python , Java, Perl Nexus, MSF, Clustal, Fasta Phylip, PIR,
CINEMA [24] NO NO
[25] PRINTS
CLC viewer [26] C/C++ and Python [27] Many NO NO
ClustalX viewer Biopython
Nexus, MSF, Clustal, Fasta Phylip NO NO
[28] [29]
Cylindrical
Java INSDSet, GFF3, ClustalW, Blast, XML,
BLAST Viewer NO NO
[31] proprietary XML,
[30]
Perl, R
DECIPHER Fasta Fastq, GenBank NO NO
[32]
Discovery Studio C, C++, Java SPT, SEQ, PDB, EMBL, HELM,
NO NO
[33] [34] Clustal, GDE, FASTA BSML, GB
Perl
DNASP [35] Fasta Nexus, Mega, Phylip NO NO
[36]
DNASTAR
Lasergene C/C++/Python ABI, DNA Multi-Seq, Fasta GCG YES
NO
Molecular Biology [37] Pileup, GenBank, Phred [38]
Suite
Java
FLAK Fasta NO NO
[39]

320 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 12, December 2016

Python, C++, C
Genedoc [40] MSF NO NO
[41]
Perl/Python/Java/C++/C
Geneious Fasta NO NO
[42]
Integrated Genome
Perl, C/C++ and Java
Browser (IGB) BAM, Fasta PSL NO NO
[44]
[43]
C++ and Java MSF,ClustalW, ,Nexus, PIR, Phylip,
IVistMSA [45] NO NO
[46] GDE,
Java BLC, PIR, Stockholm, MSF, Clustal,
Jalview 2 [47] NO NO
[48] Fasta PFAM
Java FASTA MSF, Clustal, Phylip, Newick,
JEvTrace [49] NO NO
[50] PDB
Javascript
JSAV [51] An array of JavaScript objects NO NO
[52]
Java YES
Maestro [53] Clustal, Fasta PDB NO
[54] [55]
C++
MEGA [56] FASTA Clustal, Nexus, Mega, etc NO NO
[57]
MSAReveal.org
Java [59] Fasta NO NO
[58]
Multiseq (vmd C++ [61]
Fasta PDB, ALN, Phylip, Nexus NO NO
plugin) [60]
MView [62] Perl Clustal, HSSP, Fasta PIR, MSF, Fasta
NO NO
[63] search, Blast search
Java
PFAAT [64] Nexus, MSF, Clustal, Fasta PFAAT NO NO
[65]
Ralee (emacs
Perl
plugin for RNA Stockholm NO NO
[67]
[66] )
Java
S2S RNA editor Fasta RNAML NO NO
[68]

C, C++ Nexus, MSF, Clustal, Fasta Phylip,


Seaview [26] NO NO
[69] MASE

Objective C NBRF/PIR,Nexus, Clustal, GDE,


Seqotron [27] NO NO
[70] Stockholm, flat, MEGA, Fasta
Sequilab [71] C++ Fasta NO NO
C, C++, Perl
Sequlator [72] MSF NO NO
[73]
Javascript
SnipViz [74] Fasta newick NO NO
[75]
C, C++ PDB, Embl, Pfam, Nexus, Fasta
Strap NO NO
[76] Clustalw, MSF
Fastq,GFF3, SAM, BAM, ACE, AFG,
Tablet [77] Java NO NO
Fasta MAQ, SOAP2,
Fasta Fastq, ClustalW, Stockholm,
UGENE C++, Qt ABIF, SCF, Newick, GenBank, EMBL, NO NO
PDB, MSF, GFF
C++ [78]
VISSA sequence Clustal, Fasta NO NO
Python
DNApy Fasta Fastq, GenBank NO NO
[79]
Alignment Javascript
Fasta NO NO
Annotator [80] [81]

321 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 12, December 2016

B. Database Search
Many new tools are available for searching in protein The Database Search tools support different formats
and nucleotide sequence databases. With the for querying the stored data. The supported formats
advancement in technology, many tools have been include FASTA bare sequences, identifiers, Genbank,
developed for database search. These are implemented in FASTQ, EMBL, Genbank, CLUSTAL, Stockholm,
different languages like C, C++, C#, Java, Perl, Python, A2M, A3M, EMBL, MEGA, GCG/MSF, PIR/NBRF and
CUDA C++ and PTX assembly language, Java, Android, TREECON. Some of these tools have been implemented
Objective-C, and Ruby. These tools include BLAST, CS- on Hadoop MapReduce and Apache Spark. Tools like
BLAST, CUDASW++, DIAMOND, BLAST, DIAMOND, HMMER, HHpred/ HHsearch,
GGSearch/GLSearch, Genoogle, and HMMER. A list of ScalaBLAST are implemented on Hadoop MapReduce
the available tools in this category is presented in Table and Apache Spark Framework.
2.

Table 2: Overview of Database Search tools that use MapReduce and Spark Framework

Name Sequence Type Language Data format MapReduce Spark


BLAST Nucleotide or C, C++, C#, Java, Perl and Fasta bare sequences, YES [12]
NO
[82] protein sequence Python. identifiers, Genbank

CS-BLAST Protein sequence C++ Fasta NO NO

CUDASW CUDA C++ and PTX Fasta


Protein sequence NO NO
++ [83] assembly languages
Java, Android, Objective- NO
DIAMON Fasta Fastq
Protein sequence C, Python and C++ NO
D
[84]
FASTA Nucleotide or Python, Ruby, and Perl Fasta
NO NO
[85] protein sequence [86]
C/C++
GGSearch/ Fasta
Protein sequence BioPerl, BioPython NO NO
GLSearch

Nucleotide or Fasta
Genoogle Java NO NO
protein sequence
HMMER Nucleotide or Fasta EMBL, Genbank YES
C NO
[87] protein sequence [88]
Fasta , A2M, GCG/MSF,
HHpred/ A3M,PIR/NBRF,EMBL,
Protein C++ YES [88] NO
HHsearch MEGA, Clustal

Sequence
C++, Java Fasta
KLAST similarity search NO NO
tool
Nucleotide or C/C++ Fasta
USEARCH NO NO
protein
Fasta
Nucleotide or
Parasail Python or C Fastq NO NO
protein

C++, Java, R Matlab,


PSI- Fasta
Protein Python NO NO
BLAST

322 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 12, December 2016

PHYLIP, GCG,
PSI-Search Java
Protein UniProtKB, PIR, NO NO
[89]

ScalaBLAS Nucleotide or Fasta YES


Java, C++, Python [90] YES [91]
T protein [92]
Nucleotide/ Java Fasta
Sequilab NO NO
Peptide
Nucleotide or Fasta
SAM Java YES NO
protein

Nucleotide or C++ Fasta


SSEARCH NO NO
protein
C++ Fasta
SWAPHI Protein NO NO
/FASTQ
SWAPHI- DNA sequence file
DNA C/C++ NO NO
LS
C Fasta
SWIMM Protein NO NO

Nucleotide or C++ Fasta


SWIPE NO NO
protein

C. Genomics Analysis
Tools in this category are used for analysis of data storage. EMBL format is used in ACT and Sim4
nucleotide or peptide sequences. Different programming tools for data storage. GENBANK format is used in ACT,
languages are used for implementation of genome Mauve, MGA, Mulan and Sim4 tools for data storage.
analysis tools. Java language is used by ACT, FLAK, GFF format is used in ACT tool for data storage. FASTA
Mauve, Mulan, Sequero and Shuffle-LAGAN. Perl is format is used in ACT, BLAT, DECIPHE R, GMAP,
used by BLAT, Sim4/ SIBsim4 and SLAM tools. R is Splign, Mauve, Mulan, Multiz, PLAST-ncRNA,
used by DECIPHE R tool for implementation. Sequilab, Shuffle-LAGAN, Sim4 and SLAM tools for
FORTRAN is used by GMAP tool for implementation. data storage. Multi-FASTA format is used in Mauve tool
C++ is used by GMAP, Splign, Mauve, Sequilab, for data storage. Bare format is used in Mulan and Multiz
Shuffle-LAGAN and sim4 tools for implementation. tools for data storage. Fastq format is used in PLAST-
FORTRAN is used by GMAP tool for implementation. C ncRNA tool for data storage.
is used by GMAP, PLAST-ncRNA, Sequilab, Shuffle-
LAGAN, Sim4 and SLAM tools for implementation. Some of these tools like BLAT, GMAP and MGA are
Python is used by MGA, Multiz, PLAST-ncRNA and implemented in Hadoop MapReduce framework for
Sequilab tools for implementation. Ruby language is used genome Analysis. Specific format is used in the
by the SLAM tool. Table 3 presents the tools available implementation of BLAT, GMAP and MGA tools for
for Genomics Analysis. input data in Hadoop MapReduce or Apache Spark
framework.
Every tool supports specific format for efficient data
storage. EMBL format is used in ACT and Sim4 tools for

Table 3: Overview of Genomics Analysis tools that use MapReduce and Spark Framework

Name Sequence Type Language Data Formats MapReduce Spark


EMBL, GENBANK
ACT (Artemis
Java GFF, Fasta
Comparison Tool) Nucleotide NO NO
[93]
Fasta
BLAT Nucleotide Perl YES [12] NO

323 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 12, December 2016

R NO
DECIPHER Nucleotide Fasta NO
Java
FLAK [94] Nucleotide Fasta NO NO
Fasta
Fortran, C and C++ YES [96]
GMAP [95] Nucleotide NO

C++ Fasta
Splign [97] Nucleotide NO NO

Java Fasta Multi-Fasta


Mauve [98] Nucleotide C++ GenBank NO NO

Genbank
Python
MGA [99] Nucleotide NO NO

Java Fasta Bare,Genbank


Mulan Nucleotide NO NO
Python Bare, Fasta
Multiz Nucleotide NO NO
Fasta Fastq
PLAST-ncRNA Nucleotide C, Perl, Python NO NO
Java
Nucleotide or Fasta
Sequerome NO NO
peptide
Fasta
C, C++,
Nucleotide or
Sequilab Python, SIMD NO NO
peptide
dynamic
C, C++, Java and Perl Fasta
Shuffle-LAGAN Nucleotide NO NO
C/C++
SIBsim4 / Sim4 Embl, Fasta GenBank
Nucleotide Perl NO NO

C ,Perl and ruby


Fasta
SLAM Nucleotide NO NO

IV. FUTURE OF BIOINFORMATICS TOOLS single points of failure. Also the traditional bioinformatics
data analysis tools based on R, Perl, or python do not meet
Bioinformatics research data is very large in size, the requirements to handle such huge datasets. So there is
complex in nature and critical for results. Conventional a need to use different tools according to the nature of
research methods exhibit very greater time complexity datasets involved and the type of queries or results to get
while analyzing results and very high space complexity insights of that data structures. It is a shift from data
while storing such massive datasets thus requiring generation to data analysis now.
systems with tremendously high processing capabilities.
For example, the NCIs The Cancer Genomics Atlas
(TCGA) dataset alone is 2.5 Petabytes and it would take The main goal to use statistics, machine learning
23 days to even download this dataset even with industry algorithms and data mining techniques to identify,
standard internet speeds. So instead of introducing such compile, analyze and visualize biological data structures
highly capable machines the concept of cloud computing is to make new models that may help in epidemic disease
has taken the place. To handle large, complex and analysis, understanding evolution, matching genomics,
distributed data of bioinformatics it is economical to suggesting medicine, providing health care information
process datasets across the cloud. Cloud computing is and predicting metabolomics processes. The main
beneficial for big data analytics because it would purpose is to introduce and implement standard protocols
distribute data load across physically distant machines while analyzing huge bioinformatics data using modern
thus improving efficiency, saving money and avoiding computer science techniques that will tackle the

324 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 12, December 2016

complexity in nature of data and combine data across the quality. This thing will ultimately help biological
cloud from distributed resources. Now the need is to fuse scientists, pharmacists, patients, customers and
robust, efficient, quantitative, accurate and precise data companies.
visualization algorithms in previously implemented tools
of biological and biomedical fields to target all four Vs
of big data: volume of data, velocity of processing the
data, variability of data sources, and veracity of the data
V. CONCLUSION
NGS (Next Generation Sequencing) plays an Like Multiple Sequence Alignment, Pairwise
imperative role in the Bioinformatics field. There are Sequence Alignment tools implemented in Hadoop
many tools exists which are implemented in Hadoop MapReduce and Apache Spark for Bioinformatics
MapReduce and Apache Spark framework. These tools datasets. We can implement these tools such as ACANA,
are implemented using specific language like C, C++, AlignMe, CUDAlign, DNADot, DOTLET, FEAST, G-
Python, Scala and Java. Bioinformatics domain consists PAS, LALIGN, mAlign, MUMer, Needle, NW, Path,
of DNA, Protein, Genomics and RNA data. When this PyMOL, SEQALN, SIM, GAP, NAP, LAP, SSEARCH,
data is stored in BI (Bioinformatics) tools then specific Water and YASS in Hadoop MapReduce Platform. These
data format will be used like GenBank, EMBL, Fasta tools can also implement in Apache Spark framework.
Fastq and Phylip 4 etc. When Bioinformatics data are
store in Hadoop MapReduce or Spark framework, REFERENCES
specific data format will be used such as Avro, Sparse
vector, (key, value) pair and Sequence file etc.
Alignment viewers/editors Tools implemented in [1] M. Niemenmaa, "Analysing sequencing data in Hadoop:The
road to interactivity via SQL".
Hadoop MapReduce and Spark framework. We can
implement Alignment viewers/editor tools such as Ale, [2] M. C. Schatz, "CloudBurst: highly sensitive read mapping
AliView, Base by Base, BioEdit, BioNemerics, with MapReduce," Oxford Journal, 2009.
BoxShade, CLC Viewer, DnaSP, FLAK, Genodoc, [3] C. S. Ram Vinay Pandey, "DistMap: A Toolkit for Distributed
Jalview 2, JSAV, JEvTrace, Mega, PFAAT, Seaview, Short Read Mapping on a Hadoop Cluster," PLOS ONE,
Sequilab, Snipviz, Strap, Tablet, DNApy and UGENE in August 2013.
Hadoop MapReduce Platform. These tools can also [4] B. M. S. F. N. Brian D OConnor, "SeqWare Query Engine:
implement in Apache Spark framework. Alignment storing and searching sequence data in the cloud," in
viewer/editor tools such as DANASTAR and Maestro Open Source Conference, 2010.
have implemented in Hadoop MapReduce platform. [5] A. C. S. K. H. H. M. R. H. R. L. M. Steven Lewis, "Hydra: a
scalable proteomic search engine which utilizes the
Database search also play a vital role in Hadoop distributed computing framework," in BioMed
Bioinformatics. Like Alignment viewer/editor tools, Central, 5 December 2012.
database tools implemented in Hadoop MapReduce and [6] M. O. C. Geraldine A. Van der Auwera, "From FastQ Data to
Apache Spark. We can implement Alignment database High-Confidence Variant Calls: The Genome Analysis
tools such as CS-BLAST, FASTA GLSearch, Genoogle, Toolkit Best Practices Pipeline," in Current Protocols in
KLAST, Parasail, PSI-Search and Sequilab in Hadoop Bioinformatics, October 2013.
MapReduce Platform. These tools can also implement in [7] K. D. H. Ben Langmead, "Langmead B, Hansen KD, Leek JT.
Apache Spark framework. Database tools such as Cloud-scale RNA-sequencing differential expression
BLAST, DIAMOND, HMMER, HHSearch and analysis with Myrna. Genome Biol. 2010;11(8):R83.," in
BioMed Central , 11 August 2010.
ScalaBLAST have implemented in Hadoop MapReduce
platform. ScalaBLAST have implemented in Apache [8] C.-C. C. Wei-Chun Chung, "CloudDOE: A User-Friendly Tool
Spark framework. for Deploying Hadoop Clouds and Analyzing High-
Throughput Sequencing Data with MapReduce," PLOS
Genome Analysis consist of genes and microarray ONE.
data. Gene-to-Disease relationship show a significant role [9] S. K. R. B. a. J. B. Thomas Robinson, "SAMQA: error
in Genome Analysis. . Like Alignment database tools, classification and validation of high-throughput
Genome Analysis tools implemented in Hadoop sequenced read data," BioMed Central.
MapReduce and Apache Spark for Bioinformatics [10] M. M. A. G. K. G. M. V. D. R. R. Samuel V Angiuoli, "CloVR:
datasets. We can implement Genome Analysis tools such A virtual machine for automated and portable sequence
as ACT, DECIPHE R, FLAK, Splign, Mauve, Mulan, analysis from the desktop using cloud computing,"
BioMed Central.
Multiz, Sequilab and SLAM in Hadoop MapReduce
Platform. These tools can also implement in Apache [11] C. S. O. a. D. J. Baxter, "ScalaBLAST 2.0: rapid and robust
Spark framework. Genome Analysis tools such as BLAT, BLAST calculations on multiprocessor systems,"
Oxford, January 29, 2013.
GMAP and MGA have implemented in Hadoop
MapReduce platform.

325 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 12, December 2016

[12] M. C. Schatz, "BlastReduce: High Performance Short Read [36] "paradoxus," [Online]. Available:
Mapping with MapReduce," in Computional Method for https://paradoxus.wordpress.com/category/life-of-
Next Generation Sequencing Data Analysis. student/.
[13] [Online]. Available: http://www.red-bean.com/ale/. [37] "LIST OF SEQUENCE ALIGNMENT SOFTWARE,"
[Online]. Available:
[14] U. University, "http://www.ormbunkar.se/aliview/," [Online]. http://theinfolist.com/php/SummaryGet.php?FindGo=Li
[15] [Online]. Available: http://www.ormbunkar.se/aliview/. st%20Of%20Sequence%20Alignment%20Software.

[16] [Online]. Available: http://virology.uvic.ca/virology-ca- [38] 1. A. O. D. a. R. D. S. Jurate Daugelaite, "An Overview of


tools/base-by-base/. Multiple Sequence Alignments and Cloud," 23 June
2013, no.
[17] T. Hall, "http://www.mbio.ncsu.edu/BioEdit/bioedit.html," http://downloads.hindawi.com/journals/isrn.biomathema
[Online]. tics/2013/615630.pdf.
[18] [Online]. Available: [39] "Fuzzy Logic Analysis of k-mers," [Online]. Available:
https://www.postjobfree.com/resumes/python/las- http://www.flakbio.com/.
vegas/nevada/united-states.
[40] "Running and Using GeneDoc," [Online]. Available:
[19] [Online]. Available: http://www.nrbsc.org/gfx/genedoc/gdpaf.htm.
https://borderstonebio.wordpress.com/cv/.
[41] [Online]. Available: https://www.linkedin.com/in/dilip-gosar-
[20] "BIONUMERICS SEVEN: A UNIQUE SOFTWARE 38567b105 .
PLATFORM," [Online]. Available: http://www.applied-
maths.com/bionumerics. [42] "PhyloSoC:Biodiversity Conservation Algorithms and GUI,"
[Online]. Available:
[21] [Online]. Available: https://informatics.nescent.org/wiki/PhyloSoC:Biodiver
http://www.wikiwand.com/en/BioNumerics. sity_Conservation_Algorithms_and_GUI.
[22] "boxshade," [Online]. Available: [43] "Integrated Genome Browser," [Online]. Available:
https://sourceforge.net/projects/boxshade/. http://bioviz.org/igb/.
[23] [Online]. Available: [44] J. F. M. B. a. I. D. Alexandre Poliakov, "GenomeVISTAan
http://packages.ubuntu.com/trusty/boxshade. integrated software package for whole-genome
alignment and visualization," Oxford, May 2, 2014.
[24] "The Advance interfaces," [Online]. Available:
http://aig.cs.man.ac.uk/research/utopia/cinema/cinema.p [45] "Interactive Visual Tools for Multiple Sequence Alignments,"
hp. [Online]. Available: http://www.ivistmsa.com/.
[25] [Online]. Available: [46] [Online]. Available: http://tscm.com/product/web-client-for-
https://books.google.com.pk/books?id=OVFXbHtSzzU edvs-edvr-download.
C&pg=.
[47] "Jalview," [Online]. Available: http://www.jalview.org/.
[26] "CLC Sequence Viewer," [Online]. Available:
http://www.qiagenbioinformatics.com/products/clc- [48] "12th Annual Bioinformatics Open Source Conference,"
sequence-viewer/. Vienna, Austria, July 15-16, 2011.

[27] [Online]. Available: [49] "Jevtrace," [Online]. Available:


http://www.bioprocessonline.com/doc/the-worlds-first- http://compbio.berkeley.edu/people/marcin/jevtrace/.
customizable-bioinformatics-0001. [50] "jevtrace," [Online]. Available:
[28] "Clustal Omega," [Online]. Available: http://compbio.berkeley.edu/people/marcin/jevtrace/.
http://www.clustal.org/omega/. [51] "javascript sequence alignment viewer," [Online]. Available:
[29] [Online]. Available: http://osdir.com/ml/python-bio- http://www.bioinf.org.uk/software/jsav/.
general/2009-02/msg00013.htmls. [52] "JSAV," [Online]. Available:
[30] "Cylindrical BLAST Viewer," [Online]. Available: https://github.com/vkaravir/JSAV.
https://sourceforge.net/projects/cyl-viewer/. [53] "Software Applications," [Online]. Available:
[31] "macforge.net," [Online]. Available: https://www.schrodinger.com/products/14/12/.
http://www.macforge.net/projects.php?cat=252&view=e [54] [Online]. Available: http://yuba.stanford.edu/~casado/of-
xtended&page=11. sw.html.
[32] [Online]. Available: [55] S. Ibrahim, "Maestro: Replica-Aware Map Scheduling for
http://www.molecularecologist.com/2012/11/a- MapReduce," 13-16 May 2012.
comparison-of-bioinformatics-programming-languages.
[56] "MEGA,_Molecular_Evolutionary_Genetics_Analysis,"
[33] "Discovery Studio," [Online]. Available: [Online]. Available:
https://en.wikipedia.org/wiki/Discovery_Studio. https://en.wikipedia.org/wiki/MEGA,_Molecular_Evolu
[34] [Online]. Available: tionary_Genetics_Analysis.
http://bioinformaticsonline.com/search?q=Java&limit=1 [57] [Online]. Available: https://github.com/meganz/sdk.
0&offset=10&search_type=tags.
[58] "MSAReveal.org," [Online]. Available:
[35] "DnaSP," [Online]. Available: http://www.ub.edu/dnasp/. http://www.bioinformatics.org/msareveal/index2.html.
[59] [Online]. Available: https://sourceforge.net/projects/msa-
edna/.

326 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 12, December 2016

[60] J. E. D. W. a. Z. L.-S. Elijah Roberts, "The Luthey-Schulten [83] B. S. a. D. L. M. Yongchao LiuEmail author,
group," [Online]. Available: "CUDASW++2.0: enhanced Smith-Waterman protein
http://www.scs.illinois.edu/schulten/multiseq/. database search on CUDA-enabled GPUs based on SIMT
and virtualized SIMD abstractions," BioMedCentral, 6
[61] E. Roberts, "MultiSeq: unifying sequence and structure data April 2010.
for evolutionary analysis," BioMed Central , 16 August
2006. [84] "UWSysLab/diamond," [Online]. Available:
https://github.com/UWSysLab/diamond.
[62] "Mview," [Online]. Available:
https://desmid.github.io/mview/. [85] "FASTA Sequence Comparison at the U. of Virginia,"
[Online]. Available:
[63] [Online]. Available: https://sourceforge.net/projects/bio- http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.sh
mview/postdownload?source=dlp. tml.
[64] "PAFFT," [Online]. Available: http://pfaat.sourceforge.net/. [86] [Online]. Available:
[65] [Online]. Available: https://en.wikipedia.org/wiki/FASTA_format.
https://sourceforge.net/directory/os:windows/?q=annotat 87] "HMMER: biosequence analysis using profile hidden Markov
ion%20tool. models," [Online]. Available: http://hmmer.org/.
[66] S. Griffiths-Jones, "RALEERNA ALignment Editor in [88] A. Ragothaman, "Developing eThread Pipeline Using SAGA-
Emacs," Oxford Journal, Accepted August 17, 2004. Pilot Abstraction for Large-Scale Structural
[67] [Online]. Available: Bioinformatics," BioMed Research International.
https://raw.githubusercontent.com/samgriffithsjones/rale [89] M. B. a. L. Aravind, "PSI-BLAST Tutorial," in Comparative
e/release-0.8/00README. Genomics: Volumes 1 and 2..
[68] [Online]. Available: -[90] C. S. Oehmen, "ScalaBLAST 2.0: rapid and robust BLAST
http://www.bioinformatics.org/groups/categories.php?ca calculations on multiprocessor systems," Oxford.
t_id=2.
[91] A. T. Seung-Jin Sul, "Parallelizing BLAST and SOM
[69] [Online]. Available: Algorithms with MapReduce-MPI Library," in Parallel
http://packages.ubuntu.com/precise/seaview. and Distributed Processing Workshops and Phd Forum
[70] [Online]. Available: (IPDPSW), 2011 IEEE International Symposium on, 01
https://github.com/4ment?tab=repositories. September 2011.

[71] "seqpup," [Online]. Available: [92] "Research on High-performance and Scalable Data Access in
http://iubio.bio.indiana.edu/soft/molbio/seqpup/java/. Parallel Big Data Computing," [Online]. Available:
http://stars.library.ucf.edu/etd/1417/.
[72] "sequlator," [Online]. Available: http://sequlator.com/.
[93] K. M. R. M. B. M.-A. R. B. G. B. a. J. P. Tim J. Carver, "ACT:
[73] [Online]. Available: the Artemis comparison too," Oxford, 2005.
https://www.google.com.pk/url?sa=t&rct=j&q=&esrc=.
[94] "FLAK (Fuzzy Logic Analysis of k-mers)," [Online].
[74] U. o. W. Department of Biochemistry, "SnipViz: a compact Available: http://www.flakbio.com/.
and lightweight web site widget for display and
dissemination of multiple versions of gene and protein [95] T. D. W. a. C. K. Watanabe, "GMAP: a genomic mapping and
sequences.," PubMed. alignment program for mRNA and EST sequences,"
Oxford.
[75] [Online]. Available:
https://github.com/njahn82/dvcs_epmc/blob/master/data [96] N. R. S. J. A. G. Karthik Kambatla, "Relaxed Synchronization
/github_parsed.csv. and Eager Scheduling in MapReduce," [Online].
Available:
[76] [Online]. Available: https://researchontherocks.wordpress.com/2011/11/03/r
http://www.bioinformatics.org/strap/Scripting.html. elaxed-synchronization-and-eager-scheduling-in-
mapreduce/.
[77] "tablet," [Online]. Available: https://ics.hutton.ac.uk/tablet/.
[97] [Online]. Available:
[78] [Online]. Available:
https://www.ncbi.nlm.nih.gov/sutils/splign/splign.cgi.
http://www.revolvy.com/main/index.php?s=List%20of
%20alignment%20visualization%20software&iem_type [98] M. B. B. F. P. N. Darling AC, "Mauve: multiple alignment of
=topic. conserved genomic sequence with rearrangements.,"
[79] [Online]. Available: https://github.com/mengqvist/DNApy. PubMed.

[80] [99] [Online]. Available: http://bibiserv.techfak.uni-


M. F. B. W. T. W. a. A. G. Christoph Gille1, " Alignment-
bielefeld.de/mga/.
Annotator web server: rendering and annotating
sequence alignments," Oxford, April 16, 2014.
[81] [Online]. Available: http://www.bioinformatics.org/strap/aa/.
[82] W. G. W. M. Stephen F. Altschul, "Basic Local Alignment
Search Tool," 1990. [Online]. Available:
https://blastalgorithm.com/.

327 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 12, December 2016

Muhammad Atif Sarwar is a fields. Currently, he is leading the Big Data Analytics
Lab Engineer at Department of Research Group at COMSATS Institute Sahiwal.
Computer Science, COMSATS
Institute of Information
Technology Sahiwal, Pakistan.
He received BS (CS) degree
from COMSATS Institute of
Information Technology Sahiwal, Pakistan in 2015.
Currently, he is a scholar of MS (CS) session 2015-
2017 in COMSATS Institute of Information
Technology Sahiwal, Pakistan. His main research
interests include Big Data Analytics and Machine
Learning. Particularly, he is interested in applications
of Big Data in the Bioinformatics field. Currently, he
is working with the Big Data Analytics Research
Group at COMSATS Institute Sahiwal.

Abbas Rehman is a Lab


Engineer at Department of
Computer Science, COMSATS
Institute of Information
Technology Sahiwal, Pakistan.
He received MCS degree from
COMSATS Institute of
Information Technology Sahiwal,
Pakistan in 2015. Currently, he is a scholar of MS (CS)
session 2015-2017 in COMSATS Institute of
Information Technology Sahiwal, Pakistan. His main
research interests include Big Data Analytics and
Machine Learning. Particularly, he is interested in
applications of Big Data in the Bioinformatics field.
Currently, he is working with the Big Data Analytics
Research Group at COMSATS Institute Sahiwal.

Dr. Javed Ferzund is an


associate professor at
Department of Computer
Science, COMSATS Institute
of Information Technology,
Sahiwal, where he served as
Head of Department from
2013-2015. He received PhD
degree from Graz University of Technology, Austria
in 2009. His main research interests include Big Data
Analytics, Internet of Things and Machine Learning.
Particularly, he is interested in applications of IoT and
Big Data in the Agro-Informatics and Bioinformatics

328 https://sites.google.com/site/ijcsis/
ISSN 1947-5500

You might also like