You are on page 1of 31

Department of Information Science and Engineering

M S Ramaiah Institute of Technology


Bangalore- 54

OPTIMIZING PHYLOGENETIC ANALYSIS USING


MAP REDUCE PROGRAMMING MODEL

Abhinav Anurag
Darshil Shah
Eklavya Uppal
Ishank Mishra

Under the guidance of Mr. Siddesh G M, Assistant Professor,


Department of Information Science and Engineering, M S Ramaiah Institute of
Technology

Introduction
What Bioinformatics is
research, development, or application of computational tools
and approaches for expanding the use of biological, medical,
behavioral or health data, including those to acquire, store,
organize, analyze, or visualize such data.
By developing techniques for analyzing sequence data and
related structures, we can attempt to understand molecular
basis of life.

Phylogeny
Phylogeny is the study of evolutionary relationships among
groups of organisms (e.g. species, populations), which are
discovered through molecular sequencing data and
morphological data matrices.
Computational Phylogenetics is the application of
computational algorithms, methods and programs to
phylogenetic analyses. The goal is to assemble a phylogenetic
tree representing a hypothesis about the evolutionary ancestry
of a set of genes, species, or other taxa.

Phylogentic Tree
A phylogenetic tree is a statement about the evolutionary
relationship between a set of homologous characters of one or
several organisms.
Homology is the relationship of two characters that have
descended, usually with divergence, from a common ancestral
character. The characters can be any genic (gene sequence,
protein sequence), structural (i.e.morphological) or
behavioural feature of an organism.
Scientists build phylogenetic trees in an attempt to understand
evolutionary relationships.

An evolutionary tree showing the divergence of raccoons and bears. Despite


their difference in size and shape, these two families are closely related.

Building Phylogenetic Tree


Distance Matrix Method
Distance data is a matrix in which a measure of the evolutionary
distance between each pair of sequences in the multiple
alignment has been calculated.
Seq1

Seq2

Seq3

Seq4

Seq1

0.8

0.4

0.6

Seq2

0.8

0.5

0.6

Seq3

0.2

0.5

0.9

Seq4

0.6

0.6

0.9

Needleman-Wunsch Algorithm
The NWA (Needleman-Wunsch algorithm) was proposed by
Saul Needleman and Christian Wunsch (1970).
The algorithm performs a Global Alignment on two sequences
and is commonly used in bioinformatics to align protein or
nucleotide sequences.
To find the alignment with the highest score, a two-dimensional
array (or matrix) is allocated. This matrix is often called the F
matrix, and its (i,j)th entry is often denoted by Fij.
There is one column for each character in sequence A, and
one row for each character in sequence B. Thus, if we
are aligning sequences of sizes n and m, the amount of
memory used by the algorithm is in O(nm).

Calculating the F Matrix


Algorithm

The F matrix would look like this

Backtracing

This would produce an alignment like this


G-ATTACA
GCA-TGCU

Calculation of Score
For example, if the similarity matrix was
then the alignment:

with a gap penalty of -5, would have the following score

MapReduce
MapReduce is a programming model for processing large data
sets with a parallel, distributed algorithm on a cluster.
A MapReduce program is composed of a Map() procedure that
performs filtering and sorting and a Reduce() procedure that
performs a summary operation.
"Map" step: The master node takes the input, divides it into
smaller sub-problems, and distributes them to worker nodes.
The worker node processes the smaller problem, and passes
the answer back to its master node.
"Reduce" step: The master node then collects the answers to all
the sub-problems and combines them in some way to form the
output the answer to the problem it was originally trying to
solve.

MapReduce example
mapper (filename, file-contents):
// filename : document name
// file-contents : document contents
for each word in file-contents:
emit (word, 1)

reducer (word, values):


// values : a list of aggregated partial counts
sum = 0
for each value in values:
Sum = sum + value
emit (word, sum)

For example, if we had the files:


foo.txt: Sweet, this is the foo file
bar.txt: This is the bar file
We would expect the output to be:
sweet 1
this 2
is 2
the 2
foo 1
bar 1
file 2

Apache Hadoop
Apache Hadoop is an open-source software framework for storage
and large-scale processing of data-sets on clusters of
commodity hardware.
MapReduce is the heart of Hadoop.
While it can be used on a single machine, its true power lies in
its ability to scale to multiple computers, each with several
processor cores.
Hadoop is also designed to efficiently distribute large amounts of
work across a set of machines.

Hadoop Distributed File System


In a Hadoop cluster, data is distributed to all the nodes of the
cluster as it is being loaded in. The Hadoop Distributed File
System (HDFS) will split large data files into chunks which are
managed by different nodes in the cluster. In addition to this
each chunk is replicated across several machines, so that a
single machine failure does not result in any data being
unavailable.

Problem Statement
Distance-matrix methods of phylogenetic analysis explicitly
rely on a measure of "genetic distance" between the
sequences being classified, and therefore they require multiple
sequence alignments as an input.
Distance methods attempt to construct an all-to-all matrix
from the sequence query set describing the distance between
each sequence pair. Dynamic programming algorithms like
Needleman-Wunsch algorithm (NWA) and Smith-Waterman
algorithm (SWA) produce accurate alignments.
But these algorithms are computation intensive and are
hence limited to a small number of short sequences.

Goal of the Project


The goal of this project is Design and
Implementation of parallel approach to
Phylogenetic analyses using Hadoop Data
Clusters.
The proposed methodology should be able to
give a performance enhancement when the
sequence alignments are done in parallel
using the Hadoop framework.

Proposed Methodology
Input format
This project uses the input data in FASTA format. FASTA format is
a text-based format for representing either nucleotide
sequences or peptide sequences, in which nucleotides or
amino acids are represented using single-letter codes.
A sequence in FASTA format begins with a single-line description,
followed by lines of sequence data. The description line is
distinguished from the sequence data by a greater-than (">")
symbol in the first column.

Sequence data set in FASTA format

Proposed Algorithm

Hierarchical clustering using UPGMA


UPGMA (Unweighted Pair Group Method with Arithmetic
Mean) is a simple agglomerative (bottom-up) hierarchical
clustering method.
The UPGMA algorithm constructs a rooted tree
(dendrogram) that reflects the structure present in a
pairwise similarity matrix.

Algorithm

Various stages of MapReduce in the proposed system

User Interface
The external interface for this project was created using D3.js (D3
for Data-Driven Documents). D3.js is a JavaScript library that
uses digital data to drive the creation and control of dynamic
and interactive graphical forms which run in web browsers.

Performance Analysis
Running Time
800000
700000
600000
500000
400000
Time in milliseconds

Single Node
Two node

300000
200000
100000
0

40

60

80

100

120

Number of sequences

Running Time or Number of sequences versus time taken


for alignments

Time for the three map reduce stages

700000
600000
500000
400000
Time in milliseconds

40
60
80
100
120

300000
200000
100000
0
Stage 1

Stage 2

Stage 3

Map Reduce Stages

Comparison of three stages of MapReduce for different


Sequence sets

Throughput or number of sequences aligned per


second

25

20

15
no. of Sequence Alignment/sec

Two Node
Single Node

10

40

60

80

100

120

Number of sequences

Throughput or number of sequences aligned per second

Conclusion
The project work proposed a time efficient approach to
Phylogenetic analyses that produces a phylogram
(phylogenetic tree or evolutionary tree).
The proposed method of making Phylogenetic analyses has found
improvements on the computation time and also maintains the
accuracy.
The dynamic nature of the algorithm NWA coupled with data and
computational parallelism of Hadoop data grids has found to
improve the accuracy and speed of sequence alignment.

You might also like