Presentation 4

Department of Information Science and Engineering
M S Ramaiah Institute of Technology

Bangalore- 54
OPTIMIZING PHYLOGENETIC ANALYSIS USING

MAP REDUCE PROGRAMMING MODEL
Abhinav Anurag
Darshil Shah
Eklavya Uppal
Ishank Mishra
Under the guidance of Mr. Siddesh G M, Assistant Professor,

Department of Information Science and Engineering, M S Ramaiah Institute of
Technology
Introduction
What Bioinformatics is
research, development, or application of computational tools
and approaches for expanding the use of biological, medical,
behavioral or health data, including those to acquire, store,
organize, analyze, or visualize such data.
By developing techniques for analyzing sequence data and
related structures, we can attempt to understand molecular
basis of life.
Phylogeny
Phylogeny is the study of evolutionary relationships among
groups of organisms (e.g. species, populations), which are
discovered through molecular sequencing data and
morphological data matrices.
Computational Phylogenetics is the application of
computational algorithms, methods and programs to
phylogenetic analyses. The goal is to assemble a phylogenetic
tree representing a hypothesis about the evolutionary ancestry
of a set of genes, species, or other taxa.
Phylogentic Tree
A phylogenetic tree is a statement about the evolutionary
relationship between a set of homologous characters of one or
several organisms.
Homology is the relationship of two characters that have
descended, usually with divergence, from a common ancestral
character. The characters can be any genic (gene sequence,
protein sequence), structural (i.e.morphological) or
behavioural feature of an organism.
Scientists build phylogenetic trees in an attempt to understand
evolutionary relationships.
An evolutionary tree showing the divergence of raccoons and bears. Despite

their difference in size and shape, these two families are closely related.
Building Phylogenetic Tree

Distance Matrix Method
Distance data is a matrix in which a measure of the evolutionary
distance between each pair of sequences in the multiple
alignment has been calculated.
Seq1
Seq2
Seq3
Seq4
Seq1
0.8
0.4
0.6
Seq2
0.8
0.5
0.6
Seq3
0.2
0.5
0.9
Seq4
0.6
0.6
0.9
Needleman-Wunsch Algorithm
The NWA (Needleman-Wunsch algorithm) was proposed by
Saul Needleman and Christian Wunsch (1970).
The algorithm performs a Global Alignment on two sequences
and is commonly used in bioinformatics to align protein or
nucleotide sequences.
To find the alignment with the highest score, a two-dimensional
array (or matrix) is allocated. This matrix is often called the F
matrix, and its (i,j)th entry is often denoted by Fij.
There is one column for each character in sequence A, and
one row for each character in sequence B. Thus, if we
are aligning sequences of sizes n and m, the amount of
memory used by the algorithm is in O(nm).
Calculating the F Matrix

Algorithm
The F matrix would look like this
Backtracing
This would produce an alignment like this

G-ATTACA
GCA-TGCU
Calculation of Score
For example, if the similarity matrix was
then the alignment:
with a gap penalty of -5, would have the following score
MapReduce
MapReduce is a programming model for processing large data
sets with a parallel, distributed algorithm on a cluster.
A MapReduce program is composed of a Map() procedure that
performs filtering and sorting and a Reduce() procedure that
performs a summary operation.
"Map" step: The master node takes the input, divides it into
smaller sub-problems, and distributes them to worker nodes.
The worker node processes the smaller problem, and passes
the answer back to its master node.
"Reduce" step: The master node then collects the answers to all
the sub-problems and combines them in some way to form the
output the answer to the problem it was originally trying to
solve.
MapReduce example
mapper (filename, file-contents):
// filename : document name
// file-contents : document contents
for each word in file-contents:
emit (word, 1)
reducer (word, values):

// values : a list of aggregated partial counts
sum = 0
for each value in values:
Sum = sum + value
emit (word, sum)
For example, if we had the files:

foo.txt: Sweet, this is the foo file
bar.txt: This is the bar file
We would expect the output to be:
sweet 1
this 2
is 2
the 2
foo 1
bar 1
file 2
Apache Hadoop
Apache Hadoop is an open-source software framework for storage
and large-scale processing of data-sets on clusters of
commodity hardware.
MapReduce is the heart of Hadoop.
While it can be used on a single machine, its true power lies in
its ability to scale to multiple computers, each with several
processor cores.
Hadoop is also designed to efficiently distribute large amounts of
work across a set of machines.
Hadoop Distributed File System

In a Hadoop cluster, data is distributed to all the nodes of the
cluster as it is being loaded in. The Hadoop Distributed File
System (HDFS) will split large data files into chunks which are
managed by different nodes in the cluster. In addition to this
each chunk is replicated across several machines, so that a
single machine failure does not result in any data being
unavailable.
Problem Statement
Distance-matrix methods of phylogenetic analysis explicitly
rely on a measure of "genetic distance" between the
sequences being classified, and therefore they require multiple
sequence alignments as an input.
Distance methods attempt to construct an all-to-all matrix
from the sequence query set describing the distance between
each sequence pair. Dynamic programming algorithms like
Needleman-Wunsch algorithm (NWA) and Smith-Waterman
algorithm (SWA) produce accurate alignments.
But these algorithms are computation intensive and are
hence limited to a small number of short sequences.
Goal of the Project

The goal of this project is Design and
Implementation of parallel approach to
Phylogenetic analyses using Hadoop Data
Clusters.
The proposed methodology should be able to
give a performance enhancement when the
sequence alignments are done in parallel
using the Hadoop framework.
Proposed Methodology
Input format
This project uses the input data in FASTA format. FASTA format is
a text-based format for representing either nucleotide
sequences or peptide sequences, in which nucleotides or
amino acids are represented using single-letter codes.
A sequence in FASTA format begins with a single-line description,
followed by lines of sequence data. The description line is
distinguished from the sequence data by a greater-than (">")
symbol in the first column.
Sequence data set in FASTA format
Proposed Algorithm
Hierarchical clustering using UPGMA

UPGMA (Unweighted Pair Group Method with Arithmetic
Mean) is a simple agglomerative (bottom-up) hierarchical
clustering method.
The UPGMA algorithm constructs a rooted tree
(dendrogram) that reflects the structure present in a
pairwise similarity matrix.
Algorithm
Various stages of MapReduce in the proposed system
User Interface
The external interface for this project was created using D3.js (D3
for Data-Driven Documents). D3.js is a JavaScript library that
uses digital data to drive the creation and control of dynamic
and interactive graphical forms which run in web browsers.
Performance Analysis
Running Time
800000
700000
600000
500000
400000
Time in milliseconds
Single Node
Two node
300000
200000
100000
0
40
60
80
100
120
Number of sequences
Running Time or Number of sequences versus time taken

for alignments
Time for the three map reduce stages
700000
600000
500000
400000
Time in milliseconds
40
60
80
100
120
300000
200000
100000
0
Stage 1
Stage 2
Stage 3
Map Reduce Stages
Comparison of three stages of MapReduce for different

Sequence sets
Throughput or number of sequences aligned per

second
25
20
15
no. of Sequence Alignment/sec
Two Node
Single Node
10
40
60
80
100
120
Number of sequences
Throughput or number of sequences aligned per second
Conclusion
The project work proposed a time efficient approach to
Phylogenetic analyses that produces a phylogram
(phylogenetic tree or evolutionary tree).
The proposed method of making Phylogenetic analyses has found
improvements on the computation time and also maintains the
accuracy.
The dynamic nature of the algorithm NWA coupled with data and
computational parallelism of Hadoop data grids has found to
improve the accuracy and speed of sequence alignment.

Presentation 4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Presentation 4

Uploaded by

Copyright:

Available Formats

Department of Information Science and Engineering

M S Ramaiah Institute of Technology

OPTIMIZING PHYLOGENETIC ANALYSIS USING

Under the guidance of Mr. Siddesh G M, Assistant Professor,

An evolutionary tree showing the divergence of raccoons and bears. Despite

Building Phylogenetic Tree

Calculating the F Matrix

The F matrix would look like this

This would produce an alignment like this

with a gap penalty of -5, would have the following score

reducer (word, values):

For example, if we had the files:

Hadoop Distributed File System

Goal of the Project

Sequence data set in FASTA format

Hierarchical clustering using UPGMA

Various stages of MapReduce in the proposed system

Running Time or Number of sequences versus time taken

Time for the three map reduce stages

Map Reduce Stages

Comparison of three stages of MapReduce for different

Throughput or number of sequences aligned per

Throughput or number of sequences aligned per second

You might also like