Professional Documents
Culture Documents
Abhinav Anurag
Darshil Shah
Eklavya Uppal
Ishank Mishra
Introduction
What Bioinformatics is
research, development, or application of computational tools
and approaches for expanding the use of biological, medical,
behavioral or health data, including those to acquire, store,
organize, analyze, or visualize such data.
By developing techniques for analyzing sequence data and
related structures, we can attempt to understand molecular
basis of life.
Phylogeny
Phylogeny is the study of evolutionary relationships among
groups of organisms (e.g. species, populations), which are
discovered through molecular sequencing data and
morphological data matrices.
Computational Phylogenetics is the application of
computational algorithms, methods and programs to
phylogenetic analyses. The goal is to assemble a phylogenetic
tree representing a hypothesis about the evolutionary ancestry
of a set of genes, species, or other taxa.
Phylogentic Tree
A phylogenetic tree is a statement about the evolutionary
relationship between a set of homologous characters of one or
several organisms.
Homology is the relationship of two characters that have
descended, usually with divergence, from a common ancestral
character. The characters can be any genic (gene sequence,
protein sequence), structural (i.e.morphological) or
behavioural feature of an organism.
Scientists build phylogenetic trees in an attempt to understand
evolutionary relationships.
Seq2
Seq3
Seq4
Seq1
0.8
0.4
0.6
Seq2
0.8
0.5
0.6
Seq3
0.2
0.5
0.9
Seq4
0.6
0.6
0.9
Needleman-Wunsch Algorithm
The NWA (Needleman-Wunsch algorithm) was proposed by
Saul Needleman and Christian Wunsch (1970).
The algorithm performs a Global Alignment on two sequences
and is commonly used in bioinformatics to align protein or
nucleotide sequences.
To find the alignment with the highest score, a two-dimensional
array (or matrix) is allocated. This matrix is often called the F
matrix, and its (i,j)th entry is often denoted by Fij.
There is one column for each character in sequence A, and
one row for each character in sequence B. Thus, if we
are aligning sequences of sizes n and m, the amount of
memory used by the algorithm is in O(nm).
Backtracing
Calculation of Score
For example, if the similarity matrix was
then the alignment:
MapReduce
MapReduce is a programming model for processing large data
sets with a parallel, distributed algorithm on a cluster.
A MapReduce program is composed of a Map() procedure that
performs filtering and sorting and a Reduce() procedure that
performs a summary operation.
"Map" step: The master node takes the input, divides it into
smaller sub-problems, and distributes them to worker nodes.
The worker node processes the smaller problem, and passes
the answer back to its master node.
"Reduce" step: The master node then collects the answers to all
the sub-problems and combines them in some way to form the
output the answer to the problem it was originally trying to
solve.
MapReduce example
mapper (filename, file-contents):
// filename : document name
// file-contents : document contents
for each word in file-contents:
emit (word, 1)
Apache Hadoop
Apache Hadoop is an open-source software framework for storage
and large-scale processing of data-sets on clusters of
commodity hardware.
MapReduce is the heart of Hadoop.
While it can be used on a single machine, its true power lies in
its ability to scale to multiple computers, each with several
processor cores.
Hadoop is also designed to efficiently distribute large amounts of
work across a set of machines.
Problem Statement
Distance-matrix methods of phylogenetic analysis explicitly
rely on a measure of "genetic distance" between the
sequences being classified, and therefore they require multiple
sequence alignments as an input.
Distance methods attempt to construct an all-to-all matrix
from the sequence query set describing the distance between
each sequence pair. Dynamic programming algorithms like
Needleman-Wunsch algorithm (NWA) and Smith-Waterman
algorithm (SWA) produce accurate alignments.
But these algorithms are computation intensive and are
hence limited to a small number of short sequences.
Proposed Methodology
Input format
This project uses the input data in FASTA format. FASTA format is
a text-based format for representing either nucleotide
sequences or peptide sequences, in which nucleotides or
amino acids are represented using single-letter codes.
A sequence in FASTA format begins with a single-line description,
followed by lines of sequence data. The description line is
distinguished from the sequence data by a greater-than (">")
symbol in the first column.
Proposed Algorithm
Algorithm
User Interface
The external interface for this project was created using D3.js (D3
for Data-Driven Documents). D3.js is a JavaScript library that
uses digital data to drive the creation and control of dynamic
and interactive graphical forms which run in web browsers.
Performance Analysis
Running Time
800000
700000
600000
500000
400000
Time in milliseconds
Single Node
Two node
300000
200000
100000
0
40
60
80
100
120
Number of sequences
700000
600000
500000
400000
Time in milliseconds
40
60
80
100
120
300000
200000
100000
0
Stage 1
Stage 2
Stage 3
25
20
15
no. of Sequence Alignment/sec
Two Node
Single Node
10
40
60
80
100
120
Number of sequences
Conclusion
The project work proposed a time efficient approach to
Phylogenetic analyses that produces a phylogram
(phylogenetic tree or evolutionary tree).
The proposed method of making Phylogenetic analyses has found
improvements on the computation time and also maintains the
accuracy.
The dynamic nature of the algorithm NWA coupled with data and
computational parallelism of Hadoop data grids has found to
improve the accuracy and speed of sequence alignment.