You are on page 1of 29

Multiple

Alignment

Stuart M. Brown
NYU School of Medicine
Pairwise Alignment

The alignment of two sequences


(DNA or protein) is a relatively
straightforward computational
problem.
The best solution seems to be an
approach called Dynamic
Programming.
Dynamic Programming
Dynamic Programming is a general
programming technique.
It is applicable when a large search space
can be structured into a succession of
stages, such that:
the initial stage contains trivial solutions to
sub-problems
each partial solution in a later stage can
be calculated by recurring a fixed number
of partial solutions in an earlier stage
the final stage contains the overall
solution
Multiple Alignments
Making an optimal alignment between
two sequences is computationally
straightforward, but aligning a large
number of sequences using the same
method is almost impossible.

The problem increases exponentially


with the number of sequences involved,
so it becomes computationally expensive
(and inefficient) for large numbers of
sequences.
Longer Sequences
A G T A G T A

G -1 -1 -2 G -1 -1 -2 ?

T -2 -2 -1 T -2 -2 -1 ?

A -2 -3 -3 A -2 -3 -3 ?

C ? ? ? ?

What happens to the number of cells in the matrix when we add another
base to one sequence?
How about to both?
# cells = L1 x L2 or L2 if we use 2 sequences of the same length.
So the amount of computing grows with the square of seq. length bad but
not terrible, because the compute time for each cell remains constant
Align Three Sequences by
Dynamic programming

Georg Fullen, VSNS Biocomputing,


Univ. Munster

So how many cells (that contain values that must be computed) do we add for each additional
sequence its a power function! For N sequences of length L: # of cells = 2n x Ln
This is very bad for computing alignments of a lot of sequences!

If the calculation takes 1 nanosecond per cell, then for 6 sequences of length 100, we'll have a
running time of is 26 x 1006 x 10-9 seconds (64000 seconds). Just add 2 more sequences, and
the running time is 28 x 1008 x 10-9 = 2.6 x 109 seconds (~28 days)
Global vs. Local Multiple
Alignments
Global alignment algorithms start at the beginning
of two sequences and add gaps to each until the
end of one is reached.
Local alignment algorithms finds the region (or
regions) of highest similarity between two
sequences and build the alignment outward from
there.
Optimal Alignment
For a given group of sequences,
there is no single "correct"
alignment, only an alignment that
is "optimal" according to some set
of calculations.
Determining what alignment is best
for a given set of sequences is
really up to the judgement of the
investigator.
Progressive Pairwise
Methods
Most of the available multiple
alignment programs use some sort of
incremental or progressive method
that makes pairwise alignments,
averages them into a consensus
(actually a profile), then adds new
sequences one at a time to the
aligned set.
This is an approximate method!
CLUSTALW
CLUSTAL is the most popular multiple
alignment program
Gap penalties can be adjusted based on
specific amino acid residues, regions of
hydrophobicity, proximity to other gaps, or
secondary structure.
it can re-align just selected sequences or
selected regions in an existing alignment
It can compute phylogenetic trees from a set
of aligned sequences.
Unix command line program
Website: http://www.ebi.ac.uk/Tools/clustalw2/index.html

There are also Mac and PC versions with a


nice graphical interface (CLUSTALX).
CLUSTALW2 at the EBI website
http://www.ebi.ac.uk/Tools/clustalw2/index.html
Other Multiple Alignment Tools

MUSCLE
http://www.ebi.ac.uk/Tools/muscle/index.html

TCOFFE
http://www.ebi.ac.uk/Tools/t-coffee/

MSA
Editing Multiple
Alignments
There are a variety of tools that can be used
to modify and display a multiple alignment.
These programs can be very useful in
formatting and annotating an alignment for
publication.
An editor can also be used to make
modifications by hand to improve
biologically significant regions in a multiple
alignment created by an alignment program.
Alignment editors
The MACAW and SeqVu program for
Macintosh and GeneDoc and DCSE for
PCs are free and provide excellent
editor functionality.
Many comprehensive molecular
biology programs include multiple
alignment functions:
Sequencher, MacVector, DS Gene, Vector
NTI, all include a built-in version of CLUSTAL
SeqVu
JalView
Install on
your
machine
or run as a
Java
WebStart
application
Check out CINEMA (Colour
INteractive Editor for Multiple
Alignments)
It is an editor created completely
in JAVA (old browsers beware)
It includes a fully functional
version of CLUSTAL, BLAST, and
a DotPlot module

http://www.bioinf.man.ac.uk/dbbrowser/CINEMA2.1/
Analysis of Alignments
Once you have a multiple alignment,
what can you do with it?
1) Identify regions of similarity and difference
- conserved regions may be functionally important,
and/or sites for inclusive (cross species) primer
design
- Variable regions may be functionally important,
and/or sites for gene/allele-specific primer design
- 2) Create a sequence logo
3) Build a Phylogenetic Tree (next week)
Format a Multiple Alignment
The concept of a consensus sequence is implied by any
multiple alignment. There can be various rules for building
the consensus: simple majority rules, plurality by a
specific %, etc.
The alignment may look nicer by showing how each letter
matches the consensus highlight the differences.

1) PLOTSIMILARITY (a graph of overall similarity


across the alignment) EMBOSS = plotcon

2) Show match to consensus = showalign


3) Shade by similarity = prettyplot/Boxshade
Plurality: 2.00 Threshold: 4
AveWeight 0.55 AveMatch 2.91 AvMisMatch -2.00

PRETTY of: @pretty.list October 7, 1998 10:35 ..

1 50
fa10.ugly .......... .......... .......... ..TTttGESA D.PvtTtVE.
fa12.ugly .......... .......... .......... ..TTatGESA D.PvtTtVE.
fo1k.ugly .......... .......... .......... ..TTsaGESA D.PvtTtVE.
e.ugly Gvenae.kgv tEnTna.Tad fvaqpvyLPe .nqT...... kv.Affynrs
p1m.ugly GlgqmlEsmI .dnTvreTvg AatsrdaLPn teasGPthSk eiPALTAVET
p1s.ugly GlgqmlEsmI .dnTvreTvg AatsrdaLPn teasGPahSk eiPALTAVET
p2s.ugly GigdmiEgav .Egitknalv pptstnsLPg hkpsGPahSk eiPALTAVET
p3s.ugly Giedliseva .qgal..Tls lpkqqdsLPd tkasGPahSk evPALTAVET
cb3.ugly ...gpvEdaI .......T.. Aaigr..vad tvgTGPtnSe aiPALTAaET
r14.ugly GlgdelEevI vEkT.kqTv. Asi....... ..ssGPkhtq kvPiLTAnET
r2.ugly ...npvEnyI dEvlnevlv. .......vPn inssnPttSn saPALdAaET
Consensus G-----E--I -E-T---T-- A------LP- --TTGPGESA D-PALTAVET

/////////////////////////////////////////////////////////////////

301 349
fa10.ugly aElyCPRPll AIkvtsqdRy KqKI.iAPa. ..KQll.... .........
fa12.ugly aElyCPRPll AIevssqdRh KqKI.iAPg. ..KQll.... .........
fo1k.ugly aEtyCPRPll AIhpt.eaRh KqKI.vAPv. ..KQTl.... .........
e.ugly krvfCPRPtv ffPwpTsG.D Kidmtpragv lmlespnald isrty....
p1m.ugly irvWCPRPPR AlaYygpGvD ykdgtltPls tkdlTTy... .........
p1s.ugly irvWCPRPPR AvaYygpGvD ykdgtltPls tkdlTTy... .........
p2s.ugly VrvWCPRPPR AvPYfgpGvD ykdg.ltPlp ekglTTy... .........
p3s.ugly VrvWCPRPPR AvPYygpGvD yrn.nldPls ekglTTy... .........
cb3.ugly VkaWiPRPPR lcqYekakn. vnfrssgvtt trqsiTtmtn tgaiwtti.
r14.ugly VEaWiPRaPR AlPY.Tsigr tny..pknte pvikkrk.gd i.ksy....
r2.ugly VkaWCPRPPR AleY.Trahr tnfkiedrsi qtaivTrpii ttagpsdmy
Consensus VE-WCPRPPR AIPY-T-GRD K-KI--AP-- --KQTT---- ---------
Boxshade
Shade each letter of the alignment based on its match to the
consensus
highlights conserved regions
much more informative for protein alignments (shades
of grey for similar amino acids)

http://mobyle.pasteur.fr/cgi-bin/MobylePortal/portal.py?form=boxshade

http://www.ch.embnet.org/software/BOX_form.html
Sequence Logos
http://weblogo.berkeley.edu/logo.cgi
http://weblogo.threeplusone.com/create.cgi
http://genome.tugraz.at/Logo/
T. D. Schneider and R. M. Stephens. Sequence logos: a new way to display
consensus sequences. Nucleic Acids Research, Vol. 18, No 20, p. 6097-6100.
Buidling on Alignments
Multiple Alignments are the starting
point for calculating phylogenetic
trees
Motifs and Profiles are calculated
from multiple alignments

You might also like