Professional Documents
Culture Documents
Alignment
Stuart M. Brown
NYU School of Medicine
Pairwise Alignment
G -1 -1 -2 G -1 -1 -2 ?
T -2 -2 -1 T -2 -2 -1 ?
A -2 -3 -3 A -2 -3 -3 ?
C ? ? ? ?
What happens to the number of cells in the matrix when we add another
base to one sequence?
How about to both?
# cells = L1 x L2 or L2 if we use 2 sequences of the same length.
So the amount of computing grows with the square of seq. length bad but
not terrible, because the compute time for each cell remains constant
Align Three Sequences by
Dynamic programming
So how many cells (that contain values that must be computed) do we add for each additional
sequence its a power function! For N sequences of length L: # of cells = 2n x Ln
This is very bad for computing alignments of a lot of sequences!
If the calculation takes 1 nanosecond per cell, then for 6 sequences of length 100, we'll have a
running time of is 26 x 1006 x 10-9 seconds (64000 seconds). Just add 2 more sequences, and
the running time is 28 x 1008 x 10-9 = 2.6 x 109 seconds (~28 days)
Global vs. Local Multiple
Alignments
Global alignment algorithms start at the beginning
of two sequences and add gaps to each until the
end of one is reached.
Local alignment algorithms finds the region (or
regions) of highest similarity between two
sequences and build the alignment outward from
there.
Optimal Alignment
For a given group of sequences,
there is no single "correct"
alignment, only an alignment that
is "optimal" according to some set
of calculations.
Determining what alignment is best
for a given set of sequences is
really up to the judgement of the
investigator.
Progressive Pairwise
Methods
Most of the available multiple
alignment programs use some sort of
incremental or progressive method
that makes pairwise alignments,
averages them into a consensus
(actually a profile), then adds new
sequences one at a time to the
aligned set.
This is an approximate method!
CLUSTALW
CLUSTAL is the most popular multiple
alignment program
Gap penalties can be adjusted based on
specific amino acid residues, regions of
hydrophobicity, proximity to other gaps, or
secondary structure.
it can re-align just selected sequences or
selected regions in an existing alignment
It can compute phylogenetic trees from a set
of aligned sequences.
Unix command line program
Website: http://www.ebi.ac.uk/Tools/clustalw2/index.html
MUSCLE
http://www.ebi.ac.uk/Tools/muscle/index.html
TCOFFE
http://www.ebi.ac.uk/Tools/t-coffee/
MSA
Editing Multiple
Alignments
There are a variety of tools that can be used
to modify and display a multiple alignment.
These programs can be very useful in
formatting and annotating an alignment for
publication.
An editor can also be used to make
modifications by hand to improve
biologically significant regions in a multiple
alignment created by an alignment program.
Alignment editors
The MACAW and SeqVu program for
Macintosh and GeneDoc and DCSE for
PCs are free and provide excellent
editor functionality.
Many comprehensive molecular
biology programs include multiple
alignment functions:
Sequencher, MacVector, DS Gene, Vector
NTI, all include a built-in version of CLUSTAL
SeqVu
JalView
Install on
your
machine
or run as a
Java
WebStart
application
Check out CINEMA (Colour
INteractive Editor for Multiple
Alignments)
It is an editor created completely
in JAVA (old browsers beware)
It includes a fully functional
version of CLUSTAL, BLAST, and
a DotPlot module
http://www.bioinf.man.ac.uk/dbbrowser/CINEMA2.1/
Analysis of Alignments
Once you have a multiple alignment,
what can you do with it?
1) Identify regions of similarity and difference
- conserved regions may be functionally important,
and/or sites for inclusive (cross species) primer
design
- Variable regions may be functionally important,
and/or sites for gene/allele-specific primer design
- 2) Create a sequence logo
3) Build a Phylogenetic Tree (next week)
Format a Multiple Alignment
The concept of a consensus sequence is implied by any
multiple alignment. There can be various rules for building
the consensus: simple majority rules, plurality by a
specific %, etc.
The alignment may look nicer by showing how each letter
matches the consensus highlight the differences.
1 50
fa10.ugly .......... .......... .......... ..TTttGESA D.PvtTtVE.
fa12.ugly .......... .......... .......... ..TTatGESA D.PvtTtVE.
fo1k.ugly .......... .......... .......... ..TTsaGESA D.PvtTtVE.
e.ugly Gvenae.kgv tEnTna.Tad fvaqpvyLPe .nqT...... kv.Affynrs
p1m.ugly GlgqmlEsmI .dnTvreTvg AatsrdaLPn teasGPthSk eiPALTAVET
p1s.ugly GlgqmlEsmI .dnTvreTvg AatsrdaLPn teasGPahSk eiPALTAVET
p2s.ugly GigdmiEgav .Egitknalv pptstnsLPg hkpsGPahSk eiPALTAVET
p3s.ugly Giedliseva .qgal..Tls lpkqqdsLPd tkasGPahSk evPALTAVET
cb3.ugly ...gpvEdaI .......T.. Aaigr..vad tvgTGPtnSe aiPALTAaET
r14.ugly GlgdelEevI vEkT.kqTv. Asi....... ..ssGPkhtq kvPiLTAnET
r2.ugly ...npvEnyI dEvlnevlv. .......vPn inssnPttSn saPALdAaET
Consensus G-----E--I -E-T---T-- A------LP- --TTGPGESA D-PALTAVET
/////////////////////////////////////////////////////////////////
301 349
fa10.ugly aElyCPRPll AIkvtsqdRy KqKI.iAPa. ..KQll.... .........
fa12.ugly aElyCPRPll AIevssqdRh KqKI.iAPg. ..KQll.... .........
fo1k.ugly aEtyCPRPll AIhpt.eaRh KqKI.vAPv. ..KQTl.... .........
e.ugly krvfCPRPtv ffPwpTsG.D Kidmtpragv lmlespnald isrty....
p1m.ugly irvWCPRPPR AlaYygpGvD ykdgtltPls tkdlTTy... .........
p1s.ugly irvWCPRPPR AvaYygpGvD ykdgtltPls tkdlTTy... .........
p2s.ugly VrvWCPRPPR AvPYfgpGvD ykdg.ltPlp ekglTTy... .........
p3s.ugly VrvWCPRPPR AvPYygpGvD yrn.nldPls ekglTTy... .........
cb3.ugly VkaWiPRPPR lcqYekakn. vnfrssgvtt trqsiTtmtn tgaiwtti.
r14.ugly VEaWiPRaPR AlPY.Tsigr tny..pknte pvikkrk.gd i.ksy....
r2.ugly VkaWCPRPPR AleY.Trahr tnfkiedrsi qtaivTrpii ttagpsdmy
Consensus VE-WCPRPPR AIPY-T-GRD K-KI--AP-- --KQTT---- ---------
Boxshade
Shade each letter of the alignment based on its match to the
consensus
highlights conserved regions
much more informative for protein alignments (shades
of grey for similar amino acids)
http://mobyle.pasteur.fr/cgi-bin/MobylePortal/portal.py?form=boxshade
http://www.ch.embnet.org/software/BOX_form.html
Sequence Logos
http://weblogo.berkeley.edu/logo.cgi
http://weblogo.threeplusone.com/create.cgi
http://genome.tugraz.at/Logo/
T. D. Schneider and R. M. Stephens. Sequence logos: a new way to display
consensus sequences. Nucleic Acids Research, Vol. 18, No 20, p. 6097-6100.
Buidling on Alignments
Multiple Alignments are the starting
point for calculating phylogenetic
trees
Motifs and Profiles are calculated
from multiple alignments