You are on page 1of 48

Biochem 711 2008

L11: Alignments ! Evolution: MEGA


Table of Contents
Introduction ............................................................................................. 3 Acknowledgements .................................................................................. 4 L11 Exercise A: Set up............................................................................. 4
1. Launch MEGA............................................................................................... 4 2. Retrieve Sequence ........................................................................................ 5

L11 Exercise B: BLAST and Align within MEGA ....................................... 6


1. Launch MEGA web browser........................................................................... 6 2. BLAST search within MEGA .......................................................................... 6 2.1. Paste sequence........................................................................................ 6 2.2. Select the database to be searched ............................................................ 6 2.3. Optimization algorithm: blastn................................................................... 7 2.4. Press BLAST ............................................................................................ 7 2.5. BLAST results .......................................................................................... 7 2.6. Selecting results for alignment ................................................................... 9 3. Preparing the Alignment within MEGA......................................................... 11 3.1. Add first sequence to alignment ............................................................... 11 3.2. Add additional sequence to be aligned...................................................... 11 3.3. Save the current list ................................................................................ 12 4. Create the Alignment .................................................................................. 13 4.1. Algorithm: ClustalW................................................................................ 13 4.2. Perform the alignment ............................................................................ 14 4.3. Adjustments to the Aligned Sequences ..................................................... 15 4.4. Adjustments to the Alignment.................................................................. 16

L11 Exercise C: Calculate a Neighbor-Joining Tree ............................... 18


1. Open alignment file .................................................................................... 18 2. Activate Neighbor-Joining ........................................................................... 19

L11 Exercise D: Precision in Acquiring and Aligning Sequences............ 20


1. Acquiring Query Sequence .......................................................................... 2. BLAST within MEGA.................................................................................... 2.1. Set-up ................................................................................................... 2.2. BLAST results ........................................................................................ 3. Build the alignment list............................................................................... 3.1. Edit Sequence Names ............................................................................. 3.2. Edit START codons ................................................................................. 4. Translate to Protein.................................................................................... 5. Set parameters and calculate protein alignment ......................................... 6. Alignment adjustments ............................................................................... 21 23 23 24 31 32 33 33 34 35

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)4

Biochem 711 2008


7. Export alignment in DNA and protein forms ................................................ 8. Eliminate duplicate sequences.................................................................... 9. Eliminate inadequate sequences ................................................................. 9.1. Biology and Structure of the protein .......................................................... 9.2. Remove sequence .................................................................................. 10. Estimate reliability of alignment with Average AA Identity .......................... 1. Create Neighbor-Joining Tree...................................................................... 2. Estimating the reliability of a tree: Bootstraping.......................................... 3. Tree Rooting............................................................................................... 3.1. Finding an outgroup................................................................................ 3.2. Rooting the tree .....................................................................................

35 35 37 37 39 40 42 43 45 46 46

L11 Exercise E: Neighbor-Joining Phylogenetic Tree, Rooting ............... 42

L11 Exercise F: End of laboratory .......................................................... 48

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)5

Biochem 711 2008

Introduction

! INFO

http://en.wikipedia.org/wiki/ /Evolution, /Phylogenetics, /Phylogenetic_tree

In biology, evolution refers to changes in the inherited traits of a population of organisms from one generation to the next. Genes that are passed on to an organism's offspring produce the inherited traits that are the basis of evolution. Phylogenetics is the study of evolutionary relatedness among various groups of organisms (e.g., species, populations). A phylogenetic tree, also called an evolutionary tree, is a tree showing the evolutionary relationships among various biological species that are believed to have a common ancestor. In a phylogenetic tree, each node with descendants represents the most recent common ancestor of the descendants, and the edge lengths in some trees correspond to time estimates. Each node is called a taxonomic unit. Taxonomy is the classification of organisms according to similarity. Although phylogenetic trees produced on the basis of sequenced genes or genomic data in different species can provide evolutionary insight, they have important limitations. They do not necessarily accurately represent the species evolutionary history.

! INFO

http://en.wikipedia.org/wiki/Multiple_sequence_alignment

A multiple sequence alignment is a sequence alignment of three or more biological sequences. In general, the input set of query sequences are assumed to have an evolutionary relationship by which they share a lineage and are descended from a common ancestor. From the resulting multiple sequence alignment, sequence homology can be inferred (homology refers to any similarity between characters that is due to their shared ancestry) and phylogenetic analysis can be conducted to assess the shared evolutionary origins amongst the sequences.

In practical terms, a tree is constructed from a multiple alignment of homologous sequences. The quality of the alignment is the most influential factor for the calculated trees. In these exercises we will use the MEGA software that can retrieve sequences, create a multiple sequence alignment with the Clustal algorithm and calculate a tree with various methods. !"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)6

Biochem 711 2008

Quoting from the web site http://www.megasoftware.net/ MEGA 4: Molecular Evolutionary Genetics Analysis MEGA is an integrated tool for conducting automatic and manual sequence alignment, inferring phylogenetic trees, mining web-based databases, estimating rates of molecular evolution, and testing evolutionary hypotheses. References:
Kumar S, Dudley J, Nei M & Tamura K (2008) MEGA: A biologist-centric software for evolutionary analysis of DNA and protein sequences. Briefings in Bioinformatics 9: 299-306. Tamura K, Dudley J, Nei M & Kumar S (2007) MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Molecular Biology and Evolution 24: 1596-1599.

Acknowledgements
This laboratory is loosely inspired from Barry G. Halls book Phylogenetic Trees Made Easy: A How-to Manual, Third Edition

L11 Exercise A: Set up


MEGA 4 has been tested on the following Microsoft Windows operating systems: Windows 95/98, NT, 2000, XP, and Vista. If you are working from home you can download MEGA from
http://www.megasoftware.net/

Your DMC computer should be running in Windows mode. If not ask for help.

1. Launch MEGA

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)7

Biochem 711 2008

! TASK
Launch MEGA with the menu cascade Start > All Programs > Mega 4

2. Retrieve Sequence

! INFO
The first complete E. coli genome was announced Sept 5th 1997 in the journal Science1 as a milestone in complete genome elucidation. In the following example we will use the nuoL gene defined as "NADH:ubiquinone oxidoreductase, membrane subunit L within the complete genome entry. Note: database searches now often find complete genomes. The reference to a particular gene is obtained by the begin and end values specifying a sequence region. For example, the nuoL gene is defined for the complete genome of K12 which has the accession value of CP000948:
LOCUS CP000948 1842 bp DNA linear BCT 05-JUN-2008 DEFINITION Escherichia coli str. K12 substr. DH10B, complete genome.

ACCESSION
VERSION

CP000948 REGION: 2482992..2484833


CP000948.1 GI:169887498

! TASK

"
1

The nuoL sequence will be provided to you either as a text file within the Classroom Scratch directory or on the virology.wisc.edu/acp resources section.

Open the file, highlight and copy the sequence to the clipboard. This is the Query sequence.
Blattner FR et al. The complete genome sequence of Escherichia coli K-12. Science. 1997 Sep 5; 277(5331):1453-74.

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)8

Biochem 711 2008

L11 Exercise B: BLAST and Align within MEGA

1. Launch MEGA web browser

! TASK

MEGA has a built-in web browser that we will use to find and retrieve sequences. Follow the menu Alignment > Do BLAST Search to launch the internal browser that goes directly to NCBI BLAST.

2. BLAST search within MEGA


2.1. Paste sequence

! TASK
Paste the query sequence from the clipboard into the Enter Query Sequence window.

2.2. Select the database to be searched

! TASK
The default database is the Human genome. Change to the non-redundant (nr) complete nucleotide database collection.

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)9

Biochem 711 2008 2.3. Optimization algorithm: blastn Note: The default search should be blastn.

2.4. Press BLAST Before pressing the BLAST button you can optionally choose to Show results in a new window. Verify that the parameters are: Search database nr using Blastn (Optimize for somewhat similar sequences)

! Press BLAST.
A new job page will appear and updated within seconds. If a warning sign is presented simply press OK

2.5. BLAST results

! READ
The result page first shows a graphical overview of the finds. The length of the bar is an indication of the region of similarity with the query sequence.

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3):

Biochem 711 2008 The bar is colored by alignment score according to the color key above. For example, bars colored red indicate a similarity score above 200. The graphical output is followed by a table showing the sequence accession number, description and scores. Databases are updated constantly and your results will likely be somewhat different.

Shown below are the top and bottom portions of the output as it is of this writing. The Max score is the sum of identities and similarities dictated by the comparison matrix default (BLOSUM62). The E-value is defined at
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

Note: the Max ident (maximum identity) can be at 100% at the bottom of the table as well, but in these cases the Query coverage is much lower and E-value are very high.

//

Finally the pair-wise alignments are shown sorted by descending E-value from the most significant to the least significant (shown below):
>gi|60650283|gb|BT021192.1| Bos taurus SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily b, member 1 (SMARCB1), mRNA, complete cds Length=1491 GENE ID: 537412 SMARCB1 | SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily b, member 1 [Bos taurus] (10 or fewer PubMed links) Score = 41.1 bits (21), Expect = 3.0 Identities = 29/33 (87%), Gaps = 0/33 (0%)

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3);

Biochem 711 2008


Strand=Plus/Plus Query Sbjct 77 20 CCGATACTCGCTTCTGCCGCCGCGAGGCTGATG ||||| ||||||||||||||||| | | ||||| CCGATCCTCGCTTCTGCCGCCGCAATGATGATG 109 52

2.6. Selecting results for alignment

! READ ! TASK

To create a phylogenetic tree we only want to include homologs: sequences that have a common ancestor. This is the general assumption in phylogenetics. We can select which sequences we want to include as judged by the pair-wise sequence alignment.

Click on the Max Score for the first result in the list, here 583 for Escherichia coli str. K12 substr. DH10B, complete genome. This is a direct link to the alignment:
>gi|169887498|gb|CP000948.1| Length=4686137 Escherichia coli str. K12 substr. DH10B, complete genome

Features in this part of subject sequence: NADH:ubiquinone oxidoreductase, membrane subunit L NADH:ubiquinone oxidoreductase, membrane subunit K Score = 583 bits (303), Expect = 2e-163 Identities = 303/303 (100%), Gaps = 0/303 (0%) Strand=Plus/Plus
Query Sbjct Query Sbjct Query Sbjct Query Sbjct Query Sbjct Query Sbjct 1 2484830 61 2484890 121 2484950 181 2485010 241 2485070 301 2485130 TCATCCGCGCATCTCACTTACTGAATCGATGTTCAGGTTCTGGCGACGACGGTGAAGTTG |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| TCATCCGCGCATCTCACTTACTGAATCGATGTTCAGGTTCTGGCGACGACGGTGAAGTTG CAGCAGCAGCGCAAGGCCGATACTCGCTTCTGCCGCCGCGAGGCTGATGGCGAGAATGTA |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| CAGCAGCAGCGCAAGGCCGATACTCGCTTCTGCCGCCGCGAGGCTGATGGCGAGAATGTA CATCACCTGACCGTCGGTCTGGCCCCAGTAGCTTCCGGCGACCACGAAGGCCAGCGCGGA |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| CATCACCTGACCGTCGGTCTGGCCCCAGTAGCTTCCGGCGACCACGAAGGCCAGCGCGGA GGCGTTAATCATGATTTCCAGACCAATCAACATAAACAGCAGATTGCGACGGATAACCAG |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| GGCGTTAATCATGATTTCCAGACCAATCAACATAAACAGCAGATTGCGACGGATAACCAG ACCGGTTAAGCCAAGAACGAATAAGATTGCCGCGAGGATCAGTCCATGTTGTAAGGGGAT |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ACCGGTTAAGCCAAGAACGAATAAGATTGCCGCGAGGATCAGTCCATGTTGTAAGGGGAT CAT ||| CAT 303 2485132 60 2484889 120 2484949 180 2485009 240 2485069 300 2485129

This is a perfect match: 303 out of 303 and no gaps. It is in fact the query sequence that we will want to include in the tree.

With a right-mouse click, open the subunit L in a new browser window.

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)<

Biochem 711 2008

10

The resulting window will show the information relevant to the gene as taken out of the complete genome sequence thanks to the Range: from option which is filled automatically when opening the link:

LOCUS DEFINITION ACCESSION VERSION PROJECT KEYWORDS SOURCE ORGANISM

CP000948 1842 bp DNA linear BCT 05-JUN-2008 Escherichia coli str. K12 substr. DH10B, complete genome. CP000948 REGION: 2482992..2484833 CP000948.1 GI:169887498 GenomeProject:20079 . Escherichia coli str. K12 substr. DH10B Escherichia coli str. K12 substr. DH10B Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; Enterobacteriaceae; Escherichia. REFERENCE 1 (bases 1 to 1842) AUTHORS Durfee,T., Nelson,R., Baldwin,S., Plunkett,G. III, Burland,V., Mau,B., Petrosino,J.F., Qin,X., Muzny,D.M., Ayele,M., Gibbs,R.A., Csorgo,B., Posfai,G., Weinstock,G.M. and Blattner,F.R. TITLE The complete genome sequence of Escherichia coli DH10B: insights into the biology of a laboratory workhorse JOURNAL J. Bacteriol. 190 (7), 2597-2606 (2008) PUBMED 18245285 REFERENCE 2 (bases 1 to 1842) AUTHORS Plunkett,G. III. TITLE Direct Submission JOURNAL Submitted (20-FEB-2008) Department of Genetics and Biotechnology, University of Wisconsin, 425G Henry Mall, Madison, WI 53706, USA COMMENT DH10B and DH10B-T1R are available from Invitrogen Corporation (http://www.invitrogen.com). FEATURES Location/Qualifiers source 1..1842 /organism="Escherichia coli str. K12 substr. DH10B" /mol_type="genomic DNA" /strain="K-12" /sub_strain="DH10B" /db_xref="taxon:316385" gene complement(1..1842) /gene="nuoL" /locus_tag="ECDH10B_2440" CDS complement(1..1842) /gene="nuoL" /locus_tag="ECDH10B_2440" /codon_start=1 /transl_table=11 /product="NADH:ubiquinone oxidoreductase, membrane subunit L" /protein_id="ACB03437.1" /db_xref="GI:169889730" /db_xref="ASAP:AEC-0002144" /translation="MNMLALTIILPLIGFVLLAFSRGRWSENVSAIVGVGSVGLAALV TAFIGVDFFANGEQTYSQPLWTWMSVGDFNIGFNLVLDGLSLTMLSVVTGVGFLIHMY ASWYMRGEEGYSRFFAYTNLFIASMVVLVLADNLLLMYLGWEGVGLCSYLLIGFYYTD PKNGAAAMKAFVVTRVGDVFLAFALFILYNELGTLNFREMVELAPAHFADGNNMLMWA TLMLLGGAVGKSAQLPLQTWLADAMAGPTPVSALIHAATMVTAGVYLIARTHGLFLMT PEVLHLVGIVGAVTLLLAGFAALVQTDIKRVLAYSTMSQIGYMFLALGVQAWDAAIFH LMTHAFFKALLFLASGSVILACHHEQNIFKMGGLRKSIPLVYLCFLVGGAALSALPLV TAGFFSKDEILAGAMANGHINLMVAGLVGAFMTSLYTFRMIFIVFHGKEQIHAHAVKG VTHSLPLIVLLILSTFVGALIVPPLQGVLPQTTELAHGSMLTLEITSGVVAVVGILLA AWLWLGKRTLVTSIANSAPGRLLGTWWYNAWGFDWLYDKVFVKPFLGIAWLLKRDPLN SMMNIPAVLSRFAGKGLLLSENGYLRWYVASMSIGAVVVLALLMVLR" gene complement(1839..>1842) /gene="nuoK" /locus_tag="ECDH10B_2441" CDS complement(1839..>1842) /gene="nuoK" /locus_tag="ECDH10B_2441" /codon_start=1 /transl_table=11 /product="NADH:ubiquinone oxidoreductase, membrane subunit K" /protein_id="ACB03438.1" /db_xref="GI:169889731" /db_xref="ASAP:AEC-0002145" /translation="MIPLQHGLILAAILFVLGLTGLVIRRNLLFMLIGLEIMINASAL AFVVAGSYWGQTDGQVMYILAISLAAAEASIGLALLLQLHRRRQNLNIDSVSEMRG" ORIGIN 1 tcaacgcagt accatcaaca gtgccagcac cacgaccgca ccgatgctca tggatgccac 61 ataccagcgc agatagccgt tctcacttaa cagcagacct ttacctgcaa agcgggaaag 121 gacagccggg atgttcatca ttgagttcag cggatcgcgt ttcagcaacc aggcaatacc 181 caggaacggc ttgacgaaca ctttgtcata cagccagtca aatccccagg cgttgtacca 241 ccaggtgccc agcagacggc ccggcgcact gttggcgatg gaggtcacca gagtacgttt 301 acccagccac agccaggctg ccagcagaat gccgaccacc gcgaccacgc cagaggtaat 361 ttccagggtc aacatgctgc cgtgcgccag ttccgtcgtt tgcggaagca cgccctgcag

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)4=

Biochem 711 2008


421 481 541 601 661 721 781 841 901 961 1021 1081 1141 1201 1261 1321 1381 1441 1501 1561 1621 1681 1741 1801 // Disclaimer | Write to the Help Desk NCBI | NLM | NIH cggcggtaca gctgtgagtt gaaaatcata cagattgata cgcagtgacc cagcggaata cagaatgacg gtggaaaatc gctcatggta ggccagcagc caggaacagg gtggatcagc cggcaactgc catgttattg ggtgcccagt ggtcacgacg gatcagcaga gtcggcaagc agagtagccc accagtgacc gttaaagtcg agcgaagaaa gccgacgatc gccaatcaat atcagtgcgc acccctttca cggaaggtgt tgaccattcg agcggtagtg gatttacgca gaaccggatg gccgcatccc gagtaagcga agcgtaaccg ccgtgggtac gcggagacag gcagatttac ccgtcagcaa tcgttgtaaa aacgctttca taggagcaca accagaacca tcttcaccgc accgagagca cctaccgaca tcaacgccga gccgagacgt ggcaaaataa caacgaaggt cggcgtgagc agagcgaggt ccatcgcacc ccgacagtgc gaccgcccat ccaggaacag atgcctgcac gaacacgttt ccccgacaat gggcgatcag gcgtcgggcc cgaccgcacc agtgcgctgg gaatgaacag ttgccgctgc ggcccacgcc ccatgctggc gcatatacca tggtcagcga tccacgtcca taaaggcggt tttcagacca tggttaaggc ggaaaggatc gtgaatttgt cataaacgca cgcgaggatc tgcgccgccc cttgaagatg cagcgcttta gccaagcgcg gatgtcggtc acccaccaga gtagacaccc cgccatcgcg gcccagcagc tgccagttcc tgcgaaagcg gccattcttc ttcccagccg gatgaacagg ggaggcgtac caggccgtcc cagcggctgg taccagcgcc gcgcccacgg aagcatgttc agcagcacaa tcttttccgt ccgaccagac tcatccttac accaggaagc ttctgttcgt aagaacgcgt aggaacatgt tgtaccagcg tgcagaactt gcggttacca tcggcaagcc atcagcgtcg accatttcgc aggaacacgt ggatcggtgt aggtacatca ttggtgtaag atgtgaataa agcaccaggt ctgtatgtct gccaggccca gagaatgcca at tcagcggcag ggaagacgat ctgccaccat tgaagaagcc agagataaac gatggcaggc gggtcatcaa agccaatctg cggcaaaacc ccggcgtcat tggttgcggc atgtctgcaa cccacatcag ggaagttcag cacccacacg aatagaaccc gcagcaggtt cgaagaagcg ggaaacccac taaaaccgat gctcgccgtt cagagcctac gcaggacgaa

11

3. Preparing the Alignment within MEGA


3.1. Add first sequence to alignment

! TASK
Click on the button with the red + sign + Add to Alignment The addition will be acknowledged by MEGA and will start the list of sequences to be aligned. Click OK The MEGA AlnExplorer will start perhaps behind the current browser window. Close or move this browser window to reveal the M4 Alignment Explorer. 3.2. Add additional sequences to be aligned

! TASK

Add new sequences to the alignment list

In a similar manner explore the alignments provided by BLAST. Check the E-value, and choose sequences that are at least 50% similar over the length of the sequence query. Choose one sequence per species. There would !"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)44

Biochem 711 2008 be no gain of evolutionary information to include too many identical E. coli sequences or sequences that are 100% identical. Concentrate on chains named L. The nucleotide sequence should be in the 1800 bases range. If you make a mistake, you can remove a sequence already sent to the M4 Alignment Explorer by right-clicking on the sequence and selecting Cut from the pull-down menu.

12

Note: instead of the Open in New Window you can also simply click on the score link and click the back arrow to go back to the list. When you are done, your M4 Alignment Explorer should look similar to the following which exhibits 15 sequences: Note: The base background coloring can be toggled on/off from the Display menu. 3.3. Save the current list

! TASK

Save the alignment list into a file with the menu cascade:

Data > Export Alignment > MEGA Format Enter a file name, e.g. nuoL and the .meg filename extension is already supplied by MEGA.

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)45

Biochem 711 2008

13

Save the file on the Desktop. When prompted, give a title in the first question and answer YES to say that these are protein coding sequences

! TASK

Close the MEGA web browser

"

Note: if you prefer you can copy the file NuoL.meg from the Classes Scratch drive or download it from http://virology.wisc.edu/acp

4. Create the Alignment

! READ

Defining an alignment.

Excerpt from: http://en.wikipedia.org/wiki/Sequence_alignment


A sequence alignment is a way of arranging the primary sequences (here as DNA) to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Since we want to build a tree, we assume that the sequences are homologs and derive from a common ancestor. During the alignment gaps are inserted between the residues so that residues with identical or similar characters are aligned in successive columns. If two sequences in an alignment share a common ancestor, mismatches can be interpreted as point mutations and gaps can be interpreted as indels (insertion or deletion mutations) introduced in one or both lineages in the time since they diverged from one another.

4.1. Algorithm: ClustalW !"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)46

Biochem 711 2008 MEGA uses an embedded version of ClustalW (command line interface) to perform the alignment. Clustal creates the alignment in 3 main steps: (see http://en.wikipedia.org/wiki/Clustal) perform pair-wise alignment of all sequences. Create a simple phylogenetic tree based on similarity distance Use the phylogenetic tree to carry out a multiple alignment

14

The basis of similarity is a comparison table or distance matrix, BLOSUM62 for proteins and mostly identities for nucleic acids 4.2. Perform the alignment

! TASK

From the Alignment menu choose Align by ClustalW

Confirm that you want to select all the sequences for the alignment: click OK

This will bring the ClustalW paramters window. For now keep the suggested values and press OK

The progress bar shows the advance of the Clustal algorithm from pairwise alignments to multiple alignment. When the alignment is done, the main window will flash and refresh with the new alignment, shown below at the 5 and 3 ends

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)47

Biochem 711 2008

15

//

This default alignment is the best alignment that ClustalW can create given the default parameters. It is unlikely to be the best alignment, and manual adjustments would have to be made to compensate some of the flaws introduced by the algorithm, such as the splitting of codons for example. Below we will perform a few adjustments, and a better alignment will be carried out in a later exercise. 4.3. Adjustments to the Aligned Sequences

! READ

The following adjustments are made ad hoc for this example and may or may not apply in the exact same way if you have selected different sequences. 4.3.1. Remove 2 sequences In this example 2 sequences (Klebsiella pneumoniae 342 and Erwinia tasmaniensis strain ET1/99) appear shorter (and indeed do not match until base 960) and will be removed with the rightclick and Cut menu options. The example will remain with 13 sequences. If you have inadvertently incorporated sequences that are smaller you may wish to remove them at this time. Note: now that the 2 short sequences are removed, * are shown on some of the top squares indicating columns where all bases are identical. 4.3.2. Reverse complement sequences The sequences that are aligned are the reverse complement of the coding sequence. The sequences will therefore be changed within MEGA. !"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)48

Biochem 711 2008

16

Hint: check the 3end of your sequences and if they end with CAT (the reverse complement of AUG or ATG the Methionine codon) then it is likely the case.

! TASK

Follow these directions if your sequences are reversed. Data > Reverse Complement (result)

Edit > Select All

The ATG is highlighted at the 5 end of the converted sequences. * show columns with 100% identical base. 4.4. Adjustments to the Alignment If you have reversed your sequences they should all start with ATG and all ATGs should be aligned within the same column. If they are not you will be able to adjust this part of the alignment in the same manner as we will adjust the 3end below. 4.4.1. Adjusting the 3end These are coding sequence and end with a terminator codon and all three are represented UAA (Ochre), UAG (Amber), UGA (Opal) in their DNA form. By definition these are the termination signals, and these 3 columns should be aligned.

! TASK
Select the terminator bases for as many adjacent sequences as possible. Then move the selected bases towards the 3 end of the sequence by repeatedly clicking on the button |!( ) meant to move selected blocks rightward.

Repeat for the remaining terminator sequence until all 3ends are aligned.

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)49

Biochem 711 2008

17

!
Save the file again (Data > Export Alignment), perhaps with a new name, e.g. NuoL_edited.meg. Answer YES to the question about coding proteins. We now have an alignment from which we can build a tree. See next Exercise.

! TASK
You can save the alignment in the .mas binary format with the following menu cascade: Data > Save Session Call the new file e.g. NuoL_Edited.mas The .mas format is binary and while the .meg is plain text. If you are curious you can open them with Wordpad.

For the next segment we will need the .meg file.

! TASK

Close the Alignment Explorer.

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)4:

Biochem 711 2008

18

"

Note: if you prefer you can copy the file NuoL_Edited.meg from the Classes Scratch drive or download it from
http://virology.wisc.edu/acp

L11 Exercise C: Calculate a Neighbor-Joining Tree

! INFO

http://en.wikipedia.org/wiki/Neighbor_joining

The neighbor-joining iterative algorithm requires knowledge of the distance between each pair of sequences in the tree, constructed in a step-wise fashion. Each iteration consists of the following steps:
1) Calculate a distance matrix Q for each pair of sequences. 2) Find the pair of sequences in Q with the lowest value. Create a node on the tree that joins these two sequences (i.e. join the closest neighbors, as the algorithm name implies). 3) Calculate the distance of each of the sequences in the pair to this new node. 4) Calculate the distance of all sequences outside of this pair to the new node. 5) Start the algorithm again, considering the pair of joined neighbors as a single sequence (taxon) and using the distances calculated in the previous step.

Neighbor-joining is based on the minimum-evolution criterion for phylogenetic trees, i.e. the topology that gives the least total branch length is preferred at each step of the algorithm. This algorithm has been extensively tested and is statistically consistent under many models of evolution. Hence, given data of sufficient length, neighbor-joining will reconstruct the true tree with high probability.

1. Open alignment file

! TASK
From the main MEGA window click on Click me to activate a data file and open the previously saved or retrieved NuoL_Edited.meg file. The file will open within the MEGA Sequence Data Explorer as show below. !"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)4;

Biochem 711 2008

19

Note that by default the sequence is shown only for positions that are different from the first listed sequence (E. coli str. K12). This can be toggled on and off with the special button: Note that this window is separate from the main MEGA window that might be behind this view.

2. Activate Neighbor-Joining

! TASK

From the main MEGA window follow the menu cascade: Phylogeny > Construct Phylogeny > Neighbor-Joining (NJ)...

This will open the M4: Analysis Preference window where current parameters and selections are summarized: we can see that we are working on a nucleotide file and that the method currently chosen is Neighbor-Joining. Green squares at the end of lines indicate parameters that could be altered and will be reviewed later.

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)4<

Biochem 711 2008

20

! TASK
Click the button This will bring the M4: Tree Explorer window

Save the current tree:

We have now created a valid tree based on the alignment. We will explore how to change tree viewing and drawing options in a further exercise.

! TASK
Close all MEGA windows: the MEGA main window and any associated MEGA Explorer window.

L11 Exercise D: Precision in Acquiring and Aligning Sequences

! READ

Our accomplishment so far is to have used MEGA to retrieve sequences, align them and make a tree. However, we have not worked with precision either in the selection of the files or in making the tree. This section will go over finding homologous sequences to the best of our judgment. There is no meaning in placing an unrelated sequence in the tree because the purpose of the tree is to depict the level of ancestry between the sequences. Restating our purpose: why do we make a tree? A tree is a graphical representation of the relationship between sequences believed to derive from a common ancestor and serves as a tool to illustrate a !"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)5=

Biochem 711 2008

21

concept or a hypothesis. The level of details may relate to the problem we want to illustrate. For example, we may want to study variants of a protein within a single species, or, on the other hand, find the evolutionary relationship of a given protein amongst all known species. In evolutionary biology, homology refers to any similarity between characters that is due to their shared ancestry. Homology among proteins and DNA is often concluded on the basis of sequence similarity. However, sequence similarity may arise from different ancestors: short sequences may be similar by chance, and sequences may be similar because both were selected to bind to a particular protein. When similarity is very high on a reasonable length there is little doubt that the sequences are likely homologs. However, how do we find sequences that are distantly related? DNA sequences share roughly 25% similarity with a random sequence of the same composition simply because there are only 4 choices (A,C, G, T/U.) On the other hand, with proteins there are 20 choices at each position along the sequence and homology can often be detected until a similarity threshold of 5%.

! READ

Therefore in the next section we will acquire sequences based on protein search. However, a score of 100% between 2 (identical) protein sequences does not necessarily signify that the coding DNA would also score 100%. Because of the nature of the genetic code there exist silent substitutions for amino acids that arise from a different codon and therefore alter the DNA/RNA sequence. For this exercise we are interested in the finest tree structure as possible, and we will retrieve the actual coding DNA sequences. In a first step we will allow different DNA sequences that encode identical proteins to be part of the list and true duplicates will be removed at a later stage.

! READ

For this exercise we will embark on finding sequences related to the bacterial KcsA potassium channel.

1. Acquiring Query Sequence Potassium channels found in bacteria are amongst the most studied of ion channels in terms of their molecular structure. They have a tetrameric structure and has been solved by X-ray crystallography and NMR. Therefore we can retrieve the sequence from the PDB web site.

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)54

Biochem 711 2008

22

! TASK

Open a web browser. ( e.g.

or

Point the browser to the PDB web site: http://www.rcsb.org/pdb In the search box enter 1F6G (1F6G is the PDB ID code for the bacterial KcsA potassium channel.) Click Site Search or press return. On the left hand side panel under the Structure Tab: Click on FASTA Sequence The browser will then ask to save or open. Choose Save and direct the file to the desktop if asked (default with Firefox.) Accept the default name of 1F6G.fasta.txt Navigate to the desktop.

Right-Click on the file.

Choose Open With " Select WordPad (or Microsoft Office Word) to open the file.

Note: if you double-click on the file icon, it will open with Notepad and only one single long line will show as Notepad is not intelligent about end-of-line:

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)55

Biochem 711 2008 The structure is composed of four identical proteins sequences labeled A to D. Copy the sequence of one of the chains e.g. chain A to the clipboard. This will be our query sequence to use with BLAST.

23

2. BLAST within MEGA


2.1. Set-up

! TASK
Launch MEGA Launch MEGA BLAST web browser (The default is blastn with nucleotides and we need to switch to proteins.) Click Home Scroll down the page. Click protein blast Paste sequence in the appropriate window.
(Right-click with the mouse or press keyboard Control V together)

(Windows: Start > All Programs) (MEGA: Alignment > Do BLAST Search)

Verify that the database is nr. Verify that the algorithm is blastp (protein-protein BLAST) Press the BLAST button at the bottom of the page.

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)56

Biochem 711 2008

24

! OPTIONAL

Before pressing the BLAST button you can open and explore the "Algorithm Parameters by clicking-open the triangle. 2.2. BLAST results As previously the result page is organized with a graphical summary (max 100), a description list (up to the limit preset in Algorithms Parameters default is 100) and finally the alignments. Within the Description and Alignment lists, entries that are derived from the Protein Data Bank and are therefore a solved structure are shown with ; entries with and Entrez Gene (a searchable database of genes at NCBI from RefSeq genomes) are shown with the symbol .

! INFO

http://www.ncbi.nlm.nih.gov/RefSeq/

= Entrez Gene = RefSeq = The Reference Sequence collection. RefSeq aims to provide a comprehensive, integrated, non-redundant, well-annotated set of sequences. Each RefSeq represents a single, naturally occurring molecule from one organism. Similar to a review article in the literature, a RefSeq is a synthesis of information representing the consolidation of information by a particular group at a particular time. RefSeq has been built using data from public archival databases only.

// (skip to bottom)

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)57

Biochem 711 2008 2.2.1. Explore first hit

25

! TASK
Click on the Evalue (here 317) to jump to the first hit, or scroll down the page. Typically the very first hit is of course the query sequence itself.
LOCUS DEFINITION ACCESSION VERSION DBSOURCE 1F6G_A 160 aa linear BCT 24-SEP-2008 Chain A, Potassium Channel (Kcsa) Full-Length Fold. 1F6G_A 1F6G_A GI:13399712 pdb: molecule 1F6G, chain 65, release Aug 27, 2007; deposition: Jun 21, 2000; class: Proton Transport, Membrane Protein; source: Mol_id: 1; Organism_scientific: Streptomyces Lividans; Organism_common: Bacteria; Expression_system: Escherichia Coli; Expression_system_common: Bacteria; Expression_system_vector_type: Plasmid; Expression_system_plasmid: Pqe32; Exp. method: Nmr, 8 Structures. . Streptomyces lividans Streptomyces lividans Bacteria; Actinobacteria; Actinobacteridae; Actinomycetales; Streptomycineae; Streptomycetaceae; Streptomyces. 1 (residues 1 to 160) Cortes,D.M., Cuello,L.G. and Perozo,E. Molecular architecture of full-length KcsA: role of cytoplasmic domains in ion permeation and activation gating J. Gen. Physiol. 117 (2), 165-180 (2001) 11158168 2 (residues 1 to 160) Cortes,D.M. and Perozo,E. Direct Submission Submitted (21-JUN-2000) SEQRES. Location/Qualifiers 1..160 /organism="Streptomyces lividans" /db_xref="taxon:1916" 1..126 /region_name="Domain 1" /note="NCBI Domains" 2..20 /sec_str_type="helix" /note="helix 1" 22..45 /sec_str_type="helix" /note="helix 2" 63..74 /sec_str_type="helix" /note="helix 3" 86..113 /sec_str_type="helix" /note="helix 4" 114..122 /sec_str_type="helix" /note="helix 5" 130..147 /sec_str_type="helix" /note="helix 6"

The original sequence was the PDB sequence and that is what we found. Furthermore, since the channel has a tetrameric structure the 4 identical chains are listed as entry followed by to signify it is a structural entry. Right-click and open this entry into a new window

KEYWORDS SOURCE ORGANISM REFERENCE AUTHORS TITLE JOURNAL PUBMED REFERENCE AUTHORS TITLE JOURNAL COMMENT FEATURES source Region SecStr SecStr SecStr SecStr SecStr SecStr ORIGIN

The text is shown at right.

DO NOT CLICK the

+ Add to Alignment
button at this point!

1 mppmlsglla rlvklllgrh gsalhwaaag aatvllvivl lagsylavla ergapgaqli 61 typaalwwsv etattvgygd lypvtlwgrc vavvvmvagi tsfglvtaal atwfvgreqe 121 rrghfvrhse kaaeeaytrt tralherfdr lermlddnrr //

This particular sequence was expressed through a plasmid expression vector after cloning. The DNA sequence linked to this protein sequence is not directly available. In addition since this sequence was cloned for structural expression, !"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)58

Biochem 711 2008

26

we do not know at this point if any mutation (even minor) was introduced into the sequence to facilitate the structural study. Since structural sequences do not have a link to their DNA coding sequence, and since we do not know if these sequences are naturally occurring, as a general rule we will not use any of the structural proteins sequences.

! TASK

Close this window (if you used the right-click method) or click the back button if you did not use the right-click method. 2.2.2. Explore second hit

! READ

Before opening the second hit lets first remark that there are also multiple links for this sequence but unlike the previous, first link it is not related to the trimeric nature of the sequence, but rather on the fact that the sequence is represented in multiple databases:

Query Sbjct Query Sbjct Query Sbjct

1 1 61 61 121 121

MPPMLSGLLARLVKLLLGRHGSALHWAAAGAATVLLVIVLLAGSYLAVLAERGAPGAQLI MPPMLSGLLARLVKLLLGRHGSALHW.AAGAATVLLVIVLLAGSYLAVLAERGAPGAQLI MPPMLSGLLARLVKLLLGRHGSALHWRAAGAATVLLVIVLLAGSYLAVLAERGAPGAQLI TYPAALWWSVETATTVGYGDLYPVTLWGRCVAVVVMVAGITSFGLVTAALATWFVGREQE TYP.ALWWSVETATTVGYGDLYPVTLWGR.VAVVVMVAGITSFGLVTAALATWFVGREQE TYPRALWWSVETATTVGYGDLYPVTLWGRLVAVVVMVAGITSFGLVTAALATWFVGREQE RRGHFVRHSEKAAEEAYTRTTRALHERFDRLERMLDDNRR RRGHFVRHSEKAAEEAYTRTTRALHERFDRLERMLDDNRR RRGHFVRHSEKAAEEAYTRTTRALHERFDRLERMLDDNRR 160 160

60 60 120 120

This sequence differs at 3 amino acid positions (label . added above) and is 98% identical to the query sequence.

! TASK

Using the right-click method in a new window open the following 3 entries: ref|NP_631700.1 | G
emb|CAA86025.1| emb|CAC16993.1| G

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)59

Biochem 711 2008 Scroll down to the FEATURES section.


ref|NP_631700.1| voltage-gated potassium channel [Streptomyces coelicolor A3(2)] emb|CAC16993.1| voltage-gated potassium channel [Streptomyces coelicolor A3(2)]

27

emb|CAA86025.1|

potassium channel protein [Streptomyces lividans]

Interstingly these 3 entries are linked together but belong to 2 different species: Streptomyces coelicolor A3(2)and Streptomyces lividans Furthermore, upon close inspection it can be seen that the sequence within the 3 entries are 100% identical to each other, which means that the query sequence !"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)5:

Biochem 711 2008 might have had some errors, and therefore it is indeed best not to include structure sequences in our tree.

28

Both sequences from the emb database contain information of related PDB structures, and indeed emb|CAA86025.1| is shown linked to the PDB entry 1F6G (highlighted on the figure above), our original query sequence. This was the query sequence, yet there were 3 amino acid differences detected. It is possible that ther structure was done on a slight variant which might have been obtained by cloning, and therefore these would not be natural mutations.

! TASK
Click on the CDS link highlighted on the RefSeq version. This action will open a new NCBI window with the DNA coding sequence. You can verify that it begins with the START codon (atg) and ends with the OPAL terminator codon (tga). This will be our first entry: //

! TASK

Click the top red cross button: + Add to Alignment

2.2.3. Checking for errors Scroll down to the next entry that is not a structural entry (s):
>ref|YP_002206049.1| G voltage-gated potassium channel [Streptomyces sviceus ATCC 29083] gb|EDY57125.1| G voltage-gated potassium channel [Streptomyces sviceus ATCC 29083] Length=162

! TASK

Open ref|YP_002206049.1| in a new window Click CDS Scroll to bottom Observe sequence and /translation
/translation="MAPCARPAPGLRGVSMLPGFLARMVELMRRRDGRSLHVKAAGGA TAVLLVVMLTGSWAVLVAEEGARGASLTSYPKALWWSVETATTVGYGDFYPVTWWGRV

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)5;

Biochem 711 2008

29
VGTVVMVVGITTYGMVTAALATWFVARQQKRAHPVGAETLHALHERFDRLEELLGAKK KG"

ORIGIN 1 ctggccccgt gcgcccgccc cgcccccggc cttcgaggag tgagcatgct gcccggattc

[...]

! READ

The reported translation starts with Methionine (M) but the reported DNA sequence starts with a C rather than A. Translating this sequence would in fact result in a Leucine being the first amino acid. Therefore there is a mistake in the DNA sequence. In the other link (gb|EDY57125.1|) the same mistake can be found. We can notice that the next entry is from the same strain (Streptomyces sviceus ATCC 29083) which has a correct ATG start and is therefore a better sequence. Ignore the sequence with the wrong start and Click the top red cross button + Add to Alignment for ref|ZP_03195532.1| 2.2.4. Reverse Complement? Some sequences are reported as the negative strand, and need to be reversed (reverse complemented) either before adding them to MEGA or within MEGA. The next entry as we follow along the Description list and skipping s entries is:
>ref|YP_909829.1| G voltage-gated potassium channel protein [Bifidobacterium adolescentis ATCC 15703] dbj|BAF39747.1| G possible voltage-gated potassium channel protein [Bifidobacterium adolescentis ATCC 15703] Length=228

! TASK
CDS

Open ref|YP_909829.1| in a new window Observe the CDS record:


1..228 /locus_tag="BAD_0966" /coded_by="complement(NC_008618.1:1203706..1204392)" /note="COG family: Kef-type K+ transport systems_predicted

Note that it is stated: coded by complement which means that the coding sequence is the reverse complement. Click CDS Scroll to sequence at bottom Observe sequence at 5 and 3 ends

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)5<

Biochem 711 2008


ORIGIN 1 ctatggtctg cgcagccgcg cgatgtcgtc acgtagctgg ttgacggtgt ccgtcagttc [...] 661 cagaaacacc aacgccaatg cggtcat //

30

It can easily be seen that the sequence cat at the 3 end is the reverse complement of atg the START codon. Similarly, the 5end sequence cta is the reverse completement of the Amber STOP codon uag. There are 2 ways to reverse complement the DNA sequence: within the browser or within MEGA. Within Browser: Click square next to Reverse complemented strand Press Refresh button. Add sequence to MEGA: Press + Add to Alignment Within MEGA: Add sequence to MEGA: Press + Add to Alignment Right-click on sequence name Select Reverse Complement

! TASK

Choose one method and add this sequence in the proper orientation

2.2.5. Long sequences In some cases clicking the CDS link will bring about the complete genome. In this case, press the back button and check next to the CDS entry what are the begin and end numbers along the sequence that contain the gene. Example for:
>ref|YP_832549.1| G Ion transport 2 domain-containing protein [Arthrobacter sp. FB24] gb|ABK04449.1| G Ion transport 2 domain protein [Arthrobacter sp. FB24] Length=244

! TASK
CDS

Open ref|YP_832549.1| in a new window Observe the CDS record:


1..244 /locus_tag="Arth_3070"

/coded_by="NC_008541.1:3441675..3442409"
/note="PFAM: Ion transport protein; Ion transport 2 domain protein,;

Note the coded by entry. 3441675..3442409 are the begin and end values for the gene within the NC_008541.1 record. !"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)6=

Biochem 711 2008 Click the CDS link again.

31

Replace the begin and end values with 3441675 and 3442409 Press Refresh

Note that a new button for showing the whole sequence is now being displayed. Press + Add to Alignment

3. Build the alignment list

! READ

Go down the Description list and choose sequences to be added: For each hit with an acceptable E-value select one of the links (in the previous hits there were 5 links,) privilege links from RefSeq if there is a choice. In any case, choose only those that have a CDS link. Click on the CDS link and only then click + Add to Alignment. Skip all structure entries (s)

%
Time & patience required!

Acceptable E-value is a cutoff decision. We can decide that anything higher than 1. 10-3 is too high, but that cut-off may vary in the end is rather subjective. Also, we can eliminate sequences that come from the same strain as well as proteins labeled hypothetical, putative, possible or predicted as well as synthetic contructs. The general algorithm that we follow therefore is:
Reverse Complement?

E-Value < threshold?

different strain?

Check Sequence and translation

Open in new window? CDS available?

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)64

Biochem 711 2008 3.1. Edit Sequence Names

32

MEGA creates an entry name based on the database name of up to 40 characters. Long names are hard to read and make for untidy trees. Therefore we shall now rename the files to a better name within MEGA. Each software and each operating system has restrictions on what to use for special characters. For example, Unix does not handle blanks or quotes well, and the Nexus file format crashes if dashes are part of the file name. Guidelines: Choose a name that is Unique Use underscore _ rather than spaces Keep name short (some software limit name to 10 characters) Only use letters, numbers, underscore and period. Do no use and eliminate blank spaces and colon (:)

! TASK

Double-click name within MEGA and edit names of sequence. After: K+No1-renamed.mas

Before: K+No1.mas

Click on the blank cell above the name list (marked above) to adjust the cell size for all names.

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)65

Biochem 711 2008

33

"

If you are unsure about your gathered sequence you can retrieve the file(s) the K+No1.mas or K+No1-renamed.mas Classroom Scratch directory or on the virology.wisc.edu/acp resources section.

3.2. Edit START codons

! READ

In the second hit example we eliminated a sequence because its START codon was not correct, and we noticed that this was a duplicate sequence. Four sequences were introduced in the list with a START codon with the wrong first base although the translated proteins was reported starting with a Methionine (as it should.) Because there are advantages to working with the protein sequence for creating the alignment, the DNA sequences will be used in a later stage to create a protein translation. If the START codon is wrong, the translated proteins will not reflect the actual protein sequence. Therefore we need to edit the START codon that have an erroneous base: M_vanbaalenii, S_arenicola, M_sp_Mjls_4829, S_pneumoniae.

! TASK

Adjust the 4 sequences named above one by one

Right-click on first base Select cut from pull down menu Type letter A to replace cut base The first 3 letters of every sequence should now read: ATG

4. Translate to Protein

! READ

MEGA works with either DNA or protein sequences. However, when MEGA is working with protein sequence, the alignment is transferred back to the DNA sequence. Previously we discussed the fact that there are 4 choices at each position for DNA while there are 20 choices for amino acids in protein sequences. This fact results in more precise alignments if performed at the protein level. In addition, if the alignment was performed at the DNA level, Clustal would add gaps to maximize the alignment score and would undoubtedly split codons, resulting in an aberrant protein translation. !"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)66

Biochem 711 2008

34

The transferred alignment to the DNA will keep codons intact, create gaps as multiples of 3 and therefore eliminate frameshift artifacts. It has also been shown that trees created from such protein alignments are more accurate than trees created form the DNA sequences2.

! TASK
Click on tab Translated Proteins Sequences

5. Set parameters and calculate protein alignment

! TASK

With the protein being displayed follow the menu cascades: Edit > Select All then Alignment > Align by ClustalW This will bring the protein parameters windows. There are 2 sets of gap penalties corresponding to the 2-step process of the Custal algorithm. Hall3 reports that changing the Multiple Alignment parameters of Gap Opening and Gap Extension from their defaults of 10 and 0.2 to values of 3 and 1.8 respectively improves alignments significantly.

! TASK

Change the defaults values as explained above Defaults Change to

Click OK to activate the alignment.


2

Hall BG. Comparison of the accuracies of several phylogenetic methods using protein and DNA sequences. Mol Biol Evol. 2005; 22(3):792-802. Erratum in: Mol Biol Evol. 2005 ; 22(4):1160.
3

Hall B.G. Phylogenetic Trees Made Easy. 2008; 3 Edition; Sinauer Associates, Inc.

rd

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)67

Biochem 711 2008

35

MEGA will report the progress of the alignment status in the ClustalW Progress window showing the 2-step process.

6. Alignment adjustments

! READ

Since the alignment is the basis for the calculation of the tree one would expect that adjusting the alignment as best as possible will calculate the best tree. However, it was shown that as long as the alignment is more than 50% accurate increasing the alignment accuracy (mostly by manual adjustment) has little effect on the resulting tree accuracy4.

7. Export alignment in DNA and protein forms


Export the alignment as a .meg file from the Export Alignment menu Name file e.g. K+AlignAA_No1.meg Now we will export the alignment in its DNA form. Click DNA Sequences above the file names to return to the DNA form which is now aligned according to the amino acids. Export the alignment as a .meg file Name file e.g. K+AlignDNA_No1.meg Answer YES when asked if coding sequences

8. Eliminate duplicate sequences


During the selection process we do not know if any of the sequences had silent mutations. The following process will trace duplicate sequences if there are any present.

Ogden TH, Rosenberg MS. Multiple sequence alignment accuracy and phylogenetic inference. Syst Biol. 2006; 55(2):314-328.

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)68

Biochem 711 2008

36

! TASK
On the main MEGA window click line: Click me to activate data file Choose the newly saved DNA alignment K+AlignDNA_No1.meg The alignment will open within the Sequence Data Explorer. On the main MEGA window follow the menu cascade: Distances > Choose Model In the opening M4: Analysis Preferences window verify that the first line indicates Data Type: Nucleotide (Coding). On the line named -> Model click on the green square at right. The green square will change to a gray square with 3 dots. Click on that gray square and follow the menu cascade: Nucleotide > No. of Differences as illustrated below

Click OK The -> Model line will update to read the updated mode. On the main MEGA window follow the menu cascade: Distances > Compute Pairwise...

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)69

Biochem 711 2008

37

The model we have just chosen should still be active for this calculation. Click Compute button

Alter the look of the computed pairwise table: Change the number of decimal places with the down arrow (circled) and optionally shrink the cell size with the cursor (arrow.) Identical sequences would show a difference of zero. In our example the lowest pair is revealed to be 1 base different as shown on this blow up of the calculated distance matrix: M. ulcerans and M.marinum. Therefore there are no duplicate sequences.

9. Eliminate inadequate sequences

! TASK

Activate the DNA alignment from within the main MEGA window and press the Translated Protein Sequences button. 9.1. Biology and Structure of the protein

! READ

Potassium ion channels remove the hydration shell from the ion when it enters the selectivity filter. In prokaryotic species the selectivity filter is formed by five residues TVGYG in the P loop from each of the 4 subunits as illustrated below for KcsA: !"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)6:

Biochem 711 2008

38

Indeed this pattern showed in all BLAST results. For example in Thermoplasma volcanium (underlined below):
Query Sbjct 36 26 LVIVLLAGSYLAVLAERGAPGAQLITYPAALWWSVETATTVGYGDLYPVTLWGRCVAVVV +IV+L GSYL L +R +++ Y A+W+++ET TTVGYGD+ PV+ GR VA+++ FIIVVLIGSYLEFLTQRNVKYSEIKNYFTAIWFTMETVTTVGYGDVVPVSNLGRVVAMLI 95 85

Thermoplasma volcanium is the only sequence that does not seem in register with the rest of the alignment in the region of the specificity filter (starting at alignment position 392) as shown here with colors removed for clarity. It can also be noted that the TVGYG sequence is split by 2 gaps by the misaligned sequence. The specificity filter sequence of Thermoplasma volcanium appears much earlier in the alignment at position 231 and is also misaligned. This sequence is defined as Kef-type K+ transporter NAD-binding component [Thermoplasma volcanium GSS1] from a paper titled Archaeal adaptation to higher temperatures revealed by genomic of Thermoplasma !"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)6;

Biochem 711 2008

39

volcanium. (Kawashima T. et al., Proc Natl Acad Sci U S A. 2000 ;97(26):1425714262) with an optimum growth temperature of 60C. The length of the volcanium sequence (348) is more than twice that of the query sequence (160). In addition, the description for the volcanium protein as both K+ transporter and NAD-binding can lead to believe that this protein contains 2 domains. A quick search on the PFAM database (http://pfam.sanger.ac.uk/ a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models) reveals that indeed the volcanium protein contains 2 domains: the Ion_trans_2 domain and the NAD binding TrkA-N domain: Query Sequence Thermoplasma volcanium

Pfam-A

Description

Entry type

Sequence Start End

HMM From To

E-value

Query Sequence KcsA Ion_trans_2 Ion channel Thermoplasma volcanium Ion_trans_2 Ion channel TrkA_N TrkA-N domain

Domain Domain Domain

34 25 123

116 106 247

1 1 1

81 81 121

2.6e-18 3.4e-21 3.2e-15

! INFO
The Ion_trans_2 pattern at the Pfam database has an Isoleucine insertion within the filter sequence.
#HMM #MATCH #SEQ *->ivlllvlifgtvyysl....epeeg.wewsfldalYFsfvTlTTiGYGDivPlstdaGRlftivyiliGiplfllllavlgrflte<-* ++l++vl++g+ + l ++++p+ ++ al++s+ T TT+GYGD++P+ t +GR +++v+++ Gi+ f+l++a l+++++ VLLVIVLLAGSYLAVLaergAPGAQlI--TYPAALWWSVETATTVGYGDLYPV-TLWGRCVAVVVMVAGITSFGLVTAALATWFVG 116

In addition, it is interesting to note that the filter sequence does not registered as a pattern at the Prosite database http://www.expasy.ch/prosite/

9.2. Remove sequence

! READ

At this point we can make the decision to either remove the NAD binding domain portion of the volcanium sequence that is obviously not homologous to our query sequence, or we can decide to remove that sequence altogether. For simplicity we will remove that sequence from the alignment and delete the orphaned gaps.

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)6<

Biochem 711 2008

40

! TASK
Return to DNA alignment view: press DNA Sequences button Right-Click the T_volcanium sequence Select Delete from the pull-down menu We will now remove the many orphaned gaps that remain: Edit > Select All Alignment > Delete Gap-Only Sites This will remove columns with only gaps and no sequence. Export the alignment in DNA form: Data > Export Alignment > MEGA format Enter a title and answer YES to protein coding question. Name the file K+AlignDNA_No2.meg

10. Estimate reliability of alignment with Average AA Identity

! READ

This is the final stage before calculating the tree.

It has been shown that when the percent identity of amino acids falls below 20% the resulting sequence alignment has less than 50% of the amino acids correctly aligned5. If the percent identity is between 20 and 30% the number of correctly aligned amino acids raises to 80% and for percent identites above 30% the number of correctly aligned amino acids raises to 90%. As discussed previously, the tree accuracy varies little if more than 50% of the amino acids are correctly aligned.

! TASK
Return to protein alignment view: press Translated Proteins button.
5

Thompson JD, Plewniak F, Poch O. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. 1999; 27(13):2682-2690.

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)7=

Biochem 711 2008 Save the protein alignment into a file named e.g. K+AlignAA_No2.meg

41

On the MEGA main window activate this file by clicking click to activate a data file. The file will open in the M4: Sequence Data Explorer. On the MEGA main window follow the menu cascade: Distances > Compute Overall Mean... In the opening Analysis Preferences window click on the green square on the -> Model line, follow the menu cascade: Amino Acid > p-distance:

The p-distance is 1 amino acid identity, therefore the 20% limit discussed above would compute to a p-distance of 0.8. If we find a p-distance lower than 0.8 it is an acceptable value, and if it is above 0.8 the alignment is unreliable. Here we find 0.589 and therefore we can use this alignment to compute a tree or a family of trees. Note: if the value is shown as 1 simply press the up or down arrow above to increase the number of decimal points.

! INFO

For non-coding DNA sequence we cannot use the protein alignment as a guide as we have done above. Instead of a threshold of 20% amino acid identity, the minimun is 66% DNA identity to reach the threshold of 50% alignment accuracy6.

Kumar S, Filipski A. (Review) Multiple sequence alignment: in pursuit of homologous DNA positions. Genome Res. 2007; 17(2): 127-135.

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)74

Biochem 711 2008

42

L11 Exercise E: Neighbor-Joining Phylogenetic Tree, Rooting

! INFO

http://en.wikipedia.org/wiki/ /Phylogenetics, /Computational_phylogenetics /Phylogenetic_tree

Phylogenetics is the study of evolutionary relatedness among various groups of organisms. Computational phylogenetics is the application of computational algorithms to phylogenetic analyses. The goal is to assemble a phylogenetic tree representing a hypothesis about the evolutionary ancestry of a set of genes or sequences. Computational algorithms: Distance-matrix methods such as neighbor-joining or UPGMA, which calculate genetic distance from multiple sequence alignments, are simplest to implement, but do not invoke an evolutionary model. Maximum parsimony is another simple method of estimating phylogenetic trees, but implies an implicit model of evolution (i.e. parsimony). More advanced methods use the optimality criterion of maximum likelihood, often within a Bayesian Framework, and apply an explicit model of evolution to phylogenetic tree estimation.

1. Create Neighbor-Joining Tree


In a previous exercise we used the neighbor-joining distance method to calculate the tree and we will use that method again. As shown in the INFO table above there are many other methods, some of which require long set-ups, statistical modeling and long calculations even with a fast computer. We created a suitable alignment in the previous exercise and we assessed its suitability by computing the overall mean distance.

! Preliminary: This is a continuation of the previous exercise. If you are


starting from here click on the phrase click here to activate a data file on MEGA main window and open the previously created DNA alignment
K+AlignDNA_No2.meg

If you are continuing from above make sure you are looking at the DNA rather than the translated protein sequences.

! TASK
From the main MEGA window follow the menu cascade: Phylogeny > Construct Phylogeny > Neighbor-Joining (NJ)... !"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)75

Biochem 711 2008 In the Analysis preference window: adjust the -> Model line to read Maximum Composite Likelihood

43

(click green square, then choose Nucleotide > Maximum Composite Likelihood)

Click Compute to calculate the tree.

2. Estimating the reliability of a tree: Bootstraping

! INFO

In computational phylogenetics, boostrapping refers to creating multiple pseudo-alignments from randomly chosen columns (sampling) from the original alignment until the pseudo-, random alignments have the same length as the original alignment. For each random alignment a tree is calculated with the same parameters as the tree calculated for the original alignment. The tree is then assessed for the presence (score of 1) or absence (score of 0) of each clade that was present on the original tree and the scores are recorded. The next bootstrap cycle is then initiated. The number of cycles may depend on the computing time or the desired precision. 100 to 2000 boostrap replicates are typical, and the number of cycles increases the calculation time. We can be most confident in clades with 90 to 100% bootstrap values the confidence level decreases with the calculated bootstrap values. Vocabulary: A clade is a taxonomic group comprising a single common ancestor and all the descendants of that ancestor. A strap is long narrow strip of pliant material such as leather. The
!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)76

Biochem 711 2008

44

computer term bootstrap began as a 1950s metaphor derived from using a strap to pull on leather boots without outside help. In computing, bootstrapping ("to pull oneself up by one's bootstraps") refers to techniques that allow a simple system to activate a more complicated system.

! TASK
Follow the menu cascade from the main MEGA window: Phylogeny > Boostrap Test of Phylogeny > Neighbor-joining... Click on the Test of Phylogeny tab. Keep the default number of cycles to 500. The random seed varies each time and can be accepted as-is. Click ! to close the window. Click Compute to calculate the tree. In the window where the tree is displayed follow the menu cascade: View > Topology Only This will change the display of the tree and show the branching more clearly and read the bootstrap values more clearly.

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)77

Biochem 711 2008

45

! OPTIONAL 1
Save Current Session in .mts format and / or Export Current Tree (Newick)

! OPTIONAL 2
Explore the various tree display options

3. Tree Rooting

! READ

In spite of its appearance the Neighbor-Joining tree (with or without bootstraping) is unrooted. An Unrooted tree illustrates relatedness without making assumptions about common ancestry while a rooted tree is a !"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)78

Biochem 711 2008

46

directed tree with a unique node corresponding to the most recent common ancestor (usually calculated.) The most common method for rooting trees is the use of an uncontroversial outgroup close enough to allow inference from sequence or trait data, but far enough to be a clear outgroup. 3.1. Finding an outgroup The outgroup will allow the rooting of the tree. The evolutionary assumption is that the outgroup branched from the parent group before the other groups branched from each other.

! READ

You might have noted that there are 2 longer sequences within the sequences selected form the BLAST results: P_physalis and P_penicillatus. Indeed these 2 sequences are Eukaryotic sequences as shown within the NCBI sequence text:
DEFINITION ACCESSION VERSION DBSOURCE ORGANISM voltage-gated potassium channel [Physalia physalis]. ABD59027 ABD59027.1 GI:88976032 accession DQ385496.1 Physalia physalis Eukaryota; Metazoa; Cnidaria; Hydrozoa; Siphonophora; Cystonectae; Physaliidae; Physalia. DEFINITION Polyorchis penicillatus potassium channel homolog (jShak1) mRNA,complete cds. ACCESSION U32922 VERSION U32922.1 GI:987508 KEYWORDS . SOURCE Polyorchis penicillatus (penicillate jellyfish) ORGANISM Polyorchis penicillatus Eukaryota; Metazoa; Cnidaria; Hydrozoa; Hydroida; Anthomedusae; Polyorchidae; Polyorchis.

Both sequences are reported as potassium channel and we can assume that they are homologs. In addition these sequences fit on the alignment including within the selectivity filter region and are shown together as 100 on the bootstrap calculation. 3.2. Rooting the tree

! TASK
On the tree window click the rooting icon as shown here and located at the top left of the tree window. Click on the node connecting the 2 Eukaryotic sequences to root the tree at this node.

=> !"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)79

Biochem 711 2008

47

The tree is now a rooted tree.

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)7:

Biochem 711 2008

48

L11 Exercise F: End of laboratory


1) Save files that you wish to keep 2) quit MEGA 3) Close all windows. -eClass notes

!"#$%&'%()*)+,-".(#-%)/#(0)1+2!)3)7;

You might also like