01.-Databases and Sequence Retrieval

1.
-DATABASES AND SEQUENCE RETRIEVAL

Biological databases are stores of biological information such as nucleic acid, amino acid and protein
information. Some examples are;
- Nucleotide and amino acid sequences like: GenBank and GenPept (managed by the NCBI, USA),
SwissProt, PIR, TrEMBL (managed by EBI and EMBL, respectively in Europe), and DISC (managed
by the DNA Information and Stock Center in Japan).
- Structure databases like PDB, managed by the RCSB in the USA.
- Expression databases, some examples are: GEO (managed by the NCBI in the USA), Bodymap
(Japan), SMD (belonging to Stanford University, USA) and the MPR datasets (from the MIT, USA).
- Databases related to organisms, like for example SGD, YPD and MIPS which store information
related to Saccharomyces cerevisiase, or BGDP, Flybase, and GadFly which store data about
Drosophila Melanogaster. Some other databases include TIGR related to bacteria.
1.1.-SEQUENCE RETRIEVAL: NCBI

There are several formants in which an mRNA sequence can be obtained, these are FASTA and GenBank.
- FASTA format is a text-based format for representing either nucleotide sequences or peptide
sequences, in which nucleotides or amino acids are represented using single-letter codes.
- GenBank format (GenBank Flat File Format) consists of an annotation section and a sequence
section. The start of the annotation section is marked by a line beginning with the word "LOCUS".
The start of sequence section is marked by a line beginning with the word "ORIGIN" and the end
of the section is marked by a line with only "//".
1.1.1.-GENBANK FLAT FILE FORMAT

The LOCUS field contains a number of different data elements, including locus name, sequence length,
molecule type, GenBank division, and modification date.
- The locus name (SCU49845) was originally designed to help group entries with similar sequences:
the first three characters usually designated the organism; the fourth and fifth characters were
used to show other group designations, such as gene product; for segmented entries, the last
character was one of a series of sequential integers.
- The sequence length (5028 bp) is the number of nucleotide base pairs (or amino acid residues) in
the sequence record.
- The molecule type (DNA) is the type of molecule that was sequenced.
- The GenBank division (PLN) to which a record belongs is indicated with a three letter
abbreviation. Some of the divisions contain sequences from specific groups of organisms, whereas
others contain data generated by specific sequencing technologies from many different
organisms. The organismal divisions are historical and do not reflect the current NCBI Taxonomy.
Instead, they merely serve as a convenient way to divide GenBank into smaller pieces.
- The modification date he date in the LOCUS field is the date of last modification. The sample
record shown here was last modified on 21-JUN-1999.
The DEFINITION field contains a brief description of sequence; includes information such as source
organism, gene name/protein name, or some description of the sequence's function (if the sequence is
non-coding). If the sequence has a coding region (CDS), description may be followed by a completeness
qualifier, such as "complete cds".
The ACCESSION field contains a unique identifier for a sequence record. An accession number applies to
the complete record and is usually a combination of a letter(s) and numbers, such as a single letter
followed by five digits (e.g., U12345) or two letters followed by six digits (e.g., AF123456). Some accessions
might be longer, depending on the type of sequence record. Accession numbers do not change, even if
information in the record is changed at the author's request. Sometimes, however, an original accession
number might become secondary to a newer accession number, if the authors make a new submission
that combines previous sequences, or if for some reason a new submission supersedes an earlier record.
The VERSION field contains a nucleotide sequence identification number that represents a single, specific
sequence in the GenBank database. This identification number uses the accession.version format. If there
is any change to the sequence data (even a single base), the version number will be increased, e.g.,
U12345.1 → U12345.2, but the accession portion will remain stable. The accession.version system of
sequence identifiers runs parallel to the GI number system, i.e., when any change is made to a sequence,
it receives a new GI number AND an increase to its version number.
- GenInfo Identifier sequence identification number, in this case, for the nucleotide sequence. If a
sequence changes in any way, a new GI number will be assigned. A separate GI number is also
assigned to each protein translation within a nucleotide sequence record, and a new GI is assigned
if the protein translation changes in any way (see below).
The KEYWORDS field includes a word or phrase describing the sequence. If no keywords are included in
the entry, the field contains only a period.
The SOURCE field contains free-format information including an abbreviated form of the organism name,
sometimes followed by a molecule type.
- The organism is the formal scientific name for the source organism (genus and species, where
appropriate) and its lineage, based on the phylogenetic classification scheme used in the NCBI
Taxonomy Database.
The REFERENCE field contains information about publications by the authors of the sequence that discuss
the data reported in the record. References are automatically sorted within the record based on date of
publication, showing the oldest references first. Some sequences have not been reported in papers and
show a status of "unpublished" or "in press". It consists of the following subfields:
- Authors in the order in which they appear in the cited article.

- Title of the published work or tentative title of an unpublished work.
- Journal, refers to a MEDLINE abbreviation of the journal name.
- PubMed Identifier (PMID).
- Direct Submission includes contact information of the submitter, such as institute/department
and postal address. This is always the last citation in the References field.
The FEATURES field includes information about genes and gene products, as well as regions of biological
significance reported in the sequence. The location of each feature is provided as well, and can be a single
base, a contiguous span of bases, a joining of sequence spans, and other representations. If a feature is
located on the complementary strand, the word "complement" will appear before the base span. If the
"<" symbol precedes a base span, the sequence is partial on the 5' end (e.g., CDS <1..206). If the ">"
symbol follows a base span, the sequence is partial on the 3' end (e.g., CDS 435..915>).
- The source subfield is a mandatory feature in each record that summarizes the length of the
sequence, scientific name of the source organism, and Taxon ID number (from the NCBI database).
Can also include other information such as map location, strain, clone, tissue type, etc., if provided
by submitter.
- The CDS (coding sequence) is the region of nucleotides that corresponds with the sequence of
amino acids in a protein (location includes start and stop codons). The CDS feature includes an
amino acid translation. Authors can specify the nature of the CDS by using the qualifier
"/evidence=experimental" or "/evidence=not_experimental". Submitters are also encouraged to
annotate the mRNA feature, which includes the 5' untranslated region (5'UTR), coding sequences
(CDS, exon), and 3' untranslated region (3'UTR).
- The exon is the region of genome that codes for portion of spliced mRNA; may contain 5' UTR, all
CDSs, and 3' UTR.
- The intron is segment of DNA that is transcribed, but removed from within the transcript, by
splicing together the sequences (exons) on either side of it.
- The STS (Sequence Tagged Site) is a short, single-copy DNA sequence that characterizes a mapping
landmark on the genome and can be detected by PCR. A region of the genome can be mapped by
determining the order of a series of STSs.
- The base span of the biological feature indicated to the left. Features can be complete, partial on
the 5' end, partial on the 3' end, and/or on the complementary strand.
1. A complete feature is simply written as n...m.
2. < indicates partial on the 5' end.
3. > indicates partial on the 3' end.
4. (complement) indicates that the feature is on the complementary strand.
- The gene subfield points to a region of biological interest identified as a gene and for which a
name has been assigned. The base span for the gene feature is dependent on the furthest 5' and
3' features.
The ORIGIN field may be left blank, may appear as "Unreported," or may give a local pointer to the
sequence start, usually involving an experimentally determined restriction cleavage site or the genetic
locus (if available). This information is present only in older records. The sequence data begin on the line
immediately below ORIGIN. To view/save the sequence data only, display the record in FASTA format. A
description of FASTA format is accessible from the BLAST Web pages.
1.2.-OPEN READING FRAMES

A reading frame is one of three possible ways of reading a nucleotide sequence. Let’s say we have a
stretch of 15 DNA base pairs:
acttagccgggacta
The first possibility arises when reading since the first adenine, this would be referred as the first reading
frame. Furthermore, it is possible to start reading from the first cytosine, or the first thymine (second and
third reading frames, respectively). The reading frames affect the final translate poly-peptide. In the
example below the uppercase letters represent the amino acids that are coded by the three possible
reading frames:
It is worthy to note that, in fact, there are six reading frames, three per strand (on the positive and on the
negative).
ORF finder is an informatics tool that searches for reading frames in a nucleotide sequence. The algorithm
takes a FASTA input of the sequence, note that it is only possible to read in the 5’3’ direction, and as
such we can read each strand only once.
Once the FASTA sequence is loaded, the algorithm looks at the first reading frame, if this first codon is not
the initiation one (ATG) it keeps looking throughout the strand. Once it finds an initiation codon it will
look for a stop codon (TAA, TAG or TGA). It the jumps to the second reading frame and repeats this
process.
This is performed on both the positive strand (3’5’), and on the negative frame (5’3’). The results are
expressed as Frame ± n, where n refers to the reading frame (1, 2 or 3).
Note that a minimum length can be set for the reading

frames, specified in nucleotides. The goal is, usually, to find
the longest reading frame.
1.3.-MUTATION NOTATION AND GRAPHICS

Mutation notation depends on the NCBI entry level one is
working with as shown in the image on the right. Let’s first
define what a contig is. A sequence contig is a continuous
(not contiguous) sequence resulting from the reassembly
of the small DNA fragments generated by bottom-up
sequencing strategies. On the other hand, YACs and BACs
refer to both yeast and bacterial artificial chromosomes,
respectively.
One can click on “Graphics” in NCBI, this will lead to a graphical representation of the sequence in
particular with all the described features as a space line. The following table illustrates the colors and bars
used to describe each of the features.
Taking CDCC50 Homo sapiens gene as an example, more concretely the entry with accession number
NM_174908.3.
This can be used to track single mutations, but some notation has to be used when describing it. After the
accession number a colon is placed, followed by the level of the mutation: “c” refers to the CDS, “g” to
the genomic level and “n” to mRNA level. (Note that the CDS is similar to the mRNA level only without
introns). After the level identifier we place a dot, specify the mutation base pair, and finally indicate the
base change. As an example NT_010783.15:c.957C>T and NT_010783.15:g.6848C>T refer to the same
mutation, being the only difference the specified level.
Example: gene ACTG1 presents a mutation identified as NM_001613.3:c.957C>T. The exonic coordinates
are specified below. And the CDS spans (140…127) with respect to the mRNA transcript. At what position
and exon in the mRNA transcript will the mutation be place?
The mutation is specified at the CDS level which starts at exon 2 (140 bp), in
this case we need to add 139 bp to the mutation site to obtain the mRNA
mutation place, in this case it is 957 + 139 = 1096 bp. And therefore it is
located in exon 5.
1.4.-MAP VIEWER
In order to access the human database in Map Viewer it is necessary to select the corresponding sequence
in the search box, then press “Go”. Once the queried gene results are displayed one may further filter the
results.
On the right hand “Quick Filter” box one may select the “Gene” option and apply the filter. This will narrow
the search to only a few entries, which are the actual genes. Furthermore, on the tab option one can select
“Genes_seq”, an option that will display the gene’s map in a new window. This will display a region of the
chromosome with several genes, on which center (and highlighted in red) one can find the queried gene.
In order to display the contig to which the gene belongs, one has to click on the “Maps & Options” tab, on
the left hand side panel. A new window will open in which it is possible to add more maps to the map
view, by clicking on “Contig” and pressing “Ok” one adds the contig’s map to the map view.
An additional option on the left hand side panel is “Data As Table View”. By clicking on it a new window
opens in which all the previous information is displayed as text.
- Under the Genes on Sequence tab there is a list of all the genes found on the fragment with their
respective coordinates, referent to the chromosome. Additionally information such as orientation
(+ strand or – strand) can be found.
- On the Contig tab similar information can be found as in the tab above, however relating to the
contig.
- Under the Symbol column there are hyperlinks to the NCBI’s gene entry. (Note that from here we
can find every mRNA transcript by going to “Genomic regions, transcripts, and products”  Go to
nucleotide  GenBank  Features (and here select the wanted transcript)).
- The option Download/View Sequence/Evidence leads to a new window in which the
chromosome coordinates of gene can be translated into the contig coordinates.

01.-Databases and Sequence Retrieval

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

01.-Databases and Sequence Retrieval

Uploaded by

Copyright:

Available Formats

1.

-DATABASES AND SEQUENCE RETRIEVAL

1.1.-SEQUENCE RETRIEVAL: NCBI

1.1.1.-GENBANK FLAT FILE FORMAT

- Authors in the order in which they appear in the cited article.

1.2.-OPEN READING FRAMES

Note that a minimum length can be set for the reading

1.3.-MUTATION NOTATION AND GRAPHICS

You might also like