Unit 6 - Bioinformatics

BIOINFORMATICS
INTRODUCTION
• Bioinformatics is an interdisciplinary field mainly involving
molecular biology and genetics, computer science, mathematics,
and statistics.
• computational techniques for solving biological problems
1. data problems: representation (graphics), storage and retrieval
(databases), analysis (statistics, artificial intelligence,
optimization, etc.)
2. biology problems: sequence analysis, structure or function
prediction, data mining, etc. also called computational biology
• National Center for Biotechnology Information (NCBI 2001)
defines bioinformatics as: Bioinformatics is the field of science in
which biology, computer science, and IT merge into a single
discipline
• There are three important sub-disciplines within bioinformatics

1. the development of new algorithms and statistics which assess
relationships among members of large data sets
2. the analysis and interpretation of various types of data

including nucleotide and amino acid sequences, protein domains,
and protein structures
3. the development and implementation of tools that enable

efficient access and management of different types of information.
• Types of datasets : genome sequences, macromolecular
structures, and functional genomics experiments (e.g.
microarray data)
• other data : phylogenetic and metabolic pathway analysis,

the text of scientific papers, and plant varietal information
and statistics.
• Analysis of biological data requires application of large

number of techniques like primary sequence alignment,
protein 3D structure alignment, phylogenetic tree
construction, prediction and classification of protein
structure, prediction of RNA structure, prediction of protein
function, and expression data clustering.
• Development of suitable algorithms is an important part of
bioinformatics
• The techniques and algorithms were specifically developed

for the analysis of biological data, for instance, the dynamic
programming algorithm for sequence alignment is one of the
most popular programmes among the biologists
• The sequence information generated worldwide is stored

systematically in different types of databases
• Hence, it is necessary to understand about the databases

and their different types
Pattern recognition
• The initiation of translation or transcription process is
determined by the presence of specific patterns of DNA or
RNA, or motifs.
• Research on detecting specific patterns of DNA sequences
such as genes, protein coding regions, promoters, etc., leads
to uncover functional aspects of cells.
• Patterns are used in database searching eg:- BLOCKS in
protein database
• Pattern searching on BLAST and FASTA for the closest
matches
PATTERNS MOST EXAMINED IN DNA
SEQUENCES
Gene features DNA characteristics
Coding sequences ORFs,GC rich, CpG content
Translational start and Start:ATG, Stop:TAA,TAG,TGA

stop sites
Splice site(exon/intron Consensus sequences

borders)
Promoter regions TATA,shine-dalgarno,Pribnow,Kozak

consensus, CpG content
Poly A Signals Consensus sequence ,10-20 bases

upstream to poly A tail
Prokaryotic gene structure
ORF (open reading frame)

TATA box
Start codon Stop codon
ATGACAGATTACAGATTACAGATTACAGGATAG
Frame 1
Frame 2
Frame 3
Prokaryotes
• Advantages
– Simple gene structure
– Small genomes (0.5 to 10 million bp)
– No introns
– Genes are called Open Reading Frames (ORFs)
– High coding density (>90%)
• Disadvantages
– Some genes overlap (nested)
– Some genes are quite short (<60 bp)
Gene finding approaches
1) Rule-based (e.g, start & stop codons)
2) Content-based (e.g., codon bias,

promoter sites)
3) Similarity-based (e.g., orthologs)
4) Pattern-based (e.g., machine-learning)
5) Ab-initio methods (FFT)

Simple rule-based gene finding
• Look for putative start Codon (ATG)
• Staying in same frame, scan in groups of three
until a stop Codon is found
• If no: of codons >=50, assume it’s a gene
• At end of chromosome, repeat process for
reverse complement
Example ORF
Content based gene prediction method
• RNA polymerase promoter site (-10, -30 site

or TATA box)
• Shine-Dalgarno sequence (+10, Ribosome
Binding Site) to initiate protein translation
• Codon biases
• High GC content
Similarity-based gene finding
• Take all known genes from a related genome and compare
them to the query genome via BLAST
• Disadvantages:
– Orthologs (genes in different species that evolved from a common
ancestral gene by speciation)/paralogs genes related by
duplication within a genome –evolve new function) sometimes
lose function and become pseudogenes
– Not all genes will always be known in the comparison genome
– The best species for comparison isn’t always obvious
• Similarity comparisons are good supporting evidence for
prediction validity
Machine Learning Techniques
Hidden Markov Model
ANN based method
Bayes Networks
Ab-initio Methods
• Fast Fourier Transform (algorithm) based
methods
• Poor performance
• Able to identify new genes
• FTG method (FTG is a web server for analyzing
nucleotide sequences to predict the genes
using Fourier transform techniques).
Eukaryotic genes
Eukaryotes
• Complex gene structure
• Large genomes (0.1 to 3 billion bases)
• Exons and Introns (interrupted)
• Low coding density (<30%)
– 3% in humans, 25% in Fugu, 60% in yeast
• Alternate splicing (40-60% of all genes)
• Considerable number of pseudogenes
Finding Eukaryotic Genes Computationally
• Rule-based
– Not as applicable – too many false positives
• Content-based Methods
– CpG islands, GC content, hexamer repeats, composition statistics, codon
frequencies
• Feature-based Methods
– donor sites, acceptor sites, promoter sites, start/stop codons, polyA signals,
feature lengths
• Similarity-based Methods
– sequence homology, EST (expressed sequence tags) searches
• Pattern-based
– HMMs, Artificial Neural Networks
• Most effective is a combination of all the above
Gene prediction programs
• Rule-based programs
– Use explicit set of rules to make decisions.
– Example: GeneFinder
• Neural Network-based programs
– Use data set to build rules.
– Examples: Grail, GrailEXP
• Hidden Markov Model-based programs
– Use probabilities of states and transitions between
these states to predict features.
– Examples: Genscan, GenomeScan
Combined Methods
• GRAIL (http://compbio.ornl.gov/Grail-1.3/)
• FGENEH (http://www.bioscience.org/urllists/genefind.htm)
• HMMgene (http://www.cbs.dtu.dk/services/HMMgene/)
• GENSCAN(http://genes.mit.edu/GENSCAN.html)
• GenomeScan (http://genes.mit.edu/genomescan.html)
• Twinscan (http://ardor.wustl.edu/query.html)
Egpred: Prediction of Eukaryotic Genes
• Similarity Search
– First BLASTX against RefSeq datbase
– Second BLASTX against sequences from first BLAST
– Detection of significant exons from BLASTX output
– BLASTN against Introns to filter exons
• Prediction using ab-initio programs

– NNSPLICE used to compute splice sites
• Combined method
Biological databases
Introduction
• Biological databases : libraries of life sciences information,
collected from scientific experiments, published literature, high-
throughput experiment technology, and computational analysis
• Information from research area including genomics, proteomics,

metabolomics, microarray gene expression, phylogenetics
• There are two main functions of biological databases:
1. Make biological data available to scientists.
2. To make biological data available in computer readable form.

• Biological databases can be broadly classified into sequence and
structure databases
• Sequence databases are applicable to both nucleic acid sequences

and protein sequences
• structure database is applicable to only Proteins.
• The first database was created within a short period after the Insulin
protein sequence was made available in 1956.
• Around mid 1960s, the first nucleic acid sequence of Yeast tRNA with
77 bases (individual units of nucleic acids) was found out. During this
period, three dimensional structures of proteins were studied and
the well known Protein Data Bank was developed as the first protein
structure database with only 10 entries in 1972
• Databases in general can be classified in to primary, secondary or composite
databases.
• A primary database contains information of the sequence or structure alone.
• Experimental results are submitted directly into the database by researchers, and
the data are essentially archival in nature.
• Once given a database accession number, the data in primary databases are never
changed: they form part of the scientific record.
• Examples of these include
1. Swiss-Prot , PIR - protein sequences,
2. EMBL, GenBank & DDBJ -Genome sequences {International Nucleotide Sequence

Database Collaboration (INSDC)}
3. PDB, SCOP-protein structures.

International nucleotide data banks
GenBank
EMBL
USA
Europe
EMBL International NLM
EBI Advisory Meeting NCBI
Collaborative
Meeting
TrEMBL DDBJ NRDB
Japan
NIG
CIB
Genbank (NCBI)
Created in 1988 as a part of the

National Library of Medicine at NIH
– Open access, annotated collection of publically available
nucleotide sequence
– Produced & maintained by NCBI
– Accessed & searched through Entrez system at NCBI
– Develop software tools for sequence analysis
EMBL
• European Molecular Biology Laboratory
• Supported by 20 European countries &
Australia
• Nucleotide sequence database
• Maintained by EBI (European
Bioinformatics Institute)
DDBJ
• DNA Data Bank of Japan
• Collaboration with EMBL & Genbank
• Run by National Institute of Genetics
• A secondary database contains derived information
from the primary database.
• They are often referred to as curated databases but

this is a bit of a misnomer because primary
databases are also curated to ensure that the data
in them is consistent and accurate.
Primary database Secondary database
Curated database;
Synonyms Archival database
knowledgebase
Results of analysis, literature

Direct submission of
research and interpretation,
Source of data experimentally-derived data
often of data in primary
from researchers
databases
•InterPro (protein families,

•ENA, GenBank and DDBJ (nucl
motifs and domains)
eotide sequence)
•UniProt
•Array Express
Knowledgebase (sequence and
Archive and GEO (functional
functional information on
Examples genomics data)
proteins)
•Protein Data Bank (PDB;
•Ensembl (variation, function,
coordinates of three-
regulation and more layered
dimensional macromolecular
onto whole genome
structures)
sequences)
Composite protein Databases
• There are a number of "composite" databases of protein
sequences.
• These compile their sequence data from the primary

sequence databases and filter them to retain only the non-
redundant sequences.
• The best-known are OWL, NCBI
• PIR (Protein Information Resource), SWISS-PROT, TrEMBL
• PROSITE, Pfam (motif databases)

Database searching
• Database use a system where an entry can be

identified in 2 different ways :
1. Identifier
2. Accession code (or number)

1. Identifier :
– String of letters & digits
– Abbreviation of full protein or gene name
– “locus” in GenBank , “entry name” in SWISS-PROT
– Changeable
Eg : KRAF_HUMAN is the entry name for Raf-1 oncogene from Homo

sapiens
2. Accession code (or number) :

– A number ( possibly with a few character in front) uniquely identifies an
entry in its database
– Stable
Eg : accession code for KRAF_HUMAN in SWISS-PROT is P04049

• Some software systems must be used to perform
the searches like
– all entries with keyword (eg : “GTPase”)
– entry with a given literature reference (by author or

article )
– all protein with a keyword (eg : “ribosomal”)
• Two examples of such software systems :

– SRS - The sequence retrieval system
– ENTREZ
• SRS :
– Sequence Retrieval System
– Developed by EBI
– System for integrating heterogeneous databases
– Web oriented system, accessed through HTML pages & Common Gateway
Interface(CGI) scripts
• ENTREZ :
– Developed & accessible at NCBI Entrez site
– Provide search facilities for large no. of databases & links between them
– Provides a well defined web interface

Unit 6 - Bioinformatics

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 6 - Bioinformatics

Uploaded by

Copyright:

Available Formats

BIOINFORMATICS

• There are three important sub-disciplines within bioinformatics

2. the analysis and interpretation of various types of data

3. the development and implementation of tools that enable

• other data : phylogenetic and metabolic pathway analysis,

• Analysis of biological data requires application of large

• The techniques and algorithms were specifically developed

• The sequence information generated worldwide is stored

• Hence, it is necessary to understand about the databases

Gene features DNA characteristics

Coding sequences ORFs,GC rich, CpG content

Translational start and Start:ATG, Stop:TAA,TAG,TGA

Splice site(exon/intron Consensus sequences

Promoter regions TATA,shine-dalgarno,Pribnow,Kozak

Poly A Signals Consensus sequence ,10-20 bases

ORF (open reading frame)

1) Rule-based (e.g, start & stop codons)

2) Content-based (e.g., codon bias,

3) Similarity-based (e.g., orthologs)

4) Pattern-based (e.g., machine-learning)

5) Ab-initio methods (FFT)

• RNA polymerase promoter site (-10, -30 site

– Second BLASTX against sequences from first BLAST

– Detection of significant exons from BLASTX output

– BLASTN against Introns to filter exons

• Prediction using ab-initio programs

• Information from research area including genomics, proteomics,

• There are two main functions of biological databases:

1. Make biological data available to scientists.

2. To make biological data available in computer readable form.

• Sequence databases are applicable to both nucleic acid sequences

• structure database is applicable to only Proteins.

• A primary database contains information of the sequence or structure alone.

• Examples of these include

1. Swiss-Prot , PIR - protein sequences,

2. EMBL, GenBank & DDBJ -Genome sequences {International Nucleotide Sequence

3. PDB, SCOP-protein structures.

EMBL International NLM

EBI Advisory Meeting NCBI

Created in 1988 as a part of the

• They are often referred to as curated databases but

Results of analysis, literature

•InterPro (protein families,

• These compile their sequence data from the primary

• The best-known are OWL, NCBI

• PIR (Protein Information Resource), SWISS-PROT, TrEMBL

• PROSITE, Pfam (motif databases)

• Database use a system where an entry can be

2. Accession code (or number)

– Abbreviation of full protein or gene name

– “locus” in GenBank , “entry name” in SWISS-PROT

Eg : KRAF_HUMAN is the entry name for Raf-1 oncogene from Homo

2. Accession code (or number) :

Eg : accession code for KRAF_HUMAN in SWISS-PROT is P04049

– entry with a given literature reference (by author or

– all protein with a keyword (eg : “ribosomal”)

• Two examples of such software systems :

– System for integrating heterogeneous databases

– Provides a well defined web interface

You might also like