You are on page 1of 6

DataMining In BioInformatics

Bioinformatics

Bioinformatic is a discipline of quantitative analysis of information relating to


biological macromolecules with the aid of computers. The development of
bioinformatics as a field is the result of advances in both molecular biology and
computer science over the past 30–40 years.

• Bioinformatics is the science of storing, extracting, organizing, analyzing,


interpreting and utilizing information from biological sequences and
molecules. Bioinformatics can be also defined as the application of
computer technology to the management of biological information

The rapid developments in genomics and proteomics have generated a large


amount of biological data. Which is present in the different Biological databases.
Analyzing large biological data sets requires making sense of the data by
inferring structure or generalizations from the data.Drawing conclusions from
these data requires sophisticated computational analyses.Bioinformatics consists
of two subfields: the development of computational tools and databases and the
application of these tools and databases in generating biological knowledge to
better understand living systems. These two subfields are complementary to
each other.

Database
A database is a computerized archive used to store and organize data in such a
way that information can be retrieved easily via a variety of search criteria.
Databases are composed of computer hardware and software for data
management. The chief objective of the development of a database is to
organize data in a set of structured records to enable easy retrieval of
information. Type of Biological databases: Biological databases can be divided
into many categories on type of therie organization , type of molecules they
contain. We primarily divides atabases on the bases of biological molecules they
contain molecules.

1. Nucleotide Acid Databases: Containing the nucleic acids both DNA and
RNA eg. Genbank, EBI nucleotide, DDBJ.
2. Protein Databases: These contains information on protein molecules. Like
sequences, Functional annotation ,post-transcriptional modifications etc.
Protein(NCBI),Swiss-prot, Pfam,PRINT,CATH etc.
3. Structural Databases: These contains information about the structural
features of biomolecules. Including there their 3-D structure. The main
archieve in this is PDB containing the experimently determined strctures.
DATAMINING
Data mining refers to extracting or “mining” knowledge from large amounts of
data. Data Mining (DM) is the science of finding new interesting patterns and
relationship in huge amount of data. It is defined as “the process of discovering
meaningful new correlations, patterns, and trends by digging into large amounts
of data stored in warehouses”. Data mining is also sometimes called Knowledge
Discovery in Databases (KDD). Data mining is not specific to any industry. It
requires intelligent technologies and the willingness to explore the possibility of
hidden knowledge that resides in the data. Data Mining approaches seem ideally
suited for Bioinformatics, since it is data-rich, but lacks a comprehensive theory
of life’s organization at the molecular level. The extensive databases of biological
information create both challenges and opportunities for development of novel
KDD methods. Mining biological data helps to extract useful knowledge from
massive datasets gathered in biology, and in other related life sciences areas
such as medicine and neuroscience.

In recent years, rapid developments in genomics and proteomics have generated


a large amount of biological data. Drawing conclusions from these data requires
sophisticated computational analyses. Analyzing large biological data sets
requires making sense of the data by inferring structure or generalizations from
the data. Examples of this type of analysis include protein structure prediction,
gene classification, cancer classification based on microarray data, clustering of
gene expression data, statistical modeling of protein-protein interaction, etc. The
primary goal of bioinformatics is to increase the understanding of biological
processes.

Bioinformatics Approaches.
Multiple levels of analysis at both gene nd protein level which includes.

1. Gene Finding 2.Protein structure and function prediction


3.Comparative genomics 4.Gene Expression analysis
And many more….

Existing Computational Tools in Bioinformatics

1. Sequence similarity 2.Multiple sequence alignments


3. Database searching 4.Evolutionary (phylogenetic) tree construction
5. Sequence assemblers 6.Gene finders
7. Gene expression analysis(Microarray)
And many more…..
List of genome analysis tools.
1. Gene finder and general genome annotation tools (GeneFinder, FGENESH, GeneID).
2. Cross genome comparison tools and databases (GenBank ,DDBJ, EBI-nucleotide,
BLAST ,FASTA, GenomeBlast, KEGG, GenomeComp MapViewer ).
3. Large scale sequence assembly and polymorphism identification tools (Phred , Phrap).
4. Genomic visualization tools (UCSC, NCBI, Ensembl).
5. Data cleansing tools - vector screening, repeat masking(VecScreen (NCBI),
RepeatMasker).
Protein analysis tools
1. Protein composition, isoelectric point, molecular weight analysis tools(MultiIdent-
ExPasy).
2. Comparable alignment/searching tools for proteins(Blast, Fasta, LALIGN ,ClustalW).
3. Protein secondary structure prediction tools (GOR, Jpred, predictprotein).
4. Protein structure modeling tools(Swiss-Model,Modeller).
5. Identification and characterization with peptide mass fingerprinting data(Mascot)
6. Pattern and profile searches(Pfam,Prosite,SMART).
Gene expression tools
1. EST Clustering and differential expression analysis tools and databases.
2. SAGE Analysis tools and databases(NCBI-SAGE).
3. Microarray data collection, calibration and analysis tools and databases(GEO).
4. Gene clustering and visualization tools(Cluster).

BLAST {Basic Local Alignment Search Tool}


The BLAST program was developed by Stephen Altschul of NCBI in 1990 and has
since become one of the most popular programs for sequence analysis. BLAST
uses heuristics to align a query sequence with all sequences in a database. The
objective is to find high-scoring ungapped segments among related sequences.
The existence of such segments above a given threshold indicates pairwise
similarity beyond random chance, which helps to discriminate related sequences
from unrelated sequences in a database.

Purpose of BLAST:

1. Identifying orthologs and paralogs


2. Discovering new genes or proteins
3. Discovering variants of genes or proteins
4. Investigating expressed sequence tags (ESTs)
5. Exploring protein structure and function

BLAST performs sequence alignment through the following steps. The first step is
to create a list of words from the query sequence. Each word is typically three
residues for protein sequences and eleven residues for DNA sequences. The list
includes every possible word extracted from the query sequence. This step is
also called seeding. The second step is to search a sequence database for the
occurrence of these words. This step is to identify database sequences
containing the matching words. The matching of the words is scored by a given
substitution matrix. A word is considered a match if it is above a threshold. The
fourth step involves pairwise alignment by extending from the words in both
directions while counting the alignment score using the same substitution
matrix. The extension continues until the score of the alignment drops below a
threshold due to mismatches (the drop threshold is twenty-two for proteins and
twenty for DNA). The resulting contiguous aligned segment pair without gaps is
called high-scoring segment pair (HSP). In the original version of BLAST, the
highest scored HSPs are presented as the final report. They are also called
maximum scoring pairs.

Seeding in BLAST

Statistical Significance in BLAST.


BLAST searches, have a statistical indicator is known as the E-value (expectation
value), and it indicates the probability that the resulting alignments from a
database search are caused by random chance.BLAST compares a sequence
against all database sequences, and so the E-value is determined by following
formula:

E=m×n×P
where m is the total number of residues in a database, n is the number of
residues in the query sequence, and P is the probability that an HSP alignment is
a result of random chance. For example, aligning a query sequence of 100
residues to a database containing a total of 1012 residues results in a P-value for
the ungapped HSP region in one of the database matches of 1 × 1−20. The E-
value, which is the product of the three values, is 100 × 1012 × 10−20, which
equals 10−6. It is expressed as 1e − 6 in BLAST output. This indicates that the
probability of this database sequence match occurring due to random chance is
10−6.

PSI-BLAST
It is a variant of BLAST. This program is used to find distant relatives of a
protein using the position specific score matrix(PSSM). First, a list of all
closely related proteins is created. These proteins are combined into a
general "profile" sequence, which summarises significant features present
in these sequences. A query against the protein database is then run
using this profile, and a larger group of proteins is found. This larger group
is used to construct another profile, and the process is repeated.

By including related proteins in the search, PSI-BLAST is much more


sensitive in picking up distant evolutionary relationships than a standard
protein-protein BLAST.

REFERENCE
1. Raza. K. Aapplication of data mining in bioinformatics Ind. Journal of Computer Science and
Engineering,Vol 1 No 2, 114-118.

2. JIN XIONG , Essential Bioinformatics.

3. Blast - Ian Korf, NCBI BLAST guide.

4. www.ncbi.nlm.nih.gov

You might also like