Professional Documents
Culture Documents
Bioinformatics
Database
A database is a computerized archive used to store and organize data in such a
way that information can be retrieved easily via a variety of search criteria.
Databases are composed of computer hardware and software for data
management. The chief objective of the development of a database is to
organize data in a set of structured records to enable easy retrieval of
information. Type of Biological databases: Biological databases can be divided
into many categories on type of therie organization , type of molecules they
contain. We primarily divides atabases on the bases of biological molecules they
contain molecules.
1. Nucleotide Acid Databases: Containing the nucleic acids both DNA and
RNA eg. Genbank, EBI nucleotide, DDBJ.
2. Protein Databases: These contains information on protein molecules. Like
sequences, Functional annotation ,post-transcriptional modifications etc.
Protein(NCBI),Swiss-prot, Pfam,PRINT,CATH etc.
3. Structural Databases: These contains information about the structural
features of biomolecules. Including there their 3-D structure. The main
archieve in this is PDB containing the experimently determined strctures.
DATAMINING
Data mining refers to extracting or “mining” knowledge from large amounts of
data. Data Mining (DM) is the science of finding new interesting patterns and
relationship in huge amount of data. It is defined as “the process of discovering
meaningful new correlations, patterns, and trends by digging into large amounts
of data stored in warehouses”. Data mining is also sometimes called Knowledge
Discovery in Databases (KDD). Data mining is not specific to any industry. It
requires intelligent technologies and the willingness to explore the possibility of
hidden knowledge that resides in the data. Data Mining approaches seem ideally
suited for Bioinformatics, since it is data-rich, but lacks a comprehensive theory
of life’s organization at the molecular level. The extensive databases of biological
information create both challenges and opportunities for development of novel
KDD methods. Mining biological data helps to extract useful knowledge from
massive datasets gathered in biology, and in other related life sciences areas
such as medicine and neuroscience.
Bioinformatics Approaches.
Multiple levels of analysis at both gene nd protein level which includes.
Purpose of BLAST:
BLAST performs sequence alignment through the following steps. The first step is
to create a list of words from the query sequence. Each word is typically three
residues for protein sequences and eleven residues for DNA sequences. The list
includes every possible word extracted from the query sequence. This step is
also called seeding. The second step is to search a sequence database for the
occurrence of these words. This step is to identify database sequences
containing the matching words. The matching of the words is scored by a given
substitution matrix. A word is considered a match if it is above a threshold. The
fourth step involves pairwise alignment by extending from the words in both
directions while counting the alignment score using the same substitution
matrix. The extension continues until the score of the alignment drops below a
threshold due to mismatches (the drop threshold is twenty-two for proteins and
twenty for DNA). The resulting contiguous aligned segment pair without gaps is
called high-scoring segment pair (HSP). In the original version of BLAST, the
highest scored HSPs are presented as the final report. They are also called
maximum scoring pairs.
Seeding in BLAST
E=m×n×P
where m is the total number of residues in a database, n is the number of
residues in the query sequence, and P is the probability that an HSP alignment is
a result of random chance. For example, aligning a query sequence of 100
residues to a database containing a total of 1012 residues results in a P-value for
the ungapped HSP region in one of the database matches of 1 × 1−20. The E-
value, which is the product of the three values, is 100 × 1012 × 10−20, which
equals 10−6. It is expressed as 1e − 6 in BLAST output. This indicates that the
probability of this database sequence match occurring due to random chance is
10−6.
PSI-BLAST
It is a variant of BLAST. This program is used to find distant relatives of a
protein using the position specific score matrix(PSSM). First, a list of all
closely related proteins is created. These proteins are combined into a
general "profile" sequence, which summarises significant features present
in these sequences. A query against the protein database is then run
using this profile, and a larger group of proteins is found. This larger group
is used to construct another profile, and the process is repeated.
REFERENCE
1. Raza. K. Aapplication of data mining in bioinformatics Ind. Journal of Computer Science and
Engineering,Vol 1 No 2, 114-118.
4. www.ncbi.nlm.nih.gov