You are on page 1of 18

Review Paper

N.M. Luscombe,
D. Greenbaum,
Review
M. Gerstein
Department of Molecular Biophysics
and Biochemistry What is bioinformatics? An
Yale University
New Haven, USA introduction and overview

Abstract: A flood of data means that many of the challenges in biology are now challenges
in computing. Bioinformatics, the application of computational techniques to analyse the
information associated with biomolecules on a large-scale, has now firmly established
itself as a discipline in molecular biology, and encompasses a wide range of subject areas
from structural biology, genomics to gene expression studies.
In this review we provide an introduction and overview of the current state of the field.
We discuss the main principles that underpin bioinformatics analyses, look at the types
of biological information and databases that are commonly used, and finally examine
some of the studies that are being conducted, particularly with reference to transcription
regulatory systems.

Introduction Bioinformatics - a definition1

Biological data are being produced (Molecular) bio – informatics: bioinformatics is conceptualising biology in
at a phenomenal rate [1]. For terms of molecules (in the sense of physical chemistry) and applying
example as of August 2000, the "informatics techniques" (derived from disciplines such as applied maths,
GenBank repository of nucleic acid computer science and statistics) to understand and organise the information
sequences contained 8,214,000 associated with these molecules, on a large scale. In short, bioinformatics
entries [2] and the SWISS-PROT is a management information system for molecular biology and has many
database of protein sequences practical applications.
contained 88,166 [3]. On average,
1
these databases are doubling in As submitted to the Oxford English Dictionary
size every 15 months[2]. In addition,
since the publication of the H. As a result of this surge in data, that life itself is an information
influenzae genome [4], complete computers have become indispensable technology; an organism’s physiology
sequences for over 40 organisms to biological research. Such an approach is largely determined by its genes, which
have been released, ranging from is ideal because of the ease with which at its most basic can be viewed as
450 genes to over 100,000. Add to computers can handle large quantities digital information. At the same time,
this the data from the myriad of of data and probe the complex dynam- there have been major advances in the
related projects that study gene ics observed in nature. Bioinformatics, technologies that supply the initial data;
expression, determine the protein the subject of the current review, is Anthony Kerlavage of Celera recently
structures encoded by the genes, often defined as the application of cited that an experimental laboratory
and detail how these products inter- computational techniques to understand can produce over 100 gigabytes of
act with one another, and we can and organise the information associated data a day with ease [5]. This incredible
begin to imagine the enormous with biological macromolecules. This processing power has been matched
quantity and variety of information uexpected union between the two by developments in computer technol-
that is being produced. subjects is largely attributed to the fact ogy; the most important areas of

Yearbook of Medical Informatics 2001 83


Review Paper

improvements have been in the CPU, understanding , large-scale and be listed. We also give approximate
disk storage and Internet, allowing practical applications . Specifically, we values describing the sizes of data being
faster computations, better data stor- discuss the range of data that are discussed.
age and revolutionalised the methods currently being examined, the databases
for accessing and exchanging data. into which they are organised, the types We start with an overview of the
of analyses that are being conducted sources of information: these may
Aims of bioinformatics using transcription regulatory systems be divided into raw DNA sequences,
The aims of bioinformatics are three- as an example, and finally some of the protein sequences, macromolecular
fold. First, at its simplest bioinformatics major practical applications of structures, genome sequences, and
organises data in a way that allows bioinformatics. other whole genome data. Raw DNA
researchers to access existing infor- sequences are strings of the four base-
mation and to submit new entries as letters comprising genes, each typically
they are produced, eg the Protein Data “…the INFORMATION 1,000 bases long. The GenBank
Bank for 3D macromolecular struc- associated with these repository of nucleic acid sequences
tures [6,7]. While data-curation is an molecules…” currently holds a total of 9.5 billion
essential task, the information stored bases in 8.2 million entries (all database
in these databases is essentially use- Table 1 lists the types of data that are figures as of August 2000). At the next
less until analysed. Thus the purpose of analysed in bioinformatics and the range level are protein sequences comprising
bioinformatics extends much further. of topics that we consider to fall within strings of 20 amino acid-letters. At
The second aim is to develop tools and the field. Here we take a broad view and present there are about 300,000 known
resources that aid in the analysis of include subjects that may not normally protein sequences, with a typical
data. For example, having sequenced a
particular protein, it is of interest to Table 1. Sources of data used in bioinformatics, the quantity of each type of data that is currently
compare it with previously characte- (August 2000) available, and bioinformatics subject areas that utilise this data.
rised sequences. This needs more than
just a simple text-based search and Data source Data size Bioinformatics topics
Raw DNA sequence 8.2 million sequences Separating coding and non-coding regions
programs such as FASTA [8] and (9.5 billion bases) Identification of introns and exons
PSI-BLAST [9] must consider what Gene product prediction
Forensic analysis
comprises a biologically significant
match. Development of such resources Protein sequence 300,000 sequences Sequence comparison algorithms
(~300 amino acids Multiple sequence alignments algorithms
dictates expertise in computational each) Identification of conserved sequence motifs
theory as well as a thorough under-
Macromolecular 13,000 structures Secondary, tertiary structure prediction
standing of biology. The third aim is to structure (~1,000 atomic 3D structural alignment algorithms
use these tools to analyse the data and coordinates each) Protein geometry measurements
Surface and volume shape calculations
interpret the results in a biologically Intermolecular interactions
meaningful manner. Traditionally,
Molecular simulations
biological studies examined individual (force-field calculations,
systems in detail, and frequently molecular movements,
docking predictions)
compared them with a few that are
related. In bioinformatics, we can now Genomes 40 complete genomes Characterisation of repeats
(1.6 million – Structural assignments to genes
conduct global analyses of all the 3 billion bases each) Phylogenetic analysis
available data with the aim of un- Genomic-scale censuses
(characterisation of protein content, metabolic pathways)
covering common principles that apply Linkage analysis relating specific genes to diseases
across many systems and highlight
Gene expression largest: ~20 time Correlating expression patterns
novel features. point measurements Mapping expression data to sequence, structural and
for ~6,000 genes biochemical data
In this review, we provide an intro-
duction to bioinformatics. We focus on Other data
the first and third aims just described, Literature 11 million citations Digital libraries for automated bibliographical searches
with particular reference to the key- Knowledge databases of data from literature
words underlined in the definition: infor- Metabolic pathways Pathway simulations
mation,informatics, organisation,

84 Yearbook of Medical Informatics 2001


Review Paper

bacterial protein containing approxi- sequences. While more biological infor- relationship between the two proteins
mately 300 amino acids. Macromo- mation can be derived from a single is remote [17, 18]. Among homologues,
lecular structural data represents a structure than a protein sequence, the it is useful to distinguish between
more complex form of information. lack of depth in the latter is remedied orthologues, proteins in different
There are currently 13,000 entries in by analysing larger quantities of data. species that have evolved from a
the Protein Data Bank, PDB, most common ancestral gene, and
of which are protein structures. A paralogues, proteins that are related by
typical PDB file for a medium-sized “… ORGANISE the informa- gene duplication within a genome [19].
protein contains the xyz coordinates tion on a LARGE SCALE …” Normally, orthologues retain the same
of approximately 2,000 atoms. function while paralogues evolve
Redundancy and multiplicity of data distinct, but related functions [20].
Scientific euphoria has recently A concept that underpins most
centred on whole genome sequencing. research methods in bioinformatics is An important concept that arises
As with the raw DNA sequences, that much of this data can be grouped from these observations is that of a
genomes consist of strings of base- together based on biologically meaning- finite “parts list” for different organisms
letters, ranging from 1.6 million bases ful similarities. For example, sequence [21,22]: an inventory of proteins
in Haemophilus influenzae to 3 billion segments are often repeated at contained within an organism, arranged
in humans. An important aspect of different positions of genomic DNA according to different properties such
complete genomes is the distinction [11]. Genes can be clustered into those as gene sequence, protein fold or
between coding regions and non- with particular functions (eg enzymatic function. Taking protein folds as an
coding regions –'junk' repetitive actions) or according to the metabolic example, we mentioned that with a
sequences making up the bulk of base pathway to which they belong [12], few exceptions, the tertiary structures
sequences especially in eukaryotes. although here, single genes may actually of proteins adopt one of a limited
We can now measure expression levels possess several functions [13]. Going repertoire of folds. As the number of
of almost every gene in a given cell further, distinct proteins frequently different fold families is considerably
on a whole-genome level although have comparable sequences – orga- smaller than the number of gene
public availability of such data is still nisms often have multiple copies of a families, categorising the proteins by
limited. Expression level measurements particular gene through duplication fold provides a substantial simplifi-
are made under different environmental while different species have equivalent cation of the contents of a genome.
conditions, different stages of the cell or similar proteins that were inherited Similar simplifications can be
cycle and different cell types in multi- when they diverged from each other in provided by other attributes such as
cellular organisms. Currently the largest evolution. At a structural level, we protein function. As such, we expect
dataset for yeast has made approxi- predict there to be a finite number of this notion of a finite parts list to become
mately 20 time-point measurements different tertiary structures – estimates increasingly common in the future
for 6,000 genes [10]. Other genomic- range between 1,000 and 10,000 folds genomic analyses.
scale data include biochemical informa- [14,15] – and proteins adopt equivalent
tion on metabolic pathways, regulatory structures even when they differ Clearly, an essential aspect of mana-
networks, protein-protein interaction greatly in sequence [16]. As a result, ging this large volume of data lies in
data from two-hybrid experiments, although the number of structures in developing methods for assessing
and systematic knockouts of individ- the PDB has increased exponentially, similarities between different biomole-
ual genes to test the viability of an the rate of discovery of novel folds has cules and identifying those that are
organism. actually decreased. related. Below, we discuss the major
databases that provide access to the
What is apparent from this list is the There are common terms to describe primary sources of information, and
diversity in the size and complexity of the relationship between pairs of also introduce some secondary data-
different datasets. There are invariably proteins or the genes from which they bases that systematically group the
more sequence-based data than struc- are derived: analogous proteins have data (Table 2). These classifications
tural data because of the relative ease related folds, but unrelated sequences, ease comparisons between genomes
with which they can be produced. This while homologous proteins are both and their products, allowing the identi-
is partly related to the greater complex- sequentially and structurally similar. fication of common themes between
ity and information-content of individual The two categories can sometimes be those that are related and highlighting
structures compared to individual difficult to distinguish especially if the features that are unique to some.

Yearbook of Medical Informatics 2001 85


Review Paper

Table 2. List of URLs for the databases that are cited in the review. 3D-space when the protein is folded.
URL
By using multiple motifs, fingerprints
Database
can encode protein folds and
Protein sequence
(primary) functionalities more flexibly than
SWISS-PROT www.expasy.ch/sprot/sprot-top.html PROSITE. Finally, Pfam [28] contains
PIR-International www.mips.biochem.mpg.de/proj/protseqdb a large collection of multiple sequence
Protein sequence (composite)
alignments and profile Hidden Markov
OWL www.bioinf.man.ac.uk/dbbrowser/OWL Models covering many common protein
NRDB www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein domains. Pfam-A comprises accurate
Protein sequence (secondary)
manually compiled alignments while
PROSITE www.expasy.ch/prosite Pfam-B is an automated clustering of
PRINTS www.bioinf.man.ac.uk/dbbrowser/PRINTS/PRINTS.html the whole SWISS-PROT database.
Pfam www.sanger.ac.uk/Pfam/
These different secondary databases
Macromolecular have recently been incorporated into a
structures single resource named InterPro [29].
Protein Data Bank (PDB) www.rcsb.org/pdb
Nucleic Acids Database (NDB) ndbserver.rutgers.edu/
HIV Protease Database www.ncifcrf.gov/CRYS/HIVdb/NEW_DATABASE Structural databases
ReLiBase www2.ebi.ac.uk:8081/home.html Next we look at databases of macro-
PDBsum www.biochem.ucl.ac.uk/bsm/pdbsum molecular structures. The Protein Data
CATH www.biochem.ucl.ac.uk/bsm/cath
SCOP scop.mrc-lmb.cam.ac.uk/scop
Bank, PDB [6,7], provides a primary
FSSP www2.embl-ebi.ac.uk/dali/fssp archive of all 3D structures for
macromolecules such as proteins,
Nucleotide sequences RNA, DNA and various complexes.
GenBank www.ncbi.nlm.nih.gov/Genbank
EMBL www.ebi.ac.uk/embl Most of the ~13,000 structures (August
DDBJ www.ddbj.nig.ac.jp 2000) are solved by x-ray crystallo-
graphy and NMR, but some theoretical
Genome sequences models are also included. As the infor-
Entrez genomes www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome
GeneCensus bioinfo.mbb.yale.edu/genome mation provided in individual PDB
COGs www.ncbi.nlm.nih.gov/COG entries can be difficult to extract,
PDBsum [30] provides a separate Web
Integrated databases
InterPro www.ebi.ac.uk/interpro
page for every structure in the PDB
Sequence retrieval system (SRS) www.expasy.ch/srs5 displaying detailed structural analyses,
Entrez www.ncbi.nlm.nih.gov/Entrez schematic diagrams and data on inter-
actions between different molecules in
Protein sequence databases and also include protein sequence data a given entry. Three major databases
Protein sequence databases are from the translated coding regions in classify proteins by structure in order
categorised as primary, composite or DNA sequence databases (see to identify structural and evolutionary
secondary. Primary databases contain below). Secondary databases contain relationships: CATH [31], SCOP [32],
over 300,000 protein sequences and information derived from protein and FSSP databases [33]. All
function as a repository for the raw sequences and help the user determine comprise hierarchical structural
data. Some more common repositories, whether a new sequence belongs to a taxonomy where groups of proteins
such as SWISS-PROT [3] and PIR- known protein family. One of the most increase in similarity at lower levels
International [23], annotate the popular is PROSITE [26], a database of the classification tree. In addition,
sequences as well as describe the of short sequence patterns and profiles numerous databases focus on particular
proteins’ functions, its domain structure that characterise biologically significant types of macromolecules. These
and post-translational modifications. sites in proteins. PRINTS [27] expands include the Nucleic Acids Database,
Composite databases such as OWL on this concept and provides a NDB [34], for structures related to
[24] and the NRDB [25] compile and compendium of protein fingerprints – nucleic acids, the HIV protease
filter sequence data from different groups of conserved motifs that database [35] for HIV-1, HIV-2 and
primary databases to produce com- characterise a protein family. Motifs SIV protease structures and their
bined non-redundant sets that are more are usually separated along a protein complexes, and ReLiBase [36] for
complete than the individual databases sequence, but may be contiguous in receptor-ligand complexes.

86 Yearbook of Medical Informatics 2001


Review Paper

Nucleotide and Genome in 21 completed genomes on the basis The technologies for measuring
sequences of sequence similarity. Members of protein abundance are currently limited
As described previously, the biggest the same Cluster of Orthologous Group, to 2D gel electrophoresis followed by
excitement currently lies with the COG, are expected to have the same mass spectrometry [54]. As gels can
availability of complete genome 3D domain architecture and often, simi- only routinely resolve about 1,000
sequences for different organisms. The lar functions. The most straightforward proteins [55], only the most abundant
GenBank [2], EMBL [37] and DDBJ application of the database is to predict can be visualised. At present, data
[38] databases contain DNA sequen- the function of uncharacterised proteins from these experiments are only
ces for individual genes that encode through their homology to characterised available from the literature [56,57].
protein and RNA products. Much like proteins, and also to identify phylo-
the composite protein sequence genetic patterns of protein occurrence Data integration
database, the Entrez nucleotide – for example, whether a given COG The most profitable research in
database [39] compiles sequence data is represented across most or all bioinformatics often results from
from these primary databases. organisms or in just a few closely integrating multiple sources of data
related species. [58]. For instance, the 3D coordinates
As whole-genome sequencing is of a protein are more useful if combined
often conducted through international Gene expression data with data about the protein’s function,
collaborations, individual genomes are A most recent source of genomic- occurrence in different genomes, and
published at different sites. The Entrez scale data has been from expression interactions with other molecules. In
genome database [40] brings together experiments, which quantify the this way, individual pieces of infor-
all complete and partial genomes in a expression levels of individual genes. mation are put in context with respect
single location and currently represents These experiments measure the to other data. Unfortunately, it is not
over 1,000 organisms (August 2000). amount of mRNA or protein products always straightforward to access and
In addition to providing the raw that are produced by the cell. For the cross-reference these sources of infor-
nucleotide sequence, information is former, there are three main mation because of differences in
presented at several levels of detail technologies: the cDNA microarray nomenclature and file formats.
including: a list of completed genomes, [42-44], Affymatrix GeneChip [45] and
all chromosomes in an organism, SAGE methods [46]. The first method At a basic level, this problem is
detailed views of single chromosomes measures relative levels of mRNA frequently addressed by providing
marking coding and non-coding regions, abundance between different samples, external links to other databases, for
and single genes. At each level there while the last two measure absolute example in PDBsum, web-pages for
are graphical presentations, pre- levels. Most of the effort in gene individual structures direct the user
computed analyses and links to other expression analysis has concentrated towards corresponding entries in the
sections of Entrez. For example, on the yeast and human genomes and PDB, NDB, CATH, SCOP and
annotations for single genes include as yet, there is no central repository for SWISS-PROT. At a more advanced
the translated protein sequence, this data. For yeast, the Young [10], level, there have been efforts to
sequence alignments with similar genes Church [47] and Samson datasets [48] integrate access across several data
in other genomes and summaries of use the GeneChip method, while the sources. One is the Sequence Retrieval
the experimentally characterised or Stanford cell cycle [49], diauxic shift System, SRS [59], which allows flat-
predicted function. GeneCensus [41] [50] and deletion mutant datasets [51] file databases to be indexed to each
also provides an entry point for genome use the microarray. Most measure other; this allows the user to retrieve,
analysis with an interactive whole- mRNA levels throughout the whole link and access entries from nucleic
genome comparison from an evolution- yeast cell cycle, although some focus acid, protein sequence, protein motif,
ary perspective. The database allows on a particular stage in the cycle. For protein structure and bibliographic
building of phylogenetic trees based on humans, the main application has been databases. Another is the Entrez facility
different criteria such as ribosomal to understand expression in tumour [39], which provides similar gateways
RNA or protein fold occurrence. The and cancer cells. The Molecular to DNA and protein sequences,
site also enables multiple genome Portraits of Breast Tumours [52], genome mapping data, 3D macromo-
comparisons, analysis of single Lymphoma and Leukaemia Molecular lecular structures and the PubMed
genomes and retrieval of information Profiling [53] projects provide data bibliographic database [60]. A search
for individual genes. The COGs data- from microarray experiments on for a particular gene in either database
base [20] classifies proteins encoded human cancer cells. will allow smooth transitions to the

Yearbook of Medical Informatics 2001 87


Review Paper

genome it comes from, the protein gene products, and large-scale analyses first is represented by the vertical axis in
sequence it encodes, its structure, of gene expression levels. Some of these the figure and outlines a possible approach
bibliographic reference and equivalent research topics will be demonstrated in to the rational drug design process. The
entries for all related genes. our example analysis of transcription aim is to take a single protein and follow
regulatory systems. through an analysis that maximises our
understanding of the protein it encodes.
“…UNDERSTAND and Other subject areas we have included Starting with a gene sequence, we can
organise the information…” in Table 1 are development of digital determine the protein sequence with
libraries for automated bibliographical strong certainty. From there, prediction
Having examined the data, we can searches, knowledge bases of biological algorithms can be used to calculate the
discuss the types of analyses that are information from the literature, DNA structure adopted by the protein.
conducted. As shown in Table 1, the analysis methods in forensics, prediction Geometry calculations can define the
broad subject areas in bioinformatics of nucleic acid structures, metabolic shape of the protein’s surface and
can be separated according to the sources pathway simulations, and linkage analysis molecular simulations can determine the
of information that are used in the studies. – linking specific genes to different force fields surrounding the molecule.
For raw DNA sequences, investigations disease traits. Finally, using docking algorithms, one
involve separating coding and non-coding could identify or design ligands that may
regions, and identification of introns, In addition to finding relationships bind the protein, paving the way for
exons and promoter regions for annotating between different proteins, much of designing a drug that specifically alters
genomic DNA [61,62]. For protein se- bioinformatics involves the analysis of the protein’s function. In practise, the
quences, analyses include developing one type of data to infer and understand intermediate steps are still difficult to
algorithms for sequence comparisons the observations for another type of achieve accurately, and they are best
[63], methods for producing multiple data. An example is the use of sequence combined with experimental methods to
sequence alignments [64], and searching and structural data to predict the obtain some of the data, for example
for functional domains from conserved secondary and tertiary structures of new characterising the structure of the protein
sequence motifs in such alignments. protein sequences [67]. These methods, of interest.
Investigations of structural data include especially the former, are often based on
prediction of secondary and tertiary pro- statistical rules derived from structures, The aims of the second dimension, the
tein structures, producing methods for such as the propensity for certain amino breadth in biological analysis, is to
3D structural alignments [65,66],exami- acid sequences to produce different compare a gene with others. Initially,
ning protein geometries using distance secondary structural elements. Another simple algorithms can be used to com-
and angular measurements, calculations example is the use of structural data to pare the sequences and structures of a
of surface and volume shapes and ana- understand a protein’s function; here pair of related proteins. With a larger
lysis of protein interactions with other studies have investigated the relationship number of proteins, improved algorithms
subunits, DNA, RNA and smaller mole- different protein folds and their functions can be used to produce multiple align-
cules. These studies have lead to molecu- [68,69] and analysed similarities between ments, and extract sequence patterns or
lar simulation topics in which structural different binding sites in the absence of structural templates that define a family
data are used to calculate the energetics homology [70]. Combined with similarity of proteins. Using this data, it is also
involved in stabilising macromolecular measurements, these studies provide us possible to construct phylogenetic trees
structures, simulating movements within with an understanding of how much to trace the evolutionary path of proteins.
macromolecules, and computing the biological information can be accurately Finally, with even more data, the infor-
energies involved in molecular docking. transferred between homologous mation must be stored in large-scale
The increasing availability of annotated proteins [71]. databases. Comparisons become more
genomic sequences has resulted in the complex, requiring multiple scoring
introduction of computational genomics The bioinformatics spectrum schemes, and we are able to conduct
and proteomics – large-scale analyses Figure 1 summarises the main points genomic scale censuses that provide
of complete genomes and the proteins we raised in our discussions of comprehensive statistical accounts of
that they encode. Research includes organising and understanding protein features, such as the abundance
characterisation of protein content and biological data – the development of of particular structures or functions in
metabolic pathways between different bioinformatics techniques has allowed different genomes. It also allows us to
genomes, identification of interacting an expansion of biological analysis in build phylogenetic trees that trace the
proteins, assignment and prediction of two dimension, depth and breadth. The evolution of whole organisms.

88 Yearbook of Medical Informatics 2001


Review Paper

Fig. 1. Paradigm shifts during the past couple of decades have taken much of biology away from the laboratory bench and have allowed the
integration of other scientific disciplines, specifically computing. The result is an expansion of biological research in breadth and depth. The
vertical axis demonstrates how bioinformatics can aid rational drug design with minimal work in the wet lab. Starting with a single gene sequence,
we can determine with strong certainty, the protein sequence. From there, we can determine the structure using structure prediction techniques.
With geometry calculations, we can further resolve the protein’s surface and through molecular simulation determine the force fields surrounding
the molecule. Finally docking algorithms can provide predictions of the ligands that will bind on the protein surface, thus paving the way for
the design of a drug specific to that molecule. The horizontal axis shows how the influx of biological data and advances in computer technology
have broadened the scope of biology. Initially with a pair of proteins, we can make comparisons between the between sequences and structures
of evolutionary related proteins. With more data, algorithms for multiple alignments of several proteins become necessary. Using multiple
sequences, we can also create phylogenetic trees to trace the evolutionary development of the proteins in question. Finally, with the deluge
of data we currently face, we need to construct large databases to store, view and deconstruct the information. Alignments now become more
complex, requiring sophisticated scoring schemes and there is enough data to compile a genome census – a genomic equivalent of a population
census – providing comprehensive statistical accounting of protein features in genomes.

Yearbook of Medical Informatics 2001 89


Review Paper

“… applying INFORMATICS We start by considering structural of structures represented in the PDB


TECHNIQUES…” analyses of how DNA-binding proteins does not necessarily reflect the relative
recognise particular base sequences. importance of the different proteins in
The distinct subject areas we Later, we review several genomic the cell, it is clear that helix-turn-helix,
mention require different types of studies that have characterised the zinc-coordinating and leucine zipper
informatics techniques. Briefly, for data nature of transcription factors in motifs are used repeatedly. This
organisation, the first biological different organisms, and the methods provides compact frameworks that
databases were simple flat files. that have been used to identify regula- present the α-helix on the surfaces of
However with the increasing amount tory binding sites in the upstream structurally diverse proteins. At a gross
of information, relational database regions. Finally, we provide an overview level, it is possible to highlight the
methods with Web-page interfaces of gene expression analyses that have differences between transcription
have become increasingly popular. In been recently conducted and suggest factor domains that “just” bind DNA
sequence analysis, techniques include future uses of transcription regulatory and those involved in catalysis [74].
string comparison methods such as analyses to rationalise the observations Although there are exceptions, the
text search and 1-dimensional align- made in gene expression experiments. former typically approach the DNA
ment algorithms. Motif and pattern All the results that we describe have from a single face and slot into the
identification for multiple sequences been found through computational grooves to interact with base edges.
depend on machine learning, clustering studies. The latter commonly envelope the
and data-mining techniques. 3D substrate, using complex networks of
structural analysis techniques include Structural studies secondary structures and loops.
Euclidean geometry calculations As of August 2000, there were 379
combined with basic application of structures of protein-DNA complexes Focusing on proteins with α-helices,
physical chemistry, graphical repre- in the PDB. Analyses of these the structures show many variations,
sentations of surfaces and volumes, structures have provided valuable both in amino acid sequences and
and structural comparison and 3D insight into the stereochemical detailed geometry. They have clearly
matching methods. For molecular principles of binding, including how evolved independently in accordance
simulations, Newtonian mechanics, particular base sequences are with the requirements of the context in
quantum mechanics, molecular me- recognized and how the DNA structure which they are found. While achieving
chanics and electrostatic calculations is quite often modified on binding. a close fit between the α-helix and
are applied. In many of these areas, major groove, there is enough flexibility
the computational methods must be A structural taxonomy of DNA- to allow both the protein and DNA to
combined with good statistical analyses binding proteins, similar to that adopt distinct conformations. However,
in order to provide an objective measure presented in SCOP and CATH, was several studies that analysed the binding
for the significance of the results. first proposed by Harrison [72] and geometries of α-helices demonstrated
periodically updated to accommodate that most adopt fairly uniform confor-
Transcription regulation – a case new structures as they are solved [73]. mations regardless of protein family.
study in bioinformatics The classification consists of a two- They are commonly inserted in the
DNA-binding proteins have a central tier system: the first level collects major groove sideways, with their
role in all aspects of genetic activity proteins into eight groups that share lengthwise axis roughly parallel to the
within an organism, participating in gross structural features for DNA- slope outlined by the DNA backbone.
processes such as transcription, packa- binding, and the second comprises 54 Most start with the N-terminus in the
ging, rearrangement, replication and families of proteins that are structurally groove and extend out, completing two
repair. In this section, we focus on the homologous to each other. Assembly to three turns within contacting distance
studies that have contributed to our of such a system simplifies the of the nucleic acid [75,76].
understanding of transcription regula- comparison of different binding
tion in different organisms. Through methods; it highlights the diversity of Given the similar binding orientations,
this example, we demonstrate how protein-DNA complex geometries it is surprising to find that the interactions
bioinformatics has been used to increase found in nature, but also underlines the between each amino acid position along
our knowledge of biological systems importance of interactions between α- the α-helices and nucleotides on the
and also illustrate the practical helices and the DNA major groove, DNA vary considerably between
applications of the different subject the main mode of binding in over half different protein families. However,
areas that were briefly outlined earlier. the protein families. While the number by classifying the amino acids according

90 Yearbook of Medical Informatics 2001


Review Paper

to the sizes of their side chains, we are DNA complexes, indeed exists. factors in genomes invariably depends
able to rationalise the different However, many interactions that are on similarity search strategies, which
interactions patterns. The rules of normally considered to be non-specific, assume a functional and evolutionary
interactions are based on the simple such as those with the DNA backbone, relationship between homologous
premise that for a given residue position can also provide specificity depending proteins. In E. coli, studies have so far
on α-helices in similar conformations, on the context in which they are made. estimated a total of 300 to 500
small amino acids interact with transcription regulators [87] and
nucleotides that are close in distance Armed with an understanding of PEDANT [88], a database of auto-
and large amino acids with those that protein structure, DNA-binding motifs matically assigned gene functions,
are further [76,77]. Equivalent studies and side chain stereochemistry, a major shows that typically 2-3% of
for binding by other structural motifs, application has been the prediction of prokaryotic and 6-7% of eukaryotic
like β-hairpins, have also been binding either by proteins known to genomes comprise DNA-binding
conducted [78]. When considering contain a particular motif, or those with proteins. As assignments were only
these interactions, it is important to structures solved in the uncomplexed complete for 40-60% of genomes as of
remember that different regions of the form. Most common are predictions August 2000, these figures most likely
protein surface also provide interfaces for α-helix-major groove interactions underestimate the actual number.
with the DNA. – given the amino acid sequence, what Nonetheless, they already represent a
DNA sequence would it recognise large quantity of proteins and it is clear
This brings us to look at the atomic [77,83]. In a different approach, that there are more transcription
level interactions between individual molecular simulation techniques have regulators in eukaryotes than other
amino acid-base pairs. Such analyses been used to dock whole proteins and species. This is unsurprising, consider-
are based on the premise that a DNAs on the basis of force-field ing the organisms have developed a
significant proportion of specific DNA- calculations around the two molecules relatively sophisticated transcription
binding could be rationalised by a [84,85]. mechanism.
universal code of recognition between
amino acids and bases, ie whether The reason that both methods have From the conclusions of the structural
certain protein residues preferably only been met with limited success is studies, the best strategy for charac-
interact with particular nucleotides because even for apparently simple terising DNA-binding of the putative
regardless of the type of protein-DNA cases like α-helix-binding, there are transcription factors in each genome is
complex [79]. Studies have considered many other factors that must be to group them by homology and analyse
hydrogen bonds, van der Waals contacts considered. Comparisons between the individual families. Such classifi-
and water-mediated bonds [80-82]. bound and unbound nucleic acid cations are provided in the secondary
Results showed that about 2/3 of all structures show that DNA-bending is sequence databases described earlier
interactions are with the DNA a common feature of complexes formed and also those that specialise in
backbone and that their main role is with transcription factors [74, 86]. This regulatory proteins such as RegulonDB
one of sequence-independent stabilisa- and other factors such as electrostatic [89] and TRANSFAC [90]. Of even
tion. In contrast, interactions with bases and cation-mediated interactions assist greater use is the provision of structural
display some strong preferences, indirect recognition of the nucleotide assignments to the proteins; given a
including the interactions of arginine or sequence, although they are not well transcription factor, it is helpful to know
lysine with guanine, asparagine or understood yet. Therefore, it is now the structural motif that it uses for
glutamine with adenine and threonine clear that detailed rules for specific binding, therefore providing us with a
with thymine. Such preferences were DNA-binding will be family specific, better understanding of how it recog-
explained through examination of the but with underlying trends such as the nises the target sequence. Structural
stereochemistry of the amino acid side arginine-guanine interactions. genomics through bioinformatics
chains and base edges. Also highlighted assigns structures to the protein
were more complex types of inter- Genomic studies products of genomes by demonstrating
actions where single amino acids Due to the wealth of biochemical similarity to proteins of known structure
contact more than one base-step data that are available, genomic studies [91]. These studies have shown that
simultaneously, thus recognising a short in bioinformatics have concentrated prokaryotic transcription factors most
DNA sequence. These results on model organisms, and the analysis frequently contain helix-turn-helix
suggested that universal specificity, of regulatory systems has been no motifs [87,92] and eukaryotic factors
one that is observed across all protein- exception. Identification of transcription contain homeodomain type helix-turn-

Yearbook of Medical Informatics 2001 91


Review Paper

helix, zinc finger or leucine zipper motifs. contacts are commonly used to stabilise sequences, it is of interest to search for
From the protein classifications in each deformations in the nucleic acid their potential binding sites within
genome, it is clear that different types structure, particularly in widening the genome sequences [95]. For
of regulatory proteins differ in abun- DNA minor groove. The second class prokaryotes, most analyses have
dance and families significantly differ comprise families whose members all involved compiling data on experi-
in size. A study by Huynen and van target the same nucleotide sequence; mentally known binding sites for
Nimwegen [93] has shown that mem- here, base-contacting positions are particular proteins and building a
bers of a single family have similar absolutely or highly conserved allowing consensus sequence that incorporates
functions, but as the requirements of related proteins to target the same any variations in nucleotides. Additional
this function vary over time, so does sequence. sites are found by conducting word-
the presence of each gene family in the matching searches over the entire
genome. The third, and most interesting, class genome and scoring candidate sites by
comprises families in which binding similarity [96-99]. Unsurprisingly, most
Most recently, using a combination is also specific but different members of the predicted sites are found in non-
of sequence and structural data, we bind distinct base sequences. Here coding regions of the DNA [96] and
examined the conservation of amino protein residues undergo frequent the results of the studies are often
acid sequences between related DNA- mutations, and family members can presented in databases such as
binding proteins, and the effect that be divided into subfamilies according RegulonDB [89]. The consensus
mutations have on DNA sequence to the amino acid sequences at base- search approach is often complemented
recognition. The structural families contacting positions; those in the by comparative genomic studies
described above were expanded to same subfamily are predicted to bind searching upstream regions of
include proteins that are related by the same DNA sequence and those orthologous genes in closely related
sequence similarity, but whose of different subfamilies to bind organisms. Through such an approach,
structures remain unsolved. Again, distinct sequences. On the whole, it was found that at least 27% of
members of the same family are the subfamilies corresponded well known E. coli DNA-regulatory motifs
homologous, and probably derive from with the proteins’ functions and are conserved in one or more distantly
a common ancestor. members of the same subfamilies were related bacteria [100].
found to regulate similar transcription
Amino acid conservations were pathways. The combined analysis of The detection of regulatory sites in
calculated for the multiple sequence sequence and structural data described eukaryotes poses a more difficult
alignments of each family [94]. by this study provided an insight into problem because consensus sequences
Generally, alignment positions that how homologous DNA-binding tend to be much shorter, variable, and
interact with the DNA are better scaffolds achieve different specificities dispersed over very large distances.
conserved than the rest of the protein by altering their amino acid sequences. However, initial studies in S.
surface, although the detailed patterns In doing so, proteins evolved distinct cerevisiae provided an interesting
of conservation are quite complex. functions, therefore allowing structur- observation for the GATA protein in
Residues that contact the DNA back- ally related transcription factors to nitrogen metabolism regulation.
bone are highly conserved in all protein regulate expression of different genes. While the 5 base-pair GATA
families, providing a set of stabilising Therefore, the relative abundance of consensus sequence is found almost
interactions that are common to all transcription regulatory families in a everywhere in the genome, a single
homologous proteins. The conservation genome depends, not only on the isolated binding site is insufficient to
of alignment positions that contact importance of a particular protein exert the regulatory function [101].
bases, and recognise the DNA se- function, but also in the adaptability Therefore specificity of GATA activity
quence, are more complex and could of the DNA-binding motifs to comes from the repetition of the
be rationalised by defining a 3-class recognise distinct nucleotide consensus sequence within the
model for DNA-binding. First, protein sequences. This, in turn, appears to upstream regions of controlled genes
families that bind non-specifically be best accommodated by simple in multiple copies. An initial study has
usually contain several conserved base- binding motifs, such as the zinc fingers. used this observation to predict new
contacting residues; without exception, Given the knowledge of the tran- regulatory sites by searching for over-
interactions are made in the minor scription regulators that are contained represented oligonucleotides in non-
groove where there is little discrim- in each organism, and an understanding coding regions of yeast and worm
ination between base types. The of how they recognise DNA genomes [102,103].

92 Yearbook of Medical Informatics 2001


Review Paper

Having detected the regulatory trees, and group genes in a “bottom- More complex relationships have
binding sites, there is the problem of up” fashion; genes with the most similar also been assessed. Conventional
defining the genes that are actually expression profiles are clustered first, wisdom is that gene products that
regulated, commonly termed regulons. and those with more diverse profiles interact with each other are more likely
Generally, binding sites are assumed to are included iteratively [106-108]. In to have similar expression profiles than
be located directly upstream of the contrast, the self-organising map [109, if they do not [116,117]. However, a
regulons; however there are different 110] and K-means methods [111] recent study showed that this relation-
problems associated with this assump- employ a “top-down” approach in which ship is not so simple [118]. While
tion depending on the organism. For the user pre-defines the number of expression profiles are similar for gene
prokaryotes, it is complicated by the clusters for the dataset. The clusters products that are permanently associ-
presence of operons; it is difficult to are initially assigned randomly, and the ated, for example in the large ribosomal
locate the regulated gene within an genes are regrouped iteratively until subunit, profiles differ significantly for
operon since it can lie several genes they are optimally clustered. products that are only associated
downstream of the regulatory se- transiently, including those belonging
quence. It is often difficult to predict Given these methods, it is of interest to the same metabolic pathway.
the organisation of operons [104], to relate the expression data to other
especially to define the gene that is attributes such as structure, function As described below, one of the main
found at the head, and there is often a and subcellular localisation of each driving forces behind expression
lack of long-range conservation in gene gene product. Mapping these properties analysis has been to analyse cancerous
order between related organisms [105]. provides an insight into the cell lines [119]. In general, it has been
The problem in eukaryotes is even characteristics of proteins that are shown that different cell lines (eg
more severe; regulatory sites often act expressed together, and also suggest epithelial and ovarian cells) can be
in both directions, binding sites are some interesting conclusions about the distinguished on the basis of their
usually distant from regulons because overall biochemistry of the cell. In expression profiles, and that these
of large intergenic regions, and yeast, shorter proteins tend to be more profiles are maintained when cells are
transcription regulation is usually a highly expressed than longer proteins, transferred from an in vivo to an in
result of combined action by multiple probably because of the relative ease vitro environment [120]. The basis for
transcription factors in a combinatorial with which they are produced [112]. their physiological differences were
manner. Looking at the amino acid content, apparent in the expression of specific
highly expressed genes are generally genes; for example, expression levels
Despite these problems, these enriched in alanine and glycine, and of gene products necessary for
studies have succeeded in confirming depleted in asparagine; these are progression through the cell cycle,
the transcription regulatory pathways thought to reflect the requirements of especially ribosomal genes, correlated
of well-characterised systems such as amino acid usage in the organism, where well with variations in cell proliferation
the heat shock response system [99]. synthesis of alanine and glycine are rate. Comparative analysis can be
In addition, it is feasible to experi- energetically less expensive than extended to tumour cells, in which the
mentally verify any predictions, most asparagine. Turning to protein underlying causes of cancer can be
notably using gene expression data. structure, expression levels of the TIM uncovered by pinpointing areas of
barrel and NTP hydrolase folds are biological variations compared to
Gene expression studies highest, while those for the leucine normal cells. For example in breast
Many expression studies have so zipper, zinc finger and transmembrane cancer, genes related to cell prolifera-
far focused on devising methods to helix-containing folds are lowest. This tion and the IFN-regulated signal
cluster genes by similarities in relates to the functions associated with transduction pathway were found to
expression profiles. This is in order to these folds; the former are commonly be upregulated [52,121]. One of the
determine the proteins that are involved in metabolic pathways and difficulties in cancer treatment has
expressed together under different the latter in signalling or transport been to target specific therapies to
cellular conditions. Briefly, the most processes [113]. This is also reflected pathogenetically distinct tumour types,
common methods are hierarchical in the relationship with subcellular in order to maximise efficacy and
clustering, self-organising maps, and localisations of proteins, where minimise toxicity. Thus, improvements
K-means clustering. Hierarchical expression of cytoplasmic proteins is in cancer classifications have been
methods originally derived from high, but nuclear and membrane central to advances in cancer treat-
algorithms to construct phylogenetic proteins tend to be low [114,115]. ment. Although the distinction between

Yearbook of Medical Informatics 2001 93


Review Paper

different forms of cancer – for example of remote homologues and checking can be determined using translation
subclasses of acute leukaemia – has whether the prediction is energetically software. Sequence search techniques
been well established, it is still not viable [124]. Where biochemical or can then be used to find homologues in
possible to establish a clinical diagnosis structural data are lacking, studies could model organisms, and based on
on the basis of a single test. In a recent be made in low-level organisms like sequence similarity, it is possible to
study, acute myeloid leukaemia and yeast and the results applied to model the structure of the human
acute lymphoblastic leukaemia were homologues in higher-level organisms protein on experimentally characterised
successfully distinguished based on the such as humans, where experiments structures. Finally, docking algorithms
expression profiles of these cells [53]. are more demanding. could design molecules that could bind
As the approach does not require prior the model structure, leading the way
biological knowledge of the diseases, it An equivalent approach is also for biochemical assays to test their
may provide a generic strategy for employed in genomics. Homologue- biological activity on the actual protein.
classifying all types of cancer. finding is extensively used to confirm
coding regions in newly sequenced Large-scale censuses
Clearly, an essential aspect of genomes and functional data is fre- Although databases can efficiently
understanding expression data lies in quently transferred to annotate individ- store all the information related to
understanding the basis of transcription ual genes. On a larger scale, it also genomes, structures and expression
regulation. However, analysis in this area simplifies the problem of understanding datasets, it is useful to condense all this
is still limited to preliminary analyses of complex genomes by analysing simple information into understandable trends
expression levels in yeast mutants lacking organisms first and then applying the and facts that users can readily under-
key components of the transcription same principles to more complicated stand. Broad generalisations help
initiation complex [10,122]. ones – this is one reason why early identify interesting subject areas for
structural genomics projects focused further detailed analysis, and place
on Mycoplasma genitalium [91]. new observations in a proper context.
“… many PRACTICAL This enables one to see whether they
APPLICATIONS…” Ironically, the same idea can be are unusual in any way.
applied in reverse. Potential drug
Here, we describe some of the major targets are quickly discovered by Through these large-scale
uses of bioinformatics. checking whether homologues of censuses, one can address a number
essential microbial proteins are missing of evolutionary, biochemical and
Finding Homologues in humans. On a smaller scale, structural biophysical questions. For example,
As described earlier, one of the differences between similar proteins are specific protein folds associated
driving forces behind bioinformatics is may be harnessed to design drug with certain phylogenetic groups?
the search for similarities between molecules that specifically bind to one How common are different folds
different biomolecules. Apart from structure but not another. within particular organisms? And to
enabling systematic organisation of what degree are folds shared between
data, identification of protein homol- Rational Drug Design related organisms? Does this extent of
ogues has some direct practical uses. One of the earliest medical applica- sharing parallel measures of
The most obvious is transferring infor- tions of bioinformatics has been in relatedness derived from traditional
mation between related proteins. For aiding rational drug design. Figure 2 evolutionary trees? Initial studies show
example, given a poorly characterised outlines the commonly cited approach, that the frequency of folds differs
protein, it is possible to search for taking the MLH1 gene product as an greatly between organisms and that
homologues that are better understood example drug target. MLH1 is a human the sharing of folds between organisms
and with caution, apply some of the gene encoding a mismatch repair does in fact follow traditional
knowledge of the latter to the former. protein (mmr) situated on the short phylogenetic classifications [21,41].
Specifically with structural data, arm of chromosome 3 [125]. Through We can also integrate data on protein
theoretical models of proteins are linkage analysis and its similarity to functions; given that the particular
usually based on experimentally solved mmr genes in mice, the gene has protein folds are often related to specific
structures of close homologues [123]. been implicated in nonpolyposis colo- biochemical functions [68, 69], these
Similar techniques are used in fold rectal cancer [126]. Given the nucle- findings highlight the diversity of
recognition in which tertiary structure otide sequence, the probable amino metabolic pathways in different
predictions depend on finding structures acid sequence of the encoded protein organisms [20,105].

94 Yearbook of Medical Informatics 2001


Review Paper

Fig.2. Above is a schematic outlining how scientists can use bioinformatics to aid rational drug discovery. MLH1 is a human gene encoding a mismatch
repair protein (mmr) situated on the short arm of chromosome 3. Through linkage analysis and its similarity to mmr genes in mice, the gene has been
implicated in nonpolyposis colorectal cancer. Given the nucleotide sequence, the probable amino acid sequence of the encoded protein can be
determined using translation software. Sequence search techniques can be used to find homologues in model organisms, and based on sequence
similarity, it is possible to model the structure of the human protein on experimentally characterised structures. Finally, docking algorithms could
design molecules that could bind the model structure, leading the way for biochemical assays to test their biological activity on the actual protein.

As we discussed earlier, one of the localisations of proteins and their inter- usually involves compiling expression
most exciting new sources of genomic actions with each other [127-129]. In data for cells affected by different
information is the expression data. conjunction with structural data, we can diseases [131], eg cancer [53,132,
Combining expression information with then begin to compile a map of all protein- 133] and ateriosclerosis [134], and
structural and functional classifications protein interactions in an organism. comparing the measurements against
of proteins we can ask whether the normal expression levels. Identifi-
high occurrence of a protein fold in a Further applications in medical cation of genes that are expressed
genome is indicative of high expression sciences differently in affected cells provides
levels [112]. Further genomic scale data Most recent applications in the a basis for explaining the causes of
that we can consider in large-scale medical sciences have centred on illnesses and highlights potential drug
surveys include the subcellular gene expression analysis [130]. This targets. Using the process described

Yearbook of Medical Informatics 2001 95


Review Paper

in Figure 2, one would design conducted – with reference to trans- The Protein Data Bank. A computer-based
compounds that bind the expressed cription regulatory systems – and finally archival file for macromolecular structures.
Eur J Biochem 1977;80(2):319-24.
protein, or perhaps more importantly, looked at several practical applications 7. Berman HM, Westbrook J, Feng Z, Gilliland
the transcription regulator has caused of the field. G, Bhat TN, Weissig H, et al. The Protein
the change in expression levels. Given Data Bank. Nucleic Acids Res
a lead compound, microarray experi- Two principal approaches underpin 2000;28(1):235-42.
8. Pearson WR, Lipman DJ. Improved tools
ments can then be used to evaluate all studies in bioinformatics. First is
for biological sequence comparison. Proc
responses to pharmacological inter- that of comparing and grouping the Natl Acad Sci U S A 1988;85(8):2444-2448.
vention, [135,136] and also provide data according to biologically meaning- 9. Altschul SF, Madden TL, Schaffer AA,
early tests to detect or predict the ful similarities and second, that of Zhang J, Zhang Z, Miller W, et al. Gapped
toxicity of trial drugs. analysing one type of data to infer and BLAST and PSI-BLAST: a new generation
of protein database search programs. Nucleic
Further advances in bioinformatics understand the observations for another Acids Res. 1997;25(17):3389-3402.
combined with experimental genomics type of data. These approaches are 10. Holstege FC JE, Wyrick JJ, Lee TI,
for individuals are predicted to reflected in the main aims of the field, Hengartner CJ, Green MR, Golub TR,
revolutionalise the future of healthcare. which are to understand and organise Lander ES, Young RA. Dissecting the
regulatory circuitry of a eukaryotic genome.
A typical scenario for a patient may the information associated with biolo-
Cell 1998;95(5):717-728.
start with post-natal genotyping to gical molecules on a large scale. As a 11. Pedersendagger AG, Jensendagger LJ,
assess susceptibility or immunity from result, bioinformatics has not only Brunak S, Staerfeldt HH, Ussery DW. A
specific diseases and pathogens. With provided greater depth to biological DNA structural atlas for Escherichia coli. J
this information, a unique combination investigations, but added the dimension Mol Biol 2000;299(4):907-930.
12. Kanehisa M, Goto S. KEGG: kyoto
of vaccines could be prescribed, mini- of breadth as well. In this way, we are encyclopedia of genes and genomes. Nucleic
mising the healthcare costs of unneces- able to examine individual systems in Acids Res 2000;28(1):27-30.
sary treatments and anticipating the detail and also compare them with 13. Jeffery CJ. Moonlighting proteins. TIBS
onslaught of diseases later in life. those that are related in order to 1999;24(1):8-11.
14. Chothia C. Proteins. One thousand families
Regular lifetime screenings could lead uncover common principles that apply for the molecular biologist [news]. Nature
to guidance for nutrition intake and across many systems and highlight 1992;357(6379):543-4.
early detections of any illnesses [137]. unusual features that are unique to 15. Orengo CA, Jones DT, Thornton JM.
In addition, drug-based treatments some. Protein superfamilies and domain
could be tailored specifically to the superfolds. Nature 1994;372(6507):631-4.
Acknowledgements 16. Lesk AM, Chothia C. How different amino
patient and disease, thus providing the acid sequences determine similar protein
most effective course of medication We thank Patrick McGarvey for comments structures: the structure and evolutionary
with minimal side-effects [138]. Given on the manuscript. dynamics of the globins. J Mol Biol
the present rate of development, such 1980;136(3):225-70.
17. Russell RB, Saqi MA, Sayle RA, Bates PA,
a scenario in healthcare appears to be
References Sternberg MJ. Recognition of analogous
possible in the not too distant future. and homologous protein folds: analysis of
1. Reichhardt T. It’s sink or swim as a tidal sequence and structure conservation. J Mol
Conclusions wave of data approaches. Nature Biol 1997;269(3):423-39.
1999;399(6736):517-20. 18. Russell RB, Saqi MA, Bates PA, Sayle RA,
With the current deluge of data,
2. Benson DA, Karsch-Mizrachi I, Lipman Sternberg MJ. Recognition of analogous
computational methods have become and homologous protein folds—assessment
DJ, Ostell J, Rapp BA, Wheeler DL.
indispensable to biological investiga- GenBank. Nucleic Acids Res 2000;28 of prediction success and associated
tions. Originally developed for the (1):15-8. alignment accuracy using empirical
analysis of biological sequences, bioin- 3. Bairoch A, Apweiler R. The SWISS-PROT substitution matrices. Protein Eng
protein sequence database and its 1998;11(1):1-9.
formatics now encompasses a wide 19. Fitch WM. Distinguishing homologous
supplement TrEMBL in 2000. Nucleic
range of subject areas including struc- Acids Res 2000;28(1):45-8. from analogous proteins. Syst Zool
tural biology, genomics and gene ex- 4. Fleischmann RD, Adams MD, White O, 1970;19:99-110.
pression studies. In this review, we Clayton RA, Kirkness EF, Kerlavage AR, 20. Tatusov RL, Koonin EV, Lipman DJ. A
provided an introduction and overview et al. Whole-genome random sequencing genomic perspective on protein families.
and assembly of Haemophilus influenzae Science 1997;278(5338):631-7.
of the current state of field. In 21. Gerstein M, Hegyi H. Comparing genomes
Rd. Science 1995;269 (5223):496-512.
particular, we discussed the types of 5. Drowning in data. The Economist 26 June in terms of protein structure: surveys of a
biological information and databases 1999. finite parts list. FEMS Microbiol Rev
that are commonly used, examined 6. Bernstein FC, Koetzle TF, Williams GJ, 1998;22(4):277-304.
Meyer EF, Jr., Brice MD, Rodgers JR, et al. 22. Skolnick J, Fetrow JS. From genes to protein
some of the studies that are being

96 Yearbook of Medical Informatics 2001


Review Paper

structure and function: novel applications 38. Okayama T, Tamura T, Gojobori T, Tateno 2000;406(6797):747-52.
of computational approaches in the genomic Y, Ikeo K, Miyazaki S, et al. Formal design 53. Golub TR, Slonim DK, Tamayo P, Huard
era. TIBtech 2000;18:34-39. and implementation of an improved DDBJ C, Gaasenbeek M, Mesirov JP, et al.
23. McGarvey PB, Huang H, Barker WC, DNA database with a new schema and Molecular classification of cancer: class
Orcutt BC, Garavelli JS, Srinivasarao GY, object-oriented library. Bioinformatics discovery and class prediction by gene
et al. PIR: a new resource for bioinformatics. 1998;14(6):472-8. expression monitoring. Science 1999;286
Bioinformatics 2000;16(3):290-291. 39. Schuler GD, Epstein JA, Ohkawa H, Kans (5439):531-7.
24. Bleasby AJ, Akrigg D, Attwood TK. JA. Entrez: molecular biology database and 54. Celis JE, Gromov P. 2D protein
OWL—a non-redundant composite protein retrieval system. Methods Enzymol electrophoresis: can it be perfected? Curr
sequence database. Nucleic Acids Res 1996;266:141-62. Opin Biotechnol 1999;10(1):16-21.
1994;22(17):3574-3577. 40. Tatusova TA, Karsch-Mizrachi I, Ostell 55. Pandey A, Mann M. Proteomics to study
25. Bleasby AJ, Wootton JC. Construction of JA. Complete genomes in WWW Entrez: genes and genomes. Nature 2000;405
validated, non-redundant composite protein data representation and analysis. (6788):837-46.
sequence databases. Protein Eng Bioinformatics 1999;15(7-8):536-43. 56. Futcher B, Latter GI, Monardo P,
1990;3(3):153-159. 41. Lin J, Gerstein M. Whole-genome trees McLaughlin CS, Garrels JI. A sampling of
26. Hofmann K, Bucher P, Falquet L, Bairoch A. based on the occurrence of folds and the yeast proteome. Mol Cell Biol
The PROSITE database, its status in 1999. orthologs: implications for comparing 1999;19(11):7357-68.
Nucleic Acids Res 1999;27(1):215-219. genomes on different levels. Genome Res 57. Gygi SP, Rist B, Gerber SA, Turecek F,
27. Attwood TK, Croning MD, Flower DR, 2000;10(6):808-18. Gelb MH, Aebersold R. Quantitative
Lewis AP, Mabey JE, Scordis P, et al. 42. Eisen MB, Brown PO. DNA arrays for analysis of complex protein mixtures using
PRINTS-S: the database formerly known analysis of gene expression. Methods isotope-coded affinity tags. Nat Biotechnol
as PRINTS. Nucleic Acids Res Enzymol 1999;303:179-205. 1999;17(10):994-9.
2000;28(1):225-227. 43. Cheung VG, Morley M, Aguilar F, Massimi 58. Gerstein M. Integrative database analysis
28. Bateman A, Birney E, Durbin R, Eddy SR, A, Kucherlapati R, Childs G. Making and in structural genomics. Nature Struct Biol
Howe KL, Sonnhammer EL. The Pfam reading microarrays. Nat Genet 1999;21(1 2000;7:960-3.
protein families database. Nucleic Acids Suppl):15-9. 59. Etzold T, Ulyanov A, Argos P. SRS:
Res 2000;28(1):263-266. 44. Duggan DJ, Bittner M, Chen Y, Meltzer P, information retrieval system for molecular
29. Attwood TK, Flower DR, Lewis AP, Trent JM. Expression profiling using cDNA biology data banks. Methods Enzymol
Mabey JE, Morgan SR, Scordis P, et al. microarrays. Nat Genet 1999;21(1 1996;266:114-28.
PRINTS prepares for the new millennium. Suppl):10-4. 60. Wade K. Searching Entrez PubMed and
Nucleic Acids Res 1999;27(1):220-225. 45. Lipshutz RJ FS, Gingeras TR, Lockhart uncover on the internet [news]. Aviat Space
30. Laskowski RA, Hutchinson EG, Michie DJ. High density synthetic oligonucleotide Environ Med 2000;71(5):559.
AD, Wallace AC, Jones ML, Thornton arrays. Nat Gen 1999;21(1):20-24. 61. Zhang MQ. Promoter analysis of co-
JM. PDBsum: a Web-based database of 46. Velculescu VE ZL, Zhou, W Traverso, G St regulated genes in the yeast genome.
summaries and analyses of all PDB Croix, B Vogelstein B, Kinzler KW. Serial Comput Chem 1999;23(3-4):233-50.
structures. TIBS 1997;22(12):488-490. Analysis of Gene Expression Detailed 62. Boguski MS. Biosequence exegesis. Science
31. Pearl FM, Lee D, Bray JE, Sillitoe I, Todd Protocol. 1999. 1999;286(5439):453-5.
AE, Harrison AP, et al. Assigning genomic 47. Roth FP HJ, Estep PW, Church GM. 63. Miller C, Gurd J, Brass A. A RAPID
sequences to CATH. Nucleic Acids Res Finding DNA regulatory motifs within algorithm for sequence database
2000;28(1):277-282. unaligned noncoding sequences clustered comparisons: application to the
32. Lo Conte L, Ailey B, Hubbard TJ, Brenner by whole-genome mRNA quantitation. Nat identification of vector contamination in
SE, Murzin AG, Chothia C. SCOP: a Biotechnol 1998;16(10):939-45. the EMBL databases. Bioinformatics
structural classification of proteins database. 48. Jelinsky SA, Samson LD. Global response 1999;15(2):111-21.
Nucleic Acids Res 2000;28(1):257-259. of Saccharomyces cerevisiae to an alkylating 64. Gonnet GH, Korostensky C, Benner S.
33. Holm L, Sander C. Touring protein fold agent. Proc Natl Acad Sci U S A Evaluation measures of multiple sequence
space with Dali/FSSP. Nucleic Acids Res 1999;96(4):1486-91. alignments [In Process Citation]. J Comput
1998;26(1):316-319. 49. Cho RJ, Campbell MJ, Winzeler EA, Biol 2000;7(1-2):261-76.
34. Berman HM, Olson WK, Beveridge DL, Steinmetz L, Conway A, Wodicka L, et al. 65. Orengo CA, Taylor WR. SSAP: sequential
Westbrook J, Gelbin A, Demeny T, et al. A genome-wide transcriptional analysis of structure alignment program for protein
The Nucleic Acid Database. A the mitotic cell cycle. Mol Cell structure comparison. Methods Enzymol
comprehensive relational database of three- 1998;2(1):65-73. 1996;266:617-35.
dimensional structures of nucleic acids. 50. DeRisi JL, Iyer VR, Brown PO. Exploring 66. Orengo CA. CORA—topological
Biophys J 1992;63(3):751-759. the metabolic and genetic control of gene fingerprints for protein structural families.
35. Vondrasek J, Wlodawer A. Database of expression on a genomic scale. Science Protein Sci 1999;8(4):699-715.
HIV proteinase structures. TIBS 1997;278(5338):680-6. 67. Russell RB, Sternberg MJ. Structure
1997;22(5):183. 51. Winzeler EA, Shoemaker DD, Astromoff A, prediction. How good are we? Curr Biol
36. Hendlich M. Databases for protein-ligand Liang H, Anderson K, Andre B, et al. 1995;5(5):488-90.
complexes. Acta Cryst D 1998;54(1):1178- Functional characterization of the S. cerevisiae 68. Martin AC, Orengo CA, Hutchinson EG,
1182. genome by gene deletion and parallel analysis. Jones S, Karmirantzou M, Laskowski RA,
37. Baker W, van den Broek A, Camon E, Science 1999;285(5429):901-6. et al. Protein folds and functions. Structure
Hingamp P, Sterk P, Stoesser G, et al. The 52. Perou CM, Sorlie T, Eisen MB, van de Rijn 1998;6(7):875-84.
EMBL nucleotide sequence database. M, Jeffrey SS, Rees CA, et al. Molecular 69. Hegyi H, Gerstein M. The relationship
Nucleic Acids Res 2000;28(1):19-23. portraits of human breast tumours. Nature between protein structure and function: a

Yearbook of Medical Informatics 2001 97


Review Paper

comprehensive survey with application to Aviles FX, Sternberg MJ. Modelling 100. McGuire AM, Hughes JD, Church GM.
the yeast genome. J Mol Biol 1999;288(1): repressor proteins docking to DNA. Conservation of DNA regulatory motifs
147-64. Proteins 1998;33(4):535-49. and discovery of new motifs in microbial
70. Russell RB, Sasieni PD, Sternberg MJE. 86. Dickerson RE. DNA bending: the prevalence genomes [In Process Citation]. Genome
Supersites within superfolds. Binding site of kinkiness and the virtues of normality. Res 2000;10(6):744-57.
similarity in the absence of homology. J Nucleic Acids Res 1998;26(8):1906-26. 101. Bysani N, Daugherty JR, Cooper TG.
Mol Biol 1998;282(4):903-18. 87. Perez-Rueda E, Collado-Vides J. The Saturation mutagenesis of the UASNTR
71. Wilson CA, Kreychman J, Gerstein M. repertoire of DNA-binding transcriptional (GATAA) responsible for nitrogen
Assessing annotation transfer for genomics: regulators in Escherichia coli K-12. Nucleic catabolite repression-sensitive
quantifying the relations between protein Acids Res 2000;28(8):1838-47. transcriptional activation of the allantoin
sequence, structure and function through 88. Mewes HW, Frishman D, Gruber C, Geier pathway genes in Saccharomyces
traditional and probabilistic scores. J Mol B, Haase D, Kaps A, et al. MIPS: a database cerevisiae. J Bacteriol 1991;173
Biol 2000;297(1):233-49. for genomes and protein sequences. Nucleic (16):4977-82.
72. Harrison SC. A structural taxonomy of Acids Res 2000;28(1):37-40. 102. Clarke ND, Berg JM. Zinc fingers in
DNA-binding domains. Nature 89. Salgado H, Santos-Zavaleta A, Gama- Caenorhabditis elegans: finding families
1991;353(6346):715-9. Castro S, Millan-Zarate D, Blattner FR, and probing pathways. Science
73. Luscombe NM, Austin SE, Berman HM, Collado-Vides J. RegulonDB (version 3.0): 1998;282(5396):2018-22.
Thornton JM. An overview of the structures transcriptional regulation and operon 103. van Helden J, Andre B, Collado-Vides J.
of protein-DNA complexes. Genome organization in Escherichia coli K-12. Extracting regulatory sites from the
Biology 2000;1(1):1-37. Nucleic Acids Res 2000;28(1):65-7. upstream region of yeast genes by
74. Jones S, van Heyningen P, Berman HM, 90. Wingender E, Chen X, Hehl R, Karas H, computational analysis of oligonucleotide
Thornton JM. Protein-DNA interactions: Liebich I, Matys V, et al. TRANSFAC: an frequencies.JMolBiol1998;281(5):827-42.
A structural analysis. J Mol Biol integrated system for gene expression 104. Salgado H, Moreno-Hagelsieb G, Smith
1999;287(5):877-96. regulation. Nucleic Acids Res TF, Collado-Vides J. Operons in
75. Suzuki M, Gerstein M. Binding geometry 2000;28(1):316-9. Escherichia coli: genomic analyses and
of alpha-helices that recognize DNA. 91. Teichmann SA, Chothia C, Gerstein M. predictions. Proc Natl Acad Sci U S A
Proteins 1995;23(4):525-35. Advances in structural genomics. Curr Opin 2000;97(12):6652-7.
76. Luscombe NM, Thornton JM. Protein- Struct Biol 1999;9(3):390-9. 105. Tatusov RL, Mushegian AR, Bork P,
DNA interactions: a 3D analysis of alpha- 92. Aravind L, Koonin EV. DNA-binding Brown NP, Hayes WS, Borodovsky M, et
helix-binding in the major groove. proteins and evolution of transcription al.MetabolismandevolutionofHaemophilus
Manuscript in preparation. regulation in the archaea. Nucleic Acids Res influenzae deduced from a whole- genome
77. Suzuki M, Brenner SE, Gerstein M, Yagi N. 1999;27(23):4658-70. comparison with Escherichia coli. Curr
DNA recognition code of transcription 93. Huynen MA, van Nimwegen E. The Biol 1996;6(3):279-91.
factors. Protein Eng 1995;8(4):319-28. frequency distribution of gene family sizes 106. EisenMB,SpellmanPT,BrownPO,Botstein
78. Suzuki M. DNA recognition by a beta- in complete genomes. Mol Biol Evol D. Cluster analysis and display of genome-
sheet. Protein Eng 1995;8(1):1-4. 1998;15(5):583-9. wide expression patterns. Proc Natl Acad Sci
79. Seeman NC, Rosenberg JM, Rich A. 94. Luscombe NM, Thornton JM. Protein- U S A 1998;95(25):14863-8.
Sequence specific recognition of double DNA interactions: an analysis of amino 107. Wen X, Fuhrman S, Michaels GS, Carr
helical nucleic acids by proteins. Proc Natl acid conservation and the effect on binding DB, Smith S, Barker JL, et al. Large-scale
Acad Sci U S A 1976;73:804-808. specificity. Manuscript in preparation. temporal gene expression mapping of
80. Suzuki M. A framework for the DNA- 95. Gelfand MS. Prediction of function in DNA central nervous system development. Proc
protein recognition code of the probe helix sequence analysis. J Comp Biol 1995;1:87- Natl Acad Sci U S A 1998;95(1):334-9.
in transcription factors: the chemical and 115. 108. Alon U, Barkai N, Notterman DA, Gish K,
stereochemical rules [see comments]. 96. Robison K, McGuire AM, Church GM. A Ybarra S, Mack D, et al. Broad patterns of
Structure 1994;2(4):317-26. comprehensive library of DNA-binding site gene expression revealed by clustering
81. Mandel-Gutfreund Y, Schueler O, Margalit matrices for 55 proteins applied to the analysis of tumor and normal colon tissues
H. Comprehensive analysis of hydrogen complete Escherichia coli K-12 genome. J probed by oligonucleotide arrays. Proc
bonds in regulatory protein DNA- Mol Biol 1998;284(2):241-54. Natl Acad Sci U S A 1999;96(12):6745-50.
complexes: in search of common principles. 97. Thieffry D, Salgado H, Huerta AM, Collado- 109. Tamayo P, Slonim D, Mesirov J, Zhu Q,
J Mol Biol 1995;253(2):370-82. Vides J. Prediction of transcriptional Kitareewan S, Dmitrovsky E, et al.
82. Luscombe NM, Laskowski RA, Thornton regulatory sites in the complete genome Interpreting patterns of gene expression
JM. Protein-DNA interactions: a 3D sequence of Escherichia coli K-12. with self-organizing maps: methods and
analysis of amino acid-base interactions. Bioinformatics 1998;14(5):391-400. application to hematopoietic differentia-
Manuscript in preparation. 98. Mironov AA, Koonin EV, Roytberg MA, tion. Proc Natl Acad Sci U S A
83. Mandel-Gutfreund Y, Margalit H, Jernigan Gelfand MS. Computer analysis of 1999;96(6):2907-12.
RL, Zhurkin VB. A role for CH...O transcription regulatory patterns in 110. Toronen P, Kolehmainen M, Wong G,
interactions in protein-DNA recognition. J completely sequenced bacterial genomes. Castren E. Analysis of gene expression
Mol Biol 1998;277(5):1129-40. Nucleic Acids Res 1999;27(14):2981-9. data using self-organizing maps. FEBS
84. Sternberg MJ, Gabb HA, Jackson RM. 99. Gelfand MS, Koonin EV, Mironov AA. Lett 1999;451(2):142-6.
Predictive docking of protein-protein and Prediction of transcription regulatory sites 111. Tavazoie S, Hughes JD, Campbell MJ,
protein-DNA complexes. Curr Opin Struct in Archaea by a comparative genomic Cho RJ, Church GM. Systematic deter-
Biol 1998;8(2):250-6. approach. Nucleic Acids Res 2000;28(3): mination of genetic network architecture.
85. Aloy P, Moont G, Gabb HA, Querol E, 695-705. Nat Genet 1999;22(3):281-5.

98 Yearbook of Medical Informatics 2001


Review Paper

112. Jansen R, Gerstein M. Analysis of the 122. Livesey FJ, Furukawa T, Steffen MA, expression with self-organizing maps:
yeast transcriptome with structural and Church GM, Cepko CL. Microarray methods and application to
functional categories: characterizing analysis of the transcriptional network hematopoietic differentiation. Proc Natl
highly expressed proteins. Nucleic Acids controlled by the photoreceptor Acad Sci U S A 1999;96(6):2907-12.
Res 2000;28(6):1481-8. homeobox gene Crx. Curr Biol 133. Perou CM JS, van de Rijn M, Rees CA,
113. Gerstein M, Jansen R. The current 2000;10(6):301-10. Eisen MB, Ross DT, Pergamenschikov
excitment in bioinformatics, analysis of 123. Sali A, Blundell TL. Comparative protein A, Williams CF, Zhu SX, Lee JC, Lashkari
whole-genome expression data: how does modelling by satisfaction of spatial D, Shalon D, Brown PO, Botstein D.
it relate to protein structure and function. restraints. J Mol Biol 1993;234(3):779- Distinctive gene expression patterns in
Current Opinion in Structural Biology 815. human mammary epithelial cells and
2000;10:574-84. 124. Jones DT, Taylor WR, Thornton JM. A breast cancers. Proc Natl Acad Sci
114. Drawid A, Gerstein M. A Bayesian new approach to protein fold recognition. 1999;96(16):9212-7.
System Integrating Expression Data with Nature 1992;358(6381):86-9. 134. Hiltunen MO, Niemi M, Yla-Herttuala S.
Sequence Patterns for Localizing Proteins: 125. Kok K, Naylor SL, Buys CH. Deletions Functional genomics and DNA array
Comprehensive Application to the Yeast of the short arm of chromosome 3 in solid techniques in atherosclerosis research.
Genome. J Mol Biol 2000;301:1059-75. tumors and the search for suppressor Curr Opin Lipidol 1999;10(6):515-9.
115. Drawid A, Jansen R, Gerstein M. Genom- genes. Adv Cancer Res 1997;71:27-92. 135. Colantuoni C, Purcell AE, Bouton CM,
wide analysis relating expression level 126. Syngal S, Fox EA, Eng C, Kolodner RD, Pevsner J. High throughput analysis of
with protein subcellular localisation. Garber JE. Sensitivity and specificity of gene expression in the human brain. J
TIGS 2000;16:426-30. clinical criteria for hereditary non- Neurosci Res 2000;59(1):1-10.
116. Marcotte EM, Pellegrini M, Ng HL, Rice polyposis colorectal cancer associated 136. Debouck C, Metcalf B. The impact of
DW, Yeates TO, Eisenberg D. Detecting mutations in MSH2 and MLH1. J Med genomics on drug discovery. Annu Rev
protein function and protein-protein Gen 2000;37(9):641-645. Pharmacol Toxicol 2000;40:193-207.
interactions from genome sequences. 127. Uetz P, Giot L, Cagney G, Mansfield 137. Sander C. Genomic medicine and the
Science 1999;285(5428):751-3. TA, Judson RS, Knight JR, et al. A future of health care. Science 2000;287
117. Eisenberg D, Marcotte EM, Xenarios I, comprehensive analysis of protein-protein (5460):1977-8.
Yeates TO. Protein function in the post- interactions in Saccharomyces cerevisiae. 138. Ohlstein EH, Ruffolo RR, Jr., Elliott JD.
genomic era. Nature 2000;405 Nature 2000;403(6770):623-7. Drug discovery in the next millennium.
(6788):823-6. 128. Ross-Macdonald P, Sheehan A, Friddle Annu Rev Pharmacol Toxicol
118. Jansen R, Greenbaum D, Gerstein M. C, Roeder GS, Snyder M. Transposon 2000;40:177-91.
Relating whole-genome expression data mutagenesis for the analysis of protein
with protein-protein interactions. production, function, and localization.
Manuscript in preparation. Methods Enzymol 1999;303:512-32.
119. Marx J. Medicine. DNA arrays reveal 129. Mewes HW, Heumann K, Kaps A, Mayer
cancer in its many forms. Science K, Pfeiffer F, Stocker S, et al. MIPS: a
2000;289(5485):1670-2. database for genomes and protein Address of the authors:
120. Ross DT, Scherf U, Eisen MB, Perou sequences. Nucleic Acids Res Nicholas M. Luscombe, Dov Greenbaum,
CM, Rees C, Spellman P, et al. Systematic 1999;27(1):44-8. Mark Gerstein*
variation in gene expression patterns in 130. Murray-Rust P. Bioinformatics and drug Department of Molecular Biophysics and
human cancer cell lines. Nat Genet discovery. Curr Opin Biotechnol Biochemistry
2000;24(3):227-35. 1994;5(6):648-53. Yale University
121. Perou CM, Jeffrey SS, van de Rijn M, 131. Friend SH. How DNA microarrays and 266 Whitney Avenue
Rees CA, Eisen MB, Ross DT, et al. expression profiling will affect clinical PO Box 208 114
Distinctive gene expression patterns in practice. BMJ 1999;319(7220):1306-7. New Haven CT 06520-8114, USA
human mammary epithelial cells and 132. Tamayo P SD, Mesirov J, Zhu Q, mark.gerstein@yale.edu
breast cancers. Proc Natl Acad Sci U S A Kitareewan S, Dmitrovsky E, Lander ES,
1999;96(16):9212-7. Golub TR. Interpreting patterns of gene *corresponding author

Yearbook of Medical Informatics 2001 99


Review Paper

100 Yearbook of Medical Informatics 2001

You might also like