Modul 1 (Struktur Datenbanken)

Modul 1
Datenbanken
Strukturen
Lokale Muster
Daten, Datensammlung, Datenbank
Inhalte Implementierung
• Molekülstrukturen • „Flat file“

• Spektren • Lokale Datenbank
• Patentinformation • www-Zugriff
• Moleküleigenschaften •…
• Fachliteratur
• Verweise
• Anbieterinformation
• Preise
•…
Definition einer Datenbank
Datenbank =
Verwaltungskomponente +
Speicherungskomponente für persistente Daten,
die einem bestimmten Zweck dienen.
„Molekül-Datenbanken“
Raw data User
User interface
Source file Application software
Filtering Index file
Library file Data 1 Data 2 Data 3

Was ist das ?
• Wie erkennen wir einen „Baum“ ?

• Welche „Bäume“ sind einander am ähnlichsten ?
Molecular Similarity
Typical applications of the “similarity concept”
• Similarity searching in databases

• Pattern recognition in molecular structures
• Similarity searching in virtual compound libraries
• Data clustering / classification
• Single compound design, de novo design
• Compound library design
• “Diversity” analysis of compound collections
• SAR modeling & prediction
Analogy-based Feature Selection
“Find function-determining features of

macromolecular receptors and their small molecule effectors”
A1 a1
A2 a2
B1 b1
B2 b2
B3 b3
C1 c1
C2 c2
Receptors Ligands
Applications
• Drug Discovery
• Chemical Biology
• Functional Genomics
• Similarity Searching & Virtual Screening
• Identification of targets & ligands
• Design of compound libraries
The Early Drug Discovery Process
Target Lead Preclinical

Validation Identification Development
Drug
Target Hit Lead

Identification Identification Optimization
Bioinformatics
Cheminformatics
Primary Sequence Databases
Database Version No. of Sequences
SwissProt 54.3 (10/2007) 285,335 (02/05: 168,297)

EMBL 92 (09/2007) 105,696,243 (12/04: 46,105,397)
TrEMBL 37.3 (10/2007) 4,935,209 (02/05: 1,589,670)
MIPS
OWL
…
UniProt combines SwissProt, TrEMBL, PIR
UniProtKB/TrEMBL Release 33.7: 3 189 332 entries
http://www.ebi.ac.uk/Databases/index.html
http://pir.georgetown.edu/pirwww/
http://pir.georgetown.edu/pirwww/dbinfo/
Genome mRNA cDNA EST
Genome
Coding part (H.sapiens ~ 1%)
E1 E2 E3 E4 Eukaryotic gene
I1 I2 I3 with Intron/Exon structure
Splicing
E1 E2 E3 E4
5’-UTR 3’-UTR
Reverse Transcription
~7 x 106 (70%) EST in GenBank!

5’-EST 3’-EST
EST: C. Venter 1990s
(most common)
From Raw Data to Sequences
I) cDNA sequence fragments (ESTs) VI) ORF (open reading frame) Prediction
Six-frame translation
3’ 5’
5’ 3’
II) Fragment matching (clustering)
(>40 bp; >95% ident.)
V) DNA complement
3’ 5’
5’ 3’
III) Contig assembly IV) “Contig” (contiguous clone map)

Sequence “mature” in a database
New sequence
DB-Entry
Unannotated Preliminary Unreviewed Standard
†
Some Numbers
Organism Genome Size Genes
Epstein-Barr virus 0.172 x 106 (bp) 80

Escherichia coli 4.6 x 106 4406
Saccharomyces cerevisiae 12.1 x 106 5885
Drosophila melanogaster 180 x 106 13601
Homo sapiens 3200 x 106 ~ 25000
Most
Most human
human genes
genes are
are “hypothetical”,
“hypothetical”, “unclassified”,
“unclassified”, “unknown”
“unknown”
UniProt_SwissProt Line Types
ID - Identification CC - Comments or notes

AC - Accession number(s) DR - Database cross-references
DT - Date KW - Keywords
DE - Description FT - Feature table data
GN - Gene name(s) SQ - Sequence header
OS - Organism species - (blanks) sequence data
OG - Organelle // - Termination line
OC - Organism classification
RN - Reference number
RP - Reference position
RC - Reference comments
RX - Reference cross-references
RA - Reference authors
RL - Reference location
ID
AC
LEP_ECOLI
P00803; P78098;
STANDARD; PRT; 324 AA.
A SwissProt Entry
DT 21-JUL-1986 (REL. 01, CREATED)
DT 01-NOV-1997 (REL. 35, LAST SEQUENCE UPDATE)
DT 01-NOV-1997 (REL. 35, LAST ANNOTATION UPDATE)
DE SIGNAL PEPTIDASE I (EC 3.4.21.89) (SPASE I) (LEADER PEPTIDASE I).
GN LEPB.
OS ESCHERICHIA COLI.
OC PROKARYOTA; GRACILICUTES; SCOTOBACTERIA; FACULTATIVELY ANAEROBIC RODS;
OC ENTEROBACTERIACEAE.
RN [1]
RP SEQUENCE FROM N.A.
RX MEDLINE; 84008229.
RA WOLFE P.B., WICKNER W., GOODMAN J.M.;
RL J. BIOL. CHEM. 258:12073-12080(1983).
CC -!- CATALYTIC ACTIVITY: CLEAVAGE OF N-TERMINAL LEADER SEQUENCES FROM
CC SECRETED AND PERIPLASMIC PROTEINS PRECURSOR.
CC -!- SUBCELLULAR LOCATION: TYPE II MEMBRANE PROTEIN. INNER MEMBRANE.
CC -!- SIMILARITY: BELONGS TO PEPTIDASE FAMILY S26; ALSO KNOWN AS TYPE
CC I LEADER PEPTIDASE FAMILY.
DR EMBL; K00426; G146600; -.
DR PIR; A00998; ZPECS.
DR PROSITE; PS00501; SPASE_I_1; 1.
KW INNER MEMBRANE; TRANSMEMBRANE; HYDROLASE; PROTEASE.
FT MOD_RES 1 1 BLOCKED.
FT TRANSMEM 4 22
FT DOMAIN 23 58 CYTOPLASMIC.
FT TRANSMEM 59 77
FT DOMAIN 78 324 PERIPLASMIC.
FT ACT_SITE 91 91
FT ACT_SITE 146 146
FT MUTAGEN 62 62 E->V: INDIFFERENT.
LEP_ECOLI Length: 324 January 7, 1999 14:23 Type: P Check: 8977 ..
1 MANMFALILV IATLVTGILW CVDKFFFAPK RRERQAAAQA AAGDSLDKAT ..

//
SwissProt Feature Table
The feature table may indicate regions that
• perform or affect function

• interact with other molecules
• affect replication
• are involved in recombination
• are a repeat unit
• have secondary or tertiary structure
• are revised or corrected
• DB searching
• links between databases
A Ala Alanine.
C Cys Cysteine. Amino acid codes
D Asp Aspartic acid.
E Glu Glutamic acid.
F Phe Phenylalanine.
G Gly Glycine.
H His Histidine.
I Ile Isoleucine.
K Lys Lysine.
L Leu Leucine.
M Met Methionine.
N Asn Asparagine.
P Pro Proline.
Q Gln Glutamine.
R Arg Arginine.
S Ser Serine.
T Thr Threonine.
V Val Valine.
W Trp Tryptophan.
Y Tyr Tyrosine.
B Asx Aspartic acid or Asparagine.

Z Glx Glutamine or Glutamic acid.
X Xaa Any amino acid.
Levels of Pattern Conservation
Active site Predictive

Conserved
3D protein structure Patterns
Protein fold / domains
2D protein structure
1D protein structure Alignment studies
(amino acid sequence)
mRNA sequence
DNA sequence
A
O
C
O
D
O
E
O
F
O
The 20 standard
N N N N N
SH
OH
L-amino acids
O
O OH
G H I K L
O O O O O
N N N N N
NH
N
NH2 R1 O R3
H
M N P Q R N O
O O O O O H2N N
N N N N N H
O R2 OH
NH2
S O
O NH2 NH
Peptide backbone
HN NH2
N C
S T V W Y
O O O O O
N N N N N
OH OH
NH
OH
Stereochemie von Aminosäuren: Fischer-Projektion
COOH COOH
H 2N H H NH2
R R
L D
Die Darstellung von Verbindungen mit einem oder mehreren Chiralitätszentren

kann durch die Fischer-Projektion (Emil Fischer) erfolgen:
• Hierbei wird die Kohlenstoff-Hauptkette vertikal angeordnet.
• Das C-Atom mit der höchsten Oxidationsstufe wird nach oben geschrieben.
• vertikale Bindungen zeigen nach hinten, horizontalen Bindungen kommen aus
der Papierebene nach vorne heraus.
die 21. und 22.
proteinogene Aminosäure
O
H
Se OH
proteinogene NH2
Aminosäuren
Selenocystein, Sec
(UAG Stop-Codon)
N O
H
N
OH
O NH2
Pyrrolysin, Pyl
(UAG Stop-Codon)
Selenocystein und Pyrrolysin - werden durch Codons kodiert, die unter gewöhnlichen
Umständen die Proteinsynthese abbrechen: diese Codons müssen durch einen Prozess
der Rekodierung umdefiniert werden, damit diese Aminosäuren in Proteine eingebaut
werden können.
http://www.biophys.uni-duesseldorf.de/~wilm/doc/ls_2003_01_secis_pp4.pdf
The Peptide Bond
Ramachandran Plot
3-10
trans
Peptide notation: N C
• white regions are disallowed except for glycine
Tutorial
http://www.cryst.bbk.ac.uk/PPS2/course/
The alpha-Helix
Right-handed α-Helix
i+8
i+4
5.4 Å pitch
• 3.6 residues in a turn

(36 residues = 10 turns)
Helical Structures
3-10 Helix
• 3 residues in a turn
• 10 atoms in ring formed
by a hydrogen-bond
The beta-Strand & beta-Sheet
Beta strand conformation
Antiparallel beta-Sheet
7 Å pitch
C-terminus
Beta-Sheets
Flavodoxin (PDB: 1AG9)

Reverse Turns (“Beta-Turns”)
• generally occur at the surface of the protein
• Hydrogen-bond between residues i and i+3 (Cα distance < 7 Å)
• nucleation centers during protein folding?
Type I Type II
Gly: no hindrance with C=O of (i+1)
• difference between type I and II:

orientation of the peptide bond between i+1 and i+2
• account for approx. 50% of all turns
Beta-Hairpin Turns
• Beta-hairpin turns occur between two antiparallel beta-strands
= Supersecondary Structure
Type I’ Type II’
Residue 2: always Gly Residue 1: always Gly

Local Conformations are Context-Dependent
VDLLKN
identical sequence, different 3D structure

too short for homology assessment!
Global and local sequence features determine
protein structure and function
Ribonuclease T1 from Aspergillus oryzae
A Guanyl-specific hydrolase
ACDYTCGSNCYSSSDVSTAQAAGYQL
HEDGETVGSNSYPHKYNNYEGFDFSV
SSPYYEWPILSSGDVYSGGSPGADRV
VFNENNQLAGVITHTGASGNNFVECT
Amino acid sequence Structural model

PDB: 4RNT
Bioinformatics: Searching for Homologues
Homolog
Similar protein with a common ancestral sequence
• may have similar function or structure
• structural homology
• functional homology
• homology ≠ similarity !
• no “% homology” !
Ortholog
Homolog proteins in different species
Paralog
Homolog proteins in the same species
Secondary Databases (Patterns & Motifs)
Database Primary Source Stored Information
PROSITE SwissProt Regular expressions (patterns)

Profiles SwissProt Weighted matrices (profiles)
PRINTS OWL (SwissProt) Aligned motifs (fingerprints)
BLOCKS PROSITE/PRINTS Aligned motifs (blocks)
IDENTIFY BLOCKS/PRINTS Fuzzy regular express. (patterns)
Pfam SwissProt Hidden Markov Models (HMM)
…
Databases integrating Genetic, Molecular, or
Metabolic Data
Amaze Biochemical pathways
www.ebi.ac.uk/research/pfmp/
Ecocyc / Metacyc Metabolic pathways

http://biocyc.org
KEGG Metabolic pathways

www.genome.ad.jp/kegg/
TransPath Signal transduction pathways

http://transpath.gbf.de/
BIND Protein interaction and complexes

www.bind.ca/
GeneNet Gene networks

http://wwwmgs.bionet.nsc.ru/mgs/systems/genenet/
CSNDB Cell-signaling networks

http://geo.nihs.go.jp/csndb/
Information Retrieval Systems
• SRS – Sequence Retrieval System (at EBI, UK)

http://www.srs.ebi.ac.uk
• Entrez (at NCBI, USA)

http://www.ncbi.nlm.nih.gov/Entrez
Hausaufgabe: Üben !
• Amino acids – structures and codes
http://bioinf.man.ac.uk/aacids/amino_acid.htm
Amino Acid Classification: A Venn-Diagram
small
tiny proline
aliphatic P
polar
CSS
A N Q
I S
V G C
L SH
D
T E
M
charged
F Y H K R
W
negative
hydrophobic
positive
aromatic
Sliding-Window: The Helical-Wheel Plot
Alpha-helix
3.6 residues per turn
(100 degrees / residue)
http://cti.itc.virginia.edu/~cmg/Demo/wheel/wheelApp.html
Transmembrane helices of rhodopsin ( PDB)
Sliding-Window: The Hydrophobicity Plot
Detect potential transmembrane segments
Hydrophobicity plot of human Rhodopsin (AC P08100 at ExPASy),

ExPASy-Service ProtScale; window size = 9; Kyte&Doolittle hydrophobicity scale
Sliding-Window: Secondary Structure
Chou-Fasman method
• based on analyzing frequency of amino acids in different secondary
structures
• A, E, L, and M strong predictors of alpha helices
• P and G are predictors in the break of a helix
• Table of predictive values created for alpha helices, beta sheets,
and loops
• Structure with greatest overall prediction value used to determine
the structure (80% majority, α+β window size = 5, turn: 4 residues)
• GOR method improves upon the Chou-Fasman method:
• Assumes amino acids surrounding the central amino acid influence
secondary structure central amino acid is likely to adopt
Scoring matrices
Example of a multiple sequence alignment (ClustalW)
“Block”
SW_LEP_BACAM .......... ......MTEE Q..KPTSEKS VKRKSNTYWE WGKAIIIAVA
SW_LEPP_BACSU .......... .......... ....MTKEKV FKKKS.SILE WGKAIVIAVI
SW_LEP_ECOLI FAPKRRERQA AAQAAAGDSL D..KATLKKV APKPG..WLE TGASVFPVLA
SW_LEP_SALTY FAPKRRARQA AAQTASGDAL D..NATLNKV APKPG..WLE TGASVFPVLA
SW_LEP_PSEFL FAPRRRSAIA SYQGSVSQP. D..AVVIEKL NKEPL..LVE YGKSFFPVLF
SW_LEPC_BACCL .......... .......... ....MTKQKE KRGRR..... WPWFVA..VC
SW_LEP_HAEIN VLPKRHRQVA RAEQRSGKT. ...LSEEEKA KIEPISEASE FLSSLFPVLA
SW_LEP_MYCTU AGQVFDAAPF DAAPDADSEG DSKAAKTDEP RPAKRSTLRE FAVLAVIAVV
SW_LEP_BACAM LALLIRHFLF EPYLVEGSSM YPTLH..... DGERLFVN.. ..........

SW_LEPP_BACSU LALLIRNFLF EPYVVEGKSM DPTLV..... DSERLFVN.. ..........
SW_LEP_ECOLI IVLIVRSFIY EPFQIPSGSM MPTLL..... IGDFILVEKF AYGIKDPIYQ
SW_LEP_SALTY IVLIVRSFLY EPFQIPSGSM MPTLL..... IGDFILVEKF AYGIKDPIYQ
SW_LEP_PSEFL IVLVLRSFLV EPFQIPSGSM KPTLD..... VGDFILVNKF SYGIRLPVID
SW_LEPC_BACCL VVATLRLFVF SNYVVEGKSM MPTLE..... SGNLLIVN.. ..........
SW_LEP_HAEIN VVFLVRSFLF EPFQIPSGSM ESTLR..... VGDFLVVNKY AYGVKDPIFQ
SW_LEP_MYCTU LYYVMLTFVA RPYLIPSESM EPTLHGCSTC VGDRIMVD.. ..........
[S,G]-x-S-M-x-[P,S] “Pattern”
Regular expression matching
Searching for Consensus Patterns in PROSITE
Query: E.coli leader peptidase

-Consensus pattern: [GS]-x-S-M-x-[PS]-[AT]-[LF]
[S is an active site residue]
-Sequences known to belong to this class detected by the pattern: ALL.
-Other sequence(s) detected in SWISS-PROT: 16.
-Consensus pattern: K-R-[LIVMSTA](2)-G-x-[PG]-G-[DE]-x-[LIVM]-x-[LIVMFY]

[K is an active site residue]
-Sequences known to belong to this class detected by the pattern: ALL SPases I
from prokaryotes as well as yeast IMP1, but not IMP2.
-Other sequence(s) detected in SWISS-PROT: NONE.
-Consensus pattern: [LIVMFYW](2)-x(2)-G-D-[NH]-x(3)-[SND]-x(2)-[SG]

-Sequences known to belong to this class detected by the pattern: ALL.
-Other sequence(s) detected in SWISS-PROT: 10.
Spase_I_1 (G,S)xSMx(P,S)(A,T)(L,F)
(S)xSMx(P)(T)(L)
89: PFQIP SGSMMPTL LIGDF
Amino Acid Composition
16
14
12
10
% 8
6
4
2
0
A C D E F G H I K L M N P Q R S T V W Y
SwissProt V 40.30
Archaebakterium (Thermoplasma volcanium)
E.coli K-12
P. falciparum
Homo sapiens
Protein Targeting Signals
Signal peptidase
mature protein e.g.

secreted proteins
mitochondrial matrix proteins
chloroplast stromal proteins
e.g.
mitochondrial IMS proteins
apicoplast proteins
Known exceptions:
e.g.
some mitochondrial proteins
( ) SKL
some peroxisomal proteins
http://www.rockefeller.edu/pubinfo/proteintarget.html

Modul 1 (Struktur Datenbanken)

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Modul 1 (Struktur Datenbanken)

Uploaded by

Copyright:

Available Formats

Modul 1

• Molekülstrukturen • „Flat file“

Raw data User

Source file Application software

Filtering Index file

Library file Data 1 Data 2 Data 3

• Wie erkennen wir einen „Baum“ ?

Typical applications of the “similarity concept”

• Similarity searching in databases

“Find function-determining features of

Target Lead Preclinical

Target Hit Lead

Database Version No. of Sequences

SwissProt 54.3 (10/2007) 285,335 (02/05: 168,297)

Coding part (H.sapiens ~ 1%)

~7 x 106 (70%) EST in GenBank!

III) Contig assembly IV) “Contig” (contiguous clone map)

Unannotated Preliminary Unreviewed Standard

Organism Genome Size Genes

Epstein-Barr virus 0.172 x 106 (bp) 80

ID - Identification CC - Comments or notes

1 MANMFALILV IATLVTGILW CVDKFFFAPK RRERQAAAQA AAGDSLDKAT ..

The feature table may indicate regions that

• perform or affect function

B Asx Aspartic acid or Asparagine.

Active site Predictive

Die Darstellung von Verbindungen mit einem oder mehreren Chiralitätszentren

• white regions are disallowed except for glycine

• 3.6 residues in a turn

Beta strand conformation

Flavodoxin (PDB: 1AG9)

• difference between type I and II:

Type I’ Type II’

Residue 2: always Gly Residue 1: always Gly

identical sequence, different 3D structure

Amino acid sequence Structural model

Database Primary Source Stored Information

PROSITE SwissProt Regular expressions (patterns)

Ecocyc / Metacyc Metabolic pathways

KEGG Metabolic pathways

TransPath Signal transduction pathways

BIND Protein interaction and complexes

GeneNet Gene networks

CSNDB Cell-signaling networks

• SRS – Sequence Retrieval System (at EBI, UK)

• Entrez (at NCBI, USA)

Hydrophobicity plot of human Rhodopsin (AC P08100 at ExPASy),

SW_LEP_BACAM LALLIRHFLF EPYLVEGSSM YPTLH..... DGERLFVN.. ..........

Query: E.coli leader peptidase

-Consensus pattern: K-R-[LIVMSTA](2)-G-x-[PG]-G-[DE]-x-[LIVM]-x-[LIVMFY]

-Consensus pattern: [LIVMFYW](2)-x(2)-G-D-[NH]-x(3)-[SND]-x(2)-[SG]

mature protein e.g.

You might also like