You are on page 1of 43

Modul 1

Datenbanken
Strukturen
Lokale Muster
Daten, Datensammlung, Datenbank

Inhalte Implementierung

• Molekülstrukturen • „Flat file“


• Spektren • Lokale Datenbank
• Patentinformation • www-Zugriff
• Moleküleigenschaften •…
• Fachliteratur
• Verweise
• Anbieterinformation
• Preise
•…
Definition einer Datenbank

Datenbank =
Verwaltungskomponente +
Speicherungskomponente für persistente Daten,
die einem bestimmten Zweck dienen.
„Molekül-Datenbanken“

Raw data User

User interface

Source file Application software

Filtering Index file

Library file Data 1 Data 2 Data 3


Was ist das ?

• Wie erkennen wir einen „Baum“ ?


• Welche „Bäume“ sind einander am ähnlichsten ?
Molecular Similarity

Typical applications of the “similarity concept”

• Similarity searching in databases


• Pattern recognition in molecular structures
• Similarity searching in virtual compound libraries
• Data clustering / classification
• Single compound design, de novo design
• Compound library design
• “Diversity” analysis of compound collections
• SAR modeling & prediction
Analogy-based Feature Selection

“Find function-determining features of


macromolecular receptors and their small molecule effectors”

A1 a1
A2 a2
B1 b1
B2 b2
B3 b3
C1 c1
C2 c2
Receptors Ligands
Applications

• Drug Discovery
• Chemical Biology

• Functional Genomics
• Similarity Searching & Virtual Screening
• Identification of targets & ligands
• Design of compound libraries
The Early Drug Discovery Process

Target Lead Preclinical


Validation Identification Development

Drug

Target Hit Lead


Identification Identification Optimization

Bioinformatics
Cheminformatics
Primary Sequence Databases

Database Version No. of Sequences

SwissProt 54.3 (10/2007) 285,335 (02/05: 168,297)


EMBL 92 (09/2007) 105,696,243 (12/04: 46,105,397)
TrEMBL 37.3 (10/2007) 4,935,209 (02/05: 1,589,670)
MIPS
OWL

UniProt combines SwissProt, TrEMBL, PIR
UniProtKB/TrEMBL Release 33.7: 3 189 332 entries

http://www.ebi.ac.uk/Databases/index.html
http://pir.georgetown.edu/pirwww/
http://pir.georgetown.edu/pirwww/dbinfo/
Genome mRNA cDNA EST

Genome

Coding part (H.sapiens ~ 1%)

E1 E2 E3 E4 Eukaryotic gene
I1 I2 I3 with Intron/Exon structure
Splicing

E1 E2 E3 E4
5’-UTR 3’-UTR
Reverse Transcription

~7 x 106 (70%) EST in GenBank!


5’-EST 3’-EST
EST: C. Venter 1990s
(most common)
From Raw Data to Sequences

I) cDNA sequence fragments (ESTs) VI) ORF (open reading frame) Prediction
Six-frame translation
3’ 5’

5’ 3’
II) Fragment matching (clustering)
(>40 bp; >95% ident.)
V) DNA complement

3’ 5’
5’ 3’

III) Contig assembly IV) “Contig” (contiguous clone map)


Sequence “mature” in a database

New sequence

DB-Entry

Unannotated Preliminary Unreviewed Standard


Some Numbers

Organism Genome Size Genes

Epstein-Barr virus 0.172 x 106 (bp) 80


Escherichia coli 4.6 x 106 4406
Saccharomyces cerevisiae 12.1 x 106 5885
Drosophila melanogaster 180 x 106 13601
Homo sapiens 3200 x 106 ~ 25000

Most
Most human
human genes
genes are
are “hypothetical”,
“hypothetical”, “unclassified”,
“unclassified”, “unknown”
“unknown”
UniProt_SwissProt Line Types

ID - Identification CC - Comments or notes


AC - Accession number(s) DR - Database cross-references
DT - Date KW - Keywords
DE - Description FT - Feature table data
GN - Gene name(s) SQ - Sequence header
OS - Organism species - (blanks) sequence data
OG - Organelle // - Termination line
OC - Organism classification
RN - Reference number
RP - Reference position
RC - Reference comments
RX - Reference cross-references
RA - Reference authors
RL - Reference location
ID
AC
LEP_ECOLI
P00803; P78098;
STANDARD; PRT; 324 AA.
A SwissProt Entry
DT 21-JUL-1986 (REL. 01, CREATED)
DT 01-NOV-1997 (REL. 35, LAST SEQUENCE UPDATE)
DT 01-NOV-1997 (REL. 35, LAST ANNOTATION UPDATE)
DE SIGNAL PEPTIDASE I (EC 3.4.21.89) (SPASE I) (LEADER PEPTIDASE I).
GN LEPB.
OS ESCHERICHIA COLI.
OC PROKARYOTA; GRACILICUTES; SCOTOBACTERIA; FACULTATIVELY ANAEROBIC RODS;
OC ENTEROBACTERIACEAE.
RN [1]
RP SEQUENCE FROM N.A.
RX MEDLINE; 84008229.
RA WOLFE P.B., WICKNER W., GOODMAN J.M.;
RL J. BIOL. CHEM. 258:12073-12080(1983).
CC -!- CATALYTIC ACTIVITY: CLEAVAGE OF N-TERMINAL LEADER SEQUENCES FROM
CC SECRETED AND PERIPLASMIC PROTEINS PRECURSOR.
CC -!- SUBCELLULAR LOCATION: TYPE II MEMBRANE PROTEIN. INNER MEMBRANE.
CC -!- SIMILARITY: BELONGS TO PEPTIDASE FAMILY S26; ALSO KNOWN AS TYPE
CC I LEADER PEPTIDASE FAMILY.
DR EMBL; K00426; G146600; -.
DR PIR; A00998; ZPECS.
DR PROSITE; PS00501; SPASE_I_1; 1.
KW INNER MEMBRANE; TRANSMEMBRANE; HYDROLASE; PROTEASE.
FT MOD_RES 1 1 BLOCKED.
FT TRANSMEM 4 22
FT DOMAIN 23 58 CYTOPLASMIC.
FT TRANSMEM 59 77
FT DOMAIN 78 324 PERIPLASMIC.
FT ACT_SITE 91 91
FT ACT_SITE 146 146
FT MUTAGEN 62 62 E->V: INDIFFERENT.
LEP_ECOLI Length: 324 January 7, 1999 14:23 Type: P Check: 8977 ..

1 MANMFALILV IATLVTGILW CVDKFFFAPK RRERQAAAQA AAGDSLDKAT ..


//
SwissProt Feature Table

The feature table may indicate regions that

• perform or affect function


• interact with other molecules
• affect replication
• are involved in recombination
• are a repeat unit
• have secondary or tertiary structure
• are revised or corrected

• DB searching
• links between databases
A Ala Alanine.
C Cys Cysteine. Amino acid codes
D Asp Aspartic acid.
E Glu Glutamic acid.
F Phe Phenylalanine.
G Gly Glycine.
H His Histidine.
I Ile Isoleucine.
K Lys Lysine.
L Leu Leucine.
M Met Methionine.
N Asn Asparagine.
P Pro Proline.
Q Gln Glutamine.
R Arg Arginine.
S Ser Serine.
T Thr Threonine.
V Val Valine.
W Trp Tryptophan.
Y Tyr Tyrosine.

B Asx Aspartic acid or Asparagine.


Z Glx Glutamine or Glutamic acid.
X Xaa Any amino acid.
Levels of Pattern Conservation

Active site Predictive


Conserved
3D protein structure Patterns
Protein fold / domains
2D protein structure
1D protein structure Alignment studies
(amino acid sequence)
mRNA sequence
DNA sequence
A
O
C
O
D
O
E
O
F
O
The 20 standard
N N N N N

SH
OH
L-amino acids
O
O OH

G H I K L
O O O O O
N N N N N

NH
N

NH2 R1 O R3
H
M N P Q R N O
O O O O O H2N N
N N N N N H
O R2 OH
NH2

S O
O NH2 NH
Peptide backbone
HN NH2
N C
S T V W Y
O O O O O
N N N N N

OH OH
NH
OH
Stereochemie von Aminosäuren: Fischer-Projektion

COOH COOH
H 2N H H NH2
R R

L D

Die Darstellung von Verbindungen mit einem oder mehreren Chiralitätszentren


kann durch die Fischer-Projektion (Emil Fischer) erfolgen:
• Hierbei wird die Kohlenstoff-Hauptkette vertikal angeordnet.
• Das C-Atom mit der höchsten Oxidationsstufe wird nach oben geschrieben.
• vertikale Bindungen zeigen nach hinten, horizontalen Bindungen kommen aus
der Papierebene nach vorne heraus.
die 21. und 22.
proteinogene Aminosäure

O
H
Se OH
proteinogene NH2
Aminosäuren
Selenocystein, Sec
(UAG Stop-Codon)

N O
H
N
OH
O NH2

Pyrrolysin, Pyl
(UAG Stop-Codon)

Selenocystein und Pyrrolysin - werden durch Codons kodiert, die unter gewöhnlichen
Umständen die Proteinsynthese abbrechen: diese Codons müssen durch einen Prozess
der Rekodierung umdefiniert werden, damit diese Aminosäuren in Proteine eingebaut
werden können.
http://www.biophys.uni-duesseldorf.de/~wilm/doc/ls_2003_01_secis_pp4.pdf
The Peptide Bond

Ramachandran Plot

3-10

trans

Peptide notation: N C

• white regions are disallowed except for glycine

Tutorial
http://www.cryst.bbk.ac.uk/PPS2/course/
The alpha-Helix
Right-handed α-Helix

i+8

i+4

5.4 Å pitch

• 3.6 residues in a turn


(36 residues = 10 turns)
Helical Structures

3-10 Helix

• 3 residues in a turn
• 10 atoms in ring formed
by a hydrogen-bond
The beta-Strand & beta-Sheet

Beta strand conformation

Antiparallel beta-Sheet

7 Å pitch

C-terminus
Beta-Sheets

Flavodoxin (PDB: 1AG9)


Reverse Turns (“Beta-Turns”)
• generally occur at the surface of the protein
• Hydrogen-bond between residues i and i+3 (Cα distance < 7 Å)
• nucleation centers during protein folding?

Type I Type II
Gly: no hindrance with C=O of (i+1)

• difference between type I and II:


orientation of the peptide bond between i+1 and i+2
• account for approx. 50% of all turns
Beta-Hairpin Turns
• Beta-hairpin turns occur between two antiparallel beta-strands

= Supersecondary Structure

Type I’ Type II’

Residue 2: always Gly Residue 1: always Gly


Local Conformations are Context-Dependent

VDLLKN

identical sequence, different 3D structure


too short for homology assessment!
Global and local sequence features determine
protein structure and function
Ribonuclease T1 from Aspergillus oryzae
A Guanyl-specific hydrolase

ACDYTCGSNCYSSSDVSTAQAAGYQL
HEDGETVGSNSYPHKYNNYEGFDFSV
SSPYYEWPILSSGDVYSGGSPGADRV
VFNENNQLAGVITHTGASGNNFVECT

Amino acid sequence Structural model


PDB: 4RNT
Bioinformatics: Searching for Homologues

Homolog
Similar protein with a common ancestral sequence
• may have similar function or structure
• structural homology
• functional homology
• homology ≠ similarity !
• no “% homology” !

Ortholog
Homolog proteins in different species

Paralog
Homolog proteins in the same species
Secondary Databases (Patterns & Motifs)

Database Primary Source Stored Information

PROSITE SwissProt Regular expressions (patterns)


Profiles SwissProt Weighted matrices (profiles)
PRINTS OWL (SwissProt) Aligned motifs (fingerprints)
BLOCKS PROSITE/PRINTS Aligned motifs (blocks)
IDENTIFY BLOCKS/PRINTS Fuzzy regular express. (patterns)
Pfam SwissProt Hidden Markov Models (HMM)

Databases integrating Genetic, Molecular, or
Metabolic Data
Amaze Biochemical pathways
www.ebi.ac.uk/research/pfmp/

Ecocyc / Metacyc Metabolic pathways


http://biocyc.org

KEGG Metabolic pathways


www.genome.ad.jp/kegg/

TransPath Signal transduction pathways


http://transpath.gbf.de/

BIND Protein interaction and complexes


www.bind.ca/

GeneNet Gene networks


http://wwwmgs.bionet.nsc.ru/mgs/systems/genenet/

CSNDB Cell-signaling networks


http://geo.nihs.go.jp/csndb/
Information Retrieval Systems

• SRS – Sequence Retrieval System (at EBI, UK)


http://www.srs.ebi.ac.uk

• Entrez (at NCBI, USA)


http://www.ncbi.nlm.nih.gov/Entrez

Hausaufgabe: Üben !
• Amino acids – structures and codes
http://bioinf.man.ac.uk/aacids/amino_acid.htm
Amino Acid Classification: A Venn-Diagram

small
tiny proline

aliphatic P
polar
CSS
A N Q
I S
V G C
L SH
D
T E
M
charged
F Y H K R
W
negative
hydrophobic
positive
aromatic
Sliding-Window: The Helical-Wheel Plot

Alpha-helix
3.6 residues per turn
(100 degrees / residue)

http://cti.itc.virginia.edu/~cmg/Demo/wheel/wheelApp.html
Transmembrane helices of rhodopsin ( PDB)
Sliding-Window: The Hydrophobicity Plot
Detect potential transmembrane segments

Hydrophobicity plot of human Rhodopsin (AC P08100 at ExPASy),


ExPASy-Service ProtScale; window size = 9; Kyte&Doolittle hydrophobicity scale
Sliding-Window: Secondary Structure

Chou-Fasman method
• based on analyzing frequency of amino acids in different secondary
structures
• A, E, L, and M strong predictors of alpha helices
• P and G are predictors in the break of a helix
• Table of predictive values created for alpha helices, beta sheets,
and loops
• Structure with greatest overall prediction value used to determine
the structure (80% majority, α+β window size = 5, turn: 4 residues)
• GOR method improves upon the Chou-Fasman method:
• Assumes amino acids surrounding the central amino acid influence
secondary structure central amino acid is likely to adopt
Scoring matrices
Example of a multiple sequence alignment (ClustalW)

“Block”
SW_LEP_BACAM .......... ......MTEE Q..KPTSEKS VKRKSNTYWE WGKAIIIAVA
SW_LEPP_BACSU .......... .......... ....MTKEKV FKKKS.SILE WGKAIVIAVI
SW_LEP_ECOLI FAPKRRERQA AAQAAAGDSL D..KATLKKV APKPG..WLE TGASVFPVLA
SW_LEP_SALTY FAPKRRARQA AAQTASGDAL D..NATLNKV APKPG..WLE TGASVFPVLA
SW_LEP_PSEFL FAPRRRSAIA SYQGSVSQP. D..AVVIEKL NKEPL..LVE YGKSFFPVLF
SW_LEPC_BACCL .......... .......... ....MTKQKE KRGRR..... WPWFVA..VC
SW_LEP_HAEIN VLPKRHRQVA RAEQRSGKT. ...LSEEEKA KIEPISEASE FLSSLFPVLA
SW_LEP_MYCTU AGQVFDAAPF DAAPDADSEG DSKAAKTDEP RPAKRSTLRE FAVLAVIAVV

SW_LEP_BACAM LALLIRHFLF EPYLVEGSSM YPTLH..... DGERLFVN.. ..........


SW_LEPP_BACSU LALLIRNFLF EPYVVEGKSM DPTLV..... DSERLFVN.. ..........
SW_LEP_ECOLI IVLIVRSFIY EPFQIPSGSM MPTLL..... IGDFILVEKF AYGIKDPIYQ
SW_LEP_SALTY IVLIVRSFLY EPFQIPSGSM MPTLL..... IGDFILVEKF AYGIKDPIYQ
SW_LEP_PSEFL IVLVLRSFLV EPFQIPSGSM KPTLD..... VGDFILVNKF SYGIRLPVID
SW_LEPC_BACCL VVATLRLFVF SNYVVEGKSM MPTLE..... SGNLLIVN.. ..........
SW_LEP_HAEIN VVFLVRSFLF EPFQIPSGSM ESTLR..... VGDFLVVNKY AYGVKDPIFQ
SW_LEP_MYCTU LYYVMLTFVA RPYLIPSESM EPTLHGCSTC VGDRIMVD.. ..........

[S,G]-x-S-M-x-[P,S] “Pattern”
Regular expression matching
Searching for Consensus Patterns in PROSITE

Query: E.coli leader peptidase


-Consensus pattern: [GS]-x-S-M-x-[PS]-[AT]-[LF]
[S is an active site residue]
-Sequences known to belong to this class detected by the pattern: ALL.
-Other sequence(s) detected in SWISS-PROT: 16.

-Consensus pattern: K-R-[LIVMSTA](2)-G-x-[PG]-G-[DE]-x-[LIVM]-x-[LIVMFY]


[K is an active site residue]
-Sequences known to belong to this class detected by the pattern: ALL SPases I
from prokaryotes as well as yeast IMP1, but not IMP2.
-Other sequence(s) detected in SWISS-PROT: NONE.

-Consensus pattern: [LIVMFYW](2)-x(2)-G-D-[NH]-x(3)-[SND]-x(2)-[SG]


-Sequences known to belong to this class detected by the pattern: ALL.
-Other sequence(s) detected in SWISS-PROT: 10.

Spase_I_1 (G,S)xSMx(P,S)(A,T)(L,F)
(S)xSMx(P)(T)(L)
89: PFQIP SGSMMPTL LIGDF
Amino Acid Composition

16
14
12
10
% 8
6
4
2
0
A C D E F G H I K L M N P Q R S T V W Y

SwissProt V 40.30
Archaebakterium (Thermoplasma volcanium)
E.coli K-12
P. falciparum
Homo sapiens
Protein Targeting Signals

Signal peptidase

mature protein e.g.


secreted proteins
mitochondrial matrix proteins
chloroplast stromal proteins

e.g.
mitochondrial IMS proteins
apicoplast proteins

Known exceptions:
e.g.
some mitochondrial proteins
( ) SKL
some peroxisomal proteins

http://www.rockefeller.edu/pubinfo/proteintarget.html

You might also like