Protein Prediction

Protein Secondary Structure Prediction PSSP
Proteins
Protein: from the Greek word PROTEUO which means "to be first (in rank or influence)" Why are proteins important to us:
Proteins make up about 15% of the mass of the average person and maintain the structural integrity of the cell. Enzyme acts as a biological catalyst Storage and transport Haemoglobin Antibodies Hormones Insulin
Introduction to proteins
Peptide Bond
Four levels of protein structure
Conformational Parameters of secondary structure of a protein

Dihedral Angles/torsion angles/rotation angles
=phi= N-C =psi= C-C = omega=C-N
These values can be calculated?

Ramachandran Plot: Founded by G.N.Ramachandran.
Green region indicates the stericially permitted & values except Gly and Pro. Yellow circles represent the conformational angles of several secondary structures..-helix, parallel & anti parallel -sheet
Helices
Helices
H: - helix G: 310 helix I: - helix (extremely rare)
i-i+3th i-i+4th i-i+5th Hydrogen bonding
Secondary Structure
8 different categories (DSSP): H: - helix (pitch 5.4 A0) G: 310 helix I: - helix (extremely rare) E: - strand B: - bridge T: - turn S: bend L: the rest
1.5 A0
Three secondary structure states

Prediction methods are normally trained and assessed for only 3 states (residues): H (helix), E (strands) and L (coil) There are many published 8-to-3 states reduction methods Standard reduction methods are defined by programs DSSP (Dictionary of SS of Proteins), STRIDE, and DEFINE Improvement of predictive accuracy of different SSP (Secondary Structure Prediction) programs depends on the choice of the reduction method
Protein Secondary Structure Prediction

Techniques for the prediction of protein secondary structure provide information that is useful both in ab initio structure prediction and as an additional constraint for fold-recognition algorithms. Knowledge of secondary structure alone can help the design of site-directed or deletion mutants that will not destroy the native protein structure. For all these applications it is essential that the secondary structure prediction be accurate, or at least that, the reliability for each residue can be assessed.
Protein Secondary Structure Prediction

If a protein sequence shows clear similarity to a protein of known three dimensional structure, then the most accurate method of predicting the secondary structure is to align the sequences by standard dynamic programming algorithms, as the homology modelling is much more accurate than secondary structure prediction for high levels of sequence identity. Secondary structure prediction methods are of most use when sequence similarity to a protein of known structure is undetectable. It is important that there is no detectable sequence similarity between sequences used to train and test secondary structure prediction methods.
Protein Secondary Structure

Secondary Structure
Regular Secondary Structure (-helices, sheets)
Irregular Secondary Structure (Tight turns, Random coils, bulges)
PSSP Algorithms
There are three generations in PSSP algorithms Early/First Generation: based on statistical/rule based information of single aminoacids Second Generation: based on windows (segments) of aminoacids. Typically a window containes 11-21 aminoacids Third Generation: based on the use of windows on evolutionary information
PSSP: First Generation

First generation PSSP systems are based on statistical information on a single aminoacid The most relevant algorithms:
Chow-Fasman, 1974 (Statistics based) GOR, 1978 (Rule based)

Both algorithms claimed 74-78% of predictive accuracy, but tested with better constructed datasets were proved to have the predictive accuracy ~50% (Nishikawa, 1983)
PSSP: Second Generation

Based on the information contained in a
window of aminoacids (11-21 aa.) The most systems use algorithms based on:

Statistical information Physico-chemical properties Sequence patterns Multi-layered neural networks Graph-theory Multivariante statistics Expert rules Nearest-neighbour algorithms No Bayesian networks
PSSP: Second Generation

Main problems:
Prediction accuracy <70% Prediction accuracy for -strand 28-48% Predicted chains are usually too short what leads do the difficult use of predictions
PSSP: Third Generation

PHD: First algorithm in this generation (1994)
Evolutionary information improves the prediction accuracy to 72%
Use of evolutionary information: 1. Scan a database with known sequences with alignment methods for finding similar sequences 2. Filter the previous list with a threshold to identify the most significant sequences 3. Build aminoacid exchange profiles based on the probable homologs (most significant sequences) 4. The profiles are used in the prediction, i.e. in building the classifier
PSSP: Third Generation

Many of the second generation algorithms have been updated to third generation The most important algorithms of today Predator: Nearest-neighbour PSI-Pred: Neural networks SSPro: Neural networks SAM-T02: Homologs (Hidden Markov Models) PHD: Neural networks Due to the improvement of protein information in databases i.e. better evolutionary information, todays predictive accuracy is ~80% It is believed that maximum reachable accuracy is 88%
First Generation PSSP

Two classical methods that use previously determined propensities:
Chou-Fasman Garnier-Osguthorpe-Robson
Chou-Fasman method
Uses table of conformational parameters (propensities) determined primarily from measurements of secondary structure.
Frequency of amino acid X observed in element Y Frequency of element Y in database
Designations: H = Strong Former, h = Former, I = Weak Former, i = Indifferent, B = Strong Breaker, b = Breaker; P = Conformational Parameter
The Chou-Fasman method
If you were asked to determine whether an amino acid in a protein of interest is part of a -helix or sheet, you might think to look in a protein database and see which secondary structures amino acids in similar contexts belonged to.
The Chou-Fasman method (1974) is a combination of such statistics-based methods and rule-based methods.
Steps of the Chou-Fasman algorithm:

1. Calculate propensities from a set of solved structures. For all 20 amino acids i,calculate these propensities by:
P(i / Helix ) P (i )
P(i / Beta ) P (i )
P(i / Turn ) P (i )
Propensities > 1 mean that the residue type I is likely to be found in the Corresponding secondary structure type.
Chou and Fasman

Amino Acid
Ala Cys Leu Met Glu Gln His Lys Val Ile Phe Tyr Trp Thr Gly Ser Asp Asn Pro Arg
-Helix
1.29 1.11 1.30 1.47 1.44 1.27 1.22 1.23 0.91 0.97 1.07 0.72 0.99 0.82 0.56 0.82 1.04 0.90 0.52 0.96
-Sheet
0.90 0.74 1.02 0.97 0.75 0.80 1.08 0.77 1.49 1.45 1.32 1.25 1.14 1.21 0.92 0.95 0.72 0.76 0.64 0.99
Turn
0.78 0.80 0.59 0.39 1.00 0.97 0.69 0.96 0.47 0.51 0.58 1.05 0.75 1.03 1.64 1.33 1.41 1.23 1.91 0.88
Favors -Helix
Favors -strand
Favors turn
Chou and Fasman

Predicting helices: - find nucleation site: 4 out of 6 contiguous residues with P()>1 - extension: extend helix in both directions until a set of 4 contiguous residues has an average P() < 1 (breaker) - if average P() over whole region is >1, it is predicted to be helical Predicting strands: - find nucleation site: 3 out of 5 contiguous residues with P()>1 - extension: extend strand in both directions until a set of 4 contiguous residues has an average P() < 1 (breaker) - if average P() over whole region is >1, it is predicted to be a strand
2. Once the propensities are calculated, each amino acid is categorized using the propensities as one of:
Each amino acid is also categorized as one of: helix-former, helix-breaker, or helix-indifferent. (That is, helix-formers have high helical propensities, helix-breakers have low helical propensities, and helix-indifferent have intermediate propensities.)
sheet-former, sheet-breaker, or sheet-indifferent. For example, it was found (as expected) that glycine and prolines are helix-breakers.
3.Find nucleation sites.

These are short subsequences with a high-concentration of helix-formers (or sheet-formers).
These sites are found with some heuristic rule (e.g. a sequence of 6 amino acids with at least 4 helix-formers, and no helixbreakers").
4. Extend the nucleation sites, adding residues at the ends, maintaining an average propensity greater than some threshold. 5. Step 4 may create overlaps; Finally, we deal with these overlaps using some heuristic rules.
The GOR method (Garnier, Osguthorpe, Robson)

Position-dependent propensities for helix, sheet or turn is calculated for each amino acid. For each position j in the sequence, eight residues on either side are considered. j
A helix propensity table contains information about propensity for residues at 17 positions when the conformation of residue j is helical. The helix propensity tables have 20 x 17 entries. Build similar tables for strands and turns. GOR simplification: The predicted state of AAj is calculated as the sum of the position-dependent propensities of all residues around AAj.
GOR can be used at : http://abs.cit.nih.gov/gor/ (current version is GOR IV)
Suppose aj is the amino acid that we are trying to categorize. GOR looks at the residues
Intuitively, it assigns a structure based on probabilities it has calculated from protein databases. These probabilities are of the form
Accuracy
Both Chou and Fasman and GOR have been assessed and their accuracy is estimated to be Q3=60-65%.
(initially, higher scores were reported, but the experiments set to measure Q3 were flawed, as the test cases included proteins used to derive the propensities!)
Nearest Neighbour Method

This method depends on the spatial structure of central residues in each window having kowledge of its neighbours. This methods is hence called as Memory/Homology based method.
Steps
Computes MS Algorithm Computes the distance between homologous sections Example : NNSSP
SDV Hyperplanes DS manipulation
Training Data
NN Tree (Binary search tree)
Test data
Validation
Actual NN tree with data
New data
Prediction
Modelling scheme of the nerest-neighbour method
Neural Networks
Single sequence methods - train network using sets of known proteins of certain types (all alpha, all beta, alpha+beta) then use to predict for query sequence
NNPREDICT (>70% accuracy)
NEURAL NETWORKS
Inspired by the brain Traditional computers struggle to recognize and generalize patterns of the past for future actions Brain as an information processing system contains 10 billion nerve cells or neurons and each neuron is connected to other neuron through about 10,000 synapses
Brain
Interconnected network of neurons that collect, process and disseminate electrical signals via synapses

Neural Network
Interconnected network of units (or nodes) that collect, process and disseminate values via links
Neurons Synapses
Nodes Links
Scheme for modeling Neural Network:

Building a random network Training the net on a training set
Building a random network

random selection of the type of node random selection of the parameters of the node random selection of the number of the inputs connecting the inputs and outputs until the net is larger running the training set over the net selecting the proper output removal of all nodes which do not contribute to the output
Training the network

The general idea behind this is to run the net on the training set, and every time it gives a right answer leave it as it is and or strengthen the path. Every time it makes a mistake penalize it. This can be done in several ways to incrementally modify the net: alter the parameters add/delete connections add/delete the nodes with their connections
Typical methodology used to train a feed-forward network for secondary structure prediction is based on Qian and Sejnowski, 1988
Alanine: 100000000 Helix : 100000000
Different Network Topologies

Single layer feed-forward networks
Input layer projecting into the output layer

Single layer network Input layer Output layer

Multi-layer feed-forward networks
One or more hidden layers. Input projects only from previous layers onto a layer.
2-layer or 1-hidden layer fully connected network
Input layer
Hidden Output layer layer

Recurrent networks
A network with feedback, where some of its inputs are connected to some of its outputs (discrete time).
Recurrent network
Input layer
Output layer
Features
A typical training set consists of 100 nonhomologous protein chains (15,000 training patterns)
A net with an input window of 17, five hidden nodes in a single hidden layer and three outputs will have 357 input nodes and 1,808 weights.
Predictions are made on a winner-takes-all basis
PHD (Profile network from HeiDelberg) > 70%

A program with several cascading neural networks. The method employs a two-layered feed-forward neural network
Sequence to structure (First layer) Structure to Structure (Second layer)

Window length is 13, and for every position in the window frequencies for 20 aa is calculated (20x13) Based on this the OUTPUT is a probability of the three possible classes (H,E and C)
Evaluating Prediction Efficiency

Jackknief test: Percentage of correctly classified residues: Correlation coefficient for each target class:
Prediction of Transmembrane helices
Two Ways: One is solely based on the construction principles of proteins associated with physico-chemical properties of amino acids. No concept of training is involved.
The other is to collect data sets with known structures, extract features and use machine learning algorithms for predictions.
Outline
1. Importance of Transmembrane Proteins 2. General Topologies 3. Methods (and challenges) for Structural Studies of TM Proteins
Eukaryotic cells have many membranes
Transmembrane Proteins
v Cellular roles include: Communication between cells Communications between organelles and cytosol Ion transport, Nutrient transport Links to extracellular matrix Receptors for viruses Connections for cytoskeleton v Over 25% of proteins in complete genomes. v Key roles in diabetes, hypertension, depression, arthritis, cancer, and many other common diseases. v Targets for over 75% of pharmaceuticals.
Transmembrane Proteins
v Cellular roles include: Communication between cells Communications between organelles and cytosol Ion transport, Nutrient transport Links to extracellular matrix Receptors for viruses Connections for cytoskeleton v Over 25% of proteins in complete genomes. v Key roles in diabetes, hypertension, depression, arthritis, cancer, and many other common diseases. v Targets for over 75% of pharmaceuticals.
However, very few TM protein structures have been solved!
Biological Membrane = Lipid Bilayer
Approximately 30 thick Hydrophobic core + Hydrophilic or charged headgroups Mixture of lipids that vary in type of head groups, lengths of acyl chains, number of double bonds (Some membranes also contain cholesterol)
Membrane Bilayer with Proteins
In order to be stable in this environment, a polypeptide chain needs to (1) contain a lot of amino acids with hydrophobic sidechains, and (2) fold up to satisfy backbone H-bond propensity - How?
Structure Solution #1: Hydrophobic alphahelix

Satisfies polypeptide backbone hydrogen bonding Hydrophobic sidechains face outward into lipids
Examples of Helix Bundle TM Proteins
PDB = 1QHJ
PDB = 1RRC
Single helix or helical bundles (> 90% of TM proteins) Examples: Human growth hormone receptor, Insulin receptor ATP binding cassette family - CFTR Multidrug resistance proteins 7TM receptors - G protein-linked receptors
Structure solution #2 Beta-barrel

Beta sheet satisfies backbone hydrogen bonds between strands Wrap sheet around into barrel shape Sidechains on the outside of the barrel are hydrophobic
Examples of Beta Barrel TM Proteins
PDB = 1EK9
PDB = 2POR
Beta barrels - in outer membrane of gram negative bacteria, and some nonconstitutive membrane acting toxins Examples: Porins
General Topologies of TM Proteins
Single helix or helical bundles and Beta barrels Both topologies result in hydrophobic surfaces facing acyl chains of lipids Part protruding from membrane can be a very short sequence (a few amino acids), a loop, or large, independently folding domains
Presence of Hydrophobic TM Domain can result in:

Low levels of expression Difficulties in solubilization Difficulties in crystallization
Attempting crystallization and structure solution of transmembrane proteins is considered difficult and risky.
Difficult and risky, but still possible: TM Proteins of Known Structure

Bacteriorhodopsin, Rhodopsin Photosynthetic reaction centers Porins Light harvesting complexes Potassium channels Chloride channels Aquaporin Transporters Etc. **Although few in number, each of these structures have been important for addressing key functions.*** Great summary and resource: http://blanco.biomol.uci.edu/Membrane_Proteins_xtal.html
Protein Folding Problem

How does a one-dimensional amino acid sequence determine a specific three-dimensional structure? Or How can we read the sequence and predict that structure?
General Idea
We know what an alpha-helix or a beta strand looks like, so (1) figure out which parts of the sequence are helices and which parts are strands (2) figure out how they pack together For soluble proteins, neither is well predicted. But for transmembrane proteins ...
TM Protein Structure Prediction, Step #1

For alpha-helical transmembrane proteins, hydropathy plot analysis provides a fairly accurate method to predict which amino acids form membrane-spanning helices
We can model the structure of an individual alpha helix fairly accurately.
TM Protein Structure Prediction, Step #2

How do the helices pack in the membrane?
There are several labs studying known protein structures to identify factors involved in determining how transmembrane helices pack together (specificity of interaction and packing motifs) Hydrogen bonds Hydrophobiciity Amino acids known to face the lumen of a channel Multiple sequence alignments Helix packing sequence motifs, etc. These kinds of information are then combined with protein docking and energy minimization programs to predict how the helices pack together. It is quite possible that studies of helical transmembrane proteins could lead to key information about the protein folding problem - how to predict protein structure from amino acid sequence
Prediction using Machine learning techniques
collect data sets with known Build Model

Align using clustralW/hmmalign Hmmbuild Hmmsearch Calculate threshold Score distribution and Training data (NN) Validation
Summary
Transmembrane Proteins play many important processes in cellular processes in both health and disease Two general type of tertiary structure are found to cross the membranes: beta-barrels and alpha-helices Structural Studies of TM Proteins are impeded by difficulties in overexpression, purification and crystallization However, the few dozen structures that have been determined have provided key information about channels (gating, selectivity, etc.), energetics, transport, and other transmembrane processes Analysis of helical transmembrane protein structures may lead to accurate predictions of protein structure from amino acid sequence for this type of protein
Prediction of protein conformations from protein sequences (3D Prediction)
Protein Conformations
Predict protein 3D structure from (amino acid) sequence Sequence secondary structure 3D structure function
73
74
Protein 3D Structure Detection

X-ray Crys NMR
Expensive Slow
75
Protein Structure
Protein 3D structure biological function Lock & key model of enzyme function (docking) Folding problem protein sequence 3D structure Structure prediction and alignment Protein design, drug design, etc The holy grail of bioinformatics
76
The Prediction Problem
Can we predict the final 3D protein structure knowing only its amino acid sequence? Studied for 4 Decades Primary Motivation for Bioinformatics Based on this 1-to-1 Mapping of Sequence to Structure Still very much an OPEN PROBLEM
77
78
Predicting Protein Structure

Goal Find best fit of sequence to 3D structure Comparative (homology) modeling Construct 3D model from alignment to protein sequences with known structure Threading (fold recognition) Pick best fit to sequences of known 2D / 3D structures (folds) Ab initio / de novo methods Attempt to calculate 3D structure from scratch Molecular dynamics Energy minimization Lattice models
79
PSP: Goals
Accurate 3D structures. But not there yet.
Good guesses
Working models for researchers Understand the FOLDING PROCESS Get into the Black Box Only hope for some proteins 25% wont crystallize, too big for NMR Best hope for novel protein engineering Drug design, etc. 80
PSP: Major Hurdles

Energetics We dont know all the forces involved in detail Too computationally expensive BY FAR! Conformational search impossibly large 100 a.a. protein, 2 moving dihedrals, 2 possible positions for each diheral: 2200 conformations! Levinthals Paradox Longer than time of universe to search Proteins fold in a couple of seconds?? Multiple-minima problem 81
Tertiary Structure Prediction

Major Techniques
Comparative Modeling
Homology Modeling Threading
Template-Free Modeling
De novo/ab initio Methods
Physics-Based Knowledge-Based
82
Homology Modeling
83
Steps
Template selection Target template alignment Model building Evaluation
Repeated until a satisfactory model structure is achieved
84
Threading
a library of protein folds (templates) a scoring function to measure the fitness of a sequence -> structure alignment a search technique for finding the best alignment between a fixed sequence and structure a means of choosing the best fold from among the best scoring alignments of a sequence to all possible folds
85
ab initio Methods
86
The
ab
initio
approach
(Figure
6.25)
ignores
sequence homology and attempts to predict the folded state from fundamental energetics or
physicochemical properties associated with

constituent residues. This involves
the
modelling
physicochemical parameters in terms of force
fields that direct the folding. These constraints will

reflect the energetics associated with charge, hydrophobicity and polarity with the aim being to find a single structure of low energy. 87
How to define the energy of a PROTEIN? How to find the conformations for which the energy is minimum?
This approach is based on the thermodynamic argument that

the native structure of a protein is the global minimum in the free energy profile. Generally the results are expressed as rmsd (root mean square deviations), reflecting the difference in positions
between corresponding atoms in the experimental and

calculated (predicted) structures. 88
What make global minimum?

1) Semi imperical potential function Which is calculated as the sum of all the possible pair wise interactions between the atoms in the molecule Eg: AMBER force field Evaluates the N(N-1)/2 atom-atom interaction which requires computational time N square. Most of the computational methods look for the Global minimum?? 89
SEMI IMPERICAL POTENTIAL FUNCTION N(N-1)/2 Atom-atom Interaction
Global Minimum E N E R G Y structure Eg : AMBER Force Fields E N E R G Y
Global Minimum
structure
SCHEMATIC DIAGRAM OF DIFFERENT TYPES OF ENERGY FUNCTIONS 90
To over come this reduce the resolution at which the potential function is calculated.
Instead of atom-atom potential The United atom potential would be an approximation. This approximation is also called as a Pseudo atom
2) UNRES force field (Residues with Solvent)

UNRES was originally designed and parameterized to locate native-like structures of proteins as the lowest in potential energy by unrestricted global optimization.
Propensities of amino acids calculated. the backbone is represented as a sequence of carbon (C) atoms linked by virtual bonds designated as dC, with united peptide groups (ps) in their centers. 92
93
Molecular Dynamics
Computation of dynamics or motion of a ptn.
1) The physical forces which influence the folding process are well represented by the semi empirical force fields. 2) The atoms of the ptn move independently upon induction of the force fields.. Calculated by Newtons laws of motion:F=ma (calculated for each fematoseconds)..
94
Motion influenced by temperature??

Stimulated annealing:-
Higher temperatures the motion will be greater in a shorter period of time.

1) Initial global minimum of the atom is calculated first. 2) Temperature is raised to 3000K for a few Femtoseconds and then gradually lowered to room temperature 300K The idea is to even if you start to predict the protein in with wrong structure (at 3000K) final corrections will lead to conformation.
96
Molecular dynamics Optimization tool
Trajectory
energy
Conformational space
97
Other such simulations are ..

MONTE CARLO SIMULATIONS CONFORMATIONAL SPACE ANNEALING ROSETTA ALGORITHM CASP (Critical assessment of structure prediction) LEVINTHAL PARODOX etc
98
ROSETTA ALGORITHM
Break target sequence into fragments of 9 amino acids
Create profile , X, for target
Create profile, S, for similar PDB sequences
Align profiles X, S to get best match fragment
Use fragments as starting point for optimisation, using: - hydrophobic burial - polar side-chain interactions Create 1000 structures, and - hydrogen bonding between beta-strands Choose cluster centre as the - hard sphere repulsion (van der Waals) best prediction

Protein Prediction

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Protein Prediction

Uploaded by

Copyright:

Available Formats

Protein Secondary Structure Prediction PSSP

Four levels of protein structure

Conformational Parameters of secondary structure of a protein

These values can be calculated?

i-i+3th i-i+4th i-i+5th Hydrogen bonding

Three secondary structure states

Protein Secondary Structure Prediction

Protein Secondary Structure Prediction

Protein Secondary Structure

Regular Secondary Structure (-helices, sheets)

Irregular Secondary Structure (Tight turns, Random coils, bulges)

PSSP: First Generation

Chow-Fasman, 1974 (Statistics based) GOR, 1978 (Rule based)

PSSP: Second Generation

PSSP: Second Generation

PSSP: Third Generation

PSSP: Third Generation

First Generation PSSP

Frequency of amino acid X observed in element Y Frequency of element Y in database

The Chou-Fasman method

Steps of the Chou-Fasman algorithm:

Chou and Fasman

Chou and Fasman

3.Find nucleation sites.

The GOR method (Garnier, Osguthorpe, Robson)

GOR can be used at : http://abs.cit.nih.gov/gor/ (current version is GOR IV)

Nearest Neighbour Method

NN Tree (Binary search tree)

Actual NN tree with data

Modelling scheme of the nerest-neighbour method

NNPREDICT (>70% accuracy)

Scheme for modeling Neural Network:

Building a random network

Training the network

Alanine: 100000000 Helix : 100000000

Different Network Topologies

Input layer projecting into the output layer

Different Network Topologies

Hidden Output layer layer

Different Network Topologies

Predictions are made on a winner-takes-all basis

PHD (Profile network from HeiDelberg) > 70%

Sequence to structure (First layer) Structure to Structure (Second layer)

Evaluating Prediction Efficiency

Prediction of Transmembrane helices

Eukaryotic cells have many membranes

However, very few TM protein structures have been solved!

Biological Membrane = Lipid Bilayer

Membrane Bilayer with Proteins

Structure Solution #1: Hydrophobic alphahelix

Examples of Helix Bundle TM Proteins

Structure solution #2 Beta-barrel

Examples of Beta Barrel TM Proteins

General Topologies of TM Proteins

Presence of Hydrophobic TM Domain can result in:

Difficult and risky, but still possible: TM Proteins of Known Structure

Protein Folding Problem

TM Protein Structure Prediction, Step #1

We can model the structure of an individual alpha helix fairly accurately.

TM Protein Structure Prediction, Step #2

Prediction using Machine learning techniques

collect data sets with known Build Model

Prediction of protein conformations from protein sequences (3D Prediction)