You are on page 1of 100

Protein Secondary Structure Prediction PSSP

Proteins
Protein: from the Greek word PROTEUO which means "to be first (in rank or influence)" Why are proteins important to us:

Proteins make up about 15% of the mass of the average person and maintain the structural integrity of the cell. Enzyme acts as a biological catalyst Storage and transport Haemoglobin Antibodies Hormones Insulin

Introduction to proteins

Peptide Bond

Four levels of protein structure

Conformational Parameters of secondary structure of a protein


Dihedral Angles/torsion angles/rotation angles
=phi= N-C =psi= C-C = omega=C-N

These values can be calculated?


Ramachandran Plot: Founded by G.N.Ramachandran.

Green region indicates the stericially permitted & values except Gly and Pro. Yellow circles represent the conformational angles of several secondary structures..-helix, parallel & anti parallel -sheet

Helices

Helices
H: - helix G: 310 helix I: - helix (extremely rare)

i-i+3th i-i+4th i-i+5th Hydrogen bonding

Secondary Structure
8 different categories (DSSP): H: - helix (pitch 5.4 A0) G: 310 helix I: - helix (extremely rare) E: - strand B: - bridge T: - turn S: bend L: the rest

1.5 A0

Three secondary structure states


Prediction methods are normally trained and assessed for only 3 states (residues): H (helix), E (strands) and L (coil) There are many published 8-to-3 states reduction methods Standard reduction methods are defined by programs DSSP (Dictionary of SS of Proteins), STRIDE, and DEFINE Improvement of predictive accuracy of different SSP (Secondary Structure Prediction) programs depends on the choice of the reduction method

Protein Secondary Structure Prediction


Techniques for the prediction of protein secondary structure provide information that is useful both in ab initio structure prediction and as an additional constraint for fold-recognition algorithms. Knowledge of secondary structure alone can help the design of site-directed or deletion mutants that will not destroy the native protein structure. For all these applications it is essential that the secondary structure prediction be accurate, or at least that, the reliability for each residue can be assessed.

Protein Secondary Structure Prediction


If a protein sequence shows clear similarity to a protein of known three dimensional structure, then the most accurate method of predicting the secondary structure is to align the sequences by standard dynamic programming algorithms, as the homology modelling is much more accurate than secondary structure prediction for high levels of sequence identity. Secondary structure prediction methods are of most use when sequence similarity to a protein of known structure is undetectable. It is important that there is no detectable sequence similarity between sequences used to train and test secondary structure prediction methods.

Protein Secondary Structure


Secondary Structure

Regular Secondary Structure (-helices, sheets)

Irregular Secondary Structure (Tight turns, Random coils, bulges)

PSSP Algorithms
There are three generations in PSSP algorithms Early/First Generation: based on statistical/rule based information of single aminoacids Second Generation: based on windows (segments) of aminoacids. Typically a window containes 11-21 aminoacids Third Generation: based on the use of windows on evolutionary information

PSSP: First Generation


First generation PSSP systems are based on statistical information on a single aminoacid The most relevant algorithms:

Chow-Fasman, 1974 (Statistics based) GOR, 1978 (Rule based)


Both algorithms claimed 74-78% of predictive accuracy, but tested with better constructed datasets were proved to have the predictive accuracy ~50% (Nishikawa, 1983)

PSSP: Second Generation


Based on the information contained in a
window of aminoacids (11-21 aa.) The most systems use algorithms based on:

Statistical information Physico-chemical properties Sequence patterns Multi-layered neural networks Graph-theory Multivariante statistics Expert rules Nearest-neighbour algorithms No Bayesian networks

PSSP: Second Generation


Main problems:

Prediction accuracy <70% Prediction accuracy for -strand 28-48% Predicted chains are usually too short what leads do the difficult use of predictions

PSSP: Third Generation


PHD: First algorithm in this generation (1994)
Evolutionary information improves the prediction accuracy to 72%

Use of evolutionary information: 1. Scan a database with known sequences with alignment methods for finding similar sequences 2. Filter the previous list with a threshold to identify the most significant sequences 3. Build aminoacid exchange profiles based on the probable homologs (most significant sequences) 4. The profiles are used in the prediction, i.e. in building the classifier

PSSP: Third Generation


Many of the second generation algorithms have been updated to third generation The most important algorithms of today Predator: Nearest-neighbour PSI-Pred: Neural networks SSPro: Neural networks SAM-T02: Homologs (Hidden Markov Models) PHD: Neural networks Due to the improvement of protein information in databases i.e. better evolutionary information, todays predictive accuracy is ~80% It is believed that maximum reachable accuracy is 88%

First Generation PSSP


Two classical methods that use previously determined propensities:

Chou-Fasman Garnier-Osguthorpe-Robson

Chou-Fasman method
Uses table of conformational parameters (propensities) determined primarily from measurements of secondary structure.

Frequency of amino acid X observed in element Y Frequency of element Y in database

Designations: H = Strong Former, h = Former, I = Weak Former, i = Indifferent, B = Strong Breaker, b = Breaker; P = Conformational Parameter

The Chou-Fasman method

If you were asked to determine whether an amino acid in a protein of interest is part of a -helix or sheet, you might think to look in a protein database and see which secondary structures amino acids in similar contexts belonged to.
The Chou-Fasman method (1974) is a combination of such statistics-based methods and rule-based methods.

Steps of the Chou-Fasman algorithm:


1. Calculate propensities from a set of solved structures. For all 20 amino acids i,calculate these propensities by:

P(i / Helix ) P (i )

P(i / Beta ) P (i )

P(i / Turn ) P (i )

Propensities > 1 mean that the residue type I is likely to be found in the Corresponding secondary structure type.

Chou and Fasman


Amino Acid
Ala Cys Leu Met Glu Gln His Lys Val Ile Phe Tyr Trp Thr Gly Ser Asp Asn Pro Arg

-Helix
1.29 1.11 1.30 1.47 1.44 1.27 1.22 1.23 0.91 0.97 1.07 0.72 0.99 0.82 0.56 0.82 1.04 0.90 0.52 0.96

-Sheet
0.90 0.74 1.02 0.97 0.75 0.80 1.08 0.77 1.49 1.45 1.32 1.25 1.14 1.21 0.92 0.95 0.72 0.76 0.64 0.99

Turn
0.78 0.80 0.59 0.39 1.00 0.97 0.69 0.96 0.47 0.51 0.58 1.05 0.75 1.03 1.64 1.33 1.41 1.23 1.91 0.88

Favors -Helix

Favors -strand

Favors turn

Chou and Fasman


Predicting helices: - find nucleation site: 4 out of 6 contiguous residues with P()>1 - extension: extend helix in both directions until a set of 4 contiguous residues has an average P() < 1 (breaker) - if average P() over whole region is >1, it is predicted to be helical Predicting strands: - find nucleation site: 3 out of 5 contiguous residues with P()>1 - extension: extend strand in both directions until a set of 4 contiguous residues has an average P() < 1 (breaker) - if average P() over whole region is >1, it is predicted to be a strand

2. Once the propensities are calculated, each amino acid is categorized using the propensities as one of:
Each amino acid is also categorized as one of: helix-former, helix-breaker, or helix-indifferent. (That is, helix-formers have high helical propensities, helix-breakers have low helical propensities, and helix-indifferent have intermediate propensities.)

sheet-former, sheet-breaker, or sheet-indifferent. For example, it was found (as expected) that glycine and prolines are helix-breakers.

3.Find nucleation sites.


These are short subsequences with a high-concentration of helix-formers (or sheet-formers).

These sites are found with some heuristic rule (e.g. a sequence of 6 amino acids with at least 4 helix-formers, and no helixbreakers").

4. Extend the nucleation sites, adding residues at the ends, maintaining an average propensity greater than some threshold. 5. Step 4 may create overlaps; Finally, we deal with these overlaps using some heuristic rules.

The GOR method (Garnier, Osguthorpe, Robson)


Position-dependent propensities for helix, sheet or turn is calculated for each amino acid. For each position j in the sequence, eight residues on either side are considered. j

A helix propensity table contains information about propensity for residues at 17 positions when the conformation of residue j is helical. The helix propensity tables have 20 x 17 entries. Build similar tables for strands and turns. GOR simplification: The predicted state of AAj is calculated as the sum of the position-dependent propensities of all residues around AAj.

GOR can be used at : http://abs.cit.nih.gov/gor/ (current version is GOR IV)

Suppose aj is the amino acid that we are trying to categorize. GOR looks at the residues

Intuitively, it assigns a structure based on probabilities it has calculated from protein databases. These probabilities are of the form

Accuracy
Both Chou and Fasman and GOR have been assessed and their accuracy is estimated to be Q3=60-65%.
(initially, higher scores were reported, but the experiments set to measure Q3 were flawed, as the test cases included proteins used to derive the propensities!)

Nearest Neighbour Method


This method depends on the spatial structure of central residues in each window having kowledge of its neighbours. This methods is hence called as Memory/Homology based method.

Steps

Computes MS Algorithm Computes the distance between homologous sections Example : NNSSP
SDV Hyperplanes DS manipulation

Training Data

NN Tree (Binary search tree)

Test data

Validation

Actual NN tree with data

New data

Prediction

Modelling scheme of the nerest-neighbour method

Neural Networks
Single sequence methods - train network using sets of known proteins of certain types (all alpha, all beta, alpha+beta) then use to predict for query sequence

NNPREDICT (>70% accuracy)

NEURAL NETWORKS

Inspired by the brain Traditional computers struggle to recognize and generalize patterns of the past for future actions Brain as an information processing system contains 10 billion nerve cells or neurons and each neuron is connected to other neuron through about 10,000 synapses

Brain
Interconnected network of neurons that collect, process and disseminate electrical signals via synapses

Neural Network
Interconnected network of units (or nodes) that collect, process and disseminate values via links

Neurons Synapses

Nodes Links

Scheme for modeling Neural Network:


Building a random network Training the net on a training set

Building a random network


random selection of the type of node random selection of the parameters of the node random selection of the number of the inputs connecting the inputs and outputs until the net is larger running the training set over the net selecting the proper output removal of all nodes which do not contribute to the output

Training the network


The general idea behind this is to run the net on the training set, and every time it gives a right answer leave it as it is and or strengthen the path. Every time it makes a mistake penalize it. This can be done in several ways to incrementally modify the net: alter the parameters add/delete connections add/delete the nodes with their connections

Typical methodology used to train a feed-forward network for secondary structure prediction is based on Qian and Sejnowski, 1988

Alanine: 100000000 Helix : 100000000

Different Network Topologies


Single layer feed-forward networks

Input layer projecting into the output layer


Single layer network Input layer Output layer

Different Network Topologies


Multi-layer feed-forward networks

One or more hidden layers. Input projects only from previous layers onto a layer.
2-layer or 1-hidden layer fully connected network

Input layer

Hidden Output layer layer

Different Network Topologies


Recurrent networks

A network with feedback, where some of its inputs are connected to some of its outputs (discrete time).
Recurrent network

Input layer

Output layer

Features
A typical training set consists of 100 nonhomologous protein chains (15,000 training patterns)
A net with an input window of 17, five hidden nodes in a single hidden layer and three outputs will have 357 input nodes and 1,808 weights.

Predictions are made on a winner-takes-all basis

PHD (Profile network from HeiDelberg) > 70%


A program with several cascading neural networks. The method employs a two-layered feed-forward neural network

Sequence to structure (First layer) Structure to Structure (Second layer)


Window length is 13, and for every position in the window frequencies for 20 aa is calculated (20x13) Based on this the OUTPUT is a probability of the three possible classes (H,E and C)

Evaluating Prediction Efficiency


Jackknief test: Percentage of correctly classified residues: Correlation coefficient for each target class:

Prediction of Transmembrane helices

Two Ways: One is solely based on the construction principles of proteins associated with physico-chemical properties of amino acids. No concept of training is involved.

The other is to collect data sets with known structures, extract features and use machine learning algorithms for predictions.

Outline
1. Importance of Transmembrane Proteins 2. General Topologies 3. Methods (and challenges) for Structural Studies of TM Proteins

Eukaryotic cells have many membranes

Transmembrane Proteins
v Cellular roles include: Communication between cells Communications between organelles and cytosol Ion transport, Nutrient transport Links to extracellular matrix Receptors for viruses Connections for cytoskeleton v Over 25% of proteins in complete genomes. v Key roles in diabetes, hypertension, depression, arthritis, cancer, and many other common diseases. v Targets for over 75% of pharmaceuticals.

Transmembrane Proteins
v Cellular roles include: Communication between cells Communications between organelles and cytosol Ion transport, Nutrient transport Links to extracellular matrix Receptors for viruses Connections for cytoskeleton v Over 25% of proteins in complete genomes. v Key roles in diabetes, hypertension, depression, arthritis, cancer, and many other common diseases. v Targets for over 75% of pharmaceuticals.

However, very few TM protein structures have been solved!

Biological Membrane = Lipid Bilayer

Approximately 30 thick Hydrophobic core + Hydrophilic or charged headgroups Mixture of lipids that vary in type of head groups, lengths of acyl chains, number of double bonds (Some membranes also contain cholesterol)

Membrane Bilayer with Proteins

In order to be stable in this environment, a polypeptide chain needs to (1) contain a lot of amino acids with hydrophobic sidechains, and (2) fold up to satisfy backbone H-bond propensity - How?

Structure Solution #1: Hydrophobic alphahelix


Satisfies polypeptide backbone hydrogen bonding Hydrophobic sidechains face outward into lipids

Examples of Helix Bundle TM Proteins

PDB = 1QHJ

PDB = 1RRC

Single helix or helical bundles (> 90% of TM proteins) Examples: Human growth hormone receptor, Insulin receptor ATP binding cassette family - CFTR Multidrug resistance proteins 7TM receptors - G protein-linked receptors

Structure solution #2 Beta-barrel


Beta sheet satisfies backbone hydrogen bonds between strands Wrap sheet around into barrel shape Sidechains on the outside of the barrel are hydrophobic

Examples of Beta Barrel TM Proteins

PDB = 1EK9

PDB = 2POR

Beta barrels - in outer membrane of gram negative bacteria, and some nonconstitutive membrane acting toxins Examples: Porins

General Topologies of TM Proteins

Single helix or helical bundles and Beta barrels Both topologies result in hydrophobic surfaces facing acyl chains of lipids Part protruding from membrane can be a very short sequence (a few amino acids), a loop, or large, independently folding domains

Presence of Hydrophobic TM Domain can result in:


Low levels of expression Difficulties in solubilization Difficulties in crystallization
Attempting crystallization and structure solution of transmembrane proteins is considered difficult and risky.

Difficult and risky, but still possible: TM Proteins of Known Structure


Bacteriorhodopsin, Rhodopsin Photosynthetic reaction centers Porins Light harvesting complexes Potassium channels Chloride channels Aquaporin Transporters Etc. **Although few in number, each of these structures have been important for addressing key functions.*** Great summary and resource: http://blanco.biomol.uci.edu/Membrane_Proteins_xtal.html

Protein Folding Problem


How does a one-dimensional amino acid sequence determine a specific three-dimensional structure? Or How can we read the sequence and predict that structure?

General Idea
We know what an alpha-helix or a beta strand looks like, so (1) figure out which parts of the sequence are helices and which parts are strands (2) figure out how they pack together For soluble proteins, neither is well predicted. But for transmembrane proteins ...

TM Protein Structure Prediction, Step #1


For alpha-helical transmembrane proteins, hydropathy plot analysis provides a fairly accurate method to predict which amino acids form membrane-spanning helices

We can model the structure of an individual alpha helix fairly accurately.

TM Protein Structure Prediction, Step #2


How do the helices pack in the membrane?
There are several labs studying known protein structures to identify factors involved in determining how transmembrane helices pack together (specificity of interaction and packing motifs) Hydrogen bonds Hydrophobiciity Amino acids known to face the lumen of a channel Multiple sequence alignments Helix packing sequence motifs, etc. These kinds of information are then combined with protein docking and energy minimization programs to predict how the helices pack together. It is quite possible that studies of helical transmembrane proteins could lead to key information about the protein folding problem - how to predict protein structure from amino acid sequence

Prediction using Machine learning techniques

collect data sets with known Build Model


Align using clustralW/hmmalign Hmmbuild Hmmsearch Calculate threshold Score distribution and Training data (NN) Validation

Summary
Transmembrane Proteins play many important processes in cellular processes in both health and disease Two general type of tertiary structure are found to cross the membranes: beta-barrels and alpha-helices Structural Studies of TM Proteins are impeded by difficulties in overexpression, purification and crystallization However, the few dozen structures that have been determined have provided key information about channels (gating, selectivity, etc.), energetics, transport, and other transmembrane processes Analysis of helical transmembrane protein structures may lead to accurate predictions of protein structure from amino acid sequence for this type of protein

Prediction of protein conformations from protein sequences (3D Prediction)

Protein Conformations

Predict protein 3D structure from (amino acid) sequence Sequence secondary structure 3D structure function

73

74

Protein 3D Structure Detection


X-ray Crys NMR
Expensive Slow

75

Protein Structure
Protein 3D structure biological function Lock & key model of enzyme function (docking) Folding problem protein sequence 3D structure Structure prediction and alignment Protein design, drug design, etc The holy grail of bioinformatics

76

The Prediction Problem

Can we predict the final 3D protein structure knowing only its amino acid sequence? Studied for 4 Decades Primary Motivation for Bioinformatics Based on this 1-to-1 Mapping of Sequence to Structure Still very much an OPEN PROBLEM

77

78

Predicting Protein Structure


Goal Find best fit of sequence to 3D structure Comparative (homology) modeling Construct 3D model from alignment to protein sequences with known structure Threading (fold recognition) Pick best fit to sequences of known 2D / 3D structures (folds) Ab initio / de novo methods Attempt to calculate 3D structure from scratch Molecular dynamics Energy minimization Lattice models

79

PSP: Goals
Accurate 3D structures. But not there yet.

Good guesses
Working models for researchers Understand the FOLDING PROCESS Get into the Black Box Only hope for some proteins 25% wont crystallize, too big for NMR Best hope for novel protein engineering Drug design, etc. 80

PSP: Major Hurdles


Energetics We dont know all the forces involved in detail Too computationally expensive BY FAR! Conformational search impossibly large 100 a.a. protein, 2 moving dihedrals, 2 possible positions for each diheral: 2200 conformations! Levinthals Paradox Longer than time of universe to search Proteins fold in a couple of seconds?? Multiple-minima problem 81

Tertiary Structure Prediction


Major Techniques

Comparative Modeling
Homology Modeling Threading

Template-Free Modeling
De novo/ab initio Methods
Physics-Based Knowledge-Based

82

Homology Modeling

83

Steps

Template selection Target template alignment Model building Evaluation

Repeated until a satisfactory model structure is achieved

84

Threading
a library of protein folds (templates) a scoring function to measure the fitness of a sequence -> structure alignment a search technique for finding the best alignment between a fixed sequence and structure a means of choosing the best fold from among the best scoring alignments of a sequence to all possible folds

85

ab initio Methods

86

The

ab

initio

approach

(Figure

6.25)

ignores

sequence homology and attempts to predict the folded state from fundamental energetics or

physicochemical properties associated with


constituent residues. This involves

the

modelling

physicochemical parameters in terms of force

fields that direct the folding. These constraints will


reflect the energetics associated with charge, hydrophobicity and polarity with the aim being to find a single structure of low energy. 87

How to define the energy of a PROTEIN? How to find the conformations for which the energy is minimum?

This approach is based on the thermodynamic argument that


the native structure of a protein is the global minimum in the free energy profile. Generally the results are expressed as rmsd (root mean square deviations), reflecting the difference in positions

between corresponding atoms in the experimental and


calculated (predicted) structures. 88

What make global minimum?


1) Semi imperical potential function Which is calculated as the sum of all the possible pair wise interactions between the atoms in the molecule Eg: AMBER force field Evaluates the N(N-1)/2 atom-atom interaction which requires computational time N square. Most of the computational methods look for the Global minimum?? 89

SEMI IMPERICAL POTENTIAL FUNCTION N(N-1)/2 Atom-atom Interaction

Global Minimum E N E R G Y structure Eg : AMBER Force Fields E N E R G Y

Global Minimum

structure

SCHEMATIC DIAGRAM OF DIFFERENT TYPES OF ENERGY FUNCTIONS 90

To over come this reduce the resolution at which the potential function is calculated.
Instead of atom-atom potential The United atom potential would be an approximation. This approximation is also called as a Pseudo atom

2) UNRES force field (Residues with Solvent)


UNRES was originally designed and parameterized to locate native-like structures of proteins as the lowest in potential energy by unrestricted global optimization.
Propensities of amino acids calculated. the backbone is represented as a sequence of carbon (C) atoms linked by virtual bonds designated as dC, with united peptide groups (ps) in their centers. 92

93

Molecular Dynamics
Computation of dynamics or motion of a ptn.

1) The physical forces which influence the folding process are well represented by the semi empirical force fields. 2) The atoms of the ptn move independently upon induction of the force fields.. Calculated by Newtons laws of motion:F=ma (calculated for each fematoseconds)..

94

Motion influenced by temperature??


Stimulated annealing:-

Higher temperatures the motion will be greater in a shorter period of time.


1) Initial global minimum of the atom is calculated first. 2) Temperature is raised to 3000K for a few Femtoseconds and then gradually lowered to room temperature 300K The idea is to even if you start to predict the protein in with wrong structure (at 3000K) final corrections will lead to conformation.

96

Molecular dynamics Optimization tool

Trajectory

energy

Conformational space

97

Other such simulations are ..


MONTE CARLO SIMULATIONS CONFORMATIONAL SPACE ANNEALING ROSETTA ALGORITHM CASP (Critical assessment of structure prediction) LEVINTHAL PARODOX etc

98

ROSETTA ALGORITHM
Break target sequence into fragments of 9 amino acids

Create profile , X, for target

Create profile, S, for similar PDB sequences

Align profiles X, S to get best match fragment

Use fragments as starting point for optimisation, using: - hydrophobic burial - polar side-chain interactions Create 1000 structures, and - hydrogen bonding between beta-strands Choose cluster centre as the - hard sphere repulsion (van der Waals) best prediction

You might also like