You are on page 1of 76

GENOMICS AND PROTEOMICS ANALYSES

Dr Joshua Boateng 21 /11 / 2011

Dr J Boateng BIOT 1011 Bioinformatics

Biotech and pharmaceutical companies spent $10 billion on hardware, software, and services in 2002.
Source: Gartner
The biotechnology/IT market will increase at a compound annual growth rate (CAGR) of 24% to nearly $38 billion by 2006. Source: IDC Research
Reference: Prof. A.S. Kolaskar Vice Chancellor, University of Pune
Dr J Boateng BIOT 1011 Bioinformatics

GENOMICS
Genetics: the science of genes, heredity, and the variation of organisms. In modern research, genetics provides tools in the investigation of the function of a particular gene, e.g. analysis of genetic interactions. Genomics: the study of large-scale genetic patterns across the genome for a given species. It deals with the systematic use of genome information to provide answers in biology, medicine, and industry.
Dr J Boateng BIOT 1011 Bioinformatics

The study of sequences, gene organization & mutations at the DNA level i.e. the study of information flow within a cell Genomics has the potential of offering new therapeutic methods for the treatment of some diseases, as well as new diagnostic methods. Major tools and methods related to genomics are bioinformatics, genetic analysis, measurement of gene expression, and determination of gene function.
Dr J Boateng BIOT 1011 Bioinformatics

GENOME COMPARISONS
Species
Humans Mouse Puffer fish Malaria Mosquito Fruit Fly Roundworm E. Coli

Chrom. Genes
46 40 44 6 8 12 1 28-35,000 22.5-30000 31000 14000 14000 19000 5000

Base pairs
3.1 billion 3.1 billion 2.7 million 365 million 137 million 97 million 4.1 million

Dr J Boateng BIOT 1011 Bioinformatics

Many diverse studies require the determination of the abundance of large numbers of specific DNA or RNA molecules in complex mixtures, including, for example, the determination of the changes in mRNA levels of many genes
Genome analysis entails the prediction of genes in uncharacterized genomic sequences. The 21st century has seen the announcement of the draft version of the human genome sequence. Model organisms have been sequenced in both the plant and animal kingdoms.

GENOMIC ANALYSIS

Dr J Boateng BIOT 1011 Bioinformatics

GENOMIC ANALSIS
However, the pace of genome annotation is not matching the pace of genome sequencing. Experimental genome annotation is slow and time consuming. The demand is to be able to develop computational tools for gene prediction.
Computational gene prediction is relatively simple for the prokaryotes where all the genes are converted into the corresponding mRNA and then into proteins. The process is more complex for eukaryotic cells where the coding DNA sequence is interrupted by random sequences called introns.
Dr J Boateng BIOT 1011 Bioinformatics

BIOLOGICAL QUESTIONS
Some of the questions biologists want to answer today are: What part of and DNA sequence codes for a protein and what part of it is junk DNA? Classify the junk DNA as intron, untranslated region, transposons, dead genes, regulatory elements. Divide a newly sequenced genome into the genes (coding) and the non-coding regions.
Dr J Boateng BIOT 1011 Bioinformatics

Biological Research in 21st Century


The new paradigm, now emerging is that all the 'genes' will be known (in the sense of being resident in databases available electronically), and that the starting "point of a biological investigation will be theoretical. - Walter Gilbert
Dr J Boateng BIOT 1011 Bioinformatics

IMPORTANCE OF GENOME ANALYSIS


The importance of genome analysis can be understood by comparing the human and chimpanzee genomes. The chimp and human genomes vary by an average of just 2% i.e. just about 160 enzymes. A complete genome analysis of the two genomes would give a strong insight into the various mechanisms responsible for the differences.
Dr J Boateng BIOT 1011 Bioinformatics

COMPLEXITY IS AN UNDERSTATEMENT?

Dr J Boateng BIOT 1011 Bioinformatics

GENOMIC ANALYSIS_ basics


Techniques used to estimate the relative abundance of two or more sets of mRNA
differential screening of cDNA libraries, subtractive hybridization, differential display,

However, more advanced methods have been recently developed.


Dr J Boateng BIOT 1011 Bioinformatics

GENOMICS ANALYSIS_Advances
Advanced methods are particularly amenable to organisms whose entire genome sequences are known, such as S. cerevisiae. It is now practicable to investigate changes of mRNA levels of all yeast open reading frames (ORFs) in one experiment.

Dr J Boateng BIOT 1011 Bioinformatics

Advanced genomic analysis techniques


DNA sequencing DNA microarray technology
analysis of gene expression profiles at the mRNA level

Bioinformatic tools to organize and analyze such data Chip-based analysis of samples Models of gene networks
Dr J Boateng BIOT 1011 Bioinformatics

Microarray Technology

Dr J Boateng BIOT 1011 Bioinformatics

Post-genomic Era
Series of omics
Comparative genomics Structural and functional genomics Transriptomics Proteomics Metabolomics
Dr J Boateng BIOT 1011 Bioinformatics

Bioinformatics tools needed for analysis of data from these omics


Dr J Boateng BIOT 1011 Bioinformatics

Data Mining
Development of new tools for data mining
Sequence alignment Genome sequencing Genome comparison Micro array data analysis Proteomics data analysis Small molecular array analysis To derive information and gain knowledge from the data
Dr J Boateng BIOT 1011 Bioinformatics

COMPARATIVE GENOMICS
Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand the uniqueness between different species Comparative genomics involves the use of computer programs that can line up multiple genomes and look for regions of similarity among them.
Dr J Boateng BIOT 1011 Bioinformatics

When we BLAST a sequence is that comparative genomics?

Difference is in Scale and Direction Other omics


One or several genes compared against all other known genes. Use genome to inform us about the entire organism.

Comparative
Entire Genome compared to other entire genomes. Use information from many genomes to learn more about the individual genes. BIOT 1011

Dr J Boateng Bioinformatics

Background on Comparative Genomic Analysis


Sequencing the genomes of the human, the mouse and a wide variety of other organisms - from yeast to chimpanzees Driving force for the development of new field of biological research called comparative genomics.

Dr J Boateng BIOT 1011 Bioinformatics

BACKGROUND
Comparing the human genome with the genomes of different organisms helps to better understand gene structure and function and thereby develop new strategies in the battle against human disease. Comparative genomics also provides a powerful new tool for studying evolutionary changes among organisms.

Dr J Boateng BIOT 1011 Bioinformatics

This helps to identify the genes that are conserved among species along with the genes that give each organism its own unique characteristics. Using computer-based analysis to zero in on the genomic features that have been preserved in multiple organisms over millions of years, researchers will be able to pinpoint the signals that control gene function. This should in turn translate into innovative approaches for treating human disease and improving human health.
Dr J Boateng BIOT 1011 Bioinformatics

BACKGROUND
The evolutionary perspective may prove extremely helpful in understanding disease susceptibility. For example, chimpanzees do not suffer from some of the diseases that strike humans, such as malaria and AIDS. A comparison of the sequence of genes involved in disease susceptibility may reveal the reasons for this species barrier, thereby suggesting new pathways for prevention of human disease.
Dr J Boateng BIOT 1011 Bioinformatics

BACKGROUND
Although living creatures look and behave in many different ways, all of their genomes consist of DNA, the chemical chain that makes up the genes that code for thousands of different kinds of proteins. Precisely which protein is produced by a given gene is determined by the sequence in which four chemical building blocks - adenine (A), thymine (T), cytosine (C) and guanine (G) - are laid out along DNA's double-helix structure.
Dr J Boateng BIOT 1011 Bioinformatics

BACKGROUND
In order for researchers to most efficiently use an organism's genome in comparative studies, data about its DNA must be in large, contiguous segments, anchored to chromosomes and, ideally, fully sequenced. Furthermore, the data needs to be organized for easy access and high-speed analysis by sophisticated computer software. Organisms that have been completely sequenced include: mouse (Mus musculus), human (Homo sapiens), fruit fly (Drosophila melanogaster); and ....................
Dr J Boateng BIOT 1011 Bioinformatics

BACKGROUND
The fledgling field of comparative genomics has already yielded some dramatic results. For example, a March 2000 study comparing the fruit fly genome with the human genome discovered that about 60 percent of genes are conserved between fly and human. Simply put, the two organisms appear to share a core set of genes. Researchers have found that two-thirds of human cancer genes have counterparts in the fruit fly.
Dr J Boateng BIOT 1011 Bioinformatics

BACKGROUND
More surprisingly, when scientists inserted a human gene associated with early-onset Parkinson's disease into fruit flies, they displayed symptoms similar to those seen in humans with the disorder. This raises the possibility that the tiny insects could serve as a new model for testing therapies aimed at Parkinson's.
Dr J Boateng BIOT 1011 Bioinformatics

Comparative Genomics What one should look for?


Human P. falciparum

Mosquito Proteins that are shared by All genomes Exclusively by Human & P.f. Exclusively by Human & Mosquito Exclusively by P.f. & Mosquito Unique proteins in Human P.f. Targets for anti-malarial drugs Mosquito

Dr J Boateng BIOT 1011 Bioinformatics

Comparative Gene Prediction


GenScan : ab initio gene prediction. GeneWise, Procrustes : homology guided. Rosseta, SGP1 (Syntetic Gene Prediction), CEM (Conserved Exon Method) : gene prediction and sequence alignment are clearly separated. GenomeScan : Ab Initio modified by BLAST homologies. SGP-2, TwinScan, SLAM, DoubleScan : modification of GenScan scoring schema to incorporate similarity to known proteins.
Dr J Boateng BIOT 1011 Bioinformatics

Proteome by the dictionary


The term proteome, coined in 1994. A linguistic equivalent to the concept of genome

Proteome - complete set of proteins that is


expressed, and modified by the entire genome in the lifetime of a cell.

Practical: the complement of proteins expressed by a cell at any one time.

Dr J Boateng BIOT 1011 Bioinformatics

Proteomics by the dictionary


Proteomics (Practical) - the study of the
proteome using technologies of large-scale protein separation and identification.

Large scale separation : 2DE Liquid Chromatography Identification : MALDI MS Tandem MS/MS FT-MS ..
Dr J Boateng BIOT 1011 Bioinformatics

http:www.bio-itworld.com/archive/031704/horizons_horizons_comm.html

Dr J Boateng BIOT 1011 Bioinformatics

Proteomics according to Medline


Development of Proteomics
From 220 publications in the previous millennium (94-99) To 21,350 (!!!) publications in this millennium (00-05)
9000 8000 7000 6000 5000 4000 3000 2000 1000 0 1997 1998 1999 2000 2001 2002 2003 2004 Papers Reviews

1730

Dr J Boateng BIOT 1011 Bioinformatics

Proteomics by Google THE REALISTIC TRUTH.

Proteomics

886,000 hits (2004) 4,700,000 hits (2005) 2,070,000 hits (2004) 16,000,000 hits (2005)

Genomics

Dr J Boateng BIOT 1011 Bioinformatics

Comparing Proteomics & Genomics


Genome
DNA Nc-RNA

Genomics analysis
mRNA

proteome
Coding DNA

Proteome analysis
Proteins Peptides Glyco, other modifications Dynamic Up/ down variants Poorly archived

linear

Dynamic Up/down

3D

Archived Completion (EST, cDNA, GEO

No notion of completion

Dr J Boateng BIOT 1011 Bioinformatics

Proteomics Genomics
More differences

Gene/ RNA dynamic


Handle Stable molecules Handling cheap/ easy Minimal modification Works in isolation Tech Sequencing (established) DNA array / genotyping/ expression / CGH/

Protein dynamic
Fragile molecules Handling dependent Labile modification Protein-interaction Localization dependent MS related (not yet) Protein Chip (not yet) Antibodies array (not yet)

HTP

Dr J Boateng BIOT 1011 Bioinformatics

Proteomics:
Original definition: study of the proteins encoded by the genome of a biological sample Current definition: study of the whole protein complement of a biological sample (cell, tissue, animal, biological fluid [urine, serum]) Usually involves high resolution separation of polypeptides at front-end, followed by mass spectrometry identification and analysis
Dr J Boateng BIOT 1011 Bioinformatics

Challenges facing Proteomic Technologies


Limited/variable sample material Sample degradation (occurs rapidly, even during sample preparation) Vast dynamic range required Post-translational modifications (often skew results) Specificity among tissue, developmental and temporal stages Perturbations by environmental (disease/drugs) conditions Researchers have deemed sequencing the genome easy, as PCR was able to assist in overcoming many of these issues in genomics.
Dr J Boateng BIOT 1011 Bioinformatics

The Proteomics Tool Kit


technologies for separating and visualizing proteins and peptides technologies for assessing protein-protein interactions technologies for identifying proteins* technologies for quantifying protein expression* bioinformatic tools for assessment and communication
Dr J Boateng BIOT 1011 Bioinformatics

Proteomic Technologies
Amino Acid Composition Array-based Proteomics 2D PAGE Mass Spectrometry Structural Proteomics Informatics (and the challenges facing the Human Proteome Dr J Boateng BIOT 1011 Project)
Bioinformatics

Amino Acid Composition (Edmund)


Pioneering method of obtaining information from proteins. Cumbersome and tedious by todays standards. Requires the use of terrible smelling mercaptoethanol. Not high-throughput by todays standards, hence, comp is no longer the most widely used technique.
Dr J Boateng BIOT 1011 Bioinformatics

Protein Sequencing
step 1, fragmenting into peptides

Dr J Boateng BIOT 1011 Bioinformatics

Protein Sequencing
step 2, sequencing the peptides by Edmund degradation.

Dr J Boateng BIOT 1011 Separation by HPLC and detect by absorbance at 269nm. Bioinformatics

Array-based Proteomics
Employ two-hybrid assays Use GFP, FRET, and GST
GFP = green florescent protein FRET = florescence resonance energy transfer GST = glutathione S-transferase, a well characterized protein used as a marker protein.
Dr J Boateng BIOT 1011 Bioinformatics

Array-based Proteomics

Dr J Boateng BIOT 1011 Bioinformatics

Array-based Proteomics
Offer a high-throughput technique for proteome analysis. These small plates are able to hold many different samples at a time. Current research is ongoing in an attempt to interface array methodologies with Mass Spectrometry at ORNL.
Dr J Boateng BIOT 1011 Bioinformatics

2D PAGE
2-D gel electrophoresis is a multi-step procedure that can be used to separate hundreds to thousands of proteins with extremely high resolution. It works by separation of proteins by their pI's in one dimension using an immobilized pH gradient (first dimension: isoelectric focusing) and then by their MW's in the second dimension. The core technology of proteomics is 2-DE At present, there is no other technique that is capable of simultaneously resolving thousands of proteins in one separation procedure. (sited in 2000)
Dr J Boateng BIOT 1011 Bioinformatics

Evolution of 2-DE methodology


Traditional IEF procedure:
Iso electric focusing (IEF) in run in thin polyacrylamide gel rods in glass or plastic tubes. Gel rods containing: 1. urea, 2. detergent, 3. reductant, and 4. carrier ampholytes (form pH gradient). Problem: 1. tedious. 2. not reproducible.

In the past
Dr J Boateng BIOT 1011 Bioinformatics

Evolution of 2-DE methodology


SDS-PAGE Gel size:
This OFarrell techniques has been used for 20 years without major modification. 20 x 20 cm have become a standard for 2-DE. Assumption: 100 bands can be resolved by 20 cm long 1-DE. Therefore, 20 x 20 cm gel can resolved 100 x 100 = 10,000 proteins, in theory. 100
100
Dr J Boateng BIOT 1011 Bioinformatics

Evolution of 2-DE methodology


Problems with traditional 1st dimension IEF
Works well for native protein, not good for denaturing proteins, because:
OPERATOR DEPENDENT

1. 2. 3. 4. 5.

Takes longer time to run. Techniques are cumbersome. (the soft, thin, long gel rods needs excellent experiment technique) Batch to batch variation of carrier ampholytes. Patterns are not reproducible enough. Lost of most basic proteins and some acidic protein.

Dr J Boateng BIOT 1011 Bioinformatics

2D PAGE
2-D gel electrophoresis process consists of these steps: Sample preparation
First dimension: isoelectric focusing Second dimension: gel electrophoresis

Staining Imaging analysis via software


Dr J Boateng BIOT 1011 Bioinformatics

Challenges for 2-DE


1. Spot number:
10,000-150,000 gene products in a cell. PTM makes it difficult to predict real number. Sensitivity and dynamic range of 2-DE must be adequate. Its impossible to display all proteins in one single gels.

Dr J Boateng BIOT 1011 Bioinformatics

Challenges for 2-DE


2. Isoelectric point spectrum:
pI of proteins: range from pH 3-13. (by in vitro translated ORF) PTM would not alter the pI outside this range. pH gradient from 3-13 dose not exist. For proteins which pI > 11.5, they need to be handed separately.
Dr J Boateng BIOT 1011 Bioinformatics

Challenges for 2-DE


3. molecular weights:
Small proteins or peptides can be analysed by modifying the gel and buffer condition of SDS-PAGE. Protein > 250 kDa do not enter 2nd SDS-PAGE properly. 1-DE (SDS-PAGE) can be run in a lane at the side of 2-DE.

Dr J Boateng BIOT 1011 Bioinformatics

Challenges for 2-DE


4. hydrophobic proteins:
Some very hydrophobic proteins do not go in solution. Some hydrophobic proteins are lost during sample preparation and iso electric focusing (IEF). More chemical developments are required.
Dr J Boateng BIOT 1011 Bioinformatics

Challenges for 2-DE


5. Sensitivity of detection:
Low copy number proteins are very difficult to detect, even employing most sensitive staining methods. Sensitivity of staining methods:
1. Silver staining 2. Fluorescent staining 3. Dye binding staining (CBR)
Dr J Boateng BIOT 1011 Bioinformatics

Challenges for 2-DE


6. Loading capacity:
For detection of low abundant proteins, more sample needs to be loaded. A wide dynamic range of the SDS-PAGE is required to prevent merging of highly abundant protein. Loading capacity: IEF > SDS-PAGE.

Dr J Boateng BIOT 1011 Bioinformatics

Challenges for 2-DE


7. Quantitation:
The detection method must give reliable quantitative information. Silver staining does not give reliable quantitative data.

Dr J Boateng BIOT 1011 Bioinformatics

Challenges for 2-DE


8. Reproducibility:
Highest importance in 2-DE experiment. Immobilized pH gradient strip have improved a lot for 1st dimension consistency Variation most comes from sample preparation.
Dr J Boateng BIOT 1011 Bioinformatics

A good-looking spot pattern streak and smear free is not a guarantee for best 2-DE protocol

Dr J Boateng BIOT 1011 Bioinformatics

Technologies for identifying proteins


Western blotting Chemical (Edman) sequencing of proteins mass spectrometry
peptide mass fingerprint mass spec decay databases and search engines
Dr J Boateng BIOT 1011 Bioinformatics

Mass Spectrometry
Mass Spectrometry is another tool to analyze the proteome. In general a Mass Spectrometer consists of:
Ion Source Mass Analyzer Detector

Mass Spectrometers are used to quantify the mass-to-charge (m/z) ratios of substances. From this quantification, a mass is determined, proteins are identified, and further analysis is performed.
Dr J Boateng BIOT 1011 Bioinformatics

MASS SPECTROMETRY
MORE DETAILED MASS SPECTROMETRY APPLICATIONS IN MORNING LECTURE ON 28TH NOVEMBER 2011

Dr J Boateng BIOT 1011 Bioinformatics

application of bioinformatics in the fields of genomics and proteomics

Dr J Boateng BIOT 1011 Bioinformatics

What is Bioinformatics?
Conceptualizing biology in terms of molecules and then applying informatics techniques from math, computer science, and statistics to understand and organize the information associated with these molecules on a large scale

Dr J Boateng BIOT 1011 Bioinformatics

How do we use Bioinformatics?


Store/retrieve biological information (databases) Retrieve/compare gene sequences Predict function of unknown genes/proteins Search for previously known functions of a gene Compare data with other researchers Compile/distribute data for other researchers
Dr J Boateng BIOT 1011 Bioinformatics

Sequence retrieval: National Center for Biotechnology Information GenBank and other genome databases Sequence comparison programs: BLAST GCG MacVector

Protein Structure: 3D modeling programs RasMol, Protein Explorer


Dr J Boateng BIOT 1011 Bioinformatics

Dr J Boateng BIOT 1011 Bioinformatics

Similarity Search: BLAST


A tool for searching gene or protein sequence databases for related genes of interest Alignments between the query sequence and any given database sequence, allowing for mismatches and gaps, indicate their degree of similarity The structure, function, and evolution of a gene may be determined by such comparisons
http://www.ncbi.nlm.nih.gov/BLAST/
Dr J Boateng BIOT 1011 Bioinformatics

% identity

CATTATGATA GTTTATGATT

70%

MRCKTETGAR
90%

MRCGTETGAR
Dr J Boateng BIOT 1011 Bioinformatics

Strengths: Accessibility Growing rapidly User friendly

Weaknesses: Sometimes not up-to-date Limited possibilities Limited comparisons and information Not accurate
Dr J Boateng BIOT 1011 Bioinformatics

Need for improved Bioinformatics


Genomics: Proteomics: Global view of protein function/interactions Protein motifs Structural databases
Dr J Boateng BIOT 1011 Bioinformatics

Human Genome Project Gene array technology Comparative genomics Functional genomics

Data Mining
Handling enormous amounts of data Sort through what is important and what is not Manipulate and analyze data to find patterns and variations that correlate with biological function
Dr J Boateng BIOT 1011 Bioinformatics

Proteomics
Uses information determined by biochemical/crystal structure methods Visualization of protein structure Make protein-protein comparisons Used to determine: - conformation/folding - antibody binding sites - protein-protein interactions - computer aided drug design
Dr J Boateng BIOT 1011 Bioinformatics

students

educators

bioinformatics

researchers

institutions

Dr J Boateng BIOT 1011 Bioinformatics

You might also like