Professional Documents
Culture Documents
L ECTURE S ERIES
Introduction (DG) Cancer Biology (MO) Cell Signaling Inference (MO) Genetic Variation (DG) Massive Testing (LY) Biomarker Discovery (LY) Phenotype Prediction (DG) Embedding Mechanism (DG)
2 / 64
O UTLINE
Biology and Statistical Learning Predicting from Comparisons Pathway De-regulation Breast Cancer Prognosis Metastatic Cancer
3 / 64
R ECAP
Statistical methods for analyzing cancer data permeate the literature. Prominent examples examined in previous lectures include
Modeling the accumulation of driver mutations during tumorigenesis; Identifying perturbed signaling in tumor cells; Discovering risk-bearing DNA sequence variation; and Finding differentially expressed genes and gene products.
The nal two lectures are about learning classiers that can distinguish between cellular phenotypes from mRNA transcript levels collected from cells in assayed tissue.
4 / 64
B IOLOGICAL R ATIONALE
In cancer, malignant phenotypes arise from the net effect of interactions among multiple genes and other molecular agents within biological networks. The resulting perturbations in signaling pathways can be detected and quantied with mRNA concentrations. Statistical learning can serve as a basis for:
Detecting disease (e.g., tumor vs normal); Discriminating among cancer sub-types (e.g., GIST vs LMS or BRCA1 mutation vs no BRCA1 mutation); Predicting outcomes (e.g., poor prognosis vs good prognosis).
5 / 64
X : High-throughput genomic data. The traditional approach experimental and molecule-by-molecule is not feasible at this scale. A principled approach is required to extract knowledge from X. Statistical learning has emerged as a core methodology for the analysis of X.
6 / 64
Standard Goals: Learn a predictor f : Rd {1, ..., K } or class-conditional model p(x|k) from L. Less Standard: Develop statistical metrics and models for regulation and mechanism.
7 / 64
B ARRIERS (I)
Applications to biomedicine, specically the implications for clinical practice, are widely acknowledged to remain limited. One major barrier is the study-to-study diversity in reported prediction accuracies and signatures (lists of discriminating genes). Some of this variation can be attributed to the over-tting that results from the infamous small n, large d dilemma. Typically, the number of samples (chips, proles, patients) per class is n = 10 1000 whereas the number of features (exons, transcripts, genes) is d = 1000 50, 000.
8 / 64
B ARRIERS (II)
However, complex decision rules are perhaps the central obstacle to mature applications. The methods applied were usually designed for other purposes and with little emphasis on transparency. Specically, the rules generated by nearly all standard, off-the-shelf techniques applied to genomics data, such as boosting, neural networks, multiple decision trees, support vector machines, and linear discriminant analysis, usually involve nonlinear functions of hundreds or thousands of genes, and a great many parameters.
10 / 64
B ARRIERS (III)
In contrast, follow-up studies, for instance independent validation or therapeutic development, are usually based on a relatively small number of biomarkers assayed with high-resolution methods such as RT-PCR. This usually also requires an understanding of the role of the genes and gene products in the context of molecular pathways. Ideally, the decision rules could be interpreted mechanistically, for instance in terms of transcriptional regulation, and be robust with respect to parameter settings.
11 / 64
B ARRIERS (IV)
Consequently, standard decision rules are too complex to characterize biologically. Moreover, what is notably missing is a solid link with potential mechanism, which seem to be a necessary condition for translational medicine, i.e., drug development and clinical decision-making.
12 / 64
ACCURACY
AND
C ONTEXT
Needless to say, accuracy is also necessary. But the accuracy of many of the methods mentioned above is already high enough to be of potential clinical value for many important phenotype distinctions. Also, it is now common to follow methodological development with a biological story about the genes appearing in the support (signature) of the classier, e.g., an enrichment analysis. However, this does not substitute for providing a potential mechanistic characterization of the decision rules in terms of biochemical interactions or specic regulatory motifs.
13 / 64
P ROPOSED F RAMEWORK
Translational objectives, and small-sample issues, argue for limiting the number of parameters and introducing strong biases. The two principal objectives for the family of classiers described below are:
Use elementary and parameter-free building blocks to assemble a classier which is determined by its support. Demonstrate that these can be as discriminating as those that emerge from the most powerful methods in statistical learning.
14 / 64
E XPRESSION O RDERING
The building blocks we choose are two-gene comparisons, regarded as biological switches related to regulatory motifs or other properties of transcriptional networks. The decision rules are then determined by expression orderings. However, explicitly connecting statistical classication and molecular mechanism for cancer is a major, largely open, challenge. A more modest goal is to propose a potential statistical framework.
15 / 64
O UTLINE
Biology and Statistical Learning Predicting from Comparisons Pathway De-regulation Breast Cancer Prognosis Metastatic Cancer
16 / 64
S TRATEGY
Use (within sample) ranks to enhance robustness. Adapt models to sample size. Introduce bias to control variance. Bias towards potential mechanism. Hypothesis-driven learning?
17 / 64
N OTATION (I)
G: list of d genes. X = (X1 , ..., Xd ): expression prole. Y {1, 2, ..., K }: classes or phenotypes. Data: d n matrix of mRNA counts. May restrict G to a network m with dm genes.
18 / 64
N OTATION (II)
Order the expression values: x1 xd . Let ri be the rank of gene i in the ordering. Then r = (r1 , ..., rd ) d , the set of permutations of {1, ..., d}, and r = 1 . Thus, xi < xj for two genes i, j if and only if ri < rj . Replace x Rd by r d . Dene binary variables zij = (ri < rj ).
19 / 64
N OTATION (III)
Since gene expression is inherently stochastic, consider x, r , z as realizations of r.v.s X, R, Z . Clearly, R determines Z = {Zij } and vice-versa. d Z : {0, 1}(2) , with d! legitimate comparison strings.
d
Write p(r |k) = P(R = r |Y = k), r d , and p(z|k) = P(Z = z|Y = k).
20 / 64
For each pair of genes i, j, dene a score |ij |, where ij = P(Zij = 1|Y = 1) P(Zij = 1|Y = 0), estimated from L. Unique TSP: Y = Zi j ( > 0) or Y = 1 Zi j ( < 0). Maximizing the score minimizes the average of sensitivity and specicity: 1 ij = PL (Y = 1|Y = 0) + PL (Y = 0|Y = 1). For multiple TSPs, vote.
21 / 64
22 / 64
{Xi < Xj }
Varying the threshold allows for trading off sensitivity and specicity.
23 / 64
C HOOSING
Only crude measures of the separation between P(gk (X)|Y = 0) and P(gk (X)|Y = 1) can resist over-tting. In particular, resubstitution error is less effective than a simple mean-variance criterion: Tk := E(gk (X)|Y = 0) E(gk (X)|Y = 1) [var (gk (X)|Y = 0) + var (gk (X)|Y = 1)]1/2
Given any k = {(i1 , j1 ), . . . , (ik , jk )}, choose k to maximize Tk . The numerator is just (i,j)k ij , evidently maximized at . Since the denominator varies more slowly, our choice of k k and the gene pairs is roughly equivalent to maximizing Tk .
24 / 64
Comparisons with discriminative methods (SVM, PAM, k-NN, RF, naive Bayes) on standard cancer datasets:
Simple decision rules for classifying human cancers from gene expression proles, Bioinformatics, 21, 3896-3904, 2005.
Specialized to prostate cancer: Robust prostate cancer marker genes discovered from direct integration of inter-study microarray data, 21, 3905-3911, Bioinformatics, 2005.
25 / 64
E XTERNAL VALIDATION
Highly accurate two-gene classifier for differentiating gastrointestinal stromal tumors and leiomyosarcomas
Nathan D. Price*, Jonathan Trent, Adel K. El-Naggar, David Cogdell, Ellen Taylor, Kelly K. Hunt, Raphael E. Pollock, Leroy Hood*, Ilya Shmulevich*, and Wei Zhang
*Institute for Systems Biology, Seattle, WA 98103; and Departments of Sarcoma Medical Oncology, Pathology, and Surgical Oncology, University of Texas M. D. Anderson Cancer Center, Houston, TX 77030 Contributed by Leroy Hood, December 28, 2006 (sent for review November 29, 2006)
Gastrointestinal stromal tumor (GIST) has emerged as a clinically with high accuracy. We chose to use a supervised top scoring pair distinct type of sarcoma with frequent overexpression and muta(TSP) analysis (13, 14), which finds pairs where the relative tion of the c-Kit oncogene and a favorable response to imatinib CLINICAL TRIALS AND OBSERVATIONS expression of a gene pair is reversed between the two cancers. mesylate [also known as STI571 (Gleevec)] therapy. However, a This method is advantageous because it provides the simplest signicant diagnostic challenge remains in the differentiation of possible classifier that is independent of data normalization, GIST from leiomyosarcomas (LMSs). To improve on the diagnostic helps to avoid overfitting, and results in a very simple experievaluation and to complement the immunohistochemical evaluamental test that is easy to implement in the clinic. We identified tion of these tumors, we performed a whole-genome gene expresa single gene set (OBSCN and C9orf65) that accurately classified sion study on 68 well characterized tumor samples. Using bioinGIST from LMS with an estimated accuracy by using leave-oneformatic approaches, we devised a two-gene relative expression out cross-validation of 97.8% on future cases on the basis of the classier that distinguishes between GIST and LMS with an accuMitch Raponi,1 Jeffrey E. Lancet,2 Hongtao Fan,3 Lesley Dossey,1 Grace Lee,1 Ivana Gojo,19Eric 19 Feldman,5 Jasondiagnosed microarray data and of 4 of J. additional cases Gotlib,6 racy of 99.3% on the microarray samples and an estimated accuracy correctly using RT-PCR. wenberg,10 Richard M. Stone,11 Lawrence E. on futurePeter L. Greenberg,6 John J. Wright,using Morris,7 cases. We validated this classier by 8 Jean-Luc Harousseau,9 Bob LoWe conclude that this two-gene set of 97.8% provides a rapid, PCR-based assay that reliably distinguishes 13 Peter De Porre,12 Yixin Wang,1 and Judith E.and on an additional RT-PCR on 20 samples in the microarray study Karp GIST from LMS and has the potential to aid in diagnosis and in 19 independent samples, with 100% accuracy. Thus, our two-gene 1Veridex, La Jolla, CA; 2H. Lee Moftt Cancer Center & Research Institute, University the selection of appropriate therapies. The PA; 4Greenebaum pairs of South Florida, Tampa; 3Centocor R&D, Malvern, use of marker relative expression classier is a highly accurate diagnostic method based on York, NY; 6Stanford Cancer Center, CA; 7Blood and Bone Cancer Center, University of Maryland, Baltimore; 5Weill Medical College, Cornell University, Newrelative expression reversals that are independent of to distinguish between GIST and LMS and has the potential to be data normalization holds tremendous promise as a method Marrow Transplant Group of Georgia, clinical 8setting. The success of Program, National Cancer Institute, National Institutes of Health, Bethesda, MD; for rapidly implemented in a Atlanta; Cancer Therapy Evaluation this 9Service dHematologie Clinique, Centre Hospitalier Universitaire (CHU) Hotel Dieu, Nantes, France; 10Department of Hematology, Erasmus University Medical the development of clinically relevant biomarkers. classier is likely due to two general traits, namely that the Center, Rotterdam, the Netherlands; data normalization and that it uses as classier is independent of 11Adult Leukemia Program, Dana-Farber Cancer Institute, Boston, MA; 12Johnson & Johnson Pharmaceutical Research & Results 13Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD 26 Development, Beerse, Belgium; possible to achieve this independence to simple an approach as and
A 2-gene classier for predicting response to the farnesyltransferase inhibitor tipifarnib in acute myeloid leukemia
/ 64
ORIGINAL ARTICLE
Usefulness of the top-scoring pairs of genes for prediction of prostate cancer progression
H Zhao, CJ Logothetis and IP Gorlov
Department of Genitourinary Medical Oncology, The University of Texas MD, Anderson Cancer Center, Houston, TX, USA
to interferon-related gene Usually candidate for with such identified An identifytogenes associated comparison progression. signaturegenes areDNA damage according a gene-by-gene of expression. Recent reports suggested that relative expression of a gene pair more efficiently predicts cancer progression than single-gene resistance is a predictive markerphenotypes according to the for chemotherapy analysis does. The top-scoring pair (TSP) algorithm classifies relative expression of a pair of genes. We applied the TSP approach to predict, which patients and radiation for breast after radical prostatectomy. Relative expression would experience systemic tumor progression cancer of TPD52L2/SQLE and CEACAM1/BRCA1 gene pairs identified those patients with more than Ralph R. Weichselbauma,b, Hemant Ishwaranc, Taewon Yoona,b, Dimitry S. A. Nuytend,e, Samuel W. Bakera,b, 99% specificity but relatively low sensitivity (B10%). These two gene pairs were validated in Nikolai Khodareva, Andy W. Sua,b, Arifaddition, a,b, Paul Roachf, Bas Kreikegenes improved sensitivity Berghh, three independent data sets. In Y. Shaikh combining two pairs of d,e, Bernard Roizmang, Jonas Yudi Pawitancompromisingde Vijverd, and Andy J. Minna,b,1 without i, Marc J. van specificity. Functional annotation of the TSP genes showed that they cluster by a limited number of biological functions and pathways, suggesting that relatively aDepartment of Radiation and Cellular Oncology, bLudwig Center for Metastasis Research, fDepartment of Surgical Oncology, and gMarjorie B. Kovler Viral lower expression of genes from specific pathways can predict cancer progression. In conclusion, Oncology Laboratories, University of Chicago, Chicago, IL 60637; cDepartment of Quantitative Health Sciences, Cleveland Clinic, Cleveland, OH 44195; dDepartments of Diagnostic Oncology the eRadiation Oncology, Netherlands Cancer Institute, 1066 CX and effectiveNetherlands; and hDepartments 27 / 64 comparative analysis of and expression of two genes may be a simple Amsterdam, The classifier
Prediction of cancer progression after radical prostatectomy is one of the most challenging problems in the management of prostate cancer. Gene-expression profiling is widely used
G1 ,G2 : Two disjoint sets of genes of size m, the context G1 , G2 : The median expression in G1 , G2 Classication rule: f (X) = {G1 < G2 } Choose the context by maximizing the (apparent) accuracy P(f (X) = Y ).
Let s(G1 , G2 ) = |P(G1 < G2 |Y = 0) P(G1 < G2 |Y = 1)|. Then choose the context to maximize s(G1 , G2 ).
28 / 64
F INDING
THE
C ONTEXT (I)
Exact optimization (for m > 1) is computationally impossible and would lead to massive overtting anyway. Let G1 = R1 , G2 = R2 (ranks are computed in G1 G2 ). Suppose: (i) {Xi < Xj } {1 = i, 2 = j}|Y for each i G1 , j G2 ; (ii) (1 , 2 ) is uniformly distributed given Y . Then P(G1 < G2 |Y ) = 1 m2 P(Xi < Xj |Y ).
iG1 ,jG2
29 / 64
F INDING
THE
C ONTEXT (II)
Both assumptions are true in practice. Consequently, s(G1 , G2 ) iG1,jG2 ij . Finally, ij (G1 , G2 ) = arg max
G1 ,G2 iG1 ,jG2
This search is feasible either (i) exactly, but with gene ltering, for m 5; or (ii) greedily, adding one gene at a time, without gene ltering.
30 / 64
C LASSIFICATION R ESULTS
31 / 64
O UTLINE
Biology and Statistical Learning Predicting from Comparisons Pathway De-regulation Breast Cancer Prognosis Metastatic Cancer
32 / 64
P ERTURBED N ETWORKS
Diseased cells arise from aberrant activity in cellular signaling, and pathways are the fundamental scale of many cancer processes. These aberrations cannot be identied from phenotypic information typically measured in the clinic. Moreover, they are the net effect of interactions among multiple molecular agents. Generally, network analyses do not account for combinatorial (multi-way) interactions among genes or gene products, and do not quantify de-regulation.
33 / 64
ss networks in a phenotype
B ABY I LLUSTRATION
34 / 64
S WAP D ISTANCE
A distance between permutations and of {1, . . . , d}. D(, ): the minimum number of adjacent swaps needed to transform into . Example: D((3, 1, 2, 4), (1, 2, 3, 4)) = 2.
35 / 64
PATHWAY VARIABLES
Consider a network m with dm genes. Let = (1 , . . . , dm ) be the order statistics for x = (x1 , . . . , xdm ): x1 < x2 < < xdm . Let D(x, x ) be the swap distance between (x) and (x ). Then D(x, x ) is also the normalized Hamming distance between z(x) and z(x ), the corresponding comparison strings.
36 / 64
O RDER I NDEX
Fix a phenotype k and let X and X be i.i.d. expression proles under p(x|k). Dene the Order Index: (k,m) = 1 Then it is easy to show that (k,m) = 1 dm 2
1 dm 1 E[D(X, X 2
)].
.5 1, but generally .5 since there are many gene pairs expressed on different scales. (k,m) 1: A highly disorganized system.
37 / 64
E XAMPLES
In the Death network, for prostate tissue, (normal) = 0.924 and (metastatic) = 0.823. The difference is highly signcant (p < .001). Overall, 75 networks have signicant differences in , which is usually smaller in metastatic tumors.
38 / 64
D E - REGULATION
IN
D ISEASE
A general trend emerges: when pairs of phenotypes represent gradations of disease, the order index is usually smaller in the more malignant one when there is a signicant difference. In the following plots, each point represents a pair ((A,m) , (B,m) ) for a network m, where A is more malignant than B.
39 / 64
40 / 64
Fix a context G (set of genes). Let DG be the swap distance restricted to G. Classify by nearest-neighbor in L. Choose G so that the distance DG (X, X) between independent samples is
Large if X, X are from different classes; Small if from the same class.
41 / 64
O UTLINE
Biology and Statistical Learning Predicting from Comparisons Pathway De-regulation Breast Cancer Prognosis Metastatic Cancer
42 / 64
43 / 64
ON
Fix ten genes (e.g., the ve top-scoring pairs). Let x be the expression prole and r 10 the rank vector. Construct two distributions p(r |good) and p(r |poor ) by maximizing entropy subject to xing all 10 = 45 pairwise 2 comparison probabilities. Use Iterative Projection to learn the parameters. With d = 10, everything can be computed, including normalizing constants and entropies.
44 / 64
M ORE F ORMALLY
Let q be a prob. dist. on 10 , and let pL be the empirical distribution on L. For k {poor , good}: p(r |k) = argq max H(q) s.t. i < j : q(r : ri < rj ) = pL (r : ri < rj |k )
45 / 64
Classify sample x as poor if p(r (x)|poor ) > . p(r (x)|good) For = 1, 70% sensitivity and 64% specicity (overall 66%). Varying trades off sensitivity and specicity. Entropies are H = 14.22 (good), H = 17.45 (poor), H = 21.79 (uniform).
47 / 64
O UTLINE
Biology and Statistical Learning Predicting from Comparisons Pathway De-regulation Breast Cancer Prognosis Metastatic Cancer
48 / 64
M ETASTATIC C ANCER
Cancer is an acquired genetic disorder due to the accumulation over time of DNA alterations that lead to uncontrolled cell growth and proliferation. Ninety percent of deaths result from metastasis, meaning that cancer cells break away and migrate to distant organs. By lodging in other organs they replace normal cells until the organ no longer functions.
49 / 64
T UMOR S ITE
OF
O RIGIN
In approximately 4% of cancers, a metastatic tumor is found of unknown primary origin (Hillen, 2000). However, the appropriate treatment depends on the tissue of origin. The GEO or Gene Expression Omnibus (Barrett et al., 2006) contains 16,715 tumor samples from 20 sites of origin for the most popular platform. Objective: Build a classier for distinguishing among the 20 sites of origin and validate it with cross-study error estimation.
50 / 64
Systematic variation across samples is highly correlated with date, lab, etc. Especially problematic when batch labels are confounded with class label. Affects not only the patterns of expression of individual genes, but in fact the entire dependency structure, including correlations.
51 / 64
B ATCH E FFECTS
Samples from the same phenotype but different dates, labs, etc. display systematic differences in the distribution of individual genes and dependency structure.
52 / 64
Figure: The fraction of signicantly correlated gene pairs for which the sign reverses between pairs of batches.
53 / 64
S TUDY E FFECTS
Within class, but across studies, there are differences due to age, location, etc., as well as platform and mRNA storage/extraction methods. Combined with batch effects, samples from different studies are not even approximately identically distributed. Must take this into account in estimating generalization error. The consequence of confounding, batch and study effects make cross-study validation, as opposed to oridinary cross-validation, imperative.
54 / 64
Overall accuracy is a poor measure of utility with major class imbalance in training. Instead use Mean Class Conditional Accuracy (MCCA). Generalizes the average of sensitivity and specicity to multiclass. Take the average of {P(F (X ) = y|Y = y)} for y = 1, ..., L.
55 / 64
M ETHODS
OF
E STIMATING ACCURACY
Resubstitution: Validate on L, the training data. Strong optimistic bias. Holdout: Randomly partition data into training and validation. Still optimistic because training and validation are identically distributed. Cross-validation: Still optimistic for same reason. Cross-Study Validation: Validate on a different study, done in a different lab than the training study. Higher bar, but the gold standard.
56 / 64
Study 1 Training
Study 2
Study 3 Validation +
57 / 64
D ECISION T REES
OF
C OMPARISONS
Goal: Generalize kTSP and related algorithms to multiclass problems. Build decision trees with comparison questions: Is gene i more highly expressed than gene j? With the site of origin data, can build trees with depth up to fteen queries.
58 / 64
T REE
OF
C OMPARISONS
Is ABCD > EFGH?
...
Is GHKL > MNPQ?
...
Colon (35/40)
...
59 / 64
60 / 64
R EDUCING D IVERSITY
AND
S AMPLE S IZE
Reducing Diversity: Train on largest study for each site. Test on the rest. Accuracy = 85.8%, MCCA = 74.0%. Reducing n: Keep only 10 samples per study-site of origin pair. Notice that n is smaller for every site of origin.
61 / 64
AND
62 / 64
VS
64 / 64
Goal: Quantify how much batch/study effects reduce accuracy, MCCA. Randomize study labels within each phenotype. After shufing study labels: Accuracy = 98.6%, MCCA = 96.1%. 8 points of MCCA lost to batch/study effects.
65 / 64
C ONCLUSIONS
Accuracy should be demonstrated cross-study. Sample diversity is more important than sample size.
67 / 64