You are on page 1of 67

S TATISTICAL L EARNING IN C ANCER B IOLOGY: L ECTURE 7

Donald Geman, Michael Ochs, Laurent Younes


Johns Hopkins Unversity

ENS-Cachan February 27, 2013

L ECTURE S ERIES

Lecture 1: Lecture 2: Lecture 3: Lecture 4: Lecture 5: Lecture 6: Lecture 7: Lecture 8:

Introduction (DG) Cancer Biology (MO) Cell Signaling Inference (MO) Genetic Variation (DG) Massive Testing (LY) Biomarker Discovery (LY) Phenotype Prediction (DG) Embedding Mechanism (DG)

2 / 64

O UTLINE

Biology and Statistical Learning Predicting from Comparisons Pathway De-regulation Breast Cancer Prognosis Metastatic Cancer

3 / 64

R ECAP
Statistical methods for analyzing cancer data permeate the literature. Prominent examples examined in previous lectures include
Modeling the accumulation of driver mutations during tumorigenesis; Identifying perturbed signaling in tumor cells; Discovering risk-bearing DNA sequence variation; and Finding differentially expressed genes and gene products.

The nal two lectures are about learning classiers that can distinguish between cellular phenotypes from mRNA transcript levels collected from cells in assayed tissue.

4 / 64

B IOLOGICAL R ATIONALE
In cancer, malignant phenotypes arise from the net effect of interactions among multiple genes and other molecular agents within biological networks. The resulting perturbations in signaling pathways can be detected and quantied with mRNA concentrations. Statistical learning can serve as a basis for:
Detecting disease (e.g., tumor vs normal); Discriminating among cancer sub-types (e.g., GIST vs LMS or BRCA1 mutation vs no BRCA1 mutation); Predicting outcomes (e.g., poor prognosis vs good prognosis).

5 / 64

S TATISTICAL L EARNING (I)

X : High-throughput genomic data. The traditional approach experimental and molecule-by-molecule is not feasible at this scale. A principled approach is required to extract knowledge from X. Statistical learning has emerged as a core methodology for the analysis of X.

6 / 64

S TATISTICAL L EARNING (II)

Training set: L = {(x(1) , y (1) ), . . . , (x(n) , y (n) )}.


x(i) Rd : mRNA expression prole for sample i; y (i) {1, 2, ..., K }: cellular phenotype of sample i.

Standard Goals: Learn a predictor f : Rd {1, ..., K } or class-conditional model p(x|k) from L. Less Standard: Develop statistical metrics and models for regulation and mechanism.

7 / 64

B ARRIERS (I)
Applications to biomedicine, specically the implications for clinical practice, are widely acknowledged to remain limited. One major barrier is the study-to-study diversity in reported prediction accuracies and signatures (lists of discriminating genes). Some of this variation can be attributed to the over-tting that results from the infamous small n, large d dilemma. Typically, the number of samples (chips, proles, patients) per class is n = 10 1000 whereas the number of features (exons, transcripts, genes) is d = 1000 50, 000.

8 / 64

S OME P UBLIC M ICROARRAY DATASETS


D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 D16 D17 D18 D19 D20 D21 Study Colon BRCA1 CNS DLBCL Lung Marfan Crohns Sarcoma Squamous GCM Leukemia 1 Leukemia 2 Leukemia 3 Leukemia 4 Prostate 1 Prostate 2 Prostate 3 Prostate 4 Prostate 5 Breast 1 Breast 2 Class 0 (size) Normal (22) non-BRCA1 (93) Classic (25) DLBCL (58) Mesothelioma (150) Normal (41) Normal (42) GIST (37) Normal (22) Normal (90) ALL (25) AML1 (24) ALL(710) Normal (138) Normal (50) Normal (38) Normal (9) Normal (25) Primary (25) ER-positive (61) ER-positive(127) Class 1 (size) Tumor (40) BRCA1 (25) Desmoplastic (9) FL (19) ADCS (31) Marfan (60) Crohns (59) LMS (31) Head-Neck Cancer (22) Tumor (190) AML (47) AML2 (24) AML (501) AML (403) Tumor (52) Tumor (50) Tumor (24) Primary (65) Metastatic (65) ER-negative(36) ER-negative(80) Probes d 2000 1658 7129 7129 12533 4123 22283 43931 12625 16063 7129 12564 19896 19896 12600 12625 12626 12619 12558 16278 9760 Reference [?] [?] [?] [?] [?] [?] [?] [?] [?] [?] [?] [?] [?] [?] [?] [?] [?] [?] [?] [?] [?]/ 64 9

B ARRIERS (II)

However, complex decision rules are perhaps the central obstacle to mature applications. The methods applied were usually designed for other purposes and with little emphasis on transparency. Specically, the rules generated by nearly all standard, off-the-shelf techniques applied to genomics data, such as boosting, neural networks, multiple decision trees, support vector machines, and linear discriminant analysis, usually involve nonlinear functions of hundreds or thousands of genes, and a great many parameters.

10 / 64

B ARRIERS (III)

In contrast, follow-up studies, for instance independent validation or therapeutic development, are usually based on a relatively small number of biomarkers assayed with high-resolution methods such as RT-PCR. This usually also requires an understanding of the role of the genes and gene products in the context of molecular pathways. Ideally, the decision rules could be interpreted mechanistically, for instance in terms of transcriptional regulation, and be robust with respect to parameter settings.

11 / 64

B ARRIERS (IV)

Consequently, standard decision rules are too complex to characterize biologically. Moreover, what is notably missing is a solid link with potential mechanism, which seem to be a necessary condition for translational medicine, i.e., drug development and clinical decision-making.

12 / 64

ACCURACY

AND

C ONTEXT

Needless to say, accuracy is also necessary. But the accuracy of many of the methods mentioned above is already high enough to be of potential clinical value for many important phenotype distinctions. Also, it is now common to follow methodological development with a biological story about the genes appearing in the support (signature) of the classier, e.g., an enrichment analysis. However, this does not substitute for providing a potential mechanistic characterization of the decision rules in terms of biochemical interactions or specic regulatory motifs.

13 / 64

P ROPOSED F RAMEWORK

Translational objectives, and small-sample issues, argue for limiting the number of parameters and introducing strong biases. The two principal objectives for the family of classiers described below are:
Use elementary and parameter-free building blocks to assemble a classier which is determined by its support. Demonstrate that these can be as discriminating as those that emerge from the most powerful methods in statistical learning.

14 / 64

E XPRESSION O RDERING
The building blocks we choose are two-gene comparisons, regarded as biological switches related to regulatory motifs or other properties of transcriptional networks. The decision rules are then determined by expression orderings. However, explicitly connecting statistical classication and molecular mechanism for cancer is a major, largely open, challenge. A more modest goal is to propose a potential statistical framework.

15 / 64

O UTLINE

Biology and Statistical Learning Predicting from Comparisons Pathway De-regulation Breast Cancer Prognosis Metastatic Cancer

16 / 64

S TRATEGY

Use (within sample) ranks to enhance robustness. Adapt models to sample size. Introduce bias to control variance. Bias towards potential mechanism. Hypothesis-driven learning?

17 / 64

N OTATION (I)

G: list of d genes. X = (X1 , ..., Xd ): expression prole. Y {1, 2, ..., K }: classes or phenotypes. Data: d n matrix of mRNA counts. May restrict G to a network m with dm genes.

18 / 64

N OTATION (II)

Order the expression values: x1 xd . Let ri be the rank of gene i in the ordering. Then r = (r1 , ..., rd ) d , the set of permutations of {1, ..., d}, and r = 1 . Thus, xi < xj for two genes i, j if and only if ri < rj . Replace x Rd by r d . Dene binary variables zij = (ri < rj ).

19 / 64

N OTATION (III)

Since gene expression is inherently stochastic, consider x, r , z as realizations of r.v.s X, R, Z . Clearly, R determines Z = {Zij } and vice-versa. d Z : {0, 1}(2) , with d! legitimate comparison strings.
d

Write p(r |k) = P(R = r |Y = k), r d , and p(z|k) = P(Z = z|Y = k).

20 / 64

E VEN O NE Zij C AN B E D ISCRIMINATING


TSP: Differentiate between two phenotypes by nding a pair of genes whose ordering typically reverses (Stat. Appl. in
Genetics and Molecular Biology, 3, 2004.)

For each pair of genes i, j, dene a score |ij |, where ij = P(Zij = 1|Y = 1) P(Zij = 1|Y = 0), estimated from L. Unique TSP: Y = Zi j ( > 0) or Y = 1 Zi j ( < 0). Maximizing the score minimizes the average of sensitivity and specicity: 1 ij = PL (Y = 1|Y = 0) + PL (Y = 0|Y = 1). For multiple TSPs, vote.
21 / 64

A N O F REE L UNCH E RROR B OUND


L = {(x1 , y1 ), ..., (xn1 +n2 , yn1 +n2 )}: training set T1 = {(i1 , j1 ), ..., (iM , jM )}: TSPs for L. Em = {1 s n1 + n2 : TSP (im , im ) errs on s} E = m Em : samples incorrectly classied by at least one TSP. ecv : LOOCV error rate. eapp (f ): apparent error rate of the TSP classier. THEOREM: Any sample s E is erroneously classied during LOOCV. In particular, eapp (f ) |E| ecv . n1 + n2

22 / 64

K TOP S CORING PAIRS


Base prediction on the k highest scoring pairs: = {(i1 , j1 ), . . . , (ik , jk )}. k More generally, the natural discriminant is gk (X; k ) =
(i,j)k

{Xi < Xj }

The k-TSP classier is majority voting: f (X) = {gk (X : k ) > k } 2

Varying the threshold allows for trading off sensitivity and specicity.
23 / 64

C HOOSING

Only crude measures of the separation between P(gk (X)|Y = 0) and P(gk (X)|Y = 1) can resist over-tting. In particular, resubstitution error is less effective than a simple mean-variance criterion: Tk := E(gk (X)|Y = 0) E(gk (X)|Y = 1) [var (gk (X)|Y = 0) + var (gk (X)|Y = 1)]1/2

Given any k = {(i1 , j1 ), . . . , (ik , jk )}, choose k to maximize Tk . The numerator is just (i,j)k ij , evidently maximized at . Since the denominator varies more slowly, our choice of k k and the gene pairs is roughly equivalent to maximizing Tk .
24 / 64

F URTHER H OMEGROWN D EVELOPMENTS

Comparisons with discriminative methods (SVM, PAM, k-NN, RF, naive Bayes) on standard cancer datasets:
Simple decision rules for classifying human cancers from gene expression proles, Bioinformatics, 21, 3896-3904, 2005.

Specialized to prostate cancer: Robust prostate cancer marker genes discovered from direct integration of inter-study microarray data, 21, 3905-3911, Bioinformatics, 2005.

25 / 64

E XTERNAL VALIDATION
Highly accurate two-gene classifier for differentiating gastrointestinal stromal tumors and leiomyosarcomas
Nathan D. Price*, Jonathan Trent, Adel K. El-Naggar, David Cogdell, Ellen Taylor, Kelly K. Hunt, Raphael E. Pollock, Leroy Hood*, Ilya Shmulevich*, and Wei Zhang
*Institute for Systems Biology, Seattle, WA 98103; and Departments of Sarcoma Medical Oncology, Pathology, and Surgical Oncology, University of Texas M. D. Anderson Cancer Center, Houston, TX 77030 Contributed by Leroy Hood, December 28, 2006 (sent for review November 29, 2006)

Gastrointestinal stromal tumor (GIST) has emerged as a clinically with high accuracy. We chose to use a supervised top scoring pair distinct type of sarcoma with frequent overexpression and muta(TSP) analysis (13, 14), which finds pairs where the relative tion of the c-Kit oncogene and a favorable response to imatinib CLINICAL TRIALS AND OBSERVATIONS expression of a gene pair is reversed between the two cancers. mesylate [also known as STI571 (Gleevec)] therapy. However, a This method is advantageous because it provides the simplest signicant diagnostic challenge remains in the differentiation of possible classifier that is independent of data normalization, GIST from leiomyosarcomas (LMSs). To improve on the diagnostic helps to avoid overfitting, and results in a very simple experievaluation and to complement the immunohistochemical evaluamental test that is easy to implement in the clinic. We identified tion of these tumors, we performed a whole-genome gene expresa single gene set (OBSCN and C9orf65) that accurately classified sion study on 68 well characterized tumor samples. Using bioinGIST from LMS with an estimated accuracy by using leave-oneformatic approaches, we devised a two-gene relative expression out cross-validation of 97.8% on future cases on the basis of the classier that distinguishes between GIST and LMS with an accuMitch Raponi,1 Jeffrey E. Lancet,2 Hongtao Fan,3 Lesley Dossey,1 Grace Lee,1 Ivana Gojo,19Eric 19 Feldman,5 Jasondiagnosed microarray data and of 4 of J. additional cases Gotlib,6 racy of 99.3% on the microarray samples and an estimated accuracy correctly using RT-PCR. wenberg,10 Richard M. Stone,11 Lawrence E. on futurePeter L. Greenberg,6 John J. Wright,using Morris,7 cases. We validated this classier by 8 Jean-Luc Harousseau,9 Bob LoWe conclude that this two-gene set of 97.8% provides a rapid, PCR-based assay that reliably distinguishes 13 Peter De Porre,12 Yixin Wang,1 and Judith E.and on an additional RT-PCR on 20 samples in the microarray study Karp GIST from LMS and has the potential to aid in diagnosis and in 19 independent samples, with 100% accuracy. Thus, our two-gene 1Veridex, La Jolla, CA; 2H. Lee Moftt Cancer Center & Research Institute, University the selection of appropriate therapies. The PA; 4Greenebaum pairs of South Florida, Tampa; 3Centocor R&D, Malvern, use of marker relative expression classier is a highly accurate diagnostic method based on York, NY; 6Stanford Cancer Center, CA; 7Blood and Bone Cancer Center, University of Maryland, Baltimore; 5Weill Medical College, Cornell University, Newrelative expression reversals that are independent of to distinguish between GIST and LMS and has the potential to be data normalization holds tremendous promise as a method Marrow Transplant Group of Georgia, clinical 8setting. The success of Program, National Cancer Institute, National Institutes of Health, Bethesda, MD; for rapidly implemented in a Atlanta; Cancer Therapy Evaluation this 9Service dHematologie Clinique, Centre Hospitalier Universitaire (CHU) Hotel Dieu, Nantes, France; 10Department of Hematology, Erasmus University Medical the development of clinically relevant biomarkers. classier is likely due to two general traits, namely that the Center, Rotterdam, the Netherlands; data normalization and that it uses as classier is independent of 11Adult Leukemia Program, Dana-Farber Cancer Institute, Boston, MA; 12Johnson & Johnson Pharmaceutical Research & Results 13Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD 26 Development, Beerse, Belgium; possible to achieve this independence to simple an approach as and

A 2-gene classier for predicting response to the farnesyltransferase inhibitor tipifarnib in acute myeloid leukemia

/ 64

E XTERNAL VALIDATION ( CONT )


Prostate Cancer and Prostatic Diseases (2010), 18 & 2010 Nature Publishing Group All rights reserved 1365-7852/10 $32.00 www.nature.com/pcan

ORIGINAL ARTICLE

Usefulness of the top-scoring pairs of genes for prediction of prostate cancer progression
H Zhao, CJ Logothetis and IP Gorlov
Department of Genitourinary Medical Oncology, The University of Texas MD, Anderson Cancer Center, Houston, TX, USA

to interferon-related gene Usually candidate for with such identified An identifytogenes associated comparison progression. signaturegenes areDNA damage according a gene-by-gene of expression. Recent reports suggested that relative expression of a gene pair more efficiently predicts cancer progression than single-gene resistance is a predictive markerphenotypes according to the for chemotherapy analysis does. The top-scoring pair (TSP) algorithm classifies relative expression of a pair of genes. We applied the TSP approach to predict, which patients and radiation for breast after radical prostatectomy. Relative expression would experience systemic tumor progression cancer of TPD52L2/SQLE and CEACAM1/BRCA1 gene pairs identified those patients with more than Ralph R. Weichselbauma,b, Hemant Ishwaranc, Taewon Yoona,b, Dimitry S. A. Nuytend,e, Samuel W. Bakera,b, 99% specificity but relatively low sensitivity (B10%). These two gene pairs were validated in Nikolai Khodareva, Andy W. Sua,b, Arifaddition, a,b, Paul Roachf, Bas Kreikegenes improved sensitivity Berghh, three independent data sets. In Y. Shaikh combining two pairs of d,e, Bernard Roizmang, Jonas Yudi Pawitancompromisingde Vijverd, and Andy J. Minna,b,1 without i, Marc J. van specificity. Functional annotation of the TSP genes showed that they cluster by a limited number of biological functions and pathways, suggesting that relatively aDepartment of Radiation and Cellular Oncology, bLudwig Center for Metastasis Research, fDepartment of Surgical Oncology, and gMarjorie B. Kovler Viral lower expression of genes from specific pathways can predict cancer progression. In conclusion, Oncology Laboratories, University of Chicago, Chicago, IL 60637; cDepartment of Quantitative Health Sciences, Cleveland Clinic, Cleveland, OH 44195; dDepartments of Diagnostic Oncology the eRadiation Oncology, Netherlands Cancer Institute, 1066 CX and effectiveNetherlands; and hDepartments 27 / 64 comparative analysis of and expression of two genes may be a simple Amsterdam, The classifier

Prediction of cancer progression after radical prostatectomy is one of the most challenging problems in the management of prostate cancer. Gene-expression profiling is widely used

TOP -S CORING M EDIANS (TSM)

G1 ,G2 : Two disjoint sets of genes of size m, the context G1 , G2 : The median expression in G1 , G2 Classication rule: f (X) = {G1 < G2 } Choose the context by maximizing the (apparent) accuracy P(f (X) = Y ).
Let s(G1 , G2 ) = |P(G1 < G2 |Y = 0) P(G1 < G2 |Y = 1)|. Then choose the context to maximize s(G1 , G2 ).

28 / 64

F INDING

THE

C ONTEXT (I)

Exact optimization (for m > 1) is computationally impossible and would lead to massive overtting anyway. Let G1 = R1 , G2 = R2 (ranks are computed in G1 G2 ). Suppose: (i) {Xi < Xj } {1 = i, 2 = j}|Y for each i G1 , j G2 ; (ii) (1 , 2 ) is uniformly distributed given Y . Then P(G1 < G2 |Y ) = 1 m2 P(Xi < Xj |Y ).
iG1 ,jG2

29 / 64

F INDING

THE

C ONTEXT (II)

Both assumptions are true in practice. Consequently, s(G1 , G2 ) iG1,jG2 ij . Finally, ij (G1 , G2 ) = arg max
G1 ,G2 iG1 ,jG2

This search is feasible either (i) exactly, but with gene ltering, for m 5; or (ii) greedily, adding one gene at a time, without gene ltering.

30 / 64

C LASSIFICATION R ESULTS

31 / 64

O UTLINE

Biology and Statistical Learning Predicting from Comparisons Pathway De-regulation Breast Cancer Prognosis Metastatic Cancer

32 / 64

P ERTURBED N ETWORKS
Diseased cells arise from aberrant activity in cellular signaling, and pathways are the fundamental scale of many cancer processes. These aberrations cannot be identied from phenotypic information typically measured in the clinic. Moreover, they are the net effect of interactions among multiple molecular agents. Generally, network analyses do not account for combinatorial (multi-way) interactions among genes or gene products, and do not quantify de-regulation.

33 / 64

ss networks in a phenotype

B ABY I LLUSTRATION

34 / 64

S WAP D ISTANCE

A distance between permutations and of {1, . . . , d}. D(, ): the minimum number of adjacent swaps needed to transform into . Example: D((3, 1, 2, 4), (1, 2, 3, 4)) = 2.

35 / 64

PATHWAY VARIABLES

Consider a network m with dm genes. Let = (1 , . . . , dm ) be the order statistics for x = (x1 , . . . , xdm ): x1 < x2 < < xdm . Let D(x, x ) be the swap distance between (x) and (x ). Then D(x, x ) is also the normalized Hamming distance between z(x) and z(x ), the corresponding comparison strings.

36 / 64

O RDER I NDEX
Fix a phenotype k and let X and X be i.i.d. expression proles under p(x|k). Dene the Order Index: (k,m) = 1 Then it is easy to show that (k,m) = 1 dm 2
1 dm 1 E[D(X, X 2

)].

2P(Zij = 1|k)P(Zij = 0|k).


i,jGm

.5 1, but generally .5 since there are many gene pairs expressed on different scales. (k,m) 1: A highly disorganized system.
37 / 64

E XAMPLES
In the Death network, for prostate tissue, (normal) = 0.924 and (metastatic) = 0.823. The difference is highly signcant (p < .001). Overall, 75 networks have signicant differences in , which is usually smaller in metastatic tumors.

38 / 64

D E - REGULATION

IN

D ISEASE

A general trend emerges: when pairs of phenotypes represent gradations of disease, the order index is usually smaller in the more malignant one when there is a signicant difference. In the following plots, each point represents a pair ((A,m) , (B,m) ) for a network m, where A is more malignant than B.

39 / 64

G LOBAL P ICTURE ranking in disease Deregulation of network

40 / 64

D ISTANCE - BASED C LASSIFICATION

Fix a context G (set of genes). Let DG be the swap distance restricted to G. Classify by nearest-neighbor in L. Choose G so that the distance DG (X, X) between independent samples is
Large if X, X are from different classes; Small if from the same class.

This can be done in a similar fashion to kTSP and TSM.

41 / 64

O UTLINE

Biology and Statistical Learning Predicting from Comparisons Pathway De-regulation Breast Cancer Prognosis Metastatic Cancer

42 / 64

B REAST C ANCER P ROGNOSIS


Objective: separate BC microarray samples into good vs poor prognosis determined by recurrence within ve years. Mammaprint Signature: List of 70 genes and corresponding (correlation-based) decision rule. One of three signatures approved by the FDA for clinical use. Learned from a training set L with n = 162 samples (46 recurrent and 116 non-recurrent). Achieves 89% sensitivity and 41% specicity on the Buyse test set of n = 302 samples (46 recurrent and 256 non-recurrent).

43 / 64

M AXIMUM E NTROPY M ODELS P ERMUTATIONS

ON

Fix ten genes (e.g., the ve top-scoring pairs). Let x be the expression prole and r 10 the rank vector. Construct two distributions p(r |good) and p(r |poor ) by maximizing entropy subject to xing all 10 = 45 pairwise 2 comparison probabilities. Use Iterative Projection to learn the parameters. With d = 10, everything can be computed, including normalizing constants and entropies.

44 / 64

M ORE F ORMALLY

Let q be a prob. dist. on 10 , and let pL be the empirical distribution on L. For k {poor , good}: p(r |k) = argq max H(q) s.t. i < j : q(r : ri < rj ) = pL (r : ri < rj |k )

45 / 64

L IKELIHOOD R ATIO T EST

Classify sample x as poor if p(r (x)|poor ) > . p(r (x)|good) For = 1, 70% sensitivity and 64% specicity (overall 66%). Varying trades off sensitivity and specicity. Entropies are H = 14.22 (good), H = 17.45 (poor), H = 21.79 (uniform).

47 / 64

O UTLINE

Biology and Statistical Learning Predicting from Comparisons Pathway De-regulation Breast Cancer Prognosis Metastatic Cancer

48 / 64

M ETASTATIC C ANCER

Cancer is an acquired genetic disorder due to the accumulation over time of DNA alterations that lead to uncontrolled cell growth and proliferation. Ninety percent of deaths result from metastasis, meaning that cancer cells break away and migrate to distant organs. By lodging in other organs they replace normal cells until the organ no longer functions.

49 / 64

T UMOR S ITE

OF

O RIGIN

In approximately 4% of cancers, a metastatic tumor is found of unknown primary origin (Hillen, 2000). However, the appropriate treatment depends on the tissue of origin. The GEO or Gene Expression Omnibus (Barrett et al., 2006) contains 16,715 tumor samples from 20 sites of origin for the most popular platform. Objective: Build a classier for distinguishing among the 20 sites of origin and validate it with cross-study error estimation.

50 / 64

G ENERIC P ROBLEM : B ATCH E FFECTS

Systematic variation across samples is highly correlated with date, lab, etc. Especially problematic when batch labels are confounded with class label. Affects not only the patterns of expression of individual genes, but in fact the entire dependency structure, including correlations.

51 / 64

B ATCH E FFECTS
Samples from the same phenotype but different dates, labs, etc. display systematic differences in the distribution of individual genes and dependency structure.

52 / 64

B ATCH E FFECTS : R EVERSE C ORRELATION

Figure: The fraction of signicantly correlated gene pairs for which the sign reverses between pairs of batches.
53 / 64

S TUDY E FFECTS
Within class, but across studies, there are differences due to age, location, etc., as well as platform and mRNA storage/extraction methods. Combined with batch effects, samples from different studies are not even approximately identically distributed. Must take this into account in estimating generalization error. The consequence of confounding, batch and study effects make cross-study validation, as opposed to oridinary cross-validation, imperative.

54 / 64

U NBIASED VALIDATION : ACCURACY

Overall accuracy is a poor measure of utility with major class imbalance in training. Instead use Mean Class Conditional Accuracy (MCCA). Generalizes the average of sensitivity and specicity to multiclass. Take the average of {P(F (X ) = y|Y = y)} for y = 1, ..., L.

55 / 64

M ETHODS

OF

E STIMATING ACCURACY

Resubstitution: Validate on L, the training data. Strong optimistic bias. Holdout: Randomly partition data into training and validation. Still optimistic because training and validation are identically distributed. Cross-validation: Still optimistic for same reason. Cross-Study Validation: Validate on a different study, done in a different lab than the training study. Higher bar, but the gold standard.

56 / 64

L EAVE -S TUDY-O UT VALIDATION

Study 1 Training

Study 2

Study 3 Validation +

57 / 64

D ECISION T REES

OF

C OMPARISONS

Goal: Generalize kTSP and related algorithms to multiclass problems. Build decision trees with comparison questions: Is gene i more highly expressed than gene j? With the site of origin data, can build trees with depth up to fteen queries.

58 / 64

T REE

OF

C OMPARISONS
Is ABCD > EFGH?

...
Is GHKL > MNPQ?

...

Colon (35/40)

...

59 / 64

TSP T REES : R ESULTS


One decision tree: 91.4% accuracy, 75.4% MCCA. Random Forest with 10 trees and 10k gene pairs chosen at random for each tree: 95.8% accuracy, 84.2% MCCA Three trees with no common genes: 94.4% accuracy, 79.9% MCCA Lack of independence problematic for ensembles, even if disjoint.
Tree 1 Wrong 741 690 Tree 1 Correct 868 14416

Tree 2 Wrong Tree 2 Correct

60 / 64

R EDUCING D IVERSITY

AND

S AMPLE S IZE

Reducing Diversity: Train on largest study for each site. Test on the rest. Accuracy = 85.8%, MCCA = 74.0%. Reducing n: Keep only 10 samples per study-site of origin pair. Notice that n is smaller for every site of origin.

61 / 64

E FFECTS OF R EDUCING D IVERSITY S AMPLE S IZE

AND

62 / 64

B REAST VS N ON -B REAST : C ROSS -S TUDY VS H OLDOUT


An experiment to compare the performance of cross-study and (randomized) CV. Breast vs all 19 other sites. For non-breast samples, half for training and testing. Randomly order the breast tumor studies. Let nk be the sample size study k. Cross-study: Train on studies 1 thru k and validate on study k + 1. Cross-validation: Randomly choose nk+1 breast samples from studies 1, ..., k + 1 for testing, train on the rest, repeat.
63 / 64

R ESULTS OF C ROSS - STUDY C ROSS - VALIDATION

VS

64 / 64

R ANDOMIZING S TUDY L ABELS (I)

Goal: Quantify how much batch/study effects reduce accuracy, MCCA. Randomize study labels within each phenotype. After shufing study labels: Accuracy = 98.6%, MCCA = 96.1%. 8 points of MCCA lost to batch/study effects.

65 / 64

R ANDOMIZING S TUDY L ABELS (II)


Before After Shuffling Shuffling Study 1 Site of Origin 1 Study 2

Study 1 Site of Origin 2 Study 2


66 / 64

C ONCLUSIONS

Accuracy should be demonstrated cross-study. Sample diversity is more important than sample size.

67 / 64

You might also like