Data Loading...
STATISTICAL CHALLENGES IN MULTI-STUDY GENOMIC DATA ANALYSIS Flipbook PDF
REPRODUCIBILITY AND VALIDATION IN GENOMICS 14 BASIC ANALYSIS PLAN FOR CROSS-STUDY VALIDATION • Work with rawest data ava
112 Views
60 Downloads
FLIP PDF 623.99KB
STATISTICAL CHALLENGES IN MULTI-STUDY GENOMIC DATA ANALYSIS Giovanni Parmigiani Johns Hopkins University http://astor.som.jhmi.edu/∼gp [email protected]
JSM Toronto, 2004
R EPRODUCIBILITY
AND
VALIDATION
IN
G ENOMICS
1
BACKGROUND Studies of gene expression have small samples, and use diverse technologies and differing populations. Results are often perceived to be discordant or not adequately reproducible. Can we use the collection of available studies to assess the reproducibility of molecular classfication results? Can we develop simple tools for biologists to cumulatively integrate the knowledge embedded in these studies?
JJ × i 2p II
R EPRODUCIBILITY
AND
VALIDATION
IN
G ENOMICS
2
THREE RELATED QUESTIONS • REPRODUCIBILITY: Which aspects of gene expression can be consitently measured across studies and platforms? • VALIDATION: To what extent are the biological conclusions confirmed across studies? • INTEGRATION: Are there viable approaches for integration?
JJ × i 2p II
R EPRODUCIBILITY
AND
VALIDATION
IN
G ENOMICS
3
A PROFILE FOR BRCA1-LINKED TUMORS? • Studies: van’t Veer, Nature 2002 (Rosetta, Agilent long oligos) Hedenfalk, NEJM 2001 (NHGRI, cDNA) • The overlap among the lists of BRCA1-related genes is meager, and reproducibility has been criticized. • Does breast cancer in BRCA1 germline mutation carriers have a specific molecular profile?
JJ × i 2p II
R EPRODUCIBILITY
AND
VALIDATION
IN
G ENOMICS
4
STANDARDIZED EFFECT SIZES CAN BE REPRODUCIBLE
Standardized effect sizes (left) p-values (center) and q-values (right) for discriminating between adenocarcinomas and squamous carcinomas of the lung. Each point corresponds to a gene, each coordinate to one of the two studies. Dashed lines are empirical regression lines Clinical Cancer Research 2004 JJ × i 2p II
R EPRODUCIBILITY
AND
VALIDATION
IN
G ENOMICS
5
Human Molecular Genetics 2003 JJ × i 2p II
R EPRODUCIBILITY
AND
VALIDATION
IN
G ENOMICS
6
HEDENFALK on VAN’T VEER
0.0 −0.5
SAM SCORE
0.2 −0.2 −0.6
SAM SCORE
0.5
0.6
VAN’T VEER on HEDENFALK
0
1
BRCA1 MUTANT (=1)
0
1
BRCA1 MUTANT (=1)
JJ × i 2p II
R EPRODUCIBILITY
AND
VALIDATION
IN
G ENOMICS
7
EGFR probe sets on HU133 Courtesy of Jangwen Zhang JJ × i 2p II
R EPRODUCIBILITY
AND
VALIDATION
IN
G ENOMICS
8
INTEGRATIVE CORRELATIONS
A2
A3
●●
−0.36
●●
−0.95
●●
0.63
● ● A4
0.2 0.4 0.6 0.8 1.0 1.2 1.4
●
● A1
PAIRWISE CORRELATIONS
● ● ●
EXPRESSION
●
STUDY B
PAIRWISE CORRELATIONS
0.2 0.4 0.6 0.8 1.0 1.2 1.4
EXPRESSION
STUDY A
● ● ●
● ●
● ● ●
●
B1
B2
B3
B4
●●
0.046
●●
−0.8
●●
0.52
● ● B5
JJ × i 2p II
R EPRODUCIBILITY
AND
VALIDATION
IN
G ENOMICS
9
−1.0
−0.5
0.0
0.5
0.5 0.0 −1.0
−0.5
Correlations of Gene 3 in Study B
0.5 0.0 −0.5
1.0
−1.0
−0.5
0.0
0.5
1.0
−1.0
−0.5
0.0
0.5
1.0
Correlations of Gene 3 in Study A
1.0
Correlations of Gene 2 in Study A
1.0
Correlations of Gene 1 in Study A
0.0 −1.0
−0.5
0.5 0.0 −0.5 −1.0
0.5
r = −0.02 Correlations of Gene 5 in Study B
r = 0.04 Correlations of Gene 4 in Study B
r = −0.03
−1.0
−0.5
0.0
0.5
Correlations of Gene 2 in Study B
r = −0.67
−1.0
Correlations of Gene 1 in Study B
r = 0.75
1.0
1.0
1.0
INTEGRATIVE CORRELATIONS, by Gene
−1.0
−0.5
0.0
0.5
Correlations of Gene 4 in Study A
1.0
−1.0
−0.5
0.0
0.5
1.0
Correlations of Gene 5 in Study A
Five examples of integrative correlations of genes across the van’t Veer (Agilent) and Hedenfalk (cDNA) studies JJ × i 2p II
R EPRODUCIBILITY
AND
VALIDATION
IN
G ENOMICS
10
0
2
4
Density 6 8
10
12
FDR ON REPRODUCIBILITY SCORES
−0.4
−0.2
0.0 0.2 Reproducibility Score
0.4
0.6
Empirical and permutation distribution of reproducibility scores across the van’t Veer (Agilent) and Hedenfalk (cDNA) studies JJ × i 2p II
R EPRODUCIBILITY
AND
VALIDATION
IN
G ENOMICS
11
4
Density 6 8
10
12
FILTERING BY INTEGRATIVE CORRELATIONS
REPRODUCIBLE GENES
SAM statistics for BRCA1 versus Wildtype: HEDENFALK
0 0.0
−0.1
0.0
0.1
0.2
SAM statistics for BRCA1 versus Wildtype: VANT’VEER
0.6
−0.5
−0.2
CORR = −0.132
−0.2
0.4
0.5
0.0 0.2 Reproducibility Score
0.0
−0.2
0.2
0.4
−0.4
−0.4
SAM statistics for BRCA1 versus Wildtype: HEDENFALK
0.6
2
1.0
UNREPRODUCIBLE GENES
0.3
CORR = 0.712
−0.4
−0.2
0.0
0.2
SAM statistics for BRCA1 versus Wildtype: VANT’VEER
JJ × i 2p II
R EPRODUCIBILITY
AND
VALIDATION
IN
G ENOMICS
12
0.2 0.1 0.0
DENSITY
0.3
0.4
FDR ON COMBINED ANALYSIS OF GENES RELATED TO GERMLINE BRCA1 MUTATIONS
−5
0
5
SAM STATISTICS: observed (solid) and expected (dashed)
JJ × i 2p II
R EPRODUCIBILITY
AND
VALIDATION
IN
G ENOMICS
13
HEDENFALK
VAN’T VEER
DSC2 TCEAL1 TPX2 SSBP1 TOPBP1 HDGF INDO CTSK CYB5 CKS2 MPL VRK2 RARRES1 GART TRIM29 WARS LPIN1 LYN KLK6 KIAA0232 USP10 DLG7 HMGN3 CALU BTG3 GTF2E2 TNFAIP1 CD58 DHCR24 MDH1 NOLC1 TFAM ILF2 PLOD NSEP1 CSTB GDI2 FBXL5 NCK1 SAS CDKN2C POLR2F TP53BP2 MFGE8 MTCP1 RFC4 KIAA0063 KIAA1223 UGDH TNFRSF1B CTPS TIMP3 EML2 C1GALT1 NUP155 SEC13L1 BCL2A1 MMP7 NUP160 GTPBP4 PPP1CB YES1 C20orf55 MYBL2 TOB1 MSN GATA3 NRIP1 INPP4B MTIF2 ITGB5 DYSF GALNT10 SH3GLB1 BIRC3 SF3B5 VCAM1 ZNF22 DKFZp762E1 ALCAM
−3
−2
−1
0
1
2
3
scale
JJ × i 2p II
R EPRODUCIBILITY
AND
VALIDATION
IN
G ENOMICS
14
BASIC ANALYSIS PLAN FOR CROSS-STUDY VALIDATION • Work with rawest data available. Redo the normalization if at all possible. traps: background subtraction in cDNA; MAS 4. • Merge Locuslink, Unigene, or Genbank? • Exclude genes that are not reproducible in the unsupervised correlation of correlations analysis. Compare against permutation of gene labels • Select genes that are consistently predicting phenotype. Compare against permutation of phenotype labels JJ × i 2p II
R EPRODUCIBILITY
AND
VALIDATION
IN
G ENOMICS
15 Figure 4 B: ALL GENES
Figure 4 C: ALL GENES
−0.2
0.0
0.2
0.4
0.5 0.0 −1.0
−0.5
Michigan
−0.5 −1.0
−1.0
−0.4
−0.4
−0.2
0.0
0.2
0.4
−1.0
−0.5
0.0
0.5
1.0
Stanford
Figure 4 D: REPRODUCIBLE GENES
Figure 4 E: REPRODUCIBLE GENES
Figure 4 F: REPRODUCIBLE GENES
−0.1
0.0 Harvard
0.1
0.2
0.3
0.0 −0.5
−1.0 −0.2
−1.0
−0.5
Michigan
Michigan
0.0
1.0 0.5 0.0
−0.3
1.5
0.5
Harvard
0.5
Harvard
−0.5
Stanford
0.0
Michigan
0.5 0.0 −0.5
Stanford
1.0
0.5
1.5
Figure 4 A: ALL GENES
−0.3
−0.2
−0.1
0.0
0.1
0.2
0.3
Harvard
Parmigiani, Garrett, Ramaswamy and Gabrielson, CCR 2004.
−0.5
0.0
0.5
1.0
Stanford
JJ × i 2p II
R EPRODUCIBILITY
AND
VALIDATION
IN
G ENOMICS
16
VALIDATION • WITHIN–STUDY ’STATISTICAL’ VALIDATION Cross-validation; Traning and Testing samples “Validation” of clusters/classification using alternative data mining techniques • WITHIN–STUDY ’BIOLOGICAL’ VALIDATION “Validation” of clusters using phenotypes; “Validation” of clusters/classifiers using functional annotation information Validation of expression changes using different assays • ACROSS-STUDY VALIDATION Overlap of genes in signatures
JJ × i 2p II
R EPRODUCIBILITY
AND
VALIDATION
IN
G ENOMICS
17
CONCLUSIONS • Cross-study analysis of molecular classification is necessary. Small sample/gene ratios Need for external validation Need for cumulating knowledge
R EPRODUCIBILITY
AND
VALIDATION
IN
G ENOMICS
17
CONCLUSIONS • Cross-study analysis of molecular classification is necessary. Small sample/gene ratios Need for external validation Need for cumulating knowledge • Combined analysis is possible. Use a carefully selected, but significant, subset of genes Combine in the signal-to-noise (effect-size) ratio scale Put in context of genomic distributions
R EPRODUCIBILITY
AND
VALIDATION
IN
G ENOMICS
17
CONCLUSIONS • Cross-study analysis of molecular classification is necessary. Small sample/gene ratios Need for external validation Need for cumulating knowledge • Combined analysis is possible. Use a carefully selected, but significant, subset of genes Combine in the signal-to-noise (effect-size) ratio scale Put in context of genomic distributions • Toolbox in progress for: Selecting genes that reproduce well; Validating putative markers; Building combined analyses. JJ × i 2p II
R EPRODUCIBILITY
AND
VALIDATION
IN
G ENOMICS
18
READINGS, CREDITS, ADS CREDITS: Les Cope, Ed Gabrielson, Liz Garrett, Simens Zhong Dongmei Liu, Rob Scharpf, Qiushan Tao, Jangwen Zhang
www.arraybook.org High quality freeware for microarray analysis
astor.som.jhmi.edu/MergeMaid MergeMaid: Merging and EDA for multi-study genomic data analysis. JJ × i 2p II