Data Loading...

STATISTICAL CHALLENGES IN MULTI-STUDY GENOMIC DATA ANALYSIS Flipbook PDF

REPRODUCIBILITY AND VALIDATION IN GENOMICS 14 BASIC ANALYSIS PLAN FOR CROSS-STUDY VALIDATION • Work with rawest data ava


112 Views
60 Downloads
FLIP PDF 623.99KB

DOWNLOAD FLIP

REPORT DMCA

STATISTICAL CHALLENGES IN MULTI-STUDY GENOMIC DATA ANALYSIS Giovanni Parmigiani Johns Hopkins University http://astor.som.jhmi.edu/∼gp [email protected]

JSM Toronto, 2004

R EPRODUCIBILITY

AND

VALIDATION

IN

G ENOMICS

1

BACKGROUND Studies of gene expression have small samples, and use diverse technologies and differing populations. Results are often perceived to be discordant or not adequately reproducible. Can we use the collection of available studies to assess the reproducibility of molecular classfication results? Can we develop simple tools for biologists to cumulatively integrate the knowledge embedded in these studies?

JJ ×  i 2p II

R EPRODUCIBILITY

AND

VALIDATION

IN

G ENOMICS

2

THREE RELATED QUESTIONS • REPRODUCIBILITY: Which aspects of gene expression can be consitently measured across studies and platforms? • VALIDATION: To what extent are the biological conclusions confirmed across studies? • INTEGRATION: Are there viable approaches for integration?

JJ ×  i 2p II

R EPRODUCIBILITY

AND

VALIDATION

IN

G ENOMICS

3

A PROFILE FOR BRCA1-LINKED TUMORS? • Studies: van’t Veer, Nature 2002 (Rosetta, Agilent long oligos) Hedenfalk, NEJM 2001 (NHGRI, cDNA) • The overlap among the lists of BRCA1-related genes is meager, and reproducibility has been criticized. • Does breast cancer in BRCA1 germline mutation carriers have a specific molecular profile?

JJ ×  i 2p II

R EPRODUCIBILITY

AND

VALIDATION

IN

G ENOMICS

4

STANDARDIZED EFFECT SIZES CAN BE REPRODUCIBLE

Standardized effect sizes (left) p-values (center) and q-values (right) for discriminating between adenocarcinomas and squamous carcinomas of the lung. Each point corresponds to a gene, each coordinate to one of the two studies. Dashed lines are empirical regression lines Clinical Cancer Research 2004 JJ ×  i 2p II

R EPRODUCIBILITY

AND

VALIDATION

IN

G ENOMICS

5

Human Molecular Genetics 2003 JJ ×  i 2p II

R EPRODUCIBILITY

AND

VALIDATION

IN

G ENOMICS

6

HEDENFALK on VAN’T VEER

0.0 −0.5

SAM SCORE

0.2 −0.2 −0.6

SAM SCORE

0.5

0.6

VAN’T VEER on HEDENFALK

0

1

BRCA1 MUTANT (=1)

0

1

BRCA1 MUTANT (=1)

JJ ×  i 2p II

R EPRODUCIBILITY

AND

VALIDATION

IN

G ENOMICS

7

EGFR probe sets on HU133 Courtesy of Jangwen Zhang JJ ×  i 2p II

R EPRODUCIBILITY

AND

VALIDATION

IN

G ENOMICS

8

INTEGRATIVE CORRELATIONS

A2

A3

●●

−0.36

●●

−0.95

●●

0.63

● ● A4

0.2 0.4 0.6 0.8 1.0 1.2 1.4



● A1

PAIRWISE CORRELATIONS

● ● ●

EXPRESSION



STUDY B

PAIRWISE CORRELATIONS

0.2 0.4 0.6 0.8 1.0 1.2 1.4

EXPRESSION

STUDY A

● ● ●

● ●

● ● ●



B1

B2

B3

B4

●●

0.046

●●

−0.8

●●

0.52

● ● B5

JJ ×  i 2p II

R EPRODUCIBILITY

AND

VALIDATION

IN

G ENOMICS

9

−1.0

−0.5

0.0

0.5

0.5 0.0 −1.0

−0.5

Correlations of Gene 3 in Study B

0.5 0.0 −0.5

1.0

−1.0

−0.5

0.0

0.5

1.0

−1.0

−0.5

0.0

0.5

1.0

Correlations of Gene 3 in Study A

1.0

Correlations of Gene 2 in Study A

1.0

Correlations of Gene 1 in Study A

0.0 −1.0

−0.5

0.5 0.0 −0.5 −1.0

0.5

r = −0.02 Correlations of Gene 5 in Study B

r = 0.04 Correlations of Gene 4 in Study B

r = −0.03

−1.0

−0.5

0.0

0.5

Correlations of Gene 2 in Study B

r = −0.67

−1.0

Correlations of Gene 1 in Study B

r = 0.75

1.0

1.0

1.0

INTEGRATIVE CORRELATIONS, by Gene

−1.0

−0.5

0.0

0.5

Correlations of Gene 4 in Study A

1.0

−1.0

−0.5

0.0

0.5

1.0

Correlations of Gene 5 in Study A

Five examples of integrative correlations of genes across the van’t Veer (Agilent) and Hedenfalk (cDNA) studies JJ ×  i 2p II

R EPRODUCIBILITY

AND

VALIDATION

IN

G ENOMICS

10

0

2

4

Density 6 8

10

12

FDR ON REPRODUCIBILITY SCORES

−0.4

−0.2

0.0 0.2 Reproducibility Score

0.4

0.6

Empirical and permutation distribution of reproducibility scores across the van’t Veer (Agilent) and Hedenfalk (cDNA) studies JJ ×  i 2p II

R EPRODUCIBILITY

AND

VALIDATION

IN

G ENOMICS

11

4

Density 6 8

10

12

FILTERING BY INTEGRATIVE CORRELATIONS

REPRODUCIBLE GENES

SAM statistics for BRCA1 versus Wildtype: HEDENFALK

0 0.0

−0.1

0.0

0.1

0.2

SAM statistics for BRCA1 versus Wildtype: VANT’VEER

0.6

−0.5

−0.2

CORR = −0.132

−0.2

0.4

0.5

0.0 0.2 Reproducibility Score

0.0

−0.2

0.2

0.4

−0.4

−0.4

SAM statistics for BRCA1 versus Wildtype: HEDENFALK

0.6

2

1.0

UNREPRODUCIBLE GENES

0.3

CORR = 0.712

−0.4

−0.2

0.0

0.2

SAM statistics for BRCA1 versus Wildtype: VANT’VEER

JJ ×  i 2p II

R EPRODUCIBILITY

AND

VALIDATION

IN

G ENOMICS

12

0.2 0.1 0.0

DENSITY

0.3

0.4

FDR ON COMBINED ANALYSIS OF GENES RELATED TO GERMLINE BRCA1 MUTATIONS

−5

0

5

SAM STATISTICS: observed (solid) and expected (dashed)

JJ ×  i 2p II

R EPRODUCIBILITY

AND

VALIDATION

IN

G ENOMICS

13

HEDENFALK

VAN’T VEER

DSC2 TCEAL1 TPX2 SSBP1 TOPBP1 HDGF INDO CTSK CYB5 CKS2 MPL VRK2 RARRES1 GART TRIM29 WARS LPIN1 LYN KLK6 KIAA0232 USP10 DLG7 HMGN3 CALU BTG3 GTF2E2 TNFAIP1 CD58 DHCR24 MDH1 NOLC1 TFAM ILF2 PLOD NSEP1 CSTB GDI2 FBXL5 NCK1 SAS CDKN2C POLR2F TP53BP2 MFGE8 MTCP1 RFC4 KIAA0063 KIAA1223 UGDH TNFRSF1B CTPS TIMP3 EML2 C1GALT1 NUP155 SEC13L1 BCL2A1 MMP7 NUP160 GTPBP4 PPP1CB YES1 C20orf55 MYBL2 TOB1 MSN GATA3 NRIP1 INPP4B MTIF2 ITGB5 DYSF GALNT10 SH3GLB1 BIRC3 SF3B5 VCAM1 ZNF22 DKFZp762E1 ALCAM

−3

−2

−1

0

1

2

3

scale

JJ ×  i 2p II

R EPRODUCIBILITY

AND

VALIDATION

IN

G ENOMICS

14

BASIC ANALYSIS PLAN FOR CROSS-STUDY VALIDATION • Work with rawest data available. Redo the normalization if at all possible. traps: background subtraction in cDNA; MAS 4. • Merge Locuslink, Unigene, or Genbank? • Exclude genes that are not reproducible in the unsupervised correlation of correlations analysis. Compare against permutation of gene labels • Select genes that are consistently predicting phenotype. Compare against permutation of phenotype labels JJ ×  i 2p II

R EPRODUCIBILITY

AND

VALIDATION

IN

G ENOMICS

15 Figure 4 B: ALL GENES

Figure 4 C: ALL GENES

−0.2

0.0

0.2

0.4

0.5 0.0 −1.0

−0.5

Michigan

−0.5 −1.0

−1.0

−0.4

−0.4

−0.2

0.0

0.2

0.4

−1.0

−0.5

0.0

0.5

1.0

Stanford

Figure 4 D: REPRODUCIBLE GENES

Figure 4 E: REPRODUCIBLE GENES

Figure 4 F: REPRODUCIBLE GENES

−0.1

0.0 Harvard

0.1

0.2

0.3

0.0 −0.5

−1.0 −0.2

−1.0

−0.5

Michigan

Michigan

0.0

1.0 0.5 0.0

−0.3

1.5

0.5

Harvard

0.5

Harvard

−0.5

Stanford

0.0

Michigan

0.5 0.0 −0.5

Stanford

1.0

0.5

1.5

Figure 4 A: ALL GENES

−0.3

−0.2

−0.1

0.0

0.1

0.2

0.3

Harvard

Parmigiani, Garrett, Ramaswamy and Gabrielson, CCR 2004.

−0.5

0.0

0.5

1.0

Stanford

JJ ×  i 2p II

R EPRODUCIBILITY

AND

VALIDATION

IN

G ENOMICS

16

VALIDATION • WITHIN–STUDY ’STATISTICAL’ VALIDATION Cross-validation; Traning and Testing samples “Validation” of clusters/classification using alternative data mining techniques • WITHIN–STUDY ’BIOLOGICAL’ VALIDATION “Validation” of clusters using phenotypes; “Validation” of clusters/classifiers using functional annotation information Validation of expression changes using different assays • ACROSS-STUDY VALIDATION Overlap of genes in signatures

JJ ×  i 2p II

R EPRODUCIBILITY

AND

VALIDATION

IN

G ENOMICS

17

CONCLUSIONS • Cross-study analysis of molecular classification is necessary. Small sample/gene ratios Need for external validation Need for cumulating knowledge

R EPRODUCIBILITY

AND

VALIDATION

IN

G ENOMICS

17

CONCLUSIONS • Cross-study analysis of molecular classification is necessary. Small sample/gene ratios Need for external validation Need for cumulating knowledge • Combined analysis is possible. Use a carefully selected, but significant, subset of genes Combine in the signal-to-noise (effect-size) ratio scale Put in context of genomic distributions

R EPRODUCIBILITY

AND

VALIDATION

IN

G ENOMICS

17

CONCLUSIONS • Cross-study analysis of molecular classification is necessary. Small sample/gene ratios Need for external validation Need for cumulating knowledge • Combined analysis is possible. Use a carefully selected, but significant, subset of genes Combine in the signal-to-noise (effect-size) ratio scale Put in context of genomic distributions • Toolbox in progress for: Selecting genes that reproduce well; Validating putative markers; Building combined analyses. JJ ×  i 2p II

R EPRODUCIBILITY

AND

VALIDATION

IN

G ENOMICS

18

READINGS, CREDITS, ADS CREDITS: Les Cope, Ed Gabrielson, Liz Garrett, Simens Zhong Dongmei Liu, Rob Scharpf, Qiushan Tao, Jangwen Zhang

www.arraybook.org High quality freeware for microarray analysis

astor.som.jhmi.edu/MergeMaid MergeMaid: Merging and EDA for multi-study genomic data analysis. JJ ×  i 2p II