Design of microarray experiments

Similar documents
Gene Expression Analysis

Statistical issues in the analysis of microarray data

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis

Cancer Biostatistics Workshop Science of Doing Science - Biostatistics

From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data

Fairfield Public Schools

The Wondrous World of fmri statistics

False Discovery Rates

Analysis of Illumina Gene Expression Microarray Data

Gene expression analysis. Ulf Leser and Karin Zimmermann

Principles of Hypothesis Testing for Public Health

Data Acquisition. DNA microarrays. The functional genomics pipeline. Experimental design affects outcome data analysis

Analysis of gene expression data. Ulf Leser and Philippe Thomas

Tutorial for proteome data analysis using the Perseus software platform

Two-Sample T-Tests Assuming Equal Variance (Enter Means)

Molecular Genetics: Challenges for Statistical Practice. J.K. Lindsey

Simple Linear Regression Inference

Study Design and Statistical Analysis

Two-Sample T-Tests Allowing Unequal Variance (Enter Difference)

Error Type, Power, Assumptions. Parametric Tests. Parametric vs. Nonparametric Tests

Basic Data Analysis. Stephen Turnbull Business Administration and Public Policy Lecture 12: June 22, Abstract. Review session.

Likelihood Approaches for Trial Designs in Early Phase Oncology

Randomized Block Analysis of Variance

Permutation Tests for Comparing Two Populations

ANOVA ANOVA. Two-Way ANOVA. One-Way ANOVA. When to use ANOVA ANOVA. Analysis of Variance. Chapter 16. A procedure for comparing more than two groups

Chapter 5 Analysis of variance SPSS Analysis of variance

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

Lecture Notes Module 1

Statistical analysis of modern sequencing data quality control, modelling and interpretation

Additional sources Compilation of sources:

HYPOTHESIS TESTING: POWER OF THE TEST

Multivariate Analysis of Variance. The general purpose of multivariate analysis of variance (MANOVA) is to determine

Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients

"Statistical methods are objective methods by which group trends are abstracted from observations on many separate individuals." 1

II. DISTRIBUTIONS distribution normal distribution. standard scores

individualdifferences

Redwood Building, Room T204, Stanford University School of Medicine, Stanford, CA

Concepts of Experimental Design

Test Positive True Positive False Positive. Test Negative False Negative True Negative. Figure 5-1: 2 x 2 Contingency Table

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

Descriptive Statistics

Section 13, Part 1 ANOVA. Analysis Of Variance

Comparing Functional Data Analysis Approach and Nonparametric Mixed-Effects Modeling Approach for Longitudinal Data Analysis

Analysis of Data. Organizing Data Files in SPSS. Descriptive Statistics

Introduction to SAGEnhaft

Introduction To Real Time Quantitative PCR (qpcr)

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

Univariate Regression

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Measuring gene expression (Microarrays) Ulf Leser

1.5 Oneway Analysis of Variance

Software and Methods for the Analysis of Affymetrix GeneChip Data. Rafael A Irizarry Department of Biostatistics Johns Hopkins University

Point Biserial Correlation Tests

Research Methods & Experimental Design

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Tests for Two Proportions

Least Squares Estimation

Sample Size and Power in Clinical Trials

Statistiek II. John Nerbonne. October 1, Dept of Information Science

Statistics in Medicine Research Lecture Series CSMC Fall 2014

Detection of changes in variance using binary segmentation and optimal partitioning

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Testing for Granger causality between stock prices and economic growth

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

12: Analysis of Variance. Introduction

Comparing Two Groups. Standard Error of ȳ 1 ȳ 2. Setting. Two Independent Samples

Statistics Graduate Courses

THE FIRST SET OF EXAMPLES USE SUMMARY DATA... EXAMPLE 7.2, PAGE 227 DESCRIBES A PROBLEM AND A HYPOTHESIS TEST IS PERFORMED IN EXAMPLE 7.

UNDERSTANDING THE DEPENDENT-SAMPLES t TEST

Guide to Biostatistics

17. SIMPLE LINEAR REGRESSION II

Obesity in America: A Growing Trend

Real-time PCR: Understanding C t

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Two-sample hypothesis testing, II /16/2004

Microarray Data Analysis. A step by step analysis using BRB-Array Tools

Some Essential Statistics The Lure of Statistics

Multiple Optimization Using the JMP Statistical Software Kodak Research Conference May 9, 2005

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Table 1" Cancer Analysis No Yes Diff. S EE p-value OLS Design- Based

Random effects and nested models with SAS

Tests for Two Survival Curves Using Cox s Proportional Hazards Model

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

Logistic Regression (1/24/13)

REAL TIME PCR USING SYBR GREEN

LAB : THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

How To Run Statistical Tests in Excel

Introduction to Regression and Data Analysis

Biostatistics: Types of Data Analysis

Inclusion and Exclusion Criteria

Package ERP. December 14, 2015

Multivariate Analysis of Variance (MANOVA)

AP: LAB 8: THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

People have thought about, and defined, probability in different ways. important to note the consequences of the definition:

3.4 Statistical inference for 2 populations based on two samples

Classification and Regression by randomforest

Transcription:

Practical microarray analysis experimental design Design of microarray experiments Ulrich Mansmann mansmann@imbi.uni-heidelberg.de Practical microarray analysis October 2003 Heidelberg Heidelberg, October 2003

Practical microarray analysis experimental design Experiments Scientists deal mostly with experiments of the following form: A number of alternative conditions / treatments one of which is applied to each experimental unit an observation (or several observations) then being made on each unit. The objective is: Separate out differences between the conditions / treatments from the uncontrolled variation that is assumed to be present. Take steps towards understanding the phenomena under investigation. Heidelberg, October 2003 2

Practical microarray analysis experimental design Statistical thinking Uncertain knowledge Knowledge of the extent of uncertainty in it + = Useable knowledge Measurement model m = µ + e m measurement with error, µ - true but unknown value What is the mean of e? What is the variance of e? Is there dependence between e and µ? What is the distribution of e (and µ)? Typically but not always: e ~ N(0,σ²) Gaussian / Normal measurement model Decisions on the experimental design influence the measurement model. Heidelberg, October 2003 3

Practical microarray analysis experimental design Main requirements for experiments Once the conditions / treatments, experimental units, and the nature of the observations have been fixed, the main requirements are: Experimental units receiving different treatments should differ in no systematic way from one another Assumptions that certain sources of variation are absent or negligible should, as far as practical, be avoided; Random errors of estimation should be suitably small, and this should be achieved with as few experimental units as possible; The conclusions of the experiment should have a wide range of validity; The experiment should be simple in design and analysis; A proper statistical analysis of the results should be possible without making artificial assumptions. Taken from Cox DR (958) Planning of experiments, Wiley & Sons, New York (page 3) Heidelberg, October 2003 4

Information dilemma: too many or too few? # of variables of interest # of observational units Classical situation of a clinical research project: Statistical methods, principles of clinical epidemiology and principles of experimental design allow to give a confirmatory answer, if results of the study describe reality or are caused by random fluctuations. Working with micro-arrays 4

Information dilemma: too many or too few? # of variables of interest # of observational units The use of micro-array technology turns the classical situation upside down. There is the need for orientation how to perform microarray experiments. A new methodological consciousness is put to work: False detection rate Validation to avoid overfitting Working with micro-arrays 6

Biometrical practice Biological, medical framework Experimental design, clinical epidemiology Statistical methods Working with micro-arrays 7

Micro-array experiments Bioinformatics Biological, medical framework Technology Experimental design, clinical epidemiology Statistical methods Complex statistical methods Data collections Working with micro-arrays 8

Example LPS: The setting Problem: Differential reaction on LPS stimulation in peripheral blood of stroke patients and controls? Blood Gene expres. Patient Blood + LPS Gene expres. Pat Difference? Blood Gene expres. Control Blood + LPS Gene expres. Kon Sample size has to be chosen with respect to financial restrictions Peripheral blood is a special tissue, possible confounder Chosen technology: Affymetrix (22283 genes) PNAS, 00:896-90 Working with micro-arrays 5

Example LPS: Design - Pooling Assume a linear model for appropriately transformed gene expression: y Pat,Gen = transformed abundance + confounder effect + biol. var. + techn. var. Pat No LPS Gene expres. Pat 5 Pool LPS Gene expres. Pool Correction for confounding - if composition of pools is homogeneous over possible confounder Reduction of biological variability: σ biol No reduction of technological / array specific variability: σ tech Reduction of arrays is determined by Ψ = σ tech / σ biol. Working with micro-arrays 6

Example LPS: Design - Gene exclusion Array used codes for ~ 8000 genes Do we have good rules to reduce the set of interesting genes? How can we introduce a hierarchy into the gene list without manipulating the result of our analysis? Possible solutions: Bioinformatics: Integration of pathway information into the analysis Statistics: Use of genes with high inter-array variability - set cut-point Meta-genes (West et al.) - predefine # of meta-genes define cluster strategy Working with micro-arrays 7

Example LPS: Design - Gene exclusion Array used codes for ~ 8000 genes Do we have good rules to reduce the set of interesting genes? How can we introduce a hierarchy into the gene list without manipulating the result of our analysis? Possible solutions: Bioinformatics: Statistics: Integration of pathway information into the analysis Only possible for small problems Use of genes with high inter-array variability - set cut-point Meta-genes (West et al.) - predefine # of meta-genes define cluster strategy Working with micro-arrays 8

Example LPS: Design - Gene exclusion Array used codes for ~ 8000 genes Do we have good rules to reduce the set of interesting genes? How can we introduce a hierarchy into the gene list without manipulating the result of our analysis? Possible solutions: Bioinformatics: Statistics: Integration of pathway information into the analysis Only possible for small problems Use of genes with high inter-array variability - set cut-point Mostly heuristic procedures / Kropf et al. Meta-genes (West et al.) - predefine # of meta-genes define cluster strategy Working with micro-arrays 9

Example LPS: Design - Gene exclusion Array used codes for ~ 8000 genes Do we have good rules to reduce the set of interesting genes? How can we introduce a hierarchy into the gene list without manipulating the result of our analysis? Possible solutions: Bioinformatics: Statistics: Integration of pathway information into the analysis Only possible for small problems Use of genes with high inter-array variability - set cut-point Mostly heuristic procedures / Kropf et al. Meta-genes (West et al.) - predefine # of meta-genes define cluster strategy Not well evaluated Working with micro-arrays 20

Meta-genes Patient 2 Patient Working with micro-arrays 2

Meta-genes Patient 2 Patient Working with micro-arrays 22

Meta-genes Patient 2 Meta-gene 6 Meta-gene 5 Meta-gene Meta-gene 7 Meta-gene 4 Patient Meta-gene 3 Meta-gene 2 Working with micro-arrays 23

Meta-genes Patient 2 Meta-gene 6 Meta-gene 5 Meta-gene Meta-gene 7 Meta-gene 4 Patient Meta-gene 3 Meta-gene 2 Working with micro-arrays 24

Example LPS: Differential reaction DR LPS stimulated LPS stimulated FC = 3 FC = 0.25 Control Patient Differential reaction (DR): log(0.25 / 3) = log(0.25) - log(3) = - 3.8 DR = Pat - Kon Working with micro-arrays 29

Example LPS: Data Pool (5 subjects) Group Sex distribution (Male:Female) mean age Control :4 60.8 2 Control :4 65.4 3 Control 2:3 6.6 4 Patient 4: 64.4 5 Patient 5:0 66.2 6 Patient 3:2 74.4 Working with micro-arrays 30

Example LPS: Expression Summaries Quantification of expression MSA5: Tukey bi-weight signal of PM/MM, which is log-transformed RMA: VSN: linear additive model for log(pm), Median polish to aggregate over probes arsinh - transformation for PM values Rock - Blythe model for expression Working with micro-arrays 3

Example LPS: Expression Summaries MAS5 RMA Kontrollpool 2 - transf. Expression 0 2 4 6 8 0 Kontrollpool 2 - transf. Expression 4 6 8 0 2 4-2 0 2 4 6 8 0 Kontrollpool - transf. Expression 4 6 8 0 2 4 Kontrollpool - transf. Expression Working with micro-arrays 32

Example LPS: First look on the data All 22253 genes 000 meta-genes Differential reaction -.5 -.0-0.5 0.0 0.5.0.5 2.0 * * * * * * * * * ** * * * * * * * * * * * * * * * * * ** * * * * Mean diff. reaction -.0-0.5 0.0 0.5.0.5 69 352 80 343 64 604-2 0 2 4 6 8 Mean diff. response -2 0 2 4 6 Mean diff. expression Working with micro-arrays 33

Example LPS: Metagenes Mean diff. reaction -.0-0.5 0.0 0.5.0.5 69 604 352 80 343 64-2 0 2 4 6 > meta.gene.rma.summary $metagene.64 "2067_x_at" "204270_at" "23606_s_at "29273_at" "220557_s_at" "34478_at" $metagene.80 "207425_s_at" "26234_s_at" "26629_at" $metagene.343 "200935_at" "20556_s_at" "20579_s_at" "207824_s_at "2790_s_at "24792_x_at" "27793_at" "28600_at" $metagene.352 "20625_s_at" "20627_s_at" "207387_s_at" "20692_s_at "239_s_at "22206_at" $metagene.604 "204747_at" "205569_at" "205660_at" "2063_at" "20797_s_at" "222_s_at" "27502_at" $metagene.69 "AFFX-HUMRGE/M0098_3_at" "AFFX-HUMRGE/M0098_5_at "AFFX-HUMRGE/M0098_M_at" "AFFX-r2-Hs8SrRNA-3_s_at" "AFFX-r2-Hs8SrRNA-5_at" "AFFX-r2-Hs8SrRNA-M_x_at" Mean diff. expression Working with micro-arrays 34

Example LPS: Metagenes - multiple testing > round(meta.gene.mult.test.rma[[]][:30,],5) rawp Bonferroni Holm Hochberg SidakSS SidakSD BH BY [,] 0.0000 0.00932 0.00932 0.00932 0.00928 0.00928 0.00504 0.03772 [2,] 0.0000 0.0008 0.0007 0.0007 0.0003 0.0002 0.00504 0.03772 [3,] 0.00002 0.0573 0.0570 0.0570 0.0560 0.0557 0.00524 0.03924 [4,] 0.00004 0.0434 0.04328 0.04328 0.04248 0.04235 0.0094 0.07042 [5,] 0.00005 0.0482 0.04802 0.04802 0.04707 0.04689 0.0094 0.07042 [6,] 0.00006 0.05645 0.0567 0.0567 0.05489 0.05462 0.0094 0.07042 [7,] 0.00008 0.07559 0.0753 0.0753 0.07280 0.07238 0.0080 0.08083 [8,] 0.0002 0.2024 0.940 0.940 0.330 0.255 0.0236 0.09255 [9,] 0.0002 0.2398 0.2299 0.2299 0.66 0.573 0.0236 0.09255 [0,] 0.0003 0.2696 0.258 0.258 0.924 0.823 0.0236 0.09255 [,] 0.0004 0.3600 0.3464 0.3464 0.276 0.2598 0.0236 0.09255 Working with micro-arrays 35

Example LPS: Predictive analysis 0.0 0. 0.2 0.3 0.4 observed t-statistics Mixture components - without DR - negative DR - positive DR -0-5 0 5 0 t Statistik Working with micro-arrays 36

Example LPS: Predictive analysis 0.00 0.05 0.0 0.5 0.20 0.25 0.30 0.35 observed t-statistics Mixture components - without DR - negative DR - positive DR Inference of mixture components by EMalgorithm or Gibbssampler -0-5 0 5 0 t Statistik Working with micro-arrays 37

Example LPS: Predictive analysis Prob(Diff. reaction t-value) 0.0 0.2 0.4 0.6 0.8.0 P negative = 0.008 P positive = 0.003-0 -5 0 5 0 Value of t statistics Working with micro-arrays 38

Example LPS: Adjusted p-values Predictive analysis is closely related with frequentist test-theory: Procedure by Benjamini Hochberg (Efron, Storey, Tibshirani, 200) i-ter p-wert 0.0000 0.0002 0.0004 0.0006 0.0008 0.000 0.002 0.004 Index * FDR / # of genes FDR: 0. BH: red area contains in mean at most 0% false positive decisions. 0 50 00 50 200 250 300 Index Working with micro-arrays 39

Example LPS: Interpretation Genes with differential reaction (RMA) BH-procedure: # Genes: 22 PA with PPV>0.99: # Genes: 98 + 3 Genes with differential reaction (MAS5) BH-procedure: # Genes: 42 PA with PPV>0.99: # Genes: 62 Set of genes common to RMA and MAS5 result: 27 Working with micro-arrays 40

Example LPS: Interpretation Confounder, Covariates, Population variability - Pools are unbalanced between cases and controls Sample size calculation + Setting allows with high chance to detect absolute effects above 2. cdna arrays may have resulted in a more efficient analysis. Generality +/- Few pools may not give a representative sample of the patient group of interest. Interpretability - Inhomogeneities with respect to sex and age make it difficult to interpret DR as related to the disease. Artificial assumptions - Assumption of a linear model for confounder effects allows to assume an effect measurement fully attributable to the disease. Use of cdna arrays would have automatically eliminated the confounder effects. Working with micro-arrays 4

Practical microarray analysis experimental design The most simple measurement model in microarray experiments Situation: m arrays (Affimetrix) from control population n arrays (Affimetrix) from population with special condition /treatment Observation of interest: Mean difference of log-transformed gene expression ( logfc) logfc obs = logfc true + e e ~ N(0, σ² [/n+/m]) In an experiment with 5 arrays per population and the same variance for the expression of a gene of interest, the above formula implies that the variance of the logfc is only 40% (/5+/5 = 2/5 = 0.4) of the variability of a single measurement taming of uncertainty. Heidelberg, October 2003 5

Practical microarray analysis experimental design Separate out differences between the conditions / treatments from the uncontrolled variation that is assumed to be present. Is logfc true 0? How to decide? Special Decision rules: Statistical Tests When the probability model for the mechanism generating the observed data is known, hypotheses about the model can be tested. This involves the question: Could the presented data reasonable have come from the model if the hypothesis is correct? Usually a decision must be made on the basis of the available data, and some degree of uncertainty is tolerated about the correctness of that decision. These four components: data, model, hypothesis, and decision are basic to the statistical problem of hypothesis testing. Heidelberg, October 2003 6

Practical microarray analysis experimental design Quality of decision True state of gene Decision Gene is diff. expr. Gene is not diff. expr. Gene is diff. expr. OK false positive decision happens with probability α Gene is not diff. expr. OK false negative decision happens with probability β Two sources of error: Power of a test: False positive rate α False negative rate β Ability to detect a difference if there is a true difference Power true positive rate or Power = - β Heidelberg, October 2003 7

Practical microarray analysis experimental design Question of interest (Alterative): The Statistical test Is the gene G differentially expressed between two cell populations? Answer the question via a proof by contradiction: Show that there is no evidence to support the logical contrary of the alternative. The logical contrary of the alternative is called null hypothesis. Null hypothesis: The gene G is not differentially expressed between two cell populations of interest. A test statistic T is introduced which measures the fit of the observed data to the null hypothesis. The test statistics T implies a prob. distribution P to quantify its variability when the null hypothesis is true. It will be checked if the test statistic evaluated at the observed data t obs behaves typically (not extreme) with respect to the test distribution. The p-value is the probability under the null hypothesis of an observation which is more extreme as the observation given by the data: P( T t obs ) = p. A criteria is needed to asses extreme behaviour of the test statistic via the p value which is called the level of the test: a. The observed data does not fit to the null hypothesis if p < a or t obs > t* where t* is the -α or -α/2 quantile of the prob. distribution P. t* is also called the critical value. The conditions p < a and t obs > t* are equivalent. If p < a or t obs > t* the null hypothesis will be rejected. If p a or t obs t* the null hypothesis can not be rejected this does not mean that it is true Absence of evidence for a difference is no evidence for an absence of difference. Heidelberg, October 2003 8

Practical microarray analysis experimental design Controlling the power sample size calculations The test should produce a significant result (level α) with a power of -β if logfc true = δ null hypothesis: logfc true = 0 0 alternative: logfc true = δ δ z -α/2 σ n,m z -β σ n,m σ 2 n,m = σ 2 (/n+/m) The above requirement is fulfilled if: δ = (z -α/2 + z -β ) σ n,m or 2 n m (z α / 2 + z β) = n + m δ 2 σ 2 Heidelberg, October 2003 9

Practical microarray analysis experimental design Controlling the power sample size calculations 2 n m (z α / 2 + z β) = n + m δ 2 σ 2 n = N γ and m = N (-γ) with M total size of experiment and γ ]0,[ 2 (z α / 2 + z β ) N = γ ( γ) 2 δ 2 σ The size of the experiment is minimal if g = ½. 0 5 0 5 20 0.0 0.2 0.4 0.6 0.8.0 Gamma Heidelberg, October 2003 0

Practical microarray analysis experimental design Sample size calculation for a microarray experiment I Truth Test result diff. expr. (H ) not diff. expr. (H 0 ) diff. expr. D D 0 D not diff. expr. U U 0 U Number of genes on array G G 0 G α 0 = E[D 0 ]/G 0 β = E[U ]/G FDR=E[D 0 /D] E: expectation / mean number family type I error probability: α F = P[D 0 >0] family type II error probability: β F = P[U >0] Heidelberg, October 2003

Practical microarray analysis experimental design Sample size calculation for a microarray experiment II Independent genes Dependent Genes P[D 0 =0] = (-α 0 ) Go = -α F D 0 ~ Binomial(G 0, α 0 ) E[D 0 ] = G 0 α 0 Poissonapprox.: E[D 0 ] ~ -ln(-α F ) P[U =0] = (-β ) G = -β F E[U ] = G (-β ) Bonferroni: α 0 = α F / G 0 No direct link between the probability for D 0 and α F. -β F max{0,- G β } No direct link between the probability for U and β F. Heidelberg, October 2003 2

Practical microarray analysis experimental design Sample size calculation for a microarray experiment III for an array with 33000 independent genes What are useful α 0 and β? α F = 0.8 E[D 0 ] = - ln(-0.8) =.6 = λ P(exactly k false pos.) = exp(-λ) λ k / (k!) false pos. 0 2 3 4 5 Prob. 0.200 0.322 0.259 0.39 0.056 0.08 P(at least six false positives) = 0.0062 32500 unexpressed genes: α 0 =.6/32500 = 0.0000495 500 expressed genes, set E[D ] = 450 -β = 450/500 = 0.9 β = 0. -β F = (-β ) G < 0-23 E[FDR] = 0.0035 95% quantile of FDR: 0.0089 (calculated by simulation) Heidelberg, October 2003 3

Practical microarray analysis experimental design Sample size calculation for a microarray experiment IV In order to complete the sample size calculation for a microarray experiment, information on σ 2 is needed. The size of the experiment, N, needed to detect a logfc true of δ on a significance level α and with power -β is: 2 2 (z α / 2 + z β ) σ N = 4 2 δ In a similar set of experiments σ 2 for a set of 20 VSN transformed arrays was between.55 and.85. One may choose the value σ 2 = 2. δ log(.5) log(2) log(3) log(5) log(0) N (σ 2 = 2) 388 476 90 88 44 N (σ 2 = ) 694 238 96 44 22 Sample size with α = 0.0000495, β = 0. Heidelberg, October 2003 4

Practical microarray analysis experimental design Sample size formula for a one group test The test should produce a significant result (level α) with a power of -β if T = δ null hypothesis: T = 0 0 alternative: T = δ δ z -α/2 σ n z -β σ n σ 2 n = σ 2 /n The above requirement is fulfilled if: δ = (z -α/2 + z -β ) σ n or 2 2 (z α / 2 + z β ) σ n = 2 δ Heidelberg, October 2003 5

Practical microarray analysis experimental design Measurement model for cdna arrays Gene expression under condition A intensity of red colour, Gene expression under condition B intensity of green colour Measurement: m A/B = I red,a 2 I green,b Log = γ A/B + δ + e γ A/B log-transformed true fold change of gene of condition A with respect to condition B δ - dye effect, e measurement error with E[e] = 0 and Var(e) = σ 2 Measurement m A/B is used to estimate unknown γ A/B Vertices Edges mrna samples hybridization Direction Dye assignment Green Red B A Heidelberg, October 2003 6

Practical microarray analysis experimental design Estimation of log fold change g A/B Reference Design Dye swap design B A B A R R B A/ Estimate of γ A/B DS g = m A/R m B/R g A/ B = (m A/B m B/A )/2 Variability of estimate g ) = 2 σ 2 Var( g DS A/ B) = 0.5 σ 2 Sample Size increases proportional to the variance of the measurement! Var( R A/ B Heidelberg, October 2003 7

Practical microarray analysis experimental design 2x2 factorial experiments I treatment / condition Wild type Mutation before treatment β β+µ after treatment β+τ β+τ+µ+ψ β - baseline effect; τ - effect of treatment; µ - effect of mutation ψ - differential effect on treatment between WT and MUT treatment effect on gene expr. in WT cells: WT = (β + τ) - β = τ treatment effect on gene expr. in MUT cells: MUT = (β + τ + µ + ψ) (β + µ)= τ + ψ differential treatment effect: MUT WT or ψ 0 How many cdna arrays are needed to show ψ 0 with significance α and power -β if ψ > ln(5)? Heidelberg, October 2003 8

Practical microarray analysis experimental design 2x2 factorial experiments II Study the joint effect of two conditions / treatment, A and B, on the gene expression of a cell population of interest. There are four possible condition / treatment combinations: AB: treatment applied to MUT cells A: treatment applied to WT cells B: no treatment applied to MUT cells 0: no treatment applied to WT cells 0 A B AB Design with 2 slides Heidelberg, October 2003 9

Practical microarray analysis experimental design 2x2 factorial experiments III Array m A/0 m 0/A m B/0 m 0/B m AB/0 m 0/AB m AB/A m A/AB m AB/B m B/AB m A/B m B/A Measurement γ A/0 + δ + e = τ + δ + e -γ A/0 + δ + e = -τ + δ + e γ B/0 + δ + e = µ + δ + e -γ B/0 + δ + e = -µ + δ + e γ AB/0 + δ + e = µ + τ + ψ + δ + e -γ AB/0 + δ + e = - (µ + τ + ψ) + δ + e γ AB/A + δ + e = µ + ψ + δ + e -γ AB/A + δ + e = - (µ + ψ) + δ + e γ AB/B + δ + e = µ + ψ + δ + e -γ AB/B + δ + e = - (µ + ψ) + δ + e γ A/B + δ + e = τ - µ + δ + e -γ A/B + δ + e = - (τ - µ) + δ + e Each measurement has variance σ 2 Parameter β is confounded with the dye effect Heidelberg, October 2003 20

Practical microarray analysis experimental design Heidelberg, October 2003 2 Regression analysis ψ µ τ δ = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 M M M M M M M M M M M M E B A / A B / AB B/ B AB / AB A / A AB / AB 0/ 0 AB / B 0/ 0 B / A 0/ 0 A / AB B 0 A For parameter θ = (δ, τ, µ, ψ) define the design matrix X such that E(M) = Xθ. For each gene, compute least square estimate θ* = (X X) - X M (BLUE) Obtain measures of precision of estimated effects. Use all possibilities of the theory of linear models. Design problem: Each measurement M is made with variability σ 2. How precise can we estimate the components or contrasts of θ? Answer: Look at (X X) -

Practical microarray analysis experimental design 0 A 2 x 2 factorial designs IV B AB total.2.by.2.design.mat delta alpha beta psi > precision.2.by.2.rfc(x.mat) A/0 0 0 $inv.mat 0/A - 0 0 B/0 0 0 tau mu psi 0/B 0-0 tau 0.250 0.25-0.25 AB/0 mu 0.25 0.250-0.25 0/AB - - - psi -0.250-0.250 0.50 AB/A 0 A/AB 0 - - $effects AB/B 0 tau mu psi tau-mu B/AB - 0-0.25 0.25 0.50 0.25 B/A - 0 A/B - 0 Var(A-B) = Var(A) + Var(B) 2 Cov(A,B) Heidelberg, October 2003 22

Practical microarray analysis experimental design Sample size for differential treatment effect (DTE) in a 2 x 2 factorial designs I Array has 20.000 genes: 9500 without DTE, 500 with DTE α F = 0.9, using Bonferroni adjustment: α = 0.9/20.000 = 0.0000462 Mean number of correct positives is set to 450: -β = 0.9 σ 2 = 0.7, taken from similar experiments A total dye swap design (2 arrays) estimates ψ with precision σ 2 /2 = 0.35 N = [4.074 +.282] 2 0.35 / ln(5) 2 = 3.876 The experiment would need in total 4 x 2 = 48 arrays Is there a chance to get the same result cheaper? Heidelberg, October 2003 23

Practical microarray analysis experimental design 2 x 2 factorial designs V Design I Common ref. Design II Common ref. Design III Connected Design IV Connected AB AB AB AB A B A B A B A B 0 0 0 0 A Design V All-pairs AB B Scaled variances of estimated effects D.I D.II D.III D.IV D.V D.tot tau 2 0.75.00 0.5 0.25 mu 2 0.75 0.75 0.5 0.25 psi 3 3.00 2.00.0 0.50 0 # chips 3 3 4 4 6 2 Heidelberg, October 2003 24

Practical microarray analysis experimental design Sample size for differential treatment effect (DTE) in a 2 x 2 factorial designs II Is there a chance to get the same result cheaper? Using total dye swap design, the experiment would need in total 4 x 2 = 48 arrays Using Design III, the effect of interest is estimated with doubled variance (4 8) but by using a design which need only 4 arrays (2 4). This reduces the number of arrays needed from 48 to 32. Heidelberg, October 2003 25

Practical microarray analysis experimental design Designs for time course experiments Experimental Design - Conclusions In addition to experimental constraints, design decisions should be guided by knowledge of which effects are of greater interest to the investigator. The unrealistic planning based on independent genes may be put into a more realistic framework by using simulation studies speak to your bio statistician/informatician How to collect and present experience from performed microarray experiments on which to base assumptions for planing (σ 2 )? Further reading: Kerr MK, Churchill GA (200) Experimental design for gene expression microarrays, Biostatistics, 2:83-20 Lee MLT, Whitmore GA (2002), Power and sample size for DNA microarray studies, Stat. in Med., 2:3543-3570 Heidelberg, October 2003 26