Statistical issues in the analysis of microarray data Daniel Gerhard Institute of Biostatistics Leibniz University of Hannover ESNATS Summerschool, Zermatt D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 1 / 30
Table of Contents 1 Outline 2 Experimental design 3 Statistical modelling 4 Hypotheses testing 5 Gene set enrichment analysis 6 Classification D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 2 / 30
Outline Focus is set on Single channel microarrays One sample per array Gene expressions for thousands of oligonucleotides Identifying genes that are differentially expressed due to a treatment Finding significantly differentially expressed genes with a given error probability (Predicting a treatment level given the gene expression data) D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 3 / 30
Controlled experiments Independent replications Multiple sources of variability present: Sample-, array-, environmental variability,... Account for this variability in the experimental design by several replications of arrays, samples, multiple timepoints,... Randomisation Needed to separate treatment effects from other factors, which might influence gene expression D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 4 / 30
Experimental design Planning an experiment Multiple arrays per sample? Enables estimating array variability. Large amount of RNA needed. With more complex designs a larger number of arrays, samples is needed Measuring covariates, which are not directly of interest, but might have an influence on gene expression Simple classic design 2 Treatments (Control/Treatment), Multiple arrays/samples per treatments D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 5 / 30
Data structure Treatment A Treatment B... Array 1 Array 2 Array 3 Array 4 Array 5 Array 6... Gene 1 y 11 y 12 y 13 y 14 y 15 y 16... Gene 2 y 21 y 22 y 23 y 24 y 25 y 26... Gene 3 y 31 y 32 y 33 y 34 y 35 y 36............ D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 6 / 30
Data example Generating artificial data 2 treatments (A, B) 20 arrays per treatment 5000 genes per array Normal distributed residuals, array effects within array sd = 1; between array sd = 0.5 100 genes show an effect (δ = ±2) 2 x transformation D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 7 / 30
Data example 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 0 10 20 30 40 50 60 70 Array D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 8 / 30
Data example density 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10 x D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 9 / 30
Normalisation Preliminary data processing Checking for hybridisation errors Variability between arrays might bias the results Only a few genes are expected to show an effect Using all observations or known expressions of reference genes to standardise arrays Trying to shift data into a normal distribution (commonly by log 2 transformation) D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 10 / 30
Example data transformation original transformed density 0.0 0.2 0.4 0.6 0.8 1.0 density 0.0 0.1 0.2 0.3 0.4 0.5 0 2 4 6 8 10 x 4 2 0 2 4 6 log2(x) D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 11 / 30
Median normalisation transformed normalised density 0.0 0.1 0.2 0.3 0.4 0.5 density 0.0 0.1 0.2 0.3 0.4 0.5 4 2 0 2 4 6 log2(x) 4 2 0 2 4 6 log2(x) D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 12 / 30
Estimating treatment effects Statistical models Trying to explain the effects by only a few parameters in a statistical model Estimating parameters e.g. by minimising residuals Due to limited calculation resources, models can be fitted separately for each gene 2 sample design For the simple treatment-control design the difference between arithmetic means & it s standard error for each gene can be estimated. After applying the inverse of the log 2 transformation the fold change (ratio of arithmetic means) is the parameter of interest. D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 13 / 30
Parametric vs. non-parametric methods Parametric methods Assuming normal distribution after log 2 transformation Summarising the data by means and standard errors is adequate under assumptions of a general linear model Nonparametric methods At skewed distributions providing only means & std.err. might be misleading Instead using medians, IQR, range,... Applying rank transformation, resampling methods,... Interpretation of treatment comparisons might be more complicated in models with less assumptions Lack of power at small sample sizes D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 14 / 30
Independent observations? No complete randomisation Observations from non randomised experimental units might be correlated, e.g. Multiple arrays for the same sample Samples of the same individual over time Block structures... Assuming independence of correlated observation may lead to underestimation of variability Introducing multiple error terms in the model Increased complexity of the model, increase in sample size needed D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 15 / 30
Hypotheses Testing Test for a single gene Setting up hypotheses of interest (e.g. H 0 : parameter of interest equals 0) Constructing test statistics for each gene Calculating p-values under assumption of a null distribution for the test statistic Borrowing information from multiple genes At small sample sizes the genewise estimation of std.errors is difficult Adding a fudge factor to the std. err. to minimise the coefficient of variation Borrowing information about variability from all genes by empirical bayes D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 16 / 30
t-test results for the example Distribution of p-values: Frequency 0 20 40 60 80 100 120 0.0 0.2 0.4 0.6 0.8 1.0 p value D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 17 / 30
Error rates As multiple hypotheses are tested, there is a choice of controlling different error rates, and the individual type-i-error might not be adequate # H 0 not rejected # H 0 rejected # true H 0 U V m 0 # false H 0 T S m m 0 known m R R m PCER Per Comparison error rate: E(V )/m FWER Family-wise error rate: P (V > 0) FDR False discovery rate: E (V /R)... D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 18 / 30
FWER controlling procedures Calculating p i adjusted p-values (i = 1,..., m) Bonferroni: p i = min {1, p i /m} (single-step) Holm: p i = min {1, max {p i 1, (m i + 1) p i }} (step-down, for p 1 p i p m ) Utilising a multivariate distribution, resampling methods (single-step)... D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 19 / 30
FWER control using data-driven weights Weighted step-down procedure Weight the m unadjusted p-values pi = p i /w i and order them by p1 p m Reject H i as long pi α P m k=i w k Obtaining weights Choosing weights independently of the significance of the test Gather information about the distribution of hypotheses under the null or in the alternative Examples Weighting by the total variance w i = S i of the entire sample Weighting by nondecreasing monotone functions of the weights w i = f (S i ) Using principle components to define weights D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 20 / 30
FDR controlling procedures Calculating p j adjusted p-values (i = 1,..., m) Benjamini-Hochberg: p j = min j i { m i p i } (step-up, for p 1 p i p m ) Benjamini-Yekutieli: correction under dependence (step-up) Storey: pfdr (estimating m 0 /m)... D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 21 / 30
Comparison of adjustment methods using adj. p-values for the data example Method # H 0 rejected # H 0 falsely rejected unadjusted 334 234 Bonferroni 91 0 Holm 92 0 S i -weighted 44 0 min-p 82 0 BH 102 3 BY 98 0 Storey 102 3 D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 22 / 30
Volcano Plot 2 1 0 1 2 0 2 4 6 8 10 log 2 fold change log 10 p value unadjusted Bonferroni BH BY min p D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 23 / 30
Gene set enrichment analysis Define multiple sets of genes Test differential expression for these gene sets Small effects of single genes are hard to detect Combination of multiple small effects to get the big picture Reduction of the dimensionality of the multiple testing problem Test effects for whole pathways, functional groups, etc. D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 24 / 30
Assigning genetic features to known classes Classification Reformulating the problem into a setting with p regressors to estimate the class membership probability (control/treatment) for each gene Finding a classification rule by e.g. Logistic regression Discriminant analysis SVM... Validation Fitting the model to training data Validation of the model by test data Crossvalidation to validate the model on training data D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 25 / 30
Problem of high dimensions p >> n Problem Requirement for logistic regression or LDA is that the number of observations is larger than the number of variables Reducing the number of variables by Feature Selection Using Penalized Logistic Regression,... D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 26 / 30
Feature selection Filtering genes Multiple testing approaches can be used as filter Select all variables corresponding to genes with a p-value p p 0 Perform for example logistic regression to model the posterior probability of K classes log Pr (G = k X = x) Pr (G = K X = x) = β k0 + β T k x D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 27 / 30
A second example Generating artificial training data 2 treatments (A, B) 20 arrays per treatment 5000 genes per array Normal distributed residuals, array effects within array sd = 1; between array sd = 0.5 Genes show N(0, 0.25) distributed effects 2 x transformation Generating test data 10 arrays per treatment Same effects as in training data Both datasets are log 2 transformed and median normalized. D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 28 / 30
Feature selection / Classification Choosing only 10 genes with the best t-test results as covariates Performing LDA and logistic regression Validation by the test set LDA: A B A 8 2 B 2 8 logistic regression: A B A 7 2 B 3 8 D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 29 / 30
References Dudoit, S and Van der Laan, M (2008): Multiple Testing Procedures with Application to Genomics. Springer Series in Statistics. Gentleman, R, Carey, VJ, Huber, W, Irizarry, RA, and Dudoit, S (2005): Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer for Biology and Health. Hastie, T, Tibshirani, R and Friedman J (2001): The Elements of Statistical Learning. Data Mining, Inference and Prediction. Springer Series in Statistics. Benfamini, Y and Hochberg, Y (1995): Controlling the false discovery rate: a new and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57:1289 1300. Benjamini, Y and Yekutieli, D 2001: The control of the false discovery rate in multiple testing under dependency. Annals of Statistics 29:1165-1188. Finos L and Salmaso L (2007): FDR- and FWE-controlling methods using data-driven weights. JSPI. 137:3859 3870. Kropf, S & Läuter, J (2002): Multiple tests for different sets of variables using a data-driven ordering of hypotheses, with an application to gene expression data. Biometrical Journal 44:789 800. Saeys Y, Iñaki I, Larrañaga (2007): A review of feature selection techniques in bioinformatics. Bioinformatics. 23:2507 2517. Schwender, H, Ickstadt, K, and Rahnenführer J (2008): Classification with High-Dimensional Genetic Data: Assigning Patients and Genetic Features to Known Classes. Biometrical Journal 50:911-926. Storey, JD and Tibshirani, R (2003): Statistical significance for genomewide studies. PNAS. 100:9440-9445. D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 30 / 30