Statistical issues in the analysis of microarray data

Similar documents

False Discovery Rates

Gene Expression Analysis

Package ERP. December 14, 2015

Package dunn.test. January 6, 2016

Cancer Biostatistics Workshop Science of Doing Science - Biostatistics

Gene expression analysis. Ulf Leser and Karin Zimmermann

From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data

Exploratory data analysis for microarray data

Molecular Genetics: Challenges for Statistical Practice. J.K. Lindsey

Penalized Logistic Regression and Classification of Microarray Data

A direct approach to false discovery rates

False Discovery Rate Control with Groups

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients

Regularized Logistic Regression for Mind Reading with Parallel Validation

False discovery rate and permutation test: An evaluation in ERP data analysis

Simple Linear Regression Inference

Least Squares Estimation

Two-Way ANOVA tests. I. Definition and Applications...2. II. Two-Way ANOVA prerequisites...2. III. How to use the Two-Way ANOVA tool?...

Tutorial for proteome data analysis using the Perseus software platform

Lecture 3: Linear methods for classification

Software and Methods for the Analysis of Affymetrix GeneChip Data. Rafael A Irizarry Department of Biostatistics Johns Hopkins University

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

Statistics in Medicine Research Lecture Series CSMC Fall 2014

Additional sources Compilation of sources:

Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach

Statistiek II. John Nerbonne. October 1, Dept of Information Science

Comparing Functional Data Analysis Approach and Nonparametric Mixed-Effects Modeling Approach for Longitudinal Data Analysis

Statistical Machine Learning

Descriptive Statistics

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Statistics Graduate Courses

Quality Assessment of Exon and Gene Arrays

Package empiricalfdr.deseq2

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Estimation of σ 2, the variance of ɛ

COURSE PLAN BDA: Biomedical Data Analysis Master in Bioinformatics for Health Sciences Academic Year Qualification.

Multivariate Statistical Inference and Applications

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

5. Linear Regression

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

STA 4273H: Statistical Machine Learning

Introduction to data analysis: Supervised analysis

Statistical Analysis Strategies for Shotgun Proteomics Data

Statistics for BIG data

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MIC - Detecting Novel Associations in Large Data Sets. by Nico Güttler, Andreas Ströhlein and Matt Huska

Time series experiments

Applications of R Software in Bayesian Data Analysis

Adaptive linear step-up procedures that control the false discovery rate

The Statistics Tutor s Quick Guide to

Handling missing data in Stata a whirlwind tour

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

SUMAN DUVVURU STAT 567 PROJECT REPORT

Principles of Hypothesis Testing for Public Health

II. DISTRIBUTIONS distribution normal distribution. standard scores

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Univariate Regression

MEU. INSTITUTE OF HEALTH SCIENCES COURSE SYLLABUS. Biostatistics

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon

Analysis of Data. Organizing Data Files in SPSS. Descriptive Statistics

Introduction to General and Generalized Linear Models

Applying Statistics Recommended by Regulatory Documents

Examples. David Ruppert. April 25, Cornell University. Statistics for Financial Engineering: Some R. Examples. David Ruppert.

Interpretation of Somers D under four simple models

Finding statistical patterns in Big Data

MA2823: Foundations of Machine Learning

Machine Learning Methods for Demand Estimation

Statistical Models in R

Basic Statistical and Modeling Procedures Using SAS

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

Package HHG. July 14, 2015

Principles of Data Mining by Hand&Mannila&Smyth

TOWARD BIG DATA ANALYSIS WORKSHOP

Part 2: Analysis of Relationship Between Two Variables

Research Methods & Experimental Design

Study Guide for the Final Exam

Minería de Datos ANALISIS DE UN SET DE DATOS.! Visualization Techniques! Combined Graph! Charts and Pies! Search for specific functions

Graphical Modeling for Genomic Data

QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS

Erik Parner 14 September Basic Biostatistics - Day 2-21 September,

Non-Inferiority Tests for Two Means using Differences

Quantitative Methods for Finance

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Analysis of Illumina Gene Expression Microarray Data

Parametric and non-parametric statistical methods for the life sciences - Session I

Module 5: Statistical Analysis

Final Exam Practice Problem Answers

Transcription:

Statistical issues in the analysis of microarray data Daniel Gerhard Institute of Biostatistics Leibniz University of Hannover ESNATS Summerschool, Zermatt D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 1 / 30

Table of Contents 1 Outline 2 Experimental design 3 Statistical modelling 4 Hypotheses testing 5 Gene set enrichment analysis 6 Classification D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 2 / 30

Outline Focus is set on Single channel microarrays One sample per array Gene expressions for thousands of oligonucleotides Identifying genes that are differentially expressed due to a treatment Finding significantly differentially expressed genes with a given error probability (Predicting a treatment level given the gene expression data) D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 3 / 30

Controlled experiments Independent replications Multiple sources of variability present: Sample-, array-, environmental variability,... Account for this variability in the experimental design by several replications of arrays, samples, multiple timepoints,... Randomisation Needed to separate treatment effects from other factors, which might influence gene expression D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 4 / 30

Experimental design Planning an experiment Multiple arrays per sample? Enables estimating array variability. Large amount of RNA needed. With more complex designs a larger number of arrays, samples is needed Measuring covariates, which are not directly of interest, but might have an influence on gene expression Simple classic design 2 Treatments (Control/Treatment), Multiple arrays/samples per treatments D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 5 / 30

Data structure Treatment A Treatment B... Array 1 Array 2 Array 3 Array 4 Array 5 Array 6... Gene 1 y 11 y 12 y 13 y 14 y 15 y 16... Gene 2 y 21 y 22 y 23 y 24 y 25 y 26... Gene 3 y 31 y 32 y 33 y 34 y 35 y 36............ D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 6 / 30

Data example Generating artificial data 2 treatments (A, B) 20 arrays per treatment 5000 genes per array Normal distributed residuals, array effects within array sd = 1; between array sd = 0.5 100 genes show an effect (δ = ±2) 2 x transformation D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 7 / 30

Data example 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 0 10 20 30 40 50 60 70 Array D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 8 / 30

Data example density 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10 x D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 9 / 30

Normalisation Preliminary data processing Checking for hybridisation errors Variability between arrays might bias the results Only a few genes are expected to show an effect Using all observations or known expressions of reference genes to standardise arrays Trying to shift data into a normal distribution (commonly by log 2 transformation) D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 10 / 30

Example data transformation original transformed density 0.0 0.2 0.4 0.6 0.8 1.0 density 0.0 0.1 0.2 0.3 0.4 0.5 0 2 4 6 8 10 x 4 2 0 2 4 6 log2(x) D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 11 / 30

Median normalisation transformed normalised density 0.0 0.1 0.2 0.3 0.4 0.5 density 0.0 0.1 0.2 0.3 0.4 0.5 4 2 0 2 4 6 log2(x) 4 2 0 2 4 6 log2(x) D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 12 / 30

Estimating treatment effects Statistical models Trying to explain the effects by only a few parameters in a statistical model Estimating parameters e.g. by minimising residuals Due to limited calculation resources, models can be fitted separately for each gene 2 sample design For the simple treatment-control design the difference between arithmetic means & it s standard error for each gene can be estimated. After applying the inverse of the log 2 transformation the fold change (ratio of arithmetic means) is the parameter of interest. D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 13 / 30

Parametric vs. non-parametric methods Parametric methods Assuming normal distribution after log 2 transformation Summarising the data by means and standard errors is adequate under assumptions of a general linear model Nonparametric methods At skewed distributions providing only means & std.err. might be misleading Instead using medians, IQR, range,... Applying rank transformation, resampling methods,... Interpretation of treatment comparisons might be more complicated in models with less assumptions Lack of power at small sample sizes D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 14 / 30

Independent observations? No complete randomisation Observations from non randomised experimental units might be correlated, e.g. Multiple arrays for the same sample Samples of the same individual over time Block structures... Assuming independence of correlated observation may lead to underestimation of variability Introducing multiple error terms in the model Increased complexity of the model, increase in sample size needed D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 15 / 30

Hypotheses Testing Test for a single gene Setting up hypotheses of interest (e.g. H 0 : parameter of interest equals 0) Constructing test statistics for each gene Calculating p-values under assumption of a null distribution for the test statistic Borrowing information from multiple genes At small sample sizes the genewise estimation of std.errors is difficult Adding a fudge factor to the std. err. to minimise the coefficient of variation Borrowing information about variability from all genes by empirical bayes D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 16 / 30

t-test results for the example Distribution of p-values: Frequency 0 20 40 60 80 100 120 0.0 0.2 0.4 0.6 0.8 1.0 p value D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 17 / 30

Error rates As multiple hypotheses are tested, there is a choice of controlling different error rates, and the individual type-i-error might not be adequate # H 0 not rejected # H 0 rejected # true H 0 U V m 0 # false H 0 T S m m 0 known m R R m PCER Per Comparison error rate: E(V )/m FWER Family-wise error rate: P (V > 0) FDR False discovery rate: E (V /R)... D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 18 / 30

FWER controlling procedures Calculating p i adjusted p-values (i = 1,..., m) Bonferroni: p i = min {1, p i /m} (single-step) Holm: p i = min {1, max {p i 1, (m i + 1) p i }} (step-down, for p 1 p i p m ) Utilising a multivariate distribution, resampling methods (single-step)... D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 19 / 30

FWER control using data-driven weights Weighted step-down procedure Weight the m unadjusted p-values pi = p i /w i and order them by p1 p m Reject H i as long pi α P m k=i w k Obtaining weights Choosing weights independently of the significance of the test Gather information about the distribution of hypotheses under the null or in the alternative Examples Weighting by the total variance w i = S i of the entire sample Weighting by nondecreasing monotone functions of the weights w i = f (S i ) Using principle components to define weights D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 20 / 30

FDR controlling procedures Calculating p j adjusted p-values (i = 1,..., m) Benjamini-Hochberg: p j = min j i { m i p i } (step-up, for p 1 p i p m ) Benjamini-Yekutieli: correction under dependence (step-up) Storey: pfdr (estimating m 0 /m)... D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 21 / 30

Comparison of adjustment methods using adj. p-values for the data example Method # H 0 rejected # H 0 falsely rejected unadjusted 334 234 Bonferroni 91 0 Holm 92 0 S i -weighted 44 0 min-p 82 0 BH 102 3 BY 98 0 Storey 102 3 D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 22 / 30

Volcano Plot 2 1 0 1 2 0 2 4 6 8 10 log 2 fold change log 10 p value unadjusted Bonferroni BH BY min p D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 23 / 30

Gene set enrichment analysis Define multiple sets of genes Test differential expression for these gene sets Small effects of single genes are hard to detect Combination of multiple small effects to get the big picture Reduction of the dimensionality of the multiple testing problem Test effects for whole pathways, functional groups, etc. D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 24 / 30

Assigning genetic features to known classes Classification Reformulating the problem into a setting with p regressors to estimate the class membership probability (control/treatment) for each gene Finding a classification rule by e.g. Logistic regression Discriminant analysis SVM... Validation Fitting the model to training data Validation of the model by test data Crossvalidation to validate the model on training data D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 25 / 30

Problem of high dimensions p >> n Problem Requirement for logistic regression or LDA is that the number of observations is larger than the number of variables Reducing the number of variables by Feature Selection Using Penalized Logistic Regression,... D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 26 / 30

Feature selection Filtering genes Multiple testing approaches can be used as filter Select all variables corresponding to genes with a p-value p p 0 Perform for example logistic regression to model the posterior probability of K classes log Pr (G = k X = x) Pr (G = K X = x) = β k0 + β T k x D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 27 / 30

A second example Generating artificial training data 2 treatments (A, B) 20 arrays per treatment 5000 genes per array Normal distributed residuals, array effects within array sd = 1; between array sd = 0.5 Genes show N(0, 0.25) distributed effects 2 x transformation Generating test data 10 arrays per treatment Same effects as in training data Both datasets are log 2 transformed and median normalized. D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 28 / 30

Feature selection / Classification Choosing only 10 genes with the best t-test results as covariates Performing LDA and logistic regression Validation by the test set LDA: A B A 8 2 B 2 8 logistic regression: A B A 7 2 B 3 8 D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 29 / 30

References Dudoit, S and Van der Laan, M (2008): Multiple Testing Procedures with Application to Genomics. Springer Series in Statistics. Gentleman, R, Carey, VJ, Huber, W, Irizarry, RA, and Dudoit, S (2005): Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer for Biology and Health. Hastie, T, Tibshirani, R and Friedman J (2001): The Elements of Statistical Learning. Data Mining, Inference and Prediction. Springer Series in Statistics. Benfamini, Y and Hochberg, Y (1995): Controlling the false discovery rate: a new and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57:1289 1300. Benjamini, Y and Yekutieli, D 2001: The control of the false discovery rate in multiple testing under dependency. Annals of Statistics 29:1165-1188. Finos L and Salmaso L (2007): FDR- and FWE-controlling methods using data-driven weights. JSPI. 137:3859 3870. Kropf, S & Läuter, J (2002): Multiple tests for different sets of variables using a data-driven ordering of hypotheses, with an application to gene expression data. Biometrical Journal 44:789 800. Saeys Y, Iñaki I, Larrañaga (2007): A review of feature selection techniques in bioinformatics. Bioinformatics. 23:2507 2517. Schwender, H, Ickstadt, K, and Rahnenführer J (2008): Classification with High-Dimensional Genetic Data: Assigning Patients and Genetic Features to Known Classes. Biometrical Journal 50:911-926. Storey, JD and Tibshirani, R (2003): Statistical significance for genomewide studies. PNAS. 100:9440-9445. D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 30 / 30