Hypothesis Testing. Wolfgang Huber, Bernd Klaus, EMBL

Similar documents
Gene Expression Analysis

Statistical issues in the analysis of microarray data

Package empiricalfdr.deseq2

False Discovery Rates

LAB : THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

AP: LAB 8: THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS

Point Biserial Correlation Tests

Principles of Hypothesis Testing for Public Health

Statistical tests for SPSS

II. DISTRIBUTIONS distribution normal distribution. standard scores

RNA-seq. Quantification and Differential Expression. Genomics: Lecture #12

Introduction to Analysis of Variance (ANOVA) Limitations of the t-test

Permutation Tests for Comparing Two Populations

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Two Correlated Proportions (McNemar Test)

Two-Way ANOVA tests. I. Definition and Applications...2. II. Two-Way ANOVA prerequisites...2. III. How to use the Two-Way ANOVA tool?...

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data

DATA INTERPRETATION AND STATISTICS

Comparing Two Groups. Standard Error of ȳ 1 ȳ 2. Setting. Two Independent Samples

Stat 5102 Notes: Nonparametric Tests and. confidence interval

Association Between Variables

4. Continuous Random Variables, the Pareto and Normal Distributions

Crosstabulation & Chi Square

Non-Inferiority Tests for Two Proportions

Nonparametric Statistics

1.5 Oneway Analysis of Variance

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Tutorial for proteome data analysis using the Perseus software platform

Binary Diagnostic Tests Two Independent Samples

Package ERP. December 14, 2015

Sample Size and Power in Clinical Trials

Two-Sample T-Tests Assuming Equal Variance (Enter Means)

Additional sources Compilation of sources:

Lecture Notes Module 1

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Non-Parametric Tests (I)

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

Having a coin come up heads or tails is a variable on a nominal scale. Heads is a different category from tails.

SCHOOL OF HEALTH AND HUMAN SCIENCES DON T FORGET TO RECODE YOUR MISSING VALUES

Testing Research and Statistical Hypotheses

CHI-SQUARE: TESTING FOR GOODNESS OF FIT

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Tests for Two Proportions

Chi Squared and Fisher's Exact Tests. Observed vs Expected Distributions

NCSS Statistical Software

Skewed Data and Non-parametric Methods

Chapter 4. Probability and Probability Distributions

Quantitative Methods for Finance

Experimental Designs (revisited)

November 08, S8.6_3 Testing a Claim About a Standard Deviation or Variance

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Chapter 3 RANDOM VARIATE GENERATION

Package dunn.test. January 6, 2016

11. Analysis of Case-control Studies Logistic Regression

Using Excel for inferential statistics

Topic 8. Chi Square Tests

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

How to get accurate sample size and power with nquery Advisor R

Interpretation of Somers D under four simple models

Statistics in Medicine Research Lecture Series CSMC Fall 2014

STATISTICS 8, FINAL EXAM. Last six digits of Student ID#: Circle your Discussion Section:

2 GENETIC DATA ANALYSIS

T test as a parametric statistic

International Statistical Institute, 56th Session, 2007: Phil Everson

NONPARAMETRIC STATISTICS 1. depend on assumptions about the underlying distribution of the data (or on the Central Limit Theorem)

Tutorial 5: Hypothesis Testing

StatCrunch and Nonparametric Statistics

Odds ratio, Odds ratio test for independence, chi-squared statistic.

The Bonferonni and Šidák Corrections for Multiple Comparisons

Two-sample inference: Continuous data

Foundation of Quantitative Data Analysis

Descriptive Statistics

STAT 35A HW2 Solutions

How To Check For Differences In The One Way Anova

Introduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing.

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

p-values and significance levels (false positive or false alarm rates)

Statistics. One-two sided test, Parametric and non-parametric test statistics: one group, two groups, and more than two groups samples

THE KRUSKAL WALLLIS TEST

Permutation & Non-Parametric Tests

Independent samples t-test. Dr. Tom Pierce Radford University

Types of Data, Descriptive Statistics, and Statistical Tests for Nominal Data. Patrick F. Smith, Pharm.D. University at Buffalo Buffalo, New York

Outline. Definitions Descriptive vs. Inferential Statistics The t-test - One-sample t-test

Case Study in Data Analysis Does a drug prevent cardiomegaly in heart failure?

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Statistical Analysis Strategies for Shotgun Proteomics Data

CONTINGENCY TABLES ARE NOT ALL THE SAME David C. Howell University of Vermont

Simple Regression Theory II 2010 Samuel L. Baker

The Null Hypothesis. Geoffrey R. Loftus University of Washington

Question: What is the probability that a five-card poker hand contains a flush, that is, five cards of the same suit?

SOLUTIONS: 4.1 Probability Distributions and 4.2 Binomial Distributions

ANOVA ANOVA. Two-Way ANOVA. One-Way ANOVA. When to use ANOVA ANOVA. Analysis of Variance. Chapter 16. A procedure for comparing more than two groups

Difference tests (2): nonparametric

Nonparametric statistics and model selection

Transcription:

Hypothesis Testing Wolfgang Huber, Bernd Klaus, EMBL

Karl Popper (1902-1994) Logical asymmetry between verification and falsifiability. No number of positive outcomes at the level of experimental testing can confirm a scientific theory, but a single counterexample is logically decisive: it shows the theory is false

The four steps of hypothesis testing Step 1: Set up a model of reality: null hypothesis, H0 Step 2: Do an experiment, collect data Step 3: Compute the probability of the data in this model Step 4: Make a decision: reject model if the computed probability is deemed to small H0: a model of reality that lets us make specific predictions of how the data should look like. The model is stated using the mathematical theory of probability. Examples of null hypotheses: The coin is fair The new drug is no better (or worse) than a placebo The observed CellTitreGlo signal is no different from that of negative controls

Binomial Distribution H0 here: p = 0.5. Distribution of number of heads: P(Heads 2) = 0.0193 P(Heads 10) = 0.0193

Significance Level If H 0 is true and the coin is fair (p=0.5), it is unprobable to observe extreme events such as more than 9 heads 0.0193 = P(Heads 10 H 0 ) = p-value (one-sided) If we observe 10 heads in a trial the null hypotheses is likely to be false. An often used (but entirely arbitray) cutoff is 0.05 ( significance level α ): if p<α, we reject H 0 Two views: Strength of evidence for a certain (negative) statement Rational decision support

Statistical Testing Workflow 1. Set up hypothesis H 0 (that you want to reject) 2. Find a test statistic T that should be sensitive to (interesting) deviations from H 0 3. Figure out the null distribution of T, if H 0 holds 4. Compute the actual value of T for the data at hand 5. Compute p-value = the probability of seeing that value, or more extreme, in the null distribution. 6. Test Decision: Rejection of H 0 - yes / no?

Permutation tests Instead of relying on approximations of the distribution of the test statistic under the null hypothesis Compute it via switching of the group labels!

Sampling distribution

Permutation test recipe

Errors in hypothesis testing Truth Decision not rejected ( negative ) rejected ( positive ) H0 true True negative (specificity) False Positive Type I error α H0 false False Negative Type II error β True Positive (sensitivity)

t-statistic (1908, William Sealy Gosset, pen-name Student ) One sample t-test: compare to a fixed value μ0 Without n: z-score With n: t-statistic: If data are normal, its null distribution can be computed: t- distribution with a parameter that is called degrees of freedom equal to n-1 One sample t-test

One sample t-test example Consider the following 10 data points: -0.01, 0.65, -0.17, 1.77, 0.76, -0.16, 0.88, 1.09, 0.96, 0.25 We are wondering if these values come from a distribution with a true mean of 0: one sample t-test The 10 data points have a mean of 0.60 and a standard deviation of 0.62. From that, we calculate the t-statistic: t = 0.60 / 0.62 * 10 1/2 = 3.0

p-value and test decision 10 observations compare observed t-statistic to the t- distribution with 9 degrees of freedom

One-sided vs two-sided test One-sided e.g. HA: μ>0 5% Two-sided e.g. HA: μ=0 2.5% 2.5%

Avoid fallacy The p-value is the probability that the observed data could happen, under the condition that the null hypothesis is true. It it not the probability that the null hypothesis is true. Absence of evidence evidence of absence

Two samples t-test Do two different samples have the same mean? y and x are the average of the observations in both populations SE is the standard error for the difference If H0 is correct, test statistic follows a t-distribution with n+m-2 degrees of freedom (n, m the number of observations in each sample).

Comments and pitfalls The derivation of the t-distribution assumes that the observations are independent and that they follow a normal distribution. Deviation from Normality - heavier tails: test still maintains type-i error control, but may no longer have optimal power. Options: Wilcoxon test, permutation tests If the data are dependent, then p-values will likely be totally wrong (e.g., for positive correlation, too optimistic).

t-test and wilcoxon test in R x,y: Data (only x needs to be specified for one-group test, specify target mu instead) paired: paired (e.g. repeated measurements on the same subjects) or unpaired var.equal: Can the variances in the two groups assumed to be equal? alternative: one- or two-sided test?... just like the t-test, exact: shall computations be performed using permutations? (slow for large samples)

different data distributions independent case

t-test can be wrong if independence assumption does not hold Text Here: correlation between samples: commonly leads to overestimation of significance

Another typical case: Batch effects or latent variables n = 10000 m = 20 x = matrix(rnorm(n*m), nrow=n, ncol=m) fac = factor(c(rep(0, 10), rep(1, 10))) rt1 = rowttests(x, fac) Batch effects overlapping experimental groups: leads to overestimation of significance x[, 6:15] = x[, 6:15]+1 rt2 = rowttests(x, fac) sva package; Leek JT, Storey JD. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007

wilcoxon test Both tests show no significant differences!

Tests for Categorical Data So far we have only discussed tests for continuos data Tests for categorial data can be roughly subdivided into test for proportions and for tables However, both of them are closely related!

Comparing Proportions Tests of single proportions are generally based on the binomial distribution with success prob. p and n trials Using the CLT allows one to test p against a fixed value p_0, i.e. p=p_0 for a total of via the normally distributed statistic A similar construction can be used to compare two proportions p_1 and p_2 with a slightly different standard error estimate In practice often a Yates correction is applied to improve the normal approximation

Example: Genetic disease Imagine we have 250 individuals, some of them have a given disease others don t. We observe that a 20% of the individuals that are homozygous for the minor allele have the disease compared to 10% of the rest. Would we see this again if we picked another 250 individuals?

Example ctd The proportions test gives no significant result Here compare the genotype proportions between disease and healthy

This can be interpreted as as distributing the grand total according to the products of the row and column proportions. An r c table looks like this: Chi-Squared Test If there is no relation between rows and columns, then you would expect to have the following cell values

Chi-Squared Test The test-statistic X^2 has a X^2 distribution with (r 1) (c 1) degrees of freedom

Fisher-Test Computes exact p-values for 2x2 tables, based on permutation tests Uses the odds ratio: Flipping rows or columns of the table inverts the odds ratio Therefore the log-odds ratio also very often used as a measure of association in 2x2 tables Unlike test statistics, it is not affect by the sample size

xkcd

The Multiple Testing Problem When performing a large number of tests, the type I error is inflated: for α=0.05 and performing n tests, the probability of no false positive result is: The larger the number of tests performed, the higher the probability of a false rejection!

Multiple Testing Examples Many data analysis approaches in genomics rely on item-by-item (i.e. multiple) testing: Microarray or RNA-Seq expression profiles of normal vs perturbed samples: gene-by-gene ChIP-chip: locus-by-locus RNAi and chemical compound screens Genome-wide association studies: marker-by-marker QTL analysis: marker-by-marker and trait-by-trait

False positive rate and false discovery rate FPR: fraction of FP among all genes (etc.) tested FDR: fraction of FP among hits called Example: 20,000 genes, 100 hits, 10 of them wrong. FPR: 0.05% FDR: 10%

Experiment-wide type I error rates Not rejected Rejected Total True null hypotheses U V m 0 False null hypotheses T S m 1 Total m R R m 0 Family-wise error rate: P(V > 0), the probability of one or more false positives. For large m 0, this is difficult to keep small. False discovery rate: E[ V / max{r,1} ], the expected fraction of false positives among all discoveries. Slide 4

Diagnostic plot: the histogram of p-values Observed p-values are a mix of samples from a uniform distribution (from true nulls) and from distributions concentrated at 0 (from true alternatives)

Benjamini Hochberg multiple testing adjustment slope: α / #genes

How to estimate the number (not: the identity) of differentially expressed genes For a series of hypothesis tests H1...Hm with p-values pi, plot (1 pi, N(pi)) for all i where N(p) is the number of p-values greater than p. Red line: (1 pi,(1 p)*m) Schweder T, Spjøtvoll E (1982) Plots of P-values to evaluate many tests simultaneously. Biometrika 69:493 502. (1 p)*m = expected

Corect null distribution use p-value histogram for diagnosis Reminder: p values follow a uniform distribution on the unit interval [0,1] under the null distribution Significant p values thus become visible as an enrichment of p values near zero in the histogram.

Misspecified null distributions

Correction of null distributions - Empirical null modeling, CRAN packages fdrtool and locfdr They estimate the variance of the null model (in the case of z-scores) Example Wrong null model, sigma = 1 Estimation Corrected null model, sigma = 0.8

Two introductory multiple testing reviews... you should definitely have a look at! S. Dudoit, J. Shaffer, and J. Boldrick. Multiple hypothesis testing in microarray experiments. Statist. Science, 18:71 103, 2003. (BH, FDR etc. methods applied to microarray data) B. Efron. Microarrays, empirical Bayes, and the two-groups model. Statist. Sci., 23:1 22, 2008. (How to actually estimate two-groups models and FDR from highdimensional data)

parathyroid dataset

parathyroid dataset

Independent filtering From the all the tests to be done first filter out those that seem to report negligible signal (say, 40%), then formally test for differential expression on the rest. Slide 7

Increased detection rates Stage 1 filter: filter based on sum of counts across all samples (remove fraction theta that has the smallest sum) Stage 2: standard NB-GLM test on the rest

Increased power? Increased detection rate implies increased power only if we are still controlling type I errors at the same level as before. Slide 9

What do we need for type I error control? I. For each individual (per gene) test statistic, we need to know its correct null distribution II. If and as much as the multiple testing procedure relies on certain (in)dependence structure between the different test statistics, our test statistics need to comply. I.: one (though not the only) solution is to make sure that by filtering, the null distribution is not affected - that it is the same before and after filtering II.: See later

Result: independence of stage 1 and stage 2 statistics under the null hypothesis For genes for which the null hypothesis is true (X 1,..., X n exchangeable), filter f and statistic g are statistically independent in both of the following cases: NB-Test (DESeq2): f (stage 1): overall variance (or mean) g (stage 2): the standard two-sample t-statistic, or any test statistic which is scale and location invariant. Normally distributed data: f (stage 1): overall variance (or mean) g (stage 2): the standard two-sample t-statistic, or any test statistic which is scale and location invariant. Non-parametrically: f: any function that does not depend on the order of the arguments. E.g. overall variance, IQR. g: the Wilcoxon rank sum test statistic. Slide 11

Conclusion Correct use of this two-stage approach can substantially increase power at same type I error.

References Bourgon R., Gentleman R. and Huber W. Independent filtering increases detection power for high-throughput experiments, PNAS (2010) Bioconductor package genefilter vignette DESeq2 vignette