Statistical Hypothesis Tests for NLP

Similar documents
Stat 411/511 THE RANDOMIZATION TEST. Charlotte Wickham. stat511.cwick.co.nz. Oct

Introduction to Hypothesis Testing. Hypothesis Testing. Step 1: State the Hypotheses

DDBA 8438: Introduction to Hypothesis Testing Video Podcast Transcript

Analysis of Variance ANOVA

Section 13, Part 1 ANOVA. Analysis Of Variance

Descriptive Statistics

Permutation Tests for Comparing Two Populations

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

Nonparametric statistics and model selection

QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS

HYPOTHESIS TESTING: POWER OF THE TEST

UNDERSTANDING THE DEPENDENT-SAMPLES t TEST

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

Introduction to Hypothesis Testing OPRE 6301

MONT 107N Understanding Randomness Solutions For Final Examination May 11, 2010

Two-sample inference: Continuous data

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

Sample Size and Power in Clinical Trials

A Statistical Analysis of Popular Lottery Winning Strategies

Outline. Definitions Descriptive vs. Inferential Statistics The t-test - One-sample t-test

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Comparison of frequentist and Bayesian inference. Class 20, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Unit 27: Comparing Two Means

Multivariate Analysis of Variance. The general purpose of multivariate analysis of variance (MANOVA) is to determine

II. DISTRIBUTIONS distribution normal distribution. standard scores

Exact Nonparametric Tests for Comparing Means - A Personal Summary

p-values and significance levels (false positive or false alarm rates)

SCHOOL OF HEALTH AND HUMAN SCIENCES DON T FORGET TO RECODE YOUR MISSING VALUES

Online 12 - Sections 9.1 and 9.2-Doug Ensley

Recall this chart that showed how most of our course would be organized:

HYPOTHESIS TESTING WITH SPSS:

Name: Date: Use the following to answer questions 3-4:

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

Two-sample hypothesis testing, II /16/2004

Test Positive True Positive False Positive. Test Negative False Negative True Negative. Figure 5-1: 2 x 2 Contingency Table

Multiple-Comparison Procedures

Confidence intervals

A Few Basics of Probability

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:

Chapter 7 Section 7.1: Inference for the Mean of a Population

LAB : THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

An Introduction to Statistics Course (ECOE 1302) Spring Semester 2011 Chapter 10- TWO-SAMPLE TESTS

ANOVA ANOVA. Two-Way ANOVA. One-Way ANOVA. When to use ANOVA ANOVA. Analysis of Variance. Chapter 16. A procedure for comparing more than two groups

The Assumption(s) of Normality

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon

Fairfield Public Schools

SPSS Explore procedure

Multiple samples: Pairwise comparisons and categorical outcomes

Correlational Research

Chapter 8 Hypothesis Testing Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing

Chapter 5 Analysis of variance SPSS Analysis of variance

How To Check For Differences In The One Way Anova

Chapter 9. Two-Sample Tests. Effect Sizes and Power Paired t Test Calculation

THE FIRST SET OF EXAMPLES USE SUMMARY DATA... EXAMPLE 7.2, PAGE 227 DESCRIBES A PROBLEM AND A HYPOTHESIS TEST IS PERFORMED IN EXAMPLE 7.

Chapter 7. Comparing Means in SPSS (t-tests) Compare Means analyses. Specifically, we demonstrate procedures for running Dependent-Sample (or

Statistiek II. John Nerbonne. October 1, Dept of Information Science

Statistics for Sports Medicine

Testing Group Differences using T-tests, ANOVA, and Nonparametric Measures

Chapter 26: Tests of Significance

Pearson's Correlation Tests

The Null Hypothesis. Geoffrey R. Loftus University of Washington

WISE Power Tutorial All Exercises

Principles of Hypothesis Testing for Public Health

6.080 / Great Ideas in Theoretical Computer Science Spring 2008

Statistical tests for SPSS

AP: LAB 8: THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

Statistics Review PSY379

Permutation & Non-Parametric Tests

Psychology 60 Fall 2013 Practice Exam Actual Exam: Next Monday. Good luck!

Organizing Your Approach to a Data Analysis

Understanding Confidence Intervals and Hypothesis Testing Using Excel Data Table Simulation

General Method: Difference of Means. 3. Calculate df: either Welch-Satterthwaite formula or simpler df = min(n 1, n 2 ) 1.

Come scegliere un test statistico

Invalidity in Predicate Logic

StatCrunch and Nonparametric Statistics

Two Related Samples t Test

Multiple Imputation for Missing Data: A Cautionary Tale

statistics Chi-square tests and nonparametric Summary sheet from last time: Hypothesis testing Summary sheet from last time: Confidence intervals

Hypothesis testing. c 2014, Jeffrey S. Simonoff 1

Measuring Evaluation Results with Microsoft Excel

Introduction to Analysis of Variance (ANOVA) Limitations of the t-test

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Study Design and Statistical Analysis

1 Hypothesis Testing. H 0 : population parameter = hypothesized value:

Introduction to Regression and Data Analysis

1 Another method of estimation: least squares

Chapter 4. Hypothesis Tests

Tests for Two Proportions

NONPARAMETRIC STATISTICS 1. depend on assumptions about the underlying distribution of the data (or on the Central Limit Theorem)

SPSS/Excel Workshop 3 Summer Semester, 2010

The Dummy s Guide to Data Analysis Using SPSS

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Chapter 7. One-way ANOVA

Chapter 7 Notes - Inference for Single Samples. You know already for a large sample, you can invoke the CLT so:

Independent samples t-test. Dr. Tom Pierce Radford University

Two-sample t-tests. - Independent samples - Pooled standard devation - The equal variance assumption

Finding statistical patterns in Big Data

Experience, Not Metrics

Lab 11. Simulations. The Concept

Comparing Means in Two Populations

Transcription:

Statistical Hypothesis Tests for NLP or: Approximate Randomization for Fun and Profit William Morgan ruby@cs.stanford.edu Stanford NLP Group Statistical Hypothesis Tests for NLP p. 1

You have two systems... Motivation: Two systems, A and B. They produce output o A and o B. Want to show that B is better than A. The simple approach: Using some evaluation metric e, show e(o B ) > e(o A ). And that s enough... Statistical Hypothesis Tests for NLP p. 2

Sadly, no:...or is it? The difference between e(o A ) and e(o B ) could be due to sheer dumb luck. We want to show that s not the case. Statistical significance tests give us a way of quantifying the probability that the difference between two systems is due to luck. If low, we can believe the difference is real. If high, then either: 1. the systems are not different; or 2. the data are insufficient to show that the systems aren t different. Statistical Hypothesis Tests for NLP p. 3

Quick question Why not just compare confidence intervals? You sometimes seem people determining statistical significance of system differences by looking at whether the confidence intervals overlap. This approach is overly conservative: system B score 1 2 3 4 5 y=x 1.0 1.5 2.0 2.5 3.0 3.5 4.0 system A score Statistical Hypothesis Tests for NLP p. 4

Statistical significance tests 1. Forget about bigger or smaller. Let s just think about difference or no difference. (A two-tailed test.) 2. Call the hypothesis that there s no difference between A and B the null hypothesis. 3. Pretend o A and o B are sampled independently from a population (so o A and o B decomposable). 4. Call t(o 1,o 2 ) = e(o 1 ) e(o 2 ) the test statistic (so t : system output system output R). 5. Find the distribution of t under the null hypothesis, i.e. assuming the null hypothesis is true. 6. See where t(o A,o B ) the thing we actually observed lies in this distribution. If it s somewhere weird (unlikely), that s evidence that the null hypothesis is false, i.e. the systems are different. Statistical Hypothesis Tests for NLP p. 5

Pretty picture distribution of the test statistic under the null hypothesis value of test statistic on the observed data significance level Statistical Hypothesis Tests for NLP p. 6

The significance level The area to the right of t(o A,o B ) is the significance level the probability that some t t(o A,o B ) would be generated if the null hypothesis were true. Also called the p-value. Small values suggest the null hypothesis is false, given the observation of t(o A,o B ). Corollary: all else being equal, a large difference between e(o A ) and e(o B ) yields a smaller significance level (as one would hope!). Values below 0.05 are typically considered good enough. So all we have to do is calculate the distribution of t. Statistical Hypothesis Tests for NLP p. 7

Calculating the distribution The classical approach: Keep adding assumptions until we arrive at a known distribution which we can calculate analytically. E.g.: Student s t-test. Assume that e(o A ) and e(o B ) are sample means from a bivariate Normal distribution with zero covariance. Then we know t is distributed according to Student s t-distribution if the null hypothesis is true. Back in the stone age, computing with rocks and twigs, making those assumptions made the problem tractable. But the problem with this approach is that you may falsely reject the null hypothesis if one of the additional assumptions is violated. (Type I error.) Statistical Hypothesis Tests for NLP p. 8

What you SHOULD do Simulate the distribution using a randomization test. It s just as good as analytical approaches, even when the analytical assumptions are met! (Hoeffding 1952) And it s better when they re not. (Noreen 1989) Best of all: dirt simple. Intuition: Erase the labels output of A or output of B from all of the observations. Now consider the population of every possible labeling. (Order relevant.) If the systems are really different, the observed labeling should be unlikely under this distribution. Statistical Hypothesis Tests for NLP p. 9

Basic approximate randomization Exact randomization requires iterating through the entire set of possible labelings. (E.g. Fisher s exact test.) That s huge! Instead, sample from it. Let o A = {o 1 A,,on A } and o B = {o 1 B,,om B } be the output of the two systems. Repeat R times: randomly assign each of {o 1 A,,on A,o1 B,,om B } into classes X (size n) and Y (size m). Calculate t(x, Y ). Let r be the number of times that t(x,y ) t(o A,o B ). As R, r/r approaches the significance level. Actually, should use R+1 r+1 for statistical reasons (not that it matters for, say, R 19) Statistical Hypothesis Tests for NLP p. 10

That was easy. Some comments Random assignment is done without replacement, in contrast to bootstrap. Randomization tests are statistically valid, meaning the probability of falsely rejecting the null hypothesis is no greater than the rejection level of the test (i.e. choosing a threshold for the significance level a priori). That s important! (and where those +1 s come from.) R = 1000 is the typical case. Now how do we use that for NLP applications? Statistical Hypothesis Tests for NLP p. 11

Applying this to NLP Our output bits o 1 A,,on A,o1 B,,om B are going to be something like tagged/translated/parsed sentences. Our metric function e is going to be something like BLEU/WER/F-measure. We ll run two systems on the same input and see how their output differs. Wait! Now we ve violated an assumption of all s.s.t. s that each bit of output is independently sampled: Each o i A and oi B pair are dependent. (Output of system A on input i will probably be similar to output of system B on input i.) A statistician would recognize this situation as requiring a paired test. Statistical Hypothesis Tests for NLP p. 12

Approximate randomization for NLP We can control for dependent variables by stratifying the output and only permuting within each stratum. (Yeh 2000, Noreen 1989) In this case, we ll stratify each o i A,oi B. Let o A = {o 1 A,,on A } and o B = {o 1 B,,on B } be the output of the two systems on the same input. Start with X = o A and Y = o B. Repeat R times: randomly flip each o i A, oj B between X and Y with probability 1 2. Calculate t(x,y ). Let r be the number of times that t(x,y ) t(o A,o B ). As R, r+1 R+1 approaches the significance level. Statistical Hypothesis Tests for NLP p. 13

Randomization vs. Bootstrap Q: How do randomization tests compare with bootstrap resampling, in which data is drawn with replacement? For example, Koehn (2004) proposes paired bootstrap resampling for MT comparisons, which is almost identical to AR except for the replacement issue. A: Bootstrap resampling contains an additional assumption, which is that the (original) sample is close to the population of all possible outputs. Randomization tests do not require this assumption and thus are better. Riezler and Maxwell (2005) also give anecdotal evidence that bootstrap resampling is more prone to type I errors than AR for SMT. Statistical Hypothesis Tests for NLP p. 14

Comparing many systems So that s how we compare two systems. If we compare many systems, there s a danger we need to be aware of. In the binary comparison case, with threshold 0.05, validity tells us that we ll falsely reject the null hypothesis (make a type I error) 5% of the time. But if we do 20 comparisons, the chance of making a type I error can be as high as 1 (1 0.05) 20 =.64. How do we prevent this? Statistical Hypothesis Tests for NLP p. 15

Correcting for Multiple Tests The Bonferonni correction is the most well-known solution: simply divide the threshold by n. In the above case, 1 (1 0.05 20 )20 = 0.04884 0.05. But Bonferonni is widely considered overly conservative (i.e. sacrifices Type II error control for Type I) and not often used in practice. Another popular option is Fisher s Least Significant Difference (LSD) test. (But possibly too liberal.) Or, consider Tukey s Honestly Significant Difference (HSD) test. (But possibly too conservative.) Statistical Hypothesis Tests for NLP p. 16

Which one should I use? Probably none of them. Only indisputably called for when: 1. you re doing post-hoc (unplanned) comparisons; or 2. you have a global null hypothesis ( if any one of these components is different from the others, then we ve improved the state of the art ). In other cases, you probably have sufficient philosophical currency to do nothing at all. But you should be aware of the issue, so you know what to say when Bob Moore yells at you at ACL. Statistical Hypothesis Tests for NLP p. 17

Summary Use approximate randomization to compare two systems. Calculate confidence intervals, but don t read anything into overlap. If you re comparing lots of things, think about (but don t use) some form of correction. Statistical Hypothesis Tests for NLP p. 18

References Wassily Hoeffding. 1952. The Large-Sample Power of Tests Based on Permutations of Observations. Annals of Mathematical Statistics, 23, 169 192. Philipp Koehn. 2004. Statistical Significance Tests for Machine Translation. Proceedings of EMNLP. Eric W. Noreen. 1989. Computer Intensive Methods for Testing Hypothesis. John Wiley & Sons. Stefan Rielzer and John T. Maxwell III. 2005. On Some Pitfalls in Automatic Evaluation and Significance Testing in MT. Proceedings of the ACL Workshop in Intrinsic and Extrinsic Evaluation Measures for MT. Alexander Yeh. 2000. More accurate tests for the statistical significance of result differences. Proceedings of Coling 2000. Statistical Hypothesis Tests for NLP p. 19