The Chi-square test when the expected frequencies are less than 5

Similar documents
Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Chapter 3 RANDOM VARIATE GENERATION

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Experimental Design. Power and Sample Size Determination. Proportions. Proportions. Confidence Interval for p. The Binomial Test

Chapter 23. Two Categorical Variables: The Chi-Square Test

Calculating P-Values. Parkland College. Isela Guerra Parkland College. Recommended Citation

Chapter 19 The Chi-Square Test

Chi Square Tests. Chapter Introduction

Projects Involving Statistics (& SPSS)

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Characteristics of Binomial Distributions

SOLUTIONS: 4.1 Probability Distributions and 4.2 Binomial Distributions

Permutation Tests for Comparing Two Populations

SAS Software to Fit the Generalized Linear Model

A and B This represents the probability that both events A and B occur. This can be calculated using the multiplication rules of probability.

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Simple Regression Theory II 2010 Samuel L. Baker

A TOOLKIT FOR STATISTICAL COMPARISON OF DATA DISTRIBUTIONS

Statistical Impact of Slip Simulator Training at Los Alamos National Laboratory

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

You flip a fair coin four times, what is the probability that you obtain three heads.

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

Point Biserial Correlation Tests

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

Fairfield Public Schools

START Selected Topics in Assurance

Testing Research and Statistical Hypotheses

Study Guide for the Final Exam

Random variables, probability distributions, binomial random variable

Chapter 8 Hypothesis Testing Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing

Tests for Two Proportions

Simple Linear Regression Inference

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

9. Sampling Distributions

Binomial Probability Distribution

Simulating Chi-Square Test Using Excel

4. Continuous Random Variables, the Pareto and Normal Distributions

Statistical Functions in Excel

Quantitative Methods for Finance

Using Excel for inferential statistics

Detecting Flooding Attacks Using Power Divergence

MTH 140 Statistics Videos

Bivariate Statistics Session 2: Measuring Associations Chi-Square Test

The CUSUM algorithm a small review. Pierre Granjon

People have thought about, and defined, probability in different ways. important to note the consequences of the definition:

Process Capability Analysis Using MINITAB (I)

Normal and Binomial. Distributions

Non Parametric Inference

Two Correlated Proportions (McNemar Test)

Statistical tests for SPSS

COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES.

Lecture 25. December 19, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Crosstabulation & Chi Square

12.5: CHI-SQUARE GOODNESS OF FIT TESTS

The normal approximation to the binomial

CHI-SQUARE: TESTING FOR GOODNESS OF FIT

Normality Testing in Excel

Simple linear regression

Hypothesis Testing: Two Means, Paired Data, Two Proportions

LAB : THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

II. DISTRIBUTIONS distribution normal distribution. standard scores

WHERE DOES THE 10% CONDITION COME FROM?

99.37, 99.38, 99.38, 99.39, 99.39, 99.39, 99.39, 99.40, 99.41, cm

CHAPTER 14 NONPARAMETRIC TESTS

IEOR 6711: Stochastic Models I Fall 2012, Professor Whitt, Tuesday, September 11 Normal Approximations and the Central Limit Theorem

Lesson 3: Calculating Conditional Probabilities and Evaluating Independence Using Two-Way Tables

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

THE SIX SIGMA BLACK BELT PRIMER

Graphs. Exploratory data analysis. Graphs. Standard forms. A graph is a suitable way of representing data if:

Non-Inferiority Tests for One Mean

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:

Tutorial 5: Hypothesis Testing

Comparing Multiple Proportions, Test of Independence and Goodness of Fit

IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

UNDERSTANDING THE TWO-WAY ANOVA

6.4 Normal Distribution

Correlation key concepts:

Goodness of Fit. Proportional Model. Probability Models & Frequency Data

Homework 4 - KEY. Jeff Brenion. June 16, Note: Many problems can be solved in more than one way; we present only a single solution here.

Mind on Statistics. Chapter 12

Probability Distributions

SPC Data Visualization of Seasonal and Financial Data Using JMP WHITE PAPER

Non-Inferiority Tests for Two Proportions

Chi-square test Fisher s Exact test

11. Analysis of Case-control Studies Logistic Regression

Nonparametric Statistics

STATISTICS 8, FINAL EXAM. Last six digits of Student ID#: Circle your Discussion Section:

AP: LAB 8: THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

General Method: Difference of Means. 3. Calculate df: either Welch-Satterthwaite formula or simpler df = min(n 1, n 2 ) 1.

Stat 411/511 THE RANDOMIZATION TEST. Charlotte Wickham. stat511.cwick.co.nz. Oct

THE FIRST SET OF EXAMPLES USE SUMMARY DATA... EXAMPLE 7.2, PAGE 227 DESCRIBES A PROBLEM AND A HYPOTHESIS TEST IS PERFORMED IN EXAMPLE 7.

Introduction to Regression and Data Analysis

StatCrunch and Nonparametric Statistics

Transcription:

The Chi-square test when the expected frequencies are less than 5 Wai Wan Tsang and Kai Ho Cheng Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong {tsang, khcheng3}@cs.hku.hk Summary. In the chi-square test, it is required that the expected frequency of each cell is at least 5. This condition ensures that the CDF of the test statistic (χ 2 ) can be closely approximated by the chi-square distribution. This paper describes two methods to compute the CDF of χ 2 directly. The first method computes the exact probabilities for all attainable values of χ 2. It is effective when both the number of samples and the number of cells are small. The second method approximates the CDF with an empirical distribution function that has three digits of accuracy. The second method complements the first one when the number of cells is large. A C program that uses these two methods to compute the CDF of χ 2 is implemented. With this program, one can carry out the chi-square test even when some or all expected frequencies are less than 5. Key words: goodness-of-fit test, chi-square test 1 Introduction The chi-square goodness-of-fit test is used to check whether a set of samples fits a purported discrete distribution. The null hypothesis is that the samples follow the distribution. Suppose that the possible outcomes of an experiment are 1, 2,..., k, with probabilities p 1, p 2,..., p k, respectively. The experiment is carried out n times independently. Let o 1, È o 2,..., o k be È the numbers of 1, 2,..., k respectively in the n outcomes. Note that o i = n and p i = 1. The chi-square statistic is defined as k χ 2 = i=1 (o i np i) 2 np i (1) o i is called the observed frequency of cell i and np i is the expected frequency. When the null hypothesis is true and all expected frequencies are at least 5, the CDF of χ 2 is closely approximated by the chi-square distribution of k 1 degrees of freedom, denoted as Chisq(x, k 1). Let p-value = Chisq(χ 2, k 1). If the p-value is greater than a pre-set threshold of proportion, say, 0.95, the null hypothesis is

1584 Wai Wan Tsang and Kai Ho Cheng rejected. Otherwise, it is accepted. χ 2 indeed has discrete values but the chi-square distribution is continuous. Figure 1a shows the true CDF of χ 2 when all np i s are 5 (the staircases) and Chisq(x,5) (the smooth curve). They are close to each other. Figure 1b shows the staircases and the curve again when all np i s are 2. In this graph, the curve deviates noticeably from the staircases. To ensure that the CDF can be closely approximated by the chi-square distribution, the chi-square test requires all expected frequencies be at least 5. (a) k = 6, n = 30 and all p i = 1/6. (b) k = 6, n = 12 and all p i = 1/6. Fig. 1. The CDFs of χ 2 and their approximation, Chisq(x, k 1) The chi-square test is suggested by Karl Pearson in 1900 [PK00]. The approximation of the CDF of χ 2 with the chi-square distribution was crucial before the computer era. With today s computing technology, we can actually compute the CDF of χ 2 on the fly, at least when n and k are small. In doing so, we can relax the at-least-5 requirement on the expected frequencies. The relaxation is important in the applications where testing samples are scarce or very expensive, e.g., in medical or genomic research. This paper describes two methods for computing the CDF of χ 2, one analytical and one empirical. The first method computes the exact CDF but is inefficient when k or n is large. The second method computes an empirical distribution function (EDF) of χ 2 using 11 million trials. The resulting probabilities have at least three digits of accuracy. A C program that uses these two methods to compute the CDF of χ 2 is implemented. With this program, one can carry out the chi-square test even when some or all expected frequencies are less than 5.

The Chi-square test when the expected frequencies are less than 5 1585 2 The analytical method It is easy to see that when k = 2, a test instance, specified by [o 1, o 2], follows the binomial distribution. That is, the probability that there are o 1 1 s and o 2 2 s is n! o 1!o 2! po 1 1 p o 2 2 (2) When k 2, [o 1, o 2,..., o k ] follows the multinomial distribution, a generalization of the binomial distribution. The probability, p, that [o 1, o 2,..., o k ] occurs is n! o 1!o 2!... o k! po 1 1 p o 2 2... p o k k (3) The following sketches a straightforward way to compute the CDF of χ 2 using the above formula. 1. For each instance, [o 1, o 2,..., o k ], compute the χ 2 value and p. 2. Sort the pairs of [χ 2, p] in the ascending order of the χ 2 values. 3. Combine the pairs that have identical χ 2 values. The p in the new pair is the sum of the p s in the pairs being combined. For example, [0.65, 0.01] and [0.65, 0.02] are combined into [0.65, 0.03]. The resulting list gives the density distribution of χ 2. 4. Accumulate the p s in the density distribution to form the CDF. A C program that computes the CDF using this method has been implemented. The test instances, [o 1, o 2,..., o k ] s, are enumerated using recursion. For efficiency, the powers of p i s and factorials in the formula of the multinomial distribution are pre-computed. To verify the correctness of our program, we plot the computed CDFs together with the corresponding chi-square distributions in Figure 2. As expected, they are very close to each other. To demonstrate the effectiveness, we used the program to compute the CDFs for the chi-square test of k = 2, 3,..., 10 cells. For each k, we found the largest n such that the computation could end within 1 minute on a PC with 2.26GHz Pentium 4 processor. Table 1 shows the n s recorded. k 2 3 4 5 6 7 8 9 10 Largest n 1.9 7 6560 500 150 75 50 35 28 23 Table 1. The largest n s found for various k s s.t. the program ends in 1 minute. 3 The empirical method The analytical method is inefficient when k is large. For such cases, we can estimate the EDF of χ 2 using simulation. This approach was suggested by Professor G.

1586 Wai Wan Tsang and Kai Ho Cheng (a) k = 4, n = 40 and all p i = 1/4. (b) k = 8, n = 40 and all p i = 1/8. Fig. 2. The CDFs of χ 2 and their corresponding chi-square distributions. Marsaglia in 2005 [MAR05]. We have implemented a C program for the task. In our program, random numbers are generated using a combination of the multiply-with-carry generator [MZ91] and the 3-shift generator [MAR03]. Discrete variates are obtained using the method suggested in [MTW04]. The maximum absolute error (MAE) in an EDF has the same distribution as the Kolmogorov statistic [TW04]. Suppose that an EDF is obtained using m trials. Using the asymptotic distribution of the Kolmogorov statistic given in [KOL33], the mean and standard deviation of MAE are 0.87/ m and 0.26/ m, respectively. In our program, m = 11,000,000. The mean plus three standard deviations is 0.0004975. Therefore, it is very safe to claim that the EDF is accurate up to the third digit. To verify the correctness of our program, we plot the estimated EDFs together with the true CDFs computed using the analytical method in Figure 3. The EDFs coincide with the CDFs in the graphs. We use our program to estimate the EDFs of χ 2 for different n s and different hypothetical distributions having 5 values (k = 5). The execution times are shown in Table 2. As expected, the execution time is proportional to n but is insensitive to the distribution. n=20 n=30 n=40 n=50 p 1 = 1/5, p 2 = 1/5, p 3 = 1/5, p 4 = 1/5, p 5 = 1/5 18 s 24 s 31 s 38 s p 1 = 1/15, p 2 = 2/15, p 3 = 3/15, p 4 = 4/15, p 5 = 5/15 19 s 25 s 32 s 38 s p 1 = 1/25, p 2 = 2/25, p 3 = 4/25, p 4 = 7/25, p 5 = 11/25 20 s 25 s 32 s 38 s Table 2. Execution times for computing the EDFs for various n s and distributions.

The Chi-square test when the expected frequencies are less than 5 1587 (a) k = 6, n = 12 and all p i = 1/6. (b) k = 8, n = 40 and all p i = 1/8. Fig. 3. The CDFs of χ 2 and the EDFs obtained using our program. Table 3 shows the execution times of computing the EDFs for different k s when n = 200. In the experiment, the hypothetical distributions are uniformly distributed, i.e., all p i s are equal. The results show that the execution time is insensitive to k. n = 200 k = 20 137 s k = 40 146 s k = 60 147 s k = 80 168 s k = 100 172 s Table 3. Execution times for computing the EDF for n = 200 and k = 20, 40, 60, 80 and 100. 4 Discussion A C program that evaluates the CDF of χ 2 (p-value) in the chi-square test has been developed. If all expected frequencies are at least 5, the p-value is computed from the chi-square distribution of k 1 degrees of freedom as usual. Otherwise, if k 10 and n is less than or equal to the values shown in Table 1, compute the p-value using the analytical method, else use the empirical method. If the empirical method is used, the estimated execution time will be printed on the console. This program can be downloaded from the website at http://www.cs.hku.hk/ tsang/chisq.c. We are still tuning the program for efficiency. A dynamic programming approach is being considered for computing the true CDF of χ 2. For the empirical method,

1588 Wai Wan Tsang and Kai Ho Cheng certain random number generators that are faster than the combined generator used is being tested for suitability. (a) k = 2, n = 10 and p 1 = p 2 = 1/2. (b) k = 2, n = 6, p 1 = 1/4 and p 2 = 3/4. Fig. 4. Two CDFs with large quantum jumps χ 2 is a discrete variable but is treated as a continuous variable in the chi-square test. The appropriateness depends on the sizes of k and n. When k is very small, the quantum jumps in the CDF of χ 2 are obvious even when all expected frequencies are at least 5. Figure 4a shows an extreme case where k = 2, n = 10 and p 1 = p 2 = 1/2. The quantum jumps are bigger when the at-least-5 requirement is not satisfied or the p i s are not equal, or both, as shown in Figure 4b where k = 2, n = 6, p 1 = 1/4 and p 2 = 3/4. The effects of the discreteness on Type I error, Type II error and the power of the chi-square test are worth for further investigation.

References The Chi-square test when the expected frequencies are less than 5 1589 [KOL33] Kolmogorov, A.: Sulla determinazione empirica ei una legge di distributione. Giornale dell Istituto Italiano degli Attuari, 4, 83 91 (1933) [MAR03] Marsaglia G: Xorshift RNGs. Journal Statistical Software, 8, Issue 14 (2003) [MAR05] Marsaglia, G: Monkeying with the Goodness-of-Fit Test. Journal of Statistical Software, 14, Issue 13 (2005) [MTW04] Marsaglia, G., Tsang, W.W. and Wang, J.: Fast genereation of Discrete Random Variables. Journal of Statistical Software, 11, Issue 3 (2004) [MZ91] Marsaglia, G. and Zaman, A.: A new class of random number generators. The Annals of Applied Probability, 1, 462 480 (1991) [PK00] Pearson, K.: On the Criterion that a Given System of Deviations from the Probable in the Case of Correlated System of Variables is such that it can be Reasonably Supposed to have Arisen from Random Sampling. Philosophical Magazine, 50, Issue 5, 157 175 (1900) [TW04] Tsang, W.W and Wang, J.: Evaluating the CDF of the Kolmogorov statistic for normality testing. Proceedings of the COMPSTAT 2004, 16th Symposium of IASC, Prague, 1893 1900, August 23-27 (2003)