Contingency tables, chi-square tests

Similar documents
CONTINGENCY TABLES ARE NOT ALL THE SAME David C. Howell University of Vermont

Topic 8. Chi Square Tests

Having a coin come up heads or tails is a variable on a nominal scale. Heads is a different category from tails.

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Chi-square test Fisher s Exact test

THE FIRST SET OF EXAMPLES USE SUMMARY DATA... EXAMPLE 7.2, PAGE 227 DESCRIBES A PROBLEM AND A HYPOTHESIS TEST IS PERFORMED IN EXAMPLE 7.

STATISTICS 8, FINAL EXAM. Last six digits of Student ID#: Circle your Discussion Section:

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Is it statistically significant? The chi-square test

4. Continuous Random Variables, the Pareto and Normal Distributions

Bivariate Statistics Session 2: Measuring Associations Chi-Square Test

STAT 35A HW2 Solutions

Section 12 Part 2. Chi-square test

Association Between Variables

Two Correlated Proportions (McNemar Test)

Chapter 23. Two Categorical Variables: The Chi-Square Test

Introduction to Analysis of Variance (ANOVA) Limitations of the t-test

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

Using Excel for inferential statistics

Nonparametric Statistics

Study Guide for the Final Exam

Common Univariate and Bivariate Applications of the Chi-square Distribution

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

Chi Squared and Fisher's Exact Tests. Observed vs Expected Distributions

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

Elementary Statistics

Types of Data, Descriptive Statistics, and Statistical Tests for Nominal Data. Patrick F. Smith, Pharm.D. University at Buffalo Buffalo, New York

Statistical tests for SPSS

Simulating Chi-Square Test Using Excel

First-year Statistics for Psychology Students Through Worked Examples

Statistical Functions in Excel

Binomial Probability Distribution

SAS Software to Fit the Generalized Linear Model

Chapter 19 The Chi-Square Test

Contemporary Mathematics- MAT 130. Probability. a) What is the probability of obtaining a number less than 4?

LAB : THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

Solutions to Homework 10 Statistics 302 Professor Larget

Week 4: Standard Error and Confidence Intervals

The Dummy s Guide to Data Analysis Using SPSS

ANALYSING LIKERT SCALE/TYPE DATA, ORDINAL LOGISTIC REGRESSION EXAMPLE IN R.

AP: LAB 8: THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

Poisson Models for Count Data

Testing Research and Statistical Hypotheses

Introduction to Hypothesis Testing

STA 130 (Winter 2016): An Introduction to Statistical Reasoning and Data Science

Stat 5102 Notes: Nonparametric Tests and. confidence interval

Name: Date: Use the following to answer questions 3-4:

Test Positive True Positive False Positive. Test Negative False Negative True Negative. Figure 5-1: 2 x 2 Contingency Table

Odds ratio, Odds ratio test for independence, chi-squared statistic.

HYPOTHESIS TESTING (ONE SAMPLE) - CHAPTER 7 1. used confidence intervals to answer questions such as...

Data Analysis Tools. Tools for Summarizing Data

Analysis of categorical data: Course quiz instructions for SPSS

Experimental Design. Power and Sample Size Determination. Proportions. Proportions. Confidence Interval for p. The Binomial Test

Advanced Statistical Analysis of Mortality. Rhodes, Thomas E. and Freitas, Stephen A. MIB, Inc. 160 University Avenue. Westwood, MA 02090

Recall this chart that showed how most of our course would be organized:

One-Way ANOVA using SPSS SPSS ANOVA procedures found in the Compare Means analyses. Specifically, we demonstrate

statistics Chi-square tests and nonparametric Summary sheet from last time: Hypothesis testing Summary sheet from last time: Confidence intervals

Statistics. One-two sided test, Parametric and non-parametric test statistics: one group, two groups, and more than two groups samples

Categorical Data Analysis

Testing differences in proportions

Chapter 13. Chi-Square. Crosstabs and Nonparametric Tests. Specifically, we demonstrate procedures for running two separate

3.4 Statistical inference for 2 populations based on two samples

Drawing a histogram using Excel

Chapter 3 RANDOM VARIATE GENERATION

OPRE504 Chapter Study Guide Chapter 7 Randomness and Probability. Terminology of Probability. Probability Rules:

Descriptive Statistics and Measurement Scales

6.4 Normal Distribution

Using Stata for Categorical Data Analysis

2. Discrete random variables

The Big Picture. Describing Data: Categorical and Quantitative Variables Population. Descriptive Statistics. Community Coalitions (n = 175)

Simple Random Sampling

Come scegliere un test statistico

Characteristics of Binomial Distributions

Descriptive Analysis

Midterm Review Problems

QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS

Analysing Questionnaires using Minitab (for SPSS queries contact -)

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Main Effects and Interactions

The overall size of these chance errors is measured by their RMS HALF THE NUMBER OF TOSSES NUMBER OF HEADS MINUS NUMBER OF TOSSES

Binomial Distribution n = 20, p = 0.3

The correlation coefficient

Tests for Two Proportions

Adverse Impact Ratio for Females (0/ 1) = 0 (5/ 17) = Adverse impact as defined by the 4/5ths rule was not found in the above data.

Normal distribution. ) 2 /2σ. 2π σ

TABLE OF CONTENTS. About Chi Squares What is a CHI SQUARE? Chi Squares Hypothesis Testing with Chi Squares... 2

When to use Excel. When NOT to use Excel 9/24/2014

Likelihood: Frequentist vs Bayesian Reasoning

American Journal Of Business Education July/August 2012 Volume 5, Number 4

WHERE DOES THE 10% CONDITION COME FROM?

MONT 107N Understanding Randomness Solutions For Final Examination May 11, 2010

Chapter 5. Discrete Probability Distributions

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon

Stats Review Chapters 9-10

Part 3. Comparing Groups. Chapter 7 Comparing Paired Groups 189. Chapter 8 Comparing Two Independent Groups 217

DETERMINE whether the conditions for a binomial setting are met. COMPUTE and INTERPRET probabilities involving binomial random variables

A and B This represents the probability that both events A and B occur. This can be calculated using the multiplication rules of probability.

Standard Deviation Estimator

CHAPTER 13. Experimental Design and Analysis of Variance

Transcription:

Contingency tables, chi-square tests Joe Felsenstein Contingency tables, chi-square tests p.1/13

Confidence limits on a proportion When you observe the numbers of times two outcomes occur, in a binomial distribution (tossing coins, in effect), you can use the definition of confidence intervals to work out an interval estimate of the probability of Heads. Use the R function qbinom(0.05, n, p) or qbinom(0.95, n, p) where there are n trials and Heads probability p. Adjust p until you just barely reach the observed number of Heads. For example, if you made 100 tosses and got 43 Heads, the smallest p that reaches 43 Heads for the 0.025 lower tail probability is 0.347. The highest p that reaches 42 Heads for the 0.975 lower tail probability is 0.522. So the confidence interval is (0.332, 0.522). Note that these numbers are not symmetrical around the observed frequency of 0.43. This method is exact, but kind of a nuisance to compute! A test is easier. If we want to see whether 0.50 is too high a probability of Heads when we observe 43 Heads out of 100 tosses, we use pbinom(43, 100, 0.5) which gives probability of 43 or fewer Heads as 0.09667, so a one-tailed test does not exclude 0.50 as the Heads probability. Contingency tables, chi-square tests p.2/13

Probability The normal approximation Actually, the binomial distribution is fairly well-approximated by the Normal distribution: 0.15 0.10 0.05 0.00 0 5 10 15 20 Number of heads This shows the Binomial with 20 trials and Heads probability 0.3, the class probabilities are the open circles. For each number of Heads k, we approximate this by the area under a normal distribution with mean 20 0.3 and variance 20 0.3 0.7, between k + 1 2 and k 1 2 (these are shown as solid circles connected by lines). The chi-square distribution in effect uses that approximation. Contingency tables, chi-square tests p.3/13

Three cases There are really three different cases. It turns out that the same chi-square calculation tests them all! 1. We draw a given number of sample items (say 100 and 150) in each of two cases. We observe the number of an outcome in each. Thus one marginal (the numbers of the two cases investigated) is fixed. Males Females Total went to college 37 52 89 not 63 98 161 Total 100 150 250 > tb <- matrix(c(37,52,62,98),c(2,2)) > chisq.test(tb) Pearson s Chi-squared test with Yates continuity correction data: tb X-squared = 0.0589, df = 1, p-value = 0.8083 Contingency tables, chi-square tests p.4/13

Three cases 2. We draw individuals from a population and classify them as having or not having one trait, and also as having or not having another. Neither marginal is fixed. does twitters not Total flaky 53 28 81 not 69 50 119 Total 122 78 200 > tb <- matrix(c(53,28,69,50),c(2,2)) > chisq.test(tb) Pearson s Chi-squared test with Yates continuity correction data: tb X-squared = 0.8328, df = 1, p-value = 0.3615 Contingency tables, chi-square tests p.5/13

Three cases 3. The lady tasting tea example by R. A. Fisher. A taste test is done where 100 cups of tea have 50 in which the milk was added after the hot water, and 50 in which milk was added before the hot water. The taster tries to decide which cups are the 50 in which the milk was added afterward. milk first milk second Total Says milk first 31 19 50 Says milk second 19 31 50 Total 50 50 100 > tb <- c(31,19,19,31) > a <- matrix(tb,c(2,2)) > chisq.test(a) Pearson s Chi-squared test with Yates continuity correction data: a X-squared = 4.84, df = 1, p-value = 0.02781 Contingency tables, chi-square tests p.6/13

How to do a chi-square test First, figure out the expected numbers in each class. In a contingency table, in each cell). For an m n contingency table this is (row sum) (column sum) / (total for whole table). Sum over all classes: classes (Observed Expected) 2 Expected The number of degrees of freedom is the total number of classes, less one because the expected frequencies add up to 1, less the number of parameters you had to estimate. For a contingency table you in effect estimated n 1 column frequencies and m 1 row frequencies so the degrees of freedom are nm (n 1) (m 1) 1 which is (n 1)(m 1). Look the value up on a chi-square (χ 2 ) table, which is the distribution of sums of (various numbers of) squares of normally-distributed quantities. Contingency tables, chi-square tests p.7/13

The chi-square distribution Here are some critical values of the χ 2 distribution for different numbers of degrees of freedom: df Upper 95% point df Upper 95% point 1 3.841 15 24.996 2 5.991 20 31.410 3 7.815 25 37.652 4 9.488 30 43.773 5 11.070 35 49.802 6 12.592 40 55.758 7 14.067 45 61.656 8 15.507 50 67.505 9 16.919 60 79.082 10 18.307 70 90.531 Of course you can get the correct P values computed when you use R. Contingency tables, chi-square tests p.8/13

Fisher s exact test R. A. Fisher invented an exact test that can be used on 2 2 tables if the number of entries is small enough (say less than 100). We consider only tables that have the same marginals, both the row ones and the column ones. We can rank these by how extreme they are. For example in the Lady Tasting Tea example I gave, this table 31 19 19 31 can be shown to have probability of occurring (among all tables that have the same marginal totals, if there is no actual ability of the taster) of Prob ([ 31 19 19 31 ]) = 50! 50! 50! 50! 100! 31! 19! 19! 31! = 0.00916353 and these are summed for all tables as extreme or more extreme (having 31, 32, 33,... in the upper corner if a one-tailed test). P = 0.02731 (compare this with the chi-square values. Contingency tables, chi-square tests p.9/13

Power and lumping classes The chi-square test guards against all departures in any pattern. But this can lose power if you are really expecting some kinds of departure and not others. Intelligently collapsing classes can help. Example: put 2 checkers on each red square of a checkerboard, and none on the black squares. If we test this as an 8 8 contingency table, χ 2 = 32, with 31 degrees of freedom. Not significant. If we test this by grouping all red squares into one class, which has 64 checkers, and all black squares into one class, which has 0 checkers, χ 2 = 32 but the degrees of freedom is only 1. Wildly significant! Although we lose power to detect other patterns of departure by grouping, we gain greatly in ability to detect whether red and black squares have different numbers of checkers. Contingency tables, chi-square tests p.10/13

Goodness-of-fit tests If the classes represent different outcomes from a distribution (either binned continuous variables or discrete variables), we can use the usual chi-square calculation if we have expectations for each class. These can be computed by adjusting parameters. One way that is valid is to adjust them to minimize the chi-square. The number of parameters you adjusted has to be subtracted from the degrees of freedom. Classes with expectations less than 1 should be combined they may cause inaccuracy of the chi-square approximation. The degrees of freedom is correspondingly reduced. Contingency tables, chi-square tests p.11/13

Nonindependent points If we collect points that are not independent, this can cause trouble. Example: surveying proportion of smokers while collecting multiple members of each family, who may all tend to smoke or all not. If, for example, we count each point twice (for example), the χ 2 value is then twice what it should be. This will exaggerate significance, of courses. Contingency tables, chi-square tests p.12/13

An after-the-fact method A method (which I will not justify in detail, but works): We are allowed to collapse rows and columns of a contingency table, even after the fact, even to maximize our chi-square until we end up with a 2 2 table. We can conservatively correct for the collapsing and the after-the-factness by testing not with 1 degrees of freedom, but with rows + cols 3. Contingency tables, chi-square tests p.13/13