2 x 2 Case Chi Square for Contingency Tables A test for p 1 = p 2 We have learned a confidence interval for p 1 p 2, the difference in the population proportions. We want a hypothesis testing procedure for this difference. Definitions A contingency table is a tabular arrangement of count data representing how the row factor frequencies relate to the column factor. We call a contingency table with r rows and c columns, an r x c contingency table. Each category in a contingency table is called a cell. Example Consider a 2 x 2 contingency table with the row factor denoting a success versus failure, and the column factor denoting Group 1 or Group 2, where the samples for both Group 1 and Group 2 are independent of each other. Then, the contingency table looks like this: Group 1 Group 2 Success Y 1 Y 2 Failure Recall Example 10.37 regarding effectiveness of Timolol on angina status. The contingency table would be as follows: Timolol Placebo Angina free 44 19 Not Angina Free 116 128
We have already used this data to construct a 95% confidence interval for the difference in the proportion of angina free for the Timolol versus the Placebo conditions. Let p 1 denote the probability (or population proportion) of success for Group 1 Let p 2 denote the probability (or population proportion) of success for Group 2 To test H O : p 1 = p 2, we ll introduce Pearson s χ 2 (Chi square) statistic. Definition Pearson s χ 2 statistic is X 2 s O E 2 where the sum is over all the cells in the table, O denotes E observed values in each cell, and E denotes the value we d expect to see (if H O were true). Now, we have the observed values (the data we collected). What are the E s? Remember, we conduct hypothesis tests under the assumption that the null hypothesis is true. If the null hypothesis were true, then. So, then p 1 and p 2 would be estimating a common p (i.e. the probability of a success would be the same under Group 1 or Group 2 in our example). Then, we could estimate this common p by using a weighted ( pooled ) estimator. Little Sidebar p pool n 1p 1 n 2 p 2 n 1 n 2 n 1 Y 1 n n 2 Y 2 1 n 2 Y 1 Y 2 n 1 n 2 n 1 n 2 Suppose you are flipping an unfair coin, where the probability of a heads is 0.3 and the probability of a tails is 0.7. How many heads would you expect to see if you were to flip this unfair coin ten times? Now, apply this thought process to get the expected successes for Group 1. And compute the expected successes for Group 2. Chi square for Contingency Tables Page 2
Fill out the Expected Table for the Group 1/Group 2 success/failure contingency table. Group 1 Group 2 Success Failure Things to remember The E s (expected counts) need not be integers and we do not round them The row and column totals are the same for observed and expected tables (this is a good way to check your calculations!) For the Chi square test (we ll begin implementing in just a moment) to be valid, we need each E 1 and for the average E 5 Chi square for Contingency Tables Page 3
Calculating P values under the χ 2 distribution The χ 2 distribution is a right skewed distribution. The values of a χ 2 random variable are greater than or equal to 0. The χ 2 distribution has degrees of freedom. The degrees of freedom for a χ 2 test with a contingency table are df = (# of rows 1)(# of columns 1) For a non directional alternative, P = P{χ 2 df X 2 s} If df=1, we have the option of performing a directional alternative. In this case, 1 P P χ 2 df 2 X 2 s if data deviate in the direction specified by H A 0.5 otherwise TI 83/84 Matrix (2 nd x inverse) > scroll over to EDIT > ENTER > Enter your matrix STAT > scroll over to TESTS > scroll down to X 2 Test > ENTER > Make sure your observed values are in the matrix specified; the expected matrix will be calculated for you and stored in the matrix specified > Calculate > ENTER Chi square for Contingency Tables Page 4
Example Using the table below, conduct a test of hypothesis at the α = 0.01 significance level, to determine whether there is a significant difference in the probability of being angina free under Timolol or placebo. Timolol Placebo Angina free 44 19 Not Angina Free 116 128 Chi square for Contingency Tables Page 5
What if the researchers wanted to know to know whether the probability of being angina free is greater under Timolol than under placebo? What if the researchers wanted to detect whether the probability of being angina free under Timolol is less than under placebo? Chi square for Contingency Tables Page 6
A Test for Association The work up of all the previous examples assumed we had two independent samples and we were observing those two samples for the outcome of one variable. Many times, we are in the situation where we observe one sample for two explanatory factors. Factor 1 Level 1 Level 2 Factor 2 Level 1 Y 1 Y 2 Level 2 In the case where we have one sample and we re observing it for two explanatory factors, we ll test the hypothesis of association. The test for H O : there is no association is numerically equivalent to that of H O : p 1 = p 2 but the hypotheses and interpretations are different. Chi square for Contingency Tables Page 7
Example 10.21 To study the association of hair color and eye color in a German population, an anthropologist observed a sample of 6,800 men. Eye Color Dark Hair Color Dark 726 131 Light 3,129 2,814 Light Test at the α = 0.05 significance level, whether hair color is associated with eye color in this population of German men. Chi square for Contingency Tables Page 8
General r x c Case The ideas presented in the 2 x 2 cases just presented can be easily extended to general r x c contingency tables. For the case where we have c different samples (your columns), and we re checking each sample for different levels of the row factor, the hypothesis will change slightly. Here, we ll test whether the distributions are the same for each sample. (Think about it, if we have more than a success and a failure, then for each column we ll have P(level 1), P(level 2),,P(level r). And then, the null hypothesis would be testing whether p 11 = p 12 = = p 1c and p 21 = p 22 = = p 2c, etc This is called a compound hypothesis.) For the case where we have one sample and we re checking that one sample for different levels of two different factors, we ll still be testing association. Chi square for Contingency Tables Page 9
Example 10.31 The following table shows the observed distribution of A, B, AB, and O blood types in three samples of African Americans living in different locations. I (Florida) II (Iowa) III (Missouri) A 122 1781 353 B 117 1351 269 AB 19 289 60 O 244 3301 713 Test at the α = 0.05 level of significance, whether the distribution of blood type for African Americans is different across the three regions. Chi square for Contingency Tables Page 10
Example 10.33 To study the association of hair color and eye color in a German population, an anthropologist observed a sample of 6,800 men (this is the same study as that of example 10.21). Eye Color Hair Color Brown Black Fair Red Brown 438 288 115 16 Grey or Green 1387 746 946 53 Blue 807 189 1768 47 Test, at the α = 0.05 significance level, whether hair color is associated with eye color in this population of German men. Chi square for Contingency Tables Page 11
Final Notes on Chi Square for Contingency Tables Remember your calculator gives P values for a non directional alternative We can have a directional alternative when we re in the 2 x 2 table, and when H A is directional, one must check the data deviate in the direction specified by H A o If yes, cut P value in half o If no, P > 0.5 and fail to reject H O Degrees of freedom for an r x c table are (# rows 1)(# columns 1) Pearson s X 2 statistic for contingency tables uses the approximation X 2 ~ χ 2 df, so in order to be a valid approximation, a standard rule of thumb is to require E 1 for each cell and the average E 5 (and observations independent of one another) If expected counts are small, and data forms a 2 x 2 table, Fisher s exact test may be appropriate By contrast, example 10.21 illustrates X 2 s is very sensitive with large sample sizes For r x c tables, we have the following two hypotheses o c samples and we re checking for r levels of a row factor, then we re testing whether the distributions are the same (for the groups your columns) o one sample and we re checking for r levels of a row factor, and c levels of a column factor, then we re testing for an association of the row and column factors Chi square for Contingency Tables Page 12