STAT -50 Introduction to Statistics The Chi-Square Test The Chi-square test is a nonparametric test that is used to compare experimental results with theoretical models. That is, we will be comparing observed frequencies with expected frequencies. In a hypothesis test, the expected frequencies are those we would expect if the null hypothesis our test is true. O The formula is where O represents the observed frequency and represents the expected frequency. The value df depends on the type test you are performing. The Chi-Square Distribution The χ distribution is nonnegative not symmetrical; it is skewed to the right distributed to form a family distributions, with a separate distribution for each different degrees freedom. The Chi-Square Test for Goodness Fit The goodness--fit test compares the distribution observed outcomes for a single categorical variable to the expected outcomes predicted by a probability model. This test involves one sample, and one variable. Assumptions and Conditions: Be sure that the data is counts, or frequencies Independence assumption Sample size assumption xpected cell frequency condition: each expected frequency is at least 5 3 4 The Chi-square test is one-sided 0 (df, α) Automobile insurance is much more expensive for teenage than for older. To justify this cost difference, insurance companies claim that the younger are much more likely to be involved in costly. To test this claim, a researcher obtains information about registered from the Department Motor Vehicles and selects a sample 300 accident reports from the police department. The DMV reports the age registered in each age category as reported below. The accident reports is also shown. Does this data indicate that occur with the same distribution as the ages the? H 0 : H a : 5 6 1
Automobile insurance is much more expensive for teenage than for older. To justify this cost difference, insurance companies claim that the younger are much more likely to be involved in costly. To test this claim, a researcher obtains information about registered from the Department Motor Vehicles and selects a sample 300 accident reports from the police department. The DMV reports the age registered in each age category as reported below. The accident reports is also shown. Does this data indicate that occur with the same distribution as the ages the? H 0 : The distribution the ages involved in is the same as the distribution the ages registered. H a : The distribution the ages involved in is not the same as the distribution the ages registered. xpected cell frequency condition Under 0 16 68 0-9 8 9 30 or over 56 140 (this is the data) expected O - (O - ) (O - ) 7 8 xpected cell frequency condition xpected cell frequency condition Under 0 16 68 48 0-9 8 9 84 30 or over 56 140 168 n = 300 300 Note: Σ observed = Σ expected expected O - (O - ) (O - ) Under 0 16 68 48 0-9 8 9 84 30 or over 56 140 168 n = 300 300 Note: Σ observed = Σ expected expected O - (O - ) (O - ) 9 10 xpected cell frequency condition xpected cell frequency condition Under 0 16 68 48 0-9 8 9 84 30 or over 56 140 168 n = 300 300 Note: Σ observed = Σ expected expected O - (O - ) (O - ) Under 0 16 68 48 0-9 8 9 84 30 or over 56 140 168 n = 300 300 Note: Σ observed = Σ expected expected O - (O - ) (O - ) 11 1
xpected cell frequency condition - each expected frequency 5 Under 0 16 68 48 0-9 8 9 84 30 or over 56 140 168 n = 300 300 Note: Σ observed = Σ expected expected O - (O - ) (O - ) 13 expected O - (O - ) (O - ) Under 0 16 68 48 0 400 8.33 0-9 8 9 84 30 or over 56 140 168 Specify the sampling distribution model and the test you will use. O, with df = k-1 = df = 14 expected O - (O - ) (O - ) Under 0 16 68 48 0 400 8.33 0-9 8 9 84 8 64.76 30 or over 56 140 168-8 784 4.67 Note: Σ(O - ) = 0 Specify the sampling distribution model and the test you will use. expected O - (O - ) (O - ) Under 0 16 68 48 0 400 8.33 0-9 8 9 84 8 64.76 30 or over 56 140 168-8 784 4.67 Note: Σ(O - ) = 0 Specify the sampling distribution model and the test you will use. O, with df = k-1 O, with df = k-1 = df = = df = 15 16 expected O - (O - ) (O - ) Under 0 16 68 48 0 400 8.33 0-9 8 9 84 8 64.76 30 or over 56 140 168-8 784 4.67 expected O - (O - ) (O - ) Under 0 16 68 48 0 400 8.33 0-9 8 9 84 8 64.76 30 or over 56 140 168-8 784 4.67 13.76 Specify the sampling distribution model and the test you will use. Since the conditions are met, we will use a Chi-square model with degrees freedom, and do a Chi-square goodness--fit test. O, with df = k-1 = df = 17 O, with df = k-1 = 13.76 df = 3-1 = P-value: 18 3
= 13.76 df = 3-1 = P-value: P <.005 Statistical conclusion: Conclusion in context: expected O - (O - ) (O - ) Under 0 16 68 48 0 400 8.33 0-9 8 9 84 8 64.76 30 or over 56 140 168-8 784 4.67 13.76 19 0 expected O - (O - ) (O - ) Under 0 16 68 48 0 400 8.33 0-9 8 9 84 8 64.76 30 or over 56 140 168-8 784 4.67 = 13.76 df = 3-1 = 13.76 Using SPSS for a Goodness Fit Test If you have the expected proportions: 1. Create a numeric variable with a width 1 and no decimal places for the categories. Code the values this variable as follows: In the Values column, click on the box with the three dots: P-value: P <.005 Statistical conclusion: Since the p-value is small, reject the null hypothesis. Conclusion in context: The data indicates that the distribution ages involved in is not the same as the distribution ages the in the population. 1 You will then see the Value Labels dialog box. Since there are three categories ages, enter the values 1,, and 3 as coding variables: Then click on Add and you will see the results: nter the value "1" and code it as "under 0". (You do not have to use quotation marks; they will be added by SPSS.) 3 4 4
Continue adding all categories, one at a time, and then click on OK. You will see the results in the Values column in Variable View. 5 6. Create a numeric variable with no decimal places for the observed frequencies. You can then enter the observed frequency for that category. Then, for each category, enter the coded value: Repeat this until all observed frequencies have been entered: As you enter each value you will see a drop-down box. If you click on it, you can choose from the list labels. However, if you just move to the next column, you will see the category name associated with the coded value. 7 8 3. Weight the cases using the observed frequencies. 4. Now select > Analyze > Nonparametric Tests > Legacy Dialogs > Chi-Square 9 30 5
5. Select the variable with the observed frequencies as the Test Variable In the xpected Values box, select Values: 6. nter the expected s (as decimals) one at a time, and click on Add until all have been entered: 31 3 6. nter the expected s (as decimals) one at a time, and click on Add until all have been entered: 7. After the last value has been entered, click on OK. You should see a table showing the observed and expected frequencies and a table with the results the Chi-square test: count Observed N xpected N Residual 68 48.0 0.0 68 9 9 84.0 8.0 140 140 168.0-8.0 300 Test Statistics count Chi-Square 13.76 a df Asymp. Sig..001 a. 0 cells (.0%) have expected frequencies less than 5. The minimum expected cell frequency is 48.0. These results show that χ = 13.76, and p =.001 (Note that you also have the option to choose All categories equal if that is appropriate.) 33 34 The Chi-Square Test for Homogeneity In a test for homogeneity, we compare observed distributions for several groups to see if there are differences among the respective populations. The central issue is whether the category proportions are the same for all the populations. The test involves several samples but only one variable. The article Relationship Health Behaviors to Alcohol and Cigarette Use by College Students (J. College Student Development (199)) included data on drinking behavior for independently chosen random samples male and female students similar to the data shown below. Does there appear to be a gender difference with respect to drinking behavior? None 140 ( ) 186 ( ) Low (1-7) 478 ( ) 661 ( ) Moderate (8-4) 300 ( ) 173 ( ) High (5 or more) 63 ( ) 16 ( ) 35 36 6
The Chi-Square Test for Homogeneity Assumptions and Conditions: Be sure that the data is counts, or frequencies Independence assumption If you want to generalize from the data to a population. Sample size assumption xpected cell frequency condition ach expected frequency is at least 5 The article Relationship Health Behaviors to Alcohol and Cigarette Use by College Students (J. College Student Development (199)) included data on drinking behavior for independently chosen random samples male and female students similar to the data shown below. Does there appear to be a gender difference with respect to drinking behavior? H 0 : H a : xpected cell frequency condition 37 38 The article Relationship Health Behaviors to Alcohol and Cigarette Use by College Students (J. College Student Development (199)) included data on drinking behavior for independently chosen random samples male and female students similar to the data shown below. Does there appear to be a gender difference with respect to drinking behavior? H 0 : The proportions the four drinking levels are the same for males and for females H a : The proportions the four drinking levels are not the same for males and for females xpected cell frequency condition: (row total)(column total) n 39 Specify the sampling distribution model and the test you will use. df = (R - 1)(C - 1) None 140 ( ) 186 ( ) Low (1-7) 478 ( ) 661 ( ) Moderate (8-4) 300 ( ) 173 ( ) High (5 or more) 63 ( ) 16 ( ) 40 None 140 ( ) 186 ( ) Low (1-7) 478 ( ) 661 ( ) Moderate (8-4) 300 ( ) 173 ( ) High (5 or more) 63 ( ) 16 ( ) None 140 ( ) 186 ( ) Low (1-7) 478 ( ) 661 ( ) Moderate (8-4) 300 ( ) 173 ( ) High (5 or more) 63 ( ) 16 ( ) Specify the sampling distribution model and the test you will use. df = (R - 1)(C - 1) = (4-1)( - 1) = (3)(1) = 3 Fill in the row and column totals. The conditions are met, so we will use a Chi-square model with 3 degrees freedom, and do a Chi-square test homogeneity. 41 4 7
None 140 ( ) 186 ( ) 36 Low (1-7) 478 ( ) 661 ( ) 1139 Moderate (8-4) 300 ( ) 173 ( ) 473 High (5 or more) 63 ( ) 16 ( ) 79 981 1036 017 None 140 ( 158.56 ) 186 ( ) 36 Low (1-7) 478 ( ) 661 ( ) 1139 Moderate (8-4) 300 ( ) 173 ( ) 473 High (5 or more) 63 ( ) 16 ( ) 79 981 1036 017 Calculate the expected frequencies for each cell, using (row total)(column total) = n Calculate the expected frequencies for each cell, using (row total)(column total) = n 43 44 None 140 ( 158.56 ) 186 ( 167.44 ) 36 Low (1-7) 478 ( ) 661 ( ) 1139 Moderate (8-4) 300 ( ) 173 ( ) 473 High (5 or more) 63 ( ) 16 ( ) 79 981 1036 017 Calculate the expected frequencies for each cell, using (row total)(column total) = n O None 140 ( 158.56 ) 186 ( 167.44 ) 36 Low (1-7) 478 ( 553.97 ) 661 ( 585.03 ) 1139 Moderate (8-4) 300 ( 30.05 ) 173 ( 4.95 ) 473 High (5 or more) 63 ( 38.4 ) 16 ( 40.58 ) 79 981 1036 017 45 46 O.17 + None 140 ( 158.56 ) 186 ( 167.44 ) 36 Low (1-7) 478 ( 553.97 ) 661 ( 585.03 ) 1139 Moderate (8-4) 300 ( 30.05 ) 173 ( 4.95 ) 473 High (5 or more) 63 ( 38.4 ) 16 ( 40.58 ) 79 981 1036 017 O.17 +.06 + None 140 ( 158.56 ) 186 ( 167.44 ) 36 Low (1-7) 478 ( 553.97 ) 661 ( 585.03 ) 1139 Moderate (8-4) 300 ( 30.05 ) 173 ( 4.95 ) 473 High (5 or more) 63 ( 38.4 ) 16 ( 40.58 ) 79 981 1036 017 47 48 8
O.17 +.06 + 10.418 + 9.865 + None 140 ( 158.56 ) 186 ( 167.44 ) 36 Low (1-7) 478 ( 553.97 ) 661 ( 585.03 ) 1139 Moderate (8-4) 300 ( 30.05 ) 173 ( 4.95 ) 473 High (5 or more) 63 ( 38.4 ) 16 ( 40.58 ) 79 981 1036 017 O.17 +.06 + 10.418 + 9.865 + 1.7 + 0.14 + 15.73 + 14.89 = None 140 ( 158.56 ) 186 ( 167.44 ) 36 Low (1-7) 478 ( 553.97 ) 661 ( 585.03 ) 1139 Moderate (8-4) 300 ( 30.05 ) 173 ( 4.95 ) 473 High (5 or more) 63 ( 38.4 ) 16 ( 40.58 ) 79 981 1036 017 49 50 O.17 +.06 + 10.418 + 9.865 + 1.7 + 0.14 + None 140 ( 158.56 ) 186 ( 167.44 ) 36 Low (1-7) 478 ( 553.97 ) 661 ( 585.03 ) 1139 Moderate (8-4) 300 ( 30.05 ) 173 ( 4.95 ) 473 High (5 or more) 63 ( 38.4 ) 16 ( 40.58 ) 79 981 1036 017 15.73 + 14.89 = 96.54 51 5 The article Relationship Health Behaviors to Alcohol and Cigarette Use by College Students (J. College Student Development (199)) included data on drinking behavior for independently chosen random samples male and female students similar to the data shown below. Does there appear to be a gender difference with respect to drinking behavior? H 0 : The proportions the four drinking levels are the same for males and females H a : The proportions the four drinking levels are not the same for males and females = 96.54 df = 3 P-value: p <.005 Statistical conclusion: Conclusion in context: The article Relationship Health Behaviors to Alcohol and Cigarette Use by College Students (J. College Student Development (199)) included data on drinking behavior for independently chosen random samples male and female students similar to the data shown below. Does there appear to be a gender difference with respect to drinking behavior? H 0 : The proportions the four drinking levels are the same for males and females H a : The proportions the four drinking levels are not the same for males and females = 96.54 df = 3 P-value: p <.005 Statistical conclusion: p is small, so the null hypothesis is rejected Conclusion in context: The data does indicate a gender difference with respect to drinking behavior. 53 54 9
Using SPSS for a Test for Homogeneity 1. Create a string variable for each the categories, and a numeric variable for the observed frequencies. Be sure to make the columns wide enough ("columns" in Variable View). 3. Select > Analyze > Descriptive Statistics > Crosstabs Select one variable as the row variable and the other as the column variable. Click on Statistics and then on Chi-square. Then enter the values these two variables:. Weight the cases using the observed frequencies. (> Data > Weight Cases ) 55 56 Click on the Cells button, and select Observed and xpected in the Cell Display window. Then click on Continue. Your output should include a table showing the observed and expected frequencies: Click on Display clustered bar charts to produce the graph shown in the results. Click on Continue and then click on OK. gender * level Crosstabulation level high low moderate none gender female Count 16 661 173 186 1036 xpected Count 40.6 585.0 4.9 167.4 1036.0 male Count 63 478 300 140 981 xpected Count 38.4 554.0 30.1 158.6 981.0 Count 79 1139 473 36 017 xpected Count 79.0 1139.0 473.0 36.0 017.0 57 58 and a table with the results your Chi-square test: Here is the graph that represents the results: Chi-Square Tests Value df Asymp. Sig. (- sided) Pearson Chi-Square 96.56 a 3.000 Likelihood Ratio 98.966 3.000 N Valid Cases 017 a. 0 cells (.0%) have expected count less than 5. The minimum expected count is 38.4. These results show that χ = 96.56, and p =.000 59 60 10
The Chi-Square Test for Independence In a test for independence, we investigate association between two categorical variables in a single population. There is one sample, but there are two variables. Assumptions and Conditions: If you want to generalize from the data to a population. xpected cell frequency condition 61 The table shown below was constructed using data in the article Television Viewing and Physical Fitness in Adults (Research Quarterly for xercise and Sport (1990)). The author hoped to determine whether time spent watching television is associated with cardiovascular fitness. Subjects were asked about their television viewing time (per day, rounded to the nearest hour) and were classified as physically fit if they scored in the excellent or very good category on a step test. H o : H a : 0 35 ( ) 147 ( ) 1-101 ( ) 69 ( ) 3-4 8 ( ) ( ) 5 or more 4 ( ) 34 ( ) 6 The table shown below was constructed using data in the article Television Viewing and Physical Fitness in Adults (Research Quarterly for xercise and Sport (1990)). The author hoped to determine whether time spent watching television is associated with cardiovascular fitness. Subjects were asked about their television viewing time (per day, rounded to the nearest hour) and were classified as physically fit if they scored in the excellent or very good category on a step test. xpected cell frequency condition 0 35 ( ) 147 ( ) 1-101 ( ) 69 ( ) 3-4 8 ( ) ( ) 5 or more 4 ( ) 34 ( ) H o : Fitness and TV viewing are independent H a : Fitness and TV viewing are not independent 63 64 0 35 ( ) 147 ( ) 1-101 ( ) 69 ( ) 3-4 8 ( ) ( ) 5 or more 4 ( ) 34 ( ) Specify the sampling distribution model and the test you will use. 0 35 ( ) 147 ( ) 1-101 ( ) 69 ( ) 3-4 8 ( ) ( ) 5 or more 4 ( ) 34 ( ) Find the row and column totals. df = (R - 1)(C - 1) = (4-1)( - 1) = (3)(1) = 3 Since the conditions are met, we will use a Chi-square model with 3 degrees freedom, and do a Chi-square test for independence. 65 66 11
0 35 ( ) 147 ( ) 18 1-101 ( ) 69 ( ) 730 3-4 8 ( ) ( ) 50 5 or more 4 ( ) 34 ( ) 38 168 103 100 0 35 ( 5.48 ) 147 ( ) 18 1-101 ( 10.0 ) 69 ( ) 730 3-4 8 ( 35.00 ) ( ) 50 5 or more 4 ( 5.3 ) 34 ( ) 38 168 103 100 (row total)(column total) = n (row total)(column total) = n 67 68 0 35 ( 5.48 ) 147 ( 156.5 ) 18 1-101 ( 10.0 ) 69 ( 67.80 ) 730 3-4 8 ( 35.00 ) ( 15.00 ) 50 5 or more 4 ( 5.3 ) 34 ( 3.68 ) 38 168 103 100 0 35 ( 5.48 ) 147 ( 156.5 ) 18 1-101 ( 10.0 ) 69 ( 67.80 ) 730 3-4 8 ( 35.00 ) ( 15.00 ) 50 5 or more 4 ( 5.3 ) 34 ( 3.68 ) 38 168 103 100 (row total)(column total) = n O 3.557 +.579 + 69 70 0 35 ( 5.48 ) 147 ( 156.5 ) 18 1-101 ( 10.0 ) 69 ( 67.80 ) 730 3-4 8 ( 35.00 ) ( 15.00 ) 50 5 or more 4 ( 5.3 ) 34 ( 3.68 ) 38 168 103 100 0 35 ( 5.48 ) 147 ( 156.5 ) 18 1-101 ( 10.0 ) 69 ( 67.80 ) 730 3-4 8 ( 35.00 ) ( 15.00 ) 50 5 or more 4 ( 5.3 ) 34 ( 3.68 ) 38 168 103 100 O 3.557 +.579 + O 3.557 +.579 +.014 +.00 + 1.4 +.8 +.38 +.0539 = 6.161 71 7 1
6.161 df = 3 0 35 ( 5.48 ) 147 ( 156.5 ) 18 1-101 ( 10.0 ) 69 ( 67.80 ) 730 3-4 8 ( 35.00 ) ( 15.00 ) 50 5 or more 4 ( 5.3 ) 34 ( 3.68 ) 38 168 103 100 P-value: 73 74 6.161 df = 3 0 35 ( 5.48 ) 147 ( 156.5 ) 18 1-101 ( 10.0 ) 69 ( 67.80 ) 730 3-4 8 ( 35.00 ) ( 15.00 ) 50 5 or more 4 ( 5.3 ) 34 ( 3.68 ) 38 168 103 100 6.161 df = 3 0 35 ( 5.48 ) 147 ( 156.5 ) 18 1-101 ( 10.0 ) 69 ( 67.80 ) 730 3-4 8 ( 35.00 ) ( 15.00 ) 50 5 or more 4 ( 5.3 ) 34 ( 3.68 ) 38 168 103 100 P-value: p >.10 Statistical conclusion: Conclusion in context: 75 P-value: p >.10 Statistical conclusion: Since the p-value is large, we cannot reject the null hypothesis. Conclusion in context: There is not enough evidence to conclude that time spent watching television is associated with cardiovascular fitness. 76 Using SPSS for a Test for Independence Then enter the frequencies as before: Follow the instructions for a Chi-Square test for homogeneity. You may define two string variables for the categories and one numeric variable for the counts, or you may choose to use coding for one or either the variables representing the categories. 77 78 13
Weight the cases by counts, and then use > Analyze > Descriptive Statistics > Crosstabs SPSS output: Select one variable as the row variable and the other as the column variable. TVGroup * Fitness Crosstabulation Fitness Fit Not Fit Click on Statistics and then on Chi-square. Click on the Cells button, and select Observed and xpected in the Cell Display window. Click on Display clustered bar charts to produce the graph shown in the results. Then click on Continue and on OK. TVGroup 0 Count 35 147 18 xpected Count 5.5 156.5 18.0 1- Count 101 69 730 xpected Count 10. 67.8 730.0 3-4 Count 8 50 xpected Count 35.0 15.0 50.0 5 or more Count 4 34 38 xpected Count 5.3 3.7 38.0 Count 168 103 100 xpected Count 168.0 103.0 100.0 79 80 SPSS output: Here is the graph that supports these results: Chi-Square Tests Value df Asymp. Sig. (- sided) Pearson Chi-Square 6.161 a 3.104 Likelihood Ratio 5.930 3.115 N Valid Cases 100 a. 0 cells (.0%) have expected count less than 5. The minimum expected count is 5.3. These results show that χ = 36.161 and p =.104 81 8 1. A health pressional selected a random sample 100 patients from each four major hospital emergency rooms to see if the major reasons for emergency room visits (accident, illegal activity, illness, other) are the same in all four hospitals. This is an example a. A goodness--fit test b. A test for homogeneity c. A test for independence 1. A health pressional selected a random sample 100 patients from each four major hospital emergency rooms to see if the major reasons for emergency room visits (accident, illegal activity, illness, other) are the same in all four hospitals. This is an example a. A goodness--fit test b. A test for homogeneity c. A test for independence 83 84 14
. An urban economist wants to determine whether the region the United States a resident lives in is related to his level education. He randomly selects 1800 US residents and asks them to report their level education and the region the US in which they live. The economist is using a. A goodness--fit test b. A test for homogeneity c. A test for independence. An urban economist wants to determine whether the region the United States a resident lives in is related to his level education. He randomly selects 1800 US residents and asks them to report their level education and the region the US in which they live. The economist is using a. A goodness--fit test b. A test for homogeneity c. A test for independence 85 86 3. As part a class project, a student asked a random sample students about their preferred st drink: Pepsi, Coke, or 7-Up, to determine whether these three drinks were equally preferred by students. 3. As part a class project, a student asked a random sample students about their preferred st drink: Pepsi, Coke, or 7-Up, to determine whether these three drinks were equally preferred by students. The student should use a. A goodness--fit test b. A test for homogeneity c. A test for independence The student should use a. A goodness--fit test b. A test for homogeneity c. A test for independence 87 88 15