Elementary Statistics

lementary Statistics Chap10 Dr. Ghamsary Page 1 lementary Statistics M. Ghamsary, Ph.D. Chapter 10 Chi-square Test for Goodness of fit and Contingency tables

lementary Statistics Chap10 Dr. Ghamsary Page Chi-Square Test Generally speaking, the chi-square test is a statistical test used to examine differences with categorical variables. The chi-square test is used in two similar but distinct circumstances: 1. for estimating how closely an observed distribution matches an expected distribution - we'll refer to this as the goodness-of-fit test. for estimating whether two random variables are independent (Contingency Tables) Goodness of Fit Test One of the more interesting goodness-of-fit applications of the chi-square test is to examine issues of fairness and cheating in games of chance, such as coins, cards, dice, and roulette. Since such games usually involve wagering, there is significant incentive for people to try to rig the games and allegations of missing cards, "loaded" dice, and "sticky" roulette wheels are all too common. So how can the goodness-of-fit test be used to examine cheating in gambling? It is easier to describe the process through an example. Take the example of dice. Most dice used in wagering have six sides, with each side having a value of one, two, three, four, five, or six. If the die being used is fair, then the chance of any particular number coming up is the same: 1 in 6. However, if the die is loaded, then certain numbers will have a greater likelihood of appearing, while others will have a lower likelihood So we would like to test and see if a given data set will match the hypothesized distribution. The following is the test statistics used for this purpose. where, χ = ( O ) O: is the observed data : the expected value.

lementary Statistics Chap10 Dr. Ghamsary Page 3 Clearly if the data matches the claimed distribution, this chi-square value will be small and we cannot reject the null hypothesis. Otherwise this value, χ, will be large and we must reject the H 0. xample 1: The simplest example is to flip a coin 100 times and record the outcomes. Suppose we observed 40 heads. Test the claim that the coin is fair, which means the outcomes are equally likely. Use 5% level of significance. Solution: Let us write the outcome in the following table. The expected number of heads is Step1: R S T 0.50(100)=50 H :The Coin isfair 0 H :The Coin isnot fair 1 Step: Calculate the test statistics as follows: Step3: Decision: So we reject H 0. Head Tail Observed 40 60 xpected 50 50 df=-1=1 α = 005. ( O ) ( 50) ( 50) by using Table III CV=3.84 40 60 χ = = + = 50 50 4 Conclusion: This means the coin is biased..

lementary Statistics Chap10 Dr. Ghamsary Page 4 xample : The next simplest example is to roll a die 10 times and record the outcomes. Suppose we have observed 18 one s, 3 two s, 15 three s, four s, 17 five s, and 5 six s. Test the claim that the die is fair, which means the outcomes are equally likely again. Use 5% level of significance. Solution: Let us write the outcome in the following table. The expected number of outcomes is all equal 0, under the assumption of equality. So we have =10/6=0. 1 3 4 5 6 Observed 18 3 15 17 5 Step1: R S T xpected 0 0 0 0 0 0 H :The Die isfair 0 H :The Die is not fair 1 Step: Calculate the test statistics as follows: df=6-1=5 α = 005. by using Table III CV=11.07 ( O ) χ = = ( 18 0) ( 3 0) ( 15 0) + + 0 0 0 ( 0) ( 17 0) ( 5 0) + + = 3.8. 0 0 0 Step3: Decision: fail to reject H 0 Conclusion: This means the die is unbiased

lementary Statistics Chap10 Dr. Ghamsary Page 5 xample 3: An ice cream shop would like to know which flavor is preferred by the customers. The past record shows that 50% prefer vanilla, 0% prefer chocolate, 10% prefer vanilla fudge, 15% prefer strawberry, and 5% prefer other kinds. A random sample of 500 customers revealed the following results. Test the claim that the observed numbers and the percentage match. Flavor Vanilla Chocolate Strawberry Vanilla Fudge Others Customers 40 10 70 40 30 Solution: Let us calculate the expected value as follows: Vanilla: 50% of 500 = 0.50*500=50 Chocolate: 0% of 500 = 0.0*500 =100 Strawberry: 15% of 500 = 0.15*500= 75 Vanilla Fudge: 10% of 500 = 0.10*500=50 Others: 5% of 500 = 0.05*500=5 Flavor Vanilla Chocolate Strawberry Vanilla Fudge Other Observed 40 10 70 40 30 xpected 50 100 75 50 5 Step1: H 0:The Observed and expected match H 1:TheObservedandexpected donot match df=5-1=4α = 005. CV=9.49 Step: Calculate the test statistics as follows: ( O ) χ = = ( 40 50) ( 10 100) ( 70 75) ( 40 50 ) ( 30 5 ) + + + 50 100 75 50 + 1. 3. 5 Step3: Decision: fail to reject H 0 Conclusion: This means the die is unbiased

lementary Statistics Chap10 Dr. Ghamsary Page 6 xample 4: Affirmative Action Problem A large organization in a city is accused of being racist in one or more race group. If in that city, there are 45% White, 15% Black, 0% Hispanic, 5% Asian, and the rest are others. A random sample of 50 from the whole corporation is collected with the following results. Test to see if the frequency of the observed and the percentage in the population are the same. H 0: The frequency observed matches the percentage of population Step1: H 1: The frequency observed does not match the percentage of population α = 005. df=5-1=4 CV=9.49 Race Observed xpected % White 100 45% 11.5 Black 30 15% 37.5 Hispanic 40 0% 50 Asian 0 5% 1.5 Other 60 15% 37.5 Total: 50 CV=9.49 Step: ( O ) χ = = ( 100 11. 5) ( 30 37. 5) ( 40 50) + + 11. 5 37. 5 50 ( 0 1. 5) ( 60 37. 5) + +.9 1. 5 37. 5 Step3: Decision: reject H 0 Conclusion: This means the frequency of observed and the % of the population do not match.

lementary Statistics Chap10 Dr. Ghamsary Page 7 xample5: In a study in Alameda County, California, researchers compared the demographic characteristics of members of grand juries to determine how closely these juries reflected the population of the county. If the juries were selected randomly or impartially, then the characteristics of the jurors should closely match those of the larger county; however, if attorneys were tilting the jury selection process, then the jurors' characteristics would be quite different from the county. (figures taken from UCLA Law Review, vol 0, 1973 - as shown at: http://www.stat.ucla.edu/cases/jury/) Age Country-Wide % # of Jurors 1-40 4 5 41-50 3 9 51-60 16 19 >61 19 33 Questions Based on the figures shown in the table above, use the chi-square test to evaluate whether there is evidence of jury fixing in terms of the age of jurors in Alameda County. a. What is the null hypothesis? What is the alternative hypothesis? b. What figures do you need to calculate for this test? c. How many degrees of freedom are there? d. What is the value of the chi-square statistic for this table? What is the p-value of this statistic? e. From this value, what can you conclude about the age of jurors in Alameda County?

lementary Statistics Chap10 Dr. Ghamsary Page 8 Test of Independence The other primary use of the chi-square test is to examine whether two variables are independent or not. What does it mean to be independent, in this sense? It means that the two factors are not related. Typically in any research such as epidemiology and social science research, we're interested in finding factors that are related - education and income, occupation and prestige, age and voting behavior. In this case, the chi- square can be used to assess whether two variables are independent or not. More generally, we say that factor A is "not correlated with" or "independent of" the factor B if more of one is not associated with more of another. If two categorical variables are correlated their values tend to move together, either in the same direction or in the opposite. In practice there are many data comes in the following format. They are called two way frequency table and some other text book call it contingency tables. The test of dependency is a test to see if row factor and the column factor are related. Test Statistics: is the same as before, namely: χ ( O ) =, Where b the O is the observed cells and is the expected cells which is can be find from the following: Row Total gb Colunm Total g = Grand total Also we have degrees of freedom = (r-1)(c-1), Where, r = Number of rows, c = Number of column

lementary Statistics Chap10 Dr. Ghamsary Page 9 xample 6: Dr. Ghamsary and colleagues are testing to see if the habits of smoking and gender are independent. They have collected a random sample of 50 people as they appear in the following table. Test their claim by using the 0.05 level of significance. Sex Male Smoking Yes No 60 40 80 70 Total Female 150 Total 140 110 50 Solution: H 0: Sex and smoking are independent Step1: H 1: Sex and smoking are not independent df = ( 1)( 1) = 1 α = 005. CV=3.84 100* 140 100* 110 11 = = 56 1 = = 44 50 50 150* 140 150* 110 1 = = 84 = = 66 50 50 Smoking Total Yes No 60 40 Sex Male 56 44 100 80 70 100 Female 84 66 150 Total 140 110 50 Step: Calculate the test statistics as follows: ( O ) χ = = Step3: Decision: fail to reject H 0 ( 60 56) ( 40 44) + + 56 44 ( 80 84) ( 70 66) 84 66 1.08 +. Conclusion: This means the sex and smoking are independent.

lementary Statistics Chap10 Dr. Ghamsary Page 10 Cautionary Note It is important to keep in mind that the chi-square test only tests whether two variables are independent. It cannot address questions of which is greater or less. Using the chi-square test, we cannot evaluate directly the hypothesis that men smoke more than women; rather, the test (strictly speaking) can only test whether the two variables are independent or not. xample 7: Ghamsary and others have done some research (Kashan, Vol 1, 13. 1998) on income and level of education. They are interested to know if people with more education have higher income. They collected a random sample of 00 people from a large population and they found the following results.test the claim that the education and income are independent factors. Use α = 001. Income\ducation None High School 4 year college Graduate School Less than 30K 5 0 10 6 30K-50K 10 8 40 1 Above 50K 5 1 50 3 Solution: Income\ducation N HS College Graduate Total Less than 30K 5 0 10 6 61 9.76 14.64 4.40 1.0 30K-50K 10 8 40 1 90 14.40 1.60 36.00 18.00 Above 50K 5 1 50 3 99 15.84 3.76 39.60 19.80 Total 40 60 100 50 50 61* 40 61* 60 61* 100 61* 50 11 = =9.76 1 = =14.64 13 = = 4.40 14 = = 1.0 0 5 0 5 50 0 5 90* 40 90* 60 90* 100 90* 50 1 = =14.40 = =1.60 3 = = 36.00 4 = = 18.00 0 5 0 5 50 0 5 99* 40 99* 60 99* 100 99* 50 31 = =15.84 3 = =3.76 33 = = 39.60 34 = = 19.80 50 0 5 50 0 5

lementary Statistics Chap10 Dr. Ghamsary Page 11 Step1: H 0:Income and ducation are independent H 1:Income and ducation are not independent Step: Calculate the test statistics as follows: ( O ) χ = = ( 5 9.76) ( 0 14.64) ( 10 4.40) ( 6 1.0) df = ( 3 1)( 4 1) = 6 at 0.01, CV=16.81 + + + 9.76 14.64 4.40 1.0 10 14.40 8 1.60 40 36.00 1 18.00 + + + + 14.40 1.60 36.00 18.00 5 15.84 1 3.76 50 39.60 3 19.80 + + + 66.58 15.84 3.76 39.60 19.80 ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) MINITAB: Chi-Square Test xpected counts are printed below observed counts C1 C C3 C4 Total 1 5 0 10 6 61 9.76 14.64 4.40 1.0 10 8 40 1 90 14.40 1.60 36.00 18.00 3 5 1 50 3 99 15.84 3.76 39.60 19.80 Total 40 60 100 50 50 Chi-Sq = 3.797 + 1.96 + 8.498 + 3.151 + 1.344 + 1.896 + 0.444 +.000 + 7.418 + 5.81 +.731 + 7.517 = 66.581 DF = 6, P-Value = 0.000 Step3: Decision: Reject H 0 Conclusion: This means the income and education are not independent.

lementary Statistics Chap10 Dr. Ghamsary Page 1 xample 8: In a recent research taken from a random sample of 500 student, show in the following table by two factors, Study on time for the tests and School areas. Is there an association between the type of school area and the student goals? School Area Study on time on the tests Rural Suburban Urban Total Always 80 90 60 30 Some Times 70 70 30 170 Never 50 30 0 100 Total 00 190 110 500

lementary Statistics Chap10 Dr. Ghamsary Page 13 xample 9: A chocolate manufacturing company conducted a survey of 300 customers. The research question is: Is there a significant relationship between packaging preference (size of the bottle purchased) and economic status? There were four packaging sizes: small, medium, large, and jumbo. conomic status was: lower, middle, and upper. The following data was collected. Test the claim that the size of the packaging and economic status are independent by using 0.10 level of significance. conomic Status Size Lower Middle Upper Total Small 30 18 70 Medium 3 8 19 70 Large 18 7 35 80 Jumbo 19 3 38 80 Total 90 100 110

lementary Statistics Chap10 Dr. Ghamsary Page 14 xample 10: A random sample of 1500 persons is questioned regarding their political affiliation and opinion on the war in IRAQ. Test if the political affiliation and their opinion on the war in IRAQ are dependent using 5% level of significance. The observed data is given in the following table. War in IRAQ Party Affiliation Favor Indifferent Opposed Total Democrat 10 50 580 750 Republican 600 150 100 850 Independent 50 30 10 00 Total 770 30 800 1800