1 lementary Statistics Chap10 Dr. Ghamsary Page 1 lementary Statistics M. Ghamsary, Ph.D. Chapter 10 Chisquare Test for Goodness of fit and Contingency tables
2 lementary Statistics Chap10 Dr. Ghamsary Page ChiSquare Test Generally speaking, the chisquare test is a statistical test used to examine differences with categorical variables. The chisquare test is used in two similar but distinct circumstances: 1. for estimating how closely an observed distribution matches an expected distribution  we'll refer to this as the goodnessoffit test. for estimating whether two random variables are independent (Contingency Tables) Goodness of Fit Test One of the more interesting goodnessoffit applications of the chisquare test is to examine issues of fairness and cheating in games of chance, such as coins, cards, dice, and roulette. Since such games usually involve wagering, there is significant incentive for people to try to rig the games and allegations of missing cards, "loaded" dice, and "sticky" roulette wheels are all too common. So how can the goodnessoffit test be used to examine cheating in gambling? It is easier to describe the process through an example. Take the example of dice. Most dice used in wagering have six sides, with each side having a value of one, two, three, four, five, or six. If the die being used is fair, then the chance of any particular number coming up is the same: 1 in 6. However, if the die is loaded, then certain numbers will have a greater likelihood of appearing, while others will have a lower likelihood So we would like to test and see if a given data set will match the hypothesized distribution. The following is the test statistics used for this purpose. where, χ = ( O ) O: is the observed data : the expected value.
3 lementary Statistics Chap10 Dr. Ghamsary Page 3 Clearly if the data matches the claimed distribution, this chisquare value will be small and we cannot reject the null hypothesis. Otherwise this value, χ, will be large and we must reject the H 0. xample 1: The simplest example is to flip a coin 100 times and record the outcomes. Suppose we observed 40 heads. Test the claim that the coin is fair, which means the outcomes are equally likely. Use 5% level of significance. Solution: Let us write the outcome in the following table. The expected number of heads is Step1: R S T 0.50(100)=50 H :The Coin isfair 0 H :The Coin isnot fair 1 Step: Calculate the test statistics as follows: Step3: Decision: So we reject H 0. Head Tail Observed xpected df=1=1 α = 005. ( O ) ( 50) ( 50) by using Table III CV= χ = = + = Conclusion: This means the coin is biased..
4 lementary Statistics Chap10 Dr. Ghamsary Page 4 xample : The next simplest example is to roll a die 10 times and record the outcomes. Suppose we have observed 18 one s, 3 two s, 15 three s, four s, 17 five s, and 5 six s. Test the claim that the die is fair, which means the outcomes are equally likely again. Use 5% level of significance. Solution: Let us write the outcome in the following table. The expected number of outcomes is all equal 0, under the assumption of equality. So we have =10/6= Observed Step1: R S T xpected H :The Die isfair 0 H :The Die is not fair 1 Step: Calculate the test statistics as follows: df=61=5 α = 005. by using Table III CV=11.07 ( O ) χ = = ( 18 0) ( 3 0) ( 15 0) ( 0) ( 17 0) ( 5 0) + + = Step3: Decision: fail to reject H 0 Conclusion: This means the die is unbiased
5 lementary Statistics Chap10 Dr. Ghamsary Page 5 xample 3: An ice cream shop would like to know which flavor is preferred by the customers. The past record shows that 50% prefer vanilla, 0% prefer chocolate, 10% prefer vanilla fudge, 15% prefer strawberry, and 5% prefer other kinds. A random sample of 500 customers revealed the following results. Test the claim that the observed numbers and the percentage match. Flavor Vanilla Chocolate Strawberry Vanilla Fudge Others Customers Solution: Let us calculate the expected value as follows: Vanilla: 50% of 500 = 0.50*500=50 Chocolate: 0% of 500 = 0.0*500 =100 Strawberry: 15% of 500 = 0.15*500= 75 Vanilla Fudge: 10% of 500 = 0.10*500=50 Others: 5% of 500 = 0.05*500=5 Flavor Vanilla Chocolate Strawberry Vanilla Fudge Other Observed xpected Step1: H 0:The Observed and expected match H 1:TheObservedandexpected donot match df=51=4α = 005. CV=9.49 Step: Calculate the test statistics as follows: ( O ) χ = = ( 40 50) ( ) ( 70 75) ( ) ( 30 5 ) Step3: Decision: fail to reject H 0 Conclusion: This means the die is unbiased
6 lementary Statistics Chap10 Dr. Ghamsary Page 6 xample 4: Affirmative Action Problem A large organization in a city is accused of being racist in one or more race group. If in that city, there are 45% White, 15% Black, 0% Hispanic, 5% Asian, and the rest are others. A random sample of 50 from the whole corporation is collected with the following results. Test to see if the frequency of the observed and the percentage in the population are the same. H 0: The frequency observed matches the percentage of population Step1: H 1: The frequency observed does not match the percentage of population α = 005. df=51=4 CV=9.49 Race Observed xpected % White % 11.5 Black 30 15% 37.5 Hispanic 40 0% 50 Asian 0 5% 1.5 Other 60 15% 37.5 Total: 50 CV=9.49 Step: ( O ) χ = = ( ) ( ) ( 40 50) ( ) ( ) Step3: Decision: reject H 0 Conclusion: This means the frequency of observed and the % of the population do not match.
7 lementary Statistics Chap10 Dr. Ghamsary Page 7 xample5: In a study in Alameda County, California, researchers compared the demographic characteristics of members of grand juries to determine how closely these juries reflected the population of the county. If the juries were selected randomly or impartially, then the characteristics of the jurors should closely match those of the larger county; however, if attorneys were tilting the jury selection process, then the jurors' characteristics would be quite different from the county. (figures taken from UCLA Law Review, vol 0, as shown at: Age CountryWide % # of Jurors > Questions Based on the figures shown in the table above, use the chisquare test to evaluate whether there is evidence of jury fixing in terms of the age of jurors in Alameda County. a. What is the null hypothesis? What is the alternative hypothesis? b. What figures do you need to calculate for this test? c. How many degrees of freedom are there? d. What is the value of the chisquare statistic for this table? What is the pvalue of this statistic? e. From this value, what can you conclude about the age of jurors in Alameda County?
8 lementary Statistics Chap10 Dr. Ghamsary Page 8 Test of Independence The other primary use of the chisquare test is to examine whether two variables are independent or not. What does it mean to be independent, in this sense? It means that the two factors are not related. Typically in any research such as epidemiology and social science research, we're interested in finding factors that are related  education and income, occupation and prestige, age and voting behavior. In this case, the chi square can be used to assess whether two variables are independent or not. More generally, we say that factor A is "not correlated with" or "independent of" the factor B if more of one is not associated with more of another. If two categorical variables are correlated their values tend to move together, either in the same direction or in the opposite. In practice there are many data comes in the following format. They are called two way frequency table and some other text book call it contingency tables. The test of dependency is a test to see if row factor and the column factor are related. Test Statistics: is the same as before, namely: χ ( O ) =, Where b the O is the observed cells and is the expected cells which is can be find from the following: Row Total gb Colunm Total g = Grand total Also we have degrees of freedom = (r1)(c1), Where, r = Number of rows, c = Number of column
9 lementary Statistics Chap10 Dr. Ghamsary Page 9 xample 6: Dr. Ghamsary and colleagues are testing to see if the habits of smoking and gender are independent. They have collected a random sample of 50 people as they appear in the following table. Test their claim by using the 0.05 level of significance. Sex Male Smoking Yes No Total Female 150 Total Solution: H 0: Sex and smoking are independent Step1: H 1: Sex and smoking are not independent df = ( 1)( 1) = 1 α = 005. CV= * * = = 56 1 = = * * = = 84 = = Smoking Total Yes No Sex Male Female Total Step: Calculate the test statistics as follows: ( O ) χ = = Step3: Decision: fail to reject H 0 ( 60 56) ( 40 44) ( 80 84) ( 70 66) Conclusion: This means the sex and smoking are independent.
10 lementary Statistics Chap10 Dr. Ghamsary Page 10 Cautionary Note It is important to keep in mind that the chisquare test only tests whether two variables are independent. It cannot address questions of which is greater or less. Using the chisquare test, we cannot evaluate directly the hypothesis that men smoke more than women; rather, the test (strictly speaking) can only test whether the two variables are independent or not. xample 7: Ghamsary and others have done some research (Kashan, Vol 1, ) on income and level of education. They are interested to know if people with more education have higher income. They collected a random sample of 00 people from a large population and they found the following results.test the claim that the education and income are independent factors. Use α = 001. Income\ducation None High School 4 year college Graduate School Less than 30K K50K Above 50K Solution: Income\ducation N HS College Graduate Total Less than 30K K50K Above 50K Total * 40 61* 60 61* * = = = = = = = = * 40 90* 60 90* * 50 1 = =14.40 = = = = = = * 40 99* 60 99* * = = = = = = = =
11 lementary Statistics Chap10 Dr. Ghamsary Page 11 Step1: H 0:Income and ducation are independent H 1:Income and ducation are not independent Step: Calculate the test statistics as follows: ( O ) χ = = ( ) ( ) ( ) ( 6 1.0) df = ( 3 1)( 4 1) = 6 at 0.01, CV= ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) MINITAB: ChiSquare Test xpected counts are printed below observed counts C1 C C3 C4 Total Total ChiSq = = DF = 6, PValue = Step3: Decision: Reject H 0 Conclusion: This means the income and education are not independent.
12 lementary Statistics Chap10 Dr. Ghamsary Page 1 xample 8: In a recent research taken from a random sample of 500 student, show in the following table by two factors, Study on time for the tests and School areas. Is there an association between the type of school area and the student goals? School Area Study on time on the tests Rural Suburban Urban Total Always Some Times Never Total
13 lementary Statistics Chap10 Dr. Ghamsary Page 13 xample 9: A chocolate manufacturing company conducted a survey of 300 customers. The research question is: Is there a significant relationship between packaging preference (size of the bottle purchased) and economic status? There were four packaging sizes: small, medium, large, and jumbo. conomic status was: lower, middle, and upper. The following data was collected. Test the claim that the size of the packaging and economic status are independent by using 0.10 level of significance. conomic Status Size Lower Middle Upper Total Small Medium Large Jumbo Total
14 lementary Statistics Chap10 Dr. Ghamsary Page 14 xample 10: A random sample of 1500 persons is questioned regarding their political affiliation and opinion on the war in IRAQ. Test if the political affiliation and their opinion on the war in IRAQ are dependent using 5% level of significance. The observed data is given in the following table. War in IRAQ Party Affiliation Favor Indifferent Opposed Total Democrat Republican Independent Total
More information