CATEGORICAL DATA Chi-Square Tests For Univariate Data 1 CATEGORICAL DATA Chi-Square Tests for Univariate Data Recall that a categorical variable is one in which the possible values are categories or groupings. We ve seen one such variable: it s the binary variable with only two possible outcomes: success or failure. In this topic we explore testing hypotheses about categorical variables with MORE than two outcomes. EXAMPLE Consider an experiment in which two different tomato phenotypes are crossed and the resulting offspring observed. The parent types are tall cut-leaf tomatoes and dwarf -leaf tomatoes. Variable: Offspring Phenotype Possible Values: 1) tall cut-leaf, ) tall -leaf, 3) dwarf cut-leaf, and 4) dwarf -leaf. If Mendel s laws of inheritance hold, the resulting population proportions in the offspring would be 1) 9 / 16, ) 3 / 16, 3) 3 / 16, and 4) 1 / 16. One might hypothesize that Mendel s Laws don t hold for these genes. In an experiment to test that, the researcher observed the proportions 1).575, ).179, 3).18, and 4).65 based on a sample of 1611 offspring.
CATEGORICAL DATA Chi-Square Tests For Univariate Data EXAMPLE Consider an observational study in which the types of insects that feed on the nectar from a certain flower are studied. The scientist randomly selects hours during the day over several days during the summer season and selects several different plants. She counts the number of different kinds of insects that feed at the plant during the study. Variable: Insect Family Possible Values: 1) bees, ) wasps, or 3) flies One might hypothesize that this flower attracts the different insect families in unequal proportions. Important Point: Testing procedures for hypotheses of this form are called Goodness-of-Fit tests. These tests compare the sample proportions to the hypothesized proportions to see how good the fit is. Important Point: These categories must be mutually exclusive and exhaustive. Notation: k = number of possible categories that the variable of interest can have.
CATEGORICAL DATA Chi-Square Tests For Univariate Data 3 Category True Population 1 π 1 1 Sample ˆπ π ˆπ π k π πˆ k k Hypothesized Population π 1 π k Exhaustive means that π i = 1, ˆ π i = 1, and π =1. i EXAMPLE tomatoes and Mendel s Laws. k = 4 Category True Population Sample Hypothesized Population tall cutleaf π 1 ˆ1 π =. 575 1 9 16 tall π ˆ π =. 179 3 16 dwarf π 3 ˆ3 π =. 18 π 3 = 3 16 cut dwarf π 4 ˆ π 4 =. 65 4 1 16
CATEGORICAL DATA Chi-Square Tests For Univariate Data 4 Now, for a sample of size n and a set of hypothesized proportions under the null hypothesis, I can calculate how many sample units should be in each category (if there was no sampling variability, of course). These numbers are called the EXPECTED CELL COUNTS under the null hypothesis and are calculated as n hypothesized value ( π ) for that category (cell). i The OBSERVED CELL COUNTS are the actual counts seen in each category during the experiment. Category Expected Count π Observed Count n ˆπ 1 n 1 1 n π n ˆπ k nπˆ nπ k k Important Point: This test procedure is valid only if the sample sizes and hypothesized proportions are such that virtually every cell has an expected count of 5 or more. If they aren t you must use a different test procedure.
CATEGORICAL DATA Chi-Square Tests For Univariate Data 5 EXAMPLE Tomatoes & Mendel s Laws. n = 1611 Category Expected Count Observed Count tall cutleaf n π1 = 1611(9 /16) = 96. n ˆ1 π = 96 tall n π = 1611(3 /16) = 3.1 n ˆ π = 88 dwarf n π 3 = 1611(3/16) = 3.1 n ˆ π = 93 cut dwarf n π 4 = 1611(1/16) = 1.7 nπ ˆ k = 14 Hypotheses: H o : π 1 = 9 / 16, π = 3/ 16, π 3 = 3 /16, and π 4 = 1/ 16 H A : not H o (H o is not true) Important Point: Note how uninformative the alternative hypothesis is in a goodness-of-fit test. These tests compare the sample data against a specific set of hypothesized proportions. If the null hypothesis is rejected, one cannot tell what the true proportions are, only that they are not the ones listed in the null hypothesis. Significance Level: let s choose α=.4.
CATEGORICAL DATA Chi-Square Tests For Univariate Data 6 Test Statistic: is a summary of the comparison of the observed and expected cell counts. The actual form is Χ = all cells (observed count - expected count expected count) This is called the CHI-SQUARE or GOODNESS- OF-FIT STATISTIC. Important Point: the closer the expected and observed counts are to each other, the smaller the value of. Χ Small values of values support HA. Χ support the null hypothesis and large EXAMPLE tomatoes and Mendel s Laws. Category Expected Observed ( n ˆ π nπ Count Count nπ tall cutleaf n π 1 = n ˆ1 π = 96.433 96. tall 3.1 n π = n ˆ π = 88.658 dwarf n π 3 = 3.1 n ˆ π = 93.74 cut dwarf 1.7 n π 4 = nπ ˆ k = 14.18 So, Χ =.433 +.658 +.74 +.18 = 1. 473 )
CATEGORICAL DATA Chi-Square Tests For Univariate Data 7 P-value: under the null hypothesis, the test statistic a sampling distribution known as the CHI-SQUARE DISTRIBUTION. Χ has Like the T-distribution, the shape of the Chi-Square Distribution depends on the degrees of freedom. Here, df = k 1. Important Point: the degrees of freedom for the Chi- Square Goodness of Fit test are the number of categories (k) minus 1 NOT the sample size minus 1. The p-value is the area under the Chi-square distribution to the right of the test statistic value: To find the P-value, first calculate Χ and the df. Then go to Table 8 (page 686 of the text).
CATEGORICAL DATA Chi-Square Tests For Univariate Data 8 Find the row labeled with the df you have for your test. Go across the values in the row, until you find two values that Χ bracket your value. Read the P-value from the tops of the columns containing the two bracketing values. EXAMPLE tomatoes, df=4-1=3 and Χ =1.473. So, on page 686, go to the row labeled df=3 and find the closest value to 1.47. It s bracketed by the values.5844 to the left and 6.51 to the right. The column headers for these two values are.9 (left) and.1 (right). This says that the P-value falls between.1 and.9. Conclusion: since the P-value >.1 >> α =.4, do not reject H o. There is insufficient evidence to suggest that something other than Mendel s law of inheritance is working for the two tomato phenotypes that were crossed.
CATEGORICAL DATA Chi-Square Tests For Univariate Data 9 GOODNESS-OF-FIT TEST PROCEDURE FOR UNIVARIATE CATEGORICAL DATA Null Hypothesis: H o : π 1 = π 1, π = π,, π k = π k where π is the hypothesized population proportion of the i π i i th category and =1 Alternative Hypothesis: H A : H o is not true Test Statistic: (observed count - expected count) Χ = expected count all cells where the expected count in the i th category is nπ i. P-value: area to the right of the observed Χ value under the Chi-Square distribution with k 1 degrees of freedom. Use table 8 to get an approximate value for the P-value. Assumptions: 1) the sample was random. ) the sample size is sufficiently large and the hypothesized cell proportions are such that the expected cell counts are all 5 or more.
CATEGORICAL DATA Chi-Square Tests For Univariate Data 1 EXAMPLE It is hypothesized that when homing pigeons are disoriented in a particular manner, they exhibit no preference in direction of flight after takeoff. To test this, 1 pigeons were disoriented, then let loose and their direction observed (as wedges representing an eighth of 36 ). The results are given below. Use a significance level of.1 to test the hypothesis that direction of takeoff is equally likely. Hypotheses: H o : π1= π = π3 =... = π8 = 1 8 H A : not H o Direction Observed Frequency Expected Frequency ( n ˆ π nπ nπ -45 18 1(.15)=15.6 46-9 15 1.667 91-135 15 1.667 136-18 15 15 181-5 13 15.67 6-7 8 15 3.67 71-315 7 15 4.67 316-36 9 15.4 Test Statistic: (observed count - expected count) Χ = expected count all cells =(.6+1.667+ +.4)=14.135 )
CATEGORICAL DATA Chi-Square Tests For Univariate Data 11 P-value: df = k-1 = 8-1 = 7. From the table,.5 < p-value <.5. Conclusion: since the P-value < α, reject H o and conclude that there exists evidence to indicate that the pigeons show directional preferences when disoriented before being allowed to fly. NOTE: This is an example of a test in which the null hypothesis is the claim, which goes against what we have learned this semester. Most goodness of fit tests are that way, that is the null hypothesis is the distribution of interest. It is assumed to be true unless the data indicate otherwise. But we cannot show that the null hypothesis is true only that there is no evidence to not believe it.