Chapter 7. Categorical Data Analysis

Chapter 7 Categorical Data Analysis In Chapter 5 we studied how to test hypotheses involving a single population such as H 0 :µ=5 vs. H a :µ>5. In Chapter 6, we studied how to test hypotheses involving two or more populations, such as H 0 :µ 1 =µ 2 vs. H a : µ 1 >µ 2. In both these chapters we were dealing with quantitative variables such as height of a person, or lengths of alligators etc. In this chapter we will learn how to test hypotheses that involve qualitative or categorical variables. Recall some examples of qualitative or categorical variables such as Gender, Religion, Race, Color etc. They are considered categorical variables because their values are not numeric. Although we may represent values of categorical variables with numbers, it is still a categorical variable. For example we can represent colors using numbers such as 1 for white, 2 for red, 3 for blue and so one, but that does not make it quantitative because you couldn t add 2 and 3 (red and blue) and hope that 5 represents purple. The number 5 probably already represents some other color. The numbers in this color example have no inherent numerical properties; they are simply labels. When dealing with categorical variables, the closest thing to something numerical is the frequency data. So for example let s say I observe the color of the cars passing by my window (assuming I can see a road from my window with cars passing by). Suppose I collect the following data of the first twelve cars that I see: White, Red, Red, Black, Blue, Green, Yellow, Red, White, Black, Blue and White. I can translate this data into a frequency table like this: Color of the vehicle Frequency White 3 Black 2 Blue 2 Yellow 1 Red 3 Green 1 It is relatively easy to obtain such frequency tables for data involving categorical variables as you can see in the above example. Using frequency data, we can test a variety of new types of hypotheses that we have not seen in previous chapters. For example, a favorite family game when you are on a long drive on an interstate highway is for the contestants to pick a color, say white or red and see who gets the most number of cars of that color till you reach your destination. The person who picked the color with the most number of cars wins this extremely delightful and colorful game. Suppose on one long journey while playing this game my family collected the following data: Color of the vehicle Frequency White 44 Red 36 All the Rest 120 Data such as in the above table can be used to test hypotheses about proportions. For example, say I have a hypothesis 25% of all cars produced are White and another 25% are Red and the remaining 50% are all other colors combined. In symbols, this hypothesis can be written as: 1

H 0 : p white = 0.25, p red = 0.25, p other = 0.50 H a : at least one of the proportions is different than specified in the null hypothesis Hypotheses such as this cannot be tested using any of the methods that we have studied so far. For example we couldn t use either the z test or the t test or the F test to test such a hypothesis. Testing this type of hypothesis requires a new type of test called the chi-square test, where chi is pronounced as in the words kind or kite and not like chime. So the bad news is that we will have to learn a new type of Excel function (or in the olden days, a new statistical table), but the good news is that the whole hypothesis testing procedure remains the same. So, we will still have a required or desired significance level (alpha), we will still have a test statistic, a critical value, a rejection region, a p-value, a decision and a conclusion. The rules of rejection remain the same. To obtain the test statistic value, we use a formula which I will tell you shortly. But before I give you the formula, I must tell you that we will need another column of values. In the above table of data, we will add a column for expected frequency if the null hypothesis was true. Color of the Car Observed Frequency (O) Expected Frequency (if the Null Hypothesis was true) (E) White 44 50 Red 36 50 All the Rest 120 100 Total 200 200 Note that in this table, we have changed the label for the second column as Observed Frequency. Please verify that the new column has values that represent hypothesized proportions. For example, since the hypothesized proportion of White cars was 25%, the expected frequency is 50, which happens to be 25% of 200. Now let me give you the formula for the Test Statistic value for this type of hypothesis: Frequency. Chi-Square Value =, Where O is the Observed Frequency and E is the Expected How to calculate the Chi-Square test statistic? We will use the above example to illustrate how to compute the test statistic: Color of the Car (O) (E) (O-E) (O-E) 2 (O-E) 2 /E White 44 50-6 36 0.72 Red 36 50-14 196 3.92 All the Rest 120 100 20 400 4.00 Total 200 200 0 8.64 The chi-square test statistic is 8.64 How to obtain the critical value? If we had a chi-square table, we could obtain the critical value from the table, but since we are getting so good at using Excel, we will obtain it using the excel function =CHIINV(). This function takes two 2

parameters probability and degree of freedom. The probability is basically your alpha value (which is typically 0.05) and the degree of freedom is the number of groups (in our example, three) minus one. So for our example, we will obtain the p-value using the formula =CHIINV(0.05,2) which comes gives us 5.991465. So what is the rejection region? Any chi-square value greater than 5.991465 falls in the rejection region. How to get the p-value? We get the p-value using the =CHIDIST() function. In our example, it will be =CHIDIST(8.64,2) = 0.0133. Decision Time: Since the chi-square test statistic value is 8.64, which is in the rejection region, because it is greater than 5.99, we reject the null. The same decision would be reached using the p-value, which happens to be 0.0133 which is less than alpha value of 0.05. Conclusion: There is sufficient evidence, at significance level 0.05, that the proportions of white, red and other cars are other than 0.25, 0.25 and 0.50 respectively. What if alpha was 0.01? If alpha was 0.01, the critical value would be =CHIINV(0.01,2) = 9.21034. So the rejection region would be χ 2 > 9.21034. The p-value would still be the same at 0.013. Using the critical value approach, we will fail to reject the null hypothesis since 8.64 is not greater than 9.21034. Also, using the p-value we will fail to reject the null because 0.0133 is greater than 0.01. Note that using either of the two approaches, the decision should always be the same. The =CHITEST() function. Excel provides a function called =CHITEST(). Once you have generated the column for the expected frequency, you can use the =CHITEST() function to get the p-value for the test, without having to generate the test statistic value, which requires you to generate the columns necessary to compute (O- E) 2 /E. For the above example the following Excel screen shots will illustrate the use of the =CHITEST() function: 3

Please note that the CHITEST function needs two ranges the actual frequency range, which is the same as the observed frequency range and the expected frequency range. Please also note that the value thus obtained (0.0133) is the same value that we had obtained earlier using the =CHIDIST(8.64,2) function. Please run the above example in Excel yourself to get a better feel of how to use the CHITEST() function. Two Categorical Variables So far in this chapter, we have looked at hypotheses regarding proportions of certain values of a categorical random variable. In the example that we discussed, the random variable was color of a vehicle and the hypothesis was about the proportions of vehicles with certain colors. In such hypotheses, we are looking at frequency data of one categorical variable (color of vehicle in our example). What if we have frequency data on two categorical variables? For example, let us look at the following data that shows the number of wins at home and away for a certain university in various sports in the past five years: Sport Wins at Home Wins Away Football 23 17 Basketball 39 21 Baseball 29 31 Soccer 19 21 Figure 1: Data for Sports vs. Home Field Advantage In the above data, there are two categorical variables Wins (at home or away) and Sport. When we have data like this on two categorical variables, the question that can be asked is whether there is a relationship between the two variables or whether they are independent of each other. For example we can ask the question whether home field advantage depends upon the sport or not. Essentially we are asking whether two variables are independent or dependent. This type of test is called the test of independence. Null Hypothesis: H 0 : Home Field Advantage and Sport are independent of each other Alternate Hypothesis: H a : Home Field Advantage depends on the Sport Chi-Square test can be used to test for independence between two variables. 4

Test Statistic: The formula for the test statistic is the same for two variables as for one variable. It is χ 2 =, Just like in the case of one variable, we will have to create expected frequencies (E). Generating expected frequency for two categorical variables involves some extra work, which I will explain next. How to obtain Expected Frequencies? a. For each row, find the row sum. b. For each column, find the column sum. c. Find the grand sum i.e. the sum of all the row sums (or sum of all the column sums) d. For i th row and j th column, the expected frequency is row sum of i th row * column sum of j th column divided by the grand sum. The row sums, column sums and the grand sum are shown in the table in Figure 2. Sport Wins at Home Wins Away Row Sums Football 23 17 40 Basketball 39 21 60 Baseball 29 31 60 Soccer 19 21 40 Column Sums 110 90 200 Figure 2: Row Sums, Column Sums and Grand Sum Please verify the row sums, the column sums and the grand sum in Figure 2. The expected frequencies are given in the table in Figure 3, using the formula explained in step d above. Sport Wins at Home Wins Away Row Sums Football 22 18 40 Basketball 33 27 60 Baseball 33 27 60 Soccer 22 18 40 Column Sums 110 90 200 Figure 3: Expected Frequencies (E) I will explain a couple of these frequencies in Figure 3. You should verify all the rest of the frequencies. The expected frequency for the cell for Football and Wins at Home is computed as 40*110/200 = 22. The expected frequency for the cell Baseball and Wins Away is computed as 60*90/200 = 27. So now we have the observed frequencies and the expected frequencies in Figures 1 and 3 respectively. Next we calculate the chi-square value. For each cell we need to compute (O-E) 2 /E. The next table shows the values of (O-E) 2 /E for each cell. I will show you Sport Wins at Home Wins Away Football 0.05 0.06 Basketball 1.09 1.33 Baseball 0.48 0.59 Soccer 0.41 0.50 Figure 4: (O-E) 2 /E for each cell The sum of all these values gives the chi-square test statistic value = 4.51 So what should we compare this test value with in other words, what is the critical value? 5

The critical value can be determined from the Excel function =CHIINV(alpha, degrees of freedom). Suppose our alpha is 0.05. For a test of independence, the degree of freedom is given by (r 1)*(c-1) where r is the number of rows (4 in our example) and c is the number of columns (2 in our example). So (4 1)*(2 1) = 3*1 = 3. Critical Value: =CHIINV(0.05,3) = 7.81473 Rejection region: χ 2 > 7.81473 p-value: is given by the excel function =CHIDIST(4.51,3) = 0.2114 Decision using the critical value approach: We fail to reject the null hypothesis because 4.51 is less than 7.81473. Decision using the p-value approach: We fail to reject the null because the p-value of 0.2114 is higher than 0.05. Conclusion: we did not find sufficient evidence, at significance level of 0.05 that home field advantage depends on the sport. Can we get the p-value directly using Excel? Once we compute the expected frequencies (Figure 3) we can compute the p-value without having to calculate the numbers in Figure 4. So we can bypass the calculations of (O-E) 2 /E. How? Using the function =CHITEST(). In this function, we specify two ranges the range for observed frequencies (Figure- 1) and the range for expected frequencies (Figure-4). Suppose the range of data values in Figure-1 is C5:D8 and suppose the range of expected frequencies in Figure-4 is C13:D16 (See Figure 5). Then =CHITEST(C5:D8,C13:D16) will give 0.2114, which is the same p- value we got using =CHIDIST(4.51,3). Figure 6 shows the formulas used in Figure 5. Please try to recreate this example on your computer to get a better sense of how this chi-square test was performed. Figure 5: Excel Calculations of expected frequencies 6

Figure 6: Excel Formulas for the numbers in Figure 5. Summary of the Chapter When dealing with categorical variables certain types of hypotheses can be made. One type of hypothesis involves a single categorical variable. The hypothesis is about the proportions of distribution of the category into different values. Another type of hypothesis involves two categorical variables. The hypothesis is regarding whether the two variables are independent or dependent on each other. The test of hypothesis involving categorical variables uses a chi-square test. The test statistic for a chi-square test is a measure of how far the actual frequencies are with respect to the expected frequencies if the Null-Hypothesis was true. The higher the value of the test statistic, the stronger is the evidence in favor of the alternate hypothesis. 7