Bivariate Statistics Session 2: Measuring Associations Chi-Square Test Features Of The Chi-Square Statistic The chi-square test is non-parametric. That is, it makes no assumptions about the distribution of variables. For this reason, it is typically used with data measured at the nominal or ordinal levels. Pearson s Chi-Square (χ2) is the most popular of the non-parametric statistics. The chi-square (χ2) test is used to assess the relationship between 2 nominal or ordinal variables. It is a very general statistical test that can be used whenever we wish to evaluate whether frequencies that have been empirically obtained differ significantly from those that would be expected on the basis of chance or theoretical expectations. In other words, when the researcher wishes to explore how the categories of the row variable are distributed according to the categories of the column variable. A statistically significant chi-square test indicates that the rows and columns of the contingency table are dependent, that is, that there are differences between the cell frequencies (cell: fields in the table) that are substantial enough not to be attributed to chance or randomness. A non-significant chi-square test implies that differences in cell frequencies may be random. The basic idea of the chi-square statistic is to compare the observed distribution of frequencies with the expected distribution of frequencies. The chi-square test shows whether the observed association between the variables is due to chance. This test relies on the basic assumption that there is no association between the variables in the contingency table (remember the null hypothesis: no association between 2 variables). Assumptions of the Chi-Square Test: Required Level of Measurement: - The chi-square statistic requires 2 nominal (or ordinal) variables. Postulates of the chi-square test: 1) Random sample 2) Mutually exclusive categories 3) Expected frequencies must all be > 1 4) No more than 20% of cases in the contingency table should have an expected frequency < 5. If these conditions are not satisfied, the chi-square test may be biased. Contingency Tables The basis of any chi-square test is always a table with frequency counts in the cells. Depending on the number of columns and rows, the table is usually referred to as an N (Number of Rows) x M (number of Columns) table. The simplest version is a 2x2 contingency table. 1
Contingency table: frequencies of 2 variables presented in one table. All categories of the first variable appear in rows, and all the categories of a second variable in columns. You also obtain a joint frequency for each cell, and totals (for both rows and columns). Cells: Fields in the table. Example 1: 2x2 Table In the fear of crime survey we found that women are more likely than men to say that they go to certain areas only if accompanied by others. In certain areas, I only go in the company with others Yes No Total Male 290 364 654 Female 795 200 995 Total 1085 564 1649 Example 2: 2x3 Table Suppose a survey asked about whether people are in favour of introducing the Euro. It found the following answers by political party reference. We might want to know: Does preference for the Euro vary by political preference? How strong is the relationship? Pro Euro introduction Political Preference Labour Tory Liberal Row total yes 120 40 30 190 no 60 60 20 140 Colum total 180 100 50 330 Calculating A Chi-Square Test How can we know that the differences above are systematic, and not due to chance? We compare our observed values to the values we would expect to see by chance alone if the null hypothesis were true. Observed frequencies: distribution of variables in the sample Expected frequencies: theoretical frequencies that would be obtained if there was no association between the variables (that is, if null hypothesis was to be accepted). Expected frequencies are computed as follows: Row total * Column total Expected cell frequency = Total number of observations 2
The general formula for computing the chi-square statistic is: ChiSquare = SumOf ( ObservedFrequency ExpectedFrequency ) ExpectedFrequency 2 Degrees of Freedom (df) Statisticians use the term degrees of freedom to describe the number of values in the final calculation of a statistic that are free to vary. Degrees of freedom is computed by multiplying the number of rows minus one, by the number of columns minus one. The formula is: df = (# of rows - 1 ) * (# of columns -1) Basic steps underlying the computation of a Chi-square statistic: Step 1: Observing some distribution of frequencies in the cells of a table (the observed frequencies) and computing the sum of each row and column. Step 2: Computing the frequencies one would expect in each cell by chance (the expected frequencies: row total * column total / total number of observations) Step 3: Comparing the observed to the expected frequencies (observed frequency expected frequency) Step 4: Computing the chi-square value (see formula above) A Step to Step guide computing a 2x2 Chi-Square Statistic. Starting point: Observed Distrib. Gender Males 290 364 654 Females 795 200 995 Sum 1085 564 1649 Step 1: Compute Expected Cell Frequencies Formula (for each cell): row total * column total / n Gender Males 430.32 223.68 654 Females 654.68 340.32 995 Sum 1085 564 1649 Computing expected cell frequencies: The expected frequency in cell 1,1 (yes, males) is (654*1085)/1649= 430.32 The expected frequency in cell 1,2 (yes, females) is (995 * 1085)/1649= 654.68 The expected frequency in cell 2,1 (no, males) is (654*564)/1649= 223.68 3
The expected frequency in cell 2,2 (no, females) is (995 * 564)/1649= 340.32 Step 2: Compute difference Observed-Expected Gender Males -140.32 140.32 Females 140.32-140.32 Males yes: 290 430.32= -140.32 Females yes: 795 654.68= 140.32 Males no: 364 223.68= 140.32 Females no: 200 340.32= -140.32 Step 3: Compute (Difference Squared)/Expected Gender Males 45.75 88.02 Females 30.07 57.85 Males yes: (-140.32) 2 / 430.32= 45.75 Females yes: (140.32) 2 / 654.68= 30.07 Males no: (140.32) 2 / 223.68= 88.02 Females no: (-140.32) 2 / 340.32= 57.85 Step 4: Sum of all Cell chi-squares 45.76 + 30.07 + 88.02 + 57.85 = 221.7 χ 2 = 221.7 Testing for Significance Verifying in a standard table (χ 2 distribution table) whether for a given value of χ 2 and a given number of degrees of freedom, the association between the variables is statistically significant, i.e. whether there are differences between observed and expected frequencies that are substantial enough not to have been caused by chance. The standard level of significance (alpha) used is.05. In the example above, with one Degree of Freedom ((Rows-1)*(Columns-1)), the critical value of χ 2 for α=.05 is 3.85. Since our χ 2 value (221.7) exceeds the critical value at α=.05, we reject the null hypothesis and conclude that there is a significant association between these two variables. However, if we look at other critical values at α=.01 and α=.001, we see that our χ 2 =221.7 exceeds these values (6.63 and 10.83, respectively), too. This indicates that the association between these two variables is highly significant and we report as p<.001. 4
Just like other statistical tests, the chi-square test is sensitive to sample size. The larger the sample, the more likely it is that you will reject the null hypothesis. In other words, the chi-square test is more likely to be statistically significant with larger sample sizes, even if the association between the two variables is weak. Measuring Strength Of Association The chi-square test is not a measure of the strength of the association between two variables. Other tests need to be carried out to test the strength of the association between nominal (or ordinal variables), such as phi, Cramer s V, contingency coefficient, and gamma. 1) Phi Coefficient (φ) - Phi is a coefficient based on the value of χ 2 - Measure of the strength of the relationship between two dichotomous variables (i.e., 2x2 table) - Phi ranges between 0 and 1. The higher the value of Phi, the stronger the association between the 2 variables. - Phi is a symmetrical measure, that is, it does not make the distinction between the IV and DV. In other words, it does not indicate which variable is the cause of the other. - Phi is computed as: Phi = χ 2 N 2) Cramer s V - V is a coefficient based on the value of χ 2 - Measure of the strength of the relationship between two nominal variables, regardless of the size of the contingency table (ex: 3x2, 4x3, 5x2, etc.) - Same basic idea as Phi, but is not limited to 2x2 tables - V ranges between 0 and 1. The higher the value of V, the stronger the association between the 2 variables. - Like Phi, V is a symmetrical measure; it does not make the distinction between the IV and DV. - In a 2x2 table, V and φ are identical - V is computed as: V= SQRT (χ 2 / n (k 1)) Where χ 2 = value of chi-square statistic, n= sample size, and k= minimum number of columns or rows in the table (ex: if table has 2 rows and 3 columns, then k= 2). 5
Chi-Square in R 1. Make contingency table e.g. with CrossTable() from library(gmodels) 2. Calculate chi-square and p-value e.g. included in CrossTable() or use chisq.test() 3. if significant, interpret strength with Phi, Cramer s V library(vcd) assocstats() Alternative Tests: Fisher s Exact test (E<5) Yate s Correction (2x2 table) Likelihood ratio IN SUMMARY, WHEN YOU WANT TO ASSESS THE RELATIONSHIP BETWEEN 2 NOMINAL (OR ORDINAL VARIABLES): 1) Compute the value of χ 2. 2) If the χ 2 is statistically significant, measure the strength of the association. If the χ 2 is not statistically significant, the variables are independent (i.e., no association between the variables, and it is irrelevant to measure the strength of the association). 3) Offer an interpretation of the results. 6