1 Bivariate Statistics Session 2: Measuring Associations ChiSquare Test Features Of The ChiSquare Statistic The chisquare test is nonparametric. That is, it makes no assumptions about the distribution of variables. For this reason, it is typically used with data measured at the nominal or ordinal levels. Pearson s ChiSquare (χ2) is the most popular of the nonparametric statistics. The chisquare (χ2) test is used to assess the relationship between 2 nominal or ordinal variables. It is a very general statistical test that can be used whenever we wish to evaluate whether frequencies that have been empirically obtained differ significantly from those that would be expected on the basis of chance or theoretical expectations. In other words, when the researcher wishes to explore how the categories of the row variable are distributed according to the categories of the column variable. A statistically significant chisquare test indicates that the rows and columns of the contingency table are dependent, that is, that there are differences between the cell frequencies (cell: fields in the table) that are substantial enough not to be attributed to chance or randomness. A nonsignificant chisquare test implies that differences in cell frequencies may be random. The basic idea of the chisquare statistic is to compare the observed distribution of frequencies with the expected distribution of frequencies. The chisquare test shows whether the observed association between the variables is due to chance. This test relies on the basic assumption that there is no association between the variables in the contingency table (remember the null hypothesis: no association between 2 variables). Assumptions of the ChiSquare Test: Required Level of Measurement:  The chisquare statistic requires 2 nominal (or ordinal) variables. Postulates of the chisquare test: 1) Random sample 2) Mutually exclusive categories 3) Expected frequencies must all be > 1 4) No more than 20% of cases in the contingency table should have an expected frequency < 5. If these conditions are not satisfied, the chisquare test may be biased. Contingency Tables The basis of any chisquare test is always a table with frequency counts in the cells. Depending on the number of columns and rows, the table is usually referred to as an N (Number of Rows) x M (number of Columns) table. The simplest version is a 2x2 contingency table. 1
2 Contingency table: frequencies of 2 variables presented in one table. All categories of the first variable appear in rows, and all the categories of a second variable in columns. You also obtain a joint frequency for each cell, and totals (for both rows and columns). Cells: Fields in the table. Example 1: 2x2 Table In the fear of crime survey we found that women are more likely than men to say that they go to certain areas only if accompanied by others. In certain areas, I only go in the company with others Yes No Total Male Female Total Example 2: 2x3 Table Suppose a survey asked about whether people are in favour of introducing the Euro. It found the following answers by political party reference. We might want to know: Does preference for the Euro vary by political preference? How strong is the relationship? Pro Euro introduction Political Preference Labour Tory Liberal Row total yes no Colum total Calculating A ChiSquare Test How can we know that the differences above are systematic, and not due to chance? We compare our observed values to the values we would expect to see by chance alone if the null hypothesis were true. Observed frequencies: distribution of variables in the sample Expected frequencies: theoretical frequencies that would be obtained if there was no association between the variables (that is, if null hypothesis was to be accepted). Expected frequencies are computed as follows: Row total * Column total Expected cell frequency = Total number of observations 2
3 The general formula for computing the chisquare statistic is: ChiSquare = SumOf ( ObservedFrequency ExpectedFrequency ) ExpectedFrequency 2 Degrees of Freedom (df) Statisticians use the term degrees of freedom to describe the number of values in the final calculation of a statistic that are free to vary. Degrees of freedom is computed by multiplying the number of rows minus one, by the number of columns minus one. The formula is: df = (# of rows  1 ) * (# of columns 1) Basic steps underlying the computation of a Chisquare statistic: Step 1: Observing some distribution of frequencies in the cells of a table (the observed frequencies) and computing the sum of each row and column. Step 2: Computing the frequencies one would expect in each cell by chance (the expected frequencies: row total * column total / total number of observations) Step 3: Comparing the observed to the expected frequencies (observed frequency expected frequency) Step 4: Computing the chisquare value (see formula above) A Step to Step guide computing a 2x2 ChiSquare Statistic. Starting point: Observed Distrib. Gender Males Females Sum Step 1: Compute Expected Cell Frequencies Formula (for each cell): row total * column total / n Gender Males Females Sum Computing expected cell frequencies: The expected frequency in cell 1,1 (yes, males) is (654*1085)/1649= The expected frequency in cell 1,2 (yes, females) is (995 * 1085)/1649= The expected frequency in cell 2,1 (no, males) is (654*564)/1649=
4 The expected frequency in cell 2,2 (no, females) is (995 * 564)/1649= Step 2: Compute difference ObservedExpected Gender Males Females Males yes: = Females yes: = Males no: = Females no: = Step 3: Compute (Difference Squared)/Expected Gender Males Females Males yes: ( ) 2 / = Females yes: (140.32) 2 / = Males no: (140.32) 2 / = Females no: ( ) 2 / = Step 4: Sum of all Cell chisquares = χ 2 = Testing for Significance Verifying in a standard table (χ 2 distribution table) whether for a given value of χ 2 and a given number of degrees of freedom, the association between the variables is statistically significant, i.e. whether there are differences between observed and expected frequencies that are substantial enough not to have been caused by chance. The standard level of significance (alpha) used is.05. In the example above, with one Degree of Freedom ((Rows1)*(Columns1)), the critical value of χ 2 for α=.05 is Since our χ 2 value (221.7) exceeds the critical value at α=.05, we reject the null hypothesis and conclude that there is a significant association between these two variables. However, if we look at other critical values at α=.01 and α=.001, we see that our χ 2 =221.7 exceeds these values (6.63 and 10.83, respectively), too. This indicates that the association between these two variables is highly significant and we report as p<
5 Just like other statistical tests, the chisquare test is sensitive to sample size. The larger the sample, the more likely it is that you will reject the null hypothesis. In other words, the chisquare test is more likely to be statistically significant with larger sample sizes, even if the association between the two variables is weak. Measuring Strength Of Association The chisquare test is not a measure of the strength of the association between two variables. Other tests need to be carried out to test the strength of the association between nominal (or ordinal variables), such as phi, Cramer s V, contingency coefficient, and gamma. 1) Phi Coefficient (φ)  Phi is a coefficient based on the value of χ 2  Measure of the strength of the relationship between two dichotomous variables (i.e., 2x2 table)  Phi ranges between 0 and 1. The higher the value of Phi, the stronger the association between the 2 variables.  Phi is a symmetrical measure, that is, it does not make the distinction between the IV and DV. In other words, it does not indicate which variable is the cause of the other.  Phi is computed as: Phi = χ 2 N 2) Cramer s V  V is a coefficient based on the value of χ 2  Measure of the strength of the relationship between two nominal variables, regardless of the size of the contingency table (ex: 3x2, 4x3, 5x2, etc.)  Same basic idea as Phi, but is not limited to 2x2 tables  V ranges between 0 and 1. The higher the value of V, the stronger the association between the 2 variables.  Like Phi, V is a symmetrical measure; it does not make the distinction between the IV and DV.  In a 2x2 table, V and φ are identical  V is computed as: V= SQRT (χ 2 / n (k 1)) Where χ 2 = value of chisquare statistic, n= sample size, and k= minimum number of columns or rows in the table (ex: if table has 2 rows and 3 columns, then k= 2). 5
6 ChiSquare in R 1. Make contingency table e.g. with CrossTable() from library(gmodels) 2. Calculate chisquare and pvalue e.g. included in CrossTable() or use chisq.test() 3. if significant, interpret strength with Phi, Cramer s V library(vcd) assocstats() Alternative Tests: Fisher s Exact test (E<5) Yate s Correction (2x2 table) Likelihood ratio IN SUMMARY, WHEN YOU WANT TO ASSESS THE RELATIONSHIP BETWEEN 2 NOMINAL (OR ORDINAL VARIABLES): 1) Compute the value of χ 2. 2) If the χ 2 is statistically significant, measure the strength of the association. If the χ 2 is not statistically significant, the variables are independent (i.e., no association between the variables, and it is irrelevant to measure the strength of the association). 3) Offer an interpretation of the results. 6
