3. Analysis of Qualitative Data Inferential Stats, CEC at RUPP Poch Bunnak, Ph.D. Content 1. Hypothesis tests about a population proportion: Binomial test 2. Chi-square testt for goodness offitfit 3. Chi-square test for independence 4. Notes on Measures of associations 2010 Poch Bunnak 2
1. Hypothesis tests About a population proportion: One-sample binomial test 2010 Poch Bunnak 3 1.1. Situations: To compare if a sample proportion is different from another test value, such at the value a given population, past result... Examples: In 2010, girls accounted 25% of all students in all universities in Phnom Penh. The Rector wishes to know if the % of girls at RUPP is significantly higher or lower than the overall proportion. A president got 55% of votes. After one year in his president office, his wanted to know if the number of his supporters increased or decreased. A company sold 2500 red and 2300 blue toys. Do data provide evidence for a significant color preference? 1.2. Test statistics: Large sample: Z-test; Small sample: Binomial test (n > 5/[min(p, 1-p)] 2010 Poch Bunnak 4
1.2.a Test statistics: z test Random sample, binary var, normal distribution of p (the closer p to 0 or 1 for any sample size, the skewed is the distribution of p) H 0 : p = p h H a : p p h Or H a : p > p h Or H a : p < p h Two ways: Z = (p p h )/ S.E. (p) or Z = Sqrt(chi-square) (hi then find the p value make decision i S.E.(p) = sqrt((p*(1-p))/n) Run 13 1.3. Example Women today are getting more educated and working outside the home for cash. They are likely to marry later and have fewer children than before. Is this claim true? Use CDHS 2005 data to test if the mean age at marriage and the mean number of children in 2005 are differentfrom f those in 2000. 2010 Poch Bunnak 5 1.2.b Test statistics: Binomial test Based on binomial distribution (prob distribution of two outcomes only, binary var) Assumptions: Binary var., normality [n*p>10 & n*(1-p)>10] Example 15 girls and 35 boys were enrolled in one class. Is this class gender-different from the gender admission quota of 25% girls? SPSS data: create two vars (gender with 1=girls and2=boys and n with 15 for girls and 35 for boys), weight by n. SPSS analysis: Analyze Nonparametric test Binomial Move gender in Test var List Box Enter 0.25 in the Test Proportion Box OK 2010 Poch Bunnak 6
Binomial Test Asymp. Observed Sig. Category N Prop. Test Prop. (1-tailed) cat Group 1 1.00 15.30.25.252 a Group 2 2.00 35.70 Total 50 100 1.00 a. Based on Z Approximation. Interpretation There was 30% of girls in the class with 5% greater than the quota. However, the difference is not statistically significant based on the binomial test (n = 50, p (1-tailed) = 0.252) Practice: Redo the test with 150 girls and 350 boys. What do you see? A survey of 200 voters showed that 120 voted for A and 80 votedfor tdf B. BIs there enough evidence to predictthe ditth winner? 2010 Poch Bunnak 7 Other features of binomial test with SPSS Note that binomial test is always one-tailed test SPSS does not calculate CI of the difference. You can do this using formulas You can use cut point to split the data, if do not want to do recode (values =< cut point value is group 1) Three options for calculating p values: Asymptotic distribution (z approximation) Exact test (based on actual data w/o prob sampling calculation): when the normal approximation is not met. You should use this test if your data are small or p is small Monte Carlo: When the sample size is too large 2010 Poch Bunnak 8
2. Hypothesis tests about a population s proportions: Chi-square test for goodness of fit 2010 Poch Bunnak 9 2.1. Situations: To compare a sample s freq distribution of a categ var with expected frequencies (all categories contain the same proportion of values) or with user-specified proportions p of values Examples: Do all three candidates have a significant difference in the number of supporters? Is there any evidence showing that all departments have different numbers of first year students? In 2000, 10% were extreme poor, 20% were just poor, 40% were just above the poverty line, 25% were rich, and 5% were very rich. Is there any change in the distribution of living standard 10 years later? 2.2. Test statistics: Chi-square test for goodness of fit 2010 Poch Bunnak 10
2.3. Chi-square test assumptions Nonparametric test no distribution shape assumption Categorical data; data from a random sample The 2 test is valid only if the expected freq (f () (e) ) is at least 5 for any category or no more than 20% of the categories should have f (e) < 5 H 0 : f (o) = f (e) (ll (all categories); )H a : f (o) f (e) (tl (at least1 category) 2.3. Example The distribution of foundation-year students by department is: 100 in English, 80 in math, and 110 in computer science. Is the difference statistical significant at 99% CL? H0: the distribution of students are equal across the department Enter data in SPSS (dept and n vars) and weight by n SPSS: Analyze Nonparametric Chi-square Put Dept var in Test Var List (Be sure all cat equal is ticked) OK 2010 Poch Bunnak 11 2.4. Result and interpretation Notes: f(e) = n of case/n of cat; residual = f(0) f(e); 2 = Sum[(f (0) -f (e) ) 2 /f (e) ]; df = n of cat 1 Compare the obtained 2 with critical 2 or use asymp sig to make decision about the test Interpretation: The test is not stat sig ( 2 = 4.8, df=2, p=0.089), meaning that H 0 is accepted tdandh a is rejected. Thus, there is no evidence supporting the H a that the distribution of students are different across all three departments. 2010 Poch Bunnak 12
2.5. Other notes for Chi-square test with SPSS If you have many categories but wish to analyze some of them, you can specify the range of values to be analyzed You can do the test of freq distribution against user-defined (values entered in the order of cat value codes and one of them must be different; otherwise equal f (e) is the same as bf before) Three options for calculating p values: Asymptotic distribution (z approximation) Exact test: when the freq values assumptions are not met (small n f (e) () too small and too many categories with < 5) Monte Carlo: When the sample size is too large 2010 Poch Bunnak 13 3. Hypothesis tests about Different population proportions: Chi-square test for independence 2010 Poch Bunnak 14
3.1. Situations You want to find if two categorical vars from a single population are associated 2 vars are associated if they are dependent on one another (change in one var change in the other var) Examples Is the proportion offemale f students t in two departments t (English and Computer) the same? [Gender and dept vars; if the % of females in both depts is the same no association b/w gender and field of study] Does the number of supports of a president change after 1 year ofhis being elected? [2 vars: support (yes-no) and time(when elected and 1 year later). If the % of supporters the same no association] If both vars are binary 2x2 table; In general, 2 categorical variables rxc table 2010 Poch Bunnak 15 3.3. Chi-square test assumptions Nonparametric test no distribution shape assumption Categorical or nominal data; data from a random sample The 2 test is valid only if f () (e) > 0 there is no more than 20% of the categories have f (e) < 5 H 0 : There is no association b/w var 1 and var 2 No association i = independence d = (% in urban ~ % in rural) = (f (o) = f (e) ) 3.4. Example 1: 2x2 table The mean age at mar is 19.4 yrs. Is there any sig difference in the proportion of mar at age below the average b/w urban and rural areas? H 0 : no association b/w age at mar and residential location SPSS CDHS 2005 data Recode v511 into binary var v511_d with 1=below 19.4 and 2>=19.4 Analyze Descriptive Crosstabs V025 as column and v511_d as row Clikt Clickto open Statistics Sttiti Box Click Chi-square Continue OK 2010 Poch Bunnak 16
Result 2010 Poch Bunnak 17 Interpretation In total, t 59% of women married at age below the average of age at mar. This proportion is higher in rural areas than in urban areas (59.9% versus 55.9%, respectively. A 2 test was performed see if the two vars are independence and the result showed that the two vars are independent at 95% CL (chi- square = 34 3.4, 2-tailed p = 0.065). 065) Although the proportion of marriage at the age below the mean is higher in rural than in urban areas, the difference is not statistical significant. Note that if H a is one-tailed, thus p = 0.065/2 = 0.033 sig at 95% two vars are dependent! 2010 Poch Bunnak 18
Importance! Chi-square test does not tell us how strong is the association. To know this, we need to request measures of association: Contingency coefficient: i C=sqrt( 2 /( 2 /)) /n)), 2 -based, value: 0 ~1 (reach 1 only if there are many categories of vars) Phi: Adjusted for n, =sqrt( 2 /n), for 2x2 table only, value: 0 1 Cramer s V: Adj for n, V=sqrt( 2 /(n*min(r-1,c-1)), rxc tables, value: 0 1 Lamda and Uncertainty Coefficients are 2 -based, value: 0 1; PRE interpretation: improvement in predicting one var given the knowledge of the other var. 2010 Poch Bunnak 19 3.4. Example 2: rxc table Find out if the level of education of husbands and wives are related ( v106 and v701). Is this true for both urban and rural areas? (Use appropriate tests) Table below summarizes the data on religious preference and the attitudes towards abortion in one country. Is respondents attitude related to their religious preference? Religious preference Liberal Conservativ protestant e protestant Catholic None Total Attitude toward abortion Favor 103 182 80 16 381 Oppose 187 238 286 74 785 Total 290 420 366 90 1166 2010 Poch Bunnak 20