Chi Square & Correlation

Nonparametric Test of Chi 2 Used when too many assumptions are violated in T-Tests: Sample size too small to reflect population Data are not continuous and thus not appropriate for parametric tests based on normal distributions. χ2 is another way of showing that some pattern in data is not created randomly by chance. X 2 can be one or two dimensional. X 2 deals with the question of whether what we observed is different from what is expected

Calculating X 2 What would a contingency table look like if no relationship exists between gender and voting for Bush? (i.e. statistical independence) Male Female 25 25 25 Voted for Bush 50 25 Voted for Kerry 50 50 50 100 NOTE: INDEPENDENT VARIABLES ON COLUMS AND DEPENDENT ON ROWS

Calculating X 2 What would a contingency table look like if a perfect relationship exists between gender and voting for Bush? Male Female Voted for Bush Voted for Kerry 50 0 0 50

f^ ij = Calculating the expected value f ^ ij = ( f i )( N The expected frequency of the cell in the ith row and jth column Fi = The total in the ith row marginal Fj = The total in the jth column marginal N = The grand total, or sample size for the entire table f j ) Expected Voted for Bush = 50x50 / 100 = 25

Nonparametric Test of Chi 2 Again, the basic question is what you are observing in some given data created by chance or through some systematic process? 2 ( O E ) χ = E 2 O= Observed frequency E= Expected frequency

Nonparametric Test of Chi 2 The null hypothesis we are testing here is that the proportion of occurrences in each category are equal to each other (Ho: B=K). Our research hypothesis is that they are not equal (Ha: B =K). Given the sample size, how many cases could we expect in each category (n/#categories)? The obtained/critical value estimation will provide a coefficient and a Pr. that the results are random.

Voted for Bush Voted For Kerry Male 50 0 Female 0 50 Let s do a X 2 (50-25) 2 /25=25 (0-25) 2 /25=25 (0-25) 2 /25=25 (50-25) 2 /25=25 X 2 =100 What would X 2 be when there is statistical independence?

Let s corroborate with SPSS Pearson Chi-Square Continuity Correction a Likelihood Ratio Fisher's Exact Test Linear-by-Linear Association N of Valid Cases Chi-Square Tests Asymp. Sig. Value df (2-sided).000 b 1 1.000.000 1 1.000.000 1 1.000.000 1 1.000 100 a. Computed only for a 2x2 table Exact Sig. (2-sided) b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 25.00. Pearson Chi-Square Continuity Correction a Likelihood Ratio Fisher's Exact Test Linear-by-Linear Association N of Valid Cases Chi-Square Tests Asymp. Sig. Value df (2-sided) 100.000 b 1.000 96.040 1.000 138.629 1.000 99.000 1.000 100 a. Computed only for a 2x2 table Exact Sig. (1-sided) 1.000.579 Exact Sig. (2-sided) b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 25.00. Exact Sig. (1-sided).000.000

Testing for significance Voted for Bush Voted for Kerry Male Female 20 30 30 20 X 2 = 4 How do we know if the relationship is statistically significant? We need to know the df (df= (R-1) (C-1) ) (2-1)(2-1)= 1 We go to the X 2 distribution to look for the critical value (CV= 3.84) We conclude that the relationship gender and voting is statistically significant.

When is X 2 appropriate to use? X 2 is perhaps the most widely used statistical technique to analyze nominal and ordinal data Nominal X nominal (gender and voting preferences) Nominal and ordinal (gender and opinion for W)

65 80 145 X 2 =52.3 Do we reject the null hypothesis? X 2 can also be used with larger tables Opinion of Bush Favorable Indifferent Unfavorable MALE 40 (19.4) 10 (.88) 5 20 15 55 FEMALE (15.8) (.72) (8.6) (6.9) 45 30 70

Correlation (Does not mean causation) We want to know how two variables are related to each other Does eating doughnuts affect weight? Does spending more hours studying increase test scores? Correlation means how much two variables overlap with each other

Types of Correlations X (cause) Y (effect) Correlation Values Increases Increases Positive 0 to1 Decreases Decreases Positive 0 to 1 Increases Decreases Negative -1 to 0 Decreases Increases Negative -1 to 0 Increase Decreases Does not change Independent 0

Conceptualizing Correlation Weak Measuring Development Strong GPD POP WEIGHT GDP EDUCATION Correlation will be associated with what type of validity?

Correlation Coefficient n XY X Y r xy = [ n X 2 ( X) 2 ][ n Y 2 ( Y) 2 ]

Home Value & Square footage Log value Log sqft value 2 sqft 2 Val * sqft 5.13 4.02 26.3169 16.1604 20.6226 5.2 4.54 27.04 20.6116 23.608 4.53 3.53 20.5209 12.4609 15.9909 4.79 3.8 22.9441 14.44 18.202 4.78 3.86 22.8484 14.8996 18.4508 4.72 4.17 22.2784 17.3889 19.6824 29.15 23.92 141.95 95.96 116.56

Correlation Coefficient r xy = [(141.95*6) (6*116.56) (29.15)(23.92) (29.15) 2 ][(95.96*6) (23.92) 2 ]. 78 = 2.09 2.66 VALUE SQFT Correlations VALUE SQFT Pearson Correlation 1.778 Sig. (2-tailed)..068 N 6 6 Pearson Correlation.778 1 Sig. (2-tailed).068. N 6 6

Rules of Thumb Size of correlation coefficient.8-1.0.6 -.8.4 -.6.2 -.4.0 -.2 General Interpretation Very Strong Strong Moderate Weak Very Weak or no relationship

Multiple Correlation Coefficients Correlations VALUE SQFT BTH BDR VALUE Pearson Correlation 1.784**.775**.708** Sig. (2-tailed)..000.000.000 N 46 46 46 46 SQFT Pearson Correlation.784** 1.669**.654** Sig. (2-tailed).000..000.000 N 46 46 46 46 BTH Pearson Correlation.775**.669** 1.895** Sig. (2-tailed).000.000..000 N 46 46 46 46 BDR Pearson Correlation.708**.654**.895** 1 Sig. (2-tailed).000.000.000. N 46 46 46 46 **. Correlation is significant at the 0.01 level (2-tailed).

Limitation of correlation coefficients They tell us how strong two variables are related However, r coefficients are limited because they cannot tell anything about: 1. Causation between X and Y 2. Marginal impact of X on Y 3. What percentage of the variation of Y is explained by X 4. Forecasting Because of the above Ordinary Least Square (OLS) is most useful

Do you have the BLUES? B for Best (Minimum error) L for Linear (The form of the relationship) U for Un-bias (does the parameter truly reflect the effect?) E for Estimator

Home value and sq. Feet 5.3 5.2 5.1 Y = α + βx + ε 5.0 4.9 4.8 4.7 VALUE 4.6 4.5 3.4 3.6 3.8 4.0 4.2 4.4 4.6 SQFT Does the above line meet the BLUE criteria?