Data Analysis for categorical variables and its application to happiness studies Thanawit Bunsit Department of Economics, University of Bath The economics of happiness and wellbeing workshop: building on theory, method through practice Universidade Federal Fluminense, Rio de Janeiro, Brazil 20 th May, 2011
Workshop objective By the end of this workshop, the participants will be able to: i) examine correlation between two or more than two variables using correlation analysis techniques. ii) conduct the analysis using probit and logit model. iii) use statistical software (SPSS and STATA) for categorical variable data analysis and reliability test.
Outline Recap categorical variable Correlation analysis Probit and logit model Data analysis using SPSS and STATA Questions and comments
Correlation analysis Pearson s correlation coefficient Analyze---Covariate---Bivariate The Pearson correlation coefficient is a measure of linear association between two variables. The values of the correlation coefficient range from -1 to 1. The sign of the correlation coefficient indicates the direction of the relationship (positive or negative). The absolute value of the correlation coefficient indicates the strength,with larger absolute values indicating stronger relationships.
Correlation analysis Chi-square (χ 2 ) Analyze---Descriptive Statistics---Crosstabs -Select variables (row = independent, column = dependent) - Select Chi-square from Statistics tab - Select percentage from Cell tab
Correlation analysis Other correlation coefficient Nominal ------> Phi and Cramer s V Ordinal ------> Somer s D, Gamma, Kendall s tau Nominal + Interval ------> Eta
Activity 1: correlation Find the association between this factors. Religion and the level of HIV/AIDS cases (Low = up to 10% HIV/AIDS cases of population, High = more than 10% of population) Region and the level of HIV/AIDS cases Do you think which factors relate to GDP per capita? Use correlation coefficient and regression analysis to answer this question.
Binary logit model Dichotomous dependent variable Categorical dependent variable Yes or No Value 1 or 0 Logistic regression or logit model
Case 1 Training for new employees Variable 1) Score = Applicant s aptitude test score 2) Experience = Months of relevant prior experience that the applicant has had before this job 3) Pass = Whether the applicant actually passed the test after their training period (Yes = 1, No = 0)
Normal linear regression
Problems can be seen from Slope Scatter plot Likely to fail May pass Almost definitely pass
The logit transformation Data -----> Probabilities Score 1 2 3 4 5 Pass N 7 5 6 4 2 Prob. 0.7 0.5 0.6 0.4 0.2 Fail N 3 5 4 6 8 Prob. 0.3 0.5 0.4 0.6 0.8
Probabilities and odds Score 1 2 3 4 5 Pass N 7 5 6 4 2 Prob. 0.7 0.5 0.6 0.4 0.2 Fail N 3 5 4 6 8 Prob. 0.3 0.5 0.4 0.6 0.8 Odds 2.33 1.00 1.50 0.67 0.25
Odds ratio Data ----> Probabilities ----> Odds Odds ratio = P(event) 1-P(event) P(event) = The probability of a particular event occurring 1-P(event) = The probability of the event not occurring P(event) = odds(event)/[1+odds(event)]
The odds and logit Score 1 2 3 4 5 Pass N 7 5 6 4 2 Prob. 0.7 0.5 0.6 0.4 0.2 Fail N 3 5 4 6 8 Prob. 0.3 0.5 0.4 0.6 0.8 Odds 2.33 1.00 1.50 0.67 0.25 Logit 0.37 0.00 0.18-0.18-0.60
The logit curve Prob. -2-1 0 1 2 Logit
Probit model 1 Pr[H i = 3] Very Happy ξ 2 Pr[H i = 2] Fairly happy ξ 1 Pr[H i = 1] Not too happy 0
Probit model Hi* = X i β + ε i, ε i ~ N(0, 1), i = 1,..., N. (1) Generally for an m-alternative ordered model, it is defined as H i = j α j -1 < Hi* α j, (2) where α0 = and α m = and here j = 1, 2, 3. Then Pr [H i = j] = Pr[α j -1 < H i * α j ] = Pr[α j -1 < X i β + ε i α j ] = Pr[α j -1 X i β < ε i α j ] = Pr[α j -1 X i β < ε i α j X i β] = Φ(α j X i β) Φ(α j -1 X i β) (3)
Logit model results -2 Log Likelihood Cox& Snell R 2 Nagelkerke R 2 Classification table
SPSS command for logit Analyze Regression Binary logistic Define dependent and independent variables
Logit model χ 2 = (-2LL 0 ) (2LL M ) -2LL 0 = -2LL for the baseline or null model -2LL M = -2LL for the model after the variable(s) were entered
Logit model
Logit model Now we can calculate logit(pass), odds and its probability for a person who scores 1 or 5 Odds ratio = exp(logit(pass)) Logit(pass) = a + Bx = -1.314 + 0.467(1) = -0.847 Odds = exp(-0.847) = 0.429 Prob. = 0.429/(0.429+1) = 0.30
Activity 2 1.Analyse the previous model with two independent variables (Score and Experience) 2.Formulate a logit model 3.Interpret the results
Multinomial logit model Nominal or categorical dependent variable Dependent variable cannot be ordered More than two categories
Analyze Regression Probit Probit model Use example file Libongdata
Muito obrigado Thanawit Bunsit. Department of Economics, University of Bath. Present at Universidade Federal Fluminense, RJ, Brazil. 20th May, 2011.