Data Analysis for categorical variables and its application to happiness studies

Data Analysis for categorical variables and its application to happiness studies Thanawit Bunsit Department of Economics, University of Bath The economics of happiness and wellbeing workshop: building on theory, method through practice Universidade Federal Fluminense, Rio de Janeiro, Brazil 20 th May, 2011

Workshop objective By the end of this workshop, the participants will be able to: i) examine correlation between two or more than two variables using correlation analysis techniques. ii) conduct the analysis using probit and logit model. iii) use statistical software (SPSS and STATA) for categorical variable data analysis and reliability test.

Outline Recap categorical variable Correlation analysis Probit and logit model Data analysis using SPSS and STATA Questions and comments

Correlation analysis Pearson s correlation coefficient Analyze---Covariate---Bivariate The Pearson correlation coefficient is a measure of linear association between two variables. The values of the correlation coefficient range from -1 to 1. The sign of the correlation coefficient indicates the direction of the relationship (positive or negative). The absolute value of the correlation coefficient indicates the strength,with larger absolute values indicating stronger relationships.

Correlation analysis Chi-square (χ 2 ) Analyze---Descriptive Statistics---Crosstabs -Select variables (row = independent, column = dependent) - Select Chi-square from Statistics tab - Select percentage from Cell tab

Correlation analysis Other correlation coefficient Nominal ------> Phi and Cramer s V Ordinal ------> Somer s D, Gamma, Kendall s tau Nominal + Interval ------> Eta

Activity 1: correlation Find the association between this factors. Religion and the level of HIV/AIDS cases (Low = up to 10% HIV/AIDS cases of population, High = more than 10% of population) Region and the level of HIV/AIDS cases Do you think which factors relate to GDP per capita? Use correlation coefficient and regression analysis to answer this question.

Binary logit model Dichotomous dependent variable Categorical dependent variable Yes or No Value 1 or 0 Logistic regression or logit model

Case 1 Training for new employees Variable 1) Score = Applicant s aptitude test score 2) Experience = Months of relevant prior experience that the applicant has had before this job 3) Pass = Whether the applicant actually passed the test after their training period (Yes = 1, No = 0)

Normal linear regression

Problems can be seen from Slope Scatter plot Likely to fail May pass Almost definitely pass

The logit transformation Data -----> Probabilities Score 1 2 3 4 5 Pass N 7 5 6 4 2 Prob. 0.7 0.5 0.6 0.4 0.2 Fail N 3 5 4 6 8 Prob. 0.3 0.5 0.4 0.6 0.8

Probabilities and odds Score 1 2 3 4 5 Pass N 7 5 6 4 2 Prob. 0.7 0.5 0.6 0.4 0.2 Fail N 3 5 4 6 8 Prob. 0.3 0.5 0.4 0.6 0.8 Odds 2.33 1.00 1.50 0.67 0.25

Odds ratio Data ----> Probabilities ----> Odds Odds ratio = P(event) 1-P(event) P(event) = The probability of a particular event occurring 1-P(event) = The probability of the event not occurring P(event) = odds(event)/[1+odds(event)]

The odds and logit Score 1 2 3 4 5 Pass N 7 5 6 4 2 Prob. 0.7 0.5 0.6 0.4 0.2 Fail N 3 5 4 6 8 Prob. 0.3 0.5 0.4 0.6 0.8 Odds 2.33 1.00 1.50 0.67 0.25 Logit 0.37 0.00 0.18-0.18-0.60

The logit curve Prob. -2-1 0 1 2 Logit

Probit model 1 Pr[H i = 3] Very Happy ξ 2 Pr[H i = 2] Fairly happy ξ 1 Pr[H i = 1] Not too happy 0

Probit model Hi* = X i β + ε i, ε i ~ N(0, 1), i = 1,..., N. (1) Generally for an m-alternative ordered model, it is defined as H i = j α j -1 < Hi* α j, (2) where α0 = and α m = and here j = 1, 2, 3. Then Pr [H i = j] = Pr[α j -1 < H i * α j ] = Pr[α j -1 < X i β + ε i α j ] = Pr[α j -1 X i β < ε i α j ] = Pr[α j -1 X i β < ε i α j X i β] = Φ(α j X i β) Φ(α j -1 X i β) (3)

Logit model results -2 Log Likelihood Cox& Snell R 2 Nagelkerke R 2 Classification table

SPSS command for logit Analyze Regression Binary logistic Define dependent and independent variables

Logit model χ 2 = (-2LL 0 ) (2LL M ) -2LL 0 = -2LL for the baseline or null model -2LL M = -2LL for the model after the variable(s) were entered

Logit model

Logit model Now we can calculate logit(pass), odds and its probability for a person who scores 1 or 5 Odds ratio = exp(logit(pass)) Logit(pass) = a + Bx = -1.314 + 0.467(1) = -0.847 Odds = exp(-0.847) = 0.429 Prob. = 0.429/(0.429+1) = 0.30

Activity 2 1.Analyse the previous model with two independent variables (Score and Experience) 2.Formulate a logit model 3.Interpret the results

Multinomial logit model Nominal or categorical dependent variable Dependent variable cannot be ordered More than two categories

Analyze Regression Probit Probit model Use example file Libongdata

Muito obrigado Thanawit Bunsit. Department of Economics, University of Bath. Present at Universidade Federal Fluminense, RJ, Brazil. 20th May, 2011.