Advanced data analysis

Transcription

1 Advanced data analysis M.Gerolimetto Dip. di Statistica Università Ca Foscari Venezia, margherita

2 PART 2: LOGISTIC REGRESSION

3 Dichotomous dependent variables When in a multiple regression model the dependent variable is qualitative and must be expressed by a dummy variable, special estimation problems arise. An example is the problem of explaining whether or not an individual will buy a car: the dependent variable Y takes value 1 in case he/she will buy the car, 0 in case he/she will not buy the car. The explicative variables can be also qualitative variables represented by dummy (for example the characteristics of the car) as well as quantitative variables (for example the price of the car and the income of the person). The predicted values of the dependent variable will fall mainly in the interval [0,1], so those values can be interpreted as the probability that the individual will buy the car, given the characteristics or the income etc. Approximating this relationship by a line produces a very bad fit.

4 In order to avoid having probabilities outside the [0,1] range, instead of a linear model a NON LIN- EAR MODEL is used that works squeezing the probabilities in the (0,1) range. So, in place of a linear function (as in the linear multiple regression) a nonlinear function is used. The most common non linear function, very appropriate for this framework, are the LOGISTIC FUNCTION and the CUMULATIVE NORMAL FUNC- TION. Linear function > usual multiple regression model Y = Xβ + u Logistic function > logistic regr. Cumulative normal function > probit regr. P (Y = 1) = f(xβ) It is called logistic or probit regression, it depends on the choice of f(), called link function.

5 A hidden error term In logistic (probit) regression the error term does not appear clearly in the model. It is somehow implicit. This formulation comes from the Random utility model that states that an individual buys a certain product (Y = 1) if the utility (connected to it) is bigger than a threshold or the highest compared to other products in the same consideration set. This utility is expressed by a linear function that includes an error term (!!). So the expression P(Y = 1) is equivalent to P(U > 0) where U = Xβ + u. Hence P (Y = 1) = P (Xβ+u > 0) and it is modeled by f(xβ). The error term is implicit in the action of writing P(Y = 1) so it does not appear explicitly in the logit or probit model. NB The error term is always in the model even though it is not evident!!!

6 Violating the assumptions of multiple regression The binary nature of the dependent variable leads to the violation of some assumptions of the multiple regression model: 1. The Y s are not normal, but they follow a binomial distribution instead. The inference of the multiple regression model, based on the assumption of normality, loses here its validity. 2. The variance of a binomial variable is not constant. Also the hypothesis of homoskedasticity is violated. 3. The conditional expectation E(Y X) is fitted by function that is non linear. The logistic regression (and the probit model as well) deals with these problems by introducing a different approach to estimate the coefficients, to interpret them and also to state the goodness of fit of the model (= no more ordinary least squares estimation, no more R 2 ).

7 Inference When estimating the coefficients in the logistic regression context the standard OLS inference cannot be used. One possibility is to estimate the coefficients with the maximum likelihood procedure (ML). The maximum likelihood procedure requires the maximization of the likelihood function. This is equivalent to obtain the estimates of the coefficients that mostly accord to the empirical evidence given by the sample. Note that to obtain maximimum likelihood estimates it is often used an efficient numeric algorithm called Iterative Reweighted Least Squares.

8 The logistic function Suppose to have a sample of n units that give observation for variable Y (dichotomous) and X (quantitative) and we want to explain Y with X, using a logistic regression. Hence the link function is the logistic function (usually indicated by g) that for a generic variable z is: g(z) = ez 1 + e z In logistic regression context, this function is exploited as follows: P (Y = 1) = exβ 1 + e Xβ P (Y = 0) = 1 P (Y = 1) = e Xβ Once the estimates for parameter β are obtained, call it b, the fitted values are ˆP (Y = 1) = exb 1 + e Xb ˆP (Y = 0) = e Xb

9 The coefficients Because of the presence of nonlinear functional forms, the marginal effect of an explicative variable on the dependent variable is not given by that explanatory variable coefficient, but by an opportune function of the coefficient. A way out to interpret the coefficient is to consider the expression: e Xβ P (Y = 1) log P (Y = 0) = log 1+e Xβ = Xβ 1 1+e Xβ The logarithm of the ratio of the probability that the event occurs and the probability that the event does not occur (name: ODDS RATIO) is called LOGIT transform. Hence the coefficients β have to be read as a the reaction, in terms of variation of the logit, consequently to a unit variation in the explicative variable.

10 Odds ratio Another way out to interpret the coefficients is by considering only the ratio of the probability that the event occurs and the probability that the event does not occur. This ratio is P (Y = 1) P (Y = 0) = e Xβ 1+e Xβ 1 1+e Xβ = e Xβ In this case the exponentiated coefficients reflect changes in the odds ratio, consequently to a unit variation in the explicative variable. Coefficients (β) are in particular useful to determine the sign of the relationship: a positive coefficient indicates that a unit increase in the X is connected with an increases the predicted probability and viceversa. Exponentiated coefficients (e β ) are in particular useful to express the magnitude of the relationship: the impact is multiplicative, which means that we have information on how bigger (or smaller) becomes P(Y=1) (compared to P(Y=0)).

11 Significance of the coefficients In logistic regression the null hypothesis of not significant coefficients is tested similarly to what is done in the linear regression. If we are considering β coefficients (i.e. the logit has to be considered as the dependent variable), a zero of the coefficient means that the variable has no impact. To confirm this, think of the e β coefficients (i.e. the odds ratio has to be considered as the dependent variable). When β is zero, then e β = 1, then P (Y =1) P (Y =0) = 1. If the odds ratio is equal to 1, P (Y = 1) = P (Y = 0) hence there is no way this explicative variable is useful in making predictions! The hypothesis testing is done with the Wald test (instead of the T test), because the estimation method is not the standard OLS.

12 Goodness of fit 1 One possibility to assess model estimation fit is using Pseudo R 2 values that works similarly to the R 2 described for the multiple regression analysis. In multiple linear regression, the R 2 is built on the basis of the sum of squared residuals, which also the quantity minimized to obtain the estimate of the coefficients. Similarly, in logistic regression the Pseudo R 2 is built on the ground of the likelihood value. In particular, in logistic regression the model estimation fit is measured with a quantity that is 2LL (LL is log-likelihood) that is positive and takes value zero in case of perfect fit (log1 = 0), hence the closer to zero is the 2LL, the better is the fit.

13 The 2LL value can be used to compare different models. The idea is to compute 2LL for the rival (nested!) models and then choose the model with the lowest 2LL value. It is also possible to test the significance of the difference between 2LL computed for rival models (χ 2 test). In order to produce an index, based on 2LL, readable as a Pseudo R 2 (something in the (0,1) interval) the index 2LL can be normalized by comparing the 2LL obtained for the examined model with the 2LL value of a hypothetic null model (very bad one, with only intercept). R 2 logit = 2LL null ( 2LL model ) 2LL null

14 The Akaike index is another useful tool to compare different models. The formula is: AIC = 2LL + 2 p Where p is the number of estimated coefficients. The preferred model is that with the lowest AIC. Hence AIC not only considers the goodness of fit, but it also penalizes the overfitting as it is an increasing function of the number of estimated parameters. The objective is to find the model that best explains the data with a minimum of free parameters. AIC can be used for all models not only for the logistic!

15 Goodness of fit 2 Another possibility to assess the goodness of fit is by the concept of predictive accuracy. The idea is to compare actual and fitted values of Y, interpreted in terms of membership to a certain class. A possible good indicator of this capability of the model to discriminate between groups (for example group of those who buy the car/ those who do not by the car) is the sum of the fraction of zeros correctly predicted plus the sum of the fraction of ones correcly predicted. If the resulting sum exceeds 1 the model is of value.

16 Dummy variables Also in logistic regression dummy variables can be among the explicative variables set. Focusing on exponentiated coefficients, they represent the relative level of the dependent variable for the represented group, compared to the omitted group. EXAMPLE: Suppose that Y=1 corresponds to an improve in professional position. Among the explicative variables there is the gender (male-female) to which is associated a dummy variable (female=1, male=0) included in the model (Remember multicollinearity!) Suppose 0.78 is the estimated exponentiated coefficient for that dummy variable. The analysis has to be done for female compared to male (the omitted dummy), hence the probability connected to the female group is 0.78 times the probability connected to the male group. This means that the probability decreases if the person is female compared to a male person. In multiplicative terms the amount of this reduction is 0.78.

17 If the qualitative dependent variable has more than two choice categories (not binary, but polychotomous) the model presented in the previous slides has to be generalized. In this case the model is called multinomial logit or multinomial probit. For example a commuter can choose among private car, bus, train. The result is a model composed by 3 equations (one for each alternative) where the probability of choosing exactly that alternative is expressed by means of a logit model. The disadvantage of this model is in that it is assumed the independence of the irrelevant alternatives property that means that the odds ratio are constant even though another alternative is included in the consideration set. This is unrealistic, especially if two or more alternatives are close substitutes.

18 GLM models Generalized Linear Models are a class of statistical models that generalize the classical linear models and include the logistic regression. Special cases: the linear regression > when the Y s are gaussian variables. the logistic regression > when the Y s are binary variables (binomial distribution). GLM are very useful to treat non linear models that can be expressed in a linear form! NB: standard OLS inference does not hold. Mainly in the GLM context the inference is based on the likelihood function.

19 The main characteristics of the GLM models are: 1. The Y s belong to an extremely general family distribution called exponential family that includes, among the other, also the well-known gaussian, binomial, gamma, chi-square, exponential variables. 2. The model can be rendered linear taking an opportune function.

20 A real data analysis: 2006 Poverty in Italy in year Poverty is an extremely complex concept. In economics, many different definitions have been proposed: one of the most used is the relative poverty definition. Relative poverty means that a person is considered poor if his/her income is smaller than a certain threshold (for example the average income of the population). Note that poverty is treated as a binary attribute (poor-not poor) that derives from the dichotomization of a quantitative variable (income). Indeed, poor (Y=1) are those whose income (I) is less than the fixed threshold. Not poor (Y=0) are those whose income is bigger than the fixed threshold. To model the POVERTY RISK it is possible to use a logistic regression.

21 The data set we use comes from the Italian Households Expenditure Survey (realized by ISTAT). The sample (it is a probabilistic sample) includes more than households that give a representative image of the Italian population. Information on the income is available only in terms of the income quartile to which the household belongs. Hence, total expenditure is used as a proxy of the income. The poverty threshold is the 60 percent of the median Italian expenditure (across all households). Let us call this threshold K. The generic household i is poor if E i <= K (E is the expenditure, as a proxy of I), it is not poor if E i > K. So a dummy variable Y is created dichotomizing the expenditure to describe the relative poverty phenomenon.

22 The poverty risk (the probability of being poor: the probability that Y = 1) has been modeled using a logistic regression where the explicative variables are: 1. Gender: male, female 2. Education: primary, middle, high, graduation 3. Family dimension: small (1-2), medium (3-4), big (more than 5) 4. Employment status: Employed, unemployed, other (e.g. students, retired...) 5. Presence of old people: Yes, No (old means age>65) 6. Presence of children: Yes, No (children means age<14) 7. Geographic location: North, Center, South

23 Every qualitative variable has been traduced into a dummy variables set. 1. Gender: D m = 1 if the household head is a man, D m = 0 otherwise D f = 1 if the household head is a woman, D f = 0 otherwise 2. Education: D ep = 1 if the household head has primary school license, D ep = 0 otherwise D em = 1 if the household head has middle school license, D em = 0 otherwise D eh = 1 if the household head has high school license, D eh = 0 otherwise D eg = 1 if the household head has graduated, D eg = 0 otherwise

24 3. Family dimension: D ds = 1 if the household has small dimension, D ds = 0 otherwise D dm = 1 if the household has medium dimension, D dm = 0 otherwise D db = 1 if the household has big dimension, D db = 0 otherwise 4. Employment status: D ee = 1 if the household head is employed, D ee = 0 otherwise D eu = 1 if the household head is unemployed, D eu = 0 otherwise D eo = 1 if the household head is classified as other, D eo = 0 otherwise

25 5. Presence of old people: D oy = 1 if there is an old person in the household, D oy = 0 otherwise D on = 1 if there are no old people in the household, D on = 0 otherwise 6. Presence of children: D cy = 1 if there is a child in the household, D cy = 0 otherwise D cn = 1 if there are no children in the household, D cn = 0 otherwise

26 7. Geographic location: D gn = 1 if the family lives in the North of Italy, D gn = 0 otherwise D gc = 1 if the family lives in the Center of Italy, D gc = 0 otherwise D gs = 1 if the family lives in the South of Italy, D gs = 0 otherwise

27 In order to avoid multicollinearity, from each dummy set one variable has to be excluded. We exclude the variables connected to the character that is more frequent in the population in order to have a sort of benchmark family to which the comparative analysis has to be referred. The benchmark family is characterized by a male household head with primary school license and employed. The dimension is small without old people and without children. Can you write the model?