: 1 if individual t is not in the labor force (i.e. not seeking employment)

Transcription

1 Economics 140A Qualitative Dependent Variables With each of the classic assumptions covered, we turn our attention to extensions of the classic linear regression model. Our rst extension is to data in which the dependent variable is limited. Dependent variables that are limited in some way are common in economics, but not all require special treatment. We have seen examples with wages, income or consumption as the dependent variables - all of these must be positive. As these strictly positive variables take numerous values, we found the log transform to be su cient. Yet not all restrictions on the dependent variable can be handled so easily. If we model individual choice, the optimal behavior of individuals often results in a sizable fraction of the population at a corner solution. For example, a sizable fraction of working age adults do not work outside the home, so the distribution of hours worked has a sizable pile up at zero. If we t a linear conditional mean, we will likely predict negative hours worked for some individuals. The log transform used for wages will not work, as the log of zero is unde ned. Another issues arises with sample selection. It may well be the case that E (Y jx) is linear, but nonrandom sampling requires more detailed inference. Finally, a host of other data issues may arise linear conditional mean functions that switch over regimes, data recorded as counts or analysis of durations between events. As we will see, even if only a nite number of values are possible, a linear model for E (Y jx) may still be appropriate. While all these issues may arise, we focus on perhaps the most common restriction in which the dependent variable is qualitative in nature and so takes discrete values. For this reason, such models are also termed discrete dependent variable models or (less frequently) dummy dependent variable models. As we recall from our discussion of qualitative regressors, qualitative variables capture the presence or absence of some non-numeric quantity. For example, in studying home ownership the dependent variable is often 1 if household t owns their home Y t = 0 otherwise Many qualitative variables take more than two values. For example, in studies of employment dynamics the dependent variable can take three values 8 < 1 if individual t is employed Y t = 0 if individual t is unemployed but seeking employment 1 if individual t is not in the labor force (i.e. not seeking employment)

2 We focus attention on qualitative dependent variables that take only two values and for ease set these values to 0 and 1. In binary response models, interest is primarily in p (X) P (Y = 1jX) = P (Y = 1jX 1 ; ; X K ) ; for various values of X. For a continuous regressor X j, the partial e ect of X j (Y =1jX) the response probability is j. When multiplied by X j (for small X j ), the partial e ect yields the approximate change in P (Y = 1jX) when X j increases by X j holding all other regressors constant. For a discrete regressor X K, the partial e ect is P (Y = 1jx 1 ; ; x K 1 ; X K = 1) P (Y = 1jx 1 ; ; x K 1 ; X K = 0) Perhaps the most natural extension of the classic linear regression model is to leave the structure of the population model unchanged, so that Y t = X t + U t How does the presence of a qualitative dependent variable a ect our analysis? In quite substantial ways. Consider the familiar relation E (Y t jx) = X t Because Y t takes only the values 0 and 1, hence E (Y t jx) = P (Y t = 1jX) p (X) = X t and so the conditional mean is a probability and the model is termed the linear probability model. The coe cient 1 is interpreted as the e ect of a one unit change in X t on the probability that Y t = 1. Similarly, if X t is a binary regressor, then 1 captures the e ect of moving from X t = 0 to X t = 1. As there is no reason to believe that as X t varies the conditional mean will remain between 0 and 1, equality between the conditional mean and P (Y t = 1) is a substantial drawback of the linear probability model. As one can deduce, it is hard to t points that are clustered at 0 and 1 (on the y axis) with a straight line, so the R-square measure is not reliable. (Draw a graph with the points clustered at 0 and 1 on the y-axis and a straight line attempting to t them.) 2

3 Several other features of the linear probability model are easily obtained. Because Y t is a Bernoulli random variable, U t is a binomial random variable (0 + U t = 1 X t ) with probability 1 ( X t ) 1 ( X t ) with probability X t From the de nition of U t it is clear that the error has mean 0 and variance EU 2 t = ( X t ) 2 [1 ( X t )] + [1 ( X t )] 2 ( X t ) = ( X t ) [1 ( X t )] Thus the error term is heteroskedastic and binomial, violating two of the classic assumptions. The OLSE is unbiased and consistent for the linear probability model, although robust standard errors are needed to account for the heteroskedasticity. As an aside, the test statistic for H 0 1 = = K = 0 can be accurately constructed from the OLSE, as under H 0 the error is homoskedastic, EUt 2 = 0 (1 0 ). To improve the e ciency of the OLSE, construct the weighted least squares estimator. Let Yt P denote the predicted value of the dependent variable constructed from the OLSE. If 0 < Yt P < 1 for all t, then form the estimate of the error standard deviation q S t = [Yt P (1 Yt P )] The weighted least squares estimator is obtained from the model Y t 1 X t = S 0 + t S 1 + U t t S t S t Again, the reported standard errors are valid, as follows from our earlier treatment of weighted least squares. If Yt P =2 (0; 1), then WLS is infeasible without an ad hoc adjustment and should not be done. The linear probability model is a convenient approximation and generally gives good estimates of the partial e ects of the response probability near the center of the regressor distribution. If one wishes to know the partial e ect, averaged over the values of X, then the linear probability model may work well even if it gives poor estimates of the partial e ects for extreme values of X. Example (Married Women s Labor Force Participation) In a survey of 753 women, 428 report working more than zero hours. Also, 606 have no young children while 118 have exactly one young child. The variables in play are 3

4 inlf nonw nc ed ex k6 k + binary, value 1 indicates non-zero working hours non-wife income, in thousands of dollars education experience number of children less than 6 years old number of children between 6 and 18, inclusive. The estimated regression is inlf P ols se robust se = 586 (154) [151] R 2 = nonwf inc + 038ed + 039ex 0006 ex 2 016age 262k (0014) [0015] (007) [007] (006) [006] (00018) [00019] (002) [002] (034) [032] (013) [013] k + Except for k+, all regressor coe cients have sensible signs and are statistically signi cant. The regressor k + is neither statistically signi cant nor practically important. Also, the OLS and robust standard errors are almost identical! Interpretation An increase in non-wife income of $10,000 reduces participation in the labor force by only.034 (3.4 percent). As the sample mean of non-wife income is only $20,129 with a standard deviation of $11,635, a $10,000 increase is quite substantial. Having one more small child seems a rst-order e ect, reducing the probability of being in the labor force by 26.2 percent. Finally, of the 753 tted values, 33 lie outside the unit interval (hence we do not construct WLS estimators). The case for linear probability models grows stronger if most regressors are discrete and take only a few values, so that there are no extreme values. To understand how to construct a discrete regressor from a continuous regressor, return to the preceding example. Partition the variable k6 into three indicator variables I 0 = 1 (0 young children), I 1 = 1 (1 young child), and I 2 = 1 (2 or more young children). We replace k6 with (I 1 ; I 2 ) to allow the impact of the rst young child to di er and obtain the estimated coe cients 263 for I 1 and 274 for I 2. It appears that the key impact is having one young child, additional young children do not change labor force participation much. The use of discrete regressors is familiar to us from our discussion of regressor speci cation. For the model with union-gender interactions, the predicted values correspond to cell averages. When a model has the amount of indicator variables and interactions to t cell averages, 4

5 the model is saturated. A new fact about saturated models emerges here If a model is saturated, then the predicted probabilities must lie between 0 and 1 (because they mimic cell averages). For estimates of the partial e ects for extreme values of the regressors, we must develop a new framework. To do so, observe that the linear probability model falls within a broader class of models, termed single index models. The term single index arises because the various regressors a ect Y t through the scalar Xt, 0 which is the single index. The class of single index models is Y t = F (X 0 t) + U t ; where the linear probability model is given by F (Xt) 0 = Xt. 0 To overcome the problem that the predictions for Y t can lie outside the unit interval, we constrain F so that 0 < F (z) < 1 for all z Given this constraint, a natural choice for F is a cumulative distribution function (although it is not necessary to use a CDF). Index models in which F is a CDF are derived from a latent model. The latent model concerns a variable that underlies the decision and cannot be observed. If Y t measures whether or not household t owns a home, then the latent variable Yt captures the desire of household t to own a home. If the desire is high enough, then household t owns a home. Or, put another way, Yt captures the di erence in utility between the two options, namely owning and renting, in which case if Yt is positive, then the utility from owning exceeds that from renting and household t purchases a home. The (latent) population model that explains the latent variable is Yt = X t + V t ; where fv t g n is a sequence of i.i.d. random variables that are symmetric about their mean of 0, with variance 2 and distribution F. (Note, V t does not have to be symmetric about zero.) The latent variable and the observed variable are linked as 1 if Y Y t = t > 0 0 otherwise From the measurement rule we can see that if we multiply Yt by any positive constant, then Y t is unchanged. As a result, we can only estimate 0 and 1 up to a positive multiple (that is, relative to scale). To identify the coe cients, we 5

6 set = 1. We also see that if the threshold is c 6= 0, then we return to a zero threshold simply by subtracting c from 0. To identify the intercept, we set c = 0. To construct an estimator of the coe cients, note p (X) = P (Y t = 1jX) = P (Y t > 0jX) = P ( X t + V t > 0) = P ( X t > V t ) = P ( ( X t ) V t ) = F ( X t ) ; where the third displayed line follows because of symmetry about 0. The main criticism of the linear probability model has been addressed; because the distribution function is contained in [0; 1] so too is the probability that Y t equals 1. The presence of the latent model can give one the impression that we are interested in the e ect of X on Y, which is given by 1. Yet Y rarely has sensible units of measurement (desire to own a home, or di erences in utility), so the magnitude of 1 is not generally important. Rather, our goal is to explain the e ect of X on the response probability p (X). To understand the link, note that if X is a continuous = f ( X) 1 where f (z) If F is a strictly increasing function (as is true for Gaussian and logistic CDF s), then f (z) > 0 for all z and the sign of 1 determines the direction of the e ect on the response probability. Observe that the magnitude of the e ect depends on the value of the regressor, through f ( X). If the underlying density is unimodal and symmetric about 0, the maximum value of f (z) occurs at z = 0. For the leading cases Probit f (z) = Logit f (z) = 1 p 2 2 e z2 f (0) = 1 p e z (1 + e z ) 2 f (0) = 25 For the multiple regressor model, relative e ects do not vary with @X j = i j 6

7 If X is a discrete regressor, then the impact of an increase from c to c + 1 in X on the response probability is F ( (c + 1)) F ( c) If X is an indicator regressor, then c = 0. Example (E ect of Job Training) p (X) = probability of employment X k = indicator of participation in job training The direction of the job training e ect is the sign of k, while the magnitude of the e ect di ers depending on age, education, and experience (the other included regressors). Finally, consider the model X t = X 1t + 2 X 2 1t + 3 ln X 2t The partial e ect of X 1 on the response probability is f (X t ) ( X 1t ) so the direction of the partial e ect potentially changes at X 1 = partial e ect of l 2 on the response probability is f (X t ) The Because d ln X 2 = dx 2 X 2, a 1 percent change in X 2 is a.01 change in ln X 2. Therefore the partial e ect of a 1 percent change in X 2 on the response probability is f (X t ) Given the distributional assumptions on V t, the maximum likelihood estimator arises naturally. The distribution of Y 1 is used to form the likelihood as L ( 0 ; 1 jy 1 = y 1 ; X 1 = x 1 ) = [F ( x 1 )] y 1 [1 F ( x 1 )] 1 y 1 The likelihood for the sample is = L [ 0 ; 1 j (Y 1 ; X 1 ) = (y 1 ; x 1 ) ; ; (Y n ; X n ) = (y n ; x n )] ny [F ( x t )] yt [1 F ( x t )] 1 yt 7

8 The log-likelihood is ln L ( 0 ; 1 j) = (y t ln F ( x t ) + (1 y t ) ln [1 F ( x t )]) (We are able to construct ln L because of the strict inequality 0 < F (z) < 1.) The rst-order condition for the estimator of the coe cients is the partial derivative of the log-likelihood with respect to each of the coe cients and is termed the score. The ML estimators B 0 and B 1 are the values that set the scores equal to ln L ( 0 ; 1 i = i =B i y t F (B 0 + B 1 x t ) F (B 0 + B 1 x t ) [1 F (B 0 + B 1 x t ( x t = 0; i i =B i for i = 0; 1. The MLE is consistent and asymptotically Gaussian. To determine the covariance matrix for the estimators, we construct the expected value of the Hessian conditional on X, for which we ( xt) = it f ( i x t ), equals = E 2 ln L ( E 0 ; 1 0 jx F (1 F ) xf( xf) 0 + u(xf (1) ) # uxf [xf(1 F ) + ( xf)f ] [F (1 F )] 2 jx f 2 x t x 0 t ; because E (ujx) = 0 F (1 F ) which is a positive semi-de nite matrix. The estimator of the asymptotic variance of B is " # 1 f 2 x t x 0 t V = F (1 F ) If the inverse exists, the matrix is positive de nite. If the inverse does not exist, the problem is likely multicollinear regressors. It does not make sense to compute robust standard errors. The reason - in the latent model we specify all conditional 8

9 moments of Y jx. Therefore, if we believe the variance is misspeci ed, then the conditional mean must be misspeci ed as well. If we follow the classic regression model and assume that V t is Gaussian, then F () is the distribution of a standard Gaussian random variable. The resulting ML estimators are termed probit estimators (because is termed the probit function in statistics) and are obtained through nonlinear optimization (as the score function is not a linear function of the coe cient estimators). Because the Gaussian distribution function cannot be expressed in closed form (that is, an integral must be used), many researchers assume that V t has a logistic distribution. While the logistic density function is similar to the Gaussian and di ers only in the tails, the logistic distribution function can be expressed in closed form as F ( x t ) = exp ( x t ) 1 + exp ( x t ) The closed form expression for the logistic distribution delivers a simpli ed likelihood as well ny yt 1 yt exp (0 + L ( 0 ; 1 j) = 1 x t ) exp ( x t ) 1 + exp ( x t ) = exp ( P n 0 y P t + n 1 x ty t ) Q n [1 + exp ( x t )] Thus ln L ( 0 ; 1 j) = 0 y t + 1 x t y t ln [1 + exp ( x t )] The ML estimators, which are termed logit estimators, are the values B 0 and B 1 that ln L ( 0 ; 1 j) = y t x i;t i i =B i 1 + exp (B 0 + B 1 x t ) exp (B 0 + B 1 x t ) x i;t = 0; for i = 0; 1 with x 0;t = 1 and x 1;t = x t. (As for the probit estimators, a nonlinear solution technique must be used.) An immediate consequence of the score for 0 is that y t = ^P (y t = 1) ; 9

10 that is the observed frequency of y t = 1 equals the predicted frequency (here ^P (y t = 1) exp(b 0+B 1 x t) 1+exp(B 0 +B 1 x t) ). (Note, the same feature holds for the linear probability model, because the OLS coe cient estimators satisfy the relation that the sum of observed values of the dependent variable, which yields the observed frequencies, equals the sum of predicted values of the dependent variable, which yields the predicted frequencies.) One additional advantage of the logistic assumption is that ln F ( x t ) 1 F ( x t ) = x t The slope coe cient is interpreted as the e ect of a one unit change in the regressor on the logarithm of the odds ratio, where the odds ratio yields the probability of a success (Y = 1) divided by the probability of a failure (Y = 0). For the special case in which we have a number of observations for each value of the regressor, a simpler estimator can be constructed. Suppose that the regressor takes K distinct values and that there are n k observations on each value. For each of the distinct regressor P values calculate the observed frequency of success, that is construct ^p k = 1 nk n k y t. The estimator of the coe cients is then obtained as the OLS coe cient estimator for the regression model ln ^p k 1 ^p k = x k + u k for k = 1; ; K. The method is sensible if n k is large for each k. If n k is not constant across k, then the error is heteroskedastic and weighted least squares should be used. To perform hypothesis tests, any of the three test statistics can be used. As the tests are asymptotically equivalent (and the nite sample comparisons are speci c to the model) simply choose the statistic that is easiest to compute. We begin with test of exclusion restrictions, of which the leading example is the need to include additional regressors Z (perhaps indicators for region or industry). Set Y t = F (X t + Z t ) + U t (If Z t consists only of functions of X t, we have a pure functional form test.) The Wald test is computed directly in Stata. To construct an LR test, rst estimate the full model via probit and obtain the estimated value of the log-likelihood, ^L U. Next construct the probit estimate for the restricted model, in which the 10

11 conditional mean is assumed to be F (X t ), and obtain the estimated value of the log-likelihood, ^LR. The LR test statistic is 2 ^LU ^LR ) 2 Q Q = dim () If Q is large, then probit can be di cult to construct. For large Q, the LM statistic is preferred as only the restricted model is estimated. First, construct the probit estimate for the restricted model, B, and form ^F = F (XB) ^f = f (XB) ^U = Y ^F. We then regress the residuals on both the included and excluded regressors, where we do WLS for e ciency ^U h i 1 t ^ft ^ft 2 = 1 X t + 2 Z t + V t w t = ^Ft 1 ^Ft w t w t w t Because the residuals sum to zero, there is no need for an intercept. The explained sum of squares from the regression is identical to the LM statistic. Alternatively, nr 2 can be used as it is an asymptotically equivalent statistic, although numerically distinct. Both are distributed as 2 (Q) random variables. Although less common, there are more general restrictions of interest to test. To test for heteroskedastic errors, the latent model becomes Y t = X t + U t U t jx N 0; e 2Zt We analyze the leading case, in which Z t consists of all varying regressors (all but the intercept). With heteroskedastic errors, our calculations become P (Y t = 1jX) = P (U t > X t jx) = P e Zt U t > e Zt X t jx = e Zt X t As noted before, if the error to the latent model is heteroskedastic, the speci cation of the conditional mean is altered (hence we do not construct robust standard errors for the original speci cation). In particular, the conditional mean is no longer a single index model, as the regressors a ect the response probabilty in two ways. To indicate the absence of a single index model, the response probability is often written as p (X) = m (X; X; ) ; 11

12 where the last two arguments emphasize the fact that regressors a ect the response probability through more than the single index X. The natural null hypothesis is H 0 = 0, under which the latent model is a standard probit model. As the restricted model is clearly the easiest to estimate, we again use the LM statistic. Again, construct the probit estimate for the restricted model, B, and form ^F = F (XB) ^f = f (XB) ^U = Y ^F. We then regress the residuals on both the included regressors and the score for (recall, ^ft multiplied by the excluded regressors forms the score for in the test of exclusion restrictions), where we do WLS for e ciency ^U t ^ft 5 m (X t ; X t ; ) = 1 X t + 2 w t w t + V t w t = =0 w t h ^Ft 1 ^Ft i 1 2 For the heteroskedaticity example (in which 0 = 0) 5 m (X t ; X t ; ) w t =0 = e Zt X t e Zt X t ( Z t ) =0 = (X t ) X t (Z t ) The explained sum of squares from the regression is identical to the LM statistic. Alternatively, nr 2 can be used as it is an asymptotically equivalent statistic, although numerically distinct. Again, both statistics are 2 (Q) random variables. 12