Instrumental Variables & 2SLS

Transcription

1 Instrumental Variables & 2SLS y = β 0 + β 1 x 1 + β 2 x β k x k + u x 1 = π 0 + π 1 z + π 2 x π k x k + v

2 Why Use Instrumental Variables? Instrumental Variables (IV) estimation is used when your model has endogenous x s That is, whenever Cov(x,u) 0 Thus, IV can be used to address the problem of omitted variable bias Additionally, IV can be used to solve the classic errors-in-variables problem

3 What Is an Instrumental Variable? In order for a variable, z, to serve as a valid instrument for x, the following must be true The instrument must be exogenous That is, Cov(z,u) = 0 The instrument must be correlated with the endogenous variable x That is, Cov(z,x) 0

4 More on Valid Instruments We have to use common sense and economic theory to decide if it makes sense to assume Cov(z,u) = 0 We can test if Cov(z,x) 0 Just testing H 0 : π 1 = 0 in x = π 0 + π 1 z + v Sometimes refer to this regression as the first-stage regression

5 IV Estimation in the Simple Regression Case For y = β 0 + β 1 x + u, and given our assumptions Cov(z,y) = β 1 Cov(z,x) + Cov(z,u), so β 1 = Cov(z,y) / Cov(z,x) Then the IV estimator for β 1 is ˆβ 1 = ( z z)( y y) i i ( z z)( x x) i i

6 Inference with IV Estimation The homoskedasticity assumption in this case is E(u 2 z) = σ 2 = Var(u) As in the OLS case, given the asymptotic variance, we can estimate the standard error Var se ( ˆ β ) 1 = 2 σ nσ ρ 2 x 2 ˆ σ SST R ( ˆ β ) 1 = 2 x 2 x, z x, z

7 IV versus OLS estimation Standard error in IV case differs from OLS only in the R 2 from regressing x on z Since R 2 < 1, IV standard errors are larger However, IV is consistent, while OLS is inconsistent, when Cov(x,u) 0 The stronger the correlation between z and x, the smaller the IV standard errors

8 IV versus OLS estimation Let s think about a wage model that tries to explain how wages differ across individuals based on observable characteristics. Economic theory tells us that the wage is a function of the marginal product of the worker. So what determines this marginal product? Two factors seem to play a key role; innate ability and investment in human capital. The problem is that innate ability is not directly observable. What happens if we ignore it?

9 IV versus OLS estimation Dependent Variable: LWAGE Method: Least Squares Included observations: 428 after adjustments Variable Coefficient Std. Error t-statistic Prob. EDUC C R-squared Mean dependent var Adjusted R-squared S.D. dependent var S.E. of regression Akaike info criterion Sum squared resid Schwarz criterion Log likelihood Hannan-Quinn criter F-statistic Durbin-Watson stat Prob(F-statistic) The problem with the current model is that educ is likely to be correlated with error term u since factors that are not controlled for will influence educ and these factors all end up in the error. This violates the assumption E(x i u) = 0. Thus OLS estimates are biased. This is known as simultaneity or correlation bias.

10 Finding a Good IV For the log(wage) equation, an instrumental variable z for educ must be (1) uncorrelated with ability (and any other unobserved factors affecting wage) and (2) correlated with education. Something such as the last digit of an individual s Social Security Number almost certainly satisfies the first requirement: it is uncorrelated with ability because it is determined randomly. However, it is precisely because of the randomness of the last digit of the SSN that it is not correlated with education, either; therefore it makes a poor instrumental variable for educ.

11 Finding a Good IV What we have called a proxy variable for the omitted variable makes a poor IV for the opposite reason. For example, in the log(wage) example with omitted ability, a proxy variable for abil must be as highly correlated as possible with abil. An instrumental variable must be uncorrelated with abil. Therefore, while IQ is a good candidate as a proxy variable for abil, it is not a good instrumental variable for educ.

12 IV versus OLS estimation Dependent Variable: EDUC Method: Least Squares Included observations: 428 Variable Coefficient Std. Error t-statistic Prob. FATHEDUC C R-squared Mean dependent var Adjusted R-squared S.D. dependent var S.E. of regression Akaike info criterion Sum squared resid Schwarz criterion Log likelihood Hannan-Quinn criter F-statistic Durbin-Watson stat Prob(F-statistic) We need an instrument that can overcome this bias. The instrument needs to be correlated with educ but uncorrelated with u. One potential instrument is fatheduc. We can test the correlation between fatheduc and educ using a simple regression.

13 IV versus OLS estimation Dependent Variable: LWAGE Method: Two-Stage Least Squares Included observations: 428 after adjustments Instrument specification: FATHEDUC Constant added to instrument list Variable Coefficient Std. Error t-statistic Prob. EDUC C R-squared Mean dependent var Adjusted R-squared S.D. dependent var S.E. of regression Sum squared resid F-statistic Durbin-Watson stat Prob(F-statistic) Second-Stage SSR J-statistic 1.02E-42 Instrument rank 2 Unfortunately we cannot test the second condition for IV estimation, i.e. E(x i u) = 0. Why not? So we must use economic theory and basic intuitive arguments to justify this condition. Using fatheduc as an instrument results in the IV estimates shown.

14 The Effect of Poor Instruments What if our assumption that Cov(z,u) = 0 is false? The IV estimator will be inconsistent, too Can compare asymptotic bias in OLS and IV Prefer IV if Corr(z,u)/Corr(z,x) < Corr(x,u) IV : plim ˆ β 1 ~ OLS: plim β = β Corr( z, u) Corr( z, x) = β + Corr( x, u) 1 σ σ u x σ σ u x

15 Weak Instruments Dependent Variable: LBWGHT Method: Least Squares Included observations: 1388 Variable Coefficient Std. Error t-statistic Prob. PACKS C R-squared Mean dependent var Adjusted R-squared S.D. dependent var S.E. of regression Akaike info criterion Sum squared resid Schwarz criterion Log likelihood Hannan-Quinn criter F-statistic Durbin-Watson stat Prob(F-statistic) Here we estimate the effects of smoking on birth weight. The problem is that the number of packs smoked might be correlated with other health factors not included in the regression and so it is probably correlated with the error term, u. What might be a suitable instrument. The price of cigarettes, cigprice, should be uncorrelated with the error and should be negatively correlated with consumption of packs.

16 Weak Instruments Dependent Variable: LBWGHT Method: Two-Stage Least Squares Included observations: 1388 Instrument specification: CIGPRICE Constant added to instrument list Variable Coefficient Std. Error t-statistic Prob. PACKS C R-squared Mean dependent var Adjusted R-squared S.D. dependent var S.E. of regression Sum squared resid F-statistic Durbin-Watson stat Prob(F-statistic) Second-Stage SSR J-statistic Instrument rank 2 The IV estimates do not look so good. The sign is wrong and the R 2 is negative. * What went wrong? It may be that cigprice is a poor instrument. It may be that it is correlated with u, or it may not be correlated with packs. * - Unlike in the case of OLS, the R 2 from IV estimation can be negative because SSR for IV can be larger than SST. Although it does not hurt to report the R 2 for IV estimation, it is not very useful, either.

17 Weak Instruments Dependent Variable: PACKS Method: Least Squares Included observations: 1388 Variable Coefficient Std. Error t-statistic Prob. CIGPRICE C R-squared Mean dependent var Adjusted R-squared S.D. dependent var S.E. of regression Akaike info criterion Sum squared resid Schwarz criterion Log likelihood Hannan-Quinn criter F-statistic Durbin-Watson stat Prob(F-statistic) It turns out that cigprice is not significantly correlated with packs. Why? This highlights the problem of weak instruments, i.e. where the correlation between the endogenous variable, in this example packs, and the instrument(s) is low.

18 IV Estimation in the Multiple Regression Case IV estimation can be extended to the multiple regression case Call the model we are interested in estimating the structural model Our problem is that one or more of the variables are endogenous We need an instrument for each endogenous variable

19 Multiple Regression IV (cont) Write the structural model as y 1 = β 0 + β 1 y 2 + β 2 z 1 + u 1, where y 2 is endogenous and z 1 is exogenous Let z 2 be the instrument, so Cov(z 2,u 1 ) = 0 and y 2 = π 0 + π 1 z 1 + π 2 z 2 + v 2, where π 2 0 This reduced form equation regresses the endogenous variable on all exogenous ones

20 Two Stage Least Squares (2SLS) It s possible to have multiple instruments Consider our original structural model, and let y 2 = π 0 + π 1 z 1 + π 2 z 2 + π 3 z 3 + v 2 Here we re assuming that both z 2 and z 3 are valid instruments they do not appear in the structural model and are uncorrelated with the structural error term, u 1

21 Best Instrument Could use either z 2 or z 3 as an instrument The best instrument is a linear combination of all of the exogenous variables, y 2 * = π 0 + π 1 z 1 + π 2 z 2 + π 3 z 3 We can estimate y 2 * by regressing y 2 on z 1, z 2 and z 3 can call this the first stage If then substitute ŷ 2 for y 2 in the structural model, get same coefficient as IV

22 More on 2SLS While the coefficients are the same, the standard errors from doing 2SLS by hand are incorrect because of the first stage regression error. Method extends to multiple endogenous variables need to be sure that we have at least as many excluded exogenous variables (instruments) as there are endogenous variables in the structural equation

23 2SLS Dependent Variable: LWAGE Method: Least Squares Included observations: 3010 Variable Coefficient Std. Error t-statistic Prob. EDUC EXPER EXPERSQ BLACK SMSA SOUTH C Here is a wage regression using the data in CARD.RAW. Again education is endogenous (innate ability) and so requires an IV estimator. Card uses near4c as an instrument. Since the sample is geographically random the error should not be correlated with location but being near a 4-year college may be correlated with educ. R-squared Mean dependent var Adjusted R-squared S.D. dependent var S.E. of regression Akaike info criterion Sum squared resid Schwarz criterion Log likelihood Hannan-Quinn criter F-statistic Durbin-Watson stat Prob(F-statistic)

24 2SLS Dependent Variable: EDUC Method: Least Squares Included observations: 3010 Variable Coefficient Std. Error t-statistic Prob. NEARC EXPER EXPERSQ BLACK SMSA SOUTH C We can check for the correlation between nearc4 and educ by running the auxiliary regression of educ on nearc4 and the other exogenous variables. R-squared Mean dependent var Adjusted R-squared S.D. dependent var S.E. of regression Akaike info criterion Sum squared resid Schwarz criterion Log likelihood Hannan-Quinn criter F-statistic Durbin-Watson stat Prob(F-statistic)

25 2SLS Dependent Variable: LWAGE Method: Two-Stage Least Squares Included observations: 3010 Instrument specification: NEARC4 EXPER EXPERSQ BLACK SMSA SOUTH Constant added to instrument list Variable Coefficient Std. Error t-statistic Prob. EDUC EXPER EXPERSQ BLACK SMSA SOUTH C The IV estimator yields a return to educ nearly twice as large as the OLS estimator. But the standard error is 15 times larger! That is the price to be paid to get a consistent estimate of the return to educ when educ is endogenous. R-squared Mean dependent var Adjusted R-squared S.D. dependent var S.E. of regression Sum squared resid F-statistic Durbin-Watson stat Prob(F-statistic) Second-Stage SSR J-statistic 5.68E-35 Instrument rank 7

26 2SLS w/ Multiple Instruments Dependent Variable: LWAGE Method: Least Squares Included observations: 428 after adjustments Variable Coefficient Std. Error t-statistic Prob. EDUC EXPER EXPERSQ C Another wage equation. Again educ is endogenous. But now we have two instruments, motheduc and fatheduc. R-squared Mean dependent var Adjusted R-squared S.D. dependent var S.E. of regression Akaike info criterion Sum squared resid Schwarz criterion Log likelihood Hannan-Quinn criter F-statistic Durbin-Watson stat Prob(F-statistic)

27 2SLS w/ Multiple Instruments Dependent Variable: EDUC Method: Least Squares Included observations: 753 Variable Coefficient Std. Error t-statistic Prob. MOTHEDUC FATHEDUC EXPER EXPERSQ C We can check for the correlation between the instruments and educ, after contolling for the other exogenous factors of exper and exper 2, by running the auxiliary regression. R-squared Mean dependent var Adjusted R-squared S.D. dependent var S.E. of regression Akaike info criterion Sum squared resid Schwarz criterion Log likelihood Hannan-Quinn criter F-statistic Durbin-Watson stat Prob(F-statistic)

28 2SLS w/ Multiple Instruments Dependent Variable: LWAGE Method: Two-Stage Least Squares Included observations: 428 after adjustments Instrument specification: MOTHEDUC FATHEDUC EXPER EXPERSQ Constant added to instrument list Variable Coefficient Std. Error t-statistic Prob. Another wage equation. Again educ is endogenous. But now we have two instruments, motheduc and fatheduc. EDUC EXPER EXPERSQ C R-squared Mean dependent var Adjusted R-squared S.D. dependent var S.E. of regression Sum squared resid F-statistic Durbin-Watson stat Prob(F-statistic) Second-Stage SSR J-statistic Instrument rank 5 Prob(J-statistic)

29 Addressing Errors-in-Variables with IV Estimation Remember the classical errors-in-variables problem where we observe x 1 instead of x 1 * Where x 1 = x 1 * + e 1, and e 1 is uncorrelated with x 1 * and x 2 If there is a z, such that Corr(z,u) = 0 and Corr(z,x 1 ) 0, then IV will remove the attenuation bias

30 Testing for Endogeneity Since OLS is preferred to IV if we do not have an endogeneity problem, then we d like to be able to test for endogeneity If we do not have endogeneity, both OLS and IV are consistent Idea of Hausman test is to see if the estimates from OLS and IV are different

31 Testing for Endogeneity (cont) While it s a good idea to see if IV and OLS have different implications, it s easier to use a regression test for endogeneity If y 2 is endogenous, then v 2 (from the reduced form equation) and u 1 from the structural model will be correlated The test is based on this observation

32 Testing for Endogeneity (cont) Save the residuals from the first stage Include the residual in the structural equation (which of course has y 2 in it) If the coefficient on the residual is statistically different from zero, reject the null of exogeneity If multiple endogenous variables, jointly test the residuals from each first stage

33 Testing for Endogeneity Dependent Variable: EDUC Method: Least Squares Included observations: 753 Variable Coefficient Std. Error t-statistic Prob. MOTHEDUC FATHEDUC EXPER EXPERSQ C R-squared Mean dependent var Adjusted R-squared S.D. dependent var S.E. of regression Akaike info criterion Sum squared resid Schwarz criterion Log likelihood Hannan-Quinn criter F-statistic Durbin-Watson stat Prob(F-statistic) Since OLS is efficient relative to 2SLS if there is no endogeneity we should use OLS whenever possible. Test for endogeneity by saving residuals from reduced form regression (reproduced here) and include them in the structural equation.

34 Testing for Endogeneity Dependent Variable: LWAGE Method: Least Squares Included observations: 428 after adjustments Variable Coefficient Std. Error t-statistic Prob. RESID EDUC EXPER EXPERSQ C R-squared Mean dependent var Adjusted R-squared S.D. dependent var S.E. of regression Akaike info criterion Sum squared resid Schwarz criterion Log likelihood Hannan-Quinn criter F-statistic Durbin-Watson stat Prob(F-statistic) The test for endogeneity is simply the test of the null hypothesis that γ 1 = 0. If there are multiple endogenous variables there will be a reduced form regression for each one. The residuals from each of these regressions would be included and then the null is a joint exclusion hypothesis.

35 Testing Overidentifying Restrictions If there is just one instrument for our endogenous variable, we can t test whether the instrument is uncorrelated with the error We say the model is just identified If we have multiple instruments, it is possible to test the overidentifying restrictions to see if some of the instruments are correlated with the error

36 The OverID Test Estimate the structural model using IV and obtain the residuals Regress the residuals on all the exogenous variables and obtain the R 2 to form nr 2 Under the null that all instruments are uncorrelated with the error, LM ~ χ q2 where q is the number of extra instruments

37 Testing Overidentifying Restrictions Dependent Variable: LWAGE Method: Two-Stage Least Squares Included observations: 428 after adjustments Instrument specification: MOTHEDUC FATHEDUC EXPER EXPERSQ Constant added to instrument list Variable Coefficient Std. Error t-statistic Prob. EDUC EXPER EXPERSQ C R-squared Mean dependent var Adjusted R-squared S.D. dependent var S.E. of regression Sum squared resid F-statistic Durbin-Watson stat Prob(F-statistic) Second-Stage SSR J-statistic Instrument rank 5 Prob(J-statistic) Here are the 2SLS estimates of the structural wage equation using motheduc and fatheduc as instruments. Since we have only one endogenous variable but two instruments the model is overidentified. First save the residuals from the structural equation. Second, regress the residuals on the exogenous variables and save R 2. Test is LM(p) = nr 2 where p is the number of over-identifying restrictions.

38 Testing Overidentifying Restrictions Dependent Variable: RESID02 Method: Least Squares Included observations: 428 after adjustments Variable Coefficient Std. Error t-statistic Prob. EXPER -1.83E EXPERSQ 7.34E MOTHEDUC FATHEDUC C R 2 = n = 428 LM = ~ χ 2 (1) R-squared Mean dependent var -4.07E-16 Adjusted R-squared S.D. dependent var S.E. of regression Akaike info criterion Sum squared resid Schwarz criterion Log likelihood Hannan-Quinn criter F-statistic Durbin-Watson stat Prob(F-statistic)

39 Simultaneous Equations y 1 = α 1 y 2 + β 1 z 1 + u 1 y 2 = α 2 y 1 + β 2 z 2 + u 2

40 Simultaneity Simultaneity is a specific type of endogeneity problem in which the explanatory variable is jointly determined with the dependent variable As with other types of endogeneity, IV estimation can solve the problem Some special issues to consider with simultaneous equations models (SEM)

41 Supply and Demand Example Start with an equation you d like to estimate, say a labor supply function h s = α 1 w + β 1 z + u 1, where w is the wage and z is a supply shifter Call this a structural equation it s derived from economic theory and has a causal interpretation where w directly affects h s

42 Example (cont) Problem that can t just regress observed hours on wage, since observed hours are determined by the equilibrium of supply and demand Consider a second structural equation, in this case the labor demand function h d = α 2 w + u 2 So hours are determined by a SEM

43 Example (cont) Both h and w are endogenous because they are both determined by the equilibrium of supply and demand z is exogenous, and it s the availability of this exogenous supply shifter that allows us to identify the structural demand equation With no observed demand shifters, supply is not identified and cannot be estimated

44 Identification of Demand Equation w D S (z=z1) S (z=z2) S (z=z3) h

45 Using IV to Estimate Demand So, we can estimate the structural demand equation, using z as an instrument for w First stage equation is w = π 0 + π 1 z + v 2 Second stage equation is h = α 2 ŵ + u 2 Thus, 2SLS provides a consistent estimator of α 2, the slope of the demand curve We cannot estimate α 1, the slope of the supply curve

46 The General SEM Suppose you want to estimate the structural equation: y 1 = α 1 y 2 + β 1 z 1 + u 1 where, y 2 = α 2 y 1 + β 2 z 2 + u 2 Thus, y 2 = α 2 (α 1 y 2 + β 1 z 1 + u 1 ) + β 2 z 2 + u 2 So, (1 α 2 α 1 )y 2 = α 2 β 1 z 1 + β 2 z 2 + α 2 u 1 + u 2, which can be rewritten as y 2 = π 1 z 1 + π 2 z 2 + v 2

47 The General SEM (continued) By substituting this reduced form in for y 2, we can see that since v 2 is a linear function of u 1, y 2 is correlated with the error term and α 1 is biased call it simultaneity bias The sign of the bias is complicated, but can use the simple regression as a rule of thumb In the simple regression case, the bias is the same sign as α 2 /(1 α 2 α 1 )

48 Identification of General SEM Let z 1 be all the exogenous variables in the first equation, and z 2 be all the exogenous variables in the second equation It s okay for there to be overlap in z 1 and z 2 To identify equation 1, there must be some variables in z 2 that are not in z 1 To identify equation 2, there must be some variables in z 1 that are not in z 2

49 Rank and Order Conditions We refer to this as the rank condition Note that the exogenous variable excluded from the first equation must have a non-zero coefficient in the second equation for the rank condition to hold Note that the order condition clearly holds if the rank condition does there will be an exogenous variable for the endogenous one

50 Estimation of the General SEM Estimation of SEM is straightforward The instruments for 2SLS are the exogenous variables from both equations Can extend the idea to systems with more than 2 equations For a given identified equation, the instruments are all of the exogenous variables in the whole system

51 Estimation of the General SEM The first equation is married women s labor supply. The second is the wage offer as a function of productivity measures.

52 Estimation of the General SEM Dependent Variable: HOURS Method: Least Squares Included observations: 428 after adjustments Variable Coefficient Std. Error t-statistic Prob. LWAGE AGE EDUC KIDSLT NWIFEINC C Here is the OLS estimate of the labor supply equation. Of particular interest is the coefficient on the log(wage). It is negative ( labor supply?) and very imprecisely estimated (large s.e.). R-squared Mean dependent var Adjusted R-squared S.D. dependent var S.E. of regression Akaike info criterion Sum squared resid 2.48E+08 Schwarz criterion Log likelihood Hannan-Quinn criter F-statistic Durbin-Watson stat Prob(F-statistic)

53 Estimation of the General SEM Dependent Variable: HOURS Method: Two-Stage Least Squares Included observations: 428 after adjustments Instrument specification: EXPER EXPERSQ AGE EDUC KIDSLT NWIFEINC C Variable Coefficient Std. Error t-statistic Prob. LWAGE AGE EDUC KIDSLT NWIFEINC C R-squared Mean dependent var Adjusted R-squared S.D. dependent var S.E. of regression Sum squared resid 7.74E+08 F-statistic Durbin-Watson stat Prob(F-statistic) Second-Stage SSR 2.26E+08 J-statistic Instrument rank 7 Prob(J-statistic) Now the 2SLS estimates. Note that for this equation the instruments include the exogenous variables AGE, EDUC, KIDSLT and NWIFEINC as well as the two excluded variables EXPER and EXPERSQ. These last two are included in the labor demand (wage) equation but excluded from the labor supply specification. That is what makes then (over-) identifying. Test of over-identifying restrictions is the J-stat.

54 Estimation of the General SEM Dependent Variable: LWAGE Method: Two-Stage Least Squares Included observations: 428 after adjustments Instrument specification: EXPER EXPERSQ AGE EDUC KIDSLT6 NWIFEINC C Variable Coefficient Std. Error t-statistic Prob. HOURS EDUC EXPER EXPERSQ C R-squared Mean dependent var Adjusted R-squared S.D. dependent var S.E. of regression Sum squared resid F-statistic Durbin-Watson stat Prob(F-statistic) Second-Stage SSR J-statistic Instrument rank 7 Prob(J-statistic) Here is the 2SLS estimates for the labor demand equation. This equation uses the same instruments included in the supply equation. Since there are three excluded variables from this equation (AGE, KIDSLT6 and NWIFEINC) there are two over-id restrictions.