4 Data Issues 4.1 Truncated Regression population model y i = x i β + ε i, ε i N(0, σ 2 ) given a random sample, {y i, x i } N i=1, then OLS is consistent and efficient problem arises when only a non-random sample is available specifically, {y i, x i } is observed iff y i b i differs from censored regression model in that x i is also unobserved examples only individuals with income below the poverty line are surveyed only firms with less than 100 employees are surveyed 50
MLE likelihood must account for truncation likelihood function ln[l(θ)] = ln[pr(y i x i, θ, b i, y i b i )] i again, what is Pr(y i x i, θ, b i, y i b i )? Pr(y i y i b i ) = f(y i )/F (b i ), where f( ) is the PDF of y and F ( ) is the CDF of y division by F (b) rescales probabilities to sum to one implies likelihood function is ln[l(θ)] = ln[pr(y i x i, θ, b i, y i b i )] i = [ ] (1/σ) ln φ(εi /σ) i Φ(b i /σ) 51
truncation from above and below population model y i = x i β + ε i, ε i N(0, σ 2 ) where {y i, x i } is observed iff a i y i b i likelihood function ln[l(θ)] = i ln[pr(y i x i, θ, a i, b i, a i y i b i )] again, what is Pr(y i x i, θ, a i, b i, a i y i b i )? likelihood function is ln[l(θ)] = ln[pr(y i x i, θ, a i, b i, a i y i b i )] i = [ ] (1/σ) ln φ(εi /σ) i Φ(b i /σ) Φ(a i /σ) 52
marginal effects truncated from above only E[y i y i b i ] x k = β k ( 1 λ 2 i α i λ i ) α i = b i x i β σ λ i = φ(α i) Φ(α i ) truncated from above and below E[y i a i y i b i ] x k STATA: -truncreg- α i1 α i2 = β k = a i x i β σ = b i x i β σ λ i = φ(α i) Φ(α i ) { 1 λ 2 i α i2 λ i [b i a i ]φ(α i1 ) σ [Φ(α i2 ) Φ(α i1 )] } 53
4.2 Sample Selection (Incidental Truncation) population model y i = x i β + ε i, ε i N(0, σ 2 ) given a random sample, {y i, x i } N i=1, then OLS is consistent and efficient problem arises when data on y is only available for a non-random sample let S i = 1 if y i is observed; S i = 0 if y i is unobserved differs from truncated regression model in that x i is observed for regardless of S i differs from censored regression model in that there is no clear censoring rule; i.e., S i = 0 implies nothing is none about y i, whereas in censored regression we know that y i c i 54
implies following data structure have data on a random sample, {y i, x i, S i } N i=1, but y i =. if S i = 0 can only use M i S i observations to estimate any model examples wages only observed for workers firm profits only observed for firms that remain in business SAT scores only observed for test takers house prices only observed for houses on the market issue is OLS still unbiased and consistent? answer: depends 55
exogenous sample selection if S i depends only on: (i) exogenous observables, x i, or (ii) unobservables, u i, where u i ε i then OLS is unbiased and consistent, where estimation uses only the sub-sample of M observations example w i = α + βeduc i + γ 1 age + γ 2 age 2 + ε i and Pr(w i observed) = f(educ, age) then OLS using only workers is consistent 56
endogenous sample selection model outcome and selection equations simultaneously y i = x i β + ε i S i = z i γ + u i 1 if Si > 0 S i = 0 if Si 0 y i =. if S i = 0 ε i, u i N 2 (0, 0, σ 2, 1, ρ) x, z are exogenous z = [x w }{{} exclusion restriction(s) ] 57
problem E[y z] = xβ, but E[y z, S = 1] = xβ + ρφ(zγ)/φ(zγ), where ρφ(zγ)/φ(zγ) is known as the Inverse Mills Ratio implies that E[y z, S = 1] = xβ iff ρ = 0 OLS estimation of y i = x i β + ε i using only M observations omits the IMR term, which implies that solution ε i = ρφ(zγ)/φ(zγ) + ε i which is not mean zero, and is not independent of x estimate IMR (using i = 1,..., N) estimate probit model, where S is dependent variable and z are the covariates = γ obtain IMR i = φ(z i γ) Φ(z i γ) regress y i on x i, IMR i via OLS (using i = 1,..., M) test of endogenous selection H o : ρ = 0 H a : ρ 0 58
notes usual OLS standard errors are incorrect since IMR is predicted; must account for additional uncertainty due to estimation of γ need an exclusion restriction(s) a variable in z not in x due to the fact that otherwise model is identified from nonlinearity of IMR, which arises solely from the assumption of joint normality STATA: -heckman-, -heckman2-59
4.3 Cov(x, ε) 0 OLS requires Cov(x i, ε i ) = 0; otherwise, E[ β ols ] = β + Cov(x, ε) Var(x) β situation can arise for a number of reasons omitted variable bias (unobserved heterogeneity) reverse causation measurement error terminology x is exogenous if it is uncorrelated with ε x is endogenous if it is correlated with ε 60
4.3.1 Omitted Variable Bias a relevant regressor is excluded from the regression model and is correlated with x example y i = α + βx i + γw i + ε i True Model y i = α + βx i + ε i Estimated Model where ε i = γw i + ε i OLS on the estimated model yields E[ β ols ] = β + = β + = β + Cov(x, ε) Var(x) Cov(x, γw) + Cov(x, ε) Var(x) γ Cov(x, w) + Cov(x, ε) Var(x) if Cov(x, ε) = 0 (i.e., only source of correlation between x and ε is w), then Cov(x, w) E[ β ols ] = β + γ Var(x)? β depending on sgn(γ) and direction of correlation between x and w 61
notes w may represent an observed variable that is excluded by mistake, or an unobserved variable that the analyst does not have data on in multiple regression model, bias spills over across variables example y i = α + β 1 x 1i + β 2 x 2i + γw i + ε i True Model y i = α + β 1 x 1i + β 2 x 2i + ε i Estimated Model where ε i = γw i + ε i if Cov(x 1, ε) = 0, but Cov(x 2, ε) 0, then not only is β 2 biased, but β 1 is biased iff Cov(x 1, x 2 ) 0 62
4.3.2 Reverse Causation not only does x have an effect on y, but y also has an effect on x (i.e., the two variables are jointly determined) example: wages of working women and the number of children... more children may reduce a woman s productivity at work, or increase her desire for a more flexible job (sacrificing pay), thus reducing her wage; low wage woman may opt for more children because the opportunity cost of their time is lower model y i = α + βx i + ε i x i = θ + δy i + µ i where the parameters represent the structural parameters substitution for y in the second equation reveals x i = θ + δα + δβx i + δε i + µ i 1 = 1 δβ (θ + δα + δε i + µ i ) which implies that Cov(x, ε) 0 intuitively, an unobserved shock to y (i.e., ε) must be correlated with x since changes in y lead to changes in x 63
4.3.3 Measurement Error problem: data are measured imprecisely examples recall error coding errors mis-information (e.g., overstate income, understate drug use) rounding errors (e.g., labor supply = 40 hrs/wk, or rounded to nearest 5 ; income rounded to $1000s) two cases: (i) error in the dependent variable, or (ii) error(s) in independent variable(s) 64
dependent variable true model y i = α + βx i + ε i, ε i N(0, σ 2 ε) where on a variable indicates correctly measured given a random sample {yi, x i }N i=1, OLS is consistent and efficient with measurement error, do not observe y i instead one observes y i where y i }{{} observed = y }{{} i + µ }{{} i true measurement error, µ i N(0, σ 2 µ) reliability ratio RR = Var(y ) Var(y) [0, 1] susbtitution implies that the estimated model is y i = α + βx i + (µ i + ε i ) = α + βx i + ε i 65
properties of OLS estimates β OLS is unbiased and consistent iff Cov(x, ε) = 0, which is the case if Cov(x, ε) = Cov(x, ε) } {{ } + Cov(x, µ) } {{ } 0 by assumption 0 if ME of x α OLS is unbiased and consistent iff β OLS is unbiased and consistent since α OLS = y β OLS x and E[ ε] = 0, which is the case if E[ ε] = E[ε] }{{} 0 by + β E[µ] }{{} 0 if assumption classical ME 66
OLS standard errors are correct if µ i N implies ε N this holds even if Cov(µ, ε) 0 what is σ 2 ε? Var( ε) = Var(µ + ε) = Var(µ) + Var(ε) + 2 Cov(µ, ε) = σ 2 µ + σ 2 ε + 2ρσ µ σ ε which is greater than Var(ε) if ρ = 0 if Var( ε) Var(ε), then standard errors are larger summary: Classical Errors-in-Variables (CEV) model assumptions (i) µ i N(0, σ 2 µ) (ii) Cov(µ, ε) = 0 (iii) Cov(x, µ) = 0 implications (i) OLS unbiased, consistent (ii) standard errors are correct (iii) R 2, standard errors due to extra noise in the data 67
independent variable true model y i = α + βx i + ε i, ε i N(0, σ 2 ε) where on a variable indicates correctly measured given a random sample {yi, x i }N i=1, OLS is consistent and efficient with measurement error, do not observe x i instead one observes x i where x i }{{} observed = x }{{} i + µ }{{} i true measurement error, µ i N(0, σ 2 µ) reliability ratio RR = Var(x ) Var(x) [0, 1] susbtitution implies that the estimated model is y i = α + βx i + (ε i βµ i ) = α + βx i + ε i 68
properties of OLS estimates β OLS is unbiased and consistent iff Cov(x, ε) = 0, which is not likely Cov(x, ε) = Cov(x, ε) + Cov(x, βµ) = Cov(x, ε) } {{ } 0 by assumption + Cov(µ, ε) } {{ }? βcov(x, µ) } {{ } 0 = β OLS is unbiased and consistent if (i) β = 0 and Cov(µ, ε), or (ii) Cov(µ, ε) = β Cov(x, µ) α OLS is unbiased and consistent iff β OLS is unbiased and consistent since α OLS = y β OLS x and E[ ε] = 0, which is the case if E[ ε] = E[ε] }{{} 0 by + β E[µ] }{{} 0 if assumption classical ME 69
summary: Classical Errors-in-Variables (CEV) model assumptions (i) µ i N(0, σ 2 µ) (ii) Cov(µ, ε) = 0 (iii) Cov(x, µ) = 0 implications (i) OLS biased, inconsistent (ii) β OLS is attenuated toward zero (i.e., biased toward zero, biased down in absolute value, correct sign) plim( β OLS ) = β + Cov(x, ε) Var(x) Cov(x, ε βµ) = β + Var(x) Cov(x, ε) β Cov(x, µ) = β + Var(x) = β + = β [ Cov(x, ε) Var(x) } {{ } =0 1 σ2 µ [ ] σ 2 = β x σ } {{ 2 x } [0,1] σ 2 x ] Cov(x, µ) β + Var(x) } {{ } = β = β RR }{{} [0,1] =0 [ σ 2 x σ 2 µ σ 2 x ] Cov(µ, µ) Var(x) } {{ } =σ 2 µ/σ 2 x which is smaller than β in absolute value, but of the same sign as β 70
(iii) in multiple regression yi = α + βx i + K γ kx ki + ε k=1 where x is a mismeasured version of x and x k, k = 1,..., K, are correctly measured, then β OLS suffers from attenuation bias, and γ k are also biased in a complex way iff x k is uncorrelated with x 71
4.3.4 The Solution: Instrumental Variables goal: devise alternative estimation technique to obtain consistent estimates when x is endogenous solution identify β from exogenous variation in x suppose x can be decomposed into two independent parts: x = x + x where Cov(x, ε) = Cov(x, ε) + Cov(x, ε) and Cov(x, ε) 0, but Cov(x, ε) = 0 idea is to use variation in x due to x to identify β; ignore variation in x from x since this impact of this variation on y confounds effects of x and ε to only use variation arising from x, need additional information get this new information by adding data on a new var, z, called an instrument or instrumental variable (IV) or exclusion restriction 72
z is an IV for x iff (i) Cov(x, z) 0 (ii) Cov(ε, z) = 0 (iii) E[y x, z] = E[y x] (i.e., z has no direct effect on y; z is excluded from the model for y) (i) and (ii) = z is correlated with x through x estimation techniques IV Two-Stage Least Squares (TSLS or 2SLS) MLE 73
IV estimator model y i = α + βx i + ε i implies Cov(y, z) = Cov(α, z) + Cov(βx, z) + Cov(ɛ, z) = β Cov(x, z) estimator which is unbiased, consistent β IV = Cov(y, z) Cov(x, z) formula β IV = 1 N 1 1 N 1 i (y i y)(z i z) i (x i x)(z i z) 74
properties of β IV β IV is consistent plim β IV = = = = β 1 N 1 1 N 1 1 N 1 1 N 1 1 N 1 i y i(z i z) i x i(z i z) i (α + βx i + ε i )(z i z) i x i(z i z) 1 N 1 i βx i(z i z) i x i(z i z) α IV is consistent, since α IV = y β IV x Var(ε) = σ 2 σ 2 = 1 N 2 (y i α IV β IV x i ) 2 i Var( β IV ) Var( β IV ) = σ N Var(x)ρ 2 x,z σ i (x i x) R 2 x,z }{{} (sample counterpart) = ρ 2 x,z in simple OLS which is decreasing in Var(x) and ρ x,z 75
notes Var( β IV ) > Var( β OLS ) if ρ 2 x,z < 1 recall, Var( β OLS ) = σ/ i (x i x) inefficient to use IV if x is exogenous IV is algebraically equivalent to OLS using x as an instrument for itself β IV = = 1 N 1 1 N 1 1 N 1 i (y i y)(z i z) i (x i x)(z i z) i (y i y)(x i x) i (x i x) 2 1 N 1 = β OLS and α IV = y β IV x = y β OLS x = α OLS and σ Var( β IV ) = i (x i x)rx,z 2 σ = i (x i x)rx,x 2 σ = i (x i x) = Var( β OLS ) 76
multiple regression with only 1 endogenous var exogenous x s serve as instruments for themselves solution is simple using matrix algebra multiple regression with more than 1 endogenous var need unique instrument for each endogeous var exogenous x s serve as instruments for themselves solution is simple using matrix algebra 77
TSLS estimation proceeds in 2 steps first-stage x i = δ + πz i + µ i estimable via OLS = x i Cov(x, ε) 0 = Cov(µ, ε) 0 x i varies across i due to variation in z i (not µ i since x i does not depend on µ i ) second-stage y i = α + β x i + ε i 78
notes β T SLS is consistent standard errors need to be adjusted since x i is a predicted regressor if multiple endogenous vars, need a unique IV for each endogenous x if second-stage contains other exogenous vars, these vars must be included in the first-stage test of π 0 is test for Cov(x, z) 0 can test endogeneity using a Hausman test comparing β T SLS with β OLS if more than 1 IV for an endogenous var, then model is overidentified (as opposed to exactly identified) test of non-zero covariance between the set of IVs and x is given by a test that the coeffs on all IVs are jointly equal to zero enables other tests for instrument validity GMM estimation is more efficient if ε is heteroskdastic 79
MLE estimate first- and second-stage simultaneously, but second-stage is replaced with reduced form (i.e., y is expressed solely as a function of exogenous variables in the model) model x i = δ + πz i + µ i y i = α + βx i + ε i (structural eqn) = (α + βδ) + βπz i + (ε i + βµ i ) = (α + βδ) + βπz i + ε i (reduced form) where and ε, µ N 2 (0, Σ) bivariate normal dbn Σ = σ2 ε ρσ ε σ µ ρσ ε σ µ σ 2 µ is a 2x2 symmetric, positive definite matrix 80
the joint dbn of the reduced form errors is ε, µ N 2 (0, Σ) where Σ = σ2 ε + β 2 σ 2 µ + 2βρσ ε σ µ ρσ ε σ µ + βσ 2 µ ρσ ε σ µ + βσ 2 µ σ 2 µ derive ln[l(θ)], where θ = {δ, π, α, β, σ ε, σ µ, ρ} ln[l(θ)] = i ln[pr(y i, x i z i, θ)] = ln[pr( ɛ i, µ i z i, θ)] i = [ ( ɛi ln J φ 2, µ )] i, i σ ɛ σ Σ µ where J is the determinant of the Jacobian and φ 2 is the bivariate std normal pdf estimates obtained as arg max θ ln[l(θ)] = [ ( ɛi ln J φ 2, µ )] i, i σ ɛ σ Σ µ test of H o : π = 0 is a test for Cov(x, z) 0 test of endogeneity given by H o : ρ = 0 81
specification tests testing endogeneity may be relevant for economic reasons relevant since OLS is more efficient if x is exogenous Hausman test if x is exogenous, then β IV β OLS if x is endogenous, then β IV β OLS define test statistic based on difference β IV β OLS H = ( βiv β ) OLS ( ΣIV Σ ) 1 OLS ( βiv β ) OLS χ 2 K where K = # of x s 82
Durbin-Wu-Hausman test model x i = δ + πz i + µ i y i = α + βx i + ε i x is endogenous iff Cov(µ, ε) 0 steps: (i) estimate µ i via OLS (ii) estimate y i = α + βx i + δ µ i + ε i via OLS (iii) test H o : δ = 0, rejection implies x is endogenous if multiple endogenous vars, then conduct joint test H o : δ 1 =... = δ K = 0 (K = # of endog vars) 83
testing overidentifying restrictions if # IVs > # endogenous vars, can test if Cov(z, ε) = 0 steps: (i) regress y on x via TSLS = α T SLS, β T SLS = ε i (ii) regress ε i on z s (all IVs) = R 2 (iii) test statistic NR 2 χ 2 q where q is # of overidentifying restrictions intuition: if Cov(z, ε) = 0, then explanatory power of second regression should be small, R 2 0 84
weak IV = Cov(x, z) 0 can show plim β IV = β + ρ z,ε ρ z,x σ ε σ x if z is a valid IV, then ρ z,x > 0 and ρ z,ε = 0 = plim β IV = β but, if ρ z,x 0 and/or ρ z,ε 0, then plim β IV β OLS plim β OLS = β + ρ x,ε σ ε σ x and the asymptotic bias of OLS is smaller than IV iff ρ z,ε > ρ x,ε ρ z,x which becomes more likely as ρ z,x 0 STATA: -ivreg2-85