problem arises when only a non-random sample is available differs from censored regression model in that x i is also unobserved



Similar documents
ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

IMPACT EVALUATION: INSTRUMENTAL VARIABLE METHOD

Lecture 15. Endogeneity & Instrumental Variable Estimation

HURDLE AND SELECTION MODELS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009

Econometrics Simple Linear Regression

Reject Inference in Credit Scoring. Jie-Men Mok

Generating Random Numbers Variance Reduction Quasi-Monte Carlo. Simulation Methods. Leonid Kogan. MIT, Sloan , Fall 2010

On Marginal Effects in Semiparametric Censored Regression Models

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Panel Data: Linear Models

Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment 2-3, Probability and Statistics, March Due:-March 25, 2015.


Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Factor Analysis. Factor Analysis

From the help desk: Bootstrapped standard errors

Regression with a Binary Dependent Variable

Lecture 3: Differences-in-Differences

A Subset-Continuous-Updating Transformation on GMM Estimators for Dynamic Panel Data Models

Sales forecasting # 2

Solución del Examen Tipo: 1

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

2. Linear regression with multiple regressors

Online Appendices to the Corporate Propensity to Save

Maximum likelihood estimation of a bivariate ordered probit model: implementation and Monte Carlo simulations

Clustering in the Linear Model

Time Series Analysis

Quadratic forms Cochran s theorem, degrees of freedom, and all that

Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, and Discrete Changes

SAMPLE SELECTION BIAS IN CREDIT SCORING MODELS

Lecture 3: Linear methods for classification

Chapter 4: Statistical Hypothesis Testing

Part 2: Analysis of Relationship Between Two Variables

Introduction to Path Analysis

Using instrumental variables techniques in economics and finance

UNIVERSITY OF WAIKATO. Hamilton New Zealand

Chapter 2. Dynamic panel data models

The Bivariate Normal Distribution

Fitting Subject-specific Curves to Grouped Longitudinal Data

PS 271B: Quantitative Methods II. Lecture Notes

Standard errors of marginal effects in the heteroskedastic probit model

Exploratory Factor Analysis and Principal Components. Pekka Malo & Anton Frantsev 30E00500 Quantitative Empirical Research Spring 2016

Data Mining: Algorithms and Applications Matrix Math Review

5. Linear Regression

Chapter 3: The Multiple Linear Regression Model

Forecasting in supply chains

SYSTEMS OF REGRESSION EQUATIONS

Lecture Note: Self-Selection The Roy Model. David H. Autor MIT Spring 2003 November 14, 2003

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

Linear Classification. Volker Tresp Summer 2015

A General Approach to Variance Estimation under Imputation for Missing Survey Data

Note 2 to Computer class: Standard mis-specification tests

An Internal Model for Operational Risk Computation

The Real Business Cycle Model

Models for Longitudinal and Clustered Data

Non-Inferiority Tests for Two Means using Differences

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

Correlated Random Effects Panel Data Models

Introduction to General and Generalized Linear Models

Maximum Likelihood Estimation

Multiple Choice Models II

Introduction: Overview of Kernel Methods

Chapter 10: Basic Linear Unobserved Effects Panel Data. Models:

Wooldridge, Introductory Econometrics, 3d ed. Chapter 12: Serial correlation and heteroskedasticity in time series regressions

University of Maryland Fraternity & Sorority Life Spring 2015 Academic Report

Mortgage Lending Discrimination and Racial Differences in Loan Default

1 Teaching notes on GMM 1.

Linear Models for Continuous Data

A Basic Introduction to Missing Data

October 3rd, Linear Algebra & Properties of the Covariance Matrix

Performance Related Pay and Labor Productivity

ECON20310 LECTURE SYNOPSIS REAL BUSINESS CYCLE

Financial Risk Management Exam Sample Questions/Answers

Simple Linear Regression Inference

Magne Mogstad and Matthew Wiswall

VI. Real Business Cycles Models

Comparison of Estimation Methods for Complex Survey Data Analysis

Section 1: Simple Linear Regression

Average Redistributional Effects. IFAI/IZA Conference on Labor Market Policy Evaluation

Finance 400 A. Penati - G. Pennacchi Market Micro-Structure: Notes on the Kyle Model

Comparing Features of Convenient Estimators for Binary Choice Models With Endogenous Regressors

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Sections 2.11 and 5.8

Probability Calculator

Asymmetry and the Cost of Capital

ESTIMATING AVERAGE TREATMENT EFFECTS: IV AND CONTROL FUNCTIONS, II Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics

Transcription:

4 Data Issues 4.1 Truncated Regression population model y i = x i β + ε i, ε i N(0, σ 2 ) given a random sample, {y i, x i } N i=1, then OLS is consistent and efficient problem arises when only a non-random sample is available specifically, {y i, x i } is observed iff y i b i differs from censored regression model in that x i is also unobserved examples only individuals with income below the poverty line are surveyed only firms with less than 100 employees are surveyed 50

MLE likelihood must account for truncation likelihood function ln[l(θ)] = ln[pr(y i x i, θ, b i, y i b i )] i again, what is Pr(y i x i, θ, b i, y i b i )? Pr(y i y i b i ) = f(y i )/F (b i ), where f( ) is the PDF of y and F ( ) is the CDF of y division by F (b) rescales probabilities to sum to one implies likelihood function is ln[l(θ)] = ln[pr(y i x i, θ, b i, y i b i )] i = [ ] (1/σ) ln φ(εi /σ) i Φ(b i /σ) 51

truncation from above and below population model y i = x i β + ε i, ε i N(0, σ 2 ) where {y i, x i } is observed iff a i y i b i likelihood function ln[l(θ)] = i ln[pr(y i x i, θ, a i, b i, a i y i b i )] again, what is Pr(y i x i, θ, a i, b i, a i y i b i )? likelihood function is ln[l(θ)] = ln[pr(y i x i, θ, a i, b i, a i y i b i )] i = [ ] (1/σ) ln φ(εi /σ) i Φ(b i /σ) Φ(a i /σ) 52

marginal effects truncated from above only E[y i y i b i ] x k = β k ( 1 λ 2 i α i λ i ) α i = b i x i β σ λ i = φ(α i) Φ(α i ) truncated from above and below E[y i a i y i b i ] x k STATA: -truncreg- α i1 α i2 = β k = a i x i β σ = b i x i β σ λ i = φ(α i) Φ(α i ) { 1 λ 2 i α i2 λ i [b i a i ]φ(α i1 ) σ [Φ(α i2 ) Φ(α i1 )] } 53

4.2 Sample Selection (Incidental Truncation) population model y i = x i β + ε i, ε i N(0, σ 2 ) given a random sample, {y i, x i } N i=1, then OLS is consistent and efficient problem arises when data on y is only available for a non-random sample let S i = 1 if y i is observed; S i = 0 if y i is unobserved differs from truncated regression model in that x i is observed for regardless of S i differs from censored regression model in that there is no clear censoring rule; i.e., S i = 0 implies nothing is none about y i, whereas in censored regression we know that y i c i 54

implies following data structure have data on a random sample, {y i, x i, S i } N i=1, but y i =. if S i = 0 can only use M i S i observations to estimate any model examples wages only observed for workers firm profits only observed for firms that remain in business SAT scores only observed for test takers house prices only observed for houses on the market issue is OLS still unbiased and consistent? answer: depends 55

exogenous sample selection if S i depends only on: (i) exogenous observables, x i, or (ii) unobservables, u i, where u i ε i then OLS is unbiased and consistent, where estimation uses only the sub-sample of M observations example w i = α + βeduc i + γ 1 age + γ 2 age 2 + ε i and Pr(w i observed) = f(educ, age) then OLS using only workers is consistent 56

endogenous sample selection model outcome and selection equations simultaneously y i = x i β + ε i S i = z i γ + u i 1 if Si > 0 S i = 0 if Si 0 y i =. if S i = 0 ε i, u i N 2 (0, 0, σ 2, 1, ρ) x, z are exogenous z = [x w }{{} exclusion restriction(s) ] 57

problem E[y z] = xβ, but E[y z, S = 1] = xβ + ρφ(zγ)/φ(zγ), where ρφ(zγ)/φ(zγ) is known as the Inverse Mills Ratio implies that E[y z, S = 1] = xβ iff ρ = 0 OLS estimation of y i = x i β + ε i using only M observations omits the IMR term, which implies that solution ε i = ρφ(zγ)/φ(zγ) + ε i which is not mean zero, and is not independent of x estimate IMR (using i = 1,..., N) estimate probit model, where S is dependent variable and z are the covariates = γ obtain IMR i = φ(z i γ) Φ(z i γ) regress y i on x i, IMR i via OLS (using i = 1,..., M) test of endogenous selection H o : ρ = 0 H a : ρ 0 58

notes usual OLS standard errors are incorrect since IMR is predicted; must account for additional uncertainty due to estimation of γ need an exclusion restriction(s) a variable in z not in x due to the fact that otherwise model is identified from nonlinearity of IMR, which arises solely from the assumption of joint normality STATA: -heckman-, -heckman2-59

4.3 Cov(x, ε) 0 OLS requires Cov(x i, ε i ) = 0; otherwise, E[ β ols ] = β + Cov(x, ε) Var(x) β situation can arise for a number of reasons omitted variable bias (unobserved heterogeneity) reverse causation measurement error terminology x is exogenous if it is uncorrelated with ε x is endogenous if it is correlated with ε 60

4.3.1 Omitted Variable Bias a relevant regressor is excluded from the regression model and is correlated with x example y i = α + βx i + γw i + ε i True Model y i = α + βx i + ε i Estimated Model where ε i = γw i + ε i OLS on the estimated model yields E[ β ols ] = β + = β + = β + Cov(x, ε) Var(x) Cov(x, γw) + Cov(x, ε) Var(x) γ Cov(x, w) + Cov(x, ε) Var(x) if Cov(x, ε) = 0 (i.e., only source of correlation between x and ε is w), then Cov(x, w) E[ β ols ] = β + γ Var(x)? β depending on sgn(γ) and direction of correlation between x and w 61

notes w may represent an observed variable that is excluded by mistake, or an unobserved variable that the analyst does not have data on in multiple regression model, bias spills over across variables example y i = α + β 1 x 1i + β 2 x 2i + γw i + ε i True Model y i = α + β 1 x 1i + β 2 x 2i + ε i Estimated Model where ε i = γw i + ε i if Cov(x 1, ε) = 0, but Cov(x 2, ε) 0, then not only is β 2 biased, but β 1 is biased iff Cov(x 1, x 2 ) 0 62

4.3.2 Reverse Causation not only does x have an effect on y, but y also has an effect on x (i.e., the two variables are jointly determined) example: wages of working women and the number of children... more children may reduce a woman s productivity at work, or increase her desire for a more flexible job (sacrificing pay), thus reducing her wage; low wage woman may opt for more children because the opportunity cost of their time is lower model y i = α + βx i + ε i x i = θ + δy i + µ i where the parameters represent the structural parameters substitution for y in the second equation reveals x i = θ + δα + δβx i + δε i + µ i 1 = 1 δβ (θ + δα + δε i + µ i ) which implies that Cov(x, ε) 0 intuitively, an unobserved shock to y (i.e., ε) must be correlated with x since changes in y lead to changes in x 63

4.3.3 Measurement Error problem: data are measured imprecisely examples recall error coding errors mis-information (e.g., overstate income, understate drug use) rounding errors (e.g., labor supply = 40 hrs/wk, or rounded to nearest 5 ; income rounded to $1000s) two cases: (i) error in the dependent variable, or (ii) error(s) in independent variable(s) 64

dependent variable true model y i = α + βx i + ε i, ε i N(0, σ 2 ε) where on a variable indicates correctly measured given a random sample {yi, x i }N i=1, OLS is consistent and efficient with measurement error, do not observe y i instead one observes y i where y i }{{} observed = y }{{} i + µ }{{} i true measurement error, µ i N(0, σ 2 µ) reliability ratio RR = Var(y ) Var(y) [0, 1] susbtitution implies that the estimated model is y i = α + βx i + (µ i + ε i ) = α + βx i + ε i 65

properties of OLS estimates β OLS is unbiased and consistent iff Cov(x, ε) = 0, which is the case if Cov(x, ε) = Cov(x, ε) } {{ } + Cov(x, µ) } {{ } 0 by assumption 0 if ME of x α OLS is unbiased and consistent iff β OLS is unbiased and consistent since α OLS = y β OLS x and E[ ε] = 0, which is the case if E[ ε] = E[ε] }{{} 0 by + β E[µ] }{{} 0 if assumption classical ME 66

OLS standard errors are correct if µ i N implies ε N this holds even if Cov(µ, ε) 0 what is σ 2 ε? Var( ε) = Var(µ + ε) = Var(µ) + Var(ε) + 2 Cov(µ, ε) = σ 2 µ + σ 2 ε + 2ρσ µ σ ε which is greater than Var(ε) if ρ = 0 if Var( ε) Var(ε), then standard errors are larger summary: Classical Errors-in-Variables (CEV) model assumptions (i) µ i N(0, σ 2 µ) (ii) Cov(µ, ε) = 0 (iii) Cov(x, µ) = 0 implications (i) OLS unbiased, consistent (ii) standard errors are correct (iii) R 2, standard errors due to extra noise in the data 67

independent variable true model y i = α + βx i + ε i, ε i N(0, σ 2 ε) where on a variable indicates correctly measured given a random sample {yi, x i }N i=1, OLS is consistent and efficient with measurement error, do not observe x i instead one observes x i where x i }{{} observed = x }{{} i + µ }{{} i true measurement error, µ i N(0, σ 2 µ) reliability ratio RR = Var(x ) Var(x) [0, 1] susbtitution implies that the estimated model is y i = α + βx i + (ε i βµ i ) = α + βx i + ε i 68

properties of OLS estimates β OLS is unbiased and consistent iff Cov(x, ε) = 0, which is not likely Cov(x, ε) = Cov(x, ε) + Cov(x, βµ) = Cov(x, ε) } {{ } 0 by assumption + Cov(µ, ε) } {{ }? βcov(x, µ) } {{ } 0 = β OLS is unbiased and consistent if (i) β = 0 and Cov(µ, ε), or (ii) Cov(µ, ε) = β Cov(x, µ) α OLS is unbiased and consistent iff β OLS is unbiased and consistent since α OLS = y β OLS x and E[ ε] = 0, which is the case if E[ ε] = E[ε] }{{} 0 by + β E[µ] }{{} 0 if assumption classical ME 69

summary: Classical Errors-in-Variables (CEV) model assumptions (i) µ i N(0, σ 2 µ) (ii) Cov(µ, ε) = 0 (iii) Cov(x, µ) = 0 implications (i) OLS biased, inconsistent (ii) β OLS is attenuated toward zero (i.e., biased toward zero, biased down in absolute value, correct sign) plim( β OLS ) = β + Cov(x, ε) Var(x) Cov(x, ε βµ) = β + Var(x) Cov(x, ε) β Cov(x, µ) = β + Var(x) = β + = β [ Cov(x, ε) Var(x) } {{ } =0 1 σ2 µ [ ] σ 2 = β x σ } {{ 2 x } [0,1] σ 2 x ] Cov(x, µ) β + Var(x) } {{ } = β = β RR }{{} [0,1] =0 [ σ 2 x σ 2 µ σ 2 x ] Cov(µ, µ) Var(x) } {{ } =σ 2 µ/σ 2 x which is smaller than β in absolute value, but of the same sign as β 70

(iii) in multiple regression yi = α + βx i + K γ kx ki + ε k=1 where x is a mismeasured version of x and x k, k = 1,..., K, are correctly measured, then β OLS suffers from attenuation bias, and γ k are also biased in a complex way iff x k is uncorrelated with x 71

4.3.4 The Solution: Instrumental Variables goal: devise alternative estimation technique to obtain consistent estimates when x is endogenous solution identify β from exogenous variation in x suppose x can be decomposed into two independent parts: x = x + x where Cov(x, ε) = Cov(x, ε) + Cov(x, ε) and Cov(x, ε) 0, but Cov(x, ε) = 0 idea is to use variation in x due to x to identify β; ignore variation in x from x since this impact of this variation on y confounds effects of x and ε to only use variation arising from x, need additional information get this new information by adding data on a new var, z, called an instrument or instrumental variable (IV) or exclusion restriction 72

z is an IV for x iff (i) Cov(x, z) 0 (ii) Cov(ε, z) = 0 (iii) E[y x, z] = E[y x] (i.e., z has no direct effect on y; z is excluded from the model for y) (i) and (ii) = z is correlated with x through x estimation techniques IV Two-Stage Least Squares (TSLS or 2SLS) MLE 73

IV estimator model y i = α + βx i + ε i implies Cov(y, z) = Cov(α, z) + Cov(βx, z) + Cov(ɛ, z) = β Cov(x, z) estimator which is unbiased, consistent β IV = Cov(y, z) Cov(x, z) formula β IV = 1 N 1 1 N 1 i (y i y)(z i z) i (x i x)(z i z) 74

properties of β IV β IV is consistent plim β IV = = = = β 1 N 1 1 N 1 1 N 1 1 N 1 1 N 1 i y i(z i z) i x i(z i z) i (α + βx i + ε i )(z i z) i x i(z i z) 1 N 1 i βx i(z i z) i x i(z i z) α IV is consistent, since α IV = y β IV x Var(ε) = σ 2 σ 2 = 1 N 2 (y i α IV β IV x i ) 2 i Var( β IV ) Var( β IV ) = σ N Var(x)ρ 2 x,z σ i (x i x) R 2 x,z }{{} (sample counterpart) = ρ 2 x,z in simple OLS which is decreasing in Var(x) and ρ x,z 75

notes Var( β IV ) > Var( β OLS ) if ρ 2 x,z < 1 recall, Var( β OLS ) = σ/ i (x i x) inefficient to use IV if x is exogenous IV is algebraically equivalent to OLS using x as an instrument for itself β IV = = 1 N 1 1 N 1 1 N 1 i (y i y)(z i z) i (x i x)(z i z) i (y i y)(x i x) i (x i x) 2 1 N 1 = β OLS and α IV = y β IV x = y β OLS x = α OLS and σ Var( β IV ) = i (x i x)rx,z 2 σ = i (x i x)rx,x 2 σ = i (x i x) = Var( β OLS ) 76

multiple regression with only 1 endogenous var exogenous x s serve as instruments for themselves solution is simple using matrix algebra multiple regression with more than 1 endogenous var need unique instrument for each endogeous var exogenous x s serve as instruments for themselves solution is simple using matrix algebra 77

TSLS estimation proceeds in 2 steps first-stage x i = δ + πz i + µ i estimable via OLS = x i Cov(x, ε) 0 = Cov(µ, ε) 0 x i varies across i due to variation in z i (not µ i since x i does not depend on µ i ) second-stage y i = α + β x i + ε i 78

notes β T SLS is consistent standard errors need to be adjusted since x i is a predicted regressor if multiple endogenous vars, need a unique IV for each endogenous x if second-stage contains other exogenous vars, these vars must be included in the first-stage test of π 0 is test for Cov(x, z) 0 can test endogeneity using a Hausman test comparing β T SLS with β OLS if more than 1 IV for an endogenous var, then model is overidentified (as opposed to exactly identified) test of non-zero covariance between the set of IVs and x is given by a test that the coeffs on all IVs are jointly equal to zero enables other tests for instrument validity GMM estimation is more efficient if ε is heteroskdastic 79

MLE estimate first- and second-stage simultaneously, but second-stage is replaced with reduced form (i.e., y is expressed solely as a function of exogenous variables in the model) model x i = δ + πz i + µ i y i = α + βx i + ε i (structural eqn) = (α + βδ) + βπz i + (ε i + βµ i ) = (α + βδ) + βπz i + ε i (reduced form) where and ε, µ N 2 (0, Σ) bivariate normal dbn Σ = σ2 ε ρσ ε σ µ ρσ ε σ µ σ 2 µ is a 2x2 symmetric, positive definite matrix 80

the joint dbn of the reduced form errors is ε, µ N 2 (0, Σ) where Σ = σ2 ε + β 2 σ 2 µ + 2βρσ ε σ µ ρσ ε σ µ + βσ 2 µ ρσ ε σ µ + βσ 2 µ σ 2 µ derive ln[l(θ)], where θ = {δ, π, α, β, σ ε, σ µ, ρ} ln[l(θ)] = i ln[pr(y i, x i z i, θ)] = ln[pr( ɛ i, µ i z i, θ)] i = [ ( ɛi ln J φ 2, µ )] i, i σ ɛ σ Σ µ where J is the determinant of the Jacobian and φ 2 is the bivariate std normal pdf estimates obtained as arg max θ ln[l(θ)] = [ ( ɛi ln J φ 2, µ )] i, i σ ɛ σ Σ µ test of H o : π = 0 is a test for Cov(x, z) 0 test of endogeneity given by H o : ρ = 0 81

specification tests testing endogeneity may be relevant for economic reasons relevant since OLS is more efficient if x is exogenous Hausman test if x is exogenous, then β IV β OLS if x is endogenous, then β IV β OLS define test statistic based on difference β IV β OLS H = ( βiv β ) OLS ( ΣIV Σ ) 1 OLS ( βiv β ) OLS χ 2 K where K = # of x s 82

Durbin-Wu-Hausman test model x i = δ + πz i + µ i y i = α + βx i + ε i x is endogenous iff Cov(µ, ε) 0 steps: (i) estimate µ i via OLS (ii) estimate y i = α + βx i + δ µ i + ε i via OLS (iii) test H o : δ = 0, rejection implies x is endogenous if multiple endogenous vars, then conduct joint test H o : δ 1 =... = δ K = 0 (K = # of endog vars) 83

testing overidentifying restrictions if # IVs > # endogenous vars, can test if Cov(z, ε) = 0 steps: (i) regress y on x via TSLS = α T SLS, β T SLS = ε i (ii) regress ε i on z s (all IVs) = R 2 (iii) test statistic NR 2 χ 2 q where q is # of overidentifying restrictions intuition: if Cov(z, ε) = 0, then explanatory power of second regression should be small, R 2 0 84

weak IV = Cov(x, z) 0 can show plim β IV = β + ρ z,ε ρ z,x σ ε σ x if z is a valid IV, then ρ z,x > 0 and ρ z,ε = 0 = plim β IV = β but, if ρ z,x 0 and/or ρ z,ε 0, then plim β IV β OLS plim β OLS = β + ρ x,ε σ ε σ x and the asymptotic bias of OLS is smaller than IV iff ρ z,ε > ρ x,ε ρ z,x which becomes more likely as ρ z,x 0 STATA: -ivreg2-85