Lecture 15. Endogeneity & Instrumental Variable Estimation

Similar documents
ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

Department of Economics Session 2012/2013. EC352 Econometric Methods. Solutions to Exercises from Week (0.052)

August 2012 EXAMINATIONS Solution Part I

MULTIPLE REGRESSION EXAMPLE

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

Multicollinearity Richard Williams, University of Notre Dame, Last revised January 13, 2015

Nonlinear Regression Functions. SW Ch 8 1/54/

Correlation and Regression

ESTIMATING AVERAGE TREATMENT EFFECTS: IV AND CONTROL FUNCTIONS, II Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics

Rockefeller College University at Albany

IMPACT EVALUATION: INSTRUMENTAL VARIABLE METHOD

Addressing Alternative. Multiple Regression Spring 2012

2. Linear regression with multiple regressors

Interaction effects between continuous variables (Optional)

MODEL I: DRINK REGRESSED ON GPA & MALE, WITHOUT CENTERING

problem arises when only a non-random sample is available differs from censored regression model in that x i is also unobserved

Regression Analysis (Spring, 2000)

Marginal Effects for Continuous Variables Richard Williams, University of Notre Dame, Last revised February 21, 2015

Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY

HURDLE AND SELECTION MODELS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009

Handling missing data in Stata a whirlwind tour

Stata Walkthrough 4: Regression, Prediction, and Forecasting

Wooldridge, Introductory Econometrics, 3d ed. Chapter 12: Serial correlation and heteroskedasticity in time series regressions

Econometrics I: Econometric Methods

MULTIPLE REGRESSION WITH CATEGORICAL DATA

Correlated Random Effects Panel Data Models

Clustering in the Linear Model

Forecasting in STATA: Tools and Tricks

Standard errors of marginal effects in the heteroskedastic probit model

25 Working with categorical data and factor variables

Solución del Examen Tipo: 1

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Lab 5 Linear Regression with Within-subject Correlation. Goals: Data: Use the pig data which is in wide format:

Linear Regression Models with Logarithmic Transformations

xtmixed & denominator degrees of freedom: myth or magic

DETERMINANTS OF CAPITAL ADEQUACY RATIO IN SELECTED BOSNIAN BANKS

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Failure to take the sampling scheme into account can lead to inaccurate point estimates and/or flawed estimates of the standard errors.

I n d i a n a U n i v e r s i t y U n i v e r s i t y I n f o r m a t i o n T e c h n o l o g y S e r v i c e s

MODELING AUTO INSURANCE PREMIUMS

Introduction to structural equation modeling using the sem command

Nonlinear relationships Richard Williams, University of Notre Dame, Last revised February 20, 2015

Quick Stata Guide by Liz Foster

Using Stata 9 & Higher for OLS Regression Richard Williams, University of Notre Dame, Last revised January 8, 2015

The following postestimation commands for time series are available for regress:

Regression Analysis: A Complete Example

From the help desk: Swamy s random-coefficients model

Discussion Section 4 ECON 139/ Summer Term II

is paramount in advancing any economy. For developed countries such as

Chapter 4: Vector Autoregressive Models

5. Linear Regression

Chapter 1. Vector autoregressions. 1.1 VARs and the identi cation problem

Interaction effects and group comparisons Richard Williams, University of Notre Dame, Last revised February 20, 2015

SPSS Guide: Regression Analysis

Multiple Linear Regression in Data Mining

Study Guide for the Final Exam

Panel Data Analysis Josef Brüderl, University of Mannheim, March 2005

Panel Data Analysis Fixed and Random Effects using Stata (v. 4.2)

Introduction to Time Series Regression and Forecasting

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Multiple Regression: What Is It?

Sample Size Calculation for Longitudinal Studies

An introduction to Value-at-Risk Learning Curve September 2003

Chapter 6: Multivariate Cointegration Analysis

Module 5: Multiple Regression Analysis

Correlation and Simple Linear Regression

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Review of Bivariate Regression

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

4. Multiple Regression in Practice

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

S TAT E P LA N N IN G OR G A N IZAT IO N

A Simple Feasible Alternative Procedure to Estimate Models with High-Dimensional Fixed Effects

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Using instrumental variables techniques in economics and finance

False. Model 2 is not a special case of Model 1, because Model 2 includes X5, which is not part of Model 1. What she ought to do is estimate

Econometrics Simple Linear Regression

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

MEASURING THE INVENTORY TURNOVER IN DISTRIBUTIVE TRADE

Introduction to Quantitative Methods

Data Analysis Methodology 1

A Panel Data Analysis of Corporate Attributes and Stock Prices for Indian Manufacturing Sector

Getting Correct Results from PROC REG

The average hotel manager recognizes the criticality of forecasting. However, most

Basic Statistical and Modeling Procedures Using SAS

Sales forecasting # 1

The Wage Return to Education: What Hides Behind the Least Squares Bias?

Note 2 to Computer class: Standard mis-specification tests

Lecture 3: Differences-in-Differences

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection

The Regression Calibration Method for Fitting Generalized Linear Models with Additive Measurement Error

SIMPLE LINEAR CORRELATION. r can range from -1 to 1, and is independent of units of measurement. Correlation can be done on two dependent variables.

Introduction to Regression and Data Analysis

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

An introduction to GMM estimation using Stata

Transcription:

Lecture 15. Endogeneity & Instrumental Variable Estimation Saw that measurement error (on right hand side) means that OLS will be biased (biased toward zero) Potential solution to endogeneity instrumental variable estimation - A variable that is correlated with the problem variable but which does not suffer from measurement error Tests for endogeneity Other sources of endogeneity Problems with weak instruments

Idea of Instrumental Variables attributed to Philip Wright 1861-1934 interested in working out whether price of butter was demand or supply driven

More formally, an instrument Z for the variable of concern X satisfies 1) X,Z) 0

More formally, an instrument Z for the variable of concern X satisfies 1) X,Z) 0 correlated with the problem variable

More formally, an instrument Z for the variable of concern X satisfies 1) X,Z) 0 correlated with the problem variable 2) Z,u) = 0

More formally, an instrument Z for the variable of concern X satisfies 1) X,Z) 0 correlated with the problem variable 2) Z,u) = 0 but uncorrelated with the residual (so does not suffer from measurement error and also is not correlated with any unobservable factors influencing the dependent variable)

Instrumental variable (IV) estimation proceeds as follows:

Instrumental variable (IV) estimation proceeds as follows: Given a model y = b0 + b1x + u (1)

Instrumental variable (IV) estimation proceeds as follows: Given a model y = b0 + b1x + u (1) Multiply (1) by the instrument Z

Instrumental variable (IV) estimation proceeds as follows: Given a model y = b0 + b1x + u (1) Multiply by the instrument Z Zy = Zb0 + b1zx + Zu

Instrumental variable (IV) estimation proceeds as follows: Given a model y = b0 + b1x + u (1) Multiply by the instrument Z Zy = Zb0 + b1zx + Zu Follows that Z,y) = Cov[Zb0 + b1zx + Zu]

Instrumental variable (IV) estimation proceeds as follows: Given a model y = b0 + b1x + u (1) Multiply by the instrument Z Zy = Zb0 + b1zx + Zu Follows that Z,y) = Cov[Zb0 + b1zx + Zu] = Zb0) + b1z,x) + Z,u)

Instrumental variable (IV) estimation proceeds as follows: Given a model y = b0 + b1x + u (1) Multiply by the instrument Z Zy = Zb0 + b1zx + Zu Follows that Z,y) = Cov[Zb0 + b1zx + Zu] = Zb0) + b1z,x) + Z,u) since Zb0) = 0 (using rules on covariance of a constant)

Instrumental variable (IV) estimation proceeds as follows: Given a model y = b0 + b1x + u (1) Multiply by the instrument Z Zy = Zb0 + b1zx + Zu Follows that Z,y) = Cov[Zb0 + b1zx + Zu] = Zb0) + b1z,x) + Z,u) since Zb0) = 0 (using rules on covariance of a constant) and Z,u) = 0 (if assumption above about the properties of instruments is correct)

Instrumental variable (IV) estimation proceeds as follows: Given a model y = b0 + b1x + u (1) Multiply by the instrument Z Zy = Zb0 + b1zx + Zu Follows that Z,y) = Cov[Zb0 + b1zx + Zu] = Zb0) + b1z,x) + Z,u) since Zb0) = 0 (using rules on covariance of a constant) and Z,u) = 0 (if assumption above about the properties of instruments is correct)

then Z,y) = 0 + b1z,x) + 0

Solving Z,y) = 0 + b1z,x) + 0 for b1 gives the formula to calculate the instrumental variable estimator

Solving Z,y) = 0 + b1z,x) + 0 for b1 gives the formula to calculate the instrumental variable estimator So b1 IV = Z, y) Z, X )

Solving Z,y) = 0 + b1z,x) + 0 for b1 gives the formula to calculate the instrumental variable estimator So b1 IV = Z, y) Z, X ) (compare with b1 OLS = X, y) Var( X ) )

Solving Z,y) = 0 + b1z,x) + 0 for b1 gives the formula to calculate the instrumental variable estimator So b1 IV = Z, y) Z, X ) (compare with b1 OLS = X, y) Var( X ) ) In the presence of measurement error (or endogeneity in general) the IV estimate is unbiased in large samples (but may be biased in small samples) - technically the IV estimator is said to be consistent

Solving Z,y) = 0 + b1z,x) + 0 for b1 gives the formula to calculate the instrumental variable estimator So b1 IV = Z, y) Z, X ) (compare with b1 OLS = X, y) Var( X ) ) In the presence of measurement error (or endogeneity in general) the IV estimate is unbiased in large samples (but may be biased in small samples) - technically the IV estimator is said to be consistent while the OLS estimator is inconsistent IN THE PRESENCE OF ENDOGENEITY which makes IV a useful estimation technique to employ

However can show that (in the 2 variable case) the variance of the IV estimator is given by ^ Var ( β IV 1 ) = N s 2 1 * * Var( X ) r 2 X Z where rxz 2 is the square of the correlation coefficient between endogenous variable and instrument

However can show that (in the 2 variable case) the variance of the IV estimator is given by ^ Var ( β IV 1 ) = N s 2 1 * * Var( X ) r 2 X Z where rxz 2 is the square of the correlation coefficient between endogenous variable and instrument ^ s 2 (compared with OLS Var ( β OLS 1 ) = ) N * Var( X )

However can show that (in the 2 variable case) the variance of the IV estimator is given by ^ Var ( β IV 1 ) = N s 2 1 * * Var( X ) r 2 X Z where rxz 2 is the square of the correlation coefficient between endogenous variable and instrument ^ s 2 (compared with OLS Var ( β OLS 1 ) = ) N * Var( X ) Since r 2 >0 So IV estimation is less precise (efficient) than OLS estimation May sometimes want to trade off bias against efficiency

So why not ensure that the correlation between X and the instrument Z is as high as possible?

So why not ensure that the correlation between X and the instrument Z is as high as possible? - if X and Z are perfectly correlated then Z must also be correlated with u and so suffer the same problems as X the initial problem is not solved.

So why not ensure that the correlation between X and the instrument Z is as high as possible? - if X and Z are perfectly correlated then Z must also be correlated with u and so suffer the same problems as X the initial problem is not solved. Conversely if the correlation between the endogenous variable and the instrument is small there are also problems

Since can always write the IV estimator as b1 IV = Z, y) Z, X )

Since can always write the IV estimator as b1 IV = Z, y) Z, X ) sub. in for y = b0 + b1x + u

Since can always write the IV estimator as b1 IV = Z, y) Z, X ) sub. in for y = b0 + b1x + u Cov ( Z, b0 + b1 X + u) Z, X )

Since can always write the IV estimator as b1 IV = Z, y) Z, X ) sub. in for y = b0 + b1x + u b1 IV = Cov ( Z, b0 + b1 X + u) Z, X ) Cov ( Z, b (, ) (, ) = 0) + b1 Cov Z X + Cov Z u Z, X )

Since can always write the IV estimator as b1 IV = Z, y) Z, X ) sub. in for y = b0 + b1x + u b1 IV = Cov ( Z, b0 + b1 X + u) Z, X ) Cov ( Z, b (, ) (, ) = 0) + b1 Cov Z X + Cov Z u Z, X ) b1 IV = 0 + b 1 Z, X ) + Z, u) Z, X ) So b1 IV = b1 + Z, u) Z, X )

Since can always write the IV estimator as b1 IV = Z, y) Z, X ) sub. in for y = b0 + b1x + u b1 IV = Cov ( Z, b0 + b1 X + u) Z, X ) Cov ( Z, b (, ) (, ) = 0) + b1 Cov Z X + Cov Z u Z, X ) 0 + b (, ) (, ) b1 IV = 1Cov Z X + Cov Z u Z, X ) Z, u) So b1 IV = b1 + Z, X ) So if X,Z) is small then the IV estimate can be a long way from the true value b1

So: always check extent of correlation between X and Z before any IV estimation (see later)

So: always check extent of correlation between X and Z before any IV estimation (see later) In large samples you can have as many instruments as you like though finding good ones is a different matter. In small samples a minimum number of instruments is better (bias in small samples increases with no. of instruments).

Where to find good instruments?

Where to find good instruments? - difficult

Where to find good instruments? - difficult - The appropriate instrument will vary depending on the issue under study.

In the case of measurement error, could use the rank of X as an instrument (ie order the variable X by size and use the number of the order rather than the actual vale.

In the case of measurement error, could use the rank of X as an instrument (ie order the variable X by size and use the number of the order rather than the actual vale. Clearly correlated with the original value but because it is a rank should not be affected with measurement error

In the case of measurement error, could use the rank of X as an instrument (ie order the variable X by size and use the number of the order rather than the actual vale. Clearly correlated with the original value but because it is a rank should not be affected with measurement error - Though this assumes that the measurement error is not so large as to affect the (true) ordering of the X variable

egen rankx=rank(x_obs) /* stata command to create the ranking of x_observ */. list x_obs rankx x_observ rankx 1. 60 1 2. 80 2 3. 100 3 4. 120 4 5. 140 5 6. 200 6 7. 220 7 8. 240 8 9. 260 9 10. 280 10 ranks from smallest observed x to largest Now do instrumental variable estimates using rankx as the instrument for x_obs ivreg y_t (x_ob=rankx) Instrumental variables (2SLS) regression Source SS df MS Number of obs = 10 -------------+------------------------------ F( 1, 8) = 84.44 Model 11654.5184 1 11654.5184 Prob > F = 0.0000 Residual 1125.47895 8 140.684869 R-squared = 0.9119 -------------+------------------------------ Adj R-squared = 0.9009 Total 12779.9974 9 1419.99971 Root MSE = 11.861 ------------------------------------------------------------------------------ y_true Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- x_observ.460465.0501086 9.19 0.000.3449144.5760156 _cons 48.72095 9.307667 5.23 0.001 27.25743 70.18447 ------------------------------------------------------------------------------ Instrumented: x_observ Instruments: rankx ------------------------------------------------------------------------------ Can see both estimated coefficients are a little closer to their true values than estimates from regression with measurement error (but not much)in this case the rank of X is not a very good instrumentnote that standard error in

instrumented regression is larger than standard error in regression of y_true on x_observed as expected with IV estimation Testing for Endogeneity It is good practice to compare OLS and IV estimates. If estimates are very different this may be a sign that things are amiss.

Testing for Endogeneity It is good practice to compare OLS and IV estimates. If estimates are very different this may be a sign that things are amiss. Using the idea that IV estimation will always be (asymptotically) unbiased whereas OLS will only be unbiased if X,u) = 0 then can do the following: Wu-Hausman Test for Endogeneity

Testing for Endogeneity It is good practice to compare OLS and IV estimates. If estimates are very different this may be a sign that things are amiss. Using the idea that IV estimation will always be (asymptotically) unbiased whereas OLS will only be unbiased if X,u) = 0 then can do the following: Wu-Hausman Test for Endogeneity 1. Given y = b0 + b1x + u (A) Regress the endogenous variable X on the instrument(s) Z X = d0 + d1z + v (B)

Testing for Endogeneity It is good practice to compare OLS and IV estimates. If estimates are very different this may be a sign that things are amiss. Using the idea that IV estimation will always be (asymptotically) unbiased whereas OLS will only be unbiased if X,u) = 0 then can do the following: Wu-Hausman Test for Endogeneity 1. Given y = b0 + b1x + u (A) Regress the endogenous variable X on the instrument(s) Z X = d0 + d1z + v (B) Save the residuals ^ v

2. Include this residual as an extra term in the original model

Include this residual as an extra term in the original model ie given y = b0 + b1x + u

Include this residual as an extra term in the original model ie given y = b0 + b1x + u estimate ^ y = b0 + b1x + b2v + e and test whether b2 = 0 (using a t test)

Include this residual as an extra term in the original model ie given y = b0 + b1x + u estimate ^ y = b0 + b1x + b2v + e and test whether b2 = 0 (using a t test) If b2 = 0 conclude there is no correlation between X and u

Include this residual as an extra term in the original model ie given y = b0 + b1x + u estimate ^ y = b0 + b1x + b2v + e and test whether b2 = 0 (using a t test) If b2 = 0 conclude there is no correlation between X and u If b2 0 conclude there is correlation between X and u

Include this residual as an extra term in the original model ie given y = b0 + b1x + u estimate ^ y = b0 + b1x + b2v + e and test whether b2 = 0 (using a t test) If b2 = 0 conclude there is no correlation between X and u If b2 0 conclude there is correlation between X and u Why?

Include this residual as an extra term in the original model ie given y = b0 + b1x + u estimate ^ y = b0 + b1x + b2v + e and test whether b2 = 0 (using a t test) If b2 = 0 conclude there is no correlation between X and u If b2 0 conclude there is correlation between X and u Why? because X = d0 + d1z + v

Include this residual as an extra term in the original model ie given y = b0 + b1x + u estimate ^ y = b0 + b1x + b2v + e and test whether b2 = 0 (using a t test) If b2 = 0 conclude there is no correlation between X and u If b2 0 conclude there is correlation between X and u Why? because X = d0 + d1z + v Endogenous X = instrument + something else

Include this residual as an extra term in the original model ie given y = b0 + b1x + u (A) estimate ^ y = b0 + b1x + b2v + e and test whether b2 = 0 (using a t test) If b2 = 0 conclude there is no correlation between X and u If b2 0 conclude there is correlation between X and u Why? because X = d0 + d1z + v Endogenous X = instrument + something else and so only way X could be correlated with u in (A) is through v

Include this residual as an extra term in the original model ie given y = b0 + b1x + u (A) estimate ^ y = b0 + b1x + b2v + e and test whether b2 = 0 (using a t test) If b2 = 0 conclude there is no correlation between X and u If b2 0 conclude there is correlation between X and u Why? because X = d0 + d1z + v Endogenous X = instrument + something else and so only way X could be correlated with u in (A) is through v (since Z is not correlated with u by assumption) This means the residual u in (A) depends on v + some other residual

Include this residual as an extra term in the original model ie given y = b0 + b1x + u (A) estimate ^ y = b0 + b1x + b2v + e and test whether b2 = 0 (using a t test) If b2 = 0 conclude there is no correlation between X and u If b2 0 conclude there is correlation between X and u Why? because X = d0 + d1z + v and so only way X could be correlated with u is through v This means the residual in (A) depends on v + some other residual u = b2v + e

Include this residual as an extra term in the original model ie given y = b0 + b1x + u (A) estimate ^ y = b0 + b1x + b2v + e (B) and test whether b2 = 0 (using a t test) If b2 = 0 conclude there is no correlation between X and u If b2 0 conclude there is correlation between X and u Why? because X = d0 + d1z + v and so only way X could be correlated with u is through v This means the residual in (A) depends on v + some residual u = b2v + e

So estimate (B) instead and test whether coefficient on v is significant ^ y = b0 + b1x + b2v + e (B) If it is, conclude that X and error term are indeed correlated; there is endogeneity N.B. This test is only as good as the instruments used and is only valid asymptotically. This may be a problem in small samples and so you should generally use this test only with sample sizes well above 100.

Example: The data set ivdat.dta contains information on the number of GCSE passes of a sample of 16 year olds and the total income of the household in which they live. Income tends to be measured with error. Individuals tend to mis-report incomes, particularly third-party incomes and nonlabour income. The following regression may therefore be subject to measurement error in one of the right hand side variables, (the gender dummy variable is less subject to error).. reg nqfede inc1 female Source SS df MS Number of obs = 252 -------------+------------------------------ F( 2, 249) = 14.55 Model 274.029395 2 137.014698 Prob > F = 0.0000 Residual 2344.9706 249 9.41755263 R-squared = 0.1046 -------------+------------------------------ Adj R-squared = 0.0974 Total 2619.00 251 10.4342629 Root MSE = 3.0688 ------------------------------------------------------------------------------ nqfede Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- inc1.0396859.0087786 4.52 0.000.022396.0569758 female 1.172351.387686 3.02 0.003.4087896 1.935913 _cons 4.929297.4028493 12.24 0.000 4.13587 5.722723 To test endogeneity first regress the suspect variable on the instrument and any exogenous variables in the original regression reg inc1 ranki female Source SS df MS Number of obs = 252 -------------+------------------------------ F( 2, 249) = 247.94 Model 81379.4112 2 40689.7056 Prob > F = 0.0000 Residual 40863.626 249 164.110948 R-squared = 0.6657 -------------+------------------------------ Adj R-squared = 0.6630 Total 122243.037 251 487.024053 Root MSE = 12.811 ------------------------------------------------------------------------------ inc1 Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- ranki.2470712.0110979 22.26 0.000.2252136.2689289 female.2342779 1.618777 0.14 0.885-2.953962 3.422518 _cons.7722511 1.855748 0.42 0.678-2.882712 4.427214

------------------------------------------------------------------------------ 1. save the residuals. predict uhat, resid 2. include residuals as additional regressor in the original equation. reg nqfede inc1 female uhat Source SS df MS Number of obs = 252 -------------+------------------------------ F( 3, 248) = 9.94 Model 281.121189 3 93.7070629 Prob > F = 0.0000 Residual 2337.87881 248 9.42693069 R-squared = 0.1073 -------------+------------------------------ Adj R-squared = 0.0965 Total 2619.00 251 10.4342629 Root MSE = 3.0703 ------------------------------------------------------------------------------ nqfede Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- inc1.0450854.0107655 4.19 0.000.0238819.0662888 female 1.176652.3879107 3.03 0.003.4126329 1.940672 uhat -.0161473.0186169-0.87 0.387 -.0528147.0205201 _cons 4.753386.4512015 10.53 0.000 3.864711 5.642062 ------------------------------------------------------------------------------ Now added residual is not statistically significantly different from zero, so conclude that there is no endogeneity bias in the OLS estimates. Hence no need to instrument. Note you can also get this result by typing the following command after the ivreg command ivendog Tests of endogeneity of: inc1 H0: Regressor is exogenous Wu-Hausman F test: 0.75229 F(1,248) P-value = 0.38659 Durbin-Wu-Hausman chi-sq test: 0.76211 Chi-sq(1) P-value = 0.38267 the first test is simply the square of the t value on uhat in the last regression (since t 2 = F)

N.B. This test is only as good as the instruments used and is only valid asymptotically. This may be a problem in small samples and so you should generally use this test only with sample sizes well above 100.

Endogeneity & Simultaneous Equation Models Often failure to establish a one-way causal relationship in an econometric model also leads to to endogeneity problems (again violates assumption that X,u) = 0 and so OLS will give biased estimates)

Endogeneity & Simultaneous Equation Models Often failure to establish a one-way causal relationship in an econometric model also leads to to endogeneity problems (again violates assumption that X,u) = 0 and so OLS will give biased estimates) Eg C = a + by + e (1) Y= C +I+G + v (2)

Endogeneity & Simultaneous Equation Models Often failure to establish a one-way causal relationship in an econometric model also leads to to endogeneity problems (again violates assumption that X,u) = 0 and so OLS will give biased estimates) Eg C = a + by + e (1) Y= C +I+G + v (2) This is a 2 equation simultaneous equation system. C and Y appear on both sides of respective equations and are interdependent since

Endogeneity & Simultaneous Equation Models Often failure to establish a one-way causal relationship in an econometric model also leads to to endogeneity problems (again violates assumption that X,u) = 0 and so OLS will give biased estimates) Eg C = a + by + e (1) Y= C +I+G + v (2) This is a 2 equation simultaneous equation system. C and Y appear on both sides of respective equations and are interdependent since Any shock, represented by Δe

Endogeneity & Simultaneous Equation Models Often failure to establish a one-way causal relationship in an econometric model also leads to to endogeneity problems (again violates assumption that X,u) = 0 and so OLS will give biased estimates) Eg C = a + by + e (1) Y= C +I+G + v (2) This is a 2 equation simultaneous equation system. C and Y appear on both sides of respective equations and are interdependent since Any shock, represented by Δe ΔC in (1)

Endogeneity & Simultaneous Equation Models Often failure to establish a one-way causal relationship in an econometric model also leads to to endogeneity problems (again violates assumption that X,u) = 0 and so OLS will give biased estimates) Eg C = a + by + e (1) Y= C +I+G + v (2) This is a 2 equation simultaneous equation system. C and Y appear on both sides of respective equations and are interdependent since Any shock, represented by Δe ΔC in (1) but then this ΔC Δ Y from (2)

Endogeneity & Simultaneous Equation Models Often failure to establish a one-way causal relationship in an econometric model also leads to to endogeneity problems (again violates assumption that X,u) = 0 and so OLS will give biased estimates) Eg C = a + by + e (1) Y= C +I+G + v (2) This is a 2 equation simultaneous equation system. C and Y appear on both sides of respective equations and are interdependent since Any shock, represented by Δe ΔC in (1) but then this ΔC Δ Y from (2) and then this ΔY Δ C from (1)

Endogeneity & Simultaneous Equation Models Often failure to establish a one-way causal relationship in an econometric model also leads to to endogeneity problems (again violates assumption that X,u) = 0 and so OLS will give biased estimates) Eg C = a + by + e (1) Y= C +I+G + v (2) This is a 2 equation simultaneous equation system. C and Y appear on both sides of respective equations and are interdependent since Any shock, represented by Δe ΔC in (1) but then this ΔC Δ Y from (2) and then this ΔY Δ C from (1)

so changes in C lead to changes in Y and changes in Y lead to changes in C but the fact that Δe ΔC ΔY

but the fact that Δe ΔC ΔY means X,u) (or in this case Y,e) ) 0 in (1) C = a + by + e (1)

but the fact that Δe ΔC ΔY means X,u) (or in this case Y,e) ) 0 in (1) which given OLS formula implies ^ X, Y ) b = Var( X ) C = a + by + e (1)

but the fact that Δe ΔC ΔY means X,u) (or in this case Y,e) ) 0 in (1) C = a + by + e (1) which given OLS formula implies ^ X, Y ) Y, C) b = = (in this example) Var( X ) Var( Y )

but the fact that Δe ΔC ΔY means X,u) (or in this case Y,e) ) 0 in (1) C = a + by + e (1) which given OLS formula implies ^ X, Y ) Y, C) Y, e) b = = = b + (sub in for C Var( X ) Var( Y ) Var( Y ) from (1))

but the fact that Δe ΔC ΔY means X,u) (or in this case Y,e) ) 0 in (1) C = a + by + e (1) which given OLS formula implies ^ X, Y ) Y, C) Y, e) b = = = b + (sub in for C Var( X ) Var( Y ) Var( Y ) from (1)) means E(^) b b

but the fact that Δe ΔC ΔY means X,u) (or in this case Y,e) ) 0 in (1) C = a + by + e (1) which given OLS formula implies ^ X, Y ) Y, C) b = = = b + Var( X ) Var( Y ) Y, e) Var( Y ) means E(^) b b So OLS in the presence of interdependent variables gives biased estimates.

but the fact that Δe ΔC ΔY means X,u) (or in this case Y,e) ) 0 in (1) C = a + by + e (1) which given OLS formula implies ^ X, Y ) Y, C) b = = = b + Var( X ) Var( Y ) Y, e) Var( Y ) means E(^) b b So OLS in the presence of interdependent variables gives biased estimates. Any right hand side variable which has the property Cov ( X, u) 0 is said to be endogenous

Solution: IV estimation (as with measurement error, since symptom, if not cause, is the same)

Solution: IV estimation (as with measurement error, since symptom, if not cause, is the same) ^ Z, y ) b IV = (A) Z, X ) Again, problem is where to find instruments. In a simultaneous equation model, the answer may often be in the system itself

Solution: IV estimation (as with measurement error, since symptom, if not cause, is the same) ^ Z, y ) b IV = (A) Z, X ) Again, problem is where to find instruments. In a simultaneous equation model, the answer may often be in the system itself Example Price = b0 + b1wage + e (1) Wage = d0 + d1price + d2unemployment + v (2)

Solution: IV estimation (as with measurement error, since symptom, if not cause, is the same) ^ Z, y ) b IV = (A) Z, X ) Again, problem is where to find instruments. In a simultaneous equation model, the answer may often be in the system itself Example Price = b0 + b1wage + e (1) Wage = d0 + d1price + d2unemployment + v (2) This time wages and prices are interdependent so OLS on either (1) or (2) will give biased estimates.. but

Solution: IV estimation (as with measurement error, since symptom, if not cause, is the same) ^ Z, y ) b IV = (A) Z, X ) Again, problem is where to find instruments. In a simultaneous equation model, the answer may often be in the system itself Example Price = b0 + b1wage + e (1) Wage = d0 + d1price + d2unemployment + v (2) This time wages and prices are interdependent so OLS on either (1) or (2) will give biased estimates.. but unemployment does not appear in (1) by assumption

(can this be justified?) but is correlated with wages through (2).

This means unemployment can be used as an instrument for wages in (1) since Price = b0 + b1wage + e (1) Wage = d0 + d1price + d2unemployment + v (2) a) Unemployment, e) = 0 (by assumption it doesn t appear in (1) ) so uncorrelated with residual, which is one requirement of an instrument

This means unemployment can be used as an instrument for wages in (1) since Price = b0 + b1wage + e (1) Wage = d0 + d1price + d2unemployment + v (2) a) Unemployment, e) = 0 (by assumption it doesn t appear in (1) ) so uncorrelated with residual, which is one requirement of an instrument and b) Unemployment, Wage) 0 so correlated with endogenous RHS variable, which is the other requirement of an instrument

This means unemployment can be used as an instrument for wages in (1) since Price = b0 + b1wage + e (1) Wage = d0 + d1price + d2unemployment + v (2) a) Unemployment, e) = 0 (by assumption it doesn t appear in (1) ) so uncorrelated with residual, which is one requirement of an instrument and b) Unemployment, Wage) 0 so correlated with endogenous RHS variable, which is the other requirement of an instrument This process of using extra exogenous variables as instruments for endogenous RHS variables is known as identification

This means unemployment can be used as an instrument for wages in (1) since Price = b0 + b1wage + e (1) Wage = d0 + d1price + d2unemployment + v (2) a) Unemployment, e) = 0 (by assumption the variable doesn t appear in (1) ) so uncorrelated with residual, which is one requirement of an instrument and b) Unemployment, Wage) 0 so correlated with endogenous RHS variable, which is the other requirement of an instrument This process of using extra exogenous variables as instruments for endogenous RHS variables is known as identification If there are no additional exogenous variables outside the original equation that can be used as instruments for the endogenous RHS variables then the equation is said to be unidentified

This means unemployment can be used as an instrument for wages in (1) since Price = b0 + b1wage + e (1) Wage = d0 + d1price + d2unemployment + v (2) a) Unemployment, e) = 0 (by assumption it doesn t appear in (1) ) so uncorrelated with residual, which is one requirement of an instrument and b) Unemployment, Wage) 0 so correlated with endogenous RHS variable, which is the other requirement of an instrument This process of using extra exogenous variables as instruments for endogenous RHS variables is known as identification If there are no additional exogenous variables outside the original equation that can be used as instruments for the endogenous RHS variables then the equation is said to be unidentified (In the example above (2) is unidentified because despite Price being endogenous, there are no other exogenous variables not already in (2) that can be used as instruments for Price).