Discussion of Omitted Variable Bias versus Multicollinearity

Similar documents
ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

August 2012 EXAMINATIONS Solution Part I

MULTIPLE REGRESSION EXAMPLE

Rockefeller College University at Albany

Lecture 15. Endogeneity & Instrumental Variable Estimation

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Multicollinearity Richard Williams, University of Notre Dame, Last revised January 13, 2015

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Module 5: Multiple Regression Analysis

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Premaster Statistics Tutorial 4 Full solutions

MODEL I: DRINK REGRESSED ON GPA & MALE, WITHOUT CENTERING

Regression Analysis: A Complete Example

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

SPSS Guide: Regression Analysis

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Hypothesis testing - Steps

Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY

Regression step-by-step using Microsoft Excel

Correlation and Regression

2. Simple Linear Regression

HURDLE AND SELECTION MODELS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009

Interaction effects between continuous variables (Optional)

Part 2: Analysis of Relationship Between Two Variables

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Department of Economics Session 2012/2013. EC352 Econometric Methods. Solutions to Exercises from Week (0.052)

SIMPLE LINEAR CORRELATION. r can range from -1 to 1, and is independent of units of measurement. Correlation can be done on two dependent variables.

MULTIPLE REGRESSION WITH CATEGORICAL DATA

DETERMINANTS OF CAPITAL ADEQUACY RATIO IN SELECTED BOSNIAN BANKS

Multiple Linear Regression in Data Mining

2. Linear regression with multiple regressors

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Discussion Section 4 ECON 139/ Summer Term II

One-Way Analysis of Variance: A Guide to Testing Differences Between Multiple Groups

4. Multiple Regression in Practice

Financial Risk Management Exam Sample Questions/Answers

Regression III: Advanced Methods

Section 13, Part 1 ANOVA. Analysis Of Variance

Handling missing data in Stata a whirlwind tour

Nonlinear Regression Functions. SW Ch 8 1/54/

Simple Regression Theory II 2010 Samuel L. Baker

Linear Regression Models with Logarithmic Transformations

Quick Stata Guide by Liz Foster

The average hotel manager recognizes the criticality of forecasting. However, most

Interaction effects and group comparisons Richard Williams, University of Notre Dame, Last revised February 20, 2015

Using R for Linear Regression

Marginal Effects for Continuous Variables Richard Williams, University of Notre Dame, Last revised February 21, 2015

Multiple Regression: What Is It?

Failure to take the sampling scheme into account can lead to inaccurate point estimates and/or flawed estimates of the standard errors.

Example: Boats and Manatees

5. Linear Regression

Stata Walkthrough 4: Regression, Prediction, and Forecasting

ESTIMATING AVERAGE TREATMENT EFFECTS: IV AND CONTROL FUNCTIONS, II Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics

Chapter 7: Simple linear regression Learning Objectives

TRINITY COLLEGE. Faculty of Engineering, Mathematics and Science. School of Computer Science & Statistics

Elementary Statistics Sample Exam #3

MULTIPLE REGRESSION ANALYSIS OF MAIN ECONOMIC INDICATORS IN TOURISM. R, analysis of variance, Student test, multivariate analysis

Linear Models in STATA and ANOVA

Multiple Linear Regression

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Econometrics I: Econometric Methods

Recall this chart that showed how most of our course would be organized:

Notes on Applied Linear Regression

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

International Statistical Institute, 56th Session, 2007: Phil Everson

1 Simple Linear Regression I Least Squares Estimation

Causal Forecasting Models

xtmixed & denominator degrees of freedom: myth or magic

Standard errors of marginal effects in the heteroskedastic probit model

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

Week TSX Index

Simple Linear Regression Inference

Econometrics Simple Linear Regression

Multinomial and Ordinal Logistic Regression

Chapter 2 Probability Topics SPSS T tests

Ridge Regression. Patrick Breheny. September 1. Ridge regression Selection of λ Ridge regression in R/SAS

Chapter 13 Introduction to Linear Regression and Correlation Analysis

MULTIPLE REGRESSIONS ON SOME SELECTED MACROECONOMIC VARIABLES ON STOCK MARKET RETURNS FROM

One-Way Analysis of Variance (ANOVA) Example Problem

Sample Size Calculation for Longitudinal Studies

Introduction to General and Generalized Linear Models

MODELING AUTO INSURANCE PREMIUMS

One-Way Analysis of Variance

Section 1: Simple Linear Regression

Regression Analysis (Spring, 2000)

Solución del Examen Tipo: 1

Simple linear regression

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

11. Analysis of Case-control Studies Logistic Regression

MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL. by Michael L. Orlov Chemistry Department, Oregon State University (1996)

Final Exam Practice Problem Answers

A Basic Introduction to Missing Data

Lab 5 Linear Regression with Within-subject Correlation. Goals: Data: Use the pig data which is in wide format:

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Transcription:

1. OVERVIEW Discussion of Omitted Variable Bias versus Multicollinearity In the lectures on Simple Linear Regression and Multiple linear regression we spent a lot of time talking about taking repeat draws of random samples from a population, estimating a regression based on the sample and calculating a. In addition, we spent a lot of time talking about (i) the expected value of, which we would ideally like to be equal to the true population parameter, and (ii) the variance of, which we would ideally like to be low (i.e. a tight distribution of s). Omitted variable bias and multicollinearity are problems to the extent that they can thwart these ideals. 1.1 Expected value of and Omitted Variable Bias: When talking about the expected value of (E[ ]) we discussed the desirable quality of unbiasedness, which says that the mean value of over many repeat random samples should be equal to the true population beta (that E[ ]= is satisfied). Omitted variable bias affects the expected value E[ ]. In particular, if you exclude (omit) a variable (z) from your regression model that is correlated with both your explanatory variable of interest (x) and your outcome variable (y) then the expected value of will be biased (E[ ] ). We call this problem omitted variable bias (OVB). 1.2 Variance of and Multicollinearity: When talking about the variance of we discussed the desirable quality of having low variance (i.e. a tight/narrow distribution of s), which means that the estimated s (the s) over many repeat random samples will be tightly centered. Further, low variance for the random variable corresponds to having a small standard error for. Multicollinearity affects the var( ), (also written as ). In particular, if you include a variable (z) in your regression model that is correlated with the explanatory variable(s) of interest (x) then this acts to increase the variance of (where is the regression coefficient on x) whenever the variable z explains little of the variation in the outcome variable. And, of course, a larger variance of corresponds to large standard error for. We call this problem multicollinearity. The problem results from the fact that in multiple linear regression we only use residual variation in the explanatory variables to estimate the regression coefficients. If there is very little residual variation in our explanatory variable of interest, then this is equivalent to having only a little total sample variation in our explanatory variable. Recall, that the total sample variation in our explanatory variable of interest (SST) lies in the denominator for the variance of so when this denominator goes down, then our standard error goes up. However, it is worth pointing out that if z explains a great deal of the variation in the outcome variable then this reduces the sum of squared residuals (SSR). As the SSR lies in the numerator for the formula of the variance, this acts to reduce the variance of. 1

2. EXAMPLES 2.1 Omitted Variable Bias Example: Once again, will be biased if we exclude (omit) a variable (z) that is correlated with both the explanatory variable of interest (x) and the outcome variable (y). The second page of Handout #7b provides a practical demonstration of what can happen to when you exclude a variable z correlated with both x and y; I re-produce the results here with some additional commentary. From Handout #7b: Omitting an important variable correlated with the other independent variables: Omitted variable bias = 1.17 +.106 educ +.011 exp -.26 female +.012 profocc R 2 =.28 (.08) (.005) (.0009) (.02) (.03) n = 2000 = 2.57 + +.011 exp -.26 female +.358 profocc R 2 =.16 (.03) (.0009) (.02) (.03) n = 2000 Additional Commentary on Handout #7: The difference between these regression models is that the second model excludes educ. As indicated, in the second equation we have excluded (omitted) the variable educ which is an important variable in that it determines the outcome (i.e. education affects log(wages)); educ is also correlated with other explanatory variables, in particular, the indicator for whether your employment type is a professional occupation ( profocc ). The correlation between educ and profocc is 0.4276 (positive) as indicated in the correlation matrix below. As a consequence, in the second equation the regression coefficient on profocc is measuring the effect of both having higher education and having a professional occupation; that is, our estimator for the regression coefficient on profocc exhibits a bias relative to the first equation. Consistent with demonstrations from class, the bias present in the estimator for the regression coefficient on profocc in the second equation (0.358) is positive relative to the first equation. Said differently, due to the exclusion of educ the estimator for the regression coefficient on profocc in the second equation is positively/upward biased relative to the first equation.. correlate lwage educ exp female profocc nonwhite (obs=2000) lwage educ exper female profocc nonwhite -------------+------------------------------------------------------ lwage 1.0000 educ exper 0.4097 0.2358 1.0000 0.0010 1.0000 female -0.1935 0.0489 0.0210 1.0000 profocc 0.2181 0.4276-0.0383 nonwhite -0.0379-0.0051-0.0200 0.1077 1.0000 0.0368-0.0143 1.0000 2

2.2 Multicollinearity Example: As we add variables to our regression model that are correlated with the explanatory variable(s) of interest, then the standard errors for the s on the explanatory variable(s) of interest will tend to increase, particularly when the added variables do not explain variation in the outcome variable (i.e. when the added variables do not reduce the sum of squared residuals). Handout #8 provides a practical demonstration of what happens to the standard errors for your s when you include a variable that is highly correlated with the explanatory variables already in the model, but does not explain much variation in y; I re-produce the relevant results from Handout #8 here with some additional commentary. From Handout #8: (1) None. reg lwage educ exper female Source SS df MS Number of obs = 2000 -------------+------------------------------ F( 3, 1996) = 247.50 Model 182.35726 3 60.7857535 Prob > F = 0.0000 Residual 490.219607 1996.245601005 R-squared = 0.2711 -------------+------------------------------ Adj R-squared = 0.2700 Total 672.576867 1999.336456662 Root MSE =.49558 lwage Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- educ.1167441.0053157 21.96 0.000.1063191.127169 exper.0109089.000869 12.55 0.000.0092046.0126132 female -.2543189.0222067-11.45 0.000 -.2978696 -.2107682 _cons 1.055792.0757381 13.94 0.000.9072576 1.204326 (2) Almost collinear. reg lwage educ exper female age Source SS df MS Number of obs = 2000 -------------+------------------------------ F( 4, 1995) = 185.69 Model 182.468262 4 45.6170655 Prob > F = 0.0000 Residual 490.108605 1995.245668474 R-squared = 0.2713 -------------+------------------------------ Adj R-squared = 0.2698 Total 672.576867 1999.336456662 Root MSE =.49565 lwage Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- educ.1692465.1115687 1.52 0.129 -.0495568.3880498 exper.0633711.1113346 0.57 0.569 -.1549732.2817154 female -.2545469.0222135-11.46 0.000 -.298111 -.2109827 age -.0524796.1113744-0.47 0.638 -.270902.1659428 _cons 1.370917.6728026 2.04 0.042.0514472 2.690386 Commentary on example from Handout #8: The difference between these regression models is that the second model includes the variable age. In the first set of regression results we see relatively small standard errors for the s on educ and exper as indicated by numbers reported in the column under Std. Err.. In the second set of regression results we see that standard errors for these two s are 3

considerably larger. Why does this happen? If the variable added to the regression equation (e.g. age) is highly correlated with variables already in the model (e.g. educ and exper) and does not explain lwage, then the standard errors for the associated will get very large. This is what is meant by multicollinearity. We are concerned about multicollinearity because large standard errors for the s produce large confidence intervals and makes it likely that you will fail to reject the null hypothesis even if the magnitude of the estimated regression coefficient ( ) is much different than the null hypothesis. Note: The heading None on the first set of regression results reflects the fact that the standard errors are small, which suggests that the variables in the model are not close to being collinear (i.e. they are not close to being perfectly correlated). The heading Almost collinear for the second set of regression results reflects the fact that the standard errors are very large, which suggest that some of the variables are very highly correlated (i.e. they are almost collinear/perfectly correlated). 3. SUMMARY OF OVB & MULTICOLLINEARITY Consider the SLR model:. 1 Now suppose we run the MLR model by adding a control variable z:. 2 I use the tilde above the regression coefficients in eq. 2 to distinguish them from the OLS estimator in eq. 1, which have the carrot hats. The still represent the OLS estimator, it s just a different OLS estimator than that presented in eq. 1. 1) If z is related to both x and y, then we want to include z if we have data on it because by including z we reduce bias due to its omission. The extent to which z is correlated with x will also affect how much multicollinearity will act to increase the standard errors on the regression coefficient for x. In this scenario, if OVB is a real concern, then we need to add z and then live with the consequence of multicollinearity and larger standard errors. Although, it is important to note that the standard errors still might get smaller with the inclusion of z. Why? Because z is also related to y, adding it in eq. 2 will reduce the SSR which acts to reduce the variance, and therefore, the standard errors of the beta estimators. 2) If z is unrelated to x but related to y, then we want to include z if we have data on it because its inclusion will reduce the SSR which acts to reduce the standard errors of the regression coefficients in the model. So even if a variable doesn t reduce bias, there can be an advantage to including it in the multiple linear regression model. 3) If z is related to x but unrelated to y, then this is multicollinearity at its worst so we want to exclude z. Adding z to the regression won t reduce bias and won t reduce the sum of squared ressiduals, but it will reduce the residual variation in x; that is, the variance inflation factor will scale the denominator towards zero which blows up the variance of the regression coefficient on x. Note: The variance inflation factor (VIF) is: 4

i. 1 1 4) If z is unrelated to both x and y, then it should not matter much whether you include it or exclude it. Although if you add a lot of variables unrelated to both x and y, then you will start to eat into your degress of freedom, which also enters into the variance equation for. Detecting OVB & Multicollinearity: It can sometimes be hard to detect OVB and multicollinearity. We cannot detect OVB if we don t have data measures on the omitted variable, one can only argue that your regression equation and its estimators for the regression coefficients are likely vulnerable to omitted variable bias. If you do have measures of some additional variable, then you can assess the importance of including it in your regression estimation. If you add a variable z to the regression equation and the estimated regression coefficient on the x of interest changes a lot, then this suggests that z is related to x and y so should be included to avoid OVB. This is the case regardless of what happens to your standard error on the regression coefficient of interest. This corresponds to summary point (1) above. However, if you add a variable z to the regression equation and the estimated regression coefficient on the x of interest does not change, then this suggests that z is unrelated to x or unrelated to y or both. Thus, excluding z does not introduce omitted variable bias. Given this, let s consider what happens to the standard error for the regression coefficient on x when we add z. - If including z increases the standard error for the regression coefficient on x then this suggests that z and x are related, including z is the type of multicollinearity we want to avoid. Under the given that excluding z does not introduce OVB, let s exclude z to avoid larger standard errors. - If including z decreases the standard error for the regression coefficient on x then this suggests that z and y are related. Multicollinearity is not a concern, and in fact, adding z to our estimation reduces the standard error for the regression coefficient on x. Even though excluding z does not generate OVB, let s include z to reduce the standard errors. - If including z does not affect the standard error on x then this suggests that z is unrelated to x and unrelated to y. Multicollinearity is not a concern. Under the given OVB is not a problem by omitting z. We often exclude these type of variables from our regression, though we sometimes include them to appease our audience or reviewer (i.e. assure them that we aren t introducing OVB by excluding a given variable). 5