2. Linear regression with multiple regressors



Similar documents
Econometrics Simple Linear Regression

Nonlinear Regression Functions. SW Ch 8 1/54/

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Forecasting the US Dollar / Euro Exchange rate Using ARMA Models

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

Simple Linear Regression Inference

Air passenger departures forecast models A technical note

Multiple Linear Regression in Data Mining

UK GDP is the best predictor of UK GDP, literally.

CALCULATIONS & STATISTICS

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

On the Degree of Openness of an Open Economy Carlos Alfredo Rodriguez, Universidad del CEMA Buenos Aires, Argentina

Simple linear regression

The relationship between stock market parameters and interbank lending market: an empirical evidence

SYSTEMS OF REGRESSION EQUATIONS

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Solución del Examen Tipo: 1

IMPACT OF WORKING CAPITAL MANAGEMENT ON PROFITABILITY

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Module 5: Multiple Regression Analysis

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

The Impact of Privatization in Insurance Industry on Insurance Efficiency in Iran

Regression Analysis (Spring, 2000)

The Effect of Seasonality in the CPI on Indexed Bond Pricing and Inflation Expectations

5. Multiple regression

Competition as an Effective Tool in Developing Social Marketing Programs: Driving Behavior Change through Online Activities

Econometric Principles and Data Analysis

Lecture 15. Endogeneity & Instrumental Variable Estimation

Introduction to Regression and Data Analysis

Chapter 3: The Multiple Linear Regression Model

Simple Regression Theory II 2010 Samuel L. Baker

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

Regression with a Binary Dependent Variable

Determinants of Stock Market Performance in Pakistan

Regression step-by-step using Microsoft Excel

August 2012 EXAMINATIONS Solution Part I

Wooldridge, Introductory Econometrics, 3d ed. Chapter 12: Serial correlation and heteroskedasticity in time series regressions

Source engine marketing: A preliminary empirical analysis of web search data

Estimation of σ 2, the variance of ɛ

European Journal of Business and Management ISSN (Paper) ISSN (Online) Vol.5, No.30, 2013

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Integrated Resource Plan

1 Teaching notes on GMM 1.

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Multiple Linear Regression

5. Linear Regression

Chapter 6: Multivariate Cointegration Analysis

Forecasting Using Eviews 2.0: An Overview

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

Chapter 13 Introduction to Linear Regression and Correlation Analysis

The Relationship between Life Insurance and Economic Growth: Evidence from India

Understanding Retention among Private Baccalaureate Liberal Arts Colleges

Ridge Regression. Patrick Breheny. September 1. Ridge regression Selection of λ Ridge regression in R/SAS

Coefficient of Determination

Part 2: Analysis of Relationship Between Two Variables

Quick Stata Guide by Liz Foster

Price volatility in the silver spot market: An empirical study using Garch applications

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

2. Simple Linear Regression

17. SIMPLE LINEAR REGRESSION II

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Clustering in the Linear Model

Department of Economics Session 2012/2013. EC352 Econometric Methods. Solutions to Exercises from Week (0.052)

Uniwersytet Ekonomiczny

FORECASTING DEPOSIT GROWTH: Forecasting BIF and SAIF Assessable and Insured Deposits

Financial Risk Management Exam Sample Questions/Answers

Correlation of International Stock Markets Before and During the Subprime Crisis

EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION

Review of Bivariate Regression

Premaster Statistics Tutorial 4 Full solutions

II. DISTRIBUTIONS distribution normal distribution. standard scores

How Far is too Far? Statistical Outlier Detection

Directions for using SPSS

From the help desk: Bootstrapped standard errors

Chapter 7: Simple linear regression Learning Objectives

4. Multiple Regression in Practice

Chapter 1. Vector autoregressions. 1.1 VARs and the identi cation problem

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

The VAR models discussed so fare are appropriate for modeling I(0) data, like asset returns or growth rates of macroeconomic time series.

Time Series Analysis

Wooldridge, Introductory Econometrics, 4th ed. Chapter 10: Basic regression analysis with time series data

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

3.2 Measures of Spread

Module 3: Correlation and Covariance

Moderator and Mediator Analysis

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

Causal Forecasting Models

PARTNERSHIP IN SOCIAL MARKETING PROGRAMS. SOCIALLY RESPONSIBLE COMPANIES AND NON-PROFIT ORGANIZATIONS ENGAGEMENT IN SOLVING SOCIETY S PROBLEMS

Introduction to Quantitative Methods

1.5 Oneway Analysis of Variance

problem arises when only a non-random sample is available differs from censored regression model in that x i is also unobserved

SPSS Guide: Regression Analysis

Chapter 3 Quantitative Demand Analysis

MARKETING COMMUNICATION IN ONLINE SOCIAL PROGRAMS: OHANIAN MODEL OF SOURCE CREDIBILITY

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection

Integrating Financial Statement Modeling and Sales Forecasting

A General Approach to Variance Estimation under Imputation for Missing Survey Data

3.1 Stationary Processes and Mean Reversion

The Basic Two-Level Regression Model

Transcription:

2. Linear regression with multiple regressors Aim of this section: Introduction of the multiple regression model OLS estimation in multiple regression Measures-of-fit in multiple regression Assumptions in the multiple regression model Violations of the assumptions (omitted-variable bias, multicollinearity, heteroskedasticity, autocorrelation) 5

2.1. The multiple regression model Intuition: A regression model specifies a functional (parametric) relationship between a dependent (endogenous) variable Y and a set of k independent (exogenous) regressors X 1, X 2,..., X k In a first step, we consider the linear multiple regression model 6

Definition 2.1: (Multiple linear regression model) The multiple (linear) regression model is given by Y i = β 0 + β 1 X 1i + β 2 X 2i +... + β k X ki + u i, (2.1) i = 1,..., n, where Y i is the i th observation on the dependent variable, X 1i, X 2i,..., X ki are the i th regressors, u i is the stochastic error term. observations on each of the k The population regression line is the relationship that holds between Y and the X s on average: E(Y i X 1i = x 1, X 2i = x 2,..., X ki = x k ) = β 0 +β 1 x 1 +...+β k x k. 7

Meaning of the coefficients: The intercept β 0 is the expected value of Y i (for all i = 1,..., n) when all X-regressors equal 0 β 1,..., β k are the slope coefficients on the respective regressors X 1,..., X k β 1, for example, is the expected change in Y i resulting from changing X 1i by one unit, holding constant X 2i,..., X ki (and analogously β 2,..., β k ) Definition 2.2: (Homoskedasticity, Heteroskedasticity) The error term u i is called homoskedastic if the conditional variance of u i given X 1i,..., X ki, Var(u i X 1i,..., X ki ), is constant for i = 1,..., n and does not depend on the values of X 1i,..., X ki. Otherwise, the error term is called heteroskedastic. 8

Example 1: (Student performance) Regression of student performance (Y ) in n = 420 USdistricts on distinct school characteristics (factors) Y i : average test score in the i th district (TEST SCORE) X 1i : average class size in the i th district (measured by the student-teacher ratio, STR) X 2i : percentage of English learners in the i th district (PCTEL) Expected signs of the coefficients: β 1 < 0 β 2 < 0 9

Example 2: (House prices) Regression of house prices (Y ) recorded for n = 546 houses sold in Windsor (Canada) on distinct housing characteristics Y i : sale price (in Canadian dollars) of the i th house (SALEPRICE) X 1i : lot size (in square feet) of the i th property (LOTSIZE) X 2i : number of bedrooms in the i th house (BEDROOMS) X 3i : number of bathrooms in the i th house (BATHROOMS) X 4i : number of storeys (excluding the basement) in the i th house (STOREYS) Expected signs of the coefficients: β 1, β 2, β 3, β 4 > 0 10

2.2. The OLS estimator in multiple regression Now: Estimation of the coefficients β 0, β 1,..., β k in the multiple regression model on the basis of n observations by applying the Ordinary Least Squares (OLS) technique Idea: Let b 0, b 1,..., b k be estimators of β 0, β 1,..., β k We can predict Y i by b 0 + b 1 X 1i +... + b k X ki The prediction error is Y i b 0 b 1 X 1i... b k X ki 11

Idea: [continued] The sum of the squared prediction errors over all n observations is n i=1 (Y i b 0 b 1 X 1i... b k X ki ) 2 (2.2) Definition 2.3: (OLS estimators, predicted values, residuals) The OLS estimators ˆβ 0, ˆβ 1,..., ˆβ k are the values of b 0, b 1,..., b k that minimize the sum of squared prediction errors (2.2). The OLS predicted values Ŷ i and residuals û i (for i = 1,..., n) are and Ŷ i = ˆβ 0 + ˆβ 1 X 1i +... + ˆβ k X ki (2.3) û i = Y i Ŷ i. (2.4) 12

Remarks: The OLS estimators ˆβ 0, ˆβ 1,..., ˆβ k and the residuals û i are computed from a sample of n observations of (X 1i,..., X ki, Y i ) for i = 1,..., n They are estimators of the unknown true population coefficients β 0, β 1,..., β k and u i There are closed-form formulas for calculating the OLS estimates from the data (see the lectures Econometrics I+II) In this lecture, we use the software-package EViews 13

Regression estimation results (EViews) for the student-performance dataset Dependent Variable: TEST_SCORE Method: Least Squares Date: 07/02/12 Time: 16:29 Sample: 1 420 Included observations: 420 Variable Coefficient Std. Error t-statistic Prob. C 686.0322 7.411312 92.56555 0.0000 STR -1.101296 0.380278-2.896026 0.0040 PCTEL -0.649777 0.039343-16.51588 0.0000 R-squared 0.426431 Mean dependent var 654.1565 Adjusted R-squared 0.423680 S.D. dependent var 19.05335 S.E. of regression 14.46448 Akaike info criterion 8.188387 Sum squared resid 87245.29 Schwarz criterion 8.217246 Log likelihood -1716.561 Hannan-Quinn criter. 8.199793 F-statistic 155.0136 Durbin-Watson stat 0.685575 Prob(F-statistic) 0.000000 14

Predicted values Ŷ i and residuals û i for the student-performance dataset 720 700 680 60 40 20 0 660 640 620 600-20 -40-60 50 100 150 200 250 300 350 400 Residual Actual Fitted 15

Regression estimation results (EViews) for the house-prices dataset Dependent Variable: SALEPRICE Method: Least Squares Date: 07/02/12 Time: 16:50 Sample: 1 546 Included observations: 546 Variable Coefficient Std. Error t-statistic Prob. C -4009.550 3603.109-1.112803 0.2663 LOTSIZE 5.429174 0.369250 14.70325 0.0000 BEDROOMS 2824.614 1214.808 2.325153 0.0204 BATHROOMS 17105.17 1734.434 9.862107 0.0000 STOREYS 7634.897 1007.974 7.574494 0.0000 R-squared 0.535547 Mean dependent var 68121.60 Adjusted R-squared 0.532113 S.D. dependent var 26702.67 S.E. of regression 18265.23 Akaike info criterion 22.47250 Sum squared resid 1.80E+11 Schwarz criterion 22.51190 Log likelihood -6129.993 Hannan-Quinn criter. 22.48790 F-statistic 155.9529 Durbin-Watson stat 1.482942 Prob(F-statistic) 0.000000 16

Predicted values Ŷ i and residuals û i for the house-prices dataset 200,000 160,000 120,000 120,000 80,000 40,000 80,000 40,000 0 0-40,000-80,000 50 100 150 200 250 300 350 400 450 500 Residual Actual Fitted 17

OLS assumptions in the multiple regression model (2.1): 1. u i has conditional mean zero given X 1i, X 2i,..., X ki : E(u i X 1i, X 2i,..., X ki ) = 0 2. (X 1i, X 2i,..., X ki, Y i ), i = 1,..., n, are independently and identically distributed (i.i.d.) draws from their joint distribution 3. Large outliers are unlikely: X 1i, X 2i,..., X ki and Y i have nonzero finite fourth moments 4. There is no perfect multicollinearity Remarks: Note that we do not assume any specific parametric distribution for the u i The OLS assumptions imply specific distribution results 18

Theorem 2.4: (Unbiasedness, consistency, normality) Given the OLS assumptions the following properties of the OLS estimators ˆβ 0, ˆβ 1,..., ˆβ k hold: 1. ˆβ 0, ˆβ 1,..., ˆβ k are unbiased estimators of β 0,..., β k. 2. ˆβ 0, ˆβ 1,..., ˆβ k are consistent estimators of β 0,..., β k. (Convergence in probability) 3. In large samples ˆβ 0, ˆβ 1,..., ˆβ k are jointly normally distributed and each single OLS estimator ˆβ j, j = 0,..., k, is normally distributed with mean β j and variance σ 2ˆβ j, that is ˆβ j N(β j, σ 2ˆβ j ). 19

Remarks: In general, the OLS estimators are correlated This correlation among ˆβ 0, ˆβ 1,..., ˆβ k arises from the correlation among the regressors X 1,..., X k The sampling distribution of the OLS estimators will become relevant in Section 3 (hypothesis-testing, confidence intervals) 20

2.3. Measures-of-fit in multiple regression Now: Three well-known summary statistics that measure how well the OLS estimates fit the data Standard error of regression (SER): The SER estimates the standard deviation of the error term u i (under the assumption of homoskedasticity): SER = 1 n k 1 n û 2 i i=1 21

Standard error of regression: [continued] We denote the sum of squared residuals by SSR n i=1 û 2 i so that SER = SSR n k 1 Given the OLS assumptions and homoskedasticity the squared SER, (SER) 2, is an unbiased estimator of the unknown constant variance of the u i SER is a measure of the spread of the distribution of Y i around the population regression line Both measures, SER and SSR, are reported in the EViews regression output 22

R 2 : The R 2 is the fraction of the sample variance of the Y i explained by the regressors Equivalently, the R 2 is 1 minus the fraction of the variance of the Y i not explained by the regressors (i.e. explained by the residuals) Denoting the explained sum of squares (ESS) and the total sum of squares (TSS) by ESS = n i=1 (Ŷ i Ȳ ) 2 and TSS = respectively, we define the R 2 as R 2 = ESS TSS = 1 SSR TSS n i=1 (Y i Ȳ ) 2, 23

R 2 : [continued] In multiple regression, the R 2 increases whenever an additional regressor X k+1 is added to the regression model, unless the estimated coefficient ˆβ k+1 is exactly equal to zero Since in practice it is extremely unusual to have exactly ˆβ k+1 = 0, the R 2 generally increases (and never decreases) when an new regressor is added to the regression model An increase in the R 2 due to the inclusion of a new regressor does not necessarily indicate an actually improved fit of the model 24

Adjusted R 2 : The adjusted R 2 (in symbols: R 2 ), deflates the conventional R 2 : R 2 = 1 n 1 SSR n k 1TSS It is always true that R 2 < R 2 (why?) When adding a new regressor X k+1 to the model, the R 2 can increase or decrease (why?) The R 2 can be negative (why?) 25

2.4. Omitted-variable bias Now: Discussion of a phenomenon that implies violation of the first OLS assumption on Slide 18 This issue is known under the phrasing omitted-variable bias and is extremely relevant in practice Although theoretically easy to grasp, avoiding this specification problem turns out to be a nontrivial task in many empirical applications 26

Definition 2.5: (Omitted-variable bias) Consider the multiple regression model in Definition 2.1 on Slide 7. Omitted-variable bias is the bias in the OLS estimator ˆβ j of the coefficient β j (for j = 1,..., k) that arises when the associated regressor X j is correlated with an omitted variable. More precisely, for omitted-variable bias to occur, the following two conditions must hold: 1. X j is correlated with the omitted variable. 2. The omitted variable is a determinant of the dependent variable Y. 27

Example: Consider the house-prices dataset (Slides 16, 17) Using the entire set of regressors, we obtain the OLS estimate ˆβ 2 = 2824.61 for the BEDROOMS-coefficient The correlation coefficients between the regressors are as follows: BEDROOMS BATHROOMS LOTSIZE STOREYS BEDROOMS 1.000000 0.373769 0.151851 0.407974 BATHROOMS 0.373769 1.000000 0.193833 0.324066 LOTSIZE 0.151851 0.193833 1.000000 0.083675 STOREYS 0.407974 0.324066 0.083675 1.000000 28

Example: [continued] There is positive (significant) correlation between the variable BEDROOMS and all other regressors Excluding the other variables from the regression yields the following OLS-estimates: Dependent Variable: SALEPRICE Method: Least Squares Date: 14/02/12 Time: 16:10 Sample: 1 546 Included observations: 546 Variable Coefficient Std. Error t-statistic Prob. C 28773.43 4413.753 6.519040 0.0000 BEDROOMS 13269.98 1444.598 9.185932 0.0000 R-squared 0.134284 Mean dependent var 68121.60 Adjusted R-squared 0.132692 S.D. dependent var 26702.67 S.E. of regression 24868.03 Akaike info criterion 23.08421 Sum squared resid 3.36E+11 Schwarz criterion 23.09997 Log likelihood -6299.989 Hannan-Quinn criter. 23.09037 F-statistic 84.38135 Durbin-Watson stat 0.811875 Prob(F-statistic) 0.000000 The alternative OLS-estimates of the BEDROOMS-coefficient differ substantially 29

Intuitive explanation of the omitted-variable bias: Consider the variable LOTSIZE as omitted LOTSIZE is an important variable for explaining SALEPRICE If we omit LOTSIZE in the regression, it will try to enter in the only way it can, namely through its positive correlation with the included variable BEDROOMS The coefficient on BEDROOMS will confound the effect of BED- ROOMS and LOTSIZE on SALEPRICE 30

More formal explanation: Omitted-variable bias means that the first OLS assumption on Slide 18 is violated Reasoning: In the multiple regression model the error term u i represents all factors other than the included regressors X 1,..., X k that are determinants of Y i If an omitted variable is correlated with at least one of the included regressors X 1,..., X k, then u i (which contains this factor) is correlated with the set of regressors This implies that E(u i X 1i,..., X ki ) 0 31

Important result: In the case of omitted-variable bias the OLS estimators on the corresponding included regressors are biased in finite samples this bias does not vanish in large samples the OLS estimators are inconsistent Solutions to omitted-variable bias: To be discussed in Section 5 32

2.5. Multicollinearity Definition 2.6: (Perfect multicollinearity) Consider the multiple regression model in Definition 2.1 on Slide 7. The regressors X 1,..., X k are said to be perfectly multicollinear if one of the regressors is a perfect linear function of the other regressors. Remarks: Under perfect multicollinearity the OLS estimates cannot be calculated due to division by zero in the OLS formulas Perfect multicollinearity often reflects a logical mistake in choosing the regressors or some unrecognized feature in the data set 33

Example: (Dummy variable trap) Consider the student-performance dataset Suppose we partition the school districts into the 3 categories (1) rural, (2) suburban, (3) urban We represent the categories by the dummy regressors { 1 if district i is rural RURAL i = 0 otherwise and by SUBURBAN i and URBAN i analogously defined Since each district belongs to one and only one category, we have for each district i: RURAL i + SUBURBAN i + URBAN i = 1 34

Example: [continued] Now, let us define the constant regressor X 0 associated with the intercept coefficient β 0 in the multiple regression model on Slide 7 by X 0i 1 for i = 1,... n Then, for i = 1,..., n, the following relationship holds among the regressors: Perfect multicollinearity X 0i = RURAL i + SUBURBAN i + URBAN i To estimate the regression we must exclude either one of the dummy regressors or the constant regressor X 0 (the intercept β 0 ) from the regression 35

Theorem 2.7: (Dummy variable trap) Let there be G different categories in the data set represented by G dummy regressors. If 1. each observation i falls into one and only one category, 2. there is an intercept (constant regressor) in the regression, 3. all G dummy regressors are included as regressors, then regression estimation fails because of perfect multicollinearity. Usual remedy: Exclude one of the dummy regressors (G 1 dummy regressors are sufficient) 36

Definition 2.8: (Imperfect multicollinearity) Consider the multiple regression model in Definition 2.1 on Slide 7. The regressors X 1,..., X k are said to be imperfectly multicollinear if two or more of the regressors are highly correlated in the sense that there is a linear function of the regressors that is highly correlated with another regressor. Remarks: Imperfect multicollinearity does not pose any (numeric) problems in calculating OLS estimates However, if regressors are imperfectly multicollinear, then the coefficients on at least one individual regressor will be imprecisely estimated 37

Remarks: [continued] Techniques for identifying and mitigating imperfect multicollinearity are presented in econometric textbooks (e.g. Hill et al., 2010, pp. 155-156) 38