2. Linear regression with multiple regressors Aim of this section: Introduction of the multiple regression model OLS estimation in multiple regression Measures-of-fit in multiple regression Assumptions in the multiple regression model Violations of the assumptions (omitted-variable bias, multicollinearity, heteroskedasticity, autocorrelation) 5
2.1. The multiple regression model Intuition: A regression model specifies a functional (parametric) relationship between a dependent (endogenous) variable Y and a set of k independent (exogenous) regressors X 1, X 2,..., X k In a first step, we consider the linear multiple regression model 6
Definition 2.1: (Multiple linear regression model) The multiple (linear) regression model is given by Y i = β 0 + β 1 X 1i + β 2 X 2i +... + β k X ki + u i, (2.1) i = 1,..., n, where Y i is the i th observation on the dependent variable, X 1i, X 2i,..., X ki are the i th regressors, u i is the stochastic error term. observations on each of the k The population regression line is the relationship that holds between Y and the X s on average: E(Y i X 1i = x 1, X 2i = x 2,..., X ki = x k ) = β 0 +β 1 x 1 +...+β k x k. 7
Meaning of the coefficients: The intercept β 0 is the expected value of Y i (for all i = 1,..., n) when all X-regressors equal 0 β 1,..., β k are the slope coefficients on the respective regressors X 1,..., X k β 1, for example, is the expected change in Y i resulting from changing X 1i by one unit, holding constant X 2i,..., X ki (and analogously β 2,..., β k ) Definition 2.2: (Homoskedasticity, Heteroskedasticity) The error term u i is called homoskedastic if the conditional variance of u i given X 1i,..., X ki, Var(u i X 1i,..., X ki ), is constant for i = 1,..., n and does not depend on the values of X 1i,..., X ki. Otherwise, the error term is called heteroskedastic. 8
Example 1: (Student performance) Regression of student performance (Y ) in n = 420 USdistricts on distinct school characteristics (factors) Y i : average test score in the i th district (TEST SCORE) X 1i : average class size in the i th district (measured by the student-teacher ratio, STR) X 2i : percentage of English learners in the i th district (PCTEL) Expected signs of the coefficients: β 1 < 0 β 2 < 0 9
Example 2: (House prices) Regression of house prices (Y ) recorded for n = 546 houses sold in Windsor (Canada) on distinct housing characteristics Y i : sale price (in Canadian dollars) of the i th house (SALEPRICE) X 1i : lot size (in square feet) of the i th property (LOTSIZE) X 2i : number of bedrooms in the i th house (BEDROOMS) X 3i : number of bathrooms in the i th house (BATHROOMS) X 4i : number of storeys (excluding the basement) in the i th house (STOREYS) Expected signs of the coefficients: β 1, β 2, β 3, β 4 > 0 10
2.2. The OLS estimator in multiple regression Now: Estimation of the coefficients β 0, β 1,..., β k in the multiple regression model on the basis of n observations by applying the Ordinary Least Squares (OLS) technique Idea: Let b 0, b 1,..., b k be estimators of β 0, β 1,..., β k We can predict Y i by b 0 + b 1 X 1i +... + b k X ki The prediction error is Y i b 0 b 1 X 1i... b k X ki 11
Idea: [continued] The sum of the squared prediction errors over all n observations is n i=1 (Y i b 0 b 1 X 1i... b k X ki ) 2 (2.2) Definition 2.3: (OLS estimators, predicted values, residuals) The OLS estimators ˆβ 0, ˆβ 1,..., ˆβ k are the values of b 0, b 1,..., b k that minimize the sum of squared prediction errors (2.2). The OLS predicted values Ŷ i and residuals û i (for i = 1,..., n) are and Ŷ i = ˆβ 0 + ˆβ 1 X 1i +... + ˆβ k X ki (2.3) û i = Y i Ŷ i. (2.4) 12
Remarks: The OLS estimators ˆβ 0, ˆβ 1,..., ˆβ k and the residuals û i are computed from a sample of n observations of (X 1i,..., X ki, Y i ) for i = 1,..., n They are estimators of the unknown true population coefficients β 0, β 1,..., β k and u i There are closed-form formulas for calculating the OLS estimates from the data (see the lectures Econometrics I+II) In this lecture, we use the software-package EViews 13
Regression estimation results (EViews) for the student-performance dataset Dependent Variable: TEST_SCORE Method: Least Squares Date: 07/02/12 Time: 16:29 Sample: 1 420 Included observations: 420 Variable Coefficient Std. Error t-statistic Prob. C 686.0322 7.411312 92.56555 0.0000 STR -1.101296 0.380278-2.896026 0.0040 PCTEL -0.649777 0.039343-16.51588 0.0000 R-squared 0.426431 Mean dependent var 654.1565 Adjusted R-squared 0.423680 S.D. dependent var 19.05335 S.E. of regression 14.46448 Akaike info criterion 8.188387 Sum squared resid 87245.29 Schwarz criterion 8.217246 Log likelihood -1716.561 Hannan-Quinn criter. 8.199793 F-statistic 155.0136 Durbin-Watson stat 0.685575 Prob(F-statistic) 0.000000 14
Predicted values Ŷ i and residuals û i for the student-performance dataset 720 700 680 60 40 20 0 660 640 620 600-20 -40-60 50 100 150 200 250 300 350 400 Residual Actual Fitted 15
Regression estimation results (EViews) for the house-prices dataset Dependent Variable: SALEPRICE Method: Least Squares Date: 07/02/12 Time: 16:50 Sample: 1 546 Included observations: 546 Variable Coefficient Std. Error t-statistic Prob. C -4009.550 3603.109-1.112803 0.2663 LOTSIZE 5.429174 0.369250 14.70325 0.0000 BEDROOMS 2824.614 1214.808 2.325153 0.0204 BATHROOMS 17105.17 1734.434 9.862107 0.0000 STOREYS 7634.897 1007.974 7.574494 0.0000 R-squared 0.535547 Mean dependent var 68121.60 Adjusted R-squared 0.532113 S.D. dependent var 26702.67 S.E. of regression 18265.23 Akaike info criterion 22.47250 Sum squared resid 1.80E+11 Schwarz criterion 22.51190 Log likelihood -6129.993 Hannan-Quinn criter. 22.48790 F-statistic 155.9529 Durbin-Watson stat 1.482942 Prob(F-statistic) 0.000000 16
Predicted values Ŷ i and residuals û i for the house-prices dataset 200,000 160,000 120,000 120,000 80,000 40,000 80,000 40,000 0 0-40,000-80,000 50 100 150 200 250 300 350 400 450 500 Residual Actual Fitted 17
OLS assumptions in the multiple regression model (2.1): 1. u i has conditional mean zero given X 1i, X 2i,..., X ki : E(u i X 1i, X 2i,..., X ki ) = 0 2. (X 1i, X 2i,..., X ki, Y i ), i = 1,..., n, are independently and identically distributed (i.i.d.) draws from their joint distribution 3. Large outliers are unlikely: X 1i, X 2i,..., X ki and Y i have nonzero finite fourth moments 4. There is no perfect multicollinearity Remarks: Note that we do not assume any specific parametric distribution for the u i The OLS assumptions imply specific distribution results 18
Theorem 2.4: (Unbiasedness, consistency, normality) Given the OLS assumptions the following properties of the OLS estimators ˆβ 0, ˆβ 1,..., ˆβ k hold: 1. ˆβ 0, ˆβ 1,..., ˆβ k are unbiased estimators of β 0,..., β k. 2. ˆβ 0, ˆβ 1,..., ˆβ k are consistent estimators of β 0,..., β k. (Convergence in probability) 3. In large samples ˆβ 0, ˆβ 1,..., ˆβ k are jointly normally distributed and each single OLS estimator ˆβ j, j = 0,..., k, is normally distributed with mean β j and variance σ 2ˆβ j, that is ˆβ j N(β j, σ 2ˆβ j ). 19
Remarks: In general, the OLS estimators are correlated This correlation among ˆβ 0, ˆβ 1,..., ˆβ k arises from the correlation among the regressors X 1,..., X k The sampling distribution of the OLS estimators will become relevant in Section 3 (hypothesis-testing, confidence intervals) 20
2.3. Measures-of-fit in multiple regression Now: Three well-known summary statistics that measure how well the OLS estimates fit the data Standard error of regression (SER): The SER estimates the standard deviation of the error term u i (under the assumption of homoskedasticity): SER = 1 n k 1 n û 2 i i=1 21
Standard error of regression: [continued] We denote the sum of squared residuals by SSR n i=1 û 2 i so that SER = SSR n k 1 Given the OLS assumptions and homoskedasticity the squared SER, (SER) 2, is an unbiased estimator of the unknown constant variance of the u i SER is a measure of the spread of the distribution of Y i around the population regression line Both measures, SER and SSR, are reported in the EViews regression output 22
R 2 : The R 2 is the fraction of the sample variance of the Y i explained by the regressors Equivalently, the R 2 is 1 minus the fraction of the variance of the Y i not explained by the regressors (i.e. explained by the residuals) Denoting the explained sum of squares (ESS) and the total sum of squares (TSS) by ESS = n i=1 (Ŷ i Ȳ ) 2 and TSS = respectively, we define the R 2 as R 2 = ESS TSS = 1 SSR TSS n i=1 (Y i Ȳ ) 2, 23
R 2 : [continued] In multiple regression, the R 2 increases whenever an additional regressor X k+1 is added to the regression model, unless the estimated coefficient ˆβ k+1 is exactly equal to zero Since in practice it is extremely unusual to have exactly ˆβ k+1 = 0, the R 2 generally increases (and never decreases) when an new regressor is added to the regression model An increase in the R 2 due to the inclusion of a new regressor does not necessarily indicate an actually improved fit of the model 24
Adjusted R 2 : The adjusted R 2 (in symbols: R 2 ), deflates the conventional R 2 : R 2 = 1 n 1 SSR n k 1TSS It is always true that R 2 < R 2 (why?) When adding a new regressor X k+1 to the model, the R 2 can increase or decrease (why?) The R 2 can be negative (why?) 25
2.4. Omitted-variable bias Now: Discussion of a phenomenon that implies violation of the first OLS assumption on Slide 18 This issue is known under the phrasing omitted-variable bias and is extremely relevant in practice Although theoretically easy to grasp, avoiding this specification problem turns out to be a nontrivial task in many empirical applications 26
Definition 2.5: (Omitted-variable bias) Consider the multiple regression model in Definition 2.1 on Slide 7. Omitted-variable bias is the bias in the OLS estimator ˆβ j of the coefficient β j (for j = 1,..., k) that arises when the associated regressor X j is correlated with an omitted variable. More precisely, for omitted-variable bias to occur, the following two conditions must hold: 1. X j is correlated with the omitted variable. 2. The omitted variable is a determinant of the dependent variable Y. 27
Example: Consider the house-prices dataset (Slides 16, 17) Using the entire set of regressors, we obtain the OLS estimate ˆβ 2 = 2824.61 for the BEDROOMS-coefficient The correlation coefficients between the regressors are as follows: BEDROOMS BATHROOMS LOTSIZE STOREYS BEDROOMS 1.000000 0.373769 0.151851 0.407974 BATHROOMS 0.373769 1.000000 0.193833 0.324066 LOTSIZE 0.151851 0.193833 1.000000 0.083675 STOREYS 0.407974 0.324066 0.083675 1.000000 28
Example: [continued] There is positive (significant) correlation between the variable BEDROOMS and all other regressors Excluding the other variables from the regression yields the following OLS-estimates: Dependent Variable: SALEPRICE Method: Least Squares Date: 14/02/12 Time: 16:10 Sample: 1 546 Included observations: 546 Variable Coefficient Std. Error t-statistic Prob. C 28773.43 4413.753 6.519040 0.0000 BEDROOMS 13269.98 1444.598 9.185932 0.0000 R-squared 0.134284 Mean dependent var 68121.60 Adjusted R-squared 0.132692 S.D. dependent var 26702.67 S.E. of regression 24868.03 Akaike info criterion 23.08421 Sum squared resid 3.36E+11 Schwarz criterion 23.09997 Log likelihood -6299.989 Hannan-Quinn criter. 23.09037 F-statistic 84.38135 Durbin-Watson stat 0.811875 Prob(F-statistic) 0.000000 The alternative OLS-estimates of the BEDROOMS-coefficient differ substantially 29
Intuitive explanation of the omitted-variable bias: Consider the variable LOTSIZE as omitted LOTSIZE is an important variable for explaining SALEPRICE If we omit LOTSIZE in the regression, it will try to enter in the only way it can, namely through its positive correlation with the included variable BEDROOMS The coefficient on BEDROOMS will confound the effect of BED- ROOMS and LOTSIZE on SALEPRICE 30
More formal explanation: Omitted-variable bias means that the first OLS assumption on Slide 18 is violated Reasoning: In the multiple regression model the error term u i represents all factors other than the included regressors X 1,..., X k that are determinants of Y i If an omitted variable is correlated with at least one of the included regressors X 1,..., X k, then u i (which contains this factor) is correlated with the set of regressors This implies that E(u i X 1i,..., X ki ) 0 31
Important result: In the case of omitted-variable bias the OLS estimators on the corresponding included regressors are biased in finite samples this bias does not vanish in large samples the OLS estimators are inconsistent Solutions to omitted-variable bias: To be discussed in Section 5 32
2.5. Multicollinearity Definition 2.6: (Perfect multicollinearity) Consider the multiple regression model in Definition 2.1 on Slide 7. The regressors X 1,..., X k are said to be perfectly multicollinear if one of the regressors is a perfect linear function of the other regressors. Remarks: Under perfect multicollinearity the OLS estimates cannot be calculated due to division by zero in the OLS formulas Perfect multicollinearity often reflects a logical mistake in choosing the regressors or some unrecognized feature in the data set 33
Example: (Dummy variable trap) Consider the student-performance dataset Suppose we partition the school districts into the 3 categories (1) rural, (2) suburban, (3) urban We represent the categories by the dummy regressors { 1 if district i is rural RURAL i = 0 otherwise and by SUBURBAN i and URBAN i analogously defined Since each district belongs to one and only one category, we have for each district i: RURAL i + SUBURBAN i + URBAN i = 1 34
Example: [continued] Now, let us define the constant regressor X 0 associated with the intercept coefficient β 0 in the multiple regression model on Slide 7 by X 0i 1 for i = 1,... n Then, for i = 1,..., n, the following relationship holds among the regressors: Perfect multicollinearity X 0i = RURAL i + SUBURBAN i + URBAN i To estimate the regression we must exclude either one of the dummy regressors or the constant regressor X 0 (the intercept β 0 ) from the regression 35
Theorem 2.7: (Dummy variable trap) Let there be G different categories in the data set represented by G dummy regressors. If 1. each observation i falls into one and only one category, 2. there is an intercept (constant regressor) in the regression, 3. all G dummy regressors are included as regressors, then regression estimation fails because of perfect multicollinearity. Usual remedy: Exclude one of the dummy regressors (G 1 dummy regressors are sufficient) 36
Definition 2.8: (Imperfect multicollinearity) Consider the multiple regression model in Definition 2.1 on Slide 7. The regressors X 1,..., X k are said to be imperfectly multicollinear if two or more of the regressors are highly correlated in the sense that there is a linear function of the regressors that is highly correlated with another regressor. Remarks: Imperfect multicollinearity does not pose any (numeric) problems in calculating OLS estimates However, if regressors are imperfectly multicollinear, then the coefficients on at least one individual regressor will be imprecisely estimated 37
Remarks: [continued] Techniques for identifying and mitigating imperfect multicollinearity are presented in econometric textbooks (e.g. Hill et al., 2010, pp. 155-156) 38