Multiple Regression. Multiple Regression, Mar 3, Food Expenditure 12. Food Expenditure Family Size

Size: px

Start display at page:

Download "Multiple Regression. Multiple Regression, Mar 3, Food Expenditure 12. Food Expenditure Family Size"

Patrick Sutton
7 years ago
Views:

1 Multiple Regression Example: Food expenditure and family income Data: Sample of 20 households Food expenditure (response variable) Family income and family size. regress food income food Coef. Std. Err. t P> t [95% Conf. Interval] income _cons regress food number food Coef. Std. Err. t P> t [95% Conf. Interval] number _cons Food Expenditure 12 8 Food Expenditure Income Family Size Multiple Regression, Mar 3,

2 Multiple Regression Multiple regression model Y i = b 0 + b 1 x 1,i + b 2 x 2,i b p x p,i + ε i i = 1,..., n where Y i response variable x 1,i,..., x p,i predictor variables (fixed, nonrandom) b 0,..., b p regression coefficients ε i iid N (0, σ 2 ) error variable Example: Food expenditure and family income Fitting multiple regression models in STATA:. regress food income number Source SS df MS Number of obs = F( 2, 17) = Model Prob > F = Resid R-squared = Adj R-squared = Total Root MSE = food Coef. Std. Err. t P> t [95% Conf. Interval] income number _cons Multiple Regression, Mar 3,

3 Multiple Regression Example: Food expenditure and family income Data: (Food i, Income i, Number i ), i = 1,..., 20 Fitted regression model: Food = ˆb 0 + ˆb 1 Income + ˆb 2 Number Y i Y^ i Fitted model is a two-dimensional plane - difficult to visualize. Multiple Regression, Mar 3,

4 Inference for Multiple Regression Multiple regression model (matrix notation) where Y = X b + ε Y X b ε n dimensional vector n (1 + p) dimensional matrix 1 + p dimensional vector n dimensional vector Thus the model can be written as Y 1 1 x 1,1 x p,1 b 0. = ε 1. Y n 1 x 1,n x p,n b p ε n Least squares approach: Minimize Results: Y Ŷ = n (Y i Ŷi) 2 i=1 ˆb = (X T X) 1 X T Y N ( b, σ 2 (X T X) 1) Ŷ = X(X T X) 1 X T Y N ( X b, σ 2 X(X T X) 1 X T) ˆε = Y Ŷ = ( 1 X(X T X) 1 X T) Y ˆσ 2 = s 2 Y Ŷ 2 e = n p = 1 n (Y i n p Ŷi) 2 i=1 N ( 0, σ 2( 1 X(X T X) 1 X T)) Details course in regression analysis (STAT 22200) or econometrics Multiple Regression, Mar 3,

5 Inference for Multiple Regression Example: Food expenditure and family income Interpretation of regression coefficients. quietly regress food income. predict e_food1, residuals. quietly regress number income. predict e_num, residuals. regress e_food1 e_num e_food1 Coef. Std. Err. t P> t [95% Conf. Interval] e_num quietly regress food number. predict e_food2, residuals. quietly regress income number. predict e_inc, residuals. regress e_food2 e_inc e_food2 Coef. Std. Err. t P> t [95% Conf. Interval] e_inc Result: b j measures the dependence of Y on x j after removing the linear effects of all other predictors x k, k j. b j = 0 if x j does not provide information for the prediction of Y additionally to the information given by the other predictor variables. Multiple Regression, Mar 3,

6 Example: Heart cathederization Multiple Regression Description: A Teflon tube (catheder) 3 mm is diameter is passed into a major vein or artery at the femoral region and pushed up into the heart to obtain information about the heart s physiology and functional ability. determined by a physician s educated guess. Data: Study with 12 children with congenital heart defects The length of the catheder is typically Exact required catheder length was measured using a fluoroscope Patient s height and weight were recorded Question: How accurately can catheder length be determined by height and length? Distance (cm) Distance (cm) Height (in) Weight (lb) Multiple Regression, Mar 3,

7 Multiple Regression Example: Heart cathederization (contd) Regression model: Y = b 0 + b 1 x 1 + b 2 x 2 + ε where Y - distance to pulmonary artery x 1 - height x 2 - weight STATA regression output:. regress distance height weight Source SS df MS Number of obs = F( 2, 9) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = distance Coef. Std. Err. t P> t [95% Conf. Interval] height weight _cons Note: Neither height nor weight seem to be significant for predicting the distance to the pulmonary artery. The regression on both variables explains 80% of the variation of the response (length of catheder). Multiple Regression, Mar 3,

8 Multiple Regression Example: Heart cathederization (contd) Consider predicting the length by height alone and by weight alone:. regress distance height R-squared = distance Coef. Std. Err. t P> t [95% Conf. Interval] height _cons regress distance weight R-squared = distance Coef. Std. Err. t P> t [95% Conf. Interval] weight _cons Note: In a simple regression of Y on either height or weight, the explanatory variable is highly significant for predicting Y. In a multiple regression of Y on height and weight, the coefficients for both height and weight are not significantly different from zero. Problem: Explanatory variables are highly linearly dependent (collinear) 80 Weight (lb) Height (in) Multiple Regression, Mar 3,

9 Analysis of Variance Decomposition of variation: SS Total = i (Y i Ȳ )2 - total variation SS Residual = i (Y i Ŷi) 2 - variation in regression model SS Model = SS Total SS Residual = i (Ŷi Ȳ )2 - variation explained by regression Coefficient of determination: The ratio R 2 = SS Model SS Total indicates how well the regression model predicts the response. R 2 is also the squared multiple correlation coefficient - in a simple linear regression we have R 2 = ρ 2 XY. Example: Heart cathederization Source SS df MS Number of obs = F( 2, 9) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = The coefficient of determination for these data is R 2 = = Regression on height and weight explains 81% of the variation of distance. Multiple Regression, Mar 3,

10 Analysis of Variance Question: Is improvement in prediction (decrease in variation) significant? Our null hypothesis is that none of the explanatory variables helps to predict the response, that is, H 0 : b 1 =... = b p = 0 versus H a : b j 0 for any j {1,..., p}. Under the null hypothesis H 0 the F statistic F = n p 1 p SS Model = n p 1 SS Total SS Residual SS Residual p SS Residual is F distributed with p and n p 1 degrees of freedom. The null hypothesis H 0 is rejected at level α if F > F p,n p 1,α. Example: Heart cathederization Source SS df MS Number of obs = F( 2, 9) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = The value of the F statistic is F = = The critical value for rejecting H 0 : b 1 = b 2 = 0 is F 2,9,0.05 = Thus the null hypothesis H 0 that both coefficients b 1 and b 2 are zero is rejected at significance level α = Multiple Regression, Mar 3,

11 Comparing Models Example: Cobb-Douglas production function Y = t K a L b M c where Y - output K - capital Regression model: L - labour M - materials log Y = log t + a log K + b log L + c log M Y 0.4 Y 0.4 Y K L M Multiple Regression, Mar 3,

12 Comparing Models Example: Cobb-Douglas production function (contd) Regression model M 0 for Cobb-Douglas function: log Y = log t + a log K + b log L + c log M. regress LY LK LM LL Source SS df MS Number of obs = F( 3, 21) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = LY Coef. Std. Err. t P> t [95% Conf. Interval] LK LM LL _cons Two variables, log K and log L, do not improve prediction of log Y. alternative model M 1 log Y = log t + c log M. regress LY LM Source SS df MS Number of obs = F( 1, 23) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = LY Coef. Std. Err. t P> t [95% Conf. Interval] LM _cons Question: Is model M 0 significantly better than model M 1? Multiple Regression, Mar 3,

13 Comparing Models Consider the multiple regression model with p explanatory variables Problem: Y i = b 0 + b 1 x 1,i b p x p,i + ε i. Test the null hypothesis versus Solution: H 0 : q specific explanatory variables all have zero coefficients H a : any of these q explanatory variables has a nonzero coefficient. Regress Y on all p explanatory variables and read SS (1) Residual from the output. Regress Y on just p q explanatory variables that remain after you remove the q variables from the model. Read SS (2) Residual from the output. The test statistic is F = n p 1 q (2) (1) SS Residual SS SS (1) Residual Residual Under the null hypothesis, F is F distributed with q and n p 1 defrees of freedom. Reject if F > F q,n p 1,α.. Multiple Regression, Mar 3,

14 Comparing Models Example: Cobb-Douglas production function Comparison of models M 0 and M 1 : M 0 : SS (0) Residual = and n p 1 = 21. M 1 : SS (1) Residual = and q = 2. F = = Since F < F 2,21,0.05 = 3.47 we cannot reject H 0 : a = b = 0. Using STATA:. test LK LL ( 1) LK = 0 ( 2) LL = 0 F( 2, 21) = 0.25 Prob > F = test LK LL _cons ( 1) LK = 0 ( 2) LL = 0 ( 3) _cons = 0 F( 3, 21) = 2.43 Prob > F = Multiple Regression, Mar 3,

MULTIPLE REGRESSION EXAMPLE

MULTIPLE REGRESSION EXAMPLE For a sample of n = 166 college students, the following variables were measured: Y = height X 1 = mother s height ( momheight ) X 2 = father s height ( dadheight ) X 3 = 1 if