BUS41100 Applied Regression Analysis Week 5: Multiple Linear Regression Parameter estimation and inference, forecasting, diagnostics, dummy variables Robert B. Gramacy The University of Chicago Booth School of Business faculty.chicagobooth.edu/robert.gramacy/teaching
Beyond SLR Many problems involve more than one independent variable or factor which affects the dependent or response variable. Multi-factor asset pricing models (beyond CAPM). Demand for a product given prices of competing brands, advertising, household attributes, etc. More than size to predict house price! In SLR, the conditional mean of Y depends on X. The multiple linear regression (MLR) model extends this idea to include more than one independent variable. 1
Categorical effects/dummy variables Week 1 already introduced one type of multiple regression: Regression for Y onto a categorical X : E[Y group = r] = β 0 + β r, for r = 1,..., R 1. I.e., ANOVA for grouped data. To represent these qualitative factors in multiple regression, we use dummy, binary, or indicator variables. 2
Dummy variables allow the mean (intercept) to shift by taking on the value 0 or 1. Examples: temporal effects (1 if Holiday season, 0 if not) spatial (1 if in Midwest, 0 if not) If a factor X takes R possible levels, we can represent X through R 1 dummy variables E[Y X ] = β 0 + β 1 1 [X =2] + β 2 1 [X =3] + + β R 1 1 [X =R] (1 [X =r] = 1 if X = r, 0 if X r.) What is E[Y X = 1]? 3
Recall the pick-up truck data example: > pickup <- read.csv("pickup.csv") > t1 <- lm(log(price) ~ make, data=pickup) > summary(t1) ## abbreviated output Call: lm(formula = log(price) ~ make, data = pickup) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 8.71499 0.23109 37.713 <2e-16 *** makeford 0.17725 0.31289 0.566 0.574 makegmc -0.05124 0.27505-0.186 0.853 The coefficient values correspond to our dummy variables. 4
What if you also want to include mileage? No problem. > t2 <- lm(log(price) ~ make + log(miles), data=pickup) > summary(t2) ## abbreviated output Call: lm(formula = log(price) ~ make + log(miles), data=pickup) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 13.83117 1.08608 12.735 5.17e-16 *** makeford 0.32419 0.25658 1.264 0.213 makegmc 0.13265 0.22720 0.584 0.562 log(miles) -0.46618 0.09747-4.783 2.15e-05 *** 5
The MLR Model The MLR model is same as always, but with more covariates. Y X 1,..., X d ind N(β 0 + β 1 X 1 + + β d X d, σ 2 ) Recall the key assumptions of our linear regression model: (i) The conditional mean of Y is linear in the X j variables. (ii) The additive errors (deviations from line) are normally distributed independent from each other identically distributed (i.e., they have constant variance) 6
Our interpretation of regression coefficients can be extended from the simple single covariate regression case: β j = E[Y X 1,..., X d ] X j Holding all other variables constant, β j is the average change in Y per unit change in X j. is from calculus and means change in 7
If d = 2, we can plot the regression surface in 3D. Consider sales of a product as predicted by price of this product (P1) and the price of a competing product (P2). Everything measured on log-scale, of course. 8
The data and least squares The data in multiple regression is a set of points with values for output Y and for each input variable. Data: Y i and x i = [X 1i, X 2i,..., X di ], for i = 1,..., n. Or, as a data array (i.e., data.frame), Data = Y 1 X 11 X 21... X d1. Y n X 1n X 2n... X dn 9
So the model is Y i = β 0 + β 1 X i1 + + β d X id + ε i, ε i iid N (0, σ 2 ). How do we estimate the MLR model parameters? The principle of least squares is unchanged; define: fitted values Ŷi = b 0 + b 1 X 1i + b 2 X 2i + + b d X di residuals e i = Y i Ŷ i standard error s = n i=1 e2 i n p, where p = d + 1. Then find the best fitting plane, i.e., coefs b 0, b 1, b 2,..., b d, by minimizing the sum of squared residuals, s 2. 10
What are the LS coefficient values? Say that Y = Y 1. Y n ˆX = 1 X 11 X 12... X 1d. 1 X n1 X n2... X nd. Then the estimates are [b 0 b d ] = b = (ˆX ˆX) 1 ˆX Y. Same intuition as for SLR: b captures the covariance between X j and Y (ˆX Y), normalized by input sum of squares (ˆX ˆX). 11
Obtaining these estimates in R is very easy: > salesdata <- read.csv("sales.csv") > attach(salesdata) > salesmlr <- lm(sales ~ P1 + P2) > salesmlr Call: lm(formula = Sales ~ P1 + P2) Coefficients: (Intercept) P1 P2 1.003-1.006 1.098 12
Multiple vs simple linear regression Basic concepts and techniques translate directly from SLR. Individual parameter inference and estimation is the same, conditional on the rest of the model. ANOVA is exactly the same, and the F -test still applies. Our diagnostics and transformations apply directly. We still use lm, summary, rstudent, predict, etc. We have been plotting HD data since Week 1. The hardest part would be moving to matrix algebra to translate all of our equations. Luckily, R does all that for you. 13
Residual standard error First off, the calculation for s 2 = var(e) is exactly the same: s 2 = n i=1 (Y i Ŷ i ) 2. n p Ŷ i = b 0 + b j X ji and p = d + 1. The residual standard error is ˆσ = s = s 2. 14
Residuals in MLR As in the SLR model, the residuals in multiple regression are purged of any relationship to the independent variables. We decompose Y into the part predicted by X and the part due to idiosyncratic error. Y = Ŷ + e corr(x j, e) = 0 corr(ŷ, e) = 0 15
Residual diagnostics for MLR Consider the residuals from the sales data: 0.5 1.0 1.5 2.0 0.03 0.00 0.02 fitted residuals 0.2 0.4 0.6 0.8 0.03 0.00 0.02 P1 residuals 0.2 0.6 1.0 0.03 0.00 0.02 P2 residuals We use the same residual diagnostics (scatterplots, QQ, etc). Plot residuals (raw or student) against Ŷ to see overall fit. Compare e or r against each X to identify problems. 16
Another great plot for MLR problems is to look at Y (true values) against Ŷ (fitted values). > plot(salesmlr$fitted, Sales, + main= "Fitted vs True Response for Sales data", + pch=20, col=4, ylab="y", xlab="y.hat") > abline(0, 1, lty=2, col=8) 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0 Fitted vs True Response for Sales data Y.hat Y If things are working, these values should form a nice straight line. 17
Transforming variables in MLR The transformation techniques are also the same as in SLR. For Y nonlinear in X j, consider adding Xj 2 polynomial terms. and other For nonconstant variance, use log(y ), Y, or another transformation such as Y /X j that moves the data to a linear scale. Use the log-log model for price/sales data and other multiplicative relationships. Again, diagnosing the problem and finding a solution involves looking at lots of residual plots (against different X j s). 18
For example, the sales, P1, and P2 variables were pre-transformed from raw values to a log scale. On the original scale, things don t look so good: > expsalesmlr <- lm(exp(sales) ~ exp(p1) + exp(p2)) 1 2 3 4 5 6 7 0.2 0.2 0.6 fitted residuals 1.2 1.6 2.0 2.4 0.2 0.2 0.6 exp(p1) residuals 1.5 2.0 2.5 3.0 3.5 0.2 0.2 0.6 exp(p2) residuals 19
In particular, the studentized residuals are heavily right skewed. ( studentizing is the same, but leverage is now distance in d-dim.) > hist(rstudent(expsalesmlr), col=7, + xlab="studentized Residuals", main="") Frequency 0 10 20 30 40 50 Our log-log transform fixes this problem. 2 1 0 1 2 3 4 Studentized Residuals 20
Inference for coefficients As before in SLR, the LS linear coefficients are random (different for each sample) and correlated with each other. The LS estimators are unbiased: E[b j ] = β j for j = 0,..., d. In particular, the sampling distribution for b is a multivariate normal, with mean β = [β 0 β d ] and covariance matrix S b. b N p (β, S b ) 21
Coefficient covariance matrix var(b) : the p p covariance matrix for random vector b is S b = var(b 0 ) cov(b 0, b 1 ) cov(b 1, b 0 ) var(b 1 )... var(b d 1 ) cov(b d 1, b d ) cov(b d, b d 1 ) var(b d ) Standard errors are the square root of the diagonal of S b. 22
) 1 Calculating the covariance is easy : S b = s (ˆX 2 ˆX > X <- cbind(1, P1, P2) > cov.b <- summary(salesmlr)$sigma^2*solve(t(x)%*%x) > print(cov.b) P1 P2 5.542812e-05-5.393036e-05-3.451517e-05 P1-5.393036e-05 8.807396e-05 9.718699e-06 P2-3.451517e-05 9.718699e-06 4.127672e-05 > se.b <- sqrt(diag(cov.b)) > se.b P1 P2 0.007445007 0.009384773 0.006424696 Variance decreases with n and var(x); increases with s 2. 23
Standard errors Conveniently, R s summary gives you all the standard errors. > summary(salesmlr) ## abbreviated output Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 1.002688 0.007445 134.7 <2e-16 *** P1-1.005900 0.009385-107.2 <2e-16 *** P2 1.097872 0.006425 170.9 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.01453 on 97 degrees of freedom Multiple R-squared: 0.998, Adjusted R-squared: 0.9979 F-statistic: 2.392e+04 on 2 and 97 DF, p-value: < 2.2e-16 24
Inference for individual coefficients Intervals and t-statistics are exactly the same as in SLR. A (1 α)100% C.I. for β j is b j ± t α/2,n p s bj. z bj = (b j β 0 j )/s bj t n p (0, 1) is number of standard errors between the LS estimate and the null value. Intervals/testing via b j & s bj are one-at-a-time procedures: You are evaluating the j th coefficient conditional on the other X s being in the model, but regardless of the values you ve estimated for the other b s. 25
R 2 for multiple regression Recall that we already view R 2 as a measure of fit : R 2 = SSR SST = n i=1 (Ŷi Ȳ )2 n i=1 (Y i Ȳ ) 2. And the correlation interpretation is very similar: R 2 = corr 2 (Ŷ, Y ) = r 2 ŷy. (rŷy = r xy in SLR since cor(x, Ŷ ) = 1) 26
Consider our marketing model: > summary(salesmlr)$r.square [1] 0.9979764 P1 and P2 explain most of the variability in log volume. Consider the pickup regressions (1: make, 2: make + miles). > summary(trucklm1)$r.square [1] 0.01806738 > summary(trucklm2)$r.square [1] 0.3643183 Make is pretty useless, but miles gets us up to R 2 = 36%. 27
Forecasting in MLR Prediction follows exactly the same methodology as in SLR. For new data x f = [X 1,f X d,f ], E[Y f x f ] = Ŷf = b 0 + b 1 X 1f + + b d X df var[y f x f ] = var(ŷf ) + var(e f ) = sfit 2 + s2 = spred 2. With ˆX our design matrix (slide 9) and ˆx f = [1, X 1,f X d,f ] s 2 fit = s 2 ˆx f (ˆX ˆX) 1ˆx f A (1 α) level prediction interval is still Ŷ f ± t α/2,n p s pred. 28
The syntax in R is also exactly the same as before: > predict(salesmlr, data.frame(p1=1, P2=1), + interval="prediction", level=0.95) fit lwr upr 1 1.094661 1.064015 1.125306 And we can get s fit using our equation, or from R. > xf <- matrix(c(1,1,1), ncol=1) > X <- cbind(1, P1, P2) > s <- summary(salesmlr)$sigma > sqrt(drop(t(xf)%*%solve(t(x)%*%x)%*%xf))*s [1] 0.005227347 > predict(salesmlr, data.frame(p1=1, P2=1), + se.fit=true)$se.fit [1] 0.005227347 29
Glossary and equations MLR updates to the LS equations: b = (ˆX ˆX) 1 ˆX Y var(b) = S b = s 2 (ˆX ˆX ) 1 sfit 2 = s2 ˆx f (ˆX ˆX) 1ˆx f R 2 = SSR/SST = cor 2 (Ŷ, Y ) = r ŷy 2 30