IAPRI Quantitative Analysis Capacity Building Series Multiple regression analysis & interpreting results
How important is R-squared? R-squared Published in Agricultural Economics 0.45 Best article of the year, 2008??? Best article of the year, 2009 0.2 Best article of the year, 200
Session 3 Topics n Multiple regression analysis What does it mean? Why is it important? How is it done and how are results interpreted? What are the hazards?
Multiple Regression Analysis n What does it mean? Multivariate analysis/statistics Ceteris paribus All else equal Controlling for
Multiple Regression Analysis n Why does it matter? y α + x + u n If E u x = E u = implying Corr( u, x ) = 0 What if y = α + βx + β2x2 + ε If = β ( ) ( ) 0 Results are biased, then = β 2x2 + ε Corr ( x x ) 0 Corr( u, x ) 0, 2 ( x, x ) 0 E u 2 = (and other conditions), we can estimate w/ multiple regressors u
Multiple Regression Analysis n Consider maize yield (mzyield) and basal fertilizer (basaprate), both kg/ha mzyield α + β basaprate + = u. reg mzyield basaprate Source SS df MS Number of obs = 8648 F(, 8646) = 526.38 Model 2.590e+09 2.590e+09 Prob > F = 0.0000 Residual.2229e+0 8646 44446.5 R-squared = 0.50 Adj R-squared = 0.500 Total.4388e+0 8647 663962.69 Root MSE = 89.3 mzyield Coef. Std. Err. t P> t [95% Conf. Interval] basaprate 5.254685.344979 39.07 0.000 4.99037 5.58333 _cons 335.84 4.5786 9.63 0.000 307.262 364.47
Multiple Regression Analysis n Top dressing (topaprate) determines yield and is correlated with basaprate, both kg/ha = α + β basaprate + β topaprate + ε mzyield 2. reg mzyield basaprate topaprate Source SS df MS Number of obs = 8647 F( 2, 8644) = 840.22 Model 2.348e+09 2.709e+09 Prob > F = 0.0000 Residual.2046e+0 8644 393535.34 R-squared = 0.628 Adj R-squared = 0.626 Total.4387e+0 8646 66406.58 Root MSE = 80.5 mzyield Coef. Std. Err. t P> t [95% Conf. Interval] basaprate.897807.32747 5.90 0.000.26706 2.528508 topaprate 3.62044.357663.47 0.000 3.00463 4.23948 _cons 34.93 4.5870 90.4 0.000 286.336 343.524
Multiple Regression Analysis y = α + βx + β2x2 +... + βk xk + u n n α β is the intercept are slope parameters (usually)
y β slope α intercept 2 3 x 8
Multiple Regression Analysis y = α + βx + β2x2 +... + βk xk + u n n α β is the intercept are slope parameters (usually) n u is the unobserved error or disturbance term n y is the dependant, explained, response or predicted variable n x... x k are the independent, explanatory, control or predictor variables, or regressors
How is it done? n OLS finds the β parameters that minimize: n ( yi α β xi β2xi2... βk xik ) i= n Minimize the noise n Squared, so residuals don t off set n Gives us βˆ and predicted values 2 ŷ
Ceteris Paribus Interpretation u x x x y k k + + + + + = β β β α... 2 2 n is the partial effect or ceteris paribus n Change x only: n Change x 2 only: n Share of total change attributable to x : β ˆ ˆ x y Δ = Δ β 2 2 ˆ ˆ x y Δ = Δ β y x ˆ ˆ Δ Δ β 2 2 ˆ ˆ ˆ x x y Δ + Δ = Δ β β
Ceteris Paribus Interpretation n Now, how do we interpret the coefficient estimate for basaprate? mzyield α + β basaprate + β topaprate + u = 2. reg mzyield basaprate topaprate Source SS df MS Number of obs = 8647 F( 2, 8644) = 840.22 Model 2.348e+09 2.709e+09 Prob > F = 0.0000 Residual.2046e+0 8644 393535.34 R-squared = 0.628 Adj R-squared = 0.626 Total.4387e+0 8646 66406.58 Root MSE = 80.5 mzyield Coef. Std. Err. t P> t [95% Conf. Interval] basaprate.897807.32747 5.90 0.000.26706 2.528508 topaprate 3.62044.357663.47 0.000 3.00463 4.23948 _cons 34.93 4.5870 90.4 0.000 286.336 343.524
Ceteris Paribus Interpretation n According to these results, a one unit change in x will result in a βˆ unit change in y, all else equal. n The ceteris paribus effect of a one unit change in x is a βˆ unit change in y. n Holding x 2 constant, a one unit change in x results in a βˆ unit change in y.
Key Assumptions n Linear in parameters n Random sample n Zero conditional mean n No perfect collinearity (variation in data) n Homoskedastic errors
Key Assumptions n Linear in parameters n Random sample n Zero conditional mean n No perfect collinearity (variation in data) n Homoskedastic errors
Perfect Collinearity n Variable is a linear function of one or more others. n No variation in one variable (collinear w/ intercept)
Can t estimate slope parameter if no variation in x Source: Wooldridge (2002) 7
Perfect Collinearity n Variable is a linear function of one or more others. n No variation in one variable (collinear w/ intercept) n Perfect correlation between 2 binary variables
Other hazards n Multi-collinearity n Including irrelevant variables n Omitting relevant variables
Multi-Collinearity n Highly correlated variables n Variable is a nonlinear function of others n What s the problem? n Efficiency losses n Schmidt thumb rule
Including Irrelevant Variables y α + β x + β x + x + = 2 2 β3 3 u n Suppose x 3 is has no effect on y, but key assumptions are satisfied (overspecified) n OLS is an unbiased estimator of β 3, even if is zero n Estimates of β and β 2 will be less efficient β 3
Omitting Relevant Variables y α + β x + β x + = 2 2 u n Suppose we omit x 2 (underspecifying) n OLS is generally biased
Omitting Relevant Variables y α + β x + β x + = 2 2 n Estimate n And let x ~ ~ ~ y = α + β x 0 ~ ~ = δ + δ x u 2 n It can be shown that: E ~ ~ ( β ) = β + β 2δ Omitted Variable Bias
Multiple Regression Analysis Corr(x,x 2 )>0 Corr(x,x 2 )<0 β 2 > 0 Positive bias Negative bias β 2 < 0 Negative bias Positive bias Source: Wooldridge, 2002, page 92
Omitting Relevant Variables n More generally, all OLS estimates will be biased, even if just one explanatory variable is correlated with the omitted variables n Direction of bias is less clear
Multiple Regression Analysis n Goodness of fit R 2 is the share of explained variance R 2 never decreases when we add variables Usually, it will increase regardless of relevance n Adjusted R 2 accounts for this
Next time: Interpreting results n Binary regressors n Other categorical regressors n Categorical regressors as a series of binary regressors n Quadratic terms n Other interactions n Average Partial Effects
Sessions materials developed by Bill Burke with input from Nicole Mason. January 202. burkewi2@stanford.edu