Multiple Regression Cautions About Simple Linear Regression Correlation and regression are powerful tools for describing relationship between two variables, but be aware of their limitations Correlation and regression describe only linear relations Correlation and least-squares regression line are not resistant to outliers Predictions outside the range of observed data are often inaccurate Relationship between two variables often influenced by lurking variables not included in our model 1
Least-Squares Regression of Heart Disease and Wine Consumption Heart disease deaths per 100,000 people 0 100 200 300 0 2 4 6 8 10 Alcohol with Wine (liters per person per year) 2
Does this regression provide strong evidence that increased wine consumption lowers the risk of heart disease? no Lurking variables Ecological fallacy Wealth Heart disease Wine consumption We can t make inferences about what individuals do, based on aggregate data Are individuals who drink more wine suffering less heart disease? General Principles of Data Analysis Plot your data To understand the data, always start with a series of graphs Interpret what you see Look for overall pattern and deviations from that pattern Numerical summary? Choose an appropriate measure to describe the pattern and deviation Mathematical model? If the pattern is regular, summarize the data in a compact mathematical model 3
Analysis of Two Quantitative Variables Plot your data For two quantitative variables, use a scatterplot Interpret what you see Describe the direction, form, and strength of the relationship Numerical summary? If pattern is roughly linear, summarize with correlation, means, and standard deviations Mathematical model? Regression gives a compact model of overall pattern, if relationship is roughly linear Analysis of Three or More Quantitative Variables Plot your data To examine relationships among all possible pairs use a scatterplot matrix Interpret what you see Describe the direction, form, and strength of the relationships Numerical summary? If pattern is roughly linear, summarize with correlations, means, and standard deviations Mathematical model? Multiple regression gives a compact model of relationship between response variable and a set of predictors 4
Blood alcohol content 0.05.1.15.2 0 2 4 6 8 10 Number of 12 ounce beers consumed 0 5 10.2 Blood alcohol content.1 10 0 5 Number of 12 ounce beers consumed 0 300 Weight (lbs) 200 0.1.2 100 200 300 100 In Stata, obtain this graph with graph matrix bac beers weight 5
Correlation Matrix in Stata corr uses only cases with no missing values on any variable (like regress). corr bac beers weight (obs=16) Because it is a symmetrical matrix, only half is shown bac beers weight -------------+--------------------------- bac 1.0000 beers 0.8943 1.0000 weight -0.1550 0.2489 1.0000 Weak, negative correlation between weight and BAC Weak, positive correlation between weight and number of beers consumed Correlation Matrix in Stata sig gives p-values for hypothesis that r is indistiguisable from 0 pwcorr uses all cases with no missing values for each pair. pwcorr bac beers weight, sig sidak obs bac beers weight -------------+--------------------------- bac 1.0000 16 beers 0.8943 1.0000 0.0000 16 16 weight -0.1550 0.2489 1.0000 0.9186 0.7287 16 16 16 sidak option corrects p-values for multiple comparisons 6
Multiple Regression in Stata. regress bac beers weight Overall F-test of model Source SS df MS Number of obs = 16 -------------+------------------------------ F( 2, 13) = 128.33 Model.027816116 2.013908058 R 2 Prob > F = 0.0000 Residual.001408883 13.000108376 R-squared = 0.9518 -------------+------------------------------ Adj R-squared = 0.9444 Total.029225 15.001948333 Root MSE =.01041 slope, b 1 ------------------------------------------------------------------------------ slope, bac b 2 Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- beers.0199757.0012629 15.82 0.000.0172474.022704 weight -.0003628.0000567-6.40 0.000 -.0004853 -.0002404 _cons.0398634.0104333 3.82 0.002.0173236.0624031 ------------------------------------------------------------------------------ y-intercept, a y ^ = a + b 1 x 1 + b 2 x 2 Estimated BAC =.0398 + (.0200)(Beers consumed) (.0003)(Weight) In Stata, obtain added-variable plots with avplots e( bac X ) -.1 -.05 0.05.1-4 -2 0 2 4 6 e( beers X ) coef =.01997571, se =.0012629, t = 15.82 e( bac X ) -.04 -.02 0.02.04-100 -50 0 50 100 e( weight X ) coef = -.00036282, se =.00005668, t = -6.4 7
Residuals-versus-Fitted Plot Residuals -.02 -.01 0.01.02 0.05.1.15.2 Fitted values In Stata, obtain this plot after regress with rvfplot, yline(0) Residuals-versus-Predictor Plot Residuals -.02 -.01 0.01.02 100 150 200 250 300 Weight (lbs) In Stata, obtain this plot after regress with rvpplot weight, yline(0) 8