Sociology 6Z03 Topic 6: Least-Squares Regression

Transcription

1 Sociology 6Z03 Topic 6: Least-Squares Regression John Fo McMaster University Fall 2016 John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44 Outline: Least-Squares Regression Introduction Review of the Equation of a Straight Line The Least-Squares Regression Line Regression vs. Correlation Detecting Problems in Least-Squares Linear Regression Interpreting Correlation and Regression John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44

2 Introduction When the relationship between a response (y) and eplanatory variable () is linear, it is reasonable to try to summarize the relationship with a straight line. This lecture describes the most common method for fitting a straight line to a scatter of points called linear least-squares regression. John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44 Review of the Equation of a Straight Line A straight line can be represented by the equation where: y = a + b a, called the y-intercept of the line, represents the y-value corresponding to an -value of 0. b, called the slope of the line, indicates how much y changes when is increased by 1. If b is positive, then the value of y increases as increases; if b is negative, then the value of y decreases as increases; if b = 0, then the line is horizontal the value of y does not change as changes. John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44

3 Review of the Equation of a Straight Line Positive Slope (b > 0) Negative Slope (b < 0) y y = a + b y a 1 b a 1 b y = a + b John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44 The Least-Squares Regression Line How Should We Fit a Line to a Scatterplot? Unless the linear relationship between y and is perfect, which is never the case for real data, no line will go through all of the points in a scatterplot. When the linear relationship between y and is very strong, it is easy to fit a line by eye to the scatterplot of the data. This is not the case when the relationship between the variables is weaker, as is usually true for data in the social sciences. We therefore need a method of fitting a line to the scatter of points that doesn t depend upon subjective judgment. We want a line that comes as close to the points as possible. A line that comes close to the data allows us to predict values of y for specific values of. John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44

4 The Least-Squares Regression Line How Should We Fit a Line to a Scatterplot? Consider, for eample, the relationship between prestige and education for the Canadian occupational prestige data. Prestige 95 To find the predicted or fitted value of y for an occupation with 11 years of education, go up to the line above = 11, and then over to the y-ais to find the corresponding value of y, that is, ŷ 48.) predicted prestige for an occupation with 11 years of education Education John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44 The Least-Squares Regression Line Thought Question What is the approimate predicted value of prestige ŷ for an occupation with = 15 years of education? A 25. B 50. C 70. D 90. E I don t know. John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44

5 The Least-Squares Regression Line How Should We Fit a Line to a Scatterplot? The predicted value of y is represented by ŷ (called y-hat ), because the predicted and observed y-values will generally differ. In the Canadian prestige data, for eample, there are a few occupations with about 11 years of education. Some have observed prestige values a bit above the line, and some have observed values a bit below the line. For each observation, the difference between the observed and predicted y-value, representing the error in prediction for that observation, is called the residual (literally, what is left over): residual = observed value predicted value = y ŷ John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44 The Least-Squares Regression Line How Should We Fit a Line to a Scatterplot? Notice that the residuals are the vertical distances between the points and the line. We want a line that makes the residuals as small as possible. Prestige y predicted ^ y residual y ^y observed y ^y = Education John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44

6 The Least-Squares Regression Line The Method of Least Squares (LS) Making the residuals small would be easy if there were just two points we could simply pass a line between the two points. When there are many points, there are several different ways to proceed. The most common method of fitting a line is called the method of least-squares (developed independently by Gauss and the French mathematician Legendre at the end of the 18th century), which finds the line with the smallest possible sum of squared residuals: Choose a and b to minimize residual 2 i The residuals are squared before adding them up to prevent positive residuals (corresponding to points above the line) from canceling out negative residuals (points below the line). Squaring makes all of the residuals positive. John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44 The Least-Squares Regression Line Least-Absolute-Values (LAV) Regression Another way to make all of the residuals positive is to take their absolute values: Choose a and b to minimize residual i This approach has its strong points for eample, it produces values of a and b that are more resistant to outliers than those produced by least-squares regression but it is more difficult mathematically. Finding a and b to minimize the sum of squared residuals is analogous to using the mean to represent the centre of a distribution, while finding a and b to minimize the sum of absolute residuals is analogous to using the median. John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44

7 The Least-Squares Regression Line Finding the LS Coefficients The least-squares line has the equation ŷ = a + b with slope and intercept b = ( i )(y i y) ( i ) 2 a = y b = r s y s where r is the correlation between y and ; s y is the standard deviation of the response variable y; s is the standard deviation of the eplanatory variable ; and y and are the means of the two variables. John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44 The Least-Squares Regression Line Finding the LS Coefficients Calculating the least-squares coefficients a and b according to these formulas is a lot of work, even when the number of observations n is not very large. But we can leave the work to the computer. John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44

8 The Least-Squares Regression Line Finding the LS Coefficients Here, for eample, is the calculation of the least-squares line for the regression of prestige (y) on education (). Starting with the correlation, standard deviations, and means of the two variables, we get r = s = s y = = y = b = r s y = s = a = y b = = John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44 The Least-Squares Regression Line Finding the LS Coefficients The fitted regression equation is therefore ŷ = Note that the origin (0, 0) does not appear in the scatterplot, and that we cannot see the intercept a = in the graph. Prestige ^y = Education John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44

9 The Least-Squares Regression Line Interpreting the LS Intercept a = is the predicted prestige score for an occupation with 0 years of average education. In this instance, we should not interpret the value of a literally, because 1 none of the 102 occupations in the dataset has less than 6 years of average education; and 2 the prestige scores cannot be negative. As mentioned, because the aes in the scatterplot do not start at the origin [the point (0, 0)], the intercept does not appear on this graph (but see the net slide). In general, even when a linear regression does a good job of summarizing the relationship between y and within the observed range of the data, it is dangerous to etrapolate this relationship beyond the range of the data. John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44 The Least-Squares Regression Line Interpreting the LS Intercept The regression intercept a = etrapolates the least-squares line far below the range of the data on education. Prestige Education John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44

10 The Least-Squares Regression Line Interpreting the LS Slope b = 5.361: Each additional year of education is accompanied on average by an increase of a bit more than 5 prestige points. This is a descriptive statement about the association between prestige and education. We may or may not be willing to give the slope coefficient a causal interpretation ( increasing average education by one year causes the prestige of the occupation to rise by more than 5 points ). Because it tell us how y changes with, we are usually more interested in the slope b than in the intercept a. John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44 The Least-Squares Regression Line Interpreting the LS Intercept and Slope Thought Question Imagine that in a least-squares regression of individuals annual income in dollars on their years of education, we obtain the following regression equation income = 10, education Suppose that the regression equation is a reasonable summary of the relationship between income and education, and that we have data on individuals with 0 to 20 years of education. John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44

11 The Least-Squares Regression Line Interpreting the LS Intercept and Slope Thought Question Which of the following statements is correct? A The predicted value of income for an individual with 0 years of education is $10,000. B Each additional year of education is associated on average with an increase of $5000 in annual income. C Both of the above. D Neither of the above. E I don t know. John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44 The Least-Squares Regression Line Graphing the LS Line To plot the regression line on the scatterplot, find two points on the line. Any two points will do, but we can plot the line more accurately if the points are widely separated. For eample, for the regression of prestige on education, we can find the ŷ values corresponding to -values of 6 and 16: for = 6: ŷ = = for = 16: ŷ = = Connecting the points (6, ) and (16, ) locates the least-squares line (as shown on the net slide). Two points that are always on the least-squares line are (0, a) and (, y). For the eample regression, (0, a) = (0, ) and (, y) = (10.738, ) John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44

12 The Least-Squares Regression Line Graphing the LS Line Graphing the least-squares line by connecting the points (6, ) and (16, ): Education Prestige John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44 Regression vs. Correlation The slope b of the least-squares regression line and the correlation r are related by the equation b = r s y s The correlation and slope are similar in certain respects and different in others: When r = 0, indicating that there is no linear relationship between y and, then b = 0 as well. If and y are standardized variables (so that s = s y = 1), then b = r. John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44

13 Regression vs. Correlation The Two LS Lines The correlation coefficient r doesn t depend upon which variable is treated as the response and which as the eplanatory variable. The slope b does depend upon which variable is treated as the response. If is regressed on y rather than vice-versa (i.e., if is treated as the response variable), then b on y = r s s y which is usually different from b y on. There are two least-squares regression lines one for the regression of y on, and the other for the regression of on y. Unless r = 1, these two regression lines are different. John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44 Regression vs. Correlation The Two LS Lines For eample, for prestige and education: y Prestige 95 regression of on y ^ = y regression of y on ^y = Education John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44

14 Regression vs. Correlation Interpreting the Correlation Coefficient The square of the correlation coefficient (r 2 ) has a special interpretation in least-squares regression: Recall the regression residuals, which give the differences between observed and predicted response values, residual i = y i ŷ i The sum of squared residuals represents the variation of y around the regression line, residual 2 i = (y i ŷ i ) 2 The total variation of y around its mean (ignoring the regression line) is (y i y) 2 John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44 Regression vs. Correlation Interpreting the Correlation Coefficient The difference between the two measures of variation is the amount of variation accounted for by the regression of y on : eplained variation = total variation residual variation = (y i y) 2 residual 2 i = (ŷ i y) 2 The squared correlation epresses the eplained variation as a fraction of the total variation of y, r 2 eplained variation = total variation = (ŷ i y) 2 (y i y) 2 = 1 residual2 i (y i y) 2 John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44

15 Regression vs. Correlation Interpreting the Correlation Coefficient When there is a perfect linear relationship between y and, the residuals are all zero; the sum of squared residuals is zero; and r 2 = 1. When there is no linear relationship between y and, the eplained variation is zero, and r 2 = 0. For the regression of occupational prestige on education, r =.85018, and thus the regression accounts for r 2 = =.7228 or about 72 percent of the variation in prestige scores. John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44 Detecting Problems in Least-Squares Linear Regression Influential Data The least-squares line is a good summary of the relationship between y and when the relationship is in fact linear and when the data are well behaved. But the least-squares line can sometimes be markedly affected by outlying data. In regression analysis, an outlier is a point far away from the general pattern of the data. It is a point whose y value is unusual compared to other points with similar -values. Points with unusual -values, when they are out of line with the rest of the data, can be influential, in the sense that their inclusion in the dataset can markedly alter the regression line. Like the mean, standard deviation, and correlation, therefore, the least-squares regression line is not resistant to unusual data. John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44

16 Detecting Problems in Least-Squares Linear Regression Influential Data The following scatterplot, showing reported and measured weight in kg, is for 101 women engaged in regular eercise. The data were collected by Caroline Davis, a psychologist at York University who studies eating disorders. If the women are unbiased reporters of their weight, then the regression line should be approimately ŷ = (that is, an intercept of 0 and a slope of 1). When the outlying point at the right is omitted, the least-squares line is close to the line of unbiased reporting (the lighter solid line). In this case, the influential outlier represents an error in recording the data. John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44 Detecting Problems in Least-Squares Linear Regression Influential Data Reported Weight (kg) Measured Weight (kg) Important Point Outliers, influential data, and other problems in regression analysis can be detected in the scatterplot of y against. It is therefore important always to plot regression data. John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44

17 Detecting Problems in Least-Squares Linear Regression Nonlinearity Sometimes problems appear even more clearly in plots of residuals against : In the following graph there is a nonlinear relationship between y and. Note that the average LS residual is 0, and that the residuals and -values are uncorrelated. y Residuals John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44 Detecting Problems in Least-Squares Linear Regression Changing Spread In the following graph, the spread of y around the regression line (the spread of the residuals) increases with. Predictions at large values of will be less accurate than at small values of Least-squares regression may not be the best method for fitting a line to the scatterplot. y Residuals John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44

18 Detecting Problems in Least-Squares Linear Regression What we want to see in a residual plot are unpatterned residuals, unrelated to : y Residuals John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44 Detecting Problems in Least-Squares Linear Regression Anscombe s Quartet The following eamples (due to Anscombe, and called Anscombe s Quartet by Tufte) are particularly instructive and cautionary: Dataset 1 Dataset 2 y y Dataset 3 Dataset 4 y y John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44

19 Y Y Y Y Detecting Problems in Least-Squares Linear Regression Anscombe s Quartet Anscombe s four datasets are cleverly constructed to have eactly the same regression of y on and the same correlation: ŷ = r =.82 As well,, y, s, and s y are all the same in the four datasets. John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44 Detecting Problems in Least-Squares Linear Regression Anscombe s Quartet (a) (b) X X (c) (d) John Fo (McMaster University) Soc X 6Z03: Least-Squares Regression X Fall / 44

20 Detecting Problems in Least-Squares Linear Regression Anscombe s Quartet The linear least-squares regression is a good summary of the relationship between and y only for the first dataset. In the second dataset, the relationship is nonlinear. In the third dataset, there is an outlier. In the fourth dataset, the least-squares line chases the influential observation. None of these problems is clear from the fitted regression equation and correlation, and none (but the last) is clear from looking at the numerical data John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44 Interpreting Correlation and Regression Cautions: Etrapolation, Lurking Variables Etrapolation: It is not safe to use a regression line for prediction outside of the range of -values observed in the data. Lurking variables: A lurking variable is an eplanatory variable that has been omitted from the analysis and that has an important effect on the relationship between and y. Imagine, for eample, is education and y is income, measured for each of a number of individuals (see the following graph, with contrived data). The filled dots represent men and the hollow dots represent women. If y is regressed on using the data both for women and for men, the relationship between income and education appears to be very weak, with r =.03. But when y is regressed on separately for women and men, the relationships are much stronger; r =.94 for each group. Here, the lurking variable is gender. John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44

21 Interpreting Correlation and Regression Further Cautions: Lurking Variables Income ($1000s) Education (y ears) John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44 Interpreting Correlation and Regression Further Cautions: Lurking Variables The opposite effect can also occur an apparent relationship between two variables can be induced by the omission of an important third variable. Freedman was interested in the relationship between the population density of cities and their crime rates. He found that the association between these two variables was due to other factors that are related both to density and to crime: For eample, large cities tend to be denser and to have higher crime rates. If we look separately at cities of similar size, density and crime are not related. John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44

22 Interpreting Correlation and Regression Further Cautions: Association vs. Causation Important Point Association is not causation: Because an observed association can be due to a lurking variable, mere statistical association between variables does not imply that one variable causes the other. Causal inferences are much more certain in eperimental research than in observational research. In a randomized eperiment, the values of the eplanatory variable are assigned at random to individuals and therefore cannot (ecept by very bad luck) be related to lurking variables. Most interesting sociological research questions are not amenable to eperimental investigation, however. John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44 Interpreting Correlation and Regression Further Cautions: Association vs. Causation Thought Question Among the many contributions to statistics of the great British statistician Sir R. A. Fisher was his invention of the randomized comparative eperiment. In the 1950s, Fisher maintained that there was no convincing evidence that smoking causes lung cancer, because the association between these two variables was at the time based solely on observational data. Fisher s argument implies that A there may be one or more lurking variables that are related both to smoking and to lung cancer. B lung cancer causes smoking rather than vice-versa. C there is no observed relationship between smoking and lung cancer. D Fisher s argument just doesn t make sense it is simply a stupid argument. John Fo (McMaster University) Soc 6Z03: Least-Squares Regression Fall / 44