" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Transcription

1 Notation: Notation and Equations for Regression Lecture 11/4 m: The number of predictor variables in a regression Xi: One of multiple predictor variables. The subscript i represents any number from 1 through m. So, the predictor variables are X1, X2,, Xm. Y: The outcome variable that is being predicted or explained in a regression ˆ Y : The estimated outcome value as predicted by the regression equation bi: The regression coefficient for predictor Xi b0: The intercept in the regression equation bi : The standard error of a regression coefficient SSY: The total sum of squares for Y, which represents the uncertainty in the outcome SSresidual: The residual sum of squares, representing the uncertainty left over after we use regression to generate the best possible predictions for the outcome SSregression: The sum of squares for the regression, representing the amount of uncertainty or variability that the predictors can explain R 2 : The proportion of variability explained by the regression MSresidual: The residual mean square, which is the mean squared error of the regression prediction, Y ˆ ; also used as an estimate of the population variance, MSregression: The mean square for the regression F: F statistic, which is the test statistic for deciding whether a regression explains meaningful variability in the outcome " Y 2 Regression equation. A regression equation is a formula that uses a set of predictor variables (Xi) to make a prediction for some outcome variable (Y). The prediction is called Y ˆ. The purpose of a regression equation is to find the best possible way to do this, that is, to combine the predictors in a way that gives estimates ( Y ˆ ) that are as close as possible to the right answer (Y). When there s only one predictor (m = 1), regression is essentially the same as correlation, and the regression equation is the same as the regression line we would draw on a scatterplot. The equation for a line always has the form Y ˆ = b0 + b1 X, where b0 and b1 are numbers representing the intercept and the slope. Finding the line that s closest to the data is the same as choosing b0 and b1 so that the predicted Y ˆ values are as close as possible to the true Y values. When there are more than one variable, we take the basic equation for a line and extend it in a natural way.

2 ˆ Y = b 0 + b 1 X 1 + b 2 X b m X m " = b 0 + b i X i i=1 to m (1) The two lines of Equation 1 mean the same thing. The first line writes everything out, and the second line combines things into a sum (you can choose to remember either one). In the sum, the index i takes on all values from 1 to m. When i = 1, the summand is b1x1; when i is 2, the summand is b2x2; and so on until bmxm. Notice that the contribution of each predictor (bixi) is the same as in the equation for a simple line with one predictor. This is why we say that each predictor has a linear effect on the predicted outcome. If we held all of the predictors except one fixed, then the relationship between the remaining predictor and the prediction would be a line. The goal of regression is to find the values of b0 through bm that lead to the best predictions. Therefore, the bs are a main focus of regression. Each bi is called the regression coefficient for its corresponding predictor (e.g., b1 is the regression coefficient for X1). The regression coefficient tells what kind of influence each predictor has on the predicted outcome. When a regression coefficient is positive, that means the predictor has a positive effect, just like with a positive correlation. When a regression coefficient is negative, the predictor has a negative effect, just like with a negative correlation. The magnitude or absolute value of bi tells how strong the effect is. If bi is near zero, then Xi has a weak effect on the outcome, just like with a correlation near zero. If bi is large (either positive or negative), then Xi has a strong effect on the outcome, just like with a correlation near ±1. The difference between regression coefficients and correlations is that regression coefficients aren t standardized, meaning they re not restricted to lie between - 1 and 1. Therefore, the strength of a regression coefficient needs to be interpreted in terms of the units of the predictor and outcome variables. In general, bi tells how many units Y ˆ increases by for every unit increase in Xi. For example, if Xi is height (in inches) and Y is how long it takes a person to run a mile (in seconds), bi = 7 would mean that for every extra inch of height, a person tends to take 7 seconds longer to run a mile. A negative coefficient means a decrease; e.g., bi = - 5 would mean that for every extra inch of height, a person tends to take 5 seconds less to run a mile. The b0 variable is a special regression coefficient called the intercept. Just like the intercept of a simple line, b0 tells what the value of Y ˆ is when all the Xis are zero (i.e., where the line intersects the Y axis). If zero isn t a sensible value of any of the predictors (e.g., no one is 0 inches tall), then the intercept taken by itself won t be a very sensible value for the outcome (e.g., it could be - 50 seconds). Therefore, we generally don t think too hard about what the value of the intercept indicates; we just know that it needs to be in the regression equation so that the overall pattern of predictions can be shifted up or down as needed to match the true values of the outcome. Partitioning sums of squares. One important question with any regression equation is how well it does predicting the outcome. We answer this question in terms of how much variability in Y is explained by the regression, meaning how much uncertainty goes away when we use the regression equation to predict the outcome. Variability or uncertainty in

3 this case is measured in sums of squares (SS), which are very similar to variance and mean squared error, except that we don t divide by degrees of freedom (yet). The total variability in Y is called SSY (sum of squares for Y) and is defined as SSY = (Y MY) 2 (2) As usual, MY represents the mean of the sample Y. Notice that if we divided SSY by n 1, we would have the sample variance of Y. So, the sum of squares for Y is just like the variance for Y except that we don t divide by n 1 (i.e., it s a sum instead of an average). As with variance, we can think of the sum of squares as a measure of uncertainty, or how much error we would expect to make if we had to guess the value of Y blindly. If we have no information about a subject, our best guess of their Y score is the mean, MY. Therefore (Y MY) 2 is our squared error, and SSY is the sum of the squared error over all subjects. If we do know something about a subject, then we can make a better prediction of their Y score than by blindly guessing the mean. This is what the regression equation does for us it uses all the predictors, Xi, to come up with the best possible prediction, Y ˆ. Once we have Y ˆ, we can ask how well it does as an estimate of Y, again by adding up the squared error over all subjects. The result is called the residual sum of squares, because it represents the uncertainty in Y that s left over after we do the regression. Notice that the residual sum of squares is the same as mean squared error except that once again it s a sum instead of a mean, because we don t divide by degrees of freedom. SSresidual = (Y ˆ Y ) 2 (3) As stated above, the goal of regression is to find the values of the regression coefficients (bi) that lead to the best predictions. By best predictions, we mean minimizing the squared difference between Y and Y ˆ. In other words, the goal is to minimize SSresidual. This is how we determine the regression coefficients (or, usually, how a computer determines them for us). The question now is how well the regression did. If the predictors do a really good job of explaining the outcome, SSresidual will be close to zero. If the predictors tell us little or nothing about the outcome, then SSresidual will be close to the original, total sum of squares, SSY. SSresidual is always less than or equal to SSY (because SSY is the error of the naive prediction, MY, and SSresidual is the error of the best possible prediction, Y ˆ ), but the question is how much less. The reduction in uncertainty from SSY to SSresidual is called SSregression, because it s the amount of variability in the outcome (Y) that the regression can explain. SSregression = SSY SSresidual SSY = SSregression + SSresidual (4) The two versions of Equation 4 say the same thing. The first version shows how we get SSregression by subtracting the residual sum of squares from the total sum of squares. The second version shows how the total sum of squares (i.e., the original variability in the data) can be broken into two parts: the portion that we can explain using the predictors (SSregression), and the portion that we cannot explain (SSresidual).

4 Explained variability. Once we ve worked out the total variability (SSY) and the portion explained by the predictors (SSregression), we can calculate their ratio. The ratio is the fraction of the total variability explained by the predictors, and it s our final measure of how useful the regression was. The fraction of explained variability is called R 2, because it extends the idea of the squared correlation, r 2. Recall that when we re doing a simple correlation (i.e., there s only one predictor), r 2 is the fraction of the variance in Y that can be explained by X. In other words, when there s only one predictor, R 2 and r 2 are equal. R 2 is just a more general concept that works when there are multiple predictors. If R 2 is close to 1, then the regression explains most of the variability in Y, meaning that if we know the values of the predictors for some subject then we can confidently predict the outcome for that subject. If R 2 is close to 0, then the predictors don t give us much information about the outcome. (R 2 is always between 0 and 1.) R 2 = SS regression SS Y (5) Hypothesis testing: The effect of one predictor. Once we ve run a regression to find the regression coefficients for a set of predictors, we can ask how reliable the regression coefficients are. The regression coefficients are statistics, meaning they re computed from a sample (for each subject in our sample, we have measured all the Xis and Y). We can use the regression coefficients as estimates of the population, but as with all estimators, they are imperfect. If we gathered a new sample and ran the regression on the new data, we d obtain somewhat different values for the regression coefficients. Therefore, each bi has a sampling distribution, which represents the probabilities of all possible values we could get for bi if we replicated the experiment. Each bi also has a standard error, which as usual is the standard deviation of the sampling distribution. The standard error tells us how reliable our estimate is, meaning how far we can expect it to be from the true population value. We won t go into detail about how the standard errors of the regression coefficients are calculated, because it s much easier to use a computer for this. However, it s important to understand what the standard errors can tell us. First, the standard error can be used to create a confidence interval for each bi, in the same way we create confidence intervals for means. I won t describe the math, but the idea is the same as before: The confidence interval is centered on the actual value of bi obtained from the sample, and the width of the confidence interval is determined by the size of the standard error. The second thing we can use standard errors for is hypothesis testing. In regression, the most common hypotheses to test regard whether the predictors have reliable influences on the population. This corresponds to asking whether the true values of the regression coefficients are different from zero. For each predictor, Xi, the null hypothesis bi = 0 states that Xi has no reliable effect on Y (the alternative hypothesis is bi 0). Notice that there s a separate null hypothesis for each predictor, and we can test each one individually. To test whether bi = 0, we calculate a t statistic equal to bi (the actual regression coefficient we obtained from our sample) divided by its standard error. t = b i " bi (6)

5 Just as with the t statistic for a t- test, t for a regression coefficient tells us how large that coefficient is relative to how large it would be expected to be by chance (i.e., according to the null hypothesis that its true value is zero). If t is large (either positive or negative), then bi is larger than would be expected by chance, so chance is not a good explanation for the result that we got. In this case, we reject the null hypothesis and adopt the alternative hypothesis that bi 0 (i.e., that Xi has a real effect on Y). If t is close to zero, then bi fits with what we d expect by chance, so we retain the null hypothesis that bi = 0 (i.e., Xi has no real effect on Y). The t statistic from a regression is used in the same way as in a t- test. If t is greater than tcrit, then we reject the null hypothesis. The alternative (and equivalent) approach is to compute a p- value, which is the probability of a result as or more extreme than t: p( tdf > t ). This is the formula for a two- tailed test, but we can also compute a one- tailed p- value if the direction of the effect (i.e., the sign of bi) was predicted in advance. In either case, we reject the null hypothesis if p < α. The only remaining information needed to find tcrit or p is the degrees of freedom. As usual, the degrees of freedom for t equals the degrees of freedom for the standard error used to compute t. The standard error for a regression coefficient, " bi, comes from SSresidual, which, as is explained below, has n m 1 degrees of freedom. Therefore, a t- test for a regression coefficient uses df = n m 1 (you don t need to memorize this). Hypothesis testing of multiple predictors. Another way to test whether predictors have reliable effects on the outcome is to test whether they explain more variability than would be expected by chance. This can be done with multiple predictors, by comparing how much variability the regression explains with those predictors to how much it explains with those predictors left out. We ll focus on the most common situation, where we want to test whether the full set of predictors, X1 through Xm, collectively tell us anything meaningful about the outcome, Y. As explained above, SSregression represents the amount of variability in Y explained by all of the predictors. We want to test whether SSregression is larger than would be expected by chance. Our null hypothesis is that none of the predictors has any effect on the outcome, meaning that the true values of the regression coefficients (except perhaps the intercept, b0) are all zero. Notice that this is the same null hypothesis used above for testing predictors one at a time, except that now we re testing them all at once, to see whether any predictor gives reliable information about the outcome. Even if all the bis are zero in the population, the regression coefficients we get from our sample will deviate from zero because of sampling variability. This leads the variability explained by the regression, SSregression, to be greater than zero, even though this explained variability is meaningless random error. Therefore, SSregression has a sampling distribution that, as usual, tells us how large it can be expected to be just by chance. Comparing the actual value of SSregression obtained from the sample to the amount expected by chance allows us to test whether chance is a good explanation for the results (H0) or the predictors are explaining something real about the outcome (H1).

6 To test whether SSregression is larger that would be expected by chance, we first divide it by its degrees of freedom to get the mean square, MSregression. MS regression = SS regression df regression (7) According to the null hypothesis that the regression doesn t explain anything real, MSregression has a (modified) chi- square distribution, multiplied by " 2 Y, the variance of Y in the population. If we knew " 2 Y then we would know the likelihood function for MSregression exactly. This is the same situation that came up with t- tests, where we knew the likelihood function for M except for not knowing σ 2. Once again, we divide by an estimate of " 2 Y to get our final test statistic, and once again, we estimate " 2 Y using the residual mean square. In this case, the residual mean square is the residual sum of squares divided by its degrees of freedom (see Eq. 3). MS residual = SS residual (8) df residual When we divide MSregression by MSresidual, " 2 Y cancels out, and we end up with a test statistic that doesn t depend on any population parameters. That is, we have a test statistic with a likelihood function that we know exactly, which is what s required for hypothesis testing. The test statistic is called F. F = MS regression MS residual (9) According to the null hypothesis, the F statistic has what s called an F distribution. F distributions arise any time you divide one chi- square variable by another, such as MSregression and MSresidual. Because the distribution of a chi- square variable depends on its degrees of freedom, an F distribution depends on the degrees of freedom of both chi- square variables that it s based on. That is, an F distribution is defined by two degrees of freedom. So, the last things we need to know are the degrees of freedom for MSregression and MSresidual. The total degrees of freedom in SSY equals n 1, and these are divided up between SSregression and SSresidual. SSresidual loses m degrees of freedom for the m regression coefficients (b1 through bm) that go into defining Y ˆ, and these degrees of freedom end up in SSregression (essentially because SSregression = SSY SSresidual). The degrees of freedom for the sums of squares carry over to the mean squares, so MSregression has m degrees of freedom and MSresidual has n m 1 degrees of freedom. (You don t need to memorize the degrees of freedom, but you should understand the basic idea of where they come from.) Once we have F and both dfs, we can compare F to its sampling distribution to get a p- value. The goal here is to decide whether F is bigger than would be expected by chance. If so, then SSregression is bigger than would be expected by chance, which means the regression is explaining something meaningful. Therefore, we want to know the probability of an F value as big as or bigger than the one we got from the data. We always compute a one- tailed p- value (using the upper tail), because an F that is unusually small doesn t tell us

7 anything interesting. The p- value isn t something that can be calculated by hand; instead we use a computer (e.g., the pf() function in R). As always, if the p- value is less than our alpha level (e.g.,.05) then we reject the null hypothesis and conclude that the predictors are explaining something meaningful about the outcome. p = p( F dfregression,df residual " F) (10) There may seem like a lot to this hypothesis test, but conceptually there are four simple steps. These same steps are used in other statistical tests, such as ANOVA, which we will learn next. Remembering and understanding these steps will not only help you understand hypothesis testing with regression, but it will make understanding ANOVA a lot easier as well. 1. Break the total sum of squares into an explainable sum of squares and an unexplainable sum of squares. 2. Divide each sum of squares by its degrees of freedom to get a mean square. 3. Divide the explainable mean squares by the unexplainable mean squares to get the test statistic, F. 4. Compare F to an F distribution to get the p- value, or find the critical value and compare F directly to Fcrit.