Chapter 5. Regression Models

Transcription

1 Chapter 5: Regression Models 118 April 9, 2013 Chapter 5. Regression Models Regression analysis is probably the most used tool in statistics. Regression deals with modeling how one variable (called a response) is related to one or more other variables (called predictors or regressors). Before introducing regression models involving two or more variables, we first return to the very simple model introduced in Chapter 1 to set up the basic ideas and notation. 1 A Simple Model Consider once again the fill-weights in the cup-a-soup example. For sake of illustration, consider the first 10 observations from the data set: , , , , , , , , , Note that although the filling machine is set to fill each cup to a specified weight, the actual weights vary from cup to cup. Let y 1, y 2,..., y n denote the fill-weights for our sample (i.e. y 1 = , y 2 = etc. and n = 10). The model we introduced in Chapter 1 that incorporates the variability is y i = µ + ϵ i (1) where ϵ i is a random error representing the deviation of the ith diameter from the average fill-weight of all cups (µ). Equation (1) is a very simple example of a statistical model. It involves a random component (ϵ i ) and a deterministic component (µ). The population mean µ is a parameter of the model and the other parameter in (1) is the variance of the random error ϵ which we shall denote by σ 2 ( sigma-squared ). Let us now consider the problem of estimating the population mean µ in (1). The technique we will use for (1) is called least-squares and it is easy to generalize to more complicated regression models. A natural and intuitive way of estimating the true value of the population mean µ is to simply take the average of the measurements: ȳ = 1 n y i. n i=1 Why should we use ȳ to estimate µ? There are many reasons why ȳ is a good estimator of µ, but the reason we shall focus on is that ȳ is the best estimator of µ in terms of having the smallest mean squared error. That is, given the 10 measurements above, we can ask: which value of µ makes sum of squared deviations n (y i µ) 2 (2) i=1

2 Chapter 5: Regression Models 119 the smallest? That is, what is the least-squares estimator of µ? The answer to this question can be found by doing some simple calculus. Consider the following function of µ: n f(µ) = (y i µ) 2. i=1 From calculus, we know that to find the extrema of a function, we can take the derivative of the function, set it equal to zero, and solve for the argument of the function. Thus, d n dµ f(µ) = 2 (y i µ) = 0. Using a little algebra, we can solve this equation for µ to get i=1 ˆµ = ȳ. (One can check that the 2nd derivative of this function is positive so that setting the first derivative to zero determines a value of µ that minimizes the sum of squares.) The hat notation (i.e. ˆµ) is used to denote an estimator of a parameter. This is a standard notational practice in statistics. Thus, we use ˆµ = ȳ to estimate the unknown population mean µ. Note that ˆµ is not the true value of µ but simply an estimator based on 10 data points. Now we shall re-do the computation using matrix notation. This will seem unnecessarily complicated, but once we have a solution worked out, we can re-apply it to many other much more complicated models very easily. Data usually comes to us in the form of arrays of numbers, typically in computer files. Therefore, a natural and easy way to handle data (particularly large sets of data) is to use the power of matrix computations. Take the fill-weight measurements y 1, y 2,..., y n and stack them into a vector and denote this vector by a boldfaced y: y y y y y y = =. y y y y y Now let X denote a column vector of ones and ϵ denote the error terms ϵ i s stacked into a vector: 1 ϵ 1 X = 1. and ϵ = ϵ ϵ n

3 Chapter 5: Regression Models 120 Then we can re-write our very simply model (1) in matrix/vector form as: More compactly, we can write: y 1 1 ϵ 1 y 2. = 1. µ + ϵ 2.. y n 1 ϵ n The sum of squares in equation (2) can be written y = Xµ + ϵ. (3) (y Xµ) (y Xµ) Multiplying this out, we find the sum of squares to be y y 2X yµ + µ 2 X X. Taking the derivative of this with respect to µ and setting the derivative equal to zero gives 2X y + 2µX X = 0. Solving for µ gives ˆµ = (X X) 1 X y. (4) The solution given by equation (4) is the least squares solution and this formula holds for a wide variety of models as we shall see. 2 The Simple Linear Regression Model Now we will define a slightly more complicated statistical model that turns out to be extremely useful in practice. The model is a simple extension of our first model y i = µ + ϵ i and, using the matrix notation, all we have to do is add another column to the vector X and change it into a matrix with two columns. To illustrate ideas, consider the data in the following table that was collected in Consumer Reports and reported on in Henderson and Velleman (1981). The table gives the make (column 1), the miles per gallon (MPG) (Column 2) and the weight (column 3) in thousands of pounds of n = 6 Japanese cars from model automobiles. Make MPG Weight Toyota Corona Datsun Mazda GLC Dodge Colt Datsun Datsun

4 Chapter 5: Regression Models 121 Figure 1: Scatterplot of miles per gallon versus weight of n = 6 Japanese cars. It seems reasonable that the miles per gallon of a car is related to the weight of the car. Our goal is to model the relationship between these two variables. A scatterplot of the data is shown in Figure 1. As can be seen from the figure, there appears to be a linear relationship between the MPG (y) and the weight of the car (x). Heavier cars tend to have lower gas mileage. A deterministic model for this data is given by y i = β 0 + β 1 x i where y i is MPG for the ith car and x i is the corresponding weight of the car. The two parameters are β 0 which is the y-intercept and β 1 which is the slope of the line. However, this model is inadequate because it forces all the points to lie exactly on a line. From Figure 1, we clearly see that the points do follow a linear pattern, but the points do not all fall exactly on a line. Thus, a better model will include a random component for the error which allows for points to scatter about the line. The following model is called a simple linear regression model: y i = β 0 + β 1 x i + ϵ i (5) for i = 1, 2,..., n. The random variable y i is called the response (it is sometimes called the dependent variable also). The x i is called the ith value of the regressor variable (sometimes known as the independent or predictor variable). The random error ϵ i is assumed to have a mean of 0 and variance σ 2. We typically assume the ϵ i s are independent of each other. The slope β 1 and intercept β 0 are the two parameters of primary importance and the question arises as to how they should be estimated. The least squares solution

5 Chapter 5: Regression Models 122 Figure 2: The least-squares regression line is determined by minimizing the sum of squared vertical differences between the observed MPG s and the corresponding point on the line. is found by determining the values of β 0 and β 1 that minimize the sum of squared errors: n (y i β 0 β 1 x i ) 2. i=1 Graphically, this corresponds to finding the line minimizing the sum of squared vertical differences between the observed MPG s and the corresponding values on the line as shown in Figure 2. Returning to our matrix and vector notation, we can write y = and X = Let β = (β 0, β 1 ) and ϵ = (ϵ 1, ϵ 2,..., ϵ n ). Then we can rewrite (5) in matrix form as y = Xβ + ϵ. (6) In order to find the least-squares estimators of β, we need to find the values of β 0 and β 1 that minimize n (y Xβ) (y Xβ) = (y i β 0 β 1 x i ) 2 i=1

6 Chapter 5: Regression Models 123 or, since ϵ i = y i β 0 β 1 x i, we need to find β 0 and β 1 that minimize ϵ ϵ. Matrix differentiation can be used to solve this problem, but instead we will use a geometric argument. First, some additional notation. Let ˆβ 0 and ˆβ 1 denote the least squares estimators of β 0 and β 1. Then given a value of the predictor x i, we can compute the predicted value of y given x i as ŷ i = ˆβ 0 + ˆβ 1 x i. The residual r i is defined to be the difference between the response y i and the predicted value ŷ i : r i = y i ŷ i. Let r = (r 1, r 2,..., r n ) and ŷ = (ŷ 1, ŷ 2,..., ŷ n ). Note that ŷ = X ˆβ where ˆβ = ( ˆβ 0, ˆβ 1 ). The least square estimators ˆβ 0 and ˆβ 1 are chosen to make r r small as possible. Geometrically, ŷ is the projection of y onto the plane spanned by the columns of the matrix X. This is illustrated in Figure 3. To make r r small as possible, r should be orthogonal to the plane spanned by the columns of X. Algebraically, this means that X r = 0. Writing this out, we get X r = X (y ŷ) = X (y X ˆβ) = 0. Thus, ˆβ should satisfy X y = X X ˆβ. (7) This equation is know as the normal equation. Assuming X X is an invertible matrix, we can simply multiply on the left on both sides of (7) to get the least-squares solution: ˆβ = (X X) 1 X y. (8) This is one of the most important equations of this course. This formula provides the least-squares solution for a wide variety of models. Note that we have already seen this solution in (4). Example. Returning to the MPG example for Japanese cars, we now illustrate the computation of the least-squares estimators of the slope β 1 and y-intercept β 0. From the data, we can compute ( X X = ). Therefore, (X X) 1 = ( )

7 Chapter 5: Regression Models 124 Figure 3: The geometry of least-squares. The vector y is projected onto the space spanned by the columns of the design matrix X denoted by X 1 and X 2 in the figure. The projected value is the vector of fitted values ŷ (denoted by yhat in the figure). The difference between y and ŷ is the vector of residuals r. Also, ( ) X y = = ( ) So, the least squares estimators of the intercept and slope are: ( ) ( ) ( ) ˆβ = (X X) 1 X y = = From this computation, we find that the least squares estimator of the y-intercept is ˆβ 0 = and the estimated slope is ˆβ 1 = and the prediction equation is given by ŷ = x. Note that the estimated y-intercept of ˆβ 0 = does not have any meaningful interpretation in this example. The y-intercept corresponds to the average y value when x = 0, i.e. the MPG for a car that weighs zero pounds. It makes no sense to estimate the mileage of a car with a weight of zero. Typically in regression examples the intercept will not be meaningful unless data is collected for values of x near zero. Since there is no such thing as cars weighing zero pounds, the intercept has no meaningful interpretation in this example.

8 Chapter 5: Regression Models 125 The slope β 1 is generally the parameter of primary interest in a simple linear regression. The slope represents the average change in the response for a unit change in the regressor. In the car example, the estimated slope of ˆβ 1 = indicates that for each additional thousand pounds of weight of a car we would expect to see a reduction of about 13 miles per gallon on average. Multiplying out the matrices in (8) we get the following formulas for the least square estimates in simple linear regression: ˆβ 0 = ȳ ˆβ 1 x ˆβ 1 = SS xy SS xx where and n SS xy = (x i x)(y i ȳ) i=1 n SS xx = (x i x) 2. i=1 That is, the estimator of the slope is the covariance between the x s and y s divided by the variance of the x s. In multiple regression when there is more than one regressor variable, the formulas for the least square estimators become extremely complicated unless you stick with the matrix notation. The matrix notation also allows us to compute quite easily the standard errors of the least squares estimators as well as the covariance between the estimators. First, let us show that the least square estimators are unbiased for the corresponding model parameters. Before doing so, note that in a designed experiment, the values of the regressor are typically fixed by the experimenter and therefore are not considered random. On the other hand, because y i = β 0 + β 1 x i + ϵ i and ϵ i is a random variable, then y i is also a random variable. Computing, we get E[ˆβ] = E[(X X) 1 X y] = (X X) 1 X E[y] = (X X) 1 X E[Xβ + ϵ] = (X X) 1 X (Xβ + E[ϵ]) = (X X) 1 X Xβ + 0 = β since E[ϵ] = 0. Therefore, the least square estimators ˆβ are unbiased for the population parameters β. Many statistical software packages have built-in functions that will perform regression analysis. We can also use software to do the matrix calculations directly. Below is Matlab code that produces some of the output generated above for the car mileage example:

9 Chapter 5: Regression Models 126 % Motor Trend car data % Illustration of simple linear regression. mpg = [27.5; 27.2; 34.1; 35.1; 31.8; 22.0]; % Car s weight wt = [2.560; 2.300; 1.975; 1.915; 2.020; 2.815]; % Compute means and standard deviations: mean(wt) std(wt) mean(mpg) std(mpg) n = length(mpg); % n = sample size X = [ones(6,1) wt]; % Compute the design matrix X bhat = inv(x *X)*X *mpg; % bhat = estimated regression coefficients yhat = X*bhat; % yhat = fitted values r = mpg - yhat; % r = residuals plot(wt, mpg, o, wt, yhat) title( Motor Trend Car Data ) axis([1.5, 3, 20, 40]) ylabel( Miles per Gallon (mpg) ) xlabel( Weight of the Car ) % Make a plot of residuals versus fitted values: plot(yhat, r, o, linspace(20,40,n), zeros(n,1)) xlabel( Fitted Values ) ylabel( Residuals ) title( Residual Plot ) % Here s a built-in matlab function that will fit a % polynomial to the data -- the last number indicates the degree of the polynomial. polyfit(wt, mpg, 1) 3 Covariance Matrices for Least-Squares Estimators Now ˆβ is a random vector (since it is a function of the random y i s). We have shown that it is unbiased for β 0 and β 1. In order to determine how stable the parameter estimates are, we need an estimate of the variability of the estimators. This can be obtained by determining the covariance matrix of ˆβ as follows: Cov(ˆβ) = E[(ˆβ β)(ˆβ β) ] = E[((X X) 1 X y (X X) 1 X E[y])((X X) 1 X y (X X) 1 X E[y]) ] = E[((X X) 1 X ϵ)((x X) 1 X ϵ) ] = (X X) 1 X E[ϵϵ ]X(X X) 1 = (X X) 1 X σ 2 IX(X X) 1 (where I is the identity matrix) = σ 2 (X X) 1.

10 Chapter 5: Regression Models 127 The main point of this derivation is that the covariance matrix of the least-square estimators is σ 2 (X X) 1 (9) where σ 2 is the variance of the error term ϵ in the simple linear regression model. Formula (9) holds a wide variety of regression models that include polynomial regression, analysis of variance, and analysis of covariance. The only assumption needed for (9) to hold is that the errors are uncorrelated and all have the same variance. Formula (9) indicates that we need an estimate for the last remaining parameter of the simple linear regression model (5), and that is the error variance σ 2. Since ϵ i = y i β 0 β 1 x i and the ith residual is r i = y i ˆβ 0 ˆβ 1 x i, a natural estimate of the error variance is ˆσ 2 = MS res = SS res n 2 where n SS res = ri 2 i=1 is the Sum of Squares for the Residuals and MS res stands for the Mean Squared Residual (or mean squared error (MSE)). We divide by n 2 in the mean squared residual so to make it an unbiased estimator of σ 2 : E[MS res ] = σ 2. We lose two degrees of freedom for estimating the slope β 1 and the intercept β 0. Therefore, the degrees of freedom associated with the mean squared residual is n 2. Returning to the car example, we have y = , ŷ = , r = where r is the vector of residuals (note that the residual sum to zero, analogously with E[ϵ i ] = 0). Computing, we get MS res = and the estimated covariance matrix for ˆβ is ˆσ 2 (X X) 1 = ( ) = ( ) The numbers in the diagonal of the covariance matrix give the estimated variances of ˆβ 0 and ˆβ 1. Therefore, the slope of the regression line is estimated to be ˆβ 1 = with estimated variance ˆσ 2ˆβ1 = Taking the square-root of this variance gives the estimated standard error of the slope ˆσ ˆβ1 = = which will be used for making inferential statement about the slope. Note that the estimated covariance between the estimated intercept and the estimated slope is Does it seem intuitive that the estimated slope and intercept will be negatively correlated when the regressor values (x i s) are all positive?

11 Chapter 5: Regression Models Hypothesis Tests for Regression Coefficients Regression models are used in a wide variety of applications. Interest often lies in testing if the slope parameter β 1 takes a particular value, β 1 = β 10 say. We can test hypotheses of the form: H 0 : β 1 = β 10 versus H a : β 1 > β 10 or H a : β 1 < β 10 or H a : β 1 β 10. A suitable test statistic for these tests is to compute the standardized difference between the estimated slope and the hypothesized slope: t = ˆβ 1 β 10 ˆσ ˆβ1 and reject H 0 when this standardized difference is large (away from the null hypothesis). Assuming the error terms ϵ i s are independent with a normal distribution, this test statistic has a t-distribution on n 2 degrees of freedom when the null hypothesis is true. If we are performing a test using a significance level α, then we would reject H 0 at significance level α if t > t α when H a : β 1 > β 10 t < t α when H a : β 1 < β 10. t > t α/2 or t < t α/2 when H a : β 1 β 10 A common hypothesis of interest is if the slope differs significantly from zero. If the slope β 1 is zero, then the response does not depend on the regressor. The test statistic in this case reduces to t = ˆβ 1 /ˆσ ˆβ1. Car Example continued... We can test if the mileage of a car is related (linearly) to the weight of the car. In other words, we want to test if H 0 : β 1 = 0 versus H a : β 1 0. Let us test this hypothesis using significance level α = Since there are n = 6 observations, we will reject H 0 if the test statistic is larger in absolute value t α/2 = t.05/2 = t.025 = which can be found in the t-table under n 2 = 6 2 = 4 degrees of freedom. Recall that ˆβ 1 = with estimated standard error ˆσ ˆβ1 = = Computing, we find that t = ˆβ 1 = = ˆσ ˆβ since t = = > t α/2 = , we reject H 0 and conclude that the slope differs from zero using a significance level α = In other words, the MPG of a car depends on the weight of the car. We can also compute a p-value for this test as p-value = 2P (T > t ) (2-tailed p-value) where T represents a t random variable on n 2 degrees of freedom and t represents the observed value of the test statistic. The factor 2 is needed because this is a two-sided

12 Chapter 5: Regression Models 129 test we reject H 0 for large values of ˆβ 1 in the positive or negative directions. The computed p-value in this example (using degrees of freedom equal to 4) is 2P ( T > ) = 2(0.0001) = Thus, we have very strong evidence that the slope differs from zero. Hypothesis tests can be performed for the intercept β 0 as well, but then is not as common. The test statistic for testing H 0 : β 0 = β 00 is t = ˆβ 0 β 00 ˆσ ˆβ0, which follows a t-distribution on n 2 degrees of freedom when the null hypothesis is true. 5 Confidence Intervals for Regression Coefficients We can also form confidence intervals for regression coefficients. The next example illustrates such an application. Example (data compliments of Brian Jones). Experiments were conducted at Wright State University to measure the stiffness of external fixators. An external fixator is designed to hold a broken bone in place so it can heal. The stiffness is an important characteristic of the fixator since it indicates how well the fixator protects the broken bone. In the experiment, the vertical force (in Newtons) on the fixator is measured along with the amount the fixator extends (in millimeters). The stiffness is defined to be the force per millimeters of extension. A natural way to estimate the stiffness of the fixator is to use the slope from an estimated simple linear regression model. The data from the experiment is given in the following table: Extension Force Figure 4 shows a scatterplot of the raw data. The relation appears to be linear. Figure 5 shows the raw data again in the left panel along with the fitted regression

13 Chapter 5: Regression Models 130 Figure 4: Scatterplot of Force (in Newtons) versus extension (in mm.) for an external fixator used to hold a broken bone in place. line ŷ = x. The points in the plot are tightly clustered about the regression line indicating that almost all the variability in y is accounted for by the regression relation (see the discussion of R 2 below). A residual plot is shown in the right panel of Figure 5. The residuals should not exhibit any structure and a plot of residuals is useful for accessing if the specified model is adequate for the data. The slope is estimated to be ˆβ 1 = and the estimated standard error of the slope is found to be ˆσ ˆβ1 = A (1 α)100% confidence interval for the slope is given by Confidence Interval for the Slope: ˆβ1 ± t α/2ˆσ ˆβ1, where the degrees of freedom for the t-critical value is given by n 2. The estimated standard error of the slope can be found as before by taking the square root of second diagonal element of the covariance matrix ˆσ 2 (X X) 1. For the fixator experiment, let us compute a 95% confidence interval for the stiffness (β 1 ). The sample size is n = 11 and the critical value is t α/2 = t.05/2 = t = for n 2 = 11 2 = 9 degrees of freedom. The 95% confidence interval for the stiffness is ˆβ 1 ± t α/2ˆσ ˆβ1 = ± (0.465) = ± 1.052, which gives an interval of [63.479, ]. With 95% confidence we estimate that the stiffness of the external fixator lies between to Newtons/mm. Problems 1. Box, Hunter, & Hunter (1978) report on an experiment looking at how y, the dispersion of an aerosol (measured as the reciprocal of the number of particles

14 Chapter 5: Regression Models 131 Figure 5: Left Panel shows the scatterplot of the fixator data along with the leastsquares regression line. The right panel shows a plot of the residuals versus the fitted values ŷ i s to evaluate the fit of the model.

15 Chapter 5: Regression Models 132 per unit volume) depends on x, the age of the aerosol (in minutes). The data are given in the following table: y x Fit a simple linear regression model to this data by performing the following steps: a) Write out the design matrix X for this data and the vector y of responses. b) Compute X X. c) Compute (X X) 1. d) Compute the least squares estimates of the y-intercept and slope ˆβ = (X X) 1 X y. e) Plot the data along with the fitted regression line. f) Compute the mean squared error from the least-squares regression line: ˆσ 2 = MSE = (y ŷ) (y ŷ)/(n 2). g) Compute the estimated covariance matrix for the estimated regression coefficients: ˆσ 2 (X X) 1. h) Does the age of the aerosol effect the dispersion of the aerosol? Perform a hypothesis test using significance level α = 0.05 to answer this question. Set up the null and alternative hypotheses in terms of the parameter of interest, determine the critical region, compute the test statistic, and state your decision. In plain English, write out the conclusion of the test. i) Find a 95% confidence interval for the slope of the regression line. 2. Consider the crystal growth data in the notes. In this example, x = time the crystal grew and y = weight of the crystal (in grams). It seems reasonable that at time zero, the crystal would weigh zero grams since it has not started growing yet. In fact, the estimated regression line has a y-intercept near zero. Find the least squares estimator of β 1 in the no-intercept model: y i = β 1 x i + ϵ i in two different ways: a) Find the value of β 1 that minimizes n (y i β 1 x i ) 2. i=1 Note: Solve this algebraically without using the data from the actual experiment.

16 Chapter 5: Regression Models 133 b) Write out the design matrix for the no-intercept model and compute b 1 = (X X) 1 X Y. Does this give the same solution as part (a)? 6 Estimating a Mean Response and Predicting a New Response Regression models are often used to predict a new response or estimate a mean response for a given value of the predictor x. We have seen how to compute a predicted value ŷ as ŷ = ˆβ 0 + ˆβ 1 x. However, as with parameter estimates, we need a measure of reliability associated with ŷ. In order to illustrate the ideas, we consider a new example. Example. An experiment was conducted to study how the weight (in grams) of a crystal varies according to how long (in hours) the crystal grows (Graybill and Iyer, 1994). The data are given in the following table: Weight Hours Clearly as the crystal grows, the weight increases. We can use the slope of the estimated least squares regression line as an estimate of the linear growth rate. A direct computation shows and (X X) = ( 14 ) ( ) ˆβ = The raw data along with the fitted regression line are shown in Figure 6. From the estimated slope, we can state that the crystals grow at a rate of grams per hour.

17 Chapter 5: Regression Models 134 Figure 6: Crystal growth data with the estimated regression line. We now turn to the question of using the estimated regression model to estimate the mean response at a given value of x or predict a new value of y for a given value of x. Note that estimating a mean response and predicting a new response are different goals. Suppose we want to estimate the mean weight of a crystal that has grown for x = 15 hours. The question is: what is the average weight of all crystals that have grown for x = 15 hours. Note this is a hypothetical population. If we were to set a production process where we grow crystals for 15 hours, what would be the average weight of the resulting crystals? In order to estimate the mean response at x = 15 hours, we use ŷ = ˆβ 0 + ˆβ 1 x plugging in x = 15. On the other hand, if we want to predict the weight of a single crystal that has grown for x = 15 hours, we would also use ŷ = ˆβ 0 + ˆβ 1 x using x = 15 just as we did for estimating a mean response. Note that although estimating a mean response and predicting a new response are two different goals, we use ŷ in each case. The difference statistically between estimating a mean response and predicting a new response lies in the uncertainty associated with each. A confidence interval for a mean response will be narrower than a prediction interval for a new response. The reason why is that a mean response for a given x value is a fixed quantity it is an expected value of the response for a given x value, known as a conditional mean. A 95% prediction interval for a new response must be wide enough to contain 95% of the future responses at a given x value. The confidence interval for a mean response only needs to contain the mean of all responses for a given x with 95% confidence. The following two formulas give the confidence interval for a mean response and a prediction interval for a new response at a given value x 0 for the predictor: ŷ±t α/2 MS res (1, x 0 )(X X) 1 ( 1 x 0 ) Confidence Interval for Mean Response (10)

18 Chapter 5: Regression Models 135 and ŷ ± t α/2 MS res (1 + (1, x 0 )(X X) 1 ( 1 x 0 )) Prediction Interval for New Response (11) where the t-critical value t α/2 is based on n 2 degrees of freedom. Note that in both formulas, (1, x 0 )(X X) 1 ( 1 x 0 ) corresponds to a 1 2 vector (1, x 0 ) times a 2 2 matrix (X X) 1 times a 2 1 vector transpose of (1, x 0 ). The prediction interval is wider than the confidence interval due to the added 1 underneath the radical in the prediction interval. Formulas (10) and (11) generalize easily to the multiple regression setting when there is more than one predictor variable. The confidence interval for the mean response can be rewritten after multiplying out the terms to get ŷ ± t α/2 MS res ( 1 n + (x 0 x) 2 ). SS xx From this formula, one can see that the confidence interval for a mean response (and also the prediction interval) is narrowest when x 0 = x. Figure 7 shows both the confidence intervals for mean responses and prediction intervals for new responses at each x value. The lower and upper ends of these intervals plotted for all x values form an upper and lower band shown in Figure 7. The solid curve corresponds to a confidence band and is narrower than the prediction band which is plotted by the dashed curve. Both bands are narrowest at the point ( x, ȳ) (the least squares regression line always passes through the point ( x, ȳ)). Note that in this example, all of the actual weight measurements (the y i s) lie inside the 95% prediction bands as seen in Figure 7. A note of caution is in order when using regression models for prediction. Using an estimated model to extrapolate outside the range where data was collected to fit the model is very dangerous. Often a straight line is a reasonable model relating a response y to a predictor (or regressor) x over a short interval of x values. However, over a broader range of x values, the response may be markedly nonlinear and the straight line fit over the small interval when extrapolated over a larger interval can give very poor or even down right nonsense predictions. It is not unusual for instance that as the regressor variable gets larger (or smaller), the response may level off and approach an asymptote. One such example is illustrated in Figure 8 showing a scatterplot of the winning times in the Boston marathon for men (open circles) and women (solid circles) each year. Also plotted is the least squares regression lines fitted to the data for men and women. If we were to extrapolate into the future using the straight line fit, then we would eventually predict that the fastest female runner would beat the fastest male runner. Not only that, the predicted times in the future for both men and women would eventually become negative which is clearly impossible. It may be that the female champion will eventually beat the male champion at some point in the future, but we cannot use these models to predict this because these models were fit using data from the past. We do not know for sure what sort of model is applicable

19 Chapter 5: Regression Models 136 Figure 7: Crystal growth data with the estimated regression line along with a the 95% confidence band for estimated mean responses (solid curves) and 95% prediction band for predicted responses (dashed curve). for future winning times. In fact, the straight line models plotted in Figure 8 are not even valid for the data shown. For instance, the data for the women shows a rapid improvement in winning times over the first several years women were allowed to run the race but then the winning times flatten out, indicating that a threshold is being reached for the fastest possible time the race can be run. This horizontal asymptote effect is evident for both males and females. Problems 3. A calibration experiment with nuclear tanks was performed in an attempt to determine the volume of fluid in the tank based on the reading from a pressure gauge. The following data was derived from such an experiment where y is the volume and x is the pressure: y x a) Write out a simple linear regression model for this experiment. b) Write down the design matrix X and the vector of responses y. c) Find the least-squares estimates of the y-intercept and slope of the regression line. Plot the data and draw the estimated regression line in the plot.

20 Chapter 5: Regression Models 137 Winning Times in Boston Marathon Time (in seconds) Year Figure 8: Winning times in the Boston Marathon versus year for men (open circles) and women (solid circles). Also plotted are the least-squares regression lines for the men and women champions. d) Find the estimated covariance matrix of the least-squares estimates. e) Test if the slope of the regression line differs from zero using α = f) Find a 95% confidence interval for the slope of the regression line. g) Estimate the mean volume for a pressure reading of x = 50 using a 95% confidence interval. h) Predict the volume in the tank from a pressure reading of x = 50 using a 95% prediction interval. 7 Coefficient of Determination R 2 The quantity SS res is a measure of the variability in the response y after factoring out the dependence on the regressor x. A measure of total variability in the response measured without regard to x is n SS yy = (y i ȳ) 2. i=1 A useful statistic for measuring the proportion of variability in the y s accounted for by the regressor x is the coefficient of determination R 2, or sometimes known simply as the R-squared: R 2 = 1 SS res SS yy. (12) In the car mileage example, SS yy = and SS res = 9.38: R 2 = =

21 Chapter 5: Regression Models 138 In the fixator example, the points are more tightly clustered about the regression line and the corresponding coefficient of determination is R 2 = which is higher than for the car mileage example (compare the plots in Figure 1 with Figure 4) By definition, R 2 is always between zero and one: 0 R 2 1. If R 2 is close to one, then most of the variability in y is explained by the regression model. R 2 is often reported when summarizing a regression model. R 2 can also be computed in multiple regression (when there is more than one regressor variable) using the same formula above. Many times a high R 2 is considered as an indication that one has a good model since most of the variability in the response is explained by the regressor variables. In fact, some experimenters use R 2 to compare various models. However, this can be problematic. R 2 always increases (or at least does not decrease) when you add regressors to a model. Thus, choosing a model based on the largest R 2 can lead to models with too many regressors. Another note of caution regarding R 2 is that a large value of R 2 does not necessarily mean that the fitted model is correct. It is not unusual to obtain a large R 2 when there is a fairly strong non-linear trend in the data. In simple linear regression, the coefficient of determination R 2 turns out to be the square of the sample correlation r. 8 Residual Analysis The regression models considered so far are simple linear regression models where it is assumed that the mean response y is a linear function of the regressor x. This is a very simple model and appears to work quite well in many examples. Even if the actual relation of y on x is non-linear, fitting a straight line model may provide a good approximation if we restrict the range of x to a small interval. In practice, one should not assume that a simple linear model will be sufficient for fitting data (except in special cases where there is a theoretical justification for a straight line model). Part of the problem in regression analysis is to determine an appropriate model relating the response y to the predictor x. Recall that the simple linear regression model is y i = β 0 + β 1 x i + ϵ i where ϵ i is a mean zero random error. After fitting the model, the residuals r i = y i ŷ i mimic the random error. A useful diagnostic to access how well a model fits the data is to plot the residuals versus the fitted values (ŷ i ). Such plots should show no structure. If there is evidence of structure in the residual plot, then it is likely that the fitted regression model is not the correct model. In such cases, a more complicated model may need to be fitted to the data such as a polynomial model (see below) or a nonlinear regression model (not covered here). It is customary to plot the residuals versus the fitted values instead of residuals versus the actual y i values. The reason is that the residuals are uncorrelated with the fitted

22 Chapter 5: Regression Models 139 Figure 9: Left Panel: Scatterplot of the full fixator data set and fitted regression line. Right Panel: The corresponding residual plot. values. Recall from the geometric derivation of the least squares estimators that the vector of residuals is orthogonal to the vector of fitted values (see Figure 3). A word of caution is needed here. Humans are very adept at picking out patterns. Sometimes, a scatterplot of randomly generated variates (i.e. noise) will show what appears to be pattern. However, if the plot was generated by just random noise, then the patterns are superficial. The same problem can occur when examining a residual plot. One must be careful of finding structure in a residual plot when there really is no structure. Analyzing residual plots is an art that improves with lots of practice. Example (Fixator example continued). When the external fixator example was introduced earlier, only a subset of the full data set was used to estimate the stiffness of the fixator. Figure 9 shows (in the left panel) a scatterplot of the full data set for values of force (x) near zero when the machine was first turned on. Also plotted is the least squares regression line. From this picture, it appears as if a straight line model would fit the data well. However, the right panel shows the corresponding residual plot which reveals a fairly strong structure indicating that a straight line does not fit the full data set well. Example. Fuel efficiency data was obtained on 32 automobiles from the models by Motor Trends US Magazine. The response of interest is miles per gallon (mpg) of the automobiles. Figure 10 shows a scatterplot of mpg versus horsepower. Figure 10 shows that increasing horsepower corresponds to lower fuel efficiency. A simple linear regression model was fit to the data and the fitted line is shown in the left panel of Figure 11. The coefficient of determination for this fit is R 2 = A closer look at the data indicates a slight non-linear trend. The right panel of Figure 11

23 Chapter 5: Regression Models 140 Figure 10: Scatterplot of Motor Trends car data: Miles per gallon (mpg) versus Gross Horsepower for 32 different brands of cars. shows a residual plot versus fitted values. The residual plot indicates that there may be problem with the straight line fit: the residuals to the left and right are positive and the residuals in the middle are mostly negative. This U -shaped pattern is indicative of a poor fit. To solve the problem, a different type of model needs to be considered or perhaps a transformation of one or both variables may work. Example: Anscombe s Regression Data. Anscombe (1973) simulated 4 very different data sets that produce identical least-square regression lines. One of the benefits of this example is to illustrate the importance of plotting your data. Figure 12 shows scatterplots of the 4 data sets along with the fitted regression line. The topleft panel shows a nice scatter of points with a linear trend and the regression line provides a nice fit to the data. The data in the top-right panel shows a very distinct non-linear pattern. Although one can fit a straight line to such data, the straight line model is clearly wrong. Instead one could try to fit a quadratic curve (see polynomial regression). The bottom-left panel demonstrates how a single point can be very influential when a least-squares line is fit to the data. The points in this plot all lie in a line except a single point. The least squares regression line is pulled towards this single influential point. In a simple linear regression it is fairly easy to detect a highly influential point as in this plot. However, in multiple regression (see next section) with several regressor variables, it can be difficult to detect influential points graphically. There exist many diagnostic tools for accessing how influential individual points are when fitting a model. There also exist robust regression techniques that prevent the fit of the line to be unduly influenced by a small number of observations. The bottom-right panel shows data from a very poorly designed experiment where all but one observation was obtained at one level of the x variable. The single point

24 Chapter 5: Regression Models 141 Figure 11: Left Panel Scatterplot of MPG versus horsepower along with the fitted regression line. Right panel Residual plot versus the fitted values ŷ i s. on the right determines the slope of the fitted regression line. 9 Multiple Linear Regression Often a response of interest may depend on several different factors and consequently, many regression applications involve more than a single regressor variable. A regression model with more than one regressor variable is known as a multiple regression model. The simple linear regression model can be generalized in a straightforward manner to incorporate other regressor variables and the previous equations for the simple linear regression continue to hold for multiple regression models. For illustration consider the results of an experiment to study the effect of the mole contents of cobalt and calcination temperature on the surface area of an iron-cobalt hydroxide catalyst (Said et al., 1994). The response variable in this experiment is y = surface-area and there are two regressor variables: x 1 = Cobalt content x 2 = Temperature. The data from this experiment are given in the following table:

25 Chapter 5: Regression Models 142 Anscombe s 4 Regression data sets y y x x2 y y x x4 Figure 12: Anscombe simple linear regression data. Four very different data sets yielding exactly the same least squares regression line. Cobalt Surface Contents Temp. Area A general model relating the surface area y to cobalt contents x 1 and temperature x 2

26 Chapter 5: Regression Models 143 is y i = f(x i1, x i2 ) + ϵ i where ϵ i is a random error and f is some unknown function. x i1 is the cobalt content on the ith unit in the data and x i2 is the corresponding ith temperature measurement for i = 1,..., n. We can try approximating f by a first order Taylor series approximation to get the following multiple regression model: y i = β 0 + β 1 x i1 + β 2 x i2 + ϵ i. (13) If (13) is not adequate to model the response y, then we could try a higher order Taylor series approximation such as: y i = β 0 + β 1 x i1 + β 2 x i2 + β 11 x 2 i1 + β 22 x 2 i2 + β 12 x i1 x i2 + ϵ i. (14) The work required for finding the least squares estimators of the coefficients and the variance and covariances of these estimated parameters has already been done. The form of the least-squares solution from the simple linear regression model holds for the multiple regression model. This is where the matrix approach to the problem really pays off because working out the details without using matrix algebra is very tedious. Consider a multiple linear regression model with k regressors, x 1,..., x k : y i = β 0 + β 1 x i1 + β 2 x i2 + + β k x ik + ϵ i. (15) The least squares estimators are given by ˆβ 0 ˆβ ˆβ = 1. = (X X) 1 X y (16) ˆβ k just as in equation (8) where 1 x 11 x 12 x 1k 1 x 21 x 22 x 2k X =..... (17) 1 x n1 x n2 x nk Note that the design matrix X has a column of ones in its first column for the intercept term β 0 just like in simple linear regression. The covariance matrix of the estimated coefficients is given by σ 2 (X X) 1 just as in equation (9) where σ 2 is the error variance which is again estimated by the mean squared residual n MS res = ˆσ 2 = (y i ŷ i ) 2 /(n k 1). i=1 Note that the degrees of freedom associated with the mean squared residual is n k 1 since we lose k + 1 degrees of freedom estimating the intercept and the k coefficients for the k regressors.

27 Chapter 5: Regression Models Coefficient Interpretation In the simple linear regression model y i = β 0 + β 1 x i + ϵ i, the slope represents the expected change in y given a unit change in x. In multiple regression the interpretation of the regression coefficients have a similar interpretation: β j represents the expected change in the response for a unit change in x j given that all the other regressors are held constant. The problem in multiple regression is that regressor variables often are correlated with one another and if you change one regressor, then the others may tend to change as well and this makes the interpretation of regression coefficients very difficult in multiple regression. This is particularly true of observational studies where there is no control over the conditions that generate the data. The next example illustrates the problem. Heart Catheter Example. A study was conducted and data collected to fit a regression model to predict the length of a catheter needed to pass from a major artery at the femoral region and moved into the heart for children (Weisberg, 1980). For 12 children, the proper catheter length was determined by checking with a fluoroscope that the catheter tip had reached the right position. It was hoped that using the child s height and weight could be used to predict the proper catheter length. The data are given in the following table: Height Weight Length When the data is fit to a multiple regression model using height and weight as regressors, how do we interpret the resulting coefficients? The coefficient for height tells us how much longer the catheter needs to be for each additional inch of height of the child provided the weight of the child stays constant. But the taller the child, the heavier the child tends to be. Figure 13 shows a scatterplot of weight versus height for the n = 12 children from this experiment. The plot shows a very strong linear relationship between height and weight. The correlation between height and weight is r = This large correlation complicates the interpretation of the regression coefficients. The problem of correlated regressor variables is known as collinearity (or multicollinearity) and it is an important problem that one needs to be aware of when fitting multiple regression models. We return to this example when in the collinearity section below.

28 Chapter 5: Regression Models 145 Figure 13: A scatterplot of weight versus height for n = 12 children in an experiment used to predict the required length of a catheter to the heart based on the child s height and weight. Fortunately, in designed experiments where the engineer has complete control over the regressor variables, data may be able to be collected in a fashion so that the estimated regression coefficients are uncorrelated. In such situations, the estimated coefficients can then be easily interpreted. To make this happen, one needs to choose the values of the regressors so that the off-diagonal terms in the (X X) 1 matrix (from (9)) that correspond to covariances between estimated coefficients of the regressors are all zero. Cobalt Example Continued. Let us return to model (13) and estimate the parameters of the model. Using the data in the cobalt example, we can construct the

29 Chapter 5: Regression Models 146 design matrix X in (17) and the response vector y as: X = y = The least squares estimates of the parameters are given by ˆβ = ( ˆβ 0, ˆβ 1, ˆβ 2 ) where ˆβ = (X X) 1 X y = = The matrix computations are tedious, but software packages like Matlab can perform these for us. Nonetheless, it is a good idea to understand exactly what it is the computer software packages are computing for us when we feed data into them. The mean squared residual for this data is ˆσ 2 = MS res = and the estimated covariance matrix of the estimated coefficients is ˆσ 2 (X X) 1 = Note that this was a well-designed experiment because the estimated coefficients for Cobalt contents ( ˆβ 1 ) and Temperature ( ˆβ 2 ) are uncorrelated the covariance between them is zero as can be seen in the covariance matrix.

30 Chapter 5: Regression Models Analysis of Variance (ANOVA) for Multiple Regression Because there are several regression parameters in the multiple regression model (15), a formal test can be conducted to see if the response depends on any of the regressor variables. That is, we conduct a single test of the hypothesis: H 0 : β 1 = β 2 = β k = 0 versus H a : not all β j s are zero. The basic idea behind the testing procedure is to partition all the variability in the response into two pieces: variability due to the regression relation and variability due to the random error term. This is why the procedure is called Analysis of Variance (ANOVA). The total variability in the response is represented by the total sums of squares (SS yy ): n SS yy = (y i ȳ) 2. i=1 We have already defined the residual sum of squares: n SS res = (y i ŷ i ) 2. i=1 We can also define the Regression Sum of Squares (SS reg ): n SS reg = (ŷ i ȳ) 2 i=1 which represents the variability in the y i s explained by the multiple regression model. Note that for an individual measurement (y i ȳ) = (y i ŷ i ) + (ŷ i ȳ). If we square both sides of this equation and sum over all n observations, we will get SS yy on the left-hand side. On the right-hand side (after doing some algebra) we will get SS res + SS reg only because the cross product terms sum to zero. This gives us the well-known variance decomposition formula: SS yy = SS reg + SS res. (18) If all the β j s are zero, then the regression model will explain very little of the variability in the response in which case SS reg will be small and SS res will be large, relatively speaking. The ANOVA test then simply compares these components of variance with each other. However, in order to make the sums of squares comparable, we need to first divide each sum of squares by its respective degrees of freedom. The degrees