Chapter 5. Regression Models

Size: px
Start display at page:

Download "Chapter 5. Regression Models"

Transcription

1 Chapter 5: Regression Models 118 April 9, 2013 Chapter 5. Regression Models Regression analysis is probably the most used tool in statistics. Regression deals with modeling how one variable (called a response) is related to one or more other variables (called predictors or regressors). Before introducing regression models involving two or more variables, we first return to the very simple model introduced in Chapter 1 to set up the basic ideas and notation. 1 A Simple Model Consider once again the fill-weights in the cup-a-soup example. For sake of illustration, consider the first 10 observations from the data set: , , , , , , , , , Note that although the filling machine is set to fill each cup to a specified weight, the actual weights vary from cup to cup. Let y 1, y 2,..., y n denote the fill-weights for our sample (i.e. y 1 = , y 2 = etc. and n = 10). The model we introduced in Chapter 1 that incorporates the variability is y i = µ + ϵ i (1) where ϵ i is a random error representing the deviation of the ith diameter from the average fill-weight of all cups (µ). Equation (1) is a very simple example of a statistical model. It involves a random component (ϵ i ) and a deterministic component (µ). The population mean µ is a parameter of the model and the other parameter in (1) is the variance of the random error ϵ which we shall denote by σ 2 ( sigma-squared ). Let us now consider the problem of estimating the population mean µ in (1). The technique we will use for (1) is called least-squares and it is easy to generalize to more complicated regression models. A natural and intuitive way of estimating the true value of the population mean µ is to simply take the average of the measurements: ȳ = 1 n y i. n i=1 Why should we use ȳ to estimate µ? There are many reasons why ȳ is a good estimator of µ, but the reason we shall focus on is that ȳ is the best estimator of µ in terms of having the smallest mean squared error. That is, given the 10 measurements above, we can ask: which value of µ makes sum of squared deviations n (y i µ) 2 (2) i=1

2 Chapter 5: Regression Models 119 the smallest? That is, what is the least-squares estimator of µ? The answer to this question can be found by doing some simple calculus. Consider the following function of µ: n f(µ) = (y i µ) 2. i=1 From calculus, we know that to find the extrema of a function, we can take the derivative of the function, set it equal to zero, and solve for the argument of the function. Thus, d n dµ f(µ) = 2 (y i µ) = 0. Using a little algebra, we can solve this equation for µ to get i=1 ˆµ = ȳ. (One can check that the 2nd derivative of this function is positive so that setting the first derivative to zero determines a value of µ that minimizes the sum of squares.) The hat notation (i.e. ˆµ) is used to denote an estimator of a parameter. This is a standard notational practice in statistics. Thus, we use ˆµ = ȳ to estimate the unknown population mean µ. Note that ˆµ is not the true value of µ but simply an estimator based on 10 data points. Now we shall re-do the computation using matrix notation. This will seem unnecessarily complicated, but once we have a solution worked out, we can re-apply it to many other much more complicated models very easily. Data usually comes to us in the form of arrays of numbers, typically in computer files. Therefore, a natural and easy way to handle data (particularly large sets of data) is to use the power of matrix computations. Take the fill-weight measurements y 1, y 2,..., y n and stack them into a vector and denote this vector by a boldfaced y: y y y y y y = =. y y y y y Now let X denote a column vector of ones and ϵ denote the error terms ϵ i s stacked into a vector: 1 ϵ 1 X = 1. and ϵ = ϵ ϵ n

3 Chapter 5: Regression Models 120 Then we can re-write our very simply model (1) in matrix/vector form as: More compactly, we can write: y 1 1 ϵ 1 y 2. = 1. µ + ϵ 2.. y n 1 ϵ n The sum of squares in equation (2) can be written y = Xµ + ϵ. (3) (y Xµ) (y Xµ) Multiplying this out, we find the sum of squares to be y y 2X yµ + µ 2 X X. Taking the derivative of this with respect to µ and setting the derivative equal to zero gives 2X y + 2µX X = 0. Solving for µ gives ˆµ = (X X) 1 X y. (4) The solution given by equation (4) is the least squares solution and this formula holds for a wide variety of models as we shall see. 2 The Simple Linear Regression Model Now we will define a slightly more complicated statistical model that turns out to be extremely useful in practice. The model is a simple extension of our first model y i = µ + ϵ i and, using the matrix notation, all we have to do is add another column to the vector X and change it into a matrix with two columns. To illustrate ideas, consider the data in the following table that was collected in Consumer Reports and reported on in Henderson and Velleman (1981). The table gives the make (column 1), the miles per gallon (MPG) (Column 2) and the weight (column 3) in thousands of pounds of n = 6 Japanese cars from model automobiles. Make MPG Weight Toyota Corona Datsun Mazda GLC Dodge Colt Datsun Datsun

4 Chapter 5: Regression Models 121 Figure 1: Scatterplot of miles per gallon versus weight of n = 6 Japanese cars. It seems reasonable that the miles per gallon of a car is related to the weight of the car. Our goal is to model the relationship between these two variables. A scatterplot of the data is shown in Figure 1. As can be seen from the figure, there appears to be a linear relationship between the MPG (y) and the weight of the car (x). Heavier cars tend to have lower gas mileage. A deterministic model for this data is given by y i = β 0 + β 1 x i where y i is MPG for the ith car and x i is the corresponding weight of the car. The two parameters are β 0 which is the y-intercept and β 1 which is the slope of the line. However, this model is inadequate because it forces all the points to lie exactly on a line. From Figure 1, we clearly see that the points do follow a linear pattern, but the points do not all fall exactly on a line. Thus, a better model will include a random component for the error which allows for points to scatter about the line. The following model is called a simple linear regression model: y i = β 0 + β 1 x i + ϵ i (5) for i = 1, 2,..., n. The random variable y i is called the response (it is sometimes called the dependent variable also). The x i is called the ith value of the regressor variable (sometimes known as the independent or predictor variable). The random error ϵ i is assumed to have a mean of 0 and variance σ 2. We typically assume the ϵ i s are independent of each other. The slope β 1 and intercept β 0 are the two parameters of primary importance and the question arises as to how they should be estimated. The least squares solution

5 Chapter 5: Regression Models 122 Figure 2: The least-squares regression line is determined by minimizing the sum of squared vertical differences between the observed MPG s and the corresponding point on the line. is found by determining the values of β 0 and β 1 that minimize the sum of squared errors: n (y i β 0 β 1 x i ) 2. i=1 Graphically, this corresponds to finding the line minimizing the sum of squared vertical differences between the observed MPG s and the corresponding values on the line as shown in Figure 2. Returning to our matrix and vector notation, we can write y = and X = Let β = (β 0, β 1 ) and ϵ = (ϵ 1, ϵ 2,..., ϵ n ). Then we can rewrite (5) in matrix form as y = Xβ + ϵ. (6) In order to find the least-squares estimators of β, we need to find the values of β 0 and β 1 that minimize n (y Xβ) (y Xβ) = (y i β 0 β 1 x i ) 2 i=1

6 Chapter 5: Regression Models 123 or, since ϵ i = y i β 0 β 1 x i, we need to find β 0 and β 1 that minimize ϵ ϵ. Matrix differentiation can be used to solve this problem, but instead we will use a geometric argument. First, some additional notation. Let ˆβ 0 and ˆβ 1 denote the least squares estimators of β 0 and β 1. Then given a value of the predictor x i, we can compute the predicted value of y given x i as ŷ i = ˆβ 0 + ˆβ 1 x i. The residual r i is defined to be the difference between the response y i and the predicted value ŷ i : r i = y i ŷ i. Let r = (r 1, r 2,..., r n ) and ŷ = (ŷ 1, ŷ 2,..., ŷ n ). Note that ŷ = X ˆβ where ˆβ = ( ˆβ 0, ˆβ 1 ). The least square estimators ˆβ 0 and ˆβ 1 are chosen to make r r small as possible. Geometrically, ŷ is the projection of y onto the plane spanned by the columns of the matrix X. This is illustrated in Figure 3. To make r r small as possible, r should be orthogonal to the plane spanned by the columns of X. Algebraically, this means that X r = 0. Writing this out, we get X r = X (y ŷ) = X (y X ˆβ) = 0. Thus, ˆβ should satisfy X y = X X ˆβ. (7) This equation is know as the normal equation. Assuming X X is an invertible matrix, we can simply multiply on the left on both sides of (7) to get the least-squares solution: ˆβ = (X X) 1 X y. (8) This is one of the most important equations of this course. This formula provides the least-squares solution for a wide variety of models. Note that we have already seen this solution in (4). Example. Returning to the MPG example for Japanese cars, we now illustrate the computation of the least-squares estimators of the slope β 1 and y-intercept β 0. From the data, we can compute ( X X = ). Therefore, (X X) 1 = ( )

7 Chapter 5: Regression Models 124 Figure 3: The geometry of least-squares. The vector y is projected onto the space spanned by the columns of the design matrix X denoted by X 1 and X 2 in the figure. The projected value is the vector of fitted values ŷ (denoted by yhat in the figure). The difference between y and ŷ is the vector of residuals r. Also, ( ) X y = = ( ) So, the least squares estimators of the intercept and slope are: ( ) ( ) ( ) ˆβ = (X X) 1 X y = = From this computation, we find that the least squares estimator of the y-intercept is ˆβ 0 = and the estimated slope is ˆβ 1 = and the prediction equation is given by ŷ = x. Note that the estimated y-intercept of ˆβ 0 = does not have any meaningful interpretation in this example. The y-intercept corresponds to the average y value when x = 0, i.e. the MPG for a car that weighs zero pounds. It makes no sense to estimate the mileage of a car with a weight of zero. Typically in regression examples the intercept will not be meaningful unless data is collected for values of x near zero. Since there is no such thing as cars weighing zero pounds, the intercept has no meaningful interpretation in this example.

8 Chapter 5: Regression Models 125 The slope β 1 is generally the parameter of primary interest in a simple linear regression. The slope represents the average change in the response for a unit change in the regressor. In the car example, the estimated slope of ˆβ 1 = indicates that for each additional thousand pounds of weight of a car we would expect to see a reduction of about 13 miles per gallon on average. Multiplying out the matrices in (8) we get the following formulas for the least square estimates in simple linear regression: ˆβ 0 = ȳ ˆβ 1 x ˆβ 1 = SS xy SS xx where and n SS xy = (x i x)(y i ȳ) i=1 n SS xx = (x i x) 2. i=1 That is, the estimator of the slope is the covariance between the x s and y s divided by the variance of the x s. In multiple regression when there is more than one regressor variable, the formulas for the least square estimators become extremely complicated unless you stick with the matrix notation. The matrix notation also allows us to compute quite easily the standard errors of the least squares estimators as well as the covariance between the estimators. First, let us show that the least square estimators are unbiased for the corresponding model parameters. Before doing so, note that in a designed experiment, the values of the regressor are typically fixed by the experimenter and therefore are not considered random. On the other hand, because y i = β 0 + β 1 x i + ϵ i and ϵ i is a random variable, then y i is also a random variable. Computing, we get E[ˆβ] = E[(X X) 1 X y] = (X X) 1 X E[y] = (X X) 1 X E[Xβ + ϵ] = (X X) 1 X (Xβ + E[ϵ]) = (X X) 1 X Xβ + 0 = β since E[ϵ] = 0. Therefore, the least square estimators ˆβ are unbiased for the population parameters β. Many statistical software packages have built-in functions that will perform regression analysis. We can also use software to do the matrix calculations directly. Below is Matlab code that produces some of the output generated above for the car mileage example:

9 Chapter 5: Regression Models 126 % Motor Trend car data % Illustration of simple linear regression. mpg = [27.5; 27.2; 34.1; 35.1; 31.8; 22.0]; % Car s weight wt = [2.560; 2.300; 1.975; 1.915; 2.020; 2.815]; % Compute means and standard deviations: mean(wt) std(wt) mean(mpg) std(mpg) n = length(mpg); % n = sample size X = [ones(6,1) wt]; % Compute the design matrix X bhat = inv(x *X)*X *mpg; % bhat = estimated regression coefficients yhat = X*bhat; % yhat = fitted values r = mpg - yhat; % r = residuals plot(wt, mpg, o, wt, yhat) title( Motor Trend Car Data ) axis([1.5, 3, 20, 40]) ylabel( Miles per Gallon (mpg) ) xlabel( Weight of the Car ) % Make a plot of residuals versus fitted values: plot(yhat, r, o, linspace(20,40,n), zeros(n,1)) xlabel( Fitted Values ) ylabel( Residuals ) title( Residual Plot ) % Here s a built-in matlab function that will fit a % polynomial to the data -- the last number indicates the degree of the polynomial. polyfit(wt, mpg, 1) 3 Covariance Matrices for Least-Squares Estimators Now ˆβ is a random vector (since it is a function of the random y i s). We have shown that it is unbiased for β 0 and β 1. In order to determine how stable the parameter estimates are, we need an estimate of the variability of the estimators. This can be obtained by determining the covariance matrix of ˆβ as follows: Cov(ˆβ) = E[(ˆβ β)(ˆβ β) ] = E[((X X) 1 X y (X X) 1 X E[y])((X X) 1 X y (X X) 1 X E[y]) ] = E[((X X) 1 X ϵ)((x X) 1 X ϵ) ] = (X X) 1 X E[ϵϵ ]X(X X) 1 = (X X) 1 X σ 2 IX(X X) 1 (where I is the identity matrix) = σ 2 (X X) 1.

10 Chapter 5: Regression Models 127 The main point of this derivation is that the covariance matrix of the least-square estimators is σ 2 (X X) 1 (9) where σ 2 is the variance of the error term ϵ in the simple linear regression model. Formula (9) holds a wide variety of regression models that include polynomial regression, analysis of variance, and analysis of covariance. The only assumption needed for (9) to hold is that the errors are uncorrelated and all have the same variance. Formula (9) indicates that we need an estimate for the last remaining parameter of the simple linear regression model (5), and that is the error variance σ 2. Since ϵ i = y i β 0 β 1 x i and the ith residual is r i = y i ˆβ 0 ˆβ 1 x i, a natural estimate of the error variance is ˆσ 2 = MS res = SS res n 2 where n SS res = ri 2 i=1 is the Sum of Squares for the Residuals and MS res stands for the Mean Squared Residual (or mean squared error (MSE)). We divide by n 2 in the mean squared residual so to make it an unbiased estimator of σ 2 : E[MS res ] = σ 2. We lose two degrees of freedom for estimating the slope β 1 and the intercept β 0. Therefore, the degrees of freedom associated with the mean squared residual is n 2. Returning to the car example, we have y = , ŷ = , r = where r is the vector of residuals (note that the residual sum to zero, analogously with E[ϵ i ] = 0). Computing, we get MS res = and the estimated covariance matrix for ˆβ is ˆσ 2 (X X) 1 = ( ) = ( ) The numbers in the diagonal of the covariance matrix give the estimated variances of ˆβ 0 and ˆβ 1. Therefore, the slope of the regression line is estimated to be ˆβ 1 = with estimated variance ˆσ 2ˆβ1 = Taking the square-root of this variance gives the estimated standard error of the slope ˆσ ˆβ1 = = which will be used for making inferential statement about the slope. Note that the estimated covariance between the estimated intercept and the estimated slope is Does it seem intuitive that the estimated slope and intercept will be negatively correlated when the regressor values (x i s) are all positive?

11 Chapter 5: Regression Models Hypothesis Tests for Regression Coefficients Regression models are used in a wide variety of applications. Interest often lies in testing if the slope parameter β 1 takes a particular value, β 1 = β 10 say. We can test hypotheses of the form: H 0 : β 1 = β 10 versus H a : β 1 > β 10 or H a : β 1 < β 10 or H a : β 1 β 10. A suitable test statistic for these tests is to compute the standardized difference between the estimated slope and the hypothesized slope: t = ˆβ 1 β 10 ˆσ ˆβ1 and reject H 0 when this standardized difference is large (away from the null hypothesis). Assuming the error terms ϵ i s are independent with a normal distribution, this test statistic has a t-distribution on n 2 degrees of freedom when the null hypothesis is true. If we are performing a test using a significance level α, then we would reject H 0 at significance level α if t > t α when H a : β 1 > β 10 t < t α when H a : β 1 < β 10. t > t α/2 or t < t α/2 when H a : β 1 β 10 A common hypothesis of interest is if the slope differs significantly from zero. If the slope β 1 is zero, then the response does not depend on the regressor. The test statistic in this case reduces to t = ˆβ 1 /ˆσ ˆβ1. Car Example continued... We can test if the mileage of a car is related (linearly) to the weight of the car. In other words, we want to test if H 0 : β 1 = 0 versus H a : β 1 0. Let us test this hypothesis using significance level α = Since there are n = 6 observations, we will reject H 0 if the test statistic is larger in absolute value t α/2 = t.05/2 = t.025 = which can be found in the t-table under n 2 = 6 2 = 4 degrees of freedom. Recall that ˆβ 1 = with estimated standard error ˆσ ˆβ1 = = Computing, we find that t = ˆβ 1 = = ˆσ ˆβ since t = = > t α/2 = , we reject H 0 and conclude that the slope differs from zero using a significance level α = In other words, the MPG of a car depends on the weight of the car. We can also compute a p-value for this test as p-value = 2P (T > t ) (2-tailed p-value) where T represents a t random variable on n 2 degrees of freedom and t represents the observed value of the test statistic. The factor 2 is needed because this is a two-sided

12 Chapter 5: Regression Models 129 test we reject H 0 for large values of ˆβ 1 in the positive or negative directions. The computed p-value in this example (using degrees of freedom equal to 4) is 2P ( T > ) = 2(0.0001) = Thus, we have very strong evidence that the slope differs from zero. Hypothesis tests can be performed for the intercept β 0 as well, but then is not as common. The test statistic for testing H 0 : β 0 = β 00 is t = ˆβ 0 β 00 ˆσ ˆβ0, which follows a t-distribution on n 2 degrees of freedom when the null hypothesis is true. 5 Confidence Intervals for Regression Coefficients We can also form confidence intervals for regression coefficients. The next example illustrates such an application. Example (data compliments of Brian Jones). Experiments were conducted at Wright State University to measure the stiffness of external fixators. An external fixator is designed to hold a broken bone in place so it can heal. The stiffness is an important characteristic of the fixator since it indicates how well the fixator protects the broken bone. In the experiment, the vertical force (in Newtons) on the fixator is measured along with the amount the fixator extends (in millimeters). The stiffness is defined to be the force per millimeters of extension. A natural way to estimate the stiffness of the fixator is to use the slope from an estimated simple linear regression model. The data from the experiment is given in the following table: Extension Force Figure 4 shows a scatterplot of the raw data. The relation appears to be linear. Figure 5 shows the raw data again in the left panel along with the fitted regression

13 Chapter 5: Regression Models 130 Figure 4: Scatterplot of Force (in Newtons) versus extension (in mm.) for an external fixator used to hold a broken bone in place. line ŷ = x. The points in the plot are tightly clustered about the regression line indicating that almost all the variability in y is accounted for by the regression relation (see the discussion of R 2 below). A residual plot is shown in the right panel of Figure 5. The residuals should not exhibit any structure and a plot of residuals is useful for accessing if the specified model is adequate for the data. The slope is estimated to be ˆβ 1 = and the estimated standard error of the slope is found to be ˆσ ˆβ1 = A (1 α)100% confidence interval for the slope is given by Confidence Interval for the Slope: ˆβ1 ± t α/2ˆσ ˆβ1, where the degrees of freedom for the t-critical value is given by n 2. The estimated standard error of the slope can be found as before by taking the square root of second diagonal element of the covariance matrix ˆσ 2 (X X) 1. For the fixator experiment, let us compute a 95% confidence interval for the stiffness (β 1 ). The sample size is n = 11 and the critical value is t α/2 = t.05/2 = t = for n 2 = 11 2 = 9 degrees of freedom. The 95% confidence interval for the stiffness is ˆβ 1 ± t α/2ˆσ ˆβ1 = ± (0.465) = ± 1.052, which gives an interval of [63.479, ]. With 95% confidence we estimate that the stiffness of the external fixator lies between to Newtons/mm. Problems 1. Box, Hunter, & Hunter (1978) report on an experiment looking at how y, the dispersion of an aerosol (measured as the reciprocal of the number of particles

14 Chapter 5: Regression Models 131 Figure 5: Left Panel shows the scatterplot of the fixator data along with the leastsquares regression line. The right panel shows a plot of the residuals versus the fitted values ŷ i s to evaluate the fit of the model.

15 Chapter 5: Regression Models 132 per unit volume) depends on x, the age of the aerosol (in minutes). The data are given in the following table: y x Fit a simple linear regression model to this data by performing the following steps: a) Write out the design matrix X for this data and the vector y of responses. b) Compute X X. c) Compute (X X) 1. d) Compute the least squares estimates of the y-intercept and slope ˆβ = (X X) 1 X y. e) Plot the data along with the fitted regression line. f) Compute the mean squared error from the least-squares regression line: ˆσ 2 = MSE = (y ŷ) (y ŷ)/(n 2). g) Compute the estimated covariance matrix for the estimated regression coefficients: ˆσ 2 (X X) 1. h) Does the age of the aerosol effect the dispersion of the aerosol? Perform a hypothesis test using significance level α = 0.05 to answer this question. Set up the null and alternative hypotheses in terms of the parameter of interest, determine the critical region, compute the test statistic, and state your decision. In plain English, write out the conclusion of the test. i) Find a 95% confidence interval for the slope of the regression line. 2. Consider the crystal growth data in the notes. In this example, x = time the crystal grew and y = weight of the crystal (in grams). It seems reasonable that at time zero, the crystal would weigh zero grams since it has not started growing yet. In fact, the estimated regression line has a y-intercept near zero. Find the least squares estimator of β 1 in the no-intercept model: y i = β 1 x i + ϵ i in two different ways: a) Find the value of β 1 that minimizes n (y i β 1 x i ) 2. i=1 Note: Solve this algebraically without using the data from the actual experiment.

16 Chapter 5: Regression Models 133 b) Write out the design matrix for the no-intercept model and compute b 1 = (X X) 1 X Y. Does this give the same solution as part (a)? 6 Estimating a Mean Response and Predicting a New Response Regression models are often used to predict a new response or estimate a mean response for a given value of the predictor x. We have seen how to compute a predicted value ŷ as ŷ = ˆβ 0 + ˆβ 1 x. However, as with parameter estimates, we need a measure of reliability associated with ŷ. In order to illustrate the ideas, we consider a new example. Example. An experiment was conducted to study how the weight (in grams) of a crystal varies according to how long (in hours) the crystal grows (Graybill and Iyer, 1994). The data are given in the following table: Weight Hours Clearly as the crystal grows, the weight increases. We can use the slope of the estimated least squares regression line as an estimate of the linear growth rate. A direct computation shows and (X X) = ( 14 ) ( ) ˆβ = The raw data along with the fitted regression line are shown in Figure 6. From the estimated slope, we can state that the crystals grow at a rate of grams per hour.

17 Chapter 5: Regression Models 134 Figure 6: Crystal growth data with the estimated regression line. We now turn to the question of using the estimated regression model to estimate the mean response at a given value of x or predict a new value of y for a given value of x. Note that estimating a mean response and predicting a new response are different goals. Suppose we want to estimate the mean weight of a crystal that has grown for x = 15 hours. The question is: what is the average weight of all crystals that have grown for x = 15 hours. Note this is a hypothetical population. If we were to set a production process where we grow crystals for 15 hours, what would be the average weight of the resulting crystals? In order to estimate the mean response at x = 15 hours, we use ŷ = ˆβ 0 + ˆβ 1 x plugging in x = 15. On the other hand, if we want to predict the weight of a single crystal that has grown for x = 15 hours, we would also use ŷ = ˆβ 0 + ˆβ 1 x using x = 15 just as we did for estimating a mean response. Note that although estimating a mean response and predicting a new response are two different goals, we use ŷ in each case. The difference statistically between estimating a mean response and predicting a new response lies in the uncertainty associated with each. A confidence interval for a mean response will be narrower than a prediction interval for a new response. The reason why is that a mean response for a given x value is a fixed quantity it is an expected value of the response for a given x value, known as a conditional mean. A 95% prediction interval for a new response must be wide enough to contain 95% of the future responses at a given x value. The confidence interval for a mean response only needs to contain the mean of all responses for a given x with 95% confidence. The following two formulas give the confidence interval for a mean response and a prediction interval for a new response at a given value x 0 for the predictor: ŷ±t α/2 MS res (1, x 0 )(X X) 1 ( 1 x 0 ) Confidence Interval for Mean Response (10)

18 Chapter 5: Regression Models 135 and ŷ ± t α/2 MS res (1 + (1, x 0 )(X X) 1 ( 1 x 0 )) Prediction Interval for New Response (11) where the t-critical value t α/2 is based on n 2 degrees of freedom. Note that in both formulas, (1, x 0 )(X X) 1 ( 1 x 0 ) corresponds to a 1 2 vector (1, x 0 ) times a 2 2 matrix (X X) 1 times a 2 1 vector transpose of (1, x 0 ). The prediction interval is wider than the confidence interval due to the added 1 underneath the radical in the prediction interval. Formulas (10) and (11) generalize easily to the multiple regression setting when there is more than one predictor variable. The confidence interval for the mean response can be rewritten after multiplying out the terms to get ŷ ± t α/2 MS res ( 1 n + (x 0 x) 2 ). SS xx From this formula, one can see that the confidence interval for a mean response (and also the prediction interval) is narrowest when x 0 = x. Figure 7 shows both the confidence intervals for mean responses and prediction intervals for new responses at each x value. The lower and upper ends of these intervals plotted for all x values form an upper and lower band shown in Figure 7. The solid curve corresponds to a confidence band and is narrower than the prediction band which is plotted by the dashed curve. Both bands are narrowest at the point ( x, ȳ) (the least squares regression line always passes through the point ( x, ȳ)). Note that in this example, all of the actual weight measurements (the y i s) lie inside the 95% prediction bands as seen in Figure 7. A note of caution is in order when using regression models for prediction. Using an estimated model to extrapolate outside the range where data was collected to fit the model is very dangerous. Often a straight line is a reasonable model relating a response y to a predictor (or regressor) x over a short interval of x values. However, over a broader range of x values, the response may be markedly nonlinear and the straight line fit over the small interval when extrapolated over a larger interval can give very poor or even down right nonsense predictions. It is not unusual for instance that as the regressor variable gets larger (or smaller), the response may level off and approach an asymptote. One such example is illustrated in Figure 8 showing a scatterplot of the winning times in the Boston marathon for men (open circles) and women (solid circles) each year. Also plotted is the least squares regression lines fitted to the data for men and women. If we were to extrapolate into the future using the straight line fit, then we would eventually predict that the fastest female runner would beat the fastest male runner. Not only that, the predicted times in the future for both men and women would eventually become negative which is clearly impossible. It may be that the female champion will eventually beat the male champion at some point in the future, but we cannot use these models to predict this because these models were fit using data from the past. We do not know for sure what sort of model is applicable

19 Chapter 5: Regression Models 136 Figure 7: Crystal growth data with the estimated regression line along with a the 95% confidence band for estimated mean responses (solid curves) and 95% prediction band for predicted responses (dashed curve). for future winning times. In fact, the straight line models plotted in Figure 8 are not even valid for the data shown. For instance, the data for the women shows a rapid improvement in winning times over the first several years women were allowed to run the race but then the winning times flatten out, indicating that a threshold is being reached for the fastest possible time the race can be run. This horizontal asymptote effect is evident for both males and females. Problems 3. A calibration experiment with nuclear tanks was performed in an attempt to determine the volume of fluid in the tank based on the reading from a pressure gauge. The following data was derived from such an experiment where y is the volume and x is the pressure: y x a) Write out a simple linear regression model for this experiment. b) Write down the design matrix X and the vector of responses y. c) Find the least-squares estimates of the y-intercept and slope of the regression line. Plot the data and draw the estimated regression line in the plot.

20 Chapter 5: Regression Models 137 Winning Times in Boston Marathon Time (in seconds) Year Figure 8: Winning times in the Boston Marathon versus year for men (open circles) and women (solid circles). Also plotted are the least-squares regression lines for the men and women champions. d) Find the estimated covariance matrix of the least-squares estimates. e) Test if the slope of the regression line differs from zero using α = f) Find a 95% confidence interval for the slope of the regression line. g) Estimate the mean volume for a pressure reading of x = 50 using a 95% confidence interval. h) Predict the volume in the tank from a pressure reading of x = 50 using a 95% prediction interval. 7 Coefficient of Determination R 2 The quantity SS res is a measure of the variability in the response y after factoring out the dependence on the regressor x. A measure of total variability in the response measured without regard to x is n SS yy = (y i ȳ) 2. i=1 A useful statistic for measuring the proportion of variability in the y s accounted for by the regressor x is the coefficient of determination R 2, or sometimes known simply as the R-squared: R 2 = 1 SS res SS yy. (12) In the car mileage example, SS yy = and SS res = 9.38: R 2 = =

21 Chapter 5: Regression Models 138 In the fixator example, the points are more tightly clustered about the regression line and the corresponding coefficient of determination is R 2 = which is higher than for the car mileage example (compare the plots in Figure 1 with Figure 4) By definition, R 2 is always between zero and one: 0 R 2 1. If R 2 is close to one, then most of the variability in y is explained by the regression model. R 2 is often reported when summarizing a regression model. R 2 can also be computed in multiple regression (when there is more than one regressor variable) using the same formula above. Many times a high R 2 is considered as an indication that one has a good model since most of the variability in the response is explained by the regressor variables. In fact, some experimenters use R 2 to compare various models. However, this can be problematic. R 2 always increases (or at least does not decrease) when you add regressors to a model. Thus, choosing a model based on the largest R 2 can lead to models with too many regressors. Another note of caution regarding R 2 is that a large value of R 2 does not necessarily mean that the fitted model is correct. It is not unusual to obtain a large R 2 when there is a fairly strong non-linear trend in the data. In simple linear regression, the coefficient of determination R 2 turns out to be the square of the sample correlation r. 8 Residual Analysis The regression models considered so far are simple linear regression models where it is assumed that the mean response y is a linear function of the regressor x. This is a very simple model and appears to work quite well in many examples. Even if the actual relation of y on x is non-linear, fitting a straight line model may provide a good approximation if we restrict the range of x to a small interval. In practice, one should not assume that a simple linear model will be sufficient for fitting data (except in special cases where there is a theoretical justification for a straight line model). Part of the problem in regression analysis is to determine an appropriate model relating the response y to the predictor x. Recall that the simple linear regression model is y i = β 0 + β 1 x i + ϵ i where ϵ i is a mean zero random error. After fitting the model, the residuals r i = y i ŷ i mimic the random error. A useful diagnostic to access how well a model fits the data is to plot the residuals versus the fitted values (ŷ i ). Such plots should show no structure. If there is evidence of structure in the residual plot, then it is likely that the fitted regression model is not the correct model. In such cases, a more complicated model may need to be fitted to the data such as a polynomial model (see below) or a nonlinear regression model (not covered here). It is customary to plot the residuals versus the fitted values instead of residuals versus the actual y i values. The reason is that the residuals are uncorrelated with the fitted

22 Chapter 5: Regression Models 139 Figure 9: Left Panel: Scatterplot of the full fixator data set and fitted regression line. Right Panel: The corresponding residual plot. values. Recall from the geometric derivation of the least squares estimators that the vector of residuals is orthogonal to the vector of fitted values (see Figure 3). A word of caution is needed here. Humans are very adept at picking out patterns. Sometimes, a scatterplot of randomly generated variates (i.e. noise) will show what appears to be pattern. However, if the plot was generated by just random noise, then the patterns are superficial. The same problem can occur when examining a residual plot. One must be careful of finding structure in a residual plot when there really is no structure. Analyzing residual plots is an art that improves with lots of practice. Example (Fixator example continued). When the external fixator example was introduced earlier, only a subset of the full data set was used to estimate the stiffness of the fixator. Figure 9 shows (in the left panel) a scatterplot of the full data set for values of force (x) near zero when the machine was first turned on. Also plotted is the least squares regression line. From this picture, it appears as if a straight line model would fit the data well. However, the right panel shows the corresponding residual plot which reveals a fairly strong structure indicating that a straight line does not fit the full data set well. Example. Fuel efficiency data was obtained on 32 automobiles from the models by Motor Trends US Magazine. The response of interest is miles per gallon (mpg) of the automobiles. Figure 10 shows a scatterplot of mpg versus horsepower. Figure 10 shows that increasing horsepower corresponds to lower fuel efficiency. A simple linear regression model was fit to the data and the fitted line is shown in the left panel of Figure 11. The coefficient of determination for this fit is R 2 = A closer look at the data indicates a slight non-linear trend. The right panel of Figure 11

23 Chapter 5: Regression Models 140 Figure 10: Scatterplot of Motor Trends car data: Miles per gallon (mpg) versus Gross Horsepower for 32 different brands of cars. shows a residual plot versus fitted values. The residual plot indicates that there may be problem with the straight line fit: the residuals to the left and right are positive and the residuals in the middle are mostly negative. This U -shaped pattern is indicative of a poor fit. To solve the problem, a different type of model needs to be considered or perhaps a transformation of one or both variables may work. Example: Anscombe s Regression Data. Anscombe (1973) simulated 4 very different data sets that produce identical least-square regression lines. One of the benefits of this example is to illustrate the importance of plotting your data. Figure 12 shows scatterplots of the 4 data sets along with the fitted regression line. The topleft panel shows a nice scatter of points with a linear trend and the regression line provides a nice fit to the data. The data in the top-right panel shows a very distinct non-linear pattern. Although one can fit a straight line to such data, the straight line model is clearly wrong. Instead one could try to fit a quadratic curve (see polynomial regression). The bottom-left panel demonstrates how a single point can be very influential when a least-squares line is fit to the data. The points in this plot all lie in a line except a single point. The least squares regression line is pulled towards this single influential point. In a simple linear regression it is fairly easy to detect a highly influential point as in this plot. However, in multiple regression (see next section) with several regressor variables, it can be difficult to detect influential points graphically. There exist many diagnostic tools for accessing how influential individual points are when fitting a model. There also exist robust regression techniques that prevent the fit of the line to be unduly influenced by a small number of observations. The bottom-right panel shows data from a very poorly designed experiment where all but one observation was obtained at one level of the x variable. The single point

24 Chapter 5: Regression Models 141 Figure 11: Left Panel Scatterplot of MPG versus horsepower along with the fitted regression line. Right panel Residual plot versus the fitted values ŷ i s. on the right determines the slope of the fitted regression line. 9 Multiple Linear Regression Often a response of interest may depend on several different factors and consequently, many regression applications involve more than a single regressor variable. A regression model with more than one regressor variable is known as a multiple regression model. The simple linear regression model can be generalized in a straightforward manner to incorporate other regressor variables and the previous equations for the simple linear regression continue to hold for multiple regression models. For illustration consider the results of an experiment to study the effect of the mole contents of cobalt and calcination temperature on the surface area of an iron-cobalt hydroxide catalyst (Said et al., 1994). The response variable in this experiment is y = surface-area and there are two regressor variables: x 1 = Cobalt content x 2 = Temperature. The data from this experiment are given in the following table:

25 Chapter 5: Regression Models 142 Anscombe s 4 Regression data sets y y x x2 y y x x4 Figure 12: Anscombe simple linear regression data. Four very different data sets yielding exactly the same least squares regression line. Cobalt Surface Contents Temp. Area A general model relating the surface area y to cobalt contents x 1 and temperature x 2

26 Chapter 5: Regression Models 143 is y i = f(x i1, x i2 ) + ϵ i where ϵ i is a random error and f is some unknown function. x i1 is the cobalt content on the ith unit in the data and x i2 is the corresponding ith temperature measurement for i = 1,..., n. We can try approximating f by a first order Taylor series approximation to get the following multiple regression model: y i = β 0 + β 1 x i1 + β 2 x i2 + ϵ i. (13) If (13) is not adequate to model the response y, then we could try a higher order Taylor series approximation such as: y i = β 0 + β 1 x i1 + β 2 x i2 + β 11 x 2 i1 + β 22 x 2 i2 + β 12 x i1 x i2 + ϵ i. (14) The work required for finding the least squares estimators of the coefficients and the variance and covariances of these estimated parameters has already been done. The form of the least-squares solution from the simple linear regression model holds for the multiple regression model. This is where the matrix approach to the problem really pays off because working out the details without using matrix algebra is very tedious. Consider a multiple linear regression model with k regressors, x 1,..., x k : y i = β 0 + β 1 x i1 + β 2 x i2 + + β k x ik + ϵ i. (15) The least squares estimators are given by ˆβ 0 ˆβ ˆβ = 1. = (X X) 1 X y (16) ˆβ k just as in equation (8) where 1 x 11 x 12 x 1k 1 x 21 x 22 x 2k X =..... (17) 1 x n1 x n2 x nk Note that the design matrix X has a column of ones in its first column for the intercept term β 0 just like in simple linear regression. The covariance matrix of the estimated coefficients is given by σ 2 (X X) 1 just as in equation (9) where σ 2 is the error variance which is again estimated by the mean squared residual n MS res = ˆσ 2 = (y i ŷ i ) 2 /(n k 1). i=1 Note that the degrees of freedom associated with the mean squared residual is n k 1 since we lose k + 1 degrees of freedom estimating the intercept and the k coefficients for the k regressors.

27 Chapter 5: Regression Models Coefficient Interpretation In the simple linear regression model y i = β 0 + β 1 x i + ϵ i, the slope represents the expected change in y given a unit change in x. In multiple regression the interpretation of the regression coefficients have a similar interpretation: β j represents the expected change in the response for a unit change in x j given that all the other regressors are held constant. The problem in multiple regression is that regressor variables often are correlated with one another and if you change one regressor, then the others may tend to change as well and this makes the interpretation of regression coefficients very difficult in multiple regression. This is particularly true of observational studies where there is no control over the conditions that generate the data. The next example illustrates the problem. Heart Catheter Example. A study was conducted and data collected to fit a regression model to predict the length of a catheter needed to pass from a major artery at the femoral region and moved into the heart for children (Weisberg, 1980). For 12 children, the proper catheter length was determined by checking with a fluoroscope that the catheter tip had reached the right position. It was hoped that using the child s height and weight could be used to predict the proper catheter length. The data are given in the following table: Height Weight Length When the data is fit to a multiple regression model using height and weight as regressors, how do we interpret the resulting coefficients? The coefficient for height tells us how much longer the catheter needs to be for each additional inch of height of the child provided the weight of the child stays constant. But the taller the child, the heavier the child tends to be. Figure 13 shows a scatterplot of weight versus height for the n = 12 children from this experiment. The plot shows a very strong linear relationship between height and weight. The correlation between height and weight is r = This large correlation complicates the interpretation of the regression coefficients. The problem of correlated regressor variables is known as collinearity (or multicollinearity) and it is an important problem that one needs to be aware of when fitting multiple regression models. We return to this example when in the collinearity section below.

28 Chapter 5: Regression Models 145 Figure 13: A scatterplot of weight versus height for n = 12 children in an experiment used to predict the required length of a catheter to the heart based on the child s height and weight. Fortunately, in designed experiments where the engineer has complete control over the regressor variables, data may be able to be collected in a fashion so that the estimated regression coefficients are uncorrelated. In such situations, the estimated coefficients can then be easily interpreted. To make this happen, one needs to choose the values of the regressors so that the off-diagonal terms in the (X X) 1 matrix (from (9)) that correspond to covariances between estimated coefficients of the regressors are all zero. Cobalt Example Continued. Let us return to model (13) and estimate the parameters of the model. Using the data in the cobalt example, we can construct the

29 Chapter 5: Regression Models 146 design matrix X in (17) and the response vector y as: X = y = The least squares estimates of the parameters are given by ˆβ = ( ˆβ 0, ˆβ 1, ˆβ 2 ) where ˆβ = (X X) 1 X y = = The matrix computations are tedious, but software packages like Matlab can perform these for us. Nonetheless, it is a good idea to understand exactly what it is the computer software packages are computing for us when we feed data into them. The mean squared residual for this data is ˆσ 2 = MS res = and the estimated covariance matrix of the estimated coefficients is ˆσ 2 (X X) 1 = Note that this was a well-designed experiment because the estimated coefficients for Cobalt contents ( ˆβ 1 ) and Temperature ( ˆβ 2 ) are uncorrelated the covariance between them is zero as can be seen in the covariance matrix.

30 Chapter 5: Regression Models Analysis of Variance (ANOVA) for Multiple Regression Because there are several regression parameters in the multiple regression model (15), a formal test can be conducted to see if the response depends on any of the regressor variables. That is, we conduct a single test of the hypothesis: H 0 : β 1 = β 2 = β k = 0 versus H a : not all β j s are zero. The basic idea behind the testing procedure is to partition all the variability in the response into two pieces: variability due to the regression relation and variability due to the random error term. This is why the procedure is called Analysis of Variance (ANOVA). The total variability in the response is represented by the total sums of squares (SS yy ): n SS yy = (y i ȳ) 2. i=1 We have already defined the residual sum of squares: n SS res = (y i ŷ i ) 2. i=1 We can also define the Regression Sum of Squares (SS reg ): n SS reg = (ŷ i ȳ) 2 i=1 which represents the variability in the y i s explained by the multiple regression model. Note that for an individual measurement (y i ȳ) = (y i ŷ i ) + (ŷ i ȳ). If we square both sides of this equation and sum over all n observations, we will get SS yy on the left-hand side. On the right-hand side (after doing some algebra) we will get SS res + SS reg only because the cross product terms sum to zero. This gives us the well-known variance decomposition formula: SS yy = SS reg + SS res. (18) If all the β j s are zero, then the regression model will explain very little of the variability in the response in which case SS reg will be small and SS res will be large, relatively speaking. The ANOVA test then simply compares these components of variance with each other. However, in order to make the sums of squares comparable, we need to first divide each sum of squares by its respective degrees of freedom. The degrees

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r), Chapter 0 Key Ideas Correlation, Correlation Coefficient (r), Section 0-: Overview We have already explored the basics of describing single variable data sets. However, when two quantitative variables

More information

Simple Regression Theory II 2010 Samuel L. Baker

Simple Regression Theory II 2010 Samuel L. Baker SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the

More information

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Section 14 Simple Linear Regression: Introduction to Least Squares Regression Slide 1 Section 14 Simple Linear Regression: Introduction to Least Squares Regression There are several different measures of statistical association used for understanding the quantitative relationship

More information

Session 7 Bivariate Data and Analysis

Session 7 Bivariate Data and Analysis Session 7 Bivariate Data and Analysis Key Terms for This Session Previously Introduced mean standard deviation New in This Session association bivariate analysis contingency table co-variation least squares

More information

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96 1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years

More information

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Objectives: To perform a hypothesis test concerning the slope of a least squares line To recognize that testing for a

More information

Premaster Statistics Tutorial 4 Full solutions

Premaster Statistics Tutorial 4 Full solutions Premaster Statistics Tutorial 4 Full solutions Regression analysis Q1 (based on Doane & Seward, 4/E, 12.7) a. Interpret the slope of the fitted regression = 125,000 + 150. b. What is the prediction for

More information

Simple linear regression

Simple linear regression Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between

More information

Module 5: Multiple Regression Analysis

Module 5: Multiple Regression Analysis Using Statistical Data Using to Make Statistical Decisions: Data Multiple to Make Regression Decisions Analysis Page 1 Module 5: Multiple Regression Analysis Tom Ilvento, University of Delaware, College

More information

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

More information

Chapter 9 Descriptive Statistics for Bivariate Data

Chapter 9 Descriptive Statistics for Bivariate Data 9.1 Introduction 215 Chapter 9 Descriptive Statistics for Bivariate Data 9.1 Introduction We discussed univariate data description (methods used to eplore the distribution of the values of a single variable)

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

Chapter 7: Simple linear regression Learning Objectives

Chapter 7: Simple linear regression Learning Objectives Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -

More information

Regression III: Advanced Methods

Regression III: Advanced Methods Lecture 16: Generalized Additive Models Regression III: Advanced Methods Bill Jacoby Michigan State University http://polisci.msu.edu/jacoby/icpsr/regress3 Goals of the Lecture Introduce Additive Models

More information

2. Simple Linear Regression

2. Simple Linear Regression Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according

More information

CURVE FITTING LEAST SQUARES APPROXIMATION

CURVE FITTING LEAST SQUARES APPROXIMATION CURVE FITTING LEAST SQUARES APPROXIMATION Data analysis and curve fitting: Imagine that we are studying a physical system involving two quantities: x and y Also suppose that we expect a linear relationship

More information

Regression Analysis: A Complete Example

Regression Analysis: A Complete Example Regression Analysis: A Complete Example This section works out an example that includes all the topics we have discussed so far in this chapter. A complete example of regression analysis. PhotoDisc, Inc./Getty

More information

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

 Y. Notation and Equations for Regression Lecture 11/4. Notation: Notation: Notation and Equations for Regression Lecture 11/4 m: The number of predictor variables in a regression Xi: One of multiple predictor variables. The subscript i represents any number from 1 through

More information

Module 3: Correlation and Covariance

Module 3: Correlation and Covariance Using Statistical Data to Make Decisions Module 3: Correlation and Covariance Tom Ilvento Dr. Mugdim Pašiƒ University of Delaware Sarajevo Graduate School of Business O ften our interest in data analysis

More information

Univariate Regression

Univariate Regression Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is

More information

Linear Models in STATA and ANOVA

Linear Models in STATA and ANOVA Session 4 Linear Models in STATA and ANOVA Page Strengths of Linear Relationships 4-2 A Note on Non-Linear Relationships 4-4 Multiple Linear Regression 4-5 Removal of Variables 4-8 Independent Samples

More information

Notes on Applied Linear Regression

Notes on Applied Linear Regression Notes on Applied Linear Regression Jamie DeCoster Department of Social Psychology Free University Amsterdam Van der Boechorststraat 1 1081 BT Amsterdam The Netherlands phone: +31 (0)20 444-8935 email:

More information

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Review Jeopardy. Blue vs. Orange. Review Jeopardy Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round $200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?

More information

Course Objective This course is designed to give you a basic understanding of how to run regressions in SPSS.

Course Objective This course is designed to give you a basic understanding of how to run regressions in SPSS. SPSS Regressions Social Science Research Lab American University, Washington, D.C. Web. www.american.edu/provost/ctrl/pclabs.cfm Tel. x3862 Email. SSRL@American.edu Course Objective This course is designed

More information

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade Statistics Quiz Correlation and Regression -- ANSWERS 1. Temperature and air pollution are known to be correlated. We collect data from two laboratories, in Boston and Montreal. Boston makes their measurements

More information

Analysis of Variance ANOVA

Analysis of Variance ANOVA Analysis of Variance ANOVA Overview We ve used the t -test to compare the means from two independent groups. Now we ve come to the final topic of the course: how to compare means from more than two populations.

More information

The correlation coefficient

The correlation coefficient The correlation coefficient Clinical Biostatistics The correlation coefficient Martin Bland Correlation coefficients are used to measure the of the relationship or association between two quantitative

More information

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model. Polynomial Regression POLYNOMIAL AND MULTIPLE REGRESSION Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model. It is a form of linear regression

More information

Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

More information

Hypothesis testing - Steps

Hypothesis testing - Steps Hypothesis testing - Steps Steps to do a two-tailed test of the hypothesis that β 1 0: 1. Set up the hypotheses: H 0 : β 1 = 0 H a : β 1 0. 2. Compute the test statistic: t = b 1 0 Std. error of b 1 =

More information

Example: Boats and Manatees

Example: Boats and Manatees Figure 9-6 Example: Boats and Manatees Slide 1 Given the sample data in Table 9-1, find the value of the linear correlation coefficient r, then refer to Table A-6 to determine whether there is a significant

More information

Econometrics Simple Linear Regression

Econometrics Simple Linear Regression Econometrics Simple Linear Regression Burcu Eke UC3M Linear equations with one variable Recall what a linear equation is: y = b 0 + b 1 x is a linear equation with one variable, or equivalently, a straight

More information

Part 2: Analysis of Relationship Between Two Variables

Part 2: Analysis of Relationship Between Two Variables Part 2: Analysis of Relationship Between Two Variables Linear Regression Linear correlation Significance Tests Multiple regression Linear Regression Y = a X + b Dependent Variable Independent Variable

More information

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions. Algebra I Overview View unit yearlong overview here Many of the concepts presented in Algebra I are progressions of concepts that were introduced in grades 6 through 8. The content presented in this course

More information

STAT 350 Practice Final Exam Solution (Spring 2015)

STAT 350 Practice Final Exam Solution (Spring 2015) PART 1: Multiple Choice Questions: 1) A study was conducted to compare five different training programs for improving endurance. Forty subjects were randomly divided into five groups of eight subjects

More information

Using R for Linear Regression

Using R for Linear Regression Using R for Linear Regression In the following handout words and symbols in bold are R functions and words and symbols in italics are entries supplied by the user; underlined words and symbols are optional

More information

Relationships Between Two Variables: Scatterplots and Correlation

Relationships Between Two Variables: Scatterplots and Correlation Relationships Between Two Variables: Scatterplots and Correlation Example: Consider the population of cars manufactured in the U.S. What is the relationship (1) between engine size and horsepower? (2)

More information

Algebra 1 2008. Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard

Algebra 1 2008. Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard Academic Content Standards Grade Eight and Grade Nine Ohio Algebra 1 2008 Grade Eight STANDARDS Number, Number Sense and Operations Standard Number and Number Systems 1. Use scientific notation to express

More information

Recall this chart that showed how most of our course would be organized:

Recall this chart that showed how most of our course would be organized: Chapter 4 One-Way ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical

More information

1.5 Oneway Analysis of Variance

1.5 Oneway Analysis of Variance Statistics: Rosie Cornish. 200. 1.5 Oneway Analysis of Variance 1 Introduction Oneway analysis of variance (ANOVA) is used to compare several means. This method is often used in scientific or medical experiments

More information

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number 1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number A. 3(x - x) B. x 3 x C. 3x - x D. x - 3x 2) Write the following as an algebraic expression

More information

COMP6053 lecture: Relationship between two variables: correlation, covariance and r-squared. jn2@ecs.soton.ac.uk

COMP6053 lecture: Relationship between two variables: correlation, covariance and r-squared. jn2@ecs.soton.ac.uk COMP6053 lecture: Relationship between two variables: correlation, covariance and r-squared jn2@ecs.soton.ac.uk Relationships between variables So far we have looked at ways of characterizing the distribution

More information

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1) CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.

More information

Exercise 1.12 (Pg. 22-23)

Exercise 1.12 (Pg. 22-23) Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.

More information

Describing Relationships between Two Variables

Describing Relationships between Two Variables Describing Relationships between Two Variables Up until now, we have dealt, for the most part, with just one variable at a time. This variable, when measured on many different subjects or objects, took

More information

UNDERSTANDING THE TWO-WAY ANOVA

UNDERSTANDING THE TWO-WAY ANOVA UNDERSTANDING THE e have seen how the one-way ANOVA can be used to compare two or more sample means in studies involving a single independent variable. This can be extended to two independent variables

More information

Association Between Variables

Association Between Variables Contents 11 Association Between Variables 767 11.1 Introduction............................ 767 11.1.1 Measure of Association................. 768 11.1.2 Chapter Summary.................... 769 11.2 Chi

More information

GRADES 7, 8, AND 9 BIG IDEAS

GRADES 7, 8, AND 9 BIG IDEAS Table 1: Strand A: BIG IDEAS: MATH: NUMBER Introduce perfect squares, square roots, and all applications Introduce rational numbers (positive and negative) Introduce the meaning of negative exponents for

More information

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression Opening Example CHAPTER 13 SIMPLE LINEAR REGREION SIMPLE LINEAR REGREION! Simple Regression! Linear Regression Simple Regression Definition A regression model is a mathematical equation that descries the

More information

17. SIMPLE LINEAR REGRESSION II

17. SIMPLE LINEAR REGRESSION II 17. SIMPLE LINEAR REGRESSION II The Model In linear regression analysis, we assume that the relationship between X and Y is linear. This does not mean, however, that Y can be perfectly predicted from X.

More information

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 4: September

More information

Nonlinear Regression Functions. SW Ch 8 1/54/

Nonlinear Regression Functions. SW Ch 8 1/54/ Nonlinear Regression Functions SW Ch 8 1/54/ The TestScore STR relation looks linear (maybe) SW Ch 8 2/54/ But the TestScore Income relation looks nonlinear... SW Ch 8 3/54/ Nonlinear Regression General

More information

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ STA 3024 Practice Problems Exam 2 NOTE: These are just Practice Problems. This is NOT meant to look just like the test, and it is NOT the only thing that you should study. Make sure you know all the material

More information

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Chapter 13 Introduction to Linear Regression and Correlation Analysis Chapter 3 Student Lecture Notes 3- Chapter 3 Introduction to Linear Regression and Correlation Analsis Fall 2006 Fundamentals of Business Statistics Chapter Goals To understand the methods for displaing

More information

Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance

Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance 2 Making Connections: The Two-Sample t-test, Regression, and ANOVA In theory, there s no difference between theory and practice. In practice, there is. Yogi Berra 1 Statistics courses often teach the two-sample

More information

CALCULATIONS & STATISTICS

CALCULATIONS & STATISTICS CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents

More information

Correlation key concepts:

Correlation key concepts: CORRELATION Correlation key concepts: Types of correlation Methods of studying correlation a) Scatter diagram b) Karl pearson s coefficient of correlation c) Spearman s Rank correlation coefficient d)

More information

Introduction to Regression and Data Analysis

Introduction to Regression and Data Analysis Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it

More information

Part II. Multiple Linear Regression

Part II. Multiple Linear Regression Part II Multiple Linear Regression 86 Chapter 7 Multiple Regression A multiple linear regression model is a linear model that describes how a y-variable relates to two or more xvariables (or transformations

More information

AP Physics 1 and 2 Lab Investigations

AP Physics 1 and 2 Lab Investigations AP Physics 1 and 2 Lab Investigations Student Guide to Data Analysis New York, NY. College Board, Advanced Placement, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks

More information

MULTIPLE REGRESSION EXAMPLE

MULTIPLE REGRESSION EXAMPLE MULTIPLE REGRESSION EXAMPLE For a sample of n = 166 college students, the following variables were measured: Y = height X 1 = mother s height ( momheight ) X 2 = father s height ( dadheight ) X 3 = 1 if

More information

Multiple Regression: What Is It?

Multiple Regression: What Is It? Multiple Regression Multiple Regression: What Is It? Multiple regression is a collection of techniques in which there are multiple predictors of varying kinds and a single outcome We are interested in

More information

What does the number m in y = mx + b measure? To find out, suppose (x 1, y 1 ) and (x 2, y 2 ) are two points on the graph of y = mx + b.

What does the number m in y = mx + b measure? To find out, suppose (x 1, y 1 ) and (x 2, y 2 ) are two points on the graph of y = mx + b. PRIMARY CONTENT MODULE Algebra - Linear Equations & Inequalities T-37/H-37 What does the number m in y = mx + b measure? To find out, suppose (x 1, y 1 ) and (x 2, y 2 ) are two points on the graph of

More information

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

3.1 Least squares in matrix form

3.1 Least squares in matrix form 118 3 Multiple Regression 3.1 Least squares in matrix form E Uses Appendix A.2 A.4, A.6, A.7. 3.1.1 Introduction More than one explanatory variable In the foregoing chapter we considered the simple regression

More information

Pre-Algebra 2008. Academic Content Standards Grade Eight Ohio. Number, Number Sense and Operations Standard. Number and Number Systems

Pre-Algebra 2008. Academic Content Standards Grade Eight Ohio. Number, Number Sense and Operations Standard. Number and Number Systems Academic Content Standards Grade Eight Ohio Pre-Algebra 2008 STANDARDS Number, Number Sense and Operations Standard Number and Number Systems 1. Use scientific notation to express large numbers and small

More information

Solving Mass Balances using Matrix Algebra

Solving Mass Balances using Matrix Algebra Page: 1 Alex Doll, P.Eng, Alex G Doll Consulting Ltd. http://www.agdconsulting.ca Abstract Matrix Algebra, also known as linear algebra, is well suited to solving material balance problems encountered

More information

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,

More information

4. Continuous Random Variables, the Pareto and Normal Distributions

4. Continuous Random Variables, the Pareto and Normal Distributions 4. Continuous Random Variables, the Pareto and Normal Distributions A continuous random variable X can take any value in a given range (e.g. height, weight, age). The distribution of a continuous random

More information

Descriptive Statistics

Descriptive Statistics Descriptive Statistics Descriptive statistics consist of methods for organizing and summarizing data. It includes the construction of graphs, charts and tables, as well various descriptive measures such

More information

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Chapter Seven Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Section : An introduction to multiple regression WHAT IS MULTIPLE REGRESSION? Multiple

More information

What are the place values to the left of the decimal point and their associated powers of ten?

What are the place values to the left of the decimal point and their associated powers of ten? The verbal answers to all of the following questions should be memorized before completion of algebra. Answers that are not memorized will hinder your ability to succeed in geometry and algebra. (Everything

More information

The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy

The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy BMI Paper The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy Faculty of Sciences VU University Amsterdam De Boelelaan 1081 1081 HV Amsterdam Netherlands Author: R.D.R.

More information

The Effect of Dropping a Ball from Different Heights on the Number of Times the Ball Bounces

The Effect of Dropping a Ball from Different Heights on the Number of Times the Ball Bounces The Effect of Dropping a Ball from Different Heights on the Number of Times the Ball Bounces Or: How I Learned to Stop Worrying and Love the Ball Comment [DP1]: Titles, headings, and figure/table captions

More information

3.4 Statistical inference for 2 populations based on two samples

3.4 Statistical inference for 2 populations based on two samples 3.4 Statistical inference for 2 populations based on two samples Tests for a difference between two population means The first sample will be denoted as X 1, X 2,..., X m. The second sample will be denoted

More information

Week 4: Standard Error and Confidence Intervals

Week 4: Standard Error and Confidence Intervals Health Sciences M.Sc. Programme Applied Biostatistics Week 4: Standard Error and Confidence Intervals Sampling Most research data come from subjects we think of as samples drawn from a larger population.

More information

Section 1.1. Introduction to R n

Section 1.1. Introduction to R n The Calculus of Functions of Several Variables Section. Introduction to R n Calculus is the study of functional relationships and how related quantities change with each other. In your first exposure to

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

Multicollinearity Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised January 13, 2015

Multicollinearity Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised January 13, 2015 Multicollinearity Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised January 13, 2015 Stata Example (See appendices for full example).. use http://www.nd.edu/~rwilliam/stats2/statafiles/multicoll.dta,

More information

Copyright 2007 by Laura Schultz. All rights reserved. Page 1 of 5

Copyright 2007 by Laura Schultz. All rights reserved. Page 1 of 5 Using Your TI-83/84 Calculator: Linear Correlation and Regression Elementary Statistics Dr. Laura Schultz This handout describes how to use your calculator for various linear correlation and regression

More information

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1) Spring 204 Class 9: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.) Big Picture: More than Two Samples In Chapter 7: We looked at quantitative variables and compared the

More information

Unit 1 Equations, Inequalities, Functions

Unit 1 Equations, Inequalities, Functions Unit 1 Equations, Inequalities, Functions Algebra 2, Pages 1-100 Overview: This unit models real-world situations by using one- and two-variable linear equations. This unit will further expand upon pervious

More information

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4) Summary of Formulas and Concepts Descriptive Statistics (Ch. 1-4) Definitions Population: The complete set of numerical information on a particular quantity in which an investigator is interested. We assume

More information

Normal distribution. ) 2 /2σ. 2π σ

Normal distribution. ) 2 /2σ. 2π σ Normal distribution The normal distribution is the most widely known and used of all distributions. Because the normal distribution approximates many natural phenomena so well, it has developed into a

More information

2. Here is a small part of a data set that describes the fuel economy (in miles per gallon) of 2006 model motor vehicles.

2. Here is a small part of a data set that describes the fuel economy (in miles per gallon) of 2006 model motor vehicles. Math 1530-017 Exam 1 February 19, 2009 Name Student Number E There are five possible responses to each of the following multiple choice questions. There is only on BEST answer. Be sure to read all possible

More information

Fairfield Public Schools

Fairfield Public Schools Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity

More information

Lecture 11: Chapter 5, Section 3 Relationships between Two Quantitative Variables; Correlation

Lecture 11: Chapter 5, Section 3 Relationships between Two Quantitative Variables; Correlation Lecture 11: Chapter 5, Section 3 Relationships between Two Quantitative Variables; Correlation Display and Summarize Correlation for Direction and Strength Properties of Correlation Regression Line Cengage

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2 University of California, Berkeley Prof. Ken Chay Department of Economics Fall Semester, 005 ECON 14 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE # Question 1: a. Below are the scatter plots of hourly wages

More information

2013 MBA Jump Start Program. Statistics Module Part 3

2013 MBA Jump Start Program. Statistics Module Part 3 2013 MBA Jump Start Program Module 1: Statistics Thomas Gilbert Part 3 Statistics Module Part 3 Hypothesis Testing (Inference) Regressions 2 1 Making an Investment Decision A researcher in your firm just

More information

An introduction to Value-at-Risk Learning Curve September 2003

An introduction to Value-at-Risk Learning Curve September 2003 An introduction to Value-at-Risk Learning Curve September 2003 Value-at-Risk The introduction of Value-at-Risk (VaR) as an accepted methodology for quantifying market risk is part of the evolution of risk

More information

Unit 26 Estimation with Confidence Intervals

Unit 26 Estimation with Confidence Intervals Unit 26 Estimation with Confidence Intervals Objectives: To see how confidence intervals are used to estimate a population proportion, a population mean, a difference in population proportions, or a difference

More information

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. Review MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. 1) All but one of these statements contain a mistake. Which could be true? A) There is a correlation

More information

An analysis method for a quantitative outcome and two categorical explanatory variables.

An analysis method for a quantitative outcome and two categorical explanatory variables. Chapter 11 Two-Way ANOVA An analysis method for a quantitative outcome and two categorical explanatory variables. If an experiment has a quantitative outcome and two categorical explanatory variables that

More information

Regression and Correlation

Regression and Correlation Regression and Correlation Topics Covered: Dependent and independent variables. Scatter diagram. Correlation coefficient. Linear Regression line. by Dr.I.Namestnikova 1 Introduction Regression analysis

More information

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction

More information

Elasticity. I. What is Elasticity?

Elasticity. I. What is Elasticity? Elasticity I. What is Elasticity? The purpose of this section is to develop some general rules about elasticity, which may them be applied to the four different specific types of elasticity discussed in

More information

Summary: Transformations. Lecture 14 Parameter Estimation Readings T&V Sec 5.1-5.3. Parameter Estimation: Fitting Geometric Models

Summary: Transformations. Lecture 14 Parameter Estimation Readings T&V Sec 5.1-5.3. Parameter Estimation: Fitting Geometric Models Summary: Transformations Lecture 14 Parameter Estimation eadings T&V Sec 5.1-5.3 Euclidean similarity affine projective Parameter Estimation We will talk about estimating parameters of 1) Geometric models

More information