Part II. Multiple Linear Regression

Size: px
Start display at page:

Download "Part II. Multiple Linear Regression"

Transcription

1 Part II Multiple Linear Regression 86

2 Chapter 7 Multiple Regression A multiple linear regression model is a linear model that describes how a y-variable relates to two or more xvariables (or transformations of x-variables). For example, suppose that a researcher is studying factors that might affect systolic blood pressures for women aged 45 to 65 years old. The response variable is systolic blood pressure (Y ). Suppose that two predictor variables of interest are age (X 1 ) and body mass index (X 2 ). The general structure of a multiple linear regression model for this situation would be Y = β 0 + β 1 X 1 + β 2 X 2 + ɛ. The equation β 0 + β 1 X 1 + β 2 X 2 describes the mean value of blood pressure for specific values of age and BMI. The error term (ɛ) describes the characteristics of the differences between individual values of blood pressure and their expected values of blood pressure. One note concerning terminology. A linear model is one that is linear in the beta coefficients, meaning that each beta coefficient simply multiplies an x-variable or a transformation of an x-variable. For instance y = β 0 + β 1 x + β 2 x 2 + ɛ is called a multiple linear regression model even though it describes a quadratic, curved, relationship between y and a single x-variable. 87

3 88 CHAPTER 7. MULTIPLE REGRESSION 7.1 About the Model Notation for the Population Model A population model for a multiple regression model that relates a y- variable to p 1 predictor variables is written as y i = β 0 + β 1 x i,1 + β 2 x i, β p 1 x i,p 1 + ɛ i. (7.1) We assume that the ɛ i have a normal distribution with mean 0 and constant variance σ 2. These are the same assumptions that we used in simple regression with one x-variable. The subscript i refers to the i th individual or unit in the population. In the notation for the xvariables, the subscript following i simply denotes which x-variable it is. Estimates of the Model Parameters The estimates of the β coefficients are the values that minimize the sum of squared errors for the sample. The exact formula for this will be given in the next chapter when we introduce matrix notation. The letter b is used to represent a sample estimate of a β coefficient. Thus b 0 is the sample estimate of β 0, b 1 is the sample estimate of β 1, and so on. MSE = SSE n p estimates σ2, the variance of the errors. In the formula, n = sample size, p = number of β coefficients in the model and SSE = sum of squared errors. Notice that for simple linear regression p = 2. Thus, we get the formula for MSE that we introduced in that context of one predictor. In the case of two predictors, the estimated regression equation yields a plane (as opposed to a line in the simple linear regression setting). For more than two predictors, the estimated regression equation yields a hyperplane. STAT 501 D. S. Young

4 CHAPTER 7. MULTIPLE REGRESSION 89 Predicted Values and Residuals A predicted value is calculated as ŷ i = b 0 + b 1 x i,1 + b 2 x i, b p 1 x i,p 1, where the b values come from statistical software and the x-values are specified by us. A residual (error) term is calculated as e i = y i ŷ i, the difference between an actual and a predicted value of y. A plot of residuals versus predicted values ideally should resemble a horizontal random band. Departures from this form indicates difficulties with the model and/or data. Other residual analyses can be done exactly as we did in simple regression. For instance, we might wish to examine a normal probability plot (NPP) of the residuals. Additional plots to consider are plots of residuals versus each x-variable separately. This might help us identify sources of curvature or nonconstant variance. Interaction Terms An interaction term is when there is a coupling or combined effect of 2 or more independent variables. Suppose we have a response variable (Y ) and two predictors (X 1 and X 2 ). Then, the regression model with an interaction term is written as Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 1 X 2 + ɛ. Suppose you also have a third predictor (X 3 ). model with all interaction terms is written as Then, the regression Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 1 X 2 + β 5 X 1 X 3 + β 6 X 2 X 3 + β 7 X 1 X 2 X 3 + ɛ. In a model with more predictors, you can imagine how much the model grows by adding interactions. Just make sure that you have enough observations to cover the degrees of freedom used in estimating the corresponding regression coefficients! D. S. Young STAT 501

5 90 CHAPTER 7. MULTIPLE REGRESSION For each observation, their value of the interaction is found by multiplying the recorded values of the predictor variables in the interaction. In models with interaction terms, the significance of the interaction term should always be assessed first before proceeding with significance testing of the main variables. If one of the main variables is removed from the model, then the model should not include any interaction terms involving that variable. 7.2 Significance Testing of Each Variable Within a multiple regression model, we may want to know whether a particular x-variable is making a useful contribution to the model. That is, given the presence of the other x-variables in the model, does a particular x-variable help us predict or explain the y-variable? For instance, suppose that we have three x-variables in the model. The general structure of the model could be Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + ɛ. (7.2) As an example, to determine whether variable X 1 is a useful predictor variable in this model, we could test H 0 : β 1 = 0 H A : β 1 0. If the null hypothesis above were the case, then a change in the value of X 1 would not change Y, so Y and X 1 are not related. Also, we would still be left with variables X 2 and X 3 being present in the model. When we cannot reject the null hypothesis above, we should say that we do not need variable X 1 in the model given that variables X 2 and X 3 will remain in the model. In general, the interpretation of a slope in multiple regression can be tricky. Correlations among the predictors can change the slope values dramatically from what they would be in separate simple regressions. To carry out the test, statistical software will report p-values for all coefficients in the model. Each p-value will be based on a t-statistic calculated as t = (sample coefficient - hypothesized value) / standard error of coefficient. STAT 501 D. S. Young

6 CHAPTER 7. MULTIPLE REGRESSION 91 For our example above, the t-statistic is: t = b 1 0 s.e.(b 1 ) = b 1 s.e.(b 1 ). Note that the hypothesized value is usually just 0, so this portion of the formula is often omitted. 7.3 Examples Example 1: Heat Flux Data Set The data are from n = 29 homes used to test solar thermal energy. The variables of interest for our model are y = total heat flux, and x 1, x 2, and x 3, which are the focal points for the east, north, and south directions, respectively. There are two other measurements in this data set: another measurement of the focal points and the time of day. We will not utilize these predictors at this time. Table 7.1 gives the data used for this analysis. The regression model of interest is y i = β 0 + β 1 x i,1 + β 2 x i,2 + β 3 x i,3 + ɛ i. Figure 7.1(a) gives a histogram of the residuals. While the shape is not completely bell-shaped, it again is not suggestive of any severe departures from normality. Figure 7.1(b) gives a plot of the residuals versus the fitted values. Again, the values appear to be randomly scattered about 0, suggesting constant variance. The following provides the t-tests for the individual regression coefficients: ########## Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-06 *** east north e-12 *** south e-06 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 25 degrees of freedom D. S. Young STAT 501

7 92 CHAPTER 7. MULTIPLE REGRESSION Histogram of Residuals Residuals vs. Fitted Values Density Residuals Residuals Fitted Values (a) (b) Figure 7.1: (a) Histogram of the residuals for the heat flux data set. (b) Plot of the residuals. Multiple R-Squared: , Adjusted R-squared: F-statistic: on 3 and 25 DF, p-value: 2.167e-11 ########## At the α = 0.05 significance level, both north and south appear to be statistically significant predictors of heat flux. However, east is not (with a p-value of ). While we could claim this is a marginally significant predictor, we will rerun the analysis by dropping the east predictor. The following provides the t-tests for the individual regression coefficients for the newly suggested model: ########## Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-12 *** north e-12 *** south e-05 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 26 degrees of freedom STAT 501 D. S. Young

8 CHAPTER 7. MULTIPLE REGRESSION 93 Multiple R-Squared: , Adjusted R-squared: F-statistic: on 2 and 26 DF, p-value: 8.938e-12 ########## The residual plots still appear okay (they are not included here) and we obtain new estimates for our model (in the above). Some things to note from this final analysis are: The final sample multiple regression equation is ŷ i = x i, x i,3. To use this equation for prediction, we substitute specified values for the two directions (i.e., north and south). We can interpret the slopes in the same way that we do for a straightline model, but we have to add the constraint that values of other variables remain constant. When the south position is held constant, the average flux temperature for a home decreases by degrees for each 1 unit increase in the north position. When the north position is held constant, the average flux temperature for a home increases by 4.80 degrees for each 1 unit increase in the south position. The value of R 2 = means that the model (the two x-variables) explains 85.87% of the observed variation in a home s flux temperature. The value MSE = is the estimated standard deviation of the residuals. Roughly, it is the average absolute size of a residual. Example 2: Kola Project Data Set The Kola Project ran from and involved extensive geological surveys of Finland, Norway, and Russia. The entire published data set consists of over 600 observations measured on 111 variables. Table 7.2 provides merely a subset of this data for three variables. The data is subsetted on the LITO variable for counts of 1. The sample size of this subset is n = 131. The investigators are interested in modeling the geological composition variable Cr INAA as a function of Cr and Co. A scatterplot of this data with D. S. Young STAT 501

9 94 CHAPTER 7. MULTIPLE REGRESSION the least squares plane is provided in Figure 7.2. In this 3D plot, observations above the plane (i.e., observations with positive residuals) are given by green points and observations below the plane (i.e., observations with negative residuals) are given by red points. The output for fitting a multiple linear regression model to this data is below: Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-05 *** Cr e-13 *** Co Signif. codes: 0 *** ** 0.01 * Residual standard error: on 128 degrees of freedom Multiple R-squared: 0.544, Adjusted R-squared: F-statistic: on 2 and 128 DF, p-value: < 2.2e-16 Note that Co is found to be not statistically significant. However, the scatterplot in Figure 7.2 clearly shows that the data is skewed to the right for each of the variables (i.e., the bulk of the data is clustered near the lower-end of values for each variable while there are fewer values as you increase along a given axis). In fact, a plot of the standardized residuals against the fitted values (Figure 7.3) indicates that a transformation is needed. Since the data appears skewed to the right for each of the variables, a log transformation on Cr INAA, Cr, and Co will be taken. The scatterplot in Figure 7.4 shows the results from this transformations along with the new least squares plane. Clearly, the transformation has done a better job linearizing the relationship. The output for fitting a multiple linear regression model to this transformed data is below: Residuals: Min 1Q Median 3Q Max STAT 501 D. S. Young

10 CHAPTER 7. MULTIPLE REGRESSION 95 Figure 7.2: 3D scatterplot of the Kola data set with the least squares plane Fitted Values Standardized Residuals Figure 7.3: The standardized residuals versus the fitted values for the raw Kola data set. D. S. Young STAT 501

11 96 CHAPTER 7. MULTIPLE REGRESSION Figure 7.4: 3D scatterplot of the Kola data set where the logarithm of each variable has been taken Fitted Values Standardized Residuals Figure 7.5: The standardized residuals versus the fitted values for the logtransformed Kola data set. STAT 501 D. S. Young

12 CHAPTER 7. MULTIPLE REGRESSION 97 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** ln_cr e-10 *** ln_co Signif. codes: 0 *** ** 0.01 * Residual standard error: on 128 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 2 and 128 DF, p-value: < 2.2e-16 There is also a noted improvement in the plot of the standardized residuals versus the fitted values (Figure 7.5). Notice that the log transformation of Co is not statistically significant as it has a high p-value (0.375). After omitting the log transformation of Co from our analysis, a simple linear regression model is fit to the data. Figure 7.6 provides a scatterplot of the data and a plot of the standardized residuals against the fitted values. These plots, combined with the following simple linear regression output, indicate a highly statistically significant relationship between the log transformation of Cr INAA and the log transformation of Cr. Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** ln_cr <2e-16 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 129 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 129 DF, p-value: < 2.2e-16 D. S. Young STAT 501

13 98 CHAPTER 7. MULTIPLE REGRESSION ln(cr) ln(cr_inaa) (a) Fitted Values Standardized Residuals (b) Figure 7.6: (a) Scatterplot of the Kola data set where the logarithm of Cr INAA has been regressed on the logarithm of Cr. (b) Plot of the standardized residuals for this simple linear regression fit. STAT 501 D. S. Young

14 CHAPTER 7. MULTIPLE REGRESSION 99 i Flux Insolation East South North Time Table 7.1: The heat flux for homes data. D. S. Young STAT 501

15 100 CHAPTER 7. MULTIPLE REGRESSION X 1 X 2 Y X 1 X 2 Y X 1 X 2 Y X 1 X 2 Y Table 7.2: The subset of the Kola data. Here X 1, X 2, and Y are the variables Cr, Co, and Cr INAA, respectively. STAT 501 D. S. Young

16 Chapter 8 Matrix Notation in Regression There are two main reasons for using matrices in regression. First, the notation simplifies the writing of the model. Secondly, and most importantly, matrix formulas provide the means by which statistical software calculates the estimated coefficients and their standard errors, as well as the set of predicted values for the observed sample. If necessary, a review of matrices and some of their basic properties can be found in Appendix B. 8.1 Matrices and Regression In matrix notation, the theoretical regression model for the population is written as Y = Xβ + ɛ. The four different items in the equation are: 1. Y is a n-dimensional column vector that vertically lists the y values: Y = Y 1 Y 2. Y n. 2. The X matrix is a matrix in which each row gives the x-variable data for a different observation. The first column equals 1 for all observations (unless doing a regression through the origin), and each column after 101

17 102 CHAPTER 8. MATRIX NOTATION IN REGRESSION the first gives the data for a different variable. There is a column for each variable, including any added interactions, transformations, indicators, and so on. The abstract formulation is: 1 X 1,1... X 1,p 1 1 X 2,1... X 2,p 1 X = X n,1... X n,p 1 In the subscripting, the first value is the observation number and the second number is the variable number. The first column is always a column of 1 s. The X matrix has n rows and p columns. 3. β is a p-dimensional column vector listing the coefficients: β = β 0 β 1. β p 1. Notice the subscript for the numbering of the β s. As an example, for simple linear regression, β = (β 0 β 1 ) T. The β vector will contain symbols, not numbers, as it gives the population parameters. 4. ɛ is a n-dimensional column vector listing the errors: ɛ 1 ɛ 2 ɛ =. ɛ n. Again, we will not have numerical values for the ɛ vector. As an example, suppose that data for a y-variable and two x-variables is as given in Table 8.1. For the model y i = β 0 + β 1 x i,1 + β 2 x i,2 + β 3 x i,1 x i,2 + ɛ i, the matrices Y, X, β, and ɛ are as follows: STAT 501 D. S. Young

18 CHAPTER 8. MATRIX NOTATION IN REGRESSION 103 y i x i, x i, Table 8.1: A sample data set. Y = , X = , β = β 0 β 1 β 2 β 3, ɛ = ɛ 1 ɛ 2 ɛ 3 ɛ 4 ɛ 5 ɛ Notice that the first column of the X matrix equals 1 for all rows (observations), the second column gives the values of x i,1, the third column lists the values of x i,2, and the fourth column gives the values of the interaction values x i,1 x i,2. 2. For the theoretical model, we do not know the values of the beta coefficients or the errors. In those two matrices (column vectors) we can only list the symbols for these items. 3. There is a slight abuse of notation that occurs here which often happens when writing regression models in matrix notation. I stated earlier how capital letters are reserved for random variables and lower case letters are reserved for realizations. In this example, capital letters have been used for the realizations. There should be no misunderstanding as it will usually be clear if we are in the context of discussing random variables or their realizations. Finally, using Calculus rules for matrices, it can be derived that the ordinary least squares estimates of the β coefficients are calculated using the matrix formula b = (X T X) 1 X T y, D. S. Young STAT 501

19 104 CHAPTER 8. MATRIX NOTATION IN REGRESSION which minimizes the sum of squared errors e 2 = e T e = (Y Ŷ)T (Y Ŷ) = (Y Xb T (Y Xb), where b = (b 0 b 1... b p 1 ) T. As in the simple linear regression case, these regression coefficient estimators are unbiased (i.e., E(b) = β). The formula above is used by statistical software to calculate values of the sample coefficients. An important theorem in regression analysis (and Statistics in general) is the Gauss-Markov Theorem, which we alluded to earlier. Since we have the proper matrix notation in place, we will now formalize this very important result. Theorem 1 (Gauss-Markov Theorem) Suppose that we have the linear regression model Y = Xβ + ɛ, where E(ɛ i X) = 0 and E(ɛ i X) = σ 2 for all i = 1,..., n. Then ˆβ = b = (X T X) 1 X T Y is an unbiased estimator of β and has the smallest variance of all other unbiased estimates of β. Any estimator which is unbiased and has smaller variance than any other unbiased estimators is called a best linear unbiased estimator or BLUE. An important note regarding the matrix expressions introduced above is that and Ŷ = Xb = X(X T X) 1 X T Y = HY e = Y Ŷ = Y HY = (I n n H)Y, STAT 501 D. S. Young

20 CHAPTER 8. MATRIX NOTATION IN REGRESSION 105 where H = X(X T X) 1 X T is the n n hat matrix. H is important for several reasons as it appears often in regression formulas. One important implication of H is that it is a projection matrix, meaning that it projects the response vector, Y, as a linear combination of the columns of the X matrix in order to obtain the vector of fitted values, Ŷ. Also, the diagonal of this matrix contains the h j,j values we introduced earlier in the context of Studentized residuals, which is important when discussing leverage. 8.2 Variance-Covariance Matrix of b Two important characteristics of the sample multiple regression coefficients are their standard errors and their correlations with each other. The variancecovariance matrix of the sample coefficients b is a symmetric p p square matrix. Remember that p is the number of beta coefficients in the model (including the intercept). The rows and the columns of the variance-covariance matrix are in coefficient order (first row is information about b 0, second is about b 1, and so on). The diagonal values (from top left to bottom right) are the variances of the sample coefficients (written as Var(b i )). The standard error of a coefficient is the square root of its variance. An off-diagonal value is the covariance between two coefficient estimates (written as Cov(b i, b j )). The correlation between two coefficient estimates can be determined using the following relationship: correlation = covariance divided by product of standard deviations (written as Corr(b i, b j )). In regression, the theoretical variance-covariance matrix of the sample coefficients is V(b) = σ 2 (X T X) 1. Recall, the MSE estimates σ 2, so the estimated variance-covariance matrix of the sample beta coefficients is calculated as ˆV(b) = MSE(X T X) 1. D. S. Young STAT 501

21 106 CHAPTER 8. MATRIX NOTATION IN REGRESSION 100 (1 α)% confidence intervals are also readily available for β: b j ± t n p;1 α/2 ˆV(b) j,j, where ˆV(b) j,j is the j th diagonal element of the estimated variance-covariance matrix of the sample beta coefficients (i.e., the (estimated) standard error). Furthermore, the Bonferroni joint 100(1 α)% confidence intervals are: for j = 0, 1, 2,..., (p 1). b j ± t n p;1 α/(2p) ˆV(b) j,j, 8.3 Statistical Intervals The statistical intervals for estimating the mean or predicting new observations in the simple linear regression case can easily extend to the multiple regression case. Here, it is only necessary to present the formulas. First, let use define the vector of given predictors as X h = 1 X h,1 X h,2. X h,p 1. We are interested in either intervals for E(Y X = X h ) or intervals for the value of a new response y given that the observation has the particular value X h. First we define the standard error of the fit at X h given by: s.e.(ŷh) = MSE(X T h (X T X) 1 X h ). Now, we can give the formulas for the various intervals: 100 (1 α)% Confidence Interval: ŷ h ± t n p;1 α/2s.e.(ŷ h ). STAT 501 D. S. Young

22 CHAPTER 8. MATRIX NOTATION IN REGRESSION 107 Bonferroni Joint 100 (1 α)% Confidence Intervals: for i = 1, 2,..., q. ŷ hi ± t n p;1 α/(2q)s.e.(ŷ hi ), 100 (1 α)% Working-Hotelling Confidence Band: ŷ h ± pfp,n p;1 αs.e.(ŷ h ). 100 (1 α)% Prediction Interval: ŷ h ± t n p;1 α/2 MSE/m + [s.e.(ŷh )] 2, where m = 1 corresponds to a prediction interval for a new observation at a given X h and m > 1 corresponds to the mean of m new observations calculated at the same X h. Bonferroni Joint 100 (1 α)% Prediction Intervals: for i = 1, 2,..., q. ŷ hi ± t n p;1 α/(2q) MSE + [s.e.(ŷhi )] 2, Scheffé Joint 100 (1 α)% Prediction Intervals: ŷ hi ± qfq,n p;1 α(mse + [s.e.(ŷ h )] 2 ), for i = 1, 2,..., q. [100 (1 α)%]/[100 P %] Tolerance Intervals: One-Sided Intervals: (, ŷ h + K α,p MSE) and (ŷ h K α,p MSE, ) are the upper and lower one-sided tolerance intervals, respectively, where K α,p is found similarly as in the simple linear regression setting, but with n = (X T h (X T X) 1 X h ) 1. D. S. Young STAT 501

23 108 CHAPTER 8. MATRIX NOTATION IN REGRESSION Two-Sided Interval: ŷ h ± K α/2,p/2 MSE, where K α/2,p/2 is found similarly as in the simple linear regression setting, but with n as given above and f = n p, where p is the dimension of X h. 8.4 Example Example: Heat Flux Data Set (continued) Refer back to the heat flux data set where only north and south were used as predictors of insolation. The MSE for this model is equal to However, if we are interested in the full variance-covariance matrix and correlation matrix, then this must be calculated by hand by finding the (X T X) 1. Then, ˆV(b) = = Taking the square roots of the diagonal terms of this matrix gives you the values of s.e.(b 0 ), s.e.(b 1 ), and s.e.(b 2 ). We can also calculate the correlation matrix of b (denoted by r b ) for this data set: r b = = = Var(b 0 ) Var(b0 )Var(b 0 ) Cov(b 1,b 0 ) Var(b1 )Var(b 0 ) Cov(b 2,b 0 ) Var(b2 )Var(b 0 ) Cov(b 0,b 1 ) Var(b0 )Var(b 1 ) Var(b 1 ) Var(b1 )Var(b 1 ) Cov(b 2,b 1 ) Var(b2 )Var(b 1 ) Cov(b 0,b 2 ) Var(b0 )Var(b 2 ) Cov(b 1,b 2 ) Var(b1 )Var(b 2 ) Var(b 2 ) Var(b2 )Var(b 2 ) ( )( ) ( )(3.7657) ( )(0.9046) (3.7657)( ) (3.7657)(3.7657) (3.7657)(0.9046) (0.9046)( ) (0.9046)(3.7657) (0.9046)(0.9046) STAT 501 D. S. Young

24 CHAPTER 8. MATRIX NOTATION IN REGRESSION 109 r b is an estimate of the population correlation matrix ρ b. For example, Corr(b 1, b 2 ) = , which implies there is a fairly low, negative correlation between the average change in flux for each unit increase in the south position and each unit increase in the north position. Therefore, the presence of the north position only slightly affects the estimate of the south s beta coefficient. The consequence is that it is fairly easy to separate the individual effects of these two variables. Note that we usually do not care about correlations concerning the intercept, b 0 since we usually wish to provide an interpretation concerning the x-variables. If all x-variables are uncorrelated with each other, then all covariances between pairs of sample coefficients that multiply x-variables will equal 0. This means that the estimate of one beta is not affected by the presence of the other x-variables. Many experiments are designed to achieve this property, but achieving it with real data is often a different story. The correlation matrix presented above should NOT be confused with the correlation matrix, r, constructed for each pairwise combination of the variables Y, X 1, X 2,..., X p 1 ; namely: 1 Corr(Y, X 1 )... Corr(Y, X p 1 ) Corr(X 1, Y ) 1... Corr(X 1, X p 1 ) r = Corr(X p 1, Y ) Corr(X p 1, X 1 )... 1 Note that all of the diagonal entries are 1 because the correlation between a variable and itself is a perfect (positive) association. This correlation matrix is what most statistical software reports and it does not always report r b. The interpretation of each entry in r is identical to the Pearson correlation coefficient interpretation presented earlier. Specifically, it provides the strength and direction of the association between the variables corresponding to the row and column of the respective entry. For this example, the correlation matrix is: r = We can also calculate the 95% confidence intervals for the regression coefficients. First note that t 26,0.975 = The 95% confidence interval for β 1 is calculated using ± and for β 2 it is calculated D. S. Young STAT 501.

25 110 CHAPTER 8. MATRIX NOTATION IN REGRESSION using ± Thus, we are 95% confident that the true population regression coefficients for the north and south focal points are between ( , ) and (2.8413, ), respectively. STAT 501 D. S. Young

26 Chapter 9 Indicator Variables We next discuss how to include categorical predictor variables in a regression model. A categorical variable is a variable for which the possible outcomes are nameable characteristics, groups or treatments. Some examples are gender (male or female), highest educational degree attained (secondary school, college undergraduate, college graduate), blood pressure medication used (drug 1, drug 2, drug 3), etc. We use indicator variables to incorporate a categorical x-variable into a regression model. An indicator variable equals 1 when an observation is in a particular group and equals 0 when an observation is not in that group. An interaction between an indicator variable and a quantitative variable exists if the slope between the response and the quantitative variable depends upon the specific value present for the indicator variable. 9.1 The Leave One Out Method When a categorical predictor variable has k categories, it is possible to define k indicator variables. However, as explained later, we should only use k 1 of them as predictor variables in the regression model. Let us consider an example where we are analyzing data for a clinical trial done to compare the effectiveness of three different medications used to treat high blood pressure. n = 90 participants are randomly divided into three groups of 30 patients and each group is assigned a different medication. The response variable is the reduction in diastolic blood pressure in a 3 month period. In addition to the treatment variables, two other predictor variables 111

27 112 CHAPTER 9. INDICATOR VARIABLES will be X 1 =age and X 2 =body mass index. We are examining three different treatments so we can define the following three indicator variables for the treatment: X 3 = 1 if patient used treatment 1, 0 otherwise X 4 = 1 if patient used treatment 2, 0 otherwise X 5 = 1 if patient used treatment 3, 0 otherwise. On the surface, it seems that our model should be the following overparameterized model, a model that requires us to make a modification in order to estimate coefficients: y i = β 0 + β 1 x i,1 + β 2 x i,2 + β 3 x i,3 + β 4 x i,4 + β 5 x i,5 + ɛ i. (9.1) The difficulty with this model is that the X matrix has a linear dependency, so we can t estimate the individual coefficients (technically, this is because there will be an infinite number of solutions for the betas). The dependency stems from the fact that X i,3 + X i,4 + X i,5 = 1 for all observations because each patient uses one (and only one) of the treatments. In the X matrix, the linear dependency is that the sum of the last three columns will equal the first column (all 1 s). This scenario leads to what is called collinearity and we investigate this in the next chapter. One solution (there are others) for avoiding this difficulty is the leave one out method. The leave one out method has the general rule that whenever a categorical predictor variable has k categories, it is possible to define k indicator variables, but we should only use k 1 of them to describe the differences among the k categories. For the overall fit of the model, it does not matter which set of k 1 indicators we use. The choice of which k 1 indicator variables we use, however, does affect the interpretation of the coefficients that multiply the specific indicators in the model. In our example with three treatments (and three possible indicator variables), we might leave out the third indicator giving us this model: y i = β 0 + β 1 x i,1 + β 2 x i,2 + β 3 x i,3 + β 4 x i,4 + ɛ i. (9.2) For the overall fit of the model, it would work equally well to leave out the first indicator and include the other two or to leave out the second and include the first and third. STAT 501 D. S. Young

28 CHAPTER 9. INDICATOR VARIABLES Coefficient Interpretations The interpretation of the coefficients that multiply indicator variables is tricky. The interpretation for the individual betas with the leave one out method is that a coefficient multiplying an indicator in the model measures the difference between the group defined by the indicator in the model and the group defined by the indicator that was left. Usually, a control or placebo group is the one that is left out. Let us consider our example again. We are predicting decreases in blood pressure in response to X 1 =age, X 2 =body mass, and which of three different treatments a person used. The variables X 3 and X 4 are indicators of the treatment, as defined above. The model we will examine is y i = β 0 + β 1 x i,1 + β 2 x i,2 + β 3 x i,3 + β 4 x i,4 + ɛ i. To see what is going on, look at each treatment separately by substituting the appropriately defined values of the two indicators into the equation. For treatment 1, by definition X 3 = 1 and X 4 = 0 leading to y i = β 0 + β 1 x i,1 + β 2 x i,2 + β 3 (1) + β 4 (0) + ɛ i = β 0 + β 1 x i,1 + β 2 x i,2 + β 3 + ɛ i. For treatment 2, by definition X 3 = 0 and X 4 = 1 leading to y i = β 0 + β 1 x i,1 + β 2 x i,2 + β 3 (0) + β 4 (1) + ɛ i = β 0 + β 1 x i,1 + β 2 x i,2 + β 4 + ɛ i. For treatment 3, by definition X 3 = 0 and X 4 = 0 leading to y i = β 0 + β 1 x i,1 + β 2 x i,2 + β 3 (0) + β 4 (0) + ɛ i = β 0 + β 1 x i,1 + β 2 x i,2 + ɛ i. Now compare the three equations to each other. The only difference between the equations for treatments 1 and 3 is the coefficient β 3. The only difference between the equations for treatments 2 and 3 is the coefficient β 4. This leads to the following meanings for the coefficients: β 3 = difference in mean response for treatment 1 versus treatment 3, assuming the same age and body mass. D. S. Young STAT 501

29 114 CHAPTER 9. INDICATOR VARIABLES β 4 = difference in mean response for treatment 2 versus treatment 3, assuming the same age and body mass. Here the coefficients are measuring differences from the third treatment. With the leave one out method, a coefficient multiplying an indicator in the model measures the difference between the group defined by the indicator in the model and the group defined by the indicator that was left. IMPORTANT CAUTIONS: Notice that the coefficient that multiplies an indicator variable in the model does not retain the meaning implied by the definition of the indicator. It is common for students to wrongly state that a coefficient measures the difference between that group and the other groups. That is WRONG! It is also incorrect to say only that a coefficient multiplying an indicator measures the effect of being in that group. An effect has to involve a comparison - with the leave one out method it is a comparison to the group associated with the indicator left out. One application where many indicator variables (or binary predictors) are used is in conjoint analysis, which is a marketing tool that attempts to capture a respondent s preference given the presence or absence of various attribute levels. The X matrix is called a dummy matrix as it consists of only 1 s and 0 s. The response is then regressed on the indicators using ordinary least squares and researchers attempt to quantify items like identification of different market segments, predict profitability, or predict the impact of a new competitor. One additional note is that, in theory, with a linear dependence there are an infinite number of suitable solutions for the betas (as will be seen with multicollinearity). With the leave one out method, we are picking one with a particular meaning and then the resulting coefficients measure differences from the specified group. A method, often used in courses focused strictly on ANOVA or Design of Experiments, offers a different meaning for what we estimate. There it will be more common to parameterize in a way so that a coefficient measures how a group differs from an overall average. 9.3 Testing Overall Group Differences To test the overall significance of a categorical predictor variable, we use a general linear F -test procedure (which is developed in detail later). We form the reduced model by dropping the indicator variables from the model. STAT 501 D. S. Young

30 CHAPTER 9. INDICATOR VARIABLES 115 More technically, the null hypothesis is that the coefficients multiplying the indicator all equal 0. For our example with three treatments of high blood pressure and additional x-variables age and body mass, the details for doing an overall test of treatment differences are: Full model is: y i = β 0 + β 1 x i,1 + β 2 x i,2 + β 3 x i,3 + β 4 x i,4 + ɛ i. Null hypothesis is: H 0 : β 3 = β 4 = 0. Reduced model is: y i = β 0 + β 1 x i,1 + β 2 x i,2 + ɛ i. 9.4 Interactions To examine a possible interaction between a categorical predictor and a quantitative predictor, include product variables between each indicator and the quantitative variable. As an example, suppose we thought there could be an interaction between the body mass variable (X 2 ) and the treatment variable. This would mean that we thought that treatment differences in blood pressure reduction depend on the specific value of body mass. The model we would use is: y i = β 0 + β 1 x i,1 + β 2 x i,2 + β 3 x i,3 + β 4 x i,4 + β 5 x i,2 x i,3 + β 6 x i,2 x i,4 + ɛ i. To test whether there is an interaction, the null hypothesis is H 0 : β 5 = β 6 = 0. We would use the general linear F test procedure to carry out the test. The full model is the interaction model given three lines above. The reduced model is now: y i = β 0 + β 1 x i,1 + β 2 x i,2 + β 3 x i,3 + β 4 x i,4 + ɛ i. A visual way to assess if there is an interaction is by using an interaction plot. An interaction plot is created by plotting the response versus the quantitative predictor and connecting the successive values according to the grouping of the observations. Recall that an interaction between factors occurs when the change in response from lower levels to higher levels of one factor is not quite the same as going from lower levels to higher levels of another factor. Interaction plots allow us to compare the relative strength of the effects across factors. What results is one of three possible trends: D. S. Young STAT 501

31 116 CHAPTER 9. INDICATOR VARIABLES The lines could be (nearly) parallel, which indicates no interaction. This means that the change in the response from lower levels to higher levels for each factor is roughly the same. The lines intersect within the scope of the study, which indicates an interaction. This means that the change in the response from lower levels to higher levels of one factor is noticeably different than the change in another factor. This type of interaction is called a disordinal interaction. The lines do not intersect within the scope of the study, but the trends indicate that if we were to extend the levels of our factors, then we may see an interaction. This type of interaction is called an ordinal interaction. Figure 9.1 illustrates each type of interaction plot using a mock data set pertaining to the mean tensile strength measured at three different speeds of 3 different processes. The upper left plot illustrates the case where no interaction is present because the change in mean tensile strength is similar for each process as you increase the speed (i.e., the lines are parallel). The upper right plot illustrates an interaction because as the speeds increase, the change in mean tensile strength is noticeably different depending on which process is being used (i.e., the lines cross). The bottom right plot illustrates an ordinal interaction where no interaction is present within the scope of the range of speeds studied, but if these trends continued for higher speeds, then we may see an interaction (i.e., the lines may cross). It should also be noted that just because lines cross, it does not necessarily imply the interaction is statistically significant. Lines which appear nearly parallel, yet cross at some point, may not yield a statistically significant interaction term. If two lines cross, the more different the slopes appear and the more data that is available, then the more likely the interaction term will be significant. 9.5 Relationship to ANCOVA When dealing with categorical predictors in regression analysis, we often say that we are performing a regression with indicator variables or a regression STAT 501 D. S. Young

32 CHAPTER 9. INDICATOR VARIABLES 117 No Interaction Disordinal Interaction Response Treatment 1 Treatment 2 Treatment 3 Response Treatment 1 Treatment 2 Treatment Predictor Predictor (a) (b) Ordinal Interaction Response Treatment 1 Treatment 2 Treatment Predictor (c) Figure 9.1: (a) A plot of no interactions amongst the groups (notice how the lines are nearly parallel). (b) A plot of a disordinal interaction amongst the groups (notice how the lines intersect). (c) A plot of an ordinal interaction amongst the groups (notice how the lines don t intersect, but if we were to extrapolate beyond the predictor limits, then the lines would likely cross). D. S. Young STAT 501

33 118 CHAPTER 9. INDICATOR VARIABLES with interactions (if we are interested in testing for interactions with indicator variables and other variables). However, in the design and analysis of experiments literature, this model is also used, but with a slightly different motivation. Various experimental layouts using ANOVA tables are commonly used in the design and analysis of experiments. These ANOVA tables are constructed to compare the means of several levels of one or more treatments. For example, a one-way ANOVA can be used to compare six different dosages of blood pressure pills and the mean blood pressure of individuals who are taking one of those six dosages. In this case, there is one factor with six different levels. Suppose further that there are four different races represented in this study. Then a two-way ANOVA can be used since we have two factors - the dosage of the pill and the race of the individual taking the pill. Furthermore, an interaction term can be included if we suspect that the dosage a person is taking and the race of the individual have a combined effect on the response. As you can see, you can extend to the more general n-way ANOVA (with or without interactions) for the setting with n treatments. However, dealing with n > 2 can often lead to difficulty in interpreting the results. One other important thing to point out with ANOVA models is that, while they use least squares for estimation, they differ from how categorical variables are handled in a regression model. In an ANOVA model, there is a parameter estimated for the factor level means and these are used for the linear model of the ANOVA. This differs slightly from a regression model which estimates a regression coefficient for, say, n 1 indicator variables (assuming there are n levels of the categorical variable and we are using the leave-one-out method). Also, ANOVA models utilize ANOVA tables, which are broken down by each factor (i.e., you would look at the sums of squares for each factor present). ANOVA tables for regression models simply test if the regression model has at least one variable which is a significant predictor of the response. More details on these differences are better left to a course on design of experiments. When there is also a continuous variable measured with each response, then the n-way ANOVA model needs to reflect the continuous variable. This model is then referred to as an Analysis of Covariance (or ANCOVA) model. The continuous variable in an ANCOVA model is usually called the covariate or sometimes the concomitant variable. One difference in how an ANCOVA model is approached is that an interaction between the covariate and each factor is always tested first. The reason why is because an STAT 501 D. S. Young

34 CHAPTER 9. INDICATOR VARIABLES 119 ANCOVA is conducted to investigate the overall relationship between the response and the covariate while assuming this relationship is true for all groups (i.e., for all treatment levels). If, however, this relationship does differ across the groups, then the overall regression model is inaccurate. This assumption is called the assumption of homogeneity of slopes. This is assessed by testing for parallel slopes, which involves testing the interaction term between the covariate and each factor in the ANCOVA table. If the interaction is not statistically significant, then you can claim parallel slopes and proceed to build the ANCOVA model. If the interaction is statistically significant, then the regression model used is not appropriate and an AN- COVA model should not be used. As an example of how to write ANCOVA models, first consider the oneway ANCOVA setting. Suppose we have i = 1,..., I treatments and each treatment has j = 1,..., J i pairs of continuous variables measured (i.e., (x i,1, y i,1 ),..., (x i,ji, y i,ji )). Then the one-way ANCOVA model is written as y i,j = α i + βx i,j + ɛ i,j, where α i is the mean of the i th treatment level, β is the common regression slope, and the ɛ i,j are iid normal with mean 0 and variance σ 2. So note that the test of parallel slopes concerns testing if β is the same for all slopes versus if it is not the same for all slopes. A high p-value indicates that we have parallel slopes (or homogeneity of slopes) and can therefore use an ANCOVA model. 9.6 Coded Variables In the early days when computing power was limited, coding of the variables accomplished simplifying the linear algebra and thus allowing least squares solutions to be solved manually. Many methods exist for coding data, such as: Converting variables to two values (e.g., {-1, 1} or{0, 1}). Converting variables to three values (e.g., {-1, 0, 1}). Coding continuous variables to reflect only important digits (e.g., if the costs of various nuclear programs range from $100,000 to $150,000, D. S. Young STAT 501

35 120 CHAPTER 9. INDICATOR VARIABLES coding can be done by dividing through by $100,000, resulting in the range being from 1 to 1.5). The purpose of coding is to simplify the calculation of (X T X) 1 in the various regression equations, which was especially important when this had to be done by hand. It is important to note that the above methods are just a few possibilities and that there are no specific guidelines or rules of thumb for when to code data. Today when (X T X) 1 is calculated with computers, there may be a significant rounding error in the linear algebra manipulations if the difference in the magnitude of the predictors is large. Good statistical programs assess the probability of such errors, which would warrant using coded variables. When coding variables, one should be aware of different magnitudes of the parameter estimates compared to those for the original data. The intercept term can change dramatically, but we are concerned with any drastic changes in the slope estimates. In order to protect against additional errors due to the varying magnitudes of the regression parameters, you can compare plots of the actual data and the coded data and see if they appear similar. 9.7 Examples Example 1: Software Development Data Set Suppose that data from n = 20 institutions is collected on similar software development projects. The data set includes Y = number of man-years required for each project, X 1 = number of application subprograms developed for the project, and X 2 = 1 if an academic institution developed the program or 0 if a private firm developed the program. The data is given in Table 9.1. Suppose we wish to estimate the number of man-years necessary for developing this type of software for the purpose of contract bidding. We also suspect a possible interaction between the number of application subprograms developed and the type of institution. Thus, we consider the multiple regression model y i = β 0 + β 1 x i,1 + β 2 x i,2 + β 3 x i,1 x i,2 + ɛ i. So first, we fit the above model and assess the significance of the interaction term. STAT 501 D. S. Young

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma siersma@sund.ku.dk The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association

More information

Chapter 5 Analysis of variance SPSS Analysis of variance

Chapter 5 Analysis of variance SPSS Analysis of variance Chapter 5 Analysis of variance SPSS Analysis of variance Data file used: gss.sav How to get there: Analyze Compare Means One-way ANOVA To test the null hypothesis that several population means are equal,

More information

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Objectives: To perform a hypothesis test concerning the slope of a least squares line To recognize that testing for a

More information

Week 5: Multiple Linear Regression

Week 5: Multiple Linear Regression BUS41100 Applied Regression Analysis Week 5: Multiple Linear Regression Parameter estimation and inference, forecasting, diagnostics, dummy variables Robert B. Gramacy The University of Chicago Booth School

More information

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96 1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years

More information

Simple Regression Theory II 2010 Samuel L. Baker

Simple Regression Theory II 2010 Samuel L. Baker SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the

More information

Descriptive Statistics

Descriptive Statistics Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize

More information

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Section 14 Simple Linear Regression: Introduction to Least Squares Regression Slide 1 Section 14 Simple Linear Regression: Introduction to Least Squares Regression There are several different measures of statistical association used for understanding the quantitative relationship

More information

UNDERSTANDING THE TWO-WAY ANOVA

UNDERSTANDING THE TWO-WAY ANOVA UNDERSTANDING THE e have seen how the one-way ANOVA can be used to compare two or more sample means in studies involving a single independent variable. This can be extended to two independent variables

More information

Part 2: Analysis of Relationship Between Two Variables

Part 2: Analysis of Relationship Between Two Variables Part 2: Analysis of Relationship Between Two Variables Linear Regression Linear correlation Significance Tests Multiple regression Linear Regression Y = a X + b Dependent Variable Independent Variable

More information

1.5 Oneway Analysis of Variance

1.5 Oneway Analysis of Variance Statistics: Rosie Cornish. 200. 1.5 Oneway Analysis of Variance 1 Introduction Oneway analysis of variance (ANOVA) is used to compare several means. This method is often used in scientific or medical experiments

More information

Notes on Applied Linear Regression

Notes on Applied Linear Regression Notes on Applied Linear Regression Jamie DeCoster Department of Social Psychology Free University Amsterdam Van der Boechorststraat 1 1081 BT Amsterdam The Netherlands phone: +31 (0)20 444-8935 email:

More information

Module 5: Multiple Regression Analysis

Module 5: Multiple Regression Analysis Using Statistical Data Using to Make Statistical Decisions: Data Multiple to Make Regression Decisions Analysis Page 1 Module 5: Multiple Regression Analysis Tom Ilvento, University of Delaware, College

More information

5. Linear Regression

5. Linear Regression 5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4

More information

Comparing Nested Models

Comparing Nested Models Comparing Nested Models ST 430/514 Two models are nested if one model contains all the terms of the other, and at least one additional term. The larger model is the complete (or full) model, and the smaller

More information

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

More information

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r), Chapter 0 Key Ideas Correlation, Correlation Coefficient (r), Section 0-: Overview We have already explored the basics of describing single variable data sets. However, when two quantitative variables

More information

Using R for Linear Regression

Using R for Linear Regression Using R for Linear Regression In the following handout words and symbols in bold are R functions and words and symbols in italics are entries supplied by the user; underlined words and symbols are optional

More information

Chapter 7: Simple linear regression Learning Objectives

Chapter 7: Simple linear regression Learning Objectives Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -

More information

Introduction to Regression and Data Analysis

Introduction to Regression and Data Analysis Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it

More information

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1) CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.

More information

Simple linear regression

Simple linear regression Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between

More information

Hypothesis testing - Steps

Hypothesis testing - Steps Hypothesis testing - Steps Steps to do a two-tailed test of the hypothesis that β 1 0: 1. Set up the hypotheses: H 0 : β 1 = 0 H a : β 1 0. 2. Compute the test statistic: t = b 1 0 Std. error of b 1 =

More information

Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

More information

II. DISTRIBUTIONS distribution normal distribution. standard scores

II. DISTRIBUTIONS distribution normal distribution. standard scores Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,

More information

Point Biserial Correlation Tests

Point Biserial Correlation Tests Chapter 807 Point Biserial Correlation Tests Introduction The point biserial correlation coefficient (ρ in this chapter) is the product-moment correlation calculated between a continuous random variable

More information

Econometrics Simple Linear Regression

Econometrics Simple Linear Regression Econometrics Simple Linear Regression Burcu Eke UC3M Linear equations with one variable Recall what a linear equation is: y = b 0 + b 1 x is a linear equation with one variable, or equivalently, a straight

More information

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares Topic 4 - Analysis of Variance Approach to Regression Outline Partitioning sums of squares Degrees of freedom Expected mean squares General linear test - Fall 2013 R 2 and the coefficient of correlation

More information

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,

More information

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression A regression with two or more explanatory variables is called a multiple regression. Rather than modeling the mean response as a straight line, as in simple regression, it is

More information

STAT 350 Practice Final Exam Solution (Spring 2015)

STAT 350 Practice Final Exam Solution (Spring 2015) PART 1: Multiple Choice Questions: 1) A study was conducted to compare five different training programs for improving endurance. Forty subjects were randomly divided into five groups of eight subjects

More information

Statistical Models in R

Statistical Models in R Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Linear Models in R Regression Regression analysis is the appropriate

More information

Analysis of Variance ANOVA

Analysis of Variance ANOVA Analysis of Variance ANOVA Overview We ve used the t -test to compare the means from two independent groups. Now we ve come to the final topic of the course: how to compare means from more than two populations.

More information

Lecture Notes Module 1

Lecture Notes Module 1 Lecture Notes Module 1 Study Populations A study population is a clearly defined collection of people, animals, plants, or objects. In psychological research, a study population usually consists of a specific

More information

An analysis appropriate for a quantitative outcome and a single quantitative explanatory. 9.1 The model behind linear regression

An analysis appropriate for a quantitative outcome and a single quantitative explanatory. 9.1 The model behind linear regression Chapter 9 Simple Linear Regression An analysis appropriate for a quantitative outcome and a single quantitative explanatory variable. 9.1 The model behind linear regression When we are examining the relationship

More information

Introduction to Quantitative Methods

Introduction to Quantitative Methods Introduction to Quantitative Methods October 15, 2009 Contents 1 Definition of Key Terms 2 2 Descriptive Statistics 3 2.1 Frequency Tables......................... 4 2.2 Measures of Central Tendencies.................

More information

Introduction to Analysis of Variance (ANOVA) Limitations of the t-test

Introduction to Analysis of Variance (ANOVA) Limitations of the t-test Introduction to Analysis of Variance (ANOVA) The Structural Model, The Summary Table, and the One- Way ANOVA Limitations of the t-test Although the t-test is commonly used, it has limitations Can only

More information

Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance

Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance 2 Making Connections: The Two-Sample t-test, Regression, and ANOVA In theory, there s no difference between theory and practice. In practice, there is. Yogi Berra 1 Statistics courses often teach the two-sample

More information

Association Between Variables

Association Between Variables Contents 11 Association Between Variables 767 11.1 Introduction............................ 767 11.1.1 Measure of Association................. 768 11.1.2 Chapter Summary.................... 769 11.2 Chi

More information

Final Exam Practice Problem Answers

Final Exam Practice Problem Answers Final Exam Practice Problem Answers The following data set consists of data gathered from 77 popular breakfast cereals. The variables in the data set are as follows: Brand: The brand name of the cereal

More information

MULTIPLE REGRESSION EXAMPLE

MULTIPLE REGRESSION EXAMPLE MULTIPLE REGRESSION EXAMPLE For a sample of n = 166 college students, the following variables were measured: Y = height X 1 = mother s height ( momheight ) X 2 = father s height ( dadheight ) X 3 = 1 if

More information

CALCULATIONS & STATISTICS

CALCULATIONS & STATISTICS CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents

More information

11. Analysis of Case-control Studies Logistic Regression

11. Analysis of Case-control Studies Logistic Regression Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

More information

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Chapter 13 Introduction to Linear Regression and Correlation Analysis Chapter 3 Student Lecture Notes 3- Chapter 3 Introduction to Linear Regression and Correlation Analsis Fall 2006 Fundamentals of Business Statistics Chapter Goals To understand the methods for displaing

More information

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4) Summary of Formulas and Concepts Descriptive Statistics (Ch. 1-4) Definitions Population: The complete set of numerical information on a particular quantity in which an investigator is interested. We assume

More information

2. Simple Linear Regression

2. Simple Linear Regression Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according

More information

Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011

Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011 Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011 Name: Section: I pledge my honor that I have not violated the Honor Code Signature: This exam has 34 pages. You have 3 hours to complete this

More information

Session 7 Bivariate Data and Analysis

Session 7 Bivariate Data and Analysis Session 7 Bivariate Data and Analysis Key Terms for This Session Previously Introduced mean standard deviation New in This Session association bivariate analysis contingency table co-variation least squares

More information

17. SIMPLE LINEAR REGRESSION II

17. SIMPLE LINEAR REGRESSION II 17. SIMPLE LINEAR REGRESSION II The Model In linear regression analysis, we assume that the relationship between X and Y is linear. This does not mean, however, that Y can be perfectly predicted from X.

More information

Exercise 1.12 (Pg. 22-23)

Exercise 1.12 (Pg. 22-23) Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.

More information

Module 3: Correlation and Covariance

Module 3: Correlation and Covariance Using Statistical Data to Make Decisions Module 3: Correlation and Covariance Tom Ilvento Dr. Mugdim Pašiƒ University of Delaware Sarajevo Graduate School of Business O ften our interest in data analysis

More information

Chapter 23. Inferences for Regression

Chapter 23. Inferences for Regression Chapter 23. Inferences for Regression Topics covered in this chapter: Simple Linear Regression Simple Linear Regression Example 23.1: Crying and IQ The Problem: Infants who cry easily may be more easily

More information

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2 University of California, Berkeley Prof. Ken Chay Department of Economics Fall Semester, 005 ECON 14 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE # Question 1: a. Below are the scatter plots of hourly wages

More information

We extended the additive model in two variables to the interaction model by adding a third term to the equation.

We extended the additive model in two variables to the interaction model by adding a third term to the equation. Quadratic Models We extended the additive model in two variables to the interaction model by adding a third term to the equation. Similarly, we can extend the linear model in one variable to the quadratic

More information

ANOVA. February 12, 2015

ANOVA. February 12, 2015 ANOVA February 12, 2015 1 ANOVA models Last time, we discussed the use of categorical variables in multivariate regression. Often, these are encoded as indicator columns in the design matrix. In [1]: %%R

More information

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 4: September

More information

Interaction between quantitative predictors

Interaction between quantitative predictors Interaction between quantitative predictors In a first-order model like the ones we have discussed, the association between E(y) and a predictor x j does not depend on the value of the other predictors

More information

Principles of Hypothesis Testing for Public Health

Principles of Hypothesis Testing for Public Health Principles of Hypothesis Testing for Public Health Laura Lee Johnson, Ph.D. Statistician National Center for Complementary and Alternative Medicine johnslau@mail.nih.gov Fall 2011 Answers to Questions

More information

Comparing Means in Two Populations

Comparing Means in Two Populations Comparing Means in Two Populations Overview The previous section discussed hypothesis testing when sampling from a single population (either a single mean or two means from the same population). Now we

More information

Statistical Models in R

Statistical Models in R Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova

More information

5. Multiple regression

5. Multiple regression 5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful

More information

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences.

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences. 1 Commands in JMP and Statcrunch Below are a set of commands in JMP and Statcrunch which facilitate a basic statistical analysis. The first part concerns commands in JMP, the second part is for analysis

More information

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management KSTAT MINI-MANUAL Decision Sciences 434 Kellogg Graduate School of Management Kstat is a set of macros added to Excel and it will enable you to do the statistics required for this course very easily. To

More information

Statistiek II. John Nerbonne. October 1, 2010. Dept of Information Science j.nerbonne@rug.nl

Statistiek II. John Nerbonne. October 1, 2010. Dept of Information Science j.nerbonne@rug.nl Dept of Information Science j.nerbonne@rug.nl October 1, 2010 Course outline 1 One-way ANOVA. 2 Factorial ANOVA. 3 Repeated measures ANOVA. 4 Correlation and regression. 5 Multiple regression. 6 Logistic

More information

2013 MBA Jump Start Program. Statistics Module Part 3

2013 MBA Jump Start Program. Statistics Module Part 3 2013 MBA Jump Start Program Module 1: Statistics Thomas Gilbert Part 3 Statistics Module Part 3 Hypothesis Testing (Inference) Regressions 2 1 Making an Investment Decision A researcher in your firm just

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING In this lab you will explore the concept of a confidence interval and hypothesis testing through a simulation problem in engineering setting.

More information

6.4 Normal Distribution

6.4 Normal Distribution Contents 6.4 Normal Distribution....................... 381 6.4.1 Characteristics of the Normal Distribution....... 381 6.4.2 The Standardized Normal Distribution......... 385 6.4.3 Meaning of Areas under

More information

CURVE FITTING LEAST SQUARES APPROXIMATION

CURVE FITTING LEAST SQUARES APPROXIMATION CURVE FITTING LEAST SQUARES APPROXIMATION Data analysis and curve fitting: Imagine that we are studying a physical system involving two quantities: x and y Also suppose that we expect a linear relationship

More information

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance

More information

Simple Linear Regression Inference

Simple Linear Regression Inference Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

More information

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1) Spring 204 Class 9: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.) Big Picture: More than Two Samples In Chapter 7: We looked at quantitative variables and compared the

More information

An analysis method for a quantitative outcome and two categorical explanatory variables.

An analysis method for a quantitative outcome and two categorical explanatory variables. Chapter 11 Two-Way ANOVA An analysis method for a quantitative outcome and two categorical explanatory variables. If an experiment has a quantitative outcome and two categorical explanatory variables that

More information

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE STATISTICS The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE VS. INFERENTIAL STATISTICS Descriptive To organize,

More information

Chapter 15. Mixed Models. 15.1 Overview. A flexible approach to correlated data.

Chapter 15. Mixed Models. 15.1 Overview. A flexible approach to correlated data. Chapter 15 Mixed Models A flexible approach to correlated data. 15.1 Overview Correlated data arise frequently in statistical analyses. This may be due to grouping of subjects, e.g., students within classrooms,

More information

Comparing Two Groups. Standard Error of ȳ 1 ȳ 2. Setting. Two Independent Samples

Comparing Two Groups. Standard Error of ȳ 1 ȳ 2. Setting. Two Independent Samples Comparing Two Groups Chapter 7 describes two ways to compare two populations on the basis of independent samples: a confidence interval for the difference in population means and a hypothesis test. The

More information

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means Lesson : Comparison of Population Means Part c: Comparison of Two- Means Welcome to lesson c. This third lesson of lesson will discuss hypothesis testing for two independent means. Steps in Hypothesis

More information

Premaster Statistics Tutorial 4 Full solutions

Premaster Statistics Tutorial 4 Full solutions Premaster Statistics Tutorial 4 Full solutions Regression analysis Q1 (based on Doane & Seward, 4/E, 12.7) a. Interpret the slope of the fitted regression = 125,000 + 150. b. What is the prediction for

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Multivariate Analysis of Variance (MANOVA)

Multivariate Analysis of Variance (MANOVA) Chapter 415 Multivariate Analysis of Variance (MANOVA) Introduction Multivariate analysis of variance (MANOVA) is an extension of common analysis of variance (ANOVA). In ANOVA, differences among various

More information

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA) INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA) As with other parametric statistics, we begin the one-way ANOVA with a test of the underlying assumptions. Our first assumption is the assumption of

More information

Statistics 2014 Scoring Guidelines

Statistics 2014 Scoring Guidelines AP Statistics 2014 Scoring Guidelines College Board, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks of the College Board. AP Central is the official online home

More information

TRINITY COLLEGE. Faculty of Engineering, Mathematics and Science. School of Computer Science & Statistics

TRINITY COLLEGE. Faculty of Engineering, Mathematics and Science. School of Computer Science & Statistics UNIVERSITY OF DUBLIN TRINITY COLLEGE Faculty of Engineering, Mathematics and Science School of Computer Science & Statistics BA (Mod) Enter Course Title Trinity Term 2013 Junior/Senior Sophister ST7002

More information

Section 13, Part 1 ANOVA. Analysis Of Variance

Section 13, Part 1 ANOVA. Analysis Of Variance Section 13, Part 1 ANOVA Analysis Of Variance Course Overview So far in this course we ve covered: Descriptive statistics Summary statistics Tables and Graphs Probability Probability Rules Probability

More information

DATA INTERPRETATION AND STATISTICS

DATA INTERPRETATION AND STATISTICS PholC60 September 001 DATA INTERPRETATION AND STATISTICS Books A easy and systematic introductory text is Essentials of Medical Statistics by Betty Kirkwood, published by Blackwell at about 14. DESCRIPTIVE

More information

MATH BOOK OF PROBLEMS SERIES. New from Pearson Custom Publishing!

MATH BOOK OF PROBLEMS SERIES. New from Pearson Custom Publishing! MATH BOOK OF PROBLEMS SERIES New from Pearson Custom Publishing! The Math Book of Problems Series is a database of math problems for the following courses: Pre-algebra Algebra Pre-calculus Calculus Statistics

More information

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate

More information

SIMPLE LINEAR CORRELATION. r can range from -1 to 1, and is independent of units of measurement. Correlation can be done on two dependent variables.

SIMPLE LINEAR CORRELATION. r can range from -1 to 1, and is independent of units of measurement. Correlation can be done on two dependent variables. SIMPLE LINEAR CORRELATION Simple linear correlation is a measure of the degree to which two variables vary together, or a measure of the intensity of the association between two variables. Correlation

More information

2. Linear regression with multiple regressors

2. Linear regression with multiple regressors 2. Linear regression with multiple regressors Aim of this section: Introduction of the multiple regression model OLS estimation in multiple regression Measures-of-fit in multiple regression Assumptions

More information

Concepts of Experimental Design

Concepts of Experimental Design Design Institute for Six Sigma A SAS White Paper Table of Contents Introduction...1 Basic Concepts... 1 Designing an Experiment... 2 Write Down Research Problem and Questions... 2 Define Population...

More information

Two-Sample T-Tests Assuming Equal Variance (Enter Means)

Two-Sample T-Tests Assuming Equal Variance (Enter Means) Chapter 4 Two-Sample T-Tests Assuming Equal Variance (Enter Means) Introduction This procedure provides sample size and power calculations for one- or two-sided two-sample t-tests when the variances of

More information

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS Systems of Equations and Matrices Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a

More information

ANALYSIS OF TREND CHAPTER 5

ANALYSIS OF TREND CHAPTER 5 ANALYSIS OF TREND CHAPTER 5 ERSH 8310 Lecture 7 September 13, 2007 Today s Class Analysis of trends Using contrasts to do something a bit more practical. Linear trends. Quadratic trends. Trends in SPSS.

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

Introduction to Matrix Algebra

Introduction to Matrix Algebra Psychology 7291: Multivariate Statistics (Carey) 8/27/98 Matrix Algebra - 1 Introduction to Matrix Algebra Definitions: A matrix is a collection of numbers ordered by rows and columns. It is customary

More information

CHI-SQUARE: TESTING FOR GOODNESS OF FIT

CHI-SQUARE: TESTING FOR GOODNESS OF FIT CHI-SQUARE: TESTING FOR GOODNESS OF FIT In the previous chapter we discussed procedures for fitting a hypothesized function to a set of experimental data points. Such procedures involve minimizing a quantity

More information