Appendix D Methods for Multiple Linear Regression Analysis

Size: px
Start display at page:

Download "Appendix D Methods for Multiple Linear Regression Analysis"

Transcription

1 Appendix D Methods for Multiple Linear Regression Analysis D.1. Introduction and Notation. The EPOLLS model equations were derived from a multiple linear regression (MLR) analysis as described in Chapters 9 through 11. This analysis was performed using the SAS System (version 6.08) statistical software from the SAS Institute Inc. MLR methods and associated statistical tests used to develop the EPOLLS model are documented in this appendix. This brief discussion of statistical regression methods is based on more thorough treatments given by Montgomery and Peck (1992) and Freund and Littell (1991). Notation and basic definitions associated with regression analyses include the following: i = index denoting different observations or case studies in a data set j = index denoting different regressor variables in a model n = number of observations in a data sample k = number of regressor variables in a model p = k + 1 = number of coefficients in a regression model u = a random variable µ u = mean of the population of u u = Σu i / n = mean of a sample of u, giving an estimate of population mean (^µ u ) σ 2 u = Σ (u i - µ u ) 2 / n = variance of the population of u, indicating variability about the mean 2 s u = Σ (u i - u) 2 / (n - 1) = variance of a sample of u, giving an estimate of population variance (^σ u ) y i = system response (dependent), which is random and observable, from i th observation in sample x ij = j th regressor (independent) variable, which is constant and known, from i th observation in sample ε i = error (random) between a regression model and the i th observation in sample = coefficients (parameters) of a regression model β j ^θ = an estimator of the true value of the parameter θ ^θ (i) = a parameter calculated with a model fit to all available data except the i th observation ^θ (j) = a parameter calculated with a model using all regressor variables except the j th regressor 306

2 307 D.2. Simple and Multiple Linear Regression. The familiar linear regression analysis involves two coefficients, which define the slope and intercept of a line. A simple linear regression (SLR) model can be written as: y = β 0 + β 1 x + ε (D.1) In fitting a line to a sample data set using SLR, the goal is to find estimates of the coefficients (^β 0 and ^β 1 ) that minimize the error ε. Mathematically, this is done by minimizing the square of the errors between the observed and predicted values of y (hence the name least squares regression). With the fitted model, the predicted value of y for a given x is: ^y = ^β 0 + ^β 1 x (D.2) Multiple linear regression (MLR) is used to fit a model with more than one regressor variable. Analogous to Equation D.1 for SLR, an MLR model can be written using matrix notation: (D.3) where the matrix of the regressor variables [X] is called the design matrix. The first model coefficient (β 0 ) in the vector {β} is the intercept of the model. Hence, the design matrix must be specified as: (D.4) In fitting an MLR model, the goal is to find the "best" estimates of the coefficients {β} that minimize the differences between all of the observed responses (y i ) and the corresponding model predictions (^y i ). In the same manner as for SLR, the coefficients are found by minimizing the sum of the squares of the errors (that is, minimize Σε 2 i = Σ(y i - ^y i ) 2 ) in a least squares regression analysis. The solution for least squares estimators of the model coefficients can be written: (D.5) A related calculation defines the "hat" matrix [H]:

3 308 (D.6) The [H] matrix is used in several model diagnostics described later in this appendix. A multiple linear regression model is called "linear" because only linear coefficients {β} are used. Transforms of the regressor variables (such as 1/x, x 2, x 0.5, e x, ln(x), etc.) are all permitted in an MLR model. A basic assumption in a regression analysis is that the functional form of the model is appropriate and includes all important variables. In addition, four assumptions about the errors {ε} are fundamental to a regression analysis: (1) the mean of the errors is zero, (2) the variances of the errors for all observations are constant, (3) the errors are independent of each other (uncorrelated), and (4) for some statistical tests, the errors are normally distributed. Gross violations of these basic assumptions will yield a poor or biased model. However, if the variances of the errors {ε} are unequal and can be estimated, weighted regression schemes can sometimes be used to obtain a better model. D.3. Category Variables. The regressors in an MLR model are usually quantitative variables that can take on any real value. Frequently, however, the sample data can be sorted into categories. For example, data on the performance of a group of students can be sorted by gender. To consider the effect of this division in an MLR model, a category variable can be introduced into the design matrix [X]. The category variable x 1 might be defined as x 1 =0 for females and x 1 =1 for males. Instead of developing separate models for each group, a category variable allows for the fitting of a single MLR model that is simpler to use and applicable to both groups. Moreover, because all of the model coefficients are determined using the entire data set, a single model with a category variable is much better overall than separate models fit to subsets of the data. However, adding just the category variable to the design matrix will mean, in essence, that only the zero-intercept of the model will change with the category. That is, all other regressors will have the same effect on the system response regardless of the category to which the observation belongs. To consider the effect of a category on other regressors, product or interaction terms should be added to the model. For example, to add the category variable x 1 to a model with the quantitative regressor x j, the regression model should be: y = β 0 + β 1 x 1 + β 2 x j + β 3 x 1 x j + ε (D.7) More than one category variable can be included in the design matrix. To represent categories with more than three levels (like eye color of brown, blue, or green), two category

4 309 variables are needed. For example: x 1 =0, x 2 =0 for brown eyes, x 1 =0, x 2 =1 for blue eyes, and x 1 =1, x 2 =1 for green eyes. A four level category would require three category variables, and so forth. However, to get a valid model, the design matrix must include all possible combinations of these category variables. D.4. Model Quality and Significance of Regressors. The squares of the errors between the observed and predicted observations are used to evaluate how well a regression model fits the sample data. A basic identity in regression analyses is obtained by partitioning the sum of the squares of the variation in the sample data: SST = SSE + SSR (D.8) In words, Equation D.8 means that the total variability in the observed responses [SST=Σ(y i - y) 2 ] is equal to the random variability not explained by the model or model error [SSE=Σ(y i -^y i ) 2 ] plus the systematic variability that is explained by the regression model [SSR=Σ(^y i - y) 2 ]. When adjusted for the associated degrees-of-freedom, the three terms in Equation D.8 lead to the following definitions: MST = SST / (n-1) = Σ( y i - y ) 2 / (n-1) = total mean square of variation in observations MSE = SSE / (n-p) = Σ( y i - ^y i ) 2 / (n-p) = error mean square MSR = SSR / 1 = Σ( ^y i - y ) 2 / 1 = regression mean square Significantly, the MSE value gives an unbiased estimate of the variance in the errors ε i. To express the quality of fit between a regression model and the sample data, the coefficient of multiple determination (R 2 ) is typically used. Ranging in value from 0.0 to 1.0, the coefficient of multiple determination is defined as: (D.9) Higher values of R 2 (from a smaller SSE) indicate a better fit of the model to the sample observations. However, adding any regressor variable to an MLR model, even an irrelevant regressor, yields a smaller SSE and greater R 2. For this reason, R 2 by itself is not a good measure of the quality of fit.

5 310 To overcome this deficiency in R 2, an adjusted value can be used. The adjusted coefficient of multiple determination ( R 2 ) is defined as: (D.10) Because the number of model coefficients (p) is used in computing R 2, the value will not necessarily increase with the addition of any regressor. Hence, R 2 is a more reliable indicator of model quality. The global F-test is used to assess the overall ability of a model to explain at least some of the observed variability in the sample responses. Giving a statistical test for the significance of the regression, the global F-test is performed in the following steps: (1) Null hypothesis: β 1 = β 2 =... = β k = 0. (2) Compute the test statistic F 0 with an analysis-of-variance table: Source Degrees of Freedom Sum of Squares Mean Square F o Regression k SSR MSR F 0, k, n-p = MSR / MSE Error n - p SSE MSE Total n - 1 SST (4) From the F distribution, find F α, k, n-p corresponding to the desired level of significance (α). (5) If F 0, k, n-p > F α, k, n-p, reject the null hypothesis and conclude that at least one β j 0 and at least one model regressor explains some of the response variation. A global F-test only indicates that at least one regressor is significant, but does not indicate which regressors are significant and which are not. To study the significance of each regressor variable, partial F-tests are used. These tests look at the significance of a given regressor in the presence of all other regressor variables in the model. Hence, partial F-tests can be performed for each of the "k" regressors: (1) Null hypothesis: β j = 0. (2) Compute the test statistic F j with: F j, 1, n-p = [ SSR(full model) - SSR(full model without x j ) ] / MSE = (SSR - SSR (j) )/MSE (4) From the F distribution, find F α, 1, n-p corresponding to the desired level of significance (α). (5) If F j, 1, n-p > F α, 1, n-p, reject the null hypothesis and conclude that β j 0 and that the variable x j is a significant regressor in the full model. If a regression model is poorly specified, either because important regressor variables are

6 311 missing or unnecessary variables have been included, the fitted model may be biased. An indication of the model bias is given by the Mallows statistic (C p ), which attempts to measure the overall bias or mean square error in the estimated model parameters. The Mallows statistic is defined with: (D.11) Here, MSE is computed using all available regressors and SSE(p) is computed from a model with only "p" coefficients. Note that when p includes all available regressors, SSE(p) = SSE and C p = p. A low C p value indicates good model predictions but, more importantly, a p-term model with little bias will yield C p p. The C p statistic is most useful in evaluating candidate regression models as discussed in the next section. D.5. Selection of Regressors for Candidate Models. The real challenge in performing a multiple linear regression analysis is to find the "best" set of regressor variables that explain the variation in the observed system responses. A model is desired that not only fits well to the observations, but also yields good predictions of future responses and only includes regressor variables that contribute significantly to the model. This process of finding the best set of regressors for an MLR model is known as variable selection or model building. Given M regressor variables that could be included in a simple MLR model, there are 2 M possible equations that could be written with different combinations of the regressors. With a typically large pool of potential regressor variables, examination of all possible models is simply not practical. Instead, systematic methods are employed to find a subset of the regressor variables that will form an appropriate, "best" model. Actually, different variable selection procedures usually suggest different "best" models so additional analyses and judgment are required to obtain a good, reliable MLR model. Final selection of the best model can be based on the evaluations discussed in Section D.8. Perhaps the simplest variable selection procedure involves an attempt to find a model with maximum R 2 or R 2. This procedure does require fitting all possible models, but the results are ranked to easily identify the best model. All possible models with "k" regressors are evaluated and the one model giving the greatest R 2 or R 2 is tabulated. The maximum R 2 or R 2 found from models with one, two, three, etc. regressors are then plotted as shown in Figure D.1. Recall that R 2 always increases with the addition of more regressor variables, while R 2 may eventually decrease as shown in Figure D.1. A good candidate model can be selected where R 2 reaches a maximum or where the R 2 curve begins to flatten out.

7 312 A stepwise selection procedure relies on partial F-tests to find a group of significant regressor variables. The "best" model is found by adding or eliminating regressors in steps. First, partial F-statistics are computed for every potential regressor and the one variable giving the highest F j is inserted into the model. Next, partial F-statistics are computed for all of the remaining regressors and the one yielding the highest F j, in the presence of the first-selected regressor, is added to the model. However, no regressor is added to the model at this step unless F j exceeds a specified threshold value. Next, all variables in the model are evaluated with partial F-tests to see if each one is still significant. In this step, any regressor that is no longer significant, according to the specified threshold value of F j, is dropped from the model. The selection procedure continues in steps as new regressors are added to the model and then any variables that are no longer significant are dropped. The stepwise selection procedure stops when no other potential regressor yields a partial F greater than the threshold and all regressors in the model remain significant. The threshold or cut-off partial F values, for addition to or elimination from the model, are specified in terms of a level of significance (α). One disadvantage of the stepwise selection procedure is that not all possible combinations of regressor variables are considered for the model. Also, since one final equation is produced in a stepwise procedure, other equally good models may go unrecognized. The Mallows statistic, defined in the previous section, can also be used to find a good set of regressors for an MLR model. As done in the selection procedure based on maximum R 2 or R 2, all possible models with "k" regressors are evaluated. Here, from all possible models with one, two, three, etc. regressors, the one model giving the lowest C p is tabulated. The results are then plotted as shown in Figure D.2. Recall that C p = p when all available regressor variables are used in the model; therefore, the trend line in Figure D.2 converges to the C p = p line. A good candidate model can be selected from the C p plot by remembering that good model predictions are indicated by a low C p value and low model bias is indicated by C p p. D.6. Tests for Multicollinearity. The problem of multicollinearity exists when two or more model regressors are strongly correlated or linearly dependent. In building a regression model, the regressor variables in the design matrix are assumed to be independent of one another. However, when the system behavior is poorly understood, the selected MLR model might include several regressors that each measure, to some degree, similar components of the system response. In this situation, columns of the design matrix may be linearly correlated to a sufficient degree as to create a multicollinearity problem. When significant multicollinearity exists in the data, the mathematical solution used to fit the regression model is unstable. That is, small changes in the regressor values can cause large changes in the parameter estimates and yield an unrealistic model. Multicollinearity is especially problematic if the fitted MLR model is then used to make future predictions. Note that the regressor variables do not have to be totally independent of one

8 313 another, and some degree of correlation within the design matrix is tolerable. Techniques for ascertaining potential problems with multicollinearity are outlined in this section. When serious multicollinearity is detected, the problem can often be eliminated by discarding one or more regressor variables from the model. Possible problems due to multicollinearity may be detected during fitting of a regression model. Common indications of multicollinearity include: (1) Parameter estimates (^β j ) with signs that defy prior knowledge (i.e., a model coefficient with a negative sign when a positive sign is expected). (2) Models with large R 2, or high significance in a global F-test, but in which none of the model variables are significant in partial F-tests. (3) Different model selection procedures yield very different models. (4) Standard errors of the regression coefficients that are large, with respect to the parameter estimates, indicating poor precision in the estimates. The standard error of a coefficient is calculated as se(^β j ) = (MSE C jj ) 0.5 where C jj is the j th diagonal element of ([X] T [X]) -1. A simple form of multicollinearity is caused by pair-wise correlation between any two regressor variables. This can be detected by inspection of the correlation matrix [r] for the regressor values in the design matrix. The empirical correlation between regressors x j and x m, giving one element of [r], is computed with: (D.12) The greater the linear dependence between x j and x m, the closer r jm will be to one (obviously, the diagonals of the correlation matrix (r jj ) are equal to one). As a general rule, multicollinearity may be a problem if, for the off-diagonal terms of the correlation matrix, r jm 0.9. However, the pair-wise correlation coefficients will not indicate multicollinearity problems arising from linear dependencies between combinations of more than two regressors. Multicollinearity can also be detected from the eigenvalues or characteristic roots of the correlation matrix [r]. For a model with k regressors, there will be k eigenvalues, λ j. The ratio of the maximum over the minimum eigenvalues of [r] defines the model condition number:

9 314 (D.13) As a general rule, κ < 100 indicates no serious problem, 100 < κ < 1000 indicates moderate to strong multicollinearity, and κ > 1000 indicates a severe problem with multicollinearity. In addition, the condition index associated with each regressor variable x j is defined as: (D.14) The number of κ j > 1000 indicates the number of linear dependencies in the design matrix. Another indication of potential multicollinearity is obtained from variance inflation factors (VIFs). The VIF associated with regressor x j is computed with: (D.15) where R 2 (j) is the coefficient of multiple determination (R 2 ) from a regression of x j on all other k-1 regressors in the model. Hence, as more of the variation in x j can be explained by the other regressor variables, R 2 (j) will approach one and the VIF j will increase. Large values of VIF j indicate possible multicollinearity associated with regressor x j. In general, a VIF j 5 indicates a possible multicollinearity problem, while a VIF j 10 indicates that multicollinearity is almost certainly a problem. D.7. Tests for Influential Observations. In addition to multicollinearity, another common problem in regression analyses are models adversely affected by influential observations. Three types of influential observations are illustrated in Figure D.3 for a simple linear regression model. Outliers, defined as observations outside the general trend of the data, are a familiar type of influential observation. When an observation does fall within the trend of the data, but is found beyond the range of the other regressors, the resulting influential observation is called a high leverage point. When a high leverage point is also an outlier, that single data point can have a large impact on the regression model and is called a highly influential observation. In this section, methods are presented for detecting influential observations. The presence of influential observations can be detected by computing the PRESS (prediction error sum of squares) statistic defined as:

10 315 (D.16) where ^y (i) is a prediction of the i th observed response made from a model regressed on all of the available data except the i th observation. The PRESS statistic is then compared with the sum of the square of the errors, SSE = Σ( y i - ^y i ) 2. If PRESS is much larger than SSE, influential observations may exist. with: Outliers can be detected from studentized residuals that are defined for the i th observation (D.17) where h ii is the i th diagonal element of the hat matrix [H] defined in Equation D.6. When working with a sufficient number of observations, such that (n-p-1) > 20, a r i > 2.0 indicates that the i th observation might be an outlier. Similarly, a r i > 2.5 is a strong indicator of a likely outlier. In addition, high leverage points can be detected directly from the diagonals of the hat matrix, h ii. As a general rule, an h ii > 2p/n indicates that the i th observation is a possible high leverage point. One diagnostic test for highly influential observations uses the DFFITS statistic. For the i th observation, this statistic is defined as: (D.18) where ^y (i) and MSE (i) are based on a model regressed on all of the available data except the i th observation. A possible highly influential observation is indicated by DFFITS i > 2(p/n) 0.5. Another test for highly influential observations is based on the effects of the i th observation on the estimated model coefficients. The DFBETAS statistic, which is calculated for each j th regressor variable and each i th observation, is defined as:

11 316 (D.19) where ^β j(i) and MSE (i) are computed from a regression on all available data except the i th observation. As a general rule, a possible highly influential observation is indicated by DFBETAS ij > 2/n 0.5. In practice, a highly influential observation will cause consistently high DFBETAS for most of the regressor variables. A third test for highly influential observations is based on Cook's Distance defined as: (D.20) where r i is the studentized residual defined in Equation D.17. Values of D i much larger than all others indicates that the i th observation may be highly influential. All influential observations should be investigated for correctness and accuracy. When using the statistical tests described here, the cutoff values for indicating an influential observation should be used only as a guideline. In practice, those observations giving the strongest indications in these tests should be investigated first. Moreover, a data point should not be discarded simply because it is an influential observation. A single influential data point can sometimes illuminate an important trend in the system response. D.8. Evaluation of Final Model. In addition to the diagnostics for multicollinearity and influence data, graphical methods are useful for evaluating the performance of a regression model. In a simple linear regression model, the adequacy of a linear equation can be easily visualized with a scatter plot of the observed data (x i, y i ). However, with a multiple linear regression model, plots of the observed responses (y i ) versus each regressor variable (x ij ) are of little value. Because the system response is a function of multiple regressors, plots of the observed response versus individual variables often fail to indicate a linear relationship and can be very misleading in evaluating an otherwise good MLR model. Other plots that are more useful in visualizing the performance of an MLR model are described in this section.

12 317 A scatter plot of the predicted response versus the observed response (^y versus y), as shown in Figure D.4, gives a simple indication of model performance. Any model that can explain most of the variation in the observed responses will produce a plot with points clustered around a 45 line. Better models yield less scatter about this ^y=y line. Moreover, the scatter of points about the ^y=y line should remain roughly constant with magnitude. That is, a poor model that is less accurate at larger values of ^y will produce increasing scatter with larger values of y. A scatter plot of the residuals, as shown in Figure D.5, is also useful in evaluating a regression model. Here, the model residuals or errors (e = ^y - y) are plotted against the model predictions (^y). Residual plots are used to visually verify some of the basic assumptions underlying an MLR analysis, as discussed previously in Section D.2. Namely, the residuals (errors) between the model predictions and observed responses should have a mean of zero and a constant variance. Hence, the scatter in the residuals should be fairly uniform and centered about e = 0. A good regression model will produce a scatter in the residuals that is roughly constant with ^y as shown in Figure D.5a. Unsatisfactory models yield a scatter in the residuals that changes with ^y; three common examples are shown in (b), (c), and (d) of Figure D.5. Models producing an unsatisfactory scatter in the residuals can often be improved by transforming y to stabilize the variance in the residuals. For example, a model might be re-defined in terms of ln(y), y 0.5, y -0.5, or 1/y. However, such transformations necessitate bias-reducing adjustments when the model predictions are de-transformed, as discussed further in the next section. A third graphical method for evaluating a multiple linear regression model is based on the idea of partial residuals. Designated with e y i(j), the partial residual of y for x j is defined as: (D.21) where ^y i(j) is a prediction of y i from a regression model using all of the regressors except x j. Similarly, the partial residual of x j is designated with e x i(j) and defined as: (D.22) where ^x i(j) is a prediction of the regressor x ij from a regression of x j on all the other regressor variables. Hence, the partial residual e y i(j) represents the variation in y i not explained by a model that excludes the regressor x j, and the partial residual e x i(j) represents the variation in x j that can not be explained by the other regressor variables. Plotting e y i(j) against e x i(j) in a partial regression plot more clearly shows the influence of x j on y in the presence of all other regressors in the model. Partial regression plots, shown in Figure D.6, can be generated for every regressor variable

13 318 in a model. If the regressor x j is linearly related to the model response, a plot of the partial residuals e y i(j) and e x i(j) will cluster about a line that passes through the origin at a slope equal to β j (the coefficient corresponding to x j in the full model). Moreover, a stronger linear relationship between y and x j will be evidenced by a narrower clustering of the partial residuals. For example, less scatter is seen in the partial regression plot in Figure D.6a than the plot in Figure D.6b; this indicates a stronger relationship with the regressor x 1 than for the regressor x 2. Influence data and their effect on the fit of the model can also be spotted on a partial regression plot as shown in Figure D.6c. Even more useful, the need for linear transformations of a given regressor variable (such as ln(x), 1/x, x 0.5, x 2, etc.) may be suggested by a partial regression plot like that in Figure D.6d. Finally, it is often important to evaluate the ability of a fitted regression model to predict future events. The best way to do this is to gather additional, new data and compare these observed responses with predictions from the model. However, when this is not possible, the data used to fit the model can be split and a cross validation performed. A good regression model can be fit to part of the original data set and still accurately predict the other observations. To perform a double cross-validation: (1) Partition the data into two subsets (say, A and B) with an equal number of observations in each. Assigning individual observations to subset A or B must be done randomly. (2) Using the same model form, fit the model using the data from subset A. Use this model to predict the observations in subset B. (3) Compute the prediction R 2 p,a for the model fit to subset A, as defined in Equation D.23 below. (4) Similarly, fit the model to subset B and use this to predict the observations in subset A. (5) Compute the prediction R 2 p,b for the model fit to subset B. A good model will produce high values of R 2 p for both subsets and these values will be approximately equal (R 2 p,a R 2 p,b). The prediction R 2 p for a model fit to subset A (R 2 p,a) is computed with: (D.23) where n B and y B are the number and mean of the observed responses (y ib ) in the random subset B. Using the model fit to subset A, predictions of the observations in subset B are made to give ^y ia. The prediction R 2 p for a model fit to subset B (R 2 p,b) is computed the same way, with the use of subsets A and B reversed.

14 319 D.9. Predictions from a Regression Model. Often, the purpose of developing a regression model is to allow predictions of future events. In equation form, the predicted system response (^y), for a given set of regressor values (x), is computed with: (D.24) where ^β denotes the fitted regression coefficients. As mentioned in the previous section, the observed system responses are sometimes transformed to stabilize the variance of the model errors, based on an examination of a residuals plot. For example, suppose a model for predicting some response "θ" is desired, and the regression analysis indicates the need to transform θ by taking the square root and fitting the model to θ 0.5. Predictions of ^θ would then be obtained by squaring the model prediction. Unfortunately, the de-transformed prediction is substantially biased and will consistently underpredict the value of θ. To alleviate this problem, Miller (1984) suggests bias-reducing adjustment factors for logarithmic, square root, and inverse transformations of the dependent variable. For a square-root transformation, where a regression model is fit to y i 0.5, the low-bias prediction of ^y given by Miller can be written as: (D.25) where MSE is used as an estimate of the variance of the errors in the fitted model. Because a regression model is simply an equation fit to a database of observed responses, a regression model should not be trusted to make predictions outside the range of the regressor variables used in fitting the model. Hence, the first step in using a regression model should be to verify that the prediction does not require extrapolation beyond the range of the regressor variables in the original data set. This can be done most simply by referring to histograms of the original regressor variables. However, with more than two or three model variables, it is possible to be within the range of each regressor yet still extrapolate beyond the combined range of the variables. Referred to as hidden extrapolation, this problem is illustrated in Figure D.7 for a model with two regressor variables. Hidden extrapolation is especially problematic if the regression model is unstable due to problems of multicollinearity. An indication of possible hidden extrapolation can be made by computing h 0 : (D.26) where [X] is the design matrix used to fit the model and [x] = [ 1 x 1 x 2... x k ] is the row matrix

15 320 of values at which the model prediction is to be made. The computed value of h 0 is then compared with h max, which is the maximum of the diagonal elements of the hat matrix [H] from Equation D.6. As a general rule, extrapolation is indicated when h 0 > h max. Finally, based on how well the model fits the available data, it is possible to construct a prediction interval. For a given set of the regressor values, the actual response is believed to fall within the prediction interval (1-α)% of the time. The upper and lower bounds of the prediction interval are defined by: (D.27) where t α/2, n-p is the tail area of the t-distribution at α/2 for n-p degrees of freedom. The value ^y' is the predicted response at [x], but without the bias-reducing adjustment used with detransformations of y (see Equation D.25). For a model specified with a square-root transformation, ^y' = ( [x]{^β} ) 2. Note that the prediction interval given in Equation D.27 applies to the model prediction of a single system response. The confidence interval, for the predicted mean of multiple responses for a given set of regressor values, is narrower.

16 321 Maximum R 2 Maximum R 2 or R 2 for any k-regressor model Maximum R 2 Candidate model k = number of regressors in model Figure D.1. Selection of a candidate model based on maximum R 2 or R 2. Candidate model with low C p value that is approximately equal to p Minimum C p for any p-term model Trend Line C p = p Full Model p = k +1 = number of model coefficients Figure D.2. Selection of a candidate model from a C p plot.

17 322 Response Variable (y) Outlier High Leverage Point Correct model Model fit with influence data Highly Influential Observation Regressor Variable (x) Figure D.3. Definition of influential observations in a simple linear regression model. Predicted Response ( y ^ ) Perfect Predictions ^y = y 45 Observed Response (y) Figure D.4. Evaluation of final model with a y-y ^ scatter plot.

18 323 Residual, e = y - y^ y^ Residual, e = y - y^ y^ (a) satisfactory (b) funnel Residual, e = y - y^ y^ Residual, e = y - y^ y^ (c) double bow (d) nonlinear Figure D.5. Common patterns in residual plots used to evaluate MLR models (after Montgomery and Peck 1992).

19 324 e y (1) = partial residual of y for x e x (1) = partial residual of x 1 e y (2) = partial residual of y for x 2 Slope = β 1 Slope = β e x (2) = partial residual of x 2 (a) (b) e y (3) = partial residual of y for x 3 0 Slope = β 3 0 e x (3) = partial residual of x 3 Outlier e y (4) = partial residual of y for x 4 0 Slope = β 4 0 (c) e x (4) = partial residual of x 4 (d) Figure D.6. Partial regression plots used to evaluate MLR models.

20 325 x 2 Region of data used to fit model Prediction involving hidden extrapolation Range of regressor x 2 used in fitting model x 1 Range of regressor x 1 used in fitting model Figure D.7. Illustration of hidden extrapolation in a model with two regressor variables (after Montgomery and Peck 1992).

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r), Chapter 0 Key Ideas Correlation, Correlation Coefficient (r), Section 0-: Overview We have already explored the basics of describing single variable data sets. However, when two quantitative variables

More information

Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

More information

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

More information

Session 7 Bivariate Data and Analysis

Session 7 Bivariate Data and Analysis Session 7 Bivariate Data and Analysis Key Terms for This Session Previously Introduced mean standard deviation New in This Session association bivariate analysis contingency table co-variation least squares

More information

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance

More information

Simple Regression Theory II 2010 Samuel L. Baker

Simple Regression Theory II 2010 Samuel L. Baker SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the

More information

Module 5: Multiple Regression Analysis

Module 5: Multiple Regression Analysis Using Statistical Data Using to Make Statistical Decisions: Data Multiple to Make Regression Decisions Analysis Page 1 Module 5: Multiple Regression Analysis Tom Ilvento, University of Delaware, College

More information

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1) CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.

More information

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Section 14 Simple Linear Regression: Introduction to Least Squares Regression Slide 1 Section 14 Simple Linear Regression: Introduction to Least Squares Regression There are several different measures of statistical association used for understanding the quantitative relationship

More information

Premaster Statistics Tutorial 4 Full solutions

Premaster Statistics Tutorial 4 Full solutions Premaster Statistics Tutorial 4 Full solutions Regression analysis Q1 (based on Doane & Seward, 4/E, 12.7) a. Interpret the slope of the fitted regression = 125,000 + 150. b. What is the prediction for

More information

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Chapter Seven Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Section : An introduction to multiple regression WHAT IS MULTIPLE REGRESSION? Multiple

More information

Nonlinear Regression Functions. SW Ch 8 1/54/

Nonlinear Regression Functions. SW Ch 8 1/54/ Nonlinear Regression Functions SW Ch 8 1/54/ The TestScore STR relation looks linear (maybe) SW Ch 8 2/54/ But the TestScore Income relation looks nonlinear... SW Ch 8 3/54/ Nonlinear Regression General

More information

Regression III: Advanced Methods

Regression III: Advanced Methods Lecture 16: Generalized Additive Models Regression III: Advanced Methods Bill Jacoby Michigan State University http://polisci.msu.edu/jacoby/icpsr/regress3 Goals of the Lecture Introduce Additive Models

More information

Notes on Applied Linear Regression

Notes on Applied Linear Regression Notes on Applied Linear Regression Jamie DeCoster Department of Social Psychology Free University Amsterdam Van der Boechorststraat 1 1081 BT Amsterdam The Netherlands phone: +31 (0)20 444-8935 email:

More information

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection Chapter 311 Introduction Often, theory and experience give only general direction as to which of a pool of candidate variables (including transformed variables) should be included in the regression model.

More information

Simple linear regression

Simple linear regression Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between

More information

The correlation coefficient

The correlation coefficient The correlation coefficient Clinical Biostatistics The correlation coefficient Martin Bland Correlation coefficients are used to measure the of the relationship or association between two quantitative

More information

Regression Analysis: A Complete Example

Regression Analysis: A Complete Example Regression Analysis: A Complete Example This section works out an example that includes all the topics we have discussed so far in this chapter. A complete example of regression analysis. PhotoDisc, Inc./Getty

More information

CHAPTER 13. Experimental Design and Analysis of Variance

CHAPTER 13. Experimental Design and Analysis of Variance CHAPTER 13 Experimental Design and Analysis of Variance CONTENTS STATISTICS IN PRACTICE: BURKE MARKETING SERVICES, INC. 13.1 AN INTRODUCTION TO EXPERIMENTAL DESIGN AND ANALYSIS OF VARIANCE Data Collection

More information

Linear Models in STATA and ANOVA

Linear Models in STATA and ANOVA Session 4 Linear Models in STATA and ANOVA Page Strengths of Linear Relationships 4-2 A Note on Non-Linear Relationships 4-4 Multiple Linear Regression 4-5 Removal of Variables 4-8 Independent Samples

More information

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96 1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years

More information

Econometrics Simple Linear Regression

Econometrics Simple Linear Regression Econometrics Simple Linear Regression Burcu Eke UC3M Linear equations with one variable Recall what a linear equation is: y = b 0 + b 1 x is a linear equation with one variable, or equivalently, a straight

More information

Factor Analysis. Chapter 420. Introduction

Factor Analysis. Chapter 420. Introduction Chapter 420 Introduction (FA) is an exploratory technique applied to a set of observed variables that seeks to find underlying factors (subsets of variables) from which the observed variables were generated.

More information

Chapter 7: Simple linear regression Learning Objectives

Chapter 7: Simple linear regression Learning Objectives Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Statistical Models in R

Statistical Models in R Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Linear Models in R Regression Regression analysis is the appropriate

More information

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade Statistics Quiz Correlation and Regression -- ANSWERS 1. Temperature and air pollution are known to be correlated. We collect data from two laboratories, in Boston and Montreal. Boston makes their measurements

More information

Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance

Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance 2 Making Connections: The Two-Sample t-test, Regression, and ANOVA In theory, there s no difference between theory and practice. In practice, there is. Yogi Berra 1 Statistics courses often teach the two-sample

More information

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.

More information

5. Multiple regression

5. Multiple regression 5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful

More information

Association Between Variables

Association Between Variables Contents 11 Association Between Variables 767 11.1 Introduction............................ 767 11.1.1 Measure of Association................. 768 11.1.2 Chapter Summary.................... 769 11.2 Chi

More information

Correlation. What Is Correlation? Perfect Correlation. Perfect Correlation. Greg C Elvers

Correlation. What Is Correlation? Perfect Correlation. Perfect Correlation. Greg C Elvers Correlation Greg C Elvers What Is Correlation? Correlation is a descriptive statistic that tells you if two variables are related to each other E.g. Is your related to how much you study? When two variables

More information

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression Opening Example CHAPTER 13 SIMPLE LINEAR REGREION SIMPLE LINEAR REGREION! Simple Regression! Linear Regression Simple Regression Definition A regression model is a mathematical equation that descries the

More information

Introduction to Fixed Effects Methods

Introduction to Fixed Effects Methods Introduction to Fixed Effects Methods 1 1.1 The Promise of Fixed Effects for Nonexperimental Research... 1 1.2 The Paired-Comparisons t-test as a Fixed Effects Method... 2 1.3 Costs and Benefits of Fixed

More information

Regression and Correlation

Regression and Correlation Regression and Correlation Topics Covered: Dependent and independent variables. Scatter diagram. Correlation coefficient. Linear Regression line. by Dr.I.Namestnikova 1 Introduction Regression analysis

More information

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number 1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number A. 3(x - x) B. x 3 x C. 3x - x D. x - 3x 2) Write the following as an algebraic expression

More information

USING SAS/STAT SOFTWARE'S REG PROCEDURE TO DEVELOP SALES TAX AUDIT SELECTION MODELS

USING SAS/STAT SOFTWARE'S REG PROCEDURE TO DEVELOP SALES TAX AUDIT SELECTION MODELS USING SAS/STAT SOFTWARE'S REG PROCEDURE TO DEVELOP SALES TAX AUDIT SELECTION MODELS Kirk L. Johnson, Tennessee Department of Revenue Richard W. Kulp, David Lipscomb College INTRODUCTION The Tennessee Department

More information

Simple Predictive Analytics Curtis Seare

Simple Predictive Analytics Curtis Seare Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

More information

Statistics. Measurement. Scales of Measurement 7/18/2012

Statistics. Measurement. Scales of Measurement 7/18/2012 Statistics Measurement Measurement is defined as a set of rules for assigning numbers to represent objects, traits, attributes, or behaviors A variableis something that varies (eye color), a constant does

More information

CHAPTER 14 ORDINAL MEASURES OF CORRELATION: SPEARMAN'S RHO AND GAMMA

CHAPTER 14 ORDINAL MEASURES OF CORRELATION: SPEARMAN'S RHO AND GAMMA CHAPTER 14 ORDINAL MEASURES OF CORRELATION: SPEARMAN'S RHO AND GAMMA Chapter 13 introduced the concept of correlation statistics and explained the use of Pearson's Correlation Coefficient when working

More information

UNDERSTANDING ANALYSIS OF COVARIANCE (ANCOVA)

UNDERSTANDING ANALYSIS OF COVARIANCE (ANCOVA) UNDERSTANDING ANALYSIS OF COVARIANCE () In general, research is conducted for the purpose of explaining the effects of the independent variable on the dependent variable, and the purpose of research design

More information

Correlation key concepts:

Correlation key concepts: CORRELATION Correlation key concepts: Types of correlation Methods of studying correlation a) Scatter diagram b) Karl pearson s coefficient of correlation c) Spearman s Rank correlation coefficient d)

More information

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.

More information

2. Simple Linear Regression

2. Simple Linear Regression Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according

More information

Multivariate Analysis of Variance. The general purpose of multivariate analysis of variance (MANOVA) is to determine

Multivariate Analysis of Variance. The general purpose of multivariate analysis of variance (MANOVA) is to determine 2 - Manova 4.3.05 25 Multivariate Analysis of Variance What Multivariate Analysis of Variance is The general purpose of multivariate analysis of variance (MANOVA) is to determine whether multiple levels

More information

CALCULATIONS & STATISTICS

CALCULATIONS & STATISTICS CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents

More information

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Review Jeopardy. Blue vs. Orange. Review Jeopardy Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round $200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?

More information

Exercise 1.12 (Pg. 22-23)

Exercise 1.12 (Pg. 22-23) Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.

More information

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name: Glo bal Leadership M BA BUSINESS STATISTICS FINAL EXAM Name: INSTRUCTIONS 1. Do not open this exam until instructed to do so. 2. Be sure to fill in your name before starting the exam. 3. You have two hours

More information

4. Multiple Regression in Practice

4. Multiple Regression in Practice 30 Multiple Regression in Practice 4. Multiple Regression in Practice The preceding chapters have helped define the broad principles on which regression analysis is based. What features one should look

More information

11. Analysis of Case-control Studies Logistic Regression

11. Analysis of Case-control Studies Logistic Regression Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

More information

Introduction to Linear Regression

Introduction to Linear Regression 14. Regression A. Introduction to Simple Linear Regression B. Partitioning Sums of Squares C. Standard Error of the Estimate D. Inferential Statistics for b and r E. Influential Observations F. Regression

More information

Content Sheet 7-1: Overview of Quality Control for Quantitative Tests

Content Sheet 7-1: Overview of Quality Control for Quantitative Tests Content Sheet 7-1: Overview of Quality Control for Quantitative Tests Role in quality management system Quality Control (QC) is a component of process control, and is a major element of the quality management

More information

Part 2: Analysis of Relationship Between Two Variables

Part 2: Analysis of Relationship Between Two Variables Part 2: Analysis of Relationship Between Two Variables Linear Regression Linear correlation Significance Tests Multiple regression Linear Regression Y = a X + b Dependent Variable Independent Variable

More information

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Chapter 13 Introduction to Linear Regression and Correlation Analysis Chapter 3 Student Lecture Notes 3- Chapter 3 Introduction to Linear Regression and Correlation Analsis Fall 2006 Fundamentals of Business Statistics Chapter Goals To understand the methods for displaing

More information

NCSS Statistical Software. Multiple Regression

NCSS Statistical Software. Multiple Regression Chapter 305 Introduction Analysis refers to a set of techniques for studying the straight-line relationships among two or more variables. Multiple regression estimates the β s in the equation y = β 0 +

More information

Week 4: Standard Error and Confidence Intervals

Week 4: Standard Error and Confidence Intervals Health Sciences M.Sc. Programme Applied Biostatistics Week 4: Standard Error and Confidence Intervals Sampling Most research data come from subjects we think of as samples drawn from a larger population.

More information

LOGIT AND PROBIT ANALYSIS

LOGIT AND PROBIT ANALYSIS LOGIT AND PROBIT ANALYSIS A.K. Vasisht I.A.S.R.I., Library Avenue, New Delhi 110 012 amitvasisht@iasri.res.in In dummy regression variable models, it is assumed implicitly that the dependent variable Y

More information

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

More information

Regression Analysis (Spring, 2000)

Regression Analysis (Spring, 2000) Regression Analysis (Spring, 2000) By Wonjae Purposes: a. Explaining the relationship between Y and X variables with a model (Explain a variable Y in terms of Xs) b. Estimating and testing the intensity

More information

Canonical Correlation Analysis

Canonical Correlation Analysis Canonical Correlation Analysis LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: State the similarities and differences between multiple regression, factor analysis,

More information

Course Objective This course is designed to give you a basic understanding of how to run regressions in SPSS.

Course Objective This course is designed to give you a basic understanding of how to run regressions in SPSS. SPSS Regressions Social Science Research Lab American University, Washington, D.C. Web. www.american.edu/provost/ctrl/pclabs.cfm Tel. x3862 Email. SSRL@American.edu Course Objective This course is designed

More information

Mgmt 469. Model Specification: Choosing the Right Variables for the Right Hand Side

Mgmt 469. Model Specification: Choosing the Right Variables for the Right Hand Side Mgmt 469 Model Specification: Choosing the Right Variables for the Right Hand Side Even if you have only a handful of predictor variables to choose from, there are infinitely many ways to specify the right

More information

Introduction to Matrix Algebra

Introduction to Matrix Algebra Psychology 7291: Multivariate Statistics (Carey) 8/27/98 Matrix Algebra - 1 Introduction to Matrix Algebra Definitions: A matrix is a collection of numbers ordered by rows and columns. It is customary

More information

SIMPLE LINEAR CORRELATION. r can range from -1 to 1, and is independent of units of measurement. Correlation can be done on two dependent variables.

SIMPLE LINEAR CORRELATION. r can range from -1 to 1, and is independent of units of measurement. Correlation can be done on two dependent variables. SIMPLE LINEAR CORRELATION Simple linear correlation is a measure of the degree to which two variables vary together, or a measure of the intensity of the association between two variables. Correlation

More information

Multivariate Analysis of Variance (MANOVA)

Multivariate Analysis of Variance (MANOVA) Chapter 415 Multivariate Analysis of Variance (MANOVA) Introduction Multivariate analysis of variance (MANOVA) is an extension of common analysis of variance (ANOVA). In ANOVA, differences among various

More information

Concepts of Experimental Design

Concepts of Experimental Design Design Institute for Six Sigma A SAS White Paper Table of Contents Introduction...1 Basic Concepts... 1 Designing an Experiment... 2 Write Down Research Problem and Questions... 2 Define Population...

More information

Recall this chart that showed how most of our course would be organized:

Recall this chart that showed how most of our course would be organized: Chapter 4 One-Way ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical

More information

2. Linear regression with multiple regressors

2. Linear regression with multiple regressors 2. Linear regression with multiple regressors Aim of this section: Introduction of the multiple regression model OLS estimation in multiple regression Measures-of-fit in multiple regression Assumptions

More information

Solving Mass Balances using Matrix Algebra

Solving Mass Balances using Matrix Algebra Page: 1 Alex Doll, P.Eng, Alex G Doll Consulting Ltd. http://www.agdconsulting.ca Abstract Matrix Algebra, also known as linear algebra, is well suited to solving material balance problems encountered

More information

Elasticity. I. What is Elasticity?

Elasticity. I. What is Elasticity? Elasticity I. What is Elasticity? The purpose of this section is to develop some general rules about elasticity, which may them be applied to the four different specific types of elasticity discussed in

More information

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY 1. Introduction Besides arriving at an appropriate expression of an average or consensus value for observations of a population, it is important to

More information

Introduction to Regression and Data Analysis

Introduction to Regression and Data Analysis Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it

More information

CS 147: Computer Systems Performance Analysis

CS 147: Computer Systems Performance Analysis CS 147: Computer Systems Performance Analysis One-Factor Experiments CS 147: Computer Systems Performance Analysis One-Factor Experiments 1 / 42 Overview Introduction Overview Overview Introduction Finding

More information

MULTIPLE REGRESSION EXAMPLE

MULTIPLE REGRESSION EXAMPLE MULTIPLE REGRESSION EXAMPLE For a sample of n = 166 college students, the following variables were measured: Y = height X 1 = mother s height ( momheight ) X 2 = father s height ( dadheight ) X 3 = 1 if

More information

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2 University of California, Berkeley Prof. Ken Chay Department of Economics Fall Semester, 005 ECON 14 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE # Question 1: a. Below are the scatter plots of hourly wages

More information

Dimensionality Reduction: Principal Components Analysis

Dimensionality Reduction: Principal Components Analysis Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely

More information

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model. Polynomial Regression POLYNOMIAL AND MULTIPLE REGRESSION Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model. It is a form of linear regression

More information

Multiple Regression: What Is It?

Multiple Regression: What Is It? Multiple Regression Multiple Regression: What Is It? Multiple regression is a collection of techniques in which there are multiple predictors of varying kinds and a single outcome We are interested in

More information

International Statistical Institute, 56th Session, 2007: Phil Everson

International Statistical Institute, 56th Session, 2007: Phil Everson Teaching Regression using American Football Scores Everson, Phil Swarthmore College Department of Mathematics and Statistics 5 College Avenue Swarthmore, PA198, USA E-mail: peverso1@swarthmore.edu 1. Introduction

More information

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Objectives: To perform a hypothesis test concerning the slope of a least squares line To recognize that testing for a

More information

We are often interested in the relationship between two variables. Do people with more years of full-time education earn higher salaries?

We are often interested in the relationship between two variables. Do people with more years of full-time education earn higher salaries? Statistics: Correlation Richard Buxton. 2008. 1 Introduction We are often interested in the relationship between two variables. Do people with more years of full-time education earn higher salaries? Do

More information

AP STATISTICS REVIEW (YMS Chapters 1-8)

AP STATISTICS REVIEW (YMS Chapters 1-8) AP STATISTICS REVIEW (YMS Chapters 1-8) Exploring Data (Chapter 1) Categorical Data nominal scale, names e.g. male/female or eye color or breeds of dogs Quantitative Data rational scale (can +,,, with

More information

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

More information

Least-Squares Intersection of Lines

Least-Squares Intersection of Lines Least-Squares Intersection of Lines Johannes Traa - UIUC 2013 This write-up derives the least-squares solution for the intersection of lines. In the general case, a set of lines will not intersect at a

More information

Introduction to Analysis of Variance (ANOVA) Limitations of the t-test

Introduction to Analysis of Variance (ANOVA) Limitations of the t-test Introduction to Analysis of Variance (ANOVA) The Structural Model, The Summary Table, and the One- Way ANOVA Limitations of the t-test Although the t-test is commonly used, it has limitations Can only

More information

Causal Forecasting Models

Causal Forecasting Models CTL.SC1x -Supply Chain & Logistics Fundamentals Causal Forecasting Models MIT Center for Transportation & Logistics Causal Models Used when demand is correlated with some known and measurable environmental

More information

Multivariate Analysis of Ecological Data

Multivariate Analysis of Ecological Data Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology

More information

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4) Summary of Formulas and Concepts Descriptive Statistics (Ch. 1-4) Definitions Population: The complete set of numerical information on a particular quantity in which an investigator is interested. We assume

More information

Standard Deviation Estimator

Standard Deviation Estimator CSS.com Chapter 905 Standard Deviation Estimator Introduction Even though it is not of primary interest, an estimate of the standard deviation (SD) is needed when calculating the power or sample size of

More information

MULTIPLE REGRESSION WITH CATEGORICAL DATA

MULTIPLE REGRESSION WITH CATEGORICAL DATA DEPARTMENT OF POLITICAL SCIENCE AND INTERNATIONAL RELATIONS Posc/Uapp 86 MULTIPLE REGRESSION WITH CATEGORICAL DATA I. AGENDA: A. Multiple regression with categorical variables. Coding schemes. Interpreting

More information

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results IAPRI Quantitative Analysis Capacity Building Series Multiple regression analysis & interpreting results How important is R-squared? R-squared Published in Agricultural Economics 0.45 Best article of the

More information

Homework 8 Solutions

Homework 8 Solutions Math 17, Section 2 Spring 2011 Homework 8 Solutions Assignment Chapter 7: 7.36, 7.40 Chapter 8: 8.14, 8.16, 8.28, 8.36 (a-d), 8.38, 8.62 Chapter 9: 9.4, 9.14 Chapter 7 7.36] a) A scatterplot is given below.

More information

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions. Algebra I Overview View unit yearlong overview here Many of the concepts presented in Algebra I are progressions of concepts that were introduced in grades 6 through 8. The content presented in this course

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software STATA Tutorial Professor Erdinç Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software 1.Wald Test Wald Test is used

More information

How To Check For Differences In The One Way Anova

How To Check For Differences In The One Way Anova MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

More information

Using R for Linear Regression

Using R for Linear Regression Using R for Linear Regression In the following handout words and symbols in bold are R functions and words and symbols in italics are entries supplied by the user; underlined words and symbols are optional

More information

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics. Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing

More information