# Multiple Linear Regression in Data Mining

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1

2 2 Chap. 2 Multiple Linear Regression Perhaps the most popular mathematical model for making predictions is the multiple linear regression model. You have already studied multiple regression models in the Data, Models, and Decisions course. In this note we will build on this knowledge to examine the use of multiple linear regression models in data mining applications. Multiple linear regression is applicable to numerous data mining situations. Examples are: predicting customer activity on credit cards from demographics and historical activity patterns, predicting the time to failure of equipment based on utilization and environment conditions, predicting expenditures on vacation travel based on historical frequent flier data, predicting staffing requirements at help desks based on historical data and product and sales information, predicting sales from cross selling of products from historical information and predicting the impact of discounts on sales in retail outlets. In this note, we review the process of multiple linear regression. In this context we emphasize (a) the need to split the data into two categories: the training data set and the validation data set to be able to validate the multiple linear regression model, and (b) the need to relax the assumption that errors follow a Normal distribution. After this review, we introduce methods for identifying subsets of the independent variables to improve predictions. 2.1 A Review of Multiple Linear Regression In this section, we review briefly the multiple regression model that you encountered in the DMD course. There is a continuous random variable called the dependent variable, Y, and a number of independent variables, x 1,x 2,...,x p. Our purpose is to predict the value of the dependent variable (also referred to as the response variable) using a linear function of the independent variables. The values of the independent variables(also referred to as predictor variables, regressors or covariates) are known quantities for purposes of prediction, the model is: Y = β 0 + β 1 x 1 + β 2 x β p x p + ε, (2.1) where ε, the noise variable, is a Normally distributed random variable with mean equal to zero and standard deviation σ whose value we do not know. We also do not know the values of the coefficients β 0,β 1,β 2,...,β p. We estimate all these (p + 2) unknown values from the available data. The data consist of n rows of observations also called cases, which give us values y i,x i1,x i2,...,x ip ; i =1, 2,...,n. The estimates for the β coefficients are computed so as to minimize the sum of squares of differences between the

3 Sec. 2.1 A Review of Multiple Linear Regression 3 fitted (predicted) values at the observed values in the data. The sum of squared differences is given by n (y i β 0 β 1 x i1 β 2 x i2... β p x ip ) 2 i=1 Let us denote the values of the coefficients that minimize this expression by ˆβ 0, ˆβ 1, ˆβ 2,..., ˆβ p. These are our estimates for the unknown values and are called OLS (ordinary least squares) estimates in the literature. Once we have computed the estimates ˆβ 0, ˆβ 1, ˆβ 2,..., ˆβ p we can calculate an unbiased estimate ˆσ 2 for σ 2 using the formula: ˆσ 2 = 1 n p 1 n (y i ˆβ 0 β 1 x i1 β 2 x i2... β p x ip ) 2 i=1 Sum of the residuals = #observations #coefficients. We plug in the values of ˆβ 0, ˆβ 1, ˆβ 2,..., ˆβ p in the linear regression model (1) to predict the value of the dependent value from known values of the independent values, x 1,x 2,...,x p. The predicted value, Ŷ,is computed from the equation Ŷ = ˆβ 0 + ˆβ 1 x 1 + ˆβ 2 x ˆβ p x p. Predictions based on this equation are the best predictions possible in the sense that they will be unbiased (equal to the true values on the average) and will have the smallest expected squared error compared to any unbiased estimates if we make the following assumptions: 1. Linearity The expected value of the dependent variable is a linear function of the independent variables, i.e., E(Y x 1,x 2,...,x p )=β 0 + β 1 x 1 + β 2 x β p x p. 2. Independence The noise random variables ε i are independent between all the rows. Here ε i is the noise random variable in observation i for i =1,...,n. 3. Unbiasness The noise random variable ε i has zero mean, i.e., E(ε i )=0 for i =1, 2,...,n. 4. Homoskedasticity The standard deviation of ε i equals the same (unknown) value, σ, for i =1, 2,...,n.

4 4 Chap. 2 Multiple Linear Regression 5. Normality The noise random variables, ε i, are Normally distributed. An important and interesting fact for our purposes is that even if we drop the assumption of normality (Assumption 5) and allow the noise variables to follow arbitrary distributions, these estimates are very good for prediction. We can show that predictions based on these estimates are the best linear predictions in that they minimize the expected squared error. In other words, amongst all linear models, as defined by equation (1) above, the model using the least squares estimates, ˆβ 0, ˆβ 1, ˆβ 2,..., ˆβ p, will give the smallest value of squared error on the average. We elaborate on this idea in the next section. The Normal distribution assumption was required to derive confidence intervals for predictions. In data mining applications we have two distinct sets of data: the training data set and the validation data set that are both representative of the relationship between the dependent and independent variables. The training data is used to estimate the regression coefficients ˆβ 0, ˆβ 1, ˆβ 2,..., ˆβ p. The validation data set constitutes a hold-out sample and is not used in computing the coefficient estimates. This enables us to estimate the error in our predictions without having to assume that the noise variables follow the Normal distribution. We use the training data to fit the model and to estimate the coefficients. These coefficient estimates are used to make predictions for each case in the validation data. The prediction for each case is then compared to value of the dependent variable that was actually observed in the validation data. The average of the square of this error enables us to compare different models and to assess the accuracy of the model in making predictions. 2.2 Illustration of the Regression Process We illustrate the process of Multiple Linear Regression using an example adapted from Chaterjee, Hadi and Price from on estimating the performance of supervisors in a large financial organization. The data shown in Table 2.1 are from a survey of clerical employees in a sample of departments in a large financial organization. The dependent variable is a performance measure of effectiveness for supervisors heading departments in the organization. Both the dependent and the independent variables are totals of ratings on different aspects of the supervisor s job on a scale of 1 to 5 by 25 clerks reporting to the supervisor. As a result, the minimum value for

5 Sec. 2.3 Subset Selection in Linear Regression 5 each variable is 25 and the maximum value is 125. These ratings are answers to survey questions given to a sample of 25 clerks in each of 30 departments. The purpose of the analysis was to explore the feasibility of using a questionnaire for predicting effectiveness of departments thus saving the considerable effort required to directly measure effectiveness. The variables are answers to questions on the survey and are described below. Y Measure of effectiveness of supervisor. X1 Handles employee complaints X2 Does not allow special privileges. X3 Opportunity to learn new things. X4 Raises based on performance. X5 Too critical of poor performance. X6 Rate of advancing to better jobs. The multiple linear regression estimates as computed by the StatCalc addin to Excel are reported in Table 2.2. The equation to predict performance is Y = X X X X X X6. In Table 2.3 we use ten more cases as the validation data. Applying the previous equation to the validation data gives the predictions and errors shown in Table 2.3. The last column entitled error is simply the difference of the predicted minus the actual rating. For example for Case 21, the error is equal to =-5.54 We note that the average error in the predictions is small (-0.52) and so the predictions are unbiased. Further the errors are roughly Normal so that this model gives prediction errors that are approximately 95% of the time within ±14.34 (two standard deviations) of the true value. 2.3 Subset Selection in Linear Regression A frequent problem in data mining is that of using a regression equation to predict the value of a dependent variable when we have a number of variables available to choose as independent variables in our model. Given the high speed of modern algorithms for multiple linear regression calculations, it is tempting

6 6 Chap. 2 Multiple Linear Regression Case Y X1 X2 X3 X4 X5 X Table 2.1: Training Data (20 departments). in such a situation to take a kitchen-sink approach: why bother to select a subset, just use all the variables in the model. There are several reasons why this could be undesirable. It may be expensive to collect the full complement of variables for future predictions. We may be able to more accurately measure fewer variables (for example in surveys). Parsimony is an important property of good models. We obtain more insight into the influence of regressors in models with a few parameters.

7 Sec. 2.3 Subset Selection in Linear Regression 7 Multiple R-squared Residual SS Std. Dev. Estimate Coefficient StdError t-statistic p-value Constant X X X X X X Table 2.2: Output of StatCalc. Estimates of regression coefficients are likely to be unstable due to multicollinearity in models with many variables. We get better insights into the influence of regressors from models with fewer variables as the coefficients are more stable for parsimonious models. It can be shown that using independent variables that are uncorrelated with the dependent variable will increase the variance of predictions. It can be shown that dropping independent variables that have small (non-zero) coefficients can reduce the average error of predictions. Let us illustrate the last two points using the simple case of two independent variables. The reasoning remains valid in the general situation of more than two independent variables Dropping Irrelevant Variables Suppose that the true equation for Y, the dependent variable, is: Y = β 1 X 1 + ε (2.2)

8 8 Chap. 2 Multiple Linear Regression Case Y X1 X2 X3 X4 X5 X6 Prediction Error Averages: Std Devs: Table 2.3: Predictions on the validation data. and suppose that we estimate Y (using an additional variable X 2 that is actually irrelevant) with the equation: Y = β 1 X 1 + β 2 X 2 + ε. (2.3) We use data y i,x i1,xi2, i =1, 2...,n. We can show that in this situation the least squares estimates ˆβ 1 and ˆβ 2 will have the following expected values and variances: E( ˆβ 1 )=β 1, Var( ˆβ 1 )= σ 2 (1 R 2 12 ) n i=1 x 2 i1 E( ˆβ 2 )=0, Var( ˆβ 2 )= (1 R12 2 ) n i=1 x 2, i2 where R 12 is the correlation coefficient between X 1 and X 2. We notice that ˆβ 1 is an unbiased estimator of β 1 and ˆβ 2 is an unbiased estimator of β 2, since it has an expected value of zero. If we use Model (2) we obtain that E( ˆβ 1 )=β 1, Var( ˆβ σ 2 1 )= ni=1 x 2. 1 Note that in this case the variance of ˆβ 1 is lower. σ 2

9 Sec. 2.3 Subset Selection in Linear Regression 9 The variance is the expected value of the squared error for an unbiased estimator. So we are worse off using the irrelevant estimator in making predictions. Even if X 2 happens to be uncorrelated with X 1 so that R12 2 = 0 and the variance of ˆβ 1 is the same in both models, we can show that the variance of a prediction based on Model (3) will be worse than a prediction based on Model (2) due to the added variability introduced by estimation of β 2. Although our analysis has been based on one useful independent variable and one irrelevant independent variable, the result holds true in general. It is always better to make predictions with models that do not include irrelevant variables Dropping independent variables with small coefficient values Suppose that the situation is the reverse of what we have discussed above, namely that Model (3) is the correct equation, but we use Model (2) for our estimates and predictions ignoring variable X 2 in our model. To keep our results simple let us suppose that we have scaled the values of X 1,X 2, and Y so that their variances are equal to 1. In this case the least squares estimate ˆβ 1 has the following expected value and variance: E( ˆβ 1 )=β 1 + R 12 β 2, Var( ˆβ 1 )=σ 2. Notice that ˆβ 1 is a biased estimator of β 1 with bias equal to R 12 β 2 and its Mean Square Error is given by: MSE( ˆβ 1 ) = E[( ˆβ 1 β 1 ) 2 ] = E[{ ˆβ 1 E( ˆβ 1 )+E( ˆβ 1 ) β 1 } 2 ] = [Bias( ˆβ 1 )] 2 + Var( ˆβ 1 ) = (R 12 β 2 ) 2 + σ 2. If we use Model (3) the least squares estimates have the following expected values and variances: E( ˆβ 1 )=β 1, Var( ˆβ 1 )= σ 2 (1 R 2 12 ), E( ˆβ 2 )=β 2, Var( ˆβ σ 2 2 )= (1 R12 2 ). Now let us compare the Mean Square Errors for predicting Y at X 1 = u 1,X 2 = u 2.

10 10 Chap. 2 Multiple Linear Regression For Model (2), the Mean Square Error is: MSE2(Ŷ ) = E[(Ŷ Y )2 ] = E[(u 1 ˆβ1 u 1 β 1 ε) 2 ] = u 2 1 MSE2( ˆβ 1 )+σ 2 For Model (2), the Mean Square Error is: = u 2 1 (R 12β 2 ) 2 + u 2 1 σ2 + σ 2 MSE3(Ŷ ) = E[(Ŷ Y )2 ] = E[(u 1 ˆβ1 + u 2 ˆβ2 u 1 β 1 u 2 β 2 ε) 2 ] = Var(u 1 ˆβ1 + u 2 ˆβ2 )+σ 2, because now Ŷ isunbiased = u 2 1 Var( ˆβ 1 )+u 2 2 Var( ˆβ 2 )+2u 1 u 2 Covar( ˆβ 1, ˆβ 2 ) = (u2 1 + u2 2 2u 1u 2 R 12 ) (1 R12 2 ) σ 2 + σ 2. Model (2) can lead to lower mean squared error for many combinations of values for u 1,u 2,R 12, and (β 2 /σ) 2. For example, if u 1 =1,u 2 = 0, then MSE2(Ŷ ) <MSE3(Ŷ ), when (R 12 β 2 ) 2 + σ 2 < σ 2 (1 R 2 12 ), i.e., when β 2 σ < 1. 1 R12 2 If β 2 σ < 1, this will be true for all values of R 2 12 ; if, however, say R2 12 >.9, then this will be true for β /σ < 2. In general, accepting some bias can reduce MSE. This Bias-Variance tradeoff generalizes to models with several independent variables and is particularly important for large values of the number p of independent variables, since in that case it is very likely that there are variables in the model that have small coefficients relative to the standard deviation of the noise term and also exhibit at least moderate correlation with other variables. Dropping such variables will improve the predictions as it will reduce the MSE. This type of Bias-Variance trade-off is a basic aspect of most data mining procedures for prediction and classification.

11 Sec. 2.3 Subset Selection in Linear Regression Algorithms for Subset Selection Selecting subsets to improve MSE is a difficult computational problem for large number p of independent variables. The most common procedure for p greater than about 20 is to use heuristics to select good subsets rather than to lookfor the best subset for a given criterion. The heuristics most often used and available in statistics software are step-wise procedures. There are three common procedures: forward selection, backward elimination and step-wise regression. Forward Selection Here we keep adding variables one at a time to construct what we hope is a reasonably good subset. The steps are as follows: 1. Start with constant term only in subset S. 2. Compute the reduction in the sum of squares of the residuals (SSR) obtained by including each variable that is not presently in S. We denote by SSR(S) the sum of square residuals given that the model consists of the set S of variables. Let ˆσ 2 (S) be an unbiased estimate for σ for the model consisting of the set S of variables. For the variable, say, i, that gives the largest reduction in SSR compute F i = Max i/ S SSR(S) SSR(S {i}) ˆσ 2 (S {i}) If F i >F in, where F in is a threshold (typically between 2 and 4) add i to S 3. Repeat 2 until no variables can be added. Backward Elimination 1. Start with all variables in S. 2. Compute the increase in the sum of squares of the residuals (SSR) obtained by excluding each variable that is presently in S. For the variable, say, i, that gives the smallest increase in SSR compute SSR(S {i}) SSR(S) F i = Min i/ S ˆσ 2 (S) If F i <F out, where F out is a threshold (typically between 2 and 4) then drop i from S. 3. Repeat 2 until no variable can be dropped.

12 12 Chap. 2 Multiple Linear Regression Backward Elimination has the advantage that all variables are included in S at some stage. This addresses a problem of forward selection that will never select a variable that is better than a previously selected variable that is strongly correlated with it. The disadvantage is that the full model with all variables is required at the start and this can be time-consuming and numerically unstable. Step-wise Regression This procedure is like Forward Selection except that at each step we consider dropping variables as in Backward Elimination. Convergence is guaranteed if the thresholds F out and F in satisfy: F out < F in. It is possible, however, for a variable to enter S and then leave S at a subsequent step and even rejoin S at a yet later step. As stated above these methods pickone best subset. There are straightforward variations of the methods that do identify several close to best choices for different sizes of independent variable subsets. None of the above methods guarantees that they yield the best subset for any criterion such as adjusted R 2. (Defined later in this note.) They are reasonable methods for situations with large numbers of independent variables but for moderate numbers of independent variables the method discussed next is preferable. All Subsets Regression The idea here is to evaluate all subsets. Efficient implementations use branch and bound algorithms of the type you have seen in DMD for integer programming to avoid explicitly enumerating all subsets. (In fact the subset selection problem can be set up as a quadratic integer program.) We compute a criterion such as R 2 adj, the adjusted R2 for all subsets to choose the best one. (This is only feasible if p is less than about 20) Identifying subsets of variables to improve predictions The All Subsets Regression (as well as modifications of the heuristic algorithms) will produce a number of subsets. Since the number of subsets for even moderate values of p is very large, we need some way to examine the most promising subsets and to select from them. An intuitive metric to compare subsets is R 2. However since R 2 =1 SSR SST where SST, the Total Sum of Squares, is the Sum of Squared Residuals for the model with just the constant term, if we use it as a criterion we will always pickthe full model with all p variables. One approach is therefore to select the subset with the largest R 2 for each possible

13 Sec. 2.3 Subset Selection in Linear Regression 13 size k, k =2,...,p+1. The size is the number of coefficients in the model and is therefore one more than the number of variables in the subset to account for the constant term. We then examine the increase in R 2 as a function of k amongst these subsets and choose a subset such that subsets that are larger in size give only insignificant increases in R 2. Another, more automatic, approach is to choose the subset that maximizes, R 2 adj, a modification of R2 that makes an adjustment to account for size. The formula for R 2 adj is Radj 2 =1 n 1 n k 1 (1 R2 ). It can be shown that using Radj 2 to choose a subset is equivalent to picking the subset that minimizes ˆσ 2. Table 2.4 gives the results of the subset selection procedures applied to the training data in the Example on supervisor data in Section 2.2. Notice that the step-wise method fails to find the best subset for sizes of 4, 5, and 6 variables. The Forward and Backward methods do find the best subsets of all sizes and so give identical results as the All subsets algorithm. The best subset of size 3 consisting of {X1, X3} maximizes Radj 2 for all the algorithms. This suggests that we may be better off in terms of MSE of predictions if we use this subset rather than the full model of size 7 with all six variables in the model. Using this model on the validation data gives a slightly higher standard deviation of error (7.3) than the full model (7.1) but this may be a small price to pay if the cost of the survey can be reduced substantially by having 2 questions instead of 6. This example also underscores the fact that we are basing our analysis on small (tiny by data mining standards!) training and validation data sets. Small data sets make our estimates of R 2 unreliable. A criterion that is often used for subset selection is known as Mallow s C p. This criterion assumes that the full model is unbiased although it may have variables that, if dropped, would improve the MSE. With this assumption we can show that if a subset model is unbiased E(C p ) equals k, the size of the subset. Thus a reasonable approach to identifying subset models with small bias is to examine those with values of C p that are near k. C p is also an estimate of the sum of MSE (standardized by dividing by σ 2 ) for predictions (the fitted values) at the x-values observed in the training set. Thus good models are those that have values of C p near k and that have small k (i.e. are of small size). C p is computed from the formula: C p = SSR ˆσ Full 2 +2k n,

14 14 Chap. 2 Multiple Linear Regression SST= Fin= Fout= Forward, backward, and all subsets selections Models Size SSR RSq RSq Cp (adj) Constant X Constant X1 X Constant X1 X3 X Constant X1 X3 X5 X Constant X1 X2 X3 X5 X Constant X1 X2 X3 X4 X5 X6 Stepwise Selection Models Size SSR RSq RSq Cp (adj) Constant X Constant X1 X Constant X1 X2 X Constant X1 X2 X3 X Constant X1 X2 X3 X4 X Constant X1 X2 X3 X4 X5 X6 Table 2.4: Subset Selection for the example in Section 2.2 where ˆσ Full 2 is the estimated value of σ2 in the full model that includes all the variables. It is important to remember that the usefulness of this approach depends heavily on the reliability of the estimate of σ 2 for the full model. This requires that the training set contains a large number of observations relative to the number of variables. We note that for our example only the subsets of size 6 and 7 seem to be unbiased as for the other models C p differs substantially from k. This is a consequence of having too few observations to estimate σ 2 accurately in the full model.

### Regression Analysis: Basic Concepts

The simple linear model Regression Analysis: Basic Concepts Allin Cottrell Represents the dependent variable, y i, as a linear function of one independent variable, x i, subject to a random disturbance

### Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection

Chapter 311 Introduction Often, theory and experience give only general direction as to which of a pool of candidate variables (including transformed variables) should be included in the regression model.

### 2. Linear regression with multiple regressors

2. Linear regression with multiple regressors Aim of this section: Introduction of the multiple regression model OLS estimation in multiple regression Measures-of-fit in multiple regression Assumptions

### 5. Multiple regression

5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful

### Penalized regression: Introduction

Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

### Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction

### Collinearity of independent variables. Collinearity is a condition in which some of the independent variables are highly correlated.

Collinearity of independent variables Collinearity is a condition in which some of the independent variables are highly correlated. Why is this a problem? Collinearity tends to inflate the variance of

### NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

### IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

IAPRI Quantitative Analysis Capacity Building Series Multiple regression analysis & interpreting results How important is R-squared? R-squared Published in Agricultural Economics 0.45 Best article of the

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### SAS Code to Select the Best Multiple Linear Regression Model for Multivariate Data Using Information Criteria

Paper SA01_05 SAS Code to Select the Best Multiple Linear Regression Model for Multivariate Data Using Information Criteria Dennis J. Beal, Science Applications International Corporation, Oak Ridge, TN

### International Statistical Institute, 56th Session, 2007: Phil Everson

Teaching Regression using American Football Scores Everson, Phil Swarthmore College Department of Mathematics and Statistics 5 College Avenue Swarthmore, PA198, USA E-mail: peverso1@swarthmore.edu 1. Introduction

### Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Topic 4 - Analysis of Variance Approach to Regression Outline Partitioning sums of squares Degrees of freedom Expected mean squares General linear test - Fall 2013 R 2 and the coefficient of correlation

### Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

### Bivariate Regression Analysis. The beginning of many types of regression

Bivariate Regression Analysis The beginning of many types of regression TOPICS Beyond Correlation Forecasting Two points to estimate the slope Meeting the BLUE criterion The OLS method Purpose of Regression

### MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance

### MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL. by Michael L. Orlov Chemistry Department, Oregon State University (1996)

MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL by Michael L. Orlov Chemistry Department, Oregon State University (1996) INTRODUCTION In modern science, regression analysis is a necessary part

### Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

STATA Tutorial Professor Erdinç Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software 1.Wald Test Wald Test is used

### Jinadasa Gamage, Professor of Mathematics, Illinois State University, Normal, IL, e- mail: jina@ilstu.edu

Submission for ARCH, October 31, 2006 Jinadasa Gamage, Professor of Mathematics, Illinois State University, Normal, IL, e- mail: jina@ilstu.edu Jed L. Linfield, FSA, MAAA, Health Actuary, Kaiser Permanente,

### Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

### Introduction to Regression and Data Analysis

Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it

### Chapter 5 Estimating Demand Functions

Chapter 5 Estimating Demand Functions 1 Why do you need statistics and regression analysis? Ability to read market research papers Analyze your own data in a simple way Assist you in pricing and marketing

### Dimensionality Reduction: Principal Components Analysis

Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely

### Econometrics Simple Linear Regression

Econometrics Simple Linear Regression Burcu Eke UC3M Linear equations with one variable Recall what a linear equation is: y = b 0 + b 1 x is a linear equation with one variable, or equivalently, a straight

### 4. Multiple Regression in Practice

30 Multiple Regression in Practice 4. Multiple Regression in Practice The preceding chapters have helped define the broad principles on which regression analysis is based. What features one should look

### Causal Forecasting Models

CTL.SC1x -Supply Chain & Logistics Fundamentals Causal Forecasting Models MIT Center for Transportation & Logistics Causal Models Used when demand is correlated with some known and measurable environmental

### Regression Analysis. Data Calculations Output

Regression Analysis In an attempt to find answers to questions such as those posed above, empirical labour economists use a useful tool called regression analysis. Regression analysis is essentially a

### DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,

### Regression Analysis Prof. Soumen Maity Department of Mathematics Indian Institute of Technology, Kharagpur. Lecture - 2 Simple Linear Regression

Regression Analysis Prof. Soumen Maity Department of Mathematics Indian Institute of Technology, Kharagpur Lecture - 2 Simple Linear Regression Hi, this is my second lecture in module one and on simple

### Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear. In the main dialog box, input the dependent variable and several predictors.

### Least Squares Estimation

Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

### Regression analysis of probability-linked data

Regression analysis of probability-linked data Ray Chambers University of Wollongong James Chipperfield Australian Bureau of Statistics Walter Davis Statistics New Zealand 1 Overview 1. Probability linkage

### Variable Selection in Data Mining: Building a Predictive Model for Bankruptcy

Variable Selection in Data Mining: Building a Predictive Model for Bankruptcy Dean P. Foster and Robert A. Stine Department of Statistics The Wharton School of the University of Pennsylvania Philadelphia,

### 17. SIMPLE LINEAR REGRESSION II

17. SIMPLE LINEAR REGRESSION II The Model In linear regression analysis, we assume that the relationship between X and Y is linear. This does not mean, however, that Y can be perfectly predicted from X.

### USING SAS/STAT SOFTWARE'S REG PROCEDURE TO DEVELOP SALES TAX AUDIT SELECTION MODELS

USING SAS/STAT SOFTWARE'S REG PROCEDURE TO DEVELOP SALES TAX AUDIT SELECTION MODELS Kirk L. Johnson, Tennessee Department of Revenue Richard W. Kulp, David Lipscomb College INTRODUCTION The Tennessee Department

### 5. Linear Regression

5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4

### Chapter 13 Introduction to Linear Regression and Correlation Analysis

Chapter 3 Student Lecture Notes 3- Chapter 3 Introduction to Linear Regression and Correlation Analsis Fall 2006 Fundamentals of Business Statistics Chapter Goals To understand the methods for displaing

### Regression Analysis: A Complete Example

Regression Analysis: A Complete Example This section works out an example that includes all the topics we have discussed so far in this chapter. A complete example of regression analysis. PhotoDisc, Inc./Getty

### Mgmt 469. Model Specification: Choosing the Right Variables for the Right Hand Side

Mgmt 469 Model Specification: Choosing the Right Variables for the Right Hand Side Even if you have only a handful of predictor variables to choose from, there are infinitely many ways to specify the right

### University of Ljubljana Doctoral Programme in Statistics Methodology of Statistical Research Written examination February 14 th, 2014.

University of Ljubljana Doctoral Programme in Statistics ethodology of Statistical Research Written examination February 14 th, 2014 Name and surname: ID number: Instructions Read carefully the wording

### Stata Walkthrough 4: Regression, Prediction, and Forecasting

Stata Walkthrough 4: Regression, Prediction, and Forecasting Over drinks the other evening, my neighbor told me about his 25-year-old nephew, who is dating a 35-year-old woman. God, I can t see them getting

### LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

### ELEC-E8104 Stochastics models and estimation, Lecture 3b: Linear Estimation in Static Systems

Stochastics models and estimation, Lecture 3b: Linear Estimation in Static Systems Minimum Mean Square Error (MMSE) MMSE estimation of Gaussian random vectors Linear MMSE estimator for arbitrarily distributed

### Wooldridge, Introductory Econometrics, 4th ed. Chapter 15: Instrumental variables and two stage least squares

Wooldridge, Introductory Econometrics, 4th ed. Chapter 15: Instrumental variables and two stage least squares Many economic models involve endogeneity: that is, a theoretical relationship does not fit

### Recall this chart that showed how most of our course would be organized:

Chapter 4 One-Way ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical

### FE670 Algorithmic Trading Strategies. Stevens Institute of Technology

FE670 Algorithmic Trading Strategies Lecture 6. Portfolio Optimization: Basic Theory and Practice Steve Yang Stevens Institute of Technology 10/03/2013 Outline 1 Mean-Variance Analysis: Overview 2 Classical

### 1/27/2013. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

PSY 512: Advanced Statistics for Psychological and Behavioral Research 2 Introduce moderated multiple regression Continuous predictor continuous predictor Continuous predictor categorical predictor Understand

### Using Minitab for Regression Analysis: An extended example

Using Minitab for Regression Analysis: An extended example The following example uses data from another text on fertilizer application and crop yield, and is intended to show how Minitab can be used to

### MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could

### Forecasting in supply chains

1 Forecasting in supply chains Role of demand forecasting Effective transportation system or supply chain design is predicated on the availability of accurate inputs to the modeling process. One of the

### 1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years

### GLM I An Introduction to Generalized Linear Models

GLM I An Introduction to Generalized Linear Models CAS Ratemaking and Product Management Seminar March 2009 Presented by: Tanya D. Havlicek, Actuarial Assistant 0 ANTITRUST Notice The Casualty Actuarial

### Univariate Regression

Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is

### ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

University of California, Berkeley Prof. Ken Chay Department of Economics Fall Semester, 005 ECON 14 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE # Question 1: a. Below are the scatter plots of hourly wages

### SYSTEMS OF REGRESSION EQUATIONS

SYSTEMS OF REGRESSION EQUATIONS 1. MULTIPLE EQUATIONS y nt = x nt n + u nt, n = 1,...,N, t = 1,...,T, x nt is 1 k, and n is k 1. This is a version of the standard regression model where the observations

### 1 Simple Linear Regression I Least Squares Estimation

Simple Linear Regression I Least Squares Estimation Textbook Sections: 8. 8.3 Previously, we have worked with a random variable x that comes from a population that is normally distributed with mean µ and

### One-Way Analysis of Variance (ANOVA) Example Problem

One-Way Analysis of Variance (ANOVA) Example Problem Introduction Analysis of Variance (ANOVA) is a hypothesis-testing technique used to test the equality of two or more population (or treatment) means

### Review of basic statistics and the simplest forecasting model: the sample mean

Review of basic statistics and the simplest forecasting model: the sample mean Robert Nau Fuqua School of Business, Duke University August 2014 Most of what you need to remember about basic statistics

### Statistics 112 Regression Cheatsheet Section 1B - Ryan Rosario

Statistics 112 Regression Cheatsheet Section 1B - Ryan Rosario I have found that the best way to practice regression is by brute force That is, given nothing but a dataset and your mind, compute everything

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### CHAPTER 5. Exercise Solutions

CHAPTER 5 Exercise Solutions 91 Chapter 5, Exercise Solutions, Principles of Econometrics, e 9 EXERCISE 5.1 (a) y = 1, x =, x = x * * i x i 1 1 1 1 1 1 1 1 1 1 1 1 1 1 y * i (b) (c) yx = 1, x = 16, yx

### 2. What is the general linear model to be used to model linear trend? (Write out the model) = + + + or

Simple and Multiple Regression Analysis Example: Explore the relationships among Month, Adv.\$ and Sales \$: 1. Prepare a scatter plot of these data. The scatter plots for Adv.\$ versus Sales, and Month versus

### Statistical Models in R

Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Linear Models in R Regression Regression analysis is the appropriate

### CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

1 2 CONTENTS OF DAY 2 I. More Precise Definition of Simple Random Sample 3 Connection with independent random variables 3 Problems with small populations 8 II. Why Random Sampling is Important 9 A myth,

### 1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

STA 3024 Practice Problems Exam 2 NOTE: These are just Practice Problems. This is NOT meant to look just like the test, and it is NOT the only thing that you should study. Make sure you know all the material

### Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Statistics Quiz Correlation and Regression -- ANSWERS 1. Temperature and air pollution are known to be correlated. We collect data from two laboratories, in Boston and Montreal. Boston makes their measurements

### Regression. Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question.

Class: Date: Regression Multiple Choice Identify the choice that best completes the statement or answers the question. 1. Given the least squares regression line y8 = 5 2x: a. the relationship between

### Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Chapter Seven Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Section : An introduction to multiple regression WHAT IS MULTIPLE REGRESSION? Multiple

### A Review of Cross Sectional Regression for Financial Data You should already know this material from previous study

A Review of Cross Sectional Regression for Financial Data You should already know this material from previous study But I will offer a review, with a focus on issues which arise in finance 1 TYPES OF FINANCIAL

### Lectures 8, 9 & 10. Multiple Regression Analysis

Lectures 8, 9 & 0. Multiple Regression Analysis In which you learn how to apply the principles and tests outlined in earlier lectures to more realistic models involving more than explanatory variable and

### Example G Cost of construction of nuclear power plants

1 Example G Cost of construction of nuclear power plants Description of data Table G.1 gives data, reproduced by permission of the Rand Corporation, from a report (Mooz, 1978) on 32 light water reactor

### One-Way Analysis of Variance: A Guide to Testing Differences Between Multiple Groups

One-Way Analysis of Variance: A Guide to Testing Differences Between Multiple Groups In analysis of variance, the main research question is whether the sample means are from different populations. The

### Applying Statistics Recommended by Regulatory Documents

Applying Statistics Recommended by Regulatory Documents Steven Walfish President, Statistical Outsourcing Services steven@statisticaloutsourcingservices.com 301-325 325-31293129 About the Speaker Mr. Steven

### Robust procedures for Canadian Test Day Model final report for the Holstein breed

Robust procedures for Canadian Test Day Model final report for the Holstein breed J. Jamrozik, J. Fatehi and L.R. Schaeffer Centre for Genetic Improvement of Livestock, University of Guelph Introduction

### Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) and Neural Networks( 類 神 經 網 路 ) 許 湘 伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 35 13 Examples

### August 2012 EXAMINATIONS Solution Part I

August 01 EXAMINATIONS Solution Part I (1) In a random sample of 600 eligible voters, the probability that less than 38% will be in favour of this policy is closest to (B) () In a large random sample,

### Lecture 5 Hypothesis Testing in Multiple Linear Regression

Lecture 5 Hypothesis Testing in Multiple Linear Regression BIOST 515 January 20, 2004 Types of tests 1 Overall test Test for addition of a single variable Test for addition of a group of variables Overall

### Chapter 6: Point Estimation. Fall 2011. - Probability & Statistics

STAT355 Chapter 6: Point Estimation Fall 2011 Chapter Fall 2011 6: Point1 Estimat / 18 Chap 6 - Point Estimation 1 6.1 Some general Concepts of Point Estimation Point Estimate Unbiasedness Principle of

### Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

### Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma siersma@sund.ku.dk The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association

### Statistical Models in R

Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova

### Simple Methods and Procedures Used in Forecasting

Simple Methods and Procedures Used in Forecasting The project prepared by : Sven Gingelmaier Michael Richter Under direction of the Maria Jadamus-Hacura What Is Forecasting? Prediction of future events

### ANOVA. February 12, 2015

ANOVA February 12, 2015 1 ANOVA models Last time, we discussed the use of categorical variables in multivariate regression. Often, these are encoded as indicator columns in the design matrix. In [1]: %%R

### Simple linear regression

Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between

### The average hotel manager recognizes the criticality of forecasting. However, most

Introduction The average hotel manager recognizes the criticality of forecasting. However, most managers are either frustrated by complex models researchers constructed or appalled by the amount of time

### REGRESSION LINES IN STATA

REGRESSION LINES IN STATA THOMAS ELLIOTT 1. Introduction to Regression Regression analysis is about eploring linear relationships between a dependent variable and one or more independent variables. Regression

### STATISTICS AND DATA ANALYSIS IN GEOLOGY, 3rd ed. Clarificationof zonationprocedure described onpp. 238-239

STATISTICS AND DATA ANALYSIS IN GEOLOGY, 3rd ed. by John C. Davis Clarificationof zonationprocedure described onpp. 38-39 Because the notation used in this section (Eqs. 4.8 through 4.84) is inconsistent

### Multiple Optimization Using the JMP Statistical Software Kodak Research Conference May 9, 2005

Multiple Optimization Using the JMP Statistical Software Kodak Research Conference May 9, 2005 Philip J. Ramsey, Ph.D., Mia L. Stephens, MS, Marie Gaudard, Ph.D. North Haven Group, http://www.northhavengroup.com/

### Yiming Peng, Department of Statistics. February 12, 2013

Regression Analysis Using JMP Yiming Peng, Department of Statistics February 12, 2013 2 Presentation and Data http://www.lisa.stat.vt.edu Short Courses Regression Analysis Using JMP Download Data to Desktop

### We extended the additive model in two variables to the interaction model by adding a third term to the equation.

Quadratic Models We extended the additive model in two variables to the interaction model by adding a third term to the equation. Similarly, we can extend the linear model in one variable to the quadratic

### The VAR models discussed so fare are appropriate for modeling I(0) data, like asset returns or growth rates of macroeconomic time series.

Cointegration The VAR models discussed so fare are appropriate for modeling I(0) data, like asset returns or growth rates of macroeconomic time series. Economic theory, however, often implies equilibrium

### 3.1 Least squares in matrix form

118 3 Multiple Regression 3.1 Least squares in matrix form E Uses Appendix A.2 A.4, A.6, A.7. 3.1.1 Introduction More than one explanatory variable In the foregoing chapter we considered the simple regression

### 10. Analysis of Longitudinal Studies Repeat-measures analysis

Research Methods II 99 10. Analysis of Longitudinal Studies Repeat-measures analysis This chapter builds on the concepts and methods described in Chapters 7 and 8 of Mother and Child Health: Research methods.

### MGT 267 PROJECT. Forecasting the United States Retail Sales of the Pharmacies and Drug Stores. Done by: Shunwei Wang & Mohammad Zainal

MGT 267 PROJECT Forecasting the United States Retail Sales of the Pharmacies and Drug Stores Done by: Shunwei Wang & Mohammad Zainal Dec. 2002 The retail sale (Million) ABSTRACT The present study aims

### Factor analysis. Angela Montanari

Factor analysis Angela Montanari 1 Introduction Factor analysis is a statistical model that allows to explain the correlations between a large number of observed correlated variables through a small number

### Regression step-by-step using Microsoft Excel

Step 1: Regression step-by-step using Microsoft Excel Notes prepared by Pamela Peterson Drake, James Madison University Type the data into the spreadsheet The example used throughout this How to is a regression

### Gerry Hobbs, Department of Statistics, West Virginia University

Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

### Coefficient of Determination

Coefficient of Determination The coefficient of determination R 2 (or sometimes r 2 ) is another measure of how well the least squares equation ŷ = b 0 + b 1 x performs as a predictor of y. R 2 is computed