# USING SAS/STAT SOFTWARE'S REG PROCEDURE TO DEVELOP SALES TAX AUDIT SELECTION MODELS

Save this PDF as:
Size: px
Start display at page:

Download "USING SAS/STAT SOFTWARE'S REG PROCEDURE TO DEVELOP SALES TAX AUDIT SELECTION MODELS"

## Transcription

3 with tax assessments. These problems are multicollinearity among the values of the independent variables and influence data points. Multicollinearity Multicollinearity is present when an independent variable is nearly a linear combination of other independent variables in the model. Multicollinearity affects regression analysis in the following ways: A. produces large variances of coefficients. B. results in'unstable coefficients. c. produces regression coefficients that are too large in magnitude. D. can result in poor prediction. Given that prediction is our main goal, the potential presence of multicollinearity among the independent variables used in a model should be carefully investigated. An example of multicollinearity would be a business type where gross sales and exempt sales were highly correlated. In this case, the analyst may want to consider removing one of variables from the model. The and COLLIN options are collinearity diagnostics provided by SAS. The option reports the variance inflation factor which can be interpreted as follows: for a given variable, the variance inflation factor measures how much larger the variance of the parameter estimate is than if there was no multicollinearity present. As a rule of thumb, a greater than ten (10) can be used as an indicator of a potential collinearity problem. The COLLIN option produces a table which includes eigenvalues, condition indices, and variance proportions which can be used to examine which terms are causing the problem. The number of eigenvalues near zero indicate the number of near linear dependencies. Large values for the condition number also indicates collinearity. High loadings on the variance proportions indicate which terms are causing the problem. Example 1 (Continued) In the above example, we are co'ncerned about possible collinearity between T GROSS and T BALDUE and between GROSS2 and BALDUE2. As the variance inflation factors reported below indicate, the seven variable model selected by STEPWISE in the above example would appear to have multicollinearity problems: T GROSS 936 T-BALDUE 887 T-EXEMPT 18 GROSS BALDUE2 271 STRUCF 1 EXEMPT2 8 The table below reports the eigenvalues and condition numbers associated with this model: Condition Number Eigenvalue Number l7 The small eigenvalue and large condition number associated with the eighth principal component reported above are indications of a collinearity problem. The table below reports the variance proportions for the variables with the highest loadings on the eighth component: Variance Proportions Number T GROSS T BALDUE GROSS2 BALDUE Since the variable T GROSS has the highest variance inflation factor and the highest variance proportion for the eighth component, the decision was made to drop it from the model. This resulted in only a slight drop in adjusted R 2 whereas PRESS and Mallow 1 s Cp for the six variable model are ~lightly better. Moreover, the variance inflation factors, as the table below indicates, showed marked improvement although they still indicate the presence of collinearity in the model: T BALDUE 18 T-EXEMPT 10 GROSS 2 42 BALDUE2 28 STRUCF 1 EXEMPT

4 As the table below indicates, with the exception of dropping T EXEMPT which is discussed below, efforts to improve the model by dropping additional variables resulted in diminishing predictive capability based on PRESS and Mallow's C p (Mallow's Cp statistic was calculated using the full model MSE): AD~ PRESij R (x ~O) c p P+l Model T GROSS, T BALDUE, T EXEMPT, GROSS2. BALDOE2, STRUCF, EXEMPT T BALDUE, T EXEMPT, GROSS2, BALDUE2, STRUCF. EXEMPT T BALDUE, GROSS2, BALDUE2, STRUCF. EXEMPT T BALDUE, T EXEMPT, BALDUE2, STRUCF, EXEMPT T BALDUE, T EXEMPT, GROSS2, STRUCF. EXEMPT GROSS2, T EXEMPT, BALDUE2, STRUCF, EXEMPT GROSS2, BALDUE2, STRUCF, EXEMPT GROSS2. STRUCF, BALDUE2 ~ GROSS2, BALDUEZ What we seem to have here is a situation were two va-riables, GROSS2 and BALDUE2, are collinear but both must be included for the model to have an acceptable adjusted R2, PRESS, and Mallow's Cpo Reported below are the parameter estimates associated with the six variable model: Parameter standard Estimate Error prob>iti INTERCEP T BALDUE T-EXEMPT GROSS BALDUE STRUCF EXEMPT The presence of a variable, T EXEMPT, in the model which is not significant at the 0.05 level is also of concern. As the table above indicated, by dropping this variable, the improves slightly in adjusted mod R 1 l, PRESS, and Mallow's Cpo As reported below, the variance inflation factors are either the same or slightly better than the six variable model. T BALDUE 16 GROSS2 43 BALDUE2 28 STRUCF 1 EXEMPT2 1 Thus, the decision was made to use the five variable model. The parameter estimates are reported below: Parameter standard Estimate Error prob>iti INTERCEP T BALDUE GROSS BALDUE STRUCF EXEMPT Influence Data points Influence data points are points which exert an undue influence on the regression equation. Thi\$ may be the result, for example, of an outlying observation. If a set of data for a given business type included one extremely large per hour field audit assessment, this data point could possibly exert an undue influence on the regression equation for that business type. It is important to note that the mere presence of such a data point does not necessarily mean that it does exert an undue influence, only that it may do so. If it does, the data point would be termed an outlier. Because of the nature of our data, influence data points are a serious problem for both the dependent and independent variables. The presence of large per hour assessments may produce outliers among the values of the dependent variables for some business types. The presence of large values for some independent variables (particularly large gross sales, large exempt sales, large use taxable, large tax balances due) may produce high leverage data points. The detection of influence data points is not always readily apparent. Moreover, the issue of the remedy is a source of some controversy. While some statisticians may recommend removing outliers from the data~ others do not. If the data' point is valid, that is to say, the data for that observation is correctly measured, then we feel that there should be a compelling reason for removing it from the data set. Example 2 We have a group of manufacturers for which 53 sales tax audits have been performed with an average per hour assessment of \$18,241. This extremely high average per hour assessment leads us to suspect that there might be one or more outliers ip the data, that is to say, observations which exert an undue influence on the regression equation. 1050

5 Following the methodology discussed above, the stepwise option was used to select an initial model for analysis. This model is presented below: Mod 2 l Mallow's step Entered R prob>iti Cp 1 USE T USE BALDUE STRUCD DIRPAY STRUCA where USE2=use taxable squared, T USE=total use taxable, BALDUE2=total tax due squared, STRUCD=a dummy variable indicating whether the taxpayer registered as a domestic corporation, DIRPAY=a dummy variable indicating whether the taxpayer has a direct pay permit, and STRUCA=a dummy variable indicating whether the taxpayer registered as a sole proprietor ~ The dominance of USE2 further alerted us to the possibility of a problem with the data. Even though it had a high R 2, the large Mallow's C statistic indicated that the veriable has considerable bias also. In addition, the PRESS statistic for this model was extremely large, indicating poor prediction capability. The INFLUENCE option is used to produces statistics which measure the influence of each observation on the estimates. These statistics include the following: RSTUDENT (the studentized residuals), HAT DIAG H (the hat diagonals), COY RATIO (the covariance ratio), DFFITS (scaled measure of the change in the predicted value for the ith observation), DFBETAS (scaled measures of change in each parameter estimates for each variable included in the model). For the data set and model under consideration, the table below presents the values which would be considered as indicators of potential influence points: Statistic RSTUDENT HAT DIAG H COY RATIO Value If absolute value is greater than 2 If value is greater than.2642 (2p/n where p=number of parameters and n=sample size) If value is less than.6038 or greater than (1 plus or minus 3(p/n)) DFFITS DFBETAS If value is greater than.7268 (2 times the square root of the quantity pin) If value is greater than.2747 (2 over the square root of n) We found that a number of observations had values on one or more of the above statistics indicating that they may exert a large influence on the parameter estimates. One observation (Observation 11 in the data set) seemed to stand out from the others, however. The table below reports the influence diagnostics statistics for this observation: Statistic RSTUDENT HAT DIAG H COY RATIO DFFITS INTERCEP DFBETAS DIRPAY DFBETAS T USE DFBETAS STRUCA DFBETAS STRUCD DFBETAS BALDUE2 DFBETAS Value The values of the above statistics lead us to investigate this observation. We discovered although the data for the observation was correct, the assessment per hour for this observation was so large that it almost completely dominated the regression equation. We felt that we were justified in considering this data point to be an atypical value and therefore removing it from the data set. We removed this observation from the data set and ran PROC REG wit~ the STEPWISE option again. The R for the data set without the observation was This model is presented below: Step Model Mallo~'s Entered R prob>iti Cp 1 BALDUE USECODEO USE DIRPAY PERBALGR where BALDUE=total.tax due squared, USECODEO=a dummy variable indicating whether the taxpayer registered as a peddler, USE2=use taxable squared, DIRPAY=a dummy variable indicating whether the taxpayer has a direct pay permit, and PERBALGR=a derived 1051

6 variable measuring the percent of total tax due to gross sales. Even though the R2 is considerably lower, the PRESS statistic for the model for the data set with the atypical observation is much worse than it was for the model for the data set without the outlier. The PRESS statistic for the former model was 755,576,780,357 whereas for the latter model it was 5,636,307, Similarly, Mallow's C p for the former model was 30, for the latter model it was 15. The mean square error for the former model was 841,432 while for the latter model it was 64,658. Thus, we feel justified in removing the data point from the data set. We ran the INFLUENCE option against this new model to identify any additional influence data points. Using the same criteria discussed above, several data points still had values on the diagnostics which were of concern. Three data points particularly stood out. Two observations had studentized residuals well above the absolute value of two. The third observation had a covariance ratio of 74. Two of these observations had large values for the dependent variable (that is, large assessments per hour) whereas the other observation was the result of a no change audit (i.e., assessment per hour=o). We did not feel at this point in time that any of these observations were sufficie~tly atypical of audits performed by the TDR to justify removing them from the data set. We were concerned with the presence in the model of a term which was not significant at the.05 level. Therefore, we choose to run the model again without the variable PERBALGR. This resulted in a model ~ith a slightly worse adjusted R and PRESS but, as the table below indicates, all terms in the model are now significant at the.05 level. parameter standard Estimate Error Prob>ITI INTERCEP BALDUE USECODEO ~ USE DIRPAY Finally, we ran the option to get the variance inflation factors for the above model. The 's, reported below, indicated that the model did not have a col1inearity problem: BALDUE2 USECODEO USE2 DIRPAY CONCLUDING REMARKS In conclusion, we would like to make some remarks on the SAS diagnostic procedures. SAS offers an impressive array of diagnostics. For the novice the biggest problem may be deciding which diagnostics to use. Moreover, it is extremely easy to invoke most of the diagnostics. All the diagnostics discussed in this paper are options to the model statement. We were also impressed with the enhancements to version 6.03 such as the CP and ADJRSQ model selection options which produce a printout of the models ranked according to the best Mallow C p and adjusted ~2 statistics respectively. An option like this for the PRESS sta~istic would also be useful. We have not had an opportunity, however, to fully evaluate the enhancements to Version We were disappointed with some shortcomings, however. We were disappointed with some of the output. For example, the PARTIAL option which is used to produce partial regression residual plots does not offer a convenient way of identifying the points. Moreover, an option which would plot the regression line for the partial X residua~ on the partial Y residual would also be useful (the slope of this line is equal to the parameter estimate of the independent variable for that plot). Since we are running SAS/STAT on a system with 640K RAM, invoking some of these options on the full model.caused an out-af-memory error message. We were not able, for example, to run the CP model selection option for the full model. In canclusion l for the type of analysis we are interested in performing, we found SAS/STAT to be a very powerful and useful statistical package and would recommend its use in similar types of data analysis applications. 1052

### SAS Code to Select the Best Multiple Linear Regression Model for Multivariate Data Using Information Criteria

Paper SA01_05 SAS Code to Select the Best Multiple Linear Regression Model for Multivariate Data Using Information Criteria Dennis J. Beal, Science Applications International Corporation, Oak Ridge, TN

### Getting Correct Results from PROC REG

Getting Correct Results from PROC REG Nathaniel Derby, Statis Pro Data Analytics, Seattle, WA ABSTRACT PROC REG, SAS s implementation of linear regression, is often used to fit a line without checking

### Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

### 5. Multiple regression

5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful

### Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction

### EXPLORATORY DATA ANALYSIS: GETTING TO KNOW YOUR DATA

EXPLORATORY DATA ANALYSIS: GETTING TO KNOW YOUR DATA Michael A. Walega Covance, Inc. INTRODUCTION In broad terms, Exploratory Data Analysis (EDA) can be defined as the numerical and graphical examination

### Supplementary PROCESS Documentation

Supplementary PROCESS Documentation This document is an addendum to Appendix A of Introduction to Mediation, Moderation, and Conditional Process Analysis that describes options and output added to PROCESS

### Rachel J. Goldberg, Guideline Research/Atlanta, Inc., Duluth, GA

PROC FACTOR: How to Interpret the Output of a Real-World Example Rachel J. Goldberg, Guideline Research/Atlanta, Inc., Duluth, GA ABSTRACT THE METHOD This paper summarizes a real-world example of a factor

### NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

### Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection

Chapter 311 Introduction Often, theory and experience give only general direction as to which of a pool of candidate variables (including transformed variables) should be included in the regression model.

### Overview of Factor Analysis

Overview of Factor Analysis Jamie DeCoster Department of Psychology University of Alabama 348 Gordon Palmer Hall Box 870348 Tuscaloosa, AL 35487-0348 Phone: (205) 348-4431 Fax: (205) 348-8648 August 1,

### SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

### Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear. In the main dialog box, input the dependent variable and several predictors.

### MGT 267 PROJECT. Forecasting the United States Retail Sales of the Pharmacies and Drug Stores. Done by: Shunwei Wang & Mohammad Zainal

MGT 267 PROJECT Forecasting the United States Retail Sales of the Pharmacies and Drug Stores Done by: Shunwei Wang & Mohammad Zainal Dec. 2002 The retail sale (Million) ABSTRACT The present study aims

### Causal Forecasting Models

CTL.SC1x -Supply Chain & Logistics Fundamentals Causal Forecasting Models MIT Center for Transportation & Logistics Causal Models Used when demand is correlated with some known and measurable environmental

### 5. Linear Regression

5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES. From Exploratory Factor Analysis Ledyard R Tucker and Robert C.

CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES From Exploratory Factor Analysis Ledyard R Tucker and Robert C MacCallum 1997 180 CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES In

### Data Desk Professional: Statistical Analysis for the Macintosh. PUB DATE Mar 89 NOTE

DOCUMENT RESUME ED 309 760 IR 013 926 AUTHOR Wise, Steven L.; Kutish, Gerald W. TITLE Data Desk Professional: Statistical Analysis for the Macintosh. PUB DATE Mar 89 NOTE 10p,; Paper presented at the Annual

### Regression Analysis (Spring, 2000)

Regression Analysis (Spring, 2000) By Wonjae Purposes: a. Explaining the relationship between Y and X variables with a model (Explain a variable Y in terms of Xs) b. Estimating and testing the intensity

### IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

IAPRI Quantitative Analysis Capacity Building Series Multiple regression analysis & interpreting results How important is R-squared? R-squared Published in Agricultural Economics 0.45 Best article of the

### Time series Forecasting using Holt-Winters Exponential Smoothing

Time series Forecasting using Holt-Winters Exponential Smoothing Prajakta S. Kalekar(04329008) Kanwal Rekhi School of Information Technology Under the guidance of Prof. Bernard December 6, 2004 Abstract

### New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction

Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.

### Premaster Statistics Tutorial 4 Full solutions

Premaster Statistics Tutorial 4 Full solutions Regression analysis Q1 (based on Doane & Seward, 4/E, 12.7) a. Interpret the slope of the fitted regression = 125,000 + 150. b. What is the prediction for

### Paper PO 015. Figure 1. PoweReward concept

Paper PO 05 Constructing Baseline of Customer s Hourly Electric Usage in SAS Yuqing Xiao, Bob Bolen, Diane Cunningham, Jiaying Xu, Atlanta, GA ABSTRACT PowerRewards is a pilot program offered by the Georgia

### Moderation. Moderation

Stats - Moderation Moderation A moderator is a variable that specifies conditions under which a given predictor is related to an outcome. The moderator explains when a DV and IV are related. Moderation

### Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation

### MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could

### Canonical Correlation Analysis

Canonical Correlation Analysis LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: State the similarities and differences between multiple regression, factor analysis,

### A Comparison of Variable Selection Techniques for Credit Scoring

1 A Comparison of Variable Selection Techniques for Credit Scoring K. Leung and F. Cheong and C. Cheong School of Business Information Technology, RMIT University, Melbourne, Victoria, Australia E-mail:

### Multiple Linear Regression

Multiple Linear Regression A regression with two or more explanatory variables is called a multiple regression. Rather than modeling the mean response as a straight line, as in simple regression, it is

### 2. Simple Linear Regression

Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according

### USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION. Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA

USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA Logistic regression is an increasingly popular statistical technique

### An Introduction to Partial Least Squares Regression

An Introduction to Partial Least Squares Regression Randall D. Tobias, SAS Institute Inc., Cary, NC Abstract Partial least squares is a popular method for soft modelling in industrial applications. This

### Logistic Regression. http://faculty.chass.ncsu.edu/garson/pa765/logistic.htm#sigtests

Logistic Regression http://faculty.chass.ncsu.edu/garson/pa765/logistic.htm#sigtests Overview Binary (or binomial) logistic regression is a form of regression which is used when the dependent is a dichotomy

### DISCRIMINANT FUNCTION ANALYSIS (DA)

DISCRIMINANT FUNCTION ANALYSIS (DA) John Poulsen and Aaron French Key words: assumptions, further reading, computations, standardized coefficents, structure matrix, tests of signficance Introduction Discriminant

### Multiple Regression: What Is It?

Multiple Regression Multiple Regression: What Is It? Multiple regression is a collection of techniques in which there are multiple predictors of varying kinds and a single outcome We are interested in

### Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

Applied Statistics J. Blanchet and J. Wadsworth Institute of Mathematics, Analysis, and Applications EPF Lausanne An MSc Course for Applied Mathematicians, Fall 2012 Outline 1 Model Comparison 2 Model

### Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@

Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@ Yanchun Xu, Andrius Kubilius Joint Commission on Accreditation of Healthcare Organizations,

### An Analysis of the Telecommunications Business in China by Linear Regression

An Analysis of the Telecommunications Business in China by Linear Regression Authors: Ajmal Khan h09ajmkh@du.se Yang Han v09yanha@du.se Graduate Thesis Supervisor: Dao Li dal@du.se C-level in Statistics,

### Multicollinearity Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised January 13, 2015

Multicollinearity Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised January 13, 2015 Stata Example (See appendices for full example).. use http://www.nd.edu/~rwilliam/stats2/statafiles/multicoll.dta,

### Benchmarking Residential Energy Use

Benchmarking Residential Energy Use Michael MacDonald, Oak Ridge National Laboratory Sherry Livengood, Oak Ridge National Laboratory ABSTRACT Interest in rating the real-life energy performance of buildings

### International Statistical Institute, 56th Session, 2007: Phil Everson

Teaching Regression using American Football Scores Everson, Phil Swarthmore College Department of Mathematics and Statistics 5 College Avenue Swarthmore, PA198, USA E-mail: peverso1@swarthmore.edu 1. Introduction

### The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon

The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon ABSTRACT Effective business development strategies often begin with market segmentation,

### 1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years

### X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.

### Forecasting in supply chains

1 Forecasting in supply chains Role of demand forecasting Effective transportation system or supply chain design is predicated on the availability of accurate inputs to the modeling process. One of the

### Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through

### Cross Validation. Dr. Thomas Jensen Expedia.com

Cross Validation Dr. Thomas Jensen Expedia.com About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract

### MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance

### Module 5: Multiple Regression Analysis

Using Statistical Data Using to Make Statistical Decisions: Data Multiple to Make Regression Decisions Analysis Page 1 Module 5: Multiple Regression Analysis Tom Ilvento, University of Delaware, College

### Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

STATA Tutorial Professor Erdinç Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software 1.Wald Test Wald Test is used

### Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Chapter Seven Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Section : An introduction to multiple regression WHAT IS MULTIPLE REGRESSION? Multiple

### 1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number A. 3(x - x) B. x 3 x C. 3x - x D. x - 3x 2) Write the following as an algebraic expression

### Factor Analysis. Principal components factor analysis. Use of extracted factors in multivariate dependency models

Factor Analysis Principal components factor analysis Use of extracted factors in multivariate dependency models 2 KEY CONCEPTS ***** Factor Analysis Interdependency technique Assumptions of factor analysis

### Comparing return to work outcomes between vocational rehabilitation providers after adjusting for case mix using statistical models

Comparing return to work outcomes between vocational rehabilitation providers after adjusting for case mix using statistical models Prepared by Jim Gaetjens Presented to the Institute of Actuaries of Australia

### Modeling Lifetime Value in the Insurance Industry

Modeling Lifetime Value in the Insurance Industry C. Olivia Parr Rud, Executive Vice President, Data Square, LLC ABSTRACT Acquisition modeling for direct mail insurance has the unique challenge of targeting

### Lecture 5: Model Checking. Prof. Sharyn O Halloran Sustainable Development U9611 Econometrics II

Lecture 5: Model Checking Prof. Sharyn O Halloran Sustainable Development U9611 Econometrics II Regression Diagnostics Unusual and Influential Data Outliers Leverage Influence Heterosckedasticity Non-constant

### Introduction to Linear Regression

14. Regression A. Introduction to Simple Linear Regression B. Partitioning Sums of Squares C. Standard Error of the Estimate D. Inferential Statistics for b and r E. Influential Observations F. Regression

### Dimensionality Reduction: Principal Components Analysis

Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely

### Chapter 7: Simple linear regression Learning Objectives

Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -

### 1 Theory: The General Linear Model

QMIN GLM Theory - 1.1 1 Theory: The General Linear Model 1.1 Introduction Before digital computers, statistics textbooks spoke of three procedures regression, the analysis of variance (ANOVA), and the

### 4. Multiple Regression in Practice

30 Multiple Regression in Practice 4. Multiple Regression in Practice The preceding chapters have helped define the broad principles on which regression analysis is based. What features one should look

### Module 3: Correlation and Covariance

Using Statistical Data to Make Decisions Module 3: Correlation and Covariance Tom Ilvento Dr. Mugdim Pašiƒ University of Delaware Sarajevo Graduate School of Business O ften our interest in data analysis

### Simple linear regression

Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between

### Review Jeopardy. Blue vs. Orange. Review Jeopardy

Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round \$200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?

### DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,

### Multiple Regression Using SPSS

Multiple Regression Using SPSS The following sections have been adapted from Field (2009) Chapter 7. These sections have been edited down considerably and I suggest (especially if you re confused) that

### Stat 412/512 CASE INFLUENCE STATISTICS. Charlotte Wickham. stat512.cwick.co.nz. Feb 2 2015

Stat 412/512 CASE INFLUENCE STATISTICS Feb 2 2015 Charlotte Wickham stat512.cwick.co.nz Regression in your field See website. You may complete this assignment in pairs. Find a journal article in your field

### Directions for using SPSS

Directions for using SPSS Table of Contents Connecting and Working with Files 1. Accessing SPSS... 2 2. Transferring Files to N:\drive or your computer... 3 3. Importing Data from Another File Format...

### Introduction to proc glm

Lab 7: Proc GLM and one-way ANOVA STT 422: Summer, 2004 Vince Melfi SAS has several procedures for analysis of variance models, including proc anova, proc glm, proc varcomp, and proc mixed. We mainly will

### Example: Boats and Manatees

Figure 9-6 Example: Boats and Manatees Slide 1 Given the sample data in Table 9-1, find the value of the linear correlation coefficient r, then refer to Table A-6 to determine whether there is a significant

### Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011

Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011 Name: Section: I pledge my honor that I have not violated the Honor Code Signature: This exam has 34 pages. You have 3 hours to complete this

### Robust procedures for Canadian Test Day Model final report for the Holstein breed

Robust procedures for Canadian Test Day Model final report for the Holstein breed J. Jamrozik, J. Fatehi and L.R. Schaeffer Centre for Genetic Improvement of Livestock, University of Guelph Introduction

### PRINCIPAL COMPONENT ANALYSIS

1 Chapter 1 PRINCIPAL COMPONENT ANALYSIS Introduction: The Basics of Principal Component Analysis........................... 2 A Variable Reduction Procedure.......................................... 2

### Moderator and Mediator Analysis

Moderator and Mediator Analysis Seminar General Statistics Marijtje van Duijn October 8, Overview What is moderation and mediation? What is their relation to statistical concepts? Example(s) October 8,

### Introduction to Linear Regression

14. Regression A. Introduction to Simple Linear Regression B. Partitioning Sums of Squares C. Standard Error of the Estimate D. Inferential Statistics for b and r E. Influential Observations F. Regression

### Section A. Index. Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting techniques... 1. Page 1 of 11. EduPristine CMA - Part I

Index Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting techniques... 1 EduPristine CMA - Part I Page 1 of 11 Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting

### Week 5: Multiple Linear Regression

BUS41100 Applied Regression Analysis Week 5: Multiple Linear Regression Parameter estimation and inference, forecasting, diagnostics, dummy variables Robert B. Gramacy The University of Chicago Booth School

### Statistical Models in R

Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova

### NCSS Statistical Software. Multiple Regression

Chapter 305 Introduction Analysis refers to a set of techniques for studying the straight-line relationships among two or more variables. Multiple regression estimates the β s in the equation y = β 0 +

### ORTHOGONAL POLYNOMIAL CONTRASTS INDIVIDUAL DF COMPARISONS: EQUALLY SPACED TREATMENTS

ORTHOGONAL POLYNOMIAL CONTRASTS INDIVIDUAL DF COMPARISONS: EQUALLY SPACED TREATMENTS Many treatments are equally spaced (incremented). This provides us with the opportunity to look at the response curve

### Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Statistics Quiz Correlation and Regression -- ANSWERS 1. Temperature and air pollution are known to be correlated. We collect data from two laboratories, in Boston and Montreal. Boston makes their measurements

### This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

CHAPTER 7B Multiple Regression: Statistical Methods Using IBM SPSS This chapter will demonstrate how to perform multiple linear regression with IBM SPSS first using the standard method and then using the

### Linear Models and Conjoint Analysis with Nonlinear Spline Transformations

Linear Models and Conjoint Analysis with Nonlinear Spline Transformations Warren F. Kuhfeld Mark Garratt Abstract Many common data analysis models are based on the general linear univariate model, including

### Indices of Model Fit STRUCTURAL EQUATION MODELING 2013

Indices of Model Fit STRUCTURAL EQUATION MODELING 2013 Indices of Model Fit A recommended minimal set of fit indices that should be reported and interpreted when reporting the results of SEM analyses:

### Stock Price Forecasting Using Information from Yahoo Finance and Google Trend

Stock Price Forecasting Using Information from Yahoo Finance and Google Trend Selene Yue Xu (UC Berkeley) Abstract: Stock price forecasting is a popular and important topic in financial and academic studies.

### Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Chapter 0 Key Ideas Correlation, Correlation Coefficient (r), Section 0-: Overview We have already explored the basics of describing single variable data sets. However, when two quantitative variables

### SPSS-Applications (Data Analysis)

CORTEX fellows training course, University of Zurich, October 2006 Slide 1 SPSS-Applications (Data Analysis) Dr. Jürg Schwarz, juerg.schwarz@schwarzpartners.ch Program 19. October 2006: Morning Lessons

### Predictor Coef StDev T P Constant 970667056 616256122 1.58 0.154 X 0.00293 0.06163 0.05 0.963. S = 0.5597 R-Sq = 0.0% R-Sq(adj) = 0.

Statistical analysis using Microsoft Excel Microsoft Excel spreadsheets have become somewhat of a standard for data storage, at least for smaller data sets. This, along with the program often being packaged

### c 2015, Jeffrey S. Simonoff 1

Modeling Lowe s sales Forecasting sales is obviously of crucial importance to businesses. Revenue streams are random, of course, but in some industries general economic factors would be expected to have

### Forecasting Geographic Data Michael Leonard and Renee Samy, SAS Institute Inc. Cary, NC, USA

Forecasting Geographic Data Michael Leonard and Renee Samy, SAS Institute Inc. Cary, NC, USA Abstract Virtually all businesses collect and use data that are associated with geographic locations, whether

### Chicago Insurance Redlining - a complete example

Chapter 12 Chicago Insurance Redlining - a complete example In a study of insurance availability in Chicago, the U.S. Commission on Civil Rights attempted to examine charges by several community organizations

### USE OF ARIMA TIME SERIES AND REGRESSORS TO FORECAST THE SALE OF ELECTRICITY

Paper PO10 USE OF ARIMA TIME SERIES AND REGRESSORS TO FORECAST THE SALE OF ELECTRICITY Beatrice Ugiliweneza, University of Louisville, Louisville, KY ABSTRACT Objectives: To forecast the sales made by

### Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY

Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY ABSTRACT: This project attempted to determine the relationship

### DEGREES OF FREEDOM - SIMPLIFIED

1 Aust. J. Geod. Photogram. Surv. Nos 46 & 47 December, 1987. pp 57-68 In 009 I retyped this paper and changed symbols eg ˆo σ to VF for my students benefit DEGREES OF FREEDOM - SIMPLIFIED Bruce R. Harvey

### SPSS ADVANCED ANALYSIS WENDIANN SETHI SPRING 2011

SPSS ADVANCED ANALYSIS WENDIANN SETHI SPRING 2011 Statistical techniques to be covered Explore relationships among variables Correlation Regression/Multiple regression Logistic regression Factor analysis