USING SAS/STAT SOFTWARE'S REG PROCEDURE TO DEVELOP SALES TAX AUDIT SELECTION MODELS


 Asher Nelson
 4 years ago
 Views:
Transcription
1 USING SAS/STAT SOFTWARE'S REG PROCEDURE TO DEVELOP SALES TAX AUDIT SELECTION MODELS Kirk L. Johnson, Tennessee Department of Revenue Richard W. Kulp, David Lipscomb College INTRODUCTION The Tennessee Department of Revenue (TDR) uses SAS/STAT REG procedure to develop statistical models to predict which sales and use tax field audits will yield the highest return per hour spent on the audit. To perform the analysis, the TDR uses the SAS System computer software which runs on both the state's mainframe computer and on personal computers in the Department. This process involves running SAS programs against taxpayer files on the state's mainframe computer and downloading subsets of data based on taxpayers' business types to a personal computer. The downloaded data is analyzed using PROC REG. This paper reports on our use of SAS diagnostics to compare competing models and to analyze potential problems in the data. Since the formulas used to calculate the statistics discussed in this paper are readily available in SAS documentation, we have chosen, for the most part, not to include this information in the paper. We, of course, relied very heavily upon SAS/STAT Guide for personal Computers, Version 6 Edition (Cary, NC: SAS Institute Inc., 1985) for our descriptions of the REG procedure and tried to conform to SAS terminology in so describing these procedures. In addition, some of the PROC REG's options discussed below produce a large amount of printed output. Therefore, the statistics reported in this paper were extracted from SAS output. We will be glad to make the full output available upon request. USING REGRESSION ANALYSIS TO PREDICT ASSESSMENTS Regression analysis can be used to do the following:  to explain how the independent variables account for variation in the dependent variable  to estimate the magnitude and signs of the parameters  to screen variables and rank them in order of importance  to predict, forecast, or estimate the dependent variable. As noted above, we are primarily interested in using regression analysis to predict the hourly return from sales tax field audits. It is important to state clearly the purpose for which a regression model is to be used since a model that predicts well may not necessarily be the best model for estimating parameters or performing some other task. Model Selection We have chosen to develop a different model for each business type for which there is sufficient audit history to justify the analysis. By a different model, we mean that the independent variables used in the models will differ from one business type to another. This is based upon our experience as well as the experience of other states which indicates that the variables which are useful for predicting assessments for one business type may not be useful for predicting assessments for another business type. Several exploratory techniques are available to assist in identifying which variables to include in the models. These include forward selection, backward selection, and stepwise selection. In Version 6.03, these are invoked using the SELECTION option of the MODEL statement of PROC REG. The syntax of the option is as follows: PROC REG DATA=SASdataset; MODEL dependents=regressors /SELECTION=name P COLLIN INFLUENCE PARTIAL; where name can be FORWARD (or F), BACKWARD (or B), STEPWISE, MAXR, MINR, RSQUARE, ADJRSQ, CP, or NONE (the full model). The default is NONE. P,, COLLIN, INFLUENCE and PARTIAL invoke the diagnostic procedures discussed below. 1047
2 Because of the large number of variables which are being considered (43) and the large number of models which are produced (95), it is necessary to develop a set of procedures to reduce the number of models which must be considered for each business group. The following outlines these procedures: Example 1 Use SELECTION=STEPWISE to reduce the number of variables. The default significance levels for entry into the model (0.15) and for staying in the model (0.15) were used. Use SELECTION=ADJRSQ and SELECTION=CP with variables selected by STEPWISE to fin~ models with best adjusted R and Mallow's Cpo Use P option to calculate PRESS statistic for competing models. Use option to calculate variance inflation factors and COLLIN option for collinearitydiagnostics. Use INFLUENCE and PARTIAL options to produce influence diagnostics and partial regression residual plots. We have a business type in the retail trade sector for which 67 sales tax audits have been performed, yielding an average per hour assessment of $621. The STEPWISE option produced the following model: part~l Mod~ Mallow's Step Entered R R Cp 1 T GROSS TBALDUE TEXEMPT GROSS BALDUE STRUCF EXEMPT where T GROSS=total gross sales, T BALDUE=total tax due, TEXEMPT=total exempt sales, GROSS2=total gross sales squared, BALDUE2=total tax due squared, EXEMPT2=total exempt sales squared, and STRUCF=a dummy variable indicating whether the taxpayer registered as a foreign corporation,(le., corporate headquarters located outside TenneSSee). Mallow's C p reported in the table above is a prediction oriented statistic which indicates the presence of bias in a model. A C p greater than p+l (where p = the number of parameters in the model) is an indicator of an incompletely specified model. A C p less than p+l indicates the model is overspecified (i.e., the model contains too many variables). The recommended model is where C first approaches p+l (startirg from the full model). As the above table indicates, since Mallow's C for the last variable entered inpthe model is less than p+l (which would be eight in this case),,it is possible that this model is overfitted. As the table below indicates, this model compares well with the models produced using other selection methods: Selection Adju~ted PRES MSE Option R2 R (x 10 ) (X 10 3 ) NONE FORWARD BACKWARD STEPWISE As va iables are added to a model, the R 2 will always increase or, in the worse case, remain the same. Thus, the model with the highest R2 is not nec~ssarily the best model. Adjusted R takes into consideration the number of independent va 2 iables in the model. An adjusted R2 which is substantially less than R indicates that the model is overfitted. T~at is to say, the increase in R due to the additional variables included in the model does not make up for the loss of the degrees of freedom. None of the adjusted R 2 's reported above are causes for concern. Since we are most interested in predicting assessments per hour, we have relied heavily upon the SAS prediction diagnostics. The PRESS statistic is the sum of squares of predicted residual errors where the predicted residual for observation i is defined as the residual for the ith observation that results from oropping it from the parameter estimates. In evaluating competing models, a lower PRESS indicates better prediction capability. The model produced by STEPWISE has a much lower PRESS than the other models. PROBLEMS IN REGRESSION ANALYSIS Two wellknown problems in the data used in regression analysis are particularly endemic to data dealing 1048
3 with tax assessments. These problems are multicollinearity among the values of the independent variables and influence data points. Multicollinearity Multicollinearity is present when an independent variable is nearly a linear combination of other independent variables in the model. Multicollinearity affects regression analysis in the following ways: A. produces large variances of coefficients. B. results in'unstable coefficients. c. produces regression coefficients that are too large in magnitude. D. can result in poor prediction. Given that prediction is our main goal, the potential presence of multicollinearity among the independent variables used in a model should be carefully investigated. An example of multicollinearity would be a business type where gross sales and exempt sales were highly correlated. In this case, the analyst may want to consider removing one of variables from the model. The and COLLIN options are collinearity diagnostics provided by SAS. The option reports the variance inflation factor which can be interpreted as follows: for a given variable, the variance inflation factor measures how much larger the variance of the parameter estimate is than if there was no multicollinearity present. As a rule of thumb, a greater than ten (10) can be used as an indicator of a potential collinearity problem. The COLLIN option produces a table which includes eigenvalues, condition indices, and variance proportions which can be used to examine which terms are causing the problem. The number of eigenvalues near zero indicate the number of near linear dependencies. Large values for the condition number also indicates collinearity. High loadings on the variance proportions indicate which terms are causing the problem. Example 1 (Continued) In the above example, we are co'ncerned about possible collinearity between T GROSS and T BALDUE and between GROSS2 and BALDUE2. As the variance inflation factors reported below indicate, the seven variable model selected by STEPWISE in the above example would appear to have multicollinearity problems: T GROSS 936 TBALDUE 887 TEXEMPT 18 GROSS BALDUE2 271 STRUCF 1 EXEMPT2 8 The table below reports the eigenvalues and condition numbers associated with this model: Condition Number Eigenvalue Number l7 The small eigenvalue and large condition number associated with the eighth principal component reported above are indications of a collinearity problem. The table below reports the variance proportions for the variables with the highest loadings on the eighth component: Variance Proportions Number T GROSS T BALDUE GROSS2 BALDUE Since the variable T GROSS has the highest variance inflation factor and the highest variance proportion for the eighth component, the decision was made to drop it from the model. This resulted in only a slight drop in adjusted R 2 whereas PRESS and Mallow 1 s Cp for the six variable model are ~lightly better. Moreover, the variance inflation factors, as the table below indicates, showed marked improvement although they still indicate the presence of collinearity in the model: T BALDUE 18 TEXEMPT 10 GROSS 2 42 BALDUE2 28 STRUCF 1 EXEMPT
4 As the table below indicates, with the exception of dropping T EXEMPT which is discussed below, efforts to improve the model by dropping additional variables resulted in diminishing predictive capability based on PRESS and Mallow's C p (Mallow's Cp statistic was calculated using the full model MSE): AD~ PRESij R (x ~O) c p P+l Model T GROSS, T BALDUE, T EXEMPT, GROSS2. BALDOE2, STRUCF, EXEMPT T BALDUE, T EXEMPT, GROSS2, BALDUE2, STRUCF. EXEMPT T BALDUE, GROSS2, BALDUE2, STRUCF. EXEMPT T BALDUE, T EXEMPT, BALDUE2, STRUCF, EXEMPT T BALDUE, T EXEMPT, GROSS2, STRUCF. EXEMPT GROSS2, T EXEMPT, BALDUE2, STRUCF, EXEMPT GROSS2, BALDUE2, STRUCF, EXEMPT GROSS2. STRUCF, BALDUE2 ~ GROSS2, BALDUEZ What we seem to have here is a situation were two variables, GROSS2 and BALDUE2, are collinear but both must be included for the model to have an acceptable adjusted R2, PRESS, and Mallow's Cpo Reported below are the parameter estimates associated with the six variable model: Parameter standard Estimate Error prob>iti INTERCEP T BALDUE TEXEMPT GROSS BALDUE STRUCF EXEMPT The presence of a variable, T EXEMPT, in the model which is not significant at the 0.05 level is also of concern. As the table above indicated, by dropping this variable, the improves slightly in adjusted mod R 1 l, PRESS, and Mallow's Cpo As reported below, the variance inflation factors are either the same or slightly better than the six variable model. T BALDUE 16 GROSS2 43 BALDUE2 28 STRUCF 1 EXEMPT2 1 Thus, the decision was made to use the five variable model. The parameter estimates are reported below: Parameter standard Estimate Error prob>iti INTERCEP T BALDUE GROSS BALDUE STRUCF EXEMPT Influence Data points Influence data points are points which exert an undue influence on the regression equation. Thi$ may be the result, for example, of an outlying observation. If a set of data for a given business type included one extremely large per hour field audit assessment, this data point could possibly exert an undue influence on the regression equation for that business type. It is important to note that the mere presence of such a data point does not necessarily mean that it does exert an undue influence, only that it may do so. If it does, the data point would be termed an outlier. Because of the nature of our data, influence data points are a serious problem for both the dependent and independent variables. The presence of large per hour assessments may produce outliers among the values of the dependent variables for some business types. The presence of large values for some independent variables (particularly large gross sales, large exempt sales, large use taxable, large tax balances due) may produce high leverage data points. The detection of influence data points is not always readily apparent. Moreover, the issue of the remedy is a source of some controversy. While some statisticians may recommend removing outliers from the data~ others do not. If the data' point is valid, that is to say, the data for that observation is correctly measured, then we feel that there should be a compelling reason for removing it from the data set. Example 2 We have a group of manufacturers for which 53 sales tax audits have been performed with an average per hour assessment of $18,241. This extremely high average per hour assessment leads us to suspect that there might be one or more outliers ip the data, that is to say, observations which exert an undue influence on the regression equation. 1050
5 Following the methodology discussed above, the stepwise option was used to select an initial model for analysis. This model is presented below: Mod 2 l Mallow's step Entered R prob>iti Cp 1 USE T USE BALDUE STRUCD DIRPAY STRUCA where USE2=use taxable squared, T USE=total use taxable, BALDUE2=total tax due squared, STRUCD=a dummy variable indicating whether the taxpayer registered as a domestic corporation, DIRPAY=a dummy variable indicating whether the taxpayer has a direct pay permit, and STRUCA=a dummy variable indicating whether the taxpayer registered as a sole proprietor ~ The dominance of USE2 further alerted us to the possibility of a problem with the data. Even though it had a high R 2, the large Mallow's C statistic indicated that the veriable has considerable bias also. In addition, the PRESS statistic for this model was extremely large, indicating poor prediction capability. The INFLUENCE option is used to produces statistics which measure the influence of each observation on the estimates. These statistics include the following: RSTUDENT (the studentized residuals), HAT DIAG H (the hat diagonals), COY RATIO (the covariance ratio), DFFITS (scaled measure of the change in the predicted value for the ith observation), DFBETAS (scaled measures of change in each parameter estimates for each variable included in the model). For the data set and model under consideration, the table below presents the values which would be considered as indicators of potential influence points: Statistic RSTUDENT HAT DIAG H COY RATIO Value If absolute value is greater than 2 If value is greater than.2642 (2p/n where p=number of parameters and n=sample size) If value is less than.6038 or greater than (1 plus or minus 3(p/n)) DFFITS DFBETAS If value is greater than.7268 (2 times the square root of the quantity pin) If value is greater than.2747 (2 over the square root of n) We found that a number of observations had values on one or more of the above statistics indicating that they may exert a large influence on the parameter estimates. One observation (Observation 11 in the data set) seemed to stand out from the others, however. The table below reports the influence diagnostics statistics for this observation: Statistic RSTUDENT HAT DIAG H COY RATIO DFFITS INTERCEP DFBETAS DIRPAY DFBETAS T USE DFBETAS STRUCA DFBETAS STRUCD DFBETAS BALDUE2 DFBETAS Value The values of the above statistics lead us to investigate this observation. We discovered although the data for the observation was correct, the assessment per hour for this observation was so large that it almost completely dominated the regression equation. We felt that we were justified in considering this data point to be an atypical value and therefore removing it from the data set. We removed this observation from the data set and ran PROC REG wit~ the STEPWISE option again. The R for the data set without the observation was This model is presented below: Step Model Mallo~'s Entered R prob>iti Cp 1 BALDUE USECODEO USE DIRPAY PERBALGR where BALDUE=total.tax due squared, USECODEO=a dummy variable indicating whether the taxpayer registered as a peddler, USE2=use taxable squared, DIRPAY=a dummy variable indicating whether the taxpayer has a direct pay permit, and PERBALGR=a derived 1051
6 variable measuring the percent of total tax due to gross sales. Even though the R2 is considerably lower, the PRESS statistic for the model for the data set with the atypical observation is much worse than it was for the model for the data set without the outlier. The PRESS statistic for the former model was 755,576,780,357 whereas for the latter model it was 5,636,307, Similarly, Mallow's C p for the former model was 30, for the latter model it was 15. The mean square error for the former model was 841,432 while for the latter model it was 64,658. Thus, we feel justified in removing the data point from the data set. We ran the INFLUENCE option against this new model to identify any additional influence data points. Using the same criteria discussed above, several data points still had values on the diagnostics which were of concern. Three data points particularly stood out. Two observations had studentized residuals well above the absolute value of two. The third observation had a covariance ratio of 74. Two of these observations had large values for the dependent variable (that is, large assessments per hour) whereas the other observation was the result of a no change audit (i.e., assessment per hour=o). We did not feel at this point in time that any of these observations were sufficie~tly atypical of audits performed by the TDR to justify removing them from the data set. We were concerned with the presence in the model of a term which was not significant at the.05 level. Therefore, we choose to run the model again without the variable PERBALGR. This resulted in a model ~ith a slightly worse adjusted R and PRESS but, as the table below indicates, all terms in the model are now significant at the.05 level. parameter standard Estimate Error Prob>ITI INTERCEP BALDUE USECODEO ~ USE DIRPAY Finally, we ran the option to get the variance inflation factors for the above model. The 's, reported below, indicated that the model did not have a col1inearity problem: BALDUE2 USECODEO USE2 DIRPAY CONCLUDING REMARKS In conclusion, we would like to make some remarks on the SAS diagnostic procedures. SAS offers an impressive array of diagnostics. For the novice the biggest problem may be deciding which diagnostics to use. Moreover, it is extremely easy to invoke most of the diagnostics. All the diagnostics discussed in this paper are options to the model statement. We were also impressed with the enhancements to version 6.03 such as the CP and ADJRSQ model selection options which produce a printout of the models ranked according to the best Mallow C p and adjusted ~2 statistics respectively. An option like this for the PRESS sta~istic would also be useful. We have not had an opportunity, however, to fully evaluate the enhancements to Version We were disappointed with some shortcomings, however. We were disappointed with some of the output. For example, the PARTIAL option which is used to produce partial regression residual plots does not offer a convenient way of identifying the points. Moreover, an option which would plot the regression line for the partial X residua~ on the partial Y residual would also be useful (the slope of this line is equal to the parameter estimate of the independent variable for that plot). Since we are running SAS/STAT on a system with 640K RAM, invoking some of these options on the full model.caused an outafmemory error message. We were not able, for example, to run the CP model selection option for the full model. In canclusion l for the type of analysis we are interested in performing, we found SAS/STAT to be a very powerful and useful statistical package and would recommend its use in similar types of data analysis applications. 1052
SAS Code to Select the Best Multiple Linear Regression Model for Multivariate Data Using Information Criteria
Paper SA01_05 SAS Code to Select the Best Multiple Linear Regression Model for Multivariate Data Using Information Criteria Dennis J. Beal, Science Applications International Corporation, Oak Ridge, TN
More informationGetting Correct Results from PROC REG
Getting Correct Results from PROC REG Nathaniel Derby, Statis Pro Data Analytics, Seattle, WA ABSTRACT PROC REG, SAS s implementation of linear regression, is often used to fit a line without checking
More informationMultiple Linear Regression in Data Mining
Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple
More information5. Multiple regression
5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful
More informationData Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression
Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction
More informationEXPLORATORY DATA ANALYSIS: GETTING TO KNOW YOUR DATA
EXPLORATORY DATA ANALYSIS: GETTING TO KNOW YOUR DATA Michael A. Walega Covance, Inc. INTRODUCTION In broad terms, Exploratory Data Analysis (EDA) can be defined as the numerical and graphical examination
More informationSupplementary PROCESS Documentation
Supplementary PROCESS Documentation This document is an addendum to Appendix A of Introduction to Mediation, Moderation, and Conditional Process Analysis that describes options and output added to PROCESS
More informationRachel J. Goldberg, Guideline Research/Atlanta, Inc., Duluth, GA
PROC FACTOR: How to Interpret the Output of a RealWorld Example Rachel J. Goldberg, Guideline Research/Atlanta, Inc., Duluth, GA ABSTRACT THE METHOD This paper summarizes a realworld example of a factor
More informationNCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
More informationStepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (StepUp) Selection
Chapter 311 Introduction Often, theory and experience give only general direction as to which of a pool of candidate variables (including transformed variables) should be included in the regression model.
More informationOverview of Factor Analysis
Overview of Factor Analysis Jamie DeCoster Department of Psychology University of Alabama 348 Gordon Palmer Hall Box 870348 Tuscaloosa, AL 354870348 Phone: (205) 3484431 Fax: (205) 3488648 August 1,
More informationSAS Software to Fit the Generalized Linear Model
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling
More informationMultiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.
Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear. In the main dialog box, input the dependent variable and several predictors.
More informationMGT 267 PROJECT. Forecasting the United States Retail Sales of the Pharmacies and Drug Stores. Done by: Shunwei Wang & Mohammad Zainal
MGT 267 PROJECT Forecasting the United States Retail Sales of the Pharmacies and Drug Stores Done by: Shunwei Wang & Mohammad Zainal Dec. 2002 The retail sale (Million) ABSTRACT The present study aims
More informationCausal Forecasting Models
CTL.SC1x Supply Chain & Logistics Fundamentals Causal Forecasting Models MIT Center for Transportation & Logistics Causal Models Used when demand is correlated with some known and measurable environmental
More information5. Linear Regression
5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4
More informationSTATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 SigmaRestricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
More informationCHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES. From Exploratory Factor Analysis Ledyard R Tucker and Robert C.
CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES From Exploratory Factor Analysis Ledyard R Tucker and Robert C MacCallum 1997 180 CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES In
More informationData Desk Professional: Statistical Analysis for the Macintosh. PUB DATE Mar 89 NOTE
DOCUMENT RESUME ED 309 760 IR 013 926 AUTHOR Wise, Steven L.; Kutish, Gerald W. TITLE Data Desk Professional: Statistical Analysis for the Macintosh. PUB DATE Mar 89 NOTE 10p,; Paper presented at the Annual
More informationRegression Analysis (Spring, 2000)
Regression Analysis (Spring, 2000) By Wonjae Purposes: a. Explaining the relationship between Y and X variables with a model (Explain a variable Y in terms of Xs) b. Estimating and testing the intensity
More informationIAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results
IAPRI Quantitative Analysis Capacity Building Series Multiple regression analysis & interpreting results How important is Rsquared? Rsquared Published in Agricultural Economics 0.45 Best article of the
More informationTime series Forecasting using HoltWinters Exponential Smoothing
Time series Forecasting using HoltWinters Exponential Smoothing Prajakta S. Kalekar(04329008) Kanwal Rekhi School of Information Technology Under the guidance of Prof. Bernard December 6, 2004 Abstract
More informationNew Work Item for ISO 35345 Predictive Analytics (Initial Notes and Thoughts) Introduction
Introduction New Work Item for ISO 35345 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.
More informationPremaster Statistics Tutorial 4 Full solutions
Premaster Statistics Tutorial 4 Full solutions Regression analysis Q1 (based on Doane & Seward, 4/E, 12.7) a. Interpret the slope of the fitted regression = 125,000 + 150. b. What is the prediction for
More informationPaper PO 015. Figure 1. PoweReward concept
Paper PO 05 Constructing Baseline of Customer s Hourly Electric Usage in SAS Yuqing Xiao, Bob Bolen, Diane Cunningham, Jiaying Xu, Atlanta, GA ABSTRACT PowerRewards is a pilot program offered by the Georgia
More informationModeration. Moderation
Stats  Moderation Moderation A moderator is a variable that specifies conditions under which a given predictor is related to an outcome. The moderator explains when a DV and IV are related. Moderation
More informationImproving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP
Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation
More informationMISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group
MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could
More informationAdditional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jintselink/tselink.htm
Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jintselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm
More informationCanonical Correlation Analysis
Canonical Correlation Analysis LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: State the similarities and differences between multiple regression, factor analysis,
More informationA Comparison of Variable Selection Techniques for Credit Scoring
1 A Comparison of Variable Selection Techniques for Credit Scoring K. Leung and F. Cheong and C. Cheong School of Business Information Technology, RMIT University, Melbourne, Victoria, Australia Email:
More informationMultiple Linear Regression
Multiple Linear Regression A regression with two or more explanatory variables is called a multiple regression. Rather than modeling the mean response as a straight line, as in simple regression, it is
More information2. Simple Linear Regression
Research methods  II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according
More informationUSING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION. Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA
USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA Logistic regression is an increasingly popular statistical technique
More informationAn Introduction to Partial Least Squares Regression
An Introduction to Partial Least Squares Regression Randall D. Tobias, SAS Institute Inc., Cary, NC Abstract Partial least squares is a popular method for soft modelling in industrial applications. This
More informationLogistic Regression. http://faculty.chass.ncsu.edu/garson/pa765/logistic.htm#sigtests
Logistic Regression http://faculty.chass.ncsu.edu/garson/pa765/logistic.htm#sigtests Overview Binary (or binomial) logistic regression is a form of regression which is used when the dependent is a dichotomy
More informationDISCRIMINANT FUNCTION ANALYSIS (DA)
DISCRIMINANT FUNCTION ANALYSIS (DA) John Poulsen and Aaron French Key words: assumptions, further reading, computations, standardized coefficents, structure matrix, tests of signficance Introduction Discriminant
More informationMultiple Regression: What Is It?
Multiple Regression Multiple Regression: What Is It? Multiple regression is a collection of techniques in which there are multiple predictors of varying kinds and a single outcome We are interested in
More informationApplied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne
Applied Statistics J. Blanchet and J. Wadsworth Institute of Mathematics, Analysis, and Applications EPF Lausanne An MSc Course for Applied Mathematicians, Fall 2012 Outline 1 Model Comparison 2 Model
More informationDeveloping Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@
Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@ Yanchun Xu, Andrius Kubilius Joint Commission on Accreditation of Healthcare Organizations,
More informationAn Analysis of the Telecommunications Business in China by Linear Regression
An Analysis of the Telecommunications Business in China by Linear Regression Authors: Ajmal Khan h09ajmkh@du.se Yang Han v09yanha@du.se Graduate Thesis Supervisor: Dao Li dal@du.se Clevel in Statistics,
More informationMulticollinearity Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised January 13, 2015
Multicollinearity Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised January 13, 2015 Stata Example (See appendices for full example).. use http://www.nd.edu/~rwilliam/stats2/statafiles/multicoll.dta,
More informationBenchmarking Residential Energy Use
Benchmarking Residential Energy Use Michael MacDonald, Oak Ridge National Laboratory Sherry Livengood, Oak Ridge National Laboratory ABSTRACT Interest in rating the reallife energy performance of buildings
More informationInternational Statistical Institute, 56th Session, 2007: Phil Everson
Teaching Regression using American Football Scores Everson, Phil Swarthmore College Department of Mathematics and Statistics 5 College Avenue Swarthmore, PA198, USA Email: peverso1@swarthmore.edu 1. Introduction
More informationThe Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon
The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon ABSTRACT Effective business development strategies often begin with market segmentation,
More information1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96
1 Final Review 2 Review 2.1 CI 1propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years
More informationX X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)
CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.
More informationForecasting in supply chains
1 Forecasting in supply chains Role of demand forecasting Effective transportation system or supply chain design is predicated on the availability of accurate inputs to the modeling process. One of the
More informationData Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank
Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through
More informationCross Validation. Dr. Thomas Jensen Expedia.com
Cross Validation Dr. Thomas Jensen Expedia.com About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract
More informationMULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS
MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance
More informationModule 5: Multiple Regression Analysis
Using Statistical Data Using to Make Statistical Decisions: Data Multiple to Make Regression Decisions Analysis Page 1 Module 5: Multiple Regression Analysis Tom Ilvento, University of Delaware, College
More informationPlease follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software
STATA Tutorial Professor Erdinç Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software 1.Wald Test Wald Test is used
More informationChapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS
Chapter Seven Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Section : An introduction to multiple regression WHAT IS MULTIPLE REGRESSION? Multiple
More information1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number
1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number A. 3(x  x) B. x 3 x C. 3x  x D. x  3x 2) Write the following as an algebraic expression
More informationFactor Analysis. Principal components factor analysis. Use of extracted factors in multivariate dependency models
Factor Analysis Principal components factor analysis Use of extracted factors in multivariate dependency models 2 KEY CONCEPTS ***** Factor Analysis Interdependency technique Assumptions of factor analysis
More informationComparing return to work outcomes between vocational rehabilitation providers after adjusting for case mix using statistical models
Comparing return to work outcomes between vocational rehabilitation providers after adjusting for case mix using statistical models Prepared by Jim Gaetjens Presented to the Institute of Actuaries of Australia
More informationModeling Lifetime Value in the Insurance Industry
Modeling Lifetime Value in the Insurance Industry C. Olivia Parr Rud, Executive Vice President, Data Square, LLC ABSTRACT Acquisition modeling for direct mail insurance has the unique challenge of targeting
More informationLecture 5: Model Checking. Prof. Sharyn O Halloran Sustainable Development U9611 Econometrics II
Lecture 5: Model Checking Prof. Sharyn O Halloran Sustainable Development U9611 Econometrics II Regression Diagnostics Unusual and Influential Data Outliers Leverage Influence Heterosckedasticity Nonconstant
More informationIntroduction to Linear Regression
14. Regression A. Introduction to Simple Linear Regression B. Partitioning Sums of Squares C. Standard Error of the Estimate D. Inferential Statistics for b and r E. Influential Observations F. Regression
More informationDimensionality Reduction: Principal Components Analysis
Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely
More informationChapter 7: Simple linear regression Learning Objectives
Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) 
More information1 Theory: The General Linear Model
QMIN GLM Theory  1.1 1 Theory: The General Linear Model 1.1 Introduction Before digital computers, statistics textbooks spoke of three procedures regression, the analysis of variance (ANOVA), and the
More information4. Multiple Regression in Practice
30 Multiple Regression in Practice 4. Multiple Regression in Practice The preceding chapters have helped define the broad principles on which regression analysis is based. What features one should look
More informationModule 3: Correlation and Covariance
Using Statistical Data to Make Decisions Module 3: Correlation and Covariance Tom Ilvento Dr. Mugdim Pašiƒ University of Delaware Sarajevo Graduate School of Business O ften our interest in data analysis
More informationSimple linear regression
Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between
More informationReview Jeopardy. Blue vs. Orange. Review Jeopardy
Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 03 Jeopardy Round $200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?
More informationDEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9
DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,
More informationMultiple Regression Using SPSS
Multiple Regression Using SPSS The following sections have been adapted from Field (2009) Chapter 7. These sections have been edited down considerably and I suggest (especially if you re confused) that
More informationStat 412/512 CASE INFLUENCE STATISTICS. Charlotte Wickham. stat512.cwick.co.nz. Feb 2 2015
Stat 412/512 CASE INFLUENCE STATISTICS Feb 2 2015 Charlotte Wickham stat512.cwick.co.nz Regression in your field See website. You may complete this assignment in pairs. Find a journal article in your field
More informationDirections for using SPSS
Directions for using SPSS Table of Contents Connecting and Working with Files 1. Accessing SPSS... 2 2. Transferring Files to N:\drive or your computer... 3 3. Importing Data from Another File Format...
More informationIntroduction to proc glm
Lab 7: Proc GLM and oneway ANOVA STT 422: Summer, 2004 Vince Melfi SAS has several procedures for analysis of variance models, including proc anova, proc glm, proc varcomp, and proc mixed. We mainly will
More informationExample: Boats and Manatees
Figure 96 Example: Boats and Manatees Slide 1 Given the sample data in Table 91, find the value of the linear correlation coefficient r, then refer to Table A6 to determine whether there is a significant
More informationChicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011
Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011 Name: Section: I pledge my honor that I have not violated the Honor Code Signature: This exam has 34 pages. You have 3 hours to complete this
More informationRobust procedures for Canadian Test Day Model final report for the Holstein breed
Robust procedures for Canadian Test Day Model final report for the Holstein breed J. Jamrozik, J. Fatehi and L.R. Schaeffer Centre for Genetic Improvement of Livestock, University of Guelph Introduction
More informationPRINCIPAL COMPONENT ANALYSIS
1 Chapter 1 PRINCIPAL COMPONENT ANALYSIS Introduction: The Basics of Principal Component Analysis........................... 2 A Variable Reduction Procedure.......................................... 2
More informationModerator and Mediator Analysis
Moderator and Mediator Analysis Seminar General Statistics Marijtje van Duijn October 8, Overview What is moderation and mediation? What is their relation to statistical concepts? Example(s) October 8,
More informationIntroduction to Linear Regression
14. Regression A. Introduction to Simple Linear Regression B. Partitioning Sums of Squares C. Standard Error of the Estimate D. Inferential Statistics for b and r E. Influential Observations F. Regression
More informationSection A. Index. Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting techniques... 1. Page 1 of 11. EduPristine CMA  Part I
Index Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting techniques... 1 EduPristine CMA  Part I Page 1 of 11 Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting
More informationWeek 5: Multiple Linear Regression
BUS41100 Applied Regression Analysis Week 5: Multiple Linear Regression Parameter estimation and inference, forecasting, diagnostics, dummy variables Robert B. Gramacy The University of Chicago Booth School
More informationStatistical Models in R
Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 16233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova
More informationNCSS Statistical Software. Multiple Regression
Chapter 305 Introduction Analysis refers to a set of techniques for studying the straightline relationships among two or more variables. Multiple regression estimates the β s in the equation y = β 0 +
More informationORTHOGONAL POLYNOMIAL CONTRASTS INDIVIDUAL DF COMPARISONS: EQUALLY SPACED TREATMENTS
ORTHOGONAL POLYNOMIAL CONTRASTS INDIVIDUAL DF COMPARISONS: EQUALLY SPACED TREATMENTS Many treatments are equally spaced (incremented). This provides us with the opportunity to look at the response curve
More informationAnswer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade
Statistics Quiz Correlation and Regression  ANSWERS 1. Temperature and air pollution are known to be correlated. We collect data from two laboratories, in Boston and Montreal. Boston makes their measurements
More informationThis chapter will demonstrate how to perform multiple linear regression with IBM SPSS
CHAPTER 7B Multiple Regression: Statistical Methods Using IBM SPSS This chapter will demonstrate how to perform multiple linear regression with IBM SPSS first using the standard method and then using the
More informationLinear Models and Conjoint Analysis with Nonlinear Spline Transformations
Linear Models and Conjoint Analysis with Nonlinear Spline Transformations Warren F. Kuhfeld Mark Garratt Abstract Many common data analysis models are based on the general linear univariate model, including
More informationIndices of Model Fit STRUCTURAL EQUATION MODELING 2013
Indices of Model Fit STRUCTURAL EQUATION MODELING 2013 Indices of Model Fit A recommended minimal set of fit indices that should be reported and interpreted when reporting the results of SEM analyses:
More informationStock Price Forecasting Using Information from Yahoo Finance and Google Trend
Stock Price Forecasting Using Information from Yahoo Finance and Google Trend Selene Yue Xu (UC Berkeley) Abstract: Stock price forecasting is a popular and important topic in financial and academic studies.
More informationChapter 10. Key Ideas Correlation, Correlation Coefficient (r),
Chapter 0 Key Ideas Correlation, Correlation Coefficient (r), Section 0: Overview We have already explored the basics of describing single variable data sets. However, when two quantitative variables
More informationSPSSApplications (Data Analysis)
CORTEX fellows training course, University of Zurich, October 2006 Slide 1 SPSSApplications (Data Analysis) Dr. Jürg Schwarz, juerg.schwarz@schwarzpartners.ch Program 19. October 2006: Morning Lessons
More informationPredictor Coef StDev T P Constant 970667056 616256122 1.58 0.154 X 0.00293 0.06163 0.05 0.963. S = 0.5597 RSq = 0.0% RSq(adj) = 0.
Statistical analysis using Microsoft Excel Microsoft Excel spreadsheets have become somewhat of a standard for data storage, at least for smaller data sets. This, along with the program often being packaged
More informationc 2015, Jeffrey S. Simonoff 1
Modeling Lowe s sales Forecasting sales is obviously of crucial importance to businesses. Revenue streams are random, of course, but in some industries general economic factors would be expected to have
More informationForecasting Geographic Data Michael Leonard and Renee Samy, SAS Institute Inc. Cary, NC, USA
Forecasting Geographic Data Michael Leonard and Renee Samy, SAS Institute Inc. Cary, NC, USA Abstract Virtually all businesses collect and use data that are associated with geographic locations, whether
More informationChicago Insurance Redlining  a complete example
Chapter 12 Chicago Insurance Redlining  a complete example In a study of insurance availability in Chicago, the U.S. Commission on Civil Rights attempted to examine charges by several community organizations
More informationUSE OF ARIMA TIME SERIES AND REGRESSORS TO FORECAST THE SALE OF ELECTRICITY
Paper PO10 USE OF ARIMA TIME SERIES AND REGRESSORS TO FORECAST THE SALE OF ELECTRICITY Beatrice Ugiliweneza, University of Louisville, Louisville, KY ABSTRACT Objectives: To forecast the sales made by
More informationStatistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY
Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY ABSTRACT: This project attempted to determine the relationship
More informationDEGREES OF FREEDOM  SIMPLIFIED
1 Aust. J. Geod. Photogram. Surv. Nos 46 & 47 December, 1987. pp 5768 In 009 I retyped this paper and changed symbols eg ˆo σ to VF for my students benefit DEGREES OF FREEDOM  SIMPLIFIED Bruce R. Harvey
More informationSPSS ADVANCED ANALYSIS WENDIANN SETHI SPRING 2011
SPSS ADVANCED ANALYSIS WENDIANN SETHI SPRING 2011 Statistical techniques to be covered Explore relationships among variables Correlation Regression/Multiple regression Logistic regression Factor analysis
More informationPrinciple Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression
Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression Saikat Maitra and Jun Yan Abstract: Dimension reduction is one of the major tasks for multivariate
More informationInteraction effects and group comparisons Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised February 20, 2015
Interaction effects and group comparisons Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised February 20, 2015 Note: This handout assumes you understand factor variables,
More information