USING SAS/STAT SOFTWARE'S REG PROCEDURE TO DEVELOP SALES TAX AUDIT SELECTION MODELS

Transcription

1 USING SAS/STAT SOFTWARE'S REG PROCEDURE TO DEVELOP SALES TAX AUDIT SELECTION MODELS Kirk L. Johnson, Tennessee Department of Revenue Richard W. Kulp, David Lipscomb College INTRODUCTION The Tennessee Department of Revenue (TDR) uses SAS/STAT REG procedure to develop statistical models to predict which sales and use tax field audits will yield the highest return per hour spent on the audit. To perform the analysis, the TDR uses the SAS System computer software which runs on both the state's mainframe computer and on personal computers in the Department. This process involves running SAS programs against taxpayer files on the state's mainframe computer and downloading subsets of data based on taxpayers' business types to a personal computer. The downloaded data is analyzed using PROC REG. This paper reports on our use of SAS diagnostics to compare competing models and to analyze potential problems in the data. Since the formulas used to calculate the statistics discussed in this paper are readily available in SAS documentation, we have chosen, for the most part, not to include this information in the paper. We, of course, relied very heavily upon SAS/STAT Guide for personal Computers, Version 6 Edition (Cary, NC: SAS Institute Inc., 1985) for our descriptions of the REG procedure and tried to conform to SAS terminology in so describing these procedures. In addition, some of the PROC REG's options discussed below produce a large amount of printed output. Therefore, the statistics reported in this paper were extracted from SAS output. We will be glad to make the full output available upon request. USING REGRESSION ANALYSIS TO PREDICT ASSESSMENTS Regression analysis can be used to do the following: - to explain how the independent variables account for variation in the dependent variable - to estimate the magnitude and signs of the parameters - to screen variables and rank them in order of importance - to predict, forecast, or estimate the dependent variable. As noted above, we are primarily interested in using regression analysis to predict the hourly return from sales tax field audits. It is important to state clearly the purpose for which a regression model is to be used since a model that predicts well may not necessarily be the best model for estimating parameters or performing some other task. Model Selection We have chosen to develop a different model for each business type for which there is sufficient audit history to justify the analysis. By a different model, we mean that the independent variables used in the models will differ from one business type to another. This is based upon our experience as well as the experience of other states which indicates that the variables which are useful for predicting assessments for one business type may not be useful for predicting assessments for another business type. Several exploratory techniques are available to assist in identifying which variables to include in the models. These include forward selection, backward selection, and stepwise selection. In Version 6.03, these are invoked using the SELECTION option of the MODEL statement of PROC REG. The syntax of the option is as follows: PROC REG DATA=SASdataset; MODEL dependents=regressors /SELECTION=name P COLLIN INFLUENCE PARTIAL; where name can be FORWARD (or F), BACKWARD (or B), STEPWISE, MAXR, MINR, RSQUARE, ADJRSQ, CP, or NONE (the full model). The default is NONE. P,, COLLIN, INFLUENCE and PARTIAL invoke the diagnostic procedures discussed below. 1047

2 Because of the large number of variables which are being considered (43) and the large number of models which are produced (95), it is necessary to develop a set of procedures to reduce the number of models which must be considered for each business group. The following outlines these procedures: Example 1 Use SELECTION=STEPWISE to reduce the number of variables. The default significance levels for entry into the model (0.15) and for staying in the model (0.15) were used. Use SELECTION=ADJRSQ and SELECTION=CP with variables selected by STEPWISE to fin~ models with best adjusted R and Mallow's Cpo Use P option to calculate PRESS statistic for competing models. Use option to calculate variance inflation factors and COLLIN option for collinearitydiagnostics. Use INFLUENCE and PARTIAL options to produce influence diagnostics and partial regression residual plots. We have a business type in the retail trade sector for which 67 sales tax audits have been performed, yielding an average per hour assessment of $621. The STEPWISE option produced the following model: part~l Mod~ Mallow's Step Entered R R Cp 1 T GROSS T-BALDUE T-EXEMPT GROSS BALDUE STRUCF EXEMPT where T GROSS=total gross sales, T BALDUE=total tax due, T-EXEMPT=total exempt sales, GROSS2=total gross sales squared, BALDUE2=total tax due squared, EXEMPT2=total exempt sales squared, and STRUCF=a dummy variable indicating whether the taxpayer registered as a foreign corporation,(le., corporate headquarters located outside TenneSSee). Mallow's C p reported in the table above is a prediction oriented statistic which indicates the presence of bias in a model. A C p greater than p+l (where p = the number of parameters in the model) is an indicator of an incompletely specified model. A C p less than p+l indicates the model is overspecified (i.e., the model contains too many variables). The recommended model is where C first approaches p+l (startirg from the full model). As the above table indicates, since Mallow's C for the last variable entered inpthe model is less than p+l (which would be eight in this case),,it is possible that this model is overfitted. As the table below indicates, this model compares well with the models produced using other selection methods: Selection Adju~ted PRES MSE Option R2 R (x 10 ) (X 10 3 ) NONE FORWARD BACKWARD STEPWISE As va iables are added to a model, the R 2 will always increase or, in the worse case, remain the same. Thus, the model with the highest R2 is not nec~ssarily the best model. Adjusted R takes into consideration the number of independent va 2 iables in the model. An adjusted R2 which is substantially less than R indicates that the model is overfitted. T~at is to say, the increase in R due to the additional variables included in the model does not make up for the loss of the degrees of freedom. None of the adjusted R 2 's reported above are causes for concern. Since we are most interested in predicting assessments per hour, we have relied heavily upon the SAS prediction diagnostics. The PRESS statistic is the sum of squares of predicted residual errors where the predicted residual for observation i is defined as the residual for the ith observation that results from oropping it from the parameter estimates. In evaluating competing models, a lower PRESS indicates better prediction capability. The model produced by STEPWISE has a much lower PRESS than the other models. PROBLEMS IN REGRESSION ANALYSIS Two well-known problems in the data used in regression analysis are particularly endemic to data dealing 1048

3 with tax assessments. These problems are multicollinearity among the values of the independent variables and influence data points. Multicollinearity Multicollinearity is present when an independent variable is nearly a linear combination of other independent variables in the model. Multicollinearity affects regression analysis in the following ways: A. produces large variances of coefficients. B. results in'unstable coefficients. c. produces regression coefficients that are too large in magnitude. D. can result in poor prediction. Given that prediction is our main goal, the potential presence of multicollinearity among the independent variables used in a model should be carefully investigated. An example of multicollinearity would be a business type where gross sales and exempt sales were highly correlated. In this case, the analyst may want to consider removing one of variables from the model. The and COLLIN options are collinearity diagnostics provided by SAS. The option reports the variance inflation factor which can be interpreted as follows: for a given variable, the variance inflation factor measures how much larger the variance of the parameter estimate is than if there was no multicollinearity present. As a rule of thumb, a greater than ten (10) can be used as an indicator of a potential collinearity problem. The COLLIN option produces a table which includes eigenvalues, condition indices, and variance proportions which can be used to examine which terms are causing the problem. The number of eigenvalues near zero indicate the number of near linear dependencies. Large values for the condition number also indicates collinearity. High loadings on the variance proportions indicate which terms are causing the problem. Example 1 (Continued) In the above example, we are co'ncerned about possible collinearity between T GROSS and T BALDUE and between GROSS2 and BALDUE2. As the variance inflation factors reported below indicate, the seven variable model selected by STEPWISE in the above example would appear to have multicollinearity problems: T GROSS 936 T-BALDUE 887 T-EXEMPT 18 GROSS BALDUE2 271 STRUCF 1 EXEMPT2 8 The table below reports the eigenvalues and condition numbers associated with this model: Condition Number Eigenvalue Number l7 The small eigenvalue and large condition number associated with the eighth principal component reported above are indications of a collinearity problem. The table below reports the variance proportions for the variables with the highest loadings on the eighth component: Variance Proportions Number T GROSS T BALDUE GROSS2 BALDUE Since the variable T GROSS has the highest variance inflation factor and the highest variance proportion for the eighth component, the decision was made to drop it from the model. This resulted in only a slight drop in adjusted R 2 whereas PRESS and Mallow 1 s Cp for the six variable model are ~lightly better. Moreover, the variance inflation factors, as the table below indicates, showed marked improvement although they still indicate the presence of collinearity in the model: T BALDUE 18 T-EXEMPT 10 GROSS 2 42 BALDUE2 28 STRUCF 1 EXEMPT

4 As the table below indicates, with the exception of dropping T EXEMPT which is discussed below, efforts to improve the model by dropping additional variables resulted in diminishing predictive capability based on PRESS and Mallow's C p (Mallow's Cp statistic was calculated using the full model MSE): AD~ PRESij R (x ~O) c p P+l Model T GROSS, T BALDUE, T EXEMPT, GROSS2. BALDOE2, STRUCF, EXEMPT T BALDUE, T EXEMPT, GROSS2, BALDUE2, STRUCF. EXEMPT T BALDUE, GROSS2, BALDUE2, STRUCF. EXEMPT T BALDUE, T EXEMPT, BALDUE2, STRUCF, EXEMPT T BALDUE, T EXEMPT, GROSS2, STRUCF. EXEMPT GROSS2, T EXEMPT, BALDUE2, STRUCF, EXEMPT GROSS2, BALDUE2, STRUCF, EXEMPT GROSS2. STRUCF, BALDUE2 ~ GROSS2, BALDUEZ What we seem to have here is a situation were two va-riables, GROSS2 and BALDUE2, are collinear but both must be included for the model to have an acceptable adjusted R2, PRESS, and Mallow's Cpo Reported below are the parameter estimates associated with the six variable model: Parameter standard Estimate Error prob>iti INTERCEP T BALDUE T-EXEMPT GROSS BALDUE STRUCF EXEMPT The presence of a variable, T EXEMPT, in the model which is not significant at the 0.05 level is also of concern. As the table above indicated, by dropping this variable, the improves slightly in adjusted mod R 1 l, PRESS, and Mallow's Cpo As reported below, the variance inflation factors are either the same or slightly better than the six variable model. T BALDUE 16 GROSS2 43 BALDUE2 28 STRUCF 1 EXEMPT2 1 Thus, the decision was made to use the five variable model. The parameter estimates are reported below: Parameter standard Estimate Error prob>iti INTERCEP T BALDUE GROSS BALDUE STRUCF EXEMPT Influence Data points Influence data points are points which exert an undue influence on the regression equation. Thi$ may be the result, for example, of an outlying observation. If a set of data for a given business type included one extremely large per hour field audit assessment, this data point could possibly exert an undue influence on the regression equation for that business type. It is important to note that the mere presence of such a data point does not necessarily mean that it does exert an undue influence, only that it may do so. If it does, the data point would be termed an outlier. Because of the nature of our data, influence data points are a serious problem for both the dependent and independent variables. The presence of large per hour assessments may produce outliers among the values of the dependent variables for some business types. The presence of large values for some independent variables (particularly large gross sales, large exempt sales, large use taxable, large tax balances due) may produce high leverage data points. The detection of influence data points is not always readily apparent. Moreover, the issue of the remedy is a source of some controversy. While some statisticians may recommend removing outliers from the data~ others do not. If the data' point is valid, that is to say, the data for that observation is correctly measured, then we feel that there should be a compelling reason for removing it from the data set. Example 2 We have a group of manufacturers for which 53 sales tax audits have been performed with an average per hour assessment of $18,241. This extremely high average per hour assessment leads us to suspect that there might be one or more outliers ip the data, that is to say, observations which exert an undue influence on the regression equation. 1050

5 Following the methodology discussed above, the stepwise option was used to select an initial model for analysis. This model is presented below: Mod 2 l Mallow's step Entered R prob>iti Cp 1 USE T USE BALDUE STRUCD DIRPAY STRUCA where USE2=use taxable squared, T USE=total use taxable, BALDUE2=total tax due squared, STRUCD=a dummy variable indicating whether the taxpayer registered as a domestic corporation, DIRPAY=a dummy variable indicating whether the taxpayer has a direct pay permit, and STRUCA=a dummy variable indicating whether the taxpayer registered as a sole proprietor ~ The dominance of USE2 further alerted us to the possibility of a problem with the data. Even though it had a high R 2, the large Mallow's C statistic indicated that the veriable has considerable bias also. In addition, the PRESS statistic for this model was extremely large, indicating poor prediction capability. The INFLUENCE option is used to produces statistics which measure the influence of each observation on the estimates. These statistics include the following: RSTUDENT (the studentized residuals), HAT DIAG H (the hat diagonals), COY RATIO (the covariance ratio), DFFITS (scaled measure of the change in the predicted value for the ith observation), DFBETAS (scaled measures of change in each parameter estimates for each variable included in the model). For the data set and model under consideration, the table below presents the values which would be considered as indicators of potential influence points: Statistic RSTUDENT HAT DIAG H COY RATIO Value If absolute value is greater than 2 If value is greater than.2642 (2p/n where p=number of parameters and n=sample size) If value is less than.6038 or greater than (1 plus or minus 3(p/n)) DFFITS DFBETAS If value is greater than.7268 (2 times the square root of the quantity pin) If value is greater than.2747 (2 over the square root of n) We found that a number of observations had values on one or more of the above statistics indicating that they may exert a large influence on the parameter estimates. One observation (Observation 11 in the data set) seemed to stand out from the others, however. The table below reports the influence diagnostics statistics for this observation: Statistic RSTUDENT HAT DIAG H COY RATIO DFFITS INTERCEP DFBETAS DIRPAY DFBETAS T USE DFBETAS STRUCA DFBETAS STRUCD DFBETAS BALDUE2 DFBETAS Value The values of the above statistics lead us to investigate this observation. We discovered although the data for the observation was correct, the assessment per hour for this observation was so large that it almost completely dominated the regression equation. We felt that we were justified in considering this data point to be an atypical value and therefore removing it from the data set. We removed this observation from the data set and ran PROC REG wit~ the STEPWISE option again. The R for the data set without the observation was This model is presented below: Step Model Mallo~'s Entered R prob>iti Cp 1 BALDUE USECODEO USE DIRPAY PERBALGR where BALDUE=total.tax due squared, USECODEO=a dummy variable indicating whether the taxpayer registered as a peddler, USE2=use taxable squared, DIRPAY=a dummy variable indicating whether the taxpayer has a direct pay permit, and PERBALGR=a derived 1051

6 variable measuring the percent of total tax due to gross sales. Even though the R2 is considerably lower, the PRESS statistic for the model for the data set with the atypical observation is much worse than it was for the model for the data set without the outlier. The PRESS statistic for the former model was 755,576,780,357 whereas for the latter model it was 5,636,307, Similarly, Mallow's C p for the former model was 30, for the latter model it was 15. The mean square error for the former model was 841,432 while for the latter model it was 64,658. Thus, we feel justified in removing the data point from the data set. We ran the INFLUENCE option against this new model to identify any additional influence data points. Using the same criteria discussed above, several data points still had values on the diagnostics which were of concern. Three data points particularly stood out. Two observations had studentized residuals well above the absolute value of two. The third observation had a covariance ratio of 74. Two of these observations had large values for the dependent variable (that is, large assessments per hour) whereas the other observation was the result of a no change audit (i.e., assessment per hour=o). We did not feel at this point in time that any of these observations were sufficie~tly atypical of audits performed by the TDR to justify removing them from the data set. We were concerned with the presence in the model of a term which was not significant at the.05 level. Therefore, we choose to run the model again without the variable PERBALGR. This resulted in a model ~ith a slightly worse adjusted R and PRESS but, as the table below indicates, all terms in the model are now significant at the.05 level. parameter standard Estimate Error Prob>ITI INTERCEP BALDUE USECODEO ~ USE DIRPAY Finally, we ran the option to get the variance inflation factors for the above model. The 's, reported below, indicated that the model did not have a col1inearity problem: BALDUE2 USECODEO USE2 DIRPAY CONCLUDING REMARKS In conclusion, we would like to make some remarks on the SAS diagnostic procedures. SAS offers an impressive array of diagnostics. For the novice the biggest problem may be deciding which diagnostics to use. Moreover, it is extremely easy to invoke most of the diagnostics. All the diagnostics discussed in this paper are options to the model statement. We were also impressed with the enhancements to version 6.03 such as the CP and ADJRSQ model selection options which produce a printout of the models ranked according to the best Mallow C p and adjusted ~2 statistics respectively. An option like this for the PRESS sta~istic would also be useful. We have not had an opportunity, however, to fully evaluate the enhancements to Version We were disappointed with some shortcomings, however. We were disappointed with some of the output. For example, the PARTIAL option which is used to produce partial regression residual plots does not offer a convenient way of identifying the points. Moreover, an option which would plot the regression line for the partial X residua~ on the partial Y residual would also be useful (the slope of this line is equal to the parameter estimate of the independent variable for that plot). Since we are running SAS/STAT on a system with 640K RAM, invoking some of these options on the full model.caused an out-af-memory error message. We were not able, for example, to run the CP model selection option for the full model. In canclusion l for the type of analysis we are interested in performing, we found SAS/STAT to be a very powerful and useful statistical package and would recommend its use in similar types of data analysis applications. 1052