Variable Selection and Transformation of Variables in SAS Enterprise Miner 5.2

Transcription

1 Variable Selection and Transformation of Variables in SAS Enterprise Miner 5.2 Kattamuri S. Sarma, Ph.D. Ecostat Research Corp., White Plains NY Introduction In predictive modeling and data mining one is often confronted with a large number of inputs (explanatory variables). The number of potential inputs to choose from may be as large as 2000 or higher. Some of these inputs may not have any relation to the target. An initial screening is therefore necessary to eliminate irrelevant variables to keep the number of inputs to a manageable size. The Variable Selection node of SAS Enterprise Miner provides alternative methods for eliminating irrelevant variables and selecting variables which have predictive power. In the process of variable selection, the Variable Selection nodes creates binned variables from interval scaled inputs and grouped variables from nominal inputs. Sometimes a binned input is more strongly correlated with the target variable than the original input, indicating a non-linear relationship between the input and the target. The grouped variables are created by collapsing or grouping the categories of a nominal inputs. With fewer categories, the grouped variables are easier to use in modeling than the original ungrouped variables. The predictive power of the inputs can sometimes be enhanced by making suitable transformations. One can use the Transform Variables node to select the best mathematical transformation for any given input, based on such criterion as maximizing normality or maximizing correlation with the target. The Transform Variables node can also be used for optimally binning the interval inputs and creating dummy variables from categorical inputs. Variable selection and transformation is also done by the Decision Tree node. The inputs that give significant splits in creating a decision tree are selected by the Decision Tree node and passed to the next node which may be Regression or Neural Networks node. In addition to variable selection, the Decision Tree node creates a special categorical variable which indicates the leaf node to which a given record is assigned. This paper discusses the details of the variable selection methods, transformations and the options available in these three nodes. The Variable Selection node There are two methods of variable selection available in the Variable Selection node. These are: R-Square and Chi-Square methods of selection. R-Square Method The R-Square method can be used with a Binary as well as with a interval-scaled target. 1

2 In the R-Square method, variable selection is performed in two steps. In the first step R-Square between the input and the target is calculated. All variables with a correlation above a specified threshold are selected in the first step. Those variables which are selected in the first step enter the second step of variable selection. Step 1: In this step, a preliminary selection is made, based on Minimum R-Square property, of the Variable Selection node, which the user can specify (See Diagram 7). For each interval-scaled input the Variable Selection node calculates two measures of correlation between each input and the target. One is the R-Square between the target and the original input. The other is the R-Square between the target and the binned version of the input variable. The binned variable is a categorical variable created by the Variable Selection node from each continuous (interval-scaled) input. The levels of this categorical variable are the bins. In Enterprise Miner, this binned variable is referred to as an AOV16 variable. The number of levels or categories of the binned variable (AOV16) is at most 16, corresponding to 16 intervals that are equal in width. In the case of nominal-scaled categorical inputs with a continuous target, R-Square is calculated using one-way ANOVA. Here you have the option of using either the original or the grouped variables. Grouped variables are the new variables created by collapsing the levels of categorical variables. For example, suppose there is a categorical (nominal) variable called LIFESTYLE, which indicates the lifestyle of the customer. It may take on values such as Foreign Traveler, Urban Dweller, etc. If the variable LIFESTYLE has 100 levels or categories, it can be collapsed to fewer levels or categories by setting the Group Variables property to Yes as shown in Diagram 7. Step 2 In the second step, a sequential forward selection process is used. This process starts by selecting the input variable that has the highest correlation coefficient with the target. A regression equation (model) is estimated with the selected input. At each successive step of the sequence, an additional input variable that provides the largest incremental contribution to the Model R-Square is added to the regression. If the lower bound for the incremental contribution to the Model R- Square is reached, the selection process stops. The lower bound for the incremental contribution to the Model R-Square can be specified by setting the Stop R-Square property (See Display 7) to the desired value. Chi-Square Method This criterion can be used when the target is binary. When this criterion is selected, the selection process does not have two distinct steps, as in the case of the R-square criterion. Instead, a tree is constructed. The inputs selected in the construction of the tree are passed to the next node with the assigned Role of Input. 2

3 Using Decision Tree node for Variable Selection The Decision Tree node of Enterprise Miner can also be used for variable selection and transformation. The inputs which create significant splits in the development of the tree are passed to the next node with the role of Input. These are the variables selected by the Decision Tree node and they can be used in the Regression node or in the Neural Network node as inputs. In addition to selecting variables, the Decision Tree node also creates a special categorical variable called _NODE_ and optionally passes it to the next node as an input. The variable _NODE_ can be used as a class input in the Regression node. The Transform Variables node Transformations for Interval Inputs Simple Transformations The available simple transformations are Log, Square Root, Inverse, Square, Exponential, and Standardize. They can be applied to any interval-scaled input. These simple transformations can be used irrespective of whether the target is categorical or continuous. Binning Transformations In Enterprise Miner, there are three ways of binning an interval-scaled variable. To use these as default transformations, select the Transform Variables node, and set the value of the Interval Inputs property to Bucket, Quantile, or Optimal in the Default Methods section. Bucket: The Bucket option creates buckets by dividing the input into n equal-sized intervals and grouping the observations into the n buckets. The resulting number of observations in each bucket may differ from bucket to bucket. For example if AGE is divided into the four intervals 0 25, 25 50, 50 75, and then the number of observations in the interval 0 25 (bin 1) may be 100, the number of observations in the interval (bin 2) may be 2000, the number of observations in the interval (bin 3) may be 1000, and the number of observations in the interval (bin 4) may be 200. Quantile: This option groups the observations into quantiles (bins) with an equal number of observations in each. If there are 20 quantiles, then each quantile consists of 5% of the observations. 3

4 Optimal Binning for Relationship to Target: This transformation is available for binary targets only. The input is split into a number of bins, and the splits are placed so as to make the distribution of the target levels (for example, response and non-response) in each bin significantly different from the distribution in the other bins. Best Power Transformations The Transform Variables node selects the best power transformations from among X X,log( X), sqrt( X ), e, X 1/4, X 2, 4 and X, where X is the input. There are four criteria of best available: Maximum Normal: To find the transformation that maximizes normality, sample quantiles from each of the transformations listed above are compared with the theoretical quantiles of a normal distribution. The transformation that yields quantiles that are closest to the normal distribution is chosen. Suppose Y is obtained by applying one of the above transformations to X. For example, the 0.75-sample quantile of the transformed variable Y is that value of Y at or below which 75% of the observations in the data set fall. The 0.75-quantile for a standard normal distribution is given by PZ ( ) = 0.75, where Z is a normal random variable with mean 0 and standard deviation 1. The 0.75-sample quantile for Y is compared with , and similarly the other quantiles are compared with the corresponding quantiles of the standard normal distribution. Maximum Correlation: This is available only for continuous targets. The transformation that yields the highest linear correlation with the target is chosen. Equalize Spread with Target Levels: This method requires a class target. The method first calculates variance of a given transformed variable within each target class. Then for each transformation it calculates the variances of these variances. It chooses the transformation that yields the smallest variance of the variances. Optimal Maximum Equalize Spread with Target Level: This method requires a class target. It chooses the method that equalizes spread with the target. Transformations of Class Inputs For class inputs, two types of transformations are available. Group Rare Levels transformation: This transformation combines the rare levels into a separate group, _OTHER_. To define a rare level, you define a cutoff value. Dummy Indicators Transformation: To choose one of these available transformations, select the Transform Variables node and set the value of the Class Inputs property to the desired transformation. 4

5 Transformation before Variable Selection If you have a large number of inputs, you can make an initial variable selection, then transform the selected variables and use them in Regression or other modeling tool. This scenario is shown in Display 1. Display 1 Transformation after Variable Selection If you have only a small number of inputs (hundred or less), you can transform the variables first, and then select the best variables from the transformed and original variables. This scenario is shown in Display 2. Display 2 Variable Selection and Transformation of variables using the Decision Tree As described before, the Decision Tree node selects variables which produce significant splits, and passes them to the next node. In addition, the Decision Tree node creates a categorical variable called _NODE_. For any given record the value of this variable is the leaf node to which the record is assigned. Display 3 shows the process flow diagram for using the Decision Tree node for variable selection and transformation. 5

6 Display 3 Display 4 shows the property settings of the Decision Tree node for variable selection and variable transformation. 6

7 Display 4: Decision Tree node In order to use the Decision Tree node for variable selection and transformation, you should specify the Variable Selection property to YES, Leaf Variable property to YES and Leaf Role property to Input, as shown in Display 4. For a detailed discussion of the Decision Tree node see Predictive Modeling with SAS Enterprise Miner by the author of this paper. 7

8 Property Settings of the nodes In any process flow diagram the first node is the Input Data node, which makes the data set available for the project. The property panel of the Input Data node is shown in Display 5 Display 5: Input Data node In order that the data is available for the project, one has to first create a data source. Creation of a data source is illustrated step-by-step in the book Predictive Modeling with SAS Enterprise Miner. From the property panel shown in Display 5, it can be seen that the name of the data set is NESUG2007 and it is in the library assigned to T1. Display 6 shows the Data Partition node. 8

9 Display 6: Data Partition node From the property panel shown in Display 6, it can be seen that 40% of the records are allocated for training, 30% for validation and 30% for test and the data is split by the default method. For binary targets the default method is stratified sampling. Display 7 shows the properties panel for Variable Selection node. 9

10 Display 7: Variable Selection node Display 8 shows the property panel of Transform Variables node.. 10

11 Display8: Transform Variables node The transformation chosen for Interval inputs in Display 8 is Maximum Normal for interval inputs and Dummy Indicators for class inputs. These are the default methods. However, one can open the Variables window of the Transform Variables node and specify different transformations for different inputs. Display 9 shows the transformations available for interval inputs in Enterprise Miner, and Display 10 shows the transformations available for class inputs. 11

12 Display 9: Transformations for Interval Inputs Display 10: Transformations for Class Inputs Reference Sarma, Kattamuri S, Predictive Modeling with SAS Enterprise Miner Practical Solutions for Business Applications, Cary, NC: SAS Institute Inc.,