Variable Selection and Transformation of Variables in SAS Enterprise Miner 5.2

Similar documents
Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Gerry Hobbs, Department of Statistics, West Virginia University

Decision Trees What Are They?

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

Data Mining Techniques Chapter 6: Decision Trees

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Paper AA Get the highest bangs for your marketing bucks using Incremental Response Models in SAS Enterprise Miner TM

A Property and Casualty Insurance Predictive Modeling Process in SAS

Lecture 10: Regression Trees

Lean Six Sigma Analyze Phase Introduction. TECH QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY

A fast, powerful data mining workbench designed for small to midsize organizations

A Property & Casualty Insurance Predictive Modeling Process in SAS

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller

Joseph Twagilimana, University of Louisville, Louisville, KY

Data mining and statistical models in marketing campaigns of BT Retail

Developing Credit Scorecards Using Credit Scoring for SAS Enterprise Miner TM 12.1

A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND

An Overview and Evaluation of Decision Tree Methodology

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

Reevaluating Policy and Claims Analytics: a Case of Non-Fleet Customers In Automobile Insurance Industry

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

Data Preprocessing. Week 2

The Dummy s Guide to Data Analysis Using SPSS

Dealing with continuous variables and geographical information in non life insurance ratemaking. Maxime Clijsters

When to Use a Particular Statistical Test

Booth School of Business, University of Chicago Business 41202, Spring Quarter 2015, Mr. Ruey S. Tsay. Solutions to Midterm

Elementary Statistics Sample Exam #3

Identifying and Overcoming Common Data Mining Mistakes Doug Wielenga, SAS Institute Inc., Cary, NC

Binary Logistic Regression

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Multiple Linear Regression in Data Mining

Full and Complete Binary Trees

MEASURES OF VARIATION

Java Modules for Time Series Analysis

Social Media Mining. Data Mining Essentials

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling

Chapter 12 Discovering New Knowledge Data Mining

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

PAKDD 2006 Data Mining Competition

Model Fitting in PROC GENMOD Jean G. Orelien, Analytical Sciences, Inc.

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Text Analytics using High Performance SAS Text Miner

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

How To Understand And Solve A Linear Programming Problem

430 Statistics and Financial Mathematics for Business

Gamma Distribution Fitting

Univariate Regression

AP STATISTICS REVIEW (YMS Chapters 1-8)

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

LOGIT AND PROBIT ANALYSIS

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Big Data Analytics. Benchmarking SAS, R, and Mahout. Allison J. Ames, Ralph Abbey, Wayne Thompson. SAS Institute Inc., Cary, NC

MULTIPLE REGRESSION WITH CATEGORICAL DATA

Analysis of algorithms of time series analysis for forecasting sales

Predictive Analytics in the Public Sector: Using Data Mining to Assist Better Target Selection for Audit

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

S The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY

Statistical Models in R

2015 Workshops for Professors

Data Mining for Knowledge Management. Classification

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Correlational Research. Correlational Research. Stephen E. Brock, Ph.D., NCSP EDS 250. Descriptive Research 1. Correlational Research: Scatter Plots

Data Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA

Chapter 11 Monte Carlo Simulation

International Statistical Institute, 56th Session, 2007: Phil Everson

USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION. Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA

On Correlating Performance Metrics

Modeling Lifetime Value in the Insurance Industry

CART 6.0 Feature Matrix

Regression III: Advanced Methods

CALCULATIONS & STATISTICS

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

Data Mining Methods: Applications for Institutional Research

How To Run Statistical Tests in Excel

How To Make A Credit Risk Model For A Bank Account

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Internet Gambling Behavioral Markers: Using the Power of SAS Enterprise Miner 12.1 to Predict High-Risk Internet Gamblers

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Sample Size and Power in Clinical Trials

Introduction to Statistical Computing in Microsoft Excel By Hector D. Flores; and Dr. J.A. Dobelman

SAS Software to Fit the Generalized Linear Model

Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance

Statistics Graduate Courses

Simple Linear Regression

Data Mining Using SAS Enterprise Miner : A Case Study Approach, Second Edition

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

SP10 From GLM to GLIMMIX-Which Model to Choose? Patricia B. Cerrito, University of Louisville, Louisville, KY

Data Mining Practical Machine Learning Tools and Techniques

Data Mining III: Numeric Estimation

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Transcription:

Variable Selection and Transformation of Variables in SAS Enterprise Miner 5.2 Kattamuri S. Sarma, Ph.D. Ecostat Research Corp., White Plains NY Introduction In predictive modeling and data mining one is often confronted with a large number of inputs (explanatory variables). The number of potential inputs to choose from may be as large as 2000 or higher. Some of these inputs may not have any relation to the target. An initial screening is therefore necessary to eliminate irrelevant variables to keep the number of inputs to a manageable size. The Variable Selection node of SAS Enterprise Miner provides alternative methods for eliminating irrelevant variables and selecting variables which have predictive power. In the process of variable selection, the Variable Selection nodes creates binned variables from interval scaled inputs and grouped variables from nominal inputs. Sometimes a binned input is more strongly correlated with the target variable than the original input, indicating a non-linear relationship between the input and the target. The grouped variables are created by collapsing or grouping the categories of a nominal inputs. With fewer categories, the grouped variables are easier to use in modeling than the original ungrouped variables. The predictive power of the inputs can sometimes be enhanced by making suitable transformations. One can use the Transform Variables node to select the best mathematical transformation for any given input, based on such criterion as maximizing normality or maximizing correlation with the target. The Transform Variables node can also be used for optimally binning the interval inputs and creating dummy variables from categorical inputs. Variable selection and transformation is also done by the Decision Tree node. The inputs that give significant splits in creating a decision tree are selected by the Decision Tree node and passed to the next node which may be Regression or Neural Networks node. In addition to variable selection, the Decision Tree node creates a special categorical variable which indicates the leaf node to which a given record is assigned. This paper discusses the details of the variable selection methods, transformations and the options available in these three nodes. The Variable Selection node There are two methods of variable selection available in the Variable Selection node. These are: R-Square and Chi-Square methods of selection. R-Square Method The R-Square method can be used with a Binary as well as with a interval-scaled target. 1

In the R-Square method, variable selection is performed in two steps. In the first step R-Square between the input and the target is calculated. All variables with a correlation above a specified threshold are selected in the first step. Those variables which are selected in the first step enter the second step of variable selection. Step 1: In this step, a preliminary selection is made, based on Minimum R-Square property, of the Variable Selection node, which the user can specify (See Diagram 7). For each interval-scaled input the Variable Selection node calculates two measures of correlation between each input and the target. One is the R-Square between the target and the original input. The other is the R-Square between the target and the binned version of the input variable. The binned variable is a categorical variable created by the Variable Selection node from each continuous (interval-scaled) input. The levels of this categorical variable are the bins. In Enterprise Miner, this binned variable is referred to as an AOV16 variable. The number of levels or categories of the binned variable (AOV16) is at most 16, corresponding to 16 intervals that are equal in width. In the case of nominal-scaled categorical inputs with a continuous target, R-Square is calculated using one-way ANOVA. Here you have the option of using either the original or the grouped variables. Grouped variables are the new variables created by collapsing the levels of categorical variables. For example, suppose there is a categorical (nominal) variable called LIFESTYLE, which indicates the lifestyle of the customer. It may take on values such as Foreign Traveler, Urban Dweller, etc. If the variable LIFESTYLE has 100 levels or categories, it can be collapsed to fewer levels or categories by setting the Group Variables property to Yes as shown in Diagram 7. Step 2 In the second step, a sequential forward selection process is used. This process starts by selecting the input variable that has the highest correlation coefficient with the target. A regression equation (model) is estimated with the selected input. At each successive step of the sequence, an additional input variable that provides the largest incremental contribution to the Model R-Square is added to the regression. If the lower bound for the incremental contribution to the Model R- Square is reached, the selection process stops. The lower bound for the incremental contribution to the Model R-Square can be specified by setting the Stop R-Square property (See Display 7) to the desired value. Chi-Square Method This criterion can be used when the target is binary. When this criterion is selected, the selection process does not have two distinct steps, as in the case of the R-square criterion. Instead, a tree is constructed. The inputs selected in the construction of the tree are passed to the next node with the assigned Role of Input. 2

Using Decision Tree node for Variable Selection The Decision Tree node of Enterprise Miner can also be used for variable selection and transformation. The inputs which create significant splits in the development of the tree are passed to the next node with the role of Input. These are the variables selected by the Decision Tree node and they can be used in the Regression node or in the Neural Network node as inputs. In addition to selecting variables, the Decision Tree node also creates a special categorical variable called _NODE_ and optionally passes it to the next node as an input. The variable _NODE_ can be used as a class input in the Regression node. The Transform Variables node Transformations for Interval Inputs Simple Transformations The available simple transformations are Log, Square Root, Inverse, Square, Exponential, and Standardize. They can be applied to any interval-scaled input. These simple transformations can be used irrespective of whether the target is categorical or continuous. Binning Transformations In Enterprise Miner, there are three ways of binning an interval-scaled variable. To use these as default transformations, select the Transform Variables node, and set the value of the Interval Inputs property to Bucket, Quantile, or Optimal in the Default Methods section. Bucket: The Bucket option creates buckets by dividing the input into n equal-sized intervals and grouping the observations into the n buckets. The resulting number of observations in each bucket may differ from bucket to bucket. For example if AGE is divided into the four intervals 0 25, 25 50, 50 75, and 75 100 then the number of observations in the interval 0 25 (bin 1) may be 100, the number of observations in the interval 25 50 (bin 2) may be 2000, the number of observations in the interval 50 75 (bin 3) may be 1000, and the number of observations in the interval 75 100 (bin 4) may be 200. Quantile: This option groups the observations into quantiles (bins) with an equal number of observations in each. If there are 20 quantiles, then each quantile consists of 5% of the observations. 3

Optimal Binning for Relationship to Target: This transformation is available for binary targets only. The input is split into a number of bins, and the splits are placed so as to make the distribution of the target levels (for example, response and non-response) in each bin significantly different from the distribution in the other bins. Best Power Transformations The Transform Variables node selects the best power transformations from among X X,log( X), sqrt( X ), e, X 1/4, X 2, 4 and X, where X is the input. There are four criteria of best available: Maximum Normal: To find the transformation that maximizes normality, sample quantiles from each of the transformations listed above are compared with the theoretical quantiles of a normal distribution. The transformation that yields quantiles that are closest to the normal distribution is chosen. Suppose Y is obtained by applying one of the above transformations to X. For example, the 0.75-sample quantile of the transformed variable Y is that value of Y at or below which 75% of the observations in the data set fall. The 0.75-quantile for a standard normal distribution is 0.6745 given by PZ ( 0.6745) = 0.75, where Z is a normal random variable with mean 0 and standard deviation 1. The 0.75-sample quantile for Y is compared with 0.6745, and similarly the other quantiles are compared with the corresponding quantiles of the standard normal distribution. Maximum Correlation: This is available only for continuous targets. The transformation that yields the highest linear correlation with the target is chosen. Equalize Spread with Target Levels: This method requires a class target. The method first calculates variance of a given transformed variable within each target class. Then for each transformation it calculates the variances of these variances. It chooses the transformation that yields the smallest variance of the variances. Optimal Maximum Equalize Spread with Target Level: This method requires a class target. It chooses the method that equalizes spread with the target. Transformations of Class Inputs For class inputs, two types of transformations are available. Group Rare Levels transformation: This transformation combines the rare levels into a separate group, _OTHER_. To define a rare level, you define a cutoff value. Dummy Indicators Transformation: To choose one of these available transformations, select the Transform Variables node and set the value of the Class Inputs property to the desired transformation. 4

Transformation before Variable Selection If you have a large number of inputs, you can make an initial variable selection, then transform the selected variables and use them in Regression or other modeling tool. This scenario is shown in Display 1. Display 1 Transformation after Variable Selection If you have only a small number of inputs (hundred or less), you can transform the variables first, and then select the best variables from the transformed and original variables. This scenario is shown in Display 2. Display 2 Variable Selection and Transformation of variables using the Decision Tree As described before, the Decision Tree node selects variables which produce significant splits, and passes them to the next node. In addition, the Decision Tree node creates a categorical variable called _NODE_. For any given record the value of this variable is the leaf node to which the record is assigned. Display 3 shows the process flow diagram for using the Decision Tree node for variable selection and transformation. 5

Display 3 Display 4 shows the property settings of the Decision Tree node for variable selection and variable transformation. 6

Display 4: Decision Tree node In order to use the Decision Tree node for variable selection and transformation, you should specify the Variable Selection property to YES, Leaf Variable property to YES and Leaf Role property to Input, as shown in Display 4. For a detailed discussion of the Decision Tree node see Predictive Modeling with SAS Enterprise Miner by the author of this paper. 7

Property Settings of the nodes In any process flow diagram the first node is the Input Data node, which makes the data set available for the project. The property panel of the Input Data node is shown in Display 5 Display 5: Input Data node In order that the data is available for the project, one has to first create a data source. Creation of a data source is illustrated step-by-step in the book Predictive Modeling with SAS Enterprise Miner. From the property panel shown in Display 5, it can be seen that the name of the data set is NESUG2007 and it is in the library assigned to T1. Display 6 shows the Data Partition node. 8

Display 6: Data Partition node From the property panel shown in Display 6, it can be seen that 40% of the records are allocated for training, 30% for validation and 30% for test and the data is split by the default method. For binary targets the default method is stratified sampling. Display 7 shows the properties panel for Variable Selection node. 9

Display 7: Variable Selection node Display 8 shows the property panel of Transform Variables node.. 10

Display8: Transform Variables node The transformation chosen for Interval inputs in Display 8 is Maximum Normal for interval inputs and Dummy Indicators for class inputs. These are the default methods. However, one can open the Variables window of the Transform Variables node and specify different transformations for different inputs. Display 9 shows the transformations available for interval inputs in Enterprise Miner, and Display 10 shows the transformations available for class inputs. 11

Display 9: Transformations for Interval Inputs Display 10: Transformations for Class Inputs Reference Sarma, Kattamuri S, Predictive Modeling with SAS Enterprise Miner Practical Solutions for Business Applications, Cary, NC: SAS Institute Inc., 2007. 12