Predictive Modeling Using SAS Enterprise Miner 5.1. Course Notes

Transcription

1 Predictive Modeling Using SAS Enterprise Miner 5.1 Course Notes

2 Predictive Modeling Using SAS Enterprise Miner 5.1 Course Notes was developed by Jim Georges. Additional contributions were made by Dan Kelly and Bob Lucas. Editing and production support was provided by the Curriculum Development and Support Department. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. Predictive Modeling Using SAS Enterprise Miner 5.1 Course Notes Copyright 2004 by SAS Institute Inc., Cary, NC 27513, USA. All rights reserved. Printed in the United States of America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc. Book code 59927, course code PMEM5, prepared date 07Sep04.

3 For Your Information iii Table of Contents Course Description... v Prerequisites... vi General Conventions...vii Chapter 1 Basic Predictive Modeling Starting the Analysis Preparing the Tools Constructing a Predictive Model Adjusting Predictions Making Optimal Decisions Parametric Prediction Tuning a Parametric Model Comparing Predictive Models Deploying a Predictive Model Summarizing the Analysis Chapter 2 Flexible Parametric Modeling Defining Flexible Parametric Models Constructing Neural Networks Deconstructing Neural Networks Chapter 3 Predictive Algorithms Growing Trees Constructing Trees

4 iv For Your Information 3.3 Applying Decision Trees Appendix A Exercises... A-1 A.1 Introduction to Predictive Modeling...A-3 A.2 Flexible Parametric Models...A-9 A.3 Predictive Algorithms...A-15

5 For Your Information v Course Description This course illustrates methods for overcoming common data mining challenges on actual business data. Course topics include optimizing predictive decisions, comparing predictive models, deploying predictive models, constructing and tuning multi-layer perceptrons (neural network models), and constructing and adjusting tree models. To learn more A full curriculum of general and statistical instructor-based training is available at any of the Institute s training facilities. Institute instructors can also provide on-site training. For information on other courses in the curriculum, contact the SAS Education Division at , or send to training@sas.com. You can also find this information on the Web at support.sas.com/training as well as in the Training Course Catalog. For a list of other SAS books that relate to the topics covered in this Course Notes, USA customers can contact our SAS Publishing Department at or send to sasbook@sas.com. Customers outside the USA, please contact your local SAS office. Also, see the Publications Catalog on the Web at for a complete list of books and a convenient order form.

6 vi For Your Information Prerequisites Before attending this course, you should be familiar with simple regression modeling concepts have some experience with creating and managing SAS data sets, which you can gain from the Introduction to Programming Concepts Using SAS Software or SAS Programming I: Essentials course.

7 For Your Information vii General Conventions This section explains the various conventions used in presenting text, SAS language syntax, and examples in this book. Typographical Conventions You will see several type styles in this book. This list explains the meaning of each style: UPPERCASE ROMAN is used for SAS statements and other SAS language elements when they appear in the text. italic identifies terms or concepts that are defined in text. Italic is also used for book titles when they are referenced in text, as well as for various syntax and mathematical elements. bold is used for emphasis within text. monospace is used for examples of SAS programming statements and for SAS character strings. Monospace is also used to refer to variable and data set names, field names in windows, information in fields, and user-supplied information. select indicates selectable items in windows and menus. This book also uses icons to represent selectable items. Syntax Conventions The general forms of SAS statements and commands shown in this book include only that part of the syntax actually taught in the course. For complete syntax, see the appropriate SAS reference guide. PROC CHART DATA = SAS-data-set; HBAR VBAR chart-variables </ options>; RUN; This is an example of how SAS syntax is shown in text: PROC and CHART are in uppercase bold because they are SAS keywords. DATA= is in uppercase to indicate that it must be spelled as shown. SAS-data-set is in italic because it represents a value that you supply. In this case, the value must be the name of a SAS data set. HBAR and VBAR are in uppercase bold because they are SAS keywords. They are separated by a vertical bar to indicate they are mutually exclusive; you can choose one or the other. chart-variables is in italic because it represents a value or values that you supply. </ options> represents optional syntax specific to the HBAR and VBAR statements. The angle brackets enclose the slash as well as options because if no options are specified you do not include the slash. RUN is in uppercase bold because it is a SAS keyword.

8 viii For Your Information

9 Chapter 1 Basic Predictive Modeling 1.1 Starting the Analysis Preparing the Tools Constructing a Predictive Model Adjusting Predictions Making Optimal Decisions Parametric Prediction Tuning a Parametric Model Comparing Predictive Models Deploying a Predictive Model Summarizing the Analysis

10 1-2 Chapter 1 Basic Predictive Modeling

11 1.1 Starting the Analysis Starting the Analysis Starting the Analysis Analytic Objective Data Preparation Predictive Modeling Results Integration 3... The task of predictive modeling does stand by itself. To build a successful predictive model you must first, and unambiguously, define an analytic objective. The predictive model serves as a means of fulfilling the analytic objective. The predictive modeling effort is surrounded by two other tasks. Before modeling begins, data must be assembled, often from a variety of sources, and arranged in a format suitable for model building. After the modeling is complete, the resulting model (and the modeling results) must be integrated into the business environment that originally motivated the modeling. These tasks often require more effort than the modeling itself. This chapter focuses on the middle of the trilogy. The data preparation tasks are deferred to Chapter 3. The integration task is relegated to another course.

12 1-4 Chapter 1 Basic Predictive Modeling Analytic Objective Examples Response Up-sell and cross-sell Risk assessment Attrition Lifetime value 4... Analytic objectives that involve predictive modeling are only limited by your imagination. In SAS Enterprise Miner s intended domain, however, many fall in one of a limited number of categories (Par Rud, 2001). Response models attempt to identify individuals likely to respond to an offer or solicitation. Up-sell and cross-sell models are used to predict the likelihood of existing customers wanting additional products from the same company. Risk assessment models quantify the likelihood of events that will adversely affect a business. Attrition models gauge the probability of a customer taking his or her business elsewhere. Lifetime value models evaluate the overall profitability of a customer over a predetermined length of time. To achieve a particular object, it is often necessary to combine several predictive models. For example, a lifetime value model must not only account for the money a customer will spend but also the length of time the customer will continue spending. In this course, attention centers on a single analytic objective category: response modeling. Many of the concepts learned translate directly to other types of problems.

13 1.1 Starting the Analysis 1-5 Modeling Example Business: National veterans organization Objective: From population of lapsing donors, identify individuals worth continued solicitation. Source: 1998 KDD-Cup Competition via UCI KDD Archive 5 A national veterans organization seeks to better target its solicitations for donation. By only soliciting the most likely donors, less money will be spent on solicitation efforts and more money will be available for charitable concerns. Solicitations involve sending a small gift to an individual together with a request for donation. Gifts include mailing labels and greeting cards. The organization has more than 3.5 million individuals in its mailing database. These individuals have been classified by their response behavior to previous solicitation efforts. Of particular interest is the class of individuals identified as lapsing donors. These individuals have made their most recent donation between 12 and 24 months ago. The organization has found that by the predicting the response behavior of this group, they can use the model to rank all 3.5 million individuals in their database. With this ranking, a decision can be made to either solicit or ignore an individual in the current solicitation campaign. The current campaign refers to a greeting card mailing sent in June of It is identified in the raw data as the 97NK campaign. The source of this data is the Association for Computing Machinery s (ACM) 1998 KDD-Cup competition. The data set and other details of the competition are publicly available at the UCI KDD Archive at

14 1-6 Chapter 1 Basic Predictive Modeling Data Preparation Donor Master Demographics Transaction Detail Raw Analysis Data 95,412 Records 481 Fields 6 Before a predictive model can be built to address the organization s analytic objective, an analysis data set must be assembled. Usually, an analysis data set is assembled from multiple source data sets. Examples of the source data sets include a donor master table containing facts about individual donors, demographic overlay data from external data vendors or public sources (such as the U.S. Census Bureau), and transaction detail tables that capture the flow of information and money to and from the organization. Using a variety of summarization and transformation techniques, these data sets were combined to form a raw analysis data set. The defining characteristic of the analysis data set is the presence of a single record for each individual in the analysis population. The KDD-Cup supplied data, called cup98lrn.txt on the UCI website, is an example of a raw analysis data set. It contains more than 95,000 records and almost 500 fields. Each field provides a fact about an individual in the veterans organization s donor population. Additional Data Preparation Final Analysis Data 19,372 Records 50 Fields Raw Analysis Data 95,412 Records 481 Fields 7

15 1.1 Starting the Analysis 1-7 The raw analysis data has been reduced for the purpose of this course. A subset of just over 19,000 records has been selected for modeling. As will be seen, this subset was not chosen arbitrarily. In addition, the 481 fields have been reduced to 50. Considering their potential association with the analysis objective eliminated some of the fields (for example, it is doubtful that CD player ownership is strongly correlated with donation potential). Other fields were combined to form summaries of a particular customer behavior. It is important to obtain some understanding of the composition of the data before modeling. The following describes the origin and source of the newly created variables. Analysis Data Definition Donor master data CONTROL_NUMBER MONTHS_SINCE_ORIGIN IN_HOUSE Unique Donor ID Elapsed time since first donation 1=Given to In House program, 0=Not In House donor 8 The donor master data contributes three fields to the final analysis data. The control number uniquely identifies each member of the analysis population. The number of months since origin is a field derived from the first donation date. A final field identifies donors who are part of the organization s In House program. Analysis Data Definition Demographic and other overlay data OVERLAY_SOURCE DONOR_AGE DONOR_GENDER PUBLISHED_PHONE HOME_OWNER MOR_HIT M=Metromail, P=Polk, B=both Age as of June 1997 Actual or inferred gender Published telephone listing H=homeowner, U=unknown Mail order response hit rate 9

16 1-8 Chapter 1 Basic Predictive Modeling The next fields come from demographic and external vendor overlays. By matching on the donor s name and address (found in the donor master file), information about the donor can be obtained from commercial data vendors such as Metromail and Polk. Most of these fields are self explanatory with the exception of the mail order response hit rate. This field counts the number of known responses to mail order solicitations from all known sources. Analysis Data Definition Demographic and other overlay data CLUSTER_CODE 54 Socio-economic cluster codes SES 5 Socio-economic cluster codes INCOME_GROUP 7 income group levels MED_HOUSEHOLD_INCOME Median income in $100 s PER_CAPITA_INCOME Income per capita in dollars WEALTH_RATING 10 wealth rating groups 10 Intuitively, an association should exist between affluence and largesse. Based on this intuition, there are six separate fields in the final analysis data set that capture some aspect of wealth. The socio-economic field SES is a roll-up of the socio-economic field CLUSTER_CODE. Income group divides individuals into seven income brackets. Median household income and per-capita income are from U.S. Census data aggregated to the census block level. Wealth rating is a field that measures the wealth of an individual relative to others in his or her state. Analysis Data Definition Demographic and other overlay data MED_HOME_VALUE PCT_OWNER_OCCUPIED URBANICITY Median home value in $100 s Percent owner occupied housing U=urban, C=city, S=suburban, T=town, R=rural,?=unknown 11 Another potential discriminator of donation potential is captured in facts about an individual s domicile. Median home value and percent owner occupied data are taken

17 1.1 Starting the Analysis 1-9 from U.S. Census data. Urbanicity classifies an individual address into one of five urbanization categories. Analysis Data Definition Census overlay data PCT_MALE_MILITARY PCT_MALE_VETERANS PCT_VIETNAM_VETERANS PCT_WWII_VETERANS Percent male military in block Percent male veterans in block Percent Vietnam veterans in block Percent WWII veterans in block 12 The raw modeling data contains almost 300 fields taken from the 1990 U.S. Census. These fields describe the demographic composition of an individual s neighborhood. While vast, the predictive potential of seven-year-old U.S. Census data is limited. Thus, the final analysis data includes only four of these fields. Analysis Data Definition Transaction detail data NUMBER_PROM_12 CARD_PROM_12 Number promotions last 12 mos. Number card promotions last 12 mos. 97NK Time `94 `95 `96 `97 `98 13 All of the individuals included in the modeling population have donated to the veterans organization before. The transaction detail data captures these donations. While most of the fields described thus far are aggregate measures applied to the individual, the information captured in the transaction detail file speaks directly to the behavior of the individual. Therefore, it is perhaps, the richest source of information about future donation potential. The transaction detail data is aggregated over various time spans. The more recent data is shown here. According to the data s documentation, these fields refer to the total number of promotions and card promotions received between March 1996 and

18 1-10 Chapter 1 Basic Predictive Modeling March Because 97NK is itself a card promotion or mailing, separating this count from the overall total will distinguish individuals more responsive to card promotions. Analysis Data Definition Transaction detail data FREQ_STATUS_97NK Frequency status, June `97 RECENCY_STATUS_96NK Recency status, June `96 MONTHS_SINCE_LAST Months since last donation LAST_GIFT_AMT Amount of most recent donation 96NK 97NK Time `94 `95 `96 `97 `98 14 The frequency status for the 97NK campaign is defined to be the number of donations received between June of 1995 and June of It is coded as 1, 2, 3, or 4, where 4 implies four or more donations. Recency status as of June 1996 classifies individuals into one of six categories. The categories are defined as follows: F N A L E S First time donor. Anyone who has made his or her first donation in the last six months and has made only one donation. New donor. Anyone who has made his or her first donation in the last 12 months and is not a first time donor. Active donor. Anyone who has made his or her first donation more than 12 months ago and has made a donation in the last 12 months. Lapsing donor. Anyone who has made his or her last donation between 12 and 24 months ago. Inactive donor. Anyone who has made his or her last donation more than 24 months ago. STAR donor. Anyone who has given to three consecutive card mailings. Months since last donation and last gift amount describe the most recent donation. In theory, all individuals in the modeling population are lapsing donors as of the 97NK mailing. This implies that no one has made a donation between June 1996 and June However, for a limited number of cases, the number of months since last gift is fewer than 12. This contradiction is not resolved in the data s documentation, nor will it be resolved here.

19 1.1 Starting the Analysis 1-11 Analysis Data Definition RECENT transaction detail data RESPONSE_PROP RESPONSE_COUNT AVG_GIFT_AMT RECENT_STAR_STATUS 94NK `94 `95 Response proportion since June `94 Response count since June `94 Average gift amount since June `94 STAR (1, 0) status since June `94 96NK Time `96 `97 `98 15 Moving further back in time, the next fields describe the donation behavior between June 1994 and June Recent response proportion measures the ratio of donations to solicitations. Recent response count counts the total number of solicitations in the time frame. Recent average gift amount takes the total dollars donated in the time frame and divides by the number of donations. Recent STAR status indicates whether an individual achieved STAR status between June 1994 and June Analysis Data Definition RECENT transaction detail data CARD_RESPONSE_PROP Response proportion since June `94 CARD_RESPONSE_COUNT Response count since June `94 CARD_AVG_GIFT_AMT Average gift amount since June `94 `94 94NK `95 `96 96NK `97 Time `98 16 These fields are similar to those on the previous slide, but they describe only card mailings. They are included to distinguish individuals who are more responsive to card promotions.

20 1-12 Chapter 1 Basic Predictive Modeling Analysis Data Definition LIFETIME transaction detail data PROM GIFT_COUNT AVG_GIFT_AMT PEP_STAR 94NK `94 `95 Total number promotions ever Total number donations ever Overall average gift amount STAR status ever (1=yes, 0=no) 96NK Time `96 `97 `98 17 Analysis Data Definition LIFETIME transaction detail data GIFT_AMOUNT GIFT_COUNT MAX_GIFT GIFT_RANGE 94NK `94 `95 Total gift amount ever Total number donations ever Maximum gift amount Maximum less minimum gift amount 96NK Time `96 `97 `98 18 These variables summarize behavior over the lifetime of the individual s association with the veterans organization. Most are self-explanatory with the exception of the STAR status ever indicator. Analysis shows that there are individuals with recent STAR status who do not have PEP_STAR=1. This could indicate an error in the data or some change in the definition of STAR status in the past.

21 1.1 Starting the Analysis 1-13 Analysis Data Definition KDD supplied LIFETIME transaction detail data FILE_AVG_GIFT FILE_CARD_GIFT MONTHS_SINCE_FIRST MONTHS_SINCE_LAST 94NK `94 `95 Average gift from raw data Average card gift raw data First donation date from June `97 Last donation date from June `97 96NK Time `96 `97 `98 19 Several fields in the raw data were derivable from other fields in the data but were nevertheless included. Curiously, the derived values and the provided values do not always agree. Because it is impossible to determine which are correct, these supplied values were also included in the final analysis data. Analysis Data Definition Transaction detail data target definition TARGET_B TARGET_D Response to 97NK solicitation (1=yes 0=no) Response amount to 97NK solicitation (missing if no response) 97NK Time `94 `95 `96 `97 `98 20 The final two fields are the two most important in the entire analysis data set. They describe the response behavior to the 97NK campaign. The models to be built will attempt to predict their value in the presence of all the other information in the analysis data set.

22 1-14 Chapter 1 Basic Predictive Modeling 1.2 Preparing the Tools In this course, you build predictive models to help decide which lapsing donors are likely to reactivate. Much of this work will be accomplished via SAS Enterprise Miner 5.1, the newest data mining tool from SAS. SAS Enterprise Miner 5.1 Interface Toolbar Toolbar Shortcut Buttons Project Panel Properties Panel Diagram Workspace Help Panel Status Bar 22 The SAS Enterprise Miner 5.1 interface simplifies many common tasks associated with the construction of predictive models. The interface is divided into six interface components: Toolbar The toolbar in SAS Enterprise Miner is a graphical set of node icons and tools that you use to build process flow diagrams in the Diagram Workspace. To display the text name of any node or tool icon, position your mouse pointer over the icon.. Project panel Use the Project panel to manage and view data sources, diagrams, results, and project users. Properties panel Use the Properties panel to view and edit the settings of data sources, diagrams, nodes, results, and users. Diagram Workspace Use the Diagram Workspace to build, edit, run, and save process flow diagrams. This is where you graphically build, order, and sequence the nodes that you use to mine your data and generate reports. Help panel The Help panel displays a short description of the property that you select in the Properties Panel. Extended help can be found in the Help Topics selection from the Help main menu. Status bar The status bar is single pane at the bottom of the window that indicates the execution status of an Enterprise Miner task.

23 1.2 Preparing the Tools 1-15 Enterprise Miner Analytic Processing Client PC Data Sources SAS System EM4.3 SAS Metadata Server SAS Servers Client PC EM Java Server EM5.1 Middleware Server 23 All operations of the SAS Enterprise Miner 5.1 Java client pass through the SAS Enterprise Miner Java middleware server. This middleware server connects to two SAS servers: the SAS analytic server and the SAS metadata server. The SAS analytic server is configured in advance to access predefined data sources. In this way, all analytics in SAS Enterprise Miner are conducted within an Internet browser environment. The non-java client in SAS Enterprise Miner (Release 4.3) skips the middleware server and connects directly to the analytic and metadata servers. Unlike the 5.1 Java client, the 4.3 client requires a complete installation of SAS on each client machine.

24 1-16 Chapter 1 Basic Predictive Modeling Creating a Java Client Project Start the Java client in a manner appropriate to your installation of SAS Enterprise Miner. For example, select Start Programs SAS Enterprise Miner EM 5.1 Client. The Start Enterprise Miner window opens. When starting SAS Enterprise Miner, you can choose either the Personal Workstation or Enterprise Client configuration. Either configuration provides complete SAS Enterprise Miner capabilities. Select Personal Workstation when you are using SAS services that run on your personal computer or laptop computer. Select Enterprise Client when you connect to remote SAS servers using the SAS Enterprise Miner Shared Platform middleware service. This course assumes the Personal Workstation configuration. Select Start in the Start Enterprise Miner window. After a brief pause, the Welcome to Enterprise Miner startup page opens. This window enables you to create a new project or open an existing project.

25 1.2 Preparing the Tools 1-17 Follow these steps to create a new project: 1. Select New Project from the Welcome to Enterprise Miner startup page. As an alternative, you can select File New Project from the main menu. 2. The Create New Project window opens. Specify the project name in the Name field. The default path is typically configured through the SAS Management Console during installation. The administrator determines the paths to which you can have access. If a default project is not provided, type in the name and path location where you want to store the project. For example, type a. PVA for the name b. c:\workshop\winsas\pmem for the path. 3. Select the Start-Up Code tab and define a SAS LIBNAME statement that points to where the donor SAS training data sets reside. libname mydata 'c:\workshop\winsas\pmem'; 4. Select the OK button to create the project. The Enterprise Miner - PVA window opens.

26 1-18 Chapter 1 Basic Predictive Modeling Other Common Useful Tasks Select the Help Topics item from the Help menu or, alternatively, press F1. You are encouraged to read through the help topics, which cover many of the remaining tasks in more detail. Many of the sample data sources used in the online help can be created by selecting Generate Sample Data Sources from the Help menu. The metadata for these tables is already predefined. Select Preferences from the Options drop-down menu item to set the GUI appearance and specify model results package options. You can also use the View drop-down menu items to open the Program Editor, Log, Output, and Graph windows as well as open SAS tables and set the Property Sheet to show the advanced options.

27 1.2 Preparing the Tools 1-19 Creating a SAS Data Source Predictive models are built from previously observed data. SAS Enterprise Miner accesses this data using a SAS data source. It is important to note that data sources are not the actual training data, but instead are the metadata that defines the source data. The source data itself must reside in an allocated library. You have already allocated a libname to the donor data source as part of the start-up code for the project. 1. Right-click on the Data Sources folder in the Project Navigator (or select File New Data Sources) to open the Data Source Wizard. 2. Because the donor table resides in a SAS table, select the Next button in the Metadata Source window. Note that other products such as SAS/ETL Studio can be used to build and, in turn, register tables to the metadata server for retrieval in SAS Enterprise Miner. 3. Select the Browse button to open the Select a SAS Table window.

28 1-20 Chapter 1 Basic Predictive Modeling 4. Select the SAS library you assigned in the start-up code, and then select the PVA_RAW_DATA SAS table. 5. Click OK and then select the Next button. The Data Table Properties sheet opens. There are 50 variables and 19,372 observations.

29 1.3 Constructing a Predictive Model Constructing a Predictive Model Predicting the Unknown Expected Target Value? Input Measurements Measurement Scales Interval Ordinal Nominal Binary 20.00,12.50, 5.00, Lower, Middle, Upper CA, GA, NY, TX, F, M The fundamental problem in prediction is the correct determination of an unknown quantity in the presence of supplementary facts. In this course, the unknown quantity is called a target and the supplementary facts are called inputs. Variables in a data assume one of these two model roles. The inputs and target typically represent measurements of an observable phenomenon. The measurements found in the input and target variables are recorded on one of several measurement scales. SAS Enterprise Miner recognizes the following measurement scales for the purposes of model construction: 1 Interval measurements are quantitative values permitting certain simple arithmetic or logarithmic transformations (for example, monetary amounts). Ordinal measurements are qualitative attributes having an inherent order (for example, income group). Nominal measurements are qualitative attributes lacking an inherent order (for example, state or province). Binary measurements are qualitative attributes with only two levels (for example, gender). To solve the fundamental problem in prediction, a mathematical relationship between the inputs and the target is constructed. This mathematical relation is known as a predictive model. After it is established, the predictive model can be used to produce an estimate of an unknown target value given a set of input measurements. The model role and measurement scale are examples of metadata. Variables in a data set must have metadata defined for use by SAS Enterprise Miner. 1 Additional measurement scale categories are commonly found in the scientific literature. See the SAS Enterprise Miner online help under the heading Predictive Modeling.

30 1-22 Chapter 1 Basic Predictive Modeling Defining Modeling Metadata Metadata is data about data sets. Some metadata, such as field name, are stored with the data. Other metadata, such as how a particular variable in a data set should be used in a predictive model, must be manually specified. Defining modeling metadata is the process of establishing relevant facts about the data set prior to model construction. 1. Click the Next button to apply advisor options. Two options are available: a. Basic Use the Basic option when you already know the variable roles and measurement levels. The initial role and level are based on the variable type and format values. b. Advanced Use the Advanced option when you want SAS Enterprise Miner to automatically set the variable roles and measurement levels. Automatic initial roles and level values are based on the variable type, the variable format, and the number of distinct values contained in the variable. Select Advanced. 2. Select the Customize button to view additional variables rules you can impose.

31 1.3 Constructing a Predictive Model 1-23 Selecting each rule provides a description of the rule. For example, Missing Percentage Threshold specifies the percentage of missing values required for a variable s modeling role to be set to Rejected. 3. Select OK to use the defaults for this example. 4. Select Next in the Apply Advisor window to generate the metadata and open the columns metadata. 5. Click on the Names column header to sort the variables alphabetically. Note that CLUSTER_CODE and CONTROL_NUMBER are set to Rejected because they exceed the maximum class count threshold of Redefine these variable roles. a. Set the CONTROL_NUMBER Role to ID. b. Select the Level column header to sort the variables by level.

32 1-24 Chapter 1 Basic Predictive Modeling c. Change the level of the following nominal variables to Interval: INCOME_GROUP CARD_PROM_12 FREQUENCY_STATUS_97NK RECENT_RESPONSE_COUNT RECENT_CARD_RESPONSE_COUNT WEALTH_RATING You can use the Show code button to write SAS code to conditionally assign variable attributes. This is especially useful when you want to apply a metadata rule to several variables. For example: if role eq INPUT and level eq NOMINAL and type eq N then level = INTERVAL d. Set the TARGET_D Role to Rejected. Note that Enterprise Miner correctly identified TARGET_D and TARGET_B as targets since they start with the prefix of TARGET. 7. View the distribution of TARGET_B a. Select the TARGET_B variable and then select Explore. By default, 10,000 observations are used to generate exploratory plots. In the Sample Properties window, set the fetch size to Max and then click Apply. The plot is now generated from all 19,732 observations in the donor data source.

33 1.3 Constructing a Predictive Model 1-25 b. Select the bar for the donors (1 s). The donors are highlighted in the data table. To display a tool tip indicating the number of donors, place your cursor over this bar. c. Close the Explore window. You now finalize your metadata assignments. 1. Select Next to open the Decision Processing window. For now, forego decision processing. 2. Select Next to open the Data Source Attributes window. You can use the Role drop-down menu to set other roles such as TRAIN and SCORE, but for this example leave the role as RAW.

34 1-26 Chapter 1 Basic Predictive Modeling 3. Select Finish to add the donor table to the Data Sources folder of the Project Navigator. The data source can be used in other diagrams. You can also define global data sources that can be used across multiple projects. (See the online documentation for instructions on how to do this). Expand the Data Sources folder. Select the PVA_RAW_DATA data source and notice that the property sheet now shows properties for this data source.

35 1.3 Constructing a Predictive Model 1-27 The Fundamental Problem of Prediction Training Data? Previously Observed Cases Construction of predictive models requires training data, a set of previously observed input and target measurements, or cases. The cases of the training data are assumed to be representative of future (unobserved) input and target measurements. 2 An extremely simplistic predictive model assumes all possible input and target combinations are recorded in the training data. Given a set of input measurements, you need only to scan the training data for identical measurements and note the corresponding target measurement. Often in a real set of training data, a particular set of inputs corresponds to a range of target measurements. Because of this noise, predictive models usually provide the expected (average) value of the target for a given set of input measurements. With a qualitative target, (ordinal, nominal, or binary), the expected target value may be interpreted as the probability of each qualitative level. Both situations suggest that there are limits to the accuracy achievable by any predictive model. Usually, a given set of input measurements does not yield an exact match in the training data. How you compensate for this fact distinguishes various predictive modeling methods. Perhaps the most intuitive way to predict cases lacking an exact match in the training data is to look for a nearly matching case and note the corresponding target measurement. This is the philosophy behind nearest-neighbor prediction and other local smoothing methods. 2 In statistical terms, all cases in the training data set are assumed to be independent (that is, the measurements in one case were not affected by the measurements of one or more other cases) and the underlying distribution of the inputs and targets is stationary (not changing in time). The failure of either assumption results in poor predictive performance.

36 1-28 Chapter 1 Basic Predictive Modeling Nearest Neighbor Prediction Input 1 Input 2 Nearest Neighbor = Decision Boundary Nearest Neighbor = Training Data 29 Nearest neighbor prediction (classification) has a long history in the statistical literature starting in the early 50s. However, you could argue that its philosophical roots date back (at least) to the taxonomists of the 19 th Century. In its simplest form, the predicted target value equals the target value of the nearest training data case. You can envision this process as partitioning the input space, the set of all possible input measurements, into cells of distinct target values. The edge of these cells, where the predicted value changes, is known as the decision boundary. A nearest neighbor model has a very complex decision boundary. Generalization Accuracy = 100% Accuracy = 63% Training Data Validation Data 30 A model is only as good as its ability to generalize to new cases. While nearest neighbor prediction perfectly predicts training data cases, performance on new cases (validation data) can be substantially worse. This is especially apparent when the data are noisy (every small region of the input space contains cases with several distinct target values). In the slide above, the true value of a validation data case is indicated by dot color. Any case whose nearest neighbor has a different color is incorrectly predicted, indicated by a red circle surrounding the case.

37 1.3 Constructing a Predictive Model 1-29 Tuning a Predictive Model Accuracy 100% Training Validation 60% Neighborhood Size Training Data 31 Most predictive modeling methods possess tuning mechanisms to improve generalization. One way to tune a nearest neighbor model is to change the number of training data cases used to make a prediction. Instead of using the target value of the nearest training case, the predicted target is taken to be the average target values of the k nearest training cases. This interpolation makes the model much less sensitive to noise and typically improves generalization. In general, models are tuned to match the specific signal and noise characteristics of a given prediction problem. When there is a strong signal and little noise, highly sensitive models can be built with complex decision boundaries. Where there is a weak signal and high noise, less sensitive models with simple decision boundaries are appropriate. In SAS Enterprise Miner, monitoring model performance on validation data usually determines the appropriate tuning value. Curse of Dimensionality Additional Extraneous Input Training Data 32

38 1-30 Chapter 1 Basic Predictive Modeling Another way to tune predictive models is by choosing appropriate inputs for the model. This choice is critical. Including extraneous inputs (that is, those unrelated to the target) can devastate model performance. This phenomenon, known as the curse of dimensionality, is the general observation that the complexity of a data set increases with dimension. Cases that are nearest neighbors in two dimensions need not be nearest neighbors in three dimensions. When only two of the three dimensions are related to the target, this can degrade the performance of the nearest neighbor model. Nearest Neighbors? Extraneous Inputs 0 Nearest Neighbors Training Data 33 As the number of extraneous inputs increases, the problem becomes worse. Indeed, in high dimensions, the concept of nearest becomes quite distorted. Suppose there are 1000 cases scattered randomly but uniformly on the range 0 to 1 of 10 independent inputs. On any one input, an interval of length ½ centered at ½ contains about 500 cases. Now take any pair of inputs. How many of the cases are in the center half of both inputs? If the inputs are independent, as assumed, the answer is about 250. For three inputs, there are about 125, and so on. Perhaps surprisingly, with 10 inputs, only about 1 case out of 1000 is simultaneously in the center half of all inputs. Put another way, 99.9% of the cases are on the outer edges of this 10-dimensional input space. To maintain some sense of nearness in high dimensions requires a tremendous increase in the number of training cases. For example, a square region containing about 10 of 1000 cases in two dimensions will have sides of length 1/10. To get about 10 cases in a region with sides of length 1/10 in 10 dimensions (assuming a uniform distribution) requires, on the average, more than 1,000,000,000 cases!

39 1.3 Constructing a Predictive Model 1-31 Falling Under the Curse Accuracy 100% Training Validation 60% Neighborhood Size Training Data 34 With two relevant and eight extraneous inputs, the accuracy of the nearest neighbor algorithm decreases, even on the training data. Matters are worse in a typical prediction problem: for every relevant input there may be dozens of extraneous ones. This devastates the performance of nearest neighbor methods and begs the question of how to proceed. Breaking the Curse Predictive Algorithms Parametric Models Examples Trees MBRs Rule Inductions Regressions Neural Networks SVMs 35 To overcome the curse of dimensionality, you must utilize predictive model techniques that capture general trends in the data while ignoring extraneous information. To do this, the focus must shift from individual cases in the training data to the general pattern they create. Two approaches are widely used to overcome the curse of dimensionality. Predictive algorithms employ simple heuristic rules to reduce dimension. Parametric models are constrained to limit overgeneralization. While this classification is used to group predictive models for the purposes of this course, the distinction is somewhat artificial. Predictive algorithms often utilize predictive models; predictive models often employ predictive algorithms.

40 1-32 Chapter 1 Basic Predictive Modeling Decision Rule Accuracy = 73% Create models to extol the obvious ignore the extraneous. Example: Simple Decision Rule Training Data 36 An example of a predictive algorithm is a simple decision rule. In the example above, a single partition of the input space can lead to a surprisingly accurate prediction. This partition takes advantage of the clustering of solid cases on the right half of the original input space. It isolates cases with like-valued targets in each part of the partition. Recursive Partitioning Accuracy 100% Training Validation 50% Partition Count Training Data 37 It is not hard to devise techniques to search for and isolate cases with like-valued targets (and many researchers from many independent disciplines have). The common element of these techniques is the recursive partitioning of the input space. Partitions of the training data, based on the values of a single input, are considered. The worth of a partition is measured by how well it isolates distinct groups of target values. The input/partition combination with the highest worth is selected and the training data is accordingly split. The process continues by further subdividing each resulting split group. Ultimately, the satisfaction of certain stopping conditions terminates the process.

41 1.3 Constructing a Predictive Model 1-33 A predictive model can be built from the partitioning process by averaging targets in each final partition group and assigning this average to every case in the group. The number of times the partitioning process repeats can be thought of as a tuning parameter for the model. Each iteration subdivides the training data further and increases training data accuracy. However, increasing the training data accuracy often diminishes generalization. As with nearest neighbor models, validation data can be used to pick the optimal tuning value. Recursive partitioning techniques resist the curse of dimensionality by ignoring inputs not associated with the target. If every partition involving a particular input results in partition groups with similar average target values, the calculated worth of these partitions will be small. The particular input is not selected to partition the data, and it is effectively disregarded. Because they can quickly identify inputs with strong target associations, recursive partitioning methods are ideally suited to the role of initial predictive modeling methodology. The task that motivates predictive modeling in this course has been outlined in Section 1.1. Lapsing donors have been identified by basic business rules. Some of these donors will be subsequently ignored; some will continue to be solicited for donation. A data set describing the donation response to a mailing (identified as 97NK) will be used to make this decision. The simplest approach to this problem involves estimating donation propensity from the 97NK data. Individuals with the highest probability of response are selected for continued solicitation. Those with the lowest probability of response are ignored in the future. For now, the amount of response enters into the solicitation decision after the propensity to donate is estimated. The unknown target for this model is a binary variable, TARGET_B, that indicates donation response to the 97NK mailing. Other variables in the training data provide supplemental facts about each individual.

42 1-34 Chapter 1 Basic Predictive Modeling Building a Predictive Model The metadata defines 47 inputs for predicting donation propensity. Not all of these inputs will be needed to build a successful predictive model. To build any predictive model, however, you must first create an analysis diagram. Create a Diagram 1. Right-click on the Diagrams folder of the Project Navigator and select the Create Diagram menu item. 2. Enter a diagram name (in this example, PVA) and select OK. 3. Expand the diagram folder to see the open diagram. 4. Select the diagram icon. The property sheet now shows properties for the diagram. 5. Drag and drop the PVA_RAW_DATA data source onto the Diagram Workspace.

43 1.3 Constructing a Predictive Model 1-35 Although you will focus on developing one process flow diagram, an advantage of SAS Enterprise Miner is that you can open multiple diagrams at one time. You can also disconnect from and reconnect to a diagram provided that you have also configured the Java middle tier Enterprise Miner Application Server. Other users can also access the same project. However, only one user can open a diagram at a time. SAS Enterprise Miner for SAS 9.1 is well designed for multitasking and project sharing. In the absence of prior experience with the donation data, your first modeling goal is to identify which of the 47 inputs are most strongly related to donation propensity. Given its ability to ignore extraneous inputs, a recursive partitioning model is ideally suited to this task. The Tree tool is the primary recursive partitioning method for Enterprise Miner. It enables you to construct several types of decision tree models. Grow a Tree 1. Select the Model tab on the toolbar. 2. Drag and drop the Tree tool into the Diagram Workspace, to the right of the Input Data Source node.

44 1-36 Chapter 1 Basic Predictive Modeling Two tasks have been defined in the PVA Analysis: reading the raw data and building a decision tree. For SAS Enterprise Miner to actually perform these tasks, you must specify the order in which to do them. Order is established by connecting the nodes with arrows to form process flow diagrams. Draw an arrow from the Input Data Source node to the Tree node. This instructs SAS Enterprise Miner to perform the Input Data Source node task first and then perform the Tree node task. To draw the arrow: 1. Move the cursor next to the Input Data Source node until it changes to a pencil. 2. Click and drag a line to the Tree node. 3. Click the Diagram Workspace away from both nodes. The diagram appears as follows: Now that SAS Enterprise Miner knows the order in which to process the nodes, the next task is to actually invoke the process. 1. Right-click the Tree node. A menu of node options appears. 2. Select Run Yes to run the diagram. SAS Enterprise Miner begins to build a predictive model. Progress through the process flow diagram is indicated by a green square. Upon successful completion of the tree model, a dialog box indicates completion of the task. 3. Select OK to close the dialog. 4. Right-click the Tree node and select Results. The Decision Tree results window opens.

45 1.3 Constructing a Predictive Model 1-37 The Results-Tree window summarizes the results of the recursive partitioning model fit by SAS Enterprise Miner. By default, it is partitioned into six subwindows. 1. The Score Rankings (upper left) present a lift chart for the Tree model. Lift charts are discussed in detail in Section The Leaf Statistics (upper right) display the percentage of TARGET_B=1 in each leaf. The leaf with the highest percentage of TARGET_B=1 has leaf index equal to The Tree Map plot (center left) displays a compact graphical display of the tree with the following properties: It is displayed in vertical orientation. The nodes are colored by the proportion of a categorical target value or the average of an interval target. The node width is proportional to the number of observations in the node. 4. The Fit Statistics (center right) provide a numerical summary of model performance. 5. The Tree window (lower left) shows the standard presentation of a recursive partitioning model. A decision tree contains the following items: Root node the top node of a vertical tree that contains all observations. Internal nodes non-terminal nodes that contain the splitting rule. This includes the root node.

46 1-38 Chapter 1 Basic Predictive Modeling Leaf nodes terminal nodes that contain the final classification for a set of observations. A default decision tree has the following properties: It is displayed in vertical orientation. The nodes are colored by the proportion of a categorical target value or the average of an interval target. 6. The Output window (lower right) displays the SAS output of the Decision Tree run. The Tree tool in SAS Enterprise Miner takes its name from the usual presentation of recursive partitioning models. 1. Maximize the Tree window. The tree diagram summarizes the recursive partitioning of the data. Each box is called a node. The top box, or root node, shows non-donations (TARGET_B=0) in 75% of the cases. The first partition (of the entire training data) separates the more frequent donors from the rest. Individuals making two or fewer donations in the previous two years branch left others branch right. Of those in the left branch, 78.6% are nondonors as compared to those in the right branch with a 67.1% non-donor rate.

47 1.3 Constructing a Predictive Model 1-39 Further partitioning into subgroups occurs. The tree structure presentation shows the inputs and values used to make the partitions as well as proportion of donation within each subgroup. To examine some of the extremes in donation propensity, minimize the Tree window and select the leftmost bar in the Leaf Statistics plot. The Tree window is scrolled to display the leaf with the highest response rate. A node in a decision tree that is not partitioned is called a terminal node, or leaf. The proportion of cases in each target level in a terminal node provides the expected value of the target in a recursive partitioning model. Previous donors with two or fewer resent donations, who possess PEP_STAR status, and with two or fewer gifts over their lifetime have a 96% probability of a donation.

48 1-40 Chapter 1 Basic Predictive Modeling To see the complement to the most likely donors, you will need to search the tree diagram or display the remainder of the Leaf Statistics plot. Scroll the Leaf Statistics chart to the right using the scrollbar at the top of the plot and select the bar on the extreme right. The Tree window is scrolled to show the leaf with the lowest donation rate. Donors with more than two recent donations, who have received four or fewer card solicitations in the last 12 months, whose gift amount ranges by less than $10.50, and who have given more than 37 times in the past have a probability of not responding of about 90%.

49 1.3 Constructing a Predictive Model 1-41 To better see the overall tree structure, right-click and then select View Fit to Page. The tree is scaled to show all 17 leaf nodes. Detailed inspection of the entire tree structure reveals subgroups with extremes of donation propensity (all donors or all non-donors). The color-coding of the nodes indicates node purity. A dark node contains cases with TARGET_B=0 or TARGET_B=1 for nearly every case. A light node contains an equal mix of cases with TARGET_B=0 and TARGET_B=1. It is tempting to concoct stories to explain these extremes based on the partition rules found in the tree. While these stories would be true, they would apply only to the training data used by SAS Enterprise Miner to build this particular Tree model. In general, they would not generalize to the entire population of potential donors.

50 1-42 Chapter 1 Basic Predictive Modeling Tuning a Predictive Model The previous discussion reveals a flaw in the present modeling approach. Without a set of validation data, it is difficult to assess which of the partitions are meaningful and which are training-data-inspired fantasies. Fortunately, SAS Enterprise Miner provides a convenient way to generate a validation data set. 1. Close the Tree Results window. 2. Drag a Data Partition (from the Sample tool group) onto the diagram and connect it between the data source and tree nodes. The Data Partition node breaks a raw data set into components from training, tuning, comparing, and testing predictive models. For this modeling exercise, you need to adjust some of the node s default settings. The settings of any node may be read and changed by selecting the node and examining the properties panel. 1. Select the Data Partition node. The Properties panel shows the node s settings. The Data Set Percentage fields show the fraction of the raw data to be used for training, validation, and testing (final performance evaluation). A separate data set will be used later for testing. Therefore, only training and validation data sets are required. In this example, half the data is used for training and half for validation.

51 1.3 Constructing a Predictive Model Type 50 in the Train and Validation fields. 2. Type 0 in the Test field. Automatic Stratification 1. Select next to the Variables property. The Variables-Part window opens. 2. Scroll to the bottom of the list. Note that the Partition field for TARGET_B is set to Stratification. A stratification variable forces SAS Enterprise Miner to balance category levels across the training, validation, and test sets. 3. Close the Variable-Part window. The diagram is ready to be run again. Instead of using the entire raw data set for training, SAS Enterprise Miner now reserves half the data for model tuning and assessment. 1. Right-click the Tree node and select Run. Because no changes were made to the Input Data Source node, processing will commence at the Data Partition node. 2. View the results when the run is complete.

52 1-44 Chapter 1 Basic Predictive Modeling Several changes to the Results-Decision Tree window are apparent. First and most obvious, the tree is much simpler (6 leaves instead of 17). The Score Rankings, Leaf Statistics plots, and the Fit Statistics table now include information for the train and validation data sets. Note that the Score Rankings exhibits typical behavior: cumulative lift calculated on training data is higher than cumulative lift calculated on validation data.

53 1.3 Constructing a Predictive Model Maximize the Tree window and fit the tree to the window. Inspection of the six leaves reveals donation proportions ranging from less than 20% to more than 80% in the training data. The Leaf Statistics plot shows similar donation propensities are also observed in the validation data. Examine the Fit Statistics window. The model isolates donors from non-donors and shows a slightly lower misclassification rate than no model.

54 1-46 Chapter 1 Basic Predictive Modeling The astute student, however, could pose two objections to the present state of affairs: Even the least generous subgroup seems to have an unrealistically high donation propensity. With such a high donation probability, it might not be worth building a model; simply solicit everyone! While it is true that the selected six-leaf model has lower misclassification rate than no model, in absolute terms this increase is minimal. From the Fit Statistics window, the validation misclassification rate is 24.86%. No model has a misclassification rate of 25.00% Both objections are reasonable. The first is simply an artifact from the commonly used predictive modeling practice called separate sampling. The second correctly observes that predictive accuracy is not necessarily the best measure of a model s worth.

55 1.4 Adjusting Predictions Adjusting Predictions Separate Sampling Benefits: Helps detect rare target levels Speeds processing Risks: Biases predictions (correctable) Increases prediction variability 40 In many predictive modeling problems, the target level of interest occurs rarely relative to other target levels. For example, in a data set of 100,000 cases, only 1,000 might contain an interesting value of the target. A widespread predictive modeling practice in this situation creates a training data set with cases sampled separately from each target level. When the number of interesting and rare cases is fixed, all such cases are selected. Then, separately, a simple random sample of common cases is added. The size of the common sample should be at least as large as the rare sample and is frequently several times larger. The basic idea is that a model built from training data with three or four times as many common cases in the training data produces a model just as predictive as one built from training data with 30 to 40 times as many common cases. Often this practice can help predictive models better detect the rare levels, especially when the total number of target levels exceeds two. When separately sampled, more weight is given to the rare cases during model construction, increasing a model s sensitivity to the rare level. It also places less demands on the computer used to build the models, speeding model processing. Unfortunately, the predictions made by any model fit with separately sampled training data are biased. As seen on the next slide, this is easily corrected. More troublesome is the increased variability of the models built from the separately sampled data. The more flexible the model, the worse this problem can be. On the other hand, it can be unwise to use highly flexible models in presence of a rare target level; the effective sample size is much closer to the number of cases with the rare level than it is to the total number of available cases. Flexible models built from actually or effectively small training samples typically show poor generalization. Further discussion of the consequences of separate sampling is deferred to Chapter 3.

56 1-48 Chapter 1 Basic Predictive Modeling Adjusting Predictions Sample Population Prediction 41 Model predictions are easily adjusted to compensate for separate sampling. The only requirement is prior knowledge of the proportions of the target levels in the population from which the cases are drawn. Within a given target level, each case in the training data corresponds to a certain number of cases in the population. Predictions about the population can be obtained from predictions based on the training data by adjusting for this correspondence. For example, consider a target with two levels, 0 and 1. The proportion of cases with target level 1 in the actual population can be obtained from the proportion of cases with target level 1 in the training sample using the formula where p = ~ p1( π 1 ρ1 ) ( π ρ ) + p ( π ρ ) 1 ~ p ~ p 1 is the population proportion of level 1. ~ p 0 and ~ p 1 are the training proportions of levels 0 and 1, respectively. 1 1 π 0 and π 1 ρ 0 and ρ 1 are the overall proportions of level 0 and 1 in the population (called the population prior). are the overall proportions of level 0 and 1 in the training data.

57 1.4 Adjusting Predictions 1-49 Specifying Population Priors The 97NK raw data has a 25% overall donor proportion. This was achieved by separately sampling the 95,412-case donor population data. First, all 4,843 cases (about 5%) with TARGET_B=1 were selected. Then, for each case with TARGET_B=1, three cases with TARGET_B=0 were randomly chosen from the population data. This resulted in a raw analysis data set with 19,372 cases. The probability estimates in the decision tree were based on the separately sampled training data. Given prior knowledge of the overall population donor proportion, SAS Enterprise Miner can adjust these estimates to reflect the true population. Decision Processing Window Specification of this prior knowledge is made in the Decision Processing window. 1. Close any open results window. 2. Select Decisions in the PVA_RAW_DATA data source properties panel. The Decision Processing confirmation window opens. 3. Select Yes. The Decision Processing PVA_RAW_DATA window opens. SAS Enterprise Miner uses target profiles to store target metadata, such as the population, or prior, target level proportions. A target profile is keyed to a data set and a target variable name. Edit the prior information for TARGET_B in the PVA_RAW_DATA.

58 1-50 Chapter 1 Basic Predictive Modeling 4. Select the Prior Probabilities tab. Currently no prior is selected. 5. Select Yes. The Adjusted Prior column is added to the Decision Processing window. 6. Set the adjusted prior probability for TARGET_B=1 to Set the adjusted prior probability for TARGET_B=0 to The completed changes should appear as shown.

59 1.4 Adjusting Predictions Select OK to close the Decision Processing window. SAS Enterprise Miner now adjusts all model predictions to conform to the specified prior. You should expect to see donation proportions on the order of 5% instead of 25%. Reseeded Trees 1. Run the Tree node and view the results. The results report looks considerably different.

60 1-52 Chapter 1 Basic Predictive Modeling 2. Maximize the Tree window. Where is the tree? SAS Enterprise Miner has found the tree that best describes the data is a tree with a single leaf. Each case is assigned the same probability of response (equal to 5%). The flaw lies in the statistic used to evaluate the worth of a given model. By default, the Tree node and most other modeling nodes gauge a model s fit by its accuracy (adjusted for specified prior probability). In this case, an accuracy of 95% may be achieved by not discriminating between the cases: simply assume everyone is a non-donor. To more accurately segment the donor population, information describing the consequences of a solicitation decision is required. These consequences may be expressed in monetary terms by examining the value of TARGET_D, the donation amount. However, because the value of TARGET_D is not known until after solicitation, it is impossible to use directly in the solicitation decision. The obvious alternative is to create an estimate for TARGET_D before solicitation and make solicitation decisions based on a combination of the estimated donation probability and estimated donation amount. In this course, donation probability will be estimated using a predictive model and donation amount will be estimated by averaging TARGET_D across all donors.

61 1.5 Making Optimal Decisions Making Optimal Decisions Predictive models are most valuable when they are used to decide a course of action. This requires quantifying the decision consequences in terms of a profit or cost. Given the true value of the target for a case, it is usually straightforward to evaluate the profit resulting from a decision. For example, soliciting a certain donor who always contributes $5 results in a net profit of $5 minus the package cost. Ignoring this individual results in zero profit. Similarly, soliciting a donor who always ignores such offers results in a net loss equal to the package cost. Because of the certainty of these two cases, deciding the best course of action is obvious. Unfortunately, donation is almost never a certainty. (If it were, there would be no need to build a predictive model). For this reason, the notion of statistical decision theory must be introduced.

62 1-54 Chapter 1 Basic Predictive Modeling Defining Decision Profits To properly tune and assess the Decision Tree model (or any predictive model using the default setting of SAS Enterprise Miner), you must correctly define a decision profit matrix. In the PVA model, two decisions are possible: solicit for donation and ignore. The decision to solicit results in a profit equal to the donation amount less the package cost ($0.68) for TARGET_B=1 and loss equal to $0.68 for TARGET_B=0. The decision to ignore results in zero profit, regardless of the true target value. The variable TARGET_D records the donation amounts for those who responded to the 97NK campaign. Unfortunately, the value of TARGET_D will be unknown when deciding the appropriate action for a case. However, like donation propensity, it can be predicted from the training data. Because current focus is the donation propensity model, construction of a sophisticated donation amount model is deferred to Chapter 2. For now the simplest of donation amount models will suffice: the expected or average value of TARGET_D where TARGET_B=1. From the training data, the average value of TARGET_D is approximately $ This can be determined using the StatExplore node. 1. Drag a StatExplore node onto the diagram and connect it to the Data Partition node. 2. Select the StatExplore node and open the Variables window from the Properties panel. 3. Set the Use status of all variables to No.

63 1.5 Making Optimal Decisions Select the TARGET_D variable and set its Use status to Yes. 5. Select OK to close the Variables window. 6. Run the StatExplore node and open the Results window. The StatExplore node provides basic statistics about selected variables. Here the mean of TARGET_D is seen to be $15.295, which, for simplicity, will be considered to be $15.30.

64 1-56 Chapter 1 Basic Predictive Modeling So given a $0.68 solicitation cost, the net profit for soliciting an actual donor is $ The net profit for soliciting a non-donor is a loss of $0.68. This information can be incorporated into an analysis using the Decision Processing window. 1. Close the StatExplore Results window. 2. Select the PVA_RAW_DATA data source node. 3. Select Decisions. The Decision Processing window opens. 4. Select the Decisions tab. 5. Type in the upper-left cell of the decision matrix. 6. Type 0.68 in the lower-left cell of the decision matrix. 7. Type 0 in the lower-right cell of the decision matrix. The completed changes should appear as shown below. 8. Select OK to close the Decision Processing window. Recall that you were given the option to specify decision processing when defining the data source. Specifying priors and profits at the outset configures SAS Enterprise Miner to properly evaluate predictive models. With an appropriate profit matrix defined, refit the tree model. 1. Right-click the Tree node and select Run.

65 1.5 Making Optimal Decisions View the modeling results and select the Tree window. The Tree is seen to have significantly more nodes. Use of profit calculations within the Decision Tree node is better understood using Interactive training. Interactive training will be discussed in detail in Chapter Select the Decision Tree node. 2. Select the Interactive button on the Properties panel. The Interactive Tree Browser window opens.

66 1-58 Chapter 1 Basic Predictive Modeling 3. Select Options Node Statistics The Node Statistics window opens. 4. Select the Predicted Profit check box. 5. Select the Profit/Loss tab.

67 1.5 Making Optimal Decisions Select Average profit for checked items. 7. Select OK. 8. Select Edit Apply prior probabilities. The Tree display is updated to show the expected profit associated with each decision. If Decision 1 (solicit) is the correct decision for the population segment in a node, the displayed profit will be positive. The amount of profit will equal times the proportion of TARGET_B=1 in the node less 0.68 times the proportion of TARGET_B=0 in the node. If Decision 0 (ignore) is the correct decision for the segment, the displayed profit will be zero. As will be seen, segments with a sufficiently high response rate will be chosen for continued solicitation, and the others will be subsequently ignored. Only integer profit amounts are displayed in the node text corresponding to each decision alternative.

68 1-60 Chapter 1 Basic Predictive Modeling Basic Decision Theory Decisions Profits Target Decision Profit P P P P P P choose the larger ¼ P + ¾ P ¼ P + ¾ P E( Profit ) E( Profit ) Target P P Predictions are typically far from certain. Deciding the best course of action for a case scored by a predictive model requires calculating the expected profit of each decision alternative. Assuming a constant profit P ld for each target level, l, the expected profit for decision alternative d is given by E(Profit d ) = Σ p l P ld, where p l is the probability of target level l. The optimal decision corresponds to the highest expected profit. Making Optimal Decisions Target Decision Profit P P P P P P p choose the larger p p P + p P p P + p P E( Profit ) E( Profit ) E(Profit) vs. p 1 P P p 0 Decision Threshold With two target levels and the simple profit structure considered here, the expected profits vary linearly with target level probability. At some point on the range [0,1], the expected profits are equal. The probability associated with this point is known as the decision threshold.

69 1.5 Making Optimal Decisions 1-61 Overall Average Profit Target Decision Profit P P n n P P n n Average Profit = ( n P + n P + n P + n P ) / N 47 The worth of a predictive model can be appraised by calculating the overall average profit on a set of validation data. Using the profit structure defined above, the overall average profit equals Σ (n ld P ld )/ N, where n ld is the number of cases subject to decision d and having level l, and N is the total number of cases in the validation data. Example: Accuracy Rule Target Decision Profit 1 0 p p 0 1 choose the larger p p = 1 - p 1 p 0 48 Accuracy, the most obvious measure of a model s worth, is a special case of the general profit structure defined on the previous slide. Correct decisions are rewarded with a 1-unit profit. Incorrect decisions receive zero profit. The expected profit for a decision simply equals the probability of the corresponding target level. Thus, the best decision for each case corresponds to the most likely target value.

70 1-62 Chapter 1 Basic Predictive Modeling Example: Accuracy Rule Profit Target Decision Profit 1 0 n n n n p 0 Average Profit = ( n + n )/ N 49 The overall average profit is the total number of correctly decided cases divided by the total number of cases. This is the definition of accuracy. Therefore, maximizing overall average profit with this simple profit structure is identical to maximizing accuracy. Example: Extreme Decision Rules Target Decision Profit n n p 0 Average Profit = n / N = π Distribution of predicted target values Note that, for accuracy profit structure, the decision threshold occurs at 0.5. As you saw in the previous demonstration, such a large threshold can result in the same decision for most cases, especially when the probability of one target level is small. This is an example of an extreme decision rule. The overall average profit for extreme decision rules is completely determined by the prior target level probabilities. Thus, the predictive model provides no information about the association between the inputs and target. This is true even if the model is correctly specified and correctly estimated. The problem lies not with the model, but with the decision rule. In short, when one target level is rare, predictive accuracy is an inappropriate model appraisal method.

71 1.5 Making Optimal Decisions 1-63 Example: Conforming Decision Rules Target p p Decision Profit choose the larger 3p - p 0 E(Profit) vs. p 1 p 0 Distribution of predicted target values 51 For predictive models to be interesting and useful, the decision threshold should be similar in value to the predicted probabilities for the primary target level. When this is the case, the profit structure defines a conforming decision rule. With a conforming decision rule, each decision alternative pertains to some cases. When an accurate profit structure is not known, it is better to evaluate a model with a conforming decision rule than a potentially extreme decision rule like accuracy. A diagonal profit matrix with 1/π l, the inverse of the prior proportion for each target level, on the main diagonal usually assigns some cases to each decision alternative. For a two-level target, this diagonal profit matrix yields a decision threshold equal to the population prior. Example: Conforming Rule Profit Target Decision Profit E(Profit) vs. p n n n n 1 p 0 Average Profit = ( 3n -n ) / N Distribution of predicted target values 52 For a conforming decision rule, some cases are assigned to each decision alternative. When one of the columns of the profit matrix is entirely zeros, only the cases with predicted probabilities in excess of the profit threshold contribute to the overall average profit. The remaining cases are effectively ignored.

72 1-64 Chapter 1 Basic Predictive Modeling Example: Extreme Decision Rules Target p p 99-1 Decision Profit 0 0 choose the larger 99p - p 0 E(Profit) vs. p 1 p 0 Distribution of predicted target values 53 It is possible to have an extreme decision rule even when the decision profits are correctly specified. For example, if the average donation amount is much larger than the package cost, the resulting decision rule might also be extreme. In this situation, a single donation pays for many solicitations. The best decision is to solicit everyone. Example: Extreme Rule Profit Target Decision Profit E(Profit) vs. p 99 0 n n 0 p 0 Average Profit = ( 99 n -n ) / N = 99 π - π Distribution of predicted target values 54 With extreme decision rules, the utility of any predictive model is limited. The decision threshold is so small, all cases have a predicted probability in excess of the threshold. The overall average profit is determined entirely by the prior probabilities of the target. Defining the profit structure for a model occurs at the definition of the analytic objective. By carefully considering the profit consequences of modeling decisions and comparing the resulting decision thresholds to the prior target level proportions, you can identify an extreme decision rule before any modeling occurs. A large discrepancy between the estimated decision threshold and the target level priors could suggest that there is little reason for building a model: the model will not affect the optimal decision for your entire modeling population.

73 1.5 Making Optimal Decisions 1-65 Determining Total Model Profit Using the average response rates, the Tree node determines the profitability for all cases in a leaf. To gauge overall performance, you must return to the Decision Tree Results window. 1. Close the Tree Browser window, if necessary. 2. Open the Tree results window. 3. Maximize the Fit Statistics window. By default, the performance of any predictive model in SAS Enterprise Miner should be gauged by the overall average profit calculated on validation data. For the current tree, this profit is equals $0.135 per case in the target population. Therefore, for every 100,000 cases in the target population, the charity should expect (on the average) about $13,500 profit using the Tree model.

74 1-66 Chapter 1 Basic Predictive Modeling 1.6 Parametric Prediction Parametric Models... E(Y X=x)=g(x;w) E(Y g -1 ( X=x)=g(x;w) p (x) ) w 0 + w 1 x w p x p ) w 2 w 1 Training Data Generalized Linear Model Nearest neighbor and recursive partitioning models are both very general predictive modeling techniques that make few structural assumptions about the relationship between inputs and target. In contrast, parametric models, an alternative class of techniques, often make very strong assumptions about this relationship. This limits their susceptibility to the curse of dimensionality. In a parametric model, the expected value of the target Y is related to the inputs x=(x 1, x 2,, x p ) via the relation E(Y X=x) = g(x,w), where w=(w 0, w 1,,w d ) is a vector of parameters. Generally, the number and values of elements in the parameter vector modulate the model complexity. A simple parametric modeling form restricts variation of the target s expected value to a single direction in the input space. This direction is defined by the vector w. This modeling form, often written g -1 (E(Y X=x)) = w 0 + w 1 x 1 + w 2 x w p x p, is called a generalized linear model. Specifying a link function, g -1, a distribution for Y, and likely values for w given the training data determines the entire model.

75 1.6 Parametric Prediction 1-67 Logistic Regression Models log(odds) logit(p ) p log( g -1 ( p ) = w 0 + w 1 x w p x 1 - p p 1.0 p 0.5 logit(p) Training Data The primary purpose of the link function is to match the expected target value range to the range of w 0 + w 1 x 1 + w 2 x w p x p, (-, ). For example, the range of the expected value of a binary target is [0,1]. An extremely useful link function for binary targets is the logit function, g -1 (p) = log (p/(1 p)). Because the expected value of a binary target is P(Y=1), or simply p, the ratio p/(1-p) is simply the odds of Y=1. In words, the logit equates the log odds of Y=1 to a linear combination of the inputs. A generalized linear model with a binary target and a logit link is called a logistic regression model. It assumes the odds change monotonically in the direction defined by w. Because the odds change in a single direction over the entire input space, the decision boundary for standard logistic regression models is a (hyper-)plane perpendicular to w. Changing the Odds Training Data p log( ) 1 - p p log 1 - p ( ) = w 0 + w 1 x w p x p p = wexp(w ww log 0 1 (x + )( w 1 +1)+ + x 1 + +w ) w p x 1 - p p odds ratio 59...

76 1-68 Chapter 1 Basic Predictive Modeling The simple structure of the logistic regression model readily lends itself to interpretation. A unit change in an input x i changes the log odds by an amount equal to the corresponding parameter w i. Exponentiating shows that the unit change in x i changes the odds by a factor exp(w i ). Factor exp(w i ) is called the odds ratio because it equals the ratio of the new odds after a unit change in x i to the original odds before a unit change in x i.

77 1.6 Parametric Prediction 1-69 Building a Logistic Regression Model The principles behind non-parametric and parametric predictive models are different. You will now see that the results provided by the models are also different. 1. Add a Regression node to the process flow diagram. Place it beneath the Tree node. 2. Draw an arrow from the Data Partition node to the Regression node. For simplicity the StatExplore node has been removed from the diagram. 3. Run the diagram from the Regression node and view the results. 4. Maximize the Output window. Most of the useful modeling information is contained in this window.

78 1-70 Chapter 1 Basic Predictive Modeling 5. Scroll the Output window to display the DMREG procedure output. Model Information Training Data Set EMWS6.PART_TRAIN.DATA DMDB Catalog WORK.REG_DMDB Target Variable TARGET_B (Target Variable Indicates for Response to 97NK Mailing) Target Measurement Level Ordinal Number of Target Categories 2 Error MBernoulli Link Function Logit Number of Model Parameters 62 Number of Observations 3539 Target Profile Ordered Total Value TARGET_B Frequency The Output report provides complete information about the regression model just fit. The number of parameters and number of observations are the most conspicuous. The Number of Model Parameters value is 62. A standard regression model contains one parameter for each input. The training data contains only 50 variables, including the two potential targets and the customer ID variable. Where are the extra parameters coming from? While the training data contains only 47 inputs, some of the inputs are categorical. Encoding a categorical variable in a parametric model requires the creation of indicator, or dummy, variables. Fully encoding a categorical variable with L levels requires L 1 indicator variables. More surprising than the number of model parameters is the reported number of observations. The training data contains more than 9,000 cases, yet the model estimation process recognizes only 3,539 of this total. Where are the missing cases?

79 1.6 Parametric Prediction 1-71 Managing Missing Values To solve the mystery of the missing values, select the View Table icon from the group of shortcut buttons on the toolbar. Browse the PVA_RAW_DATA data set. Only about 75% of the rows for DONOR_AGE contain measurements. The rest contain missing values. The parametric models in SAS Enterprise Miner use a case for model estimation only if it is complete (that is, it has no missing values in the model inputs). Only 3,539 of the cases in the training data are complete.

80 1-72 Chapter 1 Basic Predictive Modeling There are several ways to proceed: Do nothing. If there are very few cases with missing values, this is a viable option. The difficulty with this approach comes when the model must predict a new case containing a missing value. Omitting the missing term from the parametric equation usually produces an extremely biased prediction. Impute a synthetic value for the missing value. For example, if an interval input contains a missing value, replace the missing value with the mean of the nonmissing values for the input. This eliminates the incomplete case problem but modifies the input s distribution. This can bias the model predictions. Making the missing value imputation process part of the modeling process allays the modified distribution concern. Any modifications made to the training data are also made to the validation data and the remainder of the modeling population. A model trained with the modified training data will not be biased if the same modifications are made to any other data set the model may encounter (and the data has a similar pattern of missing values). Create a missing indicator for each input in the data set. Cases often contain missing values for a reason. If the reason for the missing value is in some way related to the target variable, useful predictive information is lost. The missing indicator is 1 when the corresponding input is missing and 0 otherwise. Each missing indicator becomes an input to the model. This allows modeling of the association between the target and a missing value on an input. To address missing values in the 97NK model, impute synthetic data values and create missing value indicators. 1. Insert an Impute node between the Data Partition node and the Regression node. 2. Select the Impute node and examine the Properties panel.

81 1.6 Parametric Prediction 1-73 The defaults of the Replacement node (for interval inputs) replace any missing values with the mean of the nonmissing values (for categorical inputs) replace any missing values with the most frequent category. 3. Select Indicator Variable Unique. 4. Select Indicator Variable Role Input. With these settings, each input with missing values generates a new input. The new input, named IMP_original_input_name, will have missing values replaced by a synthetic value and nonmissing values copied from the original input. In addition, new inputs, named M_original_input_name, will be added to the training data to indicate the synthetic data values. Run the Impute node and review the Results window. In all, four inputs had missing values. With all missing values imputed, the entire training data set is available for building the logistic regression model. 1. Run the Regression node and view the results.

82 1-74 Chapter 1 Basic Predictive Modeling 2. Select the Output window and scroll to the DMREG procedure output. Model Information Training Data Set EMWS6.IMPT_TRAIN.VIEW DMDB Catalog WORK.REG_DMDB Target Variable TARGET_B (Target Variable Indicates for Response to 97NK Mailing) Target Measurement Level Ordinal Number of Target Categories 2 Error MBernoulli Link Function Logit Number of Model Parameters 66 Number of Observations 9685 Target Profile Ordered Total Value TARGET_B Frequency The number of parameters has increased from 62 to 66. (Four missing indicators have been added to the model.) The number of observations has increased from 3,539 to 9,685, the total number of cases in the training data. 3. Select the Fit Statistics window and scroll to the bottom of the table. The row labeled Average Profit for TARGET_B rates the performance of the model on the Training and Validation data. The average profit per case is $ on the training data and $ on the validation data. This is a change from $ and $0.1354, respectively, calculated for the Tree model.

83 1.6 Parametric Prediction Select the Output window again and scroll to the Type 3 Analysis of Effects. Type 3 Analysis of Effects Wald Effect DF Chi-Square Pr > ChiSq CARD_PROM_ DONOR_GENDER FILE_AVG_GIFT FILE_CARD_GIFT FREQUENCY_STATUS_97NK <.0001 HOME_OWNER IMP_DONOR_AGE IMP_INCOME_GROUP <.0001 IMP_MONTHS_SINCE_LAST_PROM_RESP IMP_WEALTH_RATING IN_HOUSE LAST_GIFT_AMT LIFETIME_AVG_GIFT_AMT LIFETIME_CARD_PROM LIFETIME_GIFT_AMOUNT LIFETIME_GIFT_COUNT LIFETIME_GIFT_RANGE LIFETIME_MAX_GIFT_AMT LIFETIME_MIN_GIFT_AMT LIFETIME_PROM MEDIAN_HOME_VALUE MEDIAN_HOUSEHOLD_INCOME MONTHS_SINCE_FIRST_GIFT MONTHS_SINCE_LAST_GIFT MONTHS_SINCE_ORIGIN MOR_HIT_RATE M_DONOR_AGE M_INCOME_GROUP M_MONTHS_SINCE_LAST_PROM_RESP M_WEALTH_RATING NUMBER_PROM_ OVERLAY_SOURCE PCT_MALE_MILITARY PCT_MALE_VETERANS PCT_OWNER_OCCUPIED PCT_VIETNAM_VETERANS PCT_WWII_VETERANS PEP_STAR PER_CAPTITA_INCOME PUBLISHED_PHONE RECENCY_STATUS_96NK RECENT_AVG_CARD_GIFT_AMT RECENT_AVG_GIFT_AMT RECENT_CARD_RESPONSE_COUNT RECENT_CARD_RESPONSE_PROP RECENT_RESPONSE_COUNT RECENT_RESPONSE_PROP RECENT_STAR_STATUS SES URBANICITY

84 1-76 Chapter 1 Basic Predictive Modeling The Type 3 Analysis tests the statistical significance of adding the indicated input to a model already containing other listed inputs. Roughly speaking, a value near 0 in the Pr > Chi-Square column indicates a significant input; a value near 1 indicates an extraneous input. Many of the Pr > Chi-Square values are closer to 1 than they are to 0. This is evidence of the model containing many extraneous inputs. Inclusion of extraneous inputs can lead to overgeneralization and reduced performance due to the curse of dimensionality. This could explain the large difference between training and validation profit. In short, it is desirable to tune the model to include only relevant inputs.

85 1.7 Tuning a Parametric Model Tuning a Parametric Model Forward Selection Input p-value Entry Cutoff Profit Step Training Validation Parametric models are tuned by varying the number and values of model parameters. For logistic regression models, choosing the number of parameters is equivalent to choosing the number of model inputs. Thus, to optimally tune a logistic regression model requires selecting an optimal subset of the available inputs and supplying reasonable estimates of their corresponding parameters. One way to find the optimal set of inputs is to simply try every combination. Unfortunately, the number of models to consider using this approach increases exponentially in the number of available inputs. Such an exhaustive search is impractical for realistic prediction problems. An alternative to the exhaustive search is to restrict the search to a sequence of improving models. While this might not find the single best model, it is commonly used to find models with good predictive performance. The Regression node in SAS Enterprise Miner provides three sequential selection methods. Forward selection creates a sequence of models of increasing complexity. The sequence starts with the baseline model, a model predicting the overall average target value for all cases. The algorithm searches the set of one-input models and selects the model that most improves upon the baseline model. It then searches the set of twoinput models that contain the input selected in the previous step and selects the model showing the most significant improvement. By adding a new input to those selected in the previous step, a nested sequence of increasingly complex models is generated. The sequence terminates when no significant improvement can be made. Improvement is quantified by the usual statistical measure of significance, the p-value. Adding terms in this nested fashion always increases a model s overall fit statistic. By calculating the change in fit statistic and assuming the change conforms to a chi-squared distribution, a significance probability, or p-value, can be calculated. A large fit statistic change (corresponding to a large chi-squared value) is unlikely due to chance. Therefore, a small p-value indicates a significant improvement. When

86 1-78 Chapter 1 Basic Predictive Modeling no p-value is below a predetermined entry cutoff, the forward selection procedure terminates. Validation profit determines the best model in the forward selected sequence. For large training data sets, this is often different from the last model in the sequence. Backward Selection Input p-value Stay Cutoff Profit Step Training Validation In contrast to forward selection, backward selection creates a sequence of models of decreasing complexity. The sequence starts with a saturated model, a model that contains all available inputs and, therefore, has the highest possible fit statistic. Inputs are sequentially removed from the model. At each step, the input chosen for removal least reduces the overall model fit statistic. This is equivalent to removing the input with the highest p-value. The sequence terminates when each remaining input in the model has a p-value less than the predetermined stay cutoff. As with the forward selection method, validation profit determines the best model in the backward selected sequence. Stepwise Selection Input p-value Entry Cutoff Stay Cutoff Profit Step Training Validation 64...

87 1.7 Tuning a Parametric Model 1-79 Stepwise selection combines elements from both the forward and backward selection procedures. The method begins like the forward procedure, sequentially adding inputs with the smallest p-value below the entry cutoff. However, after each input is added, the algorithm re-evaluates the statistical significance of all included inputs. If the p-value of any of the included inputs exceeds a stay cutoff, the input is removed from the model and re-entered into the pool of inputs available for inclusion in a subsequent step. The process terminates when all inputs available for inclusion in the model have p-values in excess of the entry cutoff and all inputs already included in the model have p-values below the stay cutoff. Again, validation profit determines the best model in the stepwise selected sequence.

88 1-80 Chapter 1 Basic Predictive Modeling Implementing Stepwise Selection Implementing a sequential selection method in the regression node requires a minor change to the Regression node settings. 1. Close the Regression results window. 2. Select Selection Method Stepwise on the Regression node property sheet. The Regression node is now configured to use stepwise selection to choose inputs for the model. 1. Run the Regression node and view the results. 2. Maximize the Fit Statistics window. The average training profit for the stepwise-selected model is lower than it was for the default saturated model. The average validation profit, however, is higher. Better still, this is with a 9-parameter model instead of a 62-parameter model. 3. Select the Output tab and scroll past the general model information. The stepwise procedure starts with Step 0, an intercept-only regression model. The value of the intercept parameter is chosen so the model predicts the overall target mean for every case. The parameter estimate and the training data target measurements are combined in an objective function. The objective function is determined by the link function and the error distribution of the target. The value of the objective function for the intercept-only model is compared to the values obtained in subsequent steps for more complex models. A large decrease in the objective function for the more complex model indicates a significantly better model.

89 1.7 Tuning a Parametric Model 1-81 Stepwise Selection Procedure Step 0: Intercept entered. The DMREG Procedure Newton-Raphson Ridge Optimization Without Parameter Scaling Parameter Estimates 1 Optimization Start Active Constraints 0 Objective Function Max Abs Gradient Element E-12 Optimization Results Iterations 0 Function Calls 3 Hessian Calls 1 Active Constraints 0 Objective Function Max Abs Gradient Element E-12 Ridge 0 Actual Over Pred Change 0 Convergence criterion (ABSGCONV= ) satisfied. Likelihood Ratio Test for Global Null Hypothesis: BETA=0-2 Log Likelihood Likelihood Intercept Intercept & Ratio Only Covariates Chi-Square DF Pr > ChiSq Analysis of Maximum Likelihood Estimates Standard Wald Standardized Parameter DF Estimate Error Chi-Square Pr > ChiSq Estimate Exp(Est) Intercept < Step 1 adds one input to the intercept only model. The input and corresponding parameter are chosen to produce the largest decrease in the objective function. To estimate the values of the model parameters, the modeling algorithm makes an initial guess for their values. The initial guess is combined with the training data measurements in the objective function. Based on statistical theory, the objective function is assumed to take its minimum value at the correct estimate for the parameters. The algorithm decides whether changing the values of the initial parameter estimates can decrease the value of objective function. If so, the parameter estimates are changed to decrease the value of the objective function and the process iterates. The algorithm continues iterating until changes in the parameter estimates fail to substantially decrease the value of the objective function.

90 1-82 Chapter 1 Basic Predictive Modeling Step 1: Effect FREQUENCY_STATUS_97NK entered. The DMREG Procedure Newton-Raphson Ridge Optimization Without Parameter Scaling Parameter Estimates 2 Optimization Start Active Constraints 0 Objective Function Max Abs Gradient Element Ratio Between Actual Objective Max Abs and Function Active Objective Function Gradient Predicted Iter Restarts Calls Constraints Function Change Element Ridge Change E Optimization Results Iterations 3 Function Calls 6 Hessian Calls 4 Active Constraints 0 Objective Function Max Abs Gradient Element E-6 Ridge 0 Actual Over Pred Change Convergence criterion (GCONV=1E-6) satisfied. The output next compares the model fit in step 1 with the model fit in step 0. The objective functions of both models are multiplied by two and differenced. The difference is assumed to have a chi-square distribution with 1 degree of freedom. The hypothesis that the two models are identical is tested. A large value for the chi-square statistic makes this hypothesis unlikely. Likelihood Ratio Test for Global Null Hypothesis: BETA=0-2 Log Likelihood Likelihood Intercept Intercept & Ratio Only Covariates Chi-Square DF Pr > ChiSq <.0001 Next, the output summarizes an analysis of the statistical significance of individual model effects. For the one input model, this is similar to the global significance test above. Type 3 Analysis of Effects Wald Effect DF Chi-Square Pr > ChiSq FREQUENCY_STATUS_97NK <.0001

91 1.7 Tuning a Parametric Model 1-83 Finally, an analysis of individual parameter estimates is made. The standardized estimates and the odds ratios merit special attention. Analysis of Maximum Likelihood Estimates Standard Wald Standardized Parameter DF Estimate Error Chi-Square Pr > ChiSq Estimate Exp(Est) Intercept < FREQUENCY_STATUS_97NK < Effect Odds Ratio Estimates Point Estimate FREQUENCY_STATUS_97NK The standardized estimates present the effect of the input on the log-odds of donation. The values are standardized to be independent of the input s unit of measure. This provides a means of ranking the importance of inputs in the model. The odds ratio estimates indicate by what factor the odds of donation increase for each unit change in the associated input. Combined with knowledge of the range of the input, this provides an excellent way to judge the practical (as opposed to the statistical) importance of an input in the model. The stepwise selection process continues for 10 steps. After the tenth step, neither adding nor removing inputs from the model significantly changes the model fit statistic. At this point the output window provides a summary of the stepwise procedure. The summary shows the step in which each input was added and the statistical significance of each input in the final 10-input model. Summary of Stepwise Selection Effect Number Score Wald Step Entered DF In Chi-Square Chi-Square Pr > ChiSq 1 FREQUENCY_STATUS_97NK < PEP_STAR < IMP_INCOME_GROUP < MONTHS_SINCE_LAST_GIFT < MEDIAN_HOME_VALUE < MONTHS_SINCE_FIRST_GIFT RECENT_CARD_RESPONSE_PROP M_INCOME_GROUP RECENT_AVG_GIFT_AMT IMP_DONOR_AGE Perhaps surprisingly, the last model listed is not the model selected by the regression node as the best predictor of the target. The selected model, based on the CHOOSE=VDECDATA criterion, is the model trained in Step 9. It consists of the following effects: Intercept FREQUENCY_STATUS_97NK IMP_INCOME_GROUP MEDIAN_HOME_VALUE MONTHS_SINCE_FIRST_GIFT MONTHS_SINCE_LAST_GIFT M_INCOME_GROUP PEP_STAR RECENT_AVG_GIFT_AMT RECENT_CARD_RESPONSE_PROP

92 1-84 Chapter 1 Basic Predictive Modeling For convenience, the output from step 9 is repeated. An excerpt from the analysis of individual parameter estimates is shown below. Analysis of Maximum Likelihood Estimates Standard Wald Standardized Parameter DF Estimate Error Chi-Square Pr > ChiSq Estimate Exp(Est) Intercept < FREQUENCY_STATUS_97NK < IMP_INCOME_GROUP < MEDIAN_HOME_VALUE < MONTHS_SINCE_FIRST_GIFT MONTHS_SINCE_LAST_GIFT < M_INCOME_GROUP PEP_STAR RECENT_AVG_GIFT_AMT RECENT_CARD_RESPONSE_PROP The parameter with the largest standardized estimate is the 97NK frequency status, followed by the income group, months since last gift, and months since first gift. The odds ratio estimates show that a unit change in RECENT_CARD_RESPONSE_PROP produces the largest change in the donation odds. Yet, this input had the smallest standardized estimate. This occurs because the range of the input is [0,1], so a unit change in the input is impossible. Odds Ratio Estimates Effect Point Estimate FREQUENCY_STATUS_97NK IMP_INCOME_GROUP MEDIAN_HOME_VALUE MONTHS_SINCE_FIRST_GIFT MONTHS_SINCE_LAST_GIFT M_INCOME_GROUP 0 vs PEP_STAR 0 vs RECENT_AVG_GIFT_AMT RECENT_CARD_RESPONSE_PROP The most important remaining piece of output is the Decision Table. This table summarizes the number of individuals affected by each decision and their actual donation disposition.

93 1.7 Tuning a Parametric Model 1-85 Decision Table Data Role=TRAIN Target Variable=TARGET_B Adjusted Percent Percent of of Percent of Percent Predict/Decision Target Decision Target Decision Count of Total Variable Data Role=VALIDATE Target Variable=TARGET_B Adjusted Percent Percent of of Percent of Percent Predict/Decision Target Decision Target Decision Count of Total Variable Separate tables are created for training and validation data. Summing the Adjusted Percent of Predict/Decision Variable for Decision=1 (solicit), it appears that a little more than 50% of the population is selected for solicitation (in the next section this will be called depth). The overall response rate can be calculated by dividing the Target=1, Decision=1 Adjusted Percent (3.2459%) by this depth (in the next section, this is called cumulative gain). From the Percent of Decision, it is found that almost 65% of all donors are selected for solicitation (in the next section, this is called sensitivity).

94 1-86 Chapter 1 Basic Predictive Modeling 1.8 Comparing Predictive Models Gains Charts Validation p^ Validation Data 0 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Decile Comparing average profit or loss on a validation data set provides one way of comparing predictive models. Another method, a commonly used graphical technique called a gains chart, examines predictive performance independent of profit considerations. The technique partitions the validation data into deciles based on predicted probability. The average value of the target is plotted versus decile percentage (10% indicating the top 10% of predicted probabilities, 20% indicating the second highest 10% of predicted probabilities, and so on). The average target value in each decile is compared to the overall average target value. If the model is predicting well, the initial deciles (corresponding to the highest predicted probabilities) should show a high average target value, whereas the final deciles (corresponding to the lowest predicted probabilities) should show a low average target value. Each decile corresponds to a range of predicted values. If the model is producing correct (unbiased) predictions, the average value of the target in each decile should, on the average, fall within this range of predicted values.

95 1.8 Comparing Predictive Models 1-87 Cumulative Gains Charts Validation p^ Validation Data 0 10% 50% Depth 100% In many applications, an action is taken on cases with the highest predicted target value. A cumulative gains chart plots the average target value in the validation data versus selection depth (the proportion of cases above a given threshold). Such a chart shows the expected gain in average target value obtained by selecting cases with high predicted target values. For example, selecting the top 20% of the cases (based on predicted target value) results in a subset of data with more than 80% of the cases having the primary target level. This is 1.6 times the overall average proportion of cases with the primary target level. In general, this ratio of average target value at given depth to overall average target value is known as lift. The vertical axis of a cumulative gains chart can be described in terms of lift instead of average target value. When this is the case, a cumulative gains chart is often called a lift chart. For a fixed depth, a model with greater lift is preferred to one with lesser lift. On the average, increasing the depth decreases the lift. However, the rate change for two models might be different. While one model might have a higher lift at one depth, a second model could have a higher lift at another depth. The cumulative gains chart illustrates this tradeoff.

96 1-88 Chapter 1 Basic Predictive Modeling Sensitivity Charts Sensitivity Validation Data 0 10% 50% Depth 100% While a cumulative gains chart shows the tradeoff of lift versus depth, it does not provide a complete description of model performance. In addition to knowing the gain or lift at some depth, an analyst might want to know the proportion of cases of a particular target level correctly decided at this depth. For the primary target level, this proportion is known as sensitivity. A sensitivity chart plots sensitivity versus depth. In this case, it shows the proportion of cases with the primary target level whose predicted target value exceeds fixed thresholds. Assuming each fixed threshold represents a decision threshold, this proportion is equal to sensitivity. For a fixed depth, a model with greater sensitivity is preferred to one with lesser sensitivity. By its definition, sensitivity always increases with depth. However, as with lift, the rate increase of sensitivity for two models can be different. While one model might have a higher sensitivity at one depth, a second model could have a higher sensitivity at another depth. The sensitivity chart illustrates this tradeoff. A model that provides no lift has, on the average, a sensitivity equal to selection depth. To summarize a model s overall performance, analysts can examine the total area under the sensitivity curve. Models with the greatest area under the sensitivity curve have highest average sensitivity across all decision thresholds.

97 1.8 Comparing Predictive Models 1-89 Adjusted Gains Charts Adjusted Validation p^ Validation Data 0 10% 20% 30% 40% Adjusted 50% 60% 70% 80% 90% 100% Depth Adjusted Cumulative Gains Charts Adjusted Validation p^ Validation Data 0 10% 20% 30% 40% Adjusted 50% 60% 70% 80% 90% 100% Depth Separate sampling complicates the construction of the preceding assessment charts. Adjustments must be made not only to the predicted probabilities, but also to the average target proportion and the depth calculations. SAS Enterprise Miner handles the adjustments automatically as long as a prior vector has been specified.

98 1-90 Chapter 1 Basic Predictive Modeling Adjusted Sensitivity Charts Sensitivity Validation Data 0 10% 20% 30% 40% Adjusted 50% 60% 70% 80% 90% 100% Depth Because calculating sensitivity involves only the primary target level, its value is not affected by separate sampling. However, because a sensitivity chart plots depth on the horizontal axis, the chart s appearance is affected. For example, before adjustment, sensitivity at a depth of 20% is around 0.3. After adjustment, the sensitivity at the same depth is more than 0.5. Because the sensitivity values are affected by separate sampling, so too is the total area under the sensitivity curve. It can be shown that the maximum possible area under the sensitivity curve (for large validation samples) is (1 π 1 /2), where π 1 is the overall average proportion of cases with the primary target level. Smaller overall average proportions increase the area possible under the curve. This limits the utility of the area statistic as a universal measure of model performance. For example, having 68% of the plot area under a sensitivity curve indicates mediocre predictive performance for π 1 = 0.05, but it is close to the maximum possible for π 1 = 0.5. ROC Charts Sensitivity Validation Data 1-Specificity 73...

99 1.8 Comparing Predictive Models 1-91 A slight modification to the sensitivity chart provides a chart whose appearance is invariant to separate sampling. A receiver operating characteristic (ROC) chart plots sensitivity versus 1 specificity. Specificity is the proportion of secondary target level cases correctly decided at a given decision threshold. Thus, 1 specificity is the proportion of secondary target level cases incorrectly decided at a given decision threshold. Because sensitivity and specificity are computed separately within each target level, the overall shape of the plot itself is unaffected by separate sampling methods. The area under the ROC curve, like the area under the sensitivity curve, provides a way to assess overall model performance. Unlike the sensitivity curve, however, the total area does not depend on the overall average proportion of the primary target level. Thus, the area, sometimes referred to as a c-statistic, has become a universal measure of overall predictive model performance for binary target models. A perfect model that completely separates cases with distinct target levels has c=1. A model with no lift has c=0.5. The area under a sensitivity curve can be obtained from the c-statistic via the formula: Area = π 0 c + ½(π 1 + 1/N), where π 1 is the overall average proportion of the primary target level, π 0 = 1 π 1, and N is the overall sample size of the validation data. Cumulative Gains and Profit Target Decision Profit Adjusted Validation p^ 0.2 Average Profit.95 P P 0 0 0% Average Profit = p ((P depth P.05 depth ) p +.05 P ) depth Adjusted Depth 100% Cumulative gains charts are often used to compare models separately from profit considerations. When one of the decision alternatives is to simply do nothing, a useful connection between profit and gains can be established. Consider the expression for overall average profit for the profit structure shown in the slide: Average Profit = (P 11 n 11 + P 01 n 01 ) / N.

100 1-92 Chapter 1 Basic Predictive Modeling Simple algebraic manipulation yields Average Profit = [(P 11 P 01 ) ˆp 1 + P 01 ] depth where ˆp 1 = n 11 /n 1 is the gain or average value of the target at depth = n 1 / N. This expression shows that, for a fixed depth, the model with the highest gain or lift also has the highest overall average profit. It is common practice for the selection depth to be mandated prior to modeling. In such a case, the most profitable model at the mandated selection depth is identical to the model with the highest lift at that depth. Cumulative Gains and Profit Target Decision Profit Adjusted Validation p^ 0.2 Average Profit.95 P P 0 0 0% Average Profit = p ((P depth P.05 depth ) p +.05 P ) depth Adjusted Depth 100% Although it is a common practice, mandating the selection depth prior to modeling is not necessarily a good practice. Inspection of the equation for overall average profit shows that it is proportional to the area spanned by two rectangles. One rectangle, with base equal to depth and height proportional to gain, corresponds to the profit realized by making a correct decision. The other rectangle, with base equal to depth and height constant, corresponds to the loss incurred by making an incorrect decision. Overall average profit is determined by (nonlinear) trade-off in area between these two rectangles. One danger in mandating a selection depth is seen in the overall average profit plot above. Profit is maximized at a depth of 30%. Mandating a selection depth of 20% reduces overall average profit by nearly 25%. On the other hand, mandating a selection depth of 60% adds a large number of cases with no profit increase. While this might not seem as dire as underestimating the depth, doubling the number of selected cases also increases the variability in the overall average profit.

101 1.8 Comparing Predictive Models 1-93 Comparing Predictive Models In this demonstration, the two models built thus far are compared using the assessment charts discussed above. 1. Close the Results Regression window if it is open. 2. Add a Model Comparison node to the diagram as shown. 3. Run the Model Comparison node and view the results. The Results-Model Comparison window opens. The Results window contains four sub-windows that show (clockwise, from the upper left) ROC plots, Score Ranking, Output, and Fit statistics. Overall the Tree model appears competitive with the Regression model on the training data, but worse than the regression model on the validation data.

102 1-94 Chapter 1 Basic Predictive Modeling Cumulative Gains Chart 1. Right-click the Score Rankings plot and select Data Options. The Data Options dialog opens. 2. Select Role Y for the RESPC variable. The Score Rankings window is updated to display cumulative gains charts. The overall response rate appears to be 5%, which indicates that the chart has been adjusted for oversampling. As such, the plotted results reflect the population and not the training sample in response proportion. 3. Return to the Data Options window and select Role Y for the LIFTC variable. The vertical axis of the Score Rankings window returns to its original state.

103 1.8 Comparing Predictive Models 1-95 This axis indicates the increase in response observed using the model at a particular decile versus the baseline response rate of 5%. For example, at depth of 30% (30 th percentile), the regression model gives a response rate of about 1.5 times the baseline model. 4. Select Role Y for the CAPC variable in the Data Options window. The Score Rankings plot is updated to show a sensitivity plot. The sensitivity chart shows the percent of all donors expected or captured in a solicitation up to the specified depth. For example, soliciting to a depth of 30% captures about half of all donors. ROC Chart An ROC chart appears to be very similar to the sensitivity chart. However, as discussed above, the horizontal axis is changed so that the chart is invariant to overall response rates (particularly the effects of oversampling). 1. Right-click the validation ROC chart and select Focus on Chart.

104 1-96 Chapter 1 Basic Predictive Modeling The chart shows the Regression model with uniformly higher sensitivity than the regression model. In this way, you can be confident that the regression model separates the donors from the non-donors better than the tree model, independent of the depth of solicitation. The area under the ROC curve, also known as the ROC index, is useful for comparing model performance. The Output window gives this and other model comparison statistics for all models connected to the Model Comparison node. Selection Depth One question left unanswered by these assessment charts is: to what depth should you solicit? None of the charts examined thus far directly answer this question. A clue might be found in the Score Rankings plot. 1. Return to the Score Rankings plot and again right-click and select Data Options. 2. Select Role Y for the PROFIT variable. The Score Rankings plot dramatically changes.

105 1.8 Comparing Predictive Models 1-97 While the plot is related to the definition of profit used elsewhere in SAS Enterprise Miner, there is an important difference. The quantities plotted represent average profit per case within the percentile. Mathematically, this can be expresses as E(Profit i ) = (14.62 n 11i 0.68 n 01i ) / N i, where i indicates the percentile number, n 11i and n 01i indicate the number of donors and nondonors, respectively, in percentile i (with predicted probability in excess of the decision threshold), and N i is the number of cases in percentile i. In short, this is an estimate of the average profit (per case) just for cases in the ith percentile. Focusing the Regression model, the profit-per-percentile is seen to go to zero at the 55 th percentile. This means that there are no donors with predicted probability in excess of the decision threshold (n 11i and n 01i are both zero). This implies an optimal selection depth of 50%. (This is consistent with the decision table presented in the Regression node s output window.) What is lacking in the Score Rankings plot is a presentation of the consequences to overall average profit upon choosing a sub-optimal depth for solicitation. The next (optional) section describes a simple modification to the Score Rankings data set that enables you to understand the behavior of models and non-optimal selection depths.

106 1-98 Chapter 1 Basic Predictive Modeling Enhancing the Score Rankings Plot (Optional) The profits presented in the Assessment node do not match the average profit definitions used in the modeling nodes. Determining the profit consequence in making sub-optimal decisions requires some hand calculation. For example, what is the overall profit consequence in mandating solicitation to a particular depth? To calculate this quantity, refer to the formula presented in the last slide of this section: Average Profit = [(P 11 P 01 ) ˆp 1 + P 01 ] depth. This section provides a program that modifies the Score Rankings data set using the SAS Code tool. The SAS Code tool enables you to modify and extend the performance of SAS Enterprise Miner. 1. Add a SAS Code node to the diagram and connect it to the Assessment node as shown. 2. Select SAS Code from the SAS Code node s properties panel. The SAS Code window opens. The SAS Code window has three tabs: SAS Code, Macro Variables, and Macros. The SAS Code tab is a mini program editor where you can write, load, and edit SAS programs within the SAS Enterprise Miner environment. 3. Select the Macro Variables tab.

107 1.8 Comparing Predictive Models 1-99 The Macro Variables tab lists the macro variables that are used to encode single values, such as the names of the input data sets. The macro variables are arranged in several groups: General, Properties, Imports, Exports, Files, and Statements. General macro variables Retrieve system information. Properties macro variables Retrieve information about the nodes. Imports macro variables Identify the SAS tables that are imported from predecessor nodes at run time. This example uses one General macro variable, EM_IMPORT_RANK. This macro facilitates access to the Score Rankings data set created by the Model Comparison node. Exports macro variables Identify the SAS tables that are exported to successor nodes at run time. Files macro variables Identify external files that are managed by SAS Enterprise Miner, such as log and output listings. Not all nodes create or manage all external files. Statements macro variables Identify SAS program statements that are frequently used by SAS Enterprise Miner, such as the decision statement in the modeling procedures. 4. Select the Macros tab.

108 1-100 Chapter 1 Basic Predictive Modeling The Macros tab lists the SAS macros that are used to encode multiple values, such as a list of variables, and functions that are already programmed in SAS Enterprise Miner. The macro variables are arranged in groups: Utililty and Variables. Utility macros Manage and format data. This example uses two utility macros: EM_REGISTER and EM_REPORT. Variable macros Identify groups of variables at run time. 5. Select the SAS Code tab. 6. Right-click in the Program Editor window and select File Open. 7. Select and open the Overall Average Profit Model Comparison (simple).sas program file. The program file should be located in the same directory as the PVA_RAW_DATA data set. The program is loaded into the SAS Code window. There is a simple and a deluxe version of the Overall Average Profit Model Comparison program. The simple one, used here, assumes a profit matrix with zeroes in the second column. Moreover, this profit matrix must be handedited in the Code node to conform to that specified earlier in the Decision Processing specification. The deluxe version of the program accepts a

109 1.8 Comparing Predictive Models general 2x2 profit matrix and automatically reads the Decision Processingspecified matrix from the metadata in SAS Enterprise Miner. Advanced students are invited to use and study the deluxe version. However, the details of its operation are omitted from the present discussion. The first two lines of the program define the profit consequences of solicitation for donors and non-donors, respectively. %let PF11= 14.62; %let PF01= -0.68; The next line registers a data set with SAS Enterprise Miner. %em_register(key=profit,type=data); A registered data set is internally named and managed by SAS Enterprise Miner. You must refer to this registered data set in subsequent code by its macro variable name: &EM_USER_PROFIT. Such a reference occurs in the DATA step following the registration macro call. data &EM_USER_PROFIT; set &EM_IMPORT_RANK; OVERALL_AVG_PROFIT=((&PF11-&PF01)*(respC/100) + &PF01)*(decile/100); run; The DATA step reads in observations from the Score Ranking data set created and plotted in the Model Comparison node. A new variable named OVERALL_AVG_PROFIT is added to the data set. Using the formula presented above, the variable combines the profit information defined above with the percentile-level cumulative gain statistics calculated for each mode. %em_report(key=profit,description=overall Average Profit by Decile, viewtype=lineplot,x=decile,y=overall_avg_profit, group=model,by=datarole,autodisplay=y); The final block of code instructs SAS Enterprise Miner (with the %EM_REPORT macro) to create a plot of overall average profit by percentile. Use this program to enhance your understanding of model performance. 1. Select the OK button to close the SAS Code node. 2. Run the diagram from the SAS Code node and view the results.

110 1-102 Chapter 1 Basic Predictive Modeling The SAS Code node creates a plot of overall average profit versus percentile. As is typical with conforming profit matrices, the overall average profit for both models increases with depth, reaches a maximum, and then diminishes. The plot shows the regression model achieving a maximum profit at the theoretical cutoff found earlier at 50%. From a practical point of view, it seems soliciting anywhere from 30% to 60% of the cases will give similar profits using the regression model. A mandated cutoff at 20%, however, would reduce profits by about 25%. Many of the intermediate deciles contain a mixture of both donors and non-donors that results in little additional profit (or loss). This results in maximum profit for the regression model across a wide range of values and offers some flexibility in deciding the final solicitation depth. If a secondary goal of the analysis, after maximizing profit, is to stay in contact with as many potential donors as possible, you can solicit more than the theoretical optimum with apparently little overall profit consequence. On the other hand, if a secondary goal is to minimize up-front solicitation costs, reducing the size of the solicitation somewhat from the theoretical optimum likewise has little overall profit consequence. While this flat maximum may seem to be desirable, it is also consistent with a predictive model that is unable to sharply separate the donors from the non-donors. The lack of sharp separation can be due to either an inadequate model or a basic inseparability in the target level. In SAS Enterprise Miner, the best way to identify an inadequate model is to try many types of models and see which one has the best generalization.

111 1.8 Comparing Predictive Models Deluxe Overall Average Profit Model Comparison (Self-Study) The deluxe version of the Overall Average Profit Model Assessment program does not require the user to type in the profit matrix. It uses the values of the profit matrix specified in the Decision Processing interface in the data source. In addition, it calculates the c statistic, the Average Squared Error, and the Average Profit gained by using the specified model at the theoretically optimal cutoff. Finally, the statistics necessary to draw the ROC curve, the sensitivity chart, and the gains chart are all stored in the results of this program, so you can easily generate those plots as well. In this way, a single Results window gives all relevant fit information. The SQL procedure call generates a macro variable, &DECDATA, that points to the name of the data set with the Decision Processing. For the PVA data example, the relevant portion of that data set looks like this: proc sql noprint; select DATA into :DECDATA from &EM_IMPORT_DATA_EMINFO where key='decdata'; quit; proc print data=&em_import_data_eminfo; run; The next DATA step creates macro variables &PF11, &PF10, &PF01 and &PF00, which correspond to the appropriate elements in the profit matrix above. The STRIP function eliminates leading and trailing blanks from a variable. The automatic variable _N_ is a counter that keeps track of what record in a data set is currently being operated on. The CALL SYMPUT function puts a value in a data set into a macro variable. data _null_; set &DECDATA; call symput("pf" strip(2-_n_) "1",DECISION1); call symput("pf" strip(2-_n_) "0",DECISION2); run; The &EM_IMPORT_RANK macro variable identifies the score rankings data set from a predecessor node. Because this SAS code node is attached to the Model Comparison node, this data set contains score ranking results for all models of interest. The MODEL variable contains the name of the model being evaluated (Tree, Reg, Reg2, and so on); the DATAROLE variable contains the name of the data set

112 1-104 Chapter 1 Basic Predictive Modeling that particular model is being assessed on (TRAIN, VALIDATE); and the variable DECILE specifies what portion of the data set is of interest (top 5%, top 10%, top 15%, and so on). Sorting &EM_IMPORT_RANK by these three variables will allow analysis by model type by data role. proc sort data=&em_import_rank; by MODEL DATAROLE DECILE; run; The SQL procedure is used to create a temporary data set, RANK_TOTALS, which contains the count of responders and nonresponders, for each combination of model and data role. This data set is used in the following DATA step: proc sql; create table work.rank_totals as select MODEL, DATAROLE, sum(numberofevents) as N1, sum(n-numberofevents) as N0 from &EM_IMPORT_RANK group by MODEL, DATAROLE; quit; The %EM_REGISTER macro registers objects to SAS Enterprise Miner. You can register catalogs, data, files, folders, and graphs. This enables you to use the reporting capabilities of SAS Enterprise Miner on a data set that you create. The KEY=PROFIT statement specifies a name you can use in the code for this data set. Because the key is PROFIT, the macro variable that points to the data set object you are registering to SAS Enterprise Miner is &EM_USER_PROFIT. %em_register(key=profit,type=data); The &EM_USER_PROFIT data set is the location for the enhanced profit figures. This is comparable to the simple version of the Overall Average Profit Model Assessment program, but here the c statistic is calculated in addition to the overall average profit. The CSTAT data set will contain the c statistic for each model, for each data role. data &EM_USER_PROFIT(keep=MODEL DATAROLE DEPTH _N11 _N10 _N01 _N00 PPV _PVN SENSITIVITY SPECIFICITY _1minusSPECIFICITY OVERALL_AVG_PROFIT) work.cstat(keep=model DATAROLE c); The RETAIN statement both declares a variable C and initializes its value to 0. retain c; Merging the &EM_IMPORT_RANK and RANK_TOTALS data sets by model and data role creates a data set that looks just like &EM_IMPORT_RANK, and that has been augmented by the responder and nonresponder counts. merge &EM_IMPORT_RANK work.rank_totals; by MODEL DATAROLE; Using a BY statement when creating a data set creates internal flags that can be useful. In the program below, the statement that begins if first.datarole is true if the record being processed by the DATA step is the first record of a particular data role. That is to say, the statement will be true only for the first record of a model/data role combination. That means you can use that knowledge to initialize counters, calculate terms that only need to be calculated once, and so on.

113 1.8 Comparing Predictive Models Here, for each of these first instances, the depth, positive predictive value, sensitivity, specificity, and c are initialized to appropriate values. Also, temporary counters for true positives (_N11), true negatives (_N00), false positives (_N01) and false negatives (_N10) are initialized. This initial record corresponds to the decision of not soliciting any individuals. The OUTPUT &EM_USER_PROFIT statement writes this record to the data set registered to SAS Enterprise Miner above. if first.datarole then do; DEPTH=0; _N11=0; _N01=0; _N10=N1; _N00=N0; PPV=.; _PVN=_N00/(_N10+_N00); SENSITIVITY=0; SPECIFICITY=1; _1minusSPECIFICITY=0; OVERALL_AVG_PROFIT=(&PF11*_N11+&PF10*_N10+&PF01*_N01+&PF00*_N00)/(N1+N0); c=0; output &EM_USER_PROFIT; end; If the record being processed is not the first for a particular model/data role combination, then the counters initialized above need to be incremented. In addition, the assessment statistics such as sensitivity, positive predicted value, and overall average profit also need to be incremented. These results are output to the &EM_USER_PROFIT data set. The ATTRIB statement enables you to change certain attributes of the variables in a data set; here, it is used to attach labels to the variables. _N11+numberOfEvents; _N01+(n-numberOfEvents); _N10=N1-_N11; _N00=N0-_N01; DEPTH=Decile; if (_N11+_N01) > 0 then PPV=_N11/(_N11+_N01); if (_N10+_N00) > 0 then _PVN=_N00/(_N10+_N00); SENSITIVITY=_N11/N1; SPECIFICITY=_N00/N0; _1minusSPECIFICITY=1-SPECIFICITY; OVERALL_AVG_PROFIT=(&PF11*_N11+&PF10*_N10+&PF01*_N01+&PF00*_N00)/(N1+N0); output &EM_USER_PROFIT; attrib PPV label='cumulative Gain (Pos. Predicted Value)'; attrib _PVN label='negative Predicted Value'; attrib SENSITIVITY label='sensitivity'; attrib SPECIFICITY label='specificity'; attrib _1minusSPECIFICITY label='1-specificity'; attrib OVERALL_AVG_PROFIT label='overall Avg. Profit'; To calculate the c statistic (the area under the ROC curve), you can calculate the areas of a series of trapezoids and sum them up. To do this, you will need the current values of the sensitivity and specificity as well as the last values of these statistics. The LAG function enables you to use the current and most recent values of these statistics in one DATA step: oldsens=lag(sensitivity); old1msp=lag(_1minusspecificity);

114 1-106 Chapter 1 Basic Predictive Modeling For the first record of each model/data role combination, the sensitivity is 0 and specificity is 1. Thus, that point will not add anything to the area under the curve. If the record being processed is not the first such record, then you can calculate the area of the appropriate piece under the ROC curve. Summing these consecutively gives you a running total of the c statistic calculations. if not first.datarole then do; c=sum(c,(_1minusspecificity-old1msp)*(sensitivity+oldsens)/2); end; Of course, these calculations are only of interest when you have added all of them together. At the last record for each model/data role combination, output the c statistic into the CSTAT data set. The RUN statement ends this DATA step. if last.datarole then output work.cstat; run; The %EM_REPORT macro is used to register and define reports generated by a code node. The KEY= key points to an object that has already been registered with SAS Enterprise Miner. The DESCRIPTION= setting controls the text at the top of the report window. VIEWTYPE= specifies what kind of graph you would like in your report. X= and Y= specify the category and response variables for your graph, respectively. GROUP= specifies how you would like to group your results. BY= names the variable you would like to use to separate out your results. AUTODISPLAY=Y makes sure that your graph is displayed when you open the node s results browser. %em_report(key=profit,description=model Assessment, viewtype=lineplot,x=depth,y=overall_avg_profit, group=model,by=datarole,autodisplay=y); This is all you need to generate the enhanced profit charts. To generate a table of models by data roles, filled in with comparison statistics, use the rest of the program, below. The SQL procedure creates a data set called FITSTAT that saves the average squared error (ASE) and the overall average profit for all models, for the training and validation data sets. (&EM_IMPORT_REPORTFIT points to a data set that has fit statistics for the models of interest, compiled by the Model Comparison node.) proc sql; create table work.fitstat as select MODEL, STAT, TRAIN, VALIDATE from &EM_IMPORT_REPORTFIT where STAT='_ASE_' or STAT='_APROF_' order by MODEL; quit; In order to merge these statistics to the c statistics calculated earlier, you need to transpose the data. This is accomplished with the TRANSPOSE procedure. proc transpose data=work.fitstat out=work.tfitstat (drop=_label_ rename=(_name_=datarole)); by Model; var TRAIN VALIDATE; id STAT; run;

115 1.8 Comparing Predictive Models The SQL procedure merges the ASE and overall average profit to the c statistic, for each combination of model and data role. proc sql; create table work.statsummary as select a.*, b.c from work.tfitstat a, work.cstat b where a.model=b.model and a.datarole=b.datarole; quit; The LINESIZE option controls the width of the output. options linesize=96; The TABULATE procedure generates cross-tabulation style reports. The NOSEPS option eliminates extra horizontal separation lines. The CLASS variables are the groups of interest, each model by each data role. The VAR variables are the report variables. The TABLE statement controls the generation of the report and the format of the displayed results. The CONDENSE option forces large table to print out logically. The KEYLABEL statement specifies what summary of the VAR variables is of interest; because you have only one record per model/data role combination, this is almost irrelevant. However, the KEYLABEL statement also suppresses the word Sum from the output, for aesthetic reasons. proc tabulate data=work.statsummary noseps; class DATAROLE MODEL; var c _ASE APROF_; table MODEL, DATAROLE*(c='c'*f=8.2 _ASE_='ASE'*f=8.3 _APROF_='Average Profit (MAX)'*f=8.4) / condense; keylabel sum=' '; run;

116 1-108 Chapter 1 Basic Predictive Modeling 1.9 Deploying a Predictive Model Deploying a Predictive Model Expected Target Value? Scoring Code From Predictive Model Input Measurements 79 After training and comparing predictive models, one model is selected to represent the association between the inputs and the target. After it has been selected, this model must be put to use. A scoring recipe generated from the fitted model and applied to suitable input measurements accomplishes this deployment. SAS Enterprise Miner offers two options for model deployment: scoring code modules and scored data sets. Scoring code modules are used to generate predicted target values in environments outside of SAS Enterprise Miner. Release 5.1 of SAS Enterprise Miner can create scoring code in the SAS, C, PMML, and (experimentally) Java programming languages. The SAS language code can be embedded directly in a SAS application to generate predictions. The C and Java language code must be compiled. The C code should compile with any C compiler that supports the ISO/IEC 9899 International Standard for Programming Languages -- C. The PMML module can be utilized by any RDBMS adhering to the PMML 2.1 standard. Invoking the SAS language scoring code from within SAS Enterprise Miner achieves the second deployment option: scored data sets. Using a Data Source node, you identify a data set with the required input variables. SAS Enterprise Miner generates predictions using the scoring recipe prescribed by the selected model. A copy of the scored data set is stored in the private intermediate data repository of SAS Enterprise Miner. If the data set to be scored is very large, you should consider scoring the data outside the SAS Enterprise Miner environment.

117 1.9 Deploying a Predictive Model Creating a Scored Data Set SAS Enterprise Miner creates scored data sets using a combination of Data Source, Score, and SAS Code nodes. 1. Define a data source for the PVA_SCORE_DATA as in Section 1.2. In the last step of the data source definition, define the role of the data source as Score. The PVA_SCORE_DATA is an independent sample from the population of lapsing donors. Unlike the PVA_RAW_DATA, it is not separately sampled. The data may also be used as an independent test set for performance evaluation. 2. Add the PVA_SCORE_DATA Data Source, Score, and SAS Code nodes as shown below. Connecting the Score node to the Regression node selects the Regression model for deployment. 3. Run the Score node and view the results. The Results-Score window opens.

118 1-110 Chapter 1 Basic Predictive Modeling SAS Code The Output window provides facts about the model fit. Of most interest is the fact that the regression model selects just under half the population for solicitation. This proportion differs slightly from the training and validation proportions due to oversampling. You can use the SAS Code node to place the scored data in the location of your choice. 1. Select the SAS Code node and select Update path. 2. Select SAS Code in the properties sheet. The SAS Code window opens.

119 1.9 Deploying a Predictive Model Select the Macro Variables tab. These macro variables enable you to select objects in the private data repository in SAS Enterprise Miner using standardized names. For example, you can reference the scored PVA_SCORE_DATA using the generic macro variable &EM_IMPORT_SCORE. Entering the following program in the SAS Code tab places the scored data in the SAS data set MYDATA.SCORED_DATA. data MYDATA.SCORED_DATA; set &EM_IMPORT_SCORE; run;

120 1-112 Chapter 1 Basic Predictive Modeling Score Code Files A reasonable alternative to scoring data within SAS Enterprise Miner is to score data outside of it with a set of score code. There are several options available to accomplish this. You can save the code as a SAS, C, PMML, or Java program. 1. Select the Results window of the Score node. The Score node features two types of score code: flow score code and path score code. The flow score code includes variables that depend on the target (such as residuals), whereas the path score code does not. The main practical difference is that path score code, by virtue of its less complicated output, is intended for use with production scoring systems. 2. Select View SAS Code Path Flow Score Code. The Path Flow Score Code window opens. This window contains all the DATA step code necessary to transform data structured identically to the training data into model predictions and model residuals. You must add DATA and SET commands to use this code. 3. Select File Save as to save this code to a location of your choice. If you do not want the model residuals to be included in the score code, use the following option in place of step 2, above: Select View SAS Code Path Publish Score Code. The Score node appends two variables, EM_PROBABILITY and EM_DECISION, to all scored data sets. This allows the creation of standardized code for model deployment. EM_EVENTPROBABILITY can be used to refer to the probability of the primary target level, independent of target variable name and target level encoding.

121 1.9 Deploying a Predictive Model EM_DECISION can be used to refer to the model decision (based on the profit matrix), once more independent of the target variable name. C and Java Scoring Code A similar process can be used to save C or Java scoring code. 1. Select View SAS Code C Score Code. The C Score Code window opens. The first element of the C Score window shows XML-based declarations for the variables used in subsequent C code. 2. Select the Next button. The C Score Code window updates to show the DB2 version of C code to be compiled.

122 1-114 Chapter 1 Basic Predictive Modeling 3. Select the Next button. The C Score Code window updates to show a listing of the C score metadata files. These files can be useful for compiling the C source code.

123 4. Select the Next button. The C Score Code window updates to show a listing of the ANSI C code of the model. 1.9 Deploying a Predictive Model 1-115

124 1-116 Chapter 1 Basic Predictive Modeling To use the code, save the code to the desired location, modify as desired, and compile.

125 1.10 Summarizing the Analysis Summarizing the Analysis Process Flow Summary Data Source Select training data Define metadata Define Decision Processing In this chapter, you created a diagram that can be used as a template for many predictive modeling tasks. As you built the diagram, many changes were made to the default settings of SAS Enterprise Miner tools. The slides in this section summarize the changes made to each node. Section 1.2 describes the data selection process. Section 1.3 shows how to define a diagram s metadata. Decision processing is discussed in Section 1.4 (prior probabilities) and 1.5 (profit matrices). Process Flow Summary Data Partition Set partition proportions Automatic Stratification Setting the data partition node is described in Section 1.3. Note the automatic use of stratification.

126 1-118 Chapter 1 Basic Predictive Modeling Process Flow Summary Impute Indicate imputation Set indicator role Imputation and adding imputation indicators are shown in Section 1.6. (Additional replacement methods are discussed in Chapter 3.) Process Flow Summary Regression Variable selection Section 1.6 introduces the Regression tool. Tuning a regression model by input selection is detailed in Section 1.7. (Additional options for the Regression tool are discussed in Chapter 2.)

127 1.10 Summarizing the Analysis Process Flow Summary Decision Tree Interactive tree viewer Basic Decision Tree use is described in Section 1.3. The effects of priors and profits on the Tree algorithm are shown in Sections 1.4 and 1.5. (Details of the Tree tool algorithm and their use are discussed in Chapter 3.) Process Flow Summary Model Comparison Chart Data Options 87...

128 1-120 Chapter 1 Basic Predictive Modeling Process Flow Summary SAS Code Import File %EM_REGISTER %EM_REPORT The Assessment node and modifications to profit data are described in Section 1.8. Process Flow Summary Score Publish Path/Flow score code C/Java score code Creating a scored data set and a SAS scoring module is discussed in Section 1.9.

129 1.10 Summarizing the Analysis Process Flow Summary Diagram Create SPK file Export path as SAS Program Diagram level reports and batch processing are detailed in the following demonstrations.

130 1-122 Chapter 1 Basic Predictive Modeling Reporting Results SAS Enterprise Miner provides a means to summarize your modeling efforts in a proprietary format called a SAS Package, or SPK file. SPK files can be read either within the SAS Enterprise Miner client (File Open Model Package) or via a separate application called the SAS Package Reader. For more information about the SAS Package Reader, see A SAS Package can be created from any node in your diagram. The diagram paths leading to the selected node are summarized in the SPK file. Connecting all terminal nodes in your diagram to a Control Point node results in a complete diagram summary. However, each path leading to the Control Point is stored in a separate SPK file. The major advantage of an SPK summary is that a single file contains all the information required to recreate an analysis path. All connections, node settings and analysis results are stored in a relatively compact file. Create an SPK File 1. Right-click the SAS Code node and select Create Model Package. An Input window opens. 2. Enter a name for the model package. The corresponding SPK file is stored in the Reports folder of the current project. 3. Select OK. Summary information is captured from each node in the path leading to the SAS Code node. 4. Open the Model Packages entry in the Project Panel. The newly created Model Package is listed.

131 1.10 Summarizing the Analysis Right-click the newly created Project Package and select Open. A Package viewer window opens showing the path leading to the SAS Code node. Clicking any of the nodes opens the Results window for the node.

132 1-124 Chapter 1 Basic Predictive Modeling You can reconstitute an actual SAS Enterprise Miner diagram from a model package by opening the package in the Enterprise Miner client and selecting Actions Recreate Diagram. When using the SAS Enterprise Miner client with an installed SAS solution, you can register the model package in the SAS Model Repository. This effectively publishes the model for use with other solutions. To register the package, right-click the package name and select Register. You can then identify the model repository where you want to place the model. The created model is then available for use in other SAS solutions.

133 1.10 Summarizing the Analysis Creating a Batch Analysis Program SAS Enterprise Miner 5.1 batch processing is a macro-based interface to the SAS Enterprise Miner 5.1 client/server environment that operates without running the SAS Enterprise Miner GUI (Graphical User Interface). Batch processing supports the building, running, and reporting of the SAS Enterprise Miner 5.1 process flow diagrams. The same diagram can be run from either the SAS Enterprise Miner 5.1 GUI or from a batch job. The results can be viewed in the SAS Enterprise Miner 5.1 GUI or integrated into a reporting SAS program. SAS Enterprise Miner 5.1 batch processing code is not designed to be submitted to SAS Enterprise Miner through the SAS Enterprise Miner GUI Program Editor. Instead, SAS Enterprise Miner 5.1 batch processing code should be submitted in a SAS batch job or submitted through the SAS Program Editor window. All SAS Enterprise Miner 5.1 actions have batch interfaces. SAS Enterprise Miner 5.1 produces batch code for process flows built in the GUI, or process flow diagrams can be manually coded by experienced SAS Enterprise Miner users. The macro interface used for batch processing in SAS Enterprise Miner 5.1 is compatible with all SAS Enterprise Miner 5.1 file structures and SAS language capabilities. These are the tools users need to automate creation and execution of a data mining analysis. With batch processing, you can schedule processor-intensive SAS Enterprise Miner process flow diagrams for off-peak processing hours automate daily, weekly, or monthly SAS Enterprise Miner process flow diagram runs and model training automate event-driven SAS Enterprise Miner process flow diagram runs and model training automate regular ETL (Extract, Transfer, Load) data operations for SAS Enterprise Miner create data mining templates for analysts and business users. The batch processing tool is intended for use by statisticians and programmers who have strong experience in writing SAS code and building SAS Enterprise Miner models. You can create a batch file from an existing diagram or from scratch. Creating a batch file from scratch is beyond the scope of this course. The process is summarized in the SAS Enterprise Miner online documentation under the heading Batch Processing. Creating a batch file from an existing diagram is similar to creating a model package. 1. Right-click the SAS Code node and select Export Path as SAS Program. The Export Path as SAS Program window opens.

134 1-126 Chapter 1 Basic Predictive Modeling 2. Enter a location on the client machine for the batch file. The batch file with all analysis steps is saved to the specified location. You can edit this file and submit it to the SAS Enterprise Miner server independently of the SAS Enterprise Miner client interface.

135 Chapter 2 Flexible Parametric Modeling 2.1 Defining Flexible Parametric Models Constructing Neural Networks Deconstructing Neural Networks

136 2-2 Chapter 2 Flexible Parametric Modeling

137 2.1 Defining Flexible Parametric Models Defining Flexible Parametric Models Standard Logistic Regression Models p log( ) = w 1 - p 0 + w 01 x 1 + w 02 x 2 Training Data 3 A standard logistic regression model assumes that the logit(p) is a linear combination of the inputs. This causes the logit(p) to increase in a direction specified by the model weights. The decision boundary for such a model is a plane perpendicular to this direction of increase. This is an extremely restrictive assumption that works remarkably well in practice. Even when the assumption is wrong, such a model can give useful predictions. Polynomial Logistic Regression Models p log( ) = w 1 - p 0 + w 01 x 1 + w 02 x 2 + w 11 x 1 x 1 + w 22 x 2 x 2 + w 12 x 1 x 2 Training Data 4 However, an incorrectly specified model never generalizes as well as a correctly specified one. If the association between the inputs and the target is not a linear combination of the inputs, you want to reflect this in your predictive model. This is the goal of flexible parametric modeling. One of the simplest flexible parametric approaches involves enhancing a standard regression model with nonlinear and interaction terms to create a polynomial regression model.

138 2-4 Chapter 2 Flexible Parametric Modeling In polynomial regression, a typical nonlinear modeling term is the square of an input, for example, x 1 x 1. A typical interaction term is a product of two inputs, x 1 x 2. Adding all two-way combinations of inputs yields a quadratic regression model. Quadratic regression models are much more flexible than standard regression models. The flexibility comes at a price: with p inputs in the model, there are p (p + 1)/2 twoway input combinations. This tends to rapidly overwhelm regression modeling procedures.

139 2.1 Defining Flexible Parametric Models 2-5 Defining Nonlinearities and Interactions Standard logistic regression models assume a linear and additive relationship between inputs and the logit of the target. While this assumption suffices for many modeling scenarios, ignoring any existing nonlinearities and interactions reduces model performance. This demonstration shows how to modify a standard model to account for known nonlinearities. 1. Connect another Regression node to the Impute node. 2. Right-click the new Regression node and select Rename. Enter the new name Polynomial Regression. The Score data source and associated nodes have been omitted for clarity. To build a polynomial regression model, you must explicitly define the nonlinearities and interactions for use in the model. The Equation section of the Regression node Property panel enables you to specify a polynomial regression model by creating all two-factor (input) interactions and polynomial terms. Unfortunately, the Two-Factor Interactions setting is limited to class inputs only. The Term Editor enables you to create interval input interactions by hand. However, given that the model has 37 interval inputs, this option will force the specification of 666 terms. Besides being extremely tedious and computationally demanding, such a complex model is certain to overfit the training data. A sequential selection process to avoid overfitting further increases the computation time. In short, even if you could easily specify the model, the curse of dimensionality often allows for too many input combinations in a polynomial regression model.

140 2-6 Chapter 2 Flexible Parametric Modeling Situations like this are usually addressed by one of two approaches: Restrictions are placed on the model search to make it more tractable. The modeling method is abandoned in favor of another technique. This demonstration illustrates one of many possible restrictions on the model search. The standard regression model fit in the previous chapter is expanded to include all second order polynomial terms. Another sequential selection method is applied to select useful polynomial terms. By reducing the number of inputs used to create the interactions, the process once more becomes tractable. Unfortunately, this process uses only those inputs found important in the first regression. There is a good chance that many interactions will be missed. On the other hand, increasing the flexibility of a standard regression model often (marginally) improves predictive performance. 1. Delete the connection between the Impute node and the Polynomial Regression node. 2. Draw a connection from the Regression node to the Polynomial Regression node. 3. Select the Polynomial Regression node and select Variables from the Properties panel. Note that most of the inputs have their role set to Rejected. This results from the stepwise selection used in the first Regression node. Only inputs selected by the procedure have their input roles preserved. 4. Select OK to close the Variables window.

141 2.1 Defining Flexible Parametric Models Set the Two-Factor Interaction and Polynomial Terms options to Yes in the Properties panel. Recall that the Two-Factor Interaction option only creates class variable interactions. Your instructor can show you how to use the Term Editor to interact all inputs. 6. Set the Selection Model to Stepwise. 7. Run the Polynomial Regression node and view the results. Model Results 1. Select the Fit Statistics window. The overall average profit on the validation data is slightly higher than the standard regression model. While not earth shattering, it is an improvement. 2. Maximize the Output window and scroll to the end of the report.

142 2-8 Chapter 2 Flexible Parametric Modeling Here, two of the terms added to the model are polynomials. Apparently, with this set of inputs, the effects on the MONTHS_SINCE_FIRST_GIFT and the RECENT_AVG_GIFT_AMT dimensions are nonlinear. This somewhat complicates the interpretation of the model because the odds ratio of each of these inputs increases with the magnitude of the inputs. In general, increasing the capacity of a model to capture complex input/target associations reduces interpretability of the results. Analysis of Maximum Likelihood Estimates Standard Wald Standardized Parameter DF Estimate Error Chi-Square Pr>ChiSq Estimate Exp(Est) Intercept < FREQUENCY_STATUS_97NK < IMP_INCOME_GROUP < MEDIAN_HOME_VALUE < MONTHS_SINCE_FIRST_GIFT MONTHS_SINCE_LAST_GIFT < M_INCOME_GROUP PEP_STAR RECENT_AVG_GIFT_AMT RECENT_CARD_RESPONSE_PROP MONTHS_SINCE_FIRST_GIFT* MONTHS_SINCE_FIRST_GIFT RECENT_AVG_GIFT_AMT* RECENT_AVG_GIFT_AMT Close the Results window and connect the Polynomial Regression node to the Model Comparison node. 2. Run the SAS Code node following the Model Comparison node and view the results. The Overall Average Profit curves for the two regression models are virtually identical, with a slight bump in validation profit for the Polynomial model (Reg2) in the 50 th percentile.

143 While flexible regression models offer the possibility of improved model performance, such an improvement is not guaranteed, especially if the association between inputs and the logit of the target is not well described by a second-degree polynomial. Perhaps an even more flexible parametric model (such as a neural network) will offer additional improvement. 2.1 Defining Flexible Parametric Models 2-9

144 2-10 Chapter 2 Flexible Parametric Modeling 2.2 Constructing Neural Networks Neural Network Model x 2 p log( 1 -p) = w 00 + w 01 H 1 + w 02 H 2 + w 03 H 3 tanh -1 ( H 1 ) = w 10 + w 11 x 1 + w 12 x 2 tanh -1 ( H 2 ) = w 20 + w 21 x 1 + w 22 x 2 tanh -1 ( H 3 ) = w 30 + w 31 x 1 + w 32 x 2 1 tanh(x) 0 x Training Data x With their exotic-sounding name, neural network models (formally multi-layer perceptrons) are often regarded as a mysterious and powerful predictive modeling technique. The most typical form of the model is, in fact, a natural extension of a regression model. A neural network can be thought of as a generalized linear model on a set of derived inputs. These derived inputs are themselves a generalized linear model on the original inputs. The usual link for the derived input s model is inverse hyperbolic tangent, a shift and rescaling of the logit function. What makes neural networks interesting is their ability to approximate virtually any continuous association between the inputs and the target. You simply need to specify the correct number of derived inputs. Neural Network Diagram x 2 p log ( ) = w 00 + w 01 H 1 + w 02 H 2 + w 03 H 3 1 -p Training Data x 1 tanh -1 ( H 1 ) = w 10 + w 11 x 1 + w 12 x 2 tanh -1 ( H 2 ) = w 20 + w 21 x 1 + w 22 x 2 tanh -1 ( H 3 ) = w 30 + w 31 x 1 + w 32 x 2 Inputs x 1 x 2 H 1 H 2 H 3 Hidden layer Target p 8...

145 2.2 Constructing Neural Networks 2-11 Multi-layer perceptron models were originally inspired by neurophysiology and the interconnections between neurons. The basic model form arranges neurons in layers. The first layer, called the input layer connects to a layer of neurons called a hidden layer, which, in turn, connects to a final layer called the target, or output, layer. The structure of a multi-layer perceptron lends itself to a graphical representation called a network diagram. Each element in the diagram has a counterpart in the network equation. Neural Network Training x 2 p log( 1 -p) = w 00 + w 01 H 1 + w 02 H 2 + w 03 H 3 tanh -1 ( H 1 ) = w 10 + w 11 x 1 + w 12 x 2 tanh -1 ( H 2 ) = w 20 + w 21 x 1 + w 22 x 2 tanh -1 ( H 3 ) = w 30 + w 31 x 1 + w 32 x 2 Objective function (w) Training Data x As with all parametric models, the fundamental task with a fixed model structure is to find a set of parameter estimates that approximate the association between the inputs and the expected value of the target. This is done iteratively. The model parameters are given random initial values, and predictions of the target are computed. These predictions are compared to the actual values of the target via an objective function. The actual objective function depends on the assumed distribution of the target, but conceptually the goal is minimize the difference between the actual and predicted values of the target. An easy-to-understand example of an objective function is the mean squared error (MSE) given by where MSE = 1 ( y l yˆ i ( w ˆ )) N training cases 2 N y l ŷ i ŵ is the number of training cases. is the target value of the ith case. is the predicted target value. is the current estimate of the model parameters. Training proceeds by updating the parameter estimates in a manner that decreases the value of the objective function.

146 2-12 Chapter 2 Flexible Parametric Modeling Neural Network Training Convergence x 2 p log( 1 -p) = w 00 + w 01 H 1 + w 02 H 2 + w 03 H 3 tanh -1 ( H 1 ) = w 10 + w 11 x 1 + w 12 x 2 tanh -1 ( H 2 ) = w 20 + w 21 x 1 + w 22 x 2 tanh -1 ( H 3 ) = w 30 + w 31 x 1 + w 32 x 2 Objective function (w) Training Data x Training concludes when small changes in the parameter values no longer decrease the value of the objective function. The network is said to have reached a local minimum in the objective. Training Overgeneralization x 2 p log( 1 -p) = w 00 + w 01 H 1 + w 02 H 2 + w 03 H 3 tanh -1 ( H 1 ) = w 10 + w 11 x 1 + w 12 x 2 tanh -1 ( H 2 ) = w 20 + w 21 x 1 + w 22 x 2 tanh -1 ( H 3 ) = w 30 + w 31 x 1 + w 32 x 2 Objective function (w) Training Data x A small value for the objective function, when calculated on training data, need not imply a small value for the function on validation data. Typically, improvement on the objective function is observed on both the training and the validation data over the first few iterations of the training process. At convergence, however, the model is likely to be highly overgeneralized and the values of the objective function computed on training and validation data might be quite different.

147 2.2 Constructing Neural Networks 2-13 Neural Network Final Model x 2 p log( 1 -p) = w 00 + w 01 H 1 + w 02 H 2 + w 03 H 3 tanh -1 ( H 1 ) = w 10 + w 11 x 1 + w 12 x 2 tanh -1 ( H 2 ) = w 20 + w 21 x 1 + w 22 x 2 tanh -1 ( H 3 ) = w 30 + w 31 x 1 + w 32 x 2 Profit Training Data x To compensate for overgeneralization, the overall average profit, computed on validation data, is examined. The final parameter estimates for the model are taken from the training iteration with the maximum validation profit.

148 2-14 Chapter 2 Flexible Parametric Modeling Constructing Neural Networks A small improvement in overall average profit was observed when shifting from a standard regression model to a second order polynomial regression model. Neural network models offer even more flexibility than polynomial regressions. Will they improve predicted profit even more? 1. Connect a Neural Network node to the Impute node. The number of hidden neurons determines model complexity. A minimum of three hidden units is required before substantial differences from second order polynomial regression models are possible. 2. Select View Property Sheet Advanced. The Property panel expands to reveal the default number of hidden units (three, in this case). The Advanced Property Sheet contains many useful options and features in SAS Enterprise Miner Run the Neural Network node with the default settings and view the results.

149 2.2 Constructing Neural Networks 2-15 In addition to the usual Score Rankings, Fit Statistics, and Output windows, the Results window features a graph showing the average square error versus training iteration. The stopped training algorithm uses the fifth iteration for parameter weights. While subsequent iterations achieve lower validation error, this iteration has maximum validation profit (as can be confirmed by changing the response variable in the Graphs window). The Fit Statistics window shows a slightly higher validation profit than obtained for the standard regression model.

150 2-16 Chapter 2 Flexible Parametric Modeling The window also shows that the model has more than 200 parameters. Stopped training limits the model s ability to overfit (this will be studied in detail in Section 2.3). Nevertheless, better results should be possible by eliminating irrelevant inputs. This can be achieved by once more incorporating only those inputs selected by the original regression.

151 2.2 Constructing Neural Networks 2-17 Network Complexity and Flexibility There are several ways to control the complexity and, therefore, the flexibility of a neural network model. Like in regression, you must choose the correct inputs and how these inputs interact with respect to the target. Input selection is a problem for any model and is usually handled via some type of heuristic algorithm. The interaction of inputs with respect to the target for neural network models is handled by controlling the number of hidden units. In regression models, the heuristic algorithm used for input selection is often one of the sequential procedures discussed in Chapter 1. While sequential selection procedures like stepwise are known for neural networks, the computational costs of their implementation taxes even the fastest computers. These procedures, therefore, are not part of SAS Enterprise Miner. Note that the input selection problem is no different than that for the polynomial regression model in the previous section. There, the problem was addressed by using the inputs selected by the standard regression model as inputs for the polynomial regression model. A similar approach will be tried here. Other variable selection methods can be used to pick modeling inputs. Chapter 3 shows how to configure a decision tree to be an effective input selection tool. 1. Disconnect the Neural Network node from the Impute node. 2. Connect the Neural Network node to the Regression node as shown. 3. Right-click the Neural Network node and select Update Path. This refreshes the node s metadata. 4. Select Variables from the Properties panel. Only the inputs selected by the Regression node have a role of Input. 5. Run the Neural Network node and view the results. The Fit Statistics window again shows a profit slightly higher than the standard regression model but with significantly fewer hidden units than the full network model of the previous demonstration.

152 2-18 Chapter 2 Flexible Parametric Modeling To this point, network complexity has been addressed only be reducing the size of the input space. By changing the number of hidden units, you also can directly control the amount of flexibility the neural network possesses. 1. Select View Property Sheet Advanced to confirm the advanced property sheet. 2. Change the number of hidden units to Run the Neural Network node and view the results. The model fit statistics show a validation profit higher than any model seen thus far. The training profit is also higher, a common a sign of overfitting.

153 2.2 Constructing Neural Networks 2-19 The possibility of overfitting is also seen in a plot of overall average profit. 1. Close the Results window. 2. Connect the Neural Network node to the Model Comparison node. 3. Run the SAS Code node and view the results. The new neural model does indeed produce much higher training profit and slightly higher validation profit than the other models. While improvement in model performance can be realized by trying several neural networks with varying numbers of hidden units, it might be simply a matter of chasing a stochastic specter. The next section proposes another strategy for fitting neural network models that avoids guessing the correct number of hidden units. Using this strategy, neural networks become less of a parametric modeling approach and more of a modeling algorithm. The demonstration will also help you understand how neural networks avoid overfitting even when highly overparameterized.

154 2-20 Chapter 2 Flexible Parametric Modeling 2.3 Deconstructing Neural Networks Although neural networks have been introduced as a generalization of standard regression models, practitioners often use them as a type of predictive algorithm. As is discussed in Chapter 3, predictive algorithms, as opposed to predictive models, assume little about the underlying structure of the model. They are designed to be extremely flexible and, without proper tuning, always overfit the training data. Their success depends on restricting (tuning) their flexibility to match the prediction problem at hand. This section illustrates how neural network models can be used as predictive algorithms and how you can guard against overfitting.

155 2.3 Deconstructing Neural Networks 2-21 Building a High Capacity Neural Network To start, construct a neural network with a seemingly unreasonable number of parameters. This should result in a highly overgeneralized model. 1. Connect another Neural Network node to the Regression node. 2. Change the name of the added node to Big Neural Network. 3. View the Advanced property sheet. 4. Set the number of hidden units to 40. This, it would seem, should vastly increase the likelihood of overfitting. The next two changes, however, will help to reduce this vastly overparameterized network s overfitting propensity. 1. Type 0.1 in the Randomization Scale field. This predictive algorithm approach to neural network requires small starting values for the model parameters, typically between 0.01 and 0.2. The reason is discussed in the next demonstration. 2. Select DBLDog for the Training Technique property. The Double Dogleg technique combines the default training method (for the current network architecture) with gradient descent method. Most importantly, it tends to take more steps to converge to a minimum training error than the default. Again, the reason this is important will be seen shortly. 3. Select User for the Architecture property.

156 2-22 Chapter 2 Flexible Parametric Modeling 4. Run the Big Neural Network node and view the results. The validation overall average profit of about $0.16 is between the two previous neural network models. Note that this model has over 400 parameters, whereas the standard Regression has around 10. You would expect with so many parameters that the model would badly overfit. This, however, has not happened. Why not?

157 2.3 Deconstructing Neural Networks 2-23 Taming Overgeneralizations The large number of parameters in neural network models tends to make them prone to overfitting. SAS Enterprise Miner, by default, takes steps to guard against this problem. Sometimes, however, model fit can be improved by adjusting some of the defaults. This demonstration illustrates how a neural network s overfitting remedies work. Therefore, the focus (temporarily) shifts from building the best possible model to understanding the some of the details of network optimization. Setup the demonstration as follows. 1. Connect a Metadata node to the Impute node. The Decision Tree node has been removed for diagram clarity. The Metadata node allows changes to the metadata away from the Data Source node. The new metadata settings may be used by all subsequent nodes. 2. Select the Metadata node and open the Variables window. 3. Select Rejected as the new role for all variables.

158 2-24 Chapter 2 Flexible Parametric Modeling 4. Select LIFETIME_PROM and LIFETIME_CARD_PROM, and then select Input for the new role. 5. Select TARGET_B and then select Target for the new role. 6. Select OK to close the Variables window.

159 2.3 Deconstructing Neural Networks 2-25 These settings create models with exactly two inputs. This enables you to see how the predicted values change as a function these inputs. The first model to investigate is yet another neural network. 1. Connect a Neural Network node to the Metadata node. 2. Rename the Neural Network node Two Input Network. 3. Use the Advanced Properties options to set the number of hidden units to Run the Two Input Network node and view the results.

160 2-26 Chapter 2 Flexible Parametric Modeling The Plot tab shows the value of the objective function versus training iteration. Parameter estimates were taken from the first training iteration. 5. Right-click on the Graphs window and select Data Options. 6. Use the Data Options window to create a plot of Average Profit for the training and validation data. The Profit plot shows that the network attains maximum validation profit on the first iteration. Notice that the training profit continues to increase past this point.

161 2.3 Deconstructing Neural Networks 2-27 Using the first iteration for parameter estimates seems somewhat strange; however, this apparently is where the validation profit is maximized for this network. As discussed above, the technique of early stopping is used to avoid overfitting. What is the consequence of not using the first iteration for model parameter estimates? SAS Enterprise Miner 5.1 must be forced into making this happen by making the following change to the properties. 1. Close the Results window. 2. Select Yes for Use Current Estimates. Using current estimates for subsequent training continues the parameter (weight) estimation process from the last iteration of the initial estimation process. 3. Run the Two Input Network node and view the results.

162 2-28 Chapter 2 Flexible Parametric Modeling By restarting the parameter estimation process from the last iteration of the previous estimation process, the highest validation profit is seen to occur on step zero. In fact, this corresponds to the 20 th step overall. With only two inputs, it is possible to actually see what the model predictions from this 20 th step look like. 1. Close the Results window. 2. Connect a SAS Code node to the Two Input Network node. Rename the SAS Code node 3D Plotter. You will use the 3D Plotter node to produce an ActiveX plot of the predictions produced by the Two Input Network. To view this plot you must have a browser and ActiveX plug-in installed on your client computer. 3. Select SAS Code from the 3D Plotter properties panel. The SAS Code window opens. 4. Open the program Two Input Network Surface Plot.sas in the SAS Code editor. The program uses the SAS Output Delivery System with the G3D procedure to create an ActiveX HTML file of the Two Input Network predictions versus the inputs LIFETIME_PROM and LIFETIME_CARD_PROM. In brief, the code begins by registering an HTML file with SAS Enterprise Miner. %em_register(key=myplot,type=file,extension=htm); Next, a call to the %EM_ODSLISTON utility macro is executed. This macro opens an HTML output destination (procedure output is captured to the specified HTML file) and closes the listing destination (procedure output is not placed in the Output window). The macro variable &EM_USER_MYPLOT references the file registered in the first line of code. %em_odsliston(file=&em_user_myplot);

163 2.3 Deconstructing Neural Networks 2-29 A graphics options statement resets all current graph settings and specifies that the output device is an ACTIVEX plug-in. The BORDER option draws a border around the output. goptions reset=all device=activex border; The G3D procedure is called. The scored training data (&EM_IMPORT_DATA) is selected for plotting. A surface plot of the inputs versus the model predictions requested. A maximum of 0.10 is specified for the z-axis. proc g3d data=&em_import_data; plot LIFETIME_PROM*LIFETIME_CARD_PROM=P_TARGET_B1 / zmax=0.10; run; quit; The HTML destination is closed and the listing destination is restored. %em_odslistoff; A DATA step outputs the location of the surface plot file to the Output window. By copying this location to the clipboard, you can view and interact with the created surface plot. data _null_; file PRINT; put "Plot location: &EM_USER_MYPLOT"; run; Use this program to visualize the Two Input Network predictions. 1. Select OK to close the SAS Code window. 2. Run the 3D Plotter node and view the results.

164 2-30 Chapter 2 Flexible Parametric Modeling 3. Copy the plot location appearing in Output window to the clipboard. 4. Select Start Run from the windows tool bar. The Run window opens. 5. Paste the plot location into the Open field. 6. Select OK. An attractive three-dimensional surface plot opens in a browser window.

165 2.3 Deconstructing Neural Networks 2-31 You can rotate this plot in three dimensions by holding the ALT key while clicking and dragging the pointer within the plot window. The plot shows a complex association between the inputs and the predicted target values, somewhat reminiscent of a mountain pass. It is informative to contrast the appearance of neural network predictions with those of a standard logistic regression model. 1. Close the rotating plot browser window and the SAS Code Results window.

166 2-32 Chapter 2 Flexible Parametric Modeling 2. Connect a Regression node to the Metadata node, renaming it Two Input Regression. 3. Disconnect the 3D Plotter node from the Two Input Network node and reconnect it to the Two Input Regression node. 4. Run the 3D Plotter node. You need not view the results. 5. Select Start Run from the windows tool bar. The Run window opens with the file location previously pasted. 6. Select OK. The browser window now displays the predictions of the Two Input Regression.

167 2.3 Deconstructing Neural Networks 2-33 Which model is correct? While this question is impossible to answer definitively, you can compare the profit performance of the two models on a set of validation data. 1. Connect an Assessment node to both Two Input models. 2. Copy and paste the SAS Code node used to adjust the Model Comparison profit.

168 2-34 Chapter 2 Flexible Parametric Modeling 3. Connect the copied SAS Code node to the second Model Comparison node. 4. Run the SAS Code node and view the results. The neural network, with its high flexibility, is seen to outperform the lowly regression model on the training data. This belies the expected performance on the validation data where the overall profit calculations show, in general, that the neural model s overall average profit is lower than the regression model s values.

169 2.3 Deconstructing Neural Networks Close the Results window. You can correctly argue that the comparison was unfair. The parameter estimates from the neural network model were intentionally adjusted to correspond to the final step of training instead of the step where validation profit was maximized. This was done to illustrate consequences of overtraining a network model. 1. Select the Two Input Network node. 2. Select No for Use Current Estimates in the Properties panel. 3. Run the Two Input Network node and view the results. The parameter estimates for the neural network model now correspond to their original values after training. How has this affected the appearance of the model? 4. Close the Results window.

170 2-36 Chapter 2 Flexible Parametric Modeling 5. Disconnect the 3D Plotter node from the Two Input Regression node and again connect it to the Two Input Network node. 6. Run the 3D Plotter node and produce the predicted values plot as before. The Two Input Neural predictions look very similar to the Two Input Regression predictions.

171 2.3 Deconstructing Neural Networks Close the rotating plot window. 8. Run the SAS Code node attached to the second Model Comparison node and view the results. The models produce nearly identical profit plots on both the training and validation data sets. While the Two Input Network model found a complex input/target association, most of this association was an artifact of the training data. By monitoring the model performance on a set of validation data, it was possible to stop the training when overgeneralization appeared. In this way, the Two Input Network model correctly mimicked the behavior of the simple Two Input Regression model. SAS Enterprise Miner automatically implements this technique, called stopped training, to restrict the flexibility of neural network models. But will this technique always work? 1. Select the Two Input Network node. 2. Change the number of hidden neurons to Run the Two Input Network node and view the results.

172 2-38 Chapter 2 Flexible Parametric Modeling Stopped training takes the values of the model parameters from the second iteration. 4. Run the 3D Plotter node and view the results as before.

173 2.3 Deconstructing Neural Networks 2-39 The Two Input Network again assumes a mountain-pass-like appearance. However, this time, the appearance is seen after stopped training. 1. Close the 3D Plotter window. 2. Run the SAS Code node connected to the Model Comparison node and view the results.

174 2-40 Chapter 2 Flexible Parametric Modeling For a majority of depths, the Two Input Network model is yielding lower validation overall average profits than the Two Input Regression model. Unfortunately, this is true even after stopped training. The Two Input Network model appears to be hopelessly overparameterized. It tries to use 61 parameters to model an association adequately modeled by 3. Will overparameterized neural network models always do worse than simpler models? Surprisingly, the answer is not necessarily. The problem here is caused more by an over-ambitious optimization algorithm rather than an overparameterized model. The rapid changes in predicted target values characteristic of overgeneralized predictive models result from large weight values multiplying the inputs (think partial derivatives). 1. Select the 3D Plotter node and select SAS Code. 2. Right-click in the SAS Code area and open the SAS program Two Input Network Weight Plot.sas. The program listing is appended to the existing surface plotting code. The purpose of the program is to plot the value of the Two Input Network weights by training iteration. This detail is provided by the imported data set referenced by the macro variable &EM_IMPORT_ESTIMATE. The details of the program are somewhat irrelevant in this demonstration, so only a sketch of the program s operation follows.

175 2.3 Deconstructing Neural Networks 2-41 The CONTENTS procedure is used to create a list variable names in the &EM_IMPORT_ESTIMATE data set. The list is filtered to include only the names variables with weight estimates. These variable names are placed in a macro variable &var using the SQL procedure. proc contents data=&em_import_estimate out=work.cont noprint; run; data work.b; set work.cont; if name=:"lifetime" or name=:"bias" or name=:"h"; keep name; run; proc sql noprint; select name into :vars separated by ' ' from work.b; quit; A temporary data set, WORK.A, is created from the rows of the &EM_IMPORT_ESTIMATE data set that contain actual parameter estimates. A data set, &EM_USER_WEIGHT, is registered with the SAS Enterprise Miner server. The TRANSPOSE procedure files &EM_USER_WEIGHT with a transposed version of work.a. The &EM_USER_WEIGHT data set has a column for the iteration of the estimate, a column labeling the weight, and a column with the actual weight value. A subsequent DATA step groups the weights by the labeling column into four distinct groups: weights connecting LIFETIME_PROM to the hidden units, weights connecting LIFETIME_CARD_PROM to the hidden units, biases, and weights connecting the hidden units to the TARGET_B. data work.a; set &EM_IMPORT_ESTIMATE; if _type_='parms' and _NAME_~='_LAST_'; run; %em_register(key=weight,type=data); proc transpose data=work.a(keep=_iter_ &vars) out=&em_user_weight(rename=(col1=weight)); by _ITER_; var &vars; run; data &EM_USER_WEIGHT; attrib WEIGHT_GROUP length=$20; set &EM_USER_WEIGHT; run; if _name_=: "LIFETIME_P" then WEIGHT_GROUP="LIFETIME_PROM"; if _name_=: "LIFETIME_C" then WEIGHT_GROUP="LIFETIME_CARD_PROM"; if _name_=: "BIAS" then WEIGHT_GROUP="BIAS"; if _name_=: "H" then WEIGHT_GROUP="TARGET_B";

176 2-42 Chapter 2 Flexible Parametric Modeling Finally, a plot is generated showing the change in weight values by iteration in each of the four weight groups. %em_report(key=weight,description=weights by Iteration, viewtype=lineplot,x=_iter_,y=weight, group=_label_,by=weight_group,autodisplay=y); 1. Close the SAS Code window. 2. Run the 3D Plotter node and view the results. 3. Maximize the Weights by Iteration plot. In general, the weight values start small and get larger as the iteration number increases. This is good to model fine-grain associations unique to the training data. It is not good to produce a model that captures broad trends and generalizes well. You can remedy this problem by initializing the weights to small values and making small changes to their values as training progresses. This is precisely the tactic used to fit the Big Neural Network model earlier in this section. 1. Close the Results window. 2. Select the Two Input Network node. 3. Type 0.1 in the Randomization Scale field. This predictive algorithm approach to neural network requires small starting values for the model parameters, typically between 0.01 and 0.2. The reason is discussed in the next demonstration.

177 2.3 Deconstructing Neural Networks Select DBLDog for the Training Technique property. The Double Dogleg technique combines the default training method (for the current network architecture) with gradient descent method. Most importantly, it tends to take more steps to converge to a minimum training error than the default. The reason that this is important will be seen shortly. 5. Select User for the Architecture property. 6. Run the Two Input Network node and view the results. The maximum profit is achieved on the 12 th iteration. 7. Close the Two Input Network results window.

178 2-44 Chapter 2 Flexible Parametric Modeling 8. Run the 3D Plotter node, view the results, and maximize the Weights by Iteration plot. As before, the value of the weights increases by iteration, however, the scale of the chart has changed substantially. This should result in a more gradually changing model. 9. View the surface plot.

179 2.3 Deconstructing Neural Networks 2-45 The predicted values from the Two Input Network model appear very much like those from the standard logistic regression. 1. Close the surface plot window. 2. Run the SAS Code node connected to the second Model Comparison node and view the results.

180 2-46 Chapter 2 Flexible Parametric Modeling The models are now virtually indistinguishable. By using small starting values for the weights and choosing a slow optimization process, good results are achievable even for overparameterized neural networks.

181 Chapter 3 Predictive Algorithms 3.1 Growing Trees Constructing Trees Applying Decision Trees

182 3-2 Chapter 3 Predictive Algorithms

183 3.1 Growing Trees Growing Trees Recursive partitioning models, commonly called decision trees after the form in which the results are presented, have become one of the most ubiquitous of predictive modeling tools. Tree models might not yield the largest generalization profit, but they are invaluable in improving performance and aiding understanding of other predictive models. Unlike parametric models, decision trees do not assume a particular structure for the association between the inputs and the target. This allows them to detect complex input and target relationships missed by inflexible parametric models. It also allows them, if not carefully tuned, to overgeneralize from the training data and find complex input and target associations that do not really exist. Trees are the primary example of a class of predictive modeling tools designated as predictive algorithms. Predictive algorithms are a motley assembly of often ad hoc techniques with intractable statistical properties. Their use is justified by their empirical success. In addition to decision trees, other common examples of predictive algorithms are nearest neighbor methods, naïve Bayes models, support vector machines, over-specified neural networks, and nonparametric smoothing methods. Tree Algorithm Parameters? Maximum Branches Split Worth Criterion Stopping Options Pruning Method Missing Value Method Default Settings 2 Adjusted Chi-Sq. Logworth Logworth Threshold Depth Adjustment Max. Depth, Min. Leaf Size Average Profit Best Leaf 3... The behavior of the tree algorithm in SAS Enterprise Miner is governed by many parameters that can be roughly divided into five groups: the number of subpartitions to create at each partitioning opportunity the metric used to compare different partitions the rules used to stop the partitioning process the method used to tune the tree model the method used to treat missing values. The defaults for these parameters generally yield good results for initial prediction. As discussed in later sections, varying some of the parameters can improve results for auxiliary uses of tree models.

184 3-4 Chapter 3 Predictive Algorithms Tree Algorithm: Calculate Logworth Logworth 0.7 Training Data x 1 x 1 4 Understanding the default algorithm in SAS Enterprise Miner for building trees enables you to better use the Tree tool and interpret your results. The description presented here assumes a binary target, but the algorithm for interval targets is similar. The algorithm for categorical targets with more than two levels is more complicated and is not discussed. The first part of the algorithm is called the split search. The split search starts by selecting an input for partitioning the available data. If the measurement scale of the selected input is interval, each unique value serves as a potential split point for the data. If the input is categorical, the average value of the target is taken within each categorical input level. The averages serve the same role as the unique interval input values in the discussion that follows. For a selected input and fixed split point, two groups are generated. Cases with input values less than the split point are said to branch left. Cases with input values greater than the split point are said to branch right. This, combined with the target levels, forms a 2x2 contingency table with columns specifying branch direction (left or right) and rows specifying target value (0 or 1). A Pearson chi-squared statistic is used to quantify the independence of counts in the table s columns. Large values for the chi-squared statistic suggest the proportion of 0 s and 1 s in the left branch is different than the proportion in the right branch. A large difference in target level proportions indicates a good split. Because the Pearson chi-squared statistic can be applied to the case of multi-way splits and multi-level targets, the statistic is converted to a probability value or p-value. The p-value indicates the likelihood of obtaining the observed value of the statistic assuming identical target proportions in each branch direction. For large data sets, these p-values can be very close to 0. For this reason, the quality of a split is reported by logworth = -log 10 (chi-squared p-value). Logworth is calculated for every split point of an input. At least one logworth must exceed a threshold for a split to occur with that input. By default, this threshold corresponds to a chi-squared p-value of 0.20 or a logworth of approximately 0.7.

185 3.1 Growing Trees 3-5 Tree Algorithm: Filter Partitions Logworth 0.7 Training Data x 1 x 1 5 The Tree algorithm settings disallow certain partitions of the data. Settings, such as the minimum number of observations required for a split search and the minimum number of observations in a leaf, force a minimum number of cases in a split partition. This reduces the number of potential partitions for each input in the split search. Tree Algorithm: Adjust Logworth Kass Adjusted Logworth 0.7 Training Data x 1 x 1 6 When calculating the independence of columns in a contingency table, it is possible to obtain significant (large) values of the chi-squared statistic even when there are no differences in the target level proportions between split branches. As the number of possible split points increases, the likelihood of this occurring also increases. In this way, an input with a multitude of unique input values has a greater chance of accidentally having a large logworth than an input with only a few distinct input values. Statisticians face a similar problem when combining the results from multiple statistical tests. As the number of tests increases, the chance of a false positive result likewise increases. To maintain overall confidence in the statistical findings, statisticians inflate the p-values of each test by a factor equal to the number of tests

186 3-6 Chapter 3 Predictive Algorithms being conducted. If each inflated p-value shows a significant result, then the significance of the overall results is assured. This type of p-value adjustment is known as a Bonferroni correction. Because each split point corresponds to a statistical test, Bonferroni corrections are automatically applied to the logworth calculations for an input. These corrections, called Kass adjustments after the inventor of the default Tree algorithm used in SAS Enterprise Miner, penalize inputs with many split points. Multiplying p-values by a constant is equivalent to subtracting a constant from logworth. The constant relates to the number of split points generated by the input. The adjustment allows a fairer comparison of inputs with many and few levels later in the split search algorithm. The adjustment also increases the chances of an input s logworth not exceeding the threshold. Tree Algorithm: Partition Missings Kass Adjusted Logworth 0.7 Training Data x 1 Missing in left branch Missing in right branch x 1 7 For inputs with missing values, two sets of adjusted logworths are actually generated. The two sets are calculated by including the missing values in the left branch and right branch, respectively. Tree Algorithm: Find Best Split for Input Best Split x Training Data x 1 Missing in left branch Missing in right branch x 1 8

187 3.1 Growing Trees 3-7 The best split for an input is the split that yields the highest logworth. Because the logworth calculations also account for missing input values, the tree algorithm optimally accounts for inputs with missing values. Tree Algorithm: Repeat for Other Inputs x Kass Adjusted Logworth Training Data Missing in left branch Missing in right branch x 2 9 The partitioning process is repeated for every input in the training data. Inputs whose adjusted logworth fails to exceed the threshold are excluded from consideration. Tree Algorithm: Compare Best Splits x 2 Best Split x1 0.7 Best Split x 2 Training Data x 1 Missing in left branch Missing in right branch 10 After determining the best split for every input, the tree algorithm compares each best split s corresponding logworth. The split with the highest adjusted logworth is deemed best.

188 3-8 Chapter 3 Predictive Algorithms Tree Algorithm: Partition with Best Split x 2 Best Split Training Data x 1 11 The training data is partitioned using the best split. The expected values of the target and profit are calculated within each leaf. This determines the optimal decision for the leaf. Tree Algorithm: Repeat within Partitions x 2 Training Data x 1 12

189 3.1 Growing Trees 3-9 Tree Algorithm: Calculate Logworths Kass Adjusted Logworth 0.7 Training Data x 1 Missing in left branch Missing in right branch x 1 13 The split search continues within each leaf. Logworths are calculated and adjusted as before. Tree Algorithm: Adjust for Split Depth 1.0 Depth Adjustment Kass Adjusted Logworth Training Data x 1 Missing in left branch Missing in right branch x 1 14 Because the significance of secondary and subsequent splits depend on the significance of the previous splits, the algorithm again faces a multiple comparison problem. To compensate for this, the algorithm increases the threshold by an amount related to the number of splits above the current split. For binary splits, the threshold is increased by log 10 (2) d 0.3 d, where d is the depth of the split on the decision tree. By increasing the threshold for each depth (or equivalently decreasing the logworths), the Tree algorithm makes it increasingly easy for an input s splits to be excluded from consideration.

190 3-10 Chapter 3 Predictive Algorithms Tree Algorithm: Find Best Split for Input 1.0 Best Split x 1 Training Data x 1 Missing in left branch Missing in right branch 1 15 Tree Algorithm: Repeat for Other Inputs x 2 Kass Adjusted Logworth 1.0 Training Data Missing in left branch Missing in right branch x 2 16 Tree Algorithm: Compare Best Splits x 2 Best Split x Best Split x 1 Training Data x 1 Missing in left branch Missing in right branch 17

191 3.1 Growing Trees 3-11 The best split using each input is identified, and the splits are compared as before. Tree Algorithm: Partition with Best Split x 2 Training Data x 1 18 The data is partitioned according to the best split. The process repeats in each leaf until there are no more allowed splits whose adjusted logworth exceeds the depthadjusted thresholds. This completes the split search portion of the tree algorithm. Tree Algorithm: Construct Maximal Tree x 2 Training Data x 1 19 The resulting partition of the input space is known as the maximal tree. Development of the maximal tree was based exclusively on statistical measures of split worth on the training data. It is likely that the maximal will fail to generalize well on an independent set of validation data. The second part of the Tree algorithm, called pruning, attempts to improve generalization by removing unnecessary or poorly performing splits. Pruning generates a sequence of trees starting with the maximal tree and decreasing to the root tree (a tree with one leaf). Each step of the pruning sequence eliminates one split from the maximal tree.

192 3-12 Chapter 3 Predictive Algorithms Tree Algorithm: Prune Maximal Tree x 2 Option 1: Profit =0 Training Data x 1 Splits to Remove: Tree Algorithm: Prune Maximal Tree x 2 Option 2: Profit =0 Training Data x 1 Splits to Remove: 1 21 Tree Algorithm: Prune Smallest Loss x 2 Option 1: Profit =0 Training Data x 1 Splits Removed: 1 22

193 3.1 Growing Trees 3-13 The first pruning step eliminates a single split from the maximal tree. The change in overall average profit caused by the removal of a given split is calculated. The split that least changes the overall average profit of the Tree model is removed. Tree Algorithm: Prune Maximal Tree x 2 Option 1: Profit =0 Training Data x 1 Splits to Remove: Tree Algorithm: Prune Maximal Tree x 2 Option 2: Profit =0 Training Data x 1 Splits to Remove: 2 24

194 3-14 Chapter 3 Predictive Algorithms Tree Algorithm: Prune Smallest Loss x 2 Option 1: Profit =0 Training Data x 1 Splits Removed: 2 25 The second pruning step eliminates two splits from the maximal tree. Because splits are removed from the maximal tree, it is possible that the tree obtained from the second pruning step will not be a subtree of the tree obtained in the first pruning step. Again, the splits removed are those that change the overall average profit of the Tree model by the smallest amount. Tree Algorithm: Prune Maximal Tree x 2 Option 1: Profit =0 Training Data x 1 Splits to Remove:

195 3.1 Growing Trees 3-15 Tree Algorithm: Prune Maximal Tree x 2 Option 2: Profit = Training Data x 1 Splits to Remove: 3 27 Tree Algorithm: Prune Smallest Loss x 2 Option 1: Profit =0 Training Data x 1 Splits Removed: 3 28 Tree Algorithm: Prune Smallest Loss x 2 Profit = Training Data x 1 Splits Removed:

196 3-16 Chapter 3 Predictive Algorithms Tree Algorithm: Prune Smallest Loss x 2 Profit = Training Data x 1 Splits Removed: The process continues until only the root of the tree remains. Because there is only one way to generate a two-leaf and one-leaf tree from the maximal tree, no comparisons are necessary. Tree Algorithm: Select Optimal Tree x 2 Profit Training Validation x Leaves 31 The smallest tree with the highest validation profit is chosen from the trees generated in the pruning process as the final tree model.

197 3.1 Growing Trees 3-17 Tree Variations: Multi-Way Splits Trades height for width Complicates split search Uses heuristic shortcuts 32 SAS Enterprise Miner allows for a multitude of variations on the default Tree algorithm. The first involves the use of multi-way splits instead of binary splits. This option is invoked by changing the Maximum number of branches from a node field in the Basic tab of the Tree window. Theoretically, there is no clear advantage in doing this. Any multi-way split can be obtained using a sequence of binary splits. The primary change is cosmetic. Trees with multi-way splits tend to be wider than trees with only binary splits. The inclusion of multi-way splits complicates the split search algorithm. A simple linear search becomes a search whose complexity increases geometrically in the number of splits allowed from a leaf. To combat this complexity explosion, the Tree tool in SAS Enterprise Miner resorts to heuristic search strategies. Tree Variations: Multi-Way Splits Exhaustive search size limit Maximum branches in split 33 Two fields in the Properties panel affect the number of splits in a tree. The Maximum Branches field sets an upper limit on the number of branches emanating from a node. When this number is greater than the default of two, the number of possible splits rapidly increases. To save computation time, a limit is set in the Exhaustive

198 3-18 Chapter 3 Predictive Algorithms field as to how many possible splits will be explicitly examined. When this number is exceeded, a heuristic algorithm is used in place of the exhaustive search described above. The heuristic algorithm alternately merges branches and reassigns consolidated groups of observations to different branches. The process stops when a binary split is reached. Among all candidate splits considered, the one with the best worth is chosen. The heuristic algorithm initially assigns each consolidated group of observations to a different branch, even if the number of such branches is more than the limit allowed in the final split. At each merge step, the two branches are merged that degrade the worth of the partition the least. After two branches are merged, the algorithm considers re-assigning consolidated groups of observations to different branches. Each consolidated group is considered in turn, and the process stops when no group is re-assigned. Tree Variations: Split Worth Criteria Yields similar splits Grows enormous trees Favors inputs with many levels 34 In addition to changing the number of splits, you can also change how the splits are evaluated in the split search phase of the Tree algorithm. For categorical targets, SAS Enterprise Miner offers three separate split worth criteria. Changing from the Chi-squared default criterion typically yields similar splits if the number of distinct levels in each input is similar. If not, the other split methods tend to favor inputs with more levels due to the multiple comparison problem discussed above. You can also cause the chi-square method to favor inputs with more levels by turning off the Bonferroni adjustments. Because Gini reduction and Entropy reduction criteria lack the significance threshold feature of the Chi-square criterion, they tend to grow enormous trees. Pruning and selecting a tree complexity based on validation profit limits this problem to some extent.

199 3.1 Growing Trees 3-19 Tree Variations: Split Worth Criteria Split Criterion: Chi-sq logworth Entropy Gini Variance ProbF logworth Logworth adjustments 35 There a total of five choices in SAS Enterprise Miner to evaluate split worth. Three (Chi-square logworth, entropy, and Gini) are used for categorical targets, and the remaining two (variance and ProbF logworth) are reserved for interval targets. Both Chi-square and ProbF logworths are adjusted (by default) for multiple comparisons (as described above). It is possible to deactivate this adjustment. The split worth for the entropy, Gini, and variance options is calculated as follows. Let a set of cases S be partitioned into p subsets S 1,, S p such that S = U p S i i=1 Let the number of cases in S equal N and the number of cases in each subset S i equal n i. Then the worth of a particular partition of S is given by worth = I p ( S) w I( ) i= 1 i S i where w i =n i /N (the proportion of cases in subset S i ) and for the specified split worth measure, I( ) has the following value: I 2 classes ( ) = p class log p class entropy 2 I( ) = 1 p class Gini classes ( Y case Y ) I( ) = Variance cases 2 Each worth statistic measures the change in I(S) from node to branches. In the Variance calculation, Y is the average of the target value in the node with case Y as a member.

200 3-20 Chapter 3 Predictive Algorithms Tree Variations: Stopping Rules Avoids orphan nodes Controls sensitivity Grows large trees 36 The family of adjustments you will modify most often when building trees are the rules that limit growth of the tree. Changing the minimum number of observations required for a split search and the minimum number of observations in a leaf prevents the creation of leaves with only one or a handful of cases. Changing the significance level and the maximum depth allows for larger trees that may be more sensitive to complex input and target associations. The growth of the tree is still limited by the depth adjustment made to the threshold Tree Variations: Stopping Rules Logworth threshold Minimum leaf size Maximum tree depth Threshold depth adjustment 37 Changing the logworth threshold changes the minimum logworth required for a split to be considered by the tree algorithm (when using the Chi-square or ProbF measures). Increasing the minimum leaf size will avoid orphan nodes. For large data sets, you may want to increase the maximum leaf setting to obtain additional modeling resolution. If you want really big trees and insist on using the chi-square split worth criterion, deactivate the Split Adjustment option.

201 3.1 Growing Trees 3-21 Tree Variations: Pruning and Missing Values Controls sensitivity? Helps input selection 38 The final set of Tree algorithm options pertain to the pruning missing value methods. By changing the Model assessment measure to Average Square Error you can construct what is known as a class probability tree. It can be shown that this minimizes the imprecision of the tree. Analysts sometimes use this model assessment measure to select inputs for a flexible predictive model such as neural networks. You can deactivate pruning entirely by changing the subtree to Largest. You also can specify the tree be built with a fixed number of leaves. The default method for missing values is to place them in the leaf maximizing the logworth of the selected split. Another way to handle missing values is the construction of surrogate splitting rules. At each split in the tree, inputs are evaluated on their ability to mimic the selected, or primary, split. The surrogate splits are rated on their agreement with the primary split. Agreement is the fraction of cases ending up in the same branch using the surrogate split as with the primary split. If a surrogate split has high agreement with the primary split, it can be used in place of the primary split when the input involved in the primary split is missing. You can specify how many surrogates you want the tree algorithm to keep. Surrogate rules can also aid in the task of input selection.

202 3-22 Chapter 3 Predictive Algorithms Tree Variations: Pruning Pruning options: Assessment Do not prune Prune to size Pruning metrics: Decision (profit) Avg. Square Error Misclassification Lift 39 The pruning options are controlled by two fields in the properties sheet: Subtree and Measure. Misclassification and lift are specialized pruning metrics and should be used sparingly. Tree Variations: Missing Values Surrogate rules 40 You can specify the number of surrogate rules kept by the tree algorithm as shown.

203 3.2 Constructing Trees Constructing Trees Thus far, using the decision trees has been a completely automated process. After defining model parameters, the tree models grew by themselves. SAS Enterprise Miner 5.1 also includes a powerful interactive tree-growing application to complement the automated approach. Interactive training enables you to manually specify the splits in a tree model. This enables you to customize the tree to explore effects of competing splits, incorporate business rules into your models, and prune away unwanted nodes.

204 3-24 Chapter 3 Predictive Algorithms Interactive Tree Construction The Interactive setting in the Tree node enables you to change the structure of a batch-grown tree or grow a tree completely from scratch. 1. Connect a Tree node to the Data Partition node as shown. The Two Input Network nodes have been removed for simplicity. 2. Select View Property Sheet Advanced. Examine the Decision Tree s Properties panel. The Node Sample property indicates that a random sample of 5,000 observations will be used in split search. The training data contains more than 9,000 observations. To avoid ignoring nearly half of your training data, you may want to increase this number. 3. Set the Node Sample size to 10,000. The default settings for interactive training are taken from the properties panel. A separate Properties panel is available in the SAS Enterprise Miner Tree Desktop Application to vary the settings during the interactive training session. 4. Right-click the Decision Tree node and select Update. The nodes metadata must be updated in preparation for interactive training.

205 3.2 Constructing Trees Select Interactive Training from the Tree node s property panel. The SAS Enterprise Miner Tree Desktop Application window opens. As expected, a tree starts as a single root node. The goal is to grow this root node into a full tree model. The class proportions reflect the sample and not the population. Because you have defined prior information, it is sensible to display the class proportions that reflect the population instead. 1. Select Edit Apply Prior Probabilities. The proportions in the root node are updated to reflect the population. 2. Right-click the single node representing the current state of the interactive decision tree and select Split Node.The Split Node 1 window opens.

206 3-26 Chapter 3 Predictive Algorithms The window shows the logworth of the top five split inputs. The best split appears to be FREQUENCY_STATUS_97NK. 3. Select Edit Rule to view the split details The cases are partitioned into two groups with cases whose FREQUENCY_STATUS_97NK is less than or equal to two branching left and cases whose FREQUENCY_STATUS_97NK is greater than two branching right. In addition, missing values have been placed in the left branch. You have the ability edit and add ranges to the suggested split. 1. Type 1.5 in the New split point field. 2. Select Add Branch. The FREQUENCY_STATUS_97NK Splitting Rule window is updated to reflect the additional split point. 3. Select Apply to add this split to the tree model.

207 3.2 Constructing Trees Select OK to close the FREQUENCY_STATUS_97NK Splitting Rule window. 5. Select OK to close the Split Node 1 window. The tree is updated to have three leaves off the root node. The leftmost leaf contains the majority of the observations, and has the lowest response rate. The middle leaf is seen to have about average response rate. The rightmost node has the highest response rate. There is no indication of how this split will generalize to the validation sample. 6. Right-click in the Tree window and select View Results. The tree is updated to show both training and validation data performance. While in View Result mode, you are unable to grow the tree. 7. To continue tree growth, right-click in the Tree window and select Train Interactively.

208 3-28 Chapter 3 Predictive Algorithms You grow the decision tree one node at a time, exploring alternative splits at each node along the way. You also can automatically grow the tree from any node. 1. To automatically grow the tree, right-click a node and select Train. For example, training from the lower-left node on the above tree produces the following lopsided result. To view more of the tree, right-click in the Browse Tree window and select Zoom Zoom Out. The tree is grown until all stopping rules are satisfied (see the earlier discussion). This can lead to a many-leafed specimen. 2. Balance the above tree by training from the remaining two uppermost leaves. The obtained tree is quite large indeed. Surely not all splits in this tree generalize to validation data. You have two options: eliminate undesired splits manually by hand pruning or automatically by validation profit-based pruning.

209 3.2 Constructing Trees 3-29 Automatic validation profit based pruning is aided by examining an Assessment plot. This plot shows the profit for the best tree of a given complexity (number of leaves) using the splits defined thus far. The best tree is determined as discussed above. 1. To view the Assessment plot, change the display mode to View Results and select View Assessment Plot. Apparently the validation performance plateaus at about ten leaves, although there is in fact little validation profit improvement after five leaves. 2. To confirm this, select View Assessment Table. The table shows the simplest tree with the highest validation profit has ten leaves. Suppose this ten-leaf tree is suited to your purposes. 3. Select row 9 (corresponding to the ten leaf tree) of the Assessment Table. The BROWSETREE window is updated to show the pruned tree corresponding to this row of the Assessment table.

210 3-30 Chapter 3 Predictive Algorithms As discussed in Section 3.1, this ten-leaf tree, pruned from the initial 29 leaf maximal tree, has the highest training profit of any ten-leaf subtree. The validation profit is higher than the tree automatically grown by SAS Enterprise Miner Decision Tree node. Suppose you would like to deploy this hand grown tree in the process flow diagram. 1. Close the SAS Enterprise Miner Tree Desktop Application window. You are prompted: 2. Select Yes. You are then warned not to run the diagram until after updating is complete. 3. Select OK. The interactively grown tree is now this node s model of choice. 1. Select the Result window for the Decision Tree node. The window opens displaying information on the model just constructed.

211 3.2 Constructing Trees Connect the Decision Tree node to the Model Comparison node. 3. Run the SAS Code node connected to the Model Comparison node and view the results.

212 3-32 Chapter 3 Predictive Algorithms Unfortunately, the tree model is still less competitive than the other models. This is a general property of trees, especially in the presence of interval inputs. Except in special circumstances (such as low noise problems or highly nonlinear input and target associations), trees will perform worse than even simple regression models when the majority of the inputs are interval measurement scale. However, there are other extremely important uses for tree in the modeling process. The next section discusses some of these applications.

213 3.3 Applying Decision Trees Applying Decision Trees Tree Applications Missing value imputation Categorical input consolidation x 1 x 2 x 3 x 4 Variable selection Surrogate models 44 Tree models play important supporting roles in predictive modeling. They provide an excellent way to impute the missing value of an input conditioned on nonmissing inputs. Tree models can be used to group the levels of a categorical input based on the value of the target select inputs for other flexible modeling methods such as neural networks explain the predictions of other modeling techniques.

214 3-34 Chapter 3 Predictive Algorithms Imputing Missing Values with Trees In the models built to this point, missing values have been replaced by the overall mean of the input. This approach fails to take advantage of correlations existing within the inputs themselves, which might provide a better estimate of a missing input s actual value. In this demonstration, you see how to use a tree model to predict the value of an input conditioned on other inputs in the data set. Why not use another modeling technique to do imputation? The answer lies in another What question: What do you do if, in order to predict the value of one input, you need the value of another input that also happens to be missing? For treebased imputation schemes, this is not an issue. The tree models simply rely on their built-in missing value methods and produce an estimate for the original missing value. If you have a data set with 50 inputs, the proposition of building 50 separate tree models (1 to predict the missing value of each input) seems like a daunting task. Fortunately, SAS Enterprise Miner makes this process easy.

215 3.3 Applying Decision Trees 3-35 Tree-based imputation is handled through the Impute node. 1. Attach another Impute node to the diagram and rename it Tree Impute. 2. Change the default impute method to Tree for both the Class and Interval inputs. 3. Scroll the Decision Tree Properties panel down to show the tree settings. If necessary, activate the Advanced Properties option. The default settings are quite similar to those of the Decision Tree node. Although not shown, adjusted Chi-square or ProbF logworth is used to compare splits. 4. For consistency, select Unique for Indicator Variable and Input for Indicator variable role.

216 3-36 Chapter 3 Predictive Algorithms When run, each input in the training data takes a turn as a target variable for a tree model. When an input has a missing value, SAS Enterprise Miner uses the corresponding tree to replace the unknown value of the input. 1. Run the Tree Impute node and view the results. 2. Select View Output Data Sets Train. An Explore window opens displaying the training data. 3. Scroll the Explore window to view the column labeled Imputed: DONOR_AGE. Scroll to down reveal a variety of values for missing DONOR_AGE, dependent on other inputs in the data set.

217 3.3 Applying Decision Trees 3-37 While it is not possible to view the decision rules used to impute the missing values for each input, you can inspect the scoring code that implements the decision rules. Select View SAS Code Flow Score Code. The recipe for tree imputation of each input is included as part of the scoring code for the entire model. Does improved prediction of missing values improve prediction of the target? 1. Close the Results window. 2. Connect a Regression node to the Tree Impute node. 3. Select Stepwise as the selection method. 4. Run the Regression node and view the results.

218 3-38 Chapter 3 Predictive Algorithms The regression model using tree-imputed data has virtually the same overall average profit for the validation data (although it has a smaller number of inputs). Odds Ratio Estimates Effect Point Estimate FREQUENCY_STATUS_97NK IMP_DONOR_AGE IMP_INCOME_GROUP MEDIAN_HOME_VALUE MONTHS_SINCE_LAST_GIFT PEP_STAR 0 vs In general, standard regression models are insensitive to changes in the imputation method. This might not be true for more flexible modeling methods.

219 3.3 Applying Decision Trees 3-39 Consolidating Categorical Inputs Categorical inputs pose a major problem for parametric predictive models such as regressions and neural networks. Because each categorical level must be coded by an indicator variable, a single input can account for more model parameters than all other inputs combined. Decision trees, on the other hand, thrive on categorical inputs. They can easily group the distinct levels of the categorical variable together and produce good predictions. This demonstration shows how to use a tree model to group categorical input levels and create useful inputs for regression and neural network models. 1. Connect a Decision Tree node to the Tree Impute node and rename the Decision Tree node Consolidation Tree. The Consolidation Tree node will be used to group the levels of CLUSTER_CODE together. Here you will use a tree model to group these levels based on their association with the TARGET_B. From this grouping, a new modeling input will be created. You can use this input in place of CLUSTER_CODE in a regression model or other model. In this way, the predictive prowess of CLUSTER_CODE will be incorporated into a model without the plethora of parameters needed to encode the original. The grouping can be done automatically by simply running the decision tree tool, or manually by using the tool s interactive training features. You use the automatic method here.

220 3-40 Chapter 3 Predictive Algorithms 2. Select Variables from the Consolidation Tree Properties panel. 3. Select CLUSTER_CODE Explore. The CLUSTER_CODE input is seen to have more than 50 levels. 4. Select Use No for all variables. 5. Select Use Yes for CLUSTER_CODE and TARGET_B. 6. Close the Variables window. SAS Enterprise Miner attempts to build a tree to predicted TARGET_B with just the CLUSTER_CODE input. The resulting leaves of the tree are groupings of the CLUSTER_CODE levels that have a similar donation rate. 1. Run the Consolidation Tree node and examine the results.

221 3.3 Applying Decision Trees 3-41 The constructed tree consists only of a root node. What went wrong? 2. Close the Results window. 3. Examine the P-Value Adjustment section of the Consolidation Tree s properties. The primary reason for the failure to find any splits is the Bonferonni adjustment to logworth discussed in the previous section. The adjustment penalizes the logworth of potential CLUSTER_CODE splits by an amount equal to the log of the number of partitions of CLUSTER_CODE levels into two groups, or log 10 (2 L-1 1). With 54 distinct levels, the penalty is quite large. It is also quite unnecessary. The penalty avoids favoring inputs with many possible splits. Here you are building a tree with only one input. It is impossible to favor this input over others because there are no other inputs. 4. Select No in the Bonferroni Adjustment field. 5. Re-run the Consolidation Tree node and view the results. With the Bonferroni adjustment disabled, a tree is created with a single split. The left branch groups levels of CLUSTER_CODE with a lower-than-

222 3-42 Chapter 3 Predictive Algorithms average donation rate, and the right branch groups levels of CLUSTER_CODE with a higher-than-average response rate. To view the actual levels in each branch or to refine the level groupings, use the interactive tree tool. The last task is to make this CLUSTER_CODE grouping available to subsequent models. To do this, you need to make two further Consolidation Tree property changes and re-run the node one last time. 1. Select Variable Selection No to avoid having the Consolidation Tree deactivate all modeling inputs for subsequent nodes. 2. Select Leaf Role Input to add a variable called _NODE_ to the training data. _NODE_ indicates which leaf of the tree each case is assigned. It is based on the value of CLUSTER_CODE and the rules established in the Consolidation Tree model. By default, the _NODE_ variable has a role of Segment. Changing the Leaf role to Input will allow models later in the flow to take advantage of the consolidated CLUSTER_CODE. 3. Rerun the Consolidation Tree node one last time. Now see whether the newly created input is useful enough to be selected in the regression model. 1. Connect a Regression node to Consolidation Tree node. Rename the node Consolidation Regression.

223 3.3 Applying Decision Trees Select Variables from the Consolidation Regression Properties panel. _NODE_ has been added to the variables list. 3. Close the Variables window. 4. Select Selection Method Stepwise from the Properties panel. 5. Run the Consolidation Regression node and view the results. The overall average profit on the validation data is higher than the other standard regression model.

224 3-44 Chapter 3 Predictive Algorithms 6. Select the Output window and scroll to the bottom of the report. Effect Point Estimate FREQUENCY_STATUS_97NK IMP_DONOR_AGE IMP_INCOME_GROUP MEDIAN_HOME_VALUE MONTHS_SINCE_FIRST_GIFT MONTHS_SINCE_LAST_GIFT PEP_STAR Odds Ratio 0 vs Estimates RECENT_AVG_GIFT_AMT RECENT_CARD_RESPONSE_COUNT _NODE_ 2 vs Not only is _NODE_ selected as an input, cases in the left branch of the Consolidation Tree (node 2) are 20% less likely to respond than cases in the right branch (node 3).

225 3.3 Applying Decision Trees 3-45 Selecting Inputs with Tree Models Trees can be used to select inputs for flexible predictive models. They have an advantage over using a standard regression model for the same task when the inputs relationship to the target is nonlinear or nonadditive. While this is probably not the case in this demonstration, the selected inputs still provide the support required to build a reasonably good neural network model. 1. Connect a Tree node to the Consolidation Tree node. Rename the node Selection Tree. You can use the Tree node with default settings to select inputs; however, this tends to select too few inputs for a subsequent model. Two changes to the Tree defaults result in more inputs being selected. Generally, when using trees to select inputs for neural network models, it is better to err on the side of too many inputs rather than too few. The changes to the defaults act independently. You can experiment to discover which method generalizes best with your data. 2. Type 1 in the Number of Surrogates field. This change allows inclusion of surrogate splits in the variable selection process. By definition, surrogate inputs are typically correlated with the selected split input. While it is usually bad practice to include redundant inputs in predictive models, neural networks can tolerate some degree of input redundancy. The

226 3-46 Chapter 3 Predictive Algorithms advantage of including surrogates in the variable selection is to allow inclusion of inputs that do not appear in the tree explicitly but are still important predictors of the target. 3. Select Subtree Largest. With this setting, the tree algorithm does not attempt to prune the tree. Like adding surrogate splits to the variable selection process, it tends to add (possibly irrelevant) inputs to the selection list. By limiting their flexibility (by stopped training, for example), neural networks can cope with some degree of irrelevancy in the input space. 4. Run the Selection Tree node and view the results. 5. Maximize the Output window and scroll to the variable importance section. Variable Importance Obs NAME NRULES NSURROGATES IMPORTANCE VIMPORTANCE RATIO 1 RECENT_RESPONSE_COUNT RECENT_RESPONSE_PROP RECENT_AVG_GIFT_AMT LAST_GIFT_AMT NUMBER_PROM_ LIFETIME_GIFT_AMOUNT LIFETIME_GIFT_COUNT MONTHS_SINCE_LAST_GIFT IMP_MONTHS_SINCE_LAST_PROM_RESP MOR_HIT_RATE _NODE_ IMP_INCOME_GROUP FILE_AVG_GIFT PEP_STAR PER_CAPTITA_INCOME PCT_VIETNAM_VETERANS MEDIAN_HOME_VALUE IN_HOUSE FREQUENCY_STATUS_97NK PUBLISHED_PHONE Roughly speaking, the IMPORTANCE column quantifies how much of the overall variability in the target each input explains. The values are normalized by amount of variability explained by the input with the highest importance. For example, the second most important input (RECENT_RESPONSE_PROP) accounts for about 97% as much variability as the most important input (RECENT_RESPONSE_COUNT). This makes sense because these inputs are highly correlated. The definition not only considers inputs selected as split variables, it allows accounts for surrogate inputs (if a positive number of surrogate rules is selected in the properties panel).

227 3.3 Applying Decision Trees 3-47 The list includes many more inputs than were originally selected by the Regression node. How well do these predictors work in a model? 1. Connect a Neural Network node to the Selection Tree node. Rename the node Tree Neural.

228 3-48 Chapter 3 Predictive Algorithms 2. Run the Neural Network node and view the results. The validation overall average profit is slightly lower than the standard regression with the original inputs, and much lower than the neural network model with inputs chosen using the regression node. If you were to use the Decision Tree node to select inputs without the two modifications provided above, the validation overall average profit would be much lower.

229 3.3 Applying Decision Trees 3-49 In general, when linear associations dominate, the regression method will tend to give better inputs than the decision tree method. 1. For comparison, connect another Neural Network node to the Consolidation Regression. Rename the new node Consolidation Network.

230 3-50 Chapter 3 Predictive Algorithms 2. Run the Consolidation Network node and view the results. In this case, the regression model picks the best variables. In general, both methods can be effective and both should be tried in any modeling scenario.

231 3.3 Applying Decision Trees 3-51 Decision Segment Definition The usual criticism of neural networks and similar flexible models is the difficulty in understanding the predictions. This criticism stems from the complex parameterizations found in the model. While it is true that little insight can be gained by analyzing the actual parameters of the model, much can be gained by analyzing the resulting predictions and decisions. In this demonstration, a decision tree is used to isolate cases with sufficiently high predicted probabilities (as calculated by a neural network model) to warrant solicitation. In other words, build a decision tree to describe cases for which the decision is to solicit. In this way, the characteristics of likely donors can be understood even if the model estimating the likelihood of donation is inscrutable. While not intended as a substitute for the original model, the description supplies insight into characteristics associated with likely responders. The first step is a technical one. The training and validation data sets were created by separate sampling on TARGET_B. In this analysis, TARGET_B is no longer the target variable, but the separate sampling remains an issue. To contend with this, the donors (with TARGET_B=1) need to be downweighted in the description tree, and the non-donors (with TARGET_B=0) need to be upweighted. The easiest way to created the weighting variable is to use the SAS Code node. This time, however, instead of creating code to perform some modeling-related task (such as assessment), you create code that is directly inserted into the scoring recipe used to create the model.

232 3-52 Chapter 3 Predictive Algorithms 1. Connect a SAS Code node to a Neural Network for example, the Consolidation Network fit at the close of the previous section. Rename this SAS Code node Add Weight Variable. 2. Select Score Code from the Add Weight Variable node s Properties panel (it is in the Scoring section). The Score Code window opens. 3. Open the program file Weight Explanation Tree Cases.sas, or simply type the following code into the display. This short program adds a variable called WEIGHT to all modeling data sets (train, validate, test, and score). if TARGET_B=1 then WEIGHT = 0.05/0.25; else WEIGHT = 0.95/0.75;

233 3.3 Applying Decision Trees 3-53 Cases with TARGET_B=1 will be weighted by the ratio 0.05/0.25 and cases with TARGET_B=0 will be weight by the ratio 0.25/0.75. By using this variable as a frequency variable in the Decision Tree node, statistics reported by the tree will reflect the population rather than the training and validation samples. 4. Select OK to close the Score Code window. To incorporate WEIGHT as well as to define D_TARGET_B as the target of interest, use the Metadata node. 1. Connect a Metadata node to the Add Weight Variable node. 2. Select Variables from the Metadata node properties panel. The Variables window opens. 3. Select New Role Rejected for TARGET_B. 4. Select New Role Target for D_TARGET_B.

234 3-54 Chapter 3 Predictive Algorithms 5. Select New Role Frequency for WEIGHT. 6. Select OK to close the Variables window. Now with the preliminary activities complete, on to building the description tree. 1. Connect a Decision Tree node to the Metadata node. Rename the Decision Tree node Description Tree. 2. Run the Description Tree node. 3. When the run is finished, select Interactive Training from the Properties panel. The goal here is to use the Interactive Training s superior model viewing capabilities rather than actually fitting a decision tree interactively. 4. Right-click and select View Results 5. Select View Assessment Table. The Assessment Table shows the tree to have 33 leaves with an overall misclassification rate of about 11%. Because the target variable s value is deterministic (it comes from the equation that defines the neural network), the misclassification rate on training and validation will be virtually identical. 6. Select row 5 of the Assessment table and examine the tree.

235 3.3 Applying Decision Trees 3-55 The validation column of the tree diagram has been suppressed by selecting Options Node Statistics and deselecting the Validation checkbox. The best 5-leaf tree gives a brief characterization of the Consolidation Neural Network with a misclassification rate of 20%. The color of the leaves represents the proportion of solicitation within the leaf: the lighter the node, the higher the solicitation rate. The three light purple segments (with solicitation rates ranging from 66% to 75%) receive solicitations. The other two segments do not. 7. Select row 10 of the Assessment table. The best 10-leaf tree gives a higherresolution characterization of the Consolidation Neural Network, with misclassification rate now less than 15%. There are now five segments that receive solicitations. The actual solicitation rates in these nodes range from 66% to 89%. From this tree, you can obtain a general feeling of who is being solicited and who is being ignored. The node counts give you an indication of the size of each segment (out of about 9600 cases).

236 3-56 Chapter 3 Predictive Algorithms The following change gives a better impression of the relative sizes of each segment. 8. Select View Tree Map. The Tree Map window opens. The length of each segment represents the size of each node in the tree. Segments at the bottom of the diagram are the leaves of the tree. The color of the segment gives the solicitation rate. Click on a segment to see where it occurs in the tree. 9. Select the large dark purple segment at the bottom of the Tree Map.

237 3.3 Applying Decision Trees 3-57 The corresponding node on the tree diagram is highlighted. 10. Select View Path Rule. A window opens describing the rules defining the selected segment.

238 3-58 Chapter 3 Predictive Algorithms 11. Select View Classification Matrix to obtain concrete numbers on the size of the solicitation segments. The rows represent the actual decisions (here made by the Neural Network model), whereas the columns represent the decisions made by the 15-leaf Description tree. Surprisingly, the percent of cases selected by the tree is the same as the percent of cases selected by the original neural network. You should experiment with accuracy versus tree size trade-offs (and other tree options) to achieve a description of the neural network model that is both understandable and accurate.