ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS



Similar documents
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

SAS Analyst for Windows Tutorial

Sample Table. Columns. Column 1 Column 2 Column 3 Row 1 Cell 1 Cell 2 Cell 3 Row 2 Cell 4 Cell 5 Cell 6 Row 3 Cell 7 Cell 8 Cell 9.

SPSS Explore procedure

Drawing a histogram using Excel

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Gestation Period as a function of Lifespan

Developing Credit Scorecards Using Credit Scoring for SAS Enterprise Miner TM 12.1

What Do You Think? for Instructors

Directions for Frequency Tables, Histograms, and Frequency Bar Charts

Statgraphics Getting started

Using Microsoft Excel to Plot and Analyze Kinetic Data

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection

Excel 2007 A Beginners Guide

Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL

Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices:

Spreadsheets and Laboratory Data Analysis: Excel 2003 Version (Excel 2007 is only slightly different)

How To Analyze Data In Excel 2003 With A Powerpoint 3.5

Client Marketing: Sets

IBM SPSS Direct Marketing 23

Formulas, Functions and Charts

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Microsoft Excel Tutorial

Preface of Excel Guide

USING EXCEL ON THE COMPUTER TO FIND THE MEAN AND STANDARD DEVIATION AND TO DO LINEAR REGRESSION ANALYSIS AND GRAPHING TABLE OF CONTENTS

Data Mining. SPSS Clementine Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

PowerWorld Simulator

Excel Guide for Finite Mathematics and Applied Calculus

Excel 2003 A Beginners Guide

STATGRAPHICS Online. Statistical Analysis and Data Visualization System. Revised 6/21/2012. Copyright 2012 by StatPoint Technologies, Inc.

SAS Add-In 2.1 for Microsoft Office: Getting Started with Data Analysis

IBM SPSS Direct Marketing 22

Using SPSS, Chapter 2: Descriptive Statistics

Scientific Graphing in Excel 2010

PURPOSE OF GRAPHS YOU ARE ABOUT TO BUILD. To explore for a relationship between the categories of two discrete variables

This activity will show you how to draw graphs of algebraic functions in Excel.

einstruction CPS (Clicker) Instructions

In This Issue: Excel Sorting with Text and Numbers

Netigate User Guide. Setup Introduction Questions Text box Text area Radio buttons Radio buttons Weighted...

Microsoft Excel Basics

APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING

Excel 2003 Tutorial I

Data Mining Using SAS Enterprise Miner : A Case Study Approach, Second Edition

Creating a Gradebook in Excel

Summary of important mathematical operations and formulas (from first tutorial):

To launch the Microsoft Excel program, locate the Microsoft Excel icon, and double click.

Task Force on Technology / EXCEL

Creating and Using Forms in SharePoint

How to make a line graph using Excel 2007

How To Use Spss

Increasing Productivity and Collaboration with Google Docs. Charina Ong Educational Technologist

The Dummy s Guide to Data Analysis Using SPSS

Using Excel for Business Analysis: A Guide to Financial Modelling Fundamentals

EXCEL PIVOT TABLE David Geffen School of Medicine, UCLA Dean s Office Oct 2002

Excel 2007 Basic knowledge

Creating a PowerPoint Poster using Windows

Produced by Flinders University Centre for Educational ICT. PivotTables Excel 2010

Using the SAS Enterprise Guide (Version 4.2)

Working with Tables: How to use tables in OpenOffice.org Writer

S P S S Statistical Package for the Social Sciences

Directions for using SPSS

City of De Pere. Halogen How To Guide

Excel Tutorial. Bio 150B Excel Tutorial 1

Copyright EPiServer AB

Intellect Platform - Tables and Templates Basic Document Management System - A101

Creating a Poster in PowerPoint A. Set Up Your Poster

ABSORBENCY OF PAPER TOWELS

Data Analysis Tools. Tools for Summarizing Data

EXCEL Tutorial: How to use EXCEL for Graphs and Calculations.

Business Intelligence. Tutorial for Rapid Miner (Advanced Decision Tree and CRISP-DM Model with an example of Market Segmentation*)

SPSS Workbook 1 Data Entry : Questionnaire Data

Using Word 2007 For Mail Merge

SPSS Manual for Introductory Applied Statistics: A Variable Approach

CALCULATIONS & STATISTICS

Introduction To Microsoft Office PowerPoint Bob Booth July 2008 AP-PPT5

Microsoft Excel 2010 Part 3: Advanced Excel

Manual. Sealer Monitor Software. Version

Migrating to Excel 2010 from Excel Excel - Microsoft Office 1 of 1

Excel Companion. (Profit Embedded PHD) User's Guide

Simple Predictive Analytics Curtis Seare

EXCEL FINANCIAL USES

Introduction to Exploratory Data Analysis

PowerPoint 2007 Basics Website:

ECDL. European Computer Driving Licence. Spreadsheet Software BCS ITQ Level 2. Syllabus Version 5.0

Introduction to SPSS 16.0

Using Pivot Tables in Microsoft Excel 2003

The following is an overview of lessons included in the tutorial.

Microsoft PowerPoint Tutorial

SAS Software to Fit the Generalized Linear Model

2: Entering Data. Open SPSS and follow along as your read this description.

Maximizing Microsoft Office Communicator

ReceivablesVision SM Getting Started Guide

How to Make the Most of Excel Spreadsheets

State of Illinois Web Content Management (WCM) Guide For SharePoint 2010 Content Editors. 11/6/2014 State of Illinois Bill Seagle

Q&As: Microsoft Excel 2013: Chapter 2

MyOra 3.0. User Guide. SQL Tool for Oracle. Jayam Systems, LLC

Microsoft Excel Tips & Tricks

Data Visualization. Prepared by Francisco Olivera, Ph.D., Srikanth Koka Department of Civil Engineering Texas A&M University February 2004

Linear Models in STATA and ANOVA

Transcription:

DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls. That file includes sorted respondent scores from the analysis file (30 customers) and the validation file (30 customers) of a modeling exercise, and it gives you the upper and lower bounds for the 10 % buckets that you should use in your gains chart. For both the analysis and validation file calculate the response rate in the 10 buckets and gain over total (in the same manner as you have in the slides). Remembering that this is only a small sample of 30+30, describe how the model is working. As part of your answer please also give a table similar to slide number 9 in slide set 7 (incremental gains charts). PART B Predicting and scoring using SAS EM 1 ASSIGNMENT OUTLINE AND DATA 2 QUESTIONS AND REPORT INSTRUCTIONS 3 COMPUTER INSTRUCTIONS 3.1PROJECT-LIBRARY-DATA SOURCE-DIAGRAM 3.2 VARIABLE DEFINITIONS-SAMPLES-TRANSFORMATIONS 3.3 MODELING 3.4 ASSESSING THE MODELS

2 1 ASSIGNMENT OUTLINE AND DATA The data used in this exercise is available in \\work\courses\e\27\e20100\cooking.sas Please use your network drive or usb stick for storage. This applies for both data- and project files. So first of all copy your data there. I am using all the time directory aaa on the Desktop for data files. Last year Books-By-Mail test promoted a new cook book Quick & Easy called to 9,592 names selected randomly from their primary book buyer segment. The response rate received was 3.79%. Names and all data were saved point-in-time of the promotion. In preparation for immediate roll-out, the product manager requests that you build response models to assist her in identifying those names in her primary book buyer segment most likely to order Quick & Easy. In predictions we use three different models: regression analysis, trees and neural networks. Identify the best model to use among them. If a customer orders the promoted cook book the profit margin is 16 euros before promotion costs. Promotion costs are 0.65 euros per promotion. The data file contains 5 predictor variables and an order indicator denoting who in the sample ordered the cook book (your dependent/target variable). Details of these variables can be found below.

3 VARIABLE DESCRIPTIONS OF DATAFILE cooking.sas Variable Name Num/Char Definition ORDER Numeric Indicates if customer ordered or not AGE50PL Numeric Indicates if customer is age 50+ based on purchased enhancement data. 1, if customer is 50 years of age or older HISTORY Numeric Response to previous promotions 0, if customer did not respond to one or more among the four last promotions 1, if customer responded to one or more among the four last promotions GENDER Numeric Indicates gender of customer 1, if no information is available 2, if male 3, if female TPAID Numeric Indicates the total number of all paid books 1 = 1 product paid 2 = 2 products paid 3 = 3 products paid 4 = 4 products paid 5 = 5 products paid 6 = 6 products paid 7 = 7 products paid 8 = 8 products paid 9 = 9 products paid 10 = 10 products paid 11 = 11 products paid 12 = 12 products paid 13 = 13 products paid 14 = 14 products paid 15 = 15 products paid 16 = 16+ products paid TSLBO 0 = no book orders placed 1 = 0-6 months ago 2 = 6-12 months ago 3 = 12-18 months ago 4 = 18-24 months ago 5 = 24-30 months ago 6 = 30-36 months ago 7 = 36-42 months ago 8 = 42-48 months ago 9 = 48-54 months ago 10 = 54-60 months ago 11 = 60-66 months ago 12 = 66+ months ago Numeric Indicates the customers elapsed time in months since their last book order across all genres Note, though TSLBO and TPAID have their last classes open (16+ and 66+) we use them as if they were not in order to deal with them as continuous variables.

4 2 QUESTIONS AND REPORT INSTRUCTIONS This assignment involves estimating several response models with SAS EM. The purpose of this assignment is not to make you understand every model in detail. Instead, the purpose is to briefly introduce a few models and show the basics of comparing different models. For this reason this assignment proceeds by following computer instructions in detail. Your task is to follow instructions, and answer questions about what you have done. Despite the mechanical nature of this assignment, it is important to think what is being done and why. Predictive response models play a big role in data driven marketing and this assignment gives you the opportunity to get acquainted with the relevant ingredients in relatively minor effort. NOTICE: When you are asked to estimate a model, please use 60 % of the data for analysis and 40 % for validation. QUESTION 1 (pen and paper, no need for SAS EM) Assume we decide not to send a promotion for anyone whose predicted expected profit is less than zero. For this reason we need to know, how likely ordering needs to be for a particular customer in order for customer to break even. a) Given the profit margin and cost information we have, calculate the break-even probability (i.e lowest probability for which expected profit is non-negative). b) Next assume that 5 per cent of the customers ordering the book do not pay it. In such a case the cost for the firm is 7 euros (in addition to the postal cost). What is the predicted minimum probability now for breaking even? QUESTION 2 First you need to estimate a logit regression model, and answer questions related to it. Your logit regression model is calculating to you coefficients for the explanatory variables (in slide set 6 slide 18). The coefficients a, b1, b2, you will thus get in the output of your regression model. Report which variables were statistically significant in the regression model. What are the zero and alternative hypotheses in the t-test that we carry out? What are the values of

5 the regression coefficients? Comment the signs. Are they as you would expect? Is there any indication of multi-collinearity? Next you use the regression model to do few sample predictions. What does this model predict to be the ordering probability for the example customers A and B? A B AGE50PL 1 0 HISTORY 0 1 GENDER 2 3 TPAID 2 1 TSLBO 4 2 If our criterion for promoting will be that the expected profit should be strictly positive, should A and B be promoted? Tips: It is handy to use excel to calculate the probabilities required. The exponential function is EXP in excel. When estimating the model in SAS EM, TLSBO and TPAID need to be transformed to log transformation to make them more like normal distributed. You must then use this transformation also when applying the observed TLSBO and TPAID in the excel formula. E.g if TLSBO=2 and your transformation is LOG(TLSBO), you apply LN(2) in excel. Note that in the SAS, LOG means the same thing as LN in excel (natural logarithm) Example: Assume that in your model you have two significant variables, transformed TLSBO, the transformation being log(tlsbo) and the intercept. Then you calculate: D =a + b1*log(tlsbo) and then apply exp(d)/(1+exp(d)) to get the probabilities. QUESTION 4 Next we estimate a neural network model (again for the same data). After how many iterations did the neural network stop training? Why did it stop training? Include the diagrams describing the training process for average error rate and misclassification rate.

6 QUESTION 3 Next we estimate a decision tree for the same data. Describe the decision tree splits (all the splits). What is the final definition of the subgroups (segments) of customers (so in which groups did the tree split the customers)? Describe the two best groups in terms of expected response rate that the tree found (validation data response rate is the criterion but report also analysis data response rate)? Why are there differences in the two data sets? Also report the number of observations (both in the analysis and in the validation sample) in the groups. If you wish to have a response rate of minimum 6 %, which segments do you promote? Take care to use all the possible information to exclude customers whose expected response rate is below 6 %. What is the percentage you expect to promote of the whole list of names if you use this rule of 6 %? Calculate also the expected response rate (e.g. using weighted average). Please write down clearly the calculations and not just the final figure. QUESTION 5 Next we compare the three models that we estimated. We use lift chart to do the comparison. Based on the lift chart, which of the three models turned out to be the best? Why do we use the validation data as criterion of model goodness? Print the graph in the report to complement your answer. Describe the gains in the response rate expected with the model use (in other words interpret the lift chart). QUESTION 6 Next look at the % Response chart. That is telling you the response rates (in the course slides we were looking at incremental gains charts that included the same kind of information). Comment the monotonicity of the chart. The monotonicity in the validation file, i.e. descending response rates indicates a good model. Now use chart to respond to question: using the break even probability you calculated in Question 1 a, check what percentage of the list of target customers (used in roll out) are you going to promote if you use the best model?

7 Look next at the Cumulative % Response chart: what is the response rate you are expecting to get if you define the percentage to be promoted as described above (all promoted need to have an expected profit that is nonnegative? QUESTION 7 Use SEMMA and list which steps in our analysis belong under the SEMMA steps. What is the big picture in this exercise? You were doing a lot of modeling. Why? What happens after the modeling? Where our assignment stops?

8 3 COMPUTER INSTRUCTIONS FOR PART B 3.1 PROJECT-LIBRARY-DATA SOURCE-DIAGRAM This first part includes the preliminary preparations for constructing the flow needed to build and assess the predictive models. Open Enterprise Miner. Programs SAS SAS Enterprise Miner 13.1. Define a new project for you. The project may include different data files as well as different diagrams. PROJECT File New Project Define your project path and name. DEFINE YOUR PROJECT ON THE DESKTOP. Otherwise you may not be able to access it later. LIBRARY EM wants to have all the data files you use in a library. It is only a directory the path to which you need to define.

9 You can define a new data source, project, diagram or library clicking the arrow on the right of the sun residing on the toolbar. I refer to that by sun-click. Define a new library here called garden on your desktop. Sun-click Library tick New Library -Next and define the path Select Next Finish. DATA SOURCE Define the data file we use. Sun-click Data Source- Next Browse Garden Cooking Ok Next Next Finish.

10. Note that this file is now visible on the upper left corner DIAGRAM Next we define the Diagram. A project can contain several diagrams. Sun-click Diagram. Given the name Ok.

11 When you click Diagrams under cooking you a new diagram opens. That is the space where you build the analysis flow.

12 3.2 VARIABLE DEFINITIONS-SAMPLES-TRANSFORMATIONS NOW WE START BUILDING THE DATAFLOW The flow means dragging different icons from the toolbar as building blocks of the process flow. We add nodes and connect them with arcs. Note that the icons to be used as nodes have been arranged in groups: Sample Explore Modify Model Assess etc Our data source is cooking.sas7bdat. First we drag this file to the diagram space. The first node thus is isis On the left in the grey area represents the menu for the node. The default settings can be seen. Click the Cooking icon and then on the left Variables and we see the variables that the data file includes. VARIABLE DEFINITIONS

13 The first column defines the role of the variable. We are predicting the order so the important Target variable for us is ORDER. Replace the role input that column with Target clicking the cell and selecting the Target under the appearing arrow. All the other variables have role Input. Next column Level refers to the measurement level. Variable Gender measurement level should be nominal (three different alternative values), Age50pl and History are binary and ORDER is binary as well. Do the change in the same way as when changing the Role.. TSLBO and TPAID are interval (meaning they are dealt with as continuous). The data manipulations always should start with getting acquainted with the data. As an example we view the distribution of TPAID (mark the row and click Explore). We see that it is far from being normal distributed, which is the desired feature in a regression model. To close click Ok.

14 Next embed the information about expected profit margins and costs. On the left click Decisions Click Build in the next window and Decision Weights in the next one. You see a matrix. Column with Decision 1 is to promote and Decision 2 is not to promote. The row variables correspond to the response of the respondent. If we send the promotion and it is successful, then our profit is 15.35 dollars. If we do not get response then we lose 0.65 dollars. If we do not promote it costs nothing. Note that we maximize.

15 Click Ok to exit the window. Next we wish to append new icons to the project flow. Drag the Data Partition node under Sample and connect it with Cooking just using your mouse. On the left under Data Set Allocations allocate for Analysis 60 % and Validation 40 %. At this stage normally we would deal with the missing values but to make the assignment simpler all missing values have been dealt with. Also we would deal with outliers and explore the distributions of the variables.

16 VARIABLE TRANSFORMATIONS For regression analysis the transformation of variables is needed. For regression the explanatory variables should be close to normal. Below you see distributions of our continuous variable. They are not close to normal. Drag the node Transform Variables under Modify and connect Data Partition with it. Click Transform Variables see the menu below. With interval inputs select log. This is a common transformation that is used to make the distribution of a variable more normal.

17 Note that there is also an option maximize normality. Do NOT use it in this case (will cause trouble with the interpretation). This is all we need here Run the flow right-clicking on Transform Variables and chooserun. You will see the transformations that were made. 3.3 MODELING REGRESSION ANALYSIS Choose under Model the Regression node and connect it with Transform Variables. You can define the status of the variables i.e. define the variables you use in your regression model (you will now use all the available ones) clicking in the menu below Variables.

18 You see EM suggests a model with main effects, that is all we are going to use. The regression type suggested is Logistic Regression which is the special regression analysis with 0/1 dependent variable. Our choice! When we roll down this menu you see the following.

19 Here we are especially interested in specifying the Model Selection type in regression. You may choose Backward/Forward/Stepwise where we ask you to use Stepwise. When you run the flow (Right-click the Regression node and Run ) and choose Results in the pop-up window, the results window contains two parts of most interest for you. The next set of independent/explanatory (input) variables that I will present here is only a subset of the variables that you use. The results are presented as an example to highlight the interface and interpretation of the figures and graphs. The Effects plot is providing for you the information of the final regression model as a histogram. Only those variables that are significant explanatory variables (whose coefficient is different from zero after statistical testing) are present in the histogram. You see their names and coefficient values estimated when you move the cursor on the bars. Below there are are two significant explanatory variable in the model and moreover the intercept is significant. Blue color means that the corresponding coefficients are negative. It is nice to see the coefficient values in the graph, which you will see by selecting rightclick Graph Properties and tick Show Labels. Still you need to keep track of the variable that has the coefficient.

20 Those that are interested in the t measures (statistical measures) you may look at them from View-Model-Estimate Selection Plot. In the Example below you see that there have been three steps in the selection process and finally in the third step there are three variables in the model (and the intercept). We know now that all the variable coefficients in the logit model are significantly different from zero.

21 FOR THOSE THAT HAVE MORE BACKGROUND IN STATISTICS: You may select t value in the drop-down menu. A t-value aways corresponds to a p-value which is more familiar to us. COEFFICIENT INTERPRETATION FOR CLASS VARIABLES Regression analysis deals with class variables in a way that it produces a constant for all the possible values except one that is chosen to have the value zero. In your data if HISTORY0 is included in the set of significant variables with coefficient -0.33 it means that if your HISTORY value is 1, then that has a 0 effect on your score but if it is 0 then the coefficient -0.33 tells the effect. Moreover if GENDER1 and GENDER2 are appearing among your significant variables it tells you that GENDER being 3 (meaning female) has a zero effect on the score. Assume that for GENDER2 (male) the parameter estimate is 0.02. Then it means that compared with the case of a female customer the probability to order is bigger for a male. Take an example. We have four significant variables: intercept (-2.8), GENDER (with coefficient 0.02 when gender is 2) HISTORY (coefficient -0.5 when history is 0) and TSLBO (-0.2). Then according to the logit regression formula (check your slides) a customer that is male with no previous resposes to the promotions and for whom the TSLBO is 4 (and remember the transformation we made, the log, log(4)=0.6) has the score (probability to order)

22-2.8+ 0.02-0.5-0.2*0.6-2.8+ 0.02-0.5-0.2*0.6 e /(1+ e ) = 0.032 Now we could check how good the model is in predicting, how good the fit is. There are some measures displayed for that as well. This time the normal measure of fit is NOT going to be used. However, we will later assess the goodness of the model when we have got all the three models fitted. We note that window Score Rankings overlay with option Lift is giving us information how many times we can multiply the expected response rate if we use the model for top 10 per cent best customers. This time we may multiply the response rate by 3. Thus in case (assumption) the baseline response rate without model for the respondents was 0.037 we may expect the response rate of 0.11 among the best 10 % or the respondents. NOTE : it may seem odd that the intercept has been negative in our examples. However, if all the other regression coefficients are 0 and only the intercept is used then the predicted ORDER value is e -2.52 /(1+ e -2.52 ) =0.074 > 0

23 NEURAL NETWORK Select AutoNeural node under Model and connect Data Partition with it. Thus we are using the nontransformed variable in the neural network this time. We need not adjust anything in the AutoNeural node because its purpose is to choose good settings except that increase the number of Maximum iterations to 15. Note that the Termination criterion defined overfitting refers to the minimization of the Average Error or misclassification rate (I assume that the misclassification rate is the criterion that is used though it is nowhere reported). Right-click on the node and select. Run. When the program stops it asks you if you wish to view the Result in a pop-up. Click Results.

24 The average error does not decrease in the validation sample after 5 iterations and this is true also as for the misclassification rate. Thus the model reached in 5 iterations is selected as the model to be used. Remember that the neural network is a black box, it does not produce as easily interpretable information on how it got the results, what were the essential variables etc.

25 TREE Add a decision tree node into your chart under Model and link it with Data Partition. Note that we will NOT use transformed variables here to make it easier for you to interpret the tree output. Looking at the options on the left we will use the default values but will change the significance level to 0.3 (to have a bigger tree for you to interpret). After that only right-click on the node and Run. A tree will find for you with sequential splits of the data groups that differ most in their response behavior. We will view below a tree not from the same data as ours. I is in window Tree Map. Below you see a tree where the splits are always into two. We see that if the number of paid products is exceeding 7 then this group will respond to promontion with probability 49.4 % (note you need to look at the validation file). However only very very few customers have more than 7 orders. CKBK082 indicates if the respondent responded to another recent promotion. We see that if he/she did and even if the total number of paid orders is 7 or smaller the probability of responding to this offer in this group is 7 % (note, again you need to look at the validation column). Furthermore we see that if the number of paid orders is exceeding 3 but belo 8 and moreover CKBK082 offer was taken favorably then the predicted response rate is as high as 15.7 %.

26 If we consider the tree above we see that my file I had 3355 respondents in the analysis file and 1441 respondents in the validation file. In the rows 0 refers to non-respondents and 1 to respondents. We see that the response rate is 5.0 % in both of the files. The splits are made to distinguish responders from non-responders. CHECK ALWAYS ALSO THE NUMBER OF OBSERVATIONS IN EACH OF THE BOXES PRODUCED. WE MAY IDENTIFY A GROUP WITH A HIGH RESPONSE RATE BUT IF ONLY VERY FEW OBSERVATIONS BELONG TO THAT GROUP IT IS OF NO USE. NOTE In case no tree will appear for you, first check that you have provided the decision weights earlier. If that does not help, pls send me an e-mail and you will receive

27 a tree, you analyse it and in the assessment node you assess only regression and neural networks. 3.4 ASSESSING THE MODELS This time we unfortunately will pass the tree without any further considerations and will next assess the three models. Drag the Model Comparison node under Assess and connect the three analysis nodes to it. Right-click Model Comparison on and Run. You will be displayed the following types of windows. The Score Ranking Overlay window below gives the cumulative lift of the models. You see the validation file if you click on the lower border of the window and drag down. In a lift chart (gains chart) the customers are ranked from best to worst based on probability of responding. Idea in lift chart is to compare response models to a no model scenario. For example if response rate in a dataset is 5%, then taking a random sample, whose size is10% from this dataset, would have a 5% response rate as well. Now let s assume that a response model has a lift of, for example, 3.5. This means that with the model it is possible to handpick the best 10% and within this sample achieve a response rate that is 3.5 times larger than response rate of no model random sample. This also means that lift of nomodel scenario equals 1. Lift can also be expressed as gains. Gain is the percentage increase in response rate model can bring to a sample. Therefore our example lift of 3.5 would be the same thing as a gain of 250 %. When looking at the lifts charts and other charts below, view the validation file!

28 Using the drop-down menu you may choose alternative charts. Next we look at the % Response chart. That includes much the same information as the incremental gains charts in the slides (except the gains column), specifically the response % for the validation sample ranked top down. Now we only have much smaller buckets. Notice that the response rate is monotonously decreasing in the validation file in the graph below except at the end (not so serious). Specifically we may see below e.g. (always read the validation file) that if we promote everyone exceeding a break even point, say, 5 % we promote only 16 per cent using the neural model. However, we may also use the regression or tree model in which case we may promote 30 per cent of the target population..

29 16 % 30 % Now, which is better. In order to answer, we must look at the cumulative % Response chart. Below you see the cumulative % Response chart. If we promote 16 % of the target segment employing the autoneural model we will expect a respose rate of 9.5 %. Instead if we use the regression or tree model we get the response rate of 6.5 % (reading the result in the graph). Now which model should we use?

30 9.5 % 6.5 % 16 % 30 % If we use the neural network model and the size of the target population is X (and not take into account that part of the responders do not pay) we get the profit (16-0.65)*0.095*0.16*X (0.65)*0.905*0.16*X = 0,117*X If we use the regression or the tree model the expected profit is (16-0.65)*0.065*0.3*X (0.65)*0.935*0.3*X = 0,1392*X Thus we use the regression model where the profit is greater. Opening the saved project Open EM 12.3. Start Enterprise Miner. File Open Project your project on Desktop or your usb. It normally also offers to you the most recent project. Open your diagram.

31 WHAT HAPPENS NEXT? Our assignment ends at assessing the three models that we used to predict the ordering of a customer. In the assessment node we could see how useful the models were in finding the customers that had the highest likelihoods to respond to our promotion. In real life the results of the sample are applied into a segment in the database to score it and identify those to promote and those who are not going to be promoted.