Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets



Similar documents
Tree Ensembles: The Power of Post- Processing. December 2012 Dan Steinberg Mikhail Golovnya Salford Systems

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

Classification of Bad Accounts in Credit Card Industry

CART 6.0 Feature Matrix

Gerry Hobbs, Department of Statistics, West Virginia University

Data Mining Approaches to Modeling Insurance Risk. Dan Steinberg, Mikhail Golovnya, Scott Cardell. Salford Systems 2009

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign

Benchmarking of different classes of models used for credit scoring

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Better credit models benefit us all

Package acrm. R topics documented: February 19, 2015

Risk pricing for Australian Motor Insurance

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

A Property & Casualty Insurance Predictive Modeling Process in SAS

Social Media Mining. Data Mining Essentials

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

Using multiple models: Bagging, Boosting, Ensembles, Forests

The Predictive Data Mining Revolution in Scorecards:

Addressing Analytics Challenges in the Insurance Industry. Noe Tuason California State Automobile Association

Leveraging Ensemble Models in SAS Enterprise Miner

Churn Modeling for Mobile Telecommunications:

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Data Mining Methods: Applications for Institutional Research

Data Mining: Overview. What is Data Mining?

Identifying SPAM with Predictive Models

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Paper AA Get the highest bangs for your marketing bucks using Incremental Response Models in SAS Enterprise Miner TM

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Comparison of Data Mining Techniques used for Financial Data Analysis

Data Mining. Nonlinear Classification

Simple Predictive Analytics Curtis Seare

Why Ensembles Win Data Mining Competitions

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Data Mining Techniques

Data Mining Part 5. Prediction

Data Mining Classification: Decision Trees

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

How To Make A Credit Risk Model For A Bank Account

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Predicting borrowers chance of defaulting on credit loans

STATISTICA Formula Guide: Logistic Regression. Table of Contents

A Hybrid Modeling Platform to meet Basel II Requirements in Banking Jeffery Morrision, SunTrust Bank, Inc.

Evaluation and Comparison of Data Mining Techniques Over Bank Direct Marketing

Didacticiel Études de cas

CART: Classification and Regression Trees

Easily Identify Your Best Customers

Customer Life Time Value

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Binary Logistic Regression

What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling

Modeling Lifetime Value in the Insurance Industry

Chapter 12 Bagging and Random Forests

Understanding Characteristics of Caravan Insurance Policy Buyer

Data Mining for Fun and Profit

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Churn Prediction. Vladislav Lazarov. Marius Capota.

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Data Mining Algorithms Part 1. Dejan Sarka

THE RISE OF THE BIG DATA: WHY SHOULD STATISTICIANS EMBRACE COLLABORATIONS WITH COMPUTER SCIENTISTS XIAO CHENG. (Under the Direction of Jeongyoun Ahn)

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

How to Get More Value from Your Survey Data

Easily Identify the Right Customers

The Data Mining Process

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Lecture 10: Regression Trees

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering

IBM SPSS Direct Marketing 22

Predictive Modeling and Big Data

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

Data Mining with SAS. Mathias Lanner Copyright 2010 SAS Institute Inc. All rights reserved.

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Data Mining from A to Z: Better Insights, New Opportunities WHITE PAPER

An Overview and Evaluation of Decision Tree Methodology

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

II. Methods X (i.e. if the system is convective or not). Y = 1 X ). Usually, given these estimates, an

Business Analytics and Credit Scoring

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection

IBM SPSS Direct Marketing 23

IBM SPSS Data Preparation 22

Chapter 12 Discovering New Knowledge Data Mining

Getting Even More Out of Ensemble Selection

Why do statisticians "hate" us?

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

Get to Know the IBM SPSS Product Portfolio

Beating the MLB Moneyline

Increasing Marketing ROI with Optimized Prediction

A Property and Casualty Insurance Predictive Modeling Process in SAS

Advanced Ensemble Strategies for Polynomial Models

Model Validation Techniques

Simple Linear Regression

Numerical Algorithms Group

A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND

Data Mining Applications in Higher Education

Working with telecommunications

Data Mining III: Numeric Estimation

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Transcription:

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems

Course Outline Demonstration of two classification examples in SPM o Bank Marketing o KDD cup 2009 Predictive Modeling package used for the examples o o o o o o o Core Statistics Logistic Regression CART Decision Tree (original, by Jerome Friedman) MARS Spline Regression (original, by Jerome Friedman) TreeNet gradient boosting machine ((original, by Jerome Friedman) RandomForests (original, Breiman and Cutler) Automation and model acceleration Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 2

Bank Marketing Data Portuguese bank marketing data o o o o 41,188 records 20 attributes, such as age, job, education, housing status The goal is to predict whether the client will subscribe a term deposit Output variable (desired target): has the client subscribed a term deposit? (binary: 'yes','no') Dataset is publicly available at UCI machine learning repository o http://mlr.cs.umass.edu/ml/datasets/bank+marketing Challenges o o o Missing Value Mixed categorical and numerical variables Variable selection Copyright Salford Systems 2013

Sample Data AGE JOB MARITAL DEF HOUSING LOAN CONTACT EMP_VAR_RAT E CPI CCI EURIBOR NUM_EMP Y housemai 56 d married no no no telephone 1.1 93.994-36.4 4.857 5191 no 57 services married no no telephone 1.1 93.994-36.4 4.857 5191 no 37 services married no yes no telephone 1.1 93.994-36.4 4.857 5191 no 40 admin. married no no no telephone 1.1 93.994-36.4 4.857 5191 no 56 services married no no yes telephone 1.1 93.994-36.4 4.857 5191 no 45 services married no no telephone 1.1 93.994-36.4 4.857 5191 no 59 admin. married no no no telephone 1.1 93.994-36.4 4.857 5191 no bluecollar 41 married no no telephone 1.1 93.994-36.4 4.857 5191 no 24 technician single no yes no telephone 1.1 93.994-36.4 4.857 5191 no 25 services single no yes no telephone 1.1 93.994-36.4 4.857 5191 no Other variables include: level of education, date of last contact, outcome of last campaign, days since last contact, etc. Note: missing values, categorical and numeric variables

Copyright Salford Systems 2013 Open Raw Data: bank.csv

Character Variables and Missing Values Copyright Salford Systems 2013

Request Descriptive Statistics All variables are included in default Copyright Salford Systems 2013

Brief Descriptive Stats We always check for prevalence of missing data Always review number of distinct values (too few?, too many?) Anything looks wrong in the dataset Copyright Salford Systems 2013

Full Descriptive Stats Output contains detailed descriptive statistics for every variable Copyright Salford Systems 2013

Frequency of Target variable Target Variable 0 means non subscriber 1 means subscriber It s not surprised that there are only a small percentage of people subscribed term deposit Copyright Salford Systems 2013

Data Preparation The records in this dataset are ordered by date (from May 2008 to November 2010) Note that 2008 economy crisis made this dataset complicated because time has to be considered as a factor in the analysis. We partitioned 80% as learning data and remaining 20% as testing data in time order. Note: pdays 999 means the clients have never been contacted before this phone call. Copyright Salford Systems 2013

Build LOGIT Model Copyright Salford Systems 2013

LOGIT Model Summary ROC learn value is 0.94 which should get your attention to exam if it is too good to be true ROC learning and test difference tells us that time does have an impact Copyright Salford Systems 2013

LOGIT Model Coefficients Partial coefficients are shown in the table above Copyright Salford Systems 2013

CART Classification and Regression Trees o o o o Separates relevant from irrelevant predictors Yields simply, easy to understand results Doesn t require variable transformations Impervious to outliers and missing values Fastest, most versatile predictive modeling algorithm available to analysts Provides the foundation to modern data mining techniques such as bagging and boosting

Build CART Model Copyright Salford Systems 2013

Copyright Salford Systems 2013 Testing Method

CART Model Learn and Test sample perform quite different with this model which means time does contribute as a factor to influence the outcome Also learning sample performance looks too good to be true Copyright Salford Systems 2013

Variable Importance Duration: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Copyright Salford Systems 2013

Rerun CART model excluding Duration Copyright Salford Systems 2013

Variable Importance Ranking CART gives an initial look of what variable are important, it is useful when there are quite a few predictors in your dataset. Copyright Salford Systems 2013

Root Node Split Very Effective We can view nodes detail by clicking Tree Details in CART output window The first splitter is month which is also shown in variable importance ranking table as the most influential predictor The whole tree with details can be viewed as well Copyright Salford Systems 2013

MARS Multivariate Adaptive Regression Splines Uses knots to impose local linearities These knots create basis functions to decompose the information in each variable individually MV 60 50 40 30 20 10 0-10 0 10 20 30 40 LSTAT MV 60 50 40 30 20 10 0 0 10 20 30 40 LSTAT

Build MARS Model Copyright Salford Systems 2013

MARs Model Setup Max basis Function default setting is 15 where often time model hits this limit and stop before reaching the optimal model So we set it as 60 after a couple of runs Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 25

MARS Output Window This output window shows you the number of basis functions in the model against the performance of the model. Because MARS is a regression engine, the MSE and R-squared values will still be reported, but can be ignored here. Copyright Salford Systems 2013

Summary This model improved in targeting customers, with an ROC of 0.72. Copyright Salford Systems 2013

MARS Basis function Here is where the logistic regression equation is laid out in terms of the basis functions (transformations of the predictors). Each basis function is described and the final model is listed at the bottom. This form of output is especially desired by those who are comfortable with standard regression. Copyright Salford Systems 2013

MARs Plots Note: The presence of nonlinearity in this dataset Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 29

TreeNet Stochastic Gradient Boosting Small decision trees built in an errorcorrecting sequence 1. Begin with small tree as initial model 2. Compute residuals from this model for all records 3. Grow a second small tree to predict these residuals 4. And so on

Build TreeNet Model Copyright Salford Systems 2013

TreeNet Output Window The Output window shows a graph of the number of trees in the ensemble with its corresponding ROC value. The vertical green bar denotes the model with the optimal ROC: 9 trees at 0.69. Copyright Salford Systems 2013

Partial Dependency Plots Using TreeNet for targeted marketing has improved random calling and given you an idea of how the predictors affect subscription Copyright Salford Systems 2013

Random Forests Ensemble of trees built on bootstrap samples Algorithm: o o o Each tree is grown on a bootstrap sample from the learning data During tree growing, only P predictors are selected and tried at each node By default, P is the square root of total predictors The overall prediction is determined by averaging Law of Large Numbers ensures convergence The key to accuracy is low correlation and bias To keep bias low, trees are grown to maximum depth

Build RandomForests Model Copyright Salford Systems 2013

RandomForests Output1 RandomForests optimal model is always the one with most trees, Copyright Salford Systems 2013

RandomForests Summary Copyright Salford Systems 2013

Prediction Success Table1 We want to minimize the false non-subscribers rate to spend least effort to reach most subscribers Copyright Salford Systems 2013

Adjust Class Weights Class Weights default is BALANCED which means Upweight small classes to equal size of largest target class. Now we manually upweight class 1 which is the small class even more than Balanced setting Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 39

Prediction Success Table2 Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 40

Conclusion CART, MARs, TreeNet and RandomForests o o o o o handles missing value automatically Detect interaction and nonlinearity automatically Model can be translate into other programing languages Model performance usually exceeds traditional classification algorithms Advanced setting boosts model performance CART provides initial insights of the dataset MARs gives equations in a linear regression format with transformation of original predictors TreeNet generates more accurate models RandomForests outperforms with wide datasets Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 41

KDD Cup 2009 Knowledge Discovery and Data mining competition held once a year to challenge modelers to a task o http://www.kdd.org/kddcup/index.php - competitions from 1997-2010 o Includes tasks, data, rules, results, and FAQs KDD Cup 2009 was about customer relationship prediction French telecom company Orange provided large marketing databases Overall goal was to beat the in-house system implemented by Orange Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 42

50,000 customers 15,000 predictors Datasets o ex) demographic, geographic, behavioral Three binary classification tasks: o Appetency: customer buys new product or service o Churn: customer switches providers o Upselling: customer buys upgrade offered to them Training and testing dataset Smaller subsets of data available for practice Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 43

Challenges Large database o 50,000 x 15,000 Numerical and categorical variables Missing data Unbalanced class distributions o Many more customers NOT doing these things Sanitized data - no intuition Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 44

Data Preparation Combine multiple datasets o Large dataset broken into 5 chunks, 53 MB each o True target values needed to be appended Delete or impute missing values o Not necessary in SPM Handle categorical variables o Create dummy indicators o Combine levels in variables with many o Again, not necessary in SPM Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 45

Open Prepared Data Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 46

View Data Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 47

Run Descriptive Statistics Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 48

Target Frequencies Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 49

Appetency In this context, appetency is the propensity of the customer to buy a new product or service Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 50

CART Model Setup Choose CART as the Analysis Engine Our Target is coded -1/1, so we will choose Classification/Logistic Binary as the Target Type Appetency is our response variable and VAR1-VAR15000 are our predictors Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 51

Setting a Testing Method A separate test dataset is provided in the competition, but true target values were not included For model-building, we will use a 20% random partition of the training dataset to monitor performance Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 52

Restricting Tree Size We are interested in looking at CART ranking of important predictors By forcing the tree to only one split, we can quickly create a tree to access this information Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 53

Penalties We are aware there are variables with many missing values and variables with a high number of categorical levels Setting penalties on these cases makes it harder to include these in the model Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 54

Results - Single Split CART Tree Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 55

Variable Improvement Measures Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 56

TreeNet Model Setup Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 57

Results - TreeNet Ensemble Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 58

Variable Selection Improvement measures are averaged across all trees in the ensemble Only 185 of the original 15,000 predictors are flagged as important Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 59

Recursive Feature Elimination (RFE) Remove one variable at a time from the TOP of the variable importance list to eliminate too good predictors Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 60

RFE, Step 2 Remove one variable at a time from the BOTTOM of the variable importance list to eliminate weak predictors Final ROC: 0.9048 Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 61

Parameter Variation - Automates Each TreeNet control parameter can be automatically varied over its values A model is built at each step and summarized Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 62

Stability of the Model Automate PARTITION varies the learn/test partition so the user can observe the stability of model performance Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 63

Repeat on Churn Churn is the propensity of the customer to switch providers We repeat the same steps of model-building to achieve a final model Final ROC: 0.7320 Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 64

Repeat on Upsell Upsell is the propensity of the customer to buy an upgrade offered to them We repeat the same steps of model-building to achieve a final model Final ROC: 0.9059 Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 65

Summary of Results Rank Team Appetency Churn Upselling Score 1 IBM Research 0.8830 0.7611 0.9038 0.8493 You! 0.9048 0.7320 0.9059 0.8476 2 ID Analytics, Inc. 0.8724 0.7565 0.9056 0.8448 3 Old dogs with new tricks 0.8740 0.7541 0.9050 0.8443 4 Crusaders 0.8688 0.7569 0.9034 0.8430 5 Financial Engineering Group, Inc. Japan 0.8732 0.7498 0.9057 0.8429 Unable to compare to true target values because these were only seen by competition judges However, we are confident in our results (2 of the above groups used SPM) Results can vary based on optimal selection criterion, random number seed, etc. Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 66

Overall Conclusions We were able to narrow down the predictor list significantly using TreeNet and Automate SHAVING o Of the original 15,000 predictors: Appetency: 167 Churn: 249 Upselling: 165 Handling of categorical variables and missing values was automatic and didn t cause any issues Small rates in the class of interest didn t pose a problem o Priors/Costs and Class Weights can control for this in CART and TreeNet Couldn t draw any insight as to the variables affecting appetency, churn, and upsell Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 67