Tree Ensembles: The Power of Post- Processing. December 2012 Dan Steinberg Mikhail Golovnya Salford Systems

Similar documents
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Identifying SPAM with Predictive Models

Churn Modeling for Mobile Telecommunications:

Data Mining Opportunities in Health Insurance

Data Mining Approaches to Modeling Insurance Risk. Dan Steinberg, Mikhail Golovnya, Scott Cardell. Salford Systems 2009

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

The Predictive Data Mining Revolution in Scorecards:

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Benchmarking of different classes of models used for credit scoring

Classification of Bad Accounts in Credit Card Industry

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

How To Make A Credit Risk Model For A Bank Account

Data Analysis of Trends in iphone 5 Sales on ebay

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Why Ensembles Win Data Mining Competitions

Data Mining - Evaluation of Classifiers

Heritage Provider Network Health Prize Round 3 Milestone: Team crescendo s Solution

Data Mining: Overview. What is Data Mining?

Data Mining Practical Machine Learning Tools and Techniques

Data Mining for Fun and Profit

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

Data Mining in Sports Analytics. Salford Systems Dan Steinberg Mikhail Golovnya

A Property & Casualty Insurance Predictive Modeling Process in SAS

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Better credit models benefit us all

Getting Even More Out of Ensemble Selection

Machine Learning Big Data using Map Reduce

Package acrm. R topics documented: February 19, 2015

Gerry Hobbs, Department of Statistics, West Virginia University

Data Mining Techniques Chapter 6: Decision Trees

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

The Operational Value of Social Media Information. Social Media and Customer Interaction

Predicting borrowers chance of defaulting on credit loans

Comparison of Data Mining Techniques used for Financial Data Analysis

Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/

Addressing Analytics Challenges in the Insurance Industry. Noe Tuason California State Automobile Association

Data Mining Methods: Applications for Institutional Research

Regularized Logistic Regression for Mind Reading with Parallel Validation

Data Mining. Nonlinear Classification

How To Use Data Mining For Loyalty Based Management

Beating the NCAA Football Point Spread

Risk pricing for Australian Motor Insurance

Journée Thématique Big Data 13/03/2015

CART 6.0 Feature Matrix

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

JetBlue Airways Stock Price Analysis and Prediction

Leveraging Ensemble Models in SAS Enterprise Miner

Data Mining Algorithms Part 1. Dejan Sarka

Distributed forests for MapReduce-based machine learning

Decision Trees from large Databases: SLIQ

Studying Auto Insurance Data

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Using Adaptive Random Trees (ART) for optimal scorecard segmentation

Using multiple models: Bagging, Boosting, Ensembles, Forests

DATA MINING TECHNIQUES AND APPLICATIONS

Beating the MLB Moneyline

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Fraud Detection for Online Retail using Random Forests

A Decision Theoretic Approach to Targeted Advertising

Why do statisticians "hate" us?

Machine Learning Methods for Demand Estimation

Model Combination. 24 Novembre 2009

REVIEW OF ENSEMBLE CLASSIFICATION

Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering

Table of Contents. June 2010

Social Media Mining. Data Mining Essentials

SEIZE THE DATA SEIZE THE DATA. 2015

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

Predictive Modeling and Big Data

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Selecting Data Mining Model for Web Advertising in Virtual Communities

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

A Review of Data Mining Techniques

LIVEPERSON SOLUTIONS BRIEF. Identify Your Highest Value Visitors for Real-Time Engagement and Increased Sales

How To Do Data Mining In R

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Chapter 12 Bagging and Random Forests

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

Title: Lending Club Interest Rates are closely linked with FICO scores and Loan Length

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

Assessing Data Mining: The State of the Practice

Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar

Increasing Marketing ROI with Optimized Prediction

Mortgage Business Transformation Program Using CART-based Joint Risk Modeling (and including a practical discussion of TreeNet)

Predicting the End-Price of Online Auctions

A Comparison of Variable Selection Techniques for Credit Scoring

not possible or was possible at a high cost for collecting the data.

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Scalable Machine Learning to Exploit Big Data for Knowledge Discovery

Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University)

D-optimal plans in observational studies

Advanced Big Data Analytics with R and Hadoop

Understanding Characteristics of Caravan Insurance Policy Buyer

Advanced In-Database Analytics

A Study on Prediction of User s Tendency Toward Purchases in Online Websites Based on Behavior Models

Applying Customer Attitudinal Segmentation to Improve Marketing Campaigns Wenhong Wang, Deluxe Corporation Mark Antiel, Deluxe Corporation

Data Mining Applications in Higher Education

Transcription:

Tree Ensembles: The Power of Post- Processing December 2012 Dan Steinberg Mikhail Golovnya Salford Systems

Course Outline Salford Systems quick overview Treenet an ensemble of boosted trees GPS modern highly versatile regularized regression ISLE an effective way to compress ensemble models RuleLearner extracting most interesting sets of rules Examples of compression

Brief Overview of Salford Systems Salford Systems principally well known for three reasons o Products (CART, MARS, TreeNet, RandomForests ) o Brains trust (J. Friedman, L. Breiman, R. Olshen, C. Stone) o Awards (multiple data mining competitions since 2000 until today) Marquee list of major customers in banking, insurance, retail trade bricks-and-mortar and online, pharmaceuticals, etc.) o NAB Bank o American Express o VISA USA (credit cards) o Capital One (credit cards) o Traveler s Insurance (formerly Citigroup) o Johnson & Johnson Partners in China and Japan (Asian language versions)

Salford Competitive Awards 2010 Direct Marketing Association. 1 st place* 2009 KDDCup IDAnalytics*, and FEG Japan* 1 st Runner Up 2008 DMA Direct Marketing Association 1 st Runner Up 2007 Pacific Asia PAKDD: Credit Card Cross Sell. 1 st place 2006 DMA Direct Marketing Association: Predictive Modeling* 2006 PAKDD Pacific Asia KDD: Telco Customer Type Profiling 2005 BI-Cup Latin America: Predictive Modeling E-commerce* 1 st place 2004 KDDCup: Predictive Modeling Most Accurate * 2002 NCR/Teradata Duke University: Predictive Modeling-Churn o all four separate predictive modeling challenges 1 st place 2000 KDDCup: Predictive Modeling- Online behavior 1 st place 2000 KDDCup: CRM Analysis 1 st place *Won either by Salford or by client using Salford tools

Salford Predictive Modeler SPM Download a current version from our website http://www.salford-systems.com Version will run without a license key for 10-days Request a license key from unlock@salford-systems.com Request configuration to meet your needs o Data handling capacity o Data mining engines made available

Stochastic Gradient Boosting New approach to machine learning /function approximation developed by Jerome H. Friedman at Stanford University o o Co-author of CART with Breiman, Olshen and Stone Author of MARS, PRIM, Projection Pursuit Also known as Treenet Good for classification and regression problems Built on small trees and thus o o o o Fast and efficient Data driven Immune to outliers Invariant to monotone transformations of variables Resistant to over training generalizes very well Can be remarkably accurate with little effort BUT resulting model may be very complex 6

The Algorithm Begin with a very small tree as initial model o o o Could be as small as ONE split generating 2 terminal nodes Typical model will have 3-5 splits in a tree, generating 4-6 terminal nodes Model is intentionally weak Compute residuals (prediction errors) for this simple model for every record in data Grow a second small tree to predict the residuals from the first tree Compute residuals from this new 2-tree model and grow a 3 rd tree to predict revised residuals Repeat this process to grow a sequence of trees Tree 1 Tree 2 Tree 3 More trees + + + 7

Illustration: Saddle Function 500 {X1,X2} points randomly drawn from a [-3,+3] box to produce the XOR response surface Y = X1 * X2 Will use 3-node trees to show the evolution of Treenet response surface 1 Tree 2 Trees 3 Trees 4 Trees 10 Trees 20 Trees 30 Trees 40 Trees 100 Trees 195 Trees

Notes on Treenet Solution The solution evolves slowly and usually includes hundreds or even thousands of small trees The process is myopic only the next best tree given the current set of conditions is added There is a high degree of similarity and overlap among the resulting trees Very large tree sequences make the model scoring time and resource intensive Thus, ever present need to simplify (reduce) model complexity

Real Life Dataset Dataset supplied by 2006 DMC Data Mining competition 8,000 ipod auctions held at ebay from May 2005 to May 2006 in Germany Auction items were grouped into 15 mutually exclusive categories based on size, type (regular, mini, nano), and color Goal: predict whether the closing price will be above or below the category average

The Data: Available Fields Variable CATEGORY_NAME LISTING_DURTN_DAYS LISTING_TYPE_CODE QTY_AVAILABLE FEEDBACK_SCORE START_PRICE BUY_IT_NOW_PRICE BUY_IT_NOW_FLAG BOLD_FEE_FLAG CATEGORY_FEATURED_FEE_FLAG GALLERY_FEE_FLAG RESERVE_FEE_FLAG Salford Systems Copyright 2011 Description Product category duration of auction type of auction (normal auction, multi auction, etc) amount of offered items for multi auction feedback- rating of the seller of this auction listing start price in EUR buy it now price in EUR option for buy it now on this auction listing option for bold font on this auction listing show this auction listing on top of category auction listing with picture gallery auction listing with reserve- price Variety of additional fees 11

A Tree is a Variable Transformation Node 1 Avg = -0.105 N = 506 X1 <= 1.47 Node 2 Avg = -0.068 N = 476 X1 > 1.47 Terminal Node 3 Avg = -0.695 N = 30 TREE_1 = F(X1,X2) X2 <= -1.83 Terminal Node 1 Avg = -0.250 N = 15 X2 > -1.83 Terminal Node 2 Avg = -0.062 N = 461 Any tree in a Treenet model can be represented by a derived continuous variable as a function of inputs

ISLE Compression of Treenet TN Model 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ISLE Model 1.3 0.0 0.4 0.0 0.0 1.8 1.2 0.3 0.0 0.0 0.0 0.1 0.0 The original Treenet model combines all trees with equal coefficients ISLE accomplishes model compression by removing redundant trees and changing the relative contribution of the remaining trees by adjusting the coefficients Regularized Regression methodology provides the required machinery to accomplish this task!

OLS versus GPS OLS Regression A Sequence of Linear Models 1-variable model 2-variable model 3-variable model X1, X2, X3, X4, X5, X6, X1, X2, X3, X4, X5, X6, Learn Sample Test Sample GPS Regression Large Collection of Linear Models (Paths) 1-variable models, varying coefficients 2-variable models, varying coefficients 3-variable models, varying coefficients GPS (Generalized Path Seeker) introduced by Jerome Friedman in 2008 Dramatically expands the pool of potential linear models by including different sets of variables in addition to varying the magnitude of coefficients The optimal model of any desirable size can then be selected based on its performance on the TEST sample

Path Building Process Zero Coefficient Model Sequence of 1-variable models Sequence of 2-variable models Sequence of 3-variable models OLS Solution A Variable is Selected A Variable is Added A Variable is Added Variable Selection Strategy Elasticity Parameter controls the variable selection strategy along the path (using the LEARN sample only), it can be between 0 and 2, inclusive o Elasticity = 2 fast approximation of Ridge Regression, introduces variables as quickly as possible and then jointly varies the magnitude of coefficients lowest degree of compression o Elasticity = 1 fast approximation of Lasso Regression, introduces variables sparingly letting the current active variables develop their coefficients good degree of compression versus accuracy o Elasticity = 0 fast approximation of Stepwise Regression, introduces new variables only after the current active variables were fully developed excellent degree of compression but may loose accuracy

ISLE Compression on ebay Data Test Performance ISLE managed to half the model complexity while slightly improving the TEST sample performance Lasso variable selection strategy turned out to be the winning compression approach here Note that even higher compression ratios can be achieved if one is willing to sacrifice a little bit of accuracy All ISLE models show uniformly better performance than the corresponding TN models

A Node is a Variable Transformation Node 1 Avg = -0.105 N = 506 X1 <= 1.47 Node 2 Avg = -0.068 N = 476 X1 > 1.47 Terminal Node 3 Avg = -0.695 N = 30 X2 <= -1.83 Terminal Node 1 Avg = -0.250 N = 15 X2 > -1.83 Terminal Node 2 Avg = -0.062 N = 461 NODE_X = F(X1,X2) Any node in a Treenet model can be represented by a derived dummy variable as a function of inputs

RuleLearner Compression GPS Regression TN Model T1_N1 + T1_T1 + T1_T2 + T1_T3 + T2_N1 + T2_T1 + T2_T2 + T2_T3 + T3_N1 + T3_T1 + Coefficient 1 Rule-set 1 Coefficient 2 Rule-set 2 Coefficient 3 Rule-set 3 Coefficient 4 Rule-set 4 Create an exhaustive set of dummy variables for every node (internal and terminal) and every tree in a TN model Run GPS regularized regression to extract an informative subset of node dummies along with the modeling coefficients Thus, a model compression can be achieved by eliminating redundant nodes Each selected node dummy represents a specific ruleset which can be interpreted directly for further insights

ISLE Compression in Adserving Web portal leveraging hundreds of Treenet models to optimize the user experience Optimal models may each have hundreds or thousands of trees Have sufficient time to score no more than 30 trees (each with 6 nodes) to execute page quickly enough Originally simply took the first 30 trees from the model o May have reworked the model manipulating the learn rate With ISLE much better accuracy is possible because we can now work with an optimal subset of 30 trees instead first 30

RuleLearner Compression on ebay Data Test Performance RuleLearner has reduced model complexity to 1/10 th while improved its performance A large number of HotSpots has been identified All RuleLearner models show uniformly better performance than the corresponding Treenet models

RuleLearner Compression in Banking Loan default dataset from a major bank, 16,000 accounts, 43 predictors, 10% overall default rate 64% compression using RuleLearner, no accuracy loss The top rules have lift of 6 and provide significant insights in terms of hot spots on loan defaults

Big Data The Future In model compression we can think of the tree building as simply a search engine for interesting transforms of the raw data Ideal transforms: o Handle missing values o Impervious to predictor outliers o Invariant to how the raw data is expressed (eg. Scale) Transforms can be discovered on very small samples of the data (ideally stratified samples) Second stage regression is much more easily run as a MapReduce job on the transforms o Challenge is that there can be many transforms o Results of such experiments very promising