Marrying Prediction and Segmentation to Identify Sales Leads

Similar documents
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Package trimtrees. February 20, 2015

Classification and Regression by randomforest

Using multiple models: Bagging, Boosting, Ensembles, Forests

How is the Net Promoter score calculated?

Package bigrf. February 19, 2015

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

IBM SPSS Direct Marketing 23

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Big Data Analytics. Benchmarking SAS, R, and Mahout. Allison J. Ames, Ralph Abbey, Wayne Thompson. SAS Institute Inc., Cary, NC

IBM SPSS Direct Marketing 22

Gerry Hobbs, Department of Statistics, West Virginia University

Beating the MLB Moneyline

Chapter 12 Bagging and Random Forests

How To Identify A Churner

What really drives customer satisfaction during the insurance claims process?

Note: A WebFOCUS Developer Studio license is required for each developer.

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering

Predicting borrowers chance of defaulting on credit loans

Fast Analytics on Big Data with H20

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

DATA MINING SPECIES DISTRIBUTION AND LANDCOVER. Dawn Magness Kenai National Wildife Refuge

Project SailFin: Building and Hosting Your Own Communication Server.

Data Science with R Ensemble of Decision Trees

Environmental Remote Sensing GEOG 2021

Chapter 12 Discovering New Knowledge Data Mining

Data Science with R. Introducing Data Mining with Rattle and R.

Fraud Detection for Online Retail using Random Forests

Adobe Analytics Premium Customer 360

Predicting Defaults of Loans using Lending Club s Loan Data

R software tutorial: Random Forest Clustering Applied to Renal Cell Carcinoma Steve Horvath and Tao Shi

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Framing Business Problems as Data Mining Problems

Marketing Operations Cookbook

An Oracle White Paper April How to Install the Oracle Solaris 10 Operating System on x86 Systems

The Operational Value of Social Media Information. Social Media and Customer Interaction

How To Make A Credit Risk Model For A Bank Account

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

The Data Mining Process

Data Mining Applications in Higher Education

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Data Mining. Nonlinear Classification

Using R for Customer Analytics

Google Analytics Guide

USING THE PREDICTIVE ANALYTICS FOR EFFECTIVE CROSS-SELLING

CART 6.0 Feature Matrix

Structural Health Monitoring Tools (SHMTools)

Microsoft Azure Machine learning Algorithms

Applying Customer Attitudinal Segmentation to Improve Marketing Campaigns Wenhong Wang, Deluxe Corporation Mark Antiel, Deluxe Corporation

M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Advanced analytics at your hands

Predicting Flight Delays

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

1. Classification problems

Virtualization Life, The Universe and Everything. Tony Lock Freeform Dynamics Ltd

Model Selection. Introduction. Model Selection

Predictive modelling around the world

IBM SPSS Direct Marketing 19

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

Distributed forests for MapReduce-based machine learning

Project OpenSolaris TM Dynamic Service Containers featuring Nimsoft SLM

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign

Oracle. Human Capital Management Cloud Using Workforce Reputation Management. Release 11. This guide also applies to on-premise implementations

What Is NetBeans? Free and open-source based > Open source since June, 2000 > Large community of users and developers

A Guide to Cover Letter Writing

The 5 Minute WordPress Setup Guide

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

DBTech Pro Workshop. Knowledge Discovery from Databases (KDD) Including Data Warehousing and Data Mining. Georgios Evangelidis

Comparison of Data Mining Techniques used for Financial Data Analysis

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Predictive Data modeling for health care: Comparative performance study of different prediction models

Digital Segmentation. Basic principles of effective customer segmentation

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Implementation of Breiman s Random Forest Machine Learning Algorithm

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Data mining on the EPIRARE survey data

OCS Virtual image. User guide. Version: Viking Edition

xvm Server and xvm VirtualBox Christopher Beal Principal Engineer

Downloading and Using Mozilla Thunderbird By Brazos Price Spring 2005

MERGING BUSINESS KPIs WITH PREDICTIVE MODEL KPIs FOR BINARY CLASSIFICATION MODEL SELECTION

Program Guide. Module LifeStylized.com

Configuring Sun StorageTek SL500 tape library for Amanda Enterprise backup software

Active Learning SVM for Blogs recommendation

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

Beating the NCAA Football Point Spread

Netbeans 6.0. José Maria Silveira Neto. Sun Campus Ambassador

Data Warehousing and Data Mining for improvement of Customs Administration in India. Lessons learnt overseas for implementation in India

Marketing Strategies for Retail Customers Based on Predictive Behavior Models

Transcription:

Marrying Prediction and Segmentation to Identify Sales Leads Jim Porzak, The Generations Network Alex Kriney, Sun Microsystems February 19, 2009 1

Business Challenge In 2005 Sun Microsystems began the process of open-sourcing and making its software stack freely available Many millions of downloads per month* Heterogeneous registration practices Multichannel (web, email, phone, inperson) and multitouch marketing strategy 15%+ response rates; low contact-to-lead ratio Challenge: Identify the sales leads * Solaris, MySQL, GlassFish, NetBeans, OpenOffice, OpenSolaris, xvm Virtualbox, JavaFX, Java, etc. Sun Confidential: Internal Only 2

The Project Focus on Solaris 10 (x86 version) Data sources: download registration, email subscriptions, other demographics databases, product x purchases What are the characteristics of someone with a propensity to purchase? How can we become more efficient at identifying potential leads? Apply learnings to marketing strategy for all products and track results Sun Confidential: Internal Only 3

Predictive Models Employed Building a purchase model using random forests Creating prospect persona segmentation using cluster analysis Sun Confidential: Internal Only 4

Random Forest Purchase Model Sun Confidential: Internal Only 5

Random Forests Developed by Leo Breiman of Cal Berkeley, one of the four developers of CART, and Adele Cutler, now at Utah State University. Accuracy comparable with modern machine learning methods. (SVMs, neural nets, Adaboost) Built in cross-validation using Out of Bag data. (Prediction error estimate is a by product) Large number candidate predictors are automatically selected. (Resistant to over training) Continuous and/or categorical predicting & response variables. (Easy to set up.) Can be run in unsupervised for cluster discovery. (Useful for market segmentation, etc.) Free Prediction and Scoring engines run on PC s, Unix/Linux & Mac s. (R version) See Appendix for links to more details. Sun Confidential: Internal Only 6

Data for Model Data cleanup issues > Categorical variables with many categories limited to top 30, with rest grouped into Other2 > Silly answers in number of licenses questions > Email domains checked & simplified Predict purchase of product x; yes or no. Types of predictors > Demographic: Job & Organization > Intended Use by application area > Engagement with Sun: Newsletter list combinations Solaris product registration not used in model concern that registration was intrinsic in purchase process Sun Confidential: Internal Only 7

randomforest Run Training set: random sample of 8000 records Tuned & repeated to verify stability. Final model: > train2.rf Call: randomforest(x = train2[, -1], y = train2[, 1], classwt = c(5, 1), importance = TRUE, proximity = FALSE) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 5 OOB estimate of error rate: 27.3% Confusion matrix: Has Paid Not Paid class.error Has Paid 2884 1116 0.279 Not Paid 1068 2932 0.267 Sun Confidential: Internal Only 8

Diagnostics Error Rate by # Trees Purchase model Error Rate by # Trees Sun Confidential: Internal Only 9

Variable Importance Plot Purchase model Variable Importance Sun Confidential: Internal Only 10

Partial Dependence Plot Example Sun Confidential: Internal Only 11

Model worked well! Purchase model Validation Trained on 8,000 records, validated on entire database. % Cumulative % Purchased depth of file 20% 3.4% 40% 7% 60% 13% 80% 30% Sun Confidential: Internal Only 12

Predictors by Importance Purchase model Variable Importance Tier 1 Tier 2 Tier 3 Sun Confidential: Internal Only 13

Variables that Predict Purchase On the whole those who raise their hands and identify themselves with more detailed company-related information tend to purchase Specificity is the key to all the variables 3 tiers of variables in order of importance Tier 1 - Email domain - List membership - Company Tier 2 - Organization (dept.) - Job Role - Number of Lists Tier 3 - Industry - Everything else Sun Confidential: Internal Only 14

A Closer Look at Tier 1 Users that put in a corporate domain tend to purchase. Users that use free web-based email such as Yahoo or Gmail do not People that enter a valid company name tend to purchase Opting in for most lists has a positive association with purchase, while a few lists and not opting in at all have a negative association Some lists are better than others. Developers and students do not purchase. SysAdmins and generalists do when they are subscribed to multiple lists Sun Confidential: Internal Only 15

A Closer Look at Tier 2 Organization (Department) IT, MIS, SYS Admin positions have a high tendency to purchase Job Role (more of a functional attribute) The more specific the role that is entered the higher the association with purchase, i.e. Systems Administrator The more email lists an individual signs up for increases the association with purchase Sun Confidential: Internal Only 16

Digital Persona Cluster Model Sun Confidential: Internal Only 17

Clustering Data Source 1 of 2 Surveys from 20k respondents > All within same time frame > All requesting Solaris download This survey started after modeling data pull > Totally independent data used here! Question 1 free form inputs coded as 1 for a reasonable answer, 0 for none or silly. Rest code 1 if ticked, else 0. > Other coded 1 if used Sun Confidential: Internal Only 18

Clustering Data Source 2 of 2 Variables: Q1. Licenses? Q2. Roles? Q3. Hardware systems? Q4. Why most interested? Q5. What applications? Sun Confidential: Internal Only 19

Choice of Clustering Method In this kind of check box survey, checking a box is much more significant than not checking it. > If I check Database, that says a whole lot more about me than if I don't check Database. We need an appropriate distance measure. > Expectation based Jaccard Fritz Leisch's flexclust package for R handles this quite nicely. > See Appendix for links & details We tell flexclust the number of clusters to find. Starting from a random point, it does so. Our challenge is picking a stable & meaningful solution. Sun Confidential: Internal Only 20

One Example 4-cluster Solution Sun Confidential: Internal Only 21

Another Example 4-cluster Solution Sun Confidential: Internal Only 22

Best Solution was 5 Clusters SPARC Loyalists Other Brand Responders Other Brand Non-Responders Sun x86 Loyalists Students Sun Confidential: Internal Only 23

Sparc Loyalists Digital Persona Segments Other Brand Responders 0.0 0.2 0.4 0.6 0.8 1.0 X86 Loyalists Lic_SPARC Lic_X86 Role_Dev Role_SA Role_IT_Mngr Role_IT_Arch Role_Student Role_Other Sy s_sunsparc Sy s_sun6486 Sy s_dell Sy s_hp Sy s_ibm Sy s_fujitsu Sy s_other Int_Multiplatf orm Int_Opensource Int_Pred_Lif e Int_Pricing Int_Support Int_64_bit Int_Containers Int_Perf ormance Int_DTrace Int_ZFS Int_Other Ap_Web_Serv ing Ap_Inf rastructrue Ap_Collaboration Ap_Database Ap_J2EE Ap_Desk_Lap_Top Ap_Dev elopment Ap_High_Perf Ap_Other Lic_SPARC Lic_X86 Role_Dev Role_SA Role_IT_Mngr Role_IT_Arch Role_Student Role_Other Sy s_sunsparc Sy s_sun6486 Sy s_dell Sy s_hp Sy s_ibm Sy s_fujitsu Sy s_other Int_Multiplatf orm Int_Opensource Int_Pred_Lif e Int_Pricing Int_Support Int_64_bit Int_Containers Int_Perf ormance Int_DTrace Int_ZFS Int_Other Ap_Web_Serv ing Ap_Inf rastructrue Ap_Collaboration Ap_Database Ap_J2EE Ap_Desk_Lap_Top Ap_Dev elopment Ap_High_Perf Ap_Other #1: 2313 (11.56%) #4: 2699 (13.49%) #5: 3858 (19.29%) 0.0 0.2 0.4 0.6 0.8 1.0 Other Brand Non-Responders #2: 6472 (32.36%) #3: 4658 (23.29%) Students Red dots represent the average response for the entire population Height of the bar represents that clusters response and can be viewed in context of the average response 24 Sun Confidential: Internal Only 24

Cluster 1: 2313 points (11.56%) 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 Lic_SPARC Lic_X86 Role_Dev Role_SA Role_IT_Mngr Role_IT_Arch Role_Student Role_Other Sys_SUNSPARC Sys_SUN6486 Sys_DELL Sys_HP Sys_IBM Sys_FUJITSU Sys_Other Int_Multiplatform Int_Opensource Int_Pred_Life Int_Pricing Int_Support Lic_SPARC Lic_X86 Role_Dev Role_SA Role_IT_Mngr Role_IT_Arch Role_Student Role_Other Sys_SUNSPARC Sys_SUN6486 Sys_DELL Sys_HP Sys_IBM Sys_FUJITSU Sys_Other Int_Multiplatform Int_Opensource Int_Pred_Life Int_Pricing Int_Support Int_64_bit Int_Containers Int_Performance Int_DTrace Sparc Loyalists Cluster 2: 6472 points (32.36%) Responders Int_64_bit Int_Containers Int_Performance Int_DTrace Int_ZFS Int_Other Ap_Web_Serving Ap_Infrastructrue Ap_Collaboration Ap_Database Ap_J2EE Ap_Desk_Lap_Top Ap_Development Ap_High_Perf Ap_Other Int_ZFS Int_Other Ap_Web_Serving Ap_Infrastructrue Ap_Collaboration Ap_Database Ap_J2EE Ap_Desk_Lap_Top Ap_Development Ap_High_Perf Ap_Other Cluster 1: SPARC Loyalists Heavy SPARC users/licensees Tend to be Sys Admins Do not show interest in open source Cluster 2: Other Brand Responders Sys admins Use multi-platform(non-sun) hardware in their data centers Very Interested in a variety of applications for Solaris 10 Sun Confidential: Internal Only 25

0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 Lic_SPARC Lic_X86 Role_Dev Role_SA Role_IT_Mngr Role_IT_Arch Role_Student Role_Other Sys_SUNSPARC Sys_SUN6486 Sys_DELL Sys_HP Sys_IBM Sys_FUJITSU Sys_Other Int_Multiplatform Int_Opensource Int_Pred_Life Cluster 3: 4658 points (23.29%) X86 Loyalists Cluster 4: 2699 points (13.49%) Non-responders Int_Pricing Int_Support Int_64_bit Int_Containers Int_Performance Int_DTrace Int_ZFS Int_Other Ap_Web_Serving Ap_Infrastructrue Ap_Collaboration Ap_Database Ap_J2EE Ap_Desk_Lap_Top Ap_Development Ap_High_Perf Ap_Other Lic_SPARC Lic_X86 Role_Dev Role_SA Role_IT_Mngr Role_IT_Arch Role_Student Role_Other Sys_SUNSPARC Sys_SUN6486 Sys_DELL Sys_HP Sys_IBM Sys_FUJITSU Sys_Other Int_Multiplatform Int_Opensource Int_Pred_Life Int_Pricing Int_Support Int_64_bit Int_Containers Int_Performance Int_DTrace Int_ZFS Int_Other Ap_Web_Serving Ap_Infrastructrue Ap_Collaboration Ap_Database Ap_J2EE Ap_Desk_Lap_Top Ap_Development Ap_High_Perf Ap_Other Cluster 3 x86 Loyalists Role is not distinguished from the general population Plan to run Solaris 10 on Sun x86 hardware. Not on competitor hardware Less apt to respond to interest and application questions Cluster 4 Other Brand Non-responders Less likely to respond to questions across the board More likely to run Solaris on competitor hardware Sun Confidential: Internal Only 26

0.0 0.2 0.4 0.6 0.8 Cluster 5: 3858 points (19.29%) Students Lic_SPARC Lic_X86 Role_Dev Role_SA Role_IT_Mngr Role_IT_Arch Role_Student Role_Other Sys_SUNSPARC Sys_SUN6486 Sys_DELL Sys_HP Sys_IBM Sys_FUJITSU Sys_Other Int_Multiplatform Int_Opensource Int_Pred_Life Int_Pricing Int_Support Int_64_bit Int_Containers Int_Performance Int_DTrace Int_ZFS Int_Other Ap_Web_Serving Ap_Infrastructrue Ap_Collaboration Ap_Database Ap_J2EE Ap_Desk_Lap_Top Ap_Development Ap_High_Perf Ap_Other Cluster 5 Students Identify as students Interested in open source Desk/Laptop productivity is application of choice Sun Confidential: Internal Only 27

Cluster SPARC Loyalists Other Brand Responders Number of Contacts Number of Purchases Percent of contacts Percent Purchases 2,548 910 Low 11.6% High 33.7% 7,116 806 High 32.4% High 29.8% X86 Loyalists 5,103 619 High 23.2% High 22.9% Other Brand Non-Responders 2,973 250 13.5% Low Low 9.2% Students 4,230 118 19.3% Low Low 4.4% - SPARC loyalists are a small % of x86 downloaders but have a high propensity to purchase - People who identify as loyal to other hardware providers and responded to the questions are a high % of downloaders and are just as likely to purchase as those that self-identify as loyal Sun x86 hardware - Students and people that do not fill out the survey completely do not have a high propensity to purchase Sun Confidential: Internal Only 28

Summary of Findings Specificity of certain collected registration field values and subscription to multiple specific email lists are highly predictive of purchase propensity (Tier 1 & Tier 2) Incompleteness of registration response is predictive of low purchase propensity Stated preferences for competitor products does not negatively impact purchase propensity Students have more time than money Sun Confidential: Internal Only 29

Shifts in Marketing Strategy Help purchasers self-identify Focus on registration standardization Targeted content development Segmented outbound calling lists Incorporate learnings into our lead scoring algorithm in progress Sun Confidential: Internal Only 30

Early Results 2x increase in ratio of leads to outbound calls 33% increase in per-lead pipeline $ value 3 4x improvement in outbound calling efficiency FY '09 goal for standardized registration Sun Confidential: Internal Only 31

Questions? Now would be the time! Keep in contact! > Alex's email: alex.kriney@sun.com > Jim's email: JPorzak@tgn.com Sun Confidential: Internal Only 32

Appendix Sun Confidential: Internal Only 33

Sun Microsystems Links Product/Solution OpenStorage Sun Systems for MySQL Sun x64 Systems Java NetBeans OpenSolaris MySQL GlassFish OpenSSO JavaFX OpenOffice xvm VirtualBox OpenSPARC Solaris URL sun.com/storagetek/open.jsp sun.com/systems/solutions/mysql/ sun.com/x64/index.jsp java.com netbeans.org opensolaris.org mysql.com sun.com/glassfish opensso.dev.java.net/ javafx.com openoffice.org virtualbox.org opensparc.org sun.com/software/solaris/ Sun Confidential: Internal Only 34

R Links R Homepage: www.r-project.org > The official site of R R Foundation: www.r-project.org/foundation > Central reference point for R development community > Holds copyright of R software and documentation Local CRAN mirror site: > We use: cran.cnr.berkeley.edu > Find yours at: cran.r-project.org/mirrors.html > Current Binaries, Documentation & FAQs, & more! Your local user! Group: > www.meetup.com/r-users/ Sun Confidential: Internal Only 35

Random Forest Links Original Fortran 77 source code freely available from Breiman & Cutler: > http://www.stat.berkeley.edu/users/breiman/randomforests/cc_home.h & http://www.math.usu.edu/~adele/forests/ Commercialization by Salford Systems: > http://www.salford-systems.com/randomforests.php R package, randomforest. An adaptation by Andy Liaw & Matthew Wiener of Merck: > http://cran.cnr.berkeley.edu/web/packages/randomforest/index.htm See Jim's randomforest tutorial deck: > http://www.porzak.com/jimarchive/jimporzak_rfwithr_dmaac_jan07_webin Sun Confidential: Internal Only 36

flexclust Links The flexclust package on CRAN: > http://cran.cnr.berkeley.edu/web/packages/flexclust/index.html Fritz Leisch's page: > http://www.stat.uni-muenchen.de/~leisch/ (despite picture, he is really a jolly guy!) > See: http://www.stat.uni-muenchen.de/~leisch/papers/leisch-2006.pdf See second half of Jim's user! 2008 tutorial: > http://www.porzak.com/jimarchive/user2008/user08_jimp_custseg.p Sun Confidential: Internal Only 37