An Overview of Predictive Analytics for Practitioners Dean Abbott, Abbott Analytics
Thank You Sponsors Empower users with new insights through familiar tools while balancing the need for IT to monitor and manage user created content. Deliver access to all data types across structured and unstructured sources. www.microsoft.com/bi Hortonworks develops, distributes and supports the only 100% Open Source distribution of Apache Hadoop architected, built and tested for enterprise deployments. http://hortonworks.com/ 2
Dean Abbott Co-founder and Chief Data Scientist at SmarterHQ, based in Indianapolis, Indiana President of Abbott Analytics in San Diego, California Internationally recognized data mining and predictive analytics expert with over two decades experience Author of Applied Predictive Analytics (Wiley, 2014), co-author of IBM SPSS Modeler Cookbook (Packt Publishing, 2013). Advisory board and instructor for UC Irvine Predictive Analytics Certificate Program and UC San Diego Data Mining Certificate Program.
Speaker Social Media @deanabb abbottanalytics.blogspot.com/ http://www.linkedin.com/in /deanabbott/ www.abbottanalytics.com/ 4
The Analyst s Journey Gain critical business and data analytics skills Uncover insights and provide value to your organization Put your knowledge to use immediately REGISTER TODAY passbaconference.com
An Overview of Predictive Analytics for Practitioners Dean Abbott, Abbott Analytics
What do Predictive Modelers do? The CRISP-DM Process Model CRoss-Industry Standard Process Model for Data Mining Describes Components of Complete Data Mining Cycle from the Project Manager s Perspective Shows Iterative Nature of Data Mining Deployment Business Understanding Data Data Data Evaluation Data Understanding Data Preparation Modeling 7
CRISP-DM: Business Understanding Steps Ask Relevant Business Questions Define Business Objectives Background Business Objectives Business Success Criteria Determine Data Requirements to Answer Business Question Translate Business Question into Appropriate Data Mining Approach Determine Project Plan for Data Mining Approach Assess Situation Determine Data Mining Objectives Inventory of Resources Data Mining Goals Requirements, Assumptions, Constraints Data Mining Success Criteria Risks and Contingencies Terminology Produce Project Plan Project Plan Initial Assess-ment of Tools & Techniques Costs and Benefits 8
Objective s Business objective: Random test mailing to NRA s house file achieved a 11% response rate Need a model that finds population with a minimum response rate of 13.5% to be profitable Modeling Objectives: Develop a binary outcome model that will rank-order current database based on propensity to respond to traditional mailing, optimizing at a cumulative average response rate of >= 13.5%.
CRISP-DM Step 2: Data Understanding Steps Collect Initial Data Describe Data Explore Data Verify Data Quality Initial Data Collection Report Data Description Report Data Exploration Report Data Quality Report Collect initial data Internal data: historical customer behavior, results from previous experiments External data: demographics & census, other studies and government research Extract superset of data (rows and columns) to be used in modeling Identify form of data repository: multiple vs. single table, flat file vs. database, local copy vs. data mart Perform Preliminary Analysis Characterize Data (describe, explore, verify) Condition Data 10
Source Data Business partner provided data that summarizes transactional data for every active NRA member - 49 independent variables. TN Marketing enhanced the database with demographic data- 18 appended variables. I-Miner was used to derive new variable features and transformations of pre existing data points - 79 derived variables.
CRISP-DM Step 3: Data Preparation (Conditioning) Steps Select Data Rationale for Inclusion/Exclusion Fix Data Problems Clean Data Data Cleaning Report Create Features Construct Data Derived Attributes Generated Records Integrate Data Merged Data Format Data Reformatted Data 12
Data Preparation Key transformations Date Features Filling missing data Use Distribution when possible for numeric fields Use Constant for categoricals For numeric data with both in-house and third-party versions, use in-house when available, and if not, use third party Binning and Binarization Reduce # values if nominal variables with many poorly populated values 13
Data Size Original Data Data after data cleanup and feature creation Data after further cleanup, and adding interaction terms 14
CRISP-DM Step 4: Modeling Steps Algorithm Selection Select Modeling Techniques Modeling Techniques Modeling Assumptions Sampling Generate Test Design Test Design Algorithms Build Model Parameter Settings Models Model Description Model Ranking Assess Model Model Assessment Revised Parameter Settings 15
Sampling Randomly split the 21,557 records into two data sets, training and validation Build response model on training data set: 10,778 records Validate model by scoring test data set: 10,779 records Ideally, have a third held out data set to provide final assessment of models
Classifiers Find Different Decision Boundaries Actual Data 11-Nearest Neighbor Neural Network Naïve Bayes Logistic Regression Decision Tree 17
Assess Models: ROC Curves 18
CRISP-DM Step 6: Deployment Steps How to deploy model? Software, source code, in database How often, when to update model Report results Plan Deployment Plan Moni-toring and Maintenance Deployment Plan Monitoring & Maintenance Plan Produce Final Report Final Report Final Presentation Lessons learned Review Project Experience Documentation
Model Results after Deployment Scored over 2,100,000 prospects Actual results from the rollout Average response rate = 13.67% Significant gross revenue generated for business partner.
What do We Call What We Do?
What is Predictive Analytics? Simple Definitions Data driven analysis for [large] data sets Data-driven to discover input combinations Data-driven to validate models OR Discovering interesting patterns in data automatically from the data Input variables are selected automatically Input combinations are discovered automatically 22
Customer Analytics: BI vs. PA Customer Analytics: Business Intelligence What were the e-mail open, click-through, and response rates? Which regions/states/zips had the highest response rates? Which products had the highest/lowest clickthrough rates? How many repeat purchasers were there last month? How many new subscriptions to the loyalty program were there? What is the average spend of those who belong to the loyalty program? Those who aren t a part of the loyalty program? Is this a significant difference? How many visits to the store/website did a person have? Customer Analytics for Predictive Analytics What is the likelihood an e-mail will be opened? What is the likelihood a customer will click-through a link in an e-mail? Which product is a customer most likely to purchase if given the choice? How many e-mails should the customer receive to maximize the likelihood of a purchase? What is the best product to up-sell to the customer after they purchase a product? What is the visit volume expected on the website next week? What is the likelihood a product will sell out if it is put on sale? What is the estimated customer lifetime value (CLV) of each customer? 23
Predictive Analytics vs. Data Science Predictive Analytics and Data Mining have always covered the same ground except for Big data-centricity Advanced database technology (to handle big data) Hadoop Other NoSQL (MongoDB, Cassandra ) Programming language-centricity (not listed) R, Python 24
What Degree Does it Take to Be a Predictive Modeler? Highest Degree 7 PhDs 1 Masters 2 Bachelors You don t need an advanced degree to be a great practitioner! Max. Degree Count Math 2 Computer Science 2 Social Science 2 Statistics 1 Economics 1 Machine Learning 1 Engineering 1 http://www.deep-data-mining.com/2013/05/the-10-most-influential-peoplein-data-analytics.html
Questions? 26
PASS Virtual Chapters for Business Analytics FREE ONLINE LEARNING www.sqlpass.org/vc 27
Like What You Heard? Dean will be presenting at BAC 2015! Pre-Conference (full day): An Overview of Predictive Analytics for Practitioners Breakout Sessions (60 mins): Starting Your First Predictive Analytics Project What Skills Do Predictive Modelers Need?
REGISTER TODAY passbaconference.com
Coming up next Productivity Revolution in Excel Avi Singh, PowerPivotPro and Chandoo, chandoo.org