Virtual Site Event. Predictive Analytics: What Managers Need to Know. Presented by: Paul Arnest, MS, MBA, PMP February 11, 2015

Similar documents
Predictive Modeling Techniques in Insurance

Data Mining Applications in Higher Education

Azure Machine Learning, SQL Data Mining and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Easily Identify Your Best Customers

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Data Mining Algorithms Part 1. Dejan Sarka

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign

2. Simple Linear Regression

11. Analysis of Case-control Studies Logistic Regression

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Grow Revenues and Reduce Risk with Powerful Analytics Software

not possible or was possible at a high cost for collecting the data.

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Maschinelles Lernen mit MATLAB

ANALYTICS CENTER LEARNING PROGRAM

How To Make A Credit Risk Model For A Bank Account

How To Cluster

Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar

A Property & Casualty Insurance Predictive Modeling Process in SAS

Nagarjuna College Of

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

Enhancing Compliance with Predictive Analytics

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Leveraging Ensemble Models in SAS Enterprise Miner

Data quality in Accounting Information Systems

Health Spring Meeting May 2008 Session # 42: Dental Insurance What's New, What's Important

Fraud Detection with MATLAB Ian McKenna, Ph.D.

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition

Chapter 7: Data Mining

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Social Media Mining. Data Mining Essentials

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

White Paper. Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 8 August 2013

Data Mining: Overview. What is Data Mining?

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

SUGI 29 Statistics and Data Analysis

Title. Introduction to Data Mining. Dr Arulsivanathan Naidoo Statistics South Africa. OECD Conference Cape Town 8-10 December 2010.

Driving Value From Big Data

Banking Analytics Training Program

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

2015 Workshops for Professors

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

INTRODUCING AZURE MACHINE LEARNING

Gerry Hobbs, Department of Statistics, West Virginia University

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

AcademyR Course Catalog

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller

Predictive Analytics Certificate Program

Maximizing Return and Minimizing Cost with the Decision Management Systems

Data Science and Business Analytics Certificate Data Science and Business Intelligence Certificate

Learning outcomes. Knowledge and understanding. Competence and skills

Application of Predictive Model for Elementary Students with Special Needs in New Era University

ECLT 5810 E-Commerce Data Mining Techniques - Introduction. Prof. Wai Lam

Unlocking Value from. Patanjali V, Lead Data Scientist, Tiger Analytics Anand B, Director Analytics Consulting,Tiger Analytics

Fraud Detection for Online Retail using Random Forests

L3: Statistical Modeling with Hadoop

Introduction to Machine Learning Using Python. Vikram Kamath

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Data Mining. Nonlinear Classification

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Course Syllabus. Purposes of Course:

R s and Predictive Modeling Boot Camp Nov. 8-9, Session #1: Predictive Modeling: An Overview Syed Muzayan Mehmud, ASA, FCA, MAAA

Chapter 12 Discovering New Knowledge Data Mining

Football Match Winner Prediction

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Machine Learning Capacity and Performance Analysis and R

Certificate Program in Applied Big Data Analytics in Dubai. A Collaborative Program offered by INSOFE and Synergy-BI

IMPORTANCE OF QUANTITATIVE TECHNIQUES IN MANAGERIAL DECISIONS

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Data Mining for Fun and Profit

Cool Tools for PROC LOGISTIC

S The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Decision Trees What Are They?

What is Data Science? Data, Databases, and the Extraction of Knowledge Renée November 2014

Discovering, Not Finding. Practical Data Mining for Practitioners: Level II. Advanced Data Mining for Researchers : Level III

Application of Predictive Analytics for Better Alignment of Business and IT

BOR 6335 Data Mining. Course Description. Course Bibliography and Required Readings. Prerequisites

Data Mining Introduction

Bijan Raahemi, Ph.D., P.Eng, SMIEEE Associate Professor Telfer School of Management and School of Electrical Engineering and Computer Science

Predictive Data modeling for health care: Comparative performance study of different prediction models

2013 MBA Jump Start Program. Statistics Module Part 3

Predict Influencers in the Social Network

Machine Learning with MATLAB David Willingham Application Engineer

Pentaho Data Mining Last Modified on January 22, 2007

Predictive Modeling and Big Data

High Performance Predictive Analytics in R and Hadoop:

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Data Mining: An Introduction

Joseph M. Juran Center for Research in Supply Chain, Operations, and Quality

Transcription:

Virtual Site Event Predictive Analytics: What Managers Need to Know Presented by: Paul Arnest, MS, MBA, PMP February 11, 2015 1

Ground Rules Virtual Site Ground Rules PMI Code of Conduct applies for this virtual presentation. The Virtual Attendees are expected to: Participate for a minimum of 40 minutes. Login information will be verified. Answer the question pertaining to the presentation correctly in the survey in order to obtain the PDU credit (1). Respond to the survey within 48 hours (By Friday February 13, 2015) of participation in order to obtain the PDU credit. 2

Predictive Analytics What Managers Need to Know 3

Predictive Analytics A NEW ENVIRONMENT 4

Definition Predictive Analytics: Techniques that quantify potential outcomes or events based on past data Not descriptive analysis and descriptive statistics Not techniques that enable end-users to perform individual data discovery or to customize reports 5

Convergence Once restricted to specialized statistics organizations, advanced modeling techniques are moving into the IT mainstream Stat/Analytics Shop IT 6

Concepts/Buzzwords Machine learning Supervised learning Unsupervised learning Response variable Target variable Dependent variable Left hand side variable Explanatory variable Independent variable Right hand side variable Logistic regression Random forest, etc. Sensitivity Specificity 7

Tool independence Predictive techniques use mathematical algorithms that are independent of particular tools SAS, R, Stata, SPSS, many more Use specialized tools for model development It is possible to implement models using general software tools, i.e., Java,.Net 8

Don t be intimidated Your stat/analysis package is programmed to do the heavy math You ll discover that most internal stat shops are using a small set of models and techniques over and over again Most of the work: Understanding what you want to accomplish Understanding the data Organizing the data 9

Understand the results Predictive analytics produce a probability of a characteristic or behavior based on a detailed analysis of past characteristics or behaviors Probability is 100% Certainty Model accuracy depends on similarity of past conditions to present 10

Predictive Analytics HOW IT WORKS AND WHAT TO EXPECT 11

Logistic regression Workhorse procedure for predictive analytics Supervised technique 12

Step 1 Identify a known population that exhibits the characteristic you want to predict dependent, target or response variable plus a known population that does not You may take the whole population ( big data ) or a sample Use 80% or 90% of the sample as the training data set Withhold the remainder for validation 13

Step 2 Construct a hypothesis ( null hypothesis ) Select variables expected to distinguish target population independent or explanatory variables 14

Step 3 Run a logistic regression against the variables Logistic regression will calculate the likelihood (predictive odds) that the independent variables are associated with the dependent variable 15

Step 4 Test the hypothesis on the withheld sample and the broader population Caution: It s critical to identify the target characteristics accurately 16

Logistic regression: targets Target: Workers Compensation Fraudsters Target High Incidence Organization Dr on CMS Ineligible List High Risk Occupation Psychological Impairment Imperceptible Physical Impairment Linda 1 1 1 1 1 1 Rebecca 1 1 1 1 0 1 Samuel 1 1 0 1 1 0 Stephen 1 0 0 0 1 1 Amanda 1 1 0 0 1 0 Hugh 1 0 1 0 0 1 Francesco 1 0 1 1 0 1 Allen 1 1 0 0 1 0 Eric 1 1 0 0 1 1 Gail 1 0 1 0 0 1 Joseph 1 1 1 1 0 0 Derek 1 1 1 0 1 0 Kevin 1 1 0 1 1 1 17

Logistic regression: general General population of covered workers Target High Incidence Organization Dr on CMS Ineligible List High Risk Occupation Psychological Impairment Imperceptible Physical Impairment Linda 0 1 1 1 1 1 Rebecca 0 0 0 1 0 1 Samuel 0 0 0 0 0 0 Stephen 0 0 0 0 0 1 Amanda 0 1 0 0 1 0 Hugh 0 0 1 0 0 1 Francesco 0 0 0 0 0 0 Allen 0 0 0 0 1 0 Eric 0 0 0 0 1 1 Gail 0 0 1 0 0 1 Joseph 0 0 0 1 1 0 Derek 0 0 1 0 0 0 Kevin 0 1 0 1 1 1 18

Results Maximum Likelihood Estimates: Fraud likelihood = 1.9884 (intercept) + 2.1370 (multiple cases) + 1.2356 (CMS ineligible) +.3784 (rep disciplined) +.1877 (psychological) +.4805 (imperceptible physical) 19

Interpretation Positive coefficients mean all factors contribute to likelihood of fraud Coefficients reflect the actual weight the model places on each factor Intercept ( 1.9884) means this model predicts a 12% likelihood of fraud if no modeled factors present 20

Test of model accuracy C-statistic (probability outcome is better than chance) = 0.814 0.70 indicates an acceptable model 0.80 indicates a strong model the closer to 1 the better Visually represented as ROC curve 21

Considerations Accuracy only as good as the target population sample Sum of the terms = logit of the predictive probability of the model translates into odds a claim is fraudulent Conversion of coefficient of the target variable logit(p) to probability p = 1 1+ e logit(p) 22

Logit transformation If all factors present, logit(p) = 1.9884 + 2.1370 + 1.2356 + 0.3784 + 0.1877 + 0.4805 = 2.4308 = 92% probability of fraud p logit(p) p logit(p) p logit(p) p logit(p) 0.01-4.5951 0.26-1.0460 0.51 0.0400 0.76 1.1527 0.02-3.8918 0.27-0.9946 0.52 0.0800 0.77 1.2083 0.03-3.4761 0.28-0.9445 0.53 0.1201 0.78 1.2657 0.04-3.1781 0.29-0.8954 0.54 0.1603 0.79 1.3249 0.05-2.9444 0.30-0.8473 0.55 0.2007 0.8 1.3863 0.06-2.7515 0.31-0.8001 0.56 0.2412 0.81 1.4500 0.07-2.5867 0.32-0.7538 0.57 0.2819 0.82 1.5163 0.08-2.4423 0.33-0.7082 0.58 0.3228 0.83 1.5856 0.09-2.3136 0.34-0.6633 0.59 0.3640 0.84 1.6582 0.10-2.1972 0.35-0.6190 0.60 0.4055 0.85 1.7346 0.11-2.0907 0.36-0.5754 0.61 0.4473 0.86 1.8153 0.12-1.9924 0.37-0.5322 0.62 0.4895 0.87 1.9010 0.13-1.9010 0.38-0.4895 0.63 0.5322 0.88 1.9924 0.14-1.8153 0.39-0.4473 0.64 0.5754 0.89 2.0907 0.15-1.7346 0.40-0.4055 0.65 0.6190 0.9 2.1972 0.16-1.6582 0.41-0.3640 0.66 0.6633 0.91 2.3136 0.17-1.5856 0.42-0.3228 0.67 0.7082 0.92 2.4423 0.18-1.5163 0.43-0.2819 0.68 0.7538 0.93 2.5867 0.19-1.4500 0.44-0.2412 0.69 0.8001 0.94 2.7515 0.20-1.3863 0.45-0.2007 0.70 0.8473 0.95 2.9444 0.21-1.3249 0.46-0.1603 0.71 0.8954 0.96 3.1781 0.22-1.2657 0.47-0.1201 0.72 0.9445 0.97 3.4761 0.23-1.2083 0.48-0.0800 0.73 0.9946 0.98 3.8918 0.24-1.1527 0.49-0.0400 0.74 1.0460 0.99 4.5951 0.25-1.0986 0.50 0.0000 0.75 1.0986 23

LR weaknesses All potential fraud factors combined into a single equation With many independent predictor variables, characteristics can cancel each other out Logistic regression has a hard time weighting interactions between individual variables Must be programmed explicitly Requires additional data manipulation 24

LR weaknesses (ctd) In rare-event modeling with a large number of predictive variables, logistic regression can produce many false positives Difficult to differentiate rare events from normal events when the rare events occur with extremely low frequency Bad solution is to boost the sensitivity of the model 25

Other supervised methods Decision tree mitigates the problem of numerous weak predictors overwhelming a strong predictor (logistic regression) Sorts observations of the dependent variable into buckets corresponding to its available classification values Conditional selection into paths ( branches ) Priority determined by frequency of characteristics 26

Decision tree example High Incidence Organization Left-Facing Arrows: Value = Characteristic is absent Right-Facing Arrows: Value = Characteristic is present 0 = No Fraud 1 = Fraud Misclassification Rate = 23.08% 4F/10N 9F/3N Imperceptible Physical Impairment Psychological Impairment Purity 4F/5N Purity 7F/3N 5 cases = 0 0 cases = 1 Doctor on CMS Ineligible List 0 cases = 0 2 cases = 1 Imperceptible Physical Impairment 1F/3N 3F/2N 4F/1N 3F/2N Psychological Impairment High Risk Occupation High Risk Occupation High Risk Occupation Purity Tie Tie Purity 3F/1N Purity Tie 2F/1N 2 cases = 0 0 cases = 1 1 case = 0 1 case = 1 2 cases = 0 2 cases = 1 0 cases = 0 1 case = 1 Doctor on CMS Ineligible List 0 cases = 0 1 case = 1 1 case = 0 1 case = 1 Doctor on CMS Ineligible List Imperfect Purity Purity Tie 1 case = 0 2 cases = 1 0 cases = 0 1 cases = 1 0 cases = 0 1 case = 1 1 case = 0 1 case = 1 27

Beyond decision tree Decision tree may overweight highfrequency but insignificant characteristics Boosted decision tree and random forest are techniques to improve on the results of the basic algorithm based on misclassification rates Neural networks model all possible combinations and select the best ones based on misclassification rates 28

Unsupervised methods K-means cluster Consider it a generalization of logistic regression Identify a set of independent variables Transformations likely required, as above Procedure tries to identify a set of statistically significant clusters based on the selected variables Can tease out meaningful characteristics 29

Predictive Analytics SOME BEST PRACTICES IN DATA MANAGEMENT 30

Data best practices Understand your data What does it represent How does it enter your data warehouse Check data for suitability Missing values? Do target and individual predictors correlate? Ensure that data cleansing and transformation steps are documented and repeatable for model re-estimation 31

Counterintuitive-ness The more independent variables, the less predictive value each individual variable, or characteristic, has, on average 32

Counterintuitive-ness (ctd) In rare event modeling, even a very accurate model can produce disproportionately large false positives Example: Target population 1% in a population of 1,000,000 (10,000 targets). If predictive model has a 10% false positive rate (90% accurate): Target General population 10,000 990,000 True positives: 9,000 True negatives: 891,000 False negatives: 1,000 False positives: 99,000 33

Takeaways for success 1.Clearly identify target variable 2.Limit predictor variables 3.Know the model data and manage it data management is most of the work 4.Know how to measure model performance 5.Set goals and expectations for the model 6.Monitor model performance and adjust/ re-estimate as necessary 34

Thank you/questions Paul Arnest parnest@pmibaltimore.org 35