Virtual Site Event. Predictive Analytics: What Managers Need to Know. Presented by: Paul Arnest, MS, MBA, PMP February 11, 2015

Virtual Site Event Predictive Analytics: What Managers Need to Know Presented by: Paul Arnest, MS, MBA, PMP February 11, 2015 1

Ground Rules Virtual Site Ground Rules PMI Code of Conduct applies for this virtual presentation. The Virtual Attendees are expected to: Participate for a minimum of 40 minutes. Login information will be verified. Answer the question pertaining to the presentation correctly in the survey in order to obtain the PDU credit (1). Respond to the survey within 48 hours (By Friday February 13, 2015) of participation in order to obtain the PDU credit. 2

Predictive Analytics What Managers Need to Know 3

Predictive Analytics A NEW ENVIRONMENT 4

Definition Predictive Analytics: Techniques that quantify potential outcomes or events based on past data Not descriptive analysis and descriptive statistics Not techniques that enable end-users to perform individual data discovery or to customize reports 5

Convergence Once restricted to specialized statistics organizations, advanced modeling techniques are moving into the IT mainstream Stat/Analytics Shop IT 6

Concepts/Buzzwords Machine learning Supervised learning Unsupervised learning Response variable Target variable Dependent variable Left hand side variable Explanatory variable Independent variable Right hand side variable Logistic regression Random forest, etc. Sensitivity Specificity 7

Tool independence Predictive techniques use mathematical algorithms that are independent of particular tools SAS, R, Stata, SPSS, many more Use specialized tools for model development It is possible to implement models using general software tools, i.e., Java,.Net 8

Don t be intimidated Your stat/analysis package is programmed to do the heavy math You ll discover that most internal stat shops are using a small set of models and techniques over and over again Most of the work: Understanding what you want to accomplish Understanding the data Organizing the data 9

Understand the results Predictive analytics produce a probability of a characteristic or behavior based on a detailed analysis of past characteristics or behaviors Probability is 100% Certainty Model accuracy depends on similarity of past conditions to present 10

Predictive Analytics HOW IT WORKS AND WHAT TO EXPECT 11

Logistic regression Workhorse procedure for predictive analytics Supervised technique 12

Step 1 Identify a known population that exhibits the characteristic you want to predict dependent, target or response variable plus a known population that does not You may take the whole population ( big data ) or a sample Use 80% or 90% of the sample as the training data set Withhold the remainder for validation 13

Step 2 Construct a hypothesis ( null hypothesis ) Select variables expected to distinguish target population independent or explanatory variables 14

Step 3 Run a logistic regression against the variables Logistic regression will calculate the likelihood (predictive odds) that the independent variables are associated with the dependent variable 15

Step 4 Test the hypothesis on the withheld sample and the broader population Caution: It s critical to identify the target characteristics accurately 16

Logistic regression: targets Target: Workers Compensation Fraudsters Target High Incidence Organization Dr on CMS Ineligible List High Risk Occupation Psychological Impairment Imperceptible Physical Impairment Linda 1 1 1 1 1 1 Rebecca 1 1 1 1 0 1 Samuel 1 1 0 1 1 0 Stephen 1 0 0 0 1 1 Amanda 1 1 0 0 1 0 Hugh 1 0 1 0 0 1 Francesco 1 0 1 1 0 1 Allen 1 1 0 0 1 0 Eric 1 1 0 0 1 1 Gail 1 0 1 0 0 1 Joseph 1 1 1 1 0 0 Derek 1 1 1 0 1 0 Kevin 1 1 0 1 1 1 17

Logistic regression: general General population of covered workers Target High Incidence Organization Dr on CMS Ineligible List High Risk Occupation Psychological Impairment Imperceptible Physical Impairment Linda 0 1 1 1 1 1 Rebecca 0 0 0 1 0 1 Samuel 0 0 0 0 0 0 Stephen 0 0 0 0 0 1 Amanda 0 1 0 0 1 0 Hugh 0 0 1 0 0 1 Francesco 0 0 0 0 0 0 Allen 0 0 0 0 1 0 Eric 0 0 0 0 1 1 Gail 0 0 1 0 0 1 Joseph 0 0 0 1 1 0 Derek 0 0 1 0 0 0 Kevin 0 1 0 1 1 1 18

Results Maximum Likelihood Estimates: Fraud likelihood = 1.9884 (intercept) + 2.1370 (multiple cases) + 1.2356 (CMS ineligible) +.3784 (rep disciplined) +.1877 (psychological) +.4805 (imperceptible physical) 19

Interpretation Positive coefficients mean all factors contribute to likelihood of fraud Coefficients reflect the actual weight the model places on each factor Intercept ( 1.9884) means this model predicts a 12% likelihood of fraud if no modeled factors present 20

Test of model accuracy C-statistic (probability outcome is better than chance) = 0.814 0.70 indicates an acceptable model 0.80 indicates a strong model the closer to 1 the better Visually represented as ROC curve 21

Considerations Accuracy only as good as the target population sample Sum of the terms = logit of the predictive probability of the model translates into odds a claim is fraudulent Conversion of coefficient of the target variable logit(p) to probability p = 1 1+ e logit(p) 22

Logit transformation If all factors present, logit(p) = 1.9884 + 2.1370 + 1.2356 + 0.3784 + 0.1877 + 0.4805 = 2.4308 = 92% probability of fraud p logit(p) p logit(p) p logit(p) p logit(p) 0.01-4.5951 0.26-1.0460 0.51 0.0400 0.76 1.1527 0.02-3.8918 0.27-0.9946 0.52 0.0800 0.77 1.2083 0.03-3.4761 0.28-0.9445 0.53 0.1201 0.78 1.2657 0.04-3.1781 0.29-0.8954 0.54 0.1603 0.79 1.3249 0.05-2.9444 0.30-0.8473 0.55 0.2007 0.8 1.3863 0.06-2.7515 0.31-0.8001 0.56 0.2412 0.81 1.4500 0.07-2.5867 0.32-0.7538 0.57 0.2819 0.82 1.5163 0.08-2.4423 0.33-0.7082 0.58 0.3228 0.83 1.5856 0.09-2.3136 0.34-0.6633 0.59 0.3640 0.84 1.6582 0.10-2.1972 0.35-0.6190 0.60 0.4055 0.85 1.7346 0.11-2.0907 0.36-0.5754 0.61 0.4473 0.86 1.8153 0.12-1.9924 0.37-0.5322 0.62 0.4895 0.87 1.9010 0.13-1.9010 0.38-0.4895 0.63 0.5322 0.88 1.9924 0.14-1.8153 0.39-0.4473 0.64 0.5754 0.89 2.0907 0.15-1.7346 0.40-0.4055 0.65 0.6190 0.9 2.1972 0.16-1.6582 0.41-0.3640 0.66 0.6633 0.91 2.3136 0.17-1.5856 0.42-0.3228 0.67 0.7082 0.92 2.4423 0.18-1.5163 0.43-0.2819 0.68 0.7538 0.93 2.5867 0.19-1.4500 0.44-0.2412 0.69 0.8001 0.94 2.7515 0.20-1.3863 0.45-0.2007 0.70 0.8473 0.95 2.9444 0.21-1.3249 0.46-0.1603 0.71 0.8954 0.96 3.1781 0.22-1.2657 0.47-0.1201 0.72 0.9445 0.97 3.4761 0.23-1.2083 0.48-0.0800 0.73 0.9946 0.98 3.8918 0.24-1.1527 0.49-0.0400 0.74 1.0460 0.99 4.5951 0.25-1.0986 0.50 0.0000 0.75 1.0986 23

LR weaknesses All potential fraud factors combined into a single equation With many independent predictor variables, characteristics can cancel each other out Logistic regression has a hard time weighting interactions between individual variables Must be programmed explicitly Requires additional data manipulation 24

LR weaknesses (ctd) In rare-event modeling with a large number of predictive variables, logistic regression can produce many false positives Difficult to differentiate rare events from normal events when the rare events occur with extremely low frequency Bad solution is to boost the sensitivity of the model 25

Other supervised methods Decision tree mitigates the problem of numerous weak predictors overwhelming a strong predictor (logistic regression) Sorts observations of the dependent variable into buckets corresponding to its available classification values Conditional selection into paths ( branches ) Priority determined by frequency of characteristics 26

Decision tree example High Incidence Organization Left-Facing Arrows: Value = Characteristic is absent Right-Facing Arrows: Value = Characteristic is present 0 = No Fraud 1 = Fraud Misclassification Rate = 23.08% 4F/10N 9F/3N Imperceptible Physical Impairment Psychological Impairment Purity 4F/5N Purity 7F/3N 5 cases = 0 0 cases = 1 Doctor on CMS Ineligible List 0 cases = 0 2 cases = 1 Imperceptible Physical Impairment 1F/3N 3F/2N 4F/1N 3F/2N Psychological Impairment High Risk Occupation High Risk Occupation High Risk Occupation Purity Tie Tie Purity 3F/1N Purity Tie 2F/1N 2 cases = 0 0 cases = 1 1 case = 0 1 case = 1 2 cases = 0 2 cases = 1 0 cases = 0 1 case = 1 Doctor on CMS Ineligible List 0 cases = 0 1 case = 1 1 case = 0 1 case = 1 Doctor on CMS Ineligible List Imperfect Purity Purity Tie 1 case = 0 2 cases = 1 0 cases = 0 1 cases = 1 0 cases = 0 1 case = 1 1 case = 0 1 case = 1 27

Beyond decision tree Decision tree may overweight highfrequency but insignificant characteristics Boosted decision tree and random forest are techniques to improve on the results of the basic algorithm based on misclassification rates Neural networks model all possible combinations and select the best ones based on misclassification rates 28

Unsupervised methods K-means cluster Consider it a generalization of logistic regression Identify a set of independent variables Transformations likely required, as above Procedure tries to identify a set of statistically significant clusters based on the selected variables Can tease out meaningful characteristics 29

Predictive Analytics SOME BEST PRACTICES IN DATA MANAGEMENT 30

Data best practices Understand your data What does it represent How does it enter your data warehouse Check data for suitability Missing values? Do target and individual predictors correlate? Ensure that data cleansing and transformation steps are documented and repeatable for model re-estimation 31

Counterintuitive-ness The more independent variables, the less predictive value each individual variable, or characteristic, has, on average 32

Counterintuitive-ness (ctd) In rare event modeling, even a very accurate model can produce disproportionately large false positives Example: Target population 1% in a population of 1,000,000 (10,000 targets). If predictive model has a 10% false positive rate (90% accurate): Target General population 10,000 990,000 True positives: 9,000 True negatives: 891,000 False negatives: 1,000 False positives: 99,000 33

Takeaways for success 1.Clearly identify target variable 2.Limit predictor variables 3.Know the model data and manage it data management is most of the work 4.Know how to measure model performance 5.Set goals and expectations for the model 6.Monitor model performance and adjust/ re-estimate as necessary 34

Thank you/questions Paul Arnest parnest@pmibaltimore.org 35