Fraud Detection In Insurance Claims. Bob Biermann Bob_Biermann@Yahoo.com April 15, 2013



Similar documents
Using Predictive Analytics to Detect Contract Fraud, Waste, and Abuse Case Study from U.S. Postal Service OIG

Driving Down Claim Costs With PREDICTIVE MODELING. December Sponsored by:

Using Business Analytics within the insurance claims process reducing the loss ratio by 2 to 5%

Maximizing Return and Minimizing Cost with the Decision Management Systems

Uncovering More Insurance Fraud with Predictive Analytics Strategies for Improving Results and Reducing Losses

Why is Internal Audit so Hard?

The State of Insurance Fraud Technology. A study of insurer use, strategies and plans for anti-fraud technology

Predictive Claims Processing

An effective approach to preventing application fraud. Experian Fraud Analytics

Whitepaper. Add Advanced Analytics to Your Artillery to Combat Insurance Fraud

Introduction to Integrated Marketing: Lead Scoring

Practical Text Mining in Insurance

Warranty Fraud Detection & Prevention

Past, present, and future Analytics at Loyalty NZ. V. Morder SUNZ 2014

CASE STUDIES. Examples of analytical experiences detecting fraud and abuse with. RiskTracker. Account Activity Analysis System

How Effectively Are Companies Using Business Analytics? DecisionPath Consulting Research October 2010

Predictive Modeling for Workers Compensation Claims

Framing Requirements for Predictive Analytic Projects with Decision Modeling

730 Yale Avenue Swarthmore, PA

Big Analytics: A Next Generation Roadmap

Fighting Fraud with Data Mining & Analysis

The First 24 Hours: Understanding New Claims

Interpreting Web Analytics Data

Chapter 3: Data Mining Driven Learning Apprentice System for Medical Billing Compliance

Hexaware E-book on Predictive Analytics

Detecting Anomalous Behavior with the Business Data Lake. Reference Architecture and Enterprise Approaches.

Enterprise 2.0 and SharePoint 2010

IBM's Fraud and Abuse, Analytics and Management Solution

Big data: Unlocking strategic dimensions

Data makes all the difference.

Neil Meikle, Associate Director, Forensic Technology, PwC

Claims management for insurance: Reduce costs while delivering more responsive service.

IBM Counter Fraud Signature Solutions

Using Predictive Analytics to Detect Fraudulent Claims

DevOps Practical steps towards greater business agility AND stable IT operations.

Workers Compensation Claim Fraud

INTRODUCING AZURE MACHINE LEARNING

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Using Analytics to detect and prevent Healthcare fraud. Copyright 2010 SAS Institute Inc. All rights reserved.

Three proven methods to achieve a higher ROI from data mining

Lead Scoring. Five steps to getting started. wowanalytics. 730 Yale Avenue Swarthmore, PA

Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Agenda. Agenda. SAS Predictive Claims Processing. Detecting Fraud, Increasing Recovery and Optimizing Workflow through Analytics

White Paper. Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices.

Performing a data mining tool evaluation

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

MARKETING AUTOMATION & YOUR CRM THE DYNAMIC DUO. Everything you need to know to create the ultimate sales and marketing tool.

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

White paper. 8 plays. To deliver an integrated customer service experience. On Break Free min Busy. Average Handle Time

How To Understand Business Intelligence

TEXT ANALYTICS INTEGRATION

MARKETING AUTOMATION: HOW TO UNLOCK THE VALUE OF YOUR CRM DATA

Text Analytics Beginner s Guide. Extracting Meaning from Unstructured Data

Dan French Founder & CEO, Consider Solutions

Voice of the Customer: How to Move Beyond Listening to Action Merging Text Analytics with Data Mining and Predictive Analytics

DIGITAL MARKETING BASICS: PPC

One trusted platform. All your project information.

Ten Mistakes to Avoid

CX Trends Forecast: A Summary of the Top 10 Disrupters

INSURANCE INFORMATION AND MONITORING CENTER

CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING

Performance Management for Enterprise Applications

DATA-ENHANCED CUSTOMER EXPERIENCE

Azure Machine Learning, SQL Data Mining and R

A Modeling Approach to Lead Generation

Guidewire ClaimCenter. Adapt and succeed

Manufacturing Analytics: Uncovering Secrets on Your Factory Floor

The Predictive Fraud and Abuse Analytic and Risk Management System

Andrew Tait General Manager Crawford & Company Singapore. Motor Insurance Counter Fraud Solutions

Questions to Ask When Selecting Your Customer Data Platform

The Informatica Solution for Improper Payments

A Model for Creating and Measuring Productivity Improvement

The Data Mining Process

Discover How a 360-Degree View of the Customer Boosts Productivity and Profits. eguide

Solve Your Toughest Challenges with Data Mining

Product Review: James F. Koopmann Pine Horse, Inc. Quest Software s Foglight Performance Analysis for Oracle

A New Foundation For Customer Management

Return On Investment XpoLog Center

Turning Data into Actionable Insights: Predictive Analytics with MATLAB WHITE PAPER

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios

Transcription:

Fraud Detection In Insurance Claims Bob Biermann Bob_Biermann@Yahoo.com April 15, 2013 1

Background Fraud is common and costly for the insurance industry. According to the Federal Bureau of Investigations, insurance fraud is the second most costly white-collar crime in America, and accounts for more than $30 billion in losses every year. * SIGI had an active SIU group within the claims department. However, based on perceived opportunity and a proof of concept project, in June of 2010 Selective Insurance began its Automated Fraud Referral project. All work described here in was done with Statistica v9-11. Statistica has been deployed within Selective as the enterprise modeling platform. * Allstate Insurance Website http://www.allstate.com/claims/fight-fraud.aspx 2

The Bigger Picture There are several parts of a comprehensive Fraud detection strategy. These parts are complimentary and should be used simultaneously Rules Looks at a single claim Based on intuition and experience of claim handler, maybe even a simple score. Historically used across the industry Models (Logit, GLM, GLZ, NN, MARS, PrinComp) Looks at a single claim Simultaneously looks at hundreds of pieces of info Based on history (Find more of what we currently find) Can be used to find anomalous claims (Find claims where not sure what you are looking for) Networks, Link and Association analysis Looks at multiple claims simultaneously, (find patterns and networks across claims) Not really a statistical tool or model, more of a DB analysis tool The automated fraud referral project and this presentation focus on the Models section. 3

Project Goals The company historically used the rules approach. As such, goals for project were simple: Take advantage of automation and advanced analytic capabilities to increase the economic impact of SIU activity, through: Increase the number of Fraudulent claims that are found. Decrease the proportion of claims that SIU investigates that are not Fraudulent 4

A Picture may help Let the oval on the right represent the universe of all claims. The smaller oval on the left represents the Fraudulent claims. The oval on the right represents the claims SIU investigates. The intersection represents claims that SIU investigated and determined Fraud. Based on analytics, we are trying to shift the claims investigated to make the intersection bigger. Fraud All Claims Investigated 5

It s a team effort 3 departments involved Claims SIU is the ultimate customer. IT Built a claims database specifically for analytics and scoring claims. Manage the projects execution. Actuarial-Sr. Econometrician Responsible for all quantitative work. (design, build and supervise deployment) 6

Now What? Its easy, just clean the data, build a model and wait for the credit... Only problem, its not that easy! A significant number of issues and challenges exist. Those issues are the subject of this presentation. 7

Data Issues Issues always exist within fraud data and they MUST BE DEALT WITH Fraud and its detection are dynamic, the type of fraud and its environment change. As such what you investigate and determine to be fraud change. Identifying a stable data window can be a challenge. Sample sizes are small. By the time you slice your data by lines of business, geography, time windows, etc., sample size gets very small very fast. By LOB we had only a few hundred proven fraud claims in our time window. Fraud data is biased and censored. We only know if the claim was fraudulent if SIU investigated. SIU investigated only if the claim was referred and claims referred are a biased sample. 8

The First Question The first big question is, What are we trying predict? Our project was straightforward. Can we identify claims that may be fraudulent early in the claim s life-cycle? This may not be specific enough, are you looking at a specific type of fraud, I.e., injury fraud vs. property fraud vs. a generic fraud? Do dollar amounts play a role? The questions boil down to, How is your dependent variable specified? Is a fraud variable binary (1/0), ordinal (1/2/3/4) or cardinal and a continuous variable? Given the limited sample sizes we had available it was determined that the data couldn t support a more complex specification of fraud. I.e., we defined fraud as a binary variable. 9

Claim Data: We are data rich! Claim data FNOL Claim Data Claims History Medical Billing, Doctors, ICDX-9 s Log Notes ISO Reports Policy data, current and history D+B Census Others? Capstone List Before text and ISO we had more than 350 raw data fields Time must play an explicit role in your independent variables. 10

Claims Data Most of claims data is straightforward in interpretation and structure for analysis and modeling. The handling of these data elements, i.e., policy holder age, sex, etc., is intuitively obvious. ISO reports are a semi-structured dynamic data source and with sophisticated SQL queries we were able to reconstruct this information into the structure appropriate for our modeling and analytics. The unstructured data, claim reps log notes was a very different story. Log notes are not English. They represent the handler s notes (probably entered in real time) from a call or interaction at any point in the claim process. They contain fragments of sentences, abbreviations, acronyms, copy and pasted e mails, medical notes, supervisors notes, machine generated notes, typos, misspellings, etc. 11

Text Mining Limit the time window for log notes. It is important to limit the biases created by the fact that more complex claims are open longer, have more log notes and are more likely to be referred to SIU. Likewise, post SIU referral log notes will contaminate your data if included. Traditional, top down, sentiment assessment or text mining is not appropriate and will produce nonsensical results when applied to the log notes. The inclusion text mining approach resolves many of the issues that the log note application creates. This approach requires manually building an inclusion list of words we are searching for in the log notes. From the inclusion list a text mining dictionary can be manually created. This dictionary is then used to transform the unstructured log notes into the structured variables used in modeling and analysis. 12

Text Mining The inclusion list can be based on reps training, NICB, ISO, etc., materials. However, everybody uses the language differently. There is no substitute for reading historically proven fraud claims log notes. You learn a lot. 5 issues need be confronted in the development of your inclusion list/dictionary Words vs. Concepts Contextualization Stemming Word Count Specifications Changing Characters 13

Inclusion List Dictionary Issues Of the 5 issues listed, the resolution of 2 is clear: Word count specification should be a simple count. The added complexity of the more complex metrics appear to add little in predictive power. This count should then be transformed into a binary flag to avoid biasing your data by claim duration. Changing (adding or deleting) characters should only be done at the beginning of the dictionary s creation and be done with great care. The impact of changing characters can be unpredictable. 14

Inclusion List Dictionary Issues The remaining 3 issues: Words vs. Concepts Contextualization and Stemming are intimately related. Consider a simple workers comp example: according to the NICB if a worker was to be fired and they filed a WC claim- we should be suspicious of potential fraud. 15

Stemming Stemming allows the text mining tool to stem a word and the various cases, tenses, endings. In the case of the word fired, fire, firing, etc., are stemmed to the same word, fire. Stemming is powerful and can deal with many of the usage and typo issues. However, stemming can lead to false positives in your text mining. You ll be surprised how many WC claims involve a fireplace or fire truck or a firebox or a machine firing. Stem with caution. 16

Contextualization As shown, stemming of the word fired has issues. A solution lies in the tradeoff of contextualization. For example if we look for the phrases was fired or to be fired or on list for firing we avoid many of the false positives. However, this solution is not cost free. The benefits of stemming (typos, tenses, etc.) are lost. Need to include all possible alternatives of the contextualized phrase in the inclusion list. It has to be an exact match. The count of times the phrase is found falls the more contextualized the phrase and at some point the phrase is so rare to be of no value in predictive modeling. 17

Words vs. Concepts We just saw how the contextualized phrases was fired or to be fired can help deal with some stemming issues. However, all the fired words get at the concept that the claimant was to be terminated. I strongly recommend all the words that represent a single concept be rolled up into a variable for that specific concept. This strategy; Rolls the different words to be fired and was fired into one concept variable. Allows the flexibility be include let go, layoff and terminate words. Avoids the various fired words having different parameters and potential overspecification in modeling. Helps increase the text data occurrence rates so that rare phrases can be useful for predictive modeling. 18

Final Modeling Data A stable data window, 2005-2010.6, was identified. The 300+ raw data fields were combined with about 200 word variables to create the source data file. The source data was transformed into a modeling data file of about 75k closed claims, with 500 variables including about 200 word and 25 text mining concepts. Of those closed claims, about 2300 had been investigated by SIU, with around 450 having been proven fraud. 19

Build the Model At this point we had jumped over numerous data hurdles and we were ready. We built a simple GLZ model on the investigated population. The various statistical results of the model were very good. We then used the model to score all the closed claims that were not investigated. We then sorted the claims based on score from high to low. By the normal measures of lift the results looked good. So we gave SIU a list of the highest scoring claims for their review...... and the results were not great! 20

The Solution We had ignored the biased nature of the investigated population. We needed a bridge between the general population of claims and those investigated by SIU. Approaches to building this bridge include; Segmentations (clustering, machine learning) Models (GLZ, NN) We developed the 2 model solution These GLZ models are then used to generate a prediction that is used as the referral criteria. 21

An Example The population of 75k closed claims were scored by the respective models. The claims were then sorted by score and broken into percentiles. The report on the following page presents the performance of the respective models. 22

Results PERCEN TILE INVESTIGATED FRAUD Both % OF % KNOWN INVESTI INJURY GATED FRAUD NOT IN NOT INVES IN NOT LOOKED INVESTIG INJURY PERCEN LOOKED TIGAT INJURY PERCEN LOOKED AT ATED FRAUD TILE AT ED FRAUD TILE AT INVES TIGAT ED INJURY FRAUD % OF KNOWN FRAUD IN PERCENT ILE 100 342 389 122 17.0% 527 204 124 29.0% 403 328 171 40.0% 99 438 288 68 12.6% 582 144 60 14.1% 527 199 72 16.9% 98 522 204 46 8.9% 629 97 43 10.1% 582 144 26 6.1% 97 573 153 28 6.7% 629 97 28 6.6% 607 119 26 6.1% 96 593 133 26 5.8% 646 80 17 4.0% 610 116 21 4.9% 95 605 121 17 5.3% 655 71 20 4.7% 610 116 17 4.0% 94 654 72 16 3.1% 666 60 11 2.6% 644 82 9 2.1% 93 663 63 10 2.7% 660 66 10 2.3% 668 58 9 2.1% 92 661 65 9 2.8% 670 56 11 2.6% 642 84 12 2.8% 91 671 55 7 2.4% 678 48 9 2.1% 672 54 6 1.4% 90 676 50 8 2.2% 698 28 3 0.7% 678 48 3 0.7% The above table shows the percentile results and the gain in lift from using the 2 model approach. The top percentiles lift jumps 33%, 40 vs. 29. 23

A Final Test As a final test, the Fraud and Investigate models were used to score the open claims in our claims DB. The top scoring claims were reviewed by SIU, 50+% were accepted for investigation. This served as the final acceptance check. 24

After Acceptance Model deployment was done in Statistica Live Score. The models PMML and variable transformation code are generated in a Statistica workspace and passed directly into system integration. Deployment means nightly scoring of all claims that had any changes in the claim file that day. After the claims are scored, a report listing all claims that exceed the determined threshold is sent to SIU via e-mail. Then we wait... 25

Project Status To date, 32 models are deployed across 6 lines of business. Hundreds of investigations, across all 6 lines of business have been opened from model referrals. Between 50-60% of modeled referrals are accepted for investigation. Somewhere between 25-50% of these investigated claims are ultimately determined to be fraudulent. Claims referred by the modeling project have an economic impact that averages about 35+% larger than claim handler referrals. The Fraud Modeling project paid for the project while in development. After 2 years the projects ROI is estimated between 350-500%. At current rates, the automated fraud referral project has increased SIU cost avoidance by 35-50%. 26

Future Opportunities Opportunities exist to improve our SIU models performance. These include: SIU referral feedback Text Mining enhancement Threshold optimization Re-Referrals Rules that utilize text mining results that are not viable in modeling? The claims modeling infrastructure can be used for a variety of other claims modeling projects, including Recovery Severity Litigation Etc. 27