Enhancing Compliance with Predictive Analytics



Similar documents
A Property & Casualty Insurance Predictive Modeling Process in SAS

A Property and Casualty Insurance Predictive Modeling Process in SAS

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

CART 6.0 Feature Matrix

2015 Workshops for Professors

Azure Machine Learning, SQL Data Mining and R

Data Mining Applications in Higher Education

Data Mining Techniques Chapter 6: Decision Trees

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

A fast, powerful data mining workbench designed for small to midsize organizations

Data Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Data Mining: Overview. What is Data Mining?

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

KnowledgeSEEKER POWERFUL SEGMENTATION, STRATEGY DESIGN AND VISUALIZATION SOFTWARE

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Data Mining Methods: Applications for Institutional Research

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Data Mining Algorithms Part 1. Dejan Sarka

Predictive Modeling of Titanic Survivors: a Learning Competition

How To Make A Credit Risk Model For A Bank Account

Foundations of Business Intelligence: Databases and Information Management

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Chapter 7: Data Mining

Data Mining Solutions for the Business Environment

What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling

APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING

Survey Analysis: Data Mining versus Standard Statistical Analysis for Better Analysis of Survey Responses

not possible or was possible at a high cost for collecting the data.

Digging for Gold: Business Usage for Data Mining Kim Foster, CoreTech Consulting Group, Inc., King of Prussia, PA

The Data Mining Process

Get to Know the IBM SPSS Product Portfolio

Virtual Site Event. Predictive Analytics: What Managers Need to Know. Presented by: Paul Arnest, MS, MBA, PMP February 11, 2015

The Predictive Data Mining Revolution in Scorecards:

Applying Customer Attitudinal Segmentation to Improve Marketing Campaigns Wenhong Wang, Deluxe Corporation Mark Antiel, Deluxe Corporation

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

How Organisations Are Using Data Mining Techniques To Gain a Competitive Advantage John Spooner SAS UK

Course Syllabus. Purposes of Course:

Data Mining with SAS. Mathias Lanner Copyright 2010 SAS Institute Inc. All rights reserved.

Gerry Hobbs, Department of Statistics, West Virginia University

A Review of Data Mining Techniques

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Introduction to Data Mining

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

An Overview and Evaluation of Decision Tree Methodology

Data mining and statistical models in marketing campaigns of BT Retail

Predictive Models for Enhanced Audit Selection: The Texas Audit Scoring System

Easily Identify Your Best Customers

Using Tableau Software with Hortonworks Data Platform

White Paper. Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices.

The Use of Open Source Is Growing. So Why Do Organizations Still Turn to SAS?

Database Marketing, Business Intelligence and Knowledge Discovery

Importance or the Role of Data Warehousing and Data Mining in Business Applications

KnowledgeSEEKER Marketing Edition

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Pentaho Data Mining Last Modified on January 22, 2007

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Prediction of Stock Performance Using Analytical Techniques

Data Mining Applications in Fund Raising

Data Mining - Evaluation of Classifiers

A Hybrid Modeling Platform to meet Basel II Requirements in Banking Jeffery Morrision, SunTrust Bank, Inc.

Cleaned Data. Recommendations

Predictive Analytics in the Public Sector: Using Data Mining to Assist Better Target Selection for Audit

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

Data Warehousing and Data Mining in Business Applications

Data Mining for Fun and Profit

A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND

Title. Introduction to Data Mining. Dr Arulsivanathan Naidoo Statistics South Africa. OECD Conference Cape Town 8-10 December 2010.

Big Data Analytics. Benchmarking SAS, R, and Mahout. Allison J. Ames, Ralph Abbey, Wayne Thompson. SAS Institute Inc., Cary, NC

Using Adaptive Random Trees (ART) for optimal scorecard segmentation

Make Better Decisions Through Predictive Intelligence

Hexaware E-book on Predictive Analytics

White Paper. Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics

Potential Value of Data Mining for Customer Relationship Marketing in the Banking Industry

A New Approach for Evaluation of Data Mining Techniques

Introduction. A. Bellaachia Page: 1

Data Mining and Predictive Modeling with Excel 2007

APPROACHABLE ANALYTICS MAKING SENSE OF DATA

Combining Linear and Non-Linear Modeling Techniques: EMB America. Getting the Best of Two Worlds

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Data Mining: An Introduction

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Customer and Business Analytic

Data Mining Practical Machine Learning Tools and Techniques

DATA MINING TECHNIQUES AND APPLICATIONS

Transcription:

Enhancing Compliance with Predictive Analytics FTA 2007 Revenue Estimation and Research Conference Reid Linn Tennessee Department of Revenue reid.linn@state.tn.us

Sifting through a Gold Mine of Tax Data Discovering the patterns, trends, and anomalies in massive data is one of the grand challenges of the information age. [Kantardzic, M. and Zurada, J., Next Generation of Data Mining Applications, John Wiley & Sons, New York, 2005] FTA 2007 September 2

Outline Introduction What, Why, How Caveats Currently in Development Stage...stay tuned for results and, Where Does this Fit into a Research Division? Predictive Modeling and Data Mining Definitions, Examples and Methods Logistics & (preliminary) Lessons Learned Going from Modeling to Selection Bridging the Gap to Other Research Responsibilities Discussion, Questions, Feedback FTA 2007 September 3

Enhancing Audit Selection Strategies and Goals Tax administration exists to ensure compliance with the tax laws Audit selection is one of the major enforcement mechanisms for Sustaining and enforcing compliance Maintaining the stream of revenue Project Goal: to help achieve maximum taxpayer compliance What is the best combination of strategies for achieving this goal? How do we efficiently use limited resources in this effort? The Department already profiles potential audit candidates Utilizing data in our transactional revenue integrated tax system Rule-based selection, Leads, Relationships (e.g., subsidiary audits) Predictive modeling provides an additional advanced method in a systematic & automated way, and on a very big scale FTA 2007 September 4

Predictive Platform Leveraging Existing Technologies & Acquiring Additional Tools Existing Platform Long-time SAS shop (from mainframe to mouse) Client-server setup SAS BASE, STAT, ETS and others for complete data manipulation and statistical analysis Raw transactional data extracted from mainframe (data warehouse in future) and is stored, manipulated & processed on a single dedicated Windows server (primarily using SAS/CONNECT) ESRI ArcGIS ArcView sophisticated geocoding & mapping (client side) Additional Technologies Additional CPU (2-CPUs total), RAID technology with multiple controllers, and a lot more disk storage for new data & new SAS temp files SAS Enterprise Miner full, heavyweight data mining solution but integrated with existing modules SAS Text Miner additional module to incorporate case notes containing significant & otherwise uncoded predictor information SAS DataFlux data quality and cleansing tool for record linkage, address verification, geocoding FTA 2007 September 5

Predictive Platform New Tools: What Fits versus Other Options Toolbox Enhancement We chose to ramp up our capabilities with additional SAS modules/products Obvious benefits of seamless and straightforward integration, smaller learning curve, easier pursuit of advanced techniques (getting into the guts of the software) that scale to large sets of data Perl regular expressions (PRX), Hash Data Step Component Objects, Pipes Scalable Performance Data Engine (SPDE) Macros, Macros & more Macros please (and of course, PROC SQL) Predictive analytics doesn t require SAS EM PROC LOGISTIC can do the trick to get us started...but, on a limited and much smaller scale without other major EM benefits Alternative Solutions Other data mining packages available, from enterprise level to standalone and even freeware Business intelligence tools, but data mining as described in BI tools is different (OLAP, queries, cubes, etc.) Contractors offering solutions from full compliance software deployments to single audit selection efforts FTA 2007 September 6

Predictive Modeling Explanation and Examples What is data mining? Many definitions that aren t all synonymous Advanced methods for analyzing large data sets in search of consistent patterns not readily apparent Predictive models are the most frequently used form of data mining Allows organizations to predict the outcome of a given process Produces a numeric score indicating the most likely outcome In audit selection the score ranks the expected outcome of an audit Countless examples exist in industry Used commonly in finance and insurance industries for assessing credit risk and identifying fraudulent activities Over 200 data mining initiatives ongoing in the federal government IRS has at least 9 planned and 2 operational efforts (based on 2005 review) Including, its Electronic Fraud Detection System (EFDS) FTA 2007 September 7

Data Mining Overview Not a black box Integrated process with several different phases Data identification and preparation Exploration Model building and validation Deployment Preparing the data and combining different sources into a high quality data set is critical The software technology will allow advanced exploration and visualization to identify patterns and trends on a very large scale Different models provide diverse answers Choose the set of models producing the most accurate and stable results Ongoing process with constant fine-tuning and adjustments It is important to assess the predictive model s success and how it improves audit selection FTA 2007 September 8

Data Preparation Obtaining, Cleansing, Standardizing, Gleaning Easy to be fooled into thinking it s simple & won t take long Triple or quadruple the allotted time (even if a data warehouse is available) Translating intended structure into efficient computer code is daunting Choosing, obtaining and processing multiple data sources DOR tax data, other state agencies, private vendors, census Pre-modeling Goal: to create a single, whole picture of the taxpayer with as much information as possible Then, to aggregate and condense into the best possible structure Requires record linkage techniques for the best possible match Address standardization, name cleansing, phonetic matching Geocoding for mapping and adding demographic information Errors, missing data, outliers, undercoverage, high dimensionality Complexities can be overcome SAS EM is customized to correct for many of the problems FTA 2007 September 9

Data Preparation Aggregating into a Single Useful Picture Selecting the best structure depends on analytic strategy Most predictive modeling condenses multiple units into one observation Two necessary tasks: choosing meaningful aggregates and then developing (coding) appropriate transformations Entity versus Account level data Bank analogy: multiple household members with multiple accounts/products Entities are wholly audited, but individual accounts submit tax returns Can hierarchical modeling work? Deciphering existing audit results Target window for analysis Significant assessments, tax types, reasons, adjustments, settlements Dual analytic sets: Target data versus Scoring data With audit selection, a target set is small (rare observations) Massive scoring set requires special coding for scaling to millions of records FTA 2007 September 10

Predictive Modeling Wait, more preparation? Oversampling (or, stratified sampling on the target) Non-compliance is rare relative to millions of returns Create a biased model to over-represent the target class, then adjust prediction results using prior probabilities Categorical inputs Collapse by Recoding, or even better by Clustering Missing values more common as data & dimensionality increase Disregard, Impute, or Create new unknown levels Imputation techniques: simple (e.g., medians) vs. complex (e.g., tree-based) Variable redundancy and irrelevancy reducing dimensionality Principal Components or Variable Clustering Clustering groups most correlated variables and selects representative Variable screening eliminate irrelevant variables, attentive to interactions Transformations nonlinearities, skewed distributions, outliers Almost there Split data into Training and Validation sets Treat validation as unknown data but stratify for equal proportions FTA 2007 September 11

Predictive Modeling Fitting Different Models Prediction SAS EM provides multiple statistical methods For audit selection, we ll focus on Logistic Regression, Decision Trees, and Neural Networks Test and possibly incorporate various combinations, or Ensembles Boosting model averaging with error correction weights (future capability?) Two-Stage Models model both outcome probability & expected profit Identifying candidates with highest propensity for non-compliance is valuable But, also would like to maximize productivity (and ROI) by pursuing those with greatest assessment potential Model Comparison Let the best fitted model(s) win Connect all models to a comparison node for sophisticated analysis and visual representations, and select one having the most predictive power Score! ROC & Lift charts provide comparisons of each model s classification accuracy Summary statistics provide rankings, misclassification, profit/loss, error statistics Attach best model to a score node for internal or external scoring FTA 2007 September 12

Predictive Modeling Logistic Regression A binary shift from regular revenue estimation of linear dependent variables Modeling probability of non-compliance and ranking highest pˆ log( ) = 1 pˆ wˆ 0 + ˆ 1 1 + ˆ 2 w Estimated probabilities obtained by taking the log of the odds to restrict the outcome between 0 and 1 Odds Ratios obtained by exponentiating the coefficients Measures odds of being non-compliant relative to predictors e.g., a $1,000 change in exemptions increases odds of non-compliance by 5% Assessing predictive ability (no R-squareds) x w Want to maximize the Percent Concordant, or the estimate of correctly predicted probabilities (also, referred to as a C-stat) Subset selection Stepwise versus Best-Subsets (or All-Subsets) Assessing classifiers on the validation data with multiple measures ROC curves, profit, sensitivity, mean squared error (MSE) x 2 FTA 2007 September 13

Predictive Modeling Decision Trees Exemptions: <$10k Industry: Mfg Target: 0, 1 Exemptions: >=$10k Industry: Other Multiple Variable Analysis Technique that Creates a Set of Decision Rules Starts at the top, or Root node, and autonomously or interactively creates Branches that split, or segment, input variables The Splits and corresponding decision rules are based on traditional statistical tests of significance and other machine learning algorithms (CHAID, CART, etc.) Decision Tree Split Search Algorithm segments each input variable according to a chi-square test to obtain the greatest independence between the two branches After the best split on each input is determined, the input split having the highest logworth score (or lowest adjusted p-value) is chosen as the next partition Process continues until a Maximal Tree is developed This tree predicts the training data well, but must be adjusted, or Pruned, to generalize well on the independent validation data Different assessments available: decisions (accuracy), rankings (correct ordering), estimates (low average squared error) Multiple Benefits: ease of development and interpretation Powerful assessment of most important variables Selects inputs, accommodates nonlinearities, detects interactions, handles missing FTA 2007 September 14

Predictive Modeling Other Model Types Neural Networks a real black box Powerful and flexible nonlinear models used for supervised prediction Is like a regression model on a set of derived inputs, called Hidden units, that are regressions on various linear combinations of the original inputs Not necessarily better tendency to overfit, difficult to understand, and extreme lack of interpretability, but May provide good predictions by itself or as part of an Ensemble Decision trees or surrogate models on results can demystify complexities Ensembles combination of different model techniques Combines predictions from multiple models (multiple samples or methods) Two-Stage Models Estimate noncompliance propensity & assessment amount Prediction from the binary outcome model becomes an input for the expected profit (assessment) model Can be constructed separately or SAS EM contains a two-stage node FTA 2007 September 15

Predictive Modeling Model Assessment and Scoring After tuning each individual model with the best fit statistics, it s time to choose the model with best predictive performance SAS EM major benefit provides extensive summary statistics and associated graphics for easy comparison of all models ROC charts & Lift charts can show best models for separating outcomes Default best model is one with the smallest validation misclassification rate But, can choose from a multitude of other measures Profit (assessment) optimization is an alternative selection method Useful if decent assessment/profit information available Best if we can model the expected value of the profit random variable for each outcome Scoring can occur inside or outside of SAS EM on data structured similar to training & validation (SAS, C, Java) Better to score outside of SAS EM because of magnitude of tax returns Score code modules run all prediction code and transformations FTA 2007 September 16

Data Mining Future Techniques and Research Possibilities Text Mining and Cluster Analysis SAS Text Miner runs as a node within SAS EM and allows for analysis and pattern searching of unstructured text Case notes are a source of important predictor information describing assessment reasons and audit outcomes (the real scoop) Must incorporate other pattern discovery techniques because obviously no notes for non-audited returns Cluster analysis unsupervised classification technique that groups data based on similarities in input variables Attempts to find segments with similar attributes that can be applied to new data for classifying potential new targets Other Possibilities: Overall Secondary benefits great, extensive set of taxpayer data Time series hope to discover trends/patterns in corporate taxes Other predictive or pattern discovery type questions sure to be many FTA 2007 September 17

Audit Selection Process Logistics & Lessons Learned Domain versus analytical and data expertise Not feigning audit expertise, but trying to assist in enhancement of audit selection (& hopefully incur consequential benefits for other research tasks) Collaboration with the audit division is an integral and continuous part Crucial to have the domain experts in audit identify relevant processes, business questions and data necessary to start modeling Predictive model results and audit selection scores will be provided to audit division management and departmental executives for ultimate decisions, implementation and deployment Realism about effort and time involved Difficult to envision the programming difficulty of seemingly simple concepts related to the targeted business problem and its requisite data Learning curve Steep, but not setting out to reinvent the wheel Software training, other state efforts, existing credit scoring models Not an automatic process, but instead a formidable discipline FTA 2007 September 18

Discussion, Questions & Feedback FTA 2007 September 19