Building In-Database Predictive Scoring Model: Check Fraud Detection Case Study



Similar documents
Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

Tax Fraud in Increasing

Oracle Advanced Analytics Oracle R Enterprise & Oracle Data Mining

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

The Data Mining Process

IBM SPSS Modeler Professional

Make Better Decisions Through Predictive Intelligence

Data Mining Solutions for the Business Environment

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Hexaware E-book on Predictive Analytics

Anomaly and Fraud Detection with Oracle Data Mining 11g Release 2

White Paper. Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices.

Advanced In-Database Analytics

EMC Greenplum Driving the Future of Data Warehousing and Analytics. Tools and Technologies for Big Data

Easily Identify Your Best Customers

2015 Workshops for Professors

IBM SPSS Modeler 14.2 In-Database Mining Guide

Azure Machine Learning, SQL Data Mining and R

IBM SPSS Modeler 15 In-Database Mining Guide

Fraud and Anomaly Detection Using Oracle Advanced Analytic Option 12c

Big Data and Predictive Analytics: Fiserv Data Mining Case Study [CON8631] Data Warehouse and Big Data

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Hurwitz ValuePoint: Predixion

KnowledgeSEEKER Marketing Edition

Oracle Data Miner (Extension of SQL Developer 4.0)

SEIZE THE DATA SEIZE THE DATA. 2015

Using Predictive Analytics to Detect Contract Fraud, Waste, and Abuse Case Study from U.S. Postal Service OIG

Solve your toughest challenges with data mining

Using Predictions to Power the Business. Wayne Eckerson Director of Research and Services, TDWI February 18, 2009

Ramesh Bhashyam Teradata Fellow Teradata Corporation

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Oracle Data Miner (Extension of SQL Developer 4.0)

IBM SPSS Modeler Premium

Qi Liu Rutgers Business School ISACA New York 2013

IBM SPSS Modeler Professional

Data Integration and ETL with Oracle Warehouse Builder NEW

Solve your toughest challenges with data mining

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

IBM's Fraud and Abuse, Analytics and Management Solution

Discovering, Not Finding. Practical Data Mining for Practitioners: Level II. Advanced Data Mining for Researchers : Level III

SAP Predictive Analytics: An Overview and Roadmap. Charles Gadalla, SESSION CODE: 603

Advanced Analytics for Call Center Operations

TIM 50 - Business Information Systems

The basic data mining algorithms introduced may be enhanced in a number of ways.

Data Mining. Dr. Saed Sayad. University of Toronto

Predictive modelling around the world

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

Data Mining from A to Z: Better Insights, New Opportunities WHITE PAPER

Importance or the Role of Data Warehousing and Data Mining in Business Applications

Automated Predictive Analysis. Tomer Steinberg

Sybase IQ Supercharges Predictive Analytics

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO

Starting Smart with Oracle Advanced Analytics

Why is Internal Audit so Hard?

IBM WebSphere DataStage Online training from Yes-M Systems

extreme Datamining mit Oracle R Enterprise

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

DATA MINING AND WAREHOUSING CONCEPTS

How to Optimize Your Data Mining Environment

SAS Fraud Framework for Banking

In-Database Analytics

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Oracle Database 11g: SQL Tuning Workshop

Using reporting and data mining techniques to improve knowledge of subscribers; applications to customer profiling and fraud management

Data Mining Algorithms Part 1. Dejan Sarka

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Confidently Anticipate and Drive Better Business Outcomes

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

ANALYTICS CENTER LEARNING PROGRAM

What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

ORACLE TAX ANALYTICS. The Solution. Oracle Tax Data Model KEY FEATURES

Promises and Pitfalls of Big-Data-Predictive Analytics: Best Practices and Trends

KnowledgeSEEKER POWERFUL SEGMENTATION, STRATEGY DESIGN AND VISUALIZATION SOFTWARE

Solve Your Toughest Challenges with Data Mining

Using Data Mining to Detect Insurance Fraud

IAF Business Intelligence Solutions Make the Most of Your Business Intelligence. White Paper November 2002

Advanced Big Data Analytics with R and Hadoop

Getting Started with Oracle Data Miner 11g R2. Brendan Tierney

How To Make A Credit Risk Model For A Bank Account

5.5 Copyright 2011 Pearson Education, Inc. publishing as Prentice Hall. Figure 5-2

Oracle Database 11g: SQL Tuning Workshop Release 2

High-Performance Analytics

In-Memory Analytics for Big Data

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

SAS In-Database Processing

Transcription:

Building In-Database Predictive Scoring Model: Check Fraud Detection Case Study Jay Zhou, Ph.D. Business Data Miners, LLC 978-726-3182 jzhou@businessdataminers.com Web Site: www.businessdataminers.com LinkedIn: www.linkedin.com/in/jiangzhou Copyright Business Data Miners. 1

Building In-Database Predictive Scoring Model: Check Fraud Detection Case Study Four Components Business: Check Fraud Scoring Model: Logistic Regression Modeling Tool: Oracle Data Mining Data Storage/Management: Oracle Database Copyright Business Data Miners. 2

Business Fraud Churn LTV Credit Marketing Tree Gradient Boosting Data Mining/Predictive Analytics Methods Logistic reg ANN SVM Clustering Regression SAS DM/PA Tools R Splus SPSS Statsoft ODM Weka Oracle DB2 Data Storage/Management SQL Server MY SQL Copyright 2009 Business Data Miners. SAS File R File 3

Fraud Churn Business Problems LTV Credit Marketing Tree Gradient Boosting Data Mining/Predictive Analytics Methods Logistic reg ANN SVM Clustering Regression SAS DM/PA Tools R Splus SPSS Statsoft ODM Weka Oracle DB2 Data Storage/Management SQL Server MY SQL Copyright 2009 Business Data Miners. SAS File R File 4

Fraud Churn Business Problems LTV Credit Marketing Tree Data Mining/Predictive Analytics Methods Anomaly Logistic reg SVM Clustering Regression DM/PA Tools/Database ODM Oracle 11g Copyright 2009 Business Data Miners. Copyright Business Data Miners. 5

About Check Deposit When a check is deposited, two banks are involved: Depositor s bank and payer s bank Usually the depositor s bank makes the funds available for the depositor to withdraw before the payer s bank actually transfers the funds to the depositor s bank. Copyright Business Data Miners. 6

About Check Deposit Fraud Fraudsters take advantage of this gap. They deposit fraudulent checks to inflate their account balances and then withdraw funds quickly. The depositor s bank suffers losses when fraudulent checks returned unpaid by the payer s bank. Copyright Business Data Miners. 7

About Rule Based Deposit Fraud Detection Systems Advantages Understandable by analysts. Make use of analysts experience and intuition Examples Rule 1: # of Check Deposits In Last 7 Days More than 35 Rule 2: Total Amount of Check Deposits In Last 5 Days Higher Than $15,000 Copyright Business Data Miners. 8

Statistical Predictive Model Data driven. Built based on a large number of historical cases. Advantages Handle many variables simultaneously and normally detect more frauds with lower false positives. This is illustrated in the next slide. Copyright Business Data Miners. 9

Rule 1. High Dollar Amount Dollar Amount of Transactions in Last 7 Days Predictive Model that uses two variables simultaneously Rule 2. High Frequency Lowering Threshold Increases False Positive Number of Transactions in Last 7 Days Copyright Business Data Miners. 10

The Goal of the Project Build a statistical model to predict returned items (RDI). The work is experimental and the final model has not been implemented operationally. The work was done when Dr. Jay Zhou worked for Citizens Bank as a VP. For business questions, please contact Josh Ablett at josh.ablett@citizensbank.com. Copyright Business Data Miners. 11

Major Steps 1. Data Gathering 2. Data Validation 3. Data Preparation 4. Feature Variable Calculation 5. Predictive Model Building/Testing 6. Predictive Model Deployment Copyright Business Data Miners. 12

Data Used Account Data Account open date Account open branch Account type Check Deposit (54 millions) Deposit Date Deposit Amount Paying Bank Number Receiving Bank Number Check Serial Number Third Party Known Fraud Account Hits A service provider has a repository of known fraudulent checks. A hit is generated when a check s paying bank number is found in the known fraudulent check list. Copyright Business Data Miners. 13

Data Challenges Security: account numbers, SSN, etc. Large Volume: 54 million check items Refresh Data Daily Various Data Sources Text Files SQL Server Oracle DB Copyright Business Data Miners. 14

RDBMS Solution Oracle DB Access is controlled: security External table: query or load text file heterogeneous services: load SQL Server table Materialized view : refresh data based on predefined schedules Table Partition, Index: handle large data Copyright Business Data Miners. 15

2. Data Validation SQL functions Check the number of null values Minimum, Maximum, Median values Number of distinctive values Number of records by date () width_bucket - Histogram: using - And More Copyright Business Data Miners. 16

( continued ) 2. Data Validation Metadata table Store variable statistics. Easier to handle hundreds of variables. We can query it to get the variables with desired characteristics. Unique keys: Variables with names containing the word credit and percentage of missing values less than 15% Copyright Business Data Miners. 17

3. Data Preparation Merge transaction data with performance data Most of the time, there are no common keys to uniquely link transactions and performance data. Unmatched cases Duplicates Iterative SQL joining Try many iterations of matching methods, from stronger ones to weaker ones. Successfully matched cases are removed from the pools. Copyright Business Data Miners. 18

3. Data Preparation (continued) Sampling Randomly Sample the data based on depositing account numbers. Creating Training and Testing Sets Randomly split data into training and testing sets. There is no account number overlap between two sets. Copyright Business Data Miners. 19

4. Feature Variable Calculation Based on the original variables existing in the data, we are able to create more salient variables that capture interesting transaction patterns. Copyright Business Data Miners. 20

( Continued ) 4. Feature Variable Calculation Risk Group Variables We calculate historical fraud rate for each account type. Then we group account types into a small number of risk groups based on their historical fraud rates. The group numbers are used as one of input variables to the model. Use SQL group by to calculate fraud ratio. Copyright Business Data Miners. 21

( Continued ) 4. Feature Variable Calculation Transaction Time Series Variable Life Time Maximum Check Amount From Paying Account max(amount) over (partition by ABA_ROUTING_NUMBER,PAYING_ACCOUNT order by TRANSACTION_DATE rows between unbounded preceding and current row) life_time_max_amount Number of Checks Deposited in Last Two Days count(1) over (partition by ACCOUNT_NUMBER order by TRANSACTION_DATE range between 1 preceding and current row) num_2d Copyright Business Data Miners. 22

Calculate Feature Variables (Continued) It is helpful to create as many feature variables as possible. Then we can use statistical methods to select a small number of variables that are most relevant to the our target variable, i.e., returned item. Copyright Business Data Miners. 23

Data Preparation/Feature Variable Calculation Considerations Use views or materialized views when possible. Challenge caused by separation of data and processing logic With a view or materialized view, the logic that is used to produce the view can be retrieved using the following query: select text from user_views where view_name='data_view'; Copyright Business Data Miners. 24

5.Predictive Model Building/Testing We used logistic regression models in Oracle 11g ODM. Variables used in the model including: Account Type Risk Group Days Since Account Opened Check Serial Number Out of Sequence Third Party Negative File Hit Number of Checks Deposited In Past Two Days Account Open Branch Risk Item Amount vs. Maximum Check Amount from the Paying Bank Account Copyright Business Data Miners. 25

5.Predictive Model Building/Testing ( Continued ) The final model is stored as an Oracle mining object. Prediction_Probability function: remove the traditional bottleneck of model deployment. select a.id, prediction_probability(model_object,'1' using *) score from table_data a; Copyright Business Data Miners. 26

6. Model Deployment All of the steps are implemented as materialized views. Score calculation for new data is simply refreshing a series of materialized views. Copyright Business Data Miners. 27

7. Reporting The checks are rank ordered based on the risk scores. Analysts can view the report using a web browser. Copyright Business Data Miners. 28

Model Score Performance If we maintain the same fraud prevention staff level, it is forecasted that the total fraud loss prevented will be up 20%. Copyright Business Data Miners. 29

Drawbacks of In-Database Modeling Data miners are not self-sufficient Need support from database administrator Copyright Business Data Miners. 30

Summary Predictive scoring model Significantly improve the check deposit fraud detection performance. From data miner perspective Powerful integration of SQL and advanced modeling functions Single environment Almost all the processes executed within a database environment Increased security, productivity, manageability and scalability. Copyright Business Data Miners. 31

Questions? Copyright Business Data Miners.