Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information

Similar documents

Using One-Versus-All classification ensembles to support modeling decisions in data stream mining

Using multiple models: Bagging, Boosting, Ensembles, Forests

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

Decision Support Systems

The Artificial Prediction Market

Comparison of Data Mining Techniques used for Financial Data Analysis

Predicting borrowers chance of defaulting on credit loans

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Random Forest Based Imbalanced Data Cleaning and Classification

E-commerce Transaction Anomaly Classification

Class Imbalance Learning in Software Defect Prediction

Review of Ensemble Based Classification Algorithms for Nonstationary and Imbalanced Data

Distributed forests for MapReduce-based machine learning

Using Random Forest to Learn Imbalanced Data

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Performance Metrics for Graph Mining Tasks

Leveraging Ensemble Models in SAS Enterprise Miner

How To Solve The Kd Cup 2010 Challenge

How To Classify Data Stream Mining

Data Mining - Evaluation of Classifiers

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Dan French Founder & CEO, Consider Solutions

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

Meta-Ensemble Classification Modeling for Concept Drift

REVIEW OF ENSEMBLE CLASSIFICATION

On the Effectiveness of Obfuscation Techniques in Online Social Networks

Mining Concept-Drifting Data Streams

Better credit models benefit us all

Improving Credit Card Fraud Detection with Calibrated Probabilities

Adaptive Classification Algorithm for Concept Drifting Electricity Pricing Data Streams

Selecting Data Mining Model for Web Advertising in Virtual Communities

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Knowledge-based systems and the need for learning

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Data Mining. Nonlinear Classification

Social Media Mining. Data Mining Essentials

User Requirements and Scenario Definitions. Ivo Correia. Scalable Data Analytics Scalable Algorithms, FP / SPEEDD

ONLINE learning has received growing attention in

Credit Card Fraud Detection Using Meta-Learning: Issues 1 and Initial Results

Azure Machine Learning, SQL Data Mining and R

Classification of Bad Accounts in Credit Card Industry

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Machine Learning Capacity and Performance Analysis and R

Classifier and Cluster Ensembles for Mining Concept Drifting Data Streams

Mimicking human fake review detection on Trustpilot

Data Mining Practical Machine Learning Tools and Techniques

MS1b Statistical Data Mining

Equity forecast: Predicting long term stock price movement using machine learning

Efficient Streaming Classification Methods

How To Identify A Churner

TDS - Socio-Environmental Data Science

How To Perform An Ensemble Analysis

Distributed Regression For Heterogeneous Data Sets 1

On the effect of data set size on bias and variance in classification learning

Data Mining in CRM & Direct Marketing. Jun Du The University of Western Ontario jdu43@uwo.ca

Data Mining Part 5. Prediction

Making the Most of Missing Values: Object Clustering with Partial Data in Astronomy

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Chapter 8. Final Results on Dutch Senseval-2 Test Data

CLASS imbalance learning refers to a type of classification

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

A Data Generator for Multi-Stream Data

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM

When Efficient Model Averaging Out-Performs Boosting and Bagging

Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ

Chapter 6. The stacking ensemble approach

Data Mining Algorithms Part 1. Dejan Sarka

Credit Card Fraud Detection Using Meta-Learning: Issues and Initial Results 1

ClusterOSS: a new undersampling method for imbalanced learning

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

Statistics in Retail Finance. Chapter 7: Fraud Detection in Retail Credit

A Novel Ensemble Learning-based Approach for Click Fraud Detection in Mobile Advertising

Addressing the Class Imbalance Problem in Medical Datasets

Credit Card Fraud Detection Using Self Organised Map

A Practical Differentially Private Random Decision Tree Classifier

Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms

Knowledge Discovery and Data Mining

6.2.8 Neural networks for data mining

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Data Mining & Data Stream Mining Open Source Tools

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Introducing diversity among the models of multi-label classification ensemble

Observation and Findings

Package acrm. R topics documented: February 19, 2015

Graph Mining and Social Network Analysis

Data Mining for Business Analytics

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

Monday Morning Data Mining

: Introduction to Machine Learning Dr. Rita Osadchy

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Gerry Hobbs, Department of Statistics, West Virginia University

Inductive Learning in Less Than One Sequential Data Scan

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

Transcription:

Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information Andrea Dal Pozzolo, Giacomo Boracchi, Olivier Caelen, Cesare Alippi, and Gianluca Bontempi 15/07/2015 IEEE IJCNN 2015, Killarney, Ireland 1/ 22

INTRODUCTION Fraud Detection is notably a challenging problem because of concept drift (i.e. customers habits evolve) class unbalance (i.e. genuine transactions far outnumber frauds) uncertain class labels (i.e. some frauds are not reported or reported with large delay and few transactions can be timely investigated) 2/ 22

INTRODUCTION II Fraud-detection systems (FDSs) differ from a classification tasks: only a small set of supervised samples is provided by human investigators (they check few alerts). the labels of the majority of transactions are available only several days later (after customers have report unauthorized transactions). 3/ 22

PROBLEM FORMULATION We formalise FD as a classification problem: At day t, classifier K t 1 (trained on t 1) associates to each feature vector x R n, a score P Kt 1 (+ x). The k transactions with largest P Kt 1 (+ x) define the alerts A t reported to the investigators. Investigators provide feedbacks F t about the alerts in A t, defining a set of k supervised couples (x, y) F t = {(x, y), x A t }, (1) F t are the only immediate supervised samples. 4/ 22

PROBLEM FORMULATION II At day t, delayed supervised couples D t δ are transactions that have not been checked by investigators, but their label is assumed to be correct after that δ days have elapsed. Supervised%samples% t δ Time% t 1 t D t δ F t Delayed%samples% Feedbacks% All%fraudulent%transac9ons%of%a%day% All%genuine%transac9ons%of%a%day% Fraudulent%transac9ons%in%the%feedback% Genuine%transac9ons%in%the%feedback% Figure : The supervised samples available at day t include: i) feedbacks of the first δ days and ii) delayed couples occurred before the δ th day. 5/ 22

F t are a small set of risky transactions according the FDS. Fraudulent%transac9ons%in% S D t δ contains all the occurred t Genuine%transac9ons%in% S t transactions in a day ( 99% genuine transactions). Fraudulent%feedback%in%% F t Genuine%feedback%in%% F t Time% F t 1 D t 7 F t 6 F t 5 F t 4 F t 3 F t 2 F t Day'1' D t 8 F t 1 D t 7 F t 6 F t 5 F t 4 F t 3 F t 2 F t Day'2' D t 9 D t 8 F t 1 D t 7 F t 6 F t 5 F t 4 F t 3 F t 2 F t Day'3' Figure : Everyday we have a new set of feedbacks (F t, F t 1,..., F t (δ 1) ) from the first δ days and a new set of delayed transactions occurred on the δ th day (D t δ ). In this Figure we assume δ = 7. 6/ 22

ACCURACY MEASURE FOR A FDS The goal of a FDS is to return accurate alerts, thus the highest precision in A t. This precision can be measured by the quantity p k (t) = #{(x, y) F t s.t. y = +} k (2) where p k (t) is the proportion of frauds in the top k transactions with the highest likelihood of frauds ([1]). 7/ 22

LEARNING STRATEGY Learning from feedbacks F t is a different problem than learning from delayed samples in D t δ : F t provides recent, up-to-date, information while D t δ might be already obsolete once it comes. Percentage of frauds in F t and D t δ is different. Supervised couples in F t are not independently drawn, but are instead selected by K t 1. A classifier trained on F t learns how to label transactions that are most likely to be fraudulent. Feedbacks and delayed transactions have to be treated separately. 8/ 22

CONCEPT DRIFT ADAPTATION Two conventional solutions for CD adaptation are W t and All%fraudulent%transac9ons%of%a%day% E t [6, 5]. To learn All%genuine%transac9ons%of%a%day% separately from feedbacks and delayed transactions Fraudulent%transac9ons%in%the%feedback% we propose F t, Wt D and Et D. Genuine%transac9ons%in%the%feedback% Time% D t 8 F t 1 D t 7 F t 6 F t 5 F t 4 F t 3 F t 2 F t Sliding' window' W D t F t W t D t 8 F t 1 D t 7 F t 6 F t 5 F t 4 F t 3 F t 2 F t Ensemble' M 2 M 1 F t E D t E t 9/ 22 Figure : Supervised information used by different classifiers in the ensemble and sliding window approach.

CLASSIFIER AGGREGATIONS W D t and E D t have to be aggregated with F t to exploit information provided by feedbacks. We combine these classifiers by averaging the posterior probabilities. Sliding window: P A W t (+ x) = P F t (+ x) + P W D t (+ x) 2 Ensemble: P A E t (+ x) = P F t (+ x) + P E D t (+ x) 2 A E t and A W t give larger influence to feedbacks on the probability estimates w.r.t E t and W t. 10/ 22

TWO RANDOM FOREST We used two different Random Forests (RF) classifiers depending on the fraud prevalence in the training set. for classifiers on delayed samples we used a Balanced RF [3] (undersampling before training each tree). for F t we adopted a standard RF [2] (no undersampling). 11/ 22

DATASETS We considered two datasets of credit card transactions: Table : Datasets Id Start day End day # Days # Instances # Features % Fraud 2013 2013-09-05 2014-01-18 136 21,830,330 51 0.19% 2014 2014-08-05 2014-10-09 44 7,619,452 51 0.22% In the 2013 dataset there is an average of 160k transaction per day and about 304 frauds per day, while in the 2014 dataset there is a daily average of 173k transactions and 380 frauds. 12/ 22

EXPERIMENTS Settings: We assume that after δ = 7 days all the transactions labels are provided (delayed supervised information) A budget of k = 100 alerts that can be checked by the investigators (F t is trained on a window of 700 feedbacks). A window of α = 16 days is used to train Wt D in Et D) (16 models Each experiments is repeated 10 times and the performance is assessed using p k. 13/ 22

In both 2013 and 2014 datasets, aggregations A W t outperforms the other FDSs in terms of p k. and A E t Table : Average p k in all the batches for the sliding window Dataset 2013 Dataset 2014 classifier mean sd mean sd F 0.609 0.250 0.596 0.249 W D 0.540 0.227 0.549 0.253 W 0.563 0.233 0.559 0.256 A W 0.697 0.212 0.657 0.236 Table : Average p k in all the batches for the ensemble Dataset 2013 Dataset 2014 classifier mean sd mean sd F 0.603 0.258 0.596 0.271 E D 0.459 0.237 0.443 0.242 E 0.555 0.239 0.516 0.252 A E 0.683 0.220 0.634 0.239 14/ 22

A W F W W D (a) Sliding window 2013 A W F W W D (b) Sliding window 2014 Sum of ranks from the Friedman test [4], classifiers having the same letter are not significantly different (paired t-test based upon on the ranks). A E F E E D A E F E E D (c) Ensemble 2013 (d) Ensemble 2014 15/ 22

EXPERIMENTS ON ARTIFICIAL DATASET WITH CD In the second part we artificially introduce CD in specific days by juxtaposing transactions acquired in different times of the year. Table : Datasets with Artificially Introduced CD Id Start 2013 End 2013 Start 2014 End 2014 CD1 2013-09-05 2013-09-30 2014-08-05 2014-08-31 CD2 2013-10-01 2013-10-31 2014-09-01 2014-09-30 CD3 2013-11-01 2013-11-30 2014-08-05 2014-08-31 16/ 22

Table : Average p k in the month before and after CD for the sliding window approach (a) Before CD CD1 CD2 CD3 classifier mean sd mean sd mean sd F 0.411 0.142 0.754 0.270 0.690 0.252 W D 0.291 0.129 0.757 0.265 0.622 0.228 W 0.332 0.215 0.758 0.261 0.640 0.227 A W 0.598 0.192 0.788 0.261 0.768 0.221 (b) After CD CD1 CD2 CD3 classifier mean sd mean sd mean sd F 0.635 0.279 0.511 0.224 0.599 0.271 W D 0.536 0.335 0.374 0.218 0.515 0.331 W 0.570 0.309 0.391 0.213 0.546 0.319 A W 0.714 0.250 0.594 0.210 0.675 0.244 17/ 22

A W W A W W (e) Sliding window strategies on dataset CD1 (f) Sliding window strategies on dataset CD2 W AW A E E (g) Sliding window strategies on dataset CD3 (h) Ensemble strategies on dataset CD3 Figure : Average p k per day (the higher the better) for classifiers on datasets with artificial concept drift smoothed using moving average of 15 days. The vertical bar denotes the date of the concept drift. 18/ 22

CONCLUDING REMARKS We notice that: F t outperforms classifiers on delayed samples (trained on obsolete couples). F t outperforms classifiers trained on the entire supervised dataset (dominated by delayed samples). Aggregation gives larger influence to feedbacks. 19/ 22

CONCLUSION We formalise a real-world FDS framework that meets realistic working conditions. In a real-world scenario, there is a strong alert-feedback interaction that has to be explicitly considered Feedbacks and delayed samples should be separately handled when training a FDS Aggregating two distinct classifiers is an effective strategy and that it enables a prompter adaptation in concept drifting environments 20/ 22

FUTURE WORK Future work will focus on: Adaptive aggregation of F t and the classifier trained on delayed samples. Study the sample selection bias in F t introduced by alert-feedback interaction. 21/ 22

BIBLIOGRAPHY [1] S. Bhattacharyya, S. Jha, K. Tharakunnel, and J. C. Westland. Data mining for credit card fraud: A comparative study. Decision Support Systems, 50(3):602 613, 2011. [2] L. Breiman. Random forests. Machine learning, 45(1):5 32, 2001. [3] C. Chen, A. Liaw, and L. Breiman. Using random forest to learn imbalanced data. University of California, Berkeley, 2004. [4] M. Friedman. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32(200):675 701, 1937. [5] J. Gao, B. Ding, W. Fan, J. Han, and P. S. Yu. Classifying data streams with skewed class distributions and concept drifts. Internet Computing, 12(6):37 49, 2008. [6] D. K. Tasoulis, N. M. Adams, and D. J. Hand. Unsupervised clustering in streaming data. In ICDM Workshops, pages 638 642, 2006. 22/ 22