Fraud Detection for Online Retail using Random Forests



Similar documents
Using Random Forest to Learn Imbalanced Data

Data Mining. Nonlinear Classification

Classification of Bad Accounts in Credit Card Industry

Random Forest Based Imbalanced Data Cleaning and Classification

Predicting borrowers chance of defaulting on credit loans

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Statistics in Retail Finance. Chapter 7: Fraud Detection in Retail Credit

The Artificial Prediction Market

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Leveraging Ensemble Models in SAS Enterprise Miner

CART 6.0 Feature Matrix

Homework Assignment 7

Using multiple models: Bagging, Boosting, Ensembles, Forests

Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms

Data Mining Practical Machine Learning Tools and Techniques

Gerry Hobbs, Department of Statistics, West Virginia University

Model Combination. 24 Novembre 2009

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Classification and Regression by randomforest

Knowledge Discovery and Data Mining

Easily Identify Your Best Customers

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

Beating the MLB Moneyline

Machine Learning Final Project Spam Filtering

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Decision Support Systems

Decision Trees from large Databases: SLIQ

Comparison of Data Mining Techniques used for Financial Data Analysis

Benchmarking of different classes of models used for credit scoring

Selecting Data Mining Model for Web Advertising in Virtual Communities

Knowledge Discovery and Data Mining

Distributed forests for MapReduce-based machine learning

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance

Azure Machine Learning, SQL Data Mining and R

Performance Measures for Machine Learning

E-commerce Transaction Anomaly Classification

Statistics in Retail Finance. Chapter 6: Behavioural models

Applied Multivariate Analysis - Big data analytics

Machine Learning Logistic Regression

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

The Predictive Data Mining Revolution in Scorecards:

An Autonomous Agent for Supply Chain Management

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

Random forest algorithm in big data environment

An effective approach to preventing application fraud. Experian Fraud Analytics

Chapter 6. The stacking ensemble approach

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

How To Make A Credit Risk Model For A Bank Account

Better credit models benefit us all

Event driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016

A Novel Classification Approach for C2C E-Commerce Fraud Detection

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

not possible or was possible at a high cost for collecting the data.

Multiple Linear Regression in Data Mining

Data Mining - Evaluation of Classifiers

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

arxiv: v1 [stat.ml] 21 Jan 2013

II. RELATED WORK. Sentiment Mining

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

Virtual Site Event. Predictive Analytics: What Managers Need to Know. Presented by: Paul Arnest, MS, MBA, PMP February 11, 2015

Pearson's Correlation Tests

Electronic Payment Fraud Detection Techniques

S The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY

Stock Market Forecasting Using Machine Learning Algorithms

Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering

Numerical Algorithms Group

Advanced Ensemble Strategies for Polynomial Models

Data Mining Techniques Chapter 6: Decision Trees

Mining Life Insurance Data for Customer Attrition Analysis

Tree Ensembles: The Power of Post- Processing. December 2012 Dan Steinberg Mikhail Golovnya Salford Systems

Predictive modelling around the world

Evaluation & Validation: Credibility: Evaluating what has been learned

Dynamic Predictive Modeling in Claims Management - Is it a Game Changer?

Fast Analytics on Big Data with H20

Lecture 10: Regression Trees

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

Package trimtrees. February 20, 2015

Mining the Software Change Repository of a Legacy Telephony System

Data Mining Methods: Applications for Institutional Research

A Novel Ensemble Learning-based Approach for Click Fraud Detection in Mobile Advertising

Chapter 12 Bagging and Random Forests

IST 557 Final Project

Predictive Data modeling for health care: Comparative performance study of different prediction models

Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ

Getting Even More Out of Ensemble Selection

Data Mining Classification: Decision Trees

Datamining. Gabriel Bacq CNAMTS

Social Media Mining. Data Mining Essentials

Hospital Billing Optimizer: Advanced Analytics Solution to Minimize Hospital Systems Revenue Leakage

Server Load Prediction

Beating the NCAA Football Point Spread

Transcription:

Fraud Detection for Online Retail using Random Forests Eric Altendorf, Peter Brende, Josh Daniel, Laurent Lessard Abstract As online commerce becomes more common, fraud is an increasingly important concern. Current anti-fraud techniques include blacklists, hand-crafted rules, and review of individual transactions by human experts. These methods have achieved good results, but a significant number of costly fraudulent transactions still occur, and the fraud detection process is expensive due to its heavy reliance on human experts. Our goal is to improve the quality of fraud detection and decrease the need for costly human analysis of individual transactions. We propose two modifications to the approval process for an online retailer based on a Random Forests classifier.. Introduction We have obtained data from two sources: an online retailer of computer equipment and peripherals, and Quova, a provider of geolocation data. The retail data consists of about 25, records of orders placed online over a 3 month span. Currently, when the retailer receives an order request, they decide whether to approve the order and ship the goods using the decision process illustrated in Figure. The flow consists of a series of decision nodes:. Pre-filter: rejects blacklist of bad customers, bank rejections, AVS (card-address mismatch): rejects 35,. 2. Send to human: Uses simple rules like contains ipod to decide whether to send order to human analyst for review: 3, to Analyst, 85, to Accept. 3. Analyst: Expert reviewer manually analyzes orders and rejects based on likelihood of committing fraud, defaulting on credit, or returninproduct: reject 4,5. were later reported as fraudulent by the credit card holder. The current process contains two sources of expense for the retailer that could be decreased with the use of intelligent classifiers. First, at the Analyst node, human labor is used to review individual orders. This labor is both slow and expensive relative to automated systems. Second, of the orders that are passed to the Accept node, a certain fraction are fraudulent. Potentially, an intelligent classifier at this stage could significantly reduce the number of fraudulent orders approved while only slightly increasing the number of valid orders that are rejected. In Section 3, we discuss the tradeoffs between a high rate of fraud detection and mislabeling nonfraud instances. Therefore, we propose a new order review process for the retailer as illustrated in Figure 2. Essentially, we augment the existing system with two additional decision nodes powered by intelligent classifiers:. Analyst simulator: either rejects, or sends to a real human for review. As explained in Section 5, the analyst simulator does not accept orders outright. To build a prototype for this classifier, we train on the 28,729 items that were reviewed by a human. 2. Fraud classifier: as a final step before approval, each order passes through our custom fraud classifier. We build our model by training on a set of approximately 2, orders (containing 444 fraudulent transactions) that were approved by the current system. (In Section 4, we deal with the issue of balancing FP and TP.) All orders that are not rejected by the pre-filter or human expert are approved. Of these, 444

Orders Orders (2586) (27546) Pre-filter (3364) Pre-filter Send to human? (24484) 85.2% Send to human? (28729) Human (4245) 4.8% Human simulator dsimsimimulator psimulator Human (233) Fraud predictor (444 /.2%) Accept (2286 / 99.8%) Reject Accept Reject Fraud Not fraud Fraud Not fraud Figure. Current Infrastructure Figure 2. Proposed Infrastructure 2. Data sources and feature engineering We obtained data from two primary sources. First, a major online retailer of computer equipment and peripherals supplied us with 25,86 records of orders placed over a 3 month span in 24, which identify the customer, time/date of the order, items ordered, and credit card and shipping information. These records also include the IP address from which the order was placed, allowing us to supplement the orders data with geolocation and network information data obtained from QUOVA, inc. This allows us to determine, for each order, the continent, country, state/province, city, postal code of the computer from which the order was placed. It also provides information about the local network connection (modem, cable, ISDN, DSL, mobile wireless, etc.) and the IP routing method (anonymizers, proxies, mobile gateways, satellites). Finally, it lets us determine information about the network to which the user is attached (the carrier s name, such as Stanford or Comcast, whether or not the user is on the AOL network, which geolocates all users to specific AOL office locations, and the domain name of the user s IP). Much of our time investment was on data preparation and intelligent feature engineering, a process guided by our collaborators in the finance department at the retail company. They helped us identify characteristics typical of users placing fraudulent orders, who: Tend to be transient geographically Use non-standard Internet connections Exhibit unique patterns in purchase timing Purchase items that are easily resellable and small in size Based on these ideas, we extracted a total of 72 features, including Billing/shipping address match Distances between location of network connection and billing and shipping addresses Indicator variables for specific products Indicator variables for both hot (e.g., ipod, ipaq, handheld computer) and cold (e.g, printers, cables) keywords occuring in item descriptions Item cost per weight Indicator variables for time-of-day and dayof-week Credit card history: number of recent orders, cumulative recent charges In addition, we have some international orders, but probably too few to learn probabilities of fraud from each of over 2 countries. Therefore, we omit country identity and instead sub-

stitute demographic and socioeconomic statistics obtained from the CIA World Factbook, such as number of internet connections and telephones, average income, and unemployment rate. 3. Random Forests Random Forests is an algorithm for classification and regression (Breiman 2). Briefly, it is an ensemble of decision tree classifiers. The output of the Random Forest classifier is the majority vote amongst the set of tree classifiers. To train each tree, a subset of the full training set is sampled randomly. Then, a decision tree is built in the normal way, except that no pruning is done and each node splits on a feature selected from a random subset of the full feature set. Training is fast, even for large data sets with many features and data instances, because each tree is trained independently of the others. The Random Forest algorithm has been found to be resistant to overfitting and provides a good estimate of the generalization error (without having to do cross-validation) through the out-of-bag error rate that it returns. Our data sets are quite imbalanced, which in general, can lead to problems during the learning process. Several approaches have been proposed to deal with imbalance in the context of Random Forests including resampling techniques, and cost-based optimization (Chen, et. al. 24). We take a different approach, using regression forests and classifying based on an adjustable threshold. By changing the threshold level, we create a set of classifiers, H, each of which has a different false positive (FP) and true positive (TP) rate. The tradeoff between the FP and TP rates is captured in the standard ROC curve (see, for example, Figures 5 and 6). 4. Experiments and results For our experiments we worked with the OpenML (Machine Learning) library developed by Intel, Corp. We specified the maximum number of features to be considered at each tree node to be 4 and the out-of-bag sampling rate as.6. For both analyst and fraud problems, we trained the Random Forest classifier on the first 8% of our dataset and used the remaining 2% for validation. For each validation sample, the regression model returns a response y between (indicating the positive class) and (negative class). In Figure 3, we show a histogram of these responses for the analyst problem. Note that about half of the rejected samples are easy to classify. Easy to classify Figure 3: Response distribution for Analyst Almost all nonfraud cases are easily separated Figure 4: Response distribution for Fraud For the fraud problem, the response histogram (Figure 4) looks quite different. Non-fraud cases appear to be highly distinguishable.

Since fraud cases are very rare (only 444 cases out of 23,3), we have to be careful about picking operating points on the ROC curve (Figure 6). If we zoom in near FP, we see that the probability isn t actually zero, and this has consequences. 5. Optimizing the decision threshold To select an appropriate threshold value and, in turn, a classifier, we must specify some objective in the problem. In this problem, our success can be measured by the net change in expected profit. In general, if we can assign some cost to each false positive and to each true positive, it is possible to choose amongst the set, H, of classifiers (described above). Letting p and p be the prior probabilities for classes and, and c and c the respective misclassification costs, our objective is defined as: f # f # FP p p + p ( " TP) c + p ( " g( FP)) c, where g() specifies the ROC curve p c FPc FPc " p c g! ( FP) Setting this to zero, we get: g "(FP) c p c p The optimal classifier corresponds to the point on the ROC curve where the slope is equal to a ratio involving the prior probabilities for the two! classes and the two costs. The analyst simulator problem For the analyst simulator problem, the difficulty lies in assigning costs to the two types of classification errors. Since this (classification) problem occurs far upstream in the retailer s approval process, the costs and benefits of each type of error can only be accurately determined by doing a controlled experiment; that is, by approving certain orders that would normally be rejected, and observing the outcome. If we knew the cost of human analysis (per order reviewed) and the cost of a false positives (the profit we would make from the transaction), it would be straightforward to tradeoff the threshold parameter and find the policy that maximizes profit as described in Section 4. Notice, however (see Figure 5), that a moderate TP rate can be achieved while maintaining a FP close to zero. This means that we can easily choose a decision boundary which will reliably pre-reject a sizeable portion of the samples without needing the intervention of a human analyst. Safe threshold Figure 5: ROC for analyst problem We propose, as a conservative policy, to only pre-reject cases for which we are virtually certain there will be no false positives. From the ROC plot, this corresponds to.4 on the TP axis. If we take the prior probability of rejection into account, we can expect to pre-filter.4.48 6% of the samples sent to the analyst. The fraud prediction problem Suppose we wanted to detect 6% of the fraudulent cases. This would cause a.2% FP rate. Taking prior probability into account, we can expect to reject 267 fraudulent transactions at the expense of also rejecting 426 legitimate ones (over the entire data set).

Figure 6: ROC for fraud (.2,.6) Generally, allowing a fraudulent purchase costs the retailer the full price of the item, while denying a legitimate purchase only results in a loss equal to the profit margin. So in the example above, our classifier will save the retailer money under the reasonable assumption that the profit margin on items is less than 5%. As in the Analyst case, knowing our exact profit margin will allow us to design a more aggressive policy. The retailer simply needs to pinpoint the average cost of a fraudulent order and the average profit margin on items sold. Then, using the procedure described above, we pick the point on the ROC curve that optimizes expected misclassification cost. 6. Discussion Cascades of classifiers are a popular method of dealing with rare event detection (see, for instance, Viola and Jones, 2, Wu, et al., 25). The main assumption is that we can train n independent classifiers all with a very high (say,.999) TP rate and medium (say,.5) FP rate. Then, by requiring all n classifiers to agree on a positive detection, we can reduce the FP rate to.5 n, while retaining a TP rate of.999 n. For n, say, we would obtain very good TP and FP rates. One popular application of cascades is the face detection image processing problem. In face detection, non-face images are quite diverse. This makes it easy to enforce independence of the cascade nodes. All we have to do is design a long chain of classifiers where each one specializes in rejecting a particular type of non-face object with a near-perfect TP rate. The question is: can we always enforce the independence of the classifiers?. For the fraud problem, we suspect not. Although it is a rare event detection problem, it is fundamentally different from the face detection problem. As we have seen from the ROC, most nonfraud cases look similar and are easily identified. Therefore, training on subsets of the non-fraud data would not generate independent classifiers and we would not reap the benefit of exponentially decreasing errors. A standard cascade will not work with the analyst problem either; the ROC plot shows it is not possible to obtain the needed high TP and moderate FP. We do, however, have points on the ROC with very low FP rates. We believe it should be possible to build an inverted cascade which exploits this ROC shape by requiring all classifiers in the cascade to agree on a negative detection. We would like to thank Dr. Gary Bradski at Intel Research for his valuable help and advice. 7. References Breiman, Leo (2). "Random Forests". Machine Learning 45 (), 5-32. Chao Chen, Andy Liaw and Leo Breiman (24). Using Random Forest to Learn Imbalanced Data. Paul A Viola, Michael J. Jones (2) Rapid Object Detection using a Boosted Cascade of Simple Features. CVPR, :5-58. Jianxin Wu, Matthew D. Mullin, James M. Rehg. (25) Linear Asymmetric Classifier for Cascade Detectors. ICML pages 993-.