Fraud Detection for Online Retail using Random Forests
|
|
|
- Bertha Payne
- 10 years ago
- Views:
Transcription
1 Fraud Detection for Online Retail using Random Forests Eric Altendorf, Peter Brende, Josh Daniel, Laurent Lessard Abstract As online commerce becomes more common, fraud is an increasingly important concern. Current anti-fraud techniques include blacklists, hand-crafted rules, and review of individual transactions by human experts. These methods have achieved good results, but a significant number of costly fraudulent transactions still occur, and the fraud detection process is expensive due to its heavy reliance on human experts. Our goal is to improve the quality of fraud detection and decrease the need for costly human analysis of individual transactions. We propose two modifications to the approval process for an online retailer based on a Random Forests classifier.. Introduction We have obtained data from two sources: an online retailer of computer equipment and peripherals, and Quova, a provider of geolocation data. The retail data consists of about 25, records of orders placed online over a 3 month span. Currently, when the retailer receives an order request, they decide whether to approve the order and ship the goods using the decision process illustrated in Figure. The flow consists of a series of decision nodes:. Pre-filter: rejects blacklist of bad customers, bank rejections, AVS (card-address mismatch): rejects 35,. 2. Send to human: Uses simple rules like contains ipod to decide whether to send order to human analyst for review: 3, to Analyst, 85, to Accept. 3. Analyst: Expert reviewer manually analyzes orders and rejects based on likelihood of committing fraud, defaulting on credit, or returninproduct: reject 4,5. were later reported as fraudulent by the credit card holder. The current process contains two sources of expense for the retailer that could be decreased with the use of intelligent classifiers. First, at the Analyst node, human labor is used to review individual orders. This labor is both slow and expensive relative to automated systems. Second, of the orders that are passed to the Accept node, a certain fraction are fraudulent. Potentially, an intelligent classifier at this stage could significantly reduce the number of fraudulent orders approved while only slightly increasing the number of valid orders that are rejected. In Section 3, we discuss the tradeoffs between a high rate of fraud detection and mislabeling nonfraud instances. Therefore, we propose a new order review process for the retailer as illustrated in Figure 2. Essentially, we augment the existing system with two additional decision nodes powered by intelligent classifiers:. Analyst simulator: either rejects, or sends to a real human for review. As explained in Section 5, the analyst simulator does not accept orders outright. To build a prototype for this classifier, we train on the 28,729 items that were reviewed by a human. 2. Fraud classifier: as a final step before approval, each order passes through our custom fraud classifier. We build our model by training on a set of approximately 2, orders (containing 444 fraudulent transactions) that were approved by the current system. (In Section 4, we deal with the issue of balancing FP and TP.) All orders that are not rejected by the pre-filter or human expert are approved. Of these, 444
2 Orders Orders (2586) (27546) Pre-filter (3364) Pre-filter Send to human? (24484) 85.2% Send to human? (28729) Human (4245) 4.8% Human simulator dsimsimimulator psimulator Human (233) Fraud predictor (444 /.2%) Accept (2286 / 99.8%) Reject Accept Reject Fraud Not fraud Fraud Not fraud Figure. Current Infrastructure Figure 2. Proposed Infrastructure 2. Data sources and feature engineering We obtained data from two primary sources. First, a major online retailer of computer equipment and peripherals supplied us with 25,86 records of orders placed over a 3 month span in 24, which identify the customer, time/date of the order, items ordered, and credit card and shipping information. These records also include the IP address from which the order was placed, allowing us to supplement the orders data with geolocation and network information data obtained from QUOVA, inc. This allows us to determine, for each order, the continent, country, state/province, city, postal code of the computer from which the order was placed. It also provides information about the local network connection (modem, cable, ISDN, DSL, mobile wireless, etc.) and the IP routing method (anonymizers, proxies, mobile gateways, satellites). Finally, it lets us determine information about the network to which the user is attached (the carrier s name, such as Stanford or Comcast, whether or not the user is on the AOL network, which geolocates all users to specific AOL office locations, and the domain name of the user s IP). Much of our time investment was on data preparation and intelligent feature engineering, a process guided by our collaborators in the finance department at the retail company. They helped us identify characteristics typical of users placing fraudulent orders, who: Tend to be transient geographically Use non-standard Internet connections Exhibit unique patterns in purchase timing Purchase items that are easily resellable and small in size Based on these ideas, we extracted a total of 72 features, including Billing/shipping address match Distances between location of network connection and billing and shipping addresses Indicator variables for specific products Indicator variables for both hot (e.g., ipod, ipaq, handheld computer) and cold (e.g, printers, cables) keywords occuring in item descriptions Item cost per weight Indicator variables for time-of-day and dayof-week Credit card history: number of recent orders, cumulative recent charges In addition, we have some international orders, but probably too few to learn probabilities of fraud from each of over 2 countries. Therefore, we omit country identity and instead sub-
3 stitute demographic and socioeconomic statistics obtained from the CIA World Factbook, such as number of internet connections and telephones, average income, and unemployment rate. 3. Random Forests Random Forests is an algorithm for classification and regression (Breiman 2). Briefly, it is an ensemble of decision tree classifiers. The output of the Random Forest classifier is the majority vote amongst the set of tree classifiers. To train each tree, a subset of the full training set is sampled randomly. Then, a decision tree is built in the normal way, except that no pruning is done and each node splits on a feature selected from a random subset of the full feature set. Training is fast, even for large data sets with many features and data instances, because each tree is trained independently of the others. The Random Forest algorithm has been found to be resistant to overfitting and provides a good estimate of the generalization error (without having to do cross-validation) through the out-of-bag error rate that it returns. Our data sets are quite imbalanced, which in general, can lead to problems during the learning process. Several approaches have been proposed to deal with imbalance in the context of Random Forests including resampling techniques, and cost-based optimization (Chen, et. al. 24). We take a different approach, using regression forests and classifying based on an adjustable threshold. By changing the threshold level, we create a set of classifiers, H, each of which has a different false positive (FP) and true positive (TP) rate. The tradeoff between the FP and TP rates is captured in the standard ROC curve (see, for example, Figures 5 and 6). 4. Experiments and results For our experiments we worked with the OpenML (Machine Learning) library developed by Intel, Corp. We specified the maximum number of features to be considered at each tree node to be 4 and the out-of-bag sampling rate as.6. For both analyst and fraud problems, we trained the Random Forest classifier on the first 8% of our dataset and used the remaining 2% for validation. For each validation sample, the regression model returns a response y between (indicating the positive class) and (negative class). In Figure 3, we show a histogram of these responses for the analyst problem. Note that about half of the rejected samples are easy to classify. Easy to classify Figure 3: Response distribution for Analyst Almost all nonfraud cases are easily separated Figure 4: Response distribution for Fraud For the fraud problem, the response histogram (Figure 4) looks quite different. Non-fraud cases appear to be highly distinguishable.
4 Since fraud cases are very rare (only 444 cases out of 23,3), we have to be careful about picking operating points on the ROC curve (Figure 6). If we zoom in near FP, we see that the probability isn t actually zero, and this has consequences. 5. Optimizing the decision threshold To select an appropriate threshold value and, in turn, a classifier, we must specify some objective in the problem. In this problem, our success can be measured by the net change in expected profit. In general, if we can assign some cost to each false positive and to each true positive, it is possible to choose amongst the set, H, of classifiers (described above). Letting p and p be the prior probabilities for classes and, and c and c the respective misclassification costs, our objective is defined as: f # f # FP p p + p ( " TP) c + p ( " g( FP)) c, where g() specifies the ROC curve p c FPc FPc " p c g! ( FP) Setting this to zero, we get: g "(FP) c p c p The optimal classifier corresponds to the point on the ROC curve where the slope is equal to a ratio involving the prior probabilities for the two! classes and the two costs. The analyst simulator problem For the analyst simulator problem, the difficulty lies in assigning costs to the two types of classification errors. Since this (classification) problem occurs far upstream in the retailer s approval process, the costs and benefits of each type of error can only be accurately determined by doing a controlled experiment; that is, by approving certain orders that would normally be rejected, and observing the outcome. If we knew the cost of human analysis (per order reviewed) and the cost of a false positives (the profit we would make from the transaction), it would be straightforward to tradeoff the threshold parameter and find the policy that maximizes profit as described in Section 4. Notice, however (see Figure 5), that a moderate TP rate can be achieved while maintaining a FP close to zero. This means that we can easily choose a decision boundary which will reliably pre-reject a sizeable portion of the samples without needing the intervention of a human analyst. Safe threshold Figure 5: ROC for analyst problem We propose, as a conservative policy, to only pre-reject cases for which we are virtually certain there will be no false positives. From the ROC plot, this corresponds to.4 on the TP axis. If we take the prior probability of rejection into account, we can expect to pre-filter % of the samples sent to the analyst. The fraud prediction problem Suppose we wanted to detect 6% of the fraudulent cases. This would cause a.2% FP rate. Taking prior probability into account, we can expect to reject 267 fraudulent transactions at the expense of also rejecting 426 legitimate ones (over the entire data set).
5 Figure 6: ROC for fraud (.2,.6) Generally, allowing a fraudulent purchase costs the retailer the full price of the item, while denying a legitimate purchase only results in a loss equal to the profit margin. So in the example above, our classifier will save the retailer money under the reasonable assumption that the profit margin on items is less than 5%. As in the Analyst case, knowing our exact profit margin will allow us to design a more aggressive policy. The retailer simply needs to pinpoint the average cost of a fraudulent order and the average profit margin on items sold. Then, using the procedure described above, we pick the point on the ROC curve that optimizes expected misclassification cost. 6. Discussion Cascades of classifiers are a popular method of dealing with rare event detection (see, for instance, Viola and Jones, 2, Wu, et al., 25). The main assumption is that we can train n independent classifiers all with a very high (say,.999) TP rate and medium (say,.5) FP rate. Then, by requiring all n classifiers to agree on a positive detection, we can reduce the FP rate to.5 n, while retaining a TP rate of.999 n. For n, say, we would obtain very good TP and FP rates. One popular application of cascades is the face detection image processing problem. In face detection, non-face images are quite diverse. This makes it easy to enforce independence of the cascade nodes. All we have to do is design a long chain of classifiers where each one specializes in rejecting a particular type of non-face object with a near-perfect TP rate. The question is: can we always enforce the independence of the classifiers?. For the fraud problem, we suspect not. Although it is a rare event detection problem, it is fundamentally different from the face detection problem. As we have seen from the ROC, most nonfraud cases look similar and are easily identified. Therefore, training on subsets of the non-fraud data would not generate independent classifiers and we would not reap the benefit of exponentially decreasing errors. A standard cascade will not work with the analyst problem either; the ROC plot shows it is not possible to obtain the needed high TP and moderate FP. We do, however, have points on the ROC with very low FP rates. We believe it should be possible to build an inverted cascade which exploits this ROC shape by requiring all classifiers in the cascade to agree on a negative detection. We would like to thank Dr. Gary Bradski at Intel Research for his valuable help and advice. 7. References Breiman, Leo (2). "Random Forests". Machine Learning 45 (), Chao Chen, Andy Liaw and Leo Breiman (24). Using Random Forest to Learn Imbalanced Data. Paul A Viola, Michael J. Jones (2) Rapid Object Detection using a Boosted Cascade of Simple Features. CVPR, :5-58. Jianxin Wu, Matthew D. Mullin, James M. Rehg. (25) Linear Asymmetric Classifier for Cascade Detectors. ICML pages 993-.
Using Random Forest to Learn Imbalanced Data
Using Random Forest to Learn Imbalanced Data Chao Chen, [email protected] Department of Statistics,UC Berkeley Andy Liaw, andy [email protected] Biometrics Research,Merck Research Labs Leo Breiman,
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
Random Forest Based Imbalanced Data Cleaning and Classification
Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem
Predicting borrowers chance of defaulting on credit loans
Predicting borrowers chance of defaulting on credit loans Junjie Liang ([email protected]) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm
Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk [email protected] Tom Kelsey ID5059-19-B &
Statistics in Retail Finance. Chapter 7: Fraud Detection in Retail Credit
Statistics in Retail Finance Chapter 7: Fraud Detection in Retail Credit 1 Overview > Detection of fraud remains an important issue in retail credit. Methods similar to scorecard development may be employed,
The Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
Leveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
CART 6.0 Feature Matrix
CART 6.0 Feature Matri Enhanced Descriptive Statistics Full summary statistics Brief summary statistics Stratified summary statistics Charts and histograms Improved User Interface New setup activity window
Homework Assignment 7
Homework Assignment 7 36-350, Data Mining Solutions 1. Base rates (10 points) (a) What fraction of the e-mails are actually spam? Answer: 39%. > sum(spam$spam=="spam") [1] 1813 > 1813/nrow(spam) [1] 0.3940448
Using multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms
Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms Johan Perols Assistant Professor University of San Diego, San Diego, CA 92110 [email protected] April
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
Gerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
Model Combination. 24 Novembre 2009
Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy
Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.
Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics
Classification and Regression by randomforest
Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Easily Identify Your Best Customers
IBM SPSS Statistics Easily Identify Your Best Customers Use IBM SPSS predictive analytics software to gain insight from your customer database Contents: 1 Introduction 2 Exploring customer data Where do
STATISTICA. Financial Institutions. Case Study: Credit Scoring. and
Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:
Beating the MLB Moneyline
Beating the MLB Moneyline Leland Chen [email protected] Andrew He [email protected] 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series
Machine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
Decision Support Systems
Decision Support Systems 50 (2011) 602 613 Contents lists available at ScienceDirect Decision Support Systems journal homepage: www.elsevier.com/locate/dss Data mining for credit card fraud: A comparative
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
Comparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
Benchmarking of different classes of models used for credit scoring
Benchmarking of different classes of models used for credit scoring We use this competition as an opportunity to compare the performance of different classes of predictive models. In particular we want
Selecting Data Mining Model for Web Advertising in Virtual Communities
Selecting Data Mining for Web Advertising in Virtual Communities Jerzy Surma Faculty of Business Administration Warsaw School of Economics Warsaw, Poland e-mail: [email protected] Mariusz Łapczyński
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
Distributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance
Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance Jesús M. Pérez, Javier Muguerza, Olatz Arbelaitz, Ibai Gurrutxaga, and José I. Martín Dept. of Computer
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Performance Measures for Machine Learning
Performance Measures for Machine Learning 1 Performance Measures Accuracy Weighted (Cost-Sensitive) Accuracy Lift Precision/Recall F Break Even Point ROC ROC Area 2 Accuracy Target: 0/1, -1/+1, True/False,
E-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee [email protected] Seunghee Ham [email protected] Qiyi Jiang [email protected] I. INTRODUCTION Due to the increasing popularity of e-commerce
Statistics in Retail Finance. Chapter 6: Behavioural models
Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural
Applied Multivariate Analysis - Big data analytics
Applied Multivariate Analysis - Big data analytics Nathalie Villa-Vialaneix [email protected] http://www.nathalievilla.org M1 in Economics and Economics and Statistics Toulouse School of
Machine Learning Logistic Regression
Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
The Predictive Data Mining Revolution in Scorecards:
January 13, 2013 StatSoft White Paper The Predictive Data Mining Revolution in Scorecards: Accurate Risk Scoring via Ensemble Models Summary Predictive modeling methods, based on machine learning algorithms
An Autonomous Agent for Supply Chain Management
In Gedas Adomavicius and Alok Gupta, editors, Handbooks in Information Systems Series: Business Computing, Emerald Group, 2009. An Autonomous Agent for Supply Chain Management David Pardoe, Peter Stone
BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES
HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES Translating data into business value requires the right data mining and modeling techniques which uncover important patterns within
Random forest algorithm in big data environment
Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest
An effective approach to preventing application fraud. Experian Fraud Analytics
An effective approach to preventing application fraud Experian Fraud Analytics The growing threat of application fraud Fraud attacks are increasing across the world Application fraud is a rapidly growing
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Yin Zhao School of Mathematical Sciences Universiti Sains Malaysia (USM) Penang, Malaysia Yahya
Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees
Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.
How To Make A Credit Risk Model For A Bank Account
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző [email protected] 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
Better credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
Event driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016
Event driven trading new studies on innovative way of trading in Forex market Michał Osmoła INIME live 23 February 2016 Forex market From Wikipedia: The foreign exchange market (Forex, FX, or currency
A Novel Classification Approach for C2C E-Commerce Fraud Detection
A Novel Classification Approach for C2C E-Commerce Fraud Detection *1 Haitao Xiong, 2 Yufeng Ren, 2 Pan Jia *1 School of Computer and Information Engineering, Beijing Technology and Business University,
Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA
Welcome Xindong Wu Data Mining: Updates in Technologies Dept of Math and Computer Science Colorado School of Mines Golden, Colorado 80401, USA Email: xwu@ mines.edu Home Page: http://kais.mines.edu/~xwu/
not possible or was possible at a high cost for collecting the data.
Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day
Multiple Linear Regression in Data Mining
Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
arxiv:1301.4944v1 [stat.ml] 21 Jan 2013
Evaluation of a Supervised Learning Approach for Stock Market Operations Marcelo S. Lauretto 1, Bárbara B. C. Silva 1 and Pablo M. Andrade 2 1 EACH USP, 2 IME USP. 1 Introduction arxiv:1301.4944v1 [stat.ml]
II. RELATED WORK. Sentiment Mining
Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract
MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS
MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a
Virtual Site Event. Predictive Analytics: What Managers Need to Know. Presented by: Paul Arnest, MS, MBA, PMP February 11, 2015
Virtual Site Event Predictive Analytics: What Managers Need to Know Presented by: Paul Arnest, MS, MBA, PMP February 11, 2015 1 Ground Rules Virtual Site Ground Rules PMI Code of Conduct applies for this
Pearson's Correlation Tests
Chapter 800 Pearson's Correlation Tests Introduction The correlation coefficient, ρ (rho), is a popular statistic for describing the strength of the relationship between two variables. The correlation
Electronic Payment Fraud Detection Techniques
World of Computer Science and Information Technology Journal (WCSIT) ISSN: 2221-0741 Vol. 2, No. 4, 137-141, 2012 Electronic Payment Fraud Detection Techniques Adnan M. Al-Khatib CIS Dept. Faculty of Information
S03-2008 The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY
S03-2008 The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY ABSTRACT Predictive modeling includes regression, both logistic and linear,
Stock Market Forecasting Using Machine Learning Algorithms
Stock Market Forecasting Using Machine Learning Algorithms Shunrong Shen, Haomiao Jiang Department of Electrical Engineering Stanford University {conank,hjiang36}@stanford.edu Tongda Zhang Department of
Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering
IEICE Transactions on Information and Systems, vol.e96-d, no.3, pp.742-745, 2013. 1 Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering Ildefons
Numerical Algorithms Group
Title: Summary: Using the Component Approach to Craft Customized Data Mining Solutions One definition of data mining is the non-trivial extraction of implicit, previously unknown and potentially useful
Advanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
Data Mining Techniques Chapter 6: Decision Trees
Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................
Mining Life Insurance Data for Customer Attrition Analysis
Mining Life Insurance Data for Customer Attrition Analysis T. L. Oshini Goonetilleke Informatics Institute of Technology/Department of Computing, Colombo, Sri Lanka Email: [email protected] H. A. Caldera
Tree Ensembles: The Power of Post- Processing. December 2012 Dan Steinberg Mikhail Golovnya Salford Systems
Tree Ensembles: The Power of Post- Processing December 2012 Dan Steinberg Mikhail Golovnya Salford Systems Course Outline Salford Systems quick overview Treenet an ensemble of boosted trees GPS modern
Predictive modelling around the world 28.11.13
Predictive modelling around the world 28.11.13 Agenda Why this presentation is really interesting Introduction to predictive modelling Case studies Conclusions Why this presentation is really interesting
Evaluation & Validation: Credibility: Evaluating what has been learned
Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model
Dynamic Predictive Modeling in Claims Management - Is it a Game Changer?
Dynamic Predictive Modeling in Claims Management - Is it a Game Changer? Anil Joshi Alan Josefsek Bob Mattison Anil Joshi is the President and CEO of AnalyticsPlus, Inc. (www.analyticsplus.com)- a Chicago
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
Lecture 10: Regression Trees
Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat
Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise
Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.
Chronological Sampling for Email Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada
Package trimtrees. February 20, 2015
Package trimtrees February 20, 2015 Type Package Title Trimmed opinion pools of trees in a random forest Version 1.2 Date 2014-08-1 Depends R (>= 2.5.0),stats,randomForest,mlbench Author Yael Grushka-Cockayne,
Mining the Software Change Repository of a Legacy Telephony System
Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,
Data Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
A Novel Ensemble Learning-based Approach for Click Fraud Detection in Mobile Advertising
A Novel Ensemble Learning-based Approach for Click Fraud Detection in Mobile Advertising Kasun S. Perera 1, Bijay Neupane 1, Mustafa Amir Faisal 2, Zeyar Aung 1, and Wei Lee Woon 1 1 Institute Center for
Chapter 12 Bagging and Random Forests
Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts
IST 557 Final Project
George Slota DataMaster 5000 IST 557 Final Project Abstract As part of a competition hosted by the website Kaggle, a statistical model was developed for prediction of United States Census 2010 mailing
Predictive Data modeling for health care: Comparative performance study of different prediction models
Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath [email protected] National Institute of Industrial Engineering (NITIE) Vihar
Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ
Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ David Cieslak and Nitesh Chawla University of Notre Dame, Notre Dame IN 46556, USA {dcieslak,nchawla}@cse.nd.edu
Getting Even More Out of Ensemble Selection
Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand [email protected] ABSTRACT Ensemble Selection uses forward stepwise
Data Mining Classification: Decision Trees
Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous
Datamining. Gabriel Bacq CNAMTS
Datamining Gabriel Bacq CNAMTS In a few words DCCRF uses two ways to detect fraud cases: one which is fully implemented and another one which is experimented: 1. Database queries (fully implemented) Example:
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Hospital Billing Optimizer: Advanced Analytics Solution to Minimize Hospital Systems Revenue Leakage
Hospital Billing Optimizer: Advanced Analytics Solution to Minimize Hospital Systems Revenue Leakage Profit from Big Data flow 2 Tapping the hidden assets in hospitals data Revenue leakage can have a major
Server Load Prediction
Server Load Prediction Suthee Chaidaroon ([email protected]) Joon Yeong Kim ([email protected]) Jonghan Seo ([email protected]) Abstract Estimating server load average is one of the methods that
Beating the NCAA Football Point Spread
Beating the NCAA Football Point Spread Brian Liu Mathematical & Computational Sciences Stanford University Patrick Lai Computer Science Department Stanford University December 10, 2010 1 Introduction Over
