Random Forest Based Imbalanced Data Cleaning and Classification

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Random Forest Based Imbalanced Data Cleaning and Classification"

Transcription

1 Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem of learning from extremely imbalanced data set. In this paper, we propose a combination of random forest based techniques and sampling methods to identify the potential buyers. Our methods is mainly composed of two phases: data cleaning and classification, both based on random forest. Firstly, the data set is cleaned by the elimination of dangerous negative instances. The data cleaning process is supervised by a negative biased random forests, where the negative instances make a major proportion of the training data in each of the tree in the forest. Secondly, we train a variant of random forest in which each tree is biased towards the positive class to classify the data set, where a major vote is made for prediction. We compared our methods with many other existing methods and showed its favorable performance improvement in terms of the area under the ROC. At last, we provide discussion on what business insights can be interpreted from the scoring model results. 1 Introduction to the Task The company currently has a customer base of credit card customers as well as a customer base of home loan (mortgage) customers. The company would like to make use of this opportunity to cross-sell home loans to its credit card customers, but the small size of the overlap presents a challenge when trying to develop a effective scoring model to predict potential cross-sell take-ups. A modeling dataset of 40, 700 customers with 40 modeling variables (as of the point of application for the company s credit card), plus a target variable, will be provided to the participants. This is a sample of customers who opened a new credit card with the company within a specific 2-year period and who did not have an existing home loan with the company. The target categorical variable Target Flag will have a value of 1 if the customer then opened a home loan with the company within 12 months after opening the credit card (700 random samples), and will have a value of 0 if otherwise (40,000 random samples). A prediction dataset (8, 000 sampled cases) will also be provided to the participants with similar variables but withholding the target variable. The data mining task is a to produce a score for each customer in the prediction dataset, indicating a credit card customer s propensity to take up a home loan with the company (the higher the score, the higher the propensity).

2 The following of this paper is organized as follows: in Section.2 we briefly review the problem of learning from imbalanced data and the method of random forest. The data cleaning process aimed to improve the prediction accuracy is described in Section.3.1. In Section.3.2, we develop a variant of the traditional random forest[1] to classify and ranking the potential buyers. In Section.4, we discuss about what business insights can be interpreted from the scoring model results. 2 Learning from Imbalanced Data and Random Forest Many real-world data sets exhibit skewed class distributions in which almost all cases are allotted to one or more larger classes and far fewer cases allotted for a smaller, usually more interesting class. In the given training data set of PAKDD 2007 data mining competition, only 1.7% instances correspond to the positive class, i.e, customers who opened a home loan with the company within 12 months after opening the credit card; the remaining majority of the instances belong to the negative class. A number of solutions to the class-imbalance problem were previously proposed both at the data and algorithmic levels[2]. At the data level, these solutions are based on many different forms of re-sampling techniques including under-sampling the majority or over-sampling the minority to balance the class distribution. At the algorithmic level, the frequently used methods includes costsensitive classification[3], threshold moving, recognition-based learning[4], etc. Random forest[1] is an ensemble of unpruned classification or regression trees, trained from bootstrap samples of the training data, using random feature selection in the tree induction process. The classification is made through a majority vote which take all the decision of the trees into consideration. Random forest shows important performance improvement over the single tree classifiers and many other machine learning techniques. However, random forest also suffers from the class imbalance when learning from the data set. In this paper, we made significant modification to the basic random forest algorithm to tackle the problem of learning form imbalanced data set. 3 Proposed Solutions Our method is mainly composed of two phases: data cleaning and classification(ranking). In the data cleaning step, the majority are preprocessed to eliminate the instances which may cause degradation of prediction performance. In the classification(ranking) step, we build the model to produce a score for each of the instances in the given prediction dataset. 3.1 Data Cleaning The degradation of performance in many standard classifiers is not only due to the imbalance of class distribution, but also the class overlapping caused by class

3 (a) Class overlapping exists before (b) Well-defined class boundary Data Cleaning emerges after Data Cleaning Fig. 1. Effect of Data Cleaning imbalance. To better understand this problem, imagine the situation illustrated in Figure.1. Figure.1(a) represents the original data set without any preprocess. The circle in red and black represent the positive class and the negative class respectively, where an obvious class imbalance exists. Note that in Figure.1(a) there are several negative instances in the region dominated by the positive instances, which presents some degree of class overlapping. These negative instances are considered to be dangerous since it is quite possible for any model trained from this data set to misclassify many positive instances as negative. For cost-sensitive related problems, such as the task described in Section.1, this issue is even more detrimental. Then a natural requirement is to eliminate the dangerous negative instances, i.e., the black circles in the red region in Figure.1(a). The data set after cleaning, where a well-defined class boundary exists, is represented in Figure.1(b). There are many existing techniques designed for this data cleaning task, including Tomek links[7], Condensed Nearest Neighbor Rule[6], etc. The main defect of these methods lies in their strong dependency on a distance function defined for the data set. However, the most effective form of the distance function can only be expressed in the context of a particular data domain. It is also often a challenging and non-trivial task to find the most effective form of the distance function for a particular data set[8]. The data cleaning method proposed in this paper does not employ the utilization of any distance functions and is more straightforward. In this method, the data are divided to be the minor instances set P and the major instances

4 set N. Then N is further divided into n subsets {N 1, N 1,..., N n } with each N i having approximately the same size. For each N i, we train a random forest RF from the rest instances in N and the entire P. The trick is that for every classification tree in the forest, the class distribution in the corresponding training data is not balanced, i.e., more negative instances than positive instances. Then remove all instances in N i that are incorrectly classified by RF. The rationale behind this data cleaning method is quite simple. Most standard learning algorithms assume that maximizing accuracy on a full range of cases is the goal and, therefore, these systems exhibit accurate prediction for the majority class cases, but very poor performance for minority. For such negative-biased classification model, if a negative instance is misclassified as positive, it is reasonable to consider it as dangerous since it must be highly similar with some certain positive instances and thus is responsible for class overlapping. Elimination of these dangerous negative instances from the training data will potentially reduce the false negative rate of the model trained from it. Details of this method is described in Algorithm.1. Algorithm 1 Data Cleaning: Eliminate Dangerous Negative Instances Input: majority instance set N, minority instance set P, number of subset n, number of trees in the forest l tree, threshold ɛ (0, 1) Output:clean data set D where dangerous negative instances are eliminated from D according to the given parameter ɛ 1: Divide N randomly into n subsets {N 1, N 2,..., N n}, i, j [1, 2,..., n], N i = N j, N i N j=ø if i j. 2: i 0 3: repeat 4: i i + 1 5: N S = n N j N i j=1 6: for m = 1 to l tree do 7: Randomly sample a subset P sub from P and a subset N sub from N, where P sub N sub 8: Learning a classification tree T m form N sub P sub 9: end for 10: for each instance X n N i do 11: if more than ɛ l tree tree classifiers classify X n as positive then 12: Eliminate X n from N i 13: end if 14: end for 15: until i=n S 16: N n out= N j j=1 17: Return N out as the output

5 3.2 Classification and Ranking In this subsection, we introduce the method developed to rank the given customers in the prediction data set to indicate their possibility of opening the home loan. The basic model is similar with the random forest used in the data cleaning process. The only difference is that the class distribution of the training data for each tree in the forest has been reversed, i.e., more positive instances than negative instances. Several random forests will be obtained in this way. Instances in the prediction data set will be classified by the trained random forests in an iterative way. For each instance, the number of trees in the forest that classify it as positive are assigned to it as its score. Any instance that receive a score lower than a given threshold will be excluded from the next iteration. At last, all the instances are ranked according to the sum of their score received in all the iterations. Algorithm 2 Classification and Ranking Input: cleaned training data set D t, prediction data set D p, number of iterations n, number of trees in each random forest l tree, threshold ɛ (0, 1) Output: D t with each instance being assigned with a score 1: Divide D t into positive subset D tp and negative subset D tn 2: initiate array SCORE[ ] of length D p 3: i 0 4: repeat 5: i i + 1 6: for m = 1 to l tree do 7: Randomly sample a subset P sub from D tp and a subset N sub from D tn, where P sub N sub 8: Learning a classification tree T m form N sub P sub 9: end for 10: for each instance X n D p do 11: for each classification tree T m in the random forest do 12: if T m classify X n as positive then 13: SCORE[n] SCORE[n] : end if 15: end for 16: if less than ɛ l tree tree classifiers classify X n as positive then 17: Eliminate X n from future iterations 18: end if 19: end for 20: until i=n 21: Return SCORE as the output The idea behind this method is that again we exploit the bias of the classifiers trained from imbalanced data set as in the data cleaning process. Since the positive instances dominate the training data set, each tree is biased towards correctly classifying the positive instance. Then there is a strong possibility for

6 a positive instance to receive a higher score and a negative instance to receive a lower one. By excluding the instances that received a score lower than a specified threshold, we restrict the model to focus on the hard to classify instances. 3.3 Evaluation of the Proposed Method The proposed methods are compared with 5 other popular techniques used to handle with imbalanced learning, including SMOTE, under-sampling(under), over-sampling(over), Tomek links(tomek), Condensed Nearest Neighbor Rule(CNN). RF refers to our proposed random forest based methods. 8 UCI data sets 1 are used and the obtained AUC score are listed in Table.1. Table 1. AUC Scores on 8 UCI Data Sets Data Set SMOTE Under Over Tomek CNN RF balance flag haberman letter nursery pima sat vehicle RF achieves the best performance on 4 of the data sets. For the other 4 data sets, it is almost as good as the method with the highest AUC score. 4 Discussion 4.1 Missing Value is Useful One characteristic of the given training data set is that it is abound in missing value. The most frequently used strategy to handle with missing value is imputation, i.e., replacing the missing value with an artificial value. However, there may be a good reason why the value is missing perhaps the customer want to hidden some personal privacy. This piece of information may providing new insights into consumer behaviors. Then for one who want to conduct machine learning task on this data set, it should be especially careful to employ any imputation strategy. One feasible practice is to treat missing as a specific value of the data since it conveys more information simply than that the value is missing. 1 /MLRepository.html

7 4.2 Taking Temporal Information into Consideration There are several pairs of instances in the data set that have nearly the same values in every attribute, but with different level. One reasonable explanation is that at least one of them is dirty data and should be eliminated as what we have done in the data cleaning process. In the context of cross-selling, other possible interpretation also exists. For a card holder who cherish the will of opening a home loan, it is not clear when he or she will really make that happen. The training data only tells that the customers in the positive instance opened the home loan within 12 months after they have opened the card. If the exact time interval between the time of opening the card and the home loan is taken into consideration, it may be possible find those who are without home loan but will purchase it in the coming future. Acknowledgment References 1. Breiman.L. Random Forest. In Machine Learning(2001), 45, Nitesh V.Chawla, Nathalie Japkowicz, and Aleksander Kolcz. Editorial: Special Issue on Learning from Imbalanced Data Sets. In Sigkdd Explorations(2004), Volume 6, Issue 1, Page K. M. Ting. A comparative study of cost-sensitive boosting algorithms. In Proceedings of Seventeenth In- ternational Conference on Machine Learning, pages , Stanford, CA, P. Juszczak and R. P. W. Duin. Uncertainty sampling methods for one-class classifiers. In Proceedings of the ICML 03 Workshop on Learning from Imbalanced Data Sets, Nitesh V.Chawla, Kevin W.Bowyer, Lawrence O.Hall, W.Philip Kegelmeyer. SMOTE: Synthetic Minority Over-sampling TEchnique. In Journal of Artificial Intelligence Research(2002), Hart, P.E. The Condensed Nearest Neighbor Rule. In IEEE Transactions on Information Theory IT-14 (1968), 515C Tomek, I. Two Modifications of CNN. In IEEE Transactions on Systems Man and Communications SMC-6 (1976), 769C Charu C.Aggarwal. Towards Systematic Design of Distance Functions for Data Mining Applications. In Proceedings of the International Conference on Knowledge Discovery and Data Mining(2003).

ClusterOSS: a new undersampling method for imbalanced learning

ClusterOSS: a new undersampling method for imbalanced learning 1 ClusterOSS: a new undersampling method for imbalanced learning Victor H Barella, Eduardo P Costa, and André C P L F Carvalho, Abstract A dataset is said to be imbalanced when its classes are disproportionately

More information

Using Random Forest to Learn Imbalanced Data

Using Random Forest to Learn Imbalanced Data Using Random Forest to Learn Imbalanced Data Chao Chen, chenchao@stat.berkeley.edu Department of Statistics,UC Berkeley Andy Liaw, andy liaw@merck.com Biometrics Research,Merck Research Labs Leo Breiman,

More information

Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ

Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ David Cieslak and Nitesh Chawla University of Notre Dame, Notre Dame IN 46556, USA {dcieslak,nchawla}@cse.nd.edu

More information

CLASS distribution, i.e., the proportion of instances belonging

CLASS distribution, i.e., the proportion of instances belonging IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART C: APPLICATIONS AND REVIEWS 1 A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches Mikel Galar,

More information

Learning with Skewed Class Distributions

Learning with Skewed Class Distributions CADERNOS DE COMPUTAÇÃO XX (2003) Learning with Skewed Class Distributions Maria Carolina Monard and Gustavo E.A.P.A. Batista Laboratory of Computational Intelligence LABIC Department of Computer Science

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

DATA MINING FOR IMBALANCED DATASETS: AN OVERVIEW

DATA MINING FOR IMBALANCED DATASETS: AN OVERVIEW Chapter 40 DATA MINING FOR IMBALANCED DATASETS: AN OVERVIEW Nitesh V. Chawla Department of Computer Science and Engineering University of Notre Dame IN 46530, USA Abstract Keywords: A dataset is imbalanced

More information

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Martin Hlosta, Rostislav Stríž, Jan Kupčík, Jaroslav Zendulka, and Tomáš Hruška A. Imbalanced Data Classification

More information

E-commerce Transaction Anomaly Classification

E-commerce Transaction Anomaly Classification E-commerce Transaction Anomaly Classification Minyong Lee minyong@stanford.edu Seunghee Ham sham12@stanford.edu Qiyi Jiang qjiang@stanford.edu I. INTRODUCTION Due to the increasing popularity of e-commerce

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

A Study with Class Imbalance and Random Sampling for a Decision Tree Learning System

A Study with Class Imbalance and Random Sampling for a Decision Tree Learning System A Study with Class Imbalance and Random Sampling for a Decision Tree Learning System Ronaldo C. Prati, Gustavo E. A. P. A Batista, and Maria Carolina Monard Abstract Sampling methods are a direct approach

More information

A Hybrid Approach to Learn with Imbalanced Classes using Evolutionary Algorithms

A Hybrid Approach to Learn with Imbalanced Classes using Evolutionary Algorithms Proceedings of the International Conference on Computational and Mathematical Methods in Science and Engineering, CMMSE 2009 30 June, 1 3 July 2009. A Hybrid Approach to Learn with Imbalanced Classes using

More information

Selecting Data Mining Model for Web Advertising in Virtual Communities

Selecting Data Mining Model for Web Advertising in Virtual Communities Selecting Data Mining for Web Advertising in Virtual Communities Jerzy Surma Faculty of Business Administration Warsaw School of Economics Warsaw, Poland e-mail: jerzy.surma@gmail.com Mariusz Łapczyński

More information

Handling imbalanced datasets: A review

Handling imbalanced datasets: A review Handling imbalanced datasets: A review Sotiris Kotsiantis, Dimitris Kanellopoulos, Panayiotis Pintelas Educational Software Development Laboratory Department of Mathematics, University of Patras, Greece

More information

Editorial: Special Issue on Learning from Imbalanced Data Sets

Editorial: Special Issue on Learning from Imbalanced Data Sets Editorial: Special Issue on Learning from Imbalanced Data Sets Nitesh V. Chawla Nathalie Japkowicz Retail Risk Management,CIBC School of Information 21 Melinda Street Technology and Engineering Toronto,

More information

Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

More information

Supervised Learning with Unsupervised Output Separation

Supervised Learning with Unsupervised Output Separation Supervised Learning with Unsupervised Output Separation Nathalie Japkowicz School of Information Technology and Engineering University of Ottawa 150 Louis Pasteur, P.O. Box 450 Stn. A Ottawa, Ontario,

More information

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

More information

Evaluation and Credibility. How much should we believe in what was learned?

Evaluation and Credibility. How much should we believe in what was learned? Evaluation and Credibility How much should we believe in what was learned? Outline Introduction Classification with Train, Test, and Validation sets Handling Unbalanced Data; Parameter Tuning Cross-validation

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

CLASSIFICATION JELENA JOVANOVIĆ. Web:

CLASSIFICATION JELENA JOVANOVIĆ.   Web: CLASSIFICATION JELENA JOVANOVIĆ Email: jeljov@gmail.com Web: http://jelenajovanovic.net OUTLINE What is classification? Binary and multiclass classification Classification algorithms Performance measures

More information

On the application of multi-class classification in physical therapy recommendation

On the application of multi-class classification in physical therapy recommendation RESEARCH Open Access On the application of multi-class classification in physical therapy recommendation Jing Zhang 1,PengCao 1,DouglasPGross 2 and Osmar R Zaiane 1* Abstract Recommending optimal rehabilitation

More information

On the application of multi-class classification in physical therapy recommendation

On the application of multi-class classification in physical therapy recommendation On the application of multi-class classification in physical therapy recommendation Jing Zhang 1, Douglas Gross 2, and Osmar R. Zaiane 1 1 Department of Computing Science, 2 Department of Physical Therapy,

More information

Predicting borrowers chance of defaulting on credit loans

Predicting borrowers chance of defaulting on credit loans Predicting borrowers chance of defaulting on credit loans Junjie Liang (junjie87@stanford.edu) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm

More information

Predicting School Failure Using Data Mining

Predicting School Failure Using Data Mining Predicting School Failure Using Data Mining C. MÁRQUEZ-VERA Autonomous University of Zacatecas, Mexico C. ROMERO AND S. VENTURA University of Cordoba, Spain This paper proposes to apply data mining techniques

More information

Fraud Detection for Online Retail using Random Forests

Fraud Detection for Online Retail using Random Forests Fraud Detection for Online Retail using Random Forests Eric Altendorf, Peter Brende, Josh Daniel, Laurent Lessard Abstract As online commerce becomes more common, fraud is an increasingly important concern.

More information

Class Imbalance Learning in Software Defect Prediction

Class Imbalance Learning in Software Defect Prediction Class Imbalance Learning in Software Defect Prediction Dr. Shuo Wang s.wang@cs.bham.ac.uk University of Birmingham Research keywords: ensemble learning, class imbalance learning, online learning Shuo Wang

More information

Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information

Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information Andrea Dal Pozzolo, Giacomo Boracchi, Olivier Caelen, Cesare Alippi, and Gianluca Bontempi 15/07/2015 IEEE IJCNN

More information

FOUNDATIONS OF IMBALANCED LEARNING

FOUNDATIONS OF IMBALANCED LEARNING CHAPTER 2 FOUNDATIONS OF IMBALANCED LEARNING Gary M. Weiss Fordham University Abstract Many important learning problems, from a wide variety of domains, involve learning from imbalanced data. Because this

More information

Addressing the Class Imbalance Problem in Medical Datasets

Addressing the Class Imbalance Problem in Medical Datasets Addressing the Class Imbalance Problem in Medical Datasets M. Mostafizur Rahman and D. N. Davis the size of the training set is significantly increased [5]. If the time taken to resample is not considered,

More information

Mining Caravan Insurance Database Technical Report

Mining Caravan Insurance Database Technical Report Mining Caravan Insurance Database Technical Report Tarek Amr (@gr33ndata) Abstract Caravan Insurance collected a set of 5000 records for their customers. The dataset is composed of 85 attributes, as well

More information

Fraud Detection Using Reputation Features, SVMs, and Random Forests

Fraud Detection Using Reputation Features, SVMs, and Random Forests Detection Using Reputation Features, SVMs, and Random Forests Dave DeBarr, and Harry Wechsler, Fellow, IEEE Computer Science Department George Mason University Fairfax, Virginia, 030, United States {ddebarr,

More information

The Predictive Data Mining Revolution in Scorecards:

The Predictive Data Mining Revolution in Scorecards: January 13, 2013 StatSoft White Paper The Predictive Data Mining Revolution in Scorecards: Accurate Risk Scoring via Ensemble Models Summary Predictive modeling methods, based on machine learning algorithms

More information

Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.

Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013. Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.38457 Accuracy Rate of Predictive Models in Credit Screening Anirut Suebsing

More information

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments Contents List of Figures Foreword Preface xxv xxiii xv Acknowledgments xxix Chapter 1 Fraud: Detection, Prevention, and Analytics! 1 Introduction 2 Fraud! 2 Fraud Detection and Prevention 10 Big Data for

More information

Analysis on Weighted AUC for Imbalanced Data Learning Through Isometrics

Analysis on Weighted AUC for Imbalanced Data Learning Through Isometrics Journal of Computational Information Systems 8: (22) 37 378 Available at http://www.jofcis.com Analysis on Weighted AUC for Imbalanced Data Learning Through Isometrics Yuanfang DONG, 2, Xiongfei LI,, Jun

More information

Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

More information

Preprocessing Imbalanced Dataset Using Oversampling Approach

Preprocessing Imbalanced Dataset Using Oversampling Approach Journal of Recent Research in Engineering and Technology, 2(11), 2015, pp 10-15 Article ID J111503 ISSN (Online): 2349 2252, ISSN (Print):2349 2260 Bonfay Publications Research article Preprocessing Imbalanced

More information

The More Trees, the Better! Scaling Up Performance Using Random Forest in SAS Enterprise Miner

The More Trees, the Better! Scaling Up Performance Using Random Forest in SAS Enterprise Miner Paper 3361-2015 The More Trees, the Better! Scaling Up Performance Using Random Forest in SAS Enterprise Miner Narmada Deve Panneerselvam, Spears School of Business, Oklahoma State University, Stillwater,

More information

Review of Ensemble Based Classification Algorithms for Nonstationary and Imbalanced Data

Review of Ensemble Based Classification Algorithms for Nonstationary and Imbalanced Data IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 1, Ver. IX (Feb. 2014), PP 103-107 Review of Ensemble Based Classification Algorithms for Nonstationary

More information

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset P P P Health An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset Peng Liu 1, Elia El-Darzi 2, Lei Lei 1, Christos Vasilakis 2, Panagiotis Chountas 2, and Wei Huang

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

COMP 551 Applied Machine Learning Lecture 6: Performance evaluation. Model assessment and selection.

COMP 551 Applied Machine Learning Lecture 6: Performance evaluation. Model assessment and selection. COMP 551 Applied Machine Learning Lecture 6: Performance evaluation. Model assessment and selection. Instructor: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise

More information

Drug Store Sales Prediction

Drug Store Sales Prediction Drug Store Sales Prediction Chenghao Wang, Yang Li Abstract - In this paper we tried to apply machine learning algorithm into a real world problem drug store sales forecasting. Given store information,

More information

COMP9417: Machine Learning and Data Mining

COMP9417: Machine Learning and Data Mining COMP9417: Machine Learning and Data Mining A note on the two-class confusion matrix, lift charts and ROC curves Session 1, 2003 Acknowledgement The material in this supplementary note is based in part

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

More information

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com

More information

Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance

Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance Jesús M. Pérez, Javier Muguerza, Olatz Arbelaitz, Ibai Gurrutxaga, and José I. Martín Dept. of Computer

More information

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART A: SYSTEMS AND HUMANS, VOL. 40, NO. 1, JANUARY

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART A: SYSTEMS AND HUMANS, VOL. 40, NO. 1, JANUARY IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART A: SYSTEMS AND HUMANS, VOL. 40, NO. 1, JANUARY 2010 185 RUSBoost: A Hybrid Approach to Alleviating Class Imbalance Chris Seiffert, Taghi M. Khoshgoftaar,

More information

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center

More information

Better credit models benefit us all

Better credit models benefit us all Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis

More information

Roulette Sampling for Cost-Sensitive Learning

Roulette Sampling for Cost-Sensitive Learning Roulette Sampling for Cost-Sensitive Learning Victor S. Sheng and Charles X. Ling Department of Computer Science, University of Western Ontario, London, Ontario, Canada N6A 5B7 {ssheng,cling}@csd.uwo.ca

More information

REVIEW OF ENSEMBLE CLASSIFICATION

REVIEW OF ENSEMBLE CLASSIFICATION Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Expert Systems with Applications

Expert Systems with Applications Expert Systems with Applications 36 (2009) 5445 5449 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa Customer churn prediction

More information

A Comparative Assessment of the Performance of Ensemble Learning in Customer Churn Prediction

A Comparative Assessment of the Performance of Ensemble Learning in Customer Churn Prediction The International Arab Journal of Information Technology, Vol. 11, No. 6, November 2014 599 A Comparative Assessment of the Performance of Ensemble Learning in Customer Churn Prediction Hossein Abbasimehr,

More information

Advanced Ensemble Strategies for Polynomial Models

Advanced Ensemble Strategies for Polynomial Models Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer

More information

Machine Learning model evaluation. Luigi Cerulo Department of Science and Technology University of Sannio

Machine Learning model evaluation. Luigi Cerulo Department of Science and Technology University of Sannio Machine Learning model evaluation Luigi Cerulo Department of Science and Technology University of Sannio Accuracy To measure classification performance the most intuitive measure of accuracy divides the

More information

Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

More information

A Novel Classification Approach for C2C E-Commerce Fraud Detection

A Novel Classification Approach for C2C E-Commerce Fraud Detection A Novel Classification Approach for C2C E-Commerce Fraud Detection *1 Haitao Xiong, 2 Yufeng Ren, 2 Pan Jia *1 School of Computer and Information Engineering, Beijing Technology and Business University,

More information

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training

More information

Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

More information

CLASS imbalance learning refers to a type of classification

CLASS imbalance learning refers to a type of classification IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS, PART B Multi-Class Imbalance Problems: Analysis and Potential Solutions Shuo Wang, Member, IEEE, and Xin Yao, Fellow, IEEE Abstract Class imbalance problems

More information

Machine Learning (CS 567)

Machine Learning (CS 567) Machine Learning (CS 567) Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol Han (cheolhan@usc.edu)

More information

Guido Sciavicco. 11 Novembre 2015

Guido Sciavicco. 11 Novembre 2015 classical and new techniques Università degli Studi di Ferrara 11 Novembre 2015 in collaboration with dr. Enrico Marzano, CIO Gap srl Active Contact System Project 1/27 Contents What is? Embedded Wrapper

More information

IMPROVING IDENTIFICATION OF DIFFICULT SMALL CLASSES BY BALANCING CLASS DISTRIBUTION

IMPROVING IDENTIFICATION OF DIFFICULT SMALL CLASSES BY BALANCING CLASS DISTRIBUTION IMPROVING IDENTIFICATION OF DIFFICULT SMALL CLASSES BY BALANCING CLASS DISTRIBUTION Jorma Laurikkala DEPARTMENT OF COMPUTER AND INFORMATION SCIENCES UNIVERSITY OF TAMPERE REPORT A-2001-2 UNIVERSITY OF

More information

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Jing Wang Computer Science Department, The University of Iowa jing-wang-1@uiowa.edu W. Nick Street Management Sciences Department,

More information

L25: Ensemble learning

L25: Ensemble learning L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna

More information

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

More information

Mining Life Insurance Data for Customer Attrition Analysis

Mining Life Insurance Data for Customer Attrition Analysis Mining Life Insurance Data for Customer Attrition Analysis T. L. Oshini Goonetilleke Informatics Institute of Technology/Department of Computing, Colombo, Sri Lanka Email: oshini.g@iit.ac.lk H. A. Caldera

More information

ROC Graphs: Notes and Practical Considerations for Data Mining Researchers

ROC Graphs: Notes and Practical Considerations for Data Mining Researchers ROC Graphs: Notes and Practical Considerations for Data Mining Researchers Tom Fawcett 1 1 Intelligent Enterprise Technologies Laboratory, HP Laboratories Palo Alto Alexandre Savio (GIC) ROC Graphs GIC

More information

Sample subset optimization for classifying imbalanced biological data

Sample subset optimization for classifying imbalanced biological data Sample subset optimization for classifying imbalanced biological data Pengyi Yang 1,2,3, Zili Zhang 4,5, Bing B. Zhou 1,3 and Albert Y. Zomaya 1,3 1 School of Information Technologies, University of Sydney,

More information

Learning on the Border: Active Learning in Imbalanced Data Classification

Learning on the Border: Active Learning in Imbalanced Data Classification Learning on the Border: Active Learning in Imbalanced Data Classification Şeyda Ertekin 1, Jian Huang 2, Léon Bottou 3, C. Lee Giles 2,1 1 Department of Computer Science and Engineering 2 College of Information

More information

CHAPTER 7. The problem of imbalanced data sets

CHAPTER 7. The problem of imbalanced data sets CHAPTER 7 The problem of imbalanced data sets A general goal of classifier learning is to learn a model on the basis of training data which makes as few errors as possible when classifying previously unseen

More information

On the Class Imbalance Problem *

On the Class Imbalance Problem * Fourth International Conference on Natural Computation On the Class Imbalance Problem * Xinjian Guo, Yilong Yin 1, Cailing Dong, Gongping Yang, Guangtong Zhou School of Computer Science and Technology,

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Counting the cost Data Mining Practical Machine Learning Tools and Techniques Slides for Section 5.7 In practice, different types of classification errors often incur different costs Examples: Loan decisions

More information

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

More information

Class Imbalance Problem in Data Mining: Review

Class Imbalance Problem in Data Mining: Review Class Imbalance Problem in Data Mining: Review 1 Mr.Rushi Longadge, 2 Ms. Snehlata S. Dongre, 3 Dr. Latesh Malik 1 Department of Computer Science and Engineering G. H. Raisoni College of Engineering Nagpur,

More information

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

More information

Maximum Profit Mining and Its Application in Software Development

Maximum Profit Mining and Its Application in Software Development Maximum Profit Mining and Its Application in Software Development Charles X. Ling 1, Victor S. Sheng 1, Tilmann Bruckhaus 2, Nazim H. Madhavji 1 1 Department of Computer Science, The University of Western

More information

T3: A Classification Algorithm for Data Mining

T3: A Classification Algorithm for Data Mining T3: A Classification Algorithm for Data Mining Christos Tjortjis and John Keane Department of Computation, UMIST, P.O. Box 88, Manchester, M60 1QD, UK {christos, jak}@co.umist.ac.uk Abstract. This paper

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598. Keynote, Outlier Detection and Description Workshop, 2013

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598. Keynote, Outlier Detection and Description Workshop, 2013 Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier

More information

A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication

A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication 2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management

More information

Decision Tree Learning on Very Large Data Sets

Decision Tree Learning on Very Large Data Sets Decision Tree Learning on Very Large Data Sets Lawrence O. Hall Nitesh Chawla and Kevin W. Bowyer Department of Computer Science and Engineering ENB 8 University of South Florida 4202 E. Fowler Ave. Tampa

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Introducing diversity among the models of multi-label classification ensemble

Introducing diversity among the models of multi-label classification ensemble Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and

More information

Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown

Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown Marcus A. Maloof maloof@cs.georgetown.edu Department of Computer Science, Georgetown University, Washington, DC 257-232, USA

More information

Expert Systems with Applications

Expert Systems with Applications Expert Systems with Applications 36 (2009) 4626 4636 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa Handling class imbalance in

More information

Direct Marketing When There Are Voluntary Buyers

Direct Marketing When There Are Voluntary Buyers Direct Marketing When There Are Voluntary Buyers Yi-Ting Lai and Ke Wang Simon Fraser University {llai2, wangk}@cs.sfu.ca Daymond Ling, Hua Shi, and Jason Zhang Canadian Imperial Bank of Commerce {Daymond.Ling,

More information

Learning Decision Trees for Unbalanced Data

Learning Decision Trees for Unbalanced Data Learning Decision Trees for Unbalanced Data David A. Cieslak and Nitesh V. Chawla {dcieslak,nchawla}@cse.nd.edu University of Notre Dame, Notre Dame IN 46556, USA Abstract. Learning from unbalanced datasets

More information

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data

More information

A Review of Missing Data Treatment Methods

A Review of Missing Data Treatment Methods A Review of Missing Data Treatment Methods Liu Peng, Lei Lei Department of Information Systems, Shanghai University of Finance and Economics, Shanghai, 200433, P.R. China ABSTRACT Missing data is a common

More information

OUTLIER ANALYSIS. Data Mining 1

OUTLIER ANALYSIS. Data Mining 1 OUTLIER ANALYSIS Data Mining 1 What Are Outliers? Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism Ex.: Unusual credit card purchase,

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification

More information