Random Forest Based Imbalanced Data Cleaning and Classification

Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem of learning from extremely imbalanced data set. In this paper, we propose a combination of random forest based techniques and sampling methods to identify the potential buyers. Our methods is mainly composed of two phases: data cleaning and classification, both based on random forest. Firstly, the data set is cleaned by the elimination of dangerous negative instances. The data cleaning process is supervised by a negative biased random forests, where the negative instances make a major proportion of the training data in each of the tree in the forest. Secondly, we train a variant of random forest in which each tree is biased towards the positive class to classify the data set, where a major vote is made for prediction. We compared our methods with many other existing methods and showed its favorable performance improvement in terms of the area under the ROC. At last, we provide discussion on what business insights can be interpreted from the scoring model results. 1 Introduction to the Task The company currently has a customer base of credit card customers as well as a customer base of home loan (mortgage) customers. The company would like to make use of this opportunity to cross-sell home loans to its credit card customers, but the small size of the overlap presents a challenge when trying to develop a effective scoring model to predict potential cross-sell take-ups. A modeling dataset of 40, 700 customers with 40 modeling variables (as of the point of application for the company s credit card), plus a target variable, will be provided to the participants. This is a sample of customers who opened a new credit card with the company within a specific 2-year period and who did not have an existing home loan with the company. The target categorical variable Target Flag will have a value of 1 if the customer then opened a home loan with the company within 12 months after opening the credit card (700 random samples), and will have a value of 0 if otherwise (40,000 random samples). A prediction dataset (8, 000 sampled cases) will also be provided to the participants with similar variables but withholding the target variable. The data mining task is a to produce a score for each customer in the prediction dataset, indicating a credit card customer s propensity to take up a home loan with the company (the higher the score, the higher the propensity).

The following of this paper is organized as follows: in Section.2 we briefly review the problem of learning from imbalanced data and the method of random forest. The data cleaning process aimed to improve the prediction accuracy is described in Section.3.1. In Section.3.2, we develop a variant of the traditional random forest[1] to classify and ranking the potential buyers. In Section.4, we discuss about what business insights can be interpreted from the scoring model results. 2 Learning from Imbalanced Data and Random Forest Many real-world data sets exhibit skewed class distributions in which almost all cases are allotted to one or more larger classes and far fewer cases allotted for a smaller, usually more interesting class. In the given training data set of PAKDD 2007 data mining competition, only 1.7% instances correspond to the positive class, i.e, customers who opened a home loan with the company within 12 months after opening the credit card; the remaining majority of the instances belong to the negative class. A number of solutions to the class-imbalance problem were previously proposed both at the data and algorithmic levels[2]. At the data level, these solutions are based on many different forms of re-sampling techniques including under-sampling the majority or over-sampling the minority to balance the class distribution. At the algorithmic level, the frequently used methods includes costsensitive classification[3], threshold moving, recognition-based learning[4], etc. Random forest[1] is an ensemble of unpruned classification or regression trees, trained from bootstrap samples of the training data, using random feature selection in the tree induction process. The classification is made through a majority vote which take all the decision of the trees into consideration. Random forest shows important performance improvement over the single tree classifiers and many other machine learning techniques. However, random forest also suffers from the class imbalance when learning from the data set. In this paper, we made significant modification to the basic random forest algorithm to tackle the problem of learning form imbalanced data set. 3 Proposed Solutions Our method is mainly composed of two phases: data cleaning and classification(ranking). In the data cleaning step, the majority are preprocessed to eliminate the instances which may cause degradation of prediction performance. In the classification(ranking) step, we build the model to produce a score for each of the instances in the given prediction dataset. 3.1 Data Cleaning The degradation of performance in many standard classifiers is not only due to the imbalance of class distribution, but also the class overlapping caused by class

(a) Class overlapping exists before (b) Well-defined class boundary Data Cleaning emerges after Data Cleaning Fig. 1. Effect of Data Cleaning imbalance. To better understand this problem, imagine the situation illustrated in Figure.1. Figure.1(a) represents the original data set without any preprocess. The circle in red and black represent the positive class and the negative class respectively, where an obvious class imbalance exists. Note that in Figure.1(a) there are several negative instances in the region dominated by the positive instances, which presents some degree of class overlapping. These negative instances are considered to be dangerous since it is quite possible for any model trained from this data set to misclassify many positive instances as negative. For cost-sensitive related problems, such as the task described in Section.1, this issue is even more detrimental. Then a natural requirement is to eliminate the dangerous negative instances, i.e., the black circles in the red region in Figure.1(a). The data set after cleaning, where a well-defined class boundary exists, is represented in Figure.1(b). There are many existing techniques designed for this data cleaning task, including Tomek links[7], Condensed Nearest Neighbor Rule[6], etc. The main defect of these methods lies in their strong dependency on a distance function defined for the data set. However, the most effective form of the distance function can only be expressed in the context of a particular data domain. It is also often a challenging and non-trivial task to find the most effective form of the distance function for a particular data set[8]. The data cleaning method proposed in this paper does not employ the utilization of any distance functions and is more straightforward. In this method, the data are divided to be the minor instances set P and the major instances

set N. Then N is further divided into n subsets {N 1, N 1,..., N n } with each N i having approximately the same size. For each N i, we train a random forest RF from the rest instances in N and the entire P. The trick is that for every classification tree in the forest, the class distribution in the corresponding training data is not balanced, i.e., more negative instances than positive instances. Then remove all instances in N i that are incorrectly classified by RF. The rationale behind this data cleaning method is quite simple. Most standard learning algorithms assume that maximizing accuracy on a full range of cases is the goal and, therefore, these systems exhibit accurate prediction for the majority class cases, but very poor performance for minority. For such negative-biased classification model, if a negative instance is misclassified as positive, it is reasonable to consider it as dangerous since it must be highly similar with some certain positive instances and thus is responsible for class overlapping. Elimination of these dangerous negative instances from the training data will potentially reduce the false negative rate of the model trained from it. Details of this method is described in Algorithm.1. Algorithm 1 Data Cleaning: Eliminate Dangerous Negative Instances Input: majority instance set N, minority instance set P, number of subset n, number of trees in the forest l tree, threshold ɛ (0, 1) Output:clean data set D where dangerous negative instances are eliminated from D according to the given parameter ɛ 1: Divide N randomly into n subsets {N 1, N 2,..., N n}, i, j [1, 2,..., n], N i = N j, N i N j=ø if i j. 2: i 0 3: repeat 4: i i + 1 5: N S = n N j N i j=1 6: for m = 1 to l tree do 7: Randomly sample a subset P sub from P and a subset N sub from N, where P sub N sub 8: Learning a classification tree T m form N sub P sub 9: end for 10: for each instance X n N i do 11: if more than ɛ l tree tree classifiers classify X n as positive then 12: Eliminate X n from N i 13: end if 14: end for 15: until i=n S 16: N n out= N j j=1 17: Return N out as the output

3.2 Classification and Ranking In this subsection, we introduce the method developed to rank the given customers in the prediction data set to indicate their possibility of opening the home loan. The basic model is similar with the random forest used in the data cleaning process. The only difference is that the class distribution of the training data for each tree in the forest has been reversed, i.e., more positive instances than negative instances. Several random forests will be obtained in this way. Instances in the prediction data set will be classified by the trained random forests in an iterative way. For each instance, the number of trees in the forest that classify it as positive are assigned to it as its score. Any instance that receive a score lower than a given threshold will be excluded from the next iteration. At last, all the instances are ranked according to the sum of their score received in all the iterations. Algorithm 2 Classification and Ranking Input: cleaned training data set D t, prediction data set D p, number of iterations n, number of trees in each random forest l tree, threshold ɛ (0, 1) Output: D t with each instance being assigned with a score 1: Divide D t into positive subset D tp and negative subset D tn 2: initiate array SCORE[ ] of length D p 3: i 0 4: repeat 5: i i + 1 6: for m = 1 to l tree do 7: Randomly sample a subset P sub from D tp and a subset N sub from D tn, where P sub N sub 8: Learning a classification tree T m form N sub P sub 9: end for 10: for each instance X n D p do 11: for each classification tree T m in the random forest do 12: if T m classify X n as positive then 13: SCORE[n] SCORE[n] + 1 14: end if 15: end for 16: if less than ɛ l tree tree classifiers classify X n as positive then 17: Eliminate X n from future iterations 18: end if 19: end for 20: until i=n 21: Return SCORE as the output The idea behind this method is that again we exploit the bias of the classifiers trained from imbalanced data set as in the data cleaning process. Since the positive instances dominate the training data set, each tree is biased towards correctly classifying the positive instance. Then there is a strong possibility for

a positive instance to receive a higher score and a negative instance to receive a lower one. By excluding the instances that received a score lower than a specified threshold, we restrict the model to focus on the hard to classify instances. 3.3 Evaluation of the Proposed Method The proposed methods are compared with 5 other popular techniques used to handle with imbalanced learning, including SMOTE, under-sampling(under), over-sampling(over), Tomek links(tomek), Condensed Nearest Neighbor Rule(CNN). RF refers to our proposed random forest based methods. 8 UCI data sets 1 are used and the obtained AUC score are listed in Table.1. Table 1. AUC Scores on 8 UCI Data Sets Data Set SMOTE Under Over Tomek CNN RF balance 0.6113 0.6217 0.5732 0.6087 0.6312 0.6543 flag 0.7617 0.7233 0.7412 0.7098 0.7322 0.7691 haberman 0.6312 0.6412 0.6377 0.6521 0.6709 0.6852 letter 0.9978 0.9993 0.9879 0.9817 0.9921 0.9973 nursery 0.9815 0.9713 0.9551 0.9866 0.9735 0.9783 pima 0.8211 0.7820 0.8078 0.7988 0.8091 0.8102 sat 0.8971 0.9210 0.9100 0.9078 0.9311 0.9517 vehicle 0.9633 0.9712 0.9891 0.9533 0.9756 0.9690 RF achieves the best performance on 4 of the data sets. For the other 4 data sets, it is almost as good as the method with the highest AUC score. 4 Discussion 4.1 Missing Value is Useful One characteristic of the given training data set is that it is abound in missing value. The most frequently used strategy to handle with missing value is imputation, i.e., replacing the missing value with an artificial value. However, there may be a good reason why the value is missing perhaps the customer want to hidden some personal privacy. This piece of information may providing new insights into consumer behaviors. Then for one who want to conduct machine learning task on this data set, it should be especially careful to employ any imputation strategy. One feasible practice is to treat missing as a specific value of the data since it conveys more information simply than that the value is missing. 1 http://www.ics.uci.edu/mlearn /MLRepository.html

4.2 Taking Temporal Information into Consideration There are several pairs of instances in the data set that have nearly the same values in every attribute, but with different level. One reasonable explanation is that at least one of them is dirty data and should be eliminated as what we have done in the data cleaning process. In the context of cross-selling, other possible interpretation also exists. For a card holder who cherish the will of opening a home loan, it is not clear when he or she will really make that happen. The training data only tells that the customers in the positive instance opened the home loan within 12 months after they have opened the card. If the exact time interval between the time of opening the card and the home loan is taken into consideration, it may be possible find those who are without home loan but will purchase it in the coming future. Acknowledgment References 1. Breiman.L. Random Forest. In Machine Learning(2001), 45,5-32. 2. Nitesh V.Chawla, Nathalie Japkowicz, and Aleksander Kolcz. Editorial: Special Issue on Learning from Imbalanced Data Sets. In Sigkdd Explorations(2004), Volume 6, Issue 1, Page 1. 3. K. M. Ting. A comparative study of cost-sensitive boosting algorithms. In Proceedings of Seventeenth In- ternational Conference on Machine Learning, pages 983-990, Stanford, CA, 2000. 4. P. Juszczak and R. P. W. Duin. Uncertainty sampling methods for one-class classifiers. In Proceedings of the ICML 03 Workshop on Learning from Imbalanced Data Sets, 2003. 5. Nitesh V.Chawla, Kevin W.Bowyer, Lawrence O.Hall, W.Philip Kegelmeyer. SMOTE: Synthetic Minority Over-sampling TEchnique. In Journal of Artificial Intelligence Research(2002), 321-357. 6. Hart, P.E. The Condensed Nearest Neighbor Rule. In IEEE Transactions on Information Theory IT-14 (1968), 515C516. 7. Tomek, I. Two Modifications of CNN. In IEEE Transactions on Systems Man and Communications SMC-6 (1976), 769C772. 8. Charu C.Aggarwal. Towards Systematic Design of Distance Functions for Data Mining Applications. In Proceedings of the International Conference on Knowledge Discovery and Data Mining(2003).