A Hybrid Approach to Learn with Imbalanced Classes using Evolutionary Algorithms

Proceedings of the International Conference on Computational and Mathematical Methods in Science and Engineering, CMMSE 2009 30 June, 1 3 July 2009. A Hybrid Approach to Learn with Imbalanced Classes using Evolutionary Algorithms Claudia Regina Milaré 1, Gustavo E.A.P.A. Batista 1 and André C.P.L.F. Carvalho 1 1 Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo emails: cmilare@gmail.com, gbatista@icmc.usp.br, andre@icmc.usp.br Abstract There is an increasing interest in application of Evolutionary Algorithms to induce classification rules. This hybrid approach can aid in areas that classical methods to rule induction have not been completely successful. One example is the induction of classification rules in imbalanced domains. Imbalanced data occur when some classes heavily outnumbers other classes. Frequently, classical Machine Learning classifiers are not able to learn in the presence of imbalanced data sets, outputting classifiers that always predict the most numerous classes. In this work we propose a novel hybrid approach to deal with this problem. We create several balanced data sets with all minority class cases and a random sample of majority class cases. These balanced data sets are given to classical Machine Learning systems that output rule sets. The rule sets are combined in a pool of rules and an evolutionary algorithm is used to build a classifier from this pool of rules. This hybrid approach has some advantages over under-sampling since it reduces the amount of discarded information, and some advantages over over-sampling since it avoids overfitting. This approach was experimentally analyzed and our experiments show the proposed approach can improve classification measured as the area under the ROC curve. Key words: Evolutionary Algorithms, Rule Subset Selection, Treatment of Imbalanced Classes.

1 Introduction A Hybrid Approach to Learn with Imbalanced Classes There is an increasing interest in application of evolutionary algorithms to induce classification rules. This hybrid approach can aid in areas that classical methods for rule induction have not been completely successful. One example is the induction of classification rules in class imbalanced domains. Imbalanced data occur when some classes heavily outnumber other classes. Frequently, classical Machine Learning inducers are not able to learn in the presence of imbalanced data sets. These inducers usually output classifiers that always predict the most numerous classes. However, in several domains, minority classes are the classes of interest, which are associated with highest costs. For instance, in detection of fraudulent transactions in credit cards and phone calls, diagnosis of rare diseases and prediction of weather events such as hoar-frosts, misclassifying minority class examples is more expensive than majority class examples, and a classifier that simply outputs a majority class is completely useless. In this work we propose a novel hybrid approach to deal with this problem. We believe the class imbalance problem is also a search problem. Therefore, an evolutionary algorithm is used to better search the hypothesis space. Our approach creates several balanced data sets with all minority class cases and a random sample of majority class cases. These balanced data sets are given to classical Machine Learning systems that output rule sets. The rule sets are combined in a pool of rules and an evolutionary algorithm is used to select some of the rules from the pool in order to build a classifier. Our method uses under-sampling to create several balanced data sets. Undersampling eliminates majority class cases in order to create balanced data sets. A major drawback of under-sampling is that it can discard potentially useful data that could be important to the induction process. Our method overcomes this limitation by creating several data samples and, therefore, increasing the probability that all data is used for learning. Another method frequently used to treat imbalanced data sets is over-sampling. Over-sampling artificially increases the number of minority class cases. A major drawback of this method is that it supposedly increases the likelihood of occurring overfitting, since it makes exact copies of the minority class examples. Our method avoids overfitting by limiting the number of rules in each classifier. More precisely, the classifiers created by our method have the same number of rules as the individual classifiers induced from balanced data. In this work, we deal, without loss of generality, with two-class problems, where one of the classes significantly outnumbers the other class. Conceptlearning problems allow us to use the area under the ROC curve (AUC) [11] as the main metric to assess our results. The use of AUC is highly recommended in experiments with imbalanced classes since metrics that are dependent of costs

Claudia Milaré, Gustavo Batista, André Carvalho and class distribution, such as accuracy and error rate, are misleading in imbalanced domains. Our approach was experimentally analyzed in three application domains. Our experiments with C4.5Rules and Ripper, two well-know rule set induction algorithm, show the proposed approach can significantly improve classification measured as the area under the ROC curve. This paper is organized as follows: Section 2 presents some related work; Section 3 describes the proposed approach; Section 4 empirically evaluates the proposed approach on some application domains; and finally, in Section 5 concludes this paper and presents some directions for future work. 2 Related Work This section reviews some related work on class imbalance and methods to combine classifiers. 2.1 Class Imbalance The predictive accuracy of several Machine Learning algorithms is largely affected by the distribution of examples among the classes. Since most learning systems are designed with the assumption of well-balanced data sets, they fail to induce a classifier able to predict the minority class. These algorithms tend to assign new examples to the most frequent class. The investigation of alternatives to efficiently deal with class imbalance problem is a very important research issue, since imbalanced data sets can be found in several domains. For example, in fraud detection in telephonic calls [12] and credit card transactions [29], the number of legitimate transactions is usually much higher than the number of that fraudulent transactions; in analysis of insurance risk [25], only few clients ask for the insurance premium in a given time window; and in direct marketing [20], the number of respondents is usually very small (about 1%) for the majority of marketing campaigns. Several papers have analyzed the problem of learning from imbalance data sets (for instance [24, 20, 19, 12, 18, 17, 4]). Among the strategies followed by these works, three main approaches have been frequently used: Assigning different costs to misclassifications: higher costs are assigned to minority classes; Random under-sampling: artificially balancing the training data by eliminating examples from the majority class;

A Hybrid Approach to Learn with Imbalanced Classes Random over-sampling: artificially balancing the training data by replicating examples from the minority class. Our approach uses under-sampling as an intermediate step to create several balanced data sets. Other methods use a similar approach to deal with class imbalances. For instance, [7] split the examples from the majority class into several non-overlapping subsets with approximately the same number of examples as the minority class. Each subset is combined with all minority class examples, forming balanced data sets that are given to a learning algorithm. The classifiers obtained are integrated using stacking [30]. A similar approach is proposed in [21], in which Adaboost [14] integrates the outputs of several classifiers induced from balanced data sets obtained with under-sampling. Our approach differs from previously published work, since we are interested in creating symbolic classifiers, i.e., we are interested in creating classifiers that can be easily interpreted by humans. Although, ensembles can be build over several individual symbolic classifiers, the final classifier cannot be considered a symbolic classifier, since this classifier cannot be easily interpreted. 2.2 Combining Classifiers There are several papers in literature describing different alternatives for the combination of knowledge bases. A straightforward approach to combine knowledge bases is to use ensembles [23, 10]. An ensemble is composed by a set of individual classifiers which predictions are combined to determine the label of a new instance. Frequently, an ensemble is more precise than its individual classifiers. Despite the gain in performance that generally is obtained by ensembles, the combination of symbolic classifiers usually results in a non-symbolic final classifier. Two well-known ensemble methods are Bagging and Boosting. Bagging [6] is one of the oldest and simplest techniques for creating an ensemble of classifiers. This technique uses majority vote to combine predictions from individual classifiers, assigning the most frequently predicted class to the final classification. Differently from bagging, in Boosting [28], each training example is associated with a weight. This weight is related to the accuracy of the induced hypothesis for that particular example. A hypothesis is induced per iteration and the weights associated with each example might be modified. A second approach to combine knowledge bases is to integrated the knowledge generated by different classifiers into a unique knowledge base and, then, to use a rule selection method to create a classifier. In [26], the authors proposed an algorithm named ROCCER to select rules based on the performance of the rules on the ROC space. Another technique that uses a similar approach is the

Claudia Milaré, Gustavo Batista, André Carvalho GARSS algorithm [3]. GARSS uses an Evolutionary Algorithm to select rules that maximize AUC measure. Both ROCCER and GARSS were used in context of associative classification. Other works using Evolutionary Algorithm to select rules that combine knowledge from a large knowledge base can be found in [15, 5]. The general methodology of inducing several classifiers with sampling and integrating the knowledge of these classifiers in a final classifier was initially proposed by Fayyad, Djorgovski, and Weir [13]. This methodology was implemented in the RULLER system and used in the SKICAT project, which objective was to catalog and analyze objects in digitized sky images. According to the authors, this methodology was able to generate a robust set of rules. Moreover, the induced classifier was more precise than astronomers in the classification of cosmic objects from photographs. The methodology adopted in the RULLER system was extended in the XRULLER system (extended RULER) [2] to use knowledge induced from different algorithms. 3 Proposed Approach The approach proposed in this work uses an Evolutionary Algorithm to select rules in order to maximize the AUC metric for problems with imbalanced classes. Evolutionary Algorithms are search algorithms based on natural selection and genetics [16]. They evolve sets of potential solutions (population), usually encoded as a sequence of bits (chromosomes). The evolution is performed by applying a set of transformations (usually, the genetic operations crossover and mutation), and by evaluating the quality (fitness) of the solutions. As previously mentioned, our approach is based on under-sampling. However, in order to reduce the probability of information loss due to discarded data, our approach creates several under-sampled training sets. Given a training set T = T + T, where T + is the set of positive (minority) examples, and T is the set of negative (majority) examples, n random samples T1,..., Tn are created from T. Each random sample has the same number of examples as the positive set, i.e., T i = T +, 1 i n. In total, n balanced training sets T i are created by joining T + with each T i, i.e., T i = T + T i, 1 i n. The parameter n was set to 100 in our experiments. Rule sets are induced from each training set T i, the rules from all rule sets are integrated into a unique pool of rules, and the repeated rules are discarded. In our experiments, the Machine Learning algorithms C4.5Rules [27] and Ripper [9] were used to induce the rule sets. The pool of rules is given as input to an Evolutionary Algorithm. A primary key, represented as a natural number, is associated to each rule in the pool. Therefore, each rule can be accessed independently by its key. Our method uses

A Hybrid Approach to Learn with Imbalanced Classes the Pittsburgh approach to encode classifiers as chromosomes. Thus, an array of keys is used to represent a chromosome, i.e., a rule set to be interpreted as a classifier. In our implementation, the initial population is randomly composed by 40 chromosomes. The evaluation function used is the AUC metric measured over the training examples. The selection method is the fitness-proportionate selection, in which the number of times a chromosome is expected to reproduce is proportional to its fitness. A simple crossover operator was applied with probability 0.4. The mutation operator alters the value of randomly selected elements in the chromosomes. Thus, it exchanges a random selected rule by other rule. The mutation operator was applied with probability 0.1. Mutation and crossover rates were chosen based on our previous experience with Evolutionary Algorithms [22]. The number of generations is limited to 20. Finally, our implementation uses an elitism operator for population replacement. According to this method, the best chromosome from each population is preserved to the next generation. As previously mentioned, each chromosome represents a classifier. We have fixed the number of rules in each chromosome, aiming to avoid overfitting. As the fitness function is the AUC measured over the training set, allowing the chromosomes to grow up causes the Evolutionary Algorithm to create chromosomes with many rules that do not generalize well. In our experiments, we set the number of rules of each chromosome differently to each data set. This number is approximately the mean number of rules obtained by the classifiers induced over each T i, and varies from 6 up to 20, depending on the data set. Therefore, the classifiers induced by our approach are as complex, in terms of number of rules, as the individual classifiers. 4 Experimental Evaluation Experiments were carried out in order to evaluate the proposed approach. The experiments were performed using three benchmark data sets, collected from the UCI repository [1], and one real-world data set used in [8]. These data sets are related to classification problems, covering different application domains. They are all imbalanced data sets. Table 1 summarizes the main features of these data sets, which are: Identifier identification of the data set used in the text; #Examples the total number of examples; #Attributes (quanti., quali.) the total number of attributes, as well as the number of continuous and nominal attributes; #Classes the number of classes; Missing Data indicates if there are missing values; and, Majority Class Error the expected error of a classifier that always outputs the majority class. Each of the three data sets was divided into 10 pairs of training and test sets using random subsampling. The training examples are 70% of the original

Claudia Milaré, Gustavo Batista, André Carvalho Table 1: Data sets summary description. Identifier #Examples #Attributes #Classes Missing Majority Class (quanti., quali.) Data Error Abalone 4176 8 (7, 1) 2 No 2,717% Mammography 11182 6 (6,0) 2 No 2,316% Page-blocks 5472 10 (10,0) 2 No 10,216% data set and the test set 30%. Within each fold, 100 balanced data sets were created with all minority class examples and a random sample of the majority class examples, as previously described. These balanced data sets were given to symbolic Machine Learning algorithms. The symbolic algorithms used in the experiments were C4.5Rules [27] and Ripper [9]. The C4.5Rules and Ripper were run with their default parameters. The classification rules generated for each training set and by each algorithm were combined into a pool of rules. An Evolutionary Algorithm was used to build a classifier from this pool of rules. As described in Section 3, the chromosome size was defined according to the average size of the classifiers generated by C4.5Rules and Ripper. Table 2 shows the size of chromosomes used by the Evolutionary Algorithm for each combination of data set and inducer. Table 2: Chromosomes sizes. Data Set C4.5Rules Ripper Abalone 20 6 Mammography 10 6 Page-blocks 15 10 Table 3 presents the results obtained with the C4.5Rules inducer. All results are mean AUC values calculated over the 10 pairs of training and test sets. The results are divided into three columns. Column C4.5Rules presents the results obtained by the C4.5Rules inducer over imbalanced data; column Under-sampling presents the results obtained by C4.5Rules over a balanced data set obtained by under-sampling the training set; and, column EA-C4.5Rules presents the results obtained by the proposed approach. As can be observed, EA-C4.5Rules presents the higher mean AUC value for all data sets. Table 4 presents the results obtained with the Ripper inducer. Similarly with the previous table, column Ripper presents the results obtained by the Ripper inducer over imbalanced data; column Under-sampling presents the results obtained by Ripper over a balanced data set obtained by under-sampling the training set; and, column EA-Ripper presents the results obtained by the proposed approach.

A Hybrid Approach to Learn with Imbalanced Classes Table 3: Average AUC score for C4.5Rules inducer. Data Set C4.5Rules Under-sampling EA-C4.5Rules Abalone 77.93 (02.23) 78.31 (02.27) 80.95 (00.91) Mammography 87.20 (03.70) 89.43 (02.55) 90.30 (02.87) Page-blocks 97.26 (01.42) 96.97 (00.84) 97.56 (01.09) As can be observed, EA-Ripper also presents the highest mean AUC values for all data sets. Table 4: Average AUC score for Ripper inducer. Data Set Ripper Under-sampling EA-Ripper Abalone 59.91 (02.49) 75.30 (03.88) 81.06 (02.84) Mammography 78.20 (02.68) 88.73 (01.82) 91.97 (02.68) Page-blocks 92.41 (01.31) 95.10 (00.80) 96.71 (00.47) 5 Conclusion and Future Work This work presents a novel hybrid approach to deal with class imbalances that combines symbolic Machine Learning inducers and Evolutionary Algorithms. Our approach uses Evolutionary Algorithms to perform a more extensive search over the hypothesis space. In addition, our approach creates classifiers with the same number of rules as the classifiers induced from balanced data sets obtained by under-sampling. Therefore, our method is able to create more precise classifiers, in terms of area under the ROC curve, and with the same complexity as the individual classifiers. As future work, we plan to investigate new metrics to compose rules into a classifier. These metrics indicate which rule should be fired in cases that multiple rules cover an example. These metrics have a direct influence over the performance of classifiers and novel metrics, designed to imbalanced classes, can potentially improve classification on imbalanced data sets. Acknowledgements. This research was partially supported by the Brazilian research agencies FAPESP and CNPq. References [1] A. Asuncion and D.J. Newman. UCI machine learning repository, 2007.

Claudia Milaré, Gustavo Batista, André Carvalho [2] J. A. Baranauskas and M. C. Monard. Combining symbolic classifiers from multiple inducers. Knowledge Based Systems, 16(3):129 136, 2003. Elsevier Science. [3] G. E. A. P. A. Batista, C. R. Milaré, R. C. Prati, and M. C. Monard. A comparison of methods for rule subset selection applied to associative classification. Inteligencia Artificial, (32):29 35, 2006. [4] G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations Newsletter, 6(1):20 29, 2004. Special issue on Learning from Imbalanced Datasets. [5] F. C. Bernardini, R. C. Prati, and M. C. Monard. Evolving sets of symbolic classifiers into a single symbolic classifier using genetic algorithms. In International Conference on Hybrid Intelligent Systems, pages 525 530, Washington, DC, USA, 2008. IEEE Computer Society. [6] L. Breiman. Bagging predictors. Machine Learning, 24(2):123 140, 1996. [7] P. K. Chan and S. J. Stolfo. Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In International Conference on Knowledge Discovery and Data Mining, pages 164 168, 1998. [8] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321 357, 2002. [9] W. Cohen. Fast effective rule induction. In International Conference on Machine Learning, pages 115 123, 1995. [10] T. G. Dietterich. Machine Learning Research: Four Current Directions. Technical report, Oregon State University, 1997. [11] T. Fawcett. Roc graphs: Notes and practical considerations for researchers. Technical Report HPL-2003-4, HP Labs, 2004. [12] T. Fawcett and F. J. Provost. Adaptive fraud detection. Journal of Data Mining and Knowledge Discovery, 1(3):291 316, 1997. [13] U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. The kdd process for extracting useful knowledge from volumes of data. Communications of the ACM, 39(11):27 34, 1996.

A Hybrid Approach to Learn with Imbalanced Classes [14] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119 139, 1997. [15] A. Ghosh and B. Nath. Multi-objective rule mining using genetic algorithms. Information Sciences, 163(1-3):123 133, 2004. [16] D. E. Goldberg. Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, 1989. [17] N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5):429 449, 2002. [18] M. Kubat, R. Holte, and S. Matwin. Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 30:195 215, 1998. [19] M. Kubat and S. Matwin. Addressing the course of imbalanced training sets: One-sided selection. In International Conference in Machine Learning, pages 179 186. Morgan Kaufmann, 1997. [20] C. X. Ling and C. Li. Data mining for direct mining: Problems and solutions. In International Conference on Knownledge Discovery and Data Mining, pages 73 79, 1998. [21] X. Y. Liu, J. Wu, and Z. H. Zhou. Exploratory undersampling for classimbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 39(2):539 550, 2009. [22] C. R. Milaré, G. E. A. P. A. Batista, A. C. P. L. F. Carvalho, and M. C. Monard. Applying genetic and symbolic learning algorithms to extract rules from artificial neural neworks. In Proc. Mexican International Conference on Artificial Intelligence, volume 2972 of LNAI, pages 833 843. Springer-Verlag, 2004. [23] D. Opitz and R. Maclin. Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, 11:169 198, 1999. [24] M. Pazzani, C. Merz, P. Murphy, K. Ali, T. Hume, and C. Brunk. Reducing misclassification costs. In International Conference in Machine Learning, pages 217 225, 1994. [25] E. P. D. Pednault, B. K. Rosen, and C. Apte. Handling imbalanced data sets in insurance risk modeling. Technical Report RC-21731, IBM Research Report, March 2000.

Claudia Milaré, Gustavo Batista, André Carvalho [26] R. C. Prati and P. A. Flach. Roccer: An algorithm for rule learning based on ROC analysis. In International Joint Conference on Artificial Intelligence (IJCAI 2005), pages 823 828, 2005. [27] J. R. Quinlan. C4.5 Programs for Machine Learning. Morgan Kaufmann Publishers, CA, 1988. [28] R. E. Schapire. The strength of weak learnability. Machine Learning, 5(2):197 227, 1990. [29] S. J. Stolfo, D. W. Fan, W. Lee, A. L. Prodromidis, and P. K. Chan. Credit card fraud detection using meta-learning: Issues and initial results. In AAAI- 97 Workshop on AI Methods in Fraud and Risk Management, pages 83 90, 1997. [30] D. H. Wolpert. Stacked generalization. Neural Networks, 5:241 259, 1992.