A Hybrid Approach to Learn with Imbalanced Classes using Evolutionary Algorithms

Size: px
Start display at page:

Download "A Hybrid Approach to Learn with Imbalanced Classes using Evolutionary Algorithms"

Transcription

1 Proceedings of the International Conference on Computational and Mathematical Methods in Science and Engineering, CMMSE June, 1 3 July A Hybrid Approach to Learn with Imbalanced Classes using Evolutionary Algorithms Claudia Regina Milaré 1, Gustavo E.A.P.A. Batista 1 and André C.P.L.F. Carvalho 1 1 Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo s: cmilare@gmail.com, gbatista@icmc.usp.br, andre@icmc.usp.br Abstract There is an increasing interest in application of Evolutionary Algorithms to induce classification rules. This hybrid approach can aid in areas that classical methods to rule induction have not been completely successful. One example is the induction of classification rules in imbalanced domains. Imbalanced data occur when some classes heavily outnumbers other classes. Frequently, classical Machine Learning classifiers are not able to learn in the presence of imbalanced data sets, outputting classifiers that always predict the most numerous classes. In this work we propose a novel hybrid approach to deal with this problem. We create several balanced data sets with all minority class cases and a random sample of majority class cases. These balanced data sets are given to classical Machine Learning systems that output rule sets. The rule sets are combined in a pool of rules and an evolutionary algorithm is used to build a classifier from this pool of rules. This hybrid approach has some advantages over under-sampling since it reduces the amount of discarded information, and some advantages over over-sampling since it avoids overfitting. This approach was experimentally analyzed and our experiments show the proposed approach can improve classification measured as the area under the ROC curve. Key words: Evolutionary Algorithms, Rule Subset Selection, Treatment of Imbalanced Classes.

2 1 Introduction A Hybrid Approach to Learn with Imbalanced Classes There is an increasing interest in application of evolutionary algorithms to induce classification rules. This hybrid approach can aid in areas that classical methods for rule induction have not been completely successful. One example is the induction of classification rules in class imbalanced domains. Imbalanced data occur when some classes heavily outnumber other classes. Frequently, classical Machine Learning inducers are not able to learn in the presence of imbalanced data sets. These inducers usually output classifiers that always predict the most numerous classes. However, in several domains, minority classes are the classes of interest, which are associated with highest costs. For instance, in detection of fraudulent transactions in credit cards and phone calls, diagnosis of rare diseases and prediction of weather events such as hoar-frosts, misclassifying minority class examples is more expensive than majority class examples, and a classifier that simply outputs a majority class is completely useless. In this work we propose a novel hybrid approach to deal with this problem. We believe the class imbalance problem is also a search problem. Therefore, an evolutionary algorithm is used to better search the hypothesis space. Our approach creates several balanced data sets with all minority class cases and a random sample of majority class cases. These balanced data sets are given to classical Machine Learning systems that output rule sets. The rule sets are combined in a pool of rules and an evolutionary algorithm is used to select some of the rules from the pool in order to build a classifier. Our method uses under-sampling to create several balanced data sets. Undersampling eliminates majority class cases in order to create balanced data sets. A major drawback of under-sampling is that it can discard potentially useful data that could be important to the induction process. Our method overcomes this limitation by creating several data samples and, therefore, increasing the probability that all data is used for learning. Another method frequently used to treat imbalanced data sets is over-sampling. Over-sampling artificially increases the number of minority class cases. A major drawback of this method is that it supposedly increases the likelihood of occurring overfitting, since it makes exact copies of the minority class examples. Our method avoids overfitting by limiting the number of rules in each classifier. More precisely, the classifiers created by our method have the same number of rules as the individual classifiers induced from balanced data. In this work, we deal, without loss of generality, with two-class problems, where one of the classes significantly outnumbers the other class. Conceptlearning problems allow us to use the area under the ROC curve (AUC) [11] as the main metric to assess our results. The use of AUC is highly recommended in experiments with imbalanced classes since metrics that are dependent of costs

3 Claudia Milaré, Gustavo Batista, André Carvalho and class distribution, such as accuracy and error rate, are misleading in imbalanced domains. Our approach was experimentally analyzed in three application domains. Our experiments with C4.5Rules and Ripper, two well-know rule set induction algorithm, show the proposed approach can significantly improve classification measured as the area under the ROC curve. This paper is organized as follows: Section 2 presents some related work; Section 3 describes the proposed approach; Section 4 empirically evaluates the proposed approach on some application domains; and finally, in Section 5 concludes this paper and presents some directions for future work. 2 Related Work This section reviews some related work on class imbalance and methods to combine classifiers. 2.1 Class Imbalance The predictive accuracy of several Machine Learning algorithms is largely affected by the distribution of examples among the classes. Since most learning systems are designed with the assumption of well-balanced data sets, they fail to induce a classifier able to predict the minority class. These algorithms tend to assign new examples to the most frequent class. The investigation of alternatives to efficiently deal with class imbalance problem is a very important research issue, since imbalanced data sets can be found in several domains. For example, in fraud detection in telephonic calls [12] and credit card transactions [29], the number of legitimate transactions is usually much higher than the number of that fraudulent transactions; in analysis of insurance risk [25], only few clients ask for the insurance premium in a given time window; and in direct marketing [20], the number of respondents is usually very small (about 1%) for the majority of marketing campaigns. Several papers have analyzed the problem of learning from imbalance data sets (for instance [24, 20, 19, 12, 18, 17, 4]). Among the strategies followed by these works, three main approaches have been frequently used: Assigning different costs to misclassifications: higher costs are assigned to minority classes; Random under-sampling: artificially balancing the training data by eliminating examples from the majority class;

4 A Hybrid Approach to Learn with Imbalanced Classes Random over-sampling: artificially balancing the training data by replicating examples from the minority class. Our approach uses under-sampling as an intermediate step to create several balanced data sets. Other methods use a similar approach to deal with class imbalances. For instance, [7] split the examples from the majority class into several non-overlapping subsets with approximately the same number of examples as the minority class. Each subset is combined with all minority class examples, forming balanced data sets that are given to a learning algorithm. The classifiers obtained are integrated using stacking [30]. A similar approach is proposed in [21], in which Adaboost [14] integrates the outputs of several classifiers induced from balanced data sets obtained with under-sampling. Our approach differs from previously published work, since we are interested in creating symbolic classifiers, i.e., we are interested in creating classifiers that can be easily interpreted by humans. Although, ensembles can be build over several individual symbolic classifiers, the final classifier cannot be considered a symbolic classifier, since this classifier cannot be easily interpreted. 2.2 Combining Classifiers There are several papers in literature describing different alternatives for the combination of knowledge bases. A straightforward approach to combine knowledge bases is to use ensembles [23, 10]. An ensemble is composed by a set of individual classifiers which predictions are combined to determine the label of a new instance. Frequently, an ensemble is more precise than its individual classifiers. Despite the gain in performance that generally is obtained by ensembles, the combination of symbolic classifiers usually results in a non-symbolic final classifier. Two well-known ensemble methods are Bagging and Boosting. Bagging [6] is one of the oldest and simplest techniques for creating an ensemble of classifiers. This technique uses majority vote to combine predictions from individual classifiers, assigning the most frequently predicted class to the final classification. Differently from bagging, in Boosting [28], each training example is associated with a weight. This weight is related to the accuracy of the induced hypothesis for that particular example. A hypothesis is induced per iteration and the weights associated with each example might be modified. A second approach to combine knowledge bases is to integrated the knowledge generated by different classifiers into a unique knowledge base and, then, to use a rule selection method to create a classifier. In [26], the authors proposed an algorithm named ROCCER to select rules based on the performance of the rules on the ROC space. Another technique that uses a similar approach is the

5 Claudia Milaré, Gustavo Batista, André Carvalho GARSS algorithm [3]. GARSS uses an Evolutionary Algorithm to select rules that maximize AUC measure. Both ROCCER and GARSS were used in context of associative classification. Other works using Evolutionary Algorithm to select rules that combine knowledge from a large knowledge base can be found in [15, 5]. The general methodology of inducing several classifiers with sampling and integrating the knowledge of these classifiers in a final classifier was initially proposed by Fayyad, Djorgovski, and Weir [13]. This methodology was implemented in the RULLER system and used in the SKICAT project, which objective was to catalog and analyze objects in digitized sky images. According to the authors, this methodology was able to generate a robust set of rules. Moreover, the induced classifier was more precise than astronomers in the classification of cosmic objects from photographs. The methodology adopted in the RULLER system was extended in the XRULLER system (extended RULER) [2] to use knowledge induced from different algorithms. 3 Proposed Approach The approach proposed in this work uses an Evolutionary Algorithm to select rules in order to maximize the AUC metric for problems with imbalanced classes. Evolutionary Algorithms are search algorithms based on natural selection and genetics [16]. They evolve sets of potential solutions (population), usually encoded as a sequence of bits (chromosomes). The evolution is performed by applying a set of transformations (usually, the genetic operations crossover and mutation), and by evaluating the quality (fitness) of the solutions. As previously mentioned, our approach is based on under-sampling. However, in order to reduce the probability of information loss due to discarded data, our approach creates several under-sampled training sets. Given a training set T = T + T, where T + is the set of positive (minority) examples, and T is the set of negative (majority) examples, n random samples T1,..., Tn are created from T. Each random sample has the same number of examples as the positive set, i.e., T i = T +, 1 i n. In total, n balanced training sets T i are created by joining T + with each T i, i.e., T i = T + T i, 1 i n. The parameter n was set to 100 in our experiments. Rule sets are induced from each training set T i, the rules from all rule sets are integrated into a unique pool of rules, and the repeated rules are discarded. In our experiments, the Machine Learning algorithms C4.5Rules [27] and Ripper [9] were used to induce the rule sets. The pool of rules is given as input to an Evolutionary Algorithm. A primary key, represented as a natural number, is associated to each rule in the pool. Therefore, each rule can be accessed independently by its key. Our method uses

6 A Hybrid Approach to Learn with Imbalanced Classes the Pittsburgh approach to encode classifiers as chromosomes. Thus, an array of keys is used to represent a chromosome, i.e., a rule set to be interpreted as a classifier. In our implementation, the initial population is randomly composed by 40 chromosomes. The evaluation function used is the AUC metric measured over the training examples. The selection method is the fitness-proportionate selection, in which the number of times a chromosome is expected to reproduce is proportional to its fitness. A simple crossover operator was applied with probability 0.4. The mutation operator alters the value of randomly selected elements in the chromosomes. Thus, it exchanges a random selected rule by other rule. The mutation operator was applied with probability 0.1. Mutation and crossover rates were chosen based on our previous experience with Evolutionary Algorithms [22]. The number of generations is limited to 20. Finally, our implementation uses an elitism operator for population replacement. According to this method, the best chromosome from each population is preserved to the next generation. As previously mentioned, each chromosome represents a classifier. We have fixed the number of rules in each chromosome, aiming to avoid overfitting. As the fitness function is the AUC measured over the training set, allowing the chromosomes to grow up causes the Evolutionary Algorithm to create chromosomes with many rules that do not generalize well. In our experiments, we set the number of rules of each chromosome differently to each data set. This number is approximately the mean number of rules obtained by the classifiers induced over each T i, and varies from 6 up to 20, depending on the data set. Therefore, the classifiers induced by our approach are as complex, in terms of number of rules, as the individual classifiers. 4 Experimental Evaluation Experiments were carried out in order to evaluate the proposed approach. The experiments were performed using three benchmark data sets, collected from the UCI repository [1], and one real-world data set used in [8]. These data sets are related to classification problems, covering different application domains. They are all imbalanced data sets. Table 1 summarizes the main features of these data sets, which are: Identifier identification of the data set used in the text; #Examples the total number of examples; #Attributes (quanti., quali.) the total number of attributes, as well as the number of continuous and nominal attributes; #Classes the number of classes; Missing Data indicates if there are missing values; and, Majority Class Error the expected error of a classifier that always outputs the majority class. Each of the three data sets was divided into 10 pairs of training and test sets using random subsampling. The training examples are 70% of the original

7 Claudia Milaré, Gustavo Batista, André Carvalho Table 1: Data sets summary description. Identifier #Examples #Attributes #Classes Missing Majority Class (quanti., quali.) Data Error Abalone (7, 1) 2 No 2,717% Mammography (6,0) 2 No 2,316% Page-blocks (10,0) 2 No 10,216% data set and the test set 30%. Within each fold, 100 balanced data sets were created with all minority class examples and a random sample of the majority class examples, as previously described. These balanced data sets were given to symbolic Machine Learning algorithms. The symbolic algorithms used in the experiments were C4.5Rules [27] and Ripper [9]. The C4.5Rules and Ripper were run with their default parameters. The classification rules generated for each training set and by each algorithm were combined into a pool of rules. An Evolutionary Algorithm was used to build a classifier from this pool of rules. As described in Section 3, the chromosome size was defined according to the average size of the classifiers generated by C4.5Rules and Ripper. Table 2 shows the size of chromosomes used by the Evolutionary Algorithm for each combination of data set and inducer. Table 2: Chromosomes sizes. Data Set C4.5Rules Ripper Abalone 20 6 Mammography 10 6 Page-blocks Table 3 presents the results obtained with the C4.5Rules inducer. All results are mean AUC values calculated over the 10 pairs of training and test sets. The results are divided into three columns. Column C4.5Rules presents the results obtained by the C4.5Rules inducer over imbalanced data; column Under-sampling presents the results obtained by C4.5Rules over a balanced data set obtained by under-sampling the training set; and, column EA-C4.5Rules presents the results obtained by the proposed approach. As can be observed, EA-C4.5Rules presents the higher mean AUC value for all data sets. Table 4 presents the results obtained with the Ripper inducer. Similarly with the previous table, column Ripper presents the results obtained by the Ripper inducer over imbalanced data; column Under-sampling presents the results obtained by Ripper over a balanced data set obtained by under-sampling the training set; and, column EA-Ripper presents the results obtained by the proposed approach.

8 A Hybrid Approach to Learn with Imbalanced Classes Table 3: Average AUC score for C4.5Rules inducer. Data Set C4.5Rules Under-sampling EA-C4.5Rules Abalone (02.23) (02.27) (00.91) Mammography (03.70) (02.55) (02.87) Page-blocks (01.42) (00.84) (01.09) As can be observed, EA-Ripper also presents the highest mean AUC values for all data sets. Table 4: Average AUC score for Ripper inducer. Data Set Ripper Under-sampling EA-Ripper Abalone (02.49) (03.88) (02.84) Mammography (02.68) (01.82) (02.68) Page-blocks (01.31) (00.80) (00.47) 5 Conclusion and Future Work This work presents a novel hybrid approach to deal with class imbalances that combines symbolic Machine Learning inducers and Evolutionary Algorithms. Our approach uses Evolutionary Algorithms to perform a more extensive search over the hypothesis space. In addition, our approach creates classifiers with the same number of rules as the classifiers induced from balanced data sets obtained by under-sampling. Therefore, our method is able to create more precise classifiers, in terms of area under the ROC curve, and with the same complexity as the individual classifiers. As future work, we plan to investigate new metrics to compose rules into a classifier. These metrics indicate which rule should be fired in cases that multiple rules cover an example. These metrics have a direct influence over the performance of classifiers and novel metrics, designed to imbalanced classes, can potentially improve classification on imbalanced data sets. Acknowledgements. This research was partially supported by the Brazilian research agencies FAPESP and CNPq. References [1] A. Asuncion and D.J. Newman. UCI machine learning repository, 2007.

9 Claudia Milaré, Gustavo Batista, André Carvalho [2] J. A. Baranauskas and M. C. Monard. Combining symbolic classifiers from multiple inducers. Knowledge Based Systems, 16(3): , Elsevier Science. [3] G. E. A. P. A. Batista, C. R. Milaré, R. C. Prati, and M. C. Monard. A comparison of methods for rule subset selection applied to associative classification. Inteligencia Artificial, (32):29 35, [4] G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations Newsletter, 6(1):20 29, Special issue on Learning from Imbalanced Datasets. [5] F. C. Bernardini, R. C. Prati, and M. C. Monard. Evolving sets of symbolic classifiers into a single symbolic classifier using genetic algorithms. In International Conference on Hybrid Intelligent Systems, pages , Washington, DC, USA, IEEE Computer Society. [6] L. Breiman. Bagging predictors. Machine Learning, 24(2): , [7] P. K. Chan and S. J. Stolfo. Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In International Conference on Knowledge Discovery and Data Mining, pages , [8] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16: , [9] W. Cohen. Fast effective rule induction. In International Conference on Machine Learning, pages , [10] T. G. Dietterich. Machine Learning Research: Four Current Directions. Technical report, Oregon State University, [11] T. Fawcett. Roc graphs: Notes and practical considerations for researchers. Technical Report HPL , HP Labs, [12] T. Fawcett and F. J. Provost. Adaptive fraud detection. Journal of Data Mining and Knowledge Discovery, 1(3): , [13] U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. The kdd process for extracting useful knowledge from volumes of data. Communications of the ACM, 39(11):27 34, 1996.

10 A Hybrid Approach to Learn with Imbalanced Classes [14] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1): , [15] A. Ghosh and B. Nath. Multi-objective rule mining using genetic algorithms. Information Sciences, 163(1-3): , [16] D. E. Goldberg. Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, [17] N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5): , [18] M. Kubat, R. Holte, and S. Matwin. Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 30: , [19] M. Kubat and S. Matwin. Addressing the course of imbalanced training sets: One-sided selection. In International Conference in Machine Learning, pages Morgan Kaufmann, [20] C. X. Ling and C. Li. Data mining for direct mining: Problems and solutions. In International Conference on Knownledge Discovery and Data Mining, pages 73 79, [21] X. Y. Liu, J. Wu, and Z. H. Zhou. Exploratory undersampling for classimbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 39(2): , [22] C. R. Milaré, G. E. A. P. A. Batista, A. C. P. L. F. Carvalho, and M. C. Monard. Applying genetic and symbolic learning algorithms to extract rules from artificial neural neworks. In Proc. Mexican International Conference on Artificial Intelligence, volume 2972 of LNAI, pages Springer-Verlag, [23] D. Opitz and R. Maclin. Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, 11: , [24] M. Pazzani, C. Merz, P. Murphy, K. Ali, T. Hume, and C. Brunk. Reducing misclassification costs. In International Conference in Machine Learning, pages , [25] E. P. D. Pednault, B. K. Rosen, and C. Apte. Handling imbalanced data sets in insurance risk modeling. Technical Report RC-21731, IBM Research Report, March 2000.

11 Claudia Milaré, Gustavo Batista, André Carvalho [26] R. C. Prati and P. A. Flach. Roccer: An algorithm for rule learning based on ROC analysis. In International Joint Conference on Artificial Intelligence (IJCAI 2005), pages , [27] J. R. Quinlan. C4.5 Programs for Machine Learning. Morgan Kaufmann Publishers, CA, [28] R. E. Schapire. The strength of weak learnability. Machine Learning, 5(2): , [29] S. J. Stolfo, D. W. Fan, W. Lee, A. L. Prodromidis, and P. K. Chan. Credit card fraud detection using meta-learning: Issues and initial results. In AAAI- 97 Workshop on AI Methods in Fraud and Risk Management, pages 83 90, [30] D. H. Wolpert. Stacked generalization. Neural Networks, 5: , 1992.

Learning with Skewed Class Distributions

Learning with Skewed Class Distributions CADERNOS DE COMPUTAÇÃO XX (2003) Learning with Skewed Class Distributions Maria Carolina Monard and Gustavo E.A.P.A. Batista Laboratory of Computational Intelligence LABIC Department of Computer Science

More information

Using Random Forest to Learn Imbalanced Data

Using Random Forest to Learn Imbalanced Data Using Random Forest to Learn Imbalanced Data Chao Chen, chenchao@stat.berkeley.edu Department of Statistics,UC Berkeley Andy Liaw, andy liaw@merck.com Biometrics Research,Merck Research Labs Leo Breiman,

More information

Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ

Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ David Cieslak and Nitesh Chawla University of Notre Dame, Notre Dame IN 46556, USA {dcieslak,nchawla}@cse.nd.edu

More information

ClusterOSS: a new undersampling method for imbalanced learning

ClusterOSS: a new undersampling method for imbalanced learning 1 ClusterOSS: a new undersampling method for imbalanced learning Victor H Barella, Eduardo P Costa, and André C P L F Carvalho, Abstract A dataset is said to be imbalanced when its classes are disproportionately

More information

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Martin Hlosta, Rostislav Stríž, Jan Kupčík, Jaroslav Zendulka, and Tomáš Hruška A. Imbalanced Data Classification

More information

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com

More information

Advanced Ensemble Strategies for Polynomial Models

Advanced Ensemble Strategies for Polynomial Models Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer

More information

Random Forest Based Imbalanced Data Cleaning and Classification

Random Forest Based Imbalanced Data Cleaning and Classification Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem

More information

DATA MINING FOR IMBALANCED DATASETS: AN OVERVIEW

DATA MINING FOR IMBALANCED DATASETS: AN OVERVIEW Chapter 40 DATA MINING FOR IMBALANCED DATASETS: AN OVERVIEW Nitesh V. Chawla Department of Computer Science and Engineering University of Notre Dame IN 46530, USA Abstract Keywords: A dataset is imbalanced

More information

Data Mining for Direct Marketing: Problems and

Data Mining for Direct Marketing: Problems and Data Mining for Direct Marketing: Problems and Solutions Charles X. Ling and Chenghui Li Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 Tel: 519-661-3341;

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Editorial: Special Issue on Learning from Imbalanced Data Sets

Editorial: Special Issue on Learning from Imbalanced Data Sets Editorial: Special Issue on Learning from Imbalanced Data Sets Nitesh V. Chawla Nathalie Japkowicz Retail Risk Management,CIBC School of Information 21 Melinda Street Technology and Engineering Toronto,

More information

Handling imbalanced datasets: A review

Handling imbalanced datasets: A review Handling imbalanced datasets: A review Sotiris Kotsiantis, Dimitris Kanellopoulos, Panayiotis Pintelas Educational Software Development Laboratory Department of Mathematics, University of Patras, Greece

More information

Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance

Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance Jesús M. Pérez, Javier Muguerza, Olatz Arbelaitz, Ibai Gurrutxaga, and José I. Martín Dept. of Computer

More information

Toward Scalable Learning with Non-uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection

Toward Scalable Learning with Non-uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection Toward Scalable Learning with Non-uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection Philip K. Chan Computer Science Florida Institute of Technolog7 Melbourne, FL 32901 pkc~cs,

More information

Roulette Sampling for Cost-Sensitive Learning

Roulette Sampling for Cost-Sensitive Learning Roulette Sampling for Cost-Sensitive Learning Victor S. Sheng and Charles X. Ling Department of Computer Science, University of Western Ontario, London, Ontario, Canada N6A 5B7 {ssheng,cling}@csd.uwo.ca

More information

Maximum Profit Mining and Its Application in Software Development

Maximum Profit Mining and Its Application in Software Development Maximum Profit Mining and Its Application in Software Development Charles X. Ling 1, Victor S. Sheng 1, Tilmann Bruckhaus 2, Nazim H. Madhavji 1 1 Department of Computer Science, The University of Western

More information

New Ensemble Combination Scheme

New Ensemble Combination Scheme New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,

More information

On the effect of data set size on bias and variance in classification learning

On the effect of data set size on bias and variance in classification learning On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent

More information

Learning on the Border: Active Learning in Imbalanced Data Classification

Learning on the Border: Active Learning in Imbalanced Data Classification Learning on the Border: Active Learning in Imbalanced Data Classification Şeyda Ertekin 1, Jian Huang 2, Léon Bottou 3, C. Lee Giles 2,1 1 Department of Computer Science and Engineering 2 College of Information

More information

Decision Tree Learning on Very Large Data Sets

Decision Tree Learning on Very Large Data Sets Decision Tree Learning on Very Large Data Sets Lawrence O. Hall Nitesh Chawla and Kevin W. Bowyer Department of Computer Science and Engineering ENB 8 University of South Florida 4202 E. Fowler Ave. Tampa

More information

Inductive Learning in Less Than One Sequential Data Scan

Inductive Learning in Less Than One Sequential Data Scan Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Research Hawthorne, NY 10532 {weifan,haixun,psyu}@us.ibm.com Shaw-Hwa Lo Statistics Department,

More information

Credit Card Fraud Detection Using Meta-Learning: Issues 1 and Initial Results

Credit Card Fraud Detection Using Meta-Learning: Issues 1 and Initial Results From: AAAI Technical Report WS-97-07. Compilation copyright 1997, AAAI (www.aaai.org). All rights reserved. Credit Card Fraud Detection Using Meta-Learning: Issues 1 and Initial Results Salvatore 2 J.

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

A Novel Classification Approach for C2C E-Commerce Fraud Detection

A Novel Classification Approach for C2C E-Commerce Fraud Detection A Novel Classification Approach for C2C E-Commerce Fraud Detection *1 Haitao Xiong, 2 Yufeng Ren, 2 Pan Jia *1 School of Computer and Information Engineering, Beijing Technology and Business University,

More information

Selecting Data Mining Model for Web Advertising in Virtual Communities

Selecting Data Mining Model for Web Advertising in Virtual Communities Selecting Data Mining for Web Advertising in Virtual Communities Jerzy Surma Faculty of Business Administration Warsaw School of Economics Warsaw, Poland e-mail: jerzy.surma@gmail.com Mariusz Łapczyński

More information

BUILDING CLASSIFICATION MODELS FROM IMBALANCED FRAUD DETECTION DATA

BUILDING CLASSIFICATION MODELS FROM IMBALANCED FRAUD DETECTION DATA BUILDING CLASSIFICATION MODELS FROM IMBALANCED FRAUD DETECTION DATA Terence Yong Koon Beh 1, Swee Chuan Tan 2, Hwee Theng Yeo 3 1 School of Business, SIM University 1 yky2k@yahoo.com, 2 jamestansc@unisim.edu.sg,

More information

REVIEW OF ENSEMBLE CLASSIFICATION

REVIEW OF ENSEMBLE CLASSIFICATION Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.

More information

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training

More information

How To Solve The Class Imbalance Problem In Data Mining

How To Solve The Class Imbalance Problem In Data Mining IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART C: APPLICATIONS AND REVIEWS 1 A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches Mikel Galar,

More information

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

More information

Ensemble Data Mining Methods

Ensemble Data Mining Methods Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods

More information

Credit Card Fraud Detection Using Meta-Learning: Issues and Initial Results 1

Credit Card Fraud Detection Using Meta-Learning: Issues and Initial Results 1 Credit Card Fraud Detection Using Meta-Learning: Issues and Initial Results 1 Salvatore J. Stolfo, David W. Fan, Wenke Lee and Andreas L. Prodromidis Department of Computer Science Columbia University

More information

Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Introducing diversity among the models of multi-label classification ensemble

Introducing diversity among the models of multi-label classification ensemble Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

A COMPARATIVE STUDY OF DECISION TREE ALGORITHMS FOR CLASS IMBALANCED LEARNING IN CREDIT CARD FRAUD DETECTION

A COMPARATIVE STUDY OF DECISION TREE ALGORITHMS FOR CLASS IMBALANCED LEARNING IN CREDIT CARD FRAUD DETECTION International Journal of Economics, Commerce and Management United Kingdom Vol. III, Issue 12, December 2015 http://ijecm.co.uk/ ISSN 2348 0386 A COMPARATIVE STUDY OF DECISION TREE ALGORITHMS FOR CLASS

More information

Multiple Classifiers -Integration and Selection

Multiple Classifiers -Integration and Selection 1 A Dynamic Integration Algorithm with Ensemble of Classifiers Seppo Puuronen 1, Vagan Terziyan 2, Alexey Tsymbal 2 1 University of Jyvaskyla, P.O.Box 35, FIN-40351 Jyvaskyla, Finland sepi@jytko.jyu.fi

More information

On the Class Imbalance Problem *

On the Class Imbalance Problem * Fourth International Conference on Natural Computation On the Class Imbalance Problem * Xinjian Guo, Yilong Yin 1, Cailing Dong, Gongping Yang, Guangtong Zhou School of Computer Science and Technology,

More information

Combining Multiple Models Across Algorithms and Samples for Improved Results

Combining Multiple Models Across Algorithms and Samples for Improved Results Combining Multiple Models Across Algorithms and Samples for Improved Results Haleh Vafaie, PhD. Federal Data Corporation Bethesda, MD Hvafaie@feddata.com Dean Abbott Abbott Consulting San Diego, CA dean@abbottconsulting.com

More information

E-commerce Transaction Anomaly Classification

E-commerce Transaction Anomaly Classification E-commerce Transaction Anomaly Classification Minyong Lee minyong@stanford.edu Seunghee Ham sham12@stanford.edu Qiyi Jiang qjiang@stanford.edu I. INTRODUCTION Due to the increasing popularity of e-commerce

More information

Sharing Classifiers among Ensembles from Related Problem Domains

Sharing Classifiers among Ensembles from Related Problem Domains Sharing Classifiers among Ensembles from Related Problem Domains Yi Zhang yi-zhang-2@uiowa.edu W. Nick Street nick-street@uiowa.edu Samuel Burer samuel-burer@uiowa.edu Department of Management Sciences,

More information

Addressing the Class Imbalance Problem in Medical Datasets

Addressing the Class Imbalance Problem in Medical Datasets Addressing the Class Imbalance Problem in Medical Datasets M. Mostafizur Rahman and D. N. Davis the size of the training set is significantly increased [5]. If the time taken to resample is not considered,

More information

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing

More information

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

More information

How To Identify A Churner

How To Identify A Churner 2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management

More information

Interactive Exploration of Decision Tree Results

Interactive Exploration of Decision Tree Results Interactive Exploration of Decision Tree Results 1 IRISA Campus de Beaulieu F35042 Rennes Cedex, France (email: pnguyenk,amorin@irisa.fr) 2 INRIA Futurs L.R.I., University Paris-Sud F91405 ORSAY Cedex,

More information

Preprocessing Imbalanced Dataset Using Oversampling Approach

Preprocessing Imbalanced Dataset Using Oversampling Approach Journal of Recent Research in Engineering and Technology, 2(11), 2015, pp 10-15 Article ID J111503 ISSN (Online): 2349 2252, ISSN (Print):2349 2260 Bonfay Publications Research article Preprocessing Imbalanced

More information

Meta Learning Algorithms for Credit Card Fraud Detection

Meta Learning Algorithms for Credit Card Fraud Detection International Journal of Engineering Research and Development e-issn: 2278-67X, p-issn: 2278-8X, www.ijerd.com Volume 6, Issue 6 (March 213), PP. 16-2 Meta Learning Algorithms for Credit Card Fraud Detection

More information

Comparison of Bagging, Boosting and Stacking Ensembles Applied to Real Estate Appraisal

Comparison of Bagging, Boosting and Stacking Ensembles Applied to Real Estate Appraisal Comparison of Bagging, Boosting and Stacking Ensembles Applied to Real Estate Appraisal Magdalena Graczyk 1, Tadeusz Lasota 2, Bogdan Trawiński 1, Krzysztof Trawiński 3 1 Wrocław University of Technology,

More information

Expert Systems with Applications

Expert Systems with Applications Expert Systems with Applications 36 (2009) 4626 4636 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa Handling class imbalance in

More information

Sample subset optimization for classifying imbalanced biological data

Sample subset optimization for classifying imbalanced biological data Sample subset optimization for classifying imbalanced biological data Pengyi Yang 1,2,3, Zili Zhang 4,5, Bing B. Zhou 1,3 and Albert Y. Zomaya 1,3 1 School of Information Technologies, University of Sydney,

More information

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

More information

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset P P P Health An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset Peng Liu 1, Elia El-Darzi 2, Lei Lei 1, Christos Vasilakis 2, Panagiotis Chountas 2, and Wei Huang

More information

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

More information

An Analysis of Four Missing Data Treatment Methods for Supervised Learning

An Analysis of Four Missing Data Treatment Methods for Supervised Learning An Analysis of Four Missing Data Treatment Methods for Supervised Learning Gustavo E. A. P. A. Batista and Maria Carolina Monard University of São Paulo - USP Institute of Mathematics and Computer Science

More information

Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms

Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms Johan Perols Assistant Professor University of San Diego, San Diego, CA 92110 jperols@sandiego.edu April

More information

Situational Awareness at Internet Scale: Detection of Extremely Rare Crisis Periods

Situational Awareness at Internet Scale: Detection of Extremely Rare Crisis Periods Situational Awareness at Internet Scale: Detection of Extremely Rare Crisis Periods 2008 Sandia Workshop on Data Mining and Data Analysis David Cieslak, dcieslak@cse.nd.edu, http://www.nd.edu/~dcieslak/,

More information

Learning Decision Trees for Unbalanced Data

Learning Decision Trees for Unbalanced Data Learning Decision Trees for Unbalanced Data David A. Cieslak and Nitesh V. Chawla {dcieslak,nchawla}@cse.nd.edu University of Notre Dame, Notre Dame IN 46556, USA Abstract. Learning from unbalanced datasets

More information

A Comparative Assessment of the Performance of Ensemble Learning in Customer Churn Prediction

A Comparative Assessment of the Performance of Ensemble Learning in Customer Churn Prediction The International Arab Journal of Information Technology, Vol. 11, No. 6, November 2014 599 A Comparative Assessment of the Performance of Ensemble Learning in Customer Churn Prediction Hossein Abbasimehr,

More information

Mining the Software Change Repository of a Legacy Telephony System

Mining the Software Change Repository of a Legacy Telephony System Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

More information

Cost Drivers of a Parametric Cost Estimation Model for Data Mining Projects (DMCOMO)

Cost Drivers of a Parametric Cost Estimation Model for Data Mining Projects (DMCOMO) Cost Drivers of a Parametric Cost Estimation Model for Mining Projects (DMCOMO) Oscar Marbán, Antonio de Amescua, Juan J. Cuadrado, Luis García Universidad Carlos III de Madrid (UC3M) Abstract Mining is

More information

Model Combination. 24 Novembre 2009

Model Combination. 24 Novembre 2009 Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

More information

Getting Even More Out of Ensemble Selection

Getting Even More Out of Ensemble Selection Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise

More information

Minority Report in Fraud Detection: Classification of Skewed Data

Minority Report in Fraud Detection: Classification of Skewed Data ABSTRACT Minority Report in Fraud Detection: Classification of Skewed Data This paper proposes an innovative fraud detection method, built upon existing fraud detection research and Minority Report, to

More information

Mining Life Insurance Data for Customer Attrition Analysis

Mining Life Insurance Data for Customer Attrition Analysis Mining Life Insurance Data for Customer Attrition Analysis T. L. Oshini Goonetilleke Informatics Institute of Technology/Department of Computing, Colombo, Sri Lanka Email: oshini.g@iit.ac.lk H. A. Caldera

More information

Data Mining Analytics for Business Intelligence and Decision Support

Data Mining Analytics for Business Intelligence and Decision Support Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing

More information

L25: Ensemble learning

L25: Ensemble learning L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna

More information

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA Welcome Xindong Wu Data Mining: Updates in Technologies Dept of Math and Computer Science Colorado School of Mines Golden, Colorado 80401, USA Email: xwu@ mines.edu Home Page: http://kais.mines.edu/~xwu/

More information

SVM Ensemble Model for Investment Prediction

SVM Ensemble Model for Investment Prediction 19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of

More information

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health Lecture 1: Data Mining Overview and Process What is data mining? Example applications Definitions Multi disciplinary Techniques Major challenges The data mining process History of data mining Data mining

More information

USING DATA MINING FOR BANK DIRECT MARKETING: AN APPLICATION OF THE CRISP-DM METHODOLOGY

USING DATA MINING FOR BANK DIRECT MARKETING: AN APPLICATION OF THE CRISP-DM METHODOLOGY USING DATA MINING FOR BANK DIRECT MARKETING: AN APPLICATION OF THE CRISP-DM METHODOLOGY Sérgio Moro and Raul M. S. Laureano Instituto Universitário de Lisboa (ISCTE IUL) Av.ª das Forças Armadas 1649-026

More information

Review of Ensemble Based Classification Algorithms for Nonstationary and Imbalanced Data

Review of Ensemble Based Classification Algorithms for Nonstationary and Imbalanced Data IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 1, Ver. IX (Feb. 2014), PP 103-107 Review of Ensemble Based Classification Algorithms for Nonstationary

More information

Fraud Detection for Online Retail using Random Forests

Fraud Detection for Online Retail using Random Forests Fraud Detection for Online Retail using Random Forests Eric Altendorf, Peter Brende, Josh Daniel, Laurent Lessard Abstract As online commerce becomes more common, fraud is an increasingly important concern.

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

II. RELATED WORK. Sentiment Mining

II. RELATED WORK. Sentiment Mining Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract

More information

On the application of multi-class classification in physical therapy recommendation

On the application of multi-class classification in physical therapy recommendation RESEARCH Open Access On the application of multi-class classification in physical therapy recommendation Jing Zhang 1,PengCao 1,DouglasPGross 2 and Osmar R Zaiane 1* Abstract Recommending optimal rehabilitation

More information

Colon cancer survival prediction using ensemble data mining on SEER data

Colon cancer survival prediction using ensemble data mining on SEER data Colon cancer survival prediction using ensemble data mining on SEER data Reda Al-Bahrani, Ankit Agrawal, Alok Choudhary Dept. of Electrical Engg. and Computer Science Northwestern University Evanston,

More information

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing www.ijcsi.org 198 Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing Lilian Sing oei 1 and Jiayang Wang 2 1 School of Information Science and Engineering, Central South University

More information

Predictive Data Mining in Very Large Data Sets: A Demonstration and Comparison Under Model Ensemble

Predictive Data Mining in Very Large Data Sets: A Demonstration and Comparison Under Model Ensemble Predictive Data Mining in Very Large Data Sets: A Demonstration and Comparison Under Model Ensemble Dr. Hongwei Patrick Yang Educational Policy Studies & Evaluation College of Education University of Kentucky

More information

A Comparative Evaluation of Meta-Learning Strategies over Large and Distributed Data Sets

A Comparative Evaluation of Meta-Learning Strategies over Large and Distributed Data Sets A Comparative Evaluation of Meta-Learning Strategies over Large and Distributed Data Sets Andreas L. Prodromidis and Salvatore J. Stolfo Columbia University, Computer Science Dept., New York, NY 10027,

More information

Healthcare Measurement Analysis Using Data mining Techniques

Healthcare Measurement Analysis Using Data mining Techniques www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 03 Issue 07 July, 2014 Page No. 7058-7064 Healthcare Measurement Analysis Using Data mining Techniques 1 Dr.A.Shaik

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

The class imbalance problem in pattern classification and learning. V. García J.S. Sánchez R.A. Mollineda R. Alejo J.M. Sotoca

The class imbalance problem in pattern classification and learning. V. García J.S. Sánchez R.A. Mollineda R. Alejo J.M. Sotoca The class imbalance problem in pattern classification and learning V. García J.S. Sánchez R.A. Mollineda R. Alejo J.M. Sotoca Pattern Analysis and Learning Group Dept.de Llenguatjes i Sistemes Informàtics

More information

Electronic Payment Fraud Detection Techniques

Electronic Payment Fraud Detection Techniques World of Computer Science and Information Technology Journal (WCSIT) ISSN: 2221-0741 Vol. 2, No. 4, 137-141, 2012 Electronic Payment Fraud Detection Techniques Adnan M. Al-Khatib CIS Dept. Faculty of Information

More information

A Robust Method for Solving Transcendental Equations

A Robust Method for Solving Transcendental Equations www.ijcsi.org 413 A Robust Method for Solving Transcendental Equations Md. Golam Moazzam, Amita Chakraborty and Md. Al-Amin Bhuiyan Department of Computer Science and Engineering, Jahangirnagar University,

More information

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,

More information

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution

More information

Classification for Fraud Detection with Social Network Analysis

Classification for Fraud Detection with Social Network Analysis Classification for Fraud Detection with Social Network Analysis Miguel Pironet 1, Claudia Antunes 1, Pedro Moura 2, João Gomes 2 1 Instituto Superior Técnico - Alameda, Departamento de Engenharia Informática,

More information

Performance Analysis of Decision Trees

Performance Analysis of Decision Trees Performance Analysis of Decision Trees Manpreet Singh Department of Information Technology, Guru Nanak Dev Engineering College, Ludhiana, Punjab, India Sonam Sharma CBS Group of Institutions, New Delhi,India

More information

Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction

Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction Huanjing Wang Western Kentucky University huanjing.wang@wku.edu Taghi M. Khoshgoftaar

More information

Using One-Versus-All classification ensembles to support modeling decisions in data stream mining

Using One-Versus-All classification ensembles to support modeling decisions in data stream mining Using One-Versus-All classification ensembles to support modeling decisions in data stream mining Patricia E.N. Lutu Department of Computer Science, University of Pretoria, South Africa Patricia.Lutu@up.ac.za

More information

How To Detect Fraud With Reputation Features And Other Features

How To Detect Fraud With Reputation Features And Other Features Detection Using Reputation Features, SVMs, and Random Forests Dave DeBarr, and Harry Wechsler, Fellow, IEEE Computer Science Department George Mason University Fairfax, Virginia, 030, United States {ddebarr,

More information

A Novel Ensemble Learning-based Approach for Click Fraud Detection in Mobile Advertising

A Novel Ensemble Learning-based Approach for Click Fraud Detection in Mobile Advertising A Novel Ensemble Learning-based Approach for Click Fraud Detection in Mobile Advertising Kasun S. Perera 1, Bijay Neupane 1, Mustafa Amir Faisal 2, Zeyar Aung 1, and Wei Lee Woon 1 1 Institute Center for

More information

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center

More information