Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ
|
|
|
- Samuel Cunningham
- 10 years ago
- Views:
Transcription
1 Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ David Cieslak and Nitesh Chawla University of Notre Dame, Notre Dame IN 46556, USA Abstract. Many machine learning applications like finance, medicine, and risk management suffer from class imbalance: cases of interest occur rarely. Further complicating these applications is that the training and testing samples might differ significantly in their respective class distributions. Sampling has been shown to be a strong solution to imbalance and additionally offers a rich parameter space from which to select classifiers. This paper is concerned with the interaction between Probability Estimation Trees (PETs) [], sampling, and performance metrics as testing distributions fluctuate substantially. A set of comprehensive analyses is presented, which anticipate classifier performance through a set of widely varying testing distributions. Introduction Finance, medicine, and risk management form the basis for many machine learning applications. A compelling aspect of these applications is that they present several challenges to the machine learning community. The common thread among these challenges persists to be class imbalance and cost-sensitive application, which has been a focus of significant recent work [2, 3]. However, the common assumption behind most of the related works is that the testing data carries the same class distribution as the training data. This assumption becomes limiting for the classifiers learned on the imbalanced datasets, as the learning usually follows a prior sampling stage to mitigate the effect of observed imbalance. This is, effectively, guided by the premise of improving the prediction on the minority class as measured by some evaluation function. Thus, it becomes important to understand the interaction between sampling methods, classifier learning, and evaluation functions when the class distributions change. To illustrate, a disease may occur naturally in 5% of a North American population. However, an epidemic condition may drastically increase the rate of infection to 45%, instigating differences in P (disease) between the training and testing datasets. Thus, the class distribution between negative and positive classes changes significantly. Scalar evaluations of a classifier learned on the original population will not offer a reasonable expectation for performance during the epidemic. A separate, but related problem occurs when the model trained from a segment of North American population is then applied to a European population where the distribution of measured features can potentially differ significantly, even if the disease base-rate remains at the original 5%. This issue becomes critical as the learned classifiers are optimized on the sampling distributions spelled out during training to increase performance on minority or positive T. Washio et al. (Eds.): PAKDD 28, LNAI 52, pp , 28. c Springer-Verlag Berlin Heidelberg 28
2 52 D. Cieslak and N. Chawla class, as measured by some evaluation function. If sampling is the strong governing factor for the performance on imbalanced datasets, can we guide the sampling to have more effective generalization? Contributions: We present a comprehensive empirical study investigating the effects of changing distributions on a combination of sampling methods and classifier learning. In addition, we also study the robustness of certain evaluation measures. We consider two popular sampling methods for countering class imbalance: undersampling and SMOTE [2,4]. To determine the optimal levels of sampling (under and/or SMOTE), we use a bruteforce wrapper method with cross-validation that optimizes on different evaluation measures like Negative Cross Entropy (), Brier Score (Brier), and Area Under the ROC Curve () on the original training distribution. The former focuses on quality of probability estimates and the latter focuses on rank-ordering. The guiding question here is what is more effective improved quality of estimates or improved rank-ordering if the eventual testing distribution changes? We use the wrapper to empirically discover the potentially best sampling amounts for the given classifier and evaluation measure. This allows us to draw observations on the suitability of popular sampling methods, in conjunction with the evaluation measures, on evolving testing distributions. We restrict our study to PETs [] given their popularity in the literature. This also allows for a more focused analysis. Essentially, we used unpruned C4.5 decision trees [5] and considered both leaf frequency based probability estimates and Laplace smoothed estimates. We also present an analysis of the interaction between measures used for parameter discovery and evaluation. Is a single evaluation measure more universal than the others, especially in changing distributions? 2 Sampling Methods Resampling is a prevalent, highly parameterizable treatment of the class imbalance problem with a large search space. Typically resampling improves positive class accuracy and rank-order [6, 7, 8, 2]. To our knowledge, there is no empirical literature detailing the effects of sampling on the quality of probability estimates; however, it is established that sampling improves rank-order. This study examines two sampling methods: random undersampling and SMOTE [9]. While seemingly primitive, randomly removing majority class examples has been shown to improve performance in class imbalance problems. Some training information is lost, but this is counterbalanced by the improvement in minority class accuracy and rank-order. SMOTE is an advanced oversampling method which generates synthetic examples at random intervals between known positive examples. [2] provides the most comprehensive survey and comparison of current sampling methods. We search a large sampling space via wrapper [] using a heuristic to limit the search space. This strategy first removes excess negative examples by undersampling from % to % in % steps and then synthetically adds from % to % more positive examples in 5% increments using SMOTE. Each phase ceases when the wrapper s objective function no longer improves after three successive samplings. We use Brier,,and [, 2] as objective functions to guide the wrapper and final evaluation metrics. Figure shows the Wrapper and Evaluation framework.
3 Analyzing PETs on Imbalanced Datasets 52 Training Data Testing Data X X Original Data Sample Generate Varied Distributions Generate Validation Data 5X Optimize Sampling Evaluate Classifiers Validation Training Data Validation Testing Data Fig.. Wrapper and Evaluation Framework Table. Dataset Distributions. Ordered in an increasing order of class imbalance. Dataset Examples Features Class Balance Adult [3] 48, :24 E-State [9] 5, :2 Pendigits [3], : Satimage [3] 6, : Forest Cover [3] 38,5 93:7 Oil [4] :4 Compustat [], :4 Mammography [9], :2 3 Experiments and Results We consider performance on different samplings of the testing set to explore the range of potential distributions by exploring samplings for which P (+) = {.2,.5,,,.3,...,, 5, 8}. For example, suppose a given dataset has 2 examples from class and examples of class in the testing set. To evaluate on P (+) =.5, class examples are randomly removed from the evaluation set. We experimented on eight different datasets, summarized in Table. We explore visualizations of the trends in and as P (+) is varied. Each plot contains several different classifiers: the baseline PET []; sampling guided by Brier (called for frequency based estimates and for Laplace based estimates); sampling guided by ;and finally sampling guided by AU ROC (the latter two using similar naming convention as Brier). In Figures 2 to 9, and are depicted as a function of increasing class distribution, ranging from fully negative on the left to fully positive on the right. A vertical line indicates the location of the original class distribution. Brier trends are omitted as they mirror those of.
4 522 D. Cieslak and N. Chawla Frequency Wrapper Fig. 2. Adult.7.5 Frequency Wrapper Fig. 3. E-State Figures 2 through 9 show the experimental and trends as the class distribution varies. Despite the variety of datasets evaluated, some compelling general trends emerge. Throughout, we note that wrappers guided by losses generally improve at and below the natural distribution of P (+) as compared to wrappers This implies that loss does well in optimizing when the testing distribution resembles the training conditions. It is notable that in some cases, such as Figures 2, 3, 5, 6, 7, 8, & 9, that the baseline classifier actually produces better scores than at least the frequency wrapper, if not both wrappers. The frequency wrapper selected extreme levels of sampling. The reduction in at low P (+) indicates that using loss measures within the wrapper lowers the loss estimates for the negative class examples. That is, while the loss from the positive class may actually increase, the lower overall losses are driven by better calibrated estimates on the predominantly occurring majority class. On the other hand, classifiers learned from the AU ROC guided wrappers do not result in as well-calibrated estimates. AU ROC favors the positive class rank-order, while Brier and tend to treat both classes equally, which in turn selects extreme sampling levels. Thus, if optimization is desired and the positive class is anticipated to occur as rarely or more rarely than in the training data, sampling should be selected according to either Brier or. However, the environment producing the data may be quite dynamic, creating a shift in the class ratio and causing the minority class to become much more prevalent. In a complete paradigm shift, the former minority class might become larger than the former majority class, such as in an epidemic. Invariably, there is a cross-over point in
5 Analyzing PETs on Imbalanced Datasets Frequency Wrapper Fig. 4. Pendigits Frequency Wrapper Fig. 5. Satimage each dataset after which one of the AU ROC wrappersoptimizes values. This is logical, as measures the quality of rank-order in terms of the positive class extra emphasis is placed on correctly classifying positive examples and is reflected by the higher selected sampling levels. As the positive examples eventually form the majority of the evaluation set, classifiers producing on average higher quality positive class probability estimates will produce the best. Therefore, if a practitioner anticipates an epidemic-like influx of positive examples sampling methods guided by are favored. Improvement to under varied testing distributions is not as uniform. We observe that at least one loss function wrapper generally produces better values in Figures 2, 3, & 4, but that an wrapper is optimal in Figures 6, 7, & 9. It is difficult to declare a champion in Figures 5 & 8. It is of note that datasets with naturally larger positive classes tend to benefit (in terms of AU ROC) from a loss wrapper, while those with naturally smaller positive classes benefit more from the AU ROC wrapper. As seen before, guides a wrapper to higher sampling levels than Brier or. In the cases of relatively few positive examples (such as Forest Cover, Oil, and Mammography), a heavy emphasis during training on these few examples produces better values. For the datasets with a larger set of positive examples (as in Adult, E-State, and Pendigits) from which to naturally draw, this over-emphasis does not produce as favorable a result. Therefore, in cases where there are very few positive examples, a practitioner should optimize sampling according to. Otherwise, Brier or optimization is sufficient.
6 524 D. Cieslak and N. Chawla 5.5 Frequency Wrapper Natural Distribution Natural Distribution 86 Fig. 6. Forest Cover.4.2 Frequency Wrapper Fig. 7. Oil The difference of characteristics between the trends in and is noteworthy. The trends appear stable and linear. By calculating the loss on each class at the base distribution, it appears that one is able to project the at any class distribution using a weighted average. trends are much more violent, likely owing to the highly perturbable nature of the measure. Adding or removing a few examples can heavily impact the produced ranking. As a measure, is characteristically less predictable than a loss function. We also note that sampling mitigates the need of application of Laplace smoothing at the leaves. We can see that the baseline classifier benefits from smoothing, as also noted by other works. However, by treating the dataset for class imbalance first, we are able to counter the bias and variance in estimates arising from small leaf-sizes. The wrapper essentially searches for the ideal training distribution by undersampling and/or injecting synthetic minority class instances that lead to a reduction in loss or improvement in ranking. Throughout Figures 2 to 9, we also note that Brier and loss wrappers tend to perform similarly across measures and datasets. This is not surprising as the shape of Brier and values are similar. We observe that the optimal sampling levels found by Brier and are similar, certainly more similar than to those samplings of AU ROC. In general, maintains a slight performance edge. If in the interests of time a practitioner may only experiment using one loss measure, then this study recommends using, although the results found here may not apply to all domains and performance metrics.
7 Analyzing PETs on Imbalanced Datasets Frequency Wrapper Fig. 8. Compustat Frequency Wrapper Fig. 9. Mammography 4 Conclusions The main focus of our paper was to empirically explore and evaluate the interaction between techniques for countering class imbalance, PETs, and corresponding evaluation measures under circumstances where training and testing samples differ. In light of the questions posited in the Introduction, we make the following key observations. We demonstrated that it is possible to identify potentially optimal quantities of sampling by optimizing on quality of estimates or rank-order as calculated by AU- ROC. Almost all the wrappers demonstrated significant improvements in and reductions in losses over the baseline classifier, irrespective of the dataset. As an evaluation measure, is much more stable and predictable as compared to. We observe to change almost linearly as a function of P (+), while tends to change as P (+) changes. There is a strong inter-play between undersampling and SMOTE. The wrapper determines an interaction between both the approaches by searching undersampling parameters before oversampling via SMOTE. It is much more difficult to anticipate the effects of a class distribution shift on than it is on probability loss functions. When a dataset is highly imbalanced, we recommend guiding sampling through as this places the necessary emphasis on the minority class. When class imbalance is much more moderate, tends to produce an improved AU ROC.
8 526 D. Cieslak and N. Chawla While Laplace smoothing has a profound effect in improving both the quality of estimates and ranking for the baseline classifier, the advantage diminishes with sampling methods. The combination of SMOTE and undersampling improves the calibration at the leaves and thus we observed that wrapper based sampling methods are able to improve performance lower losses and higher ranking irrespective of smoothing at the leaves. References. Provost, F., Domingos, P.: Tree Induction for Probability-Based Ranking. Machine Learning 52(3), (23) 2. Batista, G., Prati, R., Monard, M.: A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. SIGKDD Explorations 6(), 2 29 (24) 3. Chawla, N., Japkowicz, N., Kolcz, A.: Editorial: Special Issue on Learning from Imbalanced Data Sets. SIGKDD Explorations 6(), 6 (24) 4. Estabrooks, A., Jo, T., Japkowicz, N.: A Multiple Resampling Method for Learning from Imbalances Data Sets. Computational Intelligence 2(), 8 36 (24) 5. Quinlan, J.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (992) 6. Japkowicz, N.: The Class Imbalance Problem: Significance and Strategies. In: ICAI 2, pp. 7 (2) 7. Ling, C.X., Li, C.: Data Mining for Direct Marketing: Problems and Solutions. In: KDD, pp (998) 8. Solberg, A., Solberg, R.: A Large-Scale Evaluation of Features for Automatic Detection of Oil Spills. In: ERS SAR Images IEEE Symp. Geosc. Rem., vol. 3, pp (996) 9. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-sampling Technique. JAIR 6, (22). Chawla, N.V., Cieslak, D.A., Hall, L.O., Joshi, A.: Automatically countering imbalance and its empirical relationship to cost. Utility-Based Data Mining: A Special issue of the International Journal Data Mining and Knowledge Discovery (28). Buja, A., Stuetzle, W., Sheu, Y.: Loss Functions for Binary Class Probability Estimation and Classification: Structure and Applications (under submission, 26) 2. Caruana, R., Niculescu-Mizil, A.: An Empirical Comparison of Supervised Learning Algorithms. In: ICML 26, pp (26) 3. Asuncion, A., Newman, D.: UCI machine learning repository (27) 4. Kubat, M., Holte, R., Matwin, S.: Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning 3, (998)
Using Random Forest to Learn Imbalanced Data
Using Random Forest to Learn Imbalanced Data Chao Chen, [email protected] Department of Statistics,UC Berkeley Andy Liaw, andy [email protected] Biometrics Research,Merck Research Labs Leo Breiman,
Learning Decision Trees for Unbalanced Data
Learning Decision Trees for Unbalanced Data David A. Cieslak and Nitesh V. Chawla {dcieslak,nchawla}@cse.nd.edu University of Notre Dame, Notre Dame IN 46556, USA Abstract. Learning from unbalanced datasets
A Hybrid Approach to Learn with Imbalanced Classes using Evolutionary Algorithms
Proceedings of the International Conference on Computational and Mathematical Methods in Science and Engineering, CMMSE 2009 30 June, 1 3 July 2009. A Hybrid Approach to Learn with Imbalanced Classes using
Random Forest Based Imbalanced Data Cleaning and Classification
Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem
ClusterOSS: a new undersampling method for imbalanced learning
1 ClusterOSS: a new undersampling method for imbalanced learning Victor H Barella, Eduardo P Costa, and André C P L F Carvalho, Abstract A dataset is said to be imbalanced when its classes are disproportionately
Learning with Skewed Class Distributions
CADERNOS DE COMPUTAÇÃO XX (2003) Learning with Skewed Class Distributions Maria Carolina Monard and Gustavo E.A.P.A. Batista Laboratory of Computational Intelligence LABIC Department of Computer Science
Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm
Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Martin Hlosta, Rostislav Stríž, Jan Kupčík, Jaroslav Zendulka, and Tomáš Hruška A. Imbalanced Data Classification
DATA MINING FOR IMBALANCED DATASETS: AN OVERVIEW
Chapter 40 DATA MINING FOR IMBALANCED DATASETS: AN OVERVIEW Nitesh V. Chawla Department of Computer Science and Engineering University of Notre Dame IN 46530, USA Abstract Keywords: A dataset is imbalanced
Roulette Sampling for Cost-Sensitive Learning
Roulette Sampling for Cost-Sensitive Learning Victor S. Sheng and Charles X. Ling Department of Computer Science, University of Western Ontario, London, Ontario, Canada N6A 5B7 {ssheng,cling}@csd.uwo.ca
E-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee [email protected] Seunghee Ham [email protected] Qiyi Jiang [email protected] I. INTRODUCTION Due to the increasing popularity of e-commerce
Editorial: Special Issue on Learning from Imbalanced Data Sets
Editorial: Special Issue on Learning from Imbalanced Data Sets Nitesh V. Chawla Nathalie Japkowicz Retail Risk Management,CIBC School of Information 21 Melinda Street Technology and Engineering Toronto,
Preprocessing Imbalanced Dataset Using Oversampling Approach
Journal of Recent Research in Engineering and Technology, 2(11), 2015, pp 10-15 Article ID J111503 ISSN (Online): 2349 2252, ISSN (Print):2349 2260 Bonfay Publications Research article Preprocessing Imbalanced
Handling imbalanced datasets: A review
Handling imbalanced datasets: A review Sotiris Kotsiantis, Dimitris Kanellopoulos, Panayiotis Pintelas Educational Software Development Laboratory Department of Mathematics, University of Patras, Greece
Decision Tree Learning on Very Large Data Sets
Decision Tree Learning on Very Large Data Sets Lawrence O. Hall Nitesh Chawla and Kevin W. Bowyer Department of Computer Science and Engineering ENB 8 University of South Florida 4202 E. Fowler Ave. Tampa
Learning on the Border: Active Learning in Imbalanced Data Classification
Learning on the Border: Active Learning in Imbalanced Data Classification Şeyda Ertekin 1, Jian Huang 2, Léon Bottou 3, C. Lee Giles 2,1 1 Department of Computer Science and Engineering 2 College of Information
Direct Marketing When There Are Voluntary Buyers
Direct Marketing When There Are Voluntary Buyers Yi-Ting Lai and Ke Wang Simon Fraser University {llai2, wangk}@cs.sfu.ca Daymond Ling, Hua Shi, and Jason Zhang Canadian Imperial Bank of Commerce {Daymond.Ling,
Getting Even More Out of Ensemble Selection
Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand [email protected] ABSTRACT Ensemble Selection uses forward stepwise
Data Mining for Direct Marketing: Problems and
Data Mining for Direct Marketing: Problems and Solutions Charles X. Ling and Chenghui Li Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 Tel: 519-661-3341;
Selecting Data Mining Model for Web Advertising in Virtual Communities
Selecting Data Mining for Web Advertising in Virtual Communities Jerzy Surma Faculty of Business Administration Warsaw School of Economics Warsaw, Poland e-mail: [email protected] Mariusz Łapczyński
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Proposal of Credit Card Fraudulent Use Detection by Online-type Decision Tree Construction and Verification of Generality
Proposal of Credit Card Fraudulent Use Detection by Online-type Decision Tree Construction and Verification of Generality Tatsuya Minegishi 1, Ayahiko Niimi 2 Graduate chool of ystems Information cience,
On the application of multi-class classification in physical therapy recommendation
On the application of multi-class classification in physical therapy recommendation Jing Zhang 1, Douglas Gross 2, and Osmar R. Zaiane 1 1 Department of Computing Science, 2 Department of Physical Therapy,
Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data
Fifth International Workshop on Computational Intelligence & Applications IEEE SMC Hiroshima Chapter, Hiroshima University, Japan, November 10, 11 & 12, 2009 Extension of Decision Tree Algorithm for Stream
Addressing the Class Imbalance Problem in Medical Datasets
Addressing the Class Imbalance Problem in Medical Datasets M. Mostafizur Rahman and D. N. Davis the size of the training set is significantly increased [5]. If the time taken to resample is not considered,
Mining the Software Change Repository of a Legacy Telephony System
Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,
A COMPARATIVE STUDY OF DECISION TREE ALGORITHMS FOR CLASS IMBALANCED LEARNING IN CREDIT CARD FRAUD DETECTION
International Journal of Economics, Commerce and Management United Kingdom Vol. III, Issue 12, December 2015 http://ijecm.co.uk/ ISSN 2348 0386 A COMPARATIVE STUDY OF DECISION TREE ALGORITHMS FOR CLASS
An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset
P P P Health An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset Peng Liu 1, Elia El-Darzi 2, Lei Lei 1, Christos Vasilakis 2, Panagiotis Chountas 2, and Wei Huang
Mining Caravan Insurance Database Technical Report
Mining Caravan Insurance Database Technical Report Tarek Amr (@gr33ndata) Abstract Caravan Insurance collected a set of 5000 records for their customers. The dataset is composed of 85 attributes, as well
Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance
Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance Jesús M. Pérez, Javier Muguerza, Olatz Arbelaitz, Ibai Gurrutxaga, and José I. Martín Dept. of Computer
Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification
Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde
A Novel Classification Approach for C2C E-Commerce Fraud Detection
A Novel Classification Approach for C2C E-Commerce Fraud Detection *1 Haitao Xiong, 2 Yufeng Ren, 2 Pan Jia *1 School of Computer and Information Engineering, Beijing Technology and Business University,
Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing
www.ijcsi.org 198 Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing Lilian Sing oei 1 and Jiayang Wang 2 1 School of Information Science and Engineering, Central South University
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam
Review of Ensemble Based Classification Algorithms for Nonstationary and Imbalanced Data
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 1, Ver. IX (Feb. 2014), PP 103-107 Review of Ensemble Based Classification Algorithms for Nonstationary
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
SMOTE: Synthetic Minority Over-sampling Technique
Journal of Artificial Intelligence Research 16 (2002) 321 357 Submitted 09/01; published 06/02 SMOTE: Synthetic Minority Over-sampling Technique Nitesh V. Chawla Department of Computer Science and Engineering,
131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10
1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data
Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction
Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction Huanjing Wang Western Kentucky University [email protected] Taghi M. Khoshgoftaar
Data Mining: A Preprocessing Engine
Journal of Computer Science 2 (9): 735-739, 2006 ISSN 1549-3636 2005 Science Publications Data Mining: A Preprocessing Engine Luai Al Shalabi, Zyad Shaaban and Basel Kasasbeh Applied Science University,
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
On the effect of data set size on bias and variance in classification learning
On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent
Improving Credit Card Fraud Detection with Calibrated Probabilities
Improving Credit Card Fraud Detection with Calibrated Probabilities Alejandro Correa Bahnsen, Aleksandar Stojanovic, Djamila Aouada and Björn Ottersten Interdisciplinary Centre for Security, Reliability
Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information
Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information Andrea Dal Pozzolo, Giacomo Boracchi, Olivier Caelen, Cesare Alippi, and Gianluca Bontempi 15/07/2015 IEEE IJCNN
Mining Life Insurance Data for Customer Attrition Analysis
Mining Life Insurance Data for Customer Attrition Analysis T. L. Oshini Goonetilleke Informatics Institute of Technology/Department of Computing, Colombo, Sri Lanka Email: [email protected] H. A. Caldera
On the Class Imbalance Problem *
Fourth International Conference on Natural Computation On the Class Imbalance Problem * Xinjian Guo, Yilong Yin 1, Cailing Dong, Gongping Yang, Guangtong Zhou School of Computer Science and Technology,
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Expert Systems with Applications
Expert Systems with Applications 36 (2009) 4626 4636 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa Handling class imbalance in
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Comparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS
ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS Abstract D.Lavanya * Department of Computer Science, Sri Padmavathi Mahila University Tirupati, Andhra Pradesh, 517501, India [email protected]
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 [email protected]
Distributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
Impact of Boolean factorization as preprocessing methods for classification of Boolean data
Impact of Boolean factorization as preprocessing methods for classification of Boolean data Radim Belohlavek, Jan Outrata, Martin Trnecka Data Analysis and Modeling Lab (DAMOL) Dept. Computer Science,
The Optimality of Naive Bayes
The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most
Prediction of Stock Performance Using Analytical Techniques
136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University
Benchmarking Open-Source Tree Learners in R/RWeka
Benchmarking Open-Source Tree Learners in R/RWeka Michael Schauerhuber 1, Achim Zeileis 1, David Meyer 2, Kurt Hornik 1 Department of Statistics and Mathematics 1 Institute for Management Information Systems
On the application of multi-class classification in physical therapy recommendation
RESEARCH Open Access On the application of multi-class classification in physical therapy recommendation Jing Zhang 1,PengCao 1,DouglasPGross 2 and Osmar R Zaiane 1* Abstract Recommending optimal rehabilitation
Fraud Detection for Online Retail using Random Forests
Fraud Detection for Online Retail using Random Forests Eric Altendorf, Peter Brende, Josh Daniel, Laurent Lessard Abstract As online commerce becomes more common, fraud is an increasingly important concern.
Using One-Versus-All classification ensembles to support modeling decisions in data stream mining
Using One-Versus-All classification ensembles to support modeling decisions in data stream mining Patricia E.N. Lutu Department of Computer Science, University of Pretoria, South Africa [email protected]
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Class Imbalance Learning in Software Defect Prediction
Class Imbalance Learning in Software Defect Prediction Dr. Shuo Wang [email protected] University of Birmingham Research keywords: ensemble learning, class imbalance learning, online learning Shuo Wang
Towards applying Data Mining Techniques for Talent Mangement
2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore Towards applying Data Mining Techniques for Talent Mangement Hamidah Jantan 1,
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
Advanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
Experiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
Knowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES
IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES Bruno Carneiro da Rocha 1,2 and Rafael Timóteo de Sousa Júnior 2 1 Bank of Brazil, Brasília-DF, Brazil [email protected] 2 Network Engineering
Colon cancer survival prediction using ensemble data mining on SEER data
Colon cancer survival prediction using ensemble data mining on SEER data Reda Al-Bahrani, Ankit Agrawal, Alok Choudhary Dept. of Electrical Engg. and Computer Science Northwestern University Evanston,
Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.
Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.38457 Accuracy Rate of Predictive Models in Credit Screening Anirut Suebsing
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Lecture 15 - ROC, AUC & Lift Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk [email protected] Tom Kelsey ID5059-17-AUC
Active Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center
Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100
Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100 Erkan Er Abstract In this paper, a model for predicting students performance levels is proposed which employs three
Cross-Validation. Synonyms Rotation estimation
Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical
Tweaking Naïve Bayes classifier for intelligent spam detection
682 Tweaking Naïve Bayes classifier for intelligent spam detection Ankita Raturi 1 and Sunil Pranit Lal 2 1 University of California, Irvine, CA 92697, USA. [email protected] 2 School of Computing, Information
Chapter 12 Discovering New Knowledge Data Mining
Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to
Machine Learning in Hospital Billing Management. 1. George Mason University 2. INOVA Health System
Machine Learning in Hospital Billing Management Janusz Wojtusiak 1, Che Ngufor 1, John M. Shiver 1, Ronald Ewald 2 1. George Mason University 2. INOVA Health System Introduction The purpose of the described
How To Solve The Kd Cup 2010 Challenge
A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China [email protected] [email protected]
Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News
Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Sushilkumar Kalmegh Associate Professor, Department of Computer Science, Sant Gadge Baba Amravati
Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms
Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms Johan Perols Assistant Professor University of San Diego, San Diego, CA 92110 [email protected] April
Visualization of large data sets using MDS combined with LVQ.
Visualization of large data sets using MDS combined with LVQ. Antoine Naud and Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Grudziądzka 5, 87-100 Toruń, Poland. www.phys.uni.torun.pl/kmk
Email Spam Detection Using Customized SimHash Function
International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email
INVESTIGATIONS INTO EFFECTIVENESS OF GAUSSIAN AND NEAREST MEAN CLASSIFIERS FOR SPAM DETECTION
INVESTIGATIONS INTO EFFECTIVENESS OF AND CLASSIFIERS FOR SPAM DETECTION Upasna Attri C.S.E. Department, DAV Institute of Engineering and Technology, Jalandhar (India) [email protected] Harpreet Kaur
72. Ontology Driven Knowledge Discovery Process: a proposal to integrate Ontology Engineering and KDD
72. Ontology Driven Knowledge Discovery Process: a proposal to integrate Ontology Engineering and KDD Paulo Gottgtroy Auckland University of Technology [email protected] Abstract This paper is
Scoring the Data Using Association Rules
Scoring the Data Using Association Rules Bing Liu, Yiming Ma, and Ching Kian Wong School of Computing National University of Singapore 3 Science Drive 2, Singapore 117543 {liub, maym, wongck}@comp.nus.edu.sg
