Analysis on Weighted AUC for Imbalanced Data Learning Through Isometrics
|
|
- Benjamin Paul
- 7 years ago
- Views:
Transcription
1 Journal of Computational Information Systems 8: (22) Available at Analysis on Weighted AUC for Imbalanced Data Learning Through Isometrics Yuanfang DONG, 2, Xiongfei LI,, Jun LI 3, Haiying ZHAO 4 Key Laboratory of Symbolic Computation and Knowledge Engineering for Ministry of Education, Jilin University, Changchun 32, China 2 School of Economics and Management, Changchun University of Science and Technology, Changchun 322, China 3 Department of Mathematics, Changchun University of Science and Technology, Changchun 322, China 4 School of Computer Science and Technology, Xinjiang Normal University, Wulumuqi 83, China Abstract A performance evaluation model, weighted AUC (wauc), is proposed to determine a better way to measure the imbalanced data learning classifiers. When computing the weighted area under the ROC curve, weights vary with the values of the true positive rate (T P rate) among regions in a bid to focus on the accuracy of minority class that is more important in common. As AUC is a special case of wauc, wau C is compared with other common performance evaluation measures by isometric analysis. The experimental results show that wau C can distinguish the classifiers with the same AU C values. Keywords: Machine Learning; Classification; Imbalanced Data; Performance Evaluation; Isometric Introduction Most learning algorithms assume that the data have balanced class distribution. But in the real world, there are cases of the imbalanced class and the skewed class distribution []. Under the imbalanced two-class cases, the minority class is usually called positive class, and the majority class is called negative class. When dealing with the data with skewed class distribution, as the majority class is dominant and the classification boundary bias to the majority class, the ability of the classic classification algorithms on predicting minority class is decreased, and so the overall prediction performance is affected. Project supported by the National Science and Technology Support Program (No. 26BAKA33), the Technology Development Program of Jilin Province (No. 2974), and Natural Science Foundation of Jilin Province (No. 252). Corresponding author. address: xiongfei@jlu.edu.cn (Xiongfei LI) / Copyright 22 Binary Information Press January 22
2 372 Y. Dong et al. /Journal of Computational Information Systems 8: (22) It tends to make mistakes, if the common Accuracy or Error rate is adopted as the measure of performance evaluation in the case of imbalanced class distribution, because Accuracy or Error rate do not consider misclassification cost. A number of measures have been proposed to deal with imbalanced data classifier performance evaluation. These measures can be divided into two categories: numerical and graphical measures. Numerical measures, including Accuracy, P recision, Recall, F-measure, Gmean and AU C, give a single value to characterize a classifier performance. Graphical measures, including ROC curves, precision-recall curves, cost curves, and so on [2, 3], draw images, specially the two and three dimensional images that are easy to observe. ROC curve (Receiver Operating Characteristics curve) is the performance evaluation method that has been used most widely and studied in depth [4]. The method has many advantages, such as intuitive, easy to understand, simple to use, etc. At first, ROC curve appeared in the signal detection study, and aimed at the trade-off between the true positive rate and the false positive rate [5]. Spackman introduced the ROC analysis to the field of machine learning [6] to evaluate and compare algorithms. 2 Related Work The confusion matrix shown in Table describes the distribution of the sample classifications, which is the basis to calculate some classifier performance measures. Table : Confusion matrix predicted positive predicted negative actual positive True Positives (TP) False Negatives (FN) actual negative False Positives (FP) True Negatives (TN) For the two-class problems, the Accuracy (Acc) and the Error rate (Err) can easily be derived from the confusion matrix shown in Table. T P + T N Acc = T P + F N + T N + F P. () Err = F P + F N T P + F N + T N + F P. (2) These two measures are sensitive to class imbalances, and overly bias on majority class. So, using Acc or Err will cause the error results on performance comparison when dealing with imbalanced data [7, 8]. True positive rate is denoted by T P rate and true negative rate by T Nrate. False positive rate and false negative rate are denoted by F P rate and F N rate, respectively. T P rate = T P, F P rate = F P, T Nrate = T N, F Nrate = F N. T P +F N T N+F P T N+F P T P +F N True positive rate, also known as the Recall, is the percentage of positive class samples that have been correctly classified. P recision is defined as the percentage of the samples that is correctly labeled as positive class sample, namely: P recision = T P T P + F P. (3)
3 Y. Dong et al. /Journal of Computational Information Systems 8: (22) F-measure combines P recision with Recall. Higher F-measure is, the model has better performance on positive class [9]. F-measure = ( + β2 ) recall precision β 2 recall + precision. (4) The Gmean proposed by Kubat etc. is the geometric mean of the prediction accuracy of positive class and negative class, and an important measure to avoid over-fitting negative class []. Gmean = T P rate T Nrate. (5) On the ROC curve, the y-axis represents T P rate, while the x-axis represents F P rate. Each point on the ROC curve corresponds to a classifier model. IF a classifier is a ranker or a scoring classifier, a threshold on the score transforms a ranker or a scoring classifier into a discrete one. An ROC curve shows what happens with the corresponding confusion matrix for each possible threshold [4]. The Area under ROC curve, that is AUC, can be calculated by Eq. (6). AUC is the quantitative expression of the ROC curve. AUC = ydx. (6) 3 A Weighted Classifier Performance Measure-wAUC 3. The choice of weights The shortcoming of traditional AUC is that it doesn t consider the cost bias, and adapts the same weights (i.e. which is assumed to have the same cost with value ) for each region during the calculation of AU C. For two-class imbalanced data, traditional AU C pay more attention to the accuracy on positive class (i.e. T P rate) than the accuracy on negative class (i.e. T Nrate). AUC doesn t identify each class s contribution to the overall performance. This means that different combinations of T P rate and T Nrate could lead to the same AUC measure value. For example, Classifier A(F P rate, T P rate) = (., ) and Classifier B(F P rate, T P rate) = (,.9) have the same AU C value of.75. But, for the imbalanced data classification problem, classification incorrectly on positive class samples is more expensive. The accuracy on the positive class should have a greater contribution to the overall performance than the accuracy on the negative class, and so classifier B has better performance than classifier A. To reflect this, the different weights should be adapted on the different regions under the ROC curve, and make the regions with higher T P rate correspond to greater weight. Let the weighted function be g(y), the function g(y) should satisfy the following conditions: () non-negativity: y [, ], g(y). (2) normativity: g(y)dy =, as g(y) is the weight density function. (3) monotonicity: g(y) is monotonically increasing, which reflects that the higher correspond to the region with greater weight.
4 374 Y. Dong et al. /Journal of Computational Information Systems 8: (22) Weighted function g(y) can be as simple curve as line, and can also be any other curve satisfying the above conditions. The curve that ROC curve form after mapping is called weighted ROC curve, i.e. wroc. And the space that ROC space form after mapping is called weighted ROC space. If a classifier corresponds a point (x, y) in ROC space, the point corresponds to the following point in the weighted ROC space is as follows: Φ : (x, y) ( ( x)g(y), y). (7) The point (, ) is mapped as ( g(), ), while the point (, ) is mapped as (, ). 3.2 Weighted AUC AUC can also be understood as the right area of ROC curve in ROC space. That is, let y be the integral variable, x = f(y) shows the ROC curve equation, the following formula can be used to calculate AU C: AUC = ( f(y))dy = f(y)dy. (8) Definition Let x = f(y) as the ROC curve equation, the weighted function is g(y), the weighted AUC is defined as follows: wauc = g(y)( f(y))dy. (9) By definition, in geometry, wauc is the region that is enclosed by the 3 curves that are wroc curve, x-axis and line x =. By the normativity of weighted function g(y), it is easy to know: wauc = g(y)( f(y))dy = g(y)dy g(y)f(y)dy = g(y)f(y)dy. () That is wauc = g(y)( f(y))dy. can be used in the calculation. However, the Eq. (9) is better that can express the aim of weighting the different regions. In the case that there is a single classifier, the wroc curve is the curve that connect the point (, ),( ( x)g(y), y), (, ), and the region that is enclosed by the wroc, y =, and x = is the wauc of the classifier. wauc(x, y) = 2 [( x)g(y) + g()] y + 2 ( x)g(y)( y) = 2 g() y + ( x)g(y) 2 Let g(y) = y+, the classifier A(F P rate, T P rate) = (., ) and classifier B(F P rate, T P rate) = 2 (,.9) have different wauc, that is 58 and 93 respectively. This shows that the classifier B has better performance. 4 Experimental Analysis 4. Isometric analysis ROC isometrics are collections of points in ROC space with the same value for a performance metric. Flach and Fürnkranz etc. investigate isometrics to understand metrics [, 2]. However,
5 Y. Dong et al. /Journal of Computational Information Systems 8: (22) isometrics can be used for the task of classifier selection to construct reliable classifiers. Provost and Fawcett [3] use isometrics to determine the optimal point on a ROC convex hull. The term isometric seems to originate from Vilalta and Oblinger [4], they give a isometric plot for gaining information, but their analysis is quantitative in nature and the connection to ROC analysis is not made. We recall that isometrics are collections of points with the same value for the metric. Generally speaking, in 2D ROC space isometrics are lines or curves, Below we will see three types of isometric landscapes: (a) with parallel linear isometrics (Accuracy); (b) with non-parallel linear isometrics (P recision, F-measure); and (c) with non-linear isometrics (decision tree splitting criteria). Type (a) means that the metric applies the same positive/negative trade-off throughout ROC space; type (b) means that the trade-off varies with different values of the metric; and type (c) means that the trade-off varies even with the same value of the metric. Fig. illustrates the isometric of the Accuracy that is linear parallel. In the case that there is a single classifier, the ROC curve is the connectional line of 3 points that are (, ),(F P rate, T P rate), T P rate+ F P rate and (, ), and AUC is the area under the ROC curve. At this time, AUC =, 2 and the value is same with Accuracy. The AUC isometric is linear parallel and shows as Fig. 2. F-measure is usually used to the field of information retrieval. This measure is insensitive to how the incorrect predictions are distributed over the classes. The isometric of F-measure is linear non-parallel and shows as Fig. 3. Accuracy isometrics Fig. : Accuracy isometrics AUC isometrics Fig. 2: AUC isometrics F measure isometrics.5.5 Fig. 3: F-measure isometrics By the definition of Gmean, Gmean expresses both the accuracy on the positive class and the accuracy on the negative class. The isometric of Gmean is non-linear non-parallel as Fig. 4. It is interesting to note that versions of precision with shifted rotation point occur more often in machine learning [5]. The isometric of precision is linear non-parallel as Fig. 5. Gmean isometrics Precision isometrics Fig. 4: Gmean isometrics Fig. 5: Precision isometrics Fig. 6 illustrates the isometrics of wauc in the case that weighted function is a liner function g(y) = ay + b. We note that the isometric of the wauc is non-linear. Besides, from the normal
6 376 Y. Dong et al. /Journal of Computational Information Systems 8: (22) directions of wauc, we find that classifier with higher accuracy on positive class tends to get more attention, if wauc is adopted. If weighted function is set to be the exponential function g(y) = ey, the isometric of the wauc e is, as shown in Fig. 7, also nonlinear. The normal directions of wau C indicates that exponential weighted wau C pays more attention on classifier with higher accuracy on positive class than the linear weighted wau C. wauc Isometrics (Linear Weighted Function) wauc Isometrics (Exponential Weighted Function) Fig. 6: wauc isometrics (Linear) Fig. 7: wauc isometrics (Exponential) 4.2 Simulation analysis. Linear Case: Let g(y) = ay + b, g(y) is monotonically increasing, so a. By g(y)dy =, g(y)dy = (ay + b)dy = a + b =. Then b = a, and g(y) = ay + a. So: wauc(x, y) = 2 g() y + 2 ( x) g(y) = ( 2 a ) y + ( 2 2 ( x) ay + a ) ( ) a = y ( ) ( ) 2 a 2 a 4 2 axy x Exponential Function: Let g(y) = ey, g(y) is monotonically increasing. By g(y)dy =, e g(y)dy = e y dy =, Then, e wauc(x, y) = 2 g() y + 2 ( x) g(y) = 2 e y + ( ) e y ( x) 2 e ( ) = (y ( x) e y ) 2(e ) The main purpose of wauc will be to weight a measure suitable to evaluate the performance in imbalanced domains. The weighting factor will aim at favoring those results with better classification rates on the minority class. Table 2 describes the values of different measures in the Case of AUC =.75. Note that θ 5 produces a perfect balance between T P rate and F P rate, θ biases to the accuracy on negative class, and θ 9 biases to the accuracy on positive class. All of θ i correspond to the same AUC value of.75. Accuracy would select the biased θ because it strongly depends on the majority class rate. Gmean suggests the most balanced configurations ignoring the fact that the minority class is usually the most important. While Gmean does not distinguish between θ 3 and θ 5, and the drawback can be overcome when using the wauc by appropriately choosing the
7 Y. Dong et al. /Journal of Computational Information Systems 8: (22) Table 2: The values of different measures in the case of AUC =.75 Acc Precision F-measure Gmean wauc Linear wauc Exp. θ θ θ θ θ θ θ θ θ weighted function. One can see that the exponential function wauc select θ 8, which correspond to the moderate cases with the highest T P rate. All in all, θ 8 biases to the minority class, but does not consider the minority only, and θ 9 biases to the positive class. But when exponential function wauc is as model selection measure, θ 8 will be selected. Fig. 8 describes the wauc isometric and the isometric when AUC =.75. Fig. 9 draws the case that the isometrics of θ 8 and θ 9, and the isometric when AUC =.75 intersect. wauc isometrics (Exponential Function).96 wauc isometrics (Exponential Function) AUC=.75 wauc=93 wauc=93 wauc=93,auc=.75 =.95,= wauc=93,auc=.75 =.9,= Fig. 8: The superposition of wauc isometrics and AUC isometric Fig. 9: Intersections of isometrics of θ 8 and θ 9 with the isometric of AUC =.75 respectively. 5 Conclusion In this paper, we have introduced a new method to evaluate the performance of classification systems in two-class problems with skewed data distributions.theoretical and empirical studies have shown the robustness and advantages of wau C with respect to some other well-known performance measures. For the imbalanced data classification problem, the performance evaluation measure should distinguish between the accuracies on different classes when evaluating
8 378 Y. Dong et al. /Journal of Computational Information Systems 8: (22) and selecting classifiers. In order to focus on the accuracy on positive class, different weights are adapted to different regions under ROC curve, and weighted AUC (i.e. wauc) is defined to set larger weight to the region with higher true positive rate(t P rate), and makes the performance evaluation measure wau C tending to the classifiers with higher accuracy on the positive class. Theoretical analysis and discussion on isometrics of the wau C show that wau C outperforms AUC. References [] N. V. Chawla, Data mining for imbalanced datasets: An overview, in: Oded Maimon, Lior Rokach (Ed.) Data Mining and Knowledge Discovery Handbook, Springer Press, Heidelberg, 2, pp [2] N. Japkowicz, S. Stephen, The class imbalance problem: a systematic study, Intelligent Data Analysis, 6(22)4-49. [3] N. V. Chawla, N. Japkowicz, A. Kotcz, Editorial: special issue on learning from imbalanced data set, SIGKDD Exploration Newsletters, (24)-6. [4] T. Fawcett, An introduction to ROC analysis, Pattern Recognition Letters, 8(26) [5] J. P. Egan, Signal detection theory and ROC analysis, Series in Cognition and Perception, Academic Press, 975 [6] K. A. Spackman, Signal detection theory: Valuable tools for evaluating inductive learning, In: Sixth International Workshop on Machine Learning, 989, pp [7] W. Elazmeh, N. Japkowicz, S. Matwin, Evaluating misclassifications in imbalanced data, In: Proc. of the 7th European Conference on Machine Learning, 26, pp [8] J. Huang, C. X. Ling, Using AUC and accuracy in evaluating learning algorithms, IEEE Trans, on Knowledge and Data Engineering, 7(25) [9] S. Daskalaki, I. Kopanas, N. Avouris, Evaluation of classifiers for an uneven class distribution problem, Applied Artificial Intelligence, 2(26) [] M. Kubat, S. Matwin, Adressing the curse of imbalanced training sets: one-sided selection, In: Proc. of the 4th Intl. Conf. on Machine Learning Nashville, 997, pp [] P. Flach, The geometry of ROC space: Understanding machine learning metrics through ROC isometrics, In: Proc. of the 2th International Conference on Machine Learning, 23, pp [2] Fürnkranz J, Flach P, Roc n rule learning- towards a better understanding of covering algorithms, Machine Learning, 58(25) [3] F. Provost, T. Fawcett, Robust classification for imprecise environments, Machine Learning, 42(2) [4] R. Vilalta, D. Oblinger, A quantification of distance-bias between evaluation metrics in classification, In: Proc. of the 7th International Conference on Machine Learning, 2, pp [5] J. Fürnkranz, P. Flach, An analysis of rule evaluation metrics, In: Proc. of the 2th International Conference on Machine Learning, 23, pp
Using Random Forest to Learn Imbalanced Data
Using Random Forest to Learn Imbalanced Data Chao Chen, chenchao@stat.berkeley.edu Department of Statistics,UC Berkeley Andy Liaw, andy liaw@merck.com Biometrics Research,Merck Research Labs Leo Breiman,
More informationChapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
More informationLearning with Skewed Class Distributions
CADERNOS DE COMPUTAÇÃO XX (2003) Learning with Skewed Class Distributions Maria Carolina Monard and Gustavo E.A.P.A. Batista Laboratory of Computational Intelligence LABIC Department of Computer Science
More informationClusterOSS: a new undersampling method for imbalanced learning
1 ClusterOSS: a new undersampling method for imbalanced learning Victor H Barella, Eduardo P Costa, and André C P L F Carvalho, Abstract A dataset is said to be imbalanced when its classes are disproportionately
More informationConstrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm
Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Martin Hlosta, Rostislav Stríž, Jan Kupčík, Jaroslav Zendulka, and Tomáš Hruška A. Imbalanced Data Classification
More informationSelecting Data Mining Model for Web Advertising in Virtual Communities
Selecting Data Mining for Web Advertising in Virtual Communities Jerzy Surma Faculty of Business Administration Warsaw School of Economics Warsaw, Poland e-mail: jerzy.surma@gmail.com Mariusz Łapczyński
More informationRandom Forest Based Imbalanced Data Cleaning and Classification
Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem
More informationSVM Ensemble Model for Investment Prediction
19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of
More informationAnalyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ
Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ David Cieslak and Nitesh Chawla University of Notre Dame, Notre Dame IN 46556, USA {dcieslak,nchawla}@cse.nd.edu
More informationA Hybrid Approach to Learn with Imbalanced Classes using Evolutionary Algorithms
Proceedings of the International Conference on Computational and Mathematical Methods in Science and Engineering, CMMSE 2009 30 June, 1 3 July 2009. A Hybrid Approach to Learn with Imbalanced Classes using
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Lecture 15 - ROC, AUC & Lift Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-17-AUC
More informationEnhancing Quality of Data using Data Mining Method
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2, ISSN 25-967 WWW.JOURNALOFCOMPUTING.ORG 9 Enhancing Quality of Data using Data Mining Method Fatemeh Ghorbanpour A., Mir M. Pedram, Kambiz Badie, Mohammad
More informationConsolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance
Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance Jesús M. Pérez, Javier Muguerza, Olatz Arbelaitz, Ibai Gurrutxaga, and José I. Martín Dept. of Computer
More informationExperiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
More informationPerformance Measures in Data Mining
Performance Measures in Data Mining Common Performance Measures used in Data Mining and Machine Learning Approaches L. Richter J.M. Cejuela Department of Computer Science Technische Universität München
More informationMining Life Insurance Data for Customer Attrition Analysis
Mining Life Insurance Data for Customer Attrition Analysis T. L. Oshini Goonetilleke Informatics Institute of Technology/Department of Computing, Colombo, Sri Lanka Email: oshini.g@iit.ac.lk H. A. Caldera
More informationIntroducing diversity among the models of multi-label classification ensemble
Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and
More informationA General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center
More informationROC Curve, Lift Chart and Calibration Plot
Metodološki zvezki, Vol. 3, No. 1, 26, 89-18 ROC Curve, Lift Chart and Calibration Plot Miha Vuk 1, Tomaž Curk 2 Abstract This paper presents ROC curve, lift chart and calibration plot, three well known
More informationT-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or
More informationAddressing the Class Imbalance Problem in Medical Datasets
Addressing the Class Imbalance Problem in Medical Datasets M. Mostafizur Rahman and D. N. Davis the size of the training set is significantly increased [5]. If the time taken to resample is not considered,
More informationE-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee minyong@stanford.edu Seunghee Ham sham12@stanford.edu Qiyi Jiang qjiang@stanford.edu I. INTRODUCTION Due to the increasing popularity of e-commerce
More informationMining the Software Change Repository of a Legacy Telephony System
Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,
More informationReview of Ensemble Based Classification Algorithms for Nonstationary and Imbalanced Data
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 1, Ver. IX (Feb. 2014), PP 103-107 Review of Ensemble Based Classification Algorithms for Nonstationary
More informationPerformance Metrics. number of mistakes total number of observations. err = p.1/1
p.1/1 Performance Metrics The simplest performance metric is the model error defined as the number of mistakes the model makes on a data set divided by the number of observations in the data set, err =
More informationPerformance Evaluation Metrics for Software Fault Prediction Studies
Acta Polytechnica Hungarica Vol. 9, No. 4, 2012 Performance Evaluation Metrics for Software Fault Prediction Studies Cagatay Catal Istanbul Kultur University, Department of Computer Engineering, Atakoy
More informationPredicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
More informationPerformance Measures for Machine Learning
Performance Measures for Machine Learning 1 Performance Measures Accuracy Weighted (Cost-Sensitive) Accuracy Lift Precision/Recall F Break Even Point ROC ROC Area 2 Accuracy Target: 0/1, -1/+1, True/False,
More informationEvaluation & Validation: Credibility: Evaluating what has been learned
Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model
More informationA Two-Pass Statistical Approach for Automatic Personalized Spam Filtering
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences
More informationGetting Even More Out of Ensemble Selection
Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise
More informationChoosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction
Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction Huanjing Wang Western Kentucky University huanjing.wang@wku.edu Taghi M. Khoshgoftaar
More informationLess naive Bayes spam detection
Less naive Bayes spam detection Hongming Yang Eindhoven University of Technology Dept. EE, Rm PT 3.27, P.O.Box 53, 5600MB Eindhoven The Netherlands. E-mail:h.m.yang@tue.nl also CoSiNe Connectivity Systems
More informationDATA MINING FOR IMBALANCED DATASETS: AN OVERVIEW
Chapter 40 DATA MINING FOR IMBALANCED DATASETS: AN OVERVIEW Nitesh V. Chawla Department of Computer Science and Engineering University of Notre Dame IN 46530, USA Abstract Keywords: A dataset is imbalanced
More informationThe Relationship Between Precision-Recall and ROC Curves
Jesse Davis jdavis@cs.wisc.edu Mark Goadrich richm@cs.wisc.edu Department of Computer Sciences and Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, 2 West Dayton Street,
More informationHow To Solve The Class Imbalance Problem In Data Mining
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART C: APPLICATIONS AND REVIEWS 1 A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches Mikel Galar,
More informationROC Graphs: Notes and Practical Considerations for Data Mining Researchers
ROC Graphs: Notes and Practical Considerations for Data Mining Researchers Tom Fawcett Intelligent Enterprise Technologies Laboratory HP Laboratories Palo Alto HPL-23-4 January 7 th, 23* E-mail: tom_fawcett@hp.com
More informationQuality and Complexity Measures for Data Linkage and Deduplication
Quality and Complexity Measures for Data Linkage and Deduplication Peter Christen and Karl Goiser Department of Computer Science, Australian National University, Canberra ACT 0200, Australia {peter.christen,karl.goiser}@anu.edu.au
More informationHandling imbalanced datasets: A review
Handling imbalanced datasets: A review Sotiris Kotsiantis, Dimitris Kanellopoulos, Panayiotis Pintelas Educational Software Development Laboratory Department of Mathematics, University of Patras, Greece
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
More informationData Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationA Hybrid Data Mining Model to Improve Customer Response Modeling in Direct Marketing
A Hybrid Data Mining Model to Improve Customer Response Modeling in Direct Marketing Maryam Daneshmandi mdaneshmandi82@yahoo.com School of Information Technology Shiraz Electronics University Shiraz, Iran
More informationImproving Credit Card Fraud Detection with Calibrated Probabilities
Improving Credit Card Fraud Detection with Calibrated Probabilities Alejandro Correa Bahnsen, Aleksandar Stojanovic, Djamila Aouada and Björn Ottersten Interdisciplinary Centre for Security, Reliability
More informationArtificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier
International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing
More informationAzure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
More informationBUILDING CLASSIFICATION MODELS FROM IMBALANCED FRAUD DETECTION DATA
BUILDING CLASSIFICATION MODELS FROM IMBALANCED FRAUD DETECTION DATA Terence Yong Koon Beh 1, Swee Chuan Tan 2, Hwee Theng Yeo 3 1 School of Business, SIM University 1 yky2k@yahoo.com, 2 jamestansc@unisim.edu.sg,
More informationIntrusion Detection via Machine Learning for SCADA System Protection
Intrusion Detection via Machine Learning for SCADA System Protection S.L.P. Yasakethu Department of Computing, University of Surrey, Guildford, GU2 7XH, UK. s.l.yasakethu@surrey.ac.uk J. Jiang Department
More informationThe Optimality of Naive Bayes
The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most
More informationPrediction of Stock Performance Using Analytical Techniques
136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University
More informationFinancial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms
Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms Johan Perols Assistant Professor University of San Diego, San Diego, CA 92110 jperols@sandiego.edu April
More informationOverview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set
Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification
More informationLearning Decision Trees for Unbalanced Data
Learning Decision Trees for Unbalanced Data David A. Cieslak and Nitesh V. Chawla {dcieslak,nchawla}@cse.nd.edu University of Notre Dame, Notre Dame IN 46556, USA Abstract. Learning from unbalanced datasets
More informationA Review of Performance Evaluation Measures for Hierarchical Classifiers
A Review of Performance Evaluation Measures for Hierarchical Classifiers Eduardo P. Costa Depto. Ciências de Computação ICMC/USP - São Carlos Caixa Postal 668, 13560-970, São Carlos-SP, Brazil Ana C. Lorena
More informationFacing Imbalanced Data Recommendations for the Use of Performance Metrics
Facing Imbalanced Data Recommendations for the Use of Performance Metrics László A. Jeni, Jeffrey F. Cohn, 2, and Fernando De La Torre Carnegie Mellon University, Pittsburgh, PA, laszlo.jeni@ieee.org,ftorre@cs.cmu.edu
More informationA DECISION TREE BASED PEDOMETER AND ITS IMPLEMENTATION ON THE ANDROID PLATFORM
A DECISION TREE BASED PEDOMETER AND ITS IMPLEMENTATION ON THE ANDROID PLATFORM ABSTRACT Juanying Lin, Leanne Chan and Hong Yan Department of Electronic Engineering, City University of Hong Kong, Hong Kong,
More informationActive Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
More informationPractical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
More informationMaximum Profit Mining and Its Application in Software Development
Maximum Profit Mining and Its Application in Software Development Charles X. Ling 1, Victor S. Sheng 1, Tilmann Bruckhaus 2, Nazim H. Madhavji 1 1 Department of Computer Science, The University of Western
More informationBig Data Classification: Problems and Challenges in Network Intrusion Prediction with Machine Learning
Big Data Classification: Problems and Challenges in Network Intrusion Prediction with Machine Learning By: Shan Suthaharan Suthaharan, S. (2014). Big data classification: Problems and challenges in network
More informationVisualizing High-Dimensional Predictive Model Quality
Visualizing High-Dimensional Predictive Model Quality Penny Rheingans University of Maryland, Baltimore County Department of Computer Science and Electrical Engineering rheingan@cs.umbc.edu Marie desjardins
More information1. Classification problems
Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification
More informationThe Enron Corpus: A New Dataset for Email Classification Research
The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu
More informationData Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product
Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product Sagarika Prusty Web Data Mining (ECT 584),Spring 2013 DePaul University,Chicago sagarikaprusty@gmail.com Keywords:
More informationCategorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,
More informationRoulette Sampling for Cost-Sensitive Learning
Roulette Sampling for Cost-Sensitive Learning Victor S. Sheng and Charles X. Ling Department of Computer Science, University of Western Ontario, London, Ontario, Canada N6A 5B7 {ssheng,cling}@csd.uwo.ca
More informationConsistent Binary Classification with Generalized Performance Metrics
Consistent Binary Classification with Generalized Performance Metrics Nagarajan Natarajan Joint work with Oluwasanmi Koyejo, Pradeep Ravikumar and Inderjit Dhillon UT Austin Nov 4, 2014 Problem and Motivation
More informationAn Evaluation of Calibration Methods for Data Mining Models in Simulation Problems
Universidad Politécnica de Valencia Departamento de Sistemas Informáticos y Computación Master de Ingeniería de Software, Métodos Formales y Sistemas de Información Master Thesis An Evaluation of Calibration
More informationPredict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons
More informationPerformance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com
More informationTOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam
More informationApplied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.
Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.38457 Accuracy Rate of Predictive Models in Credit Screening Anirut Suebsing
More informationThe class imbalance problem in pattern classification and learning. V. García J.S. Sánchez R.A. Mollineda R. Alejo J.M. Sotoca
The class imbalance problem in pattern classification and learning V. García J.S. Sánchez R.A. Mollineda R. Alejo J.M. Sotoca Pattern Analysis and Learning Group Dept.de Llenguatjes i Sistemes Informàtics
More informationFeature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification
Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde
More informationSample subset optimization for classifying imbalanced biological data
Sample subset optimization for classifying imbalanced biological data Pengyi Yang 1,2,3, Zili Zhang 4,5, Bing B. Zhou 1,3 and Albert Y. Zomaya 1,3 1 School of Information Technologies, University of Sydney,
More informationClassification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
More informationMeasuring Lift Quality in Database Marketing
Measuring Lift Quality in Database Marketing Gregory Piatetsky-Shapiro Xchange Inc. One Lincoln Plaza, 89 South Street Boston, MA 2111 gps@xchange.com Sam Steingold Xchange Inc. One Lincoln Plaza, 89 South
More informationMHI3000 Big Data Analytics for Health Care Final Project Report
MHI3000 Big Data Analytics for Health Care Final Project Report Zhongtian Fred Qiu (1002274530) http://gallery.azureml.net/details/81ddb2ab137046d4925584b5095ec7aa 1. Data pre-processing The data given
More informationCLASS imbalance learning refers to a type of classification
IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS, PART B Multi-Class Imbalance Problems: Analysis and Potential Solutions Shuo Wang, Member, IEEE, and Xin Yao, Fellow, IEEE Abstract Class imbalance problems
More informationDepth and Excluded Courses
Depth and Excluded Courses Depth Courses for Communication, Control, and Signal Processing EECE 5576 Wireless Communication Systems 4 SH EECE 5580 Classical Control Systems 4 SH EECE 5610 Digital Control
More informationLearning on the Border: Active Learning in Imbalanced Data Classification
Learning on the Border: Active Learning in Imbalanced Data Classification Şeyda Ertekin 1, Jian Huang 2, Léon Bottou 3, C. Lee Giles 2,1 1 Department of Computer Science and Engineering 2 College of Information
More informationDistributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
More informationISSN: 2321-7782 (Online) Volume 2, Issue 10, October 2014 International Journal of Advance Research in Computer Science and Management Studies
ISSN: 2321-7782 (Online) Volume 2, Issue 10, October 2014 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
More informationDiscovering process models from empirical data
Discovering process models from empirical data Laura Măruşter (l.maruster@tm.tue.nl), Ton Weijters (a.j.m.m.weijters@tm.tue.nl) and Wil van der Aalst (w.m.p.aalst@tm.tue.nl) Eindhoven University of Technology,
More informationDirect Marketing When There Are Voluntary Buyers
Direct Marketing When There Are Voluntary Buyers Yi-Ting Lai and Ke Wang Simon Fraser University {llai2, wangk}@cs.sfu.ca Daymond Ling, Hua Shi, and Jason Zhang Canadian Imperial Bank of Commerce {Daymond.Ling,
More informationAn Approach to Detect Spam Emails by Using Majority Voting
An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,
More informationHow To Understand The Impact Of A Computer On Organization
International Journal of Research in Engineering & Technology (IJRET) Vol. 1, Issue 1, June 2013, 1-6 Impact Journals IMPACT OF COMPUTER ON ORGANIZATION A. D. BHOSALE 1 & MARATHE DAGADU MITHARAM 2 1 Department
More informationENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS
ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS Michael Affenzeller (a), Stephan M. Winkler (b), Stefan Forstenlechner (c), Gabriel Kronberger (d), Michael Kommenda (e), Stefan
More informationA NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data
More informationData Mining for Direct Marketing: Problems and
Data Mining for Direct Marketing: Problems and Solutions Charles X. Ling and Chenghui Li Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 Tel: 519-661-3341;
More informationKeywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.
International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant
More informationHow To Use Neural Networks In Data Mining
International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and
More informationCrowdsourcing Fraud Detection Algorithm Based on Psychological Behavior Analysis
, pp.138-142 http://dx.doi.org/10.14257/astl.2013.31.31 Crowdsourcing Fraud Detection Algorithm Based on Psychological Behavior Analysis Li Peng 1,2, Yu Xiao-yang 1, Liu Yang 2, Bi Ting-ting 2 1 Higher
More informationDiscovering Criminal Behavior by Ranking Intelligence Data
UNIVERSITY OF AMSTERDAM Faculty of Science Discovering Criminal Behavior by Ranking Intelligence Data by 5889081 A thesis submitted in partial fulfillment for the degree of Master of Science in the field
More informationKnowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
More informationFraud Detection for Online Retail using Random Forests
Fraud Detection for Online Retail using Random Forests Eric Altendorf, Peter Brende, Josh Daniel, Laurent Lessard Abstract As online commerce becomes more common, fraud is an increasingly important concern.
More informationDecision Support Systems
Decision Support Systems 50 (2011) 602 613 Contents lists available at ScienceDirect Decision Support Systems journal homepage: www.elsevier.com/locate/dss Data mining for credit card fraud: A comparative
More informationPreprocessing Imbalanced Dataset Using Oversampling Approach
Journal of Recent Research in Engineering and Technology, 2(11), 2015, pp 10-15 Article ID J111503 ISSN (Online): 2349 2252, ISSN (Print):2349 2260 Bonfay Publications Research article Preprocessing Imbalanced
More informationFRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS
FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS Breno C. Costa, Bruno. L. A. Alberto, André M. Portela, W. Maduro, Esdras O. Eler PDITec, Belo Horizonte,
More informationIntroduction to Engineering System Dynamics
CHAPTER 0 Introduction to Engineering System Dynamics 0.1 INTRODUCTION The objective of an engineering analysis of a dynamic system is prediction of its behaviour or performance. Real dynamic systems are
More informationCurrent Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary
Shape, Space, and Measurement- Primary A student shall apply concepts of shape, space, and measurement to solve problems involving two- and three-dimensional shapes by demonstrating an understanding of:
More information