Analysis on Weighted AUC for Imbalanced Data Learning Through Isometrics

Similar documents
Using Random Forest to Learn Imbalanced Data

Chapter 6. The stacking ensemble approach

Learning with Skewed Class Distributions

ClusterOSS: a new undersampling method for imbalanced learning

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm

Selecting Data Mining Model for Web Advertising in Virtual Communities

Random Forest Based Imbalanced Data Cleaning and Classification

SVM Ensemble Model for Investment Prediction

Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ

A Hybrid Approach to Learn with Imbalanced Classes using Evolutionary Algorithms

Knowledge Discovery and Data Mining

Enhancing Quality of Data using Data Mining Method

Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance

Experiments in Web Page Classification for Semantic Web

Performance Measures in Data Mining

Mining Life Insurance Data for Customer Attrition Analysis

Introducing diversity among the models of multi-label classification ensemble

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

ROC Curve, Lift Chart and Calibration Plot

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Addressing the Class Imbalance Problem in Medical Datasets

E-commerce Transaction Anomaly Classification

Mining the Software Change Repository of a Legacy Telephony System

Review of Ensemble Based Classification Algorithms for Nonstationary and Imbalanced Data

Performance Metrics. number of mistakes total number of observations. err = p.1/1

Performance Evaluation Metrics for Software Fault Prediction Studies

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Performance Measures for Machine Learning

Evaluation & Validation: Credibility: Evaluating what has been learned

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

Getting Even More Out of Ensemble Selection

Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction

Less naive Bayes spam detection

DATA MINING FOR IMBALANCED DATASETS: AN OVERVIEW

The Relationship Between Precision-Recall and ROC Curves

How To Solve The Class Imbalance Problem In Data Mining

ROC Graphs: Notes and Practical Considerations for Data Mining Researchers

Quality and Complexity Measures for Data Linkage and Deduplication

Handling imbalanced datasets: A review

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Data Mining - Evaluation of Classifiers

A Hybrid Data Mining Model to Improve Customer Response Modeling in Direct Marketing

Improving Credit Card Fraud Detection with Calibrated Probabilities

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Azure Machine Learning, SQL Data Mining and R

BUILDING CLASSIFICATION MODELS FROM IMBALANCED FRAUD DETECTION DATA

Intrusion Detection via Machine Learning for SCADA System Protection

The Optimality of Naive Bayes

Prediction of Stock Performance Using Analytical Techniques

Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Learning Decision Trees for Unbalanced Data

A Review of Performance Evaluation Measures for Hierarchical Classifiers

Facing Imbalanced Data Recommendations for the Use of Performance Metrics

A DECISION TREE BASED PEDOMETER AND ITS IMPLEMENTATION ON THE ANDROID PLATFORM

Active Learning SVM for Blogs recommendation

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Big Data Classification: Problems and Challenges in Network Intrusion Prediction with Machine Learning

1. Classification problems

The Enron Corpus: A New Dataset for Classification Research

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

Categorical Data Visualization and Clustering Using Subjective Factors

Roulette Sampling for Cost-Sensitive Learning

Consistent Binary Classification with Generalized Performance Metrics

Predict Influencers in the Social Network

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Applied Mathematical Sciences, Vol. 7, 2013, no. 112, HIKARI Ltd,

The class imbalance problem in pattern classification and learning. V. García J.S. Sánchez R.A. Mollineda R. Alejo J.M. Sotoca

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Classification algorithm in Data mining: An Overview

MHI3000 Big Data Analytics for Health Care Final Project Report

CLASS imbalance learning refers to a type of classification

Depth and Excluded Courses

Learning on the Border: Active Learning in Imbalanced Data Classification

Distributed forests for MapReduce-based machine learning

ISSN: (Online) Volume 2, Issue 10, October 2014 International Journal of Advance Research in Computer Science and Management Studies

Direct Marketing When There Are Voluntary Buyers

An Approach to Detect Spam s by Using Majority Voting

How To Understand The Impact Of A Computer On Organization

ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

Data Mining for Direct Marketing: Problems and

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

How To Use Neural Networks In Data Mining

Crowdsourcing Fraud Detection Algorithm Based on Psychological Behavior Analysis

Discovering Criminal Behavior by Ranking Intelligence Data

Knowledge Discovery from patents using KMX Text Analytics

Fraud Detection for Online Retail using Random Forests

Decision Support Systems

Preprocessing Imbalanced Dataset Using Oversampling Approach

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

Introduction to Engineering System Dynamics

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

Transcription:

Journal of Computational Information Systems 8: (22) 37 378 Available at http://www.jofcis.com Analysis on Weighted AUC for Imbalanced Data Learning Through Isometrics Yuanfang DONG, 2, Xiongfei LI,, Jun LI 3, Haiying ZHAO 4 Key Laboratory of Symbolic Computation and Knowledge Engineering for Ministry of Education, Jilin University, Changchun 32, China 2 School of Economics and Management, Changchun University of Science and Technology, Changchun 322, China 3 Department of Mathematics, Changchun University of Science and Technology, Changchun 322, China 4 School of Computer Science and Technology, Xinjiang Normal University, Wulumuqi 83, China Abstract A performance evaluation model, weighted AUC (wauc), is proposed to determine a better way to measure the imbalanced data learning classifiers. When computing the weighted area under the ROC curve, weights vary with the values of the true positive rate (T P rate) among regions in a bid to focus on the accuracy of minority class that is more important in common. As AUC is a special case of wauc, wau C is compared with other common performance evaluation measures by isometric analysis. The experimental results show that wau C can distinguish the classifiers with the same AU C values. Keywords: Machine Learning; Classification; Imbalanced Data; Performance Evaluation; Isometric Introduction Most learning algorithms assume that the data have balanced class distribution. But in the real world, there are cases of the imbalanced class and the skewed class distribution []. Under the imbalanced two-class cases, the minority class is usually called positive class, and the majority class is called negative class. When dealing with the data with skewed class distribution, as the majority class is dominant and the classification boundary bias to the majority class, the ability of the classic classification algorithms on predicting minority class is decreased, and so the overall prediction performance is affected. Project supported by the National Science and Technology Support Program (No. 26BAKA33), the Technology Development Program of Jilin Province (No. 2974), and Natural Science Foundation of Jilin Province (No. 252). Corresponding author. Email address: xiongfei@jlu.edu.cn (Xiongfei LI). 553 95/ Copyright 22 Binary Information Press January 22

372 Y. Dong et al. /Journal of Computational Information Systems 8: (22) 37 378 It tends to make mistakes, if the common Accuracy or Error rate is adopted as the measure of performance evaluation in the case of imbalanced class distribution, because Accuracy or Error rate do not consider misclassification cost. A number of measures have been proposed to deal with imbalanced data classifier performance evaluation. These measures can be divided into two categories: numerical and graphical measures. Numerical measures, including Accuracy, P recision, Recall, F-measure, Gmean and AU C, give a single value to characterize a classifier performance. Graphical measures, including ROC curves, precision-recall curves, cost curves, and so on [2, 3], draw images, specially the two and three dimensional images that are easy to observe. ROC curve (Receiver Operating Characteristics curve) is the performance evaluation method that has been used most widely and studied in depth [4]. The method has many advantages, such as intuitive, easy to understand, simple to use, etc. At first, ROC curve appeared in the signal detection study, and aimed at the trade-off between the true positive rate and the false positive rate [5]. Spackman introduced the ROC analysis to the field of machine learning [6] to evaluate and compare algorithms. 2 Related Work The confusion matrix shown in Table describes the distribution of the sample classifications, which is the basis to calculate some classifier performance measures. Table : Confusion matrix predicted positive predicted negative actual positive True Positives (TP) False Negatives (FN) actual negative False Positives (FP) True Negatives (TN) For the two-class problems, the Accuracy (Acc) and the Error rate (Err) can easily be derived from the confusion matrix shown in Table. T P + T N Acc = T P + F N + T N + F P. () Err = F P + F N T P + F N + T N + F P. (2) These two measures are sensitive to class imbalances, and overly bias on majority class. So, using Acc or Err will cause the error results on performance comparison when dealing with imbalanced data [7, 8]. True positive rate is denoted by T P rate and true negative rate by T Nrate. False positive rate and false negative rate are denoted by F P rate and F N rate, respectively. T P rate = T P, F P rate = F P, T Nrate = T N, F Nrate = F N. T P +F N T N+F P T N+F P T P +F N True positive rate, also known as the Recall, is the percentage of positive class samples that have been correctly classified. P recision is defined as the percentage of the samples that is correctly labeled as positive class sample, namely: P recision = T P T P + F P. (3)

Y. Dong et al. /Journal of Computational Information Systems 8: (22) 37 378 373 F-measure combines P recision with Recall. Higher F-measure is, the model has better performance on positive class [9]. F-measure = ( + β2 ) recall precision β 2 recall + precision. (4) The Gmean proposed by Kubat etc. is the geometric mean of the prediction accuracy of positive class and negative class, and an important measure to avoid over-fitting negative class []. Gmean = T P rate T Nrate. (5) On the ROC curve, the y-axis represents T P rate, while the x-axis represents F P rate. Each point on the ROC curve corresponds to a classifier model. IF a classifier is a ranker or a scoring classifier, a threshold on the score transforms a ranker or a scoring classifier into a discrete one. An ROC curve shows what happens with the corresponding confusion matrix for each possible threshold [4]. The Area under ROC curve, that is AUC, can be calculated by Eq. (6). AUC is the quantitative expression of the ROC curve. AUC = ydx. (6) 3 A Weighted Classifier Performance Measure-wAUC 3. The choice of weights The shortcoming of traditional AUC is that it doesn t consider the cost bias, and adapts the same weights (i.e. which is assumed to have the same cost with value ) for each region during the calculation of AU C. For two-class imbalanced data, traditional AU C pay more attention to the accuracy on positive class (i.e. T P rate) than the accuracy on negative class (i.e. T Nrate). AUC doesn t identify each class s contribution to the overall performance. This means that different combinations of T P rate and T Nrate could lead to the same AUC measure value. For example, Classifier A(F P rate, T P rate) = (., ) and Classifier B(F P rate, T P rate) = (,.9) have the same AU C value of.75. But, for the imbalanced data classification problem, classification incorrectly on positive class samples is more expensive. The accuracy on the positive class should have a greater contribution to the overall performance than the accuracy on the negative class, and so classifier B has better performance than classifier A. To reflect this, the different weights should be adapted on the different regions under the ROC curve, and make the regions with higher T P rate correspond to greater weight. Let the weighted function be g(y), the function g(y) should satisfy the following conditions: () non-negativity: y [, ], g(y). (2) normativity: g(y)dy =, as g(y) is the weight density function. (3) monotonicity: g(y) is monotonically increasing, which reflects that the higher correspond to the region with greater weight.

374 Y. Dong et al. /Journal of Computational Information Systems 8: (22) 37 378 Weighted function g(y) can be as simple curve as line, and can also be any other curve satisfying the above conditions. The curve that ROC curve form after mapping is called weighted ROC curve, i.e. wroc. And the space that ROC space form after mapping is called weighted ROC space. If a classifier corresponds a point (x, y) in ROC space, the point corresponds to the following point in the weighted ROC space is as follows: Φ : (x, y) ( ( x)g(y), y). (7) The point (, ) is mapped as ( g(), ), while the point (, ) is mapped as (, ). 3.2 Weighted AUC AUC can also be understood as the right area of ROC curve in ROC space. That is, let y be the integral variable, x = f(y) shows the ROC curve equation, the following formula can be used to calculate AU C: AUC = ( f(y))dy = f(y)dy. (8) Definition Let x = f(y) as the ROC curve equation, the weighted function is g(y), the weighted AUC is defined as follows: wauc = g(y)( f(y))dy. (9) By definition, in geometry, wauc is the region that is enclosed by the 3 curves that are wroc curve, x-axis and line x =. By the normativity of weighted function g(y), it is easy to know: wauc = g(y)( f(y))dy = g(y)dy g(y)f(y)dy = g(y)f(y)dy. () That is wauc = g(y)( f(y))dy. can be used in the calculation. However, the Eq. (9) is better that can express the aim of weighting the different regions. In the case that there is a single classifier, the wroc curve is the curve that connect the point (, ),( ( x)g(y), y), (, ), and the region that is enclosed by the wroc, y =, and x = is the wauc of the classifier. wauc(x, y) = 2 [( x)g(y) + g()] y + 2 ( x)g(y)( y) = 2 g() y + ( x)g(y) 2 Let g(y) = y+, the classifier A(F P rate, T P rate) = (., ) and classifier B(F P rate, T P rate) = 2 (,.9) have different wauc, that is 58 and 93 respectively. This shows that the classifier B has better performance. 4 Experimental Analysis 4. Isometric analysis ROC isometrics are collections of points in ROC space with the same value for a performance metric. Flach and Fürnkranz etc. investigate isometrics to understand metrics [, 2]. However,

Y. Dong et al. /Journal of Computational Information Systems 8: (22) 37 378 375 isometrics can be used for the task of classifier selection to construct reliable classifiers. Provost and Fawcett [3] use isometrics to determine the optimal point on a ROC convex hull. The term isometric seems to originate from Vilalta and Oblinger [4], they give a isometric plot for gaining information, but their analysis is quantitative in nature and the connection to ROC analysis is not made. We recall that isometrics are collections of points with the same value for the metric. Generally speaking, in 2D ROC space isometrics are lines or curves, Below we will see three types of isometric landscapes: (a) with parallel linear isometrics (Accuracy); (b) with non-parallel linear isometrics (P recision, F-measure); and (c) with non-linear isometrics (decision tree splitting criteria). Type (a) means that the metric applies the same positive/negative trade-off throughout ROC space; type (b) means that the trade-off varies with different values of the metric; and type (c) means that the trade-off varies even with the same value of the metric. Fig. illustrates the isometric of the Accuracy that is linear parallel. In the case that there is a single classifier, the ROC curve is the connectional line of 3 points that are (, ),(F P rate, T P rate), T P rate+ F P rate and (, ), and AUC is the area under the ROC curve. At this time, AUC =, 2 and the value is same with Accuracy. The AUC isometric is linear parallel and shows as Fig. 2. F-measure is usually used to the field of information retrieval. This measure is insensitive to how the incorrect predictions are distributed over the classes. The isometric of F-measure is linear non-parallel and shows as Fig. 3. Accuracy isometrics Fig. : Accuracy isometrics AUC isometrics Fig. 2: AUC isometrics F measure isometrics.5.5 Fig. 3: F-measure isometrics By the definition of Gmean, Gmean expresses both the accuracy on the positive class and the accuracy on the negative class. The isometric of Gmean is non-linear non-parallel as Fig. 4. It is interesting to note that versions of precision with shifted rotation point occur more often in machine learning [5]. The isometric of precision is linear non-parallel as Fig. 5. Gmean isometrics Precision isometrics Fig. 4: Gmean isometrics Fig. 5: Precision isometrics Fig. 6 illustrates the isometrics of wauc in the case that weighted function is a liner function g(y) = ay + b. We note that the isometric of the wauc is non-linear. Besides, from the normal

376 Y. Dong et al. /Journal of Computational Information Systems 8: (22) 37 378 directions of wauc, we find that classifier with higher accuracy on positive class tends to get more attention, if wauc is adopted. If weighted function is set to be the exponential function g(y) = ey, the isometric of the wauc e is, as shown in Fig. 7, also nonlinear. The normal directions of wau C indicates that exponential weighted wau C pays more attention on classifier with higher accuracy on positive class than the linear weighted wau C. wauc Isometrics (Linear Weighted Function) wauc Isometrics (Exponential Weighted Function) Fig. 6: wauc isometrics (Linear) Fig. 7: wauc isometrics (Exponential) 4.2 Simulation analysis. Linear Case: Let g(y) = ay + b, g(y) is monotonically increasing, so a. By g(y)dy =, g(y)dy = (ay + b)dy = a + b =. Then b = a, and g(y) = ay + a. So: 2 2 2 wauc(x, y) = 2 g() y + 2 ( x) g(y) = ( 2 a ) y + ( 2 2 ( x) ay + a ) ( ) 2 2 + a = y ( ) ( ) 2 a 2 a 4 2 axy x + 4 4 2. Exponential Function: Let g(y) = ey, g(y) is monotonically increasing. By g(y)dy =, e g(y)dy = e y dy =, Then, e wauc(x, y) = 2 g() y + 2 ( x) g(y) = 2 e y + ( ) e y ( x) 2 e ( ) = (y ( x) e y ) 2(e ) The main purpose of wauc will be to weight a measure suitable to evaluate the performance in imbalanced domains. The weighting factor will aim at favoring those results with better classification rates on the minority class. Table 2 describes the values of different measures in the Case of AUC =.75. Note that θ 5 produces a perfect balance between T P rate and F P rate, θ biases to the accuracy on negative class, and θ 9 biases to the accuracy on positive class. All of θ i correspond to the same AUC value of.75. Accuracy would select the biased θ because it strongly depends on the majority class rate. Gmean suggests the most balanced configurations ignoring the fact that the minority class is usually the most important. While Gmean does not distinguish between θ 3 and θ 5, and the drawback can be overcome when using the wauc by appropriately choosing the

Y. Dong et al. /Journal of Computational Information Systems 8: (22) 37 378 377 Table 2: The values of different measures in the case of AUC =.75 Acc Precision F-measure Gmean wauc Linear wauc Exp. θ.55.5.936.5238.5366.7228 362 392 θ 2. 727.375 65.7348 45 58 θ 3 5.5 38.323 27.7433 53 629 θ 4.7.799 593.3784.7483 52 725 θ 5.75 5.75 38.3529.75 563 83 θ 6.3.79 5.3333.7483 55 86 θ 7 5.35 682.954.378.7433 53 899 θ 8.9 273.837.35.7348 45 93 θ 9.95 5.5864.743 946.7228 362 93 weighted function. One can see that the exponential function wauc select θ 8, which correspond to the moderate cases with the highest T P rate. All in all, θ 8 biases to the minority class, but does not consider the minority only, and θ 9 biases to the positive class. But when exponential function wauc is as model selection measure, θ 8 will be selected. Fig. 8 describes the wauc isometric and the isometric when AUC =.75. Fig. 9 draws the case that the isometrics of θ 8 and θ 9, and the isometric when AUC =.75 intersect. wauc isometrics (Exponential Function).96 wauc isometrics (Exponential Function).9.7.75.95.94.93.92 AUC=.75 wauc=93 wauc=93 wauc=93,auc=.75 =.95,=5.75.5..3.5.9.9 wauc=93,auc=.75 =.9,= 9.39 2 3 4 5 6 Fig. 8: The superposition of wauc isometrics and AUC isometric Fig. 9: Intersections of isometrics of θ 8 and θ 9 with the isometric of AUC =.75 respectively. 5 Conclusion In this paper, we have introduced a new method to evaluate the performance of classification systems in two-class problems with skewed data distributions.theoretical and empirical studies have shown the robustness and advantages of wau C with respect to some other well-known performance measures. For the imbalanced data classification problem, the performance evaluation measure should distinguish between the accuracies on different classes when evaluating

378 Y. Dong et al. /Journal of Computational Information Systems 8: (22) 37 378 and selecting classifiers. In order to focus on the accuracy on positive class, different weights are adapted to different regions under ROC curve, and weighted AUC (i.e. wauc) is defined to set larger weight to the region with higher true positive rate(t P rate), and makes the performance evaluation measure wau C tending to the classifiers with higher accuracy on the positive class. Theoretical analysis and discussion on isometrics of the wau C show that wau C outperforms AUC. References [] N. V. Chawla, Data mining for imbalanced datasets: An overview, in: Oded Maimon, Lior Rokach (Ed.) Data Mining and Knowledge Discovery Handbook, Springer Press, Heidelberg, 2, pp. 853-867. [2] N. Japkowicz, S. Stephen, The class imbalance problem: a systematic study, Intelligent Data Analysis, 6(22)4-49. [3] N. V. Chawla, N. Japkowicz, A. Kotcz, Editorial: special issue on learning from imbalanced data set, SIGKDD Exploration Newsletters, (24)-6. [4] T. Fawcett, An introduction to ROC analysis, Pattern Recognition Letters, 8(26)86-874. [5] J. P. Egan, Signal detection theory and ROC analysis, Series in Cognition and Perception, Academic Press, 975 [6] K. A. Spackman, Signal detection theory: Valuable tools for evaluating inductive learning, In: Sixth International Workshop on Machine Learning, 989, pp. 6-63. [7] W. Elazmeh, N. Japkowicz, S. Matwin, Evaluating misclassifications in imbalanced data, In: Proc. of the 7th European Conference on Machine Learning, 26, pp. 26-37. [8] J. Huang, C. X. Ling, Using AUC and accuracy in evaluating learning algorithms, IEEE Trans, on Knowledge and Data Engineering, 7(25)299-3. [9] S. Daskalaki, I. Kopanas, N. Avouris, Evaluation of classifiers for an uneven class distribution problem, Applied Artificial Intelligence, 2(26)38-47. [] M. Kubat, S. Matwin, Adressing the curse of imbalanced training sets: one-sided selection, In: Proc. of the 4th Intl. Conf. on Machine Learning Nashville, 997, pp. 79-86. [] P. Flach, The geometry of ROC space: Understanding machine learning metrics through ROC isometrics, In: Proc. of the 2th International Conference on Machine Learning, 23, pp. 94-2. [2] Fürnkranz J, Flach P, Roc n rule learning- towards a better understanding of covering algorithms, Machine Learning, 58(25)39-77. [3] F. Provost, T. Fawcett, Robust classification for imprecise environments, Machine Learning, 42(2)23-23. [4] R. Vilalta, D. Oblinger, A quantification of distance-bias between evaluation metrics in classification, In: Proc. of the 7th International Conference on Machine Learning, 2, pp. 87-94. [5] J. Fürnkranz, P. Flach, An analysis of rule evaluation metrics, In: Proc. of the 2th International Conference on Machine Learning, 23, pp. 22-29.