Analysis on Weighted AUC for Imbalanced Data Learning Through Isometrics

Size: px
Start display at page:

Download "Analysis on Weighted AUC for Imbalanced Data Learning Through Isometrics"

Transcription

1 Journal of Computational Information Systems 8: (22) Available at Analysis on Weighted AUC for Imbalanced Data Learning Through Isometrics Yuanfang DONG, 2, Xiongfei LI,, Jun LI 3, Haiying ZHAO 4 Key Laboratory of Symbolic Computation and Knowledge Engineering for Ministry of Education, Jilin University, Changchun 32, China 2 School of Economics and Management, Changchun University of Science and Technology, Changchun 322, China 3 Department of Mathematics, Changchun University of Science and Technology, Changchun 322, China 4 School of Computer Science and Technology, Xinjiang Normal University, Wulumuqi 83, China Abstract A performance evaluation model, weighted AUC (wauc), is proposed to determine a better way to measure the imbalanced data learning classifiers. When computing the weighted area under the ROC curve, weights vary with the values of the true positive rate (T P rate) among regions in a bid to focus on the accuracy of minority class that is more important in common. As AUC is a special case of wauc, wau C is compared with other common performance evaluation measures by isometric analysis. The experimental results show that wau C can distinguish the classifiers with the same AU C values. Keywords: Machine Learning; Classification; Imbalanced Data; Performance Evaluation; Isometric Introduction Most learning algorithms assume that the data have balanced class distribution. But in the real world, there are cases of the imbalanced class and the skewed class distribution []. Under the imbalanced two-class cases, the minority class is usually called positive class, and the majority class is called negative class. When dealing with the data with skewed class distribution, as the majority class is dominant and the classification boundary bias to the majority class, the ability of the classic classification algorithms on predicting minority class is decreased, and so the overall prediction performance is affected. Project supported by the National Science and Technology Support Program (No. 26BAKA33), the Technology Development Program of Jilin Province (No. 2974), and Natural Science Foundation of Jilin Province (No. 252). Corresponding author. address: xiongfei@jlu.edu.cn (Xiongfei LI) / Copyright 22 Binary Information Press January 22

2 372 Y. Dong et al. /Journal of Computational Information Systems 8: (22) It tends to make mistakes, if the common Accuracy or Error rate is adopted as the measure of performance evaluation in the case of imbalanced class distribution, because Accuracy or Error rate do not consider misclassification cost. A number of measures have been proposed to deal with imbalanced data classifier performance evaluation. These measures can be divided into two categories: numerical and graphical measures. Numerical measures, including Accuracy, P recision, Recall, F-measure, Gmean and AU C, give a single value to characterize a classifier performance. Graphical measures, including ROC curves, precision-recall curves, cost curves, and so on [2, 3], draw images, specially the two and three dimensional images that are easy to observe. ROC curve (Receiver Operating Characteristics curve) is the performance evaluation method that has been used most widely and studied in depth [4]. The method has many advantages, such as intuitive, easy to understand, simple to use, etc. At first, ROC curve appeared in the signal detection study, and aimed at the trade-off between the true positive rate and the false positive rate [5]. Spackman introduced the ROC analysis to the field of machine learning [6] to evaluate and compare algorithms. 2 Related Work The confusion matrix shown in Table describes the distribution of the sample classifications, which is the basis to calculate some classifier performance measures. Table : Confusion matrix predicted positive predicted negative actual positive True Positives (TP) False Negatives (FN) actual negative False Positives (FP) True Negatives (TN) For the two-class problems, the Accuracy (Acc) and the Error rate (Err) can easily be derived from the confusion matrix shown in Table. T P + T N Acc = T P + F N + T N + F P. () Err = F P + F N T P + F N + T N + F P. (2) These two measures are sensitive to class imbalances, and overly bias on majority class. So, using Acc or Err will cause the error results on performance comparison when dealing with imbalanced data [7, 8]. True positive rate is denoted by T P rate and true negative rate by T Nrate. False positive rate and false negative rate are denoted by F P rate and F N rate, respectively. T P rate = T P, F P rate = F P, T Nrate = T N, F Nrate = F N. T P +F N T N+F P T N+F P T P +F N True positive rate, also known as the Recall, is the percentage of positive class samples that have been correctly classified. P recision is defined as the percentage of the samples that is correctly labeled as positive class sample, namely: P recision = T P T P + F P. (3)

3 Y. Dong et al. /Journal of Computational Information Systems 8: (22) F-measure combines P recision with Recall. Higher F-measure is, the model has better performance on positive class [9]. F-measure = ( + β2 ) recall precision β 2 recall + precision. (4) The Gmean proposed by Kubat etc. is the geometric mean of the prediction accuracy of positive class and negative class, and an important measure to avoid over-fitting negative class []. Gmean = T P rate T Nrate. (5) On the ROC curve, the y-axis represents T P rate, while the x-axis represents F P rate. Each point on the ROC curve corresponds to a classifier model. IF a classifier is a ranker or a scoring classifier, a threshold on the score transforms a ranker or a scoring classifier into a discrete one. An ROC curve shows what happens with the corresponding confusion matrix for each possible threshold [4]. The Area under ROC curve, that is AUC, can be calculated by Eq. (6). AUC is the quantitative expression of the ROC curve. AUC = ydx. (6) 3 A Weighted Classifier Performance Measure-wAUC 3. The choice of weights The shortcoming of traditional AUC is that it doesn t consider the cost bias, and adapts the same weights (i.e. which is assumed to have the same cost with value ) for each region during the calculation of AU C. For two-class imbalanced data, traditional AU C pay more attention to the accuracy on positive class (i.e. T P rate) than the accuracy on negative class (i.e. T Nrate). AUC doesn t identify each class s contribution to the overall performance. This means that different combinations of T P rate and T Nrate could lead to the same AUC measure value. For example, Classifier A(F P rate, T P rate) = (., ) and Classifier B(F P rate, T P rate) = (,.9) have the same AU C value of.75. But, for the imbalanced data classification problem, classification incorrectly on positive class samples is more expensive. The accuracy on the positive class should have a greater contribution to the overall performance than the accuracy on the negative class, and so classifier B has better performance than classifier A. To reflect this, the different weights should be adapted on the different regions under the ROC curve, and make the regions with higher T P rate correspond to greater weight. Let the weighted function be g(y), the function g(y) should satisfy the following conditions: () non-negativity: y [, ], g(y). (2) normativity: g(y)dy =, as g(y) is the weight density function. (3) monotonicity: g(y) is monotonically increasing, which reflects that the higher correspond to the region with greater weight.

4 374 Y. Dong et al. /Journal of Computational Information Systems 8: (22) Weighted function g(y) can be as simple curve as line, and can also be any other curve satisfying the above conditions. The curve that ROC curve form after mapping is called weighted ROC curve, i.e. wroc. And the space that ROC space form after mapping is called weighted ROC space. If a classifier corresponds a point (x, y) in ROC space, the point corresponds to the following point in the weighted ROC space is as follows: Φ : (x, y) ( ( x)g(y), y). (7) The point (, ) is mapped as ( g(), ), while the point (, ) is mapped as (, ). 3.2 Weighted AUC AUC can also be understood as the right area of ROC curve in ROC space. That is, let y be the integral variable, x = f(y) shows the ROC curve equation, the following formula can be used to calculate AU C: AUC = ( f(y))dy = f(y)dy. (8) Definition Let x = f(y) as the ROC curve equation, the weighted function is g(y), the weighted AUC is defined as follows: wauc = g(y)( f(y))dy. (9) By definition, in geometry, wauc is the region that is enclosed by the 3 curves that are wroc curve, x-axis and line x =. By the normativity of weighted function g(y), it is easy to know: wauc = g(y)( f(y))dy = g(y)dy g(y)f(y)dy = g(y)f(y)dy. () That is wauc = g(y)( f(y))dy. can be used in the calculation. However, the Eq. (9) is better that can express the aim of weighting the different regions. In the case that there is a single classifier, the wroc curve is the curve that connect the point (, ),( ( x)g(y), y), (, ), and the region that is enclosed by the wroc, y =, and x = is the wauc of the classifier. wauc(x, y) = 2 [( x)g(y) + g()] y + 2 ( x)g(y)( y) = 2 g() y + ( x)g(y) 2 Let g(y) = y+, the classifier A(F P rate, T P rate) = (., ) and classifier B(F P rate, T P rate) = 2 (,.9) have different wauc, that is 58 and 93 respectively. This shows that the classifier B has better performance. 4 Experimental Analysis 4. Isometric analysis ROC isometrics are collections of points in ROC space with the same value for a performance metric. Flach and Fürnkranz etc. investigate isometrics to understand metrics [, 2]. However,

5 Y. Dong et al. /Journal of Computational Information Systems 8: (22) isometrics can be used for the task of classifier selection to construct reliable classifiers. Provost and Fawcett [3] use isometrics to determine the optimal point on a ROC convex hull. The term isometric seems to originate from Vilalta and Oblinger [4], they give a isometric plot for gaining information, but their analysis is quantitative in nature and the connection to ROC analysis is not made. We recall that isometrics are collections of points with the same value for the metric. Generally speaking, in 2D ROC space isometrics are lines or curves, Below we will see three types of isometric landscapes: (a) with parallel linear isometrics (Accuracy); (b) with non-parallel linear isometrics (P recision, F-measure); and (c) with non-linear isometrics (decision tree splitting criteria). Type (a) means that the metric applies the same positive/negative trade-off throughout ROC space; type (b) means that the trade-off varies with different values of the metric; and type (c) means that the trade-off varies even with the same value of the metric. Fig. illustrates the isometric of the Accuracy that is linear parallel. In the case that there is a single classifier, the ROC curve is the connectional line of 3 points that are (, ),(F P rate, T P rate), T P rate+ F P rate and (, ), and AUC is the area under the ROC curve. At this time, AUC =, 2 and the value is same with Accuracy. The AUC isometric is linear parallel and shows as Fig. 2. F-measure is usually used to the field of information retrieval. This measure is insensitive to how the incorrect predictions are distributed over the classes. The isometric of F-measure is linear non-parallel and shows as Fig. 3. Accuracy isometrics Fig. : Accuracy isometrics AUC isometrics Fig. 2: AUC isometrics F measure isometrics.5.5 Fig. 3: F-measure isometrics By the definition of Gmean, Gmean expresses both the accuracy on the positive class and the accuracy on the negative class. The isometric of Gmean is non-linear non-parallel as Fig. 4. It is interesting to note that versions of precision with shifted rotation point occur more often in machine learning [5]. The isometric of precision is linear non-parallel as Fig. 5. Gmean isometrics Precision isometrics Fig. 4: Gmean isometrics Fig. 5: Precision isometrics Fig. 6 illustrates the isometrics of wauc in the case that weighted function is a liner function g(y) = ay + b. We note that the isometric of the wauc is non-linear. Besides, from the normal

6 376 Y. Dong et al. /Journal of Computational Information Systems 8: (22) directions of wauc, we find that classifier with higher accuracy on positive class tends to get more attention, if wauc is adopted. If weighted function is set to be the exponential function g(y) = ey, the isometric of the wauc e is, as shown in Fig. 7, also nonlinear. The normal directions of wau C indicates that exponential weighted wau C pays more attention on classifier with higher accuracy on positive class than the linear weighted wau C. wauc Isometrics (Linear Weighted Function) wauc Isometrics (Exponential Weighted Function) Fig. 6: wauc isometrics (Linear) Fig. 7: wauc isometrics (Exponential) 4.2 Simulation analysis. Linear Case: Let g(y) = ay + b, g(y) is monotonically increasing, so a. By g(y)dy =, g(y)dy = (ay + b)dy = a + b =. Then b = a, and g(y) = ay + a. So: wauc(x, y) = 2 g() y + 2 ( x) g(y) = ( 2 a ) y + ( 2 2 ( x) ay + a ) ( ) a = y ( ) ( ) 2 a 2 a 4 2 axy x Exponential Function: Let g(y) = ey, g(y) is monotonically increasing. By g(y)dy =, e g(y)dy = e y dy =, Then, e wauc(x, y) = 2 g() y + 2 ( x) g(y) = 2 e y + ( ) e y ( x) 2 e ( ) = (y ( x) e y ) 2(e ) The main purpose of wauc will be to weight a measure suitable to evaluate the performance in imbalanced domains. The weighting factor will aim at favoring those results with better classification rates on the minority class. Table 2 describes the values of different measures in the Case of AUC =.75. Note that θ 5 produces a perfect balance between T P rate and F P rate, θ biases to the accuracy on negative class, and θ 9 biases to the accuracy on positive class. All of θ i correspond to the same AUC value of.75. Accuracy would select the biased θ because it strongly depends on the majority class rate. Gmean suggests the most balanced configurations ignoring the fact that the minority class is usually the most important. While Gmean does not distinguish between θ 3 and θ 5, and the drawback can be overcome when using the wauc by appropriately choosing the

7 Y. Dong et al. /Journal of Computational Information Systems 8: (22) Table 2: The values of different measures in the case of AUC =.75 Acc Precision F-measure Gmean wauc Linear wauc Exp. θ θ θ θ θ θ θ θ θ weighted function. One can see that the exponential function wauc select θ 8, which correspond to the moderate cases with the highest T P rate. All in all, θ 8 biases to the minority class, but does not consider the minority only, and θ 9 biases to the positive class. But when exponential function wauc is as model selection measure, θ 8 will be selected. Fig. 8 describes the wauc isometric and the isometric when AUC =.75. Fig. 9 draws the case that the isometrics of θ 8 and θ 9, and the isometric when AUC =.75 intersect. wauc isometrics (Exponential Function).96 wauc isometrics (Exponential Function) AUC=.75 wauc=93 wauc=93 wauc=93,auc=.75 =.95,= wauc=93,auc=.75 =.9,= Fig. 8: The superposition of wauc isometrics and AUC isometric Fig. 9: Intersections of isometrics of θ 8 and θ 9 with the isometric of AUC =.75 respectively. 5 Conclusion In this paper, we have introduced a new method to evaluate the performance of classification systems in two-class problems with skewed data distributions.theoretical and empirical studies have shown the robustness and advantages of wau C with respect to some other well-known performance measures. For the imbalanced data classification problem, the performance evaluation measure should distinguish between the accuracies on different classes when evaluating

8 378 Y. Dong et al. /Journal of Computational Information Systems 8: (22) and selecting classifiers. In order to focus on the accuracy on positive class, different weights are adapted to different regions under ROC curve, and weighted AUC (i.e. wauc) is defined to set larger weight to the region with higher true positive rate(t P rate), and makes the performance evaluation measure wau C tending to the classifiers with higher accuracy on the positive class. Theoretical analysis and discussion on isometrics of the wau C show that wau C outperforms AUC. References [] N. V. Chawla, Data mining for imbalanced datasets: An overview, in: Oded Maimon, Lior Rokach (Ed.) Data Mining and Knowledge Discovery Handbook, Springer Press, Heidelberg, 2, pp [2] N. Japkowicz, S. Stephen, The class imbalance problem: a systematic study, Intelligent Data Analysis, 6(22)4-49. [3] N. V. Chawla, N. Japkowicz, A. Kotcz, Editorial: special issue on learning from imbalanced data set, SIGKDD Exploration Newsletters, (24)-6. [4] T. Fawcett, An introduction to ROC analysis, Pattern Recognition Letters, 8(26) [5] J. P. Egan, Signal detection theory and ROC analysis, Series in Cognition and Perception, Academic Press, 975 [6] K. A. Spackman, Signal detection theory: Valuable tools for evaluating inductive learning, In: Sixth International Workshop on Machine Learning, 989, pp [7] W. Elazmeh, N. Japkowicz, S. Matwin, Evaluating misclassifications in imbalanced data, In: Proc. of the 7th European Conference on Machine Learning, 26, pp [8] J. Huang, C. X. Ling, Using AUC and accuracy in evaluating learning algorithms, IEEE Trans, on Knowledge and Data Engineering, 7(25) [9] S. Daskalaki, I. Kopanas, N. Avouris, Evaluation of classifiers for an uneven class distribution problem, Applied Artificial Intelligence, 2(26) [] M. Kubat, S. Matwin, Adressing the curse of imbalanced training sets: one-sided selection, In: Proc. of the 4th Intl. Conf. on Machine Learning Nashville, 997, pp [] P. Flach, The geometry of ROC space: Understanding machine learning metrics through ROC isometrics, In: Proc. of the 2th International Conference on Machine Learning, 23, pp [2] Fürnkranz J, Flach P, Roc n rule learning- towards a better understanding of covering algorithms, Machine Learning, 58(25) [3] F. Provost, T. Fawcett, Robust classification for imprecise environments, Machine Learning, 42(2) [4] R. Vilalta, D. Oblinger, A quantification of distance-bias between evaluation metrics in classification, In: Proc. of the 7th International Conference on Machine Learning, 2, pp [5] J. Fürnkranz, P. Flach, An analysis of rule evaluation metrics, In: Proc. of the 2th International Conference on Machine Learning, 23, pp

Using Random Forest to Learn Imbalanced Data

Using Random Forest to Learn Imbalanced Data Using Random Forest to Learn Imbalanced Data Chao Chen, chenchao@stat.berkeley.edu Department of Statistics,UC Berkeley Andy Liaw, andy liaw@merck.com Biometrics Research,Merck Research Labs Leo Breiman,

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Learning with Skewed Class Distributions

Learning with Skewed Class Distributions CADERNOS DE COMPUTAÇÃO XX (2003) Learning with Skewed Class Distributions Maria Carolina Monard and Gustavo E.A.P.A. Batista Laboratory of Computational Intelligence LABIC Department of Computer Science

More information

ClusterOSS: a new undersampling method for imbalanced learning

ClusterOSS: a new undersampling method for imbalanced learning 1 ClusterOSS: a new undersampling method for imbalanced learning Victor H Barella, Eduardo P Costa, and André C P L F Carvalho, Abstract A dataset is said to be imbalanced when its classes are disproportionately

More information

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Martin Hlosta, Rostislav Stríž, Jan Kupčík, Jaroslav Zendulka, and Tomáš Hruška A. Imbalanced Data Classification

More information

Selecting Data Mining Model for Web Advertising in Virtual Communities

Selecting Data Mining Model for Web Advertising in Virtual Communities Selecting Data Mining for Web Advertising in Virtual Communities Jerzy Surma Faculty of Business Administration Warsaw School of Economics Warsaw, Poland e-mail: jerzy.surma@gmail.com Mariusz Łapczyński

More information

Random Forest Based Imbalanced Data Cleaning and Classification

Random Forest Based Imbalanced Data Cleaning and Classification Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem

More information

SVM Ensemble Model for Investment Prediction

SVM Ensemble Model for Investment Prediction 19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of

More information

Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ

Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ David Cieslak and Nitesh Chawla University of Notre Dame, Notre Dame IN 46556, USA {dcieslak,nchawla}@cse.nd.edu

More information

A Hybrid Approach to Learn with Imbalanced Classes using Evolutionary Algorithms

A Hybrid Approach to Learn with Imbalanced Classes using Evolutionary Algorithms Proceedings of the International Conference on Computational and Mathematical Methods in Science and Engineering, CMMSE 2009 30 June, 1 3 July 2009. A Hybrid Approach to Learn with Imbalanced Classes using

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Lecture 15 - ROC, AUC & Lift Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-17-AUC

More information

Enhancing Quality of Data using Data Mining Method

Enhancing Quality of Data using Data Mining Method JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2, ISSN 25-967 WWW.JOURNALOFCOMPUTING.ORG 9 Enhancing Quality of Data using Data Mining Method Fatemeh Ghorbanpour A., Mir M. Pedram, Kambiz Badie, Mohammad

More information

Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance

Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance Jesús M. Pérez, Javier Muguerza, Olatz Arbelaitz, Ibai Gurrutxaga, and José I. Martín Dept. of Computer

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

Performance Measures in Data Mining

Performance Measures in Data Mining Performance Measures in Data Mining Common Performance Measures used in Data Mining and Machine Learning Approaches L. Richter J.M. Cejuela Department of Computer Science Technische Universität München

More information

Mining Life Insurance Data for Customer Attrition Analysis

Mining Life Insurance Data for Customer Attrition Analysis Mining Life Insurance Data for Customer Attrition Analysis T. L. Oshini Goonetilleke Informatics Institute of Technology/Department of Computing, Colombo, Sri Lanka Email: oshini.g@iit.ac.lk H. A. Caldera

More information

Introducing diversity among the models of multi-label classification ensemble

Introducing diversity among the models of multi-label classification ensemble Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and

More information

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center

More information

ROC Curve, Lift Chart and Calibration Plot

ROC Curve, Lift Chart and Calibration Plot Metodološki zvezki, Vol. 3, No. 1, 26, 89-18 ROC Curve, Lift Chart and Calibration Plot Miha Vuk 1, Tomaž Curk 2 Abstract This paper presents ROC curve, lift chart and calibration plot, three well known

More information

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577 T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

More information

Addressing the Class Imbalance Problem in Medical Datasets

Addressing the Class Imbalance Problem in Medical Datasets Addressing the Class Imbalance Problem in Medical Datasets M. Mostafizur Rahman and D. N. Davis the size of the training set is significantly increased [5]. If the time taken to resample is not considered,

More information

E-commerce Transaction Anomaly Classification

E-commerce Transaction Anomaly Classification E-commerce Transaction Anomaly Classification Minyong Lee minyong@stanford.edu Seunghee Ham sham12@stanford.edu Qiyi Jiang qjiang@stanford.edu I. INTRODUCTION Due to the increasing popularity of e-commerce

More information

Mining the Software Change Repository of a Legacy Telephony System

Mining the Software Change Repository of a Legacy Telephony System Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

More information

Review of Ensemble Based Classification Algorithms for Nonstationary and Imbalanced Data

Review of Ensemble Based Classification Algorithms for Nonstationary and Imbalanced Data IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 1, Ver. IX (Feb. 2014), PP 103-107 Review of Ensemble Based Classification Algorithms for Nonstationary

More information

Performance Metrics. number of mistakes total number of observations. err = p.1/1

Performance Metrics. number of mistakes total number of observations. err = p.1/1 p.1/1 Performance Metrics The simplest performance metric is the model error defined as the number of mistakes the model makes on a data set divided by the number of observations in the data set, err =

More information

Performance Evaluation Metrics for Software Fault Prediction Studies

Performance Evaluation Metrics for Software Fault Prediction Studies Acta Polytechnica Hungarica Vol. 9, No. 4, 2012 Performance Evaluation Metrics for Software Fault Prediction Studies Cagatay Catal Istanbul Kultur University, Department of Computer Engineering, Atakoy

More information

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

More information

Performance Measures for Machine Learning

Performance Measures for Machine Learning Performance Measures for Machine Learning 1 Performance Measures Accuracy Weighted (Cost-Sensitive) Accuracy Lift Precision/Recall F Break Even Point ROC ROC Area 2 Accuracy Target: 0/1, -1/+1, True/False,

More information

Evaluation & Validation: Credibility: Evaluating what has been learned

Evaluation & Validation: Credibility: Evaluating what has been learned Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model

More information

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences

More information

Getting Even More Out of Ensemble Selection

Getting Even More Out of Ensemble Selection Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise

More information

Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction

Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction Huanjing Wang Western Kentucky University huanjing.wang@wku.edu Taghi M. Khoshgoftaar

More information

Less naive Bayes spam detection

Less naive Bayes spam detection Less naive Bayes spam detection Hongming Yang Eindhoven University of Technology Dept. EE, Rm PT 3.27, P.O.Box 53, 5600MB Eindhoven The Netherlands. E-mail:h.m.yang@tue.nl also CoSiNe Connectivity Systems

More information

DATA MINING FOR IMBALANCED DATASETS: AN OVERVIEW

DATA MINING FOR IMBALANCED DATASETS: AN OVERVIEW Chapter 40 DATA MINING FOR IMBALANCED DATASETS: AN OVERVIEW Nitesh V. Chawla Department of Computer Science and Engineering University of Notre Dame IN 46530, USA Abstract Keywords: A dataset is imbalanced

More information

The Relationship Between Precision-Recall and ROC Curves

The Relationship Between Precision-Recall and ROC Curves Jesse Davis jdavis@cs.wisc.edu Mark Goadrich richm@cs.wisc.edu Department of Computer Sciences and Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, 2 West Dayton Street,

More information

How To Solve The Class Imbalance Problem In Data Mining

How To Solve The Class Imbalance Problem In Data Mining IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART C: APPLICATIONS AND REVIEWS 1 A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches Mikel Galar,

More information

ROC Graphs: Notes and Practical Considerations for Data Mining Researchers

ROC Graphs: Notes and Practical Considerations for Data Mining Researchers ROC Graphs: Notes and Practical Considerations for Data Mining Researchers Tom Fawcett Intelligent Enterprise Technologies Laboratory HP Laboratories Palo Alto HPL-23-4 January 7 th, 23* E-mail: tom_fawcett@hp.com

More information

Quality and Complexity Measures for Data Linkage and Deduplication

Quality and Complexity Measures for Data Linkage and Deduplication Quality and Complexity Measures for Data Linkage and Deduplication Peter Christen and Karl Goiser Department of Computer Science, Australian National University, Canberra ACT 0200, Australia {peter.christen,karl.goiser}@anu.edu.au

More information

Handling imbalanced datasets: A review

Handling imbalanced datasets: A review Handling imbalanced datasets: A review Sotiris Kotsiantis, Dimitris Kanellopoulos, Panayiotis Pintelas Educational Software Development Laboratory Department of Mathematics, University of Patras, Greece

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

A Hybrid Data Mining Model to Improve Customer Response Modeling in Direct Marketing

A Hybrid Data Mining Model to Improve Customer Response Modeling in Direct Marketing A Hybrid Data Mining Model to Improve Customer Response Modeling in Direct Marketing Maryam Daneshmandi mdaneshmandi82@yahoo.com School of Information Technology Shiraz Electronics University Shiraz, Iran

More information

Improving Credit Card Fraud Detection with Calibrated Probabilities

Improving Credit Card Fraud Detection with Calibrated Probabilities Improving Credit Card Fraud Detection with Calibrated Probabilities Alejandro Correa Bahnsen, Aleksandar Stojanovic, Djamila Aouada and Björn Ottersten Interdisciplinary Centre for Security, Reliability

More information

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

BUILDING CLASSIFICATION MODELS FROM IMBALANCED FRAUD DETECTION DATA

BUILDING CLASSIFICATION MODELS FROM IMBALANCED FRAUD DETECTION DATA BUILDING CLASSIFICATION MODELS FROM IMBALANCED FRAUD DETECTION DATA Terence Yong Koon Beh 1, Swee Chuan Tan 2, Hwee Theng Yeo 3 1 School of Business, SIM University 1 yky2k@yahoo.com, 2 jamestansc@unisim.edu.sg,

More information

Intrusion Detection via Machine Learning for SCADA System Protection

Intrusion Detection via Machine Learning for SCADA System Protection Intrusion Detection via Machine Learning for SCADA System Protection S.L.P. Yasakethu Department of Computing, University of Surrey, Guildford, GU2 7XH, UK. s.l.yasakethu@surrey.ac.uk J. Jiang Department

More information

The Optimality of Naive Bayes

The Optimality of Naive Bayes The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most

More information

Prediction of Stock Performance Using Analytical Techniques

Prediction of Stock Performance Using Analytical Techniques 136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University

More information

Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms

Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms Johan Perols Assistant Professor University of San Diego, San Diego, CA 92110 jperols@sandiego.edu April

More information

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification

More information

Learning Decision Trees for Unbalanced Data

Learning Decision Trees for Unbalanced Data Learning Decision Trees for Unbalanced Data David A. Cieslak and Nitesh V. Chawla {dcieslak,nchawla}@cse.nd.edu University of Notre Dame, Notre Dame IN 46556, USA Abstract. Learning from unbalanced datasets

More information

A Review of Performance Evaluation Measures for Hierarchical Classifiers

A Review of Performance Evaluation Measures for Hierarchical Classifiers A Review of Performance Evaluation Measures for Hierarchical Classifiers Eduardo P. Costa Depto. Ciências de Computação ICMC/USP - São Carlos Caixa Postal 668, 13560-970, São Carlos-SP, Brazil Ana C. Lorena

More information

Facing Imbalanced Data Recommendations for the Use of Performance Metrics

Facing Imbalanced Data Recommendations for the Use of Performance Metrics Facing Imbalanced Data Recommendations for the Use of Performance Metrics László A. Jeni, Jeffrey F. Cohn, 2, and Fernando De La Torre Carnegie Mellon University, Pittsburgh, PA, laszlo.jeni@ieee.org,ftorre@cs.cmu.edu

More information

A DECISION TREE BASED PEDOMETER AND ITS IMPLEMENTATION ON THE ANDROID PLATFORM

A DECISION TREE BASED PEDOMETER AND ITS IMPLEMENTATION ON THE ANDROID PLATFORM A DECISION TREE BASED PEDOMETER AND ITS IMPLEMENTATION ON THE ANDROID PLATFORM ABSTRACT Juanying Lin, Leanne Chan and Hong Yan Department of Electronic Engineering, City University of Hong Kong, Hong Kong,

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

Maximum Profit Mining and Its Application in Software Development

Maximum Profit Mining and Its Application in Software Development Maximum Profit Mining and Its Application in Software Development Charles X. Ling 1, Victor S. Sheng 1, Tilmann Bruckhaus 2, Nazim H. Madhavji 1 1 Department of Computer Science, The University of Western

More information

Big Data Classification: Problems and Challenges in Network Intrusion Prediction with Machine Learning

Big Data Classification: Problems and Challenges in Network Intrusion Prediction with Machine Learning Big Data Classification: Problems and Challenges in Network Intrusion Prediction with Machine Learning By: Shan Suthaharan Suthaharan, S. (2014). Big data classification: Problems and challenges in network

More information

Visualizing High-Dimensional Predictive Model Quality

Visualizing High-Dimensional Predictive Model Quality Visualizing High-Dimensional Predictive Model Quality Penny Rheingans University of Maryland, Baltimore County Department of Computer Science and Electrical Engineering rheingan@cs.umbc.edu Marie desjardins

More information

1. Classification problems

1. Classification problems Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification

More information

The Enron Corpus: A New Dataset for Email Classification Research

The Enron Corpus: A New Dataset for Email Classification Research The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu

More information

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product Sagarika Prusty Web Data Mining (ECT 584),Spring 2013 DePaul University,Chicago sagarikaprusty@gmail.com Keywords:

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

Roulette Sampling for Cost-Sensitive Learning

Roulette Sampling for Cost-Sensitive Learning Roulette Sampling for Cost-Sensitive Learning Victor S. Sheng and Charles X. Ling Department of Computer Science, University of Western Ontario, London, Ontario, Canada N6A 5B7 {ssheng,cling}@csd.uwo.ca

More information

Consistent Binary Classification with Generalized Performance Metrics

Consistent Binary Classification with Generalized Performance Metrics Consistent Binary Classification with Generalized Performance Metrics Nagarajan Natarajan Joint work with Oluwasanmi Koyejo, Pradeep Ravikumar and Inderjit Dhillon UT Austin Nov 4, 2014 Problem and Motivation

More information

An Evaluation of Calibration Methods for Data Mining Models in Simulation Problems

An Evaluation of Calibration Methods for Data Mining Models in Simulation Problems Universidad Politécnica de Valencia Departamento de Sistemas Informáticos y Computación Master de Ingeniería de Software, Métodos Formales y Sistemas de Información Master Thesis An Evaluation of Calibration

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com

More information

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

More information

Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.

Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013. Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.38457 Accuracy Rate of Predictive Models in Credit Screening Anirut Suebsing

More information

The class imbalance problem in pattern classification and learning. V. García J.S. Sánchez R.A. Mollineda R. Alejo J.M. Sotoca

The class imbalance problem in pattern classification and learning. V. García J.S. Sánchez R.A. Mollineda R. Alejo J.M. Sotoca The class imbalance problem in pattern classification and learning V. García J.S. Sánchez R.A. Mollineda R. Alejo J.M. Sotoca Pattern Analysis and Learning Group Dept.de Llenguatjes i Sistemes Informàtics

More information

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

More information

Sample subset optimization for classifying imbalanced biological data

Sample subset optimization for classifying imbalanced biological data Sample subset optimization for classifying imbalanced biological data Pengyi Yang 1,2,3, Zili Zhang 4,5, Bing B. Zhou 1,3 and Albert Y. Zomaya 1,3 1 School of Information Technologies, University of Sydney,

More information

Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

More information

Measuring Lift Quality in Database Marketing

Measuring Lift Quality in Database Marketing Measuring Lift Quality in Database Marketing Gregory Piatetsky-Shapiro Xchange Inc. One Lincoln Plaza, 89 South Street Boston, MA 2111 gps@xchange.com Sam Steingold Xchange Inc. One Lincoln Plaza, 89 South

More information

MHI3000 Big Data Analytics for Health Care Final Project Report

MHI3000 Big Data Analytics for Health Care Final Project Report MHI3000 Big Data Analytics for Health Care Final Project Report Zhongtian Fred Qiu (1002274530) http://gallery.azureml.net/details/81ddb2ab137046d4925584b5095ec7aa 1. Data pre-processing The data given

More information

CLASS imbalance learning refers to a type of classification

CLASS imbalance learning refers to a type of classification IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS, PART B Multi-Class Imbalance Problems: Analysis and Potential Solutions Shuo Wang, Member, IEEE, and Xin Yao, Fellow, IEEE Abstract Class imbalance problems

More information

Depth and Excluded Courses

Depth and Excluded Courses Depth and Excluded Courses Depth Courses for Communication, Control, and Signal Processing EECE 5576 Wireless Communication Systems 4 SH EECE 5580 Classical Control Systems 4 SH EECE 5610 Digital Control

More information

Learning on the Border: Active Learning in Imbalanced Data Classification

Learning on the Border: Active Learning in Imbalanced Data Classification Learning on the Border: Active Learning in Imbalanced Data Classification Şeyda Ertekin 1, Jian Huang 2, Léon Bottou 3, C. Lee Giles 2,1 1 Department of Computer Science and Engineering 2 College of Information

More information

Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

More information

ISSN: 2321-7782 (Online) Volume 2, Issue 10, October 2014 International Journal of Advance Research in Computer Science and Management Studies

ISSN: 2321-7782 (Online) Volume 2, Issue 10, October 2014 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 2, Issue 10, October 2014 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Discovering process models from empirical data

Discovering process models from empirical data Discovering process models from empirical data Laura Măruşter (l.maruster@tm.tue.nl), Ton Weijters (a.j.m.m.weijters@tm.tue.nl) and Wil van der Aalst (w.m.p.aalst@tm.tue.nl) Eindhoven University of Technology,

More information

Direct Marketing When There Are Voluntary Buyers

Direct Marketing When There Are Voluntary Buyers Direct Marketing When There Are Voluntary Buyers Yi-Ting Lai and Ke Wang Simon Fraser University {llai2, wangk}@cs.sfu.ca Daymond Ling, Hua Shi, and Jason Zhang Canadian Imperial Bank of Commerce {Daymond.Ling,

More information

An Approach to Detect Spam Emails by Using Majority Voting

An Approach to Detect Spam Emails by Using Majority Voting An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,

More information

How To Understand The Impact Of A Computer On Organization

How To Understand The Impact Of A Computer On Organization International Journal of Research in Engineering & Technology (IJRET) Vol. 1, Issue 1, June 2013, 1-6 Impact Journals IMPACT OF COMPUTER ON ORGANIZATION A. D. BHOSALE 1 & MARATHE DAGADU MITHARAM 2 1 Department

More information

ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS

ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS Michael Affenzeller (a), Stephan M. Winkler (b), Stefan Forstenlechner (c), Gabriel Kronberger (d), Michael Kommenda (e), Stefan

More information

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data

More information

Data Mining for Direct Marketing: Problems and

Data Mining for Direct Marketing: Problems and Data Mining for Direct Marketing: Problems and Solutions Charles X. Ling and Chenghui Li Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 Tel: 519-661-3341;

More information

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques. International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant

More information

How To Use Neural Networks In Data Mining

How To Use Neural Networks In Data Mining International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and

More information

Crowdsourcing Fraud Detection Algorithm Based on Psychological Behavior Analysis

Crowdsourcing Fraud Detection Algorithm Based on Psychological Behavior Analysis , pp.138-142 http://dx.doi.org/10.14257/astl.2013.31.31 Crowdsourcing Fraud Detection Algorithm Based on Psychological Behavior Analysis Li Peng 1,2, Yu Xiao-yang 1, Liu Yang 2, Bi Ting-ting 2 1 Higher

More information

Discovering Criminal Behavior by Ranking Intelligence Data

Discovering Criminal Behavior by Ranking Intelligence Data UNIVERSITY OF AMSTERDAM Faculty of Science Discovering Criminal Behavior by Ranking Intelligence Data by 5889081 A thesis submitted in partial fulfillment for the degree of Master of Science in the field

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Fraud Detection for Online Retail using Random Forests

Fraud Detection for Online Retail using Random Forests Fraud Detection for Online Retail using Random Forests Eric Altendorf, Peter Brende, Josh Daniel, Laurent Lessard Abstract As online commerce becomes more common, fraud is an increasingly important concern.

More information

Decision Support Systems

Decision Support Systems Decision Support Systems 50 (2011) 602 613 Contents lists available at ScienceDirect Decision Support Systems journal homepage: www.elsevier.com/locate/dss Data mining for credit card fraud: A comparative

More information

Preprocessing Imbalanced Dataset Using Oversampling Approach

Preprocessing Imbalanced Dataset Using Oversampling Approach Journal of Recent Research in Engineering and Technology, 2(11), 2015, pp 10-15 Article ID J111503 ISSN (Online): 2349 2252, ISSN (Print):2349 2260 Bonfay Publications Research article Preprocessing Imbalanced

More information

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS Breno C. Costa, Bruno. L. A. Alberto, André M. Portela, W. Maduro, Esdras O. Eler PDITec, Belo Horizonte,

More information

Introduction to Engineering System Dynamics

Introduction to Engineering System Dynamics CHAPTER 0 Introduction to Engineering System Dynamics 0.1 INTRODUCTION The objective of an engineering analysis of a dynamic system is prediction of its behaviour or performance. Real dynamic systems are

More information

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary Shape, Space, and Measurement- Primary A student shall apply concepts of shape, space, and measurement to solve problems involving two- and three-dimensional shapes by demonstrating an understanding of:

More information