Performance Measures in Data Mining

Size: px
Start display at page:

Download "Performance Measures in Data Mining"

Transcription

1 Performance Measures in Data Mining Common Performance Measures used in Data Mining and Machine Learning Approaches L. Richter J.M. Cejuela Department of Computer Science Technische Universität München Master Lab Course Data Mining, SS 2015, Jul 1st

2 Outline Item Set and Association Rule Weights Simple Measures Complex Measures Basic Performance Measures Complex Measures Performance Curves Regression Assessment Strategies

3 Item Set and Association Rule Weights Simple Measures Measures for Item Sets Various algorithms can yield frequent item sets. From frequent item sets c and c {i} you can derive if c then {i}. Typically there is only one item in the RHS (right hand side of the rule). Support (of an item set): sup(x Y ) = sup(y X) = P(X Y ) how many times the item set is found in the database Confidence (of a rule): conf (X Y ) = P(Y X) = P(X and Y )/P(X) = sup(x Y )/sup(x)

4 Item Set and Association Rule Weights Simple Measures Measures for Item Sets Various algorithms can yield frequent item sets. From frequent item sets c and c {i} you can derive if c then {i}. Typically there is only one item in the RHS (right hand side of the rule). Support (of an item set): sup(x Y ) = sup(y X) = P(X Y ) how many times the item set is found in the database Confidence (of a rule): conf (X Y ) = P(Y X) = P(X and Y )/P(X) = sup(x Y )/sup(x)

5 Item Set and Association Rule Weights Simple Measures Measures for Item Sets Various algorithms can yield frequent item sets. From frequent item sets c and c {i} you can derive if c then {i}. Typically there is only one item in the RHS (right hand side of the rule). Support (of an item set): sup(x Y ) = sup(y X) = P(X Y ) how many times the item set is found in the database Confidence (of a rule): conf (X Y ) = P(Y X) = P(X and Y )/P(X) = sup(x Y )/sup(x)

6 Item Set and Association Rule Weights Complex Measures Measures for Item Sets cont. d Measures how frequent an item set / how interesting a rule is in comparison to the expected occurrence (interesting): Leverage (of an item set): lev(x Y ) = P(X and Y ) (P(X)P(Y )) Lift (of a rule): lift(x Y ) = lift(y X) = P(X Y )/(P(X)P(Y )) = conf (X Y )/sup(y ) = conf (Y X)/sup(X) Conviction (of a rule): Similar to Lift, but directed Compares the probability that X appears without Y, if they were independent with the observed frequency of X and Y. conviction(x Y ) = P(X)P( Y )/P(X Y ) = (1 sup(y )/(1 conf (X Y ))

7 Item Set and Association Rule Weights Complex Measures Measures for Item Sets cont. d Measures how frequent an item set / how interesting a rule is in comparison to the expected occurrence (interesting): Leverage (of an item set): lev(x Y ) = P(X and Y ) (P(X)P(Y )) Lift (of a rule): lift(x Y ) = lift(y X) = P(X Y )/(P(X)P(Y )) = conf (X Y )/sup(y ) = conf (Y X)/sup(X) Conviction (of a rule): Similar to Lift, but directed Compares the probability that X appears without Y, if they were independent with the observed frequency of X and Y. conviction(x Y ) = P(X)P( Y )/P(X Y ) = (1 sup(y )/(1 conf (X Y ))

8 Item Set and Association Rule Weights Complex Measures Measures for Item Sets cont. d Measures how frequent an item set / how interesting a rule is in comparison to the expected occurrence (interesting): Leverage (of an item set): lev(x Y ) = P(X and Y ) (P(X)P(Y )) Lift (of a rule): lift(x Y ) = lift(y X) = P(X Y )/(P(X)P(Y )) = conf (X Y )/sup(y ) = conf (Y X)/sup(X) Conviction (of a rule): Similar to Lift, but directed Compares the probability that X appears without Y, if they were independent with the observed frequency of X and Y. conviction(x Y ) = P(X)P( Y )/P(X Y ) = (1 sup(y )/(1 conf (X Y ))

9 Item Set and Association Rule Weights Complex Measures Supplementary Material J-Measure empirically observed accuracy of rule Cross-entropy (measuring how good a distribution approximates another distribution) between the binary variables φ and θ with vs. without conditioning on event θ ( J(θ φ) = p(θ) p(φ θ)log p(φ θ) ) 1 p(φ θ) + (1 p(φ θ)) log p(φ) 1 p(φ)

10 Basic Performance Measures Basic Building Blocks for Performance Measures True Positive (TP): positive instances predicted as positive True Negative (TN): negative instances predicted as negative False Positive (FP): negative instances predicted as positive False Negative (FN): positive instances predicted as negative Confusion Matrix: Predicted a Predicted b Real a TP FN Real b FP TN

11 Basic Performance Measures Performance Measures Accuracy, acc = = Error rate, err = = TP + TN TP + FN + FP + TN Number of correct predictions Total number of predictions FN + FP TP + FN + FP + TN = 1 acc Number of wrong predictions Total number of predictions

12 Basic Performance Measures Performance Measures cont d True Positive Rate, TPR, Sensitivity = True Negative Rate, TNR, Specificity = False Positive Rate, FPR = False Negative Rate, FNR = FP TN + FP TP TP + FN FN TP + FN TN TN + FP

13 Basic Performance Measures Performance Measures cont d True Positive Rate, TPR, Sensitivity = True Negative Rate, TNR, Specificity = False Positive Rate, FPR = False Negative Rate, FNR = FP TN + FP TP TP + FN FN TP + FN TN TN + FP

14 Basic Performance Measures Performance Measures cont d True Positive Rate, TPR, Sensitivity = True Negative Rate, TNR, Specificity = False Positive Rate, FPR = False Negative Rate, FNR = FP TN + FP TP TP + FN FN TP + FN TN TN + FP

15 Basic Performance Measures Performance Measures cont d True Positive Rate, TPR, Sensitivity = True Negative Rate, TNR, Specificity = False Positive Rate, FPR = False Negative Rate, FNR = FP TN + FP TP TP + FN FN TP + FN TN TN + FP

16 Basic Performance Measures Performance Measures cont d F 1 measure = Precision, p = Recall, r = 2rp r + p = TP TP + FP TP TP + FN 2 TP 2 TP + FP + FN

17 Basic Performance Measures Performance Measures cont d F 1 measure = Precision, p = Recall, r = 2rp r + p = TP TP + FP TP TP + FN 2 TP 2 TP + FP + FN

18 Basic Performance Measures Performance Measures cont d F 1 measure = Precision, p = Recall, r = 2rp r + p = TP TP + FP TP TP + FN 2 TP 2 TP + FP + FN

19 Complex Measures Performance Curves Performance Curves "costs" of different error types are different prediction behaviour changes over the test set performance display in 2D different domains prefer different chart types

20 Complex Measures Performance Curves Lift Charts Taken from from marketing area to evaluate mailing success y-axis: number or percentage of responders x-axis: sample red diagonal: random lift green line: optimum lift

21 Complex Measures Performance Curves ROC Curves Receiver Operator Characteristics y-axis: TPR x-axis: FPR diagonal: random guessing (TPR=FPR) Taken from

22 Complex Measures Performance Curves Sensitivity vs. Specificity preferred in medicine y-axis: TPR x-axis: TNR (specificity) also frequently as ROC curve with 1 - specificity Taken from

23 Complex Measures Performance Curves Recall Precision Curves Taken from preferred in information retrieval positives are the documents retrieved in response to a query true positives are documents really relevant to the query y-axis: precision x-axis: recall

24 Regression Error Measures for Regression Mean squared error, MSE = (p 1 a 1 ) (p n a n ) 2 n Root mean squared error, RMSE = (p1 a 1 ) (p n a n ) 2 n Mean absolute error, MAE = p 1 a p n a n n

25 Regression Error Measures for Regression Mean squared error, MSE = (p 1 a 1 ) (p n a n ) 2 n Root mean squared error, RMSE = (p1 a 1 ) (p n a n ) 2 n Mean absolute error, MAE = p 1 a p n a n n

26 Regression Error Measures for Regression Mean squared error, MSE = (p 1 a 1 ) (p n a n ) 2 n Root mean squared error, RMSE = (p1 a 1 ) (p n a n ) 2 n Mean absolute error, MAE = p 1 a p n a n n

27 Regression Relative Error Measures Relative squared error = (p 1 a 1 ) (p n a n ) 2 (a 1 ā) (a n ā) 2 Root relative squared error = (p 1 a 1 ) (p n a n ) 2 (a 1 ā) (a n ā) 2 Relative absolute error = p 1 a p n a n a 1 ā + + a n ā

28 Assessment Strategies General Problem each algorithm abstracts from observations (instances) the aspects kept and the aspect discarded differ between the learning scheme (inductive bias) this means also: information about individual instances are contained in the model, too individual instance information leads to overfitting

29 Assessment Strategies Solution Strategies use fresh date, i.e. instances not used for the training for very large numbers of instances: simple split in test and training set most common: 10-fold cross validation LOOCV: Leave one out cross validation

Performance Measures for Machine Learning

Performance Measures for Machine Learning Performance Measures for Machine Learning 1 Performance Measures Accuracy Weighted (Cost-Sensitive) Accuracy Lift Precision/Recall F Break Even Point ROC ROC Area 2 Accuracy Target: 0/1, -1/+1, True/False,

More information

Evaluation & Validation: Credibility: Evaluating what has been learned

Evaluation & Validation: Credibility: Evaluating what has been learned Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Counting the cost Data Mining Practical Machine Learning Tools and Techniques Slides for Section 5.7 In practice, different types of classification errors often incur different costs Examples: Loan decisions

More information

COMP9417: Machine Learning and Data Mining

COMP9417: Machine Learning and Data Mining COMP9417: Machine Learning and Data Mining A note on the two-class confusion matrix, lift charts and ROC curves Session 1, 2003 Acknowledgement The material in this supplementary note is based in part

More information

CLASSIFICATION JELENA JOVANOVIĆ. Web:

CLASSIFICATION JELENA JOVANOVIĆ.   Web: CLASSIFICATION JELENA JOVANOVIĆ Email: jeljov@gmail.com Web: http://jelenajovanovic.net OUTLINE What is classification? Binary and multiclass classification Classification algorithms Performance measures

More information

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification

More information

Evaluation in Machine Learning (1) Evaluation in machine learning. Aims. 14s1: COMP9417 Machine Learning and Data Mining.

Evaluation in Machine Learning (1) Evaluation in machine learning. Aims. 14s1: COMP9417 Machine Learning and Data Mining. Acknowledgements Evaluation in Machine Learning (1) 14s1: COMP9417 Machine Learning and Data Mining School of Computer Science and Engineering, University of New South Wales March 19, 2014 Material derived

More information

COMP 551 Applied Machine Learning Lecture 6: Performance evaluation. Model assessment and selection.

COMP 551 Applied Machine Learning Lecture 6: Performance evaluation. Model assessment and selection. COMP 551 Applied Machine Learning Lecture 6: Performance evaluation. Model assessment and selection. Instructor: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Lecture 15 - ROC, AUC & Lift Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-17-AUC

More information

Performance Metrics. number of mistakes total number of observations. err = p.1/1

Performance Metrics. number of mistakes total number of observations. err = p.1/1 p.1/1 Performance Metrics The simplest performance metric is the model error defined as the number of mistakes the model makes on a data set divided by the number of observations in the data set, err =

More information

Evaluation and Credibility. How much should we believe in what was learned?

Evaluation and Credibility. How much should we believe in what was learned? Evaluation and Credibility How much should we believe in what was learned? Outline Introduction Classification with Train, Test, and Validation sets Handling Unbalanced Data; Parameter Tuning Cross-validation

More information

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2. Tid Refund Marital Status

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2. Tid Refund Marital Status Data Mining Classification: Basic Concepts, Decision Trees, and Evaluation Lecture tes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Classification: Definition Given a collection of

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 5: Decision Theory & ROC Curves Gaussian ML Estimation Many figures courtesy Kevin Murphy s textbook,

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Hani Tamim, PhD Associate Professor Department of Internal Medicine American University of Beirut

Hani Tamim, PhD Associate Professor Department of Internal Medicine American University of Beirut Hani Tamim, PhD Associate Professor Department of Internal Medicine American University of Beirut A graphical plot which illustrates the performance of a binary classifier system as its discrimination

More information

Cross Validation. Dr. Thomas Jensen Expedia.com

Cross Validation. Dr. Thomas Jensen Expedia.com Cross Validation Dr. Thomas Jensen Expedia.com About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract

More information

Machine Learning model evaluation. Luigi Cerulo Department of Science and Technology University of Sannio

Machine Learning model evaluation. Luigi Cerulo Department of Science and Technology University of Sannio Machine Learning model evaluation Luigi Cerulo Department of Science and Technology University of Sannio Accuracy To measure classification performance the most intuitive measure of accuracy divides the

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Credibility: Evaluating what s been learned Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Issues: training, testing,

More information

Evaluation and Cost-Sensitive Learning. Cost-Sensitive Learning

Evaluation and Cost-Sensitive Learning. Cost-Sensitive Learning 1 J. Fürnkranz Evaluation and Cost-Sensitive Learning Evaluation Hold-out Estimates Cross-validation Significance Testing Sign test ROC Analysis Cost-Sensitive Evaluation ROC space ROC convex hull Rankers

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Evaluating Machine Learning Algorithms. Machine DIT

Evaluating Machine Learning Algorithms. Machine DIT Evaluating Machine Learning Algorithms Machine Learning @ DIT 2 Evaluating Classification Accuracy During development, and in testing before deploying a classifier in the wild, we need to be able to quantify

More information

Data Mining - The Next Mining Boom?

Data Mining - The Next Mining Boom? Howard Ong Principal Consultant Aurora Consulting Pty Ltd Abstract This paper introduces Data Mining to its audience by explaining Data Mining in the context of Corporate and Business Intelligence Reporting.

More information

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data

More information

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center

More information

The many faces of ROC analysis in machine learning. Peter A. Flach University of Bristol, UK

The many faces of ROC analysis in machine learning. Peter A. Flach University of Bristol, UK The many faces of ROC analysis in machine learning Peter A. Flach University of Bristol, UK www.cs.bris.ac.uk/~flach/ Objectives After this tutorial, you will be able to [model evaluation] produce ROC

More information

Analysis on Weighted AUC for Imbalanced Data Learning Through Isometrics

Analysis on Weighted AUC for Imbalanced Data Learning Through Isometrics Journal of Computational Information Systems 8: (22) 37 378 Available at http://www.jofcis.com Analysis on Weighted AUC for Imbalanced Data Learning Through Isometrics Yuanfang DONG, 2, Xiongfei LI,, Jun

More information

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework Jakrarin Therdphapiyanak Dept. of Computer Engineering Chulalongkorn University

More information

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product Sagarika Prusty Web Data Mining (ECT 584),Spring 2013 DePaul University,Chicago sagarikaprusty@gmail.com Keywords:

More information

Online Performance Anomaly Detection with

Online Performance Anomaly Detection with ΘPAD: Online Performance Anomaly Detection with Tillmann Bielefeld 1 1 empuxa GmbH, Kiel KoSSE-Symposium Application Performance Management (Kieker Days 2012) November 29, 2012 @ Wissenschaftszentrum Kiel

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

An Approach to Detect Spam Emails by Using Majority Voting

An Approach to Detect Spam Emails by Using Majority Voting An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,

More information

PREDICTING SUCCESS IN THE COMPUTER SCIENCE DEGREE USING ROC ANALYSIS

PREDICTING SUCCESS IN THE COMPUTER SCIENCE DEGREE USING ROC ANALYSIS PREDICTING SUCCESS IN THE COMPUTER SCIENCE DEGREE USING ROC ANALYSIS Arturo Fornés arforser@fiv.upv.es, José A. Conejero aconejero@mat.upv.es 1, Antonio Molina amolina@dsic.upv.es, Antonio Pérez aperez@upvnet.upv.es,

More information

ROC Graphs: Notes and Practical Considerations for Data Mining Researchers

ROC Graphs: Notes and Practical Considerations for Data Mining Researchers ROC Graphs: Notes and Practical Considerations for Data Mining Researchers Tom Fawcett 1 1 Intelligent Enterprise Technologies Laboratory, HP Laboratories Palo Alto Alexandre Savio (GIC) ROC Graphs GIC

More information

Using Random Forest to Learn Imbalanced Data

Using Random Forest to Learn Imbalanced Data Using Random Forest to Learn Imbalanced Data Chao Chen, chenchao@stat.berkeley.edu Department of Statistics,UC Berkeley Andy Liaw, andy liaw@merck.com Biometrics Research,Merck Research Labs Leo Breiman,

More information

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data

More information

Lecture Notes for Chapter 4. Introduction to Data Mining

Lecture Notes for Chapter 4. Introduction to Data Mining Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data

More information

Evaluation in Machine Learning (2) Aims. Introduction

Evaluation in Machine Learning (2) Aims. Introduction Acknowledgements Evaluation in Machine Learning (2) 15s1: COMP9417 Machine Learning and Data Mining School of Computer Science and Engineering, University of New South Wales Material derived from slides

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

Classifiers & Classification

Classifiers & Classification Classifiers & Classification Forsyth & Ponce Computer Vision A Modern Approach chapter 22 Pattern Classification Duda, Hart and Stork School of Computer Science & Statistics Trinity College Dublin Dublin

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

Machine Learning. Topic: Evaluating Hypotheses

Machine Learning. Topic: Evaluating Hypotheses Machine Learning Topic: Evaluating Hypotheses Bryan Pardo, Machine Learning: EECS 349 Fall 2011 How do you tell something is better? Assume we have an error measure. How do we tell if it measures something

More information

Health Care and Life Sciences

Health Care and Life Sciences Sensitivity, Specificity, Accuracy, Associated Confidence Interval and ROC Analysis with Practical SAS Implementations Wen Zhu 1, Nancy Zeng 2, Ning Wang 2 1 K&L consulting services, Inc, Fort Washington,

More information

Measuring the propensity to purchase Creating and interpreting the gain chart. Ricco RAKOTOMALALA

Measuring the propensity to purchase Creating and interpreting the gain chart. Ricco RAKOTOMALALA Measuring the propensity to purchase Creating and interpreting the gain chart Ricco RAKOTOMALALA Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/ 1 Customer targeting process Promoting a new

More information

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com

More information

A Decision Tree for Weather Prediction

A Decision Tree for Weather Prediction BULETINUL UniversităŃii Petrol Gaze din Ploieşti Vol. LXI No. 1/2009 77-82 Seria Matematică - Informatică - Fizică A Decision Tree for Weather Prediction Elia Georgiana Petre Universitatea Petrol-Gaze

More information

1. Classification problems

1. Classification problems Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification

More information

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing

More information

BUILDING CLASSIFICATION MODELS FROM IMBALANCED FRAUD DETECTION DATA

BUILDING CLASSIFICATION MODELS FROM IMBALANCED FRAUD DETECTION DATA BUILDING CLASSIFICATION MODELS FROM IMBALANCED FRAUD DETECTION DATA Terence Yong Koon Beh 1, Swee Chuan Tan 2, Hwee Theng Yeo 3 1 School of Business, SIM University 1 yky2k@yahoo.com, 2 jamestansc@unisim.edu.sg,

More information

IMPACT OF COMPUTER ON ORGANIZATION

IMPACT OF COMPUTER ON ORGANIZATION International Journal of Research in Engineering & Technology (IJRET) Vol. 1, Issue 1, June 2013, 1-6 Impact Journals IMPACT OF COMPUTER ON ORGANIZATION A. D. BHOSALE 1 & MARATHE DAGADU MITHARAM 2 1 Department

More information

MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

More information

Performance Evaluation Metrics for Software Fault Prediction Studies

Performance Evaluation Metrics for Software Fault Prediction Studies Acta Polytechnica Hungarica Vol. 9, No. 4, 2012 Performance Evaluation Metrics for Software Fault Prediction Studies Cagatay Catal Istanbul Kultur University, Department of Computer Engineering, Atakoy

More information

A Hybrid Data Mining Model to Improve Customer Response Modeling in Direct Marketing

A Hybrid Data Mining Model to Improve Customer Response Modeling in Direct Marketing A Hybrid Data Mining Model to Improve Customer Response Modeling in Direct Marketing Maryam Daneshmandi mdaneshmandi82@yahoo.com School of Information Technology Shiraz Electronics University Shiraz, Iran

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

CODE ASSESSMENT METHODOLOGY PROJECT (CAMP) Comparative Evaluation:

CODE ASSESSMENT METHODOLOGY PROJECT (CAMP) Comparative Evaluation: This document contains information exempt from mandatory disclosure under the FOIA. Exemptions 2 and 4 apply. CODE ASSESSMENT METHODOLOGY PROJECT (CAMP) Comparative Evaluation: Coverity Prevent 2.4.0 Fortify

More information

Business Case Development for Credit and Debit Card Fraud Re- Scoring Models

Business Case Development for Credit and Debit Card Fraud Re- Scoring Models Business Case Development for Credit and Debit Card Fraud Re- Scoring Models Kurt Gutzmann Managing Director & Chief ScienAst GCX Advanced Analy.cs LLC www.gcxanalyacs.com October 20, 2011 www.gcxanalyacs.com

More information

CSC574 - Computer and Network Security Module: Intrusion Detection

CSC574 - Computer and Network Security Module: Intrusion Detection CSC574 - Computer and Network Security Module: Intrusion Detection Prof. William Enck Spring 2013 1 Intrusion An authorized action... that exploits a vulnerability... that causes a compromise... and thus

More information

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS Breno C. Costa, Bruno. L. A. Alberto, André M. Portela, W. Maduro, Esdras O. Eler PDITec, Belo Horizonte,

More information

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577 T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

More information

Addressing the Class Imbalance Problem in Medical Datasets

Addressing the Class Imbalance Problem in Medical Datasets Addressing the Class Imbalance Problem in Medical Datasets M. Mostafizur Rahman and D. N. Davis the size of the training set is significantly increased [5]. If the time taken to resample is not considered,

More information

Discovering Criminal Behavior by Ranking Intelligence Data

Discovering Criminal Behavior by Ranking Intelligence Data UNIVERSITY OF AMSTERDAM Faculty of Science Discovering Criminal Behavior by Ranking Intelligence Data by 5889081 A thesis submitted in partial fulfillment for the degree of Master of Science in the field

More information

Lecture 5: Evaluation

Lecture 5: Evaluation Lecture 5: Evaluation Information Retrieval Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group Simone.Teufel@cl.cam.ac.uk 1 Overview 1 Recap/Catchup

More information

ROC analysis of classifiers in machine learning: A survey

ROC analysis of classifiers in machine learning: A survey Intelligent Data Analysis 17 (2013) 531 558 531 DOI 10.3233/IDA-130592 IOS Press ROC analysis of classifiers in machine learning: A survey Matjaž Majnik and Zoran Bosnić Faculty of Computer and Information

More information

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques. International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant

More information

Mining the Software Change Repository of a Legacy Telephony System

Mining the Software Change Repository of a Legacy Telephony System Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

More information

Application of Data Mining based Malicious Code Detection Techniques for Detecting new Spyware

Application of Data Mining based Malicious Code Detection Techniques for Detecting new Spyware Application of Data Mining based Malicious Code Detection Techniques for Detecting new Spyware Cumhur Doruk Bozagac Bilkent University, Computer Science and Engineering Department, 06532 Ankara, Turkey

More information

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

More information

Stock Market Forecasting Using Machine Learning Algorithms

Stock Market Forecasting Using Machine Learning Algorithms Stock Market Forecasting Using Machine Learning Algorithms Shunrong Shen, Haomiao Jiang Department of Electrical Engineering Stanford University {conank,hjiang36}@stanford.edu Tongda Zhang Department of

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Evaluation of selected data mining algorithms implemented in Medical Decision Support Systems Kamila Aftarczuk

Evaluation of selected data mining algorithms implemented in Medical Decision Support Systems Kamila Aftarczuk Master Thesis Software Engineering Thesis no: MSE-2007-21 September 2007 Evaluation of selected data mining algorithms implemented in Medical Decision Support Systems Kamila Aftarczuk This thesis is submitted

More information

Getting Even More Out of Ensemble Selection

Getting Even More Out of Ensemble Selection Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 12, December 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

The Relationship Between Precision-Recall and ROC Curves

The Relationship Between Precision-Recall and ROC Curves Jesse Davis jdavis@cs.wisc.edu Mark Goadrich richm@cs.wisc.edu Department of Computer Sciences and Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, 2 West Dayton Street,

More information

Data Mining as a tool to Predict the Churn Behaviour among Indian bank customers

Data Mining as a tool to Predict the Churn Behaviour among Indian bank customers Data Mining as a tool to Predict the Churn Behaviour among Indian bank customers Manjit Kaur Department of Computer Science Punjabi University Patiala, India manjit8718@gmail.com Dr. Kawaljeet Singh University

More information

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum Statistical Validation and Data Analytics in ediscovery Jesse Kornblum Administrivia Silence your mobile Interactive talk Please ask questions 2 Outline Introduction Big Questions What Makes Things Similar?

More information

The Operational Value of Social Media Information. Social Media and Customer Interaction

The Operational Value of Social Media Information. Social Media and Customer Interaction The Operational Value of Social Media Information Dennis J. Zhang (Kellogg School of Management) Ruomeng Cui (Kelley School of Business) Santiago Gallino (Tuck School of Business) Antonio Moreno-Garcia

More information

Tweaking Naïve Bayes classifier for intelligent spam detection

Tweaking Naïve Bayes classifier for intelligent spam detection 682 Tweaking Naïve Bayes classifier for intelligent spam detection Ankita Raturi 1 and Sunil Pranit Lal 2 1 University of California, Irvine, CA 92697, USA. araturi@uci.edu 2 School of Computing, Information

More information

Towards better accuracy for Spam predictions

Towards better accuracy for Spam predictions Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial

More information

Predicting Deadline Transgressions Using Event Logs

Predicting Deadline Transgressions Using Event Logs Predicting Deadline Transgressions Using Event Logs Anastasiia Pika 1, Wil M. P. van der Aalst 2,1, Colin J. Fidge 1, Arthur H. M. ter Hofstede 1,2, and Moe T. Wynn 1 1 Queensland University of Technology,

More information

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Sushilkumar Kalmegh Associate Professor, Department of Computer Science, Sant Gadge Baba Amravati

More information

Data and Analysis. Informatics 1 School of Informatics, University of Edinburgh. Part III Unstructured Data. Ian Stark. Staff-Student Liaison Meeting

Data and Analysis. Informatics 1 School of Informatics, University of Edinburgh. Part III Unstructured Data. Ian Stark. Staff-Student Liaison Meeting Inf1-DA 2010 2011 III: 1 / 89 Informatics 1 School of Informatics, University of Edinburgh Data and Analysis Part III Unstructured Data Ian Stark February 2011 Inf1-DA 2010 2011 III: 2 / 89 Part III Unstructured

More information

Nonlinear Regression Functions. SW Ch 8 1/54/

Nonlinear Regression Functions. SW Ch 8 1/54/ Nonlinear Regression Functions SW Ch 8 1/54/ The TestScore STR relation looks linear (maybe) SW Ch 8 2/54/ But the TestScore Income relation looks nonlinear... SW Ch 8 3/54/ Nonlinear Regression General

More information

Data Mining Using SAS Enterprise Miner 7.1

Data Mining Using SAS Enterprise Miner 7.1 Data Mining Using SAS Enterprise Miner 7.1 Lorne Rothman Lorne.rothman@sas.com Principal Statistician SAS Institute (Canada) Inc. Copyright 2010 SAS Institute Inc. All rights reserved. Data Mining The

More information

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next

More information

E-discovery Taking Predictive Coding Out of the Black Box

E-discovery Taking Predictive Coding Out of the Black Box E-discovery Taking Predictive Coding Out of the Black Box Joseph H. Looby Senior Managing Director FTI TECHNOLOGY IN CASES OF COMMERCIAL LITIGATION, the process of discovery can place a huge burden on

More information

Lecture 8: Signal Detection and Noise Assumption

Lecture 8: Signal Detection and Noise Assumption ECE 83 Fall Statistical Signal Processing instructor: R. Nowak, scribe: Feng Ju Lecture 8: Signal Detection and Noise Assumption Signal Detection : X = W H : X = S + W where W N(, σ I n n and S = [s, s,...,

More information

ViviSight: A Sophisticated, Data-driven Business Intelligence Tool for Churn and Loan Default Prediction

ViviSight: A Sophisticated, Data-driven Business Intelligence Tool for Churn and Loan Default Prediction ViviSight: A Sophisticated, Data-driven Business Intelligence Tool for Churn and Loan Default Prediction Barun Paudel 1, T.H. Gopaluwewa 1, M.R.De. Waas Gunawardena 1, W.C.H. Wijerathna 1, Rohan Samarasinghe

More information

Using Machine Learning on Sensor Data

Using Machine Learning on Sensor Data Journal of Computing and Information Technology - CIT 18, 2010, 4, 341 347 doi:10.2498/cit.1001913 341 Using Machine Learning on Sensor Data Alexandra Moraru 1,MarkoPesko 1,2, Maria Porcius 3, Carolina

More information

Binary and Ranked Retrieval

Binary and Ranked Retrieval Binary and Ranked Retrieval Binary Retrieval RSV(d i,q j ) {0,1} Does not allow the user to control the magnitude of the output. In fact, for a given query, the system may return under-dimensioned output

More information

ROC Curve, Lift Chart and Calibration Plot

ROC Curve, Lift Chart and Calibration Plot Metodološki zvezki, Vol. 3, No. 1, 26, 89-18 ROC Curve, Lift Chart and Calibration Plot Miha Vuk 1, Tomaž Curk 2 Abstract This paper presents ROC curve, lift chart and calibration plot, three well known

More information

Selecting Data Mining Model for Web Advertising in Virtual Communities

Selecting Data Mining Model for Web Advertising in Virtual Communities Selecting Data Mining for Web Advertising in Virtual Communities Jerzy Surma Faculty of Business Administration Warsaw School of Economics Warsaw, Poland e-mail: jerzy.surma@gmail.com Mariusz Łapczyński

More information

Is there ethics in algorithms? Searching for ethics in contemporary technology. Martin Peterson

Is there ethics in algorithms? Searching for ethics in contemporary technology. Martin Peterson Is there ethics in algorithms? Searching for ethics in contemporary technology Martin Peterson www.martinpeterson.org m.peterson@tue.nl What is an algorithm? An algorithm is a finite it sequence of well-defined

More information

Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance

Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance Jesús M. Pérez, Javier Muguerza, Olatz Arbelaitz, Ibai Gurrutxaga, and José I. Martín Dept. of Computer

More information

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Advanced analytics at your hands

Advanced analytics at your hands 2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously

More information

Link Prediction Analysis in the Wikipedia Collaboration Graph

Link Prediction Analysis in the Wikipedia Collaboration Graph Link Prediction Analysis in the Wikipedia Collaboration Graph Ferenc Molnár Department of Physics, Applied Physics, and Astronomy Rensselaer Polytechnic Institute Troy, New York, 121 Email: molnaf@rpi.edu

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

A Decision Tree Classification Model for University Admission System

A Decision Tree Classification Model for University Admission System A Decision Tree Classification Model for University Admission ystem Abdul Fattah Mashat Faculty of Computing and Information Technology Jeddah, audi Arabia Mohammed M. Fouad Faculty of Informatics and

More information

Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms Big Data & Scripting Part II Streaming Algorithms 1, Counting Distinct Elements 2, 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set

More information