# Performance Measures in Data Mining

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Performance Measures in Data Mining Common Performance Measures used in Data Mining and Machine Learning Approaches L. Richter J.M. Cejuela Department of Computer Science Technische Universität München Master Lab Course Data Mining, SS 2015, Jul 1st

2 Outline Item Set and Association Rule Weights Simple Measures Complex Measures Basic Performance Measures Complex Measures Performance Curves Regression Assessment Strategies

3 Item Set and Association Rule Weights Simple Measures Measures for Item Sets Various algorithms can yield frequent item sets. From frequent item sets c and c {i} you can derive if c then {i}. Typically there is only one item in the RHS (right hand side of the rule). Support (of an item set): sup(x Y ) = sup(y X) = P(X Y ) how many times the item set is found in the database Confidence (of a rule): conf (X Y ) = P(Y X) = P(X and Y )/P(X) = sup(x Y )/sup(x)

4 Item Set and Association Rule Weights Simple Measures Measures for Item Sets Various algorithms can yield frequent item sets. From frequent item sets c and c {i} you can derive if c then {i}. Typically there is only one item in the RHS (right hand side of the rule). Support (of an item set): sup(x Y ) = sup(y X) = P(X Y ) how many times the item set is found in the database Confidence (of a rule): conf (X Y ) = P(Y X) = P(X and Y )/P(X) = sup(x Y )/sup(x)

5 Item Set and Association Rule Weights Simple Measures Measures for Item Sets Various algorithms can yield frequent item sets. From frequent item sets c and c {i} you can derive if c then {i}. Typically there is only one item in the RHS (right hand side of the rule). Support (of an item set): sup(x Y ) = sup(y X) = P(X Y ) how many times the item set is found in the database Confidence (of a rule): conf (X Y ) = P(Y X) = P(X and Y )/P(X) = sup(x Y )/sup(x)

6 Item Set and Association Rule Weights Complex Measures Measures for Item Sets cont. d Measures how frequent an item set / how interesting a rule is in comparison to the expected occurrence (interesting): Leverage (of an item set): lev(x Y ) = P(X and Y ) (P(X)P(Y )) Lift (of a rule): lift(x Y ) = lift(y X) = P(X Y )/(P(X)P(Y )) = conf (X Y )/sup(y ) = conf (Y X)/sup(X) Conviction (of a rule): Similar to Lift, but directed Compares the probability that X appears without Y, if they were independent with the observed frequency of X and Y. conviction(x Y ) = P(X)P( Y )/P(X Y ) = (1 sup(y )/(1 conf (X Y ))

7 Item Set and Association Rule Weights Complex Measures Measures for Item Sets cont. d Measures how frequent an item set / how interesting a rule is in comparison to the expected occurrence (interesting): Leverage (of an item set): lev(x Y ) = P(X and Y ) (P(X)P(Y )) Lift (of a rule): lift(x Y ) = lift(y X) = P(X Y )/(P(X)P(Y )) = conf (X Y )/sup(y ) = conf (Y X)/sup(X) Conviction (of a rule): Similar to Lift, but directed Compares the probability that X appears without Y, if they were independent with the observed frequency of X and Y. conviction(x Y ) = P(X)P( Y )/P(X Y ) = (1 sup(y )/(1 conf (X Y ))

8 Item Set and Association Rule Weights Complex Measures Measures for Item Sets cont. d Measures how frequent an item set / how interesting a rule is in comparison to the expected occurrence (interesting): Leverage (of an item set): lev(x Y ) = P(X and Y ) (P(X)P(Y )) Lift (of a rule): lift(x Y ) = lift(y X) = P(X Y )/(P(X)P(Y )) = conf (X Y )/sup(y ) = conf (Y X)/sup(X) Conviction (of a rule): Similar to Lift, but directed Compares the probability that X appears without Y, if they were independent with the observed frequency of X and Y. conviction(x Y ) = P(X)P( Y )/P(X Y ) = (1 sup(y )/(1 conf (X Y ))

9 Item Set and Association Rule Weights Complex Measures Supplementary Material J-Measure empirically observed accuracy of rule Cross-entropy (measuring how good a distribution approximates another distribution) between the binary variables φ and θ with vs. without conditioning on event θ ( J(θ φ) = p(θ) p(φ θ)log p(φ θ) ) 1 p(φ θ) + (1 p(φ θ)) log p(φ) 1 p(φ)

10 Basic Performance Measures Basic Building Blocks for Performance Measures True Positive (TP): positive instances predicted as positive True Negative (TN): negative instances predicted as negative False Positive (FP): negative instances predicted as positive False Negative (FN): positive instances predicted as negative Confusion Matrix: Predicted a Predicted b Real a TP FN Real b FP TN

11 Basic Performance Measures Performance Measures Accuracy, acc = = Error rate, err = = TP + TN TP + FN + FP + TN Number of correct predictions Total number of predictions FN + FP TP + FN + FP + TN = 1 acc Number of wrong predictions Total number of predictions

12 Basic Performance Measures Performance Measures cont d True Positive Rate, TPR, Sensitivity = True Negative Rate, TNR, Specificity = False Positive Rate, FPR = False Negative Rate, FNR = FP TN + FP TP TP + FN FN TP + FN TN TN + FP

13 Basic Performance Measures Performance Measures cont d True Positive Rate, TPR, Sensitivity = True Negative Rate, TNR, Specificity = False Positive Rate, FPR = False Negative Rate, FNR = FP TN + FP TP TP + FN FN TP + FN TN TN + FP

14 Basic Performance Measures Performance Measures cont d True Positive Rate, TPR, Sensitivity = True Negative Rate, TNR, Specificity = False Positive Rate, FPR = False Negative Rate, FNR = FP TN + FP TP TP + FN FN TP + FN TN TN + FP

15 Basic Performance Measures Performance Measures cont d True Positive Rate, TPR, Sensitivity = True Negative Rate, TNR, Specificity = False Positive Rate, FPR = False Negative Rate, FNR = FP TN + FP TP TP + FN FN TP + FN TN TN + FP

16 Basic Performance Measures Performance Measures cont d F 1 measure = Precision, p = Recall, r = 2rp r + p = TP TP + FP TP TP + FN 2 TP 2 TP + FP + FN

17 Basic Performance Measures Performance Measures cont d F 1 measure = Precision, p = Recall, r = 2rp r + p = TP TP + FP TP TP + FN 2 TP 2 TP + FP + FN

18 Basic Performance Measures Performance Measures cont d F 1 measure = Precision, p = Recall, r = 2rp r + p = TP TP + FP TP TP + FN 2 TP 2 TP + FP + FN

19 Complex Measures Performance Curves Performance Curves "costs" of different error types are different prediction behaviour changes over the test set performance display in 2D different domains prefer different chart types

20 Complex Measures Performance Curves Lift Charts Taken from from marketing area to evaluate mailing success y-axis: number or percentage of responders x-axis: sample red diagonal: random lift green line: optimum lift

21 Complex Measures Performance Curves ROC Curves Receiver Operator Characteristics y-axis: TPR x-axis: FPR diagonal: random guessing (TPR=FPR) Taken from

22 Complex Measures Performance Curves Sensitivity vs. Specificity preferred in medicine y-axis: TPR x-axis: TNR (specificity) also frequently as ROC curve with 1 - specificity Taken from

23 Complex Measures Performance Curves Recall Precision Curves Taken from preferred in information retrieval positives are the documents retrieved in response to a query true positives are documents really relevant to the query y-axis: precision x-axis: recall

24 Regression Error Measures for Regression Mean squared error, MSE = (p 1 a 1 ) (p n a n ) 2 n Root mean squared error, RMSE = (p1 a 1 ) (p n a n ) 2 n Mean absolute error, MAE = p 1 a p n a n n

25 Regression Error Measures for Regression Mean squared error, MSE = (p 1 a 1 ) (p n a n ) 2 n Root mean squared error, RMSE = (p1 a 1 ) (p n a n ) 2 n Mean absolute error, MAE = p 1 a p n a n n

26 Regression Error Measures for Regression Mean squared error, MSE = (p 1 a 1 ) (p n a n ) 2 n Root mean squared error, RMSE = (p1 a 1 ) (p n a n ) 2 n Mean absolute error, MAE = p 1 a p n a n n

27 Regression Relative Error Measures Relative squared error = (p 1 a 1 ) (p n a n ) 2 (a 1 ā) (a n ā) 2 Root relative squared error = (p 1 a 1 ) (p n a n ) 2 (a 1 ā) (a n ā) 2 Relative absolute error = p 1 a p n a n a 1 ā + + a n ā

28 Assessment Strategies General Problem each algorithm abstracts from observations (instances) the aspects kept and the aspect discarded differ between the learning scheme (inductive bias) this means also: information about individual instances are contained in the model, too individual instance information leads to overfitting

29 Assessment Strategies Solution Strategies use fresh date, i.e. instances not used for the training for very large numbers of instances: simple split in test and training set most common: 10-fold cross validation LOOCV: Leave one out cross validation

### Performance Measures for Machine Learning

Performance Measures for Machine Learning 1 Performance Measures Accuracy Weighted (Cost-Sensitive) Accuracy Lift Precision/Recall F Break Even Point ROC ROC Area 2 Accuracy Target: 0/1, -1/+1, True/False,

### Evaluation & Validation: Credibility: Evaluating what has been learned

Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model

### Data Mining Practical Machine Learning Tools and Techniques

Counting the cost Data Mining Practical Machine Learning Tools and Techniques Slides for Section 5.7 In practice, different types of classification errors often incur different costs Examples: Loan decisions

### COMP9417: Machine Learning and Data Mining

COMP9417: Machine Learning and Data Mining A note on the two-class confusion matrix, lift charts and ROC curves Session 1, 2003 Acknowledgement The material in this supplementary note is based in part

### CLASSIFICATION JELENA JOVANOVIĆ. Web:

CLASSIFICATION JELENA JOVANOVIĆ Email: jeljov@gmail.com Web: http://jelenajovanovic.net OUTLINE What is classification? Binary and multiclass classification Classification algorithms Performance measures

### Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification

### Evaluation in Machine Learning (1) Evaluation in machine learning. Aims. 14s1: COMP9417 Machine Learning and Data Mining.

Acknowledgements Evaluation in Machine Learning (1) 14s1: COMP9417 Machine Learning and Data Mining School of Computer Science and Engineering, University of New South Wales March 19, 2014 Material derived

### COMP 551 Applied Machine Learning Lecture 6: Performance evaluation. Model assessment and selection.

COMP 551 Applied Machine Learning Lecture 6: Performance evaluation. Model assessment and selection. Instructor: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Lecture 15 - ROC, AUC & Lift Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-17-AUC

### Performance Metrics. number of mistakes total number of observations. err = p.1/1

p.1/1 Performance Metrics The simplest performance metric is the model error defined as the number of mistakes the model makes on a data set divided by the number of observations in the data set, err =

### Evaluation and Credibility. How much should we believe in what was learned?

Evaluation and Credibility How much should we believe in what was learned? Outline Introduction Classification with Train, Test, and Validation sets Handling Unbalanced Data; Parameter Tuning Cross-validation

### Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2. Tid Refund Marital Status

Data Mining Classification: Basic Concepts, Decision Trees, and Evaluation Lecture tes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Classification: Definition Given a collection of

### Introduction to Machine Learning

Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 5: Decision Theory & ROC Curves Gaussian ML Estimation Many figures courtesy Kevin Murphy s textbook,

### Chapter 6. The stacking ensemble approach

82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

### Hani Tamim, PhD Associate Professor Department of Internal Medicine American University of Beirut

Hani Tamim, PhD Associate Professor Department of Internal Medicine American University of Beirut A graphical plot which illustrates the performance of a binary classifier system as its discrimination

### Cross Validation. Dr. Thomas Jensen Expedia.com

Cross Validation Dr. Thomas Jensen Expedia.com About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract

### Machine Learning model evaluation. Luigi Cerulo Department of Science and Technology University of Sannio

Machine Learning model evaluation Luigi Cerulo Department of Science and Technology University of Sannio Accuracy To measure classification performance the most intuitive measure of accuracy divides the

### Data Mining Practical Machine Learning Tools and Techniques

Credibility: Evaluating what s been learned Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Issues: training, testing,

### Evaluation and Cost-Sensitive Learning. Cost-Sensitive Learning

1 J. Fürnkranz Evaluation and Cost-Sensitive Learning Evaluation Hold-out Estimates Cross-validation Significance Testing Sign test ROC Analysis Cost-Sensitive Evaluation ROC space ROC convex hull Rankers

### Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

### Evaluating Machine Learning Algorithms. Machine DIT

Evaluating Machine Learning Algorithms Machine Learning @ DIT 2 Evaluating Classification Accuracy During development, and in testing before deploying a classifier in the wild, we need to be able to quantify

### Data Mining - The Next Mining Boom?

Howard Ong Principal Consultant Aurora Consulting Pty Ltd Abstract This paper introduces Data Mining to its audience by explaining Data Mining in the context of Corporate and Business Intelligence Reporting.

### Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data

### A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center

### The many faces of ROC analysis in machine learning. Peter A. Flach University of Bristol, UK

The many faces of ROC analysis in machine learning Peter A. Flach University of Bristol, UK www.cs.bris.ac.uk/~flach/ Objectives After this tutorial, you will be able to [model evaluation] produce ROC

### Analysis on Weighted AUC for Imbalanced Data Learning Through Isometrics

Journal of Computational Information Systems 8: (22) 37 378 Available at http://www.jofcis.com Analysis on Weighted AUC for Imbalanced Data Learning Through Isometrics Yuanfang DONG, 2, Xiongfei LI,, Jun

### An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework Jakrarin Therdphapiyanak Dept. of Computer Engineering Chulalongkorn University

### Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product Sagarika Prusty Web Data Mining (ECT 584),Spring 2013 DePaul University,Chicago sagarikaprusty@gmail.com Keywords:

### Online Performance Anomaly Detection with

ΘPAD: Online Performance Anomaly Detection with Tillmann Bielefeld 1 1 empuxa GmbH, Kiel KoSSE-Symposium Application Performance Management (Kieker Days 2012) November 29, 2012 @ Wissenschaftszentrum Kiel

### Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

### Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

### An Approach to Detect Spam Emails by Using Majority Voting

An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,

### PREDICTING SUCCESS IN THE COMPUTER SCIENCE DEGREE USING ROC ANALYSIS

PREDICTING SUCCESS IN THE COMPUTER SCIENCE DEGREE USING ROC ANALYSIS Arturo Fornés arforser@fiv.upv.es, José A. Conejero aconejero@mat.upv.es 1, Antonio Molina amolina@dsic.upv.es, Antonio Pérez aperez@upvnet.upv.es,

### ROC Graphs: Notes and Practical Considerations for Data Mining Researchers

ROC Graphs: Notes and Practical Considerations for Data Mining Researchers Tom Fawcett 1 1 Intelligent Enterprise Technologies Laboratory, HP Laboratories Palo Alto Alexandre Savio (GIC) ROC Graphs GIC

### Using Random Forest to Learn Imbalanced Data

Using Random Forest to Learn Imbalanced Data Chao Chen, chenchao@stat.berkeley.edu Department of Statistics,UC Berkeley Andy Liaw, andy liaw@merck.com Biometrics Research,Merck Research Labs Leo Breiman,

### Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data

### Lecture Notes for Chapter 4. Introduction to Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data

### Evaluation in Machine Learning (2) Aims. Introduction

Acknowledgements Evaluation in Machine Learning (2) 15s1: COMP9417 Machine Learning and Data Mining School of Computer Science and Engineering, University of New South Wales Material derived from slides

### Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

### Classifiers & Classification

Classifiers & Classification Forsyth & Ponce Computer Vision A Modern Approach chapter 22 Pattern Classification Duda, Hart and Stork School of Computer Science & Statistics Trinity College Dublin Dublin

### Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

### Machine Learning. Topic: Evaluating Hypotheses

Machine Learning Topic: Evaluating Hypotheses Bryan Pardo, Machine Learning: EECS 349 Fall 2011 How do you tell something is better? Assume we have an error measure. How do we tell if it measures something

### Health Care and Life Sciences

Sensitivity, Specificity, Accuracy, Associated Confidence Interval and ROC Analysis with Practical SAS Implementations Wen Zhu 1, Nancy Zeng 2, Ning Wang 2 1 K&L consulting services, Inc, Fort Washington,

### Measuring the propensity to purchase Creating and interpreting the gain chart. Ricco RAKOTOMALALA

Measuring the propensity to purchase Creating and interpreting the gain chart Ricco RAKOTOMALALA Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/ 1 Customer targeting process Promoting a new

### Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com

### A Decision Tree for Weather Prediction

BULETINUL UniversităŃii Petrol Gaze din Ploieşti Vol. LXI No. 1/2009 77-82 Seria Matematică - Informatică - Fizică A Decision Tree for Weather Prediction Elia Georgiana Petre Universitatea Petrol-Gaze

### 1. Classification problems

Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification

### Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier

International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing

### BUILDING CLASSIFICATION MODELS FROM IMBALANCED FRAUD DETECTION DATA

BUILDING CLASSIFICATION MODELS FROM IMBALANCED FRAUD DETECTION DATA Terence Yong Koon Beh 1, Swee Chuan Tan 2, Hwee Theng Yeo 3 1 School of Business, SIM University 1 yky2k@yahoo.com, 2 jamestansc@unisim.edu.sg,

### IMPACT OF COMPUTER ON ORGANIZATION

International Journal of Research in Engineering & Technology (IJRET) Vol. 1, Issue 1, June 2013, 1-6 Impact Journals IMPACT OF COMPUTER ON ORGANIZATION A. D. BHOSALE 1 & MARATHE DAGADU MITHARAM 2 1 Department

### MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

### Performance Evaluation Metrics for Software Fault Prediction Studies

Acta Polytechnica Hungarica Vol. 9, No. 4, 2012 Performance Evaluation Metrics for Software Fault Prediction Studies Cagatay Catal Istanbul Kultur University, Department of Computer Engineering, Atakoy

### A Hybrid Data Mining Model to Improve Customer Response Modeling in Direct Marketing

A Hybrid Data Mining Model to Improve Customer Response Modeling in Direct Marketing Maryam Daneshmandi mdaneshmandi82@yahoo.com School of Information Technology Shiraz Electronics University Shiraz, Iran

### Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

### CODE ASSESSMENT METHODOLOGY PROJECT (CAMP) Comparative Evaluation:

This document contains information exempt from mandatory disclosure under the FOIA. Exemptions 2 and 4 apply. CODE ASSESSMENT METHODOLOGY PROJECT (CAMP) Comparative Evaluation: Coverity Prevent 2.4.0 Fortify

### Business Case Development for Credit and Debit Card Fraud Re- Scoring Models

Business Case Development for Credit and Debit Card Fraud Re- Scoring Models Kurt Gutzmann Managing Director & Chief ScienAst GCX Advanced Analy.cs LLC www.gcxanalyacs.com October 20, 2011 www.gcxanalyacs.com

### CSC574 - Computer and Network Security Module: Intrusion Detection

CSC574 - Computer and Network Security Module: Intrusion Detection Prof. William Enck Spring 2013 1 Intrusion An authorized action... that exploits a vulnerability... that causes a compromise... and thus

### FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS Breno C. Costa, Bruno. L. A. Alberto, André M. Portela, W. Maduro, Esdras O. Eler PDITec, Belo Horizonte,

### T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

### Addressing the Class Imbalance Problem in Medical Datasets

Addressing the Class Imbalance Problem in Medical Datasets M. Mostafizur Rahman and D. N. Davis the size of the training set is significantly increased [5]. If the time taken to resample is not considered,

### Discovering Criminal Behavior by Ranking Intelligence Data

UNIVERSITY OF AMSTERDAM Faculty of Science Discovering Criminal Behavior by Ranking Intelligence Data by 5889081 A thesis submitted in partial fulfillment for the degree of Master of Science in the field

### Lecture 5: Evaluation

Lecture 5: Evaluation Information Retrieval Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group Simone.Teufel@cl.cam.ac.uk 1 Overview 1 Recap/Catchup

### ROC analysis of classifiers in machine learning: A survey

Intelligent Data Analysis 17 (2013) 531 558 531 DOI 10.3233/IDA-130592 IOS Press ROC analysis of classifiers in machine learning: A survey Matjaž Majnik and Zoran Bosnić Faculty of Computer and Information

### Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant

### Mining the Software Change Repository of a Legacy Telephony System

Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

### Application of Data Mining based Malicious Code Detection Techniques for Detecting new Spyware

Application of Data Mining based Malicious Code Detection Techniques for Detecting new Spyware Cumhur Doruk Bozagac Bilkent University, Computer Science and Engineering Department, 06532 Ankara, Turkey

### Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

### Stock Market Forecasting Using Machine Learning Algorithms

Stock Market Forecasting Using Machine Learning Algorithms Shunrong Shen, Haomiao Jiang Department of Electrical Engineering Stanford University {conank,hjiang36}@stanford.edu Tongda Zhang Department of

### Data Mining Practical Machine Learning Tools and Techniques

Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

### Evaluation of selected data mining algorithms implemented in Medical Decision Support Systems Kamila Aftarczuk

Master Thesis Software Engineering Thesis no: MSE-2007-21 September 2007 Evaluation of selected data mining algorithms implemented in Medical Decision Support Systems Kamila Aftarczuk This thesis is submitted

### Getting Even More Out of Ensemble Selection

Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise

### International Journal of Advance Research in Computer Science and Management Studies

Volume 2, Issue 12, December 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

### The Relationship Between Precision-Recall and ROC Curves

Jesse Davis jdavis@cs.wisc.edu Mark Goadrich richm@cs.wisc.edu Department of Computer Sciences and Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, 2 West Dayton Street,

### Data Mining as a tool to Predict the Churn Behaviour among Indian bank customers

Data Mining as a tool to Predict the Churn Behaviour among Indian bank customers Manjit Kaur Department of Computer Science Punjabi University Patiala, India manjit8718@gmail.com Dr. Kawaljeet Singh University

### Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

Statistical Validation and Data Analytics in ediscovery Jesse Kornblum Administrivia Silence your mobile Interactive talk Please ask questions 2 Outline Introduction Big Questions What Makes Things Similar?

### The Operational Value of Social Media Information. Social Media and Customer Interaction

The Operational Value of Social Media Information Dennis J. Zhang (Kellogg School of Management) Ruomeng Cui (Kelley School of Business) Santiago Gallino (Tuck School of Business) Antonio Moreno-Garcia

### Tweaking Naïve Bayes classifier for intelligent spam detection

682 Tweaking Naïve Bayes classifier for intelligent spam detection Ankita Raturi 1 and Sunil Pranit Lal 2 1 University of California, Irvine, CA 92697, USA. araturi@uci.edu 2 School of Computing, Information

### Towards better accuracy for Spam predictions

Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial

### Predicting Deadline Transgressions Using Event Logs

Predicting Deadline Transgressions Using Event Logs Anastasiia Pika 1, Wil M. P. van der Aalst 2,1, Colin J. Fidge 1, Arthur H. M. ter Hofstede 1,2, and Moe T. Wynn 1 1 Queensland University of Technology,

### Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Sushilkumar Kalmegh Associate Professor, Department of Computer Science, Sant Gadge Baba Amravati

### Data and Analysis. Informatics 1 School of Informatics, University of Edinburgh. Part III Unstructured Data. Ian Stark. Staff-Student Liaison Meeting

Inf1-DA 2010 2011 III: 1 / 89 Informatics 1 School of Informatics, University of Edinburgh Data and Analysis Part III Unstructured Data Ian Stark February 2011 Inf1-DA 2010 2011 III: 2 / 89 Part III Unstructured

### Nonlinear Regression Functions. SW Ch 8 1/54/

Nonlinear Regression Functions SW Ch 8 1/54/ The TestScore STR relation looks linear (maybe) SW Ch 8 2/54/ But the TestScore Income relation looks nonlinear... SW Ch 8 3/54/ Nonlinear Regression General

### Data Mining Using SAS Enterprise Miner 7.1

Data Mining Using SAS Enterprise Miner 7.1 Lorne Rothman Lorne.rothman@sas.com Principal Statistician SAS Institute (Canada) Inc. Copyright 2010 SAS Institute Inc. All rights reserved. Data Mining The

### Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next

### E-discovery Taking Predictive Coding Out of the Black Box

E-discovery Taking Predictive Coding Out of the Black Box Joseph H. Looby Senior Managing Director FTI TECHNOLOGY IN CASES OF COMMERCIAL LITIGATION, the process of discovery can place a huge burden on

### Lecture 8: Signal Detection and Noise Assumption

ECE 83 Fall Statistical Signal Processing instructor: R. Nowak, scribe: Feng Ju Lecture 8: Signal Detection and Noise Assumption Signal Detection : X = W H : X = S + W where W N(, σ I n n and S = [s, s,...,

### ViviSight: A Sophisticated, Data-driven Business Intelligence Tool for Churn and Loan Default Prediction

ViviSight: A Sophisticated, Data-driven Business Intelligence Tool for Churn and Loan Default Prediction Barun Paudel 1, T.H. Gopaluwewa 1, M.R.De. Waas Gunawardena 1, W.C.H. Wijerathna 1, Rohan Samarasinghe

### Using Machine Learning on Sensor Data

Journal of Computing and Information Technology - CIT 18, 2010, 4, 341 347 doi:10.2498/cit.1001913 341 Using Machine Learning on Sensor Data Alexandra Moraru 1,MarkoPesko 1,2, Maria Porcius 3, Carolina

### Binary and Ranked Retrieval

Binary and Ranked Retrieval Binary Retrieval RSV(d i,q j ) {0,1} Does not allow the user to control the magnitude of the output. In fact, for a given query, the system may return under-dimensioned output

### ROC Curve, Lift Chart and Calibration Plot

Metodološki zvezki, Vol. 3, No. 1, 26, 89-18 ROC Curve, Lift Chart and Calibration Plot Miha Vuk 1, Tomaž Curk 2 Abstract This paper presents ROC curve, lift chart and calibration plot, three well known

### Selecting Data Mining Model for Web Advertising in Virtual Communities

Selecting Data Mining for Web Advertising in Virtual Communities Jerzy Surma Faculty of Business Administration Warsaw School of Economics Warsaw, Poland e-mail: jerzy.surma@gmail.com Mariusz Łapczyński

### Is there ethics in algorithms? Searching for ethics in contemporary technology. Martin Peterson

Is there ethics in algorithms? Searching for ethics in contemporary technology Martin Peterson www.martinpeterson.org m.peterson@tue.nl What is an algorithm? An algorithm is a finite it sequence of well-defined

### Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance

Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance Jesús M. Pérez, Javier Muguerza, Olatz Arbelaitz, Ibai Gurrutxaga, and José I. Martín Dept. of Computer

### MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a

### Data Mining. Nonlinear Classification

Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously

### Link Prediction Analysis in the Wikipedia Collaboration Graph

Link Prediction Analysis in the Wikipedia Collaboration Graph Ferenc Molnár Department of Physics, Applied Physics, and Astronomy Rensselaer Polytechnic Institute Troy, New York, 121 Email: molnaf@rpi.edu

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C