Data Mining Practical Machine Learning Tools and Techniques



Similar documents
Evaluation & Validation: Credibility: Evaluating what has been learned

Data Mining Practical Machine Learning Tools and Techniques

Performance Measures for Machine Learning

Knowledge Discovery and Data Mining

Performance Measures in Data Mining

Data Mining - Evaluation of Classifiers

Experiments in Web Page Classification for Semantic Web

ROC Curve, Lift Chart and Calibration Plot

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

1. Classification problems

Chapter 6. The stacking ensemble approach

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

Business Case Development for Credit and Debit Card Fraud Re- Scoring Models

ROC Graphs: Notes and Practical Considerations for Data Mining Researchers

Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction

Health Care and Life Sciences

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Azure Machine Learning, SQL Data Mining and R

Data Mining. Nonlinear Classification

Data Mining Algorithms Part 1. Dejan Sarka

Performance Metrics for Graph Mining Tasks

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

Data Mining Analysis (breast-cancer data)

How To Understand The Impact Of A Computer On Organization

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Data Mining Practical Machine Learning Tools and Techniques

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Journal of Engineering Science and Technology Review 7 (4) (2014) 89-96

Knowledge Discovery and Data Mining

Mining the Software Change Repository of a Legacy Telephony System

A Decision Tree for Weather Prediction

Feature Subset Selection in Spam Detection

Measures of diagnostic accuracy: basic definitions

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

Classification of Titanic Passenger Data and Chances of Surviving the Disaster Data Mining with Weka and Kaggle Competition Data

More Data Mining with Weka

Mining Life Insurance Data for Customer Attrition Analysis

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

Maximum Profit Mining and Its Application in Software Development

Supervised Learning (Big Data Analytics)

Knowledge Discovery and Data Mining

Quality and Complexity Measures for Data Linkage and Deduplication

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

2 Decision tree + Cross-validation with R (package rpart)

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Data Mining - The Next Mining Boom?

Data Mining in Weka Bringing It All together

Data and Analysis. Informatics 1 School of Informatics, University of Edinburgh. Part III Unstructured Data. Ian Stark. Staff-Student Liaison Meeting

Introduction to Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

AUTOMATED PROVISIONING OF CAMPAIGNS USING DATA MINING TECHNIQUES

How To Cluster

Random Forest Based Imbalanced Data Cleaning and Classification

WEKA Explorer User Guide for Version 3-4-3

Maschinelles Lernen mit MATLAB

Evaluation of selected data mining algorithms implemented in Medical Decision Support Systems Kamila Aftarczuk

Using Random Forest to Learn Imbalanced Data

An Approach to Detect Spam s by Using Majority Voting

A Systematic Review of Fault Prediction approaches used in Software Engineering

Getting Even More Out of Ensemble Selection

SVM Ensemble Model for Investment Prediction

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

CLINICAL DECISION SUPPORT FOR HEART DISEASE USING PREDICTIVE MODELS

Data Mining Part 5. Prediction

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Active Learning SVM for Blogs recommendation

How To Detect Fraud With Reputation Features And Other Features

Automated Problem List Generation from Electronic Medical Records in IBM Watson

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

A semi-supervised Spam mail detector

Identifying Potentially Useful Header Features for Spam Filtering

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

Class Imbalance Learning in Software Defect Prediction

MHI3000 Big Data Analytics for Health Care Final Project Report

Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance

PREDICTING SUCCESS IN THE COMPUTER SCIENCE DEGREE USING ROC ANALYSIS

STANDARDISATION AND CLASSIFICATION OF ALERTS GENERATED BY INTRUSION DETECTION SYSTEMS

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Selected Topics in Applied Machine Learning: An integrating view on data analysis and learning algorithms

CSC574 - Computer and Network Security Module: Intrusion Detection

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Profit-Optimal Model and Target Size Selection with Variable Marginal Costs

USING THE PREDICTIVE ANALYTICS FOR EFFECTIVE CROSS-SELLING

CODE ASSESSMENT METHODOLOGY PROJECT (CAMP) Comparative Evaluation:

Direct Marketing When There Are Voluntary Buyers

Learning to Recognize Missing Attachments

Predicting Flight Delays

How To Solve The Class Imbalance Problem In Data Mining

Predictive Data modeling for health care: Comparative performance study of different prediction models

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Decision Trees from large Databases: SLIQ

Fraud Detection for Online Retail using Random Forests

Machine Learning Final Project Spam Filtering

Real Time Data Analytics Loom to Make Proactive Tread for Pyrexia

Transcription:

Counting the cost Data Mining Practical Machine Learning Tools and Techniques Slides for Section 5.7 In practice, different types of classification errors often incur different costs Examples: Loan decisions Oil-slick detection Fault diagnosis Promotional mailing 2 Counting the cost The confusion matrix: Aside: the kappa statistic Two confusion matrices for a 3-class problem: actual predictor (left) vs. random predictor (right) Actual class Predicted class True positive False negative False positive True negative There are many other types of cost! E.g.: cost of collecting training data Number of successes: sum of entries in diagonal (D) Kappa statistic: D observed D random D perfect D random measures relative improvement over random predictor 3 4

Classification with costs Cost-sensitive classification Two cost matrices: Success rate is replaced by average cost per prediction Cost is given by appropriate entry in the cost matrix Can take costs into account when making predictions Basic idea: only predict high-cost class when very confident about prediction Given: predicted class probabilities rmally we just predict the most likely class Here, we should make the prediction that minimizes the expected cost Expected cost: dot product of vector of class probabilities and appropriate column in cost matrix Choose column (class) that minimizes expected cost 5 6 Cost-sensitive learning Lift charts So far we haven't taken costs into account at training time Most learning schemes do not perform costsensitive learning They generate the same classifier no matter what costs are assigned to the different classes Example: standard decision tree learner Simple methods for cost-sensitive learning: Resampling of instances according to costs Weighting of instances according to costs Some schemes can take costs into account by varying a parameter, e.g. naïve Bayes In practice, costs are rarely known Decisions are usually made by comparing possible scenarios Example: promotional mailout to 1,000,000 households Mail to all; 0.1% respond (1000) Data mining tool identifies subset of 100,000 most promising, 0.4% of these respond (400) 40% of responses for 10% of cost may pay off Identify subset of 400,000 most promising, 0.2% respond (800) A lift chart allows a visual comparison 7 8

Generating a lift chart A hypothetical lift chart Sort instances according to predicted probability of being positive: 1 2 3 4 Predicted probability 0.95 0.93 0.93 0.88 Actual class x axis is sample size y axis is number of true positives 40% of responses for 10% of cost 80% of responses for 40% of cost 9 10 ROC curves A sample ROC curve ROC curves are similar to lift charts Stands for receiver operating characteristic Used in signal detection to show tradeoff between hit rate and false alarm rate over noisy channel Differences to lift chart: y axis shows percentage of true positives in sample rather than absolute number x axis shows percentage of false positives in sample rather than sample size Jagged curve one set of test data Smooth curve use cross-validation 11 12

Cross-validation and ROC curves ROC curves for two schemes Simple method of getting a ROC curve using cross-validation: Collect probabilities for instances in test folds Sort instances according to probabilities This method is implemented in WEKA However, this is just one possibility Another possibility is to generate an ROC curve for each fold and average them For a small, focused sample, use method A For a larger one, use method B In between, choose between A and B with appropriate probabilities 13 14 The convex hull More measures... Given two learning schemes we can achieve any point on the convex hull! TP and FP rates for scheme 1: t1 and f 1 TP and FP rates for scheme 2: t2 and f 2 If scheme 1 is used to predict 100 q % of the cases and scheme 2 for the rest, then TP rate for combined scheme: q t 1 + (1-q) t 2 FP rate for combined scheme: q f 1 +(1-q) f 2 Percentage of retrieved documents that are relevant: precision=tp/(tp+fp) Percentage of relevant documents that are returned: recall =TP/(TP+FN) Precision/recall curves have hyperbolic shape Summary measures: average precision at 20%, 50% and 80% recall (three-point average recall) F-measure=(2 recall precision)/(recall+precision) sensitivity specificity = (TP / (TP + FN)) (TN / (FP + TN)) Area under the ROC curve (AUC): probability that randomly chosen positive instance is ranked above randomly chosen negative one 15 16

Summary of some measures Domain Plot Explanation Cost curves Cost curves plot expected costs directly Example for case with uniform costs (i.e. error): Lift chart ROC curve Recallprecision curve Marketing Communications Information retrieval TP Subset size TP rate FP rate Recall Precision TP (TP+FP)/ (TP+FP+TN+FN) TP/(TP+FN) FP/(FP+TN) TP/(TP+FN) TP/(TP+FP) 17 18 Cost curves: example with costs Probability cost functionp c [+]= p[+]c[+ -] p[+]c[+ -]p[-]c[- +] rmalized expected cost=fn p c [+]fp 1 p c [+] 19