# COMP9417: Machine Learning and Data Mining

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 COMP9417: Machine Learning and Data Mining A note on the two-class confusion matrix, lift charts and ROC curves Session 1, 2003 Acknowledgement The material in this supplementary note is based in part on Chapter 5, Section 7 of Witten and Frank [2] and the paper by Fawcett [1]. 1 Two-class confusion matrix This may be used to summarise the predictive performance of a classifier on test data. It is commonly encountered in a two-class format, but can be generated for any number of classes. Suppose we have a two-class problem with classes referred to as positive and negative. A single prediction by a classifier can have four outcomes which are displayed in the confusion matrix of Table 1. Predicted Class positive negative Actual positive true positive false negative Class negative false positive true negative Table 1: Two-class confusion matrix (also known as a 2 2 contingency table). true positive the actual class of the test instance is positive and the classifier correctly predicts the class as positive false negative the actual class of the test instance is positive but the classifier incorrectly predicts the class as negative false positive the actual class of the test instance is negative but the classifier incorrectly predicts the class as positive true negative the actual class of the test instance is negative and the classifier correctly predicts the class as negative For a particular data set the confusion matrix will contain in its cells the number of instances for each of the four possible classification outcomes. Note that the correct predictions lie on the main diagonal of the matrix while the incorrect predictions are 1

2 Figure 1: A lift chart. The diagonal indicates the expected success rate (number of true positives predicted for a sample) which would be obtained for random samples. The two cases discussed in the text are marked on the curve from our fictional data set: where 10% of the recipients contain 400 respondents, and 40% of the recipients contain 800 respondents. in the off-diagonal cells. We can obtain the familiar measure of predictive accuracy of a classifier by tp + tn accuracy = tp + fn + fp + tn 100% where tp stands for true positive, fn stands for false negative, fp stands for false positive and tn stands for true negative. The predictive accuracy of a two-class classifier is maximised by maximising the sum of the true positives and true negatives as a proportion of the total number of instances. 1.1 Costs Real-world applications usually have several kinds of attached cost. For example, simply collecting example data may be expensive. Alternatively, sample instances may be easily obtained whereas applying classification labels for some target function to this data may be much more costly. These costs are up-front since they apply before any machine learning method can be applied. However, there are also issues to do with the cost of using a hypothesis constructed by machine learning in order to do something. For example, the runtime efficiency of a classifier in making predictions may be critical in a real-time system. Another common situation which occurs in many applications is that misclassification costs can be different for false negatives and false negatives. It may also happen that the classification benefits are different for true positives and true negatives. 2

3 Unfortunately, bear in mind that it may not always be possible to quantify accurately the actual costs for each situation in the confusion matrix. 2 Lift charts This is a method of evaluating performance based on identifying subsets of test data on which a classifier can predict with better accuracy relative to its performance on the entire set of test data. It is usually associated with problems from marketing. The lift factor is defined as the success rate in predicting positives on a subset of the data as a proportion of the success rate in predicting positives on all of the data. Lift factor = where tp S and N S are the number of true positives and the size of the sample S, and tp T and N T are the number of true positives and the size of the test set T. For example, suppose we have a data set from a mail-out marketing campaign. In this data set there are 1000 respondents to mail which was sent to recipients. Now we learn a classifier which can identify a subset of recipients in which there are 400 respondents. tp S N S tp T N T Lift factor = = = 4.0 In marketing terminology this lift factor indicates the potential increase in response rate. If costs are known, it can be calculated whether it is better to mail to the complete list of recipients or to the subset identified. Now suppose that another classifier is learned, with slightly different parameters, identifying a subset of recipients of whom 800 were respondents. The lift factor here is two. We can see that in fact there may be many subsets of the entire test set, and it would be of interest to know the lift factor for them. An efficient way of calculating the lift factor for different subsets of the data is is possible when a classifier is able to produce not only a predicted class label for a test instance but also a measure of confidence, such as a probability, in its prediction. In such cases we simply sort the data set in decreasing order of confidence of prediction for each instance of the class of interest, say the positive class. Then we can find a subset of a given size N with the greatest possible proportion of positive instances by simply taking the first N instances in the sorted list. The lift factor can then be calculated as above by taking the number of true positives from the sample of size N. By repeating this process for each of the first k instances, from zero to the total number of instances we can obtain the success rate (number of true positives), for each such sample. Graphing the number of true positives on the y-axis against the proportion of the sample to the total number of instances on the x-axis gives a lift chart. A hypothetical lift chart is shown in Figure 1. By inspection it is clear that results in or close to the top 3

4 Rank Confidence Actual class Rank Confidence Actual class pos neg pos pos neg pos pos neg pos pos pos neg pos pos pos neg neg pos pos Table 2: A set of predictions for a hypothetical two-class test set, ranked in decreasing order of the confidence of prediction for the positive class, also showing the actual class for each instance. left corner of a lift chart are the most desirable, since these indicate close to a maximum success rate from those instances ranked most likely to be true positives by our learned classifier. 3 ROC curves The method of ROC ( Receiver Operating Characteristic ) analysis is closely related to lift charts. It provides techniques for visualizing classifier performance, and evaluating and comparing algorithms. However, in this note we consider only the first of these applications. More details can be found in [1]. ROC analysis originated in signal detection theory to depict the trade-off between so-called hit rate and false alarm rate. For the purposes of machine learning these correspond to the true positive rate and the false positive rate, respectively, of classifiers. We define these as follows. true positive rate = true positives total no. of positives false positive rate = false positives total no. of negatives As for lift charts, we approach the analysis of classifier performance by first sorting its test set predictions in decreasing order of confidence (or probability). Note that, if we have a classifier which outputs only class predictions, such as a decision tree, it is usually possible to extract a confidence measure as well (in the case of decision trees, we could use the proportion of true positives out of all of the instances at a leaf node). Also, for a two-class case we can find the confidence of a prediction of the positive class given a confidence for the negative class by subtracting it from the maximum confidence. Table 3 contains a hypothetical set of ranked predictions. Here the confidence is a probability. Note that in Table 3 we can see the distribution of true positives throughout the list 4

5 Algorithm Graph of ROC curve Input: list of test examples T sorted in descending order of confidence begin Let K = T be the number of test examples Let P be the number of positive examples in T Let N be the number of negative examples in T T P := 0 F P := 0 for i = 1 to K do if example T [i] is positive then T P := T P + 1 else F P := F P + 1 endif output point on ROC curve T P, F P P N i := i + 1 endfor end Figure 2: A ROC curve for the sample in Table 3. See text for details. 5

6 of predictions. The first two predictions are true positives (the actual class is positive) whereas the third and ninth predictions are false positives (the actual class is negative). Taken together, we can develop an algorithm to generate a ROC curve for a two-class learning problem. The algorithm computes a set of x, y points on a two-dimensional graph, where the y-axis denotes the true-positive rate and the x-axis denotes the falsepositive rate. A ROC curve corresponding to the data in Table 3 is shown in Figure 2. The jagged line corresponds to the single data set in the table. The smooth curve is obtained by cross-validation, as follows. For each fold of the cross-validation, evaluate the classifier s predictions on the test set. For each different position on the y-axis, i.e. each possible number of false positives up to the total number of negatives in the test set, take just enough of the ranked set of test set examples to include that number of false positives and find the number of true positives. Taking the mean of these true positive counts over all folds gives the x, y points to draw the smooth curve. 4 Summary ROC analysis is a very useful tool for assessing classifier performance where there are variable misclassification costs or highly skewed class distributions (e.g. test set examples are nearly all of one class, so it is easy to get an apparently high predictive accuracy). Both lift charts and ROC curves give a graphical view of classification performance and can be easily calculated where a ranking on predictions is available. The two-class confusion matrix or contingency table is a fundamental method which is the starting point for many statistical analysis techniques. References [1] T. Fawcett. ROC Graphs: Notes and Practical Considerations for Data Mining Researchers. Tech Report HPL , HP Labs, Available at Fawcett/papers/. [2] Ian H. Witten and Eibe Frank. Data Mining. Morgan-Kaufmann, San Francisco, CA,

### Data Mining Practical Machine Learning Tools and Techniques

Counting the cost Data Mining Practical Machine Learning Tools and Techniques Slides for Section 5.7 In practice, different types of classification errors often incur different costs Examples: Loan decisions

### Evaluation & Validation: Credibility: Evaluating what has been learned

Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model

### Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Lecture 15 - ROC, AUC & Lift Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-17-AUC

### Performance Measures in Data Mining

Performance Measures in Data Mining Common Performance Measures used in Data Mining and Machine Learning Approaches L. Richter J.M. Cejuela Department of Computer Science Technische Universität München

### Performance Measures for Machine Learning

Performance Measures for Machine Learning 1 Performance Measures Accuracy Weighted (Cost-Sensitive) Accuracy Lift Precision/Recall F Break Even Point ROC ROC Area 2 Accuracy Target: 0/1, -1/+1, True/False,

### Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

### ROC Curve, Lift Chart and Calibration Plot

Metodološki zvezki, Vol. 3, No. 1, 26, 89-18 ROC Curve, Lift Chart and Calibration Plot Miha Vuk 1, Tomaž Curk 2 Abstract This paper presents ROC curve, lift chart and calibration plot, three well known

### How to Conduct a Hypothesis Test

How to Conduct a Hypothesis Test The idea of hypothesis testing is relatively straightforward. In various studies we observe certain events. We must ask, is the event due to chance alone, or is there some

### ROC Graphs: Notes and Practical Considerations for Data Mining Researchers

ROC Graphs: Notes and Practical Considerations for Data Mining Researchers Tom Fawcett Intelligent Enterprise Technologies Laboratory HP Laboratories Palo Alto HPL-23-4 January 7 th, 23* E-mail: tom_fawcett@hp.com

### Mining Life Insurance Data for Customer Attrition Analysis

Mining Life Insurance Data for Customer Attrition Analysis T. L. Oshini Goonetilleke Informatics Institute of Technology/Department of Computing, Colombo, Sri Lanka Email: oshini.g@iit.ac.lk H. A. Caldera

### Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2. Tid Refund Marital Status

Data Mining Classification: Basic Concepts, Decision Trees, and Evaluation Lecture tes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Classification: Definition Given a collection of

### Performance Metrics for Graph Mining Tasks

Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical

### Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

### Measuring the propensity to purchase Creating and interpreting the gain chart. Ricco RAKOTOMALALA

Measuring the propensity to purchase Creating and interpreting the gain chart Ricco RAKOTOMALALA Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/ 1 Customer targeting process Promoting a new

### Appendix E: Graphing Data

You will often make scatter diagrams and line graphs to illustrate the data that you collect. Scatter diagrams are often used to show the relationship between two variables. For example, in an absorbance

### WEKA Explorer User Guide for Version 3-4-3

WEKA Explorer User Guide for Version 3-4-3 Richard Kirkby Eibe Frank November 9, 2004 c 2002, 2004 University of Waikato Contents 1 Launching WEKA 2 2 The WEKA Explorer 2 Section Tabs................................

### Introduction to Machine Learning Connectionist and Statistical Language Processing

Introduction to Machine Learning Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes Introduction to Machine Learning p.1/22

### Chapter 6. The stacking ensemble approach

82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

### 1. Classification problems

Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification

### Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier

International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing

### Data Mining Practical Machine Learning Tools and Techniques

Credibility: Evaluating what s been learned Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Issues: training, testing,

### Mining the Software Change Repository of a Legacy Telephony System

Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

### Health Care and Life Sciences

Sensitivity, Specificity, Accuracy, Associated Confidence Interval and ROC Analysis with Practical SAS Implementations Wen Zhu 1, Nancy Zeng 2, Ning Wang 2 1 K&L consulting services, Inc, Fort Washington,

### Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com

### 2/2/2011. Data source. Define a new data source. Connect to the Iris database in SQL Server Impersonation Information. Default option D B M G

Data Mining in SQL Server 2005 SQL Server 2005 Data Mining structures SQL Server 2005 mining structures created and managed using Visual Studio 2005 Business Intelligence Projects Analysis Services Project

### STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

### Measures of diagnostic accuracy: basic definitions

Measures of diagnostic accuracy: basic definitions Ana-Maria Šimundić Department of Molecular Diagnostics University Department of Chemistry, Sestre milosrdnice University Hospital, Zagreb, Croatia E-mail

### T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

### Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant

### Overview. Introduction to Machine Learning. Definition of Learning. A Sample Data Set. Connectionist and Statistical Language Processing

Overview Introduction to Machine Learning Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes definition of learning sample

### WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise

### Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Sushilkumar Kalmegh Associate Professor, Department of Computer Science, Sant Gadge Baba Amravati

### LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING In this lab you will explore the concept of a confidence interval and hypothesis testing through a simulation problem in engineering setting.

### Smart Grid Data Analytics for Decision Support

1 Smart Grid Data Analytics for Decision Support Prakash Ranganathan, Department of Electrical Engineering, University of North Dakota, Grand Forks, ND, USA Prakash.Ranganathan@engr.und.edu, 701-777-4431

### Questions: Does it always take the same amount of force to lift a load? Where should you press to lift a load with the least amount of force?

Lifting A Load 1 NAME LIFTING A LOAD Questions: Does it always take the same amount of force to lift a load? Where should you press to lift a load with the least amount of force? Background Information:

### Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data

### Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

### An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework Jakrarin Therdphapiyanak Dept. of Computer Engineering Chulalongkorn University

### Final Project Report

CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes

### Review of Fundamental Mathematics

Review of Fundamental Mathematics As explained in the Preface and in Chapter 1 of your textbook, managerial economics applies microeconomic theory to business decision making. The decision-making tools

### A Hybrid Data Mining Model to Improve Customer Response Modeling in Direct Marketing

A Hybrid Data Mining Model to Improve Customer Response Modeling in Direct Marketing Maryam Daneshmandi mdaneshmandi82@yahoo.com School of Information Technology Shiraz Electronics University Shiraz, Iran

### 6.4 Normal Distribution

Contents 6.4 Normal Distribution....................... 381 6.4.1 Characteristics of the Normal Distribution....... 381 6.4.2 The Standardized Normal Distribution......... 385 6.4.3 Meaning of Areas under

### ALGEBRA. sequence, term, nth term, consecutive, rule, relationship, generate, predict, continue increase, decrease finite, infinite

ALGEBRA Pupils should be taught to: Generate and describe sequences As outcomes, Year 7 pupils should, for example: Use, read and write, spelling correctly: sequence, term, nth term, consecutive, rule,

### Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product Sagarika Prusty Web Data Mining (ECT 584),Spring 2013 DePaul University,Chicago sagarikaprusty@gmail.com Keywords:

### 2.0 Lesson Plan. Answer Questions. Summary Statistics. Histograms. The Normal Distribution. Using the Standard Normal Table

2.0 Lesson Plan Answer Questions 1 Summary Statistics Histograms The Normal Distribution Using the Standard Normal Table 2. Summary Statistics Given a collection of data, one needs to find representations

### Probability Distributions

CHAPTER 5 Probability Distributions CHAPTER OUTLINE 5.1 Probability Distribution of a Discrete Random Variable 5.2 Mean and Standard Deviation of a Probability Distribution 5.3 The Binomial Distribution

### Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign Arun K Mandapaka, Amit Singh Kushwah, Dr.Goutam Chakraborty Oklahoma State University, OK, USA ABSTRACT Direct

### 10-3 Measures of Central Tendency and Variation

10-3 Measures of Central Tendency and Variation So far, we have discussed some graphical methods of data description. Now, we will investigate how statements of central tendency and variation can be used.

### Performance Metrics. number of mistakes total number of observations. err = p.1/1

p.1/1 Performance Metrics The simplest performance metric is the model error defined as the number of mistakes the model makes on a data set divided by the number of observations in the data set, err =

### Introduction to Learning & Decision Trees

Artificial Intelligence: Representation and Problem Solving 5-38 April 0, 2007 Introduction to Learning & Decision Trees Learning and Decision Trees to learning What is learning? - more than just memorizing

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### Random Forest Based Imbalanced Data Cleaning and Classification

Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem

### Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

### Classification/Decision Trees (II)

Classification/Decision Trees (II) Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Right Sized Trees Let the expected misclassification rate of a tree T be R (T ).

### Enhancing Quality of Data using Data Mining Method

JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2, ISSN 25-967 WWW.JOURNALOFCOMPUTING.ORG 9 Enhancing Quality of Data using Data Mining Method Fatemeh Ghorbanpour A., Mir M. Pedram, Kambiz Badie, Mohammad

### CODE ASSESSMENT METHODOLOGY PROJECT (CAMP) Comparative Evaluation:

This document contains information exempt from mandatory disclosure under the FOIA. Exemptions 2 and 4 apply. CODE ASSESSMENT METHODOLOGY PROJECT (CAMP) Comparative Evaluation: Coverity Prevent 2.4.0 Fortify

### Contents. Introduction and Notes pages 2-3 (These are important and it s only 2 pages ~ please take the time to read them!)

Page Contents Introduction and Notes pages 2-3 (These are important and it s only 2 pages ~ please take the time to read them!) Systematic Search for a Change of Sign (Decimal Search) Method Explanation

### Comparing Methods to Identify Defect Reports in a Change Management Database

Comparing Methods to Identify Defect Reports in a Change Management Database Elaine J. Weyuker, Thomas J. Ostrand AT&T Labs - Research 180 Park Avenue Florham Park, NJ 07932 (weyuker,ostrand)@research.att.com

### Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

### A Content based Spam Filtering Using Optical Back Propagation Technique

A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT

### Lecture Notes for Chapter 4. Introduction to Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data

### Dynamic Predictive Modeling in Claims Management - Is it a Game Changer?

Dynamic Predictive Modeling in Claims Management - Is it a Game Changer? Anil Joshi Alan Josefsek Bob Mattison Anil Joshi is the President and CEO of AnalyticsPlus, Inc. (www.analyticsplus.com)- a Chicago

### Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

### ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.

### CLASSIFICATION AND CLUSTERING. Anveshi Charuvaka

CLASSIFICATION AND CLUSTERING Anveshi Charuvaka Learning from Data Classification Regression Clustering Anomaly Detection Contrast Set Mining Classification: Definition Given a collection of records (training

### Binary and Ranked Retrieval

Binary and Ranked Retrieval Binary Retrieval RSV(d i,q j ) {0,1} Does not allow the user to control the magnitude of the output. In fact, for a given query, the system may return under-dimensioned output

### In this course, we will consider the various possible types of presentation of data and justification for their use in given situations.

PRESENTATION OF DATA 1.1 INTRODUCTION Once data has been collected, it has to be classified and organised in such a way that it becomes easily readable and interpretable, that is, converted to information.

### Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov

Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

### PREDICTING SUCCESS IN THE COMPUTER SCIENCE DEGREE USING ROC ANALYSIS

PREDICTING SUCCESS IN THE COMPUTER SCIENCE DEGREE USING ROC ANALYSIS Arturo Fornés arforser@fiv.upv.es, José A. Conejero aconejero@mat.upv.es 1, Antonio Molina amolina@dsic.upv.es, Antonio Pérez aperez@upvnet.upv.es,

### USING THE PREDICTIVE ANALYTICS FOR EFFECTIVE CROSS-SELLING

USING THE PREDICTIVE ANALYTICS FOR EFFECTIVE CROSS-SELLING Michael Combopiano Northwestern University Michael.Comobopiano@att.net Sunil Kakade Northwestern University Sunil.kakade@gmail.com Abstract--The

### BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

### Using Random Forest to Learn Imbalanced Data

Using Random Forest to Learn Imbalanced Data Chao Chen, chenchao@stat.berkeley.edu Department of Statistics,UC Berkeley Andy Liaw, andy liaw@merck.com Biometrics Research,Merck Research Labs Leo Breiman,

### Lecture 10: Regression Trees

Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

### Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

### Content Sheet 7-1: Overview of Quality Control for Quantitative Tests

Content Sheet 7-1: Overview of Quality Control for Quantitative Tests Role in quality management system Quality Control (QC) is a component of process control, and is a major element of the quality management

### Performance Analysis of Various Data mining Classification Algorithms on Diabetes Heart dataset

ISSN:2320-0790 Performance Analysis of Various Data mining Classification Algorithms on Diabetes Heart dataset G.G.Gokilam 1, Dr K.Shanthi 2 1 Research scholar, 2 Research Guide Department of Computer

### Predicting Flight Delays

Predicting Flight Delays Dieterich Lawson jdlawson@stanford.edu William Castillo will.castillo@stanford.edu Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing

### Web Document Clustering

Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

### An innovative application of a constrained-syntax genetic programming system to the problem of predicting survival of patients

An innovative application of a constrained-syntax genetic programming system to the problem of predicting survival of patients Celia C. Bojarczuk 1, Heitor S. Lopes 2 and Alex A. Freitas 3 1 Departamento

### Classifiers & Classification

Classifiers & Classification Forsyth & Ponce Computer Vision A Modern Approach chapter 22 Pattern Classification Duda, Hart and Stork School of Computer Science & Statistics Trinity College Dublin Dublin

### Big Data in Education

Big Data in Education Alex J. Bowers, Ph.D. Associate Professor of Education Leadership Teachers College, Columbia University April 24, 2014 Schutt, R., & O'Neil, C. (2013). Doing Data Science: Straight

### A Review of Performance Evaluation Measures for Hierarchical Classifiers

A Review of Performance Evaluation Measures for Hierarchical Classifiers Eduardo P. Costa Depto. Ciências de Computação ICMC/USP - São Carlos Caixa Postal 668, 13560-970, São Carlos-SP, Brazil Ana C. Lorena

### Identifying Potentially Useful Email Header Features for Email Spam Filtering

Identifying Potentially Useful Email Header Features for Email Spam Filtering Omar Al-Jarrah, Ismail Khater and Basheer Al-Duwairi Department of Computer Engineering Department of Network Engineering &

### Guido Sciavicco. 11 Novembre 2015

classical and new techniques Università degli Studi di Ferrara 11 Novembre 2015 in collaboration with dr. Enrico Marzano, CIO Gap srl Active Contact System Project 1/27 Contents What is? Embedded Wrapper

### An Approach to Detect Spam Emails by Using Majority Voting

An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

### A CLASSIFIER FUSION-BASED APPROACH TO IMPROVE BIOLOGICAL THREAT DETECTION. Palaiseau cedex, France; 2 FFI, P.O. Box 25, N-2027 Kjeller, Norway.

A CLASSIFIER FUSION-BASED APPROACH TO IMPROVE BIOLOGICAL THREAT DETECTION Frédéric Pichon 1, Florence Aligne 1, Gilles Feugnet 1 and Janet Martha Blatny 2 1 Thales Research & Technology, Campus Polytechnique,

### Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Summary of Formulas and Concepts Descriptive Statistics (Ch. 1-4) Definitions Population: The complete set of numerical information on a particular quantity in which an investigator is interested. We assume

### Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction

Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction Huanjing Wang Western Kentucky University huanjing.wang@wku.edu Taghi M. Khoshgoftaar

### 2) The three categories of forecasting models are time series, quantitative, and qualitative. 2)

Exam Name TRUE/FALSE. Write 'T' if the statement is true and 'F' if the statement is false. 1) Regression is always a superior forecasting method to exponential smoothing, so regression should be used

### !"!!"#\$\$%&'()*+\$(,%!"#\$%\$&'()*""%(+,'-*&./#-\$&'(-&(0*".\$#-\$1"(2&."3\$'45"

!"!!"#\$\$%&'()*+\$(,%!"#\$%\$&'()*""%(+,'-*&./#-\$&'(-&(0*".\$#-\$1"(2&."3\$'45"!"#"\$%&#'()*+',\$\$-.&#',/"-0%.12'32./4'5,5'6/%&)\$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:

### Data Mining. Nonlinear Classification

Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

### Temperature Scales. The metric system that we are now using includes a unit that is specific for the representation of measured temperatures.

Temperature Scales INTRODUCTION The metric system that we are now using includes a unit that is specific for the representation of measured temperatures. The unit of temperature in the metric system is

### A Complete Data Mining process to Manage the QoS of ADSL Services

A Complete Data Mining process to Manage the QoS of ADSL Services Françoise Fessant and Vincent Lemaire 1 Abstract. 1 In this paper we explore the interest of computational intelligence tools in the management

### Chapter 2 - Graphical Summaries of Data

Chapter 2 - Graphical Summaries of Data Data recorded in the sequence in which they are collected and before they are processed or ranked are called raw data. Raw data is often difficult to make sense

### Discrete Random Variables and their Probability Distributions

CHAPTER 5 Discrete Random Variables and their Probability Distributions CHAPTER OUTLINE 5.1 Probability Distribution of a Discrete Random Variable 5.2 Mean and Standard Deviation of a Discrete Random Variable