# Performance Metrics. number of mistakes total number of observations. err = p.1/1

Save this PDF as:

Size: px
Start display at page:

Download "Performance Metrics. number of mistakes total number of observations. err = p.1/1"

## Transcription

1 p.1/1 Performance Metrics The simplest performance metric is the model error defined as the number of mistakes the model makes on a data set divided by the number of observations in the data set, err = number of mistakes total number of observations.

2 p.2/1 The Model Error In order to define the model error formally we introduce the 0-1 loss function. This function compares the output of a model for a particular observation with the label of this observation. If the model commits a prediction error on this observation then the loss function returns a 1, otherwise it returns a0. Formally, let (x, y) D be an observation where D R n {+1, 1} and let ˆf : R n {+1, 1} be a model, then we define the 0-1 loss function L : {+1, 1} {+1, 1} {0, 1} as, L y, ˆf(x) 8 < 0 if y = ˆf(x), = : 1 if y ˆf(x). Let D = {(x 1,y 1 ),...,(x l,y l )} R n {+1, 1}, let the model ˆf and the loss function L be as defined above, then we can write the model error as, err D [ ˆf] = 1 l lx L i=1 y i, ˆf(x i ), where (x i,y i ) D. The model error is the average loss of a model over a data set.

3 p.3/1 Model Accuracy We can also characterize the performance of a model in terms of its accuracy, acc = number of correct predictions total number of observations. Again, we can use the 0-1 loss function to define this metric more concisely, acc D [ ˆf] = 1 l =1 1 l l lx L i=1 lx L i=1 =1 err D [ ˆf]. y i, ˆf(x! i ) y i, ˆf(x i ) acc D [ ˆf] =1 err D [ ˆf]

4 p.4/1 Example As an example of the above metrics, consider a model ĝ which commits 5 prediction errors when applied to a data set Q of length 100. We can compute the error as, err Q [ĝ] = 1 (5) = We can compute the accuracy of the model as, acc Q [ĝ] =1 err Q [ĝ] = = 0.95.

5 p.5/1 Model Errors Let (x, y) R n {+1, 1} be an observation and let ˆf : R n {+1, 1} be a model, then we have the following four possibilities when the model is applied to the observation, ˆf(x) = 8 +1 if y =+1, called the true positive >< 1 if y =+1, called the false negative +1 if y = 1, called the false positive >: 1 if y = 1, called the true negative This means that models can commit two types of model errors. Under certain circumstances it is important to distinguish these types of errors when evaluating a model. We use a confusion matrix to report these errors in an effective manner.

6 p.6/1 The Confusion Matrix A confusion matrix for a binary classification model is a 2 2 table that displays the observed labels against the predicted labels of a data set. Observed (y) \ Predicted (ŷ) True Positive (TP) False Negative (FN) -1 False Positive (FP) True Negative (TN) One way to visualize the confusion matrix is to consider that applying a model ˆf to an observation (x, y) will give us two labels. The first label y is due to the observation and the second label ŷ = ˆf(x) is due to the prediction of the model. Therefore, an observation with the label pair (y, ŷ) will be mapped onto a confusion matrix as follows, (+1,+1) TP (-1,+1) FP (+1,-1) FN (-1,-1) TN

7 p.7/1 The Confusion Matrix Example: Observed \ Predicted A confusion matrix of a model applied to a set of 200 observations. On this set of observations the model commits 7 false negative errors and 4 false positive errors in addition to the 95 true positive and 94 true negative predictions.

8 p.8/1 Example Wisconsin Breast Cancer Dataset: > library(e1071) > wdbc.df <- read.csv("wdbc.csv") > svm.model <- svm(diagnosis., data=wdbc.df, type="c-classification", kernel="linear", cost=1) > predict <- fitted(svm.model) > cm <- table(wdbc.df\$diagnosis,predict) > cm predict B M B M > err <- (cm[1,2] + cm[2,1])/length(predict) * 100 > err [1]

9 p.9/1 Model Evaluation Model evaluation is the process of finding an optimal model for the problem at hand. You guessed, we are talking about optimization! Let, D = {(x 1,y 1 ),...,(x l,y l )} R n {+1, 1} be our training data, then we can write our parameterized model as, where (x i,y i ) D. ˆf D [k, λ, C](x) =sign! lx α C,i y i k[λ](x i, x) b, i=1 With this we can write our model error as, h i err D ˆfD [k, λ, C] = 1 l lx L i=1 y i, ˆf D [k, λ, C](x i ), where (x i,y i ) D.

10 p. 10/1 Training Error The optimal training error is defined as, h min err D ˆfD [k, λ, C]i k,λ,c = min k,λ,c 1 l lx L i=1 y i, ˆf D [k, λ, C](x i ).

11 p. 11/1 Training Error Observation: The problem here is that we can always find a set of model parameters that make the model complex enough to drive the training error down to zero. The training error as a model evaluation criterion is overly optimistic.

12 p. 12/1 The Hold-Out Method Here we split the training data D into a training set and a testing set P and Q, respectively, such that, D = P Q and P Q =. The optimal training error is then computed as the optimization problem, h h i min err P ˆfP [k, λ, C]i = err P ˆfP [k,λ,c ]. k,λ,c Here, the optimal training error is obtained with model ˆfP [k,λ,c ]. The optimal test error is computed as an optimization using Q as the test set, h h i min err Q ˆfP [k, λ, C]i = err Q ˆfP [k,λ,c ]. k,λ,c The optimal test error is achieved by some model ˆfP [k,λ,c ].

13 p. 13/1 The Hold-Out Method high Test Error Training Error Error high Model Complexity

### Evaluation & Validation: Credibility: Evaluating what has been learned

Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model

### Performance Measures in Data Mining

Performance Measures in Data Mining Common Performance Measures used in Data Mining and Machine Learning Approaches L. Richter J.M. Cejuela Department of Computer Science Technische Universität München

### Consistent Binary Classification with Generalized Performance Metrics

Consistent Binary Classification with Generalized Performance Metrics Nagarajan Natarajan Joint work with Oluwasanmi Koyejo, Pradeep Ravikumar and Inderjit Dhillon UT Austin Nov 4, 2014 Problem and Motivation

### Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification

### W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer

### T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

### Data Mining Practical Machine Learning Tools and Techniques

Credibility: Evaluating what s been learned Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Issues: training, testing,

### Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data

### Performance Measures for Machine Learning

Performance Measures for Machine Learning 1 Performance Measures Accuracy Weighted (Cost-Sensitive) Accuracy Lift Precision/Recall F Break Even Point ROC ROC Area 2 Accuracy Target: 0/1, -1/+1, True/False,

### Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2. Tid Refund Marital Status

Data Mining Classification: Basic Concepts, Decision Trees, and Evaluation Lecture tes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Classification: Definition Given a collection of

### Application of Data Mining based Malicious Code Detection Techniques for Detecting new Spyware

Application of Data Mining based Malicious Code Detection Techniques for Detecting new Spyware Cumhur Doruk Bozagac Bilkent University, Computer Science and Engineering Department, 06532 Ankara, Turkey

### Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com

### Chapter 6. The stacking ensemble approach

82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

### The validation of Credit Rating and Scoring Models

The validation of Credit Rating and Scoring Models raffaella.calabrese1@unimib.it University of Milano-Bicocca Swiss Statistics Meeting Geneva, Switzerland October 29th, 2009 Outline The validation process

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Lecture 15 - ROC, AUC & Lift Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-17-AUC

### Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data

### Using Random Forest to Learn Imbalanced Data

Using Random Forest to Learn Imbalanced Data Chao Chen, chenchao@stat.berkeley.edu Department of Statistics,UC Berkeley Andy Liaw, andy liaw@merck.com Biometrics Research,Merck Research Labs Leo Breiman,

### Lecture Notes for Chapter 4. Introduction to Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data

### 1. Classification problems

Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification

### Online Performance Anomaly Detection with

ΘPAD: Online Performance Anomaly Detection with Tillmann Bielefeld 1 1 empuxa GmbH, Kiel KoSSE-Symposium Application Performance Management (Kieker Days 2012) November 29, 2012 @ Wissenschaftszentrum Kiel

### Validation parameters: An introduction to measures of

Validation parameters: An introduction to measures of test accuracy Types of tests All tests are fundamentally quantitative Sometimes we use the quantitative result directly However, it is often necessary

### A CLASSIFIER FUSION-BASED APPROACH TO IMPROVE BIOLOGICAL THREAT DETECTION. Palaiseau cedex, France; 2 FFI, P.O. Box 25, N-2027 Kjeller, Norway.

A CLASSIFIER FUSION-BASED APPROACH TO IMPROVE BIOLOGICAL THREAT DETECTION Frédéric Pichon 1, Florence Aligne 1, Gilles Feugnet 1 and Janet Martha Blatny 2 1 Thales Research & Technology, Campus Polytechnique,

### Addressing the Class Imbalance Problem in Medical Datasets

Addressing the Class Imbalance Problem in Medical Datasets M. Mostafizur Rahman and D. N. Davis the size of the training set is significantly increased [5]. If the time taken to resample is not considered,

### Inferring Probabilistic Models of cis-regulatory Modules. BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey cdewey@biostat.wisc.

Inferring Probabilistic Models of cis-regulatory Modules MI/S 776 www.biostat.wisc.edu/bmi776/ Spring 2015 olin Dewey cdewey@biostat.wisc.edu Goals for Lecture the key concepts to understand are the following

### Mining the Software Change Repository of a Legacy Telephony System

Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

### Anomaly detection. Problem motivation. Machine Learning

Anomaly detection Problem motivation Machine Learning Anomaly detection example Aircraft engine features: = heat generated = vibration intensity Dataset: New engine: (vibration) (heat) Density estimation

### CLASS distribution, i.e., the proportion of instances belonging

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART C: APPLICATIONS AND REVIEWS 1 A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches Mikel Galar,

### Using Static Code Analysis Tools for Detection of Security Vulnerabilities

Using Static Code Analysis Tools for Detection of Security Vulnerabilities Katerina Goseva-Popstajanova & Andrei Perhinschi Lane Deptartment of Computer Science and Electrical Engineering West Virginia

### CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

### Performance Metrics for Graph Mining Tasks

Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical

### An Approach to Detect Spam Emails by Using Majority Voting

An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,

### Exercise 2: Pose Estimation and Step Counting from IMU Data

Exercise 2: Pose Estimation and Step Counting from IMU Data Due date: 11:59pm, Oct 21st, 2015 1 Introduction In this exercise, you will implement pose estimation and step counting. We strongly encourage

### SVM Ensemble Model for Investment Prediction

19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of

### Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

### CSC574 - Computer and Network Security Module: Intrusion Detection

CSC574 - Computer and Network Security Module: Intrusion Detection Prof. William Enck Spring 2013 1 Intrusion An authorized action... that exploits a vulnerability... that causes a compromise... and thus

### Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

### An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework Jakrarin Therdphapiyanak Dept. of Computer Engineering Chulalongkorn University

### Nino Pellegrino October the 20th, 2015

Learning Behavioral Fingerprints from NetFlows... using Timed Automata Nino Pellegrino October the 20th, 2015 Nino Pellegrino Learning Behavioral Fingerprints October the 20th, 2015 1 / 32 Use case Nino

### Feature Subset Selection in E-mail Spam Detection

Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature

### Link Prediction Analysis in the Wikipedia Collaboration Graph

Link Prediction Analysis in the Wikipedia Collaboration Graph Ferenc Molnár Department of Physics, Applied Physics, and Astronomy Rensselaer Polytechnic Institute Troy, New York, 121 Email: molnaf@rpi.edu

### International Journal of Advance Research in Computer Science and Management Studies

Volume 2, Issue 12, December 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

### Data Mining as a tool to Predict the Churn Behaviour among Indian bank customers

Data Mining as a tool to Predict the Churn Behaviour among Indian bank customers Manjit Kaur Department of Computer Science Punjabi University Patiala, India manjit8718@gmail.com Dr. Kawaljeet Singh University

### Classifiers & Classification

Classifiers & Classification Forsyth & Ponce Computer Vision A Modern Approach chapter 22 Pattern Classification Duda, Hart and Stork School of Computer Science & Statistics Trinity College Dublin Dublin

### Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov

Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

### 2.5.3 Use basic database skills to enter information in a database

2.5 Filling System Documentation and Databases 2.5.3 Use basic database skills to enter information in a database Be able to enter accurate and relevant data in an existing database system (LO018) Developed

### Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

Statistical Validation and Data Analytics in ediscovery Jesse Kornblum Administrivia Silence your mobile Interactive talk Please ask questions 2 Outline Introduction Big Questions What Makes Things Similar?

### Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

### A Hybrid Data Mining Model to Improve Customer Response Modeling in Direct Marketing

A Hybrid Data Mining Model to Improve Customer Response Modeling in Direct Marketing Maryam Daneshmandi mdaneshmandi82@yahoo.com School of Information Technology Shiraz Electronics University Shiraz, Iran

### 2/2/2011. Data source. Define a new data source. Connect to the Iris database in SQL Server Impersonation Information. Default option D B M G

Data Mining in SQL Server 2005 SQL Server 2005 Data Mining structures SQL Server 2005 mining structures created and managed using Visual Studio 2005 Business Intelligence Projects Analysis Services Project

### Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier

International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing

### Practical Graph Mining with R. 5. Link Analysis

Practical Graph Mining with R 5. Link Analysis Outline Link Analysis Concepts Metrics for Analyzing Networks PageRank HITS Link Prediction 2 Link Analysis Concepts Link A relationship between two entities

### Methodologies for Evaluation of Standalone CAD System Performance

Methodologies for Evaluation of Standalone CAD System Performance DB DSFM DCMS OSEL DESE DP DIAM Berkman Sahiner, PhD USFDA/CDRH/OSEL/DIAM AAPM CAD Subcommittee in Diagnostic Imaging CAD: CADe and CADx

### Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

### A Content based Spam Filtering Using Optical Back Propagation Technique

A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT

### Predicting air pollution level in a specific city

Predicting air pollution level in a specific city Dan Wei dan1991@stanford.edu 1. INTRODUCTION The regulation of air pollutant levels is rapidly becoming one of the most important tasks for the governments

### Data Mining Lab 5: Introduction to Neural Networks

Data Mining Lab 5: Introduction to Neural Networks 1 Introduction In this lab we are going to have a look at some very basic neural networks on a new data set which relates various covariates about cheese

### Binary and Ranked Retrieval

Binary and Ranked Retrieval Binary Retrieval RSV(d i,q j ) {0,1} Does not allow the user to control the magnitude of the output. In fact, for a given query, the system may return under-dimensioned output

### STANDARDISATION AND CLASSIFICATION OF ALERTS GENERATED BY INTRUSION DETECTION SYSTEMS

STANDARDISATION AND CLASSIFICATION OF ALERTS GENERATED BY INTRUSION DETECTION SYSTEMS Athira A B 1 and Vinod Pathari 2 1 Department of Computer Engineering,National Institute Of Technology Calicut, India

### ROC Graphs: Notes and Practical Considerations for Data Mining Researchers

ROC Graphs: Notes and Practical Considerations for Data Mining Researchers Tom Fawcett Intelligent Enterprise Technologies Laboratory HP Laboratories Palo Alto HPL-23-4 January 7 th, 23* E-mail: tom_fawcett@hp.com

### A Systematic Review of Fault Prediction approaches used in Software Engineering

A Systematic Review of Fault Prediction approaches used in Software Engineering Sarah Beecham Lero The Irish Software Engineering Research Centre University of Limerick, Ireland Tracy Hall Brunel University

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

### Programming Exercise 3: Multi-class Classification and Neural Networks

Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks

### 19th International Conference on Mechatronics and Machine Vision in Practice (M2VIP12), 28-30th Nov 2012, Auckland, New-Zealand

Monitoring and Remote Sensing of the Street Lighting System Using Computer Vision and Image Processing Techniques for the Purpose of Mechanized Blackouts Pooya Najafi Zanjani 1,2, Vahid Ghods 2, Morteza

### MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

### More Data Mining with Weka

More Data Mining with Weka Class 5 Lesson 1 Simple neural networks Ian H. Witten Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz Lesson 5.1: Simple neural networks Class

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical

### Data Mining Techniques for Prognosis in Pancreatic Cancer

Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree

### Health Care and Life Sciences

Sensitivity, Specificity, Accuracy, Associated Confidence Interval and ROC Analysis with Practical SAS Implementations Wen Zhu 1, Nancy Zeng 2, Ning Wang 2 1 K&L consulting services, Inc, Fort Washington,

### Data Mining Practical Machine Learning Tools and Techniques

Counting the cost Data Mining Practical Machine Learning Tools and Techniques Slides for Section 5.7 In practice, different types of classification errors often incur different costs Examples: Loan decisions

### Globally Optimal Crowdsourcing Quality Management

Globally Optimal Crowdsourcing Quality Management Akash Das Sarma Stanford University akashds@stanford.edu Aditya G. Parameswaran University of Illinois (UIUC) adityagp@illinois.edu Jennifer Widom Stanford

### Selecting Data Mining Model for Web Advertising in Virtual Communities

Selecting Data Mining for Web Advertising in Virtual Communities Jerzy Surma Faculty of Business Administration Warsaw School of Economics Warsaw, Poland e-mail: jerzy.surma@gmail.com Mariusz Łapczyński

### Section 6: Model Selection, Logistic Regression and more...

Section 6: Model Selection, Logistic Regression and more... Carlos M. Carvalho The University of Texas McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Model Building

### A Decision Tree Classification Model for University Admission System

A Decision Tree Classification Model for University Admission ystem Abdul Fattah Mashat Faculty of Computing and Information Technology Jeddah, audi Arabia Mohammed M. Fouad Faculty of Informatics and

### FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS Breno C. Costa, Bruno. L. A. Alberto, André M. Portela, W. Maduro, Esdras O. Eler PDITec, Belo Horizonte,

### Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.

### Quality and Complexity Measures for Data Linkage and Deduplication

Quality and Complexity Measures for Data Linkage and Deduplication Peter Christen and Karl Goiser Department of Computer Science, Australian National University, Canberra ACT 0200, Australia {peter.christen,karl.goiser}@anu.edu.au

### Maximum Profit Mining and Its Application in Software Development

Maximum Profit Mining and Its Application in Software Development Charles X. Ling 1, Victor S. Sheng 1, Tilmann Bruckhaus 2, Nazim H. Madhavji 1 1 Department of Computer Science, The University of Western

### Keywords Phishing Attack, phishing Email, Fraud, Identity Theft

Volume 3, Issue 7, July 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Detection Phishing

### Predicting & Preventing Banking Customer Churn By Unlocking Big Data.

Predicting & Preventing Banking Customer Churn By Unlocking Big Data. Case Study on a Bank. Light Years Ahead. CUSTOMERS CHURN. A key performance Indicator for Banks. Confidence in the banking industry

### The Relationship Between Precision-Recall and ROC Curves

Jesse Davis jdavis@cs.wisc.edu Mark Goadrich richm@cs.wisc.edu Department of Computer Sciences and Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, 2 West Dayton Street,

### ISSN: 2321-7782 (Online) Volume 2, Issue 10, October 2014 International Journal of Advance Research in Computer Science and Management Studies

ISSN: 2321-7782 (Online) Volume 2, Issue 10, October 2014 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

### EMPIRICAL RISK MINIMIZATION FOR CAR INSURANCE DATA

EMPIRICAL RISK MINIMIZATION FOR CAR INSURANCE DATA Andreas Christmann Department of Mathematics homepages.vub.ac.be/ achristm Talk: ULB, Sciences Actuarielles, 17/NOV/2006 Contents 1. Project: Motor vehicle

### Data Mining - The Next Mining Boom?

Howard Ong Principal Consultant Aurora Consulting Pty Ltd Abstract This paper introduces Data Mining to its audience by explaining Data Mining in the context of Corporate and Business Intelligence Reporting.

### A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data

### Leak Detection Theory: Optimizing Performance with MLOG

Itron White Paper Water Loss Management Leak Detection Theory: Optimizing Performance with MLOG Rich Christensen Vice President, Research & Development 2009, Itron Inc. All rights reserved. Introduction

### Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

### Network Intrusion Detection Systems

Network Intrusion Detection Systems False Positive Reduction Through Anomaly Detection Joint research by Emmanuele Zambon & Damiano Bolzoni 7/1/06 NIDS - False Positive reduction through Anomaly Detection

### Machine Learning Approach for Estimating Sensor Deployment Regions on Satellite Images

Machine Learning Approach for Estimating Sensor Deployment Regions on Satellite Images 1 Enes ATEŞ, 2 *Tahir Emre KALAYCI, 1 Aybars UĞUR 1 Faculty of Engineering, Department of Computer Engineering Ege

### Behavior Analysis of SVM Based Spam Filtering Using Various Kernel Functions and Data Representations

ISSN: 2278-181 Vol. 2 Issue 9, September - 213 Behavior Analysis of SVM Based Spam Filtering Using Various Kernel Functions and Data Representations Author :Sushama Chouhan Author Affiliation: MTech Scholar

### Roulette Sampling for Cost-Sensitive Learning

Roulette Sampling for Cost-Sensitive Learning Victor S. Sheng and Charles X. Ling Department of Computer Science, University of Western Ontario, London, Ontario, Canada N6A 5B7 {ssheng,cling}@csd.uwo.ca

### Computer Vision. License Plate Recognition Klaas Dijkstra - Jaap van de Loosdrecht

Computer Vision License Plate Recognition Klaas Dijkstra - Jaap van de Loosdrecht 25 August 2014 Copyright 2001 2014 by NHL University of Applied Sciences and Van de Loosdrecht Machine Vision BV All rights

### Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Martin Hlosta, Rostislav Stríž, Jan Kupčík, Jaroslav Zendulka, and Tomáš Hruška A. Imbalanced Data Classification

### Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard

### A Short Introduction to the caret Package

A Short Introduction to the caret Package Max Kuhn max.kuhn@pfizer.com August 6, 2015 The caret package (short for classification and regression training) contains functions to streamline the model training

### CSCI-B 565 DATA MINING Project Report for K-means Clustering algorithm Computer Science Core Fall 2012 Indiana University

CSCI-B 565 DATA MINING Project Report for K-means Clustering algorithm Computer Science Core Fall 2012 Indiana University Jayesh Kawli jkawli@indiana.edu 09/17/2012 1. Examining Wolberg s breast cancer

### Predicting Defaults of Loans using Lending Club s Loan Data

Predicting Defaults of Loans using Lending Club s Loan Data Oleh Dubno Fall 2014 General Assembly Data Science Link to my Developer Notebook (ipynb) - http://nbviewer.ipython.org/gist/odubno/0b767a47f75adb382246