Machine Learning model evaluation. Luigi Cerulo Department of Science and Technology University of Sannio

Similar documents
Evaluation & Validation: Credibility: Evaluating what has been learned

Data Mining - Evaluation of Classifiers

Data Mining Practical Machine Learning Tools and Techniques

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Performance Measures for Machine Learning

Chapter 6. The stacking ensemble approach

Supervised Learning (Big Data Analytics)

Performance Metrics for Graph Mining Tasks

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Maschinelles Lernen mit MATLAB

Performance Metrics. number of mistakes total number of observations. err = p.1/1

Performance Measures in Data Mining

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

Experiments in Web Page Classification for Semantic Web

Cross-Validation. Synonyms Rotation estimation

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

MHI3000 Big Data Analytics for Health Care Final Project Report

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Knowledge Discovery and Data Mining

How To Cluster

Feature Subset Selection in Spam Detection

Predictive Data modeling for health care: Comparative performance study of different prediction models

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Social Media Mining. Data Mining Essentials

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Data Mining. Nonlinear Classification

A Content based Spam Filtering Using Optical Back Propagation Technique

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

Quality and Complexity Measures for Data Linkage and Deduplication

Discovering process models from empirical data

Machine Learning Final Project Spam Filtering

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Data Mining Algorithms Part 1. Dejan Sarka

1. Classification problems

S The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

Knowledge Discovery from patents using KMX Text Analytics

Azure Machine Learning, SQL Data Mining and R

Random Forest Based Imbalanced Data Cleaning and Classification

CSE 473: Artificial Intelligence Autumn 2010

CSC574 - Computer and Network Security Module: Intrusion Detection

Towards better accuracy for Spam predictions

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Categorical Data Visualization and Clustering Using Subjective Factors

Introduction to Learning & Decision Trees

Mining a Corpus of Job Ads

Health Care and Life Sciences

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Fingerprinting the Datacenter: Automated Classification of Performance Crises

A Systematic Review of Fault Prediction approaches used in Software Engineering

Classification of Bad Accounts in Credit Card Industry

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

A Property and Casualty Insurance Predictive Modeling Process in SAS

Knowledge Discovery and Data Mining

Lecture 13: Validation

How To Understand The Impact Of A Computer On Organization

Unit 9 Describing Relationships in Scatter Plots and Line Graphs

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

Chapter 7. Feature Selection. 7.1 Introduction

Leak Detection Theory: Optimizing Performance with MLOG

Predicting Flight Delays

Identifying Potentially Useful Header Features for Spam Filtering

Active Learning SVM for Blogs recommendation

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall

Content-Based Recommendation

Real Time Data Analytics Loom to Make Proactive Tread for Pyrexia

ROC Graphs: Notes and Practical Considerations for Data Mining Researchers

ISSN: (Online) Volume 2, Issue 10, October 2014 International Journal of Advance Research in Computer Science and Management Studies

Spam Detection A Machine Learning Approach

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

On Cross-Validation and Stacking: Building seemingly predictive models on random data

Server Load Prediction

Using Random Forest to Learn Imbalanced Data

L13: cross-validation

Distances, Clustering, and Classification. Heatmaps

An Approach to Detect Spam s by Using Majority Voting

Bayesian Spam Detection

A Procedure for Classifying New Respondents into Existing Segments Using Maximum Difference Scaling

LCs for Binary Classification

Journal of Engineering Science and Technology Review 7 (4) (2014) 89-96

Mining the Software Change Repository of a Legacy Telephony System

Classification of Titanic Passenger Data and Chances of Surviving the Disaster Data Mining with Weka and Kaggle Competition Data

Dynamic Predictive Modeling in Claims Management - Is it a Game Changer?

Copyright 2006, SAS Institute Inc. All rights reserved. Predictive Modeling using SAS

Detecting Spam Using Spam Word Associations

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100

Detecting Credit Card Fraud

MACHINE LEARNING IN HIGH ENERGY PHYSICS

A Property & Casualty Insurance Predictive Modeling Process in SAS

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Data and Analysis. Informatics 1 School of Informatics, University of Edinburgh. Part III Unstructured Data. Ian Stark. Staff-Student Liaison Meeting

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Combining Global and Personal Anti-Spam Filtering

KERNEL LOGISTIC REGRESSION-LINEAR FOR LEUKEMIA CLASSIFICATION USING HIGH DIMENSIONAL DATA

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Transcription:

Machine Learning model evaluation Luigi Cerulo Department of Science and Technology University of Sannio

Accuracy To measure classification performance the most intuitive measure of accuracy divides the proportion of correct predictions by the total number of predictions. Suppose a classifier correctly identified whether or not 99,990 out of 100,000 newborn babies are carriers of a treatable but potentially-fatal genetic defect! ACC = ERR =1 TP + TN TP + TN + FP + FN ACC ACC = 99.99% ERR =0.01%

Accuracy An accuracy of 99.99 % appear to indicate an extremely accurate classifier. But, what if the genetic defect is found in only 10 out of every 100,000 babies? A stupid classifier that predicts always "no defect" will still be correct for 99.99 percent of all cases. In this case, even though the predictions are correct for the large majority of data, the classifier is not very useful for its intended purpose, which is to identify children with birth defects!

Utility measure The best measure of classifier performance is whether the classifier is successful at its intended purpose. For this reason, it is crucial to have measures of model performance that measure utility rather than raw accuracy.

Necessary data to evaluate a classifier Actual class values. Predicted class values. Estimated probability of the prediction.

A closer look at confusion matrices A confusion matrix is a table that categorizes predictions according to whether they match the actual value in the data. One of the table's dimensions indicates the possible categories of predicted values while the other dimension indicates the same for actual values. Although, we have only seen 2 x 2 confusion matrices so far, a matrix can be created for a model predicting any number of classes.

Positive and Negative class The most common performance measures consider the model's ability to discern one class versus all others. The class of interest is known as the positive class, while all others are known as negative. The relationship between positive class and negative class predictions can be depicted as a 2 x 2 confusion matrix that tabulates whether predictions fall into one of four categories: True Positive (TP): Correctly classified as the class of interest True Negative (TN): Correctly classified as not the class of interest False Positive (FP): Incorrectly classified as the class of interest False Negative (FN): Incorrectly classified as not the class of interest

Beyond accuracy other measures of performance Different measures have been developed and used for specific purposes in disciplines as diverse as medicine, information retrieval, marketing, and signal detection theory, among others. The Classification and Regression Training (caret) R package includes functions for computing many such performance measures. This package provides a large number of tools for preparing, training, evaluating, and visualizing machine learning models and data.

Confusion matrix computed with caret function

Kappa statistic The kappa statistic adjusts accuracy by accounting for the possibility of a correct prediction by chance alone. Accuracy = P(A=no,P=no) + P(A=yes,P=yes) probability of a correct prediction due to classifier P(expected) = P(A=no)P(P=no) + P(A=yes)P(P=yes) probability of a correct prediction due to chance (independence assumption) Kappa = Accuracy P (Expected) 1 P (expected)

Kappa statistic ACC = 998 1000 = 98% P (expected) = 1000 1000 998 1000 + 0 1000 2 1000 = 98% Kappa = 0.98 0.98 1 0.98 =0 998 2 0 0

Sensitivity and Specificity Classification often involves a balance between being overly conservative and overly aggressive in decision making. For example, an e-mail filter could guarantee to eliminate every spam message by aggressively eliminating nearly every ham message at the same time. On the other hand, a guarantee that no ham messages will be inadvertently filtered might allow an unacceptable amount of spam to pass through the filter. This tradeoff is captured by a pair of measures: sensitivity and specificity.

Sensitivity and Specificity The sensitivity of a model (aka true positive rate), measures the proportion of positive examples that were correctly classified. sensitivity = TP TP + FN The specificity of a model (aka true negative rate), measures the proportion of negative examples that were correctly classified. specificity = TN TN + FP

Diagnosis test

Sensitivity and Specificity Sensitivity and specificity range from 0 to 1, with values close to 1 being more desirable. It is important to find an appropriate balance between the two. 87.2% of spam msg were correctly classified as spam 16.8% of spam were wrongly kept as ham 99.9% of ham msg were correctly classified as ham 0.1% of ham were wrongly rejected as spam

Precision and Recall Closely related to sensitivity and specificity are two other performance measures, related to compromises made in classification: precision and recall. Used primarily in the context of information retrieval, these statistics are intended to provide an indication of how interesting and relevant a model's results are, or whether the predictions are diluted by meaningless noise.

Precision and Recall The Precision (aka positive predictive value) is defined as the proportion of positive examples that are truly positive; in other words, when a model predicts the positive class, how often is it correct? precision = TP TP + FP The Recall is a measure of how complete the results are. It is defined as the number of true positives over the total number of positives. (it is the same as sensitivity). recall = TP TP + FN

F-measure A measure of model performance that combines precision and recall into a single number is known as the F- measure (aka F1 score or the F-score) Combines precision and recall using the harmonic mean. The harmonic mean is used rather than the more common arithmetic mean since both precision and recall are expressed as proportions between zero and one. 2 precision recall 2 TP F-measure= = recall + precision 2 TP + FP + FN

Estimated probability of prediction Even though the classifier makes a single prediction about each example, it may be more confident about some decisions than others. For instance, a classifier may be 99% certain that a SMS with the words "free" and "ringtones" is 99% spam, but is only 51% certain that a SMS with the word "tonight" is spam. In both cases, the classifier predicts a spam, but it is far more certain about one decision than the other. Studying these internal prediction probabilities is useful to evaluate the model performance. If two models make the same number of mistakes, but one is more able to accurately assess its uncertainty, then it is a smarter model.

SVM model with probabilities

Performance tradeoffs Visualizations are often helpful for understanding how the performance of machine learning algorithms varies from situation to situation. Rather than thinking about a single pair of statistics such as sensitivity and specificity, or precision and recall, visualizations allow you to examine how measures vary across a wide range of values.

Performance tradeoffs The ROC curve (Receiver Operating Characteristic) is commonly used to examine the tradeoff between the detection of true positives, while avoiding the false positives. ROC curves were developed by engineers in the field of communications around the time of World War II; receivers of radar and radio signals needed a method to discriminate between true signals and false alarms.

ROC curve

ROC curve The points comprising ROC curves indicate the true positive rate at varying false positive thresholds. To create the curves, a classifier's predictions are sorted by the model's estimated probability of the positive class, with the largest values first. Beginning at the origin, each prediction's impact on the true positive rate and false positive rate will result in a curve tracing vertically (for a correct prediction), or horizontally (for an incorrect prediction).

ROC curve

Area under the ROC curve (AUC) The closer the curve is to the perfect classifier, the better it is at identifying positive values. This can be measured using a statistic known as the area under the ROC curve (AUC). The AUC, as you might expect, treats the ROC diagram as a two-dimensional square and measures the total area under the ROC curve. 0.9 1.0 = A (outstanding) 0.8 0.9 = B (excellent/good) 0.7 0.8 = C (acceptable/fair) 0.6 0.7 = D (poor) 0.5 0.6 = F (no discrimination)

ROC curve with ROCR package

AUC with ROCR package

PR curve ROCR package

Average curves The ROCR package prediction function is able to calculate average curves. In such case multiple prediction/test vectors must be organized as column matrices.

Estimating future performance To estimate future performance the model trained on a training set should be tested on a test set that has not been seen by the model. When prediction is performed on training data the only performance that can be estimated is the resubstitution error, which occurs when the training data is incorrectly predicted in spite of the model being built directly from this data. This information is intended to be used as a rough diagnostic, particularly to identify obviously poor performers. The resubstitution error is not a very useful marker of future performance.

The hold out method Partitioning data into training and test datasets. The training dataset is used to generate the model, which is then applied to the test dataset to generate predictions for evaluation. 2/3 1/3

Train, Validation, Test To avoid choosing a best model based upon the results of repeated testing it is better to divide the original data so that in addition to the training and test datasets, a third validation dataset is available. The validation dataset would be used for iterating and refining the model or models chosen, leaving the test dataset to be used only once as a final step to report an estimated error rate for future predictions. 1/2 train 1/4 validation 1/4

Class representation One problem with holdout sampling is that each partition may have a larger or smaller proportion of some classes. In certain cases, particularly those in which a class is a very small proportion of the dataset, this can lead a class to be omitted from the training dataset In such cases the model cannot then learn this class.

Stratification In order to reduce the chance that a class is over/ underrepresented, a technique called stratified random sampling can be used. Although, on average, a random sample will contain roughly the same proportion of class values as the full dataset, stratified random sampling ensures that the generated random partitions have approximately the same proportion of each class as the full dataset. The caret package provide the createdatapartition() function to do that.

Stratification

Bias Although stratification distributes the classes evenly, stratified sampling does not guarantee other types of representativeness. Some samples may have too many or too few difficult cases easy-to-predict cases outliers. This is especially true for smaller datasets, which may not have a large enough portion of such cases to divide among training and test sets.

Repeated holdout Repeated holdout is sometimes used to mitigate the problems of randomly composed training datasets. The repeated holdout method uses the average result from several random holdout samples to evaluate a model's performance. As multiple holdout samples are used, it is less likely that the model is trained or tested on non-representative data. But repeated random samples could potentially use the same record more than once.

k-fold Cross Validation k-fold CV randomly divides the data into k completely separate random partitions called folds. Empirical evidence suggests that with more than 10 folds no added benefit can be observed.

Cross Validation in R

Average ROC curves

Exercises 1.On the SMS spam detection dataset evaluate knn, Naive Bayes, and SVM with a 10-fold cross validation and find which is the best performing algorithm in terms of AUC and F-measure.