Evaluation in Machine Learning (1) Evaluation in machine learning. Aims. 14s1: COMP9417 Machine Learning and Data Mining.

Acknowledgements Evaluation in Machine Learning (1) 14s1: COMP9417 Machine Learning and Data Mining School of Computer Science and Engineering, University of New South Wales March 19, 2014 Material derived from slides for the book Machine Learning by T. Mitchell McGraw-Hill (1997) http://www-2.cs.cmu.edu/~tom/mlbook.html Material derived from slides by Eibe Frank http://www.cs.waikato.ac.nz/ml/weka Material derived from slides for the book Machine Learning by P. Flach Cambridge University Press (2012) http://cs.bris.ac.uk/~flach/mlbook COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 1 / 49 Aims 1. Aims COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 2 / 49 2. Introduction Evaluation in machine learning This lecture will enable you to apply standard measures and graphical methods for the evaluation of machine learning algorithms. Following it you should be able to: define sample error and true error describe the problem of estimating hypothesis accuracy or error define and calculate typically-used evaluation measures, such as accuracy, precision, recall, etc. define and calculate typically-used compound measures and their visualisation, such as lift charts, coverage plots, ROC curves, F-measure, AUC, etc. In theory, there is no difference between theory and practice. But in practice, there is. COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 3 / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 4 / 49

2. Introduction 2. Introduction Estimating Hypothesis Accuracy Two Definitions of Error The sample error of h with respect to target function f and data sample S is the proportion of examples h misclassifies how well does a hypothesis generalize beyond the training set? need to estimate off-training-set error what is the probable error in this estimate? if one hypothesis is more accurate than another on a data set, how probable is this difference in general? error S (h) 1 δ(f (x) = h(x)) n Where δ(f (x) = h(x)) is 1 if f (x) = h(x), and 0 otherwise (cf. 0 1 loss). x S The true error of hypothesis h with respect to target function f and distribution D is the probability that h will misclassify an instance drawn at random according to D. error D (h) Pr [f (x) = h(x)] x D Question: How well does error S (h) estimate error D (h)? COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 5 / 49 2. Introduction Estimators COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 6 / 49 2. Introduction Problems Estimating Error Experiment: 1. choose sample S of size n according to distribution D 2. measure error S (h) error S (h) is a random variable (i.e., result of an experiment) error S (h) is an unbiased estimator for error D (h) Given observed error S (h) what can we conclude about error D (h)? 1. Bias: If S is training set, error S (h) is optimistically biased bias E[error S (h)] error D (h) For unbiased estimate, h and S must be chosen independently 2. Variance: Even with selection of S to give unbiased estimate, error S (h) may still vary from error D (h) COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 7 / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 8 / 49

2. Introduction Problems Estimating Error Making the most of the data Note: Estimation bias not to be confused with Inductive bias former is a numerical quantity [comes from statistics], latter is a set of assertions [comes from concept learning]. More on this in a later lecture. Once evaluation is complete, all the data can be used to build the final classifier Generally, the larger the training data the better the classifier (but returns diminish) The larger the test data the more accurate the error estimate Holdout procedure: method of splitting original data into training and test set Dilemma: ideally we want both, a large training and a large test set COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 9 / 49 Loss functions COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 10 / 49 Quadratic loss function Most common performance measure: predictive accuracy (cf. sample error) Also called 0 1 loss function: 0 if prediction is correct 1 if prediction is incorrect Classifiers can produce class probabilities i What is the accuracy of the probability estimates? 0-1 loss is not appropriate p 1,...,p k are probability estimates of all possible outcomes for an instance c is the index of the instance s actual class i.e. a 1,...,a k are zero, except for a c which is 1 the quadratic loss is: E (p j a j ) 2 = j p 2 j j =c + (1 p c ) 2 leads to preference for predictors giving best guess at true probabilities COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 11 / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 12 / 49

Informational loss function Which loss function? ] the informational loss function is log(p c ), where c is the index of the actual class of an instance number of bits required to communicate the actual class if p1,...,p are the true class probabilities k then the expected value of the informational loss function is: p 1 log 2 (p 1)... p k log 2 (p k) which is minimized for p j = p j giving the entropy of the true distribution p 1 log 2 (p 1 )... p k log 2 (p k ) quadratic loss functions takes into account all the class probability estimates for an instance informational loss focuses only on the probability estimate for the actual class quadratic loss is bounded by 1 + j p 2, can never exceed 2 j informational loss can be infinite informational loss related to MDL principle (can use bits for complexity as well as accuracy) COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 13 / 49 Costs of predictions COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 14 / 49 Confusion matrix Two-class prediction case: In practice, different types of classification errors often incur different costs Examples: Medical diagnosis (has cancer vs. not) Loan decisions Fault diagnosis Promotional mailing Predicted Class Actual Class Yes No Yes True Positive (TP) False Negative (FN) No False Positive (FP) True Negative (TN) Two kinds of error: False Positive and False Negative may have different costs. Two kinds of correct prediction: True Positive and True Negative may have different benefits. Note: total number of test set examples N = TP+ FN+ FP+ TN COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 15 / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 16 / 49

Accuracy Error rate equivalent to 1 - Accuracy, i.e., TP+ TN N FP+ FN N Precision TP TP+ FP (also called: Correctness, Positive Predictive Value) Recall TP TP+ FN (also called: TP rate, Hit rate, Sensitivity, Completeness) COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 17 / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 18 / 49 Sensitivity True Positive (TP) Rate Specificity equivalent to 1 - FP rate (also called: TN rate) TP TP+ FN TN TN+ FP False Positive (FP) Rate equivalent to 1 - Specificity, i.e., (also called: False alarm rate) TP TP+ FN FP FP+ TN COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 19 / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 20 / 49

Negative Predictive Value Coverage TN TN+ FN Predicted Class Actual Class Yes No Yes TP FN No FP TN TP+ FP N Note: this is not an exhaustive list... same measures used under different names in different disciplines E.g., in concept learning, the number of instances in a sample predicted to be in (resp. not in) the concept is the sum of the first (resp. second) column. The number of positive (resp. negative) examples of the concept in a sample is the sum of the first (resp. second) row. N pred = TP+ FP N not_pred = FN+ TN Npos = TP+ FN Nneg = FP+ TN COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 21 / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 22 / 49 Trade-off We can treat the evaluation measures as conditional probabilities: P(pred pos) = P(pred neg) = P(not_pred pos) = P(not_pred neg) = P(pos pred) = P(neg pred) = TP TP+FN FP FP+TN FN TP+FN TN FP+TN TP TP+FP FP TP+FP (Sensitivity) (FP rate) (FN rate) (Specificity) (Pos. Pred. Value) Trade-off good coverage of positive examples: increase TP at risk of increasing FP i.e. increase generality good proportion of positive examples: decrease FP at risk of decreasing TP i.e. decrease generality, i.e. increase specificity Different techniques give different trade-offs and can be plotted as two different lines on any of the graphical charts: Lift, ROC or recall-precision curves. P(pos not_pred) = P(neg not_pred) = FN FN+TN TN FN+TN (FN rate) (Neg. Pred. Value) COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 23 / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 24 / 49

F-measure combines precision and recall Lift charts In practice, costs are rarely known precisely Information Retrieval K. van Rijsbergen (1979) precision recall F = 2 precision + recall Combines both precision and recall in a single measure giving equal weight to both (there are variants to weight each component differently). Inverse relation of precision and recall: can change a classification rule to increase precision (positive predictive value) by reducing number of data points classified, hence reducing recall (sensitivity), and vice versa. Note: both measures ignore TN. Instead decisions often made by comparing possible scenarios Lift comes from market research, where a typical goal is to identify a profitable target sub-group out of the total population Example: promotional mailout to population of 1,000,000 potential respondents Baseline is that 0.1% of all households in total population will respond (1000) Situation 1: classifier 1 identifies target sub-group of 100,000 most promising households of which 0.4% will respond (400) Situation 2: classifier 2 identifies target sub-group of 400,000 most promising households of which 0.2% will respond (800) COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 25 / 49 Lift charts COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 26 / 49 Hypothetical Lift Chart Lift = response rate of target sub-group response rate of total population Situation 1 gives lift of 0.4 0.1 = 4 Situation 2 gives lift of 0.2 0.1 = 2 Note that which situation is more profitable depends on cost estimates A lift chart allows for a visual comparison Typically, use lift to see how well a classifier is doing compared to base-level (e.g., guessing most common class as in ZeroR). COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 27 / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 28 / 49

Generating a lift chart ROC curves Instances are sorted according to their predicted probability of being a true positive: Rank Predicted probability Actual class 1 0.95 Yes 2 0.93 Yes 3 0.93 No 4 0.88 Yes......... In lift chart, x axis is sample size and y axis is number of true positives ROC curves are related to lift charts ROC stands for receiver operating characteristic Used in signal detection to show tradeoff between hit rate and false alarm rate over noisy channel Differences to lift chart: y axis shows percentage of true positives in sample (rather than absolute number) x axis shows percentage of false positives in sample (rather than sample size) COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 29 / 49 A sample ROC curve COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 30 / 49 ROC curves Applications: if the classifier predicts the positive class with an attached value denoting strength of the prediction, for example: probability that the instance is in the target class, e.g., logistic regression, Bayesian classifiers margin by which the instance is in the target class, e.g., ensemble learners, support vector machines if classifiers are learned by the same method with a range of different parameter settings; each outcome (e.g., accuracy) is a point on the ROC curve COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 31 / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 32 / 49

Area under the ROC Curve (AUC) Area under the ROC Curve (AUC) Applicable for any learning method that can generate an ROC curve. Method: test set of P positive and N negative examples sort examples in descending order of strength of being in the positive class, as predicted by the classifier for each positive example p i P, count the number of negative examples ranked strictly below it (count 1 2 if negative example tied with a positive), call this c i sum all these counts, C = P AUC = C P N i=1 c i How should we understand AUC? A natural probabilistic interpretation of this measure: the probability that the classifier will rank a randomly-drawn positive example higher than a randomly-drawn negative one. the AUC will be 1 if all positive examples were ranked above all negative examples (compare with the ROC curve diagram) this is a version of a non-parametric statistical procedure (see Mann-Whitney U-test, or Wilcoxon rank-sum test) COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 33 / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 34 / 49 Numeric prediction evaluation measures Numeric prediction evaluation measures Based on differences between predicted (p i ) and actual (a i ) values on a test set of n examples: Mean squared error Root mean squared error (p 1 a 1 ) 2 +...+ (p n a n ) 2 n (p 1 a 1 ) 2 +...+ (p n a n ) 2 n Mean absolute error p 1 a 1 +...+ p n a n n Relative absolute error p 1 a 1 +...+ p n a n, where ā = 1 a 1 ā +...+ a n ā n plus others, see, e.g., Weka i a i COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 35 / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 36 / 49

Table 2.1, p.52 Predictive machine learning scenarios Classification Task Label space Output space Learning problem Classification L = C Y = C learn an approximation ĉ : X C to the true labelling function c Scoring and ranking L = C Y = R C learn a model that outputs a score vector over classes Probability L = C Y = [0,1] C learn a model that outputs estimation a probability vector over classes Regression L = R Y = R learn an approximation f ˆ : X R to the true labelling function f A classifier is a mapping ĉ : X C, where C = {C 1,C 2,...,C k } is a finite and usually small set of class labels. We will sometimes also use C i to indicate the set of examples of that class. We use the hat to indicate that ĉ(x) is an estimate of the true but unknown function c(x). Examples for a classifier take the form (x,c(x)), where x X is an instance and c(x) is the true class of the instance (sometimes contaminated by noise). Learning a classifier involves constructing the function ĉ such that it matches c as closely as possible (and not just on the training set, but ideally on the entire instance space X ). COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 37 / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 38 / 49 Figure 2.1, p.53 A decision tree Table 2.2, p.54 Contingency table!viagra"!viagra" spam: 20 ham: 40!lottery" =0 =0 =1 spam: 10 ham: 5 =1 spam: 20 ham: 5 #(x) = ham!lottery" =0 =0 =1 #(x) = spam =1 #(x) = spam (left) A feature tree with training set class distribution in the leaves. (right) A decision tree obtained using the majority class decision rule. Predicted Predicted Actual 30 20 50 Actual 10 40 50 40 60 100 20 30 50 20 30 50 40 60 100 (left) A two-class contingency table or confusion matrix depicting the performance of the decision tree in Figure 2.1. Numbers on the descending diagonal indicate correct predictions, while the ascending diagonal concerns prediction errors. (right) A contingency table with the same marginals but independent rows and columns. COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 39 / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 40 / 49

Example 2.1, p.56 Accuracy as a weighted average Table 2.3, p.57 Performance measures I Suppose a classifier s predictions on a test set are as in the following table: Predicted Predicted Actual 60 15 75 Actual 10 15 25 70 30 100 From this table, we see that the true positive rate is tpr = 60/75 = 0.80 and the true negative rate is tnr = 15/25 = 0.60. The overall accuracy is acc = (60 + 15)/100 = 0.75, which is no longer the average of true positive and negative rates. However, taking into account the proportion of positives pos = 0.75 and the proportion of negatives neg = 1 pos = 0.25, we see that acc = pos tpr + neg tnr Measure Definition Equal to Estimates number of positives Pos = x Te I [c(x) = ] number of negatives Neg = x Te I [c(x) = ] Te Pos number of true positives TP = x Te I [ĉ(x) = c(x) = ] number of true negatives TN = x Te I [ĉ(x) = c(x) = ] number of false positives FP = x Te I [ĉ(x) =,c(x) = ] Neg TN number of false negatives FN = x Te I [ĉ(x) =,c(x) = ] Pos TP proportion of positives pos = Te 1 x Te I [c(x) = ] Pos/ Te P(c(x) = ) proportion of negatives neg = Te 1 x Te I [c(x) = ] 1 pos P(c(x) = ) class ratio clr = pos/neg Pos/Neg (*) accuracy acc = Te 1 (*) error rate err = Te 1 x Te I [ĉ(x) = c(x)] P(ĉ(x) = c(x)) x Te I [ĉ(x) = c(x)] 1 acc P(ĉ(x) = c(x)) COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 41 / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 42 / 49 Table 2.3, p.57 Performance measures II Figure 2.2, p.58 A coverage plot Measure Definition Equal to Estimates x Te true positive rate, tpr = I [ĉ(x)=c(x)= ] x Te I [c(x)= ] TP/Pos P(ĉ(x) = c(x) = ) sensitivity, recall x Te true negative rate, tnr = I [ĉ(x)=c(x)=] x Te I [c(x)=] TN/Neg P(ĉ(x) = c(x) = ) specificity x Te false positive rate, fpr = I [ĉ(x)=,c(x)=] x Te I [c(x)=] FP/Neg = 1 tnr P(ĉ(x) = c(x) = ) false alarm rate x Te I [ĉ(x)=,c(x)= ] false negative rate fnr = x Te I [c(x)= ] FN/Pos = 1 tpr P(ĉ(x) = c(x) = ) x Te precision, confidence prec = I [ĉ(x)=c(x)= ] x Te I [ĉ(x)= ] TP/(TP + FP) P(c(x) = ĉ(x) = ) Positives 0 TP2 TP1 Pos C1 C2 0 FP1 FP2 Neg Negatives Positives 0 TP3 Pos C3 0 FP3 Neg Negatives Table: A summary of different quantities and evaluation measures for classifiers on a test set Te. Symbols starting with a capital letter denote absolute frequencies (counts), while lower-case symbols denote relative frequencies or ratios. All except those indicated with (*) are defined only for binary classification. (left) A coverage plot depicting the two contingency tables in Table 2.2. The plot is square because the class distribution is uniform. (right) Coverage plot for Example 2.1, with a class ratio clr = 3. COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 43 / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 44 / 49

Figure 2.3, p.59 An ROC plot Important points to remember Positives 0 TP2 TP1 TP3 Pos C3 C1 C2 0 FP1 FP2-3 Neg Negatives True positive rate 0 tpr2 tpr1 tpr3 1 C3 C1 C2 0 fpr1 fpr2-3 1 False positive rate In a coverage plot, classifiers with the same accuracy are connected by line segments with slope 1. In a ROC plot (i.e., a normalised coverage plot), line segments with slope 1 connect classifiers with the same average recall. (left) C1 and C3 both dominate C2, but neither dominates the other. The diagonal line indicates that C1 and C3 achieve equal accuracy. (right) The same plot with normalised axes. We can interpret this plot as a merger of the two coverage plots in Figure 2.2, employing normalisation to deal with the different class distributions. The diagonal line now indicates that C1 and C3 have the same average recall. COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 45 / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 46 / 49 5. Summary Figure 2.4, p.61 Comparing coverage and ROC plots Summary Positives 0 TP1 TP2-3 Pos C2 C3 C1 0 FP1 FP2 FP3 Neg Negatives Positives 0 tpr1 tpr2-3 Pos C2 C3 C1 0 fpr1 fpr2 fpr3 Neg Negatives (left) In a coverage plot, accuracy isometrics have a slope of 1, and average recall isometrics are parallel to the ascending diagonal. (right) In the corresponding ROC plot, average recall isometrics have a slope of 1; the accuracy isometric here has a slope of 3, corresponding to the ratio of negatives to positives in the data set. Evaluation for machine learning and data mining is a complex issue Some commonly used methods have been found to work well in practice 10 10-fold cross-validation corrected resampled t-test (Weka) Issues to be aware of multiple testing: from a large set of hypotheses, some will appear good at random does the off-training distribution match that of the training set? COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 47 / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 48 / 49

5. Summary Summary A cautionary tale... In a WW2 survey carried out by RAF Bomber Command all bombers returning from bombing raids over Germany over a particular period were inspected. All damage inflicted by German air defences was noted and an initial recommendation was given that armour be added in the most heavily damaged areas. Further analysis instead made the surprising and counter-intuitive recommendation that the armour be placed in the areas which were completely untouched by damage. The reasoning was that the survey was biased, since it only included aircraft that successfully came back from Germany. The untouched areas were probably vital areas, which, if hit, would result in the loss of the aircraft. COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, 2014 49 / 49