ROC Graphs: Notes and Practical Considerations for Data Mining Researchers

Save this PDF as:

Size: px
Start display at page:

Transcription

1 ROC Graphs: Notes and Practical Considerations for Data Mining Researchers Tom Fawcett 1 1 Intelligent Enterprise Technologies Laboratory, HP Laboratories Palo Alto Alexandre Savio (GIC) ROC Graphs GIC 09 1 / 56

2 Outline 1 Introduction 2 Classier Performance 3 ROC Space Random Performance 4 Curves in ROC space Relative versus absolute scores Class skew Creating scoring classiers 5 Ecient generation of ROC curves Equally scored instances Creating convex ROC curves 6 Area under an ROC Curve (AUC) 7 Averaging ROC Curves Vertical Averaging Threshold Averaging 8 Additional Topics The ROC convex hull Alexandre Savio (GIC) ROC Graphs GIC 09 2 / 56

3 Introduction ROC Graphs Receiver Operating Characteristics (ROC) graphs are a useful technique for organizing classiers and visualizing their performance. An ROC graph is a technique for visualizing, organizing and selecting classiers based on their performance. ROC graphs have long been used in signal detection theory to depict the trade-o between hit rates and false alarm rates of classiers. Alexandre Savio (GIC) ROC Graphs GIC 09 3 / 56

4 Introduction ROC Graphs Although ROC graphs are apparently simple, there are some common misconceptions and pitfalls when using them in practice. One of the earliest adopters of ROC graphs in machine learning was Spackman (1989), who demonstrated the value of ROC curves in evaluating and comparing algorithms. ROC graphs are conceptually simple, but there are some non-obvious complexities that arise when they are used in research. Alexandre Savio (GIC) ROC Graphs GIC 09 4 / 56

5 Introduction Confusion matrix and performance metrics Figure: A confusion matrix and several common performance metrics that can be calculated from it. Alexandre Savio (GIC) ROC Graphs GIC 09 5 / 56

6 Classier Performance Classication model Considering a two-class classication problem: each instance I is mapped to one element of the set {p, n} of positive and negative class labels. A classication model is a mapping from instances to predicted classes. To distinguish between the actual class and the predicted class we use the labels {Y, N} for the class predictions produced by a model. Alexandre Savio (GIC) ROC Graphs GIC 09 6 / 56

7 Classier Performance TP, FP, TN and FN Given a classier and an instance, there are four possible outcomes: TP: if a positive instance is classied as positive. FN: if a positive instance is classied as negative. TN: if a negative instance is classied as negative. FN: if a negative instance is classied as positive. A two-by-two confusion matrix (or contingency matrix) can be constructed representing the dispositions of the set of instances. Alexandre Savio (GIC) ROC Graphs GIC 09 7 / 56

8 Classier Performance Metrics from the confusion matrix True Positive rate (also called hit rate or recall): positives correctly classied TP rate total positives False Positive rate (also called false alaram): negatives incorrectly classied FP rate total negatives Additional terms associated with ROC curves: Sensitivity = Recall Specicity = TN FP + TN = 1 FP rate Positive predictive value = Precision Alexandre Savio (GIC) ROC Graphs GIC 09 8 / 56

9 ROC Space ROC Graphs ROC graphs are two-dimensional graphs in which TP rate is plotted on the Y axis and FP rate is plotted on the X axis. A ROC graph depicts relative trade-os between benets (true positives) and costs (false positives). A discrete classier is one that outputs only a class label. Each discrete classier produces an (FP rate, TP rate) pair, which corresponds to a single point in ROC space. Alexandre Savio (GIC) ROC Graphs GIC 09 9 / 56

10 ROC Space An example of a ROC graph Figure: A basic ROC graph showing ve discrete classiers. Alexandre Savio (GIC) ROC Graphs GIC / 56

11 ROC Space Important Points in ROC space The lower left point (0,0) represents the strategy of never issuing a positive classication. The opposite strategy is represented by the upper right point (1,1). The point (0,1) represents perfect classication. Alexandre Savio (GIC) ROC Graphs GIC / 56

12 ROC Space Important Points in ROC space Informally, one point in ROC space is better than another if it is to the northwest (TP rate is higher, FP rate is lower, or both) of the rst. Classiers appearing on the left hand-side of an ROC graph, near the X axis, may be thought of as conservative: they make positive classications only with strong evidence so they make few false positive errors, but they often have low true positive rates as well. Classiers on the upper right-hand side of an ROC graph may be thought of as liberal. Alexandre Savio (GIC) ROC Graphs GIC / 56

13 ROC Space Random Performance Outline 1 Introduction 2 Classier Performance 3 ROC Space Random Performance 4 Curves in ROC space Relative versus absolute scores Class skew Creating scoring classiers 5 Ecient generation of ROC curves Equally scored instances Creating convex ROC curves 6 Area under an ROC Curve (AUC) 7 Averaging ROC Curves Vertical Averaging Threshold Averaging 8 Additional Topics The ROC convex hull Alexandre Savio (GIC) ROC Graphs GIC / 56

14 ROC Space Random Performance Random Performance The diagonal line y = x represents the strategy of randomly guessing a class. A random classier will produce an ROC point that slides back and forth on the diagonal based on the frequency with which it guesses the positive class. In order to get away from this diagonal into the upper triangular region, the classier must exploit some information in the data. Any classier that appears in the lower right triangle performs worse than random guessing. This triangle is therefore usually empty in ROC graphs. Alexandre Savio (GIC) ROC Graphs GIC / 56

15 Curves in ROC space Curves Some classiers yield an instance probability or score, a numeric value that represents the degree to which an instance is a member of a class. Such a ranking or scoring classier can be used with a threshold to produce a discrete (binary) classier, we may imagine varying a threshold from to + and tracing a curve through ROC space. But this is a poor way of generating a ROC curve. Alexandre Savio (GIC) ROC Graphs GIC / 56

16 Curves in ROC space Relative versus absolute scores Outline 1 Introduction 2 Classier Performance 3 ROC Space Random Performance 4 Curves in ROC space Relative versus absolute scores Class skew Creating scoring classiers 5 Ecient generation of ROC curves Equally scored instances Creating convex ROC curves 6 Area under an ROC Curve (AUC) 7 Averaging ROC Curves Vertical Averaging Threshold Averaging 8 Additional Topics The ROC convex hull Alexandre Savio (GIC) ROC Graphs GIC / 56

17 Curves in ROC space Relative versus absolute scores Relative versus absolute scores A consequence of relative scoring is that classier scores should not be compared across model classes. One model class may be designed to produce scores in the range [0,1] while another produces scores in [-1,+1] or [1,100]. Comparing model performance at a common threshold will be meaningless. An important point about ROC graphs is that they measure the ability of a classier to produce good relative instance scores. Alexandre Savio (GIC) ROC Graphs GIC / 56

18 Curves in ROC space Class skew Outline 1 Introduction 2 Classier Performance 3 ROC Space Random Performance 4 Curves in ROC space Relative versus absolute scores Class skew Creating scoring classiers 5 Ecient generation of ROC curves Equally scored instances Creating convex ROC curves 6 Area under an ROC Curve (AUC) 7 Averaging ROC Curves Vertical Averaging Threshold Averaging 8 Additional Topics The ROC convex hull Alexandre Savio (GIC) ROC Graphs GIC / 56

19 Curves in ROC space Class skew Class skew ROC curves have an attractive property: they are insensitive to changes in class distribution. Any performance metric that uses values from both columns of the confusion matrix will be inherently sensitive to class skews. Metrics such as accuracy, precision, lift and F scores use values from both columns of the confusion matrix. ROC graphs are based upon TP rate and FP rate, in which each dimension is a strict columnar ratio, so do not depend on class distributions. Alexandre Savio (GIC) ROC Graphs GIC / 56

20 Curves in ROC space Creating scoring classiers Outline 1 Introduction 2 Classier Performance 3 ROC Space Random Performance 4 Curves in ROC space Relative versus absolute scores Class skew Creating scoring classiers 5 Ecient generation of ROC curves Equally scored instances Creating convex ROC curves 6 Area under an ROC Curve (AUC) 7 Averaging ROC Curves Vertical Averaging Threshold Averaging 8 Additional Topics The ROC convex hull Alexandre Savio (GIC) ROC Graphs GIC / 56

21 Curves in ROC space Creating scoring classiers Creating scoring classiers Even if a classier only produces a class label, an aggregation of them may be used to generate a score. MetaCost (Domingos, 1999) employs bagging to generate an ensemble of discrete classiers, each of which produces a vote. The set of votes could be used to generate a score. Finally, some combination of scoring and voting can be employed. For example, rules can be provide basic probability estimates, which may then be used in weighted voting. Alexandre Savio (GIC) ROC Graphs GIC / 56

22 Ecient generation of ROC curves Ecient generation of ROC curves Given a test set, we often want to generate an ROC curve eciently from it. Exploiting the monotonicity of thresholded classications a much better algorithm can be created: any instance that is classied positive with respect to a given threshold will be classied positive for all lower thresholds as well. Therefore, we can simply sort the test instances by f scores, from highest to lowest, and move down the list, processing one instance at a time and updating TP and FP as we go. In this way we can create an ROC graph from a linear scan. Let n be the number of points in the test set. This algorithm requires an O(n log n) sort followed by an O(n) scan down the list, resulting in O(n log n) total complexity. Alexandre Savio (GIC) ROC Graphs GIC / 56

23 Ecient generation of ROC curves Practical method for calculating an ROC curve from a test set Figure: Calculating a ROC curve. Alexandre Savio (GIC) ROC Graphs GIC / 56

24 Ecient generation of ROC curves Equally scored instances Outline 1 Introduction 2 Classier Performance 3 ROC Space Random Performance 4 Curves in ROC space Relative versus absolute scores Class skew Creating scoring classiers 5 Ecient generation of ROC curves Equally scored instances Creating convex ROC curves 6 Area under an ROC Curve (AUC) 7 Averaging ROC Curves Vertical Averaging Threshold Averaging 8 Additional Topics The ROC convex hull Alexandre Savio (GIC) ROC Graphs GIC / 56

25 Ecient generation of ROC curves Equally scored instances Equally scored instances Steps 7-9 of algorithm 2 are necessary in order to correctly handle sequences of equally scored instances. The sort in line 1 of algorithm 2 does not impose any specic ordering on these instances since their f scores are equal. We want the ROC curve to represent the expected performance of the classier, which, lacking any other information, is the average of the pessimistic and optimistic. Alexandre Savio (GIC) ROC Graphs GIC / 56

26 Ecient generation of ROC curves Creating convex ROC curves Outline 1 Introduction 2 Classier Performance 3 ROC Space Random Performance 4 Curves in ROC space Relative versus absolute scores Class skew Creating scoring classiers 5 Ecient generation of ROC curves Equally scored instances Creating convex ROC curves 6 Area under an ROC Curve (AUC) 7 Averaging ROC Curves Vertical Averaging Threshold Averaging 8 Additional Topics The ROC convex hull Alexandre Savio (GIC) ROC Graphs GIC / 56

27 Ecient generation of ROC curves Creating convex ROC curves Creating convex ROC curves Figure: Modications to algorithm 2 to avoid introducing concavities Alexandre Savio (GIC) ROC Graphs GIC / 56

28 Area under an ROC Curve (AUC) Area under an ROC Curve (AUC) An ROC curve is a 2D depiction of classier performance. To compare classiers we may want to reduce ROC performance to a single scalar value representing expected performance. A common method is to calculate the area under the ROC curve (AUC). Since AUC is a portion of the area of the unit square, its value will always be between 0 and 1.0. No realistic classier should have an AUC less than 0.5. Alexandre Savio (GIC) ROC Graphs GIC / 56

29 Area under an ROC Curve (AUC) Area under an ROC Curve (AUC) IMPORTANT: AUC of a classier is equivalent to the probability that the classier will rank a randomly chose positive instance higher than a randomly chosen negative instance. The AUC is also closely related to the Gini index: Gini + 1 = 2 AUC In practice the AUC performs very well and is often used when a general measure of predictiveness is desired. Alexandre Savio (GIC) ROC Graphs GIC / 56

30 Area under an ROC Curve (AUC) Calculating the AUC Figure: Algorithm to calculate the AUC Alexandre Savio (GIC) ROC Graphs GIC / 56

31 Averaging ROC Curves Averaging ROC Curves Although ROC curves may be used to evaluate classiers, care should be taken when using them to make conclusions about classier superiority. Some researchers have assumed that an ROC graph may be used to select the best classiers simply by graphing them in ROC space and seeing which ones dominate. This is misleading; it is analogous to taking the maximum of a set of accuracy gures from a single test set. Without a measure of variance we cannot easily compare the classiers. Averaging ROC curves is easy if the original instances are available. This section presents two methods for averaging ROC curves: vertical and threshold averaging. Alexandre Savio (GIC) ROC Graphs GIC / 56

32 Averaging ROC Curves Vertical Averaging Outline 1 Introduction 2 Classier Performance 3 ROC Space Random Performance 4 Curves in ROC space Relative versus absolute scores Class skew Creating scoring classiers 5 Ecient generation of ROC curves Equally scored instances Creating convex ROC curves 6 Area under an ROC Curve (AUC) 7 Averaging ROC Curves Vertical Averaging Threshold Averaging 8 Additional Topics The ROC convex hull Alexandre Savio (GIC) ROC Graphs GIC / 56

33 Averaging ROC Curves Vertical Averaging Vertical Averaging Vertical Averaging takes vertical samples of the ROC curves for xed FP rates and averages the corresponding TP rates. Such averaging is appropriate when the FP rate can indeed be xed or when a 1D measure of variation is desired. Alexandre Savio (GIC) ROC Graphs GIC / 56

34 Averaging ROC Curves Vertical Averaging Method of Vertical Averaging Each ROC curve is treated as a function, R i, such that TP = R i (FP). This is done by choosing the maximum TP for each FP and interpolating between points when necessary. The averaged ROC curve is the function ˆR(FP) = mean[ri (FP)]. To plot an average ROC curve we can sample from ˆR at points regularly spaced along the FP -axis. Condence intervals of the mean of TP are computed using the common assumption of a binomial distribution. Alexandre Savio (GIC) ROC Graphs GIC / 56

35 Averaging ROC Curves Vertical Averaging Algorithm for Vertical Averaging Figure: Vertical averaging of ROC curves Alexandre Savio (GIC) ROC Graphs GIC / 56

36 Averaging ROC Curves Threshold Averaging Outline 1 Introduction 2 Classier Performance 3 ROC Space Random Performance 4 Curves in ROC space Relative versus absolute scores Class skew Creating scoring classiers 5 Ecient generation of ROC curves Equally scored instances Creating convex ROC curves 6 Area under an ROC Curve (AUC) 7 Averaging ROC Curves Vertical Averaging Threshold Averaging 8 Additional Topics The ROC convex hull Alexandre Savio (GIC) ROC Graphs GIC / 56

37 Averaging ROC Curves Threshold Averaging Threshold Averaging Vertical averaging has the advantage that averages are made of a single dependent varaible (FP-rate). This simplies computing condence intervals. However FP-rate is often not under the direct control of the researcher. Threshold averaging, instead of sampling points based on their positions in ROC space, as vertical averaging does, it samples based on the thresholds that produced these points. It generates an array T of classier scores which are sorted from largest to smallest and used as the set of thresholds. These thresholds are sampled at xed intervals determined by the number of samples desired. For a given threshold, the algorithm selects from each ROC curve the the point of greatest score less than or equal to the threshold. These points are then averaged separately along their X and Y axes, with the center point returned in the Avg array. Alexandre Savio (GIC) ROC Graphs GIC / 56

38 Averaging ROC Curves Threshold Averaging Algorithm for Threshold Averaging Figure: Threshold averaging or ROC curves Alexandre Savio (GIC) ROC Graphs GIC / 56

39 Additional Topics The ROC convex hull Outline 1 Introduction 2 Classier Performance 3 ROC Space Random Performance 4 Curves in ROC space Relative versus absolute scores Class skew Creating scoring classiers 5 Ecient generation of ROC curves Equally scored instances Creating convex ROC curves 6 Area under an ROC Curve (AUC) 7 Averaging ROC Curves Vertical Averaging Threshold Averaging 8 Additional Topics The ROC convex hull Alexandre Savio (GIC) ROC Graphs GIC / 56

40 Additional Topics The ROC convex hull The ROC convex hull One advantage of ROC graphs is that they enable visualizing and organizing classier performance without regard to class distributions or error costs. A researcher can graph the performance of a set of classiers, and that graph will remain invariant with respect to the operating conditions (class skews and error costs). A set of operating conditions may be transformed into a so-called iso-performance line in ROC space. All classiers corresponding to points on a line of slope m have the same expected cost. Generally a classier is potentially optimal if and only if it lies on the convex hull of the set points in ROC space, the ROC convex hull (ROCCH). The operating conditions of the classier may be translated into an iso-performance line, which in turn may be used to identify a portion of the ROCCH. As conditions change, the hull itself does not change; Alexandre Savio (GIC) ROC Graphs GIC / 56

41 Additional Topics Decision problems with more than two classes Outline 1 Introduction 2 Classier Performance 3 ROC Space Random Performance 4 Curves in ROC space Relative versus absolute scores Class skew Creating scoring classiers 5 Ecient generation of ROC curves Equally scored instances Creating convex ROC curves 6 Area under an ROC Curve (AUC) 7 Averaging ROC Curves Vertical Averaging Threshold Averaging 8 Additional Topics The ROC convex hull Alexandre Savio (GIC) ROC Graphs GIC / 56

42 Additional Topics Decision problems with more than two classes When we have more than two classes The two axes represent trade-os between errors (false positives) and benets (true positives) that a classier makes between two classes. Much of the analysis is straight-forward because of the symmetry that exists in the two-class problem. Alexandre Savio (GIC) ROC Graphs GIC / 56

43 Additional Topics Decision problems with more than two classes Multi-class ROC graphs With n classes the confusion matrix becomes an n n matrix containing the n correct classications and n 2 n possible errors. Instead of managing trade-os between TP and FP, we have n benets and n 2 n errors. One method is to produce n dierent ROC graphs, one for each class: the class reference formulation. Specically, if C is the set of all classes, ROC graph i plots the classication performance using class c i as the positive class and all other classes as the negative class. This formulation compromises one of the attractions of ROC graphs, namely that they are insensitive to class skew (see section 4.2). Because each N i comprises the union of n 1 classes, changes in prevalence within these classes may alter the c i 's ROC graph. However, this method can work in practice and provide reasonable exibility in evaluation. Alexandre Savio (GIC) ROC Graphs GIC / 56

44 Additional Topics Decision problems with more than two classes Multi-class AUC The AUC is a measure of discriminability of a pair of classes. In a two-class problem AUC is a single scalar value, but a multi-class problem introduces the issue of combining multiple pairwise discriminability values. Alexandre Savio (GIC) ROC Graphs GIC / 56

45 Additional Topics Decision problems with more than two classes Provost and Domingo's Multi-class AUC approach One approach is to generate each class reference ROC curve in turn, measure the AUC, then assuming the AUCs weighted by the reference class's prevalence in the data. More precisely: AUC total = AUC(c i ) p(c i ) c i C The disadvantage is that the class reference ROC is sensitive to class distributions and error costs, so this formulation of AUC total is as well. Alexandre Savio (GIC) ROC Graphs GIC / 56

46 Additional Topics Decision problems with more than two classes Hand and Till Multi-class AUC approach They desired a measure that is insensitive to class distribution and error costs. The derivation is too detailed to summarize here, but it is based upon the fact that the AUC is equivalent to the probability that the classier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. From this probabilistic form, they derive a formulation that measures the unweighted pairwise discriminability of classes. While Hand and Till's formulation is well justied and is insensitive to changes in class distribution, there is no easy way to visualize the surface whose area is being calculated. Alexandre Savio (GIC) ROC Graphs GIC / 56

47 Additional Topics Combining classiers Outline 1 Introduction 2 Classier Performance 3 ROC Space Random Performance 4 Curves in ROC space Relative versus absolute scores Class skew Creating scoring classiers 5 Ecient generation of ROC curves Equally scored instances Creating convex ROC curves 6 Area under an ROC Curve (AUC) 7 Averaging ROC Curves Vertical Averaging Threshold Averaging 8 Additional Topics The ROC convex hull Alexandre Savio (GIC) ROC Graphs GIC / 56

48 Additional Topics Combining classiers Interpolating classiers Alexandre Savio (GIC) ROC Graphs GIC / 56

49 Additional Topics Combining classiers Logically combining classiers Alexandre Savio (GIC) ROC Graphs GIC / 56

50 Additional Topics Combining classiers Chaining classiers Alexandre Savio (GIC) ROC Graphs GIC / 56

51 Additional Topics Alternatives to ROC graphs Outline 1 Introduction 2 Classier Performance 3 ROC Space Random Performance 4 Curves in ROC space Relative versus absolute scores Class skew Creating scoring classiers 5 Ecient generation of ROC curves Equally scored instances Creating convex ROC curves 6 Area under an ROC Curve (AUC) 7 Averaging ROC Curves Vertical Averaging Threshold Averaging 8 Additional Topics The ROC convex hull Alexandre Savio (GIC) ROC Graphs GIC / 56

52 Additional Topics Alternatives to ROC graphs DET curves DET graphs are not so much an alternative to ROC curves as an alternative way of presenting them. There are two dierences: DET graphs plot false negatives on the Y axis instead of true positives, so they plot one kind of error against another. DET graphs are log scaled on both axes so that the area of the lower left part of the curve (which corresponds to the upper left portion of an ROC graph) is expanded. The log scaling of a DET graph gives the lower left region of the graph greater surface area and allows these classiers with low false positive rates and/or low false negative rates to be compared more easily. Alexandre Savio (GIC) ROC Graphs GIC / 56

53 Additional Topics Alternatives to ROC graphs Cost curves On a cost curve, the X axis ranges from 0 to 1 and measures the proportion of positives in the distribution. The Y axis, also from 0 to 1, is the relative expected misclassication cost. A perfect classier is a horizontal line from (0; 0) to (0; 1). Cost curves are a point-line dual of ROC curves: a point (i.e., a discrete classier) in ROC space is represented by a line in cost space, with the line designating the relative expected cost of the classier. For any X point, the corresponding Y points represent the expected costs of the classiers. Thus, while in ROC space the convex hull contains the set of lowest-cost classiers, in cost space the lower envelope represents this set. Alexandre Savio (GIC) ROC Graphs GIC / 56

54 Additional Topics Alternatives to ROC graphs Relative superiority graphs and the LC index LC index is a transformation of ROC curves that facilitates comparing classiers by cost. Adams and Hand's method maps the ratio of error costs onto the interval (0,1). It then transforms a set of ROC curves into a set of parallel lines showing which classier dominates at which region in the interval. An expert provides a sub-range of (0,1) within which the ratio is expected to fall, as well as a most likely value for the ratio. This serves to focus attention on the interval of interest. The relative superiority graphs may be seen as a binary version of cost curves, in which we are only interested in which classier is superior. The LC index (for loss comparison) is thus a measure of condence of superiority rather than of cost dierence. Alexandre Savio (GIC) ROC Graphs GIC / 56

55 Conclusion Conclusion ROC graphs are a very useful tool for visualizing and evaluating classiers. They are able to provide a richer measure of classication performance than accuracy or error rate can, and they have advantages over other evaluation measures such as precision-recall graphs and lift curves. Alexandre Savio (GIC) ROC Graphs GIC / 56

56 Questions? Thank you for your attention.

ROC Graphs: Notes and Practical Considerations for Data Mining Researchers

ROC Graphs: Notes and Practical Considerations for Data Mining Researchers Tom Fawcett Intelligent Enterprise Technologies Laboratory HP Laboratories Palo Alto HPL-23-4 January 7 th, 23* E-mail: tom_fawcett@hp.com

Data Mining Practical Machine Learning Tools and Techniques

Counting the cost Data Mining Practical Machine Learning Tools and Techniques Slides for Section 5.7 In practice, different types of classification errors often incur different costs Examples: Loan decisions

Evaluation & Validation: Credibility: Evaluating what has been learned

Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model

Performance Measures for Machine Learning

Performance Measures for Machine Learning 1 Performance Measures Accuracy Weighted (Cost-Sensitive) Accuracy Lift Precision/Recall F Break Even Point ROC ROC Area 2 Accuracy Target: 0/1, -1/+1, True/False,

Performance Metrics for Graph Mining Tasks

Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2. Tid Refund Marital Status

Data Mining Classification: Basic Concepts, Decision Trees, and Evaluation Lecture tes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Classification: Definition Given a collection of

ROC Curve, Lift Chart and Calibration Plot

Metodološki zvezki, Vol. 3, No. 1, 26, 89-18 ROC Curve, Lift Chart and Calibration Plot Miha Vuk 1, Tomaž Curk 2 Abstract This paper presents ROC curve, lift chart and calibration plot, three well known

Session 7 Bivariate Data and Analysis

Session 7 Bivariate Data and Analysis Key Terms for This Session Previously Introduced mean standard deviation New in This Session association bivariate analysis contingency table co-variation least squares

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Lecture 15 - ROC, AUC & Lift Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-17-AUC

Introduction. Agents have preferences over the two goods which are determined by a utility function. Speci cally, type 1 agents utility is given by

Introduction General equilibrium analysis looks at how multiple markets come into equilibrium simultaneously. With many markets, equilibrium analysis must take explicit account of the fact that changes

Unit 9 Describing Relationships in Scatter Plots and Line Graphs

Unit 9 Describing Relationships in Scatter Plots and Line Graphs Objectives: To construct and interpret a scatter plot or line graph for two quantitative variables To recognize linear relationships, non-linear

Mining Life Insurance Data for Customer Attrition Analysis

Mining Life Insurance Data for Customer Attrition Analysis T. L. Oshini Goonetilleke Informatics Institute of Technology/Department of Computing, Colombo, Sri Lanka Email: oshini.g@iit.ac.lk H. A. Caldera

ALGEBRA I / ALGEBRA I SUPPORT

Suggested Sequence: CONCEPT MAP ALGEBRA I / ALGEBRA I SUPPORT August 2011 1. Foundations for Algebra 2. Solving Equations 3. Solving Inequalities 4. An Introduction to Functions 5. Linear Functions 6.

Validating a Credit Score Model in Conjunction with Additional Underwriting Criteria September 2012

Validating a Credit Score Model in Conjunction with Additional Underwriting Criteria September 2012 INTRODUCTION Model validation is a critical activity to verify that credit scorecards are working as

Controlling the Sensitivity of Support Vector Machines. K. Veropoulos, C. Campbell, N. Cristianini. United Kingdom. maximising the Lagrangian:

Controlling the Sensitivity of Support Vector Machines K. Veropoulos, C. Campbell, N. Cristianini Department of Engineering Mathematics, Bristol University, Bristol BS8 1TR, United Kingdom Abstract For

15.062 Data Mining: Algorithms and Applications Matrix Math Review

.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

Technology Step-by-Step Using StatCrunch

Technology Step-by-Step Using StatCrunch Section 1.3 Simple Random Sampling 1. Select Data, highlight Simulate Data, then highlight Discrete Uniform. 2. Fill in the following window with the appropriate

Performance Measures in Data Mining

Performance Measures in Data Mining Common Performance Measures used in Data Mining and Machine Learning Approaches L. Richter J.M. Cejuela Department of Computer Science Technische Universität München

Appendix E: Graphing Data

You will often make scatter diagrams and line graphs to illustrate the data that you collect. Scatter diagrams are often used to show the relationship between two variables. For example, in an absorbance

Review of Fundamental Mathematics

Review of Fundamental Mathematics As explained in the Preface and in Chapter 1 of your textbook, managerial economics applies microeconomic theory to business decision making. The decision-making tools

Measuring Lift Quality in Database Marketing

Measuring Lift Quality in Database Marketing Gregory Piatetsky-Shapiro Xchange Inc. One Lincoln Plaza, 89 South Street Boston, MA 2111 gps@xchange.com Sam Steingold Xchange Inc. One Lincoln Plaza, 89 South

Performance Metrics. number of mistakes total number of observations. err = p.1/1

p.1/1 Performance Metrics The simplest performance metric is the model error defined as the number of mistakes the model makes on a data set divided by the number of observations in the data set, err =

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

There are some general common sense recommendations to follow when presenting

Presentation of Data The presentation of data in the form of tables, graphs and charts is an important part of the process of data analysis and report writing. Although results can be expressed within

Module 3: Correlation and Covariance

Using Statistical Data to Make Decisions Module 3: Correlation and Covariance Tom Ilvento Dr. Mugdim Pašiƒ University of Delaware Sarajevo Graduate School of Business O ften our interest in data analysis

Quality and Complexity Measures for Data Linkage and Deduplication

Quality and Complexity Measures for Data Linkage and Deduplication Peter Christen and Karl Goiser Department of Computer Science, Australian National University, Canberra ACT 0200, Australia {peter.christen,karl.goiser}@anu.edu.au

Data Mining Practical Machine Learning Tools and Techniques

Credibility: Evaluating what s been learned Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Issues: training, testing,

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

DESCRIPTIVE STATISTICS The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE VS. INFERENTIAL STATISTICS Descriptive To organize,

Grade 6 Mathematics Performance Level Descriptors

Limited Grade 6 Mathematics Performance Level Descriptors A student performing at the Limited Level demonstrates a minimal command of Ohio s Learning Standards for Grade 6 Mathematics. A student at this

Lean Six Sigma Analyze Phase Introduction. TECH 50800 QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY

TECH 50800 QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY Before we begin: Turn on the sound on your computer. There is audio to accompany this presentation. Audio will accompany most of the online

c 2008 Je rey A. Miron We have described the constraints that a consumer faces, i.e., discussed the budget constraint.

Lecture 2b: Utility c 2008 Je rey A. Miron Outline: 1. Introduction 2. Utility: A De nition 3. Monotonic Transformations 4. Cardinal Utility 5. Constructing a Utility Function 6. Examples of Utility Functions

Chapter 6. The stacking ensemble approach

82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data

PREDICTING SUCCESS IN THE COMPUTER SCIENCE DEGREE USING ROC ANALYSIS

PREDICTING SUCCESS IN THE COMPUTER SCIENCE DEGREE USING ROC ANALYSIS Arturo Fornés arforser@fiv.upv.es, José A. Conejero aconejero@mat.upv.es 1, Antonio Molina amolina@dsic.upv.es, Antonio Pérez aperez@upvnet.upv.es,

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

CSC574 - Computer and Network Security Module: Intrusion Detection

CSC574 - Computer and Network Security Module: Intrusion Detection Prof. William Enck Spring 2013 1 Intrusion An authorized action... that exploits a vulnerability... that causes a compromise... and thus

1 Example of Time Series Analysis by SSA 1

1 Example of Time Series Analysis by SSA 1 Let us illustrate the 'Caterpillar'-SSA technique [1] by the example of time series analysis. Consider the time series FORT (monthly volumes of fortied wine sales

Random Forest Based Imbalanced Data Cleaning and Classification

Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem

6.4 Normal Distribution

Contents 6.4 Normal Distribution....................... 381 6.4.1 Characteristics of the Normal Distribution....... 381 6.4.2 The Standardized Normal Distribution......... 385 6.4.3 Meaning of Areas under

Arrangements And Duality

Arrangements And Duality 3.1 Introduction 3 Point configurations are tbe most basic structure we study in computational geometry. But what about configurations of more complicated shapes? For example,

The Relationship Between Precision-Recall and ROC Curves

Jesse Davis jdavis@cs.wisc.edu Mark Goadrich richm@cs.wisc.edu Department of Computer Sciences and Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, 2 West Dayton Street,

Tastes and Indifference Curves

C H A P T E R 4 Tastes and Indifference Curves This chapter begins a 2-chapter treatment of tastes or what we also call preferences. In the first of these chapters, we simply investigate the basic logic

Chapter 12: Cost Curves

Chapter 12: Cost Curves 12.1: Introduction In chapter 11 we found how to minimise the cost of producing any given level of output. This enables us to find the cheapest cost of producing any given level

Spearman s correlation

Spearman s correlation Introduction Before learning about Spearman s correllation it is important to understand Pearson s correlation which is a statistical measure of the strength of a linear relationship

Algebra 2 Chapter 1 Vocabulary. identity - A statement that equates two equivalent expressions.

Chapter 1 Vocabulary identity - A statement that equates two equivalent expressions. verbal model- A word equation that represents a real-life problem. algebraic expression - An expression with variables.

Inferential Statistics

Inferential Statistics Sampling and the normal distribution Z-scores Confidence levels and intervals Hypothesis testing Commonly used statistical methods Inferential Statistics Descriptive statistics are

TYPES OF NUMBERS. Example 2. Example 1. Problems. Answers

TYPES OF NUMBERS When two or more integers are multiplied together, each number is a factor of the product. Nonnegative integers that have exactly two factors, namely, one and itself, are called prime

Centroid: The point of intersection of the three medians of a triangle. Centroid

Vocabulary Words Acute Triangles: A triangle with all acute angles. Examples 80 50 50 Angle: A figure formed by two noncollinear rays that have a common endpoint and are not opposite rays. Angle Bisector:

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification

UNDERSTANDING THE TWO-WAY ANOVA

UNDERSTANDING THE e have seen how the one-way ANOVA can be used to compare two or more sample means in studies involving a single independent variable. This can be extended to two independent variables

Mining the Software Change Repository of a Legacy Telephony System

Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

Chapter 8 Graphs and Functions:

Chapter 8 Graphs and Functions: Cartesian axes, coordinates and points 8.1 Pictorially we plot points and graphs in a plane (flat space) using a set of Cartesian axes traditionally called the x and y axes

EQUATIONS and INEQUALITIES

EQUATIONS and INEQUALITIES Linear Equations and Slope 1. Slope a. Calculate the slope of a line given two points b. Calculate the slope of a line parallel to a given line. c. Calculate the slope of a line

SPSS: Descriptive and Inferential Statistics. For Windows

For Windows August 2012 Table of Contents Section 1: Summarizing Data...3 1.1 Descriptive Statistics...3 Section 2: Inferential Statistics... 10 2.1 Chi-Square Test... 10 2.2 T tests... 11 2.3 Correlation...

10-3 Measures of Central Tendency and Variation

10-3 Measures of Central Tendency and Variation So far, we have discussed some graphical methods of data description. Now, we will investigate how statements of central tendency and variation can be used.

1. Classification problems

Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification

Chapter 3 RANDOM VARIATE GENERATION

Chapter 3 RANDOM VARIATE GENERATION In order to do a Monte Carlo simulation either by hand or by computer, techniques must be developed for generating values of random variables having known distributions.

Chapter 15 Multiple Choice Questions (The answers are provided after the last question.)

Chapter 15 Multiple Choice Questions (The answers are provided after the last question.) 1. What is the median of the following set of scores? 18, 6, 12, 10, 14? a. 10 b. 14 c. 18 d. 12 2. Approximately

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

Module 5: Measuring (step 3) Inequality Measures

Module 5: Measuring (step 3) Inequality Measures Topics 1. Why measure inequality? 2. Basic dispersion measures 1. Charting inequality for basic dispersion measures 2. Basic dispersion measures (dispersion

Measurement with Ratios

Grade 6 Mathematics, Quarter 2, Unit 2.1 Measurement with Ratios Overview Number of instructional days: 15 (1 day = 45 minutes) Content to be learned Use ratio reasoning to solve real-world and mathematical

Data Analysis: Displaying Data - Graphs

Accountability Modules WHAT IT IS Return to Table of Contents WHEN TO USE IT TYPES OF GRAPHS Bar Graphs Data Analysis: Displaying Data - Graphs Graphs are pictorial representations of the relationships

Using Random Forest to Learn Imbalanced Data

Using Random Forest to Learn Imbalanced Data Chao Chen, chenchao@stat.berkeley.edu Department of Statistics,UC Berkeley Andy Liaw, andy liaw@merck.com Biometrics Research,Merck Research Labs Leo Breiman,

Lecture 6. Inverse of Matrix

Lecture 6 Inverse of Matrix Recall that any linear system can be written as a matrix equation In one dimension case, ie, A is 1 1, then can be easily solved as A x b Ax b x b A 1 A b A 1 b provided that

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

Mathematics Revision Guides Histograms, Cumulative Frequency and Box Plots Page 1 of 25 M.K. HOME TUITION Mathematics Revision Guides Level: GCSE Higher Tier HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

Graphical Presentation of Data

Graphical Presentation of Data Guidelines for Making Graphs Titles should tell the reader exactly what is graphed Remove stray lines, legends, points, and any other unintended additions by the computer

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

Lecture #2 Overview. Basic IRT Concepts, Models, and Assumptions. Lecture #2 ICPSR Item Response Theory Workshop

Basic IRT Concepts, Models, and Assumptions Lecture #2 ICPSR Item Response Theory Workshop Lecture #2: 1of 64 Lecture #2 Overview Background of IRT and how it differs from CFA Creating a scale An introduction

Operation Count; Numerical Linear Algebra

10 Operation Count; Numerical Linear Algebra 10.1 Introduction Many computations are limited simply by the sheer number of required additions, multiplications, or function evaluations. If floating-point

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

Reflection and Refraction

Equipment Reflection and Refraction Acrylic block set, plane-concave-convex universal mirror, cork board, cork board stand, pins, flashlight, protractor, ruler, mirror worksheet, rectangular block worksheet,

Flow and Activity Analysis

Facility Location, Layout, and Flow and Activity Analysis Primary activity relationships Organizational relationships» Span of control and reporting hierarchy Flow relationships» Flow of materials, people,

1. I have 4 sides. My opposite sides are equal. I have 4 right angles. Which shape am I?

Which Shape? This problem gives you the chance to: identify and describe shapes use clues to solve riddles Use shapes A, B, or C to solve the riddles. A B C 1. I have 4 sides. My opposite sides are equal.

Supervised Learning with Unsupervised Output Separation

Supervised Learning with Unsupervised Output Separation Nathalie Japkowicz School of Information Technology and Engineering University of Ottawa 150 Louis Pasteur, P.O. Box 450 Stn. A Ottawa, Ontario,

Descriptive Statistics

Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize

Survival Analysis Using SPSS. By Hui Bian Office for Faculty Excellence

Survival Analysis Using SPSS By Hui Bian Office for Faculty Excellence Survival analysis What is survival analysis Event history analysis Time series analysis When use survival analysis Research interest

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation

Beating the MLB Moneyline

Beating the MLB Moneyline Leland Chen llxchen@stanford.edu Andrew He andu@stanford.edu 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series

Data Analysis: Describing Data - Descriptive Statistics

WHAT IT IS Return to Table of ontents Descriptive statistics include the numbers, tables, charts, and graphs used to describe, organize, summarize, and present raw data. Descriptive statistics are most

32 Measures of Central Tendency and Dispersion

32 Measures of Central Tendency and Dispersion In this section we discuss two important aspects of data which are its center and its spread. The mean, median, and the mode are measures of central tendency

STRUTS: Statistical Rules of Thumb. Seattle, WA

STRUTS: Statistical Rules of Thumb Gerald van Belle Departments of Environmental Health and Biostatistics University ofwashington Seattle, WA 98195-4691 Steven P. Millard Probability, Statistics and Information

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number A. 3(x - x) B. x 3 x C. 3x - x D. x - 3x 2) Write the following as an algebraic expression

Simple linear regression

Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between

KEANSBURG SCHOOL DISTRICT KEANSBURG HIGH SCHOOL Mathematics Department. HSPA 10 Curriculum. September 2007

KEANSBURG HIGH SCHOOL Mathematics Department HSPA 10 Curriculum September 2007 Written by: Karen Egan Mathematics Supervisor: Ann Gagliardi 7 days Sample and Display Data (Chapter 1 pp. 4-47) Surveys and

Common sense, and the model that we have used, suggest that an increase in p means a decrease in demand, but this is not the only possibility.

Lecture 6: Income and Substitution E ects c 2009 Je rey A. Miron Outline 1. Introduction 2. The Substitution E ect 3. The Income E ect 4. The Sign of the Substitution E ect 5. The Total Change in Demand

Descriptive Statistics. Understanding Data: Categorical Variables. Descriptive Statistics. Dataset: Shellfish Contamination

Descriptive Statistics Understanding Data: Dataset: Shellfish Contamination Location Year Species Species2 Method Metals Cadmium (mg kg - ) Chromium (mg kg - ) Copper (mg kg - ) Lead (mg kg - ) Mercury

In this chapter, you will learn improvement curve concepts and their application to cost and price analysis.

7.0 - Chapter Introduction In this chapter, you will learn improvement curve concepts and their application to cost and price analysis. Basic Improvement Curve Concept. You may have learned about improvement

THE SECOND DERIVATIVE

THE SECOND DERIVATIVE Summary 1. Curvature Concavity and convexity... 2 2. Determining the nature of a static point using the second derivative... 6 3. Absolute Optima... 8 The previous section allowed

Fast Sequential Summation Algorithms Using Augmented Data Structures

Fast Sequential Summation Algorithms Using Augmented Data Structures Vadim Stadnik vadim.stadnik@gmail.com Abstract This paper provides an introduction to the design of augmented data structures that offer

Probability and Statistics

CHAPTER 2: RANDOM VARIABLES AND ASSOCIATED FUNCTIONS 2b - 0 Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be

Advances in Predictive Models for Data Mining. Se June Hong and Sholom Weiss. To appear in Pattern Recognition Letters Journal

Advances in Predictive Models for Data Mining Se June Hong and Sholom Weiss To appear in Pattern Recognition Letters Journal Advances in Predictive Models for Data Mining Se June Hong and Sholom M. Weiss

MATH 65 NOTEBOOK CERTIFICATIONS

MATH 65 NOTEBOOK CERTIFICATIONS Review Material from Math 60 2.5 4.3 4.4a Chapter #8: Systems of Linear Equations 8.1 8.2 8.3 Chapter #5: Exponents and Polynomials 5.1 5.2a 5.2b 5.3 5.4 5.5 5.6a 5.7a 1

W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer

The Big 50 Revision Guidelines for S1

The Big 50 Revision Guidelines for S1 If you can understand all of these you ll do very well 1. Know what is meant by a statistical model and the Modelling cycle of continuous refinement 2. Understand

San Jose State University Engineering 10 1

KY San Jose State University Engineering 10 1 Select Insert from the main menu Plotting in Excel Select All Chart Types San Jose State University Engineering 10 2 Definition: A chart that consists of multiple

Performance Level Descriptors Grade 6 Mathematics

Performance Level Descriptors Grade 6 Mathematics Multiplying and Dividing with Fractions 6.NS.1-2 Grade 6 Math : Sub-Claim A The student solves problems involving the Major Content for grade/course with