Evaluation in Machine Learning (1) Evaluation in machine learning. Aims. 14s1: COMP9417 Machine Learning and Data Mining.

Size: px
Start display at page:

Download "Evaluation in Machine Learning (1) Evaluation in machine learning. Aims. 14s1: COMP9417 Machine Learning and Data Mining."

Transcription

1 Acknowledgements Evaluation in Machine Learning (1) 14s1: COMP9417 Machine Learning and Data Mining School of Computer Science and Engineering, University of New South Wales March 19, 2014 Material derived from slides for the book Machine Learning by T. Mitchell McGraw-Hill (1997) Material derived from slides by Eibe Frank Material derived from slides for the book Machine Learning by P. Flach Cambridge University Press (2012) COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 Aims 1. Aims COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / Introduction Evaluation in machine learning This lecture will enable you to apply standard measures and graphical methods for the evaluation of machine learning algorithms. Following it you should be able to: define sample error and true error describe the problem of estimating hypothesis accuracy or error define and calculate typically-used evaluation measures, such as accuracy, precision, recall, etc. define and calculate typically-used compound measures and their visualisation, such as lift charts, coverage plots, ROC curves, F-measure, AUC, etc. In theory, there is no difference between theory and practice. But in practice, there is. COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49

2 2. Introduction 2. Introduction Estimating Hypothesis Accuracy Two Definitions of Error The sample error of h with respect to target function f and data sample S is the proportion of examples h misclassifies how well does a hypothesis generalize beyond the training set? need to estimate off-training-set error what is the probable error in this estimate? if one hypothesis is more accurate than another on a data set, how probable is this difference in general? error S (h) 1 δ(f (x) = h(x)) n Where δ(f (x) = h(x)) is 1 if f (x) = h(x), and 0 otherwise (cf. 0 1 loss). x S The true error of hypothesis h with respect to target function f and distribution D is the probability that h will misclassify an instance drawn at random according to D. error D (h) Pr [f (x) = h(x)] x D Question: How well does error S (h) estimate error D (h)? COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / Introduction Estimators COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / Introduction Problems Estimating Error Experiment: 1. choose sample S of size n according to distribution D 2. measure error S (h) error S (h) is a random variable (i.e., result of an experiment) error S (h) is an unbiased estimator for error D (h) Given observed error S (h) what can we conclude about error D (h)? 1. Bias: If S is training set, error S (h) is optimistically biased bias E[error S (h)] error D (h) For unbiased estimate, h and S must be chosen independently 2. Variance: Even with selection of S to give unbiased estimate, error S (h) may still vary from error D (h) COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49

3 2. Introduction Problems Estimating Error Making the most of the data Note: Estimation bias not to be confused with Inductive bias former is a numerical quantity [comes from statistics], latter is a set of assertions [comes from concept learning]. More on this in a later lecture. Once evaluation is complete, all the data can be used to build the final classifier Generally, the larger the training data the better the classifier (but returns diminish) The larger the test data the more accurate the error estimate Holdout procedure: method of splitting original data into training and test set Dilemma: ideally we want both, a large training and a large test set COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 Loss functions COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 Quadratic loss function Most common performance measure: predictive accuracy (cf. sample error) Also called 0 1 loss function: 0 if prediction is correct 1 if prediction is incorrect Classifiers can produce class probabilities i What is the accuracy of the probability estimates? 0-1 loss is not appropriate p 1,...,p k are probability estimates of all possible outcomes for an instance c is the index of the instance s actual class i.e. a 1,...,a k are zero, except for a c which is 1 the quadratic loss is: E (p j a j ) 2 = j p 2 j j =c + (1 p c ) 2 leads to preference for predictors giving best guess at true probabilities COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49

4 Informational loss function Which loss function? ] the informational loss function is log(p c ), where c is the index of the actual class of an instance number of bits required to communicate the actual class if p1,...,p are the true class probabilities k then the expected value of the informational loss function is: p 1 log 2 (p 1)... p k log 2 (p k) which is minimized for p j = p j giving the entropy of the true distribution p 1 log 2 (p 1 )... p k log 2 (p k ) quadratic loss functions takes into account all the class probability estimates for an instance informational loss focuses only on the probability estimate for the actual class quadratic loss is bounded by 1 + j p 2, can never exceed 2 j informational loss can be infinite informational loss related to MDL principle (can use bits for complexity as well as accuracy) COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 Costs of predictions COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 Confusion matrix Two-class prediction case: In practice, different types of classification errors often incur different costs Examples: Medical diagnosis (has cancer vs. not) Loan decisions Fault diagnosis Promotional mailing Predicted Class Actual Class Yes No Yes True Positive (TP) False Negative (FN) No False Positive (FP) True Negative (TN) Two kinds of error: False Positive and False Negative may have different costs. Two kinds of correct prediction: True Positive and True Negative may have different benefits. Note: total number of test set examples N = TP+ FN+ FP+ TN COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49

5 Accuracy Error rate equivalent to 1 - Accuracy, i.e., TP+ TN N FP+ FN N Precision TP TP+ FP (also called: Correctness, Positive Predictive Value) Recall TP TP+ FN (also called: TP rate, Hit rate, Sensitivity, Completeness) COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 Sensitivity True Positive (TP) Rate Specificity equivalent to 1 - FP rate (also called: TN rate) TP TP+ FN TN TN+ FP False Positive (FP) Rate equivalent to 1 - Specificity, i.e., (also called: False alarm rate) TP TP+ FN FP FP+ TN COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49

6 Negative Predictive Value Coverage TN TN+ FN Predicted Class Actual Class Yes No Yes TP FN No FP TN TP+ FP N Note: this is not an exhaustive list... same measures used under different names in different disciplines E.g., in concept learning, the number of instances in a sample predicted to be in (resp. not in) the concept is the sum of the first (resp. second) column. The number of positive (resp. negative) examples of the concept in a sample is the sum of the first (resp. second) row. N pred = TP+ FP N not_pred = FN+ TN Npos = TP+ FN Nneg = FP+ TN COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 Trade-off We can treat the evaluation measures as conditional probabilities: P(pred pos) = P(pred neg) = P(not_pred pos) = P(not_pred neg) = P(pos pred) = P(neg pred) = TP TP+FN FP FP+TN FN TP+FN TN FP+TN TP TP+FP FP TP+FP (Sensitivity) (FP rate) (FN rate) (Specificity) (Pos. Pred. Value) Trade-off good coverage of positive examples: increase TP at risk of increasing FP i.e. increase generality good proportion of positive examples: decrease FP at risk of decreasing TP i.e. decrease generality, i.e. increase specificity Different techniques give different trade-offs and can be plotted as two different lines on any of the graphical charts: Lift, ROC or recall-precision curves. P(pos not_pred) = P(neg not_pred) = FN FN+TN TN FN+TN (FN rate) (Neg. Pred. Value) COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49

7 F-measure combines precision and recall Lift charts In practice, costs are rarely known precisely Information Retrieval K. van Rijsbergen (1979) precision recall F = 2 precision + recall Combines both precision and recall in a single measure giving equal weight to both (there are variants to weight each component differently). Inverse relation of precision and recall: can change a classification rule to increase precision (positive predictive value) by reducing number of data points classified, hence reducing recall (sensitivity), and vice versa. Note: both measures ignore TN. Instead decisions often made by comparing possible scenarios Lift comes from market research, where a typical goal is to identify a profitable target sub-group out of the total population Example: promotional mailout to population of 1,000,000 potential respondents Baseline is that 0.1% of all households in total population will respond (1000) Situation 1: classifier 1 identifies target sub-group of 100,000 most promising households of which 0.4% will respond (400) Situation 2: classifier 2 identifies target sub-group of 400,000 most promising households of which 0.2% will respond (800) COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 Lift charts COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 Hypothetical Lift Chart Lift = response rate of target sub-group response rate of total population Situation 1 gives lift of = 4 Situation 2 gives lift of = 2 Note that which situation is more profitable depends on cost estimates A lift chart allows for a visual comparison Typically, use lift to see how well a classifier is doing compared to base-level (e.g., guessing most common class as in ZeroR). COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49

8 Generating a lift chart ROC curves Instances are sorted according to their predicted probability of being a true positive: Rank Predicted probability Actual class Yes Yes No Yes In lift chart, x axis is sample size and y axis is number of true positives ROC curves are related to lift charts ROC stands for receiver operating characteristic Used in signal detection to show tradeoff between hit rate and false alarm rate over noisy channel Differences to lift chart: y axis shows percentage of true positives in sample (rather than absolute number) x axis shows percentage of false positives in sample (rather than sample size) COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 A sample ROC curve COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 ROC curves Applications: if the classifier predicts the positive class with an attached value denoting strength of the prediction, for example: probability that the instance is in the target class, e.g., logistic regression, Bayesian classifiers margin by which the instance is in the target class, e.g., ensemble learners, support vector machines if classifiers are learned by the same method with a range of different parameter settings; each outcome (e.g., accuracy) is a point on the ROC curve COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49

9 Area under the ROC Curve (AUC) Area under the ROC Curve (AUC) Applicable for any learning method that can generate an ROC curve. Method: test set of P positive and N negative examples sort examples in descending order of strength of being in the positive class, as predicted by the classifier for each positive example p i P, count the number of negative examples ranked strictly below it (count 1 2 if negative example tied with a positive), call this c i sum all these counts, C = P AUC = C P N i=1 c i How should we understand AUC? A natural probabilistic interpretation of this measure: the probability that the classifier will rank a randomly-drawn positive example higher than a randomly-drawn negative one. the AUC will be 1 if all positive examples were ranked above all negative examples (compare with the ROC curve diagram) this is a version of a non-parametric statistical procedure (see Mann-Whitney U-test, or Wilcoxon rank-sum test) COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 Numeric prediction evaluation measures Numeric prediction evaluation measures Based on differences between predicted (p i ) and actual (a i ) values on a test set of n examples: Mean squared error Root mean squared error (p 1 a 1 ) (p n a n ) 2 n (p 1 a 1 ) (p n a n ) 2 n Mean absolute error p 1 a p n a n n Relative absolute error p 1 a p n a n, where ā = 1 a 1 ā a n ā n plus others, see, e.g., Weka i a i COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49

10 Table 2.1, p.52 Predictive machine learning scenarios Classification Task Label space Output space Learning problem Classification L = C Y = C learn an approximation ĉ : X C to the true labelling function c Scoring and ranking L = C Y = R C learn a model that outputs a score vector over classes Probability L = C Y = [0,1] C learn a model that outputs estimation a probability vector over classes Regression L = R Y = R learn an approximation f ˆ : X R to the true labelling function f A classifier is a mapping ĉ : X C, where C = {C 1,C 2,...,C k } is a finite and usually small set of class labels. We will sometimes also use C i to indicate the set of examples of that class. We use the hat to indicate that ĉ(x) is an estimate of the true but unknown function c(x). Examples for a classifier take the form (x,c(x)), where x X is an instance and c(x) is the true class of the instance (sometimes contaminated by noise). Learning a classifier involves constructing the function ĉ such that it matches c as closely as possible (and not just on the training set, but ideally on the entire instance space X ). COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 Figure 2.1, p.53 A decision tree Table 2.2, p.54 Contingency table!viagra"!viagra" spam: 20 ham: 40!lottery" =0 =0 =1 spam: 10 ham: 5 =1 spam: 20 ham: 5 #(x) = ham!lottery" =0 =0 =1 #(x) = spam =1 #(x) = spam (left) A feature tree with training set class distribution in the leaves. (right) A decision tree obtained using the majority class decision rule. Predicted Predicted Actual Actual (left) A two-class contingency table or confusion matrix depicting the performance of the decision tree in Figure 2.1. Numbers on the descending diagonal indicate correct predictions, while the ascending diagonal concerns prediction errors. (right) A contingency table with the same marginals but independent rows and columns. COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49

11 Example 2.1, p.56 Accuracy as a weighted average Table 2.3, p.57 Performance measures I Suppose a classifier s predictions on a test set are as in the following table: Predicted Predicted Actual Actual From this table, we see that the true positive rate is tpr = 60/75 = 0.80 and the true negative rate is tnr = 15/25 = The overall accuracy is acc = ( )/100 = 0.75, which is no longer the average of true positive and negative rates. However, taking into account the proportion of positives pos = 0.75 and the proportion of negatives neg = 1 pos = 0.25, we see that acc = pos tpr + neg tnr Measure Definition Equal to Estimates number of positives Pos = x Te I [c(x) = ] number of negatives Neg = x Te I [c(x) = ] Te Pos number of true positives TP = x Te I [ĉ(x) = c(x) = ] number of true negatives TN = x Te I [ĉ(x) = c(x) = ] number of false positives FP = x Te I [ĉ(x) =,c(x) = ] Neg TN number of false negatives FN = x Te I [ĉ(x) =,c(x) = ] Pos TP proportion of positives pos = Te 1 x Te I [c(x) = ] Pos/ Te P(c(x) = ) proportion of negatives neg = Te 1 x Te I [c(x) = ] 1 pos P(c(x) = ) class ratio clr = pos/neg Pos/Neg (*) accuracy acc = Te 1 (*) error rate err = Te 1 x Te I [ĉ(x) = c(x)] P(ĉ(x) = c(x)) x Te I [ĉ(x) = c(x)] 1 acc P(ĉ(x) = c(x)) COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 Table 2.3, p.57 Performance measures II Figure 2.2, p.58 A coverage plot Measure Definition Equal to Estimates x Te true positive rate, tpr = I [ĉ(x)=c(x)= ] x Te I [c(x)= ] TP/Pos P(ĉ(x) = c(x) = ) sensitivity, recall x Te true negative rate, tnr = I [ĉ(x)=c(x)=] x Te I [c(x)=] TN/Neg P(ĉ(x) = c(x) = ) specificity x Te false positive rate, fpr = I [ĉ(x)=,c(x)=] x Te I [c(x)=] FP/Neg = 1 tnr P(ĉ(x) = c(x) = ) false alarm rate x Te I [ĉ(x)=,c(x)= ] false negative rate fnr = x Te I [c(x)= ] FN/Pos = 1 tpr P(ĉ(x) = c(x) = ) x Te precision, confidence prec = I [ĉ(x)=c(x)= ] x Te I [ĉ(x)= ] TP/(TP + FP) P(c(x) = ĉ(x) = ) Positives 0 TP2 TP1 Pos C1 C2 0 FP1 FP2 Neg Negatives Positives 0 TP3 Pos C3 0 FP3 Neg Negatives Table: A summary of different quantities and evaluation measures for classifiers on a test set Te. Symbols starting with a capital letter denote absolute frequencies (counts), while lower-case symbols denote relative frequencies or ratios. All except those indicated with (*) are defined only for binary classification. (left) A coverage plot depicting the two contingency tables in Table 2.2. The plot is square because the class distribution is uniform. (right) Coverage plot for Example 2.1, with a class ratio clr = 3. COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49

12 Figure 2.3, p.59 An ROC plot Important points to remember Positives 0 TP2 TP1 TP3 Pos C3 C1 C2 0 FP1 FP2-3 Neg Negatives True positive rate 0 tpr2 tpr1 tpr3 1 C3 C1 C2 0 fpr1 fpr2-3 1 False positive rate In a coverage plot, classifiers with the same accuracy are connected by line segments with slope 1. In a ROC plot (i.e., a normalised coverage plot), line segments with slope 1 connect classifiers with the same average recall. (left) C1 and C3 both dominate C2, but neither dominates the other. The diagonal line indicates that C1 and C3 achieve equal accuracy. (right) The same plot with normalised axes. We can interpret this plot as a merger of the two coverage plots in Figure 2.2, employing normalisation to deal with the different class distributions. The diagonal line now indicates that C1 and C3 have the same average recall. COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / Summary Figure 2.4, p.61 Comparing coverage and ROC plots Summary Positives 0 TP1 TP2-3 Pos C2 C3 C1 0 FP1 FP2 FP3 Neg Negatives Positives 0 tpr1 tpr2-3 Pos C2 C3 C1 0 fpr1 fpr2 fpr3 Neg Negatives (left) In a coverage plot, accuracy isometrics have a slope of 1, and average recall isometrics are parallel to the ascending diagonal. (right) In the corresponding ROC plot, average recall isometrics have a slope of 1; the accuracy isometric here has a slope of 3, corresponding to the ratio of negatives to positives in the data set. Evaluation for machine learning and data mining is a complex issue Some commonly used methods have been found to work well in practice fold cross-validation corrected resampled t-test (Weka) Issues to be aware of multiple testing: from a large set of hypotheses, some will appear good at random does the off-training distribution match that of the training set? COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49 COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49

13 5. Summary Summary A cautionary tale... In a WW2 survey carried out by RAF Bomber Command all bombers returning from bombing raids over Germany over a particular period were inspected. All damage inflicted by German air defences was noted and an initial recommendation was given that armour be added in the most heavily damaged areas. Further analysis instead made the surprising and counter-intuitive recommendation that the armour be placed in the areas which were completely untouched by damage. The reasoning was that the survey was biased, since it only included aircraft that successfully came back from Germany. The untouched areas were probably vital areas, which, if hit, would result in the loss of the aircraft. COMP9417 ML & DM (CSE, UNSW) Evaluation in Machine Learning (1) March 19, / 49

Evaluation & Validation: Credibility: Evaluating what has been learned

Evaluation & Validation: Credibility: Evaluating what has been learned Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model

More information

Performance Measures in Data Mining

Performance Measures in Data Mining Performance Measures in Data Mining Common Performance Measures used in Data Mining and Machine Learning Approaches L. Richter J.M. Cejuela Department of Computer Science Technische Universität München

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Credibility: Evaluating what s been learned Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Issues: training, testing,

More information

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification

More information

Performance Measures for Machine Learning

Performance Measures for Machine Learning Performance Measures for Machine Learning 1 Performance Measures Accuracy Weighted (Cost-Sensitive) Accuracy Lift Precision/Recall F Break Even Point ROC ROC Area 2 Accuracy Target: 0/1, -1/+1, True/False,

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Lecture 15 - ROC, AUC & Lift Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-17-AUC

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center

More information

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577 T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

More information

Performance Metrics. number of mistakes total number of observations. err = p.1/1

Performance Metrics. number of mistakes total number of observations. err = p.1/1 p.1/1 Performance Metrics The simplest performance metric is the model error defined as the number of mistakes the model makes on a data set divided by the number of observations in the data set, err =

More information

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques. International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant

More information

1. Classification problems

1. Classification problems Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Health Care and Life Sciences

Health Care and Life Sciences Sensitivity, Specificity, Accuracy, Associated Confidence Interval and ROC Analysis with Practical SAS Implementations Wen Zhu 1, Nancy Zeng 2, Ning Wang 2 1 K&L consulting services, Inc, Fort Washington,

More information

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Sushilkumar Kalmegh Associate Professor, Department of Computer Science, Sant Gadge Baba Amravati

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Mining the Software Change Repository of a Legacy Telephony System

Mining the Software Change Repository of a Legacy Telephony System Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

More information

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

ROC Curve, Lift Chart and Calibration Plot

ROC Curve, Lift Chart and Calibration Plot Metodološki zvezki, Vol. 3, No. 1, 26, 89-18 ROC Curve, Lift Chart and Calibration Plot Miha Vuk 1, Tomaž Curk 2 Abstract This paper presents ROC curve, lift chart and calibration plot, three well known

More information

Performance Metrics for Graph Mining Tasks

Performance Metrics for Graph Mining Tasks Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical

More information

Business Case Development for Credit and Debit Card Fraud Re- Scoring Models

Business Case Development for Credit and Debit Card Fraud Re- Scoring Models Business Case Development for Credit and Debit Card Fraud Re- Scoring Models Kurt Gutzmann Managing Director & Chief ScienAst GCX Advanced Analy.cs LLC www.gcxanalyacs.com October 20, 2011 www.gcxanalyacs.com

More information

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

More information

MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

More information

Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100 Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100 Erkan Er Abstract In this paper, a model for predicting students performance levels is proposed which employs three

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

Boosting. riedmiller@informatik.uni-freiburg.de

Boosting. riedmiller@informatik.uni-freiburg.de . Machine Learning Boosting Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

More information

Non-Parametric Tests (I)

Non-Parametric Tests (I) Lecture 5: Non-Parametric Tests (I) KimHuat LIM lim@stats.ox.ac.uk http://www.stats.ox.ac.uk/~lim/teaching.html Slide 1 5.1 Outline (i) Overview of Distribution-Free Tests (ii) Median Test for Two Independent

More information

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product Sagarika Prusty Web Data Mining (ECT 584),Spring 2013 DePaul University,Chicago sagarikaprusty@gmail.com Keywords:

More information

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos) Machine Learning Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos) What Is Machine Learning? A computer program is said to learn from experience E with respect to some class of

More information

W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer

More information

Projects Involving Statistics (& SPSS)

Projects Involving Statistics (& SPSS) Projects Involving Statistics (& SPSS) Academic Skills Advice Starting a project which involves using statistics can feel confusing as there seems to be many different things you can do (charts, graphs,

More information

Discovering process models from empirical data

Discovering process models from empirical data Discovering process models from empirical data Laura Măruşter (l.maruster@tm.tue.nl), Ton Weijters (a.j.m.m.weijters@tm.tue.nl) and Wil van der Aalst (w.m.p.aalst@tm.tue.nl) Eindhoven University of Technology,

More information

How To Solve The Class Imbalance Problem In Data Mining

How To Solve The Class Imbalance Problem In Data Mining IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART C: APPLICATIONS AND REVIEWS 1 A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches Mikel Galar,

More information

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data

More information

Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

More information

Quality and Complexity Measures for Data Linkage and Deduplication

Quality and Complexity Measures for Data Linkage and Deduplication Quality and Complexity Measures for Data Linkage and Deduplication Peter Christen and Karl Goiser Department of Computer Science, Australian National University, Canberra ACT 0200, Australia {peter.christen,karl.goiser}@anu.edu.au

More information

Introduction to Learning & Decision Trees

Introduction to Learning & Decision Trees Artificial Intelligence: Representation and Problem Solving 5-38 April 0, 2007 Introduction to Learning & Decision Trees Learning and Decision Trees to learning What is learning? - more than just memorizing

More information

ALGEBRA. sequence, term, nth term, consecutive, rule, relationship, generate, predict, continue increase, decrease finite, infinite

ALGEBRA. sequence, term, nth term, consecutive, rule, relationship, generate, predict, continue increase, decrease finite, infinite ALGEBRA Pupils should be taught to: Generate and describe sequences As outcomes, Year 7 pupils should, for example: Use, read and write, spelling correctly: sequence, term, nth term, consecutive, rule,

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

More information

Review of Fundamental Mathematics

Review of Fundamental Mathematics Review of Fundamental Mathematics As explained in the Preface and in Chapter 1 of your textbook, managerial economics applies microeconomic theory to business decision making. The decision-making tools

More information

Gamma Distribution Fitting

Gamma Distribution Fitting Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics

More information

An Approach to Detect Spam Emails by Using Majority Voting

An Approach to Detect Spam Emails by Using Majority Voting An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,

More information

Nonlinear Regression Functions. SW Ch 8 1/54/

Nonlinear Regression Functions. SW Ch 8 1/54/ Nonlinear Regression Functions SW Ch 8 1/54/ The TestScore STR relation looks linear (maybe) SW Ch 8 2/54/ But the TestScore Income relation looks nonlinear... SW Ch 8 3/54/ Nonlinear Regression General

More information

How To Understand The Impact Of A Computer On Organization

How To Understand The Impact Of A Computer On Organization International Journal of Research in Engineering & Technology (IJRET) Vol. 1, Issue 1, June 2013, 1-6 Impact Journals IMPACT OF COMPUTER ON ORGANIZATION A. D. BHOSALE 1 & MARATHE DAGADU MITHARAM 2 1 Department

More information

The Relationship Between Precision-Recall and ROC Curves

The Relationship Between Precision-Recall and ROC Curves Jesse Davis jdavis@cs.wisc.edu Mark Goadrich richm@cs.wisc.edu Department of Computer Sciences and Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, 2 West Dayton Street,

More information

Better credit models benefit us all

Better credit models benefit us all Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis

More information

Using Random Forest to Learn Imbalanced Data

Using Random Forest to Learn Imbalanced Data Using Random Forest to Learn Imbalanced Data Chao Chen, chenchao@stat.berkeley.edu Department of Statistics,UC Berkeley Andy Liaw, andy liaw@merck.com Biometrics Research,Merck Research Labs Leo Breiman,

More information

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c11 2013/9/9 page 221 le-tex 221 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial

More information

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Model Combination. 24 Novembre 2009

Model Combination. 24 Novembre 2009 Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

More information

Spyware Prevention by Classifying End User License Agreements

Spyware Prevention by Classifying End User License Agreements Spyware Prevention by Classifying End User License Agreements Niklas Lavesson, Paul Davidsson, Martin Boldt, and Andreas Jacobsson Blekinge Institute of Technology, Dept. of Software and Systems Engineering,

More information

Data mining techniques: decision trees

Data mining techniques: decision trees Data mining techniques: decision trees 1/39 Agenda Rule systems Building rule systems vs rule systems Quick reference 2/39 1 Agenda Rule systems Building rule systems vs rule systems Quick reference 3/39

More information

Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL

Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL Paper SA01-2012 Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL ABSTRACT Analysts typically consider combinations

More information

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

!!!#$$%&'()*+$(,%!#$%$&'()*%(+,'-*&./#-$&'(-&(0*.$#-$1(2&.3$'45 !"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:

More information

Journal of Engineering Science and Technology Review 7 (4) (2014) 89-96

Journal of Engineering Science and Technology Review 7 (4) (2014) 89-96 Jestr Journal of Engineering Science and Technology Review 7 (4) (2014) 89-96 JOURNAL OF Engineering Science and Technology Review www.jestr.org Applying Fuzzy Logic and Data Mining Techniques in Wireless

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

PREDICTING SUCCESS IN THE COMPUTER SCIENCE DEGREE USING ROC ANALYSIS

PREDICTING SUCCESS IN THE COMPUTER SCIENCE DEGREE USING ROC ANALYSIS PREDICTING SUCCESS IN THE COMPUTER SCIENCE DEGREE USING ROC ANALYSIS Arturo Fornés arforser@fiv.upv.es, José A. Conejero aconejero@mat.upv.es 1, Antonio Molina amolina@dsic.upv.es, Antonio Pérez aperez@upvnet.upv.es,

More information

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Review Jeopardy. Blue vs. Orange. Review Jeopardy Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round $200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?

More information

Dynamic Predictive Modeling in Claims Management - Is it a Game Changer?

Dynamic Predictive Modeling in Claims Management - Is it a Game Changer? Dynamic Predictive Modeling in Claims Management - Is it a Game Changer? Anil Joshi Alan Josefsek Bob Mattison Anil Joshi is the President and CEO of AnalyticsPlus, Inc. (www.analyticsplus.com)- a Chicago

More information

Advanced analytics at your hands

Advanced analytics at your hands 2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously

More information

STANDARDISATION AND CLASSIFICATION OF ALERTS GENERATED BY INTRUSION DETECTION SYSTEMS

STANDARDISATION AND CLASSIFICATION OF ALERTS GENERATED BY INTRUSION DETECTION SYSTEMS STANDARDISATION AND CLASSIFICATION OF ALERTS GENERATED BY INTRUSION DETECTION SYSTEMS Athira A B 1 and Vinod Pathari 2 1 Department of Computer Engineering,National Institute Of Technology Calicut, India

More information

Data Mining for Knowledge Management. Classification

Data Mining for Knowledge Management. Classification 1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

More information

Measurement in ediscovery

Measurement in ediscovery Measurement in ediscovery A Technical White Paper Herbert Roitblat, Ph.D. CTO, Chief Scientist Measurement in ediscovery From an information-science perspective, ediscovery is about separating the responsive

More information

Mining Life Insurance Data for Customer Attrition Analysis

Mining Life Insurance Data for Customer Attrition Analysis Mining Life Insurance Data for Customer Attrition Analysis T. L. Oshini Goonetilleke Informatics Institute of Technology/Department of Computing, Colombo, Sri Lanka Email: oshini.g@iit.ac.lk H. A. Caldera

More information

Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

More information

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next

More information

Nonparametric statistics and model selection

Nonparametric statistics and model selection Chapter 5 Nonparametric statistics and model selection In Chapter, we learned about the t-test and its variations. These were designed to compare sample means, and relied heavily on assumptions of normality.

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

Ordinal Regression. Chapter

Ordinal Regression. Chapter Ordinal Regression Chapter 4 Many variables of interest are ordinal. That is, you can rank the values, but the real distance between categories is unknown. Diseases are graded on scales from least severe

More information

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction Logistics Prerequisites: basics concepts needed in probability and statistics

More information

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.

More information

Two Correlated Proportions (McNemar Test)

Two Correlated Proportions (McNemar Test) Chapter 50 Two Correlated Proportions (Mcemar Test) Introduction This procedure computes confidence intervals and hypothesis tests for the comparison of the marginal frequencies of two factors (each with

More information

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences

More information

Simple Regression Theory II 2010 Samuel L. Baker

Simple Regression Theory II 2010 Samuel L. Baker SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the

More information

Data Mining Techniques for Prognosis in Pancreatic Cancer

Data Mining Techniques for Prognosis in Pancreatic Cancer Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree

More information

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a

More information

Data Mining - The Next Mining Boom?

Data Mining - The Next Mining Boom? Howard Ong Principal Consultant Aurora Consulting Pty Ltd Abstract This paper introduces Data Mining to its audience by explaining Data Mining in the context of Corporate and Business Intelligence Reporting.

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

Question 2 Naïve Bayes (16 points)

Question 2 Naïve Bayes (16 points) Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the

More information

Leveraging Ensemble Models in SAS Enterprise Miner

Leveraging Ensemble Models in SAS Enterprise Miner ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

More information

ROC Graphs: Notes and Practical Considerations for Data Mining Researchers

ROC Graphs: Notes and Practical Considerations for Data Mining Researchers ROC Graphs: Notes and Practical Considerations for Data Mining Researchers Tom Fawcett Intelligent Enterprise Technologies Laboratory HP Laboratories Palo Alto HPL-23-4 January 7 th, 23* E-mail: tom_fawcett@hp.com

More information

2 Decision tree + Cross-validation with R (package rpart)

2 Decision tree + Cross-validation with R (package rpart) 1 Subject Using cross-validation for the performance evaluation of decision trees with R, KNIME and RAPIDMINER. This paper takes one of our old study on the implementation of cross-validation for assessing

More information