Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set



Similar documents
Evaluation & Validation: Credibility: Evaluating what has been learned

Clustering Connectionist and Statistical Language Processing

Data Mining - Evaluation of Classifiers

Experiments in Web Page Classification for Semantic Web

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

Cross-Validation. Synonyms Rotation estimation

Lecture 13: Validation

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

Lecture 10: Regression Trees

Knowledge Discovery and Data Mining

Performance Measures in Data Mining

Cross Validation. Dr. Thomas Jensen Expedia.com

Evaluating Data Mining Models: A Pattern Language

Performance Metrics for Graph Mining Tasks

Data Mining Practical Machine Learning Tools and Techniques

Mining the Software Change Repository of a Legacy Telephony System

We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap

How To Understand The Impact Of A Computer On Organization

An innovative application of a constrained-syntax genetic programming system to the problem of predicting survival of patients

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Data Mining. Nonlinear Classification

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Using Random Forest to Learn Imbalanced Data

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

Knowledge Discovery and Data Mining

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

On Cross-Validation and Stacking: Building seemingly predictive models on random data

Data Mining and Automatic Quality Assurance of Survey Data

Classification of Titanic Passenger Data and Chances of Surviving the Disaster Data Mining with Weka and Kaggle Competition Data

Performance Measures for Machine Learning

Supervised Learning (Big Data Analytics)

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

L13: cross-validation

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Practical Machine Learning Tools and Techniques

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Active Learning SVM for Blogs recommendation

Classification/Decision Trees (II)

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Knowledge Discovery and Data Mining

Web Document Clustering

More Data Mining with Weka

On the effect of data set size on bias and variance in classification learning

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Northumberland Knowledge

Performance Metrics. number of mistakes total number of observations. err = p.1/1

Social Media Mining. Data Mining Essentials

Characterizing and Predicting Blocking Bugs in Open Source Projects

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance

Knowledge Discovery and Data Mining

Chapter 8. Final Results on Dutch Senseval-2 Test Data

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Data Mining Classification: Decision Trees

Data and Analysis. Informatics 1 School of Informatics, University of Edinburgh. Part III Unstructured Data. Ian Stark. Staff-Student Liaison Meeting

Predictive Modeling for Collections of Accounts Receivable Sai Zeng IBM T.J. Watson Research Center Hawthorne, NY, Abstract

Chapter 12 Discovering New Knowledge Data Mining

Open-Source Machine Learning: R Meets Weka

SVM Ensemble Model for Investment Prediction

Data Mining with Weka

Data quality in Accounting Information Systems

Azure Machine Learning, SQL Data Mining and R

Data Mining Yelp Data - Predicting rating stars from review text

1. Classification problems

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

2 Decision tree + Cross-validation with R (package rpart)

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

A new Approach for Intrusion Detection in Computer Networks Using Data Mining Technique

Automated Content Analysis of Discussion Transcripts

DATA MINING APPROACH FOR PREDICTING STUDENT PERFORMANCE

Cross-validation for detecting and preventing overfitting

Automatic Resolver Group Assignment of IT Service Desk Outsourcing

Stock Market Forecasting Using Machine Learning Algorithms

WEKA. Machine Learning Algorithms in Java

How To Cluster

Chapter 6. The stacking ensemble approach

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

A Survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms

Model Validation Techniques

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Statistical Machine Learning

How To Understand How Weka Works

Quality and Complexity Measures for Data Linkage and Deduplication

Multiple Linear Regression in Data Mining

Decision Tree Learning on Very Large Data Sets

Mining an Online Auctions Data Warehouse

Data Mining Techniques Chapter 6: Decision Trees

Studying Auto Insurance Data

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Data Mining Techniques for Prognosis in Pancreatic Cancer

Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction

Transcription:

Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification crossvalidation, leave-one-out comparing machine learning algorithms comparing against a baseline precision and recall evaluating numeric prediction Literature: Witten and Frank (2000: ch. 6), Mitchell (1997: ch. 5). Evaluation p.1/21 Evaluation p.2/21 Training and Test Set Test and Validation Set For classification problems, we measure the performance of a model in terms of its error rate: percentage of incorrectly classified instances in the data set. We build a model because we want to use it to classify new data. Hence we are chiefly interested in model performance on new (unseen) data. The resubstitution error (error rate on the training set) is a bad predictor of performance on new data. The model was build to account for the training data, so might overfit it, i.e., not generalize to unseen data. Use two data set: the training set (seen data) to build the model (determine its parameters) and the test set (unseen data) to measure its performance (holding the parameters constant). Sometimes, we also need a validation set to tune the model (e.g., for pruning a decision tree). The validation set can t be used for testing (as it s not unseen). All three data set have to be representative samples of the data that the model will be applied to. Evaluation p.3/21 Evaluation p.4/21

Holdout Stratification If a lot of data data are available, simply take two independent samples and use one for training and one for testing. The more training, the better the model. The more test data, the more accurate the error estimate. Problem: obtaining data is often expensive and time consuming. Example: corpus annotated with word senses, experimental data on subcat preferences. Solution: obtain a limited data set and use a holdout procedure. Most straightforward: random split into test and training set. Typically between 1/3 and 1/10 held out for testing. Problem: the split into training and test set might be unrepresentative, e.g., a certain class is not represented in the training set, thus the model will not learn to classify it. Solution: use stratified holdout, i.e., sample in such a way that each class is represented in both sets. Example: data set with two classes A and B. Aim: construct a 10% test set. Take a 10% sample of all instances of class A plus a 10% sample of all instances of class B. However, this procedure doesn t work well on small data sets. Evaluation p.5/21 Evaluation p.6/21 Crossvalidation Crossvalidation Solution: k-fold crossvalidation maximizes the use of the data. Example: data set with 20 instances, 5-fold crossvalidation Divide data randomly into k folds (subsets) of equal size. Train the model on k 1 folds, use one fold for testing. test set training set Repeat this process k times so that all folds are used for testing. Compute the average performance on the k test sets. This effectively uses all the data for both training and testing. Typically k = 10 is used. Sometimes stratified k-fold crossvalidation is used. i 1,i 2,i 3,i 4 i 5,i 6,i 7,i 8 i 9,i 10,i 11,i 12 i 13,i 14,i 15,i 16 i 17,i 18,i 19,i 20 compute error rate for each fold then compute average error rate Evaluation p.7/21 Evaluation p.8/21

Leave-one-out Comparing Algorithms Leave-one-out crossvalidation is simply k-fold crossvalidation with k set to n, the number of instances in the data set. This means that the test set only consists of a single instance, which will be classified either correctly or incorrectly. Advantages: maximal use of training data, i.e., training on n 1 instances. The procedure is deterministic, no sampling involved. Disadvantages: unfeasible for large data sets: large number of training runs required, high computational cost. Cannot be stratified (only one class in the test set). Assume you want to compare the performance of two machine learning algorithms A and B on the same data set. You could use crossvalidation to determine the error rates of A and B and then compute the difference. Problem: sampling is involved (in getting the data and in crossvalidation), hence there is variance for the error rates. Solution: determine if the difference between error rates is statistically significant. If the crossvalidations of A and B use the same random division of the data (the same folds), then a paired t-test is appropriate. Evaluation p.9/21 Evaluation p.10/21 Comparing Algorithms Let k samples of the error rate of algorithms A be denoted by x 1,...,x k and k samples of the error rate of B by y 1,...,y k. Then the t for a pairedt-test is: (1) d t = σ 2 d /k Where d is the mean of the differences x n y n, and σ d is the standard deviation of this mean. Comparing against a Baseline An error rate in itself is not very meaningful. We have to take into account how hard the problem is. This means comparing against a baseline model and showing that our model performs significantly better than the baseline. The simplest model is the chance baseline, which assigns a classification randomly. Example: if we have two classes A and B in our data, and we classify each instance randomly as either A or B, then we will get 50% right just by chance (in the limit). In the general case, the error rate for the chance baseline is 1 1 n for an n-way classification. Evaluation p.11/21 Evaluation p.12/21

Comparing against a Baseline Problem: a chance baseline is not useful if the distribution of the data is skewed. We need to compare against a frequency baseline instead. A frequency baseline always assigns the most frequent class. Its error rate is 1 f max, where f max is the percentage of instances in the data that belong to the most frequent class. Example: determining when a cow is in oestrus (classes: yes, no). Chance baseline: 50%, frequency baseline: 3% (1 out of 30 days). More realistic example: assigning part-of-speech tags for English text. A tagger that assigns to each word its most frequent tag gets 20% error rate. Current state of the art is 4%. Confusion Matrix Assume a two way classification. Four classification outcomes are possible, which can be displayed in a confusion matrix: predicted class yes no actual class yes true positive false negative no false positive true negative True positives (TP): class members classified as class members True negatives (TN): class non-members classified as non-members False positives (FP): class non-members classified as class members False negatives (FN): class members classified as class non-members Evaluation p.13/21 Evaluation p.14/21 Precision and Recall Precision and Recall The error rate is an inadequate measure of the performance of an algorithm, it doesn t take into account the cost of making wrong decisions. Example: Based on chemical analysis of the water try to detect an oil slick in the sea. False positive: wrongly identifying an oil slick if there is none. False negative: fail to identify an oil slick if there is one. Here, false negatives (environmental disasters) are much more costly than false negatives (false alarms). We have to take that into account when we evaluate our model. Measures commonly used in information retrieval, based on true positives, false positives, and false negatives: Precision: number of class members classified correctly over total number of instances classied as class members. (2) Precision = TP TP + FP Recall: number of class members classified correctly over total number of class members. (3) Recall = TP TP + FN Evaluation p.15/21 Evaluation p.16/21

Precision and Recall Evaluating Numeric Prediction Precision and recall can be combined in the F-measure: (4) Example: F = 2 recall precision recall + precision = 2 TP 2 TP + FP + FN For the oil slick scenario, we want to maximize recall (avoiding environmental disasters). Maximizing precision (avoiding false alarms) is less important. The F-measure can be used if precision and recall are equally important. Error rate, precision, recall, and F-measure are mainly used for classification tasks. They are less suitable for tasks were a numeric quantity has to be predicted. Examples: predict the frequency of subcategorization frames. We want to measure how close the model is to the actual frequencies. p 1,...,p n : predicted values of the for instances 1,...,n a 1,...,a n : actual values of the for instances 1,...,n There are several measure that compare a i and p i. Evaluation p.17/21 Evaluation p.18/21 Mean Squared Error Mean squared error measures the mean difference between actual and predicted values. Used e.g. for connectionist nets: (5) MSE = 1 n n i=1 (a i p i ) 2 Often also the root mean squared error is used: n 1 (6) RMSE = (a i p i ) n 2 i=1 Advantage: has the same dimension as the original quantity. Summary No matter how the performance of the model is measured (precision, recall, MSE, correlation), we always need to measure on the test set, not on the training set. Performance on the training only tells us that the model learns what it s supposed to learn. It is not a good indicator of performance on unseen data. The test set can be obtained using an independent sample or holdout techniques (crossvalidation, leave-one-out). To meaningfully compare the performance of two algorithms for a given type of data, we need to compute if a difference in performance is significant. We also need to compare performance against a baseline (chance or frequency). Evaluation p.19/21 Evaluation p.20/21

References Mitchell, Tom. M. 1997. Machine Learning. New York: McGraw-Hill. Witten, Ian H., and Eibe Frank. 2000. Data Mining: Practical Machine Learing Tools and Techniques with Java Implementations. San Diego, CA: Morgan Kaufmann. Evaluation p.21/21