Lecture 13: Validation



Similar documents
L13: cross-validation

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

Cross Validation. Dr. Thomas Jensen Expedia.com

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining

Data Mining. Nonlinear Classification

Data Mining - Evaluation of Classifiers

LECTURE 13: Cross-validation

We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Evaluation & Validation: Credibility: Evaluating what has been learned

Cross-Validation. Synonyms Rotation estimation

Cross-validation for detecting and preventing overfitting

On Cross-Validation and Stacking: Building seemingly predictive models on random data

Data Mining Practical Machine Learning Tools and Techniques

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

Evaluating Data Mining Models: A Pattern Language

L25: Ensemble learning

Knowledge Discovery and Data Mining

Classification/Decision Trees (II)

Data Mining Methods: Applications for Institutional Research

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/ / 34

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Chapter 7. Feature Selection. 7.1 Introduction

Towards better accuracy for Spam predictions

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

2 Decision tree + Cross-validation with R (package rpart)

Regularized Logistic Regression for Mind Reading with Parallel Validation

Model Combination. 24 Novembre 2009

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Supervised Learning (Big Data Analytics)

Predict Influencers in the Social Network

Logistic Regression for Spam Filtering

Performance Metrics for Graph Mining Tasks

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Social Media Mining. Data Mining Essentials

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Fortgeschrittene Computerintensive Methoden

Speaker First Plenary Session THE USE OF "BIG DATA" - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? William H. Crown, PhD

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

Applied Multivariate Analysis - Big data analytics

Bootstrapping Big Data

Feature Subset Selection in Spam Detection

Making Sense of the Mayhem: Machine Learning and March Madness

Chapter 6. The stacking ensemble approach

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Chapter 12 Bagging and Random Forests

Decision Trees from large Databases: SLIQ

Supporting Online Material for

Machine Learning Methods for Causal Effects. Susan Athey, Stanford University Guido Imbens, Stanford University

Data Mining Practical Machine Learning Tools and Techniques

Better credit models benefit us all

Distances, Clustering, and Classification. Heatmaps

Crawling and Detecting Community Structure in Online Social Networks using Local Information

Hong Kong Stock Index Forecasting

Local classification and local likelihoods

Predicting borrowers chance of defaulting on credit loans

Computer Forensics Application. ebay-uab Collaborative Research: Product Image Analysis for Authorship Identification

Why Ensembles Win Data Mining Competitions

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Maschinelles Lernen mit MATLAB

Lecture 6. Artificial Neural Networks

Neural Network Add-in

Data Mining Classification: Decision Trees

REVIEW OF HEART DISEASE PREDICTION SYSTEM USING DATA MINING AND HYBRID INTELLIGENT TECHNIQUES

Data Mining Algorithms Part 1. Dejan Sarka

Solving Regression Problems Using Competitive Ensemble Models

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Course Syllabus. Purposes of Course:

Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks

Author Gender Identification of English Novels

On the application of multi-class classification in physical therapy recommendation

Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering

A Content based Spam Filtering Using Optical Back Propagation Technique

Designing a learning system

Ensembles of Similarity-Based Models.

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Classification and Prediction

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

A Learning Algorithm For Neural Network Ensembles

Data Mining and Automatic Quality Assurance of Survey Data

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Distributed forests for MapReduce-based machine learning

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Predicting the Betting Line in NBA Games

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Professor Anita Wasilewska. Classification Lecture Notes

Classification algorithm in Data mining: An Overview

Evaluation of Feature Selection Methods for Predictive Modeling Using Neural Networks in Credits Scoring

DATA MINING TECHNIQUES AND APPLICATIONS

6 Classification and Regression Trees, 7 Bagging, and Boosting

Transcription:

Lecture 3: Validation g Motivation g The Holdout g Re-sampling techniques g Three-way data splits

Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model selection and performance estimation g Model selection n Almost invariably, all pattern recognition techniques have one or more free parameters g The number of neighbors in a knn classification rule g The network size, learning parameters and weights in MLPs n How do we select the optimal parameter(s) or model for a given classification problem? g Performance estimation n Once we have chosen a model, how do we estimate its performance? g Performance is typically measured by the TRUE ERROR RATE, the classifier s error rate on the ENTIRE POPULATION 2

Motivation g If we had access to an unlimited number of examples these questions have a straightforward answer n Choose the model that provides the lowest error rate on the entire population and, of course, that error rate is the true error rate g In real applications we only have access to a finite set of examples, usually smaller than we wanted n One approach is to use the entire training data to select our classifier and estimate the error rate g This naïve approach has two fundamental problems n n The final model will normally overfit the training data g This problem is more pronounced with models that have a large number of parameters The error rate estimate will be overly optimistic (lower than the true error rate) g In fact, it is not uncommon to have 00% correct classification on training data n A much better approach is to split the training data into disjoint subsets: the holdout method 3

The holdout method g Split dataset into two groups n Training set: used to train the classifier n Test set: used to estimate the error rate of the trained classifier Total number of examples Training Set Test Set g A typical application the holdout method is determining a stopping point for the back propagation error MSE Stopping point Test set error Training set error Epochs 4

The holdout method g The holdout method has two basic drawbacks n In problems where we have a sparse dataset we may not be able to afford the luxury of setting aside a portion of the dataset for testing n Since it is a single train-and-test experiment, the holdout estimate of error rate will be misleading if we happen to get an unfortunate split g The limitations of the holdout can be overcome with a family of resampling methods at the expense of more computations n Cross Validation g g g n Bootstrap Random Subsampling K-Fold Cross-Validation Leave-one-out Cross-Validation 5

Random Subsampling g Random Subsampling performs K data splits of the dataset n Each split randomly selects a (fixed) no. examples without replacement n For each data split we retrain the classifier from scratch with the training examples and estimate E i with the test examples Total number of examples Experiment Test example Experiment 2 Experiment 3 g The true error estimate is obtained as the average of the separate estimates E i n This estimate is significantly better than the holdout estimate E = K K i= E i 6

K-Fold Cross-validation g Create a K-fold partition of the the dataset n For each of K experiments, use K- folds for training and the remaining one for testing Total number of examples Experiment Experiment 2 Experiment 3 Experiment 4 Test examples g K-Fold Cross validation is similar to Random Subsampling n The advantage of K-Fold Cross validation is that all the examples in the dataset are eventually used for both training and testing g As before, the true error is estimated as the average error rate E = K K E i i= 7

Leave-one-out Cross Validation g Leave-one-out is the degenerate case of K-Fold Cross Validation, where K is chosen as the total number of examples n For a dataset with N examples, perform N experiments n For each experiment use N- examples for training and the remaining example for testing Total number of examples Experiment Experiment 2 Experiment 3 Single test example Experiment N g As usual, the true error is estimated as the average error rate on test examples N E = E i N i= 8

How many folds are needed? g With a large number of folds + The bias of the true error rate estimator will be small (the estimator will be very accurate) - The variance of the true error rate estimator will be large - The computational time will be very large as well (many experiments) g With a small number of folds + The number of experiments and, therefore, computation time are reduced + The variance of the estimator will be small - The bias of the estimator will be large (conservative or higher than the true error rate) g In practice, the choice of the number of folds depends on the size of the dataset n For large datasets, even 3-Fold Cross Validation will be quite accurate n For very sparse datasets, we may have to use leave-one-out in order to train on as many examples as possible g A common choice for K-Fold Cross Validation is K=0 9

Three-way data splits g If model selection and true error estimates are to be computed simultaneously, the data needs to be divided into three disjoint sets n Training set: a set of examples used for learning: to fit the parameters of the classifier g In the MLP case, we would use the training set to find the optimal weights with the back-prop rule n Validation set: a set of examples used to tune the parameters of of a classifier g In the MLP case, we would use the validation set to find the optimal number of hidden units or determine a stopping point for the back propagation algorithm n Test set: a set of examples used only to assess the performance of a fully-trained classifier g In the MLP case, we would use the test to estimate the error rate after we have chosen the final model (MLP size and actual weights) g After assessing the final model with the test set, YOU MUST NOT further tune the model 0

Three-way data splits g Why separate test and validation sets? n The error rate estimate of the final model on validation data will be biased (smaller than the true error rate) since the validation set is used to select the final model n After assessing the final model with the test set, YOU MUST NOT tune the model any further g Procedure outline. Divide the available data into training, validation and test set 2. Select architecture and training parameters 3. Train the model using the training set 4. Evaluate the model using the validation set 5. Repeat steps 2 through 4 using different architectures and training parameters 6. Select the best model and train it using data from the training and validation set 7. Assess this final model using the test set n This outline assumes a holdout method g If CV or Bootstrap are used, steps 3 and 4 have to be repeated for each of the K folds

Three-way data splits Test set Training set Validation set Model Error Σ Model 2 Error 2 Σ Min Final Model Σ Final Error Model 3 Error 3 Σ Model 4 Error 4 Σ Model Selection Error Rate 2