L13: cross-validation



Similar documents
Lecture 13: Validation

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

LECTURE 13: Cross-validation

L25: Ensemble learning

Knowledge Discovery and Data Mining

We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap

Cross Validation. Dr. Thomas Jensen Expedia.com

Data Mining - Evaluation of Classifiers

Knowledge Discovery and Data Mining

Data Mining. Nonlinear Classification

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/ / 34

Cross-validation for detecting and preventing overfitting

Cross-Validation. Synonyms Rotation estimation

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Regularized Logistic Regression for Mind Reading with Parallel Validation

Bootstrapping Big Data

On Cross-Validation and Stacking: Building seemingly predictive models on random data

Evaluation & Validation: Credibility: Evaluating what has been learned

L4: Bayesian Decision Theory

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

Evaluating Data Mining Models: A Pattern Language

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Classification/Decision Trees (II)

Towards better accuracy for Spam predictions

Predict Influencers in the Social Network

Knowledge Discovery and Data Mining

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

Chapter 12 Bagging and Random Forests

F. Aiolli - Sistemi Informativi 2007/2008

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

Monotonicity Hints. Abstract

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Chapter 7. Feature Selection. 7.1 Introduction

Data Mining Practical Machine Learning Tools and Techniques

Local classification and local likelihoods

Basics of Statistical Machine Learning

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Social Media Mining. Data Mining Essentials

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

Multiple group discriminant analysis: Robustness and error rate

Performance Metrics for Graph Mining Tasks

Decompose Error Rate into components, some of which can be measured on unlabeled data

Applied Multivariate Analysis - Big data analytics

Making Sense of the Mayhem: Machine Learning and March Madness

Model Combination. 24 Novembre 2009

An Introduction to Neural Networks

2 Decision tree + Cross-validation with R (package rpart)

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

Neural Network Add-in

Supporting Online Material for

Supervised Learning (Big Data Analytics)

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Classification and Prediction

Music Mood Classification

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Multiple Linear Regression in Data Mining

A Learning Algorithm For Neural Network Ensembles

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Better credit models benefit us all

Classification algorithm in Data mining: An Overview

Logistic Regression for Spam Filtering

Speaker First Plenary Session THE USE OF "BIG DATA" - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? William H. Crown, PhD

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Event driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Decision Trees from large Databases: SLIQ

A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Data Mining Algorithms Part 1. Dejan Sarka

Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks

Data Mining Practical Machine Learning Tools and Techniques

LCs for Binary Classification

Machine Learning Capacity and Performance Analysis and R

Dissimilarity representations allow for building good classifiers

A successful market segmentation initiative answers the following critical business questions: * How can we a. Customer Status.

Classifying Manipulation Primitives from Visual Data

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Data Mining Methods: Applications for Institutional Research

This article is about the success story of a seemingly simple yet extremely powerful. Bootstrap- Inspired Techniques in Computational Intelligence

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

Data Mining Classification: Decision Trees

A semi-supervised Spam mail detector

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

A Property and Casualty Insurance Predictive Modeling Process in SAS

A Content based Spam Filtering Using Optical Back Propagation Technique

A Practical Differentially Private Random Decision Tree Classifier

How To Solve A Sequential Mca Problem

Monte Carlo-based statistical methods (MASM11/FMS091)

Author Gender Identification of English Novels

The Big Data Bootstrap

A Feature Selection Methodology for Steganalysis

Machine Learning Methods for Demand Estimation

Drugs store sales forecast using Machine Learning

Transcription:

Resampling methods Cross validation Bootstrap L13: cross-validation Bias and variance estimation with the Bootstrap Three-way data partitioning CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 1

Introduction Almost invariably, all the pattern recognition techniques that we have introduced have one or more free parameters The number of neighbors in a knn classifier The bandwidth of the kernel function in kernel density estimation The number of features to preserve in a subset selection problem Two issues arise at this point Model Selection: How do we select the optimal parameter(s) for a given classification problem? Validation: Once we have chosen a model, how do we estimate its true error rate? The true error rate is the classifier s error rate when tested on the ENTIRE POPULATION If we had access to an unlimited number of examples, these questions would have a straightforward answer Choose the model that provides the lowest error rate on the entire population And, of course, that error rate is the true error rate However, in real applications only a finite set of examples is available This number is usually smaller than we would hope for! Why? Data collection is a very expensive process CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 2

One may be tempted to use the entire training data to select the optimal classifier, then estimate the error rate This naïve approach has two fundamental problems The final model will normally overfit the training data: it will not be able to generalize to new data The problem of overfitting is more pronounced with models that have a large number of parameters The error rate estimate will be overly optimistic (lower than the true error rate) In fact, it is not uncommon to achieve 100% correct classification on training data The techniques presented in this lecture will allow you to make the best use of your (limited) data for Training Model selection and Performance estimation CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 3

The holdout method Split dataset into two groups Training set: used to train the classifier Test set: used to estimate the error rate of the trained classifier Total number of examples Training Set Test Set The holdout method has two basic drawbacks In problems where we have a sparse dataset we may not be able to afford the luxury of setting aside a portion of the dataset for testing Since it is a single train-and-test experiment, the holdout estimate of error rate will be misleading if we happen to get an unfortunate split These limitations of the holdout can be overcome with a family of resampling methods at the expense of higher computational cost Cross validation Random subsampling K-fold cross-validation Leave-one-out cross-validation Bootstrap CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 4

Random subsampling Random subsampling performs K data splits of the entire dataset Each data split randomly selects a (fixed) number of examples without replacement For each data split we retrain the classifier from scratch with the training examples and then estimate E i with the test examples Experiment 1 Total number of examples Test example Experiment 2 Experiment 3 The true error estimate is obtained as the average of the separate estimates E i This estimate is significantly better than the holdout estimate E = 1 K K i=1 E i CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 5

K-fold cross validation Create a K-fold partition of the dataset For each of K experiments, use K 1 folds for training and a different fold for testing This procedure is illustrated in the following figure for K = 4 Total number of examples Experiment 1 Experiment 2 Experiment 3 Test examples Experiment 4 K-Fold cross validation is similar to random subsampling The advantage of KFCV is that all the examples in the dataset are eventually used for both training and testing As before, the true error is estimated as the average error rate on test examples E = 1 K K i=1 E i CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 6

Leave-one-out cross validation LOO is the degenerate case of KFCV, where K is chosen as the total number of examples For a dataset with N examples, perform N experiments For each experiment use N 1 examples for training and the remaining example for testing Total number of examples Experiment 1 Experiment 2 Experiment 3 Single test example Experiment N As usual, the true error is estimated as the average error rate on test examples E = 1 N N i=1 E i CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 7

How many folds are needed? With a large number of folds + The bias of the true error rate estimator will be small (the estimator will be very accurate) The variance of the true error rate estimator will be large The computational time will be very large as well (many experiments) With a small number of folds + The number of experiments and, therefore, computation time are reduced + The variance of the estimator will be small The bias of the estimator will be large (conservative or larger than the true error rate) In practice, the choice for K depends on the size of the dataset For large datasets, even 3-fold cross validation will be quite accurate For very sparse datasets, we may have to use leave-one-out in order to train on as many examples as possible A common choice for is K=10 CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 8

The bootstrap The bootstrap is a resampling technique with replacement From a dataset with N examples Randomly select (with replacement) N examples and use this set for training The remaining examples that were not selected for training are used for testing This value is likely to change from fold to fold Repeat this process for a specified number of folds (K) As before, the true error is estimated as the average error rate on test data Complete dataset X 1 X 2 X 3 X 4 X 5 Experiment 1 X 3 X 1 X 3 X 3 X 5 X 2 X 4 Experiment 2 X 5 X 5 X 3 X 1 X 2 X 4 Experiment 3 X 5 X 5 X 1 X 2 X 1 X 3 X 4 Experiment K X 4 X 4 X 4 X 4 X 1 X 2 X 3 X 5 Training sets Test sets CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 9

Compared to basic CV, the bootstrap increases the variance that can occur in each fold [Efron and Tibshirani, 1993] This is a desirable property since it is a more realistic simulation of the real-life experiment from which our dataset was obtained Consider a classification problem with C classes, a total of N examples and N i examples for each class ω i The a priori probability of choosing an example from class ω i is N i /N Once we choose an example from class ω i, if we do not replace it for the next selection, then the a priori probabilities will have changed since the probability of choosing an example from class ω i will now be N i 1 /N Thus, sampling with replacement preserves the a priori probabilities of the classes throughout the random selection process An additional benefit is that the bootstrap can provide accurate measures of BOTH the bias and variance of the true error estimate CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 10

Bias and variance of a statistical estimate Problem: estimate parameter α of unknown distribution G To emphasize the fact that α concerns G, we write α(g) Solution We collect N examples X = {x 1, x 2 x N } from G X defines a discrete distribution G with mass 1/N at each example We compute the statistic α = α(g ) as an estimator of α(g) e.g., α(g ) may is the estimate of the true error rate for a classifier How good is this estimator? Bias: How much does it deviate from the true value Bias = E G α G where E G X = α G + xg x dx Variance: how much variability does it show for different samples Var = E G α E G α 2 CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 11

Example: bias and variance of the sample mean Bias: The sample mean is known to be an unbiased estimator How about its variance? From statistics, the standard deviation of the sample mean is equal to std x = 1 N(N 1) N i=1 x i x 2 This term is also known in statistics as the STANDARD ERROR Unfortunately, there is no such a neat algebraic formula for almost any estimate other than the sample mean CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 12

Bias and variance estimates with the bootstrap The bootstrap allows us to estimate bias and variance for practically any statistical estimate, be it a scalar or vector (matrix) Here we will only describe the estimation procedure For more details refer to Advanced algorithms for neural networks [Masters, 1995], which provides an excellent introduction Approach Consider a dataset of N examples X = {x 1, x 2,, x N } from distribution G This dataset defines a discrete distribution G Compute α = α(g ) as our initial estimate of α G Let {x 1, x 2,, x N } be a bootstrap dataset drawn from X = {x 1, x 2,, x N } Estimate parameter α using this bootstrap dataset α (G ) Generate K bootstrap datasets and obtain K estimates {α 1 (G ), α 2 (G ), α K (G )} The bias and variance estimates of α are Bias α = α α where α = 1 K Var α = 1 K 1 K i=1 α i α 2 K i=1 α i CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 13

Rationale The effect of generating a bootstrap dataset from the distribution G is similar to the effect of obtaining the dataset X = {x 1, x 2, x N } from the original distribution G In other words, the distribution {α 1 (G ), α 2 (G ), α K (G )} is related to the initial estimate α in the same fashion that multiple estimates α are related to the true value α CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 14

Example Assume a small dataset x = {3,5,2,1,7}, and we want to compute the bias and variance of the sample mean α = 3.6 We generate a number of bootstrap samples (three in this case) Assume that the first bootstrap yields the dataset {7,3,2,3,1} We compute the sample mean α 1 = 3.2 The second bootstrap sample yields the dataset {5,1,1,3,7} We compute the sample mean α 2 = 3.4 The third bootstrap sample yields the dataset {2,2,7,1,3} We compute the sample mean α 3 = 3.0 We average these estimates and obtain an average of α = 3.2 What are the bias and variance of the sample mean α? Bias(α ) = 3. 2 3. 6 = 0. 4 We may conclude then that resampling introduced a downward bias on the mean, so we would be inclined to use 3.6 + 0.4 = 4.0 as an unbiased estimate of α Var(α ) = 1/2 [ 3. 2 3. 2 2 + 3. 4 3. 2 2 + 3. 0 3. 2 2 ] = 0. 04 NOTES We have done this exercise for the sample mean (so you could trace the computations), but α could be any other statistical operator! How many bootstrap samples should we use? As a rule of thumb, several hundred resamples will suffice for most problems [Masters,1995] CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 15

Three-way data splits If model selection and true error estimates are to be computed simultaneously, the data should be divided into three disjoint sets [Ripley, 1996] Training set: used for learning, e.g., to fit the parameters of the classifier In an MLP, we would use the training set to find the optimal weights with the back-prop rule Validation set: used to select among several trained classifiers In an MLP, we would use the validation set to find the optimal number of hidden units or determine a stopping point for the back-propagation algorithm Test set: used only to assess the performance of a fully-trained classifier In an MLP, we would use the test to estimate the error rate after we have chosen the final model (MLP size and actual weights) Why separate test and validation sets? The error rate of the final model on validation data will be biased (smaller than the true error rate) since the validation set is used to select the final model After assessing the final model on the test set, YOU MUST NOT tune the model any further! CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 16

Procedure outline 1. Divide the available data into training, validation and test set 2. Select architecture and training parameters 3. Train the model using the training set 4. Evaluate the model using the validation set 5. Repeat steps 2 through 4 using different architectures and training parameters 6. Select the best model and train it using data from the training and validation sets 7. Assess this final model using the test set This outline assumes a holdout method If CV or bootstrap are used, steps 3 and 4 have to be repeated for each of the K folds CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 17

Min Test set Training set Validation set Three-way data splits Model 1 Error 1 Model 2 Error 2 Final Model Final Error Model 3 Error 3 Model 4 Error 4 Model Selection Error Rate CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 18