Cross-validation for detecting and preventing overfitting

Similar documents

Lecture 13: Validation

Data Mining. Nonlinear Classification

Data Mining: An Overview. David Madigan

Cross Validation. Dr. Thomas Jensen Expedia.com

Data Mining - Evaluation of Classifiers

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining

Making Sense of the Mayhem: Machine Learning and March Madness

Knowledge Discovery and Data Mining

Lecture 10: Regression Trees

Local classification and local likelihoods

L13: cross-validation

Chapter 7. Feature Selection. 7.1 Introduction

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

Big Data Analytics CSCI 4030

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

5. Multiple regression

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap

The Artificial Prediction Market

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Social Media Mining. Data Mining Essentials

Lecture 3: Linear methods for classification

Supervised Learning (Big Data Analytics)

Cluster Analysis: Basic Concepts and Algorithms

Mathematical goals. Starting points. Materials required. Time needed

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

INVESTIGATIONS AND FUNCTIONS Example 1

HT2015: SC4 Statistical Data Mining and Machine Learning

Principles of Data Mining by Hand&Mannila&Smyth

Performance Metrics for Graph Mining Tasks

Linear Threshold Units

Data Mining Practical Machine Learning Tools and Techniques

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Classification algorithm in Data mining: An Overview

Model Combination. 24 Novembre 2009

Slope-Intercept Form and Point-Slope Form

Supporting Online Material for

Statistical Machine Learning

Decision Trees from large Databases: SLIQ

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Decompose Error Rate into components, some of which can be measured on unlabeled data

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Data Mining Methods: Applications for Institutional Research

Cross-Validation. Synonyms Rotation estimation

Model Validation Techniques

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

LCs for Binary Classification

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Predict the Popularity of YouTube Videos Using Early View Data

Logistic Regression for Spam Filtering

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Section 6: Model Selection, Logistic Regression and more...

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining Techniques Chapter 6: Decision Trees

Find the Relationship: An Exercise in Graphing Analysis

Predictive Modeling Techniques in Insurance

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Neural Networks and Support Vector Machines

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Employer Health Insurance Premium Prediction Elliott Lui

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

5.1. A Formula for Slope. Investigation: Points and Slope CONDENSED

Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

CHAPTER 7: FACTORING POLYNOMIALS

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

6. The given function is only drawn for x > 0. Complete the function for x < 0 with the following conditions:

Data Mining in CRM & Direct Marketing. Jun Du The University of Western Ontario jdu43@uwo.ca

Classification by Pairwise Coupling

Higher. Polynomials and Quadratics 64

For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall

THE GOAL of this work is to learn discriminative components

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Data Mining Part 5. Prediction

Machine Learning Final Project Spam Filtering

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

IBM SPSS Neural Networks 22

STA 4273H: Statistical Machine Learning

Nonparametric statistics and model selection

Why Ensembles Win Data Mining Competitions

Data Mining Lab 5: Introduction to Neural Networks

Better credit models benefit us all

Transcription:

Cross-validation for detecting and preventing overfitting Note to other teachers and users of these slides. Andrew would be delighted if ou found this source material useful in giving our own lectures. Feel free to use these slides verbatim, or to modif them to fit our own needs. PowerPoint originals are available. If ou make use of a significant portion of these slides in our own lecture, please include this message, or the following link to the source repositor of Andrew s tutorials: http://www.cs.cmu.edu/~awm/tutorials. Comments and corrections gratefull received. Andrew W. Moore Professor School of Computer Science Carnegie Mellon Universit www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599 Copright Andrew W. Moore Slide 1

A Regression Problem = f() + noise Can we learn f from this data? Let s consider three methods Copright Andrew W. Moore Slide 2

Linear Regression Copright Andrew W. Moore Slide 3

Linear Regression Univariate Linear regression with a constant term: X 3 1 : Y 7 3 : X= = 3 7 1 3 : : 1 =(3).. 1 =7.. Originall discussed in the previous Andrew Lecture: Neural Nets Copright Andrew W. Moore Slide 4

Linear Regression Univariate Linear regression with a constant term: X 3 1 : Y 7 3 : X= = 3 7 1 3 : : Z= 1 =(3).. 1 =7.. = 1 3 1 1 : z 1 =(1,3).. 1 =7.. z k =(1, k ) 7 3 : Copright Andrew W. Moore Slide 5

Linear Regression Univariate Linear regression with a constant term: X 3 1 : Y 7 3 : X= = 3 7 1 3 : : Z= 1 =(3).. 1 =7.. = 1 3 1 1 : 7 3 : β=(z T Z) -1 (Z T ) z 1 =(1,3).. 1 =7.. est = β 0 + β 1 z k =(1, k ) Copright Andrew W. Moore Slide 6

Quadratic Regression Copright Andrew W. Moore Slide 7

Quadratic Regression X 3 1 : Z= Y 7 3 : 1 3 1 1 : z=(1,, 2,) 9 1 X= = 3 1 : 7 3 1 =(3,2).. 1 =7.. = 7 3 : : Much more about this in the future Andrew Lecture: Favorite Regression Algorithms β=(z T Z) -1 (Z T ) est = β 0 + β 1 + β 2 2 Copright Andrew W. Moore Slide 8

Join-the-dots Also known as piecewise linear nonparametric regression if that makes ou feel better Copright Andrew W. Moore Slide 9

Which is best? Wh not choose the method with the best fit to the data? Copright Andrew W. Moore Slide 10

What do we reall want? Wh not choose the method with the best fit to the data? How well are ou going to predict future data drawn from the same distribution? Copright Andrew W. Moore Slide 11

The test set method 1. Randoml choose 30% of the data to be in a test set 2. The remainder is a training set Copright Andrew W. Moore Slide 12

The test set method 1. Randoml choose 30% of the data to be in a test set 2. The remainder is a training set 3. Perform our regression on the training set (Linear regression eample) Copright Andrew W. Moore Slide 13

The test set method 1. Randoml choose 30% of the data to be in a test set 2. The remainder is a training set (Linear regression eample) Mean Squared Error = 2.4 3. Perform our regression on the training set 4. Estimate our future performance with the test set Copright Andrew W. Moore Slide 14

The test set method 1. Randoml choose 30% of the data to be in a test set 2. The remainder is a training set (Quadratic regression eample) Mean Squared Error = 0.9 3. Perform our regression on the training set 4. Estimate our future performance with the test set Copright Andrew W. Moore Slide 15

The test set method 1. Randoml choose 30% of the data to be in a test set 2. The remainder is a training set (Join the dots eample) Mean Squared Error = 2.2 3. Perform our regression on the training set 4. Estimate our future performance with the test set Copright Andrew W. Moore Slide 16

Good news: Ver ver simple The test set method Can then simpl choose the method with the best test-set score Bad news: What s the downside? Copright Andrew W. Moore Slide 17

Good news: Ver ver simple The test set method Can then simpl choose the method with the best test-set score Bad news: Wastes data: we get an estimate of the best method to appl to 30% less data If we don t have much data, our test-set might just be luck or unluck We sa the test-set estimator of performance has high variance Copright Andrew W. Moore Slide 18

LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record Copright Andrew W. Moore Slide 19

LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset Copright Andrew W. Moore Slide 20

LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R-1 datapoints Copright Andrew W. Moore Slide 21

LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error ( k, k ) Copright Andrew W. Moore Slide 22

LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error ( k, k ) When ou ve done all points, report the mean error. Copright Andrew W. Moore Slide 23

LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error ( k, k ) When ou ve done all points, report the mean error. MSE LOOCV = 2.12 Copright Andrew W. Moore Slide 24

LOOCV for Quadratic Regression For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error ( k, k ) When ou ve done all points, report the mean error. MSE LOOCV =0.962 Copright Andrew W. Moore Slide 25

LOOCV for Join The Dots For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error ( k, k ) When ou ve done all points, report the mean error. MSE LOOCV =3.33 Copright Andrew W. Moore Slide 26

Which kind of Cross Validation? Test-set Leaveone-out Downside Variance: unreliable estimate of future performance Epensive. Has some weird behavior Upside Cheap Doesn t waste data..can we get the best of both worlds? Copright Andrew W. Moore Slide 27

k-fold Cross Validation Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) Copright Andrew W. Moore Slide 28

k-fold Cross Validation Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. Copright Andrew W. Moore Slide 29

k-fold Cross Validation Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. Copright Andrew W. Moore Slide 30

k-fold Cross Validation Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Copright Andrew W. Moore Slide 31

k-fold Cross Validation Linear Regression MSE 3FOLD =2.05 Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error Copright Andrew W. Moore Slide 32

k-fold Cross Validation Quadratic Regression MSE 3FOLD =1.11 Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error Copright Andrew W. Moore Slide 33

k-fold Cross Validation Joint-the-dots MSE 3FOLD =2.93 Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error Copright Andrew W. Moore Slide 34

Which kind of Cross Validation? Test-set Leaveone-out 10-fold 3-fold R-fold Downside Variance: unreliable estimate of future performance Epensive. Has some weird behavior Wastes 10% of the data. 10 times more epensive than test set Wastier than 10-fold. Epensivier than test set Identical to Leave-one-out Upside Cheap Doesn t waste data Onl wastes 10%. Onl 10 times more epensive instead of R times. Slightl better than testset Copright Andrew W. Moore Slide 35

Which kind of Cross Validation? Downside Upside Test-set Leaveone-out 10-fold 3-fold R-fold Variance: unreliable Cheap estimate of future performance But note: One of Epensive. Andrew s Doesn t jos waste in data life is Has some weird behavioralgorithmic tricks for Wastes 10% of the data. making Onl wastes these cheap 10%. Onl 10 times more epensive 10 times more epensive than testset instead of R times. Wastier than 10-fold. Epensivier than testset Identical to Leave-one-out Slightl better than testset Copright Andrew W. Moore Slide 36

CV-based Model Selection We re tring to decide which algorithm to use. We train each machine and make a table i f i TRAINERR 10-FOLD-CV-ERR Choice 1 f 1 2 f 2 3 f 3 4 f 4 5 f 5 6 f 6 Copright Andrew W. Moore Slide 37

CV-based Model Selection Eample: Choosing number of hidden units in a onehidden-laer neural net. Step 1: Compute 10-fold CV error for si different model classes: Algorithm TRAINERR 10-FOLD-CV-ERR Choice 0 hidden units 1 hidden units 2 hidden units 3 hidden units 4 hidden units 5 hidden units Step 2: Whichever model class gave best CV score: train it with all the data, and that s the predictive model ou ll use. Copright Andrew W. Moore Slide 38

CV-based Model Selection Eample: Choosing k for a k-nearest-neighbor regression. Step 1: Compute LOOCV error for si different model classes: Algorithm TRAINERR 10-fold-CV-ERR Choice K=1 K=2 K=3 K=4 K=5 K=6 Step 2: Whichever model class gave best CV score: train it with all the data, and that s the predictive model ou ll use. Copright Andrew W. Moore Slide 39

Algorithm K=1 K=2 K=3 K=4 K=5 K=6 CV-based Model Selection Eample: Choosing k for a k-nearest-neighbor regression. Step 1: Compute LOOCV error for si different NN (and model all other nonparametric classes: Wh did we use 10-fold-CV for neural nets and LOOCV for k- nearest neighbor? TRAINERR And wh stop LOOCV-ERR at K=6 Are we guaranteed that a local optimum of K vs LOOCV will be the global optimum? What should we do if we are depressed at the epense of doing LOOCV for K= 1 through 1000? The reason is Computational. For k- methods) LOOCV happens to be as cheap as regular predictions. No good reason, ecept it looked like things were getting worse as K was increasing Choice Sadl, no. And in fact, the relationship can be ver bump. Step 2: Whichever model class gave best CV score: train it with all the data, and that s the predictive model ou ll use. Idea One: K=1, K=2, K=4, K=8, K=16, K=32, K=64 K=1024 Idea Two: Hillclimbing from an initial guess at K Copright Andrew W. Moore Slide 40

CV-based Model Selection Can ou think of other decisions we can ask Cross Validation to make for us, based on other machine learning algorithms in the class so far? Copright Andrew W. Moore Slide 41

CV-based Model Selection Can ou think of other decisions we can ask Cross Validation to make for us, based on other machine learning algorithms in the class so far? Degree of polnomial in polnomial regression Whether to use full, diagonal or spherical Gaussians in a Gaussian Baes Classifier. The Kernel Width in Kernel Regression The Kernel Width in Locall Weighted Regression The Baesian Prior in Baesian Regression These involve choosing the value of a real-valued parameter. What should we do? Copright Andrew W. Moore Slide 42

CV-based Model Selection Can ou think of other decisions we can ask Cross Validation to make for us, based on other machine learning algorithms in the class so far? Degree of polnomial in polnomial regression Whether to use full, diagonal or spherical Gaussians in a Gaussian Baes Classifier. The Kernel Width in Kernel Regression The Kernel Width in Locall Weighted Regression The Baesian Prior in Baesian Regression These involve choosing the value of a real-valued parameter. What should we do? Idea One: Consider a discrete set of values (often best to consider a set of values with eponentiall increasing gaps, as in the K-NN eample). LOOCV Idea Two: Compute and then do gradianet descent. Parameter Copright Andrew W. Moore Slide 43

CV-based Model Selection Can ou think of other decisions we can ask Cross Validation to make for us, based on other machine learning algorithms in the class so far? Degree of polnomial in polnomial regression Whether to use full, diagonal or spherical Gaussians in a Gaussian Baes Classifier. The Kernel Width in Kernel Regression The Kernel Width in Locall Weighted Regression The Baesian Prior in Baesian Regression These involve choosing the value of a real-valued parameter. What should we do? Also: The scale factors of a nonparametric distance metric Idea One: Consider a discrete set of values (often best to consider a set of values with eponentiall increasing gaps, as in the K-NN eample). LOOCV Idea Two: Compute and then do gradianet descent. Parameter Copright Andrew W. Moore Slide 44

CV-based Algorithm Choice Eample: Choosing which regression algorithm to use Step 1: Compute 10-fold-CV error for si different model classes: Algorithm TRAINERR 10-fold-CV-ERR Choice 1-NN 10-NN Linear Reg n Quad reg n LWR, KW=0.1 LWR, KW=0.5 Step 2: Whichever algorithm gave best CV score: train it with all the data, and that s the predictive model ou ll use. Copright Andrew W. Moore Slide 45

Alternatives to CV-based model selection Model selection methods: 1. Cross-validation 2. AIC (Akaike Information Criterion) 3. BIC (Baesian Information Criterion) 4. VC-dimension (Vapnik-Chervonenkis Dimension) Onl directl applicable to choosing classifiers Described in a future Lecture Copright Andrew W. Moore Slide 46

Which model selection method is best? 1. (CV) Cross-validation 2. AIC (Akaike Information Criterion) 3. BIC (Baesian Information Criterion) 4. (SRMVC) Structural Risk Minimize with VC-dimension AIC, BIC and SRMVC advantage: ou onl need the training error. CV error might have more variance SRMVC is wildl conservative Asmptoticall AIC and Leave-one-out CV should be the same Asmptoticall BIC and carefull chosen k-fold should be same You want BIC if ou want the best structure instead of the best predictor (e.g. for clustering or Baes Net structure finding) Man alternatives---including proper Baesian approaches. It s an emotional issue. Copright Andrew W. Moore Slide 47

Other Cross-validation issues Can do leave all pairs out or leave-allntuples-out if feeling resourceful. Some folks do k-folds in which each fold is an independentl-chosen subset of the data Do ou know what AIC and BIC are? If so LOOCV behaves like AIC asmptoticall. k-fold behaves like BIC if ou choose k carefull If not Nardel nardel noo noo Copright Andrew W. Moore Slide 48

Cross-Validation for regression Choosing the number of hidden units in a neural net Feature selection (see later) Choosing a polnomial degree Choosing which regressor to use Copright Andrew W. Moore Slide 49

Supervising Gradient Descent This is a weird but common use of Test-set validation Suppose ou have a neural net with too man hidden units. It will overfit. As gradient descent progresses, maintain a graph of MSE-testset-error vs. Iteration Mean Squared Error Use the weights ou found on this iteration Training Set Test Set Iteration of Gradient Descent Copright Andrew W. Moore Slide 50

Supervising Gradient Descent This is a weird but common use of Test-set validation Suppose ou have a neural net with too Relies on an intuition that a not-full- man hidden units. It will overfit. As gradient descent having fewer progresses, parameters. maintain a graph of MSE-testset-error Works prett well in vs. practice, Iteration apparentl minimized set of weights is somewhat like Mean Squared Error Use the weights ou found on this iteration Training Set Test Set Iteration of Gradient Descent Copright Andrew W. Moore Slide 51

Cross-validation for classification Instead of computing the sum squared errors on a test set, ou should compute Copright Andrew W. Moore Slide 52

Cross-validation for classification Instead of computing the sum squared errors on a test set, ou should compute The total number of misclassifications on a testset. Copright Andrew W. Moore Slide 53

Cross-validation for classification Instead of computing the sum squared errors on a test set, ou should compute The total number of misclassifications on a testset. What s LOOCV of 1-NN? What s LOOCV of 3-NN? What s LOOCV of 22-NN? Copright Andrew W. Moore Slide 54

Cross-validation for classification Instead of computing the sum squared errors on a test set, ou should compute The total number of misclassifications on a testset. But there s a more sensitive alternative: Compute log P(all test outputs all test inputs, our model) Copright Andrew W. Moore Slide 55

Cross-Validation for classification Choosing the pruning parameter for decision trees Feature selection (see later) What kind of Gaussian to use in a Gaussianbased Baes Classifier Choosing which classifier to use Copright Andrew W. Moore Slide 56

Cross-Validation for densit estimation Compute the sum of log-likelihoods of test points Eample uses: Choosing what kind of Gaussian assumption to use Choose the densit estimator NOT Feature selection (testset densit will almost alwas look better with fewer features) Copright Andrew W. Moore Slide 57

Feature Selection Suppose ou have a learning algorithm LA and a set of input attributes { X 1, X 2.. X m } You epect that LA will onl find some subset of the attributes useful. Question: How can we use cross-validation to find a useful subset? Four ideas: Another fun area in which Forward selection Andrew has spent a lot of his Backward elimination wild outh Hill Climbing Stochastic search (Simulated Annealing or GAs) Copright Andrew W. Moore Slide 58

Ver serious warning Intensive use of cross validation can overfit. How? What can be done about it? Copright Andrew W. Moore Slide 59

Ver serious warning Intensive use of cross validation can overfit. How? Imagine a dataset with 50 records and 1000 attributes. You tr 1000 linear regression models, each one using one of the attributes. What can be done about it? Copright Andrew W. Moore Slide 60

Ver serious warning Intensive use of cross validation can overfit. How? Imagine a dataset with 50 records and 1000 attributes. You tr 1000 linear regression models, each one using one of the attributes. The best of those 1000 looks good! What can be done about it? Copright Andrew W. Moore Slide 61

Ver serious warning Intensive use of cross validation can overfit. How? Imagine a dataset with 50 records and 1000 attributes. You tr 1000 linear regression models, each one using one of the attributes. The best of those 1000 looks good! But ou realize it would have looked good even if the output had been purel random! What can be done about it? Hold out an additional testset before doing an model selection. Check the best model performs well even on the additional testset. Or: Randomization Testing Copright Andrew W. Moore Slide 62

What ou should know Wh ou can t use training-set-error to estimate the qualit of our learning algorithm on our data. Wh ou can t use training set error to choose the learning algorithm Test-set cross-validation Leave-one-out cross-validation k-fold cross-validation Feature selection methods CV for classification, regression & densities Copright Andrew W. Moore Slide 63