LECTURE 13: Cross-validation



Similar documents
L13: cross-validation

Lecture 13: Validation

Determining the sample size

Properties of MLE: consistency, asymptotic normality. Fisher information.

CHAPTER 7: Central Limit Theorem: CLT for Averages (Means)

Sampling Distribution And Central Limit Theorem

Center, Spread, and Shape in Inference: Claims, Caveats, and Insights

Hypothesis testing. Null and alternative hypotheses

I. Chi-squared Distributions

Chapter 7 Methods of Finding Estimators

Confidence Intervals for One Mean

THE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n

Chapter 7 - Sampling Distributions. 1 Introduction. What is statistics? It consist of three major areas:

Multi-server Optimal Bandwidth Monitoring for QoS based Multimedia Delivery Anup Basu, Irene Cheng and Yinzhe Yu

1 Computing the Standard Deviation of Sample Means

Soving Recurrence Relations

Chapter 7: Confidence Interval and Sample Size

Output Analysis (2, Chapters 10 &11 Law)

5: Introduction to Estimation

Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring

The following example will help us understand The Sampling Distribution of the Mean. C1 C2 C3 C4 C5 50 miles 84 miles 38 miles 120 miles 48 miles

The Stable Marriage Problem

Math C067 Sampling Distributions

Quadrat Sampling in Population Ecology

Lesson 15 ANOVA (analysis of variance)

Case Study. Normal and t Distributions. Density Plot. Normal Distributions

University of California, Los Angeles Department of Statistics. Distributions related to the normal distribution

Confidence Intervals

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13

Definition. A variable X that takes on values X 1, X 2, X 3,...X k with respective frequencies f 1, f 2, f 3,...f k has mean

Normal Distribution.

Modified Line Search Method for Global Optimization

Research Method (I) --Knowledge on Sampling (Simple Random Sampling)

5.4 Amortization. Question 1: How do you find the present value of an annuity? Question 2: How is a loan amortized?


THE ARITHMETIC OF INTEGERS. - multiplication, exponentiation, division, addition, and subtraction

Overview of some probability distributions.

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

Maximum Likelihood Estimators.

Measures of Spread and Boxplots Discrete Math, Section 9.4

Lecture 16: Address decoding

1. C. The formula for the confidence interval for a population mean is: x t, which was

I. Why is there a time value to money (TVM)?

Confidence Intervals. CI for a population mean (σ is known and n > 30 or the variable is normally distributed in the.

Statistical inference: example 1. Inferential Statistics

Lesson 17 Pearson s Correlation Coefficient

WHEN IS THE (CO)SINE OF A RATIONAL ANGLE EQUAL TO A RATIONAL NUMBER?

Chapter 6: Variance, the law of large numbers and the Monte-Carlo method

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008

Plug-in martingales for testing exchangeability on-line

CHAPTER 3 THE TIME VALUE OF MONEY

Exam 3. Instructor: Cynthia Rudin TA: Dimitrios Bisias. November 22, 2011

Week 3 Conditional probabilities, Bayes formula, WEEK 3 page 1 Expected value of a random variable

Now here is the important step

Hypergeometric Distributions

.04. This means $1000 is multiplied by 1.02 five times, once for each of the remaining sixmonth

Present Value Factor To bring one dollar in the future back to present, one uses the Present Value Factor (PVF): Concept 9: Present Value

Inference on Proportion. Chapter 8 Tests of Statistical Hypotheses. Sampling Distribution of Sample Proportion. Confidence Interval

Chapter 5: Inner Product Spaces

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

Overview. Learning Objectives. Point Estimate. Estimation. Estimating the Value of a Parameter Using Confidence Intervals

A probabilistic proof of a binomial identity

Department of Computer Science, University of Otago

1 Correlation and Regression Analysis

Incremental calculation of weighted mean and variance

Optimal Adaptive Bandwidth Monitoring for QoS Based Retrieval

Project Deliverables. CS 361, Lecture 28. Outline. Project Deliverables. Administrative. Project Comments

Chapter 14 Nonparametric Statistics

A Test of Normality. 1 n S 2 3. n 1. Now introduce two new statistics. The sample skewness is defined as:

Sequences and Series

Domain 1: Designing a SQL Server Instance and a Database Solution

*The most important feature of MRP as compared with ordinary inventory control analysis is its time phasing feature.

A Combined Continuous/Binary Genetic Algorithm for Microstrip Antenna Design

Subject CT5 Contingencies Core Technical Syllabus

Lecture 2: Karger s Min Cut Algorithm

Annuities Under Random Rates of Interest II By Abraham Zaks. Technion I.I.T. Haifa ISRAEL and Haifa University Haifa ISRAEL.

NEW HIGH PERFORMANCE COMPUTATIONAL METHODS FOR MORTGAGES AND ANNUITIES. Yuri Shestopaloff,

GCSE STATISTICS. 4) How to calculate the range: The difference between the biggest number and the smallest number.

(VCP-310)

PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY AN ALTERNATIVE MODEL FOR BONUS-MALUS SYSTEM

COMPARISON OF THE EFFICIENCY OF S-CONTROL CHART AND EWMA-S 2 CONTROL CHART FOR THE CHANGES IN A PROCESS

Intelligent Sensor Placement for Hot Server Detection in Data Centers - Supplementary File

Time Value of Money. First some technical stuff. HP10B II users

Taking DCOP to the Real World: Efficient Complete Solutions for Distributed Multi-Event Scheduling

Analyzing Longitudinal Data from Complex Surveys Using SUDAAN

This document contains a collection of formulas and constants useful for SPC chart construction. It assumes you are already familiar with SPC.

Descriptive Statistics

One-sample test of proportions

Lecture 3. denote the orthogonal complement of S k. Then. 1 x S k. n. 2 x T Ax = ( ) λ x. with x = 1, we have. i = λ k x 2 = λ k.

Topic 5: Confidence Intervals (Chapter 9)

Queuing Systems: Lecture 1. Amedeo R. Odoni October 10, 2001

3 Basic Definitions of Probability Theory

Chapter XIV: Fundamentals of Probability and Statistics *

5 Boolean Decision Trees (February 11)

CHAPTER 11 Financial mathematics

Approximating Area under a curve with rectangles. To find the area under a curve we approximate the area using rectangles and then use limits to find

Transcription:

LECTURE 3: Cross-validatio Resampli methods Cross Validatio Bootstrap Bias ad variace estimatio with the Bootstrap Three-way data partitioi Itroductio to Patter Aalysis Ricardo Gutierrez-Osua Texas A&M Uiversity

Itroductio () Almost ivariably, all the patter recoitio techiques that we have itroduced have oe or more free parameters The umber of eihbors i a knn Classificatio Rule The badwidth of the kerel fuctio i Kerel Desity Estimatio The umber of features to preserve i a Subset Selectio problem Two issues arise at this poit Model Selectio How do we select the optimal parameter(s) for a ive classificatio problem? Validatio Oce we have chose a model, how do we estimate its true error rate? The true error rate is the classifier s error rate whe tested o the ENTIRE POPULATION If we had access to a ulimited umber of examples, these questios would have a straihtforward aswer Choose the model that provides the lowest error rate o the etire populatio Ad, of course, that error rate is the true error rate However, i real applicatios oly a fiite set of examples is available This umber is usually smaller tha we would hope for! Why? Data collectio is a very expesive process Itroductio to Patter Aalysis Ricardo Gutierrez-Osua Texas A&M Uiversity 2

Itroductio (2) Oe may be tempted to use the etire traii data to select the optimal classifier, the estimate the error rate This aïve approach has two fudametal problems The fial model will ormally overfit the traii data: it will ot be able to eeralize to ew data The problem of overfitti is more proouced with models that have a lare umber of parameters The error rate estimate will be overly optimistic (lower tha the true error rate) I fact, it is ot ucommo to have 00% correct classificatio o traii data The techiques preseted i this lecture will allow you to make the best use of your (limited) data for Traii Model selectio ad Performace estimatio Itroductio to Patter Aalysis Ricardo Gutierrez-Osua Texas A&M Uiversity 3

The holdout method Split dataset ito two roups Traii set: used to trai the classifier Test set: used to estimate the error rate of the traied classifier Total umber of examples Traii Set Test Set The holdout method has two basic drawbacks I problems where we have a sparse dataset we may ot be able to afford the luxury of setti aside a portio of the dataset for testi Sice it is a sile trai-ad-test experimet, the holdout estimate of error rate will be misleadi if we happe to et a ufortuate split The limitatios of the holdout ca be overcome with a family of resampli methods at the expese of hiher computatioal cost Cross Validatio Radom Subsampli K-Fold Cross-Validatio Leave-oe-out Cross-Validatio Bootstrap Itroductio to Patter Aalysis Ricardo Gutierrez-Osua Texas A&M Uiversity 4

Radom Subsampli Radom Subsampli performs K data splits of the etire dataset Each data split radomly selects a (fixed) umber of examples without replacemet For each data split we retrai the classifier from scratch with the traii examples ad the estimate E i with the test examples Total umber of examples Experimet Test example Experimet 2 Experimet 3 The true error estimate is obtaied as the averae of the separate estimates E i This estimate is siificatly better tha the holdout estimate E = K K i= E i Itroductio to Patter Aalysis Ricardo Gutierrez-Osua Texas A&M Uiversity 5

K-Fold Cross-validatio Create a K-fold partitio of the the dataset For each of K experimets, use K- folds for traii ad a differet fold for testi This procedure is illustrated i the followi fiure for K=4 Total umber of examples Experimet Experimet 2 Experimet 3 Experimet 4 Test examples K-Fold Cross validatio is similar to Radom Subsampli The advatae of K-Fold Cross validatio is that all the examples i the dataset are evetually used for both traii ad testi As before, the true error is estimated as the averae error rate o test examples E = K K E i i= Itroductio to Patter Aalysis Ricardo Gutierrez-Osua Texas A&M Uiversity 6

Leave-oe-out Cross Validatio Leave-oe-out is the deeerate case of K-Fold Cross Validatio, where K is chose as the total umber of examples For a dataset with N examples, perform N experimets For each experimet use N- examples for traii ad the remaii example for testi Total umber of examples Experimet Experimet 2 Experimet 3 Sile test example Experimet N As usual, the true error is estimated as the averae error rate o test examples N E = E i N i= Itroductio to Patter Aalysis Ricardo Gutierrez-Osua Texas A&M Uiversity 7

How may folds are eeded? With a lare umber of folds + The bias of the true error rate estimator will be small (the estimator will be very accurate) - The variace of the true error rate estimator will be lare - The computatioal time will be very lare as well (may experimets) With a small umber of folds + The umber of experimets ad, therefore, computatio time are reduced + The variace of the estimator will be small - The bias of the estimator will be lare (coservative or smaller tha the true error rate) I practice, the choice of the umber of folds depeds o the size of the dataset For lare datasets, eve 3-Fold Cross Validatio will be quite accurate For very sparse datasets, we may have to use leave-oe-out i order to trai o as may examples as possible A commo choice for K-Fold Cross Validatio is K=0 Itroductio to Patter Aalysis Ricardo Gutierrez-Osua Texas A&M Uiversity 8

The bootstrap () The bootstrap is a resampli techique with replacemet From a dataset with N examples Radomly select (with replacemet) N examples ad use this set for traii The remaii examples that were ot selected for traii are used for testi This value is likely to chae from fold to fold Repeat this process for a specified umber of folds (K) As before, the true error is estimated as the averae error rate o test examples Complete dataset X X 2 X 3 X 5 Experimet X 3 X X 3 X 3 X 5 X 2 Experimet 2 X 5 X 5 X 3 X X 2 Experimet 3 X 5 X 5 X X 2 X X 3 Experimet K X X 2 X 3 X 5 Traii sets Test sets Itroductio to Patter Aalysis Ricardo Gutierrez-Osua Texas A&M Uiversity 9

The bootstrap (2) Compared to basic cross-validatio, the bootstrap icreases the variace that ca occur i each fold [Efro ad Tibshirai, 993] This is a desirable property sice it is a more realistic simulatio of the real-life experimet from which our dataset was obtaied Cosider a classificatio problem with C classes, a total of N examples ad N i examples for each class ω i The a priori probability of choosi a example from class ω i is N i /N Oce we choose a example from class ω i, if we do ot replace it for the ext selectio, the the a priori probabilities will have chaed sice the probability of choosi a example from class ω i will ow be (N i -)/N Thus, sampli with replacemet preserves the a priori probabilities of the classes throuhout the radom selectio process A additioal beefit of the bootstrap is its ability to obtai accurate measures of BOTH the bias ad variace of the true error estimate Itroductio to Patter Aalysis Ricardo Gutierrez-Osua Texas A&M Uiversity 0

Bias ad variace of a statistical estimate Cosider the problem of estimati a parameter α of a ukow distributio G To emphasize the fact that α cocers G we will refer to it as α(g) We collect N examples X={x, x 2,, x N } from the distributio G These examples defie a discrete distributio G with mass /N at each of the examples We compute the statistic α =α(g ) as a estimator of α(g) I the cotext of this lecture, α(g ) is the estimate of the true error rate for our classifier How ood is this estimator? The oodess of a statistical estimator is measured by BIAS: How much it deviates from the true value VARIANCE: How much variability it shows for differet samples X={x, x 2,, x N } of the populatio G Example: If we try to estimate the mea of the populatio with the sample mea = E [ α' ( G) ] α( G) where E [ X] x ( x) Bias G Var = E G [( α' E [ ]) ] 2 G α' The bias of the sample mea is kow to be ZERO From elemetary statistics, the stadard deviatio of the sample mea is equal to std N(N ) ( x) = ( x i x) i= This term is also kow i statistics as the STANDARD ERROR Ufortuately, there is o such a eat alebraic formula for almost ay estimate other tha the sample mea N 2 G + = dx Itroductio to Patter Aalysis Ricardo Gutierrez-Osua Texas A&M Uiversity

Bias ad variace estimates with the bootstrap The bootstrap, with its eleat simplicity, allows us to estimate bias ad variace for practically ay statistical estimate, be it a scalar or vector (matrix) Here we will oly describe the estimatio procedure If you are iterested i more details, the textbook Advaced alorithms for eural etworks [Masters, 995] has a excellet itroductio to the bootstrap The bootstrap estimate of bias ad variace Cosider a dataset of N examples X={x, x 2,, x N } from the distributio G This dataset defies a discrete distributio G Compute α =α(g ) as our iitial estimate of α(g) Let {x *, x 2 *,, x N *} be a bootstrap dataset draw from X={x, x 2,, x N } Estimate the parameter α usi this bootstrap dataset α*(g*) Geerate K bootstrap datasets ad obtai K estimates {α* (G*), α* 2 (G*),, α* K (G*)} The ratioale i the bootstrap method is that the effect of eerati a bootstrap dataset from the distributio G is similar to the effect of obtaii the dataset X={x, x 2,, x N } from the oriial distributio G I other words, the distributio {α* (G*), α* 2 (G*),, α* K (G*)} is related to the iitial estimate α i the same fashio as multiple estimates α are related to the true value α, so the bias ad variace estimates of α are: Bias Var ( α' ) = [ α α' ] K i ( α' ) = ( α α ) K i= where 2 α = K K i= α i Itroductio to Patter Aalysis Ricardo Gutierrez-Osua Texas A&M Uiversity 2

Example: estimati bias ad variace Assume a small dataset x={3,5,2,,7} We wat to compute the bias ad variace of the sample mea α =3.6 We eerate a umber of bootstrap samples (three i this case) Assume that the first bootstrap yields the dataset {7,3,2,3,} We compute the sample mea α* =3.2 The secod bootstrap sample yields the dataset {5,,,3,7} We compute the sample mea α* 2 =3.4 The third bootstrap sample yields the dataset {2,2,7,,3} We compute the sample mea α* 3 =3.0 We averae these estimates ad obtai a averae of α* =3.2 What are the bias ad variace of the sample mea α Bias(α ) = 3.2-3.6 = -0.4 Therefore, we coclude that the re-sampli process itroduces a dowward bias o the mea, so we would be iclied to use 3.6 + 0.4 = 4.0 as a ubiased estimate of α Variace(α ) = ½*[(3.2-3.2) 2 +(3.4-3.2) 2 +(3.0-3.2) 2 ] = 0.04 NOTES We have doe this exercise for the sample mea (so you could trace the computatios), but α could be ay other statistical operator. Here lies the real power of this procedure!! How may bootstrap samples should we use? As a rule of thumb, several hudred re-samples will be sufficiet for most problems Itroductio to Patter Aalysis Ricardo Gutierrez-Osua Texas A&M Uiversity Adapted from [Masters,995] 3

Three-way data splits () If model selectio ad true error estimates are to be computed simultaeously, the data eeds to be divided ito three disjoit sets [Ripley, 996] Traii set: a set of examples used for leari: to fit the parameters of the classifier I the MLP case, we would use the traii set to fid the optimal weihts with the back-prop rule Validatio set: a set of examples used to tue the parameters of a classifier I the MLP case, we would use the validatio set to fid the optimal umber of hidde uits or determie a stoppi poit for the back-propaatio alorithm Test set: a set of examples used oly to assess the performace of a fully-traied classifier I the MLP case, we would use the test to estimate the error rate after we have chose the fial model (MLP size ad actual weihts) After assessi the fial model o the test set, YOU MUST NOT tue the model ay further! Why separate test ad validatio sets? The error rate estimate of the fial model o validatio data will be biased (smaller tha the true error rate) sice the validatio set is used to select the fial model After assessi the fial model o the test set, YOU MUST NOT tue the model ay further! Procedure outlie. Divide the available data ito traii, validatio ad test set 2. Select architecture ad traii parameters 3. Trai the model usi the traii set 4. Evaluate the model usi the validatio set 5. Repeat steps 2 throuh 4 usi differet architectures ad traii parameters 6. Select the best model ad trai it it usi data from the traii ad validatio sets 7. Assess this fial model usi the test set This outlie assumes a holdout method If Cross-Validatio or Bootstrap are used, steps 3 ad 4 have to be repeated for each of the K folds Itroductio to Patter Aalysis Ricardo Gutierrez-Osua Texas A&M Uiversity 4

Three-way data splits (2) Test set Traii set Validatio set Model Error Σ Model 2 Error 2 Σ Mi Fial Model Σ Fial Error Model 3 Error 3 Σ Model 4 Error 4 Σ Model Selectio Error Rate Itroductio to Patter Aalysis Ricardo Gutierrez-Osua Texas A&M Uiversity 5