Cross-Validation and Other Out-of-Sample Testing Strategies

Transcription

1 Cross-Validation and Other Out-of-Sample Testing Strategies Simon J. Mason International Research Institute for Climate Prediction, Earth Institute of Columbia University AMS Short Course on Significance Testing, Model Evaluation and Alternatives Seattle, January 11, 2004 L i n k i n g S c i e n c e t o S o c i e t y

2 What is cross-validation? Livezey s talk on resampling highlighted two questions: Is the observed result good? Is the observed result right? Resampling techniques can be used to address these questions by, respectively, determining: The empirical distribution of the estimator under the null hypothesis (by permutation or bootstrap methods); Sampling errors in the observed value of the estimator (by bootstrap or jackknife methods). Cross-validation addresses the second question (is the observed result right?).

3 What is cross-validation? Resampling techniques are frequently used in the context of forecast performance estimators. Cross-validation is a specialized resampling procedure that is designed specifically for application in model validation problems. Cross-validation is often, but incorrectly, used as a synonym for jackknifing. The two methods are distinct.

4 Why cross-validate? Model fitting does not provide a good estimate of actual forecast skill: the model-fit statistics tell us how well the model describes the data, not how well it predicts the data. Because the model knows in advance the relationships between the data, model-fit statistics almost inevitably over-estimate predictive skill. In effect, therefore, cross-validation is intended to eliminate errors in performance estimates.

5 Predictive v descriptive skill To estimate true predictive skill we need a set of forecasts that are independent of the data used to train the model. Specifically, the verification sample should be completely distinct from the training/calibration sample. Any leakage of information from the training sample to the verification sample will bias the predictive skill estimate.

6 Model validation designs 1. Leave at least one year out of the training sample. 2. Reconstruct the model using the new smaller training sample. 3. Forecast at least one of the years omitted. 4. Repeat at least step 3. Objective: mimic the complete lack of knowledge of future values in operational forecasting when hindcasting.

7 Leave-one-out cross-validation 1. Leave the first year out of the training sample. 2. Reconstruct the model using the new smaller training sample. 3. Forecast the omitted year. 4. Repeat, omitting subsequent years until a forecast has been made for each year.

8 What is cross-validation? Jackknife: Re-calculate the estimator using all possible subsets of the data in which one observation is missing. The jackknife is used to indicate the distribution of sampling errors in the estimator. Cross-validation: Re-calculate the estimated values using (all possible) subsets of the data (in which one observation is missing), and then recalculate the estimator. Cross-validation is used to obtain an unbiased value for the performance estimator.

9 Jackknife Leave-one-out cross-validation Correlate all but 1951 Correlate Period all but 1952 period period Predict 1951 Predict period 1952 period period period period then correlate Correlate all but 1953 then compare the correlations. Predict 1953 Correlate all but 1954 Predict 1954 period Period period Period Correlate all but 1955 period Period period Period Predict 1955 period period

10 Example Jackknife correlations between JFM temperature for San Diego (CD93) and Eastern North Dakota (CD16).

11 Example Crossvalidated correlation between JFM temperature for San Diego (CD93) and Eastern North Dakota (CD16).

12 Example The jackknife correlations suggest that the true correlation is between and The cross-validated correlation suggests that the true correlation is 0.228; less than the sample correlation. Cross-validated correlations typically are in the lower percentiles of jackknife correlations (in this case ~25 th percentile). Why?

13 Assumptions The only assumption with cross-validation is that there is no leakage of information about the omitted data to the training sample. The most difficult problem is to ensure that NONE of the information in the training sample is known to the verification sample. Unfortunately, there is usually some leakage:

14 Leakage Recalculate model parameters. Reselect predictors. Why?

15 Leakage Note that each year might involve a completely new model. The model to be used to produce a real forecast is not actually verified itself. Verification procedures verify the entire forecast process NOT the specific model.

16 Leakage Recalculate model parameters. Reselect predictors. Why? Recalculate the optimal number of modes. Why? Redefine climatology. Why? Avoid autocorrelation. Why? How?

17 Leave-k-out cross-validation 1951 Predict 1951 Omit Omit Predict Omit Omit Omit period period Omit 1953 Omit 1953 Predict 1953 Omit 1953 Omit 1953 Omit 1954 Omit 1954 Predict 1954 Omit 1954 period period Omit 1955 period Omit Omit period Predict Omit Omit Ensure that cross-validation window length is at least twice the decorrelation time

18 Assumptions There were no obvious sources of leakage in the example. The cross-validated correlation was less than the sample correlation, and less than about 75% of the jackknife correlations. So which gives us the best estimate of predictive skill? Is the difference between descriptive and predictive skill real, and if so, why?

19 Cross-validation and sampling errors Use the jackknifed averages of JFM temperature for San Diego to provide cross-validated hindcasts of JFM temperature for the omitted year, and then correlate these hindcasts with the observed values

20 Cross-validation San Diego JFM temperature r =

21 Cross-validation Bias in crossvalidated estimates of forecast skill, for simple linear regression given different sample sizes and population correlations. Bias Sample size

22 Predictive v descriptive skill Why should there be a difference between predictive and descriptive skill? 1. Model parameters are optimized for the data in the training sample, not for future data. But the best estimate of the population correlation is obtained using the largest sample possible so the fitted model parameters should be unbiased. 2. Model variables are optimized for the data in the training sample, not for future data.

23 Cross-validation and bias Withholding ALL the information in the omitted year from the verification sample is effectively impossible: it is known a priori that the model climatology will shift away from the omitted value, and that modelled relationships will shift away from the direction implicit in the omitted observational pair. So even omitting information does not necessarily mean that this information is not known by the verification sample.

24 Cross-validation and bias Leave-one-out cross-validation can give pathological estimates of forecast performance. But the underestimation of performance may be most serious for the reference forecast strategy, and so it is possible that some estimates of skill may be overestimated. With leave-3-out cross-validation, the correlation between the observed and predicted JFM temperatures increases from 1.0 to about -0.6, and continues to increase (on average) with an increasing crossvalidation window length, up to the point at which sampling errors become large.

25 Cross-validation and bias If the model is correct, leave-k-out cross-validation will always underestimate forecast performance, but the underestimation usually decreases as k increases (up to a point). It is therefore desirable to make k large.

26 Cross-validation Cross-validation is poorly designed for determining whether the model parameters are correctly estimated. But Cross-validation (appropriately implemented) is well designed for determining whether the right model has been selected

27 Multiplicity A multiplicity of candidate predictors results in positive biases in performance measures.

28 Cross-validation and multiplicity Model selection criteria indicate improved model selection for leave-k-out over leave-1-out cross-validation.

29 Cross-validation and multiplicity although if the number of candidate predictors becomes too large it becomes impossible to avoid overfitting.

30 Cross-validation If cross-validation can be used effectively to determine whether the right model has been selected, why not use cross-validation as a selection procedure rather than verification procedure? Cross-validation as a model selection procedure involves selecting the model with the best predictive capability rather than the best descriptive capability.

31 Cross-validation But how can we now verify the forecasts? If we are picking the model that gives the best cross-validated skill estimates, we cannot then use cross-validation to estimate its predictive skill. We again are subject to the problem of picking the best results, and assuming that those results will be a good indication of true performance.

32 Retroactive forecasting 1981 period ( ) Predict period Predict ( ) period ( ) 1984 period ( ) 1985 period ( ) Predict 1983 Omit Omit Predict 1984 Omit Omit Predict 1985

33 But what if we do not have enough data? L i n k i n g S c i e n c e t o S o c i e t y

34 Two-deep cross-validation 1951 Predict 1951 Predict Omit Predict Omit period period 1955 Period Omit 1953 Predict 1953 Predict 1953 Omit 1953 Omit 1954 Predict 1954 Predict 1954 Omit 1954 period Period Omit 1955 period Omit Omit period Predict Predict Omit Use the orange years for model selection. Use the green years for model verification.

35 Summary Cross-validation is designed to give an indication of the predictive skill of a model, rather than its descriptive skill. For reliable estimates of predictive skill, leakage needs to be avoided. Completely withholding information from the training sample is not a guarantee against leakage, although ising large cross-validation windows helps. In practice, most leakage results from only considering the first of two sources of uncertainty: 1. Are the model parameters correct? 2. Is the model correct?

36 Summary Cross-validation will underestimate the performance of a correct model because it is not well-designed for measuring sampling errors in parameter estimates. It is better designed for identifying biases in performance measures that result from incorrect model specification. As such, cross-validation can be used as an effective method of model selection.

37 Recommendations Cross-validation should only be performed if the model variables and format are not known a priori. Otherwise jackknifing or bootstrapping would be more appropriate. Leave-one-out cross-validation should almost always be avoided, and large cross-validation windows should be used, if possible. Retroactive validation is preferable to cross-validation where sample sizes are sufficiently large. Otherwise two-deep cross-validation is an attractive option that has not been widely used. L i n k i n g S c i e n c e t o S o c i e t y

38 Recommended readings Barnston, A. G., and H. M. van den Dool, 1993: A degeneracy in cross-validated skill in regression-based forecasts. J. Climate, 6, Browne, M. W., 2000: Cross-validation methods. J. Math. Psych., 44, Elsner, J. B., and C. P. Schmertmann, 1994: Assessing forecast skill through cross validation. Wea. Forecasting, 9, Livezey, R. E., A. G. Barnston, and B. K. Neumeister, 1990: Mixed analog/persistence prediction of United States seasonal mean temperatures. Int. J. Climatol., 10, Michaelsen, J., 1987: Cross-validation in statistical climate forecast models. J. Clim. Appl. Meteorol., 26, L i n k i n g S c i e n c e t o S o c i e t y