The R 2 will always increase (and the SSR and MSE will always decrease) each time you add another variable to the right-hand-side of the regression.

Transcription

1 In-Sample Overfitting There is a serious danger in looking at the simple R (or SSR or MSE) to select among competing forecast models. The R will always increase (and the SSR and MSE will always decrease) each time you add another variable to the right-hand-side of the regression. So, for example, if we mechanically apply this criteria to select the trend model from among models of the form: T t = β 0 + β 1 t + β t + + β p t p these criteria will direct us to choose p as large as possible. For the hepi data, I fit the hepi to a trend model for p = 1,,3,4. Here are the results. p SSR So? What s the problem?

2 The problem is that improving the fit of the model over the sample period by adding additional variables can easily lead to poorer out of sample forecasts! The more variables you add to the right hand side of the forecasting model, the higher the variance of the forecast error! So, there is a tradeoff that needs to be addressed: On the one hand, adding variables to the regression improves the fit of the model. On the other hand, adding variables to the regression increases the variance of the forecast error. The simple R, SSR, MSE criteria ignores the negative side effects of adding variables to the regression.

3 We need criteria that account for both the benefits and costs of adding variables to the regression. Adjusted R Akaike Information Criterion (AIC) Schwartz Information Criterion (SIC) Select the model that max s the adjusted R. Select the model that min s the AIC. Select the model that min s the SIC. Let s first look at each of these measures

4 Adjusted R - R = 1 T 1 ( y t s y) /( T 1) where s = SSR /( T k) = s.e. of the regression, k = number of regression parameters Note that the denominator term depends only on y 1,,y T (not on the model that the y s are fit to). So, the only thing that will change as we fit different models will be s. So, in effect, maximizing the adjusted R amounts to minimizing the standard error of the regression, SSR/(T-k), whereas maximizing the simple R amounts to minimizing the MSE of the regression, SSR/T. As we increase the number of variables in the model, SSR will decrease, T-k will decrease, and s will increase or decrease depending only whether SSR is decreasing less than or more than proportionally to the (linear) decrease in T-k.

5 AIC and SIC - log(aic) = log(ssr/t) + k/t log(sic) = log(ssr/t) + k*log(t)/t (Note the AIC and SIC reported by EViews are log(aic) and log(sic) but they do not use exactly the same formulas as those given above, which are the ones used in the text, pp In fact, there are a number of variations of the AIC and SIC that are used by different books and programs. They all imply the same results with regard to ordering models according to these criteria.) Note that the AIC and SIC values will be decreasing as additional variables are added to the regression through the first term but will be increasing through the second term, i.e., the penalty term. The preferred model is the one that minimizes the AIC (or SIC).

6 Unfortunately, these three criteria (adjusted R,AIC, SIC) will not always select the same model! That is because of the differences in their penalty functions as illustrated by Figure 4.13 in your text. The SIC imposes the strongest penalty for additional variables, followed by the AIC and then the adjusted R. So, when they select different models, the SIC will choose a more parsimonious model than the AIC, which, in turn, will choose a more parsimonious model than the R. The AIC and SIC are more commonly used than the adjusted R. Each of the two has certain (but different) theoretical properties that make them appealing. Which of the two to use in practice when they give different answers is somewhat arbitrary. If we maintain the KISS principle and select the simpler model when there is not a compelling reason to do otherwise, then we should use the SIC.

7 Compute for the polynomial trends for hepi p R-Bar-Square AIC SIC Year Forecast (p=) Forecast (p=5) (Actual) (3.4%) 44.1 (5.0%) (3.4%) 57.8 (5.5%) (3.3%) 73.7 (6.0%) (3.3%) 9. (6.5%) (3.%) (7.1%) (3.%) 338.5(7.6%) The AIC and SIC select a 5-th order polynomial in t to represent the trend component of the

8 HEPI. There are, however, a couple of reasons why I might still end up selecting the quadratic trend: 1. The AIC and SIC are in-sample fit criteria. And although they account for the costs of overfitting through the inclusion of a penalty term, I am still concerned that extrapolating such a high-order polynomial into the future will be misleading.. What might be going on with this series is that the actual trend is a linear or quadratic function of time but the parameters of that function have changed during the sample period. E.g., perhaps y t = β 0,1 + β 1,1 t for t = 1,,T 0 y t = β 0, + β 1, t for t = T 0 +1,,T, T+1, There are a number of things that I can do to pursue these possibilities.

9 Out-of-Sample Fitting What I am really interested in is the question: Having fit the model over the sample period, how well does it forecast outside of that sample? The in-sample fit criteria that we discussed do not directly answer this question. Consider the following exercise Suppose we have a data sample y 1,,y T. 1.Break it up into two parts: where n << T. y 1, y T-n (first T-n observations) y T-n+1,,y T (last n observations). Fit the shortened sample, y 1,,y T-n to various trend models that may seem like plausible choices based on time series plots, in-sample fit criteria, : linear, quadratic, the one selected by AIC/SIC, log linear,

10 3. For each estimated trend model, forecast y T-n+1,,y T and compute the forecast errors: e 1,,e n 4. Compare the errors across the various models time series plots (of the forecasts and actual values of y T-n+1,,y T ; of the forecast errors) tables of the forecasts, actuals, and errors mean squared prediction errors (MSPE) MSPE = 1 n n i= 1 e i

11 The advantage of this approach is that we are actually comparing the trend models in terms of their out-of-sample forecasting performance. A disadvantage is that the comparison is based on models fit over T-n observations rather than the T observations we have available. (Note that if you do use this approach and, for example, settle on the quadratic model, then when you proceed to construct your forecasts for T+1, you should use the quadratic model fit to the full T observations in your sample.)will the fact that, for example, the quadratic trend model outperformed other models in forecasting out of sample based on the short sample mean that it will perform best in forecasting beyond the full sample? No.

12 Structural Breaks in the Trend Suppose that the trend in y t can be modeled as T t = β 0, t + β 1,t t where and β 0,t = β 0,1 if t < T 0 = β 0, if t > T 0 β 1,t = β 1,1 if t < T 0 = β 1, if t > T 0 In this case, T T+h = β 0, + β 1, (T+h) Problem How to estimate β 0, and β 1,? A bad approach Regress y t on 1,t for t=1,,t

13 Better approaches Regress y t on 1,t for t = T 0 +1,,T Problems with this approach Not an ideal approach if you want to force either the intercept or slope coefficient to be fixed over the full sample, t = 1,,T, allowing only one of the coefficients to change at T 0. Does not allow you to test whether the intercept and/slope changed at T 0. Does not provide us with estimated deviations from trend for t = 1,,T 0, which we will want to use to estimate the seasonal and cyclical components of the series to help us forecast those components of the series.

14 Introduce dummy variables into the regression to jointly estimate β 0,1, β 0,, β 1,1, β 1, Let D t = 0 if t = 1,,T 0 = 1 if t > T 0 Run the regression y t = α 0 + α 1 D t + α t + α 3 (D t t) + ε t, over the full sample, t = 1,,T. Then ˆ β ˆ ˆ ˆ + 0,1 = ˆ α 0, β 0, = ˆ α ˆ ˆ ˆ ˆ 0 + α1, β1,1 = α, β1, = α α 3 Suppose we want to allow β 0 to change at T 0 but we want to force β 1 to remain fixed (i.e., a shift in the intercept of the trend line) Run the regression of y t on 1, D t and t to estimate α 0, α 1, and α ( = β 1 ).

15 Notes This approach extends to higher order polynomials in a straightforward way, allowing one or more parameters to change at one or more points in time. This approach can be extended to allow for breaks at unknown time(s).

16 Exponential (or,log Linear) Trends Recall that an alternative to the polynomial trend is the exponential trend model T t = e β 0 + β1t + β t β pt p since log(e x ) = x. Assuming that y t = T t + ε t we can estimate the β s by applying nonlinear least squares: Choose β 0, β 1,,β p to minimize T t = 1 ( y t e β + β t + β t β 0 1 p p t ) This minimization problem must be solved numerically (vs. analytically), but most modern regression software (including EViews) are well-equipped to solve this problem.

17 To select p in this case use the NLS residuals, yt T t (βˆ), to compute the AIC and/or the SIC, then select the model that minimizes the AIC and/or the SIC. We can also compare the fit of these exponential trend models to the polynomial trend models by comparing AICs and SICs. If we do select an estimated exponential trend model, the forecast of y T+h,T is ˆ) y ˆ T h T T ˆ +, = T + h ( β + ε T + h, T

18 A related approach that is commonly used in practice Assume that log(y t ) = T t + ε t T t = β 0 + β 1 t + + β p t p (The ε s are deviations of log(y) from its trend, vs. deviations of y from its trend.) In this case, we can fit log(y t ) to 1,t,,t p by OLS to estimate the β s we can select p by minimizing the AIC and/or SIC across these regressions ˆ) log( y ˆ T h T T ˆ +, ) = T + h ( β + ε T + h, T = T (β T + h ˆ) if the ε s are i.i.d. ˆ T + h, T log( ˆ y T + h, T y = e )

19 This approach has the advantage of relying on OLS vs. NLS, but although this approach produces an unbiased forecast of log(y T+h ), it produces a biased forecast of y T+h. [E(f(x)) f(e(x)) if f is nonlinear]. There has been some work done on ways to adjust the forecast to reduce this bias NLS is not particularly difficult or unreliable, especially in this setting. We should also note that the AICs and SICs from the log(y) regressions cannot be meaningfully compared to the AICs and SICs from the y regressions, so it is difficult to choose between the log linear models and the polynomial trend models based on in-sample fits.

20 Although this model is a nonlinear model (in the β s), its natural log is a linear model and so we also call it a log linear trend model log(t t ) = β 0 + β 1 t + β t + β p t p