Can multi-model combination really enhance the prediction skill of probabilistic ensemble forecasts?

Transcription

1 QUARTERLY JOURNAL OF THE ROYAL METEOROLOGICAL SOCIETY Q. J. R. Meteorol. Soc. 134: (2008) Published online in Wiley InterScience ( Can multi-model combination really enhance the prediction skill of probabilistic ensemble forecasts? A. P. Weigel,* M. A. Liniger and C. Appenzeller Federal Office of Meteorology and Climatology (MeteoSwiss), Zürich, Switzerland ABSTRACT: The success of multi-model ensemble combination has been demonstrated in many studies. Given that a multi-model contains information from all participating models, including the less skilful ones, the question remains as to why, and under what conditions, a multi-model can outperform the best participating single model. It is the aim of this paper to resolve this apparent paradox. The study is based on a synthetic forecast generator, allowing the generation of perfectly-calibrated single-model ensembles of any size and skill. Additionally, the degree of ensemble under-dispersion (or overconfidence) can be prescribed. Multi-model ensembles are then constructed from both weighted and unweighted averages of these single-model ensembles. Applying this toy model, we carry out systematic model-combination experiments. We evaluate how multi-model performance depends on the skill and overconfidence of the participating single models. It turns out that multi-model ensembles can indeed locally outperform a best-model approach, but only if the single-model ensembles are overconfident. The reason is that multi-model combination reduces overconfidence, i.e. ensemble spread is widened while average ensemble-mean error is reduced. This implies a net gain in prediction skill, because probabilistic skill scores penalize overconfidence. Under these conditions, even the addition of an objectively-poor model can improve multi-model skill. It seems that simple ensemble inflation methods cannot yield the same skill improvement. Using seasonal near-surface temperature forecasts from the DEMETER dataset, we show that the conclusions drawn from the toy-model experiments hold equally in a real multi-model ensemble prediction system. Copyright 2008 Royal Meteorological Society KEY WORDS DEMETER; inflation; probabilistic verification; seasonal predictions; toy model; under-dispersion Received 6 April 2007; Revised 27 November 2007; Accepted 10 December Introduction Weather and climate predictions are subject to many uncertainties and sources of forecast error. These can be grouped into two families: uncertainties in model initialization, for example due to incomplete data coverage, measurement errors, or inappropriate data-assimilation procedures; uncertainties and errors in the model itself, for example due to the parametrization of physical processes, the effect of unresolved scales, or imperfect boundary conditions (Buizza et al., 2005; Schwierz et al., 2006; Weigel et al., 2007a). For dynamical models, the uncertainties in model initialization can be addressed by applying ensemble techniques, i.e. by repeatedly integrating the model forward in time from slightly-perturbed initial conditions (e.g. Kalnay, 2003), with the perturbations being designed so that they capture as much as possible of the underlying uncertainty. Sophisticated methods of * Correspondence to: A. P. Weigel, MeteoSwiss, Krähbühlstrasse 58, PO Box 514, CH-8044 Zürich, Switzerland. andreas.weigel@meteoswiss.ch ensemble generation have been developed and successfully implemented for operational numerical weather and short-range climate forecasts (e.g. Tracton and Kalnay, 1993; Toth et al., 1997; Molteni et al., 1996; Buizza, 1997; Pellerin et al., 2003), and they have found a wide range of application in weather and climate risk management. As far as model uncertainties are concerned, however, there is currently no theoretical concept that would provide accurate estimates of the corresponding probability distributions (Palmer, 2001). Three pragmatic approaches have been pursued to obtain at least a first crude estimate of the range of uncertainties induced by model error: the introduction of stochastic physics, i.e. the random perturbation of parametrized physical processes (Buizza et al., 1999); the perturbed parameter approach, whereby a model is run several times with different settings of physical parameters (Pellerin et al., 2003); the combination of several ensemble prediction systems to form a multi-model super-ensemble (Krishnamurti et al., 1999; Palmer et al., 2004). Copyright 2008 Royal Meteorological Society

2 242 A. P. WEIGEL ET AL. The latter technique is often referred to as the multi-model ensemble approach, and is the focus of this study. Simple multi-model ensembles (MMEs) can be constructed by combining the individual ensemble forecasts with equal weights (Hagedorn et al., 2005). In more sophisticated approaches, the participating single-model ensembles (SMEs) are weighted according to their prior performance (e.g. Rajagopalan et al., 2002; Robertson et al., 2004; Doblas-Reyes et al., 2005; Stephenson et al., 2005). Regardless of which combination method has been applied, all these studies have shown that multi-model ensemble combination (MMEC) does increase prediction skill. Hagedorn et al. (2005) have investigated the rationale behind the success of multi-model ensembles. Among other questions, the authors address the question of how it is possible that MMEs on average outperform single models in their skill, given that they take the information from all participating single models, including the less skilful ones. They conclude that the success of MMEC is mainly due to error cancellation and the nonlinearity of the skill metrics applied, i.e. to the fact that multimodel skill generally differs from the average skill of the participating SMEs. But why should one not simply use the best participating SME alone, rather than considering a compound of several SMEs, including the poorer ones? Or, to turn the question around: how is it possible that the addition of poorer models can enhance prediction skill? Hagedorn et al. (2005) argue that this question is wrongly posed, since it is not usually possible to identify a best or a poorest model from a set of models, as their individual strengths and flaws typically vary with forecasting context (location, predictand, initialization time, etc.). Indeed, for a given specific forecast at a given location, the multi-model can be expected to be outperformed by some of the participating SMEs. It is only in the long run, i.e. averaged over a sufficient number of grid points and forecast realizations, that the multi-model would outperform any single-model strategy. But what if we were able to identify, say, a poor model, i.e. a model that consistently performs worse than average over the whole range of prediction aspects? Hagedorn et al. (2005) state that such a model could not contribute to prediction skill. In other words, if it were always known which model was best, MMEC would not be able to outperform a simple best-model approach, i.e. a strategy that always selects the best SME available for any given forecast context. It is the aim of the present study to revisit this question, and to demonstrate that there are conditions under which a multi-model consistently outperforms a best-model approach, and so really does enhance prediction skill assuming that the best model can be identified, and that the forecast is perfectly calibrated in a climatological sense. This appears to be a paradox, because it implies that the skill of an SME may be enhanced by adding a consistently-poorer model to the ensemble forecasts. We seek to resolve this apparent paradox, and to identify the underlying mechanisms, both by applying a simple synthetic Gaussian forecast ensemble generator (a toy model ) and by evaluating a real seasonal MME prediction system. The paper is structured as follows. Section 2 provides a detailed description of the Gaussian toy model, as well as a summary of the model combination methods and the verification measure applied. In Section 3, systematic toy-model experiments are presented, identifying the conditions under which MMEC really enhances prediction skill. In Section 4, the findings are substantiated with a real seasonal MME prediction system. Concluding remarks are given in Section Methods 2.1. The synthetic toy model Formulation The core of this study is based on a synthetic Gaussian generator of forecast observation pairs. This toy model is designed in such a way that, for a given observation x, it generates an M-member ensemble forecast f(x) = (f 1,f 2,...,f M ), fulfilling preset conditions with respect to forecast skill and ensemble properties. These conditions are controlled by two parameters, α and β, as described below. The toy model has the following properties: 1. Synthetic observations x are randomly sampled from a normally-distributed climate. 2. The corresponding M-member ensemble forecasts f(x) are generated in such a way that the forecasts have the same climatology as the verifying observations, i.e. the model climatology is unbiased and perfectly calibrated. 3. The average correlation coefficient between the forecast ensemble members and the observations is prescribed by a model parameter α. 4. Since in real operational prediction systems the ensembles are often overconfident (i.e. under-dispersive), meaning that ensemble spread is too narrow while being centred at the wrong value, a second parameter β is introduced, prescribing the degree of ensemble overconfidence. A value of β = 0corresponds to well-dispersed ensembles covering the entire range of uncertainties inherent to the predictions of a given correlation α. Asβ increases, ensemble under-dispersion increases. A forecast generator fulfilling the above conditions is given by: f 1 f 2. f M = αx + ɛ β + ɛ 1 ɛ 2. ɛ M, (1)

3 CAN MULTI-MODELS IMPROVE FORECASTS? 243 with: x N(0, 1); ɛ β N(0,β); ɛ 1,...,ɛ M N(0, 1 β 2 ). Here 1 and 0 β 1 ; the notation N(µ, σ ) refers to a random number drawn from a normal distribution with mean µ and variance σ 2. For a given observation x and control parameters α and β, an ensemble forecast f is generated by multiplying the value of x by α and adding a vector of perturbations ɛ 1,...,ɛ M, as well as a scalar perturbation term ɛ β. The perturbations are randomly sampled from the normal distributions specified above. The expression αx + ɛ β determines the centre of the ensemble distribution. Ensemble spread is controlled by 1 β 2,which is the standard deviation of the parent distribution from which the perturbations ɛ 1,...,ɛ M are sampled. It is trivial that property 1 is fulfilled: the observations x are, by construction, sampled from a standardized normal distribution. The fulfillment of conditions 2 and 3 is shown in Appendix A. This leaves property 4, i.e. the meaning of the overconfidence parameter β. We begin by considering the case β = 0. For the toy model of Equation (1), this implies that ɛ β = 0and that the forecast ensembles are sampled from a normal distribution centred at αx and with standard deviation σ = 1. Such a situation is illustrated in Figure 1(a c). For a given observation x and correlation coefficient α, the shaded curves show the distributions from which the forecast ensembles are sampled (henceforth referred to as ensemble parent distributions (EPDs)). If α = 0 (Figure 1(a)), i.e. in the case of zero correlation between forecasts and observations, the EPD is identical to climatology. As α increases (Figure 1(b c)), the centre of the EPD shifts toward the value of the observation, while its spread becomes narrower. In the case of perfect correlation (α = 1, not shown), the EPD would be given by a delta function peaking at x. Note that, for given α and x, theepd is uniquely determined, i.e. there is no uncertainty in its shape or location. Consequently, for β = 0, the EPDs quantify the full range of uncertainty inherent to ensemble predictions for a given correlation α. Ensembles that are randomly sampled from these EPDs can therefore be regarded as perfect (i.e. well-dispersed) ensembles. Note that α is often interpreted as a measure of potential predictability (Kharin and Zwiers, 2003). Now consider the case of non-zero β. For given x and α, the EPDs (standard deviation σ = 1 β 2 ) become sharper as β increases. At the same time, there is now some uncertainty in the location of the EPD. The range of this uncertainty is controlled by ɛ β, and grows as β increases. Thus, for a given α, the ensemble spread is too small and no longer represents the full amount of uncertainty inherent to the predictions: the ensembles are overconfident. The effect of the overconfidence parameter β is illustrated in Figure 1(d f). It is important to scaled probability density scaled probability density (a) (b) (c) α = 0 β = 0 σ = 1 α = 0.3 β = 0 σ = observable x x x (d) (e) (f) α = 0.65 β = 0 σ = 1 observable αx x 1 α = 0.65 β = 0.7 αx σ = 1 α2 β 2 αx x ε β α = 0.65 β = 0 σ = 1 α = 0.65 β = 0.7 ε β αx αx σ = 1 β 2 Figure 1. Illustration of the toy model. Upper panels: the effect of enhancing the correlation between the toy-model ensemble forecasts and the observations (parameter α). The shaded curves are ensemble parent distributions (EPDs) for a given observation x in the case of well-dispersed (i.e. β = 0) ensemble forecasts, with (a) α = 0, (b) α = 0.3 and(c)α = As α increases, the EPDs become narrower, and their centre shifts toward x. Lower panels: the effect of adding overconfidence (parameter β). Panel (d) is the same as (c). Panels (e) and (f) display two realizations of highly overconfident ensemble forecasts with the same α as in (d), but with β = 0.7. The overconfident EPDs have reduced ensemble spread, but some uncertainty in their location, imposed by the randomly-sampled perturbation term ɛ β. The solid black line is the climatological distribution. To clarify the illustration, the EPDs are drawn to different scales. x

4 244 A. P. WEIGEL ET AL. stress again that the two aspects characterizing overconfidence in this toy model too-narrow ensemble spread, and random location errors in the ensemble mean are both controlled by the same parameter β, and are interdependent. Because of the requirement of a well-calibrated model climatology, it is not possible to generate forecast ensembles that are always at the right location but have the wrong ensemble spread. Equally, ensembles cannot be well-dispersed while being randomly displaced. Note also that, for a given correlation α, theensemble spread cannot be increased beyond the perfect value of 1 without violating the constraint of a perfect model climatology. This means that over-dispersive (i.e. underconfident) forecast ensembles are not possible with this toy model Realism Obviously, this toy model is based on very simplifying assumptions, and cannot represent the complexity that is characteristic of real ensemble predictions. When evaluating and discussing results obtained from it, it is important to be aware of its limitations with respect to reality. The main idealizations are discussed below. Normality The climatology and the ensemble distributions are assumed to be normally distributed. The toy model does not allow skewed distributions, which are typical for precipitation, for example; nor does it allow bimodal or multimodal distributions, which are also commonly found (Wilks, 2002), and whose detection actually represents one of the main hopes of ensemble forecasting. These limitations will not be evaluated here and are beyond the scope of this paper. However, since skewed unimodal distributions can easily be normalized for example, via Box Cox transformations, as applied by Tippett et al. (2007) we believe that the toy-model results presented here can, at least in principle, be generalized to skewed unimodal distributions. Stationary climatology Both the observed and the model climatology are stationary. In reality, however, climatology reveals fluctuations and trends on many timescales (e.g. seasonal cycle, global warming trend), allowing the occurrence of record events that would have been considered impossible or highly unlikely in the past such as the 2003 heatwave in Europe, as discussed by Schär et al. (2004). It is therefore a limitation of our toy model that it is not able to generate and predict events that are inconsistent with the prescribed, i.e. observed, climatology. On the other hand, an appropriate verification of forecasts and observations sampled from a non-stationary climatology is highly problematic. Indeed, Hamill and Juras (2006) have shown that nonstationarities can produce false prediction skill when there is actually no skill. Since a robust verification is essential in the present study, we only consider a stationary climatology here. Well-calibrated model climatology The toy-model climatology is well-calibrated, in the sense that each ensemble member has the same climatology as the observations. This is in contrast to real prediction systems, which usually reveal systematic biases in mean and variance. Often it is not possible to determine a model climatology at all, because there are not enough independent historical forecast samples. The impact of these uncertainties on the success of MMEC is therefore not covered in this study. Note that, thanks to the well-calibrated model climatology, the ensemble means are distributed as N(0, + β 2 ), so that they have a lower variance than the observations, and for small α and β are very unlikely to be located on the tails of the climatology. Consequently, in deterministic forecast strategies that are based on ensemble-mean predictions, it might actually be desirable to have a model climatology that is extended beyond the known climatology, so that the ensemble means cover the true climatology and can also predict climatologically-rare events. Stationary skill As will be described in Section 3, the parameters α and β are kept constant for all samples that feed into one verification experiment. In other words, spread and correlation do not vary from sample to sample, nor do they depend on the value of x. This contrasts with reality, where spread does vary from case to case and where prediction skill is often found to be conditioned on the magnitude of the anomaly observed. Seasonal predictability in the Pacific region, for example, depends strongly on the strength of the ENSO forcing (Shukla et al., 2000). However, we do not think that this idealization has an impact on the general conclusions to be drawn in this study, since, at least in principle, any sufficiently-large set of verification samples can be stratified to subsets of constant skill and spread. Predictable signal and observational errors The central tendency of a well-dispersed ensemble distribution is often considered as the potentially-predictable signal µ (Kharin and Zwiers, 2003). It is a major simplifying assumption of the present toy model that it requires the signal to be given by αx, and thus to be causally determined by the verifying observation. In a more advanced setting, this simplification can be avoided by first sampling µ and then constructing a forecast observation pair from µ: forecast ensembles would be constructed by analogy with Equation (1), i.e. f n = µ + ɛ β + ɛ n, while a verifying observation would be obtained by adding an unpredictable observational-noise term ɛ x to µ, i.e. x = µ + ɛ x with ɛ N(0, 1 ). By this approach, which in essence is equivalent to the statistical model presented by Kharin and Zwiers (2003), observational errors can also be accounted for. As will be pointed out later, this signal-based model leads to conclusions that are qualitatively equivalent to those of the simpler toy model applied in the present study Combination Using the toy model, we can issue probabilistic categorical ensemble forecasts with respect to predefined forecast categories, by taking the proportion of ensemble members

5 CAN MULTI-MODELS IMPROVE FORECASTS? 245 falling into the respective forecast categories. In the following, we explain how we combine such single-model ensemble forecasts to form a multi-model. Consider N independent ensemble prediction systems, with ensemble sizes M 1,...,M N, which are to be combined. Assume that probabilistic forecasts are issued with respect to K forecast categories. Let p k be the climatological probability of the event falling into category k, with k {1,...,K}. Further,letm n,k denote the number of ensemble members of the nth model that forecast the kth category; thus K m n,k = M n. k=1 Finally, let m n,k /M n be the corresponding probability forecast issued by model n for category k. Two methods of combining such SME forecasts to an MME forecast y = (y 1,...,y K ) are examined in this study. These methods are henceforth referred to as POOL and IGN. Both methods are well established and applied operationally: method POOL, for example, in the European Multimodel Seasonal to Interannual Prediction System (EUROSIP) (Vitart et al., 2007), and method IGN at the International Research Institute for Climate and Society (IRI) (Barnston et al., 2003) Method POOL In method POOL, MME forecasts are generated by simply pooling together the participating SMEs, with all ensemble members having equal weight (e.g. Hagedorn et al., 2005). The probabilistic multi-model forecast for the event to fall into the kth category yk POOL is then the proportion of all ensemble members of all participating SMEs that predict the kth category: Method IGN y POOL k = N n=1 m n,k. (2) N M n n=1 The second method is a more sophisticated approach, in that the participating SMEs are weighted according to their prior performance, and the climatological forecast (p 1,...,p K ) is included as an additional model. Using this method, weighted probabilistic multi-model forecasts for the kth category, yk IGN, are formally constructed by: y IGN k = w 0 p k + N n=1 w n m n,k M n. (3) Here w 0,...,w N are predetermined weights with N n=0 w n = 1 and with w n [0, 1]. Rajagopalan et al. (2002) have derived a conceptually Bayesian justification for this approach, with climatology being the prior. To determine a set of optimum weights, we follow their suggestion and maximize the posterior likelihood function L(w 0,...,w N ) defined over a common record of T multi-model forecasts and verifying observations. The function to be maximized is given by: L(w 0,w 1,...,w N ) = T t=1 yk IGN (t)(t), (4) with k (t) representing the category of the verifying observation at time t. Rajagopalan et al. (2002) and Robertson et al. (2004) have shown that this Bayesian methodology yields multi-model forecasts that generally outperform equal-weight multi-models constructed with the POOL method. As an optimization algorithm, we here apply the method of Byrd et al. (1995), a quasi-newton method that allows box constraints in order to fulfil the conditions N n=0 w n = 1andw n [0, 1]. The reason for the name IGN is that the negative dual logarithm of Equation (4) divided by the record length T is equivalent to the average ignorance score IGN, as introduced by Roulston and Smith (2002): 1 T log 2 L = 1 T T t=1 log 2 yk IGN (t) (t) = IGN. (5) This means that maximizing the quantity in Equation (4) is equivalent to minimizing the ignorance of the multi-model forecasting system. The weights are determined so as to minimize the average information deficit ( ignorance ) of a user who is in possession of the multimodel forecasts but does not know the true outcome Verification Our verification is based on a modified version of the widely-used ranked probability skill score (RPSS) (Epstein, 1969; Murphy, 1969; Murphy, 1971). The classical RPSS is a squared measure comparing the cumulative probabilities of categorical forecast and observation vectors relative to a climatological forecast strategy. It is defined by: RPSS = 1 RPS RPS Cl, (6) where and RPS = RPS Cl = K (Y k O k ) 2, k=1 K (P k O k ) 2. k=1 The angle brackets denote the average of the RPS and RPS Cl values over a given number of forecast observation pairs; Y k is the kth component of a cumulative categorical forecast vector Y (whether from

6 246 A. P. WEIGEL ET AL. an SME or an MME), and O k is the kth component of the corresponding cumulative observation vector O against which the forecast is verified. That is, Y k = k i=1 y i, with y i being the probabilistic forecast for the event to fall in category i; ando k = k i=1 o i, with o i = 1ifthe observation is in category i and o i = 0 if the observation is in a category j = i. Analogously, P k is the cumulative climatological probability of the kth category. A more detailed description of the RPSS is provided in Wilks (2006). The RPSS is a favourable probabilistic skill score, in that it is sensitive to distance, i.e. a forecast is increasingly penalized the more its cumulative probability differs from the actual outcome. Moreover, the RPSS is strictly proper, meaning that it cannot be optimized by hedging the probabilistic forecasts toward other values against the forecaster s true belief. A big caveat of the RPSS is its strong negative bias for small ensemble sizes (e.g. Buizza and Palmer, 1998; Richardson, 2001; Kumar et al., 2001; Mason, 2004). The reason for this bias is the intrinsic unreliability (Weigel et al., 2007b) of small ensembles, leading to inconsistencies in the formulation of the RPSS. However, particularly when the performance of multi-model prediction systems is to be assessed, it is important to know whetherincreasesinprediction skill are due to a true gain in potentially-usable information, or whether they are simply an artefact of a negative bias that decreases as the ensemble grows. In other words, a bias-free skill score i.e. one that is insensitive to ensemble size is required. Müller et al. (2005) and Weigel et al. (2007b, 2007c) have derived a debiased version of the RPSS, the socalled RPSS D, which lacks the RPSS s strong dependence on ensemble size while retaining its favourable properties, in particular strict propriety, making it the skill score of choice for the present study. If K equiprobable forecast categories are considered (as is the case in this study), the RPSS D assumes a relatively simple analytic form, given by: RPS RPSS D = 1, (7) RPS Cl +D 0 /M eff where D 0 = K2 1 6K, and M eff is the effective ensemble size of the prediction system. If only a single model of ensemble size M is considered, then M eff = M. IfN models with ensemble sizes M 1,...,M N are combined, with all ensemble members having equal weight (method POOL), then M eff is given by Meff POOL = N n=1 M n. For weighted MMEs constructed by method IGN, Weigel et al.(2007c) have shown that: ( N ) 1 Meff IGN wn 2 =. M n=1 n 3. Toy-model experiments 3.1. Methodology The toy model described in Section 2.1 is now applied in order to study systematically how multi-model skill depends on correlation, overconfidence and the combination method applied. For all our experiments, the procedure applied is as follows. First, observations are randomly sampled from the normally-distributed climatology. Then, for each of these randomly-sampled observations, the toy model is applied with a given correlation coefficient α 1 and overconfidence parameter β 1, generating corresponding forecast ensembles. The ensemble size of the toy-model forecasts has been (arbitrarily) set to 20. The ensemble forecasts obtained are binned into three equiprobable, mutually-exclusive and exhaustive forecast categories. Probability forecasts for each of the three categories are calculated by taking the proportion of ensemble members falling into the respective bins. This procedure is then repeated with (different or equal) toy-model parameters and β 2, yielding another set of probabilistic forecasts. From these two sets of SME forecasts, MME forecasts are finally constructed by applying methods POOL and IGN. The high sample size used in these experiments makes an analysis of statistical significance unnecessary. In reality, of course, verification sets are much smaller, introducing additional uncertainty (see also the discussion in Section 4.2) RPSS D as a function of α and β Before looking at multi-models, we evaluate how the RPSS D of a single toy model depends on correlation and overconfidence. Being a probabilistic skill score, the RPSS D measures the shape and location of the ensemble distribution. This means that, for a given correlation coefficient α, the RPSS D should favour a forecast that correctly estimates the underlying uncertainties, rather than one that has a sharp spread but is located at the wrong place: in other words, the RPSS D should penalize overconfidence, with skill decreasing as β grows. Meanwhile, for a given value of β, RPSS D should increase as α increases: increasing the correlation coefficient between forecasts and observations should also increase prediction skill. Figure 2 shows the degree to which these favourable characteristics hold for the RPSS D. Toy-model SME forecasts have been used to calculate the RPSS D as a function of α for several values of β. Note that as β grows, the range of possible α values decreases, thanks to the condition 0 β 1. The figure shows that the RPSS D does generally increase as α increases and β decreases. However, inconsistencies can arise for overconfident ensembles if the ensemble spread is extremely small or even zero, i.e. if 1 β 2 0 (open circles in Figure 2). Under these conditions, skill can drop despite growing correlation. Note that the same behaviour is observed if the classical RPSS rather than the

7 CAN MULTI-MODELS IMPROVE FORECASTS? 247 debiased RPSS D is applied (not shown here). A further evaluation of this behaviour is left to future research. In our analyses, we avoid these extreme cases of vanishing ensemble spread and do not consider the parameter combinations indicated as open circles in Figure Multi-models constructed from well-dispersed ensembles In a first set of multi-model experiments, we combine forecasts that are generated with β = 0, and are therefore well-dispersed. Probabilistic SME forecasts from two models (henceforth referred to as Model 1 and Model 2) with parameters α 1 and are generated, with α 1, {, 0.4,...,1}. As described above, these two models are then combined to form MMEs in such a way that all possible combinations of α 1 and are considered, using methods POOL and IGN. The skill of the probabilistic SME and MME forecasts is visualized on skill matrices (e.g. Figure 3), which display the RPSS D as a function of α 1 and, with α 1 varying along the horizontal axis and along the vertical axis. We introduce the following notation. RPSSd alpha β=0.8 β=0 β=0.7 β= Figure 2. RPSS D of the toy model as a function of potential predictability (correlation coefficient) α and overconfidence β. The open circles indicate parameter combinations that are excluded from the following analysis (see text in Section 3.2 for explanation). MOD β=0 1 (α 1, ) = MOD β=0 1 (α 1 ) is the skill matrix of probabilistic SME forecasts obtained from Model 1. It is independent of by construction. MOD β=0 2 (α 1, ) = MOD β=0 2 ( ) is the skill matrix of probabilistic SME forecasts obtained from Model 2. It is independent of α 1 by construction. BEST β=0 (α 1, )=max[mod β=0 1 (α 1 ), MOD β=0 2 ( )] is the matrix that, for given α 1 and, selects the better of the two participating SME prediction systems. MEAN β=0 (α 1, )= 1 2 [MODβ=0 1 (α 1 )+MOD β=0 2 ( )] is the matrix that, for given α 1 and, represents the average skill of Models 1 and 2. POOL β=0 (α 1, ) is the skill matrix for MME ensemble forecasts constructed from Models 1 and 2 by method POOL. IGN β=0 (α 1, ) is the skill matrix for MME ensemble forecasts constructed from Models 1 and 2 by method IGN. The skill matrices MOD β=0 1, MOD β=0 2 and POOL β=0 are shown in Figures 3(a), 3(b) and 4, respectively. The latter reveals a plausible structure, in that the skill is highest when both α 1 and are close to 1, and lowest when none of the participating SMEs has skill. We are interested firstly in whether there are combinations of α 1 and such that the multi-model has higher skill than any of the participating SMEs alone. We therefore calculate, for each combination of α 1 and,the difference between POOL β=0 and BEST β=0. The results are displayed in Figure 5(a). The matrix is zero along the diagonal (where α 1 = ) and negative elsewhere. Thus there are no combinations of α 1 and that would enhance MME prediction skill beyond the maximum of the two participating SMEs. This result is also evident in Figure 5(b), where all matrix elements of POOL β=0 are plotted against the corresponding elements of BEST β=0. The points consistently lie underneath or on the bisecting line, indicating that POOL β=0 BEST β=0 always. (a) (b) α 1 α 1 Figure 3. Toy-model experiments with well-dispersed model ensembles: (a) MOD β=0 1, the skill matrix of Model 1; (b) MOD β=0 2, the skill matrix of Model 2. The size of the dots is proportional to the skill (RPSS D ), as shown in the legend above.

8 248 A. P. WEIGEL ET AL. α 1 forecasts feeding into the verification. The skill gain may simply be an averaging effect due to the nonlinearity of the skill metric applied (Hagedorn et al., 2005). We proceed with the IGN method of MMEC to investigate whether multi-models become more successful if a more sophisticated combination algorithm is applied. The results are displayed analogously to Figures 4 and 5: Figure 7 shows the IGN β=0 skill matrix, Figure 8(a) shows the difference matrix IGN β=0 BEST β=0, and Figure 8(b) shows all matrix elements of IGN β=0 plotted against the corresponding matrix elements of BEST β=0. If both α 1 and are negative, i.e. if both MOD β=0 1 and MOD β=0 2 are negative, the IGN multi-model consistently has zero skill (matrix elements in the lower-left corner Figure 4. Toy-model experiments with well-dispersed model ensembles: POOL β=0, the skill matrix of the multi-model constructed with method POOL. The size of the dots is proportional to the skill (RPSS D ), as shown in the legend above It is interesting to note that for any combination of α 1 and, MME skill is greater than or equal to the mean skill of the two participating SMEs, i.e. POOL β=0 (α 1, ) MEAN β=0 (α 1, ). This is evident from Figure 6, which shows the difference between POOL β=0 and MEAN β=0. Thus, considering the skill average over all combination experiments carried out and displayed on the matrices, POOL MMEs on average outperform both Model 1 and Model 2, even though there is not a single parameter combination (α 1, ) where the multi-model would be better than both participating SMEs. So the frequently-reported observation that MMEs on average outperform SMEs, regardless of whether temporal or spatial skill averages are considered, does not imply that MMEC really enhances prediction skill, i.e. that the multi-model is also best for the individual α 1 Figure 6. Toy-model experiments with well-dispersed model ensembles: POOL β=0 MEAN β=0, the difference between multi-model skill (as shown in Figure 4) and the average of the two single-model skill matrices shown in Figure 3. The size of the dots is proportional to the skill (RPSS D ), as shown in the legend above (a) (b) POOL β=0 α 1 BEST β=0 Figure 5. Toy-model experiments with well-dispersed model ensembles. (a) POOL β=0 BEST β=0, the pixel-wise difference between multi-model skill (as shown in Figure 4) and the maximum of the two single-model skill matrices shown in Figure 3. The size of the dots is proportional to the skill (RPSS D ), as shown in the legend above. (b) All matrix elements of POOL β=0, plotted as a function of the corresponding matrix elements of BEST β=0.

9 CAN MULTI-MODELS IMPROVE FORECASTS? 249 α 1 Figure 7. Toy-model experiments with well-dispersed model ensembles: IGN β=0, the skill matrix of the multi-model constructed with method IGN. The size of the dots is proportional to the skill (RPSS D ), as shown in the legend above. of IGN β=0 in Figure 7). This means that, in contrast to the POOL method, there are combinations of α 1 and where the multi-model appears to outperform both Model 1 and Model 2 (black dots in lower-left corner of Figure 8(a)). However, it can be shown that this is simply a consequence of the fact that method IGN is a Bayesian method starting with a climatological prior. In other words, if neither Model 1 nor Model 2 has prediction skill, then the combination algorithm assigns zero weight to the two models and simply issues the climatological forecast, i.e. w 1 = w 2 = 0andw 0 = 1 in Equation (3). Thus, the gain in prediction skill observed for negative α 1 and in Figure 8(a) is not an effect of MMEC per se, but rather of the IGN algorithm s ability to fully ignore Model 1 and Model 2 in this case. If, on the other hand, at least one of α 1 and is positive, then the matrix elements of IGN β=0 are almost identical to those of BEST β=0 (see Figure 8). Indeed, a weight analysis (not shown here) reveals that the combination algorithm assigns almost all weight to the better of Model 1 and Model 2. Thus, the IGN method clearly outperforms the POOL method, and does not spoil good forecasts by adding poor ones. However, the central conclusion drawn from the POOL experiments also holds for the IGN method: namely, that for welldispersed ensemble forecasts there are no combinations of α 1 and for which the multi-model has higher skill than the best participating model alone Multi-models constructed from highly overconfident model ensembles What changes if highly overconfident SMEs are combined, rather than well-dispersed ones as before? To answer this question, the combination experiments described above are repeated, but with a positive overconfidence parameter of β = 0.7. By analogy with the notation used above, the resulting skill matrices are denoted MOD β=0.7 1, MOD β=0.7 2, BEST β=0.7, POOL β=0.7 and IGN β=0.7. Note that the range of possible values of α is now limited because of the condition β 1 in Equation (1). We choose α 1, {, 0.4,...,0.6}; the choice α 1 = = 0.7 is omitted for the reason mentioned in Section 3.2. The skill matrices of overconfident SME forecasts, MOD β=0.7 1 and MOD β=0.7 2, are shown in Figure 9, revealing that for given values of α 1 and skill is significantly reduced with respect to the well-dispersed ensemble forecasts (β = 0) of Figure 3. This is consistent with the conclusions drawn from Figure 2, where it is revealed that the RPSS D penalizes overconfidence. The change in skill due to multi-model combination, obtained both with the POOL method and with the IGN method, is displayed in the same way as before: Figure 10(a) shows the difference matrix POOL β=0.7 BEST β=0.7, while in Figure 10(b) all matrix elements of POOL β=0.7 are plotted against the corresponding elements of BEST β=0.7. Analogously, Figure 11(a) shows (a) (b) IGN β=0 α 1 BEST β=0 Figure 8. As Figure 5, but applying method IGN to construct the multi-model: (a) IGN β=0 BEST β=0 ;(b)allmatrixelementsofign β=0, plotted as a function of the corresponding matrix elements of BEST β=0.

10 250 A. P. WEIGEL ET AL. the difference matrix IGN β=0.7 BEST β=0.7 and Figure 11(b) displays the matrix elements of IGN β=0.7 as a function of the elements of BEST β=0.7. The outcome is fundamentally different from what was shown above in the case of well-dispersed model ensembles: for both the POOL and the IGN multi-models, there now are combinations of α 1 and such that the multi-model really has higher skill than any of the participating models alone. This is evident from the black regions in the difference matrices of Figures 10(a) and 11(a), and from the fact that the scatter plots of Figures 10(b) and 11(b) exhibit points well above the bisection line. If the POOL method is applied, the biggest skill improvement is located along the matrix diagonal, i.e. where α 1. However, there are also parameter combinations (α 1, ) such that the addition of a consistently-poor model of no skill to a clearly-better model of positive skill can further improve prediction skill (visible, for example, in the regions α and 0.6). If the IGN method is applied, all matrix elements have higher skill than the single models alone (again with the constraint that for negative α 1 and the gain in skill occurs because the climatological forecast receives all the weight) Discussion In summary, the toy-model experiments described above show that MMEC can enhance prediction skill beyond the best participating single model, but only if the single models are overconfident. This conclusion holds regardless of which combination algorithm is applied. How can this be understood? To illustrate the underlying mechanism as simply as possible, without loss of generality we here consider only the equally-weighted POOL combination method, and assume that α 1 =, i.e. that the two models to be combined have the same potential predictability. (a) 0.6 (b) α 1 α 1 Figure 9. As Figure 3, but with overconfident model ensembles: (a) MOD β=0.7 1 ;(b)mod β= (a) 0.6 (b) POOL β= α 1 BEST β=0.7 Figure 10. As Figure 5, but with overconfident model ensembles: (a) POOL β=0.7 BEST β=0.7 ;(b)allmatrixelementsofpool β=0.7 plotted as a function of the corresponding matrix elements of BEST β=0.7.

11 CAN MULTI-MODELS IMPROVE FORECASTS? (a) 0.6 (b) IGN β= α 1 BEST β=0.7 Figure 11. As Figure 8, but with overconfident model ensembles: (a) IGN β=0.7 BEST β=0.7 ; (b) all matrix elements of IGN β=0.7 plotted as a function of the corresponding matrix elements of BEST β=0.7. We start with well-dispersed SMEs (i.e. β = 0). For any toy-model forecast, the EPD from which the ensemble members are sampled is then uniquely determined (Equation (1)), since the perturbation term ɛ β is zero (see Figure 1(a c)). Given that here the two participating single models have the same α and refer to the same observation x, the two SMEs to be combined are sampled from the same EPD. In other words, the only effect of MMEC in this case is to enhance the sample size, and thus to provide a better estimate of the EPD: an effect that is not captured by the RPSS D, which is insensitive to ensemble size. In essence, the EPD of the MME is identical to the EPD of the participating SMEs. Now consider overconfident SMEs (i.e. β>0). Again, for simplicity assume that α 1 =. In contrast to the case of the well-dispersed ensemble, the two EPDs of the participating single models are not uniquely determined. They have the same well-defined spread (with standard deviation 1 β 2 ), but the their locations (i.e. their mean values) are randomly perturbed by ɛ β (see Equation (1) and Figure 1(e f)). Thus, in the overconfident case, MMEC does modify the shape of the resulting multi-model EPD, since the two participating single-model EPDs are generally not identical. This is illustrated in Figure 12(a b) for a randomly-sampled observation x and α = 0.65 and β = 0.7. What happens if more and more overconfident single models with the same α and β are added to the multi-model, thus providing more and more random samples of ɛ β? Following the line of argument presented in Appendix A, it is easy to show that for an observation x, the multi-model EPD then approaches the distribution N(αx, 1 ).This is identical to the EPD of a well-dispersed single model with β = 0. In other words, the combination of independent overconfident models widens the MME spread while reducing the error in the ensemble location. The larger the number of overconfident models contributing to the MME, the more the multi-model EPD loses its overconfidence characteristics in favour of the characteristics of well-dispersed ensembles (illustrated in Figure 12 (c) and (d) for multi-models consisting of 3 and single models respectively). Note that the same effect can be observed if the signalbased toy model discussed at the end of Section is applied: in this case also, the EPDs of well-dispersed ensembles are all centred at the same value (namely at µ) and are therefore uniquely determined, while the EPDs of overconfident forecasts reveal random displacements ɛ β around µ that average out as more and more models are combined. Given that probabilistic skill scores penalize overconfidence, the paradox of skill improvement due to MMEC can now be understood: MMEC reduces the errors in ensemble location while widening ensemble spread, thus reducing the overconfidence penalty of the RPSS D,and thus improving net prediction skill. The correlation α, however, is not improved. From this follows that, at least in this simple toy-model framework, the theoretical upper limit of skill achievable by MMEC is given by the skill of a well-dispersed single model that has the same correlation coefficient α but lacks the overconfidence penalty. This is confirmed in Figure 13, which shows how the average RPSS D of MME predictions changes as the number of contributing single models increases: for all participating single models being well-dispersed (grey line, all SMEs have α = 0.65 and β = 0); and for all participating SMEs being overconfident (black line, all SMEs have α = 0.65 and β = 0.7). As expected, for well-dispersed model ensembles, skill is independent of the number of models involved, while for overconfident model ensembles, skill grows monotonically, and asymptotically approaches that of a well-dispersed ensemble. In this sense, it appears meaningful to regard α as a measure of potential predictability, as suggested by Kharin and Zwiers (2003). Note that this conclusion has been reached under the assumption that the contributing SMEs have the same α. While this assumption is not too unrealistic, since

12 252 A. P. WEIGEL ET AL. scaled probability density (a) 1 model (b) 2 models (c) 3 models (d) models α = 0.65 β = 0.7 observable α = 0.65 β = 0.7 α = 0.65 β = 0.7 x x x α = 0.65 β = 0.7 Figure 12. The effect of multi-model combination of overconfident (β = 0.7) ensemble forecasts. For a given observation x and a correlation coefficient α = 0.65, the panels show, from left to right, how a typical SME parent distribution (shaded) becomes modified as 2, 3 and equally overconfident models are included in the forecast. The multi-model EPD eventually assumes the shape of a well-dispersed ensemble forecast, as shown in Figure 1(c). The solid black line is the climatological distribution. The EPDs are scaled differently for illustrative purposes. x RPSSd number of models overconfident well-dispersed Figure 13. RPSS D of multi-model ensemble forecasts as a function of the number of participating single models, for overconfident ensembles (β = 0.7, black) and well-dispersed ensemble forecasts (β = 0, grey). All participating models have the same correlation coefficient α = The skill values plotted are obtained from random observations and corresponding multi-model ensemble forecasts. in real multi-model prediction systems the correlation skills of the participating single models are relatively similar (Yoo and Kang, 2005), the situation of course changes if two overconfident models of different α with α 1 >, say are combined. In that case, the average correlation coefficient of the MME members is lower than α 1, meaning that the gain in skill due to reduced overconfidence is accompanied by a loss of skill due to reduced correlation. Thus, the greater the difference between α 1 and, the more difficult it is for the multimodel to beat the best participating single model (see, for example, the upper-left and lower-right corners in Figure 10(a)) Deterministic interpretation As was stated in Section 2.1, overconfident toy-model ensembles are characterized by too-narrow spread and a random additive displacement error of the ensemble mean. Both these features are equivalent manifestations of overconfidence, and affect each other. Consequently, a deterministic forecast evaluation, i.e. an evaluation that only considers ensemble-mean errors, should lead to conclusions concerning MMEC similar to those of the fully 15 probabilistic verification presented above, which considers both ensemble spread and mean errors. We generally prefer the probabilistic view, since it is physically more meaningful and considers all available forecast information, but in the present simple toy-model context both views are equivalent. Indeed, one of the reviewers has provided a convincing deterministic argument on the effects of MMEC. It is based on the mean squared errors (MSEs) of ensemble means, and is outlined below. For simplicity, assume that the ensemble size is infinite. In that case, the ensemble mean f of an SME is identical to the centre of the EPD, and is given by: f =αx + ɛ β. (8) The average MSE of the SME means, i.e. their squared distance from the verifying observations x, isthengiven by: MSE = (α 1) 2 + β 2, assuming that x and ɛ β are not correlated with each other. Now assume that an MME is constructed from N such SMEs by applying method POOL. Let the participating SMEs have the same overconfidence parameter β, but different correlation coefficients α 1,...,α N. Assume further that the ɛ β values of the N SMEs, ɛ β (1),...,ɛ β (N), are sampled independently. The mean f MME of the MME is then given by: f MME = 1 N N ( αn x + ɛ β (n) ), (9) n=1 and the average multi-model MSE becomes: where MSE MME = ( α 1) N 2 β2, (10) α = 1 N N α n, n=1 is the mean of the values of α of the single models. From Equation (10) one can directly draw conclusions

13 CAN MULTI-MODELS IMPROVE FORECASTS? 253 analogous to those presented above for the probabilistic view: If β = 0, the best single model (with α = α max )always has an MSE less than or equal to that of the multimodel, and therefore accuracy greater than or equal to that of the multi-model, because α α max.this means that MMEC does not enhance prediction skill beyond the best participating SME. If β = 0, then the MME can have a lower MSE, and therefore a higher accuracy, than any participating SME alone, especially if the values of α are of similar magnitude for all SMEs. In that case, the first term on the right-hand side of Equation (10) is hardly affected by MMEC, while the second term decreases quadratically as the number of participating SMEs increases, thus reducing the MSE. In this deterministic view, the addition of more and more overconfident models, i.e. models with random additive errors of their ensemble means, continuously improves the forecast accuracy, which asymptotically approaches the MSE one would obtain for well-dispersed SMEs (with β = 0) Can ensemble inflation have the same effect as multi-model combination? Given that the enhanced prediction skill of multi-models is due to a reduction in the overconfidence penalty rather than an improvement in potential predictability, the question arises as to whether the same effect could be obtained by inflating overconfident single-model forecasts, i.e. by an appropriate recalibration. To address this question, we return to the probabilistic view, and apply an inflation method described by Doblas-Reyes et al. (2005) and Kharin and Zwiers (2003). In essence, this method widens the ensemble spread from a value of 1 β 2 to a value of 1, i.e. from the observed spread to the spread that would be assumed if the ensemble forecasts were not overconfident. The ensembles are thus inflated by a factor 1 α 2 γ =, σ ens where σ ens is the average observed ensemble spread. Simultaneously, the ensemble means are shifted toward the climatological mean by a factor δ in such a way that the forecast climatology remains well-calibrated. By equating the variances of the raw and inflated forecasts, one obtains δ = α/σ em, where σ em is the standard deviation of the ensemble-mean forecasts. The calibrated version of the toy model is then: where f 1 f 2. f M = δ(αx + ɛ β) + γ 1 γ = 1 β 2, ɛ 1 ɛ 2. ɛ M, (11) and δ = + β 2. As described above, the factor δ<1 reduces the inter-ensemble variance so that the forecast climatology remains identical to the observation climatology. However, this implies that ensemble inflation is coupled with a reduction in the effective correlation between the individual forecast-ensemble members and the observations, from α to δα. In other words, while one may expect some beneficial effect on skill due to a reduction in overconfidence, one would at the same time experience a decrease in skill due to reduced correlation. Thus, the net skill improvement due to inflation depends on the respective magnitudes of these two contrary effects. If α is very small, one would expect a strong net gain in skill, since a priori there is hardly any correlation that could be further reduced. As α grows, however, the absolute loss in effective correlation increases, thus limiting the benefit of inflation. MMEC is fundamentally different, in that it gradually widens ensemble spread and moves the ensemble mean toward truth without reducing the correlation (assuming that all participating models have similar correlation). Again, note that the same conclusion can be obtained from the signal-based model described at the end of Section 2.1.2, where any ensemble inflation must be accompanied by a reduction in the signal amplitude µ to keep the model climatology well-calibrated, thus reducing the correlation between forecasts and observations. MMEC, on the other hand, would also leave µ unchanged here. This conclusion is illustrated in Figure 14. For an overconfidence parameter of β = 0.7, the graph shows the RPSS D as a function of α for raw non-inflated SME forecasts, inflated SME forecasts, MME forecasts (POOL method) based on two overconfident SMEs of equal α, and MME forecasts (POOL method) based on an infinite number of overconfident SMEs of equal α. The latter are equivalent to well-dispersed SMEs with β = 0(see RPSSd alpha multi-model ( models) multi-model (2 models) inflated single-model raw single-model Figure 14. Effect of ensemble inflation, in comparison with multi-model combination, for overconfident (β = 0.7) ensemble forecasts: RPSS D as a function of correlation coefficient α, for: raw SME forecasts; inflated SME forecasts; MME forecasts (POOL method) based on two overconfident SMEs; and MME forecasts (POOL method) based on an infinite number of overconfident SMEs. The latter are equivalent to well-dispersed SMEs with β = 0.

14 254 A. P. WEIGEL ET AL. above). As expected, the plot shows that inflation does enhance skill with respect to raw SMEs; however, the net gain in skill drops as α grows. Multi-models, on the other hand, reveal a uniform skill improvement over the whole range of α values, with the skill improvement growing as the number of participating SMEs grows. Even a simple MME consisting of only two models outperforms the inflated SME for α>0.15. On the other hand, ensemble inflation inhibits negative skill scores. These results suggest that, for climatologically wellcalibrated models, ensemble inflation is most effective when the potential predictability is low and when only SME forecasts are available. This conclusion is still not general enough to be transferred to real forecasts, first because it is based on only one exemplary calibration method (there are other common methods that have not been evaluated here, such as the reliability correction described in Toth et al. (2003)), and secondly because it is based on a very idealized toy model. Indeed, in the present setting, the success of MMEC is closely linked to the independence of the model-error terms ɛ β. In real prediction systems, model errors tend to be correlated (e.g. Yoo and Kang, 2005), raising the question as to whether MMEC would still outperform simple calibration approaches under real conditions. The study of Doblas-Reyes et al. (2005) suggests that they actually do, at least in the context of seasonal forecasts, but a more general investigation of this issue is left to future research. 4. Application to real seasonal forecast data So far, all results have been obtained on the basis of a simple Gaussian-type toy model. It is the aim of this section to show that the conclusions drawn also hold for real (seasonal) multi-model ensemble predictions Models applied In the following, ensemble forecasts of two operational seasonal prediction systems are evaluated and combined: ECMWF s System 2 (S2) (Anderson et al., 2003), and the UK Met Office s GloSea (GS) (Gordon et al., 2000). Hindcast data for these two models are obtained from the DEMETER database (Palmer et al., 2004). Although this database comprises hindcasts of seven different models, we have restricted ourselves to two models, in order to be consistent with the toy-model simulations and to keep the interpretation of the results as simple as possible. Moreover, the robustness of the weight estimations for method IGN drops drastically as the number of models to be combined increases (Robertson et al., 2004). We consider hindcasts of mean summer near-surface (2 m) temperature, averaged over the months June, July and August. All hindcasts have been started from 1 May initial conditions. The hindcast period is Data are verified grid-point-wise against the corresponding observations from the 40-year ECMWF re-analysis (ERA40) dataset (Uppala et al., 2005). Both the forecasts and the verifying observations are interpolated on a grid with resolution. As in the toy-model experiments above, three equiprobable categories are considered. The terciles separating the three categories are determined from the hindcast and observation data separately Seasonal prediction skill Ensemble forecasts of the two models are combined to MMEs by applying both the methods POOL and IGN. To reduce the sampling variability of the model weights in the IGN method, a nine-point binomial spatial smoother is applied for the optimization procedure, as described by Robertson et al. (2004). All calculations are carried out in a one-year out cross-validation mode (Wilks, 2006). This means that for each year to be verified, the target year is eliminated from the computation of observation terciles, model terciles and optimum model weights. Skill is calculated grid-point-wise for the two single models, as well as the POOL and IGN multi-model ensembles, using the debiased RPSS D. The resulting skill maps are shown in Figure 15, and the optimum weights obtained from the IGN combination are displayed in Figure 16. Most notably, both S2 (Figure 15(a)) and GS (Figure 15(b)) reveal large areas with negative skill, i.e. areas where the models perform worse than if the forecast ensembles were simply randomly sampled from climatology, particularly in the extratropics. While the extent and magnitude of these areas is reduced for POOL MME forecasts (Figure 15(c)), they are almost totally eliminated for IGN MME forecasts (Figure 15(d)). This is because of the IGN method s ability to ignore poor ensemble forecasts and to issue the climatological forecast instead (see Figure 8(a), lower-left corner). Indeed, Figure 16(c) shows that in the regions of poor single-model skill, climatology receives almost all the weighting. For completeness, spatially-averaged values of singlemodel and multi-model skill are provided in Table I for a number of regions, applying both the debiased RPSS D and the classical RPSS. With only 40 hindcast years available, the sample size is very small compared to the toy-model experiments described above, giving rise to uncertainties in the skill estimates obtained. Indeed, a comprehensive quantitative verification requires confidence intervals over the skill estimates (Joliffe, 2007), a task that is particularly complicated in the present case, where not only temporal averages over scores but also spatial averages over statistically-dependent grid points are considered, so that the number of independent forecast observation pairs is not known (see also the discussion in Weigel et al. (2007c)). Since a quantitative comparison of methods IGN and POOL has already been provided by Rajagopalan et al. (2002) and is not central to this paper, we omit an estimate of confidence intervals. However, the figures in Table I appear to confirm the conclusions of Rajagopalan et al. (2002): that on regionally-averaged areas multi-models have higher

15 CAN MULTI-MODELS IMPROVE FORECASTS? 255 Figure 15. Skill maps (RPSS D ) for real seasonal forecasts (June, July, August) of 2 m temperature, with a lead time of one month, obtained from the DEMETER database for the period Skill is evaluated for: (a) S2; (b) GS; (c) the multi-model constructed with method POOL (equal weights); and (d) the multi-model constructed with method IGN (optimum weights) (a) (b) (c) Figure 16. Optimum weights obtained from method IGN for the forecasting context described in Figure 15: (a) weights of the S2 model; (b) weights of the GS model; (c) weights attributed to climatology. skill than the participating SMEs alone, and that IGN MMEs always outperform POOL MMEs. These conclusions appear to hold for both the classical RPSS and the debiased RPSS D Local skill improvement and overconfidence Given the results of Section 3.3 (in particular Figure 6), the success of MMEC on regionally-averaged areas as suggested by Table I does not imply that the multimodel would also outperform any of the participating single models locally, i.e. at each individual grid point. We therefore proceed with an evaluation on a gridpoint basis. In particular, we want to investigate whether the link between skill improvement and overconfidence observed with the toy model is also visible with real seasonal ensemble predictions.

16 256 A. P. WEIGEL ET AL. Figures 10 and 11 from the toy-model experiments have shown that skill improvement due to MMEC is more clearly visible for IGN multi-models than for POOL multimodels. While the latter enhance prediction skill only if α 1 (a prerequisite that is not fulfilled in the general case), the IGN multi-models yield skill improvement over the whole range of combinations (α 1, ). For this reason, we henceforth consider IGN MMEs only. Figure 17(a) shows those grid points where the multimodel skill (IGN) of the seasonal forecasts is positive and exceeds that of the best single model (i.e. both S2 and GS). Thus, it excludes those grid points where skill improvement is only due to the IGN method s ability to issue the climatological forecast in the case of poor ensemble forecasts. Despite the considerable scatter, which is not too surprising given the comparatively short verification period of 42 years (i.e. 42 samples), the grid points where IGN locally outperforms all SMEs are clearly organized in larger-scale structures, predominantly along the tropical belt. Apart from tropical Africa and South America, these grid points are mainly found over the oceans. It should be mentioned here that a similar analysis based on POOL ensembles reveals a much noisier picture, which makes it difficult to identify these regions (not shown). How does this pattern of skill improvement relate to overconfidence? To obtain an estimate of the overconfidence characteristics of the two participating SMEs, we assume Gaussian behaviour of the observed and simulated seasonal averages of near-surface temperature, and fit the toy-model parameters α and β to the joint series of model hindcasts and verifications. The Gaussian assumption is admittedly a very simplifying one, but can be justified as a first rough estimate for the variable considered (Wilks, 2002, 2006) and the small number of verification samples (Scherrer et al., 2006). The procedure is as follows. First, the ERA40 observations are linearly and grid-point-wise scaled so that the climatology of scaled observations has zero mean and unit variance. This means that, if x is the mean value of all observations x at a given grid point and σ 2 x is the corresponding variance, the scaled observations x are determined from: x = x x σ x. (12) Table I. Average skill of single- and multi-model ensembles in various regions, for seasonal forecasts (June, July, August) of 2 m temperature, with a lead time of one month, obtained from the DEMETER database for the period The table shows skill measured with the debiased RPSS D for ECMWF s System 2 (S2), the Met Office s GloSea (GS), and multi-models constructed with methods POOL (equal weights) and IGN (optimum weights). The classical RPSS is given in parentheses. Skill values are given in percentage form. For all regions considered, the highest skill occurs with method IGN. Region Longitude Latitude S2 RPSS (%) GS RPSS (%) POOL RPSS (%) IGN RPSS (%) Global 180 W 180 E 85 S 85 N 9.5 ( 0.6) 5.9 ( 4.5) 1 (5.6) 11.1 (9.4) Tropics 180 W 180 E 20 S 20 N 2 (12.3) 18.3 (9.3) 23.8 (19.6) 24.9 (22.7) Europe 15 W 45 E 35 N 70 N 5.4 ( 5.1) 3.5 ( 7.2) 5.7 () 7.0 (5.3) Russia 40 E 180 E 40 N 80 N 2.0 ( 8.8) 3.4 ( 14.8) 1.5 ( 0.4) 2.3 (1.4) Asia 60 E 140 E 0 40 N 10.9 () 6.4 ( 3.9) 11.6 (6.7) 12.9 (11.1) Australia 100 E 180 E 50 S 10 N 12.6 (1.9) 8.8 ( 3.3) 14.1 (9.3) 16.2 (12.8) Africa 20 W 50 E 40 S 35 N 10.4 () 8.2 ( 2.0) 11.9 (7.0) 12.7 (10.7) N. America 140 W 60 W 15 N 75 N 7.1 ( 3.2) 3.6 ( 7.1) 7.9 (2.7) 8.4 (6.9) S. America 90 W 30 W 60 S 15 N 13.2 (3.6) 13.1 (3.4) 16.4 (11.8) 17.6 (15.6) (a) (b) Figure 17. (a) Regions where the multi-model locally outperforms both participating single models, and (b) regions where the participating single models are overconfident, using the seasonal forecasting context described in Figure 15. The black pixels in (a) indicate grid points where the multi-model (using method IGN, i.e. optimum weights) locally outperforms both participating single models. The black pixels in (b) indicate grid points where both participating single models are highly overconfident (β >0.63). The grey areas are regions of zero or negative multi-model skill: at these grid points, skill improvement is dominated by the IGN method s ability to fully ignore poor ensemble forecasts and to issue the climatological forecast instead.

17 CAN MULTI-MODELS IMPROVE FORECASTS? 257 An analogous linear standardization is carried out for the S2 and GS model climatologies. At each grid point (lon, lat), the average correlation coefficients of the two models with the observations, α S2 (lon, lat) and α GS (lon, lat), are calculated. The corresponding overconfidence values β S2 (lon, lat) and β GS (lon, lat) are estimated grid-point-wise from the variance of the respective (scaled) ensemble means, as shown below. Since the DEMETER hindcasts have small ensemble size, sampling errors of the ensemble means need to be considered. For a given standardized forecast ensemble f, let ˆσ ens be the ensemble spread, let êm be the observed ensemble mean, and let em be the population mean, i.e. the ensemble mean that would be obtained if there were infinitely many ensemble members. The sampling error ɛ S of êm has a variance of var (ɛ S ) =ˆσ 2 ens /M (e.g. Bronstein and Semendjajew, 1985), where M is the ensemble size (in the present case M = 9). An evaluation of many forecast observation pairs yields, for the variance of the observed ensemble means: var (êm) = (em + ɛ S êm ) 2 = em 2 + ɛ2 S = var (em ) + 1 M ˆσ ens 2. (13) In the toy-model context of Equation (1), we have var (em ) = var (αx + ɛ β ) = + β 2. From this and Equation (13), we obtain: β = var (êm) 1 M ˆσ 2 ens α2. (14) Figure 18 shows maps of estimated values for β S2 and β GS. Those grid points of Figure 18 in which both participating single models simultaneously reveal high overconfidence are displayed in Figure 17(b). As a threshold defining high overconfidence, a subjective value of β = 0.63 is chosen to obtain a proportion of grid points similar to that in Figure 17(a). Figures 17(b) and 18 show that there actually are regions where both models are highly overconfident, and that these regions are organized in structures very similar to those where MMEC locally outperforms both SMEs (Figure 17(a)). Indeed, both local skill improvement and high overconfidence are predominantly observed over the tropical oceans, tropical Africa and South America. More detailed patterns, such as the oval structure over Brazil or the three-band structure over the equatorial Atlantic, are also visible in both maps. This apparent link between skill improvement and overconfidence is reinforced by Figure 19, which is the real-case version of Figures 8(b) and 11(b): multi-model skill values of all grid points are plotted against the respective maximum skill values of the two participating single models. The grey points in Figure 19(a) represent those grid points where both S2 and GS are highly overconfident, exceeding a threshold of 0.63, while the grey points in Figure 19(b) represent those grid points where both S2 and GS are almost well-dispersed, with a β value of less than 0.2. Most of the highly-overconfident points are greater than zero and above the bisection line, indicating a true skill improvement against the best SME due to model combination, while the well-dispersed ensembles are consistently below the bisection line, indicating that model combination cannot locally enhance prediction skill if the forecasts are not overconfident. Thus, despite the comparatively short verification record available, and despite the very simplifying assumptions concerning the Gaussian behaviour of observations and forecasts, the results support the conclusions drawn from the toy-model experiments described above, and indicate a clear link between overconfidence and the success of MMEC. This is even more remarkable since in the real forecasting context the sources of overconfidence may overlap, resulting in ɛ β values for the participating SMEs that are not fully independent. Finally, note that Yoo and Kang (2005) have carried out a related study on seasonal predictions of (a) (b) Figure 18. Overconfidence of real seasonal prediction systems. Shown are the fitted β values for (a) the S2 model and (b) the GS model for the forecasting context described in Figure 15.

18 multi model skill multi model skill A. P. WEIGEL ET AL maximum single model skill maximum single model skill Figure 19. As Figure 8(b), but with real data, for the seasonal forecasting context described in Figure 15. Multi-model skill (RPSSD combined from S2 and GS using method IGN) is plotted against the corresponding maximum single-model skill. Each point represents one grid point of the global map shown in Figure 18. The grey points in (a) indicate those grid points where both participating single models are highly overconfident (β > 0.63), while the grey points in (b) indicate that both participating models are almost well-dispersed (β < 0.2). precipitation, using a multi-model composed of nine Asian and American models and applying ensemblemean correlation as a skill metric. They came to a similar result: they also showed that the success of multimodel combination is closely linked to random model error (which in our framework is equivalent to overconfidence), and they also identified the Tropics as the region where model errors are largest and where MMEC therefore appears to be most beneficial. 5. Conclusions In this study we have revisited the frequently-raised question whether MMEC can really enhance prediction skill, i.e. whether a multi-model can perform better than the best single model available, assuming that there is a best model and that it can be identified. We have approached this question by applying a climatologically well-calibrated Gaussian generator of forecast observation pairs, with two free parameters controlling the characteristics of the forecast ensembles: the correlation with the observations, which controls the potential predictability of the forecasting system; and the degree of overconfidence (i.e. ensemble underdispersion), which controls the ensemble s ability to capture the full amount of uncertainty inherent to the prediction. Because of the design of the toy model, overconfidence is always coupled with a random additive error in ensemble location. The toy model has been used to systematically generate forecast ensembles of varying characteristics, and to combine them to MMEs, applying two combination methods (weighted and unweighted). The gain in prediction skill due to MMEC has been measured using the RPSSD, a skill score that is insensitive to ensemble size. Being a probabilistic skill score, the discrete RPSS rewards good correlation and penalizes overconfidence. The central conclusion of this study is that MMEC can indeed enhance prediction skill, so as to outperform the Copyright 2008 Royal Meteorological Society best participating SME, but only if the SMEs are overconfident, i.e. the ensembles are too sharp while being centred at the wrong value. This observation does not depend on the combination algorithm applied. The effect of MMEC is to gradually widen the ensemble spread and to move the ensemble mean toward truth without reducing the potential predictability. This is in contrast to simple inflation methods, which destroy potential predictability. In other words, MMEC can remedy overconfidence while retaining the correlation between forecasts and observations. This leads to a decrease in the overconfidence penalty imposed by the probabilistic skill metric, and thus to an improvement in overall skill. It is under these conditions that even the addition of a consistentlypoorer model can enhance multi-model skill, as long as the model s poor performance is due to high overconfidence rather than low potential predictability. On the other hand, the combination of well-dispersed SMEs does not enhance skill. Indeed, the toy-model simulations suggest that well-dispersed SME forecasts can be considered as an upper limit of the skill that is potentially achievable by MMEC. One can also put it the other way round: MMEC can enhance prediction skill, but only as long as the participating SMEs fail to capture the full amount of forecast uncertainty, i.e. fail to have the correct ensemble spread and to be centred at the right value. As soon as the models provide correct probability estimates, MMEs will no longer be able to beat the best participating SME. An analysis of the mean squared errors of the ensemble means leads to an equivalent conclusion. While these conclusions have primarily been drawn on the basis of an idealized toy model, an evaluation of real dynamical ensemble forecast systems has confirmed the link between overconfidence and the success of MMEC. Indeed, by diagnosing the overconfidence of the participating models, it appears to be possible a priori to identify those regions that can benefit most from MMEC. In a real operational forecast setting, it may be difficult or impossible to know which model is the best, since Q. J. R. Meteorol. Soc. 134: (2008)

19 CAN MULTI-MODELS IMPROVE FORECASTS? 259 model performance typically varies with region, season, predictand, or even lead time. In addition, the best model is usually not the same for all variables. In such a setting, MMEC can provide a pragmatic method for issuing an optimum forecast. Given that most operational centres produce ensemble forecasts anyway, the multi-model approach is inexpensive in an operational setting. Acknowledgements This study was supported by the Swiss National Science Foundation through the National Centre for Competence in Research (NCCR) Climate, and by the ENSEMBLES project (EU FP 6 contract GOCE-CT ). We would like to thank three anonymous reviewers for helpful comments on an earlier version of this paper. A. Appendix: Properties of the toy model Here we show that the toy model of Equation (A.1) satisfies properties 2 and 3 stated in Section 2.1.1, namely that the model climatology is identical to the observation climatology and the correlation coefficient between forecasts and observations is α. Consider three statistically-independent random variables (Z 1,Z 2,Z 3 ) with Z i N(0,σ i ).Let Z tot = Z 1 + Z 2 + Z 3. It can then be shown (e.g. Bronstein and Semendjajew, 1985) that ( ) Z tot N 0, σ1 2 + σ σ 3 2. (A.1) Formally, for any m {1,...,M}, the toy-model forecasts f m can be considered as a sum of three such random variables, f m = Z 1 + Z 2 + Z 3, with Z 1 = αx, Z 2 = ɛ β and Z 3 = ɛ m. From this and Equation (A1), it follows that: ( f m N 0, ) σ 2 (Z 1 ) + σ 2 (Z 2 ) + σ 2 (Z 3 ) ( ) N 0, + β 2 + (1 β 2 ) N(0, 1). In other words, the toy-model climatology is identical to the climatology of the observations. It can then be shown that the correlation coefficient ρ between f m and x is given by: ρ(f m,x)= cov (f m,x) = cov (αx,x) = α. σ 2 (f m ) σ 2 (x) References Anderson DLT, Stockdale T, Balmaseda MA, Ferranti L, Vitart F, Doblas-Reyes FJ, Hagedorn R, Jung T, Vitart A, Troccoli A, Palmer T Comparison of the ECMWF seasonal forecast systems 1 and 2, including the relative performance for the 1997/8 El Niño. ECMWF Tech. Memo 404, 93 pp. Barnston AG, Mason SJ, Goddard L, DeWitt DG, Zebiak SE Multimodel ensembling in seasonal climate forecasting at IRI. Bull. Am. Meteorol. Soc. 84: Bronstein IN, Semendjajew KA Taschenbuch der Mathematik, 22nd edition. Verlag Harri Deutsch: Thun and Frankfurt (Main), 840 pp. Buizza R Potential forecast skill of ensemble prediction, and spread and skill distributions of the ECMWF Ensemble Prediction System. Mon. Weather Rev. 125: Buizza R, Palmer TN Impact of ensemble size on ensemble prediction. Mon. Weather Rev. 126: Buizza R, Miller M, Palmer TN Stochastic representation of model uncertainties in the ECMWF Ensemble Prediction System. Q. J. R. Meteorol. Soc. 125: Buizza R, Houtekamer PL, Toth Z, Pellerin G, Wei M, Zhu Y A comparison of the ECMWF, MSC, and NCEP global ensemble prediction systems. Mon. Weather Rev. 133: Byrd RH, Lu P, Nocedal J, Zhu C A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput. 16: Doblas-Reyes FJ, Hagedorn R, Palmer TN The rationale behind the success of multi-model ensembles in seasonal forecasting. Part II: Calibration and combination. Tellus A57: Epstein ES A scoring system for probability forecasts of ranked categories. J. Appl. Meteorol. 8: Gordon C, Cooper C, Senior CA, Banks H, Gregory JM, Johns TC, Mitchell JFB, Wood RA The simulation of SST, sea ice extents and ocean heat transports in a version of the Hadley Centre coupled model without flux adjustments. Clim. Dyn. 16: Hagedorn R, Doblas-Reyes FJ, Palmer TN The rationale behind the success of multi-model ensembles in seasonal forecasting. Part I: Basic concept. Tellus A57: Hamill TM, Juras J Measuring forecast skill: is it real skill or is it the varying climatology? Q. J. R. Meteorol. Soc. 132: Joliffe IT Uncertainty and inference for verification measures. Weather and Forecasting 22: Kalnay E Atmospheric Modeling, Data Assimilation and Predictability. Cambridge University Press: 341 pp. Kharin VV, Zwiers FW Improved seasonal probability forecasts. J. Climate 16: Krishnamurti TN, Kishtawal CM, LaRow TE, Bachiochi DR, Zhang Z, Williford CE, Gadgil S, Surendran S Improved weather and seasonal climate forecasts from multimodel superensemble. Science 285: Kumar A, Barnston AG, Hoerling MP Seasonal predictions, probabilistic verifications, and ensemble size. J. Climate 14: Mason SJ On using climatology as a reference strategy in the Brier and ranked probability skill scores. Mon. Weather Rev. 132: Molteni F, Buizza R, Palmer TN, Petroliagis T The new ECMWF Ensemble Prediction System: methodology and validation. Q. J. R. Meteorol. Soc. 122: Müller WA, Appenzeller C, Doblas-Reyes FJ, Liniger MA A debiased ranked probability skill score to evaluate probabilistic ensemble forecasts with small ensemble sizes. J. Climate 18: Murphy AH On the ranked probability skill score. J. Appl. Meteorol. 8: Murphy AH A note on the ranked probability skill score. J. Appl. Meteorol. 10: Palmer TN A nonlinear dynamical perspective on model error: A proposal for non-local stochastic-dynamic parameterization in weather and climate prediction models. Q. J. R. Meteorol. Soc. 127: Palmer TN, Alessandri A, Anderson U, Cantelaube P, Davey M, Délécluse P, Déqué M, Díez E, Doblas-Reyes FJ, Feddersen H, Graham R, Gualdi S, Guérémy J-F, Hagedorn R, Hoshen M, Keenlyside N, Latif M, Lazar A, Maisonnave E, Marletto V, Morse AP, Orfila B, Rogel P, Terres J-M, Thomson MC Development of a European multimodel ensemble system for

20 260 A. P. WEIGEL ET AL. seasonal-to-interannual prediction (DEMETER). Bull. Am. Meteorol. Soc. 85: Pellerin G, Lefaivre L, Houtekamer P, Girard C Increasing the horizontal resolution of ensemble forecasts at CMC. Nonlinear Processes in Geophys. 10: Rajagopalan B, Lall U, Zebiak SE Categorical climate forecasts through regularization and optimal combination of multiple GCM ensembles. Mon. Weather Rev. 130: Richardson DS Measures of skill and value of ensemble prediction systems, their interrelationship and the effect of ensemble size. Q. J. R. Meteorol. Soc. 127: Robertson AW, Lall U, Zebiak SE, Goddard L Improved combination of multiple atmospheric GCM ensembles for seasonal prediction. Mon. Weather Rev. 132: Roulston MS, Smith LA Evaluating probabilistic forecasts using information theory. Mon. Weather Rev. 130: Schär C, Vidale PL, Lüthi D, Frei C, Häberli C, Liniger MA, Appenzeller C The role of increasing temperature variability in European summer heatwaves. Nature 427: Scherrer SC, Appenzeller C, Liniger MA Temperature trends in Switzerland and Europe: implications for climate normals. Int. J. Climatol. 26: Schwierz C, Appenzeller C, Davies HC, Liniger MA, Müller W, Stocker TF, Yoshimore M Challenges posed by and approaches to the study of seasonal-to-decadal climate variability. Clim. Change 79: Shukla J, Anderson J, Baumhefner D, Brankovic C, Chang Y, Kalnay E, Marx L, Palmer T, Paolino D, Ploshay J, Schubert S, Straus D, Suarez M, Tribbia J Dynamical seasonal prediction. Bull. Am. Meteorol. Soc. 81: Stephenson DB, Coelho CAS, Doblas-Reyes FJ, Balmaseda M Forecast assimilation: a unified framework for the combination of multi-model weather and climate predictions. Tellus 57A: Tippett MK, Barnston AG, Robertson AW Estimation of seasonal precipitation tercile-based categorical probabilities from ensembles. J. Climate 20: Toth Z, Kalnay E, Tracton S, Wobus R, Irwin J A synoptic evaluation of the NCEP ensemble. Weather and Forecasting 12: Toth Z, Talagrand O, Candille G, Zhu Y Probability and ensemble forecasts. Pp in Forecast Verification: A Practitioner s Guide in Atmospheric Science, Joliffe IT, Stephenson DB (eds). John Wiley & Sons. Tracton MS, Kalnay E Ensemble forecasting at NMC: Practical aspects. Weather and Forecasting 8: Uppala SM, Kållberg PW, Simmons AJ, Andrae U, da Costa Bechtold V, Fiorino M, Gibson JK, Haseler J, Hernandez A, Kelly GA, Li X, Onogi K, Saarinen S, Sokka N, Allan RP, Andersson E, Arpe K, Balmaseda MA, Beljaars ACM, van de Berg L, Bidlot J, Bormann N, Caires S, Chevallier F, Dethof A, Dragosavac M, Fisher M, Fuentes M, Hagemann S, Hólm E, Hoskins BJ, Isaksen L, Janssen PAEM, Jenne R, McNally AP, Mahfouf J-F, Morcrette J-J, Rayner NA, Saunders RW, Simon P, Sterl A, Trenberth KE, Untch A, Vasiljevic D, Viterbo P, Woollen J The ERA-40 re-analysis. Q. J. R. Meteorol. Soc. 131: Vitart F, Huddleston MR, Déqué M, Peake D, Palmer TN, Stockdale TN, Davey MK, Ineson S, Weisheimer A Dynamically-based seasonal forecasts of Atlantic tropical storm activity issued in June by EUROSIP. Geophys. Res. Lett. 34: L Weigel AP, Chow FK, Rotach MW. 2007a. The effect of mountainous topography on moisture exchange between the surface and the free atmosphere. Boundary-Layer Meteorol. 125: Weigel AP, Liniger MA, Appenzeller C. 2007b. The discrete Brier and ranked probability skill scores. Mon. Weather Rev. 135: Weigel AP, Liniger MA, Appenzeller C. 2007c. Generalization of the discrete brier and ranked probability skill scores for weighted multimodel ensemble forecasts. Mon. Weather Rev. 135: Wilks DS Smoothing forecast ensembles with fitted probability distributions. Q. J. R. Meteorol. Soc. 128: Wilks DS Statistical Methods in the Atmospheric Sciences, 2nd edition. International Geophysics Series, volume 91. Academic Press: 627 pp. Yoo JH, Kang I-S Theoretical examination of a multi-model composite for seasonal prediction. Geophys. Res. Lett. 32: L18707.