Limitations of Indicator Kriging for Predicting Data with Trend

Transcription

1 Limitations of Indicator Kriging for Predicting Data with Trend Andreas Papritz ETH Zurich, Department of Environmental Sciences, Zurich, Switzerland Abstract. Goovaerts and Journel [8] proposed simple indicator kriging with varying local means (siklm) as a way to extend the indicator kriging methodology to variates with an apparent spatial trend. However, contrary to the authors implications, the detrended indicators; i.e., the indicator residuals, are not stationary, and their covariance structure cannot be unbiasedly estimated from a single realization of a random process. Ignoring the non-stationary nature of the covariance of the indicators ruins the usual mean square optimality of kriging. Therefore, siklm is an ad-hoc procedure, which lacks optimality, and its use should be discouraged. INTRODUCTION According to ISI Web of Science R, about 20 journal articles and 0 contributions to conference proceedings have been published about indicator kriging (IK for short) to date. Many of these studies deal with mapping the probability that a spatial variable exceeds a threshold [e.g., 3, ]. This is an important problem in environmental surveillance and monitoring. Some studies apply IK to data with an apparent trend, following advice by Goovaerts and Journel [8] and Goovaerts [7], sec Unfortunately, Goovaerts simple IK with varying local means (siklm in short) is not feasible in practice, as it ask for the modelling of non-stationary covariances. By in practice I mean the case where we consider our measurements as a sample from a single realisation of a random process. The same problem arises if IK is used for data that show unbounded variograms. To substantiate my contention, I highlight and discuss here the limitations of siklm which arise from basic probability theory. Notwithstanding their elementary nature, these limitations seem to have been frequently ignored. I further demonstrate by a simulation that siklm lacks the usual mean square optimality of kriging, which leads me to discourage the use of siklm. 2 COVARIANCES OF INDICATOR TRANSFORMS OF NON- STATIONARY VARIATES Let Z(s) denote a real valued random variable used for modelling an attribute z measured at location s, and let I(s; z ) for a specific cut-off z be the indicator transform I(s; z ) = if Z(s) z and I(s; z ) = 0 otherwise. Then E [I(s; z )] = Prob[Z(s) z ] = F (s; z ), () Var [I(s; z )] = F (s; z ) ( F (s; z )), (2) where F (s; z) is the cumulative distribution function (cdf) of Z(s), E [.] and Var [.] are the expectation and variance operators, and Prob[A] denotes the probability of

2 event A. Further, let Cov [.] and Cor [.] denote the covariance and correlation operators. The (cross-)covariance function of the indicators for two cut-offs z and z, C I (s, s + h; z, z ) = Cov [I(s; z ), I(s + h; z )], is related to the bivariate cumulative distribution function, F (s, s+h; z, z ) = Prob[Z(s) z, Z(s+h) z ], of Z(s) and Z(s + h) by [e.g., 0] C I (s, s + h; z, z ) = F (s, s + h; z, z ) F (s; z ) F (s + h; z ). (3) For a random process with stationary bivariate distributions equations () (3) simplify to E [I(s; z )] = F (z ), (4) Var [I(s; z )] = F (z ) ( F (z )), (5) C I (h; z, z ) = F (h; z, z ) F (z ) F (z ). (6) Clearly, the right-hand sides of (4) (6) do not depend on s, and C I (.) is a function of the lag h only. Notice that equation (4) means that the expectations of the random variables, say E [Z(s)] = µ(s), must not vary in space. Otherwise, the cdfs would not be constant. Furthermore, equations (4) (6) show that we may (at least hope to) infer the first two moments of the indicators when we have data from only one realisation of {Z(s)}. To estimate the expectations and (cross-)covariances of the indicators we replace the averaging of multiple realisations by averaging over space. Spatial averaging, however, is inappropriate in the general case of non-stationary distributions; i.e., for models with moments given by equations () (3). In spite of the above, Goovaerts and Journel [8] proposed to extend the IK methodology to random processes with spatially varying µ(s). They called their method simple IK with varying local means. The terms simple IK with local prior means [7], soft IK [9] or IK with external drift [2] have since been used to denote the approach also. Apparently, the authors realized that the indicators have non-stationary (co-)variances if µ(s) varies spatially. Given an estimate, F (s; z ), of the cdf, they proposed to estimate the variogram of I(s; z ) by fitting model functions to the sample variogram, γ R (s i ; h k ; z, z ) = N(h k ) {r(s i ; z ) r(s i + h k ; z )} 2, (7) 2 N(h k ) i= of the indicator residuals r(s; z ) = i(s; z ) F (s; z ) (i(s; z ) is the indicator transform of a measurement and N(h k ) is the number of data pairs in lag-class h k ). Unfortunately, they failed to recognize that half the expected squared difference of the indicator residuals; i.e., their semivariance, is not independent of s, even if (unrealistically) the true cdf is assumed to be known; i.e. if F (s; z ) = F (s; z ): 2 E [{R(s; z ) R(s + h; z )} 2 ] = 2 Var [R(s; z ) R(s + h; z )] = 2 {F (s; z ) ( F (s; z )) + F (s + h; z ) ( F (s + h; z ))} {F (s, s + h; z, z ) F (s; z ) F (s + h; z )}. (8) As above, F (s; z ) and F (s, s + h; z, z ) are functions of s in the non-stationary case. Hence, the right-hand side of equation (8) still depends on s. Grouping the observed

3 piecewise constant trend, nugget 0. piecewise constant trend, nugget 0. attribute Z(s) E[Z(s)] cutoff indicator I(s ;0) E[I(s ;0)] Var[I(s ;0)] location s location s Figure : Two realisations, shown in red and blue, of a Gaussian random process with a piecewise constant mean function and a cubic variogram with nugget (left panel) and the corresponding indicator transforms of the simulated data for the cut-off z = 0 (right panel) (solid lines: expectations of the random variables; dotted lines: cut-off [left] and variances of indicator random variables [right]). indicator residuals into lag classes and computing a sample variogram by the customary method-of-moments estimator render it meaningless in this instance. The indicator transforms of {Z(s)} with constant µ(s) but unbounded variogram have non-stationary covariances, too. To see this, we consider Gaussian, zero order intrinsic {Z(s)}, s IR, with a linear variogram, γ(h) = h. Two increments, say Z(s) = Z(s) Z(0) and Z(t) = Z(t) Z(0), are then normally distributed with variances Var [ Z(s)] = 2s, Var [ Z(t)] = 2t and correlation ρ = Cor [ Z(s), Z(t)] = min(s, t) s t. (9) Thus, their bivariate density function is equal to [, p. 936] ( g(z s, z t ; s, t, ρ) = 4π s t( ρ 2 ) exp z2 s/s 2 2ρz s z t / s t + zt 2 /t 2 ). (0) 4( ρ 2 ) The covariance of the indicator transforms of the increments is related to g(z s, z t ; s, t, ρ) by [4, p. 400] C I (s, t; z, z ) = min(s,t) s t 0 g(z, z ; s, t, ρ) dρ. () Clearly, C I (s, t; z, z ) depends on s and t not only through the lag h = s t, and the covariance is non-stationary. 3 SIMULATION STUDY I used simulation to illustrate how large the bias between the non-stationary variograms of the indicators and an estimate based on equation (7) can be and to demonstrate that the

4 piecewise constant trend, nugget 0. piecewise constant trend, nugget 0. location s lag distance h semivariance γ(s, h ) expectation of equation (7) lag distance h Figure 2: Non-stationary indicator semivariances, γ I (s i, s i + h k ; 0, 0), for the simulations shown in Fig.. The left panel shows γ I (.) as a function of s and h, and the right panel shows the variograms γ I (s i, s i + h k ; 0, 0) for six locations s i : 0, 40,..., 200 as a function of h, together with the expectation, E [ γ R (s 0, s,... ; h k )], of the estimator given in equation (7). bias leads to a loss of efficiency in simple IK. To this end, I simulated 0 5 realisations of a Gaussian random process at the locations s 0 = 0, s =,..., s 300 = 300 on a line. The process had a piecewise constant mean function and a cubic variogram with range 66, unit total sill and nugget 0.. Piecewise constant mean functions were used by Goovaerts and Journel [8], van Meirvenne and Goovaerts [] and Brus et al. [3]. Figure shows two realisations and the corresponding indicators for the cut-off z = 0. The right panel also shows the estimated expectations of the indicators F (s i ; 0) = j= I(s i ; 0) j and their variances. The subscript j denotes here the jth realisation. For each s i : 0,,..., 200 I estimated the non-stationary covariances of the indicators for the lag distances h k : 0,, 2,..., 00 by Ĉ I (s i, s i + h k ; 0, 0) = 0 5 R(s i ; 0) j R(s i + h k ; 0) j, where R(s; 0) j = I(s; 0) j F (s; 0), and from those estimates I computed the nonstationary semivariances of the indicators by γ I (s i, s i + h k ; 0, 0) = {ĈI (s i, s i ; 0, 0) + 2 ĈI(s i + h k, s i + h k ; 0, 0) } ĈI(s i, s i + h k ; 0, 0). These estimates where then compared with the estimated expectation of the sample variograms of the indicator residuals computed for each realisation by equation (7) E [ γ R (s 0, s,... ; h k )] = j= j= i=0 {R(s i ; 0) j R(s i + h k ; 0) j } 2. The left panel of Figure 2 shows γ I (s i, s i + h k ; 0, 0) as a function of s i and h k. We see abrupt changes of the semivariance for a given h k along the ordinate from s 0 = 0 to

5 simple kriging weights SK computed with non stationary covariances SK computed with covariances estimated by equation (7) relative efficiency of SK computed with covariances estimated by equation (7) location of prediction point s 0 Figure 3: Simple IK weights of 6 measurements at locations 50 (black), 70 (red), 90 (green), 0 (blue), 30 (cyan) and 50 (magenta) as a function of the position of the prediction point s 0. The solid dots are the optimal weights computed from the non-stationary semivariances ( γ I (s i, s i + h k ; 0, 0)), the open squares are the weights computed from the expectation (E [ γ R (s 0, s,... ; h k )]) of the estimator given in equation (7). The solid line is the relative efficiency of siklm. Tickmarks without labels show the boundaries of the subregions with constant means (cf. Fig. ). s 200 = 200. The right panel of the figure shows the change of the semivariance with h k for 6 selected locations. The semivariance does not increase monotonically with h k : there are abrupt changes because of the non-constant variances of the indicators. If we ignore the non-stationary nature of the problem and use equation (7) then these jumps are lost. The discrepancies between E [ γ R (s 0, s,... ; h k )] and γ I (s i, s i + h k ; 0, 0) may seem not very significant. However, Figure 3 shows that they matter if we predict the indicators by simple kriging at the locations s 0 : 50, 5, 52,..., 50 from 6 measurements at s i : 50, 70, 90, 0, 30, 50. Close to the boundaries of the subregions with constant mean we see abrupt changes in the optimal simple IK weights which are lost when we compute them from E [ γ R (s 0, s,... ; h k )]. A loss of efficiency of up to 20% results when we use Goovaerts and Journel s suggestion to estimate the variogram. Thus, the example shows that kriging looses its mean square optimality if we ignore the non-stationary nature of the problem. We can then merely hope that kriging provides better predictions than other ad-hoc procedures such as inverse distance weighting of the indicators. 4 CONCLUSIONS I conclude by stating that any attempt to use IK for data with an apparent trend either explicitly (siklm) or implicitly by using ordinary IK within a local neighbourhood of support points requires the modelling of non-stationary indicator variograms to preserve

6 the mean square optimality of kriging. The same problem arises for random processes with constant means but unbounded variograms, although the loss of efficiency of siklm was smaller in the simulations that I ran as well but did not report here. As we cannot estimate non-stationary variograms from only one realization of {Z(s)}, IK is in practice limited to geostatistical analyses of data without an apparent trend and a bounded variogram; i.e., to models with stationary bivariate distributions. This a serious limitation because in many instances we have full coverage ancillary information that could (and should!) be exploited when predicting Z(s) or any non-linear transform thereof. But fortunately, there is life beyond IK: Diggle et al. [6] showed how to extend geostatistical methodology to non-normal response variates, and related approaches also exist for lattice models [5, chap ], so there is no harm to give up the IK methodology altogether. REFERENCES [] Abramowitz, M. and Stegun, I. A. (965). Handbook of Mathematical Functions. Dover, New York. [2] Bárdossy, A. and Lehmann, W. (998). Spatial distribution of soil moisture in a small catchment.part : Geostatistical analysis. Journal of Hydrology, 206, 5. [3] Brus, D. J., de Gruijter, J. J., Walvoort, D. J. J., de Vries, F., Bronswijk, J. J. B., Römkens, P. F. A. M., and de Vries, W. (2002). Mapping the probability of exceeding critical thresholds for cadmium concentrations in soils in the netherlands. Journal of Environmental Quality, 3, [4] Chilès, J.-P. and Delfiner, P. (999). Geostatistics: Modeling Spatial Uncertainty. John Wiley & Sons, New York. [5] Cressie, N. A. C. (993). Statistics for Spatial Data. John Wiley & Sons, New York, revised edition. [6] Diggle, P. J., Tawn, J. A., and Moyeed, R. A. (998). Model-based geostatistics (with discussions). Applied Statistics, 47(3), [7] Goovaerts, P. (997). Geostatistics for Natural Resources Evaluation. Oxford University Press, New York. [8] Goovaerts, P. and Journel, A. G. (995). Integrating soil map information in modelling the spatial variation of continuous soil properties. European Journal of Soil Science, 46, [9] Grunwald, S., Goovaerts, P., Bliss, C. M., Comerford, N. B., and Lamsal, S. (2006). Incorporation of auxiliary information in the geostatistical similation of soil nitrate nitrogen. Vadose Zone Journal, 5, [0] Journel, A. G. and Posa, D. (990). Characteristic behavior and order relations for indicator variograms. Mathematical Geology, 22(8), [] van Meirvenne, M. and Goovaerts, P. (200). Evaluating the probability of exceeding a site-specific soil cadmium contamination threshold. Geoderma, 02,