Limitations of Indicator Kriging for Predicting Data with Trend

Similar documents
INTRODUCTION TO GEOSTATISTICS And VARIOGRAM ANALYSIS

Spatial sampling effect of laboratory practices in a porphyry copper deposit

An Interactive Tool for Residual Diagnostics for Fitting Spatial Dependencies (with Implementation in R)

Geography 4203 / GIS Modeling. Class (Block) 9: Variogram & Kriging

Introduction to Modeling Spatial Processes Using Geostatistical Analyst

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Geostatistical Earth Modeling Software: User s Manual. Nicolas Remy

Least Squares Estimation

PROPERTIES OF THE SAMPLE CORRELATION OF THE BIVARIATE LOGNORMAL DISTRIBUTION

Annealing Techniques for Data Integration

Bias in the Estimation of Mean Reversion in Continuous-Time Lévy Processes

2 An Introduction to Model-Based Geostatistics

A logistic approximation to the cumulative normal distribution

Probability and Random Variables. Generation of random variables (r.v.)

4. Simple regression. QBUS6840 Predictive Analytics.

LOGIT AND PROBIT ANALYSIS

An introduction to Value-at-Risk Learning Curve September 2003

The CUSUM algorithm a small review. Pierre Granjon

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

8. THE NORMAL DISTRIBUTION

Time Series and Forecasting

Extreme Value Modeling for Detection and Attribution of Climate Extremes

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Geostatistics Exploratory Analysis

PITFALLS IN TIME SERIES ANALYSIS. Cliff Hurvich Stern School, NYU

Modeling the Distribution of Environmental Radon Levels in Iowa: Combining Multiple Sources of Spatially Misaligned Data

Introduction to General and Generalized Linear Models

Chapter 1. Vector autoregressions. 1.1 VARs and the identi cation problem

Linear Programming in Matrix Form

arxiv:physics/ v2 [physics.comp-ph] 9 Nov 2006

Lecture 3: Linear methods for classification

Module 3: Correlation and Covariance

Introduction to Engineering System Dynamics

Introduction to Geostatistics

Sensitivity Analysis 3.1 AN EXAMPLE FOR ANALYSIS

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

ALGEBRA. sequence, term, nth term, consecutive, rule, relationship, generate, predict, continue increase, decrease finite, infinite

ArcGIS Geostatistical Analyst: Statistical Tools for Data Exploration, Modeling, and Advanced Surface Generation

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

6.3 Conditional Probability and Independence

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Nonlinear Regression:

Stock price fluctuations and the mimetic behaviors of traders

Ehlers Filters by John Ehlers

BayesX - Software for Bayesian Inference in Structured Additive Regression

Data Preparation and Statistical Displays

Appendix 1: Time series analysis of peak-rate years and synchrony testing.

Economics of Strategy (ECON 4550) Maymester 2015 Applications of Regression Analysis

Introduction to Regression and Data Analysis

Multiple Linear Regression in Data Mining

Vision based Vehicle Tracking using a high angle camera

Univariate and Multivariate Methods PEARSON. Addison Wesley

The Effect of Environmental Factors on Real Estate Value

PLEASE SCROLL DOWN FOR ARTICLE

STA 4273H: Statistical Machine Learning

Mulliken suggested to split the shared density 50:50. Then the electrons associated with the atom k are given by:

The VAR models discussed so fare are appropriate for modeling I(0) data, like asset returns or growth rates of macroeconomic time series.

Measurement with Ratios

11. Time series and dynamic linear models

Solving Linear Programs

Using rainfall radar data to improve interpolated maps of dose rate in the Netherlands

Handling attrition and non-response in longitudinal data

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

SYSTEMS OF REGRESSION EQUATIONS

MULTIVARIATE PROBABILITY DISTRIBUTIONS

Algebra Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard

Green = 0,255,0 (Target Color for E.L. Gray Construction) CIELAB RGB Simulation Result for E.L. Gray Match (43,215,35) Equal Luminance Gray for Green

University of Ljubljana Doctoral Programme in Statistics Methodology of Statistical Research Written examination February 14 th, 2014.

Exact Nonparametric Tests for Comparing Means - A Personal Summary

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

Integration of Geological, Geophysical, and Historical Production Data in Geostatistical Reservoir Modelling

INTEREST RATE DERIVATIVES IN THE SOUTH AFRICAN MARKET BASED ON THE PRIME RATE

Module 5: Multiple Regression Analysis

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

UNIVERSITY OF WAIKATO. Hamilton New Zealand

Basics of Floating-Point Quantization

Figure 1. Diode circuit model

Randomization Based Confidence Intervals For Cross Over and Replicate Designs and for the Analysis of Covariance

Physics Lab Report Guidelines

MATH2740: Environmental Statistics

Spring Force Constant Determination as a Learning Tool for Graphing and Modeling

Autocovariance and Autocorrelation

Validating Market Risk Models: A Practical Approach

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Trend and Seasonal Components

Credit Risk Models: An Overview

SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12

Decomposition of Event Sequences into Independent Components

Confidence Intervals for Exponential Reliability

3.1. Solving linear equations. Introduction. Prerequisites. Learning Outcomes. Learning Style

Discussion. Seppo Laaksonen Introduction

CAPM, Arbitrage, and Linear Factor Models

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

Standard errors of marginal effects in the heteroskedastic probit model

Econometrics Simple Linear Regression

( ) = 1 x. ! 2x = 2. The region where that joint density is positive is indicated with dotted lines in the graph below. y = x

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Hedge Effectiveness Testing

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Transcription:

Limitations of Indicator Kriging for Predicting Data with Trend Andreas Papritz ETH Zurich, Department of Environmental Sciences, Zurich, Switzerland papritz@env.ethz.ch Abstract. Goovaerts and Journel [8] proposed simple indicator kriging with varying local means (siklm) as a way to extend the indicator kriging methodology to variates with an apparent spatial trend. However, contrary to the authors implications, the detrended indicators; i.e., the indicator residuals, are not stationary, and their covariance structure cannot be unbiasedly estimated from a single realization of a random process. Ignoring the non-stationary nature of the covariance of the indicators ruins the usual mean square optimality of kriging. Therefore, siklm is an ad-hoc procedure, which lacks optimality, and its use should be discouraged. INTRODUCTION According to ISI Web of Science R, about 20 journal articles and 0 contributions to conference proceedings have been published about indicator kriging (IK for short) to date. Many of these studies deal with mapping the probability that a spatial variable exceeds a threshold [e.g., 3, ]. This is an important problem in environmental surveillance and monitoring. Some studies apply IK to data with an apparent trend, following advice by Goovaerts and Journel [8] and Goovaerts [7], sec. 7.3.3. Unfortunately, Goovaerts simple IK with varying local means (siklm in short) is not feasible in practice, as it ask for the modelling of non-stationary covariances. By in practice I mean the case where we consider our measurements as a sample from a single realisation of a random process. The same problem arises if IK is used for data that show unbounded variograms. To substantiate my contention, I highlight and discuss here the limitations of siklm which arise from basic probability theory. Notwithstanding their elementary nature, these limitations seem to have been frequently ignored. I further demonstrate by a simulation that siklm lacks the usual mean square optimality of kriging, which leads me to discourage the use of siklm. 2 COVARIANCES OF INDICATOR TRANSFORMS OF NON- STATIONARY VARIATES Let Z(s) denote a real valued random variable used for modelling an attribute z measured at location s, and let I(s; z ) for a specific cut-off z be the indicator transform I(s; z ) = if Z(s) z and I(s; z ) = 0 otherwise. Then E [I(s; z )] = Prob[Z(s) z ] = F (s; z ), () Var [I(s; z )] = F (s; z ) ( F (s; z )), (2) where F (s; z) is the cumulative distribution function (cdf) of Z(s), E [.] and Var [.] are the expectation and variance operators, and Prob[A] denotes the probability of

event A. Further, let Cov [.] and Cor [.] denote the covariance and correlation operators. The (cross-)covariance function of the indicators for two cut-offs z and z, C I (s, s + h; z, z ) = Cov [I(s; z ), I(s + h; z )], is related to the bivariate cumulative distribution function, F (s, s+h; z, z ) = Prob[Z(s) z, Z(s+h) z ], of Z(s) and Z(s + h) by [e.g., 0] C I (s, s + h; z, z ) = F (s, s + h; z, z ) F (s; z ) F (s + h; z ). (3) For a random process with stationary bivariate distributions equations () (3) simplify to E [I(s; z )] = F (z ), (4) Var [I(s; z )] = F (z ) ( F (z )), (5) C I (h; z, z ) = F (h; z, z ) F (z ) F (z ). (6) Clearly, the right-hand sides of (4) (6) do not depend on s, and C I (.) is a function of the lag h only. Notice that equation (4) means that the expectations of the random variables, say E [Z(s)] = µ(s), must not vary in space. Otherwise, the cdfs would not be constant. Furthermore, equations (4) (6) show that we may (at least hope to) infer the first two moments of the indicators when we have data from only one realisation of {Z(s)}. To estimate the expectations and (cross-)covariances of the indicators we replace the averaging of multiple realisations by averaging over space. Spatial averaging, however, is inappropriate in the general case of non-stationary distributions; i.e., for models with moments given by equations () (3). In spite of the above, Goovaerts and Journel [8] proposed to extend the IK methodology to random processes with spatially varying µ(s). They called their method simple IK with varying local means. The terms simple IK with local prior means [7], soft IK [9] or IK with external drift [2] have since been used to denote the approach also. Apparently, the authors realized that the indicators have non-stationary (co-)variances if µ(s) varies spatially. Given an estimate, F (s; z ), of the cdf, they proposed to estimate the variogram of I(s; z ) by fitting model functions to the sample variogram, γ R (s i ; h k ; z, z ) = N(h k ) {r(s i ; z ) r(s i + h k ; z )} 2, (7) 2 N(h k ) i= of the indicator residuals r(s; z ) = i(s; z ) F (s; z ) (i(s; z ) is the indicator transform of a measurement and N(h k ) is the number of data pairs in lag-class h k ). Unfortunately, they failed to recognize that half the expected squared difference of the indicator residuals; i.e., their semivariance, is not independent of s, even if (unrealistically) the true cdf is assumed to be known; i.e. if F (s; z ) = F (s; z ): 2 E [{R(s; z ) R(s + h; z )} 2 ] = 2 Var [R(s; z ) R(s + h; z )] = 2 {F (s; z ) ( F (s; z )) + F (s + h; z ) ( F (s + h; z ))} {F (s, s + h; z, z ) F (s; z ) F (s + h; z )}. (8) As above, F (s; z ) and F (s, s + h; z, z ) are functions of s in the non-stationary case. Hence, the right-hand side of equation (8) still depends on s. Grouping the observed

piecewise constant trend, nugget 0. piecewise constant trend, nugget 0. attribute Z(s) 2 0 2 3 E[Z(s)] cutoff indicator I(s ;0) 0.0 0.2 0.4 0.6 0.8.0 E[I(s ;0)] Var[I(s ;0)] 0 50 00 50 200 250 300 0 50 00 50 200 250 300 location s location s Figure : Two realisations, shown in red and blue, of a Gaussian random process with a piecewise constant mean function and a cubic variogram with nugget (left panel) and the corresponding indicator transforms of the simulated data for the cut-off z = 0 (right panel) (solid lines: expectations of the random variables; dotted lines: cut-off [left] and variances of indicator random variables [right]). indicator residuals into lag classes and computing a sample variogram by the customary method-of-moments estimator render it meaningless in this instance. The indicator transforms of {Z(s)} with constant µ(s) but unbounded variogram have non-stationary covariances, too. To see this, we consider Gaussian, zero order intrinsic {Z(s)}, s IR, with a linear variogram, γ(h) = h. Two increments, say Z(s) = Z(s) Z(0) and Z(t) = Z(t) Z(0), are then normally distributed with variances Var [ Z(s)] = 2s, Var [ Z(t)] = 2t and correlation ρ = Cor [ Z(s), Z(t)] = min(s, t) s t. (9) Thus, their bivariate density function is equal to [, p. 936] ( g(z s, z t ; s, t, ρ) = 4π s t( ρ 2 ) exp z2 s/s 2 2ρz s z t / s t + zt 2 /t 2 ). (0) 4( ρ 2 ) The covariance of the indicator transforms of the increments is related to g(z s, z t ; s, t, ρ) by [4, p. 400] C I (s, t; z, z ) = min(s,t) s t 0 g(z, z ; s, t, ρ) dρ. () Clearly, C I (s, t; z, z ) depends on s and t not only through the lag h = s t, and the covariance is non-stationary. 3 SIMULATION STUDY I used simulation to illustrate how large the bias between the non-stationary variograms of the indicators and an estimate based on equation (7) can be and to demonstrate that the

piecewise constant trend, nugget 0. piecewise constant trend, nugget 0. location s 0 50 00 50 200 0 20 40 60 80 00 lag distance h 0.00 0.05 0.0 0.5 0.20 0.25 semivariance γ(s, h ) 0.00 0.05 0.0 0.5 0.20 expectation of equation (7) 0 40 80 20 60 200 0 20 40 60 80 00 lag distance h Figure 2: Non-stationary indicator semivariances, γ I (s i, s i + h k ; 0, 0), for the simulations shown in Fig.. The left panel shows γ I (.) as a function of s and h, and the right panel shows the variograms γ I (s i, s i + h k ; 0, 0) for six locations s i : 0, 40,..., 200 as a function of h, together with the expectation, E [ γ R (s 0, s,... ; h k )], of the estimator given in equation (7). bias leads to a loss of efficiency in simple IK. To this end, I simulated 0 5 realisations of a Gaussian random process at the locations s 0 = 0, s =,..., s 300 = 300 on a line. The process had a piecewise constant mean function and a cubic variogram with range 66, unit total sill and nugget 0.. Piecewise constant mean functions were used by Goovaerts and Journel [8], van Meirvenne and Goovaerts [] and Brus et al. [3]. Figure shows two realisations and the corresponding indicators for the cut-off z = 0. The right panel also shows the estimated expectations of the indicators F (s i ; 0) = 0 5 0 5 j= I(s i ; 0) j and their variances. The subscript j denotes here the jth realisation. For each s i : 0,,..., 200 I estimated the non-stationary covariances of the indicators for the lag distances h k : 0,, 2,..., 00 by Ĉ I (s i, s i + h k ; 0, 0) = 0 5 R(s i ; 0) j R(s i + h k ; 0) j, 99 999 where R(s; 0) j = I(s; 0) j F (s; 0), and from those estimates I computed the nonstationary semivariances of the indicators by γ I (s i, s i + h k ; 0, 0) = {ĈI (s i, s i ; 0, 0) + 2 ĈI(s i + h k, s i + h k ; 0, 0) } ĈI(s i, s i + h k ; 0, 0). These estimates where then compared with the estimated expectation of the sample variograms of the indicator residuals computed for each realisation by equation (7) E [ γ R (s 0, s,... ; h k )] = 0 5 0 5 j= j= 200 2 20 i=0 {R(s i ; 0) j R(s i + h k ; 0) j } 2. The left panel of Figure 2 shows γ I (s i, s i + h k ; 0, 0) as a function of s i and h k. We see abrupt changes of the semivariance for a given h k along the ordinate from s 0 = 0 to

simple kriging weights 0.0 0.2 0.4 0.6 0.8.0.2 SK computed with non stationary covariances SK computed with covariances estimated by equation (7) relative efficiency of SK computed with covariances estimated by equation (7) 50 70 90 0 30 50 location of prediction point s 0 Figure 3: Simple IK weights of 6 measurements at locations 50 (black), 70 (red), 90 (green), 0 (blue), 30 (cyan) and 50 (magenta) as a function of the position of the prediction point s 0. The solid dots are the optimal weights computed from the non-stationary semivariances ( γ I (s i, s i + h k ; 0, 0)), the open squares are the weights computed from the expectation (E [ γ R (s 0, s,... ; h k )]) of the estimator given in equation (7). The solid line is the relative efficiency of siklm. Tickmarks without labels show the boundaries of the subregions with constant means (cf. Fig. ). s 200 = 200. The right panel of the figure shows the change of the semivariance with h k for 6 selected locations. The semivariance does not increase monotonically with h k : there are abrupt changes because of the non-constant variances of the indicators. If we ignore the non-stationary nature of the problem and use equation (7) then these jumps are lost. The discrepancies between E [ γ R (s 0, s,... ; h k )] and γ I (s i, s i + h k ; 0, 0) may seem not very significant. However, Figure 3 shows that they matter if we predict the indicators by simple kriging at the locations s 0 : 50, 5, 52,..., 50 from 6 measurements at s i : 50, 70, 90, 0, 30, 50. Close to the boundaries of the subregions with constant mean we see abrupt changes in the optimal simple IK weights which are lost when we compute them from E [ γ R (s 0, s,... ; h k )]. A loss of efficiency of up to 20% results when we use Goovaerts and Journel s suggestion to estimate the variogram. Thus, the example shows that kriging looses its mean square optimality if we ignore the non-stationary nature of the problem. We can then merely hope that kriging provides better predictions than other ad-hoc procedures such as inverse distance weighting of the indicators. 4 CONCLUSIONS I conclude by stating that any attempt to use IK for data with an apparent trend either explicitly (siklm) or implicitly by using ordinary IK within a local neighbourhood of support points requires the modelling of non-stationary indicator variograms to preserve

the mean square optimality of kriging. The same problem arises for random processes with constant means but unbounded variograms, although the loss of efficiency of siklm was smaller in the simulations that I ran as well but did not report here. As we cannot estimate non-stationary variograms from only one realization of {Z(s)}, IK is in practice limited to geostatistical analyses of data without an apparent trend and a bounded variogram; i.e., to models with stationary bivariate distributions. This a serious limitation because in many instances we have full coverage ancillary information that could (and should!) be exploited when predicting Z(s) or any non-linear transform thereof. But fortunately, there is life beyond IK: Diggle et al. [6] showed how to extend geostatistical methodology to non-normal response variates, and related approaches also exist for lattice models [5, chap. 6.5.2], so there is no harm to give up the IK methodology altogether. REFERENCES [] Abramowitz, M. and Stegun, I. A. (965). Handbook of Mathematical Functions. Dover, New York. [2] Bárdossy, A. and Lehmann, W. (998). Spatial distribution of soil moisture in a small catchment.part : Geostatistical analysis. Journal of Hydrology, 206, 5. [3] Brus, D. J., de Gruijter, J. J., Walvoort, D. J. J., de Vries, F., Bronswijk, J. J. B., Römkens, P. F. A. M., and de Vries, W. (2002). Mapping the probability of exceeding critical thresholds for cadmium concentrations in soils in the netherlands. Journal of Environmental Quality, 3, 875 884. [4] Chilès, J.-P. and Delfiner, P. (999). Geostatistics: Modeling Spatial Uncertainty. John Wiley & Sons, New York. [5] Cressie, N. A. C. (993). Statistics for Spatial Data. John Wiley & Sons, New York, revised edition. [6] Diggle, P. J., Tawn, J. A., and Moyeed, R. A. (998). Model-based geostatistics (with discussions). Applied Statistics, 47(3), 299 350. [7] Goovaerts, P. (997). Geostatistics for Natural Resources Evaluation. Oxford University Press, New York. [8] Goovaerts, P. and Journel, A. G. (995). Integrating soil map information in modelling the spatial variation of continuous soil properties. European Journal of Soil Science, 46, 397 44. [9] Grunwald, S., Goovaerts, P., Bliss, C. M., Comerford, N. B., and Lamsal, S. (2006). Incorporation of auxiliary information in the geostatistical similation of soil nitrate nitrogen. Vadose Zone Journal, 5, 39 404. [0] Journel, A. G. and Posa, D. (990). Characteristic behavior and order relations for indicator variograms. Mathematical Geology, 22(8), 0 025. [] van Meirvenne, M. and Goovaerts, P. (200). Evaluating the probability of exceeding a site-specific soil cadmium contamination threshold. Geoderma, 02, 75 00.