Approximate Likelihoods for Spatial Processes Petruţa C. Caragea, Richard L. Smith Department of Statistics, University of North Carolina at Chapel Hill KEY WORDS: Maximum likelihood for spatial processes, dimensionality reduction, relative efficiency, time series 1 Introduction Many applications of spatial statistics involve evaluating a likelihood function over a sample with an increasing number of data locations. For example, Holland et. al. (00) analyzed the Clean Air Status and Trends Network (CASTNet) data set, which was developed by the U.S. Environmental Protection Agency (EPA) in conjunction with the National Oceanic and Atmospheric Administration (NOAA) in order to monitor air quality and supporting meteorological measurements across the United States. Established in 1987, CASTNet almost doubled the number of site locations from 38 in 1989 (35 in the Eastern part of the U.S.) to 70 sites across the U.S. in 001. Holland et. al. focused on establishing a spatial map for trends in two pollutants: SO and SO 4 ; that is, the paper estimated the spatial parameters associated with these trends. The authors developed an algorithm which involved maximum likelihood estimation. In fact, the underlying field was assumed to be Gaussian with a spatial covariance function in some given family. The evaluation of the likelihood function involved calculating the inverse and determinant of the covariance matrix. Although this analysis is computationally feasible for CASTNet, it would not be so for a much larger network. Experience shows that by the time the number of locations increases to be in the hundreds the impact of this high dimensionality on calculating the inverse and the determinant of the covariance matrix makes computing maximum likelihood estimates intractable. Moreover, data sets that encompass hundreds if not thousands of location sites are becoming more prevalent. For example, the Historical Climate Network (HCN) developed and maintained by NOAA now has several thousand location sites. In order to be able to take advantage of the benefits of maximum likelihood estimates in the setting of such high dimensionality, it is necessary to establish efficient approximations to the likelihood. We consider here several approximate likelihoods based on grouping the observations into clusters and building an estimating function by accounting for variability both between and within clusters. Theoretical results derived for an analogous time series problem allow us to compare the three approximation schemes. These results are based on the general idea that calculations for the variance of the alternative estimator can be performed using the information sandwich principle, after we have expressed the derivatives of the quasi-likelihood function as a quadratic sum of independent normal random variables. We conclude by illustrating the new method with simulations. This paper is made up of three parts. The first section consists of the general theoretical methodology used to calculate the asymptotic variances for the proposed estimators. The second part describes the general spatial model, and the practical details of the proposed approximation schemes. Since it is practically intractable to compute thoretically the performance of these general estimators, we analyze in detail the analogous one-dimensional problem. Thus the third part consists of a thorough theoretical analysis of the following three estimators. The first one is named Big Blocks. We start by considering the mean value for each block. The proposed estimation function in this case is simply the likelihood of the block averages. Maximizing this function leads to the proposed estimator. We call the second estimator Small Blocks. In this case the theoretical derivations are performed under the assumption that there is independence between blocks. We calculate the likelihood function for each block, which is readily available since the original covariance structure is known. The function to be maximized here is obtained by multiplying the individual block-likelihoods. The last estimator is based on a combination of the two aforementioned schemes. Naturally, we expect the Big Blocks estimator to exhibit some loss in efficiency due to the representation of the entire block through its mean only, while the assumption of independence between blocks in the second case will also lead to some efficiency reduction, although not as large as in the first instance. We construct the Hybrid estimation function in a few steps. First compute the block means and consider their underlying likelihood function. Assume in this case that given the block mean, the blocks are independent. Although this assumption cannot always be verified in practice, it is a reasonable working assumption. Therefore we construct the estimation function to be the product of the block means likelihood and the product of individual conditional block likelihoods. Clearly this is not an exact likelihood, due to the conditional independence assumption. As a measure of performance, we define the relative efficiency as the ratio between the asymptotic variances of the classical MLE and the one for the alternative estimator. To compute the asymptotic variance for the various estimators, we use the expansion method. As a check, we compare the theoretical values obtained through the expansion method with the relative efficiencies of these three estimators on simulated data sets. We conclude by presenting a possible theoretical extension of the time series problem to its spatial equivalent and describe a promising approach for the more general case. 385 The Expansion Method The main novelty in this work is the approximation to the likelihood which leads to a certain dimensionality reduction. Another non standard element is the calculation of the asymptotic variance of the alternative estimators. Throughout this paper we refer to this technique as the expansion method. Since it lays at the basis of our theoretical calculations, we outline here the main principles. The most serious complication in calculating the asymptotic variance of any of the proposed estimators comes from the fact that they are derived from quasi-likelihood func-
tions. Therefore the standard Fisher information approach does not lead to correct results. Liang and Zeger (1986) proposed a solution to this problem, a technique that is known as the information sandwich approach. Suppose we have a statistical model indexed by a finitedimensional parameter θ, and suppose an estimate θ n is constructed by minimizing a criterion function S n (θ). We assume the true parameter value is θ 0 and that θ n is a consistent estimator. We also assume that S n (θ) is twice continuously differentiable in θ, and that its underlying distribution is sufficiently smooth so that the function H(θ), defined below, is also continuous in a neighborhood of θ 0. Let f(θ) for any function f denote the vector of first-order partial derivatives of f with respect to the components of θ, and f the matrix of second-order partial derivatives. We assume: 1 (SA1) n S n (θ) p H(θ) as n uniformly on some neighborhood of θ 0, where H( ) is a matrix-valued function, continuous near θ 0, with H(θ 0 ) invertible, (SA) 1 n S n (θ 0 ) d N (0, V (θ 0 )) for some covariance matrix V (θ 0 ). All the above conditions being satisfied, we can apply the Slutsky lemma together with a Taylor expansion to conclude that: n( θn θ 0 ) d N (0, H(θ 0 ) 1 V (θ 0 )H(θ 0 ) 1 ). Therefore, we need to be able to compute the variance of the first derivative of the minimizing criterion, as well as the expected value of the second derivative. The way we solve this problem is to employ a Corollary to the Martingale Central Limit Theorem (MCLT), mainly an application to quadratic forms of independent normal random variables. Consider the sequence S n = a n,i,j ξ i ξ j, {i,j: i j} where {ξ i } are independent N[0, 1], coefficients {a n,i,j } are defined for each n. Define: m n = E{S n } = i v n = Var{S n } = i a n,i,i and a n,i,i + {i,j: i<j} a n,i,j. Theorem (see Billingsley, 1995, page 476): Suppose (A1) max i a n,i,i /v n 0 as n, ( ) (A) max k i: i<k a n,i,k /v n 0 as n. Then S n m n vn d N[0, 1]. To summarize, the phrase expansion technique comes from the fact that the functions of interest, { S n }, are typically quadratic forms defined on the spatial process, which in turn can be explained into quadratic forms of the independent normal random variables ξ i. This enables us to use the MCLT to derive their variance and hence apply the sandwich information principle to obtain the asymptotic distribution of the alternative estimator. Joint Statistical Meetings - Section on Statistics & the Environment 3 Spatial Setting 3.1 Spatial Models The basic object we consider is a stochastic process {Z(s), s D} where D is a subset of R d (d-dimensional Euclidean space), usually though not necessarily d =. For example, Z(s) may represent the daily quantity of SO measured at a specific location s. Let µ(s) = E[Z(s)], s D, denote the mean value at location s. We also assume that the variance of Z(s) exists for all s D. In this work we analyze Gaussian processes. Also, we usually assume second-order stationarity though many of the results hold under the weak intrinsic stationarity assumption. Since we assume that we are sampling from a Gaussian process, we can write down the exact likelihood function which we subsequently maximize numerically with respect to the unknown parameters. Without any major change in the methodology, we can also incorporate linear regression terms in the model, which becomes Z N (Xβ, Σ) (1) where Z an n-dimensional vector of observations, X an n q matrix of known regressors, β a q-vector of unknown regression parameters and Σ the covariance matrix of the observations. In many applications we may assume Σ = αv (θ) () where α is an unknown parameter vector and V (θ) is a vector of standardized covariances determined by the unknown vector θ. With Z defined by (1), the negative log likelihood is given by l(β, α, θ) = n log(π) + n log α + 1 log V (θ) + 1 α (Z Xβ)T V (θ) 1 (Z Xβ). (3) The traditional approach employed to solve this optimization problem is a two-stage process and it is based on the Cholesky decomposition of the covariance matrix. The first stage computes the least squares estimator for β, while the second stage uses this estimator to perform a numerical maximization with respect to the other parameters. Since the number of computations to calculate the inverse and the determinant of an n n covariance matrix is of the order n 3, we expect serious delays in getting the results for large data sets. With the growing interest in monitoring and analyzing the ozone and particulate matter over the U.S., scenarios in which data is collected at as many as 8-900 sites a few times daily, computational problems become more and more acute. As the exact maximum likelihood function becomes intractable in such instances, we shall consider the three alternatives to approximating the estimating function mentioned before: Big Blocks, Small Blocks and Hybrid. This procedure is based on the idea of clustering the sampling sites in a given number of groups (say b) of approximately equal size (say k). For the Big Blocks estimator, we first compute the cluster means and then consider their likelihood as the optimization criterion. We expect that summarizing the entire cluster correlation in a single component, the cluster mean, to lead to a loss in efficiency in some cases, especially for large cluster sizes. 386
For the Small Blocks estimator, we compute the quasilikelihood function as the product of individual cluster likelihoods. We assume the cluster correlation structure is known, belonging to some parametric family. The underlying assumption here is that the clusters are independent, which will induce some efficiency loss (especially for small cluster sizes), although we expect it to be less serious than in the previous case. To give a general idea of the computational efficiency of the Hybrid estimator, we describe not only the algorithm we follow, but also the approximate number of calculations one needs to perform in order to obtain it. This estimation technique accounts for both within and between cluster correlation, so we expect it to be superior to both aforementioned methods. We proceed as follows: 1. Calculate the cluster means and evaluate their joint likelihood. To do so, we need to compute the inverse of the b b covariance matrix corresponding to the cluster means, each of which requires approximately k steps, followed by the Cholesky decomposition of a b b matrix, which requires O(b 3 ) steps. If we summarize, the number of evaluation steps required here is O(b k + b 3 ). If b = n /3, this is of order O(n ), compared with O(n 3 ) for the full likelihood calculations, and this is the best possible rate for an estimator of this form.. Conditionally on the cluster means, compute the individual cluster joint likelihood. This is an O(k 3 ) operation, which is repeated b times, hence we perform O(b k 3 ) evaluations. This is of the same or smaller order than the first step if b O(n 1/ ). 3. Finally, compute the quasi-likelihood function by multiplying all the above b + 1 likelihood components. This is the function that needs to be maximized in the estimation process. For any i and j, compute σ ij = Cov[Z i, Z j ] = 1 k l=1 l =1 σ (i 1)k+l σ (j 1)k+l and define Σ = (σij ) 1 i,j b. Next we need to calculate the conditional likelihood given its mean. Thus, we first consider the joint density of the first k 1 observations in each block and the corresponding group mean. Then, for all 1 i b, the vector (Z (i 1)k+1,..., Z ik 1, Zi )T is normally distributed with vector ( mean (µ ) i, µ i ) and covariance matrix Σi τ i where µ i is given by (5), τ i σ ii µ T i = {µ (i 1)k+1,..., µ ik 1 } σ jj = Cov[Z (i 1)k+j, Z (i 1)k+j ] Σ i = (σ jj ) 1 j, j k 1 and τ i = {τ (i 1)k+1,..., τ ik 1 } where τ (i 1)k+j = Cov[Z i, Z (i 1)k+j ] Standard multivariate normal distribution results give the conditional joint density of Z (i 1)k+1,..., Z ik 1 given Z i to be N (µ i + τ T i σ ii 1 (Z i µ i ), Σ i τ i σ ii 1 τ T i ) Denote by µ ci and Σ ci (θ) the conditional mean and conditonal covariance matrix given the block mean, for the i th block, respectively. Then according to equation (3), we obtain the conditional log likelihoods 3. Modified Algorithm Practical Issues This section illustrates some of the practical details involved in the strategy described above. The first step is to cluster the data into a number of blocks with approximately the same number of elements. We employ a classical clustering procedure based on the latitude and longitude of each sample location and denote by b the number of clusters and by k the cluster size. Most of the details that follow are specific to the Big Blocks and Hybrid estimators, but one should proceed in a similar fashion for the Small Blocks. The first step is to compute the block averages. Define Z as the vector of cluster means, i.e. Z = {Z1,..., Zb } where Zi denotes the mean of cluster i. We assume that the new process Z is Gaussian, with mean µ and covariance matrix Σ. Thus the negative log likelihood for the cluster means is of the form: l means (β, θ) = b log(π) + 1 log Σ (θ) + 1 (Z µ ) T Σ (θ) 1 (Z µ ). (4) Note that for any i such that 1 i b we can express µ i = E[Z i ] = 1 k j=1 and let µ be the vector mean, µ = {µ 1,..., µ b }. Next compute the covariance matrix of the cluster means process. l ci (β, θ) = k 1 log(π) + 1 log Σ (θ) + 1 (Z i µ ci ) T Σ 1 (θ)(z i µ ci ). (6) The last step of the algorithm is to multiply individual likelihoods (4) and (6), or, equivalently, to sum the b + 1 individual negative log likelihoods. Thus the estimating function has the form: [ l full = 1 m bk log(π) + log Σ (θ) + log Σ ci (θ) i=1 i=1 + (Z µ ) T Σ 1 (θ) (Z µ ) ] m + (Z i µ ci ) T Σ 1 (θ) (Z i µ ci ). (7) Following we describe a rough sketch of the Hybrid estimation scheme. MODIFIED ALGORITHM 1. Note first that we can assume, without any loss of generality, that Σ = α V (θ) and hence Σ = α V (θ). For the current value of θ, compute V = V (θ) and V ci = V ci (θ). Next, perform the Cholesky decomposition V = L L and V ci = L ci L for all i = 1, b.. Calculate L 1 and L 1 for all i = 1, b (which is straightforward to do, since they are all lower triangular matrices). 3. Calculate L and L ci which are simply the product X (i 1)k+j β (5) of the diagonal entries of L and L ci respectively, for all i = 1, b and thus V = L, V ci = L ci. 4. Compute Z = L 1 Z and X = L 1 X. Also compute Z ci = L 1 Z i and hence calculate the transformed conditional mean µ for all i = 1, b. 387
5. This step calculates the GLS estimator of β. The problem that arises here is that one would need to compute the inverse of the original covariance matrix, V, which is a prohibitive operation if n is too large. One possible reduction of this problem would be to consider instead the covariance matrix of the joint distribution of all clusters, under the working assumption that they are independent given the cluster means. Since an efficient way to invert this approximating matrix is not yet known, another current option is to consider another approximate estimator, say β = ˆβ mean + ˆβblockj b + 1 where ˆβ mean and ˆβ blockj are the least squares estimators for the means and block conditional processes, respectively. In some simple cases, one could analytically minimize the estimating function with respect to α by defining α(θ) = G (θ) + m i=1 G (θ), n where G denotes the corresponding sum of squares. 6. Define the profile negative log likelihood as g(θ) = n log(π) + n log G (θ) + m i=1 G (θ) n + 1 m log( V (θ) + log( V ci (θ) ) + n so that g is the function to be minimized. 7. Repeat each of the steps 1 6 for each θ for which g has to be evaluated. The minimum will eventually be achieved at a point ˆθ and this defines the MLE. i=1 Through routine algebra manipulations we obtain that {Xm} has the following covariance structure: ] σɛ if m=0 γm [ = [ φ k+1 φ kφ +k k (1 φ) (1 φ ) φ(1 φ k ) k (1 φ) (1 φ ) ] σ ɛ if m = 1 (φ k ) m 1 γ 1 if m. It is interesting to remark here that this covariance structure corresponds to an ARMA(1,1) process. First we calculate the likelihood function for the means time series, using the covariance structure derived in equation (8). We compute the variance of the estimator using the information sandwich technique. Denoting by V ave the covariance matrix of the means process, we have that the likelihood function is: L ave = Define 1 1 exp (π) b/ V ave 1/ { 1 } X T VaveX 1 V = φ V 1 ave = (v ij ) 1 i, j b (8). (9) and assume that σ ɛ is known (this assumption will have little bearing on the final result and it considerably simplifies the computations). Thus the first derivative of the negative log likelihood function, modulo fixed constants, is given by φ l(φ) = 1 1 k v ij i=1 j=1 l=1 m=1 φ V V x (i 1)k+l x (j 1)k+m 4 Time Series Setting It immediately becomes clear that obtaining a theoretical approximation of the asymptotic variance of the alternative estimators is rather tedious. To avoid some of the complications due to the generality of that approach, we consider a simpler case, that preserves the characteristics of the original problem. We perform the complete calculations for the first order autoregressive time series, AR(1): X n = φ X n 1 + ɛ n, where {ɛ n } are independent N[0, 1]. This case is particularly appealing since it allows us to rewrite the quasilikelihood function as a quadratic form of independent normal random variables. Apparently simple, carrying out the complete calculations turns out to be rather involved. In this section we present the general ideas and a series of important computational details for the time series problem. 4.1 Big Blocks Method For the first method divide the original time series in b blocks, consisting of k observations each. Compute the mean of each block and let us denote by {Xm} the means time series. In other words, denote by Xm the average of the observations in the m th block and let γ m 1 = Cov[X 1, X m]. There is no apparent closed-form solution by equating the first derivative of the likelihood to zero. We compute the variance of the first derivative of the negative likelihood function and the expected value of the second derivative. To be able to compute them, we rewrite the likelihood function using the representation of x i as an AR(1) process. Thus i x i = σ ɛ φ i r ξ i and it follows that S n = = σ ɛ k i=1 j=1 v ij k r= l=1 m=1 i=1 j=1 l=1 m=1 (j 1)k+m (i 1)k+l r= s= x (i 1)k+l x (j 1)k+m v ij φ (i+j )k+l+m r s ξ r ξ s. Carefully rearranging and combining the above coefficients, S n is equivalent to: S n = r a rr ξ + r<s a rs ξ r ξ s Next we apply the Martingale Central Limit Theorem to S n which is a quadratic form of independent normal random variables and we obtain m n = E[S n ] = r a rr = 0 and v n = Var[S n ] = r a rr + r,s: r<s 4 a rs 388
In a similar fashion we calculate the expected value of the second derivative of the quasi-likelihood function and apply the information sandwich formula to obtain an approximation for the variance of the unbiased estimator ˆφ 1. The covariance matrix in this case has a very complex structure, and finding its inverse is a nontrivial exercise. One could use numerical methods to calculate its inverse. Since that would naturally introduce more error into calculations, we prefer to use a strategy that takes advantage of the fact that it is a particular case of a Toeplitz matrix. Trench(1964) developed an algorithm that calculates the exact inverse of any Toeplitz matrix. His algorithm is faster than the traditional approach that uses the Cholesky decomposition; it is of the order of b, when the matrix has b rows and columns. Then we use numerical methods to calculate the first and second derivatives of the quasi-likelihood function. Given the intractable analytical structure of the relative efficiency, we analyze its values numerically for a few particular cases. We proceed in the following manner. After computing the elements of the matrix V we evaluate each of the coefficients a rs. Each coefficient consists of finite sums only, thus its evaluation is routine algebra. The next step is to calculate the sum of these coefficients over all values of r and s. We take advantage of the fact that both indices r and s have b as an upper bound. Then, taking a closer look at the structure of the coefficients, we distinguish two cases. For r, s we need to evaluate a finite sum. In the other cases, when at least r or s are less than or equal to 1, we exploit the fact that we can separate the sums containing r and s from the other sums, and simply compute these infinite sums (over r and s) as geometric series. In the end we combine all these sums to obtain the final result. Following we present a table illustrating the performance of the Big Blocks estimator (relative to the classical MLE) for different values of b, k and φ. The table also provides a verification of the validity of the theoretical results, by comparing them with their analogue results obtained through simulations. φ b=5 k=100 b=50 k=10 Theory Sim Theory Sim -0.750 0.0014 0.00 0.00549 0.005-0.50 0.01166 0.013 0.0898 0.080-0.010 0.0195 0.018 0.1599 0.158 0.010 0.0003 0.019 0.1670 0.165 0.50 0.0380 0.03 0.780 0.69 0.750 0.13367 0.13 0.73897 0.74 It is clear at this point that this approximation to the likelihood does not lead to an efficient estimator. Although appealing for its simplicity and considerable dimension reduction, it is inefficient for even moderate block sizes.this caveat makes it unfit for realistic problems. Therefore we need to alter the way we compute the minimizing criterion, and take into account more adequately the underlying correlation structure, which leads us to the next technique. 4. Small Blocks Method This method ignores the correlation between blocks but takes into account the true dependence structure within blocks. We construct the quasi-likelihood in this case as the product between the b individual block likelihoods. Since the original process is an AR(1), it is immediate that the quasi-likelihood function in this case has the form: 1 1 L blk = exp (π) b k/ V bl b/ 1 j=1 X bl j T V 1 bl X bl j. To calculate the asymptotic variance, we follow the expansion method. Thus we rewrite the first derivative of the negative log likelihood as a quadratic form of normal random variables. Modulo fixed constants, this equals: S n = σɛ n s 1 φ s r 1 ξ r ξ s + b 1 s= r= s M k +1 mk+1 mk+1 m=0 r= s= φ mk+3 r s ξ r ξ s ) To compute the asymptotic variance we use the information sandwich technique. Unfortunately, even in this simple case the calculations are far from trivial. The final expression for the relative efficiency is not simple enough to allow us study its limiting behavior analytically so we compute it for a few particular cases. Following we present a comparative table between values for the relative efficiency derived as in the theoretical approach described above and values obtained through simulations. φ b=5 k=100 b=50 k=10 Theory Sim Theory Sim -0.750 0.98998 0.999 0.9595 0.934-0.50 0.999 0.991 0.9139 0.898-0.010 0.99199 0.990 0.9018 0.891 0.010 0.99199 0.990 0.9018 0.89 0.50 0.999 0.99 0.9139 0.91 0.750 0.98998 0.993 0.9595 0.94 From this table we note that the performance of the Small Blocks estimator is quite good, clearly improved compared to the previous case. However, one could think of instances when the assumption of independence between blocks would be too strong. Next we relax this assumption and construct the third approximation to the likelihood. 4.3 Hybrid Method As mentioned earlier, the assumption here is that given the block means, the blocks are independent. We use the means likelihood which we have actually developed for the Big Blocks (see equation (9)) and the conditional likelihoods for each block, given the block mean. The only difficulty here might be in computing the conditional means and covariances. Taking advantage of the special structure of the AR(1) process and applying standard normal multivariate results, one obtains a rather simple form for this matrix and its inverse. Since the calculation of the quasilikelihood function involves both the likelihood of the means and the block conditional likelihoods we need to use again the expansion method. The estimating function in this case is: l hyb = 1 bk log(π) + 1 [ log V (φ) + b log W (φ) + X T V (φ) X ( + X k 1 j µ j (φ) ) T ( W (φ) X k 1 j µ j (φ) ) j=1 where by W we have denoted the inverse of the block conditional covariance matrix and by µ j the block conditional mean. Rearranging the sums, one can rewrite the gradient and Hessian of the quasi-likelihood function as quadratic forms of independent normal random variables. The following table presents the theoretical and simulation based 389
values for the asymptotic efficiency: φ b=5 k=100 b=50 k=10 Theory Sim Theory Sim -0.750 0.99953 0.999 0.967 0.943-0.50 0.9980 0.998 0.91373 0.91-0.010 0.99495 0.995 0.89989 0.898 0.010 0.99457 0.995 0.89739 0.897 0.50 0.9903 0.99 0.91409 0.903 0.750 0.99134 0.991 0.91800 0.9 Note that both the Hybrid and the Small Blocks estimators perform very well, much better than the Big Blocks estimator. At this point it is not clear if the Hybrid estimator is always performing better than the Small Blocks estimator. 5 Extension to a Lattice Sampled Process Theoretical computations required to derive the asymptotic variance of the proposed estimators in the spatial case is extremely involved. Therefore, we illustrate here how would one extend the strategy used for the one-dimensional time series problem to its analogous spatial process. Consider a spatial process on an integer lattice, denoted x ij where i and j are integers. Since we model the process by its covariance structure, one of the simplest forms to consider for the covariance structure is the Kronecker product form, i.e. Cov[x ij, x kl ] = γ (1) ik γ() jl, (10) where γ (1) and γ () are the covariances of one-dimensional time series in the horizontal and vertical directions. If we assume that these are both of AR(1) form, with the same autoregressive parameter, then we deduce Cov[x ij, x kl ] = σ xφ i k + j l, (11) where φ < 1 for stationarity. An equivalent definition, which represents (11) as a function of an array of independent N[0, 1] random variables {ξ ij }, is the formula x ij = σ x (1 φ ) r=0 s=0 φ r+s ξ i+r,j+s. (1) We may also represent the process equivalently by x ij φ(x i+1,j + x i,j+1 ) + φ x i+1,j+1 = ɛ ij (13) where ɛ ij = σ x (1 φ )ξ ij are independent N[0, σɛ ], σɛ = σx(1 φ ). In the Kronecker product notation, the covariance function of the process is U U, and the inverse covariance function is U 1 U 1, where U is just the AR(1) covariance matrix. Note that the processes we have defined here lie within the general class of spatial processes on lattices first defined by Whittle (1954). We now consider maximum likelihood estimation of φ. The model is that observations {x ij, 1 i m, 1 i n} have a joint normal distribution with mean 0 and covariances given by (11). We also assume that σɛ is known. The negative log likelihood for φ is then, modulo some fixed constants, 1 x ij x kl v ijkl, i j k l where v ijkl is a component of the inverse covariance matrix evaluated at the (i, j) (k, l) position. Note here that the analytical form of the above estimating function is completely specified and could be rewritten as a quadratic form of independent normal random variables, taking advantage of the representation of x ij in (1) and (13). Therefore we could follow the expansion technique and compare the results for the classical MLE case, when the Fisher information technique leads to valid conjectures. We also plan on working on the practical implementation of all the methods for the spatial setting. We will use as guide for our investigation the conclusions derived from the time series setting. The next step is to analyze a classical data set of large dimensions using the proposed methodology and compare our results with the already established ones. The final attempt is to analyze a particulate matter or ozone data set consisting of too many observations to use the classical MLE technique, thus proving the appeal of this new methodology. References [1] Billingsley, P. (1995), Probability and Measure. Third Edition, Wiley, New York. [] Brockwell, P.J., Davis, R.A. (1991), Time Series: Theory and Methods. Second Edition, Springer-Verlag, New York. [3] Cressie, N. (1993), Statistics for Spatial Data. Second edition, John Wiley, New York. [4] Holland, D.M., Caragea P.C. and Smith, R.L. (001), Trends in Rural Sulfur Concentrations. Preprint [5] Holland, D.M., De Oliveira, V., Cox, L.H. and Smith, R.L. (000), Estimation of regional trends in sulfur dioxide over the eastern United States. Environmetrics, to appear. [6] Mardia, K.V., Kent, J.T. and Bibby, J.M. (1979), Multivariate Analysis. New York: Academic Press [7] Liang, K.Y. and Zeger, S.L. (1986), Longitudinal data analysis using generalized linear models. Biometrika 73, 13. [8] Mardia, K.V. and Marshall, R.J. (1984), Maximum Likelihood estimation of models for residual covariance in spatial regression. Biometrika 71, 135 146. [9] Smith, R.L. (1996), Estimating nonstationary spatial correlations. Preprint, University of North Carolina [10] Smith, R.L. (001), CBMS Course in Environmental Statistics, University of Washington, June 001. [11] Stein, M.L. (1999), Interpolation of Spatial Data: Some Theory of Kriging. Springer Verlag, New York. [1] Trench, F. William (1964) An algorithm for the inversion of finite Toeplitz matrices J. Soc. Indust. Appl. Math. Vol.1, No.3 515 5 [13] Vecchia, A.V. (1988), Estimation and identification for continuous spatial processes. J. Roy. Statist. B 50 97 31. [14] Whittle, P. (1954), On stationary processes in the plane. Biometrika 41, 434 449. 390