Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Transcription

1 Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data Y 1,..., Y n observed on a set of geographical units (over a map), the pixels of an image or a regular arrangements of points on a lattice. 1

2 Models for areal data are also sometimes employed for irregularly arranged point-referenced data sets when the number of spatial units is very large computational considerations.

3 As we shall see in Chapter 5, certain types of areal models are computationally easier to work with and ideal for use with Gibbs sampler. In this setting, unlike the geostatistical one, we are typically not interested in prediction and have observed data at all spatial sites. What is of interest in this setting? Spatial pattern evident? Are there clusters of high/low values?

4 Smoothing: Filter out some of the noise in the data help elucidate spatial pattern. Deciding how much to smooth the data is not always clear. Smoother maps are easier to interpret but will generally not represent the data well and vice versa. Example: No smoothing at all is equivalent to presenting a raw map of the data. Extreme smoothing would involve associating the same value Ȳ with all units. Optimal smoothing lies somewhere between these two extremes.

5 Also of interest in this setting is relating the response to covariates through regression models need to account for spatial dependence in such regression models. Also in the regression setting, we would be interested in examining the residual spatial structure after accounting for covariates. Exploratory methods for areal data Recall the primary source of spatial information in the areal setting consists of adjacencies knowing, for each region, all the neighboring regions (for some appropriate definition of neighbor). i.e.the arrangement of the regions across the map.

6 This adjacency structure is quantified through the neighborhood (or proximity) matrix W: W ij = 0 if i = j 0 if i and j are not neighbors c ij > 0 if i and j are neighbors c ij quantifies the strength of the neighbor relationship. Most often c ij = 1 for all neighbor pairs and two regions are considered neighbors if they share a common boundary. It is instructive to think of this spatial structure as a graph, where nodes correspond to regions and two nodes on the graph are connected if the associated regions are neighbors.

7 The neighborhood matrix W can be used for exploratory analysis and will also be used when we discuss models for areal data. Note that it is also possible to define 2 nd order neighbors and to have a corresponding 2 nd order neighborhood matrix. After simply plotting data (usually on a map in this case) an exploratory analysis usually proceeds with an attempt to quantify the strength of spatial association in the data.

8 For this, two statistics can be employed: 1. Morn s I: I = n i j w ij (Y i Ȳ )(Y j Ȳ ) ( i j w ij ) i(y i Ȳ ) 2 where I 0 no spatial dependence I > 0 positive spatial dependence I < 0 negative spatial dependence Can be thought of as an areal correlation coefficient.

9 2. Geary s C: C = (n 1) i j w ij (Y i Y j ) 2 ( i j w ij ) i(y i Ȳ ) 2 where C 0 C 1 no spatial dependence C < 1 positive spatial dependence C > 1 negative spatial dependence Under the hypothesis that the Y i s are iid, one can show that the asymptotic distributions of both statistics are normal and that E[I] = 0; E[C] = 1

10 Using these asymptotic distributions one can easily construct hypothesis test of H 0 : E[I] = 0 against either a one or two-sided alternative. Another, perhaps preferable, way to test for association is to use a Monte Carlo test for independence. Idea: Under the assumption that the Y i s are iid, the distribution of I (and C) is invariant to permutations of the Y i s. What does this mean?

11 The distribution of I clearly depends on W; however, if the spatial structure has no role to play then permuting the rows of W will not change the distribution of I. So [I W] [I W ] where W is any row permutation of W. To calculate a Monte Carlo test for spatial association, we randomly permute the data vector Y (equivalent to permuting the rows of W) and calculate the value the new value say, I (1). Repeat this procedure many times, say, n = 999: I (1), I (2),..., I (999) and plot the histogram of these values. We then locate the original observed value I (obs) on this histogram.

12 Under the assumption that the Y i s are iid, the observed value I (obs) comes from the same distribution as I (1), I (2),..., I (999) I (obs) should lie somewhere in the main body of the histogram. If I (obs) lies in the tails of the histogram, we have evidence against the hypothesis that the Y i s are iid. Can quantify this by calculating an empirical p-value. If associated with each Y i is a vector of covariates x i, then even if the Y i s are spatially dependent they may not be identically distributed.

13 As in the point referenced setting, this suggests applying these techniques to the estimated residuals from standard regression models. Simple Smoothing To filter out noise in the data and produce a smooth map we can use the W matrix and replace each Y i with Ŷ i = w ij Y j ; w i+ = w ij j w i+ j a weighted average that will encourage the smoothed Y i to be similar to its neighbors. Problems with this? A possible remedy is Yˆ i for α [0, 1]. = (1 α)y i + αŷ i

14 Here, α = 0 yields the raw data and α = 1 yields a very smooth map. Try different values of α in an exploratory fashion. In Chapter 5 we will discuss hierarchical models for smoothing which will incorporate covariate information and spatial random effects. In that setting our smoothed Y i s will be posterior means E[Y i Data].

15 Markov Random Fields In the point-referenced data setting we specified the joint distribution of the observed data Y 1,..., Y n directly. In the areal setting, where we have Y 1,..., Y n and a neighborhood matrix W we will take a different approach and build the required joint distributions f(y 1,..., y n ) through the specification of a set of simpler full conditional distributions f(y i y j, j i), i = 1,..., n. For a given joint distribution f(y 1,..., y n ) we can always obtain unique and well defined conditional distributions f(y 1,..., y n ) = f(y 1,..., y n ) f(y1,..., y n )dy j

16 But note that the converse is not always true! We can not simply write down a set of full conditional distributions f(y i y j, j i), i = 1,..., n and claim that these determine a unique f(y 1,..., y n ). Consider two random variables with Y 1 Y 2 N(α 0 + α 1 Y 2, σ1 2 ) and Y 2 Y 1 N(β 0 + β 1 Y1 3, σ2 2 )

17 In this case E[Y 1 ] = E[E[Y 1 Y 2 ]] = E[α 0 + α 1 Y 2 ] = α 0 + α 1 E[Y 2 ] E[Y 2 ] is a linear function of E[Y 1 ] But we also have E[Y 2 ] = E[E[Y 2 Y 1 ]] = E[β 0 + β 1 Y 3 1 ] β 0 + β 1 E[Y 3 1 ] Both conditions can not hold (except in trivial cases) and so here the two conditional distributions do not determine a valid and unique joint distribution.

18 In general when a set of full conditional distributions determine a unique and valid joint distribution we say that the set of conditional distributions is compatible. Improper distribution: An improper distribution is a distribution with non-integrable density. That is, if S is the sample space of Y then S f(y)dy = When would such an object be useful in statistics? Clearly, an improper distribution is not useful as a model for data. In Bayesian statistics, where parameters are assigned probability distributions, improper distributions may be employed as priors. How?

19 Even though the prior density π(θ) is such that π(θ)dθ = having observed data y (assumed to arise from a proper distribution) the corresponding posterior may be proper π(θ y)dθ < and so inference based on this posterior is valid. Such distributions have their uses in Bayesian statistics and in fact are used, as we shall see later, as models for random effects in an areal data setting.

20 Given a set of compatible and proper full conditional distributions f(y i y j, j i), i = 1,..., n, the resulting joint distribution can be improper! Example: consider the bivariate joint distribution with f(y 1, y 2 ) exp[ 1 2 (y 1 y 2 ) 2 ], (y 1, y 2 ) R 2 This density has no valid normalizing constant since exp[ 1 2 (y 1 y 2 ) 2 ]dy 1 dy 2 = and so the distribution is improper. What about the corresponding full conditional distributions?

21 Clearly [Y 1 Y 2 = y 2 ] N(y 2, 1) and [Y 2 Y 1 = y 1 ] N(y 1, 1) so here we have an example of two compatible and proper full conditional distributions that yield an improper joint distribution. If we have a set of compatible full conditional distributions f(y i y j, j i), i = 1,..., n, how can we determine the form of the resulting joint distribution f(y 1,..., y n )? Brook s Lemma

22 Brook s Lemma notes that if {f(y i y j ), j i), i = 1,..., n} is a set of compatible full conditional distributions and y 0 = (y 10,..., y no ) is any fixed point in the support of f(y 1,..., y n ) then f(y 1,..., y n ) = f(y 1 y 2,..., y n ) f(y 10 y 2,..., y n ) f(y 2 y 10, y 3..., y n ) f(y 20 y 10, y 3,..., y n ) f(y n y 10,..., y n 1,0 ) f(y n0 y 10,..., y n 1,0 ) f(y 10,..., y n0 ) This gives us the joint distribution up to a normalizing constant. If f(y 1,..., y n ) is proper, then the fact that it integrates to 1 determines the normalizing constant. How should we specify the full conditional distributions so that (1) they are compatible and (2) they are simple enough and yet yield useful spatial structure?

23 We will not worry about (1). To address (2) we will assume that the full conditional distribution of Y i depends only on its neighbors. That is, the full conditional distribution of Y i will depend only on those Y j s that have W ij 0. Letting i = {j W ij 0} denote the set of neighbors for region i (i j W ij 0) this implies f(y i y j, j i) = f(y i y j, j i ), i = 1,..., n

24 This sort of specification for the full conditional distributions, when compatible, is referred to as a Markov random field (MRF) due to the obvious Markovian structure of the full conditional distributions. The idea behind such models is the development of a complicated spatial dependence structure through a set of simple local specifications that depend only on lattice (or map) adjacencies. We will develop and employ these sorts of models as models for areal data or as models for random effects in an areal setting. Clique: A clique is a set of cells (or indices) such that each element in the set is a neighbor of every other element in the set.

25 Think of the graph representation of the neighborhood structure mentioned earlier. A clique represents a set of nodes M on the graph such the each pair of indices (i, j) with both i and j in M represents an edge of the graph. With n spatial units, we can have cliques of size 1,..., n. Potential function: A potential of order k is a function of k arguments that is exchangeable in its arguments. A potential function of order k typically operates on the variable values y s1,..., y sk associated with a clique {s 1,..., s k } of size k.

26 Examples k = 2 1. y i y j 2. (y i y j ) 2 3. y i y j + (1 y i )(1 y j ) for binary data Gibbs Distribution: A joint distribution for Y 1,..., Y n is a Gibbs distribution if the joint density/pmf f(y 1,..., y n ) takes the following form f(y 1,..., y n ) exp{γ k α M k φ (k) (y α1,..., y αk )} Where φ (k) ( ) is a potential of order k, M k is the collection of all cliques of size k and γ > 0 is a parameter.

27 The joint distribution f(y 1,..., y n ) depends on y 1,..., y n only through potential functions evaluated over the cliques induced by the neighborhood (graph) structure. Note such a distribution may have more than one parameter the potential functions may depend on unknown parameters.

28 Hammersley-Clifford Theorem: If we have a MRF then the corresponding joint distribution is a Gibbs distribution. Only Cliques of order 1 independence - consider the form of the corresponding Gibbs distribution. Distributions having Cliques of order 2 are most common. An example is the pairwise difference form f(y 1,..., y n ) exp{ 1 2τ 2 (y i y j ) 2 } based on quadratic potential functions. i,j

29 Conditionally autoregressive (CAR) models Particularly popular class of MRF models introduced by J. Besag in These models have become very popular within the last decade, particulary since the advent of Gibbs sampling. Gibbs sampling is a procedure for simulating realizations from a joint distribution f(y 1,..., y n ) using only the full conditional distributions {f(y i y j, j i), i = 1,..., n}.

30 Useful in Bayesian statistics when we want to draw samples from a posterior distribution of interest. MRF models are ideal in this setting since they are specified in terms of full conditional distributions. More on this later...

31 Autonormal (Gaussian) CAR models Here we begin with the full conditionals [Y i y j, j i] N( j b ij y j, τ 2 i ), i = 1,..., n For appropriately chosen b ij these full conditionals are compatible, so using Brook s lemma we can obtain the joint distribution as f(y 1,..., y n ) exp{ 1 2 y D 1 (I B)y} where B = (b ij ) and D = diag{τ 2 1,..., τ 2 n } Looks like a multivariate normal distribution with µ = 0 and Σ 1 y = D 1 (I B).

32 This is of course only true if D 1 (I B) is symmetric. We must choose b ij in the conditional Gaussian distributions to ensure this symmetry. In particular, choosing b ij so that b ij τ 2 i = b ji τj 2, for all i, j will ensure symmetry (and compatibility). Notice that if τ 2 i τ 2 j then we can not have b ij = b ji. How to choose the b ij s subject to the above constraints? and also, to yield a reasonable joint spatial distribution?

33 We will take the b ij s to be functions of the neighborhood matrix W b ij = w ij w i+, τ 2 i = τ 2 w i+ Does this specification satisfy the symmetry condition? With these choices the full conditional distributions are [Y i y j, j i] N( j w ij w i+ y j, τ 2 w i+ ), i = 1,..., n Interpretation?

34 The joint distribution for these choices of b ij and τ i is f(y 1,..., y n ) exp{ 1 2τ 2y (D W W)y} where D W = diag{w 1+,..., w n+ }. This is again MVN with µ = 0 and Σ 1 y = (D W W) Note here that (D W W)1 = 0 Σ 1 y is singular! This is a singular MVN distribution an improper distribution no valid normalizing constant

35 Such a distribution is often referred to as a Gaussian intrinsic autoregression. To further investigate this impropriety we can rewrite the joint distribution as f(y 1,..., y n ) exp{ 1 2τ 2 i,j w ij (y i y j ) 2 } a pairwise difference Gibbs distribution with quadratic potentials. What happens to this distribution if I add a constant µ to all the Y i? nothing the Y i s are not centered. This distribution does not identify an overall mean.

36 To provide the required centering we can impose a constraint Yi = 0 Problems with this as a model for data? Can not expect our data to respect this constraint... This constrained improper distribution can not be used as a model for data, but can be used as a model for spatial random effects (a prior for parameters that vary spatially). Perhaps explain this in the context of a map...

37 If we want to use the autonormal model as a distribution for data (as opposed to a prior for spatial random effects) we need an alternative solutions to the impropriety problem. We have (D W W)1 = 0 causing unfortunate results. An obvious remedy is to incorporate a constant ρ so that Σ 1 y is non-singular. = (D W ρw) Such models are often referred to as proper CAR models.

38 How to choose ρ to ensure non-singularity? Such non-singularity is guaranteed provided ρ ( λ 1, 1 (1) λ ) where λ (n) (1) < λ (2) < λ (n) are the ordered eigenvalues of D 1 2 w WD 1 2 w. It is also possible to show λ (1) < 0 and λ (n) > 0 so that the interval ( λ 1, 1 (1) λ ) contains (n) 0. How to choose ρ?

39 Leave ρ ( λ 1, 1 (1) λ ) unspecified as a parameter in our (n) model. One usually adopts the simple choice ρ [0, 1) when λ (n) = 1. Here ρ = 0 corresponds to conditional distributions [Y i y j, j i] N(0, spatial independence. τ 2 w i+ ), i = 1,..., n Further ρ 1 corresponds to the IAR model and larger values of ρ imply a greater degree of spatial dependence.

40 Note with the IAR model (ρ = 1) we only have one parameter τ 2 - the variance component. This variance component does not quantify spatial dependence in any way. With the IAR model, much of the spatial structure imposed by the model is preimplied by the chosen W. Note also that independence does not arise as a special case of this model.

41 Of course one could, in principle, allow the neighborhood structure, W, itself to be a parameter in the model fairly complicated. When the more general CAR model incorporating ρ is employed, how does one interpret ρ? very carefully. In particular, ρ does not represent correlation. Rather, ρ is some measure of dependence in the sense that ρ = 0 corresponds to independence and spatial dependence increases with ρ. The maximum allowable spatial dependence corresponds to the IAR model when ρ = 1.

42 To calibrate ρ for a given neighborhood structure and map, one could simulate realizations from the CAR model for different values of ρ. For each realization we could compute Moran s I to get a strength of the spatial dependence implied by a particular ρ value.

43 In general, even moderate amounts of spatial dependence will require ρ > 0.9 and usually estimates of ρ are close to its upper bound value. When modeling random effects in an areal data setting, I usually fit models based on the proper CAR model as well as the IAR model and then compare the two using some model selection tool. Usually, at least in my experience, the IAR model ends up being the preferred model.

44 I note again that in the framework of this model we specify a joint normal distribution for the data and specify the inverse covariance matrix Σ 1 y = (D W ρw) but in general have no simple form for the covariance matrix. The elements of Σ y give us, of course, information on the marginal covariance structure of Y. The elements of Σ 1 y give us information on the conditional covariance structure of Y. For example, using standard results associated with the MVN distribution, we can show that 1/(Σ 1 y ) ii gives us V AR(Y i y j, j i).

45 Moreover, if (Σ 1 y ) ij = 0 then Y i and Y j are conditionally independent given {y k, k i, j}. We see that W ij = 0 implies conditional independence between Y i and Y j (given all other Y s). From this we see that the specification of a neighborhood structure W is essentially a set of conditional independence assumptions. Regression: If the proper CAR model is used as a distribution for data, we can accommodate covariates x i by modifying the conditional distributions to N(x i β + j w ij w i+ (y j x j β), τ 2 w i+ ), i = 1,..., n

46 With these conditional specifications the marginal distribution for Y is MVN with µ = Xβ and Σ 1 y = (D W ρw). We will mostly be concerned with the µ = 0 case when CAR models are applied as a (prior) distribution for random effects. Multivariate spatial data: Suppose, associated with each areal unit, we observe several, say p dependent observations Y i = (Y i1, Y i2..., Y ip ). Models for these sorts of data must account for the spatial dependence across areal units and also dependence within each Y i.

47 Multivariate conditional autoregressive models (MCAR) have been developed for such data. The idea is a straightforward extension of the univariate case where we specify the joint distribution of all np random variables Y = (Y 1,..., Y n ) through a set of full conditional distributions. These full conditional distributions will be p variate normal instead of univariate normal. Note also that a CAR model can, in principle, be adopted for model point referenced data by allowing the elements of W to depend on the distance between points.

48 This may be useful for very large datasets since CAR models, as we shall see in Chapter 5, are numerically less demanding to fit within a Gibbs sampling framework. When prediction is not of interest, this is a perfectly acceptable way of building a joint distribution. Whether or not such an approach yields an adequate representation of the underlying spatial structure in a given application is a model assessment issue - and a critical one at that.

49 Non-Gaussian CAR models When dealing with non-gaussian areal data, our preferred approach will be based on generalized linear mixed models, where we incorporate Gaussian CAR random effects into models for non-gaussian data Chapter 5. An alternative to this approach, which we consider now, is to adopt a MRF type specification for the data Y 1,..., Y n and determine a joint distribution through the specification of a set of compatible non-gaussian full conditional distributions.

50 For example, we can allow the full conditional distributions f(y i y j, j i) to take Poisson, binomial, Gamma or in fact any form from the exponential family. When these are compatible, the result is a joint spatial distribution for non-gaussian data. See Cressie (1993) for a full development of CAR models in a general framework. I will present two examples of such non- Gaussian CAR models and discuss the computational problems associated with these.

51 Binary Data: For binary Y 1,..., Y n an autologistic (binary MRF) model specifies the full conditional distributions as p i = P (Y i = 1 y j, j i) = P (Y i = 1 y j, j i ) and p i log( ) = x i 1 p β + ψ 1 j w ij y j where β is a vector of regression parameters and ψ R is a spatial dependence parameter. These full conditional distributions are compatible and Brook s lemma yields the form of the joint pmf: f(y 1,..., y n ) exp{β ( i y i x i )+ψ i,j w ij y i y j } A Gibbs distribution with potentials on cliques of order 2.

52 We can, in principle use this form to fit the model and obtain, for example, MLE s of β and ψ. Unfortunately, there is a computational problem that arises. The normalizing constant in f(y 1,..., y n ) depends on model parameters f(y 1,..., y n ) = C(β, ψ) exp{β ( i y i x i ) + ψ i,j w ij y i y j } and so would need to be evaluated at each iteration of the maximization procedure. Note that C(β, ψ) 1 = 1 y 1 =0 1 exp{β ( y n =0 i y i x i )+ψ i,j w ij y i y j }

53 Evaluating this constant for any particular value of β and ψ requires summing 2 n terms not feasible even for moderate n; in particular since we would have to do this iteratively. Evaluating the normalizing constant is also required for Bayesian inference. Pseudo likelihood, a somewhat adhoc inferential scheme can be employed to avoid the calculation of the normalization constant. The autologistic model can be generalized to the case where each Y i is categorical and takes values in the set {0, L 1} for some L 2.

54 In this case the full conditional distributions are defined by P (Y i = l y j, j i) exp(ψ j i w ij I(y j = l)) where ψ R is again a spatial dependence parameter. Covariates can be added to this model just as in the autologistic case. This model, referred to as the Potts model can be used to model allocations in finite mixture models providing a robust alternative to the usual Gaussian spatial random effects models As before, the model contains a normalizing constant C(ψ) that causes computational problems when fitting this model.

55 Simultaneous autoregressive (SAR) models MRF models such as the CAR models we have discussed are by far the most popular sorts of models for areal data. An alternative class of models for areal data can be based on an autoregressive structure similar to that adopted in time series modeling. As before we have data Y 1,..., Y n and spatial information W. Unlike the MRF approach, we do not focus on full conditionals in this framework.

56 Instead, we start with a vector of independent errors or innovations e MV N(0, D) with D = diag{σ1 2,..., σ2 n } or more simply D = σ 2 I. We then construct a simple functional relationship between Y and e and this relationship induces a distribution for Y. Consider the relationship Y i = j b ij Y j + e i, i = 1,..., n for some constant b ij and with b ii = 0.

57 In matrix form this is where B = (b ij ). Y = BY + e From this we can obtain the relationship between Y and e Y = (I B) 1 e assuming I B is invertible. The simple distribution assigned to e then induces the following for Y: Y MV N(0, (I B) 1 D[(I B) 1 ] ) and when D = σ 2 I this is just Y MV N(0, σ 2 (I B) 1 [(I B) 1 ] )

58 To ensure that I B is invertible, we can take B = ρw and restrict ρ to an appropriate range. Invertibility is ensured when ρ (1/λ (1), 1/λ (n) ) where λ (1) and λ (n) are the smallest and largest eigenvalues of W. The SAR model is then based on Σy = σ 2 [(I ρw)(i ρw) ] 1 where ρ is referred to as the autoregression parameter with ρ = 0 corresponding to Σy = σ 2 I an independence model.

59 Regression: When covariates are present, the SAR model can be adopted as a model for residuals. In this case we define U = Y Xβ and assume U follows a SAR model so that (I ρw)u = e (I ρw)(y Xβ) = e Y = ρwy + (I ρw)xβ + e Note here that if W = 0 this is the standard linear model. Note that the spatial covariance structure implied by the SAR model, just as with the CAR model, is not entirely intuitive.

60 In addition, the SAR models unlike the CAR models, are not based on a set of full conditional distributions. These of course exist, but they do not have a computationally convenient form. As a result, SAR models are not well suited to model fitting using the Gibbs sampler. Finally, Cressie (1993) shows that any SAR model can be represented as a CAR model; however, the converse is not true. There exist CAR models that do not have a representation as a SAR model. Given the above, we will not consider SAR models further in this course.

61 I note; however, the general approach of building spatial distributions using transformations of independent RV s is a simple, intuitive and appealing approach. Other similar approaches could (and should) be explored further...