MATH2740: Environmental Statistics Lecture 6: Distance Methods I February 10, 2016
Table of contents 1 Introduction Problem with quadrat data Distance methods 2 Point-object distances Poisson process case Rayleigh distribution Distribution of object-object distances 3 Clark-Evans test Clark-Evans test of randomness Problems with the Clark-Evans test Examples of Clark and Evans test Problems with Clark-Evans test II
Problems with quadrat data Quadrat methods can be inefficient to use in some circumstances: Time and cost to lay out and search all quadrats. Choice of quadrat size potentially influencing conclusions. Quadrat counts do not depend on underlying point pattern. Plots have same quadrat counts but different spatial pattern.
Distance methods Using distance methods tries to overcome some of the problems associated with quadrat counting methods.
Types of distance measurement Distance measurements involve measuring: Distances from randomly selected points to the nearest neighbouring object, giving a point-object distance. Distances from a randomly selected object to the nearest neighbouring object, giving an object-object distance. This procedure requires us to know the locations of all objects within the study area to allow selecting objects randomly.
Example: Types of distance measurement Have a Poisson process with 30 objects within a unit square. Left: distances from four randomly selected objects in the study area to their nearest object. Gives object-object distances. Right: distances from four randomly located points in the study area to their nearest object. Gives point-object distances.
Other types of distance measurement I (NOT examined) Other types of distance measurement can be considered: Random object to the nth nearest neighbour. Random point to the nth nearest neighbour. Besag and Gleaves (1973) T-square sampling.
Other types of distance measurement II (NOT examined) Besag and Gleaves (1973) T-square sampling. Find distance from a random point O to the nearest object P. Find distance to nearest object Q from P, where Q is located in the half-plane beyond O. Gives a point-object distance and an object-object distance. Q P O
Point-object distances I Suppose object locations occur as a Poisson process with intensity λ (mean number of objects per unit area is λ). Number X(A) of objects in a region A with size A has a Poisson distribution with mean µ = λ A so pr{x(a) = x} = µx e µ x! = (λ A )x e λ A, x = 0,1,2,.... x! In particular, pr{x(a) = 0} = e λ A.
Point-object distances II Let R denote distance from a random point to nearest object. Consider a circle of radius r centred on the random point. r Distance R from a random point to the nearest object is greater than r if the circle of radius r and area πr 2 contains no objects.
Point-object distances III Distance R from random point to nearest object satisfies pr{r > r} = pr{no objects inside circle of radius r} = pr{x(a) = 0} where X(A) Poisson(µ = λ A ) with A = πr 2. Hence pr{r > r} = exp( λπr 2 ). Cumulative distribution function of R is F R (r) = pr{r r} = 1 pr{r > r} = 1 exp( λπr 2 ), r > 0. The probability density function f R (r) of R is f R (r) = df R(r) dr = 2λπr exp( λπr 2 ), r > 0.
Rayleigh distribution I f R (r) = df R(r) = 2λπr exp( λπr 2 ), r > 0. dr This is probability density function of a Rayleigh distribution. It is a special case of the Weibull distribution with probability density function f X (x) = abx b 1 exp( ax b ), for x > 0, where a > 0 and b > 0. Here a = λπ and b = 2. Plots show λ = 0.1 (left), λ = 0.2 (centre) and λ = 0.4 (right). Pdf 0.0 0.2 0.4 0.6 0.8 1.0 Pdf 0.0 0.2 0.4 0.6 0.8 1.0 Pdf 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 r r r
Rayleigh distribution II E[R] = r=0 rf R (r)dr = r=0 2λπr 2 exp( λπr 2 )dr. Let y = λπr 2 so dy = 2λπrdr and dr = dy 2 λπy so E[R] = 2ye y 2 λπy dy = 1 2 λ y 0.5 e y y=0 = 1 2 λ y=0 Γ ( 3 2 y=0 ) dy = 1 2 λ y 0.5 e y dy π since Γ( 3 2 ) = 1 2 Γ(1 2 ) = 1 2 π and area under a gamma(α = 3 2,1) distribution integrates to one, so that y=0 y 0.5 e y Γ ( ) 3 dy = 1. 2 1 2
Rayleigh distribution III: revision of gamma distribution A gamma(α,λ) distribution has probability density function f Y (y) = λα y α 1 e λy Γ(α) for y > 0, where the gamma function satisfies Γ(α) = (α 1)Γ(α 1) with Γ(1) = 1 and Γ ( 1 2) = π.
Rayleigh distribution IV E[R 2 ] = r=0 r 2 f R (r)dr = r=0 2λπr 3 exp( λπr 2 )dr. Putting y = λπr 2 and dy = 2λπrdr gives E[R 2 2ye y ] = y=0 2λπ dy = 1 ye y dy = 1 λπ y=0 λπ as area under a gamma(α = 2,1) distribution integrates to one ( ) ye y so, with Γ(2) = 1, dy = 1. y=0 Γ(2) Hence Var[R] = E[R 2 ] {E[R]} 2 = 1 λπ 1 4λ = 4 π 4λπ. ( ) Or recall for Y exponential(1), E[Y ] = y=0 ye y dy = 1.
Object-object distances Given a large number N of objects in the study area A, the distribution of the distance between a random object and the nearest neighbouring object is the same as the point-object distance. Suppose A contains N objects randomly positioned within A. Probability any object is located in a small region a A is a / A. Probability any object is not located in a is 1 a / A. If a = πr 2, probability that none of remaining N 1 objects are within a distance r of a randomly chosen object is (1 πr 2 / A ) N 1 by independence. Writing λ N/ A gives pr{r r} 1 (1 λπr 2 /N) N 1. As N this gives same as point-object distribution function.
Clark-Evans test I Have N object-object nearest neighbour distances r i, i = 1,2,...,N, with sample mean r. If randomness (Poisson process) assumption is true, then for large N, Clark and Evans (1954) assume ( 1 r N 2 λ, 4 π ) 4λπN where E[R] = 1 2 λ and Var[R] = 4 π 4λπ. Hence Z = r 1 2 λ 4 π 4λπN N(0,1). Reject randomness hypothesis at 5% level if Z > 1.96.
Clark-Evans test II For small N Clark and Evans would suggest using a suitable gamma distribution as an approximation to the distribution of r.
Clark-Evans measure of randomness Clark and Evans use 1 φ R = r E[R] = 2 λ r as a measure of randomness. φ R 1 for a random process, φ R < 1 for a clustered (aggregated) process, and φ R > 1 for a regularly located process 2. 1 Clark and Evans used the symbol R for their randomness measure but to avoid confusion with the random variable R the symbol φ R is used here. 2 Most extreme case has objects on a hexagonal grid, each object the same distance r from six others. This hexagon has area 3 3 r 2 /2 and is associated with 3 data points, the central point and a weight one third for each of the six surrounding points, so λ = 3/(3 3 r 2 /2). Thus r = 1.0746/ λ so φ R = 2.149.
Problems with Clark-Evans test I Intensity λ should be known to carry out the test. Could be estimated using the mean number of objects per unit area from the study region. Clark-Evans test uses all N object-object distances. These distances are not independent but Diggle (1976) and Donnelly (1978) showed that the correlations are small 3. Correlations between the object-object distances mean central limit theorem does NOT apply. However Z N(0,1) as shown by Donnelly (1978). 3 Donnelly (1978) obtained better approximations for mean and variance of the object-object distances, but for large N these give E[R] and Var[R] as obtained by assuming object-object distances are independent.
Using a border region I Clark and Evans (1954) advise having a border around the study region to avoid bias. For points near the edge of a study region the calculated object-object distance to objects within the study region will tend to be larger than it should be. This will have the effect of biasing the test statistic Z upwards, rejecting the randomness hypothesis and suggesting regularity of the data points.
Using a border region II Object-object distances are measured for all objects within the inner region and can be to points within the border.
Using a border region III Donnelly (1978) presented approximations for E[R] and Var[R] when a border is ignored. For perimeter P, E[R] 1 2 λ + P N Var[R] 0.070 λn + 0.037P N 2 λ. ( 0.0514 + 0.0412 ), N
Using a toroidal correction If a rectangular study region, an alternative is to assume the region lies on a torus, so opposite edges are adjacent to each other. The study region (centre below) is surrounded by a grid of identical regions. Object-object distances are measured for all objects within the central region and can be to points outside the centre.
Example 1: Simulated data I The object-object nearest neighbour distances for the N = 11 objects within the inner study region below are: 0.201 0.201 0.327 0.327 0.350 0.350 0.500 0.500 0.657 0.826 1.278
Example 1: Simulated data II Data are: 0.201 0.201 0.327 0.327 0.350 0.350 0.500 0.500 0.657 0.826 1.278 These have mean r = 0.5015. The inner region has area 9m 2 so λ can be estimated by λ = 11/9 = 1.222. The test statistic is thus z = r 1 2 λ 4 π 4λπN = 0.5015 0.4523 0.07128 = 0.690. Here z < 1.96. Accept the randomness hypothesis at 5% level. Notice many of the object-object distances are the same.
Example 2: Ground ant nests in Panama I Levings and Franks (1982) present data for the number of ground ant nests in various study regions on Barro Colorado Island, in Gatun Lake, Panama. For one 100m 2 square study region the number of nests of Ectatomma ruidum per m 2 was given as 0.61 with φ R = 1.16. This suggests λ = 0.61, N = 100λ = 61 and r = φ R 2 = 0.7426. The Clark-Evans test statistic is then λ z = r 1 2 λ 4 π 4λπN = 0.7426 0.6402 0.04285 = 2.390. As a two-sided test the P-value of this test is P = pr{ Z > 2.390} = 0.0168 so reject randomness hypothesis. φ R > 1 suggests the ant nests are distributed regularly.
Example 2: Ground ant nests in Panama II Unfortunately Levings and Franks did not appear to use a border so that their results are invalid. Using the corrected values for E[R] and Var[R] obtained by Donnelly (1978) the test statistic becomes z = 0.470 which is not significant. There is thus no evidence to reject the randomness hypothesis. For perimeter P (here 40m), this gives E[R] 1 2 λ + P N ( 0.0514 + 0.0412 N ) = 0.67735, Var[R] 0.070 λn + 0.037P N 2 λ = 0.019321.
Intensive sampling If all the nearest neighbour distances are calculated in a region, then the values are not independent. Cressie (1993, p.609-610) refers to this as intensive sampling. The consequence is that the true variance of R is greater than that assumed (due to the correlations) so the test statistic Z used in the test tends to be larger than it should be resulting in clustering being suggested more often than it should be. One solution is to use Monte-Carlo tests for inference. Independent realizations of the data assuming the null hypothesis is true are simulated and the test statistic Z i calculated for each. The observed value of the test statistic Z can be compared with the ones simulated and the test rejects the null hypothesis if the observed Z is too large or too small when compared with the simulated Z i.