Issues related to handling of spatial data

Transcription

1 In: J. McKenzie (ed) Proceedings of the epidemiology and state veterinary programmes. New Zealand Veterinary Association / Australian Veterinary Association Second Pan Pacific Veterinary Conference, Christchurch, June 1996; Introduction Issues related to handling of spatial data D.U. PFEIFFER Department of Veterinary Clinical Sciences Massey University, Palmerston North, New Zealand Epidemiological analyses are mainly conducted using data which does not include or take account of spatial relationships between the observations studied. More recently, the need for spatial data analysis has been pointed out as it may provide additional insight when attempting to reveal epidemiological cause-effect relationships (Rothman 1990). While the theory of spatial analysis has been an area of research interest for many years, only the advent of personal computers made the techniques more easily accessible to epidemiologists. Still, even recent text books on medical or veterinary epidemiology do not provide more than basic introductions to the subject area of spatial data analysis. This is could seem surprising, as place has always been seen as part of classic epidemiological triad of time, person, place. Spatial Data The distinction between spatial and non-spatial data can easily become the subject of extensive discussions. In general, observations for which absolute location and/or relative positioning (spatial arrangement) is taken into account can be referred to as spatial data (Anselin 1992). It can be subdivided into two major categories representing discrete and continuous phenomena. Based on the former classification, which has also been called entity view, spatial phenomena are described using zero dimensional objects such as points, one dimensional objects such as lines or two dimensional objects such as areas. If space is described using continuous phenomena, such as in the case of temperature or topography, this has also been described as field view. In practice, the latter is usually measured based on sampling discrete entities such as locations in space. The entity view allows spatial objects to have attributes. Spatial analysis is typically aimed at the spatial arrangement of the observational units, but can also take into account attribute information. An analysis conducted only on the basis of the attributes of the observational units ignoring the spatial relationships is not considered a spatial data analysis. Spatial Data Analysis The methods used in spatial data analysis can be broadly categorized in those concerned with visualizing data, those for exploratory data analysis and methods for development of statistical models (Bailey and Gatrell 1995). During most analyses, a combination of techniques will be used with the data first being displayed visually, followed by exploration of possible patterns and possibly modeling. Data visualization One of the first steps in any data analysis should be an inspection of the data. Visual displays of information using plots or maps will provide the epidemiologist with the basis for generating hypotheses and, if required, an assessment of the fit or predictive ability of models. Over the last couple of years interactive computer packages have been developed which allow dynamic

2 displays of the data. Geographic information systems can be used to produce maps and they allow the exploration of spatial patterns in an interactive fashion. Exploratory data analysis Data exploration is aimed at developing hypotheses and makes extensive use of graphical views of the data such as maps or scatter plots. Exploratory data analysis makes few assumptions about the data and should be robust to extreme data values. Simple analytical models can also be used in this analysis phase. Models of spatial data For this type of spatial data analysis specific hypotheses are formally tested or predictions are made using statistical models of the data. Modeling of spatial phenomena has to incorporate the possibility of spatial dependence in order to provide a true representation of the existing effects. Such spatial effects can be either large scale trends or local effects. The first is also called a first order effect and it describes overall variation in the mean value of a parameter such as rainfall. The second which is named a second order effect is produced by spatial dependence and represents the tendency of neighboring values to follow each other in terms of their deviation from the mean. This can for example be the case with the incidence of an infectious animal disease affecting animals on farm properties. First order effects can be readily modeled by standard regression models. The presence of second order effects violates the independence assumption of standard statistical analysis techniques, and appropriate analysis techniques will have to take account of the covariance structure in the data giving rise to these local effects. Often spatial data are modeled as stationary spatial processes which assumes that while there may be dependence between neighboring observations, it is independent from absolute location. A spatial process is isotropic, if in a stationary process covariance between observations at different locations depends only on the distance but not on direction. Nonstationary data is almost impossible to model as most locations will require different parameter sets. Therefore, most spatial modeling procedures begin with first identifying a trend in mean value and then modeling the residuals from this trend as a stationary process. With any of these models it has to be kept in mind that they are abstractions of reality, and first or second order effects are artifacts of the modeler. Bailey and Gatrell (1995) conclude that models can be at best 'not wrong', rather than 'right'. They add that the analyst should always involve judgment and intuition in statistical modeling. Problems in Spatial Data Analysis A major factor influencing spatial data analysis is the geographical scale at which the data is being analyzed. It may be possible to identify specific non-random patterns at a local level which when looked at from a national level turn into random variations. Another problem can be that many spatial data sets are based on irregularly shaped area units or there may be directional effects. Proximity or neighborhood also may be more difficult to clearly define than for example in time-series analysis. Any type of spatial analysis will be subject to some degree of edge effect where area units on the map boundary do have neighbors only in one direction. Many data analyses have to be conducted with observations based on information summarized at a particular spatial aggregation level such as at the veterinary district. Inferences from such analyses may only be correct if used at the same level of aggregation. This situation has also been called the modifiable areal unit problem. 84

3 Methods of Spatial Data Analysis Methods used in spatial data analysis can be divided according to the three main categories of data to be analyzed. They are point patterns, spatially continuous and area data. Point patterns Spatial point patterns are based on the coordinates of events such as the locations of outbreaks of a disease. It is also possible that they include attribute information such as the time of outbreak occurrence. Data on point patterns can be based on a complete map of all point events or a sampled point pattern. The basic interest of a spatial point pattern analysis will be to detect whether it is distributed at random or represents a clustered or regular pattern. It is important to recognize that the stochastic process studied relates to the locations where events are occurring. A spatial point pattern can be quantified in terms of the intensity of the process using its first order properties, measured as the mean number of events per unit area. Second order properties or spatial dependency are analyzed on the basis of the relationship between pairs of points or areas. The latter is typically interpreted as analysis for clustering. Visualization of spatial point patterns The method used most frequently to present spatial point patterns is a dot map. It is generally difficult to assess randomness of a pattern from visual inspection of such a map. It becomes important to take account of the population at risk when for example inspecting a dot map of disease outbreaks. One method for representing this difference in population at risk is to use a cartogram, where the size of the areas is geometrically transformed proportional to the corresponding population value. In a case-control study of tuberculosis breakdown in cattle herds from the Waikato region of New Zealand all cattle herds which had broken down with tuberculosis infection were compared with a random sample of cattle herds free from infection. Figure 1 presents a series of dot maps showing the locations of cases and random controls. Inspection of the map with locations of the case herds could give the observer the impression that they are clustered. Without inspection of the distribution of random control herds it becomes difficult to differentiate whether clustering only occurred in case herds or if the distribution of all cattle herds in this area is inherently non-random. In this situation it clearly is the case that cattle herds are not randomly distributed throughout the study region. But there is also some clustering of tuberculosis breakdowns. Random Control Herds Case Herds Case and Control Herds Figure 1: Dot maps of the locations of herds from a case-control study of tuberculosis breakdown in New Zealand cattle herds 85

4 Exploratory analysis of spatial point patterns Techniques for exploratory spatial analysis of point patterns are aimed at deriving summary statistics or plots of the observed distribution to investigate specific hypotheses. The methods used are examining first or second order effects. First order effects for point patterns can be examined with two techniques - quadrat counts and kernel estimation. The quadrat methods involve dividing the area into sub-regions of equal size - quadrats and produce a summary statistic on the basis of the number of counts per quadrat. The counts are then divided by the size of the area. These techniques give an indication of the variation of the intensity of the underlying process in space. The disadvantage of the techniques is that they aggregate the information into area type data which can result in loss of information. Kernel estimation is a technique which uses the original point locations to produce a smooth bivariate histogram of intensity. It has been used for example for home range estimation in wildlife ecology (Izenman 1991). Second order properties of point patterns can be investigated using the distances between the points - particularly nearest-neighbor distances. The latter can be estimated using two techniques - either the distance between a randomly selected event and the nearest neighboring event or between a randomly selected location in space and the nearest event. Spatial dependence can be investigated by visual examination of the probability distributions of the observed nearest -neighbor distances. Clustered events would show a steep part of the distribution function with lower values, whereas regularity would be indicated by steepness of the curve with higher values. The k - function will allow taking into account not just the nearest events. It depends on the assumption of an underlying isotropic process and is problematic to use in the presence of significant first order effects. Modeling of spatial point patterns Spatial point modeling techniques are aimed at explaining an observed point pattern, and typically involve comparison with the model of complete spatial randomness (CSR). A point pattern generated by a random spatial process should follow a homogeneous Poisson process. This implies that every event has an equal probability of occurring at any position in the study area and occurrence is independent of the location of any other event, hence the absence of first order and second order effects. It is against this basic model that the analysis will assess whether the point process is regular, clustered or random. There are a range of methods available to test for CSR. Some are based on quadrat counts such as the index of dispersion tests, others use nearest-neighbor distances such as the Clark-Evans test or the K function. Comparison of an observed pattern with CSR has its limitations in epidemiology as it does not allow definition of the type of point process other than whether it is completely random in space or not. It also cannot take account of issues such as a clustered underlying population at risk. Alternative models which could be used include the heterogeneous Poisson process, the Cox process, the Poisson cluster process or Markov point process (Bailey and Gatrell 1995). Getis and Ord (1992) describe the use of a distance statistic G which can be used to assess spatial autocorrelation for point patterns as well as for area data. It can be used to detect local pockets of dependence which might not show up when using a global statistic. The analysis of point patterns is important in veterinary epidemiology as it allows inferences on the occurrence of spatial clustering. The presence of clustering would suggest infectiousness or the presence of specific environmental risk factors. Second order effects in a spatial process can be the result of disease clustering. Disease clustering can be assessed using a number of methods and they can be categorized into general and focused tests (Waller and 86

5 Lawson 1995). The latter tests relate to the clustering of events around fixed point locations such for example a nuclear power plant. Wartenberg and Greenberg 1990 describe techniques for detection of hot spot clusters and clinal clusters. A tool which can be effectively used for the analysis of clustering effects is the K function (Kingham, Gatrell, and Rowlingson 1995). In this context, two classes of point processes such as cases of disease and random controls without the disease are compared. The principle is that both point processes are pooled and then the point process describing the cases is compared with the pooled process. Cuzick and Edwards (1990) developed a method which is also based on nearest-neighbor distances. The test statistic simply compares the number of case-case pairs for a given number of nearest neighbors. Applying this technique to the case-control data mentioned above it appears that there is significant clustering of cases compared with the control population (see Figure 2). The software Stat! (BioMedware, Ann Arbor, Michigan, U.S.A.) was used to run the analysis and produce the graph. Figure 2: Cuzick and Edwards method applied to tuberculosis breakdown case control study data (+ = cases, = controls, arrows identify nearest neighbors) Kingham, Gatrell, and Rowlingson (1995) describe a method combining the Diggle and Chetwynd method based on bivariate K functions with results of a logistic regression analysis allowing them to make use of information about additional covariates to test for clustering. Methods aimed at the detection of hot spots or small areas which might represent clusters of disease include the Geographical Analysis Machine developed by Openshaw (1990). This technique is based on comparing the observed intensity of cases in circles of varying radius. The result of the analysis is a map with circles indicating the areas where case incidence was higher than expected under the assumption of spatial randomness. Alexander and Cuzick (1992) reviewed methods for assessment of disease clusters. Wartenberg and Greenberg (1993) discuss problems associated with detection of disease clusters. 87

6 Other analyses using point pattern information Point events in space may have time of occurrence as one particular attribute. Specific methods are available to investigate clustering in space and time. Interaction is present if pairs of cases are near in space as well as in time. Contagious diseases requiring direct contact will produce space-time clustering between cases. The techniques described below are not considered useful for non-contagious diseases. There are three main methods available which allow assessment of a space-time relationship between cases: Knox s method, Mantel s method and the K- nearest neighbor method. All three techniques require production of distance matrices of the spatial as well as the temporal relationship between cases. Data from a longitudinal study of Mycobacterium bovis infection in a wild possum population in New Zealand will be used to demonstrate the usage of these techniques (Pfeiffer 1994). As a first step a histogram of the geographical nearest neighbor distances has to be produced as shown in Figure 3. The distribution of nearest neighbor distances expected under spatial randomness is shown in the histogram. It assumes uniform population density across the study area. The excess of shorter distances compared with the Poisson probability density function suggests that there may be spatial clustering in this data set. The map in the same figure indicates the locations of the 26 cases used in the analysis and their nearest neighbors are indicated by the arrows. Figure 4 presents the temporal distance distribution and a map connecting the cases according to their sequence of occurrence. The map does suggest the disease has shifted its spatial focus over time and that there is some degree of temporal clustering. Figure 3: Map and histogram of geographical distances between cases of tuberculosis infection in wild possums 88

7 Figure 4: Temporal distance map and histogram for cases of tuberculosis infection in wild possums As a next step a formal statistical test has to be conducted to assess the statistical significance of a potential space-time interaction process. When using Knox s method, a critical distance in time as well as in space defining closeness has to be set and pairs of cases are tabulated into a 2*2 contingency table with spatial and temporal closeness/farness defining the rows and columns (Knox 1964). Knox saw the critical distance as defining latency period. In most situations determining the critical distance requires a subjective decision. Approximate randomization permutation techniques are used to construct a Null distribution for Knox s test statistic. Figure 5 shows the results of Knox s test applied to the tuberculous possum data using a critical distance of 100m in space and 3 months in time. The result of 30 for Knox s test statistic X which is significant at a p-value of 0.02 suggests that given the selected critical distances time-space interaction is present in this data set. 89

8 Figure 5: Results of Knox s method applied to cases of tuberculosis infection in wild possums Another approach to investigate time-space interaction could be the use of the Mantel method (Mantel 1967). Mantel s approach does not require selection of critical distances. It uses both, time and space distance matrices between all cases. But it should be kept in mind that the Mantel test can be insensitive to non-linear associations between time and space distances. Distance measures can be transformed in a number of ways, such as the reciprocal transformation which reduces the effect of large time and space distances. The Null hypothesis is that the time distances are independent of the space distances. Randomization permutation techniques can be used to generate a test statistic for the Mantel test. Figure 6 presents the results of this analysis when applied to the data on tuberculous possums. The scatter plot of space distances against temporal distances seems to suggest while the points are scattered throughout the plot that there some denser accumulations of cases present. The frequency distribution of the test statistic under the Null hypothesis on the basis of 500 random permutations is presented in the left window of Figure 6. It can be concluded that there is significant space-time interaction. 90

9 Figure 6: Results from applying the Mantel method to test for time-space interaction between cases of tuberculosis infection in wild possums A third approach available is the K-nearest neighbor test of space-time interaction in point data. The test statistic indicates the number of case pairs which are K nearest neighbors in time and space. The statistic is based on an approximate randomization of the Mantel product statistic. Figure 7 presents the results from applying the K-nearest neighbor method to the possum tuberculosis data. The map shows the locations of the cases and the arrows indicate k=2 nearest neighbors. The test statistic produced on the basis of 1000 random permutations suggests that only the cumulative statistic J k is statistically significant, whereas J k is not. The latter parameter measures the statistical significance from increasing K by 1. The test statistic supports the presence of space-time interaction, and suggests that the first 5 nearest neighbors are involved in space-time interaction. 91

10 Figure 7: Results from applying the K-nearest neighbor method to test for time-space interaction between cases of tuberculosis infection in wild possums Spatially continuous data The point pattern analyses assessed characteristics of the spatial distributions of points, but made only limited use of attribute information. With spatially continuous and also area data the analysis focus shifts towards use of the attribute information, in order to describe their pattern in space. Spatially continuous data is also often referred to as geostatistical data. The data is usually collected by sampling at fixed points in space. The main objective of the analysis will be to describe the spatial variation in an attribute value, using the data collected at the sampled points. The spatial variation can be modeled as first and second order spatial processes. Visualization of spatially continuous data The data values obtained from the sampled locations can be mapped using proportionally scaled symbols or columns for each sampling point. Column charts will in fact allow presentation of multiple attribute values at the same point. Overlapping symbols can cause a problem with interpretation of such maps. But none of these approaches will be able to represent the underlying continuity of the process studied. Exploratory analysis of spatially continuous data The techniques used in this area can be categorized into methods for describing first order effects and those for second order effects. 92

11 Exploratory analysis of first order effects in spatially continuous data The main techniques in this area are spatial moving averages, tessellation methods and kernel estimation techniques. A spatial moving average interpolates values between a given number of neighboring sampling points. The more points are used the smoother the created surface will be. It will be possible to describe global trends. A weighting mechanism can be introduced to account for varying distances between sampling points. Alternatively a tessellation of the observed sample points can be used. This is most commonly done using Delauney triangulation, also referred to as a triangulated irregular network (TIN). This method assigns to each sampling point a territory in which each point is closer to this sampling point than to any other. The resulting polygon map is called a Dirichlet tessellation and the tiles are known as Voronoi or Thiessen polygons. Such a TIN can be used to construct a contour map or a digital terrain model (DTM). Figure 8 shows a triangulated irregular network based on height information from a 50 m grid plus break lines. The TIN was then used to create a contour map and a digital terrain model of possum tuberculosis longitudinal study area. Pfeiffer (1994) used this technique to generate polygons from point locations describing the areas where particular Mycobacterium bovis strains occurred. The technique can be appropriate for presence/absence type information, where a smooth surface is not the objective of the interpolation. TIN Contour map DTM Figure 8: TIN and derived contour maps and digital terrain model for the possum tuberculosis longitudinal study site As with point patterns it is also possible to use kernel estimation to convert the attribute data from the sampling points into a surface. This time not using the number of events per unit area but rather the value of the attribute. This technique has been used in geographical epidemiology to model the relative risk function, measuring local risk relative to the regional mean (Bithell 1990). Exploratory analysis of second order effects in spatially continuous data Spatial dependence between attribute values measured at sampled locations is described using the covariance function or covariogram. The presence of second order effects would result in positive covariance between observations a small distance apart and lower covariance or correlation if they are further apart. The covariogram describes the function of the covariance for varying distances h between sample points and the correlogram the corresponding correlation. The semi-variogram is a graphical representation of the variation between sampling points separated by a given distance and direction. For a stationary spatial process all three describe similar information. Estimates of the semi-variogram are considered to be more robust to departures from stationarity represented as a general trend in the spatial process. A 93

12 continuous process without spatial dependence will result in a horizontal line. A stationary process will reach an upper bound, referred to as the sill at a distance h called the range. Theoretically, the intercept with the y-axis should be at a value of 0 variation. In reality, sampling error and small scale variation will result in variability at small distances and the variogram will meet the y-axis not in the origin. This intercept with the y-axis is called the nugget effect. Variograms which do not reach an upper bound suggest non-stationarity. Figure 9 shows an isotropic sample semi-variogram for the proportion of tuberculous possums captured at trap sites during the longitudinal study. The shape of the variogram suggests that the process is non-stationary, but given the relatively small nugget value there is also likely to be spatial dependence. Parameters defining a variogram Example of a variogram Figure 9: Isotropic semi-variogram for the proportion of tuberculous possums at individual trap sites in the longitudinal study Modeling of spatially continuous data A number of approaches can be used to model or predict spatially continuous data. For the first-order processes trend surface analysis can be used based on ordinary polynomial least squares regression. Results have to be treated with caution, because the standard regression assumptions of independent random errors and heteroscedasticity are likely to be violated. Lessard et al. (1990) used an inverse distance-weighted mathematical algorithm to interpolate climatic measurements between sample points. Most trend surface models may be able to describe an overall trend, but are not useful for local prediction. In the presence of weak first order, but strong second order effects it is more appropriate to use models fitted to variograms. Such models can be defined by eye and are most commonly based on the spherical, exponential or Gaussian model. The fit of a particular model can be assessed through cross-validation. Figure 10 shows an omni-directional exponential variogram model for the possum tuberculosis prevalence data. For the model to be valid it would be necessary to remove the non-stationarity through trend regression. 94

13 Figure 10: Omnidirectional exponential variogram model for possum tuberculosis prevalence data The variogram model itself does not allow prediction of values. This can be achieved with Kriging. This is a weighted moving average technique for estimating the value of a spatially distributed variable from adjacent values while considering interdependence expressed in a variogram. It allows the interpolation error to be mapped and from a statistical viewpoint is considered to be the most satisfactory method for interpolation (Oliver and Webster 1990). Pfeiffer (1994) used ordinary Kriging to produce a surface of possum population density based on possum capture data at sample points (see Figure 11). The omnidirectional variogram suggests that this data is more stationary than the tuberculosis prevalence data, but it also shows strong spatial dependence. An exponential model was fitted and used as the basis for Kriging. The distribution of Kriging errors shows that there are some reasonably high errors and according to the map they are located in one particular area of the study. 95

14 Omnidirectional variogram model Histogram of Kriging errors Map of Kriging errors Contour map based on Kriging estimates Figure 11: Variogram model, Kriging errors and estimates for possum density in the longitudinal study on possum tuberculosis epidemiology A number of multivariate methods can be used for modeling of spatially continuous data. Principal components can be used to combine the information from multiple variables into a small number of components, each of them representing a particular combination of variables and explaining a particular proportion of the variation in the data. Eastman and Fulk (1993) used the technique to analyze the information contained in a time series of NDVI maps for Africa, thereby conducting a space-time analysis. Cliff et al. (1995) discuss the application of multidimensional scaling (MDS) to spatial epidemiological data. They use the technique to map geographical information about measles mortality in Australia and New Zealand as disease space where points with similar disease risks are closer to each other on the MDS map even though they are far removed geographically. Bailey and Gatrell (1995) discuss a range of other multivariate analysis techniques for spatially continuous data. Area data Attribute data which does have values within fixed polygonal zones within a study area is referred to as area data or lattice data. The areal units can constitute a regular lattice or grid or consist of irregular units. It is usually not required to estimate values as they should be present for all areas. The main emphasis with area data is on detection and explanation of spatial patterns or trends possibly extended to take account of covariates. 96

15 Visualization of area data Area data can be visualized using a wide range of techniques. The Choropleth map is probably the most commonly used tool. Appropriate use of class intervals and colors to represent values in a choropleth map is essential. Cartograms or density equalized maps can be used to express the importance of particular areas. The analyst has to be aware of the problems which can be caused by the modifiable areal units problem. It is also possible to display several attributes at the same time, by adding scaled columns or symbols to a choropleth map. Area information can be presented together with spatially continuous data for example by draping a choropleth map over a DTM. N Kilometers Cattle per Hectare Lakes Canton Boundaries Choropleth map of cattle density in Switzerland Thiessen polygons representing areas used by possums infected with four different strains of Mycobacterium bovis draped over a DTM Figure 12: Examples of choropleth maps Exploration of area data Informal investigations of hypotheses can be aimed at first order or second order spatial processes. Most of these techniques require a methodology for measuring proximity. Possible approaches include various distance measures between polygon centroids as well as presence or length of a shared boundary. For further analyses the proximity information can be described through generation of contiguity or spatial weights matrices. This is quite difficult to achieve in currently available GIS software and there are only few specialized spatial statistics software packages (e.g. SpaceStat, Regional Research Institute, West Virginia University, Morgantown, West Virginia, U.S.A.) which can perform such operations. For investigation of first order effects, a spatial weights matrix can be used to estimate spatial moving averages. If the data is available on the basis of a regular grid, the median polish may be appropriate. Kernel estimation can also be applied for investigations of first order effects in area data. In the case of second order spatial processes the objective is to explore spatial dependence of deviations in attribute values from their mean. In the context of area data, this effect is referred to as spatial autocorrelation and it quantifies the correlation between values of the same attribute between different locations. The most commonly used techniques for spatial autocorrelation are Moran's I and Geary's C. The first is closely related to the covariogram and the second to the variogram used for spatially continuous data. Power analyses of these statistics have been conducted by Walter (1993) who concluded that the power of Moran s I was highest. A correlogram can be used to graphically display the correlation between values 97

16 at different spatial lags. If the autocorrelation does not decline after a number of lags, it indicates the presence of non-stationarity. The correlogram has similar applications in spatial analysis as it has in time-series analysis for describing patterns. Hungerford (1991) analyzed the spatial distribution of cattle anaplasmosis between counties within the state of Illinois using second-order analysis and detected significant spatial clustering within the state. The above mentioned methods do not provide local indicators of spatial association which would be useful for identifying so-called hot spots. The Moran scatterplot and spatial lag pies described in Anselin (1994) can be used to describe local patterns of variation visually. Quantitative estimates can be obtained using the G statistic by Getis and Ord (1992) or the local indicators of spatial association by Anselin (1995). The latter can be used as an indicator of local pockets of non-stationarity (hot spots), similar to the G statistic, and also to assess the influence of individual data points on the global statistic and to identify outliers. Anselin, Dodson, and Hudak (1993) describe how these different techniques can be combined to form a exploratory spatial analysis system. Figure 13 shows a number of examples used by these authors to display local variation (from Anselin s world wide web site The spatial lag pie map superimposes a pie on each area with the top half of the pie representing the local value and the bottom part the neighboring values for this particular variable. It gives the observer an appreciation of the ratio between the local value and the surrounding spatial units. The Moran scatterplot shows the original value of each observation on the x-axis and the value of its spatial lag on the y-axis. The plot can be used to identify outliers or even to conduct local regression to further describe the spatial association. These outliers can then be mapped as shown in Figure 13. The map of the areas with significant LISA statistic indicates the area were there appears to be spatial autocorrelation. 98

17 Spatial Lag Pie Map Moran scatterplot Moran scatterplot outliers Figure 13: Example of an exploratory spatial data analysis approach Map of areas with significant LISA or G statistic In landscape ecology, approaches have been developed to describe the interactions among patches within a landscape mosaic referred to as landscape pattern. Most biological processes and that includes of course diseases are influenced by a multitude of factors which together may form a particular pattern. Spatial patterns are particularly difficult to quantify. Ecologists use the term landscape structure which describes the spatial relationships between habitat patches within a landscape (Dunning, Danielson, and Leck 1992). The software FRAGSTATS (McGarigal and Marks, Oregon State University, Corvallis, Oregon, U.S.A.) allows calculation of a wide range of indices and parameters describing landscape structure which could be used for further analyses. Modeling of area data Modeling techniques are aimed at establishing explanatory relationships between attribute values of a dependent variables, taking account of the relative spatial arrangement of the areas and other values associated with each area unit. Again, it is possible to focus the analysis on first order or second order effects. Multiple ordinary least squares regression can only be used for preliminary exploratory analyses, but suffers from the problem that in the presence of spatial dependence the errors are not independent and that the variance is unlikely to be constant. The presence of spatial dependence can be assessed readily using a spatial correlogram. A range of spatial regression models have been described by Haining (1990) and they can be implemented using the SpaceStat software mentioned above. Hungerford (1991) analyzed the relationship between cattle density and anaplasmosis prevalence on a county basis 99

18 in Illinois using measures of spatial correlation. Perry et al. (1991) used a GIS to investigate the occurrence of Rhipicephalus appendiculatus in Africa to identify the factors controlling the distribution of the vector tick which transmits the parasite Theileria parva causing East Coast fever, Corridor disease and January disease in cattle. A number of authors have included spatial data into multivariate analysis as independent variables. Clifton-Hadley (1993) used spatial descriptive measures, spatial autocorrelation and distance to particular features of interest to analyse patterns of occurrence of badger-related tuberculosis breakdowns of cattle herds in south-west England. Pfeiffer (1994) used a GIS to provide for point locations (cases of disease) specific geographical variables such as height above sea level, aspect, slope and distance to features of interest which were then used as explanatory variables in multivariate statistical analysis. In the field of epidemiology, parameters of interest are very often counts or proportions which can be modeled using generalised linear modeling techniques rather than ordinary least-squares regression. It should be noted though that spatial forms of these models are not well developed yet. Bailey and Gatrell (1995) suggest introducing covariates into the regression model such as the spatial coordinates or a variable representing regions categorized broadly by location to remove the effect of spatial dependence. Glass et al. (1995) developed a risk density map for Lyme disease based on a multiple logistic regression model, but they did not attempt to remove spatial dependence from the data. A number of different predictive modeling approaches for spatial data was compared by Williams et al. (1994). They used linear and non-linear discriminant analysis, tree-based induction and neural networks to map tsetse distributions in Zimbabwe and concluded that while the simpler methods (linear discriminant analysis and tree-based induction) were less precise, they were easier to interpret. Figure 14 presents some preliminary results of a logistic regression analysis for prediction of Theileria parva presence in an African country (this analysis was conducted by Perry,B.D., Kruska,R.L., Pfeiffer,D.U. and others at ILRI, Nairobi, Kenya). The regression model includes eight different environmental and land use variables and is based on information collected at random sample locations throughout the country. The model was used to generate a risk map representing the probability of T.parva presence at a particular location given a number of risk factors included in the model. This map is presented as a DTM and as a raster map. In addition, two additional raster maps are shown which display the lower and upper 95% confidence limits of T.parva presence as predicted by the regression model. The receiver operating characteristic curve (ROC) characterizing the predictive accuracy of the model could be used to adjust the decision making cut-off for the prediction probability balancing sensitivity and specificity as required. In this analysis the possible presence of spatial dependence was not taken into account. 100

19 Sensitivity Percent of False Positives Sampling location ROC curve for logistic regression model DTM of predicted probability of Theileria parva presence Raster map of predicted probability of T.parva presence Raster map of lower 95% confidence limit of probability of T.parva presence Raster map of upper 95% confidence limit of probability of T.parva presence Figure 14: Results of a multiple logistic regression analysis for prediction of Theileria parva presence 101

20 The appropriate technique for mapping probabilities or rates is as a measure of relative risk which could be estimated by dividing the observed risk by an estimate of expected risk. The standardized mortality ratio has been used widely to represent spatial variation of disease risk (Elliott, Martuzzi, and Shaddick 1995). Small numbers of observations may result in extreme values. More recently this problem has become less important through the adoption of empirical Bayes estimation. These techniques require estimates of a prior probability distribution which can for example be based on the overall probabilities across all areas. Bayesian techniques are then used to convert these estimates into posterior probability estimates. These can be made spatial by using neighborhood probabilities to derive a prior probability distribution. Decision making and spatial data Spatial data together with other non-spatial information is used for decision making purposes. This has become more difficult because the amount of information available has increased substantially. Specific decision making tools have been developed attempting to simplify the process of making the right choice. A recent area of activity has been the adaptation of multicriteria and multi-objective evaluation techniques to spatial problems. Such systems can take account of uncertainty in the data as well as of the risk of making the wrong decision (Eastman et al. 1995). Spatial data has also become an essential component of disease information systems which are beginning to replace largely manual systems which have been used by decision makers for the control of endemic and epidemic diseases. The large amounts of data which can be processed easily, their objectivity and the quickness of response are some of the advantages of computerized animal disease information systems. GIS provides an essential component of such systems. An example of an animal disease information system is EpiMAN which was developed in New Zealand for the management of an outbreak of foot-and-mouth disease (Morris et al. 1992). Epidemiological simulation GIS can provide geographical data which allows computer simulations of the dynamics of infectious diseases for specific geographical locations. Spatial heterogeneity can be represented in simulation models resulting in more realistic representations of reality. There are only few examples where this approach has been used in veterinary epidemiology. Sanson (1993) described a model of foot-and-mouth disease which represents inter-farm spread of the disease on a true geographical area, using various transmission mechanisms. Pfeiffer (1994) developed a geographic simulation model of the dynamics of bovine tuberculosis infection in wild possum populations. The geographical component is a major feature of this model. The model uses vegetation maps to represent the ecological conditions of particular environments. Conclusion Spatial data has become an important component of disease investigations. The availability of geographic information systems in combination with fast and relatively inexpensive computer hardware leaves the epidemiologist with the responsibility of making effective use of the information. While the descriptive techniques for spatial data have been available for a long time, exploratory and modeling techniques are still a very active area of development and they are not as accessible to the analyst so that they could become a routine component of epidemiological analysis. 102

21 References Alexander,F.E. and J. Cuzick Methods for the assessment of disease clusters. Geographical and environmental Epidemiology: Methods for small-area Studies. Editors P. Elliott, J. Cuzick, D. English, and R. Stern, Oxford: Oxford University Press. Anselin,L Exploratory spatial data analysis and geographic information systems. New Tools for spatial Analysis. Editor M. Painho, Luxembourg: Eurostat. Anselin,L Local indicators of spatial association - LISA. Geographical Analysis 27, no. 2: Anselin,L Spatial data analysis with GIS: An introduction to application in the social Sciences. 75. Technical Report Series. Santa Barbara, California: National Center for Geographic Information and Analysis. Anselin,L., R. F. Dodson, and S. Hudak Linking GIS and spatial data analysis in practice. Geographical Analysis 1: Bailey,T.C. and A. C. Gatrell Interactive spatial data analysis. Harlow, Essex, England: Longman Group. 413pp Bithell, J. F An application of density estimation to geographical epidemiology. Statistics in Medicine 9: Cliff, A. D., P. Haggett, M. R. Smallman-Raynor, D. F. Stroup, and G. D. Williamson The application of multidimensional scaling methods to epidemiological data. Statistical Methods in Medical Research 4: Clifton-Hadley, R. S The use of a geographical information system (GIS) in the control and epidemiology of bovine tuberculosis in south-west England. Proceedings of the Society for Veterinary Epidemiology and Preventive Medicine, editor M. V. Thrusfield, Society for Veterinary Epidemiology and Preventive Medicine. Cuzick, J., and R. Edwards Spatial clustering for inhomogeneous populations. Journal of the Royal Statistical Society B 52, no. 1: Dunning, J. B., B. J. Danielson, and C. F. Leck Ecological processes that affect populations in complex landscapes. Oikos 65: Eastman, J. R., and M. Fulk Long sequence time series evaluation using standardized principal components. Photogrammetric Engineering and Remote Sensing 59, no. 6: Eastman, J. R., W. Jin, P. A. K. Kyem, and J. Toledano Raster procedures for multicriteria/multi-objective decisions. Photogrammetric Engineering and Remote Sensing 61, no. 5: Elliott, P., M. Martuzzi, and G. Shaddick Spatial statistical methods in environmental epidemiology: a critique. Statistical Methods in Medical Research 4: Getis, A., and J. K. Ord The analysis of spatial association by use of distance statistics. Geographical Analysis 24 (3): Glass, G. E., B. S. Schwartz, J. M. Morgan, D. T. Johnson, P. M. Noy, and E. Israel Environmental risk factors for Lyme disease identified with geographic information systems. American Journal of Public Health 85, no. 7:

22 Haining,R Spatial Data Analysis in the social and environmental Sciences. Cambridge: Cambridge University Press. Hungerford, L. L Use of spatial statistics to identify and test significance in geographic disease patterns. Preventive Veterinary Medicine 11: Izenman, A. J Recent developments in nonparametric density estimation. Journal of the American Statistical Association 86 (413): Kingham, S. P., A. C. Gatrell, and B. Rowlingson Testing for clustering of health events within a geographical information system framework. Environment and Planning A 27: Knox, E. G The detection of space-time interaction. Applied Statistics. 13: Lessard, P., R. L`Eplattenier, R. A. I. Norval, K. Kundert, T. T. Dolan, H. Croze, and others Geographical information systems for studying the epidemiology of cattle diseases caused by Theileria parva. Veterinary Record 126: Mantel, N The detection of disease clustering and a generalized regression approach. Cancer Research. 27 (2): Morris,R.S., Sanson,R.L and Stern,M.W. 1992: EPIMAN - A Decision Support System for Managing a Foot-and-Mouth Disease Epidemic. Proceedings Fifth Annual Meeting of the Dutch Society for Veterinary Epidemiology and Economy, Wageningen, Oliver, M. A., and R. Webster Kriging: a method of interpolation for geographical information systems. International Journal of Geographical Information Systems 4 (3): Openshaw, S Automating the search for cancer clusters: a review of problems, progress, and opportunities. Spatial epidemiology. Editor R.W. Thomas, Pion Publications. Perry, B. D., R. Kruska, P. Lessard, R. A. I. Norval, and K. Kundert Estimating the distribution and abundance of Rhipicephalus appendiculatus in Africa. Preventive Veterinary Medicine 11: Pfeiffer, D. U The role of a wildlife reservoir in the epidemiology of bovine tuberculosis. Unpublished PhD Thesis, Massey University, Palmerston North, New Zealand. Rothman, K. J A sobering start for the cluster busters' conference. American Journal of Epidemiology 132 Sup 1: S6-S13. Sanson, R.L The development of a decision support system for an animal disease emergency. Unpublished PhD Thesis. Massey University, Palmerston North, New Zealand. Waller, L. A., and A. B. Lawson The power of focused tests to detect disease clustering. Statistics in Medicine 14: Walter, S. D Assessing spatial patterns in disease rates. Statistics in Medicine 12: Wartenberg, D., and M. Greenberg Detecting disease clusters: the importance of statistical power. American Journal of Epidemiology 132 Sup 1: S156-S166. Wartenberg, D., and M. Greenberg Solving the cluster puzzle: Clues to follow and pitfalls to avoid. Statistics in Medicine 12:

23 Williams, B., D. Rogers, G. Staton, B. Ripley, and T. Booth Statistical modelling of georeferenced data: Mapping tsetse distributions in Zimbabwe using climate and vegetation data. Modelling vector-borne and other parasitic Diseases. Editors B. D. Perry, and J. W. Hansen, Nairobi, Kenya: The International Laboratory for Research on Animal Diseases. 105