Issues related to handling of spatial data

Size: px
Start display at page:

Download "Issues related to handling of spatial data"

Transcription

1 In: J. McKenzie (ed) Proceedings of the epidemiology and state veterinary programmes. New Zealand Veterinary Association / Australian Veterinary Association Second Pan Pacific Veterinary Conference, Christchurch, June 1996; Introduction Issues related to handling of spatial data D.U. PFEIFFER Department of Veterinary Clinical Sciences Massey University, Palmerston North, New Zealand Epidemiological analyses are mainly conducted using data which does not include or take account of spatial relationships between the observations studied. More recently, the need for spatial data analysis has been pointed out as it may provide additional insight when attempting to reveal epidemiological cause-effect relationships (Rothman 1990). While the theory of spatial analysis has been an area of research interest for many years, only the advent of personal computers made the techniques more easily accessible to epidemiologists. Still, even recent text books on medical or veterinary epidemiology do not provide more than basic introductions to the subject area of spatial data analysis. This is could seem surprising, as place has always been seen as part of classic epidemiological triad of time, person, place. Spatial Data The distinction between spatial and non-spatial data can easily become the subject of extensive discussions. In general, observations for which absolute location and/or relative positioning (spatial arrangement) is taken into account can be referred to as spatial data (Anselin 1992). It can be subdivided into two major categories representing discrete and continuous phenomena. Based on the former classification, which has also been called entity view, spatial phenomena are described using zero dimensional objects such as points, one dimensional objects such as lines or two dimensional objects such as areas. If space is described using continuous phenomena, such as in the case of temperature or topography, this has also been described as field view. In practice, the latter is usually measured based on sampling discrete entities such as locations in space. The entity view allows spatial objects to have attributes. Spatial analysis is typically aimed at the spatial arrangement of the observational units, but can also take into account attribute information. An analysis conducted only on the basis of the attributes of the observational units ignoring the spatial relationships is not considered a spatial data analysis. Spatial Data Analysis The methods used in spatial data analysis can be broadly categorized in those concerned with visualizing data, those for exploratory data analysis and methods for development of statistical models (Bailey and Gatrell 1995). During most analyses, a combination of techniques will be used with the data first being displayed visually, followed by exploration of possible patterns and possibly modeling. Data visualization One of the first steps in any data analysis should be an inspection of the data. Visual displays of information using plots or maps will provide the epidemiologist with the basis for generating hypotheses and, if required, an assessment of the fit or predictive ability of models. Over the last couple of years interactive computer packages have been developed which allow dynamic

2 displays of the data. Geographic information systems can be used to produce maps and they allow the exploration of spatial patterns in an interactive fashion. Exploratory data analysis Data exploration is aimed at developing hypotheses and makes extensive use of graphical views of the data such as maps or scatter plots. Exploratory data analysis makes few assumptions about the data and should be robust to extreme data values. Simple analytical models can also be used in this analysis phase. Models of spatial data For this type of spatial data analysis specific hypotheses are formally tested or predictions are made using statistical models of the data. Modeling of spatial phenomena has to incorporate the possibility of spatial dependence in order to provide a true representation of the existing effects. Such spatial effects can be either large scale trends or local effects. The first is also called a first order effect and it describes overall variation in the mean value of a parameter such as rainfall. The second which is named a second order effect is produced by spatial dependence and represents the tendency of neighboring values to follow each other in terms of their deviation from the mean. This can for example be the case with the incidence of an infectious animal disease affecting animals on farm properties. First order effects can be readily modeled by standard regression models. The presence of second order effects violates the independence assumption of standard statistical analysis techniques, and appropriate analysis techniques will have to take account of the covariance structure in the data giving rise to these local effects. Often spatial data are modeled as stationary spatial processes which assumes that while there may be dependence between neighboring observations, it is independent from absolute location. A spatial process is isotropic, if in a stationary process covariance between observations at different locations depends only on the distance but not on direction. Nonstationary data is almost impossible to model as most locations will require different parameter sets. Therefore, most spatial modeling procedures begin with first identifying a trend in mean value and then modeling the residuals from this trend as a stationary process. With any of these models it has to be kept in mind that they are abstractions of reality, and first or second order effects are artifacts of the modeler. Bailey and Gatrell (1995) conclude that models can be at best 'not wrong', rather than 'right'. They add that the analyst should always involve judgment and intuition in statistical modeling. Problems in Spatial Data Analysis A major factor influencing spatial data analysis is the geographical scale at which the data is being analyzed. It may be possible to identify specific non-random patterns at a local level which when looked at from a national level turn into random variations. Another problem can be that many spatial data sets are based on irregularly shaped area units or there may be directional effects. Proximity or neighborhood also may be more difficult to clearly define than for example in time-series analysis. Any type of spatial analysis will be subject to some degree of edge effect where area units on the map boundary do have neighbors only in one direction. Many data analyses have to be conducted with observations based on information summarized at a particular spatial aggregation level such as at the veterinary district. Inferences from such analyses may only be correct if used at the same level of aggregation. This situation has also been called the modifiable areal unit problem. 84

3 Methods of Spatial Data Analysis Methods used in spatial data analysis can be divided according to the three main categories of data to be analyzed. They are point patterns, spatially continuous and area data. Point patterns Spatial point patterns are based on the coordinates of events such as the locations of outbreaks of a disease. It is also possible that they include attribute information such as the time of outbreak occurrence. Data on point patterns can be based on a complete map of all point events or a sampled point pattern. The basic interest of a spatial point pattern analysis will be to detect whether it is distributed at random or represents a clustered or regular pattern. It is important to recognize that the stochastic process studied relates to the locations where events are occurring. A spatial point pattern can be quantified in terms of the intensity of the process using its first order properties, measured as the mean number of events per unit area. Second order properties or spatial dependency are analyzed on the basis of the relationship between pairs of points or areas. The latter is typically interpreted as analysis for clustering. Visualization of spatial point patterns The method used most frequently to present spatial point patterns is a dot map. It is generally difficult to assess randomness of a pattern from visual inspection of such a map. It becomes important to take account of the population at risk when for example inspecting a dot map of disease outbreaks. One method for representing this difference in population at risk is to use a cartogram, where the size of the areas is geometrically transformed proportional to the corresponding population value. In a case-control study of tuberculosis breakdown in cattle herds from the Waikato region of New Zealand all cattle herds which had broken down with tuberculosis infection were compared with a random sample of cattle herds free from infection. Figure 1 presents a series of dot maps showing the locations of cases and random controls. Inspection of the map with locations of the case herds could give the observer the impression that they are clustered. Without inspection of the distribution of random control herds it becomes difficult to differentiate whether clustering only occurred in case herds or if the distribution of all cattle herds in this area is inherently non-random. In this situation it clearly is the case that cattle herds are not randomly distributed throughout the study region. But there is also some clustering of tuberculosis breakdowns. Random Control Herds Case Herds Case and Control Herds Figure 1: Dot maps of the locations of herds from a case-control study of tuberculosis breakdown in New Zealand cattle herds 85

4 Exploratory analysis of spatial point patterns Techniques for exploratory spatial analysis of point patterns are aimed at deriving summary statistics or plots of the observed distribution to investigate specific hypotheses. The methods used are examining first or second order effects. First order effects for point patterns can be examined with two techniques - quadrat counts and kernel estimation. The quadrat methods involve dividing the area into sub-regions of equal size - quadrats and produce a summary statistic on the basis of the number of counts per quadrat. The counts are then divided by the size of the area. These techniques give an indication of the variation of the intensity of the underlying process in space. The disadvantage of the techniques is that they aggregate the information into area type data which can result in loss of information. Kernel estimation is a technique which uses the original point locations to produce a smooth bivariate histogram of intensity. It has been used for example for home range estimation in wildlife ecology (Izenman 1991). Second order properties of point patterns can be investigated using the distances between the points - particularly nearest-neighbor distances. The latter can be estimated using two techniques - either the distance between a randomly selected event and the nearest neighboring event or between a randomly selected location in space and the nearest event. Spatial dependence can be investigated by visual examination of the probability distributions of the observed nearest -neighbor distances. Clustered events would show a steep part of the distribution function with lower values, whereas regularity would be indicated by steepness of the curve with higher values. The k - function will allow taking into account not just the nearest events. It depends on the assumption of an underlying isotropic process and is problematic to use in the presence of significant first order effects. Modeling of spatial point patterns Spatial point modeling techniques are aimed at explaining an observed point pattern, and typically involve comparison with the model of complete spatial randomness (CSR). A point pattern generated by a random spatial process should follow a homogeneous Poisson process. This implies that every event has an equal probability of occurring at any position in the study area and occurrence is independent of the location of any other event, hence the absence of first order and second order effects. It is against this basic model that the analysis will assess whether the point process is regular, clustered or random. There are a range of methods available to test for CSR. Some are based on quadrat counts such as the index of dispersion tests, others use nearest-neighbor distances such as the Clark-Evans test or the K function. Comparison of an observed pattern with CSR has its limitations in epidemiology as it does not allow definition of the type of point process other than whether it is completely random in space or not. It also cannot take account of issues such as a clustered underlying population at risk. Alternative models which could be used include the heterogeneous Poisson process, the Cox process, the Poisson cluster process or Markov point process (Bailey and Gatrell 1995). Getis and Ord (1992) describe the use of a distance statistic G which can be used to assess spatial autocorrelation for point patterns as well as for area data. It can be used to detect local pockets of dependence which might not show up when using a global statistic. The analysis of point patterns is important in veterinary epidemiology as it allows inferences on the occurrence of spatial clustering. The presence of clustering would suggest infectiousness or the presence of specific environmental risk factors. Second order effects in a spatial process can be the result of disease clustering. Disease clustering can be assessed using a number of methods and they can be categorized into general and focused tests (Waller and 86

5 Lawson 1995). The latter tests relate to the clustering of events around fixed point locations such for example a nuclear power plant. Wartenberg and Greenberg 1990 describe techniques for detection of hot spot clusters and clinal clusters. A tool which can be effectively used for the analysis of clustering effects is the K function (Kingham, Gatrell, and Rowlingson 1995). In this context, two classes of point processes such as cases of disease and random controls without the disease are compared. The principle is that both point processes are pooled and then the point process describing the cases is compared with the pooled process. Cuzick and Edwards (1990) developed a method which is also based on nearest-neighbor distances. The test statistic simply compares the number of case-case pairs for a given number of nearest neighbors. Applying this technique to the case-control data mentioned above it appears that there is significant clustering of cases compared with the control population (see Figure 2). The software Stat! (BioMedware, Ann Arbor, Michigan, U.S.A.) was used to run the analysis and produce the graph. Figure 2: Cuzick and Edwards method applied to tuberculosis breakdown case control study data (+ = cases, = controls, arrows identify nearest neighbors) Kingham, Gatrell, and Rowlingson (1995) describe a method combining the Diggle and Chetwynd method based on bivariate K functions with results of a logistic regression analysis allowing them to make use of information about additional covariates to test for clustering. Methods aimed at the detection of hot spots or small areas which might represent clusters of disease include the Geographical Analysis Machine developed by Openshaw (1990). This technique is based on comparing the observed intensity of cases in circles of varying radius. The result of the analysis is a map with circles indicating the areas where case incidence was higher than expected under the assumption of spatial randomness. Alexander and Cuzick (1992) reviewed methods for assessment of disease clusters. Wartenberg and Greenberg (1993) discuss problems associated with detection of disease clusters. 87

6 Other analyses using point pattern information Point events in space may have time of occurrence as one particular attribute. Specific methods are available to investigate clustering in space and time. Interaction is present if pairs of cases are near in space as well as in time. Contagious diseases requiring direct contact will produce space-time clustering between cases. The techniques described below are not considered useful for non-contagious diseases. There are three main methods available which allow assessment of a space-time relationship between cases: Knox s method, Mantel s method and the K- nearest neighbor method. All three techniques require production of distance matrices of the spatial as well as the temporal relationship between cases. Data from a longitudinal study of Mycobacterium bovis infection in a wild possum population in New Zealand will be used to demonstrate the usage of these techniques (Pfeiffer 1994). As a first step a histogram of the geographical nearest neighbor distances has to be produced as shown in Figure 3. The distribution of nearest neighbor distances expected under spatial randomness is shown in the histogram. It assumes uniform population density across the study area. The excess of shorter distances compared with the Poisson probability density function suggests that there may be spatial clustering in this data set. The map in the same figure indicates the locations of the 26 cases used in the analysis and their nearest neighbors are indicated by the arrows. Figure 4 presents the temporal distance distribution and a map connecting the cases according to their sequence of occurrence. The map does suggest the disease has shifted its spatial focus over time and that there is some degree of temporal clustering. Figure 3: Map and histogram of geographical distances between cases of tuberculosis infection in wild possums 88

7 Figure 4: Temporal distance map and histogram for cases of tuberculosis infection in wild possums As a next step a formal statistical test has to be conducted to assess the statistical significance of a potential space-time interaction process. When using Knox s method, a critical distance in time as well as in space defining closeness has to be set and pairs of cases are tabulated into a 2*2 contingency table with spatial and temporal closeness/farness defining the rows and columns (Knox 1964). Knox saw the critical distance as defining latency period. In most situations determining the critical distance requires a subjective decision. Approximate randomization permutation techniques are used to construct a Null distribution for Knox s test statistic. Figure 5 shows the results of Knox s test applied to the tuberculous possum data using a critical distance of 100m in space and 3 months in time. The result of 30 for Knox s test statistic X which is significant at a p-value of 0.02 suggests that given the selected critical distances time-space interaction is present in this data set. 89

8 Figure 5: Results of Knox s method applied to cases of tuberculosis infection in wild possums Another approach to investigate time-space interaction could be the use of the Mantel method (Mantel 1967). Mantel s approach does not require selection of critical distances. It uses both, time and space distance matrices between all cases. But it should be kept in mind that the Mantel test can be insensitive to non-linear associations between time and space distances. Distance measures can be transformed in a number of ways, such as the reciprocal transformation which reduces the effect of large time and space distances. The Null hypothesis is that the time distances are independent of the space distances. Randomization permutation techniques can be used to generate a test statistic for the Mantel test. Figure 6 presents the results of this analysis when applied to the data on tuberculous possums. The scatter plot of space distances against temporal distances seems to suggest while the points are scattered throughout the plot that there some denser accumulations of cases present. The frequency distribution of the test statistic under the Null hypothesis on the basis of 500 random permutations is presented in the left window of Figure 6. It can be concluded that there is significant space-time interaction. 90

9 Figure 6: Results from applying the Mantel method to test for time-space interaction between cases of tuberculosis infection in wild possums A third approach available is the K-nearest neighbor test of space-time interaction in point data. The test statistic indicates the number of case pairs which are K nearest neighbors in time and space. The statistic is based on an approximate randomization of the Mantel product statistic. Figure 7 presents the results from applying the K-nearest neighbor method to the possum tuberculosis data. The map shows the locations of the cases and the arrows indicate k=2 nearest neighbors. The test statistic produced on the basis of 1000 random permutations suggests that only the cumulative statistic J k is statistically significant, whereas J k is not. The latter parameter measures the statistical significance from increasing K by 1. The test statistic supports the presence of space-time interaction, and suggests that the first 5 nearest neighbors are involved in space-time interaction. 91

10 Figure 7: Results from applying the K-nearest neighbor method to test for time-space interaction between cases of tuberculosis infection in wild possums Spatially continuous data The point pattern analyses assessed characteristics of the spatial distributions of points, but made only limited use of attribute information. With spatially continuous and also area data the analysis focus shifts towards use of the attribute information, in order to describe their pattern in space. Spatially continuous data is also often referred to as geostatistical data. The data is usually collected by sampling at fixed points in space. The main objective of the analysis will be to describe the spatial variation in an attribute value, using the data collected at the sampled points. The spatial variation can be modeled as first and second order spatial processes. Visualization of spatially continuous data The data values obtained from the sampled locations can be mapped using proportionally scaled symbols or columns for each sampling point. Column charts will in fact allow presentation of multiple attribute values at the same point. Overlapping symbols can cause a problem with interpretation of such maps. But none of these approaches will be able to represent the underlying continuity of the process studied. Exploratory analysis of spatially continuous data The techniques used in this area can be categorized into methods for describing first order effects and those for second order effects. 92

11 Exploratory analysis of first order effects in spatially continuous data The main techniques in this area are spatial moving averages, tessellation methods and kernel estimation techniques. A spatial moving average interpolates values between a given number of neighboring sampling points. The more points are used the smoother the created surface will be. It will be possible to describe global trends. A weighting mechanism can be introduced to account for varying distances between sampling points. Alternatively a tessellation of the observed sample points can be used. This is most commonly done using Delauney triangulation, also referred to as a triangulated irregular network (TIN). This method assigns to each sampling point a territory in which each point is closer to this sampling point than to any other. The resulting polygon map is called a Dirichlet tessellation and the tiles are known as Voronoi or Thiessen polygons. Such a TIN can be used to construct a contour map or a digital terrain model (DTM). Figure 8 shows a triangulated irregular network based on height information from a 50 m grid plus break lines. The TIN was then used to create a contour map and a digital terrain model of possum tuberculosis longitudinal study area. Pfeiffer (1994) used this technique to generate polygons from point locations describing the areas where particular Mycobacterium bovis strains occurred. The technique can be appropriate for presence/absence type information, where a smooth surface is not the objective of the interpolation. TIN Contour map DTM Figure 8: TIN and derived contour maps and digital terrain model for the possum tuberculosis longitudinal study site As with point patterns it is also possible to use kernel estimation to convert the attribute data from the sampling points into a surface. This time not using the number of events per unit area but rather the value of the attribute. This technique has been used in geographical epidemiology to model the relative risk function, measuring local risk relative to the regional mean (Bithell 1990). Exploratory analysis of second order effects in spatially continuous data Spatial dependence between attribute values measured at sampled locations is described using the covariance function or covariogram. The presence of second order effects would result in positive covariance between observations a small distance apart and lower covariance or correlation if they are further apart. The covariogram describes the function of the covariance for varying distances h between sample points and the correlogram the corresponding correlation. The semi-variogram is a graphical representation of the variation between sampling points separated by a given distance and direction. For a stationary spatial process all three describe similar information. Estimates of the semi-variogram are considered to be more robust to departures from stationarity represented as a general trend in the spatial process. A 93

12 continuous process without spatial dependence will result in a horizontal line. A stationary process will reach an upper bound, referred to as the sill at a distance h called the range. Theoretically, the intercept with the y-axis should be at a value of 0 variation. In reality, sampling error and small scale variation will result in variability at small distances and the variogram will meet the y-axis not in the origin. This intercept with the y-axis is called the nugget effect. Variograms which do not reach an upper bound suggest non-stationarity. Figure 9 shows an isotropic sample semi-variogram for the proportion of tuberculous possums captured at trap sites during the longitudinal study. The shape of the variogram suggests that the process is non-stationary, but given the relatively small nugget value there is also likely to be spatial dependence. Parameters defining a variogram Example of a variogram Figure 9: Isotropic semi-variogram for the proportion of tuberculous possums at individual trap sites in the longitudinal study Modeling of spatially continuous data A number of approaches can be used to model or predict spatially continuous data. For the first-order processes trend surface analysis can be used based on ordinary polynomial least squares regression. Results have to be treated with caution, because the standard regression assumptions of independent random errors and heteroscedasticity are likely to be violated. Lessard et al. (1990) used an inverse distance-weighted mathematical algorithm to interpolate climatic measurements between sample points. Most trend surface models may be able to describe an overall trend, but are not useful for local prediction. In the presence of weak first order, but strong second order effects it is more appropriate to use models fitted to variograms. Such models can be defined by eye and are most commonly based on the spherical, exponential or Gaussian model. The fit of a particular model can be assessed through cross-validation. Figure 10 shows an omni-directional exponential variogram model for the possum tuberculosis prevalence data. For the model to be valid it would be necessary to remove the non-stationarity through trend regression. 94

13 Figure 10: Omnidirectional exponential variogram model for possum tuberculosis prevalence data The variogram model itself does not allow prediction of values. This can be achieved with Kriging. This is a weighted moving average technique for estimating the value of a spatially distributed variable from adjacent values while considering interdependence expressed in a variogram. It allows the interpolation error to be mapped and from a statistical viewpoint is considered to be the most satisfactory method for interpolation (Oliver and Webster 1990). Pfeiffer (1994) used ordinary Kriging to produce a surface of possum population density based on possum capture data at sample points (see Figure 11). The omnidirectional variogram suggests that this data is more stationary than the tuberculosis prevalence data, but it also shows strong spatial dependence. An exponential model was fitted and used as the basis for Kriging. The distribution of Kriging errors shows that there are some reasonably high errors and according to the map they are located in one particular area of the study. 95

14 Omnidirectional variogram model Histogram of Kriging errors Map of Kriging errors Contour map based on Kriging estimates Figure 11: Variogram model, Kriging errors and estimates for possum density in the longitudinal study on possum tuberculosis epidemiology A number of multivariate methods can be used for modeling of spatially continuous data. Principal components can be used to combine the information from multiple variables into a small number of components, each of them representing a particular combination of variables and explaining a particular proportion of the variation in the data. Eastman and Fulk (1993) used the technique to analyze the information contained in a time series of NDVI maps for Africa, thereby conducting a space-time analysis. Cliff et al. (1995) discuss the application of multidimensional scaling (MDS) to spatial epidemiological data. They use the technique to map geographical information about measles mortality in Australia and New Zealand as disease space where points with similar disease risks are closer to each other on the MDS map even though they are far removed geographically. Bailey and Gatrell (1995) discuss a range of other multivariate analysis techniques for spatially continuous data. Area data Attribute data which does have values within fixed polygonal zones within a study area is referred to as area data or lattice data. The areal units can constitute a regular lattice or grid or consist of irregular units. It is usually not required to estimate values as they should be present for all areas. The main emphasis with area data is on detection and explanation of spatial patterns or trends possibly extended to take account of covariates. 96

15 Visualization of area data Area data can be visualized using a wide range of techniques. The Choropleth map is probably the most commonly used tool. Appropriate use of class intervals and colors to represent values in a choropleth map is essential. Cartograms or density equalized maps can be used to express the importance of particular areas. The analyst has to be aware of the problems which can be caused by the modifiable areal units problem. It is also possible to display several attributes at the same time, by adding scaled columns or symbols to a choropleth map. Area information can be presented together with spatially continuous data for example by draping a choropleth map over a DTM. N Kilometers Cattle per Hectare Lakes Canton Boundaries Choropleth map of cattle density in Switzerland Thiessen polygons representing areas used by possums infected with four different strains of Mycobacterium bovis draped over a DTM Figure 12: Examples of choropleth maps Exploration of area data Informal investigations of hypotheses can be aimed at first order or second order spatial processes. Most of these techniques require a methodology for measuring proximity. Possible approaches include various distance measures between polygon centroids as well as presence or length of a shared boundary. For further analyses the proximity information can be described through generation of contiguity or spatial weights matrices. This is quite difficult to achieve in currently available GIS software and there are only few specialized spatial statistics software packages (e.g. SpaceStat, Regional Research Institute, West Virginia University, Morgantown, West Virginia, U.S.A.) which can perform such operations. For investigation of first order effects, a spatial weights matrix can be used to estimate spatial moving averages. If the data is available on the basis of a regular grid, the median polish may be appropriate. Kernel estimation can also be applied for investigations of first order effects in area data. In the case of second order spatial processes the objective is to explore spatial dependence of deviations in attribute values from their mean. In the context of area data, this effect is referred to as spatial autocorrelation and it quantifies the correlation between values of the same attribute between different locations. The most commonly used techniques for spatial autocorrelation are Moran's I and Geary's C. The first is closely related to the covariogram and the second to the variogram used for spatially continuous data. Power analyses of these statistics have been conducted by Walter (1993) who concluded that the power of Moran s I was highest. A correlogram can be used to graphically display the correlation between values 97

16 at different spatial lags. If the autocorrelation does not decline after a number of lags, it indicates the presence of non-stationarity. The correlogram has similar applications in spatial analysis as it has in time-series analysis for describing patterns. Hungerford (1991) analyzed the spatial distribution of cattle anaplasmosis between counties within the state of Illinois using second-order analysis and detected significant spatial clustering within the state. The above mentioned methods do not provide local indicators of spatial association which would be useful for identifying so-called hot spots. The Moran scatterplot and spatial lag pies described in Anselin (1994) can be used to describe local patterns of variation visually. Quantitative estimates can be obtained using the G statistic by Getis and Ord (1992) or the local indicators of spatial association by Anselin (1995). The latter can be used as an indicator of local pockets of non-stationarity (hot spots), similar to the G statistic, and also to assess the influence of individual data points on the global statistic and to identify outliers. Anselin, Dodson, and Hudak (1993) describe how these different techniques can be combined to form a exploratory spatial analysis system. Figure 13 shows a number of examples used by these authors to display local variation (from Anselin s world wide web site The spatial lag pie map superimposes a pie on each area with the top half of the pie representing the local value and the bottom part the neighboring values for this particular variable. It gives the observer an appreciation of the ratio between the local value and the surrounding spatial units. The Moran scatterplot shows the original value of each observation on the x-axis and the value of its spatial lag on the y-axis. The plot can be used to identify outliers or even to conduct local regression to further describe the spatial association. These outliers can then be mapped as shown in Figure 13. The map of the areas with significant LISA statistic indicates the area were there appears to be spatial autocorrelation. 98

17 Spatial Lag Pie Map Moran scatterplot Moran scatterplot outliers Figure 13: Example of an exploratory spatial data analysis approach Map of areas with significant LISA or G statistic In landscape ecology, approaches have been developed to describe the interactions among patches within a landscape mosaic referred to as landscape pattern. Most biological processes and that includes of course diseases are influenced by a multitude of factors which together may form a particular pattern. Spatial patterns are particularly difficult to quantify. Ecologists use the term landscape structure which describes the spatial relationships between habitat patches within a landscape (Dunning, Danielson, and Leck 1992). The software FRAGSTATS (McGarigal and Marks, Oregon State University, Corvallis, Oregon, U.S.A.) allows calculation of a wide range of indices and parameters describing landscape structure which could be used for further analyses. Modeling of area data Modeling techniques are aimed at establishing explanatory relationships between attribute values of a dependent variables, taking account of the relative spatial arrangement of the areas and other values associated with each area unit. Again, it is possible to focus the analysis on first order or second order effects. Multiple ordinary least squares regression can only be used for preliminary exploratory analyses, but suffers from the problem that in the presence of spatial dependence the errors are not independent and that the variance is unlikely to be constant. The presence of spatial dependence can be assessed readily using a spatial correlogram. A range of spatial regression models have been described by Haining (1990) and they can be implemented using the SpaceStat software mentioned above. Hungerford (1991) analyzed the relationship between cattle density and anaplasmosis prevalence on a county basis 99

18 in Illinois using measures of spatial correlation. Perry et al. (1991) used a GIS to investigate the occurrence of Rhipicephalus appendiculatus in Africa to identify the factors controlling the distribution of the vector tick which transmits the parasite Theileria parva causing East Coast fever, Corridor disease and January disease in cattle. A number of authors have included spatial data into multivariate analysis as independent variables. Clifton-Hadley (1993) used spatial descriptive measures, spatial autocorrelation and distance to particular features of interest to analyse patterns of occurrence of badger-related tuberculosis breakdowns of cattle herds in south-west England. Pfeiffer (1994) used a GIS to provide for point locations (cases of disease) specific geographical variables such as height above sea level, aspect, slope and distance to features of interest which were then used as explanatory variables in multivariate statistical analysis. In the field of epidemiology, parameters of interest are very often counts or proportions which can be modeled using generalised linear modeling techniques rather than ordinary least-squares regression. It should be noted though that spatial forms of these models are not well developed yet. Bailey and Gatrell (1995) suggest introducing covariates into the regression model such as the spatial coordinates or a variable representing regions categorized broadly by location to remove the effect of spatial dependence. Glass et al. (1995) developed a risk density map for Lyme disease based on a multiple logistic regression model, but they did not attempt to remove spatial dependence from the data. A number of different predictive modeling approaches for spatial data was compared by Williams et al. (1994). They used linear and non-linear discriminant analysis, tree-based induction and neural networks to map tsetse distributions in Zimbabwe and concluded that while the simpler methods (linear discriminant analysis and tree-based induction) were less precise, they were easier to interpret. Figure 14 presents some preliminary results of a logistic regression analysis for prediction of Theileria parva presence in an African country (this analysis was conducted by Perry,B.D., Kruska,R.L., Pfeiffer,D.U. and others at ILRI, Nairobi, Kenya). The regression model includes eight different environmental and land use variables and is based on information collected at random sample locations throughout the country. The model was used to generate a risk map representing the probability of T.parva presence at a particular location given a number of risk factors included in the model. This map is presented as a DTM and as a raster map. In addition, two additional raster maps are shown which display the lower and upper 95% confidence limits of T.parva presence as predicted by the regression model. The receiver operating characteristic curve (ROC) characterizing the predictive accuracy of the model could be used to adjust the decision making cut-off for the prediction probability balancing sensitivity and specificity as required. In this analysis the possible presence of spatial dependence was not taken into account. 100

19 Sensitivity Percent of False Positives Sampling location ROC curve for logistic regression model DTM of predicted probability of Theileria parva presence Raster map of predicted probability of T.parva presence Raster map of lower 95% confidence limit of probability of T.parva presence Raster map of upper 95% confidence limit of probability of T.parva presence Figure 14: Results of a multiple logistic regression analysis for prediction of Theileria parva presence 101

20 The appropriate technique for mapping probabilities or rates is as a measure of relative risk which could be estimated by dividing the observed risk by an estimate of expected risk. The standardized mortality ratio has been used widely to represent spatial variation of disease risk (Elliott, Martuzzi, and Shaddick 1995). Small numbers of observations may result in extreme values. More recently this problem has become less important through the adoption of empirical Bayes estimation. These techniques require estimates of a prior probability distribution which can for example be based on the overall probabilities across all areas. Bayesian techniques are then used to convert these estimates into posterior probability estimates. These can be made spatial by using neighborhood probabilities to derive a prior probability distribution. Decision making and spatial data Spatial data together with other non-spatial information is used for decision making purposes. This has become more difficult because the amount of information available has increased substantially. Specific decision making tools have been developed attempting to simplify the process of making the right choice. A recent area of activity has been the adaptation of multicriteria and multi-objective evaluation techniques to spatial problems. Such systems can take account of uncertainty in the data as well as of the risk of making the wrong decision (Eastman et al. 1995). Spatial data has also become an essential component of disease information systems which are beginning to replace largely manual systems which have been used by decision makers for the control of endemic and epidemic diseases. The large amounts of data which can be processed easily, their objectivity and the quickness of response are some of the advantages of computerized animal disease information systems. GIS provides an essential component of such systems. An example of an animal disease information system is EpiMAN which was developed in New Zealand for the management of an outbreak of foot-and-mouth disease (Morris et al. 1992). Epidemiological simulation GIS can provide geographical data which allows computer simulations of the dynamics of infectious diseases for specific geographical locations. Spatial heterogeneity can be represented in simulation models resulting in more realistic representations of reality. There are only few examples where this approach has been used in veterinary epidemiology. Sanson (1993) described a model of foot-and-mouth disease which represents inter-farm spread of the disease on a true geographical area, using various transmission mechanisms. Pfeiffer (1994) developed a geographic simulation model of the dynamics of bovine tuberculosis infection in wild possum populations. The geographical component is a major feature of this model. The model uses vegetation maps to represent the ecological conditions of particular environments. Conclusion Spatial data has become an important component of disease investigations. The availability of geographic information systems in combination with fast and relatively inexpensive computer hardware leaves the epidemiologist with the responsibility of making effective use of the information. While the descriptive techniques for spatial data have been available for a long time, exploratory and modeling techniques are still a very active area of development and they are not as accessible to the analyst so that they could become a routine component of epidemiological analysis. 102

21 References Alexander,F.E. and J. Cuzick Methods for the assessment of disease clusters. Geographical and environmental Epidemiology: Methods for small-area Studies. Editors P. Elliott, J. Cuzick, D. English, and R. Stern, Oxford: Oxford University Press. Anselin,L Exploratory spatial data analysis and geographic information systems. New Tools for spatial Analysis. Editor M. Painho, Luxembourg: Eurostat. Anselin,L Local indicators of spatial association - LISA. Geographical Analysis 27, no. 2: Anselin,L Spatial data analysis with GIS: An introduction to application in the social Sciences. 75. Technical Report Series. Santa Barbara, California: National Center for Geographic Information and Analysis. Anselin,L., R. F. Dodson, and S. Hudak Linking GIS and spatial data analysis in practice. Geographical Analysis 1: Bailey,T.C. and A. C. Gatrell Interactive spatial data analysis. Harlow, Essex, England: Longman Group. 413pp Bithell, J. F An application of density estimation to geographical epidemiology. Statistics in Medicine 9: Cliff, A. D., P. Haggett, M. R. Smallman-Raynor, D. F. Stroup, and G. D. Williamson The application of multidimensional scaling methods to epidemiological data. Statistical Methods in Medical Research 4: Clifton-Hadley, R. S The use of a geographical information system (GIS) in the control and epidemiology of bovine tuberculosis in south-west England. Proceedings of the Society for Veterinary Epidemiology and Preventive Medicine, editor M. V. Thrusfield, Society for Veterinary Epidemiology and Preventive Medicine. Cuzick, J., and R. Edwards Spatial clustering for inhomogeneous populations. Journal of the Royal Statistical Society B 52, no. 1: Dunning, J. B., B. J. Danielson, and C. F. Leck Ecological processes that affect populations in complex landscapes. Oikos 65: Eastman, J. R., and M. Fulk Long sequence time series evaluation using standardized principal components. Photogrammetric Engineering and Remote Sensing 59, no. 6: Eastman, J. R., W. Jin, P. A. K. Kyem, and J. Toledano Raster procedures for multicriteria/multi-objective decisions. Photogrammetric Engineering and Remote Sensing 61, no. 5: Elliott, P., M. Martuzzi, and G. Shaddick Spatial statistical methods in environmental epidemiology: a critique. Statistical Methods in Medical Research 4: Getis, A., and J. K. Ord The analysis of spatial association by use of distance statistics. Geographical Analysis 24 (3): Glass, G. E., B. S. Schwartz, J. M. Morgan, D. T. Johnson, P. M. Noy, and E. Israel Environmental risk factors for Lyme disease identified with geographic information systems. American Journal of Public Health 85, no. 7:

22 Haining,R Spatial Data Analysis in the social and environmental Sciences. Cambridge: Cambridge University Press. Hungerford, L. L Use of spatial statistics to identify and test significance in geographic disease patterns. Preventive Veterinary Medicine 11: Izenman, A. J Recent developments in nonparametric density estimation. Journal of the American Statistical Association 86 (413): Kingham, S. P., A. C. Gatrell, and B. Rowlingson Testing for clustering of health events within a geographical information system framework. Environment and Planning A 27: Knox, E. G The detection of space-time interaction. Applied Statistics. 13: Lessard, P., R. L`Eplattenier, R. A. I. Norval, K. Kundert, T. T. Dolan, H. Croze, and others Geographical information systems for studying the epidemiology of cattle diseases caused by Theileria parva. Veterinary Record 126: Mantel, N The detection of disease clustering and a generalized regression approach. Cancer Research. 27 (2): Morris,R.S., Sanson,R.L and Stern,M.W. 1992: EPIMAN - A Decision Support System for Managing a Foot-and-Mouth Disease Epidemic. Proceedings Fifth Annual Meeting of the Dutch Society for Veterinary Epidemiology and Economy, Wageningen, Oliver, M. A., and R. Webster Kriging: a method of interpolation for geographical information systems. International Journal of Geographical Information Systems 4 (3): Openshaw, S Automating the search for cancer clusters: a review of problems, progress, and opportunities. Spatial epidemiology. Editor R.W. Thomas, Pion Publications. Perry, B. D., R. Kruska, P. Lessard, R. A. I. Norval, and K. Kundert Estimating the distribution and abundance of Rhipicephalus appendiculatus in Africa. Preventive Veterinary Medicine 11: Pfeiffer, D. U The role of a wildlife reservoir in the epidemiology of bovine tuberculosis. Unpublished PhD Thesis, Massey University, Palmerston North, New Zealand. Rothman, K. J A sobering start for the cluster busters' conference. American Journal of Epidemiology 132 Sup 1: S6-S13. Sanson, R.L The development of a decision support system for an animal disease emergency. Unpublished PhD Thesis. Massey University, Palmerston North, New Zealand. Waller, L. A., and A. B. Lawson The power of focused tests to detect disease clustering. Statistics in Medicine 14: Walter, S. D Assessing spatial patterns in disease rates. Statistics in Medicine 12: Wartenberg, D., and M. Greenberg Detecting disease clusters: the importance of statistical power. American Journal of Epidemiology 132 Sup 1: S156-S166. Wartenberg, D., and M. Greenberg Solving the cluster puzzle: Clues to follow and pitfalls to avoid. Statistics in Medicine 12:

23 Williams, B., D. Rogers, G. Staton, B. Ripley, and T. Booth Statistical modelling of georeferenced data: Mapping tsetse distributions in Zimbabwe using climate and vegetation data. Modelling vector-borne and other parasitic Diseases. Editors B. D. Perry, and J. W. Hansen, Nairobi, Kenya: The International Laboratory for Research on Animal Diseases. 105

EXPLORING SPATIAL PATTERNS IN YOUR DATA

EXPLORING SPATIAL PATTERNS IN YOUR DATA EXPLORING SPATIAL PATTERNS IN YOUR DATA OBJECTIVES Learn how to examine your data using the Geostatistical Analysis tools in ArcMap. Learn how to use descriptive statistics in ArcMap and Geoda to analyze

More information

Geography 4203 / 5203. GIS Modeling. Class (Block) 9: Variogram & Kriging

Geography 4203 / 5203. GIS Modeling. Class (Block) 9: Variogram & Kriging Geography 4203 / 5203 GIS Modeling Class (Block) 9: Variogram & Kriging Some Updates Today class + one proposal presentation Feb 22 Proposal Presentations Feb 25 Readings discussion (Interpolation) Last

More information

An Introduction to Point Pattern Analysis using CrimeStat

An Introduction to Point Pattern Analysis using CrimeStat Introduction An Introduction to Point Pattern Analysis using CrimeStat Luc Anselin Spatial Analysis Laboratory Department of Agricultural and Consumer Economics University of Illinois, Urbana-Champaign

More information

Spatial Data Analysis

Spatial Data Analysis 14 Spatial Data Analysis OVERVIEW This chapter is the first in a set of three dealing with geographic analysis and modeling methods. The chapter begins with a review of the relevant terms, and an outlines

More information

INTRODUCTION TO GEOSTATISTICS And VARIOGRAM ANALYSIS

INTRODUCTION TO GEOSTATISTICS And VARIOGRAM ANALYSIS INTRODUCTION TO GEOSTATISTICS And VARIOGRAM ANALYSIS C&PE 940, 17 October 2005 Geoff Bohling Assistant Scientist Kansas Geological Survey [email protected] 864-2093 Overheads and other resources available

More information

New Tools for Spatial Data Analysis in the Social Sciences

New Tools for Spatial Data Analysis in the Social Sciences New Tools for Spatial Data Analysis in the Social Sciences Luc Anselin University of Illinois, Urbana-Champaign [email protected] edu Outline! Background! Visualizing Spatial and Space-Time Association!

More information

Geostatistics Exploratory Analysis

Geostatistics Exploratory Analysis Instituto Superior de Estatística e Gestão de Informação Universidade Nova de Lisboa Master of Science in Geospatial Technologies Geostatistics Exploratory Analysis Carlos Alberto Felgueiras [email protected]

More information

Diagrams and Graphs of Statistical Data

Diagrams and Graphs of Statistical Data Diagrams and Graphs of Statistical Data One of the most effective and interesting alternative way in which a statistical data may be presented is through diagrams and graphs. There are several ways in

More information

How To Understand The Theory Of Probability

How To Understand The Theory Of Probability Graduate Programs in Statistics Course Titles STAT 100 CALCULUS AND MATR IX ALGEBRA FOR STATISTICS. Differential and integral calculus; infinite series; matrix algebra STAT 195 INTRODUCTION TO MATHEMATICAL

More information

Fairfield Public Schools

Fairfield Public Schools Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity

More information

AMARILLO BY MORNING: DATA VISUALIZATION IN GEOSTATISTICS

AMARILLO BY MORNING: DATA VISUALIZATION IN GEOSTATISTICS AMARILLO BY MORNING: DATA VISUALIZATION IN GEOSTATISTICS William V. Harper 1 and Isobel Clark 2 1 Otterbein College, United States of America 2 Alloa Business Centre, United Kingdom [email protected]

More information

Big Ideas in Mathematics

Big Ideas in Mathematics Big Ideas in Mathematics which are important to all mathematics learning. (Adapted from the NCTM Curriculum Focal Points, 2006) The Mathematics Big Ideas are organized using the PA Mathematics Standards

More information

Using Spatial Statistics In GIS

Using Spatial Statistics In GIS Using Spatial Statistics In GIS K. Krivoruchko a and C.A. Gotway b a Environmental Systems Research Institute, 380 New York Street, Redlands, CA 92373-8100, USA b Centers for Disease Control and Prevention;

More information

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar business statistics using Excel Glyn Davis & Branko Pecar OXFORD UNIVERSITY PRESS Detailed contents Introduction to Microsoft Excel 2003 Overview Learning Objectives 1.1 Introduction to Microsoft Excel

More information

Introduction to Exploratory Data Analysis

Introduction to Exploratory Data Analysis Introduction to Exploratory Data Analysis A SpaceStat Software Tutorial Copyright 2013, BioMedware, Inc. (www.biomedware.com). All rights reserved. SpaceStat and BioMedware are trademarks of BioMedware,

More information

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics. Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing

More information

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in

More information

MTH 140 Statistics Videos

MTH 140 Statistics Videos MTH 140 Statistics Videos Chapter 1 Picturing Distributions with Graphs Individuals and Variables Categorical Variables: Pie Charts and Bar Graphs Categorical Variables: Pie Charts and Bar Graphs Quantitative

More information

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi - 110 012 [email protected] 1. Descriptive Statistics Statistics

More information

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics South Carolina College- and Career-Ready (SCCCR) Probability and Statistics South Carolina College- and Career-Ready Mathematical Process Standards The South Carolina College- and Career-Ready (SCCCR)

More information

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics Course Text Business Statistics Lind, Douglas A., Marchal, William A. and Samuel A. Wathen. Basic Statistics for Business and Economics, 7th edition, McGraw-Hill/Irwin, 2010, ISBN: 9780077384470 [This

More information

Introduction to Modeling Spatial Processes Using Geostatistical Analyst

Introduction to Modeling Spatial Processes Using Geostatistical Analyst Introduction to Modeling Spatial Processes Using Geostatistical Analyst Konstantin Krivoruchko, Ph.D. Software Development Lead, Geostatistics [email protected] Geostatistics is a set of models and

More information

2. Simple Linear Regression

2. Simple Linear Regression Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according

More information

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary Shape, Space, and Measurement- Primary A student shall apply concepts of shape, space, and measurement to solve problems involving two- and three-dimensional shapes by demonstrating an understanding of:

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012 Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

More information

Multivariate Analysis of Ecological Data

Multivariate Analysis of Ecological Data Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology

More information

Summarizing and Displaying Categorical Data

Summarizing and Displaying Categorical Data Summarizing and Displaying Categorical Data Categorical data can be summarized in a frequency distribution which counts the number of cases, or frequency, that fall into each category, or a relative frequency

More information

Time series analysis as a framework for the characterization of waterborne disease outbreaks

Time series analysis as a framework for the characterization of waterborne disease outbreaks Interdisciplinary Perspectives on Drinking Water Risk Assessment and Management (Proceedings of the Santiago (Chile) Symposium, September 1998). IAHS Publ. no. 260, 2000. 127 Time series analysis as a

More information

MATH BOOK OF PROBLEMS SERIES. New from Pearson Custom Publishing!

MATH BOOK OF PROBLEMS SERIES. New from Pearson Custom Publishing! MATH BOOK OF PROBLEMS SERIES New from Pearson Custom Publishing! The Math Book of Problems Series is a database of math problems for the following courses: Pre-algebra Algebra Pre-calculus Calculus Statistics

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010 Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010 Week 1 Week 2 14.0 Students organize and describe distributions of data by using a number of different

More information

GEOENGINE MSc in Geomatics Engineering (Master Thesis) Anamelechi, Falasy Ebere

GEOENGINE MSc in Geomatics Engineering (Master Thesis) Anamelechi, Falasy Ebere Master s Thesis: ANAMELECHI, FALASY EBERE Analysis of a Raster DEM Creation for a Farm Management Information System based on GNSS and Total Station Coordinates Duration of the Thesis: 6 Months Completion

More information

Data Exploration Data Visualization

Data Exploration Data Visualization Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select

More information

Data Preparation and Statistical Displays

Data Preparation and Statistical Displays Reservoir Modeling with GSLIB Data Preparation and Statistical Displays Data Cleaning / Quality Control Statistics as Parameters for Random Function Models Univariate Statistics Histograms and Probability

More information

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics. Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com SPSS-SA Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com SPSS-SA Training Brochure 2009 TABLE OF CONTENTS 1 SPSS TRAINING COURSES FOCUSING

More information

430 Statistics and Financial Mathematics for Business

430 Statistics and Financial Mathematics for Business Prescription: 430 Statistics and Financial Mathematics for Business Elective prescription Level 4 Credit 20 Version 2 Aim Students will be able to summarise, analyse, interpret and present data, make predictions

More information

Analysing Ecological Data

Analysing Ecological Data Alain F. Zuur Elena N. Ieno Graham M. Smith Analysing Ecological Data University- una Landesbibliothe;< Darmstadt Eibliothek Biologie tov.-nr. 4y Springer Contents Contributors xix 1 Introduction 1 1.1

More information

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model Assumptions Assumptions of linear models Apply to response variable within each group if predictor categorical Apply to error terms from linear model check by analysing residuals Normality Homogeneity

More information

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1) CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.

More information

Algebra 1 2008. Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard

Algebra 1 2008. Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard Academic Content Standards Grade Eight and Grade Nine Ohio Algebra 1 2008 Grade Eight STANDARDS Number, Number Sense and Operations Standard Number and Number Systems 1. Use scientific notation to express

More information

A Correlation of. to the. South Carolina Data Analysis and Probability Standards

A Correlation of. to the. South Carolina Data Analysis and Probability Standards A Correlation of to the South Carolina Data Analysis and Probability Standards INTRODUCTION This document demonstrates how Stats in Your World 2012 meets the indicators of the South Carolina Academic Standards

More information

The primary goal of this thesis was to understand how the spatial dependence of

The primary goal of this thesis was to understand how the spatial dependence of 5 General discussion 5.1 Introduction The primary goal of this thesis was to understand how the spatial dependence of consumer attitudes can be modeled, what additional benefits the recovering of spatial

More information

2013 MBA Jump Start Program. Statistics Module Part 3

2013 MBA Jump Start Program. Statistics Module Part 3 2013 MBA Jump Start Program Module 1: Statistics Thomas Gilbert Part 3 Statistics Module Part 3 Hypothesis Testing (Inference) Regressions 2 1 Making an Investment Decision A researcher in your firm just

More information

Competency 1 Describe the role of epidemiology in public health

Competency 1 Describe the role of epidemiology in public health The Northwest Center for Public Health Practice (NWCPHP) has developed competency-based epidemiology training materials for public health professionals in practice. Epidemiology is broadly accepted as

More information

3. Data Analysis, Statistics, and Probability

3. Data Analysis, Statistics, and Probability 3. Data Analysis, Statistics, and Probability Data and probability sense provides students with tools to understand information and uncertainty. Students ask questions and gather and use data to answer

More information

Simple Predictive Analytics Curtis Seare

Simple Predictive Analytics Curtis Seare Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

More information

Spatial Analysis with GeoDa Spatial Autocorrelation

Spatial Analysis with GeoDa Spatial Autocorrelation Spatial Analysis with GeoDa Spatial Autocorrelation 1. Background GeoDa is a trademark of Luc Anselin. GeoDa is a collection of software tools designed for exploratory spatial data analysis (ESDA) based

More information

Introduction to Regression and Data Analysis

Introduction to Regression and Data Analysis Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it

More information

Using kernel methods to visualise crime data

Using kernel methods to visualise crime data Submission for the 2013 IAOS Prize for Young Statisticians Using kernel methods to visualise crime data Dr. Kieran Martin and Dr. Martin Ralphs [email protected] [email protected] Office

More information

Algebra 1 Course Information

Algebra 1 Course Information Course Information Course Description: Students will study patterns, relations, and functions, and focus on the use of mathematical models to understand and analyze quantitative relationships. Through

More information

II. DISTRIBUTIONS distribution normal distribution. standard scores

II. DISTRIBUTIONS distribution normal distribution. standard scores Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,

More information

Exercise 1.12 (Pg. 22-23)

Exercise 1.12 (Pg. 22-23) Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.

More information

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

More information

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential

More information

NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS

NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS TEST DESIGN AND FRAMEWORK September 2014 Authorized for Distribution by the New York State Education Department This test design and framework document

More information

COMMON CORE STATE STANDARDS FOR

COMMON CORE STATE STANDARDS FOR COMMON CORE STATE STANDARDS FOR Mathematics (CCSSM) High School Statistics and Probability Mathematics High School Statistics and Probability Decisions or predictions are often based on data numbers in

More information

How To Use Statgraphics Centurion Xvii (Version 17) On A Computer Or A Computer (For Free)

How To Use Statgraphics Centurion Xvii (Version 17) On A Computer Or A Computer (For Free) Statgraphics Centurion XVII (currently in beta test) is a major upgrade to Statpoint's flagship data analysis and visualization product. It contains 32 new statistical procedures and significant upgrades

More information

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Lecture 2: Descriptive Statistics and Exploratory Data Analysis Lecture 2: Descriptive Statistics and Exploratory Data Analysis Further Thoughts on Experimental Design 16 Individuals (8 each from two populations) with replicates Pop 1 Pop 2 Randomly sample 4 individuals

More information

R Graphics Cookbook. Chang O'REILLY. Winston. Tokyo. Beijing Cambridge. Farnham Koln Sebastopol

R Graphics Cookbook. Chang O'REILLY. Winston. Tokyo. Beijing Cambridge. Farnham Koln Sebastopol R Graphics Cookbook Winston Chang Beijing Cambridge Farnham Koln Sebastopol O'REILLY Tokyo Table of Contents Preface ix 1. R Basics 1 1.1. Installing a Package 1 1.2. Loading a Package 2 1.3. Loading a

More information

BayesX - Software for Bayesian Inference in Structured Additive Regression

BayesX - Software for Bayesian Inference in Structured Additive Regression BayesX - Software for Bayesian Inference in Structured Additive Regression Thomas Kneib Faculty of Mathematics and Economics, University of Ulm Department of Statistics, Ludwig-Maximilians-University Munich

More information

Data Visualization Techniques and Practices Introduction to GIS Technology

Data Visualization Techniques and Practices Introduction to GIS Technology Data Visualization Techniques and Practices Introduction to GIS Technology Michael Greene Advanced Analytics & Modeling, Deloitte Consulting LLP March 16 th, 2010 Antitrust Notice The Casualty Actuarial

More information

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data

More information

Prentice Hall Algebra 2 2011 Correlated to: Colorado P-12 Academic Standards for High School Mathematics, Adopted 12/2009

Prentice Hall Algebra 2 2011 Correlated to: Colorado P-12 Academic Standards for High School Mathematics, Adopted 12/2009 Content Area: Mathematics Grade Level Expectations: High School Standard: Number Sense, Properties, and Operations Understand the structure and properties of our number system. At their most basic level

More information

Visualization Quick Guide

Visualization Quick Guide Visualization Quick Guide A best practice guide to help you find the right visualization for your data WHAT IS DOMO? Domo is a new form of business intelligence (BI) unlike anything before an executive

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

How To Write A Data Analysis

How To Write A Data Analysis Mathematics Probability and Statistics Curriculum Guide Revised 2010 This page is intentionally left blank. Introduction The Mathematics Curriculum Guide serves as a guide for teachers when planning instruction

More information

Module 2: Introduction to Quantitative Data Analysis

Module 2: Introduction to Quantitative Data Analysis Module 2: Introduction to Quantitative Data Analysis Contents Antony Fielding 1 University of Birmingham & Centre for Multilevel Modelling Rebecca Pillinger Centre for Multilevel Modelling Introduction...

More information

Data Visualization Techniques

Data Visualization Techniques Data Visualization Techniques From Basics to Big Data with SAS Visual Analytics WHITE PAPER SAS White Paper Table of Contents Introduction.... 1 Generating the Best Visualizations for Your Data... 2 The

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

Problem of the Month Through the Grapevine

Problem of the Month Through the Grapevine The Problems of the Month (POM) are used in a variety of ways to promote problem solving and to foster the first standard of mathematical practice from the Common Core State Standards: Make sense of problems

More information

Annealing Techniques for Data Integration

Annealing Techniques for Data Integration Reservoir Modeling with GSLIB Annealing Techniques for Data Integration Discuss the Problem of Permeability Prediction Present Annealing Cosimulation More Details on Simulated Annealing Examples SASIM

More information

Northumberland Knowledge

Northumberland Knowledge Northumberland Knowledge Know Guide How to Analyse Data - November 2012 - This page has been left blank 2 About this guide The Know Guides are a suite of documents that provide useful information about

More information

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number 1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number A. 3(x - x) B. x 3 x C. 3x - x D. x - 3x 2) Write the following as an algebraic expression

More information

Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

More information

Common Core Unit Summary Grades 6 to 8

Common Core Unit Summary Grades 6 to 8 Common Core Unit Summary Grades 6 to 8 Grade 8: Unit 1: Congruence and Similarity- 8G1-8G5 rotations reflections and translations,( RRT=congruence) understand congruence of 2 d figures after RRT Dilations

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

Exploratory Spatial Data Analysis

Exploratory Spatial Data Analysis Exploratory Spatial Data Analysis Part II Dynamically Linked Views 1 Contents Introduction: why to use non-cartographic data displays Display linking by object highlighting Dynamic Query Object classification

More information

Vertical Alignment Colorado Academic Standards 6 th - 7 th - 8 th

Vertical Alignment Colorado Academic Standards 6 th - 7 th - 8 th Vertical Alignment Colorado Academic Standards 6 th - 7 th - 8 th Standard 3: Data Analysis, Statistics, and Probability 6 th Prepared Graduates: 1. Solve problems and make decisions that depend on un

More information

Introduction to spatial data analysis

Introduction to spatial data analysis Introduction to spatial data analysis 3 Scuola di Dottorato in Economia, La Sapienza, 2015/2016 Instructors: Filippo Celata, Federico Martellozzo and Luca Salvati http://www.memotef.uniroma1.it/node/6524

More information

Visualizing Data. Contents. 1 Visualizing Data. Anthony Tanbakuchi Department of Mathematics Pima Community College. Introductory Statistics Lectures

Visualizing Data. Contents. 1 Visualizing Data. Anthony Tanbakuchi Department of Mathematics Pima Community College. Introductory Statistics Lectures Introductory Statistics Lectures Visualizing Data Descriptive Statistics I Department of Mathematics Pima Community College Redistribution of this material is prohibited without written permission of the

More information

Data Visualization Techniques

Data Visualization Techniques Data Visualization Techniques From Basics to Big Data with SAS Visual Analytics WHITE PAPER SAS White Paper Table of Contents Introduction.... 1 Generating the Best Visualizations for Your Data... 2 The

More information

Chapter 6: Constructing and Interpreting Graphic Displays of Behavioral Data

Chapter 6: Constructing and Interpreting Graphic Displays of Behavioral Data Chapter 6: Constructing and Interpreting Graphic Displays of Behavioral Data Chapter Focus Questions What are the benefits of graphic display and visual analysis of behavioral data? What are the fundamental

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar What is data exploration? A preliminary exploration of the data to better understand its characteristics.

More information

PITFALLS IN TIME SERIES ANALYSIS. Cliff Hurvich Stern School, NYU

PITFALLS IN TIME SERIES ANALYSIS. Cliff Hurvich Stern School, NYU PITFALLS IN TIME SERIES ANALYSIS Cliff Hurvich Stern School, NYU The t -Test If x 1,..., x n are independent and identically distributed with mean 0, and n is not too small, then t = x 0 s n has a standard

More information

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

More information

RUTHERFORD HIGH SCHOOL Rutherford, New Jersey COURSE OUTLINE STATISTICS AND PROBABILITY

RUTHERFORD HIGH SCHOOL Rutherford, New Jersey COURSE OUTLINE STATISTICS AND PROBABILITY RUTHERFORD HIGH SCHOOL Rutherford, New Jersey COURSE OUTLINE STATISTICS AND PROBABILITY I. INTRODUCTION According to the Common Core Standards (2010), Decisions or predictions are often based on data numbers

More information

Demographics of Atlanta, Georgia:

Demographics of Atlanta, Georgia: Demographics of Atlanta, Georgia: A Visual Analysis of the 2000 and 2010 Census Data 36-315 Final Project Rachel Cohen, Kathryn McKeough, Minnar Xie & David Zimmerman Ethnicities of Atlanta Figure 1: From

More information

LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE

LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE MAT 119 STATISTICS AND ELEMENTARY ALGEBRA 5 Lecture Hours, 2 Lab Hours, 3 Credits Pre-

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013 Statistics I for QBIC Text Book: Biostatistics, 10 th edition, by Daniel & Cross Contents and Objectives Chapters 1 7 Revised: August 2013 Chapter 1: Nature of Statistics (sections 1.1-1.6) Objectives

More information

Manhattan Center for Science and Math High School Mathematics Department Curriculum

Manhattan Center for Science and Math High School Mathematics Department Curriculum Content/Discipline Algebra 1 Semester 2: Marking Period 1 - Unit 8 Polynomials and Factoring Topic and Essential Question How do perform operations on polynomial functions How to factor different types

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Data Mining: Exploring Data Lecture Notes for Chapter 3 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Topics Exploratory Data Analysis Summary Statistics Visualization What is data exploration?

More information