Use R! Series Editors: Robert Gentleman Kurt Hornik Giovanni Parmigiani

Transcription

1 Use R! Series Editors: Robert Gentleman Kurt Hornik Giovanni Parmigiani

2 Use R! Albert: Bayesian Computation with R Bivand/Pebesma/Gómez-Rubio: Applied Spatial Data Analysis with R Cook/Swayne: Interactive and Dynamic Graphics for Data Analysis: With R and GGobi Hahne/Huber/Gentleman/Falcon: Bioconductor Case Studies Paradis: Analysis of Phylogenetics and Evolution with R Pfaff: Analysis of Integrated and Cointegrated Time Series with R Sarkar: Lattice: Multivariate Data Visualization with R Spector: Data Manipulation with R

3 Roger S. Bivand Edzer J. Pebesma Virgilio Gómez-Rubio Applied Spatial Data Analysis with R ABC

4 Roger S. Bivand Norwegian School of Economics and Business Administration Breiviksveien Bergen Norway Edzer J. Pebesma University of Utrecht Department of Physical Geography 3508 TC Utrecht Netherlands Series Editors: Robert Gentleman Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 Seattle, Washington USA Virgilio Gómez-Rubio Department of Epidemiology and Public Health Imperial College London St. Mary s Campus Norfolk Place London W2 1PG United Kingdom Kurt Hornik Department für Statistik und Mathematik Wirtschaftsuniversität Wien Augasse 2-6 A-1090 Wien Austria Giovanni Parmigiani The Sidney Kimmel Comprehensive Cancer Center at Johns Hopkins University 550 North Broadway Baltimore, MD USA ISBN e-isbn DOI / Library of Congress Control Number: c 2008 Springer Science+Business Media, LLC All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper springer.com

5 Ewie Voor Ellen, Ulla en Mandus A mis padres, Victorina y Virgilio Benigno

6 Preface We began writing this book in parallel with developing software for handling and analysing spatial data with R (R Development Core Team, 2008). Although the book is now complete, software development will continue, in the R community fashion, of rich and satisfying interaction with users around the world, of rapid releases to resolve problems, and of the usual joys and frustrations of getting things done. There is little doubt that without pressure from users, the development of R would not have reached its present scale, and the same applies to analysing spatial data analysis with R. It would, however, not be sufficient to describe the development of the R project mainly in terms of narrowly defined utility. In addition to being a community project concerned with the development of world-class data analysis software implementations, it promotes specific choices with regard to how data analysis is carried out. R is open source not only because open source software development, including the dynamics of broad and inclusive user and developer communities, is arguably an attractive and successful development model. R is also, or perhaps chiefly, open source because the analysis of empirical and simulated data in science should be reproducible. As working researchers, we are all too aware of the possibility of reaching inappropriate conclusions in good faith because of user error or misjudgement. When the results of research really matter, as in public health, in climate change, and in many other fields involving spatial data, good research practice dictates that someone else should be, at least in principle, able to check the results. Open source software means that the methods used can, if required, be audited, and journalling working sessions can ensure that we have a record of what we actually did, not what we thought we did. Further, using Sweave 1 atoolthatpermits the embedding of R code for complete data analyses in documents throughout this book has provided crucial support (Leisch, 2002; Leisch and Rossini, 2003). 1

7 VIII Preface We acknowledge our debt to the members of R-core for their continuing commitment to the R project. In particular, the leadership and example of Professor Brian Ripley has been important to us, although our admitted muddling through contrasts with his peerless attention to detail. His interested support at the Distributed Statistical Computing conference in Vienna in 2003 helped us to see that encouraging spatial data analysis in R was a project worth pursuing. Kurt Hornik s dedication to keep the Comprehensive R Archive Network running smoothly, providing package maintainers with superb, almost 24/7, service, and his dry humour when we blunder, have meant that the user community is provided with contributed software in an unequalled fashion. We are also grateful to Martin Mächler for his help in setting up and hosting the R-Sig-Geo mailing list, without which we would have not had a channel for fostering the R spatial community. We also owe a great debt to users participating in discussions on the mailing list, sometimes for specific suggestions, often for fruitful questions, and occasionally for perceptive bug reports or contributions. Other users contact us directly, again with valuable input that leads both to a better understanding on our part of their research realities and to the improvement of the software involved. Finally, participants at R spatial courses, workshops, and tutorials have been patient and constructive. We are also indebted to colleagues who have contributed to improving the final manuscript by commenting on earlier drafts and pointing out better procedures to follow in some examples. In particular, we would like to mention Juanjo Abellán, Nicky Best, Peter J. Diggle, Paul Hiemstra, Rebeca Ramis, Paulo J. Ribeiro Jr., Barry Rowlingson, and Jon O. Skøien. We are also grateful to colleagues for agreeing to our use of their data sets. Support from Luc Anselin has been important over a long period, including a very fruitful CSISS workshop in Santa Barbara in Work by colleagues, such as the first book known to us on using R for spatial data analysis (Kopczewska, 2006), provided further incentives both to simplify the software and complete its description. Without John Kimmel s patient encouragement, it is unlikely that we would have finished this book. Even though we have benefitted from the help and advice of so many people, there are bound to be things we have not yet grasped so remaining mistakes and omissions remain our sole responsibility. We would be grateful for messages pointing out errors in this book; errata will be posted on the book website ( Bergen Münster London April 2008 Roger S. Bivand Edzer J. Pebesma Virgilio Gómez-Rubio

8 Contents Preface VII 1 Hello World: Introducing Spatial Data Applied Spatial Data Analysis Why Do We Use R In General? for Spatial Data Analysis? R and GIS What is GIS? Service-Oriented Architectures Further Reading on GIS Types of Spatial Data Storage and Display Applied Spatial Data Analysis R Spatial Resources Online Resources Layout of the Book Part I Handling Spatial Data in R 2 Classes for Spatial Data in R Introduction Classes and Methods in R Spatial Objects SpatialPoints Methods Data Frames for Spatial Point Data SpatialLines... 38

9 X Contents 2.6 SpatialPolygons SpatialPolygonsDataFrame Objects Holes and Ring Direction SpatialGrid and SpatialPixel Objects Visualising Spatial Data The Traditional Plot System Plotting Points, Lines, Polygons, and Grids Axes and Layout Elements Degrees in Axes Labels and Reference Grid Plot Size, Plotting Area, Map Scale, and Multiple Plots Plotting Attributes and Map Legends Trellis/Lattice Plots with spplot A Straight Trellis Example Plotting Points, Lines, Polygons, and Grids Adding Reference and Layout Elements to Plots Arranging Panel Layout Interacting with Plots Interacting with Base Graphics Interacting with spplot and Lattice Plots Colour Palettes and Class Intervals Colour Palettes Class Intervals Spatial Data Import and Export Coordinate Reference Systems Using the EPSG List PROJ.4 CRS Specification Projection and Transformation Degrees, Minutes, and Seconds Vector File Formats Using OGR Drivers in rgdal Other Import/Export Functions Raster File Formats Using GDAL Drivers in rgdal Writing a Google Earth Image Overlay Other Import/Export Functions Grass Broad Street Cholera Data Other Import/Export Interfaces Analysis and Visualisation Applications TerraLib and art Other GIS and Web Mapping Systems Installing rgdal

10 Contents XI 5 Further Methods for Handling Spatial Data Support Overlay Spatial Sampling Checking Topologies Dissolving Polygons Checking Hole Status Combining Spatial Data Combining Positional Data Combining Attribute Data Auxiliary Functions Customising Spatial Data Classes and Methods Programming with Classes and Methods S3-Style Classes and Methods S4-Style Classes and Methods Animal Track Data in Package Trip Generic and Constructor Functions Methods for Trip Objects Multi-Point Data: SpatialMultiPoints Hexagonal Grids Spatio-Temporal Grids Analysing Spatial Monte Carlo Simulations Processing Massive Grids Part II Analysing Spatial Data 7 Spatial Point Pattern Analysis Introduction Packages for the Analysis of Spatial Point Patterns Preliminary Analysis of a Point Pattern Complete Spatial Randomness G Function: Distance to the Nearest Event F Function: Distance from a Point to the Nearest Event Statistical Analysis of Spatial Point Processes Homogeneous Poisson Processes Inhomogeneous Poisson Processes Estimation of the Intensity Likelihood of an Inhomogeneous Poisson Process Second-Order Properties Some Applications in Spatial Epidemiology Case Control Studies Binary Regression Estimator

11 XII Contents Binary Regression Using Generalised Additive Models Point Source Pollution Accounting for Confounding and Covariates Further Methods for the Analysis of Point Patterns Interpolation and Geostatistics Introduction Exploratory Data Analysis Non-Geostatistical Interpolation Methods Inverse Distance Weighted Interpolation Linear Regression Estimating Spatial Correlation: The Variogram Exploratory Variogram Analysis Cutoff, Lag Width, Direction Dependence Variogram Modelling Anisotropy Multivariable Variogram Modelling Residual Variogram Modelling Spatial Prediction Universal, Ordinary, and Simple Kriging Multivariable Prediction: Cokriging Collocated Cokriging Cokriging Contrasts Kriging in a Local Neighbourhood Change of Support: Block Kriging Stratifying the Domain Trend Functions and their Coefficients Non-Linear Transforms of the Response Variable Singular Matrix Errors Model Diagnostics Cross Validation Residuals Cross Validation z-scores Multivariable Cross Validation Limitations to Cross Validation Geostatistical Simulation Sequential Simulation Non-Linear Spatial Aggregation and Block Averages Multivariable and Indicator Simulation Model-Based Geostatistics and Bayesian Approaches Monitoring Network Optimization Other R Packages for Interpolation and Geostatistics Non-Geostatistical Interpolation spatial RandomFields geor and georglm fields

12 Contents XIII 9 Areal Data and Spatial Autocorrelation Introduction Spatial Neighbours Neighbour Objects Creating Contiguity Neighbours Creating Graph-Based Neighbours Distance-Based Neighbours Higher-Order Neighbours Grid Neighbours Spatial Weights Spatial Weights Styles General Spatial Weights Importing, Converting, and Exporting Spatial Neighbours and Weights Using Weights to Simulate Spatial Autocorrelation Manipulating Spatial Weights Spatial Autocorrelation: Tests Global Tests Local Tests Modelling Areal Data Introduction Spatial Statistics Approaches Simultaneous Autoregressive Models Conditional Autoregressive Models Fitting Spatial Regression Models Mixed-Effects Models Spatial Econometrics Approaches Other Methods GAM, GEE, GLMM Moran Eigenvectors Geographically Weighted Regression Disease Mapping Introduction Statistical Models Poisson-Gamma Model Log-Normal Model Marshall s Global EB Estimator Spatially Structured Statistical Models Bayesian Hierarchical Models The Poisson-Gamma Model Revisited Spatial Models Detection of Clusters of Disease Testing the Homogeneity of the Relative Risks Moran s I Test of Spatial Autocorrelation

13 XIV Contents Tango s Test of General Clustering Detection of the Location of a Cluster Geographical Analysis Machine Kulldorff s Statistic Stone s Test for Localised Clusters Other Topics in Disease Mapping Afterword R and Package Versions Used Data Sets Used References Subject Index Functions Index...371

14 Part I Handling Spatial Data in R

15 Handling Spatial Data The key intuition underlying the development of the classes and methods in the sp package, and its closer dependent packages, is that users approaching R with experience of GIS will want to see layers, coverages, rasters, or geometries. Seen from this point of view, sp classes should be reasonably familiar, appearing to be well-known data models. On the other hand, for statistician users of R, everything is a data.frame, a rectangular table with rows of observations on columns of variables. To permit the two disparate groups of users to play together happily, classes have grown that look like GIS data models to GIS and other spatial data people, and look and behave like data frames from the point of view of applied statisticians and other data analysts. This part of the book describes the classes and methods of the sp package, and in doing so also provides a practical guide to the internal structure of many GIS data models, as R permits the user to get as close as desired to the data. However, users will not often need to know more than that of Chap. 4 to read in their data and start work. Visualisation is covered in Chap. 3, and so a statistician receiving a well-organised set of data from a collaborator may even be able to start making maps in two lines of code, one to read the data and one to plot the variable of interest using lattice graphics. Note that coloured versions of figures may be found on the book website together with complete code examples, data sets, and other support material. If life was always so convenient, this part of the book could be much shorter than it is. But combining spatial data from different sources often means that much more insight is needed into the data models involved. The data models themselves are described in Chap. 2, and methods for handling and combining them are covered in Chap. 5. Keeping track of which observation belongs to which geometry is also discussed here, seen from the GIS side as feature identifiers, and row names from the data frame side. In addition to data import and export, Chap. 4 also describes the use and transformation of coordinate reference systems for sp classes, and integration of the open source GRASS GIS and R. Finally,Chap.6explainshowthemethodsandclasses introduced in Chap. 2 can be extended to suit one s own needs.

16 1 Hello World: IntroducingSpatialData 1.1 Applied Spatial Data Analysis Spatial data are everywhere. Besides those we collect ourselves ( is it raining? ), they confront us on television, in newspapers, on route planners, on computer screens, and on plain paper maps. Making a map that is suited to its purpose and does not distort the underlying data unnecessarily is not easy. Beyond creating and viewing maps, spatial data analysis is concerned with questions not directly answered by looking at the data themselves. These questions refer to hypothetical processes that generate the observed data. Statistical inference for such spatial processes is often challenging, but is necessary when we try to draw conclusions about questions that interest us. Possible questions that may arise include the following: Does the spatial patterning of disease incidences give rise to the conclusion that they are clustered, and if so, are the clusters found related to factors such as age, relative poverty, or pollution sources? Given a number of observed soil samples, which part of a study area is polluted? Given scattered air quality measurements, how many people are exposed to high levels of black smoke or particulate matter (e.g. PM 10 ), 1 and where do they live? Do governments tend to compare their policies with those of their neighbours, or do they behave independently? In this book we will be concerned with applied spatial data analysis, meaning that we will deal with data sets, explain the problems they confront us with, and show how we can attempt to reach a conclusion. This book will refer to the theoretical background of methods and models for data analysis, but emphasise hands-on, do-it-yourself examples using R; readers needing this background should consult the references. All data sets used in this book and all examples given are available, and interested readers will be able to reproduce them. 1 Particulate matter smaller than about 10 µm.

17 2 1 Hello World: Introducing Spatial Data In this chapter we discuss the following: (i) Why we use R for analysing spatial data (ii) The relation between R and geographical information systems (GIS) (iii) What spatial data are, and the types of spatial data we distinguish (iv) The challenges posed by their storage and display (v) The analysis of observed spatial data in relation to processes thought to have generated them (vi) Sources of information about the use of R for spatial data analysis and the structure of the book. 1.2 Why Do We Use R In General? The R system 2 (R Development Core Team, 2008) is a free software environment for statistical computing and graphics. It is an implementation of the S language for statistical computing and graphics (Becker et al., 1988). For data analysis, it can be highly efficient to use a special-purpose language like S, compared to using a general-purpose language. For new R users without earlier scripting or programming experience, meeting a programming language may be unsettling, but the investment 3 will quickly pay off. The user soon discovers how analysis components written or copied from examples can easily be stored, replayed, modified for another data set, or extended. R can be extended easily with new dedicated components, and can be used to develop and exchange data sets and data analysis approaches. It is often much harder to achieve this with programs that require long series of mouse clicks to operate. R provides many standard and innovative statistical analysis methods. New users may find access to both well-tried and trusted methods, and speculative and novel approaches, worrying. This can, however, be a major strength, because if required, innovations can be tested in a robust environment against legacy techniques. Many methods for analysing spatial data are less frequently used than the most common statistical techniques, and thus benefit proportionally more from the nearness to both the data and the methods that R permits. R uses well-known libraries for numerical analysis, and can easily be extended by or linked to code written in S, C,C++,Fortran,orJava.Links to various relational data base systems and geographical information systems exist, many well-known data formats can be read and/or written. The level of voluntary support and the development speed of R are high, and experience has shown R to be environment suitable for developing professional, mission-critical software applications, both for the public and the A steep learning curve the user learns a lot per unit time.

18 1.2 Why Do We Use R 3 private sector. The S language can not only be used for low-level computation on numbers, vectors, or matrices but can also be easily extended with classes for new data types and analysis methods for these classes, such as methods for summarising, plotting, printing, performing tests, or model fitting (Chambers, 1998). In addition to the core R software system, R is also a social movement, with many participants on a continuum from users just beginning to analyse data with R to developers contributingpackagestothecomprehensiver Archive Network 4 (CRAN) for others to download and employ. Just as R itself benefits from the open source development model, contributed package authors benefit from a world-class infrastructure, allowing their work to be published and revised with improbable speed and reliability, including the publication of source packages and binary packages for many popular platforms. Contributed add-on packages are very much part of the R community, and most core developers also write and maintain contributed packages. A contributed package contains R functions, optional sample data sets, and documentation including examples of how to use the functions for Spatial Data Analysis? For over 10 years, R has had an increasing number of contributed packages for handling and analysing spatial data. All these packages used to make different assumptions about how spatial data were organised, and R itself had no capabilities for distinguishing coordinates from other numbers. In addition, methods for plotting spatial data and other tasks were scattered, made different assumptions on the organisation of the data, and were rudimentary. This was not unlike the situation for time series data at the time. After some joint effort and wider discussion, a group 5 of R developers have written the R package sp to extend R with classes and methods for spatial data (Pebesma and Bivand, 2005). Classes specify a structure and define how spatial data are organised and stored. Methods are instances of functions specialised for a particular data class. For example, the summary method for all spatial data classes may tell the range spanned by the spatial coordinates, and show which coordinate reference system is used (such as degrees longitude/latitude, or the UTM zone). It may in addition show some more details for objects of a specific spatial class. A plot method may, for example create a map of the spatial data. The sp package provides classes and methods for points, lines, polygons, and grids (Sect. 1.4, Chap. 2). Adopting a single set of classes for spatial data offers a number of important advantages: 4 CRAN mirrors are linked from 5 Mostly the authors of this book with help from Barry Rowlingson and Paulo J. Ribeiro Jr.

19 4 1 Hello World: Introducing Spatial Data (i) It is much easier to move data across spatial statistics packages. The classes are either supported directly by the packages, reading and writing data in the new spatial classes, or indirectly, for example by supplying data conversion between the sp classes and the package s classes in an interface package. This last option requires one-to-many links between the packages, which are easier to provide and maintain than many-to-many links. (ii) The new classes come with a well-tested set of methods (functions) for plotting, printing, subsetting, and summarising spatial objects, or combining (overlaying) spatial data types. (iii) Packages with interfaces to geographical information systems (GIS), for reading and writing GIS file formats, and for coordinate (re)projection code support the new classes. (iv) The new methods include Lattice plots, conditioning plots, plot methods that combine points, lines, polygons, and grids with map elements (reference grids, scale bars, north arrows), degree symbols (as in 52 N) in axis labels, etc. Chapter 2 introduces the classes and methods provided by sp, and discusses some of the implementation details. Further chapters will show the degree of integration of sp classes and methods and the packages used for statistical analysis of spatial data. Figure 1.1 shows how the reception of sp classes has already influenced the landscape of contributed packages; interfacing other packages for handling and analysing spatial data is usually simple as we see in Part II. The shaded nodes of the dependency graph are packages (co)-written and/or maintained by the authors of this book, and will be used extensively in the following chapters. 1.3 R and GIS What is GIS? Storage and analysis of spatial data is traditionally done in Geographical Information Systems (GIS). According to the toolbox-based definition of Burrough and McDonnell (1998, p. 11), a GIS is...a powerful set of tools for collecting, storing, retrieving at will, transforming, and displaying spatial data from the real world for a particular set of purposes. Another definition mentioned in the same source refers to...checking, manipulating, and analysing data, which are spatially referenced to the Earth. Its capacity to analyse and visualise data makes R agoodchoiceforspatial data analysis. For some spatial analysis projects, using only R may be sufficient for the job. In many cases, however, R will be used in conjunction with GIS software and possibly a GIS data base as well. Chapter 4 will show how spatial data are imported from and exported to GIS file formats. As is often the case in applied data analysis, the real issue is not whether a given problem can be

20 1.3 R and GIS 5 sp maptools rgdal splancs geor gstat spsurvey trip aspace spdep spgwr surveillance GeoXp spgrass6 GEOmap ecespa StatDA georglm simba DCluster svcr BARD RTOMO VIM Fig Tree of R contributed packages on CRAN depending on or importing sp directly or indirectly; others suggest sp or use it without declaration in their package descriptions (status as of )