EnviroInfo 2004 (Geneva) Sh@ring EnviroInfo 2004 The Virtual Database A Tool for Integrated Data Processing in a Distributed Environment Marcel Frehner 1, Martin Brändli 2, Jürg Schenker 3 Abstract Traditional desktop GIS are expensive to buy and require much experience and know-how in order to be used reasonably. Web mapping systems have become a cheap and easy to use alternative recently but offer only restricted access to existing spatial data and limited spatial data handling capabilities. This paper presents an architecture called Virtual Database which makes available spatial environmental information as well as advanced geoprocessing functionality to any user who has access to the Internet. A particular feature of the Virtual Database is that it serves as a platform for the integration of distributed data repositories consisting of environmental data. Data access and integration conform to the OpenGIS specifications for Web Feature Services (WFS) and the Geography Markup Language (GML). Analysis functionality, in particular methods of spatial overlay are provided by a spatial analysis engine software component. First experiences with the Virtual Database show its high flexibility concerning the integration of heterogeneous data repositories. High scalability of the system is achieved by a caching mechanism based on data replication. The potential of the use of the Virtual Database is illustrated by sketching an application scenario from the field of environmental data handling. 1. Introduction One of the medium-term goals of the Landscape Inventory Division at the Swiss Federal Research Institute WSL is the development and establishment of an integrated environmental and landscape information system. The system aims at offering a comprehensive solution for enabling the sharing of spatial data, methods, and computing resources. An information system in general consists of appropriately 1 Swiss Federal Research Institute WSL, Landscape Inventories, Zürcherstrasse 111, CH- 8903 Birmensdorf,Switzerland, email: marcel.frehner@wsl.ch 2 Swiss Federal Research Institute WSL, Landscape Inventories, Zürcherstrasse 111, CH- 8903 Birmensdorf,Switzerland, email: martin.braendli@wsl.ch 3 Swiss Agency for the Environment, Forests and Landscape (SAEFL), Division of Nature, CH-3003Bern,Switzerland,juerg.schenker@buwal.admin.ch 537
compiled data, methods and models to process the data, and usually of the possibility to access external data sources. An integrated information system should offer a unifying platform facilitating the application of a diversity of technologies and methods (database, GIS, remote sensing, statistics, etc.) and permitting the potentially arbitrary combination of any available data and methods either located inhouse or externally. Basing such a system on Internet/Intranet technology promises a general access to the various data and computing resources. This paper presents the Virtual Database - an architecture for the integration of distributed data repositories. Integration of data will enable combined visualization and analysis of distributed data and build the basis for a comprehensive environmental and landscape information system. The Virtual database currently aims at integrating different distributed databases of the Swiss Agency for the Environment, Forests and Landscape (SAEFL). SAFEL is responsible for collecting and storing data of nature protection (protected areas, fauna and flora) on the national level. Because this task takes place at various decentralized institutions, the goal of the Virtual Database is to offer a unifying platform for the combination of these data. Databases (called data components) from the Centre Suisse de Cartographie de la Faune, the Institute for Systematic Botany at University of Zurich, and the Swiss Federal Research Institute WSL are brought together in order to build an integrated data federation. Data are provided by an easily accessible environment enabling comprehensive exploration and analysis. Initially, the Virtual Database was designed for the purpose of integrating data for visualization and simple querying (Brändli/Sparenborg 2002). However, since the Virtual Database is based on the distributed geographic information services paradigm (discussed in the next section) and since it takes advantage of open standardization initiatives and open source software development efforts, it is now extended towards inclusion of advanced spatial data handling capabilities which is particularly addressed in this paper. The paper proceeds as follows: Section 2 discusses related work on exchanging, sharing and analysis of spatial data using the Internet. The design of the Virtual Database is sketched in section 3, followed by the description of implementation details, particularly on integrating analysis functionality, in section 4. The benefit of the Virtual Database is presented in section 5 by illustrating a specific application scenario. The paper ends with some conclusions and an outlook for planned work. 2. Related work The establishment of this integrated environmental and landscape information system benefits from research efforts in the IT and in particular the GIS community. Traditionally, geographic information systems provide spatial data handling capabilities for data input, storage, retrieval, management, manipulation, analysis, and 538
output (Aronoff 1989, Burrough/McDonnell 1998). Geoprocessing functionality is usually supplied by a single and monolithic system, the data is normally stored in a single database. Due to the popular use of the Internet this closed architecture paradigm of GIS is shifting towards a distributed geographic information services paradigm (Tsou/Buttenfield 2002). Distribution includes both the storage of data in spatially distributed database systems and dispersed geoprocessing providers offering so-called geo-services (Peng/Tsou 2003) consisting of spatial data handling functionality. How geo-services might be located on the Internet is discussed in Tsou (2002). The price of this paradigm shift is the requirement for enhanced interoperability, reusability and flexibility of both data and geo-services. Today, however, most spatial data handling applications on the Internet concern Web mapping or Web cartography offering functionality for the use, distribution and production of maps by means of the Internet (Kraak 2001, Orthofer/Loibl 2004). Additionally, they allow for visualizing spatial data and submitting simple queries. Current standardization efforts such as the initiative by the Open GIS Consortium (OGC) support this type of geospatial data handling. OGC released the Web Map Service Implementation Specification (WMS) which standardizes the way map images, service-level metadata, and information about particular map features contained in a map are requested (OGC 2001). But an OGC-compliant Web map server does not necessarily include any further tools for spatial analysis and modeling. As data are returned in an image format, they cannot be accessed for additional processing. The need for exchanging and sharing spatial data on the Internet that goes beyond the transfer of query results as Web maps is well recognized. Standardizations like OGC s Web Feature Service (WFS) Implementation Specification (OGC 2002) and OGC s Geography Markup Language (GML) Implementation Specification (OGC 2003) focus on the exchange of geographic data in a format that enables further client-side processing. Software producers such as ESRI with the recently released ArcGIS Server are extending their Web map server products by processing functions that consist of advanced spatial data handling operations. Besides commercial developments, research efforts concerning Web-based GIS applications also aim at offering advanced spatial data handling capabilities. Tsou (2004) describes an application based on a set of Java applets that integrates GIS and remote sensing tools for address matching, network analysis, reselection, change detection in raster images, and image classification. The system claims to be particularly useful for non-gis professionals who by now hesitated using GIS software for reasons of cost, complicated software installation and insufficient software training. Focusing on GIS tools Anderson and Moreno-Sanchez (2002) demonstrate the implementation of spatial analysis capabilities around open specifications and open source software. Results show that both open specifications and open source software libraries have become powerful and mature enough to be applied in Web-GIS 539
projects. Maximal interoperability is achieved by strictly conforming to the guidelines of open specifications. 3. Design of the Virtual Database The design and implementation of the Virtual Database follows the trend of Internetbased data exchange and takes advantage of open standards, open interfaces and open source software development. Design requirements for the Virtual Database are in particular: 1. Integration of distributed data repositories which are stored using different database management systems. The autonomy of the individual components must not be restricted by the data federation. 2. Database functionality is limited to distributed queries. Inserts and updates are handled by applications of the individual components. 3. Uniform interfaces are defined for data access. 4. Retrieval, query, analysis and display of data from distributed database systems should be open to a wide audience and therefore take place by means of a Web browser-based client. Figure 1: Architecture of the Virtual Database 540
The design and implementation of the Virtual Database follow the principle of loose coupling of the individual data components (databases) and is structured into clearly separated but interrelated tiers. Figure 1 presents data components and necessary software modules as elements of three separate tiers. These tiers are as follows: 1. Enterprise Information System Tier (EIS Tier): The EIS Tier consists of distributed data repositories that have to be integrated. Data are either stored in database management systems or simply as files. 2. Middle Tier: The Middle Tier contains interfaces, so-called access layers that enable access of data repositories available from the EIS tier. The interfaces specify the way data must be served on the one hand and accessed on the other hand. In addition interfaces to descriptive metadata must be provided in order to enable an assessment of served data. The integration layer controls the access of the distributed data repositories and integrates the data retrieved from the access layers in order to provide a transparent data view. The spatial analysis engine performs any desired analysis operations. Map server software is responsible for the rendering of maps of integrated and analyzed data. 3. Client Tier: Data retrieved from the map server are displayed by a user-friendly thin Web client facilitating user interaction and display. 4. Implementation of the Virtual Database 4.1 Enterprise information system tier The EIS tier of the Virtual Database consists of heterogeneous data repositories at different locations. Heterogeneity concerns the data structures and the storage and database management systems (DBMS) of the involved data components. Currently data repositories from three different institutions are available from the EIS tier: The first repository is installed at WSL and is called Data Center for Nature and Landscape (DNL). It is a database mainly storing inventory data of protected biotopes in Switzerland (Baltensweiler/Brändli 2004). Oracle is used as DBMS in combination with ESRI s Spatial Database Engine (SDE) for handling and processing of spatial data types. The second database is located at the Centre Suisse de Cartographie de la Faune (CSCF) and stores data on endangered animal species using an Oracle DBMS. In contrast to the DNL database spatial information (i.e. coordinates) on discovered animals is stored as standard columns in regular database tables. The third data component installed at the Institute of Systematic Botany, University of Zurich, contains discovered locations of endangered and rare moss species. Again, attribute data are stored in an Oracle database. Location data are, however, stored in ESRI s shapefile format. 541
4.2 Middle tier As outlined above the middle tier contains various functional components, i.e. several access layers, an integration layer, a map server, and a spaial analysis engine. Due to the complexity of the middle tier Jakarta Struts (http://struts.apache.org) has been chosen as a framework for implementation. The Struts framework is based on the Java 2 Platform (http://java.sun.com) and makes use of Java Servlets, JavaServer Pages (JSP), JavaBeans, and XML, as well as various other open source software components provided by the Jakarta Project (http://jakarta.apache.org). Struts encourages the design and implementation of application architectures based on the Model-View-Controller (MVC) paradigm. The MVC paradigm suggests the organization of interactive applications into three separate modules: one for the application model with its data representation and business logic, the second for views that provide data presentation and user input, and the third for a controller which dispatches user requests and controls application flow (Singh et al. 2004). Inside the Virtual Database the access layer, the spatial analysis engine, the map server, and the integration layer make up the model part, the view consists of dynamic JSP pages for information presentation, and the Struts ActionServlet as well as Struts Action classes build the controller. 4.2.1 Access layer Each EIS component requires an appropriate access layer that accounts for its individual data storage system. Implementation of individual access layers follows OGC s Web Feature Service (WFS) implementation specification (OGC 2002). The specification defines interfaces for the manipulation of spatial features, i.e. querying, inserting, updating and deleting data, and bases the communication between the distributed computing platforms on HTTP. Access layers of the Virtual Database implement the following interfaces that are required in any basic read-only Web Feature Service (for a more detailed description see Brändli/Sparenborg 2002): 1. GetCapabilities: Returns details about service capabilities like available data and functionality. 2. DescribeFeatureType: Returns a description of the data structure of available data. 3. GetFeature: Returns geospatial data encoded according to the Geography Markup Language (GML) which is based on an XML schema tailored for the exchange of spatial data. Since the three data components of the EIS tier use different DBMS and file formats (ESRI s shapefile for moss data, for instance), the interfaces must be implemented accordingly. For example, access to the DNL database takes advantage of the ESRI SDE API for retrieval of spatial data types. In contrast, access to regular Oracle da- 542
tabase tables is supplied by using the particular JDBC (Java Database Connection) implementation for Oracle. Integrating distributed data repositories using an interface-based approach for data access is a highly flexible solution. The advantage of the usage of interfaces is that neither existing database schemas nor file structures have to be changed or adapted. Conformity to the specified interfaces is achieved only by adapting the access software of the access layers. In some cases, though, the database schemas have additionally been adapted by generating database views for joining related tables in order to simplify access of the data. 4.2.2 Integration layer The integration layer sends requests to each access layer of the involved data repositories. When the GML data are returned the integration layer parses and merges the data according to their XML schemas. The current implementation does not consider any data heterogeneities such as differences in scale or different data accuracies. Handling of such heterogeneities and data uncertainties is postponed to future developments of the Virtual Database. The integration layer provides a transparent view on the distributed data repositories ready for use by the map server component and the spatial analysis engine described below. During development problems related to scalability concerning the increase of the size of accessed datasets had to be handled. Spatial data, such as polygons describing the boundaries of administrative or environmental protection areas, are quite large in comparison to corresponding attribute data. Additionally, the necessary conversions from local data formats to GML substantially increase dataset size. A considerable growth of the amount of spatial data results in data transfer times that are unacceptable from a user point of view. That s why a caching mechanism based on data replication was implemented. Replicated spatial data accessed from the distributed components are stored as part of the integration layer and updated as soon as data changes occur. 4.2.3 Map server Visualization and query of spatial data and corresponding attribute data is based on out-of-the-box components of ESRI s Internet mapping software ArcIMS. ArcIMS offers an XML-based query language and various connectors, among them a Java connector with a corresponding JSP tag library which facilitates the composition of requests for visualization options, spatial queries, attribute queries, buffering, and other simple GIS operations (http://www.esri.com/software/arcgis/arcims/index.html). ArcIMS is mainly applied for historical and institutional reasons. Open Source products like MapServer 543
(http://mapserver.gis.umn.edu/) would suit as well. A comprehensive list of map server software can be found on (http://gislounge.com/ll/webgis.shtml). Peng (2003, 379) gives a detailed insight into a few popular commercial map servers. 4.2.4 Spatial analysis engine and overlay computation A full featured GIS is expected to provide functionality that can be categorized into the five areas (1) data acquisition; (2) preliminary data processing; (3) data storage and retrieval; (4) spatial search and analysis; (5) graphical display and interaction (Jones 1997, 38). The Virtual Database is already enabled for many of these features. For instance thematic layers can be selected in a Web form for display. Data can then be queried by attributes or by spatial selection, spatial objects can be buffered, attribute tables can be displayed, and object and layer metadata may be accessed. However, the Virtual Database doesn t claim to be a full featured GIS but wants to satisfy the specialized needs of some particular research and public administration groups in an optimal way. Once new users begin showing interest, further functionality may be included. As mentioned above the map server component is already able to perform spatial queries and buffer operations. By providing a particular spatial analysis engine the functionality is extended with advanced spatial data handling capabilities which we expect to improve the usefulness of the Virtual Database. The spatial analysis engine currently consists of a tool for computing vector overlays (described below). Spatial overlays leverage the Virtual Database since they build the basis for many further GIS modeling and analysis tasks. Overlaying vector data asks for geometric intersection of lines and polygons as well as for feature selection by either Boolean or set operations with input layers (Jones 1997, 48-54). The current implementation of the overlay tool of the analysis engine makes use of ESRI s MapObjects Java Edition 2.0. MapObjects provides methods for the computation of intersections on the level of single geometric objects like polylines and polygons. The overlay of two or more entire layers consisting of a great number of spatial objects is a complex task and requires highly performing software. First experiences with MapObjects show that algorithm performance is not very promising. Future versions of the analysis engine therefore will be based on alternative existing libraries, in particular the free Open Source libraries from Geotools (http://www.geotools.org) and the JTS Topology Suite (http://www.vividsolutions.com/jts) published under the LGPL license (http://www.gnu.org/copyleft/lesser.html). The user interface of the analysis engine is implemented as a JSP page allowing for the selection of two polygon layers. Figure 2 shows, that the desired type of overlay can be selected by a choice of four Boolean operations, i.e. AND, OR, NOT, XOR further illustrated by intuitive graphic symbols. 544
Figure 2: User interface for spatial overlay analysis. 4.3 Client tier The browser is designed as a thin client handling user input and display. Data assembling, graphical rendering, and spatial analysis are accomplished by the corresponding software components of the middle tier. Client-side code is thus limited to HTML and JavaScript which promises optimal availability to potential users. Maps are published as JPEG raster images. Vector formats such as Vector Markup Language (VML) and Scalable Vector Graphics (SVG) could provide more comprehensive display functions like rapid zoom-in/out, customizable map symbols, and layer stacking order (Tsou 2004), but are not implemented in the current version of the Virtual Database. Main problems related to providing data by VML or SVG are data protection issues and the necessary plug-in installation. 545
5. An application scenario Brändli and Höppner (2004) show the potential of the Virtual Database in regional planning by ameliorating data management, improving data and GIS availability, as well as simplifying data exploration and analysis. The following scenario illustrates how information retrieval may take place in the field of environmental data handling. Say regional commissioner for nature conservation Tom is interested in a particular nature reserve. The Virtual Database gives him access to the DNL where various inventories on nature and landscape are stored. By browsing through the list of datasets he finds all available states of the respective inventory object. Tom selects them all and the system displays them as layers in a map. By exploring the map Tom has the impression that the area of the reserve has been extended significantly between the last two states. He requests an overlay analysis from the server and gets a new dataset returned displaying all geometry and attribute changes. Currently, Tom is particularly interested in moss data. Therefore he uses the Virtual Database again to select some rare moss species from the data repository at the Institute for Systematic Botany at University of Zurich and adds them to the map. Tom is pleased to find out that many of them can be found in the recently added areas of the nature reserve. The scenario described involves overlay operations and spatial searches of multiple heterogeneous datasets. Given access to the Virtual Database Tom doesn t need to tediously collect and pre-process the potentially heterogeneous data layers himself, but can directly search the Virtual Database and rely on the data being served by the system. The necessary spatial operations can be performed on-line and the results are returned as a map for further exploration and analysis. Because the necessary GIS functionality is accessible on-line Tom can perform the required analysis without any locally installed GIS software. A common Web-browser-based thin client is sufficient for using the Virtual Database including its spatial analysis capabilities. 6. Conclusions and outlook We presented the design and implementation of a software architecture for a Webbased spatial data handling application that offers access to distributed spatial data repositories. The advantage of the Virtual Database is that anybody with a Web browser and access to the Internet can use provided spatial data to perform comprehensive spatial data exploration and advanced analysis operations. The chosen approach for the access of distributed data based on standardized interfaces proved evidence for being highly flexible since no changes or adaptations of involved data repositories are necessary. High scalability for the handling of large datasets is 546
achieved by replicating the data and storing them as part of the integration layer. A current bottleneck of the application is the weak performance of the spatial analysis engine concerning overlay operations. We expect a significant increase of algorithm performance by the use and inclusion of alternative open source libraries. A fundamental problem related to the integration of datasets from distributed repositories is not yet solved, however. Existing data heterogeneities, data errors and data uncertainties are currently not considered. This concerns the integration layer on the one hand which aims at providing a transparent view on the data for exploration and analysis. The implementation of homogenization methods is necessary in order to completely satisfy this goal. Exploration and analysis operations on the other hand must take into account data errors and uncertainties in order to enable reliable interpretation and assessment of datasets and analysis results. A first step towards the handling of error and uncertainty characteristics of available datasets will be taken by considering existing metadata, in particular metadata on data quality (provided that these metadata are available!). Automatic interpretation of metadata for data integration, exploration and subsequent analysis as well as metadata propagation in case of data analysis will be the key research topics in the near future. Bibliography Aronoff, S. (1989): Geographic Information Systems: A Management Perspective. WDL Publications, Ottawa. Anderson, G., Moreno-Sanchez, R. (2002): Building Web-Based Spatial Information Solutions around Open Specifications and Open Source Software. Transactions in GIS, 7(4):447-466. Blackwell Publishing Ltd., Oxford. Baltensweiler, A., Brändli, M. (2004): Web-based Exploration of Environmental Data and Corresponding Metadata, in Particular Lineage Information. In: Scharl, A. (ed.): Environmental Online Communication. Advanced Information and Knowledge Processing Series: 127-132, Springer, London. Brändli, M., Höppner, C. (2004): Die Virtuelle Datenbank: Technologie zur Unterstützung in der Regionalplanung. Proceedings CORP 2004. Brändli, M., Sparenborg, J. (2002): SVG as graphical metadata for distributed spatial data processing. SVG Open / Carto.net Developers Conference, Zurich, Switzerland, July 15-17. URL: http://www.svgopen.org/2002/papers/braendli_sparenborg svg_for_metadat a/index.html, accessed on: 12/08/2004. Burrough, Peter A., McDonnell, Rachael A. (1998): Principles of Geographical Information Systems. Oxford University Press, Oxford. Duckham, M., McCreadie, J. E. (2002): Error-aware GIS Development. In Shi et al. (2002): Spatial Data Quality. Taylor and Francis, London. 547
Heuvelink, G. B. M. (1998): Error Propagation in Environmental Modeling with GIS. Taylor and Francis, London. Kraak, M.-J. (2001): Settings and needs for web cartography. In: Kraak, M.-J., and Brown, A. (eds.): Web Cartography. Developments and prospects. Taylor and Francis, London and New York. ISO (2003): ISO 19115:2003. Geographic Information Metadata. URL: http://www.iso.org. Jones, Ch. (1997): Geographical Information Systems and Computer Cartography. Addison Wesley Longman Ltd, England. OGC (2001): Web Map Service Implementation Specification. Version: 1.1.1. Open GIS Consortium, Inc. URL: http://www.opengis.org/docs/01-068r2.pdf, accessed on: 08/05/2004. OGC (2002): Web Feature Service Implementation Specification. Version: 1.0.0. Open GIS Consortium, Inc. URL: http://www.opengis.org/docs/02-058.pdf, accessed on: 08/05/2004. OGC (2003): OpenGIS Geography Markup Language (GML) Implementation Specification. Version: 3.00. Open GIS Consortium, Inc. URL: http://www.opengis.org/docs/02-023r4.pdf, accessed on: 08/05/2004. Orthofer, R., Loibl, W. (2004): Sharing Environmental Maps on the Web: The Austrian EnviroMap System. In: Scharl, Arno (ed.): Environmental Online Communication. Advanced Information and Knowledge Processing Series: 133-144, Springer, London. Singh, I., Stearns, B., Johnson, M. (2002): Designing Enterprise Applications with the J2EE Platform, Second Edition. URL: http://java.sun.com/blueprints/guidelines/designing_enterprise_applications _2e/, accessed on: 08/05/2004. Tsou, M.-H. (2002): An Operational Metadata Framework for Searching, Indexing, and Retrieving Distributed Geographic Information Services on the Internet. In: Geographic Information Science (GIScience 2002). Lecture Notes in Computer Science Vol. 2478: 313-332, Springer-Verlag, Berlin. Tsou, M.-H., Buttenfield, B. P. (2002): A Dynamic Architecture for Distributing Geographic Information Services. Transactions in GIS, 6 (4):355-381, Blackwell Publishing Ltd, Oxford. Tsou, M.-H. (2004): Integrating Web-based GIS and image processing tools for environmental monitoring and natural resource management. Journal of Geographical Systems, 6:155-174, Springer-Verlag, Berlin. 548