Enabling geospatial Business Intelligence Dr. Thierry Badard & Mr. Etienne Dubé GeoSOA research group Laval University Department of geomatics sciences 1055, avenue du Séminaire Quebec (Quebec) G1V 0A6 Canada Email: {Thierry.Badard; Etienne.Dube}@scg.ulaval.ca Web: http://geosoa.scg.ulaval.ca About eighty percent of all data stored in corporate databases has a spatial component [Franklin 1992] This commonly recognised fact has recently stirred marked interest for the huge potential of Geospatial BI, which aims at combining GIS and Business Intelligence (BI) technologies, i.e. combining spatial analysis and map visualization with proven BI tools in order to better support the corporate data analysis process and to help companies in making more informed decisions. Business intelligence (BI) is a business management term, which refers to applications and technologies that are used to gather, provide access to, and analyze data and information about company operations. Business Intelligence applications are then usually used to better understand historical, current and future aspects of business operations. The BI applications or tools typically offer ways to mine databaseand spreadsheet centric data, and produce graphical, table based and other types of analytics regarding business operations. Business intelligence systems can thus help companies have a more comprehensive knowledge of the factors affecting their business, such as metrics on sales, production, internal operations, and they can help companies to make better business decisions. It is surely something your boss or client is possibly interested into, and asked you to investigate but be aware that BI is a different and complex world! Please, forget what you know about classical databases This paper first provides a rapid introduction to some important BI concepts and then highlights the need for geospatial BI software and deals with the integration of the spatial component in a BI software stack in order to consistently enable geo analytical tools. Different works performed and tools designed by the GeoSOA research group in the domain are finally presented. A rapid introduction to BI Business Intelligence applications rely on a complex architecture of software that is usually composed of: An ETL tool allows to Extract data from different heterogeneous sources (transactional databases, web resources, XML or flat files, Excel spreadsheets, LDAP, sensors, etc.), Transform (integration, data cleansing, data structure, updating ) these data according a target schema/data structure and Load the data in a data warehouse. A data warehouse which stores the organization s historical data for analysis purposes.
An On line Analytical Processing (OLAP) server which enables the rapid and flexible exploration and analysis of large amount of data stored in the data warehouse On the client side, some reporting tools, dashboards and/or different OLAP clients which display information in a graphical and summarized form (table, charts, etc.) to decision makers and managers. These tools offer capabilities to explore data interactively and then support the analysis process. And optionally, some data mining tools to automatically retrieve trends, patterns and phenomena in the data. Figure 1 illustrates the typical infrastructure on which BI applications rely. Figure 1: Classical architecture for deploying BI applications The data warehouse plays a central and crucial role in this architecture. It is the repository of an organization s historical data. It is separate from operational (OLTP, OnLine Transaction Processing) systems (data sources) but is often stored in relational DBMS such as Oracle, Microsoft SQL Server, PostgreSQL, MySQL Data warehouses are optimized for handling large volumes of data (up to terabytes) and for providing fast response (<10 s to not hinder the train of thoughts of a user during the analysis process) to complex analytical queries (vs. update speed for transactional databases). For that, they rely on de normalized data schemas (e.g. star or snowflake schemas) which introduce some redundancy to provide very fast replies to time consuming JOIN queries involved in usual analytical requests. Data warehouses adopt a (multi)dimensional modeling (a dimension per analysis axis). As an example, Figure 2 illustrates a star schema model of a data warehouse dealing with some population statistics. It presents 5 dimensions d_annee (different population census periods), d_sexe (sex of the population, i.e. male or female), d_age (different age classes for the population), d_statut_pop (status of the population) and d_unite_geo (spatial dimension organised according a country > region > province > economic region > division hierarchy). This last dimension table involves a lot of redundancy as for each division, values of the country, region, province and economic region fields are repeated. F_pop is the fact table which stores the values of the three defined measures for each combination of dimension members: population (total amount of persons), nb_naissance (number of birth), nb_deces (number of death).
Figure 2: A typical data warehouse star schema Figure 3 illustrates an excerpt of the f_pop fact table. All dimension members are referenced by the way of foreign key values. Figure 3: Excerpt of the f_pop fact table Figure 4 shows how the different members of the spatial dimension are stored in the data warehouse.
Figure 4: The d_unite_geo dimension table (sample) All data are then interrelated according to the analysis axes and values of the measures are stored for each combination. This refers to the OLAP data cube paradigm: each dimension could be seen as the axis of a cube (in a n dimensions space). All data are stored in the data warehouse across time: there is no correction. Summary (aggregate) data at different levels of details and/or time scales are computed on the fly by the OLAP server or stored in the data warehouse. So, a data warehouse focuses more on the analysis and the correlation of large amounts of data than on retrieving/updating a precise set of data! It is a fundamental difference with the operational (OLTP) systems used in day to day activities of a company, which logs different information without redundancy in systems such as transactional databases. Contents of the data warehouse are often presented in a summarized form (e.g. key performance indicators, dashboards, OLAP client applications, reports). It is thus primarily destined to analysts and decision makers. Figure 5 illustrates different tools used to present, explore and analyse data.
Figure 5: Dashboards, reporting and data mining tools provide different ways to represent, explore and analyse data (source: Pentaho web site, http://www.pentaho.com) To query the data warehouse, these different tools generally use the MDX query language implemented by the OLAP server. MDX stands for MultiDimensional expressions and defines a multidimensional query language. It is a de facto standard from Microsoft for SQL Server OLAP Services (now known as Analysis Services). It is also implemented by other OLAP servers (Essbase, Mondrian) and clients (Proclarity, Excel PivotTables, Cognos, JPivot ). MDX is for OLAP data cubes what SQL is for relational databases. It looks like a SQL query but relies on a different model (close to the one used in spreadsheets). Figure provides an example of a MDX query and the representation of the results returned by such a query as a crosstab. Figure 6: A MDX sample query and results displayed in a crosstab
OLAP client software (analytical dashboards, reporting tools, etc.) propose alternate representation modes (pie charts, diagrams, etc.) and different tools to refine queries and to explore data: drill down, roll up, pivot These tools are based on operators provided by the MDX query language and on a complex logic implemented in the client part. As it is commonly recognised that about 80% of data has a spatial component (Franklin, 1992), this can be used to enhance the BI user experience with map displays and spatial analysis tools and then provide a better support to the analysis and decision processes. Merging BI and GIS software Let us imagine you are a decision maker in public health policy With current BI solutions or GIS software, you will certainly have difficulties to answer to complex questions like: where are the urban spots that are more sensitive to heat waves, intense rain, flooding or droughts in a specific geographic area? How many people with cardiovascular, respiratory, neurological and psychological diseases will there be in 2025 and 2050 in a specific geographic area? How many people with low income live alone in a building requiring major repairs in a specific geographic area? To answer such questions, you can use: GIS but it implies the writing of very complex SQL queries. This is sometimes a long and hard job which requires dedicated human resources. Moreover, this job needs to be done anew every time data change or new analyses have to be achieved. Classical BI tools (OLAP clients, reporting tools) but they are often unable to handle the spatial dimension of data or only provide a very basic support. Or some phenomena can only be adequately observed and interpreted by representing them on a map! It is especially true when you want to observe the spatial distribution of a phenomenon or its spatiotemporal evolution. Geospatial BI, combining GIS and Business Intelligence (BI) technologies, has thus recently stirred marked interest for the huge potential of combining spatial analysis and map visualization with proven BI tools and techniques such as data warehousing, Online Analytical Processing (OLAP), reporting tools, dashboards and data mining. Tools recently made available on the market rely on a loose coupling between existing GIS software and some proven BI components (e.g. ESRI with SAP and MapInfo with IBM/Cognos or Microsoft with Analysis Services and Virtual Earth). They provide first solutions to display maps with summarized/aggregated information stemming from the BI infrastructure while GIS data have to be stored and managed in a separate and transactional geospatial DBMS or GIS. These solutions imply then to manage geospatial and corporate data in different systems which require additional efforts, resources and costs to consistently feed and maintain them. They also do not fully take advantage of the powerful analytical capabilities of a classical BI infrastructure and hence are usually not able to handle very large data volumes as those currently met in BI applications. Finally, this loose coupling requires often the development of dedicated applications each time a new analytical need is emerging in the company. The geometry data type on which geospatial data is relying is not handled as any other data type in the
BI infrastructure and connections with the GIS have to be carefully initiated and maintained. Drill down and roll up capabilities in the analytical data (to observe data at different levels of detail, time or scale) are often not supported by the map display because they are not intrinsic operators available in GIS. This is mainly due to the transactional structure of geospatial data in the underlying GIS software. Dimensional data structures are more efficient to reply fast to complex analytical queries which would have involved numerous time consuming join queries in a transactional system. These dedicated data structures make then possible to reply to complex analytical queries within a 5 10 seconds limit, which do not hinder the train of thoughts of a decision maker while he/she is exploring/analysing the data in an analytical dashboard or in an on the fly generated report. It is thus required to consistently integrate the geospatial component in all parts of the BI architecture. Figure 7 illustrates such an integration requires to spatially enable all components of the architecture. Figure 7: Integrating the spatial component and its functionalities into a classical BI infrastructure It requires then to inject some spatial capabilities (e.g. add support for reading/writing GIS file formats or for coordinate transformations and spatial reference systems) in ETL tools to become actual Spatial ETL tools. OLAP servers should be extended to become actual SOLAP (Spatial On Line Analytical Processing) servers. SOLAP is more a concept than a precise software product! SOLAP should bring the consistent handling of geospatial features, map displays and spatial analysis capabilities. SOLAP servers and clients should allow a rapid and easy navigation within spatial data warehouses and offers many levels of information granularity, many themes, many epochs and many display modes of information that are synchronized or not: maps, tables and diagrams adapted from (Rivest et al., 2005). In this perspective and in order to not reinvent the wheel, the GeoSOA Research Group (http://geosoa.scg.ulaval.ca) at Laval University, Quebec, Canada started to consistently and completely integrate the geospatial functionalities into an existing, mature, efficient and reputed open source BI software stack. A complete open source BI software stack is indeed now offered by Pentaho (http://www.pentaho.org). It includes:
An Extract, Transform and Load (ETL) tool (Kettle) used to integrate data from heterogeneous sources to a data warehouse; An OLAP server (Mondrian), which provides multidimensional query facilities on top of the data warehouse; Reporting and dashboard tools, used to present data to analysts in a convivial manner. The integration of Pentaho software suite with open source GIS components has thus been investigated to create a complete spatially enabled BI solution. This work has led to the implementation of GeoKettle, GeoMondrian and SOLAPLayers (formerly known as Spatialytics). The geospatial BI software suite GeoKettle GeoKettle is a "spatially enabled" version of Pentaho Data Integration (PDI, formerly known as Kettle). It is a powerful, metadata driven spatial ETL tool dedicated to the integration of different spatial data sources for building/updating geospatial data warehouses. GeoKettle consistently integrates the geospatial component into PDI and thus enable the transparent handling of the geometry data type as any other classical data type (strings, numbers, dates, etc.). All transformations available in Kettle are thus able to deal with the geometry data type. Some dedicated geospatial steps have been added. It is possible to access Geometry objects in JavaScript and define custom transformation steps ( Modified JavaScript Value step). Topological predicates (Intersects, crosses, etc.) have all been implemented. GeoKettle has been released under LGPL and is available at http://www.geokettle.org. Figure 8 illustrates the GeoKettle user interface. Figure 8: The GeoKettle GUI showing a basic geospatial data transformation
At present, Oracle spatial, PostgreSQL/PostGIS and MySQL DBMS and the ESRI shapefiles are natively supported in read and write modes. MS SQL Server 2008, Ingres and IBM DB2 can be used but it requires some tricks. It is thus possible to build and feed complex and very large geospatial data warehouses with GeoKettle in these different DBMS. Spatial Reference Systems management and coordinates transformations have been fully implemented. Native support for other geospatial DBMS (e.g. IBM DB2, MS SQL Server 2008, Ingres) and data (raster and vector based) formats will be implemented in a near future as an active and growing community has federated around the project. In addition, GeoKettle releases are aligned with the ones of PDI, GeoKettle then benefits all new features provided by PDI. For instance, Kettle is natively designed to be deployed in cluster and web service environments. It makes GeoKettle a perfect software component to be deployed as a service (SaaS) in cloud computing environments as those provided by Amazon EC2. It enables then the scalable, distributed and on demand processing of large and complex volumes of geospatial data in minutes for critical applications and without requiring a company to invest in an expensive IT infrastructure of servers, networks and software. Upcoming features to be implemented in GeoKettle deal with: Cartographic preview (work in progress) Implementation of data matching and conflation steps in order to allow geometric data cleansing and comparison of geospatial datasets Read/write support for other DBMS & GIS file formats o MapInfo (.tab or MIF/MID), KML, GeoJSON, GeoRSS, ESRI Geodatabase, ArcSDE o Native support for MS SQL Server 2008 and Ingres o WFS, Sensor Web (TML, SensorML, SOS,...) Implementation of a Spatial analysis step with a GUI GeoMondrian GeoMondrian is a "spatially enabled" version of Pentaho Analysis Services (Mondrian). GeoMondrian brings to the Mondrian OLAP server what PostGIS (resp. Oracle Spatial) brings to the PostgreSQL (resp. Oracle) DBMS, i.e. a consistent and powerful support for geospatial data. It has been released under the EPL and is available at http://www.geo mondrian.org. As far as we know, it is the first implementation of a true Spatial OLAP (SOLAP) server... And it is an open source project! It provides a consistent integration of spatial objects into the OLAP data cube structure, instead of fetching them from a separate spatial DBMS, web service or GIS file. It implements a native Geometry data type and provides first spatial extensions to the MDX query language, which allow embedding spatial analysis capabilities into the analytical queries. Figure 9 illustrates such an example of such a geo analytical query.
Figure 9: A sample query demonstrating the spatial extensions to the MDX language brought by GeoMondrian (filter spatial dimension members based on distance from a feature) These geospatial extensions to the MDX query language provide many more possibilities, such as: in line geometry constructors (from WKT encoded geometry strings) member filters based on topological predicates (intersects, contains, within, ) spatial calculated members and measures (e.g. aggregates of spatial features, buffers) calculations based on scalar attributes derived from spatial features (area, length, distance, ) At present, GeoMondrian only supports PostgreSQL/PostGIS data warehouses but other DBMS should be supported soon. SOLAPLayers Formerly known as Spatialytics, SOLAPLayers is a lightweight web cartographic component which enables navigation in geospatial (Spatial OLAP or SOLAP) data cubes, such as those handled by GeoMondrian. It aims to be integrated into existing dashboard frameworks in order to produce interactive geo analytical dashboards. Such dashboards support the decision making process by including the geospatial dimension in the analysis of enterprise data. First version of SOLAPLayers stems from a GSoC 2008 project performed under the umbrella of OSGeo. It has been released under BSD (client part) and EPL (server part) licences. Source code can be downloaded from the project homepage at http://www.solaplayers.org. SOLAPLayers is based on the OpenLayers (http://www.openlayers.org) web mapping client, and uses olap4j (http://www.olap4j.org) for connection to OLAP data sources. For now, it requires GeoMondrian, to be able to display members of a geospatial dimension on a map. SOLAPLayers allows then: the connection with a Spatial OLAP server such as GeoMondrian the navigation in the geospatial data cubes and the cartographic representation of some measures and members of a geospatial dimension as static or dynamic choropleth maps and proportional symbols (for now) Figure 10 illustrates the web interface of a basic application that uses SOLAPLayers. This demo application is available online at http://geosoa.scg.ulaval.ca/spatialytics/. It demonstrates the interaction with GeoMondrian and how the cartographic navigation in the geospatial data cube (drill down, roll up) is performed.
Features in development for SOLAPLayers deal with: Figure 10: SOLAPLayers demo application More map driven OLAP navigation operators (drill by position, by member, roll up to parent, etc.) Dimension member selection / navigation controls Legend display New thematic mapping styles: o Choropleth: quantiles, other statistical distributions o Graphics: histograms, pie charts... o Styles for other geometry types (lines and points) o Some styles or combination of styles allowing representation of multiple members/measures on a single map feature o Multi maps: Maps for different periods of time Conclusion This article has highlighted the need for geospatial BI software and has emphasized that spatiallyenabling a BI software stack requires to consistently integrate the spatial component and its functionalities into each component of the BI infrastructure. Works performed by the GeoSOA research group which have led to the release of three open source building blocks of a consistent and powerful geo BI software stack (GeoKettle, GeoMondrian and SOLAPLayers) has been presented.
Based on these key software components, future works deal with the design of a geo analytical dashboard framework. Indeed, in order to easily design and deliver dashboards which embed some geospatial components and representations, a highly customisable and flexible geo analytical dashboard framework is required. A first integration of SOLAPLayers with JasperServer (http://www.jaspersoft.com/jasperserver) and ireport (the graphical report designer for JasperReports, http://www.jaspersoft.com/ireport) has recently been performed in the GeoSOA research group. The result of this integration allows displaying information in different ways (maps, charts and tables) and the synchronisation between the different representations when the user drills down or rolls up on the map or the charts. Figure 11 illustrates the interface of a sample application based on this first integration work. Figure 11: A sample web application integrating SOLAPLayers and JasperServer components Even more recently, some experiments dealing with the integration of SOLAPLayers into the Pentaho CDF (Community Dashboard Framework, http://code.google.com/p/pentaho cdf) have been performed in the context of a Google Summer of Code (GSoC) 2009, under the umbrella of OSGeo (http://www.osgeo.org) and mentored by Dr. Thierry Badard (http://geosoa.scg.ulaval.ca/en/index.php? module=pagemaster&page_user_op=view_page&page_id=20). The integration work performed by the student during the GSoc period allows the display of the SOLAPLayers cartographic component together with a pivot table component in a CDF dashboard. Synchronisation between the map and the pivot table has been implemented in the two ways (e.g. a drill down on the map results in a drill down operation in the table and conversely). Further works are required in order to more properly and consistently integrate the SOLAPLayers component into CDF but it represents a good and promising first step towards the design of a highly customisable and flexible geo analytical dashboard framework. A live demo of the integration work performed by the student will
be available shortly on the GeoSOA website (http://geosoa.scg.ulaval.ca). The source code would also be available shortly in the GSoC 2009 repository. For further readings about the research challenges dealing with the integration of the spatial component in BI tools and the design of intelligent mobile applications for better decision support, the reader is invited to consult the following presentation: http://geosoa.scg.ulaval.ca/~badard/ download.php?url=ogrs2009 towards_mobile_solap_infrastructure tbadard_et_edube final.pdf. These research challenges are currently part of the research agenda of the GeoSOA research group. References Franklin, C. 1992. An Introduction to Geographic Information Systems: Linking Maps to Databases. Database, April, pp. 13 21 Rivest, S., Y. Bédard, M. J. Proulx, M. Nadeau, F. Hubert & J. Pastor, 2005, SOLAP: Merging Business Intelligence with Geospatial Technology for Interactive Spatio Temporal Exploration and Analysis of Data, Journal of International Society for Photogrammetry and Remote Sensing (ISPRS) "Advances in spatio temporal analysis and representation, Vol. 60, No. 1, pp. 17 33. Keywords: spatial datawarehousing, spatial olap, solap, geospatial business intelligence, geo analytic dashboard Summary As it is commonly recognised that about 80% of all data stored in corporate databases has a spatial component [Franklin 1992], this can be used to enhance the BI user experience with map displays and spatial analysis tools. Some phenomena or trends in the data can moreover be observed and adequately interpreted only if they are represented on a map (e.g. spatial distribution or spatiotemporal evolution of a given phenomenon). Geospatial BI, combining GIS and Business Intelligence (BI) technologies, has thus recently stirred marked interest for the huge potential of combining spatial analysis and map visualization with proven BI tools and techniques such as data warehousing, Online Analytical Processing (OLAP), reporting tools, dashboards and data mining. Such tools allow to better support the corporate data analysis process and help companies in making more informed decisions. This paper provides a rapid introduction to some important BI concepts and then highlights the need for geospatial BI software and deals with the integration of the spatial component in a BI software stack in order to consistently enable geo analytical tools. Different works performed and tools designed by the GeoSOA research group in the domain are finally presented. Biography Dr. Thierry Badard is CTO of Spatialytics, a new company in geospatial BI. He is also a professor of geoinformatics at Laval University (Canada) where he heads the GeoSOA research group. He is a regular researcher of the CRG and of the GEOIDE NCE. He has more than 13 years of experience and he has
been involved in national and international R&D projects of importance. He acts as a chair, editor and reviewer for several international journals and scientific conferences. Dr. Thierry Badard is also actively involved in the geospatial free and open source community. He is an OSGeo charter member and a member of the OSGeo conference committee. Member of the board of the OSGeo Francophone chapter, he is also a founding co chairs the OSGeo Quebec local chapter. He is a founding co chair of the ICA working group on open source geospatial technologies. For further details, please visit http://geosoa.scg.ulaval.ca. Etienne Dubé is a research assistant in the GeoSOA Research Group, Laval University. He holds a Masters degree in Geomatic Science and a Bachelor degree in Computer Engineering. He is the main developer in the GeoMondrian, SOLAPLayers and GeoKettle projects.