1 GeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI) Dr. Thierry Badard, CTO Spatialytics FOSS4G 2011 Workshop, Denver, CO, USA, September 12, 2011
2 Preamble These slides constitute the training material used for the GeoKettle workshop given by Spatialytics during the FOSS4G 2011 conference They are available online in PDF format: They are released under the terms of the Creative Commons CC-BY-SA license.
3 Contents What is GeoKettle? Basic features of GeoKettle Installing GeoKettle Spatial features of GeoKettle Practical learning: Exercises Conclusion
4 What is GeoKettle? It is an open source Spatial ETL tool It is part of the geospatial BI software stack developed initially by the GeoSOA research group at Laval University in Quebec But are now developed and supported by Spatialytics (open source community) (professional support, training & services but also Enterprise Editions which include support) The stack comprises: GeoKettle GeoMondrian SOLAPLayers/GeoBIExt /
5 What is Geospatial BI (GeoBI)? Want to know more about GeoBI and what this type of application can do for you? Please attend my presentation entitled Building professional geo-analytical dashboards and reports with GeoBIExt Time slot: Friday - 11:00am - 11:30am Room: Denver In this workshop, we will focus on GeoKettle capabilities and how it can facilitate your every day life while playing with geospatial data, SDI, web services, GIS formats, spatial databases,...
6 What is an ETL tool? A type of software used to populate databases or data warehouses from heterogeneous data sources ETL stands for: Extract Extract data from data sources Transform Transformation of data in order to correct errors, make some data cleansing, change the data structure, make them compliant to defined standards, etc. Load Load transformed data into a target DBMS, service, file format... An ETL tool should manage the insertion of new data and the updating of existing data Should be able to perform transformations from: A OLTP system to another OLTP system A OLTP system to analytical data warehouse
7 Why use an ETL tool? Automation of complex and repetitive data processing without producing any specific code Conversion between various data formats Migration of data from a DBMS to another Data feeding into various DBMS Population of analytical data warehouses for decision support purposes etc.
8 GeoKettle A "spatially-enabled" version of Pentaho Data Integration (Kettle) Kettle is a metadata-driven ETL with direct execution of transformations No intermediate code generation! Kettle supports several DBMS and file formats DBMS support: MySQL, PostgreSQL, Oracle, DB2, MS SQL Server,... (total of 37) Read/write support of various data file formats: text, Excel, Access, DBF, XML, Various services/systems: LDAP, CRM,... Numerous transformation steps A transformation is built in a GUI and can be seen as a chain of transformation steps Methods for the updating of databases and DW
9 GeoKettle GeoKettle provides a true and consistent integration of the spatial component All steps provided by Kettle are able to deal with geospatial data types Some geospatial dedicated steps have been added (SRS, SOS, CSW, Spatial Analysis, ) Allow then powerful integration of corporate + spatial data First release in May 2008: Version on July 2009 Current stable version: 2.0 stable (Sept. 2011) Released under LGPL Used in different organizations and countries: Some ministries, public bodies, utilities, bank, insurance, integrators, A growing community of users and contributors
10 GeoKettle Online ressources GeoKettle project page Shortcut: GeoKettle documentation (wiki) GeoKettle forum GeoKettle Trac GeoKettle plugins
11 Introduction to basic features of GeoKettle
12 Transformations (1/3) The ETL processes are named transformations Elements of a transformation are steps Links between steps are hops Parallel execution (threads) of steps hops steps
14 Transformations (3/3) hops link steps between them and define the data flow To create a hop: drag and drop from a step to another with the middle button of the mouse pressed (or Shift+left button) In a hop: data flows from the output of a step to the input of the next step, row by row fields definition (number, names & types) is always the same from one row to another Different hop types: copy distribute Conditional output
16 The different GeoKettle tools Spoon: GUI for the edition of transformations and jobs Pan: command line interface for running transformations Kitchen: command line interface for running jobs Carte: Web service for the remote execution of transformations and jobs Allow to expose and run the transformation and data integration processes as web services... Remote execution and running transformations in a cluster environment (i.e. in the cloud)
17 Repository Transformations and jobs are usually saved in XML files (.ktr/.kjb) Alternatively, they can be saved in a database repository and hence be and shared between users more easily Transformations, jobs and connection parameters to DBMS are stored in a dedicated database See the first pop-up window when running GeoKettle Enable the preservation/centalisation of knowledge about data integration processes inside the company
18 Installing & compiling GeoKettle
19 Compiling GeoKettle? To get all the latest features of GeoKettle Get the source code and compile GeoKettle! Requirements: Subversion Client (Eclipse Subversive or Tortoise SVN) Java JDK version 5 or higher Apache Ant (http://ant.apache.org) 3 steps: % svn co 2.0/trunk geokettle % cd geokettle % ant Optionally: % ant zip to build a binary distribution archive of GeoKettle % ant zip plugins to build a binary distribution archive of GeoKettle including selected plugins.
20 Installation procedure Available (2.0-RC1) on OSGeo Live DVD but we will use the 2.0 stable version in the workshop Very simple installation procedure without the installer See documentation on GeoKettle wiki Even more simple with the new installer! Prerequisites: All you need is a Java Runtime Environment Version 5.0 or higher Start the OSGeo Live Virtual Machine (if not already done) Download and start the installer inside the VM: When done, double click the GeoKettle icon on the desktop to run it Please wait for instructions when first window (repository selection) pops up!
21 Spatial features of GeoKettle
22 Transparent spatial support Consistent and transparent integration of the geometry data types: Vector geometry (based on JTS pointline-polygon model) Transparent conversions between data types: Geometry String: from and to WKT Geometry Binary: from and to WKB Native I/O support for some spatial DBMS (via JDBC or through GDAL/OGR)
23 Inputs / outputs Read/write support: Spatial DBMS: PostgreSQL/PostGIS (native) MySQL spatial (native) Oracle Spatial / Locator (native) ESRI personal geodatabse*, Ingres*, Informix datablade*, ArcSDE*, SQLite/SpatiaLite (through GDAL/OGR) * requires valid licenses and GDAL/OGR re-compilation MS SQL Server 2008, IBM DB2, (non native, requires hints) GIS file formats: ESRI ShapeFile, GML 3.1.1, KML 2.2 And all GIS file formats provided by GDAL/OGR Arc/Info, GeoJSON, GeoConcept, GeoRSS, GML 2.x, GPX, KML 2.0,...
24 Inputs / outputs Read/write support: Geospatial web services: CSW SOS (read only) No dedicated steps yet but possible: WFS, WMS, WPS, We will see how in this workshop! ;-) On the fly preview/geopreview Allow to know if a transformation produces the expected results on a smaller dataset Offer different widget: Pan, zoom, Get object attributes, symbolization (color, opacity,...) Can preview streams with more than one geometry column
26 SRS & coordinates transformation Native support of Spatial Reference Systems (SRS) in metadata of the Geometry fields (based on GeoTools referencing library) Coordinates transformation / Change of Spatial Reference System SRS Transformation step Assign a SRS to a data flow Set SRS step Reading and writing of SRS metadata Read SRS from data source: Databases and GIS file formats Validation of SRS when inserting data into PostGIS and Oracle Other DBMS do not support this feature yet! Add the SRS info when writing data into GIS file formats (φ,λ) (x,y)
27 Practical learning: Exercises!
28 Before beginning the exercises... Start the OSGeo Live Virtual Machine (if not already done) and log in Download the archive containing data and solutions to the different exercises of this workshop Unzip the archive on your Desktop It contains 3 sub directories: data input output solutions transformations» exercise_0 to exercise_9 transformations We are now ready!
29 Exercise 0 We will do this first exercise all together, step by step in order to discover GeoKettle The aim of this exercise is to know how to load a ESRI shapefile into a PostGIS database and have it published properly in GeoServer In this exercise we will play with the following new steps: Shapefile File Input Set SRS Select Values Add sequence Table Output
30 Exercise 0 Design a transformation that: Reads the Shapefile contained in the ontario_names_shp data directory. It is a set of points that locate geo names for the whole Ontario province in Canada (source: Geobase, Assigns the EPSG 4326 SRS code (WGS 84) to data Filters the stream in order to preserve only the_geom, GEONAME, FEATUREID, CONCISTERM, GENERITERM and REGIONNAME attributes Adds an identifier (numeric incremental id) to objects Stores data into a geonames table of a geokettle database on your PostgreSQL/PostGIS instance Finally, publish it in GeoServer
31 Exercise 0 Solution
32 Exercises From this point, do the exercises by yourself Exercises are more and more difficult The aim is not to follow step by step procedures mentioned in exercises We want you to become more and more efficient/autonomous and aware on how to do some tasks in GeoKettle That's why instructions will be less and less detailed as we progress in the exercises
34 Exercise 1 Based on the previous transformation, design a new one that: Reads the Shapefile contained in the ontario_mrc_shp data directory. It is a set of polygons that represents some counties in the Ontario province in Canada (source: Geobase, Converts coordinates of data from WGS84 to NAD83 (CSRS) / UTM Zone 17N Computes the area of each polygon and add the value in a new field area_meters Converts by scripting area_meters values from m2 to km2 and stores this value in a new field named area Filters the stream in order to preserve only the_geom, COMMONAME1, LEGALNAME1, DESIGNATN attributes but renames them resp. as the_geom, name, county_name, designation Converts back coordinates to WGS84 Adds an identifier (numeric incremental id) to objects Stores data into a municipalities table of a geokettle database on your PostgreSQL/PostGIS instance Finally, publish it in GeoServer
35 Exercise 1 Runs this transformation in Spoon in order to test it When finished, try to run it with the pan command line tool
36 Exercise 1 Solution
37 Exercise 1 - Solution./pan.sh -file= /home/user/desktop/geokettle_workshop/solutions/ transformations/exercise_1/ex_1.ktr
38 Exercise 2 The aim of this exercise is to know: A way to perform some spatial selection over geospatial features in GeoKettle How to perform some data aggregation in order to compute statistics on data and export these stats in a MS Excel file How to create a job that enable to perform the two previous tasks sequentially In this exercise we will play with the following new steps/job entries: Filter rows Join rows (cartesian product) OGR File Input Sort rows Group by Excel Output Transformation
39 Exercise 2 Part 1 Design a transformation that: Reads data the previous municipalities table and extracts the_geom and name fields as muni_geom and muni_name fields Filters rows in order to keep only the county of Durham In parallel, reads data form a mapinfo tab file located in the ontario_rrn_tab directory. It is an extract of the national road network stemming form Geobase.ca. Selects only roads that intersects the Durham county Sets the SRS of data to WGS84 Filters the stream in order to preserve only the_geom, ROADSEGID, ROADCLASS, RTNUMBER1, RTENAME1EN attributes but renames them resp. as the_geom, id, class, number and name Adds an identifier (numeric incremental id) to objects Stores data into a roads table of a geokettle database on your PostgreSQL/PostGIS instance Finally, publish it in GeoServer
40 Exercise 2 Part 2 Design a transformation that: Reads data in the previously created roads table Converts coordinates of data from WGS84 to NAD83 (CSRS) / UTM Zone 17N Computes by script only the length in km of each road segments and add the value in a new field named length Aggregates (sum) the values of length for each roads of a same class and stores the total value in a new field named total_length Finally, exports aggregated data into an Excel file
41 Exercise 2 Job Design a job that performs the two previous tasks sequentially Run it into Sponn But also, try to run it with the Kitchen command line tool
42 Exercise 2 Part 1: Solution
43 Exercise 2 Part 2: Solution
44 Exercise 2 Job: Solution
45 Exercise 2 - Solution./kitchen.sh -file= /home/user/desktop/geokettle_workshop/solutions/ transformations/exercise_2/ex_2.kjb
46 Exercise 3 The aim of this exercise is to know how to: retrieve data from a WFS service perform some geo-processing operations with the Sextante plugin and export the result to two different file formats: KML and Mapinfo In this exercise we will play with the following new steps/job entries: Sextante plugin OGR Output KML Output HTTP
47 Exercise 3 Job Design a job that: Requests municipalites data in GML 2 from the GeoServer WFS hosted on your WM. Use the preview layer in GeoServer in order to retrieve the GET request to send. And runs a transformation that we will define in the next slide
48 Exercise 3 Transformation Design a transformation that: Reads the GML file extracted from the WFS Removes holes from the polygons and stores the new geometry of objects in a result_geom field Filters the stream in order to preserve only the gml_id, name, county_name, designation, area and result_geom fields Filters rows that have a valid and not null geometry And stores the resulting stream in a KML file and a Mapinfo MIF/MID file
49 Exercise 3 Job: Solution
50 Exercise 3 Transform.: Solution
51 Exercise 4 The aim of this exercise is to know how to extract some POI from an OSM data file Listen to the instructor that will explain you how is structured a OSM data file In this exercise we will play with the following new steps: Get data from XML
52 Exercise 4 Design a transformation that: Extracts POI data from the OSM data file located in the ottawa_osm directory Set the SRS of data to WGS84 And exports the result as an ESRI shapefile Finally, publish it in GeoServer
53 Exercise 4 Solution
54 Exercise 5 The aim of this exercise is to know how to: Extract sensor data from a SOS Perform some spatial computation with the Spatial Analysis step Retrieves some metadata on the data stream And push these metadata in a CSW Listen to the instructor that will explain you how to proceed with SOS and CSW steps In this exercise we will play with the following new steps: SOS Input Spatial Analysis CSW Output
55 Exercise 5 Design a transformation that: Retrieves GAUGE_HEIGHT measures from the SOS service given by the instructor Removes rows where measure presents values <=30 Group rows by procedure Compute the envelope of each resulting geometry Retrieves and sets some mandatory metadata (MD_METADATA profile) And finally, publish the metadata in GeoNetwork
56 Exercise 5 Solution
57 Exercise 6 The aim of this exercise is to know how to harvest metadata from a CSW compliant service In this exercise we will play with the following new steps: CSW Input Dummy
58 Exercise 6 Design a transformation that: Harvest metadata from the geocat.ch online catalog Filters metadata that deal with dataset For each metadata row, computes by script the extent of the dataset And export the the BriefRecord_title, BriefRecord_type and the extent in a new PostGIS table named meta_extent Finally, publish this new table in GeoServer
59 Exercise 6 Solution
60 Exercise 7 The aim of this exercise is to know how to call a process hosted in a WPS compliant service In this exercise, we will create a new layer from our polygons layer (municipalities) hosted in GeoServer by applying on each polygon a Centroid WPS service In this exercise we will play with the following new steps entries: Add constants HTTP Client
61 Exercise 7 Job Design a job that: Requests municipalites data in GML 2 from the GeoServer WFS hosted on your WM. Use the preview layer in GeoServer in order to retrieve the GET request to send. And runs a transformation that we will define in the next slide
62 Exercise 7 Transformation Design a transformation that: Reads the GML file extracted from the WFS For each rows, call the Centroid service hosted in the Zoo WPS instance on your VM Stores the result in a new table named muninames in your PostGIS DBMS instance. Finally, publish it in GeoServer.
63 Exercise 7 Job: Solution
64 Exercise 7 Transform.: Solution
65 Exercise 8 Based on exercise 4, design a transformation that extracts the road network from the Ottawa OSM data file In this exercise we will play with the following new steps: Shapefile File Output
66 Exercise 8 Solution
67 Exercise 9 The aim of this exercise is to know how to: Retrieve location information from some Twitter tweets Call the geonames gazetteer service in order to retrieve lat/lon information for tweets that have no geo tag Listen to the instructor that will explain you how the twitter and geonames services work In this exercise we will play with the following new steps: Unique rows (HashSet) Generate rows
68 Exercise 9 Design a transformation that: Retrieves tweets mentioning the #foss4g tags For each tweet, checks if there is a geo info present If not, uses the location info and call the geoames.org gazetteer in order to retrieve the lat/lon of this location Stores the result in a new table named tweets in your geokettle database in the PostGIS DBMS. Finally, publish it in GeoServer
69 Exercise 9 Solution
71 Upcoming features Versions 2.x will be the last versions of GeoKettle based on the Kettle 3.2 code base. Thanks to the tremendous work of the Kettle developers, future version of GeoKettle will be more pluggable with Kettle Hence, it will be possible to add spatial extensions provided by GeoKettle to any Kettle/PDI 4.x installation. Maximizing this architecture switch, we want to perform a re-engineering of the Geometry data type. At present, it only supports 2D data. We want to allow support for: X,Y,Z,t and M data LiDAR data Linear referencing Raster data
72 Upcoming features So many tasks can be automated with GeoKette. We can think about many new steps in future releases... But, you know, the roadmap can be influenced by opportunities... So, we are open to your ideas, opportunities and possible sponsoring to have your required feature implemented Spatialytics can also provide: Support (1st and 2nd line through partners) Advanced training Be your partner in tender...
73 Upcoming features Additional non exhaustive list of steps/jobs that could be envisaged: Additional geometric data cleansing and geo-processing capabilities: inclusion of some JCS/OpenJump conflation & topology checking/cleansing capabilities (GPL -> plugin) Towards a geospatial data quality module to check and correct errors Read/write support for other DBMS, GIS file formats and services NetCDF, SDMX, Linked Geodata,... Native support for MS SQL Server 2008, Netezza spatial, NoSQL dbs,... Native support for WFS-T, WPS, WMS, Table Joining Service (TJS),... Dedicated steps: Social media (Twitter,...), OSM, cartograhic generalisation, geocoding & reverse geocoding... Direct publishing into GeoServer and MapServer But also why not see GeoKettle as a possible data source for this web servers... Raster support: re-initiating the development of a plugin to integrate all raster capabilities provided by the Sextante library (BeETLe project)
74 To learn more about GeoKettle Do not hesitate to: Visit our web sites Subscribe to the monthly Spatialytics enews letter Follow us on Twitter and Facebook Check the documentation on the wiki Post your questions on the forum Submit a bug report or feature request on the GeoKettle trac Contact us
75 Questions Contact info: Dr. Thierry Badard, CTO Spatialytics inc. Quebec, Canada Web: Twitter: tbadard, spatialytics Twitter : geokettle Twitter : geomondrian Twitter : solaplayer Twitter : geobiext