Schema transformation of Administrative Data with GeoKettle

Similar documents
Crazy NoSQL Data Integration with Pentaho

GeoKettle: A powerful open source spatial ETL tool

GeoKettle: A powerful spatial ETL tool for feeding your Spatial Data Infrastructure (SDI)

Open source geospatial Business Intelligence (BI) in action!

Open Source Business Intelligence Intro

Open Source Business Intelligence

Performance and Scalability Overview

Cloud Computing. With MySQL and Pentaho Data Integration. Matt Casters Chief Data Integration at Pentaho Kettle project founder

SAP Predictive Analytics 2.3 Supported Platforms (PAM)

Performance and Scalability Overview

Building Geospatial Business Intelligence Solutions with Free and Open Source Components

Pentaho Data Integration 4 and MySQL. Matt Casters: Pentaho's Chief Data Integration Kettle Project Founder

SAP Lumira 1.28 Product Availability Matrix (PAM)

Integrating Salesforce Using Talend Integration Cloud

SAP Lumira 1.29 Product Availability Matrix (PAM)

Enabling geospatial Business Intelligence

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

dbext for Vim David Fishburn h5p://

Data Integration for ArcGIS Users Data Interoperability. Charmel Menzel, ESRI Don Murray, Safe Software

Big Data Analytics - Accelerated. stream-horizon.com

Contents. Pentaho Corporation. Version 5.1. Copyright Page. New Features in Pentaho Data Integration 5.1. PDI Version 5.1 Minor Functionality Changes

Data Doesn t Communicate Itself Using Visualization to Tell Better Stories

DATA MINING USING PENTAHO / WEKA

Chapter 6: Data Acquisition Methods, Procedures, and Issues

Informatica Data Replication FAQs

Integrating data in the Information System An Open Source approach

Review of SwisSQL Data Migration Tool

Apatar Quick Start Guide

JBoss Data Services. Enabling Data as a Service with. Gnanaguru Sattanathan Twitter:@gnanagurus Website: bushorn.com

Insight for location-powered decision making.

TRANSFORM BIG DATA INTO ACTIONABLE INFORMATION

Real Life Performance of In-Memory Database Systems for BI

WHITE PAPER. Domo Advanced Architecture

Organisaties groot en klein, beginnen zich meer en meer te realiseren dat inzicht in (real-time) data helpt

Pentaho Data Integration User Guide

Web Mapping in Archaeology

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

OWB Users, Enter The New ODI World

Toad Data Modeler - Features Matrix

Integrating Ingres in the Information System: An Open Source Approach

A DATA WAREHOUSE SOLUTION FOR E-GOVERNMENT

Power*Architect User Guide. SQL Power Group Inc. [

How to Import Data into Microsoft Access

Technology Foundations. Conan C. Albrecht, Ph.D.

Open Source GIS The Future?

NEMUG Feb Create Your Own Web Data Mart with MySQL

Business Intelligence on a Budget: Open Source BI. Paul O Rorke

Sisense. Product Highlights.

RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE. Luigi Grimaudo Database And Data Mining Research Group

Institute of Computational Modeling SB RAS

Using Pentaho Data Integration (PDI) with Oracle Nabil Juwale Al Lopez. RMOUG Training Days February 11-13, 2013

Palo Open Source BI Suite

Open Source Business Intelligence Tools: A Review

ENTERPRISE EDITION ORACLE DATA SHEET KEY FEATURES AND BENEFITS ORACLE DATA INTEGRATOR

High-Volume Data Warehousing in Centerprise. Product Datasheet

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

Geodatabase Programming with SQL

System requirements. Java SE Runtime Environment(JRE) 7 (32bit) Java SE Runtime Environment(JRE) 6 (64bit) Java SE Runtime Environment(JRE) 7 (64bit)

BI Studio User Manual

A Survey of ETL Tools

Oracle Business Intelligence Publisher. 1 Oracle Business Intelligence Publisher Certification. Certification Information 10g Release 3 (

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Crystal Reports XI Release 2 for Windows Service Pack 3

GetLOD - Linked Open Data and Spatial Data Infrastructures

Oracle Data Integrator. Knowledge Modules Reference Guide 10g Release 3 (10.1.3)

HTSQL is a comprehensive navigational query language for relational databases.

How, What, and Where of Data Warehouses for MySQL

Alejandro Vaisman Esteban Zimanyi. Data. Warehouse. Systems. Design and Implementation. ^ Springer

Zend Server 4.0 Beta 2 Release Announcement What s new in Zend Server 4.0 Beta 2 Updates and Improvements Resolved Issues Installation Issues

ORACLE DATA INTEGRATOR ENTEPRISE EDITION FOR BUSINESS INTELLIGENCE

ORACLE DATA INTEGRATOR ENTERPRISE EDITION

How to Improve Database Connectivity With the Data Tools Platform. John Graham (Sybase Data Tooling) Brian Payton (IBM Information Management)

ICAB4136B Use structured query language to create database structures and manipulate data

HP Vertica Integration with SAP Business Objects: Tips and Techniques. HP Vertica Analytic Database

Workday Big Data Analytics

VBA and Databases (see Chapter 14 )

Migration and Developer Productivity Solutions Retargeting IT for Emerging Business Needs

Data Integration with Talend Open Studio Robert A. Nisbet, Ph.D.

Crystal Reports XI Release 2 - Service Pack 6

DevForce WinClient. DevForce Installation Guide

Mercury Users Guide Version 1.3 February 14, 2006

Is ETL Becoming Obsolete?

BusinessObjects Enterprise XI 3.1 for Windows

PUBLIC Performance Optimization Guide

<Insert Picture Here> Oracle SQL Developer 3.0: Overview and New Features

Archival Challenges Associated with the Esri Personal Geodatabase and File Geodatabase Formats

Big Data and Its Impact on the Data Warehousing Architecture

Transforming Data Integration from "Create" to "Connect"

- 1 - Guidance for the use of the WEB-tool for UWWTD reporting

User Guide. Analytics Desktop Document Number:

Parco Nazionale della Silla, Calabria, Italia

Data Integration and ETL with Oracle Warehouse Builder: Part 1

Data processing goes big

Oracle Application Express MS Access on Steroids

Transcription:

8th october 2013 Schema transformation of Administrative Data with GeoKettle 1/36 08/10/2013 ign.fr

2/36 08/10/2013

ETL definition Generalities ETL definition : An ETL tool is a type of software generally used to populate databases or data warehouses from heterogeneous data sources. GeoKettle is an open source ETL developed in Java by Spatialistics (license GNU Lesser General Public : LGPL). It can be downloaded from the web site (V2.5 actually available) : http://www.spatialytics.org/fr/projets/geokettle/ 3/36 08/10/2013

ETL definition Principles Generalities Generalities : GeoKettle is a spatially-enabled version of the generic ETL tool Kettle (Pentaho Data Integration). GeoKettle benefits from geospatial capabilities from Open Source libraries like JTS, GeoTools, deegree, OGR and, via a plugin, Sextante. This tool requires a Java Runtime Environment (JRE) version 5 or newer. It can be installed in Windows, Linux or Macintosh systems. 4/36 08/10/2013

5/36 08/10/2013

Vocabulary Record Data schema Steps File types Database types Transformation steps Steps used for schema : examples Vocabulary : Steps Hops 6/36 08/10/2013

Vocabulary Record Data schema Steps File types Database types Transformation steps Steps used for schema : examples Record - Transformations and jobs are recorded in XML format files. - File.ktr for the transformations, - File.kjb for the jobs. This is a specific format including required steps and their parameters. - Variables can be defined to reuse easily existing transformations with other data : - Other data : same structure but different date or area, - Variable : name of the file. 7/36 08/10/2013

Vocabulary Record Data schema Steps File types Database types Transformation steps Steps used for schema : examples Data schema Data sources GeoKettle tool : - reads data schemas - transforms data Transformed data - GeoKettle reads data sources and produced transformed data. - GeoKettle reads automatically the data schemas. - Rows can be previewed with a field table or a cartographic view. 8/36 08/10/2013

Vocabulary Record Data schema Steps File types Database types Transformation steps Steps used for schema : examples Steps Input The transformation steps are classified into categories : Output Transform Flow Scripting Lookup Joins Statistics extract and load data, change the format, transform data 9/36 08/10/2013

Vocabulary Record Data schema Steps File types Database types Transformation steps Steps used for schema : examples Steps : INPUT and OUTPUT categories for files File types INPUT OUTPUT CSV x CSV S3 x CSW catalog x x ESRI x Excel x x Fixed file x fxbase x GML x x Kettle x x KML x x LDAP x LDIF x Mondrian x OGR x x Property file x x RSS x x Salesforce x Shapefile x x SOS x Text file x x WFS x Xbase (DBF) x XML x x GML x x GML geometry only 10/36 08/10/2013 MapInfo GML 2 DXF

Vocabulary Record Data schema Steps File types Database types Transformation steps Steps used for schema : examples Steps : database table types for INPUT, OUTPUT and UPDATE Apache Derby AS/400 Borland Interbase dbase III, IV ou V ExtenDB Firebird SQL Generic database Greenplum Gupta SQL Base H2 Hypersonic IBM DB2 Infobright Informix Ingres Intersystems Cache KingbaseES LucidDB MaxDB (SAP DB) 11/36 08/10/2013 MonetDB MS Access MS SQL Serveur MySQL Neoview Netezza Oracle Oracle RDB Palo MOLAP Serveur PostgreSQL Remady Action Request System SAP R/3 System SQLite Sybase SybaseIQ SybaseIQ Teradata UniVerse database Vertica

Vocabulary Record Data schema Steps File types Database types Transformation steps Steps used for schema : examples GeoKettle steps allow various transformations like data transfer, spatial analysis, CRS transformation, filter, script execution, schema transformation For schema transformation, steps are particulary used to : select, modify (name, type) or remove fields aggregate values calculate new fields filter rows (conditions) join rows (rows sorted with the key) map values replace field value split a field switch according to conditions 12/36 08/10/2013

Vocabulary Record Data schema Steps File types Database types Transformation steps Example : select, modify or remove fields - Rename : Steps used for schema : examples - Modify the field type : - integer string, - date string 13/36 08/10/2013

Vocabulary Record Data schema Steps File types Database types Transformation steps Example : calculate new fields - Default values can be defined : Mathematic operations Date format modifications Steps used for schema : examples String modifications = FR_IGNF_BDUniGE_AdministrativeUnits = unpopulated Geometric operations - And there are many calculation types : 14/36 08/10/2013 Geometry type modification

Vocabulary Record Data schema Steps File types Database types Transformation steps Steps used for schema : examples Examples : filter rows According to one ore more conditions on fields and values, rows are distributes in two flows. If the condition is fulfilled (true), rows are sent to a next step. If the result is false, rows are sent to another one. switch depending on values Rows are distributes in the following steps depending on field values. There are so many output flows as defined values. 15/36 08/10/2013

Vocabulary Record Data schema Steps File types Database types Transformation steps Example : join rows - Two data sets are input. - The common keys are identified in the tables. - Rows must be sorted by their key before a joining. - All the fields are joined in output data. Steps used for schema : examples 16/36 08/10/2013

Vocabulary Record Data schema Steps File types Database types Transformation steps Example : map value Source values are replaced by target values. Steps used for schema : examples 17/36 08/10/2013

Vocabulary Record Data schema Steps File types Database types Transformation steps Example : split a field If several values are delimited by a separator in a field, the field can be split. Steps used for schema : examples 18/36 08/10/2013

Vocabulary Record Data schema Steps File types Database types Transformation steps Steps used for schema : examples Example : aggregate values This step is used to aggregate informations in a field. For example, with the Inspire AU, it s necessary to find all the municipalities included in the upper level unit. This step allows to group the municipalities by the code of the upper level units and to list all the municipalities for each upper code. You can aggregate values, or add numerical values, or calculate an average, 19/36 08/10/2013

20/36 08/10/2013

Context BDUni GE Inspire Transformations Difficulties Context The schema transformation I have tested relates to administrative data. My purpose was to transform internal data to Inspire structured data. Input : BDUniGE / Administrative theme (large scale topographic data of IGN, like BDTopo ) Administrative data extracted from a french unit département (Vendée) Output : Inspire Administrative Units 21/36 08/10/2013

Context Demonstration BDUni GE Inspire Transformations Difficulties Demonstration : Use of Geokettle to build the Inspire Identifier InspireId 22/36 08/10/2013

c c Context Demonstration BDUni GE Inspire Transformations Difficulties BDUni GE 23/36 08/10/2013

Inspire Context Inspire BDUni GE Transformations Difficulties 24/36 08/10/2013

Context Demonstration Inspire BDUni GE Transformations Difficulties Transformations done with GeoKettle 25/36 08/10/2013

1) Transformation to AdministrativeBoundary Context Demonstration Inspire BDUni GE Transformations Difficulties Inspire 26/36 08/10/2013

Transformation to AdministrativeBoundary Context Inspire BDUni GE Transformations Difficulties 27/36 08/10/2013

Context Demonstration Inspire BDUni GE Transformations Difficulties 2) Transformation to determine the Boundary link It s to prepare the transformation to AdministrativeUnits. There no direct link between units and boundaries in IGN structure. 28/36 08/10/2013

Transformation to determine the Boundary link Context Inspire BDUni GE Transformations Difficulties 29/36 08/10/2013

3) Transformation to AdministrativeUnits Context Demonstration Inspire BDUni GE Transformations Difficulties 30/36 08/10/2013

Transformation to AdministrativeUnits Context Inspire BDUni GE Transformations Difficulties 31/36 08/10/2013

Job for schema transformation of administrative data Context Inspire BDUni GE Transformations Difficulties 32/36 08/10/2013

Context Inspire BDUni GE Transformations Difficulties Encountered difficulties The input data for AU is not close to Inspire schema. It was difficult to implement the associations. Using an ETL tool requires a good knowledge of the available steps and what they can do. Find the various components, constitute various fields and values, and establish links, is relatively easy with GeoKettle. Steps allow to transform data schema. For this particular test, the highest difficulty is to produce Inspire structure with complex attributes (data types) in output. Data types can t be defined. GeoKettle is so not able to produce a gml file with a so complex structure today (with the existing output steps). 33/36 08/10/2013

34/36 08/10/2013

Advantages of GeoKettle use for schema transformation It s a very intuitive tool, simple to use. It s powerful and performant. The diversity of the provided steps is important. It reads automatically the schema from the data. Available steps are adapted to the schema transformation. Drawbacks of GeoKettle use for schema transformation Jobs and transformations are stored in xml files. The code of the transformations is not generated. So the transformations can t be executed from another platform. The Inspire complex structure can t be specified and ouput (voidvaluereason, inspireid, geographicalname, ). There is no help in the software, and step description is very light in the documentation. However, you can get support in the forum. 35/36 08/10/2013

Thank you for your attention. 36/36 08/10/2013