Data Vault & Pentaho in Healthcare Kasper de Graaf, Aly Hollander
St. Antonius Ziekenhuis Nieuwegein / Utrecht 3 locations 5.000 employees 1.100 beds 33 specialties 250 specialists 150 junior doctors 2
Santeon Santeon Hospitals 1. 2. 3. 4. 5. 6. 3 Canisius-Wilhelmina Ziekenhuis Catharina Ziekenhuis Martini Ziekenhuis Medisch Spectrum Twente Onze Lieve Vrouwe Gasthuis (OLVG) St. Antonius Ziekenhuis
Healthcare is all about data patient files, diagnostics, R&D DOT, Care activities, Appointments, Procurement Like an ordinary business, Not more difficult, But complex 4
EPR (Electronic Patient Record) system for Sint Antonius Ziekenhuis Maintains and improves the quality of care Developed and maintained by ICT department Based on web technology, open standards and open source software Supports primary health care process Modulair system, can be tuned for various user profiles: doctors, nurses or other health professionals More info: www.intrazis.org 5
IntraZis, Data Warehouse & Pentaho 2010: more demand for (management-)information IntraZis is not suitable for extensive queries ICT department starts a DWH and BI project MySQL en Pentaho was chosen in the tradition of in-house development and open source 6
Data Vault & Pentaho 7
Data Vault ETL-issues Many objects to load (hubs, links, satellites) Automation (almost) required We did NOT want to Get rid of ETL tooling Code the ETL ourselves Manage too many ETL objects (so no generation of ETL mappings or transformations) 8
Our Solution Use meta data to drive generic ETL transformations 9
Result: The Kettle Data Vault Framework A set of generic ETL transformations Driven by meta data (currently in XLS; loaded to MySQL Database) A couple of configuration files A couple of ETL jobs and transformations to tie it all together 10
The Architecture Files MySQL DBMS ETL CSV Files ERP Sources 11 ETL: Kettle Data Vault Frame work Staging Area ETL Process MySQL Data Vault ETL Central DWH & Data Marts Data Warehouse EUL
Data Vault Size Approx. 125 tables (excl error and helper tables) 40GB of data Largest table: 42 mln rows Total rows: 160 mln Refresh rate: twice a day Growth: approx. 100.000 rows daily New tables: varies strongly (new functionality is added on project basis 12
Advantages of Data Vault for us Full traceability of history (a DBC changes rapidly over time, we often see more than 30 versions) Data Model is very extensible (incorporating new source systems) Business rules are moved downstream (and change often) Generic solution saves us a lot of testing 13
The Tooling Database: MySQL ETL: Pentaho Data Integration (Kettle) BI: Pentaho 14
Automation? Staging physical database & loading: can be automated, but currently not part of the framework Data vault design: manual Data vault physical database: manual Mapping from source to data vault (Excel sheet): manual Data vault population: automated using the framework Data marts & BI: manual 15
So what does this Framework do? Automatically populate the entire data vault data warehouse Generate logging Error rows are inserted in special error tables Restartable (using the load_dts of the previous run) 16
Some design decisions Updateable views with generic column names Compare satellite attributes using string comparison (concatenate all columns, with (pipe) as delimiter) 'inject' the metadata using Kettle parameters Generate and use an error table for each Data Vault table Check for design errors (i.e. references to non-existent tables, connections, attributes) Parallel processing 17
Supported constructs Hubs, Links, Satellites Multi source hubs and links Last_seen_dts (hubs and links) Link attributes (attribute in a link that references a hub that is not modeled, like orderline) Link validity satellites (special satellites that o.a. keep track of deleted link rows) 18
Not (yet) supported constructs Composite business keys in a hub (can be solved using concatenation) Link-to-link relationships Multi active satellites CDC-like staging areas 19
Meta data tables 20
Example: meta data and excel 21
Example: a complete run 22
Example: transformation for a hub 23
Final remarks PDI framework & data vault now operational for > 2 years, still growing and still going strong Generic solution saves an enormous amount of time (both development and testing) Generic solution is a bit harder to maintain and debug Luckily maintenance is now close to zero Mistakes in the design sheet are easily made; we re considering a specialized tool (Talend Master Data) 24
Want to try? The PDI framework is open source! Download a fully operational Virtual Machine at: http://sourceforge.net/projects/pdidatavaultfw/ Developer: Edwin Weber (eacweber@gmail.com) 25
26