Data Vault Modeling The Next Generation DW Approach Presented by: Siva Govindarajan http://www.globytes.com siva02@globytes.com 1
Data Vault Modeling Agenda Introduction Data Vault Place in Evolution Data Warehousing Architecture Data Vault Components Case Study Typical 3NF / Star Schema Data Vault Approach Temporal Data + Auditability Implementation of Data Vault Conclusion 2
Introduction Data Vault concept originally conceived by Dan Linstedt Enterprise Data Warehouse Modeling approach. Hybrid approach - 3NF & Star Schema Adaptable to changes Auditable Timed right with Technology 3
Data Vault Place in Evolution 1960s - Codd, Date et. al Normal Forms 1980s - Normal Forms adapted to DWs 1985+ - Star Schema for OLAP 1990s - Data Vault concept developed Dan Linstedt 2000+ - Data Vault concept published by Dan Linstedt 4
Data Warehousing Architecture Bill Inmon s Data Warehouse Architecture OLTP Systems Staging & ODS Enterprise DW Staging (optional) Data Mart Data Mart Data Marts Data Marts Adaptive Marts Kimball s Data Warehouse Bus Architecture Data Mart Data Mart OLTP Systems Staging Data Mart Data Mart 5
Data Vault Components Hub Entities Candidate Keys + Load Time + Source Link Entities FKs from Hub + Load Time + Source Satellite Entities Descriptive Data + Load Time + Source + End Time Dimension = Hub+ Satellite Fact = Satellite + Link [+ Hub ] 6
Case Study Modeling Approaches Typical 3NF / Star Schema Data Vault approach Scenario Typical Fact and Dimension New Dimension data Reference Data change - 1:M to M:M Additional attributes to Dimension 7
Typical 3 F / Star Schema 1. Typical Fact and Dimension TR ORDER tr_order_id dim_product_id system order id system root order id global root order id order quantity total executed qty event timestamp DIM PRODUCT dim_product_id product_id product_code product desc product_name le ngth 8
Typical 3 F / Star Schema 2. ew Dimension Data TR ORDER tr_order_id dim_product_id system order id system root order id global root order id order quantity total executed qty event timestamp DIM PRODUCT dim_product_id product_id product_code product desc product_name product_category_id product_category_name product_category_code product_category_desc is_strategy REF PRODUCT CAT EGORY product_category_id product_category_name product_category_code product_category_desc is_strategy 9
Typical 3 F / Star Schema 2. ew Dimension Data TR ORDER tr_order_id dim_product_id system order id system root order id global root order id order quantity total executed qty event timestamp d im_p ro d uct_ca te g o ry_id DIM PRODUCT dim_product_id product_id product_code product desc product_name le ng th DIM PRODUCT CAT EGORY d im_p ro d uct_ca te g o ry_id product_category_code product_category_name product_category_desc is_strategy namespace REF PRODUCT CAT EGORY product_category_id product_category_name product_category_code product_category_desc is_strategy 10
Typical 3 F / Star Schema 3. Reference 1:M to M:M Change TR ORDER tr_order_id dim_product_id system order id system root order id global root order id order quantity total executed qty event timestamp d im_p ro d uct_ca te g o ry_id order owner firm id? DIM PRODUCT dim_product_id product_id product_code product desc product_name le ng th DIM PRODUCT CAT EGORY d im_p ro d uct_ca te g o ry_id product_category_code product_category_name product_category_desc is_strategy namespace? REF PRODUCT CAT EGORY product_category_id product_category_name product_category_code product_category_d esc is_strategy 11
Typical 3 F / Star Schema 4. Additional attributes to Dimension TR ORDER tr_order_id dim_product_id system order id system root order id global root order id order quantity total executed qty event timestamp dim_p rod uct_ca teg ory_id order owner firm id DIM PRODUCT dim_product_id product_id product_code product desc product_name le ng th he ig ht we ig ht p a cka ging co lo r size 12
Data Vault Approach 1. Typical Hub, Satellite and Link (Replacing Fact/Dimension) LNK_ORDER lnk_o rd e r_id lnk_lo a d _d a te lnk_re co rd _so urce hub _p ro d uct_id hub _o rd e r_id HUB_PRODUCT hub_p ro duct_id hub_re co rd _so urce hub_lo a d _d a te product_id product_code HUB_ORDER hub _o rd e r_id hub _reco rd_so urce hub _lo a d _da te tr_order_id system order id system root order id global root order id SAT _PRODUCT 01 sa t_pro d uct01_id hub _p ro d uct_id sa t_lo a d _d a te sa t_end _da te sa t_re co rd _so urce product desc product_name symbol SAT ORDER01 sa t_o rd er01_id hub _o rd e r_id sa t_lo a d _d a te sa t_e nd_d a te sa t_re co rd _so urce order quantity total executed qty event timestamp 13
Data Vault Approach 2. ew Dimension Data LNK_ORDER lnk_order_id lnk_load_d ate lnk_reco rd_source hub_pro duct_id hub_ord er_id LNK_PRODUCT _X_CAT EGORY lnk_p rod uct_x_ca te gory_id lnk_loa d_date lnk_record _source hub_product_id hub_ca te gory_id HUB_PRODUCT hub_product_id hub_reco rd_source hub_load_d ate product_id product_code HUB_ORDER hub _o rde r_id hub _record_source hub _loa d_date tr_order_id system order id system root order id global root order id HUB_CAT EGORY hub_categ ory_id hub_record_source hub_lo ad_date product_category_id product_category_code SAT_PRODUCT01 sat_product01_id hub_pro duct_id sat_load_da te sat_end_date sat_re cord_so urce product desc product_name symbol SAT ORDER01 sat_o rd er01_id hub _o rde r_id sat_lo ad_date sat_e nd _d ate sat_record_source order quantity total executed qty event timestamp SAT _CAT EGORY01 sa t_category01_id hub_ca te gory_id sa t_load_da te sa t_end_date sa t_re co rd_source product_category_name product_category_desc is_strategy 14
LNK_ORDER lnk_order_id lnk_load_d ate lnk_reco rd_source hub_pro duct_id hub_ord er_id Data Vault Approach 3. Reference 1:M to M:M Change HUB_PRODUCT hub_product_id hub_reco rd_source hub_load_d ate product_id product_code HUB_ORDER hub _o rde r_id hub _record_source hub _loa d_date tr_order_id system order id system root order id global root order id SAT_PRODUCT01 sat_product01_id hub_pro duct_id sat_load_da te sat_end_date sat_re cord_so urce product desc product_name symbol SAT ORDER01 sat_o rd er01_id hub _o rde r_id sat_lo ad_date sat_e nd _d ate sat_record_source order quantity total executed qty event timestamp LNK_PRODUCT _X_CAT EGORY lnk_p rod uct_x_ca te gory_id lnk_loa d_date lnk_record _source hub_product_id hub_ca te gory_id HUB_CAT EGORY hub_categ ory_id hub_record_source hub_lo ad_date product_category_id product_category_code SAT _CAT EGORY01 sa t_category01_id hub_ca te gory_id sa t_load_da te sa t_end_date sa t_re co rd_source product_category_name product_category_desc is_strategy 15
Data Vault Approach 4. Additional attributes to Dimension SAT_PRODUCT01 sat_product01_id HUB_PRODUCT hub _p roduct_id hub _re co rd _source hub _lo ad_date product_id product_code hub _product_id sat_load_date sat_end _d ate sat_re cord_source product desc product_name SAT _PRODUCT 02 sat_pro duct02_id hub_product_id sat_load_d ate sat_end _date sat_record_so urce length height weight packaging color 16
Temporal Data + Auditability SAT _PRODUCT01 sa t_p ro duct01_id HUB_PRODUCT hub_prod uct_id sa t_loa d_da te sa t_e nd_d ate sa t_record_source product desc product_name hub_p ro duct_id hub_record_source hub_loa d_da te product_id product_code SAT _PRODUCT 02 sa t_p ro duct02_id hub_prod uct_id sa t_loa d_da te sa t_e nd_d ate sa t_record_source length height weight packaging color LNK_PRODUCT_X_CATEGORY lnk_prod uct_x_catego ry_id lnk_load_date lnk_record_source hub _product_id hub _catego ry_id LNK_ORDER lnk_o rd er_id lnk_load _date lnk_reco rd _source hub_orde r_id hub_prod uct_id hub_categ ory_id HUB_ORDER hub_orde r_id hub_reco rd _source hub_load _date tr_order_id system order id system root order id global root order id SAT ORDER01 sa t_o rd er01_id hub_orde r_id sa t_loa d_da te sa t_e nd_d ate sa t_record_source order quantity total executed qty event timestamp source_system SAT _CAT EGORY01 HUB_CATEGORY sat_catego ry01_id hub_cate gory_id hub_record_source hub_loa d_da te product_category_id product_category_code hub_cate gory_id sat_lo ad_d ate sat_end _date sat_record_source product_category_name product_category_desc is_strategy 17
Implementation of Data Vault Make it Simple View for Dimension View For Fact Data Loading Take advantage of Technology DW Appliances High-throughput storage devices RDBMS Features 18
Make it Simple Create views for Dimension From our Demo model, Join Product related Hubs and Satellites to build View for Product Dimension as shown below : HUB_PRODUCT SAT_PRODUCT01 SAT_PRODUCT02 19
Make it Simple Create Views For Fact Join Order related Hubs, Satellites and Links to build View for ORDER Fact using ORDER related Hubs/ Satellites and Links. HUB_ORDER SAT_ORDER01 LNK_ORDER 20
Data Loading Typically data loading for a Data Vault is in the following sequence. Hubs for Dimensions Links for Dimensions Satellites for Dimensions Hubs for Fact (if any ) Links for Fact Satellites for Fact Refer to Data Vault Series article 5 of Linstedt for further details 21
DW Appliances Next wave of IT Data Warehouse hardware solutions are the Data Warehouse Appliances. Take advantage of the technology where possible. Few DW Appliances in the market : Oracle Exadata Netezza Teradata RedBrick GreenPlum 22
High-throughput storage devices Utilize Storage devices designed for DW / OLAP applications. Following are some examples only and would change over time. Please do your research based on your sizing requirements. EMC CLARiiON Storage family Texas Memory systems RamSan Solid State devices Hitachi USP / VSP Storage family Other Hybrid solutions with High performance storage and Solid State devices 23
RDBMS Features Some Oracle Features catering DW / OLAP applications: Exadata Smart Scans OLAP Based Materialized views Partitioning Reference partition Advanced data compression Automatic degree of parallelism (ADOP). Star query optimization Oracle OLAP/DWH Features. 24
Conclusion In this presentation we have addressed high level overview of Data Vault Architecture. Topics discussed are Data Vault concept and architecture Data Vault Components such as Hubs, Satellites and Link tables Typical modeling challenges with traditional modeling approaches How those challenges could be handled using Data Vault Modeling Approach. Auditing and Temporal data capture using DV Approach. And finally, some implementation details If interested in learning more, try Dan Linstedt's special coaching area at: http://danlinstedt.com/my-coach. 25
Questions?? 26
References Data Vault Series articles by Dan E. Linstedt -- http://www.tdan.com Referred few articles from Genesee Academy -- http://geneseeacademy.com Articles and books by Ralph Kimball and Bill Inmon from multiple sources. Technical Documents from http://www.oracle.com/technetwork 27
Special Thanks to Dan Linstedt for review and corrections to the article. My colleague Mark Bruscke for his assistance in preparing the article. 28
The End Contact Email : siva02@globytes.com 29