Architecting Real-Time Data Warehouses with SQL Server Mark Murphy President, Infinity Analytics Inc. Presenter: Mark Murphy NYC-based Independent Consultant https://www.linkedin.com/in/markmurphynyc http://www.infinityanalytics.com/ 1
Oracle CRM Traditional DW: Nightly / Weekly Data Load SQL 2008 Inventory Reload Changes AdventureWorks DW 2014 AdvWorks 2014 Nightly ETL SSIS/Stored Procs Supplier Shipping Schedules (CSV/XML) Oracle CRM Real-Time DW: Continuous Data Load SQL 2008 Inventory Merge Changes AdventureWorks DW 2014 AdvWorks 2014 Constant ETL Stored Procs/CDC Supplier Shipping Schedules (CSV/XML) 2
Why? Zero data latency Top customers *today* RT Analytics Predictive analytics Recommender systems RT promotions Cool factor to hit refresh New Customer Signups - *Today* 3
4 Architectural Components 1. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 2. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 3. XXXXXXXXXXXXXXXXXXXXXX (CDC) XXXXXXXXXXXXXXXXXXXXXXX 4. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 4
Oracle CRM Real-Time DW: Continuous Data Load SQL 2008 Inventory Merge Changes AdventureWorks DW 2014 AdvWorks 2014 ETL Stored Procs Supplier Shipping Schedules (CSV/XML) End Goal Dimensional Model RT-ETL to transform 3NF / flat data structures in source systems to a dimensional model in the data warehouse. 5
Caveats to RTDW If you don t need it don t do it Higher cost in ETL development and testing More moving parts more to go wrong. Easier to TRUNCATE and INSERT a full table than to implement realtime update logic. Let s Go! 6
ODS Operational Data Store Oracle CRM ODS Layer CRM_ODS SQL 2008 Inventory INVENTORY_ODS AdventureWorks DW 2014 AdvWorks 2014 ADV_WORKS_ODS AdvWorks2014 SHIP_SCHED_ODS ODS Layer Why ODS? CRM_ODS INVENTORY_ODS ADV_WORKS_ODS SHIP_SCHED_ODS Doesn t touch OLTP production source systems Overcomes lack of CDC support of source databases Divides & Conquers work & complexity Improves performance, adds flexibility 7
ODS Layer CRM_ODS INVENTORY_ODS Other Considerations One SQL Database per source database / subject area Mirror the source systems exactly (except possibly for indexes) ADV_WORKS_ODS SHIP_SCHED_ODS Do not build reports off the ODS or give direct access to end-users. Even in a traditional, non-rt DW, this is a best practice 4 Architectural Components 1. Operational Data Store (ODS) databases: create 1 per source database or subject area. PUSH data into the ODS s as often as possible. 2. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 3. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 4. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 8
Now What? Oracle CRM ODS Layer CRM_ODS SQL 2008 Inventory AdvWorks 2014 INVENTORY_ODS ADV_WORKS_ODS? AdventureWorks DW 2014 SHIP_SCHED_ODS Source to Target Mapping AdventureWorks 2014 AdventureWorksDW 2014 Translation View Source to Target OLTP 3 rd Normal Form DW - Dimensional 9
Source View - dimgeography rtdemo_src. dimgeography Source View - factinternetsales rtdemo_src. factinternetsales 10
Re-Init Procedures MERGE INTO <DESTINATION> TGT USING <SOURCE VIEW> AS SRC ON SRC.Business Key = TGT.Business Key WHEN NOT EXISTS THEN INSERT() WHEN EXISTS AND ( <some difference> ) THEN UPDATE(); Re-Inits Are needed when: System is initialized Source system changes, need to reprocess System troubleshooting (failsafe) Theoretically, with just re-inits, you could load your data warehouse in a traditional, non-rt manner. 11
Problem 1:01 run dimgeography reinit 1:02 run dimcustomer reinit 1:05 run factinternetsales reinit 1:00 1:01 1:02 1:03 1:04 1:05 What if a customer was added at 1:03, and placed an order at 1:04? Missing Key! Database Snapshots Database snapshots are created instantly, as a shadow copy. ADV_WORKS_ODS ADV_WORKS_ODS _SNAP They do not store data at initial creation. Instead, they store the before image as changes are made. Can query either the snapshot or the original. 12
Re-inits in Practice So source views/re-inits should be pointed to the ODS Snapshots. Re-inits procedures will re-synch the DW data based on the frozen version of the source. Good code/validation exercise as well. So, a correct reinit procedure will insert/update ZERO rows on the second run off the same snapshot 13
Oracle CRM ODS ODS Snapshots Layer CRM_ODS_SNAP CRM_ODS ETL Re-init SP s SQL 2008 Inventory INVENTORY_ODS_ INVENTORY_ODS SNAP ADV_WORKS_ODS AdvWorks ADV_WORKS_ODS _SNAP 2014 Adv Works DW 2014 SHIP_SCHED_ODS_ SHIP_SCHED_ODS SNAP LSNs Binary way of representing the exact transaction order of the database. Example: 0X0000002D000000480001 14
Reading & Writing LSNs LSN of a snapshot can be queried: sys.sp_cdc_dbsnapshotlsn Our REINIT_ALL procedure will store this LSN 4 Architectural Components 1. Operational Data Store (ODS) databases: create 1 per source database or subject area. PUSH data into the ODS s as often as possible. 2. Re-init Processes: build a re-init stored proc for each dim and fact, sourced from ODS snapshots. PULL from the source views into the DW dims/facts. Store the snapshot LSNs as the starting point. 3. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 4. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 15
Now What (part 2)? Oracle CRM ODS Layer CRM_ODS SQL 2008 Inventory AdvWorks 2014 INVENTORY_ODS ADV_WORKS_ODS? AdventureWorks DW 2014 SHIP_SCHED_ODS Incremental Algorithm Lookup the last LSN from the HWM table (old) Get the new latest LSN from the ODS (new) Begin Transaction Process all dimensions incrementally (old,new) Process all facts incrementally (old,new) Update the HWM Commit Transaction 16
CDC Using Change Data Capture (CDC) to pull all changes to a table for a given LSN range. Requires SQL Server Enterprise Edition CDC Tutorial: (Pinal Dave) https://www.simple-talk.com/sql/learn-sql-server/introduction-to-change-data-capture-%28cdc%29-in-sql-server-2008/ CDC Primer Sales.SalesOrderHeader cdc.sales_salesorderheader_ct..and functions 17
Reading from CDC cdc.fn_get_all_changes_sales_salesorderheader(@startlsn, @endlsn) cdc.fn_get_net_changes_sales_salesorderheader(@startlsn, @endlsn) SELECT FROM cdc.sales_salesorderheader_ct directly Add an OPTION(OPTIMIZE FOR UNKNOWN) to CDC function queries if the performance is poor. Incremental Procs For each source table, read from the CDC functions to see what s changed in the requested LSN range. Store the results in temp tables. Join the tables together, mimicking the structure of the source views. Merge into the fact/dim, just like the re-inits. Appendix A: joining two tables when they re updated in different LSN ranges. 18
ASIDE: Catch-up algorithm If the DW is behind by one hour, should it catch up all in one transaction, or break it up into smaller pieces? Former is much easier. Latter is more difficult, but provides more accurate timestamps, such as for Type-II dimensions. Need to loop through the ODS s cdc.lsn_time_mapping table, processing one time slice at a time (e.g. 5 minutes) 4 Architectural Components 1. Operational Data Store (ODS) databases: create 1 per source database or subject area. PUSH data into the ODS s as often as possible. 2. Re-init Processes: build a re-init stored proc for each dim and fact, sourced from ODS snapshots. PULL from the source views into the DW dims/facts. Store the snapshot LSNs as the starting point. 3. Incremental Processes: build an incremental stored proc for each dim and fact. Use CDC functions to populate temp tables that mimic source views. PULL data incrementally on demand. Transactionally store new HWM. 4. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 19
Agent Job Why not SSIS? You could, but if the source and target are both SQL Server, easier and faster to work directly with T-SQL. If you do, use transactions! Push process into ODS s might be perfect for SSIS. http://msbitips.blogspot.com/ 20
Why not Change Tracking? Only stores key values, not data values No way to recreate history, which might be needed. Mechanics of re-inits and incrementals to precise LSN values wouldn t be possible. Real-Time Aggregates Create Indexed views for RT aggregates on facts, with/without joins to dimensions Are kept up to date automatically 2014 allows for updateable columnstore indexes as well 21
Indexed Views Will degrade performance of INSERTs/UPDATEs to the fact table, so make sure they re worthwhile to add. Be careful of updating referenced dimensions, index view foreign key references can deadlock. Use the GetAppLock() function to single-thread write access to the fact table and referenced dimensions. OLAP SSAS Cubes may also be able to be updated frequently. Multiple partitions, sliced by time: ROLAP MOLAP Current Day 22
Monitoring/Alerting All ETL operations should be logged and timed. Logger should commit even if overall transaction is rolled back. If incremental job fails 10 times, slow it down/turn it off. Statistics Won t ever be up to date for the latest data. From fact table, in SQL 2008/2012, will give cardinality estimate of 1 if the date range is past the HWM. Trace flags 2389, 2390, 4139 in SQL 2012 to deal with this Ascending Key problem SQL 2014 is supposed to be better, but it has an issue where it may guess that there are 9% of overall rows since the statistics HWM. =>milossql.wordpress.com ( beyond histogram articles) 23
Caching Turn off caching on the reporting server to always have live data. Performance Considerations Need to tune RT ETL so that it doesn t have any inefficiencies. Measure in milliseconds, not seconds. Use WhoIsActive to see what s running MUST have Read Committed Snapshot Isolation (RCSI) enabled on the DW database. 24
Process Have RT replication flowing into DEV/QA/Prod Keep the incremental process working in all 3! AW Prod AW_ODS DEV AW_ODS QA AW_ODS Prod For a new RTDW, build in parallel to an existing DW, so you can reconcile the two. DW Prod (Legacy) DW Prod (new RT) 4 Architectural Components 1. Operational Data Store (ODS) databases: create 1 per source database or subject area. PUSH data into the ODS s as often as possible. 2. Re-init Processes: build a re-init stored proc for each dim and fact, sourced from ODS snapshots. PULL from the source views into the DW dims/facts. Store the snapshot LSNs as the starting point. 3. Incremental Processes: build an incremental stored proc for each dim and fact. Use CDC functions to populate temp tables that mimic source views. PULL data incrementally on demand. Transactionally store new HWM. 4. Test early & test often. Make sure RT data is flowing into DEV and QA. Tune ETL, statistics, aggregates and user queries against a live system with RCSI enabled. 25
4 Architectural Components - Review 1. Operational Data Store (ODS) databases: create 1 per source database or subject area. PUSH data into the ODS s as often as possible. 2. Re-init Processes: build a re-init stored proc for each dim and fact, sourced from ODS snapshots. PULL from the source views into the DW dims/facts. Store the snapshot LSNs as the starting point. 3. Incremental Processes: build an incremental stored proc for each dim and fact. Use CDC functions to populate temp tables that mimic source views. PULL data incrementally on demand. Transactionally store new HWM. 4. Test early & test often. Make sure RT data is flowing into DEV and QA. Tune ETL, statistics, aggregates and user queries against a live system with RCSI enabled. More Information Code/slides at: http://www.infinityanalytics.com/ https://www.linkedin.com/in/markmurphynyc mark@infinityanalytics.com 26
Appendix A: Joining two tables with CDC 1. Get net changes from A into #TEMP_A 2. Get net changes from B into #TEMP_B 3. Examine the join key between A and B. Search for any records missing in B that are in A. If there are missing records, then 1. Look to the future change CDC _CT table for the *before* images of future modifications, -BUT- only where the record isn t inserted in the future, after this LSN range. 2. Finally, look to the base table for records that are still missing, -BUT- only where the record isn t inserted in the future, after this LSN range. 4. Repeat step 3, but this time looking for A missing from B. 5. Now continue processing, using #TEMP_A and #TEMP_B. 27