Architecting Real-Time Data Warehouses with SQL Server



Similar documents
IT-Pruefungen.de. Hochwertige Qualität, neueste Prüfungsunterlagen.

Would-be system and database administrators. PREREQUISITES: At least 6 months experience with a Windows operating system.

W I S E. SQL Server 2008/2008 R2 Advanced DBA Performance & WISE LTD.

Data warehouse Architectures and processes

COURSE 20463C: IMPLEMENTING A DATA WAREHOUSE WITH MICROSOFT SQL SERVER

Implementing a Data Warehouse with Microsoft SQL Server

SQL SERVER BUSINESS INTELLIGENCE (BI) - INTRODUCTION

LEARNING SOLUTIONS website milner.com/learning phone

Implementing a Data Warehouse with Microsoft SQL Server

Implement a Data Warehouse with Microsoft SQL Server 20463C; 5 days

SQL Server 2008 Performance and Scale

Extraction Transformation Loading ETL Get data out of sources and load into the DW

Building Cubes and Analyzing Data using Oracle OLAP 11g

Building an Effective Data Warehouse Architecture James Serra

Microsoft Data Warehouse in Depth

Implementing a Data Warehouse with Microsoft SQL Server

Optimizing Your Data Warehouse Design for Superior Performance

An Oracle White Paper March Best Practices for Real-Time Data Warehousing

Data warehouse and Business Intelligence Collateral

Microsoft. Course 20463C: Implementing a Data Warehouse with Microsoft SQL Server

SQL Server Administrator Introduction - 3 Days Objectives

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

Unlock your data for fast insights: dimensionless modeling with in-memory column store. By Vadim Orlov

Implementing a Data Warehouse with Microsoft SQL Server MOC 20463

COURSE OUTLINE MOC 20463: IMPLEMENTING A DATA WAREHOUSE WITH MICROSOFT SQL SERVER

Cass Walker TLG Learning

SQL Server 2012 End-to-End Business Intelligence Workshop

The Data Warehouse ETL Toolkit

SQL SERVER TRAINING CURRICULUM

Course Outline: Course: Implementing a Data Warehouse with Microsoft SQL Server 2012 Learning Method: Instructor-led Classroom Learning

East Asia Network Sdn Bhd

Implementing a Data Warehouse with Microsoft SQL Server 2012 MOC 10777

Building a Data Warehouse

ETL Overview. Extract, Transform, Load (ETL) Refreshment Workflow. The ETL Process. General ETL issues. MS Integration Services

An Architectural Review Of Integrating MicroStrategy With SAP BW

THE DATA WAREHOUSE ETL TOOLKIT CDT803 Three Days

Implementing a Data Warehouse with Microsoft SQL Server

MS 20467: Designing Business Intelligence Solutions with Microsoft SQL Server 2012

Oracle Warehouse Builder 10g

Implementing a Data Warehouse with Microsoft SQL Server 2012 (70-463)

Implementing a Data Warehouse with Microsoft SQL Server 2012

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

Microsoft SQL Business Intelligence Boot Camp

Data Warehouse: Introduction

LearnFromGuru Polish your knowledge

Columnstore Indexes for Fast Data Warehouse Query Processing in SQL Server 11.0

IST722 Data Warehousing

IBM WebSphere DataStage Online training from Yes-M Systems

Course 55144B: SQL Server 2014 Performance Tuning and Optimization

How to Enhance Traditional BI Architecture to Leverage Big Data

Data Warehousing Systems: Foundations and Architectures

Exadata in the Retail Sector

Emerging Technologies Shaping the Future of Data Warehouses & Business Intelligence

Modern Data Warehousing

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc.

Microsoft. MCSA upgrade to SQL Server 2012 Certification Courseware. Version 1.0

SQL Server 2016 New Features!

AV-005: Administering and Implementing a Data Warehouse with SQL Server 2014

SSIS Training: Introduction to SQL Server Integration Services Duration: 3 days

ESSBASE ASO TUNING AND OPTIMIZATION FOR MERE MORTALS

70-467: Designing Business Intelligence Solutions with Microsoft SQL Server

ABOUT VANDERBILT UNIVERSITY MEDICAL CENTER

Lost in Space? Methodology for a Guided Drill-Through Analysis Out of the Wormhole

White Paper. Optimizing the Performance Of MySQL Cluster

Enterprise Performance Tuning: Best Practices with SQL Server 2008 Analysis Services. By Ajay Goyal Consultant Scalability Experts, Inc.

Basics Of Replication: SQL Server 2000

Data Warehousing and Data Mining

High-Volume Data Warehousing in Centerprise. Product Datasheet

Real World Enterprise SQL Server Replication Implementations. Presented by Kun Lee

In-Memory Data Management for Enterprise Applications

Original Research Articles

Implementing a Data Warehouse with Microsoft SQL Server 2014

Reflections on Agile DW by a Business Analytics Practitioner. Werner Engelen Principal Business Analytics Architect

Designing Business Intelligence Solutions with Microsoft SQL Server 2012 Course 20467A; 5 Days

PassTest. Bessere Qualität, bessere Dienstleistungen!

Microsoft SQL Database Administrator Certification

SQL Server 2012 Business Intelligence Boot Camp

<Insert Picture Here> Extending Hyperion BI with the Oracle BI Server

Cúram Business Intelligence Reporting Developer Guide

Migrating a Discoverer System to Oracle Business Intelligence Enterprise Edition

The Complete Performance Solution for Microsoft SQL Server

Business Intelligence, Analytics & Reporting: Glossary of Terms

Trivadis White Paper. Comparison of Data Modeling Methods for a Core Data Warehouse. Dani Schnider Adriano Martino Maren Eschermann

Data Integration and ETL with Oracle Warehouse Builder: Part 1

Real-time Data Replication

How To Improve Performance In A Database

Course 10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012

Business Intelligence. 9. Real Time Data Warehouses OLAP December 2013.

2074 : Designing and Implementing OLAP Solutions Using Microsoft SQL Server 2000


SAS BI Course Content; Introduction to DWH / BI Concepts

Implementing a Data Warehouse with Microsoft SQL Server 2012

Course Outline. Module 1: Introduction to Data Warehousing

Course 20463:Implementing a Data Warehouse with Microsoft SQL Server

The Art of Designing HOLAP Databases Mark Moorman, SAS Institute Inc., Cary NC

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Business Intelligence, Data warehousing Concept and artifacts

Turning your Warehouse Data into Business Intelligence: Reporting Trends and Visibility Michael Armanious; Vice President Sales and Marketing Datex,

Enterprise and Standard Feature Compare

Microsoft BI Platform Overview

Transcription:

Architecting Real-Time Data Warehouses with SQL Server Mark Murphy President, Infinity Analytics Inc. Presenter: Mark Murphy NYC-based Independent Consultant https://www.linkedin.com/in/markmurphynyc http://www.infinityanalytics.com/ 1

Oracle CRM Traditional DW: Nightly / Weekly Data Load SQL 2008 Inventory Reload Changes AdventureWorks DW 2014 AdvWorks 2014 Nightly ETL SSIS/Stored Procs Supplier Shipping Schedules (CSV/XML) Oracle CRM Real-Time DW: Continuous Data Load SQL 2008 Inventory Merge Changes AdventureWorks DW 2014 AdvWorks 2014 Constant ETL Stored Procs/CDC Supplier Shipping Schedules (CSV/XML) 2

Why? Zero data latency Top customers *today* RT Analytics Predictive analytics Recommender systems RT promotions Cool factor to hit refresh New Customer Signups - *Today* 3

4 Architectural Components 1. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 2. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 3. XXXXXXXXXXXXXXXXXXXXXX (CDC) XXXXXXXXXXXXXXXXXXXXXXX 4. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 4

Oracle CRM Real-Time DW: Continuous Data Load SQL 2008 Inventory Merge Changes AdventureWorks DW 2014 AdvWorks 2014 ETL Stored Procs Supplier Shipping Schedules (CSV/XML) End Goal Dimensional Model RT-ETL to transform 3NF / flat data structures in source systems to a dimensional model in the data warehouse. 5

Caveats to RTDW If you don t need it don t do it Higher cost in ETL development and testing More moving parts more to go wrong. Easier to TRUNCATE and INSERT a full table than to implement realtime update logic. Let s Go! 6

ODS Operational Data Store Oracle CRM ODS Layer CRM_ODS SQL 2008 Inventory INVENTORY_ODS AdventureWorks DW 2014 AdvWorks 2014 ADV_WORKS_ODS AdvWorks2014 SHIP_SCHED_ODS ODS Layer Why ODS? CRM_ODS INVENTORY_ODS ADV_WORKS_ODS SHIP_SCHED_ODS Doesn t touch OLTP production source systems Overcomes lack of CDC support of source databases Divides & Conquers work & complexity Improves performance, adds flexibility 7

ODS Layer CRM_ODS INVENTORY_ODS Other Considerations One SQL Database per source database / subject area Mirror the source systems exactly (except possibly for indexes) ADV_WORKS_ODS SHIP_SCHED_ODS Do not build reports off the ODS or give direct access to end-users. Even in a traditional, non-rt DW, this is a best practice 4 Architectural Components 1. Operational Data Store (ODS) databases: create 1 per source database or subject area. PUSH data into the ODS s as often as possible. 2. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 3. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 4. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 8

Now What? Oracle CRM ODS Layer CRM_ODS SQL 2008 Inventory AdvWorks 2014 INVENTORY_ODS ADV_WORKS_ODS? AdventureWorks DW 2014 SHIP_SCHED_ODS Source to Target Mapping AdventureWorks 2014 AdventureWorksDW 2014 Translation View Source to Target OLTP 3 rd Normal Form DW - Dimensional 9

Source View - dimgeography rtdemo_src. dimgeography Source View - factinternetsales rtdemo_src. factinternetsales 10

Re-Init Procedures MERGE INTO <DESTINATION> TGT USING <SOURCE VIEW> AS SRC ON SRC.Business Key = TGT.Business Key WHEN NOT EXISTS THEN INSERT() WHEN EXISTS AND ( <some difference> ) THEN UPDATE(); Re-Inits Are needed when: System is initialized Source system changes, need to reprocess System troubleshooting (failsafe) Theoretically, with just re-inits, you could load your data warehouse in a traditional, non-rt manner. 11

Problem 1:01 run dimgeography reinit 1:02 run dimcustomer reinit 1:05 run factinternetsales reinit 1:00 1:01 1:02 1:03 1:04 1:05 What if a customer was added at 1:03, and placed an order at 1:04? Missing Key! Database Snapshots Database snapshots are created instantly, as a shadow copy. ADV_WORKS_ODS ADV_WORKS_ODS _SNAP They do not store data at initial creation. Instead, they store the before image as changes are made. Can query either the snapshot or the original. 12

Re-inits in Practice So source views/re-inits should be pointed to the ODS Snapshots. Re-inits procedures will re-synch the DW data based on the frozen version of the source. Good code/validation exercise as well. So, a correct reinit procedure will insert/update ZERO rows on the second run off the same snapshot 13

Oracle CRM ODS ODS Snapshots Layer CRM_ODS_SNAP CRM_ODS ETL Re-init SP s SQL 2008 Inventory INVENTORY_ODS_ INVENTORY_ODS SNAP ADV_WORKS_ODS AdvWorks ADV_WORKS_ODS _SNAP 2014 Adv Works DW 2014 SHIP_SCHED_ODS_ SHIP_SCHED_ODS SNAP LSNs Binary way of representing the exact transaction order of the database. Example: 0X0000002D000000480001 14

Reading & Writing LSNs LSN of a snapshot can be queried: sys.sp_cdc_dbsnapshotlsn Our REINIT_ALL procedure will store this LSN 4 Architectural Components 1. Operational Data Store (ODS) databases: create 1 per source database or subject area. PUSH data into the ODS s as often as possible. 2. Re-init Processes: build a re-init stored proc for each dim and fact, sourced from ODS snapshots. PULL from the source views into the DW dims/facts. Store the snapshot LSNs as the starting point. 3. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 4. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 15

Now What (part 2)? Oracle CRM ODS Layer CRM_ODS SQL 2008 Inventory AdvWorks 2014 INVENTORY_ODS ADV_WORKS_ODS? AdventureWorks DW 2014 SHIP_SCHED_ODS Incremental Algorithm Lookup the last LSN from the HWM table (old) Get the new latest LSN from the ODS (new) Begin Transaction Process all dimensions incrementally (old,new) Process all facts incrementally (old,new) Update the HWM Commit Transaction 16

CDC Using Change Data Capture (CDC) to pull all changes to a table for a given LSN range. Requires SQL Server Enterprise Edition CDC Tutorial: (Pinal Dave) https://www.simple-talk.com/sql/learn-sql-server/introduction-to-change-data-capture-%28cdc%29-in-sql-server-2008/ CDC Primer Sales.SalesOrderHeader cdc.sales_salesorderheader_ct..and functions 17

Reading from CDC cdc.fn_get_all_changes_sales_salesorderheader(@startlsn, @endlsn) cdc.fn_get_net_changes_sales_salesorderheader(@startlsn, @endlsn) SELECT FROM cdc.sales_salesorderheader_ct directly Add an OPTION(OPTIMIZE FOR UNKNOWN) to CDC function queries if the performance is poor. Incremental Procs For each source table, read from the CDC functions to see what s changed in the requested LSN range. Store the results in temp tables. Join the tables together, mimicking the structure of the source views. Merge into the fact/dim, just like the re-inits. Appendix A: joining two tables when they re updated in different LSN ranges. 18

ASIDE: Catch-up algorithm If the DW is behind by one hour, should it catch up all in one transaction, or break it up into smaller pieces? Former is much easier. Latter is more difficult, but provides more accurate timestamps, such as for Type-II dimensions. Need to loop through the ODS s cdc.lsn_time_mapping table, processing one time slice at a time (e.g. 5 minutes) 4 Architectural Components 1. Operational Data Store (ODS) databases: create 1 per source database or subject area. PUSH data into the ODS s as often as possible. 2. Re-init Processes: build a re-init stored proc for each dim and fact, sourced from ODS snapshots. PULL from the source views into the DW dims/facts. Store the snapshot LSNs as the starting point. 3. Incremental Processes: build an incremental stored proc for each dim and fact. Use CDC functions to populate temp tables that mimic source views. PULL data incrementally on demand. Transactionally store new HWM. 4. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 19

Agent Job Why not SSIS? You could, but if the source and target are both SQL Server, easier and faster to work directly with T-SQL. If you do, use transactions! Push process into ODS s might be perfect for SSIS. http://msbitips.blogspot.com/ 20

Why not Change Tracking? Only stores key values, not data values No way to recreate history, which might be needed. Mechanics of re-inits and incrementals to precise LSN values wouldn t be possible. Real-Time Aggregates Create Indexed views for RT aggregates on facts, with/without joins to dimensions Are kept up to date automatically 2014 allows for updateable columnstore indexes as well 21

Indexed Views Will degrade performance of INSERTs/UPDATEs to the fact table, so make sure they re worthwhile to add. Be careful of updating referenced dimensions, index view foreign key references can deadlock. Use the GetAppLock() function to single-thread write access to the fact table and referenced dimensions. OLAP SSAS Cubes may also be able to be updated frequently. Multiple partitions, sliced by time: ROLAP MOLAP Current Day 22

Monitoring/Alerting All ETL operations should be logged and timed. Logger should commit even if overall transaction is rolled back. If incremental job fails 10 times, slow it down/turn it off. Statistics Won t ever be up to date for the latest data. From fact table, in SQL 2008/2012, will give cardinality estimate of 1 if the date range is past the HWM. Trace flags 2389, 2390, 4139 in SQL 2012 to deal with this Ascending Key problem SQL 2014 is supposed to be better, but it has an issue where it may guess that there are 9% of overall rows since the statistics HWM. =>milossql.wordpress.com ( beyond histogram articles) 23

Caching Turn off caching on the reporting server to always have live data. Performance Considerations Need to tune RT ETL so that it doesn t have any inefficiencies. Measure in milliseconds, not seconds. Use WhoIsActive to see what s running MUST have Read Committed Snapshot Isolation (RCSI) enabled on the DW database. 24

Process Have RT replication flowing into DEV/QA/Prod Keep the incremental process working in all 3! AW Prod AW_ODS DEV AW_ODS QA AW_ODS Prod For a new RTDW, build in parallel to an existing DW, so you can reconcile the two. DW Prod (Legacy) DW Prod (new RT) 4 Architectural Components 1. Operational Data Store (ODS) databases: create 1 per source database or subject area. PUSH data into the ODS s as often as possible. 2. Re-init Processes: build a re-init stored proc for each dim and fact, sourced from ODS snapshots. PULL from the source views into the DW dims/facts. Store the snapshot LSNs as the starting point. 3. Incremental Processes: build an incremental stored proc for each dim and fact. Use CDC functions to populate temp tables that mimic source views. PULL data incrementally on demand. Transactionally store new HWM. 4. Test early & test often. Make sure RT data is flowing into DEV and QA. Tune ETL, statistics, aggregates and user queries against a live system with RCSI enabled. 25

4 Architectural Components - Review 1. Operational Data Store (ODS) databases: create 1 per source database or subject area. PUSH data into the ODS s as often as possible. 2. Re-init Processes: build a re-init stored proc for each dim and fact, sourced from ODS snapshots. PULL from the source views into the DW dims/facts. Store the snapshot LSNs as the starting point. 3. Incremental Processes: build an incremental stored proc for each dim and fact. Use CDC functions to populate temp tables that mimic source views. PULL data incrementally on demand. Transactionally store new HWM. 4. Test early & test often. Make sure RT data is flowing into DEV and QA. Tune ETL, statistics, aggregates and user queries against a live system with RCSI enabled. More Information Code/slides at: http://www.infinityanalytics.com/ https://www.linkedin.com/in/markmurphynyc mark@infinityanalytics.com 26

Appendix A: Joining two tables with CDC 1. Get net changes from A into #TEMP_A 2. Get net changes from B into #TEMP_B 3. Examine the join key between A and B. Search for any records missing in B that are in A. If there are missing records, then 1. Look to the future change CDC _CT table for the *before* images of future modifications, -BUT- only where the record isn t inserted in the future, after this LSN range. 2. Finally, look to the base table for records that are still missing, -BUT- only where the record isn t inserted in the future, after this LSN range. 4. Repeat step 3, but this time looking for A missing from B. 5. Now continue processing, using #TEMP_A and #TEMP_B. 27