Chapter 3 - Data Replication and Materialized Integration



Similar documents
Database Replication with Oracle 11g and MS SQL Server 2008

Data Distribution with SQL Server Replication

Data warehouse Architectures and processes

Week 3 lecture slides

SQL Server 2012 Business Intelligence Boot Camp

EII - ETL - EAI What, Why, and How!

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

Data Warehouse: Introduction

Introduction to Datawarehousing

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications

Data Warehousing Concepts

META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING

ETL-EXTRACT, TRANSFORM & LOAD TESTING

MDM and Data Warehousing Complement Each Other

THE DATA WAREHOUSE ETL TOOLKIT CDT803 Three Days

Data Warehousing. Overview, Terminology, and Research Issues. Joachim Hammer. Joachim Hammer

An Oracle White Paper March Best Practices for Real-Time Data Warehousing

The Data Warehouse ETL Toolkit

HYPERION MASTER DATA MANAGEMENT SOLUTIONS FOR IT

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

Implementing a Data Warehouse with Microsoft SQL Server 2012 MOC 10777

DATA WAREHOUSING AND OLAP TECHNOLOGY

Week 13: Data Warehousing. Warehousing

Chapter 5. Learning Objectives. DW Development and ETL

A Survey on Data Warehouse Architecture

Course Outline. Module 1: Introduction to Data Warehousing

Extraction Transformation Loading ETL Get data out of sources and load into the DW

COURSE 20463C: IMPLEMENTING A DATA WAREHOUSE WITH MICROSOFT SQL SERVER

Implementing a Data Warehouse with Microsoft SQL Server

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

When to consider OLAP?

Course 10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012

Increase productivity and safety data of warehouse systems using Shareplex for Oracle

Data Warehousing. Jens Teubner, TU Dortmund Winter 2015/16. Jens Teubner Data Warehousing Winter 2015/16 1

Implementing a Data Warehouse with Microsoft SQL Server

Implementing a Data Warehouse with Microsoft SQL Server

Implementing a Data Warehouse with Microsoft SQL Server 2012

Implement a Data Warehouse with Microsoft SQL Server 20463C; 5 days

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

Oracle Warehouse Builder 10g

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

Implementing a Data Warehouse with Microsoft SQL Server MOC 20463

COURSE OUTLINE MOC 20463: IMPLEMENTING A DATA WAREHOUSE WITH MICROSOFT SQL SERVER

Implementing a Data Warehouse with Microsoft SQL Server 2012

Outline. Data Warehousing. What is a Warehouse? What is a Warehouse?

A Model-based Software Architecture for XML Data and Metadata Integration in Data Warehouse Systems

Distributed Databases

Course Outline: Course: Implementing a Data Warehouse with Microsoft SQL Server 2012 Learning Method: Instructor-led Classroom Learning

Data Warehousing with Oracle

Beta: Implementing a Data Warehouse with Microsoft SQL Server 2012

IBM WebSphere DataStage Online training from Yes-M Systems

Data Warehousing Systems: Foundations and Architectures

Data Warehousing and Data Mining

ETL Overview. Extract, Transform, Load (ETL) Refreshment Workflow. The ETL Process. General ETL issues. MS Integration Services

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 28

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

Data Integration: Using ETL, EAI, and EII Tools to Create an Integrated Enterprise. Colin White Founder, BI Research TDWI Webcast October 2005

Implementing a Data Warehouse with Microsoft SQL Server 2012

European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project

Foundations of Business Intelligence: Databases and Information Management

When an organization is geographically dispersed, it. Distributed Databases. Chapter 13-1 LEARNING OBJECTIVES INTRODUCTION

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 16 - Data Warehousing

Moving Large Data at a Blinding Speed for Critical Business Intelligence. A competitive advantage

Data Warehousing and OLAP Technology for Knowledge Discovery

SAS BI Course Content; Introduction to DWH / BI Concepts

A McKnight Associates, Inc. White Paper: Effective Data Warehouse Organizational Roles and Responsibilities

Advanced Data Management Technologies

Salesforce Certified Data Architecture and Management Designer. Study Guide. Summer 16 TRAINING & CERTIFICATION

Enterprise Data Quality

Basics Of Replication: SQL Server 2000

AV-005: Administering and Implementing a Data Warehouse with SQL Server 2014

Implementing a Data Warehouse with Microsoft SQL Server

Virtuoso Replication and Synchronization Services

Databases in Organizations

Cloud DBMS: An Overview. Shan-Hung Wu, NetDB CS, NTHU Spring, 2015

Microsoft. Course 20463C: Implementing a Data Warehouse with Microsoft SQL Server

LITERATURE SURVEY ON DATA WAREHOUSE AND ITS TECHNIQUES

Data Integration and ETL Process

The IBM Cognos Platform

Enterprise Data Warehouse (EDW) UC Berkeley Peter Cava Manager Data Warehouse Services October 5, 2006

Shadowbase Data Replication Solutions. William Holenstein Senior Manager of Product Delivery Shadowbase Products Group

Data Warehousing and OLAP

IT0457 Data Warehousing. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

Chapter Outline. Chapter 2 Distributed Information Systems Architecture. Middleware for Heterogeneous and Distributed Information Systems

East Asia Network Sdn Bhd

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

By Makesh Kannaiyan 8/27/2011 1

70-467: Designing Business Intelligence Solutions with Microsoft SQL Server

Lection 3-4 WAREHOUSING

A Critical Review of Data Warehouse

Migrating a Discoverer System to Oracle Business Intelligence Enterprise Edition

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Data Management in the Cloud

IST722 Data Warehousing

Performance Counters. Microsoft SQL. Technical Data Sheet. Overview:

Service Oriented Data Management

Transcription:

Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter 3 - Data Replication and Materialized Integration

Motivation Replication: controlled redundancy of data to improve performance (query response time) increase availability Replication is a common concept in (homogeneous) distributed DBMS centralized DBMS in the form of materialized views mobile DBMS environments data/information integration middleware Materialized integration: data warehouses synchronous/asynchronous replication of data from multiple sources into a central data warehouse avoid performance problems of virtual integration solutions diverts query load away from operational data sources enables complex and powerful data analysis (business intelligence) Major drawback potential for stale (out-of-date) data 2

Challenges Integration of replication with transaction processing Enforcing consistency of replica in the presence of updates possible model: update of replicated data becomes a distributed transaction that updates all copies Detecting and resolving conflicts reconciling "versions" of replicated data elements that have evolved independently (e.g., because of a communication failure) 3

Eager Replication Transaction synchronizes with copies of replicated elements before commit guarantees globally serializable execution locking avoids inconsistencies Potential problems deadlocks update overhead reduced update performance increased transaction response time lack of scalability cannot be used if nodes are disconnected (e.g., mobile databases) TA ER1 ER2 writea writea writea writeb writeb writeb writec writec writec commit commit commit 4

Lazy Replication Changes introduced at one site are propagated (as separate transactions) to other sites only after commit minimal update overhead (i.e., improved response times over eager replication) works also if sites are not connected (e.g., mobile environments) or temporarily unavailable Potential problems stale (out-of-date) data update of a completed transaction is "in-transit", i.e., has not been reflected in all replicas new transaction operating on a replica sees an old version of the data conflicting updates can cause inconsistencies between the copies concept for detecting inconsistencies reconcile conflicting transactions rollback of (committed) transactions not possible ð potential for "system delusion" (Gray, Reuter) inconsistent database, with no obvious way to repair it TA LR1 LR2 writea writeb writec commit writea writeb writec commit writea writeb writec commit 5

Replication Middleware Source table data is replicated to a target table Scenarios and uses data distribution data from one source table is replicated to more than one (read-only) target table data consolidation/integration data from more than one source table is replicated to a single target table (union view) bidirectional replication allows updates on target tables to be replicated back to the source table master/slave replication: all updates flow back to a designated master, are then distributed to other targets peer-to-peer replication: each location exchanges data with all other locations combination of the above Multi-tier replication introduction of staging area(s) changes are copied to another system then copied from staging area to multiple targets minimizes impact on source systems Lazy replication techniques are widely used 6

Replication methods Target table refresh at intervals, target table is replaced by a fresh copy of the source table no requirement to capture individual changes only makes sense if uni-directional replication is used (i.e., updates only occur on the source table) target table is small or replication occurs infrequently Change-capture replication committed changes to source table are captured and replicated to the target table replication activity continuous (near-real-time) interval-based e.g., during off-peak hours triggered by DB-events one-time snapshot need to compare snapshots of tables to determine the updates update capture committed change data apply 7

Capturing Changes Source table registration to define which parts of the table changes should be captured when replication should occur Capture logic responsible for detecting the changes to the source table make committed change data available to the apply logic realization approaches capture program analyzes the database log files, or use database triggers Committed Change Data needs to (at least) include type of change (insert, update, delete) new values of updated data items, plus data item identifier (keys) (optional) before-image information can be provided using (relational) staging table at the source location each change is reflected as a row in the staging table message-oriented middleware changes are provided as message on a queue 8

Applying Changes Apply logic is responsible for initializing target table from source table applying captured changes to the target table preserve order of dependent transactions Needs access to the captured changes stored in staging tables or change message queues the target table (the source table) Enhanced capabilities filtering, joins, aggregation, transformation of data for the target 9

Replication Conflicts Two nodes may concurrently update replicas of the same object "race" each other to propagate the update to all the other nodes potential for lost updates Detecting conflicts usually based on timestamps (or before-image data) object carries timestamp of most recent update replica update carries new value and old object timestamp each node compares old object timestamp of incoming replica updates with its own object timestamp if timestamps are the same, then the update is accepted if not, then the incoming update transaction is rejected, submitted for reconciliation rollback of transaction is not possible anymore, has been committed at the originating site wait situations in eager replication <-> reconciliation in lazy replication 10

Ownership of Replicas Group "update anywhere" update model any node/site with a replica can update it Master may cause conflicts, need for reconciliation "primary copy" update model each object has a master node only the master can update the primary copy other replicas are read-only other nodes request the master node to perform an update (e.g., using RPC) eliminates reconciliation with lazy replication results in waits, potential deadlocks for updates initiated by non-master node does not work for mobile, disconnected databases 11

Reconciliation Approaches automatically, based on rules site, time or value priority, merging of updates, manually conflict situations are reported in a conflicts table or queue Non-transactional replication schemes abandon serializability for convergence all nodes converge to the same replicated state, which may not correspond to a serial transaction execution tolerate lost updates 12

Alternatives for Conflict Detection, Avoidance Semantic synchronization permit commutative transactions requires capturing update semantics at a logical level performing the transaction update may yield different results, but still be semantically correct example: processing checks at a bank provide acceptance criteria for detecting conflicts incoming replica transaction updates are tentatively accepted and performed, but need to pass the acceptance test replaces/augments the generic conflict detection mechanisms Avoid conflicts by implementing update strategies in the application fragmentation by key a site can update only rows whose keys are in a fixed range no range overlaps fragmentation by time disjoint "time windows" for each site to perform updates 13

Data Warehousing Main goal: materialized integration of data from numerous heterogeneous sources to enable powerful strategic data analysis OLAP online analytical processing data mining often provided through (application-specific) data marts data derived from a data warehouse through pre-aggregation usually employ materialized views High-level architecture data source 1 : Data Warehouse data source n data mart : data mart 14

Multidimensional View of OLAP Data Facts central relation or collection of data in an OLAP application represents events or objects of interest e.g., a sales event, with information about the product sold, the store, the sales date and price Dimensions objects can often be thought of as arranged in a multi-dimensional space, or cube e.g., sales events have store, product, and time period dimensions each point is a single sales event, dimensions represent properties of the sale hierarchical nature of dimensions time: year, quarter, month, week, day store: country, state, region, city product store period 15

(Relational) OLAP Schema Typically uses a Star structure Dimension tables (linked to fact table) tend to be small Fact table tends to be huge Measures (dependent attributes) Snowflake schema "normalized" dimensions multiple tables to avoid redundancy requires additional joins for OLAP queries OLAP queries usually GROUP BY the dimensions compute aggregate values of measures store_id name city region sales_mgr Store dimension table period_desc description Per_year Per_quarter Per_month Per_day store_id product_id period_desc dollars units price sales_date Detailed_Sales fact table dimensions measures product_id brand size producer caselot Product dimension table Period dimension table 16

Data Warehousing Architecture metadata repository Data Warehouse Manager Analysis Reporting - Mining Tools data source 1 Monitor data mart : Extract Transform Load main data warehouse : data source n staging area data mart data flow control flow 17

Data Warehouse Manager Central component of the architecture Responsible for controlling and supervising the overall process initiate data preparation, loading implement error recovery routines manage ETL scripts or process descriptions schedule and control analysis actions Directs data warehouse refresh full load vs. incremental load periodic (e.g., every night, weekend), driven by source updates (e.g., after n changes), or on request Utilizes metadata repository Monitors the overall DW environment 18

Data Preparation Components Data preparation steps (ETL) are conducted in a staging area Monitor discovers and reports changes in data sources e.g., replication-based (staging tables may be directly used by extractors) Extractors select and transport data from data sources into the staging area DBMS or file system for managing the staging area Transformers perform standardization and integration of data responsible for "implementing" an integrated schema integrated data requires data quality! (see next chapter) data migration, data cleaning entity identification, duplicate elimination may happen SQL-based or based on external data processing operators Loaders insert the data from the staging area into the main warehouse usually employ bulk load utilities of DBMS for performance reasons 19

Monitoring and Change Data Capture Approaches log-based DBMS writes information about updates into its transaction log Logs as analyzed to extract the change data trigger-based DB triggers (user-defined or internal) are used to gather change data using replication middleware may again use the approaches above audit columns (application-based) application records information about changes in an additional audit column per table timestamp-based intermediate changes may be lost snapshot differentials requires a full snapshot taken at previous extraction step compares current state of data source with the snapshot expensive, but may be the only option for legacy data sources Some approaches have limitations example: audit columns don t support deletion, don t distinguish update and insert 20

Transformation Typical transformation operations transformation into (de-)normalized warehouse schema generation of global identifiers (surrogate keys) keys from original sources may carry semantic information, may not be globally unique data type conversion data encoding (adjustments) e.g., California à CA standardization of character-based representations e.g., <first name> <last name> (instead of <last name>, <first name>) date/time, unit of measure standardization combination/separation of attribute value(s) derived values aggregation Transformation Languages (E)SQL product-specific operators in data flow graphs 21

Sample ETL Processes (IBM DataStage) 22

Summary Replication middleware usages data distribution and consolidation improve performance, availability materialized information integration architecture capture and apply committed change data change propagation and ownership strategies eager vs. lazy group vs. master conflict detection and reconciliation approaches are required for lazy group replication! Data Warehousing materialized integration approach avoids problems and restrictions of virtual integration architectures integrated schema for mult-dimensional data analysis, OLAP facts, dimensions, (hyper-)cubes star and snowflake schema architectures central role of data warehouse manager extract/transform/load (ETL) for data preparation transformation implements schema and data integration logic data quality requirements potential problem: stale data requires real-time data warehousing 23