Extraction Transformation Loading ETL Get data out of sources and load into the DW



Similar documents
ETL Overview. Extract, Transform, Load (ETL) Refreshment Workflow. The ETL Process. General ETL issues. MS Integration Services

Data Integration and ETL Process

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

Data Integration and ETL Process

ETL and Advanced Modeling

IBM WebSphere DataStage Online training from Yes-M Systems

ETL PROCESS IN DATA WAREHOUSE

Data Warehousing and Data Mining

Lection 3-4 WAREHOUSING

Implementing a Data Warehouse with Microsoft SQL Server

LearnFromGuru Polish your knowledge

Implementing a Data Warehouse with Microsoft SQL Server 2012

Implementing a Data Warehouse with Microsoft SQL Server 2012 MOC 10777

COURSE 20463C: IMPLEMENTING A DATA WAREHOUSE WITH MICROSOFT SQL SERVER

Implementing a Data Warehouse with Microsoft SQL Server

MDM and Data Warehousing Complement Each Other

Data Warehousing. Jens Teubner, TU Dortmund Winter 2014/15. Jens Teubner Data Warehousing Winter 2014/15 1

Real-time Data Replication

Implementing a Data Warehouse with Microsoft SQL Server

IT0457 Data Warehousing. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

ETL Process in Data Warehouse. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

Course Outline. Module 1: Introduction to Data Warehousing

Implement a Data Warehouse with Microsoft SQL Server 20463C; 5 days

Course Outline: Course: Implementing a Data Warehouse with Microsoft SQL Server 2012 Learning Method: Instructor-led Classroom Learning

Chapter 5. Learning Objectives. DW Development and ETL

An Oracle White Paper March Best Practices for Real-Time Data Warehousing

Course 10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012

Microsoft. Course 20463C: Implementing a Data Warehouse with Microsoft SQL Server

Implementing a Data Warehouse with Microsoft SQL Server 2012 (70-463)

Implementing a Data Warehouse with Microsoft SQL Server 2012

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

AV-005: Administering and Implementing a Data Warehouse with SQL Server 2014

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

Implementing a Data Warehouse with Microsoft SQL Server 2012

Course 20463:Implementing a Data Warehouse with Microsoft SQL Server

SQL Server 2012 Business Intelligence Boot Camp

Chapter 3 - Data Replication and Materialized Integration

Cúram Business Intelligence Reporting Developer Guide

Data warehouse Architectures and processes

Introduction to Datawarehousing

Implementing a Data Warehouse with Microsoft SQL Server MOC 20463

COURSE OUTLINE MOC 20463: IMPLEMENTING A DATA WAREHOUSE WITH MICROSOFT SQL SERVER

Data Warehousing. Overview, Terminology, and Research Issues. Joachim Hammer. Joachim Hammer

Would-be system and database administrators. PREREQUISITES: At least 6 months experience with a Windows operating system.

The Evolution of ETL

Implementing a Data Warehouse with Microsoft SQL Server 2014

IST722 Data Warehousing

Data-Warehouse-, Data-Mining- und OLAP-Technologien

A Design Technique: Data Integration Modeling

Course MIS. Foundations of Business Intelligence

An Oracle White Paper February Real-time Data Warehousing with ODI-EE Changed Data Capture

Implementing a SQL Data Warehouse 2016

Building a Data Warehouse

Data Integration with Talend Open Studio Robert A. Nisbet, Ph.D.

The Data Warehouse ETL Toolkit

Optimizing the Performance of the Oracle BI Applications using Oracle Datawarehousing Features and Oracle DAC

Data Warehousing with Oracle

Foundations of Business Intelligence: Databases and Information Management

Implementing a Data Warehouse with Microsoft SQL Server

ETL-EXTRACT, TRANSFORM & LOAD TESTING

SQL Server Administrator Introduction - 3 Days Objectives

Jet Data Manager 2012 User Guide

Analysis and Design of ETL in Hospital Performance Appraisal System

Top 20 Data Quality Solutions for Data Science

A McKnight Associates, Inc. White Paper: Effective Data Warehouse Organizational Roles and Responsibilities

High-Volume Data Warehousing in Centerprise. Product Datasheet

Data Warehousing. Jens Teubner, TU Dortmund Winter 2015/16. Jens Teubner Data Warehousing Winter 2015/16 1

THE DATA WAREHOUSE ETL TOOLKIT CDT803 Three Days

Oracle Warehouse Builder 10g

HP Quality Center. Upgrade Preparation Guide

Framework for Data warehouse architectural components

Integrating Netezza into your existing IT landscape

ETL Tools. L. Libkin 1 Data Integration and Exchange

SAP Data Services 4.X. An Enterprise Information management Solution

news from Tom Bacon about Monday's lecture

Beta: Implementing a Data Warehouse with Microsoft SQL Server 2012

Moving Large Data at a Blinding Speed for Critical Business Intelligence. A competitive advantage

Service Oriented Data Management

SQL Server Introduction to SQL Server SQL Server 2005 basic tools. SQL Server Configuration Manager. SQL Server services management

The Data Quality Continuum*

Data Warehouse Center Administration Guide

Patterns of Information Management

SQL Server Replication Guide

Comprehensive Data Quality with Oracle Data Integrator. An Oracle White Paper Updated December 2007

SSIS Training: Introduction to SQL Server Integration Services Duration: 3 days

SAS BI Course Content; Introduction to DWH / BI Concepts

Chapter 6 FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT Learning Objectives

Dimodelo Solutions Data Warehousing and Business Intelligence Concepts

About the Tutorial. Audience. Prerequisites. Disclaimer & Copyright. ETL Testing

Extract, Transform, Load (ETL)

Foundations of Business Intelligence: Databases and Information Management

Enterprise Information Integration (EII) A Technical Ally of EAI and ETL Author Bipin Chandra Joshi Integration Architect Infosys Technologies Ltd

Tiber Solutions. Understanding the Current & Future Landscape of BI and Data Storage. Jim Hadley

Solution for Staging Area in Near Real-Time DWH Efficient in Refresh and Easy to Operate. Technical White Paper

Module 14: Scalability and High Availability

Topics. Database Essential Concepts. What s s a Good Database System? Using Database Software. Using Database Software. Types of Database Programs

Transcription:

Lection 5 ETL

Definition Extraction Transformation Loading ETL Get data out of sources and load into the DW Data is extracted from OLTP database, transformed to match the DW schema and loaded into the data warehouse database May also incorporate data from non-oltp systems such as text files, legacy systems, and spreadsheets 2

Introduction ETL is a complex combination of process and technology The most underestimated process in DW development The most time-consuming process in DW development Often, 80% of development time is spent on ETL 3

ETL for generic DW architecture L One, T companywide warehouse E Periodic extraction data is not completely fresh in warehouse 4

Independent data marts Data marts: Mini-warehouses, limited in scope L E T Separate ETL for each independent data mart Data access complexity due to multiple data marts 5

Dependent data marts ODS provides option for obtaining current data L E T Single ETL for enterprise data warehouse (EDW) Simpler data access Dependent data marts loaded from EDW 6

Data Staging Area (DSA) Transit storage for data underway in the ETL process Creates a logical and physical separation between the source systems and the data warehouse Transformations/cleansing done here No user queries (some do it) Sequential operations (few) on large data volumes Performed by central ETL logic Easily restarted RDBMS or flat files? (DBMS have become better) Allows centralized backup/recovery Often too time consuming to initial load all data marts by failure Thus, backup/recovery facilities needed 7

Extract Capture/Extract obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse Static extract = capturing a snapshot of the source data at a point in time Incremental extract = capturing changes that have occurred since the last static extract 8

Types of Data Sources Extraction is very dependent on the source types Non-cooperative sources Snapshot sources provides only full copy of source Specific sources each is different, e.g., legacy systems Logged sources writes change log (DB log) Queryable sources provides query interface, e.g., SQL Cooperative sources Replicated sources publish/subscribe mechanism Call back sources calls external code (ETL) when changes occur Internal action sources only internal actions when changes occur DB triggers is an example 9

Incremental extraction Information changes at the sources Need to update the DW Cannot apply Extract/ETL from scratch Extract from source systems can take a long time (days/weeks) Drain on the operational systems and on DW Incremental Extract/ETL Extract only changes since last load A number of methods can be used 10

Timestamps Put update timestamp on all rows Updated by DB trigger Extract only where timestamp > time for last extract Evaluation: + Reduces extract time + Less operational overhead - Source system must be changed 11

Messages Applications insert messages in a queue at updates Evaluation: + Works for all types of updates and systems - Operational applications must be changed and incurred operational overhead 12

DB triggers Triggers execute actions on INSERT, UPDATE, DELETE Evaluation: + Operational applications need not be changed (but need to set the triggers -) + Enables real-time update of DW - Operational overhead 13

DB log Replication based on DB log Find changes directly in DB log which is written anyway Evaluation: + Operational applications need not be changed + No operational overhead - Not possible in some DBMS (SQL Server, Oracle, DB2 can do it) 14

Transform Main step where the ETL adds value Changes data to be inserted in DW 3 major steps Data Cleansing Data Integration Other Transformations (includes replacement of codes, derived values, calculating aggregates) 15

Cleanse Scrub/Cleanse uses pattern recognition and AI techniques to upgrade data quality Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies Also: decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing 16 data

Examples of Dirty Data Dummy Values a clerk enters 999-99-9999 as a SSN rather than asking the customer, customers 235 years old Absence of Data NULL for customer age Cryptic Data 0 for male, 1 for female Contradicting Data customers with different addresses Violation of Business Rules NULL for customer name on an invoice Reused Primary Keys, Non-Unique Identifiers a branch bank is closed. Several years later, a new branch is opened, and the old identifier is used again Data Integration Problems Multipurpose Fields, Synonyms 17

What Causes Poor Data Quality? Factors contributing to poor data quality: Business rules do not exist or there are no standards for data capture. Standards may exist but are not enforced at the point of data capture. Inconsistent data entry (incorrect spelling, use of nicknames, middle names, or aliases) occurs. Data entry mistakes (character transposition, misspellings, and so on) happen. Integration of data from systems with different data standards is present. Data quality issues are perceived as time-consuming and expensive to fix. 18

Primary Sources of Data Quality Problems Source: The Data Warehousing Institute, Data Quality and the Bottom Line, 2002 19

Actions for Cleansing Data stewards responsible for data quality A given steward has the responsibility for certain tables Includes manual inspections and corrections! Cleansing tools: Special-purpose cleansing (addresses, names, emails, etc) AI and data mining cleansing tools Rule-based cleansing (if-then style) Automatic rules Guess missing sales person based on customer and item Do not fix all problems with data quality Allow management to see weird data in their reports? 20

Typical Cleansing Tasks Standardization changing data elements to a common format defined by the data steward Gender Analysis returns a person s gender based on the elements of their name (returns M(ale), F(emale) or U(nknown)) Entity Analysis returns an entity type based on the elements in a name (i.e. customer names) (returns O(rganization), I(ndividual) or U(nknown)) De-duplication uses match coding to discard records that represent the same person, company or entity Matching and Merging matches records based on a given set of criteria and at a given sensitivity level. Used to link records across multiple data sources. 21

Typical Cleansing Tasks Address Verification verifies that if you send a piece of mail to a given address, it will be delivered. Does not verify occupants, just valid addresses. Householding/Consolidation used to identify links between records, such as customers living in the same household or companies belonging to the same parent company. Parsing identifies individual data elements and separates them into unique fields Casing provides the proper capitalization to data elements that have been entered in all lowercase or all uppercase. 22

Analysis and Standardization Example Who is the biggest supplier? Anderson Construction $ 2,333.50 Briggs,Inc $ 8,200.10 Brigs Inc. $12,900.79 Casper Corp. $27,191.05 Caspar Corp $ 6,000.00 Solomon Industries $43,150.00 The Casper Corp $11,500.00 23...

Analysis and Standardization Example 24

Analysis and Standardization Example 50,000 Casper Corp. 40,000 30,000 20,000 10,000 0 $ Spent Solomon Ind. Briggs Inc. Anderson Cons. 25

Data Matching Example Operational System of Records Data Warehouse Mark Carver SAS SAS Campus Drive Cary, N.C. Mark W. Craver Mark.Craver@sas.com 01 Mark Carver SAS SAS Campus Drive Cary, N.C. 02 Mark W. Craver Mark.Craver@sas.com 03 Mark Craver Systems Engineer SAS Mark Craver Systems Engineer SAS 26

Data Matching Example Operational System of Records Data Warehouse Mark Carver SAS SAS Campus Drive Cary, N.C. Mark W. Craver Mark.Craver@sas.com DQ 01 Mark Craver Systems Engineer SAS SAS Campus Drive Cary, N.C. 27513 Mark.Craver@sas.com Mark Craver Systems Engineer SAS 27

Transform Transform = convert data from format of operational system to format of data warehouse Record-level: Selection data partitioning Joining data combining Aggregation data summarization Field-level: level: single-field from one field to one field multi-field from many fields to one, or one field to many 28

Data Transformation Transforms the data in accordance with the business rules and standards that have been established Example include: splitting up fields, replacement of codes, derived values, and aggregates 29

Single-field transformation In general some transformation function translates data from old form to new form Algorithmic transformation uses a formula or logical expression Table lookup another approach, uses a separate table keyed by source record code 30

Multifield transformation M:1 from many source fields to one target field 1:M from one source field to many target fields 31

Loading Load/Index= place transformed data into the warehouse and create indexes Refresh mode: bulk rewriting of target data at periodic intervals Update mode: only changes in source data are written to data warehouse 32

Loading Data are physically moved to the data warehouse Most critical operation in any warehouse affects the DW itself (not the stage area) Loading can be broken down into 2 different types: Initial Load Continuous Load (loading over time) 33

Loading tools Backup and Disaster Recovery Restart logic, Error detection Continuous loads during a fixed batch window (usually overnight) Maximize system resources to load data efficiently in allotted time Parallelization Several tenths GB per hour 34

ETL tools Tools vs. program-from-scratch No tools: Start Immediately, flexibility Complexity, Maintenance Tools: Reduces development costs, easy maintenance Limited flexibility, financial cost 35

ETL Tools ETL tools from the big vendors Oracle Warehouse Builder IBM DB2 Warehouse Manager Microsoft Integration Services 100 s others: http://en.wikipedia.org/wiki/extract,_transform,_load#tools 36

MS Integration Services 37