The Data Warehouse ETL Toolkit



Similar documents
THE DATA WAREHOUSE ETL TOOLKIT CDT803 Three Days

Dimensional Data Modeling for the Data Warehouse

Mastering Data Warehouse Aggregates. Solutions for Star Schema Performance

Lection 3-4 WAREHOUSING

AV-005: Administering and Implementing a Data Warehouse with SQL Server 2014

COURSE OUTLINE. Track 1 Advanced Data Modeling, Analysis and Design

High-Volume Data Warehousing in Centerprise. Product Datasheet

Implementing a Data Warehouse with Microsoft SQL Server MOC 20463

COURSE OUTLINE MOC 20463: IMPLEMENTING A DATA WAREHOUSE WITH MICROSOFT SQL SERVER

COURSE 20463C: IMPLEMENTING A DATA WAREHOUSE WITH MICROSOFT SQL SERVER

Implementing a Data Warehouse with Microsoft SQL Server

SQL Server 2012 Business Intelligence Boot Camp

Building a Data Warehouse

ETL Process in Data Warehouse. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

Microsoft Data Warehouse in Depth

East Asia Network Sdn Bhd

ETL-EXTRACT, TRANSFORM & LOAD TESTING

Microsoft. Course 20463C: Implementing a Data Warehouse with Microsoft SQL Server

Implementing a Data Warehouse with Microsoft SQL Server

Course Outline. Module 1: Introduction to Data Warehousing

Implementing a Data Warehouse with Microsoft SQL Server 2012

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

Implement a Data Warehouse with Microsoft SQL Server 20463C; 5 days

Data Warehousing Systems: Foundations and Architectures

LearnFromGuru Polish your knowledge

70-467: Designing Business Intelligence Solutions with Microsoft SQL Server

META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING

SAP Data Services 4.X. An Enterprise Information management Solution

Oracle Warehouse Builder 10g

Implementing a Data Warehouse with Microsoft SQL Server 2012 (70-463)

Course 10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012

Implementing a Data Warehouse with Microsoft SQL Server 2012 MOC 10777

Evaluation Checklist Data Warehouse Automation

Data Warehousing and Data Mining

Implementing a Data Warehouse with Microsoft SQL Server 2012

Implementing a Data Warehouse with Microsoft SQL Server

Implementing a Data Warehouse with Microsoft SQL Server 2012

Establish and maintain Center of Excellence (CoE) around Data Architecture

Sizing Logical Data in a Data Warehouse A Consistent and Auditable Approach

Course 20463:Implementing a Data Warehouse with Microsoft SQL Server

IBM WebSphere DataStage Online training from Yes-M Systems

Chapter 3 - Data Replication and Materialized Integration

Implementing a Data Warehouse with Microsoft SQL Server

ETL Overview. Extract, Transform, Load (ETL) Refreshment Workflow. The ETL Process. General ETL issues. MS Integration Services

ORACLE DATA INTEGRATOR ENTERPRISE EDITION

ETL PROCESS IN DATA WAREHOUSE

ENTERPRISE EDITION ORACLE DATA SHEET KEY FEATURES AND BENEFITS ORACLE DATA INTEGRATOR

Course Outline: Course: Implementing a Data Warehouse with Microsoft SQL Server 2012 Learning Method: Instructor-led Classroom Learning

SAS BI Course Content; Introduction to DWH / BI Concepts

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

Data Warehouse (DW) Maturity Assessment Questionnaire

Data Integration and ETL with Oracle Warehouse Builder NEW

Data Integration and ETL Process

THE QUALITY OF DATA AND METADATA IN A DATAWAREHOUSE

The Evolution of ETL

Data Warehouse design

Designing a Dimensional Model

HYPERION MASTER DATA MANAGEMENT SOLUTIONS FOR IT

Oracle Database 11g: Data Warehousing Fundamentals

Chapter 5. Learning Objectives. DW Development and ETL

MIS636 AWS Data Warehousing and Business Intelligence Course Syllabus

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem:

Extraction Transformation Loading ETL Get data out of sources and load into the DW

POLAR IT SERVICES. Business Intelligence Project Methodology

An Oracle White Paper March Best Practices for Real-Time Data Warehousing

Class News. Basic Elements of the Data Warehouse" 1/22/13. CSPP 53017: Data Warehousing Winter 2013" Lecture 2" Svetlozar Nestorov" "

Oracle Data Integrator Technical Overview. An Oracle White Paper Updated December 2006

SSIS Training: Introduction to SQL Server Integration Services Duration: 3 days

BENEFITS OF AUTOMATING DATA WAREHOUSING

BUSINESSOBJECTS DATA INTEGRATOR

University Data Warehouse Design Issues: A Case Study

Implementing a Data Warehouse with Microsoft SQL Server 2014

Data Warehousing with Oracle

Speeding ETL Processing in Data Warehouses White Paper

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

Oracle Warehouse Builder

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

CERULIUM TERADATA COURSE CATALOG

Demystified CONTENTS Acknowledgments xvii Introduction xix CHAPTER 1 Database Fundamentals CHAPTER 2 Exploring Relational Database Components

How To Write A Diagram

ORACLE DATA INTEGRATOR ENTERPRISE EDITION

ETL Tools. L. Libkin 1 Data Integration and Exchange

Managing Data in Motion

Table of Contents Chapter 1 - Getting Started with Oracle Data Relationship Management (DRM) 1

EAI vs. ETL: Drawing Boundaries for Data Integration

Business Intelligence for SUPRA. WHITE PAPER Cincom In-depth Analysis and Review

Data Integration and ETL with Oracle Warehouse Builder: Part 1

DATA WAREHOUSING AND OLAP TECHNOLOGY

Implementing a Data Warehouse with Microsoft SQL Server 2012

SimCorp Solution Guide

Technology-Driven Demand and e- Customer Relationship Management e-crm

Salesforce Certified Data Architecture and Management Designer. Study Guide. Summer 16 TRAINING & CERTIFICATION

Efficient and Real Time Data Integration With Change Data Capture

Data Warehousing. Jens Teubner, TU Dortmund Winter 2015/16. Jens Teubner Data Warehousing Winter 2015/16 1

Transcription:

2008 AGI-Information Management Consultants May be used for personal purporses only or by libraries associated to dandelon.com network. The Data Warehouse ETL Toolkit Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data Ralph Kimball Joe Caserta WILEY Wiley Publishing, Inc.

Acknowledgments About the Authors Introduction Part Requirements, Realities, and Architecture Chapter 1 Surrounding the Requirements Requirements Business Needs Compliance Requirements Data Profiling Security Requirements Data Integration Data Latency Archiving and Lineage End User Delivery Interfaces Available Skills Legacy Licenses Architecture ETL Tool versus Hand Coding (Buy a Tool Suite or Roll Your Own?) The Back Room - Preparing the Data The Front Room - Data Access The Mission of the Data Warehouse What the Data Warehouse Is What the Data Warehouse Is Not Industry Terms Not Used Consistently

viii Contents Chapter 2 Part II Chapter 3 Resolving Architectural Conflict: A Hybrid Approach How the Data Warehouse Is Changing The Mission of the ETL Team ETL Data Structures To Stage or Not to Stage Designing the Staging Area Data Structures in the ETL System Flat Files XML Data Sets Relational Tables Independent DBMS Working Tables Third Normal Form Entity/Relation Models Nonrelational Data Sources Dimensional Data Models: The Handoff from the Back Room to the Front Room Fact Tables Dimension Tables Atomic and Aggregate Fact Tables Surrogate Key Mapping Tables Planning and Design Standards Impact Analysis Metadata Capture Naming Conventions Auditing Data Transformation Steps Data Flow Extracting Part 1: The Logical Data Map Designing Logical Before Physical Inside the Logical Data Map Components of the Logical Data Map Using Tools for the Logical Data Map Building the Logical Data Map Data Discovery Phase Data Content Analysis Collecting Business Rules in the ETL Process Inteeratine Heteroeeneous Data Sources Part 2: The Challenge of Extracting from Disparate Platforms Connecting to Diverse Sources through ODBC Mainframe Sources Working with COBOL Copybooks EBCDIC Character Set Converting EBCDIC to ASCII 27 27 28 29 29 31 35 35 38 40 41 42 42 45 45 46 47 48 48 49 49 51 51 52 53 55 56 56 58 58 62 62 63 71 73 73

Chapter 4 Transferring Data between Platforms Handling Mainframe Numeric Data Using Pictures Unpacking Packed Decimals Working with Redefined Fields Multiple OCCURS Managing Multiple Mainframe Record Type Files Handling Mainframe Variable Record Lengths Flat Files Processing Fixed Length Flat Files Processing Delimited Flat Files XML Sources Character Sets XML Meta Data Web Log Sources W3C Common and Extended Formats Name Value Pairs in Web Logs ERP System Sources Part 3: Extracting Changed Data Detecting Changes Extraction Tips 80 81 81 83 84 85 87 89 90 91 93 93 94 94 97 98 100 102 105 106 109 Detecting Deleted or Overwritten Fact Records at the Source 111 111 Cleaning and Conforming Defining Data Quality Assumptions Part 1: Design Objectives Understand Your Key Constituencies Competing Factors Balancing Conflicting Priorities Formulate a Policy Part 2: Cleaning Deliverables Data Profiling Deliverable Cleaning Deliverable #1: Error Event Table Cleaning Deliverable #2: Audit Dimension Audit Dimension Fine Points Part 3: Screens and Their Measurements Anomaly Detection Phase Types of Enforcement Column Property Enforcement Structure Enforcement Data and Value Rule Enforcement Measurements Driving Screen Design Overall Process Flow The Show Must Go On Usually Screens

Known Table Row Counts 140 Column Nullity 140 Column Numeric and Date Ranges 141 Column Length Restriction 143 Column Explicit Valid Values 143 Column Explicit Invalid Values 144 Checking Table Row Count Reasonability 144 Checking Column Distribution Reasonability 146 General Data and Value Rule Reasonability 147 Part 4: Conforming Deliverables 148 Conformed Dimensions 148 Designing the Conformed Dimensions 150 Taking the Pledge 150 Permissible Variations of Conformed Dimensions 150 Conformed Facts 151 The Fact Table Provider 152 The Dimension Manager: Publishing Conformed Dimensions to Affected Fact Tables 152 Detailed Delivery Steps for Conformed Dimensions 153 Implementing the Conforming Modules 155 Matching Drives Deduplication 156 Surviving: Final Step of Conforming 158 Delivering 159 160 Chapter 5 Delivering Dimension Tables 161 The Basic Structure of a Dimension 162 The Grain of a Dimension 165 The Basic Load Plan for a Dimension 166 Flat Dimensions and Snowflaked Dimensions 167 Date and Time Dimensions 170 Big Dimensions 174 Small Dimensions 176 One Dimension or Two 176 Dimensional Roles 178 Dimensions as Subdimensions of Another Dimension 180 Degenerate Dimensions 182 Slowly Changing Dimensions 183 Type 1 Slowly Changing Dimension (Overwrite) 183 Type 2 Slowly Changing Dimension (Partitioning History) 185 Precise Time Stamping of a Type 2 Slowly Changing Dimension 190 Type 3 Slowly Changing Dimension (Alternate Realities) 192 Hybrid Slowly Changing Dimensions 193 Late-Arriving Dimension Records and Correcting Bad Data 194 Multivalued Dimensions and Bridge Tables 196 Ragged Hierarchies and Bridge Tables 199 Technical Note: POPULATING HIERARCHY BRIDGE TABLES 201

Using Positional Attributes in a Dimension to Represent Text Facts 204 207 Chapter 6 Delivering Fact Tables The Basic Structure of a Fact Table Guaranteeing Referential Integrity Surrogate Key Pipeline Using the Dimension Instead of a Lookup Table Fundamental Grains Transaction Grain Fact Tables Periodic Snapshot Fact Tables Accumulating Snapshot Fact Tables Preparing for Loading Fact Tables Managing Indexes Managing Partitions Outwitting the Rollback Log Loading the Data Incremental Loading Inserting Facts Updating and Correcting Facts Negating Facts Updating Facts Deleting Facts Physically Deleting Facts Logically Deleting Facts Factless Fact Tables Augmenting a Type 1 Fact Table with Type 2 History Graceful Modifications Multiple Units of Measure in a Fact Table Collecting Revenue in Multiple Currencies Late Arriving Facts Aggregations Design Requirement #1 Design Requirement #2 Design Requirement #3 Design Requirement #4 Administering Aggregations, Including Materialized Views Delivering Dimensional Data to OLAP Cubes Cube Data Sources Processing Dimensions Changes in Dimension Data Processing Facts Integrating OLAP Processing into the ETL System OLAP Wrap-up 209 210 212 214 217 217 218 220 222 224 224 224 226 226 228 228 228 229 230 230 230 232 232 234 235 237 238 239 241 243 244 245 246 246 247 248 248 249 250 252 253 253

xii Contents Part III Chapter 7 Chapter 8 Implementation and operations Development Current Marketplace ETL Tool Suite Offerings Current Scripting Languages Time Is of the Essence Push Me or Pull Me Ensuring Transfers with Sentinels Sorting Data during Preload Sorting on Mainframe Systems Sorting on Unix and Windows Systems Trimming the Fat (Filtering) Extracting a Subset of the Source File Records on Mainframe Systems Extracting a Subset of the Source File Fields Extracting a Subset of the Source File Records on Unix and Windows Systems Extracting a Subset of the Source File Fields Creating Aggregated Extracts on Mainframe Systems Creating Aggregated Extracts on UNIX and Windows Systems Using Database Bulk Loader Utilities to Speed Inserts Preparing for Bulk Load Managing Database Features to Improve Performance The Order of Things The Effect of Aggregates and Group Bys on Performance Performance Impact of Using Scalar Functions Avoiding Triggers Overcoming ODBC the Bottleneck Benefiting from Parallel Processing Troubleshooting Performance Problems Increasing ETL Throughput Reducing Input/Output Contention Eliminating Database Reads/Writes Filtering as Soon as Possible Partitioning and Parallelizing Updating Aggregates Incrementally Taking Only What You Need Bulk Loading/Eliminating Logging Dropping Databases Constraints and Indexes Eliminating Network Traffic Letting the ETL Engine Do the Work Operations Scheduling and Support Reliability, Availability, Manageability Analysis for ETL ETL Scheduling 101 255 257 258 260 260 261 262 263 264 266 269 269 270 271 273 274 274 276 278 280 282 286 287 287 288 288 292 294 296 296 297 297 298 299 299 299 300 300 300 301 302 302 303

Chapter 9 Scheduling Tools Load Dependencies Metadata Migrating to Production Operational Support for the Data Warehouse Bundling Version Releases Supporting the ETL System in Production Achieving Optimal ETL Performance Estimating Load Time Vulnerabilities of Long-Running ETL processes Minimizing the Risk of Load Failures Purging Historic Data Monitoring the ETL System Measuring ETL Specific Performance Indicators Measuring Infrastructure Performance Indicators Measuring Data Warehouse Usage to Help Manage ETL Processes Tuning ETL Processes Explaining Database Overhead ETL System Security Securing the Development Environment Securing the Production Environment Short-Term Archiving and Recovery Long-Term Archiving and Recovery Media, Formats, Software, and Hardware Obsolete Formats and Archaic Formats Hard Copy, Standards, and Museums Refreshing, Migrating, Emulating, and Encapsulating Metadata Defining Metadata Metadata What Is It? Source System Metadata Data-Staging Metadata DBMS Metadata Front Room Metadata Business Metadata Business Definitions Source System Information Data Warehouse Data Dictionary Logical Data Maps Technical Metadata System Inventory Data Models Data Definitions Business Rules ETL-Generated Metadata 304 314 314 315 316 316 319 320 321 324 330 330 331 331 332

xiv Contents ETL Job Metadata Transformation Metadata Batch Metadata Data Quality Error Event Metadata Process Execution Metadata Metadata Standards and Practices Establishing Rudimentary Standards Naming Conventions Impact Analysis Chapter 10 Responsibilities Planning and Leadership Having Dedicated Leadership Planning Large, Building Small Hiring Qualified Developers Building Teams with Database Expertise Don't Try to Save the World Enforcing Standardization Monitoring, Auditing, and Publishing Statistics Maintaining Documentation Providing and Utilizing Metadata Keeping It Simple Optimizing Throughput Managing the Project Responsibility of the ETL Team Defining the Project Planning the Project Determining the Tool Set Staffing Your Project Project Plan Guidelines Managing Scope Part IV Real Time Streaming ETL Systems Chapter 11 Real-Time ETL Systems Why Real-Time ETL? Defining Real-Time ETL Challenges and Opportunities of Real-Time Data Warehousing Real-Time Data Warehousing Review Generation 1 The Operational Data Store Generation 2 The Real-Time Partition Recent CRM Trends The Strategic Role of the Dimension Manager Categorizing the Requirement 368 370 373 374 375 377 378 379 380 380 383 383 384 385 387 387 388 388 389 389 390 390 390 391 391 392 393 393 394 401 412 416 419 421 422 424

Chapter 12 Index Data Freshness and Historical Needs Reporting Only or Integration, Too? Just the Facts or Dimension Changes, Too? Alerts, Continuous Polling, or Nonevents? Data Integration or Application Integration? Point-to-Point versus Hub-and-Spoke Customer Data Cleanup Considerations Real-Time ETL Approaches Microbatch ETL Enterprise Application Integration Capture, Transform, and Flow Enterprise Information Integration The Real-Time Dimension Manager Microbatch Processing Choosing an Approach A Decision Guide Conclusions Deepening the Definition of ETL The Future of Data Warehousing and ETL in Particular Ongoing Evolution of ETL Systems 430 432 432 433 434 434 436 437 437 441 444 446 447 452 456 459 461 461 463 464 467