ETL PROCESS IN DATA WAREHOUSE



Similar documents
ETL Process in Data Warehouse. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

Extraction Transformation Loading ETL Get data out of sources and load into the DW

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

Implementing a Data Warehouse with Microsoft SQL Server 2012 MOC 10777

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem:

Microsoft. Course 20463C: Implementing a Data Warehouse with Microsoft SQL Server

Foundations of Business Intelligence: Databases and Information Management

3: Summary Statistics

Implementing a Data Warehouse with Microsoft SQL Server 2012

Implementing a Data Warehouse with Microsoft SQL Server

THE DATA WAREHOUSE ETL TOOLKIT CDT803 Three Days

COURSE 20463C: IMPLEMENTING A DATA WAREHOUSE WITH MICROSOFT SQL SERVER

Implementing a Data Warehouse with Microsoft SQL Server

Course MIS. Foundations of Business Intelligence

Foundations of Business Intelligence: Databases and Information Management

Course 10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012

Implementing a Data Warehouse with Microsoft SQL Server 2012

LearnFromGuru Polish your knowledge

The Data Warehouse ETL Toolkit

Demystified CONTENTS Acknowledgments xvii Introduction xix CHAPTER 1 Database Fundamentals CHAPTER 2 Exploring Relational Database Components

Lecture 1: Review and Exploratory Data Analysis (EDA)

Course Outline. Module 1: Introduction to Data Warehousing

Implement a Data Warehouse with Microsoft SQL Server 20463C; 5 days

Course Outline: Course: Implementing a Data Warehouse with Microsoft SQL Server 2012 Learning Method: Instructor-led Classroom Learning

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Implementing a Data Warehouse with Microsoft SQL Server 2012

Foundations of Business Intelligence: Databases and Information Management

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

White Paper. Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices.

Implementing a Data Warehouse with Microsoft SQL Server 2012 (70-463)

Implementing a Data Warehouse with Microsoft SQL Server MOC 20463

COURSE OUTLINE MOC 20463: IMPLEMENTING A DATA WAREHOUSE WITH MICROSOFT SQL SERVER

META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING

David Dye. Extract, Transform, Load

Implementing a Data Warehouse with Microsoft SQL Server

Chapter 6 FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT Learning Objectives

Foundations of Business Intelligence: Databases and Information Management

Beta: Implementing a Data Warehouse with Microsoft SQL Server 2012

Foundations of Business Intelligence: Databases and Information Management

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

Exploratory data analysis (Chapter 2) Fall 2011


Introduction to Databases

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

Data Mining: Data Preprocessing. I211: Information infrastructure II

Data Discovery & Documentation PROCEDURE

East Asia Network Sdn Bhd

Data Warehouse and Business Intelligence Testing: Challenges, Best Practices & the Solution

Variables. Exploratory Data Analysis

Data Quality Assessment. Approach

Implementing a Data Warehouse with Microsoft SQL Server

Course 20463:Implementing a Data Warehouse with Microsoft SQL Server

AV-005: Administering and Implementing a Data Warehouse with SQL Server 2014

For Sales Kathy Hall

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Basic Concepts of Database Systems

Data warehouse Architectures and processes

CS Introduction to Data Mining Instructor: Abdullah Mueen

Instant SQL Programming

ETL Overview. Extract, Transform, Load (ETL) Refreshment Workflow. The ETL Process. General ETL issues. MS Integration Services

Conventional Files versus the Database. Files versus Database. Pros and Cons of Conventional Files. Pros and Cons of Databases. Fields (continued)

IT2305 Database Systems I (Compulsory)

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

DBMS / Business Intelligence, SQL Server

High-Volume Data Warehousing in Centerprise. Product Datasheet

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

A Review of Data Mining Techniques

Implementing a Data Warehouse with Microsoft SQL Server 2014

ETL-EXTRACT, TRANSFORM & LOAD TESTING

Data Exploration Data Visualization


Databases and Information Management

Foundations of Business Intelligence: Databases and Information Management

IT2304: Database Systems 1 (DBS 1)

Data Warehousing and Data Mining

A Review of Contemporary Data Quality Issues in Data Warehouse ETL Environment

Enterprise Data Quality

CERULIUM TERADATA COURSE CATALOG

Introduction to Computing. Lectured by: Dr. Pham Tran Vu

SQL Server 2012 Business Intelligence Boot Camp

Chapter 5. Learning Objectives. DW Development and ETL

Implementing a SQL Data Warehouse 2016

Physical Design. Meeting the needs of the users is the gold standard against which we measure our success in creating a database.

W I S E. SQL Server 2008/2008 R2 Advanced DBA Performance & WISE LTD.

SQL Server 2008 Core Skills. Gary Young 2011

Chapter 6: Physical Database Design and Performance. Database Development Process. Physical Design Process. Physical Database Design

Physical Database Design Process. Physical Database Design Process. Major Inputs to Physical Database. Components of Physical Database Design

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

BCA. Database Management System

Exercise 1.12 (Pg )

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

DBMS Questions. 3.) For which two constraints are indexes created when the constraint is added?

Unlock your data for fast insights: dimensionless modeling with in-memory column store. By Vadim Orlov

Transcription:

ETL PROCESS IN DATA WAREHOUSE

OUTLINE ETL : Extraction, Transformation, Loading Capture/Extract Scrub or data cleansing Transform Load and Index

ETL OVERVIEW Extraction Transformation Loading ETL ETL is a process, not a physical implementation A process of copying data from one database to other Data is extracted from an OLTP database, transformed to match the data warehouse schema and loaded into the data warehouse database Data from non-oltp systems: text files, legacy systems, and spreadsheets extraction, transformation, and loading

ETL OVERVIEW ETL is often a complex combination of process and technology: consumes a significant portion of the data warehouse development efforts requires the skills of business analysts, database designers, and application developers New data is added to the Data Warehouse periodically monthly, daily, hourly Because ETL is an integral, ongoing, and recurring part of a data warehouse Automated Well documented Easily changeable

ETL STAGING DATABASE ETL operations should be performed on a relational database server separate from the source databases and the data warehouse database Creates a logical and physical separation between the source systems and the data warehouse Minimizes the impact of the intense periodic ETL activity on source and data warehouse databases

EXTRACTION

EXTRACTION The integration of all of the disparate systems across the enterprise is the real challenge to getting the data warehouse to a state where it is usable Data is extracted from heterogeneous data sources Each data source has its distinct set of characteristics that need to be managed and integrated into the ETL system in order to effectively extract data.

EXTRACTION ETL process needs to effectively integrate systems that have different: DBMS Operating Systems Hardware Communication protocols Need to have a logical data map before the physical data can be transformed The logical data map describes the relationship between the extreme starting points and the extreme ending points of your ETL system usually presented in a table or spreadsheet

Target Source Transformation Table Name Column Name Data Type Table Name Column Name Data Type The content of the logical data mapping document has been proven to be the critical element required to efficiently plan ETL processes First dimensions, then facts. Provide a clear-cut blueprint of exactly what is expected from the ETL process. This table must depict, without question, the course of action involved in the transformation process Most often, the transformation can be expressed in SQL.

The analysis of the source system is usually broken into two major phases: The data discovery phase The anomaly detection phase

EXTRACTION - DATA DISCOVERY PHASE Data Discovery Phase: key criterion for the success of the data warehouse is the cleanliness and cohesiveness of the data within it. Once you understand what the target needs to look like, you need to identify and examine the data sources.

DATA DISCOVERY PHASE Determine each and every source system, table, and attribute required to load the data warehouse Collecting Documenting Source Systems Keeping track of source systems Determining the System of Record - Point of originating of data Definition of the system-of-record is important because in most enterprises data is stored redundantly across many different systems. Very common that the same piece of data is copied, moved, manipulated, transformed, altered, cleansed, or made corrupt throughout the enterprise, resulting in varying versions of the same data

DATA CONTENT ANALYSIS - EXTRACTION Understanding the content of the data is crucial for determining the best approach for retrieval - NULL values An unhandled NULL value can destroy any ETL process. NULL values pose the biggest risk when they are in foreign key columns. Joining two or more tables based on a column that contains NULL values will cause data loss - Dates in nondate fields Various formats, literally containing different values and having the exact same meaning. Most database systems support most of the various formats for display purposes but store them in a single standard format

DATA CHANGES During the initial load: capturing changes to data content in the source data is unimportant Later: the ability to capture data changes in the source system instantly becomes priority The ETL team is responsible for capturing datacontent changes during the incremental load

DETERMINING CHANGED DATA Audit Columns - Used by DB and updated by triggers Audit columns are appended to the end of each table to store the date and time a record was added or modified

DETERMINING CHANGED DATA Preserves exactly one copy of each previous extraction in the staging area for future use. During the next run, the process takes the entire source table(s) into the staging area and makes a comparison against the retained data from the last process. Only differences are sent to the data warehouse (previous/current load tables) Not the most efficient technique, but most reliable for capturing changed data

TRANSFORMATION

TRANSFORMATION Main step where the ETL adds value Actually changes data and provides guidance whether data can be used for its intended purposes Performed in staging area

TRANSFORMATION Data Quality paradigm Correct Unambiguous Consistent Complete Data quality checks are run at 2 places - after extraction and after cleaning and confirming additional check are run at this point

TRANSFORMATION - CLEANING DATA Anomaly Detection Data sampling count(*) of the rows for a department column Column Property Enforcement Null Values in required columns Numeric values that fall outside of expected high and lows Cols whose lengths are exceptionally short/long Cols with certain values outside of discrete valid value sets Adherence to a reqd pattern/ member of a set of pattern

TRANSFORMATION - CONFIRMING Structure Enforcement Tables have proper primary and foreign keys Obey referential integrity Data and Rule value enforcement Simple business rules Logical data checks

Stop Yes Staged Data Cleaning And Confirming Fatal Errors No Loading

ANOMALY DETECTION PHASE Data anomaly is a piece of data that does not fit into the domain of the rest of the data it is stored with. Exposure of unspecified data anomalies once the ETL process has been created is the leading cause of ETL deployment delays

MEASURING THE CENTRAL TENDENCY Mean (algebraic measure) (sample vs. population): n 1 Weighted arithmetic mean: x x i n i 1 Trimmed mean: chopping extreme values Median: A holistic measure Middle value if odd number of values, or average of the middle two Mode values otherwise Value that occurs most frequently in the data Unimodal, bimodal, trimodal x N x n i 1 n i 1 w x i w i i

MEASURING THE DISPERSION OF DATA Quartiles, outliers and boxplots Quartiles: Q 1 (25 th percentile), Q 3 (75 th percentile) Inter-quartile range: IQR = Q 3 Q 1 Five number summary: min, Q 1, M, Q 3, max Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually Outlier: usually, a value higher/lower than 1.5 x IQR out-value < Q1-1.5 x IQR; Q1-1.5 x IQR < out-value Variance and standard deviation (sample: s, population: σ) Variance: n n 2 1 2 1 ( xi ) N i 1 N i 1 x 2 i 2 Standard deviation σ is the square root of variance σ 2

BOXPLOT ANALYSIS Five-number summary of a distribution: Minimum, Q1, M, Q3, Maximum Boxplot Data is represented with a box The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR The median is marked by a line within the box Whiskers: two lines outside the box extend to Minimum and Maximum

a. Which ones are outliers and why? b. The weight of those pupils was measured in kg and the results is as follows. Draw the box-plot for weight. 37 40 39 40.5 42 51 41.5 39 41 30 40 42 40.5 39.5 41 40.5 37 39.5 40 41 38.5 39.5 40 41 39 40.5 40 38.5 39.5 41.5

MISSING DATA Data is not always available E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data Missing data may need to be inferred.

HOW TO HANDLE MISSING DATA? Ignore the tuple: usually done when class label is missing (assuming the tasks in classification not effective when the percentage of missing values per attribute varies considerably. Fill in the missing value manually: tedious + infeasible? 30 Fill in it automatically with a global constant : e.g., unknown, a new class?! the attribute mean the attribute mean for all samples belonging to the same class: smarter the most probable value: inference-based such as Bayesian formula or decision tree

Data Mini ng: Con cept s and Tech niqu REGRESSION y September 25, 2012 Y1 31 Y1 y = x + 1 X1 x

CLUSTER ANALYSIS Data Mini ng: Con cept s and Tech niqu

K-NEAREST NEIGHBOR 33 Euclide distance, dist(x 1, X 2 ) Normalization for each atribute: v min v' v v max A min To assign label for a new item q, k-nn find a sample of k items which is nearest to q.

THE K-NEAREST NEIGHBOR ALGORITHM (II) Category attribute: If x 1j = x 2j, dist(x 1j,x 2j )=0, otherwise dist(x 1j,x 2j )=1 Incomplete value: If value of an attribute A j of X 1 (or/and X 2 ) is incomplete: dist(x 1j,x 2j )=1 If v is attribute value of an item, and other item has incomplete value: dist(x 1j,x 2j )= 1-v or dist(x 1j,x 2j )= 0-v ) 34

X1 = Acid Durability X2 = Strength Y = Classification (seconds) (kg/square meter) 7 7 Bad 7 4 Bad 3 4 Good 1 4 Good For a new paper tissue that has X1 = 3 and X2 = 7, what the classification of this new tissue is??? X1 = Acid Durability (seconds) X2 = Strength (kg/square meter) 7 7 7 4 3 4 1 4 35 Square Distance to query instance (3, 7)

X1 = Acid Durability (seconds) X2 = Strength (kg/square meter) Square Distance to query instance (3, 7) Rank minimum distance Is it included in 3-Nearest neighbors? 7 7 3 Yes 7 4 4 No 3 4 1 Yes 1 4 2 Yes 36 X1 = Acid Durability (seconds) X2 = Strength (kg/square meter) Rank minimum distance Is it included in 3-Nearest neighbors? Y = Category of nearest Neighbor 7 7 3 Yes Bad 7 4 4 No - 3 4 1 Yes Good 1 4 2 Yes Good New paper tissue (3,7) is a good

X1 X2 Y 7 4 + 4 3 + 4 6 + 7 3 + 7 3 + 6 1 + 4 7 + K=5 What is label of item (8.4)? 8 6-8 9-9 4-8 4?

X1 X2 Y 7 4 + 4 3 + 4 6 + 7 3 + 7 3 + 6 1 + 4 7 + 8 6-8 9-9 4-8 4? K=5 Distance(exp2) Nearest Neighbor sign 1 + 17 20 2 + 2 + 13 16 4-25 1 -

LOADING Loading Dimensions Loading Facts

LOADING DIMENSIONS Physically built to have the minimal sets of components The primary key is a single field containing meaningless unique integer Surrogate Keys The DW owns these keys and never allows any other entity to assign them De-normalized flat tables all attributes in a dimension must take on a single value in the presence of a dimension primary key. Should possess one or more other fields that compose the natural key of the dimension

The data loading module consists of all the steps required to administer slowly changing dimensions (SCD) and write the dimension to disk as a physical table: correct primary keys, correct natural keys, final descriptive attributes. Creating and assigning the surrogate keys occur in this module. The table is definitely staged, since it is the object to be loaded into the presentation system of the data warehouse.

LOADING DIMENSIONS When DW receives notification that an existing row in dimension has changed it gives out 3 types of responses Type 1 Type 2 Type 3

TYPE 1 DIMENSION

TYPE 2 DIMENSION

TYPE 3 DIMENSIONS

LOADING FACTS Facts Fact tables hold the measurements of an enterprise. Relationship between fact tables and measurements is extremely simple. If a measurement exists, it can be modeled as a fact table row. If a fact table row exists, it is a measurement

KEY BUILDING PROCESS - FACTS When building a fact table, the final ETL step is converting the natural keys in the new input records into the correct, contemporary surrogate keys ETL maintains a special surrogate key lookup table for each dimension. This table is updated whenever a new dimension entity is created and whenever a Type 2 change occurs on an existing dimension entity All of the required lookup tables should be pinned in memory Randomly accessed as each incoming fact record presents its natural keys.

KEY BUILDING PROCESS

LOADING FACT TABLES Managing Indexes Performance Killers at load time Drop all indexes in pre-load time Segregate Updates from inserts Load updates Rebuild indexes

MANAGING PARTITIONS Partitions allow a table (and its indexes) to be physically divided into minitables for administrative purposes and to improve query performance The most common partitioning strategy on fact tables is to partition the table by the date key. Need to partition the fact table on the key that joins to the date dimension for the optimizer to recognize the constraint. The ETL team must be advised of any table partitions that need to be maintained.

OUTWITTING THE ROLLBACK LOG The rollback log is a superfluous feature that must be dealt with to achieve optimal load performance. Reasons why the data warehouse does not need rollback logging are: All data is entered by a managed process the ETL system. Data is loaded in bulk. Data can easily be reloaded if a load process fails. Each database management system has different logging features and manages its rollback log differently