Data Integration and ETL Process



Similar documents
Data Integration and ETL Process

Extraction Transformation Loading ETL Get data out of sources and load into the DW

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

DATA WAREHOUSING AND OLAP TECHNOLOGY

LITERATURE SURVEY ON DATA WAREHOUSE AND ITS TECHNIQUES

A Simplified Framework for Data Cleaning and Information Retrieval in Multiple Data Source Problems

META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING

COURSE 20463C: IMPLEMENTING A DATA WAREHOUSE WITH MICROSOFT SQL SERVER

Implementing a Data Warehouse with Microsoft SQL Server

OLAP Systems and Multidimensional Expressions I

Implementing a Data Warehouse with Microsoft SQL Server 2014

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

Implementing a Data Warehouse with Microsoft SQL Server

Implement a Data Warehouse with Microsoft SQL Server 20463C; 5 days

SQL Server 2012 Business Intelligence Boot Camp

Implementing a Data Warehouse with Microsoft SQL Server MOC 20463

COURSE OUTLINE MOC 20463: IMPLEMENTING A DATA WAREHOUSE WITH MICROSOFT SQL SERVER

Implementing a SQL Data Warehouse 2016

Implementing a Data Warehouse with Microsoft SQL Server 2012 MOC 10777

Implementing a Data Warehouse with Microsoft SQL Server

Course 20463:Implementing a Data Warehouse with Microsoft SQL Server

Data-Warehouse-, Data-Mining- und OLAP-Technologien

Implementing a Data Warehouse with Microsoft SQL Server 2012

Data Warehouse Design

Implementing a Data Warehouse with Microsoft SQL Server 2012

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Course 10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012

Data Warehousing. Jens Teubner, TU Dortmund Winter 2014/15. Jens Teubner Data Warehousing Winter 2014/15 1

Course Outline. Module 1: Introduction to Data Warehousing

Implementing a Data Warehouse with Microsoft SQL Server 2012

Would-be system and database administrators. PREREQUISITES: At least 6 months experience with a Windows operating system.

MDM and Data Warehousing Complement Each Other

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Data warehouse Architectures and processes

Implementing a Data Warehouse with Microsoft SQL Server 2012

ETL Overview. Extract, Transform, Load (ETL) Refreshment Workflow. The ETL Process. General ETL issues. MS Integration Services

Data Warehouse: Introduction

Chapter 3 - Data Replication and Materialized Integration

Implementing a Data Warehouse with Microsoft SQL Server 2012 (70-463)

CHAPTER SIX DATA. Business Intelligence The McGraw-Hill Companies, All Rights Reserved

Implementing a Data Warehouse with Microsoft SQL Server

Data Warehousing. Jens Teubner, TU Dortmund Winter 2015/16. Jens Teubner Data Warehousing Winter 2015/16 1

Course Outline: Course: Implementing a Data Warehouse with Microsoft SQL Server 2012 Learning Method: Instructor-led Classroom Learning

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

Microsoft. Course 20463C: Implementing a Data Warehouse with Microsoft SQL Server

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

CSPP 53017: Data Warehousing Winter 2013" Lecture 6" Svetlozar Nestorov" " Class News

Data warehouses. Data Mining. Abraham Otero. Data Mining. Agenda

Subject Description Form

Data Warehousing. Overview, Terminology, and Research Issues. Joachim Hammer. Joachim Hammer

High-Volume Data Warehousing in Centerprise. Product Datasheet

Original Research Articles

Enterprise Solutions. Data Warehouse & Business Intelligence Chapter-8

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

A Design Technique: Data Integration Modeling

Deductive Data Warehouses and Aggregate (Derived) Tables

DATA INTEGRATION CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Data Warehousing and Data Mining

THE DATA WAREHOUSE ETL TOOLKIT CDT803 Three Days

AV-005: Administering and Implementing a Data Warehouse with SQL Server 2014

A Design and implementation of a data warehouse for research administration universities

SQL Server Introduction to SQL Server SQL Server 2005 basic tools. SQL Server Configuration Manager. SQL Server services management

Talend Metadata Manager. Reduce Risk and Friction in your Information Supply Chain

Data Warehousing Systems: Foundations and Architectures

SSIS Training: Introduction to SQL Server Integration Services Duration: 3 days

The Data Warehouse ETL Toolkit

IST722 Data Warehousing

Chapter 1: Introduction

Datawarehousing and Analytics. Data-Warehouse-, Data-Mining- und OLAP-Technologien. Advanced Information Management

ETL-EXTRACT, TRANSFORM & LOAD TESTING

Sizing Logical Data in a Data Warehouse A Consistent and Auditable Approach

Building a Data Warehouse

Database Management. Chapter Objectives

Beta: Implementing a Data Warehouse with Microsoft SQL Server 2012

Data Warehousing with Oracle

Data-Warehouse-, Data-Mining- und OLAP-Technologien

A Service-oriented Architecture for Business Intelligence

Data Mart/Warehouse: Progress and Vision

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

Introduction to Datawarehousing

East Asia Network Sdn Bhd

SAS BI Course Content; Introduction to DWH / BI Concepts

CS54100: Database Systems

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

Foundations of Business Intelligence: Databases and Information Management

Dimensional Modeling for Data Warehouse

Data Warehouses and Business Intelligence ITP 487 (3 Units) Fall Objective

Business Intelligence: Effective Decision Making

Database Replication with MySQL and PostgreSQL

DBMS / Business Intelligence, SQL Server

Salesforce Certified Data Architecture and Management Designer. Study Guide. Summer 16 TRAINING & CERTIFICATION

Data Warehousing and OLAP Technology for Knowledge Discovery

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Data Modeling Basics

Overview. DW Source Integration, Tools, and Architecture. End User Applications (EUA) EUA Concepts. DW Front End Tools. Source Integration

A Model-based Software Architecture for XML Data and Metadata Integration in Data Warehouse Systems

CHAPTER-6 DATA WAREHOUSE

Course MIS. Foundations of Business Intelligence

Gradient An EII Solution From Infosys

What is Data Virtualization? Rick F. van der Lans, R20/Consultancy

Transcription:

Data Integration and ETL Process Krzysztof Dembczyński Institute of Computing Science Laboratory of Intelligent Decision Support Systems Politechnika Poznańska (Poznań University of Technology) Software Development Technologies Master studies, first semester Academic year 2013/14 (winter course) 1 / 30

Review of the Previous Lecture Three goals of the logical design of data warehouse: simplicity, expressiveness and performance. The most popular conceptual schema: star schema. Designing data warehouses is not an easy task. 2 / 30

1 Motivation 2 Data Extraction 3 Transformation and Integration of Data 4 Load of Data 5 Data Warehouse Refreshment 6 Summary 3 / 30

1 Motivation 2 Data Extraction 3 Transformation and Integration of Data 4 Load of Data 5 Data Warehouse Refreshment 6 Summary 4 / 30

Motivation OLAP queries are usually performed in a separate system, i.e., in a data warehouse. Data warehouses combine data from multiple sources Data must be translated into a consistent format Data integration represents 80% of effort for a typical data warehouse project! 5 / 30

ETL Process ETL = Extraction, Transformation, and Load Extraction of data from source systems, Transformation and integration of data into a useful format for analysis Load of data into the warehouse and build of additional structures. Refreshment of data warehouse is closely related to ETL process. The ETL process is described by metadata stored in data warehouse. Architecture of Data Warehousing: Data sources Data staging area Data warehouse 6 / 30

ETL Tools Data extraction from heterogeneous data sources, Data transformation, integration, and cleansing, Data quality analysis and control, Data loading, High-speed data transfer, Data refreshment, Managing and analyzing metadata. 7 / 30

1 Motivation 2 Data Extraction 3 Transformation and Integration of Data 4 Load of Data 5 Data Warehouse Refreshment 6 Summary 8 / 30

Data Extraction Data warehouse needs extraction of data from different external data sources: operational databases (relational, hierarchical, network, itp.), files of standard applications (Excel, COBOL applications), additional databases (direct marketing databases) and data services (stock data), and other documents (.doc, XML, WWW). 9 / 30

Data Sources Data sources are often operational systems, providing the lowest level of data. Data sources are designed for operational use, not for decision support, and the data reflect this fact. Multiple data sources are often from different systems, run on a wide range of hardware and much of the software is built in-house or highly customized. Data sources can be designed using different logical structures. 10 / 30

Data Extraction Identification of concepts and objects does not have to be easy. Example: Extract information about sales from the source system. What is meant by the term sale? A sale has occurred when the order has been received from a customer, the order is sent to the customer, the invoice has been raised against the order. It is a common problem that there is no table SALES in the operational databases; some other tables can exist like ORDER with an attribute ORDER STATUS. 11 / 30

Change Monitoring Change monitoring is directly connected with data warehouse refreshment. Detect changes to an information source. Different monitoring techniques: external and intrusive techniques. Monitoring Techniques Snapshot vs. timestamped sources Queryable, logged, and replicated sources Callback and internal action sources 12 / 30

1 Motivation 2 Data Extraction 3 Transformation and Integration of Data 4 Load of Data 5 Data Warehouse Refreshment 6 Summary 13 / 30

Transformation and Integration of Data Transformation and integration of data is the most important part of data warehousing. This phase consists in removing all inconsistencies and redundancies of data coming to the data warehouse from operational data sources. The data are conform to the conceptual schema used by the warehouse. Integration concerns data and data schemas. Different levels of integration: schema, table, tuple, attribute values. 14 / 30

Conflicts and Dirty Data Different logical models of operational sources, Different data types (account number stored as String or Numeric), Different data domains (gender: M, F, male, female, 1, 0), Different date formats (dd-mm-yyyy or mm-dd-yyyy), Different field lengths (address stored by using 20 or 50 chars), Different naming conventions: homonyms and synonyms, Semantic conflicts, when the same objects are modeled on different logical levels, Structural conflicts, when the same concepts are modeled using different structures. 15 / 30

Conflicts and Dirty Data Different entries for the same attribute (state name or abbreviation), Text fields can possess hidden information (contact person name can be given in the company name field), Wrong attribute entries (attribute name contains company name or contact person name), Inconsistent information concerning the same object, Information concerning the same object, but indicated by different keys, Missing values,... 16 / 30

Data Cleansing We would like to analyze high-quality data, since our goal is to support decision making. 17 / 30

Data Cleansing Techniques Conversion and normalization methods (date formats dd/mm/rrrr, names conventions: Jan Kowalski), Parsing text fields in order to identify and isolate data elements: Transformation (splitting the text into records {title = mgr, first name = Jan, last name = Kowalski}), Standardization (Jan Kowalski, magister mgr Jan Kowalski), Dictionary-based methods (database of names, geographical places, pharmaceutical data), Domain-specific knowledge methods to complete data (postal codes), Rationalization of data (PHX323RFD110A4 Print paper, format A4), Rule-based cleansing (replace gender by sex) Cleansing by using data mining. 18 / 30

Data Cleansing Techniques Deduplication ensures that one accurate record exists for each business entity represented in a database, Householding is the technique of grouping individual customers by the household or organization of which they are a member; this technique has some interesting marketing implications, and can also support cost-saving measures of direct advertising. Example: Consider the following rows in a database: Tim Jones 123 Main Street Marlboro MA 12234 T. Jones 123 Main St. Marlborogh MA 12234 Timothy Jones 321 Maine Street Marlborog AM 12234 Jones, Timothy 123 Maine Ave Marlborough MA 13324 The sales for around $500 are counted for each tuple. Is it the same person? 19 / 30

1 Motivation 2 Data Extraction 3 Transformation and Integration of Data 4 Load of Data 5 Data Warehouse Refreshment 6 Summary 20 / 30

Load of Data After extracting, cleaning and transforming, data must be loaded into the warehouse. Loading the warehouse includes some other processing tasks: checking integrity constraints, sorting, summarizing, creating indexes, etc. Batch (bulk) load utilities are used for loading. A load utility must allow the administrator to monitor status, to cancel, suspend, and resume a load, and to restart after failure with no loss of data integrity. 21 / 30

Load of Data Concern very large data volumes. Sequential loads can take a very long time. Full load can be treated as a single long batch transaction that builds up a new database. Using checkpoints ensures that if a failure occurs during the load, the process can restart from the last checkpoint. 22 / 30

1 Motivation 2 Data Extraction 3 Transformation and Integration of Data 4 Load of Data 5 Data Warehouse Refreshment 6 Summary 23 / 30

Data Warehouse Refreshment Refreshing a warehouse means propagating updates on source data to the data stored in the warehouse. Follows the same structure as ETL process. Several constraints: accessibility of data sources, size of data, size of data warehouse, frequency of data refreshing, degradation of performance of operational systems. 24 / 30

Data Warehouse Refreshment Detect changes in external data sources. Extract the changes and integrate into the warehouse. Update indexes, subaggregates and materialized views. 25 / 30

Data Warehouse Refreshment Periodical refreshment (daily or weekly). Immediate refreshment. Determined by usage, types of data source, etc. 26 / 30

Refreshment and Load Refreshment can be asynchronous. Load of data requires a long-term access to source data. Refreshment concerns less data. Refreshment is faster. 27 / 30

1 Motivation 2 Data Extraction 3 Transformation and Integration of Data 4 Load of Data 5 Data Warehouse Refreshment 6 Summary 28 / 30

Summary ETL process is a strategic element of data warehousing. Main concepts: extraction, transformation and integration, load, data warehouse refreshment and metadata. New emerging technology.... 29 / 30

Bibliography J. Han, M. Kamber, Data Mining: Concepts and Techniques, Morgan-Kaufmann 2000 M. Jarke, M. Lenzerini, Y. Vassiliou, P. Vassiliadis, Hurtownie danych. Podstawy organizacji i funkcjonowania, Wydawnictwa Szkolne i Pedagogiczne 2003 30 / 30