Data Integration and ETL Process



Similar documents
Data Integration and ETL Process

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

Extraction Transformation Loading ETL Get data out of sources and load into the DW

COURSE 20463C: IMPLEMENTING A DATA WAREHOUSE WITH MICROSOFT SQL SERVER

Implementing a Data Warehouse with Microsoft SQL Server

LITERATURE SURVEY ON DATA WAREHOUSE AND ITS TECHNIQUES

META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING

Implementing a Data Warehouse with Microsoft SQL Server

Implement a Data Warehouse with Microsoft SQL Server 20463C; 5 days

Implementing a Data Warehouse with Microsoft SQL Server MOC 20463

COURSE OUTLINE MOC 20463: IMPLEMENTING A DATA WAREHOUSE WITH MICROSOFT SQL SERVER

Implementing a Data Warehouse with Microsoft SQL Server 2014

SQL Server 2012 Business Intelligence Boot Camp

OLAP Systems and Multidimensional Expressions I

Data Warehousing Systems: Foundations and Architectures

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

A Design and implementation of a data warehouse for research administration universities

Data Warehousing. Jens Teubner, TU Dortmund Winter 2015/16. Jens Teubner Data Warehousing Winter 2015/16 1

DATA WAREHOUSING AND OLAP TECHNOLOGY

A Simplified Framework for Data Cleaning and Information Retrieval in Multiple Data Source Problems

Implementing a Data Warehouse with Microsoft SQL Server 2012

IBM WebSphere DataStage Online training from Yes-M Systems

Implementing a Data Warehouse with Microsoft SQL Server 2012 MOC 10777

Implementing a Data Warehouse with Microsoft SQL Server

Data Warehousing and Data Mining

Course Outline: Course: Implementing a Data Warehouse with Microsoft SQL Server 2012 Learning Method: Instructor-led Classroom Learning

Course 20463:Implementing a Data Warehouse with Microsoft SQL Server

Course Outline. Module 1: Introduction to Data Warehousing

Microsoft. Course 20463C: Implementing a Data Warehouse with Microsoft SQL Server

Implementing a Data Warehouse with Microsoft SQL Server

Implementing a SQL Data Warehouse 2016

Enterprise Solutions. Data Warehouse & Business Intelligence Chapter-8

Would-be system and database administrators. PREREQUISITES: At least 6 months experience with a Windows operating system.

Course 10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012

Implementing a Data Warehouse with Microsoft SQL Server 2012

MDM and Data Warehousing Complement Each Other

Data Warehousing. Jens Teubner, TU Dortmund Winter 2014/15. Jens Teubner Data Warehousing Winter 2014/15 1

Implementing a Data Warehouse with Microsoft SQL Server 2012 (70-463)

SAS BI Course Content; Introduction to DWH / BI Concepts

Data Warehouse Design

Implementing a Data Warehouse with Microsoft SQL Server 2012

Implementing a Data Warehouse with Microsoft SQL Server 2012

Data Integration with Talend Open Studio Robert A. Nisbet, Ph.D.

CSPP 53017: Data Warehousing Winter 2013" Lecture 6" Svetlozar Nestorov" " Class News

Data Warehouse: Introduction

Course: SAS BI(business intelligence) and DI(Data integration)training - Training Duration: 30 + Days. Take Away:

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

THE DATA WAREHOUSE ETL TOOLKIT CDT803 Three Days

East Asia Network Sdn Bhd

Data warehouses. Data Mining. Abraham Otero. Data Mining. Agenda

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

The Data Warehouse ETL Toolkit

Near Real-time Data Warehousing with Multi-stage Trickle & Flip

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Overview. DW Source Integration, Tools, and Architecture. End User Applications (EUA) EUA Concepts. DW Front End Tools. Source Integration

ETL Tools. L. Libkin 1 Data Integration and Exchange

Indexing Techniques for Data Warehouses Queries. Abstract

ETL-EXTRACT, TRANSFORM & LOAD TESTING

ETL Overview. Extract, Transform, Load (ETL) Refreshment Workflow. The ETL Process. General ETL issues. MS Integration Services

An Introduction to Data Warehousing. An organization manages information in two dominant forms: operational systems of

BUILDING OLAP TOOLS OVER LARGE DATABASES

SSIS Training: Introduction to SQL Server Integration Services Duration: 3 days

SQL Server Administrator Introduction - 3 Days Objectives

Microsoft Data Warehouse in Depth

Data-Warehouse-, Data-Mining- und OLAP-Technologien

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 28

Optimizing the Performance of the Oracle BI Applications using Oracle Datawarehousing Features and Oracle DAC

Beta: Implementing a Data Warehouse with Microsoft SQL Server 2012

Subject Description Form

Customer-Centric Data Warehouse, design issues. Announcements

For Sales Kathy Hall

Business Intelligence: Effective Decision Making

Data warehouse Architectures and processes

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Speeding ETL Processing in Data Warehouses White Paper

Chapter 3 - Data Replication and Materialized Integration

Dimensional Modeling for Data Warehouse

SENG 520, Experience with a high-level programming language. (304) , Jeff.Edgell@comcast.net

What is Data Virtualization? Rick F. van der Lans, R20/Consultancy

Building a Data Warehouse

Data Warehouses and Business Intelligence ITP 487 (3 Units) Fall Objective

Data Warehousing. Overview, Terminology, and Research Issues. Joachim Hammer. Joachim Hammer

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

Data Warehousing with Oracle

Datawarehousing and Analytics. Data-Warehouse-, Data-Mining- und OLAP-Technologien. Advanced Information Management

Establish and maintain Center of Excellence (CoE) around Data Architecture

MIS636 AWS Data Warehousing and Business Intelligence Course Syllabus

A Critical Review of Data Warehouse

High-Volume Data Warehousing in Centerprise. Product Datasheet

SQL Server Introduction to SQL Server SQL Server 2005 basic tools. SQL Server Configuration Manager. SQL Server services management

What's New in SAS Data Management

5.5 Copyright 2011 Pearson Education, Inc. publishing as Prentice Hall. Figure 5-2

An Overview of Data Warehousing, Data mining, OLAP and OLTP Technologies

Deductive Data Warehouses and Aggregate (Derived) Tables

College of Engineering, Technology, and Computer Science

Bussiness Intelligence and Data Warehouse. Tomas Bartos CIS 764, Kansas State University

Introduction to Datawarehousing

MS 20467: Designing Business Intelligence Solutions with Microsoft SQL Server 2012

Cúram Business Intelligence Reporting Developer Guide

Sizing Logical Data in a Data Warehouse A Consistent and Auditable Approach

Transcription:

Data Integration and ETL Process Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, second semester Academic year 2014/15 (winter course) 1 / 33

Review of the previous lecture Mining of massive datasets Evolution of database systems: operational vs. analytical systems. Dimensional modeling: Three goals of the logical design of data warehouse: simplicity, expressiveness and performance. The most popular conceptual schema: star schema. Designing data warehouses is not an easy task. 2 / 33

Outline 1 Motivation 2 Data Extraction 3 Transformation and Integration of Data 4 Load of Data 5 Data Warehouse Refreshment 6 Summary 3 / 33

Outline 1 Motivation 2 Data Extraction 3 Transformation and Integration of Data 4 Load of Data 5 Data Warehouse Refreshment 6 Summary 4 / 33

Motivation OLAP queries are usually performed in a separate system, i.e., in a data warehouse. Data warehouses combine data from multiple sources. Data must be translated into a consistent format. Data integration represents 80% of effort for a typical data warehouse project! 5 / 33

ETL process ETL = Extraction, Transformation, and Load Extraction of data from source systems, Transformation and integration of data into a useful format for analysis, Load of data into the warehouse and build of additional structures. Refreshment of data warehouse is closely related to ETL process. The ETL process is described by metadata stored in data warehouse. Architecture of data warehousing: Data sources Data staging area Data warehouse 6 / 33

ETL process 7 / 33

ETL tools Data extraction from heterogeneous data sources. Data transformation, integration, and cleansing. Data quality analysis and control. Data loading. High-speed data transfer. Data refreshment. Managing and analyzing metadata. Examples of ETL tools: MS SQL Server Integration Services(SSIS), IBM Infosphere DataStage, SAS ETL Studio, Oracle Warehouse Builder, Oracle Data Integrator, Business Objects Data Integrator, Pentaho Data Integration. 8 / 33

ETL tools MS SQL Server Integration Services(SSIS) 9 / 33

ETL tools MS SQL Server Integration Services(SSIS) 10 / 33

Outline 1 Motivation 2 Data Extraction 3 Transformation and Integration of Data 4 Load of Data 5 Data Warehouse Refreshment 6 Summary 11 / 33

Data extraction Data warehouse needs extraction of data from different external data sources: operational databases (relational, hierarchical, network, itp.), files of standard applications (Excel, COBOL applications), additional databases (direct marketing databases) and data services (stock data), and other documents (.doc, XML, WWW). 12 / 33

Data sources Data sources are often operational systems, providing the lowest level of data. 13 / 33

Data sources Data sources are often operational systems, providing the lowest level of data. Data sources are designed for operational use, not for decision support, and the data reflect this fact. 13 / 33

Data sources Data sources are often operational systems, providing the lowest level of data. Data sources are designed for operational use, not for decision support, and the data reflect this fact. Multiple data sources are often from different systems, run on a wide range of hardware and much of the software is built in-house or highly customized. 13 / 33

Data sources Data sources are often operational systems, providing the lowest level of data. Data sources are designed for operational use, not for decision support, and the data reflect this fact. Multiple data sources are often from different systems, run on a wide range of hardware and much of the software is built in-house or highly customized. Data sources can be designed using different logical structures. 13 / 33

Data extraction Identification of concepts and objects does not have to be easy. Example: Extract information about sales from the source system. What is meant by the term sale? 14 / 33

Data extraction Identification of concepts and objects does not have to be easy. Example: Extract information about sales from the source system. What is meant by the term sale? A sale has occurred when the order has been received by a customer, the order is sent to the customer, the invoice has been raised against the order. 14 / 33

Data extraction Identification of concepts and objects does not have to be easy. Example: Extract information about sales from the source system. What is meant by the term sale? A sale has occurred when the order has been received by a customer, the order is sent to the customer, the invoice has been raised against the order. It is a common problem that there is no table SALES in the operational databases; some other tables can exist like ORDER with an attribute ORDER STATUS. 14 / 33

Change monitoring Change monitoring is directly connected with data warehouse refreshment. Detect changes to an information source. Different monitoring techniques: external and intrusive techniques. 15 / 33

Change monitoring Change monitoring is directly connected with data warehouse refreshment. Detect changes to an information source. Different monitoring techniques: external and intrusive techniques. Monitoring Techniques Snapshot vs. timestamped sources Queryable, logged, and replicated sources Callback and internal action sources 15 / 33

Outline 1 Motivation 2 Data Extraction 3 Transformation and Integration of Data 4 Load of Data 5 Data Warehouse Refreshment 6 Summary 16 / 33

Transformation and integration of data Transformation and integration of data is the most important part of data warehousing. 17 / 33

Transformation and integration of data Transformation and integration of data is the most important part of data warehousing. This phase consists in removing all inconsistencies and redundancies of data coming to the data warehouse from operational data sources. 17 / 33

Transformation and integration of data Transformation and integration of data is the most important part of data warehousing. This phase consists in removing all inconsistencies and redundancies of data coming to the data warehouse from operational data sources. The data are conform to the conceptual schema used by the warehouse. 17 / 33

Transformation and integration of data Transformation and integration of data is the most important part of data warehousing. This phase consists in removing all inconsistencies and redundancies of data coming to the data warehouse from operational data sources. The data are conform to the conceptual schema used by the warehouse. Integration concerns data and data schemas. 17 / 33

Transformation and integration of data Transformation and integration of data is the most important part of data warehousing. This phase consists in removing all inconsistencies and redundancies of data coming to the data warehouse from operational data sources. The data are conform to the conceptual schema used by the warehouse. Integration concerns data and data schemas. Different levels of integration: schema, table, tuple, attribute values. 17 / 33

Conflicts and dirty data Different logical models of operational sources, 18 / 33

Conflicts and dirty data Different logical models of operational sources, Different data types (account number stored as String or Numeric), 18 / 33

Conflicts and dirty data Different logical models of operational sources, Different data types (account number stored as String or Numeric), Different data domains (gender: M, F, male, female, 1, 0), 18 / 33

Conflicts and dirty data Different logical models of operational sources, Different data types (account number stored as String or Numeric), Different data domains (gender: M, F, male, female, 1, 0), Different date formats (dd-mm-yyyy or mm-dd-yyyy), 18 / 33

Conflicts and dirty data Different logical models of operational sources, Different data types (account number stored as String or Numeric), Different data domains (gender: M, F, male, female, 1, 0), Different date formats (dd-mm-yyyy or mm-dd-yyyy), Different field lengths (address stored by using 20 or 50 chars), 18 / 33

Conflicts and dirty data Different logical models of operational sources, Different data types (account number stored as String or Numeric), Different data domains (gender: M, F, male, female, 1, 0), Different date formats (dd-mm-yyyy or mm-dd-yyyy), Different field lengths (address stored by using 20 or 50 chars), Different naming conventions: homonyms and synonyms, 18 / 33

Conflicts and dirty data Different logical models of operational sources, Different data types (account number stored as String or Numeric), Different data domains (gender: M, F, male, female, 1, 0), Different date formats (dd-mm-yyyy or mm-dd-yyyy), Different field lengths (address stored by using 20 or 50 chars), Different naming conventions: homonyms and synonyms, Semantic conflicts, when the same objects are modeled on different logical levels, 18 / 33

Conflicts and dirty data Different logical models of operational sources, Different data types (account number stored as String or Numeric), Different data domains (gender: M, F, male, female, 1, 0), Different date formats (dd-mm-yyyy or mm-dd-yyyy), Different field lengths (address stored by using 20 or 50 chars), Different naming conventions: homonyms and synonyms, Semantic conflicts, when the same objects are modeled on different logical levels, Structural conflicts, when the same concepts are modeled using different structures. 18 / 33

Conflicts and dirty data Different entries for the same attribute (state name or abbreviation), 19 / 33

Conflicts and dirty data Different entries for the same attribute (state name or abbreviation), Text fields can possess hidden information (contact person name can be given in the company name field), 19 / 33

Conflicts and dirty data Different entries for the same attribute (state name or abbreviation), Text fields can possess hidden information (contact person name can be given in the company name field), Wrong attribute entries (attribute name contains company name or contact person name), 19 / 33

Conflicts and dirty data Different entries for the same attribute (state name or abbreviation), Text fields can possess hidden information (contact person name can be given in the company name field), Wrong attribute entries (attribute name contains company name or contact person name), Inconsistent information concerning the same object, 19 / 33

Conflicts and dirty data Different entries for the same attribute (state name or abbreviation), Text fields can possess hidden information (contact person name can be given in the company name field), Wrong attribute entries (attribute name contains company name or contact person name), Inconsistent information concerning the same object, Information concerning the same object, but indicated by different keys, 19 / 33

Conflicts and dirty data Different entries for the same attribute (state name or abbreviation), Text fields can possess hidden information (contact person name can be given in the company name field), Wrong attribute entries (attribute name contains company name or contact person name), Inconsistent information concerning the same object, Information concerning the same object, but indicated by different keys, Missing values, 19 / 33

Conflicts and dirty data Different entries for the same attribute (state name or abbreviation), Text fields can possess hidden information (contact person name can be given in the company name field), Wrong attribute entries (attribute name contains company name or contact person name), Inconsistent information concerning the same object, Information concerning the same object, but indicated by different keys, Missing values,... 19 / 33

Data cleansing We would like to analyze high-quality data, since our goal is to support decision making. 20 / 33

Data cleansing techniques Conversion and normalization methods (date formats dd/mm/rrrr, names conventions: Jan Kowalski), 21 / 33

Data cleansing techniques Conversion and normalization methods (date formats dd/mm/rrrr, names conventions: Jan Kowalski), Parsing text fields in order to identify and isolate data elements: 21 / 33

Data cleansing techniques Conversion and normalization methods (date formats dd/mm/rrrr, names conventions: Jan Kowalski), Parsing text fields in order to identify and isolate data elements: Transformation (splitting the text into records {title = mgr, first name = Jan, last name = Kowalski}), 21 / 33

Data cleansing techniques Conversion and normalization methods (date formats dd/mm/rrrr, names conventions: Jan Kowalski), Parsing text fields in order to identify and isolate data elements: Transformation (splitting the text into records {title = mgr, first name = Jan, last name = Kowalski}), Standardization (Jan Kowalski, magister mgr Jan Kowalski), 21 / 33

Data cleansing techniques Conversion and normalization methods (date formats dd/mm/rrrr, names conventions: Jan Kowalski), Parsing text fields in order to identify and isolate data elements: Transformation (splitting the text into records {title = mgr, first name = Jan, last name = Kowalski}), Standardization (Jan Kowalski, magister mgr Jan Kowalski), Dictionary-based methods (database of names, geographical places, pharmaceutical data), 21 / 33

Data cleansing techniques Conversion and normalization methods (date formats dd/mm/rrrr, names conventions: Jan Kowalski), Parsing text fields in order to identify and isolate data elements: Transformation (splitting the text into records {title = mgr, first name = Jan, last name = Kowalski}), Standardization (Jan Kowalski, magister mgr Jan Kowalski), Dictionary-based methods (database of names, geographical places, pharmaceutical data), Domain-specific knowledge methods to complete data (postal codes), 21 / 33

Data cleansing techniques Conversion and normalization methods (date formats dd/mm/rrrr, names conventions: Jan Kowalski), Parsing text fields in order to identify and isolate data elements: Transformation (splitting the text into records {title = mgr, first name = Jan, last name = Kowalski}), Standardization (Jan Kowalski, magister mgr Jan Kowalski), Dictionary-based methods (database of names, geographical places, pharmaceutical data), Domain-specific knowledge methods to complete data (postal codes), Rationalization of data (PHX323RFD110A4 Print paper, format A4), 21 / 33

Data cleansing techniques Conversion and normalization methods (date formats dd/mm/rrrr, names conventions: Jan Kowalski), Parsing text fields in order to identify and isolate data elements: Transformation (splitting the text into records {title = mgr, first name = Jan, last name = Kowalski}), Standardization (Jan Kowalski, magister mgr Jan Kowalski), Dictionary-based methods (database of names, geographical places, pharmaceutical data), Domain-specific knowledge methods to complete data (postal codes), Rationalization of data (PHX323RFD110A4 Print paper, format A4), Rule-based cleansing (replace gender by sex) 21 / 33

Data cleansing techniques Conversion and normalization methods (date formats dd/mm/rrrr, names conventions: Jan Kowalski), Parsing text fields in order to identify and isolate data elements: Transformation (splitting the text into records {title = mgr, first name = Jan, last name = Kowalski}), Standardization (Jan Kowalski, magister mgr Jan Kowalski), Dictionary-based methods (database of names, geographical places, pharmaceutical data), Domain-specific knowledge methods to complete data (postal codes), Rationalization of data (PHX323RFD110A4 Print paper, format A4), Rule-based cleansing (replace gender by sex) Cleansing by using data mining. 21 / 33

Data cleansing techniques Deduplication ensures that one accurate record exists for each business entity represented in a database, 22 / 33

Data cleansing techniques Deduplication ensures that one accurate record exists for each business entity represented in a database, Householding is the technique of grouping individual customers by the household or organization of which they are a member; this technique has some interesting marketing implications, and can also support cost-saving measures of direct advertising. 22 / 33

Data cleansing techniques Deduplication ensures that one accurate record exists for each business entity represented in a database, Householding is the technique of grouping individual customers by the household or organization of which they are a member; this technique has some interesting marketing implications, and can also support cost-saving measures of direct advertising. Example: 22 / 33

Data cleansing techniques Deduplication ensures that one accurate record exists for each business entity represented in a database, Householding is the technique of grouping individual customers by the household or organization of which they are a member; this technique has some interesting marketing implications, and can also support cost-saving measures of direct advertising. Example: Consider the following rows in a database: Tim Jones 123 Main Street Marlboro MA 12234 T. Jones 123 Main St. Marlborogh MA 12234 Timothy Jones 321 Maine Street Marlborog AM 12234 Jones, Timothy 123 Maine Ave Marlborough MA 13324 22 / 33

Data cleansing techniques Deduplication ensures that one accurate record exists for each business entity represented in a database, Householding is the technique of grouping individual customers by the household or organization of which they are a member; this technique has some interesting marketing implications, and can also support cost-saving measures of direct advertising. Example: Consider the following rows in a database: Tim Jones 123 Main Street Marlboro MA 12234 T. Jones 123 Main St. Marlborogh MA 12234 Timothy Jones 321 Maine Street Marlborog AM 12234 Jones, Timothy 123 Maine Ave Marlborough MA 13324 The sales for around $500 are counted for each tuple. 22 / 33

Data cleansing techniques Deduplication ensures that one accurate record exists for each business entity represented in a database, Householding is the technique of grouping individual customers by the household or organization of which they are a member; this technique has some interesting marketing implications, and can also support cost-saving measures of direct advertising. Example: Consider the following rows in a database: Tim Jones 123 Main Street Marlboro MA 12234 T. Jones 123 Main St. Marlborogh MA 12234 Timothy Jones 321 Maine Street Marlborog AM 12234 Jones, Timothy 123 Maine Ave Marlborough MA 13324 The sales for around $500 are counted for each tuple. Is it the same person? 22 / 33

Outline 1 Motivation 2 Data Extraction 3 Transformation and Integration of Data 4 Load of Data 5 Data Warehouse Refreshment 6 Summary 23 / 33

Load of data After extracting, cleaning and transforming, data must be loaded into the warehouse. 24 / 33

Load of data After extracting, cleaning and transforming, data must be loaded into the warehouse. Loading the warehouse includes some other processing tasks: checking integrity constraints, sorting, summarizing, creating indexes, etc. 24 / 33

Load of data After extracting, cleaning and transforming, data must be loaded into the warehouse. Loading the warehouse includes some other processing tasks: checking integrity constraints, sorting, summarizing, creating indexes, etc. Batch (bulk) load utilities are used for loading. 24 / 33

Load of data After extracting, cleaning and transforming, data must be loaded into the warehouse. Loading the warehouse includes some other processing tasks: checking integrity constraints, sorting, summarizing, creating indexes, etc. Batch (bulk) load utilities are used for loading. A load utility must allow the administrator to monitor status, to cancel, suspend, and resume a load, and to restart after failure with no loss of data integrity. 24 / 33

Load of data Concern very large data volumes. Sequential loads can take a very long time. Full load can be treated as a single long batch transaction that builds up a new database. Using checkpoints ensures that if a failure occurs during the load, the process can restart from the last checkpoint. 25 / 33

Outline 1 Motivation 2 Data Extraction 3 Transformation and Integration of Data 4 Load of Data 5 Data Warehouse Refreshment 6 Summary 26 / 33

Data warehouse refreshment Refreshing a warehouse means propagating updates on source data to the data stored in the warehouse. 27 / 33

Data warehouse refreshment Refreshing a warehouse means propagating updates on source data to the data stored in the warehouse. Follows the same structure as ETL process. 27 / 33

Data warehouse refreshment Refreshing a warehouse means propagating updates on source data to the data stored in the warehouse. Follows the same structure as ETL process. Several constraints: accessibility of data sources, size of data, size of data warehouse, frequency of data refreshing, degradation of performance of operational systems. 27 / 33

Data warehouse refreshment Detect changes in external data sources. 28 / 33

Data warehouse refreshment Detect changes in external data sources. Extract the changes and integrate into the warehouse. 28 / 33

Data warehouse refreshment Detect changes in external data sources. Extract the changes and integrate into the warehouse. Update indexes, subaggregates and materialized views. 28 / 33

Data warehouse refreshment Periodical refreshment (daily or weekly). Immediate refreshment. Determined by usage, types of data source, etc. 29 / 33

Refreshment vs. load Refreshment can be asynchronous. Load of data requires a long-term access to source data. Refreshment concerns less data. Refreshment is faster. 30 / 33

Outline 1 Motivation 2 Data Extraction 3 Transformation and Integration of Data 4 Load of Data 5 Data Warehouse Refreshment 6 Summary 31 / 33

Summary ETL process is a strategic element of data warehousing. Main concepts: extraction, transformation and integration, load, data warehouse refreshment and metadata. New emerging technology.... 32 / 33

Bibliography C. Todman. Designing a Data Warehouse: Supporting Customer Relationship Management. Prentice Hall PTR, 2001 Matthias Jarke, Y. Vassiliou, P. Vassiliadis, and M. Lenzerini. Fundamentals of Data Warehouses. Springer-Verlag New York, second edition, 1999 R. Kimball and M. Ross. The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling. John Wiley & Sons, third edition, 2013 www.kimballuniversity.com 33 / 33