Data-Warehouse-, Data-Mining- und OLAP-Technologien



Similar documents
Chapter 3 - Data Replication and Materialized Integration

Data-Warehouse-, Data-Mining- und OLAP-Technologien

Extraction Transformation Loading ETL Get data out of sources and load into the DW

Oracle Database 10g: Introduction to SQL

ETL Overview. Extract, Transform, Load (ETL) Refreshment Workflow. The ETL Process. General ETL issues. MS Integration Services

Real-time Data Replication

DBMS / Business Intelligence, SQL Server

META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING

ETL-EXTRACT, TRANSFORM & LOAD TESTING

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

Oracle Architecture, Concepts & Facilities

Data Integration and ETL Process

Databases in Organizations

IBM WebSphere DataStage Online training from Yes-M Systems

Data warehouse Architectures and processes

ETL as a Necessity for Business Architectures

Oracle Warehouse Builder 10g

AV-005: Administering and Implementing a Data Warehouse with SQL Server 2014

MASTER DATA MANAGEMENT TEST ENABLER

LearnFromGuru Polish your knowledge

Physical Design. Meeting the needs of the users is the gold standard against which we measure our success in creating a database.

CHAPTER-6 DATA WAREHOUSE

Cúram Business Intelligence Reporting Developer Guide

CHAPTER SIX DATA. Business Intelligence The McGraw-Hill Companies, All Rights Reserved

THE DATA WAREHOUSE ETL TOOLKIT CDT803 Three Days

Oracle Database 11g SQL

The Data Warehouse ETL Toolkit

Oracle Database: Introduction to SQL

Data Warehousing. Jens Teubner, TU Dortmund Winter 2014/15. Jens Teubner Data Warehousing Winter 2014/15 1

SQL Databases Course. by Applied Technology Research Center. This course provides training for MySQL, Oracle, SQL Server and PostgreSQL databases.

Oracle Database: Introduction to SQL

RS MDM. Integration Guide. Riversand

When to consider OLAP?

A Simplified Framework for Data Cleaning and Information Retrieval in Multiple Data Source Problems

David Dye. Extract, Transform, Load

MicroStrategy Course Catalog

Instant SQL Programming

Data Integration and ETL Process

Oracle Database: SQL and PL/SQL Fundamentals NEW

Data Warehousing. Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

EMBL-EBI. Database Replication - Distribution

Data Hierarchy. Traditional File based Approach. Hierarchy of Data for a Computer-Based File

Jet Data Manager 2012 User Guide

Oracle Database: Introduction to SQL

Technology in Action. Alan Evans Kendall Martin Mary Anne Poatsy. Eleventh Edition. Copyright 2015 Pearson Education, Inc.

Data Warehouse Center Administration Guide

Integrating VoltDB with Hadoop

The Benefits of Data Modeling in Data Warehousing

Overview. DW Source Integration, Tools, and Architecture. End User Applications (EUA) EUA Concepts. DW Front End Tools. Source Integration

Would-be system and database administrators. PREREQUISITES: At least 6 months experience with a Windows operating system.

Geodatabase Programming with SQL

MDM and Data Warehousing Complement Each Other

SQL Server Replication Guide

Enterprise Data Quality

SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package Data Federation Administration Tool Guide

SAS BI Course Content; Introduction to DWH / BI Concepts

Data Warehouse: Introduction

ETL PROCESS IN DATA WAREHOUSE

Oracle Database 12c: Introduction to SQL Ed 1.1

COURSE 20463C: IMPLEMENTING A DATA WAREHOUSE WITH MICROSOFT SQL SERVER

Implementing a Data Warehouse with Microsoft SQL Server

COURSE OUTLINE. Track 1 Advanced Data Modeling, Analysis and Design

Course 10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications

ETL TESTING TRAINING

Oracle EXAM - 1Z Oracle Database 11g Release 2: SQL Tuning. Buy Full Product.

SAP Data Services 4.X. An Enterprise Information management Solution

Microsoft SQL Database Administrator Certification

Optimizing the Performance of the Oracle BI Applications using Oracle Datawarehousing Features and Oracle DAC

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

Implementing a Data Warehouse with Microsoft SQL Server 2012

Designing a Dimensional Model

Introduction to Datawarehousing

SOLUTION BRIEF. JUST THE FAQs: Moving Big Data with Bulk Load.

W I S E. SQL Server 2008/2008 R2 Advanced DBA Performance & WISE LTD.

SAP InfiniteInsight Explorer Analytical Data Management v7.0

Efficient Data Access and Data Integration Using Information Objects Mica J. Block

Data Warehousing. Overview, Terminology, and Research Issues. Joachim Hammer. Joachim Hammer

Patterns of Information Management

Lection 3-4 WAREHOUSING

Implementing a Data Warehouse with Microsoft SQL Server

SQL Server 2012 Business Intelligence Boot Camp

A generic approach for data integration using RDF, OWL and XML

Implementing a Data Warehouse with Microsoft SQL Server 2012

Files. Files. Files. Files. Files. File Organisation. What s it all about? What s in a file?

The basic data mining algorithms introduced may be enhanced in a number of ways.

SSIS Training: Introduction to SQL Server Integration Services Duration: 3 days

Implementing a Data Warehouse with Microsoft SQL Server

Chapter 5. Learning Objectives. DW Development and ETL

Course Outline. Module 1: Introduction to Data Warehousing

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Data Modeling Basics

Framework for Data warehouse architectural components

Implement a Data Warehouse with Microsoft SQL Server 20463C; 5 days

Postgres Plus xdb Replication Server with Multi-Master User s Guide

Foundations of Business Intelligence: Databases and Information Management

Transcription:

Data-Warehouse-, Data-Mining- und OLAP-Technologien Chapter 4: Extraction, Transformation, Load Bernhard Mitschang Universität Stuttgart Winter Term 2014/2015

Overview Monitoring Extraction Export, Import, Filter, Load Direct Integration Load Bulk Load Replication Materialized Views Transformation Schema Integration Data Integration Data Cleansing Tools 2

Architecture End User Data Access data flow control flow Data Warehouse Data Warehouse Manager Metadata Manager Metadata Repository Load Data Staging Area Transformation Extraction Data Staging Area Monitor Data Warehouse System Data Sources (Source: [BG04]) 3

Monitoring Goal: Discover changes in data source incrementally Approaches: Trigger Replica Timestamp Based on triggers defined in source DBMS replication support of source DBMS timestamp assigned to each row Log log of source DBMS read log Snapshot periodic snapshot of data source Changes identified by trigger writes a copy of changed data to files replication provides changed rows in a separate table use timestamp to identify changes (supported by temporal DBMS) compare snapshots 4

Snapshot Differentials F1 200 A 201 U 100 A F2 100 A 201 X 400 Y key data Two snapshot files: F1 was taken before F2 Records contain key fields and data fields Goal: Provide UPDATES/ INSERTS/ DELETES in a shapshot differential file. 300 Z 202 U Sort Merge Outerjoin: Sort F1 and F2 on their keys Read F1' and F2' and compare records Snapshot files may be compresssed 100 A 200 A 201 U 300 Z 100 A 201 X 202 U 400 Y DELETE 200 A UPDATE 201 X INSERT 202 U DELETE 300 Z INSERT 400 Y F out snapshot differential Snapshots are read multiple times Window Algorithm: Maintain a moving window of records in memory for each snapshot (aging buffer) Assumes that matching records are "physically" nearby Read snapshots only once 5

Window Algorithm INPUT: F 1, F 2, n OUTPUT: F out /* the snapshot differential */ (1) Input Buffer 1 Read n blocks from F1 (2) Input Buffer 2 Read n blocks form F2 (3) while ((Input Buffer1 EMPTY) and (Input Buffer 2 EMPTY)) F1 Input Buffer 1 In case of a buffer overflow, this (oldest) record is replaced and reported as deleted buckets (4) Match Input Buffer 1 against Input Buffer 2 (5) Match Input Buffer 1 against Aging Buffer 2 (6) Match Input Buffer 2 against Aging Buffer 1 Aging Buffer 1 (7) Put contents of Input Buffer 1 to Aging Buffer 1 (8) Put contents of Input Buffer 2 to Aging Buffer 2 F2 Input Buffer 2 age queue (9) Input Buffer 1 Read n blocks from F 1 TAIL HEAD (10) Input Buffer 2 Read n blocks from F 2 (11) Report records in Aging Buffer 1 as deletes (12) Report records in Aging Buffer 2 as inserts Aging Buffer 2 Source: [LG96] 6

Overview Monitoring Extraction Export, Import, Filter, Load Direct Integration Load Bulk Load Replication Materialized Views Transformation Schema Integration Data Integration Data Cleansing Tools 7

ETL Processing Data Staging Area Data Marts Data Warehouse Load move data to data marts move data from DSA to data warehouse db Requirements: files vs. tables copy vs. transformation Transformation Monitor Data Staging Area Extraction move data as part of transformations move source data to data staging area same system vs. heterogeneous systems automatic vs. user-driven availability of source and target system Data Sources 8

Extraction heterogeneous source systems performance support of monitoring and extraction: replica active db / trigger snapshot export / db dump logging no support accessing data sources Integration application / application API database API log files extract current data limited time frame service of source system should not be restricted 9

Export and Import decompress compress Source System Export network Import Target System EXPORT TO c:\cust_berlin_i\cust.data OF DEL MODIFIED BY COLDEL MESSAGES c:\cust_berlin_i\msg1.txt SELECT * FROM customer_data WHERE new_customer = true (DB2) Export: ASCII files proprietary format Import: import command bulk load 10

Import vs. Load IMPORT FROM c:\cust_berlin_i\cust.data OF DEL MODIFIED BY COLDEL COMMITCOUNT 1000 MESSAGES c:\cust_berlin_i\msg2.txt INSERT INTO cust_berlin_1 (DB2) LOAD FROM c:\cust_berlin_i\cust.data OF DEL MODIFIED BY COLDEL SAVECOUNT 1000 MESSAGES c:\cust_berlin_i\msg3.txt REPLACE INTO cust_berlin_1 STATISTICS YES (DB2) IMPORT LOAD COMMIT explicit automatic Logging Integrity complete, mandatory check all constraints optional Trigger all none Locking read access possible check local constraints only table lock 11

Filter and Load Input Datafiles Log File Database Tables Indexes SQL*Loader (Oracle) Loader Control File Bad Files Discard Files Bulk load tools provide some filter and transformation functionality: Load data from multiple datafiles during the same load session. Load data into multiple tables during the same load session. Specify the character set of the data. Selectively load data, i.e. load records based on the records values. Manipulate the data before loading it, using SQL functions. Generate unique sequential key values in specified columns. 12

Direct Integration external tables table functions federated database database system SQL database system database system SQL SQL Wrapper Java, C, C++, data source data data source data source source register external data as tables allows to read external data no query optimization user-defined functions provides a tables as result function reads data from external sources register external data as tables define wrapper for access to external data exploit capabilities of external source for query optimization 13

Overview Monitoring Extraction Export, Import, Filter, Load Direct Integration Load Bulk Load Replication Materialized Views Transformation Schema Integration Data Integration Data Cleansing Tools 14

Load Transfer data from the data staging area into the data warehouse and data marts. Bulk load is used to move huge amounts of data. Data has to be added to existing tables: add rows replace rows or values Flexible insert mechanism is needed to: add and update rows based on a single data source. add rows for a single data source to multiple tables in the data warehouse. Consider complex criteria in load processing: write application program use procedural extensions of SQL 15

Update and Insert IMPORT FROM OF c:\cust_berlin_i\cust.data DEL MODIFIED BY COLDEL COMMITCOUNT 1000 MESSAGES INSERT INTO INSERT: c:\cust_berlin_i\msg2.txt cust_berlin_1 (DB2) Adds the imported data to the table without changing the existing table data. INSERT_UPDATE: Insert Semantics? Adds rows of imported data to the target table, or updates existing rows (of the target table) with matching primary keys. REPLACE: Deletes all existing data from the table by truncating the data object, and inserts the imported data. The table definition and the index definitions are not changed. REPLACE_CREATE: If the table exists, deletes all existing data from the table by truncating the data object, and inserts the imported data without changing the table definition or the index definitions. If the table does not exist, creates the table and index definitions, as well as the row contents. 16

MERGE INTO customer custkey Name Address 100 Ortmann Rauchstr. 101 Martin Pariser Platz 105 Fagiolo Hiroshimastr. 106 Byrt Lassenstr. MERGE INTO customer AS c1 USING ( SELECT key, name, address, FROM cust_berlin_i WHERE ) AS c2 ON ( c1.custkey = c2.key ) WHEN MATCHED THEN MERGE INTO UPDATE SET c1.address = c2.address WHEN NOT MATCHED THEN cust_berlin_i key Name Address 100 Ortmann Rauchstrasse 1 101 Martin Pariser Platz 102 Torry Wilhelmstr. INSERT (custkey, name, address, VALUES (key, name, address, ) 'transaction table' (cust_berlin I) contains updates to existing rows in the data warehouse and/or new rows that should be inserted. MERGE Statement of SQL:2003 allows to update rows that have a matching counterpart in the master table insert rows that do not have a matching counterpart in the master table 17

Multiple Inserts customer custkey Name Address 105 Fagiolo Hiroshimastr. 106 Byrt Lassenstr. location Street Town Hiroshimastr. Berlin Lassenstr. Berlin cust_berlin_i key Name Address 100 Ortmann Rauchstrasse 1 101 Martin Pariser Platz 102 Torry Wilhelmstr. INSERT ALL INTO customer VALUES (key, name, address) INTO location VALUES (address, 'Berlin') SELECT * FROM cust_berlin_i WHERE (Oracle) Insert rows (partly) into several target tables. Allows to insert the same row several times. Allows to define conditions to select the target table. INSERT FIRST defines that the rows is inserted only once. INSERT ALL /* INSERT FIRST */ WHEN key < 100 INTO customer VALUES (key, name, address) WHEN key < 1000 INTO location VALUES (address, 'Berlin') SELECT * FROM cust_berlin_i WHERE (Oracle) 18

Replication SELECT UPDATE SELECT UPDATE SELECT SELECT target system SELECT UPDATE source system SELECT source system UPDATE SELECT target system target system source system data distribution UPDATE SELECT data consolidation UPDATE SELECT update anywhere 19

Materialized Views End User Data Access Data Mart End User Data Access Data Mart Data Warehouse Load End User Data Access End User Data Access Data Mart Data marts provide extracts of the data warehouse for a specific application. Applications often need aggregated data. Materialized Views (MV) allow to: define the content of each data mart as views on data warehouse tables automatically update the content of a data mart Important Issues: MV selection MV refresh MV usage dependent data marts 20

Materialized Views refresh Application needs number of open orders per customer Data Mart 1 DM1_Orders CNR Open Ok 7017 1-7089 - 1 7564-1 7017 3-7537 1-7456 1 - DW_Orders CNR ANR 7017 100101 7098 100104 7564 210704 7017 100105 7017 210204 7017 100101 7537 120113 7456 120113 SNR 1 1 1 2 2 2 2 2 Application needs number of products on order Count 5 3 12 1 1 1 1 1 Data Mart 2 DM2_Orders ANR Count 100101 6 100104 3 210704 12 100105 1 210204 1 120113 2 Status open ok ok open open open open open refresh CREATE TABLE DM2_Orders AS ( SELECT ANR, SUM(Count) FROM DW_Orders GROUP BY ANR ) DATA INITIALLY DEFERRED REFRESH DEFERRED; REFRESH TABLE DM2_Orders; Materialized views are created like views A strategy for refreshing has to be specified: (DB2) DEFERRED: Use REFRESH TABLE statement IMMEDIATE: As part of the update in the source table 21

Overview Monitoring Extraction Export, Import, Filter, Load Direct Integration Load Bulk Load Replication Materialized Views Transformation Schema Integration Data Integration Data Cleansing Tools 22

Transformation Convert the data into something representable to the users and valuable to the business. Transformation of structure and content: Semantics identify proper semantics Structure schema integration Data data integration and data cleansing 23

Transformation: Semantics Information on the same object is covered by several data sources. E.g., customer information is provided by several source systems. Identify synonyms Indentify homonyms car automobile cash cache student pupil bare bear baby infant sight site Identifying the proper semantics depends on the context. Users have to define the proper semantics for the data warehouse. Describe semantics in the metadata repository. 24

Schema Integration Schema integration is the activity of integrating the schemata of various sources to produce a homogeneous description of the data of interest. Properties of the integrated schema: completeness correctness minimality understandability Steps of schema integration: preintegration schema comparison schema conforming schema merging and restructuring 25

Pre-Integration Analysis of the schemata to decide on the general integration policy. Decide on: schemata to be integrated order of integration / integration process preferences 26

Schema Integration Process binary left deep tree balanced tree n-ary one shot iterative source schemata target schema intermediate schemata 27

Schema Matching Goal: Take two schemas as input and produce a mapping between elements of the two schemas that correspond semantically to each other. Typically performed manually, supported by a graphical user interface. tedious, time-consuming, error-prone, expensive General architecture of generic match: Tool 1 Tool 2 Tool 3 Tool 4 tools = schema-related apps. internal schema representation + import and export needed use libraries to help find matches only determine match candidates user may accept or reject Global libraries: dictionaries, schemas Schema Import/export Generic Match implementation Internal schema representation (ER, XML, ) Source: [RB01] 28

Schema Comparison Analysis to determine the correlations among concepts of different schemata and to detect possible conflicts. Schema Matching is part of this step. Types of conflicts in relational systems: schema table / table attribute / attribute table / attribute naming synonyms homonyms structure missing attributes implied attributes integrity constraints naming synonyms homonyms default values integrity constraints data types range of values 29

Schema Conforming Conform and align schemata to make them compatible for integration. Conflict resolution based on the application context cannot be fully automated human intervention supported by graphical interfaces Sample steps: Attributes vs. Entity Sets Composite Primary Keys Redundancies Simplification 30

Schema Conforming A B X A B X A B X A B E E n 1 E X E E A n m E B Attribute vs. Entity Set Composite Primary Keys X 1 E 2 E 1 generalization E 1 n n redundant E 2 E 3 E 23 E 1 m m 1 E 3 E 4 E 2 E 3 Redundancies Simplification E 4 31

Schema Merging and Restructuring Conformed schemata are superimposed, thus obtaining a global schema. Main steps: superimpose conformed schemata quality tests against quality dimensions (completeness, correctness, minimality, understandability, ) further transformation of the obtained schema integration transformation(s) source schemata target schema intermediate/conformed schemata 32

Schema Integration in Data Warehousing Integration in data warehousing schema integration and data integration schema integration is a prerequisite for data integration schema integration is mainly used for the data staging area final data warehouse schema is defined from a global point of view, i.e., it is more than only integrating all source schemata schema matching between source schema and data warehouse schema provides the basis for defining transformations Integration in federated systems focus on schema integration integrated schema is used 33

Data Integration Normalization / denormalization: depending on the source schema and the data warehouse schema Surrogate keys: keys should not depend on the source system Data type conversion: if data type of source attribute and target attribute differ e x a m p l e s customer system local key global key 1 107 5400345 1 109 5401340 2 107 4900342 2 214 5401340 character date character integer 'MM-DD-YYYY' 'DD.MM.YYYY' Coding: text coding; coding text; coding A coding B gross sales 1 net sales 2 3 price 2 GS 34

Data Integration Convert strings: standardization 'Video' ' video' 'VIDEO' 'video' 'Miller, Max' 'Max Miller' Convert date to date format of the target system Convert measures Combine / separate attributes e x a m p l e s 2004, 05, 31 31.05.2004 04, 05, 31 31.05.2004 '05/31/2004' 31.05.2004 inch cm km m mph km/h 2004, 05, 31 31.05.2004 'video', 1598 'video 1598' Derived attributes Aggregation net sales + tax gross sales on_stock - on_order remaining sales_per_day sales_per_month 35

Data Cleansing Elementizing identify fields David and Clara Miller Ste. 116 13150 Hiway 9 Box 1234 Boulder Crk Colo 95006 Standardizing format, coding first name 1: David last name 1: Miller first name 2: Clara last name 2: Miller suite: 116 number: 13150 street: Hiway 9 post box: 1234 city: Boulder Crk state: Colo zip: 95006 first name: David last name: Miller first name 2: Clara last name 2: Miller suite: 116 number: 13150 street: Highway 9 post box: 1234 city: Boulder Creek state: Colorado zip: 95006 Verification contradictions? should lead to corrections in source system(s) Matching is 'David Miller' and/or 'Clara Miller' already present in data warehouse? if so, are there changed fields? Householding 'David Miller' and 'Clara Miller' constitute a household Documenting first name: David last name: Miller first name 2: Clara last name 2: Miller suite: 116 number: 13150 street: Highway 9 post box: 1234 city: Boulder Creek state: California zip: 95006 36

Dimensions of Data Cleansing single record multiple records single source attribute dependencies (contradictions) spelling mistakes missing values illegal values primary key foreign key duplicates / matching householding multiple sources duplicates / matching householding contradictions standardization, coding 37

Data Quality consistency correctness completeness Are there contradictions in data and/or metadata? Do data and metadata provide an exact picture of the reality? Are there missing attributes or values? exactness reliability understandability relevance Are exact numeric values available? Are different objects identifiable? Homonyms? Is there a Standard Operating Procedure (SOP) that describes the provision of source data? Does a description for the data and coded values exist? Does the data contribute to the purpose of the data warehouse? 38

Improving Data Quality Assumption: Various projects can be undertaken to improve the quality of warehouse data. Goal: Identify the data quality enhancement projects that maximize value to the users of data. Tasks for the data warehouse manager: Determine the organizational activities the data warehouse will support; Identify all sets of data needed to support the organizational activities; Estimate the quality of each data set on each relevant data quality dimension; Identify a set of potential projects (and their cost) that could be untertaken for enhancing or affecting data quality; Estimate for each project the likely effect of that project on the quality of the various data sets, by data quality dimension; Determine for each project, data set, and relevant data quality dimension the change in utility should a particular project be untertaken. Source: [BT99] 39

Improving Data Quality I: Index of organizational activities supported by a data warehouse J: Index for data sets K: Index for data quality attributes or dimensions L: Index for possible data quality projects P(1) P(S) Current quality: CQ(J, K) Required quality: RQ(I, J, K) Anticipated quality: AQ(J, K, L) Priority of organizational activities: Weight(I) Cost of data quality enhancement: Cost(L) Value added: Utility(I, J, K, L) Value of Project L = ( I ) Maximize: total value from all projects All I Weight Utility ( I, J, K, L) All J All K All L X ( L) * Value ( L) X (L) = 1, if Project L is selected 0, otherwise Resource Constraint: X ( L) * Cost ( L) Budget All L Exclusiveness Constraint: X ( P (1)) + X ( P (2)) +... + X ( P ( S )) 1 Interaction Constraint: X ( P (1)) + X ( P (2)) + X ( P (3)) 1 40

Overview Monitoring Extraction Export, Import, Filter, Load Direct Integration Load Bulk Load Replication Materialized Views Transformation Schema Integration Data Integration Data Cleansing Tools 41

ETL Market Size 2001-2006 Source: Forrester Research, 2004 42

ETL Market Vendors coming from several different backgrounds and perspectives: A "Pure Play" Vendors ETL represents a core competency ETL accounts for most of the license revenue This class of vendors is driving the bulk of innovation and "mind share" in the ETL market C Database Management System (DBMS) Vendors database vendors have an increasing impact on this market as they continue to bundle ETL functionality closer to the relational DBMS B Business Intelligence Vendors Business intelligence tools and platforms are their core competency For most of these vendors, ETL technology plays a supporting role to their flagship business intelligence offerings, or is one component of a broad offering including business intelligence and ETL D Other Infrastructure Providers They provide various types of technical infrastructure components beyond the DBMS ETL is typically positioned as yet another technical toolset in their portfolios Source: Gartner, 2004 43

ETL Vendors/Products Ranked By Market Share Percentages A B A A C C B A D C Source: Forrester Research, 2004 44

ETL Vendors/Products Ranked By Market Share Percentages B A B C D Source: Forrester Research, 2004 45

ETL Tools Source: META Group, 2004 46

Summary Moving data is part of most steps of the ETL process: Extraction, Transformation, Loading Data Warehouse and Data Marts Several approaches available: export, import, load direct integration replication materialized views Transformation steps include semi-automatic schema matching and integration data integration steps data cleansing 47

Papers [BT99] [LG96] [RB01] D. Ballou, G. K. Tayi: Enhancing Data Quality in Data Warehouse Environments. Communications of the ACM, Vol. 42, No. 1, 1999. W. Labio, H. Garcia-Molina: Efficient Snapshot Differential Algorithms for Data Warehousing. Proc. of 22th International Conference on Very Large Data Bases, Mumbai (Bombay), India, 1996. E. Rahm, P. Bernstein: A survey of approaches to automatic schema matching. VLDB Journal 10:334-350 (2001) 48