Data Integration and Data Cleaning in DWH



Similar documents
JOURNAL OF OBJECT TECHNOLOGY

Data warehouse Architectures and processes

Overview. DW Source Integration, Tools, and Architecture. End User Applications (EUA) EUA Concepts. DW Front End Tools. Source Integration

ECS 165A: Introduction to Database Systems

How To Write A Diagram

Chapter 3 - Data Replication and Materialized Integration

Data Warehouse: Introduction

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

Data Modeling Basics

Concepts of Database Management Seventh Edition. Chapter 9 Database Management Approaches

Enterprise Modeling and Data Warehousing in Telecom Italia

META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 28

Principles of Database. Management: Summary

Chapter 1: Introduction. Database Management System (DBMS) University Database Example

Data Warehousing Systems: Foundations and Architectures

Databases in Organizations

Talend Metadata Manager. Reduce Risk and Friction in your Information Supply Chain

Chapter 5. Learning Objectives. DW Development and ETL

MDM and Data Warehousing Complement Each Other

A Model-based Software Architecture for XML Data and Metadata Integration in Data Warehouse Systems

Data Warehousing Concepts

Chapter 1: Introduction

GEOG 482/582 : GIS Data Management. Lesson 10: Enterprise GIS Data Management Strategies GEOG 482/582 / My Course / University of Washington

DATA WAREHOUSING AND OLAP TECHNOLOGY

Chapter 10 Practical Database Design Methodology and Use of UML Diagrams

Chapter 10 Practical Database Design Methodology and Use of UML Diagrams

CSE 132A. Database Systems Principles

2. Background on Data Management. Aspects of Data Management and an Overview of Solutions used in Engineering Applications

Introduction to Datawarehousing

Query Management in Data Integration Systems: the MOMIS approach

SOA Success is Not a Matter of Luck

How To Improve Performance In A Database

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

Data-Warehouse-, Data-Mining- und OLAP-Technologien

Data Warehousing and OLAP Technology for Knowledge Discovery

COMP5138 Relational Database Management Systems. Databases are Everywhere!

Overview of Data Management

1 File Processing Systems

Topics. Database Essential Concepts. What s s a Good Database System? Using Database Software. Using Database Software. Types of Database Programs

A Survey on Data Warehouse Architecture

Course Notes on A Short History of Database Technology

Course Notes on A Short History of Database Technology

CS2Bh: Current Technologies. Introduction to XML and Relational Databases. Introduction to Databases. Why databases? Why not use XML?

Chapter 1 Databases and Database Users

Enterprise Data Warehouse (EDW) UC Berkeley Peter Cava Manager Data Warehouse Services October 5, 2006

Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Chapter 1 Outline

Relational Database Basics Review

Data Warehousing. Jens Teubner, TU Dortmund Winter 2015/16. Jens Teubner Data Warehousing Winter 2015/16 1

Data Modeling and Databases I - Introduction. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Technology in Action. Alan Evans Kendall Martin Mary Anne Poatsy. Eleventh Edition. Copyright 2015 Pearson Education, Inc.

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

The Relational Model. Ramakrishnan&Gehrke, Chapter 3 CS4320 1

DATA INTEGRATION CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

University Data Warehouse Design Issues: A Case Study

Data Integration and ETL Process

Chapter 10. Practical Database Design Methodology. The Role of Information Systems in Organizations. Practical Database Design Methodology

IT0457 Data Warehousing. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

Data Warehouse Design

Data Virtualization and ETL. Denodo Technologies Architecture Brief

Application Of Business Intelligence In Agriculture 2020 System to Improve Efficiency And Support Decision Making in Investments.

Integrated Data Management: Discovering what you may not know

Data Integration and ETL Process

Integration of Distributed Healthcare Records: Publishing Legacy Data as XML Documents Compliant with CEN/TC251 ENV13606

How to Enhance Traditional BI Architecture to Leverage Big Data

Secure Database Development

The Benefits of Data Modeling in Data Warehousing

LITERATURE SURVEY ON DATA WAREHOUSE AND ITS TECHNIQUES

Data Virtualization Usage Patterns for Business Intelligence/ Data Warehouse Architectures

Principal MDM Components and Capabilities

Assistant Information Technology Specialist. X X X software related to database development and administration Computer platforms and

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

Enabling Better Business Intelligence and Information Architecture With SAP PowerDesigner Software

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Meta-data and Data Mart solutions for better understanding for data and information in E-government Monitoring

Advanced Database Management MISM Course F A Fall 2014

The Influence of Master Data Management on the Enterprise Data Model

Dimensional Modeling and E-R Modeling In. Joseph M. Firestone, Ph.D. White Paper No. Eight. June 22, 1998

CSE 233. Database System Overview

A Review of Contemporary Data Quality Issues in Data Warehouse ETL Environment

Component Approach to Software Development for Distributed Multi-Database System

Security Issues for the Semantic Web

Understanding Data Warehousing. [by Alex Kriegel]

Topics. Introduction to Database Management System. What Is a DBMS? DBMS Types

Improving your Data Warehouse s IQ

Comparing Data Integration Algorithms

Demystified CONTENTS Acknowledgments xvii Introduction xix CHAPTER 1 Database Fundamentals CHAPTER 2 Exploring Relational Database Components

Enabling Better Business Intelligence and Information Architecture With SAP Sybase PowerDesigner Software

Transcription:

Frühjahrssemester 2010 Data Integration and Data Cleaning in DWH Dr. Diego Milano

Organization Motivation: Data Integration and DWH Data Integration Schema (intensional) Level Instance (extensional) Level: Data Cleaning Building an DWH: Data Integration & Cleaning in DWH Design (Introduction to Data Quality)

Background Knowledge & Tools If you don't master some of these tools, let me know immediately: Database & basics: RDBMs concepts Relational model Entity Relationship Model (and Possibly UML) Database design: from a conceptual model to the logical model

What is a DW? A collection of data from different sources Integrated Persistent Dynamically Evolving Focused Used for Decision Support

DWH Operational data (from production/sales OLTP environments) External data (e.g. exchange rates, prices from other sales chains etc.) We focus on what happens here DWH OLAP Data Mining Reporting

Data Integration Given a set of data sources, data integration is the task of presenting them to the user as a single data source. Local Schemas Sources S 1 S 2 S 3... Integrated DB G Global Schema

Two approaches: virtual/materialized Virtual integration: Data stays at the sources, the extension of the global schema is not materialized Queries on the global schema answered using data at sources Pros/cons: + updates on the local sources immediately reflected on the (virtual) integrated DB + No redundancy/no conflicts due to lack of synchronization Enforcing constraints on the global schema not alway possible. Depending on the relationships between the global and the local schema, answering queries may be hard (and inefficient) Propagating updates from the global schema to the global sources is hard Solving inconsistencies at the extensional level is hard

Two approaches: virtual/materialized Materialized integration: Data is copied to a single integrated database Pros/cons: + Queries on the integrated repository are more efficient + Possible/Easier to apply complex transformations to the original data: Integrated schema can be very different from source Instance level transformations made easier Integrated DB goes out of sync with sources, needs periodical refreshing Less storage-efficient, potential inconsistencies due to redundancy A Data Warehouse is first of all a data integration system adopting the materialized approach

Heterogeneity The main issue in data integration tasks is heterogeneity Data residing at different sources present differences on a number of aspects. These differences make it more complex to reduce these data to a single, integrated view It is not easy to classify heterogenity in a crisp way. Some differences relate to syntactic aspects (the specific language/technology used to represent reality), other relate to semantic aspects (how a certain representation captures reality, its meaning), but these differences coexist and it is not always easy or possible to draw lines between what is syntax and what is semantics.

Heterogeneity (Systems/Technology/Syntax) Legacy systems (ad-hoc interfaces) Flat files Web-sources XML files/databases Different DBMSs (e.g. RDBMS, OODBMS...) DBMS with the same flavour (e.g. RDBMS) but with differences in proprietary syntax

Heterogeneity (Data Representation) Intensional Level (schema): Data Model (modeling language): Relational, object-oriented, reticular, semi-structured etc. Structure (representation choices): Different designers have different views of the world (and different application needs), and may use different constructs/data types to represent the same concepts/reality: e.g. Date represented as attribute/standalone concept e.g. Attribute 'sex' encoded as String / Acronym / Integer (0,1) Different views of the world include/exclude portions of information: e.g. Record marital status of employees. Linguistics/terminology: Different designers may use different terms to denote the same concept or use the same term to mean different concepts, at various levels: e.g. attribute 'price': $ Data Warehousing (CS242)

Heterogeneity (Data Representation) Extensional level (instances) Unmappable or partially mappable domains Non-overlapping domains e.g. All students in basel, only students enrolled after 2000. Domains with different granularity: e.g. Sales per day/per month Application-specific domains: e.g. custom identifiers (like employee_code, color_code) meaningful only within a certain application domain. Inconsistencies between semantically equivalent instances Due to errors or other Data Quality problems Data Warehousing (CS242)

Solving heterogeneity issues: Systems/Model level: Wrapper-based architectures Intensional Level: Schema Integration Extensional Level: Instance Identification Instance Reconciliation

Wrapper-based Architectures A wrapper is a piece of software that encapsulates another softwaresystem and acts as an interpreter for it. Allows to: Hide technological differences Hide (to a certain extent) model differences, presenting all sources in a single canonical language. Canonical Model/Language Wrapper Wrapper Wrapper? Legacy RDBMS XML data <xsd:schema> <xsd:element> <xsd:cheneso>... </xsd:cheneso> <xsd:<schema>

Schema Integration Given n data source schemas L1,..,Ln, integrating them means: Identifying correspondences among them Designing a new, integrated schema G that abstracts over all of them and is possibly tailored to some specific application (e.g. for Data Warehousing) Formally specifying mappings between the integrated schema and the source schemas. There are tools to semi-automatically perform some of the activities in schema integration, but these are mostly research-level prototypes. Schema integration is still a (complex) design task for human. Requires expertise in database modeling, and a deep knowledge of the application domains of the schemas to integrate.

Wrapper-Mediator Mediator A mediator interacts with the wrappers, and presents to the users a unified global view over the local schemas Mapping Wrapper Wrapper Wrapper? Legacy RDBMS XML data <xsd:schema> <xsd:element> <xsd:cheneso>... </xsd:cheneso> <xsd:<schema>

Schema Integration Steps 1.Analysis, Normalization, Abstraction to a common conceptual modeling language 2.Choice of integration strategy 3.Schema Matching: Identify relationships among local schemas 4.Schema Alignment: solve conflicts 5.Schema Fusion: create the Global schema The result of this process is a mapping between the source schemas and the integrated schema

1. Analysis For each data source in isolation, the designer must acquire a deep understanding of the application domain: In-depth analysis of the schema(s) interaction with domain experts The result of this phases is to produce a conceptual schema in the canonical language of choice, which: Reflects in the most accurate and complete way possible the domain of interest. Is well-understood Is well-documented

Analysis: Know Your Enemy Gathering knowledge about complex application domains is difficult: Business rules covered by secret/not well-documented (Cooperative) domain experts are key elements Understanding the IS of an enterprise is difficult: Legacy systems requires ad-hoc knowledge (e.g. No database schema but data in flat files with custom format) Even if the DB is relational: Software/System documentation is often poor. The domain conceptualization steps that lead to a certain database design, and many design choices, may be lost. Reverse-engineering of the logical schemas and associated applications is sometimes required. This might involve:» Normalization: For efficiency reasons, or bad design, logical schemas are sometimes denormalized» Inferring constraints: not all contraints of the domain are always enforced at the level of logical schema (e.g. not enforced at all, enforced at the application level) Systems are not always well designed/schemas become old. Sometimes corrections to the schema are required

Analysis, Normalization, Abstraction CREATE TABLE product( cat_desc VARCHAR(255), cat_name VARCHAR(255), cat_code INTEGER, prod_desc VARCHAR(255), prod_name VARCHAR(255), prod_code INTEGER PRIMARY KEY ); cat_desc cat_name cat_code Product prod_desc prod_name prod_code CREATE TABLE category( cat_desc VARCHAR(255), cat_name VARCHAR(255), cat_code INTEGER PRIMARY_KEY, ); CREATE TABLE product( prod_desc VARCHAR(255), prod_name VARCHAR(255), prod_code INTEGER PRIMARY KEY cat_code INTEGER REFERENCES category(cat_code) ); normalization/correction: the original logical schema is unnormalized AND does not enforce all constraints holding in the application domain. Product (1,1) belongs_to (0,n) Category description/string Name/String Code / integer Description / String Name/String Code / String

2. Choice of Integration Strategy Comparing at the same time too many schemas is not always easy/feasible Integration process binary n-ary ladder balanced single step iterative

3. Schema Matching Schemas are comparatively analyzed to identify: common concepts and relationships among them differences and structural/semantic conflicts interschema properties

Structural Conflicts on Concepts Book is a common concept Publisher and its relationship to book have a structural conflict: the designers used different language constructs to model the same reality an entity set+relationship in one schema, attributes in the other one Book title ISBN title ISBN Book published_by Publisher Publisher_address Publisher Address Name

Semantic Conflicts on Concepts The attributes Age and Birthdate clearly model two semantically different concepts. However, it is rather easy to solve this conflict because there is an obvious dependency among then. Solving the conflict means being able to restructure one of the schemas (and thus applying to the data some transformation) to make the two concepts identical. Birthdate SSN Citizen SSN Age Citizen

Pitfalls in language: stat rosa pristina nomen... Homonimy: two concepts have the same name but different semantics Synonimy: two concepts have the same semantics but different name Equivalent, with linguistic conflicts: synonims Employee Worker Teacher (1,1) (1,1) (1,1) assigned_to assigned_to assigned_to (1,n) (1,n) (1,n) Department Department Department Identical Non-equivalent, homonims!

Scheme Comparison Identity: the concept is modeled in the same way both from the point of view of structure and that of semantics Equivalence: the concept have the same semantics (same view of the world) but there are structural conflicts Comparability: concepts are modelled with different structure/semantics but the views of the world do not conflict Incomparability: The view of the world differs producing a conflict that is not (easily) solvable

Different, but comparable views Employee Employee (1,1) (1,1) participates_in assigned_to (1,n) (1,n) Project Department (1,1) belongs_to (1,n) Department

Incomparable views The semantics of the two schemes look the same. However, there is a conflict in the integrity constraints which makes the schemas incompatible. Professor Name Professor Name (0,1) (2,n) teaches teaches (1,1) (1,1) Course Course_ID Course Course_ID

Inter-schema properties Schema 1 Schema 2 title title ISBN Book Book ISBN published_by written_by Address Address Name Publisher works_for Author Name

4. Schema Alignment The goal of this phase is to solve differences/conflicts identified at the previous step Obtained by applying transformations to the local schemas: names and types of attributes functional dependencies integrity constraints Issues: Not all conflicts can be solved, e.g. they derive by a substantial differences in how different information systems are designed (how they model the application domain). In this case, users/domain experts must give hints on which is the intepretation of the world they prefer In case of uncertainty, priority is given to those schemas which are more important in the system (e.g., for DWH, schemas with central concepts in the data mart)

5. Schema Fusion Aligned schemas are merged to obtain a single integrated schema. Overlap common concepts Add all other concepts, connecting them to the common concepts

Alignment and Fusion Alignment and fusion, are applied in an iterative way: Solve some conflicts produce temporary integrated schema To solve new conflicts, apply transformations to either the schemas or to the temporary integrated schema

Mappings A mapping is a set of assertions about correspondencies that hold between two schemas. For very different schemas, mappings are hardly formalizable As the integration process proceeds, it becomes possible to express relationships about the extensions of the schemas. At the conceptual level: as set-relationships At the logical level: as queries (in the simplest case) or as transformations The goal is to link every concept in the integrted schema to some concept in the initial schemas through a chain of transformations

Questions & Answers