On the Logical Modeling of ETL Processes

Similar documents
Modeling ETL Activities as Graphs *

A Methodology for the Conceptual Modeling of ETL Processes

Optimization of ETL Work Flow in Data Warehouse

ARKTOS: A Tool For Data Cleaning and Transformation in Data Warehouse Environments

Analysis of Data Cleansing Approaches regarding Dirty Data A Comparative Study

Modeling and managing ETL processes

Optimizing ETL Processes in Data Warehouses

A GENERIC AND CUSTOMIZABLE FRAMEWORK FOR THE DESIGN OF ETL SCENARIOS

CONCEPTUAL MODELING FOR ETL PROCESS. A Thesis Presented to the Faculty of California State Polytechnic University, Pomona

Bringing Business Objects into ETL Technology

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Using Relational Algebra on the Specification of Real World ETL Processes

Data Warehouse Refreshment

Business Processes Meet Operational Business Intelligence

A Review of Contemporary Data Quality Issues in Data Warehouse ETL Environment

A proposed model for data warehouse ETL processes

Index Terms: Business Intelligence, Data warehouse, ETL tools, Enterprise data, Data Integration. I. INTRODUCTION

DATA VERIFICATION IN ETL PROCESSES

ETL Process in Data Warehouse. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

INDEPENDENT DE-DUPLICATION IN DATA CLEANING #

Modelling of Data Extraction in ETL Processes Using UML 2.0

PartJoin: An Efficient Storage and Query Execution for Data Warehouses

Mapping Conceptual to Logical Models for ETL Processes

Data Cleaning and Transformation - Tools. Helena Galhardas DEI IST

Incremental Data Migration in Multi-database Systems Using ETL Algorithm

ETL evolution from data sources to data warehouse using mediator data storage

Duplicate Detection Algorithm In Hierarchical Data Using Efficient And Effective Network Pruning Algorithm: Survey

Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner

Data warehouse Architectures and processes

Open User Involvement in Data Cleaning for Data Warehouse Quality

Performance Evaluation of Natural and Surrogate Key Database Architectures

Slowly Changing Dimensions Specification a Relational Algebra Approach Vasco Santos 1 and Orlando Belo 2 1

Development of Enterprise Architecture of PPDR Organisations W. Müller, F. Reinert

EXTRACTION, TRANSFORMATION, AND LOADING

The Relational Data Model and Relational Database Constraints

Database Schema Management

Striving towards Near Real-Time Data Integration for Data Warehouses

SQLMutation: A tool to generate mutants of SQL database queries

A Design and implementation of a data warehouse for research administration universities

Database Design Overview. Conceptual Design ER Model. Entities and Entity Sets. Entity Set Representation. Keys

SAS BI Course Content; Introduction to DWH / BI Concepts

Orchid: Integrating Schema Mapping and ETL

Normalization in OODB Design


COURSE 20463C: IMPLEMENTING A DATA WAREHOUSE WITH MICROSOFT SQL SERVER

Implementing a Data Warehouse with Microsoft SQL Server

Eventifier: Extracting Process Execution Logs from Operational Databases

LearnFromGuru Polish your knowledge

Repository Support for Data Warehouse Evolution

An Enhanced Search Interface for Information Discovery from Digital Libraries

Implementing and Maintaining Microsoft SQL Server 2008 Integration Services

Efficient development of data migration transformations

Natural Language Updates to Databases through Dialogue

Lightweight Data Integration using the WebComposition Data Grid Service

ETL PROCESS IN DATA WAREHOUSE

Conceptual Design Using the Entity-Relationship (ER) Model

EXTRACTION, TRANSFORMATION, AND LOADING

Implementing a Data Warehouse with Microsoft SQL Server MOC 20463

COURSE OUTLINE MOC 20463: IMPLEMENTING A DATA WAREHOUSE WITH MICROSOFT SQL SERVER

Database Application Developer Tools Using Static Analysis and Dynamic Profiling

Meta-Model specification V2 D

Implementing a SQL Data Warehouse 2016

Goal-Driven Design of a Data Warehouse-Based Business Process Analysis System

ETL-EXTRACT, TRANSFORM & LOAD TESTING

Implement a Data Warehouse with Microsoft SQL Server 20463C; 5 days

Filtering Noisy Contents in Online Social Network by using Rule Based Filtering System

Problems, Methods, and Challenges in Comprehensive Data Cleansing

Using the column oriented NoSQL model for implementing big data warehouses

POLAR IT SERVICES. Business Intelligence Project Methodology

IT2305 Database Systems I (Compulsory)

The Data Warehouse Design Problem through a Schema Transformation Approach

MDM for the Enterprise: Complementing and extending your Active Data Warehousing strategy. Satish Krishnaswamy VP MDM Solutions - Teradata

Universal. Event. Product. Computer. 1 warehouse.

A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1

OntoPIM: How to Rely on a Personal Ontology for Personal Information Management

East Asia Network Sdn Bhd

Metadata Management for Data Warehouse Projects

Implementing a Data Warehouse with Microsoft SQL Server 2012

An eclipse-based Feature Models toolchain

Implementing a Data Warehouse with Microsoft SQL Server

Course 20463:Implementing a Data Warehouse with Microsoft SQL Server

Data Modeling. Database Systems: The Complete Book Ch ,

Application of Data Mining Methods in Health Care Databases

Chapter 3 - Data Replication and Materialized Integration

Conceptual Level Design of Semi-structured Database System: Graph-semantic Based Approach

ORGANIZATIONAL KNOWLEDGE MAPPING BASED ON LIBRARY INFORMATION SYSTEM

Oracle 10g PL/SQL Training

Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Active database systems. Triggers. Triggers. Active database systems.

COMP 378 Database Systems Notes for Chapter 7 of Database System Concepts Database Design and the Entity-Relationship Model

A Workbench for Prototyping XML Data Exchange (extended abstract)

Data Integration for XML based on Semantic Knowledge

Data Mining and Database Systems: Where is the Intersection?

Relational Databases

BCA. Database Management System

Big Data Begets Big Database Theory

Data Warehouse Modeling and Quality Issues

AN INTEGRATION APPROACH FOR THE STATISTICAL INFORMATION SYSTEM OF ISTAT USING SDMX STANDARDS

Introduction. Introduction to Data Warehousing

Temporal Database System

CONTEMPORARY SEMANTIC WEB SERVICE FRAMEWORKS: AN OVERVIEW AND COMPARISONS

Transcription:

On the Logical Modeling of ETL Processes Panos Vassiliadis, Alkis Simitsis, and Spiros Skiadopoulos National Technical University of Athens, Dept. of Electrical and Computer Eng. Computer Science Division, Iroon Polytechniou 9, 157 73, Athens, Greece {pvassil,asimi,spiros}@dbnet.ece.ntua.gr 1 Introduction Extraction-Transformation-Loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. Research has only recently dealt with the above problem and provided few models, tools and techniques to address the issues around the ETL environment [1,2,3,5]. In this paper, we present a logical model for ETL processes. The proposed model is characterized by several templates, representing frequently used ETL activities along with their semantics and their interconnection. In the full version of the paper [4] we present more details on the aforementioned issues and complement them with results on the characterization of the content of the involved data stores after the execution of an ETL scenario and impact-analysis results in the presence of changes. 2 Logical Model Our logical model abstracts from the technicalities of monitoring, scheduling and logging while it concentrates (a) on the flow of data from the sources towards the data warehouse and (b) on the composition of the activities and the derived semantics. Metamodel layer Elementary Activity ISA RecordSet Template layer Not Null Selection Aggregate IN Supplier Schema & Scenario myscenario mynot Null myselection LineItem PartSupp Fig. 1 The metamodel for the logical entities of the ETL environment Activities are the backbone of the structure of any information system. In our framework, activities are logical abstractions representing parts, or full modules of A. Banks Pidduck et al. (Eds.): CAISE 2002, LNCS 2348, pp. 782-786, 2002. Springer-Verlag Berlin Heidelberg 2002

On the Logical Modeling of ETL Processes 783 code. We distinguish between three layers of instantiation/specialization for the logical model of ETL activities, as depicted in Fig.1: Metamodel, Template, and Custom Layers. Metamodel Layer. The most generic type of an activity forms the upper level of instantiation (Metamodel layer in Fig. 1), covering the most general case of an atomic execution entity within the ETL context. In Fig. 2 one can see the high level abstraction of the operation of an elementary binary activity. Data coming from the two inputs are propagated to the output, while the rejections are dumped to the Rejected Rows file. The operation is tuned through a set of parameters. Activities are depicted with triangles. The annotated edges characterize the data consumer/provider semantics. The + symbol denotes that data are appended to the Output; the! symbol denotes that the rejected rows are propagated towards the Rejected Rows file, whose contents they overwrite; the symbol denotes that data are simply read from Input 1; and the symbol denotes that the data from Input 2 that match the criteria of the activity are deleted from Input 2, once they have been read by the activity. Parameters Input 1 Input 2 Elementary Activity +! Output Rejected Rows Fig. 2 The basic idea behind the mechanism of Elementary Activities An Elementary Activity is formally described by the following elements: Name: a unique identifier for the activity. Input Schemata: a finite set of one or more input schemata that receive data from the data providers of the activity. Output Schema: the schema that describes the placeholder for the rows that pass the check performed from the activity. Rejections Schema: a schema that describes the placeholder for certain rows that do not pass the check performed by the activity. Parameter List: a set of pairs (name and schema) which act as regulators for the functionality of the activity. Output Operational Semantics: an SQL statement describing the content passed to the output of the operation, with respect to its input. Rejection Operational Semantics: an SQL statement describing the rejected records. Data Provider/Consumer Semantics: a modality of each schema which defines (a) whether the data are simply read (symbol in Fig. 2) or read and subsequently deleted from the data provider (symbol - in Fig. 2 and (b) whether

784 Panos Vassiliadis et al. the data are appended to the data consumer (symbol + in Fig. 2) or the contents of the data consumer are overwritten from the output of the operation (symbol! in Fig. 2). Template Layer. Providing a single metaclass for all the possible activities of an ETL environment is not really enough for the designer of the overall process. A richer language should be available, in order to describe the structure of the process and facilitate its construction. To this end, we provide a set of Template Activities (Template Layer in Fig. 1), which are specializations of the generic metamodel class. We have already pursued this goal in a previous effort [5]; in this paper, we extend this set of template objects and deal with the semantics of their combination (see [4] for more details). Filters SELECTION (φ(a n+1,,a n+k )) UNIQUE VALUE (R.A) NOT NULL (R.A) DOMAIN MISMATCH (R.A,x low,x high ) PRIMARY KEY VIOLATION (R.A 1,,R.A K ) FOREIGN KEY VIOLATION ([R.A 1,,R.A K ],[S.A 1,,S.A K ]) Unary Transformations PUSH AGGREGATION ([A 1,,A k ],[γ 1 (A 1 ),,γ m (A m )]) PROJECTION([A 1,,A k ]) FUNCTION APPLICATION(f 1 (A 1 ),,f k (A k )) SURROGATE KEY ASSIGNMENT (R.PRODKEY,S.SKEY,x) TUPLE NORMALIZATION (R.A n+1,,r.a n+k,a CODE,A VALUE,h) TUPLE DENORMALIZATION (A CODE,A VALUE,R.A n+1,,r.a n+k,h ) Binary Transformations UNION(R,S) JOIN(R,S,[(A 1,B 1 ),,(A k,b k )]) DIFF(R,S,[(A 1,B 1 ),,(A k,b k )]) Fig. 3 Template Activities grouped by category As one can see in Fig. 3, we group the template activities in three major groups. The first group, named Filters, provides checks for the respect of a certain condition by the rows of a certain condition. The semantics of these filters are the obvious (starting from a generic selection condition and proceeding to the check of value uniqueness, primary or foreign key violation, etc.). The second group of template activities is called Unary Transformations and except for the most generic Push activity (which simply propagates data from the provider to the consumer) consists of the classical aggregation and function application operations along with three data warehouse specific transformations (surrogate key assignment, normalization and denormalization). The third group consists of classical Binary Operations, such as

On the Logical Modeling of ETL Processes 785 union, join and difference with the last being a special case of the classic relational operator. Custom Layer. The instantiation and the reuse of template activities are also allowed in our framework. The activities that stem from the composition of other activities are called composite activities, whereas the instantiations of template activities will be grouped under the general name of custom activities (Custom Layer in Fig. 1). Custom activities are applied over the specific recordsets of an ETL environment. Moreover, they also involve the construction of ad-hoc, user tailored elementary activities by the designer. 3 Exploitation of the Model In order to perform operations like zooming in/out, consistency checking and what-if analysis, each scenario is modeled as a graph, which we call the Architecture Graph (see [4] from more details). Activities, data stores and attributes are modeled as nodes and their relationships as edges. Zooming in and out the architecture graph. We can zoom in and out the architecture graph, in order to eliminate the information overflow, which can be caused by the vast number of involved attributes in a scenario. In a different kind of zooming, we can follow the major flow of data from sources to the targets. Consistency Checks. We can perform several consistency checks on the architecture graph, like the detection of activities having input attributes without data providers, the detection of non-source tables having attributes without data providers, or the detection of useless source attributes. What-if analysis. We can identify possible impacts in the case of a change, by modeling changes as additions, removals or renaming of edges and nodes. Still, out of the several alternatives, only the drop node operation seems to have interesting side effects. Moreover, even if atomic operations are of no significance by themselves, they are important when they are considered in the case of composite operation transitions. References 1. H. Galhardas, D. Florescu, D. Shasha and E. Simon. Ajax: An Extensible Data Cleaning Tool. In Proc. ACM SIGMOD Intl. Conf. On the Management of Data, pp. 590, Dallas, Texas, (2000). 2. V. Raman, J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and Transformation. Technical Report University of California at Berkeley, Computer Science Division, 2000. Available at http://www.cs.berkeley.edu/~rshankar/papers/pwheel.pdf 3. S. Sarawagi. Special Issue on Data Cleaning (ed.). IEEE Computer Society, Bulletin of the Technical Committee on Data Engineering, 23(4), (2000).

786 Panos Vassiliadis et al. 4. P. Vassiliadis, A. Simitsis, S. Skiadopoulos. On the Conceptual and Logical Modeling of ETL Processes. Available at http://www.dbnet.ece.ntua.gr/~pvassil/publications/vass_tr.pdf 5. P. Vassiliadis, Z. Vagena, S. Skiadopoulos, N. Karayannidis, T. Sellis. Arktos: Towards the modeling, design, control and execution of ETL processes. Information Systems, 26(8), pp. 537-561, December 2001, Elsevier Science Ltd.