Data Cleaning and Transformation - Tools. Helena Galhardas DEI IST



Similar documents
Extract, Transform, Load (ETL)

Data Cleaning. Helena Galhardas DEI IST

Analysis of Data Cleansing Approaches regarding Dirty Data A Comparative Study

A survey of data quality tools

SEMI AUTOMATIC DATA CLEANING FROM MULTISOURCES BASED ON SEMANTIC HETEROGENOUS

Data Integration and ETL with Oracle Warehouse Builder: Part 1

Course 20463:Implementing a Data Warehouse with Microsoft SQL Server

Implementing a Data Warehouse with Microsoft SQL Server

COURSE 20463C: IMPLEMENTING A DATA WAREHOUSE WITH MICROSOFT SQL SERVER

IBM WebSphere DataStage Online training from Yes-M Systems

Implement a Data Warehouse with Microsoft SQL Server 20463C; 5 days

Implementing a Data Warehouse with Microsoft SQL Server

Implementing a Data Warehouse with Microsoft SQL Server

AV-005: Administering and Implementing a Data Warehouse with SQL Server 2014

ASTERIX: An Open Source System for Big Data Management and Analysis (Demo) :: Presenter :: Yassmeen Abu Hasson

Alejandro Vaisman Esteban Zimanyi. Data. Warehouse. Systems. Design and Implementation. ^ Springer

Data Quality in Information Integration and Business Intelligence

Implementing a Data Warehouse with Microsoft SQL Server

Course 10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012

Implementing a Data Warehouse with Microsoft SQL Server 2014

SQL Server 2012 Business Intelligence Boot Camp

Implementing a Data Warehouse with Microsoft SQL Server 2012

Microsoft. Course 20463C: Implementing a Data Warehouse with Microsoft SQL Server

Data Integration and ETL with Oracle Warehouse Builder NEW

SQL SERVER TRAINING CURRICULUM

Oracle Warehouse Builder 10g

Oracle Database 11g: SQL Tuning Workshop

MOC 20461C: Querying Microsoft SQL Server. Course Overview

Implementing a Data Warehouse with Microsoft SQL Server 2012

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Oracle Architecture, Concepts & Facilities

SAP Data Services 4.X. An Enterprise Information management Solution

A Simplified Framework for Data Cleaning and Information Retrieval in Multiple Data Source Problems

The Data Quality Continuum*

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

Oracle Data Integrator: Administration and Development

Implementing a Data Warehouse with Microsoft SQL Server MOC 20463

COURSE OUTLINE MOC 20463: IMPLEMENTING A DATA WAREHOUSE WITH MICROSOFT SQL SERVER

Introduction to IR Systems: Supporting Boolean Text Search. Information Retrieval. IR vs. DBMS. Chapter 27, Part A

ER/Studio Enterprise Portal User Guide

Data Warehousing. Overview, Terminology, and Research Issues. Joachim Hammer. Joachim Hammer

A Design Technique: Data Integration Modeling

DATA INTEGRATION CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification

Implementing a Data Warehouse with Microsoft SQL Server 2012

<Insert Picture Here> Oracle SQL Developer 3.0: Overview and New Features

Course Outline. Module 1: Introduction to Data Warehousing

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

SAS BI Course Content; Introduction to DWH / BI Concepts

Unified Batch & Stream Processing Platform

Extraction Transformation Loading ETL Get data out of sources and load into the DW

Oracle Database 10g: Introduction to SQL

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Implementing a Data Warehouse with Microsoft SQL Server 2012 (70-463)

Business Benefits From Microsoft SQL Server Business Intelligence Solutions How Can Business Intelligence Help You? PTR Associates Limited

POLAR IT SERVICES. Business Intelligence Project Methodology

Implementing a Data Warehouse with Microsoft SQL Server 2012 MOC 10777

What's New in SAS Data Management

MDM for the Enterprise: Complementing and extending your Active Data Warehousing strategy. Satish Krishnaswamy VP MDM Solutions - Teradata

f...-. I enterprise Amazon SimpIeDB Developer Guide Scale your application's database on the cloud using Amazon SimpIeDB Prabhakar Chaganti Rich Helms

DATABASE MANAGEMENT SYSTEMS. Question Bank:

MicroStrategy Course Catalog

Chapter 1: Introduction

SQL Server 2005 Features Comparison

Oracle Database 11g: SQL Tuning Workshop Release 2

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

Big Data With Hadoop

Conceptual Workflow for Complex Data Integration using AXML

SDMX technical standards Data validation and other major enhancements

Implementing a SQL Data Warehouse 2016

Data Integration and ETL Process

Microsoft End to End Business Intelligence Boot Camp

MDM and Data Warehousing Complement Each Other

SAP BODS - BUSINESS OBJECTS DATA SERVICES 4.0 amron

2074 : Designing and Implementing OLAP Solutions Using Microsoft SQL Server 2000

Implementing a Data Warehouse with Microsoft SQL Server 2012

East Asia Network Sdn Bhd

Data Warehouse Design

Efficient Data Access and Data Integration Using Information Objects Mica J. Block

Oracle Database 11g: Administer a Data Warehouse

Data Integration and ETL Process

PowerDesigner WarehouseArchitect The Model for Data Warehousing Solutions. A Technical Whitepaper from Sybase, Inc.

Data Warehousing and OLAP Technology for Knowledge Discovery

Big Data and Scripting Systems build on top of Hadoop

Data Integration with Talend Open Studio Robert A. Nisbet, Ph.D.

Transcription:

Data Cleaning and Transformation - Tools Helena Galhardas DEI IST

Agenda ETL/Data Quality tools Generic functionalities Categories of tools The AJAX data cleaning and transformation framework

Typical architecture of a DQ system SOURCE DATA Human Knowledge TARGET DATA Data Data Data... Extraction Transformation Loading... Data Analysis Metadata Dictionaries Human Knowledge Schema Integration

Existing technology How to perform data transformation and cleaning to achieve high quality data Ad-hoc programs Programs difficult to optimize and maintain RDBMS mechanisms to guarantee the verification of integrity constraints Do not address important data instance problems Data transformation scripts using an ETL or data quality tool

(ETL and) Data Quality tools

Generic functionalities (1/2) Data sources extract data from different and heterogeneous data sources Extraction capabilities schedule extracts; select data; merge data from multiple sources Loading capabilities multiple types, heterogeneous targets; refresh and append; automatically create tables Incremental updates update instead of rebuild from scratch User interface graphical vs non-graphical Metadata repository stores data schemas and transformations

Generic functionalities (2/2) Performance techniques partitioning; parallel processing; threads; clustering Versioning version control of data quality programs Function library pre-built functions Language binding language support to define new functions Debugging and tracing document the execution; tracking and analyzing to detect problems and refine specifications Exception detection and handling set of input rules for which the execution of data quality program fails Data lineage identifies the input records that produced a given data item

Commercial Research Classification of tools (2005)

Categories of data quality tools

Categories of data quality tools Data analysis statistical evaluation, logical study and application of data mining algorithms to define data patterns and rules

Categories of data quality tools Data profiling analysing data sources to identify data quality problems

Categories of data quality tools Data transformation set of operations that source data must undergo to fit target schema

Categories of data quality tools Data cleaning detecting, removing and correcting dirty data

Categories of data quality tools Duplicate elimination identifies and merges approximate duplicate records

Categories of data quality tools Data enrichement use of additional information to improve data quality

Categories Ajax Y Y Y Y Arktos Y Y Clio Y Flamingo Project Y FraQL Y Y Y IntelliClean Y Kent State University Y Y Potter's Wheel Y Y Transcm Y

The AJAX data transformation and

Problems of ETL/data quality solutions (1) Data transformations... App. Domain 1 App. Domain 2 App. Domain 3

Problems of ETL/data quality solutions (1) Data transformations... App. Domain 1 App. Domain 2 App. Domain 3 The semantics of some data transformations is defined in terms of their implementation algorithms

Problems of ETL/data quality solutions (2)

Problems of ETL/data quality solutions (2) Clean data Rejected data Data Cleaning and Transformation Dirty Data

Problems of ETL/data quality solutions (2) Clean data Rejected data Data Cleaning and Transformation Dirty Data

Problems of ETL/data quality solutions (2) Clean data Rejected data Data Cleaning and Transformation Dirty Data

Problems of ETL/data quality solutions (2) Clean data Rejected data Data Cleaning and Transformation Dirty Data There is a lack of interactive facilities to tune a data cleaning application program

Motivating example (1) 14

Motivating example (1) DirtyData(paper:String) 14

Motivating example (1) Publications(pubKey, title, eventkey, url, volume, number, pages, city, month, year) Authors(authorKey, name) vents(eventkey, name) PubsAuthors(pubKey, authorkey) Data Cleaning & Transformation DirtyData(paper:String) 14

Motivating example (2) Publications QGMW96 Making Views Self-Maintainable for Data Warehousing PDIS null null null null Miami Beach Florida, USA 1996 Events PDIS Conference on Parallel and Distributed Information Systems DirtyData Authors DQua Dallan Quass AGup Ashish Gupta JWid Jennifer Widom.. Data Cleaning & Transformation PubsAuthors QGMW96 DQua QGMW96 AGup. [1] Dallan Quass, Ashish Gupta, Inderpal Singh Mumick, and Jennifer Widom. Making Views Self-Maintainable for Data Warehousing. In Proceedings of the Conference on Parallel and Distributed Information Systems. Miami Beach, Florida, USA, 1996 [2] D. Quass, A. Gupta, I. Mumick, J. Widom, Making views selfmaintianable for data warehousing, PDIS 95 15

Modeling an ETL/data quality process Authors Duplicate Elimination DirtyTitles... DirtyEvents DirtyAuthors Extraction Standardization Formattin g Cities Tags DirtyData A data quality process is modeled by a directed acyclic graph of data transformations

AJAX features An extensible data cleaning and transformation (quality) framework Logical operators as extensions of relational algebra Physical execution algorithms A declarative language for logical operators SQL extension A debugger facility for tuning a data cleaning program

AJAX features An extensible data cleaning and transformation (quality) framework Logical operators as extensions of relational algebra Physical execution algorithms A declarative language for logical operators SQL extension A debugger facility for tuning a data cleaning program

Logical level: parametric operators 19

Logical level: parametric operators View Cluster Map Match Merge Apply 19

Logical level: parametric operators View Map Merge Cluster Match Apply View: arbitrary SQL query Map: iterator-based one-to-many mapping with arbitrary user-defined functions Match: iterator-based approximate join Cluster: uses an arbitrary clustering function Merge: extends SQL group-by with userdefined aggregate functions Apply: executes an arbitrary user-defined algorithm 19

Logical level Authors Duplicate Elimination Extraction DirtyTitles... DirtyEvents DirtyAuthors Standardization Formattin g Cities Tags DirtyData

Logical Authors Duplicate Elimination Merge Cluster Extraction DirtyTitles... DirtyEvents Map Match DirtyAuthors Standardization Formattin g Map Map Cities Tags DirtyData

Logical Authors Duplicate Elimination Merge Cluster Extraction DirtyTitles... DirtyEvents Map Match DirtyAuthors Standardization Formattin g Map Map Cities Tags DirtyData

Logical Authors Physical Authors Duplicate Elimination Merge Cluster Java TC DirtyTitles... DirtyEvents Match DirtyAuthors DirtyTitles... DirtyEvents NL DirtyAuthors Extraction Map Java scan Standardization Formattin g Map Map Cities Tags DirtyData SQL Scan Java Scan Cities Tags DirtyData

Match Input: 2 relations Finds data records that correspond to the same real object Calls distance functions for comparing field values and computing the distance between input tuples Output: 1 relation containing matching tuples and possibly 1 or 2 relations containing nonmatching tuples 22

Example Authors Merge Cluster Match DirtyAuthors Duplicate Elimination 23

Example Authors Merge Cluster Match MatchAuthors DirtyAuthors Duplicate Elimination 23

Example Authors CREATE MATCH MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2 Merge LET distance = editdistance(da1.name, da2.name) Cluster Match MatchAuthors WHERE distance < maxdist INTO MatchAuthors DirtyAuthors Duplicate Elimination 24

Example Authors CREATE MATCH MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2 Merge LET distance = editdistance(da1.name, da2.name) Cluster Match MatchAuthors WHERE distance < maxdist INTO MatchAuthors Input: DirtyAuthors Duplicate Elimination 25

Implementation of the match s 1 S 1, s 2 S 2 (s 1, s 2 ) is a match if editdistance (s 1, s 2 ) < maxdist

Nested loop S 1 S 2 editdistance... Very expensive evaluation when handling large amounts of data

Nested loop S 1 S 2 editdistance... Very expensive evaluation when handling large amounts of data Need alternative execution algorithms for the same logical specification

A database solution CREATE TABLE MatchAuthors AS! SELECT authorkey1, authorkey2, distance! FROM (SELECT a1.authorkey authorkey1,!!! a2.authorkey authorkey2,!! editdistance (a1.name, a2.name) distance!! FROM DirtyAuthors a1, DirtyAuthors a2)! WHERE distance < maxdist;

A database solution CREATE TABLE MatchAuthors AS! SELECT authorkey1, authorkey2, distance! FROM (SELECT a1.authorkey authorkey1,!!! a2.authorkey authorkey2,!! editdistance (a1.name, a2.name) distance!! FROM DirtyAuthors a1, DirtyAuthors a2)! WHERE distance < maxdist; No optimization supported for a Cartesian product with external function calls

Window scanning S n

Window scanning S n

Window scanning S n

Window scanning S n May loose some matches

String distance filtering S 1 S 2 John Smit length- 1 length John Smith Jogn Smith length John Smithe length + 1 maxdist = 1 32

String distance filtering S 1 S 2 John Smit length- 1 length John Smith Jogn Smith length editdistance John Smithe length + 1 maxdist = 1 32

AJAX features An extensible data cleaning and transformation (quality) framework Logical operators as extensions of relational algebra Physical execution algorithms A declarative language for logical operators SQL extension A debugger facility for tuning a data cleaning program

Declarative specification DEFINE FUNCTIONS AS Choose.uniqueString(OBJECT[]) RETURN STRING THROWS CiteSeerException Generate.generateId(INTEGER) RETURN STRING Normal.removeCitationTags(STRING) RETURN STRING (600) DEFINE ALGORITHMS AS TransitiveClosure SourceClustering(STRING) DEFINE INPUT DATA FLOWS AS TABLE DirtyData (paper STRING (400) ); TABLE City (city STRING (80), citysyn STRING (80) ) KEY city,citysyn; DEFINE TRANSFORMATIONS AS CREATE MAPPING mapkedida FROM DirtyData Dd LET keykdd = generateid(1) {SELECT keykdd AS paperkey, Dd.paper AS paper KEY paperkey CONSTRAINT NOT NULL mapkedida.paper }

DEFINE FUNCTIONS AS Choose.uniqueString(OBJECT[]) RETURN STRING THROWS CiteSeerException Generate.generateId(INTEGER) RETURN STRING Normal.removeCitationTags(STRING) RETURN STRING (600) Graph of data transformations DEFINE ALGORITHMS AS TransitiveClosure SourceClustering(STRING) DEFINE INPUT DATA FLOWS AS TABLE DirtyData (paper STRING (400) ); TABLE City (city STRING (80), citysyn STRING (80) ) KEY city,citysyn; DEFINE TRANSFORMATIONS AS CREATE MAPPING mapkedida FROM DirtyData Dd LET keykdd = generateid(1) {SELECT keykdd AS paperkey, Dd.paper AS paper KEY paperkey CONSTRAINT NOT NULL mapkedida.paper } Declarative specification

AJAX features An extensible data cleaning and transformation (quality) framework Logical operators as extensions of relational algebra Physical execution algorithms A declarative language for logical operators SQL extension A debugger facility for tuning a data cleaning program

Management of exceptions 37

Management of exceptions Problem: to mark tuples not handled by the cleaning criteria of an operator 37

Management of exceptions Problem: to mark tuples not handled by the cleaning criteria of an operator Solution: to specify the generation of exception tuples within a logical operator exceptions are thrown by external functions output constraints are violated 37

Debugger facility Supports the (backward and forward) data derivation of tuples wrt an operator to debug exceptions Supports the interactive data modification and, in the future, the incremental execution of logical operators

Architecture

References Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, Cristian-Augustin Saita: Declarative Data Cleaning: Language, Model, and Algorithms. VLDB 2001: 371-380

References A Survey of Data Quality tools, J. Barateiro, H. Galhardas, Datenbank-Spektrum 14: 15-21, 2005. Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, Cristian-Augustin Saita: Declarative Data Cleaning: Language, Model, and Algorithms. VLDB 2001: 371-380

AJAX features An extensible data cleaning and transformation (quality) framework Logical operators as extensions of relational algebra Physical execution algorithms A declarative language for logical operators SQL extension A debugger facility for tuning a data cleaning program

Annotation-based optimization The user specifies types of optimization The system suggests which algorithm to use Ex: CREATE MATCHING MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2 LET dist = editdistance(da1.name, da2.name) WHERE dist < maxdist % distance-filtering: map= length; dist = abs % INTO MatchAuthors