Data Integration and Exchange. L. Libkin 1 Data Integration and Exchange



Similar documents
Data exchange. L. Libkin 1 Data Integration and Exchange

ETL Tools. L. Libkin 1 Data Integration and Exchange

CS2Bh: Current Technologies. Introduction to XML and Relational Databases. Introduction to Databases. Why databases? Why not use XML?

Data integration general setting

Data Integration. Maurizio Lenzerini. Universitá di Roma La Sapienza

Relational Database Basics Review

Query Processing in Data Integration Systems

Data Quality in Information Integration and Business Intelligence

1 File Processing Systems

Files. Files. Files. Files. Files. File Organisation. What s it all about? What s in a file?

Composing Schema Mappings: An Overview

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

Database Systems. Lecture 1: Introduction

Objectives of Lecture 1. Labs and TAs. Class and Office Hours. CMPUT 391: Introduction. Introduction

ICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001

A Tutorial on Data Integration

COMP5138 Relational Database Management Systems. Databases are Everywhere!

Number of hours in the semester L Ex. Lab. Projects SEMESTER I 1. Economy Philosophy Mathematical Analysis Exam

Data Integration and ETL Process

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

INTEGRATION OF XML DATA IN PEER-TO-PEER E-COMMERCE APPLICATIONS

Web-Based Genomic Information Integration with Gene Ontology

ECS 165A: Introduction to Database Systems

Overview of Data Management

Objectives of Lecture 1. Class and Office Hours. Labs and TAs. CMPUT 391: Introduction. Introduction

Databases. DSIC. Academic Year

Lecture 7: Concurrency control. Rasmus Pagh

Course: CSC 222 Database Design and Management I (3 credits Compulsory)

Optimizing the Performance of the Oracle BI Applications using Oracle Datawarehousing Features and Oracle DAC

Scope of this Course. Database System Environment. CSC 440 Database Management Systems Section 1

Chapter 1: Introduction

Principles of Database. Management: Summary

SQL Query Evaluation. Winter Lecture 23

Database Optimizing Services

The Relational Model. Ramakrishnan&Gehrke, Chapter 3 CS4320 1

Schema Mappings and Data Exchange

Oracle BI Applications (BI Apps) is a prebuilt business intelligence solution.


Topics. Introduction to Database Management System. What Is a DBMS? DBMS Types

Week 1 Part 1: An Introduction to Database Systems. Databases and DBMSs. Why Use a DBMS? Why Study Databases??

3. Relational Model and Relational Algebra

CSE 132A. Database Systems Principles

Chapter 3: Data Mining Driven Learning Apprentice System for Medical Billing Compliance

XML Data Integration

Data Integration: A Theoretical Perspective

Database System Concepts

Writing a Requirements Document For Multimedia and Software Projects

Business Benefits From Microsoft SQL Server Business Intelligence Solutions How Can Business Intelligence Help You? PTR Associates Limited

Bridge from Entity Relationship modeling to creating SQL databases, tables, & relations


Introduction. Introduction: Database management system. Introduction: DBS concepts & architecture. Introduction: DBS versus File system

Graham Kemp (telephone , room 6475 EDIT) The examiner will visit the exam room at 15:00 and 17:00.

Chapter 11 Mining Databases on the Web

Foundations of Information Management

Objectives. Distributed Databases and Client/Server Architecture. Distributed Database. Data Fragmentation

EHR Standards Landscape

ISM 318: Database Systems. Objectives. Database. Dr. Hamid R. Nemati

Information Management

Introduction: Database management system

David Dye. Extract, Transform, Load

Temporal Database System

Information Management course

Journal of Information Technology Management SIGNS OF IT SOLUTIONS FAILURE: REASONS AND A PROPOSED SOLUTION ABSTRACT

What is a database? COSC 304 Introduction to Database Systems. Database Introduction. Example Problem. Databases in the Real-World

Database Systems. Lecture Handout 1. Dr Paolo Guagliardo. University of Edinburgh. 21 September 2015

Introduction. Chapter 1. Introducing the Database. Data vs. Information

The Relational Model. Why Study the Relational Model? Relational Database: Definitions. Chapter 3

Integrating Pattern Mining in Relational Databases

The basic data mining algorithms introduced may be enhanced in a number of ways.

The ESB and Microsoft BI

Chapter 1: Introduction. Database Management System (DBMS) University Database Example

How To Program A Computer

CS2Bh: Current Technologies. Introduction to XML and Relational Databases. The Relational Model. The relational model

Overview of Database Management Systems

DATA MINING AND WAREHOUSING CONCEPTS

MS 20467: Designing Business Intelligence Solutions with Microsoft SQL Server 2012

Is ETL Becoming Obsolete?

The Relational Model. Why Study the Relational Model? Relational Database: Definitions

Reverse Engineering in Data Integration Software

II. PREVIOUS RELATED WORK

Introduction to Databases

Answers to Review Questions

Database 2 Lecture I. Alessandro Artale

Distribution transparency. Degree of transparency. Openness of distributed systems

Combining SAWSDL, OWL DL and UDDI for Semantically Enhanced Web Service Discovery

DATABASE MANAGEMENT SYSTEM

Database Management Systems. Chapter 1

There are five fields or columns, with names and types as shown above.

Turning ClearPath MCP Data into Information with Business Information Server. White Paper

MDM and Data Warehousing Complement Each Other

Overview of Database Management

The Relational Model. Why Study the Relational Model?

Introduction to Database Systems CS4320. Instructor: Christoph Koch CS

THE SEMANTIC WEB AND IT`S APPLICATIONS

THE BCS PROFESSIONAL EXAMINATIONS Professional Graduate Diploma. Advanced Database Management Systems

myschool Web Based Education Management Solution

Using distributed database system and consolidation resources for Data server consolidation

Scheduling in a Virtual Enterprise in the Service Sector

The Import & Export of Data from a Database

CSPP 53017: Data Warehousing Winter 2013" Lecture 6" Svetlozar Nestorov" " Class News

Transcription:

Data Integration and Exchange L. Libkin 1 Data Integration and Exchange

Traditional approach to databases A single large repository of data. Database administrator in charge of access to data. Users interact with the database through application programs. Programmers write those (embedded SQL, other ways of combining general purpose programming languages and DBMSs) Queries dominate; updates less common. DMBS takes care of lots of things for you such as query processing and optimisation concurrency control enforcing database integrity L. Libkin 2 Data Integration and Exchange

Traditional approach to databases cont d This model works very within a single organisation that either does not interact much with the outside world, or the interaction is heavily controlled by the DB administrators What do we expect from such a system? 1. Data is relatively clean; little incompleteness 2. Data is consistent (enforced by the DMBS) 3. Data is there (resides on the disk) 4. Well-defined semantics of query answering (if you ask a query, you know what you want to get) 5. Access to data is controlled L. Libkin 3 Data Integration and Exchange

The world is changing The traditional model still dominates, but the world is changing. Many huge repositories are publicly available In fact many are well-organised databases, e.g., imdb.com, the CIA World Factbook, many genome databases, the DBLP server of CS publications, etc etc etc) Many queries cannot be answered using a single source. Often data from various sources needs to be combined, e.g. company mergers restructuring databases within a single organisation combining data from several private and public sources L. Libkin 4 Data Integration and Exchange

Course info No text. Because there is no text at this time... Slides will be posted on the course webpage: http://homepages.inf.ed.ac.uk/libkin/teach/dataintegr08 Tutorials by Lenzerini and Kolaitis (see links on the webpage) 3 assignments final exam Office hours: by appointment (usually works better for UG4) L. Libkin 5 Data Integration and Exchange

Why do you need this course Databases are everywhere these days (> $2 10 10 /year business whatever that means today) Every enterprise has a database; they merge, combine data hence data integration In addition, a lot of data is available on the web, but often one needs many sources to answer a query Hence (almost) everyone needs to integrate data Huge investment from leading companies, IBM, Oracle, Microsoft Very ad hoc solutions; but finally we understand what the real problems in data integration are, and have some solutions (but not all!) L. Libkin 6 Data Integration and Exchange

Background Requirement: Database Systems (3rd year) or fluency in relational databases: relational model relational algebra/calculus SQL An understanding of the basic mathematical tools that serve as the foundation of computer science: basic set theory, graph theory, theory of computation, first-order logic. L. Libkin 7 Data Integration and Exchange

Outline of the course Introduction to the problems of data integration and exchange. Key new components: incomplete information query rewriting certain answers Data integration scenarios: global-as-view, local-as-view, combined virtual vs materialized How to distinguish easy queries from hard queries? Query answering in data integration scenarios: view-based rewritings L. Libkin 8 Data Integration and Exchange

Outline of the course cont d Incomplete information in databases theory, tables, complexity practice (the ugly reality SQL) Open and closed worlds Data exchange: settings, source-to-target constraints, solutions Data exchange query answering: conjunctive (select-project-join) queries full relational algebra queries closed vs open worlds L. Libkin 9 Data Integration and Exchange

Outline of the course cont d Data exchange: XML data tree patterns consistency problems query answering Schema management: composition, other operations, schema evolution Inconsistent databases, repairs, query answering If time permits: ranking queries L. Libkin 10 Data Integration and Exchange

Query answering from multiple sources Data resides in several different databases They may have different structures, different access policies etc Our view of the world may be very different from the view of the databases we need to use. Only portions of the data from some database could be available. That is, the sources do not conform to the schema of the database into which the data will be loaded. L. Libkin 11 Data Integration and Exchange

What industry offers now: ETL tools ETL stands for Extract Transform Load Extract data from multiple sources Transform it so it is compatible with the schema Load it into a database Many self-built tools in the 80s and the 90s; through acquisition fewer products exist now The big players IBM, Microsoft, Oracle all have their ETL products; Microsoft and Oracle offer them with their database products. A few independent vendors, e.g. Informatica PowerCenter. Several open source products exist, e.g. Clover ETL. L. Libkin 12 Data Integration and Exchange

ETL tools Focus: Data profiling Data cleaning Simple transformations Bulk loading Latency requirements What they don t do yet: nontrivial transformations query answering But techniques now exist for interesting data integration and for query answering and we shall learn them. They soon will be reflected in products (IBM and Microsoft are particularly active in this area) L. Libkin 13 Data Integration and Exchange

Data profiling/cleaning Data profiling: gives the user a view of data: Samples over large tables statistics (how many different values etc) Graphical tools for exploring the database Cleaning: Same properties may have different names e.g. Last Name, L Name, LastName Same data may have different representations e.g. (0131)555-1111 vs 01315551111, George Str. vs George Street Some data may be just wrong L. Libkin 14 Data Integration and Exchange

Data transformation Most transformation rules tend to be simple: Copy attribute LName to Last Name Set age to be current year DOB Heavy emphasis on industry specific formats For example, Informatica B2B Data Exchange product offers versions for Healthcare and Financial services as well as specialised tools for formats including: MS Word, Excel, PDF, UN/EDIFACT (Data Interchange For Administration, Commerce, and Transport), RosettaNet for B2B, and many specialised healthcare and financial form. These are format/industry specific and have little to do with the general tasks of data integration. L. Libkin 15 Data Integration and Exchange

Data integration, scenario 1 DB1 DB2 DB3... DBn GLOBAL SCHEMA QUERY: Q? L. Libkin 16 Data Integration and Exchange

Data integration DB1 DB2 DB3... DBn Q 1 Q 2 Q 3 Q n V 1 A B C D................ A E B C F A C V 2.... V 3...... V n.................. GLOBAL SCHEMA QUERY: Q? L. Libkin 17 Data Integration and Exchange

Data integration DB1 DB2 DB3... DBn Q 1 Q 2 Q 3 Q n V 1 A B C D................ A E B C F A C V 2.... V 3...... V n.................. GLOBAL SCHEMA QUERY: Q? Answer to Q is obtained by querying the views V 1,..., V n L. Libkin 18 Data Integration and Exchange

Data integration, query answering We have our view of the world (the Global Schema) We can access (parts of) databases DB 1,...,DB n to get relevant data. It comes in the form of views, V 1,...,V n Our query against the global schema must be reformulated as a query against the views V 1,...,V n The approach is completely virtual: we never create a database the conforms to the global schema. L. Libkin 19 Data Integration and Exchange

Data integration, query answering, a toy example List courses taught by permanent teaching staff during Winter 2007 We have two databases: D 1 (name, age, salary) of permanent staff D 2 (teacher, course, semester, enrollment) of courses D 1 only publishes the value of the name attribute D 2 does not reveal enrollments The views: V 1 = π name (D 1 ) V 2 = π teacher,course,semester (D 2 ) Next step: establish correspondence between attributes name of V 1 and teacher of V 2 L. Libkin 20 Data Integration and Exchange

Data integration, query answering, a toy example cont d To answer query, we need to import the following data: V 1 W 2 = σ semester= Winter 2007 (V 2) Answering query: {course name,sem V 1 (name) W 2 (name,course,sem)} Or, in relational algebra π course (V 1 name=teacher W 2 ) L. Libkin 21 Data Integration and Exchange

Toy example, lessons learned We don t have access to all the data Some human intervention is essential (someone needs to tell us that teacher and name refer to the same entity) We don t run a query against a single database. Instead, we run queries against different databases based on restrictions they impose get results to use them locally run another query against those results L. Libkin 22 Data Integration and Exchange

Toy example, things getting more complicated Find informatics permanent staff who taught during the Winter 2007 semester, and their phone numbers We have additional personnel databases: an informatics database D 3 (employee, phone,office), and a university-wide database D 4 (employee, school, phone) for simplicity, assume all this information is public Now we have a choice: use D 3 to get information about phones use D 4 to get information about phones use both D 3 and D 4 to get information about phones L. Libkin 23 Data Integration and Exchange

Toy example cont d First, we need some human involvement to see that employee, name, and teacher refer to the same category of objects If one uses D 3, then the query is {name, phone sem,office V 1 (name) W 2 (name,course,sem) D 3 (name, phone,office)} If one uses D 4, then the query is {name, phone sem, school V 1 (name) W 2 (name,course,sem) D 4 (name, school,phone)} But what if one uses both D 3 and D 4? L. Libkin 24 Data Integration and Exchange

Toy example cont d We could insist on the phone number being: in either D 3 or D 4 in both D 3 and D 4, but not necessarily the same in both D 3 and D 4, and the same in both databases One can write queries for all the cases, but which one should we use? New lessons: databases that are being integrated are often inconsistent query answering is by no means unique there could be several ways to answer a query different possibilities for answering queries are a result of inconsistencies and incomplete information L. Libkin 25 Data Integration and Exchange

Toy example cont d Suppose phone numbers in D 3 and D 4 are different. What is a sensible query answer then? A common approach is to use certain answers these are guaranteed to be true. Another question: what if there is no record at all for the phone number in D 3 and D 4? Then we have an instance of incomplete information. L. Libkin 26 Data Integration and Exchange

A different scenario So far we looked at virtual integration: no database of the global schema was created. Sometimes we need such a database to be created, for example, if many queries are expected to be asked against it. In general, this is a common problem with data integration: materialize vs federate. Materialize = create a new database based on integrating data from different sources. Federate = the virtual approach: obtain data from various sources and use them to answer queries. L. Libkin 27 Data Integration and Exchange

Virtual vs Materialization A common situation for the materialization approach: merger of different organizations. A common situation for the federated approach: we don t have full access to the data, and the data changes often. L. Libkin 28 Data Integration and Exchange

Common tasks in data integration How do we represent information? Global schema, attributes, constraints data formats of attributes reconciling data from different sources abbreviations, terminology, ontologies How do we deal with imperfect information? resolve overlaps handling missing data handling inconsistencies L. Libkin 29 Data Integration and Exchange

Common tasks in data integration cont d How do we answer queries? what information is available? Can we get the answer? if not, what is the semantics of query answering? Is query answering feasible? Is it possible to compute query answers at all? If now, how do we approximate? Materialize or federate? L. Libkin 30 Data Integration and Exchange

Common tasks in data integration cont d Do it from scratch or use commercial tools? many are available (just google for data integration ) but do we fully understand them? lots of them are very ad hoc, with poorly defined semantics this is why it is so important to understand what really happens in data integration L. Libkin 31 Data Integration and Exchange

Data Exchange SOURCE DATABASE Source Schema S Target Schema T L. Libkin 32 Data Integration and Exchange

Data Exchange SOURCE DATABASE TARGET DATABASE????? Source Schema S Target Schema T L. Libkin 33 Data Integration and Exchange

Data Exchange SOURCE DATABASE TARGET DATABASE????? Source Schema S Target Schema T Query over the target schema: Q How to answer Q so that the answer is consistent with the data in the source database? L. Libkin 34 Data Integration and Exchange

Data exchange vs Data integration Data exchange appears to be an easier problem: there is only one source database; and one has complete access to the source data. But there could be many different target instances. Problem: which one to use for query answering? L. Libkin 35 Data Integration and Exchange

When do we have the need for data exchange A typical scenario: Two organizations have their legacy databases, schemas cannot be changed. Data from one organization 1 needs to be transfered to data from organization 2. Queries need to be answered against the transferred data. L. Libkin 36 Data Integration and Exchange

Data exchange towards multiple instances A simple example: we want to create a target database with the schema Flight(city1,city2,aircraft,departure,arrival) Served(city,country,population,agency) We don t start from scratch: there is a source database containing relations Route(source,destination,,departure) BG(country,city) We want to transfer data from the source to the target. L. Libkin 37 Data Integration and Exchange

Data exchange relationships between the source and the target How to specify the relationship? ROUTE Source Dest Departure city1 city2 aircraft departure arrival FLIGHT BG Country City city country population agency SERVED Semantics??? For example, arrows from city is the meaning and or or? L. Libkin 38 Data Integration and Exchange

Data exchange relationships between the source and the target Formal specification: we have a relational calculus query over both the source and the target schema. The query is of a restricted form, and can be thought of as a sequence of rules: Flight(c1, c2,, dept, ) : Route(c1, c2, dept) Served(city, country,, ) : Route(city,, ), BG(city, country) Served(city, country,, ) : Route(, city, ), BG(city, country) L. Libkin 39 Data Integration and Exchange

Data exchange targets Target instances should satisfy the rules. What does it mean to satisfy a rule? Formally, if we take: Flight(c1, c2,, dept, ) : Route(c1, c2, dept) then it is satisfied by a source S and a target T if the constraint ( c 1, c 2, d (Route(c 1, c 2, d) a 1,a 2 Flight(c1, c 2, a 1,d,a 2 ) )) This constraint is a relational calculus query that evaluates to true or false L. Libkin 40 Data Integration and Exchange

Data exchange targets What happens if there no values for some attributes, e.g. aircraft, arrival? We put in null values or some real values. But then we may have multiple solutions! L. Libkin 41 Data Integration and Exchange

Data exchange targets Source Database: ROUTE: Source Destination Departure Edinburgh Amsterdam 0600 Edinburgh London 0615 Edinburgh Frankfurt 0700 BG: Country UK UK NL GER City London Edinburgh Amsterdam Frankfurt Look at the rule Flight(c1, c2,, dept, ) : Route(c1, c2, dept) The right hand side is satisfied by But what can we put in the target? Route(Edinburgh, Amsterdam, 0600) L. Libkin 42 Data Integration and Exchange

Data exchange targets Rule: Flight(c1, c2,, dept, ) : Route(c1, c2, dept) Satisfied by: Route(Edinburgh, Amsterdam, 0600) Possible targets: Flight(Edinburgh, Amsterdam, 1, 0600, 2 ) Flight(Edinburgh, Amsterdam, B737, 0600, ) Flight(Edinburgh, Amsterdam,, 0600, 0845) Flight(Edinburgh, Amsterdam, B737, 0600, 0845) They all satisfy the constraints! L. Libkin 43 Data Integration and Exchange

Data exchange queries Now consider two queries: Q 1 : Is there a flight from Edinburgh to Amsterdam that departs before 7am? Q 2 : Is there a flight from Edinburgh to Amsterdam that arrives before 9am? What is the difference? Q 1 can be answered with certainty: in every solution we have a tuple Flight(Edinburgh, Amsterdam,, 0600, ) Q 2 cannot be answered with certainty: in some solutions we don t have a tuple Flight(Edinburgh, Amsterdam, a, t 1, t 2 ) with t 2 earlier than 9am. Our goal is to find certain answers. L. Libkin 44 Data Integration and Exchange

Data exchange queries But computing certain answers requires checking seemingly an infinite number of databases! How else can we do it? Create a good target instance T good so that: for a query Q we can define a query Q r (its rewriting) that satisfies the property: certain answers to Q = Q r (T good ) Questions: can we always find such a T good and a rewriting algorithm Q Q r? and if not, what restrictions do we impose on data exchange settings and/or queries? L. Libkin 45 Data Integration and Exchange

Inconsistencies in databases If we integrate data, we shall always have inconsistencies: One database says that we have John Smith with salary 20K in office 100 another says that we have John Smith with salary 30K in office 100 and the database must satisfy a key constraint: the name field is a key. Hence if we put Name Office Salary John Smith 100 20K John Smith 100 30K......... in our database, we have inconsistent data. L. Libkin 46 Data Integration and Exchange

Inconsistencies in databases: query answering Q 1 : Does John Smith sit in office 100? Q 2 : Does John Smith make 20K? Difference: Q 1 can be answered with certainty; Q 2 cannot be. What does it mean to answer a query with certainty? If we repair a database so that it satisfies the constraints, the answer is true no matter how we repair repair it. L. Libkin 47 Data Integration and Exchange

Inconsistencies in databases: query answering In our example, two ways to repair: R 1 : Name Office Salary John Smith 100 20K......... R 2 : Name Office Salary John Smith 100 30K......... Q 1 is always true, Q 2 is not. But the number of repairs could be very large (exponential why?). Hence prohibitively expensive query answering algorithm. Question: when can query answering be made efficient? Perhaps it involves a rewriting of the original query. The key idea: query rewriting to obtain certain answers. L. Libkin 48 Data Integration and Exchange

Schema mappings Last subject we deal with in this course. Still the least understood, but extremely important. Schema evolution: schema changes over time. Question how to transfer data? Single step data exchange. But what if we go through many steps? How do we transfer data, how do we answer queries? L. Libkin 49 Data Integration and Exchange

Schema mappings Two data exchange scenarios: Schema1 Schema2 Constraints12 Schema2 Schema3 Constraints23 Suppose we know how to move data from Schema1 to Schema2, and then from Schema2 to Schema3? Can we describe this by a single set of schema constraints: Schema1 Schema3 Constraints13 This turns out to be a very nontrivial task, but it occurs very often in database schema management. And there are other operations inverse, for example: (Schema1 Schema2 Constraints12) (Schema2 Schema1 Constraints21) L. Libkin 50 Data Integration and Exchange