Data integration general setting



Similar documents
Data Integration. Maurizio Lenzerini. Universitá di Roma La Sapienza

Query Processing in Data Integration Systems

A Tutorial on Data Integration

Data Integration and Exchange. L. Libkin 1 Data Integration and Exchange

Data exchange. L. Libkin 1 Data Integration and Exchange

Data Integration: A Theoretical Perspective

Composing Schema Mappings: An Overview

How To Understand Data Integration

Data Exchange: Semantics and Query Answering

Database Constraints and Design

Consistent Answers from Integrated Data Sources

Repair Checking in Inconsistent Databases: Algorithms and Complexity

3.1 Solving Systems Using Tables and Graphs

Schema Mappings and Data Exchange

Introduction to tuple calculus Tore Risch

XML Data Integration

Question Answering and the Nature of Intercomplete Databases

Relational Databases

Designing and Refining Schema Mappings via Data Examples

Shuffling Data Around

Virtual Data Integration

Peer Data Exchange. ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY, Pages 1 0??.

Scale Independent Query Queries and Network Marketing

Consistent Answers from Integrated Data Sources

NP-complete? NP-hard? Some Foundations of Complexity. Prof. Sven Hartmann Clausthal University of Technology Department of Informatics

Integrating and Exchanging XML Data using Ontologies

Translating SQL into the Relational Algebra

What does the number m in y = mx + b measure? To find out, suppose (x 1, y 1 ) and (x 2, y 2 ) are two points on the graph of y = mx + b.

Logical Foundations of Relational Data Exchange

Linear Programming Notes V Problem Transformations

The composition of Mappings in a Nautural Interface

XML with Incomplete Information

5 Systems of Equations

Integrating XML Data Sources using RDF/S Schemas: The ICS-FORTH Semantic Web Integration Middleware (SWIM)

SOLUTIONS FOR PROBLEM SET 2

Lecture 1: Systems of Linear Equations

Section 9.5: Equations of Lines and Planes

Econometrics Simple Linear Regression

Schema Mediation in Peer Data Management Systems

Chapter 13: Query Optimization

Data Management in Peer-to-Peer Data Integration Systems

Boolean Algebra Part 1

Accessing Data Integration Systems through Conceptual Schemas (extended abstract)

THREE DIMENSIONAL GEOMETRY

Query Processing. Q Query Plan. Example: Select B,D From R,S Where R.A = c S.E = 2 R.C=S.C. Himanshu Gupta CSE 532 Query Proc. 1

Chapter 14: Query Optimization

PYTHAGOREAN TRIPLES KEITH CONRAD

Computability Theory

Logic Programs for Consistently Querying Data Integration Systems

1. Prove that the empty set is a subset of every set.

Navigational Plans For Data Integration

Simplifying Algebraic Fractions

Solutions to Homework 10

On Scale Independence for Querying Big Data

Handout #1: Mathematical Reasoning

Chapter 17 Using OWL in Data Integration

x 2 + y 2 = 1 y 1 = x 2 + 2x y = x 2 + 2x + 1

Schema Mediation and Query Processing in Peer Data Management Systems

Factorization in Polynomial Rings

From Causes for Database Queries to Repairs and Model-Based Diagnosis and Back

1 Solving LPs: The Simplex Algorithm of George Dantzig

10.2 Series and Convergence

Equations Involving Lines and Planes Standard equations for lines in space

An Introduction to Calculus. Jackie Nicholas

MATH 304 Linear Algebra Lecture 9: Subspaces of vector spaces (continued). Span. Spanning set.

MATH 304 Linear Algebra Lecture 20: Inner product spaces. Orthogonal sets.

CSC 373: Algorithm Design and Analysis Lecture 16

Mathematical Induction

Relational Division and SQL

The University of British Columbia

Query Answering in Peer-to-Peer Data Exchange Systems

MATH Fundamental Mathematics IV

CoNP and Function Problems

In this section, we will consider techniques for solving problems of this type.

POLYNOMIAL RINGS AND UNIQUE FACTORIZATION DOMAINS

Class One: Degree Sequences

1.7 Graphs of Functions

Chapter 27: Taxation. 27.1: Introduction. 27.2: The Two Prices with a Tax. 27.2: The Pre-Tax Position

Unified Lecture # 4 Vectors

Review of Fundamental Mathematics

Limitations of E-R Designs. Relational Normalization Theory. Redundancy and Other Problems. Redundancy. Anomalies. Example

Algebra Cheat Sheets

The last three chapters introduced three major proof techniques: direct,

Chapter 9. Systems of Linear Equations

The Recovery of a Schema Mapping: Bringing Exchanged Data Back

On the Unique Games Conjecture

OWL based XML Data Integration

SPARQL: Un Lenguaje de Consulta para la Web

Plan-Space Search. Searching for a Solution Plan in a Graph of Partial Plans

(a) Write each of p and q as a polynomial in x with coefficients in Z[y, z]. deg(p) = 7 deg(q) = 9

Section 1.3 P 1 = 1 2. = P n = 1 P 3 = Continuing in this fashion, it should seem reasonable that, for any n = 1, 2, 3,..., =

Propagating Functional Dependencies with Conditions

Borne inférieure de circuit : une application des expanders

Student Performance Q&A:

Explainable Security for Relational Databases

1 Homework 1. [p 0 q i+j p i 1 q j+1 ] + [p i q j ] + [p i+1 q j p i+j q 0 ]

5.1 Radical Notation and Rational Exponents

Chapter Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

Data Integration. May 9, Petr Kremen, Bogdan Kostov

Execution Strategies for SQL Subqueries

Transcription:

Data integration general setting A source schema S: relational schema XML Schema (DTD), etc. A global schema G: could be of many different types too A mapping M between S and G: many ways to specify it, e.g. by queries that mention S and T A general condition: the source and our view of the global schema should satisfy the conditions imposed by the mapping M. L. Libkin 1 Data Integration and Exchange

Data integration general setting cont d Assume we have a source database D. We are interested in databases D over the global schema such that (D,D ) satisfies the conditions of the mapping M There are many possible ways to specify the mapping. The set of such databases D is denoted by [D] M If we have a query Q, we want certain answers that are true in all possible databases D : certain M (Q,D) = Q(D ). D [D] M L. Libkin 2 Data Integration and Exchange

Data integration general setting cont d Depending on a type of mapping M, the set [D] M could be very large or even infinite. That makes certain M (Q,D) prohibitively expensive or even impossible to compute. Hence we need a rewriting Q so that or even certain M (Q, D) = Q (D) certain M (Q,D) = Q (V ) if V is the set of views that the database D makes available. L. Libkin 3 Data Integration and Exchange

Types of mappings: Two major parameters Source-central vs global schema-central: Source is defined in terms of the global schema Known as local-as-view (LAV) The global schema is defined in terms of the source Known as global-as-view (GAV) Combinations are possible (GLAV, P2P, to be seen later) Exact vs sound definitions Exact definition specify precise relationships that must hold between the source and the global schema database Sound definitions leave that description potentially incomplete: we know some relationships but not all of them. potentially many more instances in [D] M L. Libkin 4 Data Integration and Exchange

Example Source schema: EM50(title,year,director) meaning: European movies made since 1950 RV10(movie,review) reviews for the past 10 years Global schema: Movie(title,director,year) ED(name,country,dob) (European directors) RV(movie,review) (reviews) L. Libkin 5 Data Integration and Exchange

Example LAV setting We define the source (local) in terms of the global schema hence local is a view. Two possibilities for D [D] M : Exact: D = Q(D ), where Q is a query over the global schema. Sound: D Q(D ). In other words, if a fact is present in D, it must be derivable from the global schema by means of Q. More generally, for each n-ary relation R in the source schema, there is a query Q R over the global schema such that R = Q R (D ) (exact) R Q R (D ) (sound) L. Libkin 6 Data Integration and Exchange

Sound LAV setting EM50(T,Y,D) { (t,y, d) n,dob Movie(t, y, d) ED(d, n,dob) y 1950 } RV10(t,r) { (t,r) y, d Movie(t, y, d) RV(t,r) y 1998 } Right-hand sides are simple SQL queries involving joins and simple selection predicates: SELECT M.title, RV.review FROM Movie M, RV WHERE M.title=RV.title AND M.year >= 1998 L. Libkin 7 Data Integration and Exchange

Exact LAV setting EM50(T,Y,D) = { (t,y, d) n,dob Movie(t, y, d) ED(d, n,dob) y 1950 } RV10(t,r) = { (t,r) y, d Movie(t, y, d) RV(t,r) y 1998 } All the data from the global database must be reflected in the source. L. Libkin 8 Data Integration and Exchange

LAV setting queries Consider a global schema query SELECT M.title, R.review FROM Movie M, RV R WHERE M.title=R.title AND M.year = 2000 (Movies from 2000 and their reviews) This is rewritten as a relational calculus query: {t, r d,y Movie(t,d,y) RV(t,r) y = 2000} L. Libkin 9 Data Integration and Exchange

LAV setting: {t,r d,y Movie(t,d,y) RV(t,r) y = 2000} Idea: re-express in terms of predicates of the source schema. The following seems to be the best possible way: and back to SQL: {t,r d,yem50(t,y, d) RV10(t, r) y = 2000} SELECT EM50.title, RV10.review FROM EM50, RV10 WHERE EM50.title=RV10.title AND EM50.year = 2000 Is this always possible? In what sense is this the best way? L. Libkin 10 Data Integration and Exchange

GAV settings Global schema is defined in terms of sources. Sound GAV: D Q(D) the global database contains the result of a query over the source Exact GAV: D = Q(D) the global database is obtained as the result of a query over the source Note: in exact GAV, [D] M contains a unique database! L. Libkin 11 Data Integration and Exchange

GAV example Change the schema slightly: ED (name) (i.e. we only keep names of European directors) A sound GAV setting: Movie EM50 ED {d t,y EM50(t,d,y)} RV RV10 Look at a SQL query: SELECT M.title, RV.review FROM Movie M, RV WHERE M.title=RV.title AND M.year = 2000 (Movies from 2000 and their reviews) L. Libkin 12 Data Integration and Exchange

GAV example Query: {t,r d,y M(t,d,y) RV(t,r) y = 2000} Substitute the definitions from the mapping and get: {t,r d,y EM50(t, d,y) RV10(t,r) y = 2000} This is called unfolding. Does this always work? Can queries become too large? L. Libkin 13 Data Integration and Exchange

Integration with views We have assumed that all source databases are available. But often we only get views that they publish. If only views are available, can queries be: answered? approximated? Assume that in EM50 directors are omitted. Then nothing is affected. But if titles are omitted in EM50, we cannot answer the query. L. Libkin 14 Data Integration and Exchange

Towards view-based query answering Suppose only a view of the source is available. Can queries be answered? It depends on the query language. Start with relational algebra/calculus. Suppose we have either a LAV or a GAV setting, and we want to answer queries over the global schema using the view over the source. Problem: given the setting, and a query, can it be answered? This is undecidable! Two undecidable relational algebra problems: If e is a relational algebra expression, does it always produce (i.e., on every database)? Closely related: if e 1 and e 2 are two relational algebra expressions, is it true that e 1 (D) = e 2 (D) for every database? L. Libkin 15 Data Integration and Exchange

Equivalence of relational algebra expressions A side note this is the basis of query optimisation. But it can only be sound, never complete. Equivalence is undecidable for the full relational algebra π, σ,,, The good news: it is decidable for π, σ,, And quite efficiently for π, σ, And the latter form a very important class of queries, to be seen soon. L. Libkin 16 Data Integration and Exchange

View-based query answering relational algebra A very simple setting: exact LAV (and GAV) the source schema and the target schema are identical (say, for each R(A, B,C,...) in the source there is R (A,B, C,...) in the target) The constraints in M state that they are the same. The source does not publish any views: i.e. V =. If we can answer queries in this setting, it means they have to be answered independently of the data in the source. The only way it happens: Q(D 1 ) = Q(D 2 ) for all databases D 1,D 2 ; we output this answer without even looking at the view. But this (Q(D 1 ) = Q(D 2 ) for all databases D 1, D 2 ) is undecidable. L. Libkin 17 Data Integration and Exchange

A better class of queries Conjunctive queries They are the building blocks for SQL queries: SELECT... FROM R1,..., Rn WHERE <conjunction of equalities> For example: SELECT M.title, RV.review FROM Movie M, RV WHERE M.title=RV.title AND M.year = 2000 In relational calculus: {t, r d,y Movie(t,d,y) RV(t,r) y = 2000} L. Libkin 18 Data Integration and Exchange

Conjunctive queries {t,r d,y Movie(t, d,y) RV(t, r) y = 2000} Written using only conjunction and existential quantification hence the name. In relational algebra: π t,r ( σ y=2000 ( Movie Movie.t=RV.t RV )) Also called SPJ-queries (Select-Project-Join) These are all equivalent (exercise why?) L. Libkin 19 Data Integration and Exchange

Conjunctive queries: good properties QUERY CONTAINMENT: Input: two queries Q 1 and Q 2 Output: true if Q 1 (D) Q 2 (D) for all databases D QUERY EQUIVALENCE: Input: two queries Q 1 and Q 2 Output: true if Q 1 (D) = Q 2 (D) for all databases D For relational algebra queries, both are undecidable. For conjunctive queries, both are decidable. Complexity: NP. This gives an O(2 n ) algorithm. Can often be reasonable in practice queries are small. L. Libkin 20 Data Integration and Exchange

Conjunctive queries: good properties For each conjunctive query, one can find an equivalent query with the minimum number of joins. SELECT R2.A FROM R R1, R R2 WHERE R1.A=R2.A AND R1.B=2 AND R1.C=1 In relational algebra: π... (σ... (R R)) {x y, z R(x, 2, 1) R(x, y, z)} Looking at it carefully, this is equivalent to {x R(x, 2, 1)}, or π A (σ B=2 C=1 (R)) The join is saved: SELECT R.A FROM R WHERE R.B=2 AND R.C=1 L. Libkin 21 Data Integration and Exchange

Conjunctive queries: complexity Can one find a polynomial algorithm? Unlikely. Reminder: NP-completeness. Take a graph G = (V, E): V = {a 1,...,a n } the set of vertices; E is the set of edges (a i,a j ) and define a conjunctive query Q G = x 1,...x n (a i,a j ) E E(x i,x j ) Then G satisfies Q G iff there is a homomorphism from GtoG. A homomorphism from G to {(r, b), (r, g), (g, b), (g,r), (b, r), (b,g)} the graph is 3-colourable. L. Libkin 22 Data Integration and Exchange

Conjunctive queries: summary A nicely-behaved class Basic building blocks of SQL queries Easy to reason about Another important property: monotonicity: if D 1 D 2 then Q(D 1 ) Q(D 2 ) Heavily used in data integration/exchange L. Libkin 23 Data Integration and Exchange

GAV-exact with conjunctive queries Source: R 1 (A, B), R 2 (B,C) Global schema: T 1 (A, B,C), T 2 (B, C) Exact GAV mapping: T 1 = {x, y, z R 1 (x, y) R 2 (y, z)} (or R 1 B R 2 ) T 2 = {x, y R 2 (x,y)} Query Q: SELECT T1.A, T1.B. T2.C FROM T1, T2 WHERE T1.B=T2.B AND T1.C=T2.C As conjunctive query: {x,y, z T 1 (x, y, z) T 2 (y, z)} L. Libkin 24 Data Integration and Exchange

GAV-exact with conjunctive queries cont d Take {x, y, z T 1 (x,y, z) T 2 (y, z)} and unfold: {x,y, z R 1 (x, y) R 2 (y,z) R 2 (y, z)} or R 1 R 2 R 2 This is of course R 1 R 2. Bottom line: optimise after unfolding save joins. L. Libkin 25 Data Integration and Exchange

GAV-sound with conjunctive queries Source and global schema as before: source R 1 (A, B),R 2 (B,C) Global schema: T 1 (A, B, C), T 2 (B,C) GAV mappings become sound: T 1 {x, y, z R 1 (x,y) R 2 (y,z)} T 2 R 2 Let D exact be the unique database that arises from the exact setting (with replaced by =) Then every database D sound that satisfies the sound setting also satisfies D exact D sound L. Libkin 26 Data Integration and Exchange

GAV-sound with conjunctive queries cont d Conjunctive queries are monotone: D 1 D 2 Q(D 1 ) Q(D 2 ) Exact solution is a sound solution too, and is contained in every sound solution. Hence certain answers for each conjunctive query certain(d,q) = Q(D sound ) = Q(D exact ) D sound The solution for GAV-exact gives us certain asnwers for GAV-sound, for conjunctive (and more generally, monotone) queries. L. Libkin 27 Data Integration and Exchange