Data integration is harder than you thought



Similar documents
Data integration: A theoretical perspective

Data Integration. Maurizio Lenzerini. Universitá di Roma La Sapienza

A Tutorial on Data Integration

Query Processing in Data Integration Systems

Integer sequences from walks in graphs

How To Understand Data Integration

How To Write A Theory Of The Concept Of The Mind In A Quey

Continuous Compounding and Annualization

How To Schedule A Cloud Comuting On A Computer (I.E. A Computer)

Concept and Experiences on using a Wiki-based System for Software-related Seminar Papers

Chapter 4: Matrix Norms

PACE: Policy-Aware Application Cloud Embedding

Distributed Computing and Big Data: Hadoop and MapReduce

Data Integration: A Theoretical Perspective

UNIT CIRCLE TRIGONOMETRY

Uncertain Version Control in Open Collaborative Editing of Tree-Structured Documents

Supporting Efficient Top-k Queries in Type-Ahead Search

Accessing Data Integration Systems through Conceptual Schemas (extended abstract)

est using the formula I = Prt, where I is the interest earned, P is the principal, r is the interest rate, and t is the time in years.

Top-Down versus Bottom-Up Approaches in Risk Management

Semipartial (Part) and Partial Correlation

9:6.4 Sample Questions/Requests for Managing Underwriter Candidates

Mining Relatedness Graphs for Data Integration

Exam #1 Review Answers

2 r2 θ = r2 t. (3.59) The equal area law is the statement that the term in parentheses,

Converting knowledge Into Practice

Over-encryption: Management of Access Control Evolution on Outsourced Data

Spirotechnics! September 7, Amanda Zeringue, Michael Spannuth and Amanda Zeringue Dierential Geometry Project

Chapter 3 Savings, Present Value and Ricardian Equivalence

A comparison result for perturbed radial p-laplacians

Cloud Service Reliability: Modeling and Analysis

STUDENT RESPONSE TO ANNUITY FORMULA DERIVATION

Ilona V. Tregub, ScD., Professor

ENABLING INFORMATION GATHERING PATTERNS FOR EMERGENCY RESPONSE WITH THE OPENKNOWLEDGE SYSTEM

An Introduction to Omega

On Efficiently Updating Singular Value Decomposition Based Reduced Order Models

Skills Needed for Success in Calculus 1

Define What Type of Trader Are you?

2. TRIGONOMETRIC FUNCTIONS OF GENERAL ANGLES

Housing in the Household Portfolio and Implications for Retirement Saving: Some Initial Finding from SOFIE

Definitions and terminology

Approximation Algorithms for Data Management in Networks

GESTÃO FINANCEIRA II PROBLEM SET 1 - SOLUTIONS

Chris J. Skinner The probability of identification: applying ideas from forensic statistics to disclosure risk assessment

883 Brochure A5 GENE ss vernis.indd 1-2

Voltage ( = Electric Potential )

Coordinate Systems L. M. Kalnins, March 2009

How to SYSPREP a Windows 7 Pro corporate PC setup so you can image it for use on future PCs

Classical Mechanics (CM):

Deflection of Electrons by Electric and Magnetic Fields

Software Engineering and Development

A framework for the selection of enterprise resource planning (ERP) system based on fuzzy decision making methods

Figure 2. So it is very likely that the Babylonians attributed 60 units to each side of the hexagon. Its resulting perimeter would then be 360!

Episode 401: Newton s law of universal gravitation

Introduction to NP-Completeness Written and copyright c by Jie Wang 1

Explicit, analytical solution of scaling quantum graphs. Abstract

UPS Virginia District Package Car Fleet Optimization

Carter-Penrose diagrams and black holes

Risk Sensitive Portfolio Management With Cox-Ingersoll-Ross Interest Rates: the HJB Equation

Rock Compressibility. Reservoir Pressures. PET467E A Note on Rock Compressibility

Faithful Comptroller s Handbook

PAN STABILITY TESTING OF DC CIRCUITS USING VARIATIONAL METHODS XVIII - SPETO pod patronatem. Summary

An Analysis of Manufacturer Benefits under Vendor Managed Systems

Database Management Systems

Data integration general setting

AN IMPLEMENTATION OF BINARY AND FLOATING POINT CHROMOSOME REPRESENTATION IN GENETIC ALGORITHM

Modeling and Verifying a Price Model for Congestion Control in Computer Networks Using PROMELA/SPIN

Efficient Redundancy Techniques for Latency Reduction in Cloud Systems

Accuracy and Bias of Licensed Practical Nurse and Nursing Assistant Ratings of Nursing Home Residents Pain

DOCTORATE DEGREE PROGRAMS

INITIAL MARGIN CALCULATION ON DERIVATIVE MARKETS OPTION VALUATION FORMULAS

Review Graph based Online Store Review Spammer Detection

Vector Calculus: Are you ready? Vectors in 2D and 3D Space: Review

The Lucas Paradox and the Quality of Institutions: Then and Now

Reduced Pattern Training Based on Task Decomposition Using Pattern Distributor

Office of Family Assistance. Evaluation Resource Guide for Responsible Fatherhood Programs

How to recover your Exchange 2003/2007 mailboxes and s if all you have available are your PRIV1.EDB and PRIV1.STM Information Store database

Top K Nearest Keyword Search on Large Graphs

YARN PROPERTIES MEASUREMENT: AN OPTICAL APPROACH

Comparing Availability of Various Rack Power Redundancy Configurations

Do Bonds Span the Fixed Income Markets? Theory and Evidence for Unspanned Stochastic Volatility

SELF-INDUCTANCE AND INDUCTORS

Things to Remember. r Complete all of the sections on the Retirement Benefit Options form that apply to your request.

Towards Automatic Update of Access Control Policy

Model-Driven Engineering of Adaptation Engines for Self-Adaptive Software: Executable Runtime Megamodels

How to create RAID 1 mirroring with a hard disk that already has data or an operating system on it

Valuation of Floating Rate Bonds 1

The impact of migration on the provision. of UK public services (SRG ) Final Report. December 2011

Comparing Availability of Various Rack Power Redundancy Configurations

HEALTHCARE INTEGRATION BASED ON CLOUD COMPUTING

THE DISTRIBUTED LOCATION RESOLUTION PROBLEM AND ITS EFFICIENT SOLUTION

The transport performance evaluation system building of logistics enterprises

The Supply of Loanable Funds: A Comment on the Misconception and Its Implications

Questions & Answers Chapter 10 Software Reliability Prediction, Allocation and Demonstration Testing

Firstmark Credit Union Commercial Loan Department

A formalism of ontology to support a software maintenance knowledge-based system

Tracking/Fusion and Deghosting with Doppler Frequency from Two Passive Acoustic Sensors

Scheduling Hadoop Jobs to Meet Deadlines

Attacking an obfuscated cipher by injecting faults

Gravitational Mechanics of the Mars-Phobos System: Comparing Methods of Orbital Dynamics Modeling for Exploratory Mission Planning

Transcription:

Data integation is hade than you thought Mauizio Lenzeini Diatimento di Infomatica e Sistemistica Univesità di Roma La Saienza CooIS 2001 Setembe 5, 2001 Tento, Italy

Outline Intoduction to data integation Aoaches to modeling and queying Case study in LAV: had Case study in GAV: hade than you thought Beyond LAV and GAV: even hade Conclusions Mauizio Lenzeini Data Integation 1

Achitectue fo data integation Quey Alication Mediato Global schema Data Waehouse Wae Wae Local schema Local schema Local schema Souce Souce Souce Mauizio Lenzeini Data Integation 2

Main oblems in data integation 1. Heteogeinity of souces (intensional and extensional level) 2. Limitations in the mechanisms fo accessing the souces 3. Mateialized vs vitual integation 4. Data extaction, cleaning and econciliation 5. How to ocess udates exessed on the global schema, and udates exessed on the souces 6. The queying oblem: How to answe queies exessed on the global schema 7. The modeling oblem: How to model the global schema, the souces, and the elationshis between the two Mauizio Lenzeini Data Integation 3

The queying oblem Each quey is exessed in tems of the global schema, and the associated mediato must efomulate the quey in tems of a set of queies at the souces The cucial ste is deciding the quey lan, i.e., how to decomose the quey into a set of subqueies to the souces The comuted subqueies ae then shied to the souces, and the esults ae assembled into the final answe Mauizio Lenzeini Data Integation 4

Quality in quey answeing The data integation system should be designed in such a way that suitable quality citeia ae met. Hee, we concentate on: Soundness: the answe to queies includes nothing but the tuth Comleteness: the answe to queies includes the whole tuth We aim at the whole tuth, and nothing but the tuth. But, what the tuth is deends on the aoach adoted fo modeling. Mauizio Lenzeini Data Integation 5

The modeling oblem Global schema Maing R 1 C 1 D 1 T 1 c 1 d 1 t 1 c 2 d 2 t 2 Souce stuctue Souce stuctue Souce 1 Souce 2 Mauizio Lenzeini Data Integation 6

Outline Intoduction to data integation Aoaches to modeling and queying Case study in LAV Case study in GAV Beyond LAV and GAV Conclusions Mauizio Lenzeini Data Integation 7

The modeling oblem: fundamental questions How do we model the global schema (stuctued vs semistuctued) How do we model the souces (concetual and stuctual level) How do we model the elationshi between the global schema and the souces Ae the souces defined in tems of the global schema (this aoach is called souce-centic, o local-as-view, o LAV)? Is the global schema defined in tems of the souces (this aoach is called global-schema-centic, o global-as-view, o GAV)? A mixed aoach? Mauizio Lenzeini Data Integation 8

The modeling oblem: fomal famewok A data integation system D is a tile G, S, M, whee G is the global schema (stuctue and constaints), S is the souce schema (stuctues and constaints), and M is the maing between G and S. Semantics of D: which data satisfy G? We have to stat with a souce database C (souce data coheent with S): sem C (D) = { B B is a database that is legal fo D wt C, i.e., that satisfies both G and M wt C } A quey q to D is exessed ove G. If q has aity n, then the answe to q wt D and C is q D,C = {(c 1,..., c n ) (c 1,..., c n ) q B B sem C (D)} Mauizio Lenzeini Data Integation 9

Global-as-view vs local-as-view Examle Global schema: movie(title, Yea, Diecto) euoean(diecto) eview(title, Citique) Souce 1: 1 (Title, Yea, Diecto) since 1960, euoean diectos Souce 2: 2 (T itle, Citique) since 1990 Quey: Title and citique of movies in 1998 { (T, R) D. movie(t, 1998, D) eview(t, R) }, witten { (T, R) movie(t, 1998, D) eview(t, R) } Mauizio Lenzeini Data Integation 10

Local-as-view Global schema LAV Souce This souce contains. Mauizio Lenzeini Data Integation 11

Fomalization of LAV In LAV, the maing M is constituted by a set of assetions: s φ G one fo each souce stuctue s in S, whee φ G is a quey ove G. Given souce data C, a database B satisfies M wt C if fo each souce s S: s C φ B G The maing M does not ovide diect infomation about which data satisfies the global schema. To answe a quey q ove G, we have to infe how to use M in ode to access the souce data C. Answeing queies is an infeence ocess, which is simila to answeing queies with incomlete infomation. Mauizio Lenzeini Data Integation 12

Local-as-view Examle Global schema: movie(title, Yea, Diecto) euoean(diecto) eview(title, Citique) Local-as-view: associated to elations at the souces we have views ove the global schema 1 (T, Y, D) { (T, Y, D) movie(t, Y, D) euoean(d) Y 1960 } 2 (T, R) { (T, R) movie(t, Y, D) eview(t, R) Y 1990 } The quey { (T, R) movie(t, 1998, D) eview(t, R) } is ocessed by means of an infeence mechanism that aims at e-exessing the atoms of the global schema in tems of atoms at the souces. In this case: { (T, R) 2 (T, R) 1 (T, 1998, D) } Mauizio Lenzeini Data Integation 13

Quey ocessing in LAV Answeing queies in LAV is like solving a mistey case: Souces eesent eliable witnesses Witnesses know at of the stoy, and souce data eesent what they know We have an exlicit eesentation of what the witnesses know We have to solve the case (answeing queies) based on the infomation we ae able to gathe fom the witnesses Infeence is needed Mauizio Lenzeini Data Integation 14

Global-as-view Global schema A Global schema Souce LAV This souce contains. Souce GAV The data of A ae taken fom souce 1 and Mauizio Lenzeini Data Integation 15

Fomalization of GAV In GAV, the maing M is constituted by a set of assetions: g φ S one fo each stuctue g in G, whee φ S is a quey ove S. Given souce data C, a database B satisfies M wt C if fo each g G: φ C S g B The maing M ovides diect infomation about which data satisfies the global schema. Thus, given a quey q ove G, it seems that we can simly evaluate the quey ove these data (as if we had a single database at hand). Moe on this late... Mauizio Lenzeini Data Integation 16

Global-as-view Examle Global schema: movie(title, Yea, Diecto) euoean(diecto) eview(title, Citique) Global-as-view: associated to elations in the global schema we have views ove the souces movie(t, Y, D) { (T, Y, D) 1 (T, Y, D) } euoean(d) { (D) 1 (T, Y, D) } eview(t, R) { (T, R) 2 (T, R) } Mauizio Lenzeini Data Integation 17

Global-as-view Examle of quey ocessing The quey { (T, R) movie(t, 1998, D) eview(t, R) } is ocessed by means of unfolding, i.e., by exanding the atoms accoding to thei definitions, so as to come u with souce elations. In this case: movie(t,1998,d) eview(t,r) unfolding 1 (T,1998,D) 2 (T,R) Mauizio Lenzeini Data Integation 18

Quey ocessing in GAV We do not have any exlicit eesentation of what the witnesses know All the infomation that the witnesses can ovide have been comiled into an investigation eot (the global schema, and the maing) Solving the case (answeing queies) means basically looking at the investigation eot Mauizio Lenzeini Data Integation 19

Global-as-view and local-as-view Comaison Local-as-view: (Infomation Manifold, DWQ, Picsel) Quality deends on how well we have chaacteized the souces High modulaity and eusability (if the global schema is well designed, when a souce changes, only its definition is affected) Quey ocessing needs easoning (quey efomulation comlex) Global-as-view: (Canot, SIMS, Tsimmis,... ) Quality deends on how well we have comiled the souces into the global schema though the maing Wheneve a souce changes o a new one is added, the global schema needs to be econsideed Quey ocessing can be based on some sot of unfolding (quey efomulation looks easie) Fo moe details, see [Ullman, TCS 2000], [Halevy, SIGMOD 2000]. Mauizio Lenzeini Data Integation 20

Outline Intoduction to data integation Aoaches to modeling and queying Case study in LAV Case study in GAV Beyond LAV and GAV Conclusions Mauizio Lenzeini Data Integation 21

A case study in LAV We deal with the oblem of answeing queies to data integation systems of the fom G, S, M, whee the global schema G is semi-stuctued the souces in S ae elational the maing M is of tye LAV queies ae tyical of semi-stuctued data Mauizio Lenzeini Data Integation 22

The quey answeing oblem Given data integation system D = G, S, M, souce database C, quey q, and tule t, check whethe t q D,C (i.e., whethe t q B fo all B sem C (D)). Recent esults: Comlexity fo seveal quey and view languages [Abiteboul et al, PODS 98], [Gahne et al, ICDT 99] Schemas exessed in Descition Logics [Calvanese et al, AAAI 2000] Regula ath queies without invese [Calvanese et al, ICDE 2000] and with invese [Calvanese et al, PODS 2000] Conjunctive RPQIs [Calvanese et al, KR 2000], [Calvanese et al, LICS 2000], [Calvanese et al, DBPL 2001] Mauizio Lenzeini Data Integation 23

Global databases and queies sub sub sub sub calls sub sub calls sub va calls sub va sub va va RPQ: RPQI: (sub) (sub (calls sub)) va (sub ) (va sub) Mauizio Lenzeini Data Integation 24

Regula ath queies with invese Regula-ath queies with invese (RPQIs) ae exessed by means of finite-state automata ove Σ = Σ { Σ } ( denotes the invese of the binay elation ). ( q) ( ) q q q _ q q Mauizio Lenzeini Data Integation 25

Finite state automata and RPQIs. a b c q d. Conside the quey Automaton fo Q Q = ( q) q q s 1 δ(s 0, ), s 2 δ(s 1, ), s 2 δ(s 1, q), s 3 δ(s 2, ), s 4 δ(s 3, ), s 5 δ(s 4, q), s 5 δ(s 5, q) The comutation fo RPQIs is not comletely catued by finite state automata. Mauizio Lenzeini Data Integation 26

Two-way automata A two-way automaton A = (Γ, S, S 0, ρ, F ) consists of an alhabet Γ, a finite set of states S, a set of initial states S 0 S, a tansition function ρ : S Σ 2 S { 1,0,1} and a set of acceting states F S. Given a two-way automaton A with n states, one can constuct a one-way automaton B 1 with O(2 n log n ) states such that L(B 1 ) = L(A), and a one-way automaton B 2 with O(2 n ) states such that L(B 2 ) = Γ L(A). Mauizio Lenzeini Data Integation 27

Two-way automata and RPQIs. a b c q d. Conside the quey Automaton fo Q Q = ( q) q q s 1 δ(s 0, ), s 2 δ(s 1, ), s 2 δ(s 1, q), s 3 δ(s 2, ), s 4 δ(s 3, ), s 5 δ(s 4, q), s 5 δ(s 5, q) 2way automaton (s 1, 1) δ A (s 0, ), (s 2, 1) δ A (s 1, ), (s 2, 1) δ A (s 2, q), (s 3, 0) δ A (s 2, ), (s 4, 1) δ A (s 3, ), (s 5, 1) δ A (s 4, q), (s f, 1) δ A (s 5, $) Mauizio Lenzeini Data Integation 28

Two-way automata and RPQIs Given an RPQI E = (Σ, S, I, δ, F ) ove the alhabet Σ, the coesonding two-way automaton A E is: (Σ A = Σ {$}, S A = S {s f } {s s S}, I, δ A, {s f }) whee δ A is defined as follows: (s 2, 1) δ A (s 1, ), fo each tansition s 2 δ(s 1, ) of E ente backwad mode: (s, 1) δ A (s, l), fo each s S and l Σ A exit backwad mode: (s 2, 0) δ A (s 1, ), fo each tansition s 2 δ(s 1, ) of E (s f, 1) δ A (s, $), fo each s F. = w satisfies E iff w$ L(A E ). Mauizio Lenzeini Data Integation 29

Quey answeing: basic idea Given D = G, S, M, souce database C, quey q, and tule (c, d), we seach fo a counteexamle to (c, d) q C,D, i.e., a database B sem C (D) such that (c, d) q B. Each counteexamle DB B can be eesented by a wod w B ove the alhabet Σ A = Σ C {$}, which has the fom $ d 1 w 1 d 2 $ d 3 w 2 d 4 $ $ d 2m 1 w m d 2m $ whee d 1,..., d 2m ange ove data objects in C (simly denoted by C), w i Σ +, and the $ acts as a seaato. Mauizio Lenzeini Data Integation 30

Two-way automata and canonical DBs Global schema G: ( q q ) Souces: q ( ) ( q) (q ) (d 1,d 2 ) (d 4,d 5 ) (d 4,d 2 ) (d 3,d 3 ) (d 2,d 3 ) Database fo G: q d 1 d 2 d 3 d 4 d 5 Mauizio Lenzeini Data Integation 31

Two-way automata and canonical DBs q d 1 d 2 d 3 d 4 d 5 As a wod: $d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q The above database B is a counteexamle to (d 2, d 3 ) Q D,C. To veify that (d 2, d 3 ) Q B, we exloit not only the ability of two-way automata to move on the wod both fowad and backwad, but also the ability to jum fom one osition in the wod eesenting a node to any othe osition (eithe eceding o succeeding) eesenting the same node. Mauizio Lenzeini Data Integation 32

Quey answeing: Basic idea If Q = (Σ, S, I, δ, F ), then A (Q,a,b) = (Σ A, S A, {s 0 }, δ A, {s f }), whee S A = S {s 0, s f } {s s S} (S D), and 1. (s, 1) δ A (s, l), fo each s S and l Σ C 2. (s 2, 1) δ A (s 1, ), fo each s 2 δ(s 1, ) 3. (s 2, 0) δ A (s 1, ), fo each s 2 δ(s 1, ) 4. ((s, d), 0) δ A (s, d), ((s, d), 0) δ A (s, d) ((s, d), 1) δ A ((s, d), l), ((s, d), 1) δ A ((s, d), l) (s, 0) δ A ((s, d), d), (s, 1) δ A (s, d) 5. (s 0, 1) δ A (s 0, l), fo each l Σ A, (s, 0) δ A (s 0, a) fo each s I 6. (s f, 0) δ A (s, b), fo each s F, and (s f, 1) δ A (s f, l) fo each l Σ A. A (Q,a,b) accets w B iff (a, b) Q B. Mauizio Lenzeini Data Integation 33

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 0 Tansition: (s 0, 1) δ A (s 0, l), fo each l Σ A Mauizio Lenzeini Data Integation 34

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 0 Tansition: (s 0, 1) δ A (s 0, l), fo each l Σ A Mauizio Lenzeini Data Integation 35

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 0 Tansition: (s 0, 1) δ A (s 0, l), fo each l Σ A Mauizio Lenzeini Data Integation 36

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 0 Tansition: (s 0, 1) δ A (s 0, l), fo each l Σ A Mauizio Lenzeini Data Integation 37

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 0 Tansition: (s 0, 1) δ A (s 0, l), fo each l Σ A Mauizio Lenzeini Data Integation 38

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 0 Tansition: (s 0, 1) δ A (s 0, l), fo each l Σ A Mauizio Lenzeini Data Integation 39

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 0 Tansition: (s 1, 0) δ A (s 0, d 1 ), s 1 initial state fo Q Mauizio Lenzeini Data Integation 40

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 1 Tansition: (s 1, 1) δ A (s 1, d 1 ) Mauizio Lenzeini Data Integation 41

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 1 Tansition: (s 2, 1) δ A (s 1, ), tansition coming fom Q Mauizio Lenzeini Data Integation 42

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 2 Tansition: ((s 2, d 2 ), 1) δ A (s 2, d 2 ), seach fo d 2 Mauizio Lenzeini Data Integation 43

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 2, d 2 ) Tansition: ((s 2, d 2 ), 1) δ A ((s 2, d 2 ), $), seach fo d 2 Mauizio Lenzeini Data Integation 44

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 2, d 2 ) Tansition: ((s 2, d 2 ), 1) δ A ((s 2, d 2 ), d 4 ), seach fo d 2 Mauizio Lenzeini Data Integation 45

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 2, d 2 ) Tansition: ((s 2, d 2 ), 1) δ A ((s 2, d 2 ), ), seach fo d 2 Mauizio Lenzeini Data Integation 46

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 2, d 2 ) Tansition: (s 2, 0) δ A ((s 2, d 2 ), d 2 ), exit seach mode Mauizio Lenzeini Data Integation 47

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 2 Tansition: (s 2, 1) δ A (s 2, d 2 ), backwad mode Mauizio Lenzeini Data Integation 48

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 2 Tansition: (s 3, 0) δ A (s 2, ), tansition coming fom Q Mauizio Lenzeini Data Integation 49

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 3 Tansition: (s 4, 1) δ A (s 3, ), tansition coming fom Q Mauizio Lenzeini Data Integation 50

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 4 Tansition: ((s 4, d 2 ), 1) δ A (s 4, d 2 ), seach fo d 2 Mauizio Lenzeini Data Integation 51

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 4, d 2 ) Tansition: ((s 4, d 2 ), 1) δ A ((s 4, d 2 ), $), seach fo d 2 Mauizio Lenzeini Data Integation 52

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 4, d 2 ) Tansition: ((s 4, d 2 ), 1) δ A ((s 4, d 2 ), d 3 ), seach fo d 2 Mauizio Lenzeini Data Integation 53

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 4, d 2 ) Tansition: ((s 4, d 2 ), 1) δ A ((s 4, d 2 ), ), seach fo d 2 Mauizio Lenzeini Data Integation 54

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 4, d 2 ) Tansition: ((s 4, d 2 ), 1) δ A ((s 4, d 2 ), ), seach fo d 2 Mauizio Lenzeini Data Integation 55

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 4, d 2 ) Tansition: ((s 4, d 2 ), 1) δ A ((s 4, d 2 ), d 3 ), seach fo d 2 Mauizio Lenzeini Data Integation 56

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 4, d 2 ) Tansition: ((s 4, d 2 ), 1) δ A ((s 4, d 2 ), $), seach fo d 2 Mauizio Lenzeini Data Integation 57

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 4, d 2 ) Tansition: (s 4, 0) δ A ((s 4, d 2 ), d 2 ), exit seach mode Mauizio Lenzeini Data Integation 58

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 4 Tansition: (s 4, 1) δ A (s 4, d 2 ) Mauizio Lenzeini Data Integation 59

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 4 Tansition: (s 5, 1) δ A (s 4, ), tansition coming fom Q Mauizio Lenzeini Data Integation 60

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 5 Tansition: (s 6, 1) δ A (s 5, q), tansition coming fom Q Mauizio Lenzeini Data Integation 61

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 6 Tansition: (s 7, 0) δ A (s 6, d 3 ), s 7 final state Mauizio Lenzeini Data Integation 62

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 7 Tansition: (s 7, 1) δ A (s 7, d 3 ), s 7 final state Mauizio Lenzeini Data Integation 63

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 7 Tansition: (s 7, 1) δ A (s 7, $), s 7 final state Mauizio Lenzeini Data Integation 64

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 7 final state Wod acceted by A (Q,d1,d 3 )! Mauizio Lenzeini Data Integation 65

Quey answeing: Technique To check whethe (c, d) Q B fo some B sem C (D), we check fo nonemtiness of A, that is the intesection of the one-way automaton A 0 that accets wods that eesent databases, i.e., wods of the fom ($ C Σ + C) $ the one-way automata coesonding to the vaious A (Si,a,b) (fo each souce S i and fo each ai (a, b) S C i ) the one-way automaton coesonding to the comlement of A (Q,c,d) Indeed, any wod acceted by such intesection automaton eesents a counteexamle to (c, d) Q C,D, i.e., a database B sem C (D) such that (c, d) Q B. Mauizio Lenzeini Data Integation 66

Quey answeing: Comlexity All two-way automata constucted above ae of linea size in the size of Q, def (S 1 ),..., def (S k ), and S1 C,..., Sk C. Hence, the coesonding one-way automata would be exonential. Howeve, we do not need to constuct A exlicitly. Instead, we can constuct it on the fly while checking fo nonemtiness. Quey answeing fo RPQIs is PSPACE-comlete (conp-comlete if comlexity is measued wt to the size of souce data C only). Mauizio Lenzeini Data Integation 67

Quey answeing: the comlete ictue Diffeent assumtions: 1. Database domain may be: comletely known (closed domain assumtion CDA) atially known (oen domain assumtion ODA) 2. Each souce may be: exact: ovides exactly the data secified in the associated view sound: ovides a subset of the data secified in the associated view comlete: ovides a sueset of the data secified in the associated view Mauizio Lenzeini Data Integation 68

Polynomial intactability: RPQ Given a gah G = (N, E), we define D = G, S, M, and souce database C: V s R s V e R e V G R g R g R b R b R gb R bg V C s = {(c, a) a N, c N} V C e = {(a, d) a N, d N} V C G = {(a, b), (b, a) (a, b) E} Q R s M R e whee M descibes all mismatched edge ais (e.g., R g R b ). If G is 3-coloable, then db whee M (and Q) is emty, i.e. (c, d) Q D,C. If G is not 3-coloable, then M is nonemty db, i.e. (c, d) Q D,C. = conp-had wt data comlexity Mauizio Lenzeini Data Integation 69

Comlexity of quey answeing: the comlete ictue Assumtion on Assumtion on Comlexity domain views data exession combined all sound conp conp conp closed all exact conp conp conp abitay conp conp conp all sound conp PSPACE PSPACE oen all exact conp PSPACE PSPACE abitay conp PSPACE PSPACE Mauizio Lenzeini Data Integation 70

Outline Intoduction to data integation Aoaches to modeling and queying Case study in LAV Case study in GAV Beyond LAV and GAV Conclusions Mauizio Lenzeini Data Integation 71

Coming back to GAV In GAV, the maing M is constituted by a set of assetions: g φ S one fo each stuctue g in G, whee φ S is a quey ove S. Given souce database C, a database B satisfies M wt C if fo each g G: φ C S g B If G does not have constaints, we can simly limit ou attention to one model of the infomation integation system, and answeing queies educes to using M fo comuting fom C the vitual global database, i.e., tules satisfiying the vaious φ S associated to each stuctue g of G, evaluating the quey q ove the data obtained fo the vaious g s. Mauizio Lenzeini Data Integation 72

GAV with constaints in the global schema: examle Conside D = G, S, M, with Global schema G: student(scode, Sname, Scity), univesity(ucode, Uname), enolled(scode, Ucode), key{scode} key{ucode} key{scode, Ucode} enolled[scode] student[scode] enolled[ucode] univesity[ucode] Souces S: s 1, s 2, s 3 Maing M: student { (X, Y, Z) s 1 (X, Y, Z, W ) } univesity { (X, Y ) s 2 (X, Y ) } enolled { (X, W ) s 3 (X, W ) } Mauizio Lenzeini Data Integation 73

Constaints in GAV: examle Univesity Student Enolled code AF BN name bocconi ucla code 15 12 name bill anne city oslo floence Scode 12 16 Ucode AF BN 16?? 16 16 s C 1 12 anne floence 21 15 bill oslo 24 s C 2 AF BN bocconi ucla s C 3 12 AF 16 BN Mauizio Lenzeini Data Integation 74

Constaints in GAV: examle Souce database C: s C 1 12 anne floence 21 15 bill oslo 24 s C 2 AF BN bocconi ucla s C 3 12 AF 16 BN s C 3(16, BN) imlies enolled B (16, BN), fo all B sem C (D). Due to the integity constaints in the global schema, 16 is the code of some student in all B sem C (D). Since C says nothing about the name and the city of such student, we must accet as legal fo D all vitual global databases that diffe in such attibutes. Mauizio Lenzeini Data Integation 75

GAV evisited If G does have constaints, then seveal situations ae ossible, given the souce data C: no model exists fo the data integation system, the data integation system has one model, seveal models exist fo the infomation integation system. In GAV too, answeing queies is an infeence ocess coing with incomlete infomation Coming back to the analogy with the mistey case, constaints in the global schema can make the investigation eot incomlete/incoheent, so that answeing queies may equie easoning on the investigation eot. Mauizio Lenzeini Data Integation 76

A case study in GAV We deal with the oblem of answeing queies to data integation systems of the fom G, S, M, whee the global schema G is elational, with both key and foeign key constaints the souces in S ae elational the maing M is of tye GAV queies ae conjunctive queies Mauizio Lenzeini Data Integation 77

Unfolding is not sufficient in ou context Maing M: student { (X, Y, Z) s 1 (X, Y, Z, W ) } univesity { (X, Y ) s 2 (X, Y ) } enolled { (X, W ) s 3 (X, W ) } s C 1 12 anne floence 21 15 bill oslo 24 s C 2 AF BN bocconi ucla s C 3 12 AF 16 BN Quey: { (X) student(x, Y, Z), enolled(x, W ) } Unfolding wt M: { (X) s 1 (X, Y, Z, V ), s 3 (X, W ) } etieves only the answe {12} fom C, although {12, 16} is the coect answe. The simle unfolding stategy is not sufficient in ou context. Most GAV systems use the simle unfolding stategy! Mauizio Lenzeini Data Integation 78

Pocessing queies in GAV: technique Techniques fo automated easoning on incomlete infomation ae needed. In ou context, we have develoed the following technique fo ocessing queies: Given quey q, we comute anothe quey ex G (q), called the exansion of q wt the constaints of G (atial evaluation) We unfold ex G (q) wt M, and obtain a quey unf M (ex G (q)) ove the souces We evaluate unf M (ex G (q)) ove the souce database C ex G (q) can be of exonential size wt G, but the whole ocess has olynomial time comlexity wt the size of C (see [Calvanese et al, 2001] fo details). Mauizio Lenzeini Data Integation 79

Pocessing queies in GAV: technique The oblems mentioned above also hold when: The global schema is exessed in tems of a concetual data model see [Calvanese et al, ER 2001] An ontology is used as global schema see [Calvanese et al, SWWS 2001] The global schema is exessed in tems of a semistuctued data model (e.g., XML) The maing M has the following diffeent semantics (exact souces): Given souce database C, a database B satisfies g φ S wt C if g B = φ C S Mauizio Lenzeini Data Integation 80

The case of exact souces in GAV with constaints Univesity Student Enolled code AF BN name bocconi ucla code 15 12 name bill anne city oslo floence Scode 12 16 Ucode AF BN Inconsistency no tule with code 16 16 s C 1 12 anne floence 21 15 bill oslo 24 s C 2 AF BN bocconi ucla s C 3 12 AF 16 BN Mauizio Lenzeini Data Integation 81

Outline Intoduction to data integation Aoaches to modeling and queying Case study in LAV Case study in GAV Beyond LAV and GAV Conclusions Mauizio Lenzeini Data Integation 82

Beyond LAV and GAV Global schema: W ok(reseache, P oject), Aea(P oject, F ield) Souce 1: Inteest(P eson, F ield) Souce 2: Get(Reseache, Gant), F o(gant, P oject) Maing: being inteested in field f mas to thee exists a oject such that woks fo and the aea of is f. getting gant g fo oject, mas to woking fo. This situation cannot be eesented in GAV o LAV. Mauizio Lenzeini Data Integation 83

The modeling oblem: GLAV = GAV + LAV A moe geneal method fo secifying the maing between the global schema and the souces is based on assetions of the foms: φ S s φ G (sound souce) φ S c φ G (comlete souce) whee φ S is a quey on S and φ G is a quey on G. Given souce database C, a database B fo G satisfies M wt C, i.e., if fo each assetion φ S s φ G in M, we have that φ C S φb G, fo each assetion φ S c φ G in M, we have that φ B G φc S Mauizio Lenzeini Data Integation 84

Examle of GLAV Global schema: W ok(reseache, P oject), Aea(P oject, F ield) Souce 1: Inteest(P eson, F ield) Souce 2: Get(Reseache, Gant), F o(gant, P oject) GLAV maing: { (, f) Inteest(, f) } { (, f) W ok(, ) Aea(, f) } { (, ) Get(, g) F o(g, ) } { (, ) W ok(, ) } Mauizio Lenzeini Data Integation 85

Technique fo GLAV The maing assetion φ S s φ G can be seen as φ S g φ G, whee g is a new symbol added to G. Theefoe, we can tanslate φ S s φ G into: the GAV maing ule g φ S the constaint g φ G thus obtaining a GAV system with constaints, that can be dealt with a vaiant of the above descibed technique [Calì et al, FMII 2001]. Mauizio Lenzeini Data Integation 86

Outline Intoduction to data integation Aoaches to modeling and queying Case study in LAV Case study in GAV Beyond LAV and GAV Conclusions Mauizio Lenzeini Data Integation 87

Conclusions Data integation alications have to coe with incomlete infomation, no matte which is the modeling aoach Some techniques aleady develoed, but seveal oen oblems still emain (in LAV, GAV, and GLAV) Many othe oblems not addessed hee ae elevant in data integation (e.g., how to constuct the global schema, how to deal with inconsistencies, how to coe with udates,...) In aticula, given the comlexity of sound and comlete quey answeing, it is inteesting to look at methods that accet less quality answes, tading efficiency fo accuacy Mauizio Lenzeini Data Integation 88

Acknowledgements Many thanks to Andea Calí Diego Calvanese Giusee De Giacomo Domenico Lembo Moshe Vadi Mauizio Lenzeini Data Integation 89