Data integration: A theoretical perspective



Similar documents
Data integration is harder than you thought

A Tutorial on Data Integration

Data Integration. Maurizio Lenzerini. Universitá di Roma La Sapienza

Query Processing in Data Integration Systems

Data Integration: A Theoretical Perspective

How To Write A Theory Of The Concept Of The Mind In A Quey

est using the formula I = Prt, where I is the interest earned, P is the principal, r is the interest rate, and t is the time in years.

Chapter 3 Savings, Present Value and Ricardian Equivalence

A comparison result for perturbed radial p-laplacians

Chapter 4: Matrix Norms

PACE: Policy-Aware Application Cloud Embedding

STUDENT RESPONSE TO ANNUITY FORMULA DERIVATION

Semipartial (Part) and Partial Correlation

Uncertain Version Control in Open Collaborative Editing of Tree-Structured Documents

UNIT CIRCLE TRIGONOMETRY

Vector Calculus: Are you ready? Vectors in 2D and 3D Space: Review

Top-Down versus Bottom-Up Approaches in Risk Management

Continuous Compounding and Annualization

Coordinate Systems L. M. Kalnins, March 2009

Approximation Algorithms for Data Management in Networks

Ilona V. Tregub, ScD., Professor

INITIAL MARGIN CALCULATION ON DERIVATIVE MARKETS OPTION VALUATION FORMULAS

Chris J. Skinner The probability of identification: applying ideas from forensic statistics to disclosure risk assessment

The Lucas Paradox and the Quality of Institutions: Then and Now

2. TRIGONOMETRIC FUNCTIONS OF GENERAL ANGLES

How Much Should a Firm Borrow. Effect of tax shields. Capital Structure Theory. Capital Structure & Corporate Taxes

Skills Needed for Success in Calculus 1

Supporting Efficient Top-k Queries in Type-Ahead Search

Converting knowledge Into Practice

Classical Mechanics (CM):

Definitions and terminology

An Analysis of Manufacturer Benefits under Vendor Managed Systems

MULTIPLE SOLUTIONS OF THE PRESCRIBED MEAN CURVATURE EQUATION

Voltage ( = Electric Potential )

2 r2 θ = r2 t. (3.59) The equal area law is the statement that the term in parentheses,

Over-encryption: Management of Access Control Evolution on Outsourced Data

Exam #1 Review Answers

Week 3-4: Permutations and Combinations

AN IMPLEMENTATION OF BINARY AND FLOATING POINT CHROMOSOME REPRESENTATION IN GENETIC ALGORITHM

On Efficiently Updating Singular Value Decomposition Based Reduced Order Models

Figure 2. So it is very likely that the Babylonians attributed 60 units to each side of the hexagon. Its resulting perimeter would then be 360!

Introduction to NP-Completeness Written and copyright c by Jie Wang 1

Rock Compressibility. Reservoir Pressures. PET467E A Note on Rock Compressibility

PAN STABILITY TESTING OF DC CIRCUITS USING VARIATIONAL METHODS XVIII - SPETO pod patronatem. Summary

The Binomial Distribution

Deflection of Electrons by Electric and Magnetic Fields

The Role of Gravity in Orbital Motion

Efficient Redundancy Techniques for Latency Reduction in Cloud Systems

A framework for the selection of enterprise resource planning (ERP) system based on fuzzy decision making methods

THE DISTRIBUTED LOCATION RESOLUTION PROBLEM AND ITS EFFICIENT SOLUTION

The impact of migration on the provision. of UK public services (SRG ) Final Report. December 2011

Distributed Computing and Big Data: Hadoop and MapReduce

Faithful Comptroller s Handbook

Spirotechnics! September 7, Amanda Zeringue, Michael Spannuth and Amanda Zeringue Dierential Geometry Project

Things to Remember. r Complete all of the sections on the Retirement Benefit Options form that apply to your request.

Define What Type of Trader Are you?

Financing Terms in the EOQ Model

Channel selection in e-commerce age: A strategic analysis of co-op advertising models

ENABLING INFORMATION GATHERING PATTERNS FOR EMERGENCY RESPONSE WITH THE OPENKNOWLEDGE SYSTEM

Mechanics 1: Motion in a Central Force Field

883 Brochure A5 GENE ss vernis.indd 1-2

AMB111F Financial Maths Notes

Lesson 7 Gauss s Law and Electric Fields

Modeling and Verifying a Price Model for Congestion Control in Computer Networks Using PROMELA/SPIN

Carter-Penrose diagrams and black holes

9.5 Amortization. Objectives

Do Bonds Span the Fixed Income Markets? Theory and Evidence for Unspanned Stochastic Volatility

There is considerable variation in health care utilization and spending. Geographic Variation in Health Care: The Role of Private Markets

Firstmark Credit Union Commercial Loan Department

1240 ev nm 2.5 ev. (4) r 2 or mv 2 = ke2

Gauss Law. Physics 231 Lecture 2-1

Quantity Formula Meaning of variables. 5 C 1 32 F 5 degrees Fahrenheit, 1 bh A 5 area, b 5 base, h 5 height. P 5 2l 1 2w

Review Graph based Online Store Review Spammer Detection

An application of stochastic programming in solving capacity allocation and migration planning problem under uncertainty

Valuation of Floating Rate Bonds 1

Ignorance is not bliss when it comes to knowing credit score

Gravitational Mechanics of the Mars-Phobos System: Comparing Methods of Orbital Dynamics Modeling for Exploratory Mission Planning

Confirmation of Booking

Comparing Availability of Various Rack Power Redundancy Configurations

Physics 235 Chapter 5. Chapter 5 Gravitation

Questions & Answers Chapter 10 Software Reliability Prediction, Allocation and Demonstration Testing

High Availability Replication Strategy for Deduplication Storage System

Strength Analysis and Optimization Design about the key parts of the Robot

Edge Detection with Sub-pixel Accuracy Based on Approximation of Edge with Erf Function

Risk Sensitive Portfolio Management With Cox-Ingersoll-Ross Interest Rates: the HJB Equation

Promised Lead-Time Contracts Under Asymmetric Information

GESTÃO FINANCEIRA II PROBLEM SET 1 - SOLUTIONS

VISCOSITY OF BIO-DIESEL FUELS

Transcription:

Data integation: A theoetical esective Mauizio Lenzeini Diatimento di Infomatica e Sistemistica Antonio Rubeti Univesità di Roma La Saienza Tutoial at PODS 2002 Madison, Wisconsin, USA, June 2002

Data integation Quey Global schema Maing R 1 C 1 D 1 T 1 c 1 d 1 t 1 c 2 d 2 t 2 Souce schema Souce schema Souce 1 Souce 2 Mauizio Lenzeini 1

Outline Fomal famewok fo data integation Aoaches to data integation Quey answeing in diffeent aoaches Dealing with inconsistency Reasoning on queies in data integation Conclusions Mauizio Lenzeini 2

Fomal famewok A data integation system I is a tile G, S, M, whee G is the global schema (ove an alhabet A G ) S is the souce schema (ove an alhabet A S ) M is the maing between G and S Semantics of I: which ae the databases that satisfy I (models of I)? We efe only to databases ove a fixed infinite domain Γ, and we stat with a souce database C, (data available at the souces, also called souce model) ove Γ. The set of databases that satisfy I elative to C is: sem C (I) = { B B is legal wt G and satisfies M wt C } Mauizio Lenzeini 3

Semantics of queies to I A quey q of aity n is a FOL fomula with n fee vaiables. If D is a database, then q D denotes the extension of q in D (i.e., the set of valuations in Γ fo the fee vaiables of q that make q tue in D). If q is a quey of aity n osed to a data integation system I (i.e., a quey ove A G ), then the set of cetain answes to q wt I and C is q I,C = {(c 1,..., c n ) q B B sem C (I)} Mauizio Lenzeini 4

Databases with incomlete infomation Taditional database: one model of a fist-ode theoy Quey answeing means evaluating a fomula in the model. Database with incomlete infomation: set of models (secified, fo examle, as a esticted fist-ode theoy) Quey answeing means comuting the tules that satisfy the quey in all the models in the set. Thee is a stong connection between quey answeing in data integation and quey answeing in database with incomlete infomation unde constaints. Mauizio Lenzeini 5

Outline Fomal famewok fo data integation Aoaches to data integation Quey answeing in diffeent aoaches Dealing with inconsistency Reasoning on queies in data integation Conclusions Mauizio Lenzeini 6

The maing How is the maing M between G and S secified? Ae the souces defined in tems of the global schema? Aoach called souce-centic, o local-as-view, o LAV. Is the global schema defined in tems of the souces? Aoach called global-schema-centic, o global-as-view, o GAV. A mixed aoach? Aoach called GLAV. Maing between souces, without global schema? Aoach called P2P. Mauizio Lenzeini 7

GAV vs LAV examle Global schema: movie(title, Yea, Diecto) euoean(diecto) eview(title, Citique) Souce 1: 1 (Title, Yea, Diecto) since 1960, euoean diectos Souce 2: 2 (T itle, Citique) since 1990 Quey: Title and citique of movies in 1998 D. movie(t, 1998, D) eview(t, R), witten { (T, R) movie(t, 1998, D) eview(t, R) } Mauizio Lenzeini 8

Fomalization of LAV In LAV, the maing M is constituted by a set of assetions: s φ G (sound souce) x (s( x) φ G ( x)) s φ G (exact souce) x (s( x) φ G ( x)) one fo each souce element s in A S, whee φ G is a quey ove G. Given souce database C, a database B fo G satisfies M wt C if fo each s S: s C φ G B (sound souce) s C = φ G B (exact souce) The maing M and the souce database C do not ovide diect infomation about which data satisfy the global schema. Souces ae views, and we have to answe queies on the basis of the available data in the views. Mauizio Lenzeini 9

LAV examle Global schema: movie(title, Yea, Diecto) euoean(diecto) eview(title, Citique) LAV: associated to souce elations we have views ove the global schema 1 (T, Y, D) { (T, Y, D) movie(t, Y, D) euoean(d) Y 1960 } 2 (T, R) { (T, R) movie(t, Y, D) eview(t, R) Y 1990 } The quey { (T, R) movie(t, 1998, D) eview(t, R) } is ocessed by means of an infeence mechanism that aims at e-exessing the atoms of the global schema in tems of atoms at the souces. In this case: { (T, R) 2 (T, R) 1 (T, 1998, D) } Mauizio Lenzeini 10

Fomalization of GAV In GAV, the maing M is constituted by a set of assetions: g φ S (sound souce) x (φ S ( x) g( x)) g φ S (exact souce) x (φ S ( x) g( x)) one fo each element g in A G, whee φ S is a quey ove S. Given souce database C, a database B fo G satisfies M wt C if fo each g G: g B φ S C g B = φ S C (sound souce) (exact souce) Given a souce database, M ovides diect infomation about which data satisfy the elements of the global schema. Relations in G ae views, and queies ae exessed ove the views. Thus, it seems that we can simly evaluate the quey ove the data satisfying the global elations (as if we had a single database at hand). Mauizio Lenzeini 11

GAV examle Global schema: movie(title, Yea, Diecto) euoean(diecto) eview(title, Citique) GAV: associated to elations in the global schema we have views ove the souces movie(t, Y, D) { (T, Y, D) 1 (T, Y, D) } euoean(d) { (D) 1 (T, Y, D) } eview(t, R) { (T, R) 2 (T, R) } Mauizio Lenzeini 12

GAV examle of quey ocessing The quey { (T, R) movie(t, 1998, D) eview(t, R) } is ocessed by means of unfolding, i.e., by exanding the atoms accoding to thei definitions, so as to come u with souce elations. In this case: movie(t,1998,d) eview(t,r) unfolding 1 (T,1998,D) 2 (T,R) Mauizio Lenzeini 13

GAV and LAV comaison LAV: (Infomation Manifold, DWQ, Picsel) Quality deends on how well we have chaacteized the souces High modulaity and extensibility (if the global schema is well designed, when a souce changes, only its definition is affected) Quey ocessing needs easoning (quey efomulation comlex) GAV: (Canot, SIMS, Tsimmis, IBIS, Picsel,... ) Quality deends on how well we have comiled the souces into the global schema though the maing Wheneve a souce changes o a new one is added, the global schema needs to be econsideed Quey ocessing can be based on some sot of unfolding (quey efomulation looks easie) Fo moe details, see [Ullman, TCS 00], [Halevy, VLDBJ 01]. Mauizio Lenzeini 14

Beyond GAV and LAV: GLAV In GLAV, the maing M is constituted by a set of assetions: φ S φ G (sound souce) x (φ S ( x) φ G ( x)) φ S φ G (exact souce) x (φ S ( x) φ G ( x)) whee φ S is a quey ove S, and φ G is a quey ove G. Given souce database C, a database B that is legal wt G satisfies M wt C if fo each assetion in M: φ S C φ S C φ G B (sound souce) = φ G B (exact souce) The maing M does not ovide diect infomation about which data satisfy the global schema: to answe a quey q ove G, we have to infe how to use M in ode to access the souce database C. Mauizio Lenzeini 15

Examle of GLAV Global schema: W ok(p eson, P oject), Aea(P oject, F ield) Souce 1: Souce 2: Souce 3: HasJob(P eson, F ield) T each(p of esso, Couse), In(Couse, F ield) Get(Reseache, Gant), F o(gant, P oject) GLAV maing: { (, f) HasJob(, f) } { (, f) W ok(, ) Aea(, f) } { (, f) T each(, c) In(c, f) } { (, f) W ok(, ) Aea(, f) } { (, ) Get(, g) F o(g, ) } { (, ) W ok(, ) } Mauizio Lenzeini 16

Beyond GLAV: P2P data integation In P2P, the global schema does not exist. Constaints (that we can still call G) ae defined ove A G = A S1 A Sn and the maing M is constituted by a set of assetions (φ S i 1, φ S j 2 the alhabets A Si and A Sj, esectively): φ S i 1 φ S j 2. ae queies ove A S is a distinguished subset of edicates in A G, called base edicates (whee data ae). A souce database is a database fo the base edicates. Given souce database C, a database W that satisfies I elative to C is a database fo S such that, fo each assetion φ 1 φ 2 in M, φ W 1 φ W 2. Queies ae now exessed ove alhabet A Si, and the notion of cetain answes is the usual one. Mauizio Lenzeini 17

A unified view Alhabet: A = A G A S Integity constaints: constaints G, and maing M Patial database: souce database Database: data fo all symbols in A that ae both coheent with the atial database and satisfy the integity constaints Quey answeing: comuting the tules that satisfies the quey in evey database Unde this view, the diffeence between LAV, GAV, GLAV, P2P is eflected in the kinds of integity constaints that ae exessible. Mauizio Lenzeini 18

Quey answeing with incomlete infomation [Reite 84]: elational setting, databases with incomlete infomation modeled as a fist ode theoy [Vadi 86]: elational setting, comlexity of easoning in closed wold databases with unknown values Seveal aoaches both fom the DB and the KR community [van de Meyden 98]: suvey on logical aoaches to incomlete infomation Mauizio Lenzeini 19

Connection to quey containment Quey containment (unde constaints T ) is the oblem of checking whethe q B 1 is contained in q B 2 fo evey database B (satisfying T ), whee q 1, q 2 ae queies with the same aity. A souce database C can be eesented as a conjunction q C of gound liteals ove A S (e.g., if x is in s C, then the coesonding liteal is s( x)) If q is a quey, and t is a tule, then we denote by q t the quey obtained by substituting the fee vaiables of q with t The oblem of checking whethe t q I,C can be educed to the oblem of checking whethe q C is contained in q t unde the constaints G M The combined comlexity of checking cetain answes is identical to the comlexity of quey containment unde constaints, and the data comlexity is at most the comlexity of quey containment unde constaints. Mauizio Lenzeini 20

Outline Fomal famewok fo data integation Aoaches to data integation Quey answeing in diffeent aoaches Dealing with inconsistency Reasoning on queies in data integation Conclusions Mauizio Lenzeini 21

Dealing with incomleteness and inconsistency We analyze the oblem of quey answeing in diffeent cases, deending on two aametes: Global schema: - without constaints, - with constaints Maing: - GAV o LAV, - sound o comlete Given a souce database C, we call etieved global database any database fo G that satisfies the maing wt C. Mauizio Lenzeini 22

Incomleteness and inconsistency Constaints Tye of Incomle- Inconsiin G maing teness stency no GAV/exact no no no GAV/sound yes/no no no LAV/sound yes no no LAV/exact yes yes yes GAV/exact no yes yes GAV/sound yes yes yes LAV/sound yes yes yes LAV/exact yes yes Mauizio Lenzeini 23

Incomleteness and inconsistency Constaints Tye of Incomle- Inconsiin G maing teness stency no GAV/exact no no no GAV/sound yes/no no no LAV/sound yes no no LAV/exact yes yes yes GAV/exact no yes yes GAV/sound yes yes yes LAV/sound yes yes yes LAV/exact yes yes Mauizio Lenzeini 24

INT[noconst, GAV/exact]: examle Conside I = G, S, M, with Global schema G: student(scode, Sname, Scity) univesity(ucode, Uname) enolled(scode, Ucode) Souce schema S: database elations s 1, s 2, s 3 Maing M: student(x, Y, Z) { (X, Y, Z) s 1 (X, Y, Z, W ) } univesity(x, Y ) { (X, Y ) s 2 (X, Y ) } enolled(x, W ) { (X, W ) s 3 (X, W ) } Mauizio Lenzeini 25

INT[noconst, GAV/exact]: examle Univesity Student Enolled code AF BN name bocconi ucla code 15 12 name bill anne city oslo floence Scode 12 16 Ucode AF BN 12 anne floence 21 AF bocconi 12 AF s C 1 15 bill oslo 24 s C 2 BN ucla s C 3 16 BN Examle of souce database and coesonding etieved global database Mauizio Lenzeini 26

INT[noconst, GAV/exact] Model of I Global schema = Retieved GDB Maing Souces Souce model Mauizio Lenzeini 27

INT[noconst, GAV/exact]: quey answeing Use M fo comuting fom C the etieved global database, whee each element g of G satisfies exactly the tules of C satisfying the φ S that M associates to g Since G does not have constaints, the etieved global database is legal wt G Actually, it is the only database that is legal wt G, and that satisfies M wt C Thus, we can simly evaluate the quey q ove the etieved global database, which is equivalent to unfolding the quey accoding to M, in ode to obtain a quey on A S to be evaluated ove C Answeing queies to I means answeing queies to a single database. Mauizio Lenzeini 28

INT[noconst, GAV/exact]: examle of quey answeing Maing M: student(x, Y, Z) { (X, Y, Z) s 1 (X, Y, Z, W ) } univesity(x, Y ) { (X, Y ) s 2 (X, Y ) } enolled(x, W ) { (X, W ) s 3 (X, W ) } s C 1 12 anne floence 21 15 bill oslo 24 s C 2 AF BN bocconi ucla s C 3 12 AF 16 BN Quey: { (X) student(x, Y, Z), enolled(x, W ) } Unfolding wt M: { (X) s 1 (X, Y, Z, V ), s 3 (X, W ) } etieves the answe {12} fom C. A simle unfolding stategy is sufficient in this context. Mauizio Lenzeini 29

Incomleteness and inconsistency Constaints Tye of Incomle- Inconsiin G maing teness stency no GAV/exact no no no GAV/sound yes/no no no LAV/sound yes no no LAV/exact yes yes yes GAV/exact no yes yes GAV/sound yes yes yes LAV/sound yes yes yes LAV/exact yes yes Mauizio Lenzeini 30

INT[noconst, GAV/sound]: examle Univesity Student Enolled code AF UR BN name bocconi unioma ucla code 15 12 name bill anne city oslo floence Scode 12 16 Ucode AF BN s C 1 12 anne floence 21 15 bill oslo 24 s C 2 AF BN bocconi ucla s C 3 12 AF 16 BN Examle of souce database and coesonding etieved global database Mauizio Lenzeini 31

INT[noconst, GAV/sound] The GAV maing assetions have the logical fom: x φ s ( x) g( x) The intesection of all etieved global databases (which can be comuted by letting each element g of G satisfy exactly the tules of C satisfiying the φ S that M associates to g) still satisfies M wt C, and theefoe, is the only minimal model of I. Incomleteness is of secial fom. Fo queies without negation, unfolding is sufficient. Mauizio Lenzeini 32

INT[noconst, GAV/sound] Global schema Maing = Minimal Model of I Intesection of etieved GDBs Souces Souce model Mauizio Lenzeini 33

Incomleteness and inconsistency Constaints Tye of Incomle- Inconsiin G maing teness stency no GAV/exact no no no GAV/sound yes/no no no LAV/sound yes no no LAV/exact yes yes yes GAV/exact no yes yes GAV/sound yes yes yes LAV/sound yes yes yes LAV/exact yes yes Mauizio Lenzeini 34

INT[noconst, LAV/sound]: incomleteness The LAV maing assetions have the logical fom: x s( x) φ G ( x) In geneal, given a souce database C thee ae seveal solutions of the above assetions (i.e., diffeent databases that ae legal wt G that satisfies M wt C). Incomleteness comes fom the maing. This holds even fo the case of simle queies φ G : s 1 (x) { (x) y g(x, y) } s 2 (x) { (x) g 1 (x) g 2 (x) } Mauizio Lenzeini 35

INT[noconst, LAV/sound] Global schema = = Models of I Maing Retieved GDBs Souces Souce model Mauizio Lenzeini 36

INT[noconst, LAV/sound]: dealing with incomleteness View-based quey ocessing: Answe a quey based on a set of mateialized views, athe than on the aw data in the database. Relevant oblem in Data waehousing Quey otimization Poviding hysical indeendence Mauizio Lenzeini 37

INT[noconst, LAV/sound]: dealing with incomleteness In LAV/sound data integation, the views ae the souces. Two aoaches to view-based quey ocessing: View-based quey ewiting: quey ocessing is divided in two stes 1. e-exess the quey in tems of a given quey language ove the alhabet of A S 2. evaluate the ewiting ove the souce database C View-based quey answeing: no limitation is osed on how queies ae ocessed, and the only goal is to exloit all ossible infomation, in aticula the souce database, to comute the cetain answes to the quey Mauizio Lenzeini 38

INT[noconst, LAV/sound]: connection to quey containment If queies in M ae conjunctive queies, then we can substitute the quey that M associates to s fo evey s-liteal in q C, and theefoe, checking cetain answes can be educed to checking ue containment (without constaints) of two queies in the alhabet A G The data comlexity is at most the comlexity of quey containment Mauizio Lenzeini 39

INT[noconst, LAV/sound]: some esults fo quey answeing Conjunctive queies using conjunctive views [Levy&al. PODS 95] Recusive queies (datalog ogams) using conjunctive views [Duschka&Geneseeth PODS 97], [Afati&al. ICDT 99] Comlexity analysis [Abiteboul&Duschka PODS 98] [Gahne&Mendelzon ICDT 99] Vaiants of Regula Path Queies [Calvanese&al. ICDE 00, PODS 00] [Deutsch&Tannen DBPL 01], [Calvanese&al. DBPL 01] Mauizio Lenzeini 40

INT[noconst, LAV/sound]: data comlexity Fom [Abiteboul&Duschka PODS 98]: Sound souces CQ CQ PQ datalog FOL CQ PTIME conp PTIME PTIME undec. CQ PTIME conp PTIME PTIME undec. PQ conp conp conp conp undec. datalog conp undec. conp undec. undec. FOL undec. undec. undec. undec. undec. Mauizio Lenzeini 41

INT[noconst, LAV/sound]: basic technique Conside conjunctive queies and conjunctive views. 1 (T ) { (T ) movie(t, Y, D) euoean(d) } 2 (T, V ) { (T, V ) movie(t, Y, D) eview(t, V ) } T 1 (T ) Y D movie(t, Y, D) euoean(d) T V 2 (T, V ) Y D movie(t, Y, D) eview(t, V ) movie(t, f 1 (T ), f 2 (T )) 1 (T ) euoean(f 2 (T )) 1 (T ) movie(t, f 4 (T, V ), f 5 (T, V )) 2 (T, V ) eview(t, V )) 2 (T, V ) Answeing a quey means evaluating a goal wt to this nonecusive logic ogam (PTIME data comlexity). Mauizio Lenzeini 42

INT[noconst, LAV/sound]: olynomial intactability Given a gah G = (N, E), we define I = G, S, M, and souce database C: V b R b V f R f V t R g R g R b R b R gb R bg V b C = {(c, a) a N, c N} V f C = {(a, d) a N, d N} V t C = {(a, b), (b, a) (a, b) E} Q R b M R f whee M descibes all mismatched edge ais (e.g., R g R b ). If G is 3-coloable, then B whee M (and Q) is emty, i.e. (c, d) Q I,C. If G is not 3-coloable, then M is nonemty B, i.e. (c, d) Q I,C. = conp-had data comlexity fo ositive queies and ositive views. Mauizio Lenzeini 43

INT[noconst, LAV/sound]: in conp Conside the case of Datalog queies and ositive views. t is not a cetain answe to Q wt I and C, if and only if thee is a database B fo I such that t Q B, and B satisfies M wt C Because of the fom of M x (s( x) y 1 α 1 ( x, y 1 )... y h α h ( x, y h )) each tule in C foces the existence of k tules in any database that satisfies M wt C, whee k is the maximal length of conjuncts in M If C has n tules, then thee is a database B B fo I that satisfies M wt C with at most n k tules. Since Q is monotone, t Q B. Checking whethe B satisfies M wt C can be done in PTIME wt the size of B. = conp data comlexity fo Datalog queies and ositive views. Mauizio Lenzeini 44

INT[noconst, LAV/sound]: the case of RPQ We deal with the oblem of answeing queies to data integation systems of the fom G, S, M, whee G simly fixes the labels (alhabet Σ) of a semi-stuctued database the souces in S ae elational the maing M is of tye LAV queies ae tyical of semi-stuctued data (vaiants of egula ath queies) Mauizio Lenzeini 45

Global semi-stuctued database sub sub calls sub va sub sub sub calls sub va calls sub va sub va va Mauizio Lenzeini 46

Global semi-stuctued databases and queies sub sub a calls sub va sub sub sub calls b sub va calls sub va sub va va Regula Path Quey (RPQ): (sub) (sub (calls sub)) va Mauizio Lenzeini 47

Global semi-stuctued databases and queies sub sub calls sub va sub sub sub calls b sub va calls sub va sub va va a 2RPQ: (sub ) (va sub) Mauizio Lenzeini 48

INT[noconst, LAV/sound]: the case of RPQ Given I = G, S, M, whee G simly fixes the labels (alhabet Σ) of a semi-stuctued database the souces in S ae binay elations the maing M is of tye LAV, and associates to each souce s a 2RPQ w ove Σ x, y s(x, y) x w y a souce database C a 2RPQ Q ove Σ a ai of objects t we want to detemine whethe t Q I,C. Mauizio Lenzeini 49

Quey answeing: Technique We seach fo a counteexamle to t Q I,C, i.e., a database B legal fo I wt C such that t Q B Cucial oint: it is sufficient to estict ou attention to canonical databases, i.e., databases B that can be eesented by a wod w B $ d 1 w 1 d 2 $ d 3 w 2 d 4 $ $ d 2m 1 w m d 2m $ whee d 1,..., d 2m ae constants in C, w i Σ +, and $ acts as a seaato Use wod-automata theoetic techniques! Mauizio Lenzeini 50

We need techniques fo... checking whethe a ai of objects satisfies a 2RPQ quey in the case of a wod eesenting a ath a wod eesenting semiath a wod eesenting a canonical database Mauizio Lenzeini 51

Finite-state automata and RPQs. a b c q d. Q = ( q) q q Automaton fo Q s 1 δ(s 0, ), s 2 δ(s 1, ), s 2 δ(s 1, q), s 3 δ(s 2, q), s 3 δ(s 3, q) The comutation fo RPQs is catued by finite-state automata. Mauizio Lenzeini 52

2way Regula Path Queies 2way Regula Path Queies (2RPQ) ae exessed by means of finite-state automata ove Σ { Σ }. ( q) ( ) q q q _ q q Mauizio Lenzeini 53

Finite-state automata and 2RPQs. a b c q d. Wod: Quey: q Q = ( q) q q Automaton fo Q s 1 δ(s 0, ), s 2 δ(s 1, ), s 2 δ(s 1, q), s 3 δ(s 2, ), s 4 δ(s 3, ), s 5 δ(s 4, q), s 5 δ(s 5, q) State: s 0 Tansition: s 1 δ(s 0, ) Mauizio Lenzeini 54

Finite-state automata and 2RPQs. a b c q d. Wod: Quey: q Q = ( q) q q Automaton fo Q s 1 δ(s 0, ), s 2 δ(s 1, ), s 2 δ(s 1, q), s 3 δ(s 2, ), s 4 δ(s 3, ), s 5 δ(s 4, q), s 5 δ(s 5, q) State: s 1 Tansition: s 2 δ(s 1, ) Mauizio Lenzeini 55

Finite-state automata and 2RPQs. a b c q d. Wod: Quey: q Q = ( q) q q Automaton fo Q s 1 δ(s 0, ), s 2 δ(s 1, ), s 2 δ(s 1, q), s 3 δ(s 2, ), s 4 δ(s 3, ), s 5 δ(s 4, q), s 5 δ(s 5, q) State: s 2 Tansition: none Mauizio Lenzeini 56

Finite-state automata and 2RPQs. a b c q d. Wod: Quey: q Q = ( q) q q State: s 2 Tansition: none (a, d) satisfies quey Q, but the ath fom a to d is not acceted by the 1NFA coesonding to Q: the comutation fo 2RPQs is not catued by finite-state automata. Mauizio Lenzeini 57

2way automata (2NFA) A 2way automaton A = (Γ, S, S 0, ρ, F ) consists of an alhabet Γ, a finite set of states S, a set of initial states S 0 S, a tansition function ρ : S Σ 2 S { 1,0,1} and a set of acceting states F S. Given a 2way automaton A with n states, one can constuct a one-way automaton B 1 with O(2 n log n ) states such that L(B 1 ) = L(A), and a one-way automaton B 2 with O(2 n ) states such that L(B 2 ) = Γ L(A). Mauizio Lenzeini 58

2way automata and 2RPQs Given a 2RPQ E = (Σ, S, I, δ, F ) ove the alhabet Σ, the coesonding 2way automaton A E is: (Σ A = Σ {$}, S A = S {s f } {s s S}, I, δ A, {s f }) whee δ A is defined as follows: (s 2, 1) δ A (s 1, ), fo each tansition s 2 δ(s 1, ) of E ente backwad mode: (s, 1) δ A (s, l), fo each s S and l Σ A exit backwad mode: (s 2, 0) δ A (s1, ), fo each s 2 δ(s 1, ) of E (s f, 1) δ A (s, $), fo each s F. = w satisfies E iff w$ L(A E ). Mauizio Lenzeini 59

2way automata and 2RPQs. a b c q d. Automaton fo Q Q = ( q) q q s 1 δ(s 0, ), s 2 δ(s 1, ), s 2 δ(s 1, q), s 3 δ(s 2, ), s 4 δ(s 3, ), s 5 δ(s 4, q), s 5 δ(s 5, q) 2way automaton (s 1, 1) δ A (s 0, ), (s 2, 1) δ A (s 1, ), (s 2, 1) δ A (s 1, q), (s 2, 1) δ A (s 2, q), (s 3, 0) δ A (s 2, ), (s 4, 1) δ A (s 3, ), (s 5, 1) δ A (s 4, q), (s f, 1) δ A (s 5, $) Mauizio Lenzeini 60

2NFA and 2RPQs. a b c q d. Wod: q $ Quey: Q = ( q) q q Automaton fo Q (s 1, 1) δ A (s 0, ), (s 2, 1) δ A (s 1, ), (s 2, 1) δ A (s 2, q), (s 3, 0) δ A (s 2, ), (s 4, 1) δ A (s 3, ), (s 5, 1) δ A (s 4, q), (s f, 1) δ A (s 5, $) State: s 0 Tansition: (s 1, 1) δ A (s 0, ) Mauizio Lenzeini 61

2NFA and 2RPQs. a b c q d. Wod: q $ Quey: Q = ( q) q q Automaton fo Q (s 1, 1) δ A (s 0, ), (s 2, 1) δ A (s 1, ), (s 2, 1) δ A (s 2, q), (s 3, 0) δ A (s 2, ), (s 4, 1) δ A (s 3, ), (s 5, 1) δ A (s 4, q), (s f, 1) δ A (s 5, $) State: s 1 Tansition: (s 2, 1) δ A (s 1, ) Mauizio Lenzeini 62

2NFA and 2RPQs. a b c q d. Wod: q $ Quey: Q = ( q) q q Automaton fo Q (s 1, 1) δ A (s 0, ), (s 2, 1) δ A (s 1, ), (s 2, 1) δ A (s 2, q), (s 3, 0) δ A (s 2, ), (s 4, 1) δ A (s 3, ), (s 5, 1) δ A (s 4, q), (s f, 1) δ A (s 5, $) State: s 2 Tansition: (s 2, 1) δ A (s 2, q) Mauizio Lenzeini 63

2NFA and 2RPQs. a b c q d. Wod: q $ Quey: Q = ( q) q q Automaton fo Q (s 1, 1) δ A (s 0, ), (s 2, 1) δ A (s 1, ), (s 2, 1) δ A (s 2, q), (s 3, 0) δ A (s 2, ), (s 4, 1) δ A (s 3, ), (s 5, 1) δ A (s 4, q), (s f, 1) δ A (s 5, $) State: s 2 Tansition: (s 3, 0) δ A (s 2, ) Mauizio Lenzeini 64

2NFA and 2RPQs. a b c q d. Wod: q $ Quey: Q = ( q) q q Automaton fo Q (s 1, 1) δ A (s 0, ), (s 2, 1) δ A (s 1, ), (s 2, 1) δ A (s 2, q), (s 3, 0) δ A (s 2, ), (s 4, 1) δ A (s 3, ), (s 5, 1) δ A (s 4, q), (s f, 1) δ A (s 5, $) State: s 3 Tansition: (s 4, 1) δ A (s 3, ) Mauizio Lenzeini 65

2NFA and 2RPQs. a b c q d. Wod: q $ Quey: Q = ( q) q q Automaton fo Q (s 1, 1) δ A (s 0, ), (s 2, 1) δ A (s 1, ), (s 2, 1) δ A (s 2, q), (s 3, 0) δ A (s 2, ), (s 4, 1) δ A (s 3, ), (s 5, 1) δ A (s 4, q), (s f, 1) δ A (s 5, $) State: s 4 Tansition: (s 5, 1) δ A (s 4, q) Mauizio Lenzeini 66

2NFA and 2RPQs. a b c q d. Wod: q $ Quey: Q = ( q) q q Automaton fo Q (s 1, 1) δ A (s 0, ), (s 2, 1) δ A (s 1, ), (s 2, 1) δ A (s 2, q), (s 3, 0) δ A (s 2, ), (s 4, 1) δ A (s 3, ), (s 5, 1) δ A (s 4, q), (s f, 1) δ A (s 5, $) State: s 5 Tansition: (s f, 1) δ A (s 5, $) Mauizio Lenzeini 67

2NFA and 2RPQs. a b c q d. Wod: q $ Quey: Q = ( q) q q State: s f (a, d) satisfies quey Q, and the ath fom a to d is acceted by the 2NFA coesonding to Q: the comutation fo 2RPQs is catued by 2way automata. Mauizio Lenzeini 68

2NFA and view extensions Global schema G: ( q q ) Souces: q ( ) ( q) (q ) (d 1,d 2 ) (d 4,d 5 ) (d 4,d 2 ) (d 3,d 3 ) (d 2,d 3 ) Database fo G: q d 1 d 2 d 3 d 4 d 5 Mauizio Lenzeini 69

2NFA and view extensions q d 1 d 2 d 3 d 4 d 5 Database B as a wod: $d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q To veify that (d 1, d 3 ) satisfies Q in the above database B, we build A (Q,d1,d 3 ), by exloiting not only the ability of 2way automata to move on the wod both fowad and backwad, but also the ability to jum fom one osition in the wod eesenting a node to any othe osition (eithe eceding o succeeding) eesenting the same node. Mauizio Lenzeini 70

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 0 Tansition: (s 0, 1) δ A (s 0, l), fo each l Σ A Mauizio Lenzeini 71

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 0 Tansition: (s 0, 1) δ A (s 0, l), fo each l Σ A Mauizio Lenzeini 72

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 0 Tansition: (s 0, 1) δ A (s 0, l), fo each l Σ A Mauizio Lenzeini 73

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 0 Tansition: (s 0, 1) δ A (s 0, l), fo each l Σ A Mauizio Lenzeini 74

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 0 Tansition: (s 0, 1) δ A (s 0, l), fo each l Σ A Mauizio Lenzeini 75

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 0 Tansition: (s 0, 1) δ A (s 0, l), fo each l Σ A Mauizio Lenzeini 76

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 0 Tansition: (s 1, 0) δ A (s 0, d 1 ), s 1 initial state fo Q Mauizio Lenzeini 77

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 1 Tansition: (s 1, 1) δ A (s 1, d 1 ) Mauizio Lenzeini 78

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 1 Tansition: (s 2, 1) δ A (s 1, ), tansition coming fom Q Mauizio Lenzeini 79

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 2 Tansition: ((s 2, d 2 ), 1) δ A (s 2, d 2 ), seach fo d 2 Mauizio Lenzeini 80

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 2, d 2 ) Tansition: ((s 2, d 2 ), 1) δ A ((s 2, d 2 ), $), seach fo d 2 Mauizio Lenzeini 81

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 2, d 2 ) Tansition: ((s 2, d 2 ), 1) δ A ((s 2, d 2 ), d 4 ), seach fo d 2 Mauizio Lenzeini 82

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 2, d 2 ) Tansition: ((s 2, d 2 ), 1) δ A ((s 2, d 2 ), ), seach fo d 2 Mauizio Lenzeini 83

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 2, d 2 ) Tansition: (s 2, 0) δ A ((s 2, d 2 ), d 2 ), exit seach mode Mauizio Lenzeini 84

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 2 Tansition: (s 2, 1) δ A (s 2, d 2 ), backwad mode Mauizio Lenzeini 85

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 2 Tansition: (s 3, 0) δ A (s 2, ), tansition coming fom Q Mauizio Lenzeini 86

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 3 Tansition: (s 4, 1) δ A (s 3, ), tansition coming fom Q Mauizio Lenzeini 87

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 4 Tansition: ((s 4, d 2 ), 1) δ A (s 4, d 2 ), seach fo d 2 Mauizio Lenzeini 88

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 4, d 2 ) Tansition: ((s 4, d 2 ), 1) δ A ((s 4, d 2 ), $), seach fo d 2 Mauizio Lenzeini 89

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 4, d 2 ) Tansition: ((s 4, d 2 ), 1) δ A ((s 4, d 2 ), d 3 ), seach fo d 2 Mauizio Lenzeini 90

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 4, d 2 ) Tansition: ((s 4, d 2 ), 1) δ A ((s 4, d 2 ), ), seach fo d 2 Mauizio Lenzeini 91

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 4, d 2 ) Tansition: ((s 4, d 2 ), 1) δ A ((s 4, d 2 ), ), seach fo d 2 Mauizio Lenzeini 92

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 4, d 2 ) Tansition: ((s 4, d 2 ), 1) δ A ((s 4, d 2 ), d 3 ), seach fo d 2 Mauizio Lenzeini 93

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 4, d 2 ) Tansition: ((s 4, d 2 ), 1) δ A ((s 4, d 2 ), $), seach fo d 2 Mauizio Lenzeini 94

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 4, d 2 ) Tansition: (s 4, 0) δ A ((s 4, d 2 ), d 2 ), exit seach mode Mauizio Lenzeini 95

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 4 Tansition: (s 4, 1) δ A (s 4, d 2 ) Mauizio Lenzeini 96

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 4 Tansition: (s 5, 1) δ A (s 4, ), tansition coming fom Q Mauizio Lenzeini 97

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 5 Tansition: (s 6, 1) δ A (s 5, q), tansition coming fom Q Mauizio Lenzeini 98

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 6 Tansition: (s 7, 0) δ A (s 6, d 3 ), s 7 final state Mauizio Lenzeini 99

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 7 Tansition: (s 7, 1) δ A (s 7, d 3 ), s 7 final state Mauizio Lenzeini 100

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 7 Tansition: (s 7, 1) δ A (s 7, $), s 7 final state Mauizio Lenzeini 101

A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 7 final state Wod acceted by A (Q,d1,d 3 )! Mauizio Lenzeini 102

Quey answeing: Technique To check whethe (c, d) Q I,C, we check fo nonemtiness of A, that is the intesection of the one-way automaton A 0 that accets wods that eesent databases, i.e., wods of the fom ($ C Σ + C) $ the one-way automata coesonding to the vaious A (Si,a,b) (fo each souce S i and fo each ai (a, b) S C i ) the one-way automaton coesonding to the comlement of A (Q,c,d) Indeed, any wod acceted by such intesection automaton eesents a counteexamle to (c, d) Q I,C. Mauizio Lenzeini 103

Quey answeing: Comlexity All two-way automata constucted above ae of linea size in the size of Q, the queies associated to S 1,..., S k, and S C 1,..., S C k one-way automata would be exonential.. Hence, the coesonding Howeve, we do not need to constuct A exlicitly. Instead, we can constuct it on the fly while checking fo nonemtiness. Quey answeing fo 2RPQs is PSPACE-comlete in combined comlexity (as fo RPQs). Mauizio Lenzeini 104

Comlexity of quey answeing fo 2RPQs: the comlete ictue Fom [Calvanese&al. PODS 00]: Assumtion on Assumtion on Comlexity domain views data exession combined all sound conp conp conp closed all exact conp conp conp abitay conp conp conp all sound conp PSPACE PSPACE oen all exact conp PSPACE PSPACE abitay conp PSPACE PSPACE Mauizio Lenzeini 105

INT[noconst, LAV/sound]: Connection to ewiting Quey answeing by ewiting: Given I = G, S, M, and given a quey Q ove G, ewite Q into a quey, called ew(q, I), in the alhabet A S of the souces Evaluate the ewiting ew(q, I) ove the souce database We ae inteested in sound ewitings (comuting only cetain answes, fo evey souce database C) that ae exessed in a given quey language, and that ae maximal fo the class of queies exessible in such language. Sometimes, we ae inteested in exact ewitings, i.e., ewitings that ae logically equivalent to the quey, modulo M. But: When does the ewiting comute all cetain answes? What do we gain o lose by focusing on a given class of queies? Mauizio Lenzeini 106

Pefect ewiting Let cet(q, I, C) be the function that, given quey Q, data integation system I, and souce database C, comutes the cetain answes Q I,C to Q wt I and C. Define cet [Q,I] ( ) to be the function that, with Q and I fixed, given souce database C, comutes the cetain answes Q I,C. cet [Q,I] can be seen as a quey on the alhabet A S that, given C, etuns Q I,C cet [Q,I] is a (sound) ewiting of Q wt I No sound ewiting exists that is bette than cet [Q,I] cet [Q,I] is called the efect ewiting of Q wt I Mauizio Lenzeini 107

Poeties of the efect ewiting Can we exess the efect ewiting in a cetain quey language? How does a maximal ewiting fo a given class of queies comae with the efect ewiting? Fom a semantical oint of view Fom a comutational oint of view Which is the comutational comlexity of (finding, evaluating) the efect ewiting? Mauizio Lenzeini 108

The case of conjunctive queies Let I = G, S, M be a LAV/sound data integation system, let Q and the queies in M be CQs, and let Q be the union of all maximal ewitings of Q fo the class of CQs. Then ([Levy&al. PODS 95], [Duschka&al. 97], [Abiteboul&al. PODS 98]) Q is the maximal ewiting fo the class of unions of conjunctive queies (UCQs) Q is the efect ewiting of Q wt I Q is a PTIME quey Q is an exact ewiting (equivalent to Q fo each database B of I), if an exact ewiting exists Does this ideal situation cay on to cases whee Q and M allow fo union? Mauizio Lenzeini 109

Unions of ath queies (UPQs) Vey simle quey language (called UPQ) defined as follows: Q P Q 1 Q 2 P R P 1 P 2 R denotes a binay database elation, P denotes a ath quey, which is a chaining of database elations, and Q denotes a union of ath queies. UPQs ae a simle fom of Unions of conjunctive queies Regula ath queies Mauizio Lenzeini 110

View-based quey ocessing fo UPQs View-based quey answeing fo UPQs is conp-comlete in data comlexity [Calvanese&al. ICDE 00]. In othe wods, cet(q, I, C), with Q and I fixed, is a conp-comlete function. The efect ewiting cet [Q,I] is a conp-comlete quey. Fo quey languages that include UPQs the efect ewiting is conp-had we do not have the ideal situation we had fo conjunctive queies. Poblem: Isolate those UPQs Q and I fo which the efect ewiting cet [Q,I] is a PTIME function (assuming P NP) [Calvanese&al. LICS 00]. Mauizio Lenzeini 111

Incomleteness and inconsistency Constaints Tye of Incomle- Inconsiin G maing teness stency no GAV/exact no no no GAV/sound yes/no no no LAV/sound yes no no LAV/exact yes yes yes GAV/exact no yes yes GAV/sound yes yes yes LAV/sound yes yes yes LAV/exact yes yes Mauizio Lenzeini 112

INT[noconst, LAV/exact]: inconsistency The LAV maing assetions have the logical fom: x s( x) φ G ( x) In geneal, given a souce database C, thee may be no solution of the above assetions (i.e., no database that is legal wt G and that satisfies M wt C). Examle: s 1 (x) { (x) g(x) } s 2 (x) { (x) g(x) } with s C 1 = {1}, and s C 2 = {2}. Mauizio Lenzeini 113

INT[noconst, LAV/exact] Global schema = = Models of I Global schema Maing Retieved GDBs Maing Souces Souce model Souces Incomleteness Inconsistency Mauizio Lenzeini 114

INT[noconst, LAV/exact]: some esults fo quey answeing Comlexity analysis (sound, comlete, exact) [Abiteboul&Duschka PODS 98] [Gahne&Mendelzon ICDT 99] Vaiants of Regula Path Queies [Calvanese&al. ICDE 00, PODS 00] Mauizio Lenzeini 115

INT[noconst, LAV/exact]: data comlexity Fom [Abiteboul&Duschka PODS 98]: Sound souces CQ CQ PQ datalog FOL CQ PTIME conp PTIME PTIME undec. CQ PTIME conp PTIME PTIME undec. PQ conp conp conp conp undec. datalog conp undec. conp undec. undec. FOL undec. undec. undec. undec. undec. Exact souces CQ CQ PQ datalog FOL CQ conp conp conp conp undec. CQ conp conp conp conp undec. PQ conp conp conp conp undec. datalog undec. undec. undec. undec. undec. FOL undec. undec. undec. undec. undec. Mauizio Lenzeini 116

INT[noconst, LAV/exact]: olynomial intactability Given a gah G = (N, E), we define I = G, S, M, and souce database C: V 1 { (X) colo(x, Y ) } V 2 { (Y ) colo(x, Y ) } V 3 { (X, Y ) edge(x, Y ) } C V 1 = N C V 2 = { ed, geen, blue } C V 3 = E Q { () edge(x, Y ) colo(x, Z) colo(y, Z) } Q I,C is tue if and only if G is not 3-coloable. = conp-had data comlexity fo conjunctive queies and views. Mauizio Lenzeini 117

Incomleteness and inconsistency Constaints Tye of Incomle- Inconsiin G maing teness stency no GAV/exact no no no GAV/sound yes/no no no LAV/sound yes no no LAV/exact yes yes yes GAV/exact no yes yes GAV/sound yes yes yes LAV/sound yes yes yes LAV/exact yes yes Mauizio Lenzeini 118

INT[const, GAV/exact]: inconsistency Given one souce database C, thee is only one database fo G that satisfies the maing wt C. If this is not legal wt G, then the system is inconsistent (I has no model), othewise, the case is simila to INT[noconst, GAV/exact]. Univesity Student Enolled code AF BN name bocconi ucla code 15 15 name bill anne city oslo floence Scode 12 16 Ucode AF BN s C 1 15 anne floence 21 15 bill oslo 24 s C 2 AF BN bocconi ucla s C 3 12 AF 16 BN Mauizio Lenzeini 119

INT[const, GAV/exact] Models of I Global schema = Global schema Maing Retieved GDB Maing Souces Souce model Souces Inconsistency Mauizio Lenzeini 120

Incomleteness and inconsistency Constaints Tye of Incomle- Inconsiin G maing teness stency no GAV/exact no no no GAV/sound yes/no no no LAV/sound yes no no LAV/exact yes yes yes GAV/exact no yes yes GAV/sound yes yes yes LAV/sound yes yes yes LAV/exact yes yes Mauizio Lenzeini 121

INT[const, GAV/sound]: incomleteness Let us conside a system with a global schema with constaints, and with a GAV maing M with sound souces, whose assetions have the fom g φ S with the meaning x (φ S (x) g(x)) Since G does have constaints, we cannot simly limit ou attention to one database of the integation system (as we did fo INT[noconst, GAV/exact] and INT[noconst, GAV/sound]). Mauizio Lenzeini 122

INT[const, GAV/sound] Global schema = = Models of I Retieved GDBs Global schema Maing Maing Souces Souce model Souces Incomleteness Inconsistency Mauizio Lenzeini 123

INT[const, GAV/sound]: examle Global schema G: student(scode, Sname, Scity), univesity(ucode, Uname), enolled(scode, Ucode), key{scode} key{ucode} key{scode, Ucode} enolled[scode] student[scode] enolled[ucode] univesity[ucode] Souces S: database elations s 1, s 2, s 3 Maing M: student { (X, Y, Z) s 1 (X, Y, Z, W ) } univesity { (X, Y ) s 2 (X, Y ) } enolled { (X, W ) s 3 (X, W ) } Mauizio Lenzeini 124

Constaints in GAV/sound: examle Univesity Student Enolled code AF BN name bocconi ucla code 15 12 name bill anne city oslo floence Scode 12 16 Ucode AF BN 16?? 16 16 s C 1 12 anne floence 21 15 bill oslo 24 s C 2 AF BN bocconi ucla s C 3 12 AF 16 BN Examle of souce database and coesonding etieved global database Mauizio Lenzeini 125

Constaints in GAV/sound: examle Souce database C: s C 1 12 anne floence 21 15 bill oslo 24 s C 2 AF BN bocconi ucla s C 3 12 AF 16 BN s C 3(16, BN) imlies enolled B (16, BN), fo all B sem C (I). Due to the integity constaints in the global schema, 16 is the code of some student in all B sem C (I). Since C says nothing about the name and the city of the student with code 16, we must accet as legal fo I wt C all vitual global databases that diffe in such attibutes. Mauizio Lenzeini 126

INT[const, GAV/sound]: unfolding is not sufficient Maing M: student { (X, Y, Z) s 1 (X, Y, Z, W ) } univesity { (X, Y ) s 2 (X, Y ) } enolled { (X, W ) s 3 (X, W ) } s C 1 12 anne floence 21 15 bill oslo 24 s C 2 AF BN bocconi ucla s C 3 12 AF 16 BN Quey: { (X) student(x, Y, Z), enolled(x, W ) } Unfolding wt M: { (X) s 1 (X, Y, Z, V ), s 3 (X, W ) } etieves only the answe {12} fom C, although {12, 16} is the coect answe. The simle unfolding stategy is not sufficient in ou context. Mauizio Lenzeini 127

INT[const, GAV/sound]: secial case We assume that only key and foeign key constaints ae in G, and M does not violate any key constaint of G (see late), and we associate to G a logic ogam P G, as follows. Fo each g in G we have a ule in P G of the fom: g (X 1,..., X n ) g(x 1,..., X n ) Fo each foeign key constaint g 1 [A] g 2 [B] in G whee A and B ae sets of attibutes, we have a ule in P G of the fom (the f i s ae fesh Skolem functions): g 2(X 1,..., X h, f 1 (X 1,..., X h ),..., f n h (X 1,..., X h )) g 1(X 1,..., X h,..., X m ) Mauizio Lenzeini 128

INT[const, GAV/sound]: secial case Techniques fo ocessing a conjunctive quey q osed to I = G, S, M : We constuct P G fom G We atially evaluate P G wt q, and we obtain anothe quey ex G (q), called the exansion of q wt the constaints of G We unfold ex G (q) wt M, and obtain a quey unf M (ex G (q)) ove the souces We evaluate unf M (ex G (q)) ove the souce database C ex G (q) can be of exonential size wt G, but the whole ocess has olynomial time comlexity wt the size of C. Mauizio Lenzeini 129

INT[const, GAV/sound]: examle Suose we have I = G, S, M, with G: eson(pcode, Age, CityOfBith) student(scode, Univesity) city(name, Majo) key(eson) = {Pcode} key(student) = {Scode} key(city) = {Name} eson[cityofbith] city[name] city[majo] eson[pcode] student[scode] eson[pcode] Mauizio Lenzeini 130

INT[const, GAV/sound]: examle The logic ogam P G is eson (X, Y, Z) eson(x, Y, Z) student (X, Y ) student(x, Y ) city (X, Y ) city(x, Y ) city (X, f 1 (X)) eson (Y, Z, X) eson (Y, f 2 (Y ), f 3 (Y )) city (X, Y ) eson (X, f 4 (X), f 5 (X)) student (X, Y ) Conside the quey witten as the ule { (X) eson(x, Y, Z) } q(x) eson (X, Y, Z) Mauizio Lenzeini 131

INT[const, GAV/sound]: examle eson (X,Y,Z) eson(x,y,z) student (X,W 1 ) city (W 2,X) student(x,w 1 ) city(w 2,X) ex G (q) is { (X) eson(x, Y, Z) student(x, W ) city(z, X) } Mauizio Lenzeini 132

Incomleteness and inconsistency Constaints Tye of Incomle- Inconsiin G maing teness stency no GAV/exact no no no GAV/sound yes/no no no LAV/sound yes no no LAV/exact yes yes yes GAV/exact no yes yes GAV/sound yes yes yes LAV/sound yes yes yes LAV/exact yes yes Mauizio Lenzeini 133

INT[const, LAV/sound] Global schema = = Models of I Retieved GDBs Global schema Maing Maing Souces Souce model Souces Incomleteness Inconsistency Mauizio Lenzeini 134

INT[const, LAV/sound] With functional deendencies [Duschka 97] With full deendencies [Duschka 97] With inclusion deendencies [Gyz 97] With Descition Logics integity constaints [Calvanese&al. AAAI 00] Mauizio Lenzeini 135

Incomleteness and inconsistency Constaints Tye of Incomle- Inconsiin G maing teness stency no GAV/exact no no no GAV/sound yes/no no no LAV/sound yes no no LAV/exact yes yes yes GAV/exact no yes yes GAV/sound yes yes yes LAV/sound yes yes yes LAV/exact yes yes Mauizio Lenzeini 136

INT[const, LAV/exact] Global schema Models of I Global schema Global schema Maing Retieved GDBs Maing Maing Souces Souce model Souces Souces Incomleteness Inconsistency Inconsistency Mauizio Lenzeini 137

INT[const, LAV/exact] With Descition Logics integity constaints [Calvanese&al. AAAI 00] Lagely unexloed oblem Mauizio Lenzeini 138

Outline Fomal famewok fo data integation Aoaches to data integation Quey answeing in diffeent aoaches Dealing with inconsistency Reasoning on queies in data integation Conclusions Mauizio Lenzeini 139

INT[const, GAV/sound]: Dealing with inconsistency When fo data integation system I = G, S, M and souce database C, we have sem C (I) =, the fist-ode setting descibed above is not adequate. [Subahmanian ACM-TODS 94] [Gant&al. IEEE-TKDE 95] [Dung CooIS 96] [Lin&al. JICIS 98] [Yan&al. CooIS 99] [Aenas&al. PODS 99] [Geco&al. LPAR 00] many aoaches to KB evision and KB/DB udate Mauizio Lenzeini 140

Beyond fist-ode logic: examle key(laye) = {Pcode} key(team) = {Tcode} laye[pteam] team[tcode] team[tleade] laye[pcode]. laye { (X, Y, Z) s 1 (X, Y, Z, W ) } team { (X, Y, Z) s 2 (X, Y, Z) s 3 (X, Y, Z) } s C 1 : 9 Batistuta RM 31 10 Rivaldo BC 29 s C 2 : RM Roma 8 BC Bacelona 10 s C 3 : RM Roma 9 Mauizio Lenzeini 141

Beyond fist-ode logic: a oosal Given I = G, S, M, with a GAV/sound maing M = { 1 V 1,..., n V n }, and souce database C fo S, we would like to focus on those databases fo I that 1. satisfy G (constaints in G ae igid), and 2. aoximate as much as ossible the satisfaction of the maing M wt C (assetions in M ae soft). Mauizio Lenzeini 142

Beyond fist-ode logic: a oosal We define an odeing between the global databases fo I as follows. If B 1 and B 2 ae two databases that satisfy G, we say that B 1 is bette than B 2 wt I and C, denoted as B 1 I C B 2, if thee exists an assetion i V i in M such that - ( B 1 i - ( B 1 j V C i ) ( B 2 i V C j ) ( B 2 j V C i ), and V C j ) fo all j s V j in M with j i. Intuitively, B 1 has fewe deletions than B 2 wt the etieved global database (see [Fagin&al. PODS 83]), and since the maing is sound, this means that B 1 is close than B 2 to the etieved global database. In othe wods, B 1 aoximates the sound maing bette than B 2. Mauizio Lenzeini 143

Examle Conside I = G, S, M, with G containing elation (x, y) with key x, S containing elations s 1 (x, y) and s 2 (x, y) M = { { (x, y) s 1 (x, y) s 2 (x, y) } } and conside the souce database C = { s 1 (a, d), s 1 (b, d), s 2 (a, e) }, so that the etieved global database is { (a, d), (b, d), (a, e) } We have that { (a, d), (b, d) } I C { (a, d) }, { (a, e), (b, d) } I C { (a, e) } { (a, d), (b, d) } and { (a, e) } ae incomaable { (a, e), (b, d), (c, e) } and { (a, e), (b, d) } ae incomaable Mauizio Lenzeini 144

Beyond fist-ode logic: a oosal I C is a atial ode. A database B that satisfy G satisfies the maing M with esect to C if B is maximal wt I C, i.e., fo no othe global database B that satisfies G, we have that B I C B: sem C (I) = { B B is a database that satisfies G, and such that B such that B satisfies G and B I C B } The notion of legal database fo I with esect to C, and the notion of cetain answe emain the same, given the new definition of satisfaction of maing. Mauizio Lenzeini 145

Beyond fist-ode logic: secial case of INT[const, GAV/sound] We assume that only key and foeign key constaints ae in G. Given I = G, S, M, and souce database C, we define the DATALOG ogam P(I, C) obtained by adding to the set of facts C the following set of ules: fo each g {( x) body 1 ( x, y 1 ) body m ( x, y m )} in M, the ules: g C ( X) body 1 ( X, Y 1 )... g C ( X) body m ( X, Y m ) fo each elation g G, the ules g( X, Y) g C ( X, Y), not g( X, Y) g( X, Y) g( X, Z), Y Z in g( X, Y), X is the key of g Y Z means that thee exists i such that Y i Z i. Mauizio Lenzeini 146

Beyond fist-ode logic: a oosal The above ules foce each stable model T of P(I, C) to be such that, fo each g in G, g T is a maximal subset of the tules fom the etieved global database that ae consistent with the key constaint fo g. t q I,C unde the new semantics if and only if t q T fo each stable model T of the DATALOG ogam P(I, C) {ex G (q)} A stable model of a DATALOG ogam Π is any set σ of gound atoms that coincides with the unique minimal Heband model of the DATALOG ogam Π σ, whee Π σ is obtained fom Π by deleting evey ule that has a negative liteal B with B σ, and all negative liteals in the bodies of the emaining ules The oblem of deciding whethe t q I,C is in conp wt data comlexity conp-comlete Mauizio Lenzeini 147

Outline Fomal famewok fo data integation Aoaches to data integation Quey answeing in diffeent aoaches Dealing with inconsistency Reasoning on queies in data integation Conclusions Mauizio Lenzeini 148

Reasoning on queies and views in data integation Taditional quey containment not adequate. Global schema: movie(title, Yea, Diecto) eview(title, Citique) Maing: 1 (T, Y, D) { (T, Y, D) movie(t, Y, D) Y 1960 } 2 (T, R) { (T, R) eview(t, R) R 8 } Queies: Q 1 : { (T, R) movie(t, 1998, D) eview(t, R) } Q 2 : { (T, R) movie(t, 1998, D) eview(t, R) R 8 } Q 1 is not contained in Q 2 in the taditional sense, but is contained in Q 2 elative to I. Mauizio Lenzeini 149

Relative containment [Millstein&al. PODS 00] Given data integation system I = G, S, M, a quey Q 1 is said to be contained in quey Q 2 elative to I (witten Q 1 I Q 2 ) if, fo evey souce database C, the set of cetain answes to Q 1 wt I and C is contained in the set of cetain answes to Q 2 wt I and C, i.e., if C, cet(q 1, I, C) cet(q 2, I, C) Fo LAV/sound systems with conjunctive queies in the maing, deciding elative containment of two conjunctive queies is Π 2-comlete [Millstein&al. PODS 00]. Mauizio Lenzeini 150

Lossless views Given LAV data integation system I = G, S, M, and quey Q, I is said to be lossless wt Q if, fo evey global database B fo I and fo evey souce database C such that B is legal fo I wt C, we have that Q B = Q I,C. If I = G, S, M is lossless wt Q, then answeing Q though the souces of I (views) is the same as answeing Q by accessing the global database. Note the diffeence with checking whethe the maximally contained ewiting of Q wt to I is equivalent to Q. Mauizio Lenzeini 151

Comaing the exessive owe of sets of views A set of views V is -contained in anothe set of views W if all queies that ae answeable by V ae also answeable by W [Li&al. ICDT 01]. A quey is answeable by a set of views V if thee is an equivalent ewiting of Q using V. Given LAV data integation systems I 1 = G, S 1, M 1 and I 2 = G, S 2, M 2, I 1 is -contained in I 2 if, fo each quey Q, cet [Q,I1 ] equivalent to Q imlies cet [Q,I2 ] equivalent to Q. Mauizio Lenzeini 152

Outline Fomal famewok fo data integation Aoaches to data integation Quey answeing in diffeent aoaches Dealing with inconsistency Reasoning on queies in data integation Conclusions Mauizio Lenzeini 153

Conclusions Many oen oblems, including P2P data integation Seveal inteesting classes of integity constaints Global schema exessed in tems of semi-stuctued data (with constaints) Dealing with inconsistencies, data cleaning How to go beyond the unique domain assumtion Limitations in accessing the souces How to incooate the notion of data quality (souce eliability, accuacy, etc.) Moe on easoning on queies and views Otimization Mauizio Lenzeini 154

Acknowledgements Secial thanks to Andea Calí Diego Calvanese Giusee De Giacomo Domenico Lembo Riccado Rosati Moshe Y. Vadi Mauizio Lenzeini 155