Data integation: A theoetical esective Mauizio Lenzeini Diatimento di Infomatica e Sistemistica Antonio Rubeti Univesità di Roma La Saienza Tutoial at PODS 2002 Madison, Wisconsin, USA, June 2002
Data integation Quey Global schema Maing R 1 C 1 D 1 T 1 c 1 d 1 t 1 c 2 d 2 t 2 Souce schema Souce schema Souce 1 Souce 2 Mauizio Lenzeini 1
Outline Fomal famewok fo data integation Aoaches to data integation Quey answeing in diffeent aoaches Dealing with inconsistency Reasoning on queies in data integation Conclusions Mauizio Lenzeini 2
Fomal famewok A data integation system I is a tile G, S, M, whee G is the global schema (ove an alhabet A G ) S is the souce schema (ove an alhabet A S ) M is the maing between G and S Semantics of I: which ae the databases that satisfy I (models of I)? We efe only to databases ove a fixed infinite domain Γ, and we stat with a souce database C, (data available at the souces, also called souce model) ove Γ. The set of databases that satisfy I elative to C is: sem C (I) = { B B is legal wt G and satisfies M wt C } Mauizio Lenzeini 3
Semantics of queies to I A quey q of aity n is a FOL fomula with n fee vaiables. If D is a database, then q D denotes the extension of q in D (i.e., the set of valuations in Γ fo the fee vaiables of q that make q tue in D). If q is a quey of aity n osed to a data integation system I (i.e., a quey ove A G ), then the set of cetain answes to q wt I and C is q I,C = {(c 1,..., c n ) q B B sem C (I)} Mauizio Lenzeini 4
Databases with incomlete infomation Taditional database: one model of a fist-ode theoy Quey answeing means evaluating a fomula in the model. Database with incomlete infomation: set of models (secified, fo examle, as a esticted fist-ode theoy) Quey answeing means comuting the tules that satisfy the quey in all the models in the set. Thee is a stong connection between quey answeing in data integation and quey answeing in database with incomlete infomation unde constaints. Mauizio Lenzeini 5
Outline Fomal famewok fo data integation Aoaches to data integation Quey answeing in diffeent aoaches Dealing with inconsistency Reasoning on queies in data integation Conclusions Mauizio Lenzeini 6
The maing How is the maing M between G and S secified? Ae the souces defined in tems of the global schema? Aoach called souce-centic, o local-as-view, o LAV. Is the global schema defined in tems of the souces? Aoach called global-schema-centic, o global-as-view, o GAV. A mixed aoach? Aoach called GLAV. Maing between souces, without global schema? Aoach called P2P. Mauizio Lenzeini 7
GAV vs LAV examle Global schema: movie(title, Yea, Diecto) euoean(diecto) eview(title, Citique) Souce 1: 1 (Title, Yea, Diecto) since 1960, euoean diectos Souce 2: 2 (T itle, Citique) since 1990 Quey: Title and citique of movies in 1998 D. movie(t, 1998, D) eview(t, R), witten { (T, R) movie(t, 1998, D) eview(t, R) } Mauizio Lenzeini 8
Fomalization of LAV In LAV, the maing M is constituted by a set of assetions: s φ G (sound souce) x (s( x) φ G ( x)) s φ G (exact souce) x (s( x) φ G ( x)) one fo each souce element s in A S, whee φ G is a quey ove G. Given souce database C, a database B fo G satisfies M wt C if fo each s S: s C φ G B (sound souce) s C = φ G B (exact souce) The maing M and the souce database C do not ovide diect infomation about which data satisfy the global schema. Souces ae views, and we have to answe queies on the basis of the available data in the views. Mauizio Lenzeini 9
LAV examle Global schema: movie(title, Yea, Diecto) euoean(diecto) eview(title, Citique) LAV: associated to souce elations we have views ove the global schema 1 (T, Y, D) { (T, Y, D) movie(t, Y, D) euoean(d) Y 1960 } 2 (T, R) { (T, R) movie(t, Y, D) eview(t, R) Y 1990 } The quey { (T, R) movie(t, 1998, D) eview(t, R) } is ocessed by means of an infeence mechanism that aims at e-exessing the atoms of the global schema in tems of atoms at the souces. In this case: { (T, R) 2 (T, R) 1 (T, 1998, D) } Mauizio Lenzeini 10
Fomalization of GAV In GAV, the maing M is constituted by a set of assetions: g φ S (sound souce) x (φ S ( x) g( x)) g φ S (exact souce) x (φ S ( x) g( x)) one fo each element g in A G, whee φ S is a quey ove S. Given souce database C, a database B fo G satisfies M wt C if fo each g G: g B φ S C g B = φ S C (sound souce) (exact souce) Given a souce database, M ovides diect infomation about which data satisfy the elements of the global schema. Relations in G ae views, and queies ae exessed ove the views. Thus, it seems that we can simly evaluate the quey ove the data satisfying the global elations (as if we had a single database at hand). Mauizio Lenzeini 11
GAV examle Global schema: movie(title, Yea, Diecto) euoean(diecto) eview(title, Citique) GAV: associated to elations in the global schema we have views ove the souces movie(t, Y, D) { (T, Y, D) 1 (T, Y, D) } euoean(d) { (D) 1 (T, Y, D) } eview(t, R) { (T, R) 2 (T, R) } Mauizio Lenzeini 12
GAV examle of quey ocessing The quey { (T, R) movie(t, 1998, D) eview(t, R) } is ocessed by means of unfolding, i.e., by exanding the atoms accoding to thei definitions, so as to come u with souce elations. In this case: movie(t,1998,d) eview(t,r) unfolding 1 (T,1998,D) 2 (T,R) Mauizio Lenzeini 13
GAV and LAV comaison LAV: (Infomation Manifold, DWQ, Picsel) Quality deends on how well we have chaacteized the souces High modulaity and extensibility (if the global schema is well designed, when a souce changes, only its definition is affected) Quey ocessing needs easoning (quey efomulation comlex) GAV: (Canot, SIMS, Tsimmis, IBIS, Picsel,... ) Quality deends on how well we have comiled the souces into the global schema though the maing Wheneve a souce changes o a new one is added, the global schema needs to be econsideed Quey ocessing can be based on some sot of unfolding (quey efomulation looks easie) Fo moe details, see [Ullman, TCS 00], [Halevy, VLDBJ 01]. Mauizio Lenzeini 14
Beyond GAV and LAV: GLAV In GLAV, the maing M is constituted by a set of assetions: φ S φ G (sound souce) x (φ S ( x) φ G ( x)) φ S φ G (exact souce) x (φ S ( x) φ G ( x)) whee φ S is a quey ove S, and φ G is a quey ove G. Given souce database C, a database B that is legal wt G satisfies M wt C if fo each assetion in M: φ S C φ S C φ G B (sound souce) = φ G B (exact souce) The maing M does not ovide diect infomation about which data satisfy the global schema: to answe a quey q ove G, we have to infe how to use M in ode to access the souce database C. Mauizio Lenzeini 15
Examle of GLAV Global schema: W ok(p eson, P oject), Aea(P oject, F ield) Souce 1: Souce 2: Souce 3: HasJob(P eson, F ield) T each(p of esso, Couse), In(Couse, F ield) Get(Reseache, Gant), F o(gant, P oject) GLAV maing: { (, f) HasJob(, f) } { (, f) W ok(, ) Aea(, f) } { (, f) T each(, c) In(c, f) } { (, f) W ok(, ) Aea(, f) } { (, ) Get(, g) F o(g, ) } { (, ) W ok(, ) } Mauizio Lenzeini 16
Beyond GLAV: P2P data integation In P2P, the global schema does not exist. Constaints (that we can still call G) ae defined ove A G = A S1 A Sn and the maing M is constituted by a set of assetions (φ S i 1, φ S j 2 the alhabets A Si and A Sj, esectively): φ S i 1 φ S j 2. ae queies ove A S is a distinguished subset of edicates in A G, called base edicates (whee data ae). A souce database is a database fo the base edicates. Given souce database C, a database W that satisfies I elative to C is a database fo S such that, fo each assetion φ 1 φ 2 in M, φ W 1 φ W 2. Queies ae now exessed ove alhabet A Si, and the notion of cetain answes is the usual one. Mauizio Lenzeini 17
A unified view Alhabet: A = A G A S Integity constaints: constaints G, and maing M Patial database: souce database Database: data fo all symbols in A that ae both coheent with the atial database and satisfy the integity constaints Quey answeing: comuting the tules that satisfies the quey in evey database Unde this view, the diffeence between LAV, GAV, GLAV, P2P is eflected in the kinds of integity constaints that ae exessible. Mauizio Lenzeini 18
Quey answeing with incomlete infomation [Reite 84]: elational setting, databases with incomlete infomation modeled as a fist ode theoy [Vadi 86]: elational setting, comlexity of easoning in closed wold databases with unknown values Seveal aoaches both fom the DB and the KR community [van de Meyden 98]: suvey on logical aoaches to incomlete infomation Mauizio Lenzeini 19
Connection to quey containment Quey containment (unde constaints T ) is the oblem of checking whethe q B 1 is contained in q B 2 fo evey database B (satisfying T ), whee q 1, q 2 ae queies with the same aity. A souce database C can be eesented as a conjunction q C of gound liteals ove A S (e.g., if x is in s C, then the coesonding liteal is s( x)) If q is a quey, and t is a tule, then we denote by q t the quey obtained by substituting the fee vaiables of q with t The oblem of checking whethe t q I,C can be educed to the oblem of checking whethe q C is contained in q t unde the constaints G M The combined comlexity of checking cetain answes is identical to the comlexity of quey containment unde constaints, and the data comlexity is at most the comlexity of quey containment unde constaints. Mauizio Lenzeini 20
Outline Fomal famewok fo data integation Aoaches to data integation Quey answeing in diffeent aoaches Dealing with inconsistency Reasoning on queies in data integation Conclusions Mauizio Lenzeini 21
Dealing with incomleteness and inconsistency We analyze the oblem of quey answeing in diffeent cases, deending on two aametes: Global schema: - without constaints, - with constaints Maing: - GAV o LAV, - sound o comlete Given a souce database C, we call etieved global database any database fo G that satisfies the maing wt C. Mauizio Lenzeini 22
Incomleteness and inconsistency Constaints Tye of Incomle- Inconsiin G maing teness stency no GAV/exact no no no GAV/sound yes/no no no LAV/sound yes no no LAV/exact yes yes yes GAV/exact no yes yes GAV/sound yes yes yes LAV/sound yes yes yes LAV/exact yes yes Mauizio Lenzeini 23
Incomleteness and inconsistency Constaints Tye of Incomle- Inconsiin G maing teness stency no GAV/exact no no no GAV/sound yes/no no no LAV/sound yes no no LAV/exact yes yes yes GAV/exact no yes yes GAV/sound yes yes yes LAV/sound yes yes yes LAV/exact yes yes Mauizio Lenzeini 24
INT[noconst, GAV/exact]: examle Conside I = G, S, M, with Global schema G: student(scode, Sname, Scity) univesity(ucode, Uname) enolled(scode, Ucode) Souce schema S: database elations s 1, s 2, s 3 Maing M: student(x, Y, Z) { (X, Y, Z) s 1 (X, Y, Z, W ) } univesity(x, Y ) { (X, Y ) s 2 (X, Y ) } enolled(x, W ) { (X, W ) s 3 (X, W ) } Mauizio Lenzeini 25
INT[noconst, GAV/exact]: examle Univesity Student Enolled code AF BN name bocconi ucla code 15 12 name bill anne city oslo floence Scode 12 16 Ucode AF BN 12 anne floence 21 AF bocconi 12 AF s C 1 15 bill oslo 24 s C 2 BN ucla s C 3 16 BN Examle of souce database and coesonding etieved global database Mauizio Lenzeini 26
INT[noconst, GAV/exact] Model of I Global schema = Retieved GDB Maing Souces Souce model Mauizio Lenzeini 27
INT[noconst, GAV/exact]: quey answeing Use M fo comuting fom C the etieved global database, whee each element g of G satisfies exactly the tules of C satisfying the φ S that M associates to g Since G does not have constaints, the etieved global database is legal wt G Actually, it is the only database that is legal wt G, and that satisfies M wt C Thus, we can simly evaluate the quey q ove the etieved global database, which is equivalent to unfolding the quey accoding to M, in ode to obtain a quey on A S to be evaluated ove C Answeing queies to I means answeing queies to a single database. Mauizio Lenzeini 28
INT[noconst, GAV/exact]: examle of quey answeing Maing M: student(x, Y, Z) { (X, Y, Z) s 1 (X, Y, Z, W ) } univesity(x, Y ) { (X, Y ) s 2 (X, Y ) } enolled(x, W ) { (X, W ) s 3 (X, W ) } s C 1 12 anne floence 21 15 bill oslo 24 s C 2 AF BN bocconi ucla s C 3 12 AF 16 BN Quey: { (X) student(x, Y, Z), enolled(x, W ) } Unfolding wt M: { (X) s 1 (X, Y, Z, V ), s 3 (X, W ) } etieves the answe {12} fom C. A simle unfolding stategy is sufficient in this context. Mauizio Lenzeini 29
Incomleteness and inconsistency Constaints Tye of Incomle- Inconsiin G maing teness stency no GAV/exact no no no GAV/sound yes/no no no LAV/sound yes no no LAV/exact yes yes yes GAV/exact no yes yes GAV/sound yes yes yes LAV/sound yes yes yes LAV/exact yes yes Mauizio Lenzeini 30
INT[noconst, GAV/sound]: examle Univesity Student Enolled code AF UR BN name bocconi unioma ucla code 15 12 name bill anne city oslo floence Scode 12 16 Ucode AF BN s C 1 12 anne floence 21 15 bill oslo 24 s C 2 AF BN bocconi ucla s C 3 12 AF 16 BN Examle of souce database and coesonding etieved global database Mauizio Lenzeini 31
INT[noconst, GAV/sound] The GAV maing assetions have the logical fom: x φ s ( x) g( x) The intesection of all etieved global databases (which can be comuted by letting each element g of G satisfy exactly the tules of C satisfiying the φ S that M associates to g) still satisfies M wt C, and theefoe, is the only minimal model of I. Incomleteness is of secial fom. Fo queies without negation, unfolding is sufficient. Mauizio Lenzeini 32
INT[noconst, GAV/sound] Global schema Maing = Minimal Model of I Intesection of etieved GDBs Souces Souce model Mauizio Lenzeini 33
Incomleteness and inconsistency Constaints Tye of Incomle- Inconsiin G maing teness stency no GAV/exact no no no GAV/sound yes/no no no LAV/sound yes no no LAV/exact yes yes yes GAV/exact no yes yes GAV/sound yes yes yes LAV/sound yes yes yes LAV/exact yes yes Mauizio Lenzeini 34
INT[noconst, LAV/sound]: incomleteness The LAV maing assetions have the logical fom: x s( x) φ G ( x) In geneal, given a souce database C thee ae seveal solutions of the above assetions (i.e., diffeent databases that ae legal wt G that satisfies M wt C). Incomleteness comes fom the maing. This holds even fo the case of simle queies φ G : s 1 (x) { (x) y g(x, y) } s 2 (x) { (x) g 1 (x) g 2 (x) } Mauizio Lenzeini 35
INT[noconst, LAV/sound] Global schema = = Models of I Maing Retieved GDBs Souces Souce model Mauizio Lenzeini 36
INT[noconst, LAV/sound]: dealing with incomleteness View-based quey ocessing: Answe a quey based on a set of mateialized views, athe than on the aw data in the database. Relevant oblem in Data waehousing Quey otimization Poviding hysical indeendence Mauizio Lenzeini 37
INT[noconst, LAV/sound]: dealing with incomleteness In LAV/sound data integation, the views ae the souces. Two aoaches to view-based quey ocessing: View-based quey ewiting: quey ocessing is divided in two stes 1. e-exess the quey in tems of a given quey language ove the alhabet of A S 2. evaluate the ewiting ove the souce database C View-based quey answeing: no limitation is osed on how queies ae ocessed, and the only goal is to exloit all ossible infomation, in aticula the souce database, to comute the cetain answes to the quey Mauizio Lenzeini 38
INT[noconst, LAV/sound]: connection to quey containment If queies in M ae conjunctive queies, then we can substitute the quey that M associates to s fo evey s-liteal in q C, and theefoe, checking cetain answes can be educed to checking ue containment (without constaints) of two queies in the alhabet A G The data comlexity is at most the comlexity of quey containment Mauizio Lenzeini 39
INT[noconst, LAV/sound]: some esults fo quey answeing Conjunctive queies using conjunctive views [Levy&al. PODS 95] Recusive queies (datalog ogams) using conjunctive views [Duschka&Geneseeth PODS 97], [Afati&al. ICDT 99] Comlexity analysis [Abiteboul&Duschka PODS 98] [Gahne&Mendelzon ICDT 99] Vaiants of Regula Path Queies [Calvanese&al. ICDE 00, PODS 00] [Deutsch&Tannen DBPL 01], [Calvanese&al. DBPL 01] Mauizio Lenzeini 40
INT[noconst, LAV/sound]: data comlexity Fom [Abiteboul&Duschka PODS 98]: Sound souces CQ CQ PQ datalog FOL CQ PTIME conp PTIME PTIME undec. CQ PTIME conp PTIME PTIME undec. PQ conp conp conp conp undec. datalog conp undec. conp undec. undec. FOL undec. undec. undec. undec. undec. Mauizio Lenzeini 41
INT[noconst, LAV/sound]: basic technique Conside conjunctive queies and conjunctive views. 1 (T ) { (T ) movie(t, Y, D) euoean(d) } 2 (T, V ) { (T, V ) movie(t, Y, D) eview(t, V ) } T 1 (T ) Y D movie(t, Y, D) euoean(d) T V 2 (T, V ) Y D movie(t, Y, D) eview(t, V ) movie(t, f 1 (T ), f 2 (T )) 1 (T ) euoean(f 2 (T )) 1 (T ) movie(t, f 4 (T, V ), f 5 (T, V )) 2 (T, V ) eview(t, V )) 2 (T, V ) Answeing a quey means evaluating a goal wt to this nonecusive logic ogam (PTIME data comlexity). Mauizio Lenzeini 42
INT[noconst, LAV/sound]: olynomial intactability Given a gah G = (N, E), we define I = G, S, M, and souce database C: V b R b V f R f V t R g R g R b R b R gb R bg V b C = {(c, a) a N, c N} V f C = {(a, d) a N, d N} V t C = {(a, b), (b, a) (a, b) E} Q R b M R f whee M descibes all mismatched edge ais (e.g., R g R b ). If G is 3-coloable, then B whee M (and Q) is emty, i.e. (c, d) Q I,C. If G is not 3-coloable, then M is nonemty B, i.e. (c, d) Q I,C. = conp-had data comlexity fo ositive queies and ositive views. Mauizio Lenzeini 43
INT[noconst, LAV/sound]: in conp Conside the case of Datalog queies and ositive views. t is not a cetain answe to Q wt I and C, if and only if thee is a database B fo I such that t Q B, and B satisfies M wt C Because of the fom of M x (s( x) y 1 α 1 ( x, y 1 )... y h α h ( x, y h )) each tule in C foces the existence of k tules in any database that satisfies M wt C, whee k is the maximal length of conjuncts in M If C has n tules, then thee is a database B B fo I that satisfies M wt C with at most n k tules. Since Q is monotone, t Q B. Checking whethe B satisfies M wt C can be done in PTIME wt the size of B. = conp data comlexity fo Datalog queies and ositive views. Mauizio Lenzeini 44
INT[noconst, LAV/sound]: the case of RPQ We deal with the oblem of answeing queies to data integation systems of the fom G, S, M, whee G simly fixes the labels (alhabet Σ) of a semi-stuctued database the souces in S ae elational the maing M is of tye LAV queies ae tyical of semi-stuctued data (vaiants of egula ath queies) Mauizio Lenzeini 45
Global semi-stuctued database sub sub calls sub va sub sub sub calls sub va calls sub va sub va va Mauizio Lenzeini 46
Global semi-stuctued databases and queies sub sub a calls sub va sub sub sub calls b sub va calls sub va sub va va Regula Path Quey (RPQ): (sub) (sub (calls sub)) va Mauizio Lenzeini 47
Global semi-stuctued databases and queies sub sub calls sub va sub sub sub calls b sub va calls sub va sub va va a 2RPQ: (sub ) (va sub) Mauizio Lenzeini 48
INT[noconst, LAV/sound]: the case of RPQ Given I = G, S, M, whee G simly fixes the labels (alhabet Σ) of a semi-stuctued database the souces in S ae binay elations the maing M is of tye LAV, and associates to each souce s a 2RPQ w ove Σ x, y s(x, y) x w y a souce database C a 2RPQ Q ove Σ a ai of objects t we want to detemine whethe t Q I,C. Mauizio Lenzeini 49
Quey answeing: Technique We seach fo a counteexamle to t Q I,C, i.e., a database B legal fo I wt C such that t Q B Cucial oint: it is sufficient to estict ou attention to canonical databases, i.e., databases B that can be eesented by a wod w B $ d 1 w 1 d 2 $ d 3 w 2 d 4 $ $ d 2m 1 w m d 2m $ whee d 1,..., d 2m ae constants in C, w i Σ +, and $ acts as a seaato Use wod-automata theoetic techniques! Mauizio Lenzeini 50
We need techniques fo... checking whethe a ai of objects satisfies a 2RPQ quey in the case of a wod eesenting a ath a wod eesenting semiath a wod eesenting a canonical database Mauizio Lenzeini 51
Finite-state automata and RPQs. a b c q d. Q = ( q) q q Automaton fo Q s 1 δ(s 0, ), s 2 δ(s 1, ), s 2 δ(s 1, q), s 3 δ(s 2, q), s 3 δ(s 3, q) The comutation fo RPQs is catued by finite-state automata. Mauizio Lenzeini 52
2way Regula Path Queies 2way Regula Path Queies (2RPQ) ae exessed by means of finite-state automata ove Σ { Σ }. ( q) ( ) q q q _ q q Mauizio Lenzeini 53
Finite-state automata and 2RPQs. a b c q d. Wod: Quey: q Q = ( q) q q Automaton fo Q s 1 δ(s 0, ), s 2 δ(s 1, ), s 2 δ(s 1, q), s 3 δ(s 2, ), s 4 δ(s 3, ), s 5 δ(s 4, q), s 5 δ(s 5, q) State: s 0 Tansition: s 1 δ(s 0, ) Mauizio Lenzeini 54
Finite-state automata and 2RPQs. a b c q d. Wod: Quey: q Q = ( q) q q Automaton fo Q s 1 δ(s 0, ), s 2 δ(s 1, ), s 2 δ(s 1, q), s 3 δ(s 2, ), s 4 δ(s 3, ), s 5 δ(s 4, q), s 5 δ(s 5, q) State: s 1 Tansition: s 2 δ(s 1, ) Mauizio Lenzeini 55
Finite-state automata and 2RPQs. a b c q d. Wod: Quey: q Q = ( q) q q Automaton fo Q s 1 δ(s 0, ), s 2 δ(s 1, ), s 2 δ(s 1, q), s 3 δ(s 2, ), s 4 δ(s 3, ), s 5 δ(s 4, q), s 5 δ(s 5, q) State: s 2 Tansition: none Mauizio Lenzeini 56
Finite-state automata and 2RPQs. a b c q d. Wod: Quey: q Q = ( q) q q State: s 2 Tansition: none (a, d) satisfies quey Q, but the ath fom a to d is not acceted by the 1NFA coesonding to Q: the comutation fo 2RPQs is not catued by finite-state automata. Mauizio Lenzeini 57
2way automata (2NFA) A 2way automaton A = (Γ, S, S 0, ρ, F ) consists of an alhabet Γ, a finite set of states S, a set of initial states S 0 S, a tansition function ρ : S Σ 2 S { 1,0,1} and a set of acceting states F S. Given a 2way automaton A with n states, one can constuct a one-way automaton B 1 with O(2 n log n ) states such that L(B 1 ) = L(A), and a one-way automaton B 2 with O(2 n ) states such that L(B 2 ) = Γ L(A). Mauizio Lenzeini 58
2way automata and 2RPQs Given a 2RPQ E = (Σ, S, I, δ, F ) ove the alhabet Σ, the coesonding 2way automaton A E is: (Σ A = Σ {$}, S A = S {s f } {s s S}, I, δ A, {s f }) whee δ A is defined as follows: (s 2, 1) δ A (s 1, ), fo each tansition s 2 δ(s 1, ) of E ente backwad mode: (s, 1) δ A (s, l), fo each s S and l Σ A exit backwad mode: (s 2, 0) δ A (s1, ), fo each s 2 δ(s 1, ) of E (s f, 1) δ A (s, $), fo each s F. = w satisfies E iff w$ L(A E ). Mauizio Lenzeini 59
2way automata and 2RPQs. a b c q d. Automaton fo Q Q = ( q) q q s 1 δ(s 0, ), s 2 δ(s 1, ), s 2 δ(s 1, q), s 3 δ(s 2, ), s 4 δ(s 3, ), s 5 δ(s 4, q), s 5 δ(s 5, q) 2way automaton (s 1, 1) δ A (s 0, ), (s 2, 1) δ A (s 1, ), (s 2, 1) δ A (s 1, q), (s 2, 1) δ A (s 2, q), (s 3, 0) δ A (s 2, ), (s 4, 1) δ A (s 3, ), (s 5, 1) δ A (s 4, q), (s f, 1) δ A (s 5, $) Mauizio Lenzeini 60
2NFA and 2RPQs. a b c q d. Wod: q $ Quey: Q = ( q) q q Automaton fo Q (s 1, 1) δ A (s 0, ), (s 2, 1) δ A (s 1, ), (s 2, 1) δ A (s 2, q), (s 3, 0) δ A (s 2, ), (s 4, 1) δ A (s 3, ), (s 5, 1) δ A (s 4, q), (s f, 1) δ A (s 5, $) State: s 0 Tansition: (s 1, 1) δ A (s 0, ) Mauizio Lenzeini 61
2NFA and 2RPQs. a b c q d. Wod: q $ Quey: Q = ( q) q q Automaton fo Q (s 1, 1) δ A (s 0, ), (s 2, 1) δ A (s 1, ), (s 2, 1) δ A (s 2, q), (s 3, 0) δ A (s 2, ), (s 4, 1) δ A (s 3, ), (s 5, 1) δ A (s 4, q), (s f, 1) δ A (s 5, $) State: s 1 Tansition: (s 2, 1) δ A (s 1, ) Mauizio Lenzeini 62
2NFA and 2RPQs. a b c q d. Wod: q $ Quey: Q = ( q) q q Automaton fo Q (s 1, 1) δ A (s 0, ), (s 2, 1) δ A (s 1, ), (s 2, 1) δ A (s 2, q), (s 3, 0) δ A (s 2, ), (s 4, 1) δ A (s 3, ), (s 5, 1) δ A (s 4, q), (s f, 1) δ A (s 5, $) State: s 2 Tansition: (s 2, 1) δ A (s 2, q) Mauizio Lenzeini 63
2NFA and 2RPQs. a b c q d. Wod: q $ Quey: Q = ( q) q q Automaton fo Q (s 1, 1) δ A (s 0, ), (s 2, 1) δ A (s 1, ), (s 2, 1) δ A (s 2, q), (s 3, 0) δ A (s 2, ), (s 4, 1) δ A (s 3, ), (s 5, 1) δ A (s 4, q), (s f, 1) δ A (s 5, $) State: s 2 Tansition: (s 3, 0) δ A (s 2, ) Mauizio Lenzeini 64
2NFA and 2RPQs. a b c q d. Wod: q $ Quey: Q = ( q) q q Automaton fo Q (s 1, 1) δ A (s 0, ), (s 2, 1) δ A (s 1, ), (s 2, 1) δ A (s 2, q), (s 3, 0) δ A (s 2, ), (s 4, 1) δ A (s 3, ), (s 5, 1) δ A (s 4, q), (s f, 1) δ A (s 5, $) State: s 3 Tansition: (s 4, 1) δ A (s 3, ) Mauizio Lenzeini 65
2NFA and 2RPQs. a b c q d. Wod: q $ Quey: Q = ( q) q q Automaton fo Q (s 1, 1) δ A (s 0, ), (s 2, 1) δ A (s 1, ), (s 2, 1) δ A (s 2, q), (s 3, 0) δ A (s 2, ), (s 4, 1) δ A (s 3, ), (s 5, 1) δ A (s 4, q), (s f, 1) δ A (s 5, $) State: s 4 Tansition: (s 5, 1) δ A (s 4, q) Mauizio Lenzeini 66
2NFA and 2RPQs. a b c q d. Wod: q $ Quey: Q = ( q) q q Automaton fo Q (s 1, 1) δ A (s 0, ), (s 2, 1) δ A (s 1, ), (s 2, 1) δ A (s 2, q), (s 3, 0) δ A (s 2, ), (s 4, 1) δ A (s 3, ), (s 5, 1) δ A (s 4, q), (s f, 1) δ A (s 5, $) State: s 5 Tansition: (s f, 1) δ A (s 5, $) Mauizio Lenzeini 67
2NFA and 2RPQs. a b c q d. Wod: q $ Quey: Q = ( q) q q State: s f (a, d) satisfies quey Q, and the ath fom a to d is acceted by the 2NFA coesonding to Q: the comutation fo 2RPQs is catued by 2way automata. Mauizio Lenzeini 68
2NFA and view extensions Global schema G: ( q q ) Souces: q ( ) ( q) (q ) (d 1,d 2 ) (d 4,d 5 ) (d 4,d 2 ) (d 3,d 3 ) (d 2,d 3 ) Database fo G: q d 1 d 2 d 3 d 4 d 5 Mauizio Lenzeini 69
2NFA and view extensions q d 1 d 2 d 3 d 4 d 5 Database B as a wod: $d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q To veify that (d 1, d 3 ) satisfies Q in the above database B, we build A (Q,d1,d 3 ), by exloiting not only the ability of 2way automata to move on the wod both fowad and backwad, but also the ability to jum fom one osition in the wod eesenting a node to any othe osition (eithe eceding o succeeding) eesenting the same node. Mauizio Lenzeini 70
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 0 Tansition: (s 0, 1) δ A (s 0, l), fo each l Σ A Mauizio Lenzeini 71
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 0 Tansition: (s 0, 1) δ A (s 0, l), fo each l Σ A Mauizio Lenzeini 72
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 0 Tansition: (s 0, 1) δ A (s 0, l), fo each l Σ A Mauizio Lenzeini 73
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 0 Tansition: (s 0, 1) δ A (s 0, l), fo each l Σ A Mauizio Lenzeini 74
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 0 Tansition: (s 0, 1) δ A (s 0, l), fo each l Σ A Mauizio Lenzeini 75
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 0 Tansition: (s 0, 1) δ A (s 0, l), fo each l Σ A Mauizio Lenzeini 76
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 0 Tansition: (s 1, 0) δ A (s 0, d 1 ), s 1 initial state fo Q Mauizio Lenzeini 77
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 1 Tansition: (s 1, 1) δ A (s 1, d 1 ) Mauizio Lenzeini 78
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 1 Tansition: (s 2, 1) δ A (s 1, ), tansition coming fom Q Mauizio Lenzeini 79
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 2 Tansition: ((s 2, d 2 ), 1) δ A (s 2, d 2 ), seach fo d 2 Mauizio Lenzeini 80
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 2, d 2 ) Tansition: ((s 2, d 2 ), 1) δ A ((s 2, d 2 ), $), seach fo d 2 Mauizio Lenzeini 81
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 2, d 2 ) Tansition: ((s 2, d 2 ), 1) δ A ((s 2, d 2 ), d 4 ), seach fo d 2 Mauizio Lenzeini 82
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 2, d 2 ) Tansition: ((s 2, d 2 ), 1) δ A ((s 2, d 2 ), ), seach fo d 2 Mauizio Lenzeini 83
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 2, d 2 ) Tansition: (s 2, 0) δ A ((s 2, d 2 ), d 2 ), exit seach mode Mauizio Lenzeini 84
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 2 Tansition: (s 2, 1) δ A (s 2, d 2 ), backwad mode Mauizio Lenzeini 85
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 2 Tansition: (s 3, 0) δ A (s 2, ), tansition coming fom Q Mauizio Lenzeini 86
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 3 Tansition: (s 4, 1) δ A (s 3, ), tansition coming fom Q Mauizio Lenzeini 87
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 4 Tansition: ((s 4, d 2 ), 1) δ A (s 4, d 2 ), seach fo d 2 Mauizio Lenzeini 88
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 4, d 2 ) Tansition: ((s 4, d 2 ), 1) δ A ((s 4, d 2 ), $), seach fo d 2 Mauizio Lenzeini 89
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 4, d 2 ) Tansition: ((s 4, d 2 ), 1) δ A ((s 4, d 2 ), d 3 ), seach fo d 2 Mauizio Lenzeini 90
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 4, d 2 ) Tansition: ((s 4, d 2 ), 1) δ A ((s 4, d 2 ), ), seach fo d 2 Mauizio Lenzeini 91
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 4, d 2 ) Tansition: ((s 4, d 2 ), 1) δ A ((s 4, d 2 ), ), seach fo d 2 Mauizio Lenzeini 92
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 4, d 2 ) Tansition: ((s 4, d 2 ), 1) δ A ((s 4, d 2 ), d 3 ), seach fo d 2 Mauizio Lenzeini 93
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 4, d 2 ) Tansition: ((s 4, d 2 ), 1) δ A ((s 4, d 2 ), $), seach fo d 2 Mauizio Lenzeini 94
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 4, d 2 ) Tansition: (s 4, 0) δ A ((s 4, d 2 ), d 2 ), exit seach mode Mauizio Lenzeini 95
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 4 Tansition: (s 4, 1) δ A (s 4, d 2 ) Mauizio Lenzeini 96
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 4 Tansition: (s 5, 1) δ A (s 4, ), tansition coming fom Q Mauizio Lenzeini 97
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 5 Tansition: (s 6, 1) δ A (s 5, q), tansition coming fom Q Mauizio Lenzeini 98
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 6 Tansition: (s 7, 0) δ A (s 6, d 3 ), s 7 final state Mauizio Lenzeini 99
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 7 Tansition: (s 7, 1) δ A (s 7, d 3 ), s 7 final state Mauizio Lenzeini 100
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 7 Tansition: (s 7, 1) δ A (s 7, $), s 7 final state Mauizio Lenzeini 101
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 7 final state Wod acceted by A (Q,d1,d 3 )! Mauizio Lenzeini 102
Quey answeing: Technique To check whethe (c, d) Q I,C, we check fo nonemtiness of A, that is the intesection of the one-way automaton A 0 that accets wods that eesent databases, i.e., wods of the fom ($ C Σ + C) $ the one-way automata coesonding to the vaious A (Si,a,b) (fo each souce S i and fo each ai (a, b) S C i ) the one-way automaton coesonding to the comlement of A (Q,c,d) Indeed, any wod acceted by such intesection automaton eesents a counteexamle to (c, d) Q I,C. Mauizio Lenzeini 103
Quey answeing: Comlexity All two-way automata constucted above ae of linea size in the size of Q, the queies associated to S 1,..., S k, and S C 1,..., S C k one-way automata would be exonential.. Hence, the coesonding Howeve, we do not need to constuct A exlicitly. Instead, we can constuct it on the fly while checking fo nonemtiness. Quey answeing fo 2RPQs is PSPACE-comlete in combined comlexity (as fo RPQs). Mauizio Lenzeini 104
Comlexity of quey answeing fo 2RPQs: the comlete ictue Fom [Calvanese&al. PODS 00]: Assumtion on Assumtion on Comlexity domain views data exession combined all sound conp conp conp closed all exact conp conp conp abitay conp conp conp all sound conp PSPACE PSPACE oen all exact conp PSPACE PSPACE abitay conp PSPACE PSPACE Mauizio Lenzeini 105
INT[noconst, LAV/sound]: Connection to ewiting Quey answeing by ewiting: Given I = G, S, M, and given a quey Q ove G, ewite Q into a quey, called ew(q, I), in the alhabet A S of the souces Evaluate the ewiting ew(q, I) ove the souce database We ae inteested in sound ewitings (comuting only cetain answes, fo evey souce database C) that ae exessed in a given quey language, and that ae maximal fo the class of queies exessible in such language. Sometimes, we ae inteested in exact ewitings, i.e., ewitings that ae logically equivalent to the quey, modulo M. But: When does the ewiting comute all cetain answes? What do we gain o lose by focusing on a given class of queies? Mauizio Lenzeini 106
Pefect ewiting Let cet(q, I, C) be the function that, given quey Q, data integation system I, and souce database C, comutes the cetain answes Q I,C to Q wt I and C. Define cet [Q,I] ( ) to be the function that, with Q and I fixed, given souce database C, comutes the cetain answes Q I,C. cet [Q,I] can be seen as a quey on the alhabet A S that, given C, etuns Q I,C cet [Q,I] is a (sound) ewiting of Q wt I No sound ewiting exists that is bette than cet [Q,I] cet [Q,I] is called the efect ewiting of Q wt I Mauizio Lenzeini 107
Poeties of the efect ewiting Can we exess the efect ewiting in a cetain quey language? How does a maximal ewiting fo a given class of queies comae with the efect ewiting? Fom a semantical oint of view Fom a comutational oint of view Which is the comutational comlexity of (finding, evaluating) the efect ewiting? Mauizio Lenzeini 108
The case of conjunctive queies Let I = G, S, M be a LAV/sound data integation system, let Q and the queies in M be CQs, and let Q be the union of all maximal ewitings of Q fo the class of CQs. Then ([Levy&al. PODS 95], [Duschka&al. 97], [Abiteboul&al. PODS 98]) Q is the maximal ewiting fo the class of unions of conjunctive queies (UCQs) Q is the efect ewiting of Q wt I Q is a PTIME quey Q is an exact ewiting (equivalent to Q fo each database B of I), if an exact ewiting exists Does this ideal situation cay on to cases whee Q and M allow fo union? Mauizio Lenzeini 109
Unions of ath queies (UPQs) Vey simle quey language (called UPQ) defined as follows: Q P Q 1 Q 2 P R P 1 P 2 R denotes a binay database elation, P denotes a ath quey, which is a chaining of database elations, and Q denotes a union of ath queies. UPQs ae a simle fom of Unions of conjunctive queies Regula ath queies Mauizio Lenzeini 110
View-based quey ocessing fo UPQs View-based quey answeing fo UPQs is conp-comlete in data comlexity [Calvanese&al. ICDE 00]. In othe wods, cet(q, I, C), with Q and I fixed, is a conp-comlete function. The efect ewiting cet [Q,I] is a conp-comlete quey. Fo quey languages that include UPQs the efect ewiting is conp-had we do not have the ideal situation we had fo conjunctive queies. Poblem: Isolate those UPQs Q and I fo which the efect ewiting cet [Q,I] is a PTIME function (assuming P NP) [Calvanese&al. LICS 00]. Mauizio Lenzeini 111
Incomleteness and inconsistency Constaints Tye of Incomle- Inconsiin G maing teness stency no GAV/exact no no no GAV/sound yes/no no no LAV/sound yes no no LAV/exact yes yes yes GAV/exact no yes yes GAV/sound yes yes yes LAV/sound yes yes yes LAV/exact yes yes Mauizio Lenzeini 112
INT[noconst, LAV/exact]: inconsistency The LAV maing assetions have the logical fom: x s( x) φ G ( x) In geneal, given a souce database C, thee may be no solution of the above assetions (i.e., no database that is legal wt G and that satisfies M wt C). Examle: s 1 (x) { (x) g(x) } s 2 (x) { (x) g(x) } with s C 1 = {1}, and s C 2 = {2}. Mauizio Lenzeini 113
INT[noconst, LAV/exact] Global schema = = Models of I Global schema Maing Retieved GDBs Maing Souces Souce model Souces Incomleteness Inconsistency Mauizio Lenzeini 114
INT[noconst, LAV/exact]: some esults fo quey answeing Comlexity analysis (sound, comlete, exact) [Abiteboul&Duschka PODS 98] [Gahne&Mendelzon ICDT 99] Vaiants of Regula Path Queies [Calvanese&al. ICDE 00, PODS 00] Mauizio Lenzeini 115
INT[noconst, LAV/exact]: data comlexity Fom [Abiteboul&Duschka PODS 98]: Sound souces CQ CQ PQ datalog FOL CQ PTIME conp PTIME PTIME undec. CQ PTIME conp PTIME PTIME undec. PQ conp conp conp conp undec. datalog conp undec. conp undec. undec. FOL undec. undec. undec. undec. undec. Exact souces CQ CQ PQ datalog FOL CQ conp conp conp conp undec. CQ conp conp conp conp undec. PQ conp conp conp conp undec. datalog undec. undec. undec. undec. undec. FOL undec. undec. undec. undec. undec. Mauizio Lenzeini 116
INT[noconst, LAV/exact]: olynomial intactability Given a gah G = (N, E), we define I = G, S, M, and souce database C: V 1 { (X) colo(x, Y ) } V 2 { (Y ) colo(x, Y ) } V 3 { (X, Y ) edge(x, Y ) } C V 1 = N C V 2 = { ed, geen, blue } C V 3 = E Q { () edge(x, Y ) colo(x, Z) colo(y, Z) } Q I,C is tue if and only if G is not 3-coloable. = conp-had data comlexity fo conjunctive queies and views. Mauizio Lenzeini 117
Incomleteness and inconsistency Constaints Tye of Incomle- Inconsiin G maing teness stency no GAV/exact no no no GAV/sound yes/no no no LAV/sound yes no no LAV/exact yes yes yes GAV/exact no yes yes GAV/sound yes yes yes LAV/sound yes yes yes LAV/exact yes yes Mauizio Lenzeini 118
INT[const, GAV/exact]: inconsistency Given one souce database C, thee is only one database fo G that satisfies the maing wt C. If this is not legal wt G, then the system is inconsistent (I has no model), othewise, the case is simila to INT[noconst, GAV/exact]. Univesity Student Enolled code AF BN name bocconi ucla code 15 15 name bill anne city oslo floence Scode 12 16 Ucode AF BN s C 1 15 anne floence 21 15 bill oslo 24 s C 2 AF BN bocconi ucla s C 3 12 AF 16 BN Mauizio Lenzeini 119
INT[const, GAV/exact] Models of I Global schema = Global schema Maing Retieved GDB Maing Souces Souce model Souces Inconsistency Mauizio Lenzeini 120
Incomleteness and inconsistency Constaints Tye of Incomle- Inconsiin G maing teness stency no GAV/exact no no no GAV/sound yes/no no no LAV/sound yes no no LAV/exact yes yes yes GAV/exact no yes yes GAV/sound yes yes yes LAV/sound yes yes yes LAV/exact yes yes Mauizio Lenzeini 121
INT[const, GAV/sound]: incomleteness Let us conside a system with a global schema with constaints, and with a GAV maing M with sound souces, whose assetions have the fom g φ S with the meaning x (φ S (x) g(x)) Since G does have constaints, we cannot simly limit ou attention to one database of the integation system (as we did fo INT[noconst, GAV/exact] and INT[noconst, GAV/sound]). Mauizio Lenzeini 122
INT[const, GAV/sound] Global schema = = Models of I Retieved GDBs Global schema Maing Maing Souces Souce model Souces Incomleteness Inconsistency Mauizio Lenzeini 123
INT[const, GAV/sound]: examle Global schema G: student(scode, Sname, Scity), univesity(ucode, Uname), enolled(scode, Ucode), key{scode} key{ucode} key{scode, Ucode} enolled[scode] student[scode] enolled[ucode] univesity[ucode] Souces S: database elations s 1, s 2, s 3 Maing M: student { (X, Y, Z) s 1 (X, Y, Z, W ) } univesity { (X, Y ) s 2 (X, Y ) } enolled { (X, W ) s 3 (X, W ) } Mauizio Lenzeini 124
Constaints in GAV/sound: examle Univesity Student Enolled code AF BN name bocconi ucla code 15 12 name bill anne city oslo floence Scode 12 16 Ucode AF BN 16?? 16 16 s C 1 12 anne floence 21 15 bill oslo 24 s C 2 AF BN bocconi ucla s C 3 12 AF 16 BN Examle of souce database and coesonding etieved global database Mauizio Lenzeini 125
Constaints in GAV/sound: examle Souce database C: s C 1 12 anne floence 21 15 bill oslo 24 s C 2 AF BN bocconi ucla s C 3 12 AF 16 BN s C 3(16, BN) imlies enolled B (16, BN), fo all B sem C (I). Due to the integity constaints in the global schema, 16 is the code of some student in all B sem C (I). Since C says nothing about the name and the city of the student with code 16, we must accet as legal fo I wt C all vitual global databases that diffe in such attibutes. Mauizio Lenzeini 126
INT[const, GAV/sound]: unfolding is not sufficient Maing M: student { (X, Y, Z) s 1 (X, Y, Z, W ) } univesity { (X, Y ) s 2 (X, Y ) } enolled { (X, W ) s 3 (X, W ) } s C 1 12 anne floence 21 15 bill oslo 24 s C 2 AF BN bocconi ucla s C 3 12 AF 16 BN Quey: { (X) student(x, Y, Z), enolled(x, W ) } Unfolding wt M: { (X) s 1 (X, Y, Z, V ), s 3 (X, W ) } etieves only the answe {12} fom C, although {12, 16} is the coect answe. The simle unfolding stategy is not sufficient in ou context. Mauizio Lenzeini 127
INT[const, GAV/sound]: secial case We assume that only key and foeign key constaints ae in G, and M does not violate any key constaint of G (see late), and we associate to G a logic ogam P G, as follows. Fo each g in G we have a ule in P G of the fom: g (X 1,..., X n ) g(x 1,..., X n ) Fo each foeign key constaint g 1 [A] g 2 [B] in G whee A and B ae sets of attibutes, we have a ule in P G of the fom (the f i s ae fesh Skolem functions): g 2(X 1,..., X h, f 1 (X 1,..., X h ),..., f n h (X 1,..., X h )) g 1(X 1,..., X h,..., X m ) Mauizio Lenzeini 128
INT[const, GAV/sound]: secial case Techniques fo ocessing a conjunctive quey q osed to I = G, S, M : We constuct P G fom G We atially evaluate P G wt q, and we obtain anothe quey ex G (q), called the exansion of q wt the constaints of G We unfold ex G (q) wt M, and obtain a quey unf M (ex G (q)) ove the souces We evaluate unf M (ex G (q)) ove the souce database C ex G (q) can be of exonential size wt G, but the whole ocess has olynomial time comlexity wt the size of C. Mauizio Lenzeini 129
INT[const, GAV/sound]: examle Suose we have I = G, S, M, with G: eson(pcode, Age, CityOfBith) student(scode, Univesity) city(name, Majo) key(eson) = {Pcode} key(student) = {Scode} key(city) = {Name} eson[cityofbith] city[name] city[majo] eson[pcode] student[scode] eson[pcode] Mauizio Lenzeini 130
INT[const, GAV/sound]: examle The logic ogam P G is eson (X, Y, Z) eson(x, Y, Z) student (X, Y ) student(x, Y ) city (X, Y ) city(x, Y ) city (X, f 1 (X)) eson (Y, Z, X) eson (Y, f 2 (Y ), f 3 (Y )) city (X, Y ) eson (X, f 4 (X), f 5 (X)) student (X, Y ) Conside the quey witten as the ule { (X) eson(x, Y, Z) } q(x) eson (X, Y, Z) Mauizio Lenzeini 131
INT[const, GAV/sound]: examle eson (X,Y,Z) eson(x,y,z) student (X,W 1 ) city (W 2,X) student(x,w 1 ) city(w 2,X) ex G (q) is { (X) eson(x, Y, Z) student(x, W ) city(z, X) } Mauizio Lenzeini 132
Incomleteness and inconsistency Constaints Tye of Incomle- Inconsiin G maing teness stency no GAV/exact no no no GAV/sound yes/no no no LAV/sound yes no no LAV/exact yes yes yes GAV/exact no yes yes GAV/sound yes yes yes LAV/sound yes yes yes LAV/exact yes yes Mauizio Lenzeini 133
INT[const, LAV/sound] Global schema = = Models of I Retieved GDBs Global schema Maing Maing Souces Souce model Souces Incomleteness Inconsistency Mauizio Lenzeini 134
INT[const, LAV/sound] With functional deendencies [Duschka 97] With full deendencies [Duschka 97] With inclusion deendencies [Gyz 97] With Descition Logics integity constaints [Calvanese&al. AAAI 00] Mauizio Lenzeini 135
Incomleteness and inconsistency Constaints Tye of Incomle- Inconsiin G maing teness stency no GAV/exact no no no GAV/sound yes/no no no LAV/sound yes no no LAV/exact yes yes yes GAV/exact no yes yes GAV/sound yes yes yes LAV/sound yes yes yes LAV/exact yes yes Mauizio Lenzeini 136
INT[const, LAV/exact] Global schema Models of I Global schema Global schema Maing Retieved GDBs Maing Maing Souces Souce model Souces Souces Incomleteness Inconsistency Inconsistency Mauizio Lenzeini 137
INT[const, LAV/exact] With Descition Logics integity constaints [Calvanese&al. AAAI 00] Lagely unexloed oblem Mauizio Lenzeini 138
Outline Fomal famewok fo data integation Aoaches to data integation Quey answeing in diffeent aoaches Dealing with inconsistency Reasoning on queies in data integation Conclusions Mauizio Lenzeini 139
INT[const, GAV/sound]: Dealing with inconsistency When fo data integation system I = G, S, M and souce database C, we have sem C (I) =, the fist-ode setting descibed above is not adequate. [Subahmanian ACM-TODS 94] [Gant&al. IEEE-TKDE 95] [Dung CooIS 96] [Lin&al. JICIS 98] [Yan&al. CooIS 99] [Aenas&al. PODS 99] [Geco&al. LPAR 00] many aoaches to KB evision and KB/DB udate Mauizio Lenzeini 140
Beyond fist-ode logic: examle key(laye) = {Pcode} key(team) = {Tcode} laye[pteam] team[tcode] team[tleade] laye[pcode]. laye { (X, Y, Z) s 1 (X, Y, Z, W ) } team { (X, Y, Z) s 2 (X, Y, Z) s 3 (X, Y, Z) } s C 1 : 9 Batistuta RM 31 10 Rivaldo BC 29 s C 2 : RM Roma 8 BC Bacelona 10 s C 3 : RM Roma 9 Mauizio Lenzeini 141
Beyond fist-ode logic: a oosal Given I = G, S, M, with a GAV/sound maing M = { 1 V 1,..., n V n }, and souce database C fo S, we would like to focus on those databases fo I that 1. satisfy G (constaints in G ae igid), and 2. aoximate as much as ossible the satisfaction of the maing M wt C (assetions in M ae soft). Mauizio Lenzeini 142
Beyond fist-ode logic: a oosal We define an odeing between the global databases fo I as follows. If B 1 and B 2 ae two databases that satisfy G, we say that B 1 is bette than B 2 wt I and C, denoted as B 1 I C B 2, if thee exists an assetion i V i in M such that - ( B 1 i - ( B 1 j V C i ) ( B 2 i V C j ) ( B 2 j V C i ), and V C j ) fo all j s V j in M with j i. Intuitively, B 1 has fewe deletions than B 2 wt the etieved global database (see [Fagin&al. PODS 83]), and since the maing is sound, this means that B 1 is close than B 2 to the etieved global database. In othe wods, B 1 aoximates the sound maing bette than B 2. Mauizio Lenzeini 143
Examle Conside I = G, S, M, with G containing elation (x, y) with key x, S containing elations s 1 (x, y) and s 2 (x, y) M = { { (x, y) s 1 (x, y) s 2 (x, y) } } and conside the souce database C = { s 1 (a, d), s 1 (b, d), s 2 (a, e) }, so that the etieved global database is { (a, d), (b, d), (a, e) } We have that { (a, d), (b, d) } I C { (a, d) }, { (a, e), (b, d) } I C { (a, e) } { (a, d), (b, d) } and { (a, e) } ae incomaable { (a, e), (b, d), (c, e) } and { (a, e), (b, d) } ae incomaable Mauizio Lenzeini 144
Beyond fist-ode logic: a oosal I C is a atial ode. A database B that satisfy G satisfies the maing M with esect to C if B is maximal wt I C, i.e., fo no othe global database B that satisfies G, we have that B I C B: sem C (I) = { B B is a database that satisfies G, and such that B such that B satisfies G and B I C B } The notion of legal database fo I with esect to C, and the notion of cetain answe emain the same, given the new definition of satisfaction of maing. Mauizio Lenzeini 145
Beyond fist-ode logic: secial case of INT[const, GAV/sound] We assume that only key and foeign key constaints ae in G. Given I = G, S, M, and souce database C, we define the DATALOG ogam P(I, C) obtained by adding to the set of facts C the following set of ules: fo each g {( x) body 1 ( x, y 1 ) body m ( x, y m )} in M, the ules: g C ( X) body 1 ( X, Y 1 )... g C ( X) body m ( X, Y m ) fo each elation g G, the ules g( X, Y) g C ( X, Y), not g( X, Y) g( X, Y) g( X, Z), Y Z in g( X, Y), X is the key of g Y Z means that thee exists i such that Y i Z i. Mauizio Lenzeini 146
Beyond fist-ode logic: a oosal The above ules foce each stable model T of P(I, C) to be such that, fo each g in G, g T is a maximal subset of the tules fom the etieved global database that ae consistent with the key constaint fo g. t q I,C unde the new semantics if and only if t q T fo each stable model T of the DATALOG ogam P(I, C) {ex G (q)} A stable model of a DATALOG ogam Π is any set σ of gound atoms that coincides with the unique minimal Heband model of the DATALOG ogam Π σ, whee Π σ is obtained fom Π by deleting evey ule that has a negative liteal B with B σ, and all negative liteals in the bodies of the emaining ules The oblem of deciding whethe t q I,C is in conp wt data comlexity conp-comlete Mauizio Lenzeini 147
Outline Fomal famewok fo data integation Aoaches to data integation Quey answeing in diffeent aoaches Dealing with inconsistency Reasoning on queies in data integation Conclusions Mauizio Lenzeini 148
Reasoning on queies and views in data integation Taditional quey containment not adequate. Global schema: movie(title, Yea, Diecto) eview(title, Citique) Maing: 1 (T, Y, D) { (T, Y, D) movie(t, Y, D) Y 1960 } 2 (T, R) { (T, R) eview(t, R) R 8 } Queies: Q 1 : { (T, R) movie(t, 1998, D) eview(t, R) } Q 2 : { (T, R) movie(t, 1998, D) eview(t, R) R 8 } Q 1 is not contained in Q 2 in the taditional sense, but is contained in Q 2 elative to I. Mauizio Lenzeini 149
Relative containment [Millstein&al. PODS 00] Given data integation system I = G, S, M, a quey Q 1 is said to be contained in quey Q 2 elative to I (witten Q 1 I Q 2 ) if, fo evey souce database C, the set of cetain answes to Q 1 wt I and C is contained in the set of cetain answes to Q 2 wt I and C, i.e., if C, cet(q 1, I, C) cet(q 2, I, C) Fo LAV/sound systems with conjunctive queies in the maing, deciding elative containment of two conjunctive queies is Π 2-comlete [Millstein&al. PODS 00]. Mauizio Lenzeini 150
Lossless views Given LAV data integation system I = G, S, M, and quey Q, I is said to be lossless wt Q if, fo evey global database B fo I and fo evey souce database C such that B is legal fo I wt C, we have that Q B = Q I,C. If I = G, S, M is lossless wt Q, then answeing Q though the souces of I (views) is the same as answeing Q by accessing the global database. Note the diffeence with checking whethe the maximally contained ewiting of Q wt to I is equivalent to Q. Mauizio Lenzeini 151
Comaing the exessive owe of sets of views A set of views V is -contained in anothe set of views W if all queies that ae answeable by V ae also answeable by W [Li&al. ICDT 01]. A quey is answeable by a set of views V if thee is an equivalent ewiting of Q using V. Given LAV data integation systems I 1 = G, S 1, M 1 and I 2 = G, S 2, M 2, I 1 is -contained in I 2 if, fo each quey Q, cet [Q,I1 ] equivalent to Q imlies cet [Q,I2 ] equivalent to Q. Mauizio Lenzeini 152
Outline Fomal famewok fo data integation Aoaches to data integation Quey answeing in diffeent aoaches Dealing with inconsistency Reasoning on queies in data integation Conclusions Mauizio Lenzeini 153
Conclusions Many oen oblems, including P2P data integation Seveal inteesting classes of integity constaints Global schema exessed in tems of semi-stuctued data (with constaints) Dealing with inconsistencies, data cleaning How to go beyond the unique domain assumtion Limitations in accessing the souces How to incooate the notion of data quality (souce eliability, accuacy, etc.) Moe on easoning on queies and views Otimization Mauizio Lenzeini 154
Acknowledgements Secial thanks to Andea Calí Diego Calvanese Giusee De Giacomo Domenico Lembo Riccado Rosati Moshe Y. Vadi Mauizio Lenzeini 155