Data integation is hade than you thought Mauizio Lenzeini Diatimento di Infomatica e Sistemistica Univesità di Roma La Saienza CooIS 2001 Setembe 5, 2001 Tento, Italy
Outline Intoduction to data integation Aoaches to modeling and queying Case study in LAV: had Case study in GAV: hade than you thought Beyond LAV and GAV: even hade Conclusions Mauizio Lenzeini Data Integation 1
Achitectue fo data integation Quey Alication Mediato Global schema Data Waehouse Wae Wae Local schema Local schema Local schema Souce Souce Souce Mauizio Lenzeini Data Integation 2
Main oblems in data integation 1. Heteogeinity of souces (intensional and extensional level) 2. Limitations in the mechanisms fo accessing the souces 3. Mateialized vs vitual integation 4. Data extaction, cleaning and econciliation 5. How to ocess udates exessed on the global schema, and udates exessed on the souces 6. The queying oblem: How to answe queies exessed on the global schema 7. The modeling oblem: How to model the global schema, the souces, and the elationshis between the two Mauizio Lenzeini Data Integation 3
The queying oblem Each quey is exessed in tems of the global schema, and the associated mediato must efomulate the quey in tems of a set of queies at the souces The cucial ste is deciding the quey lan, i.e., how to decomose the quey into a set of subqueies to the souces The comuted subqueies ae then shied to the souces, and the esults ae assembled into the final answe Mauizio Lenzeini Data Integation 4
Quality in quey answeing The data integation system should be designed in such a way that suitable quality citeia ae met. Hee, we concentate on: Soundness: the answe to queies includes nothing but the tuth Comleteness: the answe to queies includes the whole tuth We aim at the whole tuth, and nothing but the tuth. But, what the tuth is deends on the aoach adoted fo modeling. Mauizio Lenzeini Data Integation 5
The modeling oblem Global schema Maing R 1 C 1 D 1 T 1 c 1 d 1 t 1 c 2 d 2 t 2 Souce stuctue Souce stuctue Souce 1 Souce 2 Mauizio Lenzeini Data Integation 6
Outline Intoduction to data integation Aoaches to modeling and queying Case study in LAV Case study in GAV Beyond LAV and GAV Conclusions Mauizio Lenzeini Data Integation 7
The modeling oblem: fundamental questions How do we model the global schema (stuctued vs semistuctued) How do we model the souces (concetual and stuctual level) How do we model the elationshi between the global schema and the souces Ae the souces defined in tems of the global schema (this aoach is called souce-centic, o local-as-view, o LAV)? Is the global schema defined in tems of the souces (this aoach is called global-schema-centic, o global-as-view, o GAV)? A mixed aoach? Mauizio Lenzeini Data Integation 8
The modeling oblem: fomal famewok A data integation system D is a tile G, S, M, whee G is the global schema (stuctue and constaints), S is the souce schema (stuctues and constaints), and M is the maing between G and S. Semantics of D: which data satisfy G? We have to stat with a souce database C (souce data coheent with S): sem C (D) = { B B is a database that is legal fo D wt C, i.e., that satisfies both G and M wt C } A quey q to D is exessed ove G. If q has aity n, then the answe to q wt D and C is q D,C = {(c 1,..., c n ) (c 1,..., c n ) q B B sem C (D)} Mauizio Lenzeini Data Integation 9
Global-as-view vs local-as-view Examle Global schema: movie(title, Yea, Diecto) euoean(diecto) eview(title, Citique) Souce 1: 1 (Title, Yea, Diecto) since 1960, euoean diectos Souce 2: 2 (T itle, Citique) since 1990 Quey: Title and citique of movies in 1998 { (T, R) D. movie(t, 1998, D) eview(t, R) }, witten { (T, R) movie(t, 1998, D) eview(t, R) } Mauizio Lenzeini Data Integation 10
Local-as-view Global schema LAV Souce This souce contains. Mauizio Lenzeini Data Integation 11
Fomalization of LAV In LAV, the maing M is constituted by a set of assetions: s φ G one fo each souce stuctue s in S, whee φ G is a quey ove G. Given souce data C, a database B satisfies M wt C if fo each souce s S: s C φ B G The maing M does not ovide diect infomation about which data satisfies the global schema. To answe a quey q ove G, we have to infe how to use M in ode to access the souce data C. Answeing queies is an infeence ocess, which is simila to answeing queies with incomlete infomation. Mauizio Lenzeini Data Integation 12
Local-as-view Examle Global schema: movie(title, Yea, Diecto) euoean(diecto) eview(title, Citique) Local-as-view: associated to elations at the souces we have views ove the global schema 1 (T, Y, D) { (T, Y, D) movie(t, Y, D) euoean(d) Y 1960 } 2 (T, R) { (T, R) movie(t, Y, D) eview(t, R) Y 1990 } The quey { (T, R) movie(t, 1998, D) eview(t, R) } is ocessed by means of an infeence mechanism that aims at e-exessing the atoms of the global schema in tems of atoms at the souces. In this case: { (T, R) 2 (T, R) 1 (T, 1998, D) } Mauizio Lenzeini Data Integation 13
Quey ocessing in LAV Answeing queies in LAV is like solving a mistey case: Souces eesent eliable witnesses Witnesses know at of the stoy, and souce data eesent what they know We have an exlicit eesentation of what the witnesses know We have to solve the case (answeing queies) based on the infomation we ae able to gathe fom the witnesses Infeence is needed Mauizio Lenzeini Data Integation 14
Global-as-view Global schema A Global schema Souce LAV This souce contains. Souce GAV The data of A ae taken fom souce 1 and Mauizio Lenzeini Data Integation 15
Fomalization of GAV In GAV, the maing M is constituted by a set of assetions: g φ S one fo each stuctue g in G, whee φ S is a quey ove S. Given souce data C, a database B satisfies M wt C if fo each g G: φ C S g B The maing M ovides diect infomation about which data satisfies the global schema. Thus, given a quey q ove G, it seems that we can simly evaluate the quey ove these data (as if we had a single database at hand). Moe on this late... Mauizio Lenzeini Data Integation 16
Global-as-view Examle Global schema: movie(title, Yea, Diecto) euoean(diecto) eview(title, Citique) Global-as-view: associated to elations in the global schema we have views ove the souces movie(t, Y, D) { (T, Y, D) 1 (T, Y, D) } euoean(d) { (D) 1 (T, Y, D) } eview(t, R) { (T, R) 2 (T, R) } Mauizio Lenzeini Data Integation 17
Global-as-view Examle of quey ocessing The quey { (T, R) movie(t, 1998, D) eview(t, R) } is ocessed by means of unfolding, i.e., by exanding the atoms accoding to thei definitions, so as to come u with souce elations. In this case: movie(t,1998,d) eview(t,r) unfolding 1 (T,1998,D) 2 (T,R) Mauizio Lenzeini Data Integation 18
Quey ocessing in GAV We do not have any exlicit eesentation of what the witnesses know All the infomation that the witnesses can ovide have been comiled into an investigation eot (the global schema, and the maing) Solving the case (answeing queies) means basically looking at the investigation eot Mauizio Lenzeini Data Integation 19
Global-as-view and local-as-view Comaison Local-as-view: (Infomation Manifold, DWQ, Picsel) Quality deends on how well we have chaacteized the souces High modulaity and eusability (if the global schema is well designed, when a souce changes, only its definition is affected) Quey ocessing needs easoning (quey efomulation comlex) Global-as-view: (Canot, SIMS, Tsimmis,... ) Quality deends on how well we have comiled the souces into the global schema though the maing Wheneve a souce changes o a new one is added, the global schema needs to be econsideed Quey ocessing can be based on some sot of unfolding (quey efomulation looks easie) Fo moe details, see [Ullman, TCS 2000], [Halevy, SIGMOD 2000]. Mauizio Lenzeini Data Integation 20
Outline Intoduction to data integation Aoaches to modeling and queying Case study in LAV Case study in GAV Beyond LAV and GAV Conclusions Mauizio Lenzeini Data Integation 21
A case study in LAV We deal with the oblem of answeing queies to data integation systems of the fom G, S, M, whee the global schema G is semi-stuctued the souces in S ae elational the maing M is of tye LAV queies ae tyical of semi-stuctued data Mauizio Lenzeini Data Integation 22
The quey answeing oblem Given data integation system D = G, S, M, souce database C, quey q, and tule t, check whethe t q D,C (i.e., whethe t q B fo all B sem C (D)). Recent esults: Comlexity fo seveal quey and view languages [Abiteboul et al, PODS 98], [Gahne et al, ICDT 99] Schemas exessed in Descition Logics [Calvanese et al, AAAI 2000] Regula ath queies without invese [Calvanese et al, ICDE 2000] and with invese [Calvanese et al, PODS 2000] Conjunctive RPQIs [Calvanese et al, KR 2000], [Calvanese et al, LICS 2000], [Calvanese et al, DBPL 2001] Mauizio Lenzeini Data Integation 23
Global databases and queies sub sub sub sub calls sub sub calls sub va calls sub va sub va va RPQ: RPQI: (sub) (sub (calls sub)) va (sub ) (va sub) Mauizio Lenzeini Data Integation 24
Regula ath queies with invese Regula-ath queies with invese (RPQIs) ae exessed by means of finite-state automata ove Σ = Σ { Σ } ( denotes the invese of the binay elation ). ( q) ( ) q q q _ q q Mauizio Lenzeini Data Integation 25
Finite state automata and RPQIs. a b c q d. Conside the quey Automaton fo Q Q = ( q) q q s 1 δ(s 0, ), s 2 δ(s 1, ), s 2 δ(s 1, q), s 3 δ(s 2, ), s 4 δ(s 3, ), s 5 δ(s 4, q), s 5 δ(s 5, q) The comutation fo RPQIs is not comletely catued by finite state automata. Mauizio Lenzeini Data Integation 26
Two-way automata A two-way automaton A = (Γ, S, S 0, ρ, F ) consists of an alhabet Γ, a finite set of states S, a set of initial states S 0 S, a tansition function ρ : S Σ 2 S { 1,0,1} and a set of acceting states F S. Given a two-way automaton A with n states, one can constuct a one-way automaton B 1 with O(2 n log n ) states such that L(B 1 ) = L(A), and a one-way automaton B 2 with O(2 n ) states such that L(B 2 ) = Γ L(A). Mauizio Lenzeini Data Integation 27
Two-way automata and RPQIs. a b c q d. Conside the quey Automaton fo Q Q = ( q) q q s 1 δ(s 0, ), s 2 δ(s 1, ), s 2 δ(s 1, q), s 3 δ(s 2, ), s 4 δ(s 3, ), s 5 δ(s 4, q), s 5 δ(s 5, q) 2way automaton (s 1, 1) δ A (s 0, ), (s 2, 1) δ A (s 1, ), (s 2, 1) δ A (s 2, q), (s 3, 0) δ A (s 2, ), (s 4, 1) δ A (s 3, ), (s 5, 1) δ A (s 4, q), (s f, 1) δ A (s 5, $) Mauizio Lenzeini Data Integation 28
Two-way automata and RPQIs Given an RPQI E = (Σ, S, I, δ, F ) ove the alhabet Σ, the coesonding two-way automaton A E is: (Σ A = Σ {$}, S A = S {s f } {s s S}, I, δ A, {s f }) whee δ A is defined as follows: (s 2, 1) δ A (s 1, ), fo each tansition s 2 δ(s 1, ) of E ente backwad mode: (s, 1) δ A (s, l), fo each s S and l Σ A exit backwad mode: (s 2, 0) δ A (s 1, ), fo each tansition s 2 δ(s 1, ) of E (s f, 1) δ A (s, $), fo each s F. = w satisfies E iff w$ L(A E ). Mauizio Lenzeini Data Integation 29
Quey answeing: basic idea Given D = G, S, M, souce database C, quey q, and tule (c, d), we seach fo a counteexamle to (c, d) q C,D, i.e., a database B sem C (D) such that (c, d) q B. Each counteexamle DB B can be eesented by a wod w B ove the alhabet Σ A = Σ C {$}, which has the fom $ d 1 w 1 d 2 $ d 3 w 2 d 4 $ $ d 2m 1 w m d 2m $ whee d 1,..., d 2m ange ove data objects in C (simly denoted by C), w i Σ +, and the $ acts as a seaato. Mauizio Lenzeini Data Integation 30
Two-way automata and canonical DBs Global schema G: ( q q ) Souces: q ( ) ( q) (q ) (d 1,d 2 ) (d 4,d 5 ) (d 4,d 2 ) (d 3,d 3 ) (d 2,d 3 ) Database fo G: q d 1 d 2 d 3 d 4 d 5 Mauizio Lenzeini Data Integation 31
Two-way automata and canonical DBs q d 1 d 2 d 3 d 4 d 5 As a wod: $d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q The above database B is a counteexamle to (d 2, d 3 ) Q D,C. To veify that (d 2, d 3 ) Q B, we exloit not only the ability of two-way automata to move on the wod both fowad and backwad, but also the ability to jum fom one osition in the wod eesenting a node to any othe osition (eithe eceding o succeeding) eesenting the same node. Mauizio Lenzeini Data Integation 32
Quey answeing: Basic idea If Q = (Σ, S, I, δ, F ), then A (Q,a,b) = (Σ A, S A, {s 0 }, δ A, {s f }), whee S A = S {s 0, s f } {s s S} (S D), and 1. (s, 1) δ A (s, l), fo each s S and l Σ C 2. (s 2, 1) δ A (s 1, ), fo each s 2 δ(s 1, ) 3. (s 2, 0) δ A (s 1, ), fo each s 2 δ(s 1, ) 4. ((s, d), 0) δ A (s, d), ((s, d), 0) δ A (s, d) ((s, d), 1) δ A ((s, d), l), ((s, d), 1) δ A ((s, d), l) (s, 0) δ A ((s, d), d), (s, 1) δ A (s, d) 5. (s 0, 1) δ A (s 0, l), fo each l Σ A, (s, 0) δ A (s 0, a) fo each s I 6. (s f, 0) δ A (s, b), fo each s F, and (s f, 1) δ A (s f, l) fo each l Σ A. A (Q,a,b) accets w B iff (a, b) Q B. Mauizio Lenzeini Data Integation 33
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 0 Tansition: (s 0, 1) δ A (s 0, l), fo each l Σ A Mauizio Lenzeini Data Integation 34
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 0 Tansition: (s 0, 1) δ A (s 0, l), fo each l Σ A Mauizio Lenzeini Data Integation 35
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 0 Tansition: (s 0, 1) δ A (s 0, l), fo each l Σ A Mauizio Lenzeini Data Integation 36
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 0 Tansition: (s 0, 1) δ A (s 0, l), fo each l Σ A Mauizio Lenzeini Data Integation 37
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 0 Tansition: (s 0, 1) δ A (s 0, l), fo each l Σ A Mauizio Lenzeini Data Integation 38
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 0 Tansition: (s 0, 1) δ A (s 0, l), fo each l Σ A Mauizio Lenzeini Data Integation 39
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 0 Tansition: (s 1, 0) δ A (s 0, d 1 ), s 1 initial state fo Q Mauizio Lenzeini Data Integation 40
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 1 Tansition: (s 1, 1) δ A (s 1, d 1 ) Mauizio Lenzeini Data Integation 41
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 1 Tansition: (s 2, 1) δ A (s 1, ), tansition coming fom Q Mauizio Lenzeini Data Integation 42
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 2 Tansition: ((s 2, d 2 ), 1) δ A (s 2, d 2 ), seach fo d 2 Mauizio Lenzeini Data Integation 43
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 2, d 2 ) Tansition: ((s 2, d 2 ), 1) δ A ((s 2, d 2 ), $), seach fo d 2 Mauizio Lenzeini Data Integation 44
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 2, d 2 ) Tansition: ((s 2, d 2 ), 1) δ A ((s 2, d 2 ), d 4 ), seach fo d 2 Mauizio Lenzeini Data Integation 45
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 2, d 2 ) Tansition: ((s 2, d 2 ), 1) δ A ((s 2, d 2 ), ), seach fo d 2 Mauizio Lenzeini Data Integation 46
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 2, d 2 ) Tansition: (s 2, 0) δ A ((s 2, d 2 ), d 2 ), exit seach mode Mauizio Lenzeini Data Integation 47
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 2 Tansition: (s 2, 1) δ A (s 2, d 2 ), backwad mode Mauizio Lenzeini Data Integation 48
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 2 Tansition: (s 3, 0) δ A (s 2, ), tansition coming fom Q Mauizio Lenzeini Data Integation 49
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 3 Tansition: (s 4, 1) δ A (s 3, ), tansition coming fom Q Mauizio Lenzeini Data Integation 50
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 4 Tansition: ((s 4, d 2 ), 1) δ A (s 4, d 2 ), seach fo d 2 Mauizio Lenzeini Data Integation 51
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 4, d 2 ) Tansition: ((s 4, d 2 ), 1) δ A ((s 4, d 2 ), $), seach fo d 2 Mauizio Lenzeini Data Integation 52
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 4, d 2 ) Tansition: ((s 4, d 2 ), 1) δ A ((s 4, d 2 ), d 3 ), seach fo d 2 Mauizio Lenzeini Data Integation 53
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 4, d 2 ) Tansition: ((s 4, d 2 ), 1) δ A ((s 4, d 2 ), ), seach fo d 2 Mauizio Lenzeini Data Integation 54
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 4, d 2 ) Tansition: ((s 4, d 2 ), 1) δ A ((s 4, d 2 ), ), seach fo d 2 Mauizio Lenzeini Data Integation 55
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 4, d 2 ) Tansition: ((s 4, d 2 ), 1) δ A ((s 4, d 2 ), d 3 ), seach fo d 2 Mauizio Lenzeini Data Integation 56
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 4, d 2 ) Tansition: ((s 4, d 2 ), 1) δ A ((s 4, d 2 ), $), seach fo d 2 Mauizio Lenzeini Data Integation 57
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: (s 4, d 2 ) Tansition: (s 4, 0) δ A ((s 4, d 2 ), d 2 ), exit seach mode Mauizio Lenzeini Data Integation 58
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 4 Tansition: (s 4, 1) δ A (s 4, d 2 ) Mauizio Lenzeini Data Integation 59
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 4 Tansition: (s 5, 1) δ A (s 4, ), tansition coming fom Q Mauizio Lenzeini Data Integation 60
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 5 Tansition: (s 6, 1) δ A (s 5, q), tansition coming fom Q Mauizio Lenzeini Data Integation 61
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 6 Tansition: (s 7, 0) δ A (s 6, d 3 ), s 7 final state Mauizio Lenzeini Data Integation 62
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 7 Tansition: (s 7, 1) δ A (s 7, d 3 ), s 7 final state Mauizio Lenzeini Data Integation 63
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 7 Tansition: (s 7, 1) δ A (s 7, $), s 7 final state Mauizio Lenzeini Data Integation 64
A un of A (Q,d1,d 3 ) q d 1 d 2 d 3 d 4 d 5 Wod: $ d 4 d 5 $ d 1 d 2 $ d 4 d 2 $ d 3 d 3 $ d 2 q d 3 $ Q = ( q) ( ) q q State: s 7 final state Wod acceted by A (Q,d1,d 3 )! Mauizio Lenzeini Data Integation 65
Quey answeing: Technique To check whethe (c, d) Q B fo some B sem C (D), we check fo nonemtiness of A, that is the intesection of the one-way automaton A 0 that accets wods that eesent databases, i.e., wods of the fom ($ C Σ + C) $ the one-way automata coesonding to the vaious A (Si,a,b) (fo each souce S i and fo each ai (a, b) S C i ) the one-way automaton coesonding to the comlement of A (Q,c,d) Indeed, any wod acceted by such intesection automaton eesents a counteexamle to (c, d) Q C,D, i.e., a database B sem C (D) such that (c, d) Q B. Mauizio Lenzeini Data Integation 66
Quey answeing: Comlexity All two-way automata constucted above ae of linea size in the size of Q, def (S 1 ),..., def (S k ), and S1 C,..., Sk C. Hence, the coesonding one-way automata would be exonential. Howeve, we do not need to constuct A exlicitly. Instead, we can constuct it on the fly while checking fo nonemtiness. Quey answeing fo RPQIs is PSPACE-comlete (conp-comlete if comlexity is measued wt to the size of souce data C only). Mauizio Lenzeini Data Integation 67
Quey answeing: the comlete ictue Diffeent assumtions: 1. Database domain may be: comletely known (closed domain assumtion CDA) atially known (oen domain assumtion ODA) 2. Each souce may be: exact: ovides exactly the data secified in the associated view sound: ovides a subset of the data secified in the associated view comlete: ovides a sueset of the data secified in the associated view Mauizio Lenzeini Data Integation 68
Polynomial intactability: RPQ Given a gah G = (N, E), we define D = G, S, M, and souce database C: V s R s V e R e V G R g R g R b R b R gb R bg V C s = {(c, a) a N, c N} V C e = {(a, d) a N, d N} V C G = {(a, b), (b, a) (a, b) E} Q R s M R e whee M descibes all mismatched edge ais (e.g., R g R b ). If G is 3-coloable, then db whee M (and Q) is emty, i.e. (c, d) Q D,C. If G is not 3-coloable, then M is nonemty db, i.e. (c, d) Q D,C. = conp-had wt data comlexity Mauizio Lenzeini Data Integation 69
Comlexity of quey answeing: the comlete ictue Assumtion on Assumtion on Comlexity domain views data exession combined all sound conp conp conp closed all exact conp conp conp abitay conp conp conp all sound conp PSPACE PSPACE oen all exact conp PSPACE PSPACE abitay conp PSPACE PSPACE Mauizio Lenzeini Data Integation 70
Outline Intoduction to data integation Aoaches to modeling and queying Case study in LAV Case study in GAV Beyond LAV and GAV Conclusions Mauizio Lenzeini Data Integation 71
Coming back to GAV In GAV, the maing M is constituted by a set of assetions: g φ S one fo each stuctue g in G, whee φ S is a quey ove S. Given souce database C, a database B satisfies M wt C if fo each g G: φ C S g B If G does not have constaints, we can simly limit ou attention to one model of the infomation integation system, and answeing queies educes to using M fo comuting fom C the vitual global database, i.e., tules satisfiying the vaious φ S associated to each stuctue g of G, evaluating the quey q ove the data obtained fo the vaious g s. Mauizio Lenzeini Data Integation 72
GAV with constaints in the global schema: examle Conside D = G, S, M, with Global schema G: student(scode, Sname, Scity), univesity(ucode, Uname), enolled(scode, Ucode), key{scode} key{ucode} key{scode, Ucode} enolled[scode] student[scode] enolled[ucode] univesity[ucode] Souces S: s 1, s 2, s 3 Maing M: student { (X, Y, Z) s 1 (X, Y, Z, W ) } univesity { (X, Y ) s 2 (X, Y ) } enolled { (X, W ) s 3 (X, W ) } Mauizio Lenzeini Data Integation 73
Constaints in GAV: examle Univesity Student Enolled code AF BN name bocconi ucla code 15 12 name bill anne city oslo floence Scode 12 16 Ucode AF BN 16?? 16 16 s C 1 12 anne floence 21 15 bill oslo 24 s C 2 AF BN bocconi ucla s C 3 12 AF 16 BN Mauizio Lenzeini Data Integation 74
Constaints in GAV: examle Souce database C: s C 1 12 anne floence 21 15 bill oslo 24 s C 2 AF BN bocconi ucla s C 3 12 AF 16 BN s C 3(16, BN) imlies enolled B (16, BN), fo all B sem C (D). Due to the integity constaints in the global schema, 16 is the code of some student in all B sem C (D). Since C says nothing about the name and the city of such student, we must accet as legal fo D all vitual global databases that diffe in such attibutes. Mauizio Lenzeini Data Integation 75
GAV evisited If G does have constaints, then seveal situations ae ossible, given the souce data C: no model exists fo the data integation system, the data integation system has one model, seveal models exist fo the infomation integation system. In GAV too, answeing queies is an infeence ocess coing with incomlete infomation Coming back to the analogy with the mistey case, constaints in the global schema can make the investigation eot incomlete/incoheent, so that answeing queies may equie easoning on the investigation eot. Mauizio Lenzeini Data Integation 76
A case study in GAV We deal with the oblem of answeing queies to data integation systems of the fom G, S, M, whee the global schema G is elational, with both key and foeign key constaints the souces in S ae elational the maing M is of tye GAV queies ae conjunctive queies Mauizio Lenzeini Data Integation 77
Unfolding is not sufficient in ou context Maing M: student { (X, Y, Z) s 1 (X, Y, Z, W ) } univesity { (X, Y ) s 2 (X, Y ) } enolled { (X, W ) s 3 (X, W ) } s C 1 12 anne floence 21 15 bill oslo 24 s C 2 AF BN bocconi ucla s C 3 12 AF 16 BN Quey: { (X) student(x, Y, Z), enolled(x, W ) } Unfolding wt M: { (X) s 1 (X, Y, Z, V ), s 3 (X, W ) } etieves only the answe {12} fom C, although {12, 16} is the coect answe. The simle unfolding stategy is not sufficient in ou context. Most GAV systems use the simle unfolding stategy! Mauizio Lenzeini Data Integation 78
Pocessing queies in GAV: technique Techniques fo automated easoning on incomlete infomation ae needed. In ou context, we have develoed the following technique fo ocessing queies: Given quey q, we comute anothe quey ex G (q), called the exansion of q wt the constaints of G (atial evaluation) We unfold ex G (q) wt M, and obtain a quey unf M (ex G (q)) ove the souces We evaluate unf M (ex G (q)) ove the souce database C ex G (q) can be of exonential size wt G, but the whole ocess has olynomial time comlexity wt the size of C (see [Calvanese et al, 2001] fo details). Mauizio Lenzeini Data Integation 79
Pocessing queies in GAV: technique The oblems mentioned above also hold when: The global schema is exessed in tems of a concetual data model see [Calvanese et al, ER 2001] An ontology is used as global schema see [Calvanese et al, SWWS 2001] The global schema is exessed in tems of a semistuctued data model (e.g., XML) The maing M has the following diffeent semantics (exact souces): Given souce database C, a database B satisfies g φ S wt C if g B = φ C S Mauizio Lenzeini Data Integation 80
The case of exact souces in GAV with constaints Univesity Student Enolled code AF BN name bocconi ucla code 15 12 name bill anne city oslo floence Scode 12 16 Ucode AF BN Inconsistency no tule with code 16 16 s C 1 12 anne floence 21 15 bill oslo 24 s C 2 AF BN bocconi ucla s C 3 12 AF 16 BN Mauizio Lenzeini Data Integation 81
Outline Intoduction to data integation Aoaches to modeling and queying Case study in LAV Case study in GAV Beyond LAV and GAV Conclusions Mauizio Lenzeini Data Integation 82
Beyond LAV and GAV Global schema: W ok(reseache, P oject), Aea(P oject, F ield) Souce 1: Inteest(P eson, F ield) Souce 2: Get(Reseache, Gant), F o(gant, P oject) Maing: being inteested in field f mas to thee exists a oject such that woks fo and the aea of is f. getting gant g fo oject, mas to woking fo. This situation cannot be eesented in GAV o LAV. Mauizio Lenzeini Data Integation 83
The modeling oblem: GLAV = GAV + LAV A moe geneal method fo secifying the maing between the global schema and the souces is based on assetions of the foms: φ S s φ G (sound souce) φ S c φ G (comlete souce) whee φ S is a quey on S and φ G is a quey on G. Given souce database C, a database B fo G satisfies M wt C, i.e., if fo each assetion φ S s φ G in M, we have that φ C S φb G, fo each assetion φ S c φ G in M, we have that φ B G φc S Mauizio Lenzeini Data Integation 84
Examle of GLAV Global schema: W ok(reseache, P oject), Aea(P oject, F ield) Souce 1: Inteest(P eson, F ield) Souce 2: Get(Reseache, Gant), F o(gant, P oject) GLAV maing: { (, f) Inteest(, f) } { (, f) W ok(, ) Aea(, f) } { (, ) Get(, g) F o(g, ) } { (, ) W ok(, ) } Mauizio Lenzeini Data Integation 85
Technique fo GLAV The maing assetion φ S s φ G can be seen as φ S g φ G, whee g is a new symbol added to G. Theefoe, we can tanslate φ S s φ G into: the GAV maing ule g φ S the constaint g φ G thus obtaining a GAV system with constaints, that can be dealt with a vaiant of the above descibed technique [Calì et al, FMII 2001]. Mauizio Lenzeini Data Integation 86
Outline Intoduction to data integation Aoaches to modeling and queying Case study in LAV Case study in GAV Beyond LAV and GAV Conclusions Mauizio Lenzeini Data Integation 87
Conclusions Data integation alications have to coe with incomlete infomation, no matte which is the modeling aoach Some techniques aleady develoed, but seveal oen oblems still emain (in LAV, GAV, and GLAV) Many othe oblems not addessed hee ae elevant in data integation (e.g., how to constuct the global schema, how to deal with inconsistencies, how to coe with udates,...) In aticula, given the comlexity of sound and comlete quey answeing, it is inteesting to look at methods that accet less quality answes, tading efficiency fo accuacy Mauizio Lenzeini Data Integation 88
Acknowledgements Many thanks to Andea Calí Diego Calvanese Giusee De Giacomo Domenico Lembo Moshe Vadi Mauizio Lenzeini Data Integation 89