Structured and Semi-Structured Data Integration

Transcription

1 UNIVERSITÀ DEGLI STUDI DI ROMA LA SAPIENZA DOTTORATO DI RICERCA IN INGEGNERIA INFORMATICA XIX CICLO 2006 UNIVERSITÉ DE PARIS SUD DOCTORAT DE RECHERCHE EN INFORMATIQUE Structured and Semi-Structured Data Integration Antonella Poggi

2

3 UNIVERSITÀ DEGLI STUDI DI ROMA LA SAPIENZA DOTTORATO DI RICERCA IN INGEGNERIA INFORMATICA XIX CICLO UNIVERSITÉ DE PARIS SUD DOCTORAT DE RECHERCHE EN INFORMATIQUE Antonella Poggi Structured and Semi-Structured Data Integration Thesis Committee Prof. Maurizio Lenzerini (Advisor (Italy)) Prof. Serge Abiteboul (Advisor (France)) Reviewers Prof. Bernd Amann Prof. Alex Borgida Prof. Riccardo Rosati

4 AUTHOR S ADDRESS IN ITALY: Antonella Poggi Dipartimento di Informatica e Sistemistica Università degli Studi di Roma La Sapienza Via Salaria 113, I Roma, Italy AUTHOR S ADDRESS IN FRANCE: Antonella Poggi Departement d Informatique Université de Paris Sud Orsay Cedex, France poggi@dis.uniroma1.it WWW: poggi/

5 To Mario

6

7 Acknowledgements Everything started one day of June 2000, when I decided to go Erasmus at the Ecole Polytechnique in Paris, and I was suggested to attend there the databases lectures given by Prof. Abiteboul. When in January 2001, I decided to follow the extralectures of Prof. Abiteboul in order to present a databases project, he was so nice to sit beside me and teach me how to write my first HTML page: my homepage. This was his way to introduce me to XML! One first internship at I.N.R.I.A. was my first great experience with research, and from that day on, I never gave up dreaming research. When I came back home, I met Maurizio (whose lectures were the most exciting I ever had) and thank to him and Serge I could participate to VLDB as a volunteer (Rome, Sept. 2001). How could someone resist loving research after such a wonderful conference? Then, I finished my exams and Maurizio supported me to come back to I.N.R.I.A. for a second internship (my final project which lead to my graduation thesis). On my return, I started collaborating with Maurizio, and he made me love databases theory and data integration issues so much, that I chose to start my PhD route... Thanks to an European initiative, and thanks to both my advisors who had to fight against italian and french bureaucracy, I had the opportunity to do my research in joint work between the roman and the parisian databases groups. This was not always so easy... But I was so lucky to find such great researchers, both able to have such an amazing big picture! They were both mentors, fathers and friends. No word may express how much I would like to thank you both, Maurizio and Serge. I can only say once again: Grazie - Merci (as I am now used to concluding all my research talks). I will miss so much being such a favoured PhD student! Of course, these aknowlegements cannot end without thanking my sweet husband Mario, and my family. Both have been so patient, understanding... and, above all, they have always been with me. I love you, and will always do. iii

8

9 Contents I Antechamber 1 1 Theoretical foundations of DIS Logical framework Consistency of a DIS Query answering over DIS Updates over DIS Relationship with databases with incomplete information State of the art of DIS Commercial data integration tools Global picture of the state of the art Main related DIS LAV approach GAV approach GLAV approach II Ontology-based DIS 21 3 The language DL-Lite FRS DL-Lite FRS expressions DL-Lite FRS TBox DL-Lite FRS ABox DL-Lite FRS knowledge base Query language DL-Lite A DL-Lite A reasoning Storage of a DL-Lite A ABox Preliminaries Minimal model for a DL-Lite A ABox Canonical interpretation Closure of negative inclusions Satisfiability of a DL-Lite A KB Foundations of the algorithm for satisfiability v

10 4.3.2 Satisfiability algorithm Query answering over DL-Lite A KB Foundations of query answering algorithm Query answering algorithm Consistency and Query Answering over Ontology-based DIS DL-Lite A ontology-based DIS Linking data to DL-Lite A objects Logical framework for DL-Lite A DIS Overview of consistency and query answering method The notion of virtual ABox A naive bottom-up approach A top-down approach Relevant notions from logic programming DL-Lite A DIS consistency and query answering Modularizability Consistency algorithm Query answering algorithm Computational complexity Updates of Ontologies at the Instance Level The DL-Lite FS language Instance-level ontology update Computing updates in DL-Lite FS ontologies III XML-based DIS 97 7 The setting Data model Tree Type Constraints and schema language Prefix Queries XML-based DIS XML DIS logical framework Identification XML DIS consistency XML DIS query answering Lower-bound for query answering under exact mappings Incomplete trees Query answering using incomplete trees Query answering algorithms Algorithm under VKR and no key constraints Algorithm under Id G sound and complete

11 Conclusion 149 Bibliography 158

12

13 Part I Antechamber 1

14

15 Data integration is a huge area of research that is concerned with the problem of combining data residing at heterogeneous, autonomous and distributed data sources, and providing the user with a unified virtual view of all this data. Today s fast and continuous growth of large business organizations, often deriving from mergers of smaller enterprises, enforces an increasing need in integrating and sharing large amounts of data, coming from a number of heterogeneous and distributed data sources. Such needs are also shown by others applications, like information systems for administrative organizations, life sciences research and many others. Moreover, it is not infrequent that different parts of the same organization adopt different systems to produce and maintain critical data. Clearly, data integration is a challenge in all these kinds of situations. Furthermore, it has become even more attractive thanks the ubiquitous spread of the World Wide Web and access to information it provides. Hence, during the last decade, research and business interest has migrated from DataBase Management Systems, DBMS (Codd, 70 s [37]), to Data Integration Systems (DIS). Whereas the former make a unique local data source accessible through a schema, the latter offer the necessary framework to combine the data from a set of heterogeneous and autonomous sources through a so-called global schema (or, mediated schema). Thus, the global schema does not contain data by itself, but provides a reconciled, integrated and virtual view of the underlying sources, which in contrast contain the actual data. We insist that since the global schema acts as the interface to the user for accessing data, the choice of the language for expressing and querying such a schema is crucial. In particular, whereas research on the topic has already provided several DIS, rather few of them represent an appropriate trade-off between the expressive power of the languages for specifying the global schema and querying the system, and the efficiency of query answering. Nevertheless, both these aspects deserve to be simultaneously considered. Indeed, the issue of providing an important set of semantic constraints over the global schema becomes more and more crucial, as one wants to use rather basic conceptual modeling constructs for his application. On the other hand, offering an expressive query language and allowing for efficient query answering over typically large amounts of data are obvious requirements of such kind of systems. In this thesis, we focus on the study of hierarchical DIS, where the global schema acts as a client of the data sources as opposed to Peer-to-Peer DIS, where the global schema acts both as a client and a server for other DIS. In particular, motivated by the challenges discussed above, we investigate both structured and semi-structured data integration, in the two major contexts of ontology-based data integration, and XMLbased data integration. On the one hand, ontology-based DIS are characterized by a 3

16 4 global schema described at the intensional level of an ontology, i.e., the shared conceptualization of a domain of interest. The main issue here is that typically ontology languages are extremely costly with respect to the size of the data. Notably, we propose a setting where answering queries over the ontology-based DIS is LOGSPACE in data complexity. On the other hand, XML-based DIS are characterized by an expressive global schema. This is a novel setting, not much investigated yet. The main issue here concerns the presence of a significant set of integrity constraints expressed over the schema, and the concept of node identity which requires particular attention when data come from autonomous data sources. In particular, in both contexts, our contribution consists in approaching formally the following issues. The modeling issue, which requires to provide the user with all he needs for modeling the DIS. More precisely, he will be given (i) a language for specifying the global schema, (ii) a language for specifying the set of source schemas, and (iii) a formalism to specify the relationship existing between the data at the sources and the elements of the global schema. The query answering issue, which is concerned with the basic service offered by a DIS, namely the ability of answering queries posed over the DIS global schema. We provide an appropriate query language and algorithms to answer queries posed to the DIS. Also, we study the complexity of the problem in both contexts, under a variety of assumptions for the DIS specification. Since sources are in general autonomous, we also investigate the problem of detecting inconsistencies among data sources, a problem which is mostly of the time ignored in DIS research, thus resulting in a quite unrealistic setting. Finally, we begin the investigation of the update of DIS, in the context of Ontology-based DIS. This concerns the problem of accepting updates expressed in terms of the global schema, aiming at reflecting them by changes at the source data level. This is the first investigation we are aware of that goes in this challenging direction. Our research has been carried out under the joint supervision of the Department of Computer Science of the University of Rome La Sapienza and the GEMO INRIA- Futurs Project, resulting from the merger of INRIA-Rocquencourt Verso Project and the IASI group of the University of Paris-Sud. The thesis is organized as follows. The first part of the thesis serves as an introduction to the theoretical foundations of our approach to DIS, and a motivation for it. Then, the second part of the thesis will be devoted to ontology-based DIS examination, while the third part will concern with XML-based DIS.

17 Chapter 1 Theoretical foundations of DIS In this chapter, we introduce the main theoretical foundations underlying our investigation of DIS [63]. Specifically, we start by setting up a logical framework for data integration. Then we present the main issues related to DIS that will be the focus of our attention, namely consistency checking and query answering. Afterwards, we introduce the problem of performing updates over DIS. Finally, we discuss the relationship existing between DIS and databases with incomplete information [58]. 1.1 Logical framework As already mentioned, in this work, we are interested in studying DIS whose aim is combining data residing at different sources, and providing the user with a unified view of these data. Such a unified view is represented by the global schema. Thus, one of the most important aspects in the design of a DIS is the specification of the correspondence between the data at the sources and the elements of the global schema. Such a correspondence is modeled through the notion of mapping. It follows that the main components of a data integration system are the global schema, the sources, and the mapping. Thus, we formalize a data integration system Π in terms of a triple G, S, M, where G is the global schema, expressed in a language L G over an alphabet A G. The alphabet comprises a symbol for each element of G (i.e., a relation if G is relational, a concept or a role if G is a Description Logic, a label if G is an XML DTD, etc.). S is the source schema, expressed in a language L S over an alphabet A S. The alphabet A S includes a symbol for each element of the sources. M is the mapping between G and S, consisting of a set of assertions M, each having the form (q S, q G, as), or (q G, q S, as) where q S and q G are two queries of the same arity, respectively over the source schema S, and over the global schema G, and as may assume the value sound, 5

18 6 CHAPTER 1. THEORETICAL FOUNDATIONS OF DIS complete or exact. Queries q S are expressed in a query language L M,S over the alphabet A S, and queries q G are expressed in a query language L M,G over the alphabet A G. On the other hand, the value as models the accuracy of the mapping. Note that the definition above has been taken from [63], and it is general enough to capture all approaches in the literature, including in particular the DIS considered in this thesis. We call database a set of collections of data. We say that a source database (also referred to as a set of data sources) D = {D 1,, D m } conforms to a schema S = {S 1,, S m } if D i is an instance of S i for i = 1,, m (where clearly the notion of D i being an instance of S i depends on the language L S for expressing S). Moreover, we call global database an instance of the global schema G 1 over a domain Γ. Thus, given a set of sources D conforming to S, we call a set of legal databases for Π w.r.t. D, denoted sem(π, D), the set of databases B such that: B is a global database, and B satisfies the mapping M w.r.t. D. Clearly, the notion of B satisfying M w.r.t. D depends on the semantics of the mapping assertions. Intuitively, the assertion (q S, q G, as) means that the concept represented by the query q S over the sources D, corresponds to the concept in the global schema represented by the query q G, with the accuracy specified by as. Formally, let q be a query of arity n and DB a database. We denote with q DB the set of n-tuples in DB that satisfy q. Then, given a set of data sources D conforming to S and a global database B, we say that B satisfies M w.r.t. D, if for each M i in M of the form (q S, q G, as) we have that: if as = sound, then q B G qd S ; if as = complete, then q B G qd S ; if as = exact, then q B G = qd S. Typically sources in DIS are considered sound. This will also be the assumption we make in the investigation of ontology-based DIS. In contrast, in the XML-based context, we will study also the case of exact mappings, which appear to be useful when one considers a data source as an authority providing exactly all the information about a certain topic. On the other hand, we do not consider the case of complete mappings, since it appears less interesting in practice. Note that different forms for mappings have lead to the following characterization of the approaches to data integration in the literature [53]: In the Local-As-View (LAV) approach, mappings in M have the form (s, q G, as), where s in an element of the source schema. 1 In particular, in this thesis, we consider the case of a global database being a first-order logic model ( I, I) of G, if G is the intensional level of a Description Logic (DL) [21] ontology, or an XML document satisfying G, if G is a DTD provided with a set of integrity constraints.

19 1.2. CONSISTENCY OF A DIS 7 In the Global-As-View (GAV) approach, they have the form (q S, g, as), where g in an element of the global schema. In the Global-and-Local-As-View (GLAV) approach, no particular assumption is made on the form of mappings. Clearly, the LAV approach favors the extensibility of the system, since adding a new source simply requires enriching the mapping with a new assertion, without other changes. On the other hand, the GAV approach has a more procedural flavor, since it tells the system how to use the sources to retrieve the data. Before concluding this presentation of the logical framework for data integration, we observe that, no matter which is the interpretation of the mapping, in general, several global databases exist that are legal for Π with respect to D. This observation motivates the relationship between data integration and databases with incomplete information [86], which will be discussed in Section Consistency of a DIS Given a data integration system Π = G, S, M and a set of sources D conforming to S, it may happen that no legal database exists satisfying both the global schema constraints and the mappings w.r.t. D, i.e. sem(π, D) =. Then, we say that the system is inconsistent w.r.t. D. It is worth noting that this kind of situation is particularly critical, since as we will see, it makes query answering become meaningless. Despite its importance, this situation is often blurred out in data integration systems, or dealt with by means of a-priori and ad-hoc transformations and cleaning procedures to be applied to data retrieved from the sources (e.g.[44]). Here we address the problem from a more theoretical perspective. In particular, we believe that the first step to deal with inconsistencies is obviously to detect whether there are some it occurs. Thus, we study the problem of deciding whether a system is consistent w.r.t. a set of data sources. Such a problem can be formulated as follows: PROBLEM : DIS CONSISTENCY INPUT : A data integration system Π = G, S, M, a set of data sources D conforming to S QUESTION : Is there a database B legal for Π w.r.t. D? In both Ontology-based and XML-based DIS, we will study DIS consistency, show it is decidable, examine its complexity and provide algorithms to solve it. However, we do not consider in this thesis the problem of reconciling the data at the sources, i.e. modifying the data retrieved from the sources so that the system becomes consistent. This is a one challenging issue that we intend to address in the future.

20 8 CHAPTER 1. THEORETICAL FOUNDATIONS OF DIS 1.3 Query answering over DIS The basic service that is offered by a DIS is query answering, i.e. the ability of answering queries that are posed in terms of the global schema G and are expressed in a language L q over the alphabet A G. Given a DIS Π = G, S, M and a set of data sources D conforming to S, the certain answers q(π,d) to a query q posed over Π w.r.t. D, is the set of tuples t of elements in Γ (i.e., the domain of the instances of G) such that t q B for every legal database B w.r.t. Π, or equivalently: q(π,d) = {t t q B, B sem(π, D)} Note that q(π,d) are called certain answers to q in Π w.r.t. D. Query answering can be tackled under two different forms. In particular, under the so-called recognition form, it is formulated as follows: PROBLEM : QUERY ANSWERING (RECOGNITION) INPUT : Consistent data integration system Π = G, S, M, set of data sources D conforming to S, query q, and tuple t of elements of Γ QUESTION : Is t in q(π, D)? Other times, query answering assumes a more ambitious form and aims at finding the entire set of certain answers. Thus, it is formulated as follows: PROBLEM : QUERY ANSWERING (FULL SET) INPUT : Consistent data integration system Π = G, S, M, set of data sources D conforming to S, query q QUESTION : Find all t such that t q(π, D). As for DIS consistency, in our investigation, we will study DIS query answering under different assumptions, show it is decidable, examine its complexity and provide algorithms to solve it. Note in particular that in both the formulations for the query answering problem, we assume to have a consistent DIS. Indeed, in this thesis, we are not concerned with the problem of answering queries in the presence of mutually inconsistent data sources. One possibility to address such a problem is to follow an approach in the spirit of [62], where the authors advocate the use of an approximate semantics for mappings. 1.4 Updates over DIS In this section, we introduce write-also DIS, i.e. DIS that allow for performing updates expressed over the global schema. Several approaches to update have been proposed in the literature, see, e.g.,[39] for a survey. In particular, different change

21 1.4. UPDATES OVER DIS 9 operators are appropriate depending on whether it is a revision [20], i.e. a correction to the actual state of beliefs, or of an update [88], reflecting a change in the world. In this section, even though we use the term update, we do not aim at advocating the use of one particular approach. On the contrary, we assume to have an arbitrary operator. Moreover, we assume to have an update F expressed as a formula in terms of G, which intuitively is sanctioned to be true in the new state, i.e. it is inserted in the updated DIS specification. Thus, given a DIS Π = G, S, M, a set of data sources D conforming to S, and the update F, we have that once is applied with F to the set of legal database for Π w.r.t. D, we obtain a new set of databases, however characterized, reflecting the change F. Note that we are interested in instance-level updates. This means that we assume that the specification of Π is invariant, whereas the update reflects a change that occurs at the sources D. Thus, in particular, we consider an update of Π with a set F of facts having the form g(t) where t is a n-tuple of elements of Γ and g is an element of G, meaning that the change consists in t being an instance of g. Thus, we formulate the problem of updating a DIS as follows: PROBLEM : EXPRESSIBLE UPDATE INPUT : Consistent data integration system Π = G, S, M, set of data sources D conforming to S, set of facts F QUESTION : Is there D such that sem(π, D ) = sem(π, D) F? The above formulation is general enough to capture all approaches to update that have been proposed in the literature. However, it raises at least the following considerations. Typically the user of a DIS is not the owner of the data sources and thus he has not the right to modify their content. This is probably the reason why, as far as we know, DIS update has not been considered yet as an issue. However, we believe that a DIS should be able to possibly provide the appropriate infrastructure to allow the user to perform an instance-level update without changing the data at the sources. This could be achieved, for instance by using internal proprietary sources. What if no set of data sources exists solving the update problem formulated above (not even proprietary sources)? As usual, one possibility would be to relax the semantics of the update. Indeed, we might be interested in reasoning, e.g., answering queries, over the DIS resulting from the update. Indeed, to do so we do not necessarily need to materialize a new set of data sources, but actually we could reason on the original DIS by taking into account the update in a virtual way. In a sense, this is analogous to the distinction between projection via regression vs. progression in reasoning about actions [83]. Both the considerations above have motivated the beginning of our work on DIS update. Until now, we started tackling the problem for Ontology-based DIS (cf. Chapter 6).

22 10 CHAPTER Relationship with databases with incomplete information Before concluding this introductory chapter on the theoretical foundations of our approach to data integration, we briefly discuss the strong connection existing between DIS and databases with incomplete information. Specifically, a database with incomplete information can be viewed as a set of possible states of the real-world. Similarly, given a set of data sources, a DIS represents a set of possible databases. Thus, when a query is posed over a database with incomplete information or a DIS, the problem arises of posing the query over a possibly infinite set of database states. It follows that in order to solve query answering over a DIS, one possibility is to find a finite representation of the set of possible databases and to provide algorithms to answer queries over such a representation. Indeed, this is the main idea underlying both the works presented in this thesis. Note, in particular, that this approach recalls the approach proposed in a landmark paper by Imielinski and Lipski [58], that consists in answering queries over a database with incomplete information, by exploiting the notion of representation system. Moreover, interestingly, in [4], the same approach is extended to deal with updates over databases with incomplete information.

23 Chapter 2 State of the art of DIS As already discussed, data integration has appeared as a pervasive challenge in the last decade. Such a success recalls the crucial impact of DBMS, proven by the large number of DBMS scattered all around the world. However, while the success of relational DBMS represents a great exception in the usual bottom-up process of emerging technologies, since it had been preceded by a deep understanding and a wide acceptance of the relational model and the related theory, the interest in data integration systems grew contemporaneously in both the business and research community. In particular, it lead to the implementation of systems, without having yet a deep understanding of all the intricate issues related, involving design time as well as run time aspects [54]. Clearly, it would be unrealistic to aim at being comprehensive while discussing the state of the art of such a huge field. Thus, in this chapter, we start by briefly discussing the commercial solutions to the need for integrating data. Afterwards, we contextualize our contribution into the global picture of the state of the art in data integration research field. Finally, according to such a global picture, we discuss more in details works that are most closely related to our investigation. 2.1 Commercial data integration tools Recently, some software solutions to the need for integrating data has emerged, suggesting the adoption of a DBMS as a kind of middleware infrastructure that uses a set of software modules, called wrappers, to access heterogeneous data sources [51]. Wrappers hide the native characteristics of each source, masking them under the appearances of a common relational table. Furthermore, their aim is to mediate between the federated database and the sources, mapping the data model of each source to the federated database data model, also transforming operation over the federated database into requests that the source can handle. Examples of commercial products following this kind of approach are Oracle Integration [75] and DB2 Information Integrator (DB2II)[74]. Obviously, both are based on the use of Oracle and IBM DBMS respectively. Even though remarkable from the point of view of the number of different types of data sources supported, as well as from the point of view of query optimizations, 11

24 12 CHAPTER 2. STATE OF THE ART OF DIS these products are essentially data federation tools that are still far from data integration systems theory as it is by now well-established in the scientific databases community. Indeed, as we argued in [81], they actually allow the user to combine data coming from heterogeneous and autonomous sources, but do not provide the user with a unified view that is (logically) independent of the sources. It is worth noticing however, that data federation tools can be used as the essential underlying environment on top of which one can build a DIS. In particular, we show in [81] how to implement a DIS based on a relational schema by means of a commercial tool for data federation. In a nutshell, this is obtained by: (i) producing an instance of a federated database through the compilation of a formal DIS specification as formalized in the previous chapter; (ii) translating the user queries posed over the global schema, so as to issue them to the federated database. Even though interesting in order to highlight the mismatch between commercial products and research prototypes currently available, clearly, this approach is far from solving the main challenge addressed in this thesis, since it allows for a limited expressive power of the global schema (without constraints) and requires to follow a GAV approach. 2.2 Global picture of the state of the art In this section, we aim at giving a global picture of the state of the art in data integration and at contextualizing our contribution with respect to this global picture. From the previous chapter, it follows that a DIS specification depends on the following aspects: the data model chosen for the global database; the language used to express the global schema, i.e. characterizing it; the set of constraints the approach followed to specify the mapping, i.e. GAV, LAV or GLAV; the accuracy of the mappings (or equivalently of the data sources), i.e. sound, or exact (as we already argued complete mappings are less interesting in practice). Another aspect deserving to be considered when classifying DIS, is the architectural paradigm used. As already mentioned, in this thesis, we focus on hierarchical DIS, where it is possible to clearly distinguish between two different roles played on one hand by the global schema, that is accessed by the user and which does not contain by itself data, and on the other hand by the underlying sources, that contain the actual data. Another paradigm is recently emerging for DIS, as well as for other distributed systems, namely the Peer-To- Peer (P2P) paradigm. Put in an abstract way, P2P DIS are characterized by an architecture consisting of various autonomous nodes (called peers) which hold information, and which are linked to other nodes by means of mappings. Each node provides therefore part of the overall information available from a distributed environment and acts both as a client and as a server in the system, without relying on a single global view. However, in some sense, P2P data integration

25 2.3. MAIN RELATED DIS 13 systems can be considered as the natural extension of hierarchical data integration systems, since each node of the system may by itself be considered as an extended hierarchical DIS, that includes, besides the mapping to local data sources, an external mapping to other nodes schemas 1. Note that since research in P2P data integration is still quite young, no commercial product really emerged yet. Fig. 2.1 summarizes the state of the art in data integration. More precisely, it classifies the main integration systems according to the features discussed above. Thus, it stresses systems that are closest to our investigation and can be therefore compared with our study. In the next two sections we describe some of these systems, focusing on those whose global schema is specified by means of (i) a Description Logic (and thus can be considered as DIS based on the relational model, characterized by a significant set of semantic constraints), and (ii) XML 2 (and thus a semi-structured data model). It is worth noting that, in Fig. 2.1, we do not consider Data Warehousing Systems nor Data Exchange Systems, which even though related to DIS, are based on a different form of data interoperability. Indeed, their aim is to export a materialized instance of the global schema, whereas DIS are characterized by a global schema that is virtual. In particular, Data exchange is the problem of moving and restructuring data from a generally unique data source to the global schema schema (called target schema), given the specification of the mapping (called source-to-target dependencies) between the source and the schema. Data exchange has become an active research topic recently due to the increased need for exchange of data in various formats, typically in e-business applications[9]. Papers [41, 40] laid the theoretical foundation of exchange of relational data, and several follow-up papers studied various issues in data exchange such as schema mapping composition[11]. 2.3 Main related DIS In order to present main DIS that are closest to the work studied in this thesis, we next discuss those systems that are most comparable to our investigation, because e.g. of the expressivity of the global schema (cf. Fig. 2.1). In particular, we classify such systems on the basis of the approach followed for mappings specification. Note that despite the great increasing interest in XML from both business and research, little previous work has addressed XML-based data integration issue, as defined and studied here. In contrast, considerable work has addressed XML publishing systems and some initial work has focused on basic theoretical XML data exchange issues. Both these kinds of work are somehow orthogonal to our investigation since, besides assuming to materialize the global schema, they consider a unique data source. Hence, they were not presented in Fig However, in the XML setting, where not much work has addressed even basic data integration issues, they appear as relevant. Thus, we will present some of them. 1 Clearly, this is only an abstraction since the possible presence of cycles among peers complicates notably P2P DIS and introduces new challenging issues (see e.g. [28]). 2 Reader is assumed to be familiar with notation and terminology of the relational model [5], XML [2] and DLs [14].

26 Table 2.1: DIS state of the art Paradigm Data model Constraints Mapping Mapping Example approach accuracy Hierarchical Relational Inclusions,... LAV sound Information Manifold [60], Hierarchical Relational Inclusions,... GAV sound PICSEL [48] Hierarchical Relational Functional, GAV sound IBIS [24], inclusions INFOMIX [64] Hierarchical Semi-structured GAV sound, TSIMMIS [45] Hierarchical Semi-structured LAV exact, [34] sound Hierarchical Object-oriented keys LAV sound STYX [8] Hierarchical XML DTD LAV sound Agora [73] Hierarchical XML XML Schema types GLAV sound [90] and functional... P2P Relational keys, GLAV sound [32] foreign keys P2P XML GLAV exact, Piazza [55] sound P2P XML Keys GLAV exact, ActiveXML [1] sound 14 CHAPTER 2. STATE OF THE ART OF DIS

27 2.3. MAIN RELATED DIS LAV approach Information Manifold Information Manifold (IM) [67] is a DIS developed at AT&T, based on the CARIN Description Logic [66]. CARIN combines a Description Logic allowing for expressing disjunction of concepts, and role number restrictions, with function-free horn rules. Thus, IM handles the presence of inclusion dependencies over the global schema, and uses conjunctive queries as the language for querying the system and specifying sound LAV mappings. The main distinguishing feature of IM is the use of the bucket algorithm for query answering. In order to illustrare it, we first recall that in LAV the mappings between the sources and the global schema are described as a set of views over the global schema. Thus, query processing amounts to finding a way to answer a query posed over a database schema using a set of views over the same schema. This problem, called answering queries using views, is widely studied in the literature, since it has applications in many areas (see e.g. [53] for a survey). The most common approach proposed to deal with query answering using views is by means of query rewriting. In query rewriting, a query and a set of view definitions over a database schema are provided, and the goal is to reformulate the query into an expression, the rewriting, whose evaluation on the view extensions supplies the answer to the query. Thus, query answering via query rewriting is divided in two steps, where the first one consists of reformulating the query in terms of the given query language over the alphabet of the views (possibly augmented with auxiliary predicates), and the second one evaluates the rewriting over the view extensions. Clearly, the set of available sources may in general not store all the data needed to answer a user query, and therefore the goal is to find a rewriting that provides the maximal set of answers that can be obtained from the views. The bucket algorithm, presented in [65], is actually a query rewriting algorithm that is proved to be sound and complete with respect to the problem of answering user queries (under a first-order logic formalization of the system), only in the absence of integrity constraints on the global schema, but it is in general not complete when integrity constraints are issued on it. StyX According to Fig. 2.1 StyX [8] is based on the use of an object-oriented global schema describing the intensional level of an ontology as a labeled graph, whose nodes represent concepts and edge labels represent either roles (i.e. relationships) between concepts, or inclusion assertions. As for constraints, StyX allows to specify a set of keys over the global schema. On the other hand, StyX allows to integrate XML data sources. These are described in terms of path-to-path mapping rules that associate paths in the XML source with paths in the global schema. Thus, StyX follows the LAV approach. It adresses the problem of query rewriting in the presence of sound LAV mappings. StyX suggests a cute way of merging the two part of this thesis. However, this would require first an analysis of the properties of StyX query answering algorithm (e.g. completeness), and second a deep understanding of the impact of introducing in

28 16 CHAPTER 2. STATE OF THE ART OF DIS the StyX global schema a set of constraints comparable to ours. This represents even more an issue, given that StyX does not concern with the detection of inconsistencies among data sources. Agora Agora [73] is an XML-based DIS whose global schema is specified by means of an XML DTD (without any additional integrity constraints). Moreover, Agora is characterized by a set of sound mappings, that follow the LAV approach. More precisely, mappings are defined in terms of an intermediate virtual, generic and relational schema that closely models the generic structure of the XML global schema, rather than in terms of the XML global schema. Thus, Agora query processing technique is based on query rewriting which is performed via a translation first to the generic relational schema and then by employing traditional relational techniques for answering queries using views. Note that because of the translation, queries and mappings can be quite complex and hard to understand/define by a human user GAV approach The TSIMMIS Project TSIMMIS (The Stanford-IBM Manager of Multiple Information Sources) is a joint project of the Stanford University and the Almaden IBM database research group [36]. It is based on an architecture that presents a hierarchy of wrappers and mediators, in which wrappers convert data from each source into a common data model called OEM (Object Exchange Model) and mediators combine and integrate data exported by wrappers or by other mediators. Hence, the global schema is essentially constituted by the set of OEM objects exported by wrappers and mediators. Mediators are defined in terms of a logical language called MSL (Mediator Specification Language), which is essentially Datalog extended to support OEM objects. OEM is a semistructured and self-describing data model, in which each object has an associated label, a type for the value of the object and a value (or a set of values). User queries are posed in terms of objects synthesized at a mediator or directly exported by a wrapper. They are expressed in MSL or in a specific query language called LOREL (Lightweight Object REpository Language), an object-oriented extension of SQL. Each query is processed by a module, the Mediator Specification Interpreter (MSI) [79, 89], consisting of three main components: The View Expander, which uses mediator specification to reformulate the query into a logical plan by expanding the objects exported by the mediator according to their definitions. The logical plan is a set of MSL rules which refer to information at the sources. The Plan Generator, also called Cost-Based Optimizer, which develops a physical plan specifying which queries will be sent to the sources, the order in which they will be processed, and how the results of the queries will be combined in order to derive the answer to the original query.

29 2.3. MAIN RELATED DIS 17 The Execution engine, which executes the physical plan and produces the answer. The problem of query processing in TSIMMIS in the presence of limitations in accessing the sources is addressed in [68] by devising a more complex Plan Generator comprising three modules: a matcher, which retrieves queries that can process part of the logical plan; a sequencer, which pieces together the selected source queries in order to construct feasible plans; an optimizer, which selects the most efficient feasible plan. It has to be stressed that in TSIMMIS no global integration is ever performed. Each mediator performs integration independently. As a result, for example, a certain concept may be seen in completely different and even inconsistent ways by different mediators. This form of integration can be called query-based, since each mediator supports a certain set of queries, i.e., those related to the view it provides. The IBIS system The Internet-Based Information System (IBIS) [25] is a tool for the semantic integration of heterogeneous data sources, developed in the context of a collaboration between the University of Rome La Sapienza and CM Sistemi. IBIS adopts innovative solutions to deal with all aspects of a complex data integration environment, including source wrapping, limitations on source access, and query answering under integrity constraints. IBIS uses a relational global schema to query the data at the sources, and is able to cope with a variety of heterogeneous data sources, including data sources on the Web, relational databases, and legacy sources. Each nonrelational source is wrapped to provide a relational view on it. Also, IBIS mappings follow the GAV approach and each source is considered sound. The system allows for the specification of integrity constraints on the global schema; in addition, IBIS considers the presence of some forms of constraints on the source schema, in order to perform runtime optimization during data extraction. In particular, key and foreign key constraints can be specified on the global schema, and functional dependencies and full-width inclusion dependencies, i.e., inclusions between entire relations, can be specified on the source schema. Query processing in IBIS is separated in three phases: 1. the query is expanded to take into account the integrity constraints in the global schema; 2. the atoms in the expanded query are unfolded according to their definition in terms of the mapping, obtaining a query expressed over the sources; 3. the expanded and unfolded query is executed over the retrieved source databases, whose data are extracted by the Extractor module that retrieves from the sources all the tuples that may be used to answer the original query.

30 18 CHAPTER 2. STATE OF THE ART OF DIS Query unfolding and execution are the standard steps of query processing in GAV data integration systems, while for the expansion phase IBIS makes use of the algorithm presented in [23]. INFOMIX and INFOMIX [64] is a semantic integration system that provides solutions for GAV data integration of heterogeneous data sources (e.g., relational, XML, HTML) accessed through relational global schemas over which powerful forms of integrity constraints can be specified (e.g., keys, inclusions, and exclusion dependencies), and user queries are specified in a powerful query language (e.g., Datalog). The query answering technique proposed in such a system is based on query rewriting in Datalog enriched with negation and disjunction, under stable model semantics [26, 49]. A setting similar to the one considered in INFOMIX is the one at the basis of the DIS@DIS system [27]. Even if limited in its capability of integrating sources with different data formats (the system actually considers only relational data sources), DIS@DIS however provides mechanisms also for integration of inconsistent data in LAV. Furthermore, w.r.t. query language considered, INFOMIX and DIS@DIS aim at supporting more general, highly expressive classes of queries (including also queries intractable under worst case complexity). PICSEL Similarly to IM, PICSEL is based on CARIN and the use of conjunctive queries. However, PICSEL differs from IM in that mappings follow a rather simplified GAV approach. More precisely, each data source consists of a set of relations and for each data source there exists a mapping one-to-one from each of its relations to a distinct element of the global schema. In addition, PICSEL takes into account a set of constraints about the content of the sources that are expressed as CARIN assertions. Query expansion in CARIN is then used as the core algorithmic tool for query answering in PICSEL. Thus, query answering in PICSEL is quite efficient, since it is reduced to the evaluation of a union of conjunctive queries over the set of data sources, resulting from the query expansion, which is by itselt exponential in the size of the global schema. The main differences with respect to our investigation are as follows. PICSEL does not consider at all the case where the DIS specification is inconsistent. Also, it does not attempt to distinguish between data and objects. Finally, PICSEL mappings are much more restricted than the one we consider. Grammar AIG The Grammar AIG [18] is a formalism allowing to specify how to integrate and publish SQL data coming from autonomous sources, into an XML document that conforms to a DTD and satisfies a set of integrity constraints very close to the one we also consider. Thus, an AIG evaluation produces a materialized view conforming to a quite expressive global schema. More precisely, an AIG consists of two parts: a grammar and a set of XML constraints. The grammar extends a DTD by associating semantic attributes and semantic rules with element types. The semantic attributes

31 2.3. MAIN RELATED DIS 19 are used to pass data and control during AIG evaluation. The semantic rules compute the values of the attributes by extracting data from databases via multi-source SQL queries that constitute the mappings. As a result, the XML document is constructed via a controlled derivation from the grammar and constraints, and is thus guaranteed to both conform to the DTD and satisfy the constraints. The focus of [18] is on constraints checking in the sense that whenever during the generation of the document an attribute does not satisfy a constraint, the compilation of the materialized instance is aborted. XPeranto and SilkRoute Both XPeranto [85] and Silkroute [43] are XML publishin systems that support definition of XML materialized views of SQL data. Moreover, they both support query answering over such XML views, by using an intermediate representation of views. On the one hand, XPeranto uses an XML Query Graph Model (XQGM) as a view. The XQGM is analogous to a physical execution plan produced by a query optimizer. Nodes in the XQGM represent operations in an algebra (e.g., select, join, unnest, union) and edges represent the dataflow from one operation to the next. Individual operations may invoke XML-aware procedures for constructing and deconstructing XML values, which gives to XPeranto a procedural flavor. This captures well the relationship between XQuery expressions and complex SQL expressions, whereas it may happen to produce an XQGM that may not be composed with another XQuery query, and thus support arbitrary query answering. On the contrary, SilkRoute uses a view-forest as intermediate abstract representation of views expressed by means of XQuery, that is entirely declarative and thus can be composed with any XQuery query. As a consequance, the two representations are somehow symbiotic: declarative view forests are appropriate for the front end query composition whereas the procedural XQGM may be better for back end SQL generation GLAV approach XML data exchange basic theoretical issues In the same spirit of our work is the study presented in [12], where the authors start looking into the basic properties of XML data exchange, where the target schema is a DTD. Specifically, they define XML data exchange settings in which sourceto-target dependencies refer to the hierarchical structure of the data. They investigate the consistency problem, which in the case of data exchange, is the problem of deciding whether there exists an instance of the target schema which satisfies both the source-to-target dependencies and the DTD, and determine its exact complexity. Moreover, they identify data exchange settings over which query answering over the target schema is tractable, and those over which it is conp-complete, depending on classes of regular expressions used in DTDs. Finally, for all tractable cases they provide PTIME algorithms that compute target XML documents over which queries can be answered.