Structured and Semi-Structured Data Integration

Size: px
Start display at page:

Download "Structured and Semi-Structured Data Integration"

Transcription

1 UNIVERSITÀ DEGLI STUDI DI ROMA LA SAPIENZA DOTTORATO DI RICERCA IN INGEGNERIA INFORMATICA XIX CICLO 2006 UNIVERSITÉ DE PARIS SUD DOCTORAT DE RECHERCHE EN INFORMATIQUE Structured and Semi-Structured Data Integration Antonella Poggi

2

3 UNIVERSITÀ DEGLI STUDI DI ROMA LA SAPIENZA DOTTORATO DI RICERCA IN INGEGNERIA INFORMATICA XIX CICLO UNIVERSITÉ DE PARIS SUD DOCTORAT DE RECHERCHE EN INFORMATIQUE Antonella Poggi Structured and Semi-Structured Data Integration Thesis Committee Prof. Maurizio Lenzerini (Advisor (Italy)) Prof. Serge Abiteboul (Advisor (France)) Reviewers Prof. Bernd Amann Prof. Alex Borgida Prof. Riccardo Rosati

4 AUTHOR S ADDRESS IN ITALY: Antonella Poggi Dipartimento di Informatica e Sistemistica Università degli Studi di Roma La Sapienza Via Salaria 113, I Roma, Italy AUTHOR S ADDRESS IN FRANCE: Antonella Poggi Departement d Informatique Université de Paris Sud Orsay Cedex, France [email protected] WWW: poggi/

5 To Mario

6

7 Acknowledgements Everything started one day of June 2000, when I decided to go Erasmus at the Ecole Polytechnique in Paris, and I was suggested to attend there the databases lectures given by Prof. Abiteboul. When in January 2001, I decided to follow the extralectures of Prof. Abiteboul in order to present a databases project, he was so nice to sit beside me and teach me how to write my first HTML page: my homepage. This was his way to introduce me to XML! One first internship at I.N.R.I.A. was my first great experience with research, and from that day on, I never gave up dreaming research. When I came back home, I met Maurizio (whose lectures were the most exciting I ever had) and thank to him and Serge I could participate to VLDB as a volunteer (Rome, Sept. 2001). How could someone resist loving research after such a wonderful conference? Then, I finished my exams and Maurizio supported me to come back to I.N.R.I.A. for a second internship (my final project which lead to my graduation thesis). On my return, I started collaborating with Maurizio, and he made me love databases theory and data integration issues so much, that I chose to start my PhD route... Thanks to an European initiative, and thanks to both my advisors who had to fight against italian and french bureaucracy, I had the opportunity to do my research in joint work between the roman and the parisian databases groups. This was not always so easy... But I was so lucky to find such great researchers, both able to have such an amazing big picture! They were both mentors, fathers and friends. No word may express how much I would like to thank you both, Maurizio and Serge. I can only say once again: Grazie - Merci (as I am now used to concluding all my research talks). I will miss so much being such a favoured PhD student! Of course, these aknowlegements cannot end without thanking my sweet husband Mario, and my family. Both have been so patient, understanding... and, above all, they have always been with me. I love you, and will always do. iii

8

9 Contents I Antechamber 1 1 Theoretical foundations of DIS Logical framework Consistency of a DIS Query answering over DIS Updates over DIS Relationship with databases with incomplete information State of the art of DIS Commercial data integration tools Global picture of the state of the art Main related DIS LAV approach GAV approach GLAV approach II Ontology-based DIS 21 3 The language DL-Lite FRS DL-Lite FRS expressions DL-Lite FRS TBox DL-Lite FRS ABox DL-Lite FRS knowledge base Query language DL-Lite A DL-Lite A reasoning Storage of a DL-Lite A ABox Preliminaries Minimal model for a DL-Lite A ABox Canonical interpretation Closure of negative inclusions Satisfiability of a DL-Lite A KB Foundations of the algorithm for satisfiability v

10 4.3.2 Satisfiability algorithm Query answering over DL-Lite A KB Foundations of query answering algorithm Query answering algorithm Consistency and Query Answering over Ontology-based DIS DL-Lite A ontology-based DIS Linking data to DL-Lite A objects Logical framework for DL-Lite A DIS Overview of consistency and query answering method The notion of virtual ABox A naive bottom-up approach A top-down approach Relevant notions from logic programming DL-Lite A DIS consistency and query answering Modularizability Consistency algorithm Query answering algorithm Computational complexity Updates of Ontologies at the Instance Level The DL-Lite FS language Instance-level ontology update Computing updates in DL-Lite FS ontologies III XML-based DIS 97 7 The setting Data model Tree Type Constraints and schema language Prefix Queries XML-based DIS XML DIS logical framework Identification XML DIS consistency XML DIS query answering Lower-bound for query answering under exact mappings Incomplete trees Query answering using incomplete trees Query answering algorithms Algorithm under VKR and no key constraints Algorithm under Id G sound and complete

11 Conclusion 149 Bibliography 158

12

13 Part I Antechamber 1

14

15 Data integration is a huge area of research that is concerned with the problem of combining data residing at heterogeneous, autonomous and distributed data sources, and providing the user with a unified virtual view of all this data. Today s fast and continuous growth of large business organizations, often deriving from mergers of smaller enterprises, enforces an increasing need in integrating and sharing large amounts of data, coming from a number of heterogeneous and distributed data sources. Such needs are also shown by others applications, like information systems for administrative organizations, life sciences research and many others. Moreover, it is not infrequent that different parts of the same organization adopt different systems to produce and maintain critical data. Clearly, data integration is a challenge in all these kinds of situations. Furthermore, it has become even more attractive thanks the ubiquitous spread of the World Wide Web and access to information it provides. Hence, during the last decade, research and business interest has migrated from DataBase Management Systems, DBMS (Codd, 70 s [37]), to Data Integration Systems (DIS). Whereas the former make a unique local data source accessible through a schema, the latter offer the necessary framework to combine the data from a set of heterogeneous and autonomous sources through a so-called global schema (or, mediated schema). Thus, the global schema does not contain data by itself, but provides a reconciled, integrated and virtual view of the underlying sources, which in contrast contain the actual data. We insist that since the global schema acts as the interface to the user for accessing data, the choice of the language for expressing and querying such a schema is crucial. In particular, whereas research on the topic has already provided several DIS, rather few of them represent an appropriate trade-off between the expressive power of the languages for specifying the global schema and querying the system, and the efficiency of query answering. Nevertheless, both these aspects deserve to be simultaneously considered. Indeed, the issue of providing an important set of semantic constraints over the global schema becomes more and more crucial, as one wants to use rather basic conceptual modeling constructs for his application. On the other hand, offering an expressive query language and allowing for efficient query answering over typically large amounts of data are obvious requirements of such kind of systems. In this thesis, we focus on the study of hierarchical DIS, where the global schema acts as a client of the data sources as opposed to Peer-to-Peer DIS, where the global schema acts both as a client and a server for other DIS. In particular, motivated by the challenges discussed above, we investigate both structured and semi-structured data integration, in the two major contexts of ontology-based data integration, and XMLbased data integration. On the one hand, ontology-based DIS are characterized by a 3

16 4 global schema described at the intensional level of an ontology, i.e., the shared conceptualization of a domain of interest. The main issue here is that typically ontology languages are extremely costly with respect to the size of the data. Notably, we propose a setting where answering queries over the ontology-based DIS is LOGSPACE in data complexity. On the other hand, XML-based DIS are characterized by an expressive global schema. This is a novel setting, not much investigated yet. The main issue here concerns the presence of a significant set of integrity constraints expressed over the schema, and the concept of node identity which requires particular attention when data come from autonomous data sources. In particular, in both contexts, our contribution consists in approaching formally the following issues. The modeling issue, which requires to provide the user with all he needs for modeling the DIS. More precisely, he will be given (i) a language for specifying the global schema, (ii) a language for specifying the set of source schemas, and (iii) a formalism to specify the relationship existing between the data at the sources and the elements of the global schema. The query answering issue, which is concerned with the basic service offered by a DIS, namely the ability of answering queries posed over the DIS global schema. We provide an appropriate query language and algorithms to answer queries posed to the DIS. Also, we study the complexity of the problem in both contexts, under a variety of assumptions for the DIS specification. Since sources are in general autonomous, we also investigate the problem of detecting inconsistencies among data sources, a problem which is mostly of the time ignored in DIS research, thus resulting in a quite unrealistic setting. Finally, we begin the investigation of the update of DIS, in the context of Ontology-based DIS. This concerns the problem of accepting updates expressed in terms of the global schema, aiming at reflecting them by changes at the source data level. This is the first investigation we are aware of that goes in this challenging direction. Our research has been carried out under the joint supervision of the Department of Computer Science of the University of Rome La Sapienza and the GEMO INRIA- Futurs Project, resulting from the merger of INRIA-Rocquencourt Verso Project and the IASI group of the University of Paris-Sud. The thesis is organized as follows. The first part of the thesis serves as an introduction to the theoretical foundations of our approach to DIS, and a motivation for it. Then, the second part of the thesis will be devoted to ontology-based DIS examination, while the third part will concern with XML-based DIS.

17 Chapter 1 Theoretical foundations of DIS In this chapter, we introduce the main theoretical foundations underlying our investigation of DIS [63]. Specifically, we start by setting up a logical framework for data integration. Then we present the main issues related to DIS that will be the focus of our attention, namely consistency checking and query answering. Afterwards, we introduce the problem of performing updates over DIS. Finally, we discuss the relationship existing between DIS and databases with incomplete information [58]. 1.1 Logical framework As already mentioned, in this work, we are interested in studying DIS whose aim is combining data residing at different sources, and providing the user with a unified view of these data. Such a unified view is represented by the global schema. Thus, one of the most important aspects in the design of a DIS is the specification of the correspondence between the data at the sources and the elements of the global schema. Such a correspondence is modeled through the notion of mapping. It follows that the main components of a data integration system are the global schema, the sources, and the mapping. Thus, we formalize a data integration system Π in terms of a triple G, S, M, where G is the global schema, expressed in a language L G over an alphabet A G. The alphabet comprises a symbol for each element of G (i.e., a relation if G is relational, a concept or a role if G is a Description Logic, a label if G is an XML DTD, etc.). S is the source schema, expressed in a language L S over an alphabet A S. The alphabet A S includes a symbol for each element of the sources. M is the mapping between G and S, consisting of a set of assertions M, each having the form (q S, q G, as), or (q G, q S, as) where q S and q G are two queries of the same arity, respectively over the source schema S, and over the global schema G, and as may assume the value sound, 5

18 6 CHAPTER 1. THEORETICAL FOUNDATIONS OF DIS complete or exact. Queries q S are expressed in a query language L M,S over the alphabet A S, and queries q G are expressed in a query language L M,G over the alphabet A G. On the other hand, the value as models the accuracy of the mapping. Note that the definition above has been taken from [63], and it is general enough to capture all approaches in the literature, including in particular the DIS considered in this thesis. We call database a set of collections of data. We say that a source database (also referred to as a set of data sources) D = {D 1,, D m } conforms to a schema S = {S 1,, S m } if D i is an instance of S i for i = 1,, m (where clearly the notion of D i being an instance of S i depends on the language L S for expressing S). Moreover, we call global database an instance of the global schema G 1 over a domain Γ. Thus, given a set of sources D conforming to S, we call a set of legal databases for Π w.r.t. D, denoted sem(π, D), the set of databases B such that: B is a global database, and B satisfies the mapping M w.r.t. D. Clearly, the notion of B satisfying M w.r.t. D depends on the semantics of the mapping assertions. Intuitively, the assertion (q S, q G, as) means that the concept represented by the query q S over the sources D, corresponds to the concept in the global schema represented by the query q G, with the accuracy specified by as. Formally, let q be a query of arity n and DB a database. We denote with q DB the set of n-tuples in DB that satisfy q. Then, given a set of data sources D conforming to S and a global database B, we say that B satisfies M w.r.t. D, if for each M i in M of the form (q S, q G, as) we have that: if as = sound, then q B G qd S ; if as = complete, then q B G qd S ; if as = exact, then q B G = qd S. Typically sources in DIS are considered sound. This will also be the assumption we make in the investigation of ontology-based DIS. In contrast, in the XML-based context, we will study also the case of exact mappings, which appear to be useful when one considers a data source as an authority providing exactly all the information about a certain topic. On the other hand, we do not consider the case of complete mappings, since it appears less interesting in practice. Note that different forms for mappings have lead to the following characterization of the approaches to data integration in the literature [53]: In the Local-As-View (LAV) approach, mappings in M have the form (s, q G, as), where s in an element of the source schema. 1 In particular, in this thesis, we consider the case of a global database being a first-order logic model ( I, I) of G, if G is the intensional level of a Description Logic (DL) [21] ontology, or an XML document satisfying G, if G is a DTD provided with a set of integrity constraints.

19 1.2. CONSISTENCY OF A DIS 7 In the Global-As-View (GAV) approach, they have the form (q S, g, as), where g in an element of the global schema. In the Global-and-Local-As-View (GLAV) approach, no particular assumption is made on the form of mappings. Clearly, the LAV approach favors the extensibility of the system, since adding a new source simply requires enriching the mapping with a new assertion, without other changes. On the other hand, the GAV approach has a more procedural flavor, since it tells the system how to use the sources to retrieve the data. Before concluding this presentation of the logical framework for data integration, we observe that, no matter which is the interpretation of the mapping, in general, several global databases exist that are legal for Π with respect to D. This observation motivates the relationship between data integration and databases with incomplete information [86], which will be discussed in Section Consistency of a DIS Given a data integration system Π = G, S, M and a set of sources D conforming to S, it may happen that no legal database exists satisfying both the global schema constraints and the mappings w.r.t. D, i.e. sem(π, D) =. Then, we say that the system is inconsistent w.r.t. D. It is worth noting that this kind of situation is particularly critical, since as we will see, it makes query answering become meaningless. Despite its importance, this situation is often blurred out in data integration systems, or dealt with by means of a-priori and ad-hoc transformations and cleaning procedures to be applied to data retrieved from the sources (e.g.[44]). Here we address the problem from a more theoretical perspective. In particular, we believe that the first step to deal with inconsistencies is obviously to detect whether there are some it occurs. Thus, we study the problem of deciding whether a system is consistent w.r.t. a set of data sources. Such a problem can be formulated as follows: PROBLEM : DIS CONSISTENCY INPUT : A data integration system Π = G, S, M, a set of data sources D conforming to S QUESTION : Is there a database B legal for Π w.r.t. D? In both Ontology-based and XML-based DIS, we will study DIS consistency, show it is decidable, examine its complexity and provide algorithms to solve it. However, we do not consider in this thesis the problem of reconciling the data at the sources, i.e. modifying the data retrieved from the sources so that the system becomes consistent. This is a one challenging issue that we intend to address in the future.

20 8 CHAPTER 1. THEORETICAL FOUNDATIONS OF DIS 1.3 Query answering over DIS The basic service that is offered by a DIS is query answering, i.e. the ability of answering queries that are posed in terms of the global schema G and are expressed in a language L q over the alphabet A G. Given a DIS Π = G, S, M and a set of data sources D conforming to S, the certain answers q(π,d) to a query q posed over Π w.r.t. D, is the set of tuples t of elements in Γ (i.e., the domain of the instances of G) such that t q B for every legal database B w.r.t. Π, or equivalently: q(π,d) = {t t q B, B sem(π, D)} Note that q(π,d) are called certain answers to q in Π w.r.t. D. Query answering can be tackled under two different forms. In particular, under the so-called recognition form, it is formulated as follows: PROBLEM : QUERY ANSWERING (RECOGNITION) INPUT : Consistent data integration system Π = G, S, M, set of data sources D conforming to S, query q, and tuple t of elements of Γ QUESTION : Is t in q(π, D)? Other times, query answering assumes a more ambitious form and aims at finding the entire set of certain answers. Thus, it is formulated as follows: PROBLEM : QUERY ANSWERING (FULL SET) INPUT : Consistent data integration system Π = G, S, M, set of data sources D conforming to S, query q QUESTION : Find all t such that t q(π, D). As for DIS consistency, in our investigation, we will study DIS query answering under different assumptions, show it is decidable, examine its complexity and provide algorithms to solve it. Note in particular that in both the formulations for the query answering problem, we assume to have a consistent DIS. Indeed, in this thesis, we are not concerned with the problem of answering queries in the presence of mutually inconsistent data sources. One possibility to address such a problem is to follow an approach in the spirit of [62], where the authors advocate the use of an approximate semantics for mappings. 1.4 Updates over DIS In this section, we introduce write-also DIS, i.e. DIS that allow for performing updates expressed over the global schema. Several approaches to update have been proposed in the literature, see, e.g.,[39] for a survey. In particular, different change

21 1.4. UPDATES OVER DIS 9 operators are appropriate depending on whether it is a revision [20], i.e. a correction to the actual state of beliefs, or of an update [88], reflecting a change in the world. In this section, even though we use the term update, we do not aim at advocating the use of one particular approach. On the contrary, we assume to have an arbitrary operator. Moreover, we assume to have an update F expressed as a formula in terms of G, which intuitively is sanctioned to be true in the new state, i.e. it is inserted in the updated DIS specification. Thus, given a DIS Π = G, S, M, a set of data sources D conforming to S, and the update F, we have that once is applied with F to the set of legal database for Π w.r.t. D, we obtain a new set of databases, however characterized, reflecting the change F. Note that we are interested in instance-level updates. This means that we assume that the specification of Π is invariant, whereas the update reflects a change that occurs at the sources D. Thus, in particular, we consider an update of Π with a set F of facts having the form g(t) where t is a n-tuple of elements of Γ and g is an element of G, meaning that the change consists in t being an instance of g. Thus, we formulate the problem of updating a DIS as follows: PROBLEM : EXPRESSIBLE UPDATE INPUT : Consistent data integration system Π = G, S, M, set of data sources D conforming to S, set of facts F QUESTION : Is there D such that sem(π, D ) = sem(π, D) F? The above formulation is general enough to capture all approaches to update that have been proposed in the literature. However, it raises at least the following considerations. Typically the user of a DIS is not the owner of the data sources and thus he has not the right to modify their content. This is probably the reason why, as far as we know, DIS update has not been considered yet as an issue. However, we believe that a DIS should be able to possibly provide the appropriate infrastructure to allow the user to perform an instance-level update without changing the data at the sources. This could be achieved, for instance by using internal proprietary sources. What if no set of data sources exists solving the update problem formulated above (not even proprietary sources)? As usual, one possibility would be to relax the semantics of the update. Indeed, we might be interested in reasoning, e.g., answering queries, over the DIS resulting from the update. Indeed, to do so we do not necessarily need to materialize a new set of data sources, but actually we could reason on the original DIS by taking into account the update in a virtual way. In a sense, this is analogous to the distinction between projection via regression vs. progression in reasoning about actions [83]. Both the considerations above have motivated the beginning of our work on DIS update. Until now, we started tackling the problem for Ontology-based DIS (cf. Chapter 6).

22 10 CHAPTER Relationship with databases with incomplete information Before concluding this introductory chapter on the theoretical foundations of our approach to data integration, we briefly discuss the strong connection existing between DIS and databases with incomplete information. Specifically, a database with incomplete information can be viewed as a set of possible states of the real-world. Similarly, given a set of data sources, a DIS represents a set of possible databases. Thus, when a query is posed over a database with incomplete information or a DIS, the problem arises of posing the query over a possibly infinite set of database states. It follows that in order to solve query answering over a DIS, one possibility is to find a finite representation of the set of possible databases and to provide algorithms to answer queries over such a representation. Indeed, this is the main idea underlying both the works presented in this thesis. Note, in particular, that this approach recalls the approach proposed in a landmark paper by Imielinski and Lipski [58], that consists in answering queries over a database with incomplete information, by exploiting the notion of representation system. Moreover, interestingly, in [4], the same approach is extended to deal with updates over databases with incomplete information.

23 Chapter 2 State of the art of DIS As already discussed, data integration has appeared as a pervasive challenge in the last decade. Such a success recalls the crucial impact of DBMS, proven by the large number of DBMS scattered all around the world. However, while the success of relational DBMS represents a great exception in the usual bottom-up process of emerging technologies, since it had been preceded by a deep understanding and a wide acceptance of the relational model and the related theory, the interest in data integration systems grew contemporaneously in both the business and research community. In particular, it lead to the implementation of systems, without having yet a deep understanding of all the intricate issues related, involving design time as well as run time aspects [54]. Clearly, it would be unrealistic to aim at being comprehensive while discussing the state of the art of such a huge field. Thus, in this chapter, we start by briefly discussing the commercial solutions to the need for integrating data. Afterwards, we contextualize our contribution into the global picture of the state of the art in data integration research field. Finally, according to such a global picture, we discuss more in details works that are most closely related to our investigation. 2.1 Commercial data integration tools Recently, some software solutions to the need for integrating data has emerged, suggesting the adoption of a DBMS as a kind of middleware infrastructure that uses a set of software modules, called wrappers, to access heterogeneous data sources [51]. Wrappers hide the native characteristics of each source, masking them under the appearances of a common relational table. Furthermore, their aim is to mediate between the federated database and the sources, mapping the data model of each source to the federated database data model, also transforming operation over the federated database into requests that the source can handle. Examples of commercial products following this kind of approach are Oracle Integration [75] and DB2 Information Integrator (DB2II)[74]. Obviously, both are based on the use of Oracle and IBM DBMS respectively. Even though remarkable from the point of view of the number of different types of data sources supported, as well as from the point of view of query optimizations, 11

24 12 CHAPTER 2. STATE OF THE ART OF DIS these products are essentially data federation tools that are still far from data integration systems theory as it is by now well-established in the scientific databases community. Indeed, as we argued in [81], they actually allow the user to combine data coming from heterogeneous and autonomous sources, but do not provide the user with a unified view that is (logically) independent of the sources. It is worth noticing however, that data federation tools can be used as the essential underlying environment on top of which one can build a DIS. In particular, we show in [81] how to implement a DIS based on a relational schema by means of a commercial tool for data federation. In a nutshell, this is obtained by: (i) producing an instance of a federated database through the compilation of a formal DIS specification as formalized in the previous chapter; (ii) translating the user queries posed over the global schema, so as to issue them to the federated database. Even though interesting in order to highlight the mismatch between commercial products and research prototypes currently available, clearly, this approach is far from solving the main challenge addressed in this thesis, since it allows for a limited expressive power of the global schema (without constraints) and requires to follow a GAV approach. 2.2 Global picture of the state of the art In this section, we aim at giving a global picture of the state of the art in data integration and at contextualizing our contribution with respect to this global picture. From the previous chapter, it follows that a DIS specification depends on the following aspects: the data model chosen for the global database; the language used to express the global schema, i.e. characterizing it; the set of constraints the approach followed to specify the mapping, i.e. GAV, LAV or GLAV; the accuracy of the mappings (or equivalently of the data sources), i.e. sound, or exact (as we already argued complete mappings are less interesting in practice). Another aspect deserving to be considered when classifying DIS, is the architectural paradigm used. As already mentioned, in this thesis, we focus on hierarchical DIS, where it is possible to clearly distinguish between two different roles played on one hand by the global schema, that is accessed by the user and which does not contain by itself data, and on the other hand by the underlying sources, that contain the actual data. Another paradigm is recently emerging for DIS, as well as for other distributed systems, namely the Peer-To- Peer (P2P) paradigm. Put in an abstract way, P2P DIS are characterized by an architecture consisting of various autonomous nodes (called peers) which hold information, and which are linked to other nodes by means of mappings. Each node provides therefore part of the overall information available from a distributed environment and acts both as a client and as a server in the system, without relying on a single global view. However, in some sense, P2P data integration

25 2.3. MAIN RELATED DIS 13 systems can be considered as the natural extension of hierarchical data integration systems, since each node of the system may by itself be considered as an extended hierarchical DIS, that includes, besides the mapping to local data sources, an external mapping to other nodes schemas 1. Note that since research in P2P data integration is still quite young, no commercial product really emerged yet. Fig. 2.1 summarizes the state of the art in data integration. More precisely, it classifies the main integration systems according to the features discussed above. Thus, it stresses systems that are closest to our investigation and can be therefore compared with our study. In the next two sections we describe some of these systems, focusing on those whose global schema is specified by means of (i) a Description Logic (and thus can be considered as DIS based on the relational model, characterized by a significant set of semantic constraints), and (ii) XML 2 (and thus a semi-structured data model). It is worth noting that, in Fig. 2.1, we do not consider Data Warehousing Systems nor Data Exchange Systems, which even though related to DIS, are based on a different form of data interoperability. Indeed, their aim is to export a materialized instance of the global schema, whereas DIS are characterized by a global schema that is virtual. In particular, Data exchange is the problem of moving and restructuring data from a generally unique data source to the global schema schema (called target schema), given the specification of the mapping (called source-to-target dependencies) between the source and the schema. Data exchange has become an active research topic recently due to the increased need for exchange of data in various formats, typically in e-business applications[9]. Papers [41, 40] laid the theoretical foundation of exchange of relational data, and several follow-up papers studied various issues in data exchange such as schema mapping composition[11]. 2.3 Main related DIS In order to present main DIS that are closest to the work studied in this thesis, we next discuss those systems that are most comparable to our investigation, because e.g. of the expressivity of the global schema (cf. Fig. 2.1). In particular, we classify such systems on the basis of the approach followed for mappings specification. Note that despite the great increasing interest in XML from both business and research, little previous work has addressed XML-based data integration issue, as defined and studied here. In contrast, considerable work has addressed XML publishing systems and some initial work has focused on basic theoretical XML data exchange issues. Both these kinds of work are somehow orthogonal to our investigation since, besides assuming to materialize the global schema, they consider a unique data source. Hence, they were not presented in Fig However, in the XML setting, where not much work has addressed even basic data integration issues, they appear as relevant. Thus, we will present some of them. 1 Clearly, this is only an abstraction since the possible presence of cycles among peers complicates notably P2P DIS and introduces new challenging issues (see e.g. [28]). 2 Reader is assumed to be familiar with notation and terminology of the relational model [5], XML [2] and DLs [14].

26 Table 2.1: DIS state of the art Paradigm Data model Constraints Mapping Mapping Example approach accuracy Hierarchical Relational Inclusions,... LAV sound Information Manifold [60], Hierarchical Relational Inclusions,... GAV sound PICSEL [48] Hierarchical Relational Functional, GAV sound IBIS [24], inclusions INFOMIX [64] Hierarchical Semi-structured GAV sound, TSIMMIS [45] Hierarchical Semi-structured LAV exact, [34] sound Hierarchical Object-oriented keys LAV sound STYX [8] Hierarchical XML DTD LAV sound Agora [73] Hierarchical XML XML Schema types GLAV sound [90] and functional... P2P Relational keys, GLAV sound [32] foreign keys P2P XML GLAV exact, Piazza [55] sound P2P XML Keys GLAV exact, ActiveXML [1] sound 14 CHAPTER 2. STATE OF THE ART OF DIS

27 2.3. MAIN RELATED DIS LAV approach Information Manifold Information Manifold (IM) [67] is a DIS developed at AT&T, based on the CARIN Description Logic [66]. CARIN combines a Description Logic allowing for expressing disjunction of concepts, and role number restrictions, with function-free horn rules. Thus, IM handles the presence of inclusion dependencies over the global schema, and uses conjunctive queries as the language for querying the system and specifying sound LAV mappings. The main distinguishing feature of IM is the use of the bucket algorithm for query answering. In order to illustrare it, we first recall that in LAV the mappings between the sources and the global schema are described as a set of views over the global schema. Thus, query processing amounts to finding a way to answer a query posed over a database schema using a set of views over the same schema. This problem, called answering queries using views, is widely studied in the literature, since it has applications in many areas (see e.g. [53] for a survey). The most common approach proposed to deal with query answering using views is by means of query rewriting. In query rewriting, a query and a set of view definitions over a database schema are provided, and the goal is to reformulate the query into an expression, the rewriting, whose evaluation on the view extensions supplies the answer to the query. Thus, query answering via query rewriting is divided in two steps, where the first one consists of reformulating the query in terms of the given query language over the alphabet of the views (possibly augmented with auxiliary predicates), and the second one evaluates the rewriting over the view extensions. Clearly, the set of available sources may in general not store all the data needed to answer a user query, and therefore the goal is to find a rewriting that provides the maximal set of answers that can be obtained from the views. The bucket algorithm, presented in [65], is actually a query rewriting algorithm that is proved to be sound and complete with respect to the problem of answering user queries (under a first-order logic formalization of the system), only in the absence of integrity constraints on the global schema, but it is in general not complete when integrity constraints are issued on it. StyX According to Fig. 2.1 StyX [8] is based on the use of an object-oriented global schema describing the intensional level of an ontology as a labeled graph, whose nodes represent concepts and edge labels represent either roles (i.e. relationships) between concepts, or inclusion assertions. As for constraints, StyX allows to specify a set of keys over the global schema. On the other hand, StyX allows to integrate XML data sources. These are described in terms of path-to-path mapping rules that associate paths in the XML source with paths in the global schema. Thus, StyX follows the LAV approach. It adresses the problem of query rewriting in the presence of sound LAV mappings. StyX suggests a cute way of merging the two part of this thesis. However, this would require first an analysis of the properties of StyX query answering algorithm (e.g. completeness), and second a deep understanding of the impact of introducing in

28 16 CHAPTER 2. STATE OF THE ART OF DIS the StyX global schema a set of constraints comparable to ours. This represents even more an issue, given that StyX does not concern with the detection of inconsistencies among data sources. Agora Agora [73] is an XML-based DIS whose global schema is specified by means of an XML DTD (without any additional integrity constraints). Moreover, Agora is characterized by a set of sound mappings, that follow the LAV approach. More precisely, mappings are defined in terms of an intermediate virtual, generic and relational schema that closely models the generic structure of the XML global schema, rather than in terms of the XML global schema. Thus, Agora query processing technique is based on query rewriting which is performed via a translation first to the generic relational schema and then by employing traditional relational techniques for answering queries using views. Note that because of the translation, queries and mappings can be quite complex and hard to understand/define by a human user GAV approach The TSIMMIS Project TSIMMIS (The Stanford-IBM Manager of Multiple Information Sources) is a joint project of the Stanford University and the Almaden IBM database research group [36]. It is based on an architecture that presents a hierarchy of wrappers and mediators, in which wrappers convert data from each source into a common data model called OEM (Object Exchange Model) and mediators combine and integrate data exported by wrappers or by other mediators. Hence, the global schema is essentially constituted by the set of OEM objects exported by wrappers and mediators. Mediators are defined in terms of a logical language called MSL (Mediator Specification Language), which is essentially Datalog extended to support OEM objects. OEM is a semistructured and self-describing data model, in which each object has an associated label, a type for the value of the object and a value (or a set of values). User queries are posed in terms of objects synthesized at a mediator or directly exported by a wrapper. They are expressed in MSL or in a specific query language called LOREL (Lightweight Object REpository Language), an object-oriented extension of SQL. Each query is processed by a module, the Mediator Specification Interpreter (MSI) [79, 89], consisting of three main components: The View Expander, which uses mediator specification to reformulate the query into a logical plan by expanding the objects exported by the mediator according to their definitions. The logical plan is a set of MSL rules which refer to information at the sources. The Plan Generator, also called Cost-Based Optimizer, which develops a physical plan specifying which queries will be sent to the sources, the order in which they will be processed, and how the results of the queries will be combined in order to derive the answer to the original query.

29 2.3. MAIN RELATED DIS 17 The Execution engine, which executes the physical plan and produces the answer. The problem of query processing in TSIMMIS in the presence of limitations in accessing the sources is addressed in [68] by devising a more complex Plan Generator comprising three modules: a matcher, which retrieves queries that can process part of the logical plan; a sequencer, which pieces together the selected source queries in order to construct feasible plans; an optimizer, which selects the most efficient feasible plan. It has to be stressed that in TSIMMIS no global integration is ever performed. Each mediator performs integration independently. As a result, for example, a certain concept may be seen in completely different and even inconsistent ways by different mediators. This form of integration can be called query-based, since each mediator supports a certain set of queries, i.e., those related to the view it provides. The IBIS system The Internet-Based Information System (IBIS) [25] is a tool for the semantic integration of heterogeneous data sources, developed in the context of a collaboration between the University of Rome La Sapienza and CM Sistemi. IBIS adopts innovative solutions to deal with all aspects of a complex data integration environment, including source wrapping, limitations on source access, and query answering under integrity constraints. IBIS uses a relational global schema to query the data at the sources, and is able to cope with a variety of heterogeneous data sources, including data sources on the Web, relational databases, and legacy sources. Each nonrelational source is wrapped to provide a relational view on it. Also, IBIS mappings follow the GAV approach and each source is considered sound. The system allows for the specification of integrity constraints on the global schema; in addition, IBIS considers the presence of some forms of constraints on the source schema, in order to perform runtime optimization during data extraction. In particular, key and foreign key constraints can be specified on the global schema, and functional dependencies and full-width inclusion dependencies, i.e., inclusions between entire relations, can be specified on the source schema. Query processing in IBIS is separated in three phases: 1. the query is expanded to take into account the integrity constraints in the global schema; 2. the atoms in the expanded query are unfolded according to their definition in terms of the mapping, obtaining a query expressed over the sources; 3. the expanded and unfolded query is executed over the retrieved source databases, whose data are extracted by the Extractor module that retrieves from the sources all the tuples that may be used to answer the original query.

30 18 CHAPTER 2. STATE OF THE ART OF DIS Query unfolding and execution are the standard steps of query processing in GAV data integration systems, while for the expansion phase IBIS makes use of the algorithm presented in [23]. INFOMIX and INFOMIX [64] is a semantic integration system that provides solutions for GAV data integration of heterogeneous data sources (e.g., relational, XML, HTML) accessed through relational global schemas over which powerful forms of integrity constraints can be specified (e.g., keys, inclusions, and exclusion dependencies), and user queries are specified in a powerful query language (e.g., Datalog). The query answering technique proposed in such a system is based on query rewriting in Datalog enriched with negation and disjunction, under stable model semantics [26, 49]. A setting similar to the one considered in INFOMIX is the one at the basis of the DIS@DIS system [27]. Even if limited in its capability of integrating sources with different data formats (the system actually considers only relational data sources), DIS@DIS however provides mechanisms also for integration of inconsistent data in LAV. Furthermore, w.r.t. query language considered, INFOMIX and DIS@DIS aim at supporting more general, highly expressive classes of queries (including also queries intractable under worst case complexity). PICSEL Similarly to IM, PICSEL is based on CARIN and the use of conjunctive queries. However, PICSEL differs from IM in that mappings follow a rather simplified GAV approach. More precisely, each data source consists of a set of relations and for each data source there exists a mapping one-to-one from each of its relations to a distinct element of the global schema. In addition, PICSEL takes into account a set of constraints about the content of the sources that are expressed as CARIN assertions. Query expansion in CARIN is then used as the core algorithmic tool for query answering in PICSEL. Thus, query answering in PICSEL is quite efficient, since it is reduced to the evaluation of a union of conjunctive queries over the set of data sources, resulting from the query expansion, which is by itselt exponential in the size of the global schema. The main differences with respect to our investigation are as follows. PICSEL does not consider at all the case where the DIS specification is inconsistent. Also, it does not attempt to distinguish between data and objects. Finally, PICSEL mappings are much more restricted than the one we consider. Grammar AIG The Grammar AIG [18] is a formalism allowing to specify how to integrate and publish SQL data coming from autonomous sources, into an XML document that conforms to a DTD and satisfies a set of integrity constraints very close to the one we also consider. Thus, an AIG evaluation produces a materialized view conforming to a quite expressive global schema. More precisely, an AIG consists of two parts: a grammar and a set of XML constraints. The grammar extends a DTD by associating semantic attributes and semantic rules with element types. The semantic attributes

31 2.3. MAIN RELATED DIS 19 are used to pass data and control during AIG evaluation. The semantic rules compute the values of the attributes by extracting data from databases via multi-source SQL queries that constitute the mappings. As a result, the XML document is constructed via a controlled derivation from the grammar and constraints, and is thus guaranteed to both conform to the DTD and satisfy the constraints. The focus of [18] is on constraints checking in the sense that whenever during the generation of the document an attribute does not satisfy a constraint, the compilation of the materialized instance is aborted. XPeranto and SilkRoute Both XPeranto [85] and Silkroute [43] are XML publishin systems that support definition of XML materialized views of SQL data. Moreover, they both support query answering over such XML views, by using an intermediate representation of views. On the one hand, XPeranto uses an XML Query Graph Model (XQGM) as a view. The XQGM is analogous to a physical execution plan produced by a query optimizer. Nodes in the XQGM represent operations in an algebra (e.g., select, join, unnest, union) and edges represent the dataflow from one operation to the next. Individual operations may invoke XML-aware procedures for constructing and deconstructing XML values, which gives to XPeranto a procedural flavor. This captures well the relationship between XQuery expressions and complex SQL expressions, whereas it may happen to produce an XQGM that may not be composed with another XQuery query, and thus support arbitrary query answering. On the contrary, SilkRoute uses a view-forest as intermediate abstract representation of views expressed by means of XQuery, that is entirely declarative and thus can be composed with any XQuery query. As a consequance, the two representations are somehow symbiotic: declarative view forests are appropriate for the front end query composition whereas the procedural XQGM may be better for back end SQL generation GLAV approach XML data exchange basic theoretical issues In the same spirit of our work is the study presented in [12], where the authors start looking into the basic properties of XML data exchange, where the target schema is a DTD. Specifically, they define XML data exchange settings in which sourceto-target dependencies refer to the hierarchical structure of the data. They investigate the consistency problem, which in the case of data exchange, is the problem of deciding whether there exists an instance of the target schema which satisfies both the source-to-target dependencies and the DTD, and determine its exact complexity. Moreover, they identify data exchange settings over which query answering over the target schema is tractable, and those over which it is conp-complete, depending on classes of regular expressions used in DTDs. Finally, for all tractable cases they provide PTIME algorithms that compute target XML documents over which queries can be answered.

32 20 CHAPTER 2 Constraint-based XML rewriting The paper [90] proposes a query answering algorithm over an XML-based DIS, whose global schema is characterized by a set of expressive even though rather complicated constraints, called nested equality-generating dependencies (NEGDs), that include functional dependencies as XML keys, foreign keys and more general constraints stating that certain tuples/elements in the target must satisfy certain equalities. The mappings are sound and are expressed by means of the mapping language proposed in Clio [82], which means that they follow the GLAV approach. The main problem studied in [90] is query rewriting. Thus, according to the distinction discussed in [33], even though related, such a study attempts a different issue with respect to the one we address, which does not aim at finding a query rewriting. Moreover, [90] does not deal with the detection (nor resolution) of conflicts that may arise due to target constraints.

33 Part II Ontology-based DIS 21

34

35 In this part of the thesis, we investigate ontology-based DIS. These are data integration systems whose global schema is described as the intensional level of an ontology, i.e., the shared conceptualization of a domain of interest. We are interested, in particular, in ontologies expressed by means of logic-based languages, specifically Description Logics (DLs) [14]. Indeed, OWL 3, the main current standard language for ontology descriptions is based on such formalisms. In a nutshell, DLs have been developed and tailored over the years in Artificial Intelligence and Computational Logic to represent formally knowledge about a domain of interest in terms of concepts (or classes), which denote sets of objects, and roles (or relations), which denote denote binary relations between objects. DLs knowledge bases are formed by two distinct parts: the so-called TBox, which contains intensional description of the domain of interest; and the so-called ABox, which contains extensional information. When DLs are used to express ontologies [16], the TBox is used to express the intensional level of the ontology, while the ABox is used to represent the instance level of the ontology, i.e., the information on actual objects that are instances of the concepts and roles defined at the intensional level. From a formal point of view, a DL knowledge base is a pair K = T, A, where: T, the TBox, is formed by a finite set of universal assertions. The precise form of such assertions depends on the specific DL. However, we insist that the TBox actually mainly place constraints on the extensions of the primitive concepts and roles used to describe the domain of interest 4. A, the ABox, is formed by a finite set of membership assertions stating that a given object (or pair of objects) is an instance of a concept (or a role). When we talk about ontology-based DIS, the extensional level of the ontology is not represented anymore as an ABox, but rather it is provided by both a set of existing data sources and a set of mappings, expressing the relationship between the concepts and the roles of the intensional level of the ontology, i.e. the global schema, and the data managed by a relational DBMS. To understand which DL would be suited to act as the formalism for representing the global schema of ontology-based DIS, clearly, we need to build on results of recent research in DLs. In particular, results of [30, 76, 57] showed that none of the variants of OWL is suitable in that they all are conp-hard w.r.t. data complexity. 3 OWL Web Ontology Language Overview, 4 This contrasts with TBoxes, sometimes called acyclic, which consist of a finite set of definition assertions used to introduce defined concepts, i.e. abbreviations for complex combinations of primitive concepts and roles, such that a defined concept cannot refer to the concept itself. 23

36 24 Possible restrictions that guarantee polynomial reasoning (at least, if we concentrate on instance checking only) have been also looked at, such as Horn-SHIQ [57], EL ++ [13], DLP [50]. Among such fragments, we choose here to focus on those belonging to the DL-Lite family [29, 30], since these allow for answering (unions of) conjunctive queries (i.e. SQL select-project-join queries) in LOGSPACE w.r.t. data complexity. More importantly, they allow for delegating query processing, after a preprocessing phase which is independent of the data, to the relational DBMS managing the data layer, i.e. the ABox. This last property is obviously crucial in ontology-based DIS, where relational data sources provide the intensional level of the ontology. In the investigation of ontology-based DIS, we are also interested in write-also DIS, i.e. data integration systems that allow the user to perform updates over the extensional level of an ontology, i.e. the data sources. DIS updates in this context are related to the need of changing an ontology in order to reflect a change in the domain of interest the ontology is supposed to represent. Generally speaking, an update is represented by a formula that is intended to sanction a set of properties that are true in the state resulting from the change. One of the major challenges when dealing with an update is how to react to the case where the update is inconsistent with the current knowledge. Clearly, in order to study updates over an ontology-based DIS, we need to build on results on DL ontology updates. However, despite the importance of update, this issue is largely unexplored. Notable exceptions are [52, 69]. In particular, in [69] the authors propose a formal semantics for updates in DLs, and present interesting results on various aspects related to computing updates. However, since the problem is addressed under the assumption that the knowledge base is specified only at the extensional (i.e., instance) level, the paper does not take into account the impact of the intensional level on ontology update. Thus, as a first step toward write-also ontology-based DIS, we present here the first results of a systematic investigation on the notion of update of ontologies expressed as DL knowledge bases, where the intensional level of the ontology is assumed to be invariant, i.e., it does not change while the KB is used 5, while the instance-level of the ontology describes the state of affairs regarding the instances of concepts, which can indeed change as information in it is updated. The main contributions of this part of the thesis are as follows. First, we define a new language, called DL-Lite A, that is particularly tailored to represent ontologies in a DIS setting. In particular, DL-Lite A allows for distinguishing between values and objects. Second, we study the main reasoning services offered by a DL-Lite A KB. In particular, we provide algorithms to check DL-Lite A KB satisfiability and solve query answering over a DL-Lite A KB. We prove that these algorithms are correct and show that they run in LOGSPACE in data complexity. Third, we propose a formal framework for DL-Lite A ontology-based DIS. We show that in DL-Lite A DIS, reasoning can be separated from the access to 5 In other words, in this paper we are not considering the so-called ontology evolution problem.

37 25 actual data sources. Then, we provide algorithms to solve DIS consistency and query answering by appropriately exploiting this nice features of DL-Lite A DIS. We prove that these algorithms are correct, and again, LOGSPACE in data complexity. Fourth, we define the notion of update of the extensional level of an ontology. Building on classical approaches on knowledge base update, we provide a general semantics for instance level update in DLs. In particular, we follow the approach of [69], and we adapt Winslett s semantics [87, 88] to the case where the ontology is described by both a TBox and an ABox. Finally, we study update over a KB expressed in a restricted variant of DL-Lite A KB, called DL-Lite FS. We prove that DL-Lite FS is closed with respect to instance level update, in the sense that the result of an update is always expressible by a new DL-Lite FS ABox. Then, we provide an algorithm that computes the update over a DL-Lite FS KB. We prove that this algorithm is correct, and we show that it runs in polynomial time with respect to the size of the original knowledge base. To the best of our knowledge, this is the first algorithm for a well-founded approach to ontology update in DLs taking into account both the TBox and the ABox. This part of the thesis comes from an expansion and an updated version of a OWLED Workshop paper [35] and a AAAI conference paper [47]. It is organized as follows. Below, we briefly present the works that are most closely related to our. In Chapter 3, we present the DL DL-Lite A that is used to express the DIS global schema. In Chapter 4, we investigate DL-Lite A KBs satisfiability and query answering. In Chapter 5 we set up the logical framework for ontology-based data integration and provide algorithms to solve DIS consistency and query answering. Finally, in Chapter 6, we investigate instance-level updates of DL ontologies and provide an algorithm to compute an update over a DL-Lite A KB.

38 26

39 Chapter 3 The language In this chapter, we present a new logic of the DL-Lite family [30], called DL-Lite A. To this aim, we start by introducing DL-Lite FRS, that is a new DL particularly tailored to represent ontologies. Then, we present the query language, i.e. conjunctive queries. Finally, since while quite interesting in general, DL-Lite FRS loses the most important feature of DLs belonging to the DL-Lite family, i.e. the ability of delegating the query processing to a relational DBMS, we define DL-Lite A by imposing some restrictions to DL-Lite FRS. 3.1 DL-Lite FRS DL-Lite FRS is a new DL, whose novel aspects w.r.t. other DLs of the DL-Lite family [30, 31], are as follows. DL-Lite FRS takes seriously the distinction between objects and values, by allowing to use: value-domains, a.k.a. concrete domains [15], denoting sets of (data) values, concept attributes, denoting binary relations between objects and values, and role attributes, denoting binary relations between pairs of objects and values 1. DL-Lite FRS allows to express the existence of objects (or values) that are instances of concepts (resp. value-domains), without naming the actual objects (resp. values), by means of the so-called soft constants. Whereas these features are all provided by OWL 2, the distinction between objects and values is typically blurred in DLs. Nevertheless, as already discussed, none of the OWL variants [77], neither OWL, nor OWL-DL, nor OWL-Lite, would be 1 Obviously, a role attribute can be also seen as a ternary relation relating two objects and a value. 2 In fact, role attributes are currently not available in OWL, but are present in most conceptual modeling formalisms such as UML class diagrams and Entity-Relationship diagrams. 27

40 28 CHAPTER 3. THE LANGUAGE suited to act as the formalism for representing ontologies in the context of DIS, given that, if not restricted, they all provide reasoning services that are conp-hard in data complexity DL-Lite FRS expressions In providing the specification of our logics, we use the following notation: A denotes an atomic concept, B a basic concept, and C a general concept; D denotes an atomic value-domain, E a basic value-domain, and F a general value-domain; P denotes an atomic role, Q a basic role, and R a general role; U C denotes an atomic concept attribute, and V C a general concept attribute; U R denotes an atomic role attribute, and V R a general role attribute; C denotes the universal concept, D denotes the universal value-domain. Given a concept attribute U C (resp. a role attribute U R ), we call the domain of U C (resp. U R ), denoted by δ(u C ) (resp. δ(u R )), the set of objects (resp. of pairs of objects) that U C (resp. U R ) relates to values, and we call range of U C (resp. U R ), denoted by ρ(u C ) (resp. ρ(u R )), the set of values that U C (resp. U R ) relates to objects (resp. pairs of objects). Notice that the domain δ(u C ) of a concept attribute U C is a concept, whereas the domain δ(u R ) of a role attribute U R is a role. Furthermore, we denote with δ F (U C ) (resp. δ F (U R )) the set of objects (resp. of pairs of objects) that U C (resp. U R ) relates to values in the value-domain F. In particular, DL-Lite FRS expressions are defined as follows. Concept expressions: B ::= A Q δ(u C ) C ::= C B B Q.C δ F (U C ) δ F (U R ) δ F (U R ) Value-domain expressions (rdfdatatype denotes predefined value-domains such as integers, strings, etc.): E ::= D ρ(u C ) ρ(u R ) F ::= D E E rdfdatatype Attribute expressions: V C ::= U C U C V R ::= U R U R Role expressions: Q ::= P P δ(u R ) δ(u R ) R ::= Q Q δ F (U R ) δ F (U R )

41 3.1. DL-LITE FRS 29 In the value-domain expression above, rdfdatatype denotes predefined valuedomains, such as integers, strings, etc., that correspond to the RDF data types 3. Coherently with RDF, we assume that such data types are pairwise disjoint. In the following, we denote each such domain by T, possibly with subscript, i.e., we assume rdfdatatype ::= T 1... T n. As usual in DLs, the semantics of DL-Lite FRS is given in terms of first-order logic interpretations. More precisely, an interpretation I = ( I, I) consists of: a first order structure over the interpretation domain I that is the disjoint union of two domains: O I, called the interpretation domain of objects, and V I, called the interpretation domain of (data) values, an interpretation function I such that (i) for each rdfdatatype T i, it holds that Ti I I V, and for each pair of rdfdatatype, T i, T j, with i j, it holds that Ti I Tj I = ; and (ii) the following conditions are satisfied: I C = O I I D = V I A I O I D I V I P I O I O I U I C O I V I U I R O I O I V I ( B) I = O I \ B I ( E) I = V I \ E I ( Q) I = ( O I O I ) \ Q I ( U C ) I = ( O I V I ) \ U I C ( U R ) I = ( O I O I V I ) \ U I R (ρ(u C )) I = { v o.(o,v) U I C } (ρ(u R )) I = { v o,o. (o,o,v) U I R } (P ) I = { (o,o ) (o,o) P I } (δ F (U C )) I = { o v.(o,v) U I C v F I } (δ(u C )) I = (δ D (U C )) I (δ F (U R )) I = { (o,o ) v.(o,o,v) U I R v F I } (δ(u R )) I = (δ D (U R )) I (δ F (U R ) ) I = { (o,o ) v.(o,o,v) U I R v F I } (δ(u R ) ) I = (δ D (U R )) I ( δ F (U R )) I = { o o. (o,o ) (δ F (U R )) I } ( δ F (U R ) ) I = { o o.(o,o ) (δ F (U R ) ) I } ( Q) I = { o o.(o,o ) Q I } ( Q.C) I = { o o.(o,o ) Q I o C I } DL-Lite FRS TBox DL-Lite FRS TBox assertions are of the form: B C Q R E F U C V C U R V R (funct P) (funct P ) (funct U C ) (funct U R ) concept inclusion assertion role inclusion assertion value-domain inclusion assertion concept attribute inclusion assertion role attribute inclusion assertion role functionality assertion inverse role functionality assertion concept attribute functionality assertion role attribute functionality assertion 3 Resource Description Framework (RDF),

42 30 CHAPTER 3. THE LANGUAGE A concept inclusion assertion expresses that a (basic) concept B is subsumed by a (general) concept C. Analogously for the other types of inclusion assertions. A role functionality assertion expresses the (global) functionality of an atomic role. Analogously for the other types of functionality assertions. Note that in the sequel we will sometimes consider a TBox T as the disjoint union of T p, T k and T ni where: T p is the set of all inclusion assertions (of any type), called Positive Inclusion assertion (PI), having a positive expression in the right-hand side; T ni the set of all inclusion assertions (of any type), called Negative Inclusion assertions (NI), having a negated expression in the right-hand side; T k is the set of all functionality assertions (of any type). We now give the semantics of a TBox T, again in terms of interpretations of ( I, I) over the domain I. An interpretation I = ( I, I) is a model of a DL-Lite FRS TBox T, written I Mod(K), or equivalently, I satisfies T, written I = T, if I satisfies each assertion α in T. More precisely: if α is an inclusion assertion α β, where α may denote either a concept, or a role, or a value-domain, or a concept attribute, or a role attribute, we must have: α I β I if α is a role functionality assertion (funct Q), where Q is either P,or P, we must have for each o 1, o 2, o 3 : (o 1, o 2 ) Q I (o 1, o 3 ) Q I o 2 = o 3 if α is a concept attribute functionality assertion (funct U C ), we must have for each o, v 1, v 2 : (o, v 1 ) U I C (o, v 2 ) U I C v 1 = v 2 if α is a role attribute functionality assertion (funct U R ), we must have for each o 1, o 2, v 1, v 2, (o 1, o 2, v 1 ) U I R (o 1, o 2, v 2 ) U I R v 1 = v 2. where each o, possibly with subscript, is an element of O I, whereas each v, possibly with subscript, is an element of V I. We next give an example of a DL-Lite FRS TBox, with the aim of highlighting the use of attributes (in particular, role attributes). Note that in all the following examples, concept names are written in lowercase, role names are written in UPPERCASE, attribute names are in sans serif font, and domain names are in typewriter font.

43 3.1. DL-LITE FRS 31 Example Let T be the TBox containing the following assertions: tempemp employee (3.1) manager employee (3.2) employee person (3.3) employee WORKS-FOR.project (3.4) person δ(persname) (3.5) ρ(persname) xsd:string (3.6) (funct persname) (3.7) project δ(projname) (3.8) ρ(projname) xsd:string (3.9) (funct projname) (3.10) tempemp δ(until) (3.11) δ(until) WORKS-FOR (3.12) (funct until) (3.13) ρ(until) xsd:date (3.14) (funct MANAGES) (3.15) MANAGES WORKS-FOR (3.16) manager δ(until) (3.17) The above TBox T models information about employees and projects. Specifically, the assertions in T state the following. Both managers and fixed-term employees (tempemp) are two types of employees ( 3.2, 3.1), where an employee is a person ( 3.3) working for a project 3.4), and a person and a project are both always characterized by a unique name ( 3.8, , 3.7, ). In particular, a person name and a project name may be any string ( 3.6, 3.9). Moreover, someone who manages a project works for that project ( 3.16). Note however that an employee can manage at most one project ( 3.15). Finally, the until role attribute possibly associates a unique date term ( 3.14) with an employment ( 3.12, 3.13). Thus, T allows to express that a fixed-term employee works for at least one project until a fixed date ( 3.11), whereas a manager is someone who has only permanent positions ( 3.17). Note that this implies that there exists no employee who is simultaneously a fixed-term employee and a manager DL-Lite FRS ABox We now focus on DL-Lite FRS ABox. To this aim, we introduce an alphabet of hard constants Γ, for short constants, that is the disjoint union of two alphabets, Γ O and Γ V, respectively. Symbols in Γ O, called object identifiers (or also object constants), are used to denote objects, while symbols in Γ V, called value constants, are used to denote data values. Moreover, we introduce an alphabet of soft constants V. Coherently with Γ, V is the disjoint union of two sets of V O and V V, representing respectively constants in Γ O and Γ V. A DL-Lite FRS ABox over Γ is a finite set of

44 32 CHAPTER 3. THE LANGUAGE assertions, called membership assertions, of the form: C(a), C(s o ), F(d), F(s v ) R(a, b), V C (a, d), V R (a, b, d) where a and b are constants in Γ O, s o, s v are soft constants resp. in V O, V V, and d is a constant in Γ V. An assertion involving only constants is called ground. Let us focus on soft constants. Soft constant are used to express the existence of objects (or values) that are instances of concepts (resp. value-domains), without actually naming their object ids (resp. value constants). In other words, soft constants are constants for which the unique name assumption does not hold. It is worth noting that, according to the syntax above, soft constants can occur inside concepts or valuedomains, whereas they cannot occur inside roles 4. Inspite of this restriction, the following example shows that soft constants actually add expressive power (which will also be clearer when discussing updates in Chapter 6). Example Consider the following two ABoxes: A 1 = {A(a), B(b)} (b constant), and A 2 = {A(a), B(x)} (x soft constant). They have not the same set of models, since A 1 is such that for each interpretation I 1 = ( I 1, I1 ) that is a model of A 1, A and B are interpreted as two sets of objects which contain respectively at least one object o A, o B of the domain, such that I1 (a) = o A, and I1 (b) = o B, where o A o B. Clearly, each such model I 1 is also a model of A 2. Now let I 2 = ( I 2, I2 ) be an interpretation such that A and B contain uniquely the same object o of the domain. Then I 2 is a model of A 2, with assignment µ such that µ(x) = o, where I2 (a) = o. On the contrary, I 2 is not a model of A 1, with any assignment. From the above example it follows that if we were able to express for which constant the unique name assumption holds, then soft constants would not add expressive power. However, from a technical point of view, by following such an approach we would have to change (and complicate) all the definitions we give in Chapter 6 that lead to the definition of update (e.g. the difference among interpretations). In order to give the semantics of a DL-Lite FRS ABox in terms of interpretations ( I, I), since the ABox may involve soft constants, whereas I is a function from the set of constants Γ to the domain I, we need to introduce the preliminary notion of assignment. Definition Let V be the disjoint union of the sets of soft constants V O and V V, and I the disjoint union of O I and V I. Given an ABox A, we call assignment for A a function µ from V to I such that: for each s o V O occurring in A, µ(s o ) = o O I ; for each s v V V occurring in A, µ(s v ) = v V I. 4 From a technical point of view, the reason for this restriction is that the presence of soft constants in roles would make reasoning much less efficient, since it would possibly require to recursively unify soft constants according to role functionality assertions.

45 3.1. DL-LITE FRS 33 Let I be the disjoint union of O I and V I, and I = ( I, I) an interpretation. Moreover, let µ be an assignment for A. We say that I is model of A with µ, or equivalently, I satisfies A with µ, written I = A[µ], if the following conditions are satisfied: First I = ( I, I) assigns to each constant in Γ O, Γ V a distinct element of O I, V I respectively, as follows: for all a Γ O, we have that a I O I ; for all a, b Γ O, we have that a b implies a I b I ; for all d Γ V, we have that c I V I ; for all d, e Γ V, we have that d e implies d I e I ; Second, I satisfies each membership assertion in A, written I = α[µ]. More precisely, for each membership assertion α A, we have that: if α = C(a), with a Γ O, then a I C I ; if α = C(s o ), with s o V O, then µ(s o ) C I ; if α = F(d), with d Γ V, then d I F I ; if α = C(s v ), with s v V V, then µ(s v ) F I ; if α = R(b 1, b 2 ), with b 1, b 2 Γ O, then (b I 1, bi 2 ) RI, if α = V C (b, d), with b Γ O and d Γ V, then (b I, d I ) V I C, if α = V R (b 1, b 2, d), with b i Γ O and d Γ V, then (b 1, b 2, d) V I R. Finally, we say that I is a model of A if there exists an assignment µ for A such that I is a model of A with µ. Thus we define the set Mod(A) of models of A as follows: Mod(A) = {I µ, I = A[µ]}. We now give an example of ABox. Note that in all examples that follow, object constants in Γ O are written in bold face font, whereas value constants in Γ V are written in slanted font. Example Consider the following ABox A, where z V O : tempemp(z), (3.18) until(z, DIS-1212, ), (3.19) projname(dis-1212, QuOnto), (3.20) manager(lenz) (3.21) Specifically, the ABox assertions in A state that there exists an object denoting a fixed-term employee ( 3.18). Moreover, the name of DIS-1212 is QuOnto, and the object identified by Lenz is a manager.

46 34 CHAPTER 3. THE LANGUAGE DL-Lite FRS knowledge base Now that we have introduced DL-Lite FRS TBox and ABox, we are finally able to define when an interpretation is a model of a DL-Lite FRS KB K. Let µ be an assignment for A. An interpretation I is a model of a KB K = T, A with µ, written I = K[µ], if I is a model of T and A with µ. A KB is satisfiable if it has at least one model, i.e. if there exists at least an interpretation I and an assignment µ for A, such that I is a model of K with µ. Thus, we have that: Mod(K) = {I µ, I = A[µ] I = T }. Given a DL-Lite FRS assertion α, a KB K logically implies a ground assertion α, written K = α, if for each model I of K, we have that I is a model of α. 5 Example Let K = T, A be the knowledge base whose TBox T is the one of Example 3.1.1, and ABox A is the one of Example Clearly, K is satisfiable. Indeed, a possible model I for K is described as follows. First, µ is an assignment of the soft constant in A such that µ(z) = Palm, where Palm denotes a fixed-term employee. Lenz denotes a manager, and as such, Lenz manages exactly one project. In particular, in I, Lenz manages the project DIS-1212, identified by DIS-1212 and named QuOnto, for which he works permanently. Moreover, Lenz works permanently for the project denoted by the object id FP However, Lenz does not manage FP6-7603, since otherwise I would violate the functionality assertion 3.13 of T. On the other hand, another model I may be such that Lenz would manage FP6-7603, whereas he would not manage DIS Note, finally, that there exists no model of K such that Lenz is interpreted as a fixed-term employee (and thus there exists no assignment µ such that µ(z)=lenz), since according to 3.21 of A, Lenz is a manager and, as observed in Example 3.1.1, the sets of managers and fixed-term employees are disjoint. Before presenting, in the next section, the query language we use for the investigation of ontology-based DIS, we next introduce a notion that will be useful in the sequel. Definition Let K = T, A be a DL-Lite FRS KB and I an interpretation for K. Then we call most general assignment for A w.r.t. I an assignment µ 0 for A, that satisfies the following conditions: for each C(s o ) A, s o V O, µ 0 (s o ) = o n, where o n is a fresh object in O I, and for each F(s v ) A, s v V V, µ 0 (s v ) = v n, where v n is a fresh value in V I, where we say that µ 0 (s) is a fresh object (or value), if µ 0 (s) denotes an object (or resp. a value) such that each constant c and each soft constant s s occurring in A, c I µ 0 (x) and respectively µ 0 (s ) µ 0 (s). 5 Note that we are not interested here in the logical implication of formulas that are not ground, even though, clearly, such a notion may easily be obtained by an obvious generalization of the notion of logical implication of ground formulas.

47 3.1. DL-LITE FRS 35 Intuitively, a most general assignment is an assignment ensuring that soft constant names are treated as individual constant names. It is straightforward to prove the following. Proposition Let K = T, A be a DL-Lite FRS KB. Then, for each couple of most general assignments µ 0 and µ 0 w.r.t. I, µ 0 µ 0, we have that I = K[µ 0 ] I = K[µ 0 ]. Moreover, most general assignments have the following interesting property. Proposition Let K = T, A be a DL-Lite FRS KB. Then, K is satisfiable I, µ 0 I = K[µ 0 ]. where µ 0 is a most general assignment for A w.r.t. I. Proof. : Trivial (by definition). : Suppose that K is satisfiable. Then, there exists an assignment µ for A and an interpretation J such that J = K[µ]. Suppose now by contradiction that there exists no interpretation I that is a model of K with some most general assignment for A w.r.t. I. Thus we have in particular that µ is not a most general assignment for A w.r.t. J. Thus, let s a soft constant in A and let µ 0 J be such that: µ 0 J ( s) µ( s), and µ 0 J ( s) µ(s), for each s in A, s s. Then we have that there exists a membership assertion C( s) in A such that (i) either µ( s) = o = a J, for some constant a occurring in A, or (ii) µ( s) = o = µ(s), for some soft constant s s. Since J = K[µ 0 ], in both cases, there must exist at least one assertion α in K such that J = α[µ] and J = α[µ 0 ]. But then α must involve s since: if α does not involve any soft constant, then, clearly, either α is satisfied by J with both µ and µ 0 or α is not satisfied by J with any of the two assignments; if α involves a soft constant y s, by hypothesis, µ 0 J (y) = µ(y), and thus α is satisfied by J with both the assignments µ 0 and µ. Therefore, α must be a membership assertion of the form C ( s), and since J = C ( s)[µ J 0 ], we have that µ J 0 ( s) / C J. But since µ J 0 is a most general assignment for A w.r.t. J, then µ J 0 ( s) is a fresh object. Thus, it is always possible to build an assignment µ J 0 that is identical to µ J 0 except for the fact that µ J 0 (s) C I. Clearly, µ J 0 is a most general assignment for A w.r.t. J. Moreover, J is a model of K with µ 0 [J]. Thus, we obtain a contradiction. Intuitively, the above proposition shows that in order to study DL-Lite A KB satisfiability, we can essentially abstract from the presence of soft constants, by considering them as distinct hard constants.

48 36 CHAPTER 3. THE LANGUAGE 3.2 Query language A conjunctive query (CQ) q over a DL-Lite FRS ontology is an expression of the form q( x) conj( x, y), where x is a tuple of distinct variables, the so-called distinguished variables, y is a tuple of distinct existentially quantified variables (not occurring in x), called the non-distinguished variables, and conj( x, y) is a conjunction of atoms of the form A(x o ), P(x o, y o ), D(x v ), U C (x o, x v ), or U R (x o, y o, x v ), x o = y o, x v = y v, where: A, P, D, U C, and U R are resp. an atomic concept, an atomic role, an atomic value-domain, an atomic concept attribute and an atomic role attribute in T, x o, y o are either variables in x and y, called object variables, or constants in Γ O, x v is either a variable in x and y, called a value variable, or a constant in Γ V. We say that q( x) is the head of the query whereas conj( x, y) is the body. Moreover, the arity of q is the arity of x. Finally, a union of conjunctive queries (UCQ) is a query of the form: Q( x) i conj i ( x, y i ). Given an interpretation I = ( I, I), the query Q( x) ϕ( x, y) (either a conjunctive query or a union of conjunctive queries) is interpreted in I as the set of tuples o x I I such that there exists o y I I such that if we assign to the tuple of variables ( x, y) the tuple ( o x, o y ) the formula ϕ( o x, o y ) is true in I [5]. Then, given a tuple t of elements of Γ (we recall that Γ is the disjoint union of the objects and value constants Γ O and Γ V ), we say that t is a certain answer to q over K, written t ans(q, K), if for each interpretation I that is a model of K, we have that t I Q I. Thus, as for the DL-Lite FRS assertions, we say that K logically implies Q( t), written K = Q( t), where Q( t) is obtained from Q( x) by substituting x with t. Example Let K be the knowledge base introduced in Example Suppose first that we pose the following query, asking for all employees: q(x) employee(x). One can verify that the set of certain answers is {Lenz}. Indeed, Lenz is the only object id denoting an employee in all possible models, with any assignment. Suppose now that we ask for all couples participating to the role WORKS-FOR: q(x, y) WORKS-FOR(x, y). We then obtain no answer, since there exists no couple of object ids (a, b) in Γ such that (a I, b I ) q I for all models I of K.

49 3.3. DL-LITE A 37 Proposition Let K = T, A be a satisfiable DL-Lite FRS KB, and Q a union of conjunctive queries over K of arity n. Moreover, let m be the number of distinct soft constants s j occurring in A. Then, ans(q, K) = { t = (t 1,, t n ) I, µ 0 I, I = K[µ 0 I ] ( t I Q I i {1,, n}, j {1,, m}, t I i µ 0 I (s j ))} where µ 0 I denotes a most general assignment for A w.r.t. I. Proof. In order to prove the theorem, we denote as R 0 the set R 0 = { t = (t 1,, t n ) I, µ 0 I, I = K[µ 0 I ] ( t I Q I i {1,, n}, j {1,, m}, t I i µ 0 I (s j ))} and then we show that R 0 ans(q, K), and R 0 ans(q, K). : Trivial, by Proposition : Let I be a model of K and let t Γ n be a tuple of constants such that t ans(q, K). Then, we have in particular that I = Q( t). From Proposition 3.1.8, since K is satisfiable, there exists a most general assignment µ 0 I for A w.r.t. I such that I = K[µ 0 I ]. Let us now show that t I i µ 0 I (s j ) for each i {1,, n} and j {1,, m}. To this aim, suppose by contradiction that t I i = µ 0 I (s j ), for some i, j such that s j is a soft constant occurring in a membership assertion X(s j ), where X may denote either a concept or a value-domain. Then we can define a most general assignment µ 0 that is identical to µ 0 I except for the assignment of s j, i.e. µ 0 (s j ) µ 0 I (s j ). Since by definition µ 0 I (s j ) is an arbitrary fresh constant, we can construct a model I by modifying I so that (i) µ 0 (s j ) / X I, and (ii) µ 0 (s j ) X I. Then, clearly, I is a model of K with µ 0. Moreover, I = Q( t), thus contradicting the hypothesis of t ans(q, K). Note that the above proposition plays the same role for the query answering problem that Proposition plays for KB satisfiability. Indeed, it shows that given a query Q, in order to compute all certain answers to Q over a KB it is sufficient to consider only most general assignments for K. Thus, in particular, this allows to compute the certain answers to Q over a KB, by first considering each soft constant as a distinct hard constant, and finally eliminating those tuples that contain these newly introduced constants. 3.3 DL-Lite A Let us now compare the main features of DL-Lite FRS with those of other DLs in the DL-Lite family. First, DL-Lite FRS ABox allows for membership assertions involving general concepts and roles (as well as general value-domains, concept attributes, and role attributes).

50 38 CHAPTER 3. THE LANGUAGE Second, DL-Lite FRS allows for the representation of: the universal concept C (and universal domain value D ); qualified existential quantifications, i.e. expressions of the form: R.C, δ F (U C ), δ F (U R ), δ F (U R ), δ F (U R ), δ F (U R ). Third, DL-Lite FRS combines the main features of DL-Lite F and DL-Lite R, since it allows both functional restrictions on roles, mandatory participation on roles and disjointness between roles. Fourth, DL-Lite FRS distinguishes between objects and values and for this it introduces besides concepts and roles, also value-domains, concept attributes and role-attributes. None of the other DLs in the DL-Lite family (nor in other DLs we are aware of) allows for such a distinction. Fifth, DL-Lite FRS ABox allows for the occurrence of soft constants in membership assertions involving concepts (and value-domains). Next, we show that we can reduce general DL-Lite FRS KBs to DL-Lite FRS KBs that are equivalent, in terms of query answering, and have a much rawer form, called basic. Such a form recalls the form of other DLs in the DL-Lite family, since it does not exploit the two last features of DL-Lite FRS above mentioned. To show this, we start by defining basic DL-Lite FRS KBs. Definition Let K = T, A be a DL-Lite FRS KB. We say that K is a basic DL-Lite FRS KB if it is such that: the right-hand side of each concept inclusion assertion in T has the form: where B denotes a basic concept; B B the right-hand side of each role inclusion assertion in T has the form: where Q denotes as usual a basic role; Q Q all membership assertions in A involve only atomic concepts, atomic valuedomains, atomic concept attributes, atomic role attributes, and atomic roles. Example One can easily verify that the KB= T, A such that T is the TBox of Ex and A is the ABox of Ex is a basic DL-Lite FRS KB. We now show that it is possible to convert a general DL-Lite FRS KB into a basic DL-Lite FRS KB that is equivalent to the initial KB from the point of view of KB satisfiability, query answering, and therefore from the point of view of main reasoning services. Intuitively, this can be done by compiling away all qualified existential

51 3.3. DL-LITE A 39 quantifications in the right-hand side of both concept and role inclusion assertions by rewriting them through the use of auxiliary roles. Similarly, all membership assertions assertions involving complex expressions can be compiled away by rewriting them through the use of auxiliary expressions. Specifically, given a DL-Lite FRS KB K, we denote by Conv(K) a KB that is obtained from K by replacing each assertion α involving an expression Y with a set of assertions S(α), according to the rules shown in Fig. 3.1, where we have marked in bold newly introduced auxiliary expressions, that may be either concepts, valuedomains, concept attributes, role attributes, or roles. Then, we have the following. Lemma Let K be a DL-Lite FRS KB. Then, we have that: Proof. 1. K is satisfiable, if and only if Conv(K) is satisfiable; 2. for each conjunctive query q (not involving the newly introduced auxiliary expressions), and for each tuple t of elements of Γ V Γ O, t ans(q, K) if and only if t ans(q, Conv(K)). : : 1. Suppose that Conv(K) is satisfiable. Moreover, suppose that Conv(K) is obtained from K by replacing an assertion α with a set of assertions S(α) according to Fig One can easily verify that S(α) = α. Then, Conv(K) = α. Moreover, by construction, we have that: Conv(K) K \ {α}. Thus, since each model of Conv(K) is also a model of α, then each model of Conv(K) is also a model of K, proving that K is satisfiable. 2. Let Conv(K) be not satisfiable. Then, the claim trivially holds. Thus, let us suppose that Conv(K) is satisfiable. Moreover, let q be a conjunctive query and t be a tuple of constants such that t ans(q, Conv(K)), i.e. Conv(K) = q( t). We want to show that t ans(q, Conv(K)), i.e. K = q( t). Since we showed previously that Conv(K) = K, i.e. each model of Conv(K) is also a model of K, then from Conv(K) = q( t), it follows that K = q( t). 1. Suppose that K is satisfiable and, by contradiction, that Conv(K) is not satisfiable. Moreover, suppose that Conv(K) is obtained from K by replacing an assertion α with a set of assertions S(α) according to Fig Say, for instance that α = B R.C K. Then, α does not belong to Conv(K), which in contrast contains the set of the following assertions: B R aux R aux C R aux R where R aux is an new auxiliary role. Let I be a model of K with assignment µ. Note that such a model exists since K is satisfiable. Thus, we can construct an interpretation I, by setting I = I and then extending I as follows:

52 40 CHAPTER 3. THE LANGUAGE Y R.C δ F (U C ) δ F (U R ) δ F (U R ) δ F (U R ) δ F (U R ) A D Q δ(u C ) ρ(u C ) ρ(u R ) C D rdf DataT ype X Y replaced by X R aux R aux C R aux R X δ(u Caux ) ρ(u Caux ) F U Caux U C X δ(u Raux ) ρ(u Raux ) F U Raux U R X δ(u Raux ) ρ(u Raux ) F U Raux U R X δ(u Raux ) ρ(u Raux ) F U Raux U R X δ(u Raux ) ρ(u Raux ) F U Raux U R Y (c) replaced by Y (c, d) replaced by Y (c, d, e) replaced by Y aux(c) Y aux R aux R aux C R aux R Y aux(c) Y aux δ(u Caux ) ρ(u Caux ) F U Caux U C Y aux(c) Y aux δ(u Raux ) ρ(u Raux ) F U Raux U R Y aux(c) Y aux δ(u Raux ) ρ(u Raux ) F U Raux U R Y aux(c) Y aux Y U C Q Y aux(c, d) Y aux δ(u Raux ) ρ(u Raux ) F U Raux U R Y aux(c) Y aux δ(u Raux ) ρ(u Raux ) F U Raux U R Y aux(c, d) Y aux Y U R Y aux(c, d, e) Y aux Y Figure 3.1: Rules for computing Conv(K)

53 3.3. DL-LITE A 41 for each (o 1, o 2 ) R I such that o 1 B I, we set (o 1, o 2 ) Raux I where o 1, o 2 denote objects in I O. Since I and I differ only because of the fact that (o 1, o 2 ) Raux I, then I satisfy all the assertions in K. In particular, I is such that it satisfies the assertion B R.C. Then, it is easy to verify that I satisfies the assertions above. Thus, I is a model of Conv(K), which contradicts Conv(K) being not satisfiable. With a similar argument, we may prove that by applying another rule among those shown in Fig. 3.1 the result holds. 2. Let q be a conjunctive query and t a tuple of constants such that K = q( t). Again, we can suppose that K satisfiable, since otherwise the claim trivially holds. Moreover, we suppose again that Conv(K) is obtained from K by replacing α = B R.C K, with S(α) as shown above. Then, let I be a model of K with an assignment µ, and suppose to obtain from I a model I of Conv(K) as shown above. Clearly, I = q( t) since I = q( t), I and I differ only because of the fact that (o 1, o 2 ) Raux I and q does not involve R aux by hypothesis. Thus, to prove the claim, we need to prove that there exists no model of Conv(K) such that its restriction to the expressions used in K does not satisfy an assertion in K. By contradiction, let I be a model of Conv(K) not satisfying an assertion β in K. Two cases are possible: either β α; but then we obtain a contradiction, since, by construction, Conv(K) is obtained from K by replacing α with S(α), and thus, I satisfies all assertions in K different from α; or β = α; but since S(α) = α and since I satisfies S(α), we obtain a again a contradiction. Again, with a similar argument, we may prove that by applying another rule among those shown in Fig. 3.1 the result holds. Proposition Let K be a DL-Lite FRS KB. Then, there always exists a basic DL-Lite FRS KB K that is equivalent to K from the point of view of satisfiability and query answering over K. Moreover, K can be computed in PTIME in the size of K. Proof. The proof is based on the following observation: for each DL-Lite A KB K, Conv(K) is a basic KB; from Lemma 3.3.3, Conv(K) is equivalent to K from the point of view of satisfiability and query answering over K; by construction of Conv(K), for each assertion in K, at most one rule in Fig. 3.1 is applied, thus proving the PTIME complexity. Even though basic DL-Lite FRS KBs have a form that recalls that of DL-Lite F and DL-Lite R, they allow for unrestrictedly merging the features of both these logics.

54 42 CHAPTER 3 From results of [30], it follows that query answering over basic DL-Lite FRS is not in LOGSPACE w.r.t. data complexity anymore, and hence DL-Lite FRS loses the most interesting computational feature for ontology-based DIS query answering. Thus, we next define a new DL called DL-Lite A, starting from DL-Lite FRS, and requiring on one hand, that KBs be expressed in the basic form, on the other hand, that the use of functionality be restricted. Definition A DL-Lite A knowledge base K = T, A is a basic DL-Lite FRS KB such that T satisfies the following conditions: 1. for every role inclusion assertion Q R in T, where R is an atomic role or the inverse of an atomic role, the assertions (funct R) and (funct R ) are not in T ; 2. for every concept attribute inclusion assertion U C V C in T, where V C is an atomic concept attribute, the assertion (funct V C ) is not in T ; 3. for every role attribute inclusion assertion U R V R in T, where V R is an atomic role attribute, the assertion (funct V R ) is not in T. Roughly speaking, a DL-Lite A knowledge base imposes the condition on the global schema that no functional role can be specialized, by using it in the right-hand side of role inclusion assertion. The same condition is also imposed on every functional (role or concept) attribute. As we will show later, such limitation is sufficient to guarantee that query answering can be reduced to first-order query evaluation over a database. Example Clearly, the KB= T, A such that T is the TBox of Ex and A is the ABox of Ex satisfies the conditions above. Thus, it is an example of a DL-Lite A KB.

55 Chapter 4 DL-Lite A reasoning In this chapter we study the main DL-Lite A reasoning services, i.e. KB satisfiability and query answering. Thus, after introducing the representation of a DL-Lite A KB into a relational DBMS, we present preliminary results that lead us to algorithms for (i) checking DL-Lite A KB satisfiability, and (ii) solving query answering, both relying on the use of an SQL engine. Simultaneously, we prove the correctness of these algorithms and study their complexity. All these results provide the foundations, for the investigation of ontology-based DIS in the next chapter. 4.1 Storage of a DL-Lite A ABox Let K = T, A be a DL-Lite A KB. As already discussed, we will show that DL-Lite A keeps the nice property of the DLs in the DL-Lite family, of allowing to delegate query processing, after a preprocessing phase which is independent of the data, to an underlying DBMS managing the data layer, i.e. the ABox. Thus, all along this chapter, we assume that, given a TBox, a DL-Lite A KB K = T, A is represented as a database DB presented below. Definition Given a TBox T, and a database DB with domain Γ V, we say that DB represents a KB K = T, A in the context of T, if DB is as follows: for each atomic concept A, each unary relation A and each tuple (c o ) in A, there exists one membership assertion A(c o ) in A; for each atomic value-domain D, each unary relation D and each tuple (c v ) D, there exists one membership assertion D(c v ) in A; for each atomic role P, for each binary relation P and for each tuple (a 1, a 2 ) in P, there exists a membership assertion P(a 1, a 2 ) or P (a 2, a 1 ) in A; for each atomic role U C, for each binary relation U C, and for each tuple (b, d) in U C, there exists a membership assertion U C (b, d) in A; for each atomic role attribute U R, for each ternary relation U R, and for each tuple (a 1, a 2, d) in U R, there exists a membership assertion U R (a 1, a 2, d) in A; 43

56 44 CHAPTER 4. DL-LITE A REASONING for each tuple (s v ) in Fresh, there exists one soft constant s v V. Intuitively, this will let us deal with soft constants as if they were (hard) constants, without forgetting that they are not. As usual, given any first-order logic query Q, we denote as ans(q, DB) the set of answers that are returned by the evaluation of Q over DB. 4.2 Preliminaries In this section, we present three main constructions that will be crucial for the investigation of DL-Lite A reasoning, namely the minimal model for A, the canonical model for K and the closure of negative inclusions Minimal model for a DL-Lite A ABox Given a DL-Lite A ABox A, we denote as db(a), the (Herbrand) interpretation of A. More precisely, let db(a) be the interpretation ( db(a), db(a) ) such that db(a) is the disjoint union of two domains O db(a) = Γ O and V db(a) = Γ V and db(a) is as follows: a db(a) = a, for each constant a Γ, where Γ = Γ O Γ V, A db(a) = {a A(a) A}, for each atomic concept A, D db(a) = {d D(d) A}, for each atomic domain-value D, P db(a) = {(a 1, a 2 ) P(a 1, a 2 ) A}, for each atomic role P, U db(a) C = {(a 1, d) U C (a 1, d) A}, for each atomic concept attribute U C, and U db(a) R = {(a 1, a 2, d) U R (a 1, a 2, d) A}, for each atomic concept role U R. It is easy to see that db(a) is a minimal Herbrand model for A with a most general assignment µ 0 w.r.t. db(a) Canonical interpretation The canonical interpretation of a DL-Lite A KB is an interpretation constructed according to the notion of chase [5]. In particular, we adapt here the notion of restricted chase adopted by Johnson and Klug in [59]. To this aim, we first introduce the notion of most general assignment. Definition Let K = T, A be a DL-Lite A KB. We call canonical interpretation of K the minimal interpretation can(k) = ( can(k), can(k) ) of K that satisfies the following conditions, where can(k) is the disjoint union of the sets O can(k) and V can(k), and we use a and v, possibly with subscript or apex, to indicate resp. an object in O can(k) and a value in V can(k).

57 4.2. PRELIMINARIES 45 (cr0) can(k) O {a a Γ O occurs in A} {s o s o V O occurs in A}, can(k) V {d d Γ V occurs in A} {s v s v V V occurs in A}, a can(k) = a, for each object constant a, d can(k) = d, for each value constant d, A can(k) = {c o A(c a ) A}, for each atomic concept A, D can(k) = {c v D(c v ) A}, for each atomic value-domain D, U can(k) C = {(a, d) U C (a, d) A}, for each atomic concept attribute U C, U can(k) R = {(a 1, a 2, d) Y R (a 1, a 2, d) A}, for each atomic role attribute U R, P can(k) = {(a 1, a 2 ) P(a 1, a 2 ) A or P (a 2, a 1 ) A}, for each atomic role P (cr1) If a A can(k) 1, A 1 X 2 in T p, and a / X can(k) 2, then: 1. if X 2 = A 2, then add a to A can(k) 2 ; 2. if X 2 = Q 2, where Q 2 = P 2 P 2, then add (a, a n) to Q can(k) 2, where a n is a new element of O can(k) ; 3. if X 2 = Q 2, where Q = δ(u R2 ) δ(u R2 ), then add (a, a n, d n ) to U can(k) R 2, where a n is a new element of can(k) O and d n is a new element of can(k) V ; 4. if X = δ(u C ), then add (a, d n ) to U can(k) C, where d n is a new element of can(k) V. (cr2) If (a, a ) Q can(k) 1, Q 1 X 2 in T p, where Q = P 1 P1 then: 1. if X 2 = A 2, then add a to A can(k) 2 ;, and a / Xcan(K) 2, 2. if X 2 = Q 2, where Q 2 = P 2 P 2, then add (a, a n) to Q can(k) 2, where a n is a new element of O can(k) ; 3. if X 2 = Q 2 in T p, where Q 2 = δ(u R2 ) δ(u R2 ), then add (a, a n, d n ) to U can(k) R 2, where a n is a new element of can(k) O and d n is a new element of can(k) V ; 4. if X 2 = δ(u C ), then add (a, d n ) to U can(k) C, where d n is a new element of can(k) V. (cr3) If (a, d ) U can(k) C 1, δ(u C1 ) X 2 in T p, and a / X can(k) 2, then: 1. if X 2 = A 2, then add a to A can(k) 2 ; 2. if X 2 = Q 2, where Q 2 = P 2 P 2, then add (a, a n) to Q can(k) 2, where a n is a new element of O can(k) ; 3. if X 2 = Q 2, where Q 2 = δ(u R2 ) δ(u R2 ), then add (a, a n, d n ) to Q can(k) 2, where a n is a new element of can(k) O and d n is a new element of can(k) V ;

58 46 CHAPTER 4. DL-LITE A REASONING 4. if X 2 = δ(u C2 ), then add (a, d n ) to U can(k) C 2 ; (cr4) If (a, a, d ) Q can(k) 1, Q 1 X 2 in T p, where Q 1 = δ(u R1 ) δ(u R1 ), and a / X can(k) 2 then: 1. if X 2 = A 2, then add a to A can(k) 2 ; 2. if X 2 = δ(u C ), then add (a, a n ) to U can(k) C, where a n is a new element of can(k) O ; 3. if X 2 = Q 2, where Q 2 = P 2 P2, then add (a, a n) to Q can(k) 2, where a n is a new element of can(k) O ; 4. if X 2 = Q 2 in T p, where Q 2 = δ(u R2 ) δ(u R2 ), then add (a, a n, d n ) to U can(k) R 2, where a n is a new element of can(k) O and d n is a new element of can(k) V. (cr5) If (a 1, a 2 ) Q can(k) 1, Q 1 X 2 in T p, where Q 1 = P 1 P 1, and (a 1, a 2 ) / Q can(k) 2, then: 1. if X 2 = P 2 P 2, then add (a 1, a 2 ) to X can(k) 2 ; 2. if X 2 = δ(u R2 ) δ(u R2 ), then add (a 1, a 2, d n ) to X can(k) 2, where d n is a new element of V can(k). (cr6) If (a 1, a 2, d ) Q can(k) 1, Q 1 X 2 in T p, where Q 1 = δ(u R1 ) δ(u R1 ), and (a 1, a 2 ) / X can(k) 2, then: 1. If X 2 = P 2 P 2, then add (a 1, a 2 ) to X can(k) 2 ; 2. If X 2 = δ(u R2 ) δ(u R2 ), then add (a 1, a 2, d n ) to X can(k) 2, where d n is a new element of V can(k). (cr7) If d D can(k) 1, D 1 X 2 in T p, and d / X can(k) 2, then: 1. If X 2 = D 2, then add d to D can(k) 2 ; 2. If X 2 = ρ(u C2 ), then add (a n, d) to U can(k) C 2, where a n is a new element of can(k) O ; 3. If X 2 = ρ(u R2 ), then add (a n, a n, d) to U can(k) R 2, where a n, a n are new elements of can(k) O ; (cr8) If (a, d) U can(k) C 1, ρ(u C1 ) X 2 in T p, and d / X can(k) 2, then: 1. If X 2 = D 2, then add d to D can(k) 2 ; 2. If X 2 = ρ(u C2 ), then add (a n, d) to U can(k) C 2, where a n is a new element of can(k) O ; 3. If X 2 = ρ(u R2 ), then add (a n, a n, d) to U can(k) R 2, where a n, a n are new elements of can(k) O.

59 4.2. PRELIMINARIES 47 (cr9) If (a, a, d) U can(k) R 1, ρ(u R1 ) X 2 in T p, and d / X can(k) 2, then: 1. If X 2 = D 2, then add d to D can(k) 2 ; 2. If X 2 = ρ(u C2 ), then add (a n, d) to U can(k) C 2, where a n is a new element of can(k) O ; 3. If X 2 = ρ(u R2 ), then add (a n, a n, d) to U can(k) R 2, where a n, a n are new elements of can(k) O. (cr10) If (a, d) U can(k) C 1, U C1 U C2 in T p, and (a, d) / U can(k) C 2, then add (a, d) to U can(k) C 2 ; (cr11) If (a 1, a 2, d) U can(k) R 1, U R1 U R2 in T p, and (a 1, a 2, d) / U can(k) R 2, then add (a 1, a 2, d) to U can(k) R 2. Rules in the previous definition are called chase rules. Although they are so numerous and look like complicated, intuitively they simply aim at constructing a Herbrand interpretation of K satisfying the ABox and the set of PI assertions T p.in particular, we have the following notable property of can(k). Proposition Let K = T, A be a DL-Lite A satisfiable KB, and µ an assignment for A. Then, for each model I = ( I, I) of K with µ, we have that there exists a homomorphism Ψ from can(k) to I, i.e. a function Ψ such that: for each j-tuple of objects t ( can(k) ) n, j 1, 2, 3, t X can(k) Ψ( t I ) X I (4.1) where X may denote either a concept (in which case j = 1), or a value-domain (j = 1), or a role (j = 2) or a concept attribute (j = 2), or a role attribute (j = 3) in K. Proof. Let I = ( I, I) be a model of K with µ. We next show how to build a function Ψ from can(k) to I, by proceeding by induction on the construction of can(k). Simultaneously we show that Ψ is a homomorphism, i.e. Ψ satisfies 4.1. Base step: For each membership assertion α: If α = X(s v ), where X denotes either an atomic concept or an atomic value-domain, s v V, then, by construction of can(k), we have that s v can(k), and s v X can(k). We then set Ψ(s v ) = µ(s v ). Thus, since I is a model of α, we have that µ(s v ) X I, and 4.1 is satisfied; If α = X( t), where X denotes any atomic expression and t = (t 1,, t j ) Γ j, for j = 1, 2, 3, then, by construction of can(k) we have that t i can(k) for each i = 1,, j, and t X can(k). Then, we set Ψ( t) = t I. Thus, since I is a model of α, we have that t I X I, and 4.1 is satisfied;

60 48 CHAPTER 4. DL-LITE A REASONING Inductive step: Let can i (K) be the portion of can(k) after i applications of the chase rules. According to the inductive hypothesis, we have that can i (K) satisfies 4.1, i.e. for each tuple t can(k), t X can(k) Ψ( t) I X I. Suppose now that can i+1 (K) is the portion of can(k) that is obtained from can i (K) by application of one among the chase rules, say for instance the rule cr1. Thus suppose that a B can(k) and B X T p. By inductive hypothesis we have that Ψ(a) B I where Ψ(a) I. Now, depending on the form of X the application of cr1 may lead to one of the following cases: if X = A, then we have that a A can i+1(k) ; moreover, since I is a model of K, we have that I satisfies B A and thus, Ψ(a) A I ; if X 2 = Q 2, where Q = δ(u R2 ) δ(u R2 ), then we have that (a, a n, d n ) Q can i+1(k) 2, where a n, d n are new elements resp. of can(k) O, can(k) V ; therefore, Ψ(a n ) and Ψ(d n ) were not yet defined; moreover, since I is a model of T p, then there must exist two elements o I O, w I V such that (Ψ(a), o, w) Q I 2 ; then, by setting Ψ(a n) = o and Ψ(d n ) = w, we obtain (Ψ(a), Ψ(a n ), Ψ(d n )) Q I 2 ; if X = Q 2, where Q 2 = P 2 P2, then we have that (a, a n) Q can i+1(k) 2, where a n is a new element of can(k) O ; therefore, Ψ(a n ) was not yet defined; moreover, since I is a model of T p, then there must exist an element o I O such that (Ψ(a), o) Q I 2 ; then, by setting Ψ(a n ) = o, we obtain (Ψ(a), Ψ(a n )) Q I 2 ; if X = δ(u C ), then we have that (a, d n ) Q can i+1(k) 2, where d n is a new element of V can(k) ; therefore, Ψ(d n ) was not yet defined; moreover, since I is a model of T p,, then there must exist an element v V I such that (Ψ(a), v) U I C ; then, by setting Ψ(d n) = v, we obtain (Ψ(a), Ψ(d n )) U I C. Thus, we proved that if can i+1 (K) is obtained by application of rule cr1 then it still satisfies 4.1. Proceeding analogously with the other chase rules, we can easily prove the claim. The above proposition is very important because it proves that if K is satisfiable, then can(k) can be seen as representative of all models of K with µ. As we will see, we will use this property of can(k) several times all along our proofs Closure of negative inclusions By following the same approach of [31] we next introduce the notion of NI-closure, which results from adapting the corresponding notion of [31] to our logic. Definition Let T be a DL-Lite A TBox. We call NI-closure of T, denoted by cln(t ), the TBox obtained inductively as follows: 1. all negative inclusion assertions in T are also in cln(t );

61 4.2. PRELIMINARIES if B 1 B 2 is in T and B 2 B 3 or B 3 B 2 is in cln(t ), then also B 1 B 3 is in cln(t ); 3. if E 1 E 2 is in T and E 2 E 3 or E 3 E 2 is in cln(t ), then also E 1 E 3 is in cln(t ); 4. if Q 1 Q 2 is in T and Q 2 B or B Q 2 is in cln(t ), then also Q 1 B is in cln(t ); 5. if Q 1 Q 2 is in T and Q 2 B or B Q 2 B is in cln(t ); Q 1 is in cln(t ), then also 6. if Q 1 Q 2 is in T and Q 2 Q 3 or Q 3 Q 2 is in T n, then also Q 1 Q 3 is in cln(t ); 7. if one of the assertions Q Q, Q Q, or Q Q is in cln(t ), then all three such assertions are in cln(t ); 8. if U C1 U C2 is in T and δ(u C2 ) B or B δ(u C2 ) is in cln(t ), then also δ(u C1 ) B is in cln(t ); 9. if U C1 U C2 is in T and ρ(u C2 ) E or E ρ(u C2 ) is in cln(t ), then also ρ(u C1 ) E is in cln(t ); 10. if U C1 U C2 is in T and U C2 U C3 or U C3 U C2 is in cln(t ), then also U C1 U C3 is in cln(t ); 11. if one of the assertions ρ(u C ) ρ(u C ), δ(u C ) δ(u C ), or U C U C is in cln(t ), then all three such assertions are in cln(t ); 12. if U R1 U R2 is in T and ρ(u R2 ) E or E ρ(u R2 ) is in cln(t ), then also ρ(u R1 ) E is in cln(t ); 13. if U R1 U R2 is in T and δ(u R2 ) P or P δ(u R2 ) is in cln(t ), then also δ(u R1 ) P is in cln(t ); 14. if U R1 U R2 is in T and δ(u R 2 ) P or P δ(u R 2 ) is in cln(t ), then also δ(u R 1 ) P is in cln(t ); 15. if U R1 U R2 is in T and U R2 U R3 or U R3 U R2 is in cln(t ), then also U R1 U R3 is in cln(t ); 16. if one of the assertions ρ(u R ) ρ(u R ), δ(u R ) δ(u R ), or U R U R is in cln(t ), then all three such assertions are in cln(t ). Example Consider the TBox of Example Clearly, the NI-closure of T is the following set of NI: manager δ(until) (4.2) manager tempemp (4.3)

62 50 CHAPTER 4. DL-LITE A REASONING 4.3 Satisfiability of a DL-Lite A KB In this section, we investigate the satisfiability of a DL-Lite A KB. To this aim, by following an approach that is similar to that of [31], we first show some notable properties of the notions introduced in the previous section and then we show how to exploit such properties to provide an algorithm for checking DL-Lite A KB satisfiability. Finally, we study its complexity Foundations of the algorithm for satisfiability The algorithm for checking the satisfiability of a DL-Lite A KB strongly relies on the use of notions introduced in the previous section. Thus, we start by giving results that put in relation all these notions with DL-Lite A KB satisfiability. Specifically, the lemma below shows that the canonical model of a KB always satisfies the set of positive inclusions. Moreover, it shows that can(k) is a model of the ABox with an assignment µ 0. Lemma Let K = T, A be a DL-Lite A KB. Then, we have that: 1. can(k) = T p, where T p denote the set of positive inclusions in T ; 2. there exists a most general assignment µ 0 for A w.r.t. can(k), such that: can(k) = A[µ 0 ]. Proof. It is easy to see that 1 follows directly from the definition of can(k). Indeed, can(k) is built in such a way that every PI in T p is satisfied (cf. rules cri, for i = 1,...10). Let us now consider 2. By rule cr0, we have that can(k) = α for each membership assertion α not involving soft constants. Now, let us construct an assignment µ 0 as follows: for each s V, µ 0 (s) = s. Clearly, by construction, µ 0 is a most general assignment for A w.r.t. can(k). Moreover, can(k) = α[µ 0 ]. In contrast with the previous lemma, the following shows that the canonical model of a KB satisfies the set of functionality assertions if and only if the minimal model for A is a model for such a set of assertions. Lemma Let K = T, A be a DL-Lite A KB. Then, can(k) = T k db(a) = T k, where T k denotes the set of functionality assertions in T. Proof. : By construction, db(a) exactly coincides with the interpretation obtained by applying the only rule cr0. Thus, if can(k) satisfies T k, then, clearly, db(a) satisfies T k. : Suppose that db(a) satisfies T k. Then, by induction on the construction of can(k), we show that can(k) satisfies T k. Base step: By hypothesis, db(a) satisfies T k.

63 4.3. SATISFIABILITY OF A DL-LITE A KB 51 Inductive step: Let can i (K) be the portion of can(k) obtained after i applications of the chase rules. Suppose that can i (K) = T k and, by contradiction, suppose that can i+1 (K) = T k, where can i+1 (K) is obtained from can i (K) by applying one of the rules crj, with j {1,,11}. It is worth starting by noticing that not all the rules can cause violation of a functionality assertion. In particular, there are three types of safe rules: first type: rules triggered by an inclusion assertion between concepts or value-domains, whose right-hand side does not involve any role, nor concept attribute, nor role attribute; second type: rules triggered by an inclusion assertion between concepts or value-domains, whose right-hand side involves a role, or a concept attribute, or a role attribute, that is not involved in any functionality assertion in T k ; third type: rules triggered by an inclusion assertion among roles, concept attributes, or role attributes. For all these types of rules, assuming that can i+1 (K) violates a functionality assertion α would make us conclude that α is not satisfied already in can i (K), which would lead to a contradiction. Indeed, the application of a rule of the first type does not imply the modification of the interpretation of any role, concept attribute, nor role attribute. Concerning the rules of the third type, their application implies the modification of the interpretation of either a role, or a concept attribute, or a role attribute that is not involved in any functionality assertion. Finally, a similar argument holds for rules of the third type, since, by definition of DL-Lite A, the right-hand side of inclusions between roles, concept attributes and role attributes does not involve any expression that is also involved in a functionality assertion in T k. Therefore, the only rules that may cause can i+1 (K) to violate T k, are the rules triggered by the presence of a concept inclusion assertion whose right-hand side involves a role P or P, a concept attribute U C or a role attribute U R such that T k contains resp. the assertions (funct P), (funct P ), or (funct U C ) or (funct U R ). For instance, let us assume that can i+1 (K) is obtained from can i (K) by application of rule cr1, where we assume that a A can i(k) 1, A 1 X 2 in T p, and a / P can i+1(k). Moreover, we assume that there exists a functionality assertion α involving P which is not satisfied by can i+1 (K). However, in the case in which α = (funct P), for α to be violated, there must exist two pairs of objects (x, y), (x, z) P can i+1(k) such that y z; since we have that (o, o n ) P can i+1(k) and o / P can i(k), there exists no pair (o, o ) P can i+1(k) such that o o n. Hence, we should conclude that the pairs (x, y), (x, z) we are looking for, are such that (x, y), (x, z) P can i(k), but this would lead to a contradiction; in the case in which α = (funct P ), for α to be violated, there must exist two pairs of objects (y, x), (z, x) P can i+1(k) such that y z; since o n is a fresh object, we can conclude that there exists no pair (o, o n )

64 52 CHAPTER 4. DL-LITE A REASONING P can i+1(k) such that o o. Hence, we should conclude that the pairs (y, x), (z, x) we are looking for, are such that (y, x), (z, x) P can i(k), but this would lead to a contradiction. Clearly, with a similar argument we may prove that the claim holds also when other apparently not safe chase rules are applied to can i (K). In the same spirit of [31], we continue characterizing when the canonical model of a KB satisfies the assertions forming the KB. Until now we have considered the set of PI s and the set of functionality assertions. Let us now consider the case NI s. To this aim, we need to use the notion of NI-closure introduced in the previous section. Lemma Let K = T, A be a DL-Lite A KB. Then, can(k) = T ni db(a) = cln(t ni ). where T ni denotes a set of negative inclusions in T. Proof. : Suppose that can(k) is a model of T ni and suppose by contradiction that db(a) does not satisfy an assertion in cln(t ni ). Since can(k) is a model of T ni and cln(t ni ) denotes the set of assertions that are logically implied by T ni, we have that can(k) = cln(t ni ). But then we obtain a contradiction since db(a) coincides with the portion of can(k) obtained by application of rule cr0. : Suppose that db(a) is a model of cln(t ni ). We prove that can(k) is a model of T ni by induction on the structure of can(k). Base step: By hypothesis, db(a) satisfies T ni. Inductive step: Let can i (K) be the portion of can(k) obtained after i applications of the chase rules. Suppose that can i (K) = T ni and, by contradiction, suppose that can i+1 (K) = T ni, where can i+1 (K) is obtained from can i (K) by applying one of the chase rules. For instance, suppose that can i+1 (K) is obtained by application of rule cr1 to can i (K), where we assume that there exists a can(k) such that a A can i(k) 1, A 1 X 2 be a PI in T p and a / X can i(k) 2. Then, we have that a X can i+1(k) 2. Now, if can i+1 (K) is not a model of T ni, then there must exist a NI α in T ni that is not satisfied by can i+1 (K). However, by hypothesis, can i (K) and can i+1 (K) differ only for the fact that a / A can i(k) 2 and a A can i+1(k) 2. Then, in order for can i+1 (K) to violate α, this must involve X 2. Thus, for instance, α may assume the form Y 1 A 2 where a Y can i+1(k) 1. But then, since Y 1 A 2 and A 1 A 2, then also A 1 Y 1 belongs to cln(t ni ). Thus, we obtain a contradiction since a Y can i(k) 1 and a A can i+1(k) 1, which contradicts that can i (K) satisfies cln(t ni ). Clearly, with a similar argument we can prove the inductive step even in those cases in which can i+1 (K) is obtained by can i (K) by applying one among the other chase rules.

65 4.3. SATISFIABILITY OF A DL-LITE A KB 53 Next, by giving the two following propositions, we put everything together and we set up the basis of the algorithm for satisfiability. Proposition Let K = T, A be a DL-Lite A KB. Then, µ 0,can(K) = K[µ 0 ] db(a) = T k cln(t ), where µ 0 is a most general assignment for A w.r.t. can(k). Proof. The proof follows directly from lemmas 4.3.2,4.3.3 and Proposition Let K = T, A be a DL-Lite A KB. K is satisfiable µ 0 can(k) = K[µ 0 ], where µ 0 is a most general assignment for A w.r.t. can(k). Proof. : Trivially, if there exists an assignment µ 0 such that can(k) is a model of K with µ 0, then K is satisfiable. : Suppose that K is satisfiable and, by contradiction, that there exists no most general assignment µ 0 for A w.r.t. can(k) such that can(k) is a model of K with µ 0. Since K is satisfiable, by Proposition 3.1.8, there exists an interpretation I and a most general assignment µ 0 for A w.r.t. I such that I is a model of K with µ 0. Then, since can(k) is not a model of K with µ 0, we have in particular that can(k) = K[µ 0 ]. Therefore, by Proposition 4.3.4, we have that either db(a) = T k or db(a) = cln(t ni ). Suppose that db(a) violates, for instance, a role functionality assertion (funct P). By construction of db(a) there exist a 1, a 2, a 3 Γ, a 1 a 2, such that P(a 1, a 2 ), P(a 1, a 3 ) A. But then no model of K satisfies A, thus contradicting the hypothesis that A is satisfiable. Clearly, we would obtain a contradiction also by supposing that db(a) violates another type of functionality assertion. Suppose now that db(a) violates a NI assertion in cln(t ni ). By Lemma 4.3.3, we have that can(k) = T ni. Suppose then for instance that can(k) does not satisfy the NI A B. Then, there must exist a, b can(k) such that a A can(k), and a B can(k). But then, by Proposition 4.2.2, since K is satisfiable and I is a model of K with µ 0, then there exists a homomorphism Ψ from can(k) to I. Thus we have Ψ(a) A I and Ψ(a) B I, which clearly contradicts the fact that I is a model of T since I does not satisfy the NI A B that is logically implied by T. Clearly, from the previous two propositions, we are finally able to state the following crucial theorem, that is at the heart of our algorithm for checking DL-Lite A KB satisfiability:

66 54 CHAPTER 4. DL-LITE A REASONING Input:DL-Lite A TBox T and database DB representing K = T, A in the context of T Output: true or false (1) for each F = (funct X) T k do Q ViolateFunct(F); Q RewDB(Q); if (ans(q, DB) = true) then return false (2) for each NI = X 1 X 2 cln(t ni ) do Q ViolateNI(NI)); Q RewDB(Q); if (ans(q, DB) = true) then return false return true Figure 4.1: Algorithm Sat (K) Theorem Let K = T, A be a DL-Lite A KB. Then K is satisfiable db(a) = T k cln(t ). Proof. Trivial, from Proposition and Proposition Clearly the above theorem is crucial to our scope, since it allows for reducing the satisfiability check of a DL-Lite A KB to the problem of checking whether a finite model satisfies a set of assertions Satisfiability algorithm Given all the previous results, we are now ready to define, in Fig , the algorithm Sat for checking the satisfiability of a DL-Lite A KB. Informally, the algorithm takes as input a DL-Lite A TBox and a database DB representing a DL-Lite A KB K = T, A in the context of T as discussed in Section 4.1, and returns true or false by proceeding as follows. For each functionality assertion F in T, the algorithm starts by constructing a first-order logic Q query that checks whether the functionality assertion, or resp. the N I assertion, is violated in the minimal model db(a) for the ABox. To this aim, it calls a function ViolateFunct(F), shown in Fig , that takes as input any functionality assertion of the form F = (funct X) and returns the boolean first-order logic query Q that asks whether there exists two couples of constants that are both interpreted in db(a) as belonging to X and such that they violate F. Similarly, for each NI X in cln(t ), the algorithms calls the function ViolateNI[X], shown in Fig , that takes as input any negative inclusion assertion of the form NI = X 1 X 2 in the NI-Closure of T and returns the boolean first-order logic query Q that asks whether there exists a tuple of constants that is interpreted in db(a) as belonging simultaneously to both X 1 and X 2. Afterwards, by the use of the function RewDB, Q is rewritten in terms of the database DB. More precisely, given a first-order logic query Q, RewDB builds a query over DB that is

67 4.3. SATISFIABILITY OF A DL-LITE A KB 55 Input: DL-Lite A functionality assertion F = (funct X) Output: boolean query Case of X: X = P : return q() : P(w, x) P(w, y) x y; X = P : return q() : P(x, w) P(y, w) x y; X = U C : return q() : U C (w, x) U C (w, y) x y; X = U R : return q() : U R (w 1, w 2, x) U R (w 1, w 2, y) x y; Figure 4.2: Function ViolateFunct obtained from Q by replacing each occurrence of an atomic expression X (either concept, or value-domain, or role, or concept attribute, or role attribute) with the corresponding relation X in DB (note that, by hypothesis, for each DL-Lite A expressions there exists a corresponding relation in DB). Finally, each rewritten query Q is evaluated over DB, and the algorithm returns false if at least one such evaluation returns false, true otherwise. We have the following lemma: Lemma Let K T, A be a DL-Lite A KB. Then: K is unsatisfiable if and only if Q db(a) = true, for some Q such that Q = ViolateFunct(X), for some functionality assertion X T k, or Q = ViolateNI(X) for some NI assertion X cln(t ). Proof. The proof follows directly from Theorem Given the previous lemma, by construction of the Algorithm Sat, we can immediately claim the correctness of Algorithm Sat(K): Theorem Let K be a DL-Lite A KB. Then K is satisfiable if and only if Sat(K) = true. From the results in the previous section we can establish the computational complexity characterization of the satisfiability problem for a DL-Lite A KB. The proof is omitted since it can be straightforwardly adapted from [31]. Theorem Given a DL-Lite A KB K, Sat(K) is LOGSPACE in the size of the database used to represent K (data complexity) and PTIME in the size of the whole knowledge base (combined complexity).

68 56 CHAPTER 4. DL-LITE A REASONING Input: DL-Lite A NI NI =X 1 X 2 Output: boolean query Case of NI: NI concept inclusion: body {}; for i = 1, 2 do Case of X i : X i = A i : body body A i (x); X i = P i : body body P i (x, v); X i = Pi : body body P i(v, x); X i = δ(u Ci ): body body U Ci (x, v); X i = δ(u Ci ): body body U Ri (x, v, w); X i = δ(u Ci ) : body body U Ri (v, x, w); NI domain-value inclusion: body ; for i = 1, 2 do Case of X i : X i = D i : body body D i (x); X i = ρ(u Ci ): body body U Ci (v, x); X i = ρ(u Ri ): body body U Ri (v, w, x); NI role inclusion: body ; for i = 1, 2 do Case of X i : X i = P i : body body P i (x, y); X i = Pi : body body P i(y, x); X i = δ(u Ri ): body body U Ri (x, y, v); X i = δ(u Ri ) : body body U Ri (y, x, v); NI concept attribute inclusion (i.e. X 1 = U C1 and X 2 = U C2 ): body U C1 (x, y) U C2 (x, y); NI role attribute inclusion (i.e. X 1 = U R1 and X 2 = U R2 ): body U R1 (x, y, z) U C2 (x, y, z); return q() : body; Figure 4.3: Function ViolateNI

69 4.4. QUERY ANSWERING OVER DL-LITE A KB Query answering over DL-Lite A KB In what follows, as we did for satisfiability, we start by presenting preliminary results that are at the heart of the algorithm for query answering over a DL-Lite A KB. Then, after a discussion about the relation between query answering in DL-Lite A and query answering in other DLs of the DL-Lite family, we present our query answering algorithm and discuss its correctness and complexity Foundations of query answering algorithm Similarly to satisfiability, the algorithm for solving query answering over a DL-Lite A KB relies on the existence of the canonical model and on its properties. Thus, we start by giving a crucial result that relates the canonical model to DL-Lite A KB query answering. Specifically, the lemma below shows that given a union of conjunctive queries Q over K, if we were able to query the canonical model of a KB, then we would obtain all the answers to Q. Lemma Let K be a satisfiable DL-Lite A KB, and let Q be a union of conjunctive queries over K, of arity n. Moreover, let m be the number of distinct soft constants s j occurring in A. Then, ans(q, K) = { t = (t 1,, t n ) t can(k) Q can(k) µ 0, i {1,, n}, j {1,, m}, t can(k) i µ 0 (s j )} where µ 0 denotes a most general assignment for A w.r.t. can(k). Proof. Let t be a tuple of constants in Γ. First, suppose t ans(q, K). Since KB is satisfiable, by Proposition we have that for each model I of K there exists a most general assignment µ 0 for A such that (i) t I i µ 0(s j ) for each i {1,, n} and each j {1,, m}, (ii) for all models I of K with µ 0, we have that t I Q I. Moreover, by Proposition 4.3.5, since KB is satisfiable we have that can(k) is a model of K with some most general assignment µ 0 of A w.r.t. can(k). Thus, by Proposition 3.1.7, we have that can(k) mod K[µ 0 ] and since t ans(q, K), t can(k) Q can(k). Conversely, suppose t can(k) Q can(k), for some most general assignment µ 0. Let Q be the union of conjunctive queries Q = {q 1,...,q k } with q i defined as q i ( x i ) conj i ( x i, y i ) for each i {1,...,k}. Then, there exists i {1,...,k} such that exists an assignment σ : V can(k) that maps the variables V occurring in conj i ( t, y i ) to objects of can(k), such that all atoms in conj i ( t, y i ) under the assignment σ evaluate to true in can(k). Now let I be a model for K with µ 0. By Proposition 4.2.2, there is a homomorphism Ψ from can(k) to I. Consequently, the function obtained by composing Ψ and σ is a function that maps the variables V occurring in conj i ( t, y i ) to objects of the domain of I, such that all atoms in conj i ( t, y i ) under the assignment σ evaluate to true in I. Therefore, t I Q I. Then, by applying Proposition we obtain that: t ans(q, K).

70 58 CHAPTER 4. DL-LITE A REASONING Next, as in [31], we have a property that relates answering unions of conjunctive queries to answering conjunctive queries (the proof is omitted since it can be straightforwardly adapted from the one given in [31]). Theorem Let K be a DL-Lite A KB, and Q a union of conjunctive queries over K. Then, ans(q, K) = ans(q i, K) Query answering algorithm q i Q As already mentioned, the query answering technique for DL-Lite A as well as for the logics of the DL-Lite family introduced in [29], crucially relies on the existence of the canonical interpretation and on the property of such an interpretation to be representative of all models, as proved by Proposition Moreover, from Lemma it follows that query answering, similarly to satisfiability, can in principle be solved by evaluating the query over the canonical model can(k). However, since can(k) is in general infinite, we obviously avoid the construction of can(k). Rather, exactly in the same spirit of [31], our query answering method consists in first compiling the TBox into a finite reformulation of the query, that is afterwards evaluated over the minimal model db(a) of the ABox. This is achieved by applying an Algorithm PerfectRef. As we will see, the only difference of the whole approach that goes beyond the simple adaptation to DL-Lite A of the query answering algorithm proposed in [31], is due to the presence of soft constants in the ABox, whose treatment requires slightly modifying the reformulated query, i.e. the query obtained by means of the PerfectRef Algorithm, before evaluating it over the source database. Note that this is consistent with the formulation of Lemma According to the discussion above, we next adapt to DL-Lite A the approach proposed in [31] to solve query answering. Thus we start by presenting the algorithm for query reformulation, which is responsible for reformulating a query by compiling into the query itself the intensional knowledge in the TBox. Then, we present the complete algorithm for query answering. In order to use the reformulation technique of [31], we next define the notion of applicable inclusion assertion. Intuitively, an inclusion I is applicable to an atom g if the predicate of g is equal to the predicate on the right-hand side of I. Definition Let I be a PI inclusion assertion. We say that I is applicable to the atom g and, in this case, we indicate with gr(g, I) the atom obtained from the atom g by applying I if and only if: I is a concept inclusion assertion of the form I = B 1 B and g and B are as follows: g = A(x) and B = A, or, g = P(x, ) and B = P, or, g = P(, x) and B = P,

71 4.4. QUERY ANSWERING OVER DL-LITE A KB 59 or, g = U R (x,, ) and B = δ(u R ), or, g = U R (, x, ) and B = δ(u R ), or, g = U C (x, ) and B = δ(u C ). Then, the form of gr(g, I) depends on B 1 as follows: if B 1 = A 1 then gr(g, I) = A 1 (x); if B 1 = P 1, then gr(g, I) = P 1 (x, ); if B 1 = P 1, then gr(g, I) = P 1(, x); if B 1 = δ(u R1 ), then gr(g, I) = U R1 (x,, ); if B 1 = δ(u R1 ), then gr(g, I) = U R1 (, x, ); if B 1 = δ(u C1 ), then gr(g, I) = U C1 (x, ). I is a domain-value inclusion assertion of the form I = E 1 E and g and E are as follows: g = D(x) and E = D, or, g = U C (, x) and E = ρ(u C ), or, g = U R (,, x) and E = ρ(u R ). Then, the form of gr(g, I) depends on E 1 as follows: if E 1 = D 1 then gr(g, I) = D 1 (x); if E 1 = ρ(u C1 ), then gr(g, I) = U C1 (, x); if E 1 = ρ(u R1 ), then gr(g, I) = R 1 (,, x). I is a role inclusion assertion of the form I = Q 1 Q and g and Q are as follows: g = P(x 1, x 2 ) and Q = P or Q = P, or, g = U R (x 1, x 2, ) and Q = δ(u R ) or Q = δ(u R ). Then, the form of gr(g, I) depends on Q 1 and Q as follows: if Q 1 = P 1 and Q = P, or Q 1 = P 1 and Q = P, then gr(g, I) = P 1 (x 1, x 2 ); if Q 1 = P 1 and Q = P, or Q 1 = P 1 and Q = P, then gr(g, I) = P 1 (x 2, x 1 ); if Q 1 = δ(u R1 ), and Q = δ(u R ), or Q 1 = δ(u R1 ), and Q = δ(u R ) then gr(g, I) = R 1 (x 1, x 2, ); if Q 1 = δ(u R1 ), and Q = δ(u R ), or Q 1 = δ(u R1 ), and Q = δ(u R ) then gr(g, I) = R 1 (x 2, x 1, ); I is a concept attribute inclusion assertion of the form I = U C1 U C and g = U C (x 1, x 2 ). Then, we have that gr(g, I) = U C1 (x 1, x 2 ).

72 60 CHAPTER 4. DL-LITE A REASONING Input: conjunctive query q, DL-Lite A TBox T Output: union of conjunctive queries PR DB over db(a) PR {q}; repeat PR PR; for each q PR do (a) for each g in q do for each PI I in T do if I is applicable to g then PR PR {q[g/gr(g, I)] } (b) for each g 1, g 2 in q do if g 1 and g 2 unify then PR PR {τ(reduce(q, g 1, g 2 ))}; until PR = PR; return PR; Figure 4.4: Algorithm PerfectRef (q, T ) I is a role attribute inclusion assertion of the form I = U R1 U R and g = U R (x 1, x 2, x 3 ). Then, we have that gr(g, I) = U R1 (x 1, x 2, x 3 ). We are now ready to define, in Fig , the algorithm PerfectRef, which reformulates a conjunctive query taking into account the PIs of a DL-Lite A TBox. In the algorithm, q[g/g ] denotes the conjunctive query obtained from q by replacing the atom g with a new atom g. Informally, the algorithm first reformulates the atoms of each conjunctive query q PR, and produces a new query for each atom reformulation (step (a)). Roughly speaking, PIs are used as rewriting rules, applied from right to left, which allow to compile away the reformulation the intensional knowledge (represented by T ) that is relevant for answering q. At step (b), for each pair of atoms g 1, g 2 that unify and occur in the body of a query q, the algorithm computes the conjunctive query q = reduce(q, g 1, g 2 ), by applying to q the most general unifier between g 1 and g 2. We point out that, in unifying g 1 and g 2, each occurrence of the symbol has to be considered a different unbound variable. The most general unifier substitutes each symbol in g 1 with the corresponding argument in g 2, and vice-versa (obviously, if both arguments are, the resulting argument is ). Thanks to the unification, variables that are bound in q may become unbound in q. Hence, PIs that were not applicable to atoms of q, may become applicable to atoms of q (in the next executions of step (a)).finally, note that function τ applied to q replaces each occurrence of an unbound variable in q with the symbol. Example Consider again the DL-Lite A KB K of Example and the query q asking for all workers, i.e., those objects which participate to the WORKS-FOR role: q(x) WORKS-FOR(x, y).

73 4.4. QUERY ANSWERING OVER DL-LITE A KB 61 Input:DL-Lite A TBox T and database DB representing K = T, A, UCQ Q Output: ans(q, K) T cln(t ); if K is unsatisfiable then return AllTup(Q, K) else Q q i Q PerfectRef(q i, T ); Q RewDB(Q); Q Clean(Q ); return ans(q, DB); Figure 4.5: Algorithm Answer(Q, K) The result of PerfectRef(q, T ) is the following union of queries Q p : q(x) WORKS-FOR(x, y) q(x) until(x, y, z) q(x) tempemp(x) q(x) employee(x) q(x) manager(x). The evaluation of Q p over DB returns the set of certain answers to q over K. Roughly speaking, in order to return all workers, Q p looks in those concepts, roles, and role attributes, whose extension in DB, according to the knowledge specified by T, could provide objects that are workers. Clearly, as in [31], the proposition below holds. Lemma Let T be a DL-Lite A TBox, let q be a conjunctive query over T, and let PR be the union of conjunctive queries returned by PerfectRef(q, T ). For every DL-Lite A ABox A such that T, A is satisfiable, ans(q, T, A ) = PR db(a). Proof. The proof is an obvious adaptation of the one proposed in [31]. We are finally able to present the algorithm Answer, shown in Fig , for answering a union of conjunctive queries over a KB. More precisely, the algorithm takes as input a DL-Lite A KB K = T, A represented by means of a TBox T and a database DB, and a union of conjunctive queries Q of arity n, and returns a set of answers ans(q, K). As already discussed, Answer is very similar to the algorithm presented in [31] for the computation of the certain answers to a query posed over a DL-Lite F KB (or DL-Lite R KB). Indeed, it differs from the latter only because of the use of the functions (i) RewDB, already introduced and discussed when presenting the algorithm for checking the satisfiability of a DL-Lite A KB, and (ii) Clean, responsible for constraining the answer to not include any soft constant, coherently

74 62 CHAPTER 4 with Lemma More precisely, given a union of conjunctive queries Q, Clean proceeds as follows. For each query q in Q, it adds the set of atoms: { Fresh(s i ) s i distinguished variable of q} Observe that, if K is unsatisfiable, then, as expected, ans(q, K) is the set of all possible tuples of constants in K whose arity is the one of the query. We denote such a set by AllTup(Q, K). We now show the correctness of the Algorithm Answer(K, Q). Theorem Let K = T, A be a DL-Lite A KB, let Q be a union of conjunctive queries, let U be the set of tuples returned by Answer(Q, K), and let t be a tuple of constants in K. Then, t ans(q, K) iff t U. Proof. The proof can be straightforwardly adapted from the corresponding on in [31], by observing that PerfectRef computes the union of conjunctive queries that, once reformulated by replacing the DL-Lite A expressions with the corresponding relations in db(a), would return all the answers that would also be returned by can(k). Thus, in order to select among all such tuples, those not involving the fresh constants arbitrarily introduced by µ 0, we perform an additional selection by means of the function Clean. Clearly, as for computational complexity, we get the same bounds as those shown in [31], thus achieving our goal. In particular, we have the following: Lemma Let T be a DL-Lite A TBox, and let q be a conjunctive query over T. The algorithm PerfectRef (q, T ) terminates and runs in time polynomial in the size of T. Theorem Given a DL-Lite A KB K, Answer(K, Q) is PTIME in the size of the TBox, and LOGSPACE in the size of the database used to represent K (data complexity).

75 Chapter 5 Consistency and Query Answering over Ontology-based DIS In this chapter we investigate the main problems concerning a DL-Lite A ontologybased DIS, namely consistency (cf. Section 1.2) and query answering (cf. Section 1.3). To this aim, we start by introducing DL-Lite A ontolgy-based DIS. Then, we present an overview of our reasoning method, and finally we present the core result of this chapter, namely the modularizability of DL-Lite A reasoning services. Then, we provide algorithms for DL-Lite A DIS consistency checking and query answering, based on such a result. 5.1 DL-Lite A ontology-based DIS In this section, after discussing the notorious impedance mismatch problem between data and DL-Lite A ontology objects, we present the syntax and the semantics of DL-Lite A DIS Linking data to DL-Lite A objects Ontology-based DIS provide the user with an ontology that the user can access in order to query data actually stored in several possibly autonomous and heterogeneous data sources. Since we are interested in ontology-based DIS where the global schema represents the intensional level of a DL ontology, the instances of concepts and roles in the ontology are simply an abstract representation of some real data stored in existing data sources. Therefore, the problem arises of establishing sound mechanisms for linking existing data to objects that are instances of the concepts and the roles in the ontology. To present our solution to this problem, we come back to the concept of object identifiers. These are ad hoc identifiers (e.g., constants in logic) denoting objects being instances of ontology concepts. Clearly, object identifiers are not to be confused with any data item. Moreover, even if sources may in general store both data and object identifiers, the storage of object identifiers implicitly requires some agreement among data sources on the form for representing them. Thus, to face the possible 63

76 64 CHAPTER 5. CONSISTENCY AND QUERY ANSWERING OVER ONTOLOGY-BASED DIS absence of such an a-priori agreement, by tracing back to the work done in deductive object-oriented databases [56], we consider a domain of object identifiers that is built starting from data values, in particular as (logic) terms over data items. To realize this idea, we define more precisely the alphabets of constants coming into play. Specifically, while Γ V contains data value constants as before, Γ O is built starting from Γ V and a set Λ of function symbols, each one with an associated arity, i.e. the number of arguments of the function. Formally, let Γ V be an alphabet of data values. Then, we call object term an expression f(d 1,...,d n ) such that f Λ, arity(f) = n, and d 1,...,d n Γ V, where n > 0. In other words, object terms are constructed by function symbols applied to data value constants. We then denote as Γ O (Λ, Γ V ) the alphabet of object terms built on top of Λ and Γ V. Thus, we can now define a DL-Lite A KB over the set of data values Γ V and the set of object terms Γ O (Λ, Γ V ). Clearly, the syntax and the semantics of DL-Lite A expressions and TBox do not need to be modified. Concerning the ABox, since it is now specified by using the alphabet Γ that is the disjoint union of the alphabet of Γ V and Γ O (Λ, Γ V ), it consists of a finite set of membership assertions of the form: A(o), A(s o ), D(d), D(s v ), P(o, p), U C (o, d), U R (o, p, d) where o and p are object terms in Γ O (Λ, Γ V ), s o, s v are soft constants resp. in V O, V V (as before), and c is a constant in Γ V. To define the semantics of a DL-Lite A ABox as above, we simply define an assignment for A and an interpretation I = ( I, I) as before. It is worth noting, however, that I now assigns a different element of I O to every object term in Γ O (Λ, Γ V ) (i.e., we enforce the unique name assumption also on object terms). Formally, this means that I is such that: for all o Γ O (Λ, Γ V ), we have that o I I O ; for all o, p Γ O (Λ, Γ V ), we have that o p implies o I p I. Finally, as for the query language, a conjunctive query over a DL-Lite A KB using object terms is an expression q( x) conj( x, y) such that atoms in conj( x, y) can have the form: A(x o ), P(x o, y o ), D(x v ), U C (x o, x v ), or U R (x o, y o, x v ) where A, P, D, U C and U R are resp. an atomic concept, an atomic role, an atomic value-domain, an atomic concept attribute, and an atomic role attribute, x v is a value variable in x, and x o, y o may be, besides object variables as for DL-Lite A, also object terms, called variable object terms, where value variables appear in place of value constants. Obviously, union of conjunctive queries can be defined accordingly. Note that, from the point of view of the semantics, conjunctive queries are interpreted exactly as for the case of DL-Lite A KBs presented in the previous chapter Logical framework for DL-Lite A DIS Let us now turn our attention to the problem of linking objects in the ontology to the data in the sources. To this end, we assume that data sources are wrapped into

77 5.1. DL-LITE A ONTOLOGY-BASED DIS 65 a set of relational sources D. Note that this assumption is indeed realistic, as many data federation tools that provide exactly this kind of service are currently available (cf. Section 2.1). In this way, we can assume that all relevant data are virtually represented and managed by a relational data engine, and that we can query data by using SQL. In the following, we make the following assumptions: the set of sources D is independent with respect to the ontology; in other words, our aim is to link the ontology to a collection of data that live autonomously, and have not been structured with the purpose of storing the ontology instances; the set of sources D is characterized in terms of a set of schemas and instances, where each schema is a specification of one relational schema (i.e., the relation name and the collection of its attributes) for one source in D, and each source is formed by a set of tuples; all value constants stored in the set of sources D belong to Γ V ; ans(ϕ, D) denotes the set of tuples (of the arity of ϕ) of value constants returned as the result of the evaluation of the SQL query ϕ over the set of data sources D. We are now able to define DL-Lite A ontology-based DIS, according to the logical framework presented in Section 1.1. Given an alphabet of value constants Γ V and an alphabet of function symbols Λ, a DL-Lite A ontology-based data integration system (shortly referred as DL-Lite A DIS) is characterized by the triple Π = G, S, M such that: G is a DL-Lite A TBox; note that G is in fact the intensional level of the ontology; S is a set of relations {S 1,, S n } over Γ V, for n 1; M is a set of sound mappings partitioned into two sets, M t and M a, where: M t is a set of assertions, called typing assertions, each one of the form Φ T i where Φ is a query over D denoting the projection of one relation over one of its attributes, and T i is one of the DL-Lite A data types, M a is a set of assertions, called mapping assertions, each one of the form Φ Ψ where Φ is a first-order logic query over D of arity n, and Ψ is a DL-Lite A conjunctive query over G of arity n, without non-distinguished variables, that possibly includes terms in Γ.

78 66 CHAPTER 5. CONSISTENCY AND QUERY ANSWERING OVER ONTOLOGY-BASED DIS We briefly comment on the assertions in M as defined above. Typing assertions are used to assign appropriate types to constant values appearing in the set of data sources. Basically, these assertions are used for interpreting the values stored at the sources in terms of the types used in the ontology. Mapping assertions, on the other hand, are used to map data in the data sources to concepts, roles, and attributes in the ontology. It is worth noting that now that we have object terms, the data layer underlying a DL-Lite A DIS contains only data, whereas object identifiers are virtually built on top of this data. Thus, autonomous data sources can effectively provide their portion of data and contribute to the ontology instance-level, without being required to agree on any particular object identification scheme. We next give an example of DL-Lite A DIS. Example Let Λ = {pers, proj, mgr}, where pers, proj and mgr are function symbols of arity 1. Consider the DL-Lite A DIS Π = G, S, M such that: G is the TBox of Example 3.1.1; S = {S 1, S 2, S 3, S 4 }, with the following signature: S 1 [SSN:STRING,PROJ:STRING, D:DATE], S 2 [SSN:STRING,NAME:STRING], S 3 [C:STRING,NAME:STRING], S 4 [C:STRING,SSN:STRING] M = {M t, M a }, where: M t is such that: y, z.s 1 (x, y, z) xsd:string(x) y.s 2 (x, y) xsd:string(x) y.s 3 (x, y) xsd:string(x) y.s 4 (x, y) xsd:string(x) y, z.s 1 (y, x, z) xsd:string(x) y.s 2 (y, x) xsd:string(x) y.s 3 (y, x) xsd:string(x) y.s 4 (y, x) xsd:string(x) y, z.s 1 (y, z, x) xsd:date(x)

79 5.1. DL-LITE A ONTOLOGY-BASED DIS 67 M a is as follows: M 1 : qdb 1 (s, p, d) S 1(s, p, d) qg 1 (s, p, d) tempemp(pers(s)), projname(proj(p), p), until(pers(s), proj(p), d) M 2 : qdb 2 (s, n) S 2(s, n) qg 2 (s, n) employee(pers(s)), persname(pers(s), n) M 3 : qdb 3 (s, n) c.s 3(c, n) S 4 (c, s) qg 3 (s, n) manager(pers(s)), persname(pers(s), n) M 4 : qdb 4 (c, n) S 3(c, n) s.s 4 (c, s) qg 4 (c, n) manager(mgr(c)), persname(mgr(c), n) Suppose D be a set of sources {D 1, D 2, D 3, D 4 } conforming to S. D 1 stores tuples (s, p, d), where s and p are strings, and d is a date, such that s is the social security number of a temporary employee, p is the name of project s/he works for (different projects have different names), and d is the ending date of the employment. D 2 stores tuples (s, n) of strings consisting of the social security number s of an employee and her/his name n. D 3 stores tuples (c, n) of strings consisting of the code c of a manager and her/his name n. Finally, D 4 relates managers code with their social security number. Thus, intuitively, typing assertions in M t establish how to map SQL source datatypes to RDF data types of values occuring in the ontology. Concerning the mapping specification M t, M 1 captures that every tuple (s, p, d) in D 1 corresponds to a temporary employee pers(s), working until d for a project proj(p), whose name is p. M 2 extracts employees pers(s) and their name n. M 3 and M 4 tell us how to extract from D 3 information about managers and their name. When we extract such information, if we can we make use of D 4 which provides us the social security number of managers (identified by a code in D 3 ), then we use object terms of the form pers(s). If such information is not available in D 4, then we use object terms of the form mgr(c). In order to define the semantics of an DL-Lite A DIS, we need to define when an interpretation satisfies a mapping w.r.t. a set of data sources D. Thus, let D = {D 1,..., D s } be a set of data sources such that D j conforms to S j, for each S j S. According to the usual semantics of sound mappings (cf. Section 1.1), we say that I satisfies M : Φ Ψ w.r.t. D if for each tuple of values t in Γ V, if t ans(φ, D), then we have that t I Ψ I, where, as usual, ans(φ, D) denotes the set of answers to the query Φ posed over the set of sources D. Thus, we can now give the semantics of a DL-Lite A DIS. Let D be a set of data sources conforming to S. An interpretation I = ( I, I) is a model of Π w.r.t. D if and only if: I is a model of G;

80 68 CHAPTER 5. CONSISTENCY AND QUERY ANSWERING OVER ONTOLOGY-BASED DIS I satisfies all mapping assertions in M w.r.t. D. As usual, we say that a DL-Lite A DIS is consistent w.r.t. D if there exists a model of Π w.r.t. D. Example Refer to the DL-Lite A DIS Π of Example A possible set of data sources conforming to S is the following: D 1 = {(20903, Tones, )} D 2 = {(20903, Palmieri), (55577, Parker)} D 3 = {(Lenz, Lenzerini), (Abit, Abiteboul)} D 4 = {(Lenz, 29767)} One can easily verify that Π is consistent w.r.t. D. Let Q denote a union of conjunctive queries over Π of arity n. As usual in DIS, Q is expressed in terms of the global schema G. Moreover, we call certain answers to Q posed over Π w.r.t. D the set of n-tuples of constants in Γ(Γ V, Λ) Γ V, denoted Q(Π,D), that is defined as follows: Q(Π,D) = {t t I Q I, I sem(π, D)} 5.2 Overview of consistency and query answering method In this section, we present an overview of our solution for checking DL-Lite A DIS consistency and solving query answering. The simplest way to tackle these problems over a DL-Lite A DIS is to use the mappings to produce an actual ABox and then reasoning on the ontology constituted by the ABox and the original TBox, applying the techniques described in Chapter 4. We call such approach bottom-up. The bottomup approach involves a duplication of the data in the database so as to populate the new ABox, and this is clearly unacceptable in several circumstances. So we propose an alternative approach, called top-down that avoids such a duplication essentially by keeping the ABox virtual. We sketch out the main ideas of both approaches below, by first presenting the notion of virtual ABox. Then, we provide preliminary basic notions of logic programming upon which the technical development of the next section is built The notion of virtual ABox Definition Let Π = G, S, M be a DL-Lite A DIS, D a set of sources conforming to S, and let M be a mapping assertion in M of the form M = Φ Ψ. We call virtual ABox generated by M from D the set of assertions, denoted A(M, D), such that: A(M, D) = {Ψ[ x/ t] t ans(φ, D)}, where t and Φ, Ψ are of arity n, and Ψ[ x/ t] denotes the formula obtained from Ψ( x) by substituting the n-tuple of variables x with the n-tuple of constants t Γ n V. Moreover, we call virtual ABox for Π the set of assertions: A(M, D) = {A(M, D) M M}.

81 5.2. OVERVIEW OF CONSISTENCY AND QUERY ANSWERING METHOD69 Notice that A(M, D) is an ABox over Γ O (Λ, Γ V ) and Γ V, as shown in the following example. Example Let Π = G, S, M be the DL-Lite A DIS of Example Consider in particular the mapping M 2 such that: M 2 : qdb 2 (s, n) S 2(s, n) qg 2 (s, n) employee(pers(s)), persname(pers(s), n) Then, given the database D of Example 5.1.2, suppose to have: ans(q 2 db, D) = {(20903, Palmieri), (55577, Parker)}, then A(M 2, D) is as follows: employee(pers(20903)) (5.1) persname(pers(20903, Palmieri)) (5.2) employee(pers(55577)) (5.3) persname(pers(55577, Parker) (5.4) By proceeding in the same way for each mapping assertion in M, we can easily obtain the virtual ABox of Π. Virtual ABoxes allow for expressing the semantics of DL-Lite A DIS, in terms of the semantics of DL-Lite A ontologies as follows: Proposition Let Π = G, S, M be a DL-Lite A DIS, D a set of data sources conforming to S, and let A(M, D) be the virtual ABox for Π from D, we have that sem(π, D) = {I I = G I = A(M, D)} = Mod(K), where K = G, A(M, D). Proof. Trivial, from the definition. Now that we have introduced virtual ABox, we start by discussing the bottom-up approach A naive bottom-up approach The proposition above suggests an obvious bottom-up algorithm to solve consistency and query answering over a DL-Lite A DIS Π = G, S, M, which we describe next. First, given a set D of data sources conforming to S, we materialize the virtual ABox for Π from D. Second, we apply to the DL-Lite A KB K = G, A(M, D), the algorithms for checking DL-Lite A KB satisfiability and query answering, described in Chapter 4. This way of proceeding is sufficient to solve satisfiability, whereas for query answering over a DL-Lite A DIS, we need to further take carefully into account the possible presence of variable object terms in the query. Intuitively, this requires to proceed as follows. Given a union of conjunctive queries Q over a DL-Lite A DIS, we first substitute each distinct variable object term in Q with a new object variable, thus obtaining a query Q, which contains only object variables, object constants, value

82 70 CHAPTER 5. CONSISTENCY AND QUERY ANSWERING OVER ONTOLOGY-BASED DIS variables, and value constants. Therefore, we can process Q exactly as discussed in the previous chapter. As result, we obtain a set of tuples whose elements are data values in Γ V. Finally, by post-processing the answers so as to reconstruct object terms starting from values, we obtain the certain answers to Q. Unfortunately, the approach described above for solving both DL-Lite A DIS consistency and query answering has the following drawbacks. First, the algorithm proposed is in PTIME in the size of the database, since the generation of the virtual ABox is by itself PTIME. Second, since the database is independent from the ontology, it may happen that data it contains are modified. This would clearly require to set up a mechanism for keeping the virtual ABox up-to-date with respect to the database evolution. Thus, next, we propose a different approach (called top-down ), which uses an algorithm that avoids materializing the virtual ABox, but takes into account the mapping specification on-the-fly, during reasoning. In this way, we can both keep low the computational complexity of the algorithm, which turns out to be the same of the query answering algorithm for DL-Lite A KBs, i.e., in LOGSPACE, and avoid any further procedure for data refreshment A top-down approach We now sketch out the main steps of a top-down approach. First, we rely on the property of DL-Lite A of allowing reducing KB satisfiability and query answering to the evaluation of a first-order logic query Q over the ABox seen as a database. Since DL-Lite A DIS are not defined in terms of an ABox and a TBox but rather they are specified in terms of a TBox, a set of mappings and a set of data sources, the evaluation of Q cannot be performed over an ABox (unless we accept materializing the virtual ABox as described in the previous section). Thus, the idea is to further reformulate Q, by taking into account the mapping assertions, so as to produce a query that can be asked directly to the set of data sources. Specifically, we start by reducing the mapping assertions into a split form such that they can be seen as a part that extracts relevant data from the database and a part that specifies how the object terms of the ontology are built from such data. For the latter we use logic programming technology that tells us how to perform unifications and generate the right object terms required in the query Q. Then, making use of the first part of the split, we can formulate a new query over the database that tells us how to instantiate the variable of the original query with actual data. Observe that in this way, data are accessed only at the very last, namely at the moment of evaluating the new reformulated query over the set of data sources, and that the evaluation of such a query can be completely delegated to the DBMS that manages the database. Example Consider again the DL-Lite A DIS Π of Example and the query q discussed in Example together with its reformulation Q p = PerfectRef(q, G).

83 5.2. OVERVIEW OF CONSISTENCY AND QUERY ANSWERING METHOD71 By splitting the mappings M we obtain the following portion of logic program: tempemp(pers(s)) Aux 1 (s, p, d) projname(proj(p), p) Aux 1 (s, p, d) until(pers(s), proj(p), d) Aux 1 (s, p, d) employee(pers(s)) Aux 2 (s, n) persname(pers(s), n) Aux 2 (s, n) manager(pers(s)) Aux 3 (s, n) persname(pers(s), n) Aux 3 (s, n) manager(mgr(c)) Aux 4 (c, n) persname(mgr(c), n) Aux 4 (c, n) where Aux k is a predicate denoting the result of the evaluation over a set of data sources D conforming to S, of the query Φ k in the left-hand side of the mapping M k. Finally, we unfold each atom of the query obtained Q p, by unifying it in all possible ways with the left-hand side of mapping assertions (seen as conjunctions of atoms in logic programming syntax), and we obtain the following union of results: q Π = {pers(s) (s, p, d) ans(q 1 db, D)} {pers(s) (s, n) ans(q 2 db, D)} {pers(s) (s, n) ans(q 3 db, D)} {mgr(c) (c, n) ans(q 4 db, D)} Note that all the approach relies crucially on the use of standard notions of logic programming, which we briefly introduce in the next section Relevant notions from logic programming We next briefly recall some basic notions of logic programming [70] and Partial Evaluation [61], upon which we build the top-down approach. In particular, we exploit some crucial results on Partial Evaluation, given in [71], which we briefly recall below. Definition A definite program clause is an expression of the form A W where A is an atom and W is a conjunction of atoms A 1,...,A n. We call head of the clause its left-hand side, which contains A, and body of the clause its right hand side, which contains W. Either the body or the head of the clause may be empty. When the body is empty the clause is called fact (and the symbol is in general omitted). When the head is empty the clause is called definite goal. A definite program is a finite set of definite program clauses. Notice that A W has also a first-order logic interpretation, which is as follows: x 1,, x s (A W). where x 1,...,x s are all variables occurring in W and A.

84 72 CHAPTER 5. CONSISTENCY AND QUERY ANSWERING OVER ONTOLOGY-BASED DIS From now on, when we talk about programs, program clauses and goals, we implicitly mean definite programs, definite program clauses and definite goals, respectively. From a well-known result of logic programming we have the following crucial property of definite programs: Proposition (Program minimal model) For each program P, the intersection M P of all Herbrand models for P is a model of P, called minimal model of P. We say that an atom containing no variables is true in a logic program P if it is true in the minimal model of P. Definition Let G be the goal A 1,, A m,, A k and C be a program clause A B 1,, B q. Then G is derived from G and C using the most general unifier (mgu) θ if the following conditions hold: A m is an atom, called the selected atom, in G, θ is an mgu of A m and A, and G is the goal (A 1,...,A m 1, B 1,...,B q, A m+1,...,a k )θ where (A 1,, A n )θ = A 1 θ,, A n θ and Aθ is the atom obtained from A applying the substitution θ. Definition A resultant is an expression of the form Q 1 Q 2 where Q i (i = 1, 2), is either absent or a conjunction of literals. All variables in Q 1 and Q 2 are assumed to be universally quantified. Definition Let P be a program and let G be a goal. Then, a (partial) SLD-Tree of P {G} 1 is a tree satisfying the following: each node of the tree is a resultant, the root node is Gθ 0 G 0, where Gθ 0 = G 0 = G (i.e. θ 0 is the empty substitution), let Gθ 0 θ i G i be a node at depth i 0 such that G i has the form A 1,, A m,, A k, and suppose that A m is the selected atom. Then, for each input clause A B 1,, B q such that A m and A are unifiable with mgu θ i+1, the node has a child Gθ 1 θ 2 θ i+1 G i+1, where G i+1 is derived from G i and A m by using θ i+1, i.e. G i+1 has the form (A 1,, B 1,, B q,, A k )θ i+1, 1 Note that this definition of SLD-Tree comes from [71].

85 5.3. DL-LITE A DIS CONSISTENCY AND QUERY ANSWERING 73 nodes which are the empty clause have no children. Given a branch of the tree, we say that it is a failing branch if it ends in a node such that the selected atom does not unify with the head of any program clause. Moreover, we say that a SLD-Tree is complete if it is such that all non-failing branches end in the empty clause. Finally, given a node Qθ Q n at depth i, we say that the derivation of Q i has length i with computed answer θ, where θ is the restriction of θ 0,...,θ i to the variables in G. Now, we state the definition of partial evaluation (PE for short) from [71]. Note that the definition refers to two kinds of PE: the PE of an atom in a program, and the PE of a program w.r.t. an atom. Definition Let P be a program, A an atom, and T a SLD-tree for P { A}. Let G 1,...,G r be a set of (non-root) goals in T such that each non-failed branch of T contains exactly one of them. Let R i (i = 1,...,r) be the resultant of the derivation from A down to G i associated with the branch leading to G i. The set of resultants π = {R 1,...,R r } is a PE of A in P. These resultants have the following form: where we have assumed G i = Q i R i = Aθ i Q i (i = 1,...,r), Let P be the program resulting from P by replacing the set of clauses in P whose head contains A. Then P is a PE of P w.r.t. A. Intuitively, to obtain a PE of an atom A in P we consider a SLD-tree T for P { A}, and choose a cut in T. The PE is defined as the union of the resultants that occur in the cut and do not fail in T. 5.3 DL-Lite A DIS consistency and query answering In this section, we present the core result of this chapter, namely the modularizability of DL-Lite A DIS consistency and query answering services. Then, we provide algorithms for DL-Lite A DIS consistency and query answering, based on such result. Finally, we discuss computational complexity issues Modularizability In order to introduce modularizabilty of DL-Lite A reasoning services, according to the top-down approach discussed in the previous section, we need to present the notion of split version of a DL-Lite A DIS. Such notion characterizes DL-Lite A ontologies having a particularly friendly form. Specifically, given a DL-Lite A DIS Π = G, S, M, we compute the split version of Π, denoted Split(Π) = G, S, M, by setting G = G, and by constructing M as follows. For each mapping assertion Φ Ψ M, and for each atom p Ψ, we add an assertion Φ p into M. Luckily, we have the following.

86 74 CHAPTER 5. CONSISTENCY AND QUERY ANSWERING OVER ONTOLOGY-BASED DIS Proposition Let Π = G, S, M be a DL-Lite A DIS and D a set of data sources conforming to S. Then, we have that: sem(split(π), D) = sem(π, D). Proof. The result follows straightforwardly from the form of the mappings and Proposition Thus, given any arbitrary DL-Lite A DIS, we can always reduce it to its split version. Moreover, such a reduction is PTIME in the size of the mappings and does not depend on the size of the data. This allows for assuming, from now on, to deal only with split versions of DL-Lite A DIS. In what follows, we use the definitions given in the previous section to present modularizability of reasoning in DL-Lite A. In particular, the goal is to define a function RewDB, that, intuitively, takes as input a union of conjunctive queries (possibly with inequalities) Q over the virtual ABox for Π from a set of data sources D conforming to S, and returns a set of resultants describing (i) the queries to pose over D, and (ii) the substitution to apply to the result in order to obtain the answer to Q. In particular, we start by defining the notion of program for a query. Definition Let Π = G, S, M be a DL-Lite A DIS, D a set data sources conforming to S, and Q a union of conjunctive queries over db(a(m, D)), possibly including inequalities. Then, we call program for Q, denoted P(Q), the logic program having the following form: P = {ans q g q g = σ(q), q Q} {p k ( f(x)) Aux k ( x) Φ k ( x) p k ( f(x)) M} {Aux k ( t) t ans(φ k, D), k = 1,, m} {Distinct(v 1, v 2 ) v 1, v 2 Γ V, v 1 v 2 } {Distinct(f 1 (v 1 ), f 2 (v 2 )) v 1, v 2 Γ V, f 1, f 2 Λ, f 1 f 2 } where m is the number of mappings in M, and for each k {1,, m}, Aux k is an auxiliary predicate whose extension coincides with the set of tuples in ans(φ k, D); Distinct is an auxiliary predicate whose extension coincides with the set of pairs of distinct terms in Γ O (Λ, Γ V ), and distinct constants in D; q g = σ(q) denotes a conjunction of atoms that is obtained by replacing each x y in the body of a query q in the union Q with the atom Distinct(x, y); ans is an auxiliary predicate having the same arity as Q g. Below, we denote by θ t the substitution of the variables in ans with the terms in t. The following lemma states a notable property of the programs defined above. Lemma Let Π = G, S, M be a DL-Lite A DIS, D a set of data sources conforming to S, and Q a union of conjunctive queries over db(a(m, D)), possibly including inequalities, and P(Q) the program for Q. Then, db(a(m, D)) coincides with the projection over G of the minimal model of P(Q).

87 5.3. DL-LITE A DIS CONSISTENCY AND QUERY ANSWERING 75 Proof. To prove the theorem we first show that for each n-tuple of object terms t Γ(Γ V, Λ) n, if t belongs to X in db(a(m, D)), then we have that X( t) is true in the minimal model M P of P(Q). Consider a tuple t of object terms that belongs to X in db(a(m, D)), i.e., such that X( t) A(M, D). Thus, by construction of A(M, D) we have that there exists a mapping Φ k ( x) X( f(x)) in M, a tuple t of values in Γ V, and a substitution θ : { x/ f(t )} such that t ans(φ k, D) and t = xθ = f(t ). But then, since t ans(φ k, DB), we have that Aux k ( t ) P(Q). Moreover, since Φ k ( x) X( f(x)) is a mapping in M, we have that X( f(x)) Aux k ( x) belongs to P(Q). Thus, θ is a mgu of Aux k ( x) and Aux k ( t ). Therefore, it is possible to derive X( t) from Aux k ( x) and Aux k ( t ) by using θ, which proves that X( t) is true in M P. Conversely, let X( t) be true in the minimal model M P of P(Q), and let X be an expression of G. Clearly, by following a similar line of reasoning as above, we show that t belongs to X in db(a(m, D)). Corollary Let Π = G, S, M be a DL-Lite A DIS, D a set of data sources conforming to S, and Q a union of conjunctive queries over db(a(m, D)), possibly with inequalities. Moreover, let P be the program for Q. Then, for each tuple t of constants in Γ(Γ V, Λ) Γ V, we have that: t ans(q,db(a(m, D))) if and only if P(Q) { ansθ t } is unsatisfiable. Proof. The result follows directly from the previous lemma and the construction of P(Q). Given a union of conjunctive queries (possibly with inequalities) Q over db(a(m, D)), let SLD-Derive(P(Q)) be a function that takes as input the program P(Q) and returns a set of resultants, as follows. First, it constructs a SLD-Tree T for P(Q) { ans} as follows: it starts by selecting the atom ans, and then, it continues by selecting the atoms that belong to the alphabet of G, until there are some. Second, it returns the set S of the leaves ansθ j q j of T, that do not belong to any failing branch of T. Note that SLD-Derive can use any procedure to compute the SLD-Tree for P(Q) { ans}, provided that the computation rule follows the requirements above. We have the following. Lemma Let Π = G, S, M be a DL-Lite A DIS, D a set of data sources conforming to S, and Q a union of conjunctive queries (possibly with inequalities) over db(a(m, D)). Moreover, let SLD-Derive(P(Q)). Then, SLD-Derive(P(Q)) is a PE of { ans} w.r.t. P(Q).

88 76 CHAPTER 5. CONSISTENCY AND QUERY ANSWERING OVER ONTOLOGY-BASED DIS Proof. Trivial, by construction and by the definition of PE. Consider now the partial evaluation of P(Q) w.r.t. { ans}. In what follows, we denote it P(Q, S). We then have the following. Lemma Let Π = G, S, M be a DL-Lite A DIS,, D a set of data sources conforming to S, and Q a union of conjunctive queries (possibly with inequalities) over db(a(m, D)). Moreover, let S = SLD-Derive(P(Q)). Then, for each atom A, A is true in P(Q, S) if and only if A is true in P(Q). Proof. The proof follows from a well-known result from [71], stating that if P is a PE of a program P w.r.t. {G}, then P and P are procedurally equivalent, i.e. for each atom A, A belongs to the minimal model of P if and only if A belongs to the minimal model of P. Let S be a (non-empty) PE of P(Q) w.r.t. { ans} and Q be a resultant ansθ q in S. We define unfold Π (S, Q) as the function that returns a (extended form of) resultant Q = ansθ q such that q is a first-order query over D, which is obtained from q by proceeding as follows. At the beginning, q has an empty body. Then, for each atom A in q, if A = Aux k ( x), we add to q the query Φ k ( x); note that, by hypothesis, Φ k ( x) is an arbitrary first-order query with distinguished variables x, that can be evaluated over D; if A = Distinct(x 1, x 2 ), where x 1, x 2 have resp. the form f 1 ( y 1 ) and f 2 ( y 2 ), then: if f 1 f 2, then we do not add any conjunct to q, otherwise, we add the following conjunct: y 1i y 2i, where w is the arity f 1. i {1,...,w} Here again, note that we obtain a disjunction of variables, which can be obviously evaluated over a set of data sources D. Lemma Let Π = G, S, M be a DL-Lite A DIS, D a set f data sources conforming to S, and Q a union of conjunctive queries (possibly with inequalities) over db(a(m, D)). Moreover, let S = SLD-Derive(P(Q)). Then for each Q = ansθ q S and for each tuple of constants t in Γ V, we have that: ansθ t is true in P(Q, S) if and only if t ans(q, D), where t = t θ, and unfold Π (S, Q) = (ansθ q ). Proof. Let Q = ansθ q be a resultant in S such that q has the form A 1 ( x 1 ),, A n ( x n ). By construction, A i ( x i ) may either have the form:

89 5.3. DL-LITE A DIS CONSISTENCY AND QUERY ANSWERING 77 Aux ki ( x i ); or Distinct( x i ), where x i = (x i1, x i2 ). Suppose that A i has predicate Aux ki for each i j whereas it has predicate Distinct for j < i n. Consider now Q = unfold Π (S, Q). By construction, Q = ansθ q where q has the form: { x, y i11, y i21, y i1wi, y i2wi Φ k1 ( x 1 ),, Φ kj ( x j ), ( i n ( h {1,...,w i } y i 1h y i2h ))} where ( h {1,...,w i } y i 1h y i2h ) occurs together with the corresponding distinguished variables y i1h, y i2h if there is an atom Distinct(x i1, x i2 ) in q such that x i1 = f( y i1 ), x i2 = f( y i2 ) where f has arity w i. Now, let t be a tuple of constants in Γ V. We show next that if q( t) is true in P(Q, S), then t ans(q, D) where t = t θ. Suppose that q( t) is true in P(Q, S). Then there exists θ q such that q( t) = (A 1 ( x 1 ),, A n ( x n ))θ q is true in P(Q, S). This implies that there exist n facts F i in P(Q, S) such that F i = A i θ q is true in P(Q, S) for each i = 1,, n. But then, by construction: if i j, then F i has the form Aux ki ( t i ), which by construction means that t i ans(φ ki, D); otherwise, F i has the form Distinct( t i ), where t i = (t i1, t i2 ) and t i1, t i2 are terms in Γ(Γ V, Λ) such that t i1 t i2, i.e. t i1 = f 1 (v i1 ), t i2 = f 2 (v i2 ) where either f 1, f 2 Λ, f 1 f 2 or v i1 v i2. Then, one can easily verify that t ans(q, D). Indeed, for i j we have trivially that Φ ki ( t i ) is true, whereas for j < i k, we have that if f 1 = f 2, then v i1h v i2h, for h {1,...,w i } where w i is the arity of f 1. Thus, since ansθ q belongs to P(Q, S), then t = t θ, and we have proved the claim. Clearly, the converse of the lemma can be proved by following the same line of reasoning. Before presenting RewDB, we need to introduce one more notion, i.e. the notion of compilation for Q. Given a union of conjunctive queries Q, we call compilation for Q, denoted C(Q), the program obtained from P(Q) by eliminating all facts Aux k ( t) and Distinct( t). We then have the following. Lemma Let Π = G, S, M be a DL-Lite A DIS, D a set of data sources conforming to S, and Q a union of conjunctive queries (possibly with inequalities) over db(a(m, D)). Then, we have that SLD-Derive(P(Q)) = SLD-Derive(C(Q)). Proof. The proof follows from the observation that SLD-Derive(P(Q)) constructs a SLD-Tree for P(Q) { ans} by selecting only the atoms in the alphabet of G, and that P(Q) and C(Q) coincide in the clauses containing atoms in the alphabet of G. Now we are finally able to come back to the definition of RewDB. Let Π = G, S, M be a DL-Lite A DIS, D a set of data sources conforming to S, and Q a

90 78 CHAPTER 5. CONSISTENCY AND QUERY ANSWERING OVER ONTOLOGY-BASED DIS Algorithm RewDB(Q, Π, D) Input: DL-Lite A DIS Π = G, S, M, set of data sources D conforming to S union of conjunctive queries (possibly with inequalities) Q over db(a(m, D)) Output: set of resultants S build the program C(Q); compute the set of resultants S = SLD-Derive(C(Q)); for each ansθ q S do S unfold Π (S, Q); return S Figure 5.1: The Algorithm RewDB union of conjunctive queries (possibly with inequalities) over db(a(m, D)). We define RewDB(Q,Π, D) as the function that takes as input Q, Π, and D, and returns a set S of resultants by proceeding as shown in Fig We now show the correctness of RewDB. Theorem Let Π = G, S, M be a DL-Lite A DIS, D a set of data sources conforming to S, and Q a union of conjunctive queries over db(a(m, D)). Then, RewDB(Q,Π, D) terminates. Moreover, let S = RewDB(Q,Π, D). Then, for each tuple of constants t in Γ(Γ V, Λ) Γ V and for each tuple of constants t Γ V : t ans(q,db(a(m, D))) if and only if (ansθ q ) S such that t = t θ and t ans(q, D) Proof. Concerning termination, it is clear that RewDB always terminates since, by construction, all its steps terminate. Let us now focus on soundness and completeness of RewDB(Q, Π, D). Specifically, suppose first that RewDB(Q, Π, D) =. Then, by construction, SLD-Derive(C(Q)) =. By Lemma 5.3.8, we also have that SLD-Derive(P(Q)) =. Thus, the SLD-Tree for P(Q) { ans} contains only failing branches, which implies that P(Q) { ans} is satisfiable. Therefore, by Corollary 5.3.4, we have that there exists no t such that t ans(q,db(a(m, D))), which proves that the theorem holds. Suppose now that RewDB(Q,Π, D) = S. Let Q = ansθ q be a resultant in S. Then, by construction, we have that there exists Q = ansθ q in S = SLD-Derive(C(Q)) such that q is a conjunctive query in Q and unfold(s, Q) = Q. Thus, since SLD-Derive(C(Q)) = SLD-Derive(P(Q)), by Lemma 5.3.7, we have that for each tuple of constants t in Γ V and for each tuple of constants t in Γ(Γ V, Λ) Γ V : t = t θ t ans(q, D) ansθ t is true in P(Q, S). Let t be a tuple of constants in Γ(Γ V, Λ) Γ V. By Lemma 5.3.5, P(Q, S) is the PE of P(Q) w.r.t. { ansθ t }. Therefore, by Lemma 5.3.6, we have that: ansθ t is true in P(Q, S) ansθ t is true in P(Q).

91 5.3. DL-LITE A DIS CONSISTENCY AND QUERY ANSWERING 79 But then, by Corollary 5.3.4, we obtain: ansθ t is true in P(Q) t ans(q,db(a(m, D))). By definition of the semantics of a union of conjunctive queries with inequalities, we have that t ans(q,db(a(m, D))) if and only if there exists a query q in Q such that t ans( q,db(a(m, D))). Thus, since q is a conjunctive query in Q, we obtain the claim. Note that the correctness of RewDB is crucial, in that it allows completely forgetting the mappings, by compiling them directly in the queries to be posed over the underlying database. This proves the modularizability of rconsistency and query answering DL-Lite A DIS services. Specifically, we will see in the next section that Algorithm RewDB allows reasoning by exploiting, on one hand, results of reasoning over DL-Lite A KBs, on the other hand, the ability of the underlying database of answering arbitrary complex queries Consistency algorithm Let Π = G, S, M be a DL-Lite A DIS and D a set of data sources conforming to S. In Fig , we present an Algorithm Sat(Π, D) that, thanks to the use of the function RewDB, strongly resembles to the Algorithm Sat(K) presented in the previous chapter (Fig ) to check the satisfiability of a DL-Lite A KB. More precisely, for each functionality assertion and each NI in the NI-Closure of G (denoted as usual cln(g)), Sat(Π, D) uses the functions ViolateFunct and ViolateNI, respectively defined in Fig and Fig , that return a first-order query Q checking whether the minimal model of the virtual ABox generated from the mappings w.r.t. D violates any assertion of the global schema. Then, the algorithm uses the function RewDB(Q) that allows to forget the mapping assertions, by returning the set of resultants S as discussed in the previous section. After having further extracted from S the union of queries Q, Sat(Π, D) evaluates Q over D and returns false, if ans(q, D) returns true. Otherwise, if no functionality nor NI assertion generates a query returning true, then Sat(Π, D) returns true. As expected, we have the following result. Theorem Let Π = G, S, M be a DL-Lite A DIS and D a set of sources conforming to S. Then, Sat(Π, D) terminates. Moreover, Π is consistent w.r.t. D if and only if Sat(Π, D) = true. Proof. The termination of the Algorithm follows from the termination of RewDB. Concerning the soundness and the completeness of the algorithm, by Proposition 5.2.3, we have that Π is consistent w.r.t. D if and only if K = G,db(A(M, D)) is unsatisfiable. Moreover, by Lemma 4.3.7, we have that K = G,db(A(M, D)) is unsatisfiable if and only if Q db(a(m,d)) = true for each Q such that Q = ViolateFunct(X) for some functionality assertion X G, or Q = ViolateNI(X) for some NI assertion X cln(g). Thus, in order to prove the theorem, it suffices to prove that: ( )Q db(a(m,d)) = true if and only if ans(q, D) = true,

92 80 CHAPTER 5. CONSISTENCY AND QUERY ANSWERING OVER ONTOLOGY-BASED DIS Input:DL-Lite A DIS Π = G, S, M, a set of sources D conforming to S Output:true or false (1) for each F = (funct P) G do Q ViolateFunct(F); S RewDB(Q); Q false; for each ansθ q S do Q Q {q }; if (ans(q, D) = true) then return false (2) for each NI = X 1 X 2 cln(g) do Q ViolateNI(NI)); S RewDB(Q); for each ansθ q S do Q Q {q }; if (ans(q, D) = true) then return false (3) return true Figure 5.2: Algorithm Sat (Π, D) for each Q described as above, where Q is such that Q = ansθ q S q and S = RewDB(Q). Clearly, this concludes the proof, since ( ) follows straightforwardly from the correctness of RewDB (cf. Theorem 5.3.9) Query answering algorithm Let Π = G, S, M be a DL-Lite A DIS and D and a set of data sources conforming to S. In Fig , we present an Algorithm Answer(Q,Π, D) that once again, is very similar to the Algorithm Answer(Q, K) presented in the previous chapter (Fig ) to answer queries posed over a DL-Lite A KB. Informally, the algorithm takes as input a DL-Lite A DIS, a set of data sources conforming to S and a union of conjunctive queries Q over Π. Then, it proceeds exactly as in the case of DL-Lite A KBs (note that analogously to the case of KBs, if Π is not consistent w.r.t. D, then ans(q,π, D) is the set of all possible tuples of object terms in Γ(Γ V, Λ) and constants in Γ V, denoted AllTup(Q,Π), whose arity is the one of the query Q). Thus, it first computes the NI- Closure of G, and then it computes the perfect reformulation Q p of Q. At this point, Answer reformulates Q p by calling RewDB(Q p ) to compute the set of resultants S. Then, for each resultant Q in S, it extracts the conjunctive query in its body, evaluates it over D and further processes the answers according to the substitution occurring in the head of Q. We next show the correctness of Algorithm Answer(Q,Π, D).

93 5.3. DL-LITE A DIS CONSISTENCY AND QUERY ANSWERING 81 Input:UCQ Q, DL-Lite A DIS Π = G, S, M, set of data sources D conforming to S, such that Π is consistent w.r.t. D Output: ans(q, D) G cln(g); if Π is consistent w.r.t. D then return AllTup(Q, K) else Q p q i Q PerfectRef(q i, G); S RewDB(Q p ); R s ; for each ansθ q S do R s R s ans(q, D)θ; return R s ; Figure 5.3: Algorithm Answer(Q, Π, D) Theorem Let Π = G, S, M be a DL-Lite A DIS, D a set of sources conforming to S, and Q a union of conjunctive queries over Π. Then, Answer(Q,Π, D) terminates. Moreover, let R s be the set of tuples returned by Answer(Q,Π, D), and let t be a tuple of constants in Γ(Γ V, Λ). Then, t Q(Π,D) iff t R s. Proof. The termination of the algorithm follows from the termination of the Algorithm PerfectRefand teh function RewDB. Concerning the soundness and completeness of the Algorithm Answer, by Proposition 5.2.3, we have that: sem(π, D) = M od(k), where K = G, db(a(m, D)). Moreover, given a union of conjunctive queries Q, by Lemma 4.4.5, we have that ans(q, K) = (Q p ) db(a(m,d)), where (Q p ) = PerfectRef(Q). Then, since by definition, we have that: ans(q, K) = { t t I Q I, I Mod(K)}, and Q(Π,D) = { t t I Q I, I sem(π, D)}, it is easy to see that: Q(Π,D) = (Q p ) db(a(m,d)). On the other hand, by construction, we have that: R s = { t θ t ans((, q), D), ansθ q S }. Then, clearly,from the coorectness of RewDB(cf. Theorem 5.3.9), we obtain the claim.

94 82 CHAPTER 5. CONSISTENCY AND QUERY ANSWERING OVER ONTOLOGY-BASED DIS Computational complexity We first prove termination and complexity of RewDB. Lemma Let Π = G, S, M be a DL-Lite A DIS, D a set of sources conforming to S, and q a conjunctive queries over Π. The function RewDB(Q,Π, D) runs in exponential time w.r.t. the size of Q, and in polynomial time w.r.t. the size of M. Proof. Let Q be a union of conjunctive queries, and let n be the total number of atoms in the body of all q s in Q. Moreover, let m be the number of mappings and let m n be the maximum size of the body of mappings. The proof follows immediately from considering the cost of each of the three steps of RewDB(Q,Π, D): 1. The construction of C(Q) is clearly polynomial in n and m. 2. The computation of SLD-Derive(C(Q)) builds first a tree of depth at most n such that each of its nodes has at most m children, and, second, it processes all the leaves of the tree to obtain the set S of resultants. By construction, this set has size O(m n ). Clearly, the overall computation has complexity O(m n ). 3. Finally, the application of the function unfold Π to each element in S has complexity O(m n m n ). Based on the above property, we are able to establish the complexity of checking the consistency of a DL-Lite A DIS w.r.t. D and the complexity of answering union of conjunctive queries over DL-Lite A DIS w.r.t. D. Theorem Given a DL-Lite A DIS Π and a set of data sources D, Sat(Π, D) is LOGSPACE in the size of the D (data complexity). Moreover, it runs in polynomial time in the size of M, and in polynomial time in the size of G. Proof. The proof of the claim is a consequence of the correctness of the algorithm Sat(Π, D), established in Theorem and the fact: 1. the Algorithm Sat(Π, D) generates a number of queries Q over the minimal model of the virtual ABox that is polynomial in the size of G; 2. each query Q contains 2 atoms and thus, by Lemma , the application of RewDB to each Q is polynomial in the size of the mapping M and constant in the size of the data sources; 3. the evaluation of a union of conjunctive queries over a database can be computed in LOGSPACE with respect to the size of the database (since unions of conjunctive queries are a subclass of first-order logic queries).

95 83 Theorem Given a DL-Lite A DIS Π and a set of data sources D, Answer(Q,Π, D) is LOGSPACE in the size of the D (data complexity). Moreover, it runs in polynomial time in the size of M, in exponential time in the size of Q, and in polynomial time in the size of G. Proof. The proof of the claim is a consequence of the correctness of the Algorithm Answer, established in Theorem , and the following facts: 1. the maximum number of atoms in the body of a conjunctive query generated by the Algorithm PerfectRef is equal to the length of the initial query Q; 2. by Lemma , the algorithm PerfectRef (Q, G) runs in time polynomial in the size of G; 3. by Lemma , the cost of applying RewDB to each conjunctive query in the union generated by PerfectRef has cost exponential in the size of the conjunctive query and polynomial in the size of M; which implies that the query to be evaluated over the data sources can be computed in time exponential in the size of Q, polynomial in the size of M and constant in the size of D (data complexity); 4. the evaluation of a union of conjunctive queries over a database can be computed in LOGSPACE with respect to the size of the database (since unions of conjunctive queries are a subclass of first-order logic queries).

96

97 Chapter 6 Updates of Ontologies at the Instance Level In this chapter, we study the notion of update of an ontology expressed as a DL knowledge base. We recall that DL knowledge bases consist of a TBox used to express the intensional level of the ontology, i.e. general knowledge about concepts and their relationships, and an ABox used to express the extensional level of the ontology, i.e. the state of affairs regarding the instances of concepts. In the first section, we introduce a (restricted) variant of DL-Lite A, called DL-Lite FS, that we use for expressing KBs in this chapter. Then, we provide the general framework for instance level update of DL ontologies, by specifying in particular, the formal semantics for update. Afterwards, we address the issue of update in the context of DL-Lite FS : we show that DL-Lite FS is closed with respect to instance level update, in the sense that the result of an update is always expressible by a new DL-Lite FS ABox. This has to be contrasted with the results in [69], which imply that, if we use more expressive logics, instance level updates are generally not expressible in the logic of the original knowledge base. Finally, we provide an algorithm for computing updates in DL-Lite FS and we discuss its formal and computational properties. 6.1 The DL-Lite FS language In this chapter, we consider a restricted variant of DL-Lite A, called DL-Lite FS, which differs from DL-Lite A as follows: DL-Lite FS does not allow for specifying attributes; consequently, it does not allow specifying value-domains, concept attributes, role attributes, ranges nor domains; it does not allow for specifying inclusions among roles, nor negation of roles, thus only basic roles occur in the KB; and it allows general concepts to occur in membership assertions. The first two differences have been introduced essentially for clarity purposes. Indeed, we strongly conjecture that results of this chapter hold for DL-Lite A as well. 85

98 86 CHAPTER 6. UPDATES OF ONTOLOGIES AT THE INSTANCE LEVEL Concerning the occurrence of general concepts in membership assertions, as we will see, itis due to the need for a more expressive language to reflect an update. However, we point out that in Chapter 3, we showed that given a DL-Lite FRS KB K with general expression in the ABox, it is possible to build a new KB Conv(K), in PTIME in data complexity, that is equivalent to K from the point of view of query answering. Clearly, this holds also for DL-Lite FS since it is clearly a restricted variant of DL-Lite FRS. Next, we specify more precisely the syntax of DL-Lite FS KB. Concepts in DL-Lite FS are defined as follows: B ::= A Q C ::= B B Q ::= P P where, as usual, A denotes an atomic concept, P an atomic role, B a basic concept, Q a basic role, and C a general concept. The universal assertions allowed in the TBox are of the form B 1 B 2 inclusion assertion B 1 B 2 disjointness assertion and of the form (funct Q) functionality assertions Finally, the membership assertions allowed in a DL-Lite FS ABox are of the form: C(a), Q(a, b), C(z) membership assertions 6.2 Instance-level ontology update As already discussed in Section 1.4, several approaches to update have been considered in literature. Here, we essentially follow Winslett s approach [87, 88]. The intuition behind such approach is the following. There is an actual state-of-affairs of the world of which however we have only an incomplete description. Such a description identifies a (typically infinite) set of models, each corresponding to a state-of-affairs that we consider possible. Among them, there is one model corresponding to the actual state-of-affairs, but we don t know which. Now, when we perform an update we are changing the actual state-of-affairs. However, since we don t really know which of our models corresponds to the actual state-of-affairs we apply the change on every possible model, thus getting a new set of models representing the updated situation. Among them, we do have the model corresponding to the updated on the actual stateof-affairs, but again we don t know which. As for how we update each model, only the changes that are absolutely required to accommodate what is explicitly asserted in the update will be performed. Observe that this intuition is essentially the one behind most of the research on reasoning about actions. For example this vision is completely shared by Reiter s variant of Situation Calculus [83]. See in particular [84] where possible worlds are considered explicitly and actions on such worlds correspond to the above description. 1 1 Actually [84] studies also knowledge producing actions (i.e., sensing actions), which are more related with belief revision than update.

99 6.2. INSTANCE-LEVEL ONTOLOGY UPDATE 87 Winslett s approach to update is also completely compatible with the proposal in [69, 17] where updates of a DL ABox without TBox is studied for several expressive DLs. Observe that in those works, since the TBox is not present 2, the intensional level of the ontology is not specified, and updates are only relative to the instance level represented by the ABox. Here, instead, we do consider the intensional level of the ontology, represented by a TBox, although, as we said above, we insist that such level is unchanging. Updates impact only the instance-level of the ontology, according to what specified in the update, in a way consistent with keeping the universal assertions in the intensional level. Before addressing in the next section updates over a DL-Lite FS KB, we next define the general framework for instance-level update of a DL ontology, provide preliminary definitions and specify formally the crucial notion of model update and update. Definition (Containment between interpretations) Let I = ( I, I) and I = ( I, I ) be two interpretations (over the same alphabet). We say that I is contained in I, written I I, iff I, I are such that: if a A I then a A I, for every a, and atomic concept A; if (a, b) R I then (a, b) R I, for every (a, b) and atomic role R. We say that I is properly contained in I, written I I, iff I I but I I. Definition (Difference between interpretation) Let I = ( I, I) and I = ( I, I ) be two interpretations (over the same alphabet). We define the difference between I and I, written I I, as the interpretation ( I, I)I I such that: A I I = A I A I, for every atomic concept A; P I I = P I P I, for every atomic role P ; where S S denotes the symmetric difference between sets S and S, i.e. S S = (S S ) \ (S S ). Definition (Model update) Let T be a TBox in a DL L, I a model of T, and F a finite set of membership assertions expressed in L such that Mod(T ) Mod(F). We define the (result of) the update of I with F, denoted by U T (I, F), as follows: U T (I, F) = {I I Mod(T ) Mod(F) and there exists no I Mod(T ) Mod(F) s.t. I I I I } 2 Or, if present, it is assumed to be acyclic. Acyclic TBoxes cannot always be used to model the intensional level of an ontology, since the abbreviations that they introduce can be eliminated without semantic loss. Naturally, since acyclic TBoxes may provide compact representation of complex concepts, they may have an impact on the computational complexity of reasoning.

100 88 CHAPTER 6. UPDATES OF ONTOLOGIES AT THE INSTANCE LEVEL Observe that U T (I, F) is the set of models of T and F whose difference w.r.t. I is -minimal, and that such a set is non-empty. Definition (Update) Let T be TBox expressed in a DL L, M Mod(T ) a set of models of T and F a finite set of membership assertions expressed in L such that Mod(T ) Mod(F). We define the (result of the) update of M with F, denoted as M T F, as the following set of models: M T F = I M U T (I, F). Let K = T, A be a knowledge base in L and F a finite set of membership assertions expressed in L such that Mod(T ) Mod(F). With a little abuse of notation and terminology we will write K T F to denote Mod(K) T F and talk about the update of K instead of talking about the update of the models Mod(K) of K. A basic question arises from such definitions. Is the result of updating a knowledge base still expressible as a new knowledge base in the same DL? 3 Let us introduce the following definition. Definition (Expressible update) Let K = T, A be knowledge base expressed in a DL L and F a set of membership assertions expressed in L such that Mod(T ) Mod(F). We say that the update of K with F is expressible in L iff there exists an ABox A expressed in L such that K T F = Mod( T, A ) The results in [69] show that, for several quite standard DLs, updates are not expressible in the original language of the knowledge base, even if TBoxes are not considered. Instead, in the case of DL-Lite FS we have the notable property that updates are always expressible in DL-Lite FS itself, as we show in the next section. 6.3 Computing updates in DL-Lite FS ontologies In this section, we address instance-level updates in DL-Lite FS as specified in the previous section. In particular: we show that the result of an update is always expressible within DL-Lite FS : i.e., there always exist a new DL-Lite FS ABox that reflects the changes of the update to the original knowledge base (obviously the TBox remains unchanged as required); we show that the new ABox resulting from an update can be automatically computed; finally, we show that the size of such an ABox is polynomially bounded by the original knowledge base, and moreover that it can be computed in polynomial time. Before starting the technical development illustrate the update on an example to gain some intuition on the problem. 3 Note that this question corresponds to the expressible update problem presented in Chapter 1.4 for DIS.

101 6.3. COMPUTING UPDATES IN DL-LITE FS ONTOLOGIES 89 Example Consider the ontology presented in Example Now suppose that Lenz is not anymore a manager: we update the ontology with the membership assertion manager(lenz). Based on the semantics presented above, the result of the update can be expressed by the following ABox: { (manager), employee (Lenz)}. Note that the new instance level reflects that Lenz is an employee who is not a manager. Interestingly, the fact that Lenz is not a manager implies that he does not manage anything anymore. Nevertheless, he remains an employee, and he still works for the project he used to amnage, and this would not be captured by simply removing the ABox assertions that are inconsistent with the update. In Fig 6.1, we provide an algorithm to perform an update over a DL-Lite FS knowledge base. To simplify the presentation we make use of the following notation. First we denote by Q the inverse of Q, i.e., if Q is an atomic role, then Q is its inverse, while if Q is the inverse of an atomic role, then Q is the atomic role itself. Second, we write C to denote B if C is B, and B if C is B. Also, we use the notation C 1 C 2 to denote either assertions of the form B 1 B 2, B 1 B 2, or B 1 B 2. Finally we denote by cl(t ) the deductive closure of T, that can be defined as the obvious generalization of cln(t ) presented in Section 4.2.3, i.e. cl(t ) is built from both positive and negative inclusions. Clearly, by following the same line of reasoning as for cln(t ), it can be shown that in DL-Lite FS, cl(t ) can be computed in polynomial time w.r.t. T. The algorithm in Fig. 6.1 takes as input a DL-Lite FS satisfiable knowledge base K = T, A, and a finite set of ground (i.e., not involving soft constants) membership assertions F, and returns either ERROR (if T, F is unsatisfiable), or an ABox A (otherwise). Roughly speaking, the algorithm proceeds as follows. After a satisfiability check, it inserts into A all the membership assertions in A and F (lines 3 4), and then uses the Algorithm PerfectRef, presented in Section 4.4, Fig to compute the set F of membership assertions that are logically implied by K and contradict F according to T (lines 5 18) 4. Finally, for each F F, the algorithm deletes F from A, but inserts into A those membership assertions that are logically implied by the membership assertions deleted and do not contradict F (lines 19 32). Lemma Let K = T, A be a satisfiable DL-Lite FS knowledge base, F a finite set of ground DL-Lite FS membership assertions such that Mod(T ) Mod(F), and K the DL-Lite FS knowledge base such that K = T, A, where A = ComputeUpdate(T, A, F). We have that K is always satisfiable. Proof. By construction of the Algorithm, K is obtained from K by: inserting into A : 1. a finite set F of ground membership assertions; these are by hypothesis such that Mod(T ) Mod(F) ; 4 Note that the Algorithm PerfectRef, as introduced in Fig , returns a union of conjunctive queries. Clearly, since here we use it by giving it a ground term as input, then it returns a set of ground atoms, i.e. a set of ground membership assertions

102 90 CHAPTER 6. UPDATES OF ONTOLOGIES AT THE INSTANCE LEVEL INPUT: finite set of ground membership assertions F, satisfiable DL-Lite FS -KB T, A OUTPUT: an ABox A, or ERROR [1] if T, F is not satisfiable then ERROR [2] else for each F F do [3] if F = Q(a, b) then F := F { Q(a), Q (b)} [4] A := A F; F := [5] for each F 1 F do [6] if F 1 = C(a) then [7] for each F PerfectRef( C(a), T ) do [8] if F = C (a) and T, A = C (a) then [9] F := F {C (a)} [10] else if F = Q (a) then [11] F := F {Q (a, b) Q (a, b) A} [12] else if F 1 = Q(a, b) then [13] if (funct Q) in T then [14] for each b b s.t. T, A = Q(a, b ) do [15] F := F {Q(a, b )} [16] if (funct Q ) in T then [17] for each a a s.t. T, A = Q(a, b) do [18] F := F {Q(a, b)} [19] for each F F do [20] if F = C (a) then [21] A := A \ {C (a)} [22] for each C C 1 in cl(t ) do [23] if (C 1 (a) / F ) then A := A {C 1 (a)} [24] if F = Q(a) then [25] for each Q C 2 in cl(t ) do [26] A := A {C 2 (z)}, with z new soft constant in V [27] else if F = Q(a, b) then [28] A := A \ {Q(a, b), Q(a), Q (b)} [29] for each Q C 3 in cl(t ) do [30] if C 3 (a) / F then A := A {C 3 (a)} [31] for each Q C 4 in cl(t ) do [32] if C 4 (b) / F then A := A {C 4 (b)} Figure 6.1: Algorithm ComputeUpdate(T, A, F)

103 6.3. COMPUTING UPDATES IN DL-LITE FS ONTOLOGIES a finite set of membership assertions F that do not contradict F and are logically implied by K (such membership assertions are introduced into A either at line 24, or 28, or 29, or 33, or 34, or 39, or 42); these are therefore such that Mod(T ) Mod(F ) Mod(F) ; deleting from A the maximum finite set of membership assertions F = {F 1,..., F m} that contradict F; these are therefore such that Mod(T ) Mod(F ) Mod(F i ) =, for each i such that F i F and there exists no F A \ F such that Mod(T ) Mod(F ) Mod(F). Therefore we have that A = (A F F ) \ F. Then, since by hypothesis K is satisfiable, i.e. Mod(T ) Mod(A), we have that: Mod(K ) = Mod(T ) Mod(A ) is satisfiable. Next, we deal with termination, soundness and completeness of the algorithm shown in Fig Lemma (Termination) Let K = T, A be a DL-Lite FS knowledge base, F a finite set of ground DL-Lite FS membership assertions. Then the algorithm ComputeUpdate(T, A, F) terminates, returning ERROR if Mod(T ) Mod(F) =, and an ABox A such that T, A is a DL-Lite FS knowledge base, otherwise. Proof. The termination of ComputeUpdate(T, A, F) follows directly from the termination of Algorithm PerfectRef. Next, we prove that the algorithm shown in Fig. 6.1 is sound and complete. Lemma (Soundness) Let K = T, A be a DL-Lite FS knowledge base, F a finite set of ground DL-Lite FS membership assertions such that Mod(T ) Mod(F), and K the DL-Lite FS knowledge base such that K = T, A, where A = ComputeUpdate(T, A, F). Then, for every model I Mod(K ), we have that: I Mod(K) s.t. I U T (I, F). Proof. Let A = ComputeUpdate(T, A, F), and let I be a model of K = T, A. We show how to build an interpretation I that is a model of K. In particular, we start from I and modify it in order to obtain an interpretation I that satisfies K. Then we prove that I U T (I, F), i.e. I is a model of T and F that is at the minimal distance from I. Suppose first that I is a model of K. Then, the theorem trivially holds by taking I = I. Suppose now that I is not a model of K. Since I is by hypothesis a model of T, this means that I does not satisfy a set of membership assertions F = {F i i = 1,, n} A. Then, by construction, F i has been deleted from A, for i = 1,, n. Let us now modify I in order to make it satisfy each F i in F. Starting by considering i = 1, we repeatedly apply the function ModelSat to I i, where I 0 = I, I n = I and I i is the interpretation that is returned by calling ModelSat(I i 1, F i ). Intuitively, ModelSat(I i 1, F i ) modifies I i 1 by changing only the interpretations of constants in Γ that contradict the satisfaction of F i. More precisely, the computation of ModelSat(I i 1, F i ) proceeds as follows. 1. First we set I i = I i 1.

104 92 CHAPTER 6. UPDATES OF ONTOLOGIES AT THE INSTANCE LEVEL 2. Second, we apply the following base rules. (a) If F i = C(a) then we set a I i C I i. (b) If F i has the form F i = Q(a, b), we set (a I i, b I i ) Q I i, a I i Q I i and b I i Q I i. Moreover, if (funct Q) T, then for each (a I i 1, b I i 1 ) Q I i 1 such that b b we set (a I i, b I i ) / Q I i, and if there exists no a a such that (a I i 1, b I i 1 ) Q I i 1 then we set b I i / Q I i. Respectively, if (funct Q ) T, then for each (a I i 1, b I i 1 ) Q I i 1 such that a a, we set (a I i, b I i ) Q I i and if there exists no b b such that (a I i 1, b I i 1 ) Q I i 1 then we set a I i / Q I i. 3. Third, we apply recursively the following rules. (a) If a B I i, B C T and a / C I i 1, then set a C I i. Note that this operation modifies I only if a B I i has been set in a previous step (otherwise, since I i 1 is a model of T, if a B I i 1 then a C I i 1 ). (b) If a Q I i and there exists no individual b I i 1 such that (a, b) Q I i 1, then add (a, b ) Q I i and b Q I i, where b is an element of I i such that if (funct Q) T then there exists no a s.t. (a, b ) Q I i 1. Note that one such b always exists since otherwise, F A is not satisfiable, which both is not possible by hypothesis. (c) If a Q I i, then for each (a, b) Q I i 1 we set (a, b) / Q I i, and if there exists no a a such that (a, b) Q I i 1 then we set b / Q I i. Clearly, by construction, I defined as above is a model of T. Also, I satisfies F which is by hypothesis the set of membership assertions in A that are not in A. Moreover, I still satisfies all other membership assertions in A. In fact, suppose by contradiction that there exists F A that is not satisfied by I. This means that by construction, in order to satisfy F, I needs to be modified so that F is not satisfied anymore. But then, this means that F logically implies F, which is not possible since F and F belong to A and K is by hypothesis satisfiable. Therefore I satisfies all membership assertions in A, which proves that I is a model of K. Now, in order to complete the proof, we need to show that I U T (I, F). By hypothesis, I is a model of T. Moreover, since F A, then I is a model of F. Let us now show that there exists no interpretation I I of T and F such that: I Mod(T F), and I I I I. Suppose by contradiction that such an interpretation I exists. Then one of the following cases occurs: 1. either there exists a such that a A I, a A I and a / A I ; 2. or there exists a such that a / A I, a / A I and a A I ; 3. or there exists (a, b) such that (a, b) Q I, (a, b) Q I and (a, b) / Q I ;

105 6.3. COMPUTING UPDATES IN DL-LITE FS ONTOLOGIES or there exists (a, b) such that (a, b) / Q I, (a, b) / Q I and (a, b) Q I ; where A and Q denote resp. an atomic concept and a role. Let us consider one by one all the above possible cases, starting from the first. Since I has been obtained from I by applying the function ModelSat as specified above, then it means that one of the following cases occurs. Either there exists F F such that F = A(a), where F A; in this case, since F contradicts F, we have that A(a) PerfectRef( C(a)) for some C(a) F. Therefore, a I A I would imply that a I / C I, which would contradict that I is a model of F. Or a B I, because of the application of the function ModelSat. But then, B A T and a / A I. This means that I was previously modified in order to satisfy an assertion in F and it was necessary to have a B I, in order to satisfy T. Therefore, again, we obtain a contradiction since either a / B I and I is not a model of F, or a B I and I is not a model of T. Let us now consider the second case. If a / A I, a / A I and a A I, then we have that a A I, a A I and a / A I. Therefore, we can reduce this case to the previous one, and prove that we would obtain similarly a contradiction. Let us now suppose that I is such that there exists (a, b) such that (a, b) Q I, (a, b) Q I and (a, b) / Q I. Then, since I has been obtained from I by applying the function ModelSat, it means that I has been modified because one of the following cases occurs. Either I does not satisfy an assertion F = Q(a, b) F, where F A. In this case, since F contradicts F, we have that either (i) F contradicts F because of a functionality assertion, or (ii) F comes from the perfect reformulation of C(a) for some C(a) F, which means that Q(a, b) logically implies C(a). Suppose first that F contradicts F because of a functionality assertion, e.g. (funct Q) T for some Q(a, b ) F, b b. Then (a, b) Q I would imply that (a, b ) / Q I which would contradict that I is a model of F. Similarly, we would obtain a contradiction by supposing that (funct Q ) T for some Q(a, b) F, a a. Suppose now that F contradicts F because it logically implies the negation of some assertion in F, e.g. C(a). Then, (a, b) Q I would imply that a / C I, which would contradict that I is a model of F. Or, I is such that a Q I and there exists no b such that (a, b ) Q I. But then, this means that I was previously modified to make I satisfy F and T. In particular, this means that Q(a) is logically implied to an assertion F contradicting F. Therefore, again, if Q(a, b ) Q I, for some b, then we obtain a contradiction since we would have a Q I, which would would imply that I is not a model of F. Let us now consider the latter case, i.e. the case of (a, b) such that (a, b) / Q I, (a, b) / Q I and (a, b) Q I. By inspecting the function ModelSat we easily note that the only cases when the interpretation of (a, b) is modified so that (a, b) Q I and (a, b) / Q I are the following.

106 94 CHAPTER 6. UPDATES OF ONTOLOGIES AT THE INSTANCE LEVEL Either F = Q(a, b ) F contradicts F F for some b b where F = Q(a, b) and either (i) (functq) T. By setting (a, b ) Q I we must consequently set (a, b) / Q I whereas (a, b) Q I. But then, if (a, b) / Q I we obtain a contradiction since I is not a model of F. Or F = Q(a, b) F contradicts F F for some a a because F = Q(a, b) and (functq ) T. This case is analogous to the previous one. Or (a, b) Q I and a Q I. Suppose that a Q I and a Q I. This means that I was previously modified to satisfy F and it was necessary to have a Q I, in order to satisfy T. Therefore, again, we obtain a contradiction since either a Q I and I is not a model of F, or a / Q I and I is not a model of T. Similarly, we would obtain a contradiction by supposing that b Q I and b Q I. Or (a, b) Q I and b Q I. This case is analogous to the previous one. Therefore, assuming that there exists an interpretation I such that I Mod(T F), and I I I I leads to a contradiction, which proves that I U T (I, F). Lemma (Completeness) Let K = T, A be a DL-Lite FS knowledge base, F a finite set of ground DL-Lite FS membership assertions such that Mod(T ) Mod(F), and K the DL-Lite FS knowledge base such that K = T, A, where A = ComputeUpdate(T, A, F). Then, for every model I Mod(K), we have that: U T (I, F) Mod(K ). Proof. To prove the theorem we proceed by assuming by contradiction that there exists an interpretation I U T (I, F) that is not a model of K. Then I does not satisfy at least one membership assertion F in A \ F. We can suppose without loss of generality that there exists only one such assertion. By construction A contains all the assertions of A that do not contradict F and other assertions (introduced into A either at line 24, or 28, or 29, or 33, or 34, or 39, or 42 of the Algorithm ComputeUpdate) that are logically implied in K and do not contradict F. Then, suppose that we modify I in order to make it satisfy F. We must consequently modify I to make it still satisfy T. This can be done by applying to I the function ModelReach(I, I, F ). This function is similar to ModelSat (cf. proof of the Algorithm soundness) in that it basically modifies I by forcing it to satisfy F and T. However, since here we aim at building a model of F that is closer to I than I, ModelReach(I, I, F ) proceeds by performing possible choices as in I. Note that this is always possible since I is a model of F. More precisely, the computation of ModelReach(I, I, F ) returns an interpretation Ī by proceeding as follows. 1. First, we set Ī = I. 2. Second we modify Ī in order to make it satisfy F as follows. (a) If F = C(a), then we set a CĪ.

107 6.3. COMPUTING UPDATES IN DL-LITE FS ONTOLOGIES 95 (b) If F = C(z), then we find one constant b C I and we set b CĪ. Note that such a constant must exist. in fact, F is inserted into A either at line 29 or at line 34. Suppose that it is inserted at line 29. Then, by hypothesis, we have that (i) a Q I, which implies that there exists b s.t. (a, b) Q I and b Q I, and (ii) Q B T, which implies that b B I. Note that the case in which F is inserted at line 34 is analogous. (c) If F = Q(a, b), then we set (a, b) QĪ, a QĪ and b Q Ī. 3. Third, we apply recursively the following rule in order to make Ī satisfy T. (a) If a B Ī, B C and a / C I, then a CĪ. Note that this operation modifies Ī only if a B Ī has been in a previous step (otherwise since I is a model of T, then a C I ). (b) If a Q Ī (resp. a Q Ī) and there exists no individual b Γ such that (a, b) Q I (resp. (b, a) Q I ), then for each (a, b ) Q I (resp. (b, a) Q I ), set (a, b ) Q Ī (resp. (b, a) Q Ī). Note again that this operation modifies Ī only if a Q Ī has been set in a previous step. Moreover, in this case, there always exists at least one b such that (a, b ) Q I (resp. (b, a) Q I ) since I is a model of F and I is modified in order to satisfy F and everything that is logically implied by F. Clearly, by construction, the interpretation Ī obtained as above is a model of T. Moreover, Ī satisfies F which is by hypothesis the only membership assertion of A that is not satisfied by F. Moreover, Ī still satisfies all other membership assertions in A. In fact, suppose by contradiction that there exists F A that is not satisfied by Ī. This means that, by construction, in order to satisfy F, I needs to be modified so that F is not satisfied anymore. But then, this means that F logically implies F, which is not possible since F and F belongs to A and K by Lemma we know that K is satisfiable. Therefore Ī satisfies all membership assertions in A, which proves that Ī is a model of K. Finally, I Ī I I since by construction Ī is obtained by modifying I so that Ī interprets a set of objects as I (whereas I does not), and nothing that is interpreted in I as in I is interpreted differently in Ī. Therefore, by assuming that I U T (I, F) and that I is not a model of K, we obtain that it is possible to build a model Ī that is closer to I than I, which is a contradiction. From the two lemmas above, we get the following theorem, that sanctions the correctness of our algorithm. Theorem Let K = T, A be a DL-Lite FS knowledge base F a finite set of ground DL-Lite FS membership assertions such that Mod(T ) Mod(F), and A = ComputeUpdate(T, A, F). Then K T F Mod( T, A ). Interestingly, if we do not allow for DL-Lite FS membership assertions involving soft constants in the ABox, then we lose expressibility, as shown by the following example.

108 96 CHAPTER 6 Example The TBox { P A 1, A 2 P } and the ABox { P(a)} imply that there exists an object that is both a P -successor of a, and an instance of A 1. Now let us consider the update {A 2 (a)}. As a result of the update, A 2 (a) must be logically implied, hence we must remove P(a) from the ABox, but the fact that there is an instance of A 1 must remain logically implied after the update. It can be easily seen that to express this in the new ABox we must use A 1 (z) where z is a new soft constant. Note that a similar observation holds for membership assertions involving general concepts. Next we turn to the computational complexity of computing the update. By analyzing the algorithm we get: Theorem Let K = T, A be a DL-Lite FS knowledge base, F a finite set of ground DL-Lite FS membership assertions such that Mod(T ) Mod(F), and A = ComputeUpdate(T, A, F). Then: the size of A is polynomially bounded by the size of T A F; computing A can be done in polynomial time in the size of T A F. Proof. The proof of this theorem is an immediate consequence of the following observations: there is one call PerfectRef(A, T ) for each atom A in A; PerfectRef(q, T ) runs in polynomial time in the size of T, and in exponential time in the size of q; thus, in this case, the call PerfectRef(A, T ) has cost polynomial in T ; moreover, it produces a set of facts whose size is polynomial in the size of T ; for each F PerfectRef(A, T ), the check K = F is LOGSPACE in A; for each F F the cost of eliminating F from A is clearly polynomial in the size of A.

109 Part III XML-based DIS 97

110

111 As we already discussed in Chapter 2, several data integration systems and theoretical works have been proposed for relational data, whereas not much investigation has focused yet on XML-based data integration, besides few exceptions (cf. Chapter 2, Fig. 2.1). Our goal in this part of the thesis is to address some of its issues. In particular, we highlight two major issues that emerge in the XML context: (i) the global schema may be characterized by a set of constraints, expressed by means of a DTD and XML integrity constraints, (ii) the concept of node identity requires to introduce semantic criteria to identify nodes coming from different sources. The latter is similar to the problem of identifying objects in mediators systems [78]. Given the importance of this issue for information integration, much work has recently been focusing in identifying records representing the same real-world entity and reconciling them to obtain one record per entity (the so-called Entity Resolution [19], or Reference Reconciliation [38] problems). As we shall see, this problem requires some particular solution in the context of XML data integration. Let us first illustrate by an example XML-based data integration issues. Suppose that a hospital offers access to information about patients and their treatments. Information is stored in XML documents managed in different services of the hospital. However, because of privacy and security reasons, each user sees only parts of the data depending on her access rights. For instance, statisticians have access to the global schema S G having the form of the following DTD: S G : <!ELEMENT hospital (patient+, treatment+)> <!ELEMENT patient (SSN, name, cure*, bill*)> <!ELEMENT treatment (trid, procedure?)> <!ELEMENT procedure (treatment+)> To simplify, and following a common approach for XML data, we consider XML documents as unordered trees, with nodes labeled with elements names. The above DTD says that the document contains data about patients and hospital treatments, where a cure is nothing but a treatment id. Moreover, a set of keys and foreign key constraints are specified over the global schema. In particular, we know that two patients cannot have the same social security number SSN, that two treatments cannot have the same number trid and that all the prescribed cures have to appear among the treatments of the hospital. Such constraints correspond respectively to two key constraints and one foreign key constraint. Finally, assume that the sources consist of the following two documents, D 1 and D 2, with the following DTDs. 99

112 100 D 1 : D 2 : <hospital> <patient> <name>parker</name> <SSN>55577</SSN> </patient> S <patient> 1 : <name>rossi</name> <SSN>20903</SSN> </patient> </hospital> <hospital> <patient> <SSN>55577</SSN> </patient> </hospital> S 2 : <!ELEMENT hospital (patient*)> <!ELEMENT patient (name, SSN)> <!ELEMENT hospital (patient*)> <!ELEMENT patient (SSN)> By means of mappings, we specify that D 1 contains patients with a name and a social security number lower than , and D 2 contains patients that paid a bill and were prescribed at least one dangerous cure (we assume that these have numbers smaller than 35). Moreover, we specify that these mappings are sound, which means that D 1 and D 2 contain resp. a subset of all patients having a name and a social security number lower than , and a subset of all patients having paid a bill and been prescribed a dangerous cure. Note that if we would have known that the sources contained exactly all specified patients, then the mappings would have been exact, instead of sound. Suppose now that a user asks for the following queries: 1. Find the name and the SSN for all patients having a name and a SSN, that paid a bill and that were prescribed at least one cure. 2. Does the hospital offer dangerous treatments? As usual in DIS, our goal is to find the certain answers, e.g. the answers that are returned for all data trees that satisfy the global schema and conform to the data at the sources. By adapting data integration terminology introduced in Chapter 1 we call them legal data trees. A crucial point here is that knowledge about legal data trees may be obtained by merging the source trees. An important issue is thus to identify nodes from different sources that correspond to the same entity of the real world, a process sometimes called entity resolution [19], or reference reconciliation [38]. In practice, entity resolution is typically based on machine learning. We abstract this part of the problem here by assuming that the identification of nodes from different sources, so the merging of the source trees, is based on constraints, and more precisely key constraints. One can think of these keys as being added by a separate entity resolution module. Note, however, that data retrieved may not satisfy these constraints. In particular, there are two kinds of constraints violation. Data may be incomplete, e.g. it may violate constraints by not providing all data required according to the schema. Or, data retrieved may be inconsistent, i.e. it may violate constraints by providing two elements that are semantically the same but cannot be merged without violating key constraints. In this paper, we address the problem

113 101 of answering queries in the presence of incomplete data, while we will assume that data does not violate the constraints. Coming back to the example, one can verify that the sources are consistent. The global schema constraints specification allows to answer Query 1 by returning the patient with name Parker and social security number 55577, since thanks to the key constraint we know that there cannot be two patients with the same SSN. Note that Query 2 can also be answered with certainty. Mappings let us actually infer that the patient named Parker was prescribed a dangerous cure. In addition, thanks to the foreign key constraint, we know that every cure that is prescribed to some patient is provided by the hospital. We conclude the example by highlighting the impact of the assumption of having sound/exact mappings. Suppose that no constraints were expressed over the global schema. Under the exact mapping assumption, by inspecting the data sources, it is possible to conclude that there is only one way to merge data sources and satisfy the schema constraints. Indeed, since every patient has a name and a SSN number, we can deduce that all patients in D 2 with a SSN lower than belong also to D 1. Therefore the answer to Query 1 would be the same as in the presence of constraints, whereas no answer would be returned to Query 2, since no information is given on that portion of the global schema. On the other hand, under the assumption of sound mappings, since in the absence of constraints there could be two patients with the same SSN, both queries would return empty answers. The main contributions of this part of the thesis are as follows. First, following the logical approach presented in Section 1.1, we propose a formal framework for XML data integration systems based on (i) a global schema specified by means of a set of (simplified) DTD and a set of XML integrity constraints as defined in [42], (ii) a source schema specified by means of DTDs, and (iii) a set of LAV mappings specified by means of prefix-selection-query language that is inspired from the query language defined in [6]. Second, we define the notion of identification function, and provides one such function that aims at globally identifying nodes coming from different sources. As already mentioned, the need for the introduction of identification is motivated by the concept of node identity. Third, we study XML DIS consistency decidability and study its complexity under different assumption for the mappings. Finally, we address the query answering problem in the XML data integration setting. In particular, given the strong connection with query answering with incomplete information, we propose an approach that is reminiscent of such a context. We provide two polynomial algorithms to answer queries under different assumptions, and study the complexity of general XML DIS query answering. This part of the thesis comes from an expansion and an updated version of a DBPL conference paper [80]. It is organized as follows. Below, we start by discussing related work.in Chapter 7, we introduce the setting. In particular, we present the data model, the schema language and the query language used in this part. Then,

114 102 the logical framework for XML data integration is introduced in Chapter 8, where we also define the notion of identification function, and provide one particular such function. Finally, in Chapter 9, we investigate query answering, study its complexity and propose different algorithms to answer queries under the assumption of sound, exact and mixed mappings.

115 Chapter 7 The setting In this chapter, we introduce preliminary definitions and propositions that we use all along this chapter. In particular, we start by presenting the data model for XML documents, and some properties of the model. Then, we define types for data, corresponding to simplified DTDs. We also introduce XML constraints that, together with types, form the schema language. Finally, we present the query language, that is an extension of the one introduced in [6]. 7.1 Data model In this work, XML documents are represented as labeled unordered trees, called data trees, formally defined as follows. Definition Let N be a set of node identifiers, Σ a finite set of element names (labels), and Γ = Γ { } a domain for the data values, where the symbol is a special data value that represents the empty value. A (data) tree T over Σ and Γ is a triple T = t, λ, ν, where: t is a finite rooted tree (possibly empty) with nodes from N ; λ, called the labeling function, associates a label in Σ to each node in t; and ν, the data mapping, assigns a value in Γ to each node in t. The number of nodes in a data tree T is denoted T, whereas the depth of t is denoted d(t). We call datanodes those nodes n of t such that ν(n). Example In Fig. 7.1, we show three different data trees containing information about wards and patients admitted in an hospital. Note that only data values different from are represented and are circled. Therefore, datanodes can be easily distinguished. We next introduce the notions of subsumption and equivalence. Intuitively, a data tree is subsumed by another tree, if all the information it contains may also be found in the other tree. And, two data trees are equivalent if they hold the same information content (up to replication). Indeed, two equivalent trees will be indistinguishable with the positive query language that we will consider. 103

116 104 CHAPTER 7. THE SETTING hospital hospital ward Geriatric ward Psychiatric ward ward Geriatric ward Psychiatric admitted admitted admitted admitted admitted admitted adminid adminid adminid adminid adminid adminid (a) Data tree T 1 ward Geriatric hospital ward Geriatric (b) Data tree T 2 ward Psychiatric admitted admitted admitted admitted adminid adminid adminid adminid (c) Data tree T 3 Figure 7.1: Data Model Homomorphism, subsumption, equivalence We next define two notions that are crucial for this study. Definition Let T = t, λ, ν and T = t, λ, ν be two data trees and h a function from the nodes of t to the nodes of t. We say that h is a homomorphism from t to t if and only if h is a total function from the nodes of t to (a subset of) the nodes of t such that, for each n, n : if n is the root of t, then h(n) is the root of t ; and, if n is a child of n, then h(n) is a child of h(n ) in t ; we therefore say that h preserves the parent-child relationship; λ (h(n)) = λ(n); we say that h preserves the labeling; either ν(n) = or ν(h(n)) = ν(n); we say that h preserves data. Definition Let T = t, λ, ν and T = t, λ, ν be two data trees. We say that T is subsumed by T, written T T, if and only if there exists a homomorphism from t to t. Moreover, we say that T is equivalent to T, written T T, if and only if T T and T T. Note that, according to the above definition, the empty tree, i.e. the tree that does not contain any node, denoted T, is subsumed by all data trees.

117 7.1. DATA MODEL 105 Example Let T 1 = t 1, λ 1, ν 1, T 2 = t 2, λ 2, ν 2 and T 3 = t 3, λ 3, ν 3 be the data trees shown in Fig. 7.1(a), 7.1(b) and 7.1(c). It is easy to see that T 2 and T 3 are both subsumed by T 1. Moreover, T 1 is not subsumed by the T 2, since there exists no homomorphism from t 1 to t 2. Finally, T 1 is subsumed by T 3, which means that T 1 and T 3 are equivalent. The following lemma provides an immediate algorithm for checking subsumption: Lemma For each data trees T = t, λ, ν, T = t, λ, ν, T T if and only if either T is empty (1), or they are such that: They have roots r, r, respectively; λ(r) = λ (r ); and ν(r) = or ν(r) = ν(r ) (2); Each subtree of r is subsumed by some subtree of r (3). Proof. : Suppose first that T T. If T is the empty tree then T is the empty tree, by definition of homomorphism. So, in this case, (1) holds. Suppose now that T is not empty. Clearly, the definition of homomorphism also implies (2). Now, by considering h on the subtrees of the root, one can easily prove (3). : Let us now prove by induction on the depth k of T that: (*) For each T, T, if (1-2-3) hold for T, T and d(t) k, then there exists a homomorphism from T to T, i.e., T T. The basis of the induction is obvious by (1). Now suppose that (*) holds for some k and let T, T satisfying (1-2-3) with T = k + 1. Let T 1,..., T n be the distinct subtrees of the root of T. For each i, T i is subsumed by some subtree of the root of T. By induction hypothesis, there exists a homomorphism h i from T i to that subtree. Let h be the function that transforms the root of T to that of T and that coincides with h i on each T i. One can verify that h is a homomorphism from T to T. By induction, this shows that (*) holds for each k. We now study the complexity of subsumption. Proposition Let T and T be two data trees. One can check whether T T, in time O( T T ). Proof. (sketch) Let c be such that we know that for each T of depth less or equal to k, and for each T, one can check T T in time c T T. (Ignore to simplify data trees of empty size.) Let T be a tree of depth k + 1. The main issue is the cost of comparing the subtrees T 1,..., T k of the root of T to the subtrees T 1 root of T. By induction, comparing T the cost of comparing the subtrees is: i to T j can be performed in c T i Σ i [1..k] Σ j [1..l] (c T i T j ) c Σ i [1..k] ( T i Σ j [1..l]( T j )) c Σ i [1..k] ( T i T ) c T T.,..., T l of the. Then T j

118 106 CHAPTER 7. THE SETTING This concludes the proof. As a direct consequence of the previous proposition and the definition of equivalence of two data trees, we have the following: Corollary Let T and T be two data trees. One can check whether T T, in time O( T T ). To conclude with subsumbtion and equivalence, we observe the following properties: Proposition (i) Subsumption is transitive; (ii) Equivalence is reflexive, symmetric and transitive, i.e., is an equivalence relation. Proof. (Subsumption) Let T, T and T be such that T T and T T. Then, by definition, there exists homomorphism h 12 from T to T and h 23 from T to T. Let h 13 be the function over the nodes of T defined by: for each node n, h 13 (n) = h 23 (h 12 (n)). It is easy to verify that h 13 preserves the root, the labelling and the data. Therefore, that h 13 is a homomorphism, so T T. (Equivalence) Reflexivity and symmetry are by definition. Transitivity comes from the transitivity of subsumption. Tree prefixes, Minimality Consider two equivalent data trees. Clearly, it may be the case that one of them contains a lot of replication whereas the other does not. One would prefer in practice, to use a minimal data tree. The lack of redundancy is captured by the following two definitions. Definition A data tree T = t, λ, ν is a prefix of T = t, λ, ν if and only if: t is such that: the root r of t is the root of t; every subtree of t rooted at a child of r is a prefix of a subtree of t rooted at a child of r; λ and ν are resp. the restrictions of λ and ν to the nodes of t. Clearly, we have the following lemma. Lemma For each T, and each prefix T of T, we have that: T T. Definition Let T be a data tree. We say that T is minimal if there is no prefix of T, other than T itself, that is equivalent to T. Example Let us consider again the data trees T 1 = t 1, λ 1, ν 1 and T 3 = t 3, λ 3, ν 3. One can see that T 3 is not minimal whereas, T 1 is.

119 7.1. DATA MODEL 107 Let T = t, λ, ν be a data tree. We will use the algorithm Minimal(T) that takes as input T and returns some tree by proceeding as follows: 1. minimize the subtrees of the root; 2. select randomly one subtree of the root that is subsumed by another one and remove it, until there is no subsumed subtree. We next see that this algorithm constructs a minimal subtree that is equivalent to T in quadratic time: Proposition Given a tree T, one can construct the data tree Minimal(T) that is equivalent to T and minimal, in PTIME with respect to the size of T. Proof. (sketch) By construction and by Lemma 7.1.6, Minimal(T) is equivalent to T. Suppose that it is not minimal. Then for some node n in the tree, some subtree would be redundant, a contradiction with the construction. For the complexity, the proof is by induction on the number of nodes in the tree. Suppose that for some c, the complexity of minimalizing a data tree is c T 2 for some constant c, for all trees of size less than k. Consider a tree of size n. We have to minimize its subtrees T 1,..., T k, which costs: Σ j [1..k] c T j 2 c T 2 Note that we also have to test equivalence but that is polynomial by Corollary We also have: Proposition For each equivalence class of data trees, there exists a minimal element that is unique up to isomorphism (i.e., up to renaming node ids). Proof. The existence of a minimal tree follows from Proposition For uniqueness, suppose there are two such minimal trees T, T. Since T T, there exists homomorphisms h from T to T and h from T to T. First suppose that h (T ) = T. Then T, T are isomorphic. Now suppose that h (T ) T (strict subset). Then h (h(t)) h (T ) T. Then one subtree of T is redundant, a contradiction with the minimality of T. Thus, h(t ) T is not possible. Hence, T, T are isomorphic. Based on the previous results, we assume without loss of generality that all the trees we consider from now on are minimal unless explicitly said. Intersection To conclude the presentation of the data model, we consider a last notion, namely intersection. Definition Let T and T be two data trees. The intersection of T and T, denoted T T, is the largest subtree that is smaller than both, i.e., it is a tree T such that: (i) T T, T T, and (ii) for each T, if T T and T T, then T T.

120 108 CHAPTER 7. THE SETTING We will see that for each pair of data trees, their intersection always exists and is unique up to equivalence. Example Let us consider the data trees T 1 and T 2, resp. in Fig. 7.2(a) and 7.2(b). They contain data about patients and treatments of an hospital. In Fig. 7.2(c) we show hospital patient patient treatment treatment SSN name cure cure Parker SSN name Rossi trid 32 trid 11 (a) Data tree T 1 hospital hospital patient patient treatment treatment patient patient treatment SSN name curebill Parker SSN name Rossi trid 12 trid 13 SSN name cure Parker SSN name Rossi trid (b) Data tree T 2 (c) T 3 = T 1 T 2 Figure 7.2: Data Model the intersection T 1 T 2. One can verify that there exists no tree that is subsumed by both T 1 and T 2 and is not subsumed by T 1 T 2. Let T = t, λ, ν and T = t, λ, ν be data trees with resp. roots r, r. We next show that their intersection is constructed by the recursive function Intersection(T, T ) as follows: If λ (r ) λ (r ), then T = T. Otherwise, T = t, λ, ν, where: the root of t is a new node that inherits the labels of the two roots; moreover, if both roots are datanodes having the same value, the root of t inherits their value, otherwise, the value ; the subtrees of the root of T is the set of trees: {Intersection(T s, T s ) T s a subtree of the root of T T s a subtree of the root of T } Note that the function above does not return a minimal tree. However, it is immediate to build from the returned data tree the minimal tree that is equivalent to it, by simply applying the Algorithm Minimal(T ) defined previously. We have the following result:

121 7.1. DATA MODEL 109 Proposition Given two data trees T and T, Intersection(T, T ) is an intersection of T,T, and can be computed in quadratic time. Proof. To show that T = Intersection(T, T ) is an intersection of T and T, we have to prove that T satisfies the two properties (i) and (ii), of the definition of intersection of T and T. By construction, T clearly satisfies (i). Let us now consider (ii). Let T = t, λ, ν be such that T T and T T. We show that T T. Since T T and T T, there exist two homomorphisms h 1, h 2 from T to T, T respectively. Let h be the function from t to t recursively defined as follows: h( r) = r where r is the root of t and r is the root of t. Note that h preserves the parent-child relationship for r, since r and r are both roots. Moreover, since T T and T T, then λ( r) = λ (r ) = λ (r ), where r, r are resp. the roots of t, t. But then, from the construction of T, we have that λ( r) = λ (r ), which means that h preserves the label of r. Similarly, if ν( r) then we must have ν( r) = ν (r ) = ν (r ), and then ν( r) = ν (r ). On the contrary, if at least one among r, r is not a datanode, then ν( r) =. Therefore, h preserves the data mapping of r. for every child n of r, let t be the subtree of t rooted at n. Since T T, then h ( t ) t s, where t s is a subtree of T rooted at a child n of r. Similarly, since T T, then h 2 ( t ) t s, where t s is a subtree of T rooted at a child n of r. Then, Intersection(T, T ) T, where T s, T s are the restriction of T, T to the nodes of t s, t s respectively. We can therefore define h( t ) = t s, where t s is the subtree of r such that t s = Intersection(T s, T s ), which proves that h preserves parent-child relationship. From the previous construction, it is clear that h is a homomorphism from t to t. And therefore, we have that T T. In order to prove that Intersection(T, T ) runs in time O(N N ), where N, N are resp. the number of nodes of T, T, again, we would proceed by induction. We omit the details, since the proof is very similar to the one of the complexity of checking subsumption (cf. Proof of Proposition 7.1.7). Proposition Given two data trees, their intersection always exists and is unique up to tree equivalence. Proof. The existence of an intersection of two data trees follows directly from the previous proposition. To show uniqueness, let T1, T 2 be two intersections of T, T. By definition of intersection for T1, T 1 T and T1 T. By definition of intersection for T2, T 1 T2. By symmetry, T 2 T1, so T 1 and T 2 are equivalent.

122 110 CHAPTER 7. THE SETTING + patient SSN name hospital + treatment? * * cure bill trid procedure Figure 7.3: Example of a tree type 7.2 Tree Type Let Σ be an alphabet. A tree type over Σ is a simplified version of DTDs that can be represented as a triple Σ, r, µ, where Σ is a set of labels, r Σ is a special label denoting the root, and µ associates to each label a Σ a multiplicity atoms µ(a) representing the type of a, i.e. the set of labels allowed for children of nodes labeled a, together with some multiplicity constraints. More precisely, µ(a) is an expression a ω aω k k, where a i are distinct labels in Σ, and ω i {, +,?, 1}, for i = 1,...k. We say that a data tree T over Σ satisfies a tree type S = Σ, r, µ over Σ G, noted T = S, if and only if: (i) the root of T has label r, and (ii) for every node n of T such that λ(n) = a, if µ(a) = a ω aω k k, then all the children of n have labels in {a 1..a k }, and the number of children labeled a i is restricted as follows 1 : if ω i = 1, then exactly one child of n is labeled with a i ; if ω i =?, then at most one child of n is labeled with a i ; if ω i = +, then at least one child of n is labeled with a i ; if ω i =, then no restrictions are imposed on the children of n labeled with a i. Given a tree type, we call collection of elements a i a label a such that there is an occurrence of either a i or a+ i in µ(a), for some a i Σ. Moreover a i is called member of the collection a. Example Consider the DTD S G from Section III. S G corresponds to the tree type Σ, r, µ such that r =hospital and µ can be specified as follows: µ(hospital) = patient + treatment + µ(patient) = SSID name cure bill µ(treatment) = trid procedure? In Fig. 7.3 we show a graphical representation of S G. Note that patient and treatment are both members of the collection hospital. 1 One could also consider allowing a fixed number of children labeled a i. To simplify this will be ignored here.

123 7.3. CONSTRAINTS AND SCHEMA LANGUAGE Constraints and schema language We next recall and adapt to our setting the definition of XML constraints from [42, 22], and introduce our schema language. Let τ be a tree type over an alphabet Σ. Unary keys(uk) are assertions of the form: a.k a, where a Σ and k 1 µ(a). We thus say that k is a key for a. The semantics of keys is the following. Given a tree T satisfying S, T = a.k a if and only if: each node labeled a has a single child labeled k and this node is a datanode; for two distinct nodes labeled a, their respective children labeled k have distinct data values. Example Consider the tree type S G in Fig In order to constrain every data tree satisfying S G to be such that there does not exist any two distinct nodes labeled patients having the same SSN, we use the following UK: patient.ssn patient Note that the above UK is satisfied by the data tree in Fig. 7.2(a), whereas it is not satisfied by the data tree in Fig. 7.2(b). Foreign keys are assertions of the form: a.h b.k, where k is a key for b, a Σ and h ω µ(a) for some ω {1,?, +, }. The semantics of foreign keys is the following. Let T be a tree satisfying S. Then, T = a.h b.k if and only if for every datanode m labeled h that is a child of a node n labeled a there exists a node n labeled b having a single child m labeled k with the same data value of m. A FK a.h b.k may be seen as introducing in nodes labeled a some reference to some nodes labeled b. Now, by definition, nodes labeled b may occur anywhere in the document. Also, even if it is possible to design documents in that manner, it seems very natural to group all b s in a single place of the documents (as often done in practice). This motivates the following definition of uniquely localizable foreign key. The general case of arbitrary foreign keys is more complicated and left for future research. For a tree type S, we call uniquely localizable foreign keys (ULFK for short), a foreign key a.h b.k such that there exists a unique path r, l 1,.., l s, b verifying: (i) for each document of tree type S and for each node b in this document, the path constituted by the labels from the root to b is r, l 1,.., l s, b, and (ii) no l i on this path is a member of a collection, for i {1,.., s}. It is easy to see that as a consequence, in each document satisfying S, the elements labeled b are the children of a unique node.

124 112 CHAPTER 7. THE SETTING + patient hospital + treatment * ward SSN name * * cure bill? trid procedure * admitted admid Figure 7.4: Another example of tree type Example Consider the tree type shown in Fig. 7.4, and the following foreign keys: patient.cure treatment.trid, patient.ssn admitted.admid, where trid, and admid are respective keys for elements labeled treatment and admitted. The first assertion specifies the constraint of Section III, i.e. it specifies that whenever a cure has been prescribed to a patient, its identifier must appear among the identifiers of the treatments offered by the hospital. The second assertion specifies that whenever a patient SSN appears among the patients of the hospital, then it was admitted in some hospital ward. Note that the first foreign key is a ULFK, whereas the second is not, since ward is a member of the collection hospital, i.e. ward µ(hospital), and ward is on the label path from the root to the elements labeled admitted, referenced by the foreign key. Finally, let S G be a tree type, Φ K a set of keys and Φ FK a set of foreign keys. A schema is a triple G = S G, Φ K, Φ FK. Moreover, we say that a tree T strongly satisfies the schema G if and only if T = S G, T = Φ K and T = Φ FK. On the other hand, we say that T weakly satisfies the schema G = S G, Φ K, Φ FK, written T = w G, if and only if there exists T such that T T and T satisfies G. Intuitively this means that T may be incomplete wrt to G but not inconsistent. Clearly, if T satisfies G, then T weakly satisfies G, whereas the reverse does not hold. From now on, when we talk about satisfaction, we mean strong satisfaction, unless differently specified. Example Let us come back to the example illustrated in Section III. The hospital data that we want to represent is such that it satisfies a schema G = S G, Φ K, Φ FK, where S G is the tree type in Fig. 7.3, and Φ K and Φ FK are resp. the set of key constraints, and foreign key constraints that follow: Φ K : {patient.ssn patient; treatment.trid treatment} Φ FK : {patient.cure treatment.trid} Clearly, the tree T 1 in Fig. 7.2(a) satisfies the schema G = S G, Φ K, Φ FK, whereas the tree T 2 in Fig. 7.2(b) does not since (i) the first key constraint is violated (the data tree contains two patients with the same SSN), and (ii) the foreign key is violated

125 7.4. PREFIX QUERIES 113 (the cure with id 25 does not appear among the treatment ids). Finally, the tree T 3 of Fig. 7.2(c) is an example of tree that weakly satisfies G but does not satisfy G, since the node patient corresponding to the patient named Rossi has not any child datanode SSN. Indeed, T 3 is subsumed by T 1, which satisfies G. 7.4 Prefix Queries We introduce now the prefix query language that we use all along this work. This is an extension of prefix-selection queries presented in [6]. Intuitively, prefix queries (shortly referred as p-queries) browse the input tree starting from the root and going down to a certain depth, by traversing nodes with specified labels and data values satisfying specified conditions. Whereas boolean p-queries check for the existence of a certain tree pattern in T, general p-queries return a tree that is equivalent to a prefix projection of the nodes selected by the query. We are now able to formally define p-queries. A p-query q over an alphabet Σ is a quadruple t q, λ q, cond q, ret q where: t q is a rooted tree; λ q associates to each node a label in Σ, where sibling nodes have distinct labels; cond q is a total function that associates to a node of t q a boolean formula, called condition, having either the form, which evaluates to true for all possible values in Γ, or the form p 0 b 0 p 1 b 1...p m 1 b m 1 p m, where p i are predicates and b j are boolean operators such that (i) p i can be applied to values in Γ; for instance, if Γ = Q, p i can have the form op v, where op {=,,,, <, >} and v Q; and (ii) p i return false when applied to ; ret q (for returned by q) is a total function that assigns to each node n q in t q a boolean value such that: (i) ret q (n q ) = true, if n q is the root of t q, and (ii) if ret q (n q ) = false then ret q (n q) = false, for every children n q of n q. By analogy with data trees, we denote as d(q) the depth of t q. Let q = t q, λ q, cond q, ret q be a p-query. If there is at least one node n q t q such that ret q (n q ) = false and the parent p q of n q is such that ret q (p q ) = true, then we say that q contains an existential subtree pattern rooted at n q. Moreover, we say that q is a boolean p-query if ret q (n q ) = true, only for the root of t q. We next formalize the notion of answer to a p-query using the auxiliary concepts of valuation. Given a p-query q = t q, λ q, cond q, ret q and a data tree T = t, λ, ν, a valuation γ from q to T is a total function from the nodes of t q to T, preserving the parent-child relationship, the labeling, and such that for each n q t q, ν(γ(n q )) satisfies cond q (n q ). Observe that γ(q) is a prefix of t. Let us call image of q posed over T, denoted I(q, T) the tree t i, λ i, ν i such that: t i is consists of all the nodes of T that are in γ(q) for some valuation from q to T ; λ i and ν i are resp. the restrictions of λ and ν to the nodes in t i.

126 114 CHAPTER 7 Similarly, we call answer the tree q(t) = t A, λ A, ν A such that: for each n t A, there exists a valuation γ such that γ(n 0 ) = n for some n 0 t q such that ret q (n 0 ) = true; λ p and ν p are resp. the restrictions of λ and ν to t p. Clearly, by construction, Image(q, T) and q(t) are both prefixes of T, and q(t) is a prefix of Image(q, T). Intuitively, Image(q, T) represents the prefix of T whose nodes are selected by q, whereas P(T) is the prefix of I(T) whose nodes are returned by q. Thus, by construction and by Lemma we have the following. Lemma Given a p-query q and a data tree T over Σ, q(t) is unique (up to tree equivalence). Moreover, q(t) Image(q, T) T. Observe the following. Let q be a boolean p-query. Then either there exists no valuation from q to T and therefore Image(q, T) = T, or Image(q, T) = t r, λ r, ν r, where t r is a tree containing only the root r having the same label and data value as the root of T. Suppose that the first case occurs. Then, q(t) = T and the answer to q over T is T. This means that T does not satisfy q. Suppose now that Image(q, T) = t r, λ r, ν r. Then, q(t) = t r, λ r, ν r, which means that T satisfies q. The tree containing only the root is therefore equivalent to true, whereas the empty tree T is equivalent to false. Note that this is in the same spirit of the relational model where a boolean query returns the emptyset ( ) when the query is evaluated to false, and it returns the set containing the empty tuple ({()}) when the query is evaluated to true. Example In Fig. 7.5 we show several p-queries. We graphically represent an existential subtree pattern in a query by underlying the label of its root. Moreover, only conditions different from are represented. In particular, Fig. 7.5(a) shows a boolean query asking whether there are patients that were admitted in the ward Geriatric. Posed over the data tree in Fig. 7.1(a), this query returns true. Consider now the queries in Fig. 7.5(b) and 7.5(d). They select respectively (i) the name and the SSN of patients having a SSN smaller than , (ii) the SSN of patients that were prescribed at least one dangerous cure (i.e. a cure with id lower than 35), together with the bills they paid. The answers to these last two queries, when they are posed over the tree of Fig. 7.2(a), are given resp. in Fig. 7.5(c) and 7.5(e). Clearly, by the definition of p-queries, we have the following: Proposition p-queries are monotone, i.e. for every two data trees T, T, if T T then q(t ) q(t ).

127 hospital ward = Geriatric admitted adminid SSN < hospital patient name SSN patient name Parker hospital SSN patient name Rossi (a) Boolean p-query (b) P-query q 1 hospital hospital (c) Answer to q 1 patient patient SSN cure 35 bill SSN bill 8000 (d) P-query q 2 (e) Answer to q 2 Figure 7.5: Querying a data tree

128

129 Chapter 8 XML-based DIS In this chapter, we first define an XML DIS according to the logical framework presented in Section 1.1. Then, we introduce the notion of Identification function, and provide one such function that we will use in Chapter 9. Finally, we conclude the section by study XML DIS consistency problem. 8.1 XML DIS logical framework An XML DIS Π can be characterized by the triple G, S, M, where: The XML global schema G = S G, Φ K, Φ FK is expressed in terms of a non recursive tree type S G = Σ G, r G, µ G, a set Φ K of key constraints and a set Φ FK of uniquely localizable foreign keys. We assume that at most one key constraint is expressed for each element (e.g. Φ K are primary keys [42]); S is a set of source schemas S = {S 1, S 2,..., S m }, where S i is a tree type, for every i in {1,..., m} 1. M is the set of LAV mappings between S and G, one for each data source S i in S. Each mapping is an expression of the form: M i = (S i, q i, as i ), for i = 1,..., m, where as i {sound, exact} and q i is a p-query. Given a set of data sources D = {D 1,..., D m } conforming to S = {S 1,..., S m } (i.e. D i = S i, i = 1,..., m), the semantics of a data integration system consists of all the legal data trees that conform to the schema G and satisfy the mappings M w.r.t. D. More precisely, we have the following: sem(i, D) = {T T = S G, T = Φ K, T = Φ FK, i = 1,...m, D i q i (T) if as i = sound D i q i (T) if as i = exact} where M i = (S i, q i, as i ) 1 Note that dealing with such kind of sources is not restrictive since we can assume that suitable wrappers are available that present the sources in such formats. 117

130 118 CHAPTER 8. XML-BASED DIS hospital patient patient patient SSN name Parker cure bill SSN name Rossi SSN name Dong Figure 8.1: Legal data tree for Π (Example 8.1.1) Example Consider the following data integration system I = G, S, M. The global schema G = S G, Φ K, Φ FK is the one of the Example 7.3.3, whereas the source schema is S = {S 1, S 2 }, where S 1, S 2 correspond to the DTDs of Section III. Finally, the set of mappings is the following: M = {M 1, M 2 }, where M 1 = (S 1, q 1, sound), M 2 = (S 2, q 2, exact), and q 1, q 2 are resp. those of Fig. 7.5(b) and Fig. 7.5(d). Given a source D 1 conforming to S 1, the first mapping tells that D 1 contains some of the patients of the hospital having a social security number lower than On the other hand, given a source D 2 conforming to S 2, the second mapping tells that D 2 contains exactly all the patients that paid a bill and were prescribed at least one dangerous cure, together with all the bills they paid. One can easily verify that the tree T shown in Fig. 8.1 is a legal data tree for I. On the contrary, consider the data integration system I that is obtained from I by replacing M 1 with M 1 = (S 1, q 1, exact). Then, T is not a legal data tree for I since the patient named Dong does not belong to the data source D 1 whereas it would belong to the answer q 1 (T). Note that, according to the definition of the semantics of an XML data integration system, it may happen that no legal data tree exists, i.e. sem(i, D) =. Coherently with the DIS terminology introduced in Section 1.2, we say that the system is inconsistent w.r.t. D. Checking whether a data integration system is consistent w.r.t.a set of data sources will be the topic of Section 8.3. The main task of a DIS is to answer queries. Following the classical approach, we say that a data tree T is a certain answer to a p-query q posed over a data integration system Π = G, S, M w.r.t. to a set of data sources D, written T q(π,d), if and only if: T q(t ), T sem(π, D) Thus, the main problem we study here is the recognition problem introduced in Section 1.3, which in the XML setting can be formulated as follows: PROBLEM : QUERY ANSWERING (RECOGNITION) INPUT : Consistent data integration system Π = G, S, M, set of data sources D conforming to S p-query q, and data tree T QUESTION : Is T in q(π, D)?

131 8.2. IDENTIFICATION 119 From the definition of intersection of data trees, it follows that, given a consistent data integration system Π = G, S, M and a set of data sources D, T is a certain answer to q over Π w.r.t. D if and only if: T q(t ) T sem(π,d) Thus, if we were able to evaluate q over all legal data trees, then query answering would be solved by computing the intersection of the answers and then checking whether T is subsumed by the result. Nevertheless, in general, the set of legal data trees is infinite, which does not allow using such an approach. We will work on a finite representation for the (possibly infinite) set of all legal data tree, rep(π,d), by following an approach that is typical of a setting with incomplete information [58], cf. Section 1.5. However, to this aim, as we will see in the next chapter, we need to have persistent node identifiers that are shared among data sources. This lead us to the next section, i.e. to the definition of an identification function. 8.2 Identification An identification function aims at identifying nodes from autonomous data sources that represent the same real-world entity. In particular, it is responsible for associating to each node coming from the data sources an identifier that is based on the constraints expressed over the global schema, and more precisely key constraints. As already mentioned, the process of identifying entities that come from different sources and correspond to the same real-world entity is sometimes called Entity Resolution [19]. In practice, Entity Resolution is typically based on machine learning techniques. Here, we abstract this part of the problem by assuming and using an Entity Resolution module that has already introduced key information in the data. Thus, we have that two entities represent the same real-world entity if and only if they are characterized by the same key. Under the above mentioned assumption and based on key constraints, an identification function assigns to each data source node a semantic identifier that actually allows to identify nodes that come from different sources and correspond to the same node in every legal data tree. Before presenting our identification function, let us introduce formally how to define and characterize such a function. Let Π = G, S, M be a data integration system. An identification function is any function F that assigns a global identifier in a domain N F to every node of a set of data sources conforming to S. Let DC Π be the set of possible data sources conforming to S and such that Π is consistent: DC Π = {D D = {D 1,..., D m }, where D i conforms to S i, for i = 1,..m, and sem(π, D) } We say that F is sound w.r.t. Π, if and only if for each D DC, we have that for each two nodes n 1, n 2 in D: if F(n 1 ) = F(n 2 ) then for each T sem(π, D), and for each homomorphism h from D to T, we have that h(n 1 ) = h(n 2 ).

132 120 CHAPTER 8. XML-BASED DIS On the other hand, we say that F is complete w.r.t. Π, if and only if for each D DC, we have that for each two nodes n 1, n 2 D, each T sem(π, D), and each homomorphism h from D to T : if h(n 1 ) = h(n 2 ), then F(n 1 ) = F(n 2 ). Intuitively, the above definitions mean that a sound identification F possibly gives a sufficient condition for identifying two nodes in every legal data tree, whereas a complete identification provides a necessary condition. Obviously, in order to reduce our setting to the setting where nodes ids are available that are shared among data sources, we need an identification function that is both sound and complete. But in practice, this may be asking for too much and we will often have to live with simply sound. We next propose an identification function, named Id G, and show that it is sound under the assumption of both exact and sound mappings. Moreover, we show that it is also complete under the assumption of sound mappings, whereas it is not under the assumption of exact mappings. Nevertheless, we show that it is possible to introduce a restriction on the schema, under which Id G is also complete under the assumption of sound mappings. Let Π = G, S, M be a the data integration system with global schema G = S G, Φ K, Φ FK, where S G = Σ G, r G, µ G. In the following we start by recursively defining the domain N G of global identifiers that are assigned by Id G : ǫ is a global identifier in N G ; for n N G, a i Σ G (remember that Σ G is the set of element labels), then n.a i is a global identifier in N G ; for n N G, a i Σ G, and γ i is a data value in Γ S = Γ V S (remember that Γ is the domain of data values different from ), then n.a i [.γ i ] is a global identifier in N G. Clearly, global ids as defined above recall a subset of XPath expressions. To simplify, we assume first that the data sourecs are consistent (so the identification will succeed). We will come back to inconsistency in Section 8.3. Let D = {D 1,..., D m } be a set of data sources conforming to S = {S 1,...S m }, such that Π is consistent w.r.t. D. In Fig 8.2, we show how to build an identification function Id G that assigns to each node n in t i a global id in N G, where D i = t i, λ i, ν i, for every i {1,..., m}. Intuitively, first we define Id G (n) so that it assigns global ids to all nodes of each data source, independently, based on the schema G. Note, in particular, that we introduce a fresh Skolem constant to identify members of collections that are not characterized by any key constraint. The motivation for this is that two nodes belonging to the same collection are to be considered in general as distinct unless explicitly specified by means of a key constraint. Then, in a second phase, we possibly modify Id G so that global ids that differ for the presence of Skolems are unified, if they are assigned to nodes that, according to key constraints, represent the same node in each legal data tree. To this aim, for each depth k, starting from the bottom, and for each key constraint a.k a, we collect into a set N(a, v) all global

133 8.2. IDENTIFICATION 121 INPUT: set D = {D 1,..,D m } of data trees D i = t i,λ i,ν i, i = {1...,m}, such that Π is consistent w.r.t. D OUTPUT: Identification function Id G from t to N G [1] for i := 1 to m do for every node n in t then Id G (n) = ǫ else if n labeled a j is child of p labeled a in T then if a ωj j µ(a), where ω j {1,?} then Id G (n) = Id G (p).a j else if there exists a j.k a j Φ K then if n has a child m labeled k then Id G (n) = Id G (p).a j.ν(m) else Id G (n) = Id G (p).a j.v s, where v s is a fresh Skolem constant else Id G (n) = Id G (p).a j.v s, where v s is a fresh Skolem constant [2] for k := d(d) down to 1 do for each a.k a Φ K do for each n D at depth k such that Id G (n) = X.a.v,v Γ if N(a,v) not yet defined then N(a,v) := {Id G (n)} else N(a,v) := N(a,v) {Id G (n)} Unify(Id G,N(a,v)) Figure 8.2: Definition of Id G (D) identifiers of nodes at depth k, that are labeled a, and are characterized by a key value v. Then, we use an algorithm Unify which computes the most general unifier [70], for the Skolems occurring in each global id in N(a, v), denoted as mgu (N(a, v)). Then, such a unifier is applied to the global ids of all nodes at depth equal or lower than k. Note that such a unification process is particularly efficient, since by construction, all Skolems occurring in a pair of node ids are distinct. Now that we have shown how to build Id G, we can apply it to the set of sources D, thus obtaining a set of data sources Id G (D) with global identifiers in N G (observe that it always succeeds because we assume the data sources are consistent, i.e. that there exists a legal data tree). Then, we say that a node n is uniquely identified by Id G if and only if its identifier Id G (n) does not contain any Skolem constant. We now illustrate the identification function Id G by an example. Example Let Π = G, S, M be the data integration system discussed in Example Moreover, let D = {D 1, D 2 } be the set of sources shown in Fig. 7.5(c) and 7.5(e) respectively. Π is consistent w.r.t. D since one can easily verify that one legal data tree for Π is the tree shown in Fig See Fig. 8.3(a) and 8.3(b) for the graphical representation of Id G (D 1 ) and Id G (D 2 ), where node ids are marked in bold. Let us in particular discuss how we build Id G according to Fig First, we define Id G for the nodes belonging to D 1. Therefore, we set Id G (r 1 ) = ǫ, where r 1

134 122 CHAPTER 8. XML-BASED DIS patient id 2 =id 0.patient hospital id 0 = patient id 3 =id 0.patient hospital id 0 = patient id 1 =id 0.patient SSN id 2.SSN name id 2.name Parker SSN id 3.SSN name id 3.name Rossi SSN id 1.SSN bill id 1.bill (a) Id G (D 1) (b) Id G (D 2) Figure 8.3: Identifying data sources is the root of D 1. Second, since there exists the key constraint patient.ssn patient Φ K, and since r 1 has a child node n 1 having a child labeled SSN with data value 55577, we set Id G (n 1 ) = ǫ.patient Third, let us consider the child m 1 of n 1, labeled SSN. Since no key constraint is defined for elements SSN, and since according to the tree type S G, elements labeled patient have a unique child labeled SSN, then Id G (m 1 ) = Id G (n 1 ).SSN. The definition of Id G (n) for the other nodes n belonging to D 1 is very similar to the cases discussed above. So it is also for the nodes of D 2 except for the node n 2 of D 2 labeled bill, child of a node p 2. In this case, no key constraint is defined for elements bill, and according to the tree type S G, elements labeled patient may have an unrestricted number of children labeled bill. Therefore, we set Id G (n 2 ) = Id(p).bill.γ 1, where γ 1 is a Skolem constant in V S. Now that all nodes in D 1, D 2 have been assigned a global id, we check, starting from depth k = 3, whether some nodes at depth k contain Skolems that should be unified, according to a key constraint in Φ K. Since the only node that is not uniquely identified is p 2 and since no key constraint is defined for bill, then no unification is performed, and the construction of Id G terminates. Theorem Let Π = G, S, M be an XML DIS and D a set of sources conforming to S, such that Π is consistent w.r.t. D. Then, the construction of Id G terminates in time O( D ). Proof. One can easily verify that the first phase of the construction of Id G, terminates after having considered one by one the nodes of D. Thus, this first phase costs c 1 D. Let us now consider the second phase of the construction of Id G. For each k, all the nodes at depth k whose global ids have the form X.a.v, with v Γ and a such that a.k a is in Φ K, are partitioned into disjoint sets N(a, v). Then, according to the semantics and since Π is consistent, each node whose global id belongs to N(a, v) is such that it is mapped to the same node of each legal data tree, with label a and key value v. As a consequence, also the parents of nodes having global ids in N(a, v) are mapped to the same node, and so on until the root. Thus, each node in N(a, v) has the same sequence of labels for its ancestors and the same sequence of keys values possibly characterizing each of them. Thus, the set N(a, v) of global

135 8.2. IDENTIFICATION 123 ids can always be unified, i.e. the computation of the most general unifier always terminates successfully. Moreover, since trees have finite depth, the second phase of the construction of Id G also terminates. In particular, since the cost of computing the most general unifier of a set of global ids is linear in the size of the set [72], we show by induction on the depth k that the cost of the last phase of the definition of Id G is: c 2 S k + c 3 N 1 + c 3 N l + l c 4 T k 1 + c T k 1 c N where S k is the set of nodes at depth k, c 2 S k is the cost of constructing all sets N i of global ids to be unified at depth k, for i = 1, l, c 3 N i is the cost of computing the mgu of N i, c 4 ( T k ) is the cost of applying the mgu of each set N i to the global ids of nodes at depth k, and c T k 1 is, by inductive hypothesis, the cost of computing this second stage to a tree having depth k 1. Theorem For each data integration system Π = G, S, M, Id G is a sound identification function. Proof. Let D be a set of data sources, and Π a data integration system consistent w.r.t. D. Also, let T be a legal data tree, and n 1, n 2 two nodes of D. Suppose that Id G (n 1 ) = Id G (n 2 ). It is easy to see that, by construction, n 1, n 2 must have the same depth. Thus, we only need to prove that if n 1, n 2 have the same depth and Id G (n 1 ) = Id G (n 2 ), then for each homomorphism h from D to T, h(n 1 ) = h(n 2 ). We next show it by induction in the depth of n 1, n 2 : Base step: trivial since n 1 and n 2 are both roots, and as such, given that the system is consistent, they are both mapped to the root of T, by means of each homomorphism from D to T. Inductive step: Suppose now that n 1, n 2 have depth k, and Id(n 1 ) = Id(n 2 ) = X.Y where X is the id of the parents p 1, p 2 of n 1, n 2. Since p 1, p 2 have depth k 1, by inductive hypothesis, for each homomorphism h from D to T, p 1, p 2 are mapped to the same node p of T. Thus p 1, p 2 have the same label, say for instance a. Then, by definition of homomorphism, n 1, n 2 are both mapped to a child of p. Now, let us consider the possible forms of Y. If Y = b, then by construction, n 1, n 2 have both label b such that b 1 µ G (a). Then clearly, n 1 and n 2 are both mapped to the same node child of p in T, since there cannot be two distinct children b of p. If Y = b.v, v Γ S, then, again, n 1, n 2 have both label b. Moreover: either b.k b b Φ K, and n 1, n 2 have both a child labeled k b with the same data value v Γ. In this case, n 1 and n 2 are both mapped to the same node (child of p) in T, since there cannot be two distinct nodes b having the same key value in T. or Id G (n 1 ), Id G (n 2 ) have been unified during the second phase of the definition of Id G. Note that, by construction, this can happen if either: b is a member of a collection such that b.k b b / Φ K ;

136 124 CHAPTER 8. XML-BASED DIS or b.k b b Φ K but at least one among n 1, n 2, has not any child labeled k b (or it has one having value ). However, since they were unified during the second phase of the definition of Id G, then at least one among n 1, n 2, say n 1, is such that at the end of the first phase of the definition of Id G we had set Id G (n 1 ) = X.b.v s where v s is a fresh Skolem constant. Moreover, n 1, n 2 must have respectively two descendants n 1, n 2 with the same label b, such that (i) b.k b b Φ K, and (ii) n 1, n 2 have respectively one child k b with the same value. Thus, since by hypothesis T is a legal data tree, n 1 and n 2 are both mapped to the same node in T. Then, coherently, all ancestors of n 1, n 2 at the same depth, are mapped to the same node in T. Theorem For each data integration system Π = G, S, M such that mappings in M are all sound, Id G is a complete identification function. Proof. Let Π be a data integration system such that mappings in M are all sound, D a set of data sources such that Π is consistent w.r.t. D, and n 1, n 2 be two nodes in D. In order to prove that Id G is complete, we show that if Id G (n 1 ) Id G (n 2 ) then, for each T sem(π, D), there exists at least one homomorphism h from D to T such that h(n 1 ) h(n 2 ). Clearly, if n 1, n 2 are roots, then Id G (n 1 ) = Id G (n 2 ). Thus we assume that Id G (n 1 ) Id G (n 2 ), where at least one among n 1, n 2 is not a root. Then, it is easy to see that, by construction, it is always possible to find X, L 1, L 2 such that Id G (n 1 ) = Id G (p 1 ).L 1 and Id G (n 2 ) = Id G (p 2 ).L 2, where p 1, p 2 are the deepest ancestors of n 1, n 2 such that Id G (p 1 ) = Id G (p 2 ) (note that such a couple of nodes always exists since we can always consider the couple of roots as ancestors), and L 1 L 2. Thus, since we proved that Id G is sound, we have that for each T and for each h from D to T, p 1, p 2 are mapped to the same node in T. Now consider the following possible forms for L 1, L 2 : either exactly one among L 1, L 2 is empty; or L 1 = b 1.R 1 and L 2 = b 2.R 2, where b 1, b 2 are distinct labels; or L 1 = b.v 1.R 1 and L 2 = b.v 2.R 2, where v 1, v 2 are distinct values in Γ S. It is easy to see that in the first case, for each legal data tree T and for each h from D to T, h(n 1 ) h(n 2 ). Indeed, if for instance L 1 is empty, we have that n 1 = p 1. Therefore, n 1, n 2 are respectively mapped to a node and its child, for each T and each h from D to T. To show that the claim holds also in the last two cases, we next show that in both cases, n 1 and n 2 have respectively ancestors p 1, p 2, children of p 1, p 2 such that for each legal data tree T and for each h from D to T, h(p 1 ) and h(p 2 ) are two distinct children of h(p 1 ) = h(p 2 ). This clearly implies that h(n 1 ) h(n 2 ), for each T and h. Therefore, suppose first that L 1 = b 1.R 1 and L 2 = b 2.R 2, where b 1 b 2. Then, p 1, p 2 are respectively labeled b 1, b 2. Then for each legal data tree T and for h from D to T, h(p 1 ) and h(p 2 ) are two distinct children of the same node,

137 8.2. IDENTIFICATION 125 having a distinct label. Finally, suppose that L 1 = b.v 1.R 1 and L 2 = b.v 2.R 2, where v 1 v 2. Then n 1, n 2 are labeled b and such that: Either v 1, v 2 Γ. By construction, this means that b.k b b Φ K, and n 1, n 2 have respectively one child labeled k b with value v 1, v 2. Then, for each T and for each h from D to T, n 1 and n 2 are respectively mapped to distinct nodes in T having key values v 1, v 2. Or, at least one among v 1, v 2 is a Skolem constant, say for instance v 1. Moreover, by construction, n 1 is such that it belongs to a collection, and: either b.k k / Φ K ; or b.k k Φ K and n 1 has not any child labeled k; Id G (n 1 ), Id G (n 2 ) were not unified. Thus, there exists a tree T satisfying G and there exists h from D to T such that h maps n 1, n 2 to two distinct nodes of T. Thus, clearly, also the mappings are satisfied, since by hypothesis they are sound, and T is a legal data tree, which concludes the proof. Proposition There exists an XML DIS Π = G, S, M such that Id G is not a complete identification function for it. Proof. To show this, we use the same example that was given at the end of Section III, when discussing the impact of having exact mappings. Consider the data integration system having (i) the global schema G = S G,, such that S G is the tree type shown in Fig. 7.3, (ii) the mappings M = {(q 1, S 1, exact), (q 2, S 2, sound)}, where q 1, q 2 are respectively those shown in Fig. 7.5(b) and 7.5(d), and (iii) the source schema of the example illustrated in Section III. Let D = {D 1, D 2 } be the data sources shown respectively in Fig. 7.5(c) and 7.5(e). One can verify that such a data integration system is consistent w.r.t. D. Let T be a legal data tree. By the mapping specification, we have that q 1 (T ) D 1 and D 2 q 2 (T ). Therefore, we are sure that T does not contain any other node labeled patient having a child SSN lower than But then, since a node patient and its child SSN with value belong to D 2, then in particular there exist such nodes in T. Moreover, since T = S G, then each node patient in T has a unique child SSN. Therefore, we are sure that there are no other nodes patient in T having a child SSN with value This implies that for every homomorphism from D to T, the nodes patient with child SSN having value are mapped both to the same node of T. On the other hand, by applying Id G to D 1 and D 2, we obtain the trees shown in Fig. 8.4(a) and 8.4(b), where γ i are (different) Skolem constants, for i = 1, 2, 3. We therefore obtain that the two nodes mentioned above are assigned two different global ids patient.γ 1 and patient.γ 2. Thus, from the above result, given a data integration system Π and a set of data sources consistent w.r.t. to Π, Id G does not allow us to identify all nodes in D that

138 126 CHAPTER 8. XML-BASED DIS hospital id 0 = hospital id 0 = patient id 2 =id 0.patient. 1 patient id 3 =id 0.patient. 2 patient id 1 =id 0.patient. 3 SSN id 2.SSN name id 2.name Parker SSN id 3.SSN name id 3.name Rossi SSN id 1.SSN bill id 1.bill (a) Id G (D 1) (b) Id G (D 2) Figure 8.4: Counterexample proving that Id G is not complete actually represent the same node for each T sem(π, D), as soon as there exists at least one mapping in M that is exact. It will suffice for some particular XML DIS based on the notion of Visible Keys Restriction (VKR)introduced next. Definition A system has the VKR property if: For every element a, member of a collection of S G, there exists a key constraint a.k a Φ K. For every view M i such that (S i, M i, exact) M, M i is such that whenever it selects an element with a key, it also selects its key. We next show that the above restriction guarantees that Id G is complete also under the assumption of mappings sound. Theorem For each VKR data integration system Π = G, S, M, Id G is a complete identification function. Proof. Let us refer to the proof of Theorem One can verify that the assumption of having mappings sound is used only at the end of the proof, to show that whenever two global ids contain a Skolem, then there always exist a legal data tree T and a homomorphism from D to T such that the two nodes are mapped by h to two distinct nodes of T. Now consider the assumption to be under VKR. It is easy to see that, such an assumption guarantees that all nodes can be uniquely identified. Therefore, the case when two global ids are distinct because of the presence of Skolems cannot occur. Thus, by following exactly the same arguments of the proof of Theorem 8.2.4, we prove the claim. 8.3 XML DIS consistency As already mentioned, it may happen that an XML DIS is inconsistent (cf. Section 1.2 for the general DIS consistency problem). In this section, we therefore study the XML DIS consistency problem, as introduced in Section 1.2.

139 8.3. XML DIS CONSISTENCY 127 Next, we introduce and discuss the possible causes of inconsistency for an XML DIS. First, the global schema specification may be inconsistent, i.e. there may not exist any tree satisfying G. We next give an example of an inconsistent schema specification. Example Let G = S G, Φ K, Φ FK be a schema specification such that S G is shown in Fig. 8.5(a) whereas Φ K and Φ FK are as follows: Φ K = {B.K B, E.FK E}, Φ FK = {E.FK B.K}. According to the schema, each tree T that conforms to S G is such that each node labeled B is characterized by a unique value for its children labeled K. Moreover, each node labeled B has two children p 1, p 2 labeled respectively C and D. Let n 1 and n 2 be the children E of p 1, p 2, respectively. Since FK is a key for E, then n 1, n 2 have two children labeled FK with distinct values. But since FK is also a foreign key referring to the key value of its ancestor labeled B, and since n 1, n 2 have a common ancestor B, then, clearly, no tree exists satisfying simultaneously S G, Φ K and Φ FK. Thus, G is inconsistent. Indeed, the above example shows that the inconsistency of G is due to a particular interaction between keys and foreign keys. It was shown in [42] that the problem of verifying whether a schema specification given in terms of a general DTD and a set of unary keys and foreign keys is consistent is NP-Complete. It is an open problem whether this problem is solvable in PTIME for our simplified DTDs. Second, even if the global schema is consistent, a particular mapping (S, q, as), as {sound, exact}, may be inconsistent w.r.t. a global schema G, in the sense that for every tree T that satisfies G, q(t) = T. Example Let S G and q be respectively the tree type of a global schema specification G = S G, Φ K, Φ FK and the p-query of a mapping M = (S, q, sound). They are shown respectively in Fig. 8.5(b), and 8.5(c). Clearly, M is inconsistent w.r.t. S G, since there exists no tree satisfying S G such that q(t) T. Lemma Given a consistent global schema, the problem of checking whether a mapping M is inconsistent w.r.t. to this schema specification is decidable in PTIME w.r.t. the size of G and M. Proof. Let G = S G, Φ K, Φ FK and S G = Σ G, r G, µ G be the global schema, and M = (S, q, as), as {sound, exact}, a mapping in M such that q = t q, λ q, cond q, ret q. Let r q be the root of t q. It is possible to check whether this mapping is inconsistent w.r.t. S G by calling the function consistent(r q, r G ), where consistent(n q, a) is recursively defined as follows. For every n q t q : if λ q (n q ) a then return false, else for every n i child of n q in t q :

140 128 CHAPTER 8. XML-BASED DIS r * B K C D + + E E r A * FK FK (a) Tree type (Ex ) r B 10 (b) Tree type (Ex ) r + B (c) P-query (Ex , 8.3.4) r (d) Source schema (Ex ) B 5 B 7 (e) Data source (Ex ) Figure 8.5: Cases of inconsistencies if λ q (n i ) ω i / µ(a) for any ω i {1,?, +, }, then return false, else return n i children of n q consistent(n i, λ q (n i )). Note that this check is PTIME in the size of q and S G. Thus the claim is proved. Third, a mapping (S, q, as), as {sound, exact}, may be inconsistent w.r.t. a data source D, because D is inconsistent w.r.t. q, i.e. there exists no data tree T such that D q(t). Example Let D and S be respectively the data source and the schema shown in Fig. 8.5(e) and 8.5(d). It is easy to see that D conforms to S. Now, let q be the p-query shown in Fig. 8.5(c). Clearly, D conforms to S. Clearly, the mapping M = (S, q, sound) is inconsistent w.r.t. D. Lemma The problem of checking whether a query is inconsistent w.r.t. to a data tree (and thus a mapping is inconsistent w.r.t. a data source) is decidable in PTIME.

141 8.3. XML DIS CONSISTENCY 129 hospital hospital hospital hospital patient patient patient patient SSN <10000 name SSN name bill SSN name Parker SSN name James (a) Query q 1 (b) Query q 2 (c) Source D 1 (d) Source D 2 Figure 8.6: Example Proof. Let q be a p-query and T a data tree. A PTIME algorithm to decide whether q is consistent w.r.t.t consists in building from q the query q that results by eliminating the existential subtree patterns of q, and then checking whether q (D) D. Until now we showed that it is decidable to check whether the global schema specification, or a set of mappings are inconsistent (either w.r.t. the global schema, or w.r.t. a set of data sources D). However, consider the example below. Example Let Π = G, S, M be the XML DIS such that G, S are those of Example 8.1.1, M = {(S 1, q 1, sound), (S 2, q 2, sound)} where q 1, q 2 are respectively shown in Fig. 8.6(a), 8.6(b). Let D = {D 1, D 2 } be the sources shown respectively in Fig. 8.6(c) and 8.6(d). G is clearly consistent since in Fig. 8.1 we already provided a data tree for it. Similarly, one can easily verify that the mappings are consistent w.r.t. G and D. However, no legal data tree exists for Π w.r.t. D, since D 1 and D 2 contain two patients with the same SSN but different names. The example above shows that even in the presence of a consistent global schema specification and a consistent set of mappings, an XML DIS may still be inconsistent. Next we show that this can be checked in PTIME under the assumption that the system includes only sound mappings, whereas the problem becomes NP-Hard if it includes an exact mapping. From now on, we assume to have a consistent global schema specification, and a set of mappings that are consistent with respect to both the global schema and the data sources. Theorem The problem of checking whether an XML DIS including only sound mappings is consistent w.r.t. a set of data sources is decidable. More precisely it is PTIME in data complexity. Proof. Let Π = G, S, M be an XML DIS with all sound mappings, such that G = S G, Φ K, Φ FK, and S G = Σ G, r G, µ G, and let D be a set of data sources conforming to S. In order to show the claim, we show how to use Id G to verify that Π is consistent. To this aim, suppose we define Id G as shown in Fig. 8.2, and afterwards apply it to D. We next show that Π is inconsistent if and only if Id G is such that for each couple of nodes n 1, n 2 D, one of the following occurs: 1. either Id G (n 1 ) = Id G (n 2 ) = ǫ and n 1 and n 2 have different labels;

142 130 CHAPTER 8. XML-BASED DIS 2. or, Id G (n 1 ) = Id G (n 2 ) and ν(n 1 ) ν(n 2 ); 3. or, a.k a is a key in Φ K, and Id G (n 1 ), Id G (n 2 ) have the form Id G (n 1 ) = X 1.a.γ and Id G (n 2 ) = X 2.a.γ, where X 1 X 2 and γ Γ. Suppose first that Π is consistent. Clearly, 1 cannot occur. Moreover, if Π is consistent, by Theorem 8.2.3, Id G is sound. Then, let n 1, n 2 be two nodes of D. By definition of sound identification, if Id G (n 1 ) = Id G (n 2 ), then we have that for each T sem(π, D) and for each h from D to T, h(n 1 ) = h(n 2 ). Thus, 2 cannot occurs. Now consider 3. Since Π is consistent and includes only sound mappings, by Theorem 8.2.4, we have that for each T sem(π, D) and for each h from D to T, h(n 1 ) h(n 2 ). But this contradicts that T is legal w.r.t. G, since it implies that there exist two distinct nodes of T having the same key value γ. Suppose now that Id G does not satisfy any of the three conditions above. Then we construct T as follows. For each n D, we insert into T a node n, having the same label, the same data value and the same id as n, such that: if n is a root, then n is the root of T ; otherwise, if n is child of a node p, then n is child of the node having id Id G (p). Note that the fact that Id G does not satisfy any of the two conditions 1, 2 and 3 ensures that the above construction is well-defined. Moreover, it is easy to see that T satisfies the mappings and weakly satisfies G. Thus there exists T such that T T and T = G. Thus T sem(π, D), which proves that Π is consistent. Theorem The problem of checking whether an XML DIS is NP-Hard in data complexity. Proof. The proof is by reduction from 3-colorability. Let G = V, E be an arbitrary graph, with vertices V = {V 1,...V n } and edges E = {E 1,...E m }. We now show how to build an XML DIS Π = G, S, M, such that Π is consistent if and only if G is 3-colorable. Let us first define the global schema G = S G, Φ K, Φ FK as follows: S G is shown in Fig. 8.7(a), Φ K is composed of the following key: C.Name C Φ FK =. The idea is to define M and D so that one tree T satisfying M w.r.t. D is such that: 1. T contains exactly three subtrees rooted at C, each corresponding to a different color among Blue, Y ellow, Green; thus, T encodes a coloring that uses these three colors;

143 for each vertex V i V and each edge E j E, there exists at least one subtree rooted at R with children V, E having respectively data values V i and E j ; thus T encodes a coloring of a supergraph of G; 3. for each vertex V i V, all occurrences of the same vertex belong to a same subtree rooted at C; thus, we say that T is a well-defined encoding of a coloring of G, meaning that a vertex is assigned at least one color, and if it is assigned a color, then it is assigned the same color in each edge; 4. for each edge (V j1, V j2 ) E, there exists at least one color that is assigned to V j1 and not to V j2 and conversely; then, we say that T encodes a correct coloring of G. In order to obtain the encoding described above, we define M as the set composed of: the mapping (q 1, S 1, exact) such that q 1 asks for all colors available, cf. Fig. 8.7(b); n mappings of the form (q2 i, Si 2, sound) such that qi 2 asks for all edges of the graph to which the vertex V i belongs, cf. Fig. 8.7(d); m mappings (q j 3, Sj 3, sound) such that qj 3 asks for all vertices that belong to the edge E j, cf. Fig. 8.7(f). Finally, we build the set of data sources D composed of: the source D 1 containing the colors Blue, Y ellow, Green, cf. Fig. 8.7(c); n sources D2 i, each containing a unique subtree rooted at C, and one subtree rooted at R for each edge to which V i belongs, cf. Fig. 8.7(e); m sources D j 3, each containing the two vertices that belong to E j, cf. Fig. 8.7(g). Clearly, the specification above ensures that for each coloring of G there exists T satisfying M w.r.t.d, that is a well-defined encoding of a correct coloring of G. Moreover, for each T satisfying M w.r.t.d there exists a coloring of a supergraph G of G such that T is a well-defined encoding of a correct coloring of G. We say that T is a proper encoding of a coloring of G if each vertex is assigned a unique color. We next show that there exists a legal data tree for Π w.r.t.d if and only if G is 3-colorable. : Suppose that T is a legal data tree. Then, T satisfies the mappings M, which implies that T is a well-defined encoding of a correct coloring of a supergraph G of G. Moreover, since T is legal, T satisfies the key constraint in Φ K, which implies that T contains exactly three distinct nodes C, i.e. one for each color available. Thus, T encodes a proper coloring of G, and in particular, T encodes a proper coloring of G. Hence, G is 3-colorable. : Assume that G is 3-colorable. Then, it is possible to build a tree T that is a well-defined and proper encoding of a correct coloring of G. Thus, T contains exactly three distinct nodes C, each corresponding to a different color. Clearly T satisfies the mappings w.r.t.d and the key constraint in Φ K. Hence, Π is consistent w.r.t.d.

144 G G G G Name + C * R V E C Name C C C Name Name Name Yellow Blue Green C R E V=V i (a) Global schema G (b) Query q 1 (c) Data source D 1 (d) Query q j 2 G G G C C C R. R R R R V V i E E i1 V V i E E ini V E = E j V E V j1 E j V E V j2 E j (e) Data source D j 2 (f) Query q j 3 (g) Data source D j 3 Figure 8.7: Reduction from 3-colorability

145 Chapter 9 XML DIS query answering In this chapter, we investigate query answering in an XML DIS. We start by giving a lower-bound for query answering under the assumption of having exact mappings. Then, we introduce incomplete trees and discuss their use to solve query answering. Finally we present two PTIME algorithms for query answering, the first to be applied in a restricted setting, the second more general, and we show their correctness. 9.1 Lower-bound for query answering under exact mappings We next provide a lower-bound for query answering under the assumption of having exact mappings. Theorem The query answering problem is conp-hard in data complexity. Proof. The proof is by reduction from the following simple variant of 3-colorability, named Bonifaci s 3-colorability. PROBLEM : Bonifaci s 3-colorability INPUT : Graph G = V, E, 4-colorable QUESTION : Is G 3-colorable? It is easy to see that the Bonifaci s 3-colorability is NP-Hard. Indeed, from [46], deciding whether a planar graph is 3-colorable is NP-Hard. On the other hand, it is well-known that each planar graph is 4-colorable [10]. Thus, since a planar graph is a particular case of a graph 4-colorable, deciding whether a graph 4-colorable is 3-colorable is NP-Hard. Now, suppose we have a graph that is 4-colorable. We show how to build a consistent XML DIS Π, a set of data sources D, a data tree T and a query q such that T is a certain answer to q w.r.t.d if and only if G is not 3-colorable. The construction is very similar to the one presented in the proof of Theorem In particular, Π = G, S, M is exactly the same, whereas D is obtained from D by replacing the source D 1 with the source D 1 shown in Fig. 9.1(a). It is easy to see that given a graph G that is 4-colorable, Π is consistent w.r.t. D. Moreover, for 133

146 134 CHAPTER 9. XML DIS QUERY ANSWERING G G G C C C C C C C C C Name Name Name Name Yellow Red Blue Green Name R Name R Name R Name R Yellow Red Blue Green Name R (a) Data source D 1 (b) Data tree T (c) Query q Figure 9.1: Reduction from Bonifaci S 3-colorability each correct coloring of G using the colors Blue, Y ellow, Green, Red, there exists a legal data tree that encodes it. Conversely, for each legal data tree of Π w.r.t.d, there exists a supergraph G of G such that T encodes a correct coloring of G. Consider now the data tree shown in Fig. 9.1(b) and the query q shown in Fig. 9.1(c). We next show that T is a certain answer to q if and only if G is not 3-colorable. : Suppose that G is 3-colorable. Then, there exists a legal data tree T for Π w.r.t.d such that T encodes a correct coloring of G using only three colors. Then, this means that there is color that is not assigned to any vertex. But then, if we pose q over T we obtain only three nodes C having a child R. Hence, T is not a certain answer. : Assume that G is not 3-colorable. Then, for each correct coloring of G we need 4 colors, i.e. for each correct coloring of G, each of the four colors available has been assigned to at least one vertex. Thus each legal data tree is an encoding of a correct supergraph of G that contains at least one subtree rooted at R for each of the four nodes labeled C corresponding to the colors available. Hence, T is a certain answer to q. The previous result shows that, if node ids are not available, then the problem of answering p-queries in the XML DIS setting is a difficult problem, already under the assumption of having only one exact mapping, no foreign keys, and mappings that are specified by means of p-queries not including existential subtree patterns. As we will see, this result does no longer hold if we assume the VKR. 9.2 Incomplete trees In order to get an intuition of the overall approach to XML DIS query answering, observe the following. Given an XML DIS Π and a set of data sources D such that Π is consistent w.r.t. D, D contains information that belongs to each legal data tree represented by the DIS. In other terms D represents the known portion of the full data accessible by means of queries over the DIS. In addition, on one hand, there is a partial information about the content of each legal data that comes from the specification of the global schema. On the other hand, there is partial information on the portion of each legal data tree satisfying (exact) mappings. It turns out that

147 9.2. INCOMPLETE TREES 135 incomplete trees presented in [6] are appropriate for representing the set of legal data trees w.r.t. D, in that they actually include a prefix tree that is known to belong to each represented data tree, as well as a description of the missing information. However, since incomplete trees require node identifiers shared among data sources, in order to be able to use them, we need to apply to D beforehand a sound and complete identification function as introduced in Section 8.2. This means that in order to use Id G we assume that either all mappings are sound, or we are under VKR. To formally introduce identified incomplete trees, let us first recall some preliminary definitions from [6]. Definition Let Σ be an alphabet. A simple conditional tree type over Σ is a tuple Σ, R, µ, cond where: R Σ is the set of root labels; µ is a mapping associating to each a Σ a disjunction µ(a) of multiplicity atoms; cond associates a condition to each a Σ (note that as for p-queries, the condition applies to the data values of nodes with label a). We say that a data tree T over Σ satisfies a simple conditional tree type Σ, R, µ, cond over Σ, noted T = Σ, R, µ, cond, if and only if (i) the root of T has a label in R, (ii) for each node n of T such that λ(n) = a, if µ(a) = m 1... m m, where m i is a multiplicity atom, then all the children of n have type m i for some i {1,...m}, and (iii) for each node n of T such that λ(n) = a, ν(n) satisfies cond(a). Intuitively, conditional tree types, similarly to tree types, are used to describe a set of valid trees. However, in order to be able to describe missing information, they are more powerful than tree types. More specifically, they allow disjunctions of multiplicity atoms, which are used to represent different possible alternatives for types of missing elements. Moreover, they allow to specify conditions on data values for types. Note that alternative types and conditions on data values both express information on the content of legal data trees that comes from the schema and mapping specification. To illustrate this point, and in particular the need for conditional tree types, we give the following example. Example Let q 2 and D 2 be resp. the query and the data source shown in Fig. 7.5(d) and Fig. 7.5(e). Moreover, let Π = G, S, M be the DIS such that G = S G, Φ K, Φ FK is the global schema of Example 7.3.3, S = {S 2 } where S 2 is the DTD introduced in Chapter III, and M = {M 2 }, where M 2 = (S 2, q 2, exact). Then, D 2 conforms to S 2 and it is easy to see that Π is consistent w.r.t. D = {D 2 }. Now, imagine that you want to describe all the information you have about legal data trees. Then, first, you would probably like to represent the fact that each legal data tree contains the information in D 2. We will see how we can achieve this by using incomplete trees. Second, since the mapping is exact, you would also like to describe the portion of each legal data tree that does not contribute to the answer to q 2, i.e.

148 136 CHAPTER 9. XML DIS QUERY ANSWERING that does not come from D 2. Thus, you need to be able to represent that besides elements patient coming from D 2, each legal data tree may contain elements labeled patient such that at least one of the following is true: they have not any child labeled bill, they have not any child labeled cure, they have a child labeled cure with empty data value, or data value greater than 35. The example above shows the need for introducing disjunction of multiplicity atoms and conditions on data values, to describe possible alternative types for missing elements. However, this immediately requires to introduce specialization, which leads us to the following definition. Definition A conditional tree type over Σ and a specialized alphabet Σ is a tuple Σ, R, µ, cond, σ,σ where: Σ, R, µ, cond is a simple conditional tree; σ is a specialization mapping from Σ to Σ. The semantics of conditional tree types is defined as follows. A data tree T over Σ satisfies the conditional tree type Σ, R, µ, cond, σ,σ, noted T = Σ, R, µ, cond, σ,σ if and only if there exists T such that: T = Σ, R, µ, cond ; σ(t ) = T. Now that we have conditional tree types, we are finally able to introduce incomplete trees, which combine the representation of the missing information with the information coming from D. Definition An incomplete tree over an alphabet Σ is a tuple T = N, λ, ν, τ where: N N is a finite set of nodes; λ : N Σ is a labeling of the nodes in N; ν : N Γ associates to each node in N a data value in Γ; τ = Σ, R, µ, cond, σ, N Σ is a conditional tree type over the alphabet N Σ such that for each data tree T satisfying τ: for each n N there is at most one node of T labeled n; if a node in T has label in N, then its parent s label is also in N. As for the semantics of incomplete trees, a data tree T = t, λ, ν over Σ belongs to the set of trees represented by an incomplete tree T, denoted rep(t) if and only if there exists a data tree T 0 = t 0, λ 0, ν 0 over N Σ such that:

149 9.3. QUERY ANSWERING USING INCOMPLETE TREES 137 T 0 satisfies τ; for each node n 0 of T 0, n 0 N if and only if λ 0 (n 0 ) N, in which case n 0 = λ 0 (n 0 ); if n 0 is a node of T 0 and n 0 N, then if ν(n 0 ) / V S then ν 0 (n 0 ) = ν(n 0 ); T is obtained from T 0 by changing each label n N to λ(n) Σ. In a nutshell, in order to represent the mix of known and missing information, incomplete trees allow to specify a set N of instantiated nodes, together with their labels, and data values. Then, instantiated nodes are viewed as labels, and as such they can have multiple specializations, reflecting the fact that they are allowed to appear in different contexts. Note that the above definition differs from that of [6], only because of the presence of Skolems that can appear as data values of instantiated nodes. This difference is due to the presence of existential subtree patterns in queries that may require that an incomplete tree reflecting the presence of nodes having unknown data value. 9.3 Query answering using incomplete trees In this section we aim at giving an intuition of how to use incomplete trees to solve XML DIS query answering. Thus, we continue making the assumptions of having a consistent XML DIS and a set of data sources provided with persistent node ids, assigned by Id G under the assumptions that make it sound and complete. The main idea is to use incomplete trees as a representation system [58] for legal data trees. Thus, we follow an approach that is typical in the presence of incomplete information. This is not surprising since, as already discussed in Section 1.5, it is well-known [53] that LAV data integration query answering is strongly related to the problem of querying an incomplete database. Specifically, given an XML DIS Π, a set of data sources D, a query q and a data tree T, we construct an incomplete tree T such that T is subsumed by all trees represented by rep(t) if and only if T is subsumed by all legal data trees. Let us now observe the following. Incomplete trees have been introduced in [6] to represent and query incomplete information, which comes from a DTD and a sequence of consecutive queries over an XML document conforming to the DTD and having persistent node identifiers. Moreover, they have been proved to form a strong representation system with respect to ps-queries, i.e. p-queries not including existential subtree patterns. Thus, let us consider the case of an XML DIS Π = G, S, M and a set data sources D = {D 1,, D n } such that Π is consistent w.r.t. D, G = S G,,, and M = {M 1,, M n } is such that M i has the form M i = (q i, S i, exact) where q i is a ps-query. Then, results of [6] already provide correct algorithms for computing, in PTIME: an incomplete tree T i = qi 1 (D i ), for each mapping M i = (q i, S i, exact) M; then, we have that T i represents the set of trees satisfying M i w.r.t. D i : rep(t i ) = {T D i = q i (T)}

150 138 CHAPTER 9. XML DIS QUERY ANSWERING INPUT: consistent XML Π = G, S, M such that G = S G,,, M = {M 1,,M m }, M i = (q i,s i,exact), q i ps-query, D = {D 1,..,D m } with global ids, ps-query q, data tree T OUTPUT: true or false T := q i 1 (D 1 ) for i := 2 to m do T Si := q i 1 (D i ) T := Intersection(T Si 1,T) T := SatType(T,S G ) if T q(t) then return true else return false Figure 9.2: Algorithm Answer(Π, D, q, T) an incomplete tree T = i {1,,n} T i, where T i = qi 1 (D i ) for each mapping M i = (q i, S i, exact) in M, and each data source D i D conforming to S i ; then, we have that T represents the set of trees satisfying M w.r.t. D: rep(t) = rep(t i ); i {1,,n} an incomplete tree T = SatType(T, S G ); then, we have that T represents the set of trees that are represented by T and satisfy the global tree type S G : rep(t) = rep(t) {T T = S G } an incomplete tree q(t); then, we have that q(t) represents the set of answers returned by each tree represented by T, i.e. q(t) = {q(t) T rep(t)}. Moreover, results of [6] show how to check whether a data tree is a certain prefix of all trees represented by an incomplete tree. Furthermore we can adapt such a check to our setting, in order to decide whether a data tree is subsumed by all trees represented by an incomplete tree. Thus, given a data tree T and a ps-query q, under the above mentioned restrictions, we can apply the algorithm Answer shown in Fig Clearly, from all the above considerations, it follows that, under the above restrictions, algorithm Answer(Π, D, q, T) constructs an incomplete tree T such that rep(t) = sem(π, D). Thus, since q(t) = {q(t) T rep(t)}, we have that T q(t) if and only if T q(t ) for each T sem(π, D), which proves that Answer is correct. Moreover, by [6], Answer is PTIME in data complexity. In the next section, we provide two algorithms that solve the general XML-based DIS query answering problem by following an approach that is very similar to the one described above. Here we instead come back to the assumption of using Id G

151 9.4. QUERY ANSWERING ALGORITHMS 139 under the conditions that make it sound and complete. As discussed in Section 8.2, whereas Id G is sound and complete under the assumption of sound mappings, it is not complete as soon as there is at least one exact mapping. One may legitimately wonder whether it is possible to build a different identification function having the same complexity as Id G, namely PTIME by Theorem 8.2.2, that is sound and complete also under the assumption of having exact mappings. The answer turns out to be negative, as shown below. Theorem For any data integration system Π that has at least one exact mapping, under the assumption that P NP, there exists no identification function that is sound and complete w.r.t. Π and such that it can be computed in PTIME. Proof. Let Π be an XML DIS having all exact mappings. By contradiction, suppose that there exists an identification function F with complexity PTIME that is sound and complete w.r.t.π. Then, we can apply F to D and we obtain a set of data sources with node ids. But then, by using results of [6], we can solve query answering in PTIME, which contradicts Theorem Query answering algorithms In this section we provide two algorithms to solve query answering. The first algorithm solves query answering under the VKR restriction and the assumption of not having any key constraint. The second one is more general and requires only that the conditions hold for ensuring that Id G is sound and complete Algorithm under VKR and no key constraints In this section we present an algorithm to reduce XML DIS query answering, under VKR and under the assumption of not having any key constraint to the setting proposed in [6], and thus to the algorithm Answer of Fig More precisely, let Π = G, S, M be an XML DIS with G = S G,,, and D a set of data sources such that Π is consistent w.r.t.d. We recall that under VKR Id G is sound and complete. Thus, we can apply Id G to D and we obtain a set of data sources that share node ids as in [6]. However, this does not suffice to apply results of [6]. Indeed, we have to deal with two major differences w.r.t. [6], namely the presence of existential subtree patterns in the mapping specification and the presence of sound mappings. Intuitively, the main idea is the following. For each mapping M i M such that M i = (q i, S i, as i ), we abstractly consider D i D as a data source: that represents all possible data sources satisfying the mapping M i = (q i, S i,as i) obtained by modifying q i and S i, so that q i returns nodes that are required to exist but not to be returned by q i ; that is characterized by a color C i ; thus, if as i = sound, D i is seen as providing, for each collection of nodes a in S G, exactly the nodes with label a and color C i.

152 140 CHAPTER 9. XML DIS QUERY ANSWERING Following the above intuition, we preliminarily compute Π = G, S, M and D = {D 1,, D m} as follows: G allows for incorporating in each legal data tree the information about the color of data sources. More precisely, G = S G,, is obtained from G = S G,, by replacing S G = Σ G, r G, µ G with S G = Σ G, r G, µ G such that Σ G = Σ G C, and µ G is defined as follows: { a Σ G, µ µg (a)c G(a) = if a is a member of some collection of S G µ G (a) otherwise M = {M 1,, M m} is obtained from M = {M 1,, M m } by replacing M i = (q i, S i, as i ) with M i = (q i, S i, as i) such that M i differs from M i because (i) it requires existential subtree patterns to be returned by q i and thus to belong to the data source, and (ii) if M i is sound, that the data source provides exact information about collections of nodes colored with C i. More precisely, this is achieved by first constructing the ps-query version of q i = t i, λ i, cond i, ret i, denoted Ps-query(q i), starting from q i and then modifying it by setting ret q i (m) = true for each m t q i. Then, if as i = sound, we further modify q i as follows. For each m labeled a that has some child n labeled b such that b ω occurs in µ G (a) with ω {, +}, then we add a child n c of m in t i, and we set: λ i (n c) = C, cond i (n c) = = C i, and ret i (n c) = true. Then, S i is defined in the obvious manner. For each i {1,, m}, D i is obtained from D i as follows. Intuitively, let M i = (q i, S i, as i ) be such that q i = t qi, λ qi, cond qi, ret qi. Then, for each m in D i, we possibly add a subtree to make D i satisfying M i. More precisely, for each m D i, let m q be the node of t qi such there exists a partial function γ from the nodes n q of t qi to the nodes of D i such that γ(m q ) = m and: γ(n q ) is defined for each n q such that ret qi (n q ) = true; γ preserves the parent-child relationship, and the labeling; ν i (γ(n q )) satisfies cond qi (n q ). Clearly, since M i is consistent w.r.t. D i, then m q and γ always exist. Now, for each child n q of m q such that ret qi (n q ) = false, we apply the recursive step AddNode(m q, n q, m) defined as follows: we add a child n of m in D i, and we set λ i (n) = λ q i (n q ) and ν i (n) = v s where v s is a fresh Skolem in V S ; for each child c q of n q, we call AddNode(n q, c q, n).

153 9.4. QUERY ANSWERING ALGORITHMS 141 Finally, intuitively, if M i is sound then, for each collection of nodes in D i, we add the information about the color that characterizes D i. More precisely, for each m D i that is labeled a and for each child n of m labeled b such that bω occurs in µ G (a) with ω {, +}, we add a child n c of n in D i, and we set: λ i (n c) = C, ν i (n c) = C i. Clearly, the above computation returns an XML DIS Π and a set of data sources D such that Π is consistent w.r.t. D. Moreover, by construction, Π is such that it includes only exact mappings not involving existential subtree patterns. Also, under VKR, we can apply Id G to D and obtain a set of data sources with global ids. Thus, given a data tree T and a p-query q, by considering Ps-query(q), we have finally reduced our setting to the particular setting described in the previous section, which allows us to apply the algorithm Answer shown in Fig Note however, that for each data tree, Ps-query(q) returns a tree that is not consistent w.r.t. q. Thus, to ensure correctness, we have to verify before that T is consistent w.r.t. q, which can be done in PTIME (cf. Lemma 8.3.5). Theorem Let Π = G, S, M be an XML DIS under VKR, with M mixed, and D be a set of sources such that Π is consistent w.r.t.for D. Moreover, let q be a p- query, and A a data tree such that A is consistent w.r.t.q. Then, A q(π,d) if and only if Answer(Π, Id G (D ), q, A) = true, where Π, D are computed as described above and q = Ps-query(q). Proof. In order to prove the theorem, we show that A q (T), T rep(t) A q(t ), T sem(π, D) Thus we first show that for each T rep(t) there exists T sem(π, D) such that if A q(t), then A q(t ). Then we show that for each T sem(π, D) there exists T rep(t) such that if A q(t ) then A q (T). Let T be a data tree in rep(t). Moreover, suppose that A q (T). Clearly, since A is consistent w.r.t. q, we have that A does not contain any node that is mapped to a node n of q (T) such that there exists a valuation from a node n q of q with ret(n q ) = false. Thus, by the semantics of p-queries, we have that A q(q (T)) = q(t). Let us now construct a tree T starting from T and then eliminating all nodes labeled C. It is easy to verify that by construction, T satisfies S G and M w.r.t.d. Thus, T sem(π, D). Moreover, since q does not involve nodes labeled Cwe have that q(t) = q(t ). Then, since we showed that A q(t), we obtain A q(t ). Let now T be a tree in sem(π, D). Moreover, suppose that A q(t ). Then, by the semantics of p-queries, we have that A q (T ). But then, let T the tree that is obtained starting from T and then modifying it minimally so that q i (T ) = D i. It can be shown that we obtain a tree that satisfies S G and M and is such that A q (T). By the previous theorem, by results of [6] and by the consideration made in the previous section, one can immediately show the following.

154 142 CHAPTER 9. XML DIS QUERY ANSWERING Theorem XML DIS query answering under VKR, and under the assumption of not having key constraints is PTIME Algorithm under Id G sound and complete In this section, we provide an algorithm to solve general XML DIS query answering under the assumption of Id G being sound and complete. The idea is to generalize both the algorithms proposed in the previous sections, in order to deal uniformly with sound, and exact mappings. Thus this section is strongly related to results of [6]. We start by giving several preliminary results. First, let us introduce the function SatMapping. This takes as input a global schema specification G = S G, Φ K, Φ FK, a mapping specification M = (S, q, as) and a data source D conforming to S, such that q = t q, λ q, cond q, ret q, D = t D, λ D, ν D, and returns an incomplete tree T = N, λ, ν, τ such that τ = Σ, R, µ, cond, σ, N Σ. Intuitively, T is characterized by (i) a set of instantiated nodes that come from the data source and are known to form a tree that is a prefix of each represented tree, (ii) a type the describes the information that does not come from the data source. Obviously, if the mapping is sound, we do not known anything about the portion of information that does not come from the source, whereas, if the mapping is exact, we do know the possible alternative types for it. A crucial issue is the presence of existential subtree patterns in the query q. In order to reflect the presence of such tree patterns in each tree represented by T we add a set of instantiated nodes carrying a Skolem data values. This is the reason why we need the global schema G in order to assign a node id to such newly introduced nodes. Let Σ = {a 1,...a n }, then we construct T as follows. Note that below we denote as all the multiplicity atom τ a 1...τ a n. 1. We build the initial set of instantiated nodes. For this, we set N, λ, ν = D. 2. Depending on as we define τ differently. If as = exact, then the construction of τ is very similar to the one proposed in [6], with the only difference that we need to check whether a type is returned by the query. We define Σ as the set Σ = {τ a a Σ} {τ n n N} { τˇ m m t q ret q (m) = false} { τ m m t q } { τˆ m m t q }. Intuitively, the meaning of these types is the following: τ a is the type of all nodes labeled a without any constraint on the node and its subtree; τ n describes the type of the node n in N; τ m describes the nodes with label λ q (m) that make q false at m by violating cond q (m); τˆ m describes the nodes with label λ q (m) that satisfy cond q (m) but for which the subtree of q rooted at m cannot be matched below the node. We set R = {τ rd }, where r D is the root of N;

155 9.4. QUERY ANSWERING ALGORITHMS 143 For each a Σ, we set σ(τ a ) = a, and cond(τ a ) = true. Assume that Σ = {a 1, a n }. Let all be the multiplicity atom τa 1 τa n. We set µ(τ a ) = all. For each m in t q, we set σ( τ m ) = λ q (m), cond( τ m ) = cond q (m), and µ( τ m ) = all. If m is not a leaf, then let m 1, m l be the children of m. We set σ( τˆ m ) = λ q (m), cond( τˆ m ) = cond q (m), and µ( τˆ m ) = 1 i l α i where α i is the multiplicity atom τ mi τ mi ˆ else i where else i contains τa for every a Σ, a λ q (m i ), i = 1,...l. For each n N, we set σ(τ n ) = λ(n), cond(τ n ) = = ν(n). If n is a leaf, we set µ(τ n ) = all. Otherwise, let m be the node of t q such that there is a valuation from q to D mapping m to n (note that such a valuation exists since the mapping is consistent). Let n 1,, n k be the children of n and m 1,, m l be the children of m such that ret q (m i ) = true. Set µ(τ n ) = τ n1...τ nk τ m1, τ m1 ˆ, τ ml τ ml ˆ else n, where else n contains τa for each a Σ that is not a label of any of the children of n in D. If as = sound, then we only know that D is a prefix all legal data trees. Thus, we proceed as below. We set Σ = {τ a a Σ} {τ n n N}, where the meaning of each type is as above. We set R = {τ rd }, where r D is the root of N; For each a Σ, we set σ(τ a ) = a, cond(τ a ) = true, and µ(τ a ) = all. For each n N, we set σ(τ n ) = λ(n), cond(τ n ) = = ν(n). If n is a leaf, we set µ(τ n ) = all. Otherwise, let m be the node of t q such that there is a valuation from q to D mapping m to n (note that such a valuation exists since the mapping is consistent). Let n 1,, n k be the children of n, we set µ(τ n ) = τ n1...τ nk else n, where else n contains τ a for each a Σ. 3. We add to N, nodes that are required to belong to each represented data tree, because of an existential subtree pattern. To this aim, we need to denote with two distinct Skolems in V S the id and the data value of newly introduced nodes. Specifically we do so by repeatedly applying the following rule: For each n N, let m be the node of t q such that there is a valuation from m to n. If there exist m i child of m such that ret q (m i ) = false, then call the function AddNode(G,T, n, λ q (m i ), cond q (m i ), v ) shown in Fig. 9.3, where: if Id G (n) = X.λ(n).v and λ(n).b λ(n) Φ K then v = v otherwise, v = v is a fresh Skolem. Intuitively, this function adds an instantiated node with label λ q (m i ) and data value satisfying cond q (m i ), and a corresponding type in each disjunct of µ(t) for each type t corresponding to the instantiated node n. Clearly, in doing so we need to continue guaranteeing that nodes ids

156 144 CHAPTER 9. XML DIS QUERY ANSWERING INPUT: an incomplete tree T = N,λ,ν,τ with τ = Σ,R,µ,cond,σ,N Σ, a node n in N, a label b Σ and a schema G = S G,Φ K,Φ FK a condition c and a data value v add n to N set λ(n ) = b set ν(n ) = v for each a i Σ such that σ(a i ) = n add b i Σ set σ(b i ) = n set cond(b i ) = c add b 1 i to each disjunct α in µ(a i) if b ω occurs in µ G (λ(a)) with ω {1,?}, then set Id G (n ) = Id G (n).b else set Id G (n ) = Id G (n).b.v s, where v s is a fresh Skolem in V S Figure 9.3: Function AddNode(G, T, n, b, c, v) are assigned coherently with Id G, that is why we need to use the global schema G. Note that the above construction certainly terminates. Clearly, from the above construction it is possible to state the following. Lemma Given a mapping M = (S, q, as) and a data source D conforming to S and such that M = (S, q, as) is consistent w.r.t.d, then SatMapping returns an incomplete tree T representing all trees that satisfy M w.r.t.d, i.e. rep(t) = {T D = q(t)}, if as = exact; rep(t) = {T D q(t)}, if as = sound. Note that, similarly to [6], it turns out that incomplete trees obtained by computing the function SatMapping have all a particularly simple structure, called unambiguous and defined next. Definition An incomplete treet = N, λ, ν, τ where τ = Σ, R, µ, cond, σ, N Σ is unambiguous if for every a Σ and multiplicity atom α in µ(a): 1. if a ω occurs in α and σ(a i ) N then ω = 1; otherwise, ω = ; 2. if a i and a j, i j, occur in α and σ(a i) = σ(a j ) Σ, then cond(a i ) cond(a j ) is unsatisfiable; 3. if a i and a j, i j, occur in α and σ(a i) = σ(a j ) Σ, then there exists a 1 k occurring in α such that σ(a k ) = n N and λ(n) = σ(a i ) = σ(a j ). It is easy to that an incomplete tree returned by SatMapping is always unambiguous. Another useful notion is the notion of compatible trees. Note that this differs from the notion of compatible trees of [6], because of empty Skolem data values.

157 9.4. QUERY ANSWERING ALGORITHMS 145 More precisely, two incomplete trees N 1, λ 1, ν 1, τ 1, N 2, λ 2, ν 2, τ 2 such that τ 1 = Σ 1, R 1, µ 1, cond 1, σ 1, Σ N 1, τ 2 = Σ 2, R 2, µ 2, cond 2, σ 2, Σ N 2 are said to be compatible if for each n N 1 N 2, we have that: λ 1 (n) = λ 2 (n); and either at least one among ν 1 (n), ν 2 (n) is empty, or: if ν 1 (n), ν 2 (n) Γ then ν 1 (n) = ν 2 (n); if ν 1 (n), ν 2 (n) V S then for each couple of types (t 1, t 2 ) such that σ 2 (t 1 ) = n = σ 2 (t 2 ) we have that cond 1 (t 1 ) cond 2 (t 2 ) is satisfiable. Note that if the system is consistent, then by construction, two incomplete trees obtained by applying the function SatMapping to two mappings and the corresponding data sources are compatible. We next show that, given a couple T 1, T 2 of unambiguous compatible incomplete trees, their intersection T = Intersection(T 1,T 2 ), with T = N, λ, ν, τ and τ = Σ, R, µ, cond, σ, N Σ, is computed exactly as in [6] except for the fact that Σ may contain types coming from the merge of two instantiated nodes introduced by the function SatMapping(M i, D i ), i = 1, 2, to reflect the presence of a node of the same type in an existential subtree pattern specified in each M i. Indeed, in [6], Σ consists of all pairs of compatible types, where this last notion has to be modified in order to take into account also the case described above. Thus, two types t 1 Σ 1, t 2 Σ 2 are compatible if one condition holds among those specified in [6] and the following: σ 1 (t 1 ), σ 2 (t 2 ) (N 1 \ N 2 ) (N 2 \ N 1 ), σ 1 (t 1 ) σ 2 (t 2 ), and σ 1 (t 1 ), σ 2 (t 2 ) can be unified. Then, for each couple of compatible types satisfying the condition above, we set: σ((t 1, t 2 )) = n u where n u is the identifier that results from the unification of σ 1 (t 1 ) and σ 2 (t 2 ); cond((t 1, t 2 )) = cond 1 (t 1 ) cond 2 (t 2 ); λ(n u ) = λ 1 (σ 1 (t 1 )) = λ 2 (σ 1 (t 1 )); ν(n u ) = v where v is a fresh Skolem that satisfies cond((t 1, t 2 )). Note that, as in [6], compatibility ensures that this construction is well-defined, since: σ 1 (t 1 ), σ 2 (t 2 ) can be unified and thus by construction of Id G, this means that t 1, t 2 gave the same label; cond 1 (t 1 ) cond 2 (t 2 ) is satisfiable. Finally, the definition of µ given in [6] can be adapted easily adapted to take into account Skolem data values. Given the similarity of the construction above with the one presented in [6], one can easily verify that the lemma below holds.

158 146 CHAPTER 9. XML DIS QUERY ANSWERING INPUT: consistent XML Π = G, S, M such that G = S G,Φ K,Φ FK and M = {M 1,,M m }, D = {D 1,..,D m } with global ids (assigned by Id G ), p-query q = t, λ, cond, ret, and data tree T such that T is consistent w.r.t. q OUTPUT: true or false T := SatMapping(M 1,D 1 ) for i := 2 to m do T Si := SatMapping(M i,d i ) T := Intersection(T Si 1,T) T := SatType(T,S G ) T := CloseFK(T) if T q (T), where q = Ps-query(q) then return true else return false Figure 9.4: Algorithm Answer(2) Lemma Let T 1,T 2 be two incomplete unambiguous incomplete trees. Then, Intersection(T 1,T 2 ) returns a tree T s.t. Moreover, T is unambiguous. rep(t) = rep(t 1 ) rep(t 2 ) From the two last lemmas, it follows that if we compute an incomplete tree T i = SatMapping(M i, D i ) for each i {1, n} and then we compute the intersection among all incomplete trees, we obtain an incomplete tree that represents all trees satisfying M. Let us now focus on the global schema specification G = S G, Φ K, Φ FK. We can combine the information in an incomplete tree with information coming from S G exactly as done in [6], thus by computing using the function SatType. Concerning key constraints in Φ K, by the use of Id G we have that Φ K are implicitly satisfied by each data tree in rep(t). Now, consider foreign keys in Φ FK. We proceed by computing CloseFK(T), which essentially applies the well-known technique of the chase to the incomplete tree T. However, since the chase may lead to an infinite incomplete tree, we apply it until it adds instantiated nodes with node identifiers that cannot be unified with already present node ids. Indeed, given that we consider uniquely localizable foreign keys, and we assume that the tree type is not recursive, it can be easily seen that this condition is sufficient to ensure termination. Moreover, this ensures that CloseFK(T)) is representative of all trees in rep(t) that simultaneously satisfy Φ FK (details are omitted). We are finally able to present, in Fig. 9.4, the algorithm Answer for general XML DIS query answering under the assumption that keep Id G sound and complete. As already mentioned, this algorithm generalizes the Algorithm presented in Fig Theorem Given a consistent data integration system Π = G, S, M and a set D of data sources conforming to S, Answer(Π, D, q, T) = true if and only if T q(π,d).

159 147 Proof. The proof is similar to the proof of Theorem 9.4.1, and follows from the previous lemmas and the fact that clearly, by construction, sem(π, D) rep(t). Given the construction described above, we strongly conjecture that, under the assumptions that make Id G sound and complete, XML DIS query answering is PTIME in data complexity.

160

161 Conclusion In this thesis, we have studied the problem of modeling a data integration system, and detecting whether it is consistent with respect to a set of data sources. Moreover, we have addressed the issue of answering queries and performing updates over DIS. We have tackled the above problems considering both a structured and a semi-structured data model for the global schema. More specifically, in the first part, we have focused in DIS characterized by a global schema expressing the intensional level of a Description Logic ontology. We have first proposed and studied the new language DL-Lite A, particularly tailored for expressing ontologies. Then, we have provided LOGSPACE algorithms for checking DL-Lite A KB satisfiability and for answering conjunctive queries. We have showed that both algorithms allow to reduce the main reasoning services of the KB, to the evaluation of a first-order query over a database. Afterwards, we have motivated all these preliminary results on DL-Lite A KB analysis, by showing that DL-Lite A DIS allow for separating reasoning from the access to actual data sources. This lead us to consistency and query answering algorithms for DL-Lite A DIS keeping the notable property of being LOGSPACE. Finally, we have started studying the problem of updating a DIS, by first considering the case of the instance-level update of an ontology, expressed by means of a DL KB. Our results show that a restricted variant of DL-Lite A has several nice characteristics, including the fact that the result of an update is always expressible within the DL itself. In the part concerning the investigation of XML-based data integration, we have focused on the problem of adapting the data integration theoretical approach to the XML data model. This has raised the basic difficult issue concerning the identification of data source nodes. Indeed, in practice, one makes often the unrealistic assumption of dealing with data sources provided with persistent node identifiers. In this thesis we have defined a new notion of identification function, which exploits key information on the data (e.g. introduced by an Entity Resolution module). Based on the use of such a new function, we have provided different algorithms for solving query answering under different assumptions for the DIS specification. Moreover we have shown that XML DIS consistency and query answering are in general NPhard and conp-hard in data complexity, thus confirming in our (simplified) setting for XML results that were provided in [3] concerning the relational (LAV) setting. There are several interesting directions for continuing our research, both in the ontology-based, and the XML-based context. First, clearly, it would be interesting to study the relationship existing between the two contexts for data integration. In particular, we plan to investigate whether it is possible to apply a query rewriting 149

162 150 CHAPTER 9 technique to XML in the spirit of the one adopted in the context of ontology-based DIS. Secondly, we aim at finding a complete characterization of consistency and query answering in the XML setting. Thirdly, we plan to study the problem of updating DIS. In particular, in ontologybased DIS update, we aim at providing an algorithm to compute updates that exploits the one proposed in this thesis to update a KB, and is applicable to a general DL-Lite A DIS. Also, we plan to integrate the results of our investigation on updates, in QUONTO [7], which currently implements the algorithm for query answering and satisfiability of a KB expressed in a restricted variant of DL-Lite A. Fourthly, in this thesis, we adopted a classical model-based approach to update, stemming from the existing literature on updating knowledge bases. Other approaches to update have been studied and their application to DIS might be of interest, as well as approaches based on belief revision. We believe that, in principle, several approaches to update and belief revision could coexist on the same DIS, in order to model different types of services involving some sort of instance evolution. Finally, it is worth noting that updates bring in the general issue of dealing with inconsistency in DIS. In this thesis, we have addressed the issue of detecting inconsistencies, whereas we have not addressed at all the problem of reconciling mutually inconsistent data from the data sources. This is a challenging topic that deserves researc efforts both in the context of ontology-based DIS and XML-based DIS. Also, relatively to updates over KBs, the semantics that we have considered addresses the issue of solving inconsistency between the current instance level of the ontology and what has been asserted by the update, while it does not deal with inconsistencies between the update and the intensional level. As already mentioned in Chapter 1.4, it would be interesting to study possible semantics that are tolerant with respect to the latter form of inconsistency.

163 Bibliography [1] Serge Abiteboul, Omar Benjelloun, and Tova Milo. Positive AXML. In Proc. of the 23rd ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS 2004), [2] Serge Abiteboul, Peter Buneman, and Dan Suciu. Data on the Web. Morgan Kaufmann Publishers, San Francisco, California, [3] Serge Abiteboul and Oliver Duschka. Complexity of answering queries using materialized views. In Proc. of the 17th ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS 98), pages , [4] Serge Abiteboul and Gösta Grahne. Update semantics for incomplete databases. In Proc. of the 11th Int. Conf. on Very Large Data Bases (VLDB 85), pages 1 12, [5] Serge Abiteboul, Richard Hull, and Victor Vianu. Foundations of Databases. Addison Wesley Publ. Co., Reading, Massachussetts, [6] Serge Abiteboul, Luc Segoufin, and Victor Vianu. Representing and Querying XML with Incomplete Information. ACM Trans. on Database Systems, To appear. [7] Andrea Acciarri, Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, Mattia Palmieri, and Riccardo Rosati. QUONTO: QUerying ONTOlogies. In Proc. of the 20th Nat. Conf. on Artificial Intelligence (AAAI 2005), pages , [8] Bernd Amann, Catriel Beeri, Irini Fundulaki, and Michel Scholl. Querying xml sources using an ontology-based mediator. In CoopIS/DOA/ODBASE, pages , [9] Sihem Amer-Yahia and Yannis Kotidis. A web-services architecture for efficient xml data exchange. In ICDE 04: Proceedings of the 20th International Conference on Data Engineering, page 523, [10] K. Appel, Wolfgang Haken, and John Koch. Every Planar map is Four Colorable. Journal of Mathematics, 21: , [11] Marcelo Arenas, Pablo Barcelo, Ronald Fagin, and Leonid Libkin. Locally consistent transformations and query answering in data exchange. In Proc. of 151

164 152 BIBLIOGRAPHY the 23rd ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS 2004), pages , [12] Marcelo Arenas and Leonid Libkin. XML Data Exchange: Consistency and Query Answering. In Proc. of the 24th ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS 2005), [13] Franz Baader, Sebastian Brandt, and Carsten Lutz. Pushing the EL envelope. In Proc. of the 20th Int. Joint Conf. on Artificial Intelligence (IJCAI 2005), pages , [14] Franz Baader, Diego Calvanese, Deborah McGuinness, Daniele Nardi, and Peter F. Patel-Schneider, editors. The Description Logic Handbook: Theory, Implementation and Applications. Cambridge University Press, [15] Franz Baader and Philipp Hanschke. A schema for integrating concrete domains into concept languages. In Proc. of the 12th Int. Joint Conf. on Artificial Intelligence (IJCAI 91), pages , [16] Franz Baader, Ian Horrocks, and Ulrike Sattler. Description logics as ontology languages for the semantic web. In Mechanizing Mathematical Reasoning: Essays in Honor of Jörg Siekmann on the Occasion of His 60th Birthday, number 2605 in Lecture Notes in Artificial Intelligence, pages Springer, [17] Franz Baader, Carsten Lutz, Maya Milicic, Ulrike Sattler, and Frank Wolter. Integrating description logics and action formalisms: First results. In Proc. of the 20th Nat. Conf. on Artificial Intelligence (AAAI 2005), pages , [18] Michael Benedikt, Chee-Yong Chan, Wenfei Fan, Juliana Freire, and Rajeev Rastogi. Capturing both Types and Constraints in Data Integration. In Proc. of the 22nd ACM SIGMOD Int. Conf. on Management (SIGMOD 2003), [19] O. Benjelloun, H. Garcia-Molina, J. Jonas, Q. Su, and J. Widom. Swoosh: A generic approach to entity resolution. Technical report, Stanford University Technical Report, Available at /pub/ [20] Alexander Borgida. Language Features for Flexible Handling of Exceptions in Information Systems. ACM Trans. on Database Systems, 10(4): , [21] Alexander Borgida. Description logics in data management. IEEE Trans. on Knowledge and Data Engineering, 7(5): , [22] Peter Buneman, Susan Davidson, Wenfei Fan, Carmen Hara, and Wang-Chiew Tan. Keys for XML. In Proc. of the 10th Int. World Wide Web Conf. (WWW 2001), pages , [23] Andrea Calì, Diego Calvanese, Giuseppe De Giacomo, and Maurizio Lenzerini. Data integration under integrity constraints. Information Systems, 29(2): , 2004.

165 BIBLIOGRAPHY 153 [24] Andrea Calì, Diego Calvanese, Giuseppe De Giacomo, and Maurizio Lenzerini. Data integration under integrity constraints. In Proc. of the 14th Int. Conf. on Advanced Information Systems Engineering (CAiSE 2002), volume 2348 of Lecture Notes in Computer Science, pages Springer, [25] Andrea Calì, Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, Paolo Naggar, and Fabio Vernacotola. IBIS: Semantic data integration at work. In Proc. of the 15th Int. Conf. on Advanced Information Systems Engineering (CAiSE 2003), pages 79 94, [26] Andrea Calì, Domenico Lembo, and Riccardo Rosati. Query rewriting and answering under constraints in data integration systems. In Proc. of the 18th Int. Joint Conf. on Artificial Intelligence (IJCAI 2003), pages 16 21, [27] Andrea Calì, Domenico Lembo, Riccardo Rosati, and Marco Ruzzi. Experimenting data integration with In Proc. of the 16th Int. Conf. on Advanced Information Systems Engineering (CAiSE 2004), volume 3084 of Lecture Notes in Computer Science, pages Springer, [28] Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, and Riccardo Rosati. What to ask to a peer: Ontology-based query reformulation. In Proc. of the 9th Int. Conf. on Principles of Knowledge Representation and Reasoning (KR 2004), pages , [29] Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, and Riccardo Rosati. DL-Lite: Tractable description logics for ontologies. In Proc. of the 20th Nat. Conf. on Artificial Intelligence (AAAI 2005), pages , [30] Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, and Riccardo Rosati. Data complexity of query answering in description logics. In Proc. of the 11th Int. Conf. on Principles of Knowledge Representation and Reasoning (KR 2006), [31] Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, and Riccardo Rosati. Tractable reasoning and efficient query answering in description logics: The dl-lite family. Submitted to an international journal, [32] Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Riccardo Rosati. Logical foundations of peer-to-peer data integration. In Proc. of the 23rd ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS 2004), pages , [33] Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Moshe Y. Vardi. View-based query processing and constraint satisfaction. In Proc. of the 15th IEEE Symp. on Logic in Computer Science (LICS 2000), pages , 2000.

166 154 BIBLIOGRAPHY [34] Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Moshe Y. Vardi. Reasoning on regular path queries. SIGMOD Record, 32(4):83 92, [35] Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, Antonella Poggi, and Riccardo Rosati. Linking data to ontologies: The description logic DL-Lite A. In Proc. of the 2nd Workshop OWLED, To appear. [36] Sudarshan S. Chawathe, Hector Garcia-Molina, Joachim Hammer, Kelly Ireland, Yannis Papakonstantinou, Jeffrey D. Ullman, and Jennifer Widom. The TSIMMIS project: Integration of heterogeneous information sources. In Proc. of the 10th Meeting of the Information Processing Society of Japan (IPSJ 94), pages 7 18, [37] E. F. Codd. A relational model of data for large shared data banks. Communications of the ACM, 13(6): , [38] Luna Xin Dong, Alon Halevy, and Jayant Madhavan. Reference reconciliation in complex information spaces. In Proc. of the 24th ACM SIGMOD Int. Conf. on Management (SIGMOD 2005), pages 85 96, [39] Thomas Eiter and Georg Gottlob. On the complexity of propositional knowledge base revision, updates and counterfactuals. Artificial Intelligence, 57: , [40] Ronald Fagin, Phokion G. Kolaitis, Renée J. Miller, and Lucian Popa. Data exchange: Semantics and query answering. In Proc. of the 9th Int. Conf. on Database Theory (ICDT 2003), pages , [41] Ronald Fagin, Phokion G. Kolaitis, and Lucian Popa. Data exchange: getting to the core. ACM Trans. on Database Systems, 30(1): , [42] Wenfei Fan and Leonid Libkin. On XML Integrity Constraints in the Presence of DTDs. J. ACM, 49(3): , [43] Mary Fernandez, Yana Kadiyska, Dan Suciu, Atsuyuki Morishima, and Wang- Chiew Tan. Silkroute: A framework for publishing relational data in xml. ACM Trans. on Database Systems, 27(4): , [44] Helena Galhardas, Daniela Florescu, Dennis Shasha, and Eric Simon. An extensible framework for data cleaning. Technical Report 3742, INRIA, Rocquencourt, [45] Hector Garcia-Molina, Yannis Papakonstantinou, Dallan Quass, Anand Rajaraman, Yehoshua Sagiv, Jeffrey D. Ullman, Vasilis Vassalos, and Jennifer Widom. The TSIMMIS approach to mediation: Data models and languages. J. of Intelligent Information Systems, 8(2): , [46] Michael R. Garey and David S. Johnson. Computers and Intractability A guide to NP-completeness. W. H. Freeman and Company, San Francisco (CA, USA), 1979.

167 BIBLIOGRAPHY 155 [47] Giuseppe De Giacomo, Maurizio Lenzerini,, Antonella Poggi, and Riccardo Rosati. On the update of description logic ontologies at the instance level. In Proc. of the 21st Nat. Conf. on Artificial Intelligence (AAAI 2006), [48] Francois Goasdoue, Veronique Lattes, and Marie-Christine Rousset. The use of CARIN language and algorithms for information integration: The Picsel system. Int. J. of Cooperative Information Systems, 9(4): , [49] Luca Grieco, Domenico Lembo, Marco Ruzzi, and Riccardo Rosati. Consistent query answering under key and exclusion dependencies: Algorithms and experiments. In Proc. of the 14th Int. Conf. on Information and Knowledge Management (CIKM 2005), pages , [50] Benjamin N. Grosof, Ian Horrocks, Raphael Volz, and Stefan Decker. Description logic programs: Combining logic programs with description logic. In Proc. of the 12th Int. World Wide Web Conf. (WWW 2003), pages 48 57, [51] Laura M. Haas, Eileen T. Lin, and Mary T. Roth. Data integration through database federation. IBM Systems Journal, 41(4): , [52] Peter Haase and Ljiljana Stojanovic. Consistent evolution of owl ontologies. In Proc. of the 2nd European Semantic Web Conference, pages , [53] Alon Y. Halevy. Answering queries using views: A survey. Very Large Database J., 10(4): , [54] Alon Y. Halevy. Structures, Semantics and Statistics. In Proc. of the 30th Int. Conf. on Very Large Data Bases (VLDB 2004), [55] Alon Y. Halevy, Zachary G. Ives, Peter Mork, and Igor Tatarinov. Piazza: Data management infrastructure for semantic web applications. In Proc. of the 12th Int. World Wide Web Conf. (WWW 2003), [56] Richard Hull. A survey of theoretical research on typed complex database objects. In J. Paredaens, editor, Databases, pages Academic Press, [57] Ullrich Hustadt, Boris Motik, and Ulrike Sattler. Data complexity of reasoning in very expressive description logics. In Proc. of the 20th Int. Joint Conf. on Artificial Intelligence (IJCAI 2005), pages , [58] Tomasz Imielinski and Witold Lipski Jr. Incomplete information in relational databases. J. of the ACM, 31(4): , [59] David S. Johnson and Anthony C. Klug. Testing containment of conjunctive queries under functional and inclusion dependencies. J. of Computer and System Sciences, 28(1): , [60] Thomas Kirk, Alon Y. Levy, Yehoshua Sagiv, and Divesh Srivastava. The Information Manifold. In Proceedings of the AAAI 1995 Spring Symp. on Information Gathering from Heterogeneous, Distributed Enviroments, pages 85 91, 1995.

168 156 BIBLIOGRAPHY [61] H. J. Komorowski. A specification of an abstract prolog machine and its application to partial evaluation. Technical Report LSST 69, Linköping University, [62] Domenico Lembo, Maurizio Lenzerini, and Riccardo Rosati. Source inconsistency and incompleteness in data integration. In Proc. of the 9th Int. Workshop on Knowledge Representation meets Databases (KRDB 2002). CEUR Electronic Workshop Proceedings, [63] Maurizio Lenzerini. Data integration: A theoretical perspective. In Proc. of the 21st ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS 2002), pages , [64] Nicola Leone, Thomas Eiter, Wolfgang Faber, Michael Fink, Georg Gottlob, Gianluigi Greco, Edyta Kalka, Giovambattista Ianni, Domenico Lembo, Maurizio Lenzerini, Vincenzino Lio, Bartosz Nowicki, Riccardo Rosati, Marco Ruzzi, Witold Staniszkis, and Giorgio Terracina. The INFOMIX system for advanced integration of incomplete and inconsistent data. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages , [65] Alon Y. Levy, Anand Rajaraman, and Joann J. Ordille. Querying heterogenous information sources using source descriptions. In Proc. of the 22nd Int. Conf. on Very Large Data Bases (VLDB 96), [66] Alon Y. Levy and Marie-Christine Rousset. Combining horn rules and description logics in carin. Artif. Intell., 104(1-2): , [67] Alon Y. Levy, Divesh Srivastava, and Thomas Kirk. Data model and query evaluation in global information systems. J. of Intelligent Information Systems, 5: , [68] Chen Li, Ramana Yerneni, Vasilis Vassalos, Hector Garcia-Molina, Yannis Papakonstantinou, Jeffrey D. Ullman, and Murty Valiveti. Capability based mediation in TSIMMIS. In Proc. of the 17th ACM SIGMOD Int. Conf. on Management (SIGMOD 1999), pages , [69] Hongkai Liu, Carsten Lutz, Maja Milicic, and Frank Wolter. Updating description logic aboxes. In Proc. of the 11th Int. Conf. on Principles of Knowledge Representation and Reasoning (KR 2006), [70] John W. Lloyd. Foundations of Logic Programming (Second, Extended Edition). Springer, Berlin, Heidelberg, [71] John W. Lloyd and John.C. Shepherdson. Partial evaluation in logic programming. J. of Logic Programming, 11: , [72] M. N. Wegman M. S. Paterson. Linera Unification. J. Computer and System Sciences, 16(2): , 1978.

169 BIBLIOGRAPHY 157 [73] Ioana Manolescu, Daniela Florescu, and Donald Kossmann. Answering XML queries over heterogeneous data sources. In Proc. of the 27th Int. Conf. on Very Large Data Bases (VLDB 2001), [74] Nelson Mendonça. Integrating Information for On Demand Computing. In Proc. of the 29th Int. Conf. on Very Large Data Bases (VLDB 2003), [75] Oracle Integration, integration_home.html. [76] Maria Magdalena Ortiz, Diego Calvanese, and Thomas Eiter. Characterizing data complexity for conjunctive query answering in expressive description logics. In Proc. of the 21st Nat. Conf. on Artificial Intelligence (AAAI 2006), [77] OWL Web Ontology Language Overview, url = owl-features/. [78] Yannis Papakonstantinou, Serge Abiteboul, and Hector Garcia-Molina. Object fusion in mediator systems. In T. M. Vijayaraman, Alejandro P. Buchmann, C. Mohan, and Nandlal L. Sarda, editors, Proc. of the 22nd Int. Conf. on Very Large Data Bases (VLDB 96), pages , [79] Yannis Papakonstantinou, Hector Garcia-Molina, and Jeffrey D. Ullman. Med- Maker: A mediation system based on declarative specifications. In Stanley Y. W. Su, editor, Proc. of the 12th IEEE Int. Conf. on Data Engineering (ICDE 96), pages , [80] Antonella Poggi and Serge Abiteboul. XML Data Integration with Identification. In Proc. of the 10th Int. Workshop on Database Programming Languages (DBPL 2005), [81] Antonella Poggi and Marco Ruzzi. Filling the gap between data integration and data federation. In Proc. of the 12th Ital. Conf. on Database Systems (SEBD 2004), [82] Lucian Popa, Yannis Velegrakis, Renee J. Miller, Mauricio A. Hernandez, and Ronald Fagin. Translating web data. In Proc. of the 28th Int. Conf. on Very Large Data Bases (VLDB 2002), [83] Raymond Reiter. Knowledge in Action: Logical Foundations for Specifying and Implementing Dynamical Systems. The MIT Press, [84] Richard B. Scherl and Hector J. Levesque. Knowledge, action, and the frame problem. Artificial Intelligence, 144(1-2):1 39, [85] Jayavel Shanmugasundaram, Jerry Kiernan, Eugene J. Shekita, Catalina Fan, and John Funderburk. Querying xml views of relational data. In Proc. of the 27th Int. Conf. on Very Large Data Bases (VLDB 2001), pages , 2001.

170 158 BIBLIOGRAPHY [86] Ron van der Meyden. Logical approaches to incomplete information. In Jan Chomicki and Günter Saake, editors, Logics for Databases and Information Systems, pages Kluwer Academic Publisher, [87] Marianne Winslett. Reasoning about action using a possible models approach. In Proc. of the 15th Nat. Conf. on Artificial Intelligence (AAAI 98), [88] Marianne Winslett. Updating Logical Databases. Cambridge University Press, [89] Ramana Yerneni, Chen Li, Hector Garcia-Molina, and Jeffrey D. Ullman. Computing capabilities of mediators. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages , [90] Cong Yu and Lucian Popa. Constraint-based xml query rewriting for data integration. In Proc. of the 23rd ACM SIGMOD Int. Conf. on Management (SIG- MOD 2004), pages , 2004.

Data Integration: A Theoretical Perspective

Data Integration: A Theoretical Perspective Data Integration: A Theoretical Perspective Maurizio Lenzerini Dipartimento di Informatica e Sistemistica Università di Roma La Sapienza Via Salaria 113, I 00198 Roma, Italy [email protected] ABSTRACT

More information

Query Processing in Data Integration Systems

Query Processing in Data Integration Systems Query Processing in Data Integration Systems Diego Calvanese Free University of Bozen-Bolzano BIT PhD Summer School Bressanone July 3 7, 2006 D. Calvanese Data Integration BIT PhD Summer School 1 / 152

More information

A Tutorial on Data Integration

A Tutorial on Data Integration A Tutorial on Data Integration Maurizio Lenzerini Dipartimento di Informatica e Sistemistica Antonio Ruberti, Sapienza Università di Roma DEIS 10 - Data Exchange, Integration, and Streaming November 7-12,

More information

Data Integration. Maurizio Lenzerini. Universitá di Roma La Sapienza

Data Integration. Maurizio Lenzerini. Universitá di Roma La Sapienza Data Integration Maurizio Lenzerini Universitá di Roma La Sapienza DASI 06: Phd School on Data and Service Integration Bertinoro, December 11 15, 2006 M. Lenzerini Data Integration DASI 06 1 / 213 Structure

More information

Chapter 17 Using OWL in Data Integration

Chapter 17 Using OWL in Data Integration Chapter 17 Using OWL in Data Integration Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, Riccardo Rosati, and Marco Ruzzi Abstract One of the outcomes of the research work carried

More information

XML Data Integration

XML Data Integration XML Data Integration Lucja Kot Cornell University 11 November 2010 Lucja Kot (Cornell University) XML Data Integration 11 November 2010 1 / 42 Introduction Data Integration and Query Answering A data integration

More information

Integrating XML Data Sources using RDF/S Schemas: The ICS-FORTH Semantic Web Integration Middleware (SWIM)

Integrating XML Data Sources using RDF/S Schemas: The ICS-FORTH Semantic Web Integration Middleware (SWIM) Integrating XML Data Sources using RDF/S Schemas: The ICS-FORTH Semantic Web Integration Middleware (SWIM) Extended Abstract Ioanna Koffina 1, Giorgos Serfiotis 1, Vassilis Christophides 1, Val Tannen

More information

Data Integration and Exchange. L. Libkin 1 Data Integration and Exchange

Data Integration and Exchange. L. Libkin 1 Data Integration and Exchange Data Integration and Exchange L. Libkin 1 Data Integration and Exchange Traditional approach to databases A single large repository of data. Database administrator in charge of access to data. Users interact

More information

Enterprise Modeling and Data Warehousing in Telecom Italia

Enterprise Modeling and Data Warehousing in Telecom Italia Enterprise Modeling and Data Warehousing in Telecom Italia Diego Calvanese Faculty of Computer Science Free University of Bolzano/Bozen Piazza Domenicani 3 I-39100 Bolzano-Bozen BZ, Italy Luigi Dragone,

More information

Data Quality in Information Integration and Business Intelligence

Data Quality in Information Integration and Business Intelligence Data Quality in Information Integration and Business Intelligence Leopoldo Bertossi Carleton University School of Computer Science Ottawa, Canada : Faculty Fellow of the IBM Center for Advanced Studies

More information

INTEGRATION OF XML DATA IN PEER-TO-PEER E-COMMERCE APPLICATIONS

INTEGRATION OF XML DATA IN PEER-TO-PEER E-COMMERCE APPLICATIONS INTEGRATION OF XML DATA IN PEER-TO-PEER E-COMMERCE APPLICATIONS Tadeusz Pankowski 1,2 1 Institute of Control and Information Engineering Poznan University of Technology Pl. M.S.-Curie 5, 60-965 Poznan

More information

Integrating Heterogeneous Data Sources Using XML

Integrating Heterogeneous Data Sources Using XML Integrating Heterogeneous Data Sources Using XML 1 Yogesh R.Rochlani, 2 Prof. A.R. Itkikar 1 Department of Computer Science & Engineering Sipna COET, SGBAU, Amravati (MH), India 2 Department of Computer

More information

A Workbench for Prototyping XML Data Exchange (extended abstract)

A Workbench for Prototyping XML Data Exchange (extended abstract) A Workbench for Prototyping XML Data Exchange (extended abstract) Renzo Orsini and Augusto Celentano Università Ca Foscari di Venezia, Dipartimento di Informatica via Torino 155, 30172 Mestre (VE), Italy

More information

Grid Data Integration based on Schema-mapping

Grid Data Integration based on Schema-mapping Grid Data Integration based on Schema-mapping Carmela Comito and Domenico Talia DEIS, University of Calabria, Via P. Bucci 41 c, 87036 Rende, Italy {ccomito, talia}@deis.unical.it http://www.deis.unical.it/

More information

Integrating and Exchanging XML Data using Ontologies

Integrating and Exchanging XML Data using Ontologies Integrating and Exchanging XML Data using Ontologies Huiyong Xiao and Isabel F. Cruz Department of Computer Science University of Illinois at Chicago {hxiao ifc}@cs.uic.edu Abstract. While providing a

More information

Repair Checking in Inconsistent Databases: Algorithms and Complexity

Repair Checking in Inconsistent Databases: Algorithms and Complexity Repair Checking in Inconsistent Databases: Algorithms and Complexity Foto Afrati 1 Phokion G. Kolaitis 2 1 National Technical University of Athens 2 UC Santa Cruz and IBM Almaden Research Center Oxford,

More information

A Framework and Architecture for Quality Assessment in Data Integration

A Framework and Architecture for Quality Assessment in Data Integration A Framework and Architecture for Quality Assessment in Data Integration Jianing Wang March 2012 A Dissertation Submitted to Birkbeck College, University of London in Partial Fulfillment of the Requirements

More information

OWL based XML Data Integration

OWL based XML Data Integration OWL based XML Data Integration Manjula Shenoy K Manipal University CSE MIT Manipal, India K.C.Shet, PhD. N.I.T.K. CSE, Suratkal Karnataka, India U. Dinesh Acharya, PhD. ManipalUniversity CSE MIT, Manipal,

More information

DaQuinCIS : Exchanging and Improving Data Quality in Cooperative Information Systems

DaQuinCIS : Exchanging and Improving Data Quality in Cooperative Information Systems Università degli Studi di Roma La Sapienza Dottorato di Ricerca in Ingegneria Informatica XVI Ciclo 2004 DaQuinCIS : Exchanging and Improving Data Quality in Cooperative Information Systems Monica Scannapieco

More information

Data Integration using Agent based Mediator-Wrapper Architecture. Tutorial Report For Agent Based Software Engineering (SENG 609.

Data Integration using Agent based Mediator-Wrapper Architecture. Tutorial Report For Agent Based Software Engineering (SENG 609. Data Integration using Agent based Mediator-Wrapper Architecture Tutorial Report For Agent Based Software Engineering (SENG 609.22) Presented by: George Shi Course Instructor: Dr. Behrouz H. Far December

More information

Modern Databases. Database Systems Lecture 18 Natasha Alechina

Modern Databases. Database Systems Lecture 18 Natasha Alechina Modern Databases Database Systems Lecture 18 Natasha Alechina In This Lecture Distributed DBs Web-based DBs Object Oriented DBs Semistructured Data and XML Multimedia DBs For more information Connolly

More information

Web-Based Genomic Information Integration with Gene Ontology

Web-Based Genomic Information Integration with Gene Ontology Web-Based Genomic Information Integration with Gene Ontology Kai Xu 1 IMAGEN group, National ICT Australia, Sydney, Australia, [email protected] Abstract. Despite the dramatic growth of online genomic

More information

Improving EHR Semantic Interoperability Future Vision and Challenges

Improving EHR Semantic Interoperability Future Vision and Challenges Improving EHR Semantic Interoperability Future Vision and Challenges Catalina MARTÍNEZ-COSTA a,1 Dipak KALRA b, Stefan SCHULZ a a IMI,Medical University of Graz, Austria b CHIME, University College London,

More information

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD. Svetlana Sokolova President and CEO of PROMT, PhD. How the Computer Translates Machine translation is a special field of computer application where almost everyone believes that he/she is a specialist.

More information

Robust Module-based Data Management

Robust Module-based Data Management IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. V, NO. N, MONTH YEAR 1 Robust Module-based Data Management François Goasdoué, LRI, Univ. Paris-Sud, and Marie-Christine Rousset, LIG, Univ. Grenoble

More information

Report on the Dagstuhl Seminar Data Quality on the Web

Report on the Dagstuhl Seminar Data Quality on the Web Report on the Dagstuhl Seminar Data Quality on the Web Michael Gertz M. Tamer Özsu Gunter Saake Kai-Uwe Sattler U of California at Davis, U.S.A. U of Waterloo, Canada U of Magdeburg, Germany TU Ilmenau,

More information

A MEDIATION LAYER FOR HETEROGENEOUS XML SCHEMAS

A MEDIATION LAYER FOR HETEROGENEOUS XML SCHEMAS A MEDIATION LAYER FOR HETEROGENEOUS XML SCHEMAS Abdelsalam Almarimi 1, Jaroslav Pokorny 2 Abstract This paper describes an approach for mediation of heterogeneous XML schemas. Such an approach is proposed

More information

Integrating Pattern Mining in Relational Databases

Integrating Pattern Mining in Relational Databases Integrating Pattern Mining in Relational Databases Toon Calders, Bart Goethals, and Adriana Prado University of Antwerp, Belgium {toon.calders, bart.goethals, adriana.prado}@ua.ac.be Abstract. Almost a

More information

Chapter 1: Introduction

Chapter 1: Introduction Chapter 1: Introduction Database System Concepts, 5th Ed. See www.db book.com for conditions on re use Chapter 1: Introduction Purpose of Database Systems View of Data Database Languages Relational Databases

More information

CSE 132A. Database Systems Principles

CSE 132A. Database Systems Principles CSE 132A Database Systems Principles Prof. Victor Vianu 1 Data Management An evolving, expanding field: Classical stand-alone databases (Oracle, DB2, SQL Server) Computer science is becoming data-centric:

More information

Time: A Coordinate for Web Site Modelling

Time: A Coordinate for Web Site Modelling Time: A Coordinate for Web Site Modelling Paolo Atzeni Dipartimento di Informatica e Automazione Università di Roma Tre Via della Vasca Navale, 79 00146 Roma, Italy http://www.dia.uniroma3.it/~atzeni/

More information

Relational Database Basics Review

Relational Database Basics Review Relational Database Basics Review IT 4153 Advanced Database J.G. Zheng Spring 2012 Overview Database approach Database system Relational model Database development 2 File Processing Approaches Based on

More information

A View Integration Approach to Dynamic Composition of Web Services

A View Integration Approach to Dynamic Composition of Web Services A View Integration Approach to Dynamic Composition of Web Services Snehal Thakkar, Craig A. Knoblock, and José Luis Ambite University of Southern California/ Information Sciences Institute 4676 Admiralty

More information

Application of XML Tools for Enterprise-Wide RBAC Implementation Tasks

Application of XML Tools for Enterprise-Wide RBAC Implementation Tasks Application of XML Tools for Enterprise-Wide RBAC Implementation Tasks Ramaswamy Chandramouli National Institute of Standards and Technology Gaithersburg, MD 20899,USA 001-301-975-5013 [email protected]

More information

Overview. DW Source Integration, Tools, and Architecture. End User Applications (EUA) EUA Concepts. DW Front End Tools. Source Integration

Overview. DW Source Integration, Tools, and Architecture. End User Applications (EUA) EUA Concepts. DW Front End Tools. Source Integration DW Source Integration, Tools, and Architecture Overview DW Front End Tools Source Integration DW architecture Original slides were written by Torben Bach Pedersen Aalborg University 2007 - DWML course

More information

A first step towards modeling semistructured data in hybrid multimodal logic

A first step towards modeling semistructured data in hybrid multimodal logic A first step towards modeling semistructured data in hybrid multimodal logic Nicole Bidoit * Serenella Cerrito ** Virginie Thion * * LRI UMR CNRS 8623, Université Paris 11, Centre d Orsay. ** LaMI UMR

More information

CHAPTER 7 GENERAL PROOF SYSTEMS

CHAPTER 7 GENERAL PROOF SYSTEMS CHAPTER 7 GENERAL PROOF SYSTEMS 1 Introduction Proof systems are built to prove statements. They can be thought as an inference machine with special statements, called provable statements, or sometimes

More information

Logical and categorical methods in data transformation (TransLoCaTe)

Logical and categorical methods in data transformation (TransLoCaTe) Logical and categorical methods in data transformation (TransLoCaTe) 1 Introduction to the abbreviated project description This is an abbreviated project description of the TransLoCaTe project, with an

More information

Data exchange. L. Libkin 1 Data Integration and Exchange

Data exchange. L. Libkin 1 Data Integration and Exchange Data exchange Source schema, target schema; need to transfer data between them. A typical scenario: Two organizations have their legacy databases, schemas cannot be changed. Data from one organization

More information

Schema Mediation in Peer Data Management Systems

Schema Mediation in Peer Data Management Systems Schema Mediation in Peer Data Management Systems Alon Y. Halevy Zachary G. Ives Dan Suciu Igor Tatarinov University of Washington Seattle, WA, USA 98195-2350 {alon,zives,suciu,igor}@cs.washington.edu Abstract

More information

Principles of Distributed Database Systems

Principles of Distributed Database Systems M. Tamer Özsu Patrick Valduriez Principles of Distributed Database Systems Third Edition

More information

CS2Bh: Current Technologies. Introduction to XML and Relational Databases. Introduction to Databases. Why databases? Why not use XML?

CS2Bh: Current Technologies. Introduction to XML and Relational Databases. Introduction to Databases. Why databases? Why not use XML? CS2Bh: Current Technologies Introduction to XML and Relational Databases Spring 2005 Introduction to Databases CS2 Spring 2005 (LN5) 1 Why databases? Why not use XML? What is missing from XML: Consistency

More information

DLDB: Extending Relational Databases to Support Semantic Web Queries

DLDB: Extending Relational Databases to Support Semantic Web Queries DLDB: Extending Relational Databases to Support Semantic Web Queries Zhengxiang Pan (Lehigh University, USA [email protected]) Jeff Heflin (Lehigh University, USA [email protected]) Abstract: We

More information

Semantic Search in Portals using Ontologies

Semantic Search in Portals using Ontologies Semantic Search in Portals using Ontologies Wallace Anacleto Pinheiro Ana Maria de C. Moura Military Institute of Engineering - IME/RJ Department of Computer Engineering - Rio de Janeiro - Brazil [awallace,anamoura]@de9.ime.eb.br

More information

Integration of Distributed Healthcare Records: Publishing Legacy Data as XML Documents Compliant with CEN/TC251 ENV13606

Integration of Distributed Healthcare Records: Publishing Legacy Data as XML Documents Compliant with CEN/TC251 ENV13606 Integration of Distributed Healthcare Records: Publishing Legacy Data as XML Documents Compliant with CEN/TC251 ENV13606 J.A. Maldonado, M. Robles, P. Crespo Bioengineering, Electronics and Telemedicine

More information

Quiz! Database Indexes. Index. Quiz! Disc and main memory. Quiz! How costly is this operation (naive solution)?

Quiz! Database Indexes. Index. Quiz! Disc and main memory. Quiz! How costly is this operation (naive solution)? Database Indexes How costly is this operation (naive solution)? course per weekday hour room TDA356 2 VR Monday 13:15 TDA356 2 VR Thursday 08:00 TDA356 4 HB1 Tuesday 08:00 TDA356 4 HB1 Friday 13:15 TIN090

More information

[Refer Slide Time: 05:10]

[Refer Slide Time: 05:10] Principles of Programming Languages Prof: S. Arun Kumar Department of Computer Science and Engineering Indian Institute of Technology Delhi Lecture no 7 Lecture Title: Syntactic Classes Welcome to lecture

More information

Efficient Query Optimization for Distributed Join in Database Federation

Efficient Query Optimization for Distributed Join in Database Federation Efficient Query Optimization for Distributed Join in Database Federation by Di Wang A Thesis Submitted to the Faculty of the Worcester Polytechnic Institute In partial fulfillment of the requirements for

More information

How To Develop Software

How To Develop Software Software Engineering Prof. N.L. Sarda Computer Science & Engineering Indian Institute of Technology, Bombay Lecture-4 Overview of Phases (Part - II) We studied the problem definition phase, with which

More information

A Document Management System Based on an OODB

A Document Management System Based on an OODB Tamkang Journal of Science and Engineering, Vol. 3, No. 4, pp. 257-262 (2000) 257 A Document Management System Based on an OODB Ching-Ming Chao Department of Computer and Information Science Soochow University

More information

CS2Bh: Current Technologies. Introduction to XML and Relational Databases. The Relational Model. The relational model

CS2Bh: Current Technologies. Introduction to XML and Relational Databases. The Relational Model. The relational model CS2Bh: Current Technologies Introduction to XML and Relational Databases Spring 2005 The Relational Model CS2 Spring 2005 (LN6) 1 The relational model Proposed by Codd in 1970. It is the dominant data

More information

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya Chapter 6 Basics of Data Integration Fundamentals of Business Analytics Learning Objectives and Learning Outcomes Learning Objectives 1. Concepts of data integration 2. Needs and advantages of using data

More information

Constraint-based Query Distribution Framework for an Integrated Global Schema

Constraint-based Query Distribution Framework for an Integrated Global Schema Constraint-based Query Distribution Framework for an Integrated Global Schema Ahmad Kamran Malik 1, Muhammad Abdul Qadir 1, Nadeem Iftikhar 2, and Muhammad Usman 3 1 Muhammad Ali Jinnah University, Islamabad,

More information

Data Integration Hub for a Hybrid Paper Search

Data Integration Hub for a Hybrid Paper Search Data Integration Hub for a Hybrid Paper Search Jungkee Kim 1,2, Geoffrey Fox 2, and Seong-Joon Yoo 3 1 Department of Computer Science, Florida State University, Tallahassee FL 32306, U.S.A., [email protected],

More information

ICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001

ICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001 ICOM 6005 Database Management Systems Design Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001 Readings Read Chapter 1 of text book ICOM 6005 Dr. Manuel

More information

Technologies for a CERIF XML based CRIS

Technologies for a CERIF XML based CRIS Technologies for a CERIF XML based CRIS Stefan Bärisch GESIS-IZ, Bonn, Germany Abstract The use of XML as a primary storage format as opposed to data exchange raises a number of questions regarding the

More information

Query Management in Data Integration Systems: the MOMIS approach

Query Management in Data Integration Systems: the MOMIS approach Dottorato di Ricerca in Computer Engineering and Science Scuola di Dottorato in Information and Communication Technologies XXI Ciclo Università degli Studi di Modena e Reggio Emilia Dipartimento di Ingegneria

More information

An Intelligent Approach for Integrity of Heterogeneous and Distributed Databases Systems based on Mobile Agents

An Intelligent Approach for Integrity of Heterogeneous and Distributed Databases Systems based on Mobile Agents An Intelligent Approach for Integrity of Heterogeneous and Distributed Databases Systems based on Mobile Agents M. Anber and O. Badawy Department of Computer Engineering, Arab Academy for Science and Technology

More information

CSE 233. Database System Overview

CSE 233. Database System Overview CSE 233 Database System Overview 1 Data Management An evolving, expanding field: Classical stand-alone databases (Oracle, DB2, SQL Server) Computer science is becoming data-centric: web knowledge harvesting,

More information

Demonstrating WSMX: Least Cost Supply Management

Demonstrating WSMX: Least Cost Supply Management Demonstrating WSMX: Least Cost Supply Management Eyal Oren 2, Alexander Wahler 1, Bernhard Schreder 1, Aleksandar Balaban 1, Michal Zaremba 2, and Maciej Zaremba 2 1 NIWA Web Solutions, Vienna, Austria

More information

Journal of Information Technology Management SIGNS OF IT SOLUTIONS FAILURE: REASONS AND A PROPOSED SOLUTION ABSTRACT

Journal of Information Technology Management SIGNS OF IT SOLUTIONS FAILURE: REASONS AND A PROPOSED SOLUTION ABSTRACT Journal of Information Technology Management ISSN #1042-1319 A Publication of the Association of Management SIGNS OF IT SOLUTIONS FAILURE: REASONS AND A PROPOSED SOLUTION MAJED ABUSAFIYA NEW MEXICO TECH

More information

Distributed Database for Environmental Data Integration

Distributed Database for Environmental Data Integration Distributed Database for Environmental Data Integration A. Amato', V. Di Lecce2, and V. Piuri 3 II Engineering Faculty of Politecnico di Bari - Italy 2 DIASS, Politecnico di Bari, Italy 3Dept Information

More information

A Model-based Software Architecture for XML Data and Metadata Integration in Data Warehouse Systems

A Model-based Software Architecture for XML Data and Metadata Integration in Data Warehouse Systems Proceedings of the Postgraduate Annual Research Seminar 2005 68 A Model-based Software Architecture for XML and Metadata Integration in Warehouse Systems Abstract Wan Mohd Haffiz Mohd Nasir, Shamsul Sahibuddin

More information

Chapter 1: Introduction. Database Management System (DBMS) University Database Example

Chapter 1: Introduction. Database Management System (DBMS) University Database Example This image cannot currently be displayed. Chapter 1: Introduction Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Database Management System (DBMS) DBMS contains information

More information