Structured and Semi-Structured Data Integration

Size: px
Start display at page:

Download "Structured and Semi-Structured Data Integration"

Transcription

1 UNIVERSITÀ DEGLI STUDI DI ROMA LA SAPIENZA DOTTORATO DI RICERCA IN INGEGNERIA INFORMATICA XIX CICLO 2006 UNIVERSITÉ DE PARIS SUD DOCTORAT DE RECHERCHE EN INFORMATIQUE Structured and Semi-Structured Data Integration Antonella Poggi

2

3 UNIVERSITÀ DEGLI STUDI DI ROMA LA SAPIENZA DOTTORATO DI RICERCA IN INGEGNERIA INFORMATICA XIX CICLO UNIVERSITÉ DE PARIS SUD DOCTORAT DE RECHERCHE EN INFORMATIQUE Antonella Poggi Structured and Semi-Structured Data Integration Thesis Committee Prof. Maurizio Lenzerini (Advisor (Italy)) Prof. Serge Abiteboul (Advisor (France)) Reviewers Prof. Bernd Amann Prof. Alex Borgida Prof. Riccardo Rosati

4 AUTHOR S ADDRESS IN ITALY: Antonella Poggi Dipartimento di Informatica e Sistemistica Università degli Studi di Roma La Sapienza Via Salaria 113, I Roma, Italy AUTHOR S ADDRESS IN FRANCE: Antonella Poggi Departement d Informatique Université de Paris Sud Orsay Cedex, France poggi@dis.uniroma1.it WWW: poggi/

5 To Mario

6

7 Acknowledgements Everything started one day of June 2000, when I decided to go Erasmus at the Ecole Polytechnique in Paris, and I was suggested to attend there the databases lectures given by Prof. Abiteboul. When in January 2001, I decided to follow the extralectures of Prof. Abiteboul in order to present a databases project, he was so nice to sit beside me and teach me how to write my first HTML page: my homepage. This was his way to introduce me to XML! One first internship at I.N.R.I.A. was my first great experience with research, and from that day on, I never gave up dreaming research. When I came back home, I met Maurizio (whose lectures were the most exciting I ever had) and thank to him and Serge I could participate to VLDB as a volunteer (Rome, Sept. 2001). How could someone resist loving research after such a wonderful conference? Then, I finished my exams and Maurizio supported me to come back to I.N.R.I.A. for a second internship (my final project which lead to my graduation thesis). On my return, I started collaborating with Maurizio, and he made me love databases theory and data integration issues so much, that I chose to start my PhD route... Thanks to an European initiative, and thanks to both my advisors who had to fight against italian and french bureaucracy, I had the opportunity to do my research in joint work between the roman and the parisian databases groups. This was not always so easy... But I was so lucky to find such great researchers, both able to have such an amazing big picture! They were both mentors, fathers and friends. No word may express how much I would like to thank you both, Maurizio and Serge. I can only say once again: Grazie - Merci (as I am now used to concluding all my research talks). I will miss so much being such a favoured PhD student! Of course, these aknowlegements cannot end without thanking my sweet husband Mario, and my family. Both have been so patient, understanding... and, above all, they have always been with me. I love you, and will always do. iii

8

9 Contents I Antechamber 1 1 Theoretical foundations of DIS Logical framework Consistency of a DIS Query answering over DIS Updates over DIS Relationship with databases with incomplete information State of the art of DIS Commercial data integration tools Global picture of the state of the art Main related DIS LAV approach GAV approach GLAV approach II Ontology-based DIS 21 3 The language DL-Lite FRS DL-Lite FRS expressions DL-Lite FRS TBox DL-Lite FRS ABox DL-Lite FRS knowledge base Query language DL-Lite A DL-Lite A reasoning Storage of a DL-Lite A ABox Preliminaries Minimal model for a DL-Lite A ABox Canonical interpretation Closure of negative inclusions Satisfiability of a DL-Lite A KB Foundations of the algorithm for satisfiability v

10 4.3.2 Satisfiability algorithm Query answering over DL-Lite A KB Foundations of query answering algorithm Query answering algorithm Consistency and Query Answering over Ontology-based DIS DL-Lite A ontology-based DIS Linking data to DL-Lite A objects Logical framework for DL-Lite A DIS Overview of consistency and query answering method The notion of virtual ABox A naive bottom-up approach A top-down approach Relevant notions from logic programming DL-Lite A DIS consistency and query answering Modularizability Consistency algorithm Query answering algorithm Computational complexity Updates of Ontologies at the Instance Level The DL-Lite FS language Instance-level ontology update Computing updates in DL-Lite FS ontologies III XML-based DIS 97 7 The setting Data model Tree Type Constraints and schema language Prefix Queries XML-based DIS XML DIS logical framework Identification XML DIS consistency XML DIS query answering Lower-bound for query answering under exact mappings Incomplete trees Query answering using incomplete trees Query answering algorithms Algorithm under VKR and no key constraints Algorithm under Id G sound and complete

11 Conclusion 149 Bibliography 158

12

13 Part I Antechamber 1

14

15 Data integration is a huge area of research that is concerned with the problem of combining data residing at heterogeneous, autonomous and distributed data sources, and providing the user with a unified virtual view of all this data. Today s fast and continuous growth of large business organizations, often deriving from mergers of smaller enterprises, enforces an increasing need in integrating and sharing large amounts of data, coming from a number of heterogeneous and distributed data sources. Such needs are also shown by others applications, like information systems for administrative organizations, life sciences research and many others. Moreover, it is not infrequent that different parts of the same organization adopt different systems to produce and maintain critical data. Clearly, data integration is a challenge in all these kinds of situations. Furthermore, it has become even more attractive thanks the ubiquitous spread of the World Wide Web and access to information it provides. Hence, during the last decade, research and business interest has migrated from DataBase Management Systems, DBMS (Codd, 70 s [37]), to Data Integration Systems (DIS). Whereas the former make a unique local data source accessible through a schema, the latter offer the necessary framework to combine the data from a set of heterogeneous and autonomous sources through a so-called global schema (or, mediated schema). Thus, the global schema does not contain data by itself, but provides a reconciled, integrated and virtual view of the underlying sources, which in contrast contain the actual data. We insist that since the global schema acts as the interface to the user for accessing data, the choice of the language for expressing and querying such a schema is crucial. In particular, whereas research on the topic has already provided several DIS, rather few of them represent an appropriate trade-off between the expressive power of the languages for specifying the global schema and querying the system, and the efficiency of query answering. Nevertheless, both these aspects deserve to be simultaneously considered. Indeed, the issue of providing an important set of semantic constraints over the global schema becomes more and more crucial, as one wants to use rather basic conceptual modeling constructs for his application. On the other hand, offering an expressive query language and allowing for efficient query answering over typically large amounts of data are obvious requirements of such kind of systems. In this thesis, we focus on the study of hierarchical DIS, where the global schema acts as a client of the data sources as opposed to Peer-to-Peer DIS, where the global schema acts both as a client and a server for other DIS. In particular, motivated by the challenges discussed above, we investigate both structured and semi-structured data integration, in the two major contexts of ontology-based data integration, and XMLbased data integration. On the one hand, ontology-based DIS are characterized by a 3

16 4 global schema described at the intensional level of an ontology, i.e., the shared conceptualization of a domain of interest. The main issue here is that typically ontology languages are extremely costly with respect to the size of the data. Notably, we propose a setting where answering queries over the ontology-based DIS is LOGSPACE in data complexity. On the other hand, XML-based DIS are characterized by an expressive global schema. This is a novel setting, not much investigated yet. The main issue here concerns the presence of a significant set of integrity constraints expressed over the schema, and the concept of node identity which requires particular attention when data come from autonomous data sources. In particular, in both contexts, our contribution consists in approaching formally the following issues. The modeling issue, which requires to provide the user with all he needs for modeling the DIS. More precisely, he will be given (i) a language for specifying the global schema, (ii) a language for specifying the set of source schemas, and (iii) a formalism to specify the relationship existing between the data at the sources and the elements of the global schema. The query answering issue, which is concerned with the basic service offered by a DIS, namely the ability of answering queries posed over the DIS global schema. We provide an appropriate query language and algorithms to answer queries posed to the DIS. Also, we study the complexity of the problem in both contexts, under a variety of assumptions for the DIS specification. Since sources are in general autonomous, we also investigate the problem of detecting inconsistencies among data sources, a problem which is mostly of the time ignored in DIS research, thus resulting in a quite unrealistic setting. Finally, we begin the investigation of the update of DIS, in the context of Ontology-based DIS. This concerns the problem of accepting updates expressed in terms of the global schema, aiming at reflecting them by changes at the source data level. This is the first investigation we are aware of that goes in this challenging direction. Our research has been carried out under the joint supervision of the Department of Computer Science of the University of Rome La Sapienza and the GEMO INRIA- Futurs Project, resulting from the merger of INRIA-Rocquencourt Verso Project and the IASI group of the University of Paris-Sud. The thesis is organized as follows. The first part of the thesis serves as an introduction to the theoretical foundations of our approach to DIS, and a motivation for it. Then, the second part of the thesis will be devoted to ontology-based DIS examination, while the third part will concern with XML-based DIS.

17 Chapter 1 Theoretical foundations of DIS In this chapter, we introduce the main theoretical foundations underlying our investigation of DIS [63]. Specifically, we start by setting up a logical framework for data integration. Then we present the main issues related to DIS that will be the focus of our attention, namely consistency checking and query answering. Afterwards, we introduce the problem of performing updates over DIS. Finally, we discuss the relationship existing between DIS and databases with incomplete information [58]. 1.1 Logical framework As already mentioned, in this work, we are interested in studying DIS whose aim is combining data residing at different sources, and providing the user with a unified view of these data. Such a unified view is represented by the global schema. Thus, one of the most important aspects in the design of a DIS is the specification of the correspondence between the data at the sources and the elements of the global schema. Such a correspondence is modeled through the notion of mapping. It follows that the main components of a data integration system are the global schema, the sources, and the mapping. Thus, we formalize a data integration system Π in terms of a triple G, S, M, where G is the global schema, expressed in a language L G over an alphabet A G. The alphabet comprises a symbol for each element of G (i.e., a relation if G is relational, a concept or a role if G is a Description Logic, a label if G is an XML DTD, etc.). S is the source schema, expressed in a language L S over an alphabet A S. The alphabet A S includes a symbol for each element of the sources. M is the mapping between G and S, consisting of a set of assertions M, each having the form (q S, q G, as), or (q G, q S, as) where q S and q G are two queries of the same arity, respectively over the source schema S, and over the global schema G, and as may assume the value sound, 5

18 6 CHAPTER 1. THEORETICAL FOUNDATIONS OF DIS complete or exact. Queries q S are expressed in a query language L M,S over the alphabet A S, and queries q G are expressed in a query language L M,G over the alphabet A G. On the other hand, the value as models the accuracy of the mapping. Note that the definition above has been taken from [63], and it is general enough to capture all approaches in the literature, including in particular the DIS considered in this thesis. We call database a set of collections of data. We say that a source database (also referred to as a set of data sources) D = {D 1,, D m } conforms to a schema S = {S 1,, S m } if D i is an instance of S i for i = 1,, m (where clearly the notion of D i being an instance of S i depends on the language L S for expressing S). Moreover, we call global database an instance of the global schema G 1 over a domain Γ. Thus, given a set of sources D conforming to S, we call a set of legal databases for Π w.r.t. D, denoted sem(π, D), the set of databases B such that: B is a global database, and B satisfies the mapping M w.r.t. D. Clearly, the notion of B satisfying M w.r.t. D depends on the semantics of the mapping assertions. Intuitively, the assertion (q S, q G, as) means that the concept represented by the query q S over the sources D, corresponds to the concept in the global schema represented by the query q G, with the accuracy specified by as. Formally, let q be a query of arity n and DB a database. We denote with q DB the set of n-tuples in DB that satisfy q. Then, given a set of data sources D conforming to S and a global database B, we say that B satisfies M w.r.t. D, if for each M i in M of the form (q S, q G, as) we have that: if as = sound, then q B G qd S ; if as = complete, then q B G qd S ; if as = exact, then q B G = qd S. Typically sources in DIS are considered sound. This will also be the assumption we make in the investigation of ontology-based DIS. In contrast, in the XML-based context, we will study also the case of exact mappings, which appear to be useful when one considers a data source as an authority providing exactly all the information about a certain topic. On the other hand, we do not consider the case of complete mappings, since it appears less interesting in practice. Note that different forms for mappings have lead to the following characterization of the approaches to data integration in the literature [53]: In the Local-As-View (LAV) approach, mappings in M have the form (s, q G, as), where s in an element of the source schema. 1 In particular, in this thesis, we consider the case of a global database being a first-order logic model ( I, I) of G, if G is the intensional level of a Description Logic (DL) [21] ontology, or an XML document satisfying G, if G is a DTD provided with a set of integrity constraints.

19 1.2. CONSISTENCY OF A DIS 7 In the Global-As-View (GAV) approach, they have the form (q S, g, as), where g in an element of the global schema. In the Global-and-Local-As-View (GLAV) approach, no particular assumption is made on the form of mappings. Clearly, the LAV approach favors the extensibility of the system, since adding a new source simply requires enriching the mapping with a new assertion, without other changes. On the other hand, the GAV approach has a more procedural flavor, since it tells the system how to use the sources to retrieve the data. Before concluding this presentation of the logical framework for data integration, we observe that, no matter which is the interpretation of the mapping, in general, several global databases exist that are legal for Π with respect to D. This observation motivates the relationship between data integration and databases with incomplete information [86], which will be discussed in Section Consistency of a DIS Given a data integration system Π = G, S, M and a set of sources D conforming to S, it may happen that no legal database exists satisfying both the global schema constraints and the mappings w.r.t. D, i.e. sem(π, D) =. Then, we say that the system is inconsistent w.r.t. D. It is worth noting that this kind of situation is particularly critical, since as we will see, it makes query answering become meaningless. Despite its importance, this situation is often blurred out in data integration systems, or dealt with by means of a-priori and ad-hoc transformations and cleaning procedures to be applied to data retrieved from the sources (e.g.[44]). Here we address the problem from a more theoretical perspective. In particular, we believe that the first step to deal with inconsistencies is obviously to detect whether there are some it occurs. Thus, we study the problem of deciding whether a system is consistent w.r.t. a set of data sources. Such a problem can be formulated as follows: PROBLEM : DIS CONSISTENCY INPUT : A data integration system Π = G, S, M, a set of data sources D conforming to S QUESTION : Is there a database B legal for Π w.r.t. D? In both Ontology-based and XML-based DIS, we will study DIS consistency, show it is decidable, examine its complexity and provide algorithms to solve it. However, we do not consider in this thesis the problem of reconciling the data at the sources, i.e. modifying the data retrieved from the sources so that the system becomes consistent. This is a one challenging issue that we intend to address in the future.

20 8 CHAPTER 1. THEORETICAL FOUNDATIONS OF DIS 1.3 Query answering over DIS The basic service that is offered by a DIS is query answering, i.e. the ability of answering queries that are posed in terms of the global schema G and are expressed in a language L q over the alphabet A G. Given a DIS Π = G, S, M and a set of data sources D conforming to S, the certain answers q(π,d) to a query q posed over Π w.r.t. D, is the set of tuples t of elements in Γ (i.e., the domain of the instances of G) such that t q B for every legal database B w.r.t. Π, or equivalently: q(π,d) = {t t q B, B sem(π, D)} Note that q(π,d) are called certain answers to q in Π w.r.t. D. Query answering can be tackled under two different forms. In particular, under the so-called recognition form, it is formulated as follows: PROBLEM : QUERY ANSWERING (RECOGNITION) INPUT : Consistent data integration system Π = G, S, M, set of data sources D conforming to S, query q, and tuple t of elements of Γ QUESTION : Is t in q(π, D)? Other times, query answering assumes a more ambitious form and aims at finding the entire set of certain answers. Thus, it is formulated as follows: PROBLEM : QUERY ANSWERING (FULL SET) INPUT : Consistent data integration system Π = G, S, M, set of data sources D conforming to S, query q QUESTION : Find all t such that t q(π, D). As for DIS consistency, in our investigation, we will study DIS query answering under different assumptions, show it is decidable, examine its complexity and provide algorithms to solve it. Note in particular that in both the formulations for the query answering problem, we assume to have a consistent DIS. Indeed, in this thesis, we are not concerned with the problem of answering queries in the presence of mutually inconsistent data sources. One possibility to address such a problem is to follow an approach in the spirit of [62], where the authors advocate the use of an approximate semantics for mappings. 1.4 Updates over DIS In this section, we introduce write-also DIS, i.e. DIS that allow for performing updates expressed over the global schema. Several approaches to update have been proposed in the literature, see, e.g.,[39] for a survey. In particular, different change

21 1.4. UPDATES OVER DIS 9 operators are appropriate depending on whether it is a revision [20], i.e. a correction to the actual state of beliefs, or of an update [88], reflecting a change in the world. In this section, even though we use the term update, we do not aim at advocating the use of one particular approach. On the contrary, we assume to have an arbitrary operator. Moreover, we assume to have an update F expressed as a formula in terms of G, which intuitively is sanctioned to be true in the new state, i.e. it is inserted in the updated DIS specification. Thus, given a DIS Π = G, S, M, a set of data sources D conforming to S, and the update F, we have that once is applied with F to the set of legal database for Π w.r.t. D, we obtain a new set of databases, however characterized, reflecting the change F. Note that we are interested in instance-level updates. This means that we assume that the specification of Π is invariant, whereas the update reflects a change that occurs at the sources D. Thus, in particular, we consider an update of Π with a set F of facts having the form g(t) where t is a n-tuple of elements of Γ and g is an element of G, meaning that the change consists in t being an instance of g. Thus, we formulate the problem of updating a DIS as follows: PROBLEM : EXPRESSIBLE UPDATE INPUT : Consistent data integration system Π = G, S, M, set of data sources D conforming to S, set of facts F QUESTION : Is there D such that sem(π, D ) = sem(π, D) F? The above formulation is general enough to capture all approaches to update that have been proposed in the literature. However, it raises at least the following considerations. Typically the user of a DIS is not the owner of the data sources and thus he has not the right to modify their content. This is probably the reason why, as far as we know, DIS update has not been considered yet as an issue. However, we believe that a DIS should be able to possibly provide the appropriate infrastructure to allow the user to perform an instance-level update without changing the data at the sources. This could be achieved, for instance by using internal proprietary sources. What if no set of data sources exists solving the update problem formulated above (not even proprietary sources)? As usual, one possibility would be to relax the semantics of the update. Indeed, we might be interested in reasoning, e.g., answering queries, over the DIS resulting from the update. Indeed, to do so we do not necessarily need to materialize a new set of data sources, but actually we could reason on the original DIS by taking into account the update in a virtual way. In a sense, this is analogous to the distinction between projection via regression vs. progression in reasoning about actions [83]. Both the considerations above have motivated the beginning of our work on DIS update. Until now, we started tackling the problem for Ontology-based DIS (cf. Chapter 6).

22 10 CHAPTER Relationship with databases with incomplete information Before concluding this introductory chapter on the theoretical foundations of our approach to data integration, we briefly discuss the strong connection existing between DIS and databases with incomplete information. Specifically, a database with incomplete information can be viewed as a set of possible states of the real-world. Similarly, given a set of data sources, a DIS represents a set of possible databases. Thus, when a query is posed over a database with incomplete information or a DIS, the problem arises of posing the query over a possibly infinite set of database states. It follows that in order to solve query answering over a DIS, one possibility is to find a finite representation of the set of possible databases and to provide algorithms to answer queries over such a representation. Indeed, this is the main idea underlying both the works presented in this thesis. Note, in particular, that this approach recalls the approach proposed in a landmark paper by Imielinski and Lipski [58], that consists in answering queries over a database with incomplete information, by exploiting the notion of representation system. Moreover, interestingly, in [4], the same approach is extended to deal with updates over databases with incomplete information.

23 Chapter 2 State of the art of DIS As already discussed, data integration has appeared as a pervasive challenge in the last decade. Such a success recalls the crucial impact of DBMS, proven by the large number of DBMS scattered all around the world. However, while the success of relational DBMS represents a great exception in the usual bottom-up process of emerging technologies, since it had been preceded by a deep understanding and a wide acceptance of the relational model and the related theory, the interest in data integration systems grew contemporaneously in both the business and research community. In particular, it lead to the implementation of systems, without having yet a deep understanding of all the intricate issues related, involving design time as well as run time aspects [54]. Clearly, it would be unrealistic to aim at being comprehensive while discussing the state of the art of such a huge field. Thus, in this chapter, we start by briefly discussing the commercial solutions to the need for integrating data. Afterwards, we contextualize our contribution into the global picture of the state of the art in data integration research field. Finally, according to such a global picture, we discuss more in details works that are most closely related to our investigation. 2.1 Commercial data integration tools Recently, some software solutions to the need for integrating data has emerged, suggesting the adoption of a DBMS as a kind of middleware infrastructure that uses a set of software modules, called wrappers, to access heterogeneous data sources [51]. Wrappers hide the native characteristics of each source, masking them under the appearances of a common relational table. Furthermore, their aim is to mediate between the federated database and the sources, mapping the data model of each source to the federated database data model, also transforming operation over the federated database into requests that the source can handle. Examples of commercial products following this kind of approach are Oracle Integration [75] and DB2 Information Integrator (DB2II)[74]. Obviously, both are based on the use of Oracle and IBM DBMS respectively. Even though remarkable from the point of view of the number of different types of data sources supported, as well as from the point of view of query optimizations, 11

24 12 CHAPTER 2. STATE OF THE ART OF DIS these products are essentially data federation tools that are still far from data integration systems theory as it is by now well-established in the scientific databases community. Indeed, as we argued in [81], they actually allow the user to combine data coming from heterogeneous and autonomous sources, but do not provide the user with a unified view that is (logically) independent of the sources. It is worth noticing however, that data federation tools can be used as the essential underlying environment on top of which one can build a DIS. In particular, we show in [81] how to implement a DIS based on a relational schema by means of a commercial tool for data federation. In a nutshell, this is obtained by: (i) producing an instance of a federated database through the compilation of a formal DIS specification as formalized in the previous chapter; (ii) translating the user queries posed over the global schema, so as to issue them to the federated database. Even though interesting in order to highlight the mismatch between commercial products and research prototypes currently available, clearly, this approach is far from solving the main challenge addressed in this thesis, since it allows for a limited expressive power of the global schema (without constraints) and requires to follow a GAV approach. 2.2 Global picture of the state of the art In this section, we aim at giving a global picture of the state of the art in data integration and at contextualizing our contribution with respect to this global picture. From the previous chapter, it follows that a DIS specification depends on the following aspects: the data model chosen for the global database; the language used to express the global schema, i.e. characterizing it; the set of constraints the approach followed to specify the mapping, i.e. GAV, LAV or GLAV; the accuracy of the mappings (or equivalently of the data sources), i.e. sound, or exact (as we already argued complete mappings are less interesting in practice). Another aspect deserving to be considered when classifying DIS, is the architectural paradigm used. As already mentioned, in this thesis, we focus on hierarchical DIS, where it is possible to clearly distinguish between two different roles played on one hand by the global schema, that is accessed by the user and which does not contain by itself data, and on the other hand by the underlying sources, that contain the actual data. Another paradigm is recently emerging for DIS, as well as for other distributed systems, namely the Peer-To- Peer (P2P) paradigm. Put in an abstract way, P2P DIS are characterized by an architecture consisting of various autonomous nodes (called peers) which hold information, and which are linked to other nodes by means of mappings. Each node provides therefore part of the overall information available from a distributed environment and acts both as a client and as a server in the system, without relying on a single global view. However, in some sense, P2P data integration

25 2.3. MAIN RELATED DIS 13 systems can be considered as the natural extension of hierarchical data integration systems, since each node of the system may by itself be considered as an extended hierarchical DIS, that includes, besides the mapping to local data sources, an external mapping to other nodes schemas 1. Note that since research in P2P data integration is still quite young, no commercial product really emerged yet. Fig. 2.1 summarizes the state of the art in data integration. More precisely, it classifies the main integration systems according to the features discussed above. Thus, it stresses systems that are closest to our investigation and can be therefore compared with our study. In the next two sections we describe some of these systems, focusing on those whose global schema is specified by means of (i) a Description Logic (and thus can be considered as DIS based on the relational model, characterized by a significant set of semantic constraints), and (ii) XML 2 (and thus a semi-structured data model). It is worth noting that, in Fig. 2.1, we do not consider Data Warehousing Systems nor Data Exchange Systems, which even though related to DIS, are based on a different form of data interoperability. Indeed, their aim is to export a materialized instance of the global schema, whereas DIS are characterized by a global schema that is virtual. In particular, Data exchange is the problem of moving and restructuring data from a generally unique data source to the global schema schema (called target schema), given the specification of the mapping (called source-to-target dependencies) between the source and the schema. Data exchange has become an active research topic recently due to the increased need for exchange of data in various formats, typically in e-business applications[9]. Papers [41, 40] laid the theoretical foundation of exchange of relational data, and several follow-up papers studied various issues in data exchange such as schema mapping composition[11]. 2.3 Main related DIS In order to present main DIS that are closest to the work studied in this thesis, we next discuss those systems that are most comparable to our investigation, because e.g. of the expressivity of the global schema (cf. Fig. 2.1). In particular, we classify such systems on the basis of the approach followed for mappings specification. Note that despite the great increasing interest in XML from both business and research, little previous work has addressed XML-based data integration issue, as defined and studied here. In contrast, considerable work has addressed XML publishing systems and some initial work has focused on basic theoretical XML data exchange issues. Both these kinds of work are somehow orthogonal to our investigation since, besides assuming to materialize the global schema, they consider a unique data source. Hence, they were not presented in Fig However, in the XML setting, where not much work has addressed even basic data integration issues, they appear as relevant. Thus, we will present some of them. 1 Clearly, this is only an abstraction since the possible presence of cycles among peers complicates notably P2P DIS and introduces new challenging issues (see e.g. [28]). 2 Reader is assumed to be familiar with notation and terminology of the relational model [5], XML [2] and DLs [14].

26 Table 2.1: DIS state of the art Paradigm Data model Constraints Mapping Mapping Example approach accuracy Hierarchical Relational Inclusions,... LAV sound Information Manifold [60], Hierarchical Relational Inclusions,... GAV sound PICSEL [48] Hierarchical Relational Functional, GAV sound IBIS [24], inclusions INFOMIX [64] Hierarchical Semi-structured GAV sound, TSIMMIS [45] Hierarchical Semi-structured LAV exact, [34] sound Hierarchical Object-oriented keys LAV sound STYX [8] Hierarchical XML DTD LAV sound Agora [73] Hierarchical XML XML Schema types GLAV sound [90] and functional... P2P Relational keys, GLAV sound [32] foreign keys P2P XML GLAV exact, Piazza [55] sound P2P XML Keys GLAV exact, ActiveXML [1] sound 14 CHAPTER 2. STATE OF THE ART OF DIS

27 2.3. MAIN RELATED DIS LAV approach Information Manifold Information Manifold (IM) [67] is a DIS developed at AT&T, based on the CARIN Description Logic [66]. CARIN combines a Description Logic allowing for expressing disjunction of concepts, and role number restrictions, with function-free horn rules. Thus, IM handles the presence of inclusion dependencies over the global schema, and uses conjunctive queries as the language for querying the system and specifying sound LAV mappings. The main distinguishing feature of IM is the use of the bucket algorithm for query answering. In order to illustrare it, we first recall that in LAV the mappings between the sources and the global schema are described as a set of views over the global schema. Thus, query processing amounts to finding a way to answer a query posed over a database schema using a set of views over the same schema. This problem, called answering queries using views, is widely studied in the literature, since it has applications in many areas (see e.g. [53] for a survey). The most common approach proposed to deal with query answering using views is by means of query rewriting. In query rewriting, a query and a set of view definitions over a database schema are provided, and the goal is to reformulate the query into an expression, the rewriting, whose evaluation on the view extensions supplies the answer to the query. Thus, query answering via query rewriting is divided in two steps, where the first one consists of reformulating the query in terms of the given query language over the alphabet of the views (possibly augmented with auxiliary predicates), and the second one evaluates the rewriting over the view extensions. Clearly, the set of available sources may in general not store all the data needed to answer a user query, and therefore the goal is to find a rewriting that provides the maximal set of answers that can be obtained from the views. The bucket algorithm, presented in [65], is actually a query rewriting algorithm that is proved to be sound and complete with respect to the problem of answering user queries (under a first-order logic formalization of the system), only in the absence of integrity constraints on the global schema, but it is in general not complete when integrity constraints are issued on it. StyX According to Fig. 2.1 StyX [8] is based on the use of an object-oriented global schema describing the intensional level of an ontology as a labeled graph, whose nodes represent concepts and edge labels represent either roles (i.e. relationships) between concepts, or inclusion assertions. As for constraints, StyX allows to specify a set of keys over the global schema. On the other hand, StyX allows to integrate XML data sources. These are described in terms of path-to-path mapping rules that associate paths in the XML source with paths in the global schema. Thus, StyX follows the LAV approach. It adresses the problem of query rewriting in the presence of sound LAV mappings. StyX suggests a cute way of merging the two part of this thesis. However, this would require first an analysis of the properties of StyX query answering algorithm (e.g. completeness), and second a deep understanding of the impact of introducing in

28 16 CHAPTER 2. STATE OF THE ART OF DIS the StyX global schema a set of constraints comparable to ours. This represents even more an issue, given that StyX does not concern with the detection of inconsistencies among data sources. Agora Agora [73] is an XML-based DIS whose global schema is specified by means of an XML DTD (without any additional integrity constraints). Moreover, Agora is characterized by a set of sound mappings, that follow the LAV approach. More precisely, mappings are defined in terms of an intermediate virtual, generic and relational schema that closely models the generic structure of the XML global schema, rather than in terms of the XML global schema. Thus, Agora query processing technique is based on query rewriting which is performed via a translation first to the generic relational schema and then by employing traditional relational techniques for answering queries using views. Note that because of the translation, queries and mappings can be quite complex and hard to understand/define by a human user GAV approach The TSIMMIS Project TSIMMIS (The Stanford-IBM Manager of Multiple Information Sources) is a joint project of the Stanford University and the Almaden IBM database research group [36]. It is based on an architecture that presents a hierarchy of wrappers and mediators, in which wrappers convert data from each source into a common data model called OEM (Object Exchange Model) and mediators combine and integrate data exported by wrappers or by other mediators. Hence, the global schema is essentially constituted by the set of OEM objects exported by wrappers and mediators. Mediators are defined in terms of a logical language called MSL (Mediator Specification Language), which is essentially Datalog extended to support OEM objects. OEM is a semistructured and self-describing data model, in which each object has an associated label, a type for the value of the object and a value (or a set of values). User queries are posed in terms of objects synthesized at a mediator or directly exported by a wrapper. They are expressed in MSL or in a specific query language called LOREL (Lightweight Object REpository Language), an object-oriented extension of SQL. Each query is processed by a module, the Mediator Specification Interpreter (MSI) [79, 89], consisting of three main components: The View Expander, which uses mediator specification to reformulate the query into a logical plan by expanding the objects exported by the mediator according to their definitions. The logical plan is a set of MSL rules which refer to information at the sources. The Plan Generator, also called Cost-Based Optimizer, which develops a physical plan specifying which queries will be sent to the sources, the order in which they will be processed, and how the results of the queries will be combined in order to derive the answer to the original query.

29 2.3. MAIN RELATED DIS 17 The Execution engine, which executes the physical plan and produces the answer. The problem of query processing in TSIMMIS in the presence of limitations in accessing the sources is addressed in [68] by devising a more complex Plan Generator comprising three modules: a matcher, which retrieves queries that can process part of the logical plan; a sequencer, which pieces together the selected source queries in order to construct feasible plans; an optimizer, which selects the most efficient feasible plan. It has to be stressed that in TSIMMIS no global integration is ever performed. Each mediator performs integration independently. As a result, for example, a certain concept may be seen in completely different and even inconsistent ways by different mediators. This form of integration can be called query-based, since each mediator supports a certain set of queries, i.e., those related to the view it provides. The IBIS system The Internet-Based Information System (IBIS) [25] is a tool for the semantic integration of heterogeneous data sources, developed in the context of a collaboration between the University of Rome La Sapienza and CM Sistemi. IBIS adopts innovative solutions to deal with all aspects of a complex data integration environment, including source wrapping, limitations on source access, and query answering under integrity constraints. IBIS uses a relational global schema to query the data at the sources, and is able to cope with a variety of heterogeneous data sources, including data sources on the Web, relational databases, and legacy sources. Each nonrelational source is wrapped to provide a relational view on it. Also, IBIS mappings follow the GAV approach and each source is considered sound. The system allows for the specification of integrity constraints on the global schema; in addition, IBIS considers the presence of some forms of constraints on the source schema, in order to perform runtime optimization during data extraction. In particular, key and foreign key constraints can be specified on the global schema, and functional dependencies and full-width inclusion dependencies, i.e., inclusions between entire relations, can be specified on the source schema. Query processing in IBIS is separated in three phases: 1. the query is expanded to take into account the integrity constraints in the global schema; 2. the atoms in the expanded query are unfolded according to their definition in terms of the mapping, obtaining a query expressed over the sources; 3. the expanded and unfolded query is executed over the retrieved source databases, whose data are extracted by the Extractor module that retrieves from the sources all the tuples that may be used to answer the original query.

30 18 CHAPTER 2. STATE OF THE ART OF DIS Query unfolding and execution are the standard steps of query processing in GAV data integration systems, while for the expansion phase IBIS makes use of the algorithm presented in [23]. INFOMIX and INFOMIX [64] is a semantic integration system that provides solutions for GAV data integration of heterogeneous data sources (e.g., relational, XML, HTML) accessed through relational global schemas over which powerful forms of integrity constraints can be specified (e.g., keys, inclusions, and exclusion dependencies), and user queries are specified in a powerful query language (e.g., Datalog). The query answering technique proposed in such a system is based on query rewriting in Datalog enriched with negation and disjunction, under stable model semantics [26, 49]. A setting similar to the one considered in INFOMIX is the one at the basis of the DIS@DIS system [27]. Even if limited in its capability of integrating sources with different data formats (the system actually considers only relational data sources), DIS@DIS however provides mechanisms also for integration of inconsistent data in LAV. Furthermore, w.r.t. query language considered, INFOMIX and DIS@DIS aim at supporting more general, highly expressive classes of queries (including also queries intractable under worst case complexity). PICSEL Similarly to IM, PICSEL is based on CARIN and the use of conjunctive queries. However, PICSEL differs from IM in that mappings follow a rather simplified GAV approach. More precisely, each data source consists of a set of relations and for each data source there exists a mapping one-to-one from each of its relations to a distinct element of the global schema. In addition, PICSEL takes into account a set of constraints about the content of the sources that are expressed as CARIN assertions. Query expansion in CARIN is then used as the core algorithmic tool for query answering in PICSEL. Thus, query answering in PICSEL is quite efficient, since it is reduced to the evaluation of a union of conjunctive queries over the set of data sources, resulting from the query expansion, which is by itselt exponential in the size of the global schema. The main differences with respect to our investigation are as follows. PICSEL does not consider at all the case where the DIS specification is inconsistent. Also, it does not attempt to distinguish between data and objects. Finally, PICSEL mappings are much more restricted than the one we consider. Grammar AIG The Grammar AIG [18] is a formalism allowing to specify how to integrate and publish SQL data coming from autonomous sources, into an XML document that conforms to a DTD and satisfies a set of integrity constraints very close to the one we also consider. Thus, an AIG evaluation produces a materialized view conforming to a quite expressive global schema. More precisely, an AIG consists of two parts: a grammar and a set of XML constraints. The grammar extends a DTD by associating semantic attributes and semantic rules with element types. The semantic attributes

31 2.3. MAIN RELATED DIS 19 are used to pass data and control during AIG evaluation. The semantic rules compute the values of the attributes by extracting data from databases via multi-source SQL queries that constitute the mappings. As a result, the XML document is constructed via a controlled derivation from the grammar and constraints, and is thus guaranteed to both conform to the DTD and satisfy the constraints. The focus of [18] is on constraints checking in the sense that whenever during the generation of the document an attribute does not satisfy a constraint, the compilation of the materialized instance is aborted. XPeranto and SilkRoute Both XPeranto [85] and Silkroute [43] are XML publishin systems that support definition of XML materialized views of SQL data. Moreover, they both support query answering over such XML views, by using an intermediate representation of views. On the one hand, XPeranto uses an XML Query Graph Model (XQGM) as a view. The XQGM is analogous to a physical execution plan produced by a query optimizer. Nodes in the XQGM represent operations in an algebra (e.g., select, join, unnest, union) and edges represent the dataflow from one operation to the next. Individual operations may invoke XML-aware procedures for constructing and deconstructing XML values, which gives to XPeranto a procedural flavor. This captures well the relationship between XQuery expressions and complex SQL expressions, whereas it may happen to produce an XQGM that may not be composed with another XQuery query, and thus support arbitrary query answering. On the contrary, SilkRoute uses a view-forest as intermediate abstract representation of views expressed by means of XQuery, that is entirely declarative and thus can be composed with any XQuery query. As a consequance, the two representations are somehow symbiotic: declarative view forests are appropriate for the front end query composition whereas the procedural XQGM may be better for back end SQL generation GLAV approach XML data exchange basic theoretical issues In the same spirit of our work is the study presented in [12], where the authors start looking into the basic properties of XML data exchange, where the target schema is a DTD. Specifically, they define XML data exchange settings in which sourceto-target dependencies refer to the hierarchical structure of the data. They investigate the consistency problem, which in the case of data exchange, is the problem of deciding whether there exists an instance of the target schema which satisfies both the source-to-target dependencies and the DTD, and determine its exact complexity. Moreover, they identify data exchange settings over which query answering over the target schema is tractable, and those over which it is conp-complete, depending on classes of regular expressions used in DTDs. Finally, for all tractable cases they provide PTIME algorithms that compute target XML documents over which queries can be answered.

Data Integration: A Theoretical Perspective

Data Integration: A Theoretical Perspective Data Integration: A Theoretical Perspective Maurizio Lenzerini Dipartimento di Informatica e Sistemistica Università di Roma La Sapienza Via Salaria 113, I 00198 Roma, Italy lenzerini@dis.uniroma1.it ABSTRACT

More information

Data Management in Peer-to-Peer Data Integration Systems

Data Management in Peer-to-Peer Data Integration Systems Book Title Book Editors IOS Press, 2003 1 Data Management in Peer-to-Peer Data Integration Systems Diego Calvanese a, Giuseppe De Giacomo b, Domenico Lembo b,1, Maurizio Lenzerini b, and Riccardo Rosati

More information

Query Processing in Data Integration Systems

Query Processing in Data Integration Systems Query Processing in Data Integration Systems Diego Calvanese Free University of Bozen-Bolzano BIT PhD Summer School Bressanone July 3 7, 2006 D. Calvanese Data Integration BIT PhD Summer School 1 / 152

More information

A Tutorial on Data Integration

A Tutorial on Data Integration A Tutorial on Data Integration Maurizio Lenzerini Dipartimento di Informatica e Sistemistica Antonio Ruberti, Sapienza Università di Roma DEIS 10 - Data Exchange, Integration, and Streaming November 7-12,

More information

Data Integration. Maurizio Lenzerini. Universitá di Roma La Sapienza

Data Integration. Maurizio Lenzerini. Universitá di Roma La Sapienza Data Integration Maurizio Lenzerini Universitá di Roma La Sapienza DASI 06: Phd School on Data and Service Integration Bertinoro, December 11 15, 2006 M. Lenzerini Data Integration DASI 06 1 / 213 Structure

More information

Chapter 17 Using OWL in Data Integration

Chapter 17 Using OWL in Data Integration Chapter 17 Using OWL in Data Integration Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, Riccardo Rosati, and Marco Ruzzi Abstract One of the outcomes of the research work carried

More information

How To Understand Data Integration

How To Understand Data Integration Data Integration 1 Giuseppe De Giacomo e Antonella Poggi Dipartimento di Informatica e Sistemistica Antonio Ruberti Università di Roma La Sapienza Seminari di Ingegneria Informatica: Integrazione di Dati

More information

OntoPIM: How to Rely on a Personal Ontology for Personal Information Management

OntoPIM: How to Rely on a Personal Ontology for Personal Information Management OntoPIM: How to Rely on a Personal Ontology for Personal Information Management Vivi Katifori 2, Antonella Poggi 1, Monica Scannapieco 1, Tiziana Catarci 1, and Yannis Ioannidis 2 1 Dipartimento di Informatica

More information

XML Data Integration

XML Data Integration XML Data Integration Lucja Kot Cornell University 11 November 2010 Lucja Kot (Cornell University) XML Data Integration 11 November 2010 1 / 42 Introduction Data Integration and Query Answering A data integration

More information

Integrating XML Data Sources using RDF/S Schemas: The ICS-FORTH Semantic Web Integration Middleware (SWIM)

Integrating XML Data Sources using RDF/S Schemas: The ICS-FORTH Semantic Web Integration Middleware (SWIM) Integrating XML Data Sources using RDF/S Schemas: The ICS-FORTH Semantic Web Integration Middleware (SWIM) Extended Abstract Ioanna Koffina 1, Giorgos Serfiotis 1, Vassilis Christophides 1, Val Tannen

More information

Data Integration and Exchange. L. Libkin 1 Data Integration and Exchange

Data Integration and Exchange. L. Libkin 1 Data Integration and Exchange Data Integration and Exchange L. Libkin 1 Data Integration and Exchange Traditional approach to databases A single large repository of data. Database administrator in charge of access to data. Users interact

More information

Enterprise Modeling and Data Warehousing in Telecom Italia

Enterprise Modeling and Data Warehousing in Telecom Italia Enterprise Modeling and Data Warehousing in Telecom Italia Diego Calvanese Faculty of Computer Science Free University of Bolzano/Bozen Piazza Domenicani 3 I-39100 Bolzano-Bozen BZ, Italy Luigi Dragone,

More information

Data Quality in Information Integration and Business Intelligence

Data Quality in Information Integration and Business Intelligence Data Quality in Information Integration and Business Intelligence Leopoldo Bertossi Carleton University School of Computer Science Ottawa, Canada : Faculty Fellow of the IBM Center for Advanced Studies

More information

INTEGRATION OF XML DATA IN PEER-TO-PEER E-COMMERCE APPLICATIONS

INTEGRATION OF XML DATA IN PEER-TO-PEER E-COMMERCE APPLICATIONS INTEGRATION OF XML DATA IN PEER-TO-PEER E-COMMERCE APPLICATIONS Tadeusz Pankowski 1,2 1 Institute of Control and Information Engineering Poznan University of Technology Pl. M.S.-Curie 5, 60-965 Poznan

More information

Accessing Data Integration Systems through Conceptual Schemas (extended abstract)

Accessing Data Integration Systems through Conceptual Schemas (extended abstract) Accessing Data Integration Systems through Conceptual Schemas (extended abstract) Andrea Calì, Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini Dipartimento di Informatica e Sistemistica Università

More information

Ontology-based Data Integration with MASTRO-I for Configuration and Data Management at SELEX Sistemi Integrati

Ontology-based Data Integration with MASTRO-I for Configuration and Data Management at SELEX Sistemi Integrati Ontology-based Data Integration with MASTRO-I for Configuration and Data Management at SELEX Sistemi Integrati Alfonso Amoroso 1, Gennaro Esposito 1, Domenico Lembo 2, Paolo Urbano 2, Raffaele Vertucci

More information

A Hybrid Approach for Ontology Integration

A Hybrid Approach for Ontology Integration A Hybrid Approach for Ontology Integration Ahmed Alasoud Volker Haarslev Nematollaah Shiri Concordia University Concordia University Concordia University 1455 De Maisonneuve Blvd. West 1455 De Maisonneuve

More information

Integrating Heterogeneous Data Sources Using XML

Integrating Heterogeneous Data Sources Using XML Integrating Heterogeneous Data Sources Using XML 1 Yogesh R.Rochlani, 2 Prof. A.R. Itkikar 1 Department of Computer Science & Engineering Sipna COET, SGBAU, Amravati (MH), India 2 Department of Computer

More information

Composing Schema Mappings: An Overview

Composing Schema Mappings: An Overview Composing Schema Mappings: An Overview Phokion G. Kolaitis UC Santa Scruz & IBM Almaden Joint work with Ronald Fagin, Lucian Popa, and Wang-Chiew Tan The Data Interoperability Challenge Data may reside

More information

Data integration and reconciliation in Data Warehousing: Conceptual modeling and reasoning support

Data integration and reconciliation in Data Warehousing: Conceptual modeling and reasoning support Data integration and reconciliation in Data Warehousing: Conceptual modeling and reasoning support Diego Calvanese Giuseppe De Giacomo Riccardo Rosati Dipartimento di Informatica e Sistemistica Università

More information

Chapter 11 Mining Databases on the Web

Chapter 11 Mining Databases on the Web Chapter 11 Mining bases on the Web INTRODUCTION While Chapters 9 and 10 provided an overview of Web data mining, this chapter discusses aspects of mining the databases on the Web. Essentially, we use the

More information

Integration and Coordination in in both Mediator-Based and Peer-to-Peer Systems

Integration and Coordination in in both Mediator-Based and Peer-to-Peer Systems Dottorato di Ricerca in Ingegneria dell Informazione e sua applicazione nell Industria e nei Servizi Integration and Coordination in in both Mediator-Based and Peer-to-Peer Systems presenter: (pense@inform.unian.it)

More information

A Workbench for Prototyping XML Data Exchange (extended abstract)

A Workbench for Prototyping XML Data Exchange (extended abstract) A Workbench for Prototyping XML Data Exchange (extended abstract) Renzo Orsini and Augusto Celentano Università Ca Foscari di Venezia, Dipartimento di Informatica via Torino 155, 30172 Mestre (VE), Italy

More information

Grid Data Integration based on Schema-mapping

Grid Data Integration based on Schema-mapping Grid Data Integration based on Schema-mapping Carmela Comito and Domenico Talia DEIS, University of Calabria, Via P. Bucci 41 c, 87036 Rende, Italy {ccomito, talia}@deis.unical.it http://www.deis.unical.it/

More information

XML DATA INTEGRATION SYSTEM

XML DATA INTEGRATION SYSTEM XML DATA INTEGRATION SYSTEM Abdelsalam Almarimi The Higher Institute of Electronics Engineering Baniwalid, Libya Belgasem_2000@Yahoo.com ABSRACT This paper describes a proposal for a system for XML data

More information

Integrating and Exchanging XML Data using Ontologies

Integrating and Exchanging XML Data using Ontologies Integrating and Exchanging XML Data using Ontologies Huiyong Xiao and Isabel F. Cruz Department of Computer Science University of Illinois at Chicago {hxiao ifc}@cs.uic.edu Abstract. While providing a

More information

Shuffling Data Around

Shuffling Data Around Shuffling Data Around An introduction to the keywords in Data Integration, Exchange and Sharing Dr. Anastasios Kementsietsidis Special thanks to Prof. Renée e J. Miller The Cause and Effect Principle Cause:

More information

Query Reformulation over Ontology-based Peers (Extended Abstract)

Query Reformulation over Ontology-based Peers (Extended Abstract) Query Reformulation over Ontology-based Peers (Extended Abstract) Diego Calvanese 1, Giuseppe De Giacomo 2, Domenico Lembo 2, Maurizio Lenzerini 2, and Riccardo Rosati 2 1 Faculty of Computer Science,

More information

P2P Data Integration and the Semantic Network Marketing System

P2P Data Integration and the Semantic Network Marketing System Principles of peer-to-peer data integration Maurizio Lenzerini Dipartimento di Informatica e Sistemistica Università di Roma La Sapienza Via Salaria 113, I-00198 Roma, Italy lenzerini@dis.uniroma1.it Abstract.

More information

Data Quality in Ontology-Based Data Access: The Case of Consistency

Data Quality in Ontology-Based Data Access: The Case of Consistency Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence Data Quality in Ontology-Based Data Access: The Case of Consistency Marco Console, Maurizio Lenzerini Dipartimento di Ingegneria

More information

Repair Checking in Inconsistent Databases: Algorithms and Complexity

Repair Checking in Inconsistent Databases: Algorithms and Complexity Repair Checking in Inconsistent Databases: Algorithms and Complexity Foto Afrati 1 Phokion G. Kolaitis 2 1 National Technical University of Athens 2 UC Santa Cruz and IBM Almaden Research Center Oxford,

More information

INTEROPERABILITY IN DATA WAREHOUSES

INTEROPERABILITY IN DATA WAREHOUSES INTEROPERABILITY IN DATA WAREHOUSES Riccardo Torlone Roma Tre University http://torlone.dia.uniroma3.it/ SYNONYMS Data warehouse integration DEFINITION The term refers to the ability of combining the content

More information

A Framework and Architecture for Quality Assessment in Data Integration

A Framework and Architecture for Quality Assessment in Data Integration A Framework and Architecture for Quality Assessment in Data Integration Jianing Wang March 2012 A Dissertation Submitted to Birkbeck College, University of London in Partial Fulfillment of the Requirements

More information

OWL based XML Data Integration

OWL based XML Data Integration OWL based XML Data Integration Manjula Shenoy K Manipal University CSE MIT Manipal, India K.C.Shet, PhD. N.I.T.K. CSE, Suratkal Karnataka, India U. Dinesh Acharya, PhD. ManipalUniversity CSE MIT, Manipal,

More information

Two approaches to the integration of heterogeneous data warehouses

Two approaches to the integration of heterogeneous data warehouses Distrib Parallel Databases (2008) 23: 69 97 DOI 10.1007/s10619-007-7022-z Two approaches to the integration of heterogeneous data warehouses Riccardo Torlone Published online: 23 December 2007 Springer

More information

XML with Incomplete Information

XML with Incomplete Information XML with Incomplete Information Pablo Barceló Leonid Libkin Antonella Poggi Cristina Sirangelo Abstract We study models of incomplete information for XML, their computational properties, and query answering.

More information

DaQuinCIS : Exchanging and Improving Data Quality in Cooperative Information Systems

DaQuinCIS : Exchanging and Improving Data Quality in Cooperative Information Systems Università degli Studi di Roma La Sapienza Dottorato di Ricerca in Ingegneria Informatica XVI Ciclo 2004 DaQuinCIS : Exchanging and Improving Data Quality in Cooperative Information Systems Monica Scannapieco

More information

Data Integration using Agent based Mediator-Wrapper Architecture. Tutorial Report For Agent Based Software Engineering (SENG 609.

Data Integration using Agent based Mediator-Wrapper Architecture. Tutorial Report For Agent Based Software Engineering (SENG 609. Data Integration using Agent based Mediator-Wrapper Architecture Tutorial Report For Agent Based Software Engineering (SENG 609.22) Presented by: George Shi Course Instructor: Dr. Behrouz H. Far December

More information

Magic Sets and their Application to Data Integration

Magic Sets and their Application to Data Integration Magic Sets and their Application to Data Integration Wolfgang Faber, Gianluigi Greco, Nicola Leone Department of Mathematics University of Calabria, Italy {faber,greco,leone}@mat.unical.it Roadmap Motivation:

More information

Modern Databases. Database Systems Lecture 18 Natasha Alechina

Modern Databases. Database Systems Lecture 18 Natasha Alechina Modern Databases Database Systems Lecture 18 Natasha Alechina In This Lecture Distributed DBs Web-based DBs Object Oriented DBs Semistructured Data and XML Multimedia DBs For more information Connolly

More information

Schema Mediation and Query Processing in Peer Data Management Systems

Schema Mediation and Query Processing in Peer Data Management Systems Schema Mediation and Query Processing in Peer Data Management Systems by Jie Zhao B.Sc., Fudan University, 2003 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of

More information

Web-Based Genomic Information Integration with Gene Ontology

Web-Based Genomic Information Integration with Gene Ontology Web-Based Genomic Information Integration with Gene Ontology Kai Xu 1 IMAGEN group, National ICT Australia, Sydney, Australia, kai.xu@nicta.com.au Abstract. Despite the dramatic growth of online genomic

More information

Improving EHR Semantic Interoperability Future Vision and Challenges

Improving EHR Semantic Interoperability Future Vision and Challenges Improving EHR Semantic Interoperability Future Vision and Challenges Catalina MARTÍNEZ-COSTA a,1 Dipak KALRA b, Stefan SCHULZ a a IMI,Medical University of Graz, Austria b CHIME, University College London,

More information

UPDATES OF LOGIC PROGRAMS

UPDATES OF LOGIC PROGRAMS Computing and Informatics, Vol. 20, 2001,????, V 2006-Nov-6 UPDATES OF LOGIC PROGRAMS Ján Šefránek Department of Applied Informatics, Faculty of Mathematics, Physics and Informatics, Comenius University,

More information

View-based Data Integration

View-based Data Integration View-based Data Integration Yannis Katsis Yannis Papakonstantinou Computer Science and Engineering UC San Diego, USA {ikatsis,yannis}@cs.ucsd.edu DEFINITION Data Integration (or Information Integration)

More information

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD. Svetlana Sokolova President and CEO of PROMT, PhD. How the Computer Translates Machine translation is a special field of computer application where almost everyone believes that he/she is a specialist.

More information

Robust Module-based Data Management

Robust Module-based Data Management IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. V, NO. N, MONTH YEAR 1 Robust Module-based Data Management François Goasdoué, LRI, Univ. Paris-Sud, and Marie-Christine Rousset, LIG, Univ. Grenoble

More information

Report on the Dagstuhl Seminar Data Quality on the Web

Report on the Dagstuhl Seminar Data Quality on the Web Report on the Dagstuhl Seminar Data Quality on the Web Michael Gertz M. Tamer Özsu Gunter Saake Kai-Uwe Sattler U of California at Davis, U.S.A. U of Waterloo, Canada U of Magdeburg, Germany TU Ilmenau,

More information

A MEDIATION LAYER FOR HETEROGENEOUS XML SCHEMAS

A MEDIATION LAYER FOR HETEROGENEOUS XML SCHEMAS A MEDIATION LAYER FOR HETEROGENEOUS XML SCHEMAS Abdelsalam Almarimi 1, Jaroslav Pokorny 2 Abstract This paper describes an approach for mediation of heterogeneous XML schemas. Such an approach is proposed

More information

Data Exchange: Semantics and Query Answering

Data Exchange: Semantics and Query Answering Data Exchange: Semantics and Query Answering Ronald Fagin a Phokion G. Kolaitis b,1 Renée J. Miller c,1 Lucian Popa a a IBM Almaden Research Center {fagin,lucian}@almaden.ibm.com b University of California

More information

Integrating Pattern Mining in Relational Databases

Integrating Pattern Mining in Relational Databases Integrating Pattern Mining in Relational Databases Toon Calders, Bart Goethals, and Adriana Prado University of Antwerp, Belgium {toon.calders, bart.goethals, adriana.prado}@ua.ac.be Abstract. Almost a

More information

Chapter 1: Introduction

Chapter 1: Introduction Chapter 1: Introduction Database System Concepts, 5th Ed. See www.db book.com for conditions on re use Chapter 1: Introduction Purpose of Database Systems View of Data Database Languages Relational Databases

More information

CSE 132A. Database Systems Principles

CSE 132A. Database Systems Principles CSE 132A Database Systems Principles Prof. Victor Vianu 1 Data Management An evolving, expanding field: Classical stand-alone databases (Oracle, DB2, SQL Server) Computer science is becoming data-centric:

More information

Time: A Coordinate for Web Site Modelling

Time: A Coordinate for Web Site Modelling Time: A Coordinate for Web Site Modelling Paolo Atzeni Dipartimento di Informatica e Automazione Università di Roma Tre Via della Vasca Navale, 79 00146 Roma, Italy http://www.dia.uniroma3.it/~atzeni/

More information

Relational Database Basics Review

Relational Database Basics Review Relational Database Basics Review IT 4153 Advanced Database J.G. Zheng Spring 2012 Overview Database approach Database system Relational model Database development 2 File Processing Approaches Based on

More information

A View Integration Approach to Dynamic Composition of Web Services

A View Integration Approach to Dynamic Composition of Web Services A View Integration Approach to Dynamic Composition of Web Services Snehal Thakkar, Craig A. Knoblock, and José Luis Ambite University of Southern California/ Information Sciences Institute 4676 Admiralty

More information

Piazza: Data Management Infrastructure for Semantic Web Applications

Piazza: Data Management Infrastructure for Semantic Web Applications Piazza: Data Management Infrastructure for Semantic Web Applications Alon Y. Halevy Zachary G. Ives Peter Mork Igor Tatarinov University of Washington Box 352350 Seattle, WA 98195-2350 {alon,zives,pmork,igor}@cs.washington.edu

More information

Application of XML Tools for Enterprise-Wide RBAC Implementation Tasks

Application of XML Tools for Enterprise-Wide RBAC Implementation Tasks Application of XML Tools for Enterprise-Wide RBAC Implementation Tasks Ramaswamy Chandramouli National Institute of Standards and Technology Gaithersburg, MD 20899,USA 001-301-975-5013 chandramouli@nist.gov

More information

Overview. DW Source Integration, Tools, and Architecture. End User Applications (EUA) EUA Concepts. DW Front End Tools. Source Integration

Overview. DW Source Integration, Tools, and Architecture. End User Applications (EUA) EUA Concepts. DW Front End Tools. Source Integration DW Source Integration, Tools, and Architecture Overview DW Front End Tools Source Integration DW architecture Original slides were written by Torben Bach Pedersen Aalborg University 2007 - DWML course

More information

Consistent Answers from Integrated Data Sources

Consistent Answers from Integrated Data Sources Consistent Answers from Integrated Data Sources Leopoldo Bertossi 1, Jan Chomicki 2 Alvaro Cortés 3, and Claudio Gutiérrez 4 1 Carleton University, School of Computer Science, Ottawa, Canada. bertossi@scs.carleton.ca

More information

A first step towards modeling semistructured data in hybrid multimodal logic

A first step towards modeling semistructured data in hybrid multimodal logic A first step towards modeling semistructured data in hybrid multimodal logic Nicole Bidoit * Serenella Cerrito ** Virginie Thion * * LRI UMR CNRS 8623, Université Paris 11, Centre d Orsay. ** LaMI UMR

More information

CHAPTER 7 GENERAL PROOF SYSTEMS

CHAPTER 7 GENERAL PROOF SYSTEMS CHAPTER 7 GENERAL PROOF SYSTEMS 1 Introduction Proof systems are built to prove statements. They can be thought as an inference machine with special statements, called provable statements, or sometimes

More information

Logical and categorical methods in data transformation (TransLoCaTe)

Logical and categorical methods in data transformation (TransLoCaTe) Logical and categorical methods in data transformation (TransLoCaTe) 1 Introduction to the abbreviated project description This is an abbreviated project description of the TransLoCaTe project, with an

More information

Data Integration Over a Grid Infrastructure

Data Integration Over a Grid Infrastructure Hyper: A Framework for Peer-to-Peer Data Integration on Grids Diego Calvanese 1, Giuseppe De Giacomo 2, Maurizio Lenzerini 2, Riccardo Rosati 2, and Guido Vetere 3 1 Faculty of Computer Science, Free University

More information

Data exchange. L. Libkin 1 Data Integration and Exchange

Data exchange. L. Libkin 1 Data Integration and Exchange Data exchange Source schema, target schema; need to transfer data between them. A typical scenario: Two organizations have their legacy databases, schemas cannot be changed. Data from one organization

More information

Schema Mediation in Peer Data Management Systems

Schema Mediation in Peer Data Management Systems Schema Mediation in Peer Data Management Systems Alon Y. Halevy Zachary G. Ives Dan Suciu Igor Tatarinov University of Washington Seattle, WA, USA 98195-2350 {alon,zives,suciu,igor}@cs.washington.edu Abstract

More information

Principles of Distributed Database Systems

Principles of Distributed Database Systems M. Tamer Özsu Patrick Valduriez Principles of Distributed Database Systems Third Edition

More information

Requirements for Context-dependent Mobile Access to Information Services

Requirements for Context-dependent Mobile Access to Information Services Requirements for Context-dependent Mobile Access to Information Services Augusto Celentano Università Ca Foscari di Venezia Fabio Schreiber, Letizia Tanca Politecnico di Milano MIS 2004, College Park,

More information

CS2Bh: Current Technologies. Introduction to XML and Relational Databases. Introduction to Databases. Why databases? Why not use XML?

CS2Bh: Current Technologies. Introduction to XML and Relational Databases. Introduction to Databases. Why databases? Why not use XML? CS2Bh: Current Technologies Introduction to XML and Relational Databases Spring 2005 Introduction to Databases CS2 Spring 2005 (LN5) 1 Why databases? Why not use XML? What is missing from XML: Consistency

More information

XML Interoperability

XML Interoperability XML Interoperability Laks V. S. Lakshmanan Department of Computer Science University of British Columbia Vancouver, BC, Canada laks@cs.ubc.ca Fereidoon Sadri Department of Mathematical Sciences University

More information

Data Integration. May 9, 2014. Petr Kremen, Bogdan Kostov (petr.kremen@fel.cvut.cz, bogdan.kostov@fel.cvut.cz)

Data Integration. May 9, 2014. Petr Kremen, Bogdan Kostov (petr.kremen@fel.cvut.cz, bogdan.kostov@fel.cvut.cz) Data Integration Petr Kremen, Bogdan Kostov petr.kremen@fel.cvut.cz, bogdan.kostov@fel.cvut.cz May 9, 2014 Data Integration May 9, 2014 1 / 33 Outline 1 Introduction Solution approaches Technologies 2

More information

DLDB: Extending Relational Databases to Support Semantic Web Queries

DLDB: Extending Relational Databases to Support Semantic Web Queries DLDB: Extending Relational Databases to Support Semantic Web Queries Zhengxiang Pan (Lehigh University, USA zhp2@cse.lehigh.edu) Jeff Heflin (Lehigh University, USA heflin@cse.lehigh.edu) Abstract: We

More information

Semantic Search in Portals using Ontologies

Semantic Search in Portals using Ontologies Semantic Search in Portals using Ontologies Wallace Anacleto Pinheiro Ana Maria de C. Moura Military Institute of Engineering - IME/RJ Department of Computer Engineering - Rio de Janeiro - Brazil [awallace,anamoura]@de9.ime.eb.br

More information

Integration of Distributed Healthcare Records: Publishing Legacy Data as XML Documents Compliant with CEN/TC251 ENV13606

Integration of Distributed Healthcare Records: Publishing Legacy Data as XML Documents Compliant with CEN/TC251 ENV13606 Integration of Distributed Healthcare Records: Publishing Legacy Data as XML Documents Compliant with CEN/TC251 ENV13606 J.A. Maldonado, M. Robles, P. Crespo Bioengineering, Electronics and Telemedicine

More information

Quiz! Database Indexes. Index. Quiz! Disc and main memory. Quiz! How costly is this operation (naive solution)?

Quiz! Database Indexes. Index. Quiz! Disc and main memory. Quiz! How costly is this operation (naive solution)? Database Indexes How costly is this operation (naive solution)? course per weekday hour room TDA356 2 VR Monday 13:15 TDA356 2 VR Thursday 08:00 TDA356 4 HB1 Tuesday 08:00 TDA356 4 HB1 Friday 13:15 TIN090

More information

Logical Foundations of Relational Data Exchange

Logical Foundations of Relational Data Exchange Logical Foundations of Relational Data Exchange Pablo Barceló Department of Computer Science, University of Chile pbarcelo@dcc.uchile.cl 1 Introduction Data exchange has been defined as the problem of

More information

[Refer Slide Time: 05:10]

[Refer Slide Time: 05:10] Principles of Programming Languages Prof: S. Arun Kumar Department of Computer Science and Engineering Indian Institute of Technology Delhi Lecture no 7 Lecture Title: Syntactic Classes Welcome to lecture

More information

Efficient Query Optimization for Distributed Join in Database Federation

Efficient Query Optimization for Distributed Join in Database Federation Efficient Query Optimization for Distributed Join in Database Federation by Di Wang A Thesis Submitted to the Faculty of the Worcester Polytechnic Institute In partial fulfillment of the requirements for

More information

How To Develop Software

How To Develop Software Software Engineering Prof. N.L. Sarda Computer Science & Engineering Indian Institute of Technology, Bombay Lecture-4 Overview of Phases (Part - II) We studied the problem definition phase, with which

More information

A Document Management System Based on an OODB

A Document Management System Based on an OODB Tamkang Journal of Science and Engineering, Vol. 3, No. 4, pp. 257-262 (2000) 257 A Document Management System Based on an OODB Ching-Ming Chao Department of Computer and Information Science Soochow University

More information

CS2Bh: Current Technologies. Introduction to XML and Relational Databases. The Relational Model. The relational model

CS2Bh: Current Technologies. Introduction to XML and Relational Databases. The Relational Model. The relational model CS2Bh: Current Technologies Introduction to XML and Relational Databases Spring 2005 The Relational Model CS2 Spring 2005 (LN6) 1 The relational model Proposed by Codd in 1970. It is the dominant data

More information

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya Chapter 6 Basics of Data Integration Fundamentals of Business Analytics Learning Objectives and Learning Outcomes Learning Objectives 1. Concepts of data integration 2. Needs and advantages of using data

More information

Constraint-based Query Distribution Framework for an Integrated Global Schema

Constraint-based Query Distribution Framework for an Integrated Global Schema Constraint-based Query Distribution Framework for an Integrated Global Schema Ahmad Kamran Malik 1, Muhammad Abdul Qadir 1, Nadeem Iftikhar 2, and Muhammad Usman 3 1 Muhammad Ali Jinnah University, Islamabad,

More information

Data Integration Hub for a Hybrid Paper Search

Data Integration Hub for a Hybrid Paper Search Data Integration Hub for a Hybrid Paper Search Jungkee Kim 1,2, Geoffrey Fox 2, and Seong-Joon Yoo 3 1 Department of Computer Science, Florida State University, Tallahassee FL 32306, U.S.A., jungkkim@cs.fsu.edu,

More information

ICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001

ICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001 ICOM 6005 Database Management Systems Design Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001 Readings Read Chapter 1 of text book ICOM 6005 Dr. Manuel

More information

Technologies for a CERIF XML based CRIS

Technologies for a CERIF XML based CRIS Technologies for a CERIF XML based CRIS Stefan Bärisch GESIS-IZ, Bonn, Germany Abstract The use of XML as a primary storage format as opposed to data exchange raises a number of questions regarding the

More information

Extending XACML for Open Web-based Scenarios

Extending XACML for Open Web-based Scenarios Extending XACML for Open Web-based Scenarios Claudio A. Ardagna 1, Sabrina De Capitani di Vimercati 1, Stefano Paraboschi 2, Eros Pedrini 1, Pierangela Samarati 1, Mario Verdicchio 2 1 DTI - Università

More information

Query Management in Data Integration Systems: the MOMIS approach

Query Management in Data Integration Systems: the MOMIS approach Dottorato di Ricerca in Computer Engineering and Science Scuola di Dottorato in Information and Communication Technologies XXI Ciclo Università degli Studi di Modena e Reggio Emilia Dipartimento di Ingegneria

More information

An Intelligent Approach for Integrity of Heterogeneous and Distributed Databases Systems based on Mobile Agents

An Intelligent Approach for Integrity of Heterogeneous and Distributed Databases Systems based on Mobile Agents An Intelligent Approach for Integrity of Heterogeneous and Distributed Databases Systems based on Mobile Agents M. Anber and O. Badawy Department of Computer Engineering, Arab Academy for Science and Technology

More information

CSE 233. Database System Overview

CSE 233. Database System Overview CSE 233 Database System Overview 1 Data Management An evolving, expanding field: Classical stand-alone databases (Oracle, DB2, SQL Server) Computer science is becoming data-centric: web knowledge harvesting,

More information

Demonstrating WSMX: Least Cost Supply Management

Demonstrating WSMX: Least Cost Supply Management Demonstrating WSMX: Least Cost Supply Management Eyal Oren 2, Alexander Wahler 1, Bernhard Schreder 1, Aleksandar Balaban 1, Michal Zaremba 2, and Maciej Zaremba 2 1 NIWA Web Solutions, Vienna, Austria

More information

Journal of Information Technology Management SIGNS OF IT SOLUTIONS FAILURE: REASONS AND A PROPOSED SOLUTION ABSTRACT

Journal of Information Technology Management SIGNS OF IT SOLUTIONS FAILURE: REASONS AND A PROPOSED SOLUTION ABSTRACT Journal of Information Technology Management ISSN #1042-1319 A Publication of the Association of Management SIGNS OF IT SOLUTIONS FAILURE: REASONS AND A PROPOSED SOLUTION MAJED ABUSAFIYA NEW MEXICO TECH

More information

Navigational Plans For Data Integration

Navigational Plans For Data Integration Navigational Plans For Data Integration Marc Friedman University of Washington friedman@cs.washington.edu Alon Levy University of Washington alon@cs.washington.edu Todd Millstein University of Washington

More information

Wee Keong Ng. Web Data Management. A Warehouse Approach. With 106 Illustrations. Springer

Wee Keong Ng. Web Data Management. A Warehouse Approach. With 106 Illustrations. Springer Sourav S. Bhowmick Wee Keong Ng Sanjay K. Madria Web Data Management A Warehouse Approach With 106 Illustrations Springer Preface vii 1 Introduction 1 1.1 Motivation 2 1.1.1 Problems with Web Data 2 1.1.2

More information

Distributed Database for Environmental Data Integration

Distributed Database for Environmental Data Integration Distributed Database for Environmental Data Integration A. Amato', V. Di Lecce2, and V. Piuri 3 II Engineering Faculty of Politecnico di Bari - Italy 2 DIASS, Politecnico di Bari, Italy 3Dept Information

More information

Question Answering and the Nature of Intercomplete Databases

Question Answering and the Nature of Intercomplete Databases Certain Answers as Objects and Knowledge Leonid Libkin School of Informatics, University of Edinburgh Abstract The standard way of answering queries over incomplete databases is to compute certain answers,

More information

Data Integration for XML based on Semantic Knowledge

Data Integration for XML based on Semantic Knowledge Data Integration for XML based on Semantic Knowledge Kamsuriah Ahmad a, Ali Mamat b, Hamidah Ibrahim c and Shahrul Azman Mohd Noah d a,d Fakulti Teknologi dan Sains Maklumat, Universiti Kebangsaan Malaysia,

More information

THE Web is nowadays the world s largest source of

THE Web is nowadays the world s largest source of 940 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 7, JULY 2008 Integrating Data Warehouses with Web Data: ASurvey Juan Manuel Pérez, Rafael Berlanga, María JoséAramburu, and Torben

More information

A Model-based Software Architecture for XML Data and Metadata Integration in Data Warehouse Systems

A Model-based Software Architecture for XML Data and Metadata Integration in Data Warehouse Systems Proceedings of the Postgraduate Annual Research Seminar 2005 68 A Model-based Software Architecture for XML and Metadata Integration in Warehouse Systems Abstract Wan Mohd Haffiz Mohd Nasir, Shamsul Sahibuddin

More information

Chapter 1: Introduction. Database Management System (DBMS) University Database Example

Chapter 1: Introduction. Database Management System (DBMS) University Database Example This image cannot currently be displayed. Chapter 1: Introduction Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Database Management System (DBMS) DBMS contains information

More information