Query reformulation for an XML-based Data Integration System



Similar documents
Integrating Heterogeneous Data Sources Using XML

Integrating XML Data Sources using RDF/S Schemas: The ICS-FORTH Semantic Web Integration Middleware (SWIM)

{mcmb, ABSTRACT 1. INTRODUCTION 2. APPROACHES OF IQ IN DATA INTEGRATION SYSTEMS

Translating WFS Query to SQL/XML Query

XML DATA INTEGRATION SYSTEM

Using Ontologies to Enhance Data Management in Distributed Environments

A MEDIATION LAYER FOR HETEROGENEOUS XML SCHEMAS

INTEGRATION OF XML DATA IN PEER-TO-PEER E-COMMERCE APPLICATIONS

Integrating Relational Database Schemas using a Standardized Dictionary

XQuery and the E-xml Component suite

OWL based XML Data Integration

Integration of Heterogeneous Databases based on XML

Unraveling the Duplicate-Elimination Problem in XML-to-SQL Query Translation

BInXS: A Process for Integration of XML Schemata

A Multidatabase System as 4-Tiered Client-Server Distributed Heterogeneous Database System

Constraint-based Query Distribution Framework for an Integrated Global Schema

Technologies for a CERIF XML based CRIS

Keywords: Regression testing, database applications, and impact analysis. Abstract. 1 Introduction

Advantages of XML as a data model for a CRIS

RETRATOS: Requirement Traceability Tool Support

THE OPEN UNIVERSITY OF TANZANIA FACULTY OF SCIENCE TECHNOLOGY AND ENVIRONMENTAL STUDIES BACHELOR OF SIENCE IN DATA MANAGEMENT

Caching XML Data on Mobile Web Clients

Data Integration using Agent based Mediator-Wrapper Architecture. Tutorial Report For Agent Based Software Engineering (SENG 609.

Unit 2.1. Data Analysis 1 - V Data Analysis 1. Dr Gordon Russell, Napier University

REPORTS IN INFORMATICS

II. PREVIOUS RELATED WORK

A Hybrid Approach for Ontology Integration

View-based Data Integration

A View Integration Approach to Dynamic Composition of Web Services

Efficient Virtual Data Integration Based on XML

UNIVERSITY OF TRENTO A PEER-TO-PEER DATABASE MANAGEMENT SYSTEM. Albena Roshelova. June Technical Report # DIT

Novel Data Extraction Language for Structured Log Analysis

Component Approach to Software Development for Distributed Multi-Database System

Data Analysis 1. SET08104 Database Systems. Napier University

Chapter 2 Database System Concepts and Architecture

Query Processing in Data Integration Systems

A Data Browsing from Various Sources Driven by the User s Data Models

Improving Query Performance Using Materialized XML Views: A Learning-Based Approach

Chapter 1: Introduction

Data Integration for XML based on Semantic Knowledge

Application of XML Tools for Enterprise-Wide RBAC Implementation Tasks

Managing the Evolution of XML-based Mediation Queries

Chapter 7 Data Modeling Using the Entity- Relationship (ER) Model

A Uniform Approach to Workflow and Data Integration

Accessing Data Integration Systems through Conceptual Schemas (extended abstract)

Data Integration. Maurizio Lenzerini. Universitá di Roma La Sapienza

An Ontology Based Method to Solve Query Identifier Heterogeneity in Post- Genomic Clinical Trials

Integration and Coordination in in both Mediator-Based and Peer-to-Peer Systems

CSE 132A. Database Systems Principles

An XML Framework for Integrating Continuous Queries, Composite Event Detection, and Database Condition Monitoring for Multiple Data Streams

Using Database Metadata and its Semantics to Generate Automatic and Dynamic Web Entry Forms

A View Integration Approach to Dynamic Composition of Web Services

AN AI PLANNING APPROACH FOR GENERATING BIG DATA WORKFLOWS

Grid Data Integration based on Schema-mapping

Mapping discovery for XML data integration

Web-Based Genomic Information Integration with Gene Ontology

A Workbench for Prototyping XML Data Exchange (extended abstract)

SODDA A SERVICE-ORIENTED DISTRIBUTED DATABASE ARCHITECTURE

XML Programming with PHP and Ajax

Building Geospatial Ontologies from Geographic Database Schemas in Peer Data Management Systems

UPROM Tool: A Unified Business Process Modeling Tool for Generating Software Life Cycle Artifacts

DBMS / Business Intelligence, SQL Server

Lightweight Data Integration using the WebComposition Data Grid Service

Architecture and Implementation of an XQuery-based Information Integration Platform

THE OPEN UNIVERSITY OF TANZANIA FACULTY OF SCIENCE TECHNOLOGY AND ENVIRONMENTAL STUDIES BACHELOR OF SIENCE IN INFORMATION AND COMMUNICATION TECHNOLOGY

A Tool for Generating Relational Database Schema from EER Diagram

Linking BPMN, ArchiMate, and BWW: Perfect Match for Complete and Lawful Business Process Models?

IT2304: Database Systems 1 (DBS 1)

Business Process Configuration with NFRs and Context-Awareness

Consistent Answers from Integrated Data Sources

Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

SOLVING SEMANTIC CONFLICTS IN AUDIENCE DRIVEN WEB DESIGN

IT2305 Database Systems I (Compulsory)

Graphical Web based Tool for Generating Query from Star Schema

Visual Interfaces for the Development of Event-based Web Agents in the IRobot System

Data Integration by Bi-Directional Schema Transformation Rules

Flexible Semantic B2B Integration Using XML Specifications

Chapter 1: Introduction. Database Management System (DBMS) University Database Example

Optimization of Sequences of XML Schema Modifications - The ROfEL Approach

SQLMutation: A tool to generate mutants of SQL database queries

Integrating Bioinformatic Data Sources over the SFSU ER Design Tools XML Databus

Piazza: Data Management Infrastructure for Semantic Web Applications

Data integration and reconciliation in Data Warehousing: Conceptual modeling and reasoning support

SQL Server for developers. murach's TRAINING & REFERENCE. Bryan Syverson. Mike Murach & Associates, Inc. Joel Murach

On Development of Fuzzy Relational Database Applications

Database Design Overview. Conceptual Design ER Model. Entities and Entity Sets. Entity Set Representation. Keys

Introduction to XML Applications

Xml Mediator and Data Management

Declarative Rule-based Integration and Mediation for XML Data in Web Service- based Software Architectures

IJSER Figure1 Wrapper Architecture

not necessarily strictly sequential feedback loops exist, i.e. may need to revisit earlier stages during a later stage

Grid Data Integration Based on Schema Mapping

FiskP, DLLP and XML

Modeling the User Interface of Web Applications with UML

Using Object And Object-Oriented Technologies for XML-native Database Systems

Applying Model Management to Classical Meta Data Problems

SDMX technical standards Data validation and other major enhancements

A Framework and Architecture for Quality Assessment in Data Integration

Analyzing Triggers in XML Data Integration Systems

Transcription:

Query reformulation for an XML-based Data Integration System Bernadette Farias Lóscio Ceará State University Av. Paranjana, 1700, Itaperi, 60740-000 - Fortaleza - CE, Brazil +55 85 3101.8600 berna@larces.uece.br Thiago Costa Center for Informatics Federal University of Pernambuco P.O. Box 7851 Cidade Universitária, Recife - PE, Brazil, 50732-970 +55 81 2126.8430 tac@cin.ufpe.br Ana Carolina Salgado Center for Informatics Federal University of Pernambuco P.O. Box 7851 Cidade Universitária, Recife - PE, Brazil, 50732-970 +55 81 2126.8430 acs@cin.ufpe.br ABSTRACT The main objective of a mediator-based data integration system is to provide a unified view of several distributed and heterogeneous data sources. This view, called mediation schema, corresponds to a set of elements computed from data available on the local data sources. One of the biggest challenges facing a data integration system consists in answering a user query submitted in terms of the mediation schema given that the data is at the sources. This problem consists mainly in reformulating the query in terms of the source schemas and in integrating the corresponding answers. In this paper we address the problem of query reformulation in the context of an XML-based data integration system. Our solution is based on algorithms that generate a set of source queries considering a given user query and a set of mappings between the mediation schema and the data source schemas. Categories and Subject Descriptors H.2.5 [Heterogeneous Databases]. General Terms Languages. Keywords Data integration, XML, query reformulation, mediation. 1. INTRODUCTION Integrated access to distributed data is an important problem faced in many scientific and commercial domains. One of the main challenging problem in the context of data integration is query processing. In order to answer a query submitted to a data integration system various steps are necessary, including query decomposition, query translation, query optimization and data integration. The complexity of the query execution process depends mainly on the approach adopted to define the mappings between the mediation schema and the source schemas. Two main Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC 06, April, 23-27, 2006, Dijon, France. Copyright 2006 ACM 1-59593-108-2/06/0004 $5.00. approaches have been used to specify such mappings: Global as View (GAV) and Local as View (LAV) [12, 6]. In the GAV approach each object of the global schema is expressed as a view (i.e. a query) on the data sources. On the other hand, in the LAV approach, mediation mappings are defined in an opposite way; each object in a given source is defined as a view on the global schema. In this paper we address the problem of how to process a query submitted to a data integration system based on the GAV approach. The query execution process was developed as part of an XML-based data integration system, called Integra, proposed in [8]. We focus our discussion on the problem of how to decompose a user query into a set of remote data source queries using a set of mediation mappings. The remainder of this paper is organized as follows. Section 2 provides some definitions. Section 3 describes the query execution process. Section 4 explains our approach to query reformulation. Section 5 presents some implementation issues. Section 6 discusses some related works and Section 7 presents our conclusions. 2. SOME DEFINITIONS To facilitate the query execution process and other relevant tasks of the Integra system a conceptual data model, called X-Entity [9], and a declarative internal language called XEQ (X-Entity Query language) [3] were proposed. In this section we give an overview of the main concepts proposed in the X-Entity model and in the XEQ language. We also present the formalism adopted to represent the mappings between the mediation schema and the source schemas. 2.1 X-Entity Model Integra uses XML as the common data model for data exchange and integration. To represent the mediation schema and the source schemas Integra adopts the XML Schema language. To provide a high-level abstraction for information described in an XML schema a conceptual data model, called X-Entity model [9], was proposed. The X-Entity model is not a new formalism for conceptual modeling, rather it is an extension of the Entity Relationship model [2], i.e., it uses some basic features of the ER model and extends it with some additional ones to better represent XML schemas. The main concept of the X-Entity model is the entity type, which represents the structure of XML elements composed by other elements and attributes. In the X-Entity model, relationship types represent element-subelement relationships and 498

references between elements. In the following, we present the main X-Entity model constructs used to create X-Entity schemas. An X-Entity schema S is denoted by S = (E, R), where E is a set of entity types and R is a set of relationship types. Entity type: an entity type E, denoted by E({A 1,,A n },{R 1,,R m }), is made up of an entity name E, a set of attributes A 1,,A n and a set of relationships R 1,,R m. An entity type represents a set of elements with a complex structure, composed by attributes and other elements (called subelements). An instance of an entity type is a particular element in the source XML document. Each entity type has attributes {A 1,,A n } that describe it. An attribute A i represents either an attribute or a subelement, which is not composed by other elements or attributes. In X-Entity diagrams, entity types are displayed as rectangles. Containment relationship: a containment relationship between two entity types E and E 1, specifies that each instance of E contains instances of E 1. It is denoted by R(E, E 1,(min,max)), where R is the relationship name and (min, max) defines the minimum and the maximum number of instances of E 1 that can be associated with an instance of E. In X-Entity diagrams, containment relationships are displayed as diamond-shaped boxes labeled with contains. The straight lines connecting the relationship with the participating entities are directed from the entity E to the entity E 1. Reference relationship: a reference relationship, denoted by R(E 1, E 2,{A 11,,A 1n }, {A 21,, A 2n }), where R is the name of the relationship, specifies that the entity type E 1 references the entity type E 2. {A 11,, A 1n } and {A 21,, A 2n } represent the referencing attributes between entities of E 1 and E 2 such that the value of A 1i, 1 i n, in any entity of E 1 must match a value of A 2i, 1 i n, in some entity of E 2. In X-Entity diagrams reference relationships are represented as a diamond-shaped box labeled with refers. The straight lines connecting the relationship with the participating entities are directed to the referenced entity. Figure 2 presents three examples of X-Entity schemas. The entity type book m, for example, has two attributes: title m and publisher m, and one containment relationship, which specifies that an instance of book m has at least one subelement and may have unlimited occurrences of the subelement chapter m. The entity type chapter 1 has a reference relationship, which specifies that an instance of chapter 1 may be associated with an instance of book 1. The attributes book 1 _title 1 and title 1 are the referencing attributes between chapter 1 and book 1, respectively. Lóscio [8] have described and implemented the process of converting an XML Schema to an X-Entity schema. 2.2 XEQ Language Queries submitted to a data integration system are, in most cases, defined using declarative query languages or templates. The great majority of such languages are intended to human comprehension, which leads to difficulties in the query decomposition [4] and translation [7]. In order to minimize the complexity of such problems we proposed XEQ (X-Entity Query language), an XMLbased language, used to provide an internal query representation for the Integra system [3]. Considering that the queries submitted to the Integra system are specified according to a mediation schema defined in the X-Entity model, a XEQ expression uses X-Entity concepts to represent the queried data. The simplest conceptual representation of a XEQ expression involves naming one entity (i.e. the parent entity), providing a list of attributes to be returned, as well as using an expression to constrain this entity. We may also define expressions that return values for subentities of the parent entity. A subentity of an entity type E is an entity type E which is associated with E through a containment relationship. Queries may associate parent entities and their referenced entities. A referenced entity type of an entity type E is an entity type E which is associated with E through a reference relationship. The basic element of a XEQ expression is the SELECT element. The SELECT element has one attribute, called ENTITY, whose value is the name of the entity type E being queried. The SELECT element may be composed by other elements, as: i) a REFERENCE element that defines the join condition used to combine related elements from the entity type E and a referenced entity type E, ii) ATTRIBUTE elements that have in their content the name of an attribute of the entity type E whose value is to be retrieved by the query, iii) a WHERE element that filters the data to be retrieved by the query. To illustrate the use of the XEQ internal language we present, in Figure 1, an XQuery query and its corresponding XEQ expression. User query: Retrieve the title, year, the title of the chapters and the publisher name of all books. XQuery <books> { for $b in //book 2 return <book 2> <title 2>$b/title 2 </title 2> <year 2> $b/year 2 </year 2> <chapter 2> $b/chapter 2/chapter_title 2 </chapter 2> for $p in //publisher 2 where $b/ref_publisher 2 = $p/id_publisher 2 return <publisher 2> $p/publisher_name 2 </publisher 2> </book 2> }</books 2> XEQ <SELECT ENTITY="book 2"> <ATTRIBUTE>title 2</ATTRIBUTE> <ATTRIBUTE>year 2</ATTRIBUTE> <SELECT ENTITY="chapter 2"> <ATTRIBUTE> chapter_title 2 </ATTRIBUTE> <SELECT ENTITY="publisher 2"> <ATTRIBUTE> publisher_name 2 </ATTRIBUTE> <REFERENCE NAME="ref_book_publisher"> <MAP ATTRIBUTE="ref_publisher 2"> id_publisher 2 </MAP> </REFERENCE> </EXP> Figure 1. User query example In the example of Figure 1, consider the X-Entity schema S 2 of Figure 2. During the query translation to XEQ the query is parsed and each one of the entities being queried is identified and represented using a SELECT element. In the example above, book 2 is the entity whose values should be retrieved. Since chapter 2 is a subentity of book 2, to retrieve such information it should be used a nested SELECT element. It is important to observe that in the SELECT element associated with the referenced entity publisher 2, there is a MAP element. This element indicates how the join operation between the entity book 2 and the 499

entity publisher 2 should be done in order to retrieve the value of the publisher_name 2 attribute. 2.3 Correspondence assertions Our approach to query reformulation is widely based on correspondence assertions [11] specifying relationships between the mediation schema and the source schemas. In this section, we describe just the correspondence assertions relevant to this paper. Other correspondence assertions may be found in [8]. Definition 1 (Entity correspondence assertion): Let E 1 and E 2 be entity types. The correspondence assertion E 1 E 2 specifies that E 1 and E 2 are semantically equivalent, i.e, they describe the same real world concept and they have the same semantics. Definition 2(Attribute correspondence assertion): Let A 1 be an attribute of the entity type E 1 and A 2 an attribute of the entity type E 2, such that E 1 E 2. The correspondence assertion E 1.A 1 E 2.A 2 specifies that the attributes A 1 and A 2 are semantically equivalent (correspond to the same concept in the real world). Definition 3 (Link): Let X 1 and X 2 be elements of an X-Entity schema (an element can be an entity type, a containment relationship type or an attribute), X 1.X 2 is a link if: i) X 2 is an attribute of the entity type X 1, or ii) X 1 is an entity type participating in the relationship type X 2 (or vice-versa). Definition 4 (Path): If X 1...X n are elements of a schema, such that X i, 1 i n, X i.x i+1 is a link, then X 1.X 2.....X n is a path from X 1 (the path is directed from X 1 to X n ). Definition 5 (Inverse path): If there is a path from X 1 denoted by X 1.X 2.....X n, such that X n is an entity type, then there is a path denoted by (X 1.X 2.....X n ) -1, which is called the inverse path of X 1.X 2.....X n. Definition 6 (Path correspondence assertion): Let P 1 and P 2 be two paths: Case 1: P 1 = X 1.X 2....X n and P 2 = Y 1.Y 2...Y m, where X 1 Y 1. The correspondence assertion P 1 P 2 specifies that the entity types X n and Y m are semantically equivalent. Case 2: P 1 = X 1.X 2....X n and P 2 = (Y 1.Y 2...Y n ) -1, where X 1 Y n. The correspondence assertion P 1 P 2 specifies that the entity types X n and Y 1 are semantically equivalent. To better illustrate our ideas, consider the mediation schema and the source schemas presented in Figure 2. Table 1. Correspondence assertions between the mediation schema S med and the source schemas S 1 and S 2 CA 1: book m book 1 CA 2: book m.title m book 1.title 1 CA 3: book m.publisher m book 1.publisher 1 CA 4: chapter m chapter 1 CA 5: chapter m.chapter_title m chapter 1.chapter_title 1 CA 6: book m.book m_chapter m.chapter m (chapter 1.chapter_ref_book 1.book 1) -1 CA 7: book m book 2 CA 8: book m.title m book 2.title 2 CA 9: chapter m chapter 2 CA 10: book m.book m_chapter m.chapter m book 2.book 2_chapter 2.chapter 2 CA 11: chapter m.chapter_title m chapter 2.chapter_title 2 CA 12: book m.publisher m book 2.book 2_ref_publisher 2.publisher 2.publisher_name 2 3. QUERY EXECUTION PROCESS The query execution consists in receiving a user query, expressed in a declarative query language, and in returning a query answer expressed in XML as output. The query execution process [3], described in Figure 3, may be summarized as follows: i) The mediator receives a user query Q and performs the necessary translation and reformulation to produce the set of local XEQ expressions (q 1, q 2,, q n ), ii) Next, the mediator sends these expressions to the corresponding wrappers which translate them to the source query language producing the local queries (q 1, q 2,, q n ), iii) The wrappers receive answers for such queries (r 1, r 2,, r n ) from the data sources, these answers are translated to XML and sent to the mediator and iv) Finally, the mediator integrates the XML answers (r 1, r 2,, r n ) and returns the integrated result R as the answer to the original user query Q. The components responsible for the query execution process are presented in Figure 3 and described next. Mediator Knowledge Base Data Sources Knowledge Base User Query Q Query translator Query reformulator Query interface Data integrator ( q 1, q 2 n ),..., ( r 1, r 2,..., r n ) q Source manager R Mediation schema converter Mediator Query manager Log Cache manager Cache S med title m publisherm S 1 title 1 book m chapter_title m (1,N) contains chapter m chapter_title 1 book 1_title 1 q 1 Wrapper1 Translation Translation XEQ - SQL Relations - XML q 1 r 1 r 1 q 2 r 1 q r n n Wrapper2 Wrappern Translation Translation Translation Translation XEQ - OQL Objects - XML XEQ - XQuery XML - XML q 2 r 2 q n r n publisher1 book 1 refers chapter 1 Relational Database Object Oriented Database XML Documents S 2 title 2 year 2 ref_publisher 2 book 2 (1,N) contains chapter chapter_title 2 2 refers publisher 2 id_publisher 2 publisher_name 2 Figure 2. Mediation schema S med and source schemas S 1 and S 2 Table 1 presents the correspondence assertions (CA) between S med and the source schemas S 1 and S 2. Figure 3. Query execution process The Query Interface takes as input the user query expressed using a declarative query language and it returns an XML document as the answer to this query. The Query Translator receives a query written in a high level query language (XQuery for example) from the query interface and translates this query to the XEQ internal representation. The Query Reformulator takes a XEQ expression as input and produces as output a set of XEQ source expressions. The Source Manager receives as input a set of XEQ source expressions and sends them to the wrapper 500

associated with the corresponding data source. It receives from the wrappers the answers for the local queries already converted to XML. The Mediation Schema Converter takes as input the XML data received from the wrappers and converts them to the mediation schema. This translation is necessary to facilitate the further data integration process. Transformation functions are used to convert data format to be conform to the mediation schema. The Data Integrator takes as input a set of XML documents and transforms them in a single integrated document. 4. QUERY REFORMULATION In this section, we describe the basic algorithm that reformulates a XEQ expression over the mediation schema into a set of XEQ expressions over the source schemas. Let Q be the original user query and X Q be the expression obtained after the translation of Q to the XEQ representation. There are three phases in the algorithm: i) Identification of the mediation entity (E m ) being queried: this entity is specified in the most external SELECT element of the X ii) Identification of data sources relevant to answer the user query: a data source DS i is relevant to answer the user query Q if it contains one or more source entities necessary to compute the mediation entity E m, i.e., source entities semantically equivalent to E m, iii) Reformulation phase: this phase consists in generating a XEQ expression for each one of the data sources relevant to answer Q. This is done using the algorithms presented in Figure 4. Next consider: - T/@ENTITY: value of the ENTITY attribute of the SELECT element T - T/ATTRIBUTE: value of an ATTRIBUTE element of the SELECT element T - T/WHERE: value of a WHERE element of the SELECT element T - A/text( ): retrieves the content of the ATTRIBUTE element A - ECA(S i, E m ): retrieves the entity correspondence assertion C E m E j, where E j S i.e (S i.e is the set of entity types of the source achema S i ) - ACA(S i, A m ): retrieves the attribute correspondence assertion C A m A j, where A j E.A and E S i.e (E.A is the set of attributes of the entity type E ) - X 1 Φ X 2 : concatenation operator that joins two XEQ expressions X 1 and X 2 During the third phase of the query reformulation process the ReformulateQuery algorithm is executed for each one of the relevant data sources {DS 1,, DS n }. The first task of this algorithm consists in reformulating the element (X/Exp/SELECT) that defines the mediation entity E m being queried. To do this, the algorithm calls the procedure ReformulateSelect. This procedure recursively analyses the nested SELECT elements of T (T = X/Exp/SELECT) in order to identify the subentities {E 1,, E n } and attributes {A 1,, A k } of the mediation entity E m such that T/@entity = E m. For each mediation entity E j the algorithm ReformulateSelect obtains the correspondence assertion which specifies how to compute E i over the data source DS i. In the same way for each attribute A j, the algorithm obtains the correspondence assertion which specifies how to compute A j over the data source DS i. The algorithm also analyses WHERE elements to transform the constraints applied on the mediation elements into constraints on the source elements. This transformation is done using the algorithm ReformulateWhere. XEQ elements of the resulting source expression are created by the algorithm AddPathToQuery. Due to space limitations, in this paper we present just the ReformulateQuery and ReformulateSelect algorithms. More information about the other algorithms may be found in [3]. Algorithm ReformulateQuery(X Q, S i) {S i is the X-Entity schema of the data source DS i } T := X Q/Exp/SELECT {Exp is root element of a XEQ expression} L Si := new XEQ expression P := empty path ReformulateSelect(L si, T, S i, P) Return L Si End Algorithm ReformulateSelect(X Si, T, S i, P) {X Si is the XEQ expression corresponding to the data source S i } P:= P + T/@ENTITY NS:= AddPathtoQuery(X Si, ECA(S i, P)) For each A = T/ATTRIBUTE do AddPathtoQuery(EXP LS, ACA(S i, P + A/text())) For each T = T/SELECT do ReformulateSelect(X Si, T, S i, P) If (W = T/WHERE) then X Si:= X Si Φ ReformulateWhere(NS, W, S i, P) End Figure 4. Query reformulation Algorithms To illustrate the query reformulation process consider the user query Q 1 over the mediation schema S med and its corresponding XEQ expression X 1 presented in Figure 5. As specified by the most external SELECT element of X 1 we have that the mediation entity being queried is book m (X 1 /Exp/Select/@ENTITY = book m ). Next, based on the set of correspondence assertions presented in Table 1 the algorithm identifies the relevant data sources to compute book m. According to the correspondence assertions CA 1 (CA 1 :book m book 1 ) and CA 7 (CA 7 :book m book 2 ) the data sources relevant to compute book m are DS 1 and DS 2, since book 1 and book 2 are source entities from the schemas S 1 and S 2 respectively. Therefore, the algorithm ReformulateQuery must be executed for these two data sources. User query Q 1: Retrieve the title, publisher and the title of the chapters all books. X1(XEQ) <SELECT ENTITY="book m"> <ATTRIBUTE> title m </ATTRIBUTE> <ATTRIBUTE> publisher m </ATTRIBUTE> <SELECT ENTITY="chapter m"> <ATTRIBUTE> chapter_title m </ATTRIBUTE> XS 1(XEQ) 1. 2.<SELECT ENTITY="book 1"> 3. <ATTRIBUTE>title 1</ATTRIBUTE> 4. <ATTRIBUTE>publisher 1</ATTRIBUTE> 5. <SELECT ENTITY="chapter!"> 6. <ATTRIBUTE>chapter_title 1</ATTRIBUTE> 7. <REFERENCE NAME="ref_chapter 1_book 1"> 8. <MAP ATTRIBUTE="book 1_title 1"> title 1 </MAP> 9. </REFERENCE> 10. 11. 12. Figure 5. User query In the first step, the algorithm generates the source expression XS 1, presented in Figure 5, for the data source DS 1. The SELECT element of line 2 was generated from the correspondence assertion CA 1. Lines 3 and 4 were generated based on the correspondence assertions CA 2 and CA 3. To compute the subentity chapter m of book m the algorithm created a nested select element (line 5) according to the correspondence assertion CA 6. As chapter 1 is 501

associated with book 1 through a refers relationship then it was used a reference element to identify this restriction. For the data source DS 2 the algorithm Reformulate generates the XEQ expression XS 2 similar to the XEQ expression XS 1. 5. IMPLEMENTATION ISSUES An Integra prototype was implemented allowing the execution of the whole query process: from the submission of a user query to the answer of the integrated results. Wrappers were implemented to translate queries from XEQ to the data sources native language. In the initial version of the prototype, only wrappers for relational databases were implemented. During the tests, we used two real databases that stores medical information: Healthnet 1 and TELEMED 2, both implemented in MySQL. In the current version of the prototype, the mediation schema and the correspondences between the mediation schema and the source schemas are manually defined and stored in the Mediator Knowledge Base. In the Integra prototype, the query is translated to SQL using a XSLT stylesheet, and it is submitted to the data source through a JDBC driver. 6. RELATED WORK The work presented in [4] states and solves the query decomposition problem for XML publishing in a general setting. It allows mixed (XML and relational) storage of proprietary data and exploits redundancies (materialized views, indexes and caches) to enhance performance. Other problem faced during query processing consists in translating data source queries to the native query language. To improve the query translation process, data integration systems use languages based on scripts or templates as the input format for wrappers [7]. There are also systems that use XML query languages as input format for wrappers. The e-xml [5], MARS [4] and XPERANTO [10] systems provide integration of multiple data sources based on XML schemas.automed [1] adopts a functional language, called IQL, to provide a common query language where queries written in high level query languages can be translated into and out of. In our approach, the user query may be specified in a high level query language. This query is translated to XEQ and it can be easily rewritten into a set of local subqueries. As XEQ is an XML-based language we can take advantage of XML benefits as the capability to easily transform XML data into different formats and to easily navigate through its hierarchical structure. 7. CONCLUSION In this paper, we focused on the problem of how to process a user query submitted to an XML-based data integration system. To solve this problem we proposed X-Entity, an ER-based conceptual data model, and XEQ, an XML-based language, whose specification facilitates both query decomposition and translation. The process of decomposing or translating a XEQ expression to other query languages consists in performing a recursive navigation that visits the expression elements and constructs the query result. In this paper we did not consider the problem of generating a query execution plan nor the problem of 1 http://www.nutes.ufpe.br/servicos/health.html 2 Database of a brazilian hospital (Hospital Português Recife) query optimization. As a future work, we intend to study such problems in order to improve our proposed approach. 8. ADDITIONAL AUTHORS Juliano F. de Souza, jsf@cin.ufpe.br, Center for Informatics Federal University of Pernambuco, P.O. Box 7851 Cidade Universitária, Recife - PE, Brazil, 50732-970, +55 81 2126.8430 9. REFERENCES [1] Boyd, M., Kittivoravitkul, S., Lazanitis, C., McBrien, P., Rizopoulos, N. AutoMed: A BAV Data Integration System for Heterogeneous Data Sources. In Proceedings of the 16 th Advanced Information Systems Engineering Conference (CAiSE) (Riga, Latvia, June 7-11, 2004). Lecture Notes in Computer Science, Springer, 2004, 82-97. [2] Chen, P.P. The Entity-Relationship Model: Toward a unified view of data. ACM Transactions on Database Systems, 1 (Mar. 1976), 1-36. [3] Costa, T., A. O Gerenciador de Consultas de um Sistema de Integração de Dados. Master Thesis. Universidade Federal de Pernambuco, Recife, Brazil, 2005. [4] Deutsch, A. and Tannen, V. Reformulation of XML Queries and Constraints. Proceedings of the 9 th International Conference on Database Theory (ICDT) (Siena, Italy, January 8-10, 2003). Lecture Notes in Computer Science, Springer, 2002, 255 241. [5] Gardarin, G., Mensch, A., Tomasic, A. An Introduction to the e-xml Data Integration Suite. In Proceedings of the 7th International Conference on Extending Database Technology (EDBT) (Konstanz, Germany, March 27-31, 2000). Lecture Notes in Computer Science, Springer, 2000, 297-306. [6] Halevy, Y. Theory of answering queries using views. SIGMOD Record, 29 (Dec. 2000), 40-47. [7] Hammer J., Brennig, M., Garcia-Molina, H., Nesterov, S., Vassalos, V. and Yerneni, R. Template based wrappers in the tsimmis system. In Proceedings of the ACM SIGMOD International Conference on Management of Data (Tucson, Arizona, May 13-15, 1997). ACM Press, 1997, 532-535. [8] Lóscio, B. F. Managing the Evolution of XML-based Mediation Queries. PhD Thesis. Universidade Federal de Pernambuco, Recife, Brazil, 2003. [9] Lóscio, B. F., Salgado, A. C., Galvão, L. R. Conceptual Modeling of XML Schemas. In Proceedings of the Fifth International Workshop on Web Information and Data Management (WIDM) (New Orleans, Louisiana, November 7-8, 2003). ACM, 2003, 102-105. [10] Shanmugasundaram, J, Kiernan, J., Shekita, E.J., Fan C., Funderburk J. Querying XML Views of Relational Data. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB) (Roma, Italy, September 11-14, 2001). Morgan Kaufmann, 2001, 261-270. [11] Spaccapietra, S. and Parent, C. View integration: a step forward in solving structural conflicts. IEEE Transactions on Knowledge and Data Engineering, 6 (Apr. 1994), 258-274. [12] Ullman, J. D. Information integration using logical views. In Proceedings of the 6 th International Conference on Database Theory (ICDT) (Delphi, Greece, January 8-10, 1997). Lecture Notes in Computer Science, Springer, 1997, 19-40. 502