XML-to-SQL Query Mapping in the Presence of Multi-valued Schema Mappings and Recursive XML Schemas

XML-to-SQL Query Mapping in the Presence of Multi-valued Schema Mappings and Recursive XML Schemas Mustafa Atay, Artem Chebotko, Shiyong Lu, and Farshad Fotouhi Department of Computer Science Wayne State University Detroit, Michigan 48202 USA {matay, artem, shiyong, fotouhi}@wayne.edu Abstract. Several query mapping algorithms have been proposed to translate XML queries into SQL queries for a schema-based relational XML storage. However, existing query mapping algorithms only support single-valued mapping schemes, in which each XML element type is mapped to exactly one relation, and do not support multi-valued mapping schemes, in which each XML element type can be mapped to multiple relations. In this paper, we propose a generic query mapping algorithm, ID-XMLToSQL, for a schema-based relational XML storage. To the best of our knowledge, our algorithm provides the first generic solution to the XML-to-Relational query mapping problem that is applicable to both single-valued and multi-valued mapping schemes. Moreover, our algorithm also provides an elegant solution to the query mapping problem in the presence of recursive XML schemas and recursive queries. While existing algorithms need special recursion operators, our algorithm only requires the traditional relational operators and thus, can work with all relational databases. 1 Introduction Numerous researchers propose to use relational databases for storing and querying XML documents in order to get benefits of this mature technology. This approach requires algorithms to map XML schemas, documents and queries, into their relational equivalents. An XML-to-SQL query mapping algorithm for a schema-based relational XML storage should respect the underlying XML-to-Relational schema mapping scheme. The XML-to-Relational schema mapping schemes in the literature can be classified into the following two categories: Single-valued Schema Mappings. In a single-valued schema mapping, an XML element or attribute type is mapped into exactly one single relation in the target relational schema. Thus, it shows the characteristics of a function. The Shared schema mapping approach introduced in [1] and ODT DMap approach introduced in [2] fall into this category. R. Wagner, N. Revell, and G. Pernul (Eds.): DEXA 2007, LNCS 4653, pp. 603 616, 2007. c Springer-Verlag Berlin Heidelberg 2007

604 M. Atay et al. Multi-valued Schema Mappings. In a multi-valued schema mapping, an XML element or attribute type can be mapped into more than one relation in the target relational schema. The multi-valued schema mappings do not show the characteristics of a function and thus they are harder to deal with. The Basic and Hybrid schema mapping approaches proposed in [1] fall into this category. Although there are several query mapping algorithms for single-valued schema mapping schemes, there is no published query mapping algorithm which supports multi-valued schema mapping schemes to our best knowledge. Therefore, we propose a generic query mapping algorithm which supports both multi-valued and single-valued schema mapping schemes in this paper. Our generic algorithm also provides an elegant solution to the XML-to- Relational query mapping problem in the presence of recursive XML schemas and recursive queries. This problem is identified as an important practical problem in the literature [3,4,5]. Recursive XML schemas are common in practice as pointed out by [6] in which 35 DTDs found to be recursive out of 60 real-world DTDs. On the other hand, recursive XML queries, which include descendant axis //, are also common in practice. The challenge of XML-to-SQL query mapping is that, when there is recursion both in an XML query and in its underlying XML schema, there might be infinitely many paths corresponding to the given recursive XML query. There are two elegant algorithms [4,5] in the literature which address this issue. These algorithms solve the recursion within the relational engine by using special SQL operators which are not supported by some RDBMSs. On the other hand, we solve the recursion at XML schema level without using special SQL operators. The main contributions of this paper include the following: 1. We propose a generic query mapping algorithm, ID-XMLToSQL, for a schema-based relational XML storage scheme. To the best of our knowledge, our algorithm provides the first generic solution to the XML-to-Relational query mapping problem that is applicable to all relational XML storage mapping schemes proposed in the literature, including both single-valued and multi-valued schema mapping schemes. 2. We propose to convert a cyclic XML schema graph to a directed acyclic graph by unfolding the cycles in the XML schema graph to facilitate the recursive query mapping process. Thus, we can find out a finite number of matching paths on the generated acyclic graph for an arbitrary XML query including the recursive ones. Therefore, our proposed query mapping technique can be implemented on any RDBMS as it does not require using special SQL operators to capture the recursion while the existing algorithms need special recursion operators. Organization: The rest of the paper is organized as follows. Section 2 gives a summary of related work. We give a motivation on generic query mapping in Section 3. Section 4 includes all necessary preliminaries for our generic query mapping algorithm. The outline of our proposed query mapping algorithm ID- XMLToSQL isgiveninsection5.wedemonstrateaperformancestudyofthe

XML-to-SQL Query Mapping in the Presence 605 algorithm ID-XMLToSQL in Section 6. Finally, Section 7 concludes the paper and points out some potential future work. 2 Related Work In order to query XML data stored in a relational database, one should map the XML queries into relational queries based on the underlying XML-to-Relational schema mapping scheme. Hence, we can split the XML-to-Relational query mapping algorithms into the following two categories based on the underlying schema mapping schemes: Schema-less Query Mapping. There has been a lot of work on schema-less query mapping [7,8,9,10,11,12,13,14]. In this approach, XML schema is considered to be missing or not used and a generic relational schema is generated for all XML documents. Then, a given XML query is mapped to its relational equivalent using the generic relational schema. Schema-based Query Mapping. There have been several works on schemabased query mapping [1,15,4,5,16,17,18] where an XML schema is provided and used to generate a good relational schema. The generated relational schemas vary according to the input XML schemas. Therefore, an XML-to- Relational query mapping algorithm should know and respect the underlying XML-to-Relational schema mapping to generate correct and efficient relational queries. The problem of mapping recursive XML queries in the presence of recursive schemas studied in schema-less query mapping space [8,10]. However, their query mapping algorithms are not applicable to the schema-based query mapping space. Recently, two elegant approaches proposed in [4,5] to map recursive XML queries to their relational equivalents in the presence of recursive XML schemas. The query mapping algorithm of [4] first derives a query graph for an input path query from the XML schema graph. Then, it partitions the query graph into strongly-connected components and generates an SQL query for each component. If a component is recursive, then, the recursion in this component is captured in the corresponding SQL query by using the with construct of SQL 99. The query mapping algorithm of [5] first rewrites a given XPath query into a regular XPath expression which is capable of capturing recursion both in a DTD and in an XPath query. Furthermore, they provide an algorithm for translating regular XPath expressions to relational queries using least fixpoint (LFP) operator. The LFP operator is used to capture the recursion in the queries. However, these recursive query mapping algorithms are not generic enough to be used with multi-valued mappings such as Basic and Hybrid introduced in [1]. Moreover, they require the usage of special SQL operators such as with construct of SQL 99 or LFP operator which are not supported by some RDBMSs. Our proposed ID-XMLToSQL algorithm overcomes these limitations.

606 M. Atay et al. 3 Motivation A generic query mapping algorithm for a schema-based relational XML storage is supposed to work with a general class of XML-to-Relational schema mappings which can be classified into two main categories as Single-valued Schema Mappings and Multi-valued Schema Mappings. Surprisingly, there is no published XML-to-Relational query mapping algorithm in the schema-based XML storage space which is generic enough to work with the multi-valued XML-to-Relational schema mappings. The recursive query translation algorithm of [4] handles a general class of single-valued XML-to- Relational mappings. The main query translation procedure SQL() in [4] uses the function Annot() to find out the relation/column corresponding to an XML element. Neither Annot() nor SQL() support the multi-valued XML-to-Relational schema mapping. Thus, [4] is not generic enough to handle all types of mappings proposed in the literature. While the RegT osql algorithm proposed in [5] supports a broad class of XPath queries, it still lacks the support for multi-valued schema mappings. A single-valued mapping is a function which returns only one relation for an input XML element/attribute type. The target relation to retrieve an XML element or attribute can easily be determined from a single-valued mapping. Thus, single-valued mappings are relatively easier to handle during the query mapping phase. A multi-valued mapping is not a function since it can return multiple relations for an input XML element/attribute type. This situation may cause ambiguity while a query mapping algorithm is trying to locate the target relation for an XML element type to retrieve its data. Hence, a query mapping algorithm based on a multi-valued mapping should be intelligent enough to resolve this possible ambiguity and find out the correct relation(s) to access. Thus, it is more challenging to map XML queries to relational queries under multi-valued mapping schemes than under single-valued mapping schemes. A B1 B2 B3 C D1 D2 D3 E Fig. 1. A Sample XML Schema Graph

XML-to-SQL Query Mapping in the Presence 607 Table 1. Single-valued and Multi-valued Schema Mapping Examples Single-valued σ-mapping (Shared) Element Relation A A B1 B1 B2 B2 B3 B3 C C D1 D1 D2 C D3 D3 E E (A) Multi-valued σ-mapping (Hybrid) Element Relation A A B1 B1 B2 B2 B3 B3 C B1, B2, B3 D1 D1 D2 B1, B2, B3 D3 A, B1, B2, B3 E E (B) We use a data structure to store XML-to-Relational schema mapping information. We call this data structure as σ-mapping and formally define it in Section 4.1. The σ-mappings based on Shared and Hybrid approaches for the XML schema shown in Figure 1 are given in Table 1.A and Table 1.B, respectively. We assume the XML attribute types are mapped to the same relation with their parent element types. Example 1. If the XPath expression /A/B1/C/D3 is given against the XML schema graph shown in Figure 1, following will be its SQL equivalent based on a typical query mapping algorithm which generates a SQL query by joining all the relations along a path: Select T4.ID From σ(a) T1, σ(b1) T2, σ(c) T3, σ(d3) T4 Where T1.ID=T2.parentID And T2.ID=T3.parentID And T3.ID=T4.parentID While it is trivial to find out the matching relations in this SQL query based on the single-valued σ-mapping given in Table 1.A, it is not straightforward to find out them in case of the multi-valued σ-mapping shown in Table 1.B. For instance, it is not clear which relation should be returned for σ(c) out of the set {B1,B2,B3} and for σ(d3) out of the set {A,B1,B2,B3}. We propose the notion of path-based σ-mapping (σ p -mapping) in Section 4.2 to resolve the ambiguity due to the multi-valued schema mapping schemes by the help of input path structure and the existing mapping information. 4 Preliminaries 4.1 Schema-Based Query Mapping In schema-based relational XML storage, query mapping typically takes an XML query, an XML schema, the XML-to-Relational schema mapping information, which is called σ-mapping, and a database as input, produces a relational query, runs it against the database where the XML document is stored, and returns the query results as output. In the following, we formalize the notions of σ-mapping and query mapping:

608 M. Atay et al. Definition 1 (σ-mapping). Given an XML schema S with element-type set E and attribute-type set A, and a database schema R, aσ-mapping is a mapping σ :(E A) R, such that given an attribute/element type e, σ(e) is the set of relations in which the instances of e will be stored. Definition 2 (Query Mapping). A query mapping QM is a function that assigns to each tuple (Q, S, X, R, B, σ) a relational query Q, where Q is an XML query, S is an XML schema, X is an XML document conforming to S, R is a database schema, B is a database of R, σ is a mapping from S to R, and Q is a set of relational queries equivalent to Q such that Q (B) Q(X). 4.2 σ p -Mapping We propose to define a path-based σ-mapping (σ p -mapping) to resolve the mapping ambiguity that arises in the presence of multi-valued schema mappings. The σ p -mapping uses the information obtained from the path structure and σ-mapping to find a single relation for each element in the input path. Once σ p -mapping of a particular path expression is computed, then the equivalent relational query can be constructed without any ambiguity concern. Lemma 1. Any edge in an XML schema graph G is identified either as a normal-edge or a -edge. Proof. If an element can occur at most once under its parent, then it is connected to its parent by an edge labeled by, or? in XML schema graph G. Allthe edges in G labeled by, and? operators constitute normal-edges. If an element can occur more than once under its parent, then this element is connected to its parent by an edge labeled by or + in G. All the edges in G labeled by and + operators constitute -edges. Since there is no occurrence operator other than {,,?,, + } in G, any edge in an XML schema graph is either a normal-edge or a -edge. In the following, we formalize the notions of simple path expression and σ p - mapping: Definition 3 (Simple Path Expression). A simple path expression p can be denoted as /n 1 /n 2 /.../n k where each n i isthenodetypeofstepi and the axis of each step is child axis / which denotes parent-child relationship. The node type n 1 represents the root element of the XML document and k represents the number of steps in p. Definition 4 (σ p -Mapping). Given an input simple path p = /e 1 /e 2 /.../e n, σ-mapping σ, and an XML schema graph G, σ p (e i ) is defined as follows where i =1, 2,..., n: { σ(ei ), if σ(e i ) =1 σ p (e i ) = e i, if σ(e i ) >1 and (e i 1,e i ) is a -edge in G σ p(e i 1), if σ(e i) >1 and (e i 1,e i)isanormal-edgeing

XML-to-SQL Query Mapping in the Presence 609 Example 2. If the XPath expression p=/a/b1/c/d3 is given based on the XML schema graph shown in Figure 1, the below σ p -mapping is produced by computing the σ p based on the multi-valued schema mapping shown in Table 1.B: σ p Element Relation A A B1 B1 C B1 D3 B1 Theorem 1 (Correctness). Given an input simple path expression p = /e 1 /e 2 /.../e n, σ p (e i ) returns the correct and single target relation for every element e i in p, wherei =1, 2,..., n. Proof (Sketch). First, σ p (e i ) returns the same relation as σ(e i ) if the input element e i is mapped to a single relation. Second, if the input element e i is mapped to multiple relations, then the type of the edge between e i and its parent e i 1 is checked from the XML schema graph. If the edge is a -edge, then the σ p (e i ) returns the relation e i since e i is mapped to a separate relation as it occurs multiple times under its parent. Third, if the input element e i is mapped to multiple relations and the type of the edge between e i and its parent e i 1 is a normal-edge, then the σ p (e i 1 ) is called to determine the target relation for e i since it is mapped to the same relation as its parent e i 1.Recursivecalltoσ p (e i 1 ) stops whenever a single relation is returned. If all the edges from e 1 to e i 1 are normal-edges, then the recursion is going to stop at σ p (e 1 )sincee 1 is the root element and it is always mapped to the single relation e 1. All the edges in an XML schema graph fall into either normal-edge or -edge categories as it follows from Lemma 1. As a result, σ p (e i ) returns the correct and the single relation corresponding to element e i. Besides multi-valued mappings, the σ p -mapping can deal with single-valued schema mappings where it returns the same values as σ-mapping. Therefore, σ p -mapping is sufficient to develop a generic XML-to-Relational query mapping algorithm in the presence of multi-valued schema mappings as well as singlevalued schema mappings. 4.3 Unfolded XML Schema Graph The challenge with translating recursive XML queries over recursive XML schemas is to identify the infinite number of matching paths in the XML schema graph. However, if we unfold the recursive XML schema based on the maximum levels of depths for each cycle in the schema graph, we can find out a finite number of matching paths for an arbitrary XML query including the recursive ones. This observation leads us to an elegant and efficient solution for the problem of translating recursive XML queries in the presence of recursive XML schemas. We propose to convert a cyclic XML schema graph to a directed acyclic graph by unfolding the cycles in the original schema. This new schema is called unfolded

610 M. Atay et al. <A> < B1 > < C > < D1 >< E /></ D1 > < D2 > < E >< D1 /></ E > </ D2 > < D3 >< E /></ D3 > </ C > </ B1 > < B1 > < C > < D1 >< E /></ D1 > < D2 >< E /></ D2 > < D3 > < E > < D1 >< E /></ D1 > </ E > </ D3 > </ C > </ B1 > < D3 /> </A> A B1 B2 B3 C D1 D2 D3 E D1 E Fig. 2. A Sample XML Document and its Unfolded XML Schema Graph (UXG) XML schema graph (UXG). A UXG of a sample XML document, which conforms to the XML schema graph given in Figure 1, is shown in Figure 2. The formal definition of UXG is given in Definition 5. Definition 5 (Unfolded XML Schema Graph (UXG)). Given an XML schema S, unfolded schema of S is a directed acyclic graph UXG =(V, E, d 1,...d k ), where V is the set of vertices, E is the set of edges, each d i is the maximum level of depth for each cycle c i in S and k denotes the number of cycles in S. Eachcycle c i in S is unfolded to depth d i in UXG in top-down topological order. The vertices represent element types in S, and the edges represent their parent-child relationships. Each vertex is labeled with the name of the corresponding element type. An edge is labeled by if it is incident to a vertex which can appear more than once under its parent in the corresponding XML documents, otherwise no label is used. A recursive XML schema S can be converted into a non-recursive one in the form of a UXG G by unfolding the recursion in S with a finite number of occurrences of recursion that is decided from the XML documents X stored in the database, such that X conforms to S and G at the same time. In other words, S and G are equivalent to each other with respect to X. We can create a UXG by using one of the following two approaches: Static approach. The maximum depth of each cycle in the XML schema graph is determined by the help of a domain expert and a fixed UXG is generated during the schema mapping phase. This fixed UXG is used during the query mapping regardless of the structure of underlying XML documents. Dynamic approach. The maximum depth of each cycle in the XML schema graph is initialized to 1 and a default UXG is generated during schema

XML-to-SQL Query Mapping in the Presence 611 mapping phase. When a new XML document is loaded to the database during the data mapping phase, the maximum depth of each cycle in the current document is found and UXG is modified if any current depth value is greater than the existing one. Static UXG approach does not have any computation overhead during the data mapping phase. However, it may return unnecessary matching paths for a given recursive XML query. On the other hand, dynamic UXG approach associates some computational cost during the data mapping phase to maintain the UXG for minimizing the total number of matching paths for the input recursive XML queries. The UXG graph is constructed either during the schema mapping phase or the data mapping phase. We assume bulk data is loaded to the database system first, then it is queried next in a batch-processing fashion. Therefore, the construction of UXG does not introduce additional overhead to XML-to-Relational query mapping performance since it is precomputed before query mapping phase. 5 ID-Based Generic Query Mapping All the schema-based approaches proposed for XML-to-Relational query mapping in the literature have used ID-based techniques as in [4,5]. In ID-based techniques, each element is associated with a unique ID and the tree structure of the XML document is preserved by maintaining a foreign key to the parent which we call parentid. Each child axis / is translated into an equijoin between child and parent elements over their parentid and ID columns in ID-based techniques. We propose a generic ID-based XML-to-Relational query mapping algorithm, ID-XMLToSQL, in this section. An outline of ID-XMLToSQL is given in Figure 3. The ID-XMLToSQL algorithm first identifies all the matching simple paths p i and σ p -mappings σ pi corresponding to those paths when a path expression P and a UXG G u is given. Then it calls the SQL generation procedure SPathToSQL() for each simple path p i along with its mapping σ pi,and then, gets the union of the output SQL queries. We formalize the notion of a path expression as follow: Definition 6 (Path Expression). A path expression P can be denoted as a 1 n 1 a 2 n 2...a k n k where each n i is a node type and each a i is either child axis / or descendant axis //. Each a i n i constitutes a navigation step of P and k is the number of steps in P. A naive XML-to-SQL query mapping procedure follows a blindfold approach. It takes an input simple path expression and generates a relational query by joining the relations corresponding to each step in the simple path expression. A sample SQL query generated using naive query mapping approach is given in Example 1. When consecutive elements in a simple path expression are mapped to the same relation, then the naive approach unnecessarily joins the same relation

612 M. Atay et al. 00 Algorithm ID-XMLToSQL 01 Input: Path Expression P,UXGG u 02 Output: SQL query sql 03 Begin 04 Let p i, i=1,2,...,n, be the set of all matching simple paths of P in G u 05 Let σ pi be σ p-mapping for the simple path p i, i=1,2,...,n 06 sql= 07 sql = n i=1 SPathToSQL(pi,σp i ) 08 Return sql 09 End 00 Procedure SPathToSQL(Simple Path Expression p, σ p-mapping σ p) 01 Begin 02 Use σ p to cluster p = /e 1/e 2/.../e m according to Definition 7 03 FromClause= From 04 WhereClause= Where 05 For i=1 to m do / Construct From Clause / 06 If e i is the first element of a cluster then 07 FromClause += $σ p(e i) 08 End If 09 End For 10 For i=2 to m do / Construct Where Clause / 11 If e i is the first element of a cluster then 12 WhereClause += $σ p(e i 1).(e i 1.ID) =σ p(e i).(e i.parentid) 13 End If 14 If e i is neither first nor last element of a cluster then 15 WhereClause += $σ p(e i).(e i.id) is not null 16 End If 17 End For 18 sql= Select $σ p(e m).(e m.id) + FromClause + WhereClause 19 Return sql 20 End Fig. 3. ID-based Query Mapping Algorithm ID-XMLToSQL with itself multiple times. For the simple path expression and its σ p -mapping given in Example 2, corresponding SQL query will include two unnecessary self joins since the elements of last three steps in the path are mapped to the same relation. An intelligent XML-to-SQL query mapping algorithm should be able to recognize the elements mapped to the same relations and avoid the unnecessary self-join operations. We deal with this issue in SPathToSQL() procedure. The outline of SPathToSQL() procedure is shown in Figure 3. The SPathToSQL() procedure identifies the clusters in a path expression which are the groups of elements in consecutive navigation steps mapped into the same relation. The SPathToSQL() procedure recognizes each cluster in a simple path expression and only joins the relation corresponding to the last element of a cluster to the relation corresponding to the first element of its successor cluster. Thus, it avoids the self-join problem of a blindfold query mapping approach. The notion of a cluster is formalized as follows: Definition 7 (Cluster). Given a simple path expression p and a mapping σ p over p, the elements of consecutive steps in p which are mapped to the same relation constitute a cluster. Hence, p can be denoted as a sequence of clusters

XML-to-SQL Query Mapping in the Presence 613 such that p = c 1 c 2...c k where each c i is a cluster and k is the number of clusters in p. The SPathToSQL() procedure given in Figure 3 first constructs the From clause at lines 05-09. It introduces one relation per cluster to the From clause since all the elements in a cluster are mapped to the same relation. The Where clause is constructed at lines 10-17. A transition from one cluster to another in the input path is handled at lines 11-13. A predicate of the form σ p (e i 1 ).(e i 1.ID) = σ p (e i ).(e i.parentid) joining the last element of the previous cluster to the first element of current cluster is added to the Where clause. As a result, the relations representing all the neighboring cluster are joined. The SPathToSQL() procedure adds an existential predicate of the form σ p (e i ).(e i.id) is not null for the intermediate elements of a cluster to the Where clause (lines 14-16) as it skips the intermediate elements in a cluster. Thus, it ensures that the middle elements of a cluster co-exist with the elements at each end of the cluster in the underlying XML document. The output SQL query is constructed and returned at lines 18-19. The existential predicate not null is not introduced for the elements at each end of a cluster since they are already included within the join conditions of the output SQL query. Although the last element in a path expression may not be used in a join condition, we do not need to check the existence of the last element as it is used in the Select clause. We do not need to check the existence of the first element of a simple path expression, which is the root element, as all the simple paths start from the root element. Example 3. If the path expression /A/D3//E is given against the UXG shown in Figure 2 and input to ID-XMLToSQL algorithm, ID-XMLToSQL calls SPath- ToSQL() procedure with the following simple paths identified from the UXG: (i) /A/D3/E and (ii) /A/D3/E/D1/E and, their σ p -mappings: (i) {(A,A), (D3,A), (E,E)} and (ii) {(A,A), (D3,A), (E,E), (D1,D1), (E,E)}, respectively. Below is the generated output SQL query by our ID-XMLToSQL algorithm: Select E.ID From A, E Where A.D3.ID=E.parentID UNION ALL Select E.ID From A, E T1, D1, E T2 Where A.D3.ID=T1.parentID And T1.ID=D1.parentID and D1.ID=T2.parentID Theorem 2 (Time Complexity). The time complexity of the procedure SPath- ToSQL is O(n) where n is the number of steps in an input simple path expression p. Proof (Sketch). The statement at line 02 navigates p once to cluster it and can be evaluated in O(n). The loop at lines 05-09 navigates p once to construct the From clause and is evaluated in O(n). The loop at lines 10-16 navigates p once to

614 M. Atay et al. construct the Where clause and is executed in O(n). Thus, the time complexity of SPathToSQL() is O(n). 6 Experimental Study We compare the performance of our ID-XMLToSQL algorithm and the recursive query translation algorithm SQLGen of [4] in this section. We used a Pentium IV computer with 2.4 GHz processor and 1 GB main memory for the experiments. The experiments were run using the Java software development kit. We minimized the usage of system resources during the experiments to get more realistic results. We ran the programs 6 times and got the average value, excluding the first run, to have more accurate results. We used auction.xml document of XMark benchmark [19] as our data set to compare the performance of our proposed ID-XMLToSQL algorithm and SQL- Gen algorithm of [4]. The DTD of XMark includes several cycles, and thus, it is an appropriate XML schema for our experiments.the number of elements in the test XML document is 73,740. We selected nine queries with particular features for the test suit. Our test query suit is shown in Table 2. All the queries in our test suit are recursive queries as they contain descendant axis //. All the queries return the elements which are included in a cycle in the XML schema. While the queries Q1, Q8 and Q9 include clusters of two or more elements, the queries Q2, Q3, Q5, Q7, Q8 and Q9 include shared elements which have more than one parents in the XML schema. We implemented only a single-valued schema mapping scheme to run the two query mapping algorithms ID-XMLToSQL and SQLGen as SQLGen does not support multi-valued schema mapping schemes. We used a commercial relational DBMS which allows the usage of advanced SQL 99 with clause as it is centric to the algorithm of SQLGen. We measured the response time for each test query by running the queries generated by two algorithms separately. The experimental results are shown in Figure 4. We used logarithmic scale to increase the readability of the chart. As can be seen from the chart, our ID-XMLToSQL algorithm outperformed the SQLGen algorithm in all the test queries. The main reasons for the performance difference between ID-XMLToSQL and SQLGen include the followings: Table 2. Query Suit for Testing Query Query Definition Q1 /site/categories/category/description//parlist Q2 //text Q3 //parlist Q4 //asia//listitem Q5 //item//listitem Q6 //asia//parlist Q7 //item/parlist Q8 /site/regions/asia/item//parlist Q9 /site/regions/asia/item//listitem

XML-to-SQL Query Mapping in the Presence 615 Interval-XMLToSQL ID-XMLToSQL SQLGen 10000 Time (Logarithmic) 1000 100 10 1 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Queries Fig. 4. Experimental Results for Query Mapping ID-XMLToSQL resolves the recursion at the XML schema level using precomputed unfolded XML schema graph unlike SQLGen which resolves it inside the relational engine using recursive SQL query. The queries generated by SQLGen are typically more complex and larger than the ones generated by our ID-XMLToSQL. ID-XMLToSQL uses the notion of clustering and avoids unnecessary selfjoins. 7 Conclusions and Future Work In this paper, we proposed the generic XML-to-SQL query mapping algorithm ID-XMLToSQL which can be used with multi-valued schema mappings as well as with single-valued schema mappings. ID-XMLToSQL uses our proposed pathbased σ p -mapping technique to find the target relation for a given element of a path query in the presence of multi-valued schema mappings. We proposed to convert a cyclic XML schema graph to an acyclic one by unfolding the cycles in the graph to a maximum level of depth. Thus, we are able to map the recursive XML queries over the unfolded XML schema graph to SQL queries without using special operators to capture the recursion. Therefore, our proposed query mapping algorithm can be used on any RDBMS as it uses standard SQL features unlike other recursive query mapping algorithms in the literature. We compared the performance of our ID-XMLToSQL algorithm to SQLGen algorithm of [4] and observed that ID-XMLToSQL outperformed SQLGen for all the queries in our test suit. We consider augmenting our proposed ID-based generic query mapping algorithm with interval-based and path-based mapping schemes as a potential future work. Acknowledgment The authors would like to thank Rajasekar Krishnamurthy for providing the source code of SQLGen algorithm and his cooperation, and Dapeng Liu for involving in the implementation of our ID-XMLToSQL algorithm.

616 M. Atay et al. References 1. Shanmugasundaram, J., Tufte, K., Zhang, C., He, G., DeWitt, D.J., Naughton, J.F.: Relational databases for querying XML documents: Limitations and opportunities. In: VLDB, pp. 302 314 (1999) 2. Atay, M., Chebotko, A., Liu, D., Lu, S., Fotouhi, F.: Efficient schema-based XMLto-Relational data mapping. Information Systems Journal 32(3), 458 476 (2007) 3. Krishnamurthy, R., Kaushik, R., Naughton, J.F.: XML-to-SQL query translation literature: The state of the art and open problems. In: XML Database Symposium (2003) 4. Krishnamurthy, R., Chakaravarthy, V.T., Kaushik, R., Naughton, J.F.: Recursive XML schemas, recursive XML queries, and relational storage: XML-to-SQL query translation. In: Proc. of the 20th International Conference on Data Engineering, Boston, pp. 42 53 (March 2004) 5. Fan, W., Yu, J.X., Lu, H., Lu, J., Rastogi, R.: Query translation from XPath to SQL in the presence of recursive DTDs. In: Proc. of the 31sh VLDB Conference, Trondheim, Norway (2005) 6. Choi, B.: What are real DTDs like. In: WebDB Workshop (2002) 7. Deutsch, A., Fernandez, M.F., Suciu, D.: Storing semistructured data with STORED. In: SIGMOD Conference, pp. 431 442 (1999) 8. Florescu, D., Kossmann, D.: Storing and querying XML data using an RDBMS. IEEE Data Engineering Bulletin 22(3), 27 34 (1999) 9. Schmidt, A., Kersten, M., Windhouwer, M., Waas, F.: Efficient relational storage and retrieval of XML documents. In: WebDB (2000) 10. Yoshikawa, M., Amagasa, T., Shimura, T., Uemura, S.: XRel: A path-based approach to storage and retrieval of XML documents using relational databases. ACM Transactions on InternetTechnology (TOIT) 1(1), 110 141 (2001) 11. Tatarinov, I., Viglas, S., Beyer, K.S., Shanmugasundaram, J., Shekita, E.J., Zhang, C.: Storing and querying ordered XML using a relational database system. In: SIGMOD Conference, pp. 204 215 (2002) 12. Dehaan, D., Toman, D., Conses, M.P., Ozsu, T.: A comprehensive XQuery to SQL translation using dynamic interval encoding. In: SIGMOD Conference (2003) 13. Teubner, J., Grust, T., Keulen, M.V.: Staircase join: Teach a relational DBMS to watch its (axis) steps. In: VLDB Conference (2003) 14. Krishnamurthy, R., Kaushik, R., Naughton, J.F.: Efficient XML-to-Relational query translation: Where to add intelligence? In: Proc. of the 30th VLDB Conference, Toronto, Canada (2004) 15. Runapongsa, K., Patel, J.M.: Storing and querying XML data in object-relational dbmss. In: EDBT Workshops (2002) 16. Cheng, J., Xu, J.: DB2 extender for XML. IBM (2000), http://www-4.ibm.com/software/data/db2/extenders/xmlext/ 17. Oracle: XML Database Developer s guide - Oracle XML DB Release 2 (2002), http://otn.oracle.com/tech/xml/xmldb/content.html 18. Microsoft: SQLXML and XML Mapping Technologies (2004), http://msdn.microsoft.com/sqlxml/default.asp 19. Schmidt, A., Waas, F., Kersten, M.L., Carey, M.J., Manolescu, I., Busse, R.: XMark: a benchmark for XML data management. In: VLDB, pp. 974 985 (2002)