Caching XML Data on Mobile Web Clients

Transcription

1 Caching XML Data on Mobile Web Clients Stefan Böttcher, Adelhard Türling University of Paderborn, Faculty 5 (Computer Science, Electrical Engineering & Mathematics) Fürstenallee 11, D Paderborn, Germany Abstract Whenever XML data is delivered from a web server to small mobile clients, it may be considerably advantageous when the client caches and reuses XML data of previous queries in comparison to delivering the same XML data from server to client repetitively. We present a technique to identify, construct and minimize that fragment of an XML document, which is required to answer a query but which is not contained in the client s cache. We further show how this so called difference XML fragment can be integrated with the XML data stored already on the client, such that thereafter an XPath query can be answered completely using the client s cache available on mobile devices. Keywords: Web data management, XML, XPath, mobile clients, caching. 1. Introduction 1.1. Problem origin and motivation Whenever XML data which is stored on web server is searched by small mobile clients, one important issue is to reduce data exchange fragments from the server to the client in order to minimize communication costs. In comparison to delivering a query result from the server to the client, it may be of considerable advantage to reuse the old query results of a client wherever possible, and to transport only the difference XML fragment not yet stored on the client, from the server to the client. Our work has been motivated by the development of a web based information system for e-learning, which uses XPath queries [13] within a client-server environment involving mobile clients. However, we regard the application field to be much broader, i.e. we regard our technique to be useful wherever a web server provides XML data which is used by applications running on mobile clients. The overall architecture of the whole system is shown in Figure System overview As shown in Figure 1, an application on a mobile client communicates via a client cache with a remote web server. The mobile client uses its local cache to store XML data which has been retrieved from the web server. Whenever the client application uses an XPath expression XQ to query for a certain XML fragment of the web server, the client first checks whether or not the query XQ can be answered with the data stored in the client s cache alone. If this is the case, there is no query sent to the server. Otherwise, the client submits not only the query XQ to the server, but also tells the server, for which previous queries XPi the client still has the previous query result in its cache. This information is used on the server-side in order to compute the difference XML fragment, i.e. that fragment which is required in order to answer the query, but which is missing at the client side. Let XP1, XPn be the previous XPath queries for which the answers are still stored in the client s cache. The client sends the query XQ and the list of previous queries (XP1,,XPn) to the server together with a timestamp for each query. The server uses these queries and their time-stamps in order to compute the difference XML fragment. Only this difference XML fragment has to be submitted to the mobile client. Thereafter, the client integrates the difference XML fragment with its previous query results. Finally the client uses its refreshed cache in order to answer the query. Figure 1: System overview

2 1.3. Focus of this contribution and problem definition Our system as given above has to solve a variety of data management problems, including query processing, difference fragment computation and data integration. Within this paper, we present a query optimization technique which reuses previous query results from a client s cache, and which is based on XML fragment integration within the client s cache. This paper focuses on the computation of the difference XML fragment which is missing on a client, and the integration of the difference XML fragment with a previous query result on the client. Additionally, we present a replacement strategy to handle cache overflows. One major problem discussed in this paper is how to integrate a difference XML fragment with a previously cached query result. The solution of this problem may furthermore influence which information is mandatory and which information is optional to be included in an XML difference fragment on the server side. Example 1: In order to illustrate the problem, let us assume that we have the following server-side XML document where E1 is the only child of the document root. <E1> <E2 a= 1 > </E2> <E2 a= 2 > </E2>  <E2 a= 3 > </E2> </E1> Let us further consider that the copy in the client s cache contains the same document except the XML subtree which begins with second element node E2, i.e. that one having an attribute value of 2 for the attribute a, and the descendent nodes of this node E2. When this missing fragment is transferred from the server to the client, it must be made clear to the client where this fragment has to be inserted (e.g. as a child of E1 before its first child, i.e. before the element E2 with attribute a set to 1, or as a following sibling of one of the current E2 children of E1, or as a child of a different node E1, or somewhere else inside the fragment). On the other hand, it may be that the current content of the cache has to be integrated into the transferred difference XML fragment. Finally, the server may send more than one fragment in order to update the client s cache according to a given query. Another major problem is to identify the exact information required in order to answer a query. For example, assume that a previous query has asked for /E1/E2, and a new query asks for /E1[./E3]/E2 which obviously returns a subset of the nodes returned by the previous query. Nevertheless, the previous query result alone is not sufficient to answer the new query. Instead, we need additional information, in this case about E1 elements which contain a child node E3 (and a child node E2). The remainder of this paper is organized as follows. Section 2 describes how XPath queries are treated at the client s side, whereas the server-side counterpart is described in Section 3. Section 4 outlines how we compute the node set which is relevant to answering a query. Finally, Section 5 gives a comparison to other related work, and Section 6 contains the summary and conclusions. 2. WEB SEARCH OPTIMIZATION FOR MOBILE CLIENTS 2.1. The client s query processing algorithm The client s query processing algorithm (Algorithm 1) is outlined below, which shows the procedure Client.query( XQ ) that is called on the client cache when the client application submits a query XQ. Client.query( XQ ) { sortpreviousqueries( XP1,, XPn ) ; XMLnewFragment = getdifferencefromserver(xq,(xp1,,xpn),tq,(t1,,tn) ) ; // loading the difference from server // includes the replacement of fragments if necessary. integrate( XMLnewFragment ) ; } } Algorithm 1: The client s main query processing algorithm The call of the procedure sortpreviousqueries(xp1,, XPn) sorts previous queries (for which the query result is contained in the client s cache) according to the client application s priorities, such that XPn is the most important and XP1 is the least important query. If the client has no preferences, the procedure takes LRU. The call of the procedure getdifferencefromserver( ) submits the actual query XQ and previous queries (XP1,,XPn) together with their timestamps tq,(t1,,tn) to the server. The server computes the actual resulting difference fragment as described in the remainder of the paper. As soon as the client knows the size of the difference fragment, it starts to replace data according to the servers advice and integrates the new data with data already stored in the clients cache Server location information needed for difference XML fragments As mentioned previously, the client receives difference XML fragments from the server, which have to be

3 integrated with the previous query results already stored on the client. For this purpose, the client has to know where in the server-side XML document a fragment is located in order to store the XML fragment at a corresponding location in the client s cache. Within our approach, the server uses a node numbering scheme (which is defined in the following) in order to tell the client where a submitted fragment is located within the XML document. The node numbering scheme is a reversible function which maps paths in an XML document to paths of natural numbers. For each given XML document D, the document root of D is mapped to the path 0. Whenever a node N of D is mapped to a path P, and a node C is the i-th child node of N in document D, and C has the node name Cname, then C is mapped to a path P/i(Cname). Within Example 1 above, we assumed that the node E1 is the only child node of the document root. The numbering scheme maps the document root to 0, and maps the path from the root node to the node E1, to the path 0/1(E1). The path to the third child element E2 of E1 is mapped to 0/1(E1)/3(E2). By applying this numbering scheme to the server-side XML document, each path from the root to a node in the XML document has a unique number path associated with it. When a fragment of the XML document is transferred to the client, these number paths are used in order to uniquely determine the position of the transferred XML fragment within both the original XML document on the server, and the client s virtual cache copy of the original XML document. Within Example 1, the client s cache copy contains the whole document of the server, except the second E2 child of E1 and all the descendent nodes of this E2 child. When the missing fragment (i.e. this E2 element with all its descendents) is later transferred from the server to the client, it is important for the client to know where to insert that fragment into the client s data structure which contains a copy of the other part of the server s XML document. More specifically, the client has to use the number path 0/1(E1)/2(E2) in order to determine the correct position, into which this fragment must be inserted into the client s copy The location tree data structure used for location information Within our implementation we extend each element with an additional attribute (CSN, a shortcut for child sequence number) which simply assigns the CSN i to a node when the node is the i-th child node of its parent. Thereby the attribute value of CSN keeps information pertaining to the location of a node relative to its parent node, e.g. an attribute value for CSN of 3 means that the current element is the third child of its parent. The attribute value of the attribute CSN of a given element, corresponds to the last number in the number path from the root node to this element. Whenever the client cache also contains, for a given element, all its ancestor nodes up to the root, then the CSN attribute values along this ancestor path can be used to uniquely determine the number path of the given element. Thus, the position of this given element in the original document is uniquely determined. Therefore the client cache also stores for each XML fragment its number path to the root. In order to keep the data structure small and to support searches, we combine identical prefixes contained in different number paths, into one common number path. In other words, we merge all number paths into a tree which contains references to the stored XML fragments. This is sufficient because it is not absolutely necessary that the server sends all the ancestors of a given element node up to the root in order to let the client know about the original position of this element node. Instead, in order to reconstruct the original position of an XML fragment stored in the client s cache, it is sufficient to keep the number paths from the root to that fragment in the cache Merging of new fragments into the location tree of the client s cache Whenever a new XML fragment has to be integrated into the client s cache, this is done by using its number path in order to uniquely determine the position where the fragment has to be inserted. At the end of this number path the client inserts a reference to the fragment which is thereby inserted into the tree, representing the client s copy of the XML document. However in general, it is permitted that a newly transmitted fragment from the server overlaps with a fragment which is already stored in the client s cache. This is allowed because it may be easier for a server to send one fragment which is partially redundant, instead of multiple smaller fragments which are not redundant, with the fragments already stored in the client s cache. When the fragment sent by the server contains nodes which are already stored in the client s cache, then the old nodes contained in the client s copy of the XML document are replaced with the new nodes Client-server communication and query priorities Insertion of further fragments into the client s cache is only possible when there is enough available memory. Therefore, at the beginning of each insertion, the server informs the client about the size of the new fragment to be inserted (and the previous query results to be reused

4 because they are relevant to the new query). Based on this information, the client can perform a selection of nodes or fragments, which are replaced in order to have enough available memory for the new query. To be more specific, the client selects the nodes which it replaces based on priorities, and the client and the server use the following negotiation protocol in order to assign the priorities to query results. At first, i.e. before the client submits a new query XQ to the server, the client orders its previous queries by priority. The priority criterion may be selected according to the current client application (e.g. because the application knows and tells the client cache that certain previous results are more likely to be reused than others) or the priority criterion may simply be LRU. Later, the priority order which the client assigns to the list of previous queries can be reordered by the server, i.e. the server can assign a higher priority to a previous query, when the server recognizes that the result of this previous query should be kept in the client s cache to answer the new query XQ Replacement of nodes or XML fragments from the client s cache The main point when replacing nodes from the client s cache is to keep the lists of previously submitted XPath queries up to date. In order to check which previously submitted XPath query would be invalidated, i.e. for which XPath query expression the stored query would change, when a certain node or fragment of the client s cache is removed, we proceed as outlined in Algorithm 2 below. // XP1,,XPn are sorted by priority, i.e. the name XPn is assigned // to the XPath expression of the most important query result. replacenodes( XP1,,XPn, neededsize ) { availablesize = cachesizelimit ; markallnodes( 0 ) ; // all nodes in the client s cache are // initialized to the lowest priority i=n ; while ( i>0 and neededsize <= availablesize ) do { marknodesusedby( XPi, i, availablesize ) ; i = i - 1 ; } if ( neededsize > availablesize ) deleteallnodeswithmarklessorequalto( i+1 ) ; } Algorithm 2: Replacement of nodes with lower priority The parameters of each call of the procedure replacenodes(xp1,,xpn, neededsize) tell the client which cache size is needed (in order to store the difference XML fragment which is missing in order to answer XQ) and the priority order of the previous queries, as this order may have been rearranged by the server. At first, the available size is set to the buffer size of the client s XML cache and all nodes in the client s cache (except the root node of the number path tree, i.e. the number 0 ) are marked with the lowest priority 0. In other words, we consider each node in the client s cache and each number path to be a candidate for replacement. Then starting with the query with the highest priority (XPn), we mark all nodes in the client s cache which contribute towards answering this query. How we mark and identify these nodes (using the procedure marknodesusedby( XPi, i ) ) is discussed in the next section. Marking nodes changes the available size in the cache. When we have marked too much, i.e. the available size is more than needed for the new query result, we know that we can keep all previous query results in cache except the query result which was marked as the last one. In other words, the last query expression (XPi+1) and all those with a lower priority (XP1,,XPi) can be deleted from the query cache, and all nodes with a mark of (x+1) or less can be removed from the client s cache Marking nodes which are required for an XPath expression We delay the topic of checking whether or not a node is required for an XPath query XPi until Section 4. Here we assume that we find all the nodes which are required in order to answer a query XPi correctly. Note however that the node set required to answer a query XPi, which we call Required(XPi) for short, is often not identical to the set of nodes selected by XPi. When we mark nodes of the client s cache as required, we have to consider the case that a node is required to answer more than one query. Therefore, if only one of the query results has to be replaced, such a node still has to be kept in the client s cache. Furthermore, we have assigned priorities to the client s queries, and we want to use these priorities in order to identify and to replace the nodes of the client cache which are not required for high priority queries. Therefore the marks are assigned in the follow way. If the mark value of a required node is less than the current mark value (i.e. the second parameter passed by the procedure call), then the mark value is set to the current value and the available size is reduced by the size of this node. Otherwise marking leaves this node and the available cache size unchanged.

5 3. THE SERVER S COMPONENT OF THE CACHING MECHANISMS Basically, the server has to guarantee that the client can answer an XPath query XQ, after the server has sent a (possibly empty) XML fragment to the client. In order to reduce transportation costs, our server computes the difference between the new query result and previously submitted query results, and only returns the difference XML fragment (plus a possible reordering of the priority order of previous queries which the server uses in order to inform the client that certain query results shall not be replaced). For this purpose, the server first computes a fragment FQ which contains the complete answer to Required(XQ). Thereafter, the server applies the query expressions XPn down to XP1 which belong to the previous query results, where again XPn denotes the query expression which is associated to that query result the client application considers to be the most important. Again, on the server side we compute the set of nodes required to answer XPi Required(XPi). Whenever the node set required to answer query XPi, i.e. Required(XPi) contains a reasonably large part of FQ and the priority i of XPi is large enough, then the part of FQ which is also required for the evaluation of XPi is dropped from FQ. Finally, the server returns the fragment FQ, i.e. Required(XQ) minus those parts which are already covered by some of the XPi s. Additionally the server splits the XPi s into two groups. The first group contains the previous queries which were not used in order to reduce the answer of Required(XQ). These queries keep their original priority. The second group contains those previous query expressions which belong to results which will be reused for Required(XQ). These queries get an increment on their priorities in order to avoid having their nodes replaced. In order to keep the client s data up to date, the server has to monitor changes and to compare them with the timestamps of the client s query. Whenever a previous query of the client has an older timestamp as the actual document or document fragment which is stored at the server, the server considers the old query result of the client as being outdated. Therefore, the server does not include this outdated query into the difference computation. Instead, the actualized fragment is transmitted to the client as part of the query result to XQ, i.e. the client merges this actualized data with the other data in its cache and uses the actual timestamp of XQ for further references to this data. 4. REQUIRED NODES FOR AN XPATH EXPRESSION The goal of this section is to explain how we compute FQ=Required(XQ), i.e. the fragment of the XML document on the server (and the fragment of the tree stored in the client s cache) which is required in order to answer the query XQ. Our approach to identify the nodes required for a given query is based on root paths, i.e. on paths from the root to certain nodes of an XML document. Within other approaches (e.g. [1]), each axis in an XPath expression is considered as a function call which maps a current set of context nodes to a new set of context nodes. An XPath expression is a sequence of location steps. Whenever a location step is applied to a set of context nodes, the axis of the location step defines the function to be applied. In comparison to these approaches, we follow an idea presented in [6]. The idea is to consider root paths to context nodes, i.e. paths from the root node to a context node, instead of the context nodes themselves. Then a location step can be considered as a function which maps a set of root paths to context nodes, to a different set of root paths to context nodes. Let us consider Example 1. The usual approach ([1]) regards a child-axis location step E1/E2 as a mapping of a set containing the node <E1> onto the set containing three nodes with the element name <E2>. In comparison, we regard the child-axis location step E1/E2 as a mapping of the set of paths { 0.1(E1) } onto the following set containing three paths { 0.1(E1).1(E2), 0.1(E1).2(E2), 0.1(E1).3(E2) }, i.e. we argue that the location step extends an existing set of root paths to context nodes, to a new set of root paths to context nodes. The computation of Required(XP) consists of multiple steps. The first step is to find all selected root paths (SRP) in the XML document or within the considered fragment of the XML document, i.e. all paths from the root node to elements which are selected by XP. The set of relevant root paths (RRP) is initialized with this set of selected root paths. The second step is to expand XP along each selected root path once again, and location step by location step in order to find the nodes which are relevant to the filters used in XP. Whenever a filter has been applied to a context node on a selected root path (SRP), then all the root paths to nodes which are accessed by the filter are added to the set of relevant paths. This is done recursively until all (probably nested) filters have been investigated. We omit the details of filter evaluations here, because they are described in [6]. Instead, we continue with an example XPath query /E1/E2[E3]/E4 applied to the document of Example 1. Let us assume that the document of Example 1 contains exactly the

6 following additional nodes. Each node E2 contains exactly one child node E4, and only the third node E2 contains an additional child node E3. Then the selected root path is 0.1(E1).3(E2).1(E4), where the filter [E3] is applied to the final node of the path 0.1(E1).3(E2). Because the root path 0.1(E1).3(E2).1(E3) is required to evaluate the filter, this root path is also included in the set of relevant root paths of the XPath query /E1/E2[E3]/E4. The resulting set of relevant root paths contains all the paths which are required in order to answer a query XP correctly. Since these paths have common prefixes, we merge all the relevant root paths into a single tree, which we call the required XML fragment for a query XP or Required(XP) for short. 5. A COMPARISON TO SOME ADDITIONAL RELATED WORK Recent work on projecting XML documents [11] is similar to our approach to the computation of the relevant fragment of an XML document, as it also reduces XML fragments to those parts necessary to answer a query. However, we allow for a wider class of queries, because we also include filters with an = comparison. Furthermore, we extend our approach to the computation of difference fragments, which are the basis for a reduced data transfer between server and client. Furthermore, our idea to use a numbering scheme for nodes in the XML document in order to support the client with merging of XML documents, has been inspired by previous work on bisimulation (e.g. [12]) and XML indexing (e.g. [2]). In comparison to these approaches, our focus is on the merging of different fragments, and the application context of our merging technique is very different from the research carried out in the fields of bisimulation and XML indexing. Finally, there have been contributions to the area of XML databases with an increasing interest in query optimization and caching for XML data (e.g. [9]) and efficient storage of XML data (e.g. [7]). Additionally, many contributions have proposed ideas for cache management in distributed DBMS, e.g. [3,5,8]. While these contributions integrate the cache in the query processing environment by caching the results of queries, other contributions improve cache management by the storage of deltas, e.g. for OODBMS [4] and for XML [10]. Common to the last approach, we compute deltas, however, we follow [1] to compute delta XML fragments on the server side and our focus is to restrict the communication between server and client to the exchange of delta XML fragments. However in contrast to the approach taken in [1], we present fragment integration, consider required node sets, include the management of time stamps, and avoid server-side copies of the client s cache. Different from all other contributions, we combine a server-side difference fragment computation based on tome stamps with a client side integration technique based on a unique numbering scheme. 6. SUMMARY AND CONCLUSIONS We have presented a client-side caching and data integration mechanism for XML fragments, which allow a server to answer client queries XQ as follows. The server uses the new client query XQ and previous client queries XPi, in order to compute the difference XML fragment, i.e. the fragment which is missing on the client side in order to compute the query result for XQ. This computation includes query time and data update timestamps in such a way that outdated data in the client s cache is not reused. In order to recompute a query result for a query XQ on the server-side or on the client-side, we need (the document nodes of) the XML fragment Required(XQ) to be available on that side. Required(XQ) will be used on the client-side in order to re-compute a query result. An identical fragment Required(XQ) is used on the server-side in order to pre-compute a query result, which is a major preparatory step towards computing the difference XML fragment for XQ. Furthermore, the client s replacement strategy for nodes is negotiated between client and server in the sense that the server respects the client s replacement priorities wherever possible, i.e. as long as no previous query result is replaced which is required for the actual query XQ. Altogether, we consider our approach to be promising with respect to reducing communication costs between a web server and mobile clients which exchange XML document fragments. 7. REFERENCES [1] S. Böttcher, A. Türling. XML Fragment Caching for Small Mobile Internet Devices. 2nd International Workshop on Web-Databases. Erfurt, Oktober, Springer, LNCS 2593, Heidelberg, [2] Peter Buneman Martin Grohe Christoph Koch: Path Queries on Compressed XML, VLDB [3] Dar, S., Franklin, M., Jonsson, B., Srivastava, D., Tan, M.: Semantic data caching and replacement. In Proc. 22nd VLDB, Bombay, [4] Doherty, M., Hull, R., Rupawalla, M.: Structures for manipulating proposed updates in object-oriented databases. In SIGMOD [5] Franklin, M., Jonsson, B., Kossmann, D.: Performance tradeoffs for client-server query processing. In Proceedings of the ACM-SIGMOD Conference on

7 Management of Data (Montreal, Que., June). ACM, New York, NY, [6] S. Groppe, S. Böttcher. XPath Query Transformation based on XSLT Stylesheets. Fifth International Workshop on Web Information and Data Management (WIDM'03), New Orleans, Louisiana, USA, November [7] Kanne, C.-C., Moerkotte, G.: Efficient Storage of XML Data. Proc. Of the 16 th Int. Conf. On Data Engineering (ICDE), San Diego, March, 2000 [8] Kossmann, D., Franklin, M.J., Drasch, G.: Cache Investment: Integrating Query Optimization and Distributed Data Placement. ACM ToDS, Vol. 25, No. 4, Dec [9] Li, Q., Moon, B.: Indexing and Querying XML Data for Regular Expressions. Proc. of the 27th VDLB, Roma, [10] Marian, A., Abiteboul, S., Mignet, L.: Change-Centric Management of Versions in an XML Warehouse. Proc. of the 27th VDLB, Roma, [11] Marian, A., Jerome Simeon: Projecting XML Documents, VLDB [12] Prakash Ramanan: Covering Indexes for XML Queries: Bisimulation Simulation = Negation, VLDB [13] XML Path Language (XPath) Version 1.0. W3C Recommendation November