1 Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Huayu Wu Institute for Infocomm Research, A*STAR, Singapore Abstract. Processing XML queries over big XML data using MapReduce has been studied in recent years. However, the existing works focus on partitioning XML documents and distributing XML fragments into different compute nodes. This attempt may introduce high overhead in XML fragment transferring from one node to another during MapReduce execution. Motivated by the structural join based XML query processing approach, which uses only related inverted lists to process queries in order to reduce I/O cost, we propose a novel technique to use MapReduce to distribute labels in inverted lists in a computing cluster, so that structural joins can be parallelly performed to process queries. We also propose an optimization technique to reduce the computing space in our framework, to improve the performance of query processing. Last, we conduct experiment to validate our algorithms. 1 Introduction 1.1 Context The increasing amount of data generated by different applications, sensors and devices, and the increasing attention to the value of the data marks the beginning of the era of big data. From big IT companies to SMEs, from computer science researchers to social scientists, nowadays everyone is talking about big data and the impact of the big data in business, technologies, and science research. It is no doubt that to effectively manage big data is the first step for any further analysis and utilization of big data. How to manage such data, considering the large size and diverse formats, is a new challenge to the database community. On one hand, researchers are keen to find a more elastic and more reliable solution rather than the traditional distributed database system  to store big relational (structured) data and offer SQL-like query ability; on the other hand, for emerging datasets in heterogeneous models, new platforms and databases for semi-structured data, text data, graph data, etc. were designed, aiming to provide more efficient access to such data in large scale. Gradually, the research attempts converge to a distributed data processing framework, MapReduce . The MapReduce programming model simplifies parallel data processing by offering two interfaces: map and reduce. With a system-level support on computational resource management, a user only needs
2 to implement the two functions to process underlying data, without caring about the extendability and reliability of the system. There are extensive works to implement database operators , and database systems  on top of the MapReduce framework. 1.2 Motivation Recently, researchers started looking into the possibility of managing big XML data in a more elastic distributed environment, such as Hadoop , using MapReduce. Inspired by the XML-enabled relational database system, big XML data can be stored and processed by relational storage and operators. However, shredding XML data in big size into relational tables is extremely expensive. Furthermore, with relational storage, each XML query must be processed by several θ-joins among tables. The cost for joins is still the bottleneck for Hadoop-based database systems. Most recent research attempts  leverage on the idea of XML partitioning and query decomposition adopted from distributed XML databases . Similar to the join operation in relational database, an XML query may require linking two or more arbitrary elements across the whole XML document. Thus to process XML queries in a distributed system, transferring fragmented data from one node to another is unavoidable. In a static environment like a distributed XML database system, proper indexing techniques can help to optimally distribute data fragments and the workload. However, for an elastic distributed environment such as Hadoop, each copy of XML fragment will probably be transferred to undeterminable different nodes for processing. In other words, it is difficult to optimize data distribution in a MapReduce framework, thus the existing approach may suffer from high I/O and network transmission cost. Actually different approaches for centralized XML query processing have been study for over a decade. One highlight is the popularity of the structural join based approach (e.g., ). Compared to other native approaches, such as navigational approach and subsequence matching approach, one main advantage of the structural join approach is the saving on I/O cost. In particular, in the structural join approach, only a few inverted lists corresponding to the query nodes are read from the disk, rather than going through all the nodes in the document. It will be beneficial if we can adapt such an approach in the MapReduce framework, so that the disk I/O and network cost can be reduced. 1.3 Contribution In this paper, we study the parallelization of the structural join based XML query processing algorithms using MapReduce. We do not distribute a whole big XML document to a computer cluster, instead, we distribute inverted lists for each type of document node to be queried. As mentioned, since the size of the inverted lists that are used to process a query is much smaller than the size of the whole XML document, our approach potentially reduces the cost on cross-node data transfer. Our contribution can be summarized as follows:
3 The problem of parallelizing structural joins for XML query processing using MapReduce is studied. To the best of our knowledge, this is the first work that discusses distributing inverted list labels rather than raw XML document for parallel structural joins. A polynomial-based workload distribution algorithm is designed in the Map phase, which can balance the workload of each Reduce task. An optimization technique is proposed to not to emit nodes to reducers where they do not contribute to structural join results. We conduct experiments to validate our algorithms. 1.4 Organization The rest of the paper is organized as follows. In Section 2, we introduce the background knowledge and revisit related work. In Section 3 we present the map and reduce functions in our framework to parallelly process XML queries. In Section 4, an optimization algorithm is proposed so that unnecessary emitting can be pruned in mappers. Section 5 shows the experimental study to validate the proposed algorithms. Finally we conclude this paper in Section 6. 2 Background and Related Work 2.1 MapReduce MapReduce is a computational model initiated from functional programming languages and introduced to computer clusters for parallel data processing . It simplifies programmers implementation for parallel data processing by offering two user-defined functions, map and reduce. The map function takes a set of key/value pairs as input. After a MapReduce job is submitted to the system, the map tasks (normally referred as mappers) are started on certain compute nodes. Each map task executes the map function that is implemented by the user over every key/value input pair. The output of the map function is another set of key/value pairs, which are temporarily stored in the local file systems and sorted by the keys. When all map tasks complete executing, the system notifies the reduce tasks (referred as reducers) to start executing. The reducers pull the key/value pairs outputted from mappers in parallel and combine them into different lists for different keys. This step is similar to the group-by operator in SQL. Then for each key, the list of values will be processed by a reducer according to the reduce function specified by the user. This step is similar to the aggregate function in SQL. Finally, the results are written back to the disks. 2.2 XML Query Processing There are different approaches to process XML queries. Initially, XML documents are shred into relational tables and queries are translated into SQL statements to query the database (e.g., ). This is so-called the relational approach.
4 However, in the relational approach, an XML query requires several table joins and most of them are expensive θ-joins. Later, researchers proposed to process XML queries in their native form. Among different native approaches, the structural join approach (e.g., ) is considered more efficient because it reduces I/O cost compared to other approaches. In a structural join approach, an XML document is encoded with positional labels. The labels for each type of document node are organized in an inverted list. To process a query, only those inverted lists related to the queried node type are loaded and scanned. All other document nodes are actually ignored. Finally structural joins are performed over the inverted lists to find the query answers. 2.3 XML Query Processing using MapReduce To process XML queries using MapReduce, we need to decompose a big XML document and distribute portions to different sites, and execute the query processing algorithm in a parallel manner at different sites. Obviously, the relational approach is not suitable, because transforming a big XML document into relational tables can be extremely time consuming and θ-joins among relational tables are expensive using MapReduce. Recently, there are several works proposed to implement native XML query processing algorithms using MapReduce. In , the authors proposed a distributed algorithm for Boolean XPath query evaluation using MapReduce. By collecting the Boolean evaluation result from a computer cluster, they proposed a centralized algorithm to finally process a general XPath query. Actually, they did not use the distributed computing environment to generate the final result. It is still unclear whether the centralized step would be the bottleneck of the algorithm when the data is huge. In , a Hadoop-based system was designed to process XML queries using the structural join approach. There are two phases of MapReduce jobs in the system. In the first phase, an XML document was shred into blocks and scanned against input queries. Then the path solutions are sent to the second MapReduce job to merge to final answers. The first problem of this approach is the loss of the holistic way for generating path solutions. Actually the beauty of the structural join approach is to minimize intermediate path solutions by considering the whole query pattern during structural joins. The second problem is that the simple path filtering in the first MapReduce phase is not suitable for processing // -axis queries over complex structured XML data with recursive nodes, as pointed by many prior research works. A recent demo  built an XML query/update system on top of MapReduce framework. The technical details were not thoroughly presented. From the system architecture, it shreds an XML document and asks each mapper to process queries against each XML fragment. In fact, most existing works are based on XML document shredding, which is inspired by the document partitioning in the distributed XML databases . However, in the MapReduce framework, the distributed computing environment
5 is assumed dynamic and elastic, which makes XML fragment distribution difficult to optimize. Hence, in such a setting, fragmented data (from either original document or intermediate result) may need to be transferred to undeterminable different sites, which leads high I/O and network transmission cost. Motivated by the inverted lists based structural join algorithms, in this paper, we propose an approach that distributes inverted lists rather than a raw XML document, so that the size of the fragmented data for I/O and network transmission can be greatly reduced. 3 Framework In this section, we introduce our proposed MapReduce framework for parallel XML query processing. As mentioned, in our framework, we do not distribute the raw XML data because the raw data is large in size and causes high I/O and network transmission overhead during parallel processing in the MapReduce. Instead, we label the document first and load the inverted lists rather than raw document into a distributed file system. Obviously, the size of useful inverted lists for certain queries is much smaller than the size of raw document. Then the I/O and network transmission cost can be minimized. 3.1 Document Labeling Labeling an XML document is an essential step for most XML query processing algorithms. In this paper, we do not repeat the research works on XML labeling, but only emphasize on two points related to big XML data. First, document labeling is a one-time effort for a given XML document, and independent to query processing. In most labeling schemes, a stack is maintained to store input document nodes so that the relationship among them can be identified. As document labeling proceeds, the put and pop operations are executed over the stack. The maximum size of the stack is the maximum depth of the document. As a result, even if the document is large, the memory is not an issue for document labeling. Second, using the existing technique (e.g., ), re-labeling can be totally avoided in case an XML database is updated. For centralized XML query processing framework, the only overhead to deal with document update is the re-sorting of relevant inverted lists. In our framework, node labels will be distributed to different reducers and re-sorted. Thus, document update will bring little trouble to our framework. 3.2 Framework Overview In our approach, we implement the two functions, map and reduce in a MapReduce framework, and leverage on underlying system, e.g., Hadoop for program execution. The basic idea is to equally (or nearly equally) divide the whole computing space for a set of structural joins into a number of sub-spaces, and each of the sub-spaces will be handled by one reducer to perform structural joins.
6 (0, 1.2)-(1, 1.2)-(2, 1.2)-(3, 1.2)- (4, 1.3)-(5, 1.3)-(6, 1.3)-(7, 1.3)-... (0, 1.2.5)-(1, 1.2.5)-(4, 1.2.5)-(5, 1.2.5)- (2, 1.3.2)-(3, 1.3.2)-(6, 1.3.2)-(7, 1.3.2)-... Partitioning by e-id Reducer 1 e-id=0 A: 1.2 B: Structural join Reducer 2 e-id=1 For example, suppose an XML query involves three query nodes, namely A, B and C, then to process this query a set of structural joins among the inverted (1.2)(1.2.1)(1.2.9), (none) (1.3)(1.3.2)(1.3.5), lists I A, I B and I C (for (1.2)(1.2.5)(1.2.9), A, B and C respectively) will be performed. If we split each inverted list into three partitions, the total computing space (with size of I A I B I C ) will be divided into 27 sub-spaces, as shown in Fig. 1. A: 1.2 B: Structural join Reducer 7 e-id=6 A: B: Structural join C A B Fig. 1. Example of computing space division Each mapper will take a set of labels in an inverted list as input, and emit each label with the ID of the associated sub-space (called e-id, standing for emit id). The reducers will take the grouped labels for each inverted list, re-sort it, and apply holistic structural join algorithms to find answers. The whole process is shown in Fig. 2, and the details will be explained in the following sections. 3.3 Design of Mapper The main task of a mapper is to assign a key to each incoming label, so that the overall labels from each inverted list are nearly equally distributed in a given number of sub-spaces for the reducers to process. To achieve this goal, we adopt a polynomial-based emit id assignment. For a query with n different query nodes, i.e., using n inverted lists, we divide each inverted list into m sub-lists. Then the total number of sub-spaces for computing is m n. Fig. 1 shows an example where m = 3 and n = 3. We construct a polynomial function f of m with the highest degree of n 1, to help for emit id assignment. Each of the inverted lists corresponds to a coefficient with degree from n 1 to 0. f(m) = a n 1 m n 1 + a n 2 m n a 1 m + a 0 where a i [0, m 1] for i [0, n 1] (1) The procedure that a mapper emits an input label is shown in Algorithm 1. Example 1. We use the example in Fig. 2 to explain Algorithm 1. Suppose we need to process a query with three query nodes (i.e., n=3), A, B and C, and we
7 Mapper 1 Inverted list for A label: Mapper 2 Inverted list for B label: Mapper 3 Inverted list for C map (b-id, label): (1, 1.1)-(0, 1.2)-(1, 1.3)-... map (b-id, label): (0, 1.2.1)-(0, 1.2.5)-(1, 1.3.2)-... emit emit (e-id, label): (4, 1.1)-(5, 1.1)-(6, 1.1)-(7, 1.1)- (0, 1.2)-(1, 1.2)-(2, 1.2)-(3, 1.2)- (4, 1.3)-(5, 1.3)-(6, 1.3)-(7, 1.3)-... (e-id, label): (0, 1.2.1)-(1, 1.2.1)-(4, 1.2.1)-(5, 1.2.1)- (0, 1.2.5)-(1, 1.2.5)-(4, 1.2.5)-(5, 1.2.5)- (2, 1.3.2)-(3, 1.3.2)-(6, 1.3.2)-(7, 1.3.2)-... Partitioning by e-id Reducer 1 e-id=0 A: 1.2 B: Reducer 2 e-id=1 A: 1.2 B: Reducer 7 e-id=6 A: B: sort sort sort structural join structural join structural join (1.2)(1.2.1)(1.2.9), (1.2)(1.2.5)(1.2.9), (none) (1.3)(1.3.2)(1.3.5), C Fig. 2. Data flow of our proposed framework Input: 18 an empty 21 key, 24 a label l from an inverted list I as the value; variables m for the 16 number of partitions in each inverted list, n for the number of total inverted lists, ra a random integer 15 between 0 and n 1 1: identify the coefficient corresponding 7 to I, i.e., a i where i is the index 2: initiate an empty list L6 3: toemit(l, 0 i) 3/*m, 6n, r, l are globally viewed by Function 1 and 2*/ B 24 Algorithm 1 Map Function 25 divide each inverted lists into 2 partitions (i.e., m=2). In Mapper 1 in Fig. 2, the mapper will first assign a random number b id [0, 1] to each value, i.e., each incoming label from A s inverted list. Actually b id stands for the ID of the partition to which the incoming label belongs. By randomly choosing b id, the labels in an inverted list are nearly equally divided into n partitions. The list of (b-id, label) in Fig. 2 shows the result of inverted list partitioning.
8 Function 1 toemit(list L, index i) 1: if L.length == number of inverted lists then 2: Emit(l) 3: else 4: if L.length == i then 5: toemit(l.append(r), i) 6: else 7: for all j [0, n 1] do 8: toemit(l.append(j), i) 9: end for 10: end if 11: end if Function 2 Emit(List L) 1: initiate the polynomial function f(m) as defined in (1) 2: set the coefficients of f as the integers in L, by order 3: calculate the value of f given the value of m, as the emit key e-id 4: emit(e-id, l) In the second step, for each label the corresponding sub-spaces for structural joins are found by polynomial calculation. In this example, the polynomial function is f(m)=a A m 2 +a B m+a C, where a A, a B and a C are the coefficients corresponding to the query nodes A, B and C. For the first mapper, which is handling the inverted list for A, a A will be b id that is randomly assigned in the previous step, while a B and a C will be vary between 0 and 1 according to Function 1. For the label 1.1, its b-id is assigned as 1, and there will be four values for its e-id. Thus, the mapper will emit this label four times. The process to calculate e-id will be discussed in detail in the next example. Similarly, all labels will be emitted for shuffling Correctness of Polynomial-based Emitting Theorem 1. The polynomial function in (1) exactly produces m n values spanning [0, m n -1], when each coefficient a i takes m different values in [0, m-1]. Thus the partitioning is complete. This is the property of polynomial function, and the proof is omitted. We use an example to illustrate. Example 2. Consider the cube in Fig. 1, which stands for a division of computing space with three inverted lists (n=3), and each inverted list is partitioned into three region (m=3). The numeric id on each small cube stands for the calculated polynomial value of that sub-space. Suppose a mapper assigns integer 0 (randomized from 0 to 2) to a label in the inverted list C, the sub-space to emit the label will be calculated by the polynomial function f(m)=a A m 2 +a B m+0, where a A, a C [0,2]. By taking different value of a A and a C and taking m=3,
9 the label will eventually be emitted to 9 sub-space, i.e., 0, 3, 6, 9, 12, 15, 18, 21 and 24, as shown in Fig Complexity of Polynomial-based Emitting According to the proposed polynomial-based emitting, each label will be emitted to m n 1 sub-spaces for parallel processing. Although this number exponentially increases with n theoretically, we claim that this complexity is unavoidable, and in fact it is manageable in practice. First, theoretically, to perform a general join (Cartesian product with conditional filters) across n tables (inverted lists), each of which contains m records (labels) requires m n computations. With certain indexing and algorithmic techniques, this complexity can be reduced for centralized data processing. However, when the data is too big to be managed by a centralized machine, most indexing techniques are not adoptable. In the case of XML structural join, pre-scanning all inverted lists to record down some statistical information may be helpful for designing more efficient workload distribution algorithms. However, this process cannot be on-the-fly with query processing. On the other hand, to make this process static may involve other issues such as dealing with data updates. How to design more efficient map function for XML query processing can be an open research problem. We propose an on-the-fly based optimization technique in the next section, though it does not change the order of the complexity. Second, the value of m n is actually the number of sub- computing spaces to distribute the workload. In most queries, the number of query nodes, i.e., n is quite small. Also, the actual value of m n can be controlled by m, i.e., the number of partitions in each inverted list. After all, the total number of sub-spaces should be set based on hardware capacity and applications. 3.4 Design of Reducer After each reducer collecting all labels from all inverted lists that have the same e-id (key), it can start processing the XML queries over this sub-space. Since the map function only splits the computing space into small sub-spaces without other operations on data, in the reduce function, any structural join based algorithm can be implemented to process queries. In our implementation, we follow the holistic structural join algorithms (e.g., ), because this class of algorithms are proven optimal for many query cases. In the example in Fig. 2, for each reducer, after getting a subset of labels in each inverted list, the reducer will sort each list and then perform holistic structural join on them, to find answers. Fig. 2 shows the process of executing an XPath query //A[//B]//C. 4 Optimization Following the reduce function we implement, we design an optimization technique to prune certain nodes that will not contribute to structural join result.
10 Let us start with a motivating example. Suppose a structural join algorithm tries to process a query //A[//B]//C. In the sorted inverted list for A, the first label is 1.3, while in the sorted inverted list for B the first label is Obviously, performing a structural join A//B between 1.3 and will not return an answer. In other words, the first label (maybe first few labels) in the inverted list for B can be skipped. This example motivates our optimization which will prune certain labels during label distribution in the map function. 4.1 Statistics Collected during Document Labeling During document labeling, we collect some statistics to aid label distribution. Basically, for each sorted inverted list, we take a sample for every t labels. The samples stand for cut-off labels if the inverted list is divided into segments with size of t. The value of t can be varied based on different sizes of documents. In our heuristics, we make t=10,000. The size of such statistic data is 1/t of the inverted list size. The collected statistics will be used for constructing an index, called cut-off index to guide how to assign a partition for a label in an inverted list. Normally the number of partitions (value m) for an inverted list in our framework is small (must be smaller than t), which means the cut-off labels of different partitions can be derived from the statistics mentioned above. Then given a label, we can compare it with the cut-off index to decide to which partition it belongs. 4.2 Selective Emitting Recall that in Algorithm 1 when a map function emits a label, it randomizes a local partition (represented by the coefficient of the corresponding term in the polynomial function) and considers all possible partitions for other inverted lists (represented by all possible values of other terms coefficients) for the label to emit. In our optimization, to emit a label from an inverted list I, we (1) set the local partition in I for the label according to the cut-off index, and (2) selectively choose the coefficients (i.e., the partitions) for all the ancestor inverted lists of I, such that the current label can join with the labels in those partitions and return answers. The toemit function (previously shown in Function 1) for an optimized mapper is presented in Function 3. The intuition of the optimization is to prune the label emitting to certain reducers in which the label will not produce structural join result. As shown in Fig. 3, rather than emitting a label l 1 from an inverted list I N to all reducers, the optimization algorithm will compute in which reducers l 1 will not produce answer, and then avoid such emitting as shown by the dotted arrows. Example 3. We use an example to illustrate the optimized map function. Consider an XML twig pattern query as in Fig. 4. There are 4 nodes in the query, thus there will be 4 inverted lists to be scanned for structural joins. If we divide each inverted list into 3 partitions, the two cut-off indices for the inverted lists for
11 Function 3 toemit O (List L, index i) Input: the partition cut-off index cutoff[x][y] for inverted list I x and partition y; the current inverted list I u with numerical id u; other variables are inherited from Algorithm 1 1: if L.length == number of inverted lists then 2: Emit(l) 3: else 4: if L.length == i then 5: initiate v = 0 6: while cutoff[x][y].precede(l) && v<m do 7: v++ 8: end while 9: toemit(l.append(v), i) 10: else 11: if the query node for I L.length is an ancestor of the query node for I u then 12: initiate v = 0 13: while cutoff[x][y].precede(l) && v<m do 14: v++ 15: end while 16: for all k [0, v 1] do 17: toemit(l.append(k), i) 18: end for 19: else 20: for all j [0, n 1] do 21: toemit(l.append(j), i) 22: end for 23: end if 24: end if 25: end if A and C are shown in Fig. 4. Assuming the polynomial function for mappers is f(m)=a A m 3 +a B m 2 +a C m+a D, the whole computing space will be divided into 81 sub-spaces. When an A label is processed by a mapper, the mapper will check the cut-off index for A, and decide to put the label into the second partition, i.e., a A =1, because the label passes the cut-off value between the first and the second partitions. When a D label is distributed, the mapper will determine a D, i.e., the local partition for the label. Furthermore, using the original map function, the label will be emitted to all sub-spaces formed by its local partition and the combination of all partitions from other inverted lists. That is the label will be emitted to 3 3 = 27 sub-spaces, though in many sub-spaces the label will not contribute to a structural join answer. In the optimized map function, the label will be checked against the cut-off indices of all D s ancestor nodes, i.e., A and C. Based on index checking, the label will be emitted to the first two partitions of A s inverted list and only the first partition of C s inverted list. Thus for this label, a A is 0 or 1, a C is 0, and only a B can take 3 possible values from 0 to 2.
12 Partitioning by e-id n A CReducer e-id=1 18A: 21B: sort 15 structural join Reducer 7 e-id=6 A: B: sort structural join B A C D I N :- ( l1, l2, l3, l IN ) A: (none) B (1.3)(1.3.2)(1.3.5), R a R b R c... Rm n B Fig. 3. Intuition of the optimization A C A: B D Fig. 4. Example XML query and cut-off indices Finally, the polynomial function f(m) will have 2*3=6 different values, so this label will be emitted to 6 sub-spaces where it possibly contributes to structural join answers. 5 Experiment 5.1 Settings All the programs were implemented in Java, and run in a small Hadoop cluster with 5 slave nodes. Each slave node has a dual core 2.93GHz CPU and a 12G shared memory. The maximum memory allocated to each JVM is 2G. Since our work does not aim at Hadoop tuning, we keep all default parameters of Hadoop in execution. We generated a synthetic XML dataset with the size of 10GB, based on the XMark  schema. The document is labeled with the containment labeling scheme  so that the size of each label is fixed. We randomly compose 10 twig pattern queries with the number of query nodes varying from 2 to 5, for evaluation. The result presented in this section is based on the average running statistics. Note that this experimental study only shows the feasibility of the proposed MapReduce framework for XML structural joins, and the effectiveness of the proposed optimization technique. We do not compare with other algorithms because we did not identify one with the similar philosophy to parallelize structural join. It makes no sense to compare with the methods that shred and distribute raw XML document across compute nodes. Furthermore, the efficiency on a single node is less important than the scalability of the algorithm in big data processing, as the overall performance can be simply improved by adding in more compute nodes.