Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce
|
|
|
- Antony Shields
- 10 years ago
- Views:
Transcription
1 Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Huayu Wu Institute for Infocomm Research, A*STAR, Singapore Abstract. Processing XML queries over big XML data using MapReduce has been studied in recent years. However, the existing works focus on partitioning XML documents and distributing XML fragments into different compute nodes. This attempt may introduce high overhead in XML fragment transferring from one node to another during MapReduce execution. Motivated by the structural join based XML query processing approach, which uses only related inverted lists to process queries in order to reduce I/O cost, we propose a novel technique to use MapReduce to distribute labels in inverted lists in a computing cluster, so that structural joins can be parallelly performed to process queries. We also propose an optimization technique to reduce the computing space in our framework, to improve the performance of query processing. Last, we conduct experiment to validate our algorithms. 1 Introduction 1.1 Context The increasing amount of data generated by different applications, sensors and devices, and the increasing attention to the value of the data marks the beginning of the era of big data. From big IT companies to SMEs, from computer science researchers to social scientists, nowadays everyone is talking about big data and the impact of the big data in business, technologies, and science research. It is no doubt that to effectively manage big data is the first step for any further analysis and utilization of big data. How to manage such data, considering the large size and diverse formats, is a new challenge to the database community. On one hand, researchers are keen to find a more elastic and more reliable solution rather than the traditional distributed database system [15] to store big relational (structured) data and offer SQL-like query ability; on the other hand, for emerging datasets in heterogeneous models, new platforms and databases for semi-structured data, text data, graph data, etc. were designed, aiming to provide more efficient access to such data in large scale. Gradually, the research attempts converge to a distributed data processing framework, MapReduce [10]. The MapReduce programming model simplifies parallel data processing by offering two interfaces: map and reduce. With a system-level support on computational resource management, a user only needs
2 to implement the two functions to process underlying data, without caring about the extendability and reliability of the system. There are extensive works to implement database operators [5][14][11], and database systems [3][13][7] on top of the MapReduce framework. 1.2 Motivation Recently, researchers started looking into the possibility of managing big XML data in a more elastic distributed environment, such as Hadoop [1], using MapReduce. Inspired by the XML-enabled relational database system, big XML data can be stored and processed by relational storage and operators. However, shredding XML data in big size into relational tables is extremely expensive. Furthermore, with relational storage, each XML query must be processed by several θ-joins among tables. The cost for joins is still the bottleneck for Hadoop-based database systems. Most recent research attempts [9][8][4] leverage on the idea of XML partitioning and query decomposition adopted from distributed XML databases [16][12]. Similar to the join operation in relational database, an XML query may require linking two or more arbitrary elements across the whole XML document. Thus to process XML queries in a distributed system, transferring fragmented data from one node to another is unavoidable. In a static environment like a distributed XML database system, proper indexing techniques can help to optimally distribute data fragments and the workload. However, for an elastic distributed environment such as Hadoop, each copy of XML fragment will probably be transferred to undeterminable different nodes for processing. In other words, it is difficult to optimize data distribution in a MapReduce framework, thus the existing approach may suffer from high I/O and network transmission cost. Actually different approaches for centralized XML query processing have been study for over a decade. One highlight is the popularity of the structural join based approach (e.g., [6]). Compared to other native approaches, such as navigational approach and subsequence matching approach, one main advantage of the structural join approach is the saving on I/O cost. In particular, in the structural join approach, only a few inverted lists corresponding to the query nodes are read from the disk, rather than going through all the nodes in the document. It will be beneficial if we can adapt such an approach in the MapReduce framework, so that the disk I/O and network cost can be reduced. 1.3 Contribution In this paper, we study the parallelization of the structural join based XML query processing algorithms using MapReduce. We do not distribute a whole big XML document to a computer cluster, instead, we distribute inverted lists for each type of document node to be queried. As mentioned, since the size of the inverted lists that are used to process a query is much smaller than the size of the whole XML document, our approach potentially reduces the cost on cross-node data transfer. Our contribution can be summarized as follows:
3 The problem of parallelizing structural joins for XML query processing using MapReduce is studied. To the best of our knowledge, this is the first work that discusses distributing inverted list labels rather than raw XML document for parallel structural joins. A polynomial-based workload distribution algorithm is designed in the Map phase, which can balance the workload of each Reduce task. An optimization technique is proposed to not to emit nodes to reducers where they do not contribute to structural join results. We conduct experiments to validate our algorithms. 1.4 Organization The rest of the paper is organized as follows. In Section 2, we introduce the background knowledge and revisit related work. In Section 3 we present the map and reduce functions in our framework to parallelly process XML queries. In Section 4, an optimization algorithm is proposed so that unnecessary emitting can be pruned in mappers. Section 5 shows the experimental study to validate the proposed algorithms. Finally we conclude this paper in Section 6. 2 Background and Related Work 2.1 MapReduce MapReduce is a computational model initiated from functional programming languages and introduced to computer clusters for parallel data processing [10]. It simplifies programmers implementation for parallel data processing by offering two user-defined functions, map and reduce. The map function takes a set of key/value pairs as input. After a MapReduce job is submitted to the system, the map tasks (normally referred as mappers) are started on certain compute nodes. Each map task executes the map function that is implemented by the user over every key/value input pair. The output of the map function is another set of key/value pairs, which are temporarily stored in the local file systems and sorted by the keys. When all map tasks complete executing, the system notifies the reduce tasks (referred as reducers) to start executing. The reducers pull the key/value pairs outputted from mappers in parallel and combine them into different lists for different keys. This step is similar to the group-by operator in SQL. Then for each key, the list of values will be processed by a reducer according to the reduce function specified by the user. This step is similar to the aggregate function in SQL. Finally, the results are written back to the disks. 2.2 XML Query Processing There are different approaches to process XML queries. Initially, XML documents are shred into relational tables and queries are translated into SQL statements to query the database (e.g., [17]). This is so-called the relational approach.
4 However, in the relational approach, an XML query requires several table joins and most of them are expensive θ-joins. Later, researchers proposed to process XML queries in their native form. Among different native approaches, the structural join approach (e.g., [6]) is considered more efficient because it reduces I/O cost compared to other approaches. In a structural join approach, an XML document is encoded with positional labels. The labels for each type of document node are organized in an inverted list. To process a query, only those inverted lists related to the queried node type are loaded and scanned. All other document nodes are actually ignored. Finally structural joins are performed over the inverted lists to find the query answers. 2.3 XML Query Processing using MapReduce To process XML queries using MapReduce, we need to decompose a big XML document and distribute portions to different sites, and execute the query processing algorithm in a parallel manner at different sites. Obviously, the relational approach is not suitable, because transforming a big XML document into relational tables can be extremely time consuming and θ-joins among relational tables are expensive using MapReduce. Recently, there are several works proposed to implement native XML query processing algorithms using MapReduce. In [9], the authors proposed a distributed algorithm for Boolean XPath query evaluation using MapReduce. By collecting the Boolean evaluation result from a computer cluster, they proposed a centralized algorithm to finally process a general XPath query. Actually, they did not use the distributed computing environment to generate the final result. It is still unclear whether the centralized step would be the bottleneck of the algorithm when the data is huge. In [8], a Hadoop-based system was designed to process XML queries using the structural join approach. There are two phases of MapReduce jobs in the system. In the first phase, an XML document was shred into blocks and scanned against input queries. Then the path solutions are sent to the second MapReduce job to merge to final answers. The first problem of this approach is the loss of the holistic way for generating path solutions. Actually the beauty of the structural join approach is to minimize intermediate path solutions by considering the whole query pattern during structural joins. The second problem is that the simple path filtering in the first MapReduce phase is not suitable for processing // -axis queries over complex structured XML data with recursive nodes, as pointed by many prior research works. A recent demo [4] built an XML query/update system on top of MapReduce framework. The technical details were not thoroughly presented. From the system architecture, it shreds an XML document and asks each mapper to process queries against each XML fragment. In fact, most existing works are based on XML document shredding, which is inspired by the document partitioning in the distributed XML databases [16][12]. However, in the MapReduce framework, the distributed computing environment
5 is assumed dynamic and elastic, which makes XML fragment distribution difficult to optimize. Hence, in such a setting, fragmented data (from either original document or intermediate result) may need to be transferred to undeterminable different sites, which leads high I/O and network transmission cost. Motivated by the inverted lists based structural join algorithms, in this paper, we propose an approach that distributes inverted lists rather than a raw XML document, so that the size of the fragmented data for I/O and network transmission can be greatly reduced. 3 Framework In this section, we introduce our proposed MapReduce framework for parallel XML query processing. As mentioned, in our framework, we do not distribute the raw XML data because the raw data is large in size and causes high I/O and network transmission overhead during parallel processing in the MapReduce. Instead, we label the document first and load the inverted lists rather than raw document into a distributed file system. Obviously, the size of useful inverted lists for certain queries is much smaller than the size of raw document. Then the I/O and network transmission cost can be minimized. 3.1 Document Labeling Labeling an XML document is an essential step for most XML query processing algorithms. In this paper, we do not repeat the research works on XML labeling, but only emphasize on two points related to big XML data. First, document labeling is a one-time effort for a given XML document, and independent to query processing. In most labeling schemes, a stack is maintained to store input document nodes so that the relationship among them can be identified. As document labeling proceeds, the put and pop operations are executed over the stack. The maximum size of the stack is the maximum depth of the document. As a result, even if the document is large, the memory is not an issue for document labeling. Second, using the existing technique (e.g., [18]), re-labeling can be totally avoided in case an XML database is updated. For centralized XML query processing framework, the only overhead to deal with document update is the re-sorting of relevant inverted lists. In our framework, node labels will be distributed to different reducers and re-sorted. Thus, document update will bring little trouble to our framework. 3.2 Framework Overview In our approach, we implement the two functions, map and reduce in a MapReduce framework, and leverage on underlying system, e.g., Hadoop for program execution. The basic idea is to equally (or nearly equally) divide the whole computing space for a set of structural joins into a number of sub-spaces, and each of the sub-spaces will be handled by one reducer to perform structural joins.
6 (0, 1.2)-(1, 1.2)-(2, 1.2)-(3, 1.2)- (4, 1.3)-(5, 1.3)-(6, 1.3)-(7, 1.3)-... (0, 1.2.5)-(1, 1.2.5)-(4, 1.2.5)-(5, 1.2.5)- (2, 1.3.2)-(3, 1.3.2)-(6, 1.3.2)-(7, 1.3.2)-... Partitioning by e-id Reducer 1 e-id=0 A: 1.2 B: Structural join Reducer 2 e-id=1 For example, suppose an XML query involves three query nodes, namely A, B and C, then to process this query a set of structural joins among the inverted (1.2)(1.2.1)(1.2.9), (none) (1.3)(1.3.2)(1.3.5), lists I A, I B and I C (for (1.2)(1.2.5)(1.2.9), A, B and C respectively) will be performed. If we split each inverted list into three partitions, the total computing space (with size of I A I B I C ) will be divided into 27 sub-spaces, as shown in Fig. 1. A: 1.2 B: Structural join Reducer 7 e-id=6 A: B: Structural join C A B Fig. 1. Example of computing space division Each mapper will take a set of labels in an inverted list as input, and emit each label with the ID of the associated sub-space (called e-id, standing for emit id). The reducers will take the grouped labels for each inverted list, re-sort it, and apply holistic structural join algorithms to find answers. The whole process is shown in Fig. 2, and the details will be explained in the following sections. 3.3 Design of Mapper The main task of a mapper is to assign a key to each incoming label, so that the overall labels from each inverted list are nearly equally distributed in a given number of sub-spaces for the reducers to process. To achieve this goal, we adopt a polynomial-based emit id assignment. For a query with n different query nodes, i.e., using n inverted lists, we divide each inverted list into m sub-lists. Then the total number of sub-spaces for computing is m n. Fig. 1 shows an example where m = 3 and n = 3. We construct a polynomial function f of m with the highest degree of n 1, to help for emit id assignment. Each of the inverted lists corresponds to a coefficient with degree from n 1 to 0. f(m) = a n 1 m n 1 + a n 2 m n a 1 m + a 0 where a i [0, m 1] for i [0, n 1] (1) The procedure that a mapper emits an input label is shown in Algorithm 1. Example 1. We use the example in Fig. 2 to explain Algorithm 1. Suppose we need to process a query with three query nodes (i.e., n=3), A, B and C, and we
7 Mapper 1 Inverted list for A label: Mapper 2 Inverted list for B label: Mapper 3 Inverted list for C map (b-id, label): (1, 1.1)-(0, 1.2)-(1, 1.3)-... map (b-id, label): (0, 1.2.1)-(0, 1.2.5)-(1, 1.3.2)-... emit emit (e-id, label): (4, 1.1)-(5, 1.1)-(6, 1.1)-(7, 1.1)- (0, 1.2)-(1, 1.2)-(2, 1.2)-(3, 1.2)- (4, 1.3)-(5, 1.3)-(6, 1.3)-(7, 1.3)-... (e-id, label): (0, 1.2.1)-(1, 1.2.1)-(4, 1.2.1)-(5, 1.2.1)- (0, 1.2.5)-(1, 1.2.5)-(4, 1.2.5)-(5, 1.2.5)- (2, 1.3.2)-(3, 1.3.2)-(6, 1.3.2)-(7, 1.3.2)-... Partitioning by e-id Reducer 1 e-id=0 A: 1.2 B: Reducer 2 e-id=1 A: 1.2 B: Reducer 7 e-id=6 A: B: sort sort sort structural join structural join structural join (1.2)(1.2.1)(1.2.9), (1.2)(1.2.5)(1.2.9), (none) (1.3)(1.3.2)(1.3.5), C Fig. 2. Data flow of our proposed framework Input: 18 an empty 21 key, 24 a label l from an inverted list I as the value; variables m for the 16 number of partitions in each inverted list, n for the number of total inverted lists, ra a random integer 15 between 0 and n 1 1: identify the coefficient corresponding 7 to I, i.e., a i where i is the index 2: initiate an empty list L6 3: toemit(l, 0 i) 3/*m, 6n, r, l are globally viewed by Function 1 and 2*/ B 24 Algorithm 1 Map Function 25 divide each inverted lists into 2 partitions (i.e., m=2). In Mapper 1 in Fig. 2, the mapper will first assign a random number b id [0, 1] to each value, i.e., each incoming label from A s inverted list. Actually b id stands for the ID of the partition to which the incoming label belongs. By randomly choosing b id, the labels in an inverted list are nearly equally divided into n partitions. The list of (b-id, label) in Fig. 2 shows the result of inverted list partitioning.
8 Function 1 toemit(list L, index i) 1: if L.length == number of inverted lists then 2: Emit(l) 3: else 4: if L.length == i then 5: toemit(l.append(r), i) 6: else 7: for all j [0, n 1] do 8: toemit(l.append(j), i) 9: end for 10: end if 11: end if Function 2 Emit(List L) 1: initiate the polynomial function f(m) as defined in (1) 2: set the coefficients of f as the integers in L, by order 3: calculate the value of f given the value of m, as the emit key e-id 4: emit(e-id, l) In the second step, for each label the corresponding sub-spaces for structural joins are found by polynomial calculation. In this example, the polynomial function is f(m)=a A m 2 +a B m+a C, where a A, a B and a C are the coefficients corresponding to the query nodes A, B and C. For the first mapper, which is handling the inverted list for A, a A will be b id that is randomly assigned in the previous step, while a B and a C will be vary between 0 and 1 according to Function 1. For the label 1.1, its b-id is assigned as 1, and there will be four values for its e-id. Thus, the mapper will emit this label four times. The process to calculate e-id will be discussed in detail in the next example. Similarly, all labels will be emitted for shuffling Correctness of Polynomial-based Emitting Theorem 1. The polynomial function in (1) exactly produces m n values spanning [0, m n -1], when each coefficient a i takes m different values in [0, m-1]. Thus the partitioning is complete. This is the property of polynomial function, and the proof is omitted. We use an example to illustrate. Example 2. Consider the cube in Fig. 1, which stands for a division of computing space with three inverted lists (n=3), and each inverted list is partitioned into three region (m=3). The numeric id on each small cube stands for the calculated polynomial value of that sub-space. Suppose a mapper assigns integer 0 (randomized from 0 to 2) to a label in the inverted list C, the sub-space to emit the label will be calculated by the polynomial function f(m)=a A m 2 +a B m+0, where a A, a C [0,2]. By taking different value of a A and a C and taking m=3,
9 the label will eventually be emitted to 9 sub-space, i.e., 0, 3, 6, 9, 12, 15, 18, 21 and 24, as shown in Fig Complexity of Polynomial-based Emitting According to the proposed polynomial-based emitting, each label will be emitted to m n 1 sub-spaces for parallel processing. Although this number exponentially increases with n theoretically, we claim that this complexity is unavoidable, and in fact it is manageable in practice. First, theoretically, to perform a general join (Cartesian product with conditional filters) across n tables (inverted lists), each of which contains m records (labels) requires m n computations. With certain indexing and algorithmic techniques, this complexity can be reduced for centralized data processing. However, when the data is too big to be managed by a centralized machine, most indexing techniques are not adoptable. In the case of XML structural join, pre-scanning all inverted lists to record down some statistical information may be helpful for designing more efficient workload distribution algorithms. However, this process cannot be on-the-fly with query processing. On the other hand, to make this process static may involve other issues such as dealing with data updates. How to design more efficient map function for XML query processing can be an open research problem. We propose an on-the-fly based optimization technique in the next section, though it does not change the order of the complexity. Second, the value of m n is actually the number of sub- computing spaces to distribute the workload. In most queries, the number of query nodes, i.e., n is quite small. Also, the actual value of m n can be controlled by m, i.e., the number of partitions in each inverted list. After all, the total number of sub-spaces should be set based on hardware capacity and applications. 3.4 Design of Reducer After each reducer collecting all labels from all inverted lists that have the same e-id (key), it can start processing the XML queries over this sub-space. Since the map function only splits the computing space into small sub-spaces without other operations on data, in the reduce function, any structural join based algorithm can be implemented to process queries. In our implementation, we follow the holistic structural join algorithms (e.g., [6]), because this class of algorithms are proven optimal for many query cases. In the example in Fig. 2, for each reducer, after getting a subset of labels in each inverted list, the reducer will sort each list and then perform holistic structural join on them, to find answers. Fig. 2 shows the process of executing an XPath query //A[//B]//C. 4 Optimization Following the reduce function we implement, we design an optimization technique to prune certain nodes that will not contribute to structural join result.
10 Let us start with a motivating example. Suppose a structural join algorithm tries to process a query //A[//B]//C. In the sorted inverted list for A, the first label is 1.3, while in the sorted inverted list for B the first label is Obviously, performing a structural join A//B between 1.3 and will not return an answer. In other words, the first label (maybe first few labels) in the inverted list for B can be skipped. This example motivates our optimization which will prune certain labels during label distribution in the map function. 4.1 Statistics Collected during Document Labeling During document labeling, we collect some statistics to aid label distribution. Basically, for each sorted inverted list, we take a sample for every t labels. The samples stand for cut-off labels if the inverted list is divided into segments with size of t. The value of t can be varied based on different sizes of documents. In our heuristics, we make t=10,000. The size of such statistic data is 1/t of the inverted list size. The collected statistics will be used for constructing an index, called cut-off index to guide how to assign a partition for a label in an inverted list. Normally the number of partitions (value m) for an inverted list in our framework is small (must be smaller than t), which means the cut-off labels of different partitions can be derived from the statistics mentioned above. Then given a label, we can compare it with the cut-off index to decide to which partition it belongs. 4.2 Selective Emitting Recall that in Algorithm 1 when a map function emits a label, it randomizes a local partition (represented by the coefficient of the corresponding term in the polynomial function) and considers all possible partitions for other inverted lists (represented by all possible values of other terms coefficients) for the label to emit. In our optimization, to emit a label from an inverted list I, we (1) set the local partition in I for the label according to the cut-off index, and (2) selectively choose the coefficients (i.e., the partitions) for all the ancestor inverted lists of I, such that the current label can join with the labels in those partitions and return answers. The toemit function (previously shown in Function 1) for an optimized mapper is presented in Function 3. The intuition of the optimization is to prune the label emitting to certain reducers in which the label will not produce structural join result. As shown in Fig. 3, rather than emitting a label l 1 from an inverted list I N to all reducers, the optimization algorithm will compute in which reducers l 1 will not produce answer, and then avoid such emitting as shown by the dotted arrows. Example 3. We use an example to illustrate the optimized map function. Consider an XML twig pattern query as in Fig. 4. There are 4 nodes in the query, thus there will be 4 inverted lists to be scanned for structural joins. If we divide each inverted list into 3 partitions, the two cut-off indices for the inverted lists for
11 Function 3 toemit O (List L, index i) Input: the partition cut-off index cutoff[x][y] for inverted list I x and partition y; the current inverted list I u with numerical id u; other variables are inherited from Algorithm 1 1: if L.length == number of inverted lists then 2: Emit(l) 3: else 4: if L.length == i then 5: initiate v = 0 6: while cutoff[x][y].precede(l) && v<m do 7: v++ 8: end while 9: toemit(l.append(v), i) 10: else 11: if the query node for I L.length is an ancestor of the query node for I u then 12: initiate v = 0 13: while cutoff[x][y].precede(l) && v<m do 14: v++ 15: end while 16: for all k [0, v 1] do 17: toemit(l.append(k), i) 18: end for 19: else 20: for all j [0, n 1] do 21: toemit(l.append(j), i) 22: end for 23: end if 24: end if 25: end if A and C are shown in Fig. 4. Assuming the polynomial function for mappers is f(m)=a A m 3 +a B m 2 +a C m+a D, the whole computing space will be divided into 81 sub-spaces. When an A label is processed by a mapper, the mapper will check the cut-off index for A, and decide to put the label into the second partition, i.e., a A =1, because the label passes the cut-off value between the first and the second partitions. When a D label is distributed, the mapper will determine a D, i.e., the local partition for the label. Furthermore, using the original map function, the label will be emitted to all sub-spaces formed by its local partition and the combination of all partitions from other inverted lists. That is the label will be emitted to 3 3 = 27 sub-spaces, though in many sub-spaces the label will not contribute to a structural join answer. In the optimized map function, the label will be checked against the cut-off indices of all D s ancestor nodes, i.e., A and C. Based on index checking, the label will be emitted to the first two partitions of A s inverted list and only the first partition of C s inverted list. Thus for this label, a A is 0 or 1, a C is 0, and only a B can take 3 possible values from 0 to 2.
12 Partitioning by e-id n A CReducer e-id=1 18A: 21B: sort 15 structural join Reducer 7 e-id=6 A: B: sort structural join B A C D I N :- ( l1, l2, l3, l IN ) A: (none) B (1.3)(1.3.2)(1.3.5), R a R b R c... Rm n B Fig. 3. Intuition of the optimization A C A: B D Fig. 4. Example XML query and cut-off indices Finally, the polynomial function f(m) will have 2*3=6 different values, so this label will be emitted to 6 sub-spaces where it possibly contributes to structural join answers. 5 Experiment 5.1 Settings All the programs were implemented in Java, and run in a small Hadoop cluster with 5 slave nodes. Each slave node has a dual core 2.93GHz CPU and a 12G shared memory. The maximum memory allocated to each JVM is 2G. Since our work does not aim at Hadoop tuning, we keep all default parameters of Hadoop in execution. We generated a synthetic XML dataset with the size of 10GB, based on the XMark [2] schema. The document is labeled with the containment labeling scheme [19] so that the size of each label is fixed. We randomly compose 10 twig pattern queries with the number of query nodes varying from 2 to 5, for evaluation. The result presented in this section is based on the average running statistics. Note that this experimental study only shows the feasibility of the proposed MapReduce framework for XML structural joins, and the effectiveness of the proposed optimization technique. We do not compare with other algorithms because we did not identify one with the similar philosophy to parallelize structural join. It makes no sense to compare with the methods that shred and distribute raw XML document across compute nodes. Furthermore, the efficiency on a single node is less important than the scalability of the algorithm in big data processing, as the overall performance can be simply improved by adding in more compute nodes.
13 5.2 Result Since the main query processing algorithm is executed in the Reduce function, we first vary the number of reducers to check the impact on the performance for both the original MapReduce algorithm and the optimized MapReduce algorithm. We set the number of mappers to be 10, and try different numbers of reducers. The result is shown in Fig. 5. time (sec) no. of reducers original MR optimized MR Fig. 5. Performance under different number of reducers From the result we can see that the performance for both the original algorithm and the optimized algorithm improves as the number of reducers increases, until the number reaches 10 or 11. After that the performance is not stable when the reducer number keep increasing. The result accords with the hardware settings. There are 5 dual core processors, which can support 10 tasks to run parallelly. If the number of reducers is set below 10, the processors are not fully utilized. When the processors are fully utilized, the performance may be affected by other issues, such as network transmission overhead and randomized workload assignment. However, there is no obvious monotonous relationship between the performance and reducer numbers. Also, the optimized algorithm is always better than the original algorithm without optimization. We further show this point in the second experiment. In the second experiment, we keep the number of mappers and the number of reducers to be 10, which fully utilized the processors in the cluster. Then we vary the number of partitions in each inverted list. Recall that the number of partitions and the number of inverted lists determine the number of reduce jobs. Fig. 6(a) shows the number of labels emitted by the mappers for the two algorithms under different partition numbers for inverted lists. It clearly tells that the optimized algorithm prunes more labels as the sub-spaces are more fine-grained. Theoretically, all unnecessary labels are pruned when each subspace only contains one label from each inverted list by our optimization. As a consequence, the performance of the optimized algorithm is better than the original algorithm, and the difference is more significant when the number of partitions is larger, as shown in Fig. 6(b). In our experiment, the memory of the JVM in each node can only support the sub- computing space formed by n 3. In other words, if each inverted list contains less than 3 partitions, the size of each sub-space will exceed the allocated
14 ted labels (million) no. of emitt no. of partitions in each inverted list original MR optimized MR (a) Number of emitted labels by mappers time (sec) no. of partitions in each inverted list original MR optimized MR centralized processing (b) Execution time Fig. 6. Performance comparison between the original algorithm and the optimized algorithm for different number of partitions in each inverted list memory. From 6(b), we can see that as the number of partitions in each inverted list increases, the overall performance will drop. We also managed to run the structural join algorithm in a computer with a 3.2GHz CPU and a 8GB memory, and show the average execution time for queries in Fig. 6(b) as well. The purpose here is not to have a horizontal comparison between the Hadoop cluster and a single machine, because when the dataset gets much larger, single machines would not be able to handle it. This comparison shows that the overhead on network data transmission for a Hadoop cluster with MapReduce framework is quite large, and can make the performance worse than a single machine (suppose a single machine is capable to handle the program). Thus, we should limit the number of reduce jobs so that each reducer will take over a computing space as large as possible, to fully utilize its resource. In other words, for our problem more CPU time for each reducer is more preferable than more data in transmission. Then how to estimate the workload to choose a good number of reducers is crucial to the overall performance. 6 Conclusion and Future Work In this paper, we proposed a novel algorithm based on the MapReduce framework to process XML queries over big XML data. Different from the exiting approaches that shred and distribute XML document into different nodes in a computer cluster, our approach performs data distribution and processing on the inverted list level. In particular, we read and distribute only the inverted lists that are required for input queries during query processing, whose size is much smaller than the size of the whole document. We partition the total computing space for structural joins so that each sub-space can be handled by one reducer to perform structural joins. We further propose a pruning based optimization algorithm to improve the performance of our approach. We conduct experiments to show that our algorithm and optimization are effective. For future research work, we will focus on (1) designing better optimization algorithms by considering more complicated constraints that can prune labels
15 during structural join; and (2) tuning the underlying system, e.g., Hadoop for optimal configuration for our algorithms. References Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In: VLDB, pp (2009) 4. Bidoit, N., Colazzo, D., Malla, N., Ulliana, F., Nole, M., Sartiani, C.: Processing XML queries and updates on map/reduce clusters. In: EDBT, pp (2013) 5. Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in MapReduce. In: SIGMOD, pp (2010) 6. Bruno, N., Koudas, N., Srivastava, D.: Holistic twig joins: optimal XML pattern matching. In: SIGMOD, pp (2002) 7. Chen, S.: Cheetah: a high performance, custom data warehouse on top of MapReduce. In: VLDB, pp (2010) 8. Choi, H., Lee, K., Kim, S., Lee, Y., Moon, B.: HadoopXML: a suite for parallel processing of massive XML data with multiple twig pattern queries. In: CIKM, pp (2012) 9. Cong, G., Fan, W., Kementsietsidis, A., Li, J., Liu, X.: Partial evaluation for distributed XPath query processing and beyond. ACM Trans. Database Syst. 37(4), 32 (2012) 10. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: USENIX Symp. on Operating System Design and Implementation, pp (2004) 11. Jiang, D., Tung, A.K.H., Chen, G.: MAP-JOIN-REDUCE: toward scalable and efficient data analysis on large clusters. IEEE Trans. Knowl. Data Eng. 23(9), (2011) 12. Kling, P., Ozsu, M.T., Daudjee, K.: Generating efficient excution plans for vertically partitioned XML databases. PVLDB 4(1), 1-11 (2010) 13. Lin, Y., Agrawa., D., Chen, C., Ooi, B.C., Wu, S.: Llama: leveraging columnar starage for scalable join processing in the MapReduce framework. In: SIGMOD, pp (2011) 14. Okcan, A. Riedewald, M.: Pricessing theta-joins using MapReduce. In: SIGMOD, pp (2011) 15. Ozsu, M.T., Valduriez, P.: Principles of Distributed Database Systems 3 Ed. Springer. 16. Suciu, D.: Distributed query evaluation on semistricutred data. ACM Trans. Database Syst. 27(1), 1-62 (2002) 17. Tatarinov, I., Viglas, S., Beyer, K.S., Shanmugasundaram, J., Shekita, E.J., Zhang, C.: Storing and querying ordered XML using a relational database system. In: SIGMOD, pp (2002) 18. Xu, L., Ling, T.W., Wu, H.: Labeling Dynamic XML Documents: An Order-Centric Approach. IEEE Trans. Knowl. Data Eng. 24(1): (2012) 19. Zhang, C., Naughton, J.F., DeWitt, D.J., Luo, Q., Lohman, G.M.: On supporting containment queries in relational database management systems. In: SIGMOD, pp (2001)
JackHare: a framework for SQL to NoSQL translation using MapReduce
DOI 10.1007/s10515-013-0135-x JackHare: a framework for SQL to NoSQL translation using MapReduce Wu-Chun Chung Hung-Pin Lin Shih-Chang Chen Mon-Fong Jiang Yeh-Ching Chung Received: 15 December 2012 / Accepted:
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:
RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE
RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE Reena Pagare and Anita Shinde Department of Computer Engineering, Pune University M. I. T. College Of Engineering Pune India ABSTRACT Many clients
Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
Big Data Begets Big Database Theory
Big Data Begets Big Database Theory Dan Suciu University of Washington 1 Motivation Industry analysts describe Big Data in terms of three V s: volume, velocity, variety. The data is too big to process
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National
Lecture Data Warehouse Systems
Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
How To Write A Paper On Bloom Join On A Distributed Database
Research Paper BLOOM JOIN FINE-TUNES DISTRIBUTED QUERY IN HADOOP ENVIRONMENT Dr. Sunita M. Mahajan 1 and Ms. Vaishali P. Jadhav 2 Address for Correspondence 1 Principal, Mumbai Education Trust, Bandra,
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
A Study on Big Data Integration with Data Warehouse
A Study on Big Data Integration with Data Warehouse T.K.Das 1 and Arati Mohapatro 2 1 (School of Information Technology & Engineering, VIT University, Vellore,India) 2 (Department of Computer Science,
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Load-Balancing the Distance Computations in Record Linkage
Load-Balancing the Distance Computations in Record Linkage Dimitrios Karapiperis Vassilios S. Verykios Hellenic Open University School of Science and Technology Patras, Greece {dkarapiperis, verykios}@eap.gr
Benchmark Study on Distributed XML Filtering Using Hadoop Distribution Environment. Sanjay Kulhari, Jian Wen UC Riverside
Benchmark Study on Distributed XML Filtering Using Hadoop Distribution Environment Sanjay Kulhari, Jian Wen UC Riverside Team Sanjay Kulhari M.S. student, CS U C Riverside Jian Wen Ph.D. student, CS U
A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs In a Workflow Application
2012 International Conference on Information and Computer Applications (ICICA 2012) IPCSIT vol. 24 (2012) (2012) IACSIT Press, Singapore A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
Detection of Distributed Denial of Service Attack with Hadoop on Live Network
Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Hadoop and Map-reduce computing
Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.
MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context
MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Mining Large Datasets: Case of Mining Graph Data in the Cloud
Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large
How To Analyze Log Files In A Web Application On A Hadoop Mapreduce System
Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment Sayalee Narkhede Department of Information Technology Maharashtra Institute
HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering
HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering Chang Liu 1 Jun Qu 1 Guilin Qi 2 Haofen Wang 1 Yong Yu 1 1 Shanghai Jiaotong University, China {liuchang,qujun51319, whfcarter,yyu}@apex.sjtu.edu.cn
Comparison of Different Implementation of Inverted Indexes in Hadoop
Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,
Optimization and analysis of large scale data sorting algorithm based on Hadoop
Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,
The Hadoop Framework
The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen [email protected] Abstract. The Hadoop Framework offers an approach to large-scale
Distributed Apriori in Hadoop MapReduce Framework
Distributed Apriori in Hadoop MapReduce Framework By Shulei Zhao (sz2352) and Rongxin Du (rd2537) Individual Contribution: Shulei Zhao: Implements centralized Apriori algorithm and input preprocessing
From GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu [email protected] MapReduce/Hadoop
Evaluating HDFS I/O Performance on Virtualized Systems
Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang [email protected] University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing
Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing Hsin-Wen Wei 1,2, Che-Wei Hsu 2, Tin-Yu Wu 3, Wei-Tsong Lee 1 1 Department of Electrical Engineering, Tamkang University
Fig. 3. PostgreSQL subsystems
Development of a Parallel DBMS on the Basis of PostgreSQL C. S. Pan [email protected] South Ural State University Abstract. The paper describes the architecture and the design of PargreSQL parallel database
Introduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
A Novel Cloud Computing Data Fragmentation Service Design for Distributed Systems
A Novel Cloud Computing Data Fragmentation Service Design for Distributed Systems Ismail Hababeh School of Computer Engineering and Information Technology, German-Jordanian University Amman, Jordan Abstract-
A Comparison of Approaches to Large-Scale Data Analysis
A Comparison of Approaches to Large-Scale Data Analysis Sam Madden MIT CSAIL with Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, and Michael Stonebraker In SIGMOD 2009 MapReduce
R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5
Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 [email protected], 2 [email protected],
SCHEDULING IN CLOUD COMPUTING
SCHEDULING IN CLOUD COMPUTING Lipsa Tripathy, Rasmi Ranjan Patra CSA,CPGS,OUAT,Bhubaneswar,Odisha Abstract Cloud computing is an emerging technology. It process huge amount of data so scheduling mechanism
Semantic Web Standard in Cloud Computing
ETIC DEC 15-16, 2011 Chennai India International Journal of Soft Computing and Engineering (IJSCE) Semantic Web Standard in Cloud Computing Malini Siva, A. Poobalan Abstract - CLOUD computing is an emerging
Big Systems, Big Data
Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,
Oracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES
CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
Classification On The Clouds Using MapReduce
Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal [email protected] Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal [email protected]
Task Scheduling in Hadoop
Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed
A Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
An efficient Join-Engine to the SQL query based on Hive with Hbase Zhao zhi-cheng & Jiang Yi
International Conference on Applied Science and Engineering Innovation (ASEI 2015) An efficient Join-Engine to the SQL query based on Hive with Hbase Zhao zhi-cheng & Jiang Yi Institute of Computer Forensics,
Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2
Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Mining Interesting Medical Knowledge from Big Data
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. II (Jan Feb. 2016), PP 06-10 www.iosrjournals.org Mining Interesting Medical Knowledge from
RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG
1 RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG Background 2 Hive is a data warehouse system for Hadoop that facilitates
Mobile Cloud Computing for Data-Intensive Applications
Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, [email protected] Advisor: Professor Priya Narasimhan, [email protected] Abstract The computational and storage
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File
An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database
An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct
NoSQL for SQL Professionals William McKnight
NoSQL for SQL Professionals William McKnight Session Code BD03 About your Speaker, William McKnight President, McKnight Consulting Group Frequent keynote speaker and trainer internationally Consulted to
BSPCloud: A Hybrid Programming Library for Cloud Computing *
BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China [email protected],
A Hadoop MapReduce Performance Prediction Method
A Hadoop MapReduce Performance Prediction Method Ge Song, Zide Meng, Fabrice Huet, Frederic Magoules, Lei Yu and Xuelian Lin University of Nice Sophia Antipolis, CNRS, I3S, UMR 7271, France Ecole Centrale
Distributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Big Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
Efficient Iceberg Query Evaluation for Structured Data using Bitmap Indices
Proc. of Int. Conf. on Advances in Computer Science, AETACS Efficient Iceberg Query Evaluation for Structured Data using Bitmap Indices Ms.Archana G.Narawade a, Mrs.Vaishali Kolhe b a PG student, D.Y.Patil
What is Analytic Infrastructure and Why Should You Care?
What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group [email protected] ABSTRACT We define analytic infrastructure to be the services,
The Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
Image Search by MapReduce
Image Search by MapReduce COEN 241 Cloud Computing Term Project Final Report Team #5 Submitted by: Lu Yu Zhe Xu Chengcheng Huang Submitted to: Prof. Ming Hwa Wang 09/01/2015 Preface Currently, there s
MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example
MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design
A Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
An Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions
CLOUD BASED PEER TO PEER NETWORK FOR ENTERPRISE DATAWAREHOUSE SHARING
CLOUD BASED PEER TO PEER NETWORK FOR ENTERPRISE DATAWAREHOUSE SHARING Basangouda V.K 1,Aruna M.G 2 1 PG Student, Dept of CSE, M.S Engineering College, Bangalore,[email protected] 2 Associate Professor.,
Survey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
Redundant Data Removal Technique for Efficient Big Data Search Processing
Redundant Data Removal Technique for Efficient Big Data Search Processing Seungwoo Jeon 1, Bonghee Hong 1, Joonho Kwon 2, Yoon-sik Kwak 3 and Seok-il Song 3 1 Dept. of Computer Engineering, Pusan National
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
Research Statement Immanuel Trummer www.itrummer.org
Research Statement Immanuel Trummer www.itrummer.org We are collecting data at unprecedented rates. This data contains valuable insights, but we need complex analytics to extract them. My research focuses
Best Practices for Hadoop Data Analysis with Tableau
Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks
MapReduce: A Flexible Data Processing Tool
DOI:10.1145/1629175.1629198 MapReduce advantages over parallel databases include storage-system independence and fine-grain fault tolerance for large jobs. BY JEFFREY DEAN AND SANJAY GHEMAWAT MapReduce:
SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford
SQL VS. NO-SQL Adapted Slides from Dr. Jennifer Widom from Stanford 55 Traditional Databases SQL = Traditional relational DBMS Hugely popular among data analysts Widely adopted for transaction systems
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
Advances in Natural and Applied Sciences
AENSI Journals Advances in Natural and Applied Sciences ISSN:1995-0772 EISSN: 1998-1090 Journal home page: www.aensiweb.com/anas Clustering Algorithm Based On Hadoop for Big Data 1 Jayalatchumy D. and
International Journal of Innovative Research in Computer and Communication Engineering
FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE Sayalee Narkhede 1 and Tripti Baraskar 2 Department of Information Technology, MIT-Pune,University of Pune, Pune [email protected]
