Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

Size: px
Start display at page:

Download "Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce"

Transcription

1 Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Huayu Wu Institute for Infocomm Research, A*STAR, Singapore Abstract. Processing XML queries over big XML data using MapReduce has been studied in recent years. However, the existing works focus on partitioning XML documents and distributing XML fragments into different compute nodes. This attempt may introduce high overhead in XML fragment transferring from one node to another during MapReduce execution. Motivated by the structural join based XML query processing approach, which uses only related inverted lists to process queries in order to reduce I/O cost, we propose a novel technique to use MapReduce to distribute labels in inverted lists in a computing cluster, so that structural joins can be parallelly performed to process queries. We also propose an optimization technique to reduce the computing space in our framework, to improve the performance of query processing. Last, we conduct experiment to validate our algorithms. 1 Introduction 1.1 Context The increasing amount of data generated by different applications, sensors and devices, and the increasing attention to the value of the data marks the beginning of the era of big data. From big IT companies to SMEs, from computer science researchers to social scientists, nowadays everyone is talking about big data and the impact of the big data in business, technologies, and science research. It is no doubt that to effectively manage big data is the first step for any further analysis and utilization of big data. How to manage such data, considering the large size and diverse formats, is a new challenge to the database community. On one hand, researchers are keen to find a more elastic and more reliable solution rather than the traditional distributed database system [15] to store big relational (structured) data and offer SQL-like query ability; on the other hand, for emerging datasets in heterogeneous models, new platforms and databases for semi-structured data, text data, graph data, etc. were designed, aiming to provide more efficient access to such data in large scale. Gradually, the research attempts converge to a distributed data processing framework, MapReduce [10]. The MapReduce programming model simplifies parallel data processing by offering two interfaces: map and reduce. With a system-level support on computational resource management, a user only needs

2 to implement the two functions to process underlying data, without caring about the extendability and reliability of the system. There are extensive works to implement database operators [5][14][11], and database systems [3][13][7] on top of the MapReduce framework. 1.2 Motivation Recently, researchers started looking into the possibility of managing big XML data in a more elastic distributed environment, such as Hadoop [1], using MapReduce. Inspired by the XML-enabled relational database system, big XML data can be stored and processed by relational storage and operators. However, shredding XML data in big size into relational tables is extremely expensive. Furthermore, with relational storage, each XML query must be processed by several θ-joins among tables. The cost for joins is still the bottleneck for Hadoop-based database systems. Most recent research attempts [9][8][4] leverage on the idea of XML partitioning and query decomposition adopted from distributed XML databases [16][12]. Similar to the join operation in relational database, an XML query may require linking two or more arbitrary elements across the whole XML document. Thus to process XML queries in a distributed system, transferring fragmented data from one node to another is unavoidable. In a static environment like a distributed XML database system, proper indexing techniques can help to optimally distribute data fragments and the workload. However, for an elastic distributed environment such as Hadoop, each copy of XML fragment will probably be transferred to undeterminable different nodes for processing. In other words, it is difficult to optimize data distribution in a MapReduce framework, thus the existing approach may suffer from high I/O and network transmission cost. Actually different approaches for centralized XML query processing have been study for over a decade. One highlight is the popularity of the structural join based approach (e.g., [6]). Compared to other native approaches, such as navigational approach and subsequence matching approach, one main advantage of the structural join approach is the saving on I/O cost. In particular, in the structural join approach, only a few inverted lists corresponding to the query nodes are read from the disk, rather than going through all the nodes in the document. It will be beneficial if we can adapt such an approach in the MapReduce framework, so that the disk I/O and network cost can be reduced. 1.3 Contribution In this paper, we study the parallelization of the structural join based XML query processing algorithms using MapReduce. We do not distribute a whole big XML document to a computer cluster, instead, we distribute inverted lists for each type of document node to be queried. As mentioned, since the size of the inverted lists that are used to process a query is much smaller than the size of the whole XML document, our approach potentially reduces the cost on cross-node data transfer. Our contribution can be summarized as follows:

3 The problem of parallelizing structural joins for XML query processing using MapReduce is studied. To the best of our knowledge, this is the first work that discusses distributing inverted list labels rather than raw XML document for parallel structural joins. A polynomial-based workload distribution algorithm is designed in the Map phase, which can balance the workload of each Reduce task. An optimization technique is proposed to not to emit nodes to reducers where they do not contribute to structural join results. We conduct experiments to validate our algorithms. 1.4 Organization The rest of the paper is organized as follows. In Section 2, we introduce the background knowledge and revisit related work. In Section 3 we present the map and reduce functions in our framework to parallelly process XML queries. In Section 4, an optimization algorithm is proposed so that unnecessary emitting can be pruned in mappers. Section 5 shows the experimental study to validate the proposed algorithms. Finally we conclude this paper in Section 6. 2 Background and Related Work 2.1 MapReduce MapReduce is a computational model initiated from functional programming languages and introduced to computer clusters for parallel data processing [10]. It simplifies programmers implementation for parallel data processing by offering two user-defined functions, map and reduce. The map function takes a set of key/value pairs as input. After a MapReduce job is submitted to the system, the map tasks (normally referred as mappers) are started on certain compute nodes. Each map task executes the map function that is implemented by the user over every key/value input pair. The output of the map function is another set of key/value pairs, which are temporarily stored in the local file systems and sorted by the keys. When all map tasks complete executing, the system notifies the reduce tasks (referred as reducers) to start executing. The reducers pull the key/value pairs outputted from mappers in parallel and combine them into different lists for different keys. This step is similar to the group-by operator in SQL. Then for each key, the list of values will be processed by a reducer according to the reduce function specified by the user. This step is similar to the aggregate function in SQL. Finally, the results are written back to the disks. 2.2 XML Query Processing There are different approaches to process XML queries. Initially, XML documents are shred into relational tables and queries are translated into SQL statements to query the database (e.g., [17]). This is so-called the relational approach.

4 However, in the relational approach, an XML query requires several table joins and most of them are expensive θ-joins. Later, researchers proposed to process XML queries in their native form. Among different native approaches, the structural join approach (e.g., [6]) is considered more efficient because it reduces I/O cost compared to other approaches. In a structural join approach, an XML document is encoded with positional labels. The labels for each type of document node are organized in an inverted list. To process a query, only those inverted lists related to the queried node type are loaded and scanned. All other document nodes are actually ignored. Finally structural joins are performed over the inverted lists to find the query answers. 2.3 XML Query Processing using MapReduce To process XML queries using MapReduce, we need to decompose a big XML document and distribute portions to different sites, and execute the query processing algorithm in a parallel manner at different sites. Obviously, the relational approach is not suitable, because transforming a big XML document into relational tables can be extremely time consuming and θ-joins among relational tables are expensive using MapReduce. Recently, there are several works proposed to implement native XML query processing algorithms using MapReduce. In [9], the authors proposed a distributed algorithm for Boolean XPath query evaluation using MapReduce. By collecting the Boolean evaluation result from a computer cluster, they proposed a centralized algorithm to finally process a general XPath query. Actually, they did not use the distributed computing environment to generate the final result. It is still unclear whether the centralized step would be the bottleneck of the algorithm when the data is huge. In [8], a Hadoop-based system was designed to process XML queries using the structural join approach. There are two phases of MapReduce jobs in the system. In the first phase, an XML document was shred into blocks and scanned against input queries. Then the path solutions are sent to the second MapReduce job to merge to final answers. The first problem of this approach is the loss of the holistic way for generating path solutions. Actually the beauty of the structural join approach is to minimize intermediate path solutions by considering the whole query pattern during structural joins. The second problem is that the simple path filtering in the first MapReduce phase is not suitable for processing // -axis queries over complex structured XML data with recursive nodes, as pointed by many prior research works. A recent demo [4] built an XML query/update system on top of MapReduce framework. The technical details were not thoroughly presented. From the system architecture, it shreds an XML document and asks each mapper to process queries against each XML fragment. In fact, most existing works are based on XML document shredding, which is inspired by the document partitioning in the distributed XML databases [16][12]. However, in the MapReduce framework, the distributed computing environment

5 is assumed dynamic and elastic, which makes XML fragment distribution difficult to optimize. Hence, in such a setting, fragmented data (from either original document or intermediate result) may need to be transferred to undeterminable different sites, which leads high I/O and network transmission cost. Motivated by the inverted lists based structural join algorithms, in this paper, we propose an approach that distributes inverted lists rather than a raw XML document, so that the size of the fragmented data for I/O and network transmission can be greatly reduced. 3 Framework In this section, we introduce our proposed MapReduce framework for parallel XML query processing. As mentioned, in our framework, we do not distribute the raw XML data because the raw data is large in size and causes high I/O and network transmission overhead during parallel processing in the MapReduce. Instead, we label the document first and load the inverted lists rather than raw document into a distributed file system. Obviously, the size of useful inverted lists for certain queries is much smaller than the size of raw document. Then the I/O and network transmission cost can be minimized. 3.1 Document Labeling Labeling an XML document is an essential step for most XML query processing algorithms. In this paper, we do not repeat the research works on XML labeling, but only emphasize on two points related to big XML data. First, document labeling is a one-time effort for a given XML document, and independent to query processing. In most labeling schemes, a stack is maintained to store input document nodes so that the relationship among them can be identified. As document labeling proceeds, the put and pop operations are executed over the stack. The maximum size of the stack is the maximum depth of the document. As a result, even if the document is large, the memory is not an issue for document labeling. Second, using the existing technique (e.g., [18]), re-labeling can be totally avoided in case an XML database is updated. For centralized XML query processing framework, the only overhead to deal with document update is the re-sorting of relevant inverted lists. In our framework, node labels will be distributed to different reducers and re-sorted. Thus, document update will bring little trouble to our framework. 3.2 Framework Overview In our approach, we implement the two functions, map and reduce in a MapReduce framework, and leverage on underlying system, e.g., Hadoop for program execution. The basic idea is to equally (or nearly equally) divide the whole computing space for a set of structural joins into a number of sub-spaces, and each of the sub-spaces will be handled by one reducer to perform structural joins.

6 (0, 1.2)-(1, 1.2)-(2, 1.2)-(3, 1.2)- (4, 1.3)-(5, 1.3)-(6, 1.3)-(7, 1.3)-... (0, 1.2.5)-(1, 1.2.5)-(4, 1.2.5)-(5, 1.2.5)- (2, 1.3.2)-(3, 1.3.2)-(6, 1.3.2)-(7, 1.3.2)-... Partitioning by e-id Reducer 1 e-id=0 A: 1.2 B: Structural join Reducer 2 e-id=1 For example, suppose an XML query involves three query nodes, namely A, B and C, then to process this query a set of structural joins among the inverted (1.2)(1.2.1)(1.2.9), (none) (1.3)(1.3.2)(1.3.5), lists I A, I B and I C (for (1.2)(1.2.5)(1.2.9), A, B and C respectively) will be performed. If we split each inverted list into three partitions, the total computing space (with size of I A I B I C ) will be divided into 27 sub-spaces, as shown in Fig. 1. A: 1.2 B: Structural join Reducer 7 e-id=6 A: B: Structural join C A B Fig. 1. Example of computing space division Each mapper will take a set of labels in an inverted list as input, and emit each label with the ID of the associated sub-space (called e-id, standing for emit id). The reducers will take the grouped labels for each inverted list, re-sort it, and apply holistic structural join algorithms to find answers. The whole process is shown in Fig. 2, and the details will be explained in the following sections. 3.3 Design of Mapper The main task of a mapper is to assign a key to each incoming label, so that the overall labels from each inverted list are nearly equally distributed in a given number of sub-spaces for the reducers to process. To achieve this goal, we adopt a polynomial-based emit id assignment. For a query with n different query nodes, i.e., using n inverted lists, we divide each inverted list into m sub-lists. Then the total number of sub-spaces for computing is m n. Fig. 1 shows an example where m = 3 and n = 3. We construct a polynomial function f of m with the highest degree of n 1, to help for emit id assignment. Each of the inverted lists corresponds to a coefficient with degree from n 1 to 0. f(m) = a n 1 m n 1 + a n 2 m n a 1 m + a 0 where a i [0, m 1] for i [0, n 1] (1) The procedure that a mapper emits an input label is shown in Algorithm 1. Example 1. We use the example in Fig. 2 to explain Algorithm 1. Suppose we need to process a query with three query nodes (i.e., n=3), A, B and C, and we

7 Mapper 1 Inverted list for A label: Mapper 2 Inverted list for B label: Mapper 3 Inverted list for C map (b-id, label): (1, 1.1)-(0, 1.2)-(1, 1.3)-... map (b-id, label): (0, 1.2.1)-(0, 1.2.5)-(1, 1.3.2)-... emit emit (e-id, label): (4, 1.1)-(5, 1.1)-(6, 1.1)-(7, 1.1)- (0, 1.2)-(1, 1.2)-(2, 1.2)-(3, 1.2)- (4, 1.3)-(5, 1.3)-(6, 1.3)-(7, 1.3)-... (e-id, label): (0, 1.2.1)-(1, 1.2.1)-(4, 1.2.1)-(5, 1.2.1)- (0, 1.2.5)-(1, 1.2.5)-(4, 1.2.5)-(5, 1.2.5)- (2, 1.3.2)-(3, 1.3.2)-(6, 1.3.2)-(7, 1.3.2)-... Partitioning by e-id Reducer 1 e-id=0 A: 1.2 B: Reducer 2 e-id=1 A: 1.2 B: Reducer 7 e-id=6 A: B: sort sort sort structural join structural join structural join (1.2)(1.2.1)(1.2.9), (1.2)(1.2.5)(1.2.9), (none) (1.3)(1.3.2)(1.3.5), C Fig. 2. Data flow of our proposed framework Input: 18 an empty 21 key, 24 a label l from an inverted list I as the value; variables m for the 16 number of partitions in each inverted list, n for the number of total inverted lists, ra a random integer 15 between 0 and n 1 1: identify the coefficient corresponding 7 to I, i.e., a i where i is the index 2: initiate an empty list L6 3: toemit(l, 0 i) 3/*m, 6n, r, l are globally viewed by Function 1 and 2*/ B 24 Algorithm 1 Map Function 25 divide each inverted lists into 2 partitions (i.e., m=2). In Mapper 1 in Fig. 2, the mapper will first assign a random number b id [0, 1] to each value, i.e., each incoming label from A s inverted list. Actually b id stands for the ID of the partition to which the incoming label belongs. By randomly choosing b id, the labels in an inverted list are nearly equally divided into n partitions. The list of (b-id, label) in Fig. 2 shows the result of inverted list partitioning.

8 Function 1 toemit(list L, index i) 1: if L.length == number of inverted lists then 2: Emit(l) 3: else 4: if L.length == i then 5: toemit(l.append(r), i) 6: else 7: for all j [0, n 1] do 8: toemit(l.append(j), i) 9: end for 10: end if 11: end if Function 2 Emit(List L) 1: initiate the polynomial function f(m) as defined in (1) 2: set the coefficients of f as the integers in L, by order 3: calculate the value of f given the value of m, as the emit key e-id 4: emit(e-id, l) In the second step, for each label the corresponding sub-spaces for structural joins are found by polynomial calculation. In this example, the polynomial function is f(m)=a A m 2 +a B m+a C, where a A, a B and a C are the coefficients corresponding to the query nodes A, B and C. For the first mapper, which is handling the inverted list for A, a A will be b id that is randomly assigned in the previous step, while a B and a C will be vary between 0 and 1 according to Function 1. For the label 1.1, its b-id is assigned as 1, and there will be four values for its e-id. Thus, the mapper will emit this label four times. The process to calculate e-id will be discussed in detail in the next example. Similarly, all labels will be emitted for shuffling Correctness of Polynomial-based Emitting Theorem 1. The polynomial function in (1) exactly produces m n values spanning [0, m n -1], when each coefficient a i takes m different values in [0, m-1]. Thus the partitioning is complete. This is the property of polynomial function, and the proof is omitted. We use an example to illustrate. Example 2. Consider the cube in Fig. 1, which stands for a division of computing space with three inverted lists (n=3), and each inverted list is partitioned into three region (m=3). The numeric id on each small cube stands for the calculated polynomial value of that sub-space. Suppose a mapper assigns integer 0 (randomized from 0 to 2) to a label in the inverted list C, the sub-space to emit the label will be calculated by the polynomial function f(m)=a A m 2 +a B m+0, where a A, a C [0,2]. By taking different value of a A and a C and taking m=3,

9 the label will eventually be emitted to 9 sub-space, i.e., 0, 3, 6, 9, 12, 15, 18, 21 and 24, as shown in Fig Complexity of Polynomial-based Emitting According to the proposed polynomial-based emitting, each label will be emitted to m n 1 sub-spaces for parallel processing. Although this number exponentially increases with n theoretically, we claim that this complexity is unavoidable, and in fact it is manageable in practice. First, theoretically, to perform a general join (Cartesian product with conditional filters) across n tables (inverted lists), each of which contains m records (labels) requires m n computations. With certain indexing and algorithmic techniques, this complexity can be reduced for centralized data processing. However, when the data is too big to be managed by a centralized machine, most indexing techniques are not adoptable. In the case of XML structural join, pre-scanning all inverted lists to record down some statistical information may be helpful for designing more efficient workload distribution algorithms. However, this process cannot be on-the-fly with query processing. On the other hand, to make this process static may involve other issues such as dealing with data updates. How to design more efficient map function for XML query processing can be an open research problem. We propose an on-the-fly based optimization technique in the next section, though it does not change the order of the complexity. Second, the value of m n is actually the number of sub- computing spaces to distribute the workload. In most queries, the number of query nodes, i.e., n is quite small. Also, the actual value of m n can be controlled by m, i.e., the number of partitions in each inverted list. After all, the total number of sub-spaces should be set based on hardware capacity and applications. 3.4 Design of Reducer After each reducer collecting all labels from all inverted lists that have the same e-id (key), it can start processing the XML queries over this sub-space. Since the map function only splits the computing space into small sub-spaces without other operations on data, in the reduce function, any structural join based algorithm can be implemented to process queries. In our implementation, we follow the holistic structural join algorithms (e.g., [6]), because this class of algorithms are proven optimal for many query cases. In the example in Fig. 2, for each reducer, after getting a subset of labels in each inverted list, the reducer will sort each list and then perform holistic structural join on them, to find answers. Fig. 2 shows the process of executing an XPath query //A[//B]//C. 4 Optimization Following the reduce function we implement, we design an optimization technique to prune certain nodes that will not contribute to structural join result.

10 Let us start with a motivating example. Suppose a structural join algorithm tries to process a query //A[//B]//C. In the sorted inverted list for A, the first label is 1.3, while in the sorted inverted list for B the first label is Obviously, performing a structural join A//B between 1.3 and will not return an answer. In other words, the first label (maybe first few labels) in the inverted list for B can be skipped. This example motivates our optimization which will prune certain labels during label distribution in the map function. 4.1 Statistics Collected during Document Labeling During document labeling, we collect some statistics to aid label distribution. Basically, for each sorted inverted list, we take a sample for every t labels. The samples stand for cut-off labels if the inverted list is divided into segments with size of t. The value of t can be varied based on different sizes of documents. In our heuristics, we make t=10,000. The size of such statistic data is 1/t of the inverted list size. The collected statistics will be used for constructing an index, called cut-off index to guide how to assign a partition for a label in an inverted list. Normally the number of partitions (value m) for an inverted list in our framework is small (must be smaller than t), which means the cut-off labels of different partitions can be derived from the statistics mentioned above. Then given a label, we can compare it with the cut-off index to decide to which partition it belongs. 4.2 Selective Emitting Recall that in Algorithm 1 when a map function emits a label, it randomizes a local partition (represented by the coefficient of the corresponding term in the polynomial function) and considers all possible partitions for other inverted lists (represented by all possible values of other terms coefficients) for the label to emit. In our optimization, to emit a label from an inverted list I, we (1) set the local partition in I for the label according to the cut-off index, and (2) selectively choose the coefficients (i.e., the partitions) for all the ancestor inverted lists of I, such that the current label can join with the labels in those partitions and return answers. The toemit function (previously shown in Function 1) for an optimized mapper is presented in Function 3. The intuition of the optimization is to prune the label emitting to certain reducers in which the label will not produce structural join result. As shown in Fig. 3, rather than emitting a label l 1 from an inverted list I N to all reducers, the optimization algorithm will compute in which reducers l 1 will not produce answer, and then avoid such emitting as shown by the dotted arrows. Example 3. We use an example to illustrate the optimized map function. Consider an XML twig pattern query as in Fig. 4. There are 4 nodes in the query, thus there will be 4 inverted lists to be scanned for structural joins. If we divide each inverted list into 3 partitions, the two cut-off indices for the inverted lists for

11 Function 3 toemit O (List L, index i) Input: the partition cut-off index cutoff[x][y] for inverted list I x and partition y; the current inverted list I u with numerical id u; other variables are inherited from Algorithm 1 1: if L.length == number of inverted lists then 2: Emit(l) 3: else 4: if L.length == i then 5: initiate v = 0 6: while cutoff[x][y].precede(l) && v<m do 7: v++ 8: end while 9: toemit(l.append(v), i) 10: else 11: if the query node for I L.length is an ancestor of the query node for I u then 12: initiate v = 0 13: while cutoff[x][y].precede(l) && v<m do 14: v++ 15: end while 16: for all k [0, v 1] do 17: toemit(l.append(k), i) 18: end for 19: else 20: for all j [0, n 1] do 21: toemit(l.append(j), i) 22: end for 23: end if 24: end if 25: end if A and C are shown in Fig. 4. Assuming the polynomial function for mappers is f(m)=a A m 3 +a B m 2 +a C m+a D, the whole computing space will be divided into 81 sub-spaces. When an A label is processed by a mapper, the mapper will check the cut-off index for A, and decide to put the label into the second partition, i.e., a A =1, because the label passes the cut-off value between the first and the second partitions. When a D label is distributed, the mapper will determine a D, i.e., the local partition for the label. Furthermore, using the original map function, the label will be emitted to all sub-spaces formed by its local partition and the combination of all partitions from other inverted lists. That is the label will be emitted to 3 3 = 27 sub-spaces, though in many sub-spaces the label will not contribute to a structural join answer. In the optimized map function, the label will be checked against the cut-off indices of all D s ancestor nodes, i.e., A and C. Based on index checking, the label will be emitted to the first two partitions of A s inverted list and only the first partition of C s inverted list. Thus for this label, a A is 0 or 1, a C is 0, and only a B can take 3 possible values from 0 to 2.

12 Partitioning by e-id n A CReducer e-id=1 18A: 21B: sort 15 structural join Reducer 7 e-id=6 A: B: sort structural join B A C D I N :- ( l1, l2, l3, l IN ) A: (none) B (1.3)(1.3.2)(1.3.5), R a R b R c... Rm n B Fig. 3. Intuition of the optimization A C A: B D Fig. 4. Example XML query and cut-off indices Finally, the polynomial function f(m) will have 2*3=6 different values, so this label will be emitted to 6 sub-spaces where it possibly contributes to structural join answers. 5 Experiment 5.1 Settings All the programs were implemented in Java, and run in a small Hadoop cluster with 5 slave nodes. Each slave node has a dual core 2.93GHz CPU and a 12G shared memory. The maximum memory allocated to each JVM is 2G. Since our work does not aim at Hadoop tuning, we keep all default parameters of Hadoop in execution. We generated a synthetic XML dataset with the size of 10GB, based on the XMark [2] schema. The document is labeled with the containment labeling scheme [19] so that the size of each label is fixed. We randomly compose 10 twig pattern queries with the number of query nodes varying from 2 to 5, for evaluation. The result presented in this section is based on the average running statistics. Note that this experimental study only shows the feasibility of the proposed MapReduce framework for XML structural joins, and the effectiveness of the proposed optimization technique. We do not compare with other algorithms because we did not identify one with the similar philosophy to parallelize structural join. It makes no sense to compare with the methods that shred and distribute raw XML document across compute nodes. Furthermore, the efficiency on a single node is less important than the scalability of the algorithm in big data processing, as the overall performance can be simply improved by adding in more compute nodes.

MapReduce With Columnar Storage

MapReduce With Columnar Storage SEMINAR: COLUMNAR DATABASES 1 MapReduce With Columnar Storage Peitsa Lähteenmäki Abstract The MapReduce programming paradigm has achieved more popularity over the last few years as an option to distributed

More information

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:

More information

JackHare: a framework for SQL to NoSQL translation using MapReduce

JackHare: a framework for SQL to NoSQL translation using MapReduce DOI 10.1007/s10515-013-0135-x JackHare: a framework for SQL to NoSQL translation using MapReduce Wu-Chun Chung Hung-Pin Lin Shih-Chang Chen Mon-Fong Jiang Yeh-Ching Chung Received: 15 December 2012 / Accepted:

More information

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2 Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

More information

RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE

RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE Reena Pagare and Anita Shinde Department of Computer Engineering, Pune University M. I. T. College Of Engineering Pune India ABSTRACT Many clients

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Big Data Begets Big Database Theory

Big Data Begets Big Database Theory Big Data Begets Big Database Theory Dan Suciu University of Washington 1 Motivation Industry analysts describe Big Data in terms of three V s: volume, velocity, variety. The data is too big to process

More information

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National

More information

Lecture Data Warehouse Systems

Lecture Data Warehouse Systems Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

A Study on Big Data Integration with Data Warehouse

A Study on Big Data Integration with Data Warehouse A Study on Big Data Integration with Data Warehouse T.K.Das 1 and Arati Mohapatro 2 1 (School of Information Technology & Engineering, VIT University, Vellore,India) 2 (Department of Computer Science,

More information

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.

More information

Research Paper BLOOM JOIN FINE-TUNES DISTRIBUTED QUERY IN HADOOP ENVIRONMENT Dr. Sunita M. Mahajan 1 and Ms. Vaishali P. Jadhav 2

Research Paper BLOOM JOIN FINE-TUNES DISTRIBUTED QUERY IN HADOOP ENVIRONMENT Dr. Sunita M. Mahajan 1 and Ms. Vaishali P. Jadhav 2 Research Paper BLOOM JOIN FINE-TUNES DISTRIBUTED QUERY IN HADOOP ENVIRONMENT Dr. Sunita M. Mahajan 1 and Ms. Vaishali P. Jadhav 2 Address for Correspondence 1 Principal, Mumbai Education Trust, Bandra,

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

Data Management Using MapReduce

Data Management Using MapReduce Data Management Using MapReduce M. Tamer Özsu University of Waterloo CS742-Distributed & Parallel DBMS M. Tamer Özsu 1 / 24 Basics For data analysis of very large data sets Highly dynamic, irregular, schemaless,

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

More information

A B S T R A C T. Index Terms : Apache s Hadoop, Map/Reduce, HDFS, Hashing Algorithm. I. INTRODUCTION

A B S T R A C T. Index Terms : Apache s Hadoop, Map/Reduce, HDFS, Hashing Algorithm. I. INTRODUCTION Speed- Up Extension To Hadoop System- A Survey Of HDFS Data Placement Sayali Ashok Shivarkar, Prof.Deepali Gatade Computer Network, Sinhgad College of Engineering, Pune, India 1sayalishivarkar20@gmail.com

More information

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

More information

A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs In a Workflow Application

A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs In a Workflow Application 2012 International Conference on Information and Computer Applications (ICICA 2012) IPCSIT vol. 24 (2012) (2012) IACSIT Press, Singapore A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs

More information

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Detection of Distributed Denial of Service Attack with Hadoop on Live Network Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,

More information

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

More information

Benchmark Study on Distributed XML Filtering Using Hadoop Distribution Environment. Sanjay Kulhari, Jian Wen UC Riverside

Benchmark Study on Distributed XML Filtering Using Hadoop Distribution Environment. Sanjay Kulhari, Jian Wen UC Riverside Benchmark Study on Distributed XML Filtering Using Hadoop Distribution Environment Sanjay Kulhari, Jian Wen UC Riverside Team Sanjay Kulhari M.S. student, CS U C Riverside Jian Wen Ph.D. student, CS U

More information

Load-Balancing the Distance Computations in Record Linkage

Load-Balancing the Distance Computations in Record Linkage Load-Balancing the Distance Computations in Record Linkage Dimitrios Karapiperis Vassilios S. Verykios Hellenic Open University School of Science and Technology Patras, Greece {dkarapiperis, verykios}@eap.gr

More information

Hadoop and Map-reduce computing

Hadoop and Map-reduce computing Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Mining Large Datasets: Case of Mining Graph Data in the Cloud Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large

More information

HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering

HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering Chang Liu 1 Jun Qu 1 Guilin Qi 2 Haofen Wang 1 Yong Yu 1 1 Shanghai Jiaotong University, China {liuchang,qujun51319, whfcarter,yyu}@apex.sjtu.edu.cn

More information

Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment

Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment Sayalee Narkhede Department of Information Technology Maharashtra Institute

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Comparison of Different Implementation of Inverted Indexes in Hadoop

Comparison of Different Implementation of Inverted Indexes in Hadoop Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

The Hadoop Framework

The Hadoop Framework The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fh-giessen.de Abstract. The Hadoop Framework offers an approach to large-scale

More information

Fig. 3. PostgreSQL subsystems

Fig. 3. PostgreSQL subsystems Development of a Parallel DBMS on the Basis of PostgreSQL C. S. Pan kvapen@gmail.com South Ural State University Abstract. The paper describes the architecture and the design of PargreSQL parallel database

More information

Distributed Aggregation in Cloud Databases. By: Aparna Tiwari tiwaria@umail.iu.edu

Distributed Aggregation in Cloud Databases. By: Aparna Tiwari tiwaria@umail.iu.edu Distributed Aggregation in Cloud Databases By: Aparna Tiwari tiwaria@umail.iu.edu ABSTRACT Data intensive applications rely heavily on aggregation functions for extraction of data according to user requirements.

More information

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

A Novel Cloud Computing Data Fragmentation Service Design for Distributed Systems

A Novel Cloud Computing Data Fragmentation Service Design for Distributed Systems A Novel Cloud Computing Data Fragmentation Service Design for Distributed Systems Ismail Hababeh School of Computer Engineering and Information Technology, German-Jordanian University Amman, Jordan Abstract-

More information

Distributed Apriori in Hadoop MapReduce Framework

Distributed Apriori in Hadoop MapReduce Framework Distributed Apriori in Hadoop MapReduce Framework By Shulei Zhao (sz2352) and Rongxin Du (rd2537) Individual Contribution: Shulei Zhao: Implements centralized Apriori algorithm and input preprocessing

More information

Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

More information

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

More information

Improving Job Scheduling in Hadoop MapReduce

Improving Job Scheduling in Hadoop MapReduce Improving Job Scheduling in Hadoop MapReduce Himangi G. Patel, Richard Sonaliya Computer Engineering, Silver Oak College of Engineering and Technology, Ahmedabad, Gujarat, India. Abstract Hadoop is a framework

More information

From GWS to MapReduce: Google s Cloud Technology in the Early Days

From GWS to MapReduce: Google s Cloud Technology in the Early Days Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

More information

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created

More information

Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing

Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing Hsin-Wen Wei 1,2, Che-Wei Hsu 2, Tin-Yu Wu 3, Wei-Tsong Lee 1 1 Department of Electrical Engineering, Tamkang University

More information

Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,

More information

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG 1 RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG Background 2 Hive is a data warehouse system for Hadoop that facilitates

More information

An efficient Join-Engine to the SQL query based on Hive with Hbase Zhao zhi-cheng & Jiang Yi

An efficient Join-Engine to the SQL query based on Hive with Hbase Zhao zhi-cheng & Jiang Yi International Conference on Applied Science and Engineering Innovation (ASEI 2015) An efficient Join-Engine to the SQL query based on Hive with Hbase Zhao zhi-cheng & Jiang Yi Institute of Computer Forensics,

More information

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,

More information

SCHEDULING IN CLOUD COMPUTING

SCHEDULING IN CLOUD COMPUTING SCHEDULING IN CLOUD COMPUTING Lipsa Tripathy, Rasmi Ranjan Patra CSA,CPGS,OUAT,Bhubaneswar,Odisha Abstract Cloud computing is an emerging technology. It process huge amount of data so scheduling mechanism

More information

A Comparison of Approaches to Large-Scale Data Analysis

A Comparison of Approaches to Large-Scale Data Analysis A Comparison of Approaches to Large-Scale Data Analysis Sam Madden MIT CSAIL with Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, and Michael Stonebraker In SIGMOD 2009 MapReduce

More information

Evaluating HDFS I/O Performance on Virtualized Systems

Evaluating HDFS I/O Performance on Virtualized Systems Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang xtang@cs.wisc.edu University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing

More information

Semantic Web Standard in Cloud Computing

Semantic Web Standard in Cloud Computing ETIC DEC 15-16, 2011 Chennai India International Journal of Soft Computing and Engineering (IJSCE) Semantic Web Standard in Cloud Computing Malini Siva, A. Poobalan Abstract - CLOUD computing is an emerging

More information

Task Scheduling in Hadoop

Task Scheduling in Hadoop Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed

More information

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

A Novel Cloud Based Elastic Framework for Big Data Preprocessing School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview

More information

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable

More information

Big Systems, Big Data

Big Systems, Big Data Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,

More information

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct

More information

Big Fast Data Hadoop acceleration with Flash. June 2013

Big Fast Data Hadoop acceleration with Flash. June 2013 Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional

More information

Classification On The Clouds Using MapReduce

Classification On The Clouds Using MapReduce Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal simao.martins@tecnico.ulisboa.pt Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal claudia.antunes@tecnico.ulisboa.pt

More information

Distributed Framework for Data Mining As a Service on Private Cloud

Distributed Framework for Data Mining As a Service on Private Cloud RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

Mining Interesting Medical Knowledge from Big Data

Mining Interesting Medical Knowledge from Big Data IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. II (Jan Feb. 2016), PP 06-10 www.iosrjournals.org Mining Interesting Medical Knowledge from

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

What is Analytic Infrastructure and Why Should You Care?

What is Analytic Infrastructure and Why Should You Care? What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group grossman@uic.edu ABSTRACT We define analytic infrastructure to be the services,

More information

A Performance Analysis of Distributed Indexing using Terrier

A Performance Analysis of Distributed Indexing using Terrier A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search

More information

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File

More information

Image Search by MapReduce

Image Search by MapReduce Image Search by MapReduce COEN 241 Cloud Computing Term Project Final Report Team #5 Submitted by: Lu Yu Zhe Xu Chengcheng Huang Submitted to: Prof. Ming Hwa Wang 09/01/2015 Preface Currently, there s

More information

Mobile Cloud Computing for Data-Intensive Applications

Mobile Cloud Computing for Data-Intensive Applications Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, vct@andrew.cmu.edu Advisor: Professor Priya Narasimhan, priya@cs.cmu.edu Abstract The computational and storage

More information

NoSQL for SQL Professionals William McKnight

NoSQL for SQL Professionals William McKnight NoSQL for SQL Professionals William McKnight Session Code BD03 About your Speaker, William McKnight President, McKnight Consulting Group Frequent keynote speaker and trainer internationally Consulted to

More information

BSPCloud: A Hybrid Programming Library for Cloud Computing *

BSPCloud: A Hybrid Programming Library for Cloud Computing * BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China liuxiaodongxht@qq.com,

More information

A Hadoop MapReduce Performance Prediction Method

A Hadoop MapReduce Performance Prediction Method A Hadoop MapReduce Performance Prediction Method Ge Song, Zide Meng, Fabrice Huet, Frederic Magoules, Lei Yu and Xuelian Lin University of Nice Sophia Antipolis, CNRS, I3S, UMR 7271, France Ecole Centrale

More information

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

Efficient Iceberg Query Evaluation for Structured Data using Bitmap Indices

Efficient Iceberg Query Evaluation for Structured Data using Bitmap Indices Proc. of Int. Conf. on Advances in Computer Science, AETACS Efficient Iceberg Query Evaluation for Structured Data using Bitmap Indices Ms.Archana G.Narawade a, Mrs.Vaishali Kolhe b a PG student, D.Y.Patil

More information

CLOUD BASED PEER TO PEER NETWORK FOR ENTERPRISE DATAWAREHOUSE SHARING

CLOUD BASED PEER TO PEER NETWORK FOR ENTERPRISE DATAWAREHOUSE SHARING CLOUD BASED PEER TO PEER NETWORK FOR ENTERPRISE DATAWAREHOUSE SHARING Basangouda V.K 1,Aruna M.G 2 1 PG Student, Dept of CSE, M.S Engineering College, Bangalore,basangoudavk@gmail.com 2 Associate Professor.,

More information

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford SQL VS. NO-SQL Adapted Slides from Dr. Jennifer Widom from Stanford 55 Traditional Databases SQL = Traditional relational DBMS Hugely popular among data analysts Widely adopted for transaction systems

More information

Best Practices for Hadoop Data Analysis with Tableau

Best Practices for Hadoop Data Analysis with Tableau Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks

More information

A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

An Approach to Implement Map Reduce with NoSQL Databases

An Approach to Implement Map Reduce with NoSQL Databases www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

More information

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

KEYWORD SEARCH IN RELATIONAL DATABASES

KEYWORD SEARCH IN RELATIONAL DATABASES KEYWORD SEARCH IN RELATIONAL DATABASES N.Divya Bharathi 1 1 PG Scholar, Department of Computer Science and Engineering, ABSTRACT Adhiyamaan College of Engineering, Hosur, (India). Data mining refers to

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Redundant Data Removal Technique for Efficient Big Data Search Processing

Redundant Data Removal Technique for Efficient Big Data Search Processing Redundant Data Removal Technique for Efficient Big Data Search Processing Seungwoo Jeon 1, Bonghee Hong 1, Joonho Kwon 2, Yoon-sik Kwak 3 and Seok-il Song 3 1 Dept. of Computer Engineering, Pusan National

More information

Big Data Storage Architecture Design in Cloud Computing

Big Data Storage Architecture Design in Cloud Computing Big Data Storage Architecture Design in Cloud Computing Xuebin Chen 1, Shi Wang 1( ), Yanyan Dong 1, and Xu Wang 2 1 College of Science, North China University of Science and Technology, Tangshan, Hebei,

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),

More information