Bisimulation Reduction of Big Graphs on MapReduce

Size: px
Start display at page:

Download "Bisimulation Reduction of Big Graphs on MapReduce"

Transcription

1 Bisimulation Reduction of Big Graphs on MapReduce Yongming Luo 1, Yannick de Lange 1, George H. L. Fletcher 1, Paul De Bra 1, Jan Hidders 2, and Yuqing Wu 3 1 Eindhoven University of Technology, The Netherlands {y.luo@,y.m.d.lange@student.,g.h.l.fletcher@,p.m.e.d.bra@}tue.nl 2 Delft University of Technology, The Netherlands a.j.h.hidders@tudelft.nl 3 Indiana University, Bloomington, USA yuqwu@cs.indiana.edu Abstract. Computing the bisimulation partition of a graph is a fundamental problem which plays a key role in a wide range of basic applications. Intuitively, two nodes in a graph are bisimilar if they share basic structural properties such as labeling and neighborhood topology. In data management, reducing a graph under bisimulation equivalence is a crucial step, e.g., for indexing the graph for efficient query processing. Often, graphs of interest in the real world are massive; examples include social networks and linked open data. For analytics on such graphs, it is becoming increasingly infeasible to rely on in-memory or even I/Oefficient solutions. Hence, a trend in Big Data analytics is the use of distributed computing frameworks such as MapReduce. While there are both internal and external memory solutions for efficiently computing bisimulation, there is, to our knowledge, no effective MapReduce-based solution for bisimulation. Motivated by these observations we propose in this paper the first efficient MapReduce-based algorithm for computing the bisimulation partition of massive graphs. We also detail several optimizations for handling the data skew which often arises in real-world graphs. The results of an extensive empirical study are presented which demonstrate the effectiveness and scalability of our solution. 1 Introduction Recently, graph analytics has drawn increasing attention from the data management, semantic web, and many other research communities. Graphs of interest, such as social networks, the web graph, and linked open data, are typically on the order of billions of nodes and edges. In such cases, single-machine in-memory solutions for reasoning over graphs are often infeasible. Naturally, research has turned to external memory and distributed solutions for graph reasoning. External memory algorithms often suffice, but their performance typically scales (almost) linearly with graph size (usually the number of graph edges), which is then limited by the throughput of the I/O devices attached to the system. In this respect, distributed and parallel algorithms become attractive. Ideally, a

2 well-designed distributed algorithm would scale (roughly) linearly with the size of the computing resources it has, making use of the parallelism of the infrastructure. Though there are many alternatives, recently the MapReduce platform [8] has become a de-facto parallel processing platform for reasoning over Big Data such as real-world graphs, with wide adoption in both industry and research. Among fundamental graph problems, the bisimulation partition problem plays a key role in a surprising range of basic applications [24]. Informally, the bisimulation partition of a graph is an assignment of each node n of the graph to a unique block consisting of all nodes having the same structural properties as n (e.g., node label and neighborhood topology). In data management, variations of bisimulation play a fundamental role in constructing structural indexes for XML and RDF databases [21, 23], and many other applications for general graph data such as compression [4, 11], query processing [16], and data analytics [10]. Being well studied for decades, many main-memory efficient algorithms have been developed for bisimulation partitioning (e.g., [9, 22]). The state-of-the-art I/O efficient algorithm takes just under one day to compute a standard localized variant of bisimulation on a graph with 1.4 billion edges on commodity hardware [20]. This cost can be a potential bottleneck since bisimulation partitioning is typically one step in a larger workflow (e.g., preparing the graph for indexing and query processing). Contributions. Motivated by these observations, we have studied the effective use of the MapReduce framework for accelerating the computation of bisimulation partitions of massive graphs. In this paper we present, to our knowledge, the first efficient MapReduce-based algorithm for localized bisimulation partitioning. We further present strategies for dealing with various types of skew which occur in real-world graphs. We discuss the results of extensive experiments which show that our approach is effective and scalable, being up to an order of magnitude faster than the state of the art. As a prototypical graph problem, we hope that sharing our experiences with graph bisimulation will stimulate further progress in the community on MapReduce-based solutions for reasoning and analytics over massive graphs. Related work. While there has been work on distributed computation of bisimulation partitions, existing approaches (e.g., [3]) are not developed for the MapReduce platform, and hence are not directly applicable to our problem. Research has been conducted to investigate using the MapReduce framework to solve graph problems [6, 18] right after the framework was proposed. A major issue here is dealing with data skew in graphs. Indeed, skew is ubiquitous in real-world graphs [5]. During our investigation, we also experienced various types of skew from graph bisimulation, as we discuss below. There has been much progress done from the system side to tackle this problem. The main approach in this literature is to devise mechanisms to estimate costs of MapReduce systems (e.g., [13]) and modify the system to mitigate the skew effects, both statically [12, 15, 17] and dynamically [17, 25], so that the modification is transparent to the users. However, it is still possible to gain much efficiency by dealing with

3 skew from the algorithm design perspective [19], as we do in the novel work we present in this paper. Paper organization. In the next tion we give a brief description of localized bisimulation and MapReduce. We then describe in Section 3 our base algorithm for computing localized bisimulation partitions using MapReduce. Next, Section 4 presents optimizations of the base algorithm, to deal with the common problem of skewed data. Section 5 presents the results of our empirical study. We then conclude in Section 6 with a discussion of future directions for research. 2 Preliminaries 2.1 Localized bisimulation partitions Our data model is that of finite directed node- and edge-labeled graphs N, E, λ N, λ E, where N is a finite set of nodes, E N N is a set of edges, λ N is a function from N to a set of node labels L N, and λ E is a function from E to a set of edge labels L E. The localized bisimulation partition of graph G = N, E, λ N, λ E is based on the k-bisimilar equivalence relation. Definition 1. Let G = N, E, λ N, λ E be a graph and k 0. Nodes u, v N are called k-bisimilar (denoted as u k v), iff the following holds: 1. λ N (u) = λ N (v), 2. if k > 0, then for any edge (u, u ) E, there exists an edge (v, v ) E, such that u k 1 v and λ E (u, u ) = λ E (v, v ), and 3. if k > 0, then for any edge (v, v ) E, there exists an edge (u, u ) E, such that v k 1 u and λ E (v, v ) = λ E (u, u ). Given the k-bisimulation relation on a graph G, we can assign a unique partition identifier (e.g., an integer) to each set of k-bisimilar nodes in G. For node u N and relation k, we write pid k (u) to denote u s k-partition identifier, and we call pid k a k-partition identifier function. Definition 2. Let G = N, E, λ N, λ E be a graph, k 0, and {pid 0,..., pid k } be a set of i-partition identifier functions for G, for 0 i k. The k bisimulation signature of node u N is the pair sig k (u) = (pid 0 (u), L) where: { if k = 0, L = {(λ E (u, u ), pid k 1 (u )) (u, u ) E} if k > 0. We then have the following fact. Proposition 1 ([20]). pid k (u) = pid k (v) iff sig k (u) = sig k (v), k 0. Since there is a one-to-one mapping between a node s signature and its partition identifier, we can construct sig k (u) ( u N, k 0), assign pid k (u) according to sig k (u), and then use pid k (u) to construct sig k+1 (u), and so on. We call

4 each such construct-assign computing cycle an iteration. This signature-based approach is robust, with effective application in non-mapreduce environments (e.g., [3, 20]). We refer for a more detailed discussion of localized bisimulation to Luo et al. [20]. Example. Consider the graph in Figure 1. In iteration 0, nodes are partitioned into blocks P1 and P2 (indicated by different colors), based on the node label A and B (Def. 1). Then in iteration 1, from Def. 2, we have sig 1 (1 ) = (1, {(w, P 1), (l, P 2)}) and sig 1 (2 ) = (1, {(w, P 1), (l, P 2)}), which indicates that pid 1 (1 ) = pid 1 (2 ) (Prop. 1). P1 P2 l 3 B l A A 1 2 w l l 4 B Fig. 1. Example graph w 5 6 B B If we further compute 2-bisimulation, we see that sig 2 (1 ) sig 2 (2 ), and we conclude that nodes 1 and 2 are not 2-bisimilar, and block P1 will split. The partition blocks and their relations (i.e., a structural index ) can be seen as an abstraction of the real graph, to be used, for example, to filter unnecessary graph matching during query processing [11, 23]. A larger k leads to a more refined partition, which results in a larger structural index. So there is a tradeoff between k and the space we have for holding the structural index. In practice, though, we see that a small k value (e.g., k 5) is already sufficient for query processing. In our analysis below, we compute the k-bisimulation result up to k = 10, which is enough to show all the behaviors of interest for structural indexes. 2.2 MapReduce framework The MapReduce programming model [8] is designed to process large datasets in parallel. A MapReduce job takes a set of key/value pairs as input and outputs another set of key/value pairs. A MapReduce program consists of a series of MapReduce jobs, where each MapReduce job implements a map and a reduce function ( [ ] means a list of elements): map (k 1, v 1 ) [(k 2, v 2 )] reduce (k 2, [v 2 ]) [(k 3, v 3 )]. The map function takes key/value pair (k 1, v 1 ) as the input, emits a list of key/value pairs (k 2, v 2 ). In the reduce function, all values with the same key are grouped together as a list of values v 2 and are processed to emit another list of key/value pairs (k 3, v 3 ). Users define the map and reduce functions, letting the framework take care of all other aspects of the computation (synchronization, I/O, fault tolerance, etc.). The open source Hadoop implementation of the MapReduce framework is considered to have production quality and is widely used in industry and research [14]. Hadoop is often used together with the Hadoop Distributed File System (HDFS), which is designed to provide high-throughput access to application data. Besides map and reduce functions, in Hadoop a user can also write a l

5 custom partition function, which is applied after the map function to specify to which reducer each key/value pair should go. In our work we use Hadoop for validating our solutions, making use of the partition function as well. 3 Bisimulation partitioning with MapReduce For a graph G = N, E, λ N, λ E, we arbitrarily assign unique (integer) identifiers to each node of N. Our algorithm for computing the k-bisimulation partition of G has two input tables: a node table (denoted as N t ) and an edge table (denoted as E t ). Both tables are plain files of sequential records of nodes and edges of G, resp., stored on HDFS. The schema of table N t is as follows: nid pid 0 nid pid new nid node identifier 0 bisimulation partition identifier for the given nid bisimulation partition identifier for the given nid from the current computation iteration The schema of table E t is as follows: sid tid elabel pid old tid source node identifier target node identifier edge label bisimulation partition identifier for the given tid from the last computation iteration By combining the idea of Proposition 1 and the MapReduce programming model, we can sketch an algorithm for k-bisimulation using MapReduce, with the following workflow: for each iteration i (0 i k), we first construct the signatures for all nodes, then assign partition identifiers for all unique signatures, and pass the information to the next iteration. In our algorithm, each iteration then consists of three MapReduce tasks: 1. task Signature performs a merge join of N t and E t, and create signatures; 2. task Identifier distributes signatures to reducers and assigns partition identifiers; and, 3. task RePartition sorts N t to prepare it for the next iteration. Note that a preprocessing step is executed to guarantee the correctness of the map-side join in task Signature. We next explain each task in details. 3.1 Task Signature Task Signature (Algorithm 1) first performs a sort merge join of N t and E t, filling in the pid old tid column of E t with pid new nid of N t. This is achieved in the map function using a map-side join [18]. Then records are emitted grouping by nid in N t and sid in E t. In the reduce function, all information to construct a signature resides in [value]. So the major part of the function is to iterate through [value] and construct the signature according to Definition 2. After doing so, the node identifier along with its pid 0 nid value and signature are emitted to the next task. Note that the key/value pair input of SignatureMapper indicates

6 the fragments of N t and E t that need to be joined. The fragments need to be preprocessed before the algorithm runs. Algorithm 1 task Signature 1: procedure SignatureMapper(key, value) 2: perform map-side equi-join of N t and E t on nid and tid, fill in pid old tid with pid new nid, put all rows of N t and E t into records 3: for row in records do 4: if row is from E t then 5: emit (sid, the rest of the row) 6: else if row is from N t then 7: emit (nid, the rest of the row) 1: procedure SignatureReducer(key, [value]) key is nid or sid 2: pairset {} 3: for value in [value] do 4: if (key,value) is from N t then 5: pid 0 nid value.pid 0 nid record pid 0 nid 6: else if (key,value) is from E t then 7: pairset pairset {(elabel, pid old tid )} 8: sort elements in pairset lexicographically, first on elabel then on pid old tid 9: signature (pid 0 nid, pairset) 10: emit (key, (pid 0 nid, signature)) 3.2 Task Identifier On a single machine, in order to assign distinct values for signatures, we only need a dictionary-like data structure. In a distributed environment, on the other hand, some extra work has to be done. The Identifier task (Algorithm 2) is designed for this purpose. The map function distributes nodes of the same signature to the same reducer, so that in the reduce function, each reducer only needs to check locally whether the given signature is assigned an identifier or not; if not, then a new identifier is generated and assigned to the signature. To assign identifiers without collisions across reducers, we specify a non-overlapping identifier range each reducer can use beforehand. For instance, reducer i can generate identifiers in the range of [i N, (i + 1) N ). Algorithm 2 task Identifier 1: procedure IdentifierMapper(nId, (pid 0 nid, signature)) 2: emit (signature, (nid,pid 0 nid )) 1: procedure IdentifierReducer(signature, [(nid, pid 0 nid )]) 2: pid new nid get unique identifier for signature 3: for (nid, pid 0 nid ) in [(nid, pid 0 nid )] do 4: emit (nid, (pid 0 nid, pid new nid )) 3.3 Task RePartition The output of task Identifier is N t filled with partition identifiers from the current iteration, but consists of file fragments partitioned by signature. In order to perform a map-side join with E t in task Signature in the next iteration,

7 N t needs to be sorted and partitioned on nid, which is the main job of task RePartition (Algorithm 3). This task makes use of the MapReduce framework to do the sorting and grouping. Algorithm 3 task RePartition 1: procedure RePartitionMapper(nId, (pid 0 nid, pid new nid )) 2: emit (nid, (pid 0 nid, pid new nid )) do nothing 1: procedure RePartitionReducer(nId, [(pid 0 nid, pid new nid )]) 2: for (pid 0 nid, pid new nid ) in [(pid 0 nid, pid new nid )] do 3: emit (nid, (pid 0 nid, pid new nid )) 3.4 Running example We illustrate our algorithm on the example graph of Figure 1. In Figures 2(a), 2(b) and 2(c), we show the input and output of the map and reduce phases of tasks Signature, Identifier and RePartition, respectively, for the first iteration (i = 0), with two reducers (bounded with gray boxes) in use. 4 Strategies for dealing with skew in graphs 4.1 Data and skew In our investigations, we used a variety of graph datasets to validate the results of our algorithm, namely: Twitter (41.65M, M), Jamendo (0.48M, 1.05M), LinkedMDB (2.33M, 6.15M), DBLP (23M, 50.2M), WikiLinks (5.71M, M), DBPedia (38.62M, 115.3M), Power (8.39M, 200M), and Random (10M, 200M); the numbers in the parenthesis indicates the node count and edge count of the graph, resp.. Among these, Jamendo, LinkedMDB, DBLP, WikiLinks, DBPedia, and Twitter are real-world graphs described further in Luo et al. [20]. Random and Power are synthetic graphs generated using GTgraph [2], where Random is generated by adding edges between nodes randomly, and Power is generated following the power-law degree distribution and small-world property. During investigation of our base algorithm (further details below in Section 5), we witnessed a highly skewed workload among mappers and reducers. Figure 4(a) illustrates this on the various datasets, showing the running time for different reducers for the three tasks in the algorithm. Each spike is a time measurement for one reducer. The running time in each task is sorted in a descending order. We see that indeed some reducers carry a significantly disproportionate workload. This skew slows down the whole process since the task is only complete when the slowest worker finishes. From this empirical observation, we trace back the behavior to the data. Indeed, the partition result is skewed in many ways. For example, in Figure 3, we show the cumulative distribution of partition block size, i.e., number of nodes assigned to the block, to the number of partition blocks having the given size, for the real-world graphs described above. We see that for all of the datasets, block size shows a power-law distribution property. This indicates the need to rethink the design of our base algorithm. Recall that at the end of the map function of Algorithm 2, nodes are distributed to

8 Nodes: Edges: 3 l 1 1 w 2 2 w 2 5 l 2 4 l 3 1 l 4 2 l l w w l l l l w l w l l l l ,{(w,1),(l,2)} 2-1 1,{(w,1),(l,2)} 3-2 2,{(l,1)} 4-2 2,{(l,2)} 5-2 2,{(l,1)} 6-2 2,{} 6 nodes Join + Map Reduce Output (a) Task Signature 1-1 1,{(w,1),(l,2)} 2-1 1,{(w,1),(l,2)} 3-2 2,{(l,1)} r2 r2 r ,{(l,1)} 5-2 2,{(l,1)} 6-2 2,{} ,{(l,2)} 5-2 2,{(l,1)} 6-2 2,{} r2 r1 r ,{(w,1),(l,2)} 2-1 1,{(w,1),(l,2)} ,{(l,2)} 8 Map Reduce Output (b) Task Identifier Map Reduce Output (c) Task RePartition Fig. 2. Input and output of the three tasks of our algorithm for the example graph of Figure 1 (iteration i = 0). reducers based on their signatures. Since some of the signatures are associated with many nodes (as we see in Figure 3), the workload is inevitably unbalanced. This explains the reducers behavior in Figure 4(a) as well. In the following, we propose several strategies to handle such circumstances. 4.2 Strategy 1: Introduce another task Merge Recall from Section 3.2 that nodes with the same signature must be distributed to the same reducer, for otherwise the assignment of partition block identifiers

9 fraction of PB s having size x x (PB Size) Jamendo LinkedMDB DBLP WikiLinks DBPedia Twitter Fig. 3. Cumulative distribution of partition block size (PB Size) to partition blocks having the given size for real-world graphs cannot be guaranteed to be correct. This could be relaxed, however, if we introduce another task, which we call Merge. Suppose that when we output the partition result (without the guarantee that nodes with the same signature go to the same reducer), for each reducer, we also output a mapping of signatures to partition identifiers. Then in task Merge, we could merge the partition IDs based on the signature value, and output a local pid to global pid mapping. Then in the task RePartition, another map-side join is introduced to replace the local pid with the global pid. Because the signature itself is not correlated with the partition block size, the skew on partition block size should be eliminated. We discuss the implementation details in the following. In N t assume we have an additional field holding the partition size of the node from the previous iteration, named psize old, and let MR Load = We define rand(x) to return a random integer from 0 to x, and % as the modulo operator. N number of reducers. Algorithm 4 modified partition function for task Identifier 1: procedure Identifier getpartition for each key/value pair 2: if psize old > threshold then 3: n = psize old /MR Load numbers of reducers we need 4: return (signature.hashcode()+rand(n)) % number of reducers 5: else 6: return signature.hashcode() % number of reducers We first change the partition function for Identifier (Algorithm 4). In this case, for nodes whose associated psize old are above the threshold, we do not guarantee that they end up in the same reducer, but make sure that they are distributed to at most n reducers. Then we come to the reduce phase for Identifier (Algorithm 5). Here we also output the local partition size (named psize new ) for each node. Then the task Merge (Algorithm 6) will create the mapping between the locally assigned pid new nid and global pid.

10 Algorithm 5 modified reduce function for task Identifier 1: procedure IdentifierReducer(signature, [(nid,pid 0 nid,psize old )]) 2: pid new nid get unique identifier for signature 3: psize new size of [(nid,pid 0 nid, psize old )] 4: for (nid, pid 0 nid, psize old ) in [(nid, pid 0 nid, psize old )] do 5: emit (nid, (pid 0 nid, pid new nid )) 6: emit (signature, (pid new nid, psize new )) Algorithm 6 task Merge 1: procedure MergeMapper(signature, (pid new nid, psize new )) 2: emit (signature, (pid new nid, psize new )) do nothing 1: procedure MergeReducer(signature, [(pid new nid, psize new )]) 2: global pid count 0 3: global pid get unique identifier for signature 4: for (pid new nid, psize new ) in [(pid new nid, psize new )] do 5: global pid count global pid count + psize new 6: emit (global pid, (pid new nid )) the local - global mapping 7: emit (global pid, global pid count) Finally at the beginning of task RePartition, the partition identifiers are unified by a merge join of the local pid - global pid mapping table and N t. We achieve this by distributing the local pid - global pid table to all mappers before the task begins, with the help of Hadoop s distributed cache. Also, global pid count is updated in N t. While this is a general solution for dealing with skew, the extra Merge task introduces additional overhead. In the case of heavy skew, some signatures will produce large map files and performing merging might become a bottleneck. This indicates the need for a more specialized solution to deal with heavy skew. 4.3 Strategy 2: Top-K signature-partition identifier mapping One observation of Figure 3 is that only a few partition blocks are heavily skewed. To handle these outliers, at the end of task Signature, besides emitting N t, we can also emit, for each reducer, an aggregation count of signature appearances. Then a merge is performed among all the counts, to identify the most frequent K signatures and fix signature-partition identifier mappings for these popular partition blocks. This mapping is small enough to be distributed to all cluster nodes as a global variable by Hadoop, so that when dealing with these signatures, processing time becomes constant. As a result, in task Identifier, nodes with such signatures can be distributed randomly across reducers. There are certain drawbacks to this method. First, the output top-k frequent signatures are aggregated from local top-k frequent values (with respect to each reducer), but globally we only use these values as an estimation of the real counts. Second, the step where signature counts have to be output and merged becomes a bottleneck of the whole workflow. Last but not least, users have to specify K before processing, which may be either too high or too low.

11 However, in the case of extreme skew on the partition block sizes, for most of the real world datasets, there are only a few partition blocks which delay the whole process, even for very large datasets. So when we adopt this strategy, we can choose a very small K value and still achieve good results, without introducing another MapReduce task. This is validated in Section Empirical analysis Our experiments are executed on the Hadoop cluster at SURFsara in Amsterdam, The Netherlands. 1 This cluster consists of 72 Nodes (66 DataNodes & TaskTrackers and 6 HeadNodes), with each node equipped with 4 x 2TB HDD, 8 core CPU 2.0 GHz and 64GB RAM. In total, there are 528 map and 528 reduce processes, and 460 TB HDFS space. The cluster is running Cloudera CDH3 distribution with Apache Hadoop All algorithms are written in Java 1.6. The datasets we use are described in Section 4.1. A more detailed description of the empirical study can be found in de Lange [7]. 5.1 Impact on workload skew of the Top-K strategy Figure 4(b) shows the cluster workload after we create a identifier mapping for the top-2 signature-partitions from Section 4.3. We see that, when compared with Figure 4(a), the skew in running time per reducer is eliminated by the strategy. This means that workload is better distributed, thus lowering the running time per iteration and, in turn, the whole algorithm. Signature Signature Identifier Identifier Repartition Repartition 0 60 Wikilinks DBPedia (a) 0 65 Power Twitter 0 60 Wikilinks DBPedia (b) 0 65 Power Twitter Fig. 4. Workload skewness for base algorithm (a) and for top-2 signature strategy (b) 5.2 Overall performance comparison In Figure 5, we present an overall performance comparison for computing 10- bisimulation on each graph with: our base algorithm (Base), the merge (Merge) and top-k (Top-K Signature) skew strategies, and the state of the art singlemachine external memory algorithm (I/O Efficient) [20]. For the Top-K Signature strategy, we set K = 2 which, from our observation in Section 5.1, gives excellent skew reduction with low additional cost. For the Merge optimization we used a threshold of 1 MR load such that each partition block larger than 1

12 the optimal reducer block size is distributed among additional reducers. Furthermore, for each dataset, we choose 3 edge table size 64 MB number of reducers, which has been tested to lead to the minimum elapsed time. We empirically observed that increasing the number of reducers beyond this does not improve performance. Indeed, adding further maps and reducers at this point negatively impacts the running time due to framework overhead. Each experiment was repeated five times, and the data point reported is the average of the middle three measurements. Running time () more than I/O Efficient Algorithm Base Algorithm Top-K signatures algorithm Merge algorithm 0 Jamendo LinkedMDB DBLP WikiLinks DBPedia Random * * Tests did not complete Power Twitter Fig. 5. Overall performance comparison We immediately observe from Figure 5 that for all datasets, among the Base algorithm and its two optimizations, there are always at least two solutions which perform better than the I/O efficient solution. For small datasets such as Jamendo and LinkedMDB, this is obvious, since in these cases only 1 or 2 reducers are used, so that the algorithm turns into a single-machine in-memory algorithm. When the size of the datasets increases, the value of MapReduce becomes more visible, with up to an order of magnitude improvement in the running time for the MapReduce-based solutions. We also observe that the skew amelioration strategies give excellent overall performance on these big graphs, with 2 or 3 times of improvement over our Base algorithm in the case of the highly skewed graphs such as DBLP and DBPedia. Finally, we observe that, relative to the top-k strategy, the merge skew-strategy is mostly placed at a disadvantage due to its inherent overhead costs. Running time () DBPedia (Base) DBPedia (Top-K Signature) DBPedia (Merge) Twitter (Base) Twitter (Top-K Signature) Twitter (Merge) Iteration Fig. 6. Performance comparison per iteration for Twitter and DBPedia

13 To further study the different behaviors among the MapReduce-based solutions, we plot the running time per iteration of the three solutions for DBPedia and Twitter in Figure 6. We see that for the Twitter dataset, in the first four iterations the skew is severe for the Base algorithm, while the two optimization strategies handle it well. After iteration 5, the overhead of the Merge strategy becomes non-negligible, which is due to the bigger mapping files the Identifier produces. For the DBPedia dataset, on the other hand, the two strategies perform consistently better than the Base algorithm. Over all, based on our experiments, we note that our algorithm s performance is stable, i.e., essentially constant in running time as the number of maps and reducers is scaled with the input graph size. 6 Concluding remarks In this paper, we have presented, to our knowledge, the first general-purpose algorithm for effectively computing localized bisimulation partitions of big graphs using the MapReduce programming model. We witnessed a skewed workload during algorithm execution, and proposed two strategies to eliminate such skew from an algorithm design perspective. An extensive empirical study confirmed that our algorithm not only efficiently produces the bisimulation partition result, but also scales well with the MapReduce infrastructure, with an order of magnitude performance improvement over the state of the art on real-world graphs. We close by indicating several interesting avenues for further investigation. First, there are additional basic sources of skew which may impact performance of our algorithm, such as skew on signature sizes and skew on the structure of the bisimulation partition itself. Therefore, further optimizations should be investigated to handle these additional forms of skew. Second, in Section 5.2 we see that all three proposed solutions have their best performance for some dataset, therefore it would be interesting to study the cost model of the MapReduce framework and combine the information (e.g., statistics for data, cluster status) within our algorithm to facilitate more intelligent selection of strategies to use at runtime [1]. Last but not least, our algorithmic solutions for ameliorating skew effects may find successful applications in related research problems (e.g., distributed graph query processing, graph generation, and graph indexing). Acknowledgments. The research of YW is supported by Research Foundation Flanders (FWO) during her sabbatical visit to Hasselt University, Belgium. The research of YL, GF, JH and PD is supported by the Netherlands Organisation for Scientific Research (NWO). We thank SURFsara for their support of this investigation and Boudewijn van Dongen for his insightful comments. We thank the reviewers for their helpful suggestions. References 1. F. N. Afrati, A. D. Sarma, S. Salihoglu, and J. D. Ullman. Upper and lower bounds on the cost of a Map-Reduce computation. CoRR, abs/ , 2012.

14 2. D. A. Bader and K. Madduri. GTgraph: A suite of synthetic graph generators S. Blom and S. Orzan. A distributed algorithm for strong bisimulation reduction of state spaces. Int J Softw Tools Technol Transfer, 7:74 86, P. Buneman, M. Grohe, and C. Koch. Path queries on compressed XML. In Proc. VLDB, pages , Berlin, Germany, A. Clauset, C. Shalizi, and M. Newman. Power-law distributions in empirical data. SIAM review, 51(4): , J. Cohen. Graph twiddling in a MapReduce world. Computing in Science Engineering, 11(4):29 41, Y. de Lange. MapReduce based algorithms for localized bisimulation. Master s thesis, Eindhoven University of Technology, J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1): , Jan A. Dovier, C. Piazza, and A. Policriti. An efficient algorithm for computing bisimulation equivalence. Theor. Comp. Sci., 311(1-3): , W. Fan. Graph pattern matching revised for social network analysis. In Proc. ICDT, pages 8 21, Berlin, Germany, W. Fan, J. Li, X. Wang, and Y. Wu. Query preserving graph compression. In Proc. SIGMOD, pages , Scottsdale, AZ, USA, B. Gufler, N. Augsten, A. Reiser, and A. Kemper. Handling data skew in MapReduce. In Proc. CLOSER, pages , B. Gufler, N. Augsten, A. Reiser, and A. Kemper. Load balancing in MapReduce based on scalable cardinality estimates. In Proc. ICDE, pages , Hadoop S. Ibrahim, H. Jin, L. Lu, S. Wu, B. He, and L. Qi. LEEN: Locality/fairness-aware key partitioning for MapReduce in the cloud. In CloudCom, pages 17 24, R. Kaushik, P. Shenoy, P. Bohannon, and E. Gudes. Exploiting local similarity for indexing paths in graph-structured data. In ICDE, pages , San Jose, Y. Kwon, K. Ren, M. Balazinska, and B. Howe. Managing Skew in Hadoop. IEEE Data Eng. Bull., 36(1):24 33, J. Lin and C. Dyer. Data-Intensive Text Processing with MapReduce. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers, J. Lin and M. Schatz. Design patterns for efficient graph algorithms in MapReduce. In Proc. MLG, pages 78 85, Washington, D.C., Y. Luo, G. H. L. Fletcher, J. Hidders, Y. Wu, and P. De Bra. I/O-efficient algorithms for localized bisimulation partition construction and maintenance on massive graphs. CoRR, abs/ , T. Milo and D. Suciu. Index structures for path expressions. In Proc. ICDT, pages , Jerusalem, Israel, R. Paige and R. Tarjan. Three partition refinement algorithms. SIAM J. Comput., 16:973, F. Picalausa, Y. Luo, G. H. L. Fletcher, J. Hidders, and S. Vansummeren. A structural approach to indexing triples. In ESWC, pages , Heraklion, Greece, D. Sangiorgi and J. Rutten. Advanced Topics in Bisimulation and Coinduction. Cambridge University Press, New York, NY, USA, R. Vernica, A. Balmin, K. S. Beyer, and V. Ercegovac. Adaptive MapReduce using situation-aware mappers. In Proc. EDBT, pages , 2012.

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

Load Balancing in MapReduce Based on Scalable Cardinality Estimates

Load Balancing in MapReduce Based on Scalable Cardinality Estimates Load Balancing in MapReduce Based on Scalable Cardinality Estimates Benjamin Gufler 1, Nikolaus Augsten #, Angelika Reiser 3, Alfons Kemper 4 Technische Universität München Boltzmannstraße 3, 85748 Garching

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering

HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering Chang Liu 1 Jun Qu 1 Guilin Qi 2 Haofen Wang 1 Yong Yu 1 1 Shanghai Jiaotong University, China {liuchang,qujun51319, whfcarter,yyu}@apex.sjtu.edu.cn

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Huayu Wu Institute for Infocomm Research, A*STAR, Singapore huwu@i2r.a-star.edu.sg Abstract. Processing XML queries over

More information

MapReduce: Algorithm Design Patterns

MapReduce: Algorithm Design Patterns Designing Algorithms for MapReduce MapReduce: Algorithm Design Patterns Need to adapt to a restricted model of computation Goals Scalability: adding machines will make the algo run faster Efficiency: resources

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

The Hadoop Framework

The Hadoop Framework The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fh-giessen.de Abstract. The Hadoop Framework offers an approach to large-scale

More information

A Comparison of Join Algorithms for Log Processing in MapReduce

A Comparison of Join Algorithms for Log Processing in MapReduce A Comparison of Join Algorithms for Log Processing in MapReduce Spyros Blanas, Jignesh M. Patel Computer Sciences Department University of Wisconsin-Madison {sblanas,jignesh}@cs.wisc.edu Vuk Ercegovac,

More information

Big Data Begets Big Database Theory

Big Data Begets Big Database Theory Big Data Begets Big Database Theory Dan Suciu University of Washington 1 Motivation Industry analysts describe Big Data in terms of three V s: volume, velocity, variety. The data is too big to process

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

A Performance Analysis of Distributed Indexing using Terrier

A Performance Analysis of Distributed Indexing using Terrier A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search

More information

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, and Bhavani Thuraisingham University of Texas at Dallas, Dallas TX 75080, USA Abstract.

More information

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

More information

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Mining Large Datasets: Case of Mining Graph Data in the Cloud Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large

More information

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National

More information

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created

More information

Big Fast Data Hadoop acceleration with Flash. June 2013

Big Fast Data Hadoop acceleration with Flash. June 2013 Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

More information

Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

A Novel Cloud Based Elastic Framework for Big Data Preprocessing School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview

More information

From GWS to MapReduce: Google s Cloud Technology in the Early Days

From GWS to MapReduce: Google s Cloud Technology in the Early Days Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

Open Access Research of Fast Search Algorithm Based on Hadoop Cloud Platform

Open Access Research of Fast Search Algorithm Based on Hadoop Cloud Platform Send Orders for Reprints to reprints@benthamscience.ae The Open Automation and Control Systems Journal, 2015, 7, 1153-1159 1153 Open Access Research of Fast Search Algorithm Based on Hadoop Cloud Platform

More information

Load-Balancing the Distance Computations in Record Linkage

Load-Balancing the Distance Computations in Record Linkage Load-Balancing the Distance Computations in Record Linkage Dimitrios Karapiperis Vassilios S. Verykios Hellenic Open University School of Science and Technology Patras, Greece {dkarapiperis, verykios}@eap.gr

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

A B S T R A C T. Index Terms : Apache s Hadoop, Map/Reduce, HDFS, Hashing Algorithm. I. INTRODUCTION

A B S T R A C T. Index Terms : Apache s Hadoop, Map/Reduce, HDFS, Hashing Algorithm. I. INTRODUCTION Speed- Up Extension To Hadoop System- A Survey Of HDFS Data Placement Sayali Ashok Shivarkar, Prof.Deepali Gatade Computer Network, Sinhgad College of Engineering, Pune, India 1sayalishivarkar20@gmail.com

More information

MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12

MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12 MapReduce Algorithms A Sense of Scale At web scales... Mail: Billions of messages per day Search: Billions of searches per day Social: Billions of relationships 2 A Sense of Scale At web scales... Mail:

More information

MapReduce With Columnar Storage

MapReduce With Columnar Storage SEMINAR: COLUMNAR DATABASES 1 MapReduce With Columnar Storage Peitsa Lähteenmäki Abstract The MapReduce programming paradigm has achieved more popularity over the last few years as an option to distributed

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.

More information

Evaluating HDFS I/O Performance on Virtualized Systems

Evaluating HDFS I/O Performance on Virtualized Systems Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang xtang@cs.wisc.edu University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing

More information

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Processing with Google s MapReduce. Alexandru Costan 1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

More information

Introduction to Hadoop

Introduction to Hadoop 1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools

More information

Processing Joins over Big Data in MapReduce

Processing Joins over Big Data in MapReduce Processing Joins over Big Data in MapReduce Christos Doulkeridis Department of Digital Systems School of Information and Communication Technologies University of Piraeus http://www.ds.unipi.gr/cdoulk/

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

Big Graph Processing: Some Background

Big Graph Processing: Some Background Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Performance and Energy Efficiency of. Hadoop deployment models

Performance and Energy Efficiency of. Hadoop deployment models Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens Realtime Apache Hadoop at Facebook Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens Agenda 1 Why Apache Hadoop and HBase? 2 Quick Introduction to Apache HBase 3 Applications of HBase at

More information

A computational model for MapReduce job flow

A computational model for MapReduce job flow A computational model for MapReduce job flow Tommaso Di Noia, Marina Mongiello, Eugenio Di Sciascio Dipartimento di Ingegneria Elettrica e Dell informazione Politecnico di Bari Via E. Orabona, 4 70125

More information

Learning-based Entity Resolution with MapReduce

Learning-based Entity Resolution with MapReduce Learning-based Entity Resolution with MapReduce Lars Kolb 1 Hanna Köpcke 2 Andreas Thor 1 Erhard Rahm 1,2 1 Database Group, 2 WDI Lab University of Leipzig {kolb,koepcke,thor,rahm}@informatik.uni-leipzig.de

More information

http://www.wordle.net/

http://www.wordle.net/ Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),

More information

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct

More information

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional

More information

Infrastructures for big data

Infrastructures for big data Infrastructures for big data Rasmus Pagh 1 Today s lecture Three technologies for handling big data: MapReduce (Hadoop) BigTable (and descendants) Data stream algorithms Alternatives to (some uses of)

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

More information

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

More information

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems 215 IEEE International Conference on Big Data (Big Data) Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems Guoxin Liu and Haiying Shen and Haoyu Wang Department of Electrical

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Mobile Cloud Computing for Data-Intensive Applications

Mobile Cloud Computing for Data-Intensive Applications Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, vct@andrew.cmu.edu Advisor: Professor Priya Narasimhan, priya@cs.cmu.edu Abstract The computational and storage

More information

RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE

RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE Reena Pagare and Anita Shinde Department of Computer Engineering, Pune University M. I. T. College Of Engineering Pune India ABSTRACT Many clients

More information

Policy-based Pre-Processing in Hadoop

Policy-based Pre-Processing in Hadoop Policy-based Pre-Processing in Hadoop Yi Cheng, Christian Schaefer Ericsson Research Stockholm, Sweden yi.cheng@ericsson.com, christian.schaefer@ericsson.com Abstract While big data analytics provides

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Distributed Dynamic Load Balancing for Iterative-Stencil Applications Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,

More information

Advances in Natural and Applied Sciences

Advances in Natural and Applied Sciences AENSI Journals Advances in Natural and Applied Sciences ISSN:1995-0772 EISSN: 1998-1090 Journal home page: www.aensiweb.com/anas Clustering Algorithm Based On Hadoop for Big Data 1 Jayalatchumy D. and

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

More information

Comparison of Different Implementation of Inverted Indexes in Hadoop

Comparison of Different Implementation of Inverted Indexes in Hadoop Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

More information

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Detection of Distributed Denial of Service Attack with Hadoop on Live Network Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,

More information

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011 Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis

More information

Query and Analysis of Data on Electric Consumption Based on Hadoop

Query and Analysis of Data on Electric Consumption Based on Hadoop , pp.153-160 http://dx.doi.org/10.14257/ijdta.2016.9.2.17 Query and Analysis of Data on Electric Consumption Based on Hadoop Jianjun 1 Zhou and Yi Wu 2 1 Information Science and Technology in Heilongjiang

More information

Reduction of Data at Namenode in HDFS using harballing Technique

Reduction of Data at Namenode in HDFS using harballing Technique Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu vgkorat@gmail.com swamy.uncis@gmail.com Abstract HDFS stands for the Hadoop Distributed File System.

More information

A Case for Flash Memory SSD in Hadoop Applications

A Case for Flash Memory SSD in Hadoop Applications A Case for Flash Memory SSD in Hadoop Applications Seok-Hoon Kang, Dong-Hyun Koo, Woon-Hak Kang and Sang-Won Lee Dept of Computer Engineering, Sungkyunkwan University, Korea x860221@gmail.com, smwindy@naver.com,

More information

Actian Vector in Hadoop

Actian Vector in Hadoop Actian Vector in Hadoop Industrialized, High-Performance SQL in Hadoop A Technical Overview Contents Introduction...3 Actian Vector in Hadoop - Uniquely Fast...5 Exploiting the CPU...5 Exploiting Single

More information

Lecture 10 - Functional programming: Hadoop and MapReduce

Lecture 10 - Functional programming: Hadoop and MapReduce Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional

More information

Efficient Analysis of Big Data Using Map Reduce Framework

Efficient Analysis of Big Data Using Map Reduce Framework Efficient Analysis of Big Data Using Map Reduce Framework Dr. Siddaraju 1, Sowmya C L 2, Rashmi K 3, Rahul M 4 1 Professor & Head of Department of Computer Science & Engineering, 2,3,4 Assistant Professor,

More information

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT

More information

Big RDF Data Partitioning and Processing using hadoop in Cloud

Big RDF Data Partitioning and Processing using hadoop in Cloud Big RDF Data Partitioning and Processing using hadoop in Cloud Tejas Bharat Thorat Dept. of Computer Engineering MIT Academy of Engineering, Alandi, Pune, India Prof.Ranjana R.Badre Dept. of Computer Engineering

More information

International Journal of Engineering Research ISSN: 2348-4039 & Management Technology November-2015 Volume 2, Issue-6

International Journal of Engineering Research ISSN: 2348-4039 & Management Technology November-2015 Volume 2, Issue-6 International Journal of Engineering Research ISSN: 2348-4039 & Management Technology Email: editor@ijermt.org November-2015 Volume 2, Issue-6 www.ijermt.org Modeling Big Data Characteristics for Discovering

More information