Bisimulation Reduction of Big Graphs on MapReduce


 Louise Hudson
 1 years ago
 Views:
Transcription
1 Bisimulation Reduction of Big Graphs on MapReduce Yongming Luo 1, Yannick de Lange 1, George H. L. Fletcher 1, Paul De Bra 1, Jan Hidders 2, and Yuqing Wu 3 1 Eindhoven University of Technology, The Netherlands 2 Delft University of Technology, The Netherlands 3 Indiana University, Bloomington, USA Abstract. Computing the bisimulation partition of a graph is a fundamental problem which plays a key role in a wide range of basic applications. Intuitively, two nodes in a graph are bisimilar if they share basic structural properties such as labeling and neighborhood topology. In data management, reducing a graph under bisimulation equivalence is a crucial step, e.g., for indexing the graph for efficient query processing. Often, graphs of interest in the real world are massive; examples include social networks and linked open data. For analytics on such graphs, it is becoming increasingly infeasible to rely on inmemory or even I/Oefficient solutions. Hence, a trend in Big Data analytics is the use of distributed computing frameworks such as MapReduce. While there are both internal and external memory solutions for efficiently computing bisimulation, there is, to our knowledge, no effective MapReducebased solution for bisimulation. Motivated by these observations we propose in this paper the first efficient MapReducebased algorithm for computing the bisimulation partition of massive graphs. We also detail several optimizations for handling the data skew which often arises in realworld graphs. The results of an extensive empirical study are presented which demonstrate the effectiveness and scalability of our solution. 1 Introduction Recently, graph analytics has drawn increasing attention from the data management, semantic web, and many other research communities. Graphs of interest, such as social networks, the web graph, and linked open data, are typically on the order of billions of nodes and edges. In such cases, singlemachine inmemory solutions for reasoning over graphs are often infeasible. Naturally, research has turned to external memory and distributed solutions for graph reasoning. External memory algorithms often suffice, but their performance typically scales (almost) linearly with graph size (usually the number of graph edges), which is then limited by the throughput of the I/O devices attached to the system. In this respect, distributed and parallel algorithms become attractive. Ideally, a
2 welldesigned distributed algorithm would scale (roughly) linearly with the size of the computing resources it has, making use of the parallelism of the infrastructure. Though there are many alternatives, recently the MapReduce platform [8] has become a defacto parallel processing platform for reasoning over Big Data such as realworld graphs, with wide adoption in both industry and research. Among fundamental graph problems, the bisimulation partition problem plays a key role in a surprising range of basic applications [24]. Informally, the bisimulation partition of a graph is an assignment of each node n of the graph to a unique block consisting of all nodes having the same structural properties as n (e.g., node label and neighborhood topology). In data management, variations of bisimulation play a fundamental role in constructing structural indexes for XML and RDF databases [21, 23], and many other applications for general graph data such as compression [4, 11], query processing [16], and data analytics [10]. Being well studied for decades, many mainmemory efficient algorithms have been developed for bisimulation partitioning (e.g., [9, 22]). The stateoftheart I/O efficient algorithm takes just under one day to compute a standard localized variant of bisimulation on a graph with 1.4 billion edges on commodity hardware [20]. This cost can be a potential bottleneck since bisimulation partitioning is typically one step in a larger workflow (e.g., preparing the graph for indexing and query processing). Contributions. Motivated by these observations, we have studied the effective use of the MapReduce framework for accelerating the computation of bisimulation partitions of massive graphs. In this paper we present, to our knowledge, the first efficient MapReducebased algorithm for localized bisimulation partitioning. We further present strategies for dealing with various types of skew which occur in realworld graphs. We discuss the results of extensive experiments which show that our approach is effective and scalable, being up to an order of magnitude faster than the state of the art. As a prototypical graph problem, we hope that sharing our experiences with graph bisimulation will stimulate further progress in the community on MapReducebased solutions for reasoning and analytics over massive graphs. Related work. While there has been work on distributed computation of bisimulation partitions, existing approaches (e.g., [3]) are not developed for the MapReduce platform, and hence are not directly applicable to our problem. Research has been conducted to investigate using the MapReduce framework to solve graph problems [6, 18] right after the framework was proposed. A major issue here is dealing with data skew in graphs. Indeed, skew is ubiquitous in realworld graphs [5]. During our investigation, we also experienced various types of skew from graph bisimulation, as we discuss below. There has been much progress done from the system side to tackle this problem. The main approach in this literature is to devise mechanisms to estimate costs of MapReduce systems (e.g., [13]) and modify the system to mitigate the skew effects, both statically [12, 15, 17] and dynamically [17, 25], so that the modification is transparent to the users. However, it is still possible to gain much efficiency by dealing with
3 skew from the algorithm design perspective [19], as we do in the novel work we present in this paper. Paper organization. In the next tion we give a brief description of localized bisimulation and MapReduce. We then describe in Section 3 our base algorithm for computing localized bisimulation partitions using MapReduce. Next, Section 4 presents optimizations of the base algorithm, to deal with the common problem of skewed data. Section 5 presents the results of our empirical study. We then conclude in Section 6 with a discussion of future directions for research. 2 Preliminaries 2.1 Localized bisimulation partitions Our data model is that of finite directed node and edgelabeled graphs N, E, λ N, λ E, where N is a finite set of nodes, E N N is a set of edges, λ N is a function from N to a set of node labels L N, and λ E is a function from E to a set of edge labels L E. The localized bisimulation partition of graph G = N, E, λ N, λ E is based on the kbisimilar equivalence relation. Definition 1. Let G = N, E, λ N, λ E be a graph and k 0. Nodes u, v N are called kbisimilar (denoted as u k v), iff the following holds: 1. λ N (u) = λ N (v), 2. if k > 0, then for any edge (u, u ) E, there exists an edge (v, v ) E, such that u k 1 v and λ E (u, u ) = λ E (v, v ), and 3. if k > 0, then for any edge (v, v ) E, there exists an edge (u, u ) E, such that v k 1 u and λ E (v, v ) = λ E (u, u ). Given the kbisimulation relation on a graph G, we can assign a unique partition identifier (e.g., an integer) to each set of kbisimilar nodes in G. For node u N and relation k, we write pid k (u) to denote u s kpartition identifier, and we call pid k a kpartition identifier function. Definition 2. Let G = N, E, λ N, λ E be a graph, k 0, and {pid 0,..., pid k } be a set of ipartition identifier functions for G, for 0 i k. The k bisimulation signature of node u N is the pair sig k (u) = (pid 0 (u), L) where: { if k = 0, L = {(λ E (u, u ), pid k 1 (u )) (u, u ) E} if k > 0. We then have the following fact. Proposition 1 ([20]). pid k (u) = pid k (v) iff sig k (u) = sig k (v), k 0. Since there is a onetoone mapping between a node s signature and its partition identifier, we can construct sig k (u) ( u N, k 0), assign pid k (u) according to sig k (u), and then use pid k (u) to construct sig k+1 (u), and so on. We call
4 each such constructassign computing cycle an iteration. This signaturebased approach is robust, with effective application in nonmapreduce environments (e.g., [3, 20]). We refer for a more detailed discussion of localized bisimulation to Luo et al. [20]. Example. Consider the graph in Figure 1. In iteration 0, nodes are partitioned into blocks P1 and P2 (indicated by different colors), based on the node label A and B (Def. 1). Then in iteration 1, from Def. 2, we have sig 1 (1 ) = (1, {(w, P 1), (l, P 2)}) and sig 1 (2 ) = (1, {(w, P 1), (l, P 2)}), which indicates that pid 1 (1 ) = pid 1 (2 ) (Prop. 1). P1 P2 l 3 B l A A 1 2 w l l 4 B Fig. 1. Example graph w 5 6 B B If we further compute 2bisimulation, we see that sig 2 (1 ) sig 2 (2 ), and we conclude that nodes 1 and 2 are not 2bisimilar, and block P1 will split. The partition blocks and their relations (i.e., a structural index ) can be seen as an abstraction of the real graph, to be used, for example, to filter unnecessary graph matching during query processing [11, 23]. A larger k leads to a more refined partition, which results in a larger structural index. So there is a tradeoff between k and the space we have for holding the structural index. In practice, though, we see that a small k value (e.g., k 5) is already sufficient for query processing. In our analysis below, we compute the kbisimulation result up to k = 10, which is enough to show all the behaviors of interest for structural indexes. 2.2 MapReduce framework The MapReduce programming model [8] is designed to process large datasets in parallel. A MapReduce job takes a set of key/value pairs as input and outputs another set of key/value pairs. A MapReduce program consists of a series of MapReduce jobs, where each MapReduce job implements a map and a reduce function ( [ ] means a list of elements): map (k 1, v 1 ) [(k 2, v 2 )] reduce (k 2, [v 2 ]) [(k 3, v 3 )]. The map function takes key/value pair (k 1, v 1 ) as the input, emits a list of key/value pairs (k 2, v 2 ). In the reduce function, all values with the same key are grouped together as a list of values v 2 and are processed to emit another list of key/value pairs (k 3, v 3 ). Users define the map and reduce functions, letting the framework take care of all other aspects of the computation (synchronization, I/O, fault tolerance, etc.). The open source Hadoop implementation of the MapReduce framework is considered to have production quality and is widely used in industry and research [14]. Hadoop is often used together with the Hadoop Distributed File System (HDFS), which is designed to provide highthroughput access to application data. Besides map and reduce functions, in Hadoop a user can also write a l
5 custom partition function, which is applied after the map function to specify to which reducer each key/value pair should go. In our work we use Hadoop for validating our solutions, making use of the partition function as well. 3 Bisimulation partitioning with MapReduce For a graph G = N, E, λ N, λ E, we arbitrarily assign unique (integer) identifiers to each node of N. Our algorithm for computing the kbisimulation partition of G has two input tables: a node table (denoted as N t ) and an edge table (denoted as E t ). Both tables are plain files of sequential records of nodes and edges of G, resp., stored on HDFS. The schema of table N t is as follows: nid pid 0 nid pid new nid node identifier 0 bisimulation partition identifier for the given nid bisimulation partition identifier for the given nid from the current computation iteration The schema of table E t is as follows: sid tid elabel pid old tid source node identifier target node identifier edge label bisimulation partition identifier for the given tid from the last computation iteration By combining the idea of Proposition 1 and the MapReduce programming model, we can sketch an algorithm for kbisimulation using MapReduce, with the following workflow: for each iteration i (0 i k), we first construct the signatures for all nodes, then assign partition identifiers for all unique signatures, and pass the information to the next iteration. In our algorithm, each iteration then consists of three MapReduce tasks: 1. task Signature performs a merge join of N t and E t, and create signatures; 2. task Identifier distributes signatures to reducers and assigns partition identifiers; and, 3. task RePartition sorts N t to prepare it for the next iteration. Note that a preprocessing step is executed to guarantee the correctness of the mapside join in task Signature. We next explain each task in details. 3.1 Task Signature Task Signature (Algorithm 1) first performs a sort merge join of N t and E t, filling in the pid old tid column of E t with pid new nid of N t. This is achieved in the map function using a mapside join [18]. Then records are emitted grouping by nid in N t and sid in E t. In the reduce function, all information to construct a signature resides in [value]. So the major part of the function is to iterate through [value] and construct the signature according to Definition 2. After doing so, the node identifier along with its pid 0 nid value and signature are emitted to the next task. Note that the key/value pair input of SignatureMapper indicates
6 the fragments of N t and E t that need to be joined. The fragments need to be preprocessed before the algorithm runs. Algorithm 1 task Signature 1: procedure SignatureMapper(key, value) 2: perform mapside equijoin of N t and E t on nid and tid, fill in pid old tid with pid new nid, put all rows of N t and E t into records 3: for row in records do 4: if row is from E t then 5: emit (sid, the rest of the row) 6: else if row is from N t then 7: emit (nid, the rest of the row) 1: procedure SignatureReducer(key, [value]) key is nid or sid 2: pairset {} 3: for value in [value] do 4: if (key,value) is from N t then 5: pid 0 nid value.pid 0 nid record pid 0 nid 6: else if (key,value) is from E t then 7: pairset pairset {(elabel, pid old tid )} 8: sort elements in pairset lexicographically, first on elabel then on pid old tid 9: signature (pid 0 nid, pairset) 10: emit (key, (pid 0 nid, signature)) 3.2 Task Identifier On a single machine, in order to assign distinct values for signatures, we only need a dictionarylike data structure. In a distributed environment, on the other hand, some extra work has to be done. The Identifier task (Algorithm 2) is designed for this purpose. The map function distributes nodes of the same signature to the same reducer, so that in the reduce function, each reducer only needs to check locally whether the given signature is assigned an identifier or not; if not, then a new identifier is generated and assigned to the signature. To assign identifiers without collisions across reducers, we specify a nonoverlapping identifier range each reducer can use beforehand. For instance, reducer i can generate identifiers in the range of [i N, (i + 1) N ). Algorithm 2 task Identifier 1: procedure IdentifierMapper(nId, (pid 0 nid, signature)) 2: emit (signature, (nid,pid 0 nid )) 1: procedure IdentifierReducer(signature, [(nid, pid 0 nid )]) 2: pid new nid get unique identifier for signature 3: for (nid, pid 0 nid ) in [(nid, pid 0 nid )] do 4: emit (nid, (pid 0 nid, pid new nid )) 3.3 Task RePartition The output of task Identifier is N t filled with partition identifiers from the current iteration, but consists of file fragments partitioned by signature. In order to perform a mapside join with E t in task Signature in the next iteration,
7 N t needs to be sorted and partitioned on nid, which is the main job of task RePartition (Algorithm 3). This task makes use of the MapReduce framework to do the sorting and grouping. Algorithm 3 task RePartition 1: procedure RePartitionMapper(nId, (pid 0 nid, pid new nid )) 2: emit (nid, (pid 0 nid, pid new nid )) do nothing 1: procedure RePartitionReducer(nId, [(pid 0 nid, pid new nid )]) 2: for (pid 0 nid, pid new nid ) in [(pid 0 nid, pid new nid )] do 3: emit (nid, (pid 0 nid, pid new nid )) 3.4 Running example We illustrate our algorithm on the example graph of Figure 1. In Figures 2(a), 2(b) and 2(c), we show the input and output of the map and reduce phases of tasks Signature, Identifier and RePartition, respectively, for the first iteration (i = 0), with two reducers (bounded with gray boxes) in use. 4 Strategies for dealing with skew in graphs 4.1 Data and skew In our investigations, we used a variety of graph datasets to validate the results of our algorithm, namely: Twitter (41.65M, M), Jamendo (0.48M, 1.05M), LinkedMDB (2.33M, 6.15M), DBLP (23M, 50.2M), WikiLinks (5.71M, M), DBPedia (38.62M, 115.3M), Power (8.39M, 200M), and Random (10M, 200M); the numbers in the parenthesis indicates the node count and edge count of the graph, resp.. Among these, Jamendo, LinkedMDB, DBLP, WikiLinks, DBPedia, and Twitter are realworld graphs described further in Luo et al. [20]. Random and Power are synthetic graphs generated using GTgraph [2], where Random is generated by adding edges between nodes randomly, and Power is generated following the powerlaw degree distribution and smallworld property. During investigation of our base algorithm (further details below in Section 5), we witnessed a highly skewed workload among mappers and reducers. Figure 4(a) illustrates this on the various datasets, showing the running time for different reducers for the three tasks in the algorithm. Each spike is a time measurement for one reducer. The running time in each task is sorted in a descending order. We see that indeed some reducers carry a significantly disproportionate workload. This skew slows down the whole process since the task is only complete when the slowest worker finishes. From this empirical observation, we trace back the behavior to the data. Indeed, the partition result is skewed in many ways. For example, in Figure 3, we show the cumulative distribution of partition block size, i.e., number of nodes assigned to the block, to the number of partition blocks having the given size, for the realworld graphs described above. We see that for all of the datasets, block size shows a powerlaw distribution property. This indicates the need to rethink the design of our base algorithm. Recall that at the end of the map function of Algorithm 2, nodes are distributed to
8 Nodes: Edges: 3 l 1 1 w 2 2 w 2 5 l 2 4 l 3 1 l 4 2 l l w w l l l l w l w l l l l ,{(w,1),(l,2)} 21 1,{(w,1),(l,2)} 32 2,{(l,1)} 42 2,{(l,2)} 52 2,{(l,1)} 62 2,{} 6 nodes Join + Map Reduce Output (a) Task Signature 11 1,{(w,1),(l,2)} 21 1,{(w,1),(l,2)} 32 2,{(l,1)} r2 r2 r ,{(l,1)} 52 2,{(l,1)} 62 2,{} ,{(l,2)} 52 2,{(l,1)} 62 2,{} r2 r1 r ,{(w,1),(l,2)} 21 1,{(w,1),(l,2)} ,{(l,2)} 8 Map Reduce Output (b) Task Identifier Map Reduce Output (c) Task RePartition Fig. 2. Input and output of the three tasks of our algorithm for the example graph of Figure 1 (iteration i = 0). reducers based on their signatures. Since some of the signatures are associated with many nodes (as we see in Figure 3), the workload is inevitably unbalanced. This explains the reducers behavior in Figure 4(a) as well. In the following, we propose several strategies to handle such circumstances. 4.2 Strategy 1: Introduce another task Merge Recall from Section 3.2 that nodes with the same signature must be distributed to the same reducer, for otherwise the assignment of partition block identifiers
9 fraction of PB s having size x x (PB Size) Jamendo LinkedMDB DBLP WikiLinks DBPedia Twitter Fig. 3. Cumulative distribution of partition block size (PB Size) to partition blocks having the given size for realworld graphs cannot be guaranteed to be correct. This could be relaxed, however, if we introduce another task, which we call Merge. Suppose that when we output the partition result (without the guarantee that nodes with the same signature go to the same reducer), for each reducer, we also output a mapping of signatures to partition identifiers. Then in task Merge, we could merge the partition IDs based on the signature value, and output a local pid to global pid mapping. Then in the task RePartition, another mapside join is introduced to replace the local pid with the global pid. Because the signature itself is not correlated with the partition block size, the skew on partition block size should be eliminated. We discuss the implementation details in the following. In N t assume we have an additional field holding the partition size of the node from the previous iteration, named psize old, and let MR Load = We define rand(x) to return a random integer from 0 to x, and % as the modulo operator. N number of reducers. Algorithm 4 modified partition function for task Identifier 1: procedure Identifier getpartition for each key/value pair 2: if psize old > threshold then 3: n = psize old /MR Load numbers of reducers we need 4: return (signature.hashcode()+rand(n)) % number of reducers 5: else 6: return signature.hashcode() % number of reducers We first change the partition function for Identifier (Algorithm 4). In this case, for nodes whose associated psize old are above the threshold, we do not guarantee that they end up in the same reducer, but make sure that they are distributed to at most n reducers. Then we come to the reduce phase for Identifier (Algorithm 5). Here we also output the local partition size (named psize new ) for each node. Then the task Merge (Algorithm 6) will create the mapping between the locally assigned pid new nid and global pid.
10 Algorithm 5 modified reduce function for task Identifier 1: procedure IdentifierReducer(signature, [(nid,pid 0 nid,psize old )]) 2: pid new nid get unique identifier for signature 3: psize new size of [(nid,pid 0 nid, psize old )] 4: for (nid, pid 0 nid, psize old ) in [(nid, pid 0 nid, psize old )] do 5: emit (nid, (pid 0 nid, pid new nid )) 6: emit (signature, (pid new nid, psize new )) Algorithm 6 task Merge 1: procedure MergeMapper(signature, (pid new nid, psize new )) 2: emit (signature, (pid new nid, psize new )) do nothing 1: procedure MergeReducer(signature, [(pid new nid, psize new )]) 2: global pid count 0 3: global pid get unique identifier for signature 4: for (pid new nid, psize new ) in [(pid new nid, psize new )] do 5: global pid count global pid count + psize new 6: emit (global pid, (pid new nid )) the local  global mapping 7: emit (global pid, global pid count) Finally at the beginning of task RePartition, the partition identifiers are unified by a merge join of the local pid  global pid mapping table and N t. We achieve this by distributing the local pid  global pid table to all mappers before the task begins, with the help of Hadoop s distributed cache. Also, global pid count is updated in N t. While this is a general solution for dealing with skew, the extra Merge task introduces additional overhead. In the case of heavy skew, some signatures will produce large map files and performing merging might become a bottleneck. This indicates the need for a more specialized solution to deal with heavy skew. 4.3 Strategy 2: TopK signaturepartition identifier mapping One observation of Figure 3 is that only a few partition blocks are heavily skewed. To handle these outliers, at the end of task Signature, besides emitting N t, we can also emit, for each reducer, an aggregation count of signature appearances. Then a merge is performed among all the counts, to identify the most frequent K signatures and fix signaturepartition identifier mappings for these popular partition blocks. This mapping is small enough to be distributed to all cluster nodes as a global variable by Hadoop, so that when dealing with these signatures, processing time becomes constant. As a result, in task Identifier, nodes with such signatures can be distributed randomly across reducers. There are certain drawbacks to this method. First, the output topk frequent signatures are aggregated from local topk frequent values (with respect to each reducer), but globally we only use these values as an estimation of the real counts. Second, the step where signature counts have to be output and merged becomes a bottleneck of the whole workflow. Last but not least, users have to specify K before processing, which may be either too high or too low.
11 However, in the case of extreme skew on the partition block sizes, for most of the real world datasets, there are only a few partition blocks which delay the whole process, even for very large datasets. So when we adopt this strategy, we can choose a very small K value and still achieve good results, without introducing another MapReduce task. This is validated in Section Empirical analysis Our experiments are executed on the Hadoop cluster at SURFsara in Amsterdam, The Netherlands. 1 This cluster consists of 72 Nodes (66 DataNodes & TaskTrackers and 6 HeadNodes), with each node equipped with 4 x 2TB HDD, 8 core CPU 2.0 GHz and 64GB RAM. In total, there are 528 map and 528 reduce processes, and 460 TB HDFS space. The cluster is running Cloudera CDH3 distribution with Apache Hadoop All algorithms are written in Java 1.6. The datasets we use are described in Section 4.1. A more detailed description of the empirical study can be found in de Lange [7]. 5.1 Impact on workload skew of the TopK strategy Figure 4(b) shows the cluster workload after we create a identifier mapping for the top2 signaturepartitions from Section 4.3. We see that, when compared with Figure 4(a), the skew in running time per reducer is eliminated by the strategy. This means that workload is better distributed, thus lowering the running time per iteration and, in turn, the whole algorithm. Signature Signature Identifier Identifier Repartition Repartition 0 60 Wikilinks DBPedia (a) 0 65 Power Twitter 0 60 Wikilinks DBPedia (b) 0 65 Power Twitter Fig. 4. Workload skewness for base algorithm (a) and for top2 signature strategy (b) 5.2 Overall performance comparison In Figure 5, we present an overall performance comparison for computing 10 bisimulation on each graph with: our base algorithm (Base), the merge (Merge) and topk (TopK Signature) skew strategies, and the state of the art singlemachine external memory algorithm (I/O Efficient) [20]. For the TopK Signature strategy, we set K = 2 which, from our observation in Section 5.1, gives excellent skew reduction with low additional cost. For the Merge optimization we used a threshold of 1 MR load such that each partition block larger than 1 https://www.surfsara.nl
12 the optimal reducer block size is distributed among additional reducers. Furthermore, for each dataset, we choose 3 edge table size 64 MB number of reducers, which has been tested to lead to the minimum elapsed time. We empirically observed that increasing the number of reducers beyond this does not improve performance. Indeed, adding further maps and reducers at this point negatively impacts the running time due to framework overhead. Each experiment was repeated five times, and the data point reported is the average of the middle three measurements. Running time () more than I/O Efficient Algorithm Base Algorithm TopK signatures algorithm Merge algorithm 0 Jamendo LinkedMDB DBLP WikiLinks DBPedia Random * * Tests did not complete Power Twitter Fig. 5. Overall performance comparison We immediately observe from Figure 5 that for all datasets, among the Base algorithm and its two optimizations, there are always at least two solutions which perform better than the I/O efficient solution. For small datasets such as Jamendo and LinkedMDB, this is obvious, since in these cases only 1 or 2 reducers are used, so that the algorithm turns into a singlemachine inmemory algorithm. When the size of the datasets increases, the value of MapReduce becomes more visible, with up to an order of magnitude improvement in the running time for the MapReducebased solutions. We also observe that the skew amelioration strategies give excellent overall performance on these big graphs, with 2 or 3 times of improvement over our Base algorithm in the case of the highly skewed graphs such as DBLP and DBPedia. Finally, we observe that, relative to the topk strategy, the merge skewstrategy is mostly placed at a disadvantage due to its inherent overhead costs. Running time () DBPedia (Base) DBPedia (TopK Signature) DBPedia (Merge) Twitter (Base) Twitter (TopK Signature) Twitter (Merge) Iteration Fig. 6. Performance comparison per iteration for Twitter and DBPedia
13 To further study the different behaviors among the MapReducebased solutions, we plot the running time per iteration of the three solutions for DBPedia and Twitter in Figure 6. We see that for the Twitter dataset, in the first four iterations the skew is severe for the Base algorithm, while the two optimization strategies handle it well. After iteration 5, the overhead of the Merge strategy becomes nonnegligible, which is due to the bigger mapping files the Identifier produces. For the DBPedia dataset, on the other hand, the two strategies perform consistently better than the Base algorithm. Over all, based on our experiments, we note that our algorithm s performance is stable, i.e., essentially constant in running time as the number of maps and reducers is scaled with the input graph size. 6 Concluding remarks In this paper, we have presented, to our knowledge, the first generalpurpose algorithm for effectively computing localized bisimulation partitions of big graphs using the MapReduce programming model. We witnessed a skewed workload during algorithm execution, and proposed two strategies to eliminate such skew from an algorithm design perspective. An extensive empirical study confirmed that our algorithm not only efficiently produces the bisimulation partition result, but also scales well with the MapReduce infrastructure, with an order of magnitude performance improvement over the state of the art on realworld graphs. We close by indicating several interesting avenues for further investigation. First, there are additional basic sources of skew which may impact performance of our algorithm, such as skew on signature sizes and skew on the structure of the bisimulation partition itself. Therefore, further optimizations should be investigated to handle these additional forms of skew. Second, in Section 5.2 we see that all three proposed solutions have their best performance for some dataset, therefore it would be interesting to study the cost model of the MapReduce framework and combine the information (e.g., statistics for data, cluster status) within our algorithm to facilitate more intelligent selection of strategies to use at runtime [1]. Last but not least, our algorithmic solutions for ameliorating skew effects may find successful applications in related research problems (e.g., distributed graph query processing, graph generation, and graph indexing). Acknowledgments. The research of YW is supported by Research Foundation Flanders (FWO) during her sabbatical visit to Hasselt University, Belgium. The research of YL, GF, JH and PD is supported by the Netherlands Organisation for Scientific Research (NWO). We thank SURFsara for their support of this investigation and Boudewijn van Dongen for his insightful comments. We thank the reviewers for their helpful suggestions. References 1. F. N. Afrati, A. D. Sarma, S. Salihoglu, and J. D. Ullman. Upper and lower bounds on the cost of a MapReduce computation. CoRR, abs/ , 2012.
14 2. D. A. Bader and K. Madduri. GTgraph: A suite of synthetic graph generators. 3. S. Blom and S. Orzan. A distributed algorithm for strong bisimulation reduction of state spaces. Int J Softw Tools Technol Transfer, 7:74 86, P. Buneman, M. Grohe, and C. Koch. Path queries on compressed XML. In Proc. VLDB, pages , Berlin, Germany, A. Clauset, C. Shalizi, and M. Newman. Powerlaw distributions in empirical data. SIAM review, 51(4): , J. Cohen. Graph twiddling in a MapReduce world. Computing in Science Engineering, 11(4):29 41, Y. de Lange. MapReduce based algorithms for localized bisimulation. Master s thesis, Eindhoven University of Technology, J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1): , Jan A. Dovier, C. Piazza, and A. Policriti. An efficient algorithm for computing bisimulation equivalence. Theor. Comp. Sci., 311(13): , W. Fan. Graph pattern matching revised for social network analysis. In Proc. ICDT, pages 8 21, Berlin, Germany, W. Fan, J. Li, X. Wang, and Y. Wu. Query preserving graph compression. In Proc. SIGMOD, pages , Scottsdale, AZ, USA, B. Gufler, N. Augsten, A. Reiser, and A. Kemper. Handling data skew in MapReduce. In Proc. CLOSER, pages , B. Gufler, N. Augsten, A. Reiser, and A. Kemper. Load balancing in MapReduce based on scalable cardinality estimates. In Proc. ICDE, pages , Hadoop S. Ibrahim, H. Jin, L. Lu, S. Wu, B. He, and L. Qi. LEEN: Locality/fairnessaware key partitioning for MapReduce in the cloud. In CloudCom, pages 17 24, R. Kaushik, P. Shenoy, P. Bohannon, and E. Gudes. Exploiting local similarity for indexing paths in graphstructured data. In ICDE, pages , San Jose, Y. Kwon, K. Ren, M. Balazinska, and B. Howe. Managing Skew in Hadoop. IEEE Data Eng. Bull., 36(1):24 33, J. Lin and C. Dyer. DataIntensive Text Processing with MapReduce. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers, J. Lin and M. Schatz. Design patterns for efficient graph algorithms in MapReduce. In Proc. MLG, pages 78 85, Washington, D.C., Y. Luo, G. H. L. Fletcher, J. Hidders, Y. Wu, and P. De Bra. I/Oefficient algorithms for localized bisimulation partition construction and maintenance on massive graphs. CoRR, abs/ , T. Milo and D. Suciu. Index structures for path expressions. In Proc. ICDT, pages , Jerusalem, Israel, R. Paige and R. Tarjan. Three partition refinement algorithms. SIAM J. Comput., 16:973, F. Picalausa, Y. Luo, G. H. L. Fletcher, J. Hidders, and S. Vansummeren. A structural approach to indexing triples. In ESWC, pages , Heraklion, Greece, D. Sangiorgi and J. Rutten. Advanced Topics in Bisimulation and Coinduction. Cambridge University Press, New York, NY, USA, R. Vernica, A. Balmin, K. S. Beyer, and V. Ercegovac. Adaptive MapReduce using situationaware mappers. In Proc. EDBT, pages , 2012.
FPHadoop: Efficient Execution of Parallel Jobs Over Skewed Data
FPHadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel LirozGistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel LirozGistau, Reza Akbarinia, Patrick Valduriez. FPHadoop:
More informationDeveloping MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
More informationBig Data Technology MapReduce Motivation: Indexing in Search Engines
Big Data Technology MapReduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
More informationCSEE5430 Scalable Cloud Computing Lecture 2
CSEE5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.92015 1/36 Google MapReduce A scalable batch processing
More informationParallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
More informationBigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduceHadoop
More informationScalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationLoad Balancing in MapReduce Based on Scalable Cardinality Estimates
Load Balancing in MapReduce Based on Scalable Cardinality Estimates Benjamin Gufler 1, Nikolaus Augsten #, Angelika Reiser 3, Alfons Kemper 4 Technische Universität München Boltzmannstraße 3, 85748 Garching
More informationParallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
More informationA Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 461 Komaba, Meguroku, Tokyo 1538505, Japan Abstract.
More informationHadoopSPARQL : A Hadoopbased Engine for Multiple SPARQL Query Answering
HadoopSPARQL : A Hadoopbased Engine for Multiple SPARQL Query Answering Chang Liu 1 Jun Qu 1 Guilin Qi 2 Haofen Wang 1 Yong Yu 1 1 Shanghai Jiaotong University, China {liuchang,qujun51319, whfcarter,yyu}@apex.sjtu.edu.cn
More informationEnergy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
More informationBig Data Begets Big Database Theory
Big Data Begets Big Database Theory Dan Suciu University of Washington 1 Motivation Industry analysts describe Big Data in terms of three V s: volume, velocity, variety. The data is too big to process
More informationMapReduce: Algorithm Design Patterns
Designing Algorithms for MapReduce MapReduce: Algorithm Design Patterns Need to adapt to a restricted model of computation Goals Scalability: adding machines will make the algo run faster Efficiency: resources
More informationParallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce
Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Huayu Wu Institute for Infocomm Research, A*STAR, Singapore huwu@i2r.astar.edu.sg Abstract. Processing XML queries over
More informationEnhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim AlAbsi, DaeKi Kang and MyongJong Kim Abstract In Hadoop MapReduce distributed file system, as the input
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationKeywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
More informationFrom GWS to MapReduce: Google s Cloud Technology in the Early Days
LargeScale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop
More informationBig Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
More informationA Comparison of Join Algorithms for Log Processing in MapReduce
A Comparison of Join Algorithms for Log Processing in MapReduce Spyros Blanas, Jignesh M. Patel Computer Sciences Department University of WisconsinMadison {sblanas,jignesh}@cs.wisc.edu Vuk Ercegovac,
More informationStorage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, and Bhavani Thuraisingham University of Texas at Dallas, Dallas TX 75080, USA Abstract.
More informationA Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationMining Large Datasets: Case of Mining Graph Data in the Cloud
Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large
More informationMapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially
More informationMaximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
More informationThe Hadoop Framework
The Hadoop Framework Nils Braden University of Applied Sciences GießenFriedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fhgiessen.de Abstract. The Hadoop Framework offers an approach to largescale
More informationAnalysis and Optimization of Massive Data Processing on High Performance Computing Architecture
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National
More informationBig Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
More informationMobile Storage and Search Engine of Information Oriented to Food Cloud
Advance Journal of Food Science and Technology 5(10): 13311336, 2013 ISSN: 20424868; eissn: 20424876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:
More informationGraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
More informationComparision of kmeans and kmedoids Clustering Algorithms for Big Data Using MapReduce Techniques
Comparision of kmeans and kmedoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,
More informationBenchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
More informationReducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan
Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21stcentury with increasingly huge amounts of data to store and be
More informationSurvey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
More informationEvaluating HDFS I/O Performance on Virtualized Systems
Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang xtang@cs.wisc.edu University of WisconsinMadison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing
More informationOutline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
More informationMapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12
MapReduce Algorithms A Sense of Scale At web scales... Mail: Billions of messages per day Search: Billions of searches per day Social: Billions of relationships 2 A Sense of Scale At web scales... Mail:
More informationMapReduce With Columnar Storage
SEMINAR: COLUMNAR DATABASES 1 MapReduce With Columnar Storage Peitsa Lähteenmäki Abstract The MapReduce programming paradigm has achieved more popularity over the last few years as an option to distributed
More informationThe Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have
More informationIntroduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
More informationIntroduction to Hadoop
1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools
More informationOpen Access Research of Fast Search Algorithm Based on Hadoop Cloud Platform
Send Orders for Reprints to reprints@benthamscience.ae The Open Automation and Control Systems Journal, 2015, 7, 11531159 1153 Open Access Research of Fast Search Algorithm Based on Hadoop Cloud Platform
More informationIntroduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
More informationBig Graph Processing: Some Background
Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI580, Bo Wu Graphs
More informationProcessing Joins over Big Data in MapReduce
Processing Joins over Big Data in MapReduce Christos Doulkeridis Department of Digital Systems School of Information and Communication Technologies University of Piraeus http://www.ds.unipi.gr/cdoulk/
More informationA B S T R A C T. Index Terms : Apache s Hadoop, Map/Reduce, HDFS, Hashing Algorithm. I. INTRODUCTION
Speed Up Extension To Hadoop System A Survey Of HDFS Data Placement Sayali Ashok Shivarkar, Prof.Deepali Gatade Computer Network, Sinhgad College of Engineering, Pune, India 1sayalishivarkar20@gmail.com
More informationLoadBalancing the Distance Computations in Record Linkage
LoadBalancing the Distance Computations in Record Linkage Dimitrios Karapiperis Vassilios S. Verykios Hellenic Open University School of Science and Technology Patras, Greece {dkarapiperis, verykios}@eap.gr
More informationBig Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
More informationOpen source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
More informationRealtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens
Realtime Apache Hadoop at Facebook Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens Agenda 1 Why Apache Hadoop and HBase? 2 Quick Introduction to Apache HBase 3 Applications of HBase at
More informationFault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
More informationBenchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
More informationMapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationExploring the Efficiency of Big Data Processing with Hadoop MapReduce
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.
More informationhttp://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an opensource software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
More informationA computational model for MapReduce job flow
A computational model for MapReduce job flow Tommaso Di Noia, Marina Mongiello, Eugenio Di Sciascio Dipartimento di Ingegneria Elettrica e Dell informazione Politecnico di Bari Via E. Orabona, 4 70125
More informationConjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect
Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past  Traditional
More informationDynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce
2012 Third International Conference on Networking and Computing Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Shiori KURAZUMI, Tomoaki TSUMURA, Shoichi SAITO and Hiroshi
More informationA Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
More informationRecognization of Satellite Images of Large Scale Data Based On Map Reduce Framework
Recognization of Satellite Images of Large Scale Data Based On Map Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),
More informationLog Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com
More informationPerformance and Energy Efficiency of. Hadoop deployment models
Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced
More informationScalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
More informationInfrastructures for big data
Infrastructures for big data Rasmus Pagh 1 Today s lecture Three technologies for handling big data: MapReduce (Hadoop) BigTable (and descendants) Data stream algorithms Alternatives to (some uses of)
More informationOptimization and analysis of large scale data sorting algorithm based on Hadoop
Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationHadoop and MapReduce. Swati Gore
Hadoop and MapReduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why MapReduce not MPI Distributed sort Why Hadoop? Existing Data
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationCLOUDDMSS: CLOUDBASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES
CLOUDDMSS: CLOUDBASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,
More informationComparison of Different Implementation of Inverted Indexes in Hadoop
Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,
More informationMapBased Graph Analysis on MapReduce
2013 IEEE International Conference on Big Data MapBased Graph Analysis on MapReduce Upa Gupta, Leonidas Fegaras University of Texas at Arlington, CSE Arlington, TX 76019 {upa.gupta,fegaras}@uta.edu Abstract
More informationDetection of Distributed Denial of Service Attack with Hadoop on Live Network
Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,
More informationBig Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
More informationLecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736bewi@tudelft.nl
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736bewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
More informationMapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example
MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design
More informationQuery and Analysis of Data on Electric Consumption Based on Hadoop
, pp.153160 http://dx.doi.org/10.14257/ijdta.2016.9.2.17 Query and Analysis of Data on Electric Consumption Based on Hadoop Jianjun 1 Zhou and Yi Wu 2 1 Information Science and Technology in Heilongjiang
More informationLearningbased Entity Resolution with MapReduce
Learningbased Entity Resolution with MapReduce Lars Kolb 1 Hanna Köpcke 2 Andreas Thor 1 Erhard Rahm 1,2 1 Database Group, 2 WDI Lab University of Leipzig {kolb,koepcke,thor,rahm}@informatik.unileipzig.de
More informationSimilarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE  Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
More informationA Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
More informationComputing Load Aware and LongView Load Balancing for Cluster Storage Systems
215 IEEE International Conference on Big Data (Big Data) Computing Load Aware and LongView Load Balancing for Cluster Storage Systems Guoxin Liu and Haiying Shen and Haoyu Wang Department of Electrical
More informationAn Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database
An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct
More informationIMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAPREDUCE
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAPREDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT
More informationIntroduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and PreRequisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The prerequisites are significant
More informationLecture 10  Functional programming: Hadoop and MapReduce
Lecture 10  Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10  Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional
More informationA STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, QuaidEMillath Govt College for Women, Chennai, (India)
More informationDistributed Dynamic Load Balancing for IterativeStencil Applications
Distributed Dynamic Load Balancing for IterativeStencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,
More informationClash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan IBM Research China IBM Almaden
More informationBig RDF Data Partitioning and Processing using hadoop in Cloud
Big RDF Data Partitioning and Processing using hadoop in Cloud Tejas Bharat Thorat Dept. of Computer Engineering MIT Academy of Engineering, Alandi, Pune, India Prof.Ranjana R.Badre Dept. of Computer Engineering
More informationAccelerating Hadoop MapReduce Using an InMemory Data Grid
Accelerating Hadoop MapReduce Using an InMemory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
More informationRECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE
RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE Reena Pagare and Anita Shinde Department of Computer Engineering, Pune University M. I. T. College Of Engineering Pune India ABSTRACT Many clients
More informationBENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
More informationPolicybased PreProcessing in Hadoop
Policybased PreProcessing in Hadoop Yi Cheng, Christian Schaefer Ericsson Research Stockholm, Sweden yi.cheng@ericsson.com, christian.schaefer@ericsson.com Abstract While big data analytics provides
More informationA Survey on: Efficient and Customizable Data Partitioning for Distributed Big RDF Data Processing using hadoop in Cloud.
A Survey on: Efficient and Customizable Data Partitioning for Distributed Big RDF Data Processing using hadoop in Cloud. Tejas Bharat Thorat Prof.RanjanaR.Badre Computer Engineering Department Computer
More informationParallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model
More informationMAPREDUCE Programming Model
CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce
More informationEfficient Analysis of Big Data Using Map Reduce Framework
Efficient Analysis of Big Data Using Map Reduce Framework Dr. Siddaraju 1, Sowmya C L 2, Rashmi K 3, Rahul M 4 1 Professor & Head of Department of Computer Science & Engineering, 2,3,4 Assistant Professor,
More information