Bisimulation Reduction of Big Graphs on MapReduce


 Louise Hudson
 1 years ago
 Views:
Transcription
1 Bisimulation Reduction of Big Graphs on MapReduce Yongming Luo 1, Yannick de Lange 1, George H. L. Fletcher 1, Paul De Bra 1, Jan Hidders 2, and Yuqing Wu 3 1 Eindhoven University of Technology, The Netherlands 2 Delft University of Technology, The Netherlands 3 Indiana University, Bloomington, USA Abstract. Computing the bisimulation partition of a graph is a fundamental problem which plays a key role in a wide range of basic applications. Intuitively, two nodes in a graph are bisimilar if they share basic structural properties such as labeling and neighborhood topology. In data management, reducing a graph under bisimulation equivalence is a crucial step, e.g., for indexing the graph for efficient query processing. Often, graphs of interest in the real world are massive; examples include social networks and linked open data. For analytics on such graphs, it is becoming increasingly infeasible to rely on inmemory or even I/Oefficient solutions. Hence, a trend in Big Data analytics is the use of distributed computing frameworks such as MapReduce. While there are both internal and external memory solutions for efficiently computing bisimulation, there is, to our knowledge, no effective MapReducebased solution for bisimulation. Motivated by these observations we propose in this paper the first efficient MapReducebased algorithm for computing the bisimulation partition of massive graphs. We also detail several optimizations for handling the data skew which often arises in realworld graphs. The results of an extensive empirical study are presented which demonstrate the effectiveness and scalability of our solution. 1 Introduction Recently, graph analytics has drawn increasing attention from the data management, semantic web, and many other research communities. Graphs of interest, such as social networks, the web graph, and linked open data, are typically on the order of billions of nodes and edges. In such cases, singlemachine inmemory solutions for reasoning over graphs are often infeasible. Naturally, research has turned to external memory and distributed solutions for graph reasoning. External memory algorithms often suffice, but their performance typically scales (almost) linearly with graph size (usually the number of graph edges), which is then limited by the throughput of the I/O devices attached to the system. In this respect, distributed and parallel algorithms become attractive. Ideally, a
2 welldesigned distributed algorithm would scale (roughly) linearly with the size of the computing resources it has, making use of the parallelism of the infrastructure. Though there are many alternatives, recently the MapReduce platform [8] has become a defacto parallel processing platform for reasoning over Big Data such as realworld graphs, with wide adoption in both industry and research. Among fundamental graph problems, the bisimulation partition problem plays a key role in a surprising range of basic applications [24]. Informally, the bisimulation partition of a graph is an assignment of each node n of the graph to a unique block consisting of all nodes having the same structural properties as n (e.g., node label and neighborhood topology). In data management, variations of bisimulation play a fundamental role in constructing structural indexes for XML and RDF databases [21, 23], and many other applications for general graph data such as compression [4, 11], query processing [16], and data analytics [10]. Being well studied for decades, many mainmemory efficient algorithms have been developed for bisimulation partitioning (e.g., [9, 22]). The stateoftheart I/O efficient algorithm takes just under one day to compute a standard localized variant of bisimulation on a graph with 1.4 billion edges on commodity hardware [20]. This cost can be a potential bottleneck since bisimulation partitioning is typically one step in a larger workflow (e.g., preparing the graph for indexing and query processing). Contributions. Motivated by these observations, we have studied the effective use of the MapReduce framework for accelerating the computation of bisimulation partitions of massive graphs. In this paper we present, to our knowledge, the first efficient MapReducebased algorithm for localized bisimulation partitioning. We further present strategies for dealing with various types of skew which occur in realworld graphs. We discuss the results of extensive experiments which show that our approach is effective and scalable, being up to an order of magnitude faster than the state of the art. As a prototypical graph problem, we hope that sharing our experiences with graph bisimulation will stimulate further progress in the community on MapReducebased solutions for reasoning and analytics over massive graphs. Related work. While there has been work on distributed computation of bisimulation partitions, existing approaches (e.g., [3]) are not developed for the MapReduce platform, and hence are not directly applicable to our problem. Research has been conducted to investigate using the MapReduce framework to solve graph problems [6, 18] right after the framework was proposed. A major issue here is dealing with data skew in graphs. Indeed, skew is ubiquitous in realworld graphs [5]. During our investigation, we also experienced various types of skew from graph bisimulation, as we discuss below. There has been much progress done from the system side to tackle this problem. The main approach in this literature is to devise mechanisms to estimate costs of MapReduce systems (e.g., [13]) and modify the system to mitigate the skew effects, both statically [12, 15, 17] and dynamically [17, 25], so that the modification is transparent to the users. However, it is still possible to gain much efficiency by dealing with
3 skew from the algorithm design perspective [19], as we do in the novel work we present in this paper. Paper organization. In the next tion we give a brief description of localized bisimulation and MapReduce. We then describe in Section 3 our base algorithm for computing localized bisimulation partitions using MapReduce. Next, Section 4 presents optimizations of the base algorithm, to deal with the common problem of skewed data. Section 5 presents the results of our empirical study. We then conclude in Section 6 with a discussion of future directions for research. 2 Preliminaries 2.1 Localized bisimulation partitions Our data model is that of finite directed node and edgelabeled graphs N, E, λ N, λ E, where N is a finite set of nodes, E N N is a set of edges, λ N is a function from N to a set of node labels L N, and λ E is a function from E to a set of edge labels L E. The localized bisimulation partition of graph G = N, E, λ N, λ E is based on the kbisimilar equivalence relation. Definition 1. Let G = N, E, λ N, λ E be a graph and k 0. Nodes u, v N are called kbisimilar (denoted as u k v), iff the following holds: 1. λ N (u) = λ N (v), 2. if k > 0, then for any edge (u, u ) E, there exists an edge (v, v ) E, such that u k 1 v and λ E (u, u ) = λ E (v, v ), and 3. if k > 0, then for any edge (v, v ) E, there exists an edge (u, u ) E, such that v k 1 u and λ E (v, v ) = λ E (u, u ). Given the kbisimulation relation on a graph G, we can assign a unique partition identifier (e.g., an integer) to each set of kbisimilar nodes in G. For node u N and relation k, we write pid k (u) to denote u s kpartition identifier, and we call pid k a kpartition identifier function. Definition 2. Let G = N, E, λ N, λ E be a graph, k 0, and {pid 0,..., pid k } be a set of ipartition identifier functions for G, for 0 i k. The k bisimulation signature of node u N is the pair sig k (u) = (pid 0 (u), L) where: { if k = 0, L = {(λ E (u, u ), pid k 1 (u )) (u, u ) E} if k > 0. We then have the following fact. Proposition 1 ([20]). pid k (u) = pid k (v) iff sig k (u) = sig k (v), k 0. Since there is a onetoone mapping between a node s signature and its partition identifier, we can construct sig k (u) ( u N, k 0), assign pid k (u) according to sig k (u), and then use pid k (u) to construct sig k+1 (u), and so on. We call
4 each such constructassign computing cycle an iteration. This signaturebased approach is robust, with effective application in nonmapreduce environments (e.g., [3, 20]). We refer for a more detailed discussion of localized bisimulation to Luo et al. [20]. Example. Consider the graph in Figure 1. In iteration 0, nodes are partitioned into blocks P1 and P2 (indicated by different colors), based on the node label A and B (Def. 1). Then in iteration 1, from Def. 2, we have sig 1 (1 ) = (1, {(w, P 1), (l, P 2)}) and sig 1 (2 ) = (1, {(w, P 1), (l, P 2)}), which indicates that pid 1 (1 ) = pid 1 (2 ) (Prop. 1). P1 P2 l 3 B l A A 1 2 w l l 4 B Fig. 1. Example graph w 5 6 B B If we further compute 2bisimulation, we see that sig 2 (1 ) sig 2 (2 ), and we conclude that nodes 1 and 2 are not 2bisimilar, and block P1 will split. The partition blocks and their relations (i.e., a structural index ) can be seen as an abstraction of the real graph, to be used, for example, to filter unnecessary graph matching during query processing [11, 23]. A larger k leads to a more refined partition, which results in a larger structural index. So there is a tradeoff between k and the space we have for holding the structural index. In practice, though, we see that a small k value (e.g., k 5) is already sufficient for query processing. In our analysis below, we compute the kbisimulation result up to k = 10, which is enough to show all the behaviors of interest for structural indexes. 2.2 MapReduce framework The MapReduce programming model [8] is designed to process large datasets in parallel. A MapReduce job takes a set of key/value pairs as input and outputs another set of key/value pairs. A MapReduce program consists of a series of MapReduce jobs, where each MapReduce job implements a map and a reduce function ( [ ] means a list of elements): map (k 1, v 1 ) [(k 2, v 2 )] reduce (k 2, [v 2 ]) [(k 3, v 3 )]. The map function takes key/value pair (k 1, v 1 ) as the input, emits a list of key/value pairs (k 2, v 2 ). In the reduce function, all values with the same key are grouped together as a list of values v 2 and are processed to emit another list of key/value pairs (k 3, v 3 ). Users define the map and reduce functions, letting the framework take care of all other aspects of the computation (synchronization, I/O, fault tolerance, etc.). The open source Hadoop implementation of the MapReduce framework is considered to have production quality and is widely used in industry and research [14]. Hadoop is often used together with the Hadoop Distributed File System (HDFS), which is designed to provide highthroughput access to application data. Besides map and reduce functions, in Hadoop a user can also write a l
5 custom partition function, which is applied after the map function to specify to which reducer each key/value pair should go. In our work we use Hadoop for validating our solutions, making use of the partition function as well. 3 Bisimulation partitioning with MapReduce For a graph G = N, E, λ N, λ E, we arbitrarily assign unique (integer) identifiers to each node of N. Our algorithm for computing the kbisimulation partition of G has two input tables: a node table (denoted as N t ) and an edge table (denoted as E t ). Both tables are plain files of sequential records of nodes and edges of G, resp., stored on HDFS. The schema of table N t is as follows: nid pid 0 nid pid new nid node identifier 0 bisimulation partition identifier for the given nid bisimulation partition identifier for the given nid from the current computation iteration The schema of table E t is as follows: sid tid elabel pid old tid source node identifier target node identifier edge label bisimulation partition identifier for the given tid from the last computation iteration By combining the idea of Proposition 1 and the MapReduce programming model, we can sketch an algorithm for kbisimulation using MapReduce, with the following workflow: for each iteration i (0 i k), we first construct the signatures for all nodes, then assign partition identifiers for all unique signatures, and pass the information to the next iteration. In our algorithm, each iteration then consists of three MapReduce tasks: 1. task Signature performs a merge join of N t and E t, and create signatures; 2. task Identifier distributes signatures to reducers and assigns partition identifiers; and, 3. task RePartition sorts N t to prepare it for the next iteration. Note that a preprocessing step is executed to guarantee the correctness of the mapside join in task Signature. We next explain each task in details. 3.1 Task Signature Task Signature (Algorithm 1) first performs a sort merge join of N t and E t, filling in the pid old tid column of E t with pid new nid of N t. This is achieved in the map function using a mapside join [18]. Then records are emitted grouping by nid in N t and sid in E t. In the reduce function, all information to construct a signature resides in [value]. So the major part of the function is to iterate through [value] and construct the signature according to Definition 2. After doing so, the node identifier along with its pid 0 nid value and signature are emitted to the next task. Note that the key/value pair input of SignatureMapper indicates
6 the fragments of N t and E t that need to be joined. The fragments need to be preprocessed before the algorithm runs. Algorithm 1 task Signature 1: procedure SignatureMapper(key, value) 2: perform mapside equijoin of N t and E t on nid and tid, fill in pid old tid with pid new nid, put all rows of N t and E t into records 3: for row in records do 4: if row is from E t then 5: emit (sid, the rest of the row) 6: else if row is from N t then 7: emit (nid, the rest of the row) 1: procedure SignatureReducer(key, [value]) key is nid or sid 2: pairset {} 3: for value in [value] do 4: if (key,value) is from N t then 5: pid 0 nid value.pid 0 nid record pid 0 nid 6: else if (key,value) is from E t then 7: pairset pairset {(elabel, pid old tid )} 8: sort elements in pairset lexicographically, first on elabel then on pid old tid 9: signature (pid 0 nid, pairset) 10: emit (key, (pid 0 nid, signature)) 3.2 Task Identifier On a single machine, in order to assign distinct values for signatures, we only need a dictionarylike data structure. In a distributed environment, on the other hand, some extra work has to be done. The Identifier task (Algorithm 2) is designed for this purpose. The map function distributes nodes of the same signature to the same reducer, so that in the reduce function, each reducer only needs to check locally whether the given signature is assigned an identifier or not; if not, then a new identifier is generated and assigned to the signature. To assign identifiers without collisions across reducers, we specify a nonoverlapping identifier range each reducer can use beforehand. For instance, reducer i can generate identifiers in the range of [i N, (i + 1) N ). Algorithm 2 task Identifier 1: procedure IdentifierMapper(nId, (pid 0 nid, signature)) 2: emit (signature, (nid,pid 0 nid )) 1: procedure IdentifierReducer(signature, [(nid, pid 0 nid )]) 2: pid new nid get unique identifier for signature 3: for (nid, pid 0 nid ) in [(nid, pid 0 nid )] do 4: emit (nid, (pid 0 nid, pid new nid )) 3.3 Task RePartition The output of task Identifier is N t filled with partition identifiers from the current iteration, but consists of file fragments partitioned by signature. In order to perform a mapside join with E t in task Signature in the next iteration,
7 N t needs to be sorted and partitioned on nid, which is the main job of task RePartition (Algorithm 3). This task makes use of the MapReduce framework to do the sorting and grouping. Algorithm 3 task RePartition 1: procedure RePartitionMapper(nId, (pid 0 nid, pid new nid )) 2: emit (nid, (pid 0 nid, pid new nid )) do nothing 1: procedure RePartitionReducer(nId, [(pid 0 nid, pid new nid )]) 2: for (pid 0 nid, pid new nid ) in [(pid 0 nid, pid new nid )] do 3: emit (nid, (pid 0 nid, pid new nid )) 3.4 Running example We illustrate our algorithm on the example graph of Figure 1. In Figures 2(a), 2(b) and 2(c), we show the input and output of the map and reduce phases of tasks Signature, Identifier and RePartition, respectively, for the first iteration (i = 0), with two reducers (bounded with gray boxes) in use. 4 Strategies for dealing with skew in graphs 4.1 Data and skew In our investigations, we used a variety of graph datasets to validate the results of our algorithm, namely: Twitter (41.65M, M), Jamendo (0.48M, 1.05M), LinkedMDB (2.33M, 6.15M), DBLP (23M, 50.2M), WikiLinks (5.71M, M), DBPedia (38.62M, 115.3M), Power (8.39M, 200M), and Random (10M, 200M); the numbers in the parenthesis indicates the node count and edge count of the graph, resp.. Among these, Jamendo, LinkedMDB, DBLP, WikiLinks, DBPedia, and Twitter are realworld graphs described further in Luo et al. [20]. Random and Power are synthetic graphs generated using GTgraph [2], where Random is generated by adding edges between nodes randomly, and Power is generated following the powerlaw degree distribution and smallworld property. During investigation of our base algorithm (further details below in Section 5), we witnessed a highly skewed workload among mappers and reducers. Figure 4(a) illustrates this on the various datasets, showing the running time for different reducers for the three tasks in the algorithm. Each spike is a time measurement for one reducer. The running time in each task is sorted in a descending order. We see that indeed some reducers carry a significantly disproportionate workload. This skew slows down the whole process since the task is only complete when the slowest worker finishes. From this empirical observation, we trace back the behavior to the data. Indeed, the partition result is skewed in many ways. For example, in Figure 3, we show the cumulative distribution of partition block size, i.e., number of nodes assigned to the block, to the number of partition blocks having the given size, for the realworld graphs described above. We see that for all of the datasets, block size shows a powerlaw distribution property. This indicates the need to rethink the design of our base algorithm. Recall that at the end of the map function of Algorithm 2, nodes are distributed to
8 Nodes: Edges: 3 l 1 1 w 2 2 w 2 5 l 2 4 l 3 1 l 4 2 l l w w l l l l w l w l l l l ,{(w,1),(l,2)} 21 1,{(w,1),(l,2)} 32 2,{(l,1)} 42 2,{(l,2)} 52 2,{(l,1)} 62 2,{} 6 nodes Join + Map Reduce Output (a) Task Signature 11 1,{(w,1),(l,2)} 21 1,{(w,1),(l,2)} 32 2,{(l,1)} r2 r2 r ,{(l,1)} 52 2,{(l,1)} 62 2,{} ,{(l,2)} 52 2,{(l,1)} 62 2,{} r2 r1 r ,{(w,1),(l,2)} 21 1,{(w,1),(l,2)} ,{(l,2)} 8 Map Reduce Output (b) Task Identifier Map Reduce Output (c) Task RePartition Fig. 2. Input and output of the three tasks of our algorithm for the example graph of Figure 1 (iteration i = 0). reducers based on their signatures. Since some of the signatures are associated with many nodes (as we see in Figure 3), the workload is inevitably unbalanced. This explains the reducers behavior in Figure 4(a) as well. In the following, we propose several strategies to handle such circumstances. 4.2 Strategy 1: Introduce another task Merge Recall from Section 3.2 that nodes with the same signature must be distributed to the same reducer, for otherwise the assignment of partition block identifiers
9 fraction of PB s having size x x (PB Size) Jamendo LinkedMDB DBLP WikiLinks DBPedia Twitter Fig. 3. Cumulative distribution of partition block size (PB Size) to partition blocks having the given size for realworld graphs cannot be guaranteed to be correct. This could be relaxed, however, if we introduce another task, which we call Merge. Suppose that when we output the partition result (without the guarantee that nodes with the same signature go to the same reducer), for each reducer, we also output a mapping of signatures to partition identifiers. Then in task Merge, we could merge the partition IDs based on the signature value, and output a local pid to global pid mapping. Then in the task RePartition, another mapside join is introduced to replace the local pid with the global pid. Because the signature itself is not correlated with the partition block size, the skew on partition block size should be eliminated. We discuss the implementation details in the following. In N t assume we have an additional field holding the partition size of the node from the previous iteration, named psize old, and let MR Load = We define rand(x) to return a random integer from 0 to x, and % as the modulo operator. N number of reducers. Algorithm 4 modified partition function for task Identifier 1: procedure Identifier getpartition for each key/value pair 2: if psize old > threshold then 3: n = psize old /MR Load numbers of reducers we need 4: return (signature.hashcode()+rand(n)) % number of reducers 5: else 6: return signature.hashcode() % number of reducers We first change the partition function for Identifier (Algorithm 4). In this case, for nodes whose associated psize old are above the threshold, we do not guarantee that they end up in the same reducer, but make sure that they are distributed to at most n reducers. Then we come to the reduce phase for Identifier (Algorithm 5). Here we also output the local partition size (named psize new ) for each node. Then the task Merge (Algorithm 6) will create the mapping between the locally assigned pid new nid and global pid.
10 Algorithm 5 modified reduce function for task Identifier 1: procedure IdentifierReducer(signature, [(nid,pid 0 nid,psize old )]) 2: pid new nid get unique identifier for signature 3: psize new size of [(nid,pid 0 nid, psize old )] 4: for (nid, pid 0 nid, psize old ) in [(nid, pid 0 nid, psize old )] do 5: emit (nid, (pid 0 nid, pid new nid )) 6: emit (signature, (pid new nid, psize new )) Algorithm 6 task Merge 1: procedure MergeMapper(signature, (pid new nid, psize new )) 2: emit (signature, (pid new nid, psize new )) do nothing 1: procedure MergeReducer(signature, [(pid new nid, psize new )]) 2: global pid count 0 3: global pid get unique identifier for signature 4: for (pid new nid, psize new ) in [(pid new nid, psize new )] do 5: global pid count global pid count + psize new 6: emit (global pid, (pid new nid )) the local  global mapping 7: emit (global pid, global pid count) Finally at the beginning of task RePartition, the partition identifiers are unified by a merge join of the local pid  global pid mapping table and N t. We achieve this by distributing the local pid  global pid table to all mappers before the task begins, with the help of Hadoop s distributed cache. Also, global pid count is updated in N t. While this is a general solution for dealing with skew, the extra Merge task introduces additional overhead. In the case of heavy skew, some signatures will produce large map files and performing merging might become a bottleneck. This indicates the need for a more specialized solution to deal with heavy skew. 4.3 Strategy 2: TopK signaturepartition identifier mapping One observation of Figure 3 is that only a few partition blocks are heavily skewed. To handle these outliers, at the end of task Signature, besides emitting N t, we can also emit, for each reducer, an aggregation count of signature appearances. Then a merge is performed among all the counts, to identify the most frequent K signatures and fix signaturepartition identifier mappings for these popular partition blocks. This mapping is small enough to be distributed to all cluster nodes as a global variable by Hadoop, so that when dealing with these signatures, processing time becomes constant. As a result, in task Identifier, nodes with such signatures can be distributed randomly across reducers. There are certain drawbacks to this method. First, the output topk frequent signatures are aggregated from local topk frequent values (with respect to each reducer), but globally we only use these values as an estimation of the real counts. Second, the step where signature counts have to be output and merged becomes a bottleneck of the whole workflow. Last but not least, users have to specify K before processing, which may be either too high or too low.
11 However, in the case of extreme skew on the partition block sizes, for most of the real world datasets, there are only a few partition blocks which delay the whole process, even for very large datasets. So when we adopt this strategy, we can choose a very small K value and still achieve good results, without introducing another MapReduce task. This is validated in Section Empirical analysis Our experiments are executed on the Hadoop cluster at SURFsara in Amsterdam, The Netherlands. 1 This cluster consists of 72 Nodes (66 DataNodes & TaskTrackers and 6 HeadNodes), with each node equipped with 4 x 2TB HDD, 8 core CPU 2.0 GHz and 64GB RAM. In total, there are 528 map and 528 reduce processes, and 460 TB HDFS space. The cluster is running Cloudera CDH3 distribution with Apache Hadoop All algorithms are written in Java 1.6. The datasets we use are described in Section 4.1. A more detailed description of the empirical study can be found in de Lange [7]. 5.1 Impact on workload skew of the TopK strategy Figure 4(b) shows the cluster workload after we create a identifier mapping for the top2 signaturepartitions from Section 4.3. We see that, when compared with Figure 4(a), the skew in running time per reducer is eliminated by the strategy. This means that workload is better distributed, thus lowering the running time per iteration and, in turn, the whole algorithm. Signature Signature Identifier Identifier Repartition Repartition 0 60 Wikilinks DBPedia (a) 0 65 Power Twitter 0 60 Wikilinks DBPedia (b) 0 65 Power Twitter Fig. 4. Workload skewness for base algorithm (a) and for top2 signature strategy (b) 5.2 Overall performance comparison In Figure 5, we present an overall performance comparison for computing 10 bisimulation on each graph with: our base algorithm (Base), the merge (Merge) and topk (TopK Signature) skew strategies, and the state of the art singlemachine external memory algorithm (I/O Efficient) [20]. For the TopK Signature strategy, we set K = 2 which, from our observation in Section 5.1, gives excellent skew reduction with low additional cost. For the Merge optimization we used a threshold of 1 MR load such that each partition block larger than 1 https://www.surfsara.nl
12 the optimal reducer block size is distributed among additional reducers. Furthermore, for each dataset, we choose 3 edge table size 64 MB number of reducers, which has been tested to lead to the minimum elapsed time. We empirically observed that increasing the number of reducers beyond this does not improve performance. Indeed, adding further maps and reducers at this point negatively impacts the running time due to framework overhead. Each experiment was repeated five times, and the data point reported is the average of the middle three measurements. Running time () more than I/O Efficient Algorithm Base Algorithm TopK signatures algorithm Merge algorithm 0 Jamendo LinkedMDB DBLP WikiLinks DBPedia Random * * Tests did not complete Power Twitter Fig. 5. Overall performance comparison We immediately observe from Figure 5 that for all datasets, among the Base algorithm and its two optimizations, there are always at least two solutions which perform better than the I/O efficient solution. For small datasets such as Jamendo and LinkedMDB, this is obvious, since in these cases only 1 or 2 reducers are used, so that the algorithm turns into a singlemachine inmemory algorithm. When the size of the datasets increases, the value of MapReduce becomes more visible, with up to an order of magnitude improvement in the running time for the MapReducebased solutions. We also observe that the skew amelioration strategies give excellent overall performance on these big graphs, with 2 or 3 times of improvement over our Base algorithm in the case of the highly skewed graphs such as DBLP and DBPedia. Finally, we observe that, relative to the topk strategy, the merge skewstrategy is mostly placed at a disadvantage due to its inherent overhead costs. Running time () DBPedia (Base) DBPedia (TopK Signature) DBPedia (Merge) Twitter (Base) Twitter (TopK Signature) Twitter (Merge) Iteration Fig. 6. Performance comparison per iteration for Twitter and DBPedia
13 To further study the different behaviors among the MapReducebased solutions, we plot the running time per iteration of the three solutions for DBPedia and Twitter in Figure 6. We see that for the Twitter dataset, in the first four iterations the skew is severe for the Base algorithm, while the two optimization strategies handle it well. After iteration 5, the overhead of the Merge strategy becomes nonnegligible, which is due to the bigger mapping files the Identifier produces. For the DBPedia dataset, on the other hand, the two strategies perform consistently better than the Base algorithm. Over all, based on our experiments, we note that our algorithm s performance is stable, i.e., essentially constant in running time as the number of maps and reducers is scaled with the input graph size. 6 Concluding remarks In this paper, we have presented, to our knowledge, the first generalpurpose algorithm for effectively computing localized bisimulation partitions of big graphs using the MapReduce programming model. We witnessed a skewed workload during algorithm execution, and proposed two strategies to eliminate such skew from an algorithm design perspective. An extensive empirical study confirmed that our algorithm not only efficiently produces the bisimulation partition result, but also scales well with the MapReduce infrastructure, with an order of magnitude performance improvement over the state of the art on realworld graphs. We close by indicating several interesting avenues for further investigation. First, there are additional basic sources of skew which may impact performance of our algorithm, such as skew on signature sizes and skew on the structure of the bisimulation partition itself. Therefore, further optimizations should be investigated to handle these additional forms of skew. Second, in Section 5.2 we see that all three proposed solutions have their best performance for some dataset, therefore it would be interesting to study the cost model of the MapReduce framework and combine the information (e.g., statistics for data, cluster status) within our algorithm to facilitate more intelligent selection of strategies to use at runtime [1]. Last but not least, our algorithmic solutions for ameliorating skew effects may find successful applications in related research problems (e.g., distributed graph query processing, graph generation, and graph indexing). Acknowledgments. The research of YW is supported by Research Foundation Flanders (FWO) during her sabbatical visit to Hasselt University, Belgium. The research of YL, GF, JH and PD is supported by the Netherlands Organisation for Scientific Research (NWO). We thank SURFsara for their support of this investigation and Boudewijn van Dongen for his insightful comments. We thank the reviewers for their helpful suggestions. References 1. F. N. Afrati, A. D. Sarma, S. Salihoglu, and J. D. Ullman. Upper and lower bounds on the cost of a MapReduce computation. CoRR, abs/ , 2012.
14 2. D. A. Bader and K. Madduri. GTgraph: A suite of synthetic graph generators. 3. S. Blom and S. Orzan. A distributed algorithm for strong bisimulation reduction of state spaces. Int J Softw Tools Technol Transfer, 7:74 86, P. Buneman, M. Grohe, and C. Koch. Path queries on compressed XML. In Proc. VLDB, pages , Berlin, Germany, A. Clauset, C. Shalizi, and M. Newman. Powerlaw distributions in empirical data. SIAM review, 51(4): , J. Cohen. Graph twiddling in a MapReduce world. Computing in Science Engineering, 11(4):29 41, Y. de Lange. MapReduce based algorithms for localized bisimulation. Master s thesis, Eindhoven University of Technology, J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1): , Jan A. Dovier, C. Piazza, and A. Policriti. An efficient algorithm for computing bisimulation equivalence. Theor. Comp. Sci., 311(13): , W. Fan. Graph pattern matching revised for social network analysis. In Proc. ICDT, pages 8 21, Berlin, Germany, W. Fan, J. Li, X. Wang, and Y. Wu. Query preserving graph compression. In Proc. SIGMOD, pages , Scottsdale, AZ, USA, B. Gufler, N. Augsten, A. Reiser, and A. Kemper. Handling data skew in MapReduce. In Proc. CLOSER, pages , B. Gufler, N. Augsten, A. Reiser, and A. Kemper. Load balancing in MapReduce based on scalable cardinality estimates. In Proc. ICDE, pages , Hadoop S. Ibrahim, H. Jin, L. Lu, S. Wu, B. He, and L. Qi. LEEN: Locality/fairnessaware key partitioning for MapReduce in the cloud. In CloudCom, pages 17 24, R. Kaushik, P. Shenoy, P. Bohannon, and E. Gudes. Exploiting local similarity for indexing paths in graphstructured data. In ICDE, pages , San Jose, Y. Kwon, K. Ren, M. Balazinska, and B. Howe. Managing Skew in Hadoop. IEEE Data Eng. Bull., 36(1):24 33, J. Lin and C. Dyer. DataIntensive Text Processing with MapReduce. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers, J. Lin and M. Schatz. Design patterns for efficient graph algorithms in MapReduce. In Proc. MLG, pages 78 85, Washington, D.C., Y. Luo, G. H. L. Fletcher, J. Hidders, Y. Wu, and P. De Bra. I/Oefficient algorithms for localized bisimulation partition construction and maintenance on massive graphs. CoRR, abs/ , T. Milo and D. Suciu. Index structures for path expressions. In Proc. ICDT, pages , Jerusalem, Israel, R. Paige and R. Tarjan. Three partition refinement algorithms. SIAM J. Comput., 16:973, F. Picalausa, Y. Luo, G. H. L. Fletcher, J. Hidders, and S. Vansummeren. A structural approach to indexing triples. In ESWC, pages , Heraklion, Greece, D. Sangiorgi and J. Rutten. Advanced Topics in Bisimulation and Coinduction. Cambridge University Press, New York, NY, USA, R. Vernica, A. Balmin, K. S. Beyer, and V. Ercegovac. Adaptive MapReduce using situationaware mappers. In Proc. EDBT, pages , 2012.
Efficient ExternalMemory Bisimulation on DAGs
Efficient ExternalMemory Bisimulation on DAGs Jelle Hellings Hasselt University and transnational University of Limburg Belgium jelle.hellings@uhasselt.be George H.L. Fletcher Eindhoven University of
More informationES 2 : A Cloud Data Storage System for Supporting Both OLTP and OLAP
ES 2 : A Cloud Data Storage System for Supporting Both OLTP and OLAP Yu Cao, Chun Chen,FeiGuo, Dawei Jiang,YutingLin, Beng Chin Ooi, Hoang Tam Vo,SaiWu, Quanqing Xu School of Computing, National University
More informationBigDansing: A System for Big Data Cleansing
BigDansing: A System for Big Data Cleansing Zuhair Khayyat Ihab F. Ilyas Alekh Jindal Samuel Madden Mourad Ouzzani Paolo Papotti JorgeArnulfo QuianéRuiz Nan Tang Si Yin Qatar Computing Research Institute
More informationA survey on platforms for big data analytics
Singh and Reddy Journal of Big Data 2014, 1:8 SURVEY PAPER Open Access A survey on platforms for big data analytics Dilpreet Singh and Chandan K Reddy * * Correspondence: reddy@cs.wayne.edu Department
More informationNo One (Cluster) Size Fits All: Automatic Cluster Sizing for Dataintensive Analytics
No One (Cluster) Size Fits All: Automatic Cluster Sizing for Dataintensive Analytics Herodotos Herodotou Duke University hero@cs.duke.edu Fei Dong Duke University dongfei@cs.duke.edu Shivnath Babu Duke
More informationParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A Survey KyongHa Lee YoonJoon Lee Department of Computer Science KAIST bart7449@gmail.com yoonjoon.lee@kaist.ac.kr Hyunsik Choi Yon Dohn Chung Department of Computer
More informationCorrelation discovery from network monitoring data in a big data cluster
Correlation discovery from network monitoring data in a big data cluster Kim Ervasti University of Helsinki kim.ervasti@gmail.com ABSTRACT Monitoring of telecommunications network is a tedious task. Diverse
More informationScaling Datalog for Machine Learning on Big Data
Scaling Datalog for Machine Learning on Big Data Yingyi Bu, Vinayak Borkar, Michael J. Carey University of California, Irvine Joshua Rosen, Neoklis Polyzotis University of California, Santa Cruz Tyson
More informationWTF: The Who to Follow Service at Twitter
WTF: The Who to Follow Service at Twitter Pankaj Gupta, Ashish Goel, Jimmy Lin, Aneesh Sharma, Dong Wang, Reza Zadeh Twitter, Inc. @pankaj @ashishgoel @lintool @aneeshs @dongwang218 @reza_zadeh ABSTRACT
More informationE 3 : an Elastic Execution Engine for Scalable Data Processing
Invited Paper E 3 : an Elastic Execution Engine for Scalable Data Processing Gang Chen 1 Ke Chen 1 Dawei Jiang 2 Beng Chin Ooi 2 Lei Shi 2 Hoang Tam Vo 2,a) Sai Wu 2 Received: July 8, 2011, Accepted: October
More informationExploring Netflow Data using Hadoop
Exploring Netflow Data using Hadoop X. Zhou 1, M. Petrovic 2,T. Eskridge 3,M. Carvalho 4,X. Tao 5 Xiaofeng Zhou, CISE Department, University of Florida Milenko Petrovic, Florida Institute for Human and
More informationEfficient Processing of Joins on Setvalued Attributes
Efficient Processing of Joins on Setvalued Attributes Nikos Mamoulis Department of Computer Science and Information Systems University of Hong Kong Pokfulam Road Hong Kong nikos@csis.hku.hk Abstract Objectoriented
More informationAn Evaluation Study of BigData Frameworks for Graph Processing
1 An Evaluation Study of BigData Frameworks for Graph Processing Benedikt Elser, Alberto Montresor Università degli Studi di Trento, Italy {elser montresor}@disi.unitn.it Abstract When Google first introduced
More informationBenchmarking Cloud Serving Systems with YCSB
Benchmarking Cloud Serving Systems with YCSB Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, Russell Sears Yahoo! Research Santa Clara, CA, USA {cooperb,silberst,etam,ramakris,sears}@yahooinc.com
More informationMizan: A System for Dynamic Load Balancing in Largescale Graph Processing
: A System for Dynamic Load Balancing in Largescale Graph Processing Zuhair Khayyat Karim Awara Amani Alonazi Hani Jamjoom Dan Williams Panos Kalnis King Abdullah University of Science and Technology,
More informationCLoud Computing is the long dreamed vision of
1 Enabling Secure and Efficient Ranked Keyword Search over Outsourced Cloud Data Cong Wang, Student Member, IEEE, Ning Cao, Student Member, IEEE, Kui Ren, Senior Member, IEEE, Wenjing Lou, Senior Member,
More informationStreaming Similarity Search over one Billion Tweets using Parallel LocalitySensitive Hashing
Streaming Similarity Search over one Billion Tweets using Parallel LocalitySensitive Hashing Narayanan Sundaram, Aizana Turmukhametova, Nadathur Satish, Todd Mostak, Piotr Indyk, Samuel Madden and Pradeep
More informationScalable Progressive Analytics on Big Data in the Cloud
Scalable Progressive Analytics on Big Data in the Cloud Badrish Chandramouli 1 Jonathan Goldstein 1 Abdul Quamar 2 1 Microsoft Research, Redmond 2 University of Maryland, College Park {badrishc, jongold}@microsoft.com,
More informationThe Uncracked Pieces in Database Cracking
The Uncracked Pieces in Database Cracking Felix Martin Schuhknecht Alekh Jindal Jens Dittrich Information Systems Group, Saarland University CSAIL, MIT http://infosys.cs.unisaarland.de people.csail.mit.edu/alekh
More informationScalable Progressive Analytics on Big Data in the Cloud
Scalable Progressive Analytics on Big Data in the Cloud Badrish Chandramouli 1 Jonathan Goldstein 1 Abdul Quamar 2 1 Microsoft Research, Redmond 2 University of Maryland, College Park {badrishc, jongold}@microsoft.com,
More informationChapter 17: DIPAR: A Framework for Implementing Big Data Science in Organizations
Chapter 17: DIPAR: A Framework for Implementing Big Data Science in Organizations Luis Eduardo Bautista Villalpando 1,2, Alain April 2, Alain Abran 2 1 Department of Electronic Systems, Autonomous University
More informationFast InMemory Triangle Listing for Large RealWorld Graphs
Fast InMemory Triangle Listing for Large RealWorld Graphs ABSTRACT Martin Sevenich Oracle Labs martin.sevenich@oracle.com Adam Welc Oracle Labs adam.welc@oracle.com Triangle listing, or identifying all
More informationThe Performance Cost of Virtual Machines on Big Data Problems in Compute Clusters
The Performance Cost of Virtual Machines on Big Data Problems in Compute Clusters Neal Barcelo Denison University Granville, OH 4023 barcel_n@denison.edu Nick Legg Denison University Granville, OH 4023
More informationSolving Big Data Challenges for Enterprise Application Performance Management
Solving Big Data Challenges for Enterprise Application Performance Management Tilmann Rabl Middleware Systems Research Group University of Toronto, Canada tilmann@msrg.utoronto.ca Sergio Gómez Villamor
More informationLoad Balancing and Rebalancing on Web Based Environment. Yu Zhang
Load Balancing and Rebalancing on Web Based Environment Yu Zhang This report is submitted as partial fulfilment of the requirements for the Honours Programme of the School of Computer Science and Software
More informationIEEE/ACM TRANSACTIONS ON NETWORKING 1
IEEE/ACM TRANSACTIONS ON NETWORKING 1 SelfChord: A BioInspired P2P Framework for SelfOrganizing Distributed Systems Agostino Forestiero, Associate Member, IEEE, Emilio Leonardi, Senior Member, IEEE,
More informationLoad Shedding for Aggregation Queries over Data Streams
Load Shedding for Aggregation Queries over Data Streams Brian Babcock Mayur Datar Rajeev Motwani Department of Computer Science Stanford University, Stanford, CA 94305 {babcock, datar, rajeev}@cs.stanford.edu
More informationTrinity: A Distributed Graph Engine on a Memory Cloud
Trinity: A Distributed Graph Engine on a Memory Cloud Bin Shao Microsoft Research Asia Beijing, China binshao@microsoft.com Haixun Wang Microsoft Research Asia Beijing, China haixunw@microsoft.com Yatao
More informationCloud Computing and Big Data Analytics: What Is New from Databases Perspective?
Cloud Computing and Big Data Analytics: What Is New from Databases Perspective? Rajeev Gupta, Himanshu Gupta, and Mukesh Mohania IBM Research, India {grajeev,higupta8,mkmukesh}@in.ibm.com Abstract. Many
More informationClean Answers over Dirty Databases: A Probabilistic Approach
Clean Answers over Dirty Databases: A Probabilistic Approach Periklis Andritsos University of Trento periklis@dit.unitn.it Ariel Fuxman University of Toronto afuxman@cs.toronto.edu Renée J. Miller University
More information