Effective Techniques for Message Reduction and Load Balancing in Distributed Graph Computation

Effective Techiques for Message Reductio ad Load Balacig i Distributed Graph Computatio ABSTRACT Da Ya, James Cheg, Yi Lu Dept. of Computer Sciece ad Egieerig The Chiese Uiversity of Hog Kog {yada, jcheg, ylu}@cse.cuhk.edu.hk Massive graphs, such as olie social etworks ad commuicatio etworks, have become commo today. To efficietly aalyze such large graphs, may distributed graph computig systems have bee developed. These systems employ the thik like a vertex programmig paradigm, where a program proceeds i iteratios ad at each iteratio, vertices exchage messages with each other. However, usig Pregel s simple message passig mechaism, some vertices may sed/receive sigificatly more messages tha others due to either the high degree of these vertices or the logic of the algorithm used. This forms the commuicatio bottleeck ad leads to imbalaced workload amog machies i the cluster. I this paper, we propose two effective message reductio techiques: 1)vertex mirrorig with message combiig, ad 2)a additioal requestrespod API. These techiques ot oly reduce the total umber of messages exchaged through the etwork, but also boud the umber of messages set/received by ay sigle vertex. We theoretically aalyze the effectiveess of our techiques, ad implemet them o top of our ope-source Pregel implemetatio called Pregel+. Our experimets o various large real graphs demostrate that our message reductio techiques sigificatly improve the performace of distributed graph computatio. Categories ad Subject Descriptors D.4.7 [Orgaizatio ad Desig]: Distributed systems Geeral Terms Performace Keywords Pregel; distributed graph computig; graph aalytics 1. INTRODUCTION With the growig iterest i aalyzig large real-world graphs such as olie social etworks, web graphs ad sematic web graphs, may distributed graph computig systems [1, 5, 10, 11, 13, 18, 21, Copyright is held by the Iteratioal World Wide Web Coferece Committee IW3C2). IW3C2 reserves the right to provide a hyperlik to the author s site if the Material is used i electroic media. WWW 2015, May 18 22, 2015, Florece, Italy. ACM 978-1-4503-3469-3/15/05. http://dx.doi.org/10.1145/2736277.2741096. Wilfred Ng Dept. of Computer Sciece ad Egieerig The Hog Kog Uiversity of Sciece ad Techology wilfred@cse.ust.hk 23] have emerged. These systems are deployed i a shared-othig distributed computig ifrastructure usually built o top of a cluster of low-cost commodity PCs. Pioeered by Google s Pregel [13], these systems adopt a vertex-cetric computig paradigm, where programmers thik aturally like a vertex whe desigig distributed graph algorithms. A Pregel-like system also takes care of fault recovery ad scales to arbitrary cluster size without the eed of chagig the program code, both of which are idispesable properties for programs ruig i a cloud eviromet. MapReduce [3], ad its ope-source implemetatio Hadoop, are also popularly used for large scale graph processig. However, may graph algorithms are itrisically iterative, such as the computatio of PageRak, coected compoets, ad shortest paths. For iterative graph computatio, a Pregel program is much more efficiet tha its MapReduce couterpart [13]. Weakesses of Pregel. Although Pregel s vertex-cetric computig model has bee widely adopted i most of the recet distributed graph computig systems [1, 11, 10, 18] ad also ispired the edge-cetric model [5]), Pregel s vertex-to-vertex message passig mechaism ofte causes bottleecks i commuicatio whe processig real-world graphs. To clarify this poit, we first briefly review how Pregel performs message passig. I Pregel, a vertex v ca sed messages to aother vertex u if v kows u s vertex ID. I most cases, v oly seds messages to its eighbors whose IDs are available from v s adjacecy list. But there also exist Pregel algorithms i which a vertex v may sed messages to aother vertex that is ot a eighbor of v [24, 19]. These algorithms usually adopt poiter jumpig or doublig), a techique that is widely used i desigig PRAM algorithms [22], to boud the umber of iteratios by Olog V ), where V refers to the umber of vertices i the graph. The problem with Pregel s message passig mechaism is that a small umber of vertices, which we call bottleeck vertices, may sed/receive much more messages tha other vertices. A bottleeck vertex ot oly geerates heavy commuicatio, but also sigificatly icreases the workload of the machie i which the vertex resides, causig highly imbalaced workload amog differet machies. Bottleeck vertices are commo whe usig Pregel to process real-world graphs, maily due to either 1)high vertex degree or 2)algorithm logic, which we elaborate more as follows. We first cosider the problem caused by high vertex degree. Whe a high-degree vertex seds messages to all its eighbors, it becomes a bottleeck vertex. Ufortuately, real-world graphs usually have highly skewed degree distributio, with some vertices havig very high degrees. For example, i the Twitter who-follows-who graph 1, the maximum degree is over 2.99M while the average degree is 1 http://law.di.uimi.it/webdata/twitter-2010/ 1307

oly 35. Similarly, i the BTC dataset used i our experimets, the maximum degree is over 1.6M while the average degree is oly 4.69. We ra Hash-Mi [17, 24], a distributed algorithm for computig coected compoets CCs), o the degree-skewed BTC dataset i a cluster with 1 master Worker 0) ad 120 slaves Workers 1 120), ad observed highly imbalaced workload amog differet workers, which we describe ext. Pregel assigs each vertex to a worker by hashig the vertex ID regardless of the degree of the vertex. As a result, each worker holds approximately the same umber of vertices, but the total umber of eighbors i the adjacecy lists i.e., umber of edges) varies greatly amog differet workers. I the computatio of Hash-Mi o BTC, we observed a ueve distributio of edge umber amog workers, as some workers cotai more high-degree vertices tha other workers. Sice messages are set alog the edges, the ueve distributio of edge umber also leads to a ueve distributio of the amout of commuicatio amog differet workers. I Figure 1, the taller blue bars idicate the total umber of messages set by each worker durig the etire computatio of Hash-Mi, where we observe highly ueve commuicatio workload amog differet workers. Bottleeck vertices may also be geerated by program logic. A example is the S-V algorithm proposed i [24, 22] for computig CCs, which we will describe i detail i Sectio 3.4. I S-V, each vertex v maitais a field D[v] which records the vertex that v is to commuicate with. The field D[v] may be updated at each iteratio as the algorithm proceeds; ad whe the algorithm termiates, vertices v i ad v j are i the same CC iff D[v i ]=D[v j ]. Thus, durig the computatio, some vertex u may commuicate with may vertices {v 1,v 2,...,v k } i its CC if u = D[v i ], for 1 i k. I this case, u becomes a bottleeck vertex. We ra S-V o the USA road etwork i a cluster with 1 master Worker 0) ad 60 slaves Workers 1 60), ad observed highly imbalaced commuicatio workload amog differet workers. I Figure 2, the taller blue bars idicate the total umber of messages set by each worker durig the etire computatio of S-V, where we ca see that the commuicatio workload is very biased especially at Worker 0). We remark that the imbalaced commuicatio workload is ot caused by skewed vertex degree distributio, sice the largest vertex degree of the USA road etwork is merely 9. Rather, it is because of the algorithm logic of S-V. Specifically, sice the USA road etwork is coected, i the last roud of S- V, all vertices v have D[v] equal to Vertex 0, idicatig that they all belog to the same CC. Sice Vertex 0 is hashed to Worker 0, Worker 0 seds much more messages tha the other workers, as ca be observed from Figure 2. I additio to the two problems metioed above, Pregel s message passig mechaism is also ot efficiet for processig graphs with relatively) high average degree due to the high overall commuicatio cost. However, may real-world graphs such as social etworks ad mobile phoe etworks have relatively high average degree, as a perso is ofte coected to at least dozes of people. Our Solutio. I this paper, we solve the problems caused by Pregel s message passig mechaism with two effective message reductio techiques. The goals are to 1)mitigate the problem of imbalaced workload by elimiatig bottleeck vertices, ad to 2)reduce the overall umber of messages exchaged through the etwork. The first techique is called mirrorig, which is desiged to elimiate bottleeck vertices caused by high vertex degree. The mai idea is to costruct mirrors of each high-degree vertex i differet machies, so that messages from a high-degree vertex are forwarded to its eighbors by its mirrors i local machies. Let Messages # Message # 2 x 107 1 0 0 20 40 60 80 100 120 Worker ID 2 Figure 1: Hash-Mi o BTC with/without mirrorig) 4 x 108 0 0 10 20 30 40 50 60 Worker ID Figure 2: S-V o USA with/without request-respod) dv) be the degree of a vertex v ad M be the umber of machies i the cluster, mirrorig bouds the umber of messages set by v each time to mi{m,dv)}. Ifv is a high-degree vertex, dv) ca be up to millios, but M is ormally oly from tes to a few hudred. We remark that ideas similar to mirrorig have bee adopted by existig systems [11, 18], but we fid that mirrorig a vertex does ot always reduce the umber of messages due to Pregel s use of message combier [13]. Hece, we provide a theoretical aalysis o which vertices should be selected for mirrorig i Sectio 5. I Figure 1, the short red bars idicate the total umber of messages set by each worker whe mirrorig is applied to all vertices with degree at least 100. We ca clearly see the big differece betwee the ueve blue bars without mirrorig) ad the eve-height short red bars with mirrorig). Furthermore, the umber of messages is also sigificatly reduced by mirrorig. We remark that the algorithm is still the same ad mirrorig is completely trasparet to users. Mirrorig reduces the ruig time of Hash-Mi o BTC from 26.97 secods to 9.55 secods. The secod techique is a ew request-respod paradigm. We exted the basic Pregel framework by a additioal request-respod fuctioality. A vertex u may request aother vertex v for its attribute av), ad the requested value will be available i the ext iteratio. The request-respod programmig paradigm simplifies the codig of may Pregel algorithms, as otherwise at least three iteratios are required to explicitly code each request ad respose process. More importatly, the request-respod paradigm effectively elimiates the bottleeck vertices resulted from algorithm logic, by boudig the umber of respose messages set by ay vertex to M. Cosider the S-V algorithm metioed earlier, where a set of k vertices {v 1,v 2,...,v k } with D[v i ]=u require the value of D[u] from u thus there are k requests ad resposes). Uder the request-respod paradigm, all the requests from a machie to the same target vertex are merged ito oe request. Therefore, at most mi{m,k} requests are eeded for the k vertices ad at most mi{m,k} resposes are set from u. For large real-world graphs, k is ofte orders of magitude greater tha M. I Figure 2, the short red bars idicate the total umber of messages set by each worker whe the request-respod paradigm is applied. Agai, the skewed message passig represeted by the blue bars are ow replaced by the eve-height short red bars. I particular, Vertex 0 ow oly respods to the requestig workers istead of all the requestig vertices i the last roud, ad hece the highly imbalaced workload caused by Vertex 0 i Worker 0 is ow eveed out. The request-respod paradigm reduces the ruig time of S-V o the USA road etwork from 261.9 secods to 137.7 secods. 1308

Figure 3: Illustratio of combier Fially, we remark that our experimets were ru i a cluster without ay resource cotetio, ad our optimizatio techiques are expected to improve the overall performace of Pregel algorithms more sigificatly if they were ru i a public data ceter, where the etwork badwidth is lower ad reducig commuicatio overhead becomes more importat. The rest of the paper is orgaized as follows. We review existig parallel graph computig systems, ad highlight the differeces of our work from theirs, i Sectio 2. I Sectio 3, we describe some Pregel algorithms for problems that are commo i social etwork aalysis ad web aalysis. I Sectio 4, we itroduce the basic commuicatio framework. We preset the mirrorig techique ad the request-respod fuctioality i Sectios 5 ad 6. Fially, we report the experimetal results i Sectio 7 ad coclude the paper i Sectio 8. 2. BACKGROUND AND RELATED WORK We first review Pregel s framework, ad the discuss other related distributed graph computig systems. 2.1 Pregel Pregel [13] is desiged based o the bulk sychroous parallel BSP) model. It distributes vertices to differet machies i a cluster, where each vertex v is associated with its adjacecy list i.e., the set of v s eighbors). A program i Pregel implemets a userdefied compute) fuctio ad proceeds i iteratios called supersteps). I each superstep, the program calls compute) for each active vertex. The compute) fuctio performs the user-specified task for a vertex v, such as processig v s icomig messages set i the previous superstep), sedig messages to other vertices to be received i the ext superstep), ad makig v vote to halt. A halted vertex is reactivated if it receives a message i a subsequet superstep. The program termiates whe all vertices vote to halt ad there is o pedig message for the ext superstep. Pregel umbers the supersteps so that a user may use the curret superstep umber whe implemetig the algorithm logic i the compute) fuctio. As a result, a Pregel algorithm ca perform differet operatios i differet supersteps by brachig o the curret superstep umber. Message Combier. Pregel allows users to implemet a combie) fuctio, which specifies how to combie messages that are set from a machie M i to the same vertex v i a machie M j. These messages are combied ito a sigle message, which is the set from M i to v i M j. However, combier is applied oly whe commutative ad associative operatios are to be applied to the messages. For example, i the PageRak computatio, the messages set to a vertex v are to be summed up to compute v s PageRak value; i this case, we ca combie all messages set from a machie M i to the same target vertex i a machie M j ito a sigle message that equals their sum. Figure 3 illustrates the idea of combier, where the messages set by vertices i machie M 1 to the same target vertex v j i machie M 2 are combied ito their sum before sedig. Aggregator. Pregel also supports aggregator, which is useful for global commuicatio. Each vertex ca provide a value to a aggregator i compute) i a superstep. The system aggregates those values ad makes the aggregated result available to all vertices i the ext superstep. 2.2 Pregel-Like Systems i JAVA Sice Google s Pregel is proprietary, may ope-source Pregel couterparts are developed. Most of these systems are implemeted i JAVA, e.g., Giraph [1] ad GPS [18]. They read the graph data from Hadoop s DFS HDFS) ad write the results to HDFS. However, sice object deletio is hadled by JAVA s Garbage Collector GC), if a machie maitais a huge amout of vertex/edge objects i mai memory, GC eeds to track a lot of objects ad the overhead ca severely degrade the system performace. To decrease the umber of objects beig maitaied, JAVA-based systems maitai vertices i mai memory i their biary represetatio. For example, Giraph orgaizes vertices as mai memory pages, where each page is simply a byte array object that holds the biary represetatio of may vertices. As a result, a vertex eeds to be deserialized from the page holdig it before callig compute); ad after compute) completes, the updated vertex eeds to be serialized back to its page. The serializatio cost ca be high, especially if the adjacecy list is log. To avoid uecessary serializatio cost, a Pregel-like system should be implemeted i a laguage such as C/C++, where programmers who are system developers, ot ed users) maage mai memory objects themselves. We implemeted our Pregel+ system i C/C++. GPS [18] supports a optimizatio called large adjacecy list partitioig LALP) to hadle high-degree vertices, whose idea is similar to vertex mirrorig. However, GPS does ot explore the performace tradeoff betwee vertex mirrorig ad message combiig. Istead, it is claimed i [18] that very small performace differece ca be observed whether combier is used or ot, ad thus, GPS simply does ot perform seder-side message combiig. Our experimets i Sectio 7 show that seder-side message combiig sigificatly reduces the overall ruig time of Pregel algorithms, ad therefore, both vertex mirrorig ad message combiig should be used to achieve better performace. As we shall see i Sectio 5, vertex mirrorig ad message combiig are two coflictig message reductio techiques, ad a theoretical aalysis o their performace tradeoff is eeded i order to devise a cost model for automatically choosig vertices for mirrorig. 2.3 GraphLab ad PowerGraph GraphLab [11] is aother parallel graph computig system that follows a desig differet from Pregel. GraphLab supports asychroous executio, ad adopts a data pullig programmig paradigm. Specifically, each vertex actively pulls data from its eighbors, rather tha passively receives messages set/pushed by its eighbors. This feature is somewhat similar to our request-respod paradigm, but i GraphLab, the requests ca oly be set to the eighbors. As a result, GraphLab caot support parallel graph algorithms where a vertex eeds to commuicate with a o-eighbor. Such algorithms are, however, quite popular i Pregel as they make use of the poiter jumpig or doublig) techique of PRAM algorithms to boud the umber of iteratios by Olog V ). Examples iclude the S-V algorithm for computig CCs [24] ad Pregel algorithm for computig miimum spaig forest [19]. These algorithms ca beefit sigificatly from our request-respod techique. Recetly, 1309

several studies [8, 12] reported that GraphLab s asychroous executio is geerally slower tha its sychroous mode that simulates Pregel s model) due to the high lockig/ulockig overhead. Thus, we maily focus o Pregel s computig model i this paper. GraphLab also builds mirrors for vertices, which are called ghosts. However, GraphLab creates mirrors for every vertex regardless of its degree, which leads to excessive space cosumptio. A more recet versio of GraphLab, called PowerGraph [5], partitios the graph by edges rather tha by vertices. Edge partitioig mitigates the problem of imbalaced workload as the edges of a high-degree vertex are hadled by multiple workers. Accordigly, a ew edgecetric Gather-Apply-Scatter GAS) computig model is used istead of the traditioal vertex-cetric computig model. 3. PREGEL ALGORITHMS I this sectio, we describe some Pregel algorithms for problems that are commo i social etwork aalysis ad web aalysis, which will be used for illustratig importat cocepts ad for performace evaluatio. We cosider fudametal problems such as 1)computig coected compoets or bi-coected compoets), which is a commo preprocessig step for social etwork aalysis [14, 15]; 2)computig miimum spaig tree or forest), which is useful i miig social relatioships [15]; ad 3)computig PageRak, which is widely used i rakig web pages [16, 9] ad spam detectio[7]. For ease of presetatio, we first defie the graph otatios used i the paper. Give a udirect graph G = V,E), we deote the eighbors of a vertex v V by Γv), ad the degree of v by dv) = Γv) ; if G is directed, we deote the i-eighbors outeighbors) of a vertex v by Γ i v) Γ out v)), ad the i-degree out-degree) of v by d i v) = Γ i v) d out v) = Γ out v) ). Each vertex v V has a uique iteger ID, deoted by idv). The diameter of G is deoted by δ. 3.1 Attribute Broadcast We first itroduce a Pregel algorithm for attribute broadcast. Give a directed graph G, where each vertex v is associated with a attribute av) ad a adjacecy list that cotais the set of v s out-eighbors Γ out v), attribute broadcast costructs a ew adjacecy list for each vertex v i G, which is defied as Γ out v) = { u, au) u Γ out v)}. Put simply, attribute broadcast associates each eighbor u i the adjacecy list of a vertex v with u s attribute au). Attribute broadcast is very useful i distributed graph computatio, ad it is a frequetly performed key operatio i may Pregel algorithms. For example, the Pregel algorithm for computig bi-coected compoets [24] requires to relabel the ID of each vertex u by its preorder umber i the spaig tree, deoted by preu). Attribute broadcast is used i this case, where au) refers to preu). The Pregel algorithm for attribute broadcast cosists of 3 supersteps: i superstep 1, each vertex v seds a message v to each eighbor u Γ out v) to request for au); the i superstep 2, each vertex u obtais the requesters v from the icomig messages, ad seds the respose message u, au) to each requester v; fially i superstep 3, each vertex v collects the icomig messages to costruct Γ out v). 3.2 PageRak Next we preset a Pregel algorithm for PageRak computatio. Give a directed web graph G =V,E), where each vertex page) v liks to a list of pages Γ out v), the problem is to compute the PageRak, prv), of each vertex v V. Figure 4: Forest structure of the S-V algorithm Figure 5: Key operatios of the S-V algorithm Pregel s PageRak algorithm [13] works as follows. I superstep 1, each vertex v iitializes prv)=1/ V ad distributes the value prv)/d out v) to each out-eighbor of v. I superstep i i>1), each vertex v sums up the received values from its i-eighbors, deoted by sum, ad computes prv)=0.15/ V +0.85 sum. It the distributes prv)/d out v) to each of its out-eighbors. 3.3 Hash-Mi We ext preset a Pregel algorithm for computig coected compoets CCs) i a udirected graph. We adopt the Hash- Mi algorithm [17, 24]. Give a CC C, let us deote the set of vertices of C by V C), ad defie the ID of C to be idc) = mi{idv) :v V C)}. We further defie the color of a vertex v as ccv) =idc), where v V C). Hash-Mi computes ccv) for each vertex v V, ad the idea is to broadcast the smallest vertex ID see so far by each vertex v, deoted by miv). Whe the algorithm termiates, miv) =ccv) for each vertex v V. We ow describe the Hash-Mi algorithm i Pregel framework. I superstep 1, each vertex v sets miv) to be idv), broadcasts miv) to all its eighbors, ad votes to halt. I superstep i i>1), each vertex v receives messages from its eighbors; let mi be the smallest ID received, if mi <miv), v sets miv) =mi ad broadcasts mi to its eighbors. All vertices vote to halt at the ed of a superstep. Whe the process coverges, all vertices have voted to halt ad for each vertex v,wehavemiv) =ccv). 3.4 The S-V Algorithm The Hash-Mi algorithm described i Sectio 3.3 requires Oδ) supersteps [24], which ca be slow for computig CCs i largediameter graphs. Aother Pregel algorithm proposed i [24] computes CCs i Olog V ) supersteps, by adaptig Shiloach-Vishki s S-V) algorithm for the PRAM model [22]. We use this algorithm to demostrate how algorithm logic geerates a bottleeck vertex v eve if dv) is small. I the S-V algorithm, each vertex u maitais a poiter D[u], which is iitialized as u, formig a self loop as show Figure 4a). Durig the computatio, vertices are orgaized ito a forest such that all vertices i a tree belog to the same CC. The tree defiitio is relaxed a bit here to allow the tree root w to have a self-loop, i.e., D[w] =w see Figures 4b) ad 4c)); while D[v] of ay other vertex v i the tree poits to v s paret. The S-V algorithm proceeds i rouds, ad i each roud, the poiters are updated i three steps illustrated i Figure 5): 1)tree 1310

Figure 6: Cojoied Tree hookig: for each edge u, v), ifu s paret w = D[u] is a tree root, hook w as a child of v s paret D[v], i.e., set D[D[u]] = D[v]; 2)star hookig: for each edge u, v), ifu is i a star see Figure 4c) for a example of star), hook the star to v s tree as i Step 1), i.e., set D[D[u]] = D[v]; 3)shortcuttig: for each vertex v, move vertex v ad its descedats closer to the tree root, by hookig v to the paret of v s paret, i.e., settig D[v] =D[D[v]]. The above three steps execute i rouds, ad the algorithm eds whe every vertex is i a star. Due to the shortcuttig operatio, the S-V algorithm creates flatteed trees e.g., stars) with large fa-out towards the ed of the executio. As a result, a vertex w may have may childre u i.e., D[u] =w), ad each of these childre u requests w for the value of D[w]. This reders w a bottleeck vertex. I particular, i the last roud of the S-V algorithm, all vertices v iaccc have D[v] =idc), ad they all sed requests to the vertex w = idc) for D[w]. I the basic Pregel framework, w receives V C) requests ad seds V C) resposes, which leads to skewed workload whe V C) is large. 3.5 Miimum Spaig Forest The Pregel algorithm proposed by [19] for miimum spaig forest MSF) computatio is aother example that shows how algorithm logic ca geerate bottleeck vertices. This algorithm proceeds i iteratios, where each iteratio cosists of three steps, which we describe below. I Step 1), each vertex v picks a edge with the miimum weight. The vertices ad their picked edges form disjoit subgraphs, each of which is a cojoied-tree: two trees with their roots joied by a cycle. Figure 6 illustrates the cocept of a cojoied-tree, where the edges are those picked i Step 1). The vertex with the smaller ID i the cycle of a cojoied-tree is called the supervertex of the tree e.g., vertex 5 is the supervertex i Figure 6), ad the other vertices are called the subvertices. I Step 2), each vertex fids the supervertex of the cojoiedtree it belogs to, which is accomplished by poiter jumpig. Specifically, each vertex v maitais a poiter D[v]; suppose that v picks edge v, u) i Step 1), the the value of D[v] is iitialized as u. Each vertex v the seds request to w = D[v] for D[w]. Iitially, the actual supervertex s e.g., vertex 5 i Figure 6) ad its eighbor s i the cycle e.g. vertex 6 i Figure 6) see that they have set each other messages ad detect that they are i the cycle. Vertex s the sets itself as the supervertex i.e., sets D[s] =s) due to s<s, before respodig D[s] =s to the requesters while D[s ]=s remais for s sice s >s). For ay other vertex v, it receives respose D[w] from w = D[v] ad updates D[v] to be D[w]. This process is repeated util covergece, upo whe D[v] records the supervertex s for all vertices v. I Step 3), each vertex v seds request to each eighbor u Γv) for its supervertex D[u], ad removes edge v, u) if D[v] = D[u] i.e., v ad u are i the same cojoied-tree); v the seds the remaiig edges to vertices i other cojoied-trees) to the supervertex D[v]. After this step, all subvertices are codesed ito their supervertex, which costructs a adjacecy list of edges to the other supervertices from those edges set by its subvertices. We cosider a improved versio of the above algorithm that applies the Storig-Edges-At-Subvertices SEAS) optimizatio of [19]. Specifically, istead of havig the supervertex merge ad store all cross-tree edges, the SEAS optimizatio stores the edges of a supervertex i a distributed fashio amog all of its subvertices. As a result, if a supervertex s is merged ito aother supervertex, it has to otify its subvertices of the ew supervertex they belog to. This is accomplished by havig each vertex v sed request to its supervertex D[v] =s for D[s]. Sice smaller cojoied-trees are merged ito larger oes, a supervertex s may have may subvertices v towards the ed of the executio, ad they all request for D[s] from s, rederig s a bottleeck vertex. 4. BASIC COMMUNICATION FRAMEWORK Whe cosiderig o which system we should implemet our message reductio techiques, we decided to implemet a ew ope-source Pregel system i C/C++, called Pregel+, to avoid the pitfalls of a JAVA-based system described i Sectio 2.2. Other reasos for a ew Pregel implemetatio iclude: 1)Giraph has bee show to have iferior performace i recet performace evaluatio of graph-parallel systems [2, 4, 6, 8, 20]; 2)GPS does ot perform seder-side message combiig, while our work studies effective message reductio techiques i a system that adheres to Pregel s framework, where message combiig is supported; 3)other systems such as GraphLab ad PowerGraph are also ot suitable as discussed i Sectio 2.3. We first itroduce the basic commuicatio framework of Pregel+. Our two ew message reductio techiques to be itroduced i Sectios 5 ad 6 further exted the basic commuicatio framework. We use the term worker to represet a computig uit, which ca be a machie or a thread/process i a machie. For ease of discussio, we assume that each machie rus oly oe worker but the cocepts ca be straightforwardly geeralized. I Pregel+, each worker is simply a MPI Message Passig Iterface) process ad commuicatios amog differet processes are implemeted usig MPI s commuicatio primitives. Each worker maitais a message chael, Ch msg, for exchagig the vertexto-vertex messages. I the compute) fuctio, if a vertex seds a message msg to a target vertex v tgt, the message is simply added to Ch msg. Like i Google s Pregel, messages i Ch msg are set to the target workers i batches before the ext superstep begis. Note that if a message msg is set from worker M i to vertex v tgt i worker M j, the ID of the target v tgt should be set alog with msg, so that whe M j receives msg, it kows which vertex msg should be directed to. The operatio of the message chael Ch msg is directly related to the commuicatio cost ad hece affects the overall performace of the system. We tested differet ways of implemetig Ch msg, ad the most efficiet oe is preseted i Figure 7. We assume that a worker maitais N vertices, {v 1,v 2,...,v N }. The message chael Ch msg associates each vertex v i with a icomig message buffer I i. Whe a icomig message msg 1 directed to vertex v i arrives, Ch msg looks up a hash table T i for the icomig message buffer I i usig v i s ID. It the appeds msg 1 to the ed of I i. The lookup table T i is static uless graph mutatio occurs, i which case updates to T i may be required. Oce all icomig messages are processed, compute) is called for each active vertex v i with the messages i I i as the iput. A worker also maitais M outgoig message buffers where M is the umber of workers), oe for each worker M j i the cluster, deoted by O j.icompute), a vertex v i may sed a message msg 2 1311

Figure 7: Illustratio of Message Chael, Ch msg Figure 8: Illustratio of Mirrorig Figure 9: Mirrorig v.s. Message Combiig to aother vertex with ID tgt. Let hash.) be the hash fuctio that computes the worker ID of a vertex from its vertex ID, the the target vertex is i worker M hashtgt). Thus, msg 2 alog with tgt) is appeded to the ed of the buffer O hashtgt). Messages i each buffer O j are set to worker M j i batch. If a combier is used, the messages i a buffer O j are first grouped sorted) by target vertex IDs, ad messages i each group are combied ito oe message usig the combier logic before sedig. 5. THE MIRRORING TECHNIQUE The mirrorig techique is desiged to elimiate bottleeck vertices caused by high vertex degree. Give a high-degree vertex v, we costruct a mirror for v i ay worker i which some of v s eighbors reside. Whe v eeds to sed a message, e.g., the value of its attribute, av), to its eighbors, v seds av) to its mirrors. The, each mirror forwards av) to the eighbors of v that reside i the same local worker as the mirror, without ay message passig. Figure 8 illustrates the idea of mirrorig. Assume that u i is a high-degree vertex residig i worker machie M 1, ad u i has eighbors {v 1,v 2,...,v j } residig i machie M 2 ad eighbors {w 1,w 2,...,w k } residig i machie M 3. Suppose that u i eeds to sed a message au i ) to the j eighbors i M 2 ad k eighbors i M 3. Figure 8a) shows how u i seds au i ) to its eighbors i M 2 ad M 3 usig Pregel s vertex-to-vertex message passig. I total, j + k) messages are set, oe for each eighbor. To apply mirrorig, we costruct a mirror for u i i M 2 ad M 3, as show by the two squares with label u i ) i Figure 8b). I this way, as illustrated i Figure 8b), u i oly eeds to sed au i ) to the two mirrors i M 2 ad M 3. The, each mirror forwards au i ) to u i s eighbors locally i M 2 ad M 3 without ay etwork commuicatio. I total, oly two messages are set through the etwork, which ot oly tremedously reduces the commuicatio cost, but also elimiates the imbalaced commuicatio load caused by u i. We formalize the effectiveess of mirrorig for message reductio by the followig theorem. THEOREM 1. Let dv) be the degree of a vertex v ad M be the umber of machies. Suppose that v is to deliver a message av) to all its eighbors i oe superstep. If mirrorig is applied o v, the the total umber of messages set by v i order to deliver av) to all its eighbors is bouded by mi{m,dv)}. PROOF. The proof follows directly from the fact that v oly eeds to sed oe message av) to each of its mirrors i other machies ad there are at most mi{m,dv)} mirrors of v. Mirrorig Threshold. The mirrorig techique is trasparet to programmers. But we ca allow users to specify a mirrorig threshold τ such that mirrorig is applied to a vertex v oly if dv) τ we will see shortly that τ ca be automatically set by a cost model followig the result of Theorem 2). If a vertex has degree less tha τ, it seds messages through the ormal message chael Ch msg as usual. Otherwise, the vertex oly seds messages to its mirrors, ad we call this message chael as the mirrorig message chael, or Ch mir i short. I a utshell, a message is set either through Ch msg or Ch mir, depedig o the degree of the sedig vertex. Figure 9 illustrates the cocepts of Ch msg ad Ch mir, where we oly cosider the message passig betwee two machies M 1 ad M 2. The adjacecy lists of vertices u 1, u 2, u 3 ad u 4 i M 1 are show i Figure 9a), ad we cosider how they sed messages to their commo eighbor v 2 residig i machie M 2. Assume that τ =3, the as Figure 9b) shows, u 1, u 2 ad u 3 sed their messages, au 1 ), au 2 ) ad au 3 ), through Ch msg, while u 4 seds its message au 4 ) through Ch mir. Mirrorig v.s. Message Combiig. Now let us assume that the messages are to be applied with commutative ad associative operatios at the receivers side, e.g., the message values are to be summed up as i PageRak computatio. I this case, a combier ca be applied o the message chael Ch msg. However, the receiver-cetric message combiig is ot applicable to the sedercetric chael Ch mir. For example, i Figure 9b), whe u 4 i M 1 seds au 4 ) to its mirror i M 2, u 4 does ot eed to kow the receivers i.e., v 1, v 2, v 3 ad v 4 ); thus, its message to v 2 caot be combied with those messages from u 1, u 2 ad u 3 that are also to be set to v 2. I fact, u 4 oly holds a list of the machies that cotai u 4 s eighbors, i.e. {M 2 } i this example, ad u 4 s eighbors v 1, v 2, v 3 ad v 4 that are local to M 2 are coected by u 4 s mirror i M 2. It may appear that u 4 s message to its mirror is wasted, because if we combie u 4 s message with those messages from u 1, u 2 ad u 3, the we do ot eed to sed it through Ch mir. However, we ote that a high-degree vertex like u 4 ofte has may eighbors i aother worker machie, e.g., v 1, v 3 ad v 4 i additio to v 2 i this example, ad the message is ot wasted sice the message is also forwarded to v 3 ad v 4, which are ot the eighbors of ay other vertex i M 1. Choice of Mirrorig Threshold. The above discussio shows that there are cases where mirrorig is useful, but it does ot give ay formal guidelie as to whe exactly mirrorig should be applied. 1312

To this ed, we coduct a theoretical aalysis below o the iterplay betwee mirrorig ad message combiig. Our result shows that mirrorig is effective eve whe message combier is used. THEOREM 2. Give a graph G =V,E) with = V vertices ad m = E edges, we assume that the vertex set is evely partitioed amog M machies e.g., by hashig as i Pregel) ad each machie holds /M vertices. We further assume that the eighbors of a vertex i G are radomly chose amog V, ad the average degree deg avg = m/ is a costat. The, mirrorig should be applied to a vertex v if v s degree is at least M exp{deg avg /M }). PROOF. Cosider a machie M i that cotais a set of /M vertices, V i = {v 1,v 2,...,v /M }, where each vertex v j has l j eighbors for 1 j /M. Cosider a specific vertex v j i M i, ad ifer how large l j should be so that applyig mirrorig o v j ca reduce the overall commuicatio eve whe a combier is used. Cosider a applicatio where all vertices sed messages to all their eighbors i each superstep, such as i PageRak computatio. Further cosider vertex u Γ out v j ). If aother vertex v k V i \{v j } seds messages through Ch msg ad v k also has u as its eighbor, the v j s message to u is wasted sice it ca be combied with v k s message to u. We assume the worst case where all vertices i V i \{v j } sed messages through Ch msg. Sice the eighbors of a vertex i G are radomly chose amog V,wehave ad therefore, = Pr{u Γ out v k )} = l k /, Pr{v j s message to u is ot wasted} Pr{u Γ out v k )} = 1 l ) k. We regard each l k as a radom variable whose value is chose idepedetly from a degree distributio e.g., power-law degree distributio) with expectatio E[l k ]=m/ = deg avg. The, the expectatio of the above equatio is give by = E v k V i 1 l ) k = 1 E[l ) k] = 1 deg ) avg = 1 deg ) /M avg. For large graphs, we have Pr{v j s message to u is ot wasted} lim 1 deg avg [ E 1 l ] k 1 deg ) avg ) /M = exp{ deg avg M }, where the last step is derived from lim 1 1/) = e 1. Accordig to the above discussio, the expected umber of v j s eighbors that are ot the eighbors of ay other vertexes) i M i is equal to l j exp{ deg avg /M }. I other words, if mirrorig is ot used, v j eeds to sed at least l j exp{ deg avg /M } messages that are ot wasted. O the other had, if mirrorig is used, v j seds at most M messages, oe to each mirror. Therefore, mirrorig reduces the umber of messages if l j exp{ deg avg /M } M, or equivaletly, l j M exp{deg avg /M }. To coclude, choosig τ = M exp{deg avg /M } as the degree threshold reduces the commuicatio cost. Theorem 2 states that the choice of τ depeds o the umber of workers, M, ad the average vertex degree, deg avg. A cluster usually ivolves tes to hudreds of workers, while the average degree deg avg of a large real world graph is mostly below 50. Cosider the sceario where M = 100 ad deg avg 50, the τ 100e 0.5 =165. This shows that mirrorig is effective eve for vertices whose degree is ot very high. We remark that Theorem 2 makes some simplified assumptio e.g., G beig a radom graph) for ease of aalysis, which may ot be accurate for a real graph. However, our experimets i Sectio 7.1 show that Theorem 2 is effective o real graphs. Mirror Costructio. Pregel+ costructs mirrors for all vertices v with Γ out v) τ after the iput graph is loaded ad before the iterative computatio, although mirror costructio ca also be pre-computed offlie like GraphLab s ghost costructio. Specifically, the eighbors i v s adjacecy list Γ out is grouped by the workers i which they reside. Each group is defied as N i = {u Γ out v) hashu) = M i }. The, for each group N i, v seds v; N i to worker M i, ad M i costructs a mirror of v with the adjacecy list N i locally i M i. Each vertex v j N i also stores the address of v j s icomig message buffer I j so that messages ca be directly forwarded to v j by v s mirror i M i. Durig graph computatio, a vertex v seds message v, av) to its mirror i worker M i. O receivig the message, M i looks up v s mirror from a hash table usig v s ID similar to T i described i Sectio 4). The message value av) is the forwarded to the icomig message buffers of v s eighbors locally i M i. Hadlig Edge Fields. There are some mior chages to Pregel s programmig iterface for applyig mirrorig. I Pregel s iterface, a vertex calls sed_msgtgt, msg) to sed a arbitrary message msg to a target vertex tgt. With mirrorig, a vertex v seds a message cotaiig the value of its attribute av) to all its eighbors by callig broadcastav)) istead of callig sed_msgu, av)) for each eighbor u Γ out v). Cosider the algorithms described i Sectio 3. For PageRak, a vertex v simply calls broadcastprv)/ Γ out v) ); while for Hash- Mi, v calls broadcastmiv)). However, there are applicatios where the message value is ot oly decided by the seder vertex v s state, but also by the edge that the message is set alog. For example, i Pregel s algorithm for sigle-source shortest path SSSP) computatio [13], a vertex seds dv) +lv, u)) to each eighbor u Γ out v), where dv) is a attribute of v estimatig the distace from the source, ad lv, u) is a attribute of its out-edge v, u) idicatig the edge legth. To support applicatios like SSSP, Pregel+ requires that each edge object supports a fuctio relaymsg), which specifies how to update the value of msg before msg is added to the icomig message buffer I i of the target vertex v i. If msg is set through Ch msg, relaymsg) is called o the seder-side before sedig. If msg is set through Ch mir, relaymsg) is called o the receiverside whe the mirror forwards msg to each local eighbor as the edge field is maitaied by the mirror). For example, i Figure 9, relaymsg) is called whe msg is passed alog a dashed arrow. By default, relaymsg) does ot chage the value of msg. To support SSSP, a vertex v calls broadcastdv)) i compute), ad meawhile, the fuctio relaymsg) is overloaded to add the edge legth lv, u) to msg, which updates the value of msg to the required value dv)+lv, u)). Summary of Cotributios. GPS does ot use message combiig, ad therefore, its LALP techique are ot as effective as our mirrorig techique that is reiforced with message combier. GraphLab s ghost vertex techique creates mirrors for all vertices 1313

regardless of the vertex degree, ad thus it is also ot as effective as our mirrorig techique. As far as we kow, this is the first work that cosiders the itegratio of vertex mirrorig ad message combiig i Pregel s computig model. I additio, we also idetified the tradeoff betwee vertex mirrorig ad message combiig i message reductio, ad provided a cost model to automatically select vertices for mirrorig so as to miimize the umber of messages. As we shall see i our experimets i Sectio 7.1, the mirrorig threshold computed by our cost model i Theorem 2 achieves ear-optimal performace. I additio, we also cope with the case where the message value depeds o the edge field, which is ot supported by GPS s LALP techique. 6. THE REQUEST-RESPOND PARADIGM I Sectios 1, 3.4 ad 3.5, we have show that bottleeck vertices ca be geerated by algorithm logic eve if the iput graph has o high-degree vertices. For hadlig such bottleeck vertices, the mirrorig techique of Sectio 5 is ot effective. To this ed, we desig our secod message reductio techique, which exteds the basic Pregel framework with a ew request-respod fuctioality. We illustrate the cocept usig the algorithms described i Sectio 3. Usig the request-respod API, attribute broadcast i Sectio 3.1 is straightforward to implemet: i superstep 1, each vertex v seds requests to each eighbor u Γ out v) for au); i superstep 2, the vertex v simply obtais au) respoded by each eighbor u, ad costructs Γ out v). Similarly, for the S-V algorithm i Sectio 3.4, whe a vertex v eeds to obtai D[w] from vertex w = D[v], it simply seds a request to w so that D[w] ca be used i the ext superstep; for the MSF algorithm i Sectio 3.5, a vertex v simply seds a request to its supervertex D[v] =s so that D[s] ca be used to update D[v] i the ext superstep. Request-Respod Message Chael. We ow explai i detail how Pregel+ supports the request-respod API. The requestrespod paradigm supports all the fuctioality of Pregel. I additio, it supplemets the vertex-to-vertex message chael Ch msg with a request-respod message chael, deoted by Ch req. Figure 10 illustrates how requests ad resposes are exchaged betwee two machies M i ad M j through Ch req. Specifically, each machie maitais M request sets, where M is the umber of machies, ad each request set S to k stores the requests to vertices i machie M k. I a superstep, a vertex v i machie M j may call requestu) i its compute) fuctio to sed request to vertex u for its attribute value au) which will be used i the ext superstep). Let hashu) =i, the the requested vertex u is i machie M i, ad hece u is added to the request set S to i of M j. Although may vertices i M j may sed request to u, oly oe request to u will be set from M j to M i sice S to i is a hash) set that elimiates redudat elemets. After compute) is called for all active vertices, the vertex-tovertex messages are first exchaged through Ch msg. The, each machie seds each request set S to k to machie M k. After the requests are exchaged, each machie receives M request sets, where set S from k stores the requests set from machie M k. I the example show i Figure 10, u is cotaied i the set S fromj i machie M i, sice vertex v i machie M j set request to u. The, a respose set R tok is costructed for each request set S from k received, which is to be set back to machie M k. I our example, the requested vertex, u S fromj, calls a user-specified fuctio respod) to retur its specified attribute au), ad adds the etry u, au) to the respose set R toj. Oce the respose sets are exchaged, each machie costructs a hash table from the received etries. I the example show i Fig- Figure 10: Illustratio of request-respod paradigm ure 10, the etry u, au) is received by machie M j sice it is i the respose set R toj i machie M i. The hash table is available for the ext superstep, where vertices ca access their requested value i their compute) fuctio. I our example, vertex v i machie M j may call get_respu) i the ext superstep, which looks up u s attribute au) from the hash table. The followig theorem shows the effectiveess of the requestrespod paradigm for message reductio. THEOREM 3. Let {v 1,v 2,...,v l } be the set of requesters that request the attribute au) from a vertex u. The, the requestrespod paradigm reduces the total umber of messages from 2l i Pregel s vertex-to-vertex message passig framework to 2 mim,l), where M is the umber of machies. PROOF. The proof follows directly from the fact that each machie seds at most 1 request to u eve though there may be more tha 1 requester i that machie, ad that at most 1 respod from u is set to each machie that makes a request to u, ad that there are at most mim,l) machies that cotai a requester. I the worst case, the request-respod paradigm uses the same umber of messages as Pregel s vertex-to-vertex message passig. But i practice, may Pregel algorithms e.g., those described i Sectios 3.4 ad 3.5) have bottleeck vertices with a large umber of requesters, leadig to imbalaced workload ad log elapsed ruig time. I such cases, our request-respod paradigm effectively bouds the umber of messages to the umber of machies cotaiig the requesters ad elimiates the imbalaced workload. Explicit Respodig. I the above discussio, a vertex v simply calls requestu) i oe superstep, ad it ca the call get_respu) i the ext superstep to get au). All the operatios icludig request exchage, respose set costructio, respose exchage, ad respose table costructio are performed by Pregel+ automatically ad are thus trasparet to users. We ame the above process as implicit respodig, where a respoder does ot kow the requester util a request is received. Whe a respoder w kows its requesters v, w ca explicitly call respodv) icompute), which adds w, w.respod) to the respose set R to j where j = hashv). This process is also illustrated i Figure 10. Explicit respodig is more cost-efficiet sice there is o eed for request exchage ad respose set costructio. Explicit respodig is useful i may applicatios. For example, to compute PageRak o a udirected graph, a vertex v ca simply call respodu) for each u Γv) to push av) =prv)/ Γv) to v s eighbors; this is because i the ext superstep, vertex u kows its eighbors Γu), ad ca thus collect their resposes. Similarly, i attribute broadcast, if the iput graph is udirected, each vertex v ca simply push its attribute av) to its eighbors. Note that 1314

Data Type V E AVG Deg Max Deg WebUK directed 133,633,040 5,507,679,822 41.21 22,429 LiveJoural directed 10,690,276 224,614,770 21.01 1,053,676 Twitter directed 52,579,682 1,963,263,821 37.34 779,958 BTC udirected 164,732,473 772,822,094 4.69 1,637,619 USA Road udirected 23,947,347 58,333,344 2.44 9 Figure 11: Datasets M = millio) data pushig by explicit respodig requires less messages tha by Pregel s vertex-to-vertex message passig, sice respods are set to machies more precisely, their respose tables) rather tha idividual vertices. Programmig Iterface. Pregel+ exteds the vertex class i Pregel s iterface [13] by requirig users to specify a additioal template argumet <R>, which idicates the type of the attribute value that a vertex respods. I compute), a vertex ca either pull data from aother vertex v by callig requestv), or push data to v by callig respodv). The attribute value that a vertex returs is defied by a user-specified abstract fuctio respod), which returs a value of type <R>. Like compute), oe may program respod) to retur differet attributes of a vertex i differet supersteps accordig to the algorithm logic of the specific applicatio. Fially, a vertex may call get_respv)i compute) to get the attribute of v, if it is pushed ito the respose table i the previous superstep. 7. EXPERIMENTAL RESULTS We ow evaluate the effectiveess of our message reductio techiques. We ra our experimets o a cluster of 16 machies, each with 24 processors two Itel Xeo E5-2620 CPU) ad 48GB RAM. Oe machie is used as the master, while the other 15 machies act as slaves. The coectivity betwee ay pair of odes i the cluster is 1Gbps. We used five real-world datasets, as show i Figure 11: 1)WebUK 2 : a web graph geerated by combiig twelve mothly sapshots of the.uk domai collected for the DELIS project; 2)Live- Joural LJ) 3 : a bipartite etwork of LiveJoural users ad their group memberships; 3)Twitter 4 : Twitter who-follows-who etwork based o a sapshot take i 2009; 4)BTC 5 : a sematic graph coverted from the Billio Triple Challege 2009 RDF dataset; 5)USA 6 : the USA road etwork. LJ, Twitter ad BTC have skewed degree distributio; WebUK, LJ ad Twitter have relatively high average degree; USA ad WebUK have a large diameter. Pregel+ Implemetatio. Pregel+ is implemeted i C/C++ as a group of header files, ad users oly eed to iclude the ecessary base classes ad implemet the applicatio logic i their subclasses. Pregel+ commuicates with HDFS through libhdfs, a JNI based C API for HDFS. Each worker is simply a MPI process ad commuicatios are implemeted usig MPI commuicatio primitives. While oe may deploy Pregel+ with ay Hadoop ad MPI versio, we use Hadoop 1.2.1 ad MPICH 3.0.4 i our experimets. All programs are compiled usig GCC 4.4.7 with -O2 optio eabled. 2 http://law.di.uimi.it/webdata/uk-uio-2006-06-2007-05 3 http://koect.ui-koblez.de/etworks/livejouralgroupmemberships 4 http://koect.ui-koblez.de/etworks/twitter_mpi 5 http://km.aifb.kit.edu/projects/btc-2009/ 6 http://www.dis.uiroma1.it/challege9/dowload.shtml All the system source codes, as well as the source codes of the algorithms discussed i this paper, ca be foud i http://www. cse.cuhk.edu.hk/pregelplus. 7.1 Effectiveess of Mirrorig Figure 12 reports the performace gai by mirrorig. We measure the gai by comparig with 1)Pregel+ without both mirrorig ad combier, deoted by Pregel-oMC; 2)Pregel+ with combier but without mirrorig, deoted by Pregel-oM; ad 3)GPS [18] with ad without LALP. The request-respod techique is ot applied i Pregel+ for this set of experimets. As a referece, we also report the performace of Giraph 1.0.0 [1] with combier) ad GraphLab 2.2 which icludes PowerGraph [5]). We test the mirrorig thresholds 1, 10, 100, 1000, ad the oe automatically set by the cost model give by Theorem 2 which is 199, 165, 62, 126, for WebUK, Twitter, LJ, BTC, respectively). But for the USA road etwork, its maximum vertex degree is oly 9 ad thus we do ot apply mirrorig with large thresholds. For GPS, we follow [8] ad fix the threshold of LALP as 100. This is a reasoable choice, sice [8] reports that this threshold achieves good performace i geeral, ad we fid that the best performace after tuig the threshold is very close to the performace whe the threshold is 100. We also report the preprocessig time of costructig mirrors for Pregel+ ad that of LALP for GPS i rows marked by Preproc Time. We also report the umber of messages set by Pregel+ ad GPS ote that Giraph does ot report the umber of messages, but the umber should be the same as that of Pregel-oMC ad Pregel-oM; while GraphLab does ot employ message passig). We ra PageRak o the three directed graphs, ad Hash-Mi o the two udirected graphs i Figure 11. For PageRak computatio, we use aggregator to check whether every vertex chages its PageRak value by less tha 0.01 after each superstep, ad termiate if so. The computatio takes 89, 89 ad 96 supersteps o WebUK, Twitter ad LJ, respectively, before covergece. We do ot ru GraphLab i asychroous mode for PageRak, sice its covergece coditio is differet from the sychroous versio ad hece leads to differet PageRak results. Mirrorig i Pregel+. As Figure 12 shows, mirrorig sigificatly improves the performace of Pregel-oM, i terms of the reductio i both ruig time ad message umber. The improvemet is particularly obvious for the graphs, Twitter, LJ, ad BTC, which have highly skewed degree distributio. Thus, the result also demostrates the effectiveess of mirrorig i workload balacig. Mirrorig is ot so effective for PageRak o WebUK, for which Pregel-oM has the best performace. The umber of messages is oly slightly decreased whe mirrorig threshold τ = 1000, ad yet it is still slower tha Pregel-oM. This is because messages set through Ch mir are itercepted by mirrors which icurs additioal cost. Sice the degree of the majority of the vertices i WebUK is ot very high, mirrorig does ot sigificatly reduce the umber of messages, ad thus, the additioal cost of Ch mir is ot paid off. The results also show that the mirrorig threshold give by our cost model achieves either the best performace, or close to the performace of the best threshold tested. The oe-off preprocessig time required to costruct the mirrors is also short compared with the computatio time. Compariso with Other Systems. Figure 12 shows that Pregel+ without mirrorig i.e., Pregel-oM) is already faster tha both Giraph ad GraphLab, which verifies that our Ch msg implemetatio is efficiet, ad thus the performace gai by mirrorig is ot a over-claimed improvemet gaied over a slow implemetatio. 1315

PageRak o WebUK PageRak o Twitter PageRak o LJ Hash o BTC Hash o USA Pregel+ with Mirrorig GraphLab Pregel+ GPS Mirrorig Thresholds Giraph Syc Asyc om omc 1 10 100 1000 Cost Model Basic LALP Comput Time 2669* 7732 5603 5561 3475 2784 2935 3909 4020 4834 4262 Preproc Time 162.29 143.00 46.68 32.79 26.66 663.34 # of Msgs 120107 490184 319614 314212 168317 119889* 134734 487285 377687 Comput Time 1575 3131 1621 1648 1177 1381 1048 1343 750.11* 1567 1762 Preproc Time 40.74 41.34 24.13 8.63 14.31 74.95 # of Msgs 62276 174730 68430 65770 40980 48616 38873* 174730 78904 Comput Time 316.26 541.53 251.98 255.26 212.35 243.32 216.05 316.45 197.28* 312 662 Preproc Time 9.95 7.72 3.75 1.07 3.94 8.17 # of Msgs 6429 21563 5949 3949* 4209 5162 4359 21563 9665 Comput Time 26.97 44.28 29.95 15.53 9.55* 10.69 9.85 37.99 33.00 93 83 155 Preproc Time 20.74 6.63 5.92 5.56 5.41 3.52 # of Msgs 1189 2419 1294 259.4 126.1 152.4 122.5* 1525 716.4 Comput Time 546.86 546.66 542.69* 1205 5714 2982 627 Preproc Time 4.52 # of Msgs 8353 8485 8305 8485 Figure 12: Effects of mirrorig : best result; Comput/Preproc time: Computatio/Preprocessig time i sec; # of Msgs: # of messages i millios) Compared with GPS, the reductio i both message umber ad ruig time achieved by the itegratio of mirrorig ad combier i Pregel+ is sigificatly more tha that achieved by LALP aloe i GPS, which ca be observed from 1)Pregel+ with mirrorig vs. Pregel-oMC, ad 2)GPS with LALP v.s. GPS without LALP. I cotrast to the claim i [18] that message combiig is ot effective, our result clearly demostrates the beefits of itegratig mirrorig ad combier, ad hece highlights the importace of our theoretical aalysis o the tradeoff betwee mirrorig ad message combiig i.e., Theorem 2). However, we otice that GPS is sometimes faster tha Pregel+ eve though much more messages are exchaged. We foud it hard to explai ad so we studied the codes of GPS to explore the reaso, which we explai below. GPS requires that vertex IDs should be itegers that are cotiguous startig from 0, 1,, V ; while other systems allow vertex IDs to be of ay user-specified type as log as a hash fuctio is provided for calculatig the ID of the worker that a vertex resides i). As a result of the dese ID represetatio, each worker i GPS simply maitais the icomig message buffers of the vertices by a array, ad whe a worker receives a message targeted at vertex tgt, it is put ito tgt s icomig message buffer i.e., I tgt ) whose positio i the array ca be directly computed from tgt. O the cotrary, systems like Pregel+ ad Giraph eed to look up I tgt from a hash table usig key tgt, which has extra cost for each message exchaged. We remark that there are good reasos to require vertex IDs to take arbitrary type, rather tha to hard-code them as cotiguous itegers. For example, the Pregel algorithm i [24] for computig bi-coected compoets costructs a auxiliary graph from the iput graph, ad each vertex of the auxiliary graph correspods to a edge u, v) of the iput graph. While we ca simply use iteger pair as vertex ID i Pregel+, usig GPS requires extra effort from programmers to relabel the vertices of the auxiliary graph with cotiguous iteger IDs, which ca be costly for a large graph. We ote that, if oe desires, he ca easily implemet GPS s dese vertex ID represetatio i Pregel+ to further improve the performace for certai algorithms, but this is ot the focus of our work which studies message reductio techiques. 7.2 Effectiveess of Request-Respod Techique Figure 13 reports the performace gaied by the request-respod techique. We test the three algorithms i Sectio 3 to which the request-respod techique is applicable: attribute broadcast, S-V ad miimum spaig forest. We also iclude Giraph ad GPS Pregel+ ReqResp Giraph GPS Pregel+ ReqResp Giraph GPS Attribute Broadcast o WebUK S-V o USA Time 178.4 s 84.53 s 169.28 s 83.71 s* 261.93 s 137.69 s* 690 s 189.77 s Msg # 11015 M 2699 M* 10950 M 6598 M 3789 M* 6598M Attribute Broadcast o BTC S-V o BTC Time 16.33 s 13.31 s 54.76 s 8.69 s* 408.78 s 190.55 s* 1531 s 286.22 s Msg # 772.8 M 393.2 M* 772.8 M 22393 M 11232 M* 22393M Attribute Broadcast o LJ Miimum Spaig Forest o USA Time 11.66 s 9.09 s 11.56 s 6.43 s* 19.95 s* 25.20 s 259.63 s 85.15 s Msg # 449.2 M 131.9 M* 449.2 M 387.1 M 162.2 M* 387.1 M Attribute Broadcast o Twitter Miimum Spaig Forest o BTC Time 59.84 s 29.65 s* 71.35 s 29.93 s 83.36 s 36.56 s* 350.15 s 209.92 s Msg # 3927 M 1396 M* 3927 M 2424 M 1110 M* 2424 M Figure 13: Effects of the request-respod techique as a referece. We do ot iclude GraphLab sice the algorithms caot be easily implemeted i GraphLab e.g., it is ot clear how a vertex v ca commuicate with a o-eighbor D[v] as i S-V ad miimum spaig forest). The results show that Pregel+ with request-respod, deoted by ReqResq, uses sigificatly less messages. For example, for attribute broadcast o WebUK, ReqResq reduces the message umber from 11,015 millio to oly 2,699 millio. ReqResq also records the shortest ruig time except i a few cases where GPS is faster due to the same reaso give i Sectio 7.1. Aother exceptio is whe computig miimum spaig forest o USA, where Pregel+ is faster without request-respod. This is because vertices i USA have very low degree, rederig the request-respod techique ieffective, ad the additioal computatioal overhead is ot paid off by the reductio i message umber. 8. CONCLUSIONS We preseted two techiques to reduce the amout of commuicatio ad to elimiate skewed commuicatio workload. The first techique, mirrorig, elimiates commuicatio bottleecks caused by high vertex degree, ad is trasparet to programmig. The secod techique is a ew request-respod paradigm, which elimiates bottleecks caused by program logic, ad simplifies the programmig of may Pregel algorithms. Our experimets o large real-world graphs verified that our techiques are effective i reducig the commuicatio cost ad overall computatio time. Ackowledgmets. We thak the aoymous reviewers for their costructive commets. This work is supported by SHIAE Grat No. 8115048). 1316

9. REFERENCES [1] Apache Giraph. http://giraph.apache.org/. [2] Z. Cai, Z. J. Gao, S. Luo, L. L. Perez, Z. Vagea, ad C. M. Jermaie. A compariso of platforms for implemetig ad ruig very large scale machie learig algorithms. I SIGMOD, pages 1371 1382, 2014. [3] J. Dea ad S. Ghemawat. Mapreduce: Simplified data processig o large clusters. I OSDI, pages 137 150, 2004. [4] B. Elser ad A. Motresor. A evaluatio study of bigdata frameworks for graph processig. I BigData Coferece, pages 60 67, 2013. [5] J. E. Gozalez, Y. Low, H. Gu, D. Bickso, ad C. Guestri. Powergraph: Distributed graph-parallel computatio o atural graphs. I OSDI, pages 17 30, 2012. [6] Y. Guo, M. Biczak, A. L. Varbaescu, A. Iosup, C. Martella, ad T. L. Willke. How well do graph-processig platforms perform? a empirical performace evaluatio ad aalysis. IPDPS, 2013. [7] Z. Gyögyi, H. Garcia-Molia, ad J. O. Pederse. Combatig web spam with trustrak. I VLDB, pages 576 587, 2004. [8] M. Ha, K. Daudjee, K. Ammar, M. T. Özsu, X. Wag, ad T. Ji. A experimetal compariso of Pregel-like graph processig systems. PVLDB, 712):1047 1058, 2014. [9] G. Jeh ad J. Widom. Scalig persoalized web search. I WWW, pages 271 279, 2003. [10] Z. Khayyat, K. Awara, A. Aloazi, H. Jamjoom, D. Williams, ad P. Kalis. Miza: a system for dyamic load balacig i large-scale graph processig. I EuroSys, pages 169 182, 2013. [11] Y. Low, J. Gozalez, A. Kyrola, D. Bickso, C. Guestri, ad J. M. Hellerstei. Distributed graphlab: A framework for machie learig i the cloud. PVLDB, 58):716 727, 2012. [12] Y. Lu, J. Cheg, D. Ya, ad H. Wu. Large-scale distributed graph computig systems: A experimetal evaluatio. PVLDB, 83):281 292, 2014. [13] G. Malewicz, M. H. Auster, A. J. C. Bik, J. C. Dehert, I. Hor, N. Leiser, ad G. Czajkowski. Pregel: a system for large-scale graph processig. I SIGMOD Coferece, pages 135 146, 2010. [14] A. Mislove, M. Marco, P. K. Gummadi, P. Druschel, ad B. Bhattacharjee. Measuremet ad aalysis of olie social etworks. I SIGCOMM Coferece o Iteret Measuremet, pages 29 42, 2007. [15] J. Niu, J. Peg, C. Tog, ad W. Liao. Evolutio of discoected compoets i social etworks: Patters ad a geerative model. I Performace Computig ad Commuicatios Coferece IPCCC), 2012 IEEE 31st Iteratioal, pages 305 313. IEEE, 2012. [16] L. Page, S. Bri, R. Motwai, ad T. Wiograd. The pagerak citatio rakig: Brigig order to the web. 1999. [17] V. Rastogi, A. Machaavajjhala, L. Chitis, ad A. D. Sarma. Fidig coected compoets i map-reduce i logarithmic rouds. I ICDE, pages 50 61, 2013. [18] S. Salihoglu ad J. Widom. GPS: a graph processig system. I SSDBM, page 22, 2013. [19] S. Salihoglu ad J. Widom. Optimizig graph algorithms o pregel-like systems. PVLDB, 77):577 588, 2014. [20] N. Satish, N. Sudaram, M. M. A. Patwary, J. Seo, J. Park, M. A. Hassaa, S. Segupta, Z. Yi, ad P. Dubey. Navigatig the maze of graph aalytics frameworks usig massive graph datasets. I SIGMOD Coferece, pages 979 990, 2014. [21] Z. Shag ad J. X. Yu. Catch the wid: Graph workload balacig o cloud. I ICDE, pages 553 564, 2013. [22] Y. Shiloach ad U. Vishki. A olog ) parallel coectivity algorithm. J. Algorithms, 31):57 67, 1982. [23] D. Ya, J. Cheg, Y. Lu, ad W. Ng. Blogel: A block-cetric framework for distributed computatio o real-world graphs. PVLDB, 714):1981 1992, 2014. [24] D. Ya, J. Cheg, K. Xig, Y. Lu, W. Ng, ad Y. Bu. Pregel algorithms for graph coectivity problems with performace guaratees. PVLDB, 714):1821 1832, 2014. 1317