All Roads Lead to Rome: Optimistic Recovery for Distributed Iterative Data Processing

Transcription

1 All Roads Lead to Rome: Optmstc Recovery for Dstrbuted Iteratve Data Processng Sebastan Schelter Stephan Ewen Kostas Tzoumas Volker Markl Technsche Unverstät Berln, Germany ABSTRACT Executng data-parallel teratve algorthms on large datasets s crucal for many advanced analytcal applcatons n the felds of data mnng and machne learnng. Current systems for executng teratve tasks n large clusters typcally acheve fault tolerance through rollback recovery. The prncple behnd ths pessmstc approach s to perodcally checkpont the algorthm state. Upon falure, the system restores a consstent state from a prevously wrtten checkpont and resumes executon from that pont. We propose an optmstc recovery mechansm usng algorthmc compensatons. Our method leverages the robust, self-correctng nature of a large class of fxpont algorthms used n data mnng and machne learnng, whch converge to the correct soluton from varous ntermedate consstent states. In the case of a falure, we apply a user-defned compensate functon that algorthmcally creates such a consstent state, nstead of rollng back to a prevous checkponted state. Our optmstc recovery does not checkpont any state and hence acheves optmal falure-free performance wth respect to the overhead necessary for guaranteeng fault tolerance. We llustrate the applcablty of ths approach for three wde classes of problems. Furthermore, we show how to mplement the proposed optmstc recovery mechansm n a data flow system. Smlar to the Combne operator n MapReduce, our proposed functonalty s optonal and can be appled to ncrease performance wthout changng the semantcs of programs. In an expermental evaluaton on large datasets, we show that our proposed approach provdes optmal falure-free performance. In the absence of falures our optmstc scheme s able to outperform a pessmstc approach by a factor of two to fve. In presence of falures, our approach provdes fast recovery and outperforms pessmstc approaches n the majorty of cases. Categores and Subject Descrptors H.2.4 [Database Management]: Systems Parallel Databases Keywords teratve algorthms; fault-tolerance; optmstc recovery Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. Copyrghts for components of ths work owned by others than the author(s) must be honored. Abstractng wth credt s permtted. To copy otherwse, or republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. Request permssons from permssons@acm.org. CIKM 13, Oct. 27 Nov. 1, 213, San Francsco, CA, USA. Copyrght s held by the owner/author(s). Publcaton rghts lcensed to ACM. ACM /13/1...$ INTRODUCTION In recent years, the cost of acqurng and storng data of unprecedented volume has dropped sgnfcantly. As the technologes to process and analyze these datasets are beng developed, we face prevously unmagnable new possbltes. Busnesses can apply advanced data analyss for data-drven decson makng and predctve analytcs. Scentsts can test hypotheses on data several orders of magntude larger than before, and formulate hypotheses by explorng large datasets. The analyss of the data s typcally conducted usng parallel processng platforms n large, shared-nothng commodty clusters. Statstcal applcatons have become very popular, especally n the form of graph mnng and machne learnng tasks. Many of the algorthms used n these contexts are of teratve or recursve nature, repeatng some computaton untl a termnaton condton s met. Ther executon mposes serous overhead when carred out wth paradgms such as MapReduce [12], where every teraton s scheduled as a separate job and re-scans teraton-nvarant data. These shortcomngs have led to the proposton of specalzed systems [24, 25], as well as to the ntegraton of teratons nto data flow systems [8, 14, 26, 27, 31]. The unque propertes of teratve tasks open up a set of research questons related to buldng dstrbuted data processng systems. We focus on mprovng the handlng of machne and network falures durng the dstrbuted executon of teratve algorthms. We concentrate on problems where the sze of the evolved soluton s proportonal to the nput sze and must therefore be parttoned among the partcpatng machnes n order to scale to large datasets. The tradtonal approach to fault tolerance under such crcumstances s to perodcally persst the applcaton state as checkponts and, upon falure, restore the state from prevously wrtten checkponts and restart the executon. Ths pessmstc method s commonly referred to as rollback recovery [13]. We propose to explot the robust nature of many fxpont algorthms used n data mnng to enable an optmstc recovery mechansm. These algorthms wll converge from many possble ntermedate states. Instead of restorng such a state from a prevously wrtten checkpont and restartng the executon, we propose to apply a user-defned, algorthm-specfc compensaton functon. In case of a falure, ths functon restores a consstent algorthm state and allows the system to contnue the executon. Our proposed mechansm elmnates the need to checkpont ntermedate state for such tasks. We show how to extend the programmng model of a parallel data flow system [3] to allow users to specfy compensaton functons. The compensaton functon becomes part of the executon plan, and s only executed n case of falures.

2 In order to show the applcablty of our approach to a wde varety of problems, we explore three classes of problems that requre dstrbuted, teratve data processng. We start by lookng at approaches to carry out lnk analyss and to compute centraltes n large networks. Technques from ths feld are extensvely used n web mnng for search engnes and socal network analyss. Next, we descrbe path problems n graphs whch nclude standard problems such as reachablty and shortest paths. Last, we look at dstrbuted methods for factorzng large, sparse matrces, a feld of hgh mportance for personalzaton and recommendaton mnng. For each class, we dscuss solvng algorthms and provde blueprnts for compensaton functons. A programmer or lbrary mplementor can drectly use these functons for algorthms that fall nto these classes. Fnally, we evaluate our proposed recovery mechansm by applyng several of the dscussed algorthms to large datasets. We compare the effort necessary to reach the soluton after smulated falures wth tradtonal pessmstc approaches and our proposed optmstc approach. Our results show that our proposed recovery mechansm provdes optmal falure-free performance. In case of falures, our recovery mechansm s superor to pessmstc approaches for recoverng algorthms that ncrementally compute ther soluton. For non-ncremental algorthms that recompute the whole soluton n each teraton, we fnd our optmstc scheme to be superor to pessmstc approaches for recoverng from falures n early teratons. Ths motvates the need for a hybrd approach whch we plan to nvestgate as future work. 1.1 Contrbutons and Organzaton The contrbutons of ths paper are the followng: We propose a novel optmstc recovery mechansm that does not checkpont any state. Therefore, t provdes optmal falure-free performance and smultaneously uses less resources n the cluster than tradtonal approaches. (2) We show how to ntegrate our recovery mechansm nto the programmng model of a parallel data flow system [3]. (3) We nvestgate the applcablty of our approach to three mportant classes of problems: lnk analyss and centralty n networks, path problems n graphs, and matrx factorzaton. For each class we provde blueprnts for generc compensaton functons. (4) We provde emprcal evdence that shows that our proposed approach has optmal falure-free performance and fast recovery tmes n the majorty of scenaros. The rest of the paper s organzed as follows. Secton 2 postons the proposed optmstc recovery mechansm n relaton to exstng approaches. Secton 3 ntroduces a parallel programmng model for fxpont algorthms. Sectons 4 dscusses how to ntegrate the optmstc recovery for teratons nto data flow systems. Secton 5 ntroduces three wde classes of problems for whch our proposed recovery mechansm can be appled. Secton 6 presents our expermental evaluaton. Fnally, Sectons 7 and 8 dscuss related work, conclude, and offer future research drectons. 2. A CASE FOR OPTIMISTIC RECOVERY Realzng fault tolerance n dstrbuted data analyss systems s a complex task. The optmal approach to fault tolerance depends, among other parameters, on the sze of the cluster, the characterstcs of the hardware, the duraton of the data analyss program, the duraton of ndvdual stages of the program (.e., the duraton of an teraton n our context), and the progress already made by the program at the tme of the falure. Most fault tolerance mechansms ntroduce overhead durng normal (falure-free) operaton, and recovery overhead n the case of falures [13]. We classfy recovery mechansms for large-scale teratve computatons n three broad categores, rangng from the most pessmstc to the most optmstc: operator-level pessmstc recovery, teraton-level pessmstc recovery, and the optmstc recovery mechansm proposed n ths paper. Pessmstc approaches assume a hgh probablty of falure, whereas optmstc approaches assume low falure probablty. Operator-level recovery, mplemented n MapReduce [12], checkponts the result of every ndvdual program stage (the result of the Map stage n MapReduce). Its sweet spot s very large clusters wth hgh falure rates. Ths recovery mechansm trades very hgh falure-free overhead for rapd recovery. In the teratve algorthms settng, such an approach would be desrable f falures occur at every teraton, where such an approach would be the only vable way to allow the computaton to fnsh. For teratve algorthms, the amount of work per teraton s often much lower than that of a typcal MapReduce job, renderng operator-level recovery an overkll. However, the total executon tme across all teratons may stll be sgnfcant. Hence, specalzed systems for teratve algorthms typcally follow a more optmstc approach. In teraton-level recovery, as mplemented for example n graph processng systems [24, 25], the result of an teraton as a whole s checkponted. In the case of falure, all partcpatng machnes need to revert to the the last checkponted state. Iteratonlevel recovery may skp checkpontng some teratons, tradng better falure-free performance for hgher recovery overhead n case of a falure. For pessmstc teraton-level recovery to tolerate machne falures, t must replcate the data to checkpont to several machnes, whch requres extra network bandwdth and dsk space durng executon. The overhead ncurred to the executon tme s mmense, n our experments, we encounter that teratons where a checkpont s taken take up to 5 tmes longer than teratons wthout a checkpont. Even worse, a pessmstc approach always ncurs ths overhead, regardless whether a falure happens or not. An extreme pont of teraton-level recovery s no fault tolerance, where checkponts are never taken, and the whole program s re-executed from scratch n the case of a falure. On the contrary, our proposed optmstc recovery approach never checkponts any state. Thus, t provdes optmal falure-free performance, as ts falure-free overhead s vrtually zero (Secton 6 expermentally valdates ths) and at the same tme, t requres much less resources n the cluster compared to a pessmstc approach. Furthermore, our optmstc approach only ncurs overhead n case of falures, n the form of addtonal teratons requred to compute the soluton. Fgure 1 llustrates the optmstc recovery scheme for teratve algorthms: Falure-free executon proceeds as f no fault tolerance s desred. In case of a falure, we fnsh the current teraton gnorng the faled machnes and smultaneously acqure new machnes, whch ntalze the relevant teraton-nvarant data parttons. Then, the system apples a user-suppled pece of code, that mplements a compensate functon, on every machne. The compensaton functon sets the algorthm state to a consstent state from whch the algorthm wll converge (e.g., f the algorthm computes a probablty dstrbuton, the compensaton functon could have to make sure that all parttons sum to unty). After that, the system proceeds wth the executon. The compensaton functon can be thought of as brngng the computaton back on track, where the errors ntroduced by the data loss are corrected by the algorthm tself n the subsequent teratons. In all of our experments, the optmstc approach shows to be superor, as runnng addtonal teratons wth a compensaton func-

3 falure-free executon executon wth optmstc recovery teraton 3 teraton 2 x 1 (3) x 2 (3) (3) x 1 x 2 compensaton x 1 x 2 x 1 x 2 Γ a k x k x l teraton 1 () () () () () () x 1 x 2 x 1 x 2 x a l machne 1 machne 2 machne 3 machne 1 machne 2 machne 3 machne 4 a m x m Fgure 1: Optmstc recovery wth compensatons llustrated. ton results n a shorter overall executon tme than wrtng checkponts and repeatng teratons wth a pessmstc approach. Therefore, we conjecture that our proposed approach s the best ft for runnng teratve algorthms on clusters wth moderate falure rates, gven that a compensaton for the algorthm to execute s known. Note that for many algorthms, t s easy to wrte a compensaton functon whose applcaton wll provably lead the algorthm to convergence (cf., Secton 5, where we provde those for a varety of algorthms). Our experments n Secton 6 show that our proposed optmstc recovery combnes optmal falure-free performance wth fast recovery and at the same tme outperforms a pessmstc approach n the vast majorty of cases. 3. PRELIMINARIES 3.1 Fxpont Algorthms We restrct our dscusson to algorthms that can be expressed by a general fxpont paradgm taken from Bertsekas and Tstskls [4]. Gven an n-dmensonal state vector x R n, and an update functon f(x) : R n R n, each teraton conssts of an applcaton of f to x: x (t+1) = f(x (t) ). The algorthm termnates when we fnd the fxpont x of the seres x (), x, x (2),..., such that x = f(x ). The update functon f s decomposed nto component-wse update functons f. The functon f (x) updates component x (t) such that x (t+1) = f (x (t) 1, x(t) 2,..., x(t) n ) for = 1,..., n. An update functon f mght only depend on a few components of the state x (t). The structure of these computatonal dependences of the data s descrbed by the dependency graph G dep = (N, E), where the vertces N = {x 1,..., x n} represent the components of x, and the edges E represent the dependences among the components: (, j) E f depends on x j. The dependences between the components mght be subject to addtonal parameters, e.g., dstance, condtonal probablty, transton probablty, etc, dependng on the semantcs of the applcaton. We denote the parameter of the dependency between components x and x j wth a j. Ths dependency parameter can be thought of as the weght of the edge e j n G dep. We denote the adjacent neghbors of x n G dep,.e., the computatonal dependences of x, as Γ (cf. Fgure 2). 3.2 Programmng Model We defne a smple programmng model for mplementng teratve algorthms based on the ntroduced noton of fxponts. Each algorthm s expressed by two functons. The frst functon s called ntalze and creates the components of the ntal state x () : ntalze : x () The second functon, termed update, mplements the component update functon f. Ths functon needs as nput the states of the components whch x depends on, and possbly parameters Fgure 2: Adjacent neghbors Γ of x n the dependency graph, a representaton of x s computatonal dependences. for these dependences. The states and dependency parameters necessary for recomputng component x at teraton t are captured n the dependency set D (t) = {(x (t) j, aj) x(t) j Γ }. The functon computes x (t+1) from the dependency set D (t) : update : D (t) x (t+1) We refer to the unon D (t) of the dependency sets D (t) for all components as the workset. In order to detect the convergence of the fxpont algorthm n dstrbuted settngs, we need two more functons. The functon aggregate : (x (t+1), x (t), agg) agg ncrementally computes a global aggregate agg from the current value x (t+1) and the prevous value x (t) of each component x. It s a commutatve and assocatve functon. The functon converged : agg {true, false} decdes whether the algorthm has converged by nspectng the current global aggregate. Example: PageRank [29] s an teratve method for rankng web pages based on the underlyng lnk structure. Intally, every page has the same rank. At each teraton, every page unformly dstrbutes ts rank to all pages that t lnks to, and recomputes ts rank by addng up the partal ranks t receves. The algorthm converges when the ranks of the ndvdual pages do not change anymore. For PageRank, we start wth a unform rank dstrbuton by ntalzng each x to 1, where n s the total number of vertces n the n graph. The components of x are the vertces n the graph, each vertex depends on ts ncdent neghbors, and the dependency parameters correspond to the transton probabltes between vertces. At each teraton, every vertex recomputes ts rank from ts ncdent neghbors proportonally to the transton probabltes: update : D (t).85 D (t) a jx (t) j n. The aggregaton functon computes the L1-norm of the dfference between the prevous and the current PageRank soluton, by summng up the dfferences between the prevous and current ranks, and the algorthm converges when ths dfference becomes less than a gven threshold. 3.3 Parallel Executon Ths fxpont mathematcal model s amenable to a smple parallelzaton scheme. Computaton of x (t+1) nvolves two steps: and parameters a j for all dependen- 1. Collect the states x (t) j ces j Γ. 2. Form the dependency set D (t) x (t+1)., and nvoke update to obtan

4 Assume that the vertces of the dependency graph are represented as tuples (n, x n (t) ) of component ndex n and state x (t) n. The edges of the dependency graph are represented as tuples (, j, a j), ndcatng that component x depends on component x j wth parameter a j. If the datasets contanng the states and dependency graph are co-parttoned, then the frst step can be executed by a local jon between vertces and ther correspondng edges on the component ndex n = j. For executng step 2, the result of the jon s projected to the tuple (, x (t) j, aj) and grouped on the component ndex to form the dependency set D (t). The update functon then aggregates D (t) to compute the new state x (t+1). Ths parallelzaton scheme, whch can be summarzed by treatng a sngle teraton as a jon followed by an aggregaton, s a specal case of the Bulk Synchronous Parallel (BSP) [32] paradgm. BSP models parallel computaton as local computaton (the jon part) followed by message passng between ndependent processors (the aggregaton part). Analogously to the executon of a superstep n BSP, we assume that the executon of a sngle teraton s synchronzed among all partcpatng computatonal unts. 4. INTEGRATING COMPENSATIONS 4.1 Fxpont Algorthms as Data Flows We llustrate and prototype our proposed recovery scheme usng a general data flow system wth a programmng model that extends MapReduce. For mplementaton, we use Stratosphere [3], a massvely parallel data analyss system whch offers dedcated support for teratons, a functonalty necessary for effcently runnng fxpont algorthms [14]. In the followng, we wll use the operator notaton from Stratosphere. However, the deas presented n ths paper are applcable to other data flow systems wth support for teratve or recursve queres. To ensure effcent executon of fxpont algorthms, Stratosphere offers two dstnct programmng abstractons for teratons [14]. In the frst form of teratons, called bulk teratons, each teraton completely recomputes the state vector x (t+1) from the prevous teraton s result x (t). Fgure 3 shows a generc logcal plan for modelng fxpont algorthms as presented n Secton 3.2 usng the bulk teraton abstracton (gnore the dotted box for now). The nput conssts of records (n, x (t) n ) representng the components of the state x (t) on the one hand, and of records (, j, a j) representng the dependences wth parameters on the other hand (cf. Secton 3.3). The data flow program starts wth a dependency jon, performed by a Match 1 operator, whch jons the components x (t) wth ther correspondng dependences and parameters to form the elements (, x (t) j, aj) of the dependency set D (t). The update aggregaton operator groups the result of the jon on the component ndex to form the dependency set D (t), and apples the update functon to compute x (t+1) from the dependency set. The convergence check operator jons and compares the newly computed state x (t+1) wth the prevous state x (t) n on n =. From the dfference between the components, the operator computes a global aggregate usng the aggregate functon. A mechansm for effcent computaton of dstrbutve aggregates, smlar to the aggregators n Pregel [25], nvokes the converged functon to decde whether to trgger a successve teraton (for smplcty reasons we omt ths from the fg- 1 A Match s a second-order functon performng an equ-jon followed by a user-defned frst-order functon appled to the jon result [3]. Convergence-Check Match (on n= ) x n (t) Compensaton Map state x (t) (n, x (t) n ) x (t+1) Update-Aggregaton Reduce (on ) D (t) Dependency-Jon Match (on n=j ) x (t) n dependences (, j, a j ) Fgure 3: Bulk fxpont algorthm n Stratosphere. Recreate-Dependences Match (on j= ) dependences (, j, a j ) Canddate-Creaton Reduce (on ) D (t) workset (, x (t) j,a j ) x (t+1) State-Update Match (on =n ) Compensaton Map state x (t) (n, x (t) n ) Fgure 4: Incremental fxpont algorthm n Stratosphere. ure). If the algorthm has not converged, the convergence check operator feeds the records (, x (t+1) ) that form x (t+1) nto the next teraton. The second form of teratons, called ncremental teratons [14], are optmzed for algorthms that only partally recompute the state x (t) n each teraton. A generc logcal plan for fxpont algorthms usng ths strategy s shown n Fgure 4. Lke the bulk teraton varant, the plan models the two steps from the fxpont model descrbed n Secton 3.3. It dffers from the bulk teraton plan n that t does not feed back the entre state at the end of an teraton, but only the dependency sets D (t) for the fracton of components that wll be updated n the next teraton (c.f., the feedback edge from the recreate dependences operator to the workset n Fgure 4). The system updates the state of the algorthm usng the changed components rather than fully recomputng t. Hence, ths plan explots the fact that certan components converge earler than others, as dependency sets are fed back selectvely only for those components whose state needs to be updated. In addton to the two nputs from the bulk teraton varant, ths plan has a thrd nput wth the ntal verson of the workset D (). The creaton of D () depends on the semantcs of the applcaton, but n most cases, D () s smply the unon of all ntal dependency sets. In Fgure 4, the canddate creaton operator groups the elements of the workset on the component ndex to form the dependency sets D (t) and apples the update functon to compute a canddate update for each x (t+1) from the correspondng dependency set. The state update operator jons the canddate wth the correspondng component from x (t) on the component ndex and decdes whether to set x (t+1) to the canddate value. If an update occurs, the system emts a record (, x (t+1) ) contanng the updated component to the recreate dependences operator, whch jons the updated components wth the dependences and parameters. In addton, the records are effcently merged wth the current state (e.g., va ndex mergng - see rghtmost feedback edge n the fgure). As n the bulk teraton varant, we represent the dependences as (, j, a j), and jon them on j =. The operator emts elements of the dependency sets to form the workset D (t+1) for the next teraton. The algorthm converges when an teraton creates no new dependency sets,.e. when the workset D (t+1) s empty. We restrct ourselves to algorthms that follow the plans of Fgures 3 and 4, where the dataset holdng the dependences can be

5 Compare-To-Old-Rank Match (on p=t) FxRanks Map Recompute-Ranks Reduce (on t) Fnd-Neghbors Match (on p=s) lnks (s,t,prob) Label-To-Neghbors Match (on s=v) graph (s,t) Canddate-Label Reduce (on v) workset (v,l) Label-Update Match (on v) FxComponents Map labels (v,l) mechansm s that the system knows the total number of vertces n, and keeps an aggregate statstc about of the current number of vertces n nonfaled and the current total rank r nonfaled. Note that these statstcs can be mantaned at vrtually zero cost when computed together wth the convergence check by the dstrbuted aggregaton mechansm. Algorthm 1 Compensaton functon for PageRank. 1: functon FIX-RANKS-UNIFORM((pd, r), (n, n nonfaled, r nonfaled )) 2: f pd s n faled partton then return (pd, 1/n) 3: else return (pd, (n nonfaled r)/(n r nonfaled )) ranks (p,r) Fgure 5: PageRank as teratve data flow. Fgure 6: Connected Components as teratve data flow. ether a materalzed or a non-materalzed vew,.e., t may be the result of an arbtrary plan. 4.2 Recoverng Bulk Iteratons To ntegrate optmstc recovery to the bulk teratons model, we ntroduce a compensate operator that takes as nput the current state of components x (t) (dotted box n Fgure 3). The output of the compensaton operator s nput to the dependency jon operator. The system only executes the compensate operator after a falure: e.g., n case of a falure at teraton t, the system fnshes the current teraton and actvates the addtonal operator n the plan for teraton t + 1. We note that the compensate operator can be an arbtrary plan n tself. However, we found that compensaton functons embedded n a smple Map operator are adequate for a wde class of algorthms (see Secton 5). In addton, usng a compensaton functon embedded n a Map operator ensures fast recovery tmes wthout the need for data shufflng. The compensaton functon often leverages lghtweght meta-data from prevous teratons that are stored n a dstrbuted fault-tolerant manner usng a dstrbuted lockng servce [9]. PageRank: Fgure 5 shows the data flow plan, derved from the general plan n Secton 4.1, for the PageRank algorthm. The ntal state x () conssts of all pages p and ntal ranks r. The dependences (s, t, prob) consst of the edges s t of the graph, weghted by the transton probablty prob from page s to page t. The fnd neghbors operator jons the pages wth ther outgong lnks on p = s and creates tuples (t, c) holdng the partal rank c = r prob for the neghbors. The recompute ranks operator groups the result of the fnd neghbors jon on the target page t to recompute ts rank usng PageRank s update formula:.85 c+.15/n. It emts a tuple (t, r new) that contans the new rank to the compare to old rank operator, whch jons these tuples wth the prevous ranks on p = t and ntates the dstrbuted aggregaton necessary for the convergence check. Fnally, the compare to old rank operator emts tuples (p, r) contanng the pages and recomputed ranks. If PageRank has not converged, these tuples form the nput to the next teraton t + 1. In case of a falure, the system actvates the addtonal Map operator called fx ranks n the plan, as shown n Fgure 5. Ths operator executes the compensaton functon. A smple compensaton approach s to re-ntalze the ranks of vertces n faled parttons unformly and re-scale the ranks of the non-faled vertces, so that all ranks stll sum to unty (cf. Algorthm 1). A requrement for ths 4.3 Recoverng Incremental Iteratons Analogously to bulk teratons, we compensate a large class of ncremental teratons by a smple Map operaton, appled to the state at the begnnng of the teraton subsequent to a falure. Addtonally, for ncremental teratons, the system needs to recreate all dependency sets requred to recompute the lost components. Ths s necessary because the recomputaton of a faled component mght depend on another already converged component whose state s not part of the workset anymore. In the plan from Fgure 4, the recreate dependences operator produces dependency sets from all components of the state that were updated n the falng teraton, ncludng components that were recreated or adjusted by the compensaton functon. The system hence only needs to recreate the necessary dependency sets orgnatng from components that were not updated n the falng teraton. To dentfy those nonupdated components, the state-update operator nternally mantans a tmestamp (e.g., the teraton number) for each component, ndcatng ts last update. In the teraton subsequent to a falure, a record for each such component s emtted by the state update operator to the recreate dependences operator. The recreate dependences operator jons these record wth the dependences, creatng the dependency sets D (t+1) for all components dependng on t. From the output of the recreate dependences operator, we prune all elements that do not belong to a lost component from these extra dependency sets 2. By means of ths optmzaton, the system does not unnecessarly recompute components that dd not change n ther dependent components and are not requred for recomputng a faled component. Connected Components [21]: Ths algorthm dentfes the connected components of an undrected graph, the maxmum cardnalty sets of vertces that can reach each other. We ntally assgn to each vertex v a unque numerc label whch serves as the vertex s state x. At every teraton of the algorthm, each vertex replaces ts label wth the mnmum label of ts neghbors. In our fxpont programmng model, we express t by the functon update : D (t) mn (t) D (x (t) j ). At convergence, the states of all vertces n a connected component are the same label, the mnmum of the labels ntally assgned to the vertces of ths component. Fgure 6 shows a data flow plan for Connected Components, derved from the general plan for ncremental teratons dscussed n Secton 4.1. There are three nputs. The ntal state x () conssts of the ntal labels, a set of tuples (v, l) where v s a vertex of the graph and l s ts label. Intally, each vertex has a unque label. The dependences and parameters for ths problem map drectly to the graph structure, as each vertex depends on all ts neghbors. We 2 Ths s easly acheved by evaluatng the parttonng functon on the element s jon key and checkng whether t was assgned to a faled machne.

6 represent each edge of the graph by a tuple (s, t), referrng to the source vertex s and target vertex t of the edge. The ntal workset D () conssts of canddate labels for all vertces, represented as tuples (v, l), where l s a label of v s neghbor (there s one tuple per neghbor of v). Algorthm 2 Compensaton functon for Connected Components. 1: functon FIX-COMPONENTS((v, c)) 2: f v s n faled partton then return (v, v) 3: else return (v, c) Frst, the canddate label operator groups the labels from the workset on the vertex v and computes the mnmum label l new for each vertex v. The label update operator jons the record contanng the vertex and ts canddate label l new wth the exstng entry and ts label on v. If the canddate label l new s smaller than the current label l, the system updates the state x v (t) and the label update operator emts a tuple (v, l new) representng the vertex v wth ts new label l new. The emtted tuples are joned wth the graph structure on s = v by the label to neghbors operator. Ths operator emts tuples (t, l), where l s the new label and t s a neghbor vertex of the updated vertex v. These tuples form the dependency sets for the next teraton. As D (t+1) only contans vertces for whch a neghbor updated ts label, we do not unnecessarly recompute components of vertces wthout a change n ther neghborhood n the next teraton. The algorthm converges once the system observes no more label changes durng an teraton by observng the number of records emtted from the label update operator. As dscussed, the system automatcally recomputes the necessary dependency sets n case of a falure n teraton t. Therefore, analogously to bulk teratons, the programmer s only task s to provde a record-at-a-tme operaton whch apples the compensaton functon to the state x (t+1). Here, t s suffcent to set the label of vertces n faled parttons back to ts ntal value (Algorthm 2). 5. COMPENSABLE PROBLEMS In order to demonstrate the applcablty of our proposed approach to a wde range of problems, we dscuss three classes of large-scale data mnng problems whch are solved by fxpont algorthms. For each class, we lst several problem nstances together wth a bref mathematcal descrpton and provde a generc compensaton functon as blueprnt. 5.1 Lnk Analyss and Centralty n Networks We frst dscuss problems that compute and nterpret the domnant egenvector of a large, sparse matrx representng a network. A well known example of such a problem s PageRank [29], whch computes the relevance of web pages based on the underlyng lnk structure of the web. PageRank models the web as a Markov chan and computes the chan s steady-state probabltes. Ths reduces to fndng the domnant egenvector of the transton matrx representng the Markov chan. Related methods of rankng vertces are egenvector centralty [6] and Katz centralty [22] whch nterpret the domnant egenvector (or a slght modfcaton of t) of the adjacency matrx of a network. Another example of an egenvector problem s Random Walk Wth Restart [21], whch uses a random walk based towards a source vertex to compute the proxmty of the remanng vertces of the network to ths source vertex. LneRank [2] was recently proposed as a scalable substtute for betweeness centralty, a flow-based measure of centralty n networks. Smlarly to the prevously mentoned technques, t reles on computng the domnant egenvector of a matrx, n ths case the transton matrx nduced by the lne graph of the network. The domnant egenvector of the modularty matrx [28] can be used to splt the network nto two communtes of vertces wth a hgher than expected number of edges between them. Solutons for the problems of ths class are usually computed by some varant of the Power Method [18], an teratve algorthm for computng the domnant egenvector of a matrx M. An teraton of the algorthm conssts of a matrx-vector multplcaton followed by normalzaton to unt length. To model the Power Method as a fxpont algorthm, we operate on a sparse n n matrx, whose entres correspond to the dependency parameters. We unformly ntalze each component of the estmate of the domnant egenvector to 1 n. The update functon computes the dot product of the -th row of the matrx M and the prevous state x (t). We express t by the functon update : D (t) 1 x (t) 2 D (t) a jx (t) j n our fxpont programmng model. The result s normalzed by the L2- norm of the prevous state, whch can be effcently computed usng a dstrbuted aggregaton. The algorthm converges when the L2- norm of the dfference between the prevous and the current state becomes less than a gven threshold. For falure compensaton, t s enough to unformly re-ntalze the lost components to 1 n, as the power method wll stll provably converge to the domnant egenvector afterwards [18]: compensate : x (t) { 1 n x (t) f belongs to a faled partton otherwse Ths functon suffces as base for compensatng all of the above problems when they are solved by the power method. 5.2 Path Enumeraton Problems n Graphs The second class of problems we dscuss can be seen as varants of enumeratng paths n large graphs and aggregatng ther weghts. Instances of ths problem class nclude sngle-source reachablty, sngle-source shortest paths, sngle-source maxmum relablty paths, mnmum spannng tree [19], as well as fndng the connected components [21] of an undrected graph. Instances of ths problem class can be solved by the Generalzed Jacob algorthm [19], whch s defned on an dempotent semrng (S,, ). The state vector x () s ntalzed usng the dentty element e of for the source vertex dentty element ɛ of for all other vertces. The algorthm operates on a gven, applcatonspecfc graph. The dependency parameters correspond to the edge weghts of ths graph, taken from the set S, on whch the semrng s defned. At teraton t, the algorthm enumerates paths of length t wth relaxaton operatons, computes the weght of the paths wth the operaton and aggregates these weghts wth the second, dempotent operaton. In our fxpont programmng model, we express t by the functon update : D (t) D (t) (x (t) j a j). The algorthm converges to the optmal soluton when no component of x (t) changes. For sngle-source shortest dstances, the algorthm works on the semrng (R, mn, +). For all vertces but the source, the ntal dstance s set to. At each teraton, the algorthm tres to fnd a shorter dstance by extendng the current paths by one edge, usng relaxaton operatons. The obtaned algorthm s the well known Bellman-Ford algorthm. For sngle-source reachablty, the semrng used s ({, 1},, ). A dependency parameter a j s 1, f s reachable from j and otherwse. The set of reachable vertces can then be found usng boolean relaxaton operatons. Fnally, n sngle-source maxmum relablty paths, the goal s to fnd the most relable paths from a gven source vertex to all other vertces n the graph. A dependency parameter a j represents

7 the probablty of reachng vertex j from vertex. The semrng n use s ([, 1], max, ). To compensate falures n the dstrbuted executon of the Generalzed Jacob algorthm, we can smply re-ntalze the components lost due to falure to ɛ, the dentty element of. Bellman-Ford, for example, converges to the optmal soluton from any start vector x () whch s element-wse greater than x [4]: compensate : x (t) e ɛ x (t) f s the source vertex f belongs to a faled partton otherwse 5.3 Low-Rank Matrx Factorzaton A popular technque to analyze nteractons between two types of enttes s low-rank matrx factorzaton. Problems n ths feld typcally ncorporate dyadc data, for example user-story nteractons n news personalzaton [11] or the ratngs of users towards products n collaboratve flterng [23]. The dea s to approxmately factor a sparse m n matrx M nto the product of two matrces R and C, such that M RC. The m k matrx R models the latent features of the enttes represented by the rows of M (e.g., users), whle the k n matrx R models the latent features of the enttes represented by the columns of M (e.g., news stores or products). The strength of the relaton between two dfferent enttes (e.g., the preference of a user towards a product) can be computed by the dot product r c j between ther correspondng feature vectors r from R and c j from C n the low dmensonal feature space. Recently developed parallel algorthms for low-rank matrx factorzaton, ncludng Dstrbuted Stochastc Gradent Descent [16] and varatons of Alternatng Least Squares (ALS) [34] can be leveraged to dstrbute the factorzaton. Due to ther smplcty and popularty, we focus on algorthms usng the ALS technque. The goal s to tran a factorzaton that mnmzes the emprcal squared error (m j r c j) 2 for each observed nteracton between a row entty and a column entty j. Ths error s computed by comparng the strength of each known nteracton m j between two enttes and j wth the strength r c j predcted by the factorzaton. In order to fnd a factorzaton, ALS repeatedly keeps one of the unknown matrces fxed, so that the other one can be optmally recomputed. That means for example, that r, the -th row of R, can be recomputed by solvng a least squares problem ncludng the -th row of M and all the columns c j of C that correspond to non-zero entres n the -th row of M (cf. Fgure 7). ALS then rotates between recomputng the rows of R n one step and the columns of C n the subsequent step untl the tranng error converges. M m n R m k C k n Fgure 7: Dependences for recomputng a row of R. In the case of a falure, we lose rows of R, columns of C and parts of M. We compensate as follows: as we can always reread the lost parts of M from stable storage, we approxmately recompute lost rows of R and columns of C by solvng the least squares problems usng the remanng feature vectors (gnorng the lost ones). If all necessary feature vectors for a row of R or a column of C are lost, we randomly re-ntalze t: { compensate : x (t) random vector [, 1] k x (t) f s lost otherwse Matrx factorzaton wth ALS s a non-convex problem [16], and our compensaton functon represents a jump n the parameter space. Whle we do not formally prove that after compensaton the algorthm arrves at an equally good local mnmum, we emprcally valdate our approach n Secton 6.3 to show ts applcablty to real-world data. 6. EVALUATION To evaluate the beneft of our proposed recovery mechansm, we run experments on large datasets and smulate falures. The expermental setup for our evaluaton s the followng: the cluster conssts of 26 machnes runnng Java 7, Hadoop s dstrbuted flesystem (HDFS) 1..4 and a customzed verson of Stratosphere. Each machne has two 4-core Opteron CPUs, 32 GB memory and four 1 TB dsk drves. We also run experments wth Apache Graph [1], an open-source mplementaton of Pregel [25]. The fndngs from these experments mrror those from the Stratosphere. We omt these experments for lack of space. We use three publcy avalable datasets for our experments: a webgraph called Webbase 3 [5], whch conssts of 1,19,93,19 lnks between 115,657,29 webpages, a snapshot of Twtter s socal network 4 [1], whch contans 1,963,263,821 follower lnks between 51,217,936 users and a dataset of 717,872,16 ratngs that 1,823,179 users gave to 136,736 songs n the Yahoo! Musc communty Falure-free Performance We compare our proposed optmstc approach to a pessmstc strategy that wrtes dstrbuted checkponts. When checkpontng s enabled, the checkponts are wrtten to HDFS n a bnary format wth the default replcaton factor of three. Analogously to the approaches taken n Pregel and GraphLab, we checkpont state as well as communcated dependences [24, 25]. We run each algorthm wth a degree of parallelsm of 28 (one worker per core). We look at the falure-free performance of optmstc and pessmstc recovery, n order to measure the overhead ntroduced by dstrbuted checkpontng. Fgure 8 shows the applcaton of PageRank on the Webbase dataset. The x axs shows the teraton number and the y axs shows the tme (n seconds) t took the system to complete the teraton. The algorthm converges after 39 teratons. The runtme wth checkpontng every teraton s 79 mnutes, whle optmstc recovery reduces the runtme to 14 mnutes. The performance of the optmstc approach s equal to the performance of runnng wthout a fault tolerance mechansm, wth an teraton lastng roughly 2 seconds. We see that the pessmstc approach ntroduces an overhead of more than factor 5 regardng the executon tme of every teraton that ncludes a checkpont. Wrtng checkponts only at a certan nterval mproves ths stuaton, but trades off the tme t takes to wrte a checkpont wth the tme requred to repeat the teratons conducted snce the last checkpont n case of a falure. If we for example checkpont every ffth teraton, the ncurred overhead to the runtme compared to the optmal falure-free performance s stll 82%, approxmately

8 duraton (seconds) optmstc pessmstc teraton duraton (seconds) optmstc pessmstc teraton duraton (seconds) falure n 4 falure n 6 no falure runtme (seconds) optmstc pessmstc Fgure 8: Falure-free performance of PageRank on the Webbase dataset. Fgure 9: Falure-free performance of Connected Components on the Twtter dataset. mnutes. We see that the optmstc approach s able to outperform a pessmstc one by a factor of two to fve n ths case. We note that n these experments, ths was the only job runnng n the cluster. In busy clusters, where a lot of concurrent jobs compete for I/O and network bandwdth, the negatve effect of the overhead of checkpontng mght be even more dramatc. Fgure 9 shows the falure-free performance of Connected Components appled to the Twtter dataset. The checkpontng overhead vares between the teratons, due to the ncremental character of the executon. Wth our optmstc approach, the executon takes less than 5 mnutes. Actvatng checkpontng n every teraton ncreases the runtme to 19 mnutes. We see that checkpontng results n a 3 to 4 ncreased runtme durng the frst teratons untl the majorty of vertces converge. Takng a checkpont n the frst teraton takes even longer than runnng the whole job wth our optmstc scheme. After teraton 4, the majorty of the state has converged, whch results n a smaller workset and substantally reduces the checkpontng overhead. Agan, the optmstc approach outperforms a pessmstc one that takes an early checkpont by at least a factor of two. Our experments suggest that the overhead of our optmstc approach n the falure-free case (collectng few global statstcs usng the dstrbuted aggregaton mechansm) s vrtually zero. We observe that, n the absence of falures, our prototype has the same performance as Stratosphere wth checkpontng turned off, therefore we conclude that t has optmal falure-free performance. 6.2 Recovery Performance In the followng, we smulate the falure of one machne, thereby losng 8 parallel parttons of the soluton. We smulate falures by droppng the parts of the state x that were assgned to the falng parttons n the falng teraton. Addtonally, we dscard 5% of the messages sent from the falng parttons, smulatng the fact that the machne faled durng the teraton and dd not complete all necessary nter-machne communcaton. After the falng teraton, our prototype apples the compensaton functon and fnshes the executon. We do not nclude the tme to detect a falure and acqure a new machne nto our smulatons, as the ncurred overhead would be the same for both a pessmstc and an optmstc approach. To evaluate the recovery of ncremental teratons, we run the Connected Components algorthm on the Twtter dataset. We smulate falures n dfferent teratons and apply the compensaton functon. Fgure 1 shows the overall number of teratons as well as ther duraton. The fgure llustrates that Connected Components s very robust aganst falures when pared wth our compensaton functon. Sngle falures n teratons 4 and 6 do not produce addtonal teratons, but only cause an ncrease n the runtme of the after next teraton by 45 seconds (whch amounts to an ncrease of 15% n the overall runtme). Ths happens because the system reactvates the neghbors of faled vertces n the teraton after the teraton Fgure 1: Recovery behavor of Connected Components on the Twtter dataset. none 4 4 & 6 falng teraton Fgure 11: Runtme comparson of Connected Components on the Twtter dataset. falure and by ths, trggers the recomputaton of the faled vertces n the after next teraton. When we smulate multple falures of dfferent machnes durng one run, we observe that ths process smply happens twce durng the executon: The compensaton of falures n teratons 4 and 6 or 5 and 8 durng on run produces an overhead of approxmately 35% compared to the falure-free executon. To nvestgate the effects of a drastc falure, we smulate a smultaneous falure of 5 machnes n teraton 4 and repeat ths for teraton 6. We agan observe fast recovery: n both cases the overhead s less than 3% compared to the falure-free performance. In all experments, the addtonal work caused by the optmstc recovery amounts to small a fracton of the cost of wrtng a sngle checkpont n an early teraton. The executon wth a compensated sngle machne falure takes at most 65 seconds longer than the falure-free run, whle checkpontng a sngle early teraton alone nduces an overhead of two to fve mnutes. Addtonal to wrtng the checkponts, a pessmstc approach wth a checkpontng nterval would have to repeat all the teratons snce the last checkpont. Fgure 11 llustrates the runtmes of a pessmstc approach (wth a checkpont nterval of 2 teratons) to our optmstc approach. The tmes for the pessmstc approach are composed of the average executon tme, the average tme for wrtng checkponts and the executon tme for teratons that need to be repeated. The fgure lsts the runtme of both approaches for executons wth no falures, a sngle falure n teraton 4 and multple falures durng one run n teratons 4 and 6. Fgure 11 shows that our optmstc approach s more than twce as fast n the falure-free case and at the same tme provdes faster recovery than a pessmstc approach n all cases. The robustness n Connected Components s due to the sparse computatonal dependences of the problem. Every mnmum label propagates through ts component of the graph. As socal networks typcally have a short average dstance between vertces, the majorty of the vertces fnd ther mnmum label relatvely early and converge. After falure compensaton n a later teraton, most vertces have a non-faled neghbor that has already found the mnmum label, whch they mmedately receve and converge. In order to evaluate the recovery of bulk teratons, we run PageRank on the Webbase dataset untl convergence. We smulate falures n dfferent teratons of PageRank and measure the number of teratons t takes the optmstc recovery mechansm to converge afterwards. Fgure 12 shows the convergence behavor for the smulated falures n dfferent early teratons. The x axs shows the teraton number. The y axs shows the L1-norm of the dfference of the current estmate x (t) of the PageRank vector n teraton t to the PageRank vector x (t 1) n the prevous teraton t 1 n logarthmc scale. We notce that the optmstc recovery s able to handle falures n early teratons such as the thrd, tenth, or ffteenth teraton extremely well wth the compensaton resultng only n

9 error (log1) falure n 15 falure n 1 falure n 3 no falures teraton runtme (mnutes) optmstc pessmstc none & 15 falng teraton Compute-ItemFeatures Reduce Collect-ItemFeatures Match Collect-UserFeatures Match Compute-UserFeatures Reduce Rent-Features Map ratngs RMSE falure n 5 & 1 falure n 1 falure n 5 no falure teraton Fgure 12: Recovery convergence of PageRank on the Webbase dataset. Fgure 13: Runtme comparson of PageRank on the Webbase dataset. maxmum 3 addtonal teratons (as can be seen n the rght bottom corner of Fgure 12). In addtonal experments, we observe the same behavor for multple falures durng one run n early teratons: falures n teratons 2 and 1 or 5 and 15 also only result n at most 3 addtonal teratons. Next, we smulate a smultaneous falure of fve machnes to nvestgate the effect of drastc falures: such a falure costs no overhead when t happens n teraton 3 and results n only 6 more teratons when t happens n teraton 1. When we smulate falures n later teratons, we note that they cause more overhead: a falure n teraton 25 needs 12 more teratons to reach convergence, whle a falure n teraton 35 trggers 22 addtonal teratons compared to the falure-free case. Ths behavor can be explaned as follows: a compensaton can be thought of as a random jump n the space of possble ntermedate solutons. In later teratons, the algorthm s closer to the fxpont, hence the random jump ncreases the dstance to the fxpont wth a hgher probablty than n early teratons where the algorthm s far from the fxpont. For a falure n teraton 35, executng these addtonal 22 teratons would ncur an overhead of approxmately eght mnutes. Compared to a pessmstc approach wth a checkpont nterval of fve, the overhead s stll less, as the pessmstc approach would have to restart from the checkponted state of teraton 3 and agan wrte a total of 7 checkponts, amountng to a total overhead of more than 12 mnutes. Fgure 13 summarzes our fndngs about PageRank and compares the runtmes of our optmstc scheme to such a pessmstc approach wth a checkpont nterval of fve. The tmes for the pessmstc approach comprse the average executon tme, the average tme for wrtng checkponts and the executon tme for teratons that need to be repeated. We see that the optmstc approach outperforms the pessmstc one by nearly a factor of two n the falure-free case and for all cases, ts overall runtme s shorther n lght of falures. For falures n later teratons, our fndngs suggest that hybrd approaches whch use the optmstc approach for early teratons and swtch to a pessmstc strategy later are needed. The decson when to swtch could be made by observng the slope of the convergence rate. Once t starts flattenng, the algorthm comes closer to the fxpont and t would be benefcal to swtch to a checkpontbased recovery strategy. We plan to nvestgate ths as part of our future work. 6.3 Emprcal Valdaton of Compensablty We dd not provde a formal proof for the compensablty of Alternatng Least Squares for low-rank matrx factorzaton dscussed n Secton 5.3. However, we can emprcally valdate that our optmstc approach s applcable to ths algorthm. We use the Yahoo Songs dataset for ths. To show the compensablty of Alternatng Least Squares, we mplement a popular varant of the algorthm amed at handlng ratngs [34] as data flow program (cf., Fgure 14). ratngs tems Fgure 14: ALS as fxpont algorthm n Stratosphere. Fgure 15: Optmstc recovery of falures n ALS on the Yahoo Songs dataset. Next, we smulate falures whle computng a factorzaton of rank 1. We smulate a sngle falure n teraton 5, a sngle falure n teraton 1 and fnally have a run wth multple falures where both falures happen after another. We measure the tranng error (by the root mean squared error, RMSE, shown n Fgure 15) of the factorzaton over 15 teratons. Our results show that falures result n a short ncrease of the error, yet untl teraton 15, the error agan equals that of the falure-free case. These results ndcate that the ALS algorthm pared wth the provded compensaton strategy descrbed n Secton 5.3 s very robust to all falures, whch only cause a short-lved dstorton n the convergence behavor. We also repeat ths experment for the Movelens 6 dataset, where the results mrror our fndngs. 7. RELATED WORK Executng teratve algorthms effcently n parallel has receved a lot of attenton recently, resultng n graph-based systems [24,25], and n the ntegraton of teratons nto data flow systems [7,14,26, 27,31]. The work proposed n ths paper s applcable to all systems that follow a data flow or a vertex-centrc programmng paradgm. Whle the robust characterstcs of fxpont algorthms have been known for decades, we are not aware of an approach that leverages these characterstcs for the recovery of dstrbuted, data-parallel executon of such algorthms. Rather, most dstrbuted data processng systems [12, 25], dstrbuted storage systems [17], and recently real-tme analytcal systems [33] use pessmstc approaches based on perodc checkpontng and replay to recover lost state. Systems such as Spark [31] offer recovery by recomputng lost parttons based on ther lneage. The complex data dependences of fxpont algorthms however may requre a full recomputaton of the algorthm state. Whle the authors propose to use effcent nmemory checkponts n that case, such an approach stll ncreases the resource usage of the cluster, as several copes of the data to checkpont have to be held n memory. Durng executon such a recovery strategy competes wth the actual algorthm n terms of memory and network bandwdth. Confned Recovery [25] n Pregel lmts recovery to the parttons lost n a falure. The state of lost parttons s recalculated from the logs of outgong messages of all non-faled machnes n case of a falure. Confned recovery s stll a pessmstc approach, whch requres an ncrease n the amount of checkponted data, as all outgong messages of the system have to be logged. Smlar to our compensaton functon, user-defned functons to enable optmstc recovery have been proposed for long-lved trans- 6

10 actons. The ConTract Model s a mechansm for handlng longlved computatons n a database context [3]. Sagas descrbe the concept of breakng long lved-transactons nto a collecton of subtransactons [15]. In such systems, user-defned compensaton actons are trggered n response to volatons of nvarants or falures of nested sub-transatons durng executon. 8. CONCLUSIONS AND OUTLOOK We present a novel optmstc recovery mechansm for dstrbuted teratve data processng usng a general fxpont programmng model. Our approach elmnates the need to checkpont to stable storage. In case of a falure, we leverage a user-defned compensaton functon to algorthmcally brng the ntermedary state of the teratve algorthm back nto a form from whch the algorthm stll converges to the correct soluton. Furthermore, we dscuss how to ntegrate the proposed recovery mechansm nto a data flow system. We model three wde classes of problems (lnk analyss and centralty n networks, path enumeraton n graphs, and low-rank matrx factorzaton) as fxpont algorthms and descrbe generc compensaton functons that n many cases provably converge. Fnally, we present emprcal evdence that shows that our proposed optmstc approach provdes optmal falure-free performance (vrtually zero overhead n absence of falures). At the same tme, t provdes faster recovery n the majorty of cases. For ncrementally teratve algorthms, the recovery overhead of our approach s less than a ffth of the tme t takes a pessmstc approach to only checkpont an early ntermedate state of the computaton. For recovery of early teratons of bulk teratve algorthms, the nduced overhead to the runtme s less than 1% and agan our approach outperforms a pessmstc one n all evaluated cases. In future work, we plan to nvestgate whether the compensaton functon can be automatcally derved from nvarants present n the algorthm defnton. We would also lke to fnd compensaton functons for an even broader class of problems. Our experments suggest that optmstc recovery s less effectve for bulk teratve algorthms when the algorthm s already close to the fxpont. Therefore, we plan to explore the benefts of a hybrd approach that uses optmstc recovery n early stages of the executon and swtches to pessmstc recovery later. Fnally, we would lke to leverage memory-effcent probablstc data structures for storng statstcs to create better compensaton functons. Acknowledgments We thank Jeffrey Ullman, Jörg Flege, Jérôme Kunegs and Max Hemel for frutful dscussons. Ths research s funded by the German Research Foundaton (DFG) under grant FOR 136 Stratosphere and the European Unon (EU) under grant no RO- BUST. We used data provded by Yahoo! Academc Relatons. 9. REFERENCES [1] Apache Graph, [2] Apache Hadoop, [3] D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/PACTs: A programmng model and executon framework for web-scale analytcal processng. In SoCC, pp , 21. [4] D. Bertsekas and J. Tstskls. Parallel and Dstrbuted Computaton. Athena Scentfc, [5] P. Bold and S. Vgna. The Webgraph Framework I: Compresson technques. In WWW, pp , 24. [6] P. Bonacch. Power and Centralty: A famly of measures. Amercan Journal of Socology, pp , [7] V. R. Borkar, M. J. Carey, R. Grover, N. Onose, and R. Vernca. Hyracks: A flexble and extensble foundaton for data-ntensve computng. In ICDE, pp , 211. [8] Y. Bu, V. R. Borkar, M. J. Carey, J. Rosen, N. Polyzots, T. Conde, M. Wemer, and R. Ramakrshnan. Scalng datalog for machne learnng on bg data. CoRR, abs/123.16, 212. [9] M. Burrows. The Chubby lock servce for loosely-coupled dstrbuted systems. In OSDI, pp , 26. [1] M. Cha, H. Haddad, F. Benevenuto, and P. K. Gummad. Measurng user nfluence n Twtter: The mllon follower fallacy. In ICWSM, 21. [11] A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalzaton: Scalable onlne collaboratve flterng. In WWW, pp , 27. [12] J. Dean and S. Ghemawat. MapReduce: Smplfed data processng on large clusters. Commun. ACM, 51:17 113, 28. [13] E. N. Elnozahy, L. Alvs, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols n message-passng systems. ACM Comput. Surv., 34(3):375 48, 22. [14] S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl. Spnnng fast teratve data flows. PVLDB, 5(11): , 212. [15] H. Garca-Molna and K. Salem. Sagas. In SIGMOD, pp , [16] R. Gemulla, E. Njkamp, P. J. Haas, and Y. Ssmans. Large-scale matrx factorzaton wth dstrbuted stochastc gradent descent. In KDD, pp , 211. [17] S. Ghemawat, H. Goboff, and S.-T. Leung. The Google fle system. In SOSP, pp , 23. [18] G. H. Golub and C. F. Van Loan. Matrx Computatons. The Johns Hopkns Unversty Press, [19] M. Gondran and M. Mnoux. Graphs, Dods and Semrngs - New Models and Algorthms. Sprnger, 28. [2] U. Kang, S. Papadmtrou, J. Sun, and H. Tong. Centraltes n large networks: Algorthms and observatons. In SDM, pp , 211. [21] U. Kang, C. E. Tsourakaks, and C. Faloutsos. Pegasus: A peta-scale graph mnng system. In ICDM, pp , 29. [22] L. Katz. A new status ndex derved from socometrc analyss. Psychometrka, 18:39 43, [23] Y. Koren, R. M. Bell, and C. Volnsky. Matrx factorzaton technques for recommender systems. IEEE Computer, 29. [24] Y. Low, J. Gonzalez, A. Kyrola, D. Bckson, C. Guestrn, and J. M. Hellersten. Dstrbuted GraphLab: A framework for machne learnng n the cloud. PVLDB, 5(8): , 212. [25] G. Malewcz, M. H. Austern, A. J. C. Bk, J. C. Dehnert, I. Horn, N. Leser, and G. Czajkowsk. Pregel: A system for large-scale graph processng. In SIGMOD pp , 21. [26] F. McSherry, D. G. Murray, R. Isaacs, and M. Isard. Dfferental dataflow. In CIDR, 213. [27] S. R. Mhaylov, Z. G. Ives, and S. Guha. Rex: Recursve, delta-based data-centrc computaton. PVLDB, 5(11): , 212. [28] M. E. J. Newman. Fndng communty structure n networks usng the egenvectors of matrces. Phys. Rev. E, 74:3614, 26. [29] L. Page, S. Brn, R. Motwan, and T. Wnograd. The PageRank Ctaton Rankng: Brngng Order to the Web. Stanford InfoLab, [3] A. Reuter and F. Schwenkres. ConTracts - A low-level mechansm for buldng general-purpose workflow management-systems. IEEE Data Eng. Bull., 18:4 1, [31] M. Zahara, M. Chowdhury, T. Das, D. Ankur, M. McCauley, M. Frankln, S. Shenker and I. Stoca. Reslent dstrbuted datasets: A fault-tolerant abstracton for n-memory cluster computng. NDSI, pp [32] L. G. Valant. A brdgng model for parallel computaton. Commun. ACM, 33(8):13 111, 199. [33] M. Zahara, T. Das, H. L, S. Shenker, and I. Stoca. Dscretzed streams: An effcent and fault-tolerant model for stream processng on large clusters. In HotCLoud, 212. [34] Y. Zhou, D. M. Wlknson, R. Schreber, and R. Pan. Large-scale parallel collaboratve flterng for the netflx prze. In AAIM, pp , 28.