Making State Explicit for Imperative Big Data Processing

Makig State Explicit for Imperative Big Data Processig Raul Castro Feradez, Imperial College Lodo; Matteo Migliavacca, Uiversity of Ket; Evagelia Kalyviaaki, City Uiversity Lodo; Peter Pietzuch, Imperial College Lodo https://www.useix.org/coferece/atc4/techical-sessios/presetatio/castro-feradez This paper is icluded i the Proceedigs of USENIX ATC 4: 24 USENIX Aual Techical Coferece. Jue 9 2, 24 Philadelphia, PA 978--9397--2 Ope access to the Proceedigs of USENIX ATC 4: 24 USENIX Aual Techical Coferece is sposored by USENIX.

Makig State Explicit for Imperative Big Data Processig Raul Castro Feradez, Matteo Migliavacca, Evagelia Kalyviaaki, Peter Pietzuch Imperial College Lodo, Uiversity of Ket, City Uiversity Lodo Abstract Data scietists ofte implemet machie learig algorithms i imperative laguages such as Java, Matlab ad R. Yet such implemetatios fail to achieve the performace ad scalability of specialised data-parallel processig frameworks. Our goal is to execute imperative Java programs i a data-parallel fashio with high throughput ad low latecy. This raises two challeges: how to support the arbitrary mutable state of Java programs without compromisig scalability, ad how to recover that state after failure with low overhead. Our idea is to ifer the dataflow ad the types of state accesses from a Java program ad use this iformatio to geerate a stateful dataflow graph (SDG). By explicitly separatig data from mutable state, SDGs have specific features to eable this traslatio: to esure scalability, distributed state ca be partitioed across odes if computatio ca occur etirely i parallel; if this is ot possible, partial state gives odes local istaces for idepedet computatio, which are recociled accordig to applicatio sematics. For fault tolerace, large imemory state is checkpoited asychroously without global coordiatio. We show that the performace of SDGs for several imperative olie applicatios matches that of existig data-parallel processig frameworks. Itroductio Data scietists wat to use ever more sophisticated implemetatios of machie learig algorithms, such as collaborative filterig [32], k-meas clusterig ad logistic regressio [2], ad execute them over large datasets while providig fresh, low latecy results. With the domiace of imperative programmig, such algorithms are ofte implemeted i laguages such as Java, Matlab or R. Such implemetatios though make it challegig to achieve high performace. O the other had, data-parallel processig frameworks, such as MapReduce [8], Spark [38] ad Naiad [26], ca scale computatio to a large umber of odes. Such frameworks, however, require developers to adopt particular fuctioal [37], declarative [3] or dataflow [5] programmig models. While early frameworks such as MapReduce [8] followed a restricted fuctioal model, resultig i wide-spread adoptio, recet more expressive frameworks such as Spark [38] ad Naiad [26] require developers to lear more complex programmig models, e.g. based o a richer set of higherorder fuctios. Our goal is therefore to traslate imperative Java implemetatios of machie learig algorithms to a represetatio that ca be executed i a data-parallel fashio. The executio should scale to a large umber of odes, achievig high throughput ad low processig latecy. This is challegig because Java programs support arbitrary mutable state. For example, a implemetatio of collaborative filterig [32] uses a mutable matrix to represet a model that is refied iteratively: as ew data arrives, the matrix is updated at a fie graularity ad accessed to provide up-to-date predictios. Havig stateful computatio raises two issues: first, the state may grow large, e.g. o the order of hudreds of GBs for a collaborative filterig model with tes of thousads of users. Therefore the state ad its associated computatio must be distributed across odes; secod, large state must be restored efficietly after ode failure. The failure recovery mechaism should have a low impact o performace. Curret data-parallel frameworks do ot hadle large state effectively. I stateless frameworks [8, 37, 38], computatio is defied through side-effect-free fuctioal tasks. Ay modificatio to state, such as updatig a sigle elemet i a matrix, must be implemeted as the creatio of ew immutable data, which is iefficiet. While recet frameworks [26, ] have recogised the eed for per-task mutable state, they lack abstractios for distributed state ad exhibit high overhead uder faulttolerat operatio with large state (see 6.). Imperative programmig model. We describe how, USENIX Associatio 24 USENIX Aual Techical Coferece 49

with the help of a few aotatios by developers, Java programs ca be executed automatically i a distributed data-parallel fashio. Our idea is to ifer the dataflow ad the types of state accesses from a Java program ad use this iformatio to traslate the program to a executable distributed dataflow represetatio. Usig program aalysis, our approach extracts the processig tasks ad state fields from the program ad ifers the variablelevel dataflow. Stateful dataflow graphs. This traslatio relies o the features of a ew fault-tolerat data-parallel processig model called stateful dataflow graphs (SDGs). A SDG explicitly distiguishes betwee data ad state: it is a cyclic graph of pipelied data-parallel tasks, which execute o differet odes ad access local i-memory state. SDGs iclude abstractios for maitaiig large state efficietly i a distributed fashio: if tasks ca process state etirely i parallel, the state is partitioed across odes; if this is ot possible, tasks are give local istaces of partial state for idepedet computatio. Computatio ca iclude sychroisatio poits to access all partial state istaces, ad istaces ca be recociled accordig to applicatio sematics. Data flows betwee tasks i a SDG, ad cycles specify iterative computatio. All tasks are pipelied this leads to low latecy, less itermediate data durig failure recovery ad simplified schedulig by ot havig to compute data depedecies. Tasks are replicated at rutime to overcome processig bottleecks ad stragglers. Failure recovery. Whe recoverig from failures, odes must restore potetially gigabytes of i-memory state. We describe a asychroous checkpoitig mechaism with log-based recovery that uses data structures for dirty state to miimise the iterruptio to tasks while takig local checkpoits. Checkpoits are persisted to multiple disks i parallel, from which they ca be restored to multiple odes, thus reducig recovery time. With a prototype system of SDGs, we execute Java implemetatios of collaborative filterig, logistic regressio ad a key/value store o a private cluster ad Amazo EC2. We show that SDGs execute with high throughput (comparable to batch processig systems) ad low latecy (comparable to streamig systems). Eve with large state, their failure recovery mechaism has a low performace impact, recoverig i secods. The paper cotributios ad its structure are as follows: based o a sample Java program ( 2.) ad the features of existig dataflow models ( 2.2), we motivate the eed for stateful dataflow graphs ad describe their properties ( 3); 4 explais the traslatio from Java to SDGs; 5 describes failure recovery; ad 6 presets evaluatio results, followed by related work ( 7). Algorithm : Olie collaborative filterig @Partitioed Matrix useritem = ew Matrix(); 2 @Partial Matrix coocc = ew Matrix(); 3 4 void addratig(it user, it item, it ratig) { 5 useritem.setelemet(user, item, ratig); 6 Vector userrow = useritem.getrow(user); 7 for (it i = ; i < userrow.size(); i++) 8 if (userrow.get(i) > ) { 9 it cout = coocc.getelemet(item, i); coocc.setelemet(item, i, cout + ); coocc.setelemet(i, item, cout + ); 2 } 3 } 4 Vector getrec(it user) { 5 Vector userrow = useritem.getrow(user); 6 @Partial Vector userrec = @Global coocc.multiply( userrow); 7 Vector rec = merge(@global userrec); 8 retur rec; 9 } 2 Vector merge(@collectio Vector[] alluserrec) { 2 Vector rec = ew Vector(allUserRec[].size()); 22 for (Vector cur : alluserrec) 23 for (it i = ; i < alluserrec.legth; i++) 24 rec.set(i, cur.get(i) + rec.get(i)); 25 retur rec; 26 } 2 State i Data-Parallel Processig We describe a imperative implemetatio of a machie learig algorithm ad ivestigate how it ca execute i a data-parallel fashio o a set of odes, payig attetio to its use of mutable state ( 2.). Based o this aalysis, we discuss the features of existig data-parallel processig models for supportig such a executio ( 2.2). 2. Applicatio example Alg. shows a Java implemetatio of a olie machie learig algorithm, collaborative filterig (CF) [32]. It outputs up-to-date recommedatios of items to users (fuctio getrec) based o previous item ratigs (fuctio addratig). The algorithm maitais state i two data structures: the matrix useritem stores the ratigs of items made by users (lie ); the co-occurrece matrix coocc records correlatios betwee items that were rated together by multiple users (lie 2). For may users ad items, useritem ad coocc become large ad must be distributed: useritem ca be partitioed across odes based o the user idetifier as a access key; sice the access to coocc is radom, it caot be partitioed but oly replicated o multiple odes i order to parallelise updates. I this case, results from a sigle istace of coocc are partial, ad must be merged with other partial results to obtai a complete result, as described below. The fuctio addratig first adds a ew ratig to useritem (lie 5). It the icremetally updates coocc by icreasig the co-occurrece couts for the ewly-rated The aotatios (startig with @ ) will be explaied i 4 ad should be igored for ow. 2 5 24 USENIX Aual Techical Coferece USENIX Associatio

State hadlig Dataflow Computatioal Programmig Systems model model Represe- Large Fie-graied Low Itertatio state size updates Executio latecy atio Failure recovery MapReduce [8] map/reduce as data /a scheduled recompute DryadLINQ [37] fuctioal as data /a scheduled recompute Stateless dataflow Spark [38] fuctioal as data /a hybrid recompute CIEL [25] imperative as data /a scheduled recompute HaLoop [5] map/reduce cache scheduled recompute Icremetal dataflow Icoop [4] map/reduce cache scheduled recompute Nectar [] fuctioal cache scheduled recompute CBP [9] dataflow loopback scheduled recompute Comet [2] fuctioal as data /a scheduled recompute Batched dataflow D-Streams [39] fuctioal as data /a hybrid recompute Naiad [26] dataflow explicit hybrid syc. global checkpoits Cotiuous dataflow Storm, S4 dataflow as data /a pipelied recompute SEEP [] dataflow explicit pipelied syc. local checkpoits Parallel i-memory Piccolo [3] imperative explicit /a asyc. global checkpoits Stateful dataflow SDG imperative explicit pipelied asyc. local checkpoits item ad existig items with o-zero ratigs (lie 7 2). This requires useritem ad coocc to be mutable, with efficiet fie-graied access. Sice useritem is partitioed based o the key user, ad coocc is replicated, addratig oly accesses a sigle istace of each. The fuctio getrec takes the ratig vector of a user, userrow (lie 5), ad multiplies it by the cooccurrece matrix to obtai a recommedatio vector userrec (lie 6). Sice coocc is replicated, this must be performed o all istaces of coocc, leadig to multiple partial recommedatio vectors. These partial vectors must be merged to obtai the fial recommedatio vector rec for the user (lie 7). The fuctio merge simply computes the sum of all partial recommedatio vectors (lies 2 24). Note that addratig ad getrec have differet performace goals whe hadlig state: addratig must achieve high throughput whe updatig coocc with ew ratigs; getrec must serve requests with low latecy, e.g. whe recommedatios are icluded i dyamically geerated web pages. 2.2 Desig space The above example highlights a umber of required features of a dataflow model to eable the traslatio of imperative olie machie learig algorithms to executable dataflows: (i) the model should support large state sizes (o the order of GBs), which should be represeted explicitly ad hadled with acceptable performace; i particular, (ii) the state should permit efficiet fie-graied updates. I additio, due to the eed for upto-date results, (iii) the model should process data with low latecy, idepedetly of the amout of iput data; (iv) algorithms such as logistic regressio ad k-meas clusterig also require iteratio; ad (v) eve with large state, the model should support fast failure recovery. I Table, we classify existig data-parallel processig models accordig to the above features. Table : Desig space of data-parallel processig frameworks State hadlig. Stateless dataflows, first made popular by MapReduce [8], defie a fuctioal dataflow graph i which vertices are stateless data-parallel tasks. They do ot distiguish betwee state ad data: e.g. i a wordcout job i MapReduce, the partial word couts, which are the state, are output by map tasks as part of the dataflow [8]. Dataflows i Spark, represeted as RDDs, are immutable, which simplifies failure recovery but requires a ew RDD for each state update [38]. This is iefficiet for olie algorithms such as CF i which oly part of a matrix is updated each time. Stateless models also caot treat data differetly from state. They caot use custom idex data structures for state access, or cache oly state i memory: e.g. Shark [36] eeds explicit hits which dataflows to cache. Icremetal dataflow avoids reruig etire jobs after updates to the iput data. Such models are fudametally stateful because they maitai results from earlier computatio. Icoop [4] ad Nectar [] treat state as a cache of past results. Sice they caot ifer which data will be reused, they cache all. CBP trasforms batch jobs automatically for icremetal computatio [9]. Our goals are complemetary: SDGs do ot ifer icremetal computatio but support stateful computatio efficietly, which ca realise icremetal algorithms. Existig models that represet state explicitly, such as SEEP [] ad Naiad [26], permit tasks to have access to i-memory data structures but face challeges related to state sizes: they assume that state is small compared to the data. Whe large state requires distributed processig through partitioig or replicatio, they do ot provide abstractios to support this. I cotrast, Piccolo [3] supports scalable distributed state with a key/value abstractio. However, it does ot offer a dataflow model, which meas that it caot execute a iferred dataflow from a Java program but requires computatio to be specified as multiple kerels. Latecy ad iteratio. Tasks i a dataflow graph ca 3 USENIX Associatio 24 USENIX Aual Techical Coferece 5

be scheduled for executio or materialised i a pipelie, each with differet performace implicatios. Some frameworks follow a hybrid approach i which tasks o the same ode are pipelied but ot betwee odes. Sice tasks i stateless dataflows are scheduled to process coarse-graied batches of data, such systems ca exploit the full parallelism of a cluster but they caot achieve low processig latecy. For lower latecy, batched dataflows divide data ito small batches for processig ad use efficiet, yet complex, task schedulers to resolve data depedecies. They have a fudametal trade-off betwee the lower latecy of smaller batches ad the higher throughput of larger oes typically they burde developers with makig this trade-off [39]. Cotiuous dataflow adopts a streamig model with a pipelie of tasks. It does ot materialise itermediate data betwee odes ad thus has lower latecy without a schedulig overhead: as we show i 6, batched dataflows caot achieve the same low latecies. Due to our focus o olie processig with low latecy, SDGs are fully pipelied (see 3.). To improve the performace of iterative computatio i dataflows, early frameworks such as HaLoop [5] cache the results of oe iteratio as iput to the ext. Recet frameworks [5, 38, 25, 9] geeralise this cocept by permittig iteratio over arbitrary parts of the dataflow graph, executig tasks repeatedly as part of loops. Similarly SDGs support iteratio explicitly by permittig cycles i the dataflow graph. Failure recovery. To recover from failure, frameworks either recompute state based o previous data or checkpoit state to restore it. For recomputatio, Spark represets dataflows as RDDs [38], which ca be recomputed determiistically based o their lieage. Cotiuous dataflow frameworks use techiques such as upstream backup [4] to reprocess buffered data after failure. Without checkpoitig, recomputatio ca lead to log recovery times. Checkpoitig periodically saves state to disk or the memory of other odes. With large state, this becomes resource-itesive. SEEP recovers state from memory, thus doublig the memory requiremet of a cluster []. A challege is how to take cosistet checkpoits while processig data. Sychroous global checkpoitig stops processig o all odes to obtai cosistet sapshots, thus reducig performace. For example, Naiad s stop-the-world approach exhibits low throughput with large state sizes [26]. Asychroous global checkpoitig, as used by Piccolo [3], permits odes to take cosistet checkpoits at differet times. Both techiques iclude all global state i a checkpoit ad thus require all odes to restore state after failure. Istead, SDGs use a asychroous checkpoitig mechaism with log-based recovery. As described i 5, 2 ew updateuseritem ratig Task Elemet (TE) rec request getuservec user Item State Elemet (SE) dataflow updatecoocc getrecvec coocc 3 merge rec result Figure : Stateful dataflow graph for CF algorithm it does ot require global coordiatio betwee odes durig recovery, ad it uses dirty state to miimise the disruptio to processig durig local checkpoitig. 3 Stateful Dataflow Graphs The goal of stateful dataflow graphs (SDGs) is to make it easy to traslate imperative programs with mutable state to a dataflow represetatio that performs parallel, iterative computatio with low latecy. Next we describe their model ( 3.), how they support distributed state ( 3.2) ad how they are executed ( 3.3). 3. Model We explai the mai features of SDGs usig the CF algorithm from 2. as a example. As show i Fig., a SDG has two types of vertices: task elemets, t T, trasform iput to output dataflows; ad state elemets, s S, represet the state i the SDG. Access edges, a =(t,s) A, coect task elemets to the state elemets that they read or update. To facilitate the allocatio of task ad state elemets to odes, each task elemet ca oly access a sigle state elemet, i.e. A is a partial fuctio: (t i,s j ) A,(t i,s k ) A s j = s k. Dataflows are edges betwee task elemets, d =(t i,t j ) D, ad cotai data items. Task elemets (TEs) are ot scheduled for executio but the etire SDG is materialised, i.e. each TE is assiged to oe or more physical odes. Sice TEs are pipelied, it is uecessary to geerate the complete output dataflow of a TE before it is processed by the ext TE. Data items are therefore processed with low latecy, eve across a sequece of TEs, without schedulig overhead, ad fewer data items are hadled durig failure recovery (see 5). The SDG i Fig. has five TEs assiged to three odes: the updateuseritem, updatecoocc TEs realise the addratig fuctio from Alg. ; ad the getuservec, getrecvec ad merge TEs implemet the getrec fuctio. We explai the traslatio process i 4.2. State elemets (SEs) ecapsulate the state of the computatio. They are implemeted usig efficiet data structures, such as hash tables or idexed sparse matrices. I the ext sectio, we describe the abstractios for distributed SEs, which spa multiple odes. Fig. shows the two SEs of the CF algorithm: the useritem ad the coocc matrices. The access edges spec- 4 52 24 USENIX Aual Techical Coferece USENIX Associatio

ify that useritem is updated by the updateuseritem TE ad read by the getuservec TE; coocc is updated by updatecoocc ad read by getrecvec. Parallelism. For data-parallel processig, a TE t i ca be istatiated multiple times to hadle parts of a dataflow, resultig i multiple TE istaces, ˆt i, j : j i. As we explai i 3.3, the umber of istaces i for each TE is chose at rutime ad adjusted based o workload demads ad the occurrece of stragglers. A appropriate dispatchig strategy seds items i dataflows to TE istaces: items ca be (i) partitioed usig hash- or rage-partitioig o a key; or (ii) dispatched to a arbitrary istace, e.g. i a roud-robi fashio for load-balacig. Iteratio. I iterative algorithms, SEs are accessed multiple times by TEs. There are two cases to be distiguished: (i) if the repeated access is from a sigle TE, the iteratio is etirely local ad ca be supported efficietly by a sigle ode; ad (ii) if the iteratio ivolves multiple pipelied TEs, a cycle i the dataflow of the SDG ca propagate updates betwee TEs. With cycles i the dataflow, SDGs do ot provide coordiatio durig iteratio by default. This is sufficiet for may iterative machie learig ad data miig algorithms because they ca coverge from differet itermediate states [3], eve without explicit coordiatio. A strog cosistecy model for SDGs could be realised with per-loop timestamps, as used by Naiad [26]. 3.2 Distributed state The SDG model provides abstractios for distributed state. A SE s i may be distributed across odes, leadig to multiple SE istaces ŝ i, j, because (i) it is too large to fit ito the memory of a sigle ode; or (ii) it is accessed by a TE that has multiple istaces to process the dataflow i parallel. This requires also multiple SE istaces so that the TE istaces access state locally. Fig. illustrates these two cases: (i) the useritem SE may grow larger tha the mai memory of a sigle ode; ad (ii) the data-parallel executio of the CPU-itesive updatecoocc TE leads to multiple istaces, each requirig local access to the coocc SE. A SE ca be distributed i differet ways, which are depicted i Fig. 2: a partitioed SE splits its iteral data structure ito disjoit partitios; if this is ot possible, a partial SE duplicates its data structure, creatig multiple copies that are updated idepedetly. As we describe i 4, developers selected the required type of distributed state usig source-level aotatios accordig to the sematics of their algorithm. Partitioed state. For algorithms for which state ca be partitioed, SEs ca be split ad SE istaces placed o separate odes (see Fig. 2b). Access to the SE istaces occurs i parallel. state merge (a) SE (b) Partitioed SE (c) Partial SE Figure 2: Types of distributed state i SDGs Developers ca use predefied data structures for SEs (e.g. Vector, HashMap, Matrix ad DeseMatrix) or defie their ow by implemetig dyamic partitioig ad dirty state support (see 5). Differet data structures support differet partitioig strategies: e.g. a map ca be hash- or rage-partitioed; a matrix ca be partitioed by row or colum. To obtai a uique partitioig, TEs caot access partitioed SEs usig coflictig strategies, such as accessig a matrix by row ad by colum. I additio, the dataflow partitioig strategy must be compatible with the data access patter by the TEs, as specified i the program (see 4.2). For example, multiple TE istaces with a access edge to a partitioed SE must use the same partitioig key o the dataflow so that they access SE istaces locally: i the CF algorithm, the useritem SE ad the ew ratig ad rec request dataflows must all be partitioed by row, i.e. the users for which ratigs are maitaied. Partial state. I some cases, the data structure of a SE caot be partitioed because the access patter of TEs is arbitrary. For example, i the CF algorithm, the coocc matrix has a access patter, i which the updatecoocc TE may update ay row or colum. I this case, a SE is distributed by creatig multiple partial SE istaces, each cotaiig the whole data structure (see Fig. 2c). Partial SE istaces ca be updated idepedetly by differet TE istaces. Whe a TE accesses a partial SE, there are two possible types of accesses based o the sematics of the algorithm: a TE istace may access (i) the local SE istace o the same ode; or (ii) the global state by accessig all of the partial SE istaces, which itroduces a sychroisatio poit. As we describe i 4.2, the type of access to partial SEs is determied by aotatios. Whe accessig all partial SE istaces, it is possible to execute computatio that merges their values, thus recocilig the differeces betwee them. This is doe by a merge TE that computes a sigle global value from partial SE istaces. Merge computatio is applicatio-specific ad must be defied by the developer. I the CF algorithm, the merge fuctio takes all partial userrec vectors ad computes a sigle recommedatio vector. 3.3 Executio To execute a SDG, the rutime system allocates TE ad SE istaces to odes, creatig istaces o-demad. Allocatio to odes. Sice we wat to avoid remote state access, the geeral strategy is to colocate TEs ad 5 USENIX Associatio 24 USENIX Aual Techical Coferece 53

SEs that are coected by access edges o the same ode. The rutime system uses four steps for mappig TEs ad SEs to odes: if there is a cycle i the SDG, all SEs accessed i the cycle are colocated if possible to reduce commuicatio i iterative algorithms (step ); the remaiig SEs are allocated o separate odes to icrease available memory (step 2); TEs are colocated with the SEs that they access (step 3); ad fially, ay uallocated TEs are assiged to separate odes (step 4). Fig. illustrates the above steps for allocatig the SDG to odes to 3 : sice there are o cycles (step ), the useritem SE is assiged to ode, ad the coocc SE is assiged to 2 (step 2); the updateuseritem ad getuservec TEs are assiged to, ad the updatecoocc ad getrecvec TEs are assiged to 2 (step 3); fially, the merge TE is allocated to a ew ode 3 (step 4). Rutime parallelism ad stragglers. Processig bottleecks i the deployed SDG, e.g. caused by the computatioal cost of TEs, caot be predicted statically, ad TEs istaces may become stragglers [4]. Previous work [26] tries to reduce stragglers proactively for low latecy processig, which is hard due to the may o-determiistic causes of stragglers. Istead, similar to speculative executio i MapReduce [4], SDGs adopt a reactive approach. Usig a dyamic dataflow graph approach [], the rutime system chages the umber of TE istaces i respose to stragglers. Each TE is moitored to determie if it costitutes a processig bottleeck that limits throughput. If so, a ew TE istace is created, which may result i ew partitioed or partial SE istaces. 3.4 Discussio With a explicit represetatio of state, a sigle SDG ca express multiple workflows over that state. I the case of the CF algorithm from Alg., the SDG processes ew ratigs by updatig the SEs for the user/item ad cooccurrece matrices, ad also serves recommedatio requests usig the same SEs with low latecy. Without SDGs, these two workflows would require separate offlie ad olie systems [23, 32]: a batch processig framework would icorporate ew ratigs periodically, ad olie recommedatio requests would be served by a dedicated system from memory. Sice it is iefficiet to reru the batch job after each ew ratig, the recommedatios would be computed o stale data. A drawback of the materialised represetatio of SDGs is the start-up cost. For short jobs, the deploymet cost may domiate the ruig time. Our prototype implemetatio deploys a SDG with 5 TE ad SE istaces o 5 odes withi 7 s, ad we assume that jobs are sufficietly log-ruig to amortise this delay. 4 Programmig SDGs We describe how to traslate stateful Java programs statically to SDGs for parallel executio. We do ot attempt to be completely trasparet for developers or to address the geeral problem of automatic code parallelisatio. Istead, we exploit data ad pipelie parallelism by relyig o source code aotatios. We require developers to provide a sigle Java class with aotatios that idicate how state is distributed ad accessed. 4. Aotatios Whe defiig a field i a Java class, a developer ca idicate if its cotet ca be partitioed or is partial by aotatig the field declaratio with @Partitioed or @Partial, respectively. @Partitioed. This aotatio specifies that a field ca be split ito disjoit partitios (see 3.2). A referece to a @Partitioed field always refers to a sigle partitio. This requires that access to the field uses a access key to ifer the partitio. I the CF algorithm i Alg., rows of the useritem matrix are updated with iformatio about a sigle user oly, ad thus useritem ca be declared as a partitioed field. @Partial. Fields are aotated with @Partial if distributed istaces of the field should be accessed idepedetly (see 3.2). Partial fields eable developers to defie distributed state whe it caot be partitioed. I CF, matrix coocc is aotated with @Partial, which meas that multiple istaces of the matrix may be created, ad each of them is updated idepedetly for users i a partitio (lies ). @Global. By default, a referece to a @Partial field refers to oly oe of its istaces. While most of the time, computatio should apply to oe istace to make idepedet progress, it may also be ecessary to support operatios o all istaces. A field referece aotated with @Global forces a Java expressio to apply to all istaces, deotig global access to a partial field, which itroduces a sychroisatio barrier i the SDG (see 4.2). Java expressios derivig from @Global access become logically multi-valued because they iclude results from all istaces of a partial field. As a result, ay local variable that is assiged the result of a global field access becomes partial ad must be aotated as such. I CF, the access to the coocc field carries the @Global aotatio to compute all partial recommedatios: each istace of coocc is multiplied with the user ratig vector userrow, ad the results are stored i the partial local variable userrec (lie 6). @Collectio. Global access to a partial field applies to all istaces, but it hides the idividual istaces from the developer. At some poit i the program, however, it may be ecessary to recocile all istaces. The 6 54 24 USENIX Aual Techical Coferece USENIX Associatio

SEs aalysis Iput program code geeratio Itermediate represetatio Bytecode TE code assembly 6 cotrol flow partitioig SE extractio 2 TE extractio 4 state accesses SE access traslatio 7 SEs state accesses partitioig disptachig sematic SE access extractio 3 Live variable aalysis 5 live variables TE ivocatio 8 TEs bytecode Figure 3: Traslatio of a aotated Java program to a SDG @Collectio aotatio therefore exposes all istaces of a partial field or variable as a Java array after @Global access. This eables the program to iterate over all values ad, for example, merge them ito a sigle value. I CF, the partial recommedatios are combied by accessig them usig the @Global aotatio ad the ivokig the merge method (lie 7). The parameter of merge is aotated with @Collectio, which specifies that the method ca access all istaces of the partial userrec variable to compute the fial recommedatio result. Limitatios. Java programs eed to obey certai restrictios to be traslated to SDGs due to their dataflow ature ad fault tolerace properties: Explicit state classes. All state i the program must be implemeted usig the set of SE classes (see 3.2). This gives the rutime system the ability to partitio objects of these classes ito multiple istaces (for partitioed state) or distribute them (for partial state), ad recover them after failure (see 5). Locatio idepedece. Each object accessed i the program must support trasparet serialisatio/deserialisatio: as SDGs are distributed, objects are propagated betwee odes. The program also caot make assumptios about its executio eviromet, e.g. by relyig o local etwork sockets or files. Side-effect-free parallelism. To support the parallel evaluatio of multi-valued expressios uder @Global state access, such expressios must ot affect siglevalued expressios. For example, the statemet, @Global coocc.multiply(userrow), i lie 6 i Alg. caot update userrow, which is sigle-valued. Determiistic executio. The program must be determiistic, i.e. it should ot deped o system time or radom iput. This eables the rutime system to re-execute computatio whe recoverig after failure (see 5). 4.2 Traslatig programs to SDGs Aotated Java programs are traslated to SDGs by the java2sdg tool. Fig. 3 shows the steps performed by java2sdg: it first statically aalyses the Java class to idetify SEs, TEs ad their access edges (steps 5); it the trasforms the Java bytecode of the class to geerate TE code, ready for deploymet (steps 6 8).. SE geeratio. The class is compiled to Jimple code, a typed itermediate represetatio for static aalysis used by the Soot framework [33] (step ). The Jimple code is aalysed to idetify SEs with partitioed or partial fields ad partial local variables (step 2). Based o the aotatios i the code, access to SEs is classified as local, partitioed or global (step 3). 2. TE ad dataflow geeratio. Next TEs are created so that each TE oly accesses a sigle SE, i.e. a ew TE is created from a block of code whe access to a differet SE or a differet istace of the curret SE is detected (step 4). The dispatchig sematics of the dataflows betwee created TEs (i.e. partitioed, all-tooe, oe-to-all or oe-to-ay) is chose based o the type of state access. More specifically, a ew TE is created:. for each etry poit of the class; 2. whe a TE uses partitioed access to a ew SE (or to a previously-accessed SE with a ew access key). The access key is extracted usig reachig expressio aalysis, ad the dataflow edge betwee the two TEs is aotated with the access key; 3. whe a TE uses global access to a ew partial SE. I this case, the dataflow edge betwee the two TEs is aotated with oe-to-all dispatchig sematics; 4. whe a TE uses local access to a ew partial SE, the dataflow edge is aotated with oe-to-ay dispatchig sematics. I case of local (or partitioed) access after global access, all TE istaces must be sychroised usig a distributed barrier before cotrol is trasferred to the ew TE, ad the dataflow edge has all-to-oe dispatchig sematics; ad 5. for @Collectio expressios. A sychroisatio barrier collects values from multiple TEs istaces, ad its dataflow edge has all-to-oe sematics. After geeratig the TEs, java2sdg idetifies the variables that must propagate across TEs boudaries (step 5). For each dataflow, live variable aalysis idetifies the set of variables that are associated with that dataflow edge. 3. Bytecode geeratio. Next java2sdg sythesises the bytecode for each TE that will be executed by the rutime system. It compiles the code assiged with each TE i step 4 to bytecode ad ijects it ito a TE template (step 6) usig Javassist. State accesses to fields ad partial variables are traslated to ivocatios of the rutime system, which maages the SE istaces (step 7). Fially data dispatchig across TEs is added (step 8): java2sdg ijects code, (i) at the exit poit of TEs, to serialise live variables ad sed them to the correct successor TE istace; ad, (ii) at the etry poit of a TE, to add barriers for all-to-oe dispatchig ad to gather partial results for merge TEs. 5 Failure Recovery To recover from failures, it is ecessary to replace failed odes ad re-istatiate their TEs ad SEs. TEs are state- 7 USENIX Associatio 24 USENIX Aual Techical Coferece 55

less ad thus are restored trivially, but the state of SEs must be recovered. We face the challege of desigig a recovery mechaism that: (i) ca scale to save ad recover the state of a large umber of odes with low overhead, eve with frequet failures; (ii) has low impact o the processig latecy; ad (iii) achieves fast recovery time whe recoverig large SEs. We achieve these goals with a mechaism that (a) combies local checkpoits with message replay, thus avoidig both global checkpoit coordiatio ad global rollbacks; (b) divides state of SEs ito cosistet state, which is checkpoited, ad dirty state, which permits cotiued processig while checkpoitig; ad (c) partitios checkpoits ad saves them to multiple odes, which eables parallel recovery. Approach. Our failure recovery mechaism combies local checkpoitig ad message loggig ad is ispired by failure recovery i distributed stream processig systems [4]. Nodes periodically take checkpoits of their local SEs ad output commuicatio buffers. Dataflows iclude icreasig TE-geerated scalar timestamps, ad a vector timestamp of the last data item from each iput dataflow that modified the SEs is icluded i the checkpoit. Oce the checkpoit is saved to stable storage, upstream odes ca trim their output buffers of data items that are older tha all dowstream checkpoits. After failure, a ode recovers its SEs from the last checkpoit, replays its output buffers ad reprocesses data items received from the upstream output buffers. Dowstream odes detect duplicate data items based o the timestamps ad discard them. This approach allows odes to recover SEs locally beyod the last checkpoit, without requirig odes to coordiate global rollback, ad it avoids the output commit problem. State checkpoitig. We use a asychroous parallel checkpoitig mechaism that miimises the processig iterruptio whe checkpoitig large SEs with GBs of memory. The idea is to record updates i a separate data structure, while takig a checkpoit. For each type of data structure held by a SE, there must be a implemetatio that supports the separatio of dirty state ad its subsequet cosolidatio. Checkpoitig of a ode works as follows: () to iitiate a checkpoit, each SE is flagged as dirty ad the output buffers are added to the checkpoit; (2) updates from TEs to a SE are ow hadled usig a dirty state data structure: e.g. updates to keys i a dictioary are writte to the dirty state, ad reads are first served by the dirty state ad, oly o a miss, by the dictioary; (3) asychroously to the processig, the ow cosistet state is added to the checkpoit; (4) the checkpoit is backed up to multiple odes (see below); ad (5) the SE is locked ad its state is cosolidated with the dirty state. State backup ad restore. To be memory-efficiet, B serialisatio threads chuks remote storage B2 Backup to m odes B3 2 m R R2...... Restore to odes R3 Figure 4: Parallel, m-to- state backup ad restore checkpoits must be stored o disk. We overcome the problem of low I/O performace by splittig checkpoits across m odes. To reduce recovery time, a failed SE istace ca be restored to ew partitioed SE istaces i parallel. This m-to- patter prevets a sigle ode from becomig a disk, etwork or processig bottleeck. Fig. 4 shows the distributed protocol for backig up checkpoits. I step B, checkpoit chuks, e.g. obtaied by hash-partitioig checkpoit data, are created, ad a thread pool serialises them i parallel (step B2). Checkpoit chuks are streamed to m odes, selected i a roud-robi fashio (step B3). Nodes write received checkpoit chuks directly to disk. After failure, ew odes are istatiated with the lost TEs ad SEs. Each ode with a checkpoit chuk splits it ito partitios, each of which is streamed to oe of the recoverig istaces (step R). The ew SE istaces recocile the chuks, revertig the partitioig (step R2). Fially, data items from output buffers are reprocessed to brig the recovered SE state up-to-date (step R3). 6 Evaluatio The goal of our experimetal evaluatio is to explore if SDGs ca (i) execute stateful olie processig applicatios with low latecy ad high throughput while supportig large state sizes with fie-graied updates ( 6.); (ii) scale i terms of odes comparable to stateless batch processig frameworks ( 6.2); hadle stragglers at rutime with low impact o throughput ( 6.3); ad (iii) recover from failures with low overhead ( 6.4). We exted the SEEP streamig platform to implemet SDGs ad deploy our prototype o Amazo EC2 ad a private cluster with 7 quad-core 3.4 GHz Itel Xeo servers with 8 GB of RAM. To support fast recovery, the checkpoitig frequecy for all experimets is s uless stated otherwise. Cadlesticks i plots show the 5 th, 25 th, 5 th, 75 th ad 95 th percetiles, respectively. 6. Stateful olie processig Throughput ad latecy. First we ivestigate the performace of SDGs usig the olie collaborative filterig (CF) applicatio (see 2.). We deploy it o 36 EC2 VM istaces ( c.xlarge ; 8 vcpus with 7 GB) usig the Netflix dataset, which cotais millio movie 8 56 24 USENIX Aual Techical Coferece USENIX Associatio

Throughput ( requests/s) 2 5 5 Throughput Latecy :5 :2 : 2: 5: Workload (state read/write ratio) Figure 5: Throughput ad latecy with differet read/write ratios (olie collaborative filterig) ratigs for evaluatig recommeder systems. We add ew ratigs cotiuously (addratig), while requestig fresh recommedatios (getrec). The state size maitaied by the system grows to 2 GB. Fig. 5 shows the throughput of getrec ad addratig requests ad the latecies of getrec requests whe the ratio betwee the two is varied. The achieved throughput is sufficiet to serve, 4, requests/s, with the 95 th percetile of resposes beig at most.5 s stale. As the workload ratio icludes more state reads (getrec), the throughput decreases slightly due to the cost of the sychroisatio barrier that aggregates the partial state i the SDG. The result shows that SDGs ca combie the fuctioality of a batch ad a olie processig system, while servig fresh results with low latecy ad high throughput over large mutable state. State size. Next we evaluate the performace of SDGs as the state size icreases. As a sythetic bechmark, we implemet a distributed partitioed key/value store (KV) usig SDGs because it exemplifies a algorithm with pure mutable state. We compare to a equivalet implemetatio i Naiad (versio.2) with global checkpoitig, which is the oly fault-tolerace mechaism available i the ope-source versio. We deploy it i oe VM ( m.xlarge ) ad measure the performace of servig update requests for keys. Fig. 6 shows that, for a small state size of MB, both SDGs ad Naiad exhibit similar throughput of 65, requests/s with low latecy. As the state size icreases to 2.5 GB, the SDG throughput is largely uaffected but Naiad s throughput decreases due to the overhead of its disk-based checkpoits (Naiad-Disk). Eve with checkpoits stored o a RAM disk (Naiad- NoDisk), its throughput with 2.5 GB of state is 63% lower tha that of SDGs. Similarly, the 95 th percetile latecy i Naiad icreases whe it stops processig durig checkpoitig SDGs do ot suffer from this problem. To ivestigate how SDGs ca support large distributed state across multiple odes, we scale the KV store by icreasig the umber of VMs from to 4, keepig the umber of dictioary keys per ode costat at 5 GB. Fig. 7 shows the throughput ad the latecy for read requests with a give total state size. The aggregate Latecy (ms) Throughput (, requests/s) 8 6 4 2 SDG Naiad-NoDisk Naiad-Disk SDG (latecy) Naiad-NoDisk (latecy) 2 Aggregated memory (MB) Figure 6: Throughput ad latecy with icreasig state size o sigle ode (key/value store) 8 6 4 2 Latecy (ms) Throughput (millio requests/s) 2.5.5 Throughput Latecy 5 5 2 Aggregated memory (GB) Figure 7: Throughput ad latecy with icreasig state size o multiple odes (key/value store) throughput scales ear liearly from 47, requests/s for 5 GB to.5 millio requests/s for 2 GB. The media latecy icreases from 8 29 ms, while the 95 th percetile latecy varies betwee 8 ms ad ms. This result demostrates that SDGs ca support stateful applicatios with large state sizes without compromisig throughput or processig latecy, while executig i a fault-tolerat fashio. Update graularity. We show the performace of SDGs whe performig frequet, fie-graied updates to state. For this, we deploy a streamig wordcout (WC) applicatio o 4 odes i our private cluster. WC reports the word frequecies over a wall clock time widow while processig the Wikipedia dataset. We compare to WC implemetatios i Streamig Spark [39] ad Naiad. We vary the size of the widow, which cotrols the graularity at which iput data updates the state: the smaller the widow size, the less batchig ca be doe whe updatig the state. Sice Naiad permits the cofiguratio of the batch size idepedetly of the widow size, we use a small batch size ( messages) for low-latecy (Naiad-LowLatecy) ad a large oe (2, messages) for high-throughput processig (Naiad-HighThroughput). Fig. 8 shows that oly SDG ad Naiad-LowLatecy ca sustai processig for all widow sizes, but SDG has a higher throughput due to Naiad s schedulig overhead. The other deploymets suffer from the overhead of micro-batchig: Streamig Spark has a throughput similar to SDG, but its smallest sustaiable widow size is 25 ms, after which its throughput collapses; Naiad- HighThroughput achieves the highest throughput of all, but it also caot support widows smaller tha ms. This shows that SDGs ca perform fie-graied state updates without tradig off throughput for latecy. 6.2 Scalability We explore if SDGs ca scale to higher throughput with more odes i a batch processig sceario. We deploy a implemetatio of logistic regressio (LR) [2] o EC2 ( m.xlarge ; 4 vcpus with 5 GB). We compare to LR from Spark [38], which is desiged for iterative processig, usig the GB dataset provided i its release. Latecy (ms) 9 USENIX Associatio 24 USENIX Aual Techical Coferece 57

Throughput ( requests/s) 25 2 5 Naiad-HighThroughput 5 SDG Streamig Spark Naiad-LowLatecy Widow size (ms) Figure 8: Latecy with differet widow sizes (streamig wordcout) Throughput (GB/s) 6 5 4 3 2 SDG Spark Fig. 9 shows the throughput of our SDG implemetatio ad Spark for 25 odes. Both systems exhibit liear scalability. The throughput of SDGs is higher tha Spark, which is likely due to the pipeliig i SDGs, which avoids the re-istatiatio of tasks after each iteratio. With higher throughput, iteratios are shorter, which leads to a faster covergece time. We coclude that the maagemet of partial state i the LR applicatio does ot limit scalability compared to existig stateless dataflow systems. 6.3 Stragglers We explore how SDGs hadle stragglig odes by creatig ew TE ad SE istaces at rutime (see 3.3). For this, we deploy the CF applicatio o our cluster ad iclude a less powerful machie (2.4 GHz with 4 GB). Fig. shows how the throughput ad the umber of odes chages over time as bottleeck TEs are idetified by the system. At the start, a sigle istace of the getrecvec TE is deployed. It is idetified as a bottleeck, ad a secod istace is added at t = s, which also causes a ew istace of the partial state i the coocc matrix to be created. This icreases the throughput from 36 62 requests/s. The throughput spikes occur whe the iput queues of ew TE istaces fill up. Sice the ew ode is allocated o the less powerful machie, it becomes a straggler, limitig overall throughput. At t =3 s, addig a ew TE istace without relievig the straggler does ot icrease the throughput. At t =5 s, the stragglig ode is detected by the system, ad a ew istace is created to share its work. This icreases the throughput from 62, requests/s. This shows how stragglig odes are mitigated by allocatig ew TE istaces o-demad, distributig ew partial or partitioed SE istaces as required. I more extreme cases, a stragglig ode could eve be removed ad the job resumed from a checkpoit with ew odes. 6.4 Failure recovery We evaluate the performace ad overhead of our failure recovery mechaism for SDGs. We (i) explore the recovery time uder differet recovery strategies; (ii) assess the advatages of our asychroous checkpoitig mechaism; ad (iii) ivestigate the overhead with differet checkpoitig frequecies ad state sizes. We deploy 25 5 75 Number of odes Figure 9: Scalability i terms of throughput (batch logistic regressio) Throughput ( request/s) 3 25 2 5 5 2 3 4 5 6 Time (s) Throughput Nodes Figure : Rutime parallelism for hadlig stragglers (collaborative filterig) the KV store o oe ode of our cluster, together with spare odes to store backups ad replace failed odes. Recovery time. We fail the ode uder differet recovery strategies: a m-to- recovery strategy uses m backup odes to restore to recovered odes (see 5). For each, we measure the time to restore the lost SE, re-process uprocessed data ad resume processig. Fig. shows the recovery times for differet SE sizes uder differet strategies: (i) the simplest strategy, -to-, has the logest recovery time, especially with large state sizes, because the state is restored from a sigle ode; (ii) the 2-to- strategy streams checkpoit chuks from two odes i parallel, which improves disk I/O throughput but also icreases the load o the recoverig ode whe it recostitutes the state; (iii) i the -to-2 strategy, checkpoit chuks are streamed to two recoverig odes, thus halvig the load of state recostructio; ad (iv) the 2-to-2 strategy recovers fastest because it combies the above two strategies it parallelises both the disk reads ad the state recostructio. As the state becomes large, state recostructio domiates over disk I/O overhead: with 4 GB, streamig from two disks does ot improve recovery time. Adoptig a strategy that recovers a failed ode with multiple odes, however, has sigificat beefit, compared to cases with smaller state sizes. Sychroous vs. asychroous checkpoitig. We ivestigate the beefit of our asychroous checkpoitig mechaism i compariso with sychroous checkpoitig that stops processig, as used by Naiad [26] ad SEEP []. Fig. 2 compares the throughput ad 99 th percetile latecy with icreasig state sizes. As the checkpoit size grows from 4 GB, the average throughput uder sychroous checkpoitig reduces by 33%, ad the latecy icreases from 2 8 s because the system stops processig while checkpoitig. With asychroous checkpoitig, there is oly a small (~5%) impact o throughput. Latecy is a order of magitude lower ad oly moderately affected (from 2 5 ms). This result shows that a sychroous checkpoitig approach caot achieve low-latecy processig with large state sizes. Overhead of asychroous checkpoitig. Next we evaluate the overhead of our checkpoitig mechaism 5 4 3 2 Number of odes 58 24 USENIX Aual Techical Coferece USENIX Associatio

Recovery time (s) 4 35 3 25 2 5 5 -to- recovery 2-to- recovery -to-2 recovery 2-to-2 recovery 2 4 State size (GB) Throughput ( requests/s) 25 2 5 5 T'put (Asyc) T'put (Syc) Latecy (Syc) 2 3 4 State size (GB) Figure : Recovery times with differet Figure 2: m-to- recovery strategies asyc. checkpoitig as a fuctio of checkpoitig frequecy ad state size. Fig. 3 (top) shows the processig latecy whe varyig the checkpoitig frequecy. The rightmost data poit (No FT) represets the case where the checkpoitig mechaism is disabled. The bottom figure reports the impact of the size of the checkpoit o latecy. Checkpoitig has a limited impact o latecy: without fault tolerace, the 95 th percetile latecy is 68 ms, ad it icreases to 5 ms whe checkpoitig GB every s. This is due to the overhead of mergig dirty state ad savig checkpoits to disk. Icreasig the checkpoitig frequecy or size gradually also icreases latecy: the 95 th percetile latecy with 4 GB is 85 ms, while checkpoitig 2 GB every 4 s results i s. Beyod that, the checkpoitig overhead starts to impact higher percetiles more sigificatly. Checkpoitig frequecy ad size behave almost proportioally: as the state size icreases, the frequecy ca be reduced to maitai a low processig latecy. Overall this experimet demostrates the stregth of our checkpoitig mechaism, which oly locks state while mergig dirty state. The lockig overhead thus reduces proportioally to the state update rate. 7 Related Work Programmig model. Data-parallel frameworks typically support a fuctioal/declarative model: MapReduce [8] oly has two higher-order fuctios; more recet frameworks [5, 38, 3] permit user-defied fuctioal operators; ad Naiad [26] supports differet fuctioal ad declarative programmig models o top of its timely dataflow model. CBP [9], Storm ad SEEP [] expose a low-level dataflow programmig model: algorithms are defied as a dataflow pipelie, which is harder to program ad debug. While fuctioal ad dataflow models ease distributio ad fault tolerace, SDGs target a imperative programmig model, which remais widely used by data scietists [7]. Efforts exist to brig imperative programmig to dataparallel processig. CIEL [25] uses imperative costructs such as task spawig ad futures, but this exposes the low-level executio of the dyamic dataflow graph to developers. Piccolo [3] ad Oolog [24] offer imperative compute kerels with distributed state, which... Compariso of syc. ad Latecy (s) Latecy (ms) Latecy (ms) 2 4 6 8 No FT Checkpoit frequecy (s) No FT 2 3 4 5 State size (GB) Figure 3: Impact of checkpoit frequecy ad size o latecy requires algorithms to be structured accordigly. I cotrast, SDGs simplify the traslatio of imperative programs to dataflows usig basic program aalysis techiques, which ifer state accesses ad the dataflow. By separatig differet types of state access, it becomes possible to choose automatically a effective implemetatio for distributed state. GraphLab [2] ad Pregel [22] are frameworks for graph computatios based o a shared-memory abstractio. They expose a vertex-cetric programmig model whereas SDGs target geeric stateful computatio. Program parallelisatio. Matlab has laguage costructs for parallel processig of large datasets o clusters. However, it oly supports the parallelisatio of sequetial blocks or iteratios ad ot of geeral dataflows. Declarative models such as Pig [28], DyradLINQ [37], SCOPE [6] ad Stratosphere [9] are aturally ameable to automatic parallelisatio fuctios are stateless, which allows data-parallel versios to execute o multiple odes. Istead, we focus o a imperative model. Other approaches offer ew programmig abstractios for parallel computatio over distributed state. FlumeJava [7] provides distributed immutable collectios. While immutability simplifies parallel executio, it limits the expressio of imperative algorithms. I Piccolo [3], global mutable state is accessed remotely by parallel distributed fuctios. I cotrast, tasks i SDGs oly access local state with low latecy, ad state is always colocated with computatio. Presto [35] has distributed partitioed arrays for the R laguage. Partitios ca be collected but ot updated by multiple tasks, whereas SDGs permit arbitrary dataflows. Extractig parallel dataflows from imperative programs is a hard problem [6]. We follow a approach similar to that of Beck et al. [3], i which a dataflow graph is geerated compositioally from the executio graph. While early work focused o hardware-based dataflow models [27], more recet efforts target threadbased executio [8]. Our problem is simpler because we do ot extract task parallelism but oly focus o data ad pipelie parallelism i relatio to distributed state access. Similar to pragma-based techiques [34], we use aotatios to trasform access to distributed state ito access to local istaces. Blazes [2] uses aotatios to USENIX Associatio 24 USENIX Aual Techical Coferece 59

geerate automatically coordiatio code for distributed programs. Our goal is differet: SDGs execute imperative code i a distributed fashio, ad coordiatio is determied by the extracted dataflow. Failure recovery. I-memory systems are proe to failures [], ad fast recovery is importat for lowlatecy ad high-throughput processig. With large state sizes, checkpoits caot be stored i memory, but storig them o disk ca icrease recovery time. RAM- Cloud [29] replicates data across cluster memory ad evetually backs it up to persistet storage. Similar to our approach, data is recovered from multiple disks i parallel. However, rather tha replicatig each write request, we checkpoit large state atomically, while permittig ew requests to operate o dirty state. Streamig Spark [39] ad Spark [38] use RDDs for recovery. After a failure, RDDs are recomputed i parallel o multiple odes. Such a recovery mechaism is effective if recomputatio is iexpesive for state that depeds o the etire history of the data, it would be prohibitive. I cotrast, the parallel recovery i SDGs retrieves partitioed checkpoits from multiple odes, ad oly reprocesses data from output buffers to brig restored SE istaces up-to-date. 8 Coclusios Data-parallel processig frameworks must offer a familiar programmig model with good performace. Supportig imperative olie machie learig algorithms poses challeges to frameworks due to their use of large distributed state with fie-graied access. We describe stateful dataflow graphs (SDGs), a dataparallel model that is desiged to offer a dataflow abstractio over large mutable state. With the help of aotatios, imperative algorithms ca be traslated to SDGs, which maage partitioed or partial distributed state. As we demostrated i our evaluatio, SDGs ca support diverse stateful applicatios, thus geeralisig a umber of existig data-parallel computatio models. Ackowledgemets. This work was supported by a PhD CASE Award fuded by EPSRC/BAE Systems. We thak our PC cotact, Jiyag Li, ad the aoymous ATC reviewers for their feedback ad guidace. Refereces [] AKIDAU, T., BALIKOV, A., ET AL. MillWheel: Fault-Tolerat Stream Processig at Iteret Scale. I VLDB (23). [2] ALVARO, P., CONWAY, N., ET AL. Blazes: Coordiatio Aalysis for Distributed Programs. I ICDE (24). [3] BECK, M., AND PINGALI, K. From Cotrol Flow to Dataflow. I ICPP (99). [4] BHATOTIA, P., WIEDER, A., ET AL. Icoop: MapReduce for Icremetal Computatios. I SOCC (2). [5] BU, Y.,HOWE, B., ET AL. HaLoop: Efficiet Iterative Data Processig o Large Clusters. I VLDB (2). [6] CHAIKEN, R., JENKINS, B., ET AL. SCOPE: Easy ad Efficiet Parallel Processig of Massive Data Sets. I VLDB (28). [7] CHAMBERS, C., RANIWALA, A., ET AL. FlumeJava: Easy, Efficiet Data-Parallel Pipelies. I PLDI (2). [8] DEAN, J., AND GHEMAWAT, S. MapReduce: Simplified Data Processig o Large Clusters. I CACM (28). [9] EWEN, S., TZOUMAS, K., ET AL. Spiig Fast Iterative Data Flows. I VLDB (22). [] FERNANDEZ, R. C., MIGLIAVACCA, M., ET AL. Itegratig Scale Out ad Fault Tolerace i Stream Processig usig Operator State Maagemet. I SIGMOD (23). [] GUNDA, P. K., RAVINDRANATH, L., ET AL. Nectar: Automatic Maagemet of Data ad Comp. i Dataceters. I OSDI (2). [2] HE, B., YANG, M., ET AL. Comet: Batched Stream Processig for Data Itesive Distributed Computig. I SOCC (2). [3] HUESKE, F., PETERS, M., ET AL. Opeig the Black Boxes i Data Flow Optimizatio. I VLDB (22). [4] HWANG, J.-H., BALAZINSKA, M., ET AL. High-Availability Algorithms for Distributed Stream Processig. I ICDE (25). [5] ISARD, M., BUDIU, M., ET AL. Dryad: Dist. Data-Parallel Programs from Sequetial Buildig Blocks. I EuroSys (27). [6] JOHNSTON, W. M., HANNA, J., ET AL. Advaces i Dataflow Programmig Laguages. I CSUR (24). [7] KDNUGGETS ANNUAL SOFTWARE POLL. RapidMier ad R vie for the First Place. http://goo.gl/olikb, 23. [8] LI, F., POP, A., ET AL. Automatic Extractio of Coarse-Graied Data-Flow Threads from Imperative Programs. I Micro (22). [9] LOGOTHETHIS, D., OLSON, C., ET AL. Stateful Bulk Processig for Icremetal Aalytics. I SOCC (2). [2] LOW, Y., BICKSON, D., ET AL. Dist. GraphLab: A Framework for ML ad Data Miig i the Cloud. I VLDB (22). [2] MA, J., SAUL, L. K., ET AL. Idetifyig Suspicious URLs: a Applicatio of Large-Scale Olie Learig. I ICML (29). [22] MALEWICZ, G., AUSTERN, M. H., ET AL. Pregel: A System for Large-scale Graph Processig. I SIGMOD (2). [23] MISHNE, G., DALTON, J., ET AL. Fast Data i the Era of Big Data: Twitter s Real-Time Related Query Suggestio Architecture. I SIGMOD (23). [24] MITCHELL, C., POWER, R., ET AL. Oolog: Asychroous Distributed Applicatios Made Easy. I APSYS (22). [25] MURRAY, D., SCHWARZKOPF, M., ET AL. CIEL: A Uiversal Exec. Egie for Distributed Data-Flow Comp. I NSDI (2). [26] MURRAY, D. G., MCSHERRY, F., ET AL. Naiad: A Timely Dataflow System. I SOSP (23). [27] NIKHIL, R. S., ET AL. Executig a Program o the MIT Tagged- Toke Dataflow Architecture. I TC (99). [28] OLSTON, C., REED, B., ET AL. Pig Lati: A Not-So-Foreig Laguage for Data Processig. I SIGMOD (28). [29] ONGARO, D., RUMBLE, S. M., ET AL. Fast Crash Recovery i RAMcloud. I SOSP (2). [3] POWER, R., AND LI, J. Piccolo: Buildig Fast, Distributed Programs with Partitioed Tables. I OSDI (2). [3] SCHELTER, S., EWEN, S., ET AL. All Roads Lead to Rome: Optimistic Recovery for Distributed Iterative Data Processig. I CIKM (23). [32] SUMBALY, R., KREPS, J., ET AL. The Big Data Ecosystem at LikedI. I SIGMOD (23). [33] VALLÉE-RAI, R., HENDREN, L., ET AL. Soot: A Java Optimizatio Framework. I CASCON (999). [34] VANDIERENDONCK, H., RUL, S., ET AL. The Paralax Ifrastructure: Automatic Parallelizatio with a Helpig Had. I PACT (2). [35] VENKATARAMAN, S., BODZSAR, E., ET AL. Presto: Dist. ML ad Graph Processig with Sparse Matrices. I EuroSys (23). [36] XIN, R. S., ROSEN, J., ET AL. Shark: SQL ad Rich Aalytics at Scale. I SIGMOD (23). [37] YU, Y.,ISARD, M., ET AL. DryadLINQ: a System for Geeral- Purpose Distributed Data-Parallel Computig usig a High-Level Laguage. I OSDI (28). [38] ZAHARIA, M., CHOWDHURY, M., ET AL. Resiliet Distributed Datasets: A Fault-Tolerat Abstractio for I-Memory Cluster Computig. I NSDI (22). [39] ZAHARIA, M., DAS, T., ET AL. Discretized Streams: Faulttolerat Streamig Computatio at Scale. I SOSP (23). [4] ZAHARIA, M., KONWINSKI, A., ET AL. Improvig MapReduce Performace i Heterogeeous Eviromets. I OSDI (28). 2 6 24 USENIX Aual Techical Coferece USENIX Associatio