Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks

Transcription

1 Dryad: Distributed Data-Parallel Programs from Sequetial uildig locks Michael Isard Microsoft esearch, Silico Valley drew irrell Microsoft esearch, Silico Valley Mihai udiu Microsoft esearch, Silico Valley Deis Fetterly Microsoft esearch, Silico Valley Yua Yu Microsoft esearch, Silico Valley STT Dryad is a geeral-purpose distributed executio egie for coarse-grai data-parallel applicatios. Dryad applicatio combies computatioal vertices with commuicatio chaels to form a dataflow graph. Dryad rus the applicatio by executig the vertices of this graph o a set of available computers, commuicatig as appropriate through files, TP pipes, ad shared-memory FIFOs. The vertices provided by the applicatio developer are quite simple ad are usually writte as sequetial programs with o thread creatio or lockig. ocurrecy arises from Dryad schedulig vertices to ru simultaeously o multiple computers, or o multiple PU cores withi a computer. The applicatio ca discover the size ad placemet of data at ru time, ad modify the graph as the computatio progresses to make efficiet use of the available resources. Dryad is desiged to scale from powerful multi-core sigle computers, through small clusters of computers, to data ceters with thousads of computers. The Dryad executio egie hadles all the difficult problems of creatig a large distributed, cocurret applicatio: schedulig the use of computers ad their PUs, recoverig from commuicatio or computer failures, ad trasportig data betwee vertices. ategories ad Subject Descriptors D.1.3 [POGMMING TEHNIQUES]: ocurret Programmig Distributed programmig Geeral Terms Performace, Desig, eliability Keywords ocurrecy, Distributed Programmig, Dataflow, luster omputig Permissio to make digital or hard copies of all or part of this work for persoal or classroom use is grated without fee provided that copies are ot made or distributed for profit or commercial advatage ad that copies bear this otice ad the full citatio o the first page. To copy otherwise, to republish, to post o servers or to redistribute to lists, requires prior specific permissio ad/or a fee. EuroSys 07, March 21 23, 2007, Lisboa, Portugal. opyright 2007 M /07/ $ INTODUTION The Dryad project addresses a log-stadig problem: how ca we make it easier for developers to write efficiet parallel ad distributed applicatios? We are motivated both by the emergece of large-scale iteret services that deped o clusters of hudreds or thousads of geeralpurpose servers, ad also by the predictio that future advaces i local computig power will come from icreasig the umber of cores o a chip rather tha improvig the speed or istructio-level parallelism of a sigle core [3]. oth of these scearios ivolve resources that are i a sigle admiistrative domai, coected usig a kow, high-performace commuicatio topology, uder cetralized maagemet ad cotrol. I such cases may of the hard problems that arise i wide-area distributed systems may be sidestepped: these iclude high-latecy ad ureliable etworks, cotrol of resources by separate federated or competig etities, ad issues of idetity for autheticatio ad access cotrol. Our primary focus is istead o the simplicity of the programmig model ad the reliability, efficiecy ad scalability of the applicatios. For may resource-itesive applicatios, the simplest way to achieve scalable performace is to exploit data parallelism. There has historically bee a great deal of work i the parallel computig commuity both o systems that automatically discover ad exploit parallelism i sequetial programs, ad o those that require the developer to explicitly expose the data depedecies of a computatio. There are still limitatios to the power of fully-automatic parallelizatio, ad so we build maily o ideas from the latter research traditio. odor [37] was a early example of such a system i a distributed settig, ad we take more direct ispiratio from three other models: shader laguages developed for graphic processig uits (GPUs) [30, 36], Google s Mapeduce system [16], ad parallel databases [18]. I all these programmig paradigms, the system dictates a commuicatio graph, but makes it simple for the developer to supply subrouties to be executed at specified graph vertices. ll three have demostrated great success, i that large umbers of developers have bee able to write cocurret software that is reliably executed i a distributed fashio. We believe that a major reaso for the success of GPU shader laguages, Mapeduce ad parallel databases is that the developer is explicitly forced to cosider the data parallelism of the computatio. Oce a applicatio is cast ito this framework, the system is automatically able to provide the ecessary schedulig ad distributio. The developer

2 eed have o uderstadig of stadard cocurrecy mechaisms such as threads ad fie-grai cocurrecy cotrol, which are kow to be difficult to program correctly. Istead the system rutime abstracts these issues from the developer, ad also deals with may of the hardest distributed computig problems, most otably resource allocatio, schedulig, ad the trasiet or permaet failure of a subset of compoets i the system. y fixig the boudary betwee the commuicatio graph ad the subrouties that ihabit its vertices, the model guides the developer towards a appropriate level of graularity. The system eed ot try too hard to extract parallelism withi a developer-provided subroutie, while it ca exploit the fact that depedecies are all explicitly ecoded i the flow graph to efficietly distribute the executio across those subrouties. Fially, developers ow work at a suitable level of abstractio for writig scalable applicatios sice the resources available at executio time are ot geerally kow at the time the code is writte. The aforemetioed systems restrict a applicatio s commuicatio flow for differet reasos. GPU shader laguages are strogly tied to a efficiet uderlyig hardware implemetatio that has bee tued to give good performace for commo graphics memory-access patters. Mapeduce was desiged to be accessible to the widest possible class of developers, ad therefore aims for simplicity at the expese of geerality ad performace. Parallel databases were desiged for relatioal algebra maipulatios (e.g. SQL) where the commuicatio graph is implicit. y cotrast, the Dryad system allows the developer fie cotrol over the commuicatio graph as well as the subrouties that live at its vertices. Dryad applicatio developer ca specify a arbitrary directed acyclic graph to describe the applicatio s commuicatio patters, ad express the data trasport mechaisms (files, TP pipes, ad sharedmemory FIFOs) betwee the computatio vertices. This direct specificatio of the graph also gives the developer greater flexibility to easily compose basic commo operatios, leadig to a distributed aalogue of pipig together traditioal Uix utilities such as grep, sort ad head. Dryad is otable for allowig graph vertices (ad computatios i geeral) to use a arbitrary umber of iputs ad outputs. Mapeduce restricts all computatios to take a sigle iput set ad geerate a sigle output set. SQL ad shader laguages allow multiple iputs but geerate a sigle output from the user s perspective, though SQL query plas iterally use multiple-output vertices. I this paper, we demostrate that careful choices i graph costructio ad refiemet ca substatially improve applicatio performace, while compromisig little o the programmability of the system. Nevertheless, Dryad is certaily a lower-level programmig model tha SQL or DirectX. I order to get the best performace from a ative Dryad applicatio, the developer must uderstad the structure of the computatio ad the orgaizatio ad properties of the system resources. Dryad was however desiged to be a suitable ifrastructure o which to layer simpler, higherlevel programmig models. It has already bee used, by ourselves ad others, as a platform for several domai-specific systems that are briefly sketched i Sectio 7. These rely o Dryad to maage the complexities of distributio, schedulig, ad fault-tolerace, but hide may of the details of the uderlyig system from the applicatio developer. They use heuristics to automatically select ad tue appropriate Dryad features, ad thereby get good performace for most simple applicatios. We summarize Dryad s cotributios as follows: We built a geeral-purpose, high performace distributed executio egie. The Dryad executio egie hadles may of the difficult problems of creatig a large distributed, cocurret applicatio: schedulig across resources, optimizig the level of cocurrecy withi a computer, recoverig from commuicatio or computer failures, ad deliverig data to where it is eeded. Dryad supports multiple differet data trasport mechaisms betwee computatio vertices ad explicit dataflow graph costructio ad refiemet. We demostrated the excellet performace of Dryad from a sigle multi-core computer up to clusters cosistig of thousads of computers o several otrivial, real examples. We further demostrated that Dryad s fie cotrol over a applicatio s dataflow graph gives the programmer the ecessary tools to optimize tradeoffs betwee parallelism ad data distributio overhead. This validated Dryad s desig choices. We explored the programmability of Dryad o two frots. First, we have desiged a simple graph descriptio laguage that empowers the developer with explicit graph costructio ad refiemet to fully take advatage of the rich features of the Dryad executio egie. Our user experieces lead us to believe that, while it requires some effort to lear, a programmer ca master the PIs required for most of the applicatios i a couple of weeks. Secod, we (ad others withi Microsoft) have built simpler, higher-level programmig abstractios for specific applicatio domais o top of Dryad. This has sigificatly lowered the barrier to etry ad icreased the acceptace of Dryad amog domai experts who are iterested i usig Dryad for rapid applicatio prototypig. This further validated Dryad s desig choices. The ext three sectios describe the abstract form of a Dryad applicatio ad outlie the steps ivolved i writig oe. The Dryad scheduler is described i Sectio 5; it hadles all of the work of decidig which physical resources to schedule work o, routig data betwee computatios, ad automatically reactig to computer ad etwork failures. Sectio 6 reports o our experimetal evaluatio of the system, showig its flexibility ad scalig characteristics i a small cluster of 10 computers, as well as details of larger-scale experimets performed o clusters with thousads of computers. We coclude i Sectios 8 ad 9 with a discussio of the related literature ad of future research directios. 2. SYSTEM OVEVIEW The overall structure of a Dryad job is determied by its commuicatio flow. job is a directed acyclic graph where each vertex is a program ad edges represet data chaels. It is a logical computatio graph that is automatically mapped oto physical resources by the rutime. I particular, there may be may more vertices i the graph tha executio cores i the computig cluster.

3 t ru time each chael is used to trasport a fiite sequece of structured items. This chael abstractio has several cocrete implemetatios that use shared memory, TP pipes, or files temporarily persisted i a file system. s far as the program i each vertex is cocered, chaels produce ad cosume heap objects that iherit from a base type. This meas that a vertex program reads ad writes its data i the same way regardless of whether a chael serializes its data to buffers o a disk or TP stream, or passes object poiters directly via shared memory. The Dryad system does ot iclude ay ative data model for serializatio ad the cocrete type of a item is left etirely up to applicatios, which ca supply their ow serializatio ad deserializatio routies. This decisio allows us to support applicatios that operate directly o existig data icludig exported SQL tables ad textual log files. I practice most applicatios use oe of a small set of library item types that we supply such as ewlie-termiated text strigs ad tuples of base types. schematic of the Dryad system orgaizatio is show i Figure 1. Dryad job is coordiated by a process called the job maager (deoted JM i the figure) that rus either withi the cluster or o a user s workstatio with etwork access to the cluster. The job maager cotais the applicatio-specific code to costruct the job s commuicatio graph alog with library code to schedule the work across the available resources. ll data is set directly betwee vertices ad thus the job maager is oly resposible for cotrol decisios ad is ot a bottleeck for ay data trasfers. Job schedule JM Data plae Files, FIFO, Network NS V V V D D D otrol plae Figure 1: The Dryad system orgaizatio. The job maager (JM) cosults the ame server (NS) to discover the list of available computers. It maitais the job graph ad schedules ruig vertices (V) as computers become available usig the daemo (D) as a proxy. Vertices exchage data through files, TP pipes, or shared-memory chaels. The shaded bar idicates the vertices i the job that are curretly ruig. The cluster has a ame server (NS) that ca be used to eumerate all the available computers. The ame server also exposes the positio of each computer withi the etwork topology so that schedulig decisios ca take accout of locality. There is a simple daemo (D) ruig o each computer i the cluster that is resposible for creatig processes o behalf of the job maager. The first time a vertex (V) is executed o a computer its biary is set from the job maager to the daemo ad subsequetly it is executed from a cache. The daemo acts as a proxy so that the job maager ca commuicate with the remote vertices ad moitor the state of the computatio ad how much data has bee read ad writte o its chaels. It is straightforward to ru a ame server ad a set of daemos o a user workstatio to simulate a cluster ad thus ru a etire job locally while debuggig. simple task scheduler is used to queue batch jobs. We use a distributed storage system, ot described here, that shares with the Google File System [21] the property that large files ca be broke ito small pieces that are replicated ad distributed across the local disks of the cluster computers. Dryad also supports the use of NTFS for accessig files directly o local computers, which ca be coveiet for small clusters with low maagemet overhead. 2.1 example SQL query I this sectio, we describe a cocrete example of a Dryad applicatio that will be further developed throughout the remaider of the paper. The task we have chose is represetative of a ew class of esciece applicatios, where scietific ivestigatio is performed by processig large amouts of data available i digital form [24]. The database that we use is derived from the Sloa Digital Sky Survey (SDSS), available olie at We chose the most time cosumig query (Q18) from a published study based o this database [23]. The task is to idetify a gravitatioal les effect: it fids all the objects i the database that have eighborig objects withi 30 arc secods such that at least oe of the eighbors has a color similar to the primary object s color. expressed i SQL as: select distict p.objid from photoobjll p joi eighbors call this joi X o p.objid =.objid ad.objid <.eighborobjid ad p.mode = 1 joi photoobjll l call this joi Y o l.objid =.eighborobjid ad l.mode = 1 ad abs((p.u-p.g)-(l.u-l.g))<0.05 ad abs((p.g-p.r)-(l.g-l.r))<0.05 ad abs((p.r-p.i)-(l.r-l.i))<0.05 ad abs((p.i-p.z)-(l.i-l.z))<0.05 The query ca be There are two tables ivolved. The first, photoobjll has 354,254,163 records, oe for each idetified astroomical object, keyed by a uique idetifier objid. These records also iclude the object s color, as a magitude (logarithmic brightess) i five bads: u, g, r, i ad z. The secod table, eighbors has 2,803,165,372 records, oe for each object located withi 30 arc secods of aother object. The mode predicates i the query select oly primary objects. The < predicate elimiates duplicatio caused by the eighbors relatioship beig symmetric. The output of jois X ad Y are 932,820,679 ad 83,798 records respectively, ad the fial hash emits 83,050 records. The query uses oly a few colums from the tables (the complete photoobjll table cotais2kytesperrecord). Whe executed by SQLServer the query uses a idex o photoobjll keyed by objid with additioal colums for mode, u, g, r, i ad z, ad a idex o eighbors keyed by objid with a additioal eighborobjid colum. SQL- Server reads just these idexes, leavig the remaider of the tables data restig quietly o disk. (I our experimetal setup we i fact omitted uused colums from the table, to avoid trasportig the etire multi-terabyte database across

4 the coutry.) For the equivalet Dryad computatio we extracted these idexes ito two biary files, ugriz.bi ad eighbors.bi, each sorted i the same order as the idexes. The ugriz.bi file has 36-byte records, totalig 11.8 Gytes; eighbors.bi has 16-byte records, totalig 41.8 Gytes. The output of joi X totals 31.3 Gytes, the output of joi Y is 655 Kytes ad the fial output is 649 Kytes. We mapped the query to the Dryad computatio show i Figure 2. oth data files are partitioed ito approximately H equal parts (that we call U 1 through U ad N 1 through N )byobjid rages, ad we use custom ++ item Y objects for each data record Y i the graph. The vertices X i (for 1 i ) implemet U U joi X by takig their S 4 partitioed U i ad N i iputs S ad mergig them (keyed o objid ad filtered by the < expressio ad p.mode=1) to produce records cotaiig objid, eighborobjid, ad the color colums correspodig to objid. TheD vertices M D 4 M D distribute their output records to the M vertices, partitioig by eighborobjid usig X X a rage partitioig fuctio four times fier tha that used for the iput files. The umber U N U N four was chose so that four Figure 2: The commuicatio pipelies will execute i parallel o each computer, because graph for a SQL query. Details are i Sectio 2.1. our computers have four processors each. The M vertices perform a o-determiistic merge of their iputs ad the S vertices sort o eighborobjid usig a i-memory Quicksort. The output records from S 4i 3...S 4i (for i =1through) arefedito Y i where they are merged with aother read of U i to implemet joi Y. This joi is keyed o objid (from U) = eighborobjid (from S), ad is filtered by the remaider of the predicate, thus matchig the colors. The outputs of the Y vertices are merged ito a hash table at the H vertex to implemet the distict keyword i the query. Fially, a eumeratio of this hash table delivers the result. Later i the paper we iclude more details about the implemetatio of this Dryad program. 3. DESIING DYD GPH We have desiged a simple laguage that makes it easy to specify commoly-occurrig commuicatio idioms. It is curretly embedded for coveiece i ++ as a library usig a mixture of method calls ad operator overloadig. Graphs are costructed by combiig simpler subgraphs usig a small set of operatios show i Figure 3. ll of the operatios preserve the property that the resultig graph is acyclic. The basic object i the laguage is a graph: G = V G,E G,I G,O G. G cotais a sequece of vertices V G, a set of directed edges E G,adtwosetsI G V G ad O G V G that tag some of the vertices as beig iputs ad outputs respectively. No graph ca cotai a directed edge eterig a iput vertex i I G, or oe leavig a output vertex i O G, ad these tags are used below i compositio operatios. The iput ad output edges of a vertex are ordered so a edge coects specific ports o a pair of vertices, ad a give pair of vertices may be coected by multiple edges. 3.1 reatig ew vertices The Dryad libraries defie a ++ base class from which all vertex programs iherit. Each such program has a textual ame (which is uique withi a applicatio) ad a static factory that kows how to costruct it. graph vertex is created by callig the appropriate static program factory. y required vertex-specific parameters ca be set at this poit by callig methods o the program object. These parameters are the marshaled alog with the uique vertex ame to form a simple closure that ca be set to a remote process for executio. sigleto graph is geerated from a vertex v as G = (v),, {v}, {v}. graph ca be cloed ito a ew graph cotaiig k copies of its structure usig the ^ operator where = G^k is defied as: = VG 1 VG,E k G 1 EG, k IG 1 IG,O k G 1 OG. k Here G = VG,EG,I G,O G is a cloe of G cotaiig copies of all of G s vertices ad edges, deotes sequece cocateatio, ad each cloed vertex iherits the type ad parameters of its correspodig vertex i G. 3.2 ddig graph edges New edges are created by applyig a compositio operatio to two existig graphs. There is a family of compositios all sharig the same basic structure: = creates a ew graph: = V V,E E E ew,i,o where cotais the uio of all the vertices ad edges i ad, with s iputs ad s outputs. I additio, directed edges E ew are itroduced betwee vertices i O ad I. V ad V are eforced to be disjoit at ru time, ad sice ad are both acyclic, is also. ompositios differ i the set of edges E ew that they add ito the graph. We defie two stadard compositios: >= forms a poitwise compositio as show i Figure 3(c). If O I the a sigle outgoig edge is created from each of s outputs. The edges are assiged i roud-robi to s iputs. Some of the vertices i I may ed up with more tha oe icomig edge. If I > O, a sigle icomig edge is createdtoeachof s iputs, assiged i roud-robi from s outputs. >> forms the complete bipartite graph betwee O ad I ad is show i Figure 3(d). We allow the user to exted the laguage by implemetig ew compositio operatios.

5 a c d e D b S = ^ S = ^ f g h E = (S >= >= S) S >= S E (S >= S) S >> S D ( >= ) ( >= D) F (>=>=D>=) (>=F>=) Figure 3: The operators of the graph descriptio laguage. ircles are vertices ad arrows are graph edges. triagle at the bottom of a vertex idicates a iput ad oe at the top idicates a output. oxes (a) ad (b) demostrate cloig idividual vertices usig the ^ operator. The two stadard coectio operatios are poitwise compositio usig >= show i (c) ad complete bipartite compositio usig >> show i (d). (e) illustrates a merge usig. The secod lie of the figure shows more complex patters. The merge i (g) makes use of a subroutie from (f) ad demostrates a bypass operatio. For example, each vertex might output a summary of its iput to which aggregates them ad forwards the global statistics to every. Together the vertices ca the distribute the origial dataset (received from ) ito balaced partitios. asymmetric fork/joi is show i (h). 3.3 Mergig two graphs The fial operatio i the laguage is, whichmerges two graphs. = creates a ew graph: = V V,E E,I I,O O where, i cotrast to the compositio operatios, it is ot required that ad be disjoit. V V is the cocateatio of V ad V with duplicates removed from the secod sequece. I I meas the uio of ad s iputs, mius ay vertex that has a icomig edge followig the merge (ad similarly for the output case). If a vertex is cotaied i V V its iput ad output edges are cocateated so that the edges i E occur first (with lower port umbers). This simplificatio forbids certai graphs with crossover edges, however we have ot foud this restrictio to be a problem i practice. The ivariat that the merged graph be acyclic is eforced by a ru-time check. The merge operatio is extremely powerful ad makes it easy to costruct typical patters of commuicatio such as fork/joi ad bypass as show i Figures 3(f) (h). It also provides the mechaism for assemblig a graph by had from a collectio of vertices ad edges. So for example, a tree with four vertices a, b, c, add might be costructed as G = (a>=b) (b>=c) (b>=d). The graph builder program to costruct the query graph i Figure 2 is show i Figure hael types y default each chael is implemeted usig a temporary file: the producer writes to disk (typically o its local computer) ad the cosumer reads from that file. I may cases multiple vertices will fit withi the resources of a sigle computer so it makes sese to execute them all withi the same process. The graph laguage has a ecapsulatio commad that takes a graph G ad returs a ew vertex v G. Whe v G is ru as a vertex program, the job maager passes it a serializatio of G as a ivocatio parameter, ad it rus all the vertices of G simultaeously withi the same process, coected by edges implemeted usig shared-memory FIFOs. While it would always be possible to write a custom vertex program with the same sematics as G, allowig ecapsulatio makes it efficiet to combie simple library vertices at the graph layer rather tha re-implemetig their fuctioality as a ew vertex program. Sometimes it is desirable to place two vertices i the same process eve though they caot be collapsed ito a sigle graph vertex from the perspective of the scheduler. For example, i Figure 2 the performace ca be improved by placig the first D vertex i the same process as the first four M ad S vertices ad thus avoidig some disk I/O, however the S vertices caot be started util all of the D vertices complete. Whe creatig a set of graph edges, the user ca optioally specify the trasport protocol to be used. The available protocols are listed i Table 1. Vertices that are coected usig shared-memory chaels are executed withi a sigle process, though they are idividually started as their iputs become available ad idividually report completio. ecause the dataflow graph is acyclic, schedulig deadlock is impossible whe all chaels are either writte to temporary files or use shared-memory FIFOs hidde withi

6 Graphuilder XSet = modulex^n; Graphuilder DSet = moduled^n; Graphuilder MSet = modulem^(n*4); Graphuilder SSet = modules^(n*4); Graphuilder YSet = moduley^n; Graphuilder HSet = moduleh^1; Graphuilder XIputs = (ugriz1 >= XSet) (eighbor >= XSet); Graphuilder YIputs = ugriz2 >= YSet; Graphuilder XToY = XSet >= DSet >> MSet >= SSet; for (i = 0; i < N*4; ++i) { XToY = XToY (SSet.GetVertex(i) >= YSet.GetVertex(i/4)); } Graphuilder YToH = YSet >= HSet; Graphuilder HOutputs = HSet >= output; Graphuilder fial = XIputs YIputs XToY YToH HOutputs; Figure 4: example graph builder program. The commuicatio graph geerated by this program is show i Figure 2. hael protocol File (the default) TP pipe Shared-memory FIFO Discussio Preserved after vertex executio util the job completes. equires o disk accesses, but both ed-poit vertices must be scheduled to ru at the same time. Extremely low commuicatio cost, but ed-poit vertices must ru withi the same process. Table 1: hael types. ecapsulated acyclic subgraphs. However, allowig the developer to use pipes ad visible FIFOs ca cause deadlocks. y coected compoet of vertices commuicatig usig pipes or FIFOs must all be scheduled i processes that are cocurretly executig, but this becomes impossible if the system rus out of available computers i the cluster. This breaks the abstractio that the user eed ot kow the physical resources of the system whe writig the applicatio. We believe that it is a worthwhile trade-off, sice, as reported i our experimets i Sectio 6, the resultig performace gais ca be substatial. Note also that the system could always avoid deadlock by dowgradig a pipe chael to a temporary file, at the expese of itroducig a uexpected performace cliff. 3.5 Job iputs ad outputs Large iput files are typically partitioed ad distributed across the computers of the cluster. It is therefore atural to group a logical iput ito a graph G = V P,,,V P where V P is a sequece of virtual vertices correspodig to the partitios of the iput. Similarly o job completio a set of output partitios ca be logically cocateated to form a sigle amed distributed file. applicatio will geerally iterrogate its iput graphs to read the umber of partitios at ru time ad automatically geerate the appropriately replicated graph. 3.6 Job Stages Whe the graph is costructed every vertex is placed i a stage to simplify job maagemet. The stage topology ca be see as a skeleto or summary of the overall job, ad the stage topology of our example Skyserver query applicatio is show i Figure 5. Each distict type of vertex is grouped ito a separate stage. Most stages are coected usig the >= operator, while D is coected to M usig the >> operator. The skeleto is used as a guide for geeratig summaries whe moitorig a job, ad ca also be exploited by the automatic optimizatios described i Sectio WITING VETEX POGM The primary PIs for writig a Dryad vertex program are exposed through ++ base classes ad objects. It was a desig requiremet for Dryad vertices to be able to icorporate legacy source ad libraries, so we deliberately avoided adoptig ay Dryad-specific laguage or sadboxig restrictios. Most of the existig code that we aticipate itegratig ito vertices is writte i ++, but it is straightforward to implemet PI wrappers so that developers ca write vertices i other laguages, for example #. There is also sigificat value for some domais i beig able to ru umodified legacy executables i vertices, ad so we support this as explaied i Sectio 4.2 below. 4.1 Vertex executio Dryad icludes a rutime library that is resposible for settig up ad executig vertices as part of a distributed computatio. s outlied i Sectio 3.1 the rutime receives a closure from the job maager describig the vertex to be ru, ad UIs de- U U H Y S M D X N Figure 5: The stages of the Dryad computatio from Figure 2. Sectio 3.6 has details. scribig the iput ad output chaels to coect to it. There is curretly o type-checkig for chaels ad the vertex must be able to determie, either statically or from the ivocatio parameters, the types of the items that it is expected to read ad write

7 o each chael i order to supply the correct serializatio routies. The body of a vertex is ivoked via a stadard Mai method that icludes chael readers ad writers i its argumet list. The readers ad writers have a blockig iterface to read or write the ext item, which suffices for most simple applicatios. The vertex ca report status ad errors to the job maager, ad the progress of chaels is automatically moitored. May developers fid it coveiet to iherit from predefied vertex classes that hide the details of the uderlyig chaels ad vertices. We supply map ad reduce classes with similar iterfaces to those described i [16]. We have also writte a variety of others icludig a geeral-purpose distribute that takes a sigle iput stream ad writes o multiple outputs, ad jois that call a virtual method with every matchig record tuple. These classes are simply vertices like ay other, so it is straightforward to write ew oes to support developers workig i a particular domai. 4.2 Legacy executables We provide a library process wrapper vertex that forks a executable supplied as a ivocatio parameter. The wrapper vertex must work with arbitrary data types, so its items are simply fixed-size buffers that are passed umodified to the forked process usig amed pipes i the filesystem. This allows umodified pre-existig biaries to be ru as Dryad vertex programs. It is easy, for example, to ivoke perl scripts or grep at some vertices of a Dryad job. 4.3 Efficiet pipelied executio Most Dryad vertices cotai purely sequetial code. We also support a evet-based programmig style, usig a shared thread pool. The program ad chael iterfaces have asychroous forms, though usurprisigly it is harder to use the asychroous iterfaces tha it is to write sequetial code usig the sychroous iterfaces. I some cases it may be worth ivestig this effort, ad may of the stadard Dryad vertex classes, icludig o-determiistic merge, sort, ad geeric maps ad jois, are built usig the evet-based programmig style. The rutime automatically distiguishes betwee vertices which ca use a thread pool ad those that require a dedicated thread, ad therefore ecapsulated graphs which cotai hudreds of asychroous vertices are executed efficietly o a shared thread pool. The chael implemetatio schedules read, write, serializatio ad deserializatio tasks o a thread pool shared betwee all chaels i a process, ad a vertex ca cocurretly read or write o hudreds of chaels. The rutime tries to esure efficiet pipelied executio while still presetig the developer with the simple abstractio of readig ad writig a sigle record at a time. Extesive use is made of batchig [28] to try to esure that threads process hudreds or thousads of records at a time without touchig a referece cout or accessig a shared queue. The experimets i Sectio 6.2 substatiate our claims for the efficiecy of these abstractios: eve sigle-ode Dryad applicatios have throughput comparable to that of a commercial database system. 5. JO EXEUTION The scheduler iside the job maager keeps track of the state ad history of each vertex i the graph. t preset if the job maager s computer fails the job is termiated, though the vertex scheduler could employ checkpoitig or replicatio to avoid this. vertex may be executed multiple times over the legth of the job due to failures, ad more tha oe istace of a give vertex may be executig at ay give time. Each executio of the vertex has a versio umber ad a correspodig executio record that cotais the state of that executio ad the versios of the predecessor vertices from which its iputs are derived. Each executio ames its file-based output chaels uiquely usig its versio umber to avoid coflicts amog versios. If the etire job completes successfully the each vertex selects a successful executio ad reames its output files to their correct fial forms. Whe all of a vertex s iput chaels become ready a ew executio record is created for the vertex ad placed i a schedulig queue. disk-based chael is cosidered to be ready whe the etire file is preset. chael that is a TP pipe or shared-memory FIFO is ready whe the predecessor vertex has at least oe ruig executio record. vertex ad ay of its chaels may each specify a hardcostrait or a preferece listig the set of computers o which it would like to ru. The costraits are combied ad attached to the executio record whe it is added to the schedulig queue ad they allow the applicatio writer to require that a vertex be co-located with a large iput file, ad i geeral let the scheduler preferetially ru computatios close to their data. t preset the job maager performs greedy schedulig based o the assumptio that it is the oly job ruig o the cluster. Whe a executio record is paired with a available computer the remote daemo is istructed to ru the specified vertex, ad durig executio the job maager receives periodic status updates from the vertex. If every vertex evetually completes the the job is deemed to have completed successfully. If ay vertex is re-ru more tha a set umber of times the the etire job is failed. Files represetig temporary chaels are stored i directories maaged by the daemo ad cleaed up after the job completes, ad vertices are killed by the daemo if their paret job maager crashes. We have a simple graph visualizer suitable for small jobs that shows the state of each vertex ad the amout of data trasmitted alog each chael as the computatio progresses. web-based iterface shows regularly-updated summary statistics of a ruig job ad ca be used to moitor large computatios. The statistics iclude the umber of vertices that have completed or bee re-executed, the amout of data trasferred across chaels, ad the error codes reported by failed vertices. Liks are provided from the summary page that allow a developer to dowload logs or crash dumps for further debuggig, alog with a script that allows the vertex to be re-executed i isolatio o a local machie. 5.1 Fault tolerace policy Failures are to be expected durig the executio of ay distributed applicatio. Our default failure policy is suitable for the commo case that all vertex programs are determiistic. 1 ecause our commuicatio graph is acyclic, it is relatively straightforward to esure that every termiatig executio of a job with immutable iputs will compute the 1 The defiitio of job completio ad the treatmet of job outputs above also implicitly assume determiistic executio.

8 same result, regardless of the sequece of computer or disk failures over the course of the executio. Whe a vertex executio fails for ay reaso the job maager is iformed. If the vertex reported a error clealy the process forwards it via the daemo before exitig; if the process crashes the daemo otifies the job maager; ad if the daemo fails for ay reaso the job maager receives a heartbeat timeout. If the failure was due to a read error o a iput chael (which is be reported clealy) the default policy also marks the executio record that geerated that versio of the chael as failed ad termiates its process if it is ruig. This will cause the vertex that created the failed iput chael to be re-executed, ad will lead i the ed to the offedig chael beig re-created. Though a ewly-failed executio record may have o-failed successor records, errors eed ot be propagated forwards: sice vertices are determiistic two successors may safely compute usig the outputs of differet executio versios. Note however that uder this policy a etire coected compoet of vertices coected by pipes or shared-memory FIFOs will fail as a uit sice killig a ruig vertex will cause it to close its pipes, propagatig errors i both directios alog those edges. y vertex whose executio record is set to failed is immediately cosidered for re-executio. s Sectio 3.6 explais, each vertex belogs to a stage, ad each stage has a maager object that receives a callback o every state trasitio of a vertex executio i that stage, ad o a regular timer iterrupt. Withi this callback the stage maager holds a global lock o the job maager datastructures ad ca therefore implemet quite sophisticated behaviors. For example, the default stage maager icludes heuristics to detect vertices that are ruig slower tha their peers ad schedule duplicate executios. This prevets a sigle slow computer from delayig a etire job ad is similar to the backup task mechaism reported i [16]. I future we may allow o-determiistic vertices, which would make fault-tolerace more iterestig, ad so we have implemeted our policy via a extesible mechaism that allows o-stadard applicatios to customize their behavior. 5.2 u-time graph refiemet We have used the stage-maager callback mechaism to implemet ru-time optimizatio policies that allow us to scale to very large iput sets while coservig scarce etwork badwidth. Some of the large clusters we have access to have their etwork provisioed i a two-level hierarchy, with a dedicated mii-switch servig the computers i each rack, ad the per-rack switches coected via a sigle large core switch. Therefore where possible it is valuable to schedule vertices as much as possible to execute o the same computer or withi the same rack as their iput data. If a computatio is associative ad commutative, ad performs a data reductio, the it ca beefit from a aggregatio tree. s show i Figure 6, a logical graph coectig a set of iputs to a sigle dowstream vertex ca be refied by isertig a ew layer of iteral vertices, where each iteral vertex reads data from a subset of the iputs that are close i etwork topology, for example o the same computer or withi the same rack. If the iteral vertices perform a data reductio, the overall etwork traffic betwee racks will be reduced by this refiemet. typical applicatio would be a histogrammig operatio that takes as iput a set of partial histograms ad outputs their uio. The implemetatio i efore fter Figure 6: dyamic refiemet for aggregatio. The logical graph o the left coects every iput to the sigle output. The locatios ad sizes of the iputs are ot kow util ru time whe it is determied which computer each vertex is scheduled o. t this poit the iputs are grouped ito subsets that are close i etwork topology, ad a iteral vertex is iserted for each subset to do a local aggregatio, thus savig etwork badwidth. The iteral vertices are all of the same user-supplied type, i this case show as Z. I the diagram o the right, vertices with the same label ( + or * ) are executed close to each other i etwork topology. Dryad simply attaches a custom stage maager to the iput layer. s this aggregatio maager receives callback otificatios that upstream vertices have completed, it rewrites the graph with the appropriate refiemets. The operatio i Figure 6 ca be performed recursively to geerate as may layers of iteral vertices as required. We have also foud a partial aggregatio operatio to be very useful. This refiemet is show i Figure 7; havig grouped the iputs ito k sets, the optimizer replicates the dowstream vertex k times to allow all of the sets to be processed i parallel. Optioally, the partial refiemet ca be made to propagate through the graph so that a etire pipelie of vertices will be replicated k times (this behavior is ot show i the figure). example of the applicatio of this techique is described i the experimets i Sectio 6.3. Sice the aggregatio maager is otified o the completio of upstream vertices, it has access to the size of the data writte by those vertices as well as its locatio. typical groupig heuristic esures that a dowstream vertex has o more tha a set umber of iput chaels, or a set volume of iput data. special case of partial refiemet ca be performed at startup to size the iitial layer of a graph so that, for example, each vertex processes multiple iputs up to some threshold with the restrictio that all the iputs must lie o the same computer. ecause iput data ca be replicated o multiple computers i a cluster, the computer o which a graph vertex is scheduled is i geeral o-determiistic. Moreover the efore fter Figure 7: partial aggregatio refiemet. Followig a iput groupig as i Figure 6 ito k sets, the successor vertex is replicated k times to process all the sets i parallel. k k Z

9 amout of data writte i itermediate computatio stages is typically ot kow before a computatio begis. Therefore dyamic refiemet is ofte more efficiet tha attemptig a static groupig i advace. Dyamic refiemets of this sort emphasize the power of overlayig a physical graph with its skeleto. For may applicatios, there is a equivalece class of graphs with the same skeleto that compute the same result. Varyig the umber of vertices i each stage, or their coectivity, while preservig the graph topology at the stage level, is merely a (dyamic) performace optimizatio. 6. EXPEIMENTL EVLUTION Dryad has bee used for a wide variety of applicatios, icludig relatioal queries, large-scale matrix computatios, ad may text-processig tasks. For this paper we examied the effectiveess of the Dryad system i detail by ruig two sets of experimets. The first experimet takes the SQL query described i Sectio 2.1 ad implemets it as a Dryad applicatio. We compare the Dryad performace with that of a traditioal commercial SQL server, ad we aalyze the Dryad performace as the job is distributed across differet umbers of computers. The secod is a simple map-reduce style data-miig operatio, expressed as a Dryad program ad applied to 10.2 Tytes of data usig a cluster of aroud 1800 computers. The strategies we adopt to build our commuicatio flow graphs are familiar from the parallel database literature [18] ad iclude horizotally partitioig the datasets, exploitig pipelied parallelism withi processes ad applyig exchage operatios to commuicate partial results betwee the partitios. Noe of the applicatio-level code i ay of our experimets makes explicit use of cocurrecy primitives. 6.1 Hardware The SQL query experimets were ru o a cluster of 10 computers i our ow laboratory, ad the data-miig tests were ru o a cluster of aroud 1800 computers embedded i a data ceter. Our laboratory computers each had 2 dual-core Optero processors ruig at 2 GHz (i.e., 4 PUs total), 8 Gytes of DM (half attached to each processor chip), ad 4 disks. The disks were 400 Gyte Wester Digital WD40 00Y-01PL0 ST drives, coected through a Silico Image 3114 PI ST cotroller (66MHz, 32-bit). Network coectivity was by 1 Gbit/sec Etheret liks coectig ito a sigle o-blockig switch. Oe of our laboratory computers was dedicated to ruig SQLServer ad its data was stored i 4 separate 350 Gyte NTFS volumes, oe o each drive, with SQLServer cofigured to do its ow data stripig for the raw data ad for its temporary tables. ll the other laboratory computers were cofigured with a sigle 1.4 Tyte NTFS volume o each computer, created by software stripig across the 4 drives. The computers i the data ceter had a variety of cofiguratios, but were typically roughly comparable to our laboratory equipmet. ll the computers were ruig Widows Server 2003 Eterprise x64 editio SP SQL Query The query for this experimet is described i Sectio 2.1 ad uses the Dryad commuicatio graph show i Figure 2. SQLServer 2005 s executio pla for this query was very close to the Dryad computatio, except that it used a exteral hash joi for Y i place of the sort-merge we chose for Dryad. SQLServer takes slightly loger if it is forced by a query hit to use a sort-merge joi. For our experimets, we used two variats of the Dryad graph: i-memory ad two-pass. I both variats commuicatio from each M i through its correspodig S i to Y is by a shared-memory FIFO. This pulls four sorters ito the same process to execute i parallel o the four PUs i each computer. I the i-memory variat oly, commuicatio from each D i to its four correspodig M j vertices is also by a shared-memory FIFO ad the rest of the D i M k edges use TP pipes. ll other commuicatio is through NTFS temporary files i both variats. There is good spatial locality i the query, which improves as the umber of partitios () decreases: for = 40a average of 80% of the output of D i goes to its correspodig M i, icreasig to 88% for = 6. I either variat must be large eough that every sort executed by a vertex S i will fit ito the computer s 8 Gytes of DM (or else it will page). With the curret data, this threshold is at =6. Note that the o-determiistic merge i M radomly permutes its output depedig o the order of arrival of items o its iput chaels ad this techically violates the requiremet that all vertices be determiistic. This does ot cause problems for our fault-tolerace model because the sort S i udoes this permutatio, ad sice the edge from M i to S i is a shared-memory FIFO withi a sigle process the two vertices fail (if at all) i tadem ad the o-determiism ever escapes. The i-memory variat requires at least computers sice otherwise the S vertices will deadlock waitig for data from a X vertex. The two-pass variat will ru o ay umber of computers. Oe way to view this trade-off is that by addig the file bufferig i the two-pass variat we i effect coverted to usig a two-pass exteral sort. Note that the coversio from the i-memory to the two-pass program simply ivolves chagig two lies i the graph costructio code, with o modificatios to the vertex programs. We ra the two-pass variat usig = 40, varyig the umber of computers from 1 to 9. We ra the i-memory variat usig =6through =9,eachtimeo computers. s a baselie measuremet we ra the query o a reasoably well optimized SQLServer o oe computer. Table 2 shows the elapsed times i secods for each experimet. O repeated rus the times were cosistet to withi 3.4% of their averages except for the sigle-computer two-pass case, which was withi 9.4%. Figure 8 graphs the iverse of these times, ormalized to show the speed-up factor relative to the two-pass sigle-computer case. The results are pleasatly straightforward. The two-pass Dryad job works o all cluster sizes, with close to liear speed-up. The i-memory variat works as expected for omputers SQLServer 3780 Two-pass I-memory Table 2: Time i secods to process a SQL query usig differet umbers of computers. The SQLServer implemetatio caot be distributed across multiple computers ad the i-memory experimet ca oly be ru for 6 or more computers.

10 Speed-up Dryad I-Memory Dryad Two-Pass SQLServer 2005 k k Each S k Q S is: Each is: Number of omputers Figure 8: The speedup of the SQL query computatio is earliear i the umber of computers used. The baselie is relative to Dryad ruig o a sigle computer ad times are give i Table 2. = 6 ad up, agai with close to liear speed-up, ad approximately twice as fast as the two-pass variat. The SQLServer result matches our expectatios: our specialized Dryad program rus sigificatly, but ot outrageously, faster tha SQLServer s geeral-purpose query egie. We should ote of course that Dryad simply provides a executio egie while the database provides much more fuctioality, icludig loggig, trasactios, ad mutable relatios. 6.3 Data miig The data-miig experimet fits the patter of map the reduce. The purpose of ruig this experimet was to verify that Dryad works sufficietly well i these straightforward cases, ad that it works at large scales. The computatio i this experimet reads query logs gathered by the MSN Search service, extracts the query strigs, ad builds a histogram of query frequecy. The basic commuicatio graph is show i Figure 9. The log files are partitioed ad replicated across the computers disks. The P vertices each read their part of the log files usig library ewlie-delimited text items, ad parse them to extract the query strigs. Subsequet items are all library tuples cotaiig a query strig, a cout, ad a hash of the strig. Each D vertex distributes to k outputs based o the query strig hash; S performs a i-memory sort. accumulates total couts for each query ad MS performs a streamig merge-sort. S ad MS come from a vertex library ad take a compariso fuctio as a parameter; i this example they sort based o the query hash. We have ecapsulated the simple vertices ito subgraphs deoted by diamods i order to reduce the total umber of vertices i the job (ad hece the overhead associated with process start-up) ad the volume of temporary data writte to disk. The graph show i Figure 9 does ot scale well to very large datasets. It is wasteful to execute a separate Q vertex for every iput partitio. Each partitio is oly aroud 100 Mytes, ad the P vertex performs a substatial data reductio, so the amout of data which eeds to be sorted by the S vertices is very much less tha the total M o a computer. lso, each subgraph has iputs, ad whe grows to hudreds of thousads of partitios, it becomes uwieldy to read i parallel from so may chaels. Q Q D P MS Figure 9: The commuicatio graph to compute a query histogram. Details are i Sectio 6.3. This figure shows the first cut aive ecapsulated versio that does t scale well. fter tryig a umber of differet ecapsulatio ad dyamic refiemet schemes we arrived at the commuicatio graphs show i Figure 10 for our experimet. Each subgraph i the first phase ow has multiple iputs, grouped automatically usig the refiemet i Figure 7 to esure they all lie o the same computer. The iputs are set to the parser P through a o-determiistic merge vertex M. The distributio (vertex D) has bee take out of the first phase to allow aother layer of groupig ad aggregatio (agai usig the refiemet i Figure 7) before the explosio i the umber of output chaels. We ra this experimet o 10,160,519,065,748 ytes of iput data i a cluster of aroud 1800 computers embedded i a data ceter. The iput was divided ito 99,713 partitios replicated across the computers, ad we specified that the applicatio should use 450 subgraphs. The first phase grouped the iputs ito at most 1 Gytes at a time, all lyig o the same computer, resultig i 10,405 Q subgraphs that wrote a total of 153,703,445,725 ytes. The outputs from the Q subgraphs were the grouped ito sets of at most 600 Mytes o the same local switch resultig i 217 T subgraphs. Each T was coected to every subgraph, ad they wrote a total of 118,364,131,628 ytes. The total output from the subgraphs was 33,375,616,713 ytes, ad the ed-to-ed computatio took 11 miutes ad 30 secods. Though this experimet oly uses 11,072 vertices, itermediate experimets with other graph topologies cofirmed that Dryad ca successfully execute jobs cotaiig hudreds of thousads of vertices. We would like to emphasize several poits about the optimizatio process we used to arrive at the graphs i Figure 10: 1. t o poit durig the optimizatio did we have to modify ay of the code ruig iside the vertices: we were simply maipulatig the graph of the job s commuicatio flow, chagig tes of lies of code. 2. This commuicatio graph is well suited to ay mapreduce computatio with similar characteristics: i.e. that the map phase (our P vertex) performs substa-

11 a Each Q' is: Each Each T is: b G T S is: D T 217 T 118 G Q' P Q' 10,405 Q' 154 G 99,713 M MS MS 99, T Figure 10: earragig the vertices gives better scalig performace compared with Figure 9. The user supplies graph (a) specifyig that 450 buckets should be used whe distributig the output, ad that each Q vertex may receive up to 1G of iput while each T may receive up to 600M. The umber of Q ad T vertices is determied at ru time based o the umber of partitios i the iput ad the etwork locatios ad output sizes of precedig vertices i the graph, ad the refied graph (b) is executed by the system. Details are i Sectio 6.3. tial data reductio ad the reduce phase (our vertex) performs some additioal relatively mior data reductio. differet topology might give better performace for a map-reduce task with differet behavior; for example if the reduce phase performed substatial data reductio a dyamic merge tree as described i Figure 6 might be more suitable. 3. Whe scalig up aother order of magitude or two, we might chage the topology agai, e.g. by addig more layers of aggregatio betwee the T ad stages. Such re-factorig is easy to do. 4. Gettig good performace for large-scale data-miig computatios is ot trivial. May ovel features of the Dryad system, icludig subgraph ecapsulatio ad dyamic refiemet, were used. These made it simple to experimet with differet optimizatio schemes that would have bee difficult or impossible to implemet usig a simpler but less powerful system. 7. UILDING ON DYD s explaied i the itroductio, we have targeted Dryad at developers who are experieced at usig high-level compiled programmig laguages. I some domais there may be great value i makig commo large-scale data processig tasks easier to perform, sice this allows o-developers to directly query the data store [33]. We desiged Dryad to be usable as a platform o which to develop such more restricted but simpler programmig iterfaces, ad two other groups withi Microsoft have already prototyped systems to address particular applicatio domais. 7.1 The Nebula scriptig laguage Oe team has layered a scriptig iterface o top of Dryad. It allows a user to specify a computatio as a series of stages (correspodig to the Dryad stages described i Sectio 3.6), each takig iputs from oe or more previous stages or the file system. Nebula trasforms Dryad ito a geeralizatio of the Uix pipig mechaism ad it allows programmers to write giat acyclic graphs spaig may computers. Ofte a Nebula script oly refers to existig executables such as perl or grep, allowig a user to write a etire complex distributed applicatio without compilig ay code. The Nebula layer o top of Dryad, together with some perl wrapper fuctios, has proved to be very successful for largescale text processig, with a low barrier to etry for users. Scripts typically ru o thousads of computers ad cotai 5 15 stages icludig multiple projectios, aggregatios ad jois, ofte combiig the iformatio from multiple iput sets i sophisticated ways. Nebula hides most of the details of the Dryad program from the developer. Stages are coected to precedig stages usig operators that implicitly determie the umber of vertices required. For example, a Filter operatio creates oe ew vertex for every vertex i its iput list, ad coects them poitwise to form a pipelie. ggregate operatio ca be used to perform exchages ad merges. The implemetatio of the Nebula operators makes use of dyamic optimizatios like those described i Sectio 5.2 however the operator abstractio allows users to remai uaware of the details of these optimizatios. ll Nebula vertices execute the process wrapper described i Sectio 4.2, ad the vertices i a give stage all ru the same executable ad commad-lie, specified usig the script. The Nebula system defies covetios for passig the ames of the iput ad output pipes to the vertex executable commad-lie. There is a very popular frot-ed to Nebula that lets the user describe a job usig a combiatio of: fragmets of perl that parse lies of text from differet sources ito structured records; ad a relatioal query over those structured records expressed i a subset of SQL that icludes select, project ad joi. This job descriptio is coverted ito a Nebula script ad executed usig Dryad. The perl pars-

12 ig fragmets for commo iput sources are all i libraries, so may jobs usig this frot-ed are completely described usig a few lies of SQL. 7.2 Itegratio with SSIS SQL Server Itegratio Services (SSIS) [6] supports workflow-based applicatio programmig o a sigle istace of SQLServer. The deter team i MSN has developed a system that embeds local SSIS computatios i a larger, distributed graph with commuicatio, schedulig ad fault tolerace provided by Dryad. The SSIS iput graph ca be built ad tested o a sigle computer usig the full rage of SQL developer tools. These iclude a graphical editor for costructig the job topology, ad a itegrated debugger. Whe the graph is ready to ru o a larger cluster the system automatically partitios it usig heuristics ad builds a Dryad graph that is the executed i a distributed fashio. Each Dryad vertex is a istace of SQLServer ruig a SSIS subgraph of the complete job. This system is curretly deployed i a live productio system as part of oe of deter s log processig pipelies. 7.3 Distributed SQL queries Oe obvious additioal directio would be to adapt a query optimizer for SQL or LINQ [4] queries to compile plas directly ito a Dryad flow graph usig appropriate parameterized vertices for the relatioal operatios. Sice our fault-tolerace model oly requires that iputs be immutable over the duratio of the query, ay uderlyig storage system that offers lightweight sapshots would suffice to allow us to deliver cosistet query results. We ited to pursue this as future work. 8. ELTED WOK Dryad is related to a broad class of prior literature, ragig from custom hardware to parallel databases, but we believe that the esemble of trade-offs we have chose for its desig, ad some of the techologies we have deployed, make it a uique system. Hardware Several hardware systems use stream programmig models similar to Dryad, icludig Itel IXP [2], Imagie [26], ad SOE [15]. Programmers or compilers represet the distributed computatio as a collectio of idepedet subrouties residig withi a high-level graph. lick similar approach is adopted by the lick modular router [27]. The techique used to ecapsulate multiple Dryad vertices i a sigle large vertex, described i sectio 3.4, is similar to the method used by lick to group the elemets (equivalet of Dryad vertices) i a sigle process. However, lick is always siglethreaded, while Dryad ecapsulated vertices are desiged to take advatage of multiple PU cores that may be available. Dataflow The overall structure of a Dryad applicatio is closely related to large-grai dataflow techiques used i e.g. LGDF2 [19], ODE2 [31] ad P-IO [29]. These systems were ot desiged to scale to large clusters of commodity computers, however, ad do ot tolerate machie failures or easily support programmig very large graphs. Paralex [9] has may similarities to Dryad, but i order to provide automatic faulttolerace sacrifices the vertex programmig model, allowig oly pure-fuctioal programs. Parallel databases Dryad is heavily idebted to the traditioal parallel database field [18]: e.g., Vulca [22], Gamma [17], Db [11], D2 parallel editio [12], ad may others. May techiques for exploitig parallelism, icludig data partitioig; pipelied ad partitioed parallelism; ad hash-based distributio are directly derived from this work. We ca map the whole relatioal algebra o top of Dryad, however Dryad is ot a database egie: it does ot iclude a query plaer or optimizer; the system has o cocept of data schemas or idices; ad Dryad does ot support trasactios or logs. Dryad gives the programmer more cotrol tha SQL via ++ programs i vertices ad allows programmers to specify ecapsulatio, trasport mechaisms for edges, ad callbacks for vertex stages. Moreover, the graph builder laguage allows Dryad to express irregular computatios. otiuous Query systems There are some superficial similarities betwee Q systems (e.g. [25, 10, 34]) ad Dryad, such as some operators ad the topologies of the computatio etworks. However, Dryad is a batch computatio system, ot desiged to support real-time operatio which is crucial for Q systems sice may Q widow operators deped o real-time behavior. Moreover, may datamiig Dryad computatios require extremely high throughput (tes of millios of records per secod per ode), which is much greater tha that typically see i the Q literature. Explicitly parallel laguages like Parallel Haskell [38], ilk [14] or NESL [13] have the same emphasis as Dryad o usig the user s kowledge of the problem to drive the parallelizatio. y relyig o ++, Dryad should have a faster learig curve tha that for fuctioal laguages, while also beig able to leverage commercial optimizig compilers. There is some appeal i these alterative approaches, which preset the user with a uiform programmig abstractio rather tha our two-level hierarchy. However, we believe that for data-parallel applicatios that are aturally writte usig coarse-grai commuicatio patters, we gai substatial beefit by lettig the programmer cooperate with the system to decide o the graularity of distributio. Grid computig [1] ad projects such as odor [37] are clearly related to Dryad, i that they leverage the resources of may workstatios usig batch processig. However, Dryad does ot attempt to provide support for wide-area operatio, trasparet remote I/O, or multiple admiistrative domais. Dryad is optimized for the case of a very high-throughput LN, whereas i odor badwidth maagemet is essetially hadled by the user job. Google Mapeduce The Dryad system was primarily desiged to support large-scale data-miig over clusters of thousads of computers. s a result, of the recet related systems it shares the most similarities with Google s Mapeduce [16, 33] which addresses a similar

13 problem domai. The fudametal differece betwee the two systems is that a Dryad applicatio may specify a arbitrary commuicatio DG rather tha requirig a sequece of map/distribute/sort/reduce operatios. I particular, graph vertices may cosume multiple iputs, ad geerate multiple outputs, of differet types. For may applicatios this simplifies the mappig from algorithm to implemetatio, lets us build o a greater library of basic subrouties, ad, together with the ability to exploit TP pipes ad shared-memory for data edges, ca brig substatial performace gais. t the same time, our implemetatio is geeral eough to support all the features described i the Mapeduce paper. Scietific computig Dryad is also related to high-performace computig platforms like MPI [5], PVM [35], or computig o GPUs [36]. However, Dryad focuses o a model with o shared-memory betwee vertices, ad uses o sychroizatio primitives. NOW The origial impetus for employig clusters of workstatios with a shared-othig memory model came from projects like erkeley NOW [7, 8], or T [20]. Dryad borrows some ideas from these systems, such as fault-tolerace through re-executio ad cetralized resource schedulig, but our system additioally provides a uified, simple high-level programmig laguage layer. Log datamiig ddamark, ow reamed SeSage [32] has successfully commercialized software for log datamiig o clusters of workstatios. Dryad is desiged to scale to much larger implemetatios, up to thousads of computers. 9. DISUSSION WiththebasicDryadifrastructureiplace,weseea umber of iterestig future research directios. Oe fudametal questio is the applicability of the programmig model to geeral large-scale computatios beyod text processig ad relatioal queries. Of course, ot all programs are easily expressed usig a coarse-grai data-parallel commuicatio graph, but we are ow well positioed to idetify ad evaluate Dryad s suitability for those that are. Sectio 3 assumes a applicatio developer will first costruct a static job graph, the pass it to the rutime to be executed. Sectio 6.3 shows the beefits of allowig applicatios to perform automatic dyamic refiemet of the graph. We pla to exted this idea ad also itroduce iterfaces to simplify dyamic modificatios of the graph accordig to applicatio-level cotrol flow decisios. We are particularly iterested i data-depedet optimizatios that might pick etirely differet strategies (for example choosig betwee i-memory ad exteral sorts) as the job progresses ad the volume of data at itermediate stages becomes kow. May of these strategies are already described i the parallel database literature, but Dryad gives us a flexible testbed for explorig them at very large scale. t the same time, we must esure that ay ew optimizatios ca be targeted by higher-level laguages o top of Dryad, ad we pla to implemet comprehesive support for relatioal queries as suggested i Sectio 7.3. The job maager described i this paper assumes it has exclusive cotrol of all of the computers i the cluster, ad this makes it difficult to efficietly ru more tha oe job at a time. We have completed prelimiary experimets with a ew implemetatio that allows multiple jobs to cooperate whe executig cocurretly. We have foud that this makes much more efficiet use of the resources of a large cluster, but are still explorig variats of the basic desig. full aalysis of our experimets will be preseted i a future publicatio. There are may opportuities for improved performace moitorig ad debuggig. Each ru of a large Dryad job geerates statistics o the resource usage of thousads of executios of the same program o differet iput data. These statistics are already used to detect ad re-execute slowruig outlier vertices. We pla to keep ad aalyze the statistics from a large umber of jobs to look for patters that ca be used to predict the resource eeds of vertices before they are executed. y feedig these predictios to our scheduler, we may be able to cotiue to make more efficiet use of a shared cluster. Much of the simplicity of the Dryad scheduler ad faulttolerace model come from the assumptio that vertices are determiistic. If a applicatio cotais o-determiistic vertices the we might i future aim for the guaratee that every termiatig executio produces a output that some failure-free executio could have geerated. I the geeral case where vertices ca produce side-effects this might be very hard to esure automatically. The Dryad system implemets a geeral-purpose dataparallel executio egie. We have demostrated excellet scalig behavior o small clusters, with absolute performace superior to a commercial database system for a hadcoded read-oly query. O a larger cluster we have executed jobs cotaiig hudreds of thousads of vertices, processig may terabytes of iput data i miutes, ad we ca automatically adapt the computatio to exploit etwork locality. We let developers easily create large-scale distributed applicatios without requirig them to master ay cocurrecy techiques beyod beig able to draw a graph of the data depedecies of their algorithms. We sacrifice some architectural simplicity compared with the Mapeduce system desig, but i exchage we release developers from the burde of expressig their code as a strict sequece of map, sort ad reduce steps. We also allow the programmer the freedom to specify the commuicatio trasport which, for suitable tasks, delivers substatial performace gais. ckowledgemets We would like to thak all the members of the osmos team i Widows Live Search for their support ad collaboratio, ad particularly Sam McKelvie for may helpful desig discussios. Thaks to Jim Gray ad our aoymous reviewers for suggestios o improvig the presetatio of the paper. 10. EFEENES [1] Global grid forum. [2] Itel IXP2XXX product lie of etwork processors. pfamily/ixp2xxx.htm. [3] Itel platform architecture/platform2015/.

14 [4] The LINQ project. etframework/future/liq/. [5] Ope MPI. [6] SQL Server Itegratio Services. com/sql/techologies/itegratio/default.mspx. [7] Thomas E. derso, David E. uller, David. Patterso, ad NOW Team. case for etworks of workstatios: NOW. IEEE Micro, pages 54 64, February [8] emzi H. rpaci-dusseau. u-time adaptatio i iver. Trasactios o omputer Systems (TOS), 21(1):36 86, [9] Özalp abaoğlu, Lorezo lvisi, lessadro moroso, ezo Davoli, ad Luigi lberto Giachii. Paralex: a eviromet for parallel programmig i distributed systems. pages , New York, NY, US, M Press. [10] Magdalea alaziska, Hari alakrisha, Samuel Madde, ad Mike Stoebraker. Fault-Tolerace i the orealis Distributed Stream Processig System. I M SIGMOD, altimore, MD, Jue [11] Tom arclay, obert ares, Jim Gray, ad Prakash Sudaresa. Loadig databases usig dataflow parallelism. SIGMOD ec., 23(4):72 83, [12] haitaya aru ad Gilles Fecteau. overview of D2 parallel editio. I SIGMOD 95: Proceedigs of the 1995 M SIGMOD iteratioal coferece o Maagemet of data, pages , New York, NY, US, M Press. [13] Guy E. lelloch. Programmig parallel algorithms. ommuicatios of the M (M), 39(3):85 97, [14] obert D. lumofe, hristopher F. Joerg, radley Kuszmaul, harles E. Leiserso, Keith H. adall, ad Yuli Zhou. ilk: efficiet multithreaded rutime system. I M SIGPLN Symposium o Priciples ad Practice of Parallel Programmig (PPoPP), pages , Sata arbara, aliforia, July [15] Eylo aspi, Michael hu, ady Huag, Joseph Yeh, Yury Markovskiy, dré DeHo, ad Joh Wawrzyek. Stream computatios orgaized for recofigurable executio (SOE): Itroductio ad tutorial. I FPL, Lecture Notes i omputer Sciece. Spriger Verlag, [16] Jeff Dea ad Sajay Ghemawat. Mapeduce: Simplified data processig o large clusters. I Proceedigs of the 6th Symposium o Operatig Systems Desig ad Implemetatio (OSDI), pages , December [17] D. DeWitt, S. Ghadeharizadeh, D. Scheider, H. Hsiao,. ricker, ad. asmusse. The GMM database machie project. IEEE Trasactios o Kowledge ad Data Egieerig, 2(1), [18] David DeWitt ad Jim Gray. Parallel database systems: The future of high performace database processig. ommuicatios of the M, 36(6), [19] D.. DiNucci ad. G. abb II. Desig ad implemetatio of parallel programs with LGDF2. I Digest of Papers from ompco 89, pages , [20] rmado Fox, Steve D. Gribble, Yati hawathe, Eric. rewer, ad Paul Gauthier. luster-based scalable etwork services. I M Symposium o Operatig Systems Priciples (SOSP), pages 78 91, New York, NY, US, M Press. [21] Sajay Ghemawat, Howard Gobioff, ad Shu-Tak Leug. The Google file system. I SOSP 03: Proceedigs of the ieteeth M symposium o Operatig systems priciples, pages 29 43, New York, NY, US, M Press. [22] Goetz Graefe. Ecapsulatio of parallelism i the Volcao query processig system. I SIGMOD 90: Proceedigs of the 1990 M SIGMOD iteratioal coferece o Maagemet of data, pages , New York, NY, US, M Press. [23] J. Gray,.S. Szalay,. Thakar, P. Kuszt,. Stoughto, D. Slutz, ad J. Vadeberg. Data miig the SDSS SkyServer database. I Distributed Data ad Structures 4: ecords of the 4th Iteratioal Meetig, pages , Paris, Frace, March arleto Scietific. also as MS-T [24] Jim Gray ad lex Szalay. Sciece i a expoetial world. Nature, 440(23), March [25] J.-H. Hwag, M. alaziska,. asi, U. Çetitemel, M. Stoebraker, ad S. Zdoik. compariso of stream-orieted high-availability algorithms. Techical eport T-03-17, omputer Sciece Departmet, row Uiversity, September [26] Ujval Kapasi, William J. Dally, Scott ixer, Joh D. Owes, ad rucek Khailay. The Imagie stream processor. I Proceedigs 2002 IEEE Iteratioal oferece o omputer Desig, pages , September [27] Eddie Kohler, obert Morris, ejie he, Joh Jaotti, ad M. Fras Kaashoek. The lick modular router. M Trasactios o omputer Systems, 18(3): , [28] James Larus ad Michael Parkes. Usig cohort schedulig to ehace server performace. I Useix ual Techical oferece, Jue [29] Orlado Loques, Julius Leite, ad Erique Viicio arrera E. P-IO: modular parallel-programmig eviromet. IEEE ocurrecy, 6(1):47 57, [30] William Mark, Steve Glaville, Kurt keley, ad Mark J. Kilgard. g: system for programmig graphics hardware i a -like laguage. M Trasactios o Graphics, 22(3): , [31] P. Newto ad J.. rowe. The ODE 2.0 graphical parallel programmig laguage. pages , Washigto, D.., Uited States, July [32] Ke Phillips. SeSage ES. S Magazie, March [33] ob Pike, Sea Dorward, obert Griesemer, ad Sea Quila. Iterpretig the data: Parallel aalysis with Sawzall. Scietific Programmig, 13(4): , [34] Mehul. Shah, Joseph M. Hellerstei, ad Eric rewer. Highly available, fault-tolerat, parallel dataflows. I SIGMOD 04: Proceedigs of the 2004 M SIGMOD iteratioal coferece o Maagemet of data, pages , New York, NY, US, M Press. [35] V. S. Suderam. PVM: a framework for parallel distributed computig. ocurrecy: Pract. Exper., 2(4): , [36] David Tarditi, Sidd Puri, ad Jose Oglesby. ccelerator: usig data-parallelism to program GPUs for geeral-purpose uses. I Iteratioal oferece o rchitectural Support for Programmig Laguages ad Operatig Systems (SPLOS), osto,m, October also as MS-T [37] Douglas Thai, Todd Taebaum, ad Miro Livy. Distributed computig i practice: The odor experiece. ocurrecy ad omputatio: Practice ad Experiece, 17(2-4): , [38] P.W. Trider, H-W. Loidl, ad.f. Poito. Parallel ad distributed Haskells. Joural of Fuctioal Programmig, 12(4&5): , 2002.