Joint Scheduling of Processing and Shuffle Phases in MapReduce Systems

Jont Schedulng of Processng and Shuffle Phases n MapReduce Systems Fangfe Chen, Mural Kodalam, T. V. Lakshman Department of Computer Scence and Engneerng, The Penn State Unversty Bell Laboratores, Alcatel-Lucent Abstract MapReduce has emerged as an mportant paradgm for processng data n large data centers. MapReduce s a three phase algorthm comprsng of Map, Shuffle and Reduce phases. Due to ts wdespread deployment, there have been several recent papers outlnng practcal schemes to mprove the performance of MapReduce systems. All these efforts focus on one of the three phases to obtan performance mprovement. In ths paper, we consder the problem of jontly schedulng all three phases of the MapReduce process wth a vew of understandng the theoretcal complexty of the jont schedulng and workng towards practcal heurstcs for schedulng the tasks. We gve guaranteed approxmaton algorthms and outlne several heurstcs to solve the jont schedulng problem. I. INTRODUCTION MapReduce [] has emerged as a sgnfcant data processng paradgm n data centers. MapReduce s used n several applcatons ncludng searchng the web, URL frequency estmaton and ndexng. In a typcal applcaton, the data on whch MapReduce operates s parttoned nto chunks and assgned to dfferent processors. MapReduce s essentally a three step process: ) In the Map phase, several parallel tasks are created to operate on the relevant data chunks to generate ntermedate results. These results are stored n the form of key value pars. ) In the Shuffle phase, the partal computaton results of the map phase are transferred to the processors performng the reduce operaton. 3) Durng the Reduce phase, each processor that executes the reduce task aggregates the tuples generated n the map phase. A frequent requrement for routne data center requests s fast response tme. In a large data center where there are several MapReduce jobs that run concurrently, a centralzed master coordnates the assgnment and schedulng of MapReduce tasks across the data center. The assgnment problem s to decde whch processor wll execute a map or reduce task and the schedulng s to decde n what order the tasks wll be executed on each processor. In ths paper, we assume that tasks are already assgned to the processors and the focus of the paper s to determne the schedule that mnmzes the mean response tme over all jobs. MapReduce throws up several sgnfcant schedulng challenges due to the dependences between dfferent phases of a job s MapReduce operatons as well as dependences between dfferent processors that s the consequence of splttng a job across multple processors. Though dependences between tasks have been studed n the job shop schedulng lterature, MapReduce system presents several new challenges. For example, n tradtonal job shops, each job comprses of multple tasks but at most one task can be executed at any gven tme unlke MapReduce schedulng where tasks may be executed n parallel. The shuffle operaton n MapReduce assumes smultaneous possesson of multple resources (the egress of the transmt processor and the ngress of the receve processor) and ths phase resembles data mgraton problems more than job-shop problems. If the shuffle operaton s not a bottleneck, there s no advantage f one map task belongng to a partcular job s completed early whle another task belongng to the same job s delayed. Ths s the key dea n the MapReduce formulaton n Chang et al. []. Recent studes by Chowdhury et al. [3] have shown that the shuffle operaton accounts for as much as a thrd of the completon tme of a MapReduce job. Therefore the shuffle operaton has to be taken nto consderaton explctly n the schedulng problem. Includng the shuffle operaton ntroduces new tradeoffs n the schedulng problem. If all the map tasks belongng to a job complete almost smultaneously, and all the shuffle operatons for ths job can start at ths tme. Ths creates a bottleneck at the shuffle operaton. Therefore, once shuffle s ncluded n the system, t may be better to spread the completon of the map task n order to smoothly schedule the shuffle operaton. The wdespread use of MapReduce has lead to several recent papers [4], [5], [6] outlnng practcal approaches to mprovng the performance of MapReduce systems. Though these approaches are shown to mprove the performance of specfc parts of MapReduce, they do not address the endto-end performance of MapReduce. Unlke the formulaton n Chang et al. [], we explctly model the precedence between the map and reduce operatons n ths paper and outlne a constant factor approxmaton algorthm. Moreover, [] gnores the shuffle phase. The approach n ths paper s to develop a theoretcal framework to model the end-to-end performance of MapReduce systems and n partcular model the nteractons between the three phases of MapReduce. To our knowledge, ths s the frst paper that addresses mapshuffle-reduce from a theoretcal vewpont. Ths paper makes the followng contrbutons: ) We propose a formulaton for explctly takng nto consderaton dependences between map and reduce operatons and outlne a constant factor guaranteed ap-

proxmaton algorthm (MARES) for the problem. ) The approxmaton algorthm nvolves solvng a lnear programmng problem wth exponental number of constrants and we outlne a column-generaton based approach to solve the lnear programmng relaxaton. 3) We develop a constant factor approxmaton algorthm for the map-shuffle-reduce problem (MASHERS). 4) We develop heurstcs for both the map-reduce (H- MARES) as well as map-shuffle-reduce (D-MASHERS, H-MASHERS) based on the constant factor approxmaton algorthms and study ther performance. We show va experments that these heurstcs perform extremely well n practce. The rest of the paper s organzed as follows. Secton II formally defnes the schedulng problem wth precedence constrants, provdes a -based lower bound and shows how to solve the. Secton III presents the approxmaton algorthm for the two phase map-reduce problem. Secton IV ntroduce the shuffle phase nto the problem and presents the approxmaton algorthm for the map-shuffle-reduce problem. Secton V evaluates these heurstcs by smulatons. Secton VI dscusses several related work and fnally Secton VII concludes the paper. II. SCHEDULING DEPENDENT TASKS: A COLUMN GENERATION APPROACH In ths secton we consder the problem of schedulng a set of dependent tasks on multple processors n order to mnmze the sum of the weghted completon tme of the tasks. We develop a lnear programmng based approach to get a lower bound on the optmal soluton. Soluton to ths lnear programmng problem explots the structure of the sngle processor schedulng polyhedron. The lnear programmng soluton s nfeasble to the schedulng problem but we explot the structure of the dependences n MapReduce to derve approxmaton algorthms for both the two phase map-reduce problem and the three phase map-shuffle-reduce problem. A. Problem Defnton We are gven a set J of jobs and a set P of processors. Job j J has a set of Tasks T j. Let T = j T j represent the set of all tasks. Each task u s assgned to a processor p(u) P and ts processng tme t u 0. The set of tasks assgned to processor p s denoted by J p. Let n p denote the number of tasks assgned to processor p. We assume that a processor can execute at most one task at any gven tme and all tasks are completed n a non-preemptve manner. Let G = (V, L) denote the Precedence Graph among tasks, where the set of nodes s the set of tasks and (u, v) L ndcates that task v can only start after the completon of task u. Let C u denote the completon tme of task u. Assocated wth task u s a weght w u that ndcates ts relatve mportance and and the overall schedulng objectve s to mnmze the weghted sum of the task completon tmes u w uc u. The problem of schedulng tasks wth precedence constrants n order to mnmze the total weghted completon tme s NP-hard [7] even on a sngle processor. Our problem s a sgnfcant generalzaton of ths problem and s therefore NP-hard. B. Based Lower Bound Consder a processor p and the set of tasks J p assgned to ths processor. The schedulng problem on processor p s to decde the order n whch the tasks n J p are processed. It can be shown (see [3] and the references n t) that the completon tme C u for the tasks u J p les n the followng polyhedron denoted by P p : t u C u f(s, p) S J p, p () where u S f(s, p) = t v + u S ( u S ) t v The ntuton behnd these nequaltes s smple. Consder a processor on whch two tasks wth processng tmes t and t have to be processed. If the tasks are processed n the order, then the completon tmes of the two tasks wll be C = t and C = t + t. If the order s reversed then C = t + t and C = t. Note that t C +t C = t +t +t t = (t +t +(t +t ) ) n ether of the cases. Ths argument can be extended to any subset of jobs. Note that P p represents only the set of necessary condtons that the completon tmes C u have to satsfy. The polyhedron has many vectors that are not achevable by any schedulng polcy. It can be shown that P p has n p! extreme ponts, each correspondng a permutaton of the tasks n J p. A gven permutaton of tasks n J p, results n a set of completon tmes of the tasks n J p. Ths vector of completon tmes represents an extreme pont of the polyhedron P p. Let Ep k represent extreme pont k of the polyhedron P p. Note that Ep k s an n p dmensonal vector. We use Ep k (u) to denote the completon tme of job u J p correspondng to extreme pont k at processor p. An alternate way to represent P p s to wrte t as a convex combnaton of ts extreme ponts. We use λ k p to represent the non-negatve weght assocated wth extreme pont k n polyhedron for processor p. n p! n p! P p = λ k pep k : λ k p =, λ k p 0 k= k= Snce the polyhedron P p represents the set of necessary condtons for the completon tme of the tasks on processor p, the lnear programmng problem: mn u T w u C u C u P p(u) C v C u + t v (u, v) L gves a lower bound on the weghted sum of completon tmes subject to the precedence constrants. Ths lnear programmng problem has an exponental number of constrants. It s not practcal to solve t except for small problem nstances. We

develop a column generaton technque for solvng the lnear programmng problem. C. Column Generaton In a column generaton procedure, the lnear program s frst wrtten n terms of a convex combnaton of an exponental number of columns. Ths s called the master problem (MP). The master problem s then solved by successve approxmaton. The approxmaton of the master problem s done by restrctng the set of columns n the lnear programmng problem. Ths s called the restrcted master problem (RMP). The dual soluton to the restrcted master problem s used to verfy f the current soluton s optmal. If not, a new column s generated to be ncluded n the restrcted master problem. Ths process s repeated untl optmalty s reached. The practcalty of the column generaton procedure depends on whether optmalty verfcaton and new column generaton can be done effcently. In our case, ths column generaton procedure s very easy to solve snce t reduces to a smple sortng procedure. We now gve a more detaled vew of the column generaton procedure for the schedulng problem on hand. The master problem (MP) n terms of the extreme ponts of the polyhedrons P p s the followng: subject to n p! n p! k= Z MP = mn u T w u C u () C u λ k p(u) Ek p(u) (u) u (3a) λ k p = p k= (3b) C v C u + t v (u, v) L (3c) Instead of usng all the extreme ponts of the polyhedron, the restrcted master problem s formulated over a subset of extreme ponts. Let Z(p) denote a subset of extreme ponts for processor p. Intally we generate one extreme pont for each processor. Ths s done as follows: For processor p, we randomly order the tasks n J p and compute the completon tme for all the tasks. Ths vector of completon tmes s an extreme pont of P p and s ncluded n the restrcted master problem. Therefore, ntally Z(p) contans only one element for each processor p. The restrcted master problem s smlar to the master problem except that t s formulated over the extreme ponts n Z(p). subject to k Z(p) C u Z RMP = mn u T w u C u (4) k Z(p(u)) λ k p(u) Ek p(u) (u) u (5a) λ k p = p (5b) C v C u + t v (u, v) L (5c) Note that Z RMP Z MP. We now wrte the dual to the master problem whch we call the column generator (CG): Z CG = max δ p + t v π uv (6) p v subject to θ u v:(u,v) L π uv + v:(v,u) L u:(u,v) L π vu w u u (7a) δ p u J p E k p (u)θ u 0 k p (7b) By lnear programmng dualty we know that Z CG = Z MP. Moreover any feasble soluton to the dual s a lower bound on Z MP. If we solve the restrcted master problem, and obtan the optmal dual varables then we can check f that soluton s feasble to CG. If t s, then the soluton s optmal to the master problem. The dual optmal soluton to the restrcted master problem wll be feasble to equatons Equaton (7a). However, t may not be feasble to equatons Equaton (7b) snce the RMP uses only a subset of the extreme ponts. Gven a current dual optmal soluton (δp, θu, πuv) to the RMP, we have to check f δp Ep k (u)θu k p. u J p Ths s equvalent to the followng lnear programmng problem. Z p (θ ) = mn θ C u P p uc u (8) u J p Ths s just the classcal problem of mnmzng the weghted completon tme of the tasks on processor p. The weght of task u s θu n ths case. The soluton procedure s called Smth s Rule [9] and s the followng: Sort the tasks such that θ θ... θ n p t t t np and schedulng the tasks n the order of ncreasng ndex. In ths case, the completon tme of task u s = u v= t v and n p Z p (θ ) = θuc u. (9) u= Note that defnes an extreme pont of the polyhedron P p. If δp Z p (θ), for all p then the current value of Z RMP = Z MP. If there s a p such that δp > Z p (θ) then extreme pont correspondng to ths processor s added to the restrcted master problem and the restrcted master s solved agan. Gven the current dual varable values of θ and π, we can derve a feasble dual soluton as follows: Equaton (7a) s satsfed by any dual soluton to the RMP. We know that settng δ p = Z p (θ ) makes a feasble dual soluton. Therefore, Z p (θ ) + t v πuv (0) p v u:(u,v) L s a lower bound on Z MP. The lower bound may not be monotoncally ncreasng. Therefore, we keep track of the current best (hghest) lower bound as LB. Instead of solvng the MP to optmalty, gven some threshold ɛ we can termnate computaton f the rato of the soluton to the restrcted master problem to the best lower bound s less than ( + ɛ). Algorthm summarzes the column generaton procedure.

Algorthm Column Generaton S j Release Tme : For each processor p generate Ep. : repeat 3: Solve RMP and obtan the dual varables. 4: for all processors p do 5: Solve CG for p and compute Z p(θ ) as n Equaton (9) 6: f (Z p(θ ) < δp ) then 7: Add extreme pont to RMP 8: Update lower bound LB as n Equaton (0) 9: untl No new extreme ponts or Z RMP < ( + ɛ)lb 4 3 5 Map Tasks Reduce Tasks III. MARES: A MAPREDUCE SCHEDULER Solvng the lnear program MP defned n the last secton gves a lower bound on the mn weghted sum schedule. We outlne how to convert ths (nfeasble) lnear programmng soluton to a feasble soluton to the schedulng problem that s guaranteed to be wthn a constant factor of the lower bound (and hence wthn a constant factor of the optmal soluton). In general, t s not possble to convert the lnear programmng soluton nto a constant factor approxmaton algorthm for the schedulng problem. However, the specfc structure of the precedence graph for the MapReduce problem leads to a constant factor approxmaton algorthm. In ths secton, we focus on the MapReduce problem where the shuffle phase s not the bottleneck. Ths reduces the problem to a twophase structure. The constant factor approxmaton algorthm wll be called MapReduce Scheduler (MARES). Leavng out the shuffle phase s done for two reasons. Frst, the deas n MARES are used to derve the more complex algorthm n the next secton when the shuffle operaton n ntroduced nto the pcture. Second, there are nstances when shuffle s not a sgnfcant bottleneck and n ths case the algorthm developed here could be used to schedule the tasks. In order to keep the formulaton of the problem consstent wth the model n the last secton, we ntroduce dummy tasks nto the model. Each dummy task s assgned to ts own dummy processor. Ths dummy processor s not part of the lnear programmng formulaton. Dummy tasks wll ntroduce new precedence constrants nto the problem and these constrants are taken nto account n the lnear programmng formulaton. We ntroduce two dummy tasks for each job j : Start tme dummy task S j that takes r j unt of tme. There s a lnk n the precedence graph from S j to all the map tasks for job j. Fnsh tme dummy task F j whose processng tme s zero and there s a lnk from every reduce task of job j to F j n the precedence graph. Wth these two modfcatons, an example of the precedence graph of a job j s shown n Fgure. In ths example job j has 3 map tasks, and 3, and reduce tasks 4 and 5. The objectve of the lnear programmng problem s to mnmze the weghted sum of the completon tme of the fnsh tme tasks,.e., mn j w jc Fj. (The value of w u = 0 for all u / F j ). The lnear programmng relaxaton of the schedulng problem can be solved to get a lower bound on the optmal soluton value. We use the lnear programmng soluton to get a feasble soluton that s wthn a factor of 8 of the lower F j Fg. : Task Precedence of a Job Completon Tme bound, hence gvng an 8-approxmaton algorthm. The basc dea n MARES s the followng: Task u becomes avalable only at ts completon tme C u. Tasks are scheduled n ncreasng order of completon tme as long as all the predecessors of the task have been completed. If the predecessor of a task are not completed, t wats untl ts predecessors complete and then s scheduled n the order of ts completon tme. Note that the completon tme s used for determnng the avalable tme of the job as well as the schedulng order. The algorthm MARES s outlned n Algorthm. In the descrpton, t s assumed that tme s slotted and the decsons are made n each tme slot. It s easy to make ths process polynomal tme by advancng tme by the task processng tmes. Algorthm MARES : Solve the relaxaton and obtan for all tasks. : A u s the avalable tme of u. 3: for each tme slot t do 4: for all processors p do 5: f p s not busy then 6: Schedule avalable u J p, A u t wth the lowest We now outlne the proof of the performance guarantee of the MARES. The proof explots the specal structure of the precedence constrants for mapreduce n order to derve the approxmaton rato. The approxmaton results wll not hold f the precedence relatons are arbtrary. We use M j to denote the set of map tasks for job j and R j to denote the set of reduce tasks for job j. Theorem : Let Z MP and Z H denote the objectve functon value of the MP lnear program and MARES respectvely. Then Z H 8 Z MP. Proof: Let H denote the fnsh completon tme of task u on processor p(u). Snce tasks are scheduled n the order of completon tmes, once a map task u becomes avalable on a partcular processor, the only tasks that can be processed on the processor before u are those whose completon tme s not greater than. We defne D(u) = {v : v J p(u), Cv }.

Therefore, for u M j, we have H + t v () And from Equaton () we have t v C v f(d(u), p(u)) Usng the fact that wrte C u t v () Cv for all v D(u), we can t v t v (3) Therefore t v, u M j for all jobs j. Applyng ths to Equaton () gves H 3, u M j. (4) We use CM H j = max u Mj H to denote the completon tme of all the map tasks belongng to job j. There s a key dfference between schedulng map and reduce tasks. A map task u can be scheduled as soon at t becomes avalable at ts completon tme. A reduce task u, on the other hand, even f t becomes avalable at, can only be scheduled after all ts predecessor map tasks are completed. In other words, a reduce task u R j can be scheduled only after CM H j. Once all the predecessors of a reduce task are completed, all jobs that wll be processed before the reduce task wll have completon tme lower than the reduce job. Ths statement s almost true except for the case where a task w wth Cw > s already n process on the processor p(u) when the predecessors of u R j are completed. In ths case, even f the reduce job has a lower completon tme, due to nonpreempton, the task w wll be completed before the reduce job s started. Therefore, we can wrte for all u R j, C H u C H M j + t v + t w (5) where Cw > and t w Cw CM H j. The reason we have Cw CM H j s because task w should have become avalable before the map completon tme snce t was n process when the map job completed. Use smlar analyss as for the map jobs, we have t v and CM H j 3 CM j. Wth t w Cw CM H j, Equaton (5) becomes H CM H j + + CM H j 6 CM j + 8 where the last nequalty follows from the fact that f v M j and u R j, then Cv snce ths wll be a constrant n the lnear program. Ths mples Z H 8 Z. A. H-MARES: A Heurstc Implementaton of MARES The dea of makng a job avalable at ts completon tme s manly to bound the worst case performance. A smple heurstc mplementaton of MARES whch we call H-MARES schedules tasks n the order of completon tme wthout watng for t to become avalable. The only reason for a task to wat s f some of ts predecessors have not been completed. The descrpton of the algorthm s exactly the same MARES except that A u = r j f u M j and A u = 0 otherwse. Recall that r j s the tme at whch job j enters the system. If all jobs are avalable at tme zero then A u = 0 for all tasks u. The soluton of the and the schedulng s done exactly as n MARES. We show that H-MARES outperforms MARES n all the experments performed. However t does not seem straghtforward to prove any theoretcal performance guarantee for H-MARES. In the performance evaluaton secton, we show that H-MARES does very well n practce and on the average s wthn a factor of.5 of the lower bound. IV. MASHERS: A MAP-SHUFFLE-REDUCE SCHEDULER We now address the problem of jontly schedulng map, shuffle and reduce operatons. Before we gve a descrpton of the schedulng algorthm we frst outlne the constrants mposed by the shuffle process. When a map task for a job s completed, the result of ths task has to be sent to all the reduce tasks for the job. Some reduce tasks may be on the same processor as the map task whle others may be on dfferent processors. Each processor s assumed to have an egress port that s used to send out data and an ngress port to receve data. We assume that once data transfer between two processor begns t cannot be nterrupted. We also assume that at most one fle can be transmtted or receved at any gven tme nstant for any processor. It s possble to relax ths assumpton to make multple tasks share the shuffle lnk but the analyss becomes more complex. Unlke map and reduce tasks, shuffle tasks have to smultaneously possess resources at the sendng and recevng sde n order to complete the transmsson. Therefore shuffle processng can be vewed as an edge schedulng algorthm. Further, the shuffle process s done after the map process and ths precedence constrant before the shuffle phase makes the analyss more complex and the compettve rato s looser than just schedulng when multple resources are needed. to complete a task. In order to make the formulaton compatble wth the model n Secton II, we model the shuffle process as follows: Correspondng to processor p n the system, we add two more processors I(p) and O(p) where I(p) represents the ngress port at processor p and O(p) represents the egress port. Each of these addtonal processors s treated lke regular a processor n the formulaton. In addton to the start and fnsh nodes for each job as n Secton III, we ntroduce M j R j addtonal nodes where a par of nodes represent the two ends of the transfers. Each map node has R j transfer nodes assocated wth t and each reduce node has M j transfer nodes assocated wth t. Each map task precedes ts correspondng transfer tasks and each reduce task succeeds ts correspondng transfer task. The map part of the transfer precedes

S j Release Tme Algorthm 3 MASHERS : Solve the and obtan for all tasks. : Execute Algorthm 4 and obtan G, G,...G k 3: A u, A e(u,v) for e = (u, v) SH. 4: for each tme slot t do 5: Fnd the largest such that G has unscheduled edges. 6: ES {e : e E, Ae t, e s free and avalable} 7: whle ES do 8: Schedule e(u, v) ES wth the smallest Ce 9: for all processors p do 0: f p s not busy then : Schedule u J p, A u t wth the lowest predecessors are completed 4 0 3 Map Tasks 5 6 7 8 9 Output Tasks 3 4 5 Input Tasks whose 6 F j Fg. : Task Precedence of a Job wth Data Shuffle 7 Reduce Tasks Completon Tme the reduce part of the transfer. The processng tme of the map sde transfer equals the actual transfer tme between the correspondng map and reduce processors; the processng tme of the reduce sde transfer s set to be zero. The fact that two resources have to be held smultaneously s not consdered n the lnear program and s only ntroduced n the schedulng algorthm. The completon tme of the two tasks that are at the two ends of a transfer are set equal. An example of ths new precedence graph s shown n Fgure. In the example, task, and 3 are map tasks, task 6 and 7 are reduce tasks. Tasks from 4 to 9 are outgong tasks and task from 0 to 5 are the ncomng tasks. In Fgure, the completon tme of task 9 and task 5 are set equal snce they represent a transfer. The same statement can be made for all the transfer task pars. Wth ths new precedence graph, the formulaton stays the same as n Secton II. The s solved usng the column generaton procedure and all the completon tmes are obtaned. A. An outlne of MASHERS We frst gve a hgh level vew of MASHERS Algorthm 3: Once the lower bound s solved, the map and reduce tasks are scheduled as Algorthm. A reduce task has to wat untl all ts predecessors are completed. The Shuffle tasks are vewed as edge schedulng problem whch schedules the shuffle edges n the precedence graph. Ths s done by parttonng the edges nto groups usng Algorthm 4 and schedulng the groups n order. Let G 0 denote the orgnal precedence constraned graph; let G and G denote the edge groups parttoned n the the teraton of Algorthm 4. Wthn a group, edges are scheduled n the order of ther completon tme. Let SH denote the set of shuffle edges n the precedence graph. In Fgure, edge (9, 5) s a shuffle edge that transfers the output of map task 3 to the reduce task 7. We extend the defnton of the completon tme as follows: We defne the completon tme of a shuffle edge e = (u, v) n the lnear programmng soluton as the completon tme of ts nodes, that s, C = C = C e u v. We assume tme s slotted and the schedulng algorthm s executed n each tme slot. In the descrpton of the algorthm MASHERS we assume that algorthm PARTITION has already parttoned the shuffle edges. We now descrbe the parttonng process. B. Parttonng the Shuffle Edges The partton algorthm groups the shuffle edges. The key dea s to construct the groups such that edges wth smlar processng tmes and completon tmes belong to the same group. In the descrpton of Algorthm 4 we use the followng addtonal notaton. Gven a graph H, m(h) = max e H Ce denote the maxmum completon tme of an edge n H. Gven some set of edges S, let p(s) denote sum of the processng tmes of the edges n S. Let φ(g ) = 3m(G ) + p(g ). Algorthm 4 PARTITION : Sort edges E n G 0 n non-decreasng order of C e. : 3: repeat 4: E 5: for all e E n the sorted order do 6: f φ(g ) φ(g ) holds by addng e to E then 7: add e to E 8: else 9: add e to E 0: + : untl E s empty The parttonng and schedulng algorthm works as follows: The algorthm frst parttons G 0 nto G and G wth respect to φ(g ) φ(g 0). Then t teratvely parttons G nto G and G untl G s empty. All edges n G are scheduled before consderng edges n G. C. MASHERS: Performance Guarantee In ths secton, we show that MASHERS gves a constant factor compettve rato. Ths s done by frst analyzng the performance of the shuffle operaton and then ncorporatng the map and reduce nto the analyss. The partton process s smlar to the partton operaton n [8]. The potental functon φ(g ) used for parttonng n [8] just nvolves the processng tme (the completon tmes are only used for orderng). For our problem, the potental functon used for parttonng s a

weghted sum of the completon tme of the edges and the total processng tme at the node. Ths s done n order to nclude the performance of the map operaton n the overall analyss. We now gve prelmnary result that s used to bound the performance guarantee of MASHERS. Lemma : For e E, C H e φ(g ) (6) Proof: We prove ths by nducton. Let N G (u) denote the set of edges ncdent to u G. p(n G (u)) s the sum of transfer tmes n N G (u). For an edge e = (u, v) n graph G k, we have C H e C H u + p(n G k (u)) + p(n G k (v)) C H u + p(g k) = C H u + p(g k ) 3 + p(g k ) φ(g k ). Now we wll prove that f we can schedule all edges n E k, E k,..., E + wthn φ(g ), then all edges n E k, E k,..., E can be scheduled wthn φ(g ). For e = (u, v) E, t needs to wat for at most φ(g ) tme before the set E s consdered. In addton, t can take H + p(n G (u)) + p(n G (v)) to complete. That s, Ce H φ(g ) + H + p(n G (u)) + p(n G (v)) (7) Snce we have φ(g ) φ(g )/ and H + p(n G (u)) + p(n G (v)) 3Ce + p(g ) φ(g ), Ce H φ(g )/ + φ(g ) = φ(g ) (8) Ths proves the lemma. Theorem 3: Let Z and Z H denote the objectve functon value of and the sum of completon tme by Algorthm 4 respectvely. Then Z H 58 Z. Proof: As n the case of MASHERS, for u M j, H 3 (9) An edge e = (u, v) E s added to G, because t could not be added to G. Let p(u) denote the sum of transfer tmes of edges ncdent to u n G whose completon tmes are less than Ce. Snce edges are added n the order of completon tmes and e could not be added to G mples that + p(u) > φ(g ). 3C e (In the above expresson we assume, wthout loss of generalty that node u was the blockng node.) Let D(u) denote the set of edges ncdent to u n G 0 and wth smaller completon tme than e, then p(d(u)) p(u). Together wth Ce p(d(u))/ from the we have, φ(g ) 3Ce Combne t wth Equaton (6), we have C H e + p(d(u)) 7C e 8 C e (0) For reduce tasks, smlar to the proof n Secton III, for any v R j t u + t w () C H v C H v + u D(v) 8 Cv + Cv + 8 Cv () 58 Cv (3) Ths mples Z H 58 Z. There are two comments n order here: It s possble to tghten the analyss to get a better approxmaton rato but we do not do that here n order keep the analyss smple. We use MASHERS to develop two heurstc algorthms D-MASHERS and H-MASHERS that we evaluate n the next secton. D. Heurstcs D-MASHERS and H-MASHERS The man reason for parttonng the shuffle edges s to guard aganst the worst case where a edge wth a long processng tme s scheduled early and t delays a large number of tasks. Though MASHERS provdes a worst case compettve guarantee, ts performance suffers n practce snce t guards aganst corner cases. If the transfer tmes are not too dfferent then we can skp the parttonng process and just use the soluton to gude whch task to schedule. We can use two dfferent varants: D-MASHERS: A task s delayed by ts completon tme and s scheduled n the order of completon tmes as long as all ts predecessors are done. H-MASHERS: Tasks are not delayed by ther completon tmes and are scheduled n the order of completon tmes as long as ts predecessors are done. In the expermental secton, we evaluate both D-MASHERS and H-MASHERS. V. PERFORMANCE EVALUATION In ths secton, we present a smulaton based evaluaton of the algorthms developed n ths paper. A. Smulaton Settng The workloads of real MapReduce system are not publcly avalable. Therefore, we use a synthetc workload generaton model smlar to []. Gven the number of jobs and processors, we generate a set of map and reduce tasks for each job. The processng tme of a task s unformly dstrbuted n [, 0] unt. Between a map task of sze m and a reduce task of sze r, the data shuffle delay s set to be (m + r)/3. For n jobs, ther weghts are unformly dstrbuted n [, n] and ther release tmes are unformly dstrbuted n [, 0]. Once we generate a problem nstance, we pass ts workload nformaton to our scheduler. We assume that the processng tme and shuffle delays are known to the scheduler. In the onlne case where jobs arrve over tme and ther nformaton s known only when they arrve, we update job nformaton by addng new jobs and remove fnshed tasks at the moment a job arrves. Then we run our scheduler agan. Furthermore, we may also adjust the exstng sets of extreme ponts for the new set of jobs, so that our scheduler does not have to run from the begnnng. However, ths s not the man focus of ths paper and s left for future work.

B. Stoppng Threshold Test As mentoned n Secton II, for large problem nstances, we may stop the Column Generaton once the rato of the master problem to the best lower bound becomes less than a threshold + ɛ. For the purpose of smulatng some large problem nstances, we need to test the effect of dfferent ɛ s on the lower bound and our heurstcs and pck a sutable ɛ value. In each test, frst we solve the optmal as reference pont. Then for the same problem nstance, we vary ɛ and solve the problem wth our heurstcs usng the degraded soluton. For the two phase problem, we use 50 processors and 30 jobs. Each job has 0 map tasks and 0 reduce tasks on average. For the three phase problem, we use 30 processors and 0 jobs. Each job has 8 map tasks and 3 reduce tasks on average. Then for each ɛ value, we run 0 trals. The results n Fgure 3 are compettve ratos (results over the optmal) averaged n these 0 trals. The threshold does not affect the performance of ether the or the MARES as shown n Fgure 3a. We only see a slght decrease/ncrease n the lower bound/mares compettve rato, even when ɛ = 0.5. Ths s not true for the three phase map-shuffle-reduce problem. In Fgure 3b, although the lower bound does not decrease much wth a large threshold, the compettve rato of D-MASHER ncreases dramatcally. Therefore, n the followng tests, we always set ɛ = 0.5 for the two phase but ɛ = 0 for the three phase. Compettve Ratos.5 0.5 lower bound 0 MARES 0 0. 0. 0.3 0.4 0.5 Threshold (a) MARES Compettve Ratos 8 6 4 0 8 6 4 0 Fg. 3: Threshold ɛ Test lower bound D-MASHERS 0 0. 0. 0.3 0.4 0.5 Threshold (b) MASHERS In the rest of the tests, for each test, we vary the number of jobs to generate dfferent szes of the problem. For a partcular number of jobs, we run 0 trals and plot them aganst ether the optmal or the lower bound. C. Evaluaton of Two Phase Map-Reduce Heurstcs For the two phase problem, we mplement an Integer Program (IP). Wth small problem nstances, we may obtan the optmal soluton n reasonable tme. Fgure 4a shows the compettve ratos for only 30 processors and from 5 to 6 jobs; each job has 0 map tasks and 3 reduce tasks on average. As we can see, H-MARES s the closest to the optmal and the lower bound s closer than MARES, whose compettve rato s about.5 to. Wth large problem nstances wth 00 processors and from 5 to 50 jobs (30 map tasks and 0 reduce tasks per job), we do not calculate the optmal soluton. However, we can stll see n Fgure 4b the compettve ratos of MARES and H-MARES are far away from MARES approxmaton guarantee of 8. D. Evaluaton of Map-Shuffle-Reduce Heurstcs We dd smlar tests for the three phase problem except that we do not have an IP to obtan the optmal soluton but we compre the heurstc results to the lower bound. Wth 30 processors and from 5 to 40 jobs (each has 8 map tasks and 3 reduce tasks on average), however, we can stll see n Fgure 5a, the two heurstcs generally acheve compettve rato less than 3. In Fgure 5b we plot the average compettve rato as we ncrease the number of jobs. It s nterestng to see that when there are fewer jobs, H-MASHER beats D-MASHER, for t does not hold a task. When there are more and more job the performance of D-MASHERS mproves when compared to H- MASHERS. Ths seems to ndcate that hold a task untl ts completon tme actually mproves the performance of the heurstc. One possble explanaton for ths phenomenon s that n larger problems there s a wder range of processng tme values and not holdng long tasks back hurts the completon tme of all tasks that come after t. VI. RELATED WORK Job schedulng on parallel machnes s a well-studed problem, refer to [9] for more detals. In partcular, our work s related to the problem of machne schedulng wth precedence constrants. The schedulng problem wth a general precedence graph s among the most dffcult problems n the area of machne schedulng [0], [], [], [3]. For the objectve of mnmzng the total weghted completon tme, the best approxmaton algorthm s due to Queyranne and Schulz [3]. In the problem that we consder, the precedence between map and reduce nduces a depth two precedence graph. However unlke the standard machne schedulng problems, schedulng the shuffle phase resembles the data mgraton problem. In data mgraton problems, the mgraton process s modeled by a transfer graph, n whch nodes represent the data transferrng enttes and edges represent the transfer lnks. Two edges ncdent on the same node cannot transfer smultaneously. A node completes when all edges ncdent on t complete. Km [8] gave an -based 0-approxmaton for general processng tmes, whch was then mproved by Gandh and Mestre [4] wth a combnatoral algorthm. Hajaghay et al. [5] generalze the technque n [4] and provde a local transfer protocol where multple transfers can take place concurrently. Other related work s on sum mult-colorng of graphs ncludng [6], [7]. Our work dffers from all these papers snce the map phase that precedes the shuffle phase makes the analyss dfferent from a tradtonal data transfer problems. The closest work to our problem s by Chang et al. [], but that paper does not consder the explct precedence constrants between map and reduce as well a the shuffle phase of the MapReduce problem. VII. CONCLUSION We modeled to end-to-end performance of the MapReduce process and developed constant factor approxmaton algorthms for mnmzng the weghted response tme. Based

Compettve Rato 3.5.5 0.5 Lower Bound H-MARES MARES 0 0 0 40 60 80 00 0 Problem nstance (a) Results agant the Optmal Compettve Rato 3.5.5 Fg. 4: Performance test for Two phases Map-Reduce 0.5 H-MARES 0 MARES 0 0 40 60 80 00 0 40 60 80 00 Problem nstance (b) Large problem nstances Compettve Rato 3.8.6.4..8.6 H-MASHERS.4 D-MASHERS 0 50 00 50 00 50 Problem nstance (a) Compettve Ratos Compettve Rato.6.5.4.3...9.8 Fg. 5: Performance test for Map-Shuffle-Reduce.7 0 5 0 5 0 5 30 35 40 Number of Jobs (b) Average H-MASHERS D-MASHERS on these guaranteed performance algorthms we developed heurstcs that were tested expermentally and performed sgnfcantly better that the worst case guaranteed performance. We are currently workng on mprovng the worst case performance guarantees as well as testng the algorthm on larger data sets. REFERENCES [] J. Dean and S. Ghemawat, MapReduce: Smplfed Data Processng on Large Clusters, n USENIX OSDI, 004. [] H. Chang, M. S. Kodalam, R. R. Kompella, T. V. Lakshman, M. Lee, and S. Mukherjee, Schedulng n mapreduce-lke systems for fast completon tme, n IEEE INFOCOM, 0. [3] M. Chowdhury, M. Zahara, J. Ma, M. I. Jordan, and I. Stoca, Managng data transfers n computer clusters wth orchestra, n ACM SIGCOMM, 0. [4] M. Isard, V. Prabhakaran, J. rrey, U. Weder, K. Talwar, and A. Goldberg, Quncy: Far Schedulng for Dstrbuted Computng Clusters, n SOSP, 009. [5] M. Zahara, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, and I. Stoca, Delay schedulng: a smple technque for achevng localty and farness n cluster schedulng, n EuroSys 0: Proceedngs of the 5th European conference on Computer systems, 00, pp. 65 78. [6] G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoca, Y. Lu, B. Saha, and E. Harrs, Renng n the outlers n map-reduce clusters usng mantr, n USENIX OSDI, 00. [7] J. Lenstra, A. Rnnooy Kan, and P. Brucker, Complexty of machne schedulng problems, Annals of Dscrete Mathematcs, vol., pp. 343 36, 977. [8] Y. Km, Data mgraton to mnmze the total completon tme, Journal of Algorthms, vol. 55, no., pp. 4 57, 005. [9] P. Brucker, Schedulng algorthms. Sprnger, 004. [0] R. Graham, Bounds on multprocessng tmng anomales, SIAM Journal on Appled Mathematcs, vol. 7, no., pp. 46 49, 969. [] J. Lenstra and A. Kan, Complexty of schedulng under precedence constrants, Operatons Research, pp. 35, 978. [] S. Chakrabart, C. Phllps, A. Schulz, D. Shmoys, C. Sten, and J. Wen, Improved schedulng algorthms for mnsum crtera, Automata, Languages and Programmng, pp. 646 657, 996. [3] M. Queyranne and A. Schulz, Approxmaton bounds for a general class of precedence constraned parallel machne schedulng problems, SIAM Journal on Computng, vol. 35, no. 5, pp. 4 53, 006. [4] R. Gandh and J. Mestre, Combnatoral algorthms for data mgraton to mnmze average completon tme, Algorthmca, vol. 54, no., pp. 54 7, 009. [5] M. Hajaghay, R. Khandekar, G. Kortsarz, and V. Laghat, On a local protocol for concurrent fle transfers, n Proceedngs of the 3rd ACM symposum on Parallelsm n algorthms and archtectures. ACM, 0, pp. 69 78. [6] A. Bar-Noy, M. Halldórsson, G. Kortsarz, R. Salman, and H. Shachna, Sum multcolorng of graphs, Journal of Algorthms, vol. 37, no., pp. 4 450, 000. [7] R. Gandh, M. Halldórsson, G. Kortsarz, and H. Shachna, Improved bounds for schedulng conflctng jobs wth mnsum crtera, ACM Transactons on Algorthms (TALG), vol. 4, no., p., 008.