Optimal Map Reduce Job Capacity Allocation in Cloud Systems

Transcription

1 Optmal Map Reduce Job Capacty Allocaton n Cloud Systems Marzeh Malemajd Sharf Unversty of Technology, Iran malemajd@ce.sharf.edu Danlo Ardagna Poltecnco d Mlano, Italy danlo.ardagna@polm.t Mchele Cavotta Poltecnco d Mlano, Italy mchele.cavotta@polm.t Alessandro Mara Rzz Mauro Passacantando Poltecnco d Mlano, Italy Unverstà d Psa, Italy alessandromara.rzz@polm.t mauro.passacantando@unp.t ABSTRACT We are enterng a Bg Data world. Many sectors of our economy are now guded by data-drven decson processes. Bg Data and busness ntellgence applcatons are facltated by the MapReduce programmng model whle, at nfrastructural layer, cloud computng provdes flexble and cost effectve solutons for allocatng on demand large clusters. Capacty allocaton n such systems s a ey challenge to provde performance for MapReduce jobs and mnmze cloud resource costs. The contrbuton of ths paper s twofold: ( we provde new upper and lower bounds for MapReduce job executon tme n shared Hadoop clusters, ( we formulate a lnear programmng model able to mnmze cloud resources costs and job rejecton penaltes for the executon of jobs of multple classes wth (soft deadlne guarantees. Smulaton results show how the executon tme of MapReduce jobs falls wthn 14% of our upper bound on average. Moreover, numercal analyses demonstrate that our method s able to determne the global optmal soluton of the lnear problem for systems ncludng up to 1,000 user classes n less than 0.5 seconds. 1. INTRODUCTION Nowadays, many sectors of our economy are guded by datadrven decson processes [14]. In complex systems that do not lend themselves to ntutve models (e.g., natural scences, socal and engneered systems [11], data-drven modelng and hypothess generaton have a ey role to understandng system behavor and nteractons. The adopton of data ntensve applcatons s well recognzed as able to enhance effcency of enterprses and the qualty of our lves. A recent McKnsey analyss [19] has shown, for nstance, that Bg Data could produce $300 bllon potental annual value to US health care. The analyss has also shown how Europe publc sector could potentally reduce expendture of admnstratve actvtes by 15 20%, wth an ncrease of value rangng between $223 and $446 bllon [11, 19]. From the technologcal perspectve, the MapReduce programmng model s recognzed to be the most promnent soluton for Bg Data applcatons [16]. Its open source mplementaton, Hadoop, s able to manage large datasets over Copyrght s held by author/owner(s. ether commodty clusters and hgh performance dstrbuted topologes [29]. MapReduce has attracted nterest of both ndustry and academa, snce analyzng large amounts of unstructured data s a hgh prorty tas for many companes and overtaes the scalablty level that can be acheved by tradtonal data warehouse and busness ntellgence technologes [16]. Lewse, cloud computng s becomng a manstream soluton to provde very large clusters on a pay-per-use bass. Cloud storage provdes an effectve and cheap soluton for storng Bg Data as modern NoSQL databases demonstrated good extensblty and scalablty n storng and accessng data [15]. Moreover, the pay-per-use approach and the almost nfnte capacty of cloud nfrastructures can be used effcently n supportng data ntensve computaton. Many cloud provders already nclude n ther offerng Map Reduce based platforms such as Google MapReduce framewor, Mcrosoft HDnsght, and Amazon Elastc MapReduce [2, 4, 5]. IDC estmates that by 2020, nearly 40% of Bg Data analyses wll be supported by publc cloud [6], whle Hadoop s expected to touch half of the world data by 2015 [15]. A MapReduce job conssts of two man phases, Map and Reduce; each phase performs a user-defned functon on nput data. MapReduce jobs were meant to run on dedcated clusters to support batch analyses. Nevertheless, MapReduce applcatons have evolved and t s not uncommon that large queres, submtted by dfferent user classes, need to be performed on shared clusters, possbly wth some guarantees on ther executon tme. In ths context the man drawbac [17, 26] s that the executon tme of a MapReduce job s generally unnown n advance. In such systems, capacty allocaton becomes one of the most mportant aspects. Determnng the optmal number of nodes n a cluster, shared among multple users performng heterogeneous tass, s an mportant and challengng problem [26]. Moreover, capacty allocaton polces need to decde jobs executon and rejecton rates n a way that users worloads meet ther deadlnes and the overall cost s mnmzed. Capacty and Far schedulers have been ntroduced n the new versons of Hadoop to address capacty allocaton challenges and effectve resource management [1, 3]. The man goal of Hadoop 2.x [25] s maxmzng cluster utlzaton, whle avodng short (.e., nteractve job starvaton. Our focus n ths paper s on dynamc capacty allocaton. Frst, we determne new upper and lower bounds for MapReduce job executon tmes n shared Hadoop clusters adoptng

2 capacty and far schedulers. Next, we formulate the capacty allocaton problem as an optmzaton problem, wth the am of mnmzng the cost of cloud resources and penaltes for jobs rejectons. We then reduce our mnmzaton problem to a Lnear Programmng (LP problem, whch can be solved very effcently by state of the art solvers. We valdate the accuracy of our bounds through the YARN Scheduler Load Smulator (SLS [7]. The scalablty of our optmzaton approach s demonstrated by consderng a very large set of experments. The largest nstance we consder, ncludng 1,000 user classes, can be solved to optmalty n less than 0.5 seconds. Moreover, smulaton results show that average job executon tme s around 14% lower than our upper bound. To the best of our nowledge, the only wor provdng upper and lower bounds for MapReduce jobs executon tme s [26], where only dedcated clusters and FIFO schedulng are consdered (that are not able to fulfll job concurrency and resource sharng requrements for current MapReduce applcatons. Ths paper s organzed as follows. MapReduce job executon tme lower and upper bounds are presented n Secton 2. In Secton 3 the Capacty Allocaton (CA problem s ntroduced and ts lnear formulaton s presented n Secton 4. The accuracy of the bounds and the scalablty of the soluton are evaluated n Secton 5. Secton 6 descrbes the related wor. Conclusons are fnally drawn n Secton ESTIMATING JOB EXECUTION TIMES IN SHARED CLUSTERS In large clusters, multple classes of MapReduce jobs can be executed concurrently 1. In such systems we need to estmate job executon tmes for determnng the confguraton of mnmum cost, whle provdng servce level agreement (SLA guarantees. Prevous wors, e.g., [26], provded theoretcal bounds to desgn performance models for Hadoop 1.0, consderng n partcular the FIFO scheduler. Those bounds can be used to predct job completon tmes only for dedcated clusters. Nowadays, large shared clusters are ruled by newer schedulers,.e., Capacty and Far [1, 3]. In the followng, we derve new bounds for such systems. In partcular, Secton 2.1 ntroduces prelmnares and provdes a tghter bound wth respect to [26] for a sngle-phase (ether Map or Reduce job. Secton 2.2 extends the analyss to the case of two sngle-phase jobs, whle Secton 2.3 provdes bounds for the case of multple (sngle-phase jobs nvolved. Ultmately, we complete our analyss usng the bounds n Secton 2.4 to derve executon tme bounds for multple classes of complete MapReduce jobs. Such results are used n the remanng sectons to defne the constrants of the CA problem that guarantee job deadlnes are met. For space lmtaton, some proofs are omtted and reported n [18]. 2.1 Sngle job bounds Let us consder the executon of a sngle-phase MapReduce job J and let us denote wth, n, µ, and λ the number of avalable slots, the number of tass n a Map or Reduce phase of J, the mean and maxmum tas duraton, respectvely. In the followng, we suppose that the assgnment of tass to 1 A job class s a set of jobs characterzed by the same profle n terms of map, reduce and shuffle duraton. slots { nµ (nµ / +(nµ / tme Fgure 1: Worst case of one job executon tme slots s done usng an on-lne greedy algorthm that assgns each tas to the slot wth the earlest fnshng tme. Proposton 2.1. The executon tme of a Map or Reduce phase of J under a greedy tas assgnment s at most U = n µ λ + λ. Proof. By contradcton, we assume the executon tme s U + ɛ wth ɛ > 0. Note that n µ s the phase total worload, that s the duraton of consdered phase n the case of only one slot avalable. Let the last processed tas has duraton t. All slots are busy before the startng of the last tas (otherwse t would have started earler. The tme that has elapsed before startng the last tas s (U + ɛ t. Snce all slots are busy for (U + ɛ t tme, the total worload untl that pont s (U + ɛ t. At the end of the executon, the whole phase worload must be unchanged, hence (U + ɛ t + t = n µ ( nµ λ + λ + ɛ t + t = nµ ( 1 λ + ɛ + t(1 = 0 ɛ = (t λ( 1. Snce t λ, we get ɛ 0, that s a contradcton because we assumed ɛ > 0 and 1. The worst case scenaro s llustrated n Fgure 1, where job J starts wth slots such that for nµ λ tme unts all slots are busy. After that tme only one tas wth duraton λ s left to be executed. One slot performs the last tas whle all other slots are free. Fnally, after nµ λ + λ tme unts, all tass are executed and the phase s completed. Note that a smlar upper bound has been proposed n [26]. Our contrbuton mproves the prevous result by λ µ. 2.2 Two job bounds In order to provde fast response tmes to small jobs and maxmze the throughput and utlzaton of Hadoop clusters, Far and Capacty schedulers have been devsed. Far scheduler organzes jobs n pools such that every job gets, on average, an equal amount of resources over tme. A sngle runnng job uses the entre cluster however, f other jobs are submtted, the slots that are progressvely released are assgned to the new jobs. In addton, the Far scheduler can guarantee mnmum shares, enables preempton and lmts the number of concurrent runnng jobs/tass. Capacty schedulers have smlar functonaltes. The feature set of the Capacty scheduler ncludes mnmum shares guarantee, securty, elastcty, mult-tenancy, preempton and job prortes.

3 {slots { j n j µ j n µ {slots { j n µ <n j µ j... nj µj/ j (n1 µ1 + n2 µ2/ tme Fgure 2: Lower bound of two jobs n worconservng mode. A scheduler s defned to be wor-conservng f t never lets a processor dle whle there are runnable tass n the system. Both Far and Capacty schedulers can be confgured n wor-conservng or non-wor-conservng (whch vce versa, let avalable resources dle mode. Let us consder the executon of two jobs J and J j. If the system s confgured n non-wor-conservng mode, avalable slots are dvded statstcally and J dle slots are not allocated to J j. Note that the upper bound defned n Proposton 2.1 and the lower bound provded n [26] are stll vald, snce resources are parttoned. Vce versa, f the system s confgured accordng to wor-conservng mode, when J fnshes, ts slots are allocated to J j f t stll has tass watng to start. In ths stuaton, the bounds proposed n Propostons 2.2 and 2.3 hold. We assume both jobs start at the same tme and J has α percent of all the avalable slots whereas α j percent of slots are reserved to J j,.e., α, α j (0, 1 and α + α j = 1. Proposton 2.2. The executon tmes of a greedy tas assgnment of { two jobs (J,J } j n wor-conservng mode are n µ nj µj n µ + nj µj at least mn, and, respectvely. α α j Proof. The analyss of the executon of the frst fnshed job s equvalent to the case wth a sngle job n the system (the best lower and upper bound nown n the lterature are gven by [26] and Proposton 2.1. As regards the second job, the number of slots changes at some pont of ts executon, n other words when the frst job fnshes, the second job gets all the slots of the system. Let us suppose that J termnates frst, hence J j receves all slots after at least n µ α tme unts (.e., after J lower bound [26]. Let us denote wth t f the lower bound for J j executon tme. Frst J j has α j slots untl tme nstant n µ α (see the dotted area n Fgure 2, ( then J j receves all slots for a perod of tme equal to t f n µ α. The maxmum worload that can be executed accordng to the number of slots s greater than or equal to the worload of job J j: ( n µ α αj + t f nµ n j µ j. α Thus, by replacng α j wth 1 α we get n µ α (1 α + ( t f n µ α that s equvalent to t f (n µ + n j µ j/. n j µ j, Proposton 2.3. In a system wth two jobs J and J j n wor-conservng mode, the upper bound of the executon tme of job J s (n µ / + tme Fgure 3: Upper bound of two jobs n worconservng mode for the job that ends the earlest {slots { j n µ n j µ j (nj µj + n µ / + tme Fgure 4: Upper bound of two jobs n worconservng mode for the job that end the latest n µ λ + λ, f n j µ j α α j n µ λ α, T = n j µ j + n µ λ + λ, otherwse. Proof. Here we want to now the upper bound for a job when conservng-mode polcy allows usng dle slots. Hence, the upper bound s acheved when the mnmum dle slots become avalable and t happens when the other job maes ts slots busy. If n j µ j α j n µ λ α holds (see Fgure 3, then the slots of other job can be busy such that upper bound of ths job does not change. If the nequalty does not hold, then slots of the other job become avalable before ths job fnshes (see Fgure 4. Lewse the prevous proof, n the worst case the last tas (wth maxmum duraton can only start after a perod of tme n whch all slots have been busy that s: (n jµ j + n µ λ /. 2.3 Multple class bounds In a shared system, let be the number of slots and U be the set of job classes. In each class U, h concurrent jobs are executed by usng α percent of system slots. Each job J n class has n tass wth mean tas duraton µ and maxmum tas duraton λ. Proposton 2.4. The lower bound for the executon tme of job J n presence of multple classes of jobs s n µ h α. Proof. Each class has α slots and h concurrent jobs, so each job has overall α /h slots and, usng the bound provded n [26], we get as lower bound n µ α = n µ h α. h Proposton 2.5. The upper bound for the executon tme of job J n presence of multple classes of jobs s (n µ 2λ h α + 2λ.

4 Job Class c. slots h. 2 1 One Class h slots { / h slots 2 1 Tme Whole System (n µ 2 h +2 One Job T1 { { { +(n µ 2 h +2 T1 T2 T2 Fgure 5: Slots sharng n a system wth several classes of jobs Proof. Fgure 5 shows a system where slots are shared among several classes of jobs. The max number of slots dedcated to a sngle job of class s α /h. Let us llustrate the worst case scenaro for job J. We assume that job J s executed before J and that each slot freed up from J s dedcated to J. We also assume α /h 1 slots n the last wave of job J start performng a tas wth maxmum duraton, and the frst slot freed up from job J s dedcated to J. In the worst case, ths slot also performs a tas wth maxmum duraton. After duraton λ, remanng slots n J are freed up and are dedcated to job J. α /h slots perform tass of J for (n µ 2λ h α tme and after that there s just one tas wth max duraton λ. So tme (n µ 2λ h α + 2λ s spent for performng job J To prove there s no larger upper bound we use contradcton. Let us assume that job J n Fgure 6 s executed n tme ɛ + n µ 2λ α + 2λ, ɛ > 0. Let after tme t 1 λ of the h consdered job starts, all possble slots α /h (far share are allocated to J and the duraton of the last tas s t 2 λ. The duraton ɛ + n µ 2λ α + 2λ (t 1 + t 2 s the mnmum h amount of tme that the assumed job has α /h slots. We calculate a bound by computng the mnmum amount of worloads that can be done W 1 and the amount of worload that has to be done W 2 = n µ. The mnmum amount s W 1 = (ɛ + nµ 2λ α h + 2λ t 1 t 2 α h + t 1 + t 2 as shown by the dotted area n Fgure 6. Note that, the frst term s the worload performed when α /h slots are avalable, whle t 1 and t 2 are the worloads performed when there s at least one sngle slot. The followng relaton between W 1 and W 2 holds: ( ɛ + Snce 1 α h ɛ α h n µ 2λ α + 2λ α t 1 t 2 + t 1 + t 2 n µ. h h 0 and t 1 + t 2 2λ 0, we get + n µ 2λ + (2λ t 1 t 2 α h + t 1 + t 2 n µ,.e., ɛ t 1 +t 2 2 λ 0, whch s mpossble snce ɛ > Bounds for MapReduce Jobs Executon In ths secton, we extend the results presented n [26] for a MapReduce system wth S M Map slots and S R Reduce slots usng Far/Capacty scheduler. Smlar jobs are grouped Fgure 6: Executon of a sngle job n consderng the proof by contradcton together n a job class U and αm and αr are the percentage of all Map and Reduce slots dedcated to class, whle there are h jobs runnng concurrently. Let us denote wth Mavg, Mmax, Ravg, Rmax, Sh 1, avg, Sh 1, max, Sh avg and Sh max the average and maxmum duratons of Map, Reduce, frst Shuffle and typcal Shuffle tass, respectvely. These values defne an emprcal performance profle for each job class, whle NM and NR are the number of Map and Reduce tass of job J profle. By usng the bounds defned n the prevous sectons, a lower and an upper bound on the duraton of the entre Map phase can be estmated as follows: T low M = N M Mavg h, S M αm T up = (N M Mavg 2Mmax h + 2 M M max. S M αm Smlar results can be obtaned for the Reduce stage, that conssts of the Reduce and part of the Shuffle phase. In fact, accordng also to the results dscussed n [26], we dstngush the non-overlappng porton of the frst shuffle wave from the duraton of the remanng tass n the typcal shuffle. The tme of the typcal shuffle phase can be estmated as: ( T low N Sh = R h 1 Sh avg, S R αr T up = (N R Sh avg 2 Sh max h + 2 Sh Sh max. S M αr Fnally, by puttng all parts together, we get: T low = A low h S M α M + B low h S Rα R + C low, (1 where A low = NM Mavg, B low = NR(Sh avg + Ravg and C low = Sh 1(J avg Sh avg. In the same way, the executon tme of job J s at most: T up = A up h + B up S M αm where: A up = NM Mavg 2Mmax, B up C up h S Rα R + C up, (2 = N RSh avg 2Sh max + N RR avg 2R max, = 2Sh max + Sh 1( max + 2M max + 2R max. Accordng to the guarantees to be provded to the end users, we can use T up upper bound (beng conservatve or the approxmated formula T avg = (T low + T up /2 (3 to bound the executon tme of class jobs n the Capacty Allocaton problem descrbed n the next secton.

5 3. CAPACITY ALLOCATION PROBLEM In ths secton, we consder the jont Capacty Allocaton and Admsson Control problem for a cloud based shared Hadoop 2.x system. We assume that the system runs the far or capacty scheduler, servng a set of user classes, requestng the concurrent executon of jobs wth smlar executon profle. Each class s executed wth s M = αm S M Map slots and s R = αrs R Reduce slots wth a concurrency degree of h (.e., h jobs wth the same profle are executed concurrently. We also assume that the system mplements an admsson control mechansm boundng the number of concurrent jobs h executed by the system,.e., some jobs can be rejected. denotes a predcton for the number of jobs of class to be executed and we have h H up. Furthermore, n order to avod job starvaton, we also mpose h to be greater than a gven lower bound H low. Fnally, a (soft deadlne D s assocated wth each class. Note that, gven s M, s R and h, the executon tme of a class job can be approxmated by: H up T = A h s M + B h s R + C, (4 where A, B and C are postve constants computed as dscussed n the prevous secton. We can use Equatons (2 (3 to derve (4, consderng conservatve upper bounds. In ths latter case D can be consdered as hard deadlne. In alternatve, as n [26], (4 can be obtaned from (1, (2, and (3. In that case, (4 s not a bound but an approxmated formula and D becomes soft deadlne. In ths wor, we follow ths latter more flexble approach. We assume that our MapReduce mplementaton s hosted n a Cloud envronment that provdes on-demand and reserved (see, e.g., Amazon EC2 prcng model [2] homogeneous vrtual machnes (VMs. Moreover, we denote wth c M and c R the number of Map and Reduce slots hosted n each VM,.e., each nstance supports c M Map and c R Reduce concurrent tass for each job J n class. As a consequence, let x m and x r be the number of Map and Reduce slots requred by a certan job J, the number of VMs to be provsoned has to be equal to x m/c M + x r/c R. Let us denote wth δ and wth ρ < δ the cost of on-demand and reserved VMs, respectvely and wth r the number of reserved VMs avalable (.e., the number of VMs subscrbed wth a long term contract. Let d and r be the number of on-demand and reserved VMs, respectvely, used to serve end users requests. The am of the Capacty Allocaton (CA problem we consder here s to mnmze the overall executon cost meetng, at the same tme, all deadlnes. The executon cost ncludes both the VM allocaton cost and the penalty cost for job rejecton. Gven p, the penalty cost for rejecton of a class job, the overall executon cost can be calculated as follows: δ d + ρ r + p (H up h, (5 U where decson varables are d, r, h, s M and s R, for any U,.e., we have to decde the number of on-demand and reserved VMs, concurrency degree, and the number of Map and Reduce slots for each job class. The notaton adopted n ths paper s summarzed n Table OPTIMIZATION PROBLEM In ths secton, we formulate the CA optmzaton problem and propose a sutable and fast soluton technque for the System Parameters c M Number of Map slots hosted n a VM of class c R Number of Reduce slots hosted n a VM of class U Set of job classes p Penalty for rejectng jobs from class D Maespan deadlne of jobs from class A CPU requrement for the Map phase whch can be derved by nput data and job class B CPU requrement for the Reduce phase whch can be derved by nput data and job class C Tme constant factor depends on Map, Copy, Shuffle and Reduce phases that derved by nput data and job class r Number of avalable reserved VMs δ Cost of on-demand VMs ρ Cost of reserved VMs H up Upper bound on the number of class jobs to be executed concurrently H low Lower bound on the number of class jobs to be executed concurrently Decson Varables s M Number of slots to be allocated to class for executng Map tas s R Number of slots to be allocated to class for executng Reduce tas h Number of jobs of class to be executed concurrently r Number of reserved VMs to be allocated for job executon d Number of on-demand VMs to be allocated for for job executon Table 1: Optmzaton model: parameters and decson varable. executon of MapReduce jobs n Cloud envronments. The objectve s to mnmze the executon cost, whle meetng job (soft deadlnes. The total cost ncludes VM provsonng costs and a penalty due to job rejecton. In equaton (5 the term c =1 p Hup s a constant ndependent from decson varables and can be dropped. The optmzaton problem can then be defned as follows: (P0 subject to: A h s M + mn δ d + ρ r U B h s R p h + E 0, U, (6 r r, (7 ( s M + s R r + d, (8 c M c R U H low h H up, U, (9 r 0, (10 d 0, (11 s M 0, U, (12 s R 0, U, (13 where constrants (6 are derved from equaton (4 by mposng the executon of each job to end before ts deadlne (.e., E = C D < 0. Constrant (7 ensures that no more than the avalable reserved VMs can be allocated. Constrant (8 guarantees that enough VMs are allocated to execute submtted jobs wthn ther deadlnes. Constrants (9 bound the job concurrency level for each user. We remar that, n the above problem formulaton, varables r, d, s M, s R, h are not nteger as n realty they should be. In fact, requrng varables to be nteger maes

6 the problem much more dffcult to solve. However, ths approxmaton s wdely used n the lterature (see, e.g., [9, 30] snce relaxed varables can be rounded to the closest nteger at the expense of a generally very small ncrement of the overall cost (ths s ntutve for large-scale MapReduce systems that requre tens or hundreds of relatvely cheap VMs, justfyng the use of a relaxed model. Therefore, we decded to deal wth contnuous varables, consderng a relaxaton of the real problem. However, ths restrcton wll be removed n the numercal analyses reported n Secton 5. Problem (P0 has a lnear objectve functon but constrants (6 are non-lnear and non-convex (the proof s reported n [18]. To overcome the non-convexty of the constrants, we ntroduce new decson varables Ψ = 1/h, for any U, to replace h. Then, problem (P0 s equvalent to problem (P1 defned as follows: (P1 subject to: A s M Ψ + mn δ d + ρ r U p Ψ B s R Ψ + E 0, U, (14 r r, (15 ( s M + s R r + d, (16 c M c R U where Ψ low Ψ low = 1/H up Ψ Ψ up, U, (17 r 0, (18 d 0, (19 and Ψ up s M 0, U, (20 s R 0, U, (21 = 1/H low. We remar that now constrants (14 are convex (the proof s reported n [18]. The convexty of all the constrants of problem (P1 allows to prove the followng result. Theorem 4.1. In any optmal soluton of problem (P1, constrants (14 hold as equaltes and the number of slots to be allocated to job class, s M and s R, can be evaluated as follows: ( s M = 1 A B c M + A, (22 E Ψ s R = 1 E Ψ c R ( A B c R + B c. (23 M The proof of Theorem 4.1 s reported n [18]. The results of Theorem 4.1 allow to transform (P1 nto an equvalent lnear programmng problem, whch can be solved very qucly by state of the art solvers. Theorem 4.2. (P1 s equvalent to the followng problem: (P2 mn δ d + ρ r p h U subject to: r r, (24 γ h r + d, (25 U H low h H up, U, (26 r 0, (27 d 0, (28 where γ = γ 1 + γ 2 wth: γ 1 = 1 E c R γ 2 = 1 E c M ( A B c R c M + B, (29 ( A B c M + A c, (30 R and the decson varables are r, d and h = 1/Ψ, for any U. The proof of Theorem 4.2 s reported n [18]. Snce (P2 s a lnear problem, commercal and open source solvers currently avalable are able to solve effcently very large nstances. A scalablty analyss s reported n the followng secton. The Karush-Kuhn-Tucer (KKT condtons correspondng to problem (P2 guarantee that any optmal soluton of (P2 has the followng mportant propertes. Theorem 4.3. If (r, d, h s an optmal soluton of problem (P2, then the followng statements hold: a r > 0,.e., reserved nstances are always used. b U γ h = r + d,.e., γ can be consdered a computng capacty converson rato that allows to translate class concurrency level nto VM capacty resource requrements. c If p /γ > δ, then h = H up,.e., class job are never rejected. d If p /γ < ρ, then h = H low,.e., class concurrency level s set to the lower bound. e If r > U γ Hup, then d = 0,.e., for property b, f the total capacty requrement can be satsfed through reserved nstances, on demand VMs are never used. f If r < U γ Hlow, then r = r and d > 0,.e., for property b, f the mnmum job requrements exceed reserved nstance capacty, then on demand VMs are needed. Proof. The KKT condtons assocated to (P2 are: ρ ν + µ r λ r = 0, (31 δ ν λ d = 0, (32 p + γ ν + µ λ = 0, U, (33 ( ν γ h r d = 0, (34 U λ r r = 0, (35 µ r (r r = 0, (36 λ d d = 0, (37 λ (h H low = 0, U, (38 µ (h H up = 0, U, (39 ν, λ r, µ r, λ d 0, (40 λ, µ 0, U. (41 a Assume, by contradcton, that r = 0. Then d U γ h U γ H low > 0, thus λ d = 0 and ν = δ. On the other hand, (36 mples that µ r = 0 and λ r = ρ ν = ρ δ < 0 whch s mpossble.

7 b Snce r > 0, we have λ r = 0, hence (31 mples ν = ρ + µ r ρ > 0, thus constrant (25 s actve at (r, d, h. c It follows from (32 that ν = δ λ d δ, hence we have µ = λ + p γ ν p γ ν p γ δ > 0. Therefore h = H up. d Snce ν ρ, we get λ = µ + γ ν p γ ν p γ ρ p > 0, hence h = H low. e We have r = U γ h d U γ H up < r, thus µ r = 0 and ν = ρ. Therefore, λ d = δ ρ > 0 mples d = 0. f We have d = U γ h r U γ H low r > 0, hence λ d = 0 and ν = δ. Therefore, µ r = δ ρ > 0 mples r = r. Property a s obvous, snce reserved nstances are the cheapest ones. Property b and Theorem 4.2 lead to an mportant theoretcal result. Indeed, γ parameters can be nterpreted as a computng capacty converson rato that allows to estmate VM capacty requrements n terms of class concurrency level. Accordngly, also propertes c and d become ntutve. The product γ δ s the unt cost for class job executon wth on-demand nstances. If γ δ s lower than the penalty cost, then class jobs wll always be executed. Vce versa, f γ ρ,.e., the class per unt reserved cost, s larger than the penalty, class jobs wll always be rejected. Fnally, propertes e and f relate the overall mnmum U γ Hlow and maxmum U γ Hlow capacty requrements to reserved nstance capacty and allow to establsh a prory f on demand VMs wll or wll not be used. 5. EXPERIMENTAL RESULTS In ths secton we: ( valdate job executon tme bounds, ( evaluate the scalablty of the CA problem soluton, and ( nvestgate how dfferent (P2 problem settngs mpact on the cloud cluster cost. Our analyses are based on a very large set of randomly generated nstances. Bound accuracy s evaluated through the YARN Scheduler Load Smulator (SLS [7]. In the followng secton, the desgn of experments s presented. Bound accuracy and scalablty analyses are reported n Sectons 5.2 and 5.3. Fnally, the analyss of how (P2 problem parameters mpact on cost s reported n Secton Desgn of experments Analyses n ths secton ntend to be representatve of real Hadoop systems. Instances have been randomly generated by pcng parameters accordng to values observed n real systems and logs of MapReduce applcatons. Afterwards, we use unform dstrbutons wthn the ranges reported n Table 2. In our model, the cloud cluster conssts of on-demand and reserved VMs. We consdered Amazon EC2 prces for VM hourly costs [2]. On demand and reserved nstance prces vared n the range ($0.05,$0.40, to consder the adopton of dfferent VM confguratons. Regardng MapReduce applcatons parameters, we used the values reported n [27], whch consder real log traces obtaned from four MapReduce applcatons: Twtter, Sort, WTrends, and WordCount. Moreover, as n [27] we assume that deadlnes are unformly dstrbuted n the range (10, 20 mnutes. We use the job profle from [27] to calculate a reasonable value for penaltes. Frst, the mnmum cost for runnng a sngle job (let t be cj s evaluated by settng H up = H low and solvng problem (P 2, dsablng the admsson control mechansm. Then, we set the penalty value for job rejectons p = 10 cj as n [8]. We vared H up n the range (10, 30, and we set = 0.9 H up H low. Job Profle Cluster Scale NM (70, 700 H up ( (10, 30 NR (32, 64 Mmax (s (16, 120 Job Rejecton Penalty Sh typ max (s (30, 150 p ( (250, 2500 Rmax (s (15, 75 Sh 1( max (s (10, 30 Cloud Instance Prce c M, c R (1, 4 ρ ( (5, 20 D (s (600, 1200 δ ( (5, 40 Table 2: Cluster characterstcs and Job Profles 5.2 Accuracy of Executon Tme Bounds The am of ths secton s to compare our tme bounds (1 and (2 aganst the executon tmes obtaned through YARN SLS [7], the offcal smulator provded wthn Hadoop 2.3 framewor. YARN SLS requres an Hadoop deployment and t nteracts wth t by means of moced NodeManagers and Applcaton- Masters wth the purpose of smulatng both a set of cluster nodes and the relatve worload. Those enttes nteract drectly wth Hadoop YARN, smulatng a whole runnng envronment wth a one to one mappng between smulated and real tmes (.e., the smulaton of 1 second of the Hadoop cluster requres 1 second smulaton. SLS requres as nput a cluster confguraton fle and an executon trace. Ths trace can be provded ether n Apache Rumen 2 format or n the SLS propretary format (the one we adopted, whch s a smplfed verson contanng only the data strctly needed for smulaton. In partcular, among other nformaton, t provdes for each job and each tas the start and end tmes. In our evaluaton we consder the MapReduce job profles extracted from log traces avalable from Twtter, Sort, WTrends, and WordCount reported n [27]. In order to use the SLS tool, we generated synthetc job traces representng these worloads. Frst of all, snce SLS does not provde shuffle phase executon tme, we have to use a smplfed verson of equatons (1 and (2. Therefore, we partally removed the shuffle phase, by gnorng the frst shuffle wave (to a certan extent overlapped wth the last Map wave, though and by ncludng the remanng part (e.g., Sh avg n the Reduce phase. We also consder the total number of avalable slots as shared between the Map and Reduce tass, beng 2 A tool for extractng traces from Hadoop logs

8 Twtter Sort No. of users T up 1 gap m 1 gap No. of users T up 2 gap m 2 gap % 1.24% % 0.61% % -0.09% % 3.26% % 7.68% % 10.42% % 6.43% % 4.79% % % % -7.48% % 7.12% % 5.64% % -9.80% % % % -5.07% % -1.58% % -4.94% % -2.44% % 1.56% % -0.45% Table 3: Two job classes analyss (Twtter and Sort WordCount WTrends No. of users T up 1 gap m 1 gap No. of users T up 2 gap m 2 gap % 5.27% % 23.93% % -4.20% % 16.78% % 7.08% % 39.47% % -0.65% % 12.65% % 3.80% % 35.86% % 5.06% % 25.61% % 4.34% % 22.15% % 14.59% % 43.47% % 2.52% % 26.62% % 2.41% % 15.08% Table 4: Two job classes analyss (WordCount and WTrends unable to assgn them to a specfc phase. In partcular, we used a number of slots equals to the number of vrtual cores allocated n the smulator. These slots have been used n both phases so we set S M and S R equal to the avalable cores. Then, we set the ratos α R h = α M h equal to 1/ U h. Ths because the avalable resources are equally shared among the dfferent users, so each class wll have a rato of resources proportonal to ts users h : αr = αm = h / U h. In order to valdate our bounds, we must compute job duratons,.e., for each job, the dfference between ts submsson and completon tme. Snce SLS s a trace based smulator, we must generate a trace that nterleaves for each user the submsson of jobs by ther average duraton. However, we do not now ths duraton (that s the goal of ths smulaton, but we can obtan t by relyng on a fxed-pont teraton method. We consder a closed model n whch for each class, h users can concurrently submt multple jobs. Let approxmate the average job duraton T wth an ntal guess A,0 for each class U and run the smulaton of the generated trace. Then, we can refne our guess of T teratvely wth the value A,n, computed as follows: A,n = β T,n 1 + (1 β A,n 1, (42 for each class U (we expermentally set β = 0.07, where T,n 1 s the average job duraton obtaned by SLS for class at the prevous run n 1. We terate ths procedure untl A,n and T,n are close enough for each class U. At that pont A,n T,n T for each job class. We stop the fxed-pont teraton method when the rato max U A,n T,n / T,n s below a gven threshold τ (set expermentally equal to 0.1. We then evaluate how far our bounds are from ths value, by comparng T,n wth the upper bound T up and the average of the two bounds m = (T low + T up /2. Each smulaton trace has been bult by consderng dfferent user classes (drawn from WorCount, Sort, Twtter Twtter Sort WordCount N. T up 1 gap m 1 gap N. T up 2 gap m 2 gap N. T up 3 gap m 3 gap % 7.24% % 9.46% % 5.07% % 1.06% % 6.85% % 0.26% % -1.26% % -3.30% % -3.30% % -7.89% % -4.06% % % % -3.39% % -3.46% % -4.62% % 5.43% % 6.44% % 4.34% % 2.21% % 6.38% % -0.53% % 11.33% % 8.97% % 7.26% % 2.06% % 0.21% % -0.74% % 1.68% % 3.46% % -1.32% Table 5: Three job classes analyss (Twtter, Sort and WordCount Sort WordCount WTrends N. T up 1 gap m 1 gap N. T up 2 gap m 2 gap N. T up 3 gap m 3 gap % -1.51% % -2.33% % 9.73% % -0.12% % -0.97% % 15.46% % -4.51% % -6.51% % 7.90% % -0.22% % -2.38% % 8.19% % -4.25% % -3.18% % 4.39% % 5.07% % 1.18% % 11.28% % 6.60% % 3.80% % 13.55% % 1.60% % -2.51% % 7.97% % -5.64% % -6.75% % 3.37% % 4.40% % -0.93% % 8.42% Table 6: Three job clasess analyss (Sort, Word- Count and WTrends and WTrends traces settng A,0 = T up for any U. In order to avod that jobs start smultaneously (unrealstc n real systems, we delay each job submsson by a random exponentally-dstrbuted tme value (.e., the user thn tme set equal to a tenth of the estmated job executon tme. Ultmately, we scaled down by a factor of 10 the orgnal executon tmes n order to acheve a smulaton speedup. We consdered dfferent test confguratons wth two and three job classes and wth a random number of users n the range [2, 10]. Those scenaros represents lght load condtons that correspond to the worst case for the evaluaton of our bounds. Indeed, under lght load condtons the probablty that any user class s temporarly dle can be sgnfcant and, the Far and Capacty scheduler, would assgn the dle user class slots to other classes to boost ther performance. Vce versa, under heavy loads our upper bounds become tghter. Tables 3-6 report the results we acheved. For each run the number of users and the gap between T and both T up and m are reported (a negatve m gap means that T > m. All the smulatons have been performed consderng a cluster wth 128 cores and usng the YARN far scheduler. Overall, for the two job classes, the gap between the upper bound and the jobs mean executon tme s around 19% on average, whle the gap wth respect to m s only 10% on average. For three classes the average between the upper bound and the jobs mean executon tme gap s 11%, whle the gap wth respect to m s 5%. Over all the set of experments the average between the upper bound and the jobs mean executon tme s 14%. Smulatons run on Mcrosoft Azure Lnux small nstances (.e., sngle core, 1.75GB VMs. The fxed-pont teraton procedure converges n 4.4 teratons on average. The smulaton tme of each fxed-pont procedure teraton was around 31 mnutes. 5.3 Scalablty analyss In ths secton, we evaluate the scalablty of our optmzaton soluton. We performed our experment on a VrtualBox vrtual machne based on Ubuntu server runnng on an

9 ntel Xeon Nehalem dual socet quad-core system wth 32 GB of RAM. Optmal soluton to problem (P2 was obtaned by runnng CPLEX 12.0 where we also restrcted decson varables r, d and h to be nteger,.e., we consdered the Mxed Integer Lnear Programmng (MILP verson of (P2. We performed experments consderng dfferent numbers of user classes. We vared the cardnalty of the set U between 20 and 1,000 wth step 20, and run each experment ten tmes. The results show that the tme requred to determne global optmal soluton for the MILP problem s, on average, less than 0.08 seconds. The nstances of maxmum sze ncludng 1,000 user classes can be solved n less than 0.5 second n the worst case. 5.4 Case Studes In ths secton, we nvestgate how dfferent (P2 problem settngs mpact on the cloud cluster cost. In partcular, we analyse three case studes to address the followng research questons: (1 Is t better to consder a shared cluster or to devote a dedcated cluster to ndvdual user classes? (2 What s the effect of job concurrency on cluster cost? (3 Whch s the cost mpact of more strct deadlnes? (s there a lnear relaton between the cost and job deadlnes?. Instances have been generated accordng to Sectons 5.1 and 5.3. Furthermore, to ease the results nterpretaton we excluded reserved nstances and assumed there s a sngle type of VM avalable from the cloud provder Effect of sharng cluster In ths case study, we want to examne the effect of cluster resource sharng. In partcular, we consder two scenaros. The frst one s our baselne, whch corresponds to (P2 problem settng. The second one consders the same resource demand (n terms of job profles, deadlnes, etc. but U (P2 problems are solved ndependently,.e., assumng a dedcated cluster s devoted to each user class. To perform the comparsons, we consder dfferent numbers of user classes. We vary the cardnalty of the set U between 20 and 1,000 wth step 20 and randomly generate ten nstances for each cardnalty value. For each nstance we calculate two values: the frst one s the objectve functon of the baselne scenaro, that we refer to as dependent objectve functon; the second value, that we call ndependent objectve functon, s evaluated by summng up the U objectve functons of the ndvdual problems. The comparson s performed by consderng the rato between the dependent and ndependent objectve functon. Fgure 7 reports the average of ths ratos for dfferent numbers of user classes. Overall, the cluster cost margnally decreases by assumng all user classes together and on average we have 0.48% varaton on the overall cluster cost. We can conclude that, thans to cloud elastcty, the adopton of shared or dedcated clusters leads to the same cost. Note that, shared cluster can lead to benefts thans to HDFS (e.g., better ds performance and node load balancng but ths can not be captured by our cost model Effect of job concurrency degree In ths case study we want to analyze the effect of the job concurrency degree on the cost of one sngle job. To perform the experment, we assume there s just one user class n the cluster. We vary the job concurrency degree h from 10 to 30 and, for each value, we randomly generate 10 Cluster cost vara.ons Effect of assumng all user classes to- Fgure 7: gether. Cost of a sngle job Number of user classes Job concurrency Fgure 8: Effect of job concurrency degree on sngle job cost. nstances of problem (P2. For each nstance we dsable the admsson control by settng up H low = H up and we solve the optmzaton problem. We calculate the cost of one sngle job for each nstance by dvdng the objectve functon by the job concurrency degree. Fgure 8 shows how the per-job cost vares wth dfferent job concurrency degrees for a representatve example. Overall, the analyss demonstrates that the cost varance for dfferent job concurrency s neglgble,.e., the dfferent job concurrency degree leads to less than 0.002% varaton of the cost of one job. Hence, n a cloud settng, elastcty allows to obtan a constant per-job executon cost ndependently of the number of users n a class. Ths result s n lne wth Theorem 4.3 b Effect of tghtenng the deadlnes Here we want to examne the relaton between cost and deadlnes. In partcular, we chec the effect of reducng the deadlnes on the cluster cost. We vary the cardnalty of the set U between 20 and 1,000 and for each cardnalty we generate several random nstances as descrbed n Secton 5.3. For each nstance, we teratvely tghten the deadlnes of every user class to observe how the changes are reflected on the cost. In each step, we decrease the deadlnes by 5% of the ntal value. The reducton process contnues untl the nstance wth new deadlnes does not have a feasble soluton. After each reducton, we calculate the ncreased cost rato,.e., the rato between the objectve functon for the problem wth the new deadlnes and the objectve functon of the problem wth the ntal deadlnes. Fgure 9 llustrates the trend of the ncrease cost rato for a representatve nstance wth 20 user classes: the reducton s not lnear and the cost to pay for reducng the deadlnes by a 60% s more than three tmes wth respect to the base case.

10 Cost of cluster Fgure 9: cost % reduc.on on deadlnes Effect of reducng deadlnes on cluster 6. RELATED WORK Capacty management and optmal schedulng of Hadoop clusters receved a lot of nterest by the research communty. Authors n [13] propose Starfsh, a self-tunng System for analytcs on Hadoop. Indeed, rarely Hadoop exhbts the best performance as t s, wthout a specfc tunng phase. Starfsh, collects at runtme some ey nformatons about the job executon generatng a profle that s eventually exploted to automatcally confgure Hadoop wthout human nterventon. The same tool has been successful employed to solve cluster szng problems [12]. Tan and Chen [24] face the problem of resource provsonng optmzaton mnmzng the cost assocated wth the executon of a job. Ths wor presents a cost model that depends on the amount of nput data and on the consdered job characterstcs. A proflng regresson-based analyss s carred out to estmate the model parameters. A dfferent approach, based on closed queung networs, s proposed n [20] that consders also contenton and parallelsm on compute nodes to evaluate the completon tme of a MapReduce job. Unfortunately, ths approach concerns the executon tme of the map phase only. Vanna et al. [28] propose a smlar soluton, whch however, has been valdated for cluster exclusvely dedcated to the executon of a sngle job. The wor n [17] models the executon of Map tas through a tandem queue wth overlappng phases and provdes very effcent run tme schedulng solutons for the jont optmzaton of the Map and copy/shuffle phases. Authors show how ther runtme schedulng algorthms match closely the performance of the offlne optmal verson. The wor n [10] ntroduces a novel modelng approach based on mean feld analyss and provde very fast approxmate methods to predct the performance of Bg Data systems. Deadlnes for MapReduce jobs are consdered also n [23]. The authors recognze the nablty of Hadoop schedulers to handle properly jobs wth deadlnes and propose to adapt to the problem some well-nown multprocessor schedulng polces. They present two versons of the Earlest Deadlne Frst heurstc and demonstrate they outperform the classcal Hadoop schedulers. The problem of progress estmaton of parallel queres s addressed n [21]. The authors present Parallax, a progress estmator able to predct the completon tme of queres representng MapReduce jobs. The estmator s mplemented on Pg and evaluated wth PgMx benchmar. ParaTmer [22], an extenson of Parallax, s a progress estmator that can predct the completon of parallel queres expressed as Drected Acyclc Graph (DAG of MapReduce jobs. The man mprovement wth respect to the prevous wor, s the support for queres where multple jobs wor n parallel,.e., have dfferent path n the DAG. Authors n [31] nvestgate the performance of MapReduce applcatons on homogeneous and heterogeneous Hadoop cloud based clusters. They consder a problem smlar to the one we faced n our wor and provde a smulaton-based framewor for mnmzng nfrastructural costs. However, admsson control s not consdered and a sngle type of worload (.e., user class s optmzed. In [26] the ARIA framewor s presented. Ths wor s the closest to our contrbuton and focuses on clusters dedcated to sngle user classes runnng on top of a frst n frst out scheduler. The framewor addresses the problem of calculatng the most sutable amount of resource (slots to allocate to Map and Reduce tass n order to meet a user-defned soft deadlne for a certan job and reduce costs assocated wth resource over-provsonng. A MapReduce performance model relyng on a compact job profle defnton to calculate a lower bound, an upper bound and an estmaton of job executon tme s presented. Fnally, such model, mproved n [32], s valdated through a smulaton study and an expermental campagn on a 66-nodes Hadoop cluster. 7. CONCLUSIONS AND FUTURE WORK In ths paper, we provded an optmzaton model able to mnmze the executon costs of heterogeneous tass n cloud based shared Hadoop clusters. Our model s based on novel upper and lower bounds for MapReduce job executon tme. Our soluton has been valdated by a large set of experments. Results have shown that our method s able to determne the global mnmum solutons for systems ncludng up to 1,000 user classes n less than 0.5 seconds. Moreover, the average executon tme of MapReduce jobs obtaned through smulatons s wthn 14% of our bounds on average. Future wor wll valdate the consdered tme bounds n real cloud clusters. Moreover, a dstrbuted mplementaton of the optmzaton solver able to explot the YARN herarchcal archtecture wll be developed. Acnowledgement The wor of Marzeh Malemajd has been supported by the European Commsson grant no. FP7-ICT (MODAClouds. Danlo Ardagna and Mchele Cavotta s wor has been partally supported by the the European Commsson grant no. H (DICE. The smulatons and numercal analyses have been performed under the Wndows Azure Research Pass 2013 grant.

11 8. REFERENCES [1] Capacty Scheduler. [2] Elastc Compute Cloud (EC2. [3] Far Scheduler. [4] MapReduce: Smplfed Data Processng on Large Clusters. http: //research.google.com/archve/mapreduce.html. [5] Mcrosoft Azure. [6] The dgtal unverse n [7] YARN Scheduler Load Smulator (SLS. [8] J. Anselm, D. Ardagna, and M. Passacantando. Generalzed Nash Equlbra for SaaS/PaaS Clouds. European Journal of Operatonal Research, 236(1: , [9] D. Ardagna, B. Pancucc, and M. Passacantando. Generalzed Nash Equlbra for the Servce Provsonng Problem n Cloud Systems. IEEE Transactons on Servces Computng, 6(4: , [10] A. Castglone, M. Grbaudo, M. Iacono, and F. Palmer. Explotng mean feld analyss to model performances of bg data archtectures. Future Generaton Computer Systems, 37(0: , [11] C. P. Chen and C.-Y. Zhang. Data-ntensve applcatons, challenges, technques and technologes: A survey on bg data. Informaton Scences, 275(0: , [12] H. Herodotou, F. Dong, and S. Babu. No one (cluster sze fts all: Automatc cluster szng for data-ntensve analytcs. In SOCC 11, pages 18:1 18:14, [13] H. Herodotou, H. Lm, G. Luo, N. Borsov, L. Dong, F. B. Cetn, and S. Babu. Starfsh: A Self-tunng System for Bg Data Analytcs. In CIDR 11, pages , [14] H. V. Jagadsh, J. Gehre, A. Labrnds, Y. Papaonstantnou, J. M. Patel, R. Ramarshnan, and C. Shahab. Bg data and ts techncal challenges. Commun. ACM, 57(7:86 94, [15] K. Kambatla, G. Kollas, V. Kumar, and A. Grama. Trends n bg data analytcs. Journal of Parallel and Dstrbuted Computng, 74(7: , [16] K.-H. Lee, Y.-J. Lee, H. Cho, Y. D. Chung, and B. Moon. Parallel data processng wth mapreduce: A survey. SIGMOD Rec., 40(4:11 20, [17] M. Ln, L. Zhang, A. Werman, and J. Tan. Jont optmzaton of overlappng phases n MapReduce. SIGMETRICS Performance Evaluaton Revew, 41(3:16 18, [18] M. Malemajd, A. M. Rzz, D. Ardagna, M. Cavotta, M. Passacantando, and A. Movaghar. Optmal Capacty Allocaton for executng Map Reduce Jobs n Cloud Systems. Techncal Report n , Poltecnco d Mlano, ardagna/mapreducetechreport pdf. [19] J. Manya, M. Chu, B. Brown, J. Bughn, R. Dobbs, C. Roxburgh, and A. H. Byers. Bg data: The next fronter for nnovaton, competton, and productvty. McKnsey Global Insttute, [20] D. A. Menascé and S. Bardhan. Queung Networ Models to Predct the Completon Tme of the Map Phase of MapReduce Jobs. In 38th Internatonal Computer Measurement Group Conference, [21] K. Morton, M. Balaznsa, and D. Grossman. ParaTmer: A Progress Indcator for MapReduce DAGs. In SIGMOD 10, pages , [22] K. Morton, A. Fresen, M. Balaznsa, and D. Grossman. Estmatng the progress of MapReduce ppelnes. In ICDE 10, pages , [23] L. T. X. Phan, Z. Zhang, Q. Zheng, B. T. Loo, and I. Lee. An emprcal analyss of schedulng technques for real-tme cloud-based data processng. In SOCA 11, pages 1 8, [24] F. Tan and K. Chen. Towards Optmal Resource Provsonng for Runnng MapReduce Programs n Publc Clouds. In CLOUD 11, pages , [25] V. K. Vavlapall, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curno, O. O Malley, S. Rada, B. Reed, and E. Baldeschweler. Apache Hadoop YARN: Yet Another Resource Negotator. In SOCC 13, pages 5:1 5:16, [26] A. Verma, L. Cherasova, and R. H. Campbell. ARIA: Automatc Resource Inference and Allocaton for Mapreduce Envronments. In ICAC 11, pages , [27] A. Verma, L. Cherasova, and R. H. Campbell. Resource Provsonng Framewor for Mapreduce Jobs wth Performance Goals. In Mddleware 11, pages , [28] E. Vanna, G. Comarela, T. Pontes, J. M. Almeda, V. A. F. Almeda, K. Wlnson, H. A. Kuno, and U. Dayal. Analytcal Performance Models for MapReduce Worloads. Internatonal Journal of Parallel Programmng, 41(4: , [29] F. Yan, L. Cherasova, Z. Zhang, and E. Smrn. Heterogeneous cores for MapReduce processng: Opportunty or challenge? In NOMS 14, pages 1 4, [30] Q. Zhang, Q. Zhu, M. Zhan, and R. Boutaba. Dynamc Servce Placement n Geographcally Dstrbuted Clouds. In ICDCS 12, pages , [31] Z. Zhang, L. Cherasova, and B. T. Loo. Explotng Cloud Heterogenety for Optmzed Cost/Performance MapReduce Processng. In CloudDP 14, pages 1:1 1:6, [32] Z. Zhang, L. Cherasova, A. Verma, and B. T. Loo. Automated Proflng and Resource Management of Pg Programs for Meetng Servce Level Objectves. In ICAC 12, pages 53 62, 2012.