Multi-Resource Fair Allocation in Heterogeneous Cloud Computing Systems

1 Mult-Resource Far Allocaton n Heterogeneous Cloud Computng Systems We Wang, Student Member, IEEE, Ben Lang, Senor Member, IEEE, Baochun L, Senor Member, IEEE Abstract We study the mult-resource allocaton problem n cloud computng systems where the resource pool s constructed from a large number of heterogeneous servers, representng dfferent ponts n the confguraton space of resources such as processng, memory, and storage. We desgn a mult-resource allocaton mechansm, called DRFH, that generalzes the noton of Domnant Resource Farness (DRF) from a sngle server to multple heterogeneous servers. DRFH provdes a number of hghly desrable propertes. Wth DRFH, no user prefers the allocaton of another user; no one can mprove ts allocaton wthout decreasng that of the others; and more mportantly, no coalton behavor of msreportng resource demands can beneft all ts members. DRFH also ensures some level of servce solaton among the users. As a drect applcaton, we desgn a smple heurstc that mplements DRFH n real-world systems. Large-scale smulatons drven by Google cluster traces show that DRFH sgnfcantly outperforms the tradtonal slot-based scheduler, leadng to much hgher resource utlzaton wth substantally shorter job completon tmes. Index Terms Cloud computng, heterogeneous servers, job schedulng, mult-resource allocaton, farness. F 1 INTRODUCTION Resource allocaton under the noton of farness and effcency s a fundamental problem n the desgn of cloud computng systems. Unlke tradtonal applcaton-specfc clusters and grds, cloud computng systems dstngush themselves wth unprecedented server and workload heterogenety. Modern datacenters are lkely to be constructed from a varety of server classes, wth dfferent confguratons n terms of processng capabltes, memory szes, and storage spaces [2]. Asynchronous hardware upgrades, such as addng new servers and phasng out exstng ones, further aggravate such dversty, leadng to a wde range of server specfcatons n a cloud computng system [3] [7]. Table 1 llustrates the heterogenety of servers n one of Google s clusters [3], [8]. Smlar server heterogenety has also been observed n publc clouds, such as Amazon EC2 and Rackspace [4], [5]. In addton to server heterogenety, cloud computng systems also represent much hgher dversty n resource demand profles. Dependng on the underlyng applcatons, the workload spannng multple cloud users may requre vastly dfferent amounts of resources (e.g., CPU, memory, and storage). For example, numercal computng tasks are usually CPU ntensve, whle database operatons typcally requre hghmemory support. The heterogenety of both servers and workload demands poses sgnfcant techncal challenges on the resource allocaton mechansm, gvng rse to many delcate ssues notably farness and effcency that must be carefully addressed. Despte the unprecedented heterogenety n cloud com- W. Wang, B. Lang and B. L are wth the Department of Electrcal and Computer Engneerng, Unversty of Toronto, Toronto, ON, Canada. E-mal: {wewang, lang}@ece.utoronto.ca, bl@ece.toronto.edu. Part of ths paper has appeared n [1]. Ths new verson contans substantal revson wth new llustratve examples, property analyses, proofs, and dscussons. TABLE 1 Confguratons of servers n one of Google s clusters [3], [8]. CPU and memory unts are normalzed to the maxmum server (hghlghted below). Number of servers CPUs Memory 6732.5.5 3863.5.25 11.5.75 795 1. 1. 126.25.25 52.5.12 5.5.3 5.5.97 3 1..5 1.5.6 putng systems, state-of-the-art computng frameworks employ rather smple abstracton that falls short. For example, Hadoop [9] and Dryad [1], the two most wdely deployed cloud computng frameworks, partton a server s resources nto bundles known as slots that contan fxed amounts of dfferent resources. The system then allocates resources to users at the granularty of these slots. Such a sngle resource abstracton gnores the heterogenety of both server specfcatons and demand profles, nevtably leadng to a farly neffcent allocaton [11]. Towards addressng the neffcency of the current allocaton module, many recent works focus on mult-resource allocaton mechansms. Notably, Ghods et al. [11] suggest a compellng alternatve known as the Domnant Resource Farness (DRF) allocaton, n whch each user s domnant share the maxmum rato of any resource that the user has been allocated s equalzed. The DRF allocaton possesses a set of hghly desrable farness propertes, and has quckly receved sgnfcant attenton n the lterature [12] [15]. Whle DRF and ts subsequent works address the demand heterogenety of multple resources, they all lmt the dscusson to a smplfed

2 model where resources are pooled n one place and the entre resource pool s abstracted as one bg server 1. Such an all-n-one resource model not only contrasts the prevalent datacenter nfrastructure where resources are dstrbuted to a large number of servers but also gnores the server heterogenety: the allocatons depend only on the total amount of resources pooled n the system, rrespectve of the underlyng resource dstrbuton of servers. In fact, when servers are heterogeneous, even the defnton of domnant resource s not so clear. Dependng on the underlyng server confguratons, a computng task may bottleneck on dfferent resources n dfferent servers. We shall see that nave extensons, such as applyng the DRF allocaton to each server separately, may lead to a hghly neffcent allocaton (detals n Sec. 3.4). Ths paper represents a rgorous study to propose a soluton wth provable operatonal benefts that brdge the gap between the exstng mult-resource allocaton models and the state-of-the-art datacenter nfrastructure. We propose DRFH, a DRF generalzaton n Heterogeneous envronments where resources are pooled by a large number of heterogeneous servers, representng dfferent ponts n the confguraton space of resources such as processng, memory, and storage. DRFH generalzes the ntuton of DRF by seekng an allocaton that equalzes every user s global domnant share, whch s the maxmum rato of any resources the user has been allocated n the entre resource pool. We systematcally analyze DRFH and show that t retans most of the desrable propertes that the all-n-one DRF model provdes for a sngle server [11]. Specfcally, DRFH s Pareto optmal, where no user s able to ncrease ts allocaton wthout decreasng other users allocatons. Meanwhle, DRFH s envy-free n that no user prefers the allocaton of another one. More mportantly, DRFH s group strategyproof n that whenever a coalton of users collude wth each other to msreport ther resource demands, there s a member of the coalton that cannot strctly gan. As a result, the coalton s better off not formed. In addton, DRFH offers some level of servce solaton by ensurng the sharng ncentve property n a weak sense t allows users to execute more tasks than those under some equal partton where the entre resource pool s evenly allocated among all users. DRFH also satsfes a set of other mportant propertes, namely sngle-server DRF, sngle-resource farness, bottleneck farness, and populaton monotoncty (detals n Sec. 3.3). As a drect applcaton, we desgn a heurstc schedulng algorthm that mplements DRFH n real-world systems. We conduct large-scale smulatons drven by Google cluster traces [8]. Our smulaton results show that compared wth the tradtonal slot schedulers adopted n prevalent cloud computng frameworks, the DRFH algorthm sutably matches demand heterogenety to server heterogenety, sgnfcantly mprovng the system s resource utlzaton, yet wth a substantal reducton of job completon tmes. The remander of ths paper s organzed as follows. We brefly revst the DRF allocaton and pont out ts lmtatons n heterogeneous envronments n Sec. 2. We then formulate 1. Whle [11] brefly touches on the case where resources are dstrbuted to small servers (known as the dscrete scenaro), ts coverage s rather nformal. Memory CPUs Server 1 Server 2 (1 CPU, 14 GB) (8 CPUs, 4 GB) Fg. 1. An example of a system consstng of two heterogeneous servers, n whch user 1 can schedule at most two tasks each demandng 1 CPU and 4 GB memory. The resources requred to execute the two tasks are also hghlghted n the fgure. the allocaton problem wth heterogeneous servers n Sec. 3, where a set of desrable allocaton propertes are also defned. In Sec. 4, we propose DRFH and analyze ts propertes. Sec. 6 dedcates to some practcal ssues on mplementng DRFH. We evaluate the performance of DRFH va trace-drven smulatons n Sec. 6. We survey the related work n Sec. 7 and conclude the paper n Sec. 8. 2 LIMITATIONS OF DRF ALLOCATION IN HET- EROGENEOUS SYSTEMS In ths secton, we brefly revew the DRF allocaton [11] and show that t may lead to an nfeasble allocaton when a cloud system s composed of multple heterogeneous servers. In DRF, the domnant resource s defned for each user as the one that requres the largest fracton of the total avalablty. The mechansm seeks a maxmum allocaton that equalzes each user s domnant share, defned as the fracton of the domnant resource the user has been allocated. Consder an example gven n [11]. Suppose that a computng system has 9 CPUs and 18 GB memory, and s shared by two users. User 1 wshes to schedule a set of (dvsble) tasks each requrng h1 CPU, 4 GB, and user 2 has a set of (dvsble) tasks each requrng h3 CPU, 1 GB. In ths example, the domnant resource of user 1 s the memory as each of ts task demands 1/9 of the total CPU and 2/9 of the total memory. On the other hand, the domnant resource of user 2 s CPU, as each of ts task requres 1/3 of the total CPU and 1/18 of the total memory. The DRF mechansm then allocates h3 CPU, 12 GB to user 1 and h6 CPU, 2 GB to user 2, where user 1 schedules three tasks and user 2 schedules two. It s easy to verfy that both users receve the same domnant share (.e., 2/3) and no one can schedule more tasks by allocatng addtonal resources (there s 2 GB memory left unallocated). The DRF allocaton above s based on a smplfed all-n-one resource model, where the entre system s modeled as one bg server. The allocaton hence depends only on the total amount of resources pooled n the system. In the example above, no matter how many servers the system has, and what each server specfcaton s, as long as the system has 9 CPUs and 18 GB memory n total, the DRF allocaton wll always schedule three tasks for user 1 and two for user 2. However, ths allocaton may not be possble to mplement, especally when the system

3 conssts of heterogeneous servers. For example, suppose that the resource pool s provded by two servers. Server 1 has 1 CPU and 14 GB memory, and server 2 has 8 CPUs and 4 GB memory. As shown n Fg. 1, even allocatng both servers exclusvely to user 1, at most two tasks can be scheduled, one n each server. Moreover, even for some server specfcatons where the DRF allocaton s feasble, the mechansm only gves the total amount of resources each user should receve. It remans unclear how many resources a user should be allocated n each server. These problems sgnfcantly lmt the applcaton of the DRF mechansm. In general, the allocaton s vald only when the system contans a sngle server or multple homogeneous servers, whch s rarely a case under the prevalent datacenter nfrastructure. Despte the lmtaton of the all-n-one resource model, DRF s shown to possess a set of hghly desrable allocaton propertes for cloud computng systems [11], [15]. A natural queston s: how should the DRF ntuton be generalzed to a heterogeneous envronment to acheve smlar propertes? Note that ths s not an easy queston to answer. In fact, wth heterogeneous servers, even the defnton of domnant resource s not so clear. Dependng on the server specfcatons, a resource most demanded n one server (n terms of the fracton of the server s avalablty) mght be the least-demanded n another. For nstance, n the example of Fg. 1, user 1 demands CPU the most n server 1. But n server 2, t demands memory the most. Should the domnant resource be defned separately for each server, or should t be defned for the entre resource pool? How should the allocaton be conducted? And what propertes do the resultng allocaton preserve? We shall answer these questons n the followng sectons. 3 SYSTEM MODEL AND ALLOCATION PROP- ERTIES In ths secton, we model mult-resource allocaton n a cloud computng system wth heterogeneous servers. We formalze a number of desrable propertes that are deemed the most mportant for allocaton mechansms n cloud computng envronments. 3.1 Basc Settngs Let S = {1,...,k} be the set of heterogeneous servers a cloud computng system has n ts resource pool. Let R = {1,...,m} be the set of m hardware resources provded by each server, e.g., CPU, memory, storage, etc. Let c l =(c l1,...,c lm ) T be the resource capacty vector of server l 2 S, where each component c lr denotes the total amount of resource r avalable n ths server. Wthout loss of generalty, we normalze the total avalablty of every resource to 1,.e., c lr =1, 8r 2 R. Let U = {1,...,n} be the set of cloud users sharng the entre system. For every user, let D =(D 1,...,D m ) T be ts resource demand vector, where D r s the amount of resource r requred by each nstance of the task of user. For smplcty, we assume postve demands,.e., D r > for all Memory CPUs Server 1 Server 2 (2 CPUs, 12 GB) (12 CPUs, 2 GB) Fg. 2. An example of a system contanng two heterogeneous servers shared by two users. Each computng task of user 1 requres.2 CPU tme and 1 GB memory, whle the computng task of user 2 requres 1 CPU tme and.2 GB memory. user and resource r. We say resource r the global domnant resource of user f r 2 arg max D r. r2r In other words, resource r s the most heavly demanded resource requred by each nstance of the task of user, over the entre resource pool. For each user and resource r, we defne d r = D r /D r as the normalzed demand and denote by d =(d 1,...,d m ) T the normalzed demand vector of user. As a concrete example, consder Fg. 2 where the system conssts of two heterogeneous servers. Server 1 s hghmemory wth 2 CPUs and 12 GB memory, whle server 2 s hgh-cpu wth 12 CPUs and 2 GB memory. Snce the system has 14 CPUs and 14 GB memory n total, the normalzed capacty vectors of server 1 and 2 are c 1 =(CPU share, memory share) T =(1/7, 6/7) T and c 2 = (6/7, 1/7) T, respectvely. Now suppose that there are two users. User 1 has memory-ntensve tasks each requrng.2 CPU tme and 1 GB memory, whle user 2 has CPU-heavy tasks each requrng 1 CPU tme and.2 GB memory. The demand vector of user 1 s D 1 =(1/7, 1/14) T and the normalzed vector s d 1 =(1/5, 1) T, where memory s the global domnant resource. Smlarly, user 2 has D 2 =(1/14, 1/7) T and d 2 =(1, 1/5) T, and CPU s ts global domnant resource. For now, we assume users have an nfnte number of tasks to be scheduled, and all tasks are dvsble [11], [13] [16]. We shall dscuss how these assumptons can be relaxed n Sec. 5. 3.2 Resource Allocaton For every user and server l, let A l =(A l1,...,a lm ) T be the resource allocaton vector, where A lr s the amount of resource r allocated to user n server l. Let A = (A 1,...,A k ) be the allocaton matrx of user, and A = (A 1,...,A n ) the overall allocaton for all users. We say an allocaton A feasble f no server s requred to use more than any of ts total resources,.e., A lr apple c lr, 8l 2 S, r 2 R. 2U

4 For each user, gven allocaton A l n server l, let N l (A l ) be the maxmum number of tasks (possbly fractonal) t can schedule. We have As a result, N l (A l )D r apple A lr, 8r 2 R. N l (A l )=mn r2r {A lr/d r }. The total number of tasks user can schedule under allocaton A s hence N (A )= N l (A l ). (1) Intutvely, a user prefers an allocaton that allows t to schedule more tasks. A well-justfed allocaton should never gve a user more resources than t can actually use n a server. Followng the termnology used n the economcs lterature [17], we call such an allocaton non-wasteful: Defnton 1: For user and server l, an allocaton A l s non-wasteful f reducng any resource decreases the number of tasks scheduled,.e., for all A l A 2 l, we have N l (A l) <N l (A l ). Further, user s allocaton A =(A l ) s non-wasteful f A l s non-wasteful for all server l, and allocaton A =(A ) s non-wasteful f A s non-wasteful for all user. Note that one can always convert an allocaton to nonwasteful by revokng those resources that are allocated but have never been actually used, wthout changng the number of tasks scheduled for any user. Unless otherwse specfed, we lmt the dscusson to non-wasteful allocatons. 3.3 Allocaton Mechansm and Desrable Propertes A resource allocaton mechansm takes user demands as nput and outputs the allocaton result. In general, an allocaton mechansm should provde the followng essental propertes that are wdely recognzed as the most mportant farness and effcency measures n both cloud computng systems [11], [12], [18] and the economcs lterature [17], [19]. Envy-freeness: An allocaton mechansm s envy-free f no user prefers the other s allocaton to ts own,.e., N (A ) N (A j ), 8, j 2 U. Ths property essentally embodes the noton of farness. Pareto optmalty: An allocaton mechansm s Pareto optmal f t returns an allocaton A such that for all feasble allocatons A, f N (A ) >N (A ) for some user, then there exsts a user j 6= such that N j (A j ) <N j(a j ). In other words, allocaton A cannot be further mproved such that all users are at least as well off and at least one user s strctly better off. Ths property ensures the allocaton effcency and s crtcal to acheve hgh resource utlzaton. Group strategyproofness: An allocaton mechansm s group strategyproof f whenever a coalton of users msreport 2. For any two vectors x and y, we say x y f x apple y, 8 and for some j we have strct nequalty: x j <y j. ther resource demands (assumng a user s demand s ts prvate nformaton), there s a member of the coalton who would schedule less tasks and hence has no ncentve to jon the coalton. Specfcally, let M U be the coalton of manpulators n whch user 2 M msreports ts demand as D 6= D. Let A be the allocaton returned. Also, let A be the allocaton returned when all users truthfully report ther demands. The allocaton mechansm s group strategyproof f there exsts a manpulator 2 M who cannot schedule more tasks than beng truthful,.e., N (A ) apple N (A ). In other words, user s better off qutng the coalton. Group strategyproofness s of a specal mportance for a cloud computng system, as t s common to observe n a real-world system that users try to manpulate the scheduler for more allocatons by lyng about ther resource demands [11], [18]. Sharng ncentve s another crtcal property that has been frequently mentoned n the lterature [11] [13], [15]. It ensures that every user s allocaton s not worse off than that obtaned by evenly dvdng the entre resource pool. Whle ths property s well defned for a sngle server, t s not for a system contanng multple heterogeneous servers, as there s an nfnte number of ways to evenly dvde the resource pool among users, and t s unclear whch one should be selected as a benchmark to compare wth. We shall gve a specfc dscusson to Sec. 4.5, where we justfy between two reasonable alternatves. In addton to the four essental allocaton propertes above, we also consder four other mportant propertes as follows: Sngle-server DRF: If the system contans only one server, then the resultng allocaton should be reduced to the DRF allocaton. Sngle-resource farness: If there s a sngle resource n the system, then the resultng allocaton should be reduced to a max-mn far allocaton. Bottleneck farness: If all users bottleneck on the same resource (.e., havng the same global domnant resource), then the resultng allocaton should be reduced to a max-mn far allocaton for that resource. Populaton monotoncty: If a user leaves the system and relnqushes all ts allocatons, then the remanng users wll not see any reducton n the number of tasks scheduled. To summarze, our objectve s to desgn an allocaton mechansm that guarantees all the propertes defned above. 3.4 Nave DRF Extenson and Its Ineffcency It has been shown n [11], [15] that the DRF allocaton satsfes all the desrable propertes mentoned above when the entre resource pool s modeled as one server. When resources are dstrbuted to multple heterogeneous servers, a nave generalzaton s to separately apply the DRF allocaton per server. For nstance, consder the example of Fg. 2. We frst apply DRF n server 1. Because CPU s the domnant resource of both users, t s equally dvded for both of them, each recevng 1. As a result, user 1 schedules 5 tasks onto server 1, whle user 2 schedules one. Smlarly, n server 2,

5 1% 5% % User1 42% 1% 5% User2 8% % CPU Memory CPU Memory Server 1 Server 2 Fg. 3. DRF allocaton for the example shown n Fg. 2, where user 1 s allocated 5 tasks n server 1 and 1 n server 2, whle user 2 s allocated 1 task n server 1 and 5 n server 2. memory s the domnant resource of both users and s evenly allocated, leadng to one task scheduled for user 1 and fve for user 2. The resultng allocatons n the two servers are llustrated n Fg. 3, where both users schedule 6 tasks. Unfortunately, ths allocaton volates the Pareto optmalty and s hghly neffcent. If we nstead allocate server 1 exclusvely to user 1, and server 2 exclusvely to user 2, then both users schedule 1 tasks, almost twce the number of tasks scheduled under the DRF allocaton. In fact, a smlar example can be constructed to show that the per-server DRF may lead to arbtrarly low resource utlzaton. The falure of the nave DRF extenson to the heterogeneous envronment necesstates an alternatve allocaton mechansm, whch s the man theme of the next secton. 4 DRFH ALLOCATION AND ITS PROPERTIES In ths secton, we present DRFH, a generalzaton of DRF n a heterogeneous cloud computng system where resources are dstrbuted n a number of heterogeneous servers. We analyze DRFH and show that t provdes all the desrable propertes defned n Sec. 3. 4.1 DRFH Allocaton Instead of allocatng separately n each server, DRFH jontly consders resource allocaton across all heterogeneous servers. The key ntuton s to acheve the max-mn far allocaton for the global domnant resources. Specfcally, gven allocaton A l, let G l (A l )=N l (A l )D r =mn r2r {A lr/d r }. (2) be the amount of global domnant resource user s allocated n server l. Snce the total avalablty of resources s normalzed to 1, we also refer to G l (A l ) the global domnant share user receves n server l. Smply addng up G l (A l ) over all servers gves the global domnant share user receves under allocaton A,.e., G (A )= G l (A l )= mn r2r {A lr/d r }. (3) DRFH allocaton ams to maxmze the mnmum global domnant share among all users, subject to the resource constrants per server,.e., max A s.t. mn G (A ) 2U A lr apple c lr, 8l 2 S, r 2 R. 2U Recall that wthout loss of generalty, we assume nonwasteful allocaton A (see Sec. 3.2). We have the followng structural result. Its proof s deferred to the appendx 3. Lemma 1: For user and server l, an allocaton A l s non-wasteful f and only f there exsts some g l such that A l = g l d. In partcular, g l s the global domnant share user receves n server l under allocaton A l,.e., g l = G l (A l ). Intutvely, Lemma 1 ndcates that under a non-wasteful allocaton, resources are allocated n proporton to the user s demand. Lemma 1 mmedately suggests the followng relatonshp for every user and ts non-wasteful allocaton A : G (A )= G l (A l )= g l. Problem (4) can hence be equvalently wrtten as max mn g l {g l } 2U s.t. g l d r apple c lr, 8l 2 S, r 2 R, 2U where the constrants are derved from Lemma 1. Now let g =mn P g l. Va straghtforward algebrac operatons, we see that (5) s equvalent to the followng problem: max {g l },g s.t. g g l d r apple c lr, 8l 2 S, r 2 R, 2U g l = g, 8 2 U. l2u Note that the second constrant embodes the farness n terms of equalzed global domnant share g. By solvng (6), DRFH allocates each user the maxmum global domnant share g, under the constrants of both server capacty and farness. The allocaton receved by each user n server l s smply A l = g l d. For example, Fg. 4 llustrates the resultng DRFH allocaton n the example of Fg. 2. By solvng (6), DRFH allocates server 1 exclusvely to user 1 and server 2 exclusvely to user 2, allowng each user to schedule 1 tasks wth the maxmum global domnant share g =5/7. We next analyze the propertes of DRFH allocaton obtaned by solvng (6). Our analyses of DRFH start wth the four essental resource allocaton propertes, namely, envy-freeness, Pareto optmalty, group strategyproofness, and sharng ncentve. 3. The appendx s gven n a supplementary document as per the TPDS submsson gudelnes. (4) (5) (6)

6 Resource Share 5/7 User1 User2 6/7 6/7 1/7 1/7 CPU Memory CPU Memory Server 1 Server 2 Fg. 4. An alternatve allocaton wth hgher system utlzaton for the example of Fg. 2. Server 1 and 2 are exclusvely assgned to user 1 and 2, respectvely. Both users schedule 1 tasks. 4.2 Envy-Freeness We frst show by the followng proposton that under the DRFH allocaton, no user prefers other s allocaton to ts own. Proposton 1 (Envy-freeness): The DRFH allocaton obtaned by solvng (6) s envy-free. Proof: Let {g l },g be the soluton to problem (6). For all user, ts DRFH allocaton n server l s A l = g l d. To show N (A j ) apple N (A ) for any two users and j, t s equvalent to prove N (A j ) apple N (A ). We have G (A j )= P l G l(a jl ) = P l mn r{g jl d jr /d r } apple P l g jl = G (A ), where the nequalty holds because where r mn r {d jr /d r }appled jr /d r = d jr apple 1, s user s global domnant resource. 4.3 Pareto Optmalty We next show that DRFH leads to an effcent allocaton under whch no user can mprove ts allocaton wthout decreasng that of the others. Proposton 2 (Pareto optmalty): The DRFH allocaton obtaned by solvng (6) s Pareto optmal. Proof: Let {g l },g be the soluton to problem (6). For all user, ts DRFH allocaton n server l s A l = g l d. Snce (5) and (6) are equvalent, {g l } s also the soluton to (5), and g s the maxmum value of the objectve of (5). Assume, by way of contradcton, that allocaton A s not Pareto optmal,.e., there exsts some allocaton A, such that N (A ) N (A ) for all user, and for some user j we have strct nequalty: N j (A j ) >N j(a j ). Equvalently, ths mples G (A ) G (A ) for all user, and G j (A j ) >G j(a j ) for user j. Wthout loss of generalty, let A be non-wasteful. By Lemma 1, for all user and server l, there exsts some gl such that A l = g l d. We show that based on {gl }, one can construct some {ĝ l } such that {ĝ l } s a feasble soluton to (5), yet leads to a hgher objectve than g, contradctng the fact that {g l },g optmally solve (5). To see ths, consder user j. We have G j (A j )= P l g jl = g<g j (A j )=P l g jl. For user j, there exsts a server l and some >, such that after reducng gjl to gjl, the resultng global domnant share remans hgher than g,.e., P l g jl g. Ths leads to at least d j dle resources n server l. We construct {ĝ l } by redstrbutng these dle resources to all users to ncrease ther global domnant share, therefore strctly mprovng the objectve of (5). Denote by {gl } the domnant share after reducng g jl to gjl,.e., g gl = jl, = j, l = l ; gl, o.w. The correspondng non-wasteful allocaton s A l = g l d for all user and server l. Note that allocaton A s preferred to the orgnal allocaton A by all users,.e., for all user, we have G (A )= P gl = P l g jl g = G j (A j ), = j; l l g l = G (A ) G (A ), o.w. We now construct {ĝ l } by redstrbutng the d j dle resources n server l to all users, each ncreasng ts global domnant share gl by =mn r { d jr / P d r},.e., g ĝ l = l +, l = l ; g o.w. l, It s easy to check that {ĝ l } remans a feasble allocaton. To see ths, t suffces to check server l. For all ts resource r, we have P ĝld r = P (g l + )d r = P g l d r d jr + P d r apple c lr ( d jr P d r) apple c lr. where the frst nequalty holds because A s a feasble allocaton. On the other hand, for all user 2 U, we have Pl ĝl = P l g l + = G (A )+ G (A )+ >g. Ths contradcts the premse that g s optmal for (5). 4.4 Group Strategyproofness For now, all our dscussons are based on a crtcal assumpton that all users truthfully report ther resource demands. However, n a real-world system, t s common to observe users to attempt to manpulate the scheduler by msreportng ther resource demands, so as to receve more allocaton [11], [18]. More often than not, these strategc behavours would sgnfcantly hurt those honest users and reduce the number of ther tasks scheduled, nevtably leadng to a farly neffcent allocaton outcome. Fortunately, we show by the followng proposton that DRFH s mmune to these strategc

7 behavours, as reportng the true demand s always the domnant strategy for all users, even f they form a coalton to msreport together wth others. Proposton 3 (Group strategyproofness): The DRFH allocaton obtaned by solvng (6) s group strategyproof n that the coalton behavor of msreportng demands cannot strctly beneft every member. Proof: Let M U be the set of strategc users formng a coalton to msreport the normalzed demand vector d M =(d ) 2M, where d 6= d. for all 2 M. Let d be the collecton of normalzed demand vectors submtted by all users, where d = d, for all 2 U \M. Let A be the resultng allocaton obtaned by solvng (6). In partcular, A l = g l d for each user and server l, and g = P l g l, where {g l },g solve (6). On the other hand, let A be the allocaton returned when all users truthfully report ther demands, and {g l },g the soluton to (6) wth the truthful d. Smlarly, for each user and server l, we have A l = g l d, and g = P l g l. We check the followng two cases and show that there exsts a user 2 M, such that G (A ) apple G (A ), whch s equvalent to N (A ) apple N (A ). Case 1: g apple g. In ths case, let =mn r {d r /d r} be defned for all user 2 M. Clearly, where r =mn r {d r /d r} appled r /d r = d r apple 1, s the domnant resource of user. We then have G (A )= P l G l(a l ) = P l G l(g l d ) = P l mn r{g l d r /d r} = g apple g = G (A ). Case 2: g > g. We frst consder users that are not manpulators. Snce they truthfully report ther demands, we have G j (A j)=g >g= G j (A j ), 8j 2 U \ M. (7) Now for those manpulators, there s a user 2 M such that G (A ) <G (A ). Otherwse, allocaton A s strctly preferred to allocaton A by all users. Ths contradcts the facts that A s a Pareto optmal allocaton and A s a feasble allocaton. 4.5 Sharng Incentve In addton to the aforementoned three propertes, sharng ncentve s another crtcal allocaton property that has been frequently mentoned n the lterature, e.g., [11] [13], [15], [18]. The property ensures that every user can execute at least the number of tasks t schedules when the entre resource pool s evenly parttoned. The property provdes servce solatons among the users. Whle the sharng ncentve property s well defned n the all-n-one resource model, t s not for the system wth multple heterogeneous servers. In the former case, snce the entre resource pool s abstracted as a sngle server, evenly dvdng every resource of ths bg server would lead to a unque allocaton. However, when the system conssts of multple heterogeneous servers, there are many dfferent ways to evenly dvde these servers, and t s unclear whch one should be used as a benchmark for comparson. For nstance, n the example of Fg. 2, two users share a system wth 14 CPUs and 14 GB memory n total. The followng two allocatons both allocate each user 7 CPUs and 7 GB memory: (a) User 1 s allocated 1/2 resources of server 1 and 1/2 resources of server 2, whle user 2 s allocated the rest; (b) user 1 s allocated (1.5 CPUs, 5.5 GB) n server 1 and (5.5 CPUs, 1.5 GB) n server 2, whle user 2 s allocated the rest. It s easy to verfy that the two allocatons lead to dfferent number of tasks scheduled for the same user, and can be used as two dfferent allocaton benchmarks. In fact, one can construct many other allocatons that evenly dvde all resources among the users. Despte the general ambguty explaned above, n the next two subsectons, we consder two defntons of the sharng ncentve property, strong and weak, dependng on the choce of the benchmark for equal parttonng of resources. 4.5.1 Strong Sharng Incentve Among varous allocatons that evenly dvde all servers, perhaps the most straghtforward approach s to evenly partton each server s avalablty c l among all n users. The strong sharng ncentve property s defned by usng ths per-server parttonng as a benchmark. Defnton 2 (Strong sharng ncentve): Allocaton A satsfes the strong sharng ncentve property f each user schedules fewer tasks by evenly parttonng each server,.e., N (A )= N (A l ) N (c l /n), 8 2 U. Before we proceed, t s worth mentonng that the perserver parttonng above cannot be drectly mplemented n practce. Wth a large number of users, n each server, everyone wll be allocated a very small fracton of the server s avalablty. In practce, such a small slce of resources usually cannot be used to run any computng task. However, perserver parttonng may be nterpreted as follows. Snce a cloud system s constructed by poolng hundreds of thousands of servers [2], [3], the number of users s typcally far smaller than the number of servers [11], [18],.e., k n. An equal partton could randomly allocate to each user k/n servers, whch s equvalent to randomly allocatng each server to each user wth probablty 1/n. It s easy to see that the mean number of tasks scheduled for each user under ths random allocaton s P l N (c /n), the same as that obtaned under the per-server parttonng. Unfortunately, the followng proposton shows that DRFH may volate the sharng ncentve property n the strong sense. The proof gves a counterexample. Proposton 4: DRFH does not satsfy the property of strong sharng ncentve. Proof: Consder a system consstng of two servers. Server 1 has 1 CPU and 2 GB memory; server 2 has 4 CPUs and 3 GB memory. There are two users. Each nstance of the task of user 1 demands 1 CPU and 1 GB memory; each of user 2 s

8 tasks demands 3 CPUs and 2 GB memory. In ths case, we have c 1 =(1/5, 2/5) T, c 2 =(4/5, 3/5) T, D 1 =(1/5, 1/5) T, D 2 =(3/5, 2/5) T, d 1 =(1, 1) T, and d 2 =(1, 2/3) T. It s easy to verfy that under DRFH, the global domnant share both users receve s 12/25. On the other hand, under the per-server parttonng, the global domnant share that user 2 receves s 1/2, hgher than that receved under DRFH. Whle DRFH may volate the strong sharng ncentve property, we shall show va trace-drven smulatons n Sec. 6 that ths only happens n rare cases. 4.5.2 Weak Sharng Incentve The strong sharng ncentve property s defned by choosng the per-server parttonng as a benchmark, whch s only one of many dfferent ways to evenly dvde the total avalablty. In general, any equal partton that allocates an equal share of every resource can be used as a benchmark. Ths allows us to relax the sharng ncentve defnton. We frst defne an equal partton as follows. Defnton 3 (Equal partton): Allocaton A s an equal partton f t dvdes every resource evenly among all users,.e., A lr =1/n, 8r 2 R, 2 U. It s easy to verfy that the aforementoned per-server partton s an equal partton. We are now ready to defne the weak sharng ncentve property as follows. Defnton 4 (Weak sharng ncentve): Allocaton A satsfes the weak sharng ncentve property f there exsts an equal partton A under whch each user schedules fewer tasks than those under A,.e., N (A ) N (A ), 8 2 U. In other words, the property of weak sharng ncentve only requres the allocaton to be better off than one equal partton, wthout specfyng ts specfc form. It s hence a more relaxed requrement than the strong sharng ncentve property. The followng proposton shows that DRFH satsfes the sharng ncentve property n the weak sense. The proof s constructve. Proposton 5 (Weak sharng ncentve): DRFH satsfes the property of weak sharng ncentve. Proof: Let g be the global domnant share each user receves under a DRFH allocaton A, and g l the global domnant share user receves n server l. We construct an equal partton A under whch users schedule fewer tasks than those under A. Case 1: g 1/n. In ths case, let A be any equal partton. We show that each user schedules fewer tasks under A than those under A. To see ths, consder the DRFH allocaton A. Snce t s non-wasteful, the number of tasks user schedules s N (A )=g/d r 1/nD r. On the other hand, the number of tasks user schedules under A would be at most N (A )= P mn r{a lr /D r} apple P A lr /D r =1/nD r apple N (A ). Case 2: g<1/n. In ths case, no resource has been fully allocated under A,.e., A lr = g l d r apple g l = g<1 2U 2U 2U 2U for all resource r 2 R. Let L lr = c lr 2U A lr be the amount of resource r left unallocated n server l, Further, let L r = L lr =1 2U A lr be the total amount of resource r left unallocated. We are now ready to construct an equal partton A based on A. Snce A should allocate each user 1/n of the total avalablty of every resource r, the addtonal amount of resource r user needs to obtan s u r =1/n A lr. It s easy to see that u r >, 8 2 U, r 2 R. The demanded fracton of unallocated resource r for user s f r = u r /L r. As a result, we can construct A by reallocatng those leftover resources n each server to users, n proporton to ther demands,.e., A lr = A lr + L lr f r, 8 2 U, l 2 S, r 2 R. It s easy to verfy that A s an equal partton,.e., A lr = L lr f r + =(u r /L r ) = u r + A lr A lr L lr + A lr =1/n, 8 2 U, r 2 R. We now compare the number of tasks scheduled for each user under both allocatons A and A. Because A allocates more resources to each user than A does, we have N (A ) N (A ) for all. On the other hand, by the Pareto optmalty of allocaton A, no user can schedule more tasks wthout decreasng the number of tasks scheduled for others. Therefore, we must have N (A )=N (A ) for all.

9 4.5.3 Dscusson Strong sharng ncentve provdes more predctable servce solaton than weak sharng ncentve does. It assures a user a pror that t can schedule at least the number of tasks when every server s evenly allocated. Ths gves users a concrete dea on the worst Qualty of Servce (QoS) they may receve, allowng them to accurately predct ther computng performance. Whle weak sharng ncentve also provdes some degree of servce solaton, a user cannot nfer the guaranteed number of tasks t can schedule a pror from ths weaker property, and therefore cannot predct the computng performance. We note that the root cause of such degradaton of servce solaton s due to the heterogenety among servers. When all servers are of the same specfcaton of hardware resources, DRFH reduces to DRF, and strong sharng ncentve s guaranteed. Ths s also the case for schedulers adoptng the sngleresource abstracton. For example, n Hadoop, each server s dvded nto several slots (e.g., reducers and mappers). Hadoop Far Scheduler [2] allocates these slots evenly to all users. We see that predctable servce solaton s acheved: each user receves at least k s /n slots, where k s s the number of slots, and n s the number of users. In general, one can vew weak sharng ncentve as the prce pad by DRFH to acheve hgh resource utlzaton. In fact, navely applyng DRF allocaton separately to each server retans strong sharng ncentve: n each server, the DRF allocaton ensures that a user can schedule at least the number of tasks when resources are evenly allocated [11], [15]. However, as we have seen n Sec. 3.4, such a nave DRF extenson may lead to extremely low resource utlzaton that s unacceptable. Smlar problem also exsts for tradtonal schedulers adoptng sngle-resource abstractons. By artfcally dvdng servers nto slots, these schedulers cannot match computng demands to avalable resources at a fne granularty, resultng n poor resource utlzaton n practce [11]. For these reasons, we beleve that slghtly tradng off the degree of servce solaton for much hgher resource utlzaton s well justfed. We shall use trace-drven smulaton to show n Sec. 6.3 that DRFH only volates strong sharng ncentve n rare cases n the Google cluster. 4.6 Other Important Propertes In addton to the four essental propertes shown n the prevous subsecton, DRFH also provdes a number of other mportant propertes. Frst, snce DRFH generalzes DRF to heterogeneous envronments, t naturally reduces to the DRF allocaton when there s only one server contaned n the system, where the global domnant resource defned n DRFH s exactly the same as the domnant resource defned n DRF. Proposton 6 (Sngle-server DRF): The DRFH leads to the same allocaton as DRF when all resources are concentrated n one server. Next, by defnton, we see that both sngle-resource farness and bottleneck farness trvally hold for the DRFH allocaton. We hence omt the proofs of the followng two propostons. Proposton 7 (Sngle-resource farness): The DRFH allocaton satsfes sngle-resource farness. Proposton 8 (Bottleneck farness): The DRFH allocaton satsfes bottleneck farness. Fnally, we see that when a user leaves the system and relnqushes all ts allocatons, the remanng users wll not see any reducton of the number of tasks scheduled. Formally, Proposton 9 (Populaton monotoncty): The DRFH allocaton satsfes populaton monotoncty. Proof: Let A be the resultng DRFH allocaton, then for all user and server l, A l = g l d and G (A )=g, where {g l } and g solve (6). Suppose that user j leaves the system, changng the resultng DRFH allocaton to A. By DRFH, for all user 6= j and server l, we have A l = g l d and G (A )= g, where {g l } 6=j and g solve the followng optmzaton problem: max g gl,6=j s.t. gld r apple c lr, 8l 2 S, r 2 R, 6=j gl = g, 8 6= j. l2u To show N (A ) N (A ) for all user 6= j, t s equvalent to prove G (A ) G (A ). It s easy to verfy that g, {g l } 6=j satsfy all the constrants of (8) and are hence feasble to (8). As a result, g g, whch s exactly G (A ) G (A ). 5 PRACTICAL CONSIDERATIONS So far, all our dscussons are based on several assumptons that may not be the case n a real-world system. In ths secton, we relax these assumptons and dscuss how DRFH can be mplemented n practce. 5.1 Weghted Users wth a Fnte Number of Tasks In the prevous sectons, users are assumed to be assgned equal weghts and have nfnte computng demands. Both assumptons can be easly removed wth some mnor modfcatons of DRFH. When users are assgned uneven weghts, let w be the weght assocated wth user. DRFH seeks an allocaton that acheves the weghted max-mn farness across users. Specfcally, we maxmze the mnmum normalzed global domnant share (w.r.t the weght) of all users under the same resource constrants as n (4),.e., max A s.t. mn G (A )/w 2U A lr apple c lr, 8l 2 S, r 2 R. 2U When users have a fnte number of tasks, the DRFH allocaton s computed teratvely. In each round, DRFH ncreases the global domnant share allocated to all actve users, untl one of them has all ts tasks scheduled, after whch the user becomes nactve and wll no longer be consdered n the followng allocaton rounds. DRFH then starts a new teraton (8)

1 and repeats the allocaton process above, untl no user s actve or no more resources could be allocated to users. Because each teraton saturates at least one user s resource demand, the allocaton wll be accomplshed n at most n rounds, where n s the number of users 4 Our analyss presented n Sec. 4 also extends to weghted users wth a fnte number of tasks. 5.2 Schedulng Tasks as Enttes Untl now, we have assumed that all tasks are dvsble. In a real-world system, however, fractonal tasks may not be accepted. To schedule tasks as enttes, one can apply progressve fllng as a smple mplementaton of DRFH 5. That s, whenever there s a schedulng opportunty, the scheduler always accommodates the user wth the lowest global domnant share. To do ths, t pcks the frst server whose remanng resources are suffcent to accommodate the request of the user s task. Whle ths Frst-Ft algorthm offers a farly good approxmaton to DRFH, we propose another smple heurstc that can lead to a better allocaton wth hgher resource utlzaton. Smlar to Frst-Ft, the heurstc also chooses user wth the lowest global domnant share to serve. However, nstead of randomly pckng a server, the heurstc chooses the best one whose remanng resoruces most sutably matches the demand of user s tasks, and s hence referred to as the Best-Ft DRFH. Specfcally, for user wth resource demand vector D =(D 1,...,D m ) T and a server l wth avalable resource vector c l =( c l1,..., c lm ) T, where c lr s the share of resource r remanng avalable n server l, we defne the followng heurstc functon to quanttatvely measure the ftness of the task for server l: H(, l) =kd /D 1 c l / c l1 k 1, (9) where k k 1 s the L 1 -norm. Intutvely, the smaller H(, l), the more smlar the resource demand vector D appears to the server s avalable resource vector c l, and the better ft user s task s for server l. Best-Ft DRFH schedules user s tasks to server l wth the least H(, l). As an llustratve example, suppose that only two types of resources are concerned, CPU and memory. A CPU-heavy task of user wth resource demand vector D =(1/1, 1/3) T s to be scheduled, meanng that the task requres 1/1 of the total CPU avalablty and 1/3 of the total memory avalablty of the system. Only two servers have suffcent remanng resources to accommodate ths task. Server 1 has the avalable resource vector c 1 =(1/5, 1/15) T ; Server 2 has the avalable resource vector c 2 = (1/8, 1/4) T. Intutvely, because the task s CPU-bound, t s more ft for Server 1, whch s CPU-abundant. Ths s ndeed the case as H(, 1) = <H(, 2) = 5/3, and Best-Ft DRFH places the task onto Server 1. Both Frst-Ft and Best-Ft DRFH can be easly mplemented by searchng all k servers n O(k) tme, whch s fast 4. For medum- and large-szed cloud clusters, n s n the order of thousands [3], [8]. 5. Progressve fllng has also been used to mplement the DRF allocaton [11]. However, when the system conssts of multple heterogeneous servers, progressve fllng wll lead to a DRFH allocaton. enough for small- and medum-szed clusters. For a large cluster contanng tens of thousands of servers, ths computaton can be fast approxmated by adaptng the power of two choces load balancng technque [21]. Instead of scannng through all servers, the scheduler randomly probes two servers and places the task on the server that fts the task better. It s worth mentonng that the defnton of the heurstc functon (9) s not unque. In fact, one can use more complex heurstc functon other than (9) to measure the ftness of a task for a server, e.g., cosne smlarty [22]. However, as we shall show n the next secton, Best-Ft DRFH wth (9) as ts heurstc functon already mproves the utlzaton to a level where the system capacty s almost saturated. Therefore, the beneft of usng more complex ftness measure s very lmted, at least for the Google cluster traces [8]. 6 TRACE-DRIVEN SIMULATION In ths secton, we evaluate the performance of DRFH va extensve smulatons drven by Google cluster-usage traces [8]. The traces contan resource demand/usage nformaton of over 9 users (.e., Google servces and engneers) on a cluster of 12K servers. The server confguratons are summarzed n Table 1, where the CPUs and memory of each server are normalzed so that the maxmum server s 1. Each user submts computng jobs, dvded nto a number of tasks, each requrng a set of resources (.e., CPU and memory). From the traces, we extract the computng demand nformaton the requred amount of resources and task runnng tme and use t as the demand nput of the allocaton algorthms for evaluaton. 6.1 Dynamc Allocaton Our frst evaluaton focuses on the allocaton farness of the proposed Best-Ft DRFH when users dynamcally jon and depart the system. We smulate 3 users submttng tasks wth dfferent resource requrements to a small cluster of 1 servers. The server confguratons are randomly drawn from the dstrbuton of Google cluster servers n Table 1, leadng to a resource pool contanng 52.75 CPU unts and 51.32 memory unts n total. User 1 jons the system at the begnnng, requrng.2 CPU and.3 memory for each of ts task. As shown n Fg. 5, snce only user 1 s actve at the begnnng, t s allocated 4% CPU share and 62% memory share. Ths allocaton contnues untl 2 s, at whch tme user 2 jons and submts CPU-heavy tasks, each requrng.5 CPU and.1 memory. Both users now compete for computng resources, leadng to a DRFH allocaton n whch both users receve 44% global domnant share. At 5 s, user 3 starts to submt memory-ntensve tasks, each requrng.1 CPU and.3 memory. The algorthm now allocates the same global domnant share of 26% to all three users untl user 1 fnshes ts tasks and departs at 18 s. After that, only users 2 and 3 share the system, each recevng the same share on ther global domnant resources. A smlar process repeats untl all users fnsh ther tasks. Throughout the smulaton, we see that the Best-Ft DRFH algorthm precsely acheves the DRFH allocaton at all tmes.

11 CPU Share (%) Memory Share (%) Domnant Share (%) 1 8 User 1 User 2 User 3 6 4 2 2 4 6 8 1 12 14 16 Tme (s) 1 8 User 1 User 2 User 3 6 4 2 2 4 6 8 1 12 14 16 Tme (s) 1 8 User 1 User 2 User 3 6 4 2 2 4 6 8 1 12 14 16 Tme (s) Fg. 5. CPU, memory, and global domnant share for three users on a 1-server system wth 52.75 CPU unts and 51.32 memory unts n total. TABLE 2 Resource utlzaton of the Slots scheduler wth dfferent slot szes. Number of Slots CPU Utlzaton Memory Utlzaton 1 per maxmum server 35.1% 23.4% 12 per maxmum server 42.2% 27.4% 14 per maxmum server 43.9% 28.% 16 per maxmum server 45.4% 24.2% 2 per maxmum server 4.6% 2.% 6.2 Resource Utlzaton We next evaluate the resource utlzaton of the proposed Best- Ft DRFH algorthm. We take the 24-hour computng demand data from the Google traces and smulate t on a smaller cloud computng system of 2, servers so that farness becomes relevant. The server confguratons are randomly drawn from the dstrbuton of Google cluster servers n Table 1. We compare Best-Ft DRFH wth two other benchmarks, the tradtonal Slots schedulers that schedules tasks onto slots of servers (e.g., Hadoop Far Scheduler [2]), and the Frst-Ft DRFH that chooses the frst server that fts the task. For the former, we try dfferent slot szes and chooses the one wth the hghest CPU and memory utlzaton. Table 2 summarzes our observatons, where dvdng the maxmum server (1 CPU and 1 memory n Table 1) nto 14 slots leads to the hghest overall utlzaton. Fg. 6 depcts the tme seres of CPU and memory utlzaton of the three algorthms. We see that the two DRFH mplementatons sgnfcantly outperform the tradtonal Slots scheduler wth much hgher resource utlzaton, manly because the latter gnores the heterogenety of both servers and workload. CPU Utlzaton Memory Utlzaton 1.8.6.4.2 Best Ft DRFH Frst Ft DRFH Slots 2 4 6 8 1 12 14 Tme (mn) 1.8.6.4.2 Best Ft DRFH Frst Ft DRFH Slots 2 4 6 8 1 12 14 Tme (mn) Fg. 6. Tme seres of CPU and memory utlzaton. Completon Tme Reducton 1.8.6.4.2 Best Ft DRFH Slots 1 1 1 2 1 3 1 4 1 5 Job Completon Tme (s) 8 6 4 2 (a) CDF of job completon tmes. 1% 2% 25% 1 5 51 1 11 5 51 1 Job Sze (tasks) 43% >1 62% (b) Job completon tme reducton. Fg. 7. DRFH mprovements on job completon tmes over Slots scheduler. Ths observaton s consstent wth fndngs n the homogeneous envronment where all servers are of the same hardware confguratons [11]. As for the DRFH mplementatons, we see that Best-Ft DRFH leads to unformly hgher resource utlzaton than the Frst-Ft alternatve at all tmes. The hgh resource utlzaton of Best-Ft DRFH naturally translates to shorter job completon tmes shown n Fg. 7a, where the CDFs of job completon tmes for both Best-Ft DRFH and Slots scheduler are depcted. Fg. 7b offers a more detaled breakdown, where jobs are classfed nto 5 categores based on the number of ts computng tasks, and for each category, the mean completon tme reducton s computed.

12 Task completon rato w/ DRFH 1.8.6.4.2 y = x.2.4.6.8 1 Task completon rato w/ Slots Task completon rato w/ DRFH 1.8.6.4.2 y = x.2.4.6.8 1 Task completon rato n a dedcated cloud Fg. 8. Task completon rato of users usng Best-Ft DRFH and Slots schedulers, respectvely. Each bubble s sze s logarthmc to the number of tasks the user submtted. Fg. 9. Comparson of task completon ratos under DRFH and that obtaned n dedcated clouds (DCs). Each crcle s radus s logarthmc to the number of tasks submtted. Whle DRFH shows no mprovement over Slots scheduler for small jobs, a sgnfcant completon tme reducton has been observed for those contanng more tasks. Generally, the larger the job s, the more mprovement one may expect. Smlar observatons have also been found n the homogeneous envronments [11]. Fg. 7 does not account for partally completed jobs and focuses only on those havng all tasks fnshed n both Best- Ft and Slots. As a complementary study, Fg. 8 computes the task completon rato the number of tasks completed over the number of tasks submtted for every user usng Best- Ft DRFH and Slots schedulers, respectvely. The radus of the crcle s scaled logarthmcally to the number of tasks the user submtted. We see that Best-Ft DRFH leads to hgher task completon rato for almost all users. Around 2% users have all ther tasks completed under Best-Ft DRFH but do not under Slots. 6.3 Sharng Incentve Our fnal evaluaton s on the sharng ncentve property of DRFH. Whle we have shown n Sec. 4.5 that DRFH may not satsfy the property n the strong sense, t remans unclear how often ths property would be volated n practce. Is t frequently happened or just a rare case? We answer ths queston n ths subsecton. We frst compute the task completon rato under the benchmark per-server partton. To do ths, we randomly choose dk/ne servers, where k s the number of servers and n s the number of users n the traces. We then allocate these dk/ne servers to a user and schedule ts tasks onto them. These dk/ne servers form a dedcated cloud exclusve for ths user. We compute the task completon rato obtaned n ths dedcated cloud and compare t wth the one obtaned under the DRFH allocaton. Fg. 9 llustrates the comparson results for all users. We see that most users would prefer DRFH allocaton as compared wth runnng tasks n a dedcated cloud. In partcular, only 2% users see fewer tasks fnshed under the DRFH allocaton. Even for these users, the task completon rato decreases only slghtly, as shown n Fg. 9. As a result, we see that DRFH only volates the property of strong sharng ncentve n rare cases n the Google traces. 7 RELATED WORK Despte the extensve computng system lterature on far resource allocaton, many exstng works lmt ther dscussons to the allocaton of a sngle resource type, e.g., CPU tme [23], [24] and lnk bandwdth [25] [29]. Varous farness notons have also been proposed throughout the years, rangng from applcaton-specfc allocatons [3], [31] to general farness measures [25], [32], [33]. As for mult-resource allocaton, state-of-the-art cloud computng systems employ nave sngle resource abstractons. For example, the two far sharng schedulers currently supported n Hadoop [2], [34] partton a node nto slots wth fxed fractons of resources, and allocate resources jontly at the slot granularty. Quncy [35], a far scheduler developed for Dryad [1], models the far schedulng problem as a mn-cost flow problem to schedule jobs nto slots. The recent work [18] takes the job placement constrants nto consderaton, yet t stll uses a slot-based sngle resource abstracton. Ghods et al. [11] are the frst n the lterature to present a systematc nvestgaton on the mult-resource allocaton problem n cloud computng systems. They proposed DRF to equalze the domnant share of all users, and show that a number of desrable farness propertes are guaranteed n the resultng allocaton. DRF has quckly attracted a substantal amount of attenton and has been generalzed to many dmensons. Notably, Joe-Wong et al. [12] generalzed the DRF measure and ncorporated t nto a unfyng framework that captures the trade-offs between allocaton farness and effcency. Dolev et al. [13] suggested another noton of farness for mult-resource allocaton, known as Bottleneck- Based Farness (BBF), under whch two farness propertes that DRF possesses are also guaranteed. Gutman and Nsan [14] consdered another settngs of DRF wth a more general doman of user utltes, and showed ther connectons to the BBF mechansm. Parkes et al. [15], on the other hand, extended DRF n several ways, ncludng the presence of zero demands for certan resources, weghted user endowments, and n partcular the case of ndvsble tasks. They also studed the loss of socal welfare under the DRF rules. Kash et al. [16] extended the DRF model to allow users to jon the system over tme but wll never leave. Bhattacharya et al. [36] generalzed DRF to a herarchcal scheduler that offers servce solatons

13 n a computng system wth a herarchcal structure. All these works assume, explctly or mplctly, that resources are ether concentrated nto one super computer, or are dstrbuted to a set of homogeneous servers wth exactly the same resource confguraton. However, server heterogenety has been wdely observed n today s cloud computng systems. Specfcally, Ahmad et al. [6] noted that datacenter clusters usually consst of both hgh-performance servers and low-power nodes wth dfferent hardware archtectures. Ress et al. [3], [8] llustrated a wde range of server specfcaton n Google clusters. As for publc clouds, Farley et al. [4] and Ou et al. [5] observed sgnfcant hardware dversty among Amazon EC2 servers that may lead to substantally dfferent performance across supposedly equvalent VM nstances. Ou et al. [5] also ponted out that such server heterogenety s not lmted to EC2 only, but generally exsts n long-lastng publc clouds such as Rackspace. To our knowledge, the very recent paper [37] s the only work that studed allocaton propertes n the presence of server heterogenety, where a randomzed allocaton algorthm, called Constraned-DRF (CDRF), s proposed to schedule dscrete jobs. Whle CDRF possess all the desrable propertes dscussed n ths paper, t s too complex for a job scheduler, and an effcent algorthm remans an open problem [37]. More recently, Grandl et al. [38] proposed an effcent heurstc algorthm for mult-resource schedulng n heterogeneous computer clusters. Ther work manly focuses on desgnng a good heurstc algorthm, not studyng the allocaton propertes, and s therefore orthogonal to our work. Other related works nclude far-dvson problems n the economcs lterature, n partcular the egaltaran dvson under Leontef preferences [17] and the cake-cuttng problem [19]. These works also assume the all-n-one resource model, and hence cannot be drectly appled to cloud computng systems wth heterogeneous servers. 8 CONCLUDING REMARKS In ths paper, we study a mult-resource allocaton problem n a heterogeneous cloud computng system where the resource pool s composed of a large number of servers wth dfferent confguratons n terms of resources such as processng, memory, and storage. The proposed mult-resource allocaton mechansm, known as DRFH, equalzes the global domnant share allocated to each user, and hence generalzes the DRF allocaton from a sngle server to multple heterogeneous servers. We analyze DRFH and show that t retans almost all desrable propertes that DRF provdes n the sngle-server scenaro. Notably, DRFH s envy-free, Pareto optmal, and group strategyproof. It also offers the sharng ncentve n a weak sense. We desgn a Best-Ft heurstc that mplements DRFH n a real-world system. Our large-scale smulatons drven by Google cluster traces show that, compared to the tradtonal sngle-resource abstracton such as a slot scheduler, DRFH acheves sgnfcant mprovements n resource utlzaton, leadng to much shorter job completon tmes. REFERENCES [1] W. Wang, B. L, and B. Lang, Domnant resource farness n cloud computng systems wth heterogeneous servers, n Proc. IEEE INFO- COM, 214. [2] M. Armbrust, A. Fox, R. Grffth, A. Joseph, R. Katz, A. Konwnsk, G. Lee, D. Patterson, A. Rabkn, I. Stoca, and M. Zahara, A vew of cloud computng, Commun. ACM, vol. 53, no. 4, pp. 5 58, 21. [3] C. Ress, A. Tumanov, G. Ganger, R. Katz, and M. Kozuch, Heterogenety and dynamcty of clouds at scale: Google trace analyss, n Proc. ACM SoCC, 212. [4] B. Farley, A. Juels, V. Varadarajan, T. Rstenpart, K. D. Bowers, and M. M. Swft, More for your money: Explotng performance heterogenety n publc clouds, n Proc. ACM SoCC, 212. [5] Z. Ou, H. Zhuang, A. Lukyanenko, J. Nurmnen, P. Hu, V. Mazalov, and A. Yla-Jaask, Is the same nstance type created equal? explotng heterogenety of publc clouds, IEEE Trans. Cloud Computng, vol. 1, no. 2, pp. 21 214, 213. [6] F. Ahmad, S. T. Chakradhar, A. Raghunathan, and T. Vjaykumar, Tarazu: Optmzng mapreduce on heterogeneous clusters, n Proc. ACM ASPLOS, 212. [7] R. Nathuj, C. Isc, and E. Gorbatov, Explotng platform heterogenety for power effcent data centers, n Proc. USENI ICAC, 27. [8] C. Ress, J. Wlkes, and J. L. Hellersten, Google Cluster-Usage Traces, http://code.google.com/p/googleclusterdata/. [9] Apache Hadoop, http://hadoop.apache.org. [1] M. Isard, M. Budu, Y. Yu, A. Brrell, and D. Fetterly, Dryad: dstrbuted data-parallel programs from sequental buldng blocks, n Proc. EuroSys, 27. [11] A. Ghods, M. Zahara, B. Hndman, A. Konwnsk, S. Shenker, and I. Stoca, Domnant resource farness: Far allocaton of multple resource types, n Proc. USENI NSDI, 211. [12] C. Joe-Wong, S. Sen, T. Lan, and M. Chang, Mult-resource allocaton: Farness-effcency tradeoffs n a unfyng framework, n Proc. IEEE INFOCOM, 212. [13] D. Dolev, D. Fetelson, J. Halpern, R. Kupferman, and N. Lnal, No justfed complants: On far sharng of multple resources, n Proc. ACM ITCS, 212. [14] A. Gutman and N. Nsan, Far allocaton wthout trade, n Proc. AAMAS, 212. [15] D. Parkes, A. Procacca, and N. Shah, Beyond domnant resource farness: Extensons, lmtatons, and ndvsbltes, n Proc. ACM EC, 212. [16] I. Kash, A. Procacca, and N. Shah, No agent left behnd: Dynamc far dvson of multple resources, n Proc. AAMAS, 213. [17] J. L and J. ue, Egaltaran dvson under Leontef preferences, Econ. Theory, vol. 54, no. 3, pp. 597 622, 213. [18] A. Ghods, M. Zahara, S. Shenker, and I. Stoca, Choosy: Max-mn far sharng for datacenter jobs wth constrants, n Proc. ACM EuroSys, 213. [19] A. D. Procacca, Cake cuttng: Not just chld s play, Commun. ACM, 213. [2] Hadoop Far Scheduler, http://hadoop.apache.org/docs/r.2.2/far scheduler.html. [21] M. Mtzenmacher, The power of two choces n randomzed load balancng, IEEE Trans. Parallel Dstrb. Syst., vol. 12, no. 1, pp. 194 114, 21. [22] A. Snghal, Modern nformaton retreval: A bref overvew, IEEE Data Eng. Bull., vol. 24, no. 4, pp. 35 43, 21. [23] S. Baruah, J. Gehrke, and C. Plaxton, Fast schedulng of perodc tasks on multple resources, n Proc. IEEE IPPS, 1995. [24] S. Baruah, N. Cohen, C. Plaxton, and D. Varvel, Proportonate progress: A noton of farness n resource allocaton, Algorthmca, vol. 15, no. 6, pp. 6 625, 1996. [25] F. Kelly, A. Maulloo, and D. Tan, Rate control for communcaton networks: Shadow prces, proportonal farness and stablty, J. Oper. Res. Soc., vol. 49, no. 3, pp. 237 252, 1998. [26] J. Mo and J. Walrand, Far end-to-end wndow-based congeston control, IEEE/ACM Trans. Networkng, vol. 8, no. 5, pp. 556 567, 2. [27] J. Klenberg, Y. Raban, and É. Tardos, Farness n routng and load balancng, n Proc. IEEE FOCS, 1999. [28] J. Blanquer and B. Özden, Far queung for aggregated multple lnks, n Proc. ACM SIGCOMM, 21. [29] Y. Lu and E. Knghtly, Opportunstc far schedulng over multple wreless channels, n Proc. IEEE INFOCOM, 23.

14 [3] C. Koksal, H. Kassab, and H. Balakrshnan, An analyss of short-term farness n wreless meda access protocols, n Proc. ACM SIGMET- RICS (poster sesson), 2. [31] M. Bredel and M. Fdler, Understandng farness and ts mpact on qualty of servce n IEEE 82.11, n Proc. IEEE INFOCOM, 29. [32] R. Jan, D. Chu, and W. Hawe, A quanttatve measure of farness and dscrmnaton for resource allocaton n shared computer system. Eastern Research Laboratory, Dgtal Equpment Corporaton, 1984. [33] T. Lan, D. Kao, M. Chang, and A. Sabharwal, An axomatc theory of farness n network resource allocaton, n Proc. IEEE INFOCOM, 21. [34] Hadoop Capacty Scheduler, http://hadoop.apache.org/docs/r.2.2/ capacty scheduler.html. [35] M. Isard, V. Prabhakaran, J. Currey, U. Weder, K. Talwar, and A. Goldberg, Quncy: Far schedulng for dstrbuted computng clusters, n Proc. ACM SOSP, 29. [36] A. A. Bhattacharya, D. Culler, E. Fredman, A. Ghods, S. Shenker, and I. Stoca, Herarchcal schedulng for dverse datacenter workloads, n Proc. ACM SoCC, 213. [37] E. Fredman, A. Ghods, and C.-A. Psomas, Strategyproof allocaton of dscrete jobs on multple machnes, n Proc. ACM EC, 214. [38] R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella, Mult-resource packng for cluster schedulers, n Proc. ACM SIG- COMM, 214. PLACE PHOTO HERE Baochun L receved the B.Engr. degree from the Department of Computer Scence and Technology, Tsnghua Unversty, Chna, n 1995 and the M.S. and Ph.D. degrees from the Department of Computer Scence, Unversty of Illnos at Urbana-Champagn, Urbana, n 1997 and 2. Snce 2, he has been wth the Department of Electrcal and Computer Engneerng at the Unversty of Toronto, where he s currently a Professor. He holds the Nortel Networks Junor Char n Network Archtecture and Servces from October 23 to June 25, and the Bell Canada Endowed Char n Computer Engneerng snce August 25. Hs research nterests nclude large-scale multmeda systems, cloud computng, peer-to-peer networks, applcatons of network codng, and wreless networks. Dr. L was the recpent of the IEEE Communcatons Socety Leonard G. Abraham Award n the Feld of Communcatons Systems n 2. In 29, he was a recpent of the Multmeda Communcatons Best Paper Award from the IEEE Communcatons Socety, and a recpent of the Unversty of Toronto McLean Award. He s a member of ACM and a senor member of IEEE. PLACE PHOTO HERE We Wang receved the B.Engr. and M.A.Sc. degrees from the Department of Electrcal Engneerng, Shangha Jao Tong Unversty, n 27 and 21. He s currently a Ph.D. canddate n the Department of Electrcal and Computer Engneerng at the Unversty of Toronto. Hs general research nterests cover the broad area of computer networkng, wth specal emphass on resource management and schedulng n cloud computng systems. He s also nterested n problems at the ntersecton of computer networkng and economcs. PLACE PHOTO HERE Ben Lang receved honors-smultaneous B.Sc. (valedctoran) and M.Sc. degrees n Electrcal Engneerng from Polytechnc Unversty n Brooklyn, New York, n 1997 and the Ph.D. degree n Electrcal Engneerng wth Computer Scence mnor from Cornell Unversty n Ithaca, New York, n 21. In the 21-22 academc year, he was a vstng lecturer and post-doctoral research assocate at Cornell Unversty. He joned the Department of Electrcal and Computer Engneerng at the Unversty of Toronto n 22, where he s now a Professor. Hs current research nterests are n moble communcatons and networked systems. He s an edtor for the IEEE Transactons on Wreless Communcatons and an assocate edtor for the Wley Securty and Communcaton Networks journal, n addton to regularly servng on the organzatonal or techncal commttee of a number of conferences. He s a senor member of IEEE and a member of ACM and Tau Beta P.

15 APPENDI Proof of Lemma 1: (() We start wth the necessty proof. Snce A l = g l d, for all resource r 2 R, we have As a result, A lr /D r = g l d r /D r = g l D r. N l (A l )=mn r2r {A lr/d r } = g l D r. Now for any A l A l, suppose that A lr <A lr for some resource r. We have N l (A l) =mn r2r {A lr/d r } apple A lr /D r <A lr /D r = N l (A l ). Hence by defnton, allocaton A l s non-wasteful. ()) We next present the suffcency proof. Snce A l s non-wasteful, for any two resources r 1,r 2 2 R, we must have A lr1 /D r1 = A lr2 /D r2. Otherwse, wthout loss of generalty, suppose that A lr1 /D r1 >A lr2 /D r2. There must exst some >, such that (A lr1 )/D r1 >A lr2 /D r2. Now construct an allocaton A l, such that A A lr = lr1, r = r 1 ; A lr, o.w. Clearly, A l A l. However, t s easy to see that N l (A l) =mn r2r {A lr/d r } =mn{a lr/d r } r6=r 1 =mn{a lr /D r } r6=r 1 =mn {A lr/d r } r2r = N l (A l ), (1) whch contradcts the fact that A l s non-wasteful. As a result, there exts some n l, such that for all resource r 2 R, we have A lr = n l D r = n l D r d r. Now lettng g l = n l D r, we see A l = g l d.