Distributed Load Balancing for Machines Fully Heterogeneous

Internship Report 2 nd of June - 22 th of August 2014 Distributed Load Balancing for Machines Fully Heterogeneous Nathanaël Cheriere nathanael.cheriere@ens-rennes.fr ENS Rennes Academic Year 2013-2014 Département Informatique et Télécommunications http://perso.eleves.ens-rennes. fr/~ncherier/ Supervisor Erik Saule esaule@uncc.edu Department of Computer Science University of North Carolina Charlotte http://webpages.uncc.edu/~esaule/

Distributed Load Balancing for Machines Fully Heterogeneous Nathanaël Cheriere ENS Rennes, France, nathanael.cheriere@ens-rennes.fr http://perso.eleves.ens-rennes.fr/~ncherier/ Abstract. With the increasing size of parallel systems, having a centralized algorithm to distribute the jobs to be executed can have a huge cost on the system; this is why many researchs are done in order to have decentralized scheduling algorithms. The objective of this work is to propose a decentralized solution to balance jobs on unrelated machines. We propose two distributed algorithms able to balance the load on unrelated machines with proven guarantees for some situations. The first one exploits a low heterogeneity of the jobs while the second one is focused on the case of two clusters of identical machines. Keywords: Decentralized Algorithms, Approximation Algorithms, Machine Heterogeneity, Unrelated Scheduling Problem, Work Stealing, Load-Balancing 1 Introduction Scheduling the execution of a parallel program on multiple machines is one of the basic problems in parallel computing. Its solution is far from obvious as many hypothesis can be done to specify the conditions, especially on the heterogeneity of the machines used to execute the program. These problems have been studied since the 60s, but often from a centralized point of view. However, with the current size of parallel systems, the cost of the resolution of the problem on one machine cannot be overlooked any more [9]. A solution to decrease the overload induced by the scheduling would be to have every machine participate in the distribution of the work. The work stealing strategy, used in Cilk [3] follows this idea: every machine is responsible for its own charge, and when there is no longer a job to execute, the machine tries to steal some work from another machine. However, this strategy cannot be directly applied in systems with high heterogeneity in their machine set because the initial distribution (which specifies the jobs that must be finished before stealing) is important. The execution of a first algorithm to balance the load between the machines removes the problematic cases which can disrupt a work stealing strategy. This is why this work is focused on the problem of decentralized loadbalancing for heterogeneous machines sets. It could provide a suitable initial

distribution which can then be used by a work stealing algorithm or be used at regular intervals. We propose centralized and decentralized algorithms with proven properties operating under practically relevant conditions. The first one is specialized for systems with high machine heterogeneity while having a low job heterogeneity while the second one has been created to solve the problem for two different clusters each containing only identical machines, typically a GPU-accelerated cluster. The problem is formally defined in Section 2. In Section 3 we present the results we extend in this paper and the model is detailed in Section 4. We develop an algorithm to solve the case of high machine heterogeneity and low job heterogeneity and prove an approximation ratio in Section 5. Then, in Section 6, we limit the heterogeneity of the machines to two clusters and provide an algorithm to load balance the machines and prove an approximation ratio in case of convergence, then experimentally study the convergence. 2 Problem The problem studied in this work is a load balancing problem, the goal is to construct a partition S of a set of jobs J onto the machines in the set M in order to minimize an objective function. Many functions can be minimized but we focus on the makespan C max (S) which is C max (S) = max i M C(S, i) where C(S, i) = j S(i) p i,j is the makespan of the machine i, p i,j is the time needed for the execution of job j on the machine i, and S(i) is the set of jobs assigned to the machine i. In the following parts, we will denote C(S, i) as C(i) when there is no ambiguity on the solution used to compute the makespan. In this problem, the partition obtained is usually compared to the minimum possible makespan OPT which characterizes the optimal solutions, OPT = min S C max (S). The homogeneity of the set of machines is the main hypothesis which is usually done on the set of machines used for a parallel execution. The machines can be uniform, related, or unrelated. The machines can be homogeneous. In this case, they are said to be uniform, each job j can be executed on every machine i with the exact same cost (usually the same duration), i, i M, p i,j = p i,j. In the related case, all the machines are different but only differ by a fixed factor. For all the jobs we have i, i M, α, j J, p i,j = αp i,j. The unrelated case is the most general with a lot of heterogeneity in the machine set; Every machine has a fixed cost p i,j for each job (which can be infinite) and there are no relations restricting this cost. The problem developed in the following parts is the load balancing of unrelated machines with objective to minimize C max which is denoted as R C max by Graham et al. in [8]. Because the problem is NP-Complete [6], we develop two approximation algorithms for the problem. 2

3 Related Work Scheduling problems have interested many researchers since the 60 : in 1969 Graham [7] showed that the scheduling problem on uniform machines can be approximated with in a factor 2 (there exists an algorithm that provides a solution S such that C max (S) 2 OPT). The scheduling problem on unrelated machines which is studied in this work has been studied by Lawler and Labetoulle in [10]. They showed that the problem can be solved in polynomial time using a linear programming problem when there is the hypothesis of pre-emption (the possibility to pause a job on a machine and restart it on another machine). The same problem without pre-emption has been approximated by a factor 2 by Lenstra et al. in [11] also using a linear programming problem. Moreover they showed that the problem cannot be approximated with a better approximation ratio than 3 2 unless P = NP. In [5], Guochuan et al. propose a centralized online algorithm to distribute the new arriving jobs between two different clusters of identical machines, and proved this algorithm has an approximation ratio of 4. Algorithm 1 Work Stealing for a particular machine Data: m machine Data: S(m) jobs assigned to m while true do if S(m) = then Select randomly a target machine i if S(i) then Steal half of the jobs of i else Start running a job j of S(m) Remove j of S(m) In order to avoid an excessive work load on the scheduling machine, a decentralized strategy has been introduced in 1981 by Burton and Sleep in [4]: the work stealing algorithm (Algorithm 1). The goal of the algorithm is to keep every machine busy until there are no more jobs to run. In order to meet this objective, when a machine finishes its last locally available job, it contacts its neighbours and try to steal some of the non running jobs. In 1994, Blumofe in [2,3] continued the idea of work stealing to apply it into the middleware Cilk, and proved some guarantees on the execution time of a batch of jobs using a work stealing algorithm on identical parallel machines. In particular, he proved that the expected maximum makespan is bounded by the average work per machine plus a big O of the critical path p of the problem, E(C max (S)) j J p i,j/ M + O(p ). In 2002, Bender and Rabin [1] extended this result to the problem with related machines and the possibility to use pre-emption, and proved that the 3

completion time is bounded by E(C max (S)) j J p 1,j/( M π) + O(p /π) where expected machine 1 is the fastest machine and π is the average speed of the machines, normalized by the speed of fastest machine. 4 A Priori Load Balancing For the work stealing algorithm as developed by Blumofe in [2,3], little is known about the jobs and the machines and this is possible because all machines are the same. For the work stealing algorithm adapted by Bender and Rabin [1], the algorithm knows and uses the relative speed of both machines which is enough to characterize the differences between the machines. However, in the case of unrelated machines, machines are characterized by the cost of each job execution and this also defines each job, so we consider as known the costs p i,j of each job. Applying a work stealing strategy on unrelated machines has a flaw. Indeed, this strategy starts stealing jobs from another machine only when the work previously scheduled on the machine has been executed. But if the initial distribution is poorly done, the first steal can happen long after the optimal makespan. Theorem 1. Applying a work stealing strategy on unrelated machines can induce an unbounded makespan. Proof. The circled schedule presented in table 1, presents a situation where the first steal can only happen after a time n, and so the execution can be finished in n + 1 units of time. However, with a good schedule the work can be finished in 2 units of time. Table 1. Bad initial distribution Cost Machine A Machine B Machine C Job 1 1 n n Job 2 1 1 n Job 3 n 1 1 Job 4 n 1 1 Job 5 n 1 1 The circled distribution is a bad initial distribution to apply a Work Stealing strategy(algorithm 1). This is why a load balancing before executing the first jobs in an unrelated work stealing strategy seems necessary. However, making a centralized load balancing would not fit the work stealing algorithm as it is completely distributed. 4

Theorem 2. A generic algorithm balancing optimally each pair of machine optimally can induce an unbounded makespan. Proof. The example developed in Fig. 1 has a makespan of n and each pair of machine is optimally balanced. However, the optimal solution has a makespan of 1. B {3} Optimal Optimal {2} Optimal {1} A C Cost Machine A Machine B Machine C Job 1 1 n n 2 Job 2 n 2 1 n Job 3 n n 2 1 Makespan of the circled distribution: n Optimal distribution for all pairs Optimal makespan: 1 Fig. 1. Example of situation with 3 machines and 3 jobs with an unbounded makespan There is little hope to find generic solutions, so we look at particular cases. The following parts focus on providing bounded algorithms to solve the problem. 5 Load Balancing per Type of Job In this section, the jobs are grouped by type. In each type, all the jobs have the same costs: j, j J, j has the same type as j i M, p i,j = p i,j. This distinction can easily be made in real systems where simple queries can represent most of the jobs of a system: even if the jobs are not exactly the same, their cost is similar and only vary depending on which machine executes them. 5.1 Balancing only one type of job This section presents an algorithm the case where there is only one type of job, and the proof of optimality of this algorithm. The One Job Type Balancing algorithm (Algorithm 3) is quite simple. It randomly chooses a target and balances the load of the two machines in an optimal way. Getting the optimal load balancing for two machines in this problem is ensured by Basic Greedy (Algorithm 2) thanks to the fact that the jobs cost is defined by the machine only (Proof omitted since this is trivial). Theorem 3. One Job Type Balancing (Algorithm 3) provides an optimal distribution S of the jobs. C max (S) = OPT (1) 5

Algorithm 2 Basic Greedy Data: machines m and i Data: S distribution of jobs A := S(i) S(m) S(m) := S(i) := while J do let j be the first job in A if C(m) + p m,j C(i) + p i,j then S(m) := S(m) {j} else S(i) := S(i) {j} A := A\{j} Algorithm 3 One Job Type Balancing Data: m host machine Data: S initial distribution of the jobs Result: C max(s) = OPT while true do Select i M Distribute optimally the load of i and m with the Basic Greedy Proof. Let S(n) be the solution created after the n-th execution of the algorithm. Note that C max (S(n + 1)) C max (S(n)); If not, the balancing done between the two machines at step n + 1 would not be optimal. Hence the function C max (S(n)) is decreasing. Let S be the distribution created by the algorithm 3 after an infinite number of executions. As all the jobs have the same costs we denote as p i the cost of any job on the machine i. We can now represent S(i) only by its cardinal which we denote as N(i), N(i) = S(i). We also have C(i) = N(i) p i Let S be an optimal distribution and N (i) = S (i) By contradiction, let us assume that C max (S) > C max (S ) There exists i max M such that C(i max ) = C max (S). In particular, N(i max ) > N (i max ), but this also implies that there exists i such that N(i) < N (i) (because i M N(i) = J = i M N (i)). So at least one job can be moved from i max to i to have a better local distribution (C max would decrease). Hence the distribution S is not optimal for the pair of machine i max and i but as the algorithm 3 has been applied an infinite number of times and because C max is decreasing, S should be optimal for i max and i. In conclusion we have C max (S) C max (S ) hence C max (S) = C max (S ). 5.2 Extension to multiple types of jobs The One Job Type Balancing algorithm can be extended into the Multiple Job Type Balancing algorithm to balance multiple types of jobs; however, the performance guarantee of the algorithm become linear with the number of types of jobs. 6

Algorithm 4 Multiple Job Type Balancing Data: m host machine of type Data: S initial distribution of the jobs Data: k number of types of jobs Result: C max(s) k OPT while true do Select i M foreach l: type of job do Distribute the jobs of type l between i and m using Basic Greedy Theorem 4. Multiple Job Type Balancing (Algorithm 4), applied to k types of jobs, provides distribution S of the jobs which is a k-approximation. C max (S) k OPT (2) Proof. The theorem 3 is applied for each type of job. The approximation ratio is at most linear with the number of types of jobs k but we can also show that the minimal lower bound is at least ln(k) (Fig. 2). Cost Machine 1 2... m 1 m Job j m 1/m 1/m... 1/m 1/m j m 1 1/(m 1)... 1/(m 1) 1/(m 1).. j 2... 1/2 1/2 j 1... 1 J is constructed with the rule i [1, m], add i job j i Number of job Machine per machine 1 2... m 1 m Job j m 1 1... 1 1 j m 1 0 1... 1 1.. j 2 0 0... 1 1 j 1 0 0... 0 1 If the order used by the algorithm 4 is [j m, j m 1,..., j 2, j 1], the distribution is stable for all pairs of machines, and C max = m 1/k ln(k) k=1 Fig. 2. Example of ln(k) schedule produced by the algorithm 4 6 Load balancing with two types of machines In this section, we limit the problem to two different clusters of identical machines M 1 and M 2 ( i, i M 1 (resp. M 2 ), j J, p i,j = p i,j). This is meaningful in practice since the advent of GPU-accelerated clusters. We develop a new algorithm with a proven approximation ratio of 2 under the assumption that the 7

cost of any task is smaller than the optimal makespan for the problem, which is formally expressed as i M, j J, p i,j OPT. This hypothesis is realistic in the sense that any machine is able to do almost everything another machine does but with a different speed, we suppose here that there is not a job which can only run on one cluster (and if this was the case, they could be assigned beforehand). Moreover, this hypothesis also suggests that there is a large amount of jobs, which is the starting point for creating a decentralized algorithm. To simplify the notations, in this section we will denote as p i,j the cost of job j on cluster i. 6.1 A centralized algorithm We first focus on the centralized version of the problem and then use this centralized algorithm as a stepping stone to create a decentralized one that solves the problem. This problem is a sub-problem of the scheduling problem of m unrelated machines which has already been studied by Lenstra et al. in 1990 [11]. They provide an algorithm with an approximation ratio of 2. However, the solution is provided by solving a linear programming problem first and this method seems difficult to decentralize. This is why we developed a new greedy algorithm to balance the load between two clusters. The idea behind this new algorithm, called Centralized Load Balancing for two clusters of machines (Algorithm 5), is quite simple: we first sort the jobs according to the ratio p 1,j /p 2,j so that the jobs which are executed faster on the first cluster are at the beginning of the list and the one executed faster on the second cluster are at the end of the list. Then we evaluate the decision of placing the first job of the list on the machine of the first cluster with minimal makespan and placing the last job of the list on the second cluster on the machine with minimal makespan. The job placed is the one that minimize the makespan of those two machines. That way, if a job is not placed on the cluster where it can be executed at its minimal cost, we know this choice does not have a significant impact on the overall makespan. Theorem 5. The job partition S given by the Centralized Load Balancing for two clusters of machines algorithm (Algorithm 5) has a makespan at most two times larger than the optimal solution. C max (S) 2 OPT (3) Proof. Let i max be a machine of M = M 1 M 2 such that C max (S) = C(i max ) We can suppose, without loss of generality, that i max M 1. Let j max be the last job placed on i max. Let S be the incomplete partition of J just before the choice is made by Algorithm 5 to place j max on i max. Let C 1 be C(S, i max ) and C 2 be such that C 2 = min i M 2 C(S, i). Note that C 1 has also the property C 1 = min i M 1 C(S, i). 8

Algorithm 5 Centralized Load Balancing for two clusters of machines Data: M 1 Cluster of identical machines of type 1 Data: M 2 Cluster of identical machines of type 2 Data: J Set of n jobs such that j {1,..., n}, p 1,J(j) OPT and p 2,J(j) OPT Result: Partition S of jobs for each machine, such that C max(s) 2 OPT Sort jobs in J in increasing order of p 1,j/p 2,j Initialize S with one empty set per machine j 1 := 1 j 2 := n while j 1 j 2 do Select i 1 M 1 such that C(i 1 ) = min i M 1 C(i) Select i 2 M 2 such that C(i 2 ) = min i M 2 C(i) if C(i 1 ) + p 1,J(j 1 ) C(i 2 ) + p 2,J(j 2 ) then S(i 1 ) := S(i 1 ) {J(j 1 )} j 1 := j 1 + 1 else S(i 2 ) := S(i 2 ) {J(j 2 )} j 2 := j 2 1 return S Let us compare min(c 1, C 2 ) to OPT Let S 1 be the jobs placed on cluster 1 in S, and S 2 the ones placed on cluster 2, S k = i M k S (i). To compare C 1 and C 2, we will use the notion of work, the work is defined as the total cost of the jobs assigned to the machines. The work done on the cluster 1, W 1, has the property W 1 M 1 C 1, similarly, the work W 2 done on cluster 2 is such that W 2 M 2 C 2. Let S be an optimal solution with S 1 the set of jobs placed on cluster 1 and S 2 the set of jobs placed on cluster 2. The work done on cluster 1 with an optimal solution is denoted as W 1, the work done on cluster 2 is denoted as W 2 To create an optimal solution from S, the jobs J 1 = S 1 S 2 should be moved from the cluster 1 to the cluster 2, and the jobs J 2 = S 2 S 1 from cluster 2 to cluster 1. We have the equations W 1 = W 1 p 1,j + p 1,j and W 2 = W 2 p 2,j + p 2,j (4) j J 1 j J 2 j J 2 j J 1 By contradiction, let us assume that W 1 < W 1 and W 2 < W 2, from 4 we deduce p 1,j < p 1,j and p 2,j < p 2,j (5) j J 2 j J 1 j J 1 j J 2 Let α = p 1,jmax /p 2,jmax Because the algorithm sorts the jobs, we have j J 1, p 1,j /p 2,j α and j J 2, p 1,j /p 2,j α and with 5 we have 9

But p 1,j < p 1,j α p 2,j < α p 2,j j J 2 j J 1 j J 1 j J 2 (6) p 1,j α p 2,j j J 2 j J 2 (7) With the contradiction given by 6 and 7, we deduce that W 1 W 1 or W 2 W 2, in particular, OPT max(w 1 / M 1, W 2 / M 2 ) min(w 1 / M 1, W 2 / M 2 ) min(c 1, C 2 ). In conclusion, min(c 1, C 2 ) OPT (8) Let j be the job compared to j max when j max is placed. Because j max has been placed on i max from the first cluster, we have C 1 + p 1,jmax C 2 + p 2,j (9) If C 1 C 2, C 1 OPT and p 1,jmax OPT so C max (S) 2 OPT. If C 2 < C 1, we have C 2 OPT, p 2,j OPT and C 1 + p 1,jmax C 2 + p 2,j so C max (S) 2 OPT. In both cases, C max (S) 2 OPT (10) 6.2 Distributed algorithm The Centralized Load Balancing for two clusters of machines algorithm is the base for a decentralized algorithm to balance the load of machines on two clusters of uniform machines. Indeed, like the peer to peer approach from the Work Stealing algorithm the Decentralized Load Balancing for two clusters of machines algorithm is executed on each machine, and each machine randomly selects a target (a machine), and if both machines are from the same cluster, a Greedy Load Balancing is applied, and if both machines are from different clusters, the Centralized Load Balancing for two clusters of machines algorithm is used to balance both machines (considering two sub-clusters of one machine each). This algorithm uses Greedy Load Balancing to balance two machines from the same cluster (Algorithm 6), this algorithm is in particular a 3 2 -approximation [7]. Decentralized Load Balancing for two clusters of machines also ensures an interesting property: when the situation is stable, the makespan of the job distribution is bounded by twice the optimal makespan. 10

Algorithm 6 Greedy Load Balancing Data: m 1, m 2 machines of the same cluster to balance Data: S distribution of jobs A = S(m 1 ) S(m 2 ) if m 1 is in cluster 1 then Sort jobs in A in increasing order of p 1,j/p 2,j else Sort jobs in A in increasing order of p 2,j/p 1,j Initialize S with one empty set for m 1 and m 2 while A do j first job in A if C(m 1 ) C(m 2 ) then S(m 1 ) := S(m 1 ) {j} else S(m 2 ) := S(m 2 ) {j} A := A\{j} return S Algorithm 7 Decentralized Load Balancing for two clusters of machines Data: m host machine of type k Data: J(m) Set of jobs initially on m Result: C(J(m)) 2 OPT while true do Select i M if i is in the same cluster as m then Apply Greedy Load Balancing to i and m else M 1 := {m} M 2 := {i} Apply Centralized Load Balancing for two clusters of machines to M 1 and M 2 Theorem 6. If the distribution S provided by the execution of Algorithm 7 becomes stable (for every pair of machine, the algorithm does not move any job), S is such that C max (S) 2 OPT. Proof. Let S 1 be the set of jobs assigned to cluster M 1 (S 1 = m M 1 S(m)) and S 2 the set of jobs assigned to cluster M 2. We first show that the jobs in S 2 have a ratio p 1,j /p 2,j larger the jobs in S 1. Let j 1 be such that p 1,j 1/p 2,j 1 = max j S 1 p 1,j /p 2,j and j 2 be p 1,j 2/p 2,j 2 = min j S 2 p 1,j /p 2,j. By contradiction let us assume that p 1,j 2/p 2,j 2 < p 1,j 1/p 2,j 1. Let us denote by m 1 the machine on which j 1 is scheduled, and m 2 the machine on which j 2 is scheduled. Because Decentralized Load Balancing for two clusters of machines have been executed for all pairs of machine and does not change the solution, it has been executed to balance m 1 and m 2 in particular. As we saw in proof of Theorem 5, there exists α such that j S(m 1 ), j S(m 2 ), p 1,j /p 2,j α and p 1,j /p 2,j α. 11

From these we conclude that j 1 S 1, j 2 S 2, p 1,j 1/p 2,j 1 p 1,j 2/p 2,j 2 (11) Let i max be a machine such that C max (S) = C(i max ). Without any loss of generality, we assume that i max M 1. Let j max be a job assigned to i max such that p 1,jmax /p 2,jmax = max j Simax p 1,j /p 2,j. Let C 1 = C(i max ) p 1,jmax. Since Greedy Load Balancing has been applied between the machines of the same cluster, we have the property i M 1, C 1 C(i) (12) Let C 2 and i 2 be such that C 2 = C(i 2 ) = min i M 2 C(i). Using the same reasoning by contradiction as in the proof of theorem 5 with α = p 1,j 1/p 2,j 1, we get min(c 1, C 2 ) OPT (13) Let us consider an execution of Centralized Load Balancing for two clusters of machines between i max and i 2. if j max is not the last job placed by the algorithm, then j max has been compared to j which has been placed afterward, so C 1 + p 1,jmax C 2 but C 1 + p 1,jmax C 2 (because C 1 + p 1,jmax = C max (S)). Hence C 1 C 2 so we deduce C 1 OPT and with p 1,jmax OPT we have C max (S) 2 OPT if j max is the last job placed by the algorithm, we have C 1 + p 1,jmax C 2 + p 2,jmax Hence C max (S) 2 OPT In both cases we have C max (S) 2 OPT (14) This result is valid when the situation is stable, and the same upperbound can be obtained for any situation where i max can not exchange any job with any other machine. However, the convergence is not always possible, there exists possible cycles of exchanges (Fig. 3). 6.3 Experiments Theorem 6 ensures an upperbound when the partition is stable but, as Figure 3 demonstrates, the convergence may not happen. In this part, we experimentally study the convergence of the solution generated by Algorithm 7. We focus on two clusters of similar size (often there is one or two CPU and one GPU per machine) 12

A 1 4 A 1 4 B 2 3 B 2 C 5 (a) A + balancing(b,c) C 5 3 (b) C + balancing(a,b) A B C 1 3 4 2 5 (c) B + balancing(a,c) Job Cluster 1 = {A,B} Cluster 2 = {C} 1 4ɛ 1 2 4ɛ 1 3 ɛ 2ɛ 4 1 8ɛ 1 5 1 2ɛ (d) Cost of each job Fig. 3. The cycle [(a),(b),(c)] is an example of cycle in the execution of the Decentralized Load Balancing for two clusters of machines (Algorithm 7). For each step of this cycle, there is only one non-trivial balancing possible and it leads to the next step. and use randomly generated jobs which are initially distributed randomly among the machines. We developed a simulator that implements Centralized Load Balancing for two clusters of machines and Decentralized Load Balancing for two clusters of machines (Algorithms 5 and 7). For each iteration of the simulation, a pair of machine is randomly chosen and then balanced using the decentralized algorithm. As the partition can cycle, the process is limited to 1,000,000 iterations at most. We computed the makespan of the distribution obtained as well as the stability of the partition for clusters composed of 64 and 32 machines. The results show that the stability is not often obtained, only 18% of the simulations are stable when the average number of jobs per machine is smaller than 4, and it quickly drops as the average number of jobs per machine increases (Fig. 4) and eventually reaches 0%. However, the same simulation shows that the makespan of the solution created by the decentralized algorithm gets quickly under 150% of the makespan obtained with the centralized algorithm (we denote this value as 1.5cent). This threshold is hopefully much smaller than 3 times the optimal makespan as the centralized algorithm is a proven 2-approximation. The experiments show that 90% of the partitions created have reached this threshold in less than 350 balancing which is less than 4 balancings per machine in this case. 13

Fraction of stability attained 0.4 0.3 0.2 0.1 0 96 192 384 768 1536 3072 Number of jobs Fraction of experiments where Cmax 1.5cent 1 0.8 0.6 0.4 0.2 0 0 100 200 300 400 500 Iterations 96 jobs 192 jobs 384 jobs 768 jobs 1536 jobs 3072 jobs Fig. 4. Experimentation on Decentralized Load Balancing for two clusters of machines onto a cluster of 64 machines and a cluster of 32 machines. (Left) Percentage of experiments that produced a stable partition. (Right) Percentage of partitions with a makespan smaller than 1.5cent after a number of iterations. 7 Conclusion In this work we propose two decentralized algorithms designed to balance the work on heterogeneous machines in two different cases. The Multiple Job Type Balancing algorithm uses of the limited number of classes of jobs and has an approximation ratio with an upper bound equal to the number of types of jobs considered. The Decentralized Load Balancing for two clusters of machines has a proven 2-approximation ratio if convergence is reached. However, experiments showed that the distribution is often unstable but also that the centralized algorithm quickly provides a distribution close to its centralized counter-part. Providing an upperbound for all the distributions created by the Decentralized Load Balancing for two clusters of machines algorithm and its extension to more than two clusters of machines are possible future works. Finally, it could be interesting to study implementations of this algorithm and in particular the convergence time and the quality of the approximation. Acknowledgements Many thanks to Erik Saule for the opportunity and help provided during the training course. References 1. Michael A. Bender and Michael O. Rabin. Online scheduling of parallel programs on heterogeneous systems with applications to cilk. Theory of Computing Systems, Special Issue on SPAA, 35(3):289 304, 2002. 2. Robert D. Blumofe. Scheduling multithreaded computations by work stealing. In FOCS, pages 356 368, 1994. 14

3. Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: An efficient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), page 207 216, 1995. 4. F. Warren Burton and M. Ronan Sleep. Executing functional programs on a virtual tree of processors. In Proceedings of the 1981 Conference on Functional Programming Languages and Computer Architecture, FPCA 81, pages 187 194, 1981. 5. Lin Chen, Deshi Ye, and Guochuan Zhang. Online scheduling on a cpu-gpu cluster. In T-H.Hubert Chan, LapChi Lau, and Luca Trevisan, editors, Theory and Applications of Models of Computation, volume 7876 of Lecture Notes in Computer Science, pages 1 9. Springer Berlin Heidelberg, 2013. 6. Michael R. Garey and David S. Johnson. Strong NP-completeness results: Motivation, examples, and implications. J. ACM, 25(3):499 508, 1978. 7. Ronald L. Graham. Bounds on multiprocessing timing anomalies. SIAM Journal of Applied Mathematics, 17(2):416 429, 1969. 8. Ronald L. Graham, Eugene L. Lawler, Jan Karel Lenstra, and Alexander. H. G. Rinnooy Kan. Optimization and approximation in deterministic sequencing and scheduling: a survey. Annals of discrete mathematics, 5(2):287 326, 1979. 9. Ralf Hoffmann, Matthias Korch, and Thomas Rauber. Performance evaluation of task pools based on hardware synchronization. In SC, page 44, 2004. 10. Eugene L. Lawler and Jacques Labetoulle. On preemptive scheduling of unrelated parallel processors by linear programming. J. ACM, 25(4):612 619, 1978. 11. Jan Karel Lenstra, David B. Shmoys, and Éva Tardos. Approximation algorithms for scheduling unrelated parallel machines. Mathematical Programming, 46:259 271, 1990. 15