Assigning Tasks for Efficiency in Hadoop

Transcription

1 Aigning Tak for Efficiency in Hadoop [Extended Abtract] Michael J. Ficher Computer Science Yale Univerity P.O. Box New Haven, CT, USA Xueyuan Su Computer Science Yale Univerity P.O. Box New Haven, CT, USA Yitong Yin State Key Laboratory for Novel Software Technology Nanjing Univerity, China ABSTRACT In recent year Google MapReduce ha emerged a a leading large-cale data proceing architecture. Adopted by companie uch a Amazon, Facebook, Google, IBM and Yahoo! in daily ue, and more recently put in ue by everal univeritie, it allow parallel proceing of huge volume of data over cluter of machine. Hadoop i a free Java implementation of MapReduce. In Hadoop, file are plit into block and replicated and pread over all erver in a network. Each job i alo plit into many mall piece called tak. Several tak are proceed on a ingle erver, and a job i not completed until all the aigned tak are finihed. A crucial factor that affect the completion time of a job i the particular aignment of tak to erver. Given a placement of the input data over erver, one wihe to find the aignment that minimize the completion time. In thi paper, an idealized Hadoop model i propoed to invetigate the Hadoop tak aignment problem. It i hown that there i no feaible algorithm to find the optimal Hadoop tak aignment unle P = N P. Aignment that are computed by the round robin algorithm inpired by the current Hadoop cheduler are hown to deviate from optimum by a multiplicative factor in the wort cae. A flow-baed algorithm i preented that compute aignment that are optimal to within an additive contant. Categorie and Subject Decriptor D.3.2 [Programming Language]: Language Claification concurrent, ditributed, and parallel language; F..2 [Computation by Abtract Device]: Mode of Computation parallelim and concurrency; F..3 [Computation Supported by the Kempner Fellowhip from the Department of Computer Science at Yale Univerity. Supported by the National Science Foundation of China under Grant No Thi work wa done when Yitong Yin wa at Yale Univerity. Permiion to make digital or hard copie of all or part of thi work for peronal or claroom ue i granted without fee provided that copie are not made or ditributed for profit or commercial advantage and that copie bear thi notice and the full citation on the firt page. To copy otherwie, to republih, to pot on erver or to reditribute to lit, require prior pecific permiion and/or a fee. SPAA 0, June 3 5, 200, Thira, Santorini, Greece. Copyright 200 ACM /0/06...$0.00. by Abtract Device]: Complexity Meaure and Clae reducibility and completene; F.2.2 [Analyi of Algorithm and Problem Complexity]: Nonnumerical Algorithm and Problem equencing and cheduling General Term Algorithm, Performance, Theory Keyword tak aignment, load balancing, NP-completene, approximation algorithm, MapReduce, Hadoop. INTRODUCTION. Background The cloud computing paradigm ha recently received ignificant attention in the media. The cloud i a metaphor for the Internet, which i an abtraction for the complex infratructure it conceal. Cloud computing refer to both the application delivered a ervice over the Internet and the hardware and oftware that provide uch ervice. It enviion hifting data torage and computing power away from local erver, acro the network cloud, and into large cluter of machine hoted by companie uch a Amazon, Google, IBM, Microoft, Yahoo! and o on. Google MapReduce [8, 9, 6] parallel computing architecture, for example, plit workload over large cluter of commodity PC and enable automatic parallelization. By exploiting parallel proceing, it provide a oftware platform that let one eaily write and run application that proce vat amount of data. Apache Hadoop [4] i a free Java implementation of MapReduce in the open ource oftware community. It i originally deigned to efficiently proce large volume of data by parallel proceing over commodity computer in local network. In academia, reearcher have adapted Hadoop to everal different architecture. For example, Ranger et al. [8] evaluate MapReduce in multi-core and multi-proceor ytem, Kruijf et al. [7] implement MapReduce on the Cell B.E. proceor architecture, and He et al. [4] propoe a MapReduce framework on graphic proceor. Many related application uing Hadoop have alo been developed to olve variou practical problem.

2 .2 The MapReduce Framework A Hadoop ytem run on top of a ditributed file ytem, called the Hadoop Ditributed File Sytem (HDFS). HDFS uually run on networked commodity PC, where data are replicated and locally tored on hard dik of each machine. To tore and proce huge volume of data et, HDFS typically ue a block ize of 64MB. Therefore, moving computation cloe to the data i a deign goal in the MapReduce framework. In the MapReduce framework, any application i pecified by job. A MapReduce job plit the input data into independent block, which are proceed by the map tak in parallel. Each map tak procee a ingle block coniting of ome number of record. Each record in turn conit of a key/value pair. A map tak applie the uer defined map function to each input key/value pair and produce intermediate key/value pair. The framework then ort the intermediate data, and forward them to the reduce tak via interconnected network. After receiving all intermediate key/value pair with the ame key, a reduce tak execute the uer defined reduce function and produce the output data. Finally, thee output data are written back to the HDFS. In uch a framework, there i a ingle erver, called the mater, that keep track of all job in the whole ditributed ytem. The mater run a pecial proce, called the jobtracker, that i reponible for tak aignment and cheduling for the whole ytem. For the ret of erver that are called the lave, each of them run a proce called the taktracker. The taktracker chedule the everal tak aigned to the ingle erver in a way imilar to a normal operating ytem. The map tak aignment i a vital part that affect the completion time of the whole job. Firt, each reduce tak cannot begin until it receive the required intermediate data from all finihed map tak. Second, the aignment determine the location of intermediate data and the pattern of the communication traffic. Therefore, ome algorithm hould be in place to optimize the tak aignment..3 Related Work Since Kuhn [5] propoed the firt method for the claic aignment problem in 955, variation of the aignment problem have been under extenive tudy in many area [5]. In the claic aignment problem, there are identical number of job and peron. An aignment i a one-to-one mapping from tak to peron. Each job introduce a cot when it i aigned to a peron. Therefore, an optimal aignment minimize the total cot over all peron. In the area of parallel and ditributed computing, when job are proceed in parallel over everal machine, one i intereted in minimizing the maximum proceing time of any machine. Thi problem i ometime called the minimum makepan cheduling problem. Thi problem in general i known to be N P-complete []. Under the Strictly peaking, a map tak in Hadoop ometime procee data that come from two ucceive file block. Thi occur becaue file block do not repect logical record boundarie, o the lat logical record proceed by a map tak might lie partly in the current data block and partly in the ucceeding block, requiring the map tak to acce the ucceeding block in order to fetch the tail end of it lat logical record. identical-machine model, there are ome well-known approximation algorithm. For example, Graham [2] propoed a (2 /n)-approximation algorithm in 966, where n i the total number of machine. Graham [3] propoed another 4/3-approximation algorithm in 969. However, under the unrelated-machine model, thi problem i known to be APXhard, both in term of it offline [7] and online [, 2] approximability. A ome reearcher [3, 4] pointed out, the cheduling mechanim and police that aign tak to erver within the MapReduce framework can have a profound effect on efficiency. An early verion of Hadoop ue a imple heuritic algorithm that greedily exploit data locality. Zaharia, Konwinki and Joeph [9] propoed ome heuritic refinement baed on experimental reult..4 Our Contribution We invetigate tak aignment in Hadoop. In Section 2, we propoe an idealized Hadoop model to evaluate the cot of tak aignment. Baed on thi model, we how in Section 3 that there i no feaible algorithm to find the optimal aignment unle P = N P. In Section 4, we how that tak aignment computed by a imple greedy round-robin algorithm might deviate from the optimum by a multiplicative factor. In Section 5, we preent an algorithm that employ maximum flow and increaing threhold technique to compute tak aignment that are optimal to within an additive contant. 2. PROBLEM FORMALIZATION Definition. A Map-Reduce chema (MR-chema) i a pair (T, S), where T i a et of tak and S i a et of erver. Let m = T and n = S. A tak aignment i a function A: T S that aign each tak t to a erver A(t). 2 Let A = {T S} be the et of all poible tak aignment. An MR-ytem i a triple (T, S, w), where (T, S) i an MRchema and w : T A Q + i a cot function. Intuitively, w(t, A) i the time to perform tak t on erver A(t) in the context of the complete aignment A. The motivation for thi level of generality i that the time to execute a tak t in Hadoop depend not only on the tak and the erver peed, but alo on poible network congetion, which in turn i influenced by the other tak running on the cluter. Definition 2. The load of erver under aignment A i defined a L A = P t:a(t)= w(t, A). The maximum load under aignment A i defined a L A = max L A. The total load under aignment A i defined a H A = P LA. An MR-ytem model a cloud computer where all erver work in parallel. Tak aigned to the ame erver are proceed equentially, wherea tak aigned to different erver run in parallel. Thu, the total completion time of the cloud under tak aignment A i given by the maximum load L A. 2 In an MR-chema, it i common that T S. Therefore in thi paper, unlike the claic aignment problem where an aignment refer to a one-to-one mapping or a permutation [5, 5], we intead ue the notion of many-to-one mapping.

3 Our notion of an MR-ytem i very general and admit arbitrary cot function. To uefully model Hadoop a an MR-ytem, we need a realitic but implified cot model. In Hadoop, the cot of a map tak depend frequently on the location of it data. If the data i on the erver local dik, then the cot (execution time) i coniderably lower than if the data i located remotely and mut be fetched acro the network before being proceed. We make everal implifying aumption. We aume that all tak and all erver are identical, o that for any particular aignment of tak to erver, all tak whoe data i locally available take the ame amount of time w loc, and all tak whoe data i remote take the ame amount of time w rem. However, we do not aume that w rem i contant over all aignment. Rather, we let it grow with the total number of tak whoe data i remote. Thi reflect the increaed data fetch time due to overall network congetion. Thu, w rem(r) i the cot of each remote tak in every aignment with exactly r remote tak. We aume that w rem(r) w loc for all r and that w rem(r) i (weakly) monotone increaing in r. We formalize thee concept below. In each of the following, (T, S) i an MR-chema. Definition 3. A data placement i a relation ρ T S uch that for every tak t T, there exit at leat one erver S uch that ρ(t, ) hold. The placement relation decribe where the input data block are placed. If ρ(t, ) hold, then erver locally tore a replica of the data block that tak t need. Definition 4. We repreent the placement relation ρ by an unweighted bipartite graph, called the placement graph. In the placement graph G ρ = ((T, S), E), T conit of m tak node and S conit of n erver node. There i an edge (t, ) E iff ρ(t, ) hold. Definition 5. A partial aignment α i a partial function from T to S. We regard a partial aignment a a et of ordered pair with pairwie ditinct firt element, o for partial aignment β and α, β α mean β extend α. If S, the retriction of α to i the partial aignment α = α (T {}). Thu, α agree with α for thoe tak that α aign to, but all other tak are unaigned in α. Definition 6. Let ρ be a data placement and β be a partial aignment. A tak t T i local in β if β(t) i defined and ρ(t, β(t)). A tak t T i remote in α if β(t) i defined and ρ(t, β(t)). Otherwie t i unaigned in β. Let l β, r β and u β be the number of local tak, remote tak, and unaigned tak in β, repectively. For any S, let l β be the number of local tak aigned to by β. Let k β = max S l β. Definition 7. Let ρ be a data placement, β be a partial aignment, w loc Q +, and w rem : N Q + uch that w loc w rem(0) w rem() w rem(2).... Let wrem β = w rem(r β + u β ). The Hadoop cot function with parameter ρ, w loc, and w rem( ) i the function w defined by j wloc if t i local in β, w(t, β) = otherwie. w β rem We call ρ the placement of w, and w loc and w rem( ) the local and remote cot of w, repectively. Let K β = k β w loc. The definition of remote cot under a partial aignment β i peimitic. It aume that tak not aigned by β will eventually become remote, and each remote tak will eventually have cot w rem(r β + u β ). Thi definition agree with the definition of remote cot under a complete aignment A, becaue u A = 0 and thu wrem A = w rem(r A +u A ) = w rem(r A ). Since ρ i encoded by mn bit, w loc i encoded by one rational number, and w rem( ) i encoded by m + rational number, the Hadoop cot function w(ρ, w loc, w rem( )) i encoded by mn bit plu m + 2 rational number. Definition 8. A Hadoop MR-ytem (HMR-ytem) i the MR-ytem (T, S, w), where w i the Hadoop cot function with parameter ρ, w loc, and w rem( ). A HMR-ytem i defined by (T, S, ρ, w loc, w rem( )). Problem Hadoop Tak Aignment Problem (HTA). Intance: An HMR-ytem (T, S, ρ, w loc, w rem( )). 2. Objective: Find an aignment A that minimize L A. Sometime the cot of running a tak on a erver only depend on the placement relation and it data locality, but not on the aignment of other tak. Definition 9. A Hadoop cot function w i called uniform if w rem(r) = c for ome contant c and all r N. A uniform HMR-ytem (UHMR-ytem) i an HMR-ytem (T, S, ρ, w loc, w rem( )), where w i uniform. Problem 2 Uniform Hadoop Tak Aignment Problem (UHTA). Intance: A UHMR-ytem (T, S, ρ, w loc, w rem( )). 2. Objective: Find an aignment A that minimize L A. The number of replica of each data block may be bounded, often by a mall number uch a 2 or 3. Definition 0. Call a placement graph G = ((T, S), E) j- replica-bounded if the degree of t i at mot j for all t T. A j-replica-bounded-uhmr-ytem (j-uhmr-ytem) i a UHMR-ytem (T, S, ρ, w loc, w rem( )), where G ρ i j-replicabounded. Problem 3 j-uniform Hadoop Tak Aignment Problem (j-uhta). Intance: A j-uhmr-ytem (T, S, ρ, w loc, w rem( )). 2. Objective: Find an aignment A that minimize L A. 3. HARDNESS OF TASK ASSIGNMENT In thi ection, we analyze the hardne of the variou HTA optimization problem by howing the correponding deciion problem to be N P-complete.

4 3. Tak Aignment Deciion Problem Definition. Given a erver capacity k, a tak aignment A i k-feaible if L A k. An HMR-ytem i k- admiible if there exit a k-feaible tak aignment. The deciion problem correponding to a cla of HMRytem and capacity k ak whether a given HMR-ytem in the cla i k-admiible. Thu, the k-hta problem ak about arbitrary HMR-ytem, the k-uhta problem ak about arbitrary UHMR-ytem, and the k-j-uhta problem (which we write (j, k)-uhta) ak about arbitrary j- UHMR-ytem. 3.2 N P-completene of (2,3)-UHTA The (2,3)-UHTA problem i a very retricted ubcla of the general k-admiibility problem for HMR-ytem. In thi ection, we retrict even further by taking w loc = and w rem = 3. Thi problem repreent a imple cenario where the cot function aume only the two poible value and 3, each data block ha at mot 2 replica, and each erver ha capacity 3. Depite it obviou implicity, we how that (2,3)-UHTA i N P-complete. It follow that all of the le retritive deciion problem are alo N P-complete, and the correponding optimization problem do not have feaible olution unle P = N P. Theorem 3.. (2, 3)-UHTA with cot w loc = and w rem = 3 i N P-complete. The proof method i to contruct a polynomial-time reduction from 3SAT to (2,3)-UHTA. Let G be the et of all 2-replica-bounded placement graph. Given G ρ G, we define the HMR-ytem M G = (T, S, ρ, w loc, w rem( )), where w loc = and w rem(r) = 3 for all r. We ay that G i 3-admiible if M G i 3-admiible. We contruct a polynomial-time computable mapping f : 3CNF G, and how that a 3CNF formula φ i atifiable iff f(φ) i 3- admiible. We horten 3-admiible to admiible in the following dicuion. We firt decribe the contruction of f. Let φ = C C 2 C α be a 3CNF formula, where each C u = (l u l u2 l u3) i a claue and each l uv i a literal. Let x,, x β be the variable that appear in φ. Therefore, φ contain exactly 3α intance of literal, each of which i either x i or x i, where i [, β]. 3 Let ω be the maximum number of occurrence of any literal in φ. Table ummarize the parameter of φ. Table : Parameter of the 3CNF φ claue (C u) α variable (v i) β literal (l uv) 3α max-occur of any literal ω For example, in φ = (x x 2 x 3) (x x 4 x 5) ( x x 4 x 6), we have α = 3, β = 6, and ω = 2 ince x occur twice. Given φ, we contruct the correponding placement graph G which comprie everal dijoint copie of the three type of gadget decribed below, connected together with additional edge. The firt type of gadget i called a claue gadget. Each claue gadget u contain a claue erver C u, three literal 3 The notation [a,b] in our dicuion repreent the et of integer {a, a +,, b, b}. tak l u, l u2, l u3 and an auxiliary tak a u. There i an edge between each of thee tak and the claue erver. Since φ contain α claue, G contain α claue gadget. Thu, G contain α claue erver, 3α literal tak and α auxiliary tak. Figure decribe the tructure of the u-th claue gadget. We ue circle and boxe to repreent tak and erver, repectively. l u l u2 l u3 a u Figure : The tructure of the u-th claue gadget. The econd type of gadget i called a variable gadget. Each variable gadget contain 2ω ring erver placed around a circle. Let R (i) j denote the erver at poition j [, 2ω] in ring i. Define the et T i to be the erver in odd-numbered poition. Similarly, define the et F i to be the erver in even-numbered poition. Between each pair of ring erver R (i) j and R (i) j+, we place a ring tak r(i) j connected to it two neighboring erver. To complete the circle, r (i) 2ω i connected to R (i) 2ω and R(i). There are alo ω variable tak v(i) j : j [, ω] in ring i, but they do not connect to any ring erver. Since φ contain β variable, G contain β variable gadget. Thu, G contain 2βω ring erver, 2βω ring tak and βω variable tak. Figure 2 decribe the tructure of the i-th variable gadget. R2 T i r2 F i F i R T i T i r F i F i C u T i R 2 R 3 r 2 v v 2 (i) v Figure 2: The tructure of the i-th variable gadget. The third type of gadget i called a ink gadget. The ink gadget contain a ink erver P and three ink tak p, p 2, p 3. Each ink tak i connected to the ink erver. G only contain one ink gadget. Figure 3 decribe the tructure of the ink gadget. There are alo ome inter-gadget edge in G. We connect

5 p p 2 P auxiliary tak), and each tak i local to C u. Thu, the load i at mot 3. The ink erver i aigned three local ink tak and the load i exactly 3. Therefore, all contraint are atified and A i feaible. Thi complete the proof of Lemma 3.4. p 3 Figure 3: The tructure of the ink gadget. each variable tak v (i) j to the ink erver P. We alo connect each literal tak l uv to a unique ring erver R (i) j. To be more precie, if literal l uv i the j-th occurrence of x i in φ, connect the literal tak l uv to ring erver R (i) 2j Ti; if literal luv i the j-th occurrence of x i in φ, connect the literal tak l uv to ring erver R (i) 2j Fi. Thee inter-gadget edge complete the graph G. Table 2 ummarize the parameter of G. Table 2: Parameter of the HMR-graph G claue erver C u α literal tak l uv 3α auxiliary tak a u α ring erver R (i) j 2βω ring tak r (i) j 2βω variable tak v (i) j βω ink erver P ink tak p j 3 Lemma 3.2. For any φ 3CNF, the graph f(φ) i 2- replica-bounded. Proof. We count the number of edge from each tak node in f(φ). Each claue tak ha 2 edge, each auxiliary tak ha edge, each ring tak ha 2 edge, each variable tak ha edge, and each ink tak ha edge. Therefore, f(φ) i 2-replica-bounded. The following lemma i immediate. Lemma 3.3. The mapping f : 3CNF G i polynomialtime computable. Lemma 3.4. If φ i atifiable, then G = f(φ) i admiible. Proof. Let σ be a atifying truth aignment for φ, and we contruct a feaible aignment A in G = f(φ). Firt of all, aign each ink tak to the ink erver, i.e., let A(p i) = P for all i [, 3]. Then aign each auxiliary tak a u to the claue erver C u, i.e., let A(a u) = C u for all u [, α]. If σ(x i) = true, then aign ring tak r (i) j : j [, 2ω] to ring erver in T i, variable tak v (i) j : j [, ω] to ring erver in F i. If σ(x i) = fale, then aign ring tak r (i) j : j [, 2ω] to ring erver in F i, variable tak v (i) j : j [, ω] to ring erver in T i. If literal l uv = x i and σ(x i) = true, then aign tak l uv to it local ring erver in T i. If literal l uv = x i and σ(x i) = fale, then aign tak l uv to it local ring erver in F i. Otherwie, aign tak l uv to it local claue erver C u. We then check thi tak aignment i feaible. Each ring erver i aigned either at mot three local tak (two ring tak and one literal tak), or one remote variable tak. In either cae, the load doe not exceed the capacity 3. The number of tak aigned to each claue erver C u i exactly the number of fale literal in C u under σ plu one (the The proof of the convere of Lemma 3.4 i more involved. The method i given a feaible aignment A in G = f(φ), we firt contruct a feaible aignment B in G uch that B(t) P for all t T {p, p 2, p 3}. Then we remove the ink tak and the ink erver from further conideration and conider the reulting graph G. After that, we partition G into two ubgraph, and contruct a feaible aignment B uch that no tak from one partition are remotely aigned to erver in the other partition. Thi tep involve a cae analyi. Finally, a natural way of contructing the atifying truth aignment for φ follow. Lemma 3.5. Let A be a feaible tak aignment. Then there exit a feaible tak aignment B uch that B(t) P for all t T {p, p 2, p 3}. Proof. When A atifie that A(t) P for all t T {p, p 2, p 3}, let B = A. Otherwie, aume there exit a tak t uch that A(t ) = P and t T {p, p 2, p 3}. Since the capacity of P i 3, there i at leat one ink tak, ay p, i not aigned to P. Let A(p ) = Q. Since ρ(p, Q) doe not hold, Q ha only been aigned p and L A Q = 3. Let B(p ) = P and B(t ) = Q. Repeat the ame proce for all tak other than p, p 2, p 3 that are aigned to P in A. Then let B(t) = A(t) for the remaining tak t T. To ee B i feaible, note that L B L A 3 for all erver S. Let G be the ubgraph induced by (T {p, p 2, p 3}, S {P }) = (T, S ). We have the following lemma. Lemma 3.6. Let A be a feaible tak aignment in G. Then there exit a feaible tak aignment A in G. Proof. Given A, Lemma 3.5 tell u that there exit another feaible aignment B in G uch that B(t) P for all t T. Let A (t) = B(t) for all t T. Then A i an aignment in G ince A (t) S {P } for all t T. To ee A i feaible, note that L A L B 3 for all erver S. We further partition G into two ubgraph G C and G R. G C i induced by node {C u : u [, α]} {a u : u [, α]} {l uv : u [, α], v [, 3]} and G R i induced by node {R (i) j {v (i) j : i [, β], j [, 2ω]} {r (i) j : i [, β], j [, 2ω]} : i [, β], j [, ω]}. In other word, G C conit of all claue gadget while G R conit of all variable gadget. If a tak in one partition i remotely aigned to a erver in the other partition, we call thi tak a cro-boundary tak. Let n A c be the number of cro-boundary tak that are in G C and aigned to erver in G R by A, n A r be the number of cro-boundary tak that are in G R and aigned to erver in G C by A. We have the following lemma. Lemma 3.7. Let A be a feaible aignment in G uch that n A c > 0 and n A r > 0. Then there exit a feaible aignment B in G uch that one of n B c and n B r equal n A c n A r and the other one equal 0.

6 Proof. Aume t i G C, i G R and A(t i) = i; t i G R, i G C and A(t i) = i. Then each of i and i i aigned one remote tak. Let B(t i) = i and B(t i) = i, and then L B i L A i = 3 and L B LA i = 3. Thi proce decreae n c and n r each by one, and the reulting aignment i i alo feaible. Repeat the ame proce until the maller one of n c and n r become 0. Then let B(t) = A(t) for all the remaining tak t T. It i obviou that B i feaible, and one of n B c and n B r equal n A c n A r and the other one equal 0. Lemma 3.8. Let A be a feaible aignment in G uch that n A c = 0. Then n A r = 0. Proof. For the ake of contradiction, aume t i G R, i G C and A(t i) = i. For each erver j G C, there i one auxiliary tak a u : u [, α] uch that ρ(a u, j) hold. Since w loc = and w rem = 3, if A i feaible then A(a u) A(a v) for u v. Since there are α auxiliary tak and α erver in G C, one erver i aigned exactly one auxiliary tak. Since A(t i) = i, L A i + 3 > 3, contradicting the fact that A i feaible. Therefore, there i no t i G R and i G C uch that A(t i) = i. Thu, n A r = 0. Lemma 3.9. Let A be a feaible aignment in G uch that n A r = 0. Then n A c = 0. Proof. For the ake of contradiction, aume t i G C, i G R and A(t i) = i. Let k 0, k, k 2, k 3 denote the number of ring erver filled to load 0,, 2, 3, repectively. From the total number of erver in G R, we have k 0 + k + k 2 + k 3 = 2βω () Similarly, from the total number of tak in G R, we have 0 k 0 + k + 2 k 2 + k 3 = 3βω (2) Subtracting () from (2) give k 2 = βω + k 0. Aigning both neighboring ring tak to the ame ring erver fill it to load 2. Since there are only 2βω ring erver, we have k 2 βω. Hence, k 0 = 0 and k 2 = βω. Thi implie that all ring tak are aigned to ring erver in alternating poition in each ring. There are βω remaining ring erver and βω variable tak. Therefore, a variable tak i remotely aigned to one of the remaining ring erver by A. Now conider the erver i that ha been remotely aigned t i G C. If it i aigned two ring tak, it load i L A i = 2+3 > 3. If it i aigned one variable tak, it load i L A i = > 3. A i not feaible in either cae. Therefore, there i no t i G C and i G R uch that A(t i) = i. Thu, n A c = 0. Now we prove the following Lemma. Lemma 3.0. If G = f(φ) i admiible, then φ i atifiable. Proof. Given feaible tak aignment A in G = f(φ), we contruct the atifying truth aignment σ for φ. From Lemma 3.6, 3.7, 3.8 and 3.9, we contruct a feaible aignment B in G, uch that n B c = n B r = 0, and in each variable gadget i, either erver in T i or erver in F i are aturated by variable tak. If ring erver in F i are aturated by variable tak, let σ(x i) = true. If ring erver in T i are aturated by variable tak, let σ(x i) = fale. To check that thi truth aignment i a atifying aignment, note that for the three literal tak l u, l u2, l u3, at mot two of them are aigned to the claue erver C u. There mut be one literal tak, ay l uv, that i locally aigned to a ring erver. In thi cae, σ(l uv) = true and thu the claue σ(c u) = true. Thi fact hold for all claue and thu indicate that σ(φ) = σ( V C u) = true. Thi complete the proof of Lemma 3.0. Finally we prove the main theorem. Proof of Theorem 3.. Lemma 3.3, 3.4 and 3.0 etablih that 3SAT p (2,3)-UHTA via f. Therefore, (2,3)- UHTA i N P-hard. It i eay to ee that (2,3)-UHTA N P becaue in time O(mn) a nondeterminitic Turing machine could gue the aignment and accept iff the maximum load under the aignment doe not exceed 3. Therefore, (2, 3)- UHTA i N P-complete. 4. A ROUND ROBIN ALGORITHM In thi ection, we analyze a imple round robin algorithm for the UHTA problem. Algorithm i inpired by the Hadoop cheduler algorithm. It can over each erver in a round robin fahion. When aigning a new tak to a erver, Algorithm trie heuritically to exploit data locality. Since we have not pecified the order of aigned tak, Algorithm may produce many poible output (aignment). Algorithm The round robin algorithm exploring locality. : input: a et of unaigned tak T, a lit of erver {, 2,, n}, a placement relation ρ 2: define i a an index variable 3: define A a an aignment 4: A(t) = (tak t i unaigned) for all t 5: while exit unaigned tak do 6: if exit unaigned tak t uch that ρ(t, i ) hold then 7: update A by aigning A(t) = i 8: ele 9: pick any unaigned tak t, update A by aigning A(t ) = i 0: end if : i (i mod n) + 2: end while 3: output: aignment A Algorithm i analogou to the Hadoop cheduler algorithm up to core verion 0.9. There are three difference, though. Firt, the Hadoop algorithm aume three kind of placement: data-local, rack-local and rack-remote, wherea Algorithm aume only two: local and remote. Second, the Hadoop cheduler work incrementally rather than aigning all tak initially. Lat, the Hadoop algorithm i determinitic, wherea Algorithm i nondeterminitic. Theorem 4.. If w rem > w loc, increaing the number of data block replica may increae the maximum load of the aignment computed by Algorithm. Proof. The number of edge in the placement graph i equal to the number of data block replica, and thu adding a new edge in the placement graph i equivalent to adding a new replica in the ytem. Conider the imple placement graph G where m = n, and there i an edge between tak t i and i for all i n. Running Algorithm give an aignment A in which tak t i i aigned to i for all

7 i n, and thu L A = w loc. Now we add one edge between tak t n and erver. We run Algorithm on thi new placement graph G to get aignment A. It might aign tak t n to erver in the firt tep. Following that, it aign t i to i for 2 i n, and it finally aign t to n. Since t i remote to n, thi give L A = w rem. Therefore L A > L A. Theorem 4. indicate that increaing the number of data block replica i not alway beneficial for Algorithm. In the remaining part of thi ection, we how that the aignment computed by Algorithm might deviate from the optimum by a multiplicative factor. In the following, let O be an aignment that minimize L O. Theorem 4.2. Let A be an aignment computed by Algorithm. Then L A (w rem/w loc ) L O. Proof. On the one hand, pigeonhole principle ay there i a erver aigned at leat m/n tak. Since the cot of each tak i at leat w loc, the load of thi erver i at leat m/n w loc. Thu, L O m/n w loc. On the other hand, Algorithm run in a round robin fahion where one tak i aigned at a time. Therefore, the number of tak aigned to each erver i at mot m/n. Since the cot of each tak i at mot w rem, the load of a erver i at mot m/n w rem. Thu, L A m/n w rem. Combining the two, we have L A (w rem/w loc ) L O. Theorem 4.3. Let T and S be uch that m n(n 2). There exit a placement ρ and an aignment A uch that A i a poible output of Algorithm, L A m/n w rem, and L O = m/n w loc. Proof. We prove the theorem by contructing a placement graph G ρ. Partition the et T of tak into n dijoint ubet T i : i n, uch that m/n T i T j m/n for all i j n. Now in the placement graph G ρ, connect tak in T i to erver i, for all i n. Thee et of edge guarantee that L O = m/n w loc. We then connect each tak in T n to a different erver in the ubet S = {, 2,, n }. Since m n(n 2), we have m/n m/n + n, which guarantee S T n. Thi complete the placement graph G ρ. Now run Algorithm on G ρ. There i a poible output A where tak in T n are aigned to erver in S. In that cae, all tak that are local to erver n are aigned elewhere, and thu n i aigned remote tak. Since n i aigned at leat m/n tak, thi give L A m/n w rem. When n m, the lower bound in Theorem 4.3 matche the upper bound in Theorem A FLOW-BASED ALGORITHM Theorem 3. how that the problem of computing an optimal tak aignment for the HTA problem i N P-complete. Neverthele, it i feaible to find tak aignment whoe load i at mot an additive contant greater than the optimal load. We preent uch an algorithm in thi ection. For two partial aignment α and β uch that β α, we define a new notation called virtual load from α below. Definition 2. For any tak t and partial aignment β that extend α, let j v α wloc if t i local in β, (t, β) = otherwie. w α rem The P virtual load of erver under β from α i V β,α = t:β(t)= vα (t, β). The maximum virtual load under β from α i V β,α = max S V β,α. Thu, v aume peimitically that tak not aigned by β will eventually become remote, and each remote tak will eventully have cot wrem. α When α i clear from context, we omit α and write v(t, β), V β and V β, repectively. Note that v α (t, α) = w(t, α) a in Definition 7. Algorithm 2 work iteratively to produce a equence of aignment and then output the bet one, i.e., the one of leat maximum erver load. The iteration i controlled by an integer variable τ which i initialized to and incremented on each iteration. Each iteration conit of two phae, max-cover and bal-aign: Max-cover: Given a input a placement graph G ρ, an integer value τ, and a partial aignment α, max-cover return a partial aignment α of a ubet T of tak, uch that α aign no erver more than τ tak, every tak in T i local in α, and T i maximized over all uch aignment. Thu, α make a many tak local a i poible without aigning more than τ tak to any one erver. The name max-cover follow the intuition that we are actually trying to cover a many tak a poible by their local erver, ubject to the contraint that no erver i aigned more than τ tak. Bal-aign: Given a input a et of tak T, a et of erver S, a partial aignment α computed by maxcover, and a cot function w, bal-aign ue a imple greedy algorithm to extend α to a complete aignment B by repeatedly chooing a erver with minimal virtual load and aigning ome unaigned tak to it. Thi continue until all tak are aigned. It thu generate a equence of partial aignment α = α 0 α α u = B, where u = u α. Every tak t aigned in balaign contribute v α (t, B) w α rem to the virtual load of the erver that it i aigned to. At the end, w B rem w α rem, and equality hold only when r B = r α + u α. The atute reader might feel that it i intellectually attractive to ue real erver load a the criterion to chooe erver in bal-aign becaue it embed more accurate information. We do not know if thi change ever reult in a better aignment. We do know that it may require more computation. Whenever a local tak i aigned, r + u decreae by, o the remote cot w rem(r + u) may alo decreae. If it doe, the load of all erver that have been aigned remote tak mut be recomputed. In the current verion of the algorithm, we do not need to update virtual load when a local tak i aigned becaue the virtual cot of remote tak never change in the coure of bal-aign. 5. Algorithm Decription We decribe Algorithm 2 in greater detail here. 5.. Max-cover Max-cover (line 6 of Algorithm 2) augment the partial aignment α τ computed by the previou iteration to produce α τ. (We define α 0 to be the empty partial aignment.) Thu, α τ α τ, and α τ maximize the total number of local tak aigned ubject to the contraint that no erver i aigned more than τ tak in all.

8 Algorithm 2 A flow-baed algorithm for HTA. : input: an HMR-ytem (T, S, ρ, w loc, w rem( )) 2: define A, B a aignment 3: define α a a partial aignment 4: α(t) = (tak t i unaigned) for all t 5: for τ = to m do 6: α max-cover(g ρ, τ, α) 7: B bal-aign(t, S, α, w loc, w rem( )) 8: end for 9: et A equal to a B with leat maximum load 0: output: aignment A The core of the max-cover phae i an augmenting path algorithm by Ford and Fulkeron [0]. The Ford-Fulkeron algorithm take a input a network with edge capacitie and an exiting network flow, and output a maximum flow that repect the capacity contraint. A fact about thi algorithm i well-known [6, 0]. Fact 5.. Given a flow network with integral capacitie and an initial integral -t flow f, the Ford-Fulkeron algorithm compute an integral maximum -t flow f in time O( E ( f f )), where E i the number of edge in the network and f i the value of the flow f, i.e., the amount of flow paing from the ource to the ink. During the max-cover phae at iteration τ, the input placement graph G ρ i firt converted to a correponding flow network G ρ. G ρ include all node in G ρ and an extra ource u and an extra ink v. In G ρ, there i an edge (u, t) for all t T and an edge (, v) for all S. All of the original edge (t, ) in G ρ remain in G ρ. The edge capacity i defined a follow: edge (, v) ha capacity τ for all S, while all the other edge have capacity. Therefore, for any pair of (t, ), if there i a flow through the path u t v, the value of thi flow i no greater than. Then the input partial aignment α i converted into a network flow f α a follow: if tak t i aigned to erver in the partial aignment α, aign one unit of flow through the path u t v. The Ford-Fulkeron algorithm i then run on graph G ρ with flow f α to find a maximum flow f α. From Fact 5., we know that the Ford-Fulkeron algorithm take time O( E ( f α f α )) in thi iteration. Thi output flow f α at iteration τ will act a the input flow to the Ford-Fulkeron algorithm at iteration τ +. The flow network at iteration τ + i the ame a the one at iteration τ except that each edge (, v) ha capacity τ + for all S. Thi incremental ue of Ford-Fulkeron algorithm in ucceive iteration help reduce the time complexity of the whole algorithm. At the end of the max-cover phae, the augmented flow f α i converted back into a partial aignment α. If there i one unit of flow through the path u t v in f α, we aign tak t i to erver in α. Thi converion from a network flow to a partial aignment can alway be done, becaue the flow i integral and all edge between tak and erver have capacity. Therefore, there i a one-to-one correpondence between a unit flow through the path u t v and the aignment of tak t to it local erver. It follow that f α = l α. By Fact 5., the Ford-Fulkeron algorithm compute a maximum flow that repect the capacity contraint τ. Thu, the following lemma i immediate. Lemma 5.2. Let α τ be the partial aignment computed by max-cover at iteration τ, and β be any partial aignment uch that k β τ. Then l ατ l β Bal-aign Definition 3. Let β and β be partial aignment, t a tak and a erver. We ay that β t: β i a tep that aign t to if t i unaigned in β and β = β {(t, )}. We ay β β i a tep, if β t: β for ome t and. A equence of tep α = α 0 α... α u i a trace if for each i [, u], if α i t: α i i a tep, then V α i,α V α i,α for all. Given two partial aignment α i and α i in a trace uch t: that α i α i, it follow that V α i,α V α i,α + wrem α V α i,α = V α i,α for all The following lemma i immediate. Lemma 5.3. Let u = u ατ and α τ = α0 τ α τ αu τ be a equence of partial aignment generated by balaign at iteration τ. Thi equence i a trace that end in a complete aignment B τ = αu. τ 5.2 Main Reult It i obviou that Algorithm 2 i optimal for n = ince only one aignment i poible. Now we how that for n 2, Algorithm 2 compute, in polynomial time, aignment that are optimal to within an additive contant. The reult i formally tated a Theorem 5.4. Theorem 5.4. Let n 2. Given an HMR-ytem with m tak and n erver, Algorithm 2 compute an aignment A in time O(m 2 n) uch that L A L O + w O n rem. Lemma 5.5. Algorithm 2 run in time O(m 2 n). Proof. By Fact 5., we know that the Ford-Fulkeron algorithm take time O( E f ) to augment the network flow by f. At iteration τ =, max-cover take time O( E f ), where f n. Then at iteration τ = 2, max-cover take time O( E ( f 2 f )), where f 2 2n. The ame proce i repeated until f m = m. The total running time of max-cover for all iteration thu add up to O( E ( f + f 2 f + f 3 f f m )) = O( E f m ) = O( E m) = O(m 2 n). We implement the greedy algorithm in the bal-aign phae with a priority queue. Since there are n erver, each operation of the priority queue take O(log n) time. During the bal-aign phae at each iteration, at mot m tak need to be aigned. Thi take time O(m log n). The total running time of bal-aign for all iteration i thu O(m 2 log n). Combining the running time of the two phae for all iteration give time complexity O(m 2 n). Lemma 5.5 ugget the max-cover phae i the main contributor to the time complexity of Algorithm 2. However, in a typical Hadoop ytem, the number of replica for each data block i a mall contant, ay 2 or 3. Then the degree of each t G i bounded by thi contant. In thi cae, the placement graph G i pare and E = O(m + n). A a reult, max-cover run in time O(m(m + n)). Therefore the bal-aign phae might become the main contributor to the time complexity.

9 5.2. Propertie of optimal aignment In order to prove the approximation bound, we firt etablih ome propertie of optimal aignment. Definition 4. Given an HMR-ytem, let O be the et of all optimal aignment, i.e., thoe that minimize the maximum load. Let r min = min{r A A O} and let O = {O O r O = r min}. Lemma 5.6. Let O O. If l O = k O, then r O = 0 and L O = K O. Proof. Let l O = k O for ome erver. Aume to the contrary that r O. Then L O K O + wrem. O Let t be a remote tak aigned to by O. By definition 3, ρ(t, ) hold for at leat one erver. Cae : ha at leat one remote tak t. Then move t to and t to. Thi reult in another aignment B. B i till optimal becaue L B L O, L B LO, and LB = LO for any other erver S {, }. Cae 2: ha only local tak. By the definition of k O, ha at mot k O local tak aigned by O. Then move t to. Thi reult in another aignment B. B i till optimal becaue L B < L O, L B = KO + w loc K O + w rem L O, and L B = LO for any other erver S {, }. In either cae, we have hown the new aignment i in O. However, ince t become local in the new aignment, fewer remote tak are aigned than in O. Thi contradict that O O. Thu, i aigned no remote tak, o L O = k O w loc = K O. Definition 5. Let O O. Define M O = HO K O n. Lemma 5.7. L O M O. Proof. Let be a erver of maximal local load in O, o k O = k O. Let S 2 = S { }. By Lemma 5.6, L O = K O. The total load on S 2 i P S 2 L O = H O K O, o the average load on S 2 i M O. Hence, L O max S2 L O avg S2 L O = M O Analyzing the algorithm Aume throughout thi ection that O O and α = α 0 α... α u = B i a trace generated by iteration τ = k O of the algorithm. Virtual load are all baed on α, o we generally omit explicit mention of α in the upercript of v and V. Lemma 5.8. w loc w B rem w α rem w O rem. Proof. w loc w B rem follow from the definition of a Hadoop cot function. Becaue B α, r B r α + u α. By Lemma 5.2, l α l O, o r α + u α = m l α m l O = r O. Hence, w rem(r B ) w rem(r α + u α ) w rem(r O ) by monotonicity of w rem( ). It follow by definition of the w β rem notation that w B rem w α rem w O rem. Lemma 5.9. k α = k O. Proof. k α k O becaue no erver i aigned more than τ local tak by max-cover at iteration τ = k O. For ake of contradiction, aume k α < k O. Then u α > 0, becaue otherwie α = B and L B = k α w loc < K O L O, violating the optimality of O. Let t be an unaigned tak in α. By definition, ρ(t, ) hold for ome erver. Aign t to in α to obtain a new partial aignment β. We have k β k α + k O = τ. By Lemma 5.2, l α l β, contradicting the fact that l β = l α +. We conclude that k α = k O. Lemma 5.0. L B V B. Proof. By definition, L B = P B t:b(t)= w(t, B) and V = P t:b(t)= v(t, B). By Lemma 5.8, wb rem wrem, α and thu w(t, B) v(t, B). It follow that S, L B V B. Therefore L B V B, becaue L B = max L B and V B = V B = max V B. For the remainder of thi ection, let be a erver uch that l α = k O. Such a erver exit by Lemma 5.9. Let S 2 = S { } be the et of remaining erver. For a partial aignment β α, define N β to be the average virtual load under β of the erver in S 2. Formally, N β = P S 2 V β S 2 = lβ w loc + r β w α rem V β n To obtain the approximation bound, we compare N β with the imilar quantity M O for the optimal aignment. For convenience, we let δ = w O rem/(n ). Lemma 5.. Let β = α i t: α i+ = β. Then V β N β M O δ. Proof. Proof i by a counting argument. By Lemma 5.9, we have k α = k O, o l β l α = k O. Hence, V β K O. By Lemma 5.2, we have l α l O. Let d = l β l O. d 0 becaue l β l α l O. Becaue l β +r β +u β = m = l O +r O, we have r β +u β +d = r O. Alo, u β ince t i unaigned in β. Then by Lemma 5.8, (n )N β = l β w loc + r β wrem α V β = (l O + d)w loc + (r O u β d)wrem α V β l O w loc + (r O u β )wrem O K O (n )M O w O rem. Hence, N β M O δ. Now, ince β i part of a trace, we have V β V β for all S. In particular, V β N β, ince N β i the average virtual load of all erver in S 2. We conclude that V β N β M O δ. Proof of Theorem 5.4. Lemma 5.5 how that the time complexity of Algorithm 2 i O(m 2 n). Now we finih the proof for the approximation bound. Let be a erver of maximum virtual load in B, o V B = V B. Let i be the mallet integer uch that α i = B, that i, no more tak are aigned to in the ubtrace beginning with α i. Cae : i = 0: Then l α 0 k α 0 = k O by Lemma 5.9, and r α 0 = 0, o V B = V α 0 K O. Hence, V B K O L O. Cae 2: i > 0: Then β = α i α i = β for ome tak t. By lemma 5., V β M O δ, o uing Lemma 5.8, V β V β Then by Lemma 5.7, t: + w α rem M O δ + w O rem. V B = V B = V β M O + w O rem δ L O + w O rem δ. Both cae imply that V B L O + wrem O δ. By Lemma 5.0, we have L B V B. Becaue the algorithm chooe an aignment with leat maximum load a the output A, we have L A L B. Hence, L A L O + wrem O δ = L O + «wrem O n

10 6. CONCLUSION In thi paper, we preent an algorithmic tudy of the tak aignment problem in the Hadoop MapReduce framework and propoe a mathematical model to evaluate the cot of tak aignment. Baed on thi model, we how that it i infeaible to find the optimal aignment unle P = N P. Theorem 3. how that the tak aignment problem in Hadoop remain hard even if all erver have equal capacity of 3, the cot function only ha 2 value in it range, and each data block ha at mot 2 replica. Second, we analyze the imple round robin algorithm for the UHTA problem. Theorem 4. reveal that the intuition i wrong that increaing the number of replica alway help load balancing. Uing round robin tak aignment, adding more replica into the ytem can ometime reult in wore maximum load. Theorem 4.2 and 4.3 how there could be a multiplicative gap in maximum load between the optimal aignment and the aignment computed by Algorithm. Third, we preent Algorithm 2 for the general HTA problem. Thi algorithm employ maximum flow and increaing threhold technique. Theorem 5.4 how that the aignment computed by Algorithm 2 are optimal to within an additive contant that depend only on the number of erver and the remote cot function. There are many intereting direction for future work. We have ketched a proof of a matching lower bound to Theorem 5.4 for a cla of Hadoop cot function. We plan to preent thi reult in followup work. Sharing a MapReduce cluter between multiple uer i becoming popular and ha led to recent development of multi-uer multi-job cheduler uch a fair cheduler and capacity cheduler. We plan to analyze the performance of uch cheduler and ee if the optimization technique from thi paper can be applied to improve them. 7. ACKNOWLEDGMENTS We would like to thank Avi Silberchatz, Daniel Abadi, Kamil Bajda-Pawlikowki, and Azza Abouzeid for their inpiring dicuion. We are alo grateful to the anonymou referee for providing many ueful uggetion that ignificantly improved the quality of our preentation. 8. REFERENCES [] J. Apne, Y. Azar, A. Fiat, S. Plotkin, and O. Waart. On-line routing of virtual circuit with application to load balancing and machine cheduling. Journal of the ACM, 44(3): , 997. [2] Y. Azar, J. S. Naor, and R. Rom. The competitivene of on-line aignment. In Proceeding of the 3rd Annual ACM-SIAM ympoium on Dicrete algorithm, page SIAM Philadelphia, PA, USA, 992. [3] K. Birman, G. Chockler, and R. van Renee. Toward a cloud computing reearch agenda. SIGACT New, 40(2):68 80, [4] E. Bortnikov. Open-ource grid technologie for web-cale computing. SIGACT New, 40(2):87 93, [5] R. E. Burkard. Aignment problem: Recent olution method and application. In Sytem Modelling and Optimization: Proceeding of the 2th IFIP Conference, Budapet, Hungary, September 2-6, 985, page Springer, 986. [6] T. Cormen, C. Leieron, R. Rivet, and C. Stein. Introduction to algorithm, 2nd ed. MIT pre Cambridge, MA, 200. [7] M. de Kruijf and K. Sankaralingam. MapReduce for the Cell B. E. architecture. Univerity of Wiconin Computer Science Technical Report CS-TR-2007, 625, [8] J. Dean. Experience with MapReduce, an abtraction for large-cale computation. In Proceeding of the 5th International Conference on Parallel Architecture and Compilation Technique. ACM New York, NY, USA, [9] J. Dean and S. Ghemawat. MapReduce: Simplified data proceing on large cluter. Proceeding of the 6th Sympoium on Operating Sytem Deign and Implementation, San Francico, CA, page 37 50, [0] L. R. Ford and D. R. Fulkeron. Maximal flow through a network. Canadian Journal of Mathematic, 8(3): , 956. [] M. R. Garey, D. S. Johnon, et al. Computer and Intractability: A Guide to the Theory of NP-completene. Freeman San Francico, 979. [2] R. L. Graham. Bound for certain multiproceing anomalie. Bell Sytem Technical Journal, 45(9):563 58, 966. [3] R. L. Graham. Bound on multiproceing timing anomalie. SIAM Journal on Applied Mathematic, page , 969. [4] B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang. Mar: A MapReduce framework on graphic proceor. In Proceeding of the 7th International Conference on Parallel Architecture and Compilation Technique, page ACM New York, NY, USA, [5] H. W. Kuhn. The Hungarian method for the aignment problem. Naval Reearch Logitic, 52(), Originally appeared in Naval Reearch Logitic Quarterly, 2, 955, [6] R. Lämmel. Google MapReduce programming model Reviited. Science of Computer Programming, 68(3): , [7] J. K. Lentra, D. B. Shmoy, and E. Tardo. Approximation algorithm for cheduling unrelated parallel machine. Mathematical Programming, 46():259 27, 990. [8] C. Ranger, R. Raghuraman, A. Penmeta, G. Bradki, and C. Kozyraki. Evaluating MapReduce for multi-core and multiproceor ytem. In Proceeding of the 2007 IEEE 3th International Sympoium on High Performance Computer Architecture, page IEEE Computer Society Wahington, DC, USA, [9] M. Zaharia, A. Konwinki, A. D. Joeph, R. Katz, and I. Stoica. Improving MapReduce performance in heterogeneou environment. In Proceeding of the 8th Sympoium on Operating Sytem Deign and Implementation, San Diego, CA, 2008.