DREAMS: Dynamic Resource Allocation for MapReduce with Data Skew

DREAMS: Dynamc Resource Allocaton for MapReduce wth Data Skew Zhhong Lu, Q Zhang, Mohamed Faten Zhan, Raouf Boutaba Yapng Lu and Zhenghu Gong College of Computer, Natonal Unversty of Defense Technology, Changsha, Hunan, Chna Emal:{zhlu,gzh}@nudt.edu.cn Scence and Technology on Parallel and Dstrbuted Processng Laboratory, Natonal Unversty of Defense Technology, Changsha, Hunan, Chna Emal:yplu@nudt.edu.cn Davd R. Cherton School of Computer Scence, Unversty of Waterloo, Waterloo, ON Canada Emal: {q8zhang,mfzhan,rboutaba}@uwaterloo.ca Abstract MapReduce has become a popular model for largescale data processng n recent years. However, exstng MapReduce schedulers stll suffer from an ssue known as parttonng skew, where the output of map tasks s unevenly dstrbuted among reduce tasks. In ths paper, we present DREAMS, a framework that provdes run-tme parttonng skew mtgaton. Unlke prevous approaches that try to balance the workload of reducers by reparttonng the ntermedate data assgned to each reduce task, n DREAMS we cope wth parttonng skew by adjustng task run-tme resource allocaton. We show that our approach allows DREAMS to elmnate the overhead of data reparttonng. Through experments usng both real and synthetc workloads runnng on a 11-node vrtual vrtualsed Hadoop cluster, we show that DREAMS can effectvely mtgate negatve mpact of parttonng skew, thereby mprovng job performance by up to 2.3%. I. INTRODUCTION In recent years, the exponental growth of data n many applcaton domans such as e-commerce, socal networkng and scentfc computng, has generated tremendous needs for large-scale data processng. In ths context, MapReduce [1] as a parallel computng framework has recently ganed sgnfcant popularty. In MapReduce, a job conssts of two types of tasks, namely Map and Reduce. Each map task takes a block of nput data and runs a user-specfed map functon to generate ntermedate key-value pars. Subsequently, each reduce task collects ntermedate key-value pars and apples a user-specfed reduce functon to produce the fnal output. Due to ts remarkable advantages n smplcty, robustness, and scalablty, MapReduce has been wdely used by companes such as Amazon, Facebook, and Yahoo! to process large volumes of data on a daly bass. Consequently, t has attracted consderable attenton from both ndustry and academa. Despte ts success, the current mplementatons of MapReduce stll suffer from several mportant lmtatons. In partcular, the most popular mplementaton of MapReduce, Apache Hadoop MapReduce [2], uses a hash functon Hash(HashCode(ntermedate key) mod ReduceNumber) to partton the ntermedate data among the reduce tasks. Whle the goal of usng the hash functon s to evenly dstrbute workload to each reduce tasks, n realty ths goal s rarely acheved [3] [5]. For example, Zachelas et. al. [3] have demonstrated the exstence of skewness n Youtube socal graph based on real workloads. The experments showed that, the bggest sze of parttons s larger than the smallest by more than a factor of fve. The skewed dstrbuton of reduce workload can have severe consequences. Frst, data skewness may lead to a large dfference n the runtme between the fastest and slowest tasks. As the completon tme of a MapReduce job s determned by the fnshng tme of the slowest reduce task, data skewness can cause certan tasks to run much slower than others, thereby severely delayng job completon. Second, Hadoop MapReduce allocates fxed-sze contaners to reduce tasks. However, due to data skewness, dfferent reduce tasks may have dfferent run-tme resource requrements. As a result, machnes that are runnng tasks wth heavy workload may experence resource contenton, whle machnes wth less data to process may experence resource dleness. There are several approaches recently proposed to handle parttonng skew n MapReduce [4], [6] [9]. Ibrahm et. al. proposed LEEN [6], a framework that balances reduce workload by assgnng ntermedate keys to reducers based on ther record szes. Whle ths approach can mtgate the negatve mpact of data skew, ts beneft s lmted snce the szes of records correspondng to each key can stll be unevenly dstrbuted. Furthermore, t does not perform well when dstrbuton of the records szes s severely skewed. Subsequently, Gufler et. al. [7] and Ramakrshnan et. al. [8] proposed technques to splt each key wth large record sze nto sub-keys to allow for more even dstrbuton of workload to reducers. However, most of these solutons have to wat untl all the map tasks completed to gather the partton sze nformaton before reduce tasks can be started. The authors of [5], [9] demonstrate that by startng the shuffle phase after all map tasks are completed, the overall job completon tme wll be prolonged. Whle the progressve samplng [8] and adaptve parttonng [4] can elmnate ths watng tme, the former approach requres an addtonal samplng phase to generate a parttonng plan before the job can be executed, whereas the latter approach ncurs an addtonal run-tme overhead (e.g. 3 seconds for certan jobs). In ether case, the overhead due to reparttonng can be qute large for small jobs that takes from 1 to 1 seconds to complete. These small jobs are qute common n today s producton clusters [1].

Fg. 1: MapReduce Programmng Model pars wth the same hash results are assgned to the same reduce task. In the reduce stage, each reduce task takes one partton (.e. the ntermedary key-value pars receved from all map tasks) as nput and performs the reduce functon on the partton to generate the fnal output. Ths s llustrated n Fgure 1. Typcally, the default hash functon can provde load balancng f the key frequences and the sze of key-value pars are unformly dstrbuted. Ths may fal wth skewed data. For example n the InvertedIndex applcaton, hash functon parttons the ntermedate data based on the words appeared n the fle. Therefore, reduce tasks processng more popular words wll be assgned a larger amount of data. As shown n Fgure 1, parttons are unevenly dstrbuted by the hash functon. P 1 s larger than P 2, whch causes workload mbalance between R1 and R2. [6] presents the causes of parttonng skew: Motvated by the lmtaton of the exstng solutons, n ths paper, we take a completely dfferent approach to address data skewness. Instead of subdvdng keys nto smaller subkeys to balance the reduce workload, our approach adjusts run-tme resource allocaton of each reducer to match ther correspondng data sze. Snce no reparttonng s nvolved, our approach completely elmnates the overhead due to reparttonng. To ths end, we present DREAMS, a Dynamc REsource Allocaton technque for MapReduce wth data Skew. DREAMS leverages hstorcal records to construct profles for each job type. Ths s reasonable because many producton jobs are executed repeatedly n today s producton clusters [11]. At run-tme, DREAMS can dynamcally detect data skewness and assgn more resources to reducers wth large parttons to make them fnsh faster. In DREAMS, we frst develop an onlne predcton model whch can estmate the partton szes of reduce tasks at runtme. We then establsh a performance model that correlates run-tme resource allocaton wth task completon tme. Usng ths performance model, the scheduler can make schedulng decsons that allocate the rght amount of resources to reduce tasks so as to equalze ther runnng tme. Through experments usng both real and synthetc workloads runnng on a 11-node vrtualzed Hadoop cluster, we show that DREAMS can effectvely mtgate negatve mpact of partton skew, thereby mprovng job performance by up to 2.3%. The rest of ths paper s organzed as follows. Secton II provdes the motvatons of our work. We descrbe the system archtecture of DREAMS n Secton III. Secton IV llustrates the desgn of DREAMS n detal. Secton V provdes the results of expermental evaluaton. Fnally, we summarze exstng work related to DREAMS n Secton VI, and draw our concluson n Secton VII. II. MOTIVATION In ths secton we provde an overvew of the parttonng skew problem and dscuss the resource allocaton ssues n current MapReduce mplementaton theren motvatng our study. In state of the art MapReduce systems, each map task processes one splt of nput data, and generates a sequence of key-value pars whch are called ntermedate data, on whch hash parttonng functon s performed. Snce all map tasks use the same hash parttonng functon, the key-value skewed key frequences: Some keys occur more frequently n the ntermedate data, causng those reduce tasks that process these popular keys become overloaded. skewed tuple szes: In applcatons where the szes of values n the key-value par vary sgnfcantly, uneven workload dstrbuton may arse. skewed executon tmes: Typcal n scenaros where processng a sngle, large key-value par may requre more tme than processng multple small pars. Even f the overall number of tuples per reduce task s the same, the executon tmes of reduce tasks may be dfferent. Due to many weaknesses and nadequaces experenced n the frst verson of Hadoop MapReduce (MRv1), the next generaton of Hadoop compute platform, YARN [2], has been proposed. Nevertheless, n both Hadoop MRv1 and MRv2 (a. k. a. YARN), the schedulers assume each reduce task has unform workload and resource consumpton, and therefore allocate dentcal resources to each reduce task. Specfcally, MRv1 adopts a slot-based allocaton scheme, where each machne s dvded nto dentcal slots that can be used to execute tasks. However, MRv1 does not provde resource solaton among co-located tasks, whch may cause performance degradaton at run-tme. On the other hand, YARN uses a contaner-based allocaton scheme, where each task s scheduled n an solated contaner wth guaranteed CPU ad memory resources that can be specfed n the request. But YARN stll allocates contaners of dentcal sze to all reduce tasks that belong to the same job. In the presence of parttonng skew, ths schedulng scheme can cause both varaton n task runnng tme and degradaton n resource utlzaton. For nstance, Kwon et. al. [4] demonstrated that n CloudBurst Applcaton, there s a factor of fve dfference n runtme between the fastest and the slowest reduce tasks. Snce the job completon tme depends on the slowest task, the runtme varaton of reduce tasks wll prolong the job executon. At the same tme, the reducers wth large parttons run slowly because the resources allocated to them are lmted by the contaner sze, whereas reducers wth lght workload tend to under-utlze the resources allocated to the contaner. In both cases, the resultng resource allocaton s neffcent. Most of the exstng approaches [4], [6] [9] tackle the parttonng skew problem by makng the workload assgn-

NodeManager Map1 Map2 Job Profle Job Profle Job Profle Partton Sze Montor Partton Stats Report Resource Request Resource Response Contaner Launch Applcaton Master Task Duraton Estmator Partton Sze Predctor NodeManager Partton Sze Montor Map3 Map4 Resource Manager Fne-graned Contaner Scheduler Resource Allocator NodeManager Partton Sze Montor Map5 Map6 Fg. 2: Archtecture of DREAMS ment unform among reduce tasks, thereby mtgatng the neffcences n both performance and utlzaton. However, achevng ths goal requres (sometmes heavy) modfcaton to the current Hadoop mplementaton, and often requres addtonal overhead n terms of samplng and adaptve parttonng. Therefore, n ths work we seek an alternatve soluton, consstng n adjustng contaner sze based on parttonng skew. Ths approach not only requres mnmal modfcaton to the exstng Hadoop mplementaton, but at the same tme can effectvely mtgate the negatve mpact of data skew. III. SYSTEM ARCHITECTURE Ths secton descrbes the desgn of our proposed resource allocaton framework called DREAMS. The archtecture of DREAMS s shown n Fgure 2. Specfcally, each Partton Sze Montor records the statstcs of ntermedate data that each map task generates at run-tme and sends them to the ApplcatonMaster though heartbeat messages. Partton Sze Predctor collects the partton sze reports from NodeManagers and predcts the partton szes for ths job at runtme. The Task Duraton Estmator constructs statstcal estmaton model of reduce task performance as a functon of ts partton sze and resource allocaton. The Resource Allocator determnes the amount of resources to be allocated to each reduce task based on the performance estmaton. Lastly, the Fnegraned Contaner Scheduler s responsble for schedulng task requests from ApplcatonMasters accordng to schedulng polces such as Far schedulng [12] and Domnant Resource Farness (DRF) [13]. The workflow of resource allocaton mechansm used by DREAMS conssts of 5 steps: (1) After the ApplcatonMaster s launched, t schedules all the map tasks frst and then ramps up the reduce task requests slowly accordng to the slowstart settng. Durng ther executon, each Partton Sze Montor records the sze of ntermedate data produced by each reduce task. It then sends the statstcs to the ApplcatonMaster through the RPC protocol used to montor the status of task n Hadoop. (2) Upon recevng the partton sze reports from the Partton Sze Montors, the Partton Sze Predctor performs sze predcton usng our proposed predcton model (see Secton IV-A). The task Duraton Estmator, whch uses the job profles (Secton IV-B), predcts the task duraton of each reduce task wth specfed amount of resources. Based on that, Resource Allocator determnes the amount of resources for each reduce task accordng to our proposed resource allocaton algorthm (Secton IV-C) to equalze the executon tme of all reduce tasks. (3) After that, the ResourceManager receves ApplcatonMaster s resource requests through the heartbeat messages, and schedule free contaners n the cluster to ApplcatonMaster. (4) Once the ApplcatonMaster obtans new contaners from ResourceManager, t assgns the correspondng contaner to ts pendng task, and fnally launches the task. IV. DREAMS DESIGN There are two man challenges that need to be addressed n DREAMS. Frst, to dentfy partton skew, t s necessary to develop a run-tme forecastng algorthm that predcts the partton sze of each reducer. Second, n order to determne the rght contaner sze for each reduce task, t s necessary to develop a task performance model that correlates task runnng tme wth resource allocaton. In the followng sectons, we shall descrbe our techncal solutons for each of the challenges. A. Predctng Partton Sze As mentoned prevously, the scheduler needs to know the partton sze of each reduce task n order to compute the correct contaner sze for that reduce task. Snce current Hadoop schedulers allow reduce tasks to be launched soon after a fracton (e.g. 5%) of map tasks are fnshed 1, t s necessary to predct the partton sze before the completon of all map tasks. To predct the partton sze of each reduce task to be scheduled, ( at ) run-tme the ApplcatonMaster collects two metrcs F j, S j, where F j s the percentage of map tasks that have been processed, (j ( [1, ) m] and m refers to the number of collected tuples F j, S j ) and S j s the sze of the partton generated by the completed map tasks for reduce task. In our mplementaton, we have modfed the reportng mechansm so that each map task reports ths nformaton to the ApplcatonMaster upon completon. Wth these metrcs, we use lnear regresson to determne the followng equaton for each reduce task [1, N]: a 1 + b 1 F j = S j j = 1, 2, m (1) We ntroduce an outer factor, δ, whch s the threshold to control our predcton model to stop the process of learnng, and fnalze the predcton. In practce, δ can be the map completon percentage at whch reduce tasks may be started to schedule (e.g. 5%). Every tme a new map task has fnshed, a 1 To mprove job runnng tme, exstng Hadoop schedulers overlap the executon of map tasks and reduce tasks by allowng reduce tasks to be launched before the completon of all map tasks

Duraton (Seconds) 1 8 6 4 2 measured value regresson value 15 2 25 3 Sze of Partton (MB) (a) InvertedIndex 1G Duraton (Seconds) 25 2 15 1 5 measured value(1g) regresson value(1g) measured value(2g) regresson value(2g) 1 2 3 4 5 6 7 8 Sze of Partton (MB) (b) InvertedIndex 1 and 2G Fg. 3: Relatonshp between task duraton and partton sze Duraton (Seconds) 6 5 4 3 2 1 1 2 3 4 5 6 7 Memory Allocaton (a) a reduce task n Sort Duraton (Seconds) 6 5 4 3 2 1 1 2 3 4 5 6 7 Memory Allocaton (b) a reduce task n InvertedIndex Fg. 5: Relatonshp between task duraton and mem. allocaton Duraton (Seconds) 6 5 4 3 2 measured value regresson value regresson value(pece wse) 1 1 2 3 4 5 6 7 8 CPU allocaton (vcores) (a) a reduce task n Sort Duraton (Seconds) 4 35 3 25 2 15 measured value regresson value regresson value(pece wse) 1 1 2 3 4 5 6 7 8 CPU allocaton (vcores) (b) a reduce task n InvertedIndex Fg. 4: Relatonshp between task duraton and CPU allocaton new tranng data s created. When the fracton of map tasks reaches δ, we calculate the scalng factors (a 1, b 1 ) and predct the sze of partton for each reduce task of the whole data set, even though not all of the map tasks are completed. We notced that predcton schemes such as progressve samplng [8] can also be used by DREAMS for partton sze predcton. However, the reparttonng mechansm used n [8] s based on a parttonng plan, and as result, t requres progressve samplng to be executed each tme before the job starts. In our case, snce we do not need to modfy the mplementaton of parttonng, our partton sze predcton can be done entrely onlne. Thus, we found our current predcton scheme s smple yet suffcent to produce hgh qualty predcton results. B. Reduce Phase Performance Model In ths secton, we desgn a task performance model that correlates the completon tme of ndvdual reduce tasks wth ther partton sze and resource allocaton. As Hadoop YARN only allows the CPU and memory sze of contaner to be specfed, n our mplementaton we focus on capturng the mpact of CPU and memory allocaton on task performance. In order to dentfy the relatonshp between task runnng tme, partton sze and resource allocaton, we run a set of benchmarks n our Testbed cluster by varyng resource allocaton. More specfcally, each benchmark s generated by varyng CPU allocaton = {1 vcore, 2 vcores,, 8 vcores}, memory allocaton Alloc mem = {1 GB, 2 GB,, 8 GB}, and nput dataset D set = {1 GB, 2 GB, 3 GB, 5 GB} for dfferent jobs. We run each benchmark 1 tmes, and collect the average result over the runs for each benchmark. In the frst set of experments, we fx the CPU and memory allocaton of each reduce task and focus on dentfyng the relatonshp between partton sze and task runnng tme. To llustrate, Fgure 3a shows the result of runnng the InvertedIndex job usng 1GB nput. It s evdent that there s a lnear relatonshp between partton sze and runnng tme. Furthermore, Fgure 3b shows the result when the nput sze of the job s changed from 1GB to 2GB. Agan, the runnng tme s lnearly correlated wth partton sze. However, at the same tme, we also found that the sze of total ntermedate data denoted as D (the sum of all parttons) has an mpact on task duraton whle varyng the nput dataset. Smlar observaton s also made n [14], where Zhang et. al. show the duraton of the shuffle phase can be approxmated wth a pece-wse lnear functon when the ntermedate data per reduce task s larger that 3.2 GB n ther Hadoop Cluster. Ths s consstent wth the phenomenon we observed. In the next set of experments, we fx the nput sze and vary ether the CPU or memory allocaton of each reduce task. Fgure 4 shows the typcal results for Sort and InvertedIndex job by varyng the CPU allocaton (memory allocaton s fxed to 1 GB). We found that task runnng tme s nversely proportonal to CPU allocaton. In partcular, the task runnng tme s approxmately halved when the CPU allocaton s ncreased from 1 vcore to 2 vcores. Whle ths relatonshp s accurate when the number of vcores s small, we also found ths model s no longer accurate when a large amount of CPU resource s allocated to a task. In these cases, the resource bottleneck may swtch from CPU to other resource dmensons lke dsk I/O, n whch case the benefts of ncreasng CPU allocaton would decrease. Thus, we can expect that the duraton of reduce tasks mght be approxmated wth a dfferent nversely proportonal functon when CPU allocaton exceeds a threshold ϕ. Ths threshold could be related to Job characterstcs and cluster confguraton. However, for a dfferent Job and Hadoop cluster, ϕ can be easly determned by comparng the change n task duraton whle ncreasng CPU allocaton. 2 We then repeat the same experment for memory;we vary the memory allocaton from 1 to 7 GB whle the CPU s fxed to 1 vcore. We found the same relatonshp does not apply to memory. Fgure 5 shows the task runnng tme as 2 We use the followng polcy n ths paper: we ncrease the CPU allocaton from 1 vcore to 8 vcores, and caluate the speedup of task runnng tme between current and prevous CPU allcoatons denoted as Speedup j (j [1, 7]). The frst CPU allocaton where Speedup j <.5 Speedup j 1 s consdered as the threshold ϕ.

a functon of memory allocaton. We found that even though memory allocaton s ncreased, no mprovement can be found. We beleve the reason s that memory s not the bottleneck resource for ths task. In ths case, the memory allocaton wll not affect task duraton as long as t s suffcent for ths task. Based on the above observatons, we now derve our task performance model. For each reduce task among N reduce tasks, let T denote the executon tme of reduce task, P denote the sze of partton for reduce task, D denote the sze of the ntermedate data for the job, and denote as the CPU allocaton for reduce task, the performance model can be stated as: W hen <=ϕ, T =α + βp +γd + W hen >ϕ, T =α +β P +γ D+ ζ ζ + ηp + η P + ξd + ξ D where α, β, γ, ζ, η, α, β, γ, ζ and η are the coeffcent factors to be solved usng nonlnear regresson [15]. In practce, we may leverage hstorcal records of job executon to provde nput to the regresson algorthm. Ths s reasonable n producton envronments as many jobs are executed routnely n today s producton data centers. Specfcally, we capture a trple (T, P, ) for each reduce task of the job. Usng the trples for all reduce tasks as tranng data, we can easly learn the coeffcent factors n the performance model for each job. In the end, we produce one performance model M j for each job j that can be used as nput for schedulng. Fnally, we would lke to menton that whle our performance model focuses on CPU allocaton, we beleve our model can be extended to handle the case where other resources becomes the performance bottleneck by havng addtonal terms (e.g. smlar to the second and thrd term n equaton 2) n our performance model. C. Schedulng Algorthm Once the performance model has been traned and the partton sze has been predcted, the scheduler can now decde how much resource to be allocated to each task. In order to mtgate the mpact of data skew, we adopt a smple strategy whch s to make all reduce tasks have smlar runnng tme. Algorthm 1 descrbes our resource allocaton polcy. After reachng the threshold δ, the partton sze of each reduce task can be predcted wth the predcton model. As to memory allocaton, t does not affect task duraton as long as t s suffcent for ths task, whch s dscussed n Secton IV-B. We adjust the memory allocaton to P Unt Unt mem, where Unt mem s the mnmum allocaton mem of memory. Wth respect to CPU allocaton, we obtan the amount of resources accordng to performance model M j, as descrbed from lne 5 to lne 12. Frst, we calculate the executon tme T md, whch represents the tme t takes to complete the task wth the medan partton sze P md, by performance model M j. After that, we set T md as target for each reduce task, and calculate the amount of resources that each reduce (2) Algorthm 1 Resource allocaton algorthm Input: δ - Threshold of stoppng tranng the Partton Sze Predcton Model; M j - Reduce Phase Performance Model of Job j; ϕ - Maxmum allocaton of CPU; Output: C - Set of resource allocatons for each reduce task, Alloc mem 1: Collect S and F, when a success completon event of map tasks s receved by ApplcatonMaster 2: When threshold δ s reached: 3: Stop tranng and fnalze Partton Sze Predcton Model 4: Predct Set < P > whle F = 1% 5: Calculate the medan value P md n Set < P > 6: Calculate T md, when P = P md, = 1vcore usng M j 7: for each reduce task [1, N] do 8: Alloc mem = P Unt Unt mem ; mem 9: Solve the Equaton 2 for 1: f ϕ then 11: = ϕ 12: end f 13: C = C {, Alloc mem } 14: end for 15: return C task needs. Because nodes have fnte resource capactes, should be less than the capactes. Besdes, from our experence, after CPU allocaton to a task reaches a threshold, ncreasng allocaton wll not mprove the executon tme, but nstead results n wastng CPU resource as shown n Secton IV-B. We consder should be less than threshold ϕ, whch s also an nput to our algorthm. V. EVALUATION We perform our experments on 11 vrtual machnes (VMs) n the SAVI Testbed [16], whch contans a large cluster wth many server machnes. Each VM has four 2 GHz cores, 8 GB RAM and 8 GB hard dsk. We deploy Hadoop YARN 2.4. wth one VM as Resource Manager and Name Node, and remanng 1 VMs as workers. Each worker s confgured wth 8 vrtual cores and 7GB RAM (leavng 1GB for other processes). The mnmum CPU and memory allocatons to a contaner are 1 vcore and 1 GB respectvely. The HDFS block sze s set to 128MB, and the replcaton level s set to 3. We chose two jobs to evaluate DREAMS: (1) Sort, whch s ncluded n a MapReduce benchmark n Hadoop dstrbuton. It takes sequence fles whch are generated by RandomWrter as nput, and outputs the sorted data, and (2) InvertedIndex, whch comes from PUMA benchmarks [17]. It takes a lst of documents as nput and generates an nverted ndex for these documents. We use Wkpeda data [17] for ths applcaton. A. Accuracy of predcton of partton sze In ths set of experments, we wanted to valdate the accuracy of the partton sze predcton model. To ths end, we execute MapReduce jobs on dfferent datasets wth dfferent thresholds δ, and compute the average relatve error () of all parttons n each scenaro. The s defned as follows. P pred P measrd = 1 N N =1 P measrd (3)

Applcaton TABLE I: Average relatve error of partton sze predcton Type Sze(GB) δ =.5 δ =.6 δ =.7 δ =.8 δ =.9 δ =.1 Sort Synthetc 1 2.28% 2.9% 1.94% 1.81% 1.71% 1.71% Sort Synthetc 2 1.6% 1.43% 1.32% 1.26% 1.17% 1.13% Sort Synthetc 5 1.1% 1.1%.94%.9%.84%.78% InvertedIndex Wkpeda 9.1 8.2% 7.63% 7.5% 7.5% 6.43% 5.87% InvertedIndex Wkpeda 21.2 5.62% 5.25% 5.8% 4.79% 4.53% 4.38% InvertedIndex Wkpeda 49.4 4.73% 4.43% 4.21% 4.7% 3.9% 3.7% where N s the number of reduce tasks n ths job, P pred and P measrd are the predcted value and measured value of partton sze of reduce task respectvely. Table I summarzes the average relatve errors n each scenaro. We run 1 experments for each scenaro and adopt the average. It can be seen that the s less than 8.2% n all cases. Furthermore, wth threshold δ ncreases, the predcton accuracy s mproved. B. Accuracy of reduce phase performance model In order to formally evaluate the accuracy and workload ndependency of the generated performance model, we compute the predcton error for Sort and InvertedIndex wth dfferent nput workloads. We perform two valdatons as follow: Test-on-tranng - evaluate the accuracy of preformance model based on the tranng dataset. That s, we compute the predcted reduce task duraton for each tuple (P, ) 3 by usng the performance model whch s learned from ths tranng dataset, then compute a predcton error; Test-on-unknown - evaluate the accuracy of performance model usng unknown dataset. That s, we compute the predcted reduce task duraton for each tuple (P, ) by usng the performance model whch s learned from 1 G workload (Ths derved model s consdered as a profle), then compute a predcton error. For both valdatons, we leverage the to evaluate the accuracy usng followng equaton: = 1 k T pred l T measrd l l=1 (4) k T measrd l where k s the number of tuples (P, ) for a nput dataset. Table II summarzes the average relatve error of reduce task performance model for Sort and InvertedIndex. More specfcally, wth regard to Test-on-tranng valdaton, the predcton error for Sort and InvertedIndex wth all of the workloads s less than 15%. For the Test-on-unknown group, the predcton error s slghtly hgher than the correspondng value n the Test-on-tranng, stll less than 2%. These results confrm the accuracy of our performance model. C. Performance Evaluaton We have mplemented DREAMS on Hadoop YARN 2.4. as an addtonal feature. Implementng ths approach requres 3 For example, there are N reduce tasks of a job, for each reduce task, there are one value of P and 8 values of {1, 2,, 8}. Therefore, there are 8N tuples for ths workload. TABLE II: Average relatve errors of reduce task performance model Applcaton Type Sze(GB) Test-on-tranng Test-on-unknown Sort Synthetc 1 5.44% 9.36% Sort Synthetc 2 7.91% 1.62% Sort Synthetc 3 12.28% 16.38% Sort Synthetc 5 11.9% 19.57% InvertedIndex Wkpeda 9.1 11.67% 13.97% InvertedIndex Wkpeda 21.2 12.89% 13.31% InvertedIndex Wkpeda 31.3 14.67% 16.44% InvertedIndex Wkpeda 49.4 14.56% 17.6% Job Completon Tme (Seconds) Task Task 15 1 5 15 1 12 1 8 6 4 2 Natve DREAMS 1 2 3 5 Sze of set (GB) (a) Sort Job Completon Tme (Seconds) 14 12 1 8 6 4 2 Natve DREAMS 9.1 21.2 31.3 49.4 Sze of set (GB) (b) InvertedIndex Fg. 6: Job completon tme of ndvdual jobs Map Stage Reduce Stage 5 1 15 2 25 Tme (Seconds) 5 (a) Task Executon Tmelne Percentage 1 8 6 4 2 CPU Memory 5 1 15 2 25 3 Tme (Seconds) (b) CPU and Mem. Utl. Fg. 7: Sortng 1GB wth Natve Hadoop Map Stage Reduce Stage 5 1 15 2 25 Tme (Seconds) (a) Task Executon Tmelne Percentage 1 8 6 4 2 CPU Memory 5 1 15 2 25 3 Tme (Seconds) (b) CPU and Mem. Utl. Fg. 8: Sortng 1GB wth DREAMS

Applcaton TABLE III: Workloads characterstcs Type Sze(GB) CV of Partton Szes #Map and Reduce tasks Sort Synthetc 1 46.24% 8,64 Sort Synthetc 2 46.24% 16,64 Sort Synthetc 3 46.24% 24,64 Sort Synthetc 5 46.24% 4,64 InvertedIndex Wkpeda 9.1 17.44% 73,8 InvertedIndex Wkpeda 21.2 19.12% 169,8 InvertedIndex Wkpeda 31.3 22.13% 252,8 InvertedIndex Wkpeda 49.4 24.95% 396,8 mnmal change to the exstng Hadoop archtecture. In ths secton, we compare the performance of DREAMS aganst natve Hadoop YARN 2.4. (called Natve n ths paper). The slowstart threshold s set to 1%, and the CgroupsLCEResourcesHandler s enabled. We frst evaluated DREAMS usng ndvdual Job (ether Sort or InvertedIndex) wth several nput data szes from 1 GB to 5 GB. Table III gves an overvew of these workloads. Note that tunng the number of reduce tasks for each workload can mprove job completon tme [18]. To solate ths effect, we fx the number of reduce tasks for each job. The CV (coeffcent of varaton) of partton szes represents the skewness of the reduce nput dstrbuton. We can see from the table that the CV values of all the workloads are less than 5% 4. The experment results are shown n Fgure 6. We can see from the fgure that DREAMS outperforms Natve for all cases. In partcular, DREAMS mproves job completon tme by 2.3% when sortng 5 GB data. To understand the reason behnd the performance gan, we plotted the tmelne and cluster CPU and memory usage of executng 1G Sort for Natve and DREAMS n Fgure 7 and Fgure 8. We found that DREAMS equalzes the duratons among reduce tasks, and acheves hgher CPU and Memory utlzaton than Natve n reduce stage. More specfcally, the utlzaton between DREAMS and Natve durng map stage s smlar; after map stage completes (around 15 seconds mark), both CPU and memory utlzaton of DREAMS become hgher than Natve. Furthermore, we have found that DREAMS generally acheves hgher reducton n job completon tme for Sort rather than InvertedIndex. That s because DREAMS only mproves the resource allocaton n reduce stage, but leavng map stage unchanged. And Sort s reduce ntensve, where reduce stage takes longer tme than map stage. As a result, DREAMS s able to provde hgher gan for the job runnng tme of reduce-ntensve jobs. We now present our evaluaton results usng multple jobs n parallel. Accordng to the cumulatve dstrbuton functon of job runnng tmes from a producton workload trace at Facebook [19], the job completon tmes follow a long-tal dstrbuton. More specfcally, most of the jobs (more than 5%) are less than 1 seconds long, and the dstrbuton of nterarrval tmes for ths workload trace s roughly exponental wth a mean of 14 seconds. Therefore, n ths evaluaton, we vary the number of jobs of 5G Sort and 5G InvertedIndex from 1 to 16 to create batch workloads, and submt the jobs wth an nter-arrval tme followng exponental dstrbuton wth a mean of 14 seconds. We run each of the batch workloads 5 tmes usng Natve and DREAMS. The results of average 4 The CV of each workload for Sort s the same, because these workloads are generated by the same RandomWrter. job completon tme are shown n Fgure 9a. It can be seen that DREAMS outperforms Natve Hadoop n all scenaros. Admttedly, the gan of DREAMS n experments wth multple jobs s less than the gan n sngle job experments. It s because reduce tasks of small jobs only last dozens of seconds, whch means the dfference between the longest and shortest task s only dozens of seconds. When the number of jobs s ncreasng, many short tasks are scheduled one after the other. As a result, there s a chance that these short tasks can ft nto resource vacancy that skewed tasks generate. Therefore, n some cases DREAMS can obtan only dozens of seconds gan for these small jobs (Note that dozens of seconds consttute a bg gan n sngle job scenaro). In the future, we ntend to evaluate DREAMS usng multple large jobs. Fgure 9b and 9c show the resource utlzaton of the cluster durng the executon of each batch for Natve and DREAMS respectvely. It can be seen from the dagrams that DREAMS acheves slghtly hgher CPU utlzaton than the Natve Hadoop, and the memory utlzatons of both methods are smlar. That s because the bggest partton sze of all the reduce tasks n ths workload s less than the mnmal memory allocaton, DREAMS does not adjust the memory. But wth respect to CPU allocaton, DREAMS makes an adjustment for dfferent reduce tasks, thereby achevng hgher CPU utlzaton. VI. RELATED WORK The data skew problem n MapReduce has been extensvely nvestgated recently. Kwon et.al. [2] present fve types of skews n MapReduce applcatons whch are caused by the characterstcs of the algorthm and dataset, and propose best practces to mtgate skew. On mtgatng the mpact of skewed data, several approaches have been proposed. The authors, n [7] and [6], defne a cost model for schedulng Reduce keys to reduce tasks so as to balance the load among reduce tasks. However, both approaches have to wat untl all the map tasks have completed. As shown n [5], ths would ncrease the job completon tme. In order to equally dstrbute the load to worker machnes whle overlappng the map and reduce phase, the proposal n [9] apples a Greedy-Balance approach of assgnng unassgned keys to the machne wth the least load. Ths soluton s based on the assumpton that the sze of each key-value par s dentcal, whch s not true n real workloads. Even though the results n ths paper show a reducton of maxmum load compared to default soluton, shuffle fnshng tme s worse than the default soluton. Also ths paper provdes no evaluaton about whether the job completon tme can be shortened. Unlke those later shufflng approaches, Ramakrshnan et.al. [8] propose a progressve sampler to estmate the ntermedate data dstrbuton and then partton the data to balance the load across all reduce tasks. However, ths soluton needs an addtonal samplng phase before jobs start, whch can be tme-consumng. Instead of choppng the large parttons to balance the load, SkewTune [4] reparttons heavly skewed parttons to acheve ths goal. However, t mposes an overhead whle reparttonng data and concatenatng orgnal output. Compared to SkewTune, our soluton dynamcally allocates the rght amount of resources to tasks to equalze the tasks completon tme, whch s smpler and ncurs no overhead. Fnally, Zachelas et al. propose DynamcShare [3], whch ams at schedulng MapReduce jobs n heterogeneous systems to meet ther real-tme response requrements, and achevng an

Average Job Completon Tme (Seconds) 12 1 8 6 4 2 Natve DREAMS 2 4 8 16 32 Number of Jobs (a) Job Completon Tme Utlzaton 6 5 4 3 2 1 CPU Memory 2 4 8 16 32 Number of Jobs (b) Resource Utl. of DREAMS Utlzaton 6 5 4 3 2 1 CPU Memory 2 4 8 16 32 Number of Jobs (c) Resource Utl. of Natve Fg. 9: Multple Jobs Benchmark even dstrbuton of the parttons by assgnng the parttons n such a way that puts more work on powerful nodes. Smlar to SkewTune, t mposes an overhead for the parttons assgnment procedure. Besdes, DynamcShare cannot start parttons assgnment untl all map tasks have completed. Resource-aware schedulng has receved consderable attenton n recent years. The orgnal Hadoop MapReduce mplements a slot-based resource allocaton scheme, whch does not take run-tme task resource consumpton nto consderaton. To address ths lmtaton, Hadoop YARN [2] represents a major endeavor towards resource-aware schedulng n MapReduce clusters. It offers the ablty to specfy the sze of contaner n terms of requrements for each type of resources. However, YARN assumes the resource consumpton for each Map (or Reduce) task n a job s dentcal, whch s not true for data skewed MapReduce jobs. Sharma et. al. propose MROrchestrator [21], a MapReduce resource framework that can dentfy the resource defct based on resource proflng, and dynamcally adjusts the resource allocaton. Compared wth our soluton, MROrchestrator cannot dentfy stragglers of workload mbalance before tasks launch, and t cannot judcously place tasks that need more resource on the machnes wth more free resources. In other words, f all CPUntensve tasks are launched n a machne, no matter how MROrchestrator adjusts the allocaton, resource defct cannot be mtgated. There are several other proposals that fall n another category of resource schedulng polces such as [11], [13], [22], [23]. The man focus of these approaches s on adjustng the resource allocaton n terms of the number of Map and Reduce slots for the jobs n order to acheve farness, maxmze resource utlzaton or meet job deadlne. These however do not address the data skew problem. VII. CONCLUSION MapReduce has become a predomnant model for largescale data processng n recent years. However, exstng MapReduce schedulers stll use a smple hash functon to assgn map outputs to reduce tasks. Ths smple data assgnment scheme may result n a phenomenon known as parttonng skew, where the output of map tasks s unevenly dstrbuted among reduce tasks. Whle many approaches have been proposed to address ths ssue, exstng solutons often ncur an addtonal overhead for run-tme partton sze predcton and data reparttonng. Motvated by ths lmtaton, n ths paper we present DREAMS, a framework for run-tme parttonng skew mtgaton. Unlke prevous approaches that try to balance the reduce workload by reparttonng the workload assgned to each reduce task, n DREAMS we cope wth parttonng skew by adjustng task run-tme resource allocaton. To do so, we frst develop an on-lne partton sze predcton model whch can estmate the partton szes of reduce tasks at run-tme. Our experments results show that the average relatve error s less than 8.2% n all cases. Second, we desgn a reduce task performance model that correlates task duraton wth run-tme resource allocaton and nput sze of reduce tasks. The valdaton results show that the worse predcton error s 19.57%. Thrd, we demonstrate the beneft of leveragng resource-awareness for run-tme skew mtgaton. Through experments usng real and synthetc workloads, we show that DREAMS can effectvely mtgate the negatve mpact of parttonng skew whle ncurrng neglgble overhead, thereby mprovng job runnng tme by up to 2.3%. ACKNOWLEDGEMENT Ths work s supported n part by the Natonal Natural Scence Foundaton of Chna (No.61472438), and n part by the Smart Applcatons on Vrtual Infrastructure (SAVI) project funded under the Natonal Scences and Engneerng Research Councl of Canada (NSERC) Strategc Networks grant number NETGP394424-1. REFERENCES [1] J. Dean and S. Ghemawat, Mapreduce: smplfed data processng on large clusters, Communcatons of the ACM, vol. 51, no. 1, pp. 17 113, 28. [2] V. K. Vavlapall, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth et al., Apache hadoop yarn: Yet another resource negotator, n Proceedngs of the 4th annual Symposum on Cloud Computng. ACM, 213, p. 5. [3] A. Computng, N. Zachelas, and V. Kalogerak, Real-tme schedulng of skewed mapreduce jobs n heterogeneous envronments. [4] Y. Kwon, M. Balaznska, B. Howe, and J. Rola, Skewtune: mtgatng skew n mapreduce applcatons, n Proceedngs of the 212 ACM SIGMOD Internatonal Conference on Management of Data. ACM, 212, pp. 25 36. [5] M. Hammoud, M. S. Rehman, and M. F. Sakr, Center-of-gravty reduce task schedulng to lower mapreduce network traffc, n Cloud Computng (CLOUD), 212 IEEE 5th Internatonal Conference on. IEEE, 212, pp. 49 58.

[6] S. Ibrahm, H. Jn, L. Lu, B. He, G. Antonu, and S. Wu, Handlng parttonng skew n mapreduce usng leen, Peer-to-Peer Networkng and Applcatons, vol. 6, no. 4, pp. 49 424, 213. [7] B. Gufler, N. Augsten, A. Reser, and A. Kemper, Handng data skew n mapreduce, n Proceedngs of the 1st Internatonal Conference on Cloud Computng and Servces Scence, vol. 146, 211, pp. 574 583. [8] S. R. Ramakrshnan, G. Swart, and A. Urmanov, Balancng reducer skew n mapreduce workloads usng progressve samplng, n Proceedngs of the Thrd ACM Symposum on Cloud Computng. ACM, 212, p. 16. [9] Y. Le, J. Lu, F. Ergun, and D. Wang, Onlne load balancng for mapreduce wth skewed data nput. [1] L. Cheng, Q. Zhang, and R. Boutaba, Mtgatng the negatve mpact of preempton on heterogeneous mapreduce workloads, n Proceedngs of the 7th Internatonal Conference on Network and Servces Management. Internatonal Federaton for Informaton Processng, 211, pp. 189 197. [11] A. Verma, L. Cherkasova, and R. H. Campbell, Ara: automatc resource nference and allocaton for mapreduce envronments, n Proceedngs of the 8th ACM nternatonal conference on Autonomc computng. ACM, 211, pp. 235 244. [12] Hadoop, Far scheduler, http://hadoop.apache.org/docs/r2.4./ hadoop-yarn/hadoop-yarn-ste/farscheduler.html. [13] A. Ghods, M. Zahara, B. Hndman, A. Konwnsk, S. Shenker, and I. Stoca, Domnant resource farness: Far allocaton of multple resource types. n NSDI, vol. 11, 211, pp. 24 24. [14] Z. Zhang, L. Cherkasova, and B. T. Loo, Benchmarkng approach for desgnng a mapreduce performance model, n Proceedngs of the 4th ACM/SPEC Internatonal Conference on Performance Engneerng. ACM, 213, pp. 253 258. [15] D. M. Bates and D. G. Watts, Nonlnear regresson: teratve estmaton and lnear approxmatons. Wley Onlne Lbrary, 1988. [16] J.-M. Kang, H. Bannazadeh, and A. Leon-Garca, Sav testbed: Control and management of converged vrtual ct resources, n Integrated Network Management (IM 213), 213 IFIP/IEEE Internatonal Symposum on. IEEE, 213, pp. 664 667. [17] F. Ahmad, S. Lee, M. Thottethod, and T. Vjaykumar, Puma: Purdue mapreduce benchmarks sute, 212. [18] Z. Zhang, L. Cherkasova, and B. T. Loo, Autotune: Optmzng executon concurrency and resource usage n mapreduce workflows. n ICAC, 213, pp. 175 181. [19] M. Zahara, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoca, Delay schedulng: a smple technque for achevng localty and farness n cluster schedulng, n Proceedngs of the 5th European conference on Computer systems. ACM, 21, pp. 265 278. [2] Y. Kwon, M. Balaznska, B. Howe, and J. Rola, A study of skew n mapreduce applcatons, Open Crrus Summt, 211. [21] B. Sharma, R. Prabhakar, S. Lm, M. T. Kandemr, and C. R. Das, Mrorchestrator: A fne-graned resource orchestraton framework for mapreduce clusters, n Cloud Computng (CLOUD), 212 IEEE 5th Internatonal Conference on. IEEE, 212, pp. 1 8. [22] J. Polo, D. Carrera, Y. Becerra, J. Torres, E. Ayguadé, M. Stender, and I. Whalley, Performance-drven task co-schedulng for mapreduce envronments, n Network Operatons and Management Symposum (NOMS), 21 IEEE. IEEE, 21, pp. 373 38. [23] J. Wolf, D. Rajan, K. Hldrum, R. Khandekar, V. Kumar, S. Parekh, K.- L. Wu, and A. Balmn, Flex: A slot allocaton schedulng optmzer for mapreduce workloads, n Mddleware 21. Sprnger, 21, pp. 1 2.