Dynamic Resource Allocation for MapReduce with Partitioning Skew

Transcription

1 Ths artcle has been accepted for publcaton n a future ssue of ths journal, but has not been fully edted. Content may change pror to fnal publcaton. Ctaton nformaton: DOI 1.119/TC , IEEE Transactons on Computers IEEE TRANSACTIONS ON COMPUTERS, VOL. 13, NO. 9, SEPTEMBER Dynamc Resource Allocaton for MapReduce wth Parttonng Skew Zhhong Lu, Student Member, IEEE, Q Zhang, Student Member, IEEE, Reaz Ahmed, Member, IEEE, Raouf Boutaba, Fellow, IEEE, Yapng Lu, and Zhenghu Gong Abstract MapReduce has become a prevalent programmng model for buldng data processng applcatons n the cloud. Whle beng wdely used, exstng MapReduce schedulers stll suffer from an ssue known as parttonng skew, where the output of map tasks s unevenly dstrbuted among reduce tasks. Exstng solutons follow a smlar prncple that reparttons workload among reduce tasks. However, those approaches often ncur hgh performance overhead due to the partton sze predcton and reparttonng. In ths paper, we present DREAMS, a framework that provdes run-tme parttonng skew mtgaton. Instead of reparttonng workload among reduce tasks, we cope wth the parttonng skew problem by controllng the amount of resources allocated to each reduce task. Our approach completely elmnates the reparttonng overhead, yet s smple to mplement. Experments usng both real and synthetc workloads runnng on a 21-node Hadoop cluster demonstrate that DREAMS can effectvely mtgate the negatve mpact of parttonng skew, thereby mprovng the job completon tme by up to a factor of 2.29 over the natve Hadoop YARN. Compared to the state-of-the-art soluton, DREAMS can mprove the job completon tme by a factor of Index Terms MapReduce, Hadoop YARN, resource allocaton, parttonng skew 1 INTRODUCTION In recent years, the exponental growth of raw data has generated tremendous needs for large-scale data processng. In ths context, MapReduce [1], a parallel computng framework, ganed sgnfcant popularty. A MapReduce job conssts of two types of tasks, namely Map and Reduce. Each map task takes a chunk of nput data and runs a userspecfed map functon to generate ntermedate key-value pars. Subsequently, each reduce task collects the ntermedate key-value pars and apples a user-specfed reduce functon to produce the fnal output. Due to ts remarkable advantages n terms of smplcty, robustness and scalablty, MapReduce has been wdely used by companes such as Amazon, Facebook, and Yahoo! to process large volumes of data on a daly bass. Consequently, t has attracted consderable attenton from both ndustry and academa. Despte ts success, the current mplementatons of MapReduce suffer from a few lmtatons. In partcular, the wdely-used MapReduce system, Apache Hadoop MapReduce [2], uses a hash functon to partton the ntermedate key-value pars across reduce tasks. The goal of usng a hash functon s to evenly dstrbute the workload to each reduce task. In realty ths goal s rarely acheved [3], [4]. For example, Zachelas et al. [3] have demonstrated the exstence of skewness n a Youtube socal graph applcaton usng real-world data. The experments n [3] showed that the bggest workload among reduce tasks s larger than the Zhhong Lu s wth the College of Computer, Natonal Unversty of Defense Technology, Changsha, Chna and Davd R. Cherton School of Computer Scence, Unversty of Waterloo, Waterloo, ON, Canada. E-mal: zhlu@nudt.edu.cn Q. Zhang, R. Ahmed and R. Boutaba are wth Davd R. Cherton School of Computer Scence, Unversty of Waterloo. Yapng Lu and Zhenghu Gong are wth Natonal Unversty of Defense Technology Manuscrpt receved Aprl 19, 25; revsed September 17, 214. smallest by more than a factor of fve. The skewed workload dstrbuton among reduce tasks can have a severe mpact on job completon tme. Note that the completon tme of a MapReduce job s determned by the completon tme of the slowest reduce task. Data skewness causes certan tasks wth heavy workload run slower than others. Ths n turn prolongs the job completon tme. Several recent approaches are proposed to handle the parttonng skew problem [4], [5], [6], [7], [8], [9], [1]. They follow a smlar prncple that predcts the workload for ndvdual reduce tasks based on certan statstcs of key-value pars (e.g. key frequences [6], [8]), and then reparttons the workload to acheve a better balance among the reduce tasks. However, n order to collect the statstcs of keyvalue pars, most of those solutons ether have to prevent the reduce phase from overlappng wth the map phase, or add a samplng phase before executng the actual job. Skewtune [4] can reduce ths watng tme by redstrbutng the unprocessed workload of a slow reduce task at runtme. However, Skewtune ncurs an addtonal run-tme overhead of approxmately 3 seconds (as reported n [4]). Ths overhead can be qute expensve for small jobs wth average lfe span of around 1 seconds, whch are very common n today s producton clusters [11]. Motvated by the lmtatons of the exstng solutons, n ths paper, we take a radcally dfferent approach to address data skewness. Instead of reparttonng the workload among reduce tasks, our approach dynamcally allocates resources to reduce tasks accordng to ther workload. Snce no reparttonng s nvolved, our approach completely elmnates the reparttonng overhead. To ths end, we present DREAMS, a Dynamc REsource Allocaton technque for MapReduce wth parttonng Skew. DREAMS leverages hstorcal records to construct profles for each job type. Ths s reasonable because many producton jobs are executed (c) 215 IEEE. Personal use s permtted, but republcaton/redstrbuton requres IEEE permsson. See for more nformaton.

2 Ths artcle has been accepted for publcaton n a future ssue of ths journal, but has not been fully edted. Content may change pror to fnal publcaton. Ctaton nformaton: DOI 1.119/TC , IEEE Transactons on Computers IEEE TRANSACTIONS ON COMPUTERS, VOL. 13, NO. 9, SEPTEMBER repeatedly n today s producton clusters [12]. At run-tme, DREAMS can dynamcally detect data skewness and assgn more resources to reduce tasks wth large parttons to make them fnsh faster. Compared to the prevous works, our contrbutons can be summarzed as follows: We frst develop a partton sze predcton model that can forecast the partton szes of reduce tasks at run-tme. Specfcally, we can accurately predct the sze of each partton when only 5% of map tasks have completed. We establsh a task performance model that correlates the completon tme of ndvdual reduce tasks wth ther partton szes and resource allocaton. We propose a schedulng algorthm that dynamcally adjusts resource allocaton to each reduce task usng our task performance model and the estmaton of the partton sze. Ths can reduce the runnng tme dfference among reduce tasks that have dfferent szes of parttons to process, thereby acceleratng the job completon. Experments usng both real and synthetc workloads runnng on a 21-node Hadoop cluster demonstrate that DREAMS can effectvely mtgate the negatve mpact of parttonng skew, thereby mprovng the job completon tme by up to a factor of 2.29 over the natve Hadoop YARN. Compared to the state-of-the-art soluton lke SkewTune, DREAMS can mprove the job completon tme by a factor of Ths paper extends our prelmnary work [13] n a number of ways. Frst, the tme complexty of the on-lne partton sze predcton model has been presented. Second, we have added memory allocaton nto the reduce task performance model. Thrd, the schedulng algorthm n the orgnal manuscrpt has been reformulated as an optmzaton problem and ts optmal soluton s presented. Fnally, we have conducted addtonal experments to evaluate the effectveness of DREAMS. The rest of ths paper s organzed as follows. Secton 2 provdes the motvatons of our work. We descrbe the system archtecture of DREAMS n Secton 3. Secton 4 llustrates the desgn of DREAMS n detal. Secton 5 provdes the results from expermental evaluaton. Fnally, we summarze the exstng works related to DREAMS n Secton 7, and draw our concluson n Secton 8. 2 MOTIVATION In the state-of-the-art MapReduce systems, each map task processes one chunk of the nput data, and generates a sequence of ntermedate key-value pars. A hash functon s then used to partton these key-value pars and dstrbute them to reduce tasks. Snce all map tasks use the same hash functon, the key-value pars wth the same hash value are assgned to the same reduce task. Durng the reduce stage, each reduce task takes one partton (.e. the ntermedate key-value pars correspondng to the same hash value) as nput, and performs a user-specfed reduce functon on ts partton to generate the fnal output. Ths process s llustrated n Fgure 1. Ideally, the hash functon s expected to generate equal sze parttons f the key frequences, Fg. 1: MapReduce Programmng Model and szes of the key-value pars are unformly dstrbuted. However, n realty, the hash functon often fals to acheve unform parttonng, resultng nto skewed partton szes. For example n the InvertedIndex job [14], the hash functon parttons the ntermedate key-value pars based on the occurrence of words n the fles. Therefore, reduce tasks processng more popular words wll be assgned a larger number of key-value pars. As shown n Fgure 1, parttons are unevenly dstrbuted by the hash functon. P 1 s larger than P 2, whch causes workload mbalance between R 1 and R 2. Zachelas et al. [3] presented the followng reasons of parttonng skew: Skewed key frequences: Some keys occur more frequently n the ntermedate data. As a result, parttons that contan these keys become extremely large, thereby overloadng the reduce tasks that they are assgned to. Skewed tuple szes: In MapReduce jobs where szes of the values n the key-value pars vary sgnfcantly, even though key frequences are unform, uneven workload dstrbuton among reduce tasks may arse. In order to address the weaknesses and nadequaces experenced n the frst verson of Hadoop MapReduce (MRv1), the next generaton of the Hadoop compute platform, YARN [15], has been developed. Compared to MRv1, YARN manages the schedulng process usng two components: a) ResourceManager s responsble for allocatng resources to the runnng MapReduce jobs subject to capacty constrants, farness and so on; b) an Applcaton- Master, on the other hand, works for each runnng job, and has the responsblty of negotatng approprate resources from ResourceManager and assgnng the obtaned resources to ts tasks. Ths removes the sngle pont bottleneck of JobTracker n MRv1 and mproves the ablty to scale Hadoop clusters. In addton, YARN deprecates the slot-based resource management approach n MRv1, and adopts a more flexble resource unt called contaner. The contaner provdes resource-specfc, fne-gran accountng (e.g. < 2 GB RAM, 1 CP U >). A task runnng wthn a contaner s enforced to abde by the prescrbed lmts. Nevertheless, n both Hadoop MRv1 and YARN, the schedulers assume each reduce task has unform workload and resource consumpton, and therefore allocate dentcal resources to each reduce task. Specfcally, MRv1 adopts a (c) 215 IEEE. Personal use s permtted, but republcaton/redstrbuton requres IEEE permsson. See for more nformaton.

3 Ths artcle has been accepted for publcaton n a future ssue of ths journal, but has not been fully edted. Content may change pror to fnal publcaton. Ctaton nformaton: DOI 1.119/TC , IEEE Transactons on Computers IEEE TRANSACTIONS ON COMPUTERS, VOL. 13, NO. 9, SEPTEMBER Fg. 2: Archtecture of DREAMS slot-based allocaton scheme, where each machne s dvded nto dentcal slots that can be used to execute tasks. However, MRv1 does not provde resource solaton among co-located tasks, whch may cause performance degradaton at run-tme [16]. On the other hand, YARN uses a contanerbased allocaton scheme, where each task s scheduled n an solated contaner. But, t stll allocates contaners of dentcal sze to all reduce tasks that belong to the same job. Ths schedulng scheme can cause varaton n task runnng tme due to parttonng skew, snce the executon tme of a reduce task wth a large partton can be prolonged because of the fxed contaner sze. As the job completon tme s domnated by the slowest task, the run-tme varaton of reduce tasks wll prolong the job executon tme. Most of the exstng approaches tackle the parttonng skew problem by makng the workload assgnment unformly dstrbuted among reduce tasks, thereby mtgatng the neffcences n both performance and utlzaton. However, achevng ths goal requres (sometmes heavy) modfcaton to the current Hadoop mplementaton, and often requres addtonal overhead n terms of samplng and adaptve parttonng. Therefore, n ths work we seek an alternatve soluton, where we adjust the sze of the contaner based on parttonng skew. Ths approach not only requres mnmal modfcaton to the exstng Hadoop mplementaton, but at the same tme effectvely mtgates the negatve mpact of parttonng skew. 3 SYSTEM ARCHITECTURE Ths secton descrbes the desgn of our proposed resource allocaton framework called DREAMS. The archtecture of DREAMS s shown n Fgure 2. There are fve man components: Partton Sze Montor, runnng n the NodeManager; Partton Sze Predctor, Task Duraton Estmator and Resource Allocator, runnng n the ApplcatonMaster; and Fne-graned Contaner Scheduler, runnng n the ResourceManager. Each Partton Sze Montor records the statstcs of ntermedate data that a map task generates at run-tme and sends them to the ApplcatonMaster though heartbeat messages. The Partton Sze Predctor collects the partton sze reports from NodeManagers and predcts the partton szes of every reduce task for ths job. The Task Duraton Estmator constructs statstcal estmaton model of reduce task performance as a functon of ts partton sze and resource allocaton. That s, the duraton of a reduce task can be estmated f the partton sze and resource allocaton of ths task are gven. The Resource Allocator determnes the amount of resources to be allocated to each reduce task based on the performance estmaton. Lastly, the Fne-graned Contaner Scheduler s responsble for schedulng resources among all the ApplcatonMasters n the cluster, based on schedulng polces such as Far schedulng [17] and Domnant Resource Farness (DRF) [18]. Note that the schedulers n orgnal Hadoop assume that all reduce tasks (and smlarly, all map tasks ) have homogeneous resource requrements n terms of CPU and memory. However, ths s not approprate for MapReduce jobs wth parttonng skew. We have modfed the orgnal schedulers to support fne-graned contaner schedulng that allows each task to request resources of customzable sze. The workflow of resource allocaton mechansm used by DREAMS conssts of 4 steps as shown n Fgure 2. (1) After the ApplcatonMaster s launched, t schedules all the map tasks frst and then ramps up the reduce task requests gradually accordng to the slowstart settng, whch s used to control when to start reduce tasks based on the percentage of map tasks that have fnshed. Durng ther executon, each Partton Sze Montor records the sze of ntermedate key-value pars produced by map tasks. Each Partton Sze Montor sends locally gathered statstcs to the ApplcatonMaster through the TaskUmblcalProtocal, whch s a RPC protocol used to montor task status n Hadoop. (2) Upon recevng the partton sze reports from the Partton Sze Montors, the Partton Sze Predctor performs sze predcton usng our proposed predcton model (see Secton 4.1). After all the estmated szes of reduce tasks are known, the Task Duraton Estmator uses the reduce task performance model (Secton 4.2) to predct the duraton of each reduce task wth specfed amount of resources. Based on that, the Resource Allocator determnes the amount of resources for each reduce task accordng to our proposed resource allocaton algorthm (Secton 4.3) to equalze the executon tme of all reduce tasks and then sends resource requests to the ResourceManager. Note that the Resource- Manager reports to the ApplcatonMaster the current total amount of avalable resources through heartbeat messages every second. Thus, the Resource Allocator can check the avalablty of resources when requestng contaners. (3) Next, the ResourceManager receves ApplcatonMasters resource requests through the heartbeat messages, and schedules free contaners n the cluster to correspondng ApplcatonMasters. (4) Once the ApplcatonMaster obtans new contaners from the ResourceManager, t assgns the correspondng contaners to the pendng tasks, and fnally launches the tasks (c) 215 IEEE. Personal use s permtted, but republcaton/redstrbuton requres IEEE permsson. See for more nformaton.

4 Ths artcle has been accepted for publcaton n a future ssue of ths journal, but has not been fully edted. Content may change pror to fnal publcaton. Ctaton nformaton: DOI 1.119/TC , IEEE Transactons on Computers IEEE TRANSACTIONS ON COMPUTERS, VOL. 13, NO. 9, SEPTEMBER DREAMS DESIGN There are three man challenges to be addressed n DREAMS. Frst, n order to dentfy parttonng skew, t s necessary to develop a run-tme forecastng algorthm that predcts the partton sze of each reduce task. Second, n order to determne the rght contaner sze for each reduce task, t s necessary to develop a task performance model that correlates task runnng tme wth resource allocaton. Lastly, there are multple resource dmensons such as CPU, memory, dsk I/O and network bandwdth. Allocatons wth dfferent combnaton of these resource dmensons may yeld the same completon tme. Determnng the approprate combnaton of these resource dmensons n order to mnmze the cost s a challengng problem. In the rest of ths secton, we shall descrbe our solutons for each of these challenges. 4.1 Predctng Partton Sze In order to cull the parttonng skew, the workload dstrbuton among the reduce tasks should be known n advance. Unfortunately, the sze of the partton belongng to each reduce task really depends on the nput dataset, the map functon and the number of reduce tasks n a MapReduce job. Even though most of the MapReduce jobs are routnely executed, the same job processng dfferent nput dataset would produce dfferent workload dstrbuton among ts reduce tasks. Several recently proposed approaches calculate the workload dstrbuton among reduce tasks [3], [5], [6], [7], [19]. Exstng solutons, however, ether have to wat for all the map tasks to fnsh [3], [5], [6], or need an addtonal samplng procedure before executng a job [7], [19]. However, n order to mprove the job completon tme, exstng Hadoop schedulers allow reduce tasks to be launched before the completon of all map tasks (e.g. the default slowstart settng s 5%). It has also been demonstrated by the exstng works [8], [2] that startng the shuffle phase after the completon of all the map tasks wll severely prolong the job completon tme. Therefore, t s necessary to predct the partton sze at run-tme wthout ntroducng a barrer between map and reduce phases. The nput datasets of MapReduce jobs n a producton cluster tend to be very large. Hence, the HDFS storage system [21] splts a large dataset nto smaller data chunks, whch naturally creates a samplng space. Ths suggests that a small set of random samples n ths sample space may reveal the characterstcs of the whole dataset n terms of workload dstrbuton among reduce tasks. Therefore, we can analyze the pattern of the ntermedate data after a fracton of map tasks have completed, and then predct workload dstrbuton among reduce tasks for the entre dataset. In DREAMS, we perform k measurements (j = 1, 2,..., k) over tme durng the map phase, and collect the followng two metrcs ( F j, S j ) : F j s the percentage of map tasks that have been processed, where j ([1, k] and) k refers to the number of collected tuples F j, S j. Note that each map task processes one nputsplt, and each nputsplt has dentcal sze (64MB, 128MB etc.). As a result, F j s Sze of Generated Partons (MB) regresson value Fracton of Map Tasks (%) (a) a reduce task n InvertedIndex Sze of Generated Partons (MB) regresson value Fg. 3: Partton sze predcton Fracton of Map Tasks (%) (b) a reduce task n WordCount approxmately equal to the fracton of whole dataset that has been processed. S j s the sze of the ntermedate data generated by the completed map tasks for reduce task. In our mplementaton, we have modfed the reportng mechansm so that each map task reports ths nformaton to the ApplcatonMaster upon map task completons. Our expermental evdences reveal that S j s lnearly proportonal to F j. Fgure 3 shows the typcal results n InvertedIndex and WordCount jobs. Note that when 1% map tasks are completed, S j wll represent the actual partton sze for reduce task. Hence, we use lnear regresson to determne the followng equaton for each reduce task [1, N]: S j = α 1 + β 1 F j j = 1, 2, k (1) where α 1 and β 1 are the regresson coeffcents. We ntroduce an outer factor, δ, whch works as a threshold to control our predcton model to stop the tranng process, and fnalze the predcton. In practce, δ can be the map completon percentage (e.g. 5%) at whch schedulng of the reduce tasks may be started. Every tme a new map task has fnshed, a new tranng data s generated. When the fracton of map tasks reaches δ, we calculate the regresson coeffcents (α 1, β 1 ), and predct the partton sze for each reduce task. Note that k s determned by δ. For nstance, consder there are 1 map tasks n the job, f δ = 5%, then k = 5. The computatonal complexty of our on-lne partton sze predcton model s O(k N). In partcular, for each reduce task [1, N], the scalng factors can be determned by the followng equaton: where ( α1 ) = ( X β T X ) 1 X T Y, (2) 1 1 F 1 S 1 F 2 1 X =.., Y = S F k S k It takes O(2 2 k) to multply X T by X, O(2 3 ) to compute the nverse of X T X, O(2 2 k) to multply ( X T X ) 1 by X T (c) 215 IEEE. Personal use s permtted, but republcaton/redstrbuton requres IEEE permsson. See for more nformaton.

5 Ths artcle has been accepted for publcaton n a future ssue of ths journal, but has not been fully edted. Content may change pror to fnal publcaton. Ctaton nformaton: DOI 1.119/TC , IEEE Transactons on Computers IEEE TRANSACTIONS ON COMPUTERS, VOL. 13, NO. 9, SEPTEMBER Duraton (Seconds) Duraton (Seconds) Duraton (Seconds) ftted value Sze of Partton (MB) (a) Sort 5G ftted value Sze of Partton (MB) (c) Sort 1G (5G) ftted value(5g) (1G) ftted value(1g) Sze of Partton (MB) (e) Sort 5G and 1G Duraton (Seconds) Duraton (Seconds) Duraton (Seconds) ftted value Sze of Partton (MB) (b) InvertedIndex 5G ftted value Sze of Partton (MB) (d) InvertedIndex 1G (5G) ftted value(5g) (1G) ftted value(1g) Sze of Partton (MB) (f) InvertedIndex 5G and 1G Fg. 4: Relatonshp between task duraton and partton sze use lnear regresson to determne ths relatonshp wth Equaton 3 shown as follows: T = α 2 + β 2 P, [1, N] (3) where T and P are the runnng tme and partton sze of reduce task, respectvely. The regresson results are also shown n Fgures 4a and 4b as sold lnes. Note that f the tme complextes of the reduce functons n other MapReduce jobs grow nonlnearly wth the szes of processng data, the relatonshp can also be easly learned by updatng the regresson model. Furthermore, we change the nput sze of the jobs from 5GB to 1GB and check whether the characterstc of ths relatonshp s workload ndependent. Agan, the runnng tme s lnearly correlated wth partton sze, as shown n Fgure 4c and 4d. However, we also fnd that the sze of total ntermedate data, denoted as D (the sum of all parttons), has an mpact on task duraton. Smlar observaton s also made n [22], where Zhang et al. show the duraton of the shuffle phase can be approxmated wth a pece-wse lnear functon when the ntermedate data per reduce task s larger than 3.2 GB n ther Hadoop Cluster. Ths s consstent wth the phenomenon we observed. Therefore, we update the regresson functon to Equaton 4 and tran the model by the samples from both 5GB and 1GB datasets together. T = α 2 + β 2 P + ζ 2 D, [1, N] (4) and fnally O(2k) to multply ( X T X ) 1 X T by Y. Therefore, the total computatonal complexty of the predcton model for a MapReduce job wth N reduce tasks s O(k N). 4.2 Reduce Task Performance Model In ths secton, we desgn a reduce task performance model to estmate the executon tme of reduce tasks. Currently, there are many technques for predctng MapReduce job duratons [12], [22], [23], [24]. These approaches, however, cannot estmate the duratons of ndvdual tasks. In our performance model we consder the executon tme of a reduce task s correlated wth two parameters: sze of partton to process and resource allocaton (e.g. CPU, dsk I/O and bandwdth). As Hadoop YARN only allows users to specfy the CPU and memory szes of a contaner, n our mplementaton we focus on capturng the mpact of CPU and memory allocatons on task performance. In order to dentfy the relatonshp between task runnng tme, partton sze and resource allocaton, we run a set of experments n our testbed cluster by varyng resource allocaton and nput datasets. In the frst set of experments, we fx the CPU and memory allocatons of each reduce task and focus on dentfyng the relatonshp between partton sze and task runnng tme. Fgure 4a and 4b show the results of runnng the 5G Sort and InvertedIndex jobs, respectvely. It s evdent that there s a lnear relatonshp between partton sze and task runnng tme. Hence, we The regresson results are shown n Fgure 4e and 4f. It can be seen that ths updated functon serves as a good ft for the relatonshp between partton sze and task runnng tme, although there are two dfferent datasets nvolved. In the next set of experments, we fx the nput sze and vary ether the CPU or memory allocaton of each reduce task. Fgure 5 shows the typcal results for 3G Sort and InvertedIndex jobs by varyng CPU allocaton from 1 to 8 vcores (memory allocaton s fxed to 1 GB). We use a nonlnear regresson method to model ths relatonshp wth Equaton 5, and fnd that task runnng tme s nversely proportonal to CPU allocaton. Whle ths relatonshp fts well when the number of vcores s small, we also found ths model s no longer accurate when a large amount of CPU resource s allocated to a task. In these cases, the resource bottleneck may swtch from CPU to other resource dmensons lke dsk I/O, thus the beneft of ncreasng CPU allocaton dmnshes. Smlar observaton s also made n [24], where Jalapart et al. show ncreasng network bandwdth beyond a threshold does not help snce the job completon tme s domnated by dsk performance. Ths s consstent wth the phenomenon we observed. Thus, we can expect that the duraton of reduce tasks mght be approxmated wth a dfferent nversely proportonal functon when CPU allocaton exceeds a threshold µ. Ths threshold could be related to job characterstcs and cluster confguraton. However, for a dfferent job and Hadoop cluster, µ can be easly determned by comparng the change n task duraton (c) 215 IEEE. Personal use s permtted, but republcaton/redstrbuton requres IEEE permsson. See for more nformaton.

6 Ths artcle has been accepted for publcaton n a future ssue of ths journal, but has not been fully edted. Content may change pror to fnal publcaton. Ctaton nformaton: DOI 1.119/TC , IEEE Transactons on Computers IEEE TRANSACTIONS ON COMPUTERS, VOL. 13, NO. 9, SEPTEMBER Duraton (Seconds) ftted value ftted value (pece wse) CPU allocaton (vcores) (a) a reduce task n Sort Duraton (Seconds) ftted value ftted value (pece wse) CPU allocaton (vcores) (b) a reduce task n InvertedIndex Fg. 5: Relatonshp between task duraton and CPU allocaton wll postpone the completon tme of the task. For example, durng the shuffle sub-phase n reduce stage, memory defct wll cause addtonal process of spllng data to dsk, because of nadequate space to store all the data of a task n the RAM, thereby prolongng the task and addng a burden to dsk I/O as well. However, wth the allocaton contnually rsng, the mprovement becomes smaller. The reason s that, as memory allocaton ncreases beyond a threshold, the resource bottleneck of a task shfts to other resources. After that pont, the completon tme of a task wll not be reduced despte the ncrease n memory allocaton. Ths observaton s consstent wth the CPU resource. Duraton (Seconds) ftted value Contaner Heap Sze (MB) (a) a reduce task n Sort Duraton (Seconds) ftted value Contaner Heap Sze (MB) (b) a reduce task n InvertedIndex T = α 4 + β 4, [1, N] (6) Based on the above observatons, we now derve our reduce task performance model. For each reduce task among N reduce tasks, let T denote the executon tme of reduce task, P denote the sze of partton for reduce task, denote the CPU allocaton for reduce task, and denote the memory allocaton for reduce task, the performance model can be stated as follows: Fg. 6: Relatonshp between task duraton and mem. allocaton whle ncreasng CPU allocaton. 1 T = α 3 + β 3, [1, N] (5) We then repeat the same set of experments for memory. Dfferent from the CPU allocaton n YARN, whch s determned by the number of vrtual cores used by the task contaner, there are two confguratons that control the memory allocaton n YARN: physcal RAM lmt and JVM heap sze lmt for a task. The former settng s a logcal allocaton used by the Nodemanager to montor the task memory usage. If the usage exceeds ths lmt, the Nodemanager wll kll the task. The latter settng s maxmum heap sze of the JVM process that executes the task. It determnes the maxmum memory that can be used by ths JVM. Hence, JVM heap sze lmt should be less than physcal RAM lmt. More mportantly, JVM heap sze ndcates the amount of memory allocaton that a task can use. Consequently, we vary the JVM heap sze lmt from 2 MB (the default value) to 56 MB whle keepng the CPU allocaton to 1 vcore, and use a non-lnear regresson method to learn ths relatonshp wth Equaton 6. We fnd that an nversely proportonal functon s also applcable n ths case. Fgure 6 shows the task runnng tme as a functon of memory allocaton whle runnng 3G Sort and InvertedIndex jobs. From ths fgure we can see an obvous mprovement when the memory allocaton ncreases at the begnnng. That s because memory defct 1. We use the followng polcy n ths paper: we ncrease the CPU allocaton from 1 to 8 vcores, and calculate the speedup of task runnng tme between current and prevous CPU allocatons denoted as Speedup j (j [1, 7]). The frst CPU allocaton where Speedup j <.5 Speedup j 1 s consdered as the threshold µ. T = (α 5 +β 5 P +ζ 5 D) (ξ 5 + γ 5 = α 5 ξ 5 + α 5γ 5 + α 5η 5 +ζ 5 ξ 5 D+ ζ 5γ 5 D = λ 1 + λ 2 +λ 7 D+ λ 8D + λ 3 + η 5 +β 5 ξ 5 P + β 5γ 5 P + ζ 5η 5 D + λ 9D +λ 4 P + λ 5P ) + λ 6P + β 5η 5 P where λ 1, λ 2, λ 3, λ 4, λ 5, λ 6, λ 7, λ 8 and λ 9 are the coeffcents to be solved usng nonlnear regresson. In practce, we may leverage hstorcal records of job executon to provde nput to the regresson algorthm. Ths s reasonable n producton envronments as many jobs are executed routnely n today s producton data centers. Specfcally, the hstorcal profles are generated by varyng CPU allocaton = {1 vcore, 2 vcores,, 8 vcores}, memory allocaton = {1 GB, 2 GB,, 4 GB}, and nput dataset D set = {5 GB, 3 GB} for dfferent jobs. We then capture a tuple (T, P,, D) for each reduce task of the job. Usng the tuples for all reduce tasks as tranng data, we can easly learn the coeffcent factors n the performance model for each job. In the end, we produce one performance model M j (.e. job profle) for each job j that can be used as an nput for schedulng. Note that, f no job profle s avalable, DREAMS resorts to the default contaner allocaton scheme (.e. unform contaner sze for all the reduce tasks). Fnally, we would lke to menton that whle our performance model focuses on CPU and memory allocatons, we beleve our model can be extended to handle the case where other resources becomes the performance bottleneck by havng addtonal terms n our performance model. (7) (c) 215 IEEE. Personal use s permtted, but republcaton/redstrbuton requres IEEE permsson. See for more nformaton.

7 Ths artcle has been accepted for publcaton n a future ssue of ths journal, but has not been fully edted. Content may change pror to fnal publcaton. Ctaton nformaton: DOI 1.119/TC , IEEE Transactons on Computers IEEE TRANSACTIONS ON COMPUTERS, VOL. 13, NO. 9, SEPTEMBER Resource Allocaton Algorthm Once the performance model has been traned and the partton sze has been predcted, the scheduler s ready to fnd the optmal resource allocaton to each reduce task so as to mtgate ther run-tme varaton caused by parttonng skew. Here, our strategy s to equalze the runnng tme of all reduce tasks. As mentoned n Secton 4.2, task duraton s a monotoncally ncreasng functon of partton sze. Thus, we consder that the duraton of the task wth average partton sze (P avg ) as a baselne denoted as T base, whch can be obtaned accordng to Equaton 7 wth P avg and the default CPU and memory allocatons confgured n YARN 2. Then we ncrease the resources allocated to the reduce tasks wth larger partton szes to make them run no slower than T base. We observed that there s also no need to allocate too much resources to large reduce tasks to make them run faster than T base. Thus we wsh to fnd the mnmum CPU and memory allocatons for enablng slower reduce tasks to meet the baselne T base. It can be calculated usng a varaton of Equaton 7 ntroduced n Secton 4.2, where P, D and T base are known. T base = λ 1 + λ 2 +λ 7 D+ λ 8D + λ 3 + λ 9D +λ 4 P + λ 5P We can present Equaton 8 n followng form: C 1 + C 2 + λ 6P (8) + C 3 = (9) where C 1 =λ 1 +λ 4 P +λ 7 D T base, C 2 =λ 2 +λ 5 P +λ 8 Dand C 3 =λ 3 +λ 6 P +λ 9 D. Evdently, C 1, C 2 and C 3 are constants derved from known values. Snce there are two varables, ) needed to be solved usng only one equaton, more than one root can be obtaned. In other words, there can be many possble CPU and memory combnatons that wll yeld the same completon tme, T base. Hence, we formulate ths resource allocaton problem as a constraned optmzaton problem: ( mn x,y f(x, y ) = x + ωy s.t. C 1 + C 2 + C 3 = x y (1) Cap cpu >x >1, Cap mem >y >1, [1, N] where x =, y =, and Cap cpu and Cap mem are the capactes of workers n terms of CPU and memory, respectvely. We defne the optmzaton functon as the sum of CPU and memory resources, x +ωy, where a factor ω s ntroduced for representng the weght of memory over CPU. We can confgure a hgher weght to the bottleneck resource that has lower avalablty. For nstance, f CPU s lackng n the cluster but memory s not, CPU wll become more expensve comparng to memory. In ths case, ncreasng 2. Here, snce P can be predcted by the partton sze predcton model, P avg can be easly obtaned. And the default CPU and memory allocatons to a contaner n YARN are 1vCore and 1GB, respectvely. Algorthm 1 Resource allocaton algorthm Input: δ - Threshold of stoppng tranng the Partton Sze Predcton Model; M j - Reduce Phase Performance Model of Job j; µ cpu, µ mem- Maxmum allowable allocaton of CPU and memory. Output: C - Set of resource allocatons for each reduce task (, ) 1: (S, F ) handlep arttonreport(). 2: f CompletedMap percentage δ then 3: Set < P > P redctp artton() 4: D N 1 P 5: P avg Avg(Set < P >) 6: T base P redctduraton(p avg, D, default, Amem default, M j) 7: for each reduce task [1, N] do 8: (, ) F ndoptmalalloc(p, D, T base, M j). 9: = mn(, µ cpu) 1: = mn(, µ mem) 11: C = C {(, )} 12: end for 13: end f 14: return C the weght of CPU can mprove schedulng avalablty of tasks, thereby mprovng resource utlzatons. ω depends on the capacty and the run-tme resource avalablty of the cluster. How to tune ω s out of the scope of ths work. In partcular, we use ω = 1 n ths paper. Snce ths s a lnear optmzaton problem, we use Lagrange multplers to solve ths problem. Accordngly, we get the Lagrangan L(x, y, ϕ) as follows: L(x, y ) = x + ωy + ϕ(c 1 + C 2 x + C 3 y ) (11) Then, we dfferentate L(x, y, ϕ) partally wth respect to x, y and ϕ, and we get: L =1 ϕ C 2 x x 2 = L =ω ϕ C 3 y y 2 = (12) L ϕ =C 1+ C 2 + C 3 = x y Solvng these equatons smultaneously, we get: x= C 2 ± ωc 3 C 3, y = ωc 3 ± ωc 3 C 3 C1 ωc1 (13) The detal of our resource allocaton mechansm s shown n Algorthm 1. NodeManagers perodcally send partton sze reports to the ApplcatonMaster along wth heartbeat messages. As shown n Lne 1, the Applcaton- Master handles each partton sze report and collects the partton sze statstcs (S, F ). Once the percentage of completed map tasks reaches the threshold δ, we start to predct the partton sze and adjust the allocaton for each reduce task as shown n Lne In terms of partton sze predcton, we predct the partton sze of each reduce task usng the model presented n Secton 4.1. Wth respect to the resource allocaton, we compute the optmal combnaton of CPU and memory tuples (, ) usng (c) 215 IEEE. Personal use s permtted, but republcaton/redstrbuton requres IEEE permsson. See for more nformaton.

8 Ths artcle has been accepted for publcaton n a future ssue of ths journal, but has not been fully edted. Content may change pror to fnal publcaton. Ctaton nformaton: DOI 1.119/TC , IEEE Transactons on Computers IEEE TRANSACTIONS ON COMPUTERS, VOL. 13, NO. 9, SEPTEMBER Applcaton Domans Dataset Type TABLE 1: Benchmarks characterstcs Input Sze Small (GB) #Map, #Reduce Input Sze Large (GB) Skewness(%) Skewness(%) #Map, #Reduce WordCount text retreval Wkpeda , , 64 BgramCount text retreval Wkpeda , , 64 Pars text retreval Wkpeda , , 64 RelatveFreq text retreval Wkpeda , , 64 InvertedIndex web search Wkpeda , , 64 AdjLst web search GraphGenerator , , 64 KMeans machne learnng Netflx , , 6 Classfcaton machne learnng Netflx , , 6 DataJon database RandomTextWrter , , 64 SelfJon database Synthetc , , 64 Sort others RandomWrter , , 64 Hsto-moves others Netflx , , 8 Lagrange Multplers. More specfcally, we calculate the executon tme T base frst, whch represents the tme t takes to complete the task wth the average partton sze P avg and default resource allocaton ( default, Amem default )3, accordng to Equaton 8. After that, we set T base as a target for each reduce task, and calculate the resource tuples (, ) by solvng Equaton 13 and takng the floor of the postve root. Because nodes have fnte resource capactes n terms of CPU and memory (e.g., the default settngs for the maxmum CPU and memory allocaton to a contaner n YARN are 8 vcores and 8 GB, respectvely), both and should be less than the physcal capactes, Cap cpu and Cap mem, respectvely. Besdes, from our experence, after a resource allocaton to a task reaches a threshold, ncreasng allocaton wll not mprove the executon tme, rather t results n resource wastage as shown n Secton 4.2. We consder and should be less than the thresholds µ cpu and µ mem, respectvely, whch are consdered as nputs to our algorthm. 5 EVALUATION We have mplemented DREAMS on Hadoop YARN 2.4. as an addtonal feature. We deployed DREAMS on a real Hadoop cluster wth 21 vrtual machnes (VMs) n the SAVI Testbed [25]. The SAVI Testbed s a vrtual nfrastructure managed by OpenStack [26] usng Xen [27] vrtualzaton technque. Each VM has four 2 GHz cores, 8 GB RAM and 8 GB hard dsk. We use one VM as ResourceManager and NameNode, and the remanng 2 VMs as workers. Each worker s confgured wth 8 vrtual cores and 7GB RAM (leavng 1GB for background processes). The HDFS block sze s set to 64 MB, and the replcaton level s set to 3. The CgroupsLCEResourcesHandler confguraton s enabled, and we also actvate the confguraton of map output compresson. 4. We use CapactyScheduler to schedule contaners n YARN. In the guest OS, we confgure CGroups (Control Groups) and CFQ (Completely Far Queung) for schedulng CPU and dsk I/O among processes, respectvely. We evaluate our approach usng a wde range of applcatons that nclude text retreval, web search, machne 3. The default CPU and memory allocatons to a contaner are 1 vcore and 1 GB, respectvely. 4. Usng compresson n Hadoop to optmze MapReduce performance s prevalent n ndustry and academa. [28], [29] learnng, database domans, etc. These applcatons are lsted below: 1) Text Retreval WordCount (WC): WordCount computes the occurrence frequency of each word n a corpus. We use Wkpeda data as the nput dataset. BgramCount (BC): Bgrams are sequences of two consecutve words. BgramCount computes the occurrence frequency of bgrams n a corpus. We use the mplementaton n Cloud9 [3] and Wkpeda data as the nput dataset. Pars (PS): Pars s a desgn pattern ntroduced n [31]. Usng ths desgn pattern, PS computes the word co-occurrence matrx for a corpus. We use the mplementaton n Cloud9 and Wkpeda data as the nput dataset. RelatveFrequency (RF): Relatve Frequences s ntroduced n [31]. It measures the proporton of tme word w j appears n the context of word w. It s also denoted as F (w j w ). We use the mplementaton n Cloud9 and Wkpeda data as the nput dataset. 2) Web Search InvertedIndex (II): It takes a lst of documents as nput and generates a word-to-document ndex for these documents. We use Wkpeda data as the nput dataset. AdjacencyLst (AL): It generates the adjacency lst for a graph. The graph s represented by a set of edges, whch s generated by a Graph Generator. We use the mplementaton and the nput dataset provded by PUMA benchmarks [14]. 3) Machne Learnng KMeans (KM): Ths applcaton classfes moves based on ther ratngs usng the Netflx move ratng data. We use the startng values of the cluster centrods provded by PUMA and run one teraton. Classfcaton: It classfes the moves nto one of k pre-determned clusters. Smlar to KMeans, we use the startng values of the cluster centrods provded by PUMA, and use the Netflx move ratng data. 4) Database DataJon (DJ): It combnes text fles based on a desgnated key. The text dataset s generated by (c) 215 IEEE. Personal use s permtted, but republcaton/redstrbuton requres IEEE permsson. See for more nformaton.

9 Ths artcle has been accepted for publcaton n a future ssue of ths journal, but has not been fully edted. Content may change pror to fnal publcaton. Ctaton nformaton: DOI 1.119/TC , IEEE Transactons on Computers IEEE TRANSACTIONS ON COMPUTERS, VOL. 13, NO. 9, SEPTEMBER MAPE (Percentage) WC BC PS RF Threshold δ (a) Small dataset II AL KM CF DJ SJ SRT HM MAPE (Percentage) WC BC PS RF Threshold δ II AL KM CF (b) Large dataset Fg. 7: Predcton accuracy wth dfferent threshold δ DJ SJ SRT HM Job Completon Tme (s) Sort Kmeans Hs-moves Threshold δ Fg. 8: Job completon tme wth dfferent threshold δ RandomTextWrter, and the frst word of each lne n the fles serves as the jon key. We have modfed the orgnal RandomTextWrter and used Zpf.5 dstrbuton to skew the nput data. SelfJon (SJ): Ths applcaton s ntroduced n PUMA. It generates (k+1)-szed assocatons gven the set of k-szed assocatons. We use the mplementaton of ths applcaton as well as the synthetc dataset n PUMA. 5) Others Sort (SRT): Ths applcaton sorts sequence fles generated by Hadoop RandomWrter. Smlar to [2], we have modfed RandomWrter to produce nonunformly dstrbuted data. Hstogram-moves: Ths applcaton bns moves nto 8 bns based on the average ratngs of moves. We use the mplementaton of ths applcaton n PUMA. Table 1 gves an overvew of these benchmarks wth ther confguratons used n our experments. The skewness of the workload among reduce tasks s measured by the coeffcent stdev mean of varaton (CV),, whch s used as a farness metrc n lterature [32]. The larger the rato, the more skewness s expected n the dstrbuton of workload among reduce tasks. In order to better demonstrate the skew mtgaton, we do not use the combner functon n our benchmarks. We wll present the results of runnng these jobs n the followng sectons. 5.1 Accuracy of Predcton of Partton Sze In ths set of experments, we want to valdate the accuracy of the partton sze predcton model. To ths end, we execute MapReduce jobs on dfferent datasets, and compute the mean absolute percentage error (MAPE) of all parttons n each scenaro. The MAPE s defned as follows. P pred P measrd MAP E = 1 N N =1 P measrd (14) where N s the number of reduce tasks n a job, P pred and P measrd are the predcted and of partton sze of reduce task, respectvely. Table 2 summarzes the TABLE 2: Mean absolute percentage error of partton sze predcton model on Small and Large datasets Applcaton MAPE on Small dataset MAPE on Large dataset WordCount 5.34% 3.94% BgramCount 8.67% 7.25% Pars 6.16 % 4.31 % RelatveFrequency 6.73% 5.75% InvertedIndex 3.69% 3.4% AdjLst 11.36% 1.1% KMeans 8.56% 4.13% Classfcaton 5.29% 3.17% DataJon 5.6 % 2.8% SelfJon 1.23%.63% Sort 6.32% 5.34% Hstogram-moves.47%.35% MAPE for the benchmarks wth threshold δ =.5 on two dfferent datasets. It can be seen that the error rates for most of the MapReduce applcatons are less than 5%. In partcular, Adjlst reaches the hghest error rate at 11.36%. Furthermore, Fgure 7 llustrates the mpact of dfferent values of δ on predcton accuracy. It s clear that as δ ncreases, the predcton accuracy mproves. That s because the number of tranng samples wll augment along wth the ncrease of δ. When δ =.15, the predcton error acheves less than 6% for all testng applcatons. Generally speakng, ncreasng sample sze can mprove accuracy at the cost of ncreased overhead. In DREAMS, the larger the sample sze used, the longer DREAMS has to wat for the completon of the map tasks for predctng the partton sze 5. However, we observed that as δ ncreases, the overhead n terms of job completon tme does not necessarly become larger. Fgure 8 shows the job completon tmes whle usng dfferent values of δ. As shown n Fgure 8, for reduce-ntensve jobs such as Sort and Kmeans, there wll be a sweet spot where the job completon tme s lowest; for map-ntensve jobs such as Hstogrammoves, no much dfference can be observed. The reason s that overlappng map and reduce phases can let the reduce task start to shuffle data earler, but t wll also waste resources whle the map tasks output rate s smaller than 5. The computatonal overhead s neglgble, because the maxmum number of samples s hundreds n our experments (c) 215 IEEE. Personal use s permtted, but republcaton/redstrbuton requres IEEE permsson. See for more nformaton.

10 Ths artcle has been accepted for publcaton n a future ssue of ths journal, but has not been fully edted. Content may change pror to fnal publcaton. Ctaton nformaton: DOI 1.119/TC , IEEE Transactons on Computers IEEE TRANSACTIONS ON COMPUTERS, VOL. 13, NO. 9, SEPTEMBER Duraton (Seconds) Duraton (Seconds) ftted value Sequental ID (a) WordCount ftted value Sequental ID (c) InvertedIndex Duraton (Seconds) Duraton (Seconds) ftted value Sequental ID (b) Pars ftted value Sequental ID (d) Sort Fg. 9: Fttng results for the reduce phase performance model the bandwdth. Tang et al. [33] proposed a soluton to fnd the best tmng to start the reduce phase. Ths s out of the scope of ths paper. In ths paper, we use δ =.5 n the followng experments. 5.2 Accuracy of Reduce Task Performance Model In order to formally evaluate the accuracy and workload ndependency of the performance model, we compute the predcton error usng dfferent datasets. That s, we tran and test our model based on the samples from both the Small and Large datasets. Fgure 9 shows the typcal results n terms of the goodness of ft for the performance model. Smlar results can be observed n other applcatons. To make the demonstraton more clear, we sort the experment results by the values n ascendng order. The marks + represent the measured task duratons and the sold lne represents the ftted values usng the performance model. We also perform two valdatons [34] to study the predcton accuracy of the model: Resubsttuton Method - All the avalable data s used for tranng as well as testng. That s, we compute the predcted reduce task duraton for each tuple (P, Alloc cpu, Alloc mem, D) by usng the performance model whch s learned from the tranng dataset, then compute a predcton error; K-fold Cross-valdaton - The avalable data s dvded nto K dsjont subsets, 1 K m. m s the total sze of the avalable samples. And the predcton accuracy s evaluated by the average of the separate errors 1 K K =1 Error. For each of the K sub-valdatons, (K 1) subsets are used for tranng and the remanng one for testng. Here, we choose K = 1. TABLE 3: Mean absolute percentage error of the reduce phase performance model Applcaton Resubsttuton Method K-fold Cross-valdaton WordCount 13.13% 13.45% BgramCount 1.26% 1.98% Pars % % RelatveFrequency 13.3% 14.91% InvertedIndex 12.97% 13.7% AdjLst 15.45% 18.2% KMeans 12.52% 15.13% Classfcaton 4.61% 7.58% DataJon 7.84 % 14.9 % SelfJon 9.8 % % Sort 1.95% 11.46% Hstogram-moves 11.14% 14.46% For both valdatons, we leverage the MAPE to evaluate the accuracy usng followng equaton: MAP E = m 1 m T pred l Tl measrd l=1 (15) T measrd l where m s the number of testng samples. Table 2 summarzes the MAPE of reduce task performance model for our testng workloads. Wth regard to the Resubsttuton Method valdaton, the predcton error for all of the workloads s less than 15.45%. In terms of the K-fold Cross-valdaton, the predcton error s slghtly hgher than the error n the Resubsttuton valdaton. However, the error rate s stll less than 18.2%. For some applcatons such as Adjlst, the predcton error s relatvely hgher. But overall, the predcton error s less than 15% for most of the applcatons. Lastly, tunng the parameters of the performance model by contnuously tranng the new comng data may mprove the accuracy, whch s consdered as our future work. 5.3 Job Completon Tme In ths secton, we want to valdate how well DREAMS can mtgate skew. We compare DREAMS aganst 1) Hadoop YARN 2.4.; 2) Speculaton-based straggler mtgaton approach (LATE), whch launches speculatve tasks for the slower tasks; 3) repartton-based skew mtgaton approach (SkewTune), whch reparttons the unprocessed workload of the slower tasks at run-tme and 4) Hadoop.21. wth slot solaton (MRv1 ISO). To the best of our knowledge, n addton to SkewTune, many other state-of-the-art solutons such as LEEN [6], TopCluster [9], are mplemented on top of MRv1, whch s slot-based and there s no solaton between slots. In order to farly compare DREAMS aganst SkewTune, we have mplemented solaton between slots n Hadoop.21. and nstalled SkewTune on top of MRv1 ISO. We confgure each worker wth 6 map slots and 2 reduce slots whle runnng SkewTune and MRv1 ISO. Note that tunng the number of reduce tasks of a MapReduce job can mprove the job completon tme [35]. To solate ths effect, we use the same number of reduce tasks n the correspondng experments when comparng the job completon tme. Fgure 1 shows the comparson among YARN, LATE, SkewTune, MRv1 and DREAMS n regards to job completon tme. We can see from the fgure that DREAMS (c) 215 IEEE. Personal use s permtted, but republcaton/redstrbuton requres IEEE permsson. See for more nformaton.

11 Ths artcle has been accepted for publcaton n a future ssue of ths journal, but has not been fully edted. Content may change pror to fnal publcaton. Ctaton nformaton: DOI 1.119/TC , IEEE Transactons on Computers IEEE TRANSACTIONS ON COMPUTERS, VOL. 13, NO. 9, SEPTEMBER Job Completon Tme (s) YARN LATE SkewTune MRv1_ISO DREAMS Job Completon Tme (s) YARN LATE SkewTune MRv1_ISO DREAMS WC BC PS RF II AL KM CF DJ SJ SRT HM WC BC PS RF II AL KM CF DJ SJ SRT HM (a) Small dataset (b) Large dataset Fg. 1: The comparson of job completon tme Task slots map shuffle reduce Tme (Seconds) (a) YARN Task slots map shuffle reduce Tme (Seconds) (b) LATE Task slots map shuffle reduce Tme (Seconds) (c) SkewTune Task slots map shuffle reduce Tme (Seconds) (d) DREAMS Fg. 11: Executon Tmelne for 5G Pars.7.6 YARN LATE SkewTune MRv1_ISO DREAMS.9.8 YARN LATE SkewTune MRv1_ISO DREAMS Coeffcent of Varaton Coeffcent of Varaton WC BC PS RF II AL KM CF DJ SJ SRT HM WC BC PS RF II AL KM CF DJ SJ SRT HM (a) Small dataset (b) Large dataset Fg. 12: The comparson of makespan varance of reduce tasks outperforms other skew mtgaton strateges. In partcular, DREAMS acheves 2.29, 1.93, 1.42, 1.34, 1.31, 1.29 and 1.26 speedups over YARN for Pars, RelatveFreq, Sort, DataJon, WordCount, InvtIndex and Kmeans, respectvely. Compared to other mtgaton strateges, DREAMS can acheve the hghest mprovements of 1.85 and 1.65 over LATE and SkewTune, respectvely. We also observed that DREAMS cannot mprove the job completon tme for SelfJon and Adjlst. Ths s because the skewness n these jobs s low, leavng lttle room for DREAMS and other mtgaton strateges to mprove. Snce DREAMS only adjusts resource allocaton for reduce tasks, for jobs such as Classfcaton and HsMoves n whch the reduce phase only lasts for a few seconds, no mprovement n terms of the job completon tme can be observed. In order to understand the reason behnd the mprovement of DREAMS, n Fgure 11 we demonstrate the executon tmelne whle runnng 5G Pars wth YARN, LATE, SkewTune, and DREAMS, respectvely. As shown n Fgure 11a, several large reduce tasks take much longer than other reducers, whch domnate the completon tme of the job. In comparson, LATE executes replca tasks for these large reduce tasks usng free resources, whch can accelerate those large reduce tasks. However, snce replca tasks process the same amount of work as orgnal tasks, the mprovement s not sgnfcant. SkewTune splts the unprocessed work (c) 215 IEEE. Personal use s permtted, but republcaton/redstrbuton requres IEEE permsson. See for more nformaton.

12 Ths artcle has been accepted for publcaton n a future ssue of ths journal, but has not been fully edted. Content may change pror to fnal publcaton. Ctaton nformaton: DOI 1.119/TC , IEEE Transactons on Computers IEEE TRANSACTIONS ON COMPUTERS, VOL. 13, NO. 9, SEPTEMBER of stragglers at runtme, and launches new jobs (called mgraton jobs) to process these tasks. As we can see n Fgure 11c, three addtonal jobs are launched to process those long lastng reduce tasks. Hence, overloaded tasks are processed usng more cluster resources, whch results n reducng ther executon tmes. Note that the executon tmes of the stragglers are domnated by the completon tmes of the correspondng mgraton jobs. However, the overhead of reparttonng the runnng tasks s not small. As reported n [4], approxmately 3s overhead s ncurred for reduce skew mtgaton, and SkewTune does not perform skew mtgaton for the tasks wth remanng tme less than 2 w, where w s on the order of 3s. As a results, for small jobs that complete n 1s or wth small skewness, SkewTune cannot mprove the job completon tme. In contrast, DREAMS predcts the partton sze of each reduce task at runtme and proactvely allocates more resources to overloaded reducers. Ths reduces the duratons of overloaded tasks, thereby acceleratng the job completon wth neglgble overhead. As shown n Fgure 11d, the runnng tmes of those large reducers are sgnfcantly mproved. We also compare the makespan varance of reduce tasks n DREAMS aganst other solutons. As we started earler, DREAMS s desgned to reduce the run-tme dfference among reduce tasks wth dfferent loads, thereby shortenng the job completon tme. Fgure 12 shows the comparson results wth respect to the coeffcent of varaton (CV) of reduce tasks duratons for our benchmarks. The graphs reflect that DREAMS can effectvely reduce the makespan varance of reduce tasks. More specally, the hghest reducton rato can acheve 2.47, 1.84 and 2.23 over YARN, LATE and SkewTune, respectvely. Snce the shuffle phase n reduce stage s overlappng the entre map stage, there s no need to count the makespan when the shuffle phase s watng for the output of map tasks. Here, we compare the duratons of reduce tasks startng from the completon of the last map task. ARIA [12] consders only the non-overlappng portons of shuffle nto account. Chowdhury et al. [36] also defne the begnnng of the shuffle phase as when ether the last map task fnshes or the last reduce task starts. 6 DISCUSSION The concept of dynamc contaner sze adjustment used n DREAMS s not restrcted to MapReduce. It can be appled to other large-scale programmng models such as Spark [37] and Storm [38] as well. Take Spark as an example, a Spark job conssts of a number of tasks as a form of a DAG (Drect Acyclc Graph). These tasks are scheduled to a number of the Spark Executors and executed n a dstrbuted manner. Each Spark executor runs n a contaner on top of the resource management platform (e.g. YARN and Mesos [39]). If some executors have more workload to process, dynamcally adjustng the contaner sze based on the resource requrements and workload characterstcs may brng beneft as well. Nevertheless, there are lmtatons n DREAMS s current desgn. Frst, DREAMS can only adjust CPU and memory for a contaner n our current mplementaton, and t reles on the TCP farness and Completely Far Queung to farly share the network bandwdth and dsk I/O, respectvely. Hence, DREAMS may not be able to gve a precse estmaton of the task executon tme n a hghly dynamc envronment. However, our performance model works well n DREAMS. It roughly estmates the executon tme of the reduce task based on the hstorcal data and n turn helps DREAMS to determne how much resources should be allocated to the task. Through allocatng more resources to the reduce tasks wth more workload, the executons of these tasks can be accelerated, and therefore the job completon tme can be mproved. We would lke to extend DREAMS to take account of network bandwdth and dsk I/O n future work. One nterestng dea s to ntegrate the management of contaners network and dsk I/O resources to YARN usng CGroups. Note that CGroups can support solatng network and dsk I/O between processes currently. Ths deserves further research. Second, there may be some applcatons where DREAMS s not applcable, for example, the applcatons that contan computatonal skew [7] n ther reduce functons. The computatonal skew refers to the case where the task runnng tme depends on the content of the nput rather than ts sze. For ths knd of applcatons, DREAMS resorts to YARN n current desgn. One straghtforward extenson s to montor the resource usage and progresses of tasks at run-tme, and then adjust ther allocaton dynamcally. In ths way, skewed tasks could be accelerated n a more generc manner. 7 RELATED WORK The parttonng skew problem n MapReduce has been extensvely nvestgated recently. The authors n [5] and [6] defne a cost model for assgnng reduce keys to reduce tasks so as to balance the load among reduce tasks. However, both approaches have to wat for the completon of all the map tasks. Ramakrshnan et al. [7] and Yan et al. [19] propose to sample partton sze before executng actual jobs to estmate the ntermedate data dstrbuton, and then partton the data to balance the load across all reducers. However, the addtonal samplng phase can be tme-consumng. Smlarly, Kolb et al. [4] propose two approaches, BlockSplt and ParRange, to handle data skew for enttes resoluton based on MapReduce. However, both of these two approaches have to run an addtonal MapReduce job to generate the block dstrbuton matrx (BDM). Gufler et al. [9] and Chen et al. [1] propose to aggregate selected statstcs of the key-value pars (e.g. top k keys). Ther solutons can reduce the overhead whle estmatng the reducer s workload, but these solutons stll have to wat for the completon of all the map tasks. SkewTune [4] reparttons heavly skewed parttons at runtme to mtgate skew. However, t mposes an overhead whle reparttonng data and concatenatng fnal outputs. Compared to SkewTune, our soluton dynamcally allocates resources to reduce tasks and equalzes the reduce tasks completon tme, whch s smpler and ncurs no overhead. There are also related works on cullng stragglers n MapReduce. LATE [41] speculatvely executes a replca task for the tasks at a slow progress rate. However, executng a redundant copy for a data-skew task, may result n wastng resource, snce the duplcate tasks wth data skew stll have the same amount of data. Mantr [42] culls stragglers based (c) 215 IEEE. Personal use s permtted, but republcaton/redstrbuton requres IEEE permsson. See for more nformaton.

13 Ths artcle has been accepted for publcaton n a future ssue of ths journal, but has not been fully edted. Content may change pror to fnal publcaton. Ctaton nformaton: DOI 1.119/TC , IEEE Transactons on Computers IEEE TRANSACTIONS ON COMPUTERS, VOL. 13, NO. 9, SEPTEMBER on ther causes. Wth respect to data skew, Mantr schedules tasks n descendng order of ther nput szes to mtgate skew, whch s complementary to DREAMS. Wrangler [43] predcts the status of worker nodes based on ther runtme resource usage statstcs, then selectvely delays the executon of tasks f a node s predcted to create a straggler. However, Wrangler neglects that the stragglng stuaton can also be ncurred by the task tself; parttonng skew s one such example. Resource-aware schedulng has receved consderable attenton n recent years. To address the lmtaton of slotbased resource allocaton scheme n the frst verson of Hadoop, YARN [15] represents a major endeavor towards resource-aware schedulng n MapReduce. It offers the ablty to specfy the sze of contaner. However, YARN assumes the resource consumpton for each map (or reduce) task n a job s dentcal, whch s not true for data skewed MapReduce jobs. Sharma et al. propose MROrchestrator [16], a MapReduce resource framework that can dentfy resource bottlenecks and resolve them by run-tme resource allocaton. However, MROrchestrator neglects the workload mbalance among tasks and cannot mtgate the parttonng skew. There are several other proposals that fall n other categores of resource schedulng polces such as [12], [18], [44], [45]. The man focus of those approaches s on adjustng the resource allocaton n terms of the number of map and reduce slots for the jobs n order to acheve farness, maxmze resource utlzaton or meet job deadlne. These however do not address the data skew problem. 8 CONCLUSION In ths paper, we presented DREAMS, a framework for run-tme parttonng skew mtgaton. Unlke prevous approaches that try to balance the reducers workload by reparttonng the workload assgned to each reduce task, n DREAMS we cope wth parttonng skew by adjustng runtme resource allocaton to reduce tasks. Specfcally, we frst developed an on-lne partton sze predcton model whch can estmate the partton sze of each reduce task at runtme. We then presented a reduce task performance model that correlates run-tme resource allocaton and the sze of the reduce task wth task duraton. In our experments usng a 21-node cluster runnng both real and synthetc workloads, we showed that both our partton sze predcton model and task performance model acheve hgh accuracy n most cases (wth hghest predcton error at 11.36% and 18.2%, respectvely). We also demonstrated that DREAMS can effectvely mtgate the negatve mpact of parttonng skew whle ncurrng neglgble overhead, thereby mprovng the job runnng tme by up to a factor of 2.29 and 1.65 n comparson to the natve Hadoop YARN and the state-of-the-art soluton, respectvely. ACKNOWLEDGMENTS Ths work s supported n part by the Natonal Natural Scence Foundaton of Chna (No ), and n part by the Smart Applcatons on Vrtual Infrastructure (SAVI) project funded under the Natonal Scences and Engneerng Research Councl of Canada (NSERC) Strategc Networks grant number NETGP REFERENCES [1] J. Dean and S. Ghemawat, Mapreduce: smplfed data processng on large clusters, Communcatons of the ACM, vol. 51, no. 1, pp , 28. [2] Apache hadoop yarn, current/hadoop-yarn/hadoop-yarn-ste/yarn.html. [3] N. Zachelas and V. Kalogerak, Real-tme schedulng of skewed mapreduce jobs n heterogeneous envronments, n Proceedngs of 11th Internatonal Conference on Autonomc Computng. USENIX, 214, pp [4] Y. Kwon, M. Balaznska, B. Howe, and J. Rola, Skewtune: mtgatng skew n mapreduce applcatons, n Proceedngs of the 212 ACM SIGMOD Internatonal Conference on Management of Data. ACM, 212, pp [5] B. Gufler, N. Augsten, A. Reser, and A. Kemper, Handng data skew n mapreduce, n Proceedngs of the 1st Internatonal Conference on Cloud Computng and Servces Scence, vol. 146, 211, pp [6] S. Ibrahm, H. Jn, L. Lu, B. He, G. Antonu, and S. Wu, Handlng parttonng skew n mapreduce usng leen, Peer-to-Peer Networkng and Applcatons, vol. 6, no. 4, pp , 213. [7] S. R. Ramakrshnan, G. Swart, and A. Urmanov, Balancng reducer skew n mapreduce workloads usng progressve samplng, n Proceedngs of the Thrd ACM Symposum on Cloud Computng. ACM, 212, p. 16. [8] Y. Le, J. Lu, F. Ergun, and D. Wang, Onlne load balancng for mapreduce wth skewed data nput, n INFOCOM, 214 Proceedngs IEEE. IEEE, 214, pp [9] B. Gufler, N. Augsten, A. Reser, and A. Kemper, Load balancng n mapreduce based on scalable cardnalty estmates, n ICDE 212, Aprl 212, pp [1] Q. Chen, J. Yao, and Z. Xao, Lbra: Lghtweght data skew mtgaton n mapreduce, Parallel and Dstrbuted Systems, IEEE Transactons on, vol. 26, no. 9, pp , 215. [11] L. Cheng, Q. Zhang, and R. Boutaba, Mtgatng the negatve mpact of preempton on heterogeneous mapreduce workloads, n Proceedngs of the 7th Internatonal Conference on Network and Servces Management. Internatonal Federaton for Informaton Processng, 211, pp [12] A. Verma, L. Cherkasova, and R. H. Campbell, Ara: automatc resource nference and allocaton for mapreduce envronments, n Proceedngs of the 8th ACM nternatonal conference on Autonomc computng. ACM, 211, pp [13] Z. Lu, Q. Zhang, M. F. Zhan, R. Boutaba, Y. Lu, and Z. Gong, Dreams: Dynamc resource allocaton for mapreduce wth data skew, n IM TechSessons, Ottawa, Canada, may 215. [14] F. Ahmad, S. Lee, M. Thottethod, and T. Vjaykumar, Puma: Purdue mapreduce benchmarks sute, 212. [15] V. K. Vavlapall, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth et al., Apache hadoop yarn: Yet another resource negotator, n Proceedngs of the 4th annual Symposum on Cloud Computng. ACM, 213, p. 5. [16] B. Sharma, R. Prabhakar, S. Lm, M. T. Kandemr, and C. R. Das, Mrorchestrator: A fne-graned resource orchestraton framework for mapreduce clusters, n Cloud Computng (CLOUD), 212 IEEE 5th Internatonal Conference on. IEEE, 212, pp [17] Far scheduler, hadoop-yarn/hadoop-yarn-ste/farscheduler.html. [18] A. Ghods, M. Zahara, B. Hndman, A. Konwnsk, S. Shenker, and I. Stoca, Domnant resource farness: Far allocaton of multple resource types. n NSDI, vol. 11, 211, pp [19] W. Yan, Y. Xue, and B. Maln, Scalable and robust key group sze estmaton for reducer load balancng n mapreduce, n Bg Data, 213 IEEE Internatonal Conference on. IEEE, 213, pp [2] M. Hammoud, M. S. Rehman, and M. F. Sakr, Center-of-gravty reduce task schedulng to lower mapreduce network traffc, n Cloud Computng (CLOUD), 212 IEEE 5th Internatonal Conference on. IEEE, 212, pp [21] D. Borthakur, The hadoop dstrbuted fle system: Archtecture and desgn, Hadoop Project Webste, vol. 11, p. 21, 27. [22] Z. Zhang, L. Cherkasova, and B. T. Loo, Benchmarkng approach for desgnng a mapreduce performance model, n Proceedngs of the 4th ACM/SPEC Internatonal Conference on Performance Engneerng. ACM, 213, pp [23] H. Herodotou, H. Lm, G. Luo, N. Borsov, L. Dong, F. B. Cetn, and S. Babu, Starfsh: A self-tunng system for bg data analytcs. n CIDR, vol. 11, 211, pp (c) 215 IEEE. Personal use s permtted, but republcaton/redstrbuton requres IEEE permsson. See for more nformaton.

14 Ths artcle has been accepted for publcaton n a future ssue of ths journal, but has not been fully edted. Content may change pror to fnal publcaton. Ctaton nformaton: DOI 1.119/TC , IEEE Transactons on Computers IEEE TRANSACTIONS ON COMPUTERS, VOL. 13, NO. 9, SEPTEMBER [24] V. Jalapart, H. Ballan, P. Costa, T. Karaganns, and A. Rowstron, Brdgng the tenant-provder gap n cloud servces, n Proceedngs of the Thrd ACM Symposum on Cloud Computng. ACM, 212, p. 1. [25] Smart applcatons on vrtual nfrastructure (sav), savnetwork.ca/. [26] Open stack cloud operatng system, org/. [27] Xen project, [28] Mcrosoft whte papers: Compresson n hadoop, technet.mcrosoft.com/en-us/lbrary/dn aspx. [29] Y. Chen, A. Ganapath, and R. H. Katz, To compress or not to compress-compute vs. o tradeoffs for mapreduce energy effcency, n Proceedngs of the frst ACM SIGCOMM workshop on Green networkng. ACM, 21, pp [3] J. Ln, Cloud 9: A mapreduce lbrary for hadoop, 21. [31] J. Ln and C. Dyer, Data-ntensve text processng wth mapreduce, Synthess Lectures on Human Language Technologes, vol. 3, no. 1, pp , 21. [32] R. Jan, D.-M. Chu, and W. R. Hawe, A quanttatve measure of farness and dscrmnaton for resource allocaton n shared computer system, [33] Z. Tang, L. Jang, J. Zhou, K. L, and K. L, A self-adaptve schedulng algorthm for reduce start tme, Future Generaton Computer Systems, vol. 43, pp. 51 6, 215. [34] A. K. Jan, R. P. W. Dun, and J. Mao, Statstcal pattern recognton: A revew, Pattern Analyss and Machne Intellgence, IEEE Transactons on, vol. 22, no. 1, pp. 4 37, 2. [35] Z. Zhang, L. Cherkasova, and B. T. Loo, Autotune: Optmzng executon concurrency and resource usage n mapreduce workflows. n ICAC, 213, pp [36] M. Chowdhury, M. Zahara, J. Ma, M. I. Jordan, and I. Stoca, Managng data transfers n computer clusters wth orchestra, n ACM SIGCOMM Computer Communcaton Revew, vol. 41, no. 4. ACM, 211, pp [37] Apache spark, [38] Apache storm, [39] Apache mesos, [4] L. Kolb, A. Thor, and E. Rahm, Load Balancng for MapReducebased Entty Resoluton, n Internatonal Conference on Data Engneerng, 212, pp [41] M. Zahara, A. Konwnsk, A. D. Joseph, R. H. Katz, and I. Stoca, Improvng mapreduce performance n heterogeneous envronments. n OSDI, vol. 8, no. 4, 28, p. 7. [42] G. Ananthanarayanan, S. Kandula, A. G. Greenberg, I. Stoca, Y. Lu, B. Saha, and E. Harrs, Renng n the outlers n mapreduce clusters usng mantr. n OSDI, vol. 1, no. 1, 21, p. 24. [43] N. J. Yadwadkar, G. Ananthanarayanan, and R. Katz, Wrangler: Predctable and faster jobs usng fewer resources, n Proceedngs of the ACM Symposum on Cloud Computng. ACM, 214, pp [44] J. Polo, D. Carrera, Y. Becerra, J. Torres, E. Ayguadé, M. Stender, and I. Whalley, Performance-drven task co-schedulng for mapreduce envronments, n Network Operatons and Management Symposum (NOMS), 21 IEEE. IEEE, 21, pp [45] J. Wolf, D. Rajan, K. Hldrum, R. Khandekar, V. Kumar, S. Parekh, K.-L. Wu, and A. Balmn, Flex: A slot allocaton schedulng optmzer for mapreduce workloads, n Mddleware 21. Sprnger, 21, pp Q Zhang receved hs B.A.Sc., M.Sc. and Ph.D. from Unversty of Ottawa (Canada), Queen s Unversty (Canada) and Unversty of Waterloo (Canada), respectvely. Hs current research focuses on resource management for cloud computng systems. He s currently pursung a Post-doctoral fellowshp at Unversty of Toronto (Canada) He s also nterested n related areas ncludng bg-data analytcs, software-defned networkng, network vrtualzaton and management. Reaz Ahmed receved hs PhD n Computer Scence from the Unversty of Waterloo, Canada n 27. Hs BSc. and MSc. degrees n Computer Scence are from the Bangladesh Unversty of Engneerng and Technology (BUET), Dhaka, Bangladesh n 2 and 22, respectvely. He s currently an Assstant Research Professor at the School of Computer Scence n the Unversty of Waterloo. Hs research nterests nclude Network Vrtualzaton, Network Functon Vrtualzaton, Software Defned Networkng, Internet of thngs and Future Internet Archtectures. Raouf Boutaba receved the M.Sc. and Ph.D. degrees n computer scence from the Unversty Perre and Mare Cure, Pars, France, n 199 and 1994, respectvely. He s currently a Professor of computer scence wth the Unversty of Waterloo, Waterloo, ON, Canada. Hs research nterests nclude control and management of networks and dstrbuted systems. He s a fellow of the IEEE and the Engneerng Insttute of Canada. Yapng Lu receved the Ph.D. degree n computer scence from Natonal Unversty of Defense Technology, Chna, n 26. She s currently a Professor n School of Computer wth Natonal Unversty of Defense Technology. Her current research nterests nclude network archtecture, nter-doman routng, network vrtualzaton and network securty. Zhhong Lu receved hs B.A.Sc. and M.Sc. degrees n computer scence from South Chna Unversty of Technology and Natonal Unversty of Defense Technology, respectvely. He s a Ph.D. canddate n Natonal Unversty of Defense Technology wth research nterests n bg-data analytcs and resource management n cloud computng. Currently, he s a vstng student n Unversty of Waterloo, Canada. Zhenghu Gong receved the B.E. degree n electronc engneerng from Tsnghua Unversty, Bejng, Chna, n 197. He s currently a Professor n School of Computer wth Natonal Unversty of Defense Technology, Changsha, Chna. Hs research nterests nclude computer network and communcaton, network securty and datacenter networkng (c) 215 IEEE. Personal use s permtted, but republcaton/redstrbuton requres IEEE permsson. See for more nformaton.