An Efficient Job Scheduling for MapReduce Clusters

Internatona Journa of Future Generaton ommuncaton and Networkng, pp. 391-398 http://dx.do.org/10.14257/jfgcn.2015.8.2.32 An Effcent Job Schedung for MapReduce usters Jun Lu 1, Tanshu Wu 1, and Mng We Ln 1 and Shuyu hen 2 1 oege of omputer Scence, hongqng Unversty, hongqng, hna 2 oege of Software Engneerng, hongqng Unversty, hongqng, hna ujuncqcs@163.com, netmobab@cqu.edu.cn, wutanshu@cqu.edu.cn, nmwcs @163.com Abstract The job schedung for Map Reduce custers has receved sgnfcant attenton n recent years, because t pays an mportant roe on Map Reduce custers. Tradtona job schedung performs poory n assgnng a task to approprate nodes, and can not predct the resource utzaton of the unexecuted tasks. To address the probems, an effcent job schedung for Map Reduce custers s proposed n ths paper. The job schedung ntroduces dynamc prorty schedung and rea-tme predcton mode. Dynamc prorty schedung ntroduces the mnmum cost data ocaty agorthm wth a weght to dea wth dfferent sze jobs, and rea-tme predcton mode can predct the resource utzaton of unexecuted tasks by cacuatng the runnng tasks. The resource utzaton contans PU, memory, and network. Expermenta resuts prove that the proposed job schedung s abe to perform we n Map Reduce custers. Keywords: job schedung, mnmum cost data ocaty agorthm, and dynamc prorty schedung 1. Introducton Due to extremey easy to program, fast speed, scaabty, and faut-toerance acheved for a varety of appcatons, MapReduce [1-3] has been wdey regarded as a promsng aternatve to arge-scae data anayss such as graph processng, machne earnng, and data mnng. These appcatons whch are submtted to MapReduce custers are executed n the form of jobs, and each job contans a number of tasks. Every task w be assgned to a node, whch s generay caed save node, by task scheduer n custers. Task scheduer s one of the core technooges of MapReduce, t many contros the order of task executng and resource aocaton. In addton, t can drecty nfuence the performance of MapReduce custers and the executon tme of the dfferent prorty tasks. Therefore, an approprate task schedung s very mportant for MapReduce custers. MapReduce tsef provdes three man task schedung agorthms, whch are the Frst-n-frst-out (FIFO), the capacty schedung [4], and the far schedung [5]. Frst-n-frst-out agorthm s the bud-n scheduer n MapReduce custers. The advantages of FIFO are smpe and easy to mpement, because t deas wth the jobs n the way of frst n frst out, that s, the oder job can be dea frst, and the younger job can be handed ater. However, t does not take fuy nto account that there are dfferent szes of jobs ncudng sma and arge jobs n custers, and does not consder the support of mutpe users. To address the probems of FIFO, the far schedung s deveoped to dea wth sma and arge jobs as fary as possbe n custers. In order to acheve ths goa, job prortes, poo weghts, and deay schedung s ntroduced. The schedung of jobs s controed by job prortes wth a sutabe weght, and weght s dvded nto a certan eve, such as a weght of 1.0, a weght of 2.0, and 2x more weght. But the far schedung needs a ot of manuay confguraton, whch can greaty nfuence the performance of the jobs. Moreover, ISSN: 2233-7857 IJFGN opyrght c 2015 SERS

Internatona Journa of Future Generaton ommuncaton and Networkng the far schedung does not take the actua oad of the master node nto account, whch pays the roe on job schedung and dstrbutes job to a number of save nodes. apacty schedung supports mutpe job queues. However capacty schedung mts the resources that a job can be used. For above-mentoned reasons, t s qute cear that a vad job schedung s a comprehensve study on the performance, the support of mutpe users, and the effectve utzaton of resources. To reach ths desgn prncpe, an effcent job schedung, whch ams to mprove the performance of the custers, and to meet dfferent sze jobs, s proposed. The major contrbutons of ths paper can be summarzed as foows: (1).In order to mprove the performance of the MapReduce custers, rea-tme predcton mode w be used to predct the unexecuted tasks. Ths mode estmates the resource consumptons of the unexecuted tasks by cacuatng current runnng jobs. (2).In order to satsfy the dfferent sze of jobs, the mnmum cost data ocaty agorthm w be used to cacuate the degree of data ocaty. The agorthm can effcenty avod the probem that there are dfferent sze jobs n custers, and t can aso mprove the effcency of the job schedung. To evauate the effectveness of the proposed job schedung, the job schedung prototype has been mpemented and varous benchmarks have been conducted. The smuaton resuts show that the proposed job schedung sgnfcanty mproves the performance of the MapReduce custers. The rest of ths paper s organzed as foows. Secton 2 dscusses reated work. Secton 3 presents the effcent job schedung. Secton 4 presents the expermenta resuts. Secton 5 concudes the paper. 2. Reated Work In order to desgn sutabe job schedung, a number of job schedung agorthms have been proposed n recent years, ths secton brefy summarzes research work reated to job schedung n MapReduce custers. Zahara et a., [6] propose a job schedung, whch ntroduces the deay schedung agorthm to mprove the data ocaty. However, the agorthm does not take nto account of the dfferent sze of jobs n custer. Moreover, the agorthm performs we n sma jobs, and performs poor n arge jobs, because the deay schedung agorthm can ncur performance degradaton n arge jobs. In MapReduce, arge jobs are dvded nto fxed number tasks, and there are many watng tasks n queues due to the deay characterstcs n the deay schedung agorthm. So the executon tme of arge jobs w be ncrease. Jnshuang Yan et a. [10] aso propose a job schedung, whch has advantages n the tme cost durng the nta phase of a job, and the task assgnment because the push-mode repaces the pu-mode whch s a task assgnment mechansm. But, t can not perform we for arge jobs, because t s desgned for sma jobs. Seo et a. [7] propose a new job schedung, whch ntroduces the perfectng technooges to mprove the data ocaty and the performance of the custers. But a ot of memory consumpton and network throughout w be ncreased, because a number of unreated data to the task s read to memory. In addton, a mass of data s transmtted through network due to the storage characterstcs of HDFS [8-9]. Aprgo Bezerra e.t a. [11] propose a job schedung, whch seects tasks from a pendng jobs queues by anayzng the avaabe resources of the custers to mprove the performance of the custers. Athough anayzng the avaabe resources of the custers can appropratey submt a task to the custer, t can not predct the resource consumptons of the pendng jobs. Jsha S Manjay [12] aso propose a job schedung, caed Task Tracker aware schedung agorthm, whch ams to avod the task faure caused by overoadng n 392 opyrght c 2015 SERS

Internatona Journa of Future Generaton ommuncaton and Networkng custers, and n whch users must confgure the maxmum oad for every task by settng the threshod. However, the drawback of ths job schedung s the much confguraton to the users. So t s very dffcut to use. Obvousy, the above-mentoned job schedung agorthms have the common drawbacks, whch are not predctng the resource consumpton of the unexecuted or pendng tasks, and cacuatng the degree of the data ocaty, respectvey. 3. An Effcent Job Schedung Based on the anayss of the above job schedung n MapReduce custers, an optmzed job schedung s presented n ths secton. The goa of the job schedung s to assgn resources to jobs fary. Sma and arge jobs w be reasonaby assgned to each node by anayzng the practca stuaton of resource utzaton through the dynamc prorty schedung n MapReduce usters. Moreover, the job schedung can predct the resource utzaton of the jobs whch have not been performed by anayzng the performed jobs. 3.1. Dynamc Prorty Schedung The core of the job schedung s the dynamc prorty schedung, whch ntroduces the mnmum cost data ocaty agorthm wth a weght to dea wth dfferent sze jobs. The wegh of a job can be defned as W Locaty Pr orty (1) where prorty s the prorty of a job, and ocaty s the degree of data ocaty. From the formua, we can see that the weght of jobs contans the data ocaty and the prorty of jobs. The prorty of jobs s the job executon order defned by users. Data ocaty s that the correspondng data of a job w be stored n the nodes where jobs are executed. In ths paper, the date ocaty agorthm externs the host seecton agorthm n Hadoop [13], and f the correspondng data of a job s dvded nto dfferent nodes, the proposed job schedung can cacuate mnmum cost and assgn the job to approprate nodes by mnmum data ocaty agorthm. Assume that a job contans M bocks. The M bocks are stored n dfferent nodes, and the bock numbers of each node are denoted as N 1, N 2, and N n respectvey. The dstances of each node to the task schedung node (Jobtracker) are denoted as D 1, D 2, and D n respectvey. T represents the tme cost that a bock s transferred to the node wth the maxmum number of bocks. The reason of ths seecton s that the frst executed task s assgned to the node wth the maxmum number of bocks. T s reevant wth the sze of bocks, the number of bocks, and the actua network transmsson speed. The sze of bocks s denoted as Bock sze (64MB by defaut), the number of bocks s denoted as N, and the actua network transmsson speed s denoted as speed. So T can be ndcated as T (2) N speed D Bock In ths paper, assume that the data bocks of the executed job are dstrbuted nto n nodes, whose ocaty are denoted as Locaty 1, Locaty 2,, and ocaty respectvey.. So the ocaty s denoted as Locaty sze 1 ( B ) T f N Where B f s the data bocks whch s transmtted through the network. Accordng to formua (2) and (3), ocaty s equa to: (3) opyrght c 2015 SERS 393

Internatona Journa of Future Generaton ommuncaton and Networkng Locaty 1 ( B ) f N N speed D Bock The mnmum cost data ocaty agorthm seects the smaest P nodes n {Locaty 1, Locaty 2, and ocaty }. User can predefne the vaue of P. The P nodes are sorted from smaest to argest accordng to ocaty. In MapReduce custers, a job s dvded nto a fxed number of tasks, and each task s assgned to a node. Fgure 1 ustrates an exampe of the mnmum cost data ocaty agorthm. In ths exampe, a job s dvded nto 2 tasks. There are 5 data bocks n node 1, two data bocks n node 2, and 3 bocks n node 3. The ocaty st s accessed n the order of ocaty 1, ocaty 3, and ocaty 2. Obvousy, task 1 s assgned nto node 1, and task 2 s assgned to node 3. The ocaty of node 3 s smaer than node 2 because the number of bocks n node 3 s arger than node 2. From the Fgure 1, we can see that there s no task n the node 2. Assume that the dstance between node 1 to node 2 s smaer than the dstance between node 2 to node 3. Therefore the data bocks n node 2 are processed by the task 1 n node 1. sze (4) T a s k 1 T a s k 2... T a s k n L o c a ty 1 L o c a ty 3 L o c a ty 2 L ocaty st order by ocaty from sm aest to argest... N o d e 1 N o d e 2 N o d e 3 N o d e n T h e c u rre n ty e x e c u tn g ta s k T h e ta s k w h c h h a s n o t b e e n e x e c u te d Fgure 1. The Exampe of the Mnmum ost Data Locaty Agorthm 3.2. The Rea-Tme Predcton Mode of Jobs In rea-tme predcton mode, the resource utzaton of the unexecuted tasks can be concuded through the executed tasks. These resources ncude PU, memory, and network resources. Assume that a job s dvded nto ten tasks, whch are denoted as Task 1, Task 2,, and Task n, whose engths are denoted as L 1, L 2,, and, L n respectvey, sx of them are runnng, and the other four are watng. The resource consumpton of a task Task wth the ength L n a node s derved as c (5) cpu memory where cpu stands for the PU consumpton, memroy stands for memory consumpton, and network stands for the consumpton of the network transmsson. In addton, memory contans the number of memory bytes consumed by Task t own, whch s denoted as t., the number of memory bytes consumed by storng oca data, whch s denoted as, and the number of memory bytes consumed by storng network data, whch s denoted as n. So memroy s derved as network 394 opyrght c 2015 SERS

Internatona Journa of Future Generaton ommuncaton and Networkng memory t n og( T TotaTme where L s the sze of temporary data wrtten to the oca dsk, because the task wrtes the memory data to the oca dsk when memory usage reaches a certan threshod, and a s reguator. When there s no data transmtted by network, the vaue of n s zero. Accordng to formua (5) and (6), c s equa to c cpu t n og( T TotaTme where T TotaTme s the tota runnng tme of the job. When a runnng task T r wth the ength L s competed, the resource consumpton c of the task T r s cacuated. The reatme predcton mode w seect an unexecuted task whose sze s equa to L from the watng st, and assgn the task to the node where the task T r ocated. If the sze of task cannot be found that exacty matches the sze, the cosest match s used. L L n ) n ) network (6) (7) T he currenty runnng task e n g th 9 8 23 45 e n g th 10 9 34 45 34 23 T a s k 7 T a s k 8 T a s k 9 T a s k 10 T a s k 1 T a s k 2 T a s k 3 T a s k 4 T a s k 5 T a s k 6 T h e ta s k s w h c h a re w a tn g N o d e 1 N o d e 2 N o d e 3 N o d e 6 Fgure 2. The Exampe of the Rea-Tme Predcton Mode Fgure 2 ustrates an exampe of the rea-tme predcton mode, support Task 1, Task 2, Task 3, Task 4, Task 5, and Task 6 are runnng, and Task 7, Task 8, Task 9, and Task 10 are watng. As shown n Fgure 2, task 2 s competed. The sze of task 2 s 9. In watng st, the sze of task 7 s 9. So assgnng task 7 to the Node 3 s the most effcent assgnment method, because the resource ncudng PU, memory, and network on Node 3 can meet the requrements of the task 7. 4. Expermenta Evauatons 4.1. Expermenta Envronment In order to evauate the effectveness of the proposed job schedung, the schedung s compared wth exstng job schedung agorthms, whch are the Frst-n-frst-out (FIFO) agorthm, and the far scheduer. And to evauate the performance of the proposed job schedung, an experment envronment of a MapReduce custer wth hadoop 1.0.0 s estabshed. The experment custer contans one master node and 9 save nodes. These opyrght c 2015 SERS 395

Executon Tme (sec). Internatona Journa of Future Generaton ommuncaton and Networkng nodes are connected wth a 100 Mb/s network. The master node s confgured wth 4-core 3.20 GHz Inte 7-960 processors, 16GB of memory and one 1TB 5400 RPM SATA dsk. Each save node s equpped wth 4-core 3.10 GHz Inte 5-2400 processors, 16GB of memory and three 1TB 5400 RPM STAT dsks. They a run enteros 6.4 wth kerne 2.6.32-358.e6.x86_64 operatng system. Each dsk s formatted wth the ext4 fe system. The master node acts as JobTracker, SecondaryNameNode, and NameNode. Each save node acts as DataNode, and TaskTracker. Wordount [14] and TestSort [15] [16] benchmarks are performed n the expermenta envronment. The reason of seecton the two benchmarks s that Wordount and TestSort program s often used as a basene benchmark for MapReduce. The sze of test data s cassfed nto four types, whch are 500MB, 1GB, 2GB, and 5GB. In order to make the test data stored nto each save node average, the bock sze of HDFS fe system s set to the defaut vaue of 64MB, and the repcaton number of a bock s set to the vaue of 3. 4.2. Experment Resut Fgure 3 and Fgure 4 show the experment resuts of the tree job schedung agorthms n term of executon tme. As shown n Fgure 3 and Fgure 4, the proposed job schedung performs better than the FIFO and far scheduer n term of executon tme, because each task s assgned to the reasonabe save node through the dynamc prorty schedung and rea-tme predcton mode. When the sze of test data s 5GB, the advantage of the proposed schedung s more apparent, because there are mount of watng task n custer. In ths case, the proposed job schedung can predct the resource usage of the watng tasks, and assgn a watng job to the most sutabe save node. Ths assgnment scheduer can use the custer resources effectvey. Moreover, the weght of a job can obvousy reduce the data transmsson of network, because a nove data ocaty s ntroduced. HIFO and far scheduer spend a ot of tme on the data transmsson of network, whch ncreases the tota runnng tme of the job. 16000 14000 12000 10000 8000 6000 4000 2000 0 500MB 1GB 2GB 5GB Data Sze FIFO far scheduer the proposed schedung Fgure 3. Word ount Job Executon Tme 396 opyrght c 2015 SERS

Executon Tme (sec) Internatona Journa of Future Generaton ommuncaton and Networkng 16000 14000 12000 10000 8000 6000 4000 2000 0 500MB 1GB 2GB 5GB Data Sze FIFO far scheduer the proposed schedung Fgure 4. Test Sort Job Executon Tme From the Fgure 3 and Fgure 4, we can see that the executon tme of the TestSort s arger than Wordount. The reason s that TestSort w carry shuffe data from one save node to the other. The process w generate heavy dsk I/O and network throughput. In addton, there s a mount of shuffe data n shuffe stage. In ths case, a major botteneck s network I/O bottenecks. For Wordount, there are ony sma shuffe data n shuffe state. 5. oncuson Ths paper presents a nove job schedung for MapReduce custers. The objectves of the proposed schedung are reducng the executon tme of jobs, and takng fu advantage of the resource of each node. The proposed job schedung s advantageous n executon tme and the resource utzaton because t can cacuate mnmum cost and assgn the job to approprate nodes by mnmum cost data ocaty agorthm. Moreover, the resource utzaton of the unexecuted tasks can be concuded through the executed tasks. A seres of experments are conducted and encouragng resuts are obtaned. Acknowedgements We are gratefu to the edtors and anonymous revewers for ther vauabe comments on ths paper. The work of ths paper s supported by Natona Natura Scence Foundaton of hna (Grant No. 61272399) and Research Fund for the Doctora Program of Hgher Educaton of hna (Grant No. 20110191110038). References [1] J. Dean and S. Ghemawat, "Smpfyng MapReduce data processng", Proceedngs of the 4th IEEE Internatona onference on Utty and oud omputng, (2011) December, pp. 5-8, pp. 366-370, Mebourne, Austraa. [2] X. Kaq and H. Yuxong, "Power-effcent resource aocaton n MapReduce custers", Proceedngs of the IFIP/IEEE Internatona Symposum on Integrated Network Management, (2013) May 27-31, pp. 603-608, Ghent, Begum. [3] Apache Software Foundaton, Offca Apache Hadoop Webste. URL http://hadoop.apache.org/ Accessed date, (2012) Juy 1. [4] apacty Scheduer, Tech. rep., Retreved, http://hadoop.apache.org/common/ docs/r0.20.2/capacty_scheduer.htm, (2012) February. [5] Far Scheduer, Tech. rep., Retreved: (2012) February. http://hadoop.apache.org/common/docs/r0.20.2/far_scheduer.htm [6] M. Zahara, D. Borthakur, J. S. Sarma, K. Emeeegy, S. Shenker and I. Stoca, "Deay schedung: a smpe technque for achevng farness," Proceedngs of 16th Euro Sys onference, (2010) March 1-5, pp. 265-278, Pars, France. [7] S. Seo, I. Jang, K. Woo, I. Km, J. S. Km and S. Maeng, "HPMR: Perfectng and pre-shuffng n shared MapReduce computaton envronment," Proceedngs of the IEEE Internatona onference on uster omputng and Workshops, (2009) August 31-September 4, New Oreans, Unted states. opyrght c 2015 SERS 397

Internatona Journa of Future Generaton ommuncaton and Networkng [8] HDFS homepage. http://hadoop.apache.org/hdfs/ [9] S. Ghemawat, H. Goboff, and S. Leung, The Googe Fe System, Proceedngs of the 19th AM Symposum on Operatng Systems Prncpes, vo. 37, no. 5, (2003) October 19-22, pp. 29-43. Lake George, Unted States. [10] J. Yan, X. Yang, R. Gu,. Yuan, and Y. Huang, "Performance Optmzaton for Short MapReduce Job Executon n Hadoop", Proceedngs of the 2nd Internatona onference on oud and Green omputng and 2nd Internatona onference on Soca omputng and Its Appcatons, Xangtan, hna, (2012) November 1-3, pp. 688-694. [11] A. Bezerra, P. Hernández, and A. Espnosa, "Job Schedung for Optmzng Data Locaty n Hadoop usters", Proceedngs of the 20th European MPI Users' Group Meetng, Madrd, Span, (2013) September 15-18, pp. 271-276. [12] J. S. Manjay, V. S. hoora, "Task Tracker Aware Schedung for Hadoop MapReduce", Proceedngs of the 3th Internatona onference on Advances n omputng and ommuncatons, (2013) August 29-31, pp. 278-281, Koch, Inda. [13] Hadoop homepage. http://hadoop.apache.org/. [14] Word ount Program. Avaabe n Hadoop source dstrbuton: src/exampes/org/apache/hadoop/ exampes/wordount. [15] Hadoop TeraSort program. Avaabe n Hadoop source dstrbuton snce 0.19 verson: src/exampes/org/apache/hadoop/exampes/terasort. [16] TeraSort. http://sortbenchmark.org/. Authors Jun Lu, receved hs B.S. degree n Southwest Unversty, P. R. hna, at 2001, and M.S. degree n hongqng Unversty, P. R. hna, at 2009. urrenty he s a Ph.D. canddate n oege of omputer Scence, at hongqng Unversty. Hs current nterests ncude bg data anaytcs, fash memory, nformaton securty, and Lnux Kerne. ShuYu hen, He receved hs Ph.D. degree n hongqng Unversty, P. R. hna, at 2001. urrenty, he s a professor of oege of Software Engneerng at hongqng Unversty. Hs research nterests ncude embedded Lnux system, dstrbuted systems, coud computng, etc. He has pubshed over 120 journa and conference papers n reated research areas durng recent years. Tanshu Wu, He receved hs B.S. degree n hongqng Unversty of Posts and Teecommuncatons, P. R. hna, at 2011. He s currenty a Ph.D. canddate n oege of omputer Scence, at hongqng Unversty. Hs current nterests ncude coud computng, arge-scae data mnng and faut detecton. MngWe Ln, receved hs B.S. degree n hongqng Unversty, P. R. hna, at 2009. He s currenty a Ph.D. canddate n hongqng Unversty. He s nvted as the revewer by Journa of Systems and Software, as we as omputers and Eectrca Engneerng. Hs current nterests ncude arge-scae data mnng, fash memory, Lnux Kerne, nformaton securty and wreess sensor network. 398 opyrght c 2015 SERS