his aticle has been accepted fo publication in a futue issue of this jounal, but has not been fully edited. Content ay change pio to final publication. Citation infoation: DOI 0.09/PDS.05.40555, IEEE ansactions on Paallel and Distibuted Systes Hadoop Pefoance Modeling fo Job Estiation and Resouce Povisioning Mukhtaj Khan, Yong Jin, Maozhen Li, Yang Xiang and Changjun Jiang Abstact- MapReduce has becoe a ajo coputing odel fo data intensive applications. Hadoop, an open souce ipleentation of MapReduce, has been adopted by an inceasingly gowing use counity. Cloud coputing sevice povides such as Aazon EC Cloud offe the oppotunities fo Hadoop uses to lease a cetain aount of esouces and pay fo thei use. Howeve, a key challenge is that cloud sevice povides do not have a esouce povisioning echanis to satisfy use jobs with deadline equieents. Cuently, it is solely the use's esponsibility to estiate the equied aount of esouces fo unning a job in the cloud. his pape pesents a Hadoop job pefoance odel that accuately estiates job copletion tie and futhe povisions the equied aount of esouces fo a job to be copleted within a deadline. he poposed odel builds on histoical job execution ecods and eploys Locally Weighted Linea Regession (LWLR) technique to estiate the execution tie of a job. Futheoe, it eploys Lagange Multiplies technique fo esouce povisioning to satisfy jobs with deadline equieents. he poposed odel is initially evaluated on an in-house Hadoop cluste and subsequently evaluated in the Aazon EC Cloud. Expeiental esults ow that the accuacy of the poposed odel in job execution estiation is in the ange of 94.97% and 95.5%, and jobs ae copleted within the equied deadlines foling on the esouce povisioning schee of the poposed odel. Index es Cloud coputing, Hadoop MapReduce, pefoance odeling, job estiation, esouce povisioning M I. IRODUCIO any oganizations ae continuously collecting assive aounts of datasets fo vaious souces such as the Mukhtaj Khan is with the Depatent of Electonic and Copute Engineeing, Bunel Univesity, Uxbidge, UB8 3PH, UK. Eail: Mukhtaj.Khan@bunel.ac.uk. Yong Jin is with the ational Key Lab fo Electonic Measueent echnology, oth Univesity of China, aiyuan 03005, China. He is a Visiting Pofesso in the Depatent of Electonic and Copute Engineeing, Bunel Univesity, Uxbidge, UB8 3PH, UK. Eail: Yong.Jin@bunel.ac.uk. Maozhen Li is with the Depatent of Electonic and Copute Engineeing, Bunel Univesity, Uxbidge, UB8 3PH, UK. He is also with the Key Laboatoy of Ebedded Systes and Sevice Coputing, Ministy of Education, ongji Univesity, Shanghai, 0009, China. Eail: Maozhen.Li@bunel.ac.uk. Changjun Jiang and Yang Xiang ae with the Depatent of Copute Science & echnology, ongji Univesity, 39 Siping Road, Shanghai 0009, China. Eail: {cjjiang, xiangyang}@tongji.edu.cn. Wold Wide Web, senso netwoks and social netwoks. he ability to pefo scalable and tiely analytics on these unstuctued datasets is a high pioity task fo any entepises. It has becoe difficult fo taditional netwok stoage and database systes to pocess these continuously gowing datasets. MapReduce [], oiginally developed by Google, has becoe a ajo coputing odel in spot of data intensive applications. It is a highly scalable, fault-toleant and data paallel odel that autoatically distibutes the data and paallelizes the coputation acoss a cluste of coputes []. Aong its ipleentations such as Mas[3], Phoenix[4], Dyad[5] and Hadoop [6], Hadoop has eceived a wide take by the counity due to its open souce natue [7][8][9][0]. One featue of Hadoop MapReduce is its spot of public cloud coputing that enables the oganizations to utilize cloud sevices in a pay-as-you-go anne. his facility is beneficial to sall and ediu size oganizations whee the set of a lage scale and coplex pivate cloud is not feasible due to financial constaints. Hence, executing Hadoop MapReduce applications in a cloud envionent fo big data analytics has becoe a ealistic option fo both the industial pactitiones and acadeic eseaches. Fo exaple, Aazon has designed Elastic MapReduce (EMR) that enables uses to un Hadoop applications acoss its Elastic Cloud Coputing (EC) nodes. he EC Cloud akes it easie fo uses to set and un Hadoop applications on a lage-scale vitual cluste. o use the EC Cloud, uses have to configue the equied aount of esouces (vitual nodes) fo thei applications. Howeve, the EC Cloud in its cuent fo does not spot Hadoop jobs with deadline equieents. It is puely the use's esponsibility to estiate the aount of esouces to coplete thei jobs which is a highly challenging task. Hence, Hadoop pefoance odeling has becoe a necessity in estiating the ight aount of esouces fo use jobs with deadline equieents. It ould be pointed out that odeling Hadoop pefoance is challenging because Hadoop jobs noally involve ultiple pocessing phases including thee coe phases (i.e. ap phase, uffle phase and educe phase). Moeove, the fist wave of the uffle phase is noally pocessed in paallel with the ap phase (i.e. ovelapping stage) and the othe waves of the uffle phase ae pocessed afte the ap phase is copleted (i.e. non-ovelapping stage). o effectively anage cloud esouces, seveal Hadoop pefoance odels have been poposed [][][3][4]. Howeve, these odels do not conside the ovelapping and non-ovelapping stages of the uffle phase which leads to an inaccuate estiation of job execution. 045-99 (c) 05 IEEE. Pesonal use is peitted, but epublication/edistibution equies IEEE peission. See http://www.ieee.og/publications_standads/publications/ights/index.htl fo oe infoation.
his aticle has been accepted fo publication in a futue issue of this jounal, but has not been fully edited. Content ay change pio to final publication. Citation infoation: DOI 0.09/PDS.05.40555, IEEE ansactions on Paallel and Distibuted Systes Recently, a nube of sophisticated Hadoop pefoance odels ae poposed [5][6][7][8]. Stafi [5] collects a unning Hadoop job pofile at a fine ganulaity with detailed infoation fo job estiation and optiization. On the top of Stafi, Elasticise [6] is poposed fo esouce povisioning in tes of vitual achines. Howeve, collecting the detailed execution pofile of a Hadoop job incus a high ovehead which leads to an oveestiated job execution tie. he HP odel [7] consides both the ovelapping and non-ovelapping stages and uses siple linea egession fo job estiation. his odel also estiates the aount of esouces fo jobs with deadline equieents. CRESP [8] estiates job execution and spots esouce povisioning in tes of ap and educe s. Howeve, both the HP odel and CRESP ignoe the ipact of the nube of educe tasks on job pefoance. he HP odel is esticted to a constant nube of educe tasks, wheeas CRESP only consides a single wave of the educe phase. In CRESP, the nube of educe tasks has to be equal to nube of educe s. It is unealistic to configue eithe the sae nube of educe tasks o the single wave of the educe phase fo all the jobs. It can be agued that in pactice, the nube of educe tasks vaies depending on the size of the input dataset, the type of a Hadoop application (e.g. CPU intensive, o disk I/O intensive) and use equieents. Futheoe, fo the educe phase, using ultiple waves geneates bette pefoance than using a single wave especially when Hadoop pocesses a lage dataset on a sall aount of esouces. While a single wave educes the task set ovehead, ultiple waves ipove the utilization of the disk I/O. Building on the HP odel, this pape pesents an ipoved HP odel fo Hadoop job execution estiation and esouce povisioning. he ajo contibutions of this pape ae as fols: he ipoved HP wok atheatically odels all the thee coe phases of a Hadoop job. In contast, the HP wok does not atheatically odel the non-ovelapping uffle phase in the fist wave. he ipoved HP odel eploys Locally Weighted Linea Regession (LWLR) technique to estiate the execution tie of a Hadoop job with a vaied nube of educe tasks. In contast, the HP odel eploys a siple linea egess technique fo job execution estiation which esticts to a constant nube of educe tasks. Based on job execution estiation, the ipoved HP odel eploys Langage Multiplie technique to povision the aount of esouces fo a Hadoop job to coplete within a given deadline. he pefoance of the ipoved HP odel is initially evaluated on an in-house Hadoop cluste and subsequently on Aazon EC Cloud. he evaluation esults ow that the ipoved HP odel outpefos both the HP odel and Stafi in job execution estiation with an accuacy of level in the ange of 94.97% and 95.5%. Fo esouce povisioning, 4 job scenaios ae consideed with a vaied nube of ap s and educe s. he expeiental esults ow that the ipoved HP odel is oe econoical in esouce povisioning than the HP odel. he eainde of pape is oganized as fols. Section II odels job phases in Hadoop. Section III pesents the ipoved HP odel in job execution estiation and Section IV futhe enhances the ipoved HP odel fo esouce povisioning. Section V fist evaluates the pefoance of the ipoved HP odel on an in-house Hadoop cluste and subsequently on Aazon EC Cloud. Section VI discusses a nube of elated woks. Finally, Section VII concludes the pape and points out soe futue wok. II. MODELIG JOB PHASES I HADOOP oally a Hadoop job execution is divided into a ap phase and a educe phase. he educe phase involves data uffling, data soting and use-defined educe functions. Data uffling and soting ae pefoed siultaneously. heefoe, the educe phase can be futhe divided into a uffle (o sot) phase and a educe phase pefoing use-defined functions. As a esult, an oveall Hadoop job execution wok f consists of a ap phase, a uffle phase and a educe phase as own in Fig.. Map tasks ae executed in ap s at a ap phase and educe tasks un in educe s at a educe phase. Evey task uns in one at a tie. A is allocated with a cetain aount of esouces in tes of CPU and RAM. A Hadoop job phase can be copleted in a single wave o ultiple waves. asks in a wave un in paallel on the assigned s. Input dataset Map ask Map ask Map ask Map Phase Map Output Map Output Map Output Inteediate dataset Shuffle Phase Reduce ask Reduce ask Reduce Phase Fig.. Hadoop job execution f. Heodotou pesented a detailed set of atheatical odels on Hadoop pefoance at a fine ganulaity [9]. Fo the pose of siplicity, we only conside the thee coe phases (i.e. ap phase, uffle phase and educe phase) in odeling the pefoance of Hadoop jobs. able defines the vaiables used in Hadoop job pefoance odeling. A. Modeling Map Phase In this phase, a Hadoop job eads an input dataset fo Hadoop Distibuted File Syste (HDFS), splits the input dataset into data chunks based on a specified size and then passes the data chunks to a use-define ap function. he ap function pocesses the data chunks and poduces a ap output. he ap output is called inteediate data. he aveage ap Reduce Output Reduce Output Final Output in HDFS 045-99 (c) 05 IEEE. Pesonal use is peitted, but epublication/edistibution equies IEEE peission. See http://www.ieee.og/publications_standads/publications/ights/index.htl fo oe infoation.
his aticle has been accepted fo publication in a futue issue of this jounal, but has not been fully edited. Content ay change pio to final publication. Citation infoation: DOI 0.09/PDS.05.40555, IEEE ansactions on Paallel and Distibuted Systes 3 output and the total ap phase execution tie can be coputed using Eq.() and Eq.() espectively. able. Defined vaiables in odeling job phases. Vaiables Expessions output he aveage output data size of a ap task. D total input D M selectivit y D total he total execution tie of a ap phase. he aveage input data size of a ap task. he ap selectivity which is the atio of a ap output to a ap input. he total nube of ap tasks. he aveage execution tie of a ap task. he total nube of configued ap s. he aveage size of a uffled data. he total execution tie of a uffle phase. he total nube of educe tasks. he aveage execution duation of a uffle task. he total nube of configued educe s. he total nube of uffle tasks that coplete in the fist wave. w he total nube of uffle tasks that coplete in othe waves. he aveage execution tie of a uffle task that w copletes in the fist wave. he aveage execution tie of a uffle task that w copletes in othe waves. output he aveage output data size of a educe task. D total input he total execution tie of a educe phase. D he aveage input size of a educe task. R selectivit y D he educe selectivity which is the atio of a educe output to a educe input. he aveage execution tie of a educe task. output total D input M selectivity () () B. Modeling Shuffle Phase In this phase, a Hadoop job fetches the inteediate data, sots it and copies it to one o oe educes. he uffle tasks and sot tasks ae pefoed siultaneously, theefoe, we geneally conside the as a uffle phase. he aveage size of uffled data can be coputed using Eq.(3). If, then the uffle phase will be copleted in a single wave. he total execution tie of a uffle phase can be coputed using Eq.(4). total (4) Othewise, the uffle phase will be copleted in ultiple waves and its execution tie can be coputed using Eq.(5). total ( w w ( ) ) (5) C. Modeling Reduce Phase In this phase, a job eads the soted inteediate data as input and passes to a use-defined educe function. he educe function pocesses the inteediate data and poduces a final output. In geneal, the educe output is witten back into the HDFS. he aveage output of the educe tasks and the total execution tie of the educe phase can be coputed using Eq.(6) and Eq.(7) espectively. D output input D Rselectivity (6) total (7) III. A IMPROVED HP PERFORMACE MODEL As also entioned befoe, Hadoop jobs have thee coe execution phases ap phase, uffle phase and educe phase. he ap phase and the uffle phase can have ovelapping and non-ovelapping stages. In this section, we pesent an ipoved HP odel which takes into account both ovelapping stage and non-ovelapping stage of the uffle phase duing the execution of a Hadoop job. We conside single Hadoop jobs without logical dependencies. A. Design Rationale A Hadoop job noally uns with ultiple phases in a single wave o in ultiple waves. If a job uns in a single wave then all the phases will be copleted without ovelapping stages as own in Fig.. D s h D output (3) Fig.. A Hadoop job unning in a single wave (6 ap tasks and 6 educe tasks). 045-99 (c) 05 IEEE. Pesonal use is peitted, but epublication/edistibution equies IEEE peission. See http://www.ieee.og/publications_standads/publications/ights/index.htl fo oe infoation.
his aticle has been accepted fo publication in a futue issue of this jounal, but has not been fully edited. Content ay change pio to final publication. Citation infoation: DOI 0.09/PDS.05.40555, IEEE ansactions on Paallel and Distibuted Systes 4 Howeve, if a job uns in ultiple waves, then the job will be pogessed though both ovelapping (paallel) and non-ovelapping (sequential) stages aong the phases as ow in Fig.3. In the case of ultiple waves, the fist wave of the uffle phase stats iediately afte the fist ap task copletes. Futheoe, the fist wave of the uffle phase continues until all the ap tasks coplete and all the inteediate data is uffled and soted. hus, the fist wave of the uffle phase is pogessed in paallel with the othe waves of the ap phase as own in Fig.3. Afte copletion of the fist wave of the uffle phase, the educe tasks stat unning and poduce output. Aftewads, these educe s will becoe available to the uffle tasks unning in othe waves. It can be obseved fo Fig.3 that the uffle phase takes longe to coplete in the fist wave than in othe waves. In ode to estiate the execution tie of a job in ultiple waves, we need to estiate two sets of paaetes fo the uffle phase - the aveage and the axiu duations of the fist wave, togethe with the aveage and the axiu duations of the othe waves. Moeove, thee is no significant diffeence between the duations of the ap tasks unning in non-ovelapping and ovelapping stages due to the equal size of data chunks. heefoe, we only estiate one set of paaetes fo the ap phase which ae the aveage and the axiu duations of the ap tasks. he educe tasks un in a non-ovelapping stage, theefoe we only estiate one set of paaetes fo the educe phase which ae the aveage and the axiu duations of the educe tasks. Finally, we aggegate the duations of all the thee phases to estiate the oveall job execution tie. non-ovelapping ap phase in the fist wave uffle phase in the fist wave (ovelapping and non-ovelapping) ap phase(non-ovelapping and ovelapping) Ipoved HP odel HP odel uffle and educe phases non-ovelapping uffle phase in the fist wave uffle and educe phases Fig.3. A Hadoop job unning in ultiple waves (80 ap tasks, 3 educe tasks). his can be eflected in the atheatical equations of the ipoved HP odel which ae diffeent fo the HP odel. B. Matheatical Expessions In this section, we pesent the atheatical expessions of the ipoved HP wok in odeling a Hadoop job which copletes in ultiple waves. able defines the vaiables used in the ipoved odel. able. Defined vaiables in the ipoved HP odel. Vaiables Expessions he e bound duation of the ap phase in the w fist wave (non-ovelapping). he pe bound duation of the ap phase in the w fist wave (non-ovelapping). he nube of ap tasks that coplete in the fist wave of the ap phase. w he nube of ap tasks that coplete in othe waves of the ap phase. he axiu execution tie of a ap task. ax w w ax w w w ax w ax job job job he e bound duation of the uffle phase in the fist wave (ovelapping with the ap phase). he pe bound duation of the uffle phase in the fist wave (ovelapping with the ap phase). he aveage execution tie of a uffle task that copletes in the fist wave of the uffle phase. he axiu execution tie of a uffle task that copletes in the fist wave of the uffle phase. he e bound duation of the uffle phase in othe waves (non-ovelapping) he pe bound duation of the uffle phase in othe waves (non-ovelapping). he aveage execution tie of a uffle task that copletes in othe waves of the uffle phase. he axiu execution tie of a uffle task that copletes in othe waves of the uffle phase. he e bound duation of the educe phase. he pe bound duation of the educe phase. he axiu execution tie of a educe task. he e bound execution tie of a Hadoop job. he pe bound execution tie of a Hadoop job. he aveage execution tie of a Hadoop job. It ould be pointed out that Fig.3 also ows the diffeences between the HP odel and the ipoved odel in Hadoop job odeling. he HP wok atheatically odels the whole ap phase which includes the non-ovelapping stage of the ap phase and the stage ovelapping with the uffle phase, but it does not povide any atheatical equations to odel the non-ovelapping stage of the uffle phase in the fist wave. Wheeas the ipoved HP wok atheatically odels the non-ovelapping ap phase in the fist wave, and the uffle phase in the fist wave which includes both the stage ovelapping with the ap phase and the non-ovelapping stage. In pactice, job tasks in diffeent waves ay not coplete exactly at the sae tie due to vaied ovehead in disk I/O opeations and netwok counication. heefoe, the ipoved HP odel estiates the e bound and the pe bound of the execution tie fo each phase to cove the best-case and the wose-case scenaios espectively. We conside a job that uns in both non-ovelapping and ovelapping stages. he e bound and the pe bound of the ap phase in the fist wave which is a non-ovelapping stage can be coputed using Eq.(8) and Eq.(9) espectively. 045-99 (c) 05 IEEE. Pesonal use is peitted, but epublication/edistibution equies IEEE peission. See http://www.ieee.og/publications_standads/publications/ights/index.htl fo oe infoation.
his aticle has been accepted fo publication in a futue issue of this jounal, but has not been fully edited. Content ay change pio to final publication. Citation infoation: DOI 0.09/PDS.05.40555, IEEE ansactions on Paallel and Distibuted Systes 5 (8) ax (9) By substituting the values in Eq.(6) and Eq.(7), we have job In the ovelapping stage of a unning job, the ap phase ovelaps with the uffle phase. Specifically, the tasks unning in othe waves of the ap phase un in paallel with the tasks unning in the fist wave of the uffle phase. As the uffle phase always copletes afte the ap phase which eans that the uffle phase takes longe than the ap phase, theefoe we use the duation of the uffle phase in the fist wave to copute the e bound and the pe bound of the ovelapping stage of the job using Eq.(0) and Eq.() espectively. (0) () ax In othe waves of the uffle phase, the tasks un in a non-ovelapping stage. Hence, the e bound and the pe bound of the non-ovelapping stage of the uffle phase can be coputed using Eq.() and Eq.(3) espectively. w () w w w ax w w (3) he educe tasks stat afte copletion of the uffle tasks. heefoe, the educe tasks coplete in a non-ovelapping stage. he e bound and the pe bound of the educe phase can be coputed using Eq.(4) and Eq.(5) espectively. (4) ax (5) As a esult, the e bound and pe bound of the execution tie of a Hadoop job can be coputed by cobining the execution duations of all the thee phases using Eq.(6) and Eq.(7) espectively. job job (6) w w w (7) w w w job w ax ax w w w ax ax (8) (9) Finally, we take an aveage of Eq.(8) and Eq.(9) to estiate the execution tie of a Hadoop job using Eq.(0). job job job (0) C. Job Execution Estiation In the pevious section, we have pesented the atheatical expessions of the ipoved HP odel. he e bound and the pe bound of a ap phase can be coputed using Eq.(8) and Eq.(9) espectively. Howeve, the duations of the uffle phase and the educe phase have to be estiated based on the unning ecods of a Hadoop job. When a job pocesses an inceasing size of an input dataset, the nube of ap tasks is popotionally inceased while the nube of educe tasks is specified by a use in the configuation file. he nube of educe tasks can vay depending on use's configuations. When the nube of educe tasks is kept constant, the execution duations of both the uffle tasks and the educe tasks ae linealy inceased with the inceasing size of the input dataset as consideed in the HP odel. his is because the volue of an inteediate data block equals to the total volue of the geneated inteediate data divided by the nube of educe tasks. As a esult, the volue of an inteediate data block is also linealy inceased with the inceasing size of the input dataset. Howeve, when the nube of educe tasks vaies, the execution duations of both the uffle tasks and the educe tasks ae not linea to the inceasing size of an input dataset. In eithe the uffle phase o the educe phase, we conside the tasks unning in both ovelapping and non-ovelapping stages. Unlike the HP odel, the ipoved odel consides a vaied nube of educe tasks. As a esult, the duations of both the uffle tasks and the educe tasks ae nonlinea to the size of an input dataset. heefoe, instead of using a siple linea egession as adopted by the HP odel, we apply Locally Weighted Linea Regession (LWLR) [0][] in the ipoved 045-99 (c) 05 IEEE. Pesonal use is peitted, but epublication/edistibution equies IEEE peission. See http://www.ieee.og/publications_standads/publications/ights/index.htl fo oe infoation.
his aticle has been accepted fo publication in a futue issue of this jounal, but has not been fully edited. Content ay change pio to final publication. Citation infoation: DOI 0.09/PDS.05.40555, IEEE ansactions on Paallel and Distibuted Systes 6 odel to estiate the execution duations of both the uffle tasks and the educe tasks. LWLR is an instance-based nonpaaetic function, which assigns a weight to each instance x accoding to its Euclidean distance fo the quey instance x. LWLR assigns a high weight to an instance x which is close to the quey instance and a weight to the instances that ae fa away fo the quey instance x. he weight of an instance can be coputed q using a Gaussian function as illustated in Eq.(). q x q ( X W X ) ( X W Y ) () Hee W diag ( wk ) is the diagonal atix whee all the non-diagonal cells ae 0 values. he value of a diagonal cell is inceased when the distance between a taining instance and the quey instance is deceased. Finally, the duation of a new uffle task unning in the fist wave of the uffle phase can be estiated using Eq. (3). X (3) w q ( dis tan ce( xk, xq ) ) wk exp( ),( k,,3,..., ) () h whee, wk is the weight of the taining instance at location k. x k is the taining instance at location k. is the total nube of the taining instances. h is a soothing paaete which deteines the width of the local neighbohood of the quey instance. he value of h is cucial to LWLR. Uses have the option of using a new value of h fo each estiation o a single global value of h. Howeve, finding an optial value fo h is a challenging issue itself []. In the ipoved HP odel, a single global value of h is used to iniize the estiated ean squae eos. In the ipoved HP odel, LWLR is used to estiate the duations of both the uffle tasks and the educe tasks. Fist, we estiate, which is the aveage duation of the uffle tasks unning in the fist wave of the uffle phase. o estiate, we define a atix X n whose ows contain the taining dataset x, x, x 3..., x and n is the nube of featue vaiables which is set to (i.e. the size of an inteediate dataset and the nube of educe tasks). We define a vecto Y y, y..., y of dependent vaiables that ae used fo the aveage duations of the uffle tasks. Fo exaple, epesents the aveage execution tie of the uffle task that coesponds to the taining instance of x. We define anothe atix instance X whose ows ae quey instances. Each quey q xq contains both the size of the inteediate dataset d new and the nube of educe tasks new of a new job. We calculate d new based on the aveage input data size of a ap task, the total nube of ap tasks and the ap selectivity etic which is d D M. new Fo the estiation of input i selectivit y yi, we calculate the weight fo each taining instance using Eq. () and then copute the paaete using Eq. () which is the coefficient of LWLR. and Siilaly, the duations of ax can be estiated. ax, w, ax w, he estiated values of both the uffle phase and the educe phase ae used in the ipoved HP odel to estiate the oveall execution tie of a Hadoop job when pocessing a new input dataset. Fig.4 ows the oveall achitectue of the ipoved HP odel, which suaizes the wok of the ipoved HP odel in job execution estiation. he boxes in gay epesent the sae wok pesented in the HP odel. It is woth noting that the ipoved HP odel woks in an offline ode and estiates the execution tie of a job based on the job pofile. Map Phase Fist wave Estiated fo pofile Estiate ie Othe wave Job Shuffle Phase ovelap Fist wave Estiated ie Othe wave Reduce Estiated tie of fist wave Estiated ie Reduce Phase Oveall Job Estiation Estiated ie Job Pofile Locally Weighted Linea Regession Estiated tie of othe wave Estiate tie of educe tasks Fig.4. he achitectue of the ipoved HP odel. IV. RESOURCE PROVISIOIG he ipoved HP odel pesented in Section III can estiate the execution tie of a Hadoop job based on the job execution pofile, allocated esouces (i.e. ap s and educe s), and the size of an input dataset. he ipoved HP odel is futhe enhanced to estiate the aount of esouces fo Hadoop jobs with deadline equieents. Conside a deadline t fo a job that is tageted at the e bound of the execution tie. o estiate the nube of ap s and educe s, we conside the non-lapping ap phase in the fist wave, the ap phase in othe waves togethe with the ovelapped uffle phase in the fist wave, the uffle phase in othe waves and the educe phase. heefoe we siplify 045-99 (c) 05 IEEE. Pesonal use is peitted, but epublication/edistibution equies IEEE peission. See http://www.ieee.og/publications_standads/publications/ights/index.htl fo oe infoation.
his aticle has been accepted fo publication in a futue issue of this jounal, but has not been fully edited. Content ay change pio to final publication. Citation infoation: DOI 0.09/PDS.05.40555, IEEE ansactions on Paallel and Distibuted Systes 7 Eq.(8) into Eq.(4) with a odification of Eq.(0) fo esouce estiation. a b c d t (4) whee t job a w b ( ) ( ) c d w w he ethod of Lagange Multiplies [3] is used to estiate the aounts of esouces (i.e. ap s and the educe s) fo a job to coplete within a deadline. Lagange Multiplies is an optiization technique in ultivaiable calculus that iniizes o axiizes the objective function subject to a constaint function. he objective function is f (, ) and the constaint function is g (, ) 0, whee a b c d g(, ) t is deived fo Eq.(4). o iniize the objective function, the Lagangian function is expessed as Eq.(5). L(,, ) f (, ) g(, ) (5) whee is the Lagange Multiplie. We take patial diffeentiation of Eq.(5) with espect to,,, we have L a b 0 ( ) L b ( c d) 0 ( ) L a b c d t 0 (6) (7) (8) Solving Eq.(6), Eq.(7), and Eq.(8) siultaneously fo and, we have ( ) a x x ( ) ( ) a x x x ab c d ab c d whee and ab t( x ) a c d x a c d a( x ) bx ( c d)( x( x )) Hee, the values of and ae the nubes of ap s and educe s espectively. As we have tageted at the e bound of the execution tie of a job, the estiated aount of esouces ight not be sufficient fo the job to coplete within the deadline. his is because the e bound coesponds to the best-case scenaio which is hadly achievable in a eal Hadoop envionent. heefoe, we also taget at the pe bound of the execution tie of a job. Fo this pose we use Eq.(9) as a constaint function in Lagange Multiplies, and apply the sae ethod as applied to Eq.(8) to copute the values of both and. In this case, the aounts of esouces ight be oveestiated fo a job to coplete within the deadline. his is because the pe bound coesponds to the wost-case execution of a job. As a esult, an aveage aount of esouces between the e and the pe bounds ight be oe sensible fo esouce povisioning fo a job to coplete within a deadline. V. PERFORMACE EVALUAIO he pefoance of the ipoved HP odel was initially evaluated on an in-house Hadoop cluste and subsequently on Aazon EC cloud. In this section, we pesent the evaluation esults. Fist, we give a bief desciption on the expeiental envionents that wee used in the evaluation pocess. A. Expeiental Set We set an in-house Hadoop cluste using an Intel Xeon seve achine. he specifications and configuations of the seve ae own in able 3. We installed Oacle Vitual Box and configued 8 Vitual Machines (VMs) on the seve. Each VM was assigned with 4 CPU coes, 8GB RAM and 50GB had disk stoage. We used Hadoop-.. and configued one VM as the ae ode and the eaining 7 VMs as Data odes. he ae ode was also used as a Data ode. he data block size of the HDFS was set to 64MB and the eplication level of data block was set to. wo ap s and two educe s wee configued on each VM. We eployed two typical MapReduce applications, i.e. the WodCount application and the Sot application which ae CPU intensive and IO intensive applications espectively. he teagen application was used to geneate input datasets of diffeent sizes. he second expeiental Hadoop cluste was set on Aazon EC Cloud using 0.lage instances. he specifications of the.lage ae own in able 3. In this cluste, we used Hadoop-.. and configued one instance as ae ode and othe 9 instances as Data odes. he ae ode was also used as a Data ode. he data block size of the HDFS was set to 64MB and the eplication level of data block 045-99 (c) 05 IEEE. Pesonal use is peitted, but epublication/edistibution equies IEEE peission. See http://www.ieee.og/publications_standads/publications/ights/index.htl fo oe infoation.
his aticle has been accepted fo publication in a futue issue of this jounal, but has not been fully edited. Content ay change pio to final publication. Citation infoation: DOI 0.09/PDS.05.40555, IEEE ansactions on Paallel and Distibuted Systes 8 was set to 3. Each instance was configued with one ap and one educe. able 3: Expeiental Hadoop cluste. CPU 40 coes Pocesso.7GHz Intel Xeon Seve Had disk B Connectivity 00Mbps Ethenet LA Meoy 8GB vcpu Aazon Had disk 40GB.lage instance Meoy 7.5GB Opeating Syste Ubuntu.04 LS JDK.6 Softwae Hadoop.. Oacle Vitual Box 4..8 Stafi 0.3.0 B. Job Pofile Infoation We un both the WodCount and the Sot applications on the two Hadoop clustes espectively and eployed Stafi to collect the job pofiles. Fo each application unning on each cluste, we conducted 0 tests. Fo each test, we un 5 ties and took the aveage duations of the phases. able 4 and able 5 pesent the job pofiles of the two applications that un on the EC Cloud. able 4: he job pofile of the WodCount application in EC envionent. Data size (GB) Map tasks Map task duation (s) Shuffle duation(s) in the fist wave (ovelapping) Shuffle duation(s) in othe waves (non-ovelapping) Reduce duation (s) Avg. Max Avg. Max Avg. Max Avg. Max 5 80 3 69 73 0 8 5 C. Evaluating the Ipact of the ube of Reduce asks on Job Pefoance In this section we evaluate the ipact of the nube of educe tasks on job pefoance. We un both the WodCount and the Sot applications on the in-house Hadoop cluste with a vaied nube of educe tasks. he expeiental esults ae own in Fig.5 and Fig.6 espectively. Fo both applications, it can be obseved that when the size of the input dataset is sall (e.g. 0GB), using a sall nube of educe tasks (e.g. 6) geneates less execution tie than the case of using a lage nube of educe tasks (e.g. 64). Howeve, when the size of the input dataset is lage (e.g. 5GB), using a lage nube of educe tasks (e.g. 64) geneates less execution tie than the case of using a sall nube of educe tasks (e.g. 6). It can also be obseved that when the size of the input dataset is sall (e.g. 0GB o 5GB), using a single wave of educe tasks (i.e. the nube of educe tasks is equal to the nube of educe s which is 6) pefos bette than the case of using ultiple waves of educe tasks (i.e. the nube of educe tasks is lage than the nube of educe s). Howeve, when the size of the input dataset is lage (e.g. 5GB), both the WodCount and the Sot applications pefo bette in the case of using ultiple waves of educe tasks than the case of using a single wave of educe tasks. While a single wave educes the task set ovehead on a sall dataset, ultiple waves ipove the utilization of the disk I/O on a lage dataset. As a esult, the nube of educe tasks affects the pefoance of a Hadoop application. 0 60 4 39 43 6 9 0 3 5 40 3 3 5 38 44 3 35 0 30 3 3 74 78 34 39 7 6 5 400 5 346 350 4 47 0 7 30 480 4 408 4 47 57 4 35 560 7 486 489 59 7 7 4 40 640 4 545 549 45 5 9 30 45 70 3 65 69 50 58 0 3 50 800 4 4 693 696 55 65 3 37 able 5: he pofile of the Sot application in EC envionent. Data Size (GB) Map tasks Map task duation (s) Avg. M ax Shuffle duation(s) in the fist wave (ovelapping) Shuffle duation(s) in othe waves (non-ovelapping) Reduce duation (s) Avg. Max Avg. Max Avg. Max 5 80 5 48 50 5 8 3 4 Fig.5. he pefoance of the WodCount application with a vaied nube of educe tasks. 0 60 4 08 3 3 30 4 5 40 0 6 65 3 4 50 68 0 30 8 9 35 44 63 5 400 3 77 8 37 63 57 73 30 480 3 33 35 330 4 56 75 35 560 7 375 378 55 8 87 3 40 640 3 6 44 48 5 74 7 04 45 70 3 6 484 488 63 94 97 8 50 800 3 9 537 54 7 0 04 44 Fig.6.he pefoance of the Sot application with a vaied nube of educe tasks. 045-99 (c) 05 IEEE. Pesonal use is peitted, but epublication/edistibution equies IEEE peission. See http://www.ieee.og/publications_standads/publications/ights/index.htl fo oe infoation.
his aticle has been accepted fo publication in a futue issue of this jounal, but has not been fully edited. Content ay change pio to final publication. Citation infoation: DOI 0.09/PDS.05.40555, IEEE ansactions on Paallel and Distibuted Systes 9 D. Estiating the Execution ies of Shuffle asks and Reduce asks Both the WodCount and the Sot applications pocessed a dataset on the in-house Hadoop cluste with a vaied nube of educe tasks fo 3 to 64. he size of the dataset was vaied fo GB to 0GB. Both applications also pocessed anothe dataset fo 5GB to 50GB on the EC Cloud with the nube of educe tasks vaying fo 40 to 80. he LWLR egession odel pesented in Section III.C was eployed to estiate the execution ties of both the uffle tasks and the educe tasks of a new job. he estiated values wee used in Eq.(8) and Eq.(9) to estiate the oveall job execution tie. Fig.7 and Fig.8 ow espectively the estiated execution ties of both the uffle tasks and the educe tasks fo both applications unning on the Hadoop cluste in EC. Siila evaluation esults wee obtained fo both applications unning on the in-house Hadoop cluste. We can obseve that the execution ties of both the uffle tasks (non-ovelapping stage) and educe tasks ae not linea to the size of an input dataset. It ould be noted that the execution ties of the uffle tasks that un in an ovelapping stage ae linea to the size of an input dataset because the duations of these tasks depend on the nube of ap waves, as own in able 4 and able 5. pefoance of the ipoved HP odel on the in-house cluste and subsequently evaluated the pefoance of the odel on the EC Cloud. Fo the in-house cluste, the expeiental esults obtained fo both the WodCount and the Sot applications ae own in Fig.9 and Fig.0 espectively. Fo these two figues we can obseve that the ipoved HP odel outpefos the HP odel in both applications. he oveall accuacy of the ipoved HP odel in job estiation is within 95% copaed with the actual job execution ties, wheeas the oveall accuacy of the HP odel is less than 89% which uses a siple linea egession. It is woth noting that the HP odel does not geneate a staight line in pefoance as own in [7]. his is because a vaied nube of educe tasks was used in the tests wheeas the wok pesented in [7] used a constant nube of educe tasks. Fig.9. he pefoance of the ipoved HP odel in job estiation of unning the WodCount application on the in-house cluste. Fig.7.he estiated duations of both the uffle phase (non-ovelapping stage) and the educe phase in the WodCount application. he points epesent the actual execution tie and daed lines epesent the estiated duations. Fig.0. he pefoance of the ipoved HP odel in job estiation of unning the Sot application on the in-house cluste. Fig.8. he estiated duations of both the uffle phase (non-ovelapping stage) and the educe phase in the Sot application. he points epesent the actual execution tie and daed lines epesent the estiated duation. E. Job Execution Estiation A nube of expeients wee caied out on both the in-house Hadoop cluste and the EC Cloud to evaluate the pefoance of the ipoved HP odel. Fist, we evaluated the ext, we evaluated the pefoance of the ipoved HP odel on the EC Cloud. he expeiental esults in unning both applications ae own in Fig. and Fig. espectively. It can be obseved that the ipoved HP odel also pefos bette than the HP odel. he oveall accuacy of the ipoved HP odel in job estiation is ove 94% copaed with the actual job execution ties, wheeas the oveall accuacy of the HP odel is less than 88%. he HP odel pefos bette on sall datasets but its accuacy level is deceased to 76.5% when the dataset is lage (e.g. 40GB). he eason is that the HP odel eploys a siple linea egession which cannot accuately estiate the execution ties of the uffle tasks and the educe tasks which ae not linea to the size of an input dataset. 045-99 (c) 05 IEEE. Pesonal use is peitted, but epublication/edistibution equies IEEE peission. See http://www.ieee.og/publications_standads/publications/ights/index.htl fo oe infoation.
his aticle has been accepted fo publication in a futue issue of this jounal, but has not been fully edited. Content ay change pio to final publication. Citation infoation: DOI 0.09/PDS.05.40555, IEEE ansactions on Paallel and Distibuted Systes 0 because the Stafi uses Btace to collect job pofiles which equies additional CPU cycles [6]. Stafi pefos bette on the Sot application because Sot is less CPU-intensive than the WodCount application. Fig.. he pefoance of the ipoved HP odel in job estiation of unning the WodCount application on the EC Cloud. Fig.4. A pefoance copaison aong the ipoved HP odel, the HP odel and Stafi in unning the Sot application on the EC Cloud. Fig.. he pefoance of the ipoved HP odel in job estiation of unning the Sot application on the EC Cloud. Finally, we copaed the pefoance of the ipoved HP odel in job estiation with that of both Stafi and the HP odel collectively. Fig.3 and Fig.4 ow the copaison esults of the thee odels unning the two applications on the EC Cloud espectively. We have validated the LWLR egession odel in job execution estiation using 0-fold coss validation technique. We consideed the execution of an entie job with thee phases (i.e. ap phase, uffle phase and educe phase). he ean absolute pecentage eos of the WodCount application and the Sot application ae.37% and.89% espectively which ow high genealizability of the LWLR in job execution estiation. Futheoe, the R-squaed values of the two applications ae 0.9986 and 0.9979 espectively which eflects the goodness of fit of LWLR. F. Resouce Povisioning In this section, we pesent the evaluation esults of the ipoved HP odel in esouce povisioning using the in-house Hadoop cluste. We consideed 4 scenaios as own in able 6. he intention of vaying the nube of both ap s and educe s fo to 4 was twofold. One was to evaluate the ipact of the esouces available on the pefoance of the ipoved HP odel in esouce estiation. he othe was to evaluate the pefoance of the Hadoop cluste in esouce utilization with a vaied nube of ap and educe s. able 6: Scenaio configuations. Scenaios ube of ap s on each VM ube of educe s on each VM Fig.3. A pefoance copaison aong the ipoved HP odel, the HP odel and Stafi in unning the WodCount application on the EC Cloud. It can be obseved that the ipoved HP odel poduces the best esults in job estiation fo both applications. Stafi pefos bette than the HP odel on the Sot application in soe cases as own in Fig.4. Howeve, Stafi oveestiates the job execution ties of the WodCount application as own in Fig.3. his is ainly due to the high ovehead of Stafi in collecting a lage set of pofile infoation of a unning job. he Stafi pofile geneates a high ovehead fo CPU intensive applications like WodCount 3 3 3 4 4 4 o copae the pefoance of the ipoved HP odel with the HP odel in esouce estiation in the 4 scenaios, we eployed the WodCount application as a Hadoop job pocessing 9.4GB input dataset. In each scenaio, we set 7 copletion deadlines fo the job which ae 90, 750, 590, 500, 450, 390 and 350 in seconds. We fist built a job pofile in each scenaio. We set a deadline fo the job, and eployed both the HP odel and the ipoved HP odel to estiate the aount 045-99 (c) 05 IEEE. Pesonal use is peitted, but epublication/edistibution equies IEEE peission. See http://www.ieee.og/publications_standads/publications/ights/index.htl fo oe infoation.
his aticle has been accepted fo publication in a futue issue of this jounal, but has not been fully edited. Content ay change pio to final publication. Citation infoation: DOI 0.09/PDS.05.40555, IEEE ansactions on Paallel and Distibuted Systes of esouces (i.e. the nube of ap s and the nube of educe s). We then assigned the estiated esouces to the job using the in-house Hadoop cluste and easued the actual pe bound and the e bound execution duations. We took an aveage of an pe bound and a e bound and copaed it with the given deadline. It ould be noted that fo esouce povisioning expeients we configued 6VMs to satisfy the equieent of a job. heefoe, we eployed anothe Xeon seve achine with the sae specification of the fist seve as own in able 3. We installed the Oacle Vitual Box and configued 8 VMs on the second seve. Fig.5 to Fig.8 ow the esults in esouce povisioning of the 4 scenaios espectively. Fig.7. Resouce povisioning in Scenaio 3. It is woth noting that all the job deadlines ae et in the 4 scenaios except the last job deadline in Scenaio 4 whee t=350. his could be caused by the counication ovehead incued aong the VMs unning acoss the two seve achines. Although both the ipoved HP odel and the HP odel include counication ovehead in esouce povisioning when the taining dataset was built, they only conside static counication ovehead. It can be expected that the counication ovehead vaies fo tie to tie due to the dynaic natue of a counication netwok. Fig.5. Resouce povisioning in Scenaio. Fig.6. Resouce povisioning in Scenaio. Fo the 4 scenaios we can see that oveall the ipoved HP odel slightly pefos bette than the HP odel in esouce povisioning due to its high accuacy in job execution estiation. Both odels pefo well in the fist two scenaios especially in Scenaio whee the two odels geneate a nea optial pefoance. Howeve, the two odels ove-povision esouces in both Scenaio 3 and Scenaio 4 especially in the cases whee the job deadlines ae lage. he eason is that when we built the taining dataset fo esouce estiation, we un all the VMs in the tests. One ationale was that we conside the wost cases in esouce povisioning to ake sue all the use job deadlines would be et. Howeve, the ovehead incued in unning all the VMs was high and included in esouce povisioning fo all the jobs. As a esult, fo jobs with lage deadlines, both odels ove estiate the ovehead of the VMs involved. heefoe, both odels ove-povision the aounts of esouces fo jobs with lage deadlines which can be copleted using a sall nube of VMs instead of all the VMs. Fig.8. Resouce povisioning in Scenaio 4. able 7 suaizes the esouces estiated by both the HP odel and the ipoved HP odel in the 4 scenaios. It can be obseved that the HP odel ecoends oe esouces in tes of ap s, especially in Scenaio 3. his is because the HP odel lagely consides the ap s in esouce povisioning. As a esult, the jobs foling the HP odel ae copleted quicke than the jobs foling the ipoved HP odel but with lage gaps fo the given deadlines. heefoe, the ipoved HP odel is oe econoical than the HP odel in esouce povisioning due to its ecoendations of less ap s. VI. RELAED WORK Hadoop pefoance odeling is an eeging topic that deals with job optiization, scheduling, estiation and esouce povisioning. Recently this topic has eceived a geat attention fo the eseach counity and a nube of odels have been poposed. 045-99 (c) 05 IEEE. Pesonal use is peitted, but epublication/edistibution equies IEEE peission. See http://www.ieee.og/publications_standads/publications/ights/index.htl fo oe infoation.
his aticle has been accepted fo publication in a futue issue of this jounal, but has not been fully edited. Content ay change pio to final publication. Citation infoation: DOI 0.09/PDS.05.40555, IEEE ansactions on Paallel and Distibuted Systes able 7: he aounts of esouces estiated by the HP odel and the ipoved HP odel. Scenaio Scenaio Scenaio 3 Scenaio 4 Deadlines HP odel (, ) Ipoved HP odel (, ) HP odel (, ) Ipoved HP odel (, ) HP odel (, ) Ipoved HP odel (, ) HP odel (, ) Ipoved HP odel (, ) 90 (5,) (4,4) (8,) (6,5) (8,4) (,5) (0,5) (9,5) 750 (5,) (5,5) (9,3) (7,6) (,5) (,6) (4,6) (3,6) 590 (7,) (6,6) (,4) (9,8) (8,5) (6,8) (30,6) (9,8) 500 (8,) (7,7) (4,4) (0,9) (33,6) (9,9) (36,7) (34,0) 450 (9,3) (8,8) (5,5) (,0) (37,7) (,0) (40,8) (39,0) 390 (0,3) (9,9) (8,5) (3,) (4,8) (4,) (46,9) (44,) 350 (,3) (0,0) (0,6) (4,3) (47,9) (7,3) (5,0) (49,3) Legends: = ap s, = educe s Moton et al. poposed the paallax odel [4] and late the Paaie odel [5] that estiate the pefoance of the Pig paallel queies, which can be tanslated into seies of MapReduce jobs. hey use debug uns of the sae quey on input data saples to pedict the elative pogess of the ap and educe phases. his wok is based on siplified spositions that the duations of the ap tasks and the educe tasks ae the sae fo a MapReduce application. Howeve, in eality, the duations of the ap tasks and the educe tasks cannot be the sae because the duations of these tasks ae depended on a nube of factos. Moe ipotantly, the duations of the educe tasks in ovelapping and non-ovelapping stages ae vey diffeent. Ganapathi et al. [6] eployed a ultivaiate Kenel Canonical Coelation Analysis (KCCA) egession technique to pedict the pefoance of Hive quey. Howeve, thei intention was to ow the applicability of KCCA technique in the context of MapReduce. Kadivel et al. [7] poposed Machine Leaning (ML) techniques to pedict the pefoance of Hadoop jobs. Howeve, this wok does not have a copehensive atheatical odel fo job estiation. Lin et al. [] poposed a cost vecto which contains the cost of disk I/O, netwok taffic, coputational coplexity, CPU and intenal sot. he cost vecto is used to estiate the execution duations of the ap and educe tasks. It is challenging to accuately estiate the cost of these factos in a situation whee ultiple tasks copete fo esouces. Futheoe, this wok is only evaluated to estiate the execution ties of the ap tasks and no estiations on educe tasks ae pesented. he late wok [] consides esouce contention and tasks failue situations. A siulato is eployed to evaluate the effectiveness of the odel. Howeve, siulato base appoaches ae potentially eo-pone because it is challenging to design an accuate siulato that can copehensively siulate the intenal dynaics of coplex MapReduce applications. Jalapati et al. [3] poposed a syste called Bazaa that pedicts Hadoop job pefoance and povisions esouces in te of VMs to satisfy use equieents. he wok pesented in [4] uses the Pinciple Coponent Analysis technique to optiize Hadoop jobs based on vaious configuation paaetes. Howeve, these odels leave out both the ovelapping and non-ovelapping stages of the uffle phase. hee is body of wok that focuses on optial esouce povisioning fo Hadoop jobs. ian et al. [8] poposed a cost odel that estiates the pefoance of a job and povisions the esouces fo the job using a siple egession technique. Chen et al. [8] futhe ipoved the cost odel and poposed CRESP which eploys the bute-foce seach technique fo povisioning the optial cluste esouces in te of ap s and educe s fo Hadoop jobs. he poposed cost odel is able to pedict the pefoance of a job and povisions the esouces needed. Howeve, in the two odels, the nube of educe tasks have to be equal to the nube of educe s which eans that these two odels only conside a single wave of the educe phase. It is aguable that a Hadoop job pefos bette when ultiple waves of the educe phase ae used in copaison with the use of a single, especially in situations whee a sall aount of esouces is available but pocessing a lage dataset. Laa et al. [9] poposed AROMA, a syste that autoatically povisions the optial esouces and optiizes the configuation paaetes of Hadoop fo a job to achieve the sevice level objectives. AROMA uses clusteing techniques to go the jobs with siila behavios. AROMA uses Spot Vecto Machine to pedict the pefoance of a Hadoop job and uses a patten seach technique to find the optial set of esouces fo a job to achieve the equied deadline with a iniu cost. Howeve, AROMA cannot pedict the pefoance of a Hadoop job whose esouce utilization patten is diffeent fo any pevious ones. Moe ipotantly, AROMA does not povide a copehensive atheatical odel to estiate a job execution tie as well as optial configuation paaete values of Hadoop. hee ae a few othe sophisticated odels such as [5][6][7][30] that ae siila to the ipove HP odel in the sense that they use the pevious executed job pofiles fo pefoance pediction. Heodotou et al. poposed Stafi [5] which collects the past executed jobs pofile infoation at a 045-99 (c) 05 IEEE. Pesonal use is peitted, but epublication/edistibution equies IEEE peission. See http://www.ieee.og/publications_standads/publications/ights/index.htl fo oe infoation.
his aticle has been accepted fo publication in a futue issue of this jounal, but has not been fully edited. Content ay change pio to final publication. Citation infoation: DOI 0.09/PDS.05.40555, IEEE ansactions on Paallel and Distibuted Systes 3 fine ganulaity fo job estiation and autoatic optiization. On the top of the Stafi, Heodotou et al. poposed Elasticise [6] which povisions a Hadoop cluste esouces in te of VMs. Howeve, collecting detailed job pofile infoation with a lage set of etics geneates an exta ovehead, especially fo CPU-intensive applications. As a esult, Stafi oveestiate the execution tie of a Hadoop job. Vea et al. [30] pesented the ARIA odel fo job execution estiations and esouce povisioning. he HP odel [7] extends the ARIA ode by adding scaling factos to estiate the job execution tie on lage datasets using a siple linea egession. he wok pesented in [3] divides the ap phase and educe phase into six geneic sub-phases (i.e. ead, collect, spill, ege, uffle and wite), and uses a egession technique to estiate the duations of these sub-phases. he estiated values ae then used in the analytical odel pesented in [30] to estiate the oveall job execution tie. In [3], Zhang et al. eployed the bound-based appoach [30] in heteogeneous Hadoop cluste envionents. It ould be pointed out that the afoeentioned odels ae liited to the case that they only conside a constant nube of the educe tasks. As a esult, the ipact of the nube of educe tasks on the pefoance of a Hadoop job is ignoed. he ipoved HP odel consides a vaied nube of educe tasks and eploys a sophisticated LWLR technique to estiate the oveall execution tie of a Hadoop job. REFERECES [] J. Dean and S. Gheawat, MapReduce: siplified data pocessing on lage clustes, Coun. ACM, vol. 5, no., pp. 07 3, 008. [] R. Läel, Google s MapReduce pogaing odel Revisited, Sci. Coput. Poga., vol. 70, no., pp. 30, 008. [3] B. He, W. Fang, Q. Luo,. K. Govindaaju, and. Wang, Mas: a MapReduce faewok on gaphics pocessos, in Poceedings of the 7th intenational confeence on Paallel achitectues and copilation techniques - PAC 08, 008, p. 60. [4] K. aua,. Endo, K. Kaneda, and A. Yonezawa, Phoenix: a paallel pogaing odel fo accoodating dynaically joining/leaving esouces, in SIGPLA ot., 003, vol. 38, no. 0, pp. 6 9. [5] M. Isad, M. Budiu, Y. Yu, A. Biell, and D. Fettely, Dyad: distibuted data-paallel pogas fo sequential building blocks, ACM SIGOPS Ope. Syst. Rev., vol. 4, no. 3, pp. 59 7, Ma. 007. VII. COCLUSIO Running a MapReduce Hadoop job on a public cloud such as Aazon EC necessitates a pefoance odel to estiate the job execution tie and futhe to povision a cetain aount of esouces fo the job to coplete within a given deadline. his pape has pesented an ipoved HP odel to achieve this goal taking into account ultiple waves of the uffle phase of a Hadoop job. he ipoved HP odel was initially evaluated on an in-house Hadoop cluste and subsequently evaluated on the EC Cloud. he expeiental esults owed that the ipoved HP odel outpefos both Stafi and the HP odel in job execution estiation. Siila to the HP odel, the ipoved HP odel povisions esouces fo Hadoop jobs with deadline equieents. Howeve, the ipoved HP odel is oe econoical in esouce povisioning than the HP odel. Both odels ove-povision esouces fo use jobs with lage deadlines in the cases whee VMs ae configued with a lage nube of both ap s and educe s. One futue wok would be to conside dynaic ovehead of the VMs involved in unning the use jobs to iniize esouce ove-povisioning. Cuently the ipoved HP odel only consides individual Hadoop jobs without logical dependencies. Anothe futue wok will be to odel ultiple Hadoop jobs with execution conditions. ACKOWLEDGEME his eseach is patially spoted by the 973 poject on etwok Big Data Analytics funded by the Ministy of Science and echnology, China. o. 04CB340404. [6] Apache Hadoop. [Online]. Available: http://hadoop.apache.og/. [Accessed: -Oct-03]. [7] D. Jiang, B. C. Ooi, L. Shi, and S. Wu, he Pefoance of MapReduce: An In-depth Study, Poc. VLDB Endow., vol. 3, no., pp. 47 483, Sep. 00. [8] U. Kang, C. E. souakakis, and C. Faloutsos, PEGASUS: Mining Peta-scale Gaphs, Knowl. Inf. Syst., vol. 7, no., pp. 303 35, May 0. [9] B. Panda, J. S. Hebach, S. Basu, and R. J. Bayado, PLAE: Massively Paallel Leaning of ee Ensebles with MapReduce, Poc. VLDB Endow., vol., no., pp. 46 437, Aug. 009. [0] A. Pavlo, E. Paulson, and A. Rasin, A copaison of appoaches to lage-scale data analysis, in SIGMOD 09 Poceedings of the 009 ACM SIGMOD Intenational Confeence on Manageent of data, 009, pp. 65 78. [] X. Lin, Z. Meng, C. Xu, and M. Wang, A Pactical Pefoance Model fo Hadoop MapReduce, in Cluste Coputing Wokops (CLUSER WORKSHOPS), 0 IEEE Intenational Confeence on, 0, pp. 3 39. 045-99 (c) 05 IEEE. Pesonal use is peitted, but epublication/edistibution equies IEEE peission. See http://www.ieee.og/publications_standads/publications/ights/index.htl fo oe infoation.
his aticle has been accepted fo publication in a futue issue of this jounal, but has not been fully edited. Content ay change pio to final publication. Citation infoation: DOI 0.09/PDS.05.40555, IEEE ansactions on Paallel and Distibuted Systes 4 [] X. Cui, X. Lin, C. Hu, R. Zhang, and C. Wang, Modeling the Pefoance of MapReduce unde Resouce Contentions and ask Failues, in Cloud Coputing echnology and Science (CloudCo), 03 IEEE 5th Intenational Confeence on, 03, vol., pp. 58 63. [3] J. Viajith, B. Hite, C. Paolo, K. hoas, and R. Antony, Bazaa: Enabling Pedictable Pefoance in Datacentes, Micosoft Reaseach, MSR-R- 0-38, [Online].Available: http://eseach.icosoft.co/apps/pubs/default.aspx?i d=69. [4] H. Yang, Z. Luan, W. Li, D. Qian, and G. Guan, Statistics-based Wokload Modeling fo MapReduce, in Paallel and Distibuted Pocessing Syposiu Wokops PhD Fou (IPDPSW), 0 IEEE 6th Intenational, 0, pp. 043 05. [5] H. Heodotou, H. Li, G. Luo,. Boisov, L. Dong, F. B. Cetin, and S. Babu, Stafi: A Self-tuning Syste fo Big Data Analytics, in In CIDR, 0, pp. 6 7. [] J. Fan and I. Gijbels, Local Polynoial Modelling and Its Applications: Monogaphs on Statistics and Applied Pobability 66. CRC Pess, 996. [3] A. Geoge, W. Hans, and H. Fank, Matheatical Methods fo Physicists, 6th ed. Olando, FL: Acadeic Pess, 005, p. 060. [4] K. Moton, A. Fiesen, M. Balazinska, and D. Gossan, Estiating the pogess of MapReduce pipelines, in Data Engineeing (ICDE), 00 IEEE 6th Intenational Confeence on, 00, pp. 68 684. [5] K. Moton, M. Balazinska, and D. Gossan, Paaie: A Pogess Indicato fo MapReduce DAGs, in Poceedings of the 00 ACM SIGMOD Intenational Confeence on Manageent of Data, 00, pp. 507 58. [6] A. Ganapathi, Y. Chen, A. Fox, R. Katz, and D. Patteson, Statistics-diven wokload odeling fo the Cloud, in Data Engineeing Wokops (ICDEW), 00 IEEE 6th Intenational Confeence on, 00, pp. 87 9. [6] H. Heodotou, F. Dong, and S. Babu, o One (Cluste) Size Fits All: Autoatic Cluste Sizing fo Data-intensive Analytics, in Poceedings of the nd ACM Syposiu on Cloud Coputing (SOCC ), 0, pp. 4. [7] S. Kadivel and J. A. B. Fotes, Gey-Box Appoach fo Pefoance Pediction in Map-Reduce Based Platfos, in Copute Counications and etwoks (ICCC), 0 st Intenational Confeence on, 0, pp. 9. [7] A. Vea, L. Chekasova, and R. H. Capbell, Resouce povisioning faewok fo apeduce jobs with pefoance goals, in Poceedings of the th ACM/IFIP/USEIX intenational confeence on Middlewae, 0, pp. 65 86. [8] K. Chen, J. Powes, S. Guo, and F. ian, CRESP: owads Optial Resouce Povisioning fo MapReduce Coputing in Public Clouds, IEEE anscation Paallel Distib. Syst., vol. 5, no. 6, pp. 403 4, 04. [9] H. Heodotou, Hadoop Pefoance Models, 0. [Online]. Available: http://www.cs.duke.edu/stafi/files/hadoop-odels.p df. [Accessed: -Oct-03]. [0] W. S. Cleveland and S. J. Delvin, Locally Weighted Regession: An Appoach to Regession Analysis by Local Fitting., J. A. Stat. Assoc., vol. 83, no. 403, pp. 596 60, 988. [] M. Rallis and M. Vazigiannis, Rank Pediction in gaphs with Locally Weighted Polynoial Regession and EM of Polynoial Mixtue Models, in Advances in Social etwoks Analysis and Mining (ASOAM), 0 Intenational Confeence on, 0, pp. 55 59. [8] F. ian and K. Chen, owads Optial Resouce Povisioning fo Running MapReduce Pogas in Public Clouds, in 0 IEEE 4th Intenational Confeence on Cloud Coputing, 0, pp. 55 6. [9] P. Laa and X. Zhou, AROMA: Autoated Resouce Allocation and Configuation of Mapeduce Envionent in the Cloud, in Poceedings of the 9th Intenational Confeence on Autonoic Coputing, 0, pp. 63 7. [30] A. Vea, L. Chekasova, and R. H. Capbell, ARIA: autoatic esouce infeence and allocation fo MapReduce envionents., in 8th ACM Intenational confeence on autonoic coputing, 0, pp. 35 44. [3] Z. Zhang, L. Chekasova, and B.. Loo, Benchaking Appoach fo Designing a Mapeduce Pefoance Model, in Poceedings of the 4th ACM/SPEC Intenational Confeence on Pefoance Engineeing, 03, pp. 53 58. [3] Z. Zhang, L. Chekasova, and B.. Loo, Pefoance Modeling of MapReduce Jobs in Heteogeneous Cloud Envionents, in Poceedings of the 03 IEEE Sixth 045-99 (c) 05 IEEE. Pesonal use is peitted, but epublication/edistibution equies IEEE peission. See http://www.ieee.og/publications_standads/publications/ights/index.htl fo oe infoation.
his aticle has been accepted fo publication in a futue issue of this jounal, but has not been fully edited. Content ay change pio to final publication. Citation infoation: DOI 0.09/PDS.05.40555, IEEE ansactions on Paallel and Distibuted Systes 5 Intenational Confeence on Cloud Coputing, 03, pp. 839 846. Mukhtaj Khan eceived his MSc in Mobile Copute Syste fo Staffodie Univesity, UK in 006. He is cuently a PhD student in the School of Engineeing and Design at Bunel Univesity, UK. he PhD study is sponsoed by Abdul Wali Khan Univesity Madan, Pakistan. His eseach inteests ae focused on high pefoance coputing fo big data analysis. Yong Jin eceived the PhD fo oth Univesity of China in 03. He is an Associate Pofesso in the School of Infoation and Counication Engineeing at oth Univesity of China. He is also a Visiting Pofesso in the School of Engineeing and Design at Bunel Univesity, UK. His eseach inteests ae in the aeas of iage pocessing, online inspections and big data analytics. Maozhen Li is cuently a Pofesso in the Depatent of Electonic and Copute Engineeing at Bunel Univesity London, UK. He eceived the PhD fo Institute of Softwae, Chinese Acadey of Sciences in 997. He was a Post-Doctoal Reseach Fel in the School of Copute Science and Infoatics, Cadiff Univesity, UK in 999-00. His eseach inteests ae in the aeas of high pefoance coputing (gid and cloud coputing), big data analytics and intelligent systes. He is on the Editoial Boads of Coputing and Infoatics jounal and jounal of Cloud Coputing: Advances, Systes and Applications. He has ove 00 eseach publications in these aeas. He is a Fel of the Biti Copute Society. Yang Xiang eceived the PhD degee fo Habin Institute of echnology, China in 999. He copleted his Post-Doctoal eseach at Dalian Univesity of echnology, China in 003. He is now a Pofesso in the Depatent of Copute Science and echnology, ongji Univesity, Shanghai, China. His eseach inteests ae in the aeas of achine leaning, seantic web, and big data analytics. Changjun Jiang eceived the PhD degee fo the Institute of Autoation, Chinese Acadey of Sciences, Beijing, China, in 995 and conducted postdoctoal eseach at the Institute of Coputing echnology, Chinese Acadey of Sciences, in 997. Cuently, he is a pofesso with the Depatent of Copute Science and Engineeing, ongji Univesity, Shanghai. He is also a council ebe of China Autoation Fedeation and Atificial Intelligence Fedeation, the diecto of Pofessional Coittee of Peti et of China Copute Fedeation, and the vice diecto of Pofessional Coittee of Manageent Systes of China Autoation Fedeation. He was a visiting pofesso of Institute of Coputing echnology, Chinese Acadey of Science; a eseach fel of the City Univesity of Hong Kong, Kowloon, Hong Kong; and an infoation aea specialist of Shanghai Municipal Govenent. His cuent aeas of eseach ae concuent theoy, Peti net, and foal veification of softwae, concuency pocessing and intelligent tanspotation systes. He is a senio ebe of the IEEE. 045-99 (c) 05 IEEE. Pesonal use is peitted, but epublication/edistibution equies IEEE peission. See http://www.ieee.og/publications_standads/publications/ights/index.htl fo oe infoation.